This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Newton Informed Neural Operator for Computing Multiple Solutions of Nonlinear Partials Differential Equations

Wenrui Hao
Department of Mathematics
The Pennsylvania State University, University Park,
State College, PA, USA
[email protected]
&Xinliang Liu
Computer, Electrical and Mathematical Science and Engineering Division,
King Abdullah University of Science and Technology,
Thuwal, Saudi Arabia
[email protected] &Yahong Yang
Department of Mathematics
The Pennsylvania State University, University Park,
State College, PA, USA
[email protected]
Corresponding Authors
Abstract

Solving nonlinear partial differential equations (PDEs) with multiple solutions using neural networks has found widespread applications in various fields such as physics, biology, and engineering. However, classical neural network methods for solving nonlinear PDEs, such as Physics-Informed Neural Networks (PINN), Deep Ritz methods, and DeepONet, often encounter challenges when confronted with the presence of multiple solutions inherent in the nonlinear problem. These methods may encounter ill-posedness issues. In this paper, we propose a novel approach called the Newton Informed Neural Operator, which builds upon existing neural network techniques to tackle nonlinearities. Our method combines classical Newton methods, addressing well-posed problems, and efficiently learns multiple solutions in a single learning process while requiring fewer supervised data points compared to existing neural network methods.

1 Introduction

Neural networks have been extensively applied to solve partial differential equations (PDEs) in various fields, including biology, physics, and materials science [17, 9]. While much of the existing work focuses on PDEs with a singular solution, nonlinear PDEs with multiple solutions pose a significant challenge but are widely encountered in applications such as [1, 3, 27]. In this paper, we aim to solve the following nonlinear PDEs that may contain multiple solutions:

{u(𝒙)=f(u),𝒙Ωu(𝒙)=0,𝒙Ω\begin{cases}\mathcal{L}u(\bm{x})=f(u),&\bm{x}\in\Omega\\ u(\bm{x})=0,&\bm{x}\in\partial\Omega\end{cases} (1)

Here, Ω\Omega is the domain of equation, f(u)f(u) is a nonlinear function in \mathbb{R}, u:du:\mathbb{R}^{d}\to\mathbb{R} and \mathcal{L} is a second-order elliptic operator given by

u=i,j=1daij(𝒙)uxixj+i=1dbi(𝒙)uxi+c(𝒙)u,\quad\mathcal{L}u=-\sum_{i,j=1}^{d}a^{ij}(\bm{x})u_{x_{i}x_{j}}+\sum_{i=1}^{d}b^{i}(\bm{x})u_{x_{i}}+c(\bm{x})u,

for given coefficient functions aij,bi,c(i,j=1,,d)a^{ij},b^{i},c(i,j=1,\ldots,d) with i,j=1naij(𝒙)ξiξjλ|𝝃|2,for a constant λ0.\sum_{i,j=1}^{n}a^{ij}(\bm{x})\xi_{i}\xi_{j}\geq\lambda|\bm{\xi}|^{2},~{}\text{for a constant $\lambda\geq 0$}.

Various neural network methods have been developed to tackle partial differential equations (PDEs), including PINN [29], the Deep Ritz method [37], DeepONet [24], FNO [20], MgNO [11], and OL-DeepONet [22]. These methods can be broadly categorized into two types: function learning and operator learning approaches. In function learning, the goal is to directly learn the solution. However, these methods often encounter the limitation of only being able to learn one solution in each learning process. Furthermore, the problem becomes ill-posed when there are multiple solutions. On the other hand, operator learning aims to approximate the map between parameter functions in PDEs and the unique solution. This approach cannot address the issue of multiple solutions or find them in a single training session. We will discuss this in more detail in the next section.

In this paper, we present a novel neural network approach for solving nonlinear PDEs with multiple solutions. Our method is grounded in operator learning, allowing us to capture multiple solutions within a single training process, thus overcoming the limitations of function learning methods in neural networks. Moreover, we enhance our network architecture by incorporating traditional Newton methods [31, 1], as discussed in the next section. This integration ensures that the problem of operator learning becomes well-defined, as Newton methods provide well-defined locally, thereby ensuring a robust operator. This approach addresses the challenges associated with directly applying operator networks to such problems. Additionally, we leverage Newton information during training, enabling our method to perform effectively even with limited supervised data points. We introduce our network as the Newton Informed Neural Operator.

As mentioned earlier, our approach combines the classical Newton method, which translates nonlinear PDEs into an iterative process involving solving linear functions at each iteration. One key advantage of our method is that, once the operator is effectively learned, there’s no need to solve the linear equation in every iteration. This significantly reduces the time cost, especially in complex systems encountered in fields such as material science, biology, and chemistry. Above all, our network, termed the Newton Informed Neural Operator, can efficiently handle nonlinear PDEs with multiple solutions. It tackles well-posed problems, learns multiple solutions in a single learning process, requires fewer supervised data points compared to existing neural network methods, and saves time by eliminating the need to repeatedly learn and solve the inverse problem as in traditional Newton methods.

The following paper is organized as follows: In the next section (Section 2), we will delve into nonlinear PDEs with multiple solutions and discuss related works on solving PDEs using neural network methods. In Section 3, we will review the classical Newton method for solving PDEs and introduce the Newton Informed Neural Operator, which combines neural operators with the Newton method to address nonlinear PDEs with multiple solutions. In this section, we will also analyze the approximation and generalization errors of the Newton Informed Neural Operator. Finally, in Section 4, we present the numerical results of our neural networks for solving nonlinear PDEs.

2 Backgrounds and Relative Works

2.1 Nonlinear PDEs with multiple solutions

Significant mathematical models depicting natural phenomena in biology, physics, and materials science are rooted in nonlinear partial differential equations (PDEs) [5]. These models, characterized by their inherent nonlinearity, present complex multi-solution challenges. Illustrative examples include string theory in physics, reaction-diffusion systems in chemistry, and pattern formation in biology [16, 10]. However, experimental techniques like synchrotronic and laser methods can only observe a subset of these multiple solutions. Thus, there is an urgent need to develop computational methods to unravel these nonlinear models, offering deeper insights into the underlying physics and biology [12]. Consequently, efficient numerical techniques for identifying these solutions are pivotal in understanding these intricate systems. Despite recent advancements in numerical methods for solving nonlinear PDEs, significant computational challenges persist for large-scale systems. Specifically, the computational costs of employing Newton and Newton-like approaches are often prohibitive for the large-scale systems encountered in real-world applications. In response to these challenges [15], we propose an operator learning approach based on Newton’s method to efficiently compute multiple solutions of nonlinear PDEs.

2.2 Related works

Indeed, there are numerous approaches to solving partial differential equations (PDEs) using neural networks. Broadly speaking, these methods can be categorized into two main types: function learning and operator learning.

In function learning, neural networks are used to directly approximate the solutions to PDEs. Function learning approaches aim to directly learn the solution function itself. On the other hand, in operator learning, the focus is on learning the operator that maps input parameters to the solution of the PDE. Instead of directly approximating the solution function, the neural network learns the underlying operator that governs the behavior of the system.

Function learning methods

In the function learning, a commonly employed method for addressing this problem involves the use of Physics-Informed Neural Network (PINN)-based learning approaches, as introduced by Raissi et al. [29], and Deep Ritz Methods [37]. However, in these methods, the task becomes particularly challenging due to the ill-posed nature of the problem arising from multiple solutions. Despite employing various initial data and training methods, attaining high accuracy in solution learning remains a complex endeavor. Even when a high accuracy solution is achieved, each learning process typically results in the discovery of only one solution. The specific solution learned by the neural network is heavily influenced by the initial conditions and training methods employed. However, discerning the relationships between these factors and the learned solution remains a daunting task. In [14], the authors introduce a method called HomPINNs for learning multiple solution PDEs. In their approach, the number of solutions that can be learned depends on the availability of supervised data. However, for solutions with insufficient supervised data, whether HomPINN can successfully learn them or not remains uncontrollable.

Operator learning methods

Various approaches have been developed for operator learning to solve PDEs, including DeepONet [24], which integrates physical information [7, 22], as well as techniques like FNO [20] inspired by spectral methods, and MgNO [11], HANO [23], and WNO [21] based on multilevel methods, and transformer-based neural operators [4, 8]. These methods focus on approximating the operator between the parameters and the solutions. Firstly, they require the solutions of PDEs to be unique; otherwise, the operator is not well-defined. Secondly, they focus on the relationship between the parameter functions and the solution, rather than the initial data and multiple solutions.

3 Newton Informed Neural Operator

3.1 Review of Newton Methods to Solve Nonlinear Partial Differential Equations

To tackle this problem Eq. 1, we employ Newton’s method by linearizing the equation. Given any initial guess u0(𝒙)u_{0}(\bm{x}), for i=1,2,,Mi=1,2,\ldots,M, in the ii-th iteration of Newton’s method, we have u~(𝒙)=u+δu(𝒙)\tilde{u}(\bm{x})=u+\delta u(\bm{x}) by solving

{(f(u))δu(𝒙)=uf(u),𝒙Ωδu(𝒙)=0,𝒙Ω.\begin{cases}(\mathcal{L}-f^{\prime}(u))\delta u(\bm{x})=\mathcal{L}u-f(u),&\bm{x}\in\Omega\\ \delta u(\bm{x})=0,&\bm{x}\in\partial\Omega.\end{cases} (2)

To solve Eq.(2) and repeat it MM times will yield one of the solutions of the nonlinear equation (1). For the differential initial data, it will converge to the differential solution of Eq.(1), which constitutes a well-posed problem. However, applying the solver to solve Eq. (1) multiple times can be time-consuming, especially in high-dimensional structures or when the number of discrete points in the solver is large. In this paper, we employ neural networks to address these challenges.

3.2 Neural Operator Structures

In this section, we introduce the structure of the DeepONet [24, 18] to approximate the operator locally in the Newton methods from Eq.(2), i.e., 𝒢(u):=u\mathcal{G}(u):=u_{*}, where uu_{*} is the solution of Eq.(2), which depends on uu. If we can learn the operator 𝒢(u)\mathcal{G}(u) well using the neural operator 𝒪(u;𝜽)\mathcal{O}(u;\bm{\theta}), then for an initial function u0u_{0}, assume the nn-th iteration will approximate one solution, i.e., (𝒢+Id)(n)(u0)u(\mathcal{G}+\text{Id})^{(n)}(u_{0})\approx u^{*}. Thus,

(𝒪+Id)(n)(u0)(𝒢+Id)(n)(u0)u.(\mathcal{O}+\text{Id})^{(n)}(u_{0})\approx(\mathcal{G}+\text{Id})^{(n)}(u_{0})\approx u^{*}.

For another initial data, we can evaluate our neural operator and find the solution directly.

Then we discuss how to train such an operator. To begin, we define the following shallow neural operators with pp neurons for operators from 𝒳\mathcal{X} to 𝒴\mathcal{Y} as

𝒪(a;𝜽)=i=1p𝒜iσ(𝒲ia+i)a𝒳\mathcal{O}(a;\bm{\theta})=\sum_{i=1}^{p}\mathcal{A}_{i}\sigma\left(\mathcal{W}_{i}a+\mathcal{B}_{i}\right)\quad\forall a\in\mathcal{X} (3)

where 𝒲i(𝒳,𝒴),i𝒴\mathcal{W}_{i}\in\mathcal{L}(\mathcal{X},\mathcal{Y}),\mathcal{B}_{i}\in\mathcal{Y}, 𝒜i(𝒴,𝒴)\mathcal{A}_{i}\in\mathcal{L}(\mathcal{Y},\mathcal{Y}), and 𝜽\bm{\theta} denote all the parameters in {𝒲i,𝒜i,i}i=1p\{\mathcal{W}_{i},\mathcal{A}_{i},\mathcal{B}_{i}\}_{i=1}^{p}. Here, (𝒳,𝒴)\mathcal{L}(\mathcal{X},\mathcal{Y}) denotes the set of all bounded (continuous) linear operators between 𝒳\mathcal{X} and 𝒴\mathcal{Y}, and σ:\sigma:\mathbb{\mapsto}\mathbb{R} defines the nonlinear point-wise activation.

In this paper, we will use shallow DeepONet to approximate the Newton operator. To provide a more precise description, in the shallow neural network, 𝒲i\mathcal{W}_{i} represents an interpolation of operators. With proper and reasonable assumptions, we can present the following theorem to ensure that DeepONet can effectively approximate the Newton method operator. The proof will be provided in the appendix. Furthermore, MgNO is replaced by 𝒲\mathcal{W} as a multigrid operator [33], and FNO is some kind of kernel operator; our analysis can be generalized to such cases.

Before the proof, we need to establish some assumptions regarding the input space 𝒳H2(Ω)\mathcal{X}\subset H^{2}(\Omega) of the operator and f(u)f(u) in Eq. (1).

Assumption 1.

(i): For any u𝒳u\in\mathcal{X}, we have that

f(u){eigenvalues of }=.f^{\prime}(u)\cap\{\text{eigenvalues of }\mathcal{L}\}=\emptyset.

(ii): There exists a constant FF such that f(x)W2,()F\|f(x)\|_{W^{2,\infty}(\mathbb{R})}\leq F.

(iii): All coefficients in \mathcal{L} are C1C^{1} and ΩC2\partial\Omega\in C^{2}.

(iv): There exists a linear bounded operator 𝒫\mathcal{P} on 𝒳\mathcal{X}, a small constant ϵ>0\epsilon>0, and a constant nn such that

u𝒫uH2(Ω)ϵ,for all u𝒳.\|u-\mathcal{P}u\|_{H^{2}(\Omega)}\leq\epsilon,~{}\text{for all }u\in\mathcal{X}.

Furthermore, 𝒫u\mathcal{P}u is an nn-dimensional term, i.e., it can be denoted by the nn-dimensional vector 𝒫¯un\bar{\mathcal{P}}u\in\mathbb{R}^{n}.

Remark 1.

We want to emphasize the reasonableness of our assumptions. For condition (i), we are essentially restricting our approximation efforts to local regions. This limitation is necessary because attempting to approximate the neural operator across the entire domain could lead to issues, particularly in cases where multiple solutions exist. Consider a scenario where the input function uu lies between two distinct solutions. Even a small perturbation of uu could result in the system converging to a completely different solution. Condition (i) ensures that Equation (2) has a unique solution, allowing us to focus our approximation efforts within localized domains.

Conditions (ii) and (iii) serve to regularize the problem and ensure its tractability. These conditions are indeed straightforward to fulfill, contributing to the feasibility of the overall approach.

For the embedding operator 𝒫\mathcal{P} in (iv), there are a lot of choices in DeepONet, such as finite element methods like Argyris elements [2] or embedding methods in [19, 13]. We will discuss more in the appendix. The approximation order would be h(l2)h^{-(l-2)}, and the differential method may achieve a different order. We omit the detailed discussion in the paper. Furthermore, for the differential neural network, this embedding may be different; for example, we can use Fourier expansion [35] or multigrid methods [11] to achieve this task.

Theorem 1.

Suppose 𝒳=𝒴H2(Ω)\mathcal{X}=\mathcal{Y}\subset H^{2}(\Omega) and Assumption 1 holds. Then, there exists a neural network 𝒪(u;𝛉)Ξp\mathcal{O}(u;\bm{\theta})\in\Xi_{p} defined as

Ξp:={i=1p𝑨iσ(𝒲iu+𝒃i)σ(𝒘i𝒙+ζi)|𝒲i(𝒳,m),𝒃im,𝑨i(m,)}\Xi_{p}:=\left\{\sum_{i=1}^{p}\bm{A}_{i}\sigma\left(\mathcal{W}_{i}u+\bm{b}_{i}\right)\sigma\left(\bm{w}_{i}\cdot\bm{x}+\zeta_{i}\right)|\mathcal{W}_{i}\in\mathcal{L}(\mathcal{X},\mathbb{R}^{m}),\bm{b}_{i}\in\mathbb{R}^{m},\bm{A}_{i}\in\mathcal{L}(\mathbb{R}^{m},\mathbb{R})\right\} (4)

such that

i=1p𝑨iσ(𝒲iu+𝒃i)σ(𝒘i𝒙+ζi)𝒢(u)L2(Ω)C1m1n+C2(ϵ+p2d),\left\|\sum_{i=1}^{p}\bm{A}_{i}\sigma\left(\mathcal{W}_{i}u+\bm{b}_{i}\right)\sigma\left(\bm{w}_{i}\cdot\bm{x}+\zeta_{i}\right)-\mathcal{G}(u)\right\|_{L^{2}(\Omega)}\leq C_{1}m^{-\frac{1}{n}}+C_{2}(\epsilon+p^{-\frac{2}{d}}), (5)

where σ\sigma is a smooth non-polynomial activation function, C1C_{1} is a constant independent of mm, ϵ\epsilon, and pp, C2C_{2} is a constant depended on pp, nn is the scale of the 𝒫\mathcal{P} in Assumption 1. And ϵ\epsilon depends on nn. nn and ϵ\epsilon are defined in Assumption 1.

3.3 Loss Functions of Newton Informed Neural Operator

3.3.1 Mean Square Error

The Mean Square Error loss function is defined as:

S(𝜽):=1MuMxj=1Muk=1Mx|𝒢(uj)(𝒙k)𝒪(uj;𝜽)(𝒙k)|2\displaystyle\mathcal{E}_{S}(\bm{\theta}):=\frac{1}{M_{u}\cdot M_{x}}\sum_{j=1}^{M_{u}}\sum_{k=1}^{M_{x}}\left|\mathcal{G}\left(u_{j}\right)\left(\bm{x}_{k}\right)-\mathcal{O}\left(u_{j};\bm{\theta}\right)\left(\bm{x}_{k}\right)\right|^{2} (6)

where u1,u2,,uMuμu_{1},u_{2},\ldots,u_{M_{u}}\sim\mu are independently and identically distributed (i.i.d) samples in 𝒳\mathcal{X}, and 𝒙1,𝒙2,,𝒙Mx\bm{x}_{1},\bm{x}_{2},\ldots,\bm{x}_{M_{x}} are uniformly i.i.d samples in Ω\Omega.

However, using only the Mean Squared Error loss function is not sufficient for training to learn the Newton method, especially since in most cases, we do not have enough data {ui,𝒢(uj)}j=1Mu\{u_{i},\mathcal{G}\left(u_{j}\right)\}_{j=1}^{M_{u}}. Furthermore, there are situations where we do not know how many solutions exist for the nonlinear equation (1). If the data is sparse around one of the solutions, it becomes impossible to effectively learn the Newton method around that solution.

Given that S(𝜽)\mathcal{E}_{S}(\bm{\theta}) can be viewed as the finite data formula of Sc(𝜽)\mathcal{E}_{Sc}(\bm{\theta}), where

Sc(𝜽)=limMu,MxS(𝜽).\mathcal{E}_{Sc}(\bm{\theta})=\lim_{M_{u},M_{x}\to\infty}\mathcal{E}_{S}(\bm{\theta}).

The smallness of Sc\mathcal{E}_{Sc} can be inferred from Theorem 1. To understand the gap between Sc(𝜽)\mathcal{E}_{Sc}(\bm{\theta}) and S(𝜽)\mathcal{E}_{S}(\bm{\theta}), we can rely on the following theorem. Before the proof, we need some assumptions about the data in S(𝜽)\mathcal{E}_{S}(\bm{\theta}):

Assumption 2.

(i) Boundedness: For any neural network with bounded parameters, characterized by a bound BB and dimension d𝛉d_{\bm{\theta}}, there exists a function Ψ:L2(Ω)[0,)\Psi:L^{2}(\Omega)\rightarrow[0,\infty) such that

|𝒢(u)(𝒙)|Ψ(u),sup𝜽[B,B]d𝜽|𝒩(u;𝜽)(𝒙)|Ψ(u),|\mathcal{G}(u)(\bm{x})|\leqslant\Psi(u),\quad\sup_{\bm{\theta}\in[-B,B]^{d_{\bm{\theta}}}}\left|\mathcal{N}(u;\bm{\theta})(\bm{x})\right|\leqslant\Psi(u),

for all u𝒳,𝐱Ωu\in\mathcal{X},\bm{x}\in\Omega, and there exist constants C,κ>0C,\kappa>0, such that

Ψ(u)C(1+uH2)κ.\displaystyle\Psi(u)\leqslant C\left(1+\|u\|_{H^{2}}\right)^{\kappa}. (7)

(ii) Lipschitz continuity: There exists a function Φ:L2(Ω)[0,)\Phi:L^{2}(\Omega)\rightarrow[0,\infty), such that

|𝒩(u;𝜽)(𝒙)𝒩(u;𝜽)(𝒙)|Φ(u)𝜽𝜽\displaystyle\left|\mathcal{N}(u;\bm{\theta})(\bm{x})-\mathcal{N}(u;\bm{\theta}^{\prime})(\bm{x})\right|\leqslant\Phi(u)\left\|\bm{\theta}-\bm{\theta}^{\prime}\right\|_{\ell^{\infty}} (8)

for all u𝒳,𝐱Ωu\in\mathcal{X},\bm{x}\in\Omega, and

Φ(u)C(1+uH2(Ω))κ,\Phi(u)\leqslant C\left(1+\|u\|_{H^{2}(\Omega)}\right)^{\kappa},

for the same constants C,κ>0C,\kappa>0 as in Eq. (7).

(iii) Finite measure: There exists α>0\alpha>0, such that

H2(Ω)eαuH2(Ω)2dμ(u)<.\int_{H^{2}(\Omega)}e^{\alpha\|u\|_{H^{2}(\Omega)}^{2}}\mathrm{d}\mu(u)<\infty.
Theorem 2.

If Assumption 2 holds, then the generalization error is bounded by

|𝐄(S(𝜽)Sc(𝜽))|C[1Mu(1+Cd𝜽log(CBMu)2κ+1/2)+d𝜽logMxMx],|\mathbf{E}(\mathcal{E}_{S}(\bm{\theta})-\mathcal{E}_{Sc}(\bm{\theta}))|\leqslant C\left[\frac{1}{\sqrt{M_{u}}}\left(1+Cd_{\bm{\theta}}\log(CB\sqrt{M_{u}})^{2\kappa+1/2}\right)+\frac{d_{\bm{\theta}}\log M_{x}}{M_{x}}\right],

where CC is a constant independent of BB, d𝛉d_{\bm{\theta}}, MxM_{x}, and MuM_{u}. The parameter κ\kappa is specified in (7). Here, BB represents the bound of parameters and d𝛉d_{\bm{\theta}} is the number of parameters.

The proof of Theorem 2 is presented in Appendix B.3.

Remark 2.

The Assumption 2 is easy to achieve if we consider 𝒳\mathcal{X} as the local function set around the solution, which is typically the case in Newton methods. This aligns with our approach and working region in the approximation part (see Remark 1).

3.3.2 Newton Loss

As we have mentioned, relying solely on the MSE loss function can require a significant amount of data to achieve the task. However, obtaining enough data can be challenging, especially when the equation is complex and the dimension of the input space is large. Hence, we need to consider another loss function to aid learning, which is the physical information loss function [29, 7, 14, 21], referred to here as the Network loss function.

The Newton loss function is defined as:

N(𝜽):=1NuNxj=1Nuk=1Nx|(f(uj(𝒙k)))𝒪(uj;𝜽)(𝒙k)uj(𝒙k)f(uj(𝒙k))|2\displaystyle\mathcal{E}_{N}(\bm{\theta}):=\frac{1}{N_{u}\cdot N_{x}}\sum_{j=1}^{N_{u}}\sum_{k=1}^{N_{x}}\left|(\mathcal{L}-f^{\prime}(u_{j}\left(\bm{x}_{k}\right)))\mathcal{O}\left(u_{j};\bm{\theta}\right)\left(\bm{x}_{k}\right)-\mathcal{L}u_{j}\left(\bm{x}_{k}\right)-f(u_{j}\left(\bm{x}_{k}\right))\right|^{2} (9)

where u1,u2,,uNuνu_{1},u_{2},\ldots,u_{N_{u}}\sim\nu are independently and identically distributed (i.i.d) samples in 𝒳\mathcal{X}, and 𝒙1,𝒙2,,𝒙Nx\bm{x}_{1},\bm{x}_{2},\ldots,\bm{x}_{N_{x}} are uniformly i.i.d samples in Ω\Omega.

Given that N(𝜽)\mathcal{E}_{N}(\bm{\theta}) can be viewed as the finite data formula of Nc(𝜽)\mathcal{E}_{Nc}(\bm{\theta}), where

Nc(𝜽)=limNu,NxS(𝜽).\mathcal{E}_{Nc}(\bm{\theta})=\lim_{N_{u},N_{x}\to\infty}\mathcal{E}_{S}(\bm{\theta}).

To understand the gap between Nc(𝜽)\mathcal{E}_{Nc}(\bm{\theta}) and N(𝜽)\mathcal{E}_{N}(\bm{\theta}), we can rely on the following theorem:

Corollary 1.

If Assumption 2 holds, then the generalization error is bounded by

|𝐄(N(𝜽)Nc(𝜽))|C(d𝜽logNuNu+d𝜽logNxNx),|\mathbf{E}(\mathcal{E}_{N}(\bm{\theta})-\mathcal{E}_{Nc}(\bm{\theta}))|\leqslant C\left(\frac{d_{\bm{\theta}}\log N_{u}}{\sqrt{N_{u}}}+\frac{d_{\bm{\theta}}\log N_{x}}{N_{x}}\right),

where CC is a constant independent of BB, d𝛉d_{\bm{\theta}}, MxM_{x}, and MuM_{u}. The parameter κ\kappa is specified in (7). Here, BB represents the bound of parameters and d𝛉d_{\bm{\theta}} is the number of parameters.

The proof of Corollary 1 is similar to that of Theorem 2; therefore, it will be omitted from the paper.

Remark 3.

If we only utilize S(𝛉)\mathcal{E}_{S}(\bm{\theta}) as our loss function, as demonstrated in Theorem 2, we require both MuM_{u} and MxM_{x} to be large, posing a significant challenge when dealing with complex nonlinear equations. Obtaining sufficient data becomes a critical issue in such cases. In this paper, we integrate Newton information into the loss function, defining it as follows:

(𝜽):=λS(𝜽)+N(𝜽),\mathcal{E}(\bm{\theta}):=\lambda\mathcal{E}_{S}(\bm{\theta})+\mathcal{E}_{N}(\bm{\theta}), (10)

where N(𝛉)\mathcal{E}_{N}(\bm{\theta}) represents the cost of the data involved in unsupervised learning. If we lack sufficient data for S(𝛉)\mathcal{E}_{S}(\bm{\theta}), we can adjust the parameters by selecting a small λ\lambda and increasing NxN_{x} and NuN_{u}. This strategy enables effective learning even when data for S(𝛉)\mathcal{E}_{S}(\bm{\theta}) is limited.

In the following experiment, we will use the neural operator established in Eq. (3) and the loss function in Eq. (10) to learn one step of the Newton method locally, i.e., the map between the input uu and the solution δu\delta u in eq. (2). If we have a large dataset, we can choose a large λ\lambda in (𝜽)\mathcal{E}(\bm{\theta}) (10); if we have a small dataset, we will use a small λ\lambda to ensure the generalization of the operator is minimized. After learning one step of the Newton method using the operator neural networks, we can easily and quickly obtain the solution by the initial condition of the nonlinear PDEs (1) and find new solutions not present in the datasets.

4 Experiments

4.1 Experimental Settings

We introduce two distinct training methodologies. The first approach employs exclusively supervised data, leveraging the Mean Squared Error Loss (6) as the primary optimization criterion. The second method combines both supervised and unsupervised learning paradigms, utilizing a hybrid loss function 10 that integrates Mean Squared Error Loss (6) for small proportion of data with ground truth (supervised training dataset) and with Newton’s loss (9) for large proportion of data without ground truth (unsupervised training dataset). We call the two methods as method 1 and method 2. The approaches are designed to simulate a practical scenario with limited data availability, facilitating a comparison between these training strategies to evaluate their efficacy in small supervised data regimes. We choose the same configuration of neural operator (DeepONet) which is align with our theoretical analysis. One can find the detailed experimental settings and the datasets for each examples below in Appendix A.

4.2 Case 1: Convex problem

We consider 2D convex problem (u)f(u)=0\mathcal{L}(u)-f(u)=0, where (u):=Δu\mathcal{L}(u):=-\Delta u, f(u):u2+sin5π(x+y)f(u):-u^{2}+\sin 5\pi(x+y) and u=0u=0 on Ω\partial\Omega. We investigate the training dynamics and testing performance of neural operator (DeepONet) trained with different loss functions and dataset sizes, focusing on Mean Squared Error (MSE) and Newton’s loss functions.

Refer to caption
(a) The training and testing errors using method 1
Refer to caption
(b) Comparison of models trained using Method 1 and Method 2, the latter employing the regularized loss function as defined in Equation 10 with λ=0.01\lambda=0.01.
Figure 1: Training and testing performance of DeepONet under various conditions. For the detailed experiments settings, please refer to the Appendix A.

MSE Loss Training (Fig. 1(a)): In method 1, Effective training is observed but exhibits poor generalization, suggesting only MSE loss is not enough. Performance Comparison (Fig. 1(b)): Newton’s loss model (method 2) exhibits superior performance in both L2L_{2} and H1H_{1} error metrics, highlighting its enhanced generalization accuracy. This study shows the advantages of using Newton’s loss for training DeepONet models, requiring fewer supervised data points compared to method 1.

4.3 Case 2: Non-convex problem with multiple solutions

We consider a 2D Non-convex problem,

{Δu(x,y)u2(x,y)=ssin(πx)sin(πy)inΩ,u(x,y)=0,inΩ\begin{cases}-\Delta u(x,y)-u^{2}(x,y)=-s\sin(\pi x)\sin(\pi y)\quad\text{in}\quad\Omega,\\ u(x,y)=0,\quad\text{in}\quad\partial\Omega\end{cases} (11)

where Ω=(0,1)×(0,1)\Omega=(0,1)\times(0,1) [3]. In this case, (u):=Δ(u)u2\mathcal{L}(u):=-\Delta(u)-u^{2} and it is non-convex, with multiple solutions (see Figure2 for its solutions).

Refer to caption
(a) Solutions of 2D Non-convex problem (11)
Refer to caption
(b) Method 1 VS Method 2
Figure 2: Solutions of 2D Non-convex problem (11)

In the experiment, we let one of the multiple ground truth solutions is rarely touched in the supervised training dataset such that the neural operator trained via method 1 will saturate in terms of test error because of the it relies on the ground truth to recover all the patterns for multiple solution cases (as shown by the curves in Figure 2). On the other hand, the model trained via method 2 is less affected by the limited supervised data since the utilization of the Newton’s loss. One can refer Appendix A for the detailed experiments setting.

Efficiency

We utilize this problem as a case study to demonstrate the superior efficiency of our neural operator-based method as a surrogate model to Newton’ method. By benchmarking both approaches, we highlight the significant performance advantages of our neural operator model. The performance was assessed in terms of total execution time, which includes the setup of matrices and vectors, computation on the GPU, and synchronization of CUDA streams. Both methods leverage the parallel processing capabilities of the GPU. Specifically, the Newton solver explicitly uses 10 streams and CuPy with CUDA to parallelize the computation and fully utilize the GPU parallel processing capabilities, aiming to optimize execution efficiency. In contrast, the neural operator method is inherently parallelized, taking full advantage of the GPU architecture without the explicit use of multiple streams as indicated in the table. The computational times of both methods were evaluated under a common hardware configuration. One can find the detailed description of the experiments in A.5.

Parameter Newton’s Method Neural Operator
Number of Streams 10 -
Data Type float32 float32
Execution Time for 500 linear Newton systems (s) 31.52 1.1E-4
Execution Time for 5000 linear Newton systems (s) 321.15 1.4E-4
Table 1: Benchmarking the efficiency of the neural operator-based method

The data presented in the table illustrates that the neural operator is significantly more efficient than the classical Newton solver on GPU. This efficiency gain is likely due to the neural operator’s ability to leverage parallel processing capabilities of GPUs more effectively than the Newton solver, though the Newton solver is also parallelized. This enhancement is crucial to improves efficiency for calculating the vast amounts of unknown patterns for complex nonlinear system such as Gray Scott model.

4.4 Gray-Scott Model

The Gray-Scott model [27] describes the reaction and diffusion of two chemical species, AA and SS, governed by the following equations:

At\displaystyle\frac{\partial A}{\partial t} =DAΔASA2+(μ+ρ)A,\displaystyle=D_{A}\Delta A-SA^{2}+(\mu+\rho)A,
St\displaystyle\frac{\partial S}{\partial t} =DSΔS+SA2ρ(1S),\displaystyle=D_{S}\Delta S+SA^{2}-\rho(1-S),

where DAD_{A} and DSD_{S} are the diffusion coefficients, and μ\mu and ρ\rho are rate constants.

Newton’s Method for Steady-State Solutions

Newton’s method is employed to find steady-state solutions (At=0\frac{\partial A}{\partial t}=0 and St=0\frac{\partial S}{\partial t}=0) by solving the nonlinear system:

0\displaystyle 0 =DAΔASA2+(μ+ρ)A,\displaystyle=D_{A}\Delta A-SA^{2}+(\mu+\rho)A, (12)
0\displaystyle 0 =DSΔS+SA2ρ(1S).\displaystyle=D_{S}\Delta S+SA^{2}-\rho(1-S).

The Gray-Scott model is highly sensitive to initial conditions, where even minor perturbations can lead to vastly different emergent patterns. Please refer Figure 4 for some examples of the patterns. This sensitivity reflects the model’s complex, non-linear dynamics that can evolve into a multitude of possible steady states based on the initial setup. Consequently, training a neural operator to map initial conditions directly to their respective steady states presents significant challenges. Such a model must learn from a vast functional space, capturing the underlying dynamics that dictate the transition from any given initial state to its final pattern. This complexity and diversity of potential outcomes is the inherent difficulty in training neural operators effectively for systems as complex as the Gray-Scott model. One can refer A.1.2 for the detailed discussion on Gray-Scott model. We employ a Neural Operator as a substitute for the Newton solver in the Gray-Scott model, which recurrently maps the initial state to the steady state.

Refer to caption
(a) An example demonstrating how the neural operator maps the initial state to the steady state in a step-by-step manner
Refer to caption
(b) Average convergence rate of.
Refer to caption
(c) Training via method 2
Figure 3: The convergence behavior of the Neural Operator-based solver.

In subfigure (a), we use a ring-like pattern as the initial state to test our learned neural operator. This particular pattern does not appear in the supervised training dataset and lacks corresponding ground truth data. Despite this, our neural operator, trained using Newton’s loss, is able to approximate the mapping of the initial solution to its correct steady state effectively. we further test our neural operator, utilizing it as a surrogate for Newton’s method to address nonlinear problems with an initial state drawn from the test dataset. The curve shows the average convergence rate of uui\|u-u_{i}\| across the test dataset, where uiu_{i} represents the prediction at the ii-th step by the neural operator. In subfigure (c), both the training loss curves and the test error curves are displayed, illustrating their progression throughout the training period.

5 Conclusion

In this paper, we consider using neural operators to solve nonlinear PDEs (Eq. (1)) with multiple solutions. To make the problem well-posed and to learn the multiple solutions in one learning process, we combine neural operator learning with classical Newton methods, resulting in the Newton informed neural operator. We provide a theoretical analysis of our neural operator, demonstrating that it can effectively learn the Newton operator, reduce the number of required supervised data, and learn solutions not present in the supervised learning data due to the addition of the Newton loss (9) in the loss function. Our experiments are consistent with our theoretical analysis, showcasing the advantages of our network as mentioned earlier, i.e., it requires fewer supervised data, learns solutions not present in the supervised learning data, and costs less time than classical Newton methods to solve the problem.

Reproducibility Statement

Code Availability: The code used in our experiments can be accessed via https://github.com/xll2024/newton/tree/main and also the supplementary material. Datasets can be downloaded via urls in the repository. This encompasses all scripts, functions, and auxiliary files necessary to reproduce our results. Configuration Transparency: All configurations, including hyperparameters, model architectures, and optimization settings, are explicitly provided in Appendix.

Acknowledgments and Disclosure of Funding

Y.Y. and W.H. was supported by National Institute of General Medical Sciences through grant 1R35GM146894. The work of X.L. was partially supported by the KAUST Baseline Research Fund.

References

  • [1] H. Amann and P. Hess. A multiplicity result for a class of elliptic boundary value problems. Proceedings of the Royal Society of Edinburgh Section A: Mathematics, 84(1-2):145–151, 1979.
  • [2] S. Brenner. The mathematical theory of finite element methods. Springer, 2008.
  • [3] B. Breuer, P. McKenna, and M. Plum. Multiple solutions for a semilinear boundary value problem: a computational multiplicity proof. Journal of Differential Equations, 195(1):243–269, 2003.
  • [4] Shuhao Cao. Choose a transformer: Fourier or galerkin. Advances in Neural Information Processing Systems, 34, 2021.
  • [5] Mark C Cross and Pierre C Hohenberg. Pattern formation outside of equilibrium. Reviews of modern physics, 65(3):851, 1993.
  • [6] L. Evans. Partial differential equations, volume 19. American Mathematical Society, 2022.
  • [7] Somdatta Goswami, Aniruddha Bora, Yue Yu, and George Em Karniadakis. Physics-informed neural operators. 2022 arXiv preprint arXiv:2207.05748, 2022.
  • [8] Ruchi Guo, Shuhao Cao, and Long Chen. Transformer meets boundary value inverse problems. In The Eleventh International Conference on Learning Representations, 2022.
  • [9] J. Han, A. Jentzen, and W. E. Solving high-dimensional partial differential equations using deep learning. Proceedings of the National Academy of Sciences, 115(34):8505–8510, 2018.
  • [10] W. Hao and C. Xue. Spatial pattern formation in reaction–diffusion models: a computational approach. Journal of mathematical biology, 80:521–543, 2020.
  • [11] J. He, X. Liu, and J. Xu. Mgno: Efficient parameterization of linear operators via multigrid. In The Twelfth International Conference on Learning Representations, 2023.
  • [12] Rebecca B Hoyle. Pattern formation: an introduction to methods. Cambridge University Press, 2006.
  • [13] J. Hu and S. Zhang. The minimal conforming Hk{H}^{k} finite element spaces on Rn{R}^{n} rectangular grids. Mathematics of Computation, 84(292):563–579, 2015.
  • [14] Y. Huang, W. Hao, and G. Lin. Hompinns: Homotopy physics-informed neural networks for learning multiple solutions of nonlinear elliptic differential equations. Computers & Mathematics with Applications, 121:62–73, 2022.
  • [15] Carl T Kelley. Solving nonlinear equations with Newton’s method. SIAM, 2003.
  • [16] Shigeru Kondo and Takashi Miura. Reaction-diffusion model as a framework for understanding biological pattern formation. science, 329(5999):1616–1620, 2010.
  • [17] I. Lagaris, A. Likas, and D. Fotiadis. Artificial neural networks for solving ordinary and partial differential equations. IEEE transactions on neural networks, 9(5):987–1000, 1998.
  • [18] Samuel Lanthaler, Siddhartha Mishra, and George E Karniadakis. Error estimates for deeponets: A deep learning framework in infinite dimensions. Transactions of Mathematics and Its Applications, 6(1):tnac001, 2022.
  • [19] Z. Li and N. Yan. New error estimates of bi-cubic hermite finite element methods for biharmonic equations. Journal of computational and applied mathematics, 142(2):251–285, 2002.
  • [20] Zongyi Li, Nikola Borislavov Kovachki, Kamyar Azizzadenesheli, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Fourier NeuralOperator for Parametric Partial Differential Equations. In International Conference on Learning Representations, 2020.
  • [21] Zongyi Li, Hongkai Zheng, Nikola B Kovachki, David Jin, Haoxuan Chen, Burigede Liu, Kamyar Azizzadenesheli, and Anima Anandkumar. Physics-informed neural operator for learning partial differential equations. CoRR, abs/2111.03794, 2021.
  • [22] B. Lin, Z. Mao, Z. Wang, and G. Karniadakis. Operator learning enhanced physics-informed neural networks for solving partial differential equations characterized by sharp solutions. arXiv preprint arXiv:2310.19590, 2023.
  • [23] Xinliang Liu, Bo Xu, Shuhao Cao, and Lei Zhang. Mitigating spectral bias for the multiscale operator learning. Journal of Computational Physics, 506:112944, 2024.
  • [24] Lu Lu, Pengzhan Jin, and George Em Karniadakis. Deeponet: Learning nonlinear operators for identifying differential equations based on the universal approximation theorem of operators. arXiv preprint arXiv:1910.03193, 2019.
  • [25] Lu Lu, Xuhui Meng, Shengze Cai, Zhiping Mao, Somdatta Goswami, Zhongqiang Zhang, and George Em Karniadakis. A comprehensive and fair comparison of two neural operators (with practical extensions) based on fair data. Computer Methods in Applied Mechanics and Engineering, 393:114778, 2022.
  • [26] H. Mhaskar. Neural networks for optimal approximation of smooth and analytic functions. Neural computation, 8(1):164–177, 1996.
  • [27] J. Pearson. Complex patterns in a simple system. Science, 261(5118):189–192, 1993.
  • [28] T. Poggio, H. Mhaskar, L. Rosasco, B. Miranda, and Q. Liao. Why and when can deep-but not shallow-networks avoid the curse of dimensionality: a review. International Journal of Automation and Computing, 14(5):503–519, 2017.
  • [29] Maziar Raissi, Paris Perdikaris, and George E Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378:686–707, 2019.
  • [30] A. Schmidt-Hieber. Nonparametric regression using deep neural networks with relu activation function. Annals of statistics, 48(4):1875–1897, 2020.
  • [31] M. Ulbrich. Semismooth newton methods for operator equations in function spaces. SIAM Journal on Optimization, 13(3):805–841, 2002.
  • [32] T. Welti. High-dimensional stochastic approximation: algorithms and convergence rates. PhD thesis, ETH Zurich, 2020.
  • [33] J. Xu. Two-grid discretization techniques for linear and nonlinear pdes. SIAM journal on numerical analysis, 33(5):1759–1777, 1996.
  • [34] Y. Yang and J. He. Deeper or wider: A perspective from optimal generalization error with sobolev loss. International Conference on Machine Learning, 2024.
  • [35] Y. Yang and Y. Xiang. Approximation of functionals by neural network without curse of dimensionality. arXiv preprint arXiv:2205.14421, 2022.
  • [36] Y. Yang, H. Yang, and Y. Xiang. Nearly optimal VC-dimension and pseudo-dimension bounds for deep neural network derivatives. NuerIPS 2023, 2023.
  • [37] B. Yu and W. E. The deep ritz method: a deep learning-based numerical algorithm for solving variational problems. Communications in Mathematics and Statistics, 6(1):1–12, 2018.

Appendix A Experimental settings

A.1 Background on the PDEs and generation of datasets

A.1.1 Case 1: convex problem

Function and Jacobian

The function F(u)F(u) might typically be defined as:

F(u)=Δu+u2F(u)=-\Delta u+u^{2}

The Jacobian J(u)J(u), for the given function F(u)F(u), involves the derivative of FF with respect to uu, which includes the Laplace operator and the derivative of the nonlinear term:

J(u)=Δ+2u.J(u)=-\Delta+2\cdot u.

The dataset are generated by sampling the initial state u0𝒩(0,Δ3)u_{0}\sim\mathcal{N}(0,\Delta^{-3}) and then calculate the convergent sequence {u0,u1,,un}\{u_{0},u_{1},...,u_{n}\} by Newton’s method. Each convergent sequence {u0,u1,,un}\{u_{0},u_{1},...,u_{n}\} is one data entry in the dataset.

The analysis of function and Jacobian for the non-convex problem (case 2) is similar to the convex problem except that its Jacobian J(u)=Δ2uJ(u)=\Delta-2u such that Newton’s system is not positive definite.

A.1.2 Gray Scott model

Jacobian Matrix

The Jacobian matrix JJ of the system is crucial for applying Newton’s method:

J=[JAAJASJSAJSS]J=\begin{bmatrix}J_{AA}&J_{AS}\\ J_{SA}&J_{SS}\end{bmatrix}

with components:

JAA\displaystyle J_{AA} =DAΔ+diag(2SA+μ+ρ),\displaystyle=-D_{A}\Delta+\text{diag}(-2SA+\mu+\rho),
JAS\displaystyle J_{AS} =diag(A2),\displaystyle=\text{diag}(-A^{2}),
JSA\displaystyle J_{SA} =diag(2SA),\displaystyle=\text{diag}(2SA),
JSS\displaystyle J_{SS} =DSΔ+diag(A2+ρ).\displaystyle=-D_{S}\Delta+\text{diag}(A^{2}+\rho).

The numerical simulation of the Gray-Scott model was configured with the following parameters:

  • Grid Size: The simulation grid is square with N=63N=63 points on each side, leading to a total of N2N^{2} grid points. This resolution was chosen to balance computational efficiency with spatial resolution sufficient to capture detailed patterns. The spacing between each grid point, hh, is computed as h=1.0N1h=\frac{1.0}{N-1}. This ensures that the domain is normalized to a unit square, which simplifies the analysis and scaling of diffusion rates.

  • Diffusion Coefficients: The diffusion coefficients for species AA and SS are set to DA=2.5×104D_{A}=2.5\times 10^{-4} and DS=5.0×104D_{S}=5.0\times 10^{-4}, respectively. These values determine the rate at which each species diffuses through the spatial domain.

  • Reaction Rates: The reaction rate μ\mu and feed rate ρ\rho are crucial parameters that govern the dynamics of the system. For this simulation, μ\mu is set to 0.065 and ρ\rho to 0.04, influencing the production and removal rates of the chemical species.

Simulations

The simulation utilizes a finite difference method for spatial discretization and Newton’s method to solve the steady state given the initial state. The algorithm is detailed in 1.

Refer to caption
Figure 4: Examples of steady states of the Gray Scott model
Refer to caption
Refer to caption
Refer to caption
Figure 5: Three examples depicting the evolution from the initial state to the steady state via Newton’s method.

A.2 Implementations of loss functions

Discrete Newton’s Loss

In solving partial differential equations (PDEs) numerically on a regular grid, the Laplace operator and other differential terms can be efficiently computed using convolution. Here, we detail the method for calculating J(u)δuF(u)J(u)\delta u-F(u) where J(u)J(u) is the Jacobian matrix, δu\delta u is the Newton step, and F(u)F(u) is the function defining the PDE.

Discretization

Consider a discrete representation of a function uu on a N×NN\times N grid. The function uu and its perturbation δu\delta u are represented as matrices:

u,δuN×Nu,\delta u\in\mathbb{R}^{N\times N}

The function F(u)F(u), which involves both linear and nonlinear terms, is similarly represented as F(u)N×NF(u)\in\mathbb{R}^{N\times N}.

Laplace Operator

Regarding the representing the J(u)J(u) with N×NN\times N grid function uu, the discretized Laplace operator using a finite difference method can be expressed as a convolution:

Δu=[010141010]u-\Delta u=\begin{bmatrix}0&-1&0\\ -1&4&-1\\ 0&-1&0\end{bmatrix}\ast u

This convolution computes the result of the Laplace operator applied to the grid function uu. The boundary conditions can be further incorporated in the convolution with different padding mode. Dirichlet boundary condition corresponds to zeros padding while Neumann boundary condition corresponds to replicate padding.

A.3 Architecture of DeepONet

A variant of DeepONet is used in our Newton informed neural operator. In the DeepONet, we introduce a hybrid architecture that combines convolutional layers with a static trunk basis, optimized for grid-based data inputs common in computational applications like computational biology and materials science.

Branch Network

The branch network is designed to effectively downsample and process the spatial features through a series of convolutional layers:

  • A Conv2D layer with 128 filters (7x7, stride 2) initiates the feature extraction, reducing the input dimensionality while capturing coarse spatial features.

  • This is followed by additional Conv2D layers (128 filters, 5x5 kernel, stride 2 and subsequently 3x3 with padding, 1x1) which further refine and compact the feature representation.

  • The convolutional output is flattened and processed through two fully connected layers (256 units then down to branch features), using GELU activation.

Trunk Network

The trunk utilizes a static basis represented by the tensor V, incorporated as a non-trainable component: The tensor V is precomputed, using Proper Orthogonal Decomposition (POD) as in [25], and is dimensionally compatible with the output of the branch network.

Forward Pass

During the forward computation, the branch network outputs are projected onto the trunk’s static basis via matrix multiplication, resulting in a feature matrix that is reshaped into the grid dimensionality for output.

Hyperparameters

The following table 2 summarizes the key hyperparameters used in the DeepONet architecture:

Parameter Value
Number of Conv2D layers 5
Filters in Conv2D layers 128, 128, 128, 128, 256
Kernel sizes in Conv2D layers 7x7, 5x5, 3x3, 1x1, 5x5
Strides in Conv2D layers 2, 2, 1, 1, 2
Fully Connected Layer Sizes 256, branch features
Activation Function GELU
Table 2: Hyperparameters of the DeepONet architecture

A.4 Training settings

Below we summarize the key configurations and parameters employed in the training for three cases:

Dataset
  • Case 1: For method 1, we use 500 supervised data samples (with ground truth) while for method 2, we use 5000 unsupervised data samples (only with the initial state) and 500 supervised data samples.

  • Case 2: For method 1, we use 5000 supervised data while for method 2, we use 5000 unsupervised data samples and 5000 supervised data samples.

  • Case 3 (Gray-Scott model): We only perform the method 2, with 10000 supervised data samples and 50000 unsupervised data samples.

Optimization Technique
  • Optimizer: Adam, with a learning rate of 1×1041\times 10^{-4} and a weight decay of 1×1061\times 10^{-6}.

  • Training Epochs: The model was trained over 1000 epochs to ensure convergence and we use Batch Size: 50.

These settings underscore our commitment to precision and detailed examination of neural operator efficiency in computational tasks. Our architecture and optimization choices are particularly tailored to explore and exploit the capabilities of neural networks in processing complex systems simulations.

A.5 Benchmarking Newton’s Method and neural operator based method

Experimental Setup

The benchmark study was conducted to evaluate the performance of a GPU-accelerated implementation of Newton’s method, designed to solve systems of linear equations derived from discretizing partial differential equations. The implementation utilized CuPy with CUDA to leverage the parallel processing capabilities of the NVIDIA A100 GPU. The hardware comprises Intel Cascade Lake 2.5 GHz CPU, an NVIDIA A100 GPU.

Software Environment:

Ubuntu 20.04 LTS. Python Version: 3.8. CUDA Version: 11.4.

The Newton’s method was implemented to solve the Laplacian equation over a discretized domain. Multiple system solutions were computed in parallel using CUDA streams. Key parameters of the experiment are as follows: Data Type (dtype): Single precision floating point (float32). Number of Streams: 10 CUDA streams to process data in parallel. Number of Repeated Calculations: The Newton method was executed multiples times for 500/5000 Newton linear systems, respectively, distributed evenly across the streams. Function to Solve Systems: The CuPy’s spsolve was used for solving the sparse matrix systems. The following algorithm 1 summary the procedure to benchmark the time used for solving Newton’s system for 5000 different initial state.

Algorithm 1 Solve Newton’s Systems on GPU
1:procedure SolveSystemsGPU(AA, uu, rhs_frhs\_f, NN) \triangleright Precompute RHS and diagonal for all systems
2:     rhs_listrhs_f+u2A×urhs\_list\leftarrow rhs\_f+u^{2}-A\times u
3:     diag_list2×udiag\_list\leftarrow-2\times u \triangleright Initialize solution storage
4:     delta_uinitialize zero matrix with shape of uTdelta\_u\leftarrow\text{initialize zero matrix with shape of }u^{T} \triangleright Solve each system
5:     for i=0i=0 to num_sys1num\_sys-1 do
6:         rhsrhs_list[i]rhs\leftarrow rhs\_list[i]
7:         diagdiag_list[i]diag\leftarrow diag\_list[i]
8:         JA+diagonal matrix with diag on the main diagonalJ\leftarrow A+\text{diagonal matrix with }diag\text{ on the main diagonal} \triangleright Solve the linear system
9:         delta_u[i]spsolve(J,rhs)delta\_u[i]\leftarrow\text{spsolve}(J,rhs)
10:     end for
11:     return transpose(delta_u)\text{transpose}(delta\_u)
12:end procedure

Appendix B Supplemental material for proof

B.1 Preliminaries

Definition 1 (Sobolev Spaces [6]).

Let Ω\Omega be [0,1]d[0,1]^{d} and let DD be the operator of the weak derivative of a single variable function and D𝛂=D1α1D2α2DdαdD^{\bm{\alpha}}=D^{\alpha_{1}}_{1}D^{\alpha_{2}}_{2}\ldots D^{\alpha_{d}}_{d} be the partial derivative where 𝛂=[α1,α2,,αd]T\bm{\alpha}=[\alpha_{1},\alpha_{2},\ldots,\alpha_{d}]^{T} and DiD_{i} is the derivative in the ii-th variable. Let nn\in\mathbb{N} and 1p1\leq p\leq\infty. Then we define Sobolev spaces

Wn,p(Ω):={fLp(Ω):D𝜶fLp(Ω) for all 𝜶d with |𝜶|n}W^{n,p}(\Omega):=\left\{f\in L^{p}(\Omega):D^{\bm{\alpha}}f\in L^{p}(\Omega)\text{ for all }\bm{\alpha}\in\mathbb{N}^{d}\text{ with }|\bm{\alpha}|\leq n\right\}

with a norm

fWn,p(Ω):=(0|α|nDαfLp(Ω)p)1/p\|f\|_{W^{n,p}(\Omega)}:=\left(\sum_{0\leq|\alpha|\leq n}\left\|D^{\alpha}f\right\|_{L^{p}(\Omega)}^{p}\right)^{1/p}

if p<p<\infty, and fWn,(Ω):=max0|α|nDαfL(Ω)\|f\|_{W^{n,\infty}(\Omega)}:=\max_{0\leq|\alpha|\leq n}\left\|D^{\alpha}f\right\|_{L^{\infty}(\Omega)}.

Furthermore, for 𝐟=(f1,,fd)\bm{f}=(f_{1},\ldots,f_{d}), 𝐟W1,(Ω,d)\bm{f}\in W^{1,\infty}(\Omega,\mathbb{R}^{d}) if and only if fiW1,(Ω)f_{i}\in W^{1,\infty}(\Omega) for each i=1,2,,di=1,2,\ldots,d and

𝒇W1,(Ω,d):=maxi=1,,d{fiW1,(Ω)}.\|\bm{f}\|_{W^{1,\infty}(\Omega,\mathbb{R}^{d})}:=\max_{i=1,\ldots,d}\{\|f_{i}\|_{W^{1,\infty}(\Omega)}\}.

When p=2p=2, denote Wn,2(Ω)W^{n,2}(\Omega) as Hn(Ω)H^{n}(\Omega) for n+n\in\mathbb{N}_{+}.

Proposition 1 ([26]).

Suppose σ\sigma is a is a continuous non-polynomial function and KK is a compact in d\mathbb{R}^{d}, then there are positive integers pp, constants wk,ζkw_{k},\zeta_{k} for k=1,,pk=1,\ldots,p and bounded linear functionals ck:Hr(K)c_{k}:H^{r}(K)\to\mathbb{R} such that for any vHr(K)v\in H^{r}(K),

vk=1pck(v)σ(𝒘k𝒙+ζk)L2(K)cpr/dvHr(K).\left\|v-\sum_{k=1}^{p}c_{k}(v)\sigma\left(\bm{w}_{k}\cdot\bm{x}+\zeta_{k}\right)\right\|_{L^{2}(K)}\leq cp^{-r/d}\|v\|_{H^{r}(K)}. (13)
Proposition 2 ([28, 36]).

Suppose σ\sigma is a continuous non-polynomial function and Ω\Omega is a compact subset of d\mathbb{R}^{d}. For any Lipschitz-continuous function ff, there exists a shallow neural network such that

fj=1majσ(𝝎j𝒙+bj)Cm1/d,\left\|f-\sum_{j=1}^{m}a_{j}\sigma\left(\bm{\omega}_{j}\cdot\bm{x}+b_{j}\right)\right\|_{\infty}\leq Cm^{-1/d}, (14)

where CC depends on the Lipschitz constant but is independent of mm.

Lemma 1 ([18]).

The ϵ\epsilon-covering number of [B,B]d[-B,B]^{d}, K(ϵ)K(\epsilon), satisfies

K(ϵ)(CBϵ)d,K(\epsilon)\leqslant\left(\frac{CB}{\epsilon}\right)^{d},

for some constant C>0C>0, independent of ϵ\epsilon, BB, and dd.

Step 5: Now we estimate

B.2 Proof of Theorem 1

In this subsection, we present the proof of Theorem 1, which describes the approximation ability of DeepONet. The sketch of the proof is illustrated in Fig. 6.

Refer to caption
Figure 6: The sketch of proof for Theorem 1. The details of each step are provided in the following.
Proof of Theorem 1.

Step 1: Firstly, we need to verify that the target operator 𝒢(u)\mathcal{G}(u) is well-defined.

Due to Assumption 1 (i) and [6, Theorem 6 in Section 6.2], we know that for u𝒳H2(Ω)u\in\mathcal{X}\subset H^{2}(\Omega), Equation (2) will have unique solutions. This means that 𝒢(u)\mathcal{G}(u) is a well-defined operator for the input space u𝒳u\in\mathcal{X}.

Step 2: Secondly, we aim to verify that 𝒢(u)\mathcal{G}(u) is a Lipschitz-continuous operator in H2(Ω)H^{2}(\Omega) for u𝒳u\in\mathcal{X}.

Consider the following:

f(u+v)\displaystyle f^{\prime}(u+v) =f(u)+vf′′(ξ1)\displaystyle=f^{\prime}(u)+vf^{\prime\prime}(\xi_{1})
f(u+v)\displaystyle f(u+v) =f(u)+vf(ξ2)\displaystyle=f(u)+vf^{\prime}(\xi_{2})
δuv(𝒙)\displaystyle\delta u_{v}(\bm{x}) =δu(𝒙)+ϵ(𝒙)\displaystyle=\delta u(\bm{x})+\epsilon(\bm{x}) (15)

where δu(𝒙)\delta u(\bm{x}) is the solution of Eq.(2) for the input uu, and δvu(𝒙)\delta_{v}u(\bm{x}) is the solution of Eq.(2) for the input u+vu+v. Denote

δvu(𝒙)δu(𝒙)=:ϵ(𝒙).\delta_{v}u(\bm{x})-\delta u(\bm{x})=:\epsilon(\bm{x}).

Therefore, we have:

{(f(u+v))ϵ(𝒙)=Δvv(f(ξ2)+f′′(ξ1)δu),𝒙Ωϵ(𝒙)=0,𝒙Ω.\begin{cases}(\mathcal{L}-f^{\prime}(u+v))\epsilon(\bm{x})=\Delta v-v(f^{\prime}(\xi_{2})+f^{\prime\prime}(\xi_{1})\delta u),&\bm{x}\in\Omega\\ \epsilon(\bm{x})=0,&\bm{x}\in\partial\Omega.\end{cases} (16)

Since uu and vv are in H2H^{2} and Ω\partial\Omega is in C2C^{2} (Assumption 1 (iii)), according to [6, Theorem 4 in Section 6.3], there exist constants CC 111In this paper, we consistently employ the symbol CC as a constant, which may vary from line to line. and C¯\bar{C} such that:

ϵ(𝒙)H2(Ω)\displaystyle\|\epsilon(\bm{x})\|_{H^{2}(\Omega)} Cvv(f(ξ2)+f′′(ξ1)δu)L2(Ω)\displaystyle\leq C\|\mathcal{L}v-v(f^{\prime}(\xi_{2})+f^{\prime\prime}(\xi_{1})\delta u)\|_{L^{2}(\Omega)}
C¯v(𝒙)H2(Ω).\displaystyle\leq\bar{C}\|v(\bm{x})\|_{H^{2}(\Omega)}. (17)

The last inequality is due to the boundedness of f(ξ2)+f′′(ξ1)δuf^{\prime}(\xi_{2})+f^{\prime\prime}(\xi_{1})\delta u (Assumption 1 (ii)).

Step 3: In the approximation, we first reduce the operator learning to functional learning.

When the input function u(𝒙)u(\bm{x}) belongs to 𝒳H2\mathcal{X}\subset H^{2}, the output function δu\delta u also belongs to H2H^{2}, provided that Ω\partial\Omega is of class C2C^{2}. The function 𝒢(u)=δu\mathcal{G}(u)=\delta u can be approximated by a two-layer network architected by the activation function σ(x)\sigma(x), which is not a polynomial, in the following form by Proposition 1 [26] (given in Subsection 16):

𝒢(u)(𝒙)k=1pck[𝒢(u)]σ(𝒘k𝒙+ζk)L2(Ω)C1p2d𝒢(u)H2(Ω),\left\|\mathcal{G}(u)(\bm{x})-\sum_{k=1}^{p}c_{k}[\mathcal{G}(u)]\sigma\left(\bm{w}_{k}\cdot\bm{x}+\zeta_{k}\right)\right\|_{L^{2}(\Omega)}\leq C_{1}p^{-\frac{2}{d}}\|\mathcal{G}(u)\|_{H^{2}(\Omega)}, (18)

where 𝒘kd\bm{w}_{k}\in\mathbb{R}^{d}, ζk\zeta_{k}\in\mathbb{R} for k=1,,pk=1,\ldots,p, ckc_{k} is a continuous functional, and C1C_{1} is a constant independent of the parameters.

Denote ϕk(u)=ck[𝒢(u)]\phi_{k}(u)=c_{k}[\mathcal{G}(u)], which is a Lipschitz-continuous functional from H2(Ω)H^{2}(\Omega) to \mathbb{R}, which is due to 𝒢\mathcal{G} is a Lipschitz-continuous operator and ckc_{k} is a linear functional. The remaining task in approximation is to approximate these functionals by neural networks.

Step 4: In this step, we reduce the functional learning to function learning by applying the operator 𝒫\mathcal{P} as in Assumption 1 (iv).

Based on ϕk(u)\phi_{k}(u) being a Lipschitz-continuous functional in H2(Ω)H^{2}(\Omega), we have

|ϕk(u)ϕk(𝒫u)|Lku𝒫uH2(Ω)Lkϵ,|\phi_{k}(u)-\phi_{k}(\mathcal{P}u)|\leq L_{k}\|u-\mathcal{P}u\|_{H^{2}(\Omega)}\leq L_{k}\epsilon,

where LkL_{k} is the Lipschitz constant of ϕk(u)\phi_{k}(u) for u𝒳u\in\mathcal{X}.

Furthermore, since 𝒫u\mathcal{P}u is an nn-dimensional term, i.e., it can be denoted by the nn-dimensional vector 𝒫¯un\bar{\mathcal{P}}u\in\mathbb{R}^{n}, we can rewrite ϕk(𝒫u)\phi_{k}(\mathcal{P}u) as ψk(𝒫¯u)\psi_{k}(\bar{\mathcal{P}}u), where ψk:n\psi_{k}:\mathbb{R}^{n}\to\mathbb{R} for l=1,,pl=1,\ldots,p. Furthermore, ψk\psi_{k} is a Lipschitz-continuous function since ϕk\phi_{k} is Lipschitz-continuous and 𝒫\mathcal{P} is a continuous linear operator.

Step 5: In this step, we will approximate ψk\psi_{k} using shallow neural networks.

Due to Proposition 2, we have that there is a shallow neural network such that

ψk(𝒫¯u)𝑨kσ(𝑴k𝒫¯u+𝒃k)Cm1/d,\left\|\psi_{k}(\bar{\mathcal{P}}u)-\bm{A}_{k}\sigma\left(\bm{M}_{k}\cdot\bar{\mathcal{P}}u+\bm{b}_{k}\right)\right\|_{\infty}\leq Cm^{-1/d}, (19)

where 𝒂km\bm{a}_{k}^{\intercal}\in\mathbb{R}^{m}, 𝑴km×n\bm{M}_{k}\in\mathbb{R}^{m\times n}, and 𝒃km\bm{b}_{k}\in\mathbb{R}^{m}. For the simplicity notations, we can replace 𝑴k𝒫¯\bm{M}_{k}\cdot\bar{\mathcal{P}} by a operator 𝒲k\mathcal{W}_{k}.

Above all, we have that there is a neural network in Ξp\Xi_{p} such that

k=1p𝑨kσ(𝒲ku+𝒃k)σ(𝒘k𝒙+ζk)𝒢(u)L2(Ω)C1m1n+C2(ϵ+p2d),\left\|\sum_{k=1}^{p}\bm{A}_{k}\sigma\left(\mathcal{W}_{k}u+\bm{b}_{k}\right)\sigma\left(\bm{w}_{k}\cdot\bm{x}+\zeta_{k}\right)-\mathcal{G}(u)\right\|_{L^{2}(\Omega)}\leq C_{1}m^{-\frac{1}{n}}+C_{2}(\epsilon+p^{-\frac{2}{d}}), (20)

where C1C_{1} is a constant independent of mm, ϵ\epsilon, and pp, C2C_{2} is a constant depended on pp.

Here, we discuss more about the embedding operator 𝒫\mathcal{P}. One approach is to use the Argyris element [2]. This method involves considering the 2121 degrees of freedom shown in Fig. 7. In this figure, each \bullet denotes evaluation at a point, the inner circle represents evaluation of the gradient at the center, and the outer circle denotes evaluation of the three second derivatives at the center. The arrows indicate the evaluation of the normal derivatives at the three midpoints.

Refer to caption
Figure 7: Argyris method

Another alternative approach to discretizing the input space is to use the bi-cubic Hermite finite element method [19, 13].

B.3 Proof of Theorem 2

The proof of Theorem 2 is inspired by that in [18].

Proof of Theorem 2.

Step 1: To begin with, we introduce a new term called the middle term, denoted as Sm(𝜽)\mathcal{E}_{Sm}(\bm{\theta}), defined as follows:

Sm(𝜽):=1Muj=1MuΩ|𝒢(uj)(𝒙)𝒪(uj;𝜽)(𝒙)|2d𝒙,\mathcal{E}_{Sm}(\bm{\theta}):=\frac{1}{M_{u}}\sum_{j=1}^{M_{u}}\int_{\Omega}\left|\mathcal{G}(u_{j})(\bm{x})-\mathcal{O}(u_{j};\bm{\theta})(\bm{x})\right|^{2}\mathrm{d}\bm{x},

This term represents the limit case of S(𝜽)\mathcal{E}_{S}(\bm{\theta}) as the number of samples in the domain of the output space tends to infinity (MxM_{x}\to\infty).

Then the error can be divided into two parts:

|𝐄(S(𝜽)Sc(𝜽)||𝐄(S(𝜽)Sm(𝜽)|+|𝐄(Sm(𝜽)Sc(𝜽)|.|\mathbf{E}(\mathcal{E}_{S}(\bm{\theta})-\mathcal{E}_{Sc}(\bm{\theta})|\leq|\mathbf{E}(\mathcal{E}_{S}(\bm{\theta})-\mathcal{E}_{Sm}(\bm{\theta})|+|\mathbf{E}(\mathcal{E}_{Sm}(\bm{\theta})-\mathcal{E}_{Sc}(\bm{\theta})|. (21)

Step 2: For |𝐄(Sm(𝜽)Sc(𝜽)||\mathbf{E}(\mathcal{E}_{Sm}(\bm{\theta})-\mathcal{E}_{Sc}(\bm{\theta})|, this is the classical generalization error analysis, and the result can be obtained from [30, 34]. We omit the details of this part, which can be expressed as

|𝐄(Sm(𝜽)Sc(𝜽))|Cd𝜽logMxMx,|\mathbf{E}(\mathcal{E}_{Sm}(\bm{\theta})-\mathcal{E}_{Sc}(\bm{\theta}))|\leq\frac{Cd_{\bm{\theta}}\log M_{x}}{M_{x}}, (22)

where CC is independent of the number of parameters d𝜽d_{\bm{\theta}} and the sample size MxM_{x}. In the following steps, we are going to estimate |𝐄(S(𝜽)Sm(𝜽))||\mathbf{E}(\mathcal{E}_{S}(\bm{\theta})-\mathcal{E}_{Sm}(\bm{\theta}))|, which is the error that comes from the sampling of the input space of the operator.

Step 3: Denote

S𝜽M:=1Mj=1MΩ|𝒢(uj)(𝒙)𝒪(uj;𝜽)(𝒙)|2d𝒙.S_{\bm{\theta}}^{M}:=\frac{1}{M}\sum_{j=1}^{M}\int_{\Omega}\left|\mathcal{G}(u_{j})(\bm{x})-\mathcal{O}(u_{j};\bm{\theta})(\bm{x})\right|^{2}\mathrm{d}\bm{x}.

We first estimate the gap between S𝜽MS_{\bm{\theta}}^{M} and S𝜽MS_{\bm{\theta}^{\prime}}^{M} for any bounded parameters 𝜽,𝜽\bm{\theta},\bm{\theta}^{\prime}. Due to Assumption 2 (i) and (ii), we have that

|S𝜽MS𝜽M|\displaystyle|S_{\bm{\theta}}^{M}-S_{\bm{\theta}^{\prime}}^{M}|
\displaystyle\leq 1Mj=1M|Ω|𝒢(uj)(𝒙)𝒪(uj;𝜽)(𝒙)|2|𝒢(uj)(𝒙)𝒪(uj;𝜽)(𝒙)|2d𝒙|\displaystyle\frac{1}{M}\sum_{j=1}^{M}\left|\int_{\Omega}\left|\mathcal{G}(u_{j})(\bm{x})-\mathcal{O}(u_{j};\bm{\theta})(\bm{x})\right|^{2}-\left|\mathcal{G}(u_{j})(\bm{x})-\mathcal{O}(u_{j};\bm{\theta}^{\prime})(\bm{x})\right|^{2}\mathrm{d}\bm{x}\right|
\displaystyle\leq 1Mj=1M|Ω|2𝒢(uj)(𝒙)+𝒪(uj;𝜽)(𝒙)+𝒪(uj;𝜽)(𝒙)||𝒪(uj;𝜽)(𝒙)𝒪(uj;𝜽)(𝒙)|d𝒙|\displaystyle\frac{1}{M}\sum_{j=1}^{M}\left|\int_{\Omega}\left|2\mathcal{G}(u_{j})(\bm{x})+\mathcal{O}(u_{j};\bm{\theta})(\bm{x})+\mathcal{O}(u_{j};\bm{\theta}^{\prime})(\bm{x})\right|\cdot\left|\mathcal{O}(u_{j};\bm{\theta})(\bm{x})-\mathcal{O}(u_{j};\bm{\theta}^{\prime})(\bm{x})\right|\mathrm{d}\bm{x}\right|
\displaystyle\leq 4Mj=1MΨ(uj)Φ(uj)𝜽𝜽.\displaystyle\frac{4}{M}\sum_{j=1}^{M}\Psi(u_{j})\Phi(u_{j})\cdot\left\|\bm{\theta}-\bm{\theta}^{\prime}\right\|_{\ell^{\infty}}. (23)

Step 4: Based on Step 3, we are going to estimate

𝐄[sup𝜽[B,B]d𝜽|S𝜽M𝐄S𝜽M|p]1p\mathbf{E}\left[\sup_{\bm{\theta}\in[-B,B]^{d_{\bm{\theta}}}}\left|S_{\bm{\theta}}^{M}-\mathbf{E}S_{\bm{\theta}}^{M}\right|^{p}\right]^{\frac{1}{p}}

by covering the number of the spaces.

Set {𝜽1,,𝜽K}\{\bm{\theta}_{1},\ldots,\bm{\theta}_{K}\} is a ε\varepsilon-covering of [B,B]d𝜽\in[-B,B]^{d_{\bm{\theta}}} i.e. for any 𝜽\bm{\theta} in [B,B]d𝜽\in[-B,B]^{d_{\bm{\theta}}}, there exists jj with 𝜽𝜽jϵ\left\|\bm{\theta}-\bm{\theta}_{j}\right\|_{\ell_{\infty}}\leqslant\epsilon. Then we have

𝐄[sup𝜽[B,B]d|S𝜽M𝐄[S𝜽M]|p]1/p\displaystyle\mathbf{E}\left[\sup_{\bm{\theta}\in[-B,B]^{d}}\left|S_{\bm{\theta}}^{M}-\mathbf{E}\left[S_{\bm{\theta}}^{M}\right]\right|^{p}\right]^{1/p}
\displaystyle\leq 𝐄[(sup𝜽[B,B]d|S𝜽MS𝜽j(𝜽)M|+|Sj(𝜽)M𝐄[Sj(𝜽)M]|+|𝐄[S𝜽j(𝜽)M]𝐄[S𝜽M]|)p]1/p\displaystyle\mathbf{E}{\left[\left(\sup_{\bm{\theta}\in[-B,B]^{d}}\left|S_{\bm{\theta}}^{M}-S_{\bm{\theta}_{j(\bm{\theta})}}^{M}\right|+\left|S_{j(\bm{\theta})}^{M}-\mathbf{E}\left[S_{j(\bm{\theta})}^{M}\right]\right|\right.\right.}\left.\left.+\left|\mathbf{E}\left[S_{\bm{\theta}_{j(\bm{\theta})}}^{M}\right]-\mathbf{E}\left[S_{\bm{\theta}}^{M}\right]\right|\right)^{p}\right]^{1/p}
\displaystyle\leq 𝐄[(maxj=1,,K|Sj(𝜽)M𝐄[Sj(𝜽)M]|+8ϵM(j=1M|Ψ(u)||Φ(u)|))p]1/p\displaystyle\mathbf{E}\left[\left(\max_{j=1,\ldots,K}\left|S_{j(\bm{\theta})}^{M}-\mathbf{E}\left[S_{j(\bm{\theta})}^{M}\right]\right|\right.\right.\left.\left.+\frac{8\epsilon}{M}\left(\sum_{j=1}^{M}\left|\Psi\left(u\right)\right|\left|\Phi\left(u\right)\right|\right)\right)^{p}\right]^{1/p}
\displaystyle\leq 8ϵ𝐄[|ΨΦ|p]1/p+𝐄[maxj=1,,K|S𝜽jM𝐄[S𝜽jM]|p]1/p.\displaystyle 8\epsilon\mathbf{E}\left[|\Psi\Phi|^{p}\right]^{1/p}+\mathbf{E}\left[\max_{j=1,\ldots,K}\left|S_{\bm{\theta}_{j}}^{M}-\mathbf{E}\left[S_{\bm{\theta}_{j}}^{M}\right]\right|^{p}\right]^{1/p}. (24)

For 8ϵ𝐄[|ΨΦ|p]1/p8\epsilon\mathbf{E}\left[|\Psi\Phi|^{p}\right]^{1/p}, it can be approximate by

8ϵ𝐄[|ΨΦ|p]1/p8ϵ𝐄[|Ψ|2p]1/2p𝐄[|Φ|2p]1/2p=8ϵΨL2pΦL2p.8\epsilon\mathbf{E}\left[|\Psi\Phi|^{p}\right]^{1/p}\leqslant 8\epsilon\mathbf{E}\left[|\Psi|^{2p}\right]^{1/2p}\mathbf{E}\left[|\Phi|^{2p}\right]^{1/2p}=8\epsilon\|\Psi\|_{L^{2p}}\|\Phi\|_{L^{2p}}.

For 𝐄[maxj=1,,K|S𝜽jM𝐄[S𝜽jM]|p]1/p\mathbf{E}\left[\max_{j=1,\ldots,K}\left|S_{\bm{\theta}_{j}}^{M}-\mathbf{E}\left[S_{\bm{\theta}_{j}}^{M}\right]\right|^{p}\right]^{1/p}, by applied the result in [32, 18], we know

𝐄[maxj=1,,K|S𝜽jM𝐄[S𝜽jM]|p]1/p16K1/ppΨL2p2M.\mathbf{E}\left[\max_{j=1,\ldots,K}\left|S_{{\bm{\theta}}_{j}}^{M}-\mathbf{E}\left[S_{{\bm{\theta}}_{j}}^{M}\right]\right|^{p}\right]^{1/p}\leq\frac{16K^{1/p}\sqrt{p}\|\Psi\|_{L^{2p}}^{2}}{\sqrt{M}}.

Step 5: Now we estimate |𝐄(S(𝜽)Sm(𝜽)||\mathbf{E}(\mathcal{E}_{S}(\bm{\theta})-\mathcal{E}_{Sm}(\bm{\theta})|.

Due to Assumption 2 and directly calculation, we have that

ΨL2p,ΦL2pC(1+γκp)κ,\|\Psi\|_{L^{2p}},\|\Phi\|_{L^{2p}}\leqslant C(1+\gamma\kappa p)^{\kappa},

for constants C,γ>0C,\gamma>0, depending only the measure μ\mu and the constant CC appearing in the upper bound (7).

Based on Lemma 1, we have that

𝐄[sup𝜽[B,B]d𝜽|S𝜽Mu𝐄[S𝜽Mu]|p]1/p16C2(1+γκp)2κ(ϵ+(CBϵ)d𝜽/ppMu),\mathbf{E}\left[\sup_{{\bm{\theta}}\in[-B,B]^{d_{\bm{\theta}}}}\left|S_{\bm{\theta}}^{M_{u}}-\mathbf{E}\left[S_{\bm{\theta}}^{M_{u}}\right]\right|^{p}\right]^{1/p}\leqslant 16C^{2}(1+\gamma\kappa p)^{2\kappa}\left(\epsilon+\left(\frac{CB}{\epsilon}\right)^{d_{\bm{\theta}}/p}\frac{\sqrt{p}}{\sqrt{M_{u}}}\right),

for some constants C,γ>0C,\gamma>0, independent of κ,μ,B,d𝜽,N,ϵ>0\kappa,\mu,B,d_{\bm{\theta}},N,\epsilon>0 and p2p\geqslant 2. We now choose ϵ=1Mu\epsilon=\frac{1}{\sqrt{M_{u}}}, so that

ϵ+(CBϵ)d𝜽/ppMu=1Mu(1+(CBMu)d𝜽/pp).\epsilon+\left(\frac{CB}{\epsilon}\right)^{d_{\bm{\theta}}/p}\frac{\sqrt{p}}{\sqrt{M_{u}}}=\frac{1}{\sqrt{M_{u}}}\left(1+(CB\sqrt{M_{u}})^{d_{\bm{\theta}}/p}\sqrt{p}\right).

Next, let p=d𝜽log(CBMu)p=d_{\bm{\theta}}\log(CB\sqrt{M_{u}}). Then,

(CBMu)d𝜽/pp=exp(log(CBMu)d𝜽p)p=ed𝜽log(CBMu),(CB\sqrt{M_{u}})^{d_{\bm{\theta}}/p}\sqrt{p}=\exp\left(\frac{\log(CB\sqrt{M_{u}})d_{\bm{\theta}}}{p}\right)\sqrt{p}=e\sqrt{d_{\bm{\theta}}\log(CB\sqrt{M_{u}})},

and thus we conclude that

ϵ+(CBϵ)d𝜽/ppMu1Mu(1+ed𝜽log(CBMu).).\epsilon+\left(\frac{CB}{\epsilon}\right)^{d_{\bm{\theta}}/p}\frac{\sqrt{p}}{\sqrt{M_{u}}}\leqslant\frac{1}{\sqrt{M_{u}}}\left(1+e\sqrt{d_{\bm{\theta}}\log(CB\sqrt{M_{u}})}.\right).

On the other hand, we have

(1+γκp)2κ=(1+γκd𝜽log(CBMu))2κ.(1+\gamma\kappa p)^{2\kappa}=\left(1+\gamma\kappa d_{\bm{\theta}}\log(CB\sqrt{M_{u}})\right)^{2\kappa}.

Increasing the constant C>0C>0, if necessary, we can further estimate

(1+γκd𝜽log(CBMu))2κ(1+ed𝜽log(CBMu).)C(1+d𝜽log(CBMu))2κ+1/2,\left(1+\gamma\kappa d_{\bm{\theta}}\log(CB\sqrt{M_{u}})\right)^{2\kappa}\left(1+e\sqrt{d_{\bm{\theta}}\log(CB\sqrt{M_{u}})}.\right)\leqslant C\left(1+d_{\bm{\theta}}\log(CB\sqrt{M_{u}})\right)^{2\kappa+1/2},

where C>0C>0 depends on κ,γ,μ\kappa,\gamma,\mu and the constant appearing in (7), but is independent of d𝜽,Bd_{\bm{\theta}},B and NN. We can express this dependence in the form C=C(μ,Ψ,Φ)>0C=C(\mu,\Psi,\Phi)>0, as the constants κ\kappa and γ\gamma depend on the Gaussian tail of μ\mu and the upper bound on Ψ,Φ\Psi,\Phi.

Therefore,

|𝐄(S(𝜽)Sm(𝜽)|\displaystyle|\mathbf{E}(\mathcal{E}_{S}(\bm{\theta})-\mathcal{E}_{Sm}(\bm{\theta})| sup𝜽[B,B]d𝜽|S𝜽Mu𝐄[S𝜽Mu]|\displaystyle\leq\sup_{\bm{\theta}\in[-B,B]^{d_{\bm{\theta}}}}|S_{\bm{\theta}}^{M_{u}}-\mathbf{E}\left[S_{\bm{\theta}}^{M_{u}}\right]|
CMu(1+Cd𝜽log(CBMu)2κ+1/2).\displaystyle\leq\frac{C}{\sqrt{M_{u}}}\left(1+Cd_{\bm{\theta}}\log(CB\sqrt{M_{u}})^{2\kappa+1/2}\right). (25)