This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

The parametric complexity of operator learning

Samuel Lanthaler and Andrew M. Stuart
Abstract.

Neural operator architectures employ neural networks to approximate operators mapping between Banach spaces of functions; they may be used to accelerate model evaluations via emulation, or to discover models from data. Consequently, the methodology has received increasing attention over recent years, giving rise to the rapidly growing field of operator learning. The first contribution of this paper is to prove that for general classes of operators which are characterized only by their CrC^{r}- or Lipschitz-regularity, operator learning suffers from a “curse of parametric complexity”, which is an infinite-dimensional analogue of the well-known curse of dimensionality encountered in high-dimensional approximation problems. The result is applicable to a wide variety of existing neural operators, including PCA-Net, DeepONet and the FNO. The second contribution of the paper is to prove that this general curse can be overcome for solution operators defined by the Hamilton-Jacobi equation; this is achieved by leveraging additional structure in the underlying solution operator, going beyond regularity. To this end, a novel neural operator architecture is introduced, termed HJ-Net, which explicitly takes into account characteristic information of the underlying Hamiltonian system. Error and complexity estimates are derived for HJ-Net which show that this architecture can provably beat the curse of parametric complexity related to the infinite-dimensional input and output function spaces.

1. Introduction

The paper is devoted to a study of the computational complexity of the approximation of maps between Banach spaces by means of neural operators. The paper has two main focii: establishing a complexity barrier for general classes of CrC^{r}- or Lipschitz regular maps; and then showing that this barrier can be beaten for Hamilton-Jacobi (HJ) equations. In Subsection 1.1 we give a detailed literature review; we set in context the definition of “the curse of parametric complexity” that we introduce and use in this paper; and we highlight our main contributions. Then, in Subsection 1.2, we overview the organization of the remainder of the paper.

1.1. Context and Literature Review

The use of neural networks to learn operators, typically mapping between Banach spaces of functions defined over subsets of finite dimensional Euclidean space and referred to as neural operators, is receiving growing interest in the computational science and engineering community [5, 58, 3, 28, 39, 2, 43, 36]. The methodology has the potential for accelerating numerical methods for solving partial differential equations (PDEs) when a model relating inputs and outputs is known; and it has the potential for discovering input-output maps from data when no model is available.

The computational complexity of learning and evaluating such neural operators is crucial to understanding when the methods will be effective. Numerical experiments addressing this issue may be found in [13] and the analysis of linear problems from this perspective may be found in [4, 14]. Universal approximation theorems, applicable beyond the linear setting, may be found in [5, 39, 34, 30, 31, 33, 2, 56] but such theorems do not address the cost of achieving a given small error.

Early work on operator approximation [42] presents first quantitative bounds; most notably, this work identifies the continuous nonlinear nn-widths of a space of continuous functionals defined on L2L^{2}-spaces, showing that these nn-widths decay at most (poly-)logarithmically in nn. Both upper and lower bounds are derived in this specific setting. More recently, upper bounds on the computational complexity of recent approaches to operator learning based on deep neural networks, including the DeepONet [39] and the Fourier Neural Operator (FNO) [36], have been studied in more detail. Specific operator learning tasks arising in PDEs have been considered in the papers [50, 24, 40, 34, 30, 48, 15]. Related complexity analysis for the PCA-Net architecture [2] has recently been established in [32]. These papers studying computational complexity focus on the issue of beating a form of the “curse of dimensionality” in these operator approximation tasks.

In these operator learning problems the input and output spaces are infinite dimensional, and hence the meaning of the curse of dimensionality could be ambiguous. In this infinite-dimensional context, “beating the curse” is interpreted as identifying problems, and operator approximations applied to those problems, for which a measure of evaluation cost (referred to as their complexity) grows only algebraically with the inverse of the desired error. As shown rigorously in [32], this is a non-trivial issue: for the PCA-Net architecture, it has been established that such algebraic complexity and error bounds cannot be achieved for general Lipschitz (and even more regular) operators.

As will be explained in detail in the present work, this fact is not specific to PCA-Net, but extends to many other neural operator architectures. In fact, it can be interpreted as a scaling limit of the conventional curse of dimensionality; this conventional curse affects finite-dimensional approximation problems when the underlying dimension dd is very large. It can be shown that (ReLU) neural networks cannot overcome this curse, in general. As a consequence, neural operators, which build on neural networks, suffer from the scaling limit of this curse in infinite-dimensions. To distinguish this infinite-dimensional phenomenon from the conventional curse of dimensionality encountered in high-but-finite-dimensional approximation problems, we will refer to the scaling limit identified in this work as “the curse of parametric complexity”.

The first contribution of the present paper is to prove that for general classes of operators which are characterized only by their CrC^{r}- or Lipschitz-regularity, operator learning suffers from such a curse of parametric complexity: Theorem 2.11 (and a variant thereof, Theorem 2.27) shows that, in this setting, there exist operators (and indeed even real-valued functionals) which are ϵ\epsilon-approximable only with parametric model complexity that grows exponentially in ϵ1\epsilon^{-1}.

To overcome the general curse of parametric complexity implied by Theorem 2.11 (and Theorem 2.27), efficient operator learning frameworks therefore have to leverage additional structure present in the operators of interest, going beyond CrC^{r}- or Lipschitz-regularity. Previous work on overcoming this curse for operator learning has mostly focused on operator holomorphy [23, 50, 34] and the emulation of numerical methods [30, 34, 32, 40] as two basic mechanisms for overcoming the curse of parametric complexity for specific operators of interest. A notable exception are the complexity estimates for DeepONets in [15] which are based on explicit representation of the solution; most prominently, this is achieved via the Cole-Hopf transformation for the viscous Burgers equation.

An abstract characterization of the entire class of operators that allow for efficient approximation by neural operators would be very desirable. Unfortunately, this appears to be out of reach, at the current state of analysis. Indeed, as far as the authors are aware, there does not even exist such a characterization for any class of standard numerical methods, such as finite difference, finite element or spectral, viewed as operator approximators. Therefore, in order to identify settings in which operator learning can be effective (without suffering from the general curse of parametric complexity), we restrict attention to specific classes of operators of interest.

The HJ equations present an application area that has the potential to be significantly impacted by the use of ideas from neural networks, especially regarding the solution of problems for functions defined over subsets of high dimensional (dd) Euclidean space [7, 8, 12, 11]; in particular beating the curse of dimensionality with respect to this dimension dd has been of focus. However, this body of work has not studied operator learning, as it concerns settings in which only fixed instances of the PDE are solved. The purpose of the second part of the present paper is to study the design and analysis of neural operators to approximate the solution operator for HJ equations; this operator maps the initial condition (a function) to the solution at a later time (another function).

The second contribution of the paper is to prove in Theorem 5.1 that the general curse of parametric complexity can be overcome for maps defined by the solution operator of the Hamilton-Jacobi (HJ) equation; this is achieved by exposing additional structure in the underlying solution operator, different from holomorphy and emulation and going beyond regularity, that can be leveraged by neural operators; for the HJ equations, the identified structure relies on representation of solutions of the HJ equations in terms of characteristics. In this paper the dimension dd of the underlying spatial domain will be fixed and we do not study the curse of dimensionality with respect to dd. Instead, we demonstrate that it is possible to beat the curse of parametric complexity with respect to the infinite dimensional nature of the input function for fixed (and moderate) dd.

1.2. Organization

In section 2 we present the first contribution: Theorem 2.11, together with the closely related Theorem 2.27 which extends the general but not exhaustive setting Theorem 2.11 to include the FNO, establish that the curse of parametric complexity is to be expected in operator learning. The following sections then focus on the second contribution and hence on solution operators associated with the HJ equation; in Theorem 5.1 we prove that additional structure in the solution operator for this equation can be leveraged to beat the curse of parametric complexity. In Section 3 we describe the precise setting for the HJ equation employed throughout this paper; we recall the method of characteristics for solution of the equations; and we describe a short-time existence theorem. Section 4 introduces the proposed neural operator, the HJ-Net, based on learning the flow underlying the method of characteristics and combining it with scattered data approximation. In Section 5 we state our approximation theorem for the proposed HJ-Net, resulting in complexity estimates which demonstrate that the curse of parametric complexity is avoided in relation to the infinite dimensional nature of the input space (of initial conditions). Section 6 contains concluding remarks. Whilst the high-level structure of the proofs is contained in the main body of the paper, many technical details are collected in the appendix, to promote readability.

2. The Curse of Parametric Complexity

Operator learning seeks to employ neural networks to efficiently approximate operators mapping between infinite-dimensional Banach spaces of functions. To enable implementation of these methods in practice, maps between the formally infinite-dimensional spaces have to be approximated using only a finite number of degrees of freedom.

𝒳{\mathcal{X}}𝒴{\mathcal{Y}}D𝒳{\mathbb{R}^{{D_{\mathcal{X}}}}}D𝒴{\mathbb{R}^{{D_{\mathcal{Y}}}}}𝒮{\mathcal{S}^{\dagger}}{\mathcal{E}}Ψ{\Psi}{\mathcal{R}}
Figure 1. Diagrammatic illustration of operator learning based on an encoding \mathcal{E}, a neural network Ψ\Psi, and a reconstruction \mathcal{R}.

Commonly, operator learning frameworks can therefore be written in terms of an encoding, a neural network and a reconstruction step as shown in Figure 1. The first step \mathcal{E} encodes the infinite-dimensional input using a finite number of degrees of freedom. The second approximation step Ψ\Psi maps the encoded input to an encoded, finite-dimensional output. The final reconstruction step \mathcal{R} reconstructs an output function given the finite-dimensional output of the approximation mapping. The composition of these encoding, approximation and reconstruction mappings thus takes an input function and maps it to another output function, and hence defines an operator. Existing operator learning frameworks differ in their particular choice of the encoder, neural network architecture and reconstruction mappings.

We start by giving background discussion on the curse of dimensionality (CoD) in finite dimensions, in subsection 2.1. We then describe the subject in detail for neural network-based operator learning, resulting in our notion of the curse of parametric complexity, in subsection 2.2. In subsection 2.3 we state our main theorem concerning the curse of parametric complexity for neural operators. Subsection 2.4 demonstrates that the main theorem applies to PCA-Net, DeepONet and the NOMAD neural network architectures. Subsection 2.5 extends the main theorem to the FNO since it sits outside the framework introduced in subsection 2.2.

2.1. Curse of Dimensionality for Neural Networks

Since the neural network mapping Ψ:D𝒳D𝒴\Psi:\mathbb{R}^{{D_{\mathcal{X}}}}\to\mathbb{R}^{{D_{\mathcal{Y}}}} in the decomposition shown in Figure 1 typically maps between high-dimensional (encoded) spaces, with D𝒳{D_{\mathcal{X}}}, D𝒴1{D_{\mathcal{Y}}}\gg 1, most approaches to operator learning employ neural networks to learn this mapping. The motivation for this is that, empirically, neural networks have been found to be exceptionally well suited for the approximation of such high-dimensional functions in diverse applications [20]. Detailed investigation of the approximation theory of neural networks, including quantitative upper and lower approximation error bounds, has thus attracted a lot of attention in recent years [54, 55, 29, 38, 16]. Since we build on this analysis we summarize the relevant part of it here, restricting attention to ReLU neural networks in this work, as defined next; generalization to the use of other (piecewise polynomial) activation functions is possible.

2.1.1. ReLU Neural Networks

Fix integer LL and integers {d}=0L+1.\{d_{\ell}\}_{\ell=0}^{L+1}. Let Ad+1×dA_{\ell}\in\mathbb{R}^{d_{\ell+1}\times d_{\ell}} and bd+1b_{\ell}\in\mathbb{R}^{d_{\ell+1}} for =0,,L\ell=0,\dots,L. A ReLU neural network Ψ:D𝒳D𝒴\Psi:\mathbb{R}^{{D_{\mathcal{X}}}}\to\mathbb{R}^{{D_{\mathcal{Y}}}}, xΨ(x)x\mapsto\Psi(x) is a mapping of the form

(2.1) {x0=x,x+1=σ(Ax+b),=0,,L1,Ψ(x)=ALxL+bL,\displaystyle\left\{\begin{aligned} x_{0}&=x,\\ x_{\ell+1}&=\sigma(A_{\ell}x_{\ell}+b_{\ell}),\quad\ell=0,\cdots,L-1,\\ \Psi(x)&=A_{L}x_{L}+b_{L},\end{aligned}\right.

where d0=D𝒳d_{0}={D_{\mathcal{X}}} and dL+1=D𝒴d_{L+1}={D_{\mathcal{Y}}}. Here the activation function σ:\sigma:\mathbb{R}\to\mathbb{R} is extended pointwise to act on any Euclidean space; and in what follows we employ the ReLU activation function σ(x)=min{0,x}.\sigma(x)=\min\{0,x\}. We let θ:={A,b}=0L\theta:=\{A_{\ell},b_{\ell}\}_{\ell=0}^{L} and note that we have defined parametric mapping Ψ()=Ψ(;θ)\Psi({\,\cdot\,})=\Psi({\,\cdot\,};\theta). We define the depth of Ψ\Psi as the number of layers, and the size of Ψ\Psi as the number of non-zero weights and biases, i.e.

0pt(Ψ)=L,size(Ψ)==0L{A0+b0},0pt(\Psi)=L,\quad\mathrm{size}(\Psi)=\sum_{\ell=0}^{L}\left\{\|A_{\ell}\|_{0}+\|b_{\ell}\|_{0}\right\},

where 0\|{\,\cdot\,}\|_{0} counts the number of non-zero entries of a matrix or vector.

2.1.2. Two Simple Facts from ReLU Neural Network Calculus

The following facts will be used without further comment (see e.g. [44, Section 2.2.3] for a discussion of more general results): If Ψ:D𝒳D𝒴\Psi:\mathbb{R}^{D_{\mathcal{X}}}\to\mathbb{R}^{D_{\mathcal{Y}}} is a ReLU neural network, and AD𝒳×dA\in\mathbb{R}^{{D_{\mathcal{X}}}\times d} is a matrix, then there exists a ReLU neural network Ψ~:dD𝒴\widetilde{\Psi}:\mathbb{R}^{d}\to\mathbb{R}^{D_{\mathcal{Y}}}, such that

(2.2) {Ψ~(x)=Ψ(Ax),for all xd,0pt(Ψ~)=0pt(Ψ)+1,size(Ψ~)2A0+2size(Ψ).\displaystyle\left\{\begin{aligned} \widetilde{\Psi}(x)&=\Psi(Ax),\qquad\text{for all $x\in\mathbb{R}^{d}$,}\\ 0pt(\widetilde{\Psi})&=0pt(\Psi)+1,\\ \mathrm{size}(\widetilde{\Psi})&\leq 2\|A\|_{0}+2\,\mathrm{size}(\Psi).\end{aligned}\right.

Similarly, if Vd×D𝒴V\in\mathbb{R}^{d\times{D_{\mathcal{Y}}}} is a matrix, then there exists a ReLU neural network Ψ^:D𝒳d\widehat{\Psi}:\mathbb{R}^{D_{\mathcal{X}}}\to\mathbb{R}^{d}, such that

(2.3) {Ψ^(x)=VΨ(x),for all xd,0pt(Ψ^)=0pt(Ψ)+1,size(Ψ^)2V0+2size(Ψ).\displaystyle\left\{\begin{aligned} \widehat{\Psi}(x)&=V\Psi(x),\qquad\text{for all $x\in\mathbb{R}^{d}$,}\\ 0pt(\widehat{\Psi})&=0pt(\Psi)+1,\\ \mathrm{size}(\widehat{\Psi})&\leq 2\|V\|_{0}+2\,\mathrm{size}(\Psi).\end{aligned}\right.

The main non-trivial issue in (2.2) and (2.3) is to preserve the potentially sparse structure of the underlying neural networks; this is based on a concept of “sparse concatenation” from [46].

2.1.3. Approximation Theory and CoD for ReLU Networks

One overall finding of research into the approximation power of ReLU neural networks is that, for function approximation in spaces characterized by smoothness, neural networks cannot entirely overcome the curse of dimensionality [54, 55, 29, 38]. This is illustrated by the following result, which builds on [54, Thm. 5] derived by D. Yarotsky:

Proposition 2.1 (Neural Network CoD).

Let rr\in\mathbb{N} be given. For any dimension DD\in\mathbb{N}, there exists fD,rCr([0,1]D;)f_{D,r}\in C^{r}([0,1]^{D};\mathbb{R}) and constant ϵ¯,γ>0\overline{\epsilon},\gamma>0, such that any ReLU neural network Ψ:D\Psi:\mathbb{R}^{D}\to\mathbb{R} achieving accuracy

supx[0,1]D|fD,r(x)Ψ(x)|ϵ,\sup_{x\in[0,1]^{D}}|f_{D,r}(x)-\Psi(x)|\leq\epsilon,

with ϵϵ¯\epsilon\leq\overline{\epsilon}, has size at least size(Ψ)ϵγD/r\mathrm{size}(\Psi)\geq\epsilon^{-\gamma D/r}. The constant ϵ¯=ϵ¯(r)>0\overline{\epsilon}=\overline{\epsilon}(r)>0 depends only on rr, and γ>0\gamma>0 is universal.

The proof of Proposition 2.1 is included in Appendix A.1. Proposition 2.1 shows that neural network approximation of a function between high-dimensional Euclidean spaces suffers from a curse of dimensionality, characterised by an algebraic complexity with a potentially large exponent proportional to the dimension D1D\gg 1. This lower bound is similar to the approximation rates (upper bounds) achieved by traditional methods,111Ignoring the potentially beneficial factor γ\gamma such as polynomial approximation. This fact suggests that the empirically observed efficiency of neural networks may well rely on additional structure of functions ff of practical interest, beyond their smoothness; for relevant results in this direction see, for example, [41, 21].

2.2. Curse of Parametric Complexity in Operator Learning

In the present work, we consider the approximation of an underlying operator 𝒮:𝒳𝒴\mathcal{S}^{\dagger}:\mathcal{X}\to\mathcal{Y} acting between Banach spaces; specifically, we assume that the dimensions of the spaces 𝒳,𝒴\mathcal{X},\mathcal{Y} are infinite. Given the curse of dimensionality in the finite-dimensional setting, Proposition 2.1, and letting DD\to\infty, one would generally expect a super-algebraic, potentially even exponential, lower bound on the “complexity” of neural operators 𝒮\mathcal{S} approximating such 𝒮\mathcal{S}^{\dagger}, as a function of the accuracy ϵ\epsilon. In this subsection, we make this statement precise for a general class of neural operators, in Theorem 2.11. This is preceded by a discussion of relevant structure of compact sets in infinite-dimensional function spaces and a discussion of a suitable class of “neural operators”.

2.2.1. Infinite-dimensional hypercubes

Proposition 2.1 was stated for the unit cube [0,1]D[0,1]^{D} as the underlying domain. In the finite-dimensional setting of Proposition 2.1, the approximation rate turns out to be independent of the underlying compact domain, provided that the domain has non-empty interior and assuming a Lipschitz continuous boundary. This is in contrast to the infinite dimensional case, where compact domains necessarily have empty interior and where the convergence rate depends on the specific structure of the domain. To state our complexity bound for operator approximation, we will therefore need to discuss the prototypical structure of compact subsets K𝒳K\subset\mathcal{X}.

To fix intuition, we temporarily consider 𝒳\mathcal{X} a function space (for example a Hölder, Lebesgue or Sobolev space). In this case, the most common way to define a compact subset K𝒳K\subset\mathcal{X} is via a smoothness constraint, as illustrated by the following concrete example:

Example 2.2.

Assume that 𝒳=Cs(Ω)\mathcal{X}=C^{s}(\Omega) is the space of ss-times continuously differentiable functions on a bounded domain Ωd\Omega\subset\mathbb{R}^{d}. Then for ρ>s\rho>s and upper bound M>0M>0, the subset K𝒳K\subset\mathcal{X} defined by

(2.4) K={uCρ(Ω)|uCρM},\displaystyle K={\left\{u\in C^{\rho}(\Omega)\,\middle|\,\|u\|_{C^{\rho}}\leq M\right\}},

is a compact subset of 𝒳\mathcal{X}. Here, we define the CρC^{\rho}-norm as,

(2.5) uCρ=max|ν|ρsupxΩ|Dνu(x)|.\displaystyle\|u\|_{C^{\rho}}=\max_{|\nu|\leq\rho}\sup_{x\in\Omega}|D^{\nu}u(x)|.

To better understand the approximation theory of operators 𝒮:K𝒳𝒴\mathcal{S}^{\dagger}:K\subset\mathcal{X}\to\mathcal{Y}, we would like to understand the structure of such KK. Our point of view is inspired by Fourier analysis, according to which smoothness of uKu\in K roughly corresponds to a decay rate of the Fourier coefficients of uu. In particular, uu is guaranteed to belong to the set (2.4), if uu is of the form,

(2.6) u=Aj=1jαyjej,yj[0,1] for all j,\displaystyle u=A\sum_{j=1}^{\infty}j^{-\alpha}y_{j}e_{j},\qquad y_{j}\in[0,1]\text{ for all }j\in\mathbb{N},

for a sufficiently large decay rate α>0\alpha>0, small constant A>0A>0, and where ej:de_{j}:\mathbb{R}^{d}\to\mathbb{R} denotes the periodic Fourier (sine/cosine) basis, restricted to Ω\Omega. We include a proof of this fact at the end of this subsection (see Lemma 2.7), where we also identify relevant decay rate α\alpha. In this sense, the set KK in (2.4) could be said to “contain” an infinite-dimensional hypercube j=1[0,Ajα]\prod_{j=1}^{\infty}[0,Aj^{-\alpha}], with decay rate α\alpha. Such hypercubes will replace the finite-dimensional unit cube [0,1]D[0,1]^{D} in our analysis of operator approximation in infinite-dimensions.

We would like to point out that a similar observation holds for many other sets KK defined by a smoothness constraint, such as sets in Sobolev spaces defined by a smoothness constraint, {uHρM}Hs(Ω)\{\|u\|_{H^{\rho}}\leq M\}\subset H^{s}(\Omega), or more generally {uWρ,pM}Ws,p(Ω)\{\|u\|_{W^{\rho,p}}\leq M\}\subset W^{s,p}(\Omega), for any 1p1\leq p\leq\infty, but also Besov spaces, spaces of functions of bounded variation and others share a similar structure. Bounded balls in all of these spaces contain infinite-dimensional hypercubes, consisting of elements of the form (2.6). We note in passing that, in general, it may be more natural to replace the trigonometric basis in (2.6) by some other choice of basis, such as polynomials, splines, wavelets, or a more general (frame) expansion. We refer to e.g. [9, 22] for general background and [24] for an example of such a setting in the context of holomorphic operators.

The above considerations lead us to the following definition of an abstract hypercube:

Definition 2.3.

Let e1,e2,𝒳e_{1},e_{2},\dots\in\mathcal{X} be a sequence of linearly independent and normed elements, ej𝒳=1\|e_{j}\|_{\mathcal{X}}=1. Given constants A>0A>0, α>1\alpha>1, we say that K𝒳K\subset\mathcal{X} contains an infinite-dimensional cube Qα=Qα(A;e1,e2,)Q_{\alpha}=Q_{\alpha}(A;e_{1},e_{2},\dots), if:

  1. (1)

    KK contains the set QαQ_{\alpha} consisting of all uu of the form (2.6);

  2. (2)

    set {ej}j\{e_{j}\}_{j\in\mathbb{N}} possesses a bounded bi-orthogonal sequence of functionals, labelled e1,e2,𝒳e_{1}^{\ast},e^{\ast}_{2},\dots\in\mathcal{X}^{\ast}, in the continuous dual of 𝒳\mathcal{X}222If 𝒳\mathcal{X} is a Hilbert space, such a bi-orthogonal sequence always exists for independent e1,e2,𝒳e_{1},e_{2},\dots\in\mathcal{X}.; i.e. we assume that ek(ej)=δjke^{\ast}_{k}(e_{j})=\delta_{jk} for all j,kj,k\in\mathbb{N}, and that there exists M>0M>0, such that ek𝒳M\|e^{\ast}_{k}\|_{\mathcal{X}^{\ast}}\leq M for all kk\in\mathbb{N}.

Remark 2.4.

Property (2) in Definition 2.3, i.e. the assumed existence of a bi-orthogonal sequence eje^{\ast}_{j}, ensures that there exist “coordinate projections”: if uu is of the form (2.6), then the jj-th coefficient yjy_{j} can be retrieved from uu as yj=A1jαej(u)y_{j}=A^{-1}j^{\alpha}e^{\ast}_{j}(u). This allows us to uniquely identify uQαu\in Q_{\alpha} with a set of coefficients (y1,y2,)[0,1](y_{1},y_{2},\dots)\in[0,1]^{\infty}.

Remark 2.5.

The decay rate α\alpha of the infinite-dimensional cube QαKQ_{\alpha}\subset K provides a measure of its “asymptotic size” or “complexity”. In terms of our complexity bounds, this decay rate will play a special role. Hence, we will usually retain this dependence explicitly by writing QαQ_{\alpha}, but suppress the additional dependence on AA and e1,e2,e_{1},e_{2},\dots in the following.

The notion of infinite-dimensional cubes introduced in Definition 2.3 is only a minor generalization of an established notion of cube embeddings, introduced by Donoho [18] in a Hilbert space setting. We refer to [10, Chap. 5] for a pedagogical exposition of such cube embeddings in the Hilbert space setting, and their relation to the Kolmogorov entropy of KK.

Remark 2.6.

The complexity bounds established in this work will be based on infinite-dimensional hypercubes. An interesting question, left for future work, is whether our main result on the curse of parametric complexity, Theorem 2.11 below, could be stated directly in terms of the Kolmogorov complexity of KK, or other notions such as the Kolmogorov nn-width [17].

Our definition of an infinite-dimensional hypercube is natural in view of the following lemma, the discussion following Example (2.4), and other similar results.

Lemma 2.7.

Assume that 𝒳=Cs(Ω)\mathcal{X}=C^{s}(\Omega) is the space of ss-times continuously differentiable functions on a bounded domain Ωd\Omega\subset\mathbb{R}^{d}. Choose ρ>s\rho>s and define KK, compact in 𝒳\mathcal{X}, by

K={uCρ(Ω)|uCρM},K={\left\{u\in C^{\rho}(\Omega)\,\middle|\,\|u\|_{C^{\rho}}\leq M\right\}},

with constant M>0M>0. Then KK contains an infinite-dimensional hypercube QαQ_{\alpha}, for any α>1+ρsd\alpha>1+\frac{\rho-s}{d}.

The proof of Lemma 2.7 is included in Appendix A. While we have focused on spaces of continuously differentiable functions, similar considerations also apply to other smoothness spaces, such as the Sobolev spaces Ws,pW^{s,p}, and more general Besov spaces.

2.2.2. Curse of Parametric Complexity

The main question to be addressed in the present section is the following: given K𝒳K\subset\mathcal{X} compact, 𝒮:𝒳𝒴\mathcal{S}^{\dagger}:\mathcal{X}\to\mathcal{Y} an rr-times Fréchet differentiable operator to be approximated by a neural operator 𝒮:𝒳𝒴\mathcal{S}:\mathcal{X}\to\mathcal{Y}, and given a desired approximation accuracy ϵ>0\epsilon>0, how many tunable parameters (in the architecture of 𝒮\mathcal{S}) are required to achieve,

supuK𝒮(u)𝒮(u)𝒴ϵ?\sup_{u\in K}\|\mathcal{S}^{\dagger}(u)-\mathcal{S}(u)\|_{\mathcal{Y}}\leq\epsilon?

The answer to this question clearly depends on our assumptions on K𝒳K\subset\mathcal{X}, 𝒴\mathcal{Y} and the class of neural operators 𝒮\mathcal{S}.

Assume K𝒳K\subset\mathcal{X} contains a hypercube QαQ_{\alpha}.

Consistent with our discussion in the last subsection, we will assume that K𝒳K\subset\mathcal{X} contains an infinite-dimensional hypercube QαQ_{\alpha}, as introduced in Definition 2.3, with algebraic decay rate α>0\alpha>0.

Assume 𝒴=\mathcal{Y}=\mathbb{R}, i.e. 𝒮\mathcal{S}^{\dagger} is a functional.

The approximation of an operator 𝒮:𝒳𝒴\mathcal{S}^{\dagger}:\mathcal{X}\to\mathcal{Y} with potentially infinite-dimensional output space 𝒴\mathcal{Y} is generally harder to achieve than the approximation of a functional 𝒮:𝒳\mathcal{S}^{\dagger}:\mathcal{X}\to\mathbb{R} with one-dimensional output; indeed, if dim(𝒴)1\dim(\mathcal{Y})\geq 1, then \mathbb{R} can be embedded in 𝒴\mathcal{Y} and any functional 𝒳\mathcal{X}\to\mathbb{R} gives rise to an operator 𝒳𝒴\mathcal{X}\to\mathcal{Y} under this embedding. To simplify our discussion, we will therefore initially restrict attention to the approximation of functionals, with the aim of showing that even the approximation of rr-times Fréchet differentiable functionals is generally intractable.

Assume 𝒮\mathcal{S} is of neural network-type.

Assuming that 𝒴=\mathcal{Y}=\mathbb{R}, we must finally introduce a rigorous notion of the relevant class of approximating functionals 𝒮:𝒳\mathcal{S}:\mathcal{X}\to\mathbb{R}, i.e. define a class of “neural operators/functionals” approximating the functional 𝒮\mathcal{S}^{\dagger}.

Definition 2.8 (Functional of neural network-type).

We will say that a (neural) functional 𝒮:𝒳\mathcal{S}:\mathcal{X}\to\mathbb{R} is a “functional of neural network-type”, if it can be written in the form

(2.7) 𝒮(u)=Φ(u),uK,\displaystyle\mathcal{S}(u)=\Phi(\mathcal{L}u),\quad\forall\,u\in K,

where for some \ell\in\mathbb{N}, :𝒳\mathcal{L}:\mathcal{X}\to\mathbb{R}^{\ell} is a linear map, and Φ:\Phi:\mathbb{R}^{\ell}\to\mathbb{R} is a ReLU neural network (potentially sparse).

If 𝒮\mathcal{S} is a functional of neural network-type, we define the complexity of 𝒮\mathcal{S}, denoted cmplx(𝒮)\mathrm{cmplx}(\mathcal{S}), as the smallest size of a neural network Φ\Phi for which there exists linear map \mathcal{L} such that a representation of the form (2.7) holds, i.e.

(2.8) cmplx(𝒮):=minΦsize(Φ),\displaystyle\mathrm{cmplx}(\mathcal{S}):=\min_{\Phi}\mathrm{size}(\Phi),

where the minimum is taken over all possible Φ\Phi in (2.7).

Remark 2.9.

Without loss of generality, we may assume that size(Φ)\ell\leq\mathrm{size}(\Phi) in (2.7) and (2.8). Indeed, if this is not the case, then size(Φ)<\mathrm{size}(\Phi)<\ell and we can show that it is possible to construct another representation pair (Φ~,~)(\widetilde{\Phi},\widetilde{\mathcal{L}}) in (2.7), consisting of a neural network Φ~:~\widetilde{\Phi}:\mathbb{R}^{\widetilde{\ell}}\to\mathbb{R}, linear map ~:𝒳~\widetilde{\mathcal{L}}:\mathcal{X}\to\mathbb{R}^{\widetilde{\ell}} and such that ~size(Φ~)\widetilde{\ell}\leq\mathrm{size}(\widetilde{\Phi}): To see why, let us assume that ~:=size(Φ)<\widetilde{\ell}:=\mathrm{size}(\Phi)<\ell. Let AA be the weight matrix in the first input layer of Φ\Phi. Since

A0~<,\|A\|_{0}\leq\widetilde{\ell}<\ell,

at most ~\widetilde{\ell} columns of AA can be non-vanishing. Write the matrix A=[a1,a2,,a]A=[a_{1},a_{2},\dots,a_{\ell}] in terms of its column vectors. Up to permutation, we may assume that a~+1==a=0a_{\widetilde{\ell}+1}=\dots=a_{\ell}=0. We now drop the corresponding columns in the input layer of Φ\Phi and remove these unused components from the output of the linear map \mathcal{L} in (2.7). This leads to a new map ~:𝒳~\widetilde{\mathcal{L}}:\mathcal{X}\to\mathbb{R}^{\widetilde{\ell}}, with output components (~u)j=(u)j(\widetilde{\mathcal{L}}u)_{j}=(\mathcal{L}u)_{j} for j=1,,~j=1,\dots,\widetilde{\ell}, and we define Φ~:~\widetilde{\Phi}:\mathbb{R}^{\widetilde{\ell}}\to\mathbb{R}, as the neural network that is obtained from Φ\Phi by replacing the input matrix A=[a1,,a]A=[a_{1},\dots,a_{\ell}] by A~=[a1,,a~]\widetilde{A}=[a_{1},\dots,a_{\widetilde{\ell}}]. Then Φ~~=Φ\widetilde{\Phi}\circ\widetilde{\mathcal{L}}=\Phi\circ\mathcal{L}, so that ~\widetilde{\mathcal{L}} and Φ~\widetilde{\Phi} satisfy a representation of the form (2.7), but the dimension ~\widetilde{\ell} satisfies ~=size(Φ)=size(Φ~)\widetilde{\ell}=\mathrm{size}(\Phi)=\mathrm{size}(\widetilde{\Phi}); the first equality is by definition of ~\widetilde{\ell}, and the last equality holds because we only removed zero weights from Φ\Phi. In particular, this ensures that ~size(Φ~)\widetilde{\ell}\leq\mathrm{size}(\widetilde{\Phi}) for this new representation, without affecting the size of the underlying neural network, i.e. size(Φ)=size(Φ~)\mathrm{size}(\Phi)=\mathrm{size}(\widetilde{\Phi}).

More generally, we can consider 𝒴=𝒴(Ω;p)\mathcal{Y}=\mathcal{Y}(\Omega;\mathbb{R}^{p}) a function space, consisting of functions v:Ωpv:\Omega\to\mathbb{R}^{p} with domain Ωd\Omega\subset\mathbb{R}^{d}. Given yΩy\in\Omega, we introduce the point-evaluation map,

evy:𝒴p,evy(v):=v(y).\displaystyle{\mathrm{ev}}_{y}:\mathcal{Y}\to\mathbb{R}^{p},\quad{\mathrm{ev}}_{y}(v):=v(y).

Provided that point-evaluation evy{\mathrm{ev}}_{y} is well-defined for all v𝒴v\in\mathcal{Y}, we can readily extend the above notion to operators, as follows:

Definition 2.10 (Operator of neural network-type).

Let 𝒴=𝒴(Ω;p)\mathcal{Y}=\mathcal{Y}(\Omega;\mathbb{R}^{p}) be a function space on which point-evaluation is well-defined. We will say that a (neural) operator 𝒮:𝒳𝒴\mathcal{S}:\mathcal{X}\to\mathcal{Y} is an “operator of neural network-type”, if for any evaluation point yΩy\in\Omega, the composition evy𝒮:𝒳p{\mathrm{ev}}_{y}\circ\mathcal{S}:\mathcal{X}\to\mathbb{R}^{p}, evy(𝒮(u)):=𝒮(u)(y){\mathrm{ev}}_{y}(\mathcal{S}(u)):=\mathcal{S}(u)(y), can be written in the form

(2.9) evy𝒮(u)=Φy(u),uK,\displaystyle{\mathrm{ev}}_{y}\circ\mathcal{S}(u)=\Phi_{y}(\mathcal{L}u),\quad\forall\,u\in K,

where :𝒳\mathcal{L}:\mathcal{X}\to\mathbb{R}^{\ell} is a linear operator, and Φy:p\Phi_{y}:\mathbb{R}^{\ell}\to\mathbb{R}^{p} is a ReLU neural network which may depend on the evaluation point yΩy\in\Omega. In this case, we define

cmplx(𝒮):=supyΩminΦysize(Φy).\mathrm{cmplx}(\mathcal{S}):=\sup_{y\in\Omega}\min_{\Phi_{y}}\mathrm{size}(\Phi_{y}).

We next state our main result demonstrating a “curse of parametric complexity” for functionals (and operators) of neural network-type. This is followed by a detailed discussion of the implications of this abstract result for four representative examples of operator learning frameworks: PCA-Net, DeepONet, NOMAD and the Fourier neural operator.

2.3. Main Theorem on Curse of Parametric Complexity

The following result formalizes an analogue of the curse of dimensionality in infinite-dimensions:

Theorem 2.11 (Curse of Parametric Complexity).

Let K𝒳K\subset\mathcal{X} be a compact set in an infinite-dimensional Banach space 𝒳\mathcal{X}. Assume that KK contains an infinite-dimensional hypercube QαQ_{\alpha} for some α>1\alpha>1. Then for any rr\in\mathbb{N} and δ>0\delta>0, there exists ϵ¯>0\overline{\epsilon}>0 and an rr-times Fréchet differentiable functional 𝒮:K𝒳\mathcal{S}^{\dagger}:K\subset\mathcal{X}\to\mathbb{R}, such that approximation to accuracy ϵϵ¯\epsilon\leq\overline{\epsilon} by a functional 𝒮ϵ\mathcal{S}_{\epsilon} of neural network-type,

(2.10) supuK|𝒮(u)𝒮ϵ(u)|ϵ,\displaystyle\sup_{u\in K}|\mathcal{S}^{\dagger}(u)-\mathcal{S}_{\epsilon}(u)|\leq\epsilon,

implies complexity bound cmplx(𝒮ϵ)exp(cϵ1/(α+1+δ)r)\mathrm{cmplx}(\mathcal{S}_{\epsilon})\geq\exp(c\epsilon^{-1/(\alpha+1+\delta)r}); here cc, ϵ¯>0\overline{\epsilon}>0 are constants depending only on α\alpha, δ\delta and rr.

Before providing a sketch of the proof of Theorem 2.11, we note the following simple corollary, whose proof is given in Appendix A.4.

Corollary 2.12.

Let K𝒳K\subset\mathcal{X} be a compact set in an infinite-dimensional Banach space 𝒳\mathcal{X}. Assume that KK contains an infinite-dimensional hypercube QαQ_{\alpha} for some α>1\alpha>1. Let 𝒴=𝒴(Ω;p)\mathcal{Y}=\mathcal{Y}(\Omega;\mathbb{R}^{p}) be a function space with continuous embedding in C(Ω;p)C(\Omega;\mathbb{R}^{p}). Then for any rr\in\mathbb{N} and δ>0\delta>0, there exists ϵ¯>0\overline{\epsilon}>0 and an rr-times Fréchet differentiable functional 𝒮:K𝒳𝒴\mathcal{S}^{\dagger}:K\subset\mathcal{X}\to\mathcal{Y}, such that approximation to accuracy ϵϵ¯\epsilon\leq\overline{\epsilon} by an operator 𝒮ϵ:𝒳𝒴\mathcal{S}_{\epsilon}:\mathcal{X}\to\mathcal{Y} of neural network-type,

(2.11) supuK𝒮(u)𝒮ϵ(u)ϵ,\displaystyle\sup_{u\in K}\|\mathcal{S}^{\dagger}(u)-\mathcal{S}_{\epsilon}(u)\|\leq\epsilon,

implies complexity bound cmplx(𝒮ϵ)exp(cϵ1/(α+1+δ)r)\mathrm{cmplx}(\mathcal{S}_{\epsilon})\geq\exp(c\epsilon^{-1/(\alpha+1+\delta)r}); here cc, ϵ¯>0\overline{\epsilon}>0 are constants depending only on α\alpha, δ\delta and rr.

Proof of Theorem 2.11 (Sketch).

Let 𝒮ϵ:𝒳\mathcal{S}_{\epsilon}:\mathcal{X}\to\mathbb{R} be any functional of neural network-type, achieving approximation accuracy (2.10). In view of our definition of cmplx(𝒮ϵ)\mathrm{cmplx}(\mathcal{S}_{\epsilon}) in (2.8), to prove the claim, it suffices to show that if :𝒳\mathcal{L}:\mathcal{X}\to\mathbb{R}^{\ell} is a linear map and Φ:\Phi:\mathbb{R}^{\ell}\to\mathbb{R} is a ReLU neural network, such that

𝒮ϵ(u)=Φ(u),u𝒳,\mathcal{S}_{\epsilon}(u)=\Phi(\mathcal{L}u),\quad\forall\,u\in\mathcal{X},

then size(Φ)exp(cϵ1/(α+1+δ)r)\mathrm{size}(\Phi)\geq\exp(c\epsilon^{-1/(\alpha+1+\delta)r}).

The idea behind the proof of this fact is that if K𝒳K\subset\mathcal{X} contains a hypercube QαQ_{\alpha}, then for any DD\in\mathbb{N}, a suitable rescaling of the finite-dimensional cube [0,1]D[0,1]^{D} can be embedded in KK. More precisely, for any DD\in\mathbb{N} there exists an injective linear map ιD:[0,1]DK\iota_{D}:[0,1]^{D}\to K.

If we now consider the composition 𝒮ϵιD:D\mathcal{S}_{\epsilon}\circ\iota_{D}:\mathbb{R}^{D}\to\mathbb{R}, then we observe that we have a decomposition 𝒮ϵιD(x)=Φ(ιD(x))\mathcal{S}_{\epsilon}\circ\iota_{D}(x)=\Phi(\mathcal{L}\circ\iota_{D}(x)), where

ιD:D,\mathcal{L}\circ\iota_{D}:\mathbb{R}^{D}\to\mathbb{R}^{\ell},

is linear, and

Φ:,\Phi:\mathbb{R}^{\ell}\to\mathbb{R},

is a ReLU neural network. In particular, there exists a matrix A×DA\in\mathbb{R}^{\ell\times D}, such that ιD(x)=Ax\mathcal{L}\circ\iota_{D}(x)=Ax for all xDx\in\mathbb{R}^{D}, and the mapping ΦD(x):=𝒮ϵιD(x)\Phi_{D}(x):=\mathcal{S}_{\epsilon}\circ\iota_{D}(x) defines a ReLU neural network ΦD(x)=Φ(Ax)\Phi_{D}(x)=\Phi(Ax), whose size can be bounded by

size(ΦD)2size(Φ)+2A0(Equation 2.2)2size(Φ)+2D2size(Φ)+2size(Φ)D(Remark 2.9)4size(Φ)D.\displaystyle\begin{aligned} \mathrm{size}(\Phi_{D})&\leq 2\,\mathrm{size}(\Phi)+2\|A\|_{0}&\text{(Equation \ref{eq:calculus1})}\\ &\leq 2\,\mathrm{size}(\Phi)+2\ell D&\\ &\leq 2\,\mathrm{size}(\Phi)+2\,\mathrm{size}(\Phi)D\hskip 28.45274pt&\text{(Remark \ref{rem:ellsize})}\\ &\leq 4\,\mathrm{size}(\Phi)D.&\end{aligned}

Using Proposition 2.1, for any DD\in\mathbb{N}, we are then able to construct a functional D:K𝒳\mathcal{F}_{D}:K\subset\mathcal{X}\to\mathbb{R}, mimicking the function fD:[0,1]DDf_{D}:[0,1]^{D}\subset\mathbb{R}^{D}\to\mathbb{R} constructed in Proposition 2.1, and such that uniform approximation of D\mathcal{F}_{D} by 𝒮ϵ\mathcal{S}_{\epsilon} (implying similar approximation of fDf_{D} by ΦD\Phi_{D}) incurs a lower complexity bound

size(Φ)(4D)1size(ΦD)CDϵγD/r,\mathrm{size}(\Phi)\geq(4D)^{-1}\,\mathrm{size}(\Phi_{D})\geq C_{D}\epsilon^{-\gamma D/r},

where CD>0C_{D}>0 is a constant depending on DD. For this particular functional D\mathcal{F}_{D}, and given the uniform lower bound on size(Φ)\mathrm{size}(\Phi) above, it then follows that

cmplx(𝒮ϵ)CDϵγD/r.\mathrm{cmplx}(\mathcal{S}_{\epsilon})\geq C_{D}\epsilon^{-\gamma D/r}.

The first challenge is to make this strategy precise, and to determine the DD-dependency of the constant CDC_{D}. As we will see, this argument leads to a lower bound of roughly the form cmplx(𝒮ϵ)(Dr(1+α)ϵ)γD/r\mathrm{cmplx}(\mathcal{S}_{\epsilon})\gtrsim(D^{r(1+\alpha)}\epsilon)^{-\gamma D/r}.

At this point, the argument is still for fixed DD\in\mathbb{N}, and would only lead to an algebraic complexity in ϵ1\epsilon^{-1}. To extend this to an exponential lower bound in ϵ1\epsilon^{-1}, we next observe that if the estimate cmplx(𝒮ϵ)(Dr(1+α)ϵ)γD/r\mathrm{cmplx}(\mathcal{S}_{\epsilon})\gtrsim(D^{r(1+\alpha)}\epsilon)^{-\gamma D/r} could in fact be established for all DD\in\mathbb{N} simultaneously, i.e. if we could construct a single functional 𝒮\mathcal{S}^{\dagger}, for which the lower complexity bound cmplx(𝒮ϵ)supD(Dr(1+α)ϵ)γD/r\mathrm{cmplx}(\mathcal{S}_{\epsilon})\gtrsim\sup_{D\in\mathbb{N}}(D^{r(1+\alpha)}\epsilon)^{-\gamma D/r} were to hold, then setting D(eϵ)1/(1+α)rD\approx(e\epsilon)^{-1/(1+\alpha)r} on the right would imply that

cmplx(𝒮ϵ)supD(Dr(1+α)ϵ)γD/rexp(cϵ1/(1+α)r),\mathrm{cmplx}(\mathcal{S}_{\epsilon})\gtrsim\sup_{D\in\mathbb{N}}(D^{r(1+\alpha)}\epsilon)^{-\gamma D/r}\gtrsim\exp\left(c\epsilon^{-1/(1+\alpha)r}\right),

with suitable c>0c>0. Leading to an exponential lower complexity bound for such 𝒮\mathcal{S}^{\dagger}. The second main challenge is thus to construct a single 𝒮:K𝒳\mathcal{S}^{\dagger}:K\subset\mathcal{X}\to\mathbb{R} which effectively “embeds” an infinite family of functionals D:K𝒳\mathcal{F}_{D}:K\subset\mathcal{X}\to\mathbb{R} with complexity bounds as above. This will be achieved by defining 𝒮\mathcal{S}^{\dagger} as a weighted sum of suitable functionals D\mathcal{F}_{D}. The detailed proof is provided in Appendix A.3. ∎

Several remarks are in order:

Remark 2.13.

Theorem 2.11 shows rigorously that in general, operator learning suffers from a curse of parametric complexity, in the sense that it is not possible to achieve better than exponential complexity bounds for general classes of operators which are merely determined by their (CrC^{r}- or Lipschitz-) regularity. As explained above, this is a natural infinite-dimensional analogue of the curse of dimensionality in finite-dimensions (cp. Proposition 2.1), and motivates our terminology. We note that the lower bound of Theorem 2.11 qualitatively matches general upper bounds for DeepONets derived in [37]. It would be of interest to determine sharp rates.

Remark 2.14.

Theorem 2.11 is derived for ReLU neural networks. With some effort, the argument could be extended to more general, piecewise polynomial activation functions. While we believe that the curse of parametric complexity has a fundamental character, we would like to point out that, for non-standard neural network architectures, algebraic approximation rates have been obtained [49]; these results build on either “superexpressive” activation functions or other non-standard architectures. Since these networks are not ordinary feedforward ReLU neural networks, the algebraic approximation rates of [49] are not in contradiction with Theorem 2.11. While the parametric complexity of the non-standard neural operators in [49] is exponentially smaller than the lower bound of Theorem 2.11, it is conceivable that storing the neural network weights in practice would require exponential accuracy (number of bits), to account for the highly unstable character of super-expressive constructions.

Remark 2.15.

Theorem 2.11 differs from previous results on the limitations of operator learning frameworks, as e.g. addressed in [35, 51, 34]. Earlier work focuses on the limitations imposed by a linear choice of the reconstruction mapping \mathcal{R}. In contrast, the results of the present work exhibit CkC^{k}-smooth operators and functionals which are fundamentally hard to approximate by neural network-based methods (with ReLU activation), irrespective of the choice of reconstruction.

Remark 2.16.

We finally link our main theorem to a related result for PCA-Net [32, Thm. 3.3], there derived in a complementary Hilbert space setting for 𝒳\mathcal{X} and 𝒴\mathcal{Y}; the result of [32] shows that, for PCA-Net, no fixed algebraic convergence rate can be achieved in the operator learning of general CrC^{r}-operators between Hilbert spaces; this can be viewed as a milder version of the full curse of parametric complexity identified in the present work, expressed by an exponential lower complexity bound in Theorem 2.11.

To further illustrate an implication of Theorem 2.11, we provide the following example:

Example 2.17 (Operator Learning CoD).

Let Ωd\Omega\subset\mathbb{R}^{d} be a domain. Let s,ρ0s,\rho\in\mathbb{Z}_{\geq 0} be given, with s<ρs<\rho, and consider the compact set

K={uCρ(Ω)|uCρ1}Cs(Ω).K={\left\{u\in C^{\rho}(\Omega)\,\middle|\,\|u\|_{C^{\rho}}\leq 1\right\}}\subset C^{s}(\Omega).

Fix rr\in\mathbb{N}. By Lemma 2.7, KK contains an infinite-dimensional hypercube QαQ_{\alpha} for any α>1+ρsd\alpha>1+\frac{\rho-s}{d}. Fix such α\alpha. Applying Theorem 2.11, it follows that there exists a rr-times Fréchet differentiable functional 𝒮:Cs(Ω)\mathcal{S}^{\dagger}:C^{s}(\Omega)\to\mathbb{R} and constant c,ϵ¯>0c,\overline{\epsilon}>0, such that any family 𝒮ϵ:Cs(Ω)\mathcal{S}_{\epsilon}:C^{s}(\Omega)\to\mathbb{R} of functionals of neural network-type, achieving accuracy

supuK|𝒮(u)𝒮ϵ(u)|ϵ,ϵϵ¯,\sup_{u\in K}|\mathcal{S}^{\dagger}(u)-\mathcal{S}_{\epsilon}(u)|\leq\epsilon,\quad\forall\,\epsilon\leq\overline{\epsilon},

has complexity at least cmplx(𝒮ϵ)exp(cϵ1/(1+α)r)\mathrm{cmplx}(\mathcal{S}_{\epsilon})\geq\exp(c\epsilon^{-1/(1+\alpha)r}) for ϵϵ¯\epsilon\leq\overline{\epsilon}. Furthermore, the constants c,ϵ¯>0c,\,\overline{\epsilon}>0 depend only on the parameters r,s,ρ,αr,s,\rho,\alpha.

In the next subsection, we aim to show the relevance of the above abstract result for concrete neural operator architectures. Specifically, we show that three operator learning architectures from the literature are of neural network-type (PCA-Net, DeepONet, NOMAD), and relate our notion of complexity to the required number of tunable parameters for each. Finally, we show that even frameworks which are not necessarily of neural network-type could suffer from a similar curse of parametric complexity. We make this explicit for the Fourier neural operator in subsection 2.5.

2.4. Examples of Oerators of Neural Network-Type

We describe three representative neural operator architectures and show that they can be cast in the above framework.

PCA-Net.

We start with the PCA-Net architecture from [2], anticipated in the work [25]. If 𝒳\mathcal{X} and 𝒴\mathcal{Y} are Hilbert spaces, then a neural network can be combined with principal component analysis (PCA) for the encoding and reconstruction on the underlying spaces, to define a neural operator architecture termed PCA-Net; the ingredients of this architecture are orthonormal PCA bases ϕ1𝒳,,ϕD𝒳𝒳𝒳\phi^{\mathcal{X}}_{1},\dots,\phi^{\mathcal{X}}_{{D_{\mathcal{X}}}}\in\mathcal{X} and ϕ1𝒴,,ϕD𝒴𝒴𝒴\phi^{\mathcal{Y}}_{1},\dots,\phi^{\mathcal{Y}}_{{D_{\mathcal{Y}}}}\in\mathcal{Y}, and a neural network mapping Ψ:D𝒳D𝒴\Psi:\mathbb{R}^{{D_{\mathcal{X}}}}\to\mathbb{R}^{{D_{\mathcal{Y}}}}. The encoder \mathcal{E} is obtained by projection onto the {ϕj𝒳}j=1D𝒳\{\phi^{\mathcal{X}}_{j}\}_{j=1}^{{D_{\mathcal{X}}}}, whereas the reconstruction \mathcal{R} is defined by a linear expansion in {ϕj𝒴}j=1D𝒴\{\phi^{\mathcal{Y}}_{j}\}_{j=1}^{{D_{\mathcal{Y}}}}. The resulting PCA-Net neural operator is defined as

(2.12) 𝒮(u)(y)=k=1D𝒴Ψk(α1,,αD𝒳)ϕk𝒴(y),with αj:=u,ϕj𝒳,j=1,,D𝒳.\displaystyle\mathcal{S}(u)(y)=\sum_{k=1}^{{D_{\mathcal{Y}}}}\Psi_{k}(\alpha_{1},\dots,\alpha_{D_{\mathcal{X}}})\phi^{\mathcal{Y}}_{k}(y),\qquad\text{with }\alpha_{j}:=\langle u,\phi^{\mathcal{X}}_{j}\rangle,\;j=1,\dots,{D_{\mathcal{X}}}.

Here the neural network Ψ:D𝒳D𝒴\Psi:\mathbb{R}^{D_{\mathcal{X}}}\to\mathbb{R}^{D_{\mathcal{Y}}}, with components Ψk()=Ψk(;θ)\Psi_{k}({\,\cdot\,})=\Psi_{k}({\,\cdot\,};\theta), depends on parameters which are optimized during training of the PCA-Net. The PCA basis functions ϕ1𝒴,ϕD𝒴𝒴:Ωp\phi^{\mathcal{Y}}_{1},\dots\phi^{\mathcal{Y}}_{D_{\mathcal{Y}}}:\Omega\to\mathbb{R}^{p}, defining the reconstruction, are precomputed from the data using PCA analysis.

Given an evaluation point yΩy\in\Omega, the composition evy𝒮{\mathrm{ev}}_{y}\circ\mathcal{S}, between 𝒮\mathcal{S} and the point-evaluation mapping evy(v)=v(y){\mathrm{ev}}_{y}(v)=v(y), can now be written in the form,

evy𝒮(u)=Φy(u),{\mathrm{ev}}_{y}\circ\mathcal{S}(u)=\Phi_{y}(\mathcal{L}u),

where :𝒳D𝒳\mathcal{L}:\mathcal{X}\to\mathbb{R}^{D_{\mathcal{X}}}, u:=(u,ϕ1𝒳,,u,ϕD𝒳𝒳\mathcal{L}u:=(\langle u,\phi_{1}^{\mathcal{X}}\rangle,\dots,\langle u,\phi^{\mathcal{X}}_{{D_{\mathcal{X}}}}\rangle is a linear mapping, and Φy(α):=k=1D𝒴Ψk(α)ϕk𝒴(y)\Phi_{y}(\alpha):=\sum_{k=1}^{D_{\mathcal{Y}}}\Psi_{k}(\alpha)\phi_{k}^{\mathcal{Y}}(y), for fixed yy, is the composition of a neural network Ψ\Psi with a linear read-out; thus, Φy\Phi_{y} is itself a neural network for fixed yy. This shows that PCA-Net is of neural network-type.

The following lemma shows that the complexity of PCA-Net gives a lower bound on the number of free parameters for the underlying neural network architecture.

Lemma 2.18 (Complexity of PCA-Net).

Assume that 𝒳\mathcal{X} and 𝒴\mathcal{Y} are Hilbert spaces, so that PCA-Net is well-defined. Any PCA-Net 𝒮=Ψ\mathcal{S}=\mathcal{R}\circ\Psi\circ\mathcal{E} is of neural network-type, and

size(Ψ)(2p+2)1cmplx(𝒮).\mathrm{size}(\Psi)\geq(2p+2)^{-1}\mathrm{cmplx}(\mathcal{S}).

Note that the dimension pp\in\mathbb{N} is fixed by the underlying function space 𝒴=𝒴(Ω;p)\mathcal{Y}=\mathcal{Y}(\Omega;\mathbb{R}^{p}). Thus, Lemma 2.18 implies a lower complexity bound size(Ψ)cmplx(𝒮)\mathrm{size}(\Psi)\gtrsim\mathrm{cmplx}(\mathcal{S}). The detailed proof is given in Appendix A.5.1. It thus follows from Corollary 2.12 that operator learning with PCA-Net suffers from a curse of parametric complexity:

Proposition 2.19 (Curse of parametric complexity for PCA-Net).

Assume the setting of Corollary 2.12, with 𝒳\mathcal{X}, 𝒴\mathcal{Y} are Hilbert spaces. Then for any rr\in\mathbb{N} and δ>0\delta>0, there exists ϵ¯>0\overline{\epsilon}>0 and an rr-times Fréchet differentiable functional 𝒮:K𝒳𝒴\mathcal{S}^{\dagger}:K\subset\mathcal{X}\to\mathcal{Y}, such that approximation to accuracy ϵϵ¯\epsilon\leq\overline{\epsilon} by a PCA-Net 𝒮ϵ:𝒳𝒴\mathcal{S}_{\epsilon}:\mathcal{X}\to\mathcal{Y}

(2.13) supuK𝒮(u)𝒮ϵ(u)ϵ,\displaystyle\sup_{u\in K}\|\mathcal{S}^{\dagger}(u)-\mathcal{S}_{\epsilon}(u)\|\leq\epsilon,

with encoder \mathcal{E}, neural network Ψ\Psi and reconstruction \mathcal{R}, implies complexity bound size(Ψ)exp(cϵ1/(α+1+δ)r)\mathrm{size}(\Psi)\geq\exp(c\epsilon^{-1/(\alpha+1+\delta)r}); here cc, ϵ¯>0\overline{\epsilon}>0 are constants depending only on α\alpha, δ\delta, rr and pp.

DeepONet.

A conceptually similar approach is followed by the DeepONet of [39] which differs by learning the form of the representation in 𝒴\mathcal{Y} concurrently with the coefficients, and by allowing for quite general input linear functionals {αj}j=1D𝒳.\{\alpha_{j}\}_{j=1}^{{D_{\mathcal{X}}}}.

The DeepONet architecture defines the encoding :𝒳D𝒳\mathcal{E}:\mathcal{X}\to\mathbb{R}^{D_{\mathcal{X}}} by a fixed choice of general linear functionals 1,,D𝒳\ell_{1},\dots,\ell_{D_{\mathcal{X}}}; these may be obtained, for example, by point evaluation at distinct “sensor points” or by projection onto PCA modes as in PCA-Net. The reconstruction :D𝒴𝒴\mathcal{R}:\mathbb{R}^{D_{\mathcal{Y}}}\to\mathcal{Y} is defined by expansion with respect to a set of functions ϕ1,,ϕD𝒴𝒴\phi_{1},\dots,\phi_{D_{\mathcal{Y}}}\in\mathcal{Y} which are themselves finite dimensional neural networks to be learned. The resulting DeepONet can be written as

(2.14) 𝒮(u)(y)=k=1D𝒴Ψk(α1,,αD𝒳)ϕk(y),with αj:=j(u),j=1,,D𝒳.\displaystyle\mathcal{S}(u)(y)=\sum_{k=1}^{{D_{\mathcal{Y}}}}\Psi_{k}(\alpha_{1},\dots,\alpha_{D_{\mathcal{X}}})\phi_{k}(y),\qquad\text{with }\alpha_{j}:=\ell_{j}(u),\;j=1,\dots,{D_{\mathcal{X}}}.

Here, both the neural networks Ψ:D𝒳D𝒴\Psi:\mathbb{R}^{D_{\mathcal{X}}}\to\mathbb{R}^{D_{\mathcal{Y}}}, with components Ψk=Ψk(;θ)\Psi_{k}=\Psi_{k}({\,\cdot\,};\theta), and ϕ:Ωp×D𝒴\phi:\Omega\to\mathbb{R}^{p\times{D_{\mathcal{Y}}}} with components ϕk=ϕk(;θ)\phi_{k}=\phi_{k}({\,\cdot\,};\theta), depend on parameters which are optimized during training of the DeepONet.

Given a evaluation point yΩy\in\Omega, the composition evy𝒮{\mathrm{ev}}_{y}\circ\mathcal{S}, with the point-evaluation mapping evy(v)=v(y){\mathrm{ev}}_{y}(v)=v(y) can again be written in the form,

evy𝒮(u)=Φy(u),{\mathrm{ev}}_{y}\circ\mathcal{S}(u)=\Phi_{y}(\mathcal{L}u),

where :𝒳D𝒳\mathcal{L}:\mathcal{X}\to\mathbb{R}^{D_{\mathcal{X}}}, u:=(1(u),,D𝒳(u))\mathcal{L}u:=(\ell_{1}(u),\dots,\ell_{{D_{\mathcal{X}}}}(u)) is linear, where

Φy(α):=k=1D𝒴Ψk(α)ϕk(y),\Phi_{y}(\alpha):=\sum_{k=1}^{D_{\mathcal{Y}}}\Psi_{k}(\alpha)\phi_{k}(y),

and for fixed yy the values ϕk(y)p\phi_{k}(y)\in\mathbb{R}^{p} are just (constant) vectors. Thus, Φy\Phi_{y} is the composition of a neural network Ψ\Psi with a linear read-out, and hence is itself a neural network. This shows that DeepONet is of neural network-type.

The next lemma shows that the size can be related to the complexity of DeepONet: Also in this case, cmplx(𝒮)\mathrm{cmplx}(\mathcal{S}) provides a lower bound on the total number of non-zero degrees of freedom of a DeepONet. The detailed proof is provided in Appendix A.5.2.

Lemma 2.20 (Complexity of DeepONet).

Any DeepONet 𝒮=Ψ\mathcal{S}=\mathcal{R}\circ\Psi\circ\mathcal{E}, defined by a branch-net Ψ\Psi and trunk-net ϕ\phi, is of neural network-type, and

2(size(Ψ)+size(ϕ))cmplx(𝒮).2(\mathrm{size}(\Psi)+\mathrm{size}(\phi))\geq\mathrm{cmplx}(\mathcal{S}).

The following result is now an immediate consequence of Corollary 2.12 and the above lemma.

Proposition 2.21 (Curse of parametric complexity for DeepONet).

Assume the setting of Corollary 2.12. Then for any rr\in\mathbb{N} and δ>0\delta>0, there exists ϵ¯>0\overline{\epsilon}>0 and an rr-times Fréchet differentiable functional 𝒮:K𝒳𝒴\mathcal{S}^{\dagger}:K\subset\mathcal{X}\to\mathcal{Y}, such that approximation to accuracy ϵϵ¯\epsilon\leq\overline{\epsilon} by a DeepONet 𝒮ϵ:𝒳𝒴\mathcal{S}_{\epsilon}:\mathcal{X}\to\mathcal{Y}

(2.15) supuK𝒮(u)𝒮ϵ(u)ϵ,\displaystyle\sup_{u\in K}\|\mathcal{S}^{\dagger}(u)-\mathcal{S}_{\epsilon}(u)\|\leq\epsilon,

with branch net Ψ\Psi and trunk net ϕ\phi, implies complexity bound size(Ψ)+size(ϕ)exp(cϵ1/(α+1+δ)r)\mathrm{size}(\Psi)+\mathrm{size}(\phi)\geq\exp(c\epsilon^{-1/(\alpha+1+\delta)r}); here cc, ϵ¯>0\overline{\epsilon}>0 are constants depending only on α\alpha, δ\delta and rr.

NOMAD

The linearity in the reconstruction in 𝒴\mathcal{Y} for both PCA-Net and DeepONet imposes a fundamental limitation on their approximation capability [34, 35, 51]. To overcome this limitation, nonlinear extensions of DeepONet have recently been proposed. The following NOMAD architecture [51] provides an example:

NOMAD (NOnlinear MAnifold Decoder) employs encoding by point evaluation at a fixed set of sensor points, or more general linear functionals 1,,D𝒳:𝒳\ell_{1},\dots,\ell_{D_{\mathcal{X}}}:\mathcal{X}\to\mathbb{R}. The reconstruction :D𝒴𝒴\mathcal{R}:\mathbb{R}^{D_{\mathcal{Y}}}\to\mathcal{Y} in the output space 𝒴=𝒴(Ω;p)\mathcal{Y}=\mathcal{Y}(\Omega;\mathbb{R}^{p}) is defined via a neural network Q:D𝒴×ΩpQ:\mathbb{R}^{{D_{\mathcal{Y}}}}\times\Omega\to\mathbb{R}^{p}, which depends jointly on encoded output coefficients in D𝒴\mathbb{R}^{{D_{\mathcal{Y}}}} and the evaluation point yΩy\in\Omega; as in the two previous examples, Ψ:D𝒳D𝒴\Psi:\mathbb{R}^{D_{\mathcal{X}}}\to\mathbb{R}^{D_{\mathcal{Y}}} is again a neural network. The resulting NOMAD mapping, for u𝒳u\in\mathcal{X} and yΩy\in\Omega, is given by

(2.16) 𝒮(u)(y)=Q(Ψ(α1,,αD𝒳),y),with αj:=j(u),\displaystyle\mathcal{S}(u)(y)=Q(\Psi(\alpha_{1},\dots,\alpha_{D_{\mathcal{X}}}),y),\qquad\text{with }\alpha_{j}:=\ell_{j}(u),

for j=1,,D𝒳j=1,\dots,{D_{\mathcal{X}}}. We note that the main difference between DeepONet and NOMAD is that the linear expansion in (2.14) has been replaced by a nonlinear composition with the neural network QQ. Both neural networks Ψ\Psi and QQ are optimized during training.

Given evaluation point yΩy\in\Omega, the composition evy𝒮{\mathrm{ev}}_{y}\circ\mathcal{S} with the point-evaluation can be written in the form,

evy𝒮(u)=Φy(u),{\mathrm{ev}}_{y}\circ\mathcal{S}(u)=\Phi_{y}(\mathcal{L}u),

where :𝒳D𝒳\mathcal{L}:\mathcal{X}\to\mathbb{R}^{D_{\mathcal{X}}}, u:=(1(u),,D𝒳(u))\mathcal{L}u:=(\ell_{1}(u),\dots,\ell_{D_{\mathcal{X}}}(u)) is linear, and αΦy(α):=Q(Ψk(α),y)\alpha\mapsto\Phi_{y}(\alpha):=Q(\Psi_{k}(\alpha),y) defines a neural network for fixed yy. This shows that NOMAD is of neural network-type. Finally, the following lemma provides an estimate on the complexity of NOMAD:

Lemma 2.22 (Complexity of NOMAD).

Any NOMAD operator 𝒮=Ψ\mathcal{S}=\mathcal{R}\circ\Psi\circ\mathcal{E} defined by a branch-net Ψ\Psi and non-linear reconstruction QQ is of neural network-type, and

(2.17) 2(size(Ψ)+size(Q))cmplx(𝒮).\displaystyle 2(\mathrm{size}(\Psi)+\mathrm{size}(Q))\geq\mathrm{cmplx}(\mathcal{S}).

The expression on the left hand-side of (2.17) represents the total number of non-zero degrees of freedom in the NOMAD architecture and, as for PCA-Net and DeepONet, it is lower bounded by our notion of complexity. For the proof, we refer to Appendix A.5.3.

The following result is now an immediate consequence of Corollary 2.12 and the above lemma.

Proposition 2.23 (Curse of parametric complexity for NOMAD).

Assume the setting of Corollary 2.12. Then for any rr\in\mathbb{N} and δ>0\delta>0, there exists ϵ¯>0\overline{\epsilon}>0 and an rr-times Fréchet differentiable functional 𝒮:K𝒳𝒴\mathcal{S}^{\dagger}:K\subset\mathcal{X}\to\mathcal{Y}, such that approximation to accuracy ϵϵ¯\epsilon\leq\overline{\epsilon} by a NOMAD neural operator 𝒮ϵ:𝒳𝒴\mathcal{S}_{\epsilon}:\mathcal{X}\to\mathcal{Y}

(2.18) supuK𝒮(u)𝒮ϵ(u)ϵ,\displaystyle\sup_{u\in K}\|\mathcal{S}^{\dagger}(u)-\mathcal{S}_{\epsilon}(u)\|\leq\epsilon,

with neural networks Ψ\Psi and QQ, implies complexity bound size(Ψ)+size(Q)exp(cϵ1/(α+1+δ)r)\mathrm{size}(\Psi)+\mathrm{size}(Q)\geq\exp(c\epsilon^{-1/(\alpha+1+\delta)r}); here cc, ϵ¯>0\overline{\epsilon}>0 are constants depending only on α\alpha, δ\delta and rr.

Discussion.

For all examples above, a general lower bound on the complexity cmplx(𝒮)\mathrm{cmplx}(\mathcal{S}) of operators of neural network-type implies a lower bound on the total number of degrees of freedom of the particular architecture. In particular, a lower bound on cmplx(𝒮)\mathrm{cmplx}(\mathcal{S}) gave a lower bound on the smallest possible number of non-zero parameters that are needed to implement 𝒮\mathcal{S} in practice. This observation motivates our nomenclature for the complexity.

We emphasize that our notion of complexity only relates to the size of the neural network at the core of these architectures; by design, it does not take into account other factors, such as the additional complexity associated with the practical evaluation of inner products in the PCA-Net architecture, evaluation of linear encoding functionals of DeepONets, or the numerical representation of an (optimal) output PCA basis for PCA-Net or neural network basis for DeepONet. The important point is that the aforementioned factors can only increase the overall complexity of any concrete implementation; correspondingly, our proposed notion of cmplx(𝒮)\mathrm{cmplx}(\mathcal{S}), which neglects some of these additional contributions, is can be used to derive rigorous lower bounds on the overall complexity of any implementation.

Remark 2.24.

In passing, we point out similar approaches [26, 35, 45, 57] to the PCA-Net, DeepONet and NOMAD architectures, which share a closely related underlying structure to the examples given above. We fully expect that the curse of parametric complexity applies to all of these architectures.

In the next subsection 2.5 we will show that even for operator architectures which are not of neural network-type according to the above definition, we may nevertheless be able to link them with an associated operator of neural network-type. Specifically, we will show this for the FNO in Theorem 2.27. There, we will see that the size (number of tunable parameters) of the FNO can be linked to the complexity of an associated operator of neural network-type. And hence lower bounds on the complexity of operators of neural network-type imply corresponding lower bounds on the FNO.

2.5. The Curse of (Parametric) Complexity for Fourier Neural Operators.

The definition of operators of neural network-type introduced in the previous subsection does not include the FNO, a widely adopted neural operator architecture. However, in this subsection we show that a result similar to Theorem 2.11, stated as Theorem 2.27 below, can be obtained for the FNO.

Due to intrinsic constraints on the domain on which FNO can be (readily) applied, we will assume that the spatial domain Ω=j=1d[aj,bj]d\Omega=\prod_{j=1}^{d}[a_{j},b_{j}]\subset\mathbb{R}^{d} is rectangular. We recall that an FNO,

𝒮:𝒳(Ω;k)𝒴(Ω;p),\mathcal{S}:\mathcal{X}(\Omega;\mathbb{R}^{k})\to\mathcal{Y}(\Omega;\mathbb{R}^{p}),

can be written as a composition, 𝒮=𝒬L1𝒫\mathcal{S}=\mathcal{Q}\circ\mathcal{L}_{L}\circ\dots\circ\mathcal{L}_{1}\circ\mathcal{P}, of a lifting layer 𝒫\mathcal{P}, hidden layers 1,,L\mathcal{L}_{1},\dots,\mathcal{L}_{L} and an output layer 𝒬\mathcal{Q}. In the following, let us denote by 𝒱(Ω;dv)\mathcal{V}(\Omega;\mathbb{R}^{d_{v}}) a generic space of functions from Ω\Omega to dv\mathbb{R}^{d_{v}}.

The nonlinear lifting layer

𝒫:𝒳(Ω;k)𝒱(Ω;dv),u(x)χ(x,u(x)),\mathcal{P}:\mathcal{X}(\Omega;\mathbb{R}^{k})\to\mathcal{V}(\Omega;\mathbb{R}^{d_{v}}),\quad u(x)\mapsto\chi(x,u(x)),

is defined by a neural network χ:Ω×kdv\chi:\Omega\times\mathbb{R}^{k}\to\mathbb{R}^{d_{v}}, depending jointly on the evaluation point xΩx\in\Omega and the components of the input function uu evaluated at xx, namely u(x)ku(x)\in\mathbb{R}^{k}. The dimension dvd_{v} is a free hyperparameter, and determines the number of “channels” (or the “lifting dimension”).

Each hidden layer \mathcal{L}_{\ell}, =1,,L\ell=1,\dots,L, of an FNO is of the form

:𝒱(Ω;dv)𝒱(Ω;dv),vσ(Wv+Kv+b),\mathcal{L}_{\ell}:\mathcal{V}(\Omega;\mathbb{R}^{d_{v}})\to\mathcal{V}(\Omega;\mathbb{R}^{d_{v}}),\quad v\mapsto\sigma\left(W_{\ell}v+K_{\ell}v+b_{\ell}\right),

where σ\sigma is a nonlinear activation function applied componentwise, Wdv×dvW_{\ell}\in\mathbb{R}^{d_{v}\times d_{v}} is a matrix, the bias b𝒱(Ω;dv)b_{\ell}\in\mathcal{V}(\Omega;\mathbb{R}^{d_{v}}) is a function and K:𝒱(Ω;dv)𝒱(Ω;dv)K_{\ell}:\mathcal{V}(\Omega;\mathbb{R}^{d_{v}})\to\mathcal{V}(\Omega;\mathbb{R}^{d_{v}}) is a non-local operator defined in terms of Fourier multipliers,

Kv(x)=1(T^k[v]k)(x),Kv(x)=\mathcal{F}^{-1}(\widehat{T}_{k}[\mathcal{F}v]_{k})(x),

where [v]k[\mathcal{F}v]_{k} denotes the kk-th Fourier coefficient of vv for kdk\in\mathbb{Z}^{d}, T^kdv×dv\widehat{T}_{k}\in\mathbb{C}^{d_{v}\times d_{v}} is a Fourier multiplier matrix indexed by kk, and 1\mathcal{F}^{-1} denotes the inverse Fourier transform. In practice, a Fourier cut-off kmaxk_{\mathrm{max}}\in\mathbb{Z} is introduced, and only a finite number of Fourier modes333Throughout this paper \|\cdot\|_{\ell^{\infty}} denotes the maximum norm on finite dimensional Euclidean space. kkmax\|k\|_{\ell^{\infty}}\leq k_{\mathrm{max}} is retained. In particular, the number of non-zero components of T^={T^k}kd\widehat{T}=\{\widehat{T}_{k}\}_{k\in\mathbb{Z}^{d}} is bounded by T^0dv2(2kmax+1)d\|\widehat{T}\|_{0}\leq d_{v}^{2}(2k_{\mathrm{max}}+1)^{d}. In the following, we will also assume that the bias functions bb_{\ell} are determined by their Fourier components [b^]kdv[\widehat{b}_{\ell}]_{k}\in\mathbb{C}^{d_{v}}, kkmax\|k\|_{\ell^{\infty}}\leq k_{\mathrm{max}}.

Finally, the output layer

𝒬:𝒱(Ω;dv)𝒴(Ω;p),vq(x,v(x)),\mathcal{Q}:\mathcal{V}(\Omega;\mathbb{R}^{d_{v}})\to\mathcal{Y}(\Omega;\mathbb{R}^{p}),\quad v\mapsto q(x,v(x)),

is defined in terms of a neural network q:Ω×dvpq:\Omega\times\mathbb{R}^{d_{v}}\to\mathbb{R}^{p}, a joint function of the evaluation point xΩx\in\Omega and the components of the output vv of the previous layer evaluated at xx, namely v(x)dvv(x)\in\mathbb{R}^{d_{v}}.

To define the size of an FNO, we note that its tunable parameters are given by: (i) the weights and biases of the neural network χ\chi defining the lifting layer \mathcal{R}, (ii) the components of the matrices Wdv×dvW_{\ell}\in\mathbb{R}^{d_{v}\times d_{v}}, (iii) the components of the Fourier multipliers T^kdv×dv\widehat{T}_{k}\in\mathbb{C}^{d_{v}\times d_{v}} for kkmax\|k\|_{\ell^{\infty}}\leq k_{\mathrm{max}}, (iv) the Fourier coefficients [b^]kdv[\widehat{b}_{\ell}]_{k}\in\mathbb{C}^{d_{v}}, for kkmax\|k\|_{\ell^{\infty}}\leq k_{\mathrm{max}}, and (v) the number of weights and biases of the neural network qq defining the output layer 𝒬\mathcal{Q}. Given an FNO 𝒮:𝒳(Ω;k)𝒴(Ω;p)\mathcal{S}:\mathcal{X}(\Omega;\mathbb{R}^{k})\to\mathcal{Y}(\Omega;\mathbb{R}^{p}), we define size(𝒮)\mathrm{size}(\mathcal{S}) of an FNO as the total number of non-zero parameters in this construction. We also follow the convention that for a matrix (or vector) AA with complex entries, the number of parameters is defined as A0=Re(A)0+Im(A)0\|A\|_{0}=\|\mathrm{Re}(A)\|_{0}+\|\mathrm{Im}(A)\|_{0}.

Remark 2.25 (FNO approximation of functionals).

If 𝒮:𝒳(Ω;k)\mathcal{S}^{\dagger}:\mathcal{X}(\Omega;\mathbb{R}^{k})\to\mathbb{R} is a (scalar-valued) functional, then we will again identify the output-space 𝒴(Ω;p)\mathcal{Y}(\Omega;\mathbb{R}^{p}) with a space of constant functions. In this case, it is natural to add an averaging operation after the last output layer 𝒬:𝒱(Ω;dv)𝒴(Ω;)\mathcal{Q}:\mathcal{V}(\Omega;\mathbb{R}^{d_{v}})\to\mathcal{Y}(\Omega;\mathbb{R}), i.e. we replace 𝒬\mathcal{Q} by 𝒬~:𝒱(Ω;dv)\widetilde{\mathcal{Q}}:\mathcal{V}(\Omega;\mathbb{R}^{d_{v}})\to\mathbb{R}, given by

(2.19) 𝒬~(v):=1|Ω|Ωq(x,v(x))𝑑x.\displaystyle\widetilde{\mathcal{Q}}(v):=\frac{1}{|\Omega|}\int_{\Omega}q(x,v(x))\,dx.

This does not introduce any additional degrees of freedom, and ensures that the output is constant. We also note that (2.19) is a special case of a Fourier multiplier KK, involving only the k=0k=0 Fourier mode.

In the following, we will restrict attention to the approximation of functionals, taking into account Remark 2.25. We first mention the following result, proved in Appendix A.6, which shows that FNOs are not of neural network-type, in general:

Lemma 2.26.

Let σ\sigma be the ReLU activation function. Let 𝕋[0,2π]\mathbb{T}\simeq[0,2\pi] denote the 2π2\pi-periodic torus. The FNO,

𝒮:L2(𝕋;),𝒮(u):=Ωσ(u(x))𝑑x,\mathcal{S}:L^{2}(\mathbb{T};\mathbb{R})\to\mathbb{R},\quad\mathcal{S}(u):=\int_{\Omega}\sigma(u(x))\,dx,

is not of neural network-type.

The fact that FNO is not of neural network-type is closely related to the fact that the Fourier transforms at the core of the FNO mapping 𝒮:𝒳(Ω;k)\mathcal{S}:\mathcal{X}(\Omega;\mathbb{R}^{k})\to\mathbb{R} cannot be computed exactly. In practice, the FNO therefore needs to be discretized.

A simple discretization 𝒮N\mathcal{S}^{N} of 𝒮\mathcal{S} is readily obtained and commonly used in applications of FNOs. To this end, fix NN\in\mathbb{N}, and let xj1,,jdΩx_{j_{1},\dots,j_{d}}\in\Omega, j1,,jd{1,,N}j_{1},\dots,j_{d}\in\{1,\dots,N\} be a grid consisting of NN equidistant points in each coordinate direction. A numerical approximation 𝒮N:𝒳(Ω;k)\mathcal{S}^{N}:\mathcal{X}(\Omega;\mathbb{R}^{k})\to\mathbb{R} of 𝒮\mathcal{S} is obtained by replacing the Fourier transform \mathcal{F} and its inverse 1\mathcal{F}^{-1} in each hidden layer by the discrete Fourier transform N\mathcal{F}_{N} and its inverse N1\mathcal{F}_{N}^{-1}, computed on the equidistant grid. Similarly, the exact average (2.19) is replaced by an average over the grid values. This “discretized” FNO 𝒮N\mathcal{S}^{N} thus defines a mapping

𝒮N:𝒳(Ω;k),u𝒮N(u),\mathcal{S}^{N}:\mathcal{X}(\Omega;\mathbb{R}^{k})\to\mathbb{R},\quad u\mapsto\mathcal{S}^{N}(u),

which depends only on the grid values u(xj1,,jd)ku(x_{j_{1},\dots,j_{d}})\in\mathbb{R}^{k} of the input function uu, for multi-indices (j1,,jd){1,,N}d(j_{1},\dots,j_{d})\in\{1,\dots,N\}^{d}. In contrast to 𝒮\mathcal{S}, the discretization 𝒮N\mathcal{S}^{N} is readily implemented in practice. We expect that 𝒮N(u)𝒮(u)\mathcal{S}^{N}({u})\approx\mathcal{S}(u) for sufficiently large NN. Note also that size(𝒮N)=size(𝒮)\mathrm{size}(\mathcal{S}^{N})=\mathrm{size}(\mathcal{S}) by construction.

Given these preparatory remarks, we can now state our result on the curse of parametric complexity for FNOs, with proof in Appendix A.7.

Theorem 2.27.

Let Ωd\Omega\subset\mathbb{R}^{d} be a cube. Let K𝒳K\subset\mathcal{X} be a compact subset of a Banach space 𝒳=𝒳(Ω;k)\mathcal{X}=\mathcal{X}(\Omega;\mathbb{R}^{k}). Assume that KK contains an infinite-dimensional hypercube QαQ_{\alpha} for some α>1\alpha>1. Then for any rr\in\mathbb{N} and δ>0\delta>0, there exists ϵ¯>0\overline{\epsilon}>0 and an rr-times Fréchet differentiable functional 𝒮:K𝒳\mathcal{S}^{\dagger}:K\subset\mathcal{X}\to\mathbb{R}, such that approximation to accuracy ϵϵ¯\epsilon\leq\overline{\epsilon} by a discretized FNO 𝒮ϵNϵ\mathcal{S}_{\epsilon}^{N_{\epsilon}},

supuK|𝒮(u)𝒮ϵNϵ(u)|ϵ,\sup_{u\in K}|\mathcal{S}^{\dagger}(u)-\mathcal{S}_{\epsilon}^{N_{\epsilon}}(u)|\leq\epsilon,

requires either (i) complexity bound size(𝒮ϵNϵ)exp(cϵ1/(α+1+δ)r)\mathrm{size}(\mathcal{S}_{\epsilon}^{N_{\epsilon}})\geq\exp(c\epsilon^{-1/(\alpha+1+\delta)r}), or (ii) discretization parameter Nϵexp(cϵ1/(α+1+δ)r)N_{\epsilon}\geq\exp(c\epsilon^{-1/(\alpha+1+\delta)r}); here cc, ϵ¯>0\overline{\epsilon}>0 are constants depending only on α\alpha, δ\delta and rr.

Proof.

(Sketch) The proof of Theorem 2.27 relies on the curse of parametric complexity for operators of neural network-type in Theorem 2.11. The first step is to show that discrete FNOs are of neural network-type. As a consequence, Theorem 2.11 implies a lower bound of the form cmplx(𝒮ϵNϵ)exp(cϵ1/(α+1+δ)r)\mathrm{cmplx}(\mathcal{S}_{\epsilon}^{N_{\epsilon}})\geq\exp(c\epsilon^{-1/(\alpha+1+\delta)r}). The main additional difficulty is that the discrete FNO is not a standard ReLU neural network according to our definition, since it employs a very non-standard architecture involving convolution and Fourier transforms. Hence more work is needed, in order to relate cmplx(𝒮ϵNϵ)\mathrm{cmplx}(\mathcal{S}_{\epsilon}^{N_{\epsilon}}) to size(𝒮ϵNϵ)\mathrm{size}(\mathcal{S}_{\epsilon}^{N_{\epsilon}}); while the complexity is the minimal number of parameters required to represent 𝒮ϵNϵ\mathcal{S}_{\epsilon}^{N_{\epsilon}} by an ordinary ReLU network architecture, we recall that the size(𝒮ϵNϵ)\mathrm{size}(\mathcal{S}_{\epsilon}^{N_{\epsilon}}) is defined as the number of parameters defining 𝒮ϵNϵ\mathcal{S}_{\epsilon}^{N_{\epsilon}} via the non-standard FNO architecture. Our analysis leads to an upper bound of the form

(2.20) cmplx(𝒮ϵNϵ)N2dsize(𝒮ϵNϵ).\displaystyle\mathrm{cmplx}(\mathcal{S}_{\epsilon}^{N_{\epsilon}})\lesssim N^{2d}\mathrm{size}(\mathcal{S}_{\epsilon}^{N_{\epsilon}}).

As a consequence of the exponential lower bound, cmplx(𝒮ϵNϵ)exp(cϵ1/(α+1+δ)r)\mathrm{cmplx}(\mathcal{S}_{\epsilon}^{N_{\epsilon}})\geq\exp(c\epsilon^{-1/(\alpha+1+\delta)r}), inequality (2.20) implies that either NϵN_{\epsilon} or size(𝒮ϵNϵ)\mathrm{size}(\mathcal{S}_{\epsilon}^{N_{\epsilon}}) have to be exponentially large in ϵ1\epsilon^{-1}, proving the claim.

We would like to point out that the additional factor depending on NN is natural in view of the fact that even a linear discretized FNO layer vWvv\mapsto Wv, with matrix Wdv×dvW\in\mathbb{R}^{d_{v}\times d_{v}}, actually corresponds to a mapping Nd×dvNd×dv\mathbb{R}^{N^{d}\times d_{v}}\to\mathbb{R}^{N^{d}\times d_{v}}, (v(xj1,,jd))(Wv(x)j1,,jd)(v(x_{j_{1},\dots,j_{d}}))\mapsto(Wv(x)_{j_{1},\dots,j_{d}}); i.e. the matrix multiplication should be viewed as being carried out in parallel at all NdN^{d} grid points. Representing this mapping by an ordinary matrix multiplication Nd×dvNd×dv\mathbb{R}^{N^{d}\times d_{v}}\to\mathbb{R}^{N^{d}\times d_{v}} requires NdW0N^{d}\|W\|_{0} degrees of freedom, and thus depends on NN in addition to the number of FNO parameters W0\|W\|_{0}.

For the detailed proof, we refer to Appendix A.7. ∎

Remark 2.28.

Theorem 2.27 shows that FNO suffers from a similar curse of complexity as do the variants on DeepONet and PCA-Net covered by Theorem 2.11: approximation to accuracy ϵ\epsilon of general (CrC^{r}- or Lipschitz-) operators requires either an exponential number of non-zero degree of freedom, or requires exponential computational resources to evaluate even a single forward pass. We note the difference in how the curse is expressed in Theorem 2.27 compared to Theorem 2.11; this is due to the fact that FNO is not of neural network-type (see Lemma 2.26). Instead, as outlined in the proof sketch above, only upon discretization does the FNO define an operator/functional of neural network type. The computational complexity of this discretized FNO is determined by both the FNO coefficients and the discretization parameter NN.

2.5.1. Discussion

To overcome the general curse of parametric complexity implied by Theorem 2.11 (and Theorem 2.27), efficient operator learning frameworks therefore have to leverage additional structure present in the operators of interest, going beyond CrC^{r}- or Lipschitz-regularity. Previous work on overcoming the curse of parametric complexity for operator learning has mostly focused on operator holomorphy [23, 50, 34] and the emulation of numerical methods [15, 30, 34] as two basic mechanisms for overcoming the curse of parametric complexity for specific operators of interest. An abstract characterization of the entire class of operators that allow for efficient approximation by neural operators would be very desirable. Unfortunately, this appears to be out of reach, at the current state of analysis. Indeed, as far as the authors are aware, there does not even exist such a characterization for any class of standard numerical methods, such as finite difference, finite element or spectral, viewed as operator approximators.

The second contribution of the present work is to expose additional structure, different from holomorphy and emulation, that can be leveraged by neural operators. In particular, in the remainder of this paper we identify such structure in the Hamilton-Jacobi equations, and propose a neural operator framework which can build on this structure to provably overcome the curse of parametric complexity in learning the associated solution operator.

3. The Hamilton-Jacobi Equation

In the previous section we demonstrate that, generically, a scaling-limit of the curse of dimensionality is to be expected in operator approximation, if only CrC^{r}- regularity of the map is assumed. However we also outlined in the introduction the many cases where specific structure can be exploited to overcome this curse. In the remainder of the paper we show how the curse can be removed for Hamilton-Jacobi equations. To this end we start, in this section, by setting up the theoretical framework for operator learning in this context.

We are interested in deriving error and complexity estimates for the approximation of the solution operator 𝒮t:Cperr(Ω)Cperr(Ω)\mathcal{S}_{t}^{\dagger}:C^{r}_{\mathrm{per}}(\Omega)\to C^{r}_{\mathrm{per}}(\Omega) associated with the following Hamilton-Jacobi PDE:

(HJ) {tu+H(q,qu)=0,(x,t)Ω×(0,T],u(t=0)=u0,(x,t)Ω×{0}\displaystyle\left\{\begin{aligned} \partial_{t}u+H(q,\nabla_{q}u)&=0,(x,t)\in\Omega\times(0,T],\\ u(t=0)&=u_{0},(x,t)\in\Omega\times\{0\}\end{aligned}\right.

Here Ωd\Omega\subset\mathbb{R}^{d} is a dd-dimensional domain. To simplify our analysis we consider only the case of a domain Ω=[0,2π]d\Omega=[0,2\pi]^{d} with periodic boundary conditions (so that we may identify Ω\Omega with 𝕋d\mathbb{T}^{d}.) We denote by Cperr(Ω)C^{r}_{\mathrm{per}}(\Omega) the space of rr-times continuously differentiable real-valued functions with 2π2\pi-periodic derivatives, with norm

uCr(Ω):=sup|α|rsupxΩ|Dαu(x)|.\|u\|_{C^{r}(\Omega)}:=\sup_{|\alpha|\leq r}\sup_{x\in\Omega}|D^{\alpha}u(x)|.

By slight abuse of notation, we will similarly denote by u(q,t)Cperr(Ω×[0,T])u(q,t)\in C^{r}_{\mathrm{per}}(\Omega\times[0,T]) and H(q,p)Cperr(Ω×d)H(q,p)\in C^{r}_{\mathrm{per}}(\Omega\times\mathbb{R}^{d}) functions that have CrC^{r} regularity in all variables, and that are periodic in the first variable qΩq\in\Omega, so that in particular,

qu(q,t),andqH(q,p),q\mapsto u(q,t),\quad\text{and}\quad q\mapsto H(q,p),

belong to Cperr(Ω)C^{r}_{\mathrm{per}}(\Omega), for fixed pdp\in\mathbb{R}^{d} or t[0,T].t\in[0,T].

In equation (HJ), u:Ω×[0,T]u:\Omega\times[0,T]\to\mathbb{R}, (q,t)u(q,t)(q,t)\mapsto u(q,t) is a function, depending on the spatial variable qΩq\in\Omega and on time t0t\geq 0. We will restrict attention to problems for which a classical solution uCperr(Ω×[0,T])u\in C^{r}_{\mathrm{per}}(\Omega\times[0,T]), r2r\geq 2, exists. In this setting the initial value problem (HJ) can be solved by the method of characteristics and a unique solution may be proved to exist for some T=T(u0)>0.T=T(u_{0})>0. We may then define solution operator 𝒮t\mathcal{S}_{t}^{\dagger} with property u(,t)=𝒮t(u0);u(\cdot,t)=\mathcal{S}_{t}^{\dagger}(u_{0}); the local time of existence in time will, in general, depend on the input u0u_{0} to 𝒮t.\mathcal{S}_{t}. The next two subsections describe, respectively, the method of characteristics and the existence of the solution operator on a time-interval t[0,T]t\in[0,T], for all initial data from compact \mathcal{F} in Cperr(Ω),C^{r}_{\mathrm{per}}(\Omega), for r2.r\geq 2. Thus we define {𝒮t:Cperr(Ω)Cperr(Ω)}\{\mathcal{S}_{t}^{\dagger}:\mathcal{F}\subset C^{r}_{\mathrm{per}}(\Omega)\to C^{r}_{\mathrm{per}}(\Omega)\} for all t[0,T]t\in[0,T], T=T()T=T(\mathcal{F}) sufficiently small.

We recall that throughout this paper, use of superscript \dagger denotes an object defined through construction of an exact solution of (HJ), or objects used in the construction of the solution; identical notation without a superscript \dagger denotes an approximation of that object.

3.1. Method of Characteristics for (HJ)

We briefly summarize this methodology; for more details see [19, Section 3.3]. Consider the following Hamiltonian system for t(q(t),p(t))Ω×dt\mapsto(q(t),p(t))\in\Omega\times\mathbb{R}^{d} defined by

(3.1a) {q˙=pH(q,p),q(0)=q0,p˙=qH(q,p),p(0)=p0,\displaystyle\left\{\begin{aligned} \dot{q}&=\nabla_{p}H(q,p),\quad q(0)=q_{0},\\ \dot{p}&=-\nabla_{q}H(q,p),\quad p(0)=p_{0},\end{aligned}\right.
(3.1b) p0=qu0(q0).\displaystyle\quad p_{0}=\nabla_{q}u_{0}(q_{0}).

If the solution u(q,t)u(q,t) of (HJ) is twice continuously differentiable, then uu can be evaluated by solving the ODE (3.1a) with the specific parameterized initial conditions (3.1b). Given these trajectories and t0t\geq 0, the values u(q(t),t)u(q(t),t) can be computed in terms of the “action” along this trajectory:

(3.2) u(q(t),t)=u0(q0)+0t(q(t),p(t))𝑑τ,\displaystyle u(q(t),t)=u_{0}(q_{0})+\int_{0}^{t}\mathcal{L}(q(t),p(t))\,d\tau,

where :Ω×d\mathcal{L}:\Omega\times\mathbb{R}^{d}\to\mathbb{R} is the Lagrangian associated with HH, i.e.

(3.3) (q,p):=ppH(q,p)H(q,p).\displaystyle\mathcal{L}(q,p):=p\cdot\nabla_{p}H(q,p)-H(q,p).

Equivalently, the solution z(t)=u(q(t),t)z(t)=u(q(t),t) can be expressed in terms of the solution of the following system of ODEs, t(q(t),p(t),z(t))t\mapsto(q(t),p(t),z(t)):

(3.4a) {q˙=pH(q,p),q(0)=q0,p˙=qH(q,p),p(0)=p0,z˙=(q,p),z(0)=z0.\displaystyle\left\{\begin{aligned} \dot{q}&=\nabla_{p}H(q,p),\quad q(0)=q_{0},\\ \dot{p}&=-\nabla_{q}H(q,p),\quad p(0)=p_{0},\\ \dot{z}&=\mathcal{L}(q,p),\quad z(0)=z_{0}.\end{aligned}\right.
(3.4b) p0=qu0(q0),z0=u0(q0).\displaystyle\quad p_{0}=\nabla_{q}u_{0}(q_{0}),\quad z_{0}=u_{0}(q_{0}).

The system of ODEs (3.4) is defined on Ω×d×\Omega\times\mathbb{R}^{d}\times\mathbb{R}, with a 2π2\pi-periodic spatial domain Ω=[0,2π]d\Omega=[0,2\pi]^{d}. It can be shown that

(3.5) p(t)qu(q(t),t),for t0,\displaystyle p(t)\equiv\nabla_{q}u(q(t),t),\quad\text{for $t\geq 0$,}

tracks the evolution of the gradient of uu along this trajectory. To ensure the existence of solutions to (3.1), i.e. to avoid trajectories escaping to infinity, we make the following assumption on HH, in which |||{\,\cdot\,}| denotes the Euclidean distance on d\mathbb{R}^{d}:

Assumption 3.1 (Growth at Infinity).

There exists LH>0L_{H}>0, such that

(3.6) supqΩ{pqH(q,p)}LH(1+|p|2),\displaystyle\sup_{q\in\Omega}\left\{-p\cdot\nabla_{q}H(q,p)\right\}\leq L_{H}(1+|p|^{2}),

for all pdp\in\mathbb{R}^{d}.

In the following, we will denote by

(3.7) {Ψt:Ω×d×Ω×d×,(q0,p0,z0)(q(t),p(t),z(t)),\displaystyle\left\{\begin{aligned} \Psi_{t}^{\dagger}:\Omega\times\mathbb{R}^{d}\times\mathbb{R}&\to\Omega\times\mathbb{R}^{d}\times\mathbb{R},\\ (q_{0},p_{0},z_{0})&\mapsto(q(t),p(t),z(t)),\end{aligned}\right.

the semigroup (flow map) generated by the system of ODEs (3.4a); existence of this semigroup, and hence of the solution operator 𝒮t\mathcal{S}_{t}^{\dagger}, is the topic of the next subsection.

3.2. Short-time Existence of CrC^{r}-solutions

The goal of the present section is to show that for any r2r\geq 2, and for any compact subset Cperr(Ω)\mathcal{F}\subset C^{r}_{\mathrm{per}}(\Omega), there exists a maximal T=T()>0T^{\ast}=T^{\ast}(\mathcal{F})>0, such that for any u0u_{0}\in\mathcal{F}, there exists a solution u:Ω×[0,T)u:\Omega\times[0,T^{\ast})\to\mathbb{R} of (HJ), with property qu(q,t)q\mapsto u(q,t) belongs to Cperr(Ω)C^{r}_{\mathrm{per}}(\Omega) for all t[0,T)t\in[0,T^{\ast}). Thus we may take any T<TT<T^{\ast} in (HJ), for data in \mathcal{F}. Our proof of this fact relies on the Banach space version of the implicit function theorem; see Appendix B, Theorem B.1, for a precise statement.

In the following, given t0t\geq 0 and given initial values (q0,p0)Ω×d(q_{0},p_{0})\in\Omega\times\mathbb{R}^{d}, we define qt(q0,p0)q_{t}(q_{0},p_{0}), pt(q0,p0)p_{t}(q_{0},p_{0}) as the spatial- and momenta-components of the solution of the Hamiltonian ODE (3.1a), respectively; i.e. the solution of (3.1a) is given by t(qt(q0,p0),pt(q0,p0))t\mapsto(q_{t}(q_{0},p_{0}),p_{t}(q_{0},p_{0})) for any initial data (q0,p0)Ω×d(q_{0},p_{0})\in\Omega\times\mathbb{R}^{d}.

Proposition 3.2 (Short-time Existence of Classical Solutions).

Let Assumption 3.1 hold, let r2r\geq 2 and assume that the Hamiltonian HCperr+1(Ω×d)H\in C^{r+1}_{\mathrm{per}}(\Omega\times\mathbb{R}^{d}). Then for any compact subset Cperr(Ω)\mathcal{F}\subset C^{r}_{\mathrm{per}}(\Omega), there exists T=T()>0T^{\ast}=T^{\ast}(\mathcal{F})>0 such that for any u0u_{0}\in\mathcal{F}, there exists a classical solution uCperr(Ω×[0,T))u\in C^{r}_{\mathrm{per}}(\Omega\times[0,T^{\ast})) of the Hamilton-Jacobi equation (HJ). Furthermore, for any T<TT<T^{\ast}, there exists a constant C=C(T,r,,H)>0C=C(T,r,\mathcal{F},H)>0 such that supt[0,T]u(,t)CrC\sup_{t\in[0,T]}\|u({\,\cdot\,},t)\|_{C^{r}}\leq C.

The proof, which may be found in Appendix B, uses the following two lemmas, also proved in Appendix B. The first result shows that, under Assumption 3.1, the semigroup Ψt\Psi_{t}^{\dagger} in (3.7) is well-defined, globally in time.

Lemma 3.3.

Let HCperr(Ω×d)H\in C_{\mathrm{per}}^{r}(\Omega\times\mathbb{R}^{d}) for r1r\geq 1. Let Assumption 3.1 hold. Then the mapping Ψt\Psi_{t}^{\dagger} given by (3.7) exists for any t0t\geq 0 and tΨtt\mapsto\Psi_{t}^{\dagger} defines a semigroup. In particular, for any (q0,p0)Ω×d(q_{0},p_{0})\in\Omega\times\mathbb{R}^{d}, there exists a solution t(q(t),p(t))t\mapsto(q(t),p(t)) of the ODE system (3.1a) with initial data q(0)=q0q(0)=q_{0}, p(0)=p0p(0)=p_{0}, for all t0t\geq 0.

The following result will be used to show that the method of characteristics can be used to construct solutions, at least over a sufficiently short time-interval.

Lemma 3.4.

Let Assumption 3.1 hold, let r2r\geq 2 and assume that the Hamiltonian HCperr+1(Ω×d)H\in C^{r+1}_{\mathrm{per}}(\Omega\times\mathbb{R}^{d}). Let Cperr(Ω)\mathcal{F}\subset C^{r}_{\mathrm{per}}(\Omega) be compact. Then there exists a (maximal) time T=T()>0T^{\ast}=T^{\ast}(\mathcal{F})>0, such that for all u0u_{0}\in\mathcal{F}, the spatial characteristic mapping

Φt(;u0):ΩΩ,q0qt(q0,qu0(q0)),\Phi_{t}^{\dagger}({\,\cdot\,};u_{0}):\Omega\to\Omega,\quad q_{0}\mapsto q_{t}\bigl{(}q_{0},\nabla_{q}u_{0}(q_{0})\bigr{)},

defined by (3.1) on the periodic domain Ω=[0,2π]d\Omega=[0,2\pi]^{d}, is a Cr1C^{r-1}-diffeomorphism for any t[0,T)t\in[0,T^{\ast}).

Note that the map Φt(;u0)\Phi_{t}^{\dagger}({\,\cdot\,};u_{0}) is defined by the semigroup Ψt\Psi_{t}^{\dagger} for (q,p,z)(q,p,z); however it only requires solution of the Hamiltonian system (3.1) for (q,p)(q,p). In contrast to the flowmap Ψt\Psi^{\dagger}_{t}, the spatial characteristic map Φt(;u0)\Phi_{t}^{\dagger}({\,\cdot\,};u_{0}) is in general only invertible over a sufficiently short time-interval, leading to a corresponding short-time existence result for the solution operator 𝒮t\mathcal{S}^{\dagger}_{t} associated with (HJ) in Proposition 3.2.

We close this section on the short-time existence with the following remark.

Remark 3.5.

Classical solutions of the Hamilton-Jacobi equations can develop singularities in finite time due to the potential crossing of the spatial characteristics q(t)q(t) emanating from different points; Indeed, if two spatial characteristics emanating from q0q_{0} and q0q_{0}^{\prime} cross in finite time, then (3.2) does not lead to a well-defined function u(q,t)u(q,t), and the method of characteristics breaks down. Thus, our existence theorem is generally restricted to a finite time interval [0,T][0,T^{\ast}]. In important special cases, the crossing of characteristics is ruled out, and classical solutions exist for all time, i.e. with T=T^{\ast}=\infty. One such example is the advection equation with velocity field v(q)v(q), and corresponding Hamiltonian H(q,qu)=v(q)quH(q,\nabla_{q}u)=v(q)\cdot\nabla_{q}u, for which the complexity of operator approximation is studied computationally in [13].

4. Hamilton-Jacobi Neural Operator: HJ-Net

In this section, we will describe an operator learning framework to approximate the solution operator 𝒮t\mathcal{S}^{\dagger}_{t} defined by (HJ) for initial data u0Cperr(Ω)u_{0}\in C^{r}_{\mathrm{per}}(\Omega) with r2r\geq 2. The main motivation for our choice of framework is the observation that the flow map Ψt\Psi_{t}^{\dagger} associated with the system of ODEs (3.4a) can be computed independently of the underlying solution uu. Hence, an operator learning framework for (HJ) can be constructed based on a suitable approximation ΨtΨt\Psi_{t}\approx\Psi_{t}^{\dagger}, where Ψt:Ω×d×Ω×d×\Psi_{t}:\Omega\times\mathbb{R}^{d}\times\mathbb{R}\to\Omega\times\mathbb{R}^{d}\times\mathbb{R} is a neural network approximation of the flow Ψt\Psi_{t}^{\dagger}. Given such Ψt\Psi_{t}, and upon fixing evaluation points {q0j}j=1NΩ\{q_{0}^{j}\}_{j=1}^{N}\subset\Omega, we propose to approximate the forward operator 𝒮t\mathcal{S}_{t}^{\dagger} of the Hamilton-Jacobi equation (HJ) by 𝒮t\mathcal{S}_{t}, using the following three steps:

  1.   Step a)

    encode the initial data u0Cperr(Ω)u_{0}\in C^{r}_{\mathrm{per}}(\Omega) by evaluating it at the points q0jq_{0}^{{j}}:

    :Cperr(Ω)\displaystyle\mathcal{E}:C^{r}_{\mathrm{per}}(\Omega) [Ω×d×]N,\displaystyle\to[\Omega\times\mathbb{R}^{d}\times\mathbb{R}]^{N},
    u0\displaystyle u_{0} {(q0j,p0j,z0j)}j=1N,\displaystyle\mapsto\{(q_{0}^{{j}},p_{0}^{{j}},z_{0}^{{j}})\}_{j=1}^{N},

    with (q0j,p0j,z0j):=(q0j,qu0(q0j),u0(q0j))(q_{0}^{{j}},p_{0}^{{j}},z_{0}^{{j}}):=(q_{0}^{{j}},\nabla_{q}u_{0}(q_{0}^{{j}}),u_{0}(q_{0}^{{j}}));

  2.   Step b)

    for each j=1,,Nj=1,\dots,N, apply the approximate flow Ψt:Ω×d×\Psi_{t}:\Omega\times\mathbb{R}^{d}\times\mathbb{R} to the encoded data, resulting in a map

    ΨtN:[Ω×d×]N\displaystyle\Psi^{N}_{t}:[\Omega\times\mathbb{R}^{d}\times\mathbb{R}]^{N} [Ω×d×]N,\displaystyle\to[\Omega\times\mathbb{R}^{d}\times\mathbb{R}]^{N},
    {(q0j,p0j,z0j)}j=1N\displaystyle\{(q_{0}^{{j}},p_{0}^{{j}},z_{0}^{{j}})\}_{j=1}^{N} {(qtj,ptj,ztj)}j=1N,\displaystyle\mapsto\{(q_{t}^{{j}},p_{t}^{{j}},z_{t}^{{j}})\}_{j=1}^{N},

    where (qtj,ptj,ztj):=Ψt(q0j,p0j,z0j)(q_{t}^{{j}},p_{t}^{{j}},z_{t}^{{j}}):=\Psi_{t}(q_{0}^{{j}},p_{0}^{{j}},z_{0}^{{j}}), for j=1,,Nj=1,\dots,N;

  3.   Step c)

    approximate the underlying solution at a fixed time t[0,T]t\in[0,T], for TT sufficiently small, by interpolating the data (input/output pairs) {(qtj,ztj)}j=1N\{(q_{t}^{{j}},z_{t}^{{j}})\}_{j=1}^{N}, leading to a reconstruction map:

    :[Ω×]N\displaystyle\mathcal{R}:[\Omega\times\mathbb{R}]^{N} Cr(Ω),\displaystyle\to C^{r}(\Omega),
    {(qtj,ztj)}\displaystyle\{(q_{t}^{{j}},z_{t}^{{j}})\} sz,Q.\displaystyle\mapsto s_{z,Q}.

If we let 𝒫\mathcal{P} denote the projection map which takes [Ω×d×]N[\Omega\times\mathbb{R}^{d}\times\mathbb{R}]^{N} to [Ω×]N[\Omega\times\mathbb{R}]^{N} then, for fixed TT sufficiently small, we have defined an approximation of 𝒮t:Cperr(Ω)Cperr(Ω)\mathcal{S}_{t}^{\dagger}:C^{r}_{\mathrm{per}}(\Omega)\to C^{r}_{\mathrm{per}}(\Omega), denoted 𝒮t:Cperr(Ω)Cr(Ω)\mathcal{S}_{t}:C^{r}_{\mathrm{per}}(\Omega)\to C^{r}(\Omega), and defined by

(4.1) 𝒮t=𝒫ΨtN.\mathcal{S}_{t}=\mathcal{R}\circ\mathcal{P}\circ\Psi^{N}_{t}\circ\mathcal{E}.

It is a consequence of Proposition 3.2 that our approximation 𝒮t\mathcal{S}_{t} is well-defined for all inputs u0u_{0} from compact subset Cperr(Ω)\mathcal{F}\subset C^{r}_{\mathrm{per}}(\Omega), r2r\geq 2, in some interval t[0,T]t\in[0,T], for TT sufficiently small. However the resulting approximation does not obey the semigroup property with respect to tt and should be interpreted as holding for a fixed t[0,T]t\in[0,T], TT sufficiently small. Furthermore, iterating the map obtained for any such fixed tt is not in general possible unless 𝒮t\mathcal{S}_{t} maps \mathcal{F} into itself. The requirement that 𝒮t\mathcal{S}_{t}^{\dagger} maps \mathcal{F} into itself would also be required to prove the existence of a semigroup for (HJ); for our operator approximator 𝒮t\mathcal{S}_{t} we would additionally need to ensure periodicity of the reconstruction step, something we do not address in this paper.

If the underlying solution u(q,t)u(q,t) of (HJ) exists up to time tt and if it is C2C^{2}, then the method of characteristics can be applied, and the above procedure would reproduce the underlying solution, in the absence of approximation errors in steps b) and c); i.e. in the absence of approximation errors of the Hamiltonian flow ΨtΨt\Psi_{t}\approx\Psi_{t}^{\dagger}, and in the absence of reconstruction errors. We will study the effect of approximating step b) by use of a ReLU neural network approximation of the flow Ψt\Psi_{t}^{\dagger} and by use of a moving least squares interpolation for step c). In the following two subsections we define these two approximation steps, noting that doing so leads to a complete specification of 𝒮t.\mathcal{S}_{t}. This complete specification is summarized in the final Subsection 4.3.

4.1. Step b) ReLU Network

We seek an approximation Ψt\Psi_{t} to Ψt\Psi_{t}^{\dagger} in the form of a ReLU neural network (2.1), as summarized in section 2.1.1, with input and output dimensions D𝒳=D𝒴=2d+1{D_{\mathcal{X}}}={D_{\mathcal{Y}}}=2d+1, and taking the concatenated input x:=(q0,p0,z0)Ω×d×x:=(q_{0},p_{0},z_{0})\in\Omega\times\mathbb{R}^{d}\times\mathbb{R} to its image in d×d×\mathbb{R}^{d}\times\mathbb{R}^{d}\times\mathbb{R}. We use 2π2\pi-periodicity to identify the output in the first dd components (the spatial variable qq) with a unique element in Ω=[0,2π]d\Omega=[0,2\pi]^{d}. With slight abuse of notation, this results in a well-defined mapping

Ψt:Ω×d×Ω×d×,(q0,p0,z0)Ψt(q0,p0,z0).\Psi_{t}:\Omega\times\mathbb{R}^{d}\times\mathbb{R}\to\Omega\times\mathbb{R}^{d}\times\mathbb{R},\quad(q_{0},p_{0},z_{0})\to\Psi_{t}(q_{0},p_{0},z_{0}).

Let μ\mu denote a probability measure on Ω×d×\Omega\times\mathbb{R}^{d}\times\mathbb{R}. We would like to choose θ\theta to minimize the loss function

L(θ)=𝔼(q,p,z)μΨt(q,p,z;θ)Ψt(q,p,z).L(\theta)=\mathbb{E}^{(q,p,z)\sim\mu}\|\Psi_{t}(q,p,z;\theta)-\Psi_{t}^{\dagger}(q,p,z)\|.

In practice we achieve this by an empirical approximation of μ\mu, based on NN i.i.d. samples, and numerical simulation to approximate the evaluation of Ψt()\Psi_{t}^{\dagger}({\,\cdot\,}) on these samples. The resulting approximate empirical loss can be approximately minimized by stochastic gradient descent [20, 47], for example.

4.2. Step c) Moving Least Squares

In this section, we define the interpolation mapping :{qtj,ztj}j=1Nu(,t)\mathcal{R}:\{q^{j}_{t},z^{j}_{t}\}_{j=1}^{N}\mapsto u({\,\cdot\,},t), employing reconstruction by moving least squares [53]. In general, given a function f:Ωf^{\dagger}:\Omega\to\mathbb{R}, here assumed to be defined on the domain Ω=[0,2π]d\Omega=[0,2\pi]^{d}, and given a set of (scattered) data {𝔮j,zj}j=1N\{\mathfrak{q}^{j},z^{j}\}_{j=1}^{N} where zj=f(𝔮j)z^{j}=f^{\dagger}(\mathfrak{q}^{j}) for 𝔔={𝔮1,,𝔮N}Ω\mathfrak{Q}=\{\mathfrak{q}^{1},\dots,\mathfrak{q}^{N}\}\subset\Omega, the method of moving least squares produces an approximation fz,𝔔ff_{z,\mathfrak{Q}}\approx f^{\dagger}, which is given by the following minimization [53, Def. 4.1]:

(4.2) fz,𝔔(q)=min{j=1N[zjP(𝔮j)]2ϕδ(q𝔮j)|Pπm}.\displaystyle f_{z,\mathfrak{Q}}(q)=\mathrm{min}{\left\{\sum_{j=1}^{N}\left[z^{j}-P(\mathfrak{q}^{j})\right]^{2}\phi_{\delta}(q-\mathfrak{q}^{j})\,\middle|\,P\in\pi_{m}\right\}}.

Here, πm\pi_{m} denotes the space of polynomials of degree mm, and ϕδ(q)=ϕ(q/δ)\phi_{\delta}(q)=\phi(q/\delta) is a compactly supported, non-negative weight function. We will assume ϕ()\phi({\,\cdot\,}) to be smooth, supported in the unit ball B1(0)B_{1}(0), and positive on the ball B1/2(0)B_{1/2}(0) with radius 1/21/2.

The approximation error incurred by moving least squares can be estimated in the LL^{\infty}-norm, in terms of suitable measures of the “density” of the scattered data points 𝔮j\mathfrak{q}^{j}, and the smoothness of the underlying function ff^{\dagger} [53]. The relevant notions are defined next, in which |||{\,\cdot\,}| denotes the Euclidean distance on Ωd\Omega\subset\mathbb{R}^{d}.

Definition 4.1.

The fill distance of a set of points 𝔔={𝔮1,,𝔮N}Ω\mathfrak{Q}=\{\mathfrak{q}^{1},\dots,\mathfrak{q}^{N}\}\subset\Omega for a bounded domain Ωd\Omega\subset\mathbb{R}^{d} is defined to be

h𝔔,Ω:=supqΩminj=1,,N|q𝔮j|.h_{\mathfrak{Q},\Omega}:=\sup_{q\in\Omega}\min_{j=1,\dots,N}|q-\mathfrak{q}^{j}|.

The separation distance of 𝔔\mathfrak{Q} is defined by

ρ𝔔:=12minkj|𝔮k𝔮j|.\rho_{\mathfrak{Q}}:=\frac{1}{2}\min_{k\neq j}|\mathfrak{q}^{k}-\mathfrak{q}^{j}|.

A set 𝔔\mathfrak{Q} of points in Ω\Omega is said to be quasi-uniform with respect to κ1\kappa\geq 1, if

ρ𝔔h𝔔,Ωκρ𝔔.\rho_{\mathfrak{Q}}\leq h_{\mathfrak{Q},\Omega}\leq\kappa\rho_{\mathfrak{Q}}.

Combining the approximation error of moving least squares with a stability result for moving least squares, with respect to the input data, leads to the error estimate that we require to estimate the error in our proposed HJ-Net. Proof of the following may be found in Appendix C. The statement involves both the input data set QQ and a set QQ^{\dagger} which QQ is supposed to approximate, together with their respective fill-distances and separation distances.

Proposition 4.2 (Error Stability).

Let Ω=[0,2π]dd\Omega=[0,2\pi]^{d}\subset\mathbb{R}^{d} and consider function fCr(Ω)f^{\dagger}\in C^{r}(\Omega), for some fixed regularity parameter r2r\geq 2. Assume that Q={qj,}j=1NQ^{\dagger}=\{q^{j,\dagger}\}_{j=1}^{N} is quasi-uniform with respect to κ1\kappa\geq 1. Let {qj,zj}j=1N\{q^{j},z^{j}\}_{j=1}^{N} be approximate interpolation data, and define Q:={qj}j=1NQ:=\{q^{j}\}_{j=1}^{N} where, for some ρ(0,12ρQ)\rho\in(0,\frac{1}{2}\rho_{Q^{\dagger}}) and ϵ>0\epsilon>0, we have

|qjqj,|<ρ,|zjf(qj,)|<ϵ.|q^{j}-q^{j,\dagger}|<\rho,\quad|z^{j}-f(q^{j,\dagger})|<\epsilon.

Using this approximate interpolation data, let fz,Qf_{z,Q} be obtained by moving least squares (4.2). Then there exist constants h0,γ,C>0h_{0},\gamma,C>0, depending only on dd, rr and κ1\kappa\geq 1, such that, for hQ,Ωh0h_{Q,\Omega}\leq h_{0} and moving least squares scale parameter δ:=γhQ,Ω\delta:=\gamma h_{Q,\Omega}, we have

(4.3) ffz,QLC(fCrhQ,Ωr+ϵ+fC1ρ).\displaystyle\|f^{\dagger}-f_{z,Q}\|_{L^{\infty}}\leq C\left(\|f^{\dagger}\|_{C^{r}}h^{r}_{Q^{\dagger},\Omega}+\epsilon+\|f^{\dagger}\|_{C^{1}}\rho\right).
Remark 4.3.

The constants CC and γ\gamma in the previous proposition can be computed explicitly [53, see Thm. 3.14 and Cor. 4.8].

Proposition 4.2 reveals two sources of error in the reconstruction by moving least squares: The first term on the right-hand side of (4.3) is the error due to the discretization by a finite number of evaluation points qjq^{j}. This error persists even in a perfect data setting, i.e. when qj=qj,{q}^{j}=q^{j,\dagger} and zj=f(qj,){z}^{j}=f^{\dagger}(q^{j,\dagger}). The last two terms in (4.3) reflect the fact that in our intended application to HJ-Net, the evaluation points qjq^{j} and the function values zjz^{j} are only given approximately, via the approximate flow map ΨtΨt\Psi_{t}\approx\Psi_{t}^{\dagger}, introducing additional error sources.

The proof of Proposition 4.2 relies crucially on the fact that the set of evaluation points Q={qj}j=1NQ=\{q^{j}\}_{j=1}^{N} is quasi-uniform. In our application to HJ-Net, these points are obtained as the image of (q0j,p0j,z0j):=(q0j,qu0(q0j),u0(q0j))(q_{0}^{j},p_{0}^{j},z_{0}^{j}):=(q_{0}^{{j}},\nabla_{q}u_{0}(q_{0}^{{j}}),u_{0}(q_{0}^{{j}})) under the approximate flow Ψt\Psi_{t}. In particular, they depend on u0u_{0} and we cannot ensure any a priori control on the separation distance ρQ\rho_{Q}. Our proposed reconstruction \mathcal{R} therefore involves the pruning step, stated as Algorithm 1. Lemma C.3 in Appendix C shows that Algorithm 1 produces a quasi-uniform set QQQ^{\prime}\subset Q with the desired properties asserted above.

Algorithm 1 Pruning
Interpolation points Q={qj}j=1NQ=\{q^{j}\}_{j=1}^{N}.
Pruned interpolation points Q={qjk}k=1mQ^{\prime}=\{q^{j_{k}}\}_{k=1}^{m} with fill distance hQ,Ω3hQ,Ωh_{Q^{\prime},\Omega}\leq 3h_{Q,\Omega}, and QQ^{\prime} is quasi-uniform for κ=3\kappa=3, i.e.
ρQhQ,Ω3ρQ.\rho_{Q^{\prime}}\leq h_{Q^{\prime},\Omega}\leq 3\rho_{Q^{\prime}}.
procedure 
     Set m1m\leftarrow 1, j11j_{1}\leftarrow 1 and Q{q1}Q^{\prime}\leftarrow\{q^{1}\}.  
     while m<Nm<N do
         Given Q={qj1,,qjm}QQ^{\prime}=\{q^{j_{1}},\dots,q^{j_{m}}\}\subset Q, does there exist qkQq^{k}\in Q, such that
Bh(qk)=1mBh(qj)=?B_{h}(q^{k})\cap\bigcup_{\ell=1}^{m}B_{h}(q^{j_{\ell}})=\emptyset?
         if Yes then
              Set jm+1kj_{m+1}\leftarrow k,  
              Set QQ{qk}Q^{\prime}\leftarrow Q^{\prime}\cup\{q^{k}\},  
              Set mm+1m\leftarrow m+1.  
         else
              Terminate the algorithm. 
         end if
     end while
end procedure

Given the interpolation by moving least squares and the pruning algorithm, we finally define the reconstruction map :{(qtj,ztj)}j=1Nfz,Q\mathcal{R}:\{(q_{t}^{j},z_{t}^{j})\}_{j=1}^{N}\mapsto{f}_{z,Q} as follows:

Algorithm 2 Reconstruction \mathcal{R}
  1. (1)

    Given approximate interpolation data {qtj,ztj}j=1N\{q^{j}_{t},z^{j}_{t}\}_{j=1}^{N} at data points Q={qt1,,qtN}Q=\{q^{1}_{t},\dots,q^{N}_{t}\}, determine a quasi-uniform subset Q={qtjk}k=1mQQ^{\prime}=\{q_{t}^{j_{k}}\}_{k=1}^{m}\subset Q by Algorithm 1.

  2. (2)

    Define fz,Qf_{z,Q} as the moving least squares interpolant (4.2) based on the (pruned) interpolation data {(qtjk,ztjk)}k=1m\{(q^{j_{k}}_{t},z^{j_{k}}_{t})\}_{k=1}^{m} with δ=γhQ,Ω\delta=\gamma h_{Q^{\prime},\Omega}, and γ\gamma the constant in Proposition 4.2.

4.3. Summary of HJ-Net

Thus in the final definition of HJ-Net given in equation (4.1) we recall that \mathcal{E} denotes the encoder mapping,

u0(u0):=(q0j,u0(q0j),u0(q0j)).u_{0}\mapsto\mathcal{E}(u_{0}):=(q^{j}_{0},\nabla u_{0}(q^{j}_{0}),u_{0}(q^{j}_{0})).

The mapping

(q0j,p0j,z0j)(qtj,ptj,ztj):=Ψt(q0j,p0j,z0j;θ),(q^{j}_{0},p^{j}_{0},z^{j}_{0})\mapsto(q^{j}_{t},p^{j}_{t},z^{j}_{t}):={\Psi}_{t}(q^{j}_{0},p^{j}_{0},z^{j}_{0};\theta),

approximates the flow Ψt\Psi_{t} for each of the triples, (q0j,p0j,z0j)(q^{j}_{0},p^{j}_{0},z^{j}_{0}), j=1,,Nj=1,\dots,N, and θ\theta is trained from data to minimize an approximate empirical loss. And, finally, the reconstruction \mathcal{R} is obtained by applying the pruned moving least squares Algorithm 2 to the data {qtj,ptj,ztj}j=1N\{q^{j}_{t},p^{j}_{t},z^{j}_{t}\}_{j=1}^{N} at scattered data points Qt={qt1,,qtN}Q_{t}=\{q^{1}_{t},\dots,q^{N}_{t}\} and with corresponding values zt=(zt1,,ztN)z_{t}=(z^{1}_{t},\dots,z^{N}_{t}), to approximate the output u(q,t)fzt,Qt(q)u(q,t)\approx f_{z_{t},Q_{t}}(q).

5. Error Estimates and Complexity

Subsection 5.1 contains statement of the main theorem concerning the computational complexity of HJ-Net, and the high level structure of the proof. The theorem demonstrates that the curse of parametric complexity is overcome in this problem in the sense that the cost to achieve a given error, as measured by the size of the parameterization, grows only algebraically with the inverse of the error. In the following subsections 5.2 and 5.3, we provide a detailed discussion of the approximation of the Hamiltonian flow and the moving last squares algorithm which, together, lead to proof of Theorem 5.1. Proofs of the results stated in those two subsections are given in an appendix.

5.1. HJ-Net Beats The Curse of Parametric Complexity

Theorem 5.1 (HJ-Net Approximation Estimate).

Consider equation (HJ) on periodic domain Ω=[0,2π]d\Omega=[0,2\pi]^{d}, with CperrC^{r}_{\mathrm{per}} initial data and Hamiltonian HCperr+1H\in C^{r+1}_{\mathrm{per}}, where r2r\geq 2. Assume that HH satisfies the no-blowup Assumption 3.1. Let Cperr\mathcal{F}\subset C^{r}_{\mathrm{per}} be a compact set of initial data, and let T<T()T<T^{*}(\mathcal{F}) where T()T^{*}(\mathcal{F}) is given by Proposition 3.2. Then there is constant C=C(T,d,r,H,)>0C=C(T,d,r,H,\mathcal{F})>0, such that for any ϵ>0\epsilon>0 and t[0,T]t\in[0,T], there exists a set of points Qϵ={q1,,qN}Q_{\epsilon}=\{q^{1},\dots,q^{N}\}, optimal parameter θϵ\theta_{\epsilon} and neural network Ψt()=Ψt(;θϵ)\Psi_{t}({\,\cdot\,})=\Psi_{t}({\,\cdot\,};\theta_{\epsilon}) such that the corresponding HJ-Net approximation given by (4.1), with Steps b) and c) defined in Subsections 4.1 and 4.2 respectively, satisfies

supu0𝒮t(u0)𝒮t(u0)Lϵ.\sup_{u_{0}\in\mathcal{F}}\|\mathcal{S}_{t}(u_{0})-\mathcal{S}_{t}^{\dagger}(u_{0})\|_{L^{\infty}}\leq\epsilon.

Furthermore, we have the following complexity bounds: (i) the number NN of encoding points Qϵ={qj}j=1NQ_{\epsilon}=\{q^{j}\}_{j=1}^{N} from Step a) can be bounded by

(5.1) NCϵd/r;\displaystyle N\leq C\epsilon^{-d/r};

and (ii) the neural network Ψt()=Ψt(;θϵ)\Psi_{t}({\,\cdot\,})=\Psi_{t}({\,\cdot\,};\theta_{\epsilon}) from Step b), Subsection 4.1, satisfies

(5.2) 0pt(Ψt)Clog(ϵ1),size(Ψt)Cϵ(2d+1)/rlog(ϵ1).\displaystyle 0pt(\Psi_{t})\leq C\log(\epsilon^{-1}),\quad\mathrm{size}(\Psi_{t})\leq C\epsilon^{-(2d+1)/r}\log(\epsilon^{-1}).
Proof.

We first note that, for any u0u_{0}\in\mathcal{F} and T<T()T<T^{*}(\mathcal{F}), Proposition 3.2 shows that the solution uu of (HJ) can be computed by the method of characteristics up to time TT. Thus 𝒮t\mathcal{S}^{\dagger}_{t} is well-defined for any t[0,T]t\in[0,T]. We break the proof into three steps, relying on propositions established in the following subsections, and then conclude in a final fourth step.

Step 1: (Neural Network Approximation) Let MM be given by

M:=1supu0supqΩu0(q)supu0supqΩ|u0(q)|,M:=1\vee\sup_{u_{0}\in\mathcal{F}}\sup_{q\in\Omega}\|\nabla u_{0}(q)\|_{\ell^{\infty}}\vee\sup_{u_{0}\in\mathcal{F}}\sup_{q\in\Omega}|u_{0}(q)|,

where aba\vee b denotes the maximum. By this choice of MM, we have u0(q)[M,M]d\nabla u_{0}(q)\in[-M,M]^{d}, u0(q)[M,M]u_{0}(q)\in[-M,M] for all u0u_{0}\in\mathcal{F}, qΩq\in\Omega. By Proposition 5.3, there exists a constant β=β(d,LH,t)1\beta=\beta(d,L_{H},t)\geq 1, and a constant C>0C>0, depending only on MM, dd, tt, and the norm HCr+1(𝕋d×[βM,βM]d)\|H\|_{C^{r+1}(\mathbb{T}^{d}\times[-\beta M,\beta M]^{d})}, such that the Hamiltonian flow map Ψt\Psi_{t}^{\dagger} (3.7) can be approximated by a neural network Ψt\Psi_{t} with

size(Ψt)Cϵ(2d+1)/rlog(ϵ1),0pt(Ψt)Clog(ϵ1),\mathrm{size}(\Psi_{t})\leq C\epsilon^{-(2d+1)/r}\log(\epsilon^{-1}),\quad 0pt(\Psi_{t})\leq C\log(\epsilon^{-1}),

and

(5.3) sup(q0,p0,z0)Ω×[M,M]d×[M,M]|Ψt(q0,p0,z0)Ψt(q0,p0,z0)|ϵ.\displaystyle\sup_{(q_{0},p_{0},z_{0})\in\Omega\times[-M,M]^{d}\times[-M,M]}\left|\Psi_{t}(q_{0},p_{0},z_{0})-\Psi^{\dagger}_{t}(q_{0},p_{0},z_{0})\right|\leq\epsilon.

Step 2: (Choice of Encoding Points) Fix ρ>0\rho>0, to be determined below. Let Q:=ρd[0,2π]dQ:=\rho\mathbb{Z}^{d}\cap[0,2\pi]^{d} denote an equidistant grid on [0,2π]d[0,2\pi]^{d} with grid spacing ρ\rho. Enumerating the elements of QQ, we write Q={q01,,q0N}Q=\{q_{0}^{1},\dots,q_{0}^{N}\}, where we note that there exists a constant C1=C1(d)>0C_{1}=C_{1}(d)>0 depending only on dd, such that NC1ρdN\leq C_{1}\rho^{-d}; equivalently,

(5.4) ρdC11N.\displaystyle\frac{\rho^{d}}{C_{1}}\leq\frac{1}{N}.

For any u0u_{0}\in\mathcal{F}, let

(5.5) Qu0:={qtj,|qtj,=qt(q0j,p0j),q0jQϵ,p0j=qu0(q0j)},\displaystyle Q_{u_{0}}^{\dagger}:={\left\{q_{t}^{{j},\dagger}\,\middle|\,q_{t}^{{j},\dagger}=q_{t}(q_{0}^{{j}},p_{0}^{{j}}),\,q_{0}^{{j}}\in Q_{\epsilon},\,p_{0}^{{j}}=\nabla_{q}u_{0}(q_{0}^{{j}})\right\}},

be the set of image points under the characteristic mapping defined by u0u_{0}. Since Qu0={qtj,}j=1NQ_{u_{0}}^{\dagger}=\{q^{j,\dagger}_{t}\}_{j=1}^{N} is a set of NN points, it follows from the definition of the fill distance that NN balls of radius hQu0,Ωh_{Q^{\dagger}_{u_{0}},\Omega} cover Ω=[0,2π]d\Omega=[0,2\pi]^{d}. A simple volume counting argument then implies that there exists a constant C0=C0(d)>0C_{0}=C_{0}(d)>0, such that 1/NC0hQu0,Ωd1/N\leq C_{0}\,h_{Q^{\dagger}_{u_{0}},\Omega}^{d}; equivalently,

(5.6) 1C0NhQu0,Ωd,u0.\displaystyle\frac{1}{C_{0}N}\leq h_{Q^{\dagger}_{u_{0}},\Omega}^{d},\quad\forall\,u_{0}\in\mathcal{F}.

Given ϵ\epsilon from Step 1, we now choose ρ:=(C0C1)1/dϵ1/r\rho:=(C_{0}C_{1})^{1/d}\epsilon^{1/r}, so that by (5.4) and (5.6),

ϵd/r=1C0ρdC11C0NhQu0,Ωd,u0.\epsilon^{d/r}=\frac{1}{C_{0}}\frac{\rho^{d}}{C_{1}}\leq\frac{1}{C_{0}N}\leq h_{Q^{\dagger}_{u_{0}},\Omega}^{d},\quad\forall\,u_{0}\in\mathcal{F}.

We emphasize that C0,C1C_{0},C_{1} depend only on dd, and are independent of u0u_{0}\in\mathcal{F} and ϵ\epsilon. We have thus shown that if Qϵ:={q1,,qN}Q_{\epsilon}:=\{q^{1},\dots,q^{N}\} is an enumeration of ρd[0,2π]d\rho\mathbb{Z}^{d}\cap[0,2\pi]^{d} with ρ:=(C0C1)1/dϵ1/r\rho:=(C_{0}C_{1})^{1/d}\epsilon^{1/r}, then the fill distance of the image points Qu0Q^{\dagger}_{u_{0}} under the characteristic mapping satisfies

(5.7) ϵhQu0,Ωr,u0.\displaystyle\epsilon\leq h_{Q^{\dagger}_{u_{0}},\Omega}^{r},\quad\forall\,u_{0}\in\mathcal{F}.

In particular, this step defines our encoding points QϵQ_{\epsilon}.

Step 3: (Interpolation Error Estimate) Let QϵQ_{\epsilon} be the set of evaluation points determined in Step 2. Since QϵQ_{\epsilon} is an equidistant grid with grid spacing proportional to ϵ1/r\epsilon^{1/r}, the fill distance of QϵQ_{\epsilon} is bounded by hQϵ,ΩCϵ1/rh_{Q_{\epsilon},\Omega}\leq C\epsilon^{1/r}, where the constant C=C(d)1C=C(d)\geq 1 depends only on dd. By Proposition 5.6, there exists a (possibly larger) constant C=C(d,t,H,)1C=C(d,t,H,\mathcal{F})\geq 1, such that for all u0u_{0}\in\mathcal{F}, the set of image points Qu0Q_{u_{0}}^{\dagger} under the characteristic mapping (5.5), has uniformly bounded fill distance

(5.8) hQu0,ΩCϵ1/r,u0.\displaystyle h_{Q^{\dagger}_{u_{0}},\Omega}\leq C\epsilon^{1/r},\quad\forall\,u_{0}\in\mathcal{F}.

Furthermore, taking into account (5.7), the upper bound (5.3) implies that

sup(q0,p0,z0)𝕋d×[M,M]d×[M,M]|Ψt(q0,p0,z0)Ψt(q0,p0,z0)|hQu0,Ωr,\sup_{(q_{0},p_{0},z_{0})\in\mathbb{T}^{d}\times[-M,M]^{d}\times[-M,M]}\left|\Psi_{t}(q_{0},p_{0},z_{0})-\Psi^{\dagger}_{t}(q_{0},p_{0},z_{0})\right|\leq h_{Q^{\dagger}_{u_{0}},\Omega}^{r},

for any u0u_{0}\in\mathcal{F}. In turn, this shows that the approximate interpolation data (qtj,ztj)=𝒫Ψt(q0j,p0j,z0j)({q}_{t}^{{j}},{z}_{t}^{{j}})=\mathcal{P}\circ\Psi_{t}(q_{0}^{{j}},p_{0}^{{j}},z_{0}^{{j}}), j=1,,Nj=1,\dots,N, obtained from the neural network approximation ΨtΨt\Psi_{t}\approx\Psi_{t}^{\dagger} by (orthogonal) projection 𝒫\mathcal{P} onto the first and last components, satisfies

|qtjqtj,|hQu0,Ωr,|ztju(qtj,,t)|hQu0,Ωr,|{q}_{t}^{{j}}-q_{t}^{j,\dagger}|\leq h_{Q^{\dagger}_{u_{0}},\Omega}^{r},\quad|{z}_{t}^{{j}}-u(q_{t}^{j,\dagger},t)|\leq h_{Q^{\dagger}_{u_{0}},\Omega}^{r},

where u(q,t)u(q,t) denotes the exact solution of the Hamilton-Jacobi PDE (HJ) with initial data u0u_{0}\in\mathcal{F}. By Proposition 5.5, there exists a constant C>0C>0, depending only on dd and rr, such that the pruned moving least squares approximant fz,Qu0f_{{z},Q_{u_{0}}} obtained by Algorithm 2 with (approximate) data points Qu0={qt1,,qtN}Q_{u_{0}}=\{q^{1}_{t},\dots,q^{N}_{t}\} and corresponding data values z={zt1,,ztN}z=\{z^{1}_{t},\dots,z^{N}_{t}\}, satisfies

(5.9) u(,t)fz,Qu0L(Ω)C(1+u(,t)Cr(Ω))hQu0,Ωr.\displaystyle\|u({\,\cdot\,},t)-f_{{z},Q_{u_{0}}}\|_{L^{\infty}(\Omega)}\leq C\left(1+\|u({\,\cdot\,},t)\|_{C^{r}(\Omega)}\right)h_{Q^{\dagger}_{u_{0}},\Omega}^{r}.

Step 4: (Conclusion) By the short-time existence result of Proposition 3.2, there exists C=C(H,,t)>0C=C(H,\mathcal{F},t)>0, such that u(,t)Cr(Ω)C\|u({\,\cdot\,},t)\|_{C^{r}(\Omega)}\leq C for any solution uu of the Hamilton-Jacobi equation (HJ) with initial data u(,t=0)=u0u({\,\cdot\,},t=0)=u_{0}\in\mathcal{F}. By definition of the HJ-Net approximation, we have 𝒮t(u0)fz,Qu0\mathcal{S}_{t}(u_{0})\equiv f_{{z},Q_{u_{0}}} and by definition of the solution operator, 𝒮t(u0)u(,t)\mathcal{S}_{t}^{\dagger}(u_{0})\equiv u({\,\cdot\,},t). We thus conclude that

𝒮t(u0)𝒮t(u0)L(5.9)ChQu0,Ωr(5.8)Cϵ,\|\mathcal{S}_{t}(u_{0})-\mathcal{S}^{\dagger}_{t}(u_{0})\|_{L^{\infty}}\overset{\mathclap{\underset{\downarrow}{\eqref{eq:1uh}}}}{\leq}Ch_{Q_{u_{0}}^{\dagger},\Omega}^{r}\overset{\mathclap{\underset{\downarrow}{\eqref{eq:Qdag}}}}{\leq}C\epsilon,

for a constant C=C(T,d,r,H,)>0C=C(T,d,r,H,\mathcal{F})>0, independent of ϵ\epsilon. Since ϵ\epsilon is arbitrary, replacing ϵ\epsilon by ϵ/C\epsilon/C throughout the above argument implies the claimed error and complexity estimate of Theorem 5.1. ∎

Remark 5.2 (Overall Computational Complexity of HJ-Net).

The error and complexity estimate of Theorem 5.1 implies that for moderate dimensions dd, and for sufficiently smooth input functions u0Cperru_{0}\in\mathcal{F}\subset C^{r}_{\mathrm{per}}, with r>3d+1r>3d+1, the overall complexity of this approach scales at most linearly in ϵ1\epsilon^{-1}: Indeed, the mapping of the data points (q0j,p0j,z0j)(qtj,ptj,ztj)(q^{j}_{0},p^{j}_{0},z^{j}_{0})\mapsto(q^{j}_{t},p^{j}_{t},z^{j}_{t}) for j=1,,Nj=1,\dots,N requires NN forward-passes of the neural network Ψt\Psi_{t}, which has O(ϵ(2d+1)/rlog(ϵ1))O(\epsilon^{-(2d+1)/r}\log(\epsilon^{-1})) internal degrees of freedom. Since N=O(ϵd/r)N=O(\epsilon^{-d/r}) the computational complexity of this mapping is thus bounded by O(ϵ(3d+1)/rlog(ϵ1))=O(ϵ1)O(\epsilon^{-(3d+1)/r}\log(\epsilon^{-1}))=O(\epsilon^{-1}). A naive implementation of the pruning algorithm requires at most O(N2)=O(ϵ2d/r)=O(ϵ1)O(N^{2})=O(\epsilon^{-2d/r})=O(\epsilon^{-1}) comparisons. The computational complexity of the reconstruction by the moving least squares method with mNm\leq N (pruned) interpolation points and with Mϵd/rM\sim\epsilon^{-d/r} evaluation points can be bounded by O(m+M)=O(ϵd/r+M)=O(ϵ1)O(m+M)=O(\epsilon^{-d/r}+M)=O(\epsilon^{-1}) [53, last paragraph of Chapter 4.2]. In particular, the overall complexity to obtain an ϵ\epsilon-approximation of the output function 𝒮t(u0)𝒮t(u0)\mathcal{S}_{t}(u_{0})\approx\mathcal{S}_{t}^{\dagger}(u_{0}) at e.g. the encoding points QϵQ_{\epsilon} (with M=|Qϵ|ϵd/rM=|Q_{\epsilon}|\sim\epsilon^{-d/r}) is at most linear in ϵ1\epsilon^{-1}.

Theorem 5.1 shows that for fixed dd and rr, the proposed HJ-Net architecture can overcome the general curse of parametric complexity in the operator approximation 𝒮𝒮\mathcal{S}\approx\mathcal{S}^{\dagger} implied by Theorem 2.11 even though the underlying operator does not have higher than CrC^{r}-regularity. This is possible because HJ-Net leverages additional structure inherent to the Hamilton-Jacobi PDE HJ (reflected in the method of characteristics), and therefore does not rely solely on the CrC^{r}-smoothness of the underlying operator 𝒮\mathcal{S}^{\dagger}.

5.2. Approximation of Hamiltonian Flow and Quadrature Map

In this subsection we quantify the complexity of ϵ\epsilon-approximation by a ReLU network as defined in Subsection 4.1.

Recall that Ψt:Ω×d×Ω×d×\Psi_{t}^{\dagger}:\Omega\times\mathbb{R}^{d}\times\mathbb{R}\to\Omega\times\mathbb{R}^{d}\times\mathbb{R} comprises solution of the Hamiltonian equations (3.1) and quadrature (3.2), leading to (3.4). An approximation Ψt:Ω×d×Ω×d×\Psi_{t}:\Omega\times\mathbb{R}^{d}\times\mathbb{R}\to\Omega\times\mathbb{R}^{d}\times\mathbb{R} of this Hamiltonian flow map is provided the by the following proposition, proved in Appendix D.

Proposition 5.3.

Let Ω=[0,2π]d\Omega=[0,2\pi]^{d}. Let r2r\geq 2, and t>0t>0 be given, and assume that HCperr+1(Ω×d)H\in C^{r+1}_{\mathrm{per}}(\Omega\times\mathbb{R}^{d}) satisfies the no-blowup Assumption 3.1 with constant LH>0.L_{H}>0. Then, for any M1M\geq 1, there exist constants β:=(1+d)exp(LHt)\beta:=(1+\sqrt{d})\exp(L_{H}t) and C=C(HCr+1(Ω×[βM,βM]d),M,r,d,t)>0C=C\left(\|H\|_{C^{r+1}(\Omega\times[-\beta M,\beta M]^{d})},M,r,d,t\right)>0, such that for all ϵ(0,12]\epsilon\in(0,\frac{1}{2}], there is a ReLU neural network Ψt()=Ψt(;θϵ)\Psi_{t}({\,\cdot\,})=\Psi_{t}({\,\cdot\,};\theta_{\epsilon}) satisfying

(5.10) sup(q0,p0,z0)Ω×[M,M]d+1|Ψt(q0,p0,z0)Ψt(q0,p0,z0)|ϵ,\displaystyle\sup_{(q_{0},p_{0},z_{0})\in\Omega\times[-M,M]^{d+1}}\left|\Psi_{t}(q_{0},p_{0},z_{0})-\Psi_{t}^{\dagger}(q_{0},p_{0},z_{0})\right|\leq\epsilon,

satisfying

size(Ψt)Cϵ(2d+1)/rlog(ϵ1),0pt(Ψt)Clog(ϵ1).\displaystyle\mathrm{size}(\Psi_{t})\leq C\epsilon^{-(2d+1)/r}\log\left(\epsilon^{-1}\right),\quad 0pt(\Psi_{t})\leq C\log\left(\epsilon^{-1}\right).
Remark 5.4.

Using the results of [38, 29, 55], one could in fact improve the size bound of Proposition 5.3 somewhat: neglecting logarithmic terms in ϵ1\epsilon^{-1}, it can be shown that a neural network with size(Ψt)ϵ(d+1/2)/r\mathrm{size}(\Psi_{t})\lesssim\epsilon^{-(d+1/2)/r} suffices. However, this comes at the expense of substantially increasing the depth from a logarithmic scaling 0pt(Ψt)log(ϵ1)0pt(\Psi_{t})\lesssim\log(\epsilon^{-1}), to an algebraic scaling 0pt(Ψt)ϵ(d+1/2)/r0pt(\Psi_{t})\lesssim\epsilon^{-(d+1/2)/r}.

5.3. Moving Least Squares Reconstruction Error

In this subsection we discuss error estimates for the reconstruction by moving least squares, based on imperfect input-output pairs, as defined in Subsection 4.2.

We recall that the reconstruction \mathcal{R} in the HJ-Net approximation is obtained by applying the pruned moving least squares Algorithm 2 to the data {qtj,ztj}j=1N\{q^{j}_{t},z^{j}_{t}\}_{j=1}^{N}, where (qtj,ptj,ztj)(q^{j}_{t},p^{j}_{t},z^{j}_{t}) are obtained as (qtj,ptj,ztj)=Ψt(q0j,p0j,z0j)(q^{j}_{t},p^{j}_{t},z^{j}_{t})=\Psi_{t}(q^{j}_{0},p^{j}_{0},z^{j}_{0}) with fixed evaluation points {q0j}j=1NΩ\{q^{j}_{0}\}_{j=1}^{N}\subset\Omega, and where p0j:=qu0(q0j)p^{j}_{0}:=\nabla_{q}u_{0}(q^{j}_{0}), z0j:=u0(q0j)z^{j}_{0}:=u_{0}(q^{j}_{0}) are determined in terms of a given input function u0u_{0}, so that ztju(qtj,t)z^{j}_{t}\approx u(q^{j}_{t},t) is an approximation of the corresponding solution u(,t)u({\,\cdot\,},t) at time tt.

In the following, we first derive an error estimate in terms of the fill distance of Qt={qtj}j=1NQ_{t}=\{q^{j}_{t}\}_{j=1}^{N}, in Proposition 5.5. Subsequently, in Proposition 5.6, we provide a bound on the fill distance hQt,Ωh_{Q_{t},\Omega} at time tt in terms of the fill distance hQ,Ωh_{Q,\Omega} at time 0. Proof of both propositions can be found in Appendix D.

Proposition 5.5.

Let Ω=[0,2π]dd\Omega=[0,2\pi]^{d}\subset\mathbb{R}^{d} and fix a regularity parameter r2r\geq 2. There exist constants h0,C>0h_{0},C>0 such that the following holds: Assume that Q={q1,,,qN,}ΩQ^{\dagger}=\{q^{1,\dagger},\dots,q^{N,\dagger}\}\subset\Omega is a set of NN evaluation points with fill distance hQ,Ωh0h_{Q^{\dagger},\Omega}\leq h_{0}. Then for any 2π2\pi-periodic function fCperr(Ω)f^{\dagger}\in C^{r}_{\mathrm{per}}(\Omega), and approximate input-output data {(qj,zj)}j=1N\{({q}^{j},{z}^{j})\}_{j=1}^{N}, such that

|qjqj,|hQ,Ωr,|zjf(qj,)|hQ,Ωr,|{q}^{j}-q^{j,\dagger}|\leq h_{Q^{\dagger},\Omega}^{r},\quad|{z}^{j}-f^{\dagger}(q^{j,\dagger})|\leq h_{Q^{\dagger},\Omega}^{r},

the pruned moving least squares approximant fz,Qf_{{z},{Q}} of Algorithm 2 with interpolation data points Q={q1,,qN}Q=\{q^{1},\dots,q^{N}\} and data values z={z1,,zN}z=\{z^{1},\dots,z^{N}\}, satisfies

ffz,QL(Ω)C(1+fCr(Ω))hQ,Ωr.\|f^{\dagger}-f_{{z},{Q}}\|_{L^{\infty}(\Omega)}\leq C\left(1+\|f^{\dagger}\|_{C^{r}(\Omega)}\right)h_{Q^{\dagger},\Omega}^{r}.

In contrast to Proposition 4.2, Proposition 5.5 includes the pruning step in the reconstruction, and does not assume quasi-uniformity of either the underlying exact point distribution QQ^{\dagger}, nor the approximate point distribution QQ. To obtain a bound on the reconstruction error, we can combine the preservation of CrC^{r}-regularity implied by the short-time existence Proposition 3.2, with the following:

Proposition 5.6.

Let Ω=[0,2π]d\Omega=[0,2\pi]^{d}, let r2r\geq 2, and assume that the Hamiltonian HCperr+1(Ω×d)H\in C^{r+1}_{\mathrm{per}}(\Omega\times\mathbb{R}^{d}) is periodic in qq and satisfies Assumption 3.1 with constant LH>0L_{H}>0. Let Cperr(Ω)\mathcal{F}\subset C^{r}_{\mathrm{per}}(\Omega) be a compact subset and fix t<Tt<T^{\ast}. Then there exists a constant C=C(H,,LH,t)>0C=C(H,\mathcal{F},L_{H},t)>0, such that for any set of evaluation points Q={q0j}j=1NΩQ=\{q^{j}_{0}\}_{j=1}^{N}\subset\Omega, and for any u0u_{0}\in\mathcal{F}, the image points

Qu0:={qtj|qtj=Φt,u0(q0j),j=1,,N}Ω,Q_{u_{0}}^{\dagger}:={\left\{q^{j}_{t}\,\middle|\,q^{j}_{t}=\Phi^{\dagger}_{t,u_{0}}(q^{j}_{0}),\;j=1,\dots,N\right\}}\subset\Omega,

under the spatial characteristic mapping Φt,u0:q0qt(q0,qu0(q0))\Phi^{\dagger}_{t,u_{0}}:q_{0}\mapsto q_{t}(q_{0},\nabla_{q}u_{0}(q_{0})) satisfy the following uniform bound on the fill distance:

hQu0,ΩChQ,Ω.h_{Q^{\dagger}_{u_{0}},\Omega}\leq Ch_{Q,\Omega}.

6. Conclusions

The first contribution of this work is to study the curse of dimensionality in the context of operator learning, here interpreted in terms of the infinite dimensional nature of the input space. We state a theorem which, for the first time, establishes a general form of the curse of parametric complexity, a natural scaling limit of the curse of dimensionality in high-dimensional approximation, characterized by lower bounds which are exponential in the desired error. The theorem demonstrates that in general it is not possible to obtain complexity estimates, for the size of the approximating neural operator, that grow algebraically with inverse error unless specific structure in the underlying operator is leveraged; in particular, we prove that this additional structure has to go beyond CrC^{r}- or Lipschitz-regularity. This considerably generalizes and strengthens earlier work on the curse of parametric complexity in [32], where a mild form of this curse had been identified for PCA-Net. As shown in the present work, our result applies to many proposed operator learning architectures including PCA-Net, the FNO, and DeepONet and recent nonlinear extensions thereof. The lower complexity bound in this work is obtained for neural operator architectures based on standard feedforward ReLU neural networks, and could likely be extended to feedforward architectures employing piecewise polynomial activations. It is of note that algebraic complexity bounds, which seemingly overcome the curse of parametric complexity of the present work, have recently been derived for the approximation of Lipschitz operators [49]; these results build on non-standard neural network architectures with either superexpressive activation functions, or non-standard connectivity, and therefore do not contradict our results.

The second contribution of this paper is to present an operator learning framework for Hamilton-Jacobi equations, and to provide a complexity analysis demonstrating that the methodology is able to tame the curse of parametric complexity for these PDE. We present the ideas in a simple setting, and there are numerous avenues for future investigation. For example, as pointed out in Remark 3.5, one main limitation of the proposed approach based on characteristics is that we can only consider finite time intervals on which classical solutions of the HJ equations exist. It would be of interest to extend the methodology to weak solutions, after the potential formation of singularities. It would also be of interest to combine our work with the work on curse of dimensionality with respect to dimension of Euclidean space, cited in Section 1. Furthermore, in practice we recommend learning the Hamiltonian flow, which underlies the method of characteristics for the HJ equation, using symplectic neural networks [27]. However, the analysis of these neural networks is not yet developed to the extent needed for our complexity analysis in this paper, which builds on the work in [54]. Extending the analysis of symplectic neural networks, and using this extension to analyze generalizations of HJ-Net as defined here, are interesting directions for future study. Finally, we note that it is of interest to iterate the learned operator. In order to do this, we would need to generalize the error estimates to the C1C^{1} topology. This could be achieved either by interpolation between higher-order CsC^{s} bounds of the proposed methodology for s>1s>1 combined with the existing error analysis, or by using the gradient variable pp in the interpolation.

Acknowledgments The work of SL is supported by Postdoc.Mobility grant P500PT-206737 from the Swiss National Science Foundation. The work of AMS is supported by a Department of Defense Vannevar Bush Faculty Fellowship.

References

  • [1] M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999.
  • [2] K. Bhattacharya, B. Hosseini, N. B. Kovachki, and A. M. Stuart. Model reduction and neural networks for parametric PDEs. The SMAI Journal of Computational Mathematics, 7:121–157, 2021.
  • [3] N. Boullé, C. J. Earls, and A. Townsend. Data-driven discovery of green’s functions with human-understandable deep learning. Scientific Reports, 12(1):1–9, 2022.
  • [4] N. Boullé and A. Townsend. Learning elliptic partial differential equations with randomized linear algebra. Foundations of Computational Mathematics, pages 1–31, 2022.
  • [5] T. Chen and H. Chen. Universal approximation to nonlinear operators by neural networks with arbitrary activation functions and its application to dynamical systems. IEEE Transactions on Neural Networks, 6(4):911–917, 1995.
  • [6] S.-N. Chow and J. K. Hale. Methods of Bifurcation Theory, volume 251. Springer Science & Business Media, 2012.
  • [7] Y. T. Chow, J. Darbon, S. Osher, and W. Yin. Algorithm for overcoming the curse of dimensionality for time-dependent non-convex Hamilton–Jacobi equations arising from optimal control and differential games problems. Journal of Scientific Computing, 73(2):617–643, 2017.
  • [8] Y. T. Chow, J. Darbon, S. Osher, and W. Yin. Algorithm for overcoming the curse of dimensionality for state-dependent Hamilton-Jacobi equations. Journal of Computational Physics, 387:376–409, 2019.
  • [9] O. Christensen et al. An introduction to frames and Riesz bases, volume 7. Springer, 2003.
  • [10] S. Dahlke, F. De Mari, P. Grohs, and D. Labate. Harmonic and applied analysis. Appl. Numer. Harmon. Anal, 2015.
  • [11] J. Darbon, G. P. Langlois, and T. Meng. Overcoming the curse of dimensionality for some Hamilton–Jacobi partial differential equations via neural network architectures. Research in the Mathematical Sciences, 7(3):1–50, 2020.
  • [12] J. Darbon and S. Osher. Algorithms for overcoming the curse of dimensionality for certain Hamilton–Jacobi equations arising in control theory and elsewhere. Research in the Mathematical Sciences, 3(1):1–26, 2016.
  • [13] M. De Hoop, D. Z. Huang, E. Qian, and A. M. Stuart. The cost-accuracy trade-off in operator learning with neural networks. Journal of Machine Learning, 1:3:299–341, 2022.
  • [14] M. V. de Hoop, N. B. Kovachki, N. H. Nelsen, and A. M. Stuart. Convergence rates for learning linear operators from noisy data. SIAM/ASA Journal on Uncertainty Quantification, 11(2):480–513, 2023.
  • [15] B. Deng, Y. Shin, L. Lu, Z. Zhang, and G. E. Karniadakis. Approximation rates of deeponets for learning operators arising from advection–diffusion equations. Neural Networks, 153:411–426, 2022.
  • [16] R. DeVore, B. Hanin, and G. Petrova. Neural network approximation. Acta Numerica, 30:327–444, 2021.
  • [17] R. A. DeVore. Nonlinear approximation. Acta numerica, 7:51–150, 1998.
  • [18] D. L. Donoho. Sparse components of images and optimal atomic decompositions. Constructive Approximation, 17:353–382, 2001.
  • [19] L. C. Evans. Partial Differential Equations. American Mathematical Society, Providence, R.I., 2010.
  • [20] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT press, 2016.
  • [21] R. Gribonval, G. Kutyniok, M. Nielsen, and F. Voigtlaender. Approximation spaces of deep neural networks. Constructive approximation, 55(1):259–367, 2022.
  • [22] C. Heil. A basis theory primer: expanded edition. Springer Science & Business Media, 2010.
  • [23] L. Herrmann, C. Schwab, and J. Zech. Deep ReLU neural network expression rates for data-to-qoi maps in Bayesian PDE inversion. SAM Research Report, 2020, 2020.
  • [24] L. Herrmann, C. Schwab, and J. Zech. Neural and GPC operator surrogates: Construction and expression rate bounds. arXiv preprint arXiv:2207.04950, 2022.
  • [25] J. S. Hesthaven and S. Ubbiali. Non-intrusive reduced order modeling of nonlinear problems using neural networks. Journal of Computational Physics, 363:55–78, 2018.
  • [26] N. Hua and W. Lu. Basis operator network: A neural network-based model for learning nonlinear operators via neural basis. Neural Networks, 164:21–37, 2023.
  • [27] P. Jin, Z. Zhang, A. Zhu, Y. Tang, and G. E. Karniadakis. Sympnets: Intrinsic structure-preserving symplectic networks for identifying Hamiltonian systems. Neural Networks, 132:166–179, 2020.
  • [28] Y. Khoo, J. Lu, and L. Ying. Solving parametric PDE problems with artificial neural networks. European Journal of Applied Mathematics, 32(3):421–435, 2021.
  • [29] M. Kohler and S. Langer. On the rate of convergence of fully connected deep neural network regression estimates. The Annals of Statistics, 49(4):2231–2249, 2021.
  • [30] N. B. Kovachki, S. Lanthaler, and S. Mishra. On universal approximation and error bounds for Fourier neural operators. Journal of Machine Learning Research, 22(1), 2021.
  • [31] N. B. Kovachki, Z. Li, B. Liu, K. Azizzadenesheli, K. Bhattacharya, A. Stuart, and A. Anandkumar. Neural operator: Learning maps between function spaces with applications to PDEs. Journal of Machine Learning Research, 24(89), 2023.
  • [32] S. Lanthaler. Operator learning with PCA-Net: Upper and lower complexity bounds. Journal of Machine Learning Research, 24(318), 2023.
  • [33] S. Lanthaler, Z. Li, and A. M. Stuart. The nonlocal neural operator: universal approximation. arXiv preprint arXiv:2304.13221, 2023.
  • [34] S. Lanthaler, S. Mishra, and G. E. Karniadakis. Error estimates for DeepONets: A deep learning framework in infinite dimensions. Transactions of Mathematics and Its Applications, 6(1), 2022.
  • [35] S. Lanthaler, R. Molinaro, P. Hadorn, and S. Mishra. Nonlinear reconstruction for operator learning of PDEs with discontinuities. In Eleventh International Conference on Learning Representations, 2023.
  • [36] Z. Li, N. B. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. M. Stuart, and A. Anandkumar. Fourier neural operator for parametric partial differential equations. In Ninth International Conference on Learning Representations, 2021.
  • [37] H. Liu, H. Yang, M. Chen, T. Zhao, and W. Liao. Deep nonparametric estimation of operators between infinite dimensional spaces. Journal of Machine Learning Research, 25(24):1–67, 2024.
  • [38] J. Lu, Z. Shen, H. Yang, and S. Zhang. Deep network approximation for smooth functions. SIAM Journal on Mathematical Analysis, 53(5):5465–5506, 2021.
  • [39] L. Lu, P. Jin, G. Pang, Z. Zhang, and G. E. Karniadakis. Learning nonlinear operators via deeponet based on the universal approximation theorem of operators. Nature Machine Intelligence, 3(3):218–229, 2021.
  • [40] C. Marcati and C. Schwab. Exponential convergence of deep operator networks for elliptic partial differential equations. SIAM Journal on Numerical Analysis, 61(3):1513–1545, 2023.
  • [41] H. Mhaskar, Q. Liao, and T. Poggio. Learning real and boolean functions: When is deep better than shallow. Technical report, Center for Brains, Minds and Machines (CBMM), arXiv, 2016.
  • [42] H. N. Mhaskar and N. Hahm. Neural networks for functional approximation and system identification. Neural Computation, 9(1):143–159, 1997.
  • [43] N. H. Nelsen and A. M. Stuart. The random feature model for input-output maps between banach spaces. SIAM Journal on Scientific Computing, 43(5):A3212–A3243, 2021.
  • [44] J. A. Opschoor, C. Schwab, and J. Zech. Exponential relu dnn expression of holomorphic maps in high dimension. Constructive Approximation, 55(1):537–582, 2022.
  • [45] D. Patel, D. Ray, M. R. Abdelmalik, T. J. Hughes, and A. A. Oberai. Variationally mimetic operator networks. Computer Methods in Applied Mechanics and Engineering, 419:116536, 2024.
  • [46] P. Petersen and F. Voigtlaender. Optimal approximation of piecewise smooth functions using deep relu neural networks. Neural Networks, 108:296–330, 2018.
  • [47] H. Robbins and S. Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
  • [48] T. D. Ryck and S. Mishra. Generic bounds on the approximation error for physics-informed (and) operator learning. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, Advances in Neural Information Processing Systems, 2022.
  • [49] C. Schwab, A. Stein, and J. Zech. Deep operator network approximation rates for lipschitz operators. arXiv preprint arXiv:2307.09835, 2023.
  • [50] C. Schwab and J. Zech. Deep learning in high dimension: Neural network expression rates for generalized polynomial chaos expansions in UQ. Analysis and Applications, 17(01):19–55, 2019.
  • [51] J. Seidman, G. Kissas, P. Perdikaris, and G. J. Pappas. NOMAD: Nonlinear manifold decoders for operator learning. Advances in Neural Information Processing Systems, 35:5601–5613, 2022.
  • [52] E. M. Stein and R. Shakarchi. Real Analysis: Measure Theory, Integration, and Hilbert Spaces. Princeton University Press, 2009.
  • [53] H. Wendland. Scattered Data Approximation, volume 17. Cambridge university press, 2004.
  • [54] D. Yarotsky. Error bounds for approximations with deep ReLU networks. Neural Networks, 94:103–114, 2017. Publisher: Elsevier.
  • [55] D. Yarotsky and A. Zhevnerchuk. The phase diagram of approximation rates for deep neural networks. Advances in neural information processing systems, 33:13005–13015, 2020.
  • [56] H. You, Q. Zhang, C. J. Ross, C.-H. Lee, and Y. Yu. Learning deep implicit fourier neural operators (ifnos) with applications to heterogeneous material modeling. Computer Methods in Applied Mechanics and Engineering, 398:115296, 2022.
  • [57] Z. Zhang, L. Tat, and H. Schaeffer. BelNet: Basis enhanced learning, a mesh-free neural operator. Proceedings of the Royal Society A, 479, 2023.
  • [58] Y. Zhu and N. Zabaras. Bayesian deep convolutional encoder–decoder networks for surrogate modeling and uncertainty quantification. Journal of Computational Physics, 366:415–447, 2018.

Appendix A Proof of Curse of Parametric Complexity

A.1. Preliminaries in Finite Dimensions

Our goal in this section is to prove Proposition 2.1, which we repeat here:

See 2.1

To this end, we recall and extend several results from [54] on the ReLU neural network approximation of functions f:Df:\mathbb{R}^{D}\to\mathbb{R}. Subsequently, these results will be used as building blocks to construct functionals in the infinite-dimensional context, leading to a curse of parametric complexity in that setting, made precise in Theorem 2.11. Here, we consider the space Cr(D)C^{r}(\mathbb{R}^{D}) consisting of rr-times continuously differentiable functions f:Df:\mathbb{R}^{D}\to\mathbb{R}. We denote by

FD,r:={fCr(D)|sup|α|rsupxD|Dαf(x)|1},F_{D,r}:={\left\{f\in C^{r}(\mathbb{R}^{D})\,\middle|\,\sup_{|\alpha|\leq r}\sup_{x\in\mathbb{R}^{D}}\left|D^{\alpha}f(x)\right|\leq 1\right\}},

the unit ball in Cr(D)C^{r}(\mathbb{R}^{D}). For technical reasons, it will often be more convenient to consider the subset F̊D,rFD,r\mathring{F}_{D,r}\subset F_{D,r} consisting of functions fFD,rf\in F_{D,r} vanishing at the origin, f(0)=0f(0)=0.

Remark A.1.

We note that previous work [54] considers the Sobolev space Wr,([0,1]D)W^{r,\infty}([0,1]^{D}) and the unit ball in Wr,([0,1]D)W^{r,\infty}([0,1]^{D}), rather than Cr(D)C^{r}(\mathbb{R}^{D}) and our definition of FD,rF_{D,r}. It turns out that the complexity bounds of [54] remain true also in our setting (with essentially identical proofs). We include the necessary arguments below.

Let f:Df:\mathbb{R}^{D}\to\mathbb{R} be a function. Following [54, Sect. 4.3], we will denote by 𝒩(f,ϵ)\mathcal{N}(f,\epsilon) the minimal number of hidden computation units that is required to approximate ff to accuracy ϵ\epsilon by an arbitrary ReLU feedforward network Ψ\Psi, uniformly over the unit cube [0,1]D[0,1]^{D}; i.e. 𝒩(f,ϵ)\mathcal{N}(f,\epsilon) is the minimal integer MM such that there exists a ReLU neural network Ψ\Psi with at most MM computation units444The number of computation units equals the number of scalar evaluations of the activation σ\sigma in a forward-pass, cf. [1]. and such that

supx[0,1]D|f(x)Ψ(x)|ϵ.\sup_{x\in[0,1]^{D}}|f(x)-\Psi(x)|\leq\epsilon.
Remark A.2.

We emphasize that, even though the domain of a function fFD,rf\in F_{D,r} is by definition the whole space D\mathbb{R}^{D}, the above quantity 𝒩(f,ϵ)\mathcal{N}(f,\epsilon) only relates to the approximation of ff over the unit cube [0,1]D[0,1]^{D}. The reason for this seemingly awkward choice is that it will greatly simplify our arguments later on, when we construct functionals on infinite-dimensional Banach spaces with a view towards proving an infinite-dimensional analogue of the curse of dimensionality.

Note that the number of (non-trivial) hidden computation units MM of a neural network Ψ:D\Psi:\mathbb{R}^{D}\to\mathbb{R} obeys the bounds Msize(Ψ)5DM4M\leq\mathrm{size}(\Psi)\leq 5DM^{4}: The lower bound follows from the fact that each (non-trivial) computation unit has at least one non-zero connection or a bias feeding into it. To derive the upper bound, we note that any network with at most MM computation units can be embedded in a fully connected enveloping network (allowing skip-connections) [54, cf. Fig. 6(a)] with depth MM, width MM, where each hidden node is connected to all other M21M^{2}-1 hidden nodes plus the output, and where each node in the input layer is connected to all M2M^{2} hidden units plus the output. In addition, we take into account that each computation unit and the output layer have an additional bias. The existence of this enveloping network thus implies the size bound

size(Ψ)\displaystyle\mathrm{size}(\Psi) M2(M21)+M2 hidden connections+D(M2+1)input conn.+M+1biases\displaystyle\leq\underbrace{M^{2}(M^{2}-1)+M^{2}}_{\text{ hidden connections}}+\underbrace{D(M^{2}+1)}_{\text{input conn.}}+\underbrace{M+1}_{\text{biases}}
M4+2DM4+M+15DM4,\displaystyle\leq M^{4}+2DM^{4}+M+1\leq 5DM^{4},

valid for any neural network Ψ:D\Psi:\mathbb{R}^{D}\to\mathbb{R} with at most MM computation units.

In view of the lower size bound, size(Ψ)M\mathrm{size}(\Psi)\geq M, Proposition 2.1 above is now clearly implied by the following:

Proposition A.3.

Fix rr\in\mathbb{N}. There is a universal constant γ>0\gamma>0 and a constant ϵ¯=ϵ¯(r)>0\overline{\epsilon}=\overline{\epsilon}(r)>0, depending only on rr, such that for any DD\in\mathbb{N}, there exists a function fDF̊D,rf_{D}\in\mathring{F}_{D,r} for which

𝒩(fD,ϵ)ϵγD/r,ϵϵ¯.\mathcal{N}(f_{D},\epsilon)\geq\epsilon^{-\gamma D/r},\quad\forall\,\epsilon\leq\overline{\epsilon}.

As an immediate corollary of Proposition A.3, we conclude that fDf_{D} cannot be approximated to accuracy ϵ\epsilon by a family of ReLU neural networks Ψϵ\Psi_{\epsilon} with

size(Ψϵ)=o(ϵγD/r).\mathrm{size}(\Psi_{\epsilon})=o(\epsilon^{-\gamma D/r}).

The latter conclusion was established by Yarotsky [54, Thm. 5] (with γ=1/9\gamma=1/9). Proposition 2.1 improves on Yarotsky’s result in two important ways: firstly, it implies that the lower bound size(Ψϵ)ϵγD/r\mathrm{size}(\Psi_{\epsilon})\geq\epsilon^{-\gamma D/r} holds for all ϵϵ¯\epsilon\leq\overline{\epsilon}, and not only along a (unspecified) sequence ϵk0\epsilon_{k}\to 0; secondly, it shows that the constant ϵ¯\overline{\epsilon} can be chosen independently of DD. This generalization of Yarotsky’s result will be crucial for the extension to the infinite-dimensional case, which is the goal of this work.

To prove Proposition A.3, we will need two intermediate results, which build on [54]. We start with the following lemma providing a lower bound on the required size of a fixed neural network architecture, which is able to approximate arbitrary fF̊D,rf\in\mathring{F}_{D,r} to accuracy ϵ0\epsilon\geq 0. This result is based on [54, Thm. 4(a)], but explicitly quantifies the dependence on the dimension DD. This dependence was left unspecified in earlier work.

Lemma A.4.

Fix rr\in\mathbb{N}. Let Ψ=Ψ(;θ)\Psi=\Psi({\,\cdot\,};\theta) be a ReLU neural network architecture depending on parameters θW\theta\in\mathbb{R}^{W}. There exists a constant c0=c0(r)>0c_{0}=c_{0}(r)>0, such that if

supfF̊D,rinfθWfΨ(;θ)L([0,1]D)ϵ,\sup_{f\in\mathring{F}_{D,r}}\inf_{\theta\in\mathbb{R}^{W}}\|f-\Psi({\,\cdot\,};\theta)\|_{L^{\infty}([0,1]^{D})}\leq\epsilon,

for some ϵ>0\epsilon>0, then W(c0ϵ)D/2rW\geq(c_{0}\epsilon)^{-D/2r}.

Proof.

The proof in [54] is based on the Vapnik-Chervonenkis (VC) dimension. We will not repeat the entire argument here, but instead discuss only the required changes in the proof of Yarotsky, which are needed to prove our extension of his result. In fact, there are only two differences between the proof our Lemma A.4, and [54, Thm. 4(a)], which we now point out: The first difference is that we consider F̊D,r\mathring{F}_{D,r}, consisting of Cr(D)C^{r}(\mathbb{R}^{D}) functions vanishing at the origin, whereas [54] considers the unit ball in Wr,([0,1]D)W^{r,\infty}([0,1]^{D}). Nevertheless the proof of [54] applies to our setting in a straightforward way. To see this, we recall that the construction in [54, Proof of Thm. 4, eq. (35)], considers fFD,rf\in F_{D,r} of the form

(A.1) f(x)=m=1NDymϕ(N(xxm)),\displaystyle f(x)=\sum_{m=1}^{N^{D}}y_{m}\phi(N(x-x_{m})),

with coefficients ymy_{m} to be determined later, and where ϕ:D\phi:\mathbb{R}^{D}\to\mathbb{R} is a CC^{\infty} bump function, which is required to satisfy555We note that choosing the \ell^{\infty} to define the set where ϕ(x)=0\phi(x)=0, rather than the 2\ell^{2} Euclidean distance as in [54], is immaterial. The only requirement is that the support of the functions on the right-hand side in the definition (A.1) do not overlap.

(A.2) ϕ(0)=1,andϕ(x)=0, if x>1/2.\displaystyle\phi(0)=1,\quad\text{and}\quad\phi(x)=0,\text{ if }\|x\|_{\ell^{\infty}}>1/2.

In (A.1), the points x1,,xND[0,1]Dx_{1},\dots,x_{N^{D}}\in[0,1]^{D} are chosen such that the \ell^{\infty}-distance between any two of them is at least 1/N1/N. We can easily ensure that f(0)=0f(0)=0, by choosing the points xmx_{m} to be of the form (j1,,jD)/N(j_{1},\dots,j_{D})/N, where j1,,jD{1,,N}j_{1},\dots,j_{D}\in\{1,\dots,N\}. Then, since for any multi-index α\alpha of order |α|r|\alpha|\leq r, the mixed derivative

maxx|Dαf(x)|Nrmaxm|ym|maxx|Dαϕ(x)|,\max_{x}|D^{\alpha}f(x)|\leq N^{r}\max_{m}|y_{m}|\max_{x}|D^{\alpha}\phi(x)|,

any function ff of the form (A.1) belongs to F̊D,r\mathring{F}_{D,r}, provided that

maxm|ym|c~1Nr,\max_{m}|y_{m}|\leq\widetilde{c}_{1}N^{-r},

where

(A.3) c~1=(sup|α|rsupx[0,1]D|Dαϕ(x)|)1.\displaystyle\widetilde{c}_{1}=\left(\sup_{|\alpha|\leq r}\sup_{x\in[0,1]^{D}}|D^{\alpha}\phi(x)|\right)^{-1}.

As shown in [54], the above construction allows one to encode arbitrary Boolean values z1,,zND{0,1}z_{1},\dots,z_{N^{D}}\in\{0,1\} by construction suitable fF̊D,rf\in\mathring{F}_{D,r}; this in turn can be used to estimate a VC-dimension related to Ψ(;θ)\Psi({\,\cdot\,};\theta) from below, following verbatim the arguments in [54]. As no changes are required in this part of the proof, we will not repeat the details here; Instead, following the argument leading up to [54, eq. (38)], and under the assumptions of Lemma A.4, Yarotsky’s argument immediately implies the following lower bound

Wc~0(3ϵc~1)D/2r,W\geq\widetilde{c}_{0}\left(\frac{3\epsilon}{\widetilde{c}_{1}}\right)^{-D/2r},

where c~0\widetilde{c}_{0} is an absolute constant.

To finish our proof of Lemma A.4, it remains to determine the dependence of the constant c~1\widetilde{c}_{1} in (A.3), on the parameters DD and rr. To this end, we construct a specific bump function ϕ:D\phi:\mathbb{R}^{D}\to\mathbb{R}. Recall that our only requirement on ϕ\phi in the above argument is that (A.2) must hold. To construct suitable ϕ\phi, let ϕ0:\phi_{0}:\mathbb{R}\to\mathbb{R}, ξϕ0(ξ)\xi\mapsto\phi_{0}(\xi) be a smooth bump function such that ϕ0(0)=1\phi_{0}(0)=1, ϕ0L1\|\phi_{0}\|_{L^{\infty}}\leq 1 and ϕ0(ξ)=0\phi_{0}(\xi)=0 for |ξ|>1/2|\xi|>1/2. We now make the particular choice

ϕ(x1,,xD):=j=1Dϕ0(xj).\phi(x_{1},\dots,x_{D}):=\prod_{j=1}^{D}\phi_{0}(x_{j}).

Let α=(α1,,αD)\alpha=(\alpha_{1},\dots,\alpha_{D}) be a multi-index with |α|=j=1Dαjr|\alpha|=\sum_{j=1}^{D}\alpha_{j}\leq r. Let κ:=|{αj0}\kappa:=|\{\alpha_{j}\neq 0\} denote the number of non-zero components of α\alpha. We note that κr\kappa\leq r. We thus have

|Dαϕ(x)|=j=1D|Dαjϕ0(xj)|αj0|Dαjϕ0(xj)|ϕ0Cr()κϕ0Cr()r.|D^{\alpha}\phi(x)|=\prod_{j=1}^{D}|D^{\alpha_{j}}\phi_{0}(x_{j})|\leq\prod_{\alpha_{j}\neq 0}|D^{\alpha_{j}}\phi_{0}(x_{j})|\leq\|\phi_{0}\|_{C^{r}(\mathbb{R})}^{\kappa}\leq\|\phi_{0}\|_{C^{r}(\mathbb{R})}^{r}.

In particular, we conclude that c~1=[sup|α|rsupx[0,1]D|Dαϕ(x)|]1\widetilde{c}_{1}=[\sup_{|\alpha|\leq r}\sup_{x\in[0,1]^{D}}|D^{\alpha}\phi(x)|]^{-1} can be bounded from below by c~1c~2:=ϕ0Cr()r\widetilde{c}_{1}\geq\widetilde{c}_{2}:=\|\phi_{0}\|_{C^{r}(\mathbb{R})}^{-r}, where c~2=c~2(r)\widetilde{c}_{2}=\widetilde{c}_{2}(r) clearly only depends on rr, and is independent of the ambient dimension DD. The claimed inequality of Lemma A.4 now follows from

Wc~0(3ϵc~1)D/2r(3ϵ(c~01)2r/Dc~2)D/2r(3ϵ(c~01)2rc~2)D/2r,W\geq\widetilde{c}_{0}\left(\frac{3\epsilon}{\widetilde{c}_{1}}\right)^{-D/2r}\geq\left(\frac{3\epsilon}{(\widetilde{c}_{0}\wedge 1)^{2r/D}\widetilde{c}_{2}}\right)^{-D/2r}\geq\left(\frac{3\epsilon}{(\widetilde{c}_{0}\wedge 1)^{2r}\widetilde{c}_{2}}\right)^{-D/2r},

i.e. we have W(c0ϵ)D/2rW\geq(c_{0}\epsilon)^{-D/2r} with constant

c0=3(c~01)2rc~2(r),c_{0}=\frac{3}{(\widetilde{c}_{0}\wedge 1)^{2r}\widetilde{c}_{2}(r)},

depending only on rr. ∎

While Lemma A.4 applies to a fixed architecture capable of approximating all fF̊D,rf\in\mathring{F}_{D,r}, our goal is to instead construct a single fF̊D,rf\in\mathring{F}_{D,r} for which a similar lower complexity bound holds for arbitrary architectures. The construction of such ff will rely on the following lemma, based on [54, Lem. 3]:

Lemma A.5.

Fix rr\in\mathbb{N}. For any (fixed) ϵ>0\epsilon>0, there exists fϵF̊D,rf_{\epsilon}\in\mathring{F}_{D,r}, such that

𝒩(fϵ,ϵ)D1/4(c1ϵ)D/8r,\mathcal{N}(f_{\epsilon},\epsilon)\geq D^{-1/4}(c_{1}\epsilon)^{-D/8r},

where c1=c1(r)>0c_{1}=c_{1}(r)>0 depends only on rr.

Proof.

As explained above, any neural network (with potentially sparse architecture), Ψ:D\Psi:\mathbb{R}^{D}\to\mathbb{R}, with MM computation units can be embedded in the fully connected architecture Ψ~(;θ)\widetilde{\Psi}({\,\cdot\,};\theta), θW\theta\in\mathbb{R}^{W}, with size bound W5DM4W\leq 5DM^{4}. By Lemma A.4, it follows that if W5DM4<(c0ϵ)D/2rW\leq 5DM^{4}<(c_{0}\epsilon)^{-D/2r}, then there exists fϵF̊D,rf_{\epsilon}\in\mathring{F}_{D,r}, such that

infθWfϵΨ~(;θ)L>ϵ.\inf_{\theta\in\mathbb{R}^{W}}\|f_{\epsilon}-\widetilde{\Psi}({\,\cdot\,};\theta)\|_{L^{\infty}}>\epsilon.

Expressing the above size bound in terms of MM, it follows that for any network Ψ\Psi with M<(5D)1/4(c0ϵ)D/8rM<(5D)^{-1/4}(c_{0}\epsilon)^{-D/8r} computation units, we must have fϵΨL>ϵ\|f_{\epsilon}-\Psi\|_{L^{\infty}}>\epsilon. Thus, approximation of fϵf_{\epsilon} to within accuracy ϵ\epsilon requires at least M(5D)1/4(c0ϵ)D/8rM\geq(5D)^{-1/4}(c_{0}\epsilon)^{-D/8r} computation units. From the definition of 𝒩(fϵ,ϵ)\mathcal{N}(f_{\epsilon},\epsilon), we conclude that

𝒩(fϵ,ϵ)D1/4(c1ϵ)D/8r,\mathcal{N}(f_{\epsilon},\epsilon)\geq D^{-1/4}(c_{1}\epsilon)^{-D/8r},

for this choice of fϵF̊D,rf_{\epsilon}\in\mathring{F}_{D,r}, with constant c1=52rc0(r)c_{1}=5^{2r}c_{0}(r) depending only on rr. ∎

Lemma A.5 will be our main technical tool for the proof of Proposition A.3. We also recall that 𝒩(f,ϵ)\mathcal{N}(f,\epsilon) satisfies the following properties, [54, eq. (42)–(44)]:

(A.4a) 𝒩(af,|a|ϵ)=𝒩(f,ϵ),\displaystyle\mathcal{N}(af,|a|\epsilon)=\mathcal{N}(f,\epsilon),
(A.4b) 𝒩(f±g,ϵ+gL)𝒩(f,ϵ),\displaystyle\mathcal{N}(f\pm g,\epsilon+\|g\|_{L^{\infty}})\leq\mathcal{N}(f,\epsilon),
(A.4c) 𝒩(f1±f2,ϵ1+ϵ2)𝒩(f1,ϵ1)+𝒩(f2,ϵ2).\displaystyle\mathcal{N}(f_{1}\pm f_{2},\epsilon_{1}+\epsilon_{2})\leq\mathcal{N}(f_{1},\epsilon_{1})+\mathcal{N}(f_{2},\epsilon_{2}).
Proof.

(Proposition A.3) We define a rapidly decaying sequence ϵk0\epsilon_{k}\to 0, by ϵ1=1/4\epsilon_{1}=1/4 and ϵk+1=ϵk2\epsilon_{k+1}=\epsilon^{2}_{k}, so that by recursion ϵk=22k\epsilon_{k}=2^{-2^{k}}. We also define ak:=12ϵk1/2=12ϵk1a_{k}:=\frac{1}{2}\epsilon_{k}^{1/2}=\frac{1}{2}\epsilon_{k-1}. For later reference, we note that since ϵk1/2\epsilon_{k}\leq 1/2 for all kk, we have

(A.5) s=k+1as=12(ϵk+ϵk2+ϵk22+)12ϵks=02s=ϵk.\displaystyle\sum_{s=k+1}^{\infty}a_{s}=\frac{1}{2}\left(\epsilon_{k}+\epsilon_{k}^{2}+\epsilon_{k}^{2^{2}}+\dots\right)\leq\frac{1}{2}\epsilon_{k}\sum_{s=0}^{\infty}2^{-s}=\epsilon_{k}.

Our goal is to construct fF̊D,rf\in\mathring{F}_{D,r} of the form

f=k=1akfk,with fkF̊D,rk.f=\sum_{k=1}^{\infty}a_{k}f_{k},\quad\text{with }f_{k}\in\mathring{F}_{D,r}\;\forall\,k\in\mathbb{N}.

We note that ak2ka_{k}\leq 2^{-k}, hence if ff is of the above form then,

fCr\displaystyle\|f\|_{C^{r}} s=1asfsCr1,f(0)=s=1asfs(0)=0=0,\displaystyle\leq\sum_{s=1}^{\infty}a_{s}\|f_{s}\|_{C^{r}}\leq 1,\quad f(0)=\sum_{s=1}^{\infty}a_{s}\underbrace{f_{s}(0)}_{=0}=0,

implies that fF̊D,rf\in\mathring{F}_{D,r} irrespective of the specific choice of fkF̊D,rf_{k}\in\mathring{F}_{D,r}. In the following construction, we choose γ:=1/32\gamma:=1/32 throughout. We determine fkf_{k} recursively, formally starting from the empty sum, i.e. f=0f=0. In the recursive step, given f1,,fk1f_{1},\dots,f_{k-1}, we distinguish two cases:

Case 1: Assume that

𝒩(s=1k1asfs,2ϵk)ϵkγD/r.\mathcal{N}\left(\sum_{s=1}^{k-1}a_{s}f_{s},2\epsilon_{k}\right)\geq\epsilon_{k}^{-\gamma D/r}.

In this case, we set fk:=0f_{k}:=0, and trivially obtain

(A.6) 𝒩(s=1kasfs,2ϵk)ϵkγD/r.\displaystyle\mathcal{N}\left(\sum_{s=1}^{k}a_{s}f_{s},2\epsilon_{k}\right)\geq\epsilon_{k}^{-\gamma D/r}.

Case 2: In the other case, we have

𝒩(s=1k1asfs,2ϵk)<ϵkγD/r.\mathcal{N}\left(\sum_{s=1}^{k-1}a_{s}f_{s},2\epsilon_{k}\right)<\epsilon_{k}^{-\gamma D/r}.

Our first goal is to again choose fkf_{k} such that (A.6) holds, at least for sufficiently large kk. We note that, by the general properties of 𝒩\mathcal{N}, (A.4c), and the assumption of Case 2:

𝒩(s=1kasfs,2ϵk)\displaystyle\mathcal{N}\left(\sum_{s=1}^{k}a_{s}f_{s},2\epsilon_{k}\right) 𝒩(akfk,4ϵk)𝒩(s=1k1asfs,2ϵk)\displaystyle\geq\mathcal{N}(a_{k}f_{k},4\epsilon_{k})-\mathcal{N}\left(\sum_{s=1}^{k-1}a_{s}f_{s},2\epsilon_{k}\right)
(A.7) 𝒩(akfk,4ϵk)ϵkγD/r.\displaystyle\geq\mathcal{N}(a_{k}f_{k},4\epsilon_{k})-\epsilon_{k}^{-\gamma D/r}.

By Lemma A.5, we can find fkF̊D,rf_{k}\in\mathring{F}_{D,r}, such that

(A.8) 𝒩(akfk,4ϵk)=(A.4a)𝒩(fk,4ϵk/ak)=𝒩(fk,4ϵk1/2)D1/4(8c1ϵk1/2)D/8r.\displaystyle\mathcal{N}(a_{k}f_{k},4\epsilon_{k})\overset{\mathclap{\underset{\downarrow}{(\ref{eq:N}a)}}}{=}\mathcal{N}(f_{k},4\epsilon_{k}/a_{k})=\mathcal{N}(f_{k},4\epsilon_{k}^{1/2})\geq D^{-1/4}(8c_{1}\epsilon_{k}^{1/2})^{-D/8r}.

This defines our recursive choice of fkf_{k} in Case 2. By (A.7) and (A.8), to obtain (A.6) it now suffices to show that

D1/4(8c1ϵk1/2)D/8r2ϵkγD/r.D^{-1/4}(8c_{1}\epsilon_{k}^{1/2})^{-D/8r}\geq 2\epsilon_{k}^{-\gamma D/r}.

It turns out that there exists ϵ¯>0\overline{\epsilon}>0 depending only on rr, such that

(82c12ϵk)D/16r2D1/4ϵkγD/r=(2r/γDDr/(4γD)ϵk)γD/r,(8^{2}c_{1}^{2}\epsilon_{k})^{-D/16r}\geq 2D^{1/4}\epsilon_{k}^{-\gamma D/r}=(2^{-r/\gamma D}D^{-r/(4\gamma D)}\epsilon_{k})^{-\gamma D/r},

for all ϵkϵ¯\epsilon_{k}\leq\overline{\epsilon}, and where γ=1/32\gamma=1/32. In fact, upon raising both sides to the exponent 32r/D-32r/D, it is straightforward to see that this inequality holds independently of DD\in\mathbb{N}, provided that

ϵkinfDDr/(4γD)[28r8c1(r)]4,\epsilon_{k}\leq\frac{\inf_{D}D^{-r/(4\gamma D)}}{[2^{8r}8c_{1}(r)]^{4}},

where we note that infDDr/(4γD)>0\inf_{D}D^{-r/(4\gamma D)}>0 on account of the fact that limDD1/D=1\lim_{D\to\infty}D^{-1/D}=1. Define ϵ¯\overline{\epsilon} by

ϵ¯(r):=min{k|ϵkinfDDr/(4γD)[28r8c1(r)]4}.\overline{\epsilon}(r):=\min{\left\{k\in\mathbb{N}\,\middle|\,\epsilon_{k}\leq\frac{\inf_{D}D^{-r/(4\gamma D)}}{[2^{8r}8c_{1}(r)]^{4}}\right\}}.

With this choice of ϵ¯,γ>0\overline{\epsilon},\gamma>0, and by construction of fkf_{k}, we then have

(A.9) 𝒩(s=1kasfs,2ϵk)ϵkγD/r,\displaystyle\mathcal{N}\left(\sum_{s=1}^{k}a_{s}f_{s},2\epsilon_{k}\right)\geq\epsilon_{k}^{-\gamma D/r},

for all ϵkϵ¯\epsilon_{k}\leq\overline{\epsilon}. This is inequality (A.6), and concludes our discussion of Case 2.

Continuing the above construction by recursion, we obtain a sequence f1,f2,F̊D,rf_{1},f_{2},\dots\in\mathring{F}_{D,r}, such that (A.9) holds for any kk\in\mathbb{N} such that ϵkϵ¯\epsilon_{k}\leq\overline{\epsilon}. Define f=k=1akfkf=\sum_{k=1}^{\infty}a_{k}f_{k}. We claim that for any ϵϵ¯\epsilon\leq\overline{\epsilon} we have

𝒩(f,ϵ)ϵγD/2r.\mathcal{N}(f,\epsilon)\geq\epsilon^{-\gamma D/2r}.

To see this, we fix ϵϵ¯\epsilon\leq\overline{\epsilon}. Choose kk\in\mathbb{N}, such that ϵkϵ¯\epsilon_{k}\leq\overline{\epsilon} and ϵk2ϵϵk\epsilon_{k}^{2}\leq\epsilon\leq\epsilon_{k}. Then,

𝒩(f,ϵ)\displaystyle\mathcal{N}(f,\epsilon) 𝒩(f,ϵk)=𝒩(s=1asfs,ϵk)\displaystyle\geq\mathcal{N}(f,\epsilon_{k})=\mathcal{N}\left(\sum_{s=1}^{\infty}a_{s}f_{s},\epsilon_{k}\right)
(A.4b)𝒩(s=1kasfs,ϵk+s=k+1asfsL)\displaystyle\overset{\mathclap{\underset{\downarrow}{(\ref{eq:N}b)}}}{\geq}\mathcal{N}\left(\sum_{s=1}^{k}a_{s}f_{s},\epsilon_{k}+\left\|\sum_{s=k+1}^{\infty}a_{s}f_{s}\right\|_{L^{\infty}}\right)
𝒩(s=1kasfs,ϵk+s=k+1as)\displaystyle\geq\mathcal{N}\left(\sum_{s=1}^{k}a_{s}f_{s},\epsilon_{k}+\sum_{s=k+1}^{\infty}a_{s}\right)
(A.5)𝒩(s=1kasfs,2ϵk)\displaystyle\overset{\mathclap{\underset{\downarrow}{(\ref{eq:atail})}}}{\geq}\mathcal{N}\left(\sum_{s=1}^{k}a_{s}f_{s},2\epsilon_{k}\right)
(A.9)ϵkγD/rϵγD/2r,\displaystyle\overset{\mathclap{\underset{\downarrow}{\eqref{eq:fkc1}}}}{\geq}\epsilon_{k}^{-\gamma D/r}\geq\epsilon^{-\gamma D/2r},

where the last inequality follows from ϵkϵ1/2\epsilon_{k}\leq\epsilon^{1/2}. The claim of Proposition A.3 thus follows for all ϵϵ¯\epsilon\leq\overline{\epsilon}, upon redefining the universal constant as γ=(1/32)/2=1/64\gamma=(1/32)/2=1/64. ∎

A.2. Proof of Lemma 2.7

Proof.

(Lemma 2.7) Since the interior of Ω\Omega is non-empty, then upon a rescaling and shift of the domain, we may wlog assume that [0,2π]dΩ[0,2\pi]^{d}\subset\Omega. Let us define eκsin(κx)e_{\kappa}\propto\sin(\kappa\cdot x) as a suitable re-normalization of the standard Fourier sine-basis, indexed by κd\kappa\in\mathbb{N}^{d}, and normalized such that eκCs=1\|e_{\kappa}\|_{C^{s}}=1. We note that for each eκe_{\kappa} we can define a bi-orthogonal functional eκe^{\ast}_{\kappa} by

eκ(u):=2(2π)d[0,2π]du(x)sin(κx)𝑑x.e^{\ast}_{\kappa}(u):=\frac{2}{(2\pi)^{d}}\int_{[0,2\pi]^{d}}u(x)\sin(\kappa\cdot x)\,dx.

The norm of eκe^{\ast}_{\kappa} is easily seen to be bounded by 22. Hence, for any enumeration jκ(j)j\mapsto\kappa(j), the sequence eκ(j)e_{\kappa(j)} satisfies the assumptions in the definition of a infinite-dimensional hypercube.

Furthermore, if jκ(j)j\mapsto\kappa(j) is an enumeration of κd\kappa\in\mathbb{N}^{d}, such that jκ(j)j\mapsto\|\kappa(j)\|_{\ell^{\infty}} is monotonically increasing (non-decreasing), we note that any series of the form

u=Aj=1jαyjeκ(j),yj[0,1],u=A\sum_{j=1}^{\infty}j^{-\alpha}y_{j}e_{\kappa(j)},\quad y_{j}\in[0,1],

is absolutely convergent in Cρ(Ω)C^{\rho}(\Omega), provided that

j=1jαeκ(j)Cρ(Ω)j=1jακ(j)ρs<.\sum_{j=1}^{\infty}j^{-\alpha}\|e_{\kappa(j)}\|_{C^{\rho}(\Omega)}\sim\sum_{j=1}^{\infty}j^{-\alpha}\|{\kappa(j)}\|_{\ell^{\infty}}^{\rho-s}<\infty.

Inverting the enumeration j=j(κ)j=j(\kappa) for κd\kappa\in\mathbb{N}^{d}, and setting K:=κK:=\|\kappa\|_{\ell^{\infty}}, we find that

#{κd|κ<K}j(κ)#{κd|κK},\displaystyle\#{\left\{\kappa^{\prime}\in\mathbb{N}^{d}\,\middle|\,\|\kappa^{\prime}\|_{\ell^{\infty}}<K\right\}}\leq j(\kappa)\leq\#{\left\{\kappa^{\prime}\in\mathbb{N}^{d}\,\middle|\,\|\kappa^{\prime}\|_{\ell^{\infty}}\leq K\right\}},

where the number of elements in the lower and upper bounding sets are Kd\sim K^{d}. We thus conclude that j(κ)Kdj(\kappa)\sim K^{d}. We also note that each shell {κd|κ=K}{\left\{\kappa\in\mathbb{N}^{d}\,\middle|\,\|\kappa\|_{\ell^{\infty}}=K\right\}}, with KK\in\mathbb{N}, contains Kd1\sim K^{d-1} elements. Thus, we have

j=1jακ(j)ρs\displaystyle\sum_{j=1}^{\infty}j^{-\alpha}\|{\kappa(j)}\|_{\ell^{\infty}}^{\rho-s} K=1κ=Kj(κ)ακρs\displaystyle\sim\sum_{K=1}^{\infty}\sum_{\|\kappa\|_{\ell^{\infty}}=K}j(\kappa)^{-\alpha}\|{\kappa}\|_{\ell^{\infty}}^{\rho-s}
K=1Kαdκ=Kκρs\displaystyle\sim\sum_{K=1}^{\infty}K^{-\alpha d}\sum_{\|\kappa\|_{\ell^{\infty}}=K}\|{\kappa}\|_{\ell^{\infty}}^{\rho-s}
K=1KαdKd1Kρs\displaystyle\sim\sum_{K=1}^{\infty}K^{-\alpha d}K^{d-1}K^{\rho-s}
=K=1K1d(α1ρsd).\displaystyle=\sum_{K=1}^{\infty}K^{-1-d\left(\alpha-1-\frac{\rho-s}{d}\right)}.

The last sum converges if α>1ρsd\alpha>1-\frac{\rho-s}{d}. Thus, choosing A>0A>0 sufficiently small then ensures that Qα=Qα(A;e1,e2,)Q_{\alpha}=Q_{\alpha}(A;e_{1},e_{2},\dots) is a subset of K={uCρ(Ω)|uCρM}K={\left\{u\in C^{\rho}(\Omega)\,\middle|\,\|u\|_{C^{\rho}}\leq M\right\}}. ∎

A.3. Proof of Theorem 2.11

We now provide the detailed proof of Theorem 2.11, which builds on Proposition A.3 above.

Proof.

Fix δ>0\delta>0. By assumption on K𝒳K\subset\mathcal{X}, there exists A>0A>0 and a linearly independent sequence e1,e2,𝒳e_{1},e_{2},\dots\in\mathcal{X} of normed elements, such that the set QαQ_{\alpha} consisting of all u𝒳u\in\mathcal{X} of the form

u=Aj=1jαyjej,yj[0,1],u=A\sum_{j=1}^{\infty}j^{-\alpha}y_{j}e_{j},\quad y_{j}\in[0,1],

defines a subset QαKQ_{\alpha}\subset K.

Step 1: (Finding embedded cubes [0,1]D\simeq[0,1]^{D}) We note that for any kk\in\mathbb{N} and for any choice of yj[0,1]y_{j}\in[0,1], with indices j=2k,,2k+11j=2^{k},\dots,2^{k+1}-1, we have

(A.10) u=A2(k+1)αj=2k2k+11yjejQα.\displaystyle u=A2^{-(k+1)\alpha}\sum_{j=2^{k}}^{2^{k+1}-1}y_{j}e_{j}\in Q_{\alpha}.

Set D=2kD=2^{k}, and let us denote the set of all such uu by Q¯D\overline{Q}_{D}, in the following. Note that up to the constant rescaling by A2(k+1)αA2^{-(k+1)\alpha}, Q¯D\overline{Q}_{D} can be identified with the DD-dimensional unit cube [0,1]D[0,1]^{D}. In particular, since Q¯DQαK\overline{Q}_{D}\subset Q_{\alpha}\subset K, it follows that KK contains a rescaled copy of [0,1]D[0,1]^{D} for any such DD. Furthermore, it will be important in our construction that all of the embedded cubes, defined by (A.10) for kk\in\mathbb{N}, only intersect at the origin.

By Proposition A.3, there exist constants γ,ϵ¯>0\gamma,\overline{\epsilon}>0, independent of DD, such that given any D=2kD=2^{k}, we can find fD:Df_{D}:\mathbb{R}^{D}\to\mathbb{R}, fDF̊D,rf_{D}\in\mathring{F}_{D,r}, for which the following lower complexity bound holds:

(A.11) 𝒩(fD,ϵ)ϵγD/r,ϵϵ¯.\displaystyle\mathcal{N}(f_{D},\epsilon)\geq\epsilon^{-\gamma D/r},\quad\forall\epsilon\leq\overline{\epsilon}.

Our aim is to construct 𝒮:𝒳\mathcal{S}^{\dagger}:\mathcal{X}\to\mathbb{R}, such that the restriction 𝒮|Q¯D\mathcal{S}^{\dagger}|_{\overline{Q}_{D}} to the cube Q¯D[0,1]D\overline{Q}_{D}\simeq[0,1]^{D} “reproduces” this fDf_{D}. If this can be achieved, then our intuition is that 𝒮\mathcal{S}^{\dagger} embeds all fDf_{D} for D=2kD=2^{k}, kk\in\mathbb{N}, at once, and hence a rescaled version of the lower complexity bound DϵγD/r\gtrsim_{D}\epsilon^{-\gamma D/r} should hold for any DD. Our next aim is to make this precise, and determine the implicit constant that arises due to the fact that Q¯D\overline{Q}_{D} is only a rescaled version of [0,1]D[0,1]^{D}.

Step 2: (Construction of 𝒮\mathcal{S}^{\dagger}) To construct suitable 𝒮\mathcal{S}^{\dagger}, we first recall that we assume the existence of “bi-orthogonal” elements e1,e2,e^{\ast}_{1},e^{\ast}_{2},\dots in the continuous dual space 𝒳\mathcal{X}^{\ast}, such that

ei(ej)=δij,i,j,e^{\ast}_{i}(e_{j})=\delta_{ij},\;\forall\,i,j\in\mathbb{N},

and furthermore, there exists a constant M>0M>0, such that ej𝒳M\|e^{\ast}_{j}\|_{\mathcal{X}^{\ast}}\leq M for all jj\in\mathbb{N}. Given the functions fD=f2kf_{D}=f_{2^{k}} from Step 1, we now make the following ansatz for 𝒮\mathcal{S}^{\dagger}:

(A.12) 𝒮(u)=k=12αrkf2k(A12(k+1)α[e2k(u),,e2k+11(u)]),\displaystyle\mathcal{S}^{\dagger}(u)=\sum_{k=1}^{\infty}2^{-\alpha^{\ast}rk}f_{2^{k}}\left(A^{-1}2^{(k+1)\alpha}\left[e^{\ast}_{2^{k}}(u),\dots,e^{\ast}_{2^{k+1}-1}(u)\right]\right),

Here f2k=fDf_{2^{k}}=f_{D} (for D=2kD=2^{k}) satisfies (A.11) and α=1+α+δ/2\alpha^{\ast}=1+\alpha+\delta/2. We note in passing that 𝒮\mathcal{S}^{\dagger} defines a rr-times Fréchet differentiable functional. This will be rigorously shown in Lemma A.7 below (with c=A12αc=A^{-1}2^{\alpha}).

Step 3: (Relating 𝒮\mathcal{S}^{\dagger} with fDf_{D}) We next show in which way “the restriction 𝒮|Q¯D\mathcal{S}^{\dagger}|_{\overline{Q}_{D}} to the cube Q¯D[0,1]D\overline{Q}_{D}\simeq[0,1]^{D} reproduces fDf_{D}”. Note that if uQ¯Du\in\overline{Q}_{D} is of the form (A.10) with D=2kD=2^{k} and with coefficients

y:=[y2k,,y2k+11][0,1]2k=[0,1]D,y:=[y_{2^{k}},\dots,y_{2^{k+1}-1}]\in[0,1]^{2^{k}}=[0,1]^{D},

then, if kkk^{\prime}\neq k,

[e2k(u),,e2k+11(u)]=[0,,0],[e^{\ast}_{2^{k^{\prime}}}(u),\dots,e^{\ast}_{2^{k^{\prime}+1}-1}(u)]=[0,\dots,0],

while for k=kk^{\prime}=k we find

[e2k(u),,e2k+11(u)]=[y2k,,y2k+11]=y.[e^{\ast}_{2^{k}}(u),\dots,e^{\ast}_{2^{k+1}-1}(u)]=[y_{2^{k}},\dots,y_{2^{k+1}-1}]=y.

From the fact that f2k(0)=0f_{2^{k^{\prime}}}(0)=0 for all kk^{\prime} by construction (recall that f2kF̊2k,rf_{2^{k^{\prime}}}\in\mathring{F}_{2^{k^{\prime}},r}), we conclude that

(A.13) 𝒮(u)=2αrkf2k(y)=DαrfD(y),\displaystyle\mathcal{S}^{\dagger}(u)=2^{-\alpha^{\ast}rk}f_{2^{k}}(y)=D^{-\alpha^{\ast}r}f_{D}(y),

for any uQ¯Du\in\overline{Q}_{D}. In this sense, “𝒮|Q¯D\mathcal{S}^{\dagger}|_{\overline{Q}_{D}} reproduces fDf_{D}”.

Step 4: (Lower complexity bound, uniform in DD) Let 𝒮ϵ:𝒳\mathcal{S}_{\epsilon}:\mathcal{X}\to\mathbb{R} be a family of operators of neural network-type, such that

supuK|𝒮(u)𝒮ϵ(u)|ϵ,ϵ>0.\sup_{u\in K}|\mathcal{S}^{\dagger}(u)-\mathcal{S}_{\epsilon}(u)|\leq\epsilon,\quad\forall\,\epsilon>0.

By assumption on 𝒮ϵ\mathcal{S}_{\epsilon} being of neural network-type, there exists =ϵ\ell=\ell_{\epsilon}\in\mathbb{N}, a linear mapping ϵ:𝒳\mathcal{L}_{\epsilon}:\mathcal{X}\to\mathbb{R}^{\ell}, and Φϵ:\Phi_{\epsilon}:\mathbb{R}^{\ell}\to\mathbb{R} a neural network representing 𝒮ϵ\mathcal{S}_{\epsilon}:

𝒮ϵ(u)=Φϵ(ϵu),u𝒳.\mathcal{S}_{\epsilon}(u)=\Phi_{\epsilon}(\mathcal{L}_{\epsilon}u),\quad\forall\,u\in\mathcal{X}.

For D=2kD=2^{k}, kk\in\mathbb{N}, define ιD:D𝒳\iota_{D}:\mathbb{R}^{D}\to\mathcal{X} by

ιD(y)=A2αkj=12kyje2k+j1.\iota_{D}(y)=A2^{-\alpha k}\sum_{j=1}^{2^{k}}y_{j}e_{2^{k}+j-1}.

Then, since ϵιD:D\mathcal{L}_{\epsilon}\circ\iota_{D}:\mathbb{R}^{D}\to\mathbb{R}^{\ell} is a linear mapping, there exists a matrix WD×DW_{D}\in\mathbb{R}^{\ell\times D}, such that ϵιD(y)=WDy\mathcal{L}_{\epsilon}\circ\iota_{D}(y)=W_{D}y for all yDy\in\mathbb{R}^{D}. In particular, it follows that

𝒮ϵ(ιD(y))=Φϵ(Wdy).\mathcal{S}_{\epsilon}(\iota_{D}(y))=\Phi_{\epsilon}(W_{d}y).

By (A.10), we clearly have ιD([0,1]D)=Q¯D\iota_{D}([0,1]^{D})=\overline{Q}_{D}. Let Φ^ϵ(y):=DαrΦϵ(WDy)\widehat{\Phi}_{\epsilon}(y):=D^{\alpha^{\ast}r}\Phi_{\epsilon}(W_{D}y). It now follows that

supy[0,1]D|fD(y)Φ^ϵ(y)|\displaystyle\sup_{y\in[0,1]^{D}}\left|f_{D}(y)-\widehat{\Phi}_{\epsilon}(y)\right| =Dαrsupy[0,1]D|DαrfD(y)Φϵ(WDy)|\displaystyle=D^{\alpha^{\ast}r}\sup_{y\in[0,1]^{D}}\left|D^{-\alpha^{\ast}r}f_{D}(y)-\Phi_{\epsilon}(W_{D}y)\right|
=DαrsupuQ¯D|𝒮(u)𝒮ϵ(u)|\displaystyle=D^{\alpha^{\ast}r}\sup_{u\in\overline{Q}_{D}}\left|\mathcal{S}^{\dagger}(u)-\mathcal{S}_{\epsilon}(u)\right|
DαrsupuK|𝒮(u)𝒮ϵ(u)|\displaystyle\leq D^{\alpha^{\ast}r}\sup_{u\in K}\left|\mathcal{S}^{\dagger}(u)-\mathcal{S}_{\epsilon}(u)\right|
Dαrϵ.\displaystyle\leq D^{\alpha^{\ast}r}\epsilon.

Let MM denote the number of hidden computation units of Φ^ϵ\widehat{\Phi}_{\epsilon}. By construction of fDf_{D} (cp. Proposition A.3), we have

size(Φ^ϵ)M𝒩(fD,Dαrϵ)(Dαrϵ)γD/r,\mathrm{size}(\widehat{\Phi}_{\epsilon})\geq M\geq\mathcal{N}(f_{D},D^{\alpha^{\ast}r}\epsilon)\geq(D^{\alpha^{\ast}r}\epsilon)^{-\gamma D/r},

whenever Dαrϵϵ¯D^{\alpha^{\ast}r}\epsilon\leq\overline{\epsilon}. On the other hand, we can also estimate (cp. equation (2.2) in Section 2.1.2),

size(Φ^ϵ)2WD0+2size(Φϵ)2Dsize(Φϵ)+2size(Φϵ)4Dsize(Φϵ).\mathrm{size}(\widehat{\Phi}_{\epsilon})\leq 2\|W_{D}\|_{0}+2\,\mathrm{size}(\Phi_{\epsilon})\leq 2D\mathrm{size}(\Phi_{\epsilon})+2\,\mathrm{size}(\Phi_{\epsilon})\leq 4D\mathrm{size}(\Phi_{\epsilon}).

Combining these bounds, we conclude that

size(Φϵ)12D(Dαrϵ)γD/r=12(Dαr+r/γDϵ)γD/r,\mathrm{size}(\Phi_{\epsilon})\geq\frac{1}{2D}(D^{\alpha^{\ast}r}\epsilon)^{-\gamma D/r}=\frac{1}{2}(D^{\alpha^{\ast}r+r/\gamma D}\epsilon)^{-\gamma D/r},

holds for any neural network representation of 𝒮ϵ\mathcal{S}_{\epsilon}, whenever Dαrϵϵ¯D^{\alpha^{\ast}r}\epsilon\leq\overline{\epsilon}. And hence

(A.14) cmplx(𝒮ϵ)12(Dαr+r/γDϵ)γD/r,\displaystyle\mathrm{cmplx}(\mathcal{S}_{\epsilon})\geq\frac{1}{2}(D^{\alpha^{\ast}r+r/\gamma D}\epsilon)^{-\gamma D/r},

whenever Dαrϵϵ¯D^{\alpha^{\ast}r}\epsilon\leq\overline{\epsilon}. By Lemma A.6 below, taking the supremum on the right over all admissible DD implies the lower bound

cmplx(𝒮ϵ)exp(cϵ1/(α+δ/2)r),ϵϵ~,\displaystyle\mathrm{cmplx}(\mathcal{S}_{\epsilon})\geq\exp(c\epsilon^{-1/(\alpha^{\ast}+\delta/2)r}),\quad\forall\epsilon\leq\widetilde{\epsilon},

where ϵ~,c>0\widetilde{\epsilon},c>0 depend only on α,r,ϵ¯(r),δ\alpha^{\ast},r,\overline{\epsilon}(r),\delta and γ\gamma. Recalling our choice of α=1+α+δ/2\alpha^{\ast}=1+\alpha+\delta/2, and the fact that the constant ϵ¯=ϵ¯(r)\overline{\epsilon}=\overline{\epsilon}(r) depends only on rr, while γ\gamma is universal, it follows that

cmplx(𝒮ϵ)exp(cϵ1/(α+1+δ)r),ϵϵ~.\displaystyle\mathrm{cmplx}(\mathcal{S}_{\epsilon})\geq\exp(c\epsilon^{-1/(\alpha+1+\delta)r}),\quad\forall\epsilon\leq\widetilde{\epsilon}.

with ϵ~,c>0\widetilde{\epsilon},c>0 depending only on α,r,δ\alpha,r,\delta. Up to a change in notation, this is the claimed complexity bound. ∎

The following lemma addresses the optimization of the lower bound in (A.14):

Lemma A.6.

Let rr\in\mathbb{N}, and α,ϵ¯,γ>0\alpha^{\ast},\overline{\epsilon},\gamma>0 be given. Assume that

cmplx(𝒮ϵ)12(Dαr+r/γDϵ)γD/r,\mathrm{cmplx}(\mathcal{S}_{\epsilon})\geq\frac{1}{2}\left(D^{\alpha^{\ast}r+r/\gamma D}\epsilon\right)^{-\gamma D/r},

for any DD of the form D=2kD=2^{k}, kk\in\mathbb{N}, and whenever Dαrϵϵ¯D^{\alpha^{\ast}r}\epsilon\leq\overline{\epsilon}. Fix a small parameter δ>0\delta>0. There exist ϵ~,c>0\widetilde{\epsilon},c>0, depending only on r,α,γ,ϵ¯,δr,\alpha^{\ast},\gamma,\overline{\epsilon},\delta, such that

cmplx(𝒮ϵ)exp(cϵ1/(α+δ)r),ϵϵ~,\mathrm{cmplx}(\mathcal{S}_{\epsilon})\geq\exp\left(c\epsilon^{-1/(\alpha^{\ast}+\delta)r}\right),\quad\forall\epsilon\leq\widetilde{\epsilon},
Proof.

Write ϵ¯=eβ\overline{\epsilon}=e^{-\beta} for β\beta\in\mathbb{R}. Fix a small parameter δ>0\delta>0. Since we restrict attention to ϵϵ~\epsilon\leq\widetilde{\epsilon}, we have

(eβϵ)1/(α+δ)r(eβϵ~)1/(α+δ)r=1,\left(e^{\beta}\epsilon\right)^{-1/(\alpha^{\ast}+\delta)r}\geq\left(e^{\beta}\widetilde{\epsilon}\right)^{-1/(\alpha^{\ast}+\delta)r}=1,

provided that ϵ~ϵ¯\widetilde{\epsilon}\leq\overline{\epsilon}. Given ϵϵ~\epsilon\leq\widetilde{\epsilon}, choose kk\in\mathbb{N}, such that

2k1<(eβϵ)1/(α+δ)r2k.2^{k-1}<\left(e^{\beta}\epsilon\right)^{-1/(\alpha^{\ast}+\delta)r}\leq 2^{k}.

Let D=2kD=2^{k}. Note that this defines a function D=D(ϵ)D=D(\epsilon). For any ϵ\epsilon, we can write

(A.15) D(ϵ)=ξ(eβϵ)1/(α+δ)r,\displaystyle D(\epsilon)=\xi\,\left(e^{\beta}\epsilon\right)^{-1/(\alpha^{\ast}+\delta)r},

for some ξ(1/2,1]\xi\in(1/2,1]. We now note that for ϵϵ~\epsilon\leq\widetilde{\epsilon},

1γD=(eβϵ)1/(α+δ)rγξ2(eβϵ~)1/(α+δ)rγ.\frac{1}{\gamma D}=\frac{\left(e^{\beta}\epsilon\right)^{1/(\alpha^{\ast}+\delta)r}}{\gamma\xi}\leq\frac{2\left(e^{\beta}\widetilde{\epsilon}\right)^{1/(\alpha^{\ast}+\delta)r}}{\gamma}.

Decreasing the size of ϵ~=ϵ~(r,γ,α,ϵ¯,δ)\widetilde{\epsilon}=\widetilde{\epsilon}(r,\gamma,\alpha^{\ast},\overline{\epsilon},\delta) further, we can ensure that for ϵϵ~\epsilon\leq\widetilde{\epsilon},

1γD(ϵ)2(eβϵ~)1/(α+δ)rγδ2.\frac{1}{\gamma D(\epsilon)}\leq\frac{2\left(e^{\beta}\widetilde{\epsilon}\right)^{1/(\alpha^{\ast}+\delta)r}}{\gamma}\leq\frac{\delta}{2}.

Note also that 2r/γD2δr/2Dδr/22^{r/\gamma D}\leq 2^{\delta r/2}\leq D^{\delta r/2} for any DD of the form D=2kD=2^{k}, kk\in\mathbb{N}. It thus follows that for any given ϵϵ~\epsilon\leq\widetilde{\epsilon}, and with our particular choice of D=D(ϵ)D=D(\epsilon) satisfying (A.15), we have

2r/γDDαr+r/γDϵD(α+δ)rϵ=eβξ(α+δ)reβ.2^{r/\gamma D}D^{\alpha^{\ast}r+r/\gamma D}\epsilon\leq D^{(\alpha^{\ast}+\delta)r}\epsilon=e^{-\beta}\xi^{(\alpha^{\ast}+\delta)r}\leq e^{-\beta}.

Note that this in particular implies that Dαrϵeβ=ϵ¯D^{\alpha^{\ast}r}\epsilon\leq e^{-\beta}=\overline{\epsilon}. We conclude that

cmplx(𝒮ϵ)\displaystyle\mathrm{cmplx}(\mathcal{S}_{\epsilon}) 12(Dαr+r/γDϵ)γD/r\displaystyle\geq\frac{1}{2}(D^{\alpha^{\ast}r+r/\gamma D}\epsilon)^{-\gamma D/r}
=(2r/γDDαr+r/γDϵ)γD/r\displaystyle=(2^{r/\gamma D}D^{\alpha^{\ast}r+r/\gamma D}\epsilon)^{-\gamma D/r}
(D(α+δ)rϵ)γD/r\displaystyle\geq(D^{(\alpha^{\ast}+\delta)r}\epsilon)^{-\gamma D/r}
exp(βγDr)\displaystyle\geq\exp\left(\frac{\beta\gamma D}{r}\right)
=exp(βγξe1/(α+δ)rrϵ1/(α+δ)r)\displaystyle=\exp\left(\frac{\beta\gamma\xi e^{-1/(\alpha^{\ast}+\delta)r}}{r}\epsilon^{-1/(\alpha^{\ast}+\delta)r}\right)
exp(βγe1/(α+δ)r2rϵ1/(α+δ)r).\displaystyle\geq\exp\left(\frac{\beta\gamma e^{-1/(\alpha^{\ast}+\delta)r}}{2r}\epsilon^{-1/(\alpha^{\ast}+\delta)r}\right).

Upon defining c=c(r,γ,α,ϵ¯,δ)c=c(r,\gamma,\alpha^{\ast},\overline{\epsilon},\delta) as

c:=βγe1/(α+δ)r2r,c:=\frac{\beta\gamma e^{-1/(\alpha^{\ast}+\delta)r}}{2r},

we obtain the lower bound

cmplx(𝒮ϵ)exp(cϵ1/(α+δ)r),ϵϵ¯.\mathrm{cmplx}(\mathcal{S}_{\epsilon})\geq\exp\left(c\epsilon^{-1/(\alpha^{\ast}+\delta)r}\right),\quad\forall\,\epsilon\leq\overline{\epsilon}.

This concludes the proof of Lemma A.6. ∎

In the lemma below, we provide a simple result on Fréchet differentiability which was used in our proof of Proposition 2.11:

Lemma A.7.

(Fréchet differentiability of a series) Assume that we are given a bounded family of functions fDCr(D)f_{D}\in C^{r}(\mathbb{R}^{D}) indexed by integers D=2kD=2^{k}, kk\in\mathbb{N}, such that fDCr(D)1\|f_{D}\|_{C^{r}(\mathbb{R}^{D})}\leq 1 for all DD. Let e1,e2,:𝒳e^{\ast}_{1},e^{\ast}_{2},\dots:\mathcal{X}\to\mathbb{R} be a sequence of linear functionals, such that ej𝒳M\|e^{\ast}_{j}\|_{\mathcal{X}^{\ast}}\leq M for all jj\in\mathbb{N}. Let c,α>0c,\alpha>0 be given, and assume that α>1+α\alpha^{\ast}>1+\alpha. Then, the functional 𝒮:𝒳\mathcal{S}^{\dagger}:\mathcal{X}\to\mathbb{R} defined by the series

𝒮(u):=k=12αrkf2k(c2αk(e2k(u),,e2k+11(u))),\mathcal{S}^{\dagger}(u):=\sum_{k=1}^{\infty}2^{-\alpha^{\ast}rk}f_{2^{k}}\left(c2^{\alpha k}(e^{\ast}_{2^{k}}(u),\dots,e^{\ast}_{2^{k+1}-1}(u))\right),

is rr-times Fréchet differentiable.

Proof.

By assumption on f2kf_{2^{k}} and the linearity of the functionals eje^{\ast}_{j}, each nonlinear functional k(u):=f2k(c2αk(e2k(u),,e2k+11(u)))\mathcal{F}_{k}(u):=f_{2^{k}}\left(c2^{\alpha k}(e^{\ast}_{2^{k}}(u),\dots,e^{\ast}_{2^{k+1}-1}(u))\right) in the series defining 𝒮\mathcal{S}^{\dagger} is rr-times continuously differentiable. Fixing u𝒳u\in\mathcal{X}, let us denote x=c2αk(e2k(u),,e2k+11(u))x=c2^{\alpha k}(e^{\ast}_{2^{k}}(u),\dots,e^{\ast}_{2^{k+1}-1}(u)). The \ell-th total derivative dkd^{\ell}\mathcal{F}_{k} of k\mathcal{F}_{k} (r\ell\leq r) is given by

dk(u)[v1,,v]=c2αkj1,,j=12kf2k(x)xj1xjs=1e2k+js1(vs).d^{\ell}\mathcal{F}_{k}(u)[v_{1},\dots,v_{\ell}]=c^{\ell}2^{\alpha\ell k}\sum_{j_{1},\dots,j_{\ell}=1}^{2^{k}}\frac{\partial^{\ell}f_{2^{k}}(x)}{\partial x_{j_{1}}\dots\partial x_{j_{\ell}}}\prod_{s=1}^{\ell}e^{\ast}_{2^{k}+j_{s}-1}(v_{s}).

By assumption, we have

|f2k(x)xj1xj|1.\left|\frac{\partial^{\ell}f_{2^{k}}(x)}{\partial x_{j_{1}}\dots\partial x_{j_{\ell}}}\right|\leq 1.

Since the sum over j1,,jj_{1},\dots,j_{\ell} has 2k2^{k\ell} terms, and since the functionals are bounded ejM\|e^{\ast}_{j}\|\leq M by assumption, we can now readily estimate the operator norm dk(u)\|d^{\ell}\mathcal{F}_{k}(u)\| for r\ell\leq r by

dk(u)c2αk2kM(cM)r2(α+1)rk.\|d^{\ell}\mathcal{F}_{k}(u)\|\leq c^{\ell}2^{\alpha\ell k}2^{k\ell}M^{\ell}\leq(cM)^{r}2^{(\alpha+1)rk}.

In particular, for any r\ell\leq r, the series

k=12αrkdk(u)k=12[α(α+1)]rk(cM)r<,\sum_{k=1}^{\infty}2^{-\alpha^{\ast}rk}\|d^{\ell}\mathcal{F}_{k}(u)\|\leq\sum_{k=1}^{\infty}2^{-[\alpha^{\ast}-(\alpha+1)]rk}(cM)^{r}<\infty,

is uniformly convergent. Thus, 𝒮\mathcal{S}^{\dagger} is a uniform limit of rr-times continuously differentiable functionals, all of whose derivatives of order r\ell\leq r are also uniformly convergent. From this, it follows that 𝒮\mathcal{S}^{\dagger} is itself rr-times continuously Fréchet differentiable. ∎

A.4. Proof of Corollary 2.12

Let 𝒴=𝒴(Ω;p)\mathcal{Y}=\mathcal{Y}(\Omega;\mathbb{R}^{p}) be a function space with continuous embedding in C(Ω;p)C(\Omega;\mathbb{R}^{p}). We will only consider the case p=1p=1, the case p>1p>1 following by an almost identical argument; Let ϕ0𝒴\phi\neq 0\in\mathcal{Y} be a non-trivial function. Since 𝒴=𝒴(Ω)\mathcal{Y}=\mathcal{Y}(\Omega) is continuously embedded in C(Ω)C(\Omega), it follows that point evaluation evy(ϕ)=ϕ(y){\mathrm{ev}}_{y}(\phi)=\phi(y) is continuous. Given that ϕ\phi is non-trivial, there exists yΩy\in\Omega, such that evy(ϕ)0{\mathrm{ev}}_{y}(\phi)\neq 0. We may wlog assume that evy(ϕ)=ϕ(y)=1{\mathrm{ev}}_{y}(\phi)=\phi(y)=1. Let :𝒳\mathcal{F}^{\dagger}:\mathcal{X}\to\mathbb{R} be a functional exposing the curse of parametric complexity, as constructed in Theorem 2.11. We define an rr-times Fréchet differentiable operator 𝒮:𝒳𝒴\mathcal{S}^{\dagger}:\mathcal{X}\to\mathcal{Y} by 𝒮(u):=(u)ϕ\mathcal{S}^{\dagger}(u):=\mathcal{F}^{\dagger}(u)\phi.

The claim now follows immediately by observing that

evy𝒮(u)=(u),u𝒳,{\mathrm{ev}}_{y}\circ\mathcal{S}^{\dagger}(u)=\mathcal{F}^{\dagger}(u),\quad\forall\,u\in\mathcal{X},

and by noting that if 𝒮ϵ:𝒳𝒴\mathcal{S}_{\epsilon}:\mathcal{X}\to\mathcal{Y} is an operator of neural network-type, then evy𝒮ϵ:𝒳{\mathrm{ev}}_{y}\circ\mathcal{S}_{\epsilon}:\mathcal{X}\to\mathbb{R} is a functional of neural network-type, and by assumption, with C:=evy𝒴C:=\|{\mathrm{ev}}_{y}\|_{\mathcal{Y}\to\mathbb{R}},

supuK|(u)evy𝒮ϵ(u)|\displaystyle\sup_{u\in K}|\mathcal{F}^{\dagger}(u)-{\mathrm{ev}}_{y}\circ\mathcal{S}_{\epsilon}(u)| =supuK|evy𝒮(u)evy𝒮ϵ(u)|\displaystyle=\sup_{u\in K}|{\mathrm{ev}}_{y}\circ\mathcal{S}^{\dagger}(u)-{\mathrm{ev}}_{y}\circ\mathcal{S}_{\epsilon}(u)|
supuKC𝒮(u)𝒮ϵ(u)𝒴\displaystyle\leq\sup_{u\in K}C\|\mathcal{S}^{\dagger}(u)-\mathcal{S}_{\epsilon}(u)\|_{\mathcal{Y}}
Cϵ.\displaystyle\leq C\epsilon.

By our choice of :𝒳\mathcal{F}^{\dagger}:\mathcal{X}\to\mathbb{R}, this implies that the complexity of evy𝒮ϵ{\mathrm{ev}}_{y}\circ\mathcal{S}_{\epsilon} is lower bounded by an exponential bound exp(cϵ1/(α+1+δ)r)\geq\exp(c\epsilon^{-1/(\alpha+1+\delta)r}) for some constant c=c(α,δ,r)c=c(\alpha,\delta,r). This in turn implies that

cmplx(𝒮ϵ)=supyΩcmplx(evy𝒮ϵ)exp(cϵ1/(α+1+δ)r).\mathrm{cmplx}(\mathcal{S}_{\epsilon})=\sup_{y\in\Omega}\mathrm{cmplx}({\mathrm{ev}}_{y}\circ\mathcal{S}_{\epsilon})\geq\exp(c\epsilon^{-1/(\alpha+1+\delta)r}).

This lower bound implies the exponential lower bound of Corollary 2.12.

A.5. Proofs of Lemmas 2.182.7

A.5.1. Proof of Lemma 2.18

Proof.

(Lemma 2.18) We want to show that a PCA-Net neural operator 𝒮=Ψ\mathcal{S}=\mathcal{R}\circ\Psi\circ\mathcal{E} is of neural network-type, and aim to estimate size(Ψ)\mathrm{size}(\Psi) in terms of cmplx(𝒮)\mathrm{cmplx}(\mathcal{S}). To this end, we assume that 𝒴=𝒴(Ω;p)\mathcal{Y}=\mathcal{Y}(\Omega;\mathbb{R}^{p}) is a Hilbert space of functions. Since \mathcal{R} is by definition linear, then given an evaluation point yΩy\in\Omega, the mapping β(β)(y)evy(β)\beta\mapsto\mathcal{R}(\beta)(y)\equiv{\mathrm{ev}}_{y}\circ\mathcal{R}(\beta) defines a linear map evy:D𝒴p{\mathrm{ev}}_{y}\circ\mathcal{R}:\mathbb{R}^{{D_{\mathcal{Y}}}}\to\mathbb{R}^{p}. We can represent evy{\mathrm{ev}}_{y}\circ\mathcal{R} by matrix multiplication: evy(β)=Vyβ{\mathrm{ev}}_{y}\circ\mathcal{R}(\beta)=V_{y}\beta, with Vyp×D𝒴V_{y}\in\mathbb{R}^{p\times{D_{\mathcal{Y}}}}. The encoder :𝒳D𝒳\mathcal{E}:\mathcal{X}\to\mathbb{R}^{{D_{\mathcal{X}}}} is linear by definition, thus we can take :=\mathcal{L}:=\mathcal{E} for the linear map in the definition of “operator of neural network-type”. Define a neural network Φy:D𝒳p\Phi_{y}:\mathbb{R}^{{D_{\mathcal{X}}}}\to\mathbb{R}^{p} by Φy(α)=VyΨ(α)\Phi_{y}(\alpha)=V_{y}\Psi(\alpha). Then, we have the identity

evy𝒮(u)=(evy)Ψ(u)=Φy(u),\displaystyle{\mathrm{ev}}_{y}\circ\mathcal{S}(u)=({\mathrm{ev}}_{y}\circ\mathcal{R})\circ\Psi(\mathcal{E}u)=\Phi_{y}(\mathcal{L}u),

for all u𝒳u\in\mathcal{X}. This shows that 𝒮\mathcal{S} is of neural network-type. We now aim to estimate cmplx(𝒮)\mathrm{cmplx}(\mathcal{S}) in terms of size(Ψ)\mathrm{size}(\Psi). To this end, write Ψ(α)=[Ψ1(α),,ΨD𝒴(α)]\Psi(\alpha)=[\Psi_{1}(\alpha),\dots,\Psi_{{D_{\mathcal{Y}}}}(\alpha)] with component mappings Ψj:D𝒳\Psi_{j}:\mathbb{R}^{{D_{\mathcal{X}}}}\to\mathbb{R}. Let 𝒥={j|j{1,,D𝒴},Ψj0}\mathcal{J}={\left\{j\,\middle|\,j\in\{1,\dots,{D_{\mathcal{Y}}}\},\;\Psi_{j}\neq 0\right\}} be the subset of indices for which Ψj:D𝒳\Psi_{j}:\mathbb{R}^{{D_{\mathcal{X}}}}\to\mathbb{R} is not the zero function. Define a (sparsified) matrix V^yp×D𝒴\widehat{V}_{y}\in\mathbb{R}^{p\times{D_{\mathcal{Y}}}} with jj-th column [V^y]:,j[\widehat{V}_{y}]_{:,j} defined by

[V^y]:,j:={[Vy]:,j,j𝒥,0j𝒥.\displaystyle[\widehat{V}_{y}]_{:,j}:=\begin{cases}[{V}_{y}]_{:,j},&j\in\mathcal{J},\\ 0&j\notin\mathcal{J}.\end{cases}

Then, we have V^y0p|𝒥|psize(Ψ)\|\widehat{V}_{y}\|_{0}\leq p|\mathcal{J}|\leq p\,\mathrm{size}(\Psi), and identity Φy(α)=V^yΨ(α)\Phi_{y}(\alpha)=\widehat{V}_{y}\Psi(\alpha) for all αD𝒳\alpha\in\mathbb{R}^{{D_{\mathcal{X}}}}. Thus, using the concept of sparse concatenation (2.3), we can upper bound the complexity, cmplx(𝒮)\mathrm{cmplx}(\mathcal{S}), in terms of the size(Ψ)\mathrm{size}(\Psi) of the neural network Ψ\Psi as follows:

cmplx(𝒮)supysize(Φy)supy2Vy0+2size(Ψ)2(p+1)size(Ψ).\mathrm{cmplx}(\mathcal{S})\leq\sup_{y}\mathrm{size}(\Phi_{y})\leq\sup_{y}2\|V_{y}\|_{0}+2\mathrm{size}(\Psi)\leq 2(p+1)\mathrm{size}(\Psi).

This is the claimed lower bound on size(Ψ)\mathrm{size}(\Psi). ∎

A.5.2. Proof of Lemma 2.20

Proof.

(Lemma 2.20) We observe that with D:=D𝒳D:={D_{\mathcal{X}}}, for any yΩy\in\Omega the encoder :=:𝒳D\mathcal{L}:=\mathcal{E}:\mathcal{X}\to\mathbb{R}^{D} is linear, and

evy𝒮(u)=Φy(u),u𝒳,{\mathrm{ev}}_{y}\circ\mathcal{S}(u)=\Phi_{y}(\mathcal{L}u),\quad\forall\,u\in\mathcal{X},

where Φy(α)=j=1D𝒴Ψj(α)ϕj(y)\Phi_{y}(\alpha)=\sum_{j=1}^{{D_{\mathcal{Y}}}}\Psi_{j}(\alpha)\phi_{j}(y) defines a neural network, Φy:Dp\Phi_{y}:\mathbb{R}^{D}\to\mathbb{R}^{p}. Thus, DeepONet 𝒮=Ψ\mathcal{S}=\mathcal{R}\circ\Psi\circ\mathcal{E} is of neural network-type. To estimate the complexity, cmplx(𝒮)\mathrm{cmplx}(\mathcal{S}), we let 𝒥2\mathcal{J}^{2} be the set of indices (i,j){1,,D𝒴}2(i,j)\in\{1,\dots,{D_{\mathcal{Y}}}\}^{2}, such that the ii-th component, [ϕj(y)]i[\phi_{j}(y)]_{i}, of the vector ϕj(y)p\phi_{j}(y)\in\mathbb{R}^{p} is non-zero. Let V^yp×D𝒴\widehat{V}_{y}\in\mathbb{R}^{p\times{D_{\mathcal{Y}}}} be the matrix with entries [V^y]i,j=[ϕj(y)]i[\widehat{V}_{y}]_{i,j}=[\phi_{j}(y)]_{i}, so that Φy(α)=V^yΨ(α)\Phi_{y}(\alpha)=\widehat{V}_{y}\Psi(\alpha) for all α\alpha. Note that V^y\widehat{V}_{y} has precisely |𝒥2||\mathcal{J}^{2}| non-zero entries, and that |𝒥2|size(ϕ)|\mathcal{J}^{2}|\leq\mathrm{size}(\phi), since [ϕj(y)]i0[\phi_{j}(y)]_{i}\neq 0 is non-zero for all (i,j)𝒥2(i,j)\in\mathcal{J}^{2}. Thus, it follows that

cmplx(𝒮)supysize(Φy)2|𝒥2|+2size(Ψ)2(size(ϕ)+size(Ψ)).\mathrm{cmplx}(\mathcal{S})\leq\sup_{y}\mathrm{size}(\Phi_{y})\leq 2|\mathcal{J}^{2}|+2\,\mathrm{size}(\Psi)\leq 2(\mathrm{size}(\phi)+\mathrm{size}(\Psi)).

A.5.3. Proof of Lemma 2.22

Proof.

(Lemma 2.22) To see the complexity bound, we recall that for any yΩy\in\Omega, we can choose Φy():=Q(Ψ(),y)\Phi_{y}({\,\cdot\,}):=Q(\Psi({\,\cdot\,}),y) to obtain the representation

evy𝒮(u)=Φy(u),{\mathrm{ev}}_{y}\circ\mathcal{S}(u)=\Phi_{y}(\mathcal{L}u),

where u(u)\mathcal{L}u\equiv\mathcal{E}(u) is given by the linear encoder. The composition of two neural networks Q(,y)Q({\,\cdot\,},y) and Ψ()\Psi({\,\cdot\,}) can be represented by a neural network of size at most 2size(Ψ)+2size(Q(,y))2size(Ψ)+2size(Q)2\mathrm{size}(\Psi)+2\mathrm{size}(Q({\,\cdot\,},y))\leq 2\mathrm{size}(\Psi)+2\mathrm{size}(Q). We thus have the lower bound,

cmplx(𝒮)supysize(Φy)2size(Q)+2size(Ψ).\mathrm{cmplx}(\mathcal{S})\leq\sup_{y}\mathrm{size}(\Phi_{y})\leq 2\mathrm{size}(Q)+2\mathrm{size}(\Psi).

This shows the claim. ∎

A.6. Proof of Lemma 2.26

Proof.

Our aim is to show that 𝒮:L2(𝕋;)\mathcal{S}:L^{2}(\mathbb{T};\mathbb{R})\to\mathbb{R},

𝒮(u):=𝕋σ(u(x))𝑑x,\mathcal{S}(u):=\int_{\mathbb{T}}\sigma(u(x))\,dx,

is not of neural network-type. We argue by contradiction. Suppose that 𝒮\mathcal{S} was of neural network type. By definition, there exists a linear mapping :2(𝕋;)\mathcal{L}:\mathcal{L}^{2}(\mathbb{T};\mathbb{R})\to\mathbb{R}^{\ell}, and a neural network Φ:\Phi:\mathbb{R}^{\ell}\to\mathbb{R}, for some \ell\in\mathbb{N} such that

(A.16) 𝒮(u)=Φ(u).\displaystyle\mathcal{S}(u)=\Phi(\mathcal{L}u).

In the following we will consider φj(x):=sin(jx)\varphi_{j}(x):=\sin(jx) for jj\in\mathbb{N}. Since σ(t)0\sigma(t)\geq 0 for all tt\in\mathbb{R}, and σ(t)>0\sigma(t)>0 for t>0t>0, we have

(A.17) 𝒮(u)=Ωσ(u(x))𝑑x=0u(x)0,x[0,2π].\displaystyle\mathcal{S}(u)=\int_{\Omega}\sigma(u(x))\,dx=0\quad\iff\quad u(x)\leq 0,\quad\forall x\in[0,2\pi].

Now, fix any D>D>\ell, and consider ι:DL2(𝕋;)\iota:\mathbb{R}^{D}\to L^{2}(\mathbb{T};\mathbb{R}), ιβ:=j=1Dβjsin(jx)\iota\beta:=\sum_{j=1}^{D}\beta_{j}\sin(jx). Since ι\iota and \mathcal{L} are linear mappings, it follows that

ι:D,βιβ,\mathcal{L}\circ\iota:\mathbb{R}^{D}\to\mathbb{R}^{\ell},\quad\beta\mapsto\mathcal{L}\iota\beta,

is a linear mapping. Represent this linear mapping by a matrix W×DW\in\mathbb{R}^{\ell\times D}. In particular, by (A.16), we have

(A.18) 𝒮(ιβ)=Φ(Wβ),βD.\displaystyle\mathcal{S}(\iota\beta)=\Phi(W\beta),\quad\forall\beta\in\mathbb{R}^{D}.

Since D>D>\ell, it follows that ker(W){0}\ker(W)\neq\{0\} is nontrivial. Let β0\beta\neq 0 be an element in the kernel, and consider uβ(x):=ιβ(x)=j=1Dβjsin(jx)u_{\beta}(x):=\iota\beta(x)=\sum_{j=1}^{D}\beta_{j}\sin(jx). Since uβ(x)u_{\beta}(x) is not identically equal to 0, either uβ(x)u_{\beta}(x) or uβ(x)=uβ(x)-u_{\beta}(x)=u_{-\beta}(x) must be positive somewhere in the domain 𝕋\mathbb{T}. Upon replacing ββ\beta\to-\beta if necessary, we may wlog assume that uβ(x)>0u_{\beta}(x)>0 for some x𝕋x\in\mathbb{T}. From (A.17) and (A.18), it now follows that

0𝒮(uβ)=𝒮(ιβ)=Φ(Wβ)=Φ(0)=𝒮(0)=0.0\neq\mathcal{S}(u_{\beta})=\mathcal{S}(\iota\beta)=\Phi(W\beta)=\Phi(0)=\mathcal{S}(0)=0.

A contradiction. Thus, 𝒮\mathcal{S} cannot be of neural network-type. ∎

A.7. Proof of curse of parametric complexity for FNO, Theorem 2.27

Building on the curse of parametric complexity for operators of neural network-type, we next show that FNOs also suffer from a similar curse, as stated in Theorem 2.27.

Proof.

(Theorem 2.27) Let 𝒮:𝒳\mathcal{S}^{\dagger}:\mathcal{X}\to\mathbb{R} be an rr-times Fréchet differentiable functional satisfying the conclusion of Theorem 2.11 (CoD for functionals of neural network-type). In the following, we show that 𝒮\mathcal{S}^{\dagger} also satisfies the conclusions of Theorem 2.27. Our proof argues by contradiction: we assume that 𝒮\mathcal{S}^{\dagger} can be approximated by a family of discrete FNOs satisfying the error and complexity bounds of Theorem 2.27, i.e.

  1. (1)

    Complexity bound: There exist constant c>0c>0, such that the discretization parameter NϵN_{\epsilon}\in\mathbb{N}, and the total number of non-zero parameters size(𝒮ϵNϵ)\mathrm{size}(\mathcal{S}^{N_{\epsilon}}_{\epsilon}) are bounded by Nϵ,size(𝒮ϵNϵ)exp(cϵ1/(1+α+δ)r)N_{\epsilon},\mathrm{size}(\mathcal{S}^{N_{\epsilon}}_{\epsilon})\leq\exp(c\epsilon^{-1/(1+\alpha+\delta)r}), for all ϵϵ¯\epsilon\leq\overline{\epsilon}.

  2. (2)

    Accuracy: We have

    supuK|𝒮(u)𝒮ϵNϵ(u)|ϵ,ϵ>0.\sup_{u\in K}\left|\mathcal{S}^{\dagger}(u)-\mathcal{S}^{N_{\epsilon}}_{\epsilon}(u)\right|\leq\epsilon,\quad\forall\,\epsilon>0.

Then we show that this implies the existence of a family of operators of neural network-type 𝒮~ϵ\widetilde{\mathcal{S}}_{\epsilon}, which satisfy for some δ>0\delta^{\prime}>0, and for all sufficiently small ϵ>0\epsilon>0,

  • complexity bound cmplx(𝒮~ϵ)exp(cϵ1/(1+α+δ)r)\mathrm{cmplx}(\widetilde{\mathcal{S}}_{\epsilon})\leq\exp(c\epsilon^{-1/(1+\alpha+\delta^{\prime})r}),

  • and error estimate maxuK|𝒮(u)𝒮~ϵ(u)|ϵ,\max_{u\in K}\left|\mathcal{S}^{\dagger}(u)-\widetilde{\mathcal{S}}_{\epsilon}(u)\right|\leq\epsilon,

with c>0c>0 a potentially different constant. By choice of 𝒮\mathcal{S}^{\dagger}, the existence of 𝒮~ϵ\widetilde{\mathcal{S}}_{\epsilon} is ruled out by Theorem 2.11, providing the desired contradiction.

In the following, we discuss the construction of 𝒮~ϵ\widetilde{\mathcal{S}}_{\epsilon}: Let 𝒮ϵNϵ\mathcal{S}_{\epsilon}^{N_{\epsilon}} be a family of FNOs satisfying (1) and (2) above. Fix ϵ>0\epsilon>0 for the moment. To simplify notation, we will write N=NϵN={N_{\epsilon}}, in the following. We recall that, by definition, the discretized FNO 𝒮ϵN\mathcal{S}_{\epsilon}^{N} can be written as a composition 𝒮ϵN=𝒬~L1\mathcal{S}^{N}_{\epsilon}=\widetilde{\mathcal{Q}}\circ\mathcal{L}_{L}\circ\dots\circ\mathcal{L}_{1}\circ\mathcal{R}, where

:u(x)χ(xj1,,jd,u(xj1,,jd)),\mathcal{R}:u(x)\mapsto\chi(x_{j_{1},\dots,j_{d}},u(x_{j_{1},\dots,j_{d}})),

and

𝒬~:(vj1,,jd)1Ndj1,,jdq(yj1,,jd,vj1,,jd),\widetilde{\mathcal{Q}}:\left(v_{j_{1},\dots,j_{d}}\right)\mapsto\frac{1}{N^{d}}\sum_{j_{1},\dots,j_{d}}q(y_{j_{1},\dots,j_{d}},v_{j_{1},\dots,j_{d}}),

are defined by neural networks χ\chi and qq, respectively, \mathcal{L}_{\ell} is of the form

:vσ(Wv+N1(T^Nv)+b),v=(vj1,,jd)j1,,jd=1N,\mathcal{L}_{\ell}:v\mapsto\sigma\left(W_{\ell}v+\mathcal{F}_{N}^{-1}\left(\widehat{T}_{\ell}\mathcal{F}_{N}v\right)+b_{\ell}\right),\quad v=\left(v_{j_{1},\dots,j_{d}}\right)_{j_{1},\dots,j_{d}=1}^{N},

with Wdv×dvW_{\ell}\in\mathbb{R}^{d_{v}\times d_{v}}, Fourier multiplier T^={[T^]k}kkmax\widehat{T}_{\ell}=\{[\widehat{T}_{\ell}]_{k}\}_{\|k\|_{\ell^{\infty}}\leq k_{\mathrm{max}}}, where [T^]kdv×dv[\widehat{T}_{\ell}]_{k}\in\mathbb{C}^{d_{v}\times d_{v}}, and N\mathcal{F}_{N} (N1\mathcal{F}_{N}^{-1}) denote discrete (inverse) Fourier transform, and where the bias bb is determined by its Fourier coefficients b^k\widehat{b}_{k}, kkmax\|k\|_{\ell^{\infty}}\leq k_{\mathrm{max}}. We also recall that the size of 𝒮ϵN\mathcal{S}^{N}_{\epsilon} is given by the total number of non-zero components, i.e.

size(𝒮ϵN)=size(χ)+size(q)+=1L{W0+T^0+kkmaxb^k0}.\mathrm{size}(\mathcal{S}^{N}_{\epsilon})=\mathrm{size}(\chi)+\mathrm{size}(q)+\sum_{\ell=1}^{L}\left\{\|W_{\ell}\|_{0}+\|\widehat{T}_{\ell}\|_{0}+\sum_{\|k\|_{\ell^{\infty}}\leq k_{\mathrm{max}}}\|\widehat{b}_{k}\|_{0}\right\}.

We now observe that, after flattening the tensor

(vj1,,jd)N××N×dvdvNd,\left(v_{j_{1},\dots,j_{d}}\right)\in\mathbb{R}^{N\times\dots\times N\times d_{v}}\simeq\mathbb{R}^{d_{v}N^{d}},

the (linear) mapping vWvv\mapsto W_{\ell}v can be represented by multiplication against a sparse matrix with at most W0Nd\|W_{\ell}\|_{0}N^{d} non-zero entries. For the non-local operator N1T^N\mathcal{F}^{-1}_{N}\widehat{T}_{\ell}\mathcal{F}_{N}, we note that for vdvNdv\in\mathbb{R}^{d_{v}N^{d}}, the number κ\kappa of components (channels) that need to be considered is bounded by κmin(dv,T^0)T^0\kappa\leq\min(d_{v},\|\widehat{T}_{\ell}\|_{0})\leq\|\widehat{T}_{\ell}\|_{0}. Discarding any channels that are zeroed out by T^\widehat{T}_{\ell}, a naive implementation of N1T^N\mathcal{F}^{-1}_{N}\widehat{T}_{\ell}\mathcal{F}_{N} thus amounts to a matrix representation of a linear mapping κNdκNd\mathbb{R}^{\kappa N^{d}}\to\mathbb{R}^{\kappa N^{d}}, requiring at most κ2N2dT^02N2d\kappa^{2}N^{2d}\leq\|\widehat{T}_{\ell}\|_{0}^{2}N^{2d} non-zero components. Thus, each discretized hidden layer \mathcal{L}_{\ell} can be represented exactly by an ordinary neural network layer L=σ(Av+c)L_{\ell}=\sigma(A_{\ell}v+c_{\ell}) with matrix AdvNd×dvNdA_{\ell}\in\mathbb{R}^{d_{v}N^{d}\times d_{v}N^{d}} and bias cdvNdc_{\ell}\in\mathbb{R}^{d_{v}N^{d}}, satisfying the following bounds on the number of non-zero components:

A0W0Nd+T^02N2d,c0|k|kmax[b^]k0Nd.\|A_{\ell}\|_{0}\leq\|W_{\ell}\|_{0}N^{d}+\|\widehat{T}_{\ell}\|_{0}^{2}N^{2d},\quad\|c_{\ell}\|_{0}\leq\sum_{|k|\leq k_{\mathrm{max}}}\|[\widehat{b}_{\ell}]_{k}\|_{0}N^{d}.

Similarly, the input and output layers \mathcal{R} and 𝒬~\widetilde{\mathcal{Q}} can be represented exactly by ordinary neural networks R:kNddvNdR:\mathbb{R}^{kN^{d}}\to\mathbb{R}^{d_{v}N^{d}} and Q:dvNdQ:\mathbb{R}^{d_{v}N^{d}}\to\mathbb{R}, with

size(R)Ndsize(χ),size(Q)Ndsize(q),\mathrm{size}(R)\leq N^{d}\mathrm{size}(\chi),\quad\mathrm{size}(Q)\leq N^{d}\mathrm{size}(q),

obtained by parallelization of χ\chi, resp. qq, at each grid point. Given the above observations, we conclude that, with canonical identification N××N×kkNd\mathbb{R}^{N\times\dots\times N\times k}\simeq\mathbb{R}^{kN^{d}} and N××N×ppNd\mathbb{R}^{N\times\dots\times N\times p}\simeq\mathbb{R}^{pN^{d}}, the discretized FNO 𝒮ϵN\mathcal{S}^{N}_{\epsilon} can be represented by an ordinary neural network Φ:kNdpNd\Phi:\mathbb{R}^{kN^{d}}\to\mathbb{R}^{pN^{d}}, Φ=QLLL1R\Phi=Q\circ L_{L}\circ\dots\circ L_{1}\circ R, with

size(Φ)\displaystyle\mathrm{size}(\Phi) =1L{W0Nd+T^02N2d+|k|kmaxb^k0Nd}\displaystyle\leq\sum_{\ell=1}^{L}\left\{\|W_{\ell}\|_{0}N^{d}+\|\widehat{T}_{\ell}\|_{0}^{2}N^{2d}+\sum_{|k|\leq k_{\mathrm{max}}}\|\widehat{b}_{k}\|_{0}N^{d}\right\}
+Ndsize(χ)+Ndsize(q)\displaystyle\qquad+N^{d}\mathrm{size}(\chi)+N^{d}\mathrm{size}(q)
N2dsize(𝒮ϵN)2.\displaystyle\leq N^{2d}\mathrm{size}(\mathcal{S}_{\epsilon}^{N})^{2}.

By assumption on 𝒮ϵN\mathcal{S}_{\epsilon}^{N} (for which we aim to show that it leads to a contradiction), we have size(𝒮ϵN)exp(cϵ1/(1+α+δ)r)\mathrm{size}(\mathcal{S}_{\epsilon}^{N})\leq\exp(c\epsilon^{-1/(1+\alpha+\delta)r}), and Nϵexp(cϵ1/(1+α+δ)r)N_{\epsilon}\leq\exp(c\epsilon^{-1/(1+\alpha+\delta)r}). It follows that

size(Φ)N2dsize(𝒮ϵN)2exp((2d+2)cϵ1/(1+α+δ)r).\mathrm{size}(\Phi)\leq N^{2d}\mathrm{size}(\mathcal{S}_{\epsilon}^{N})^{2}\leq\exp((2d+2)c\epsilon^{-1/(1+\alpha+\delta)r}).

In addition, Φ\Phi trivially defines an operator of neural network-type, 𝒮~ϵ:Cs(Ω;k)\widetilde{\mathcal{S}}_{\epsilon}:C^{s}(\Omega;\mathbb{R}^{k})\to\mathbb{R}, by

𝒮~ϵ(u):=Φ((u(xj1,,jd))j1,,jd=1N),\widetilde{\mathcal{S}}_{\epsilon}(u):=\Phi\left((u(x_{j_{1},\dots,j_{d}}))_{j_{1},\dots,j_{d}=1}^{N}\right),

To see this, we simply note that the point-evaluation mapping :uu(xj1,,jd)j1,,jd=1N\mathcal{L}:u\mapsto u(x_{j_{1},\dots,j_{d}})_{j_{1},\dots,j_{d}=1}^{N} is linear, and hence we have the representation

𝒮~ϵ(u)=Φ(u).\widetilde{\mathcal{S}}_{\epsilon}(u)=\Phi(\mathcal{L}u).

By the above construction, we have 𝒮~ϵ(u)𝒮ϵNϵ(u)\widetilde{\mathcal{S}}_{\epsilon}(u)\equiv\mathcal{S}_{\epsilon}^{N_{\epsilon}}(u) for all input functions uu.

To summarize, assuming that a family of FNOs 𝒮ϵNϵ\mathcal{S}_{\epsilon}^{N_{\epsilon}} exists, satisfying (1) and (2) above, we have constructed a family 𝒮~ϵ\widetilde{\mathcal{S}}_{\epsilon} of operators of neural network type, with

supuK|𝒮(u)𝒮~ϵ(u)|=supuK|𝒮(u)𝒮ϵNϵ(u)|ϵ,\sup_{u\in K}\left|\mathcal{S}^{\dagger}(u)-\widetilde{\mathcal{S}}_{\epsilon}(u)\right|=\sup_{u\in K}\left|\mathcal{S}^{\dagger}(u)-\mathcal{S}_{\epsilon}^{N_{\epsilon}}(u)\right|\leq\epsilon,

and

cmplx(𝒮~ϵ)size(Φ)exp(c′′ϵ1/(1+α+δ)r),\mathrm{cmplx}(\widetilde{\mathcal{S}}_{\epsilon})\leq\mathrm{size}(\Phi)\leq\exp(c^{\prime\prime}\epsilon^{-1/(1+\alpha+\delta)r}),

where c′′=(2d+2)c>0c^{\prime\prime}=(2d+2)c>0 is a constant independent of ϵ\epsilon.

By Theorem 2.11, and fixing any 0<δ<δ0<\delta^{\prime}<\delta, we also have the following lower bound for all sufficiently small ϵ\epsilon:

exp(cϵ1/(1+α+δ)r)cmplx(𝒮~ϵ),\exp(c^{\prime}\epsilon^{-1/(1+\alpha+\delta^{\prime})r})\leq\mathrm{cmplx}(\widetilde{\mathcal{S}}_{\epsilon}),

where c>0c^{\prime}>0 is independent of ϵ\epsilon. From the above two-sided bounds, it thus follows that

exp(cϵ1/(1+α+δ)r)cmplx(𝒮~ϵ)exp(c′′ϵ1/(1+α+δ)r),\exp(c^{\prime}\epsilon^{-1/(1+\alpha+\delta^{\prime})r})\leq\mathrm{cmplx}(\widetilde{\mathcal{S}}_{\epsilon})\leq\exp(c^{\prime\prime}\epsilon^{-1/(1+\alpha+\delta)r}),

for all ϵ\epsilon sufficiently small, and where by our choice of δ,δ\delta^{\prime},\delta: 0<1+α+δ<1+α+δ0<1+\alpha+\delta^{\prime}<1+\alpha+\delta and c,c′′>0c^{\prime},c^{\prime\prime}>0 are constants. Since ϵ1/(1+α+δ)r\epsilon^{-1/(1+\alpha+\delta^{\prime})r} grows faster than ϵ1/(1+α+δ)r\epsilon^{-1/(1+\alpha+\delta)r} as ϵ0\epsilon\to 0, this leads to the desired contradiction. We thus conclude that a family 𝒮ϵNϵ\mathcal{S}_{\epsilon}^{N_{\epsilon}} of discretized FNOs as assumed above cannot exist. This concludes the proof. ∎

Appendix B Short-time existence of CrC^{r}-solutions

The proof of short-time existence of solutions of the Hamilton-Jacobi equation (HJ) is based on the following Banach space implicit function theorem:

Theorem B.1 (Implicit Function Theorem, see e.g. [6, Section 2.2]).

Let UXU\subset X, VYV\subset Y be open subsets of Banach spaces XX and YY. Let

F:U×VZ,(u,v)F(u,v),F:U\times V\to Z,\quad(u,v)\mapsto F(u,v),

be a CpC^{p}-mapping (pp-times continuously Fréchet differentiable). Assume that there exist (u0,v0)U×V(u_{0},v_{0})\in U\times V such that F(u0,v0)=0F(u_{0},v_{0})=0, and such that the Jacobian with respect to vv, evaluated at the point (u0,v0)(u_{0},v_{0}),

DvF(u0,v0):XZ,D_{v}F(u_{0},v_{0}):X\to Z,

is a linear isomorphism. Then there exists a neighbourhood U0UU_{0}\subset U of u0u_{0}, and a CpC^{p}-mapping ψ:U0V\psi:U_{0}\to V, such that

F(u,ψ(u))=0,uU0.F(u,\psi(u))=0,\quad\forall\,u\in U_{0}.

Furthermore, ψ\psi is unique, in the sense that for any uU0u\in U_{0}, F(u,v)=0F(u,v)=0 implies v=ψ(u)v=\psi(u).

As a first step toward proving short-time existence for (HJ), we prove that under the no-blowup Assumption 3.1, the semigroup Ψt\Psi_{t}^{\dagger} (3.7) exists for any t0t\geq 0.

Proof.

(Lemma 3.3) By classical ODE theory, it is immediate that for any initial data (q0,p0,z0)Ω×d×(q_{0},p_{0},z_{0})\in\Omega\times\mathbb{R}^{d}\times\mathbb{R} a maximal solution of the ODE system (3.4) exists, and is unique, over a short time-interval. It thus remains to prove that this solution in fact exists globally, i.e. that the solution does not escape to infinity in finite time. Since zz solves z˙=(q,p)\dot{z}=\mathcal{L}(q,p), with a right-hand side that only depends on qq and pp, it will suffice to prove that the Hamiltonian trajectory q˙=pH(q,p)\dot{q}=\nabla_{p}H(q,p), p˙=qH(q,p)\dot{p}=-\nabla_{q}H(q,p) with initial data (q0,p0)(q_{0},p_{0}) exists globally in time. To this end, we note that, by Assumption 3.1, we have

ddt(1+|p|2)=2pp˙=2pqH(q,p)2LH(1+|p|2).\frac{d}{dt}(1+|p|^{2})=2p\cdot\dot{p}=-2p\cdot\nabla_{q}H(q,p)\leq 2L_{H}(1+|p|^{2}).

Gronwall’s lemma thus implies that |p|1+|p0|2exp(LHt)|p|\leq\sqrt{1+|p_{0}|^{2}}\exp(L_{H}t) remains bounded for all t0t\geq 0. This shows that blow-up cannot occur in the pp-variable in a finite time. On the other hand, the right-hand side of the ODE system is periodic in qq, and hence (by the bound on pp) blow-up is also ruled out for qq. In particular, Assumption 3.1 ensures that the Hamiltonian trajectory t(q(t),p(t))t\mapsto(q(t),p(t)) exists globally in time. As argued above, this in turn implies the global existence of the flow map tΨtt\mapsto\Psi^{\dagger}_{t}. ∎

We now apply the Implicit Function Theorem B.1 to prove Lemma 3.4.

Proof.

(Lemma 3.4) Let X=Cperr(Ω)×d×X=C^{r}_{\mathrm{per}}(\Omega)\times\mathbb{R}^{d}\times\mathbb{R} with r2r\geq 2. Define

F:X×dd,(u0,q,t;q0)qqt(q0,qu0(q0)),F:X\times\mathbb{R}^{d}\to\mathbb{R}^{d},\quad(u_{0},q,t;q_{0})\mapsto q-q_{t}(q_{0},\nabla_{q}u_{0}(q_{0})),

i.e. we set

F(u0,q,t;q0):=qqt(q0,qu0(q0)),F(u_{0},q,t;q_{0}):=q-q_{t}(q_{0},\nabla_{q}u_{0}(q_{0})),

with qt:Ω×dΩq_{t}:\Omega\times\mathbb{R}^{d}\to\Omega, (q0,p0)qt(q0,p0)(q_{0},p_{0})\mapsto q_{t}(q_{0},p_{0}) the spatial characteristic mapping of the Hamiltonian system (3.4) at time t0t\geq 0, where, recall, we have assumed that u0.u_{0}\in\mathcal{F}. Under Assumption 3.1, the spatial characteristic mapping, and hence FF, is well-defined for any t0t\geq 0 (see Lemma 3.3). Since HCr+1(Ω×d)H\in C^{r+1}(\Omega\times\mathbb{R}^{d}), the mapping Ω×d×Ω\Omega\times\mathbb{R}^{d}\times\mathbb{R}\to\Omega, (q0,p0,t)qt(q0,p0)(q_{0},p_{0},t)\mapsto q_{t}(q_{0},p_{0}) is CrC^{r}. The mapping F1:Ω×Cperr(Ω)dF_{1}:\Omega\times C^{r}_{\mathrm{per}}(\Omega)\to\mathbb{R}^{d}, (q0,u0)F1(q0,u0):=qu0(q0)(q_{0},u_{0})\mapsto F_{1}(q_{0},u_{0}):=\nabla_{q}u_{0}(q_{0}) is Cr1C^{r-1} in the first argument, and is a continuous linear mapping in the second argument (and hence infinitely Frechet differentiable in the second argument). Hence, the composition

(q0,u0)(q0,qu0(q0))qt(q0,qu0(q0)),(q_{0},u_{0})\mapsto(q_{0},\nabla_{q}u_{0}(q_{0}))\mapsto q_{t}(q_{0},\nabla_{q}u_{0}(q_{0})),

is a Cr1C^{r-1} mapping. And, as a consequence, the mapping F:X×ddF:X\times\mathbb{R}^{d}\to\mathbb{R}^{d} is a Cr1C^{r-1} mapping.

Since

F(u0,q,t;q0)=qqt(q0,qu0(q0)),F(u_{0},q,t;q_{0})=q-q_{t}(q_{0},\nabla_{q}u_{0}(q_{0})),

we clearly have

F(u0,q0,0;q0)=0,F(u_{0},q_{0},0;q_{0})=0,

for any q0Ωq_{0}\in\Omega and u0Cperr(Ω)u_{0}\in C^{r}_{\mathrm{per}}(\Omega), and the derivative with respect to the last argument

Dq0F(u0,q0,0;q0):dd,D_{q_{0}}F(u_{0},q_{0},0;q_{0}):\mathbb{R}^{d}\to\mathbb{R}^{d},

is given by

Dq0F(u0,q0,0;q0)=Dq0[qq0]=𝟏d×d,D_{q_{0}}F(u_{0},q_{0},0;q_{0})=D_{q_{0}}\left[q-q_{0}\right]=-\bm{1}_{d\times d},

which defines an isomorphism dd\mathbb{R}^{d}\to\mathbb{R}^{d}.

By the implicit function theorem, for any q¯0Ω\overline{q}_{0}\in\Omega and u¯0Cperr(Ω)\overline{u}_{0}\in C^{r}_{\mathrm{per}}(\Omega), there exist ϵ=ϵ(q¯0,u¯0),r=r(q¯0,u¯0),t=t(q¯0,u¯0)>0\epsilon=\epsilon(\overline{q}_{0},\overline{u}_{0}),r=r(\overline{q}_{0},\overline{u}_{0}),t^{\ast}=t^{\ast}(\overline{q}_{0},\overline{u}_{0})>0, and a mapping

ψq¯0,u¯0:Bϵ(q¯0)×Br(u¯0)×[0,t)d,\psi_{\overline{q}_{0},\overline{u}_{0}}:B_{\epsilon}(\overline{q}_{0})\times B_{r}(\overline{u}_{0})\times[0,t^{\ast})\to\mathbb{R}^{d},

such that F(q,u0,t;q0)=0F(q,u_{0},t;q_{0})=0 for

(q,u0,t)Bϵ(q¯0)×Br(u¯0)×[0,t),(q,u_{0},t)\in B_{\epsilon}(\overline{q}_{0})\times B_{r}(\overline{u}_{0})\times[0,t^{\ast}),

if, and only if,

q0=ψq¯0,u¯0(q,u0,t).q_{0}=\psi_{\overline{q}_{0},\overline{u}_{0}}(q,u_{0},t).

Fix u¯0Cperr(Ω)\overline{u}_{0}\in C^{r}_{\mathrm{per}}(\Omega) for the moment. Since Ω\Omega is compact, we can choose a finite number of points q¯0(1),,q¯0(m)\overline{q}_{0}^{(1)},\dots,\overline{q}_{0}^{(m)}, such that

Ωj=1mBϵ(q¯0j,u¯0)(q¯0j).\Omega\subset\bigcup_{j=1}^{m}B_{\epsilon(\overline{q}_{0}^{{j}},\overline{u}_{0})}(\overline{q}_{0}^{{j}}).

Let

t(u¯0)\displaystyle t^{\ast}(\overline{u}_{0}) :=minj=1,,mt(q¯0j,u¯0),\displaystyle:=\min_{j=1,\dots,m}t^{\ast}(\overline{q}^{{j}}_{0},\overline{u}_{0}),
r(u¯0)\displaystyle r(\overline{u}_{0}) :=minj=1,,mr(q¯0j,u¯0).\displaystyle:=\min_{j=1,\dots,m}r(\overline{q}_{0}^{{j}},\overline{u}_{0}).

Due to the uniqueness property of each ψq¯0j,u¯0\psi_{\overline{q}^{{j}}_{0},\overline{u}_{0}}, j=1,,mj=1,\dots,m, all of these mappings have the same values on overlapping domains. Hence, we can combine them into a single map

ψu¯0:Ω×Br(u¯0)(u¯0)×[0,t(u¯0))d.\psi_{\overline{u}_{0}}:\Omega\times B_{r(\overline{u}_{0})}(\overline{u}_{0})\times[0,t^{\ast}(\overline{u}_{0}))\to\mathbb{R}^{d}.

Furthermore, since ψq¯0j,u¯0Cr1\psi_{\overline{q}^{{j}}_{0},\overline{u}_{0}}\in C^{r-1}, we also have ψu¯0Cr1\psi_{\overline{u}_{0}}\in C^{r-1}. Similarly, we can cover the compact set Cperr(Ω)\mathcal{F}\subset C^{r}_{\mathrm{per}}(\Omega) by a finite number of open balls Br(u¯0(k))(u¯0(k))B_{r(\overline{u}_{0}^{(k)})}(\overline{u}_{0}^{(k)}), k=1,,Kk=1,\dots,K,

k=1KBr(u¯0(k))(u¯0(k)).\mathcal{F}\subset\bigcup_{k=1}^{K}B_{r(\overline{u}_{0}^{(k)})}(\overline{u}_{0}^{(k)}).

Setting T:=mink=1,,Kt(u¯0(k))>0T^{\ast}:=\min_{k=1,\dots,K}t^{\ast}(\overline{u}_{0}^{(k)})>0, the uniqueness property of ψu¯0(k)\psi_{\overline{u}_{0}^{(k)}} again implies that these mappings agree on overlapping domains. Hence, we can combine them into a global map, and obtain a map

ψ:Ω××[0,T)d,\psi:\Omega\times\mathcal{F}\times[0,T^{\ast})\to\mathbb{R}^{d},

which satisfies F(q,u0,t;ψ(q,u0,t))=0F(q,u_{0},t;\psi(q,u_{0},t))=0 for all qΩq\in\Omega, u0u_{0}\in\mathcal{F} and t<Tt<T^{\ast}. Furthermore, this ψ\psi is still a Cr1C^{r-1} map and it is unique, in the sense that

F(q,u0,t;q0)=0q0=ψ(q,u0,t),qΩ,u0Cperr(Ω),t<T,F(q,u_{0},t;q_{0})=0\quad\Leftrightarrow\quad q_{0}=\psi(q,u_{0},t),\quad\forall\,q\in\Omega,\;u_{0}\in C^{r}_{\mathrm{per}}(\Omega),\;t<T^{\ast},

i.e. for any u0u_{0}\in\mathcal{F} and t[0,T)t\in[0,T^{\ast}), we have q=qt(q0,qu0(q0))q=q_{t}(q_{0},\nabla_{q}u_{0}(q_{0})) if and only if q0=ψ(q,u0,t)q_{0}=\psi(q,u_{0},t). In particular, this shows that for any u0u_{0}\in\mathcal{F}, t[0,T)t\in[0,T^{\ast}), the Cr1C^{r-1}-mapping

Φt(;u0):ΩΩ,q0Φt(q0;u0):=qt(q0,qu0(q0)),\Phi^{\dagger}_{t}({\,\cdot\,};u_{0}):\Omega\to\Omega,\quad q_{0}\mapsto\Phi^{\dagger}_{t}(q_{0};u_{0}):=q_{t}(q_{0},\nabla_{q}u_{0}(q_{0})),

is invertible, with inverse qψ(q,u0,t)Cr1q\mapsto\psi(q,u_{0},t)\in C^{r-1}. This implies the claim. ∎

Next, we apply Lemma 3.4 to prove short-time existence of solutions for (HJ).

Proof.

(Proposition 3.2) Let TT^{\ast} be the maximal time such that the spatial characteristic mapping Φt,u0:ΩΩ\Phi^{\dagger}_{t,u_{0}}:\Omega\to\Omega, q0qt(q0,qu0(q0))q_{0}\mapsto q_{t}(q_{0},\nabla_{q}u_{0}(q_{0})) is invertible, for any t[0,T)t\in[0,T^{\ast}) and for all u0u_{0}\in\mathcal{F}. We have T>0T^{\ast}>0, by Lemma 3.4. We denote by Φt,u0:ΩΩ\Phi^{\dagger}_{-t,u_{0}}:\Omega\to\Omega the inverse Φt,u0:=[Φt,u0]1\Phi^{\dagger}_{-t,u_{0}}:=[\Phi^{\dagger}_{t,u_{0}}]^{-1}. By the method of characteristics, a solution u(q,t)u(q,t) of the Hamilton-Jacobi equations must satisfy

(B.1) u(q,t)=u0(q0)+0t(qτ,pτ)𝑑τ,(q,p):=ppH(q,p)H(q,p),\displaystyle u(q,t)=u_{0}(q_{0})+\int_{0}^{t}\mathcal{L}(q_{\tau},p_{\tau})\,d\tau,\quad\mathcal{L}(q,p):=p\cdot\nabla_{p}H(q,p)-H(q,p),

where q0=Φt,u0(q)q_{0}=\Phi^{\dagger}_{-t,u_{0}}(q), qτ,pτq_{\tau},p_{\tau} are the trajectories of the Hamiltonian ODE (3.1) with initial data q0,qu0(q0)q_{0},\nabla_{q}u_{0}(q_{0}). Given a fixed u0u_{0}\in\mathcal{F}, we use the above expression to define u:Ω×[0,T)u:\Omega\times[0,T^{\ast})\to\mathbb{R}.

We first observe that uCr1(Ω×[0,T))u\in C^{r-1}(\Omega\times[0,T^{\ast})), as it is the composition of Cr1C^{r-1} mappings,

u(q,t)=u0(q0)+0t(qτ(q0,p0),pτ(q0,p0))𝑑τ,u(q,t)=u_{0}(q_{0})+\int_{0}^{t}\mathcal{L}(q_{\tau}(q_{0},p_{0}),p_{\tau}(q_{0},p_{0}))\,d\tau,

where q0=Φt,u0(q)q_{0}=\Phi_{-t,u_{0}}(q), p0=qu0(Φt,u0(q))p_{0}=\nabla_{q}u_{0}(\Phi_{-t,u_{0}}(q)) are Cr1C^{r-1}-functions of qq. In particular, since r2r\geq 2, this implies that uu is at least C1C^{1}. Evaluating du(qt,t)/dtdu(q_{t},t)/dt along a fixed trajectory, we find that

(B.2) tu(qt,t)+H(qt,pt)=(ptqu(qt,t))DpH(qt,pt).\displaystyle\partial_{t}u(q_{t},t)+H(q_{t},p_{t})=\left(p_{t}-\nabla_{q}u(q_{t},t)\right)\cdot D_{p}H(q_{t},p_{t}).

Thus, to show that uu is a classical solution of (HJ), it remains to show that pt=qu(qt,t)p_{t}=\nabla_{q}u(q_{t},t) for all t[0,T)t\in[0,T^{\ast}). Assume that uCper2u\in C^{2}_{\mathrm{per}} for the moment. We first note that for the jj-th component of ptqu(qt,t)p_{t}-\nabla_{q}u(q_{t},t), we have (with implicit summation over repeated indices, following the “Einstein summation rule”)

ddt[ptjqju(qt,t)]\displaystyle\frac{d}{dt}\left[p^{j}_{t}-\partial_{q^{j}}u(q_{t},t)\right] =qjH(qt,pt)qj,qk2u(qt,t)pkH(qt,pt)\displaystyle=-\partial_{q^{j}}H(q_{t},p_{t})-\partial^{2}_{q^{j},q^{k}}u(q_{t},t)\partial_{p^{k}}H(q_{t},p_{t})
qjtu(qt,t).\displaystyle\qquad-\partial_{q^{j}}\partial_{t}u(q_{t},t).

Next, we note that by the invertibility of the spatial characteristic mapping q0qtq_{0}\mapsto q_{t}, we can write (B.2) in the form

tu(q,t)=H(q,P(q,t))+(P(q,t)qu(q,t))DpH(q,P(q,t)),\partial_{t}u(q,t)=-H(q,P(q,t))+\left(P(q,t)-\nabla_{q}u(q,t)\right)\cdot D_{p}H(q,P(q,t)),

where P(q,t):=pt(Φt,u0(q),qu0(Φt,u0(q)))P(q,t):=p_{t}\left(\Phi^{\dagger}_{-t,u_{0}}(q),\nabla_{q}u_{0}(\Phi^{\dagger}_{-t,u_{0}}(q))\right). This implies that

qjtu(q,t)\displaystyle\partial_{q^{j}}\partial_{t}u(q,t) =qjH(q,P(q,t))pkH(q,P(q,t))qjPk(q,t)\displaystyle=-\partial_{q^{j}}H(q,P(q,t))-\partial_{p^{k}}H(q,P(q,t))\partial_{q^{j}}P^{k}(q,t)
+(qjPk(q,t)qj,qk2u(q,t))pkH(q,P(q,t))\displaystyle\qquad+\left(\partial_{q^{j}}P^{k}(q,t)-\partial^{2}_{q^{j},q^{k}}u(q,t)\right)\cdot\partial_{p^{k}}H(q,P(q,t))
+(Pk(q,t)qku(q,t))qj[pkH(q,P(q,t))]\displaystyle\qquad+\left(P^{k}(q,t)-\partial_{q^{k}}u(q,t)\right)\cdot\partial_{q^{j}}\left[\partial_{p^{k}}H(q,P(q,t))\right]
=qjH(q,P(q,t))qj,qk2u(q,t)pkH(q,P(q,t))\displaystyle=-\partial_{q^{j}}H(q,P(q,t))-\partial^{2}_{q^{j},q^{k}}u(q,t)\cdot\partial_{p^{k}}H(q,P(q,t))
+(Pk(q,t)qku(q,t))Aj,k(q,P(q,t)).\displaystyle\qquad+\left(P^{k}(q,t)-\partial_{q^{k}}u(q,t)\right)\cdot A_{j,k}(q,P(q,t)).

We point out that on the last line, we have introduced Aj,k(q,t):=qj[pkH(q,P(q,t))]A_{j,k}(q,t):=\partial_{q^{j}}\left[\partial_{p^{k}}H(q,P(q,t))\right] which is a continuous function of qq and tt. Choosing now q=qtq=q_{t}, so that P(q,t)=ptP(q,t)=p_{t}, we thus obtain

qjtu(qt,t)\displaystyle\partial_{q^{j}}\partial_{t}u(q_{t},t) =qjH(qt,pt)qj,qk2u(qt,t)pkH(qt,pt)\displaystyle=-\partial_{q^{j}}H(q_{t},p_{t})-\partial^{2}_{q^{j},q^{k}}u(q_{t},t)\cdot\partial_{p^{k}}H(q_{t},p_{t})
+[ptkqku(qt,t)]Aj,k(q,t).\displaystyle\qquad+\left[p^{k}_{t}-\partial_{q^{k}}u(q_{t},t)\right]\cdot A_{j,k}(q,t).

Substitution in the ODE for ptqu(qt,t)p_{t}-\nabla_{q}u(q_{t},t) yields

ddt[ptqu(qt,t)]=A(qt,t)[ptqu(qt,t)].\frac{d}{dt}\left[p_{t}-\nabla_{q}u(q_{t},t)\right]=-A(q_{t},t)\cdot\left[p_{t}-\nabla_{q}u(q_{t},t)\right].

where A(qt,t)A(q_{t},t) is the matrix with components (Aj,k(qt,t))(A_{j,k}(q_{t},t)). Since [ptqu(qt,t)]|t=0=0[p_{t}-\nabla_{q}u(q_{t},t)]|_{t=0}=0, this implies that

(B.3) pt=qu(qt,t),t[0,T),\displaystyle p_{t}=\nabla_{q}u(q_{t},t),\quad\forall\,t\in[0,T^{\ast}),

along the trajectory. At this point, the conclusion (B.3) has been obtained under the assumption that uCper2u\in C^{2}_{\mathrm{per}}, which is only ensured a priori if r3r\geq 3.

To prove the result also for the case r=2r=2, we can apply the above argument to smooth HϵH^{\epsilon}, u0ϵu_{0}^{\epsilon}, which approximate the given HH and u0u_{0}, and for uϵu^{\epsilon} defined by the method of characteristics (B.1) with u0u_{0}, HH replaced by the smooth approximations u0ϵu_{0}^{\epsilon}, HϵH^{\epsilon}. Then, by the above argument, for any ϵ\epsilon-regularized trajectory (qtϵ,ptϵ)(q_{t}^{\epsilon},p^{\epsilon}_{t}), we have

ptϵ=quϵ(qtϵ,t).p_{t}^{\epsilon}=\nabla_{q}u^{\epsilon}(q_{t}^{\epsilon},t).

Choosing a sequence such that HϵCr+1HH^{\epsilon}\overset{C^{r+1}}{\to}H, u0ϵCru0u_{0}^{\epsilon}\overset{C^{r}}{\to}u_{0} as ϵ0\epsilon\to 0, the corresponding characteristic solution uϵu^{\epsilon} defined by (B.1) (with HϵH^{\epsilon} and u0ϵu^{\epsilon}_{0} in place of HH and u0u_{0}) converges in Cr1C^{r-1}. Since r1=1r-1=1, this implies that pt=limϵ0ptϵ=limϵ0quϵ(qtϵ,t)=qu(qt,t)p_{t}=\lim_{\epsilon\to 0}p^{\epsilon}_{t}=\lim_{\epsilon\to 0}\nabla_{q}u^{\epsilon}(q^{\epsilon}_{t},t)=\nabla_{q}u(q_{t},t).

Thus, we conclude that uCr1(Ω×[0,T))u\in C^{r-1}(\Omega\times[0,T^{\ast})) defined by (B.1) is a classical solution of the Hamilton-Jacobi equations (HJ). We finally have to show that, in fact, uCperr(Ω×[0,T))u\in C^{r}_{\mathrm{per}}(\Omega\times[0,T^{\ast})). CrC^{r}-differentiability in space follows readily from the fact that, by (B.3), we have qu(qt,t)=pt(q0,qu0(q0))\nabla_{q}u(q_{t},t)=p_{t}(q_{0},\nabla_{q}u_{0}(q_{0})). By the invertibility of the spatial characteristic map, this can be equivalently written in the form

(B.4) qu(q,t)=pt(Φt,u0(q),qu0(Φt,u0(q))),\displaystyle\nabla_{q}u(q,t)=p_{t}(\Phi^{\dagger}_{-t,u_{0}}(q),\nabla_{q}u_{0}(\Phi^{\dagger}_{-t,u_{0}}(q))),

where on the right hand side, (q0,p0)pt(q0,p0)(q_{0},p_{0})\mapsto p_{t}(q_{0},p_{0}), q0qu0(q0)q_{0}\mapsto\nabla_{q}u_{0}(q_{0}) and (q,t)Φt,u0(q,t)(q,t)\mapsto\Phi^{\dagger}_{-t,u_{0}}(q,t) are all Cr1C^{r-1} mappings. Thus, qu\nabla_{q}u is a Cr1C^{r-1}-function. Furthermore, by (HJ), this implies that tu=H(q,qu)\partial_{t}u=-H(q,\nabla_{q}u) is also a Cr1C^{r-1} function. This allows us to conclude that uCperr(Ω×[0,T))u\in C^{r}_{\mathrm{per}}(\Omega\times[0,T^{\ast})). The additional bound on u(,t)Cr\|u({\,\cdot\,},t)\|_{C^{r}} follows from the trivial estimate

u(,t)CrC(uL+qu(,t)Cr1),\|u({\,\cdot\,},t)\|_{C^{r}}\leq C\left(\|u\|_{L^{\infty}}+\|\nabla_{q}u({\,\cdot\,},t)\|_{C^{r-1}}\right),

combined with (B.1) and (B.4), and the fact that qΦt,u0(q)q\mapsto\Phi^{\dagger}_{-t,u_{0}}(q) is a Cr1C^{r-1} mapping with continuous dependence on u0u_{0}\in\mathcal{F} and t[0,T]t\in[0,T], and (q0,p0)pt(q0,p0)(q_{0},p_{0})\mapsto p_{t}(q_{0},p_{0}) is a CrC^{r}-mapping, so that

qu(,t)Cr1supt[0,T]supu0pt(Φt,u0(),qu0(Φt,u0()))Cr1<,\|\nabla_{q}u({\,\cdot\,},t)\|_{C^{r-1}}\leq\sup_{t\in[0,T]}\sup_{u_{0}\in\mathcal{F}}\|p_{t}(\Phi^{\dagger}_{-t,u_{0}}({\,\cdot\,}),\nabla_{q}u_{0}(\Phi^{\dagger}_{-t,u_{0}}({\,\cdot\,})))\|_{C^{r-1}}<\infty,

for any initial data u0u_{0}\in\mathcal{F}. ∎

Appendix C Reconstruction from scattered data

The purpose of this appendix is to prove Proposition 4.2. This is achieved through three lemmas, followed by the proof of the proposition itself, employing these lemmas.

The first lemma is the following special case of the Vitali covering lemma, a well-known geometric result from measure theory (see e.g. [52, Lemma 1.2]):

Lemma C.1 (Vitali covering).

If h>0h>0 and Q={q1,,qN}Q=\{q^{1},\dots,q^{N}\} is a set for which the domain

Ωj=1NBh(qj),\Omega\subset\bigcup_{j=1}^{N}B_{h}(q^{j}),

is contained in the union of balls of radius hh around the qjq^{j}, then there exists a subset Q={qi1,,qim}Q^{\prime}=\{q^{i_{1}},\dots,q^{i_{m}}\} such that Bh(qik)Bh(qi)=B_{h}(q^{i_{k}})\cap B_{h}(q^{i_{\ell}})=\emptyset for all k,k,\ell, and

Ωk=1mB3h(qik),\Omega\subset\bigcup_{k=1}^{m}B_{3h}(q^{i_{k}}),
Remark C.2.

Given Q={q1,,qN}Q=\{q^{1},\dots,q^{N}\}, the proof of Lemma C.1 (see e.g. [52, Lemma 1.2]) shows that the subset QQQ^{\prime}\subset Q of Lemma C.1 can be found by the following greedy algorithm, which proceeds by iteratively adding elements to QQ^{\prime} (the following is in fact the basis for Algorithm 1 in the main text):

  1. (1)

    Start with j1=1j_{1}=1 and Q1={q1}Q^{\prime}_{1}=\{q^{1}\}.

  2. (2)

    Iteration step: given Qm={qj1,,qjm}QQ^{\prime}_{m}=\{q^{j_{1}},\dots,q^{j_{m}}\}\subset Q, check whether there exists qkQq^{k}\in Q, such that

    Bh(qk)=1mBh(qj)=.B_{h}(q^{k})\cap\bigcup_{\ell=1}^{m}B_{h}(q^{j_{\ell}})=\emptyset.
    • If yes: define jn+1:=kj_{n+1}:=k and Qm+1:={qj1,,qjm+1}Q^{\prime}_{m+1}:=\{q^{j_{1}},\dots,q^{j_{m+1}}\}.

    • If not: terminate the algorithm and set Q:=Qm={qj1,,qjm}Q^{\prime}:=Q^{\prime}_{m}=\{q^{j_{1}},\dots,q^{j_{m}}\}.

Based on Lemma C.1, we can now state the following basic result:

Lemma C.3.

Given a set Q={q1,,qN}ΩQ=\{q^{1},\dots,q^{N}\}\subset\Omega with fill distance hQ,Ωh_{Q,\Omega}, the subset QQQ^{\prime}\subset Q determined by the pruning Algorithm 1 has fill distance hQ,Ω3hQ,Ωh_{Q^{\prime},\Omega}\leq 3h_{Q,\Omega} and separation distance ρQhQ,Ω\rho_{Q^{\prime}}\geq h_{Q,\Omega}, and QQ^{\prime} is quasi-uniform with distortion constant κ=3\kappa=3, i.e.

ρQhQ,Ω3ρQ.\rho_{Q^{\prime}}\leq h_{Q^{\prime},\Omega}\leq 3\rho_{Q^{\prime}}.
Proof.

By definition of hQ,Ωh_{Q,\Omega} (cf. Definition 4.1), we have

Ωj=1NBhQ,Ω(qj).\Omega\subset\bigcup_{j=1}^{N}B_{h_{Q,\Omega}}(q^{j}).

Let Q={qj1,,qjm}QQ^{\prime}=\{q^{j_{1}},\dots,q^{j_{m}}\}\subset Q be the subset determined by Algorithm 1 (reproduced in Remark C.2). QQ^{\prime} satisfies the conclusion of the Vitali covering lemma C.1 with h=hQ,Ωh=h_{Q,\Omega}; thus,

(C.1) Ωk=1mB3h(qjk),\displaystyle\Omega\subset\bigcup_{k=1}^{m}B_{3h}(q^{j_{k}}),

and Bh(qjk)Bh(qj)=B_{h}(q^{j_{k}})\cap B_{h}(q^{j_{\ell}})=\emptyset for all kk\neq\ell. The inclusion (C.1) implies that

hQ,Ω=supqΩmink=1,,m|qqjk|3h=3hQ,Ω.h_{Q^{\prime},\Omega}=\sup_{q\in\Omega}\min_{k=1,\dots,m}\left|q-q^{j_{k}}\right|\leq 3h=3h_{Q,\Omega}.

On the other hand, the fact that Bh(qjk)Bh(qj)=B_{h}(q^{j_{k}})\cap B_{h}(q^{j_{\ell}})=\emptyset implies that 12|qjqjk|h=hQ,Ω\frac{1}{2}\left|q^{j_{\ell}}-q^{j_{k}}\right|\geq h=h_{Q,\Omega} for all kk\neq\ell, and hence ρQhQ,Ω\rho_{Q^{\prime}}\geq h_{Q,\Omega}. Thus, we have

hQ,Ω3hQ,Ω3ρQ.h_{Q^{\prime},\Omega}\leq 3h_{Q,\Omega}\leq 3\rho_{Q^{\prime}}.

The bound ρQhQ,Ω\rho_{Q^{\prime}}\leq h_{Q^{\prime},\Omega} always holds (for convex sets); to see this, choose qjk,qjq^{j_{k}},q^{j_{\ell}} such that ρQ=12|qjkqj|\rho_{Q^{\prime}}=\frac{1}{2}\left|q^{j_{k}}-q^{j_{\ell}}\right| realizes the minimal separation distance. Let q¯\overline{q} be the mid-point 12(qjk+qj)\frac{1}{2}\left(q^{j_{k}}+q^{j_{\ell}}\right), so that |q¯qjk|=|q¯qj|=12|qjkqj|=ρQ\left|\overline{q}-q^{j_{k}}\right|=\left|\overline{q}-q^{j_{\ell}}\right|=\frac{1}{2}\left|q^{j_{k}}-q^{j_{\ell}}\right|=\rho_{Q^{\prime}}. We note that |q¯qjr|ρQ\left|\overline{q}-q^{j_{r}}\right|\geq\rho_{Q^{\prime}} for any qjrQq^{j_{r}}\in Q^{\prime}, since

2ρQ|qjrqjk||qjrq¯|+|q¯qjk|=|q¯qjr|+ρQ,2\rho_{Q^{\prime}}\leq\left|q^{j_{r}}-q^{j_{k}}\right|\leq\left|q^{j_{r}}-\overline{q}\right|+\left|\overline{q}-q^{j_{k}}\right|=\left|\overline{q}-q^{j_{r}}\right|+\rho_{Q^{\prime}},

i.e. |q¯qjr|ρQ\left|\overline{q}-q^{j_{r}}\right|\geq\rho_{Q^{\prime}} for r=1,,mr=1,\dots,m. We conclude that

ρQminr=1,,m|q¯qjr|supqQminr=1,,m|qqjr|=hQ,Ω.\rho_{Q^{\prime}}\leq\min_{r=1,\dots,m}\left|\overline{q}-q^{j_{r}}\right|\leq\sup_{q\in Q}\min_{r=1,\dots,m}\left|q-q^{j_{r}}\right|=h_{Q^{\prime},\Omega}.

Lemma C.4.

Let Ω=[0,2π]dd\Omega=[0,2\pi]^{d}\subset\mathbb{R}^{d}, let r2r\geq 2 and let κ1\kappa\geq 1. There exists C=C(d,r,κ)>0C=C(d,r,\kappa)>0 and γ=γ(d,r,κ)>0\gamma=\gamma(d,r,\kappa)>0, such that for all fCr(Ω)f^{\dagger}\in C^{r}(\Omega) and all QΩQ\subset\Omega, quasi-uniform with respect to κ\kappa, and with fill distance hQ,Ωh_{Q,\Omega}, the approximation error of the moving least squares interpolation (4.2) with exact data zj=f(qj)z^{j}=f^{\dagger}(q^{j}), and parameter δ:=γhz,Q\delta:=\gamma h_{z,Q}, is bounded as follows:

ffz,QLChQ,ΩrfCr.\|f^{\dagger}-f_{z,Q}\|_{L^{\infty}}\leq Ch_{Q,\Omega}^{r}\|f^{\dagger}\|_{C^{r}}.
Proof.

This is immediate from [53, Corollary 4.8]. ∎

Based on Lemma C.4, we can also derive a reconstruction estimate, when the interpolation data {qj,,f(qj,)}j=1N\{q^{j,\dagger},f^{\dagger}(q^{j,\dagger})\}_{j=1}^{N} is only known up to small errors in both the position qjqj,q^{j}\approx q^{j,\dagger} and the values zjf(qj,)z^{j}\approx f^{\dagger}(q^{j,\dagger}), and this is Proposition 4.2 stated in the main body of the text. We now prove this proposition.

Proof.

(Proposition 4.2) Let ψ:d[0,1]\psi:\mathbb{R}^{d}\to[0,1] be a compactly supported CC^{\infty} function, such that ψ(0)=1\psi(0)=1, ψL1\|\psi\|_{L^{\infty}}\leq 1 and ψ(q)=0\psi(q)=0 for |q|1/2|q|\geq 1/2. Define

f~(q):=f(q)+j=1N(zjf(qj))ψ(qqjρQ).\widetilde{f}(q):=f^{\dagger}(q)+\sum_{j=1}^{N}\left(z^{j}-f^{\dagger}({q}^{j})\right)\psi\left(\frac{q-{q}^{j}}{\rho_{Q}}\right).

Note that, by assumption, we have

|qjqk||qj,qk,|2δ2(ρQρ)ρQ,|{q}^{j}-{q}^{k}|\geq|q^{j,\dagger}-q^{k,\dagger}|-2\delta\geq 2(\rho_{Q}-\rho)\geq\rho_{Q},

for all jkj\neq k. In particular this implies that the functions

qψ(qqjρQ),q\mapsto\psi\left(\frac{q-{q}^{j}}{\rho_{Q}}\right),

have disjoint support for different values of jj. Thus, for any j=1,,Nj=1,\dots,N, we have f~(qj)=zj\widetilde{f}({q}^{j})=z^{j}, and fz,Q{f}_{z,Q} is the (exact) moving least squares approximation of f~\widetilde{f}. From Proposition C.4 it follows that

f~fz,QLCf~CrhQ,Ωr.\|\widetilde{f}-{f}_{z,Q}\|_{L^{\infty}}\leq C\|\widetilde{f}\|_{C^{r}}h^{r}_{Q,\Omega}.

We now note that

|zjf(qj)|\displaystyle|z^{j}-f^{\dagger}({q}^{j})| |zjf(qj,)|+|f(qj,)f(qj)|ϵ+fCrρ.\displaystyle\leq|z^{j}-f^{\dagger}({q}^{j,\dagger})|+|f^{\dagger}(q^{j,\dagger})-f^{\dagger}({q}^{j})|\leq\epsilon+\|f^{\dagger}\|_{C^{r}}\rho.

On the one hand (due to the disjoint supports of the bump functions), we can now make the estimate

f~fLmaxj=1,,N|zjf(qj)|ϵ+fCrρ.\|\widetilde{f}-f^{\dagger}\|_{L^{\infty}}\leq\max_{j=1,\dots,N}|z^{j}-f^{\dagger}({q}^{j})|\leq\epsilon+\|f\|_{C^{r}}\rho.

On the other hand (again due to the disjoint supports of the bump functions), we also have

f~Cr\displaystyle\|\widetilde{f}\|_{C^{r}} fCr+maxj=1,,N|zjf(qj)|ρQrψCr\displaystyle\leq\|f^{\dagger}\|_{C^{r}}+\max_{j=1,\dots,N}|z^{j}-f^{\dagger}({q}^{j})|\rho_{Q}^{-r}\|\psi\|_{C^{r}}
fCr+ϵ+fCrρρQrψCr\displaystyle\leq\|f^{\dagger}\|_{C^{r}}+\frac{\epsilon+\|f^{\dagger}\|_{C^{r}}\rho}{\rho_{Q}^{r}}\|\psi\|_{C^{r}}

These two estimates imply that

ffz,QL\displaystyle\|f^{\dagger}-{f}_{z,Q}\|_{L^{\infty}} ff~L+f~fz,QL\displaystyle\leq\|f^{\dagger}-\widetilde{f}\|_{L^{\infty}}+\|\widetilde{f}-{f}_{z,Q}\|_{L^{\infty}}
(ϵ+fC1ρ)+C(fCr+(ϵ+fC1ρ)ψCrρQr)hQ,Ωr\displaystyle\leq\left(\epsilon+\|f^{\dagger}\|_{C^{1}}\rho\right)+C\left(\|f^{\dagger}\|_{C^{r}}+\left(\epsilon+\|f^{\dagger}\|_{C^{1}}\rho\right)\frac{\|\psi\|_{C^{r}}}{\rho^{r}_{Q}}\right)h_{Q,\Omega}^{r}

Taking into account that hQ,Ω/ρQκh_{Q,\Omega}/\rho_{Q}\leq\kappa is bounded, that ψCr\|\psi\|_{C^{r}} is independent of ff^{\dagger}, and introducing a new constant C=C(d,r,κ,ψCr)>0C=C(d,r,\kappa,\|\psi\|_{C^{r}})>0, we obtain

ffz,QL\displaystyle\|f^{\dagger}-{f}_{z,Q}\|_{L^{\infty}} C(fCrhQ,Ωr+ϵ+fC1ρ),\displaystyle\leq C\left(\|f^{\dagger}\|_{C^{r}}h^{r}_{Q,\Omega}+\epsilon+\|f^{\dagger}\|_{C^{1}}\rho\right),

as claimed. ∎

Appendix D Complexity estimates for HJ-Net approximation

In the last section, we have shown that given a set of initial data Cperr(Ω)\mathcal{F}\subset C^{r}_{\mathrm{per}}(\Omega), with r2r\geq 2 and Ω=[0,2π]d\Omega=[0,2\pi]^{d}, the method of characteristics is applicable for a positive time T>0T^{\ast}>0. In the present section, we will combine this result with the proposed HJ-Net framework to derive quantitative error and complexity estimates for the approximation of the solution operator of (HJ). This will require estimates for the approximation of the Hamiltonian flow ΨtΨt\Psi_{t}\approx\Psi^{\dagger}_{t} by neural networks, as well as an estimate for the reconstruction error resulting from the pruned moving least squares operator \mathcal{R}.

D.1. Approximation of Hamiltonian flow

Proof.

(Proposition 5.3) It follows from [54, Theorem 1] (and a simple scaling argument), that there exists a constant C=C(r,d)>0C=C(r,d)>0, such that for any ϵ>0\epsilon>0, there exists a neural network ΨtΨt\Psi_{t}\approx\Psi_{t}^{\dagger} satisfying the bound,

(D.1) sup(q0,p0,z0)Ω×[M,M]d×[M,M]|Ψt(q0,p0,z0)Ψt(q0,p0,z0)|ϵ,\displaystyle\sup_{(q_{0},p_{0},z_{0})\in\Omega\times[-M,M]^{d}\times[-M,M]}\left|\Psi_{t}(q_{0},p_{0},z_{0})-\Psi_{t}^{\dagger}(q_{0},p_{0},z_{0})\right|\leq\epsilon,

and such that

0pt(Ψt)\displaystyle 0pt(\Psi_{t}) Clog(MrΨtCrϵ),\displaystyle\leq C\log\left(\frac{M^{r}\|\Psi_{t}^{\dagger}\|_{C^{r}}}{\epsilon}\right),
size(Ψt)\displaystyle\mathrm{size}(\Psi_{t}) C(MrΨtCrϵ)(2d+1)/rlog(MrΨtCrϵ),\displaystyle\leq C\left(\frac{M^{r}\|\Psi_{t}^{\dagger}\|_{C^{r}}}{\epsilon}\right)^{(2d+1)/r}\log\left(\frac{M^{r}\|\Psi_{t}^{\dagger}\|_{C^{r}}}{\epsilon}\right),

where ΨtCr=ΨtCr(Ω×[M,M]d×[M,M])\|\Psi_{t}^{\dagger}\|_{C^{r}}=\|\Psi_{t}^{\dagger}\|_{C^{r}(\Omega\times[-M,M]^{d}\times[-M,M])} denotes the CrC^{r} norm on the relevant domain. To prove the claim, we note that for any trajectory (qt,pt)(q_{t},p_{t}) satisfying the Hamiltonian ODE system

(D.2) q˙t=pH(qt,pt),p˙t=qH(qt,pt),\displaystyle\dot{q}_{t}=\nabla_{p}H(q_{t},p_{t}),\quad\dot{p}_{t}=-\nabla_{q}H(q_{t},p_{t}),

with initial data (q0,p0)Ω×[M,M]d(q_{0},p_{0})\in\Omega\times[-M,M]^{d}, we have by assumption 3.1:

ddt(1+|pt|2)=ptp˙t=2ptqH(qt,pt)2LH(1+|pt|2).\frac{d}{dt}(1+|p_{t}|^{2})=p_{t}\cdot\dot{p}_{t}=-2p_{t}\cdot\nabla_{q}H(q_{t},p_{t})\leq 2L_{H}(1+|p_{t}|^{2}).

Integration of this inequality implies |pt|2(1+|p0|2)exp(2LHt)(1+dM2)exp(2LHt)|p_{t}|^{2}\leq(1+|p_{0}|^{2})\exp(2L_{H}t)\leq(1+dM^{2})\exp(2L_{H}t). Taking also into account that M1M\geq 1, this implies that pt[βM,βM]dp_{t}\in[-\beta M,\beta M]^{d}, where β=(1+d)exp(LHt)\beta=(1+\sqrt{d})\exp(L_{H}t) depends only on dd, LHL_{H} and tt. Since ptp_{t} remains uniformly bounded and since qH(q,p)q\mapsto H(q,p) is 2π2\pi-periodic, it follows that for any (q0,p0)Ω×[M,M]d(q_{0},p_{0})\in\Omega\times[-M,M]^{d} the Hamiltonian trajectory (qt,pt)(q_{t},p_{t}) starting at (q0,p0)(q_{0},p_{0}) stays in Ω×[βM,βM]\Omega\times[-\beta M,\beta M].

Recall that Ψt(q0,p0,z0)=(qt,pt,zt)\Psi_{t}^{\dagger}(q_{0},p_{0},z_{0})=(q_{t},p_{t},z_{t}) is the flow map of the Hamiltonian ODE (D.2) combined with the action integral

(D.3) zt=z0+0t[ptpH(qt,pt)H(qt,pt)]𝑑τ.\displaystyle z_{t}=z_{0}+\int_{0}^{t}\left[p_{t}\cdot\nabla_{p}H(q_{t},p_{t})-H(q_{t},p_{t})\right]\,d\tau.

Since the Hamiltonian trajectories starting at (q0,p0)Ω×[M,M]d(q_{0},p_{0})\in\Omega\times[-M,M]^{d} are confined to Ω×[βM,βM]d\Omega\times[-\beta M,\beta M]^{d} and since the right-hand side of (D.2) and (D.3) involve only first-order derivatives of HH, it follows from basic ODE theory that there exists C=C(HCr+1(Ω×[βM,βM]d,M,t,r)>0C=C(\|H\|_{C^{r+1}(\Omega\times[-\beta M,\beta M]^{d}},M,t,r)>0, such that the CrC^{r}-norm of the flow can be bounded by

ΨtCr(Ω×[M,M]d×[M,M])C(HCr+1(Ω×[βM,βM]d,M,t,r).\|\Psi_{t}\|_{C^{r}(\Omega\times[-M,M]^{d}\times[-M,M])}\leq C(\|H\|_{C^{r+1}(\Omega\times[-\beta M,\beta M]^{d}},M,t,r).

In particular, we finally conclude that there exist constants β=β(LH,d,t)\beta=\beta(L_{H},d,t) and C=C(HCr+1(Ω×[βM,βM]d,M,t,r)>0C=C(\|H\|_{C^{r+1}(\Omega\times[-\beta M,\beta M]^{d}},M,t,r)>0, such that for any ϵ>0\epsilon>0, there exists a neural network ΨtΨt\Psi_{t}\approx\Psi_{t}^{\dagger} satisfying the bound (D.1), and such that

0pt(Ψt)Clog(ϵ1),size(Ψt)Cϵ(2d+1)/rlog(ϵ1).\displaystyle 0pt(\Psi_{t})\leq C\log\left(\epsilon^{-1}\right),\quad\mathrm{size}(\Psi_{t})\leq C\epsilon^{-(2d+1)/r}\log\left(\epsilon^{-1}\right).

D.2. Reconstruction error

Proof.

(Proposition 5.5) Let fCperr(Ω)f^{\dagger}\in C^{r}_{\mathrm{per}}(\Omega) be given with r2r\geq 2, and let {qj,zj}j=1N\{{q}^{j},{z}^{j}\}_{j=1}^{N} be approximate interpolation data, with

|qjqj,|hQ,Ωr,|zjf(qj,)|hQ,Ωr.|{q}^{j}-q^{j,\dagger}|\leq h^{r}_{Q,\Omega},\quad|{z}^{j}-f^{\dagger}(q^{j,\dagger})|\leq h^{r}_{Q,\Omega}.

The assertion of this proposition is restricted to Q={qj,}j=1NQ^{\dagger}=\{q^{j,\dagger}\}_{j=1}^{N} satisfying hQ,Ωh0h_{Q^{\dagger},\Omega}\leq h_{0} for a constant h0h_{0} (to be determined below). We may wlog assume that h01/16h_{0}\leq 1/16 in the following. Denote Q:={qj}j=1N{Q}:=\{{q}^{j}\}_{j=1}^{N}. We recall that the first step in the reconstruction Algorithm 2 consists in an application of the pruning Algorithm 1 to determine pruned interpolation points Q={qj1,,qjm}Q{Q}^{\prime}=\{{q}^{j_{1}},\dots,{q}^{j_{m}}\}\subset{Q}, such that (cf. Lemma C.3)

ρQhQ,Ω3ρQ,hQ,Ω3hQ,Ω,hQ,ΩρQ.\rho_{{Q}^{\prime}}\leq h_{{Q}^{\prime},\Omega}\leq 3\rho_{{Q}^{\prime}},\quad h_{{Q}^{\prime},\Omega}\leq 3h_{{Q},\Omega},\quad h_{{Q},\Omega}\leq\rho_{{Q}^{\prime}}.

Step 1: Write Q,={qj1,,,qjm,}Q^{\prime,\dagger}=\{q^{j_{1},\dagger},\dots,q^{j_{m},\dagger}\}. Our first goal is to show that QQ^{\prime\dagger} is quasi-uniform: To this end, we note that by definition of the separation distance and the upper bound on the distance of qj,q^{j,\dagger} and qjq^{j}:

ρQ,ρQ2maxk|qjk,qjk|ρQ2hQ,Ωr.\rho_{Q^{\prime,\dagger}}\geq\rho_{{Q}^{\prime}}-2\max_{k}|q^{j_{k},\dagger}-{q}^{j_{k}}|\geq\rho_{{Q}^{\prime}}-2h_{Q^{\dagger},\Omega}^{r}.

By the definition of the fill distance, and the assumption that hQ,Ωh01/2h_{Q^{\dagger},\Omega}\leq h_{0}\leq 1/2 and r2r\geq 2, we also have

hQ,ΩhQ,Ω+supj|qjqj,|hQ,Ω+hQ,ΩrhQ,Ω+12hQ,Ω,h_{Q^{\dagger},\Omega}\leq h_{{Q},\Omega}+\sup_{j}|{q}^{j}-q^{j,\dagger}|\leq h_{{Q},\Omega}+h_{Q^{\dagger},\Omega}^{r}\leq h_{{Q},\Omega}+\frac{1}{2}h_{Q^{\dagger},\Omega},

implying the upper bound hQ,Ω2hQ,Ω2ρQh_{Q^{\dagger},\Omega}\leq 2h_{{Q},\Omega}\leq 2\rho_{{Q}^{\prime}}. Similarly we can show that hQ,,Ω2hQ,Ωh_{Q^{\prime,\dagger},\Omega}\leq 2h_{{Q}^{\prime},\Omega}. Substitution in the lower bound on ρQ,\rho_{Q^{\prime,\dagger}} above and using that h01/16h_{0}\leq 1/16, yields

ρQ,ρQ2ρQ~(2hQ,Ω)r1ρQ(12(2h0)r1)34ρQ.\rho_{Q^{\prime,\dagger}}\geq\rho_{{Q}^{\prime}}-2\rho_{\widetilde{Q}^{\prime}}(2h_{{Q},\Omega})^{r-1}\geq\rho_{{Q}^{\prime}}(1-2(2h_{0})^{r-1})\geq\frac{3}{4}\rho_{{Q}^{\prime}}.

Thus, we conclude that

hQ,,Ω2hQ,Ω6ρQ12ρQ,.h_{Q^{\prime,\dagger},\Omega}\leq 2h_{{Q}^{\prime},\Omega}\leq 6\rho_{{Q}^{\prime}}\leq 12\rho_{Q^{\prime,\dagger}}.

Since we always have ρQ,hQ,,Ω\rho_{Q^{\prime,\dagger}}\leq h_{Q^{\prime,\dagger},\Omega}, we conclude that Q,Q^{\prime,\dagger} is quasi-uniform with κ=12\kappa=12.

Step 2: Next, we intend to apply Proposition 4.2 with Q,Q^{\prime,\dagger} in place of QQ^{\dagger} and with ρ=ϵ=hQ,Ωr\rho=\epsilon=h_{Q^{\dagger},\Omega}^{r}: To this end, it remains to show that ρ12ρQ,\rho\leq\frac{1}{2}\rho_{Q^{\prime,\dagger}} is bounded by half the separation distance. To see this, we note that by the above bounds (recall also that h01/16h_{0}\leq 1/16),

ρhQ,Ωrh0r1hQ,Ω116hQ,Ω116ρQ18ρQ,<12ρQ,,\rho\equiv h_{Q^{\dagger},\Omega}^{r}\leq h_{0}^{r-1}\,h_{Q^{\dagger},\Omega}\leq\frac{1}{16}h_{Q^{\dagger},\Omega}\leq\frac{1}{16}\rho_{{Q}^{\prime}}\leq\frac{1}{8}\rho_{Q^{\prime,\dagger}}<\frac{1}{2}\rho_{Q^{\prime,\dagger}},

showing that ρ<12ρQ,\rho<\frac{1}{2}\rho_{Q^{\prime,\dagger}}. We can thus apply Proposition 4.2 to conclude that there exist constants C,h0>0C,h_{0}>0 (with h01/16h_{0}\leq 1/16), such that if hQ,Ωh0/9h_{Q^{\dagger},\Omega}\leq h_{0}/9, we have

hQ,,Ω2hQ,Ω6hQ,Ω9hQ,Ωh0h_{Q^{\prime,\dagger},\Omega}\leq 2h_{{Q}^{\prime},\Omega}\leq 6h_{{Q},\Omega}\leq 9h_{Q^{\dagger},\Omega}\leq h_{0}

and hence

ffz,QL\displaystyle\|f^{\dagger}-f_{{z},{Q}}\|_{L^{\infty}} C(fCr(Ω)+ϵ+fC1(Ω)ρ)\displaystyle\leq C\left(\|f^{\dagger}\|_{C^{r}(\Omega)}+\epsilon+\|f^{\dagger}\|_{C^{1}(\Omega)}\rho\right)
2C(1+fCr(Ω))hQ,Ωr.\displaystyle\leq 2C\left(1+\|f^{\dagger}\|_{C^{r}(\Omega)}\right)h_{Q^{\dagger},\Omega}^{r}.

Replacing h0h0/9h_{0}\to h_{0}/9 and C2CC\to 2C now yields the claimed result of Proposition 5.5. ∎

Proof.

(Proposition 5.6) Let q0,q~0Ωq_{0},\widetilde{q}_{0}\in\Omega and u0u_{0}\in\mathcal{F} be given, and denote by (qτ,pτ)(q_{\tau},p_{\tau}) the solution of the Hamiltonian ODE, with initial value (q0,p0)=(q0,qu0(q0))(q_{0},p_{0})=(q_{0},\nabla_{q}u_{0}(q_{0})). Define (q~τ,p~τ)(\widetilde{q}_{\tau},\widetilde{p}_{\tau}) similarly.

By compactness of Cperr(Ω)Cper2(Ω)\mathcal{F}\subset C^{r}_{\mathrm{per}}(\Omega)\subset C^{2}_{\mathrm{per}}(\Omega), there exists a constant M>0M>0, such that |p0|,|p~0|u0C2(Ω)M|p_{0}|,|\widetilde{p}_{0}|\leq\|u_{0}\|_{C^{2}(\Omega)}\leq M. By continuity of the flow map, there exists M¯>0\overline{M}>0, such that

|pτ|,|p~τ|M¯,τ[0,t].|p_{\tau}|,|\widetilde{p}_{\tau}|\leq\overline{M},\quad\forall\,\tau\in[0,t].

Then, we have

ddτ|(qτ,pτ)(q~τ,p~τ)|\displaystyle\frac{d}{d\tau}|(q_{\tau},p_{\tau})-(\widetilde{q}_{\tau},\widetilde{p}_{\tau})| |H(qτ,pτ)H(q~τ,p~τ)|\displaystyle\leq\left|\nabla H(q_{\tau},p_{\tau})-\nabla H(\widetilde{q}_{\tau},\widetilde{p}_{\tau})\right|
(supqΩ,|p|M¯D2H(q,p))|(qτ,pτ)(q~τ,p~τ)|,\displaystyle\leq\left(\sup_{q\in\Omega,|p|\leq\overline{M}}\|D^{2}H(q,p)\|\right)|(q_{\tau},p_{\tau})-(\widetilde{q}_{\tau},\widetilde{p}_{\tau})|,

where D2HD^{2}H denotes the Hessian of HH and D2H\|D^{2}H\| is the matrix norm induced by the Euclidean vector norm. Further denoting

D2H:=supqΩ,|p|M¯D2H(q,p),\|D^{2}H\|_{\infty}:=\sup_{q\in\Omega,|p|\leq\overline{M}}\|D^{2}H(q,p)\|,

then by Gronwall’s inequality, it follows that

|(qt,pt)(q~t,p~t)|eD2Ht|(q0,p0)(q~0,p~0)|.|(q_{t},p_{t})-(\widetilde{q}_{t},\widetilde{p}_{t})|\leq e^{\|D^{2}H\|_{\infty}t}|(q_{0},p_{0})-(\widetilde{q}_{0},\widetilde{p}_{0})|.

Furthermore, since p0=qu0(q0)p_{0}=\nabla_{q}u_{0}(q_{0}), and p~0=qu0(q~0)\widetilde{p}_{0}=\nabla_{q}u_{0}(\widetilde{q}_{0}), we have |p0p~0|u0C2|q0q~0|M|q0q~0||p_{0}-\widetilde{p}_{0}|\leq\|u_{0}\|_{C^{2}}|q_{0}-\widetilde{q}_{0}|\leq M|q_{0}-\widetilde{q}_{0}|, which implies that

|(qt,pt)(q~t,p~t)|\displaystyle|(q_{t},p_{t})-(\widetilde{q}_{t},\widetilde{p}_{t})| eD2Ht|(q0,p0)(q~0,p~0)|\displaystyle\leq e^{\|D^{2}H\|_{\infty}t}|(q_{0},p_{0})-(\widetilde{q}_{0},\widetilde{p}_{0})|
(1+M)eD2Ht|q0q~0|.\displaystyle\leq(1+M)e^{\|D^{2}H\|_{\infty}t}|q_{0}-\widetilde{q}_{0}|.

Therefore, Φt,u0(q0)=qt\Phi^{\dagger}_{t,u_{0}}(q_{0})=q_{t}, Φt,u0(q~0)=q~t\Phi^{\dagger}_{t,u_{0}}(\widetilde{q}_{0})=\widetilde{q}_{t} satisfy the estimate

|Φt,u0(q0)Φt,u0(q~0)|C|q0q~0|,\left|\Phi^{\dagger}_{t,u_{0}}({q}_{0})-\Phi^{\dagger}_{t,u_{0}}(\widetilde{q}_{0})\right|\leq C|q_{0}-\widetilde{q}_{0}|,

with constant

C=(1+M)exp(tsupqΩ,|p|M¯D2H(q,p)),C=(1+M)\exp\left(t\sup_{q\in\Omega,|p|\leq\overline{M}}\|D^{2}H(q,p)\|\right),

which depends only on tt, HH and on \mathcal{F} (via MM and M¯\overline{M}), but is independent of the particular choice of u0u_{0}. Thus, if

Ωj=1NBh(qj),\Omega\subset\bigcup_{j=1}^{N}B_{h}(q^{j}),

then

Ωj=1NBCh(Φt,u0(qj)),\Omega\subset\bigcup_{j=1}^{N}B_{Ch}(\Phi^{\dagger}_{t,u_{0}}(q^{j})),

for any u0u_{0}\in\mathcal{F}. This implies the claim. ∎