The parametric complexity of operator learning

Samuel Lanthaler and Andrew M. Stuart

Abstract.

Neural operator architectures employ neural networks to approximate operators mapping between Banach spaces of functions; they may be used to accelerate model evaluations via emulation, or to discover models from data. Consequently, the methodology has received increasing attention over recent years, giving rise to the rapidly growing field of operator learning. The first contribution of this paper is to prove that for general classes of operators which are characterized only by their $C^{r}$ - or Lipschitz-regularity, operator learning suffers from a “curse of parametric complexity”, which is an infinite-dimensional analogue of the well-known curse of dimensionality encountered in high-dimensional approximation problems. The result is applicable to a wide variety of existing neural operators, including PCA-Net, DeepONet and the FNO. The second contribution of the paper is to prove that this general curse can be overcome for solution operators defined by the Hamilton-Jacobi equation; this is achieved by leveraging additional structure in the underlying solution operator, going beyond regularity. To this end, a novel neural operator architecture is introduced, termed HJ-Net, which explicitly takes into account characteristic information of the underlying Hamiltonian system. Error and complexity estimates are derived for HJ-Net which show that this architecture can provably beat the curse of parametric complexity related to the infinite-dimensional input and output function spaces.

1. Introduction

The paper is devoted to a study of the computational complexity of the approximation of maps between Banach spaces by means of neural operators. The paper has two main focii: establishing a complexity barrier for general classes of $C^{r}-$ or Lipschitz regular maps; and then showing that this barrier can be beaten for Hamilton-Jacobi (HJ) equations. In Subsection 1.1 we give a detailed literature review; we set in context the definition of “the curse of parametric complexity” that we introduce and use in this paper; and we highlight our main contributions. Then, in Subsection 1.2, we overview the organization of the remainder of the paper.

1.1. Context and Literature Review

The use of neural networks to learn operators, typically mapping between Banach spaces of functions defined over subsets of finite dimensional Euclidean space and referred to as neural operators, is receiving growing interest in the computational science and engineering community [5, 58, 3, 28, 39, 2, 43, 36]. The methodology has the potential for accelerating numerical methods for solving partial differential equations (PDEs) when a model relating inputs and outputs is known; and it has the potential for discovering input-output maps from data when no model is available.

The computational complexity of learning and evaluating such neural operators is crucial to understanding when the methods will be effective. Numerical experiments addressing this issue may be found in [13] and the analysis of linear problems from this perspective may be found in [4, 14]. Universal approximation theorems, applicable beyond the linear setting, may be found in [5, 39, 34, 30, 31, 33, 2, 56] but such theorems do not address the cost of achieving a given small error.

Early work on operator approximation [42] presents first quantitative bounds; most notably, this work identifies the continuous nonlinear $n$ -widths of a space of continuous functionals defined on $L^{2}$ -spaces, showing that these $n$ -widths decay at most (poly-)logarithmically in $n$ . Both upper and lower bounds are derived in this specific setting. More recently, upper bounds on the computational complexity of recent approaches to operator learning based on deep neural networks, including the DeepONet [39] and the Fourier Neural Operator (FNO) [36], have been studied in more detail. Specific operator learning tasks arising in PDEs have been considered in the papers [50, 24, 40, 34, 30, 48, 15]. Related complexity analysis for the PCA-Net architecture [2] has recently been established in [32]. These papers studying computational complexity focus on the issue of beating a form of the “curse of dimensionality” in these operator approximation tasks.

In these operator learning problems the input and output spaces are infinite dimensional, and hence the meaning of the curse of dimensionality could be ambiguous. In this infinite-dimensional context, “beating the curse” is interpreted as identifying problems, and operator approximations applied to those problems, for which a measure of evaluation cost (referred to as their complexity) grows only algebraically with the inverse of the desired error. As shown rigorously in [32], this is a non-trivial issue: for the PCA-Net architecture, it has been established that such algebraic complexity and error bounds cannot be achieved for general Lipschitz (and even more regular) operators.

As will be explained in detail in the present work, this fact is not specific to PCA-Net, but extends to many other neural operator architectures. In fact, it can be interpreted as a scaling limit of the conventional curse of dimensionality; this conventional curse affects finite-dimensional approximation problems when the underlying dimension $d$ is very large. It can be shown that (ReLU) neural networks cannot overcome this curse, in general. As a consequence, neural operators, which build on neural networks, suffer from the scaling limit of this curse in infinite-dimensions. To distinguish this infinite-dimensional phenomenon from the conventional curse of dimensionality encountered in high-but-finite-dimensional approximation problems, we will refer to the scaling limit identified in this work as “the curse of parametric complexity”.

The first contribution of the present paper is to prove that for general classes of operators which are characterized only by their $C^{r}$ - or Lipschitz-regularity, operator learning suffers from such a curse of parametric complexity: Theorem 2.11 (and a variant thereof, Theorem 2.27) shows that, in this setting, there exist operators (and indeed even real-valued functionals) which are $\epsilon-$ approximable only with parametric model complexity that grows exponentially in $\epsilon^{-1}$ .

To overcome the general curse of parametric complexity implied by Theorem 2.11 (and Theorem 2.27), efficient operator learning frameworks therefore have to leverage additional structure present in the operators of interest, going beyond $C^{r}$ - or Lipschitz-regularity. Previous work on overcoming this curse for operator learning has mostly focused on operator holomorphy [23, 50, 34] and the emulation of numerical methods [30, 34, 32, 40] as two basic mechanisms for overcoming the curse of parametric complexity for specific operators of interest. A notable exception are the complexity estimates for DeepONets in [15] which are based on explicit representation of the solution; most prominently, this is achieved via the Cole-Hopf transformation for the viscous Burgers equation.

An abstract characterization of the entire class of operators that allow for efficient approximation by neural operators would be very desirable. Unfortunately, this appears to be out of reach, at the current state of analysis. Indeed, as far as the authors are aware, there does not even exist such a characterization for any class of standard numerical methods, such as finite difference, finite element or spectral, viewed as operator approximators. Therefore, in order to identify settings in which operator learning can be effective (without suffering from the general curse of parametric complexity), we restrict attention to specific classes of operators of interest.

The HJ equations present an application area that has the potential to be significantly impacted by the use of ideas from neural networks, especially regarding the solution of problems for functions defined over subsets of high dimensional ( $d$ ) Euclidean space [7, 8, 12, 11]; in particular beating the curse of dimensionality with respect to this dimension $d$ has been of focus. However, this body of work has not studied operator learning, as it concerns settings in which only fixed instances of the PDE are solved. The purpose of the second part of the present paper is to study the design and analysis of neural operators to approximate the solution operator for HJ equations; this operator maps the initial condition (a function) to the solution at a later time (another function).

The second contribution of the paper is to prove in Theorem 5.1 that the general curse of parametric complexity can be overcome for maps defined by the solution operator of the Hamilton-Jacobi (HJ) equation; this is achieved by exposing additional structure in the underlying solution operator, different from holomorphy and emulation and going beyond regularity, that can be leveraged by neural operators; for the HJ equations, the identified structure relies on representation of solutions of the HJ equations in terms of characteristics. In this paper the dimension $d$ of the underlying spatial domain will be fixed and we do not study the curse of dimensionality with respect to $d$ . Instead, we demonstrate that it is possible to beat the curse of parametric complexity with respect to the infinite dimensional nature of the input function for fixed (and moderate) $d$ .

1.2. Organization

In section 2 we present the first contribution: Theorem 2.11, together with the closely related Theorem 2.27 which extends the general but not exhaustive setting Theorem 2.11 to include the FNO, establish that the curse of parametric complexity is to be expected in operator learning. The following sections then focus on the second contribution and hence on solution operators associated with the HJ equation; in Theorem 5.1 we prove that additional structure in the solution operator for this equation can be leveraged to beat the curse of parametric complexity. In Section 3 we describe the precise setting for the HJ equation employed throughout this paper; we recall the method of characteristics for solution of the equations; and we describe a short-time existence theorem. Section 4 introduces the proposed neural operator, the HJ-Net, based on learning the flow underlying the method of characteristics and combining it with scattered data approximation. In Section 5 we state our approximation theorem for the proposed HJ-Net, resulting in complexity estimates which demonstrate that the curse of parametric complexity is avoided in relation to the infinite dimensional nature of the input space (of initial conditions). Section 6 contains concluding remarks. Whilst the high-level structure of the proofs is contained in the main body of the paper, many technical details are collected in the appendix, to promote readability.

2. The Curse of Parametric Complexity

Operator learning seeks to employ neural networks to efficiently approximate operators mapping between infinite-dimensional Banach spaces of functions. To enable implementation of these methods in practice, maps between the formally infinite-dimensional spaces have to be approximated using only a finite number of degrees of freedom.

Figure 1. Diagrammatic illustration of operator learning based on an encoding

\mathcal{E}

, a neural network

\Psi

, and a reconstruction

\mathcal{R}

Commonly, operator learning frameworks can therefore be written in terms of an encoding, a neural network and a reconstruction step as shown in Figure 1. The first step $\mathcal{E}$ encodes the infinite-dimensional input using a finite number of degrees of freedom. The second approximation step $\Psi$ maps the encoded input to an encoded, finite-dimensional output. The final reconstruction step $\mathcal{R}$ reconstructs an output function given the finite-dimensional output of the approximation mapping. The composition of these encoding, approximation and reconstruction mappings thus takes an input function and maps it to another output function, and hence defines an operator. Existing operator learning frameworks differ in their particular choice of the encoder, neural network architecture and reconstruction mappings.

We start by giving background discussion on the curse of dimensionality (CoD) in finite dimensions, in subsection 2.1. We then describe the subject in detail for neural network-based operator learning, resulting in our notion of the curse of parametric complexity, in subsection 2.2. In subsection 2.3 we state our main theorem concerning the curse of parametric complexity for neural operators. Subsection 2.4 demonstrates that the main theorem applies to PCA-Net, DeepONet and the NOMAD neural network architectures. Subsection 2.5 extends the main theorem to the FNO since it sits outside the framework introduced in subsection 2.2.

2.1. Curse of Dimensionality for Neural Networks

Since the neural network mapping $\Psi:\mathbb{R}^{{D_{\mathcal{X}}}}\to\mathbb{R}^{{D_{\mathcal{Y}}}}$ in the decomposition shown in Figure 1 typically maps between high-dimensional (encoded) spaces, with ${D_{\mathcal{X}}}$ , ${D_{\mathcal{Y}}}\gg 1$ , most approaches to operator learning employ neural networks to learn this mapping. The motivation for this is that, empirically, neural networks have been found to be exceptionally well suited for the approximation of such high-dimensional functions in diverse applications [20]. Detailed investigation of the approximation theory of neural networks, including quantitative upper and lower approximation error bounds, has thus attracted a lot of attention in recent years [54, 55, 29, 38, 16]. Since we build on this analysis we summarize the relevant part of it here, restricting attention to ReLU neural networks in this work, as defined next; generalization to the use of other (piecewise polynomial) activation functions is possible.

2.1.1. ReLU Neural Networks

Fix integer $L$ and integers $\{d_{\ell}\}_{\ell=0}^{L+1}.$ Let $A_{\ell}\in\mathbb{R}^{d_{\ell+1}\times d_{\ell}}$ and $b_{\ell}\in\mathbb{R}^{d_{\ell+1}}$ for $\ell=0,\dots,L$ . A ReLU neural network $\Psi:\mathbb{R}^{{D_{\mathcal{X}}}}\to\mathbb{R}^{{D_{\mathcal{Y}}}}$ , $x\mapsto\Psi(x)$ is a mapping of the form

(2.1)

\displaystyle\left\{\begin{aligned} x_{0}&=x,\\ x_{\ell+1}&=\sigma(A_{\ell}x_{\ell}+b_{\ell}),\quad\ell=0,\cdots,L-1,\\ \Psi(x)&=A_{L}x_{L}+b_{L},\end{aligned}\right.

where $d_{0}={D_{\mathcal{X}}}$ and $d_{L+1}={D_{\mathcal{Y}}}$ . Here the activation function $\sigma:\mathbb{R}\to\mathbb{R}$ is extended pointwise to act on any Euclidean space; and in what follows we employ the ReLU activation function $\sigma(x)=\min\{0,x\}.$ We let $\theta:=\{A_{\ell},b_{\ell}\}_{\ell=0}^{L}$ and note that we have defined parametric mapping $\Psi({\,\cdot\,})=\Psi({\,\cdot\,};\theta)$ . We define the depth of $\Psi$ as the number of layers, and the size of $\Psi$ as the number of non-zero weights and biases, i.e.

0pt(\Psi)=L,\quad\mathrm{size}(\Psi)=\sum_{\ell=0}^{L}\left\{\|A_{\ell}\|_{0}+\|b_{\ell}\|_{0}\right\},

where $\|{\,\cdot\,}\|_{0}$ counts the number of non-zero entries of a matrix or vector.

2.1.2. Two Simple Facts from ReLU Neural Network Calculus

The following facts will be used without further comment (see e.g. [44, Section 2.2.3] for a discussion of more general results): If $\Psi:\mathbb{R}^{D_{\mathcal{X}}}\to\mathbb{R}^{D_{\mathcal{Y}}}$ is a ReLU neural network, and $A\in\mathbb{R}^{{D_{\mathcal{X}}}\times d}$ is a matrix, then there exists a ReLU neural network $\widetilde{\Psi}:\mathbb{R}^{d}\to\mathbb{R}^{D_{\mathcal{Y}}}$ , such that

(2.2)

\displaystyle\left\{\begin{aligned} \widetilde{\Psi}(x)&=\Psi(Ax),\qquad\text{for all $x\in\mathbb{R}^{d}$,}\\ 0pt(\widetilde{\Psi})&=0pt(\Psi)+1,\\ \mathrm{size}(\widetilde{\Psi})&\leq 2\|A\|_{0}+2\,\mathrm{size}(\Psi).\end{aligned}\right.

Similarly, if $V\in\mathbb{R}^{d\times{D_{\mathcal{Y}}}}$ is a matrix, then there exists a ReLU neural network $\widehat{\Psi}:\mathbb{R}^{D_{\mathcal{X}}}\to\mathbb{R}^{d}$ , such that

(2.3)

\displaystyle\left\{\begin{aligned} \widehat{\Psi}(x)&=V\Psi(x),\qquad\text{for all $x\in\mathbb{R}^{d}$,}\\ 0pt(\widehat{\Psi})&=0pt(\Psi)+1,\\ \mathrm{size}(\widehat{\Psi})&\leq 2\|V\|_{0}+2\,\mathrm{size}(\Psi).\end{aligned}\right.

The main non-trivial issue in (2.2) and (2.3) is to preserve the potentially sparse structure of the underlying neural networks; this is based on a concept of “sparse concatenation” from [46].

2.1.3. Approximation Theory and CoD for ReLU Networks

One overall finding of research into the approximation power of ReLU neural networks is that, for function approximation in spaces characterized by smoothness, neural networks cannot entirely overcome the curse of dimensionality [54, 55, 29, 38]. This is illustrated by the following result, which builds on [54, Thm. 5] derived by D. Yarotsky:

Proposition 2.1 (Neural Network CoD).

Let $r\in\mathbb{N}$ be given. For any dimension $D\in\mathbb{N}$ , there exists $f_{D,r}\in C^{r}([0,1]^{D};\mathbb{R})$ and constant $\overline{\epsilon},\gamma>0$ , such that any ReLU neural network $\Psi:\mathbb{R}^{D}\to\mathbb{R}$ achieving accuracy

\sup_{x\in[0,1]^{D}}|f_{D,r}(x)-\Psi(x)|\leq\epsilon,

with $\epsilon\leq\overline{\epsilon}$ , has size at least $\mathrm{size}(\Psi)\geq\epsilon^{-\gamma D/r}$ . The constant $\overline{\epsilon}=\overline{\epsilon}(r)>0$ depends only on $r$ , and $\gamma>0$ is universal.

The proof of Proposition 2.1 is included in Appendix A.1. Proposition 2.1 shows that neural network approximation of a function between high-dimensional Euclidean spaces suffers from a curse of dimensionality, characterised by an algebraic complexity with a potentially large exponent proportional to the dimension $D\gg 1$ . This lower bound is similar to the approximation rates (upper bounds) achieved by traditional methods,¹¹1Ignoring the potentially beneficial factor $\gamma$ such as polynomial approximation. This fact suggests that the empirically observed efficiency of neural networks may well rely on additional structure of functions $f$ of practical interest, beyond their smoothness; for relevant results in this direction see, for example, [41, 21].

2.2. Curse of Parametric Complexity in Operator Learning

In the present work, we consider the approximation of an underlying operator $\mathcal{S}^{\dagger}:\mathcal{X}\to\mathcal{Y}$ acting between Banach spaces; specifically, we assume that the dimensions of the spaces $\mathcal{X},\mathcal{Y}$ are infinite. Given the curse of dimensionality in the finite-dimensional setting, Proposition 2.1, and letting $D\to\infty$ , one would generally expect a super-algebraic, potentially even exponential, lower bound on the “complexity” of neural operators $\mathcal{S}$ approximating such $\mathcal{S}^{\dagger}$ , as a function of the accuracy $\epsilon$ . In this subsection, we make this statement precise for a general class of neural operators, in Theorem 2.11. This is preceded by a discussion of relevant structure of compact sets in infinite-dimensional function spaces and a discussion of a suitable class of “neural operators”.

2.2.1. Infinite-dimensional hypercubes

Proposition 2.1 was stated for the unit cube $[0,1]^{D}$ as the underlying domain. In the finite-dimensional setting of Proposition 2.1, the approximation rate turns out to be independent of the underlying compact domain, provided that the domain has non-empty interior and assuming a Lipschitz continuous boundary. This is in contrast to the infinite dimensional case, where compact domains necessarily have empty interior and where the convergence rate depends on the specific structure of the domain. To state our complexity bound for operator approximation, we will therefore need to discuss the prototypical structure of compact subsets $K\subset\mathcal{X}$ .

To fix intuition, we temporarily consider $\mathcal{X}$ a function space (for example a Hölder, Lebesgue or Sobolev space). In this case, the most common way to define a compact subset $K\subset\mathcal{X}$ is via a smoothness constraint, as illustrated by the following concrete example:

Example 2.2.

Assume that $\mathcal{X}=C^{s}(\Omega)$ is the space of $s$ -times continuously differentiable functions on a bounded domain $\Omega\subset\mathbb{R}^{d}$ . Then for $\rho>s$ and upper bound $M>0$ , the subset $K\subset\mathcal{X}$ defined by

(2.4)

\displaystyle K={\left\{u\in C^{\rho}(\Omega)\,\middle|\,\|u\|_{C^{\rho}}\leq M\right\}},

is a compact subset of $\mathcal{X}$ . Here, we define the $C^{\rho}$ -norm as,

(2.5)

\displaystyle\|u\|_{C^{\rho}}=\max_{|\nu|\leq\rho}\sup_{x\in\Omega}|D^{\nu}u(x)|.

To better understand the approximation theory of operators $\mathcal{S}^{\dagger}:K\subset\mathcal{X}\to\mathcal{Y}$ , we would like to understand the structure of such $K$ . Our point of view is inspired by Fourier analysis, according to which smoothness of $u\in K$ roughly corresponds to a decay rate of the Fourier coefficients of $u$ . In particular, $u$ is guaranteed to belong to the set (2.4), if $u$ is of the form,

(2.6)

\displaystyle u=A\sum_{j=1}^{\infty}j^{-\alpha}y_{j}e_{j},\qquad y_{j}\in[0,1]\text{ for all }j\in\mathbb{N},

for a sufficiently large decay rate $\alpha>0$ , small constant $A>0$ , and where $e_{j}:\mathbb{R}^{d}\to\mathbb{R}$ denotes the periodic Fourier (sine/cosine) basis, restricted to $\Omega$ . We include a proof of this fact at the end of this subsection (see Lemma 2.7), where we also identify relevant decay rate $\alpha$ . In this sense, the set $K$ in (2.4) could be said to “contain” an infinite-dimensional hypercube $\prod_{j=1}^{\infty}[0,Aj^{-\alpha}]$ , with decay rate $\alpha$ . Such hypercubes will replace the finite-dimensional unit cube $[0,1]^{D}$ in our analysis of operator approximation in infinite-dimensions.

We would like to point out that a similar observation holds for many other sets $K$ defined by a smoothness constraint, such as sets in Sobolev spaces defined by a smoothness constraint, $\{\|u\|_{H^{\rho}}\leq M\}\subset H^{s}(\Omega)$ , or more generally $\{\|u\|_{W^{\rho,p}}\leq M\}\subset W^{s,p}(\Omega)$ , for any $1\leq p\leq\infty$ , but also Besov spaces, spaces of functions of bounded variation and others share a similar structure. Bounded balls in all of these spaces contain infinite-dimensional hypercubes, consisting of elements of the form (2.6). We note in passing that, in general, it may be more natural to replace the trigonometric basis in (2.6) by some other choice of basis, such as polynomials, splines, wavelets, or a more general (frame) expansion. We refer to e.g. [9, 22] for general background and [24] for an example of such a setting in the context of holomorphic operators.

The above considerations lead us to the following definition of an abstract hypercube:

Definition 2.3.

Let $e_{1},e_{2},\dots\in\mathcal{X}$ be a sequence of linearly independent and normed elements, $\|e_{j}\|_{\mathcal{X}}=1$ . Given constants $A>0$ , $\alpha>1$ , we say that $K\subset\mathcal{X}$ contains an infinite-dimensional cube $Q_{\alpha}=Q_{\alpha}(A;e_{1},e_{2},\dots)$ , if:

(1)

$K$ contains the set $Q_{\alpha}$ consisting of all $u$ of the form (2.6);
(2)

set $\{e_{j}\}_{j\in\mathbb{N}}$ possesses a bounded bi-orthogonal sequence of functionals, labelled $e_{1}^{\ast},e^{\ast}_{2},\dots\in\mathcal{X}^{\ast}$ , in the continuous dual of $\mathcal{X}$ ²²2If $\mathcal{X}$ is a Hilbert space, such a bi-orthogonal sequence always exists for independent $e_{1},e_{2},\dots\in\mathcal{X}$ .; i.e. we assume that $e^{\ast}_{k}(e_{j})=\delta_{jk}$ for all $j,k\in\mathbb{N}$ , and that there exists $M>0$ , such that $\|e^{\ast}_{k}\|_{\mathcal{X}^{\ast}}\leq M$ for all $k\in\mathbb{N}$ .

Remark 2.4.

Property (2) in Definition 2.3, i.e. the assumed existence of a bi-orthogonal sequence $e^{\ast}_{j}$ , ensures that there exist “coordinate projections”: if $u$ is of the form (2.6), then the $j$ -th coefficient $y_{j}$ can be retrieved from $u$ as $y_{j}=A^{-1}j^{\alpha}e^{\ast}_{j}(u)$ . This allows us to uniquely identify $u\in Q_{\alpha}$ with a set of coefficients $(y_{1},y_{2},\dots)\in[0,1]^{\infty}$ .

Remark 2.5.

The decay rate $\alpha$ of the infinite-dimensional cube $Q_{\alpha}\subset K$ provides a measure of its “asymptotic size” or “complexity”. In terms of our complexity bounds, this decay rate will play a special role. Hence, we will usually retain this dependence explicitly by writing $Q_{\alpha}$ , but suppress the additional dependence on $A$ and $e_{1},e_{2},\dots$ in the following.

The notion of infinite-dimensional cubes introduced in Definition 2.3 is only a minor generalization of an established notion of cube embeddings, introduced by Donoho [18] in a Hilbert space setting. We refer to [10, Chap. 5] for a pedagogical exposition of such cube embeddings in the Hilbert space setting, and their relation to the Kolmogorov entropy of $K$ .

Remark 2.6.

The complexity bounds established in this work will be based on infinite-dimensional hypercubes. An interesting question, left for future work, is whether our main result on the curse of parametric complexity, Theorem 2.11 below, could be stated directly in terms of the Kolmogorov complexity of $K$ , or other notions such as the Kolmogorov $n$ -width [17].

Our definition of an infinite-dimensional hypercube is natural in view of the following lemma, the discussion following Example (2.4), and other similar results.

Lemma 2.7.

Assume that $\mathcal{X}=C^{s}(\Omega)$ is the space of $s-$ times continuously differentiable functions on a bounded domain $\Omega\subset\mathbb{R}^{d}$ . Choose $\rho>s$ and define $K$ , compact in $\mathcal{X}$ , by

K={\left\{u\in C^{\rho}(\Omega)\,\middle|\,\|u\|_{C^{\rho}}\leq M\right\}},

with constant $M>0$ . Then $K$ contains an infinite-dimensional hypercube $Q_{\alpha}$ , for any $\alpha>1+\frac{\rho-s}{d}$ .

The proof of Lemma 2.7 is included in Appendix A. While we have focused on spaces of continuously differentiable functions, similar considerations also apply to other smoothness spaces, such as the Sobolev spaces $W^{s,p}$ , and more general Besov spaces.

2.2.2. Curse of Parametric Complexity

The main question to be addressed in the present section is the following: given $K\subset\mathcal{X}$ compact, $\mathcal{S}^{\dagger}:\mathcal{X}\to\mathcal{Y}$ an $r$ -times Fréchet differentiable operator to be approximated by a neural operator $\mathcal{S}:\mathcal{X}\to\mathcal{Y}$ , and given a desired approximation accuracy $\epsilon>0$ , how many tunable parameters (in the architecture of $\mathcal{S}$ ) are required to achieve,

\sup_{u\in K}\|\mathcal{S}^{\dagger}(u)-\mathcal{S}(u)\|_{\mathcal{Y}}\leq\epsilon?

The answer to this question clearly depends on our assumptions on $K\subset\mathcal{X}$ , $\mathcal{Y}$ and the class of neural operators $\mathcal{S}$ .

Assume $K\subset\mathcal{X}$ contains a hypercube $Q_{\alpha}$ .

Consistent with our discussion in the last subsection, we will assume that $K\subset\mathcal{X}$ contains an infinite-dimensional hypercube $Q_{\alpha}$ , as introduced in Definition 2.3, with algebraic decay rate $\alpha>0$ .

Assume $\mathcal{Y}=\mathbb{R}$ , i.e. $\mathcal{S}^{\dagger}$ is a functional.

The approximation of an operator $\mathcal{S}^{\dagger}:\mathcal{X}\to\mathcal{Y}$ with potentially infinite-dimensional output space $\mathcal{Y}$ is generally harder to achieve than the approximation of a functional $\mathcal{S}^{\dagger}:\mathcal{X}\to\mathbb{R}$ with one-dimensional output; indeed, if $\dim(\mathcal{Y})\geq 1$ , then $\mathbb{R}$ can be embedded in $\mathcal{Y}$ and any functional $\mathcal{X}\to\mathbb{R}$ gives rise to an operator $\mathcal{X}\to\mathcal{Y}$ under this embedding. To simplify our discussion, we will therefore initially restrict attention to the approximation of functionals, with the aim of showing that even the approximation of $r$ -times Fréchet differentiable functionals is generally intractable.

Assume $\mathcal{S}$ is of neural network-type.

Assuming that $\mathcal{Y}=\mathbb{R}$ , we must finally introduce a rigorous notion of the relevant class of approximating functionals $\mathcal{S}:\mathcal{X}\to\mathbb{R}$ , i.e. define a class of “neural operators/functionals” approximating the functional $\mathcal{S}^{\dagger}$ .

Definition 2.8 (Functional of neural network-type).

We will say that a (neural) functional $\mathcal{S}:\mathcal{X}\to\mathbb{R}$ is a “functional of neural network-type”, if it can be written in the form

(2.7)

\displaystyle\mathcal{S}(u)=\Phi(\mathcal{L}u),\quad\forall\,u\in K,

where for some $\ell\in\mathbb{N}$ , $\mathcal{L}:\mathcal{X}\to\mathbb{R}^{\ell}$ is a linear map, and $\Phi:\mathbb{R}^{\ell}\to\mathbb{R}$ is a ReLU neural network (potentially sparse).

If $\mathcal{S}$ is a functional of neural network-type, we define the complexity of $\mathcal{S}$ , denoted $\mathrm{cmplx}(\mathcal{S})$ , as the smallest size of a neural network $\Phi$ for which there exists linear map $\mathcal{L}$ such that a representation of the form (2.7) holds, i.e.

(2.8)

\displaystyle\mathrm{cmplx}(\mathcal{S}):=\min_{\Phi}\mathrm{size}(\Phi),

where the minimum is taken over all possible $\Phi$ in (2.7).

Remark 2.9.

Without loss of generality, we may assume that $\ell\leq\mathrm{size}(\Phi)$ in (2.7) and (2.8). Indeed, if this is not the case, then $\mathrm{size}(\Phi)<\ell$ and we can show that it is possible to construct another representation pair $(\widetilde{\Phi},\widetilde{\mathcal{L}})$ in (2.7), consisting of a neural network $\widetilde{\Phi}:\mathbb{R}^{\widetilde{\ell}}\to\mathbb{R}$ , linear map $\widetilde{\mathcal{L}}:\mathcal{X}\to\mathbb{R}^{\widetilde{\ell}}$ and such that $\widetilde{\ell}\leq\mathrm{size}(\widetilde{\Phi})$ : To see why, let us assume that $\widetilde{\ell}:=\mathrm{size}(\Phi)<\ell$ . Let $A$ be the weight matrix in the first input layer of $\Phi$ . Since

\|A\|_{0}\leq\widetilde{\ell}<\ell,

at most $\widetilde{\ell}$ columns of $A$ can be non-vanishing. Write the matrix $A=[a_{1},a_{2},\dots,a_{\ell}]$ in terms of its column vectors. Up to permutation, we may assume that $a_{\widetilde{\ell}+1}=\dots=a_{\ell}=0$ . We now drop the corresponding columns in the input layer of $\Phi$ and remove these unused components from the output of the linear map $\mathcal{L}$ in (2.7). This leads to a new map $\widetilde{\mathcal{L}}:\mathcal{X}\to\mathbb{R}^{\widetilde{\ell}}$ , with output components $(\widetilde{\mathcal{L}}u)_{j}=(\mathcal{L}u)_{j}$ for $j=1,\dots,\widetilde{\ell}$ , and we define $\widetilde{\Phi}:\mathbb{R}^{\widetilde{\ell}}\to\mathbb{R}$ , as the neural network that is obtained from $\Phi$ by replacing the input matrix $A=[a_{1},\dots,a_{\ell}]$ by $\widetilde{A}=[a_{1},\dots,a_{\widetilde{\ell}}]$ . Then $\widetilde{\Phi}\circ\widetilde{\mathcal{L}}=\Phi\circ\mathcal{L}$ , so that $\widetilde{\mathcal{L}}$ and $\widetilde{\Phi}$ satisfy a representation of the form (2.7), but the dimension $\widetilde{\ell}$ satisfies $\widetilde{\ell}=\mathrm{size}(\Phi)=\mathrm{size}(\widetilde{\Phi})$ ; the first equality is by definition of $\widetilde{\ell}$ , and the last equality holds because we only removed zero weights from $\Phi$ . In particular, this ensures that $\widetilde{\ell}\leq\mathrm{size}(\widetilde{\Phi})$ for this new representation, without affecting the size of the underlying neural network, i.e. $\mathrm{size}(\Phi)=\mathrm{size}(\widetilde{\Phi})$ .

More generally, we can consider $\mathcal{Y}=\mathcal{Y}(\Omega;\mathbb{R}^{p})$ a function space, consisting of functions $v:\Omega\to\mathbb{R}^{p}$ with domain $\Omega\subset\mathbb{R}^{d}$ . Given $y\in\Omega$ , we introduce the point-evaluation map,

\displaystyle{\mathrm{ev}}_{y}:\mathcal{Y}\to\mathbb{R}^{p},\quad{\mathrm{ev}}_{y}(v):=v(y).

Provided that point-evaluation ${\mathrm{ev}}_{y}$ is well-defined for all $v\in\mathcal{Y}$ , we can readily extend the above notion to operators, as follows:

Definition 2.10 (Operator of neural network-type).

Let $\mathcal{Y}=\mathcal{Y}(\Omega;\mathbb{R}^{p})$ be a function space on which point-evaluation is well-defined. We will say that a (neural) operator $\mathcal{S}:\mathcal{X}\to\mathcal{Y}$ is an “operator of neural network-type”, if for any evaluation point $y\in\Omega$ , the composition ${\mathrm{ev}}_{y}\circ\mathcal{S}:\mathcal{X}\to\mathbb{R}^{p}$ , ${\mathrm{ev}}_{y}(\mathcal{S}(u)):=\mathcal{S}(u)(y)$ , can be written in the form

(2.9)

\displaystyle{\mathrm{ev}}_{y}\circ\mathcal{S}(u)=\Phi_{y}(\mathcal{L}u),\quad\forall\,u\in K,

where $\mathcal{L}:\mathcal{X}\to\mathbb{R}^{\ell}$ is a linear operator, and $\Phi_{y}:\mathbb{R}^{\ell}\to\mathbb{R}^{p}$ is a ReLU neural network which may depend on the evaluation point $y\in\Omega$ . In this case, we define

\mathrm{cmplx}(\mathcal{S}):=\sup_{y\in\Omega}\min_{\Phi_{y}}\mathrm{size}(\Phi_{y}).

We next state our main result demonstrating a “curse of parametric complexity” for functionals (and operators) of neural network-type. This is followed by a detailed discussion of the implications of this abstract result for four representative examples of operator learning frameworks: PCA-Net, DeepONet, NOMAD and the Fourier neural operator.

2.3. Main Theorem on Curse of Parametric Complexity

The following result formalizes an analogue of the curse of dimensionality in infinite-dimensions:

Theorem 2.11 (Curse of Parametric Complexity).

Let $K\subset\mathcal{X}$ be a compact set in an infinite-dimensional Banach space $\mathcal{X}$ . Assume that $K$ contains an infinite-dimensional hypercube $Q_{\alpha}$ for some $\alpha>1$ . Then for any $r\in\mathbb{N}$ and $\delta>0$ , there exists $\overline{\epsilon}>0$ and an $r$ -times Fréchet differentiable functional $\mathcal{S}^{\dagger}:K\subset\mathcal{X}\to\mathbb{R}$ , such that approximation to accuracy $\epsilon\leq\overline{\epsilon}$ by a functional $\mathcal{S}_{\epsilon}$ of neural network-type,

(2.10)

\displaystyle\sup_{u\in K}|\mathcal{S}^{\dagger}(u)-\mathcal{S}_{\epsilon}(u)|\leq\epsilon,

implies complexity bound $\mathrm{cmplx}(\mathcal{S}_{\epsilon})\geq\exp(c\epsilon^{-1/(\alpha+1+\delta)r})$ ; here $c$ , $\overline{\epsilon}>0$ are constants depending only on $\alpha$ , $\delta$ and $r$ .

Before providing a sketch of the proof of Theorem 2.11, we note the following simple corollary, whose proof is given in Appendix A.4.

Corollary 2.12.

Let $K\subset\mathcal{X}$ be a compact set in an infinite-dimensional Banach space $\mathcal{X}$ . Assume that $K$ contains an infinite-dimensional hypercube $Q_{\alpha}$ for some $\alpha>1$ . Let $\mathcal{Y}=\mathcal{Y}(\Omega;\mathbb{R}^{p})$ be a function space with continuous embedding in $C(\Omega;\mathbb{R}^{p})$ . Then for any $r\in\mathbb{N}$ and $\delta>0$ , there exists $\overline{\epsilon}>0$ and an $r$ -times Fréchet differentiable functional $\mathcal{S}^{\dagger}:K\subset\mathcal{X}\to\mathcal{Y}$ , such that approximation to accuracy $\epsilon\leq\overline{\epsilon}$ by an operator $\mathcal{S}_{\epsilon}:\mathcal{X}\to\mathcal{Y}$ of neural network-type,

(2.11)

\displaystyle\sup_{u\in K}\|\mathcal{S}^{\dagger}(u)-\mathcal{S}_{\epsilon}(u)\|\leq\epsilon,

Proof of Theorem 2.11 (Sketch).

Let $\mathcal{S}_{\epsilon}:\mathcal{X}\to\mathbb{R}$ be any functional of neural network-type, achieving approximation accuracy (2.10). In view of our definition of $\mathrm{cmplx}(\mathcal{S}_{\epsilon})$ in (2.8), to prove the claim, it suffices to show that if $\mathcal{L}:\mathcal{X}\to\mathbb{R}^{\ell}$ is a linear map and $\Phi:\mathbb{R}^{\ell}\to\mathbb{R}$ is a ReLU neural network, such that

\mathcal{S}_{\epsilon}(u)=\Phi(\mathcal{L}u),\quad\forall\,u\in\mathcal{X},

then $\mathrm{size}(\Phi)\geq\exp(c\epsilon^{-1/(\alpha+1+\delta)r})$ .

The idea behind the proof of this fact is that if $K\subset\mathcal{X}$ contains a hypercube $Q_{\alpha}$ , then for any $D\in\mathbb{N}$ , a suitable rescaling of the finite-dimensional cube $[0,1]^{D}$ can be embedded in $K$ . More precisely, for any $D\in\mathbb{N}$ there exists an injective linear map $\iota_{D}:[0,1]^{D}\to K$ .

If we now consider the composition $\mathcal{S}_{\epsilon}\circ\iota_{D}:\mathbb{R}^{D}\to\mathbb{R}$ , then we observe that we have a decomposition $\mathcal{S}_{\epsilon}\circ\iota_{D}(x)=\Phi(\mathcal{L}\circ\iota_{D}(x))$ , where

\mathcal{L}\circ\iota_{D}:\mathbb{R}^{D}\to\mathbb{R}^{\ell},

is linear, and

\Phi:\mathbb{R}^{\ell}\to\mathbb{R},

is a ReLU neural network. In particular, there exists a matrix $A\in\mathbb{R}^{\ell\times D}$ , such that $\mathcal{L}\circ\iota_{D}(x)=Ax$ for all $x\in\mathbb{R}^{D}$ , and the mapping $\Phi_{D}(x):=\mathcal{S}_{\epsilon}\circ\iota_{D}(x)$ defines a ReLU neural network $\Phi_{D}(x)=\Phi(Ax)$ , whose size can be bounded by

\displaystyle\begin{aligned} \mathrm{size}(\Phi_{D})&\leq 2\,\mathrm{size}(\Phi)+2\|A\|_{0}&\text{(Equation \ref{eq:calculus1})}\\ &\leq 2\,\mathrm{size}(\Phi)+2\ell D&\\ &\leq 2\,\mathrm{size}(\Phi)+2\,\mathrm{size}(\Phi)D\hskip 28.45274pt&\text{(Remark \ref{rem:ellsize})}\\ &\leq 4\,\mathrm{size}(\Phi)D.&\end{aligned}

Using Proposition 2.1, for any $D\in\mathbb{N}$ , we are then able to construct a functional $\mathcal{F}_{D}:K\subset\mathcal{X}\to\mathbb{R}$ , mimicking the function $f_{D}:[0,1]^{D}\subset\mathbb{R}^{D}\to\mathbb{R}$ constructed in Proposition 2.1, and such that uniform approximation of $\mathcal{F}_{D}$ by $\mathcal{S}_{\epsilon}$ (implying similar approximation of $f_{D}$ by $\Phi_{D}$ ) incurs a lower complexity bound

\mathrm{size}(\Phi)\geq(4D)^{-1}\,\mathrm{size}(\Phi_{D})\geq C_{D}\epsilon^{-\gamma D/r},

where $C_{D}>0$ is a constant depending on $D$ . For this particular functional $\mathcal{F}_{D}$ , and given the uniform lower bound on $\mathrm{size}(\Phi)$ above, it then follows that

\mathrm{cmplx}(\mathcal{S}_{\epsilon})\geq C_{D}\epsilon^{-\gamma D/r}.

The first challenge is to make this strategy precise, and to determine the $D$ -dependency of the constant $C_{D}$ . As we will see, this argument leads to a lower bound of roughly the form $\mathrm{cmplx}(\mathcal{S}_{\epsilon})\gtrsim(D^{r(1+\alpha)}\epsilon)^{-\gamma D/r}$ .

At this point, the argument is still for fixed $D\in\mathbb{N}$ , and would only lead to an algebraic complexity in $\epsilon^{-1}$ . To extend this to an exponential lower bound in $\epsilon^{-1}$ , we next observe that if the estimate $\mathrm{cmplx}(\mathcal{S}_{\epsilon})\gtrsim(D^{r(1+\alpha)}\epsilon)^{-\gamma D/r}$ could in fact be established for all $D\in\mathbb{N}$ simultaneously, i.e. if we could construct a single functional $\mathcal{S}^{\dagger}$ , for which the lower complexity bound $\mathrm{cmplx}(\mathcal{S}_{\epsilon})\gtrsim\sup_{D\in\mathbb{N}}(D^{r(1+\alpha)}\epsilon)^{-\gamma D/r}$ were to hold, then setting $D\approx(e\epsilon)^{-1/(1+\alpha)r}$ on the right would imply that

\mathrm{cmplx}(\mathcal{S}_{\epsilon})\gtrsim\sup_{D\in\mathbb{N}}(D^{r(1+\alpha)}\epsilon)^{-\gamma D/r}\gtrsim\exp\left(c\epsilon^{-1/(1+\alpha)r}\right),

with suitable $c>0$ . Leading to an exponential lower complexity bound for such $\mathcal{S}^{\dagger}$ . The second main challenge is thus to construct a single $\mathcal{S}^{\dagger}:K\subset\mathcal{X}\to\mathbb{R}$ which effectively “embeds” an infinite family of functionals $\mathcal{F}_{D}:K\subset\mathcal{X}\to\mathbb{R}$ with complexity bounds as above. This will be achieved by defining $\mathcal{S}^{\dagger}$ as a weighted sum of suitable functionals $\mathcal{F}_{D}$ . The detailed proof is provided in Appendix A.3. ∎

Several remarks are in order:

Remark 2.13.

Theorem 2.11 shows rigorously that in general, operator learning suffers from a curse of parametric complexity, in the sense that it is not possible to achieve better than exponential complexity bounds for general classes of operators which are merely determined by their ( $C^{r}$ - or Lipschitz-) regularity. As explained above, this is a natural infinite-dimensional analogue of the curse of dimensionality in finite-dimensions (cp. Proposition 2.1), and motivates our terminology. We note that the lower bound of Theorem 2.11 qualitatively matches general upper bounds for DeepONets derived in [37]. It would be of interest to determine sharp rates.

Remark 2.14.

Theorem 2.11 is derived for ReLU neural networks. With some effort, the argument could be extended to more general, piecewise polynomial activation functions. While we believe that the curse of parametric complexity has a fundamental character, we would like to point out that, for non-standard neural network architectures, algebraic approximation rates have been obtained [49]; these results build on either “superexpressive” activation functions or other non-standard architectures. Since these networks are not ordinary feedforward ReLU neural networks, the algebraic approximation rates of [49] are not in contradiction with Theorem 2.11. While the parametric complexity of the non-standard neural operators in [49] is exponentially smaller than the lower bound of Theorem 2.11, it is conceivable that storing the neural network weights in practice would require exponential accuracy (number of bits), to account for the highly unstable character of super-expressive constructions.

Remark 2.15.

Theorem 2.11 differs from previous results on the limitations of operator learning frameworks, as e.g. addressed in [35, 51, 34]. Earlier work focuses on the limitations imposed by a linear choice of the reconstruction mapping $\mathcal{R}$ . In contrast, the results of the present work exhibit $C^{k}$ -smooth operators and functionals which are fundamentally hard to approximate by neural network-based methods (with ReLU activation), irrespective of the choice of reconstruction.

Remark 2.16.

We finally link our main theorem to a related result for PCA-Net [32, Thm. 3.3], there derived in a complementary Hilbert space setting for $\mathcal{X}$ and $\mathcal{Y}$ ; the result of [32] shows that, for PCA-Net, no fixed algebraic convergence rate can be achieved in the operator learning of general $C^{r}$ -operators between Hilbert spaces; this can be viewed as a milder version of the full curse of parametric complexity identified in the present work, expressed by an exponential lower complexity bound in Theorem 2.11.

To further illustrate an implication of Theorem 2.11, we provide the following example:

Example 2.17 (Operator Learning CoD).

Let $\Omega\subset\mathbb{R}^{d}$ be a domain. Let $s,\rho\in\mathbb{Z}_{\geq 0}$ be given, with $s<\rho$ , and consider the compact set

K={\left\{u\in C^{\rho}(\Omega)\,\middle|\,\|u\|_{C^{\rho}}\leq 1\right\}}\subset C^{s}(\Omega).

Fix $r\in\mathbb{N}$ . By Lemma 2.7, $K$ contains an infinite-dimensional hypercube $Q_{\alpha}$ for any $\alpha>1+\frac{\rho-s}{d}$ . Fix such $\alpha$ . Applying Theorem 2.11, it follows that there exists a $r$ -times Fréchet differentiable functional $\mathcal{S}^{\dagger}:C^{s}(\Omega)\to\mathbb{R}$ and constant $c,\overline{\epsilon}>0$ , such that any family $\mathcal{S}_{\epsilon}:C^{s}(\Omega)\to\mathbb{R}$ of functionals of neural network-type, achieving accuracy

\sup_{u\in K}|\mathcal{S}^{\dagger}(u)-\mathcal{S}_{\epsilon}(u)|\leq\epsilon,\quad\forall\,\epsilon\leq\overline{\epsilon},

has complexity at least $\mathrm{cmplx}(\mathcal{S}_{\epsilon})\geq\exp(c\epsilon^{-1/(1+\alpha)r})$ for $\epsilon\leq\overline{\epsilon}$ . Furthermore, the constants $c,\,\overline{\epsilon}>0$ depend only on the parameters $r,s,\rho,\alpha$ .

In the next subsection, we aim to show the relevance of the above abstract result for concrete neural operator architectures. Specifically, we show that three operator learning architectures from the literature are of neural network-type (PCA-Net, DeepONet, NOMAD), and relate our notion of complexity to the required number of tunable parameters for each. Finally, we show that even frameworks which are not necessarily of neural network-type could suffer from a similar curse of parametric complexity. We make this explicit for the Fourier neural operator in subsection 2.5.

2.4. Examples of Oerators of Neural Network-Type

We describe three representative neural operator architectures and show that they can be cast in the above framework.

PCA-Net.

We start with the PCA-Net architecture from [2], anticipated in the work [25]. If $\mathcal{X}$ and $\mathcal{Y}$ are Hilbert spaces, then a neural network can be combined with principal component analysis (PCA) for the encoding and reconstruction on the underlying spaces, to define a neural operator architecture termed PCA-Net; the ingredients of this architecture are orthonormal PCA bases $\phi^{\mathcal{X}}_{1},\dots,\phi^{\mathcal{X}}_{{D_{\mathcal{X}}}}\in\mathcal{X}$ and $\phi^{\mathcal{Y}}_{1},\dots,\phi^{\mathcal{Y}}_{{D_{\mathcal{Y}}}}\in\mathcal{Y}$ , and a neural network mapping $\Psi:\mathbb{R}^{{D_{\mathcal{X}}}}\to\mathbb{R}^{{D_{\mathcal{Y}}}}$ . The encoder $\mathcal{E}$ is obtained by projection onto the $\{\phi^{\mathcal{X}}_{j}\}_{j=1}^{{D_{\mathcal{X}}}}$ , whereas the reconstruction $\mathcal{R}$ is defined by a linear expansion in $\{\phi^{\mathcal{Y}}_{j}\}_{j=1}^{{D_{\mathcal{Y}}}}$ . The resulting PCA-Net neural operator is defined as

(2.12)

\displaystyle\mathcal{S}(u)(y)=\sum_{k=1}^{{D_{\mathcal{Y}}}}\Psi_{k}(\alpha_{1},\dots,\alpha_{D_{\mathcal{X}}})\phi^{\mathcal{Y}}_{k}(y),\qquad\text{with }\alpha_{j}:=\langle u,\phi^{\mathcal{X}}_{j}\rangle,\;j=1,\dots,{D_{\mathcal{X}}}.

Here the neural network $\Psi:\mathbb{R}^{D_{\mathcal{X}}}\to\mathbb{R}^{D_{\mathcal{Y}}}$ , with components $\Psi_{k}({\,\cdot\,})=\Psi_{k}({\,\cdot\,};\theta)$ , depends on parameters which are optimized during training of the PCA-Net. The PCA basis functions $\phi^{\mathcal{Y}}_{1},\dots\phi^{\mathcal{Y}}_{D_{\mathcal{Y}}}:\Omega\to\mathbb{R}^{p}$ , defining the reconstruction, are precomputed from the data using PCA analysis.

Given an evaluation point $y\in\Omega$ , the composition ${\mathrm{ev}}_{y}\circ\mathcal{S}$ , between $\mathcal{S}$ and the point-evaluation mapping ${\mathrm{ev}}_{y}(v)=v(y)$ , can now be written in the form,

{\mathrm{ev}}_{y}\circ\mathcal{S}(u)=\Phi_{y}(\mathcal{L}u),

where $\mathcal{L}:\mathcal{X}\to\mathbb{R}^{D_{\mathcal{X}}}$ , $\mathcal{L}u:=(\langle u,\phi_{1}^{\mathcal{X}}\rangle,\dots,\langle u,\phi^{\mathcal{X}}_{{D_{\mathcal{X}}}}\rangle$ is a linear mapping, and $\Phi_{y}(\alpha):=\sum_{k=1}^{D_{\mathcal{Y}}}\Psi_{k}(\alpha)\phi_{k}^{\mathcal{Y}}(y)$ , for fixed $y$ , is the composition of a neural network $\Psi$ with a linear read-out; thus, $\Phi_{y}$ is itself a neural network for fixed $y$ . This shows that PCA-Net is of neural network-type.

The following lemma shows that the complexity of PCA-Net gives a lower bound on the number of free parameters for the underlying neural network architecture.

Lemma 2.18 (Complexity of PCA-Net).

Assume that $\mathcal{X}$ and $\mathcal{Y}$ are Hilbert spaces, so that PCA-Net is well-defined. Any PCA-Net $\mathcal{S}=\mathcal{R}\circ\Psi\circ\mathcal{E}$ is of neural network-type, and

\mathrm{size}(\Psi)\geq(2p+2)^{-1}\mathrm{cmplx}(\mathcal{S}).

Note that the dimension $p\in\mathbb{N}$ is fixed by the underlying function space $\mathcal{Y}=\mathcal{Y}(\Omega;\mathbb{R}^{p})$ . Thus, Lemma 2.18 implies a lower complexity bound $\mathrm{size}(\Psi)\gtrsim\mathrm{cmplx}(\mathcal{S})$ . The detailed proof is given in Appendix A.5.1. It thus follows from Corollary 2.12 that operator learning with PCA-Net suffers from a curse of parametric complexity:

Proposition 2.19 (Curse of parametric complexity for PCA-Net).

Assume the setting of Corollary 2.12, with $\mathcal{X}$ , $\mathcal{Y}$ are Hilbert spaces. Then for any $r\in\mathbb{N}$ and $\delta>0$ , there exists $\overline{\epsilon}>0$ and an $r$ -times Fréchet differentiable functional $\mathcal{S}^{\dagger}:K\subset\mathcal{X}\to\mathcal{Y}$ , such that approximation to accuracy $\epsilon\leq\overline{\epsilon}$ by a PCA-Net $\mathcal{S}_{\epsilon}:\mathcal{X}\to\mathcal{Y}$

(2.13)

\displaystyle\sup_{u\in K}\|\mathcal{S}^{\dagger}(u)-\mathcal{S}_{\epsilon}(u)\|\leq\epsilon,

with encoder $\mathcal{E}$ , neural network $\Psi$ and reconstruction $\mathcal{R}$ , implies complexity bound $\mathrm{size}(\Psi)\geq\exp(c\epsilon^{-1/(\alpha+1+\delta)r})$ ; here $c$ , $\overline{\epsilon}>0$ are constants depending only on $\alpha$ , $\delta$ , $r$ and $p$ .

DeepONet.

A conceptually similar approach is followed by the DeepONet of [39] which differs by learning the form of the representation in $\mathcal{Y}$ concurrently with the coefficients, and by allowing for quite general input linear functionals $\{\alpha_{j}\}_{j=1}^{{D_{\mathcal{X}}}}.$

The DeepONet architecture defines the encoding $\mathcal{E}:\mathcal{X}\to\mathbb{R}^{D_{\mathcal{X}}}$ by a fixed choice of general linear functionals $\ell_{1},\dots,\ell_{D_{\mathcal{X}}}$ ; these may be obtained, for example, by point evaluation at distinct “sensor points” or by projection onto PCA modes as in PCA-Net. The reconstruction $\mathcal{R}:\mathbb{R}^{D_{\mathcal{Y}}}\to\mathcal{Y}$ is defined by expansion with respect to a set of functions $\phi_{1},\dots,\phi_{D_{\mathcal{Y}}}\in\mathcal{Y}$ which are themselves finite dimensional neural networks to be learned. The resulting DeepONet can be written as

(2.14)

\displaystyle\mathcal{S}(u)(y)=\sum_{k=1}^{{D_{\mathcal{Y}}}}\Psi_{k}(\alpha_{1},\dots,\alpha_{D_{\mathcal{X}}})\phi_{k}(y),\qquad\text{with }\alpha_{j}:=\ell_{j}(u),\;j=1,\dots,{D_{\mathcal{X}}}.

Here, both the neural networks $\Psi:\mathbb{R}^{D_{\mathcal{X}}}\to\mathbb{R}^{D_{\mathcal{Y}}}$ , with components $\Psi_{k}=\Psi_{k}({\,\cdot\,};\theta)$ , and $\phi:\Omega\to\mathbb{R}^{p\times{D_{\mathcal{Y}}}}$ with components $\phi_{k}=\phi_{k}({\,\cdot\,};\theta)$ , depend on parameters which are optimized during training of the DeepONet.

Given a evaluation point $y\in\Omega$ , the composition ${\mathrm{ev}}_{y}\circ\mathcal{S}$ , with the point-evaluation mapping ${\mathrm{ev}}_{y}(v)=v(y)$ can again be written in the form,

{\mathrm{ev}}_{y}\circ\mathcal{S}(u)=\Phi_{y}(\mathcal{L}u),

where $\mathcal{L}:\mathcal{X}\to\mathbb{R}^{D_{\mathcal{X}}}$ , $\mathcal{L}u:=(\ell_{1}(u),\dots,\ell_{{D_{\mathcal{X}}}}(u))$ is linear, where

\Phi_{y}(\alpha):=\sum_{k=1}^{D_{\mathcal{Y}}}\Psi_{k}(\alpha)\phi_{k}(y),

and for fixed $y$ the values $\phi_{k}(y)\in\mathbb{R}^{p}$ are just (constant) vectors. Thus, $\Phi_{y}$ is the composition of a neural network $\Psi$ with a linear read-out, and hence is itself a neural network. This shows that DeepONet is of neural network-type.

The next lemma shows that the size can be related to the complexity of DeepONet: Also in this case, $\mathrm{cmplx}(\mathcal{S})$ provides a lower bound on the total number of non-zero degrees of freedom of a DeepONet. The detailed proof is provided in Appendix A.5.2.

Lemma 2.20 (Complexity of DeepONet).

Any DeepONet $\mathcal{S}=\mathcal{R}\circ\Psi\circ\mathcal{E}$ , defined by a branch-net $\Psi$ and trunk-net $\phi$ , is of neural network-type, and

2(\mathrm{size}(\Psi)+\mathrm{size}(\phi))\geq\mathrm{cmplx}(\mathcal{S}).

The following result is now an immediate consequence of Corollary 2.12 and the above lemma.

Proposition 2.21 (Curse of parametric complexity for DeepONet).

Assume the setting of Corollary 2.12. Then for any $r\in\mathbb{N}$ and $\delta>0$ , there exists $\overline{\epsilon}>0$ and an $r$ -times Fréchet differentiable functional $\mathcal{S}^{\dagger}:K\subset\mathcal{X}\to\mathcal{Y}$ , such that approximation to accuracy $\epsilon\leq\overline{\epsilon}$ by a DeepONet $\mathcal{S}_{\epsilon}:\mathcal{X}\to\mathcal{Y}$

(2.15)

\displaystyle\sup_{u\in K}\|\mathcal{S}^{\dagger}(u)-\mathcal{S}_{\epsilon}(u)\|\leq\epsilon,

with branch net $\Psi$ and trunk net $\phi$ , implies complexity bound $\mathrm{size}(\Psi)+\mathrm{size}(\phi)\geq\exp(c\epsilon^{-1/(\alpha+1+\delta)r})$ ; here $c$ , $\overline{\epsilon}>0$ are constants depending only on $\alpha$ , $\delta$ and $r$ .

NOMAD

The linearity in the reconstruction in $\mathcal{Y}$ for both PCA-Net and DeepONet imposes a fundamental limitation on their approximation capability [34, 35, 51]. To overcome this limitation, nonlinear extensions of DeepONet have recently been proposed. The following NOMAD architecture [51] provides an example:

NOMAD (NOnlinear MAnifold Decoder) employs encoding by point evaluation at a fixed set of sensor points, or more general linear functionals $\ell_{1},\dots,\ell_{D_{\mathcal{X}}}:\mathcal{X}\to\mathbb{R}$ . The reconstruction $\mathcal{R}:\mathbb{R}^{D_{\mathcal{Y}}}\to\mathcal{Y}$ in the output space $\mathcal{Y}=\mathcal{Y}(\Omega;\mathbb{R}^{p})$ is defined via a neural network $Q:\mathbb{R}^{{D_{\mathcal{Y}}}}\times\Omega\to\mathbb{R}^{p}$ , which depends jointly on encoded output coefficients in $\mathbb{R}^{{D_{\mathcal{Y}}}}$ and the evaluation point $y\in\Omega$ ; as in the two previous examples, $\Psi:\mathbb{R}^{D_{\mathcal{X}}}\to\mathbb{R}^{D_{\mathcal{Y}}}$ is again a neural network. The resulting NOMAD mapping, for $u\in\mathcal{X}$ and $y\in\Omega$ , is given by

(2.16)

\displaystyle\mathcal{S}(u)(y)=Q(\Psi(\alpha_{1},\dots,\alpha_{D_{\mathcal{X}}}),y),\qquad\text{with }\alpha_{j}:=\ell_{j}(u),

for $j=1,\dots,{D_{\mathcal{X}}}$ . We note that the main difference between DeepONet and NOMAD is that the linear expansion in (2.14) has been replaced by a nonlinear composition with the neural network $Q$ . Both neural networks $\Psi$ and $Q$ are optimized during training.

Given evaluation point $y\in\Omega$ , the composition ${\mathrm{ev}}_{y}\circ\mathcal{S}$ with the point-evaluation can be written in the form,

{\mathrm{ev}}_{y}\circ\mathcal{S}(u)=\Phi_{y}(\mathcal{L}u),

where $\mathcal{L}:\mathcal{X}\to\mathbb{R}^{D_{\mathcal{X}}}$ , $\mathcal{L}u:=(\ell_{1}(u),\dots,\ell_{D_{\mathcal{X}}}(u))$ is linear, and $\alpha\mapsto\Phi_{y}(\alpha):=Q(\Psi_{k}(\alpha),y)$ defines a neural network for fixed $y$ . This shows that NOMAD is of neural network-type. Finally, the following lemma provides an estimate on the complexity of NOMAD:

Lemma 2.22 (Complexity of NOMAD).

Any NOMAD operator $\mathcal{S}=\mathcal{R}\circ\Psi\circ\mathcal{E}$ defined by a branch-net $\Psi$ and non-linear reconstruction $Q$ is of neural network-type, and

(2.17)

\displaystyle 2(\mathrm{size}(\Psi)+\mathrm{size}(Q))\geq\mathrm{cmplx}(\mathcal{S}).

The expression on the left hand-side of (2.17) represents the total number of non-zero degrees of freedom in the NOMAD architecture and, as for PCA-Net and DeepONet, it is lower bounded by our notion of complexity. For the proof, we refer to Appendix A.5.3.

The following result is now an immediate consequence of Corollary 2.12 and the above lemma.

Proposition 2.23 (Curse of parametric complexity for NOMAD).

Assume the setting of Corollary 2.12. Then for any $r\in\mathbb{N}$ and $\delta>0$ , there exists $\overline{\epsilon}>0$ and an $r$ -times Fréchet differentiable functional $\mathcal{S}^{\dagger}:K\subset\mathcal{X}\to\mathcal{Y}$ , such that approximation to accuracy $\epsilon\leq\overline{\epsilon}$ by a NOMAD neural operator $\mathcal{S}_{\epsilon}:\mathcal{X}\to\mathcal{Y}$

(2.18)

\displaystyle\sup_{u\in K}\|\mathcal{S}^{\dagger}(u)-\mathcal{S}_{\epsilon}(u)\|\leq\epsilon,

with neural networks $\Psi$ and $Q$ , implies complexity bound $\mathrm{size}(\Psi)+\mathrm{size}(Q)\geq\exp(c\epsilon^{-1/(\alpha+1+\delta)r})$ ; here $c$ , $\overline{\epsilon}>0$ are constants depending only on $\alpha$ , $\delta$ and $r$ .

Discussion.

For all examples above, a general lower bound on the complexity $\mathrm{cmplx}(\mathcal{S})$ of operators of neural network-type implies a lower bound on the total number of degrees of freedom of the particular architecture. In particular, a lower bound on $\mathrm{cmplx}(\mathcal{S})$ gave a lower bound on the smallest possible number of non-zero parameters that are needed to implement $\mathcal{S}$ in practice. This observation motivates our nomenclature for the complexity.

We emphasize that our notion of complexity only relates to the size of the neural network at the core of these architectures; by design, it does not take into account other factors, such as the additional complexity associated with the practical evaluation of inner products in the PCA-Net architecture, evaluation of linear encoding functionals of DeepONets, or the numerical representation of an (optimal) output PCA basis for PCA-Net or neural network basis for DeepONet. The important point is that the aforementioned factors can only increase the overall complexity of any concrete implementation; correspondingly, our proposed notion of $\mathrm{cmplx}(\mathcal{S})$ , which neglects some of these additional contributions, is can be used to derive rigorous lower bounds on the overall complexity of any implementation.

Remark 2.24.

In passing, we point out similar approaches [26, 35, 45, 57] to the PCA-Net, DeepONet and NOMAD architectures, which share a closely related underlying structure to the examples given above. We fully expect that the curse of parametric complexity applies to all of these architectures.

In the next subsection 2.5 we will show that even for operator architectures which are not of neural network-type according to the above definition, we may nevertheless be able to link them with an associated operator of neural network-type. Specifically, we will show this for the FNO in Theorem 2.27. There, we will see that the size (number of tunable parameters) of the FNO can be linked to the complexity of an associated operator of neural network-type. And hence lower bounds on the complexity of operators of neural network-type imply corresponding lower bounds on the FNO.

2.5. The Curse of (Parametric) Complexity for Fourier Neural Operators.

The definition of operators of neural network-type introduced in the previous subsection does not include the FNO, a widely adopted neural operator architecture. However, in this subsection we show that a result similar to Theorem 2.11, stated as Theorem 2.27 below, can be obtained for the FNO.

Due to intrinsic constraints on the domain on which FNO can be (readily) applied, we will assume that the spatial domain $\Omega=\prod_{j=1}^{d}[a_{j},b_{j}]\subset\mathbb{R}^{d}$ is rectangular. We recall that an FNO,

\mathcal{S}:\mathcal{X}(\Omega;\mathbb{R}^{k})\to\mathcal{Y}(\Omega;\mathbb{R}^{p}),

can be written as a composition, $\mathcal{S}=\mathcal{Q}\circ\mathcal{L}_{L}\circ\dots\circ\mathcal{L}_{1}\circ\mathcal{P}$ , of a lifting layer $\mathcal{P}$ , hidden layers $\mathcal{L}_{1},\dots,\mathcal{L}_{L}$ and an output layer $\mathcal{Q}$ . In the following, let us denote by $\mathcal{V}(\Omega;\mathbb{R}^{d_{v}})$ a generic space of functions from $\Omega$ to $\mathbb{R}^{d_{v}}$ .

The nonlinear lifting layer

\mathcal{P}:\mathcal{X}(\Omega;\mathbb{R}^{k})\to\mathcal{V}(\Omega;\mathbb{R}^{d_{v}}),\quad u(x)\mapsto\chi(x,u(x)),

is defined by a neural network $\chi:\Omega\times\mathbb{R}^{k}\to\mathbb{R}^{d_{v}}$ , depending jointly on the evaluation point $x\in\Omega$ and the components of the input function $u$ evaluated at $x$ , namely $u(x)\in\mathbb{R}^{k}$ . The dimension $d_{v}$ is a free hyperparameter, and determines the number of “channels” (or the “lifting dimension”).

Each hidden layer $\mathcal{L}_{\ell}$ , $\ell=1,\dots,L$ , of an FNO is of the form

\mathcal{L}_{\ell}:\mathcal{V}(\Omega;\mathbb{R}^{d_{v}})\to\mathcal{V}(\Omega;\mathbb{R}^{d_{v}}),\quad v\mapsto\sigma\left(W_{\ell}v+K_{\ell}v+b_{\ell}\right),

where $\sigma$ is a nonlinear activation function applied componentwise, $W_{\ell}\in\mathbb{R}^{d_{v}\times d_{v}}$ is a matrix, the bias $b_{\ell}\in\mathcal{V}(\Omega;\mathbb{R}^{d_{v}})$ is a function and $K_{\ell}:\mathcal{V}(\Omega;\mathbb{R}^{d_{v}})\to\mathcal{V}(\Omega;\mathbb{R}^{d_{v}})$ is a non-local operator defined in terms of Fourier multipliers,

Kv(x)=\mathcal{F}^{-1}(\widehat{T}_{k}[\mathcal{F}v]_{k})(x),

where $[\mathcal{F}v]_{k}$ denotes the $k$ -th Fourier coefficient of $v$ for $k\in\mathbb{Z}^{d}$ , $\widehat{T}_{k}\in\mathbb{C}^{d_{v}\times d_{v}}$ is a Fourier multiplier matrix indexed by $k$ , and $\mathcal{F}^{-1}$ denotes the inverse Fourier transform. In practice, a Fourier cut-off $k_{\mathrm{max}}\in\mathbb{Z}$ is introduced, and only a finite number of Fourier modes³³3Throughout this paper $\|\cdot\|_{\ell^{\infty}}$ denotes the maximum norm on finite dimensional Euclidean space. $\|k\|_{\ell^{\infty}}\leq k_{\mathrm{max}}$ is retained. In particular, the number of non-zero components of $\widehat{T}=\{\widehat{T}_{k}\}_{k\in\mathbb{Z}^{d}}$ is bounded by $\|\widehat{T}\|_{0}\leq d_{v}^{2}(2k_{\mathrm{max}}+1)^{d}$ . In the following, we will also assume that the bias functions $b_{\ell}$ are determined by their Fourier components $[\widehat{b}_{\ell}]_{k}\in\mathbb{C}^{d_{v}}$ , $\|k\|_{\ell^{\infty}}\leq k_{\mathrm{max}}$ .

Finally, the output layer

\mathcal{Q}:\mathcal{V}(\Omega;\mathbb{R}^{d_{v}})\to\mathcal{Y}(\Omega;\mathbb{R}^{p}),\quad v\mapsto q(x,v(x)),

is defined in terms of a neural network $q:\Omega\times\mathbb{R}^{d_{v}}\to\mathbb{R}^{p}$ , a joint function of the evaluation point $x\in\Omega$ and the components of the output $v$ of the previous layer evaluated at $x$ , namely $v(x)\in\mathbb{R}^{d_{v}}$ .

To define the size of an FNO, we note that its tunable parameters are given by: (i) the weights and biases of the neural network $\chi$ defining the lifting layer $\mathcal{R}$ , (ii) the components of the matrices $W_{\ell}\in\mathbb{R}^{d_{v}\times d_{v}}$ , (iii) the components of the Fourier multipliers $\widehat{T}_{k}\in\mathbb{C}^{d_{v}\times d_{v}}$ for $\|k\|_{\ell^{\infty}}\leq k_{\mathrm{max}}$ , (iv) the Fourier coefficients $[\widehat{b}_{\ell}]_{k}\in\mathbb{C}^{d_{v}}$ , for $\|k\|_{\ell^{\infty}}\leq k_{\mathrm{max}}$ , and (v) the number of weights and biases of the neural network $q$ defining the output layer $\mathcal{Q}$ . Given an FNO $\mathcal{S}:\mathcal{X}(\Omega;\mathbb{R}^{k})\to\mathcal{Y}(\Omega;\mathbb{R}^{p})$ , we define $\mathrm{size}(\mathcal{S})$ of an FNO as the total number of non-zero parameters in this construction. We also follow the convention that for a matrix (or vector) $A$ with complex entries, the number of parameters is defined as $\|A\|_{0}=\|\mathrm{Re}(A)\|_{0}+\|\mathrm{Im}(A)\|_{0}$ .

Remark 2.25 (FNO approximation of functionals).

If $\mathcal{S}^{\dagger}:\mathcal{X}(\Omega;\mathbb{R}^{k})\to\mathbb{R}$ is a (scalar-valued) functional, then we will again identify the output-space $\mathcal{Y}(\Omega;\mathbb{R}^{p})$ with a space of constant functions. In this case, it is natural to add an averaging operation after the last output layer $\mathcal{Q}:\mathcal{V}(\Omega;\mathbb{R}^{d_{v}})\to\mathcal{Y}(\Omega;\mathbb{R})$ , i.e. we replace $\mathcal{Q}$ by $\widetilde{\mathcal{Q}}:\mathcal{V}(\Omega;\mathbb{R}^{d_{v}})\to\mathbb{R}$ , given by

(2.19)

\displaystyle\widetilde{\mathcal{Q}}(v):=\frac{1}{|\Omega|}\int_{\Omega}q(x,v(x))\,dx.

This does not introduce any additional degrees of freedom, and ensures that the output is constant. We also note that (2.19) is a special case of a Fourier multiplier $K$ , involving only the $k=0$ Fourier mode.

In the following, we will restrict attention to the approximation of functionals, taking into account Remark 2.25. We first mention the following result, proved in Appendix A.6, which shows that FNOs are not of neural network-type, in general:

Lemma 2.26.

Let $\sigma$ be the ReLU activation function. Let $\mathbb{T}\simeq[0,2\pi]$ denote the $2\pi$ -periodic torus. The FNO,

\mathcal{S}:L^{2}(\mathbb{T};\mathbb{R})\to\mathbb{R},\quad\mathcal{S}(u):=\int_{\Omega}\sigma(u(x))\,dx,

is not of neural network-type.

The fact that FNO is not of neural network-type is closely related to the fact that the Fourier transforms at the core of the FNO mapping $\mathcal{S}:\mathcal{X}(\Omega;\mathbb{R}^{k})\to\mathbb{R}$ cannot be computed exactly. In practice, the FNO therefore needs to be discretized.

A simple discretization $\mathcal{S}^{N}$ of $\mathcal{S}$ is readily obtained and commonly used in applications of FNOs. To this end, fix $N\in\mathbb{N}$ , and let $x_{j_{1},\dots,j_{d}}\in\Omega$ , $j_{1},\dots,j_{d}\in\{1,\dots,N\}$ be a grid consisting of $N$ equidistant points in each coordinate direction. A numerical approximation $\mathcal{S}^{N}:\mathcal{X}(\Omega;\mathbb{R}^{k})\to\mathbb{R}$ of $\mathcal{S}$ is obtained by replacing the Fourier transform $\mathcal{F}$ and its inverse $\mathcal{F}^{-1}$ in each hidden layer by the discrete Fourier transform $\mathcal{F}_{N}$ and its inverse $\mathcal{F}_{N}^{-1}$ , computed on the equidistant grid. Similarly, the exact average (2.19) is replaced by an average over the grid values. This “discretized” FNO $\mathcal{S}^{N}$ thus defines a mapping

\mathcal{S}^{N}:\mathcal{X}(\Omega;\mathbb{R}^{k})\to\mathbb{R},\quad u\mapsto\mathcal{S}^{N}(u),

which depends only on the grid values $u(x_{j_{1},\dots,j_{d}})\in\mathbb{R}^{k}$ of the input function $u$ , for multi-indices $(j_{1},\dots,j_{d})\in\{1,\dots,N\}^{d}$ . In contrast to $\mathcal{S}$ , the discretization $\mathcal{S}^{N}$ is readily implemented in practice. We expect that $\mathcal{S}^{N}({u})\approx\mathcal{S}(u)$ for sufficiently large $N$ . Note also that $\mathrm{size}(\mathcal{S}^{N})=\mathrm{size}(\mathcal{S})$ by construction.

Given these preparatory remarks, we can now state our result on the curse of parametric complexity for FNOs, with proof in Appendix A.7.

Theorem 2.27.

Let $\Omega\subset\mathbb{R}^{d}$ be a cube. Let $K\subset\mathcal{X}$ be a compact subset of a Banach space $\mathcal{X}=\mathcal{X}(\Omega;\mathbb{R}^{k})$ . Assume that $K$ contains an infinite-dimensional hypercube $Q_{\alpha}$ for some $\alpha>1$ . Then for any $r\in\mathbb{N}$ and $\delta>0$ , there exists $\overline{\epsilon}>0$ and an $r$ -times Fréchet differentiable functional $\mathcal{S}^{\dagger}:K\subset\mathcal{X}\to\mathbb{R}$ , such that approximation to accuracy $\epsilon\leq\overline{\epsilon}$ by a discretized FNO $\mathcal{S}_{\epsilon}^{N_{\epsilon}}$ ,

\sup_{u\in K}|\mathcal{S}^{\dagger}(u)-\mathcal{S}_{\epsilon}^{N_{\epsilon}}(u)|\leq\epsilon,

requires either (i) complexity bound $\mathrm{size}(\mathcal{S}_{\epsilon}^{N_{\epsilon}})\geq\exp(c\epsilon^{-1/(\alpha+1+\delta)r})$ , or (ii) discretization parameter $N_{\epsilon}\geq\exp(c\epsilon^{-1/(\alpha+1+\delta)r})$ ; here $c$ , $\overline{\epsilon}>0$ are constants depending only on $\alpha$ , $\delta$ and $r$ .

Proof.

(Sketch) The proof of Theorem 2.27 relies on the curse of parametric complexity for operators of neural network-type in Theorem 2.11. The first step is to show that discrete FNOs are of neural network-type. As a consequence, Theorem 2.11 implies a lower bound of the form $\mathrm{cmplx}(\mathcal{S}_{\epsilon}^{N_{\epsilon}})\geq\exp(c\epsilon^{-1/(\alpha+1+\delta)r})$ . The main additional difficulty is that the discrete FNO is not a standard ReLU neural network according to our definition, since it employs a very non-standard architecture involving convolution and Fourier transforms. Hence more work is needed, in order to relate $\mathrm{cmplx}(\mathcal{S}_{\epsilon}^{N_{\epsilon}})$ to $\mathrm{size}(\mathcal{S}_{\epsilon}^{N_{\epsilon}})$ ; while the complexity is the minimal number of parameters required to represent $\mathcal{S}_{\epsilon}^{N_{\epsilon}}$ by an ordinary ReLU network architecture, we recall that the $\mathrm{size}(\mathcal{S}_{\epsilon}^{N_{\epsilon}})$ is defined as the number of parameters defining $\mathcal{S}_{\epsilon}^{N_{\epsilon}}$ via the non-standard FNO architecture. Our analysis leads to an upper bound of the form

(2.20)

\displaystyle\mathrm{cmplx}(\mathcal{S}_{\epsilon}^{N_{\epsilon}})\lesssim N^{2d}\mathrm{size}(\mathcal{S}_{\epsilon}^{N_{\epsilon}}).

As a consequence of the exponential lower bound, $\mathrm{cmplx}(\mathcal{S}_{\epsilon}^{N_{\epsilon}})\geq\exp(c\epsilon^{-1/(\alpha+1+\delta)r})$ , inequality (2.20) implies that either $N_{\epsilon}$ or $\mathrm{size}(\mathcal{S}_{\epsilon}^{N_{\epsilon}})$ have to be exponentially large in $\epsilon^{-1}$ , proving the claim.

We would like to point out that the additional factor depending on $N$ is natural in view of the fact that even a linear discretized FNO layer $v\mapsto Wv$ , with matrix $W\in\mathbb{R}^{d_{v}\times d_{v}}$ , actually corresponds to a mapping $\mathbb{R}^{N^{d}\times d_{v}}\to\mathbb{R}^{N^{d}\times d_{v}}$ , $(v(x_{j_{1},\dots,j_{d}}))\mapsto(Wv(x)_{j_{1},\dots,j_{d}})$ ; i.e. the matrix multiplication should be viewed as being carried out in parallel at all $N^{d}$ grid points. Representing this mapping by an ordinary matrix multiplication $\mathbb{R}^{N^{d}\times d_{v}}\to\mathbb{R}^{N^{d}\times d_{v}}$ requires $N^{d}\|W\|_{0}$ degrees of freedom, and thus depends on $N$ in addition to the number of FNO parameters $\|W\|_{0}$ .

For the detailed proof, we refer to Appendix A.7. ∎

Remark 2.28.

Theorem 2.27 shows that FNO suffers from a similar curse of complexity as do the variants on DeepONet and PCA-Net covered by Theorem 2.11: approximation to accuracy $\epsilon$ of general ( $C^{r}$ - or Lipschitz-) operators requires either an exponential number of non-zero degree of freedom, or requires exponential computational resources to evaluate even a single forward pass. We note the difference in how the curse is expressed in Theorem 2.27 compared to Theorem 2.11; this is due to the fact that FNO is not of neural network-type (see Lemma 2.26). Instead, as outlined in the proof sketch above, only upon discretization does the FNO define an operator/functional of neural network type. The computational complexity of this discretized FNO is determined by both the FNO coefficients and the discretization parameter $N$ .

2.5.1. Discussion

To overcome the general curse of parametric complexity implied by Theorem 2.11 (and Theorem 2.27), efficient operator learning frameworks therefore have to leverage additional structure present in the operators of interest, going beyond $C^{r}$ - or Lipschitz-regularity. Previous work on overcoming the curse of parametric complexity for operator learning has mostly focused on operator holomorphy [23, 50, 34] and the emulation of numerical methods [15, 30, 34] as two basic mechanisms for overcoming the curse of parametric complexity for specific operators of interest. An abstract characterization of the entire class of operators that allow for efficient approximation by neural operators would be very desirable. Unfortunately, this appears to be out of reach, at the current state of analysis. Indeed, as far as the authors are aware, there does not even exist such a characterization for any class of standard numerical methods, such as finite difference, finite element or spectral, viewed as operator approximators.

The second contribution of the present work is to expose additional structure, different from holomorphy and emulation, that can be leveraged by neural operators. In particular, in the remainder of this paper we identify such structure in the Hamilton-Jacobi equations, and propose a neural operator framework which can build on this structure to provably overcome the curse of parametric complexity in learning the associated solution operator.

3. The Hamilton-Jacobi Equation

In the previous section we demonstrate that, generically, a scaling-limit of the curse of dimensionality is to be expected in operator approximation, if only $C^{r}-$ regularity of the map is assumed. However we also outlined in the introduction the many cases where specific structure can be exploited to overcome this curse. In the remainder of the paper we show how the curse can be removed for Hamilton-Jacobi equations. To this end we start, in this section, by setting up the theoretical framework for operator learning in this context.

We are interested in deriving error and complexity estimates for the approximation of the solution operator $\mathcal{S}_{t}^{\dagger}:C^{r}_{\mathrm{per}}(\Omega)\to C^{r}_{\mathrm{per}}(\Omega)$ associated with the following Hamilton-Jacobi PDE:

(HJ)

\displaystyle\left\{\begin{aligned} \partial_{t}u+H(q,\nabla_{q}u)&=0,(x,t)\in\Omega\times(0,T],\\ u(t=0)&=u_{0},(x,t)\in\Omega\times\{0\}\end{aligned}\right.

Here $\Omega\subset\mathbb{R}^{d}$ is a $d$ -dimensional domain. To simplify our analysis we consider only the case of a domain $\Omega=[0,2\pi]^{d}$ with periodic boundary conditions (so that we may identify $\Omega$ with $\mathbb{T}^{d}$ .) We denote by $C^{r}_{\mathrm{per}}(\Omega)$ the space of $r$ -times continuously differentiable real-valued functions with $2\pi$ -periodic derivatives, with norm

\|u\|_{C^{r}(\Omega)}:=\sup_{|\alpha|\leq r}\sup_{x\in\Omega}|D^{\alpha}u(x)|.

By slight abuse of notation, we will similarly denote by $u(q,t)\in C^{r}_{\mathrm{per}}(\Omega\times[0,T])$ and $H(q,p)\in C^{r}_{\mathrm{per}}(\Omega\times\mathbb{R}^{d})$ functions that have $C^{r}$ regularity in all variables, and that are periodic in the first variable $q\in\Omega$ , so that in particular,

q\mapsto u(q,t),\quad\text{and}\quad q\mapsto H(q,p),

belong to $C^{r}_{\mathrm{per}}(\Omega)$ , for fixed $p\in\mathbb{R}^{d}$ or $t\in[0,T].$

In equation (HJ), $u:\Omega\times[0,T]\to\mathbb{R}$ , $(q,t)\mapsto u(q,t)$ is a function, depending on the spatial variable $q\in\Omega$ and on time $t\geq 0$ . We will restrict attention to problems for which a classical solution $u\in C^{r}_{\mathrm{per}}(\Omega\times[0,T])$ , $r\geq 2$ , exists. In this setting the initial value problem (HJ) can be solved by the method of characteristics and a unique solution may be proved to exist for some $T=T(u_{0})>0.$ We may then define solution operator $\mathcal{S}_{t}^{\dagger}$ with property $u(\cdot,t)=\mathcal{S}_{t}^{\dagger}(u_{0});$ the local time of existence in time will, in general, depend on the input $u_{0}$ to $\mathcal{S}_{t}.$ The next two subsections describe, respectively, the method of characteristics and the existence of the solution operator on a time-interval $t\in[0,T]$ , for all initial data from compact $\mathcal{F}$ in $C^{r}_{\mathrm{per}}(\Omega),$ for $r\geq 2.$ Thus we define $\{\mathcal{S}_{t}^{\dagger}:\mathcal{F}\subset C^{r}_{\mathrm{per}}(\Omega)\to C^{r}_{\mathrm{per}}(\Omega)\}$ for all $t\in[0,T]$ , $T=T(\mathcal{F})$ sufficiently small.

We recall that throughout this paper, use of superscript $\dagger$ denotes an object defined through construction of an exact solution of (HJ), or objects used in the construction of the solution; identical notation without a superscript $\dagger$ denotes an approximation of that object.

3.1. Method of Characteristics for (HJ)

We briefly summarize this methodology; for more details see [19, Section 3.3]. Consider the following Hamiltonian system for $t\mapsto(q(t),p(t))\in\Omega\times\mathbb{R}^{d}$ defined by


(3.1a)		$\displaystyle\left\{\begin{aligned} \dot{q}&=\nabla_{p}H(q,p),\quad q(0)=q_{0},\\ \dot{p}&=-\nabla_{q}H(q,p),\quad p(0)=p_{0},\end{aligned}\right.$
(3.1b)		$\displaystyle\quad p_{0}=\nabla_{q}u_{0}(q_{0}).$

If the solution $u(q,t)$ of (HJ) is twice continuously differentiable, then $u$ can be evaluated by solving the ODE (3.1a) with the specific parameterized initial conditions (3.1b). Given these trajectories and $t\geq 0$ , the values $u(q(t),t)$ can be computed in terms of the “action” along this trajectory:

(3.2)

\displaystyle u(q(t),t)=u_{0}(q_{0})+\int_{0}^{t}\mathcal{L}(q(t),p(t))\,d\tau,

where $\mathcal{L}:\Omega\times\mathbb{R}^{d}\to\mathbb{R}$ is the Lagrangian associated with $H$ , i.e.

(3.3)

\displaystyle\mathcal{L}(q,p):=p\cdot\nabla_{p}H(q,p)-H(q,p).

Equivalently, the solution $z(t)=u(q(t),t)$ can be expressed in terms of the solution of the following system of ODEs, $t\mapsto(q(t),p(t),z(t))$ :


(3.4a)		$\displaystyle\left\{\begin{aligned} \dot{q}&=\nabla_{p}H(q,p),\quad q(0)=q_{0},\\ \dot{p}&=-\nabla_{q}H(q,p),\quad p(0)=p_{0},\\ \dot{z}&=\mathcal{L}(q,p),\quad z(0)=z_{0}.\end{aligned}\right.$
(3.4b)		$\displaystyle\quad p_{0}=\nabla_{q}u_{0}(q_{0}),\quad z_{0}=u_{0}(q_{0}).$

The system of ODEs (3.4) is defined on $\Omega\times\mathbb{R}^{d}\times\mathbb{R}$ , with a $2\pi$ -periodic spatial domain $\Omega=[0,2\pi]^{d}$ . It can be shown that

(3.5)

\displaystyle p(t)\equiv\nabla_{q}u(q(t),t),\quad\text{for $t\geq 0$,}

tracks the evolution of the gradient of $u$ along this trajectory. To ensure the existence of solutions to (3.1), i.e. to avoid trajectories escaping to infinity, we make the following assumption on $H$ , in which $|{\,\cdot\,}|$ denotes the Euclidean distance on $\mathbb{R}^{d}$ :

Assumption 3.1 (Growth at Infinity).

There exists $L_{H}>0$ , such that

(3.6)

\displaystyle\sup_{q\in\Omega}\left\{-p\cdot\nabla_{q}H(q,p)\right\}\leq L_{H}(1+|p|^{2}),

for all $p\in\mathbb{R}^{d}$ .

In the following, we will denote by

(3.7)

\displaystyle\left\{\begin{aligned} \Psi_{t}^{\dagger}:\Omega\times\mathbb{R}^{d}\times\mathbb{R}&\to\Omega\times\mathbb{R}^{d}\times\mathbb{R},\\ (q_{0},p_{0},z_{0})&\mapsto(q(t),p(t),z(t)),\end{aligned}\right.

the semigroup (flow map) generated by the system of ODEs (3.4a); existence of this semigroup, and hence of the solution operator $\mathcal{S}_{t}^{\dagger}$ , is the topic of the next subsection.

3.2. Short-time Existence of $C^{r}$ -solutions

The goal of the present section is to show that for any $r\geq 2$ , and for any compact subset $\mathcal{F}\subset C^{r}_{\mathrm{per}}(\Omega)$ , there exists a maximal $T^{\ast}=T^{\ast}(\mathcal{F})>0$ , such that for any $u_{0}\in\mathcal{F}$ , there exists a solution $u:\Omega\times[0,T^{\ast})\to\mathbb{R}$ of (HJ), with property $q\mapsto u(q,t)$ belongs to $C^{r}_{\mathrm{per}}(\Omega)$ for all $t\in[0,T^{\ast})$ . Thus we may take any $T<T^{\ast}$ in (HJ), for data in $\mathcal{F}$ . Our proof of this fact relies on the Banach space version of the implicit function theorem; see Appendix B, Theorem B.1, for a precise statement.

In the following, given $t\geq 0$ and given initial values $(q_{0},p_{0})\in\Omega\times\mathbb{R}^{d}$ , we define $q_{t}(q_{0},p_{0})$ , $p_{t}(q_{0},p_{0})$ as the spatial- and momenta-components of the solution of the Hamiltonian ODE (3.1a), respectively; i.e. the solution of (3.1a) is given by $t\mapsto(q_{t}(q_{0},p_{0}),p_{t}(q_{0},p_{0}))$ for any initial data $(q_{0},p_{0})\in\Omega\times\mathbb{R}^{d}$ .

Proposition 3.2 (Short-time Existence of Classical Solutions).

Let Assumption 3.1 hold, let $r\geq 2$ and assume that the Hamiltonian $H\in C^{r+1}_{\mathrm{per}}(\Omega\times\mathbb{R}^{d})$ . Then for any compact subset $\mathcal{F}\subset C^{r}_{\mathrm{per}}(\Omega)$ , there exists $T^{\ast}=T^{\ast}(\mathcal{F})>0$ such that for any $u_{0}\in\mathcal{F}$ , there exists a classical solution $u\in C^{r}_{\mathrm{per}}(\Omega\times[0,T^{\ast}))$ of the Hamilton-Jacobi equation (HJ). Furthermore, for any $T<T^{\ast}$ , there exists a constant $C=C(T,r,\mathcal{F},H)>0$ such that $\sup_{t\in[0,T]}\|u({\,\cdot\,},t)\|_{C^{r}}\leq C$ .

The proof, which may be found in Appendix B, uses the following two lemmas, also proved in Appendix B. The first result shows that, under Assumption 3.1, the semigroup $\Psi_{t}^{\dagger}$ in (3.7) is well-defined, globally in time.

Lemma 3.3.

Let $H\in C_{\mathrm{per}}^{r}(\Omega\times\mathbb{R}^{d})$ for $r\geq 1$ . Let Assumption 3.1 hold. Then the mapping $\Psi_{t}^{\dagger}$ given by (3.7) exists for any $t\geq 0$ and $t\mapsto\Psi_{t}^{\dagger}$ defines a semigroup. In particular, for any $(q_{0},p_{0})\in\Omega\times\mathbb{R}^{d}$ , there exists a solution $t\mapsto(q(t),p(t))$ of the ODE system (3.1a) with initial data $q(0)=q_{0}$ , $p(0)=p_{0}$ , for all $t\geq 0$ .

The following result will be used to show that the method of characteristics can be used to construct solutions, at least over a sufficiently short time-interval.

Lemma 3.4.

Let Assumption 3.1 hold, let $r\geq 2$ and assume that the Hamiltonian $H\in C^{r+1}_{\mathrm{per}}(\Omega\times\mathbb{R}^{d})$ . Let $\mathcal{F}\subset C^{r}_{\mathrm{per}}(\Omega)$ be compact. Then there exists a (maximal) time $T^{\ast}=T^{\ast}(\mathcal{F})>0$ , such that for all $u_{0}\in\mathcal{F}$ , the spatial characteristic mapping

\Phi_{t}^{\dagger}({\,\cdot\,};u_{0}):\Omega\to\Omega,\quad q_{0}\mapsto q_{t}\bigl{(}q_{0},\nabla_{q}u_{0}(q_{0})\bigr{)},

defined by (3.1) on the periodic domain $\Omega=[0,2\pi]^{d}$ , is a $C^{r-1}$ -diffeomorphism for any $t\in[0,T^{\ast})$ .

Note that the map $\Phi_{t}^{\dagger}({\,\cdot\,};u_{0})$ is defined by the semigroup $\Psi_{t}^{\dagger}$ for $(q,p,z)$ ; however it only requires solution of the Hamiltonian system (3.1) for $(q,p)$ . In contrast to the flowmap $\Psi^{\dagger}_{t}$ , the spatial characteristic map $\Phi_{t}^{\dagger}({\,\cdot\,};u_{0})$ is in general only invertible over a sufficiently short time-interval, leading to a corresponding short-time existence result for the solution operator $\mathcal{S}^{\dagger}_{t}$ associated with (HJ) in Proposition 3.2.

We close this section on the short-time existence with the following remark.

Remark 3.5.

Classical solutions of the Hamilton-Jacobi equations can develop singularities in finite time due to the potential crossing of the spatial characteristics $q(t)$ emanating from different points; Indeed, if two spatial characteristics emanating from $q_{0}$ and $q_{0}^{\prime}$ cross in finite time, then (3.2) does not lead to a well-defined function $u(q,t)$ , and the method of characteristics breaks down. Thus, our existence theorem is generally restricted to a finite time interval $[0,T^{\ast}]$ . In important special cases, the crossing of characteristics is ruled out, and classical solutions exist for all time, i.e. with $T^{\ast}=\infty$ . One such example is the advection equation with velocity field $v(q)$ , and corresponding Hamiltonian $H(q,\nabla_{q}u)=v(q)\cdot\nabla_{q}u$ , for which the complexity of operator approximation is studied computationally in [13].

4. Hamilton-Jacobi Neural Operator: HJ-Net

In this section, we will describe an operator learning framework to approximate the solution operator $\mathcal{S}^{\dagger}_{t}$ defined by (HJ) for initial data $u_{0}\in C^{r}_{\mathrm{per}}(\Omega)$ with $r\geq 2$ . The main motivation for our choice of framework is the observation that the flow map $\Psi_{t}^{\dagger}$ associated with the system of ODEs (3.4a) can be computed independently of the underlying solution $u$ . Hence, an operator learning framework for (HJ) can be constructed based on a suitable approximation $\Psi_{t}\approx\Psi_{t}^{\dagger}$ , where $\Psi_{t}:\Omega\times\mathbb{R}^{d}\times\mathbb{R}\to\Omega\times\mathbb{R}^{d}\times\mathbb{R}$ is a neural network approximation of the flow $\Psi_{t}^{\dagger}$ . Given such $\Psi_{t}$ , and upon fixing evaluation points $\{q_{0}^{j}\}_{j=1}^{N}\subset\Omega$ , we propose to approximate the forward operator $\mathcal{S}_{t}^{\dagger}$ of the Hamilton-Jacobi equation (HJ) by $\mathcal{S}_{t}$ , using the following three steps:

Step a)

encode the initial data $u_{0}\in C^{r}_{\mathrm{per}}(\Omega)$ by evaluating it at the points $q_{0}^{{j}}$ :

	$\displaystyle\mathcal{E}:C^{r}_{\mathrm{per}}(\Omega)$	$\displaystyle\to[\Omega\times\mathbb{R}^{d}\times\mathbb{R}]^{N},$
	$\displaystyle u_{0}$	$\displaystyle\mapsto\{(q_{0}^{{j}},p_{0}^{{j}},z_{0}^{{j}})\}_{j=1}^{N},$

with $(q_{0}^{{j}},p_{0}^{{j}},z_{0}^{{j}}):=(q_{0}^{{j}},\nabla_{q}u_{0}(q_{0}^{{j}}),u_{0}(q_{0}^{{j}}))$ ;

Step b)

for each $j=1,\dots,N$ , apply the approximate flow $\Psi_{t}:\Omega\times\mathbb{R}^{d}\times\mathbb{R}$ to the encoded data, resulting in a map

	$\displaystyle\Psi^{N}_{t}:[\Omega\times\mathbb{R}^{d}\times\mathbb{R}]^{N}$	$\displaystyle\to[\Omega\times\mathbb{R}^{d}\times\mathbb{R}]^{N},$
	$\displaystyle\{(q_{0}^{{j}},p_{0}^{{j}},z_{0}^{{j}})\}_{j=1}^{N}$	$\displaystyle\mapsto\{(q_{t}^{{j}},p_{t}^{{j}},z_{t}^{{j}})\}_{j=1}^{N},$

where $(q_{t}^{{j}},p_{t}^{{j}},z_{t}^{{j}}):=\Psi_{t}(q_{0}^{{j}},p_{0}^{{j}},z_{0}^{{j}})$ , for $j=1,\dots,N$ ;

Step c)

approximate the underlying solution at a fixed time $t\in[0,T]$ , for $T$ sufficiently small, by interpolating the data (input/output pairs) $\{(q_{t}^{{j}},z_{t}^{{j}})\}_{j=1}^{N}$ , leading to a reconstruction map:

	$\displaystyle\mathcal{R}:[\Omega\times\mathbb{R}]^{N}$	$\displaystyle\to C^{r}(\Omega),$
	$\displaystyle\{(q_{t}^{{j}},z_{t}^{{j}})\}$	$\displaystyle\mapsto s_{z,Q}.$

If we let $\mathcal{P}$ denote the projection map which takes $[\Omega\times\mathbb{R}^{d}\times\mathbb{R}]^{N}$ to $[\Omega\times\mathbb{R}]^{N}$ then, for fixed $T$ sufficiently small, we have defined an approximation of $\mathcal{S}_{t}^{\dagger}:C^{r}_{\mathrm{per}}(\Omega)\to C^{r}_{\mathrm{per}}(\Omega)$ , denoted $\mathcal{S}_{t}:C^{r}_{\mathrm{per}}(\Omega)\to C^{r}(\Omega)$ , and defined by

(4.1)

\mathcal{S}_{t}=\mathcal{R}\circ\mathcal{P}\circ\Psi^{N}_{t}\circ\mathcal{E}.

It is a consequence of Proposition 3.2 that our approximation $\mathcal{S}_{t}$ is well-defined for all inputs $u_{0}$ from compact subset $\mathcal{F}\subset C^{r}_{\mathrm{per}}(\Omega)$ , $r\geq 2$ , in some interval $t\in[0,T]$ , for $T$ sufficiently small. However the resulting approximation does not obey the semigroup property with respect to $t$ and should be interpreted as holding for a fixed $t\in[0,T]$ , $T$ sufficiently small. Furthermore, iterating the map obtained for any such fixed $t$ is not in general possible unless $\mathcal{S}_{t}$ maps $\mathcal{F}$ into itself. The requirement that $\mathcal{S}_{t}^{\dagger}$ maps $\mathcal{F}$ into itself would also be required to prove the existence of a semigroup for (HJ); for our operator approximator $\mathcal{S}_{t}$ we would additionally need to ensure periodicity of the reconstruction step, something we do not address in this paper.

If the underlying solution $u(q,t)$ of (HJ) exists up to time $t$ and if it is $C^{2}$ , then the method of characteristics can be applied, and the above procedure would reproduce the underlying solution, in the absence of approximation errors in steps b) and c); i.e. in the absence of approximation errors of the Hamiltonian flow $\Psi_{t}\approx\Psi_{t}^{\dagger}$ , and in the absence of reconstruction errors. We will study the effect of approximating step b) by use of a ReLU neural network approximation of the flow $\Psi_{t}^{\dagger}$ and by use of a moving least squares interpolation for step c). In the following two subsections we define these two approximation steps, noting that doing so leads to a complete specification of $\mathcal{S}_{t}.$ This complete specification is summarized in the final Subsection 4.3.

4.1. Step b) ReLU Network

We seek an approximation $\Psi_{t}$ to $\Psi_{t}^{\dagger}$ in the form of a ReLU neural network (2.1), as summarized in section 2.1.1, with input and output dimensions ${D_{\mathcal{X}}}={D_{\mathcal{Y}}}=2d+1$ , and taking the concatenated input $x:=(q_{0},p_{0},z_{0})\in\Omega\times\mathbb{R}^{d}\times\mathbb{R}$ to its image in $\mathbb{R}^{d}\times\mathbb{R}^{d}\times\mathbb{R}$ . We use $2\pi$ -periodicity to identify the output in the first $d$ components (the spatial variable $q$ ) with a unique element in $\Omega=[0,2\pi]^{d}$ . With slight abuse of notation, this results in a well-defined mapping

\Psi_{t}:\Omega\times\mathbb{R}^{d}\times\mathbb{R}\to\Omega\times\mathbb{R}^{d}\times\mathbb{R},\quad(q_{0},p_{0},z_{0})\to\Psi_{t}(q_{0},p_{0},z_{0}).

Let $\mu$ denote a probability measure on $\Omega\times\mathbb{R}^{d}\times\mathbb{R}$ . We would like to choose $\theta$ to minimize the loss function

L(\theta)=\mathbb{E}^{(q,p,z)\sim\mu}\|\Psi_{t}(q,p,z;\theta)-\Psi_{t}^{\dagger}(q,p,z)\|.

In practice we achieve this by an empirical approximation of $\mu$ , based on $N$ i.i.d. samples, and numerical simulation to approximate the evaluation of $\Psi_{t}^{\dagger}({\,\cdot\,})$ on these samples. The resulting approximate empirical loss can be approximately minimized by stochastic gradient descent [20, 47], for example.

4.2. Step c) Moving Least Squares

In this section, we define the interpolation mapping $\mathcal{R}:\{q^{j}_{t},z^{j}_{t}\}_{j=1}^{N}\mapsto u({\,\cdot\,},t)$ , employing reconstruction by moving least squares [53]. In general, given a function $f^{\dagger}:\Omega\to\mathbb{R}$ , here assumed to be defined on the domain $\Omega=[0,2\pi]^{d}$ , and given a set of (scattered) data $\{\mathfrak{q}^{j},z^{j}\}_{j=1}^{N}$ where $z^{j}=f^{\dagger}(\mathfrak{q}^{j})$ for $\mathfrak{Q}=\{\mathfrak{q}^{1},\dots,\mathfrak{q}^{N}\}\subset\Omega$ , the method of moving least squares produces an approximation $f_{z,\mathfrak{Q}}\approx f^{\dagger}$ , which is given by the following minimization [53, Def. 4.1]:

(4.2)

\displaystyle f_{z,\mathfrak{Q}}(q)=\mathrm{min}{\left\{\sum_{j=1}^{N}\left[z^{j}-P(\mathfrak{q}^{j})\right]^{2}\phi_{\delta}(q-\mathfrak{q}^{j})\,\middle|\,P\in\pi_{m}\right\}}.

Here, $\pi_{m}$ denotes the space of polynomials of degree $m$ , and $\phi_{\delta}(q)=\phi(q/\delta)$ is a compactly supported, non-negative weight function. We will assume $\phi({\,\cdot\,})$ to be smooth, supported in the unit ball $B_{1}(0)$ , and positive on the ball $B_{1/2}(0)$ with radius $1/2$ .

The approximation error incurred by moving least squares can be estimated in the $L^{\infty}$ -norm, in terms of suitable measures of the “density” of the scattered data points $\mathfrak{q}^{j}$ , and the smoothness of the underlying function $f^{\dagger}$ [53]. The relevant notions are defined next, in which $|{\,\cdot\,}|$ denotes the Euclidean distance on $\Omega\subset\mathbb{R}^{d}$ .

Definition 4.1.

The fill distance of a set of points $\mathfrak{Q}=\{\mathfrak{q}^{1},\dots,\mathfrak{q}^{N}\}\subset\Omega$ for a bounded domain $\Omega\subset\mathbb{R}^{d}$ is defined to be

h_{\mathfrak{Q},\Omega}:=\sup_{q\in\Omega}\min_{j=1,\dots,N}|q-\mathfrak{q}^{j}|.

The separation distance of $\mathfrak{Q}$ is defined by

\rho_{\mathfrak{Q}}:=\frac{1}{2}\min_{k\neq j}|\mathfrak{q}^{k}-\mathfrak{q}^{j}|.

A set $\mathfrak{Q}$ of points in $\Omega$ is said to be quasi-uniform with respect to $\kappa\geq 1$ , if

\rho_{\mathfrak{Q}}\leq h_{\mathfrak{Q},\Omega}\leq\kappa\rho_{\mathfrak{Q}}.

Combining the approximation error of moving least squares with a stability result for moving least squares, with respect to the input data, leads to the error estimate that we require to estimate the error in our proposed HJ-Net. Proof of the following may be found in Appendix C. The statement involves both the input data set $Q$ and a set $Q^{\dagger}$ which $Q$ is supposed to approximate, together with their respective fill-distances and separation distances.

Proposition 4.2 (Error Stability).

Let $\Omega=[0,2\pi]^{d}\subset\mathbb{R}^{d}$ and consider function $f^{\dagger}\in C^{r}(\Omega)$ , for some fixed regularity parameter $r\geq 2$ . Assume that $Q^{\dagger}=\{q^{j,\dagger}\}_{j=1}^{N}$ is quasi-uniform with respect to $\kappa\geq 1$ . Let $\{q^{j},z^{j}\}_{j=1}^{N}$ be approximate interpolation data, and define $Q:=\{q^{j}\}_{j=1}^{N}$ where, for some $\rho\in(0,\frac{1}{2}\rho_{Q^{\dagger}})$ and $\epsilon>0$ , we have

|q^{j}-q^{j,\dagger}|<\rho,\quad|z^{j}-f(q^{j,\dagger})|<\epsilon.

Using this approximate interpolation data, let $f_{z,Q}$ be obtained by moving least squares (4.2). Then there exist constants $h_{0},\gamma,C>0$ , depending only on $d$ , $r$ and $\kappa\geq 1$ , such that, for $h_{Q,\Omega}\leq h_{0}$ and moving least squares scale parameter $\delta:=\gamma h_{Q,\Omega}$ , we have

(4.3)

\displaystyle\|f^{\dagger}-f_{z,Q}\|_{L^{\infty}}\leq C\left(\|f^{\dagger}\|_{C^{r}}h^{r}_{Q^{\dagger},\Omega}+\epsilon+\|f^{\dagger}\|_{C^{1}}\rho\right).

Remark 4.3.

The constants $C$ and $\gamma$ in the previous proposition can be computed explicitly [53, see Thm. 3.14 and Cor. 4.8].

Proposition 4.2 reveals two sources of error in the reconstruction by moving least squares: The first term on the right-hand side of (4.3) is the error due to the discretization by a finite number of evaluation points $q^{j}$ . This error persists even in a perfect data setting, i.e. when ${q}^{j}=q^{j,\dagger}$ and ${z}^{j}=f^{\dagger}(q^{j,\dagger})$ . The last two terms in (4.3) reflect the fact that in our intended application to HJ-Net, the evaluation points $q^{j}$ and the function values $z^{j}$ are only given approximately, via the approximate flow map $\Psi_{t}\approx\Psi_{t}^{\dagger}$ , introducing additional error sources.

The proof of Proposition 4.2 relies crucially on the fact that the set of evaluation points $Q=\{q^{j}\}_{j=1}^{N}$ is quasi-uniform. In our application to HJ-Net, these points are obtained as the image of $(q_{0}^{j},p_{0}^{j},z_{0}^{j}):=(q_{0}^{{j}},\nabla_{q}u_{0}(q_{0}^{{j}}),u_{0}(q_{0}^{{j}}))$ under the approximate flow $\Psi_{t}$ . In particular, they depend on $u_{0}$ and we cannot ensure any a priori control on the separation distance $\rho_{Q}$ . Our proposed reconstruction $\mathcal{R}$ therefore involves the pruning step, stated as Algorithm 1. Lemma C.3 in Appendix C shows that Algorithm 1 produces a quasi-uniform set $Q^{\prime}\subset Q$ with the desired properties asserted above.

Algorithm 1 Pruning

Interpolation points

Q=\{q^{j}\}_{j=1}^{N}

Pruned interpolation points

Q^{\prime}=\{q^{j_{k}}\}_{k=1}^{m}

with fill distance

h_{Q^{\prime},\Omega}\leq 3h_{Q,\Omega}

, and

Q^{\prime}

is quasi-uniform for

\kappa=3

, i.e.

\rho_{Q^{\prime}}\leq h_{Q^{\prime},\Omega}\leq 3\rho_{Q^{\prime}}.

procedure

Set

m\leftarrow 1

j_{1}\leftarrow 1

and

Q^{\prime}\leftarrow\{q^{1}\}

while

m<N

Given

Q^{\prime}=\{q^{j_{1}},\dots,q^{j_{m}}\}\subset Q

, does there exist

q^{k}\in Q

, such that

B_{h}(q^{k})\cap\bigcup_{\ell=1}^{m}B_{h}(q^{j_{\ell}})=\emptyset?

if Yes then

Set

j_{m+1}\leftarrow k

Set

Q^{\prime}\leftarrow Q^{\prime}\cup\{q^{k}\}

Set

m\leftarrow m+1

else

Terminate the algorithm.

end if

end while

end procedure

Given the interpolation by moving least squares and the pruning algorithm, we finally define the reconstruction map $\mathcal{R}:\{(q_{t}^{j},z_{t}^{j})\}_{j=1}^{N}\mapsto{f}_{z,Q}$ as follows:

Algorithm 2 Reconstruction

\mathcal{R}

(1)

Given approximate interpolation data $\{q^{j}_{t},z^{j}_{t}\}_{j=1}^{N}$ at data points $Q=\{q^{1}_{t},\dots,q^{N}_{t}\}$ , determine a quasi-uniform subset $Q^{\prime}=\{q_{t}^{j_{k}}\}_{k=1}^{m}\subset Q$ by Algorithm 1.
(2)

Define $f_{z,Q}$ as the moving least squares interpolant (4.2) based on the (pruned) interpolation data $\{(q^{j_{k}}_{t},z^{j_{k}}_{t})\}_{k=1}^{m}$ with $\delta=\gamma h_{Q^{\prime},\Omega}$ , and $\gamma$ the constant in Proposition 4.2.

4.3. Summary of HJ-Net

Thus in the final definition of HJ-Net given in equation (4.1) we recall that $\mathcal{E}$ denotes the encoder mapping,

u_{0}\mapsto\mathcal{E}(u_{0}):=(q^{j}_{0},\nabla u_{0}(q^{j}_{0}),u_{0}(q^{j}_{0})).

The mapping

(q^{j}_{0},p^{j}_{0},z^{j}_{0})\mapsto(q^{j}_{t},p^{j}_{t},z^{j}_{t}):={\Psi}_{t}(q^{j}_{0},p^{j}_{0},z^{j}_{0};\theta),

approximates the flow $\Psi_{t}$ for each of the triples, $(q^{j}_{0},p^{j}_{0},z^{j}_{0})$ , $j=1,\dots,N$ , and $\theta$ is trained from data to minimize an approximate empirical loss. And, finally, the reconstruction $\mathcal{R}$ is obtained by applying the pruned moving least squares Algorithm 2 to the data $\{q^{j}_{t},p^{j}_{t},z^{j}_{t}\}_{j=1}^{N}$ at scattered data points $Q_{t}=\{q^{1}_{t},\dots,q^{N}_{t}\}$ and with corresponding values $z_{t}=(z^{1}_{t},\dots,z^{N}_{t})$ , to approximate the output $u(q,t)\approx f_{z_{t},Q_{t}}(q)$ .

5. Error Estimates and Complexity

Subsection 5.1 contains statement of the main theorem concerning the computational complexity of HJ-Net, and the high level structure of the proof. The theorem demonstrates that the curse of parametric complexity is overcome in this problem in the sense that the cost to achieve a given error, as measured by the size of the parameterization, grows only algebraically with the inverse of the error. In the following subsections 5.2 and 5.3, we provide a detailed discussion of the approximation of the Hamiltonian flow and the moving last squares algorithm which, together, lead to proof of Theorem 5.1. Proofs of the results stated in those two subsections are given in an appendix.

5.1. HJ-Net Beats The Curse of Parametric Complexity

Theorem 5.1 (HJ-Net Approximation Estimate).

Consider equation (HJ) on periodic domain $\Omega=[0,2\pi]^{d}$ , with $C^{r}_{\mathrm{per}}$ initial data and Hamiltonian $H\in C^{r+1}_{\mathrm{per}}$ , where $r\geq 2$ . Assume that $H$ satisfies the no-blowup Assumption 3.1. Let $\mathcal{F}\subset C^{r}_{\mathrm{per}}$ be a compact set of initial data, and let $T<T^{*}(\mathcal{F})$ where $T^{*}(\mathcal{F})$ is given by Proposition 3.2. Then there is constant $C=C(T,d,r,H,\mathcal{F})>0$ , such that for any $\epsilon>0$ and $t\in[0,T]$ , there exists a set of points $Q_{\epsilon}=\{q^{1},\dots,q^{N}\}$ , optimal parameter $\theta_{\epsilon}$ and neural network $\Psi_{t}({\,\cdot\,})=\Psi_{t}({\,\cdot\,};\theta_{\epsilon})$ such that the corresponding HJ-Net approximation given by (4.1), with Steps b) and c) defined in Subsections 4.1 and 4.2 respectively, satisfies

\sup_{u_{0}\in\mathcal{F}}\|\mathcal{S}_{t}(u_{0})-\mathcal{S}_{t}^{\dagger}(u_{0})\|_{L^{\infty}}\leq\epsilon.

Furthermore, we have the following complexity bounds: (i) the number $N$ of encoding points $Q_{\epsilon}=\{q^{j}\}_{j=1}^{N}$ from Step a) can be bounded by

(5.1)

\displaystyle N\leq C\epsilon^{-d/r};

and (ii) the neural network $\Psi_{t}({\,\cdot\,})=\Psi_{t}({\,\cdot\,};\theta_{\epsilon})$ from Step b), Subsection 4.1, satisfies

(5.2)

\displaystyle 0pt(\Psi_{t})\leq C\log(\epsilon^{-1}),\quad\mathrm{size}(\Psi_{t})\leq C\epsilon^{-(2d+1)/r}\log(\epsilon^{-1}).

Proof.

We first note that, for any $u_{0}\in\mathcal{F}$ and $T<T^{*}(\mathcal{F})$ , Proposition 3.2 shows that the solution $u$ of (HJ) can be computed by the method of characteristics up to time $T$ . Thus $\mathcal{S}^{\dagger}_{t}$ is well-defined for any $t\in[0,T]$ . We break the proof into three steps, relying on propositions established in the following subsections, and then conclude in a final fourth step.

Step 1: (Neural Network Approximation) Let $M$ be given by

M:=1\vee\sup_{u_{0}\in\mathcal{F}}\sup_{q\in\Omega}\|\nabla u_{0}(q)\|_{\ell^{\infty}}\vee\sup_{u_{0}\in\mathcal{F}}\sup_{q\in\Omega}|u_{0}(q)|,

where $a\vee b$ denotes the maximum. By this choice of $M$ , we have $\nabla u_{0}(q)\in[-M,M]^{d}$ , $u_{0}(q)\in[-M,M]$ for all $u_{0}\in\mathcal{F}$ , $q\in\Omega$ . By Proposition 5.3, there exists a constant $\beta=\beta(d,L_{H},t)\geq 1$ , and a constant $C>0$ , depending only on $M$ , $d$ , $t$ , and the norm $\|H\|_{C^{r+1}(\mathbb{T}^{d}\times[-\beta M,\beta M]^{d})}$ , such that the Hamiltonian flow map $\Psi_{t}^{\dagger}$ (3.7) can be approximated by a neural network $\Psi_{t}$ with

\mathrm{size}(\Psi_{t})\leq C\epsilon^{-(2d+1)/r}\log(\epsilon^{-1}),\quad 0pt(\Psi_{t})\leq C\log(\epsilon^{-1}),

and

(5.3)

\displaystyle\sup_{(q_{0},p_{0},z_{0})\in\Omega\times[-M,M]^{d}\times[-M,M]}\left|\Psi_{t}(q_{0},p_{0},z_{0})-\Psi^{\dagger}_{t}(q_{0},p_{0},z_{0})\right|\leq\epsilon.

Step 2: (Choice of Encoding Points) Fix $\rho>0$ , to be determined below. Let $Q:=\rho\mathbb{Z}^{d}\cap[0,2\pi]^{d}$ denote an equidistant grid on $[0,2\pi]^{d}$ with grid spacing $\rho$ . Enumerating the elements of $Q$ , we write $Q=\{q_{0}^{1},\dots,q_{0}^{N}\}$ , where we note that there exists a constant $C_{1}=C_{1}(d)>0$ depending only on $d$ , such that $N\leq C_{1}\rho^{-d}$ ; equivalently,

(5.4)

\displaystyle\frac{\rho^{d}}{C_{1}}\leq\frac{1}{N}.

For any $u_{0}\in\mathcal{F}$ , let

(5.5)

\displaystyle Q_{u_{0}}^{\dagger}:={\left\{q_{t}^{{j},\dagger}\,\middle|\,q_{t}^{{j},\dagger}=q_{t}(q_{0}^{{j}},p_{0}^{{j}}),\,q_{0}^{{j}}\in Q_{\epsilon},\,p_{0}^{{j}}=\nabla_{q}u_{0}(q_{0}^{{j}})\right\}},

be the set of image points under the characteristic mapping defined by $u_{0}$ . Since $Q_{u_{0}}^{\dagger}=\{q^{j,\dagger}_{t}\}_{j=1}^{N}$ is a set of $N$ points, it follows from the definition of the fill distance that $N$ balls of radius $h_{Q^{\dagger}_{u_{0}},\Omega}$ cover $\Omega=[0,2\pi]^{d}$ . A simple volume counting argument then implies that there exists a constant $C_{0}=C_{0}(d)>0$ , such that $1/N\leq C_{0}\,h_{Q^{\dagger}_{u_{0}},\Omega}^{d}$ ; equivalently,

(5.6)

\displaystyle\frac{1}{C_{0}N}\leq h_{Q^{\dagger}_{u_{0}},\Omega}^{d},\quad\forall\,u_{0}\in\mathcal{F}.

Given $\epsilon$ from Step 1, we now choose $\rho:=(C_{0}C_{1})^{1/d}\epsilon^{1/r}$ , so that by (5.4) and (5.6),

\epsilon^{d/r}=\frac{1}{C_{0}}\frac{\rho^{d}}{C_{1}}\leq\frac{1}{C_{0}N}\leq h_{Q^{\dagger}_{u_{0}},\Omega}^{d},\quad\forall\,u_{0}\in\mathcal{F}.

We emphasize that $C_{0},C_{1}$ depend only on $d$ , and are independent of $u_{0}\in\mathcal{F}$ and $\epsilon$ . We have thus shown that if $Q_{\epsilon}:=\{q^{1},\dots,q^{N}\}$ is an enumeration of $\rho\mathbb{Z}^{d}\cap[0,2\pi]^{d}$ with $\rho:=(C_{0}C_{1})^{1/d}\epsilon^{1/r}$ , then the fill distance of the image points $Q^{\dagger}_{u_{0}}$ under the characteristic mapping satisfies

(5.7)

\displaystyle\epsilon\leq h_{Q^{\dagger}_{u_{0}},\Omega}^{r},\quad\forall\,u_{0}\in\mathcal{F}.

In particular, this step defines our encoding points $Q_{\epsilon}$ .

Step 3: (Interpolation Error Estimate) Let $Q_{\epsilon}$ be the set of evaluation points determined in Step 2. Since $Q_{\epsilon}$ is an equidistant grid with grid spacing proportional to $\epsilon^{1/r}$ , the fill distance of $Q_{\epsilon}$ is bounded by $h_{Q_{\epsilon},\Omega}\leq C\epsilon^{1/r}$ , where the constant $C=C(d)\geq 1$ depends only on $d$ . By Proposition 5.6, there exists a (possibly larger) constant $C=C(d,t,H,\mathcal{F})\geq 1$ , such that for all $u_{0}\in\mathcal{F}$ , the set of image points $Q_{u_{0}}^{\dagger}$ under the characteristic mapping (5.5), has uniformly bounded fill distance

(5.8)

\displaystyle h_{Q^{\dagger}_{u_{0}},\Omega}\leq C\epsilon^{1/r},\quad\forall\,u_{0}\in\mathcal{F}.

Furthermore, taking into account (5.7), the upper bound (5.3) implies that

\sup_{(q_{0},p_{0},z_{0})\in\mathbb{T}^{d}\times[-M,M]^{d}\times[-M,M]}\left|\Psi_{t}(q_{0},p_{0},z_{0})-\Psi^{\dagger}_{t}(q_{0},p_{0},z_{0})\right|\leq h_{Q^{\dagger}_{u_{0}},\Omega}^{r},

for any $u_{0}\in\mathcal{F}$ . In turn, this shows that the approximate interpolation data $({q}_{t}^{{j}},{z}_{t}^{{j}})=\mathcal{P}\circ\Psi_{t}(q_{0}^{{j}},p_{0}^{{j}},z_{0}^{{j}})$ , $j=1,\dots,N$ , obtained from the neural network approximation $\Psi_{t}\approx\Psi_{t}^{\dagger}$ by (orthogonal) projection $\mathcal{P}$ onto the first and last components, satisfies

|{q}_{t}^{{j}}-q_{t}^{j,\dagger}|\leq h_{Q^{\dagger}_{u_{0}},\Omega}^{r},\quad|{z}_{t}^{{j}}-u(q_{t}^{j,\dagger},t)|\leq h_{Q^{\dagger}_{u_{0}},\Omega}^{r},

where $u(q,t)$ denotes the exact solution of the Hamilton-Jacobi PDE (HJ) with initial data $u_{0}\in\mathcal{F}$ . By Proposition 5.5, there exists a constant $C>0$ , depending only on $d$ and $r$ , such that the pruned moving least squares approximant $f_{{z},Q_{u_{0}}}$ obtained by Algorithm 2 with (approximate) data points $Q_{u_{0}}=\{q^{1}_{t},\dots,q^{N}_{t}\}$ and corresponding data values $z=\{z^{1}_{t},\dots,z^{N}_{t}\}$ , satisfies

(5.9)

\displaystyle\|u({\,\cdot\,},t)-f_{{z},Q_{u_{0}}}\|_{L^{\infty}(\Omega)}\leq C\left(1+\|u({\,\cdot\,},t)\|_{C^{r}(\Omega)}\right)h_{Q^{\dagger}_{u_{0}},\Omega}^{r}.

Step 4: (Conclusion) By the short-time existence result of Proposition 3.2, there exists $C=C(H,\mathcal{F},t)>0$ , such that $\|u({\,\cdot\,},t)\|_{C^{r}(\Omega)}\leq C$ for any solution $u$ of the Hamilton-Jacobi equation (HJ) with initial data $u({\,\cdot\,},t=0)=u_{0}\in\mathcal{F}$ . By definition of the HJ-Net approximation, we have $\mathcal{S}_{t}(u_{0})\equiv f_{{z},Q_{u_{0}}}$ and by definition of the solution operator, $\mathcal{S}_{t}^{\dagger}(u_{0})\equiv u({\,\cdot\,},t)$ . We thus conclude that

\|\mathcal{S}_{t}(u_{0})-\mathcal{S}^{\dagger}_{t}(u_{0})\|_{L^{\infty}}\overset{\mathclap{\underset{\downarrow}{\eqref{eq:1uh}}}}{\leq}Ch_{Q_{u_{0}}^{\dagger},\Omega}^{r}\overset{\mathclap{\underset{\downarrow}{\eqref{eq:Qdag}}}}{\leq}C\epsilon,

for a constant $C=C(T,d,r,H,\mathcal{F})>0$ , independent of $\epsilon$ . Since $\epsilon$ is arbitrary, replacing $\epsilon$ by $\epsilon/C$ throughout the above argument implies the claimed error and complexity estimate of Theorem 5.1. ∎

Remark 5.2 (Overall Computational Complexity of HJ-Net).

The error and complexity estimate of Theorem 5.1 implies that for moderate dimensions $d$ , and for sufficiently smooth input functions $u_{0}\in\mathcal{F}\subset C^{r}_{\mathrm{per}}$ , with $r>3d+1$ , the overall complexity of this approach scales at most linearly in $\epsilon^{-1}$ : Indeed, the mapping of the data points $(q^{j}_{0},p^{j}_{0},z^{j}_{0})\mapsto(q^{j}_{t},p^{j}_{t},z^{j}_{t})$ for $j=1,\dots,N$ requires $N$ forward-passes of the neural network $\Psi_{t}$ , which has $O(\epsilon^{-(2d+1)/r}\log(\epsilon^{-1}))$ internal degrees of freedom. Since $N=O(\epsilon^{-d/r})$ the computational complexity of this mapping is thus bounded by $O(\epsilon^{-(3d+1)/r}\log(\epsilon^{-1}))=O(\epsilon^{-1})$ . A naive implementation of the pruning algorithm requires at most $O(N^{2})=O(\epsilon^{-2d/r})=O(\epsilon^{-1})$ comparisons. The computational complexity of the reconstruction by the moving least squares method with $m\leq N$ (pruned) interpolation points and with $M\sim\epsilon^{-d/r}$ evaluation points can be bounded by $O(m+M)=O(\epsilon^{-d/r}+M)=O(\epsilon^{-1})$ [53, last paragraph of Chapter 4.2]. In particular, the overall complexity to obtain an $\epsilon$ -approximation of the output function $\mathcal{S}_{t}(u_{0})\approx\mathcal{S}_{t}^{\dagger}(u_{0})$ at e.g. the encoding points $Q_{\epsilon}$ (with $M=|Q_{\epsilon}|\sim\epsilon^{-d/r}$ ) is at most linear in $\epsilon^{-1}$ .

Theorem 5.1 shows that for fixed $d$ and $r$ , the proposed HJ-Net architecture can overcome the general curse of parametric complexity in the operator approximation $\mathcal{S}\approx\mathcal{S}^{\dagger}$ implied by Theorem 2.11 even though the underlying operator does not have higher than $C^{r}$ -regularity. This is possible because HJ-Net leverages additional structure inherent to the Hamilton-Jacobi PDE HJ (reflected in the method of characteristics), and therefore does not rely solely on the $C^{r}$ -smoothness of the underlying operator $\mathcal{S}^{\dagger}$ .

5.2. Approximation of Hamiltonian Flow and Quadrature Map

In this subsection we quantify the complexity of $\epsilon$ -approximation by a ReLU network as defined in Subsection 4.1.

Recall that $\Psi_{t}^{\dagger}:\Omega\times\mathbb{R}^{d}\times\mathbb{R}\to\Omega\times\mathbb{R}^{d}\times\mathbb{R}$ comprises solution of the Hamiltonian equations (3.1) and quadrature (3.2), leading to (3.4). An approximation $\Psi_{t}:\Omega\times\mathbb{R}^{d}\times\mathbb{R}\to\Omega\times\mathbb{R}^{d}\times\mathbb{R}$ of this Hamiltonian flow map is provided the by the following proposition, proved in Appendix D.

Proposition 5.3.

Let $\Omega=[0,2\pi]^{d}$ . Let $r\geq 2$ , and $t>0$ be given, and assume that $H\in C^{r+1}_{\mathrm{per}}(\Omega\times\mathbb{R}^{d})$ satisfies the no-blowup Assumption 3.1 with constant $L_{H}>0.$ Then, for any $M\geq 1$ , there exist constants $\beta:=(1+\sqrt{d})\exp(L_{H}t)$ and $C=C\left(\|H\|_{C^{r+1}(\Omega\times[-\beta M,\beta M]^{d})},M,r,d,t\right)>0$ , such that for all $\epsilon\in(0,\frac{1}{2}]$ , there is a ReLU neural network $\Psi_{t}({\,\cdot\,})=\Psi_{t}({\,\cdot\,};\theta_{\epsilon})$ satisfying

(5.10)

\displaystyle\sup_{(q_{0},p_{0},z_{0})\in\Omega\times[-M,M]^{d+1}}\left|\Psi_{t}(q_{0},p_{0},z_{0})-\Psi_{t}^{\dagger}(q_{0},p_{0},z_{0})\right|\leq\epsilon,

satisfying

\displaystyle\mathrm{size}(\Psi_{t})\leq C\epsilon^{-(2d+1)/r}\log\left(\epsilon^{-1}\right),\quad 0pt(\Psi_{t})\leq C\log\left(\epsilon^{-1}\right).

Remark 5.4.

Using the results of [38, 29, 55], one could in fact improve the size bound of Proposition 5.3 somewhat: neglecting logarithmic terms in $\epsilon^{-1}$ , it can be shown that a neural network with $\mathrm{size}(\Psi_{t})\lesssim\epsilon^{-(d+1/2)/r}$ suffices. However, this comes at the expense of substantially increasing the depth from a logarithmic scaling $0pt(\Psi_{t})\lesssim\log(\epsilon^{-1})$ , to an algebraic scaling $0pt(\Psi_{t})\lesssim\epsilon^{-(d+1/2)/r}$ .

5.3. Moving Least Squares Reconstruction Error

In this subsection we discuss error estimates for the reconstruction by moving least squares, based on imperfect input-output pairs, as defined in Subsection 4.2.

We recall that the reconstruction $\mathcal{R}$ in the HJ-Net approximation is obtained by applying the pruned moving least squares Algorithm 2 to the data $\{q^{j}_{t},z^{j}_{t}\}_{j=1}^{N}$ , where $(q^{j}_{t},p^{j}_{t},z^{j}_{t})$ are obtained as $(q^{j}_{t},p^{j}_{t},z^{j}_{t})=\Psi_{t}(q^{j}_{0},p^{j}_{0},z^{j}_{0})$ with fixed evaluation points $\{q^{j}_{0}\}_{j=1}^{N}\subset\Omega$ , and where $p^{j}_{0}:=\nabla_{q}u_{0}(q^{j}_{0})$ , $z^{j}_{0}:=u_{0}(q^{j}_{0})$ are determined in terms of a given input function $u_{0}$ , so that $z^{j}_{t}\approx u(q^{j}_{t},t)$ is an approximation of the corresponding solution $u({\,\cdot\,},t)$ at time $t$ .

In the following, we first derive an error estimate in terms of the fill distance of $Q_{t}=\{q^{j}_{t}\}_{j=1}^{N}$ , in Proposition 5.5. Subsequently, in Proposition 5.6, we provide a bound on the fill distance $h_{Q_{t},\Omega}$ at time $t$ in terms of the fill distance $h_{Q,\Omega}$ at time $0$ . Proof of both propositions can be found in Appendix D.

Proposition 5.5.

Let $\Omega=[0,2\pi]^{d}\subset\mathbb{R}^{d}$ and fix a regularity parameter $r\geq 2$ . There exist constants $h_{0},C>0$ such that the following holds: Assume that $Q^{\dagger}=\{q^{1,\dagger},\dots,q^{N,\dagger}\}\subset\Omega$ is a set of $N$ evaluation points with fill distance $h_{Q^{\dagger},\Omega}\leq h_{0}$ . Then for any $2\pi$ -periodic function $f^{\dagger}\in C^{r}_{\mathrm{per}}(\Omega)$ , and approximate input-output data $\{({q}^{j},{z}^{j})\}_{j=1}^{N}$ , such that

|{q}^{j}-q^{j,\dagger}|\leq h_{Q^{\dagger},\Omega}^{r},\quad|{z}^{j}-f^{\dagger}(q^{j,\dagger})|\leq h_{Q^{\dagger},\Omega}^{r},

the pruned moving least squares approximant $f_{{z},{Q}}$ of Algorithm 2 with interpolation data points $Q=\{q^{1},\dots,q^{N}\}$ and data values $z=\{z^{1},\dots,z^{N}\}$ , satisfies

\|f^{\dagger}-f_{{z},{Q}}\|_{L^{\infty}(\Omega)}\leq C\left(1+\|f^{\dagger}\|_{C^{r}(\Omega)}\right)h_{Q^{\dagger},\Omega}^{r}.

In contrast to Proposition 4.2, Proposition 5.5 includes the pruning step in the reconstruction, and does not assume quasi-uniformity of either the underlying exact point distribution $Q^{\dagger}$ , nor the approximate point distribution $Q$ . To obtain a bound on the reconstruction error, we can combine the preservation of $C^{r}$ -regularity implied by the short-time existence Proposition 3.2, with the following:

Proposition 5.6.

Let $\Omega=[0,2\pi]^{d}$ , let $r\geq 2$ , and assume that the Hamiltonian $H\in C^{r+1}_{\mathrm{per}}(\Omega\times\mathbb{R}^{d})$ is periodic in $q$ and satisfies Assumption 3.1 with constant $L_{H}>0$ . Let $\mathcal{F}\subset C^{r}_{\mathrm{per}}(\Omega)$ be a compact subset and fix $t<T^{\ast}$ . Then there exists a constant $C=C(H,\mathcal{F},L_{H},t)>0$ , such that for any set of evaluation points $Q=\{q^{j}_{0}\}_{j=1}^{N}\subset\Omega$ , and for any $u_{0}\in\mathcal{F}$ , the image points

Q_{u_{0}}^{\dagger}:={\left\{q^{j}_{t}\,\middle|\,q^{j}_{t}=\Phi^{\dagger}_{t,u_{0}}(q^{j}_{0}),\;j=1,\dots,N\right\}}\subset\Omega,

under the spatial characteristic mapping $\Phi^{\dagger}_{t,u_{0}}:q_{0}\mapsto q_{t}(q_{0},\nabla_{q}u_{0}(q_{0}))$ satisfy the following uniform bound on the fill distance:

h_{Q^{\dagger}_{u_{0}},\Omega}\leq Ch_{Q,\Omega}.

6. Conclusions

The first contribution of this work is to study the curse of dimensionality in the context of operator learning, here interpreted in terms of the infinite dimensional nature of the input space. We state a theorem which, for the first time, establishes a general form of the curse of parametric complexity, a natural scaling limit of the curse of dimensionality in high-dimensional approximation, characterized by lower bounds which are exponential in the desired error. The theorem demonstrates that in general it is not possible to obtain complexity estimates, for the size of the approximating neural operator, that grow algebraically with inverse error unless specific structure in the underlying operator is leveraged; in particular, we prove that this additional structure has to go beyond $C^{r}$ - or Lipschitz-regularity. This considerably generalizes and strengthens earlier work on the curse of parametric complexity in [32], where a mild form of this curse had been identified for PCA-Net. As shown in the present work, our result applies to many proposed operator learning architectures including PCA-Net, the FNO, and DeepONet and recent nonlinear extensions thereof. The lower complexity bound in this work is obtained for neural operator architectures based on standard feedforward ReLU neural networks, and could likely be extended to feedforward architectures employing piecewise polynomial activations. It is of note that algebraic complexity bounds, which seemingly overcome the curse of parametric complexity of the present work, have recently been derived for the approximation of Lipschitz operators [49]; these results build on non-standard neural network architectures with either superexpressive activation functions, or non-standard connectivity, and therefore do not contradict our results.

The second contribution of this paper is to present an operator learning framework for Hamilton-Jacobi equations, and to provide a complexity analysis demonstrating that the methodology is able to tame the curse of parametric complexity for these PDE. We present the ideas in a simple setting, and there are numerous avenues for future investigation. For example, as pointed out in Remark 3.5, one main limitation of the proposed approach based on characteristics is that we can only consider finite time intervals on which classical solutions of the HJ equations exist. It would be of interest to extend the methodology to weak solutions, after the potential formation of singularities. It would also be of interest to combine our work with the work on curse of dimensionality with respect to dimension of Euclidean space, cited in Section 1. Furthermore, in practice we recommend learning the Hamiltonian flow, which underlies the method of characteristics for the HJ equation, using symplectic neural networks [27]. However, the analysis of these neural networks is not yet developed to the extent needed for our complexity analysis in this paper, which builds on the work in [54]. Extending the analysis of symplectic neural networks, and using this extension to analyze generalizations of HJ-Net as defined here, are interesting directions for future study. Finally, we note that it is of interest to iterate the learned operator. In order to do this, we would need to generalize the error estimates to the $C^{1}$ topology. This could be achieved either by interpolation between higher-order $C^{s}$ bounds of the proposed methodology for $s>1$ combined with the existing error analysis, or by using the gradient variable $p$ in the interpolation.

Acknowledgments The work of SL is supported by Postdoc.Mobility grant P500PT-206737 from the Swiss National Science Foundation. The work of AMS is supported by a Department of Defense Vannevar Bush Faculty Fellowship.

References

[1] M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999.
[2] K. Bhattacharya, B. Hosseini, N. B. Kovachki, and A. M. Stuart. Model reduction and neural networks for parametric PDEs. The SMAI Journal of Computational Mathematics, 7:121–157, 2021.
[3] N. Boullé, C. J. Earls, and A. Townsend. Data-driven discovery of green’s functions with human-understandable deep learning. Scientific Reports, 12(1):1–9, 2022.
[4] N. Boullé and A. Townsend. Learning elliptic partial differential equations with randomized linear algebra. Foundations of Computational Mathematics, pages 1–31, 2022.
[5] T. Chen and H. Chen. Universal approximation to nonlinear operators by neural networks with arbitrary activation functions and its application to dynamical systems. IEEE Transactions on Neural Networks, 6(4):911–917, 1995.
[6] S.-N. Chow and J. K. Hale. Methods of Bifurcation Theory, volume 251. Springer Science & Business Media, 2012.
[7] Y. T. Chow, J. Darbon, S. Osher, and W. Yin. Algorithm for overcoming the curse of dimensionality for time-dependent non-convex Hamilton–Jacobi equations arising from optimal control and differential games problems. Journal of Scientific Computing, 73(2):617–643, 2017.
[8] Y. T. Chow, J. Darbon, S. Osher, and W. Yin. Algorithm for overcoming the curse of dimensionality for state-dependent Hamilton-Jacobi equations. Journal of Computational Physics, 387:376–409, 2019.
[9] O. Christensen et al. An introduction to frames and Riesz bases, volume 7. Springer, 2003.
[10] S. Dahlke, F. De Mari, P. Grohs, and D. Labate. Harmonic and applied analysis. Appl. Numer. Harmon. Anal, 2015.
[11] J. Darbon, G. P. Langlois, and T. Meng. Overcoming the curse of dimensionality for some Hamilton–Jacobi partial differential equations via neural network architectures. Research in the Mathematical Sciences, 7(3):1–50, 2020.
[12] J. Darbon and S. Osher. Algorithms for overcoming the curse of dimensionality for certain Hamilton–Jacobi equations arising in control theory and elsewhere. Research in the Mathematical Sciences, 3(1):1–26, 2016.
[13] M. De Hoop, D. Z. Huang, E. Qian, and A. M. Stuart. The cost-accuracy trade-off in operator learning with neural networks. Journal of Machine Learning, 1:3:299–341, 2022.
[14] M. V. de Hoop, N. B. Kovachki, N. H. Nelsen, and A. M. Stuart. Convergence rates for learning linear operators from noisy data. SIAM/ASA Journal on Uncertainty Quantification, 11(2):480–513, 2023.
[15] B. Deng, Y. Shin, L. Lu, Z. Zhang, and G. E. Karniadakis. Approximation rates of deeponets for learning operators arising from advection–diffusion equations. Neural Networks, 153:411–426, 2022.
[16] R. DeVore, B. Hanin, and G. Petrova. Neural network approximation. Acta Numerica, 30:327–444, 2021.
[17] R. A. DeVore. Nonlinear approximation. Acta numerica, 7:51–150, 1998.
[18] D. L. Donoho. Sparse components of images and optimal atomic decompositions. Constructive Approximation, 17:353–382, 2001.
[19] L. C. Evans. Partial Differential Equations. American Mathematical Society, Providence, R.I., 2010.
[20] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT press, 2016.
[21] R. Gribonval, G. Kutyniok, M. Nielsen, and F. Voigtlaender. Approximation spaces of deep neural networks. Constructive approximation, 55(1):259–367, 2022.
[22] C. Heil. A basis theory primer: expanded edition. Springer Science & Business Media, 2010.
[23] L. Herrmann, C. Schwab, and J. Zech. Deep ReLU neural network expression rates for data-to-qoi maps in Bayesian PDE inversion. SAM Research Report, 2020, 2020.
[24] L. Herrmann, C. Schwab, and J. Zech. Neural and GPC operator surrogates: Construction and expression rate bounds. arXiv preprint arXiv:2207.04950, 2022.
[25] J. S. Hesthaven and S. Ubbiali. Non-intrusive reduced order modeling of nonlinear problems using neural networks. Journal of Computational Physics, 363:55–78, 2018.
[26] N. Hua and W. Lu. Basis operator network: A neural network-based model for learning nonlinear operators via neural basis. Neural Networks, 164:21–37, 2023.
[27] P. Jin, Z. Zhang, A. Zhu, Y. Tang, and G. E. Karniadakis. Sympnets: Intrinsic structure-preserving symplectic networks for identifying Hamiltonian systems. Neural Networks, 132:166–179, 2020.
[28] Y. Khoo, J. Lu, and L. Ying. Solving parametric PDE problems with artificial neural networks. European Journal of Applied Mathematics, 32(3):421–435, 2021.
[29] M. Kohler and S. Langer. On the rate of convergence of fully connected deep neural network regression estimates. The Annals of Statistics, 49(4):2231–2249, 2021.
[30] N. B. Kovachki, S. Lanthaler, and S. Mishra. On universal approximation and error bounds for Fourier neural operators. Journal of Machine Learning Research, 22(1), 2021.
[31] N. B. Kovachki, Z. Li, B. Liu, K. Azizzadenesheli, K. Bhattacharya, A. Stuart, and A. Anandkumar. Neural operator: Learning maps between function spaces with applications to PDEs. Journal of Machine Learning Research, 24(89), 2023.
[32] S. Lanthaler. Operator learning with PCA-Net: Upper and lower complexity bounds. Journal of Machine Learning Research, 24(318), 2023.
[33] S. Lanthaler, Z. Li, and A. M. Stuart. The nonlocal neural operator: universal approximation. arXiv preprint arXiv:2304.13221, 2023.
[34] S. Lanthaler, S. Mishra, and G. E. Karniadakis. Error estimates for DeepONets: A deep learning framework in infinite dimensions. Transactions of Mathematics and Its Applications, 6(1), 2022.
[35] S. Lanthaler, R. Molinaro, P. Hadorn, and S. Mishra. Nonlinear reconstruction for operator learning of PDEs with discontinuities. In Eleventh International Conference on Learning Representations, 2023.
[36] Z. Li, N. B. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. M. Stuart, and A. Anandkumar. Fourier neural operator for parametric partial differential equations. In Ninth International Conference on Learning Representations, 2021.
[37] H. Liu, H. Yang, M. Chen, T. Zhao, and W. Liao. Deep nonparametric estimation of operators between infinite dimensional spaces. Journal of Machine Learning Research, 25(24):1–67, 2024.
[38] J. Lu, Z. Shen, H. Yang, and S. Zhang. Deep network approximation for smooth functions. SIAM Journal on Mathematical Analysis, 53(5):5465–5506, 2021.
[39] L. Lu, P. Jin, G. Pang, Z. Zhang, and G. E. Karniadakis. Learning nonlinear operators via deeponet based on the universal approximation theorem of operators. Nature Machine Intelligence, 3(3):218–229, 2021.
[40] C. Marcati and C. Schwab. Exponential convergence of deep operator networks for elliptic partial differential equations. SIAM Journal on Numerical Analysis, 61(3):1513–1545, 2023.
[41] H. Mhaskar, Q. Liao, and T. Poggio. Learning real and boolean functions: When is deep better than shallow. Technical report, Center for Brains, Minds and Machines (CBMM), arXiv, 2016.
[42] H. N. Mhaskar and N. Hahm. Neural networks for functional approximation and system identification. Neural Computation, 9(1):143–159, 1997.
[43] N. H. Nelsen and A. M. Stuart. The random feature model for input-output maps between banach spaces. SIAM Journal on Scientific Computing, 43(5):A3212–A3243, 2021.
[44] J. A. Opschoor, C. Schwab, and J. Zech. Exponential relu dnn expression of holomorphic maps in high dimension. Constructive Approximation, 55(1):537–582, 2022.
[45] D. Patel, D. Ray, M. R. Abdelmalik, T. J. Hughes, and A. A. Oberai. Variationally mimetic operator networks. Computer Methods in Applied Mechanics and Engineering, 419:116536, 2024.
[46] P. Petersen and F. Voigtlaender. Optimal approximation of piecewise smooth functions using deep relu neural networks. Neural Networks, 108:296–330, 2018.
[47] H. Robbins and S. Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
[48] T. D. Ryck and S. Mishra. Generic bounds on the approximation error for physics-informed (and) operator learning. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, Advances in Neural Information Processing Systems, 2022.
[49] C. Schwab, A. Stein, and J. Zech. Deep operator network approximation rates for lipschitz operators. arXiv preprint arXiv:2307.09835, 2023.
[50] C. Schwab and J. Zech. Deep learning in high dimension: Neural network expression rates for generalized polynomial chaos expansions in UQ. Analysis and Applications, 17(01):19–55, 2019.
[51] J. Seidman, G. Kissas, P. Perdikaris, and G. J. Pappas. NOMAD: Nonlinear manifold decoders for operator learning. Advances in Neural Information Processing Systems, 35:5601–5613, 2022.
[52] E. M. Stein and R. Shakarchi. Real Analysis: Measure Theory, Integration, and Hilbert Spaces. Princeton University Press, 2009.
[53] H. Wendland. Scattered Data Approximation, volume 17. Cambridge university press, 2004.
[54] D. Yarotsky. Error bounds for approximations with deep ReLU networks. Neural Networks, 94:103–114, 2017. Publisher: Elsevier.
[55] D. Yarotsky and A. Zhevnerchuk. The phase diagram of approximation rates for deep neural networks. Advances in neural information processing systems, 33:13005–13015, 2020.
[56] H. You, Q. Zhang, C. J. Ross, C.-H. Lee, and Y. Yu. Learning deep implicit fourier neural operators (ifnos) with applications to heterogeneous material modeling. Computer Methods in Applied Mechanics and Engineering, 398:115296, 2022.
[57] Z. Zhang, L. Tat, and H. Schaeffer. BelNet: Basis enhanced learning, a mesh-free neural operator. Proceedings of the Royal Society A, 479, 2023.
[58] Y. Zhu and N. Zabaras. Bayesian deep convolutional encoder–decoder networks for surrogate modeling and uncertainty quantification. Journal of Computational Physics, 366:415–447, 2018.

Appendix A Proof of Curse of Parametric Complexity

A.1. Preliminaries in Finite Dimensions

Our goal in this section is to prove Proposition 2.1, which we repeat here:

See 2.1

To this end, we recall and extend several results from [54] on the ReLU neural network approximation of functions $f:\mathbb{R}^{D}\to\mathbb{R}$ . Subsequently, these results will be used as building blocks to construct functionals in the infinite-dimensional context, leading to a curse of parametric complexity in that setting, made precise in Theorem 2.11. Here, we consider the space $C^{r}(\mathbb{R}^{D})$ consisting of $r$ -times continuously differentiable functions $f:\mathbb{R}^{D}\to\mathbb{R}$ . We denote by

F_{D,r}:={\left\{f\in C^{r}(\mathbb{R}^{D})\,\middle|\,\sup_{|\alpha|\leq r}\sup_{x\in\mathbb{R}^{D}}\left|D^{\alpha}f(x)\right|\leq 1\right\}},

the unit ball in $C^{r}(\mathbb{R}^{D})$ . For technical reasons, it will often be more convenient to consider the subset $\mathring{F}_{D,r}\subset F_{D,r}$ consisting of functions $f\in F_{D,r}$ vanishing at the origin, $f(0)=0$ .

Remark A.1.

We note that previous work [54] considers the Sobolev space $W^{r,\infty}([0,1]^{D})$ and the unit ball in $W^{r,\infty}([0,1]^{D})$ , rather than $C^{r}(\mathbb{R}^{D})$ and our definition of $F_{D,r}$ . It turns out that the complexity bounds of [54] remain true also in our setting (with essentially identical proofs). We include the necessary arguments below.

Let $f:\mathbb{R}^{D}\to\mathbb{R}$ be a function. Following [54, Sect. 4.3], we will denote by $\mathcal{N}(f,\epsilon)$ the minimal number of hidden computation units that is required to approximate $f$ to accuracy $\epsilon$ by an arbitrary ReLU feedforward network $\Psi$ , uniformly over the unit cube $[0,1]^{D}$ ; i.e. $\mathcal{N}(f,\epsilon)$ is the minimal integer $M$ such that there exists a ReLU neural network $\Psi$ with at most $M$ computation units⁴⁴4The number of computation units equals the number of scalar evaluations of the activation $\sigma$ in a forward-pass, cf. [1]. and such that

\sup_{x\in[0,1]^{D}}|f(x)-\Psi(x)|\leq\epsilon.

Remark A.2.

We emphasize that, even though the domain of a function $f\in F_{D,r}$ is by definition the whole space $\mathbb{R}^{D}$ , the above quantity $\mathcal{N}(f,\epsilon)$ only relates to the approximation of $f$ over the unit cube $[0,1]^{D}$ . The reason for this seemingly awkward choice is that it will greatly simplify our arguments later on, when we construct functionals on infinite-dimensional Banach spaces with a view towards proving an infinite-dimensional analogue of the curse of dimensionality.

Note that the number of (non-trivial) hidden computation units $M$ of a neural network $\Psi:\mathbb{R}^{D}\to\mathbb{R}$ obeys the bounds $M\leq\mathrm{size}(\Psi)\leq 5DM^{4}$ : The lower bound follows from the fact that each (non-trivial) computation unit has at least one non-zero connection or a bias feeding into it. To derive the upper bound, we note that any network with at most $M$ computation units can be embedded in a fully connected enveloping network (allowing skip-connections) [54, cf. Fig. 6(a)] with depth $M$ , width $M$ , where each hidden node is connected to all other $M^{2}-1$ hidden nodes plus the output, and where each node in the input layer is connected to all $M^{2}$ hidden units plus the output. In addition, we take into account that each computation unit and the output layer have an additional bias. The existence of this enveloping network thus implies the size bound

	$\displaystyle\mathrm{size}(\Psi)$	$\displaystyle\leq\underbrace{M^{2}(M^{2}-1)+M^{2}}_{\text{ hidden connections}}+\underbrace{D(M^{2}+1)}_{\text{input conn.}}+\underbrace{M+1}_{\text{biases}}$
		$\displaystyle\leq M^{4}+2DM^{4}+M+1\leq 5DM^{4},$

valid for any neural network $\Psi:\mathbb{R}^{D}\to\mathbb{R}$ with at most $M$ computation units.

In view of the lower size bound, $\mathrm{size}(\Psi)\geq M$ , Proposition 2.1 above is now clearly implied by the following:

Proposition A.3.

Fix $r\in\mathbb{N}$ . There is a universal constant $\gamma>0$ and a constant $\overline{\epsilon}=\overline{\epsilon}(r)>0$ , depending only on $r$ , such that for any $D\in\mathbb{N}$ , there exists a function $f_{D}\in\mathring{F}_{D,r}$ for which

\mathcal{N}(f_{D},\epsilon)\geq\epsilon^{-\gamma D/r},\quad\forall\,\epsilon\leq\overline{\epsilon}.

As an immediate corollary of Proposition A.3, we conclude that $f_{D}$ cannot be approximated to accuracy $\epsilon$ by a family of ReLU neural networks $\Psi_{\epsilon}$ with

\mathrm{size}(\Psi_{\epsilon})=o(\epsilon^{-\gamma D/r}).

The latter conclusion was established by Yarotsky [54, Thm. 5] (with $\gamma=1/9$ ). Proposition 2.1 improves on Yarotsky’s result in two important ways: firstly, it implies that the lower bound $\mathrm{size}(\Psi_{\epsilon})\geq\epsilon^{-\gamma D/r}$ holds for all $\epsilon\leq\overline{\epsilon}$ , and not only along a (unspecified) sequence $\epsilon_{k}\to 0$ ; secondly, it shows that the constant $\overline{\epsilon}$ can be chosen independently of $D$ . This generalization of Yarotsky’s result will be crucial for the extension to the infinite-dimensional case, which is the goal of this work.

To prove Proposition A.3, we will need two intermediate results, which build on [54]. We start with the following lemma providing a lower bound on the required size of a fixed neural network architecture, which is able to approximate arbitrary $f\in\mathring{F}_{D,r}$ to accuracy $\epsilon\geq 0$ . This result is based on [54, Thm. 4(a)], but explicitly quantifies the dependence on the dimension $D$ . This dependence was left unspecified in earlier work.

Lemma A.4.

Fix $r\in\mathbb{N}$ . Let $\Psi=\Psi({\,\cdot\,};\theta)$ be a ReLU neural network architecture depending on parameters $\theta\in\mathbb{R}^{W}$ . There exists a constant $c_{0}=c_{0}(r)>0$ , such that if

\sup_{f\in\mathring{F}_{D,r}}\inf_{\theta\in\mathbb{R}^{W}}\|f-\Psi({\,\cdot\,};\theta)\|_{L^{\infty}([0,1]^{D})}\leq\epsilon,

for some $\epsilon>0$ , then $W\geq(c_{0}\epsilon)^{-D/2r}$ .

Proof.

The proof in [54] is based on the Vapnik-Chervonenkis (VC) dimension. We will not repeat the entire argument here, but instead discuss only the required changes in the proof of Yarotsky, which are needed to prove our extension of his result. In fact, there are only two differences between the proof our Lemma A.4, and [54, Thm. 4(a)], which we now point out: The first difference is that we consider $\mathring{F}_{D,r}$ , consisting of $C^{r}(\mathbb{R}^{D})$ functions vanishing at the origin, whereas [54] considers the unit ball in $W^{r,\infty}([0,1]^{D})$ . Nevertheless the proof of [54] applies to our setting in a straightforward way. To see this, we recall that the construction in [54, Proof of Thm. 4, eq. (35)], considers $f\in F_{D,r}$ of the form

(A.1)

\displaystyle f(x)=\sum_{m=1}^{N^{D}}y_{m}\phi(N(x-x_{m})),

with coefficients $y_{m}$ to be determined later, and where $\phi:\mathbb{R}^{D}\to\mathbb{R}$ is a $C^{\infty}$ bump function, which is required to satisfy⁵⁵5We note that choosing the $\ell^{\infty}$ to define the set where $\phi(x)=0$ , rather than the $\ell^{2}$ Euclidean distance as in [54], is immaterial. The only requirement is that the support of the functions on the right-hand side in the definition (A.1) do not overlap.

(A.2)

\displaystyle\phi(0)=1,\quad\text{and}\quad\phi(x)=0,\text{ if }\|x\|_{\ell^{\infty}}>1/2.

In (A.1), the points $x_{1},\dots,x_{N^{D}}\in[0,1]^{D}$ are chosen such that the $\ell^{\infty}$ -distance between any two of them is at least $1/N$ . We can easily ensure that $f(0)=0$ , by choosing the points $x_{m}$ to be of the form $(j_{1},\dots,j_{D})/N$ , where $j_{1},\dots,j_{D}\in\{1,\dots,N\}$ . Then, since for any multi-index $\alpha$ of order $|\alpha|\leq r$ , the mixed derivative

\max_{x}|D^{\alpha}f(x)|\leq N^{r}\max_{m}|y_{m}|\max_{x}|D^{\alpha}\phi(x)|,

any function $f$ of the form (A.1) belongs to $\mathring{F}_{D,r}$ , provided that

\max_{m}|y_{m}|\leq\widetilde{c}_{1}N^{-r},

where

(A.3)

\displaystyle\widetilde{c}_{1}=\left(\sup_{|\alpha|\leq r}\sup_{x\in[0,1]^{D}}|D^{\alpha}\phi(x)|\right)^{-1}.

As shown in [54], the above construction allows one to encode arbitrary Boolean values $z_{1},\dots,z_{N^{D}}\in\{0,1\}$ by construction suitable $f\in\mathring{F}_{D,r}$ ; this in turn can be used to estimate a VC-dimension related to $\Psi({\,\cdot\,};\theta)$ from below, following verbatim the arguments in [54]. As no changes are required in this part of the proof, we will not repeat the details here; Instead, following the argument leading up to [54, eq. (38)], and under the assumptions of Lemma A.4, Yarotsky’s argument immediately implies the following lower bound

W\geq\widetilde{c}_{0}\left(\frac{3\epsilon}{\widetilde{c}_{1}}\right)^{-D/2r},

where $\widetilde{c}_{0}$ is an absolute constant.

To finish our proof of Lemma A.4, it remains to determine the dependence of the constant $\widetilde{c}_{1}$ in (A.3), on the parameters $D$ and $r$ . To this end, we construct a specific bump function $\phi:\mathbb{R}^{D}\to\mathbb{R}$ . Recall that our only requirement on $\phi$ in the above argument is that (A.2) must hold. To construct suitable $\phi$ , let $\phi_{0}:\mathbb{R}\to\mathbb{R}$ , $\xi\mapsto\phi_{0}(\xi)$ be a smooth bump function such that $\phi_{0}(0)=1$ , $\|\phi_{0}\|_{L^{\infty}}\leq 1$ and $\phi_{0}(\xi)=0$ for $|\xi|>1/2$ . We now make the particular choice

\phi(x_{1},\dots,x_{D}):=\prod_{j=1}^{D}\phi_{0}(x_{j}).

Let $\alpha=(\alpha_{1},\dots,\alpha_{D})$ be a multi-index with $|\alpha|=\sum_{j=1}^{D}\alpha_{j}\leq r$ . Let $\kappa:=|\{\alpha_{j}\neq 0\}$ denote the number of non-zero components of $\alpha$ . We note that $\kappa\leq r$ . We thus have

|D^{\alpha}\phi(x)|=\prod_{j=1}^{D}|D^{\alpha_{j}}\phi_{0}(x_{j})|\leq\prod_{\alpha_{j}\neq 0}|D^{\alpha_{j}}\phi_{0}(x_{j})|\leq\|\phi_{0}\|_{C^{r}(\mathbb{R})}^{\kappa}\leq\|\phi_{0}\|_{C^{r}(\mathbb{R})}^{r}.

In particular, we conclude that $\widetilde{c}_{1}=[\sup_{|\alpha|\leq r}\sup_{x\in[0,1]^{D}}|D^{\alpha}\phi(x)|]^{-1}$ can be bounded from below by $\widetilde{c}_{1}\geq\widetilde{c}_{2}:=\|\phi_{0}\|_{C^{r}(\mathbb{R})}^{-r}$ , where $\widetilde{c}_{2}=\widetilde{c}_{2}(r)$ clearly only depends on $r$ , and is independent of the ambient dimension $D$ . The claimed inequality of Lemma A.4 now follows from

W\geq\widetilde{c}_{0}\left(\frac{3\epsilon}{\widetilde{c}_{1}}\right)^{-D/2r}\geq\left(\frac{3\epsilon}{(\widetilde{c}_{0}\wedge 1)^{2r/D}\widetilde{c}_{2}}\right)^{-D/2r}\geq\left(\frac{3\epsilon}{(\widetilde{c}_{0}\wedge 1)^{2r}\widetilde{c}_{2}}\right)^{-D/2r},

i.e. we have $W\geq(c_{0}\epsilon)^{-D/2r}$ with constant

c_{0}=\frac{3}{(\widetilde{c}_{0}\wedge 1)^{2r}\widetilde{c}_{2}(r)},

depending only on $r$ . ∎

While Lemma A.4 applies to a fixed architecture capable of approximating all $f\in\mathring{F}_{D,r}$ , our goal is to instead construct a single $f\in\mathring{F}_{D,r}$ for which a similar lower complexity bound holds for arbitrary architectures. The construction of such $f$ will rely on the following lemma, based on [54, Lem. 3]:

Lemma A.5.

Fix $r\in\mathbb{N}$ . For any (fixed) $\epsilon>0$ , there exists $f_{\epsilon}\in\mathring{F}_{D,r}$ , such that

\mathcal{N}(f_{\epsilon},\epsilon)\geq D^{-1/4}(c_{1}\epsilon)^{-D/8r},

where $c_{1}=c_{1}(r)>0$ depends only on $r$ .

Proof.

As explained above, any neural network (with potentially sparse architecture), $\Psi:\mathbb{R}^{D}\to\mathbb{R}$ , with $M$ computation units can be embedded in the fully connected architecture $\widetilde{\Psi}({\,\cdot\,};\theta)$ , $\theta\in\mathbb{R}^{W}$ , with size bound $W\leq 5DM^{4}$ . By Lemma A.4, it follows that if $W\leq 5DM^{4}<(c_{0}\epsilon)^{-D/2r}$ , then there exists $f_{\epsilon}\in\mathring{F}_{D,r}$ , such that

\inf_{\theta\in\mathbb{R}^{W}}\|f_{\epsilon}-\widetilde{\Psi}({\,\cdot\,};\theta)\|_{L^{\infty}}>\epsilon.

Expressing the above size bound in terms of $M$ , it follows that for any network $\Psi$ with $M<(5D)^{-1/4}(c_{0}\epsilon)^{-D/8r}$ computation units, we must have $\|f_{\epsilon}-\Psi\|_{L^{\infty}}>\epsilon$ . Thus, approximation of $f_{\epsilon}$ to within accuracy $\epsilon$ requires at least $M\geq(5D)^{-1/4}(c_{0}\epsilon)^{-D/8r}$ computation units. From the definition of $\mathcal{N}(f_{\epsilon},\epsilon)$ , we conclude that

\mathcal{N}(f_{\epsilon},\epsilon)\geq D^{-1/4}(c_{1}\epsilon)^{-D/8r},

for this choice of $f_{\epsilon}\in\mathring{F}_{D,r}$ , with constant $c_{1}=5^{2r}c_{0}(r)$ depending only on $r$ . ∎

Lemma A.5 will be our main technical tool for the proof of Proposition A.3. We also recall that $\mathcal{N}(f,\epsilon)$ satisfies the following properties, [54, eq. (42)–(44)]:


(A.4a)		$\displaystyle\mathcal{N}(af,\|a\|\epsilon)=\mathcal{N}(f,\epsilon),$
(A.4b)		$\displaystyle\mathcal{N}(f\pm g,\epsilon+\\|g\\|_{L^{\infty}})\leq\mathcal{N}(f,\epsilon),$
(A.4c)		$\displaystyle\mathcal{N}(f_{1}\pm f_{2},\epsilon_{1}+\epsilon_{2})\leq\mathcal{N}(f_{1},\epsilon_{1})+\mathcal{N}(f_{2},\epsilon_{2}).$

Proof.

(Proposition A.3) We define a rapidly decaying sequence $\epsilon_{k}\to 0$ , by $\epsilon_{1}=1/4$ and $\epsilon_{k+1}=\epsilon^{2}_{k}$ , so that by recursion $\epsilon_{k}=2^{-2^{k}}$ . We also define $a_{k}:=\frac{1}{2}\epsilon_{k}^{1/2}=\frac{1}{2}\epsilon_{k-1}$ . For later reference, we note that since $\epsilon_{k}\leq 1/2$ for all $k$ , we have

(A.5)

\displaystyle\sum_{s=k+1}^{\infty}a_{s}=\frac{1}{2}\left(\epsilon_{k}+\epsilon_{k}^{2}+\epsilon_{k}^{2^{2}}+\dots\right)\leq\frac{1}{2}\epsilon_{k}\sum_{s=0}^{\infty}2^{-s}=\epsilon_{k}.

Our goal is to construct $f\in\mathring{F}_{D,r}$ of the form

f=\sum_{k=1}^{\infty}a_{k}f_{k},\quad\text{with }f_{k}\in\mathring{F}_{D,r}\;\forall\,k\in\mathbb{N}.

We note that $a_{k}\leq 2^{-k}$ , hence if $f$ is of the above form then,

\displaystyle\|f\|_{C^{r}}

\displaystyle\leq\sum_{s=1}^{\infty}a_{s}\|f_{s}\|_{C^{r}}\leq 1,\quad f(0)=\sum_{s=1}^{\infty}a_{s}\underbrace{f_{s}(0)}_{=0}=0,

implies that $f\in\mathring{F}_{D,r}$ irrespective of the specific choice of $f_{k}\in\mathring{F}_{D,r}$ . In the following construction, we choose $\gamma:=1/32$ throughout. We determine $f_{k}$ recursively, formally starting from the empty sum, i.e. $f=0$ . In the recursive step, given $f_{1},\dots,f_{k-1}$ , we distinguish two cases:

Case 1: Assume that

\mathcal{N}\left(\sum_{s=1}^{k-1}a_{s}f_{s},2\epsilon_{k}\right)\geq\epsilon_{k}^{-\gamma D/r}.

In this case, we set $f_{k}:=0$ , and trivially obtain

(A.6)

\displaystyle\mathcal{N}\left(\sum_{s=1}^{k}a_{s}f_{s},2\epsilon_{k}\right)\geq\epsilon_{k}^{-\gamma D/r}.

Case 2: In the other case, we have

\mathcal{N}\left(\sum_{s=1}^{k-1}a_{s}f_{s},2\epsilon_{k}\right)<\epsilon_{k}^{-\gamma D/r}.

Our first goal is to again choose $f_{k}$ such that (A.6) holds, at least for sufficiently large $k$ . We note that, by the general properties of $\mathcal{N}$ , (A.4c), and the assumption of Case 2:

	$\displaystyle\mathcal{N}\left(\sum_{s=1}^{k}a_{s}f_{s},2\epsilon_{k}\right)$	$\displaystyle\geq\mathcal{N}(a_{k}f_{k},4\epsilon_{k})-\mathcal{N}\left(\sum_{s=1}^{k-1}a_{s}f_{s},2\epsilon_{k}\right)$
(A.7)			$\displaystyle\geq\mathcal{N}(a_{k}f_{k},4\epsilon_{k})-\epsilon_{k}^{-\gamma D/r}.$

By Lemma A.5, we can find $f_{k}\in\mathring{F}_{D,r}$ , such that

(A.8)

\displaystyle\mathcal{N}(a_{k}f_{k},4\epsilon_{k})\overset{\mathclap{\underset{\downarrow}{(\ref{eq:N}a)}}}{=}\mathcal{N}(f_{k},4\epsilon_{k}/a_{k})=\mathcal{N}(f_{k},4\epsilon_{k}^{1/2})\geq D^{-1/4}(8c_{1}\epsilon_{k}^{1/2})^{-D/8r}.

This defines our recursive choice of $f_{k}$ in Case 2. By (A.7) and (A.8), to obtain (A.6) it now suffices to show that

D^{-1/4}(8c_{1}\epsilon_{k}^{1/2})^{-D/8r}\geq 2\epsilon_{k}^{-\gamma D/r}.

It turns out that there exists $\overline{\epsilon}>0$ depending only on $r$ , such that

(8^{2}c_{1}^{2}\epsilon_{k})^{-D/16r}\geq 2D^{1/4}\epsilon_{k}^{-\gamma D/r}=(2^{-r/\gamma D}D^{-r/(4\gamma D)}\epsilon_{k})^{-\gamma D/r},

for all $\epsilon_{k}\leq\overline{\epsilon}$ , and where $\gamma=1/32$ . In fact, upon raising both sides to the exponent $-32r/D$ , it is straightforward to see that this inequality holds independently of $D\in\mathbb{N}$ , provided that

\epsilon_{k}\leq\frac{\inf_{D}D^{-r/(4\gamma D)}}{[2^{8r}8c_{1}(r)]^{4}},

where we note that $\inf_{D}D^{-r/(4\gamma D)}>0$ on account of the fact that $\lim_{D\to\infty}D^{-1/D}=1$ . Define $\overline{\epsilon}$ by

\overline{\epsilon}(r):=\min{\left\{k\in\mathbb{N}\,\middle|\,\epsilon_{k}\leq\frac{\inf_{D}D^{-r/(4\gamma D)}}{[2^{8r}8c_{1}(r)]^{4}}\right\}}.

With this choice of $\overline{\epsilon},\gamma>0$ , and by construction of $f_{k}$ , we then have

(A.9)

\displaystyle\mathcal{N}\left(\sum_{s=1}^{k}a_{s}f_{s},2\epsilon_{k}\right)\geq\epsilon_{k}^{-\gamma D/r},

for all $\epsilon_{k}\leq\overline{\epsilon}$ . This is inequality (A.6), and concludes our discussion of Case 2.

Continuing the above construction by recursion, we obtain a sequence $f_{1},f_{2},\dots\in\mathring{F}_{D,r}$ , such that (A.9) holds for any $k\in\mathbb{N}$ such that $\epsilon_{k}\leq\overline{\epsilon}$ . Define $f=\sum_{k=1}^{\infty}a_{k}f_{k}$ . We claim that for any $\epsilon\leq\overline{\epsilon}$ we have

\mathcal{N}(f,\epsilon)\geq\epsilon^{-\gamma D/2r}.

To see this, we fix $\epsilon\leq\overline{\epsilon}$ . Choose $k\in\mathbb{N}$ , such that $\epsilon_{k}\leq\overline{\epsilon}$ and $\epsilon_{k}^{2}\leq\epsilon\leq\epsilon_{k}$ . Then,

	$\displaystyle\mathcal{N}(f,\epsilon)$	$\displaystyle\geq\mathcal{N}(f,\epsilon_{k})=\mathcal{N}\left(\sum_{s=1}^{\infty}a_{s}f_{s},\epsilon_{k}\right)$
		$\displaystyle\overset{\mathclap{\underset{\downarrow}{(\ref{eq:N}b)}}}{\geq}\mathcal{N}\left(\sum_{s=1}^{k}a_{s}f_{s},\epsilon_{k}+\left\\|\sum_{s=k+1}^{\infty}a_{s}f_{s}\right\\|_{L^{\infty}}\right)$
		$\displaystyle\geq\mathcal{N}\left(\sum_{s=1}^{k}a_{s}f_{s},\epsilon_{k}+\sum_{s=k+1}^{\infty}a_{s}\right)$
		$\displaystyle\overset{\mathclap{\underset{\downarrow}{(\ref{eq:atail})}}}{\geq}\mathcal{N}\left(\sum_{s=1}^{k}a_{s}f_{s},2\epsilon_{k}\right)$
		$\displaystyle\overset{\mathclap{\underset{\downarrow}{\eqref{eq:fkc1}}}}{\geq}\epsilon_{k}^{-\gamma D/r}\geq\epsilon^{-\gamma D/2r},$

where the last inequality follows from $\epsilon_{k}\leq\epsilon^{1/2}$ . The claim of Proposition A.3 thus follows for all $\epsilon\leq\overline{\epsilon}$ , upon redefining the universal constant as $\gamma=(1/32)/2=1/64$ . ∎

A.2. Proof of Lemma 2.7

Proof.

(Lemma 2.7) Since the interior of $\Omega$ is non-empty, then upon a rescaling and shift of the domain, we may wlog assume that $[0,2\pi]^{d}\subset\Omega$ . Let us define $e_{\kappa}\propto\sin(\kappa\cdot x)$ as a suitable re-normalization of the standard Fourier sine-basis, indexed by $\kappa\in\mathbb{N}^{d}$ , and normalized such that $\|e_{\kappa}\|_{C^{s}}=1$ . We note that for each $e_{\kappa}$ we can define a bi-orthogonal functional $e^{\ast}_{\kappa}$ by

e^{\ast}_{\kappa}(u):=\frac{2}{(2\pi)^{d}}\int_{[0,2\pi]^{d}}u(x)\sin(\kappa\cdot x)\,dx.

The norm of $e^{\ast}_{\kappa}$ is easily seen to be bounded by $2$ . Hence, for any enumeration $j\mapsto\kappa(j)$ , the sequence $e_{\kappa(j)}$ satisfies the assumptions in the definition of a infinite-dimensional hypercube.

Furthermore, if $j\mapsto\kappa(j)$ is an enumeration of $\kappa\in\mathbb{N}^{d}$ , such that $j\mapsto\|\kappa(j)\|_{\ell^{\infty}}$ is monotonically increasing (non-decreasing), we note that any series of the form

u=A\sum_{j=1}^{\infty}j^{-\alpha}y_{j}e_{\kappa(j)},\quad y_{j}\in[0,1],

is absolutely convergent in $C^{\rho}(\Omega)$ , provided that

\sum_{j=1}^{\infty}j^{-\alpha}\|e_{\kappa(j)}\|_{C^{\rho}(\Omega)}\sim\sum_{j=1}^{\infty}j^{-\alpha}\|{\kappa(j)}\|_{\ell^{\infty}}^{\rho-s}<\infty.

Inverting the enumeration $j=j(\kappa)$ for $\kappa\in\mathbb{N}^{d}$ , and setting $K:=\|\kappa\|_{\ell^{\infty}}$ , we find that

\displaystyle\#{\left\{\kappa^{\prime}\in\mathbb{N}^{d}\,\middle|\,\|\kappa^{\prime}\|_{\ell^{\infty}}<K\right\}}\leq j(\kappa)\leq\#{\left\{\kappa^{\prime}\in\mathbb{N}^{d}\,\middle|\,\|\kappa^{\prime}\|_{\ell^{\infty}}\leq K\right\}},

where the number of elements in the lower and upper bounding sets are $\sim K^{d}$ . We thus conclude that $j(\kappa)\sim K^{d}$ . We also note that each shell ${\left\{\kappa\in\mathbb{N}^{d}\,\middle|\,\|\kappa\|_{\ell^{\infty}}=K\right\}}$ , with $K\in\mathbb{N}$ , contains $\sim K^{d-1}$ elements. Thus, we have

	$\displaystyle\sum_{j=1}^{\infty}j^{-\alpha}\\|{\kappa(j)}\\|_{\ell^{\infty}}^{\rho-s}$	$\displaystyle\sim\sum_{K=1}^{\infty}\sum_{\\|\kappa\\|_{\ell^{\infty}}=K}j(\kappa)^{-\alpha}\\|{\kappa}\\|_{\ell^{\infty}}^{\rho-s}$
		$\displaystyle\sim\sum_{K=1}^{\infty}K^{-\alpha d}\sum_{\\|\kappa\\|_{\ell^{\infty}}=K}\\|{\kappa}\\|_{\ell^{\infty}}^{\rho-s}$
		$\displaystyle\sim\sum_{K=1}^{\infty}K^{-\alpha d}K^{d-1}K^{\rho-s}$
		$\displaystyle=\sum_{K=1}^{\infty}K^{-1-d\left(\alpha-1-\frac{\rho-s}{d}\right)}.$

The last sum converges if $\alpha>1-\frac{\rho-s}{d}$ . Thus, choosing $A>0$ sufficiently small then ensures that $Q_{\alpha}=Q_{\alpha}(A;e_{1},e_{2},\dots)$ is a subset of $K={\left\{u\in C^{\rho}(\Omega)\,\middle|\,\|u\|_{C^{\rho}}\leq M\right\}}$ . ∎

A.3. Proof of Theorem 2.11

We now provide the detailed proof of Theorem 2.11, which builds on Proposition A.3 above.

Proof.

Fix $\delta>0$ . By assumption on $K\subset\mathcal{X}$ , there exists $A>0$ and a linearly independent sequence $e_{1},e_{2},\dots\in\mathcal{X}$ of normed elements, such that the set $Q_{\alpha}$ consisting of all $u\in\mathcal{X}$ of the form

u=A\sum_{j=1}^{\infty}j^{-\alpha}y_{j}e_{j},\quad y_{j}\in[0,1],

defines a subset $Q_{\alpha}\subset K$ .

Step 1: (Finding embedded cubes $\simeq[0,1]^{D}$ ) We note that for any $k\in\mathbb{N}$ and for any choice of $y_{j}\in[0,1]$ , with indices $j=2^{k},\dots,2^{k+1}-1$ , we have

(A.10)

\displaystyle u=A2^{-(k+1)\alpha}\sum_{j=2^{k}}^{2^{k+1}-1}y_{j}e_{j}\in Q_{\alpha}.

Set $D=2^{k}$ , and let us denote the set of all such $u$ by $\overline{Q}_{D}$ , in the following. Note that up to the constant rescaling by $A2^{-(k+1)\alpha}$ , $\overline{Q}_{D}$ can be identified with the $D$ -dimensional unit cube $[0,1]^{D}$ . In particular, since $\overline{Q}_{D}\subset Q_{\alpha}\subset K$ , it follows that $K$ contains a rescaled copy of $[0,1]^{D}$ for any such $D$ . Furthermore, it will be important in our construction that all of the embedded cubes, defined by (A.10) for $k\in\mathbb{N}$ , only intersect at the origin.

By Proposition A.3, there exist constants $\gamma,\overline{\epsilon}>0$ , independent of $D$ , such that given any $D=2^{k}$ , we can find $f_{D}:\mathbb{R}^{D}\to\mathbb{R}$ , $f_{D}\in\mathring{F}_{D,r}$ , for which the following lower complexity bound holds:

(A.11)

\displaystyle\mathcal{N}(f_{D},\epsilon)\geq\epsilon^{-\gamma D/r},\quad\forall\epsilon\leq\overline{\epsilon}.

Our aim is to construct $\mathcal{S}^{\dagger}:\mathcal{X}\to\mathbb{R}$ , such that the restriction $\mathcal{S}^{\dagger}|_{\overline{Q}_{D}}$ to the cube $\overline{Q}_{D}\simeq[0,1]^{D}$ “reproduces” this $f_{D}$ . If this can be achieved, then our intuition is that $\mathcal{S}^{\dagger}$ embeds all $f_{D}$ for $D=2^{k}$ , $k\in\mathbb{N}$ , at once, and hence a rescaled version of the lower complexity bound $\gtrsim_{D}\epsilon^{-\gamma D/r}$ should hold for any $D$ . Our next aim is to make this precise, and determine the implicit constant that arises due to the fact that $\overline{Q}_{D}$ is only a rescaled version of $[0,1]^{D}$ .

Step 2: (Construction of $\mathcal{S}^{\dagger}$ ) To construct suitable $\mathcal{S}^{\dagger}$ , we first recall that we assume the existence of “bi-orthogonal” elements $e^{\ast}_{1},e^{\ast}_{2},\dots$ in the continuous dual space $\mathcal{X}^{\ast}$ , such that

e^{\ast}_{i}(e_{j})=\delta_{ij},\;\forall\,i,j\in\mathbb{N},

and furthermore, there exists a constant $M>0$ , such that $\|e^{\ast}_{j}\|_{\mathcal{X}^{\ast}}\leq M$ for all $j\in\mathbb{N}$ . Given the functions $f_{D}=f_{2^{k}}$ from Step 1, we now make the following ansatz for $\mathcal{S}^{\dagger}$ :

(A.12)

\displaystyle\mathcal{S}^{\dagger}(u)=\sum_{k=1}^{\infty}2^{-\alpha^{\ast}rk}f_{2^{k}}\left(A^{-1}2^{(k+1)\alpha}\left[e^{\ast}_{2^{k}}(u),\dots,e^{\ast}_{2^{k+1}-1}(u)\right]\right),

Here $f_{2^{k}}=f_{D}$ (for $D=2^{k}$ ) satisfies (A.11) and $\alpha^{\ast}=1+\alpha+\delta/2$ . We note in passing that $\mathcal{S}^{\dagger}$ defines a $r$ -times Fréchet differentiable functional. This will be rigorously shown in Lemma A.7 below (with $c=A^{-1}2^{\alpha}$ ).

Step 3: (Relating $\mathcal{S}^{\dagger}$ with $f_{D}$ ) We next show in which way “the restriction $\mathcal{S}^{\dagger}|_{\overline{Q}_{D}}$ to the cube $\overline{Q}_{D}\simeq[0,1]^{D}$ reproduces $f_{D}$ ”. Note that if $u\in\overline{Q}_{D}$ is of the form (A.10) with $D=2^{k}$ and with coefficients

y:=[y_{2^{k}},\dots,y_{2^{k+1}-1}]\in[0,1]^{2^{k}}=[0,1]^{D},

then, if $k^{\prime}\neq k$ ,

[e^{\ast}_{2^{k^{\prime}}}(u),\dots,e^{\ast}_{2^{k^{\prime}+1}-1}(u)]=[0,\dots,0],

while for $k^{\prime}=k$ we find

[e^{\ast}_{2^{k}}(u),\dots,e^{\ast}_{2^{k+1}-1}(u)]=[y_{2^{k}},\dots,y_{2^{k+1}-1}]=y.

From the fact that $f_{2^{k^{\prime}}}(0)=0$ for all $k^{\prime}$ by construction (recall that $f_{2^{k^{\prime}}}\in\mathring{F}_{2^{k^{\prime}},r}$ ), we conclude that

(A.13)

\displaystyle\mathcal{S}^{\dagger}(u)=2^{-\alpha^{\ast}rk}f_{2^{k}}(y)=D^{-\alpha^{\ast}r}f_{D}(y),

for any $u\in\overline{Q}_{D}$ . In this sense, “ $\mathcal{S}^{\dagger}|_{\overline{Q}_{D}}$ reproduces $f_{D}$ ”.

Step 4: (Lower complexity bound, uniform in $D$ ) Let $\mathcal{S}_{\epsilon}:\mathcal{X}\to\mathbb{R}$ be a family of operators of neural network-type, such that

\sup_{u\in K}|\mathcal{S}^{\dagger}(u)-\mathcal{S}_{\epsilon}(u)|\leq\epsilon,\quad\forall\,\epsilon>0.

By assumption on $\mathcal{S}_{\epsilon}$ being of neural network-type, there exists $\ell=\ell_{\epsilon}\in\mathbb{N}$ , a linear mapping $\mathcal{L}_{\epsilon}:\mathcal{X}\to\mathbb{R}^{\ell}$ , and $\Phi_{\epsilon}:\mathbb{R}^{\ell}\to\mathbb{R}$ a neural network representing $\mathcal{S}_{\epsilon}$ :

\mathcal{S}_{\epsilon}(u)=\Phi_{\epsilon}(\mathcal{L}_{\epsilon}u),\quad\forall\,u\in\mathcal{X}.

For $D=2^{k}$ , $k\in\mathbb{N}$ , define $\iota_{D}:\mathbb{R}^{D}\to\mathcal{X}$ by

\iota_{D}(y)=A2^{-\alpha k}\sum_{j=1}^{2^{k}}y_{j}e_{2^{k}+j-1}.

Then, since $\mathcal{L}_{\epsilon}\circ\iota_{D}:\mathbb{R}^{D}\to\mathbb{R}^{\ell}$ is a linear mapping, there exists a matrix $W_{D}\in\mathbb{R}^{\ell\times D}$ , such that $\mathcal{L}_{\epsilon}\circ\iota_{D}(y)=W_{D}y$ for all $y\in\mathbb{R}^{D}$ . In particular, it follows that

\mathcal{S}_{\epsilon}(\iota_{D}(y))=\Phi_{\epsilon}(W_{d}y).

By (A.10), we clearly have $\iota_{D}([0,1]^{D})=\overline{Q}_{D}$ . Let $\widehat{\Phi}_{\epsilon}(y):=D^{\alpha^{\ast}r}\Phi_{\epsilon}(W_{D}y)$ . It now follows that

	$\displaystyle\sup_{y\in[0,1]^{D}}\left\|f_{D}(y)-\widehat{\Phi}_{\epsilon}(y)\right\|$	$\displaystyle=D^{\alpha^{\ast}r}\sup_{y\in[0,1]^{D}}\left\|D^{-\alpha^{\ast}r}f_{D}(y)-\Phi_{\epsilon}(W_{D}y)\right\|$
		$\displaystyle=D^{\alpha^{\ast}r}\sup_{u\in\overline{Q}_{D}}\left\|\mathcal{S}^{\dagger}(u)-\mathcal{S}_{\epsilon}(u)\right\|$
		$\displaystyle\leq D^{\alpha^{\ast}r}\sup_{u\in K}\left\|\mathcal{S}^{\dagger}(u)-\mathcal{S}_{\epsilon}(u)\right\|$
		$\displaystyle\leq D^{\alpha^{\ast}r}\epsilon.$

Let $M$ denote the number of hidden computation units of $\widehat{\Phi}_{\epsilon}$ . By construction of $f_{D}$ (cp. Proposition A.3), we have

\mathrm{size}(\widehat{\Phi}_{\epsilon})\geq M\geq\mathcal{N}(f_{D},D^{\alpha^{\ast}r}\epsilon)\geq(D^{\alpha^{\ast}r}\epsilon)^{-\gamma D/r},

whenever $D^{\alpha^{\ast}r}\epsilon\leq\overline{\epsilon}$ . On the other hand, we can also estimate (cp. equation (2.2) in Section 2.1.2),

\mathrm{size}(\widehat{\Phi}_{\epsilon})\leq 2\|W_{D}\|_{0}+2\,\mathrm{size}(\Phi_{\epsilon})\leq 2D\mathrm{size}(\Phi_{\epsilon})+2\,\mathrm{size}(\Phi_{\epsilon})\leq 4D\mathrm{size}(\Phi_{\epsilon}).

Combining these bounds, we conclude that

\mathrm{size}(\Phi_{\epsilon})\geq\frac{1}{2D}(D^{\alpha^{\ast}r}\epsilon)^{-\gamma D/r}=\frac{1}{2}(D^{\alpha^{\ast}r+r/\gamma D}\epsilon)^{-\gamma D/r},

holds for any neural network representation of $\mathcal{S}_{\epsilon}$ , whenever $D^{\alpha^{\ast}r}\epsilon\leq\overline{\epsilon}$ . And hence

(A.14)

\displaystyle\mathrm{cmplx}(\mathcal{S}_{\epsilon})\geq\frac{1}{2}(D^{\alpha^{\ast}r+r/\gamma D}\epsilon)^{-\gamma D/r},

whenever $D^{\alpha^{\ast}r}\epsilon\leq\overline{\epsilon}$ . By Lemma A.6 below, taking the supremum on the right over all admissible $D$ implies the lower bound

\displaystyle\mathrm{cmplx}(\mathcal{S}_{\epsilon})\geq\exp(c\epsilon^{-1/(\alpha^{\ast}+\delta/2)r}),\quad\forall\epsilon\leq\widetilde{\epsilon},

where $\widetilde{\epsilon},c>0$ depend only on $\alpha^{\ast},r,\overline{\epsilon}(r),\delta$ and $\gamma$ . Recalling our choice of $\alpha^{\ast}=1+\alpha+\delta/2$ , and the fact that the constant $\overline{\epsilon}=\overline{\epsilon}(r)$ depends only on $r$ , while $\gamma$ is universal, it follows that

\displaystyle\mathrm{cmplx}(\mathcal{S}_{\epsilon})\geq\exp(c\epsilon^{-1/(\alpha+1+\delta)r}),\quad\forall\epsilon\leq\widetilde{\epsilon}.

with $\widetilde{\epsilon},c>0$ depending only on $\alpha,r,\delta$ . Up to a change in notation, this is the claimed complexity bound. ∎

The following lemma addresses the optimization of the lower bound in (A.14):

Lemma A.6.

Let $r\in\mathbb{N}$ , and $\alpha^{\ast},\overline{\epsilon},\gamma>0$ be given. Assume that

\mathrm{cmplx}(\mathcal{S}_{\epsilon})\geq\frac{1}{2}\left(D^{\alpha^{\ast}r+r/\gamma D}\epsilon\right)^{-\gamma D/r},

for any $D$ of the form $D=2^{k}$ , $k\in\mathbb{N}$ , and whenever $D^{\alpha^{\ast}r}\epsilon\leq\overline{\epsilon}$ . Fix a small parameter $\delta>0$ . There exist $\widetilde{\epsilon},c>0$ , depending only on $r,\alpha^{\ast},\gamma,\overline{\epsilon},\delta$ , such that

\mathrm{cmplx}(\mathcal{S}_{\epsilon})\geq\exp\left(c\epsilon^{-1/(\alpha^{\ast}+\delta)r}\right),\quad\forall\epsilon\leq\widetilde{\epsilon},

Proof.

Write $\overline{\epsilon}=e^{-\beta}$ for $\beta\in\mathbb{R}$ . Fix a small parameter $\delta>0$ . Since we restrict attention to $\epsilon\leq\widetilde{\epsilon}$ , we have

\left(e^{\beta}\epsilon\right)^{-1/(\alpha^{\ast}+\delta)r}\geq\left(e^{\beta}\widetilde{\epsilon}\right)^{-1/(\alpha^{\ast}+\delta)r}=1,

provided that $\widetilde{\epsilon}\leq\overline{\epsilon}$ . Given $\epsilon\leq\widetilde{\epsilon}$ , choose $k\in\mathbb{N}$ , such that

2^{k-1}<\left(e^{\beta}\epsilon\right)^{-1/(\alpha^{\ast}+\delta)r}\leq 2^{k}.

Let $D=2^{k}$ . Note that this defines a function $D=D(\epsilon)$ . For any $\epsilon$ , we can write

(A.15)

\displaystyle D(\epsilon)=\xi\,\left(e^{\beta}\epsilon\right)^{-1/(\alpha^{\ast}+\delta)r},

for some $\xi\in(1/2,1]$ . We now note that for $\epsilon\leq\widetilde{\epsilon}$ ,

\frac{1}{\gamma D}=\frac{\left(e^{\beta}\epsilon\right)^{1/(\alpha^{\ast}+\delta)r}}{\gamma\xi}\leq\frac{2\left(e^{\beta}\widetilde{\epsilon}\right)^{1/(\alpha^{\ast}+\delta)r}}{\gamma}.

Decreasing the size of $\widetilde{\epsilon}=\widetilde{\epsilon}(r,\gamma,\alpha^{\ast},\overline{\epsilon},\delta)$ further, we can ensure that for $\epsilon\leq\widetilde{\epsilon}$ ,

\frac{1}{\gamma D(\epsilon)}\leq\frac{2\left(e^{\beta}\widetilde{\epsilon}\right)^{1/(\alpha^{\ast}+\delta)r}}{\gamma}\leq\frac{\delta}{2}.

Note also that $2^{r/\gamma D}\leq 2^{\delta r/2}\leq D^{\delta r/2}$ for any $D$ of the form $D=2^{k}$ , $k\in\mathbb{N}$ . It thus follows that for any given $\epsilon\leq\widetilde{\epsilon}$ , and with our particular choice of $D=D(\epsilon)$ satisfying (A.15), we have

2^{r/\gamma D}D^{\alpha^{\ast}r+r/\gamma D}\epsilon\leq D^{(\alpha^{\ast}+\delta)r}\epsilon=e^{-\beta}\xi^{(\alpha^{\ast}+\delta)r}\leq e^{-\beta}.

Note that this in particular implies that $D^{\alpha^{\ast}r}\epsilon\leq e^{-\beta}=\overline{\epsilon}$ . We conclude that

	$\displaystyle\mathrm{cmplx}(\mathcal{S}_{\epsilon})$	$\displaystyle\geq\frac{1}{2}(D^{\alpha^{\ast}r+r/\gamma D}\epsilon)^{-\gamma D/r}$
		$\displaystyle=(2^{r/\gamma D}D^{\alpha^{\ast}r+r/\gamma D}\epsilon)^{-\gamma D/r}$
		$\displaystyle\geq(D^{(\alpha^{\ast}+\delta)r}\epsilon)^{-\gamma D/r}$
		$\displaystyle\geq\exp\left(\frac{\beta\gamma D}{r}\right)$
		$\displaystyle=\exp\left(\frac{\beta\gamma\xi e^{-1/(\alpha^{\ast}+\delta)r}}{r}\epsilon^{-1/(\alpha^{\ast}+\delta)r}\right)$
		$\displaystyle\geq\exp\left(\frac{\beta\gamma e^{-1/(\alpha^{\ast}+\delta)r}}{2r}\epsilon^{-1/(\alpha^{\ast}+\delta)r}\right).$

Upon defining $c=c(r,\gamma,\alpha^{\ast},\overline{\epsilon},\delta)$ as

c:=\frac{\beta\gamma e^{-1/(\alpha^{\ast}+\delta)r}}{2r},

we obtain the lower bound

\mathrm{cmplx}(\mathcal{S}_{\epsilon})\geq\exp\left(c\epsilon^{-1/(\alpha^{\ast}+\delta)r}\right),\quad\forall\,\epsilon\leq\overline{\epsilon}.

This concludes the proof of Lemma A.6. ∎

In the lemma below, we provide a simple result on Fréchet differentiability which was used in our proof of Proposition 2.11:

Lemma A.7.

(Fréchet differentiability of a series) Assume that we are given a bounded family of functions $f_{D}\in C^{r}(\mathbb{R}^{D})$ indexed by integers $D=2^{k}$ , $k\in\mathbb{N}$ , such that $\|f_{D}\|_{C^{r}(\mathbb{R}^{D})}\leq 1$ for all $D$ . Let $e^{\ast}_{1},e^{\ast}_{2},\dots:\mathcal{X}\to\mathbb{R}$ be a sequence of linear functionals, such that $\|e^{\ast}_{j}\|_{\mathcal{X}^{\ast}}\leq M$ for all $j\in\mathbb{N}$ . Let $c,\alpha>0$ be given, and assume that $\alpha^{\ast}>1+\alpha$ . Then, the functional $\mathcal{S}^{\dagger}:\mathcal{X}\to\mathbb{R}$ defined by the series

\mathcal{S}^{\dagger}(u):=\sum_{k=1}^{\infty}2^{-\alpha^{\ast}rk}f_{2^{k}}\left(c2^{\alpha k}(e^{\ast}_{2^{k}}(u),\dots,e^{\ast}_{2^{k+1}-1}(u))\right),

is $r$ -times Fréchet differentiable.

Proof.

By assumption on $f_{2^{k}}$ and the linearity of the functionals $e^{\ast}_{j}$ , each nonlinear functional $\mathcal{F}_{k}(u):=f_{2^{k}}\left(c2^{\alpha k}(e^{\ast}_{2^{k}}(u),\dots,e^{\ast}_{2^{k+1}-1}(u))\right)$ in the series defining $\mathcal{S}^{\dagger}$ is $r$ -times continuously differentiable. Fixing $u\in\mathcal{X}$ , let us denote $x=c2^{\alpha k}(e^{\ast}_{2^{k}}(u),\dots,e^{\ast}_{2^{k+1}-1}(u))$ . The $\ell$ -th total derivative $d^{\ell}\mathcal{F}_{k}$ of $\mathcal{F}_{k}$ ( $\ell\leq r$ ) is given by

d^{\ell}\mathcal{F}_{k}(u)[v_{1},\dots,v_{\ell}]=c^{\ell}2^{\alpha\ell k}\sum_{j_{1},\dots,j_{\ell}=1}^{2^{k}}\frac{\partial^{\ell}f_{2^{k}}(x)}{\partial x_{j_{1}}\dots\partial x_{j_{\ell}}}\prod_{s=1}^{\ell}e^{\ast}_{2^{k}+j_{s}-1}(v_{s}).

By assumption, we have

\left|\frac{\partial^{\ell}f_{2^{k}}(x)}{\partial x_{j_{1}}\dots\partial x_{j_{\ell}}}\right|\leq 1.

Since the sum over $j_{1},\dots,j_{\ell}$ has $2^{k\ell}$ terms, and since the functionals are bounded $\|e^{\ast}_{j}\|\leq M$ by assumption, we can now readily estimate the operator norm $\|d^{\ell}\mathcal{F}_{k}(u)\|$ for $\ell\leq r$ by

\|d^{\ell}\mathcal{F}_{k}(u)\|\leq c^{\ell}2^{\alpha\ell k}2^{k\ell}M^{\ell}\leq(cM)^{r}2^{(\alpha+1)rk}.

In particular, for any $\ell\leq r$ , the series

\sum_{k=1}^{\infty}2^{-\alpha^{\ast}rk}\|d^{\ell}\mathcal{F}_{k}(u)\|\leq\sum_{k=1}^{\infty}2^{-[\alpha^{\ast}-(\alpha+1)]rk}(cM)^{r}<\infty,

is uniformly convergent. Thus, $\mathcal{S}^{\dagger}$ is a uniform limit of $r$ -times continuously differentiable functionals, all of whose derivatives of order $\ell\leq r$ are also uniformly convergent. From this, it follows that $\mathcal{S}^{\dagger}$ is itself $r$ -times continuously Fréchet differentiable. ∎

A.4. Proof of Corollary 2.12

Let $\mathcal{Y}=\mathcal{Y}(\Omega;\mathbb{R}^{p})$ be a function space with continuous embedding in $C(\Omega;\mathbb{R}^{p})$ . We will only consider the case $p=1$ , the case $p>1$ following by an almost identical argument; Let $\phi\neq 0\in\mathcal{Y}$ be a non-trivial function. Since $\mathcal{Y}=\mathcal{Y}(\Omega)$ is continuously embedded in $C(\Omega)$ , it follows that point evaluation ${\mathrm{ev}}_{y}(\phi)=\phi(y)$ is continuous. Given that $\phi$ is non-trivial, there exists $y\in\Omega$ , such that ${\mathrm{ev}}_{y}(\phi)\neq 0$ . We may wlog assume that ${\mathrm{ev}}_{y}(\phi)=\phi(y)=1$ . Let $\mathcal{F}^{\dagger}:\mathcal{X}\to\mathbb{R}$ be a functional exposing the curse of parametric complexity, as constructed in Theorem 2.11. We define an $r$ -times Fréchet differentiable operator $\mathcal{S}^{\dagger}:\mathcal{X}\to\mathcal{Y}$ by $\mathcal{S}^{\dagger}(u):=\mathcal{F}^{\dagger}(u)\phi$ .

The claim now follows immediately by observing that

{\mathrm{ev}}_{y}\circ\mathcal{S}^{\dagger}(u)=\mathcal{F}^{\dagger}(u),\quad\forall\,u\in\mathcal{X},

and by noting that if $\mathcal{S}_{\epsilon}:\mathcal{X}\to\mathcal{Y}$ is an operator of neural network-type, then ${\mathrm{ev}}_{y}\circ\mathcal{S}_{\epsilon}:\mathcal{X}\to\mathbb{R}$ is a functional of neural network-type, and by assumption, with $C:=\|{\mathrm{ev}}_{y}\|_{\mathcal{Y}\to\mathbb{R}}$ ,

	$\displaystyle\sup_{u\in K}\|\mathcal{F}^{\dagger}(u)-{\mathrm{ev}}_{y}\circ\mathcal{S}_{\epsilon}(u)\|$	$\displaystyle=\sup_{u\in K}\|{\mathrm{ev}}_{y}\circ\mathcal{S}^{\dagger}(u)-{\mathrm{ev}}_{y}\circ\mathcal{S}_{\epsilon}(u)\|$
		$\displaystyle\leq\sup_{u\in K}C\\|\mathcal{S}^{\dagger}(u)-\mathcal{S}_{\epsilon}(u)\\|_{\mathcal{Y}}$
		$\displaystyle\leq C\epsilon.$

By our choice of $\mathcal{F}^{\dagger}:\mathcal{X}\to\mathbb{R}$ , this implies that the complexity of ${\mathrm{ev}}_{y}\circ\mathcal{S}_{\epsilon}$ is lower bounded by an exponential bound $\geq\exp(c\epsilon^{-1/(\alpha+1+\delta)r})$ for some constant $c=c(\alpha,\delta,r)$ . This in turn implies that

\mathrm{cmplx}(\mathcal{S}_{\epsilon})=\sup_{y\in\Omega}\mathrm{cmplx}({\mathrm{ev}}_{y}\circ\mathcal{S}_{\epsilon})\geq\exp(c\epsilon^{-1/(\alpha+1+\delta)r}).

This lower bound implies the exponential lower bound of Corollary 2.12.

A.5. Proofs of Lemmas 2.18 – 2.7

A.5.1. Proof of Lemma 2.18

Proof.

(Lemma 2.18) We want to show that a PCA-Net neural operator $\mathcal{S}=\mathcal{R}\circ\Psi\circ\mathcal{E}$ is of neural network-type, and aim to estimate $\mathrm{size}(\Psi)$ in terms of $\mathrm{cmplx}(\mathcal{S})$ . To this end, we assume that $\mathcal{Y}=\mathcal{Y}(\Omega;\mathbb{R}^{p})$ is a Hilbert space of functions. Since $\mathcal{R}$ is by definition linear, then given an evaluation point $y\in\Omega$ , the mapping $\beta\mapsto\mathcal{R}(\beta)(y)\equiv{\mathrm{ev}}_{y}\circ\mathcal{R}(\beta)$ defines a linear map ${\mathrm{ev}}_{y}\circ\mathcal{R}:\mathbb{R}^{{D_{\mathcal{Y}}}}\to\mathbb{R}^{p}$ . We can represent ${\mathrm{ev}}_{y}\circ\mathcal{R}$ by matrix multiplication: ${\mathrm{ev}}_{y}\circ\mathcal{R}(\beta)=V_{y}\beta$ , with $V_{y}\in\mathbb{R}^{p\times{D_{\mathcal{Y}}}}$ . The encoder $\mathcal{E}:\mathcal{X}\to\mathbb{R}^{{D_{\mathcal{X}}}}$ is linear by definition, thus we can take $\mathcal{L}:=\mathcal{E}$ for the linear map in the definition of “operator of neural network-type”. Define a neural network $\Phi_{y}:\mathbb{R}^{{D_{\mathcal{X}}}}\to\mathbb{R}^{p}$ by $\Phi_{y}(\alpha)=V_{y}\Psi(\alpha)$ . Then, we have the identity

\displaystyle{\mathrm{ev}}_{y}\circ\mathcal{S}(u)=({\mathrm{ev}}_{y}\circ\mathcal{R})\circ\Psi(\mathcal{E}u)=\Phi_{y}(\mathcal{L}u),

for all $u\in\mathcal{X}$ . This shows that $\mathcal{S}$ is of neural network-type. We now aim to estimate $\mathrm{cmplx}(\mathcal{S})$ in terms of $\mathrm{size}(\Psi)$ . To this end, write $\Psi(\alpha)=[\Psi_{1}(\alpha),\dots,\Psi_{{D_{\mathcal{Y}}}}(\alpha)]$ with component mappings $\Psi_{j}:\mathbb{R}^{{D_{\mathcal{X}}}}\to\mathbb{R}$ . Let $\mathcal{J}={\left\{j\,\middle|\,j\in\{1,\dots,{D_{\mathcal{Y}}}\},\;\Psi_{j}\neq 0\right\}}$ be the subset of indices for which $\Psi_{j}:\mathbb{R}^{{D_{\mathcal{X}}}}\to\mathbb{R}$ is not the zero function. Define a (sparsified) matrix $\widehat{V}_{y}\in\mathbb{R}^{p\times{D_{\mathcal{Y}}}}$ with $j$ -th column $[\widehat{V}_{y}]_{:,j}$ defined by

\displaystyle[\widehat{V}_{y}]_{:,j}:=\begin{cases}[{V}_{y}]_{:,j},&j\in\mathcal{J},\\ 0&j\notin\mathcal{J}.\end{cases}

Then, we have $\|\widehat{V}_{y}\|_{0}\leq p|\mathcal{J}|\leq p\,\mathrm{size}(\Psi)$ , and identity $\Phi_{y}(\alpha)=\widehat{V}_{y}\Psi(\alpha)$ for all $\alpha\in\mathbb{R}^{{D_{\mathcal{X}}}}$ . Thus, using the concept of sparse concatenation (2.3), we can upper bound the complexity, $\mathrm{cmplx}(\mathcal{S})$ , in terms of the $\mathrm{size}(\Psi)$ of the neural network $\Psi$ as follows:

\mathrm{cmplx}(\mathcal{S})\leq\sup_{y}\mathrm{size}(\Phi_{y})\leq\sup_{y}2\|V_{y}\|_{0}+2\mathrm{size}(\Psi)\leq 2(p+1)\mathrm{size}(\Psi).

This is the claimed lower bound on $\mathrm{size}(\Psi)$ . ∎

A.5.2. Proof of Lemma 2.20

Proof.

(Lemma 2.20) We observe that with $D:={D_{\mathcal{X}}}$ , for any $y\in\Omega$ the encoder $\mathcal{L}:=\mathcal{E}:\mathcal{X}\to\mathbb{R}^{D}$ is linear, and

{\mathrm{ev}}_{y}\circ\mathcal{S}(u)=\Phi_{y}(\mathcal{L}u),\quad\forall\,u\in\mathcal{X},

where $\Phi_{y}(\alpha)=\sum_{j=1}^{{D_{\mathcal{Y}}}}\Psi_{j}(\alpha)\phi_{j}(y)$ defines a neural network, $\Phi_{y}:\mathbb{R}^{D}\to\mathbb{R}^{p}$ . Thus, DeepONet $\mathcal{S}=\mathcal{R}\circ\Psi\circ\mathcal{E}$ is of neural network-type. To estimate the complexity, $\mathrm{cmplx}(\mathcal{S})$ , we let $\mathcal{J}^{2}$ be the set of indices $(i,j)\in\{1,\dots,{D_{\mathcal{Y}}}\}^{2}$ , such that the $i$ -th component, $[\phi_{j}(y)]_{i}$ , of the vector $\phi_{j}(y)\in\mathbb{R}^{p}$ is non-zero. Let $\widehat{V}_{y}\in\mathbb{R}^{p\times{D_{\mathcal{Y}}}}$ be the matrix with entries $[\widehat{V}_{y}]_{i,j}=[\phi_{j}(y)]_{i}$ , so that $\Phi_{y}(\alpha)=\widehat{V}_{y}\Psi(\alpha)$ for all $\alpha$ . Note that $\widehat{V}_{y}$ has precisely $|\mathcal{J}^{2}|$ non-zero entries, and that $|\mathcal{J}^{2}|\leq\mathrm{size}(\phi)$ , since $[\phi_{j}(y)]_{i}\neq 0$ is non-zero for all $(i,j)\in\mathcal{J}^{2}$ . Thus, it follows that

\mathrm{cmplx}(\mathcal{S})\leq\sup_{y}\mathrm{size}(\Phi_{y})\leq 2|\mathcal{J}^{2}|+2\,\mathrm{size}(\Psi)\leq 2(\mathrm{size}(\phi)+\mathrm{size}(\Psi)).

∎

A.5.3. Proof of Lemma 2.22

Proof.

(Lemma 2.22) To see the complexity bound, we recall that for any $y\in\Omega$ , we can choose $\Phi_{y}({\,\cdot\,}):=Q(\Psi({\,\cdot\,}),y)$ to obtain the representation

{\mathrm{ev}}_{y}\circ\mathcal{S}(u)=\Phi_{y}(\mathcal{L}u),

where $\mathcal{L}u\equiv\mathcal{E}(u)$ is given by the linear encoder. The composition of two neural networks $Q({\,\cdot\,},y)$ and $\Psi({\,\cdot\,})$ can be represented by a neural network of size at most $2\mathrm{size}(\Psi)+2\mathrm{size}(Q({\,\cdot\,},y))\leq 2\mathrm{size}(\Psi)+2\mathrm{size}(Q)$ . We thus have the lower bound,

\mathrm{cmplx}(\mathcal{S})\leq\sup_{y}\mathrm{size}(\Phi_{y})\leq 2\mathrm{size}(Q)+2\mathrm{size}(\Psi).

This shows the claim. ∎

A.6. Proof of Lemma 2.26

Proof.

Our aim is to show that $\mathcal{S}:L^{2}(\mathbb{T};\mathbb{R})\to\mathbb{R}$ ,

\mathcal{S}(u):=\int_{\mathbb{T}}\sigma(u(x))\,dx,

is not of neural network-type. We argue by contradiction. Suppose that $\mathcal{S}$ was of neural network type. By definition, there exists a linear mapping $\mathcal{L}:\mathcal{L}^{2}(\mathbb{T};\mathbb{R})\to\mathbb{R}^{\ell}$ , and a neural network $\Phi:\mathbb{R}^{\ell}\to\mathbb{R}$ , for some $\ell\in\mathbb{N}$ such that

(A.16)

\displaystyle\mathcal{S}(u)=\Phi(\mathcal{L}u).

In the following we will consider $\varphi_{j}(x):=\sin(jx)$ for $j\in\mathbb{N}$ . Since $\sigma(t)\geq 0$ for all $t\in\mathbb{R}$ , and $\sigma(t)>0$ for $t>0$ , we have

(A.17)

\displaystyle\mathcal{S}(u)=\int_{\Omega}\sigma(u(x))\,dx=0\quad\iff\quad u(x)\leq 0,\quad\forall x\in[0,2\pi].

Now, fix any $D>\ell$ , and consider $\iota:\mathbb{R}^{D}\to L^{2}(\mathbb{T};\mathbb{R})$ , $\iota\beta:=\sum_{j=1}^{D}\beta_{j}\sin(jx)$ . Since $\iota$ and $\mathcal{L}$ are linear mappings, it follows that

\mathcal{L}\circ\iota:\mathbb{R}^{D}\to\mathbb{R}^{\ell},\quad\beta\mapsto\mathcal{L}\iota\beta,

is a linear mapping. Represent this linear mapping by a matrix $W\in\mathbb{R}^{\ell\times D}$ . In particular, by (A.16), we have

(A.18)

\displaystyle\mathcal{S}(\iota\beta)=\Phi(W\beta),\quad\forall\beta\in\mathbb{R}^{D}.

Since $D>\ell$ , it follows that $\ker(W)\neq\{0\}$ is nontrivial. Let $\beta\neq 0$ be an element in the kernel, and consider $u_{\beta}(x):=\iota\beta(x)=\sum_{j=1}^{D}\beta_{j}\sin(jx)$ . Since $u_{\beta}(x)$ is not identically equal to $0$ , either $u_{\beta}(x)$ or $-u_{\beta}(x)=u_{-\beta}(x)$ must be positive somewhere in the domain $\mathbb{T}$ . Upon replacing $\beta\to-\beta$ if necessary, we may wlog assume that $u_{\beta}(x)>0$ for some $x\in\mathbb{T}$ . From (A.17) and (A.18), it now follows that

0\neq\mathcal{S}(u_{\beta})=\mathcal{S}(\iota\beta)=\Phi(W\beta)=\Phi(0)=\mathcal{S}(0)=0.

A contradiction. Thus, $\mathcal{S}$ cannot be of neural network-type. ∎

A.7. Proof of curse of parametric complexity for FNO, Theorem 2.27

Building on the curse of parametric complexity for operators of neural network-type, we next show that FNOs also suffer from a similar curse, as stated in Theorem 2.27.

Proof.

(Theorem 2.27) Let $\mathcal{S}^{\dagger}:\mathcal{X}\to\mathbb{R}$ be an $r$ -times Fréchet differentiable functional satisfying the conclusion of Theorem 2.11 (CoD for functionals of neural network-type). In the following, we show that $\mathcal{S}^{\dagger}$ also satisfies the conclusions of Theorem 2.27. Our proof argues by contradiction: we assume that $\mathcal{S}^{\dagger}$ can be approximated by a family of discrete FNOs satisfying the error and complexity bounds of Theorem 2.27, i.e.

(1)

Complexity bound: There exist constant $c>0$ , such that the discretization parameter $N_{\epsilon}\in\mathbb{N}$ , and the total number of non-zero parameters $\mathrm{size}(\mathcal{S}^{N_{\epsilon}}_{\epsilon})$ are bounded by $N_{\epsilon},\mathrm{size}(\mathcal{S}^{N_{\epsilon}}_{\epsilon})\leq\exp(c\epsilon^{-1/(1+\alpha+\delta)r})$ , for all $\epsilon\leq\overline{\epsilon}$ .
(2)

Accuracy: We have

$\sup_{u\in K}\left|\mathcal{S}^{\dagger}(u)-\mathcal{S}^{N_{\epsilon}}_{\epsilon}(u)\right|\leq\epsilon,\quad\forall\,\epsilon>0.$

Then we show that this implies the existence of a family of operators of neural network-type $\widetilde{\mathcal{S}}_{\epsilon}$ , which satisfy for some $\delta^{\prime}>0$ , and for all sufficiently small $\epsilon>0$ ,

•

complexity bound $\mathrm{cmplx}(\widetilde{\mathcal{S}}_{\epsilon})\leq\exp(c\epsilon^{-1/(1+\alpha+\delta^{\prime})r})$ ,
•

and error estimate $\max_{u\in K}\left|\mathcal{S}^{\dagger}(u)-\widetilde{\mathcal{S}}_{\epsilon}(u)\right|\leq\epsilon,$

with $c>0$ a potentially different constant. By choice of $\mathcal{S}^{\dagger}$ , the existence of $\widetilde{\mathcal{S}}_{\epsilon}$ is ruled out by Theorem 2.11, providing the desired contradiction.

In the following, we discuss the construction of $\widetilde{\mathcal{S}}_{\epsilon}$ : Let $\mathcal{S}_{\epsilon}^{N_{\epsilon}}$ be a family of FNOs satisfying (1) and (2) above. Fix $\epsilon>0$ for the moment. To simplify notation, we will write $N={N_{\epsilon}}$ , in the following. We recall that, by definition, the discretized FNO $\mathcal{S}_{\epsilon}^{N}$ can be written as a composition $\mathcal{S}^{N}_{\epsilon}=\widetilde{\mathcal{Q}}\circ\mathcal{L}_{L}\circ\dots\circ\mathcal{L}_{1}\circ\mathcal{R}$ , where

\mathcal{R}:u(x)\mapsto\chi(x_{j_{1},\dots,j_{d}},u(x_{j_{1},\dots,j_{d}})),

and

\widetilde{\mathcal{Q}}:\left(v_{j_{1},\dots,j_{d}}\right)\mapsto\frac{1}{N^{d}}\sum_{j_{1},\dots,j_{d}}q(y_{j_{1},\dots,j_{d}},v_{j_{1},\dots,j_{d}}),

are defined by neural networks $\chi$ and $q$ , respectively, $\mathcal{L}_{\ell}$ is of the form

\mathcal{L}_{\ell}:v\mapsto\sigma\left(W_{\ell}v+\mathcal{F}_{N}^{-1}\left(\widehat{T}_{\ell}\mathcal{F}_{N}v\right)+b_{\ell}\right),\quad v=\left(v_{j_{1},\dots,j_{d}}\right)_{j_{1},\dots,j_{d}=1}^{N},

with $W_{\ell}\in\mathbb{R}^{d_{v}\times d_{v}}$ , Fourier multiplier $\widehat{T}_{\ell}=\{[\widehat{T}_{\ell}]_{k}\}_{\|k\|_{\ell^{\infty}}\leq k_{\mathrm{max}}}$ , where $[\widehat{T}_{\ell}]_{k}\in\mathbb{C}^{d_{v}\times d_{v}}$ , and $\mathcal{F}_{N}$ ( $\mathcal{F}_{N}^{-1}$ ) denote discrete (inverse) Fourier transform, and where the bias $b$ is determined by its Fourier coefficients $\widehat{b}_{k}$ , $\|k\|_{\ell^{\infty}}\leq k_{\mathrm{max}}$ . We also recall that the size of $\mathcal{S}^{N}_{\epsilon}$ is given by the total number of non-zero components, i.e.

\mathrm{size}(\mathcal{S}^{N}_{\epsilon})=\mathrm{size}(\chi)+\mathrm{size}(q)+\sum_{\ell=1}^{L}\left\{\|W_{\ell}\|_{0}+\|\widehat{T}_{\ell}\|_{0}+\sum_{\|k\|_{\ell^{\infty}}\leq k_{\mathrm{max}}}\|\widehat{b}_{k}\|_{0}\right\}.

We now observe that, after flattening the tensor

\left(v_{j_{1},\dots,j_{d}}\right)\in\mathbb{R}^{N\times\dots\times N\times d_{v}}\simeq\mathbb{R}^{d_{v}N^{d}},

the (linear) mapping $v\mapsto W_{\ell}v$ can be represented by multiplication against a sparse matrix with at most $\|W_{\ell}\|_{0}N^{d}$ non-zero entries. For the non-local operator $\mathcal{F}^{-1}_{N}\widehat{T}_{\ell}\mathcal{F}_{N}$ , we note that for $v\in\mathbb{R}^{d_{v}N^{d}}$ , the number $\kappa$ of components (channels) that need to be considered is bounded by $\kappa\leq\min(d_{v},\|\widehat{T}_{\ell}\|_{0})\leq\|\widehat{T}_{\ell}\|_{0}$ . Discarding any channels that are zeroed out by $\widehat{T}_{\ell}$ , a naive implementation of $\mathcal{F}^{-1}_{N}\widehat{T}_{\ell}\mathcal{F}_{N}$ thus amounts to a matrix representation of a linear mapping $\mathbb{R}^{\kappa N^{d}}\to\mathbb{R}^{\kappa N^{d}}$ , requiring at most $\kappa^{2}N^{2d}\leq\|\widehat{T}_{\ell}\|_{0}^{2}N^{2d}$ non-zero components. Thus, each discretized hidden layer $\mathcal{L}_{\ell}$ can be represented exactly by an ordinary neural network layer $L_{\ell}=\sigma(A_{\ell}v+c_{\ell})$ with matrix $A_{\ell}\in\mathbb{R}^{d_{v}N^{d}\times d_{v}N^{d}}$ and bias $c_{\ell}\in\mathbb{R}^{d_{v}N^{d}}$ , satisfying the following bounds on the number of non-zero components:

\|A_{\ell}\|_{0}\leq\|W_{\ell}\|_{0}N^{d}+\|\widehat{T}_{\ell}\|_{0}^{2}N^{2d},\quad\|c_{\ell}\|_{0}\leq\sum_{|k|\leq k_{\mathrm{max}}}\|[\widehat{b}_{\ell}]_{k}\|_{0}N^{d}.

Similarly, the input and output layers $\mathcal{R}$ and $\widetilde{\mathcal{Q}}$ can be represented exactly by ordinary neural networks $R:\mathbb{R}^{kN^{d}}\to\mathbb{R}^{d_{v}N^{d}}$ and $Q:\mathbb{R}^{d_{v}N^{d}}\to\mathbb{R}$ , with

\mathrm{size}(R)\leq N^{d}\mathrm{size}(\chi),\quad\mathrm{size}(Q)\leq N^{d}\mathrm{size}(q),

obtained by parallelization of $\chi$ , resp. $q$ , at each grid point. Given the above observations, we conclude that, with canonical identification $\mathbb{R}^{N\times\dots\times N\times k}\simeq\mathbb{R}^{kN^{d}}$ and $\mathbb{R}^{N\times\dots\times N\times p}\simeq\mathbb{R}^{pN^{d}}$ , the discretized FNO $\mathcal{S}^{N}_{\epsilon}$ can be represented by an ordinary neural network $\Phi:\mathbb{R}^{kN^{d}}\to\mathbb{R}^{pN^{d}}$ , $\Phi=Q\circ L_{L}\circ\dots\circ L_{1}\circ R$ , with

	$\displaystyle\mathrm{size}(\Phi)$	$\displaystyle\leq\sum_{\ell=1}^{L}\left\{\\|W_{\ell}\\|_{0}N^{d}+\\|\widehat{T}_{\ell}\\|_{0}^{2}N^{2d}+\sum_{\|k\|\leq k_{\mathrm{max}}}\\|\widehat{b}_{k}\\|_{0}N^{d}\right\}$
		$\displaystyle\qquad+N^{d}\mathrm{size}(\chi)+N^{d}\mathrm{size}(q)$
		$\displaystyle\leq N^{2d}\mathrm{size}(\mathcal{S}_{\epsilon}^{N})^{2}.$

By assumption on $\mathcal{S}_{\epsilon}^{N}$ (for which we aim to show that it leads to a contradiction), we have $\mathrm{size}(\mathcal{S}_{\epsilon}^{N})\leq\exp(c\epsilon^{-1/(1+\alpha+\delta)r})$ , and $N_{\epsilon}\leq\exp(c\epsilon^{-1/(1+\alpha+\delta)r})$ . It follows that

\mathrm{size}(\Phi)\leq N^{2d}\mathrm{size}(\mathcal{S}_{\epsilon}^{N})^{2}\leq\exp((2d+2)c\epsilon^{-1/(1+\alpha+\delta)r}).

In addition, $\Phi$ trivially defines an operator of neural network-type, $\widetilde{\mathcal{S}}_{\epsilon}:C^{s}(\Omega;\mathbb{R}^{k})\to\mathbb{R}$ , by

\widetilde{\mathcal{S}}_{\epsilon}(u):=\Phi\left((u(x_{j_{1},\dots,j_{d}}))_{j_{1},\dots,j_{d}=1}^{N}\right),

To see this, we simply note that the point-evaluation mapping $\mathcal{L}:u\mapsto u(x_{j_{1},\dots,j_{d}})_{j_{1},\dots,j_{d}=1}^{N}$ is linear, and hence we have the representation

\widetilde{\mathcal{S}}_{\epsilon}(u)=\Phi(\mathcal{L}u).

By the above construction, we have $\widetilde{\mathcal{S}}_{\epsilon}(u)\equiv\mathcal{S}_{\epsilon}^{N_{\epsilon}}(u)$ for all input functions $u$ .

To summarize, assuming that a family of FNOs $\mathcal{S}_{\epsilon}^{N_{\epsilon}}$ exists, satisfying (1) and (2) above, we have constructed a family $\widetilde{\mathcal{S}}_{\epsilon}$ of operators of neural network type, with

\sup_{u\in K}\left|\mathcal{S}^{\dagger}(u)-\widetilde{\mathcal{S}}_{\epsilon}(u)\right|=\sup_{u\in K}\left|\mathcal{S}^{\dagger}(u)-\mathcal{S}_{\epsilon}^{N_{\epsilon}}(u)\right|\leq\epsilon,

and

\mathrm{cmplx}(\widetilde{\mathcal{S}}_{\epsilon})\leq\mathrm{size}(\Phi)\leq\exp(c^{\prime\prime}\epsilon^{-1/(1+\alpha+\delta)r}),

where $c^{\prime\prime}=(2d+2)c>0$ is a constant independent of $\epsilon$ .

By Theorem 2.11, and fixing any $0<\delta^{\prime}<\delta$ , we also have the following lower bound for all sufficiently small $\epsilon$ :

\exp(c^{\prime}\epsilon^{-1/(1+\alpha+\delta^{\prime})r})\leq\mathrm{cmplx}(\widetilde{\mathcal{S}}_{\epsilon}),

where $c^{\prime}>0$ is independent of $\epsilon$ . From the above two-sided bounds, it thus follows that

\exp(c^{\prime}\epsilon^{-1/(1+\alpha+\delta^{\prime})r})\leq\mathrm{cmplx}(\widetilde{\mathcal{S}}_{\epsilon})\leq\exp(c^{\prime\prime}\epsilon^{-1/(1+\alpha+\delta)r}),

for all $\epsilon$ sufficiently small, and where by our choice of $\delta^{\prime},\delta$ : $0<1+\alpha+\delta^{\prime}<1+\alpha+\delta$ and $c^{\prime},c^{\prime\prime}>0$ are constants. Since $\epsilon^{-1/(1+\alpha+\delta^{\prime})r}$ grows faster than $\epsilon^{-1/(1+\alpha+\delta)r}$ as $\epsilon\to 0$ , this leads to the desired contradiction. We thus conclude that a family $\mathcal{S}_{\epsilon}^{N_{\epsilon}}$ of discretized FNOs as assumed above cannot exist. This concludes the proof. ∎

Appendix B Short-time existence of $C^{r}$ -solutions

The proof of short-time existence of solutions of the Hamilton-Jacobi equation (HJ) is based on the following Banach space implicit function theorem:

Theorem B.1 (Implicit Function Theorem, see e.g. [6, Section 2.2]).

Let $U\subset X$ , $V\subset Y$ be open subsets of Banach spaces $X$ and $Y$ . Let

F:U\times V\to Z,\quad(u,v)\mapsto F(u,v),

be a $C^{p}$ -mapping ( $p$ -times continuously Fréchet differentiable). Assume that there exist $(u_{0},v_{0})\in U\times V$ such that $F(u_{0},v_{0})=0$ , and such that the Jacobian with respect to $v$ , evaluated at the point $(u_{0},v_{0})$ ,

D_{v}F(u_{0},v_{0}):X\to Z,

is a linear isomorphism. Then there exists a neighbourhood $U_{0}\subset U$ of $u_{0}$ , and a $C^{p}$ -mapping $\psi:U_{0}\to V$ , such that

F(u,\psi(u))=0,\quad\forall\,u\in U_{0}.

Furthermore, $\psi$ is unique, in the sense that for any $u\in U_{0}$ , $F(u,v)=0$ implies $v=\psi(u)$ .

As a first step toward proving short-time existence for (HJ), we prove that under the no-blowup Assumption 3.1, the semigroup $\Psi_{t}^{\dagger}$ (3.7) exists for any $t\geq 0$ .

Proof.

(Lemma 3.3) By classical ODE theory, it is immediate that for any initial data $(q_{0},p_{0},z_{0})\in\Omega\times\mathbb{R}^{d}\times\mathbb{R}$ a maximal solution of the ODE system (3.4) exists, and is unique, over a short time-interval. It thus remains to prove that this solution in fact exists globally, i.e. that the solution does not escape to infinity in finite time. Since $z$ solves $\dot{z}=\mathcal{L}(q,p)$ , with a right-hand side that only depends on $q$ and $p$ , it will suffice to prove that the Hamiltonian trajectory $\dot{q}=\nabla_{p}H(q,p)$ , $\dot{p}=-\nabla_{q}H(q,p)$ with initial data $(q_{0},p_{0})$ exists globally in time. To this end, we note that, by Assumption 3.1, we have

\frac{d}{dt}(1+|p|^{2})=2p\cdot\dot{p}=-2p\cdot\nabla_{q}H(q,p)\leq 2L_{H}(1+|p|^{2}).

Gronwall’s lemma thus implies that $|p|\leq\sqrt{1+|p_{0}|^{2}}\exp(L_{H}t)$ remains bounded for all $t\geq 0$ . This shows that blow-up cannot occur in the $p$ -variable in a finite time. On the other hand, the right-hand side of the ODE system is periodic in $q$ , and hence (by the bound on $p$ ) blow-up is also ruled out for $q$ . In particular, Assumption 3.1 ensures that the Hamiltonian trajectory $t\mapsto(q(t),p(t))$ exists globally in time. As argued above, this in turn implies the global existence of the flow map $t\mapsto\Psi^{\dagger}_{t}$ . ∎

We now apply the Implicit Function Theorem B.1 to prove Lemma 3.4.

Proof.

(Lemma 3.4) Let $X=C^{r}_{\mathrm{per}}(\Omega)\times\mathbb{R}^{d}\times\mathbb{R}$ with $r\geq 2$ . Define

F:X\times\mathbb{R}^{d}\to\mathbb{R}^{d},\quad(u_{0},q,t;q_{0})\mapsto q-q_{t}(q_{0},\nabla_{q}u_{0}(q_{0})),

i.e. we set

F(u_{0},q,t;q_{0}):=q-q_{t}(q_{0},\nabla_{q}u_{0}(q_{0})),

with $q_{t}:\Omega\times\mathbb{R}^{d}\to\Omega$ , $(q_{0},p_{0})\mapsto q_{t}(q_{0},p_{0})$ the spatial characteristic mapping of the Hamiltonian system (3.4) at time $t\geq 0$ , where, recall, we have assumed that $u_{0}\in\mathcal{F}.$ Under Assumption 3.1, the spatial characteristic mapping, and hence $F$ , is well-defined for any $t\geq 0$ (see Lemma 3.3). Since $H\in C^{r+1}(\Omega\times\mathbb{R}^{d})$ , the mapping $\Omega\times\mathbb{R}^{d}\times\mathbb{R}\to\Omega$ , $(q_{0},p_{0},t)\mapsto q_{t}(q_{0},p_{0})$ is $C^{r}$ . The mapping $F_{1}:\Omega\times C^{r}_{\mathrm{per}}(\Omega)\to\mathbb{R}^{d}$ , $(q_{0},u_{0})\mapsto F_{1}(q_{0},u_{0}):=\nabla_{q}u_{0}(q_{0})$ is $C^{r-1}$ in the first argument, and is a continuous linear mapping in the second argument (and hence infinitely Frechet differentiable in the second argument). Hence, the composition

(q_{0},u_{0})\mapsto(q_{0},\nabla_{q}u_{0}(q_{0}))\mapsto q_{t}(q_{0},\nabla_{q}u_{0}(q_{0})),

is a $C^{r-1}$ mapping. And, as a consequence, the mapping $F:X\times\mathbb{R}^{d}\to\mathbb{R}^{d}$ is a $C^{r-1}$ mapping.

Since

F(u_{0},q,t;q_{0})=q-q_{t}(q_{0},\nabla_{q}u_{0}(q_{0})),

we clearly have

F(u_{0},q_{0},0;q_{0})=0,

for any $q_{0}\in\Omega$ and $u_{0}\in C^{r}_{\mathrm{per}}(\Omega)$ , and the derivative with respect to the last argument

D_{q_{0}}F(u_{0},q_{0},0;q_{0}):\mathbb{R}^{d}\to\mathbb{R}^{d},

is given by

D_{q_{0}}F(u_{0},q_{0},0;q_{0})=D_{q_{0}}\left[q-q_{0}\right]=-\bm{1}_{d\times d},

which defines an isomorphism $\mathbb{R}^{d}\to\mathbb{R}^{d}$ .

By the implicit function theorem, for any $\overline{q}_{0}\in\Omega$ and $\overline{u}_{0}\in C^{r}_{\mathrm{per}}(\Omega)$ , there exist $\epsilon=\epsilon(\overline{q}_{0},\overline{u}_{0}),r=r(\overline{q}_{0},\overline{u}_{0}),t^{\ast}=t^{\ast}(\overline{q}_{0},\overline{u}_{0})>0$ , and a mapping

\psi_{\overline{q}_{0},\overline{u}_{0}}:B_{\epsilon}(\overline{q}_{0})\times B_{r}(\overline{u}_{0})\times[0,t^{\ast})\to\mathbb{R}^{d},

such that $F(q,u_{0},t;q_{0})=0$ for

(q,u_{0},t)\in B_{\epsilon}(\overline{q}_{0})\times B_{r}(\overline{u}_{0})\times[0,t^{\ast}),

if, and only if,

q_{0}=\psi_{\overline{q}_{0},\overline{u}_{0}}(q,u_{0},t).

Fix $\overline{u}_{0}\in C^{r}_{\mathrm{per}}(\Omega)$ for the moment. Since $\Omega$ is compact, we can choose a finite number of points $\overline{q}_{0}^{(1)},\dots,\overline{q}_{0}^{(m)}$ , such that

\Omega\subset\bigcup_{j=1}^{m}B_{\epsilon(\overline{q}_{0}^{{j}},\overline{u}_{0})}(\overline{q}_{0}^{{j}}).

Let

	$\displaystyle t^{\ast}(\overline{u}_{0})$	$\displaystyle:=\min_{j=1,\dots,m}t^{\ast}(\overline{q}^{{j}}_{0},\overline{u}_{0}),$
	$\displaystyle r(\overline{u}_{0})$	$\displaystyle:=\min_{j=1,\dots,m}r(\overline{q}_{0}^{{j}},\overline{u}_{0}).$

Due to the uniqueness property of each $\psi_{\overline{q}^{{j}}_{0},\overline{u}_{0}}$ , $j=1,\dots,m$ , all of these mappings have the same values on overlapping domains. Hence, we can combine them into a single map

\psi_{\overline{u}_{0}}:\Omega\times B_{r(\overline{u}_{0})}(\overline{u}_{0})\times[0,t^{\ast}(\overline{u}_{0}))\to\mathbb{R}^{d}.

Furthermore, since $\psi_{\overline{q}^{{j}}_{0},\overline{u}_{0}}\in C^{r-1}$ , we also have $\psi_{\overline{u}_{0}}\in C^{r-1}$ . Similarly, we can cover the compact set $\mathcal{F}\subset C^{r}_{\mathrm{per}}(\Omega)$ by a finite number of open balls $B_{r(\overline{u}_{0}^{(k)})}(\overline{u}_{0}^{(k)})$ , $k=1,\dots,K$ ,

\mathcal{F}\subset\bigcup_{k=1}^{K}B_{r(\overline{u}_{0}^{(k)})}(\overline{u}_{0}^{(k)}).

Setting $T^{\ast}:=\min_{k=1,\dots,K}t^{\ast}(\overline{u}_{0}^{(k)})>0$ , the uniqueness property of $\psi_{\overline{u}_{0}^{(k)}}$ again implies that these mappings agree on overlapping domains. Hence, we can combine them into a global map, and obtain a map

\psi:\Omega\times\mathcal{F}\times[0,T^{\ast})\to\mathbb{R}^{d},

which satisfies $F(q,u_{0},t;\psi(q,u_{0},t))=0$ for all $q\in\Omega$ , $u_{0}\in\mathcal{F}$ and $t<T^{\ast}$ . Furthermore, this $\psi$ is still a $C^{r-1}$ map and it is unique, in the sense that

F(q,u_{0},t;q_{0})=0\quad\Leftrightarrow\quad q_{0}=\psi(q,u_{0},t),\quad\forall\,q\in\Omega,\;u_{0}\in C^{r}_{\mathrm{per}}(\Omega),\;t<T^{\ast},

i.e. for any $u_{0}\in\mathcal{F}$ and $t\in[0,T^{\ast})$ , we have $q=q_{t}(q_{0},\nabla_{q}u_{0}(q_{0}))$ if and only if $q_{0}=\psi(q,u_{0},t)$ . In particular, this shows that for any $u_{0}\in\mathcal{F}$ , $t\in[0,T^{\ast})$ , the $C^{r-1}$ -mapping

\Phi^{\dagger}_{t}({\,\cdot\,};u_{0}):\Omega\to\Omega,\quad q_{0}\mapsto\Phi^{\dagger}_{t}(q_{0};u_{0}):=q_{t}(q_{0},\nabla_{q}u_{0}(q_{0})),

is invertible, with inverse $q\mapsto\psi(q,u_{0},t)\in C^{r-1}$ . This implies the claim. ∎

Next, we apply Lemma 3.4 to prove short-time existence of solutions for (HJ).

Proof.

(Proposition 3.2) Let $T^{\ast}$ be the maximal time such that the spatial characteristic mapping $\Phi^{\dagger}_{t,u_{0}}:\Omega\to\Omega$ , $q_{0}\mapsto q_{t}(q_{0},\nabla_{q}u_{0}(q_{0}))$ is invertible, for any $t\in[0,T^{\ast})$ and for all $u_{0}\in\mathcal{F}$ . We have $T^{\ast}>0$ , by Lemma 3.4. We denote by $\Phi^{\dagger}_{-t,u_{0}}:\Omega\to\Omega$ the inverse $\Phi^{\dagger}_{-t,u_{0}}:=[\Phi^{\dagger}_{t,u_{0}}]^{-1}$ . By the method of characteristics, a solution $u(q,t)$ of the Hamilton-Jacobi equations must satisfy

(B.1)

\displaystyle u(q,t)=u_{0}(q_{0})+\int_{0}^{t}\mathcal{L}(q_{\tau},p_{\tau})\,d\tau,\quad\mathcal{L}(q,p):=p\cdot\nabla_{p}H(q,p)-H(q,p),

where $q_{0}=\Phi^{\dagger}_{-t,u_{0}}(q)$ , $q_{\tau},p_{\tau}$ are the trajectories of the Hamiltonian ODE (3.1) with initial data $q_{0},\nabla_{q}u_{0}(q_{0})$ . Given a fixed $u_{0}\in\mathcal{F}$ , we use the above expression to define $u:\Omega\times[0,T^{\ast})\to\mathbb{R}$ .

We first observe that $u\in C^{r-1}(\Omega\times[0,T^{\ast}))$ , as it is the composition of $C^{r-1}$ mappings,

u(q,t)=u_{0}(q_{0})+\int_{0}^{t}\mathcal{L}(q_{\tau}(q_{0},p_{0}),p_{\tau}(q_{0},p_{0}))\,d\tau,

where $q_{0}=\Phi_{-t,u_{0}}(q)$ , $p_{0}=\nabla_{q}u_{0}(\Phi_{-t,u_{0}}(q))$ are $C^{r-1}$ -functions of $q$ . In particular, since $r\geq 2$ , this implies that $u$ is at least $C^{1}$ . Evaluating $du(q_{t},t)/dt$ along a fixed trajectory, we find that

(B.2)

\displaystyle\partial_{t}u(q_{t},t)+H(q_{t},p_{t})=\left(p_{t}-\nabla_{q}u(q_{t},t)\right)\cdot D_{p}H(q_{t},p_{t}).

Thus, to show that $u$ is a classical solution of (HJ), it remains to show that $p_{t}=\nabla_{q}u(q_{t},t)$ for all $t\in[0,T^{\ast})$ . Assume that $u\in C^{2}_{\mathrm{per}}$ for the moment. We first note that for the $j$ -th component of $p_{t}-\nabla_{q}u(q_{t},t)$ , we have (with implicit summation over repeated indices, following the “Einstein summation rule”)

	$\displaystyle\frac{d}{dt}\left[p^{j}_{t}-\partial_{q^{j}}u(q_{t},t)\right]$	$\displaystyle=-\partial_{q^{j}}H(q_{t},p_{t})-\partial^{2}_{q^{j},q^{k}}u(q_{t},t)\partial_{p^{k}}H(q_{t},p_{t})$
		$\displaystyle\qquad-\partial_{q^{j}}\partial_{t}u(q_{t},t).$

Next, we note that by the invertibility of the spatial characteristic mapping $q_{0}\mapsto q_{t}$ , we can write (B.2) in the form

\partial_{t}u(q,t)=-H(q,P(q,t))+\left(P(q,t)-\nabla_{q}u(q,t)\right)\cdot D_{p}H(q,P(q,t)),

where $P(q,t):=p_{t}\left(\Phi^{\dagger}_{-t,u_{0}}(q),\nabla_{q}u_{0}(\Phi^{\dagger}_{-t,u_{0}}(q))\right)$ . This implies that

	$\displaystyle\partial_{q^{j}}\partial_{t}u(q,t)$	$\displaystyle=-\partial_{q^{j}}H(q,P(q,t))-\partial_{p^{k}}H(q,P(q,t))\partial_{q^{j}}P^{k}(q,t)$
		$\displaystyle\qquad+\left(\partial_{q^{j}}P^{k}(q,t)-\partial^{2}_{q^{j},q^{k}}u(q,t)\right)\cdot\partial_{p^{k}}H(q,P(q,t))$
		$\displaystyle\qquad+\left(P^{k}(q,t)-\partial_{q^{k}}u(q,t)\right)\cdot\partial_{q^{j}}\left[\partial_{p^{k}}H(q,P(q,t))\right]$
		$\displaystyle=-\partial_{q^{j}}H(q,P(q,t))-\partial^{2}_{q^{j},q^{k}}u(q,t)\cdot\partial_{p^{k}}H(q,P(q,t))$
		$\displaystyle\qquad+\left(P^{k}(q,t)-\partial_{q^{k}}u(q,t)\right)\cdot A_{j,k}(q,P(q,t)).$

We point out that on the last line, we have introduced $A_{j,k}(q,t):=\partial_{q^{j}}\left[\partial_{p^{k}}H(q,P(q,t))\right]$ which is a continuous function of $q$ and $t$ . Choosing now $q=q_{t}$ , so that $P(q,t)=p_{t}$ , we thus obtain

	$\displaystyle\partial_{q^{j}}\partial_{t}u(q_{t},t)$	$\displaystyle=-\partial_{q^{j}}H(q_{t},p_{t})-\partial^{2}_{q^{j},q^{k}}u(q_{t},t)\cdot\partial_{p^{k}}H(q_{t},p_{t})$
		$\displaystyle\qquad+\left[p^{k}_{t}-\partial_{q^{k}}u(q_{t},t)\right]\cdot A_{j,k}(q,t).$

Substitution in the ODE for $p_{t}-\nabla_{q}u(q_{t},t)$ yields

\frac{d}{dt}\left[p_{t}-\nabla_{q}u(q_{t},t)\right]=-A(q_{t},t)\cdot\left[p_{t}-\nabla_{q}u(q_{t},t)\right].

where $A(q_{t},t)$ is the matrix with components $(A_{j,k}(q_{t},t))$ . Since $[p_{t}-\nabla_{q}u(q_{t},t)]|_{t=0}=0$ , this implies that

(B.3)

\displaystyle p_{t}=\nabla_{q}u(q_{t},t),\quad\forall\,t\in[0,T^{\ast}),

along the trajectory. At this point, the conclusion (B.3) has been obtained under the assumption that $u\in C^{2}_{\mathrm{per}}$ , which is only ensured a priori if $r\geq 3$ .

To prove the result also for the case $r=2$ , we can apply the above argument to smooth $H^{\epsilon}$ , $u_{0}^{\epsilon}$ , which approximate the given $H$ and $u_{0}$ , and for $u^{\epsilon}$ defined by the method of characteristics (B.1) with $u_{0}$ , $H$ replaced by the smooth approximations $u_{0}^{\epsilon}$ , $H^{\epsilon}$ . Then, by the above argument, for any $\epsilon$ -regularized trajectory $(q_{t}^{\epsilon},p^{\epsilon}_{t})$ , we have

p_{t}^{\epsilon}=\nabla_{q}u^{\epsilon}(q_{t}^{\epsilon},t).

Choosing a sequence such that $H^{\epsilon}\overset{C^{r+1}}{\to}H$ , $u_{0}^{\epsilon}\overset{C^{r}}{\to}u_{0}$ as $\epsilon\to 0$ , the corresponding characteristic solution $u^{\epsilon}$ defined by (B.1) (with $H^{\epsilon}$ and $u^{\epsilon}_{0}$ in place of $H$ and $u_{0}$ ) converges in $C^{r-1}$ . Since $r-1=1$ , this implies that $p_{t}=\lim_{\epsilon\to 0}p^{\epsilon}_{t}=\lim_{\epsilon\to 0}\nabla_{q}u^{\epsilon}(q^{\epsilon}_{t},t)=\nabla_{q}u(q_{t},t)$ .

Thus, we conclude that $u\in C^{r-1}(\Omega\times[0,T^{\ast}))$ defined by (B.1) is a classical solution of the Hamilton-Jacobi equations (HJ). We finally have to show that, in fact, $u\in C^{r}_{\mathrm{per}}(\Omega\times[0,T^{\ast}))$ . $C^{r}$ -differentiability in space follows readily from the fact that, by (B.3), we have $\nabla_{q}u(q_{t},t)=p_{t}(q_{0},\nabla_{q}u_{0}(q_{0}))$ . By the invertibility of the spatial characteristic map, this can be equivalently written in the form

(B.4)

\displaystyle\nabla_{q}u(q,t)=p_{t}(\Phi^{\dagger}_{-t,u_{0}}(q),\nabla_{q}u_{0}(\Phi^{\dagger}_{-t,u_{0}}(q))),

where on the right hand side, $(q_{0},p_{0})\mapsto p_{t}(q_{0},p_{0})$ , $q_{0}\mapsto\nabla_{q}u_{0}(q_{0})$ and $(q,t)\mapsto\Phi^{\dagger}_{-t,u_{0}}(q,t)$ are all $C^{r-1}$ mappings. Thus, $\nabla_{q}u$ is a $C^{r-1}$ -function. Furthermore, by (HJ), this implies that $\partial_{t}u=-H(q,\nabla_{q}u)$ is also a $C^{r-1}$ function. This allows us to conclude that $u\in C^{r}_{\mathrm{per}}(\Omega\times[0,T^{\ast}))$ . The additional bound on $\|u({\,\cdot\,},t)\|_{C^{r}}$ follows from the trivial estimate

\|u({\,\cdot\,},t)\|_{C^{r}}\leq C\left(\|u\|_{L^{\infty}}+\|\nabla_{q}u({\,\cdot\,},t)\|_{C^{r-1}}\right),

combined with (B.1) and (B.4), and the fact that $q\mapsto\Phi^{\dagger}_{-t,u_{0}}(q)$ is a $C^{r-1}$ mapping with continuous dependence on $u_{0}\in\mathcal{F}$ and $t\in[0,T]$ , and $(q_{0},p_{0})\mapsto p_{t}(q_{0},p_{0})$ is a $C^{r}$ -mapping, so that

\|\nabla_{q}u({\,\cdot\,},t)\|_{C^{r-1}}\leq\sup_{t\in[0,T]}\sup_{u_{0}\in\mathcal{F}}\|p_{t}(\Phi^{\dagger}_{-t,u_{0}}({\,\cdot\,}),\nabla_{q}u_{0}(\Phi^{\dagger}_{-t,u_{0}}({\,\cdot\,})))\|_{C^{r-1}}<\infty,

for any initial data $u_{0}\in\mathcal{F}$ . ∎

Appendix C Reconstruction from scattered data

The purpose of this appendix is to prove Proposition 4.2. This is achieved through three lemmas, followed by the proof of the proposition itself, employing these lemmas.

The first lemma is the following special case of the Vitali covering lemma, a well-known geometric result from measure theory (see e.g. [52, Lemma 1.2]):

Lemma C.1 (Vitali covering).

If $h>0$ and $Q=\{q^{1},\dots,q^{N}\}$ is a set for which the domain

\Omega\subset\bigcup_{j=1}^{N}B_{h}(q^{j}),

is contained in the union of balls of radius $h$ around the $q^{j}$ , then there exists a subset $Q^{\prime}=\{q^{i_{1}},\dots,q^{i_{m}}\}$ such that $B_{h}(q^{i_{k}})\cap B_{h}(q^{i_{\ell}})=\emptyset$ for all $k,\ell$ , and

\Omega\subset\bigcup_{k=1}^{m}B_{3h}(q^{i_{k}}),

Remark C.2.

Given $Q=\{q^{1},\dots,q^{N}\}$ , the proof of Lemma C.1 (see e.g. [52, Lemma 1.2]) shows that the subset $Q^{\prime}\subset Q$ of Lemma C.1 can be found by the following greedy algorithm, which proceeds by iteratively adding elements to $Q^{\prime}$ (the following is in fact the basis for Algorithm 1 in the main text):

(1)

Start with $j_{1}=1$ and $Q^{\prime}_{1}=\{q^{1}\}$ .
(2)
Iteration step: given $Q^{\prime}_{m}=\{q^{j_{1}},\dots,q^{j_{m}}\}\subset Q$ , check whether there exists $q^{k}\in Q$ , such that

$B_{h}(q^{k})\cap\bigcup_{\ell=1}^{m}B_{h}(q^{j_{\ell}})=\emptyset.$
- •
  
  If yes: define $j_{n+1}:=k$ and $Q^{\prime}_{m+1}:=\{q^{j_{1}},\dots,q^{j_{m+1}}\}$ .
- •
  
  If not: terminate the algorithm and set $Q^{\prime}:=Q^{\prime}_{m}=\{q^{j_{1}},\dots,q^{j_{m}}\}$ .

Based on Lemma C.1, we can now state the following basic result:

Lemma C.3.

Given a set $Q=\{q^{1},\dots,q^{N}\}\subset\Omega$ with fill distance $h_{Q,\Omega}$ , the subset $Q^{\prime}\subset Q$ determined by the pruning Algorithm 1 has fill distance $h_{Q^{\prime},\Omega}\leq 3h_{Q,\Omega}$ and separation distance $\rho_{Q^{\prime}}\geq h_{Q,\Omega}$ , and $Q^{\prime}$ is quasi-uniform with distortion constant $\kappa=3$ , i.e.

\rho_{Q^{\prime}}\leq h_{Q^{\prime},\Omega}\leq 3\rho_{Q^{\prime}}.

Proof.

By definition of $h_{Q,\Omega}$ (cf. Definition 4.1), we have

\Omega\subset\bigcup_{j=1}^{N}B_{h_{Q,\Omega}}(q^{j}).

Let $Q^{\prime}=\{q^{j_{1}},\dots,q^{j_{m}}\}\subset Q$ be the subset determined by Algorithm 1 (reproduced in Remark C.2). $Q^{\prime}$ satisfies the conclusion of the Vitali covering lemma C.1 with $h=h_{Q,\Omega}$ ; thus,

(C.1)

\displaystyle\Omega\subset\bigcup_{k=1}^{m}B_{3h}(q^{j_{k}}),

and $B_{h}(q^{j_{k}})\cap B_{h}(q^{j_{\ell}})=\emptyset$ for all $k\neq\ell$ . The inclusion (C.1) implies that

h_{Q^{\prime},\Omega}=\sup_{q\in\Omega}\min_{k=1,\dots,m}\left|q-q^{j_{k}}\right|\leq 3h=3h_{Q,\Omega}.

On the other hand, the fact that $B_{h}(q^{j_{k}})\cap B_{h}(q^{j_{\ell}})=\emptyset$ implies that $\frac{1}{2}\left|q^{j_{\ell}}-q^{j_{k}}\right|\geq h=h_{Q,\Omega}$ for all $k\neq\ell$ , and hence $\rho_{Q^{\prime}}\geq h_{Q,\Omega}$ . Thus, we have

h_{Q^{\prime},\Omega}\leq 3h_{Q,\Omega}\leq 3\rho_{Q^{\prime}}.

The bound $\rho_{Q^{\prime}}\leq h_{Q^{\prime},\Omega}$ always holds (for convex sets); to see this, choose $q^{j_{k}},q^{j_{\ell}}$ such that $\rho_{Q^{\prime}}=\frac{1}{2}\left|q^{j_{k}}-q^{j_{\ell}}\right|$ realizes the minimal separation distance. Let $\overline{q}$ be the mid-point $\frac{1}{2}\left(q^{j_{k}}+q^{j_{\ell}}\right)$ , so that $\left|\overline{q}-q^{j_{k}}\right|=\left|\overline{q}-q^{j_{\ell}}\right|=\frac{1}{2}\left|q^{j_{k}}-q^{j_{\ell}}\right|=\rho_{Q^{\prime}}$ . We note that $\left|\overline{q}-q^{j_{r}}\right|\geq\rho_{Q^{\prime}}$ for any $q^{j_{r}}\in Q^{\prime}$ , since

2\rho_{Q^{\prime}}\leq\left|q^{j_{r}}-q^{j_{k}}\right|\leq\left|q^{j_{r}}-\overline{q}\right|+\left|\overline{q}-q^{j_{k}}\right|=\left|\overline{q}-q^{j_{r}}\right|+\rho_{Q^{\prime}},

i.e. $\left|\overline{q}-q^{j_{r}}\right|\geq\rho_{Q^{\prime}}$ for $r=1,\dots,m$ . We conclude that

\rho_{Q^{\prime}}\leq\min_{r=1,\dots,m}\left|\overline{q}-q^{j_{r}}\right|\leq\sup_{q\in Q}\min_{r=1,\dots,m}\left|q-q^{j_{r}}\right|=h_{Q^{\prime},\Omega}.

∎

Lemma C.4.

Let $\Omega=[0,2\pi]^{d}\subset\mathbb{R}^{d}$ , let $r\geq 2$ and let $\kappa\geq 1$ . There exists $C=C(d,r,\kappa)>0$ and $\gamma=\gamma(d,r,\kappa)>0$ , such that for all $f^{\dagger}\in C^{r}(\Omega)$ and all $Q\subset\Omega$ , quasi-uniform with respect to $\kappa$ , and with fill distance $h_{Q,\Omega}$ , the approximation error of the moving least squares interpolation (4.2) with exact data $z^{j}=f^{\dagger}(q^{j})$ , and parameter $\delta:=\gamma h_{z,Q}$ , is bounded as follows:

\|f^{\dagger}-f_{z,Q}\|_{L^{\infty}}\leq Ch_{Q,\Omega}^{r}\|f^{\dagger}\|_{C^{r}}.

Proof.

This is immediate from [53, Corollary 4.8]. ∎

Based on Lemma C.4, we can also derive a reconstruction estimate, when the interpolation data $\{q^{j,\dagger},f^{\dagger}(q^{j,\dagger})\}_{j=1}^{N}$ is only known up to small errors in both the position $q^{j}\approx q^{j,\dagger}$ and the values $z^{j}\approx f^{\dagger}(q^{j,\dagger})$ , and this is Proposition 4.2 stated in the main body of the text. We now prove this proposition.

Proof.

(Proposition 4.2) Let $\psi:\mathbb{R}^{d}\to[0,1]$ be a compactly supported $C^{\infty}$ function, such that $\psi(0)=1$ , $\|\psi\|_{L^{\infty}}\leq 1$ and $\psi(q)=0$ for $|q|\geq 1/2$ . Define

\widetilde{f}(q):=f^{\dagger}(q)+\sum_{j=1}^{N}\left(z^{j}-f^{\dagger}({q}^{j})\right)\psi\left(\frac{q-{q}^{j}}{\rho_{Q}}\right).

Note that, by assumption, we have

|{q}^{j}-{q}^{k}|\geq|q^{j,\dagger}-q^{k,\dagger}|-2\delta\geq 2(\rho_{Q}-\rho)\geq\rho_{Q},

for all $j\neq k$ . In particular this implies that the functions

q\mapsto\psi\left(\frac{q-{q}^{j}}{\rho_{Q}}\right),

have disjoint support for different values of $j$ . Thus, for any $j=1,\dots,N$ , we have $\widetilde{f}({q}^{j})=z^{j}$ , and ${f}_{z,Q}$ is the (exact) moving least squares approximation of $\widetilde{f}$ . From Proposition C.4 it follows that

\|\widetilde{f}-{f}_{z,Q}\|_{L^{\infty}}\leq C\|\widetilde{f}\|_{C^{r}}h^{r}_{Q,\Omega}.

We now note that

\displaystyle|z^{j}-f^{\dagger}({q}^{j})|

\displaystyle\leq|z^{j}-f^{\dagger}({q}^{j,\dagger})|+|f^{\dagger}(q^{j,\dagger})-f^{\dagger}({q}^{j})|\leq\epsilon+\|f^{\dagger}\|_{C^{r}}\rho.

On the one hand (due to the disjoint supports of the bump functions), we can now make the estimate

\|\widetilde{f}-f^{\dagger}\|_{L^{\infty}}\leq\max_{j=1,\dots,N}|z^{j}-f^{\dagger}({q}^{j})|\leq\epsilon+\|f\|_{C^{r}}\rho.

On the other hand (again due to the disjoint supports of the bump functions), we also have

	$\displaystyle\\|\widetilde{f}\\|_{C^{r}}$	$\displaystyle\leq\\|f^{\dagger}\\|_{C^{r}}+\max_{j=1,\dots,N}\|z^{j}-f^{\dagger}({q}^{j})\|\rho_{Q}^{-r}\\|\psi\\|_{C^{r}}$
		$\displaystyle\leq\\|f^{\dagger}\\|_{C^{r}}+\frac{\epsilon+\\|f^{\dagger}\\|_{C^{r}}\rho}{\rho_{Q}^{r}}\\|\psi\\|_{C^{r}}$

These two estimates imply that

	$\displaystyle\\|f^{\dagger}-{f}_{z,Q}\\|_{L^{\infty}}$	$\displaystyle\leq\\|f^{\dagger}-\widetilde{f}\\|_{L^{\infty}}+\\|\widetilde{f}-{f}_{z,Q}\\|_{L^{\infty}}$
		$\displaystyle\leq\left(\epsilon+\\|f^{\dagger}\\|_{C^{1}}\rho\right)+C\left(\\|f^{\dagger}\\|_{C^{r}}+\left(\epsilon+\\|f^{\dagger}\\|_{C^{1}}\rho\right)\frac{\\|\psi\\|_{C^{r}}}{\rho^{r}_{Q}}\right)h_{Q,\Omega}^{r}$

Taking into account that $h_{Q,\Omega}/\rho_{Q}\leq\kappa$ is bounded, that $\|\psi\|_{C^{r}}$ is independent of $f^{\dagger}$ , and introducing a new constant $C=C(d,r,\kappa,\|\psi\|_{C^{r}})>0$ , we obtain

\displaystyle\|f^{\dagger}-{f}_{z,Q}\|_{L^{\infty}}

\displaystyle\leq C\left(\|f^{\dagger}\|_{C^{r}}h^{r}_{Q,\Omega}+\epsilon+\|f^{\dagger}\|_{C^{1}}\rho\right),

as claimed. ∎

Appendix D Complexity estimates for HJ-Net approximation

In the last section, we have shown that given a set of initial data $\mathcal{F}\subset C^{r}_{\mathrm{per}}(\Omega)$ , with $r\geq 2$ and $\Omega=[0,2\pi]^{d}$ , the method of characteristics is applicable for a positive time $T^{\ast}>0$ . In the present section, we will combine this result with the proposed HJ-Net framework to derive quantitative error and complexity estimates for the approximation of the solution operator of (HJ). This will require estimates for the approximation of the Hamiltonian flow $\Psi_{t}\approx\Psi^{\dagger}_{t}$ by neural networks, as well as an estimate for the reconstruction error resulting from the pruned moving least squares operator $\mathcal{R}$ .

D.1. Approximation of Hamiltonian flow

Proof.

(Proposition 5.3) It follows from [54, Theorem 1] (and a simple scaling argument), that there exists a constant $C=C(r,d)>0$ , such that for any $\epsilon>0$ , there exists a neural network $\Psi_{t}\approx\Psi_{t}^{\dagger}$ satisfying the bound,

(D.1)

\displaystyle\sup_{(q_{0},p_{0},z_{0})\in\Omega\times[-M,M]^{d}\times[-M,M]}\left|\Psi_{t}(q_{0},p_{0},z_{0})-\Psi_{t}^{\dagger}(q_{0},p_{0},z_{0})\right|\leq\epsilon,

and such that

	$\displaystyle 0pt(\Psi_{t})$	$\displaystyle\leq C\log\left(\frac{M^{r}\\|\Psi_{t}^{\dagger}\\|_{C^{r}}}{\epsilon}\right),$
	$\displaystyle\mathrm{size}(\Psi_{t})$	$\displaystyle\leq C\left(\frac{M^{r}\\|\Psi_{t}^{\dagger}\\|_{C^{r}}}{\epsilon}\right)^{(2d+1)/r}\log\left(\frac{M^{r}\\|\Psi_{t}^{\dagger}\\|_{C^{r}}}{\epsilon}\right),$

where $\|\Psi_{t}^{\dagger}\|_{C^{r}}=\|\Psi_{t}^{\dagger}\|_{C^{r}(\Omega\times[-M,M]^{d}\times[-M,M])}$ denotes the $C^{r}$ norm on the relevant domain. To prove the claim, we note that for any trajectory $(q_{t},p_{t})$ satisfying the Hamiltonian ODE system

(D.2)

\displaystyle\dot{q}_{t}=\nabla_{p}H(q_{t},p_{t}),\quad\dot{p}_{t}=-\nabla_{q}H(q_{t},p_{t}),

with initial data $(q_{0},p_{0})\in\Omega\times[-M,M]^{d}$ , we have by assumption 3.1:

\frac{d}{dt}(1+|p_{t}|^{2})=p_{t}\cdot\dot{p}_{t}=-2p_{t}\cdot\nabla_{q}H(q_{t},p_{t})\leq 2L_{H}(1+|p_{t}|^{2}).

Integration of this inequality implies $|p_{t}|^{2}\leq(1+|p_{0}|^{2})\exp(2L_{H}t)\leq(1+dM^{2})\exp(2L_{H}t)$ . Taking also into account that $M\geq 1$ , this implies that $p_{t}\in[-\beta M,\beta M]^{d}$ , where $\beta=(1+\sqrt{d})\exp(L_{H}t)$ depends only on $d$ , $L_{H}$ and $t$ . Since $p_{t}$ remains uniformly bounded and since $q\mapsto H(q,p)$ is $2\pi$ -periodic, it follows that for any $(q_{0},p_{0})\in\Omega\times[-M,M]^{d}$ the Hamiltonian trajectory $(q_{t},p_{t})$ starting at $(q_{0},p_{0})$ stays in $\Omega\times[-\beta M,\beta M]$ .

Recall that $\Psi_{t}^{\dagger}(q_{0},p_{0},z_{0})=(q_{t},p_{t},z_{t})$ is the flow map of the Hamiltonian ODE (D.2) combined with the action integral

(D.3)

\displaystyle z_{t}=z_{0}+\int_{0}^{t}\left[p_{t}\cdot\nabla_{p}H(q_{t},p_{t})-H(q_{t},p_{t})\right]\,d\tau.

Since the Hamiltonian trajectories starting at $(q_{0},p_{0})\in\Omega\times[-M,M]^{d}$ are confined to $\Omega\times[-\beta M,\beta M]^{d}$ and since the right-hand side of (D.2) and (D.3) involve only first-order derivatives of $H$ , it follows from basic ODE theory that there exists $C=C(\|H\|_{C^{r+1}(\Omega\times[-\beta M,\beta M]^{d}},M,t,r)>0$ , such that the $C^{r}$ -norm of the flow can be bounded by

\|\Psi_{t}\|_{C^{r}(\Omega\times[-M,M]^{d}\times[-M,M])}\leq C(\|H\|_{C^{r+1}(\Omega\times[-\beta M,\beta M]^{d}},M,t,r).

In particular, we finally conclude that there exist constants $\beta=\beta(L_{H},d,t)$ and $C=C(\|H\|_{C^{r+1}(\Omega\times[-\beta M,\beta M]^{d}},M,t,r)>0$ , such that for any $\epsilon>0$ , there exists a neural network $\Psi_{t}\approx\Psi_{t}^{\dagger}$ satisfying the bound (D.1), and such that

\displaystyle 0pt(\Psi_{t})\leq C\log\left(\epsilon^{-1}\right),\quad\mathrm{size}(\Psi_{t})\leq C\epsilon^{-(2d+1)/r}\log\left(\epsilon^{-1}\right).

∎

D.2. Reconstruction error

Proof.

(Proposition 5.5) Let $f^{\dagger}\in C^{r}_{\mathrm{per}}(\Omega)$ be given with $r\geq 2$ , and let $\{{q}^{j},{z}^{j}\}_{j=1}^{N}$ be approximate interpolation data, with

|{q}^{j}-q^{j,\dagger}|\leq h^{r}_{Q,\Omega},\quad|{z}^{j}-f^{\dagger}(q^{j,\dagger})|\leq h^{r}_{Q,\Omega}.

The assertion of this proposition is restricted to $Q^{\dagger}=\{q^{j,\dagger}\}_{j=1}^{N}$ satisfying $h_{Q^{\dagger},\Omega}\leq h_{0}$ for a constant $h_{0}$ (to be determined below). We may wlog assume that $h_{0}\leq 1/16$ in the following. Denote ${Q}:=\{{q}^{j}\}_{j=1}^{N}$ . We recall that the first step in the reconstruction Algorithm 2 consists in an application of the pruning Algorithm 1 to determine pruned interpolation points ${Q}^{\prime}=\{{q}^{j_{1}},\dots,{q}^{j_{m}}\}\subset{Q}$ , such that (cf. Lemma C.3)

\rho_{{Q}^{\prime}}\leq h_{{Q}^{\prime},\Omega}\leq 3\rho_{{Q}^{\prime}},\quad h_{{Q}^{\prime},\Omega}\leq 3h_{{Q},\Omega},\quad h_{{Q},\Omega}\leq\rho_{{Q}^{\prime}}.

Step 1: Write $Q^{\prime,\dagger}=\{q^{j_{1},\dagger},\dots,q^{j_{m},\dagger}\}$ . Our first goal is to show that $Q^{\prime\dagger}$ is quasi-uniform: To this end, we note that by definition of the separation distance and the upper bound on the distance of $q^{j,\dagger}$ and $q^{j}$ :

\rho_{Q^{\prime,\dagger}}\geq\rho_{{Q}^{\prime}}-2\max_{k}|q^{j_{k},\dagger}-{q}^{j_{k}}|\geq\rho_{{Q}^{\prime}}-2h_{Q^{\dagger},\Omega}^{r}.

By the definition of the fill distance, and the assumption that $h_{Q^{\dagger},\Omega}\leq h_{0}\leq 1/2$ and $r\geq 2$ , we also have

h_{Q^{\dagger},\Omega}\leq h_{{Q},\Omega}+\sup_{j}|{q}^{j}-q^{j,\dagger}|\leq h_{{Q},\Omega}+h_{Q^{\dagger},\Omega}^{r}\leq h_{{Q},\Omega}+\frac{1}{2}h_{Q^{\dagger},\Omega},

implying the upper bound $h_{Q^{\dagger},\Omega}\leq 2h_{{Q},\Omega}\leq 2\rho_{{Q}^{\prime}}$ . Similarly we can show that $h_{Q^{\prime,\dagger},\Omega}\leq 2h_{{Q}^{\prime},\Omega}$ . Substitution in the lower bound on $\rho_{Q^{\prime,\dagger}}$ above and using that $h_{0}\leq 1/16$ , yields

\rho_{Q^{\prime,\dagger}}\geq\rho_{{Q}^{\prime}}-2\rho_{\widetilde{Q}^{\prime}}(2h_{{Q},\Omega})^{r-1}\geq\rho_{{Q}^{\prime}}(1-2(2h_{0})^{r-1})\geq\frac{3}{4}\rho_{{Q}^{\prime}}.

Thus, we conclude that

h_{Q^{\prime,\dagger},\Omega}\leq 2h_{{Q}^{\prime},\Omega}\leq 6\rho_{{Q}^{\prime}}\leq 12\rho_{Q^{\prime,\dagger}}.

Since we always have $\rho_{Q^{\prime,\dagger}}\leq h_{Q^{\prime,\dagger},\Omega}$ , we conclude that $Q^{\prime,\dagger}$ is quasi-uniform with $\kappa=12$ .

Step 2: Next, we intend to apply Proposition 4.2 with $Q^{\prime,\dagger}$ in place of $Q^{\dagger}$ and with $\rho=\epsilon=h_{Q^{\dagger},\Omega}^{r}$ : To this end, it remains to show that $\rho\leq\frac{1}{2}\rho_{Q^{\prime,\dagger}}$ is bounded by half the separation distance. To see this, we note that by the above bounds (recall also that $h_{0}\leq 1/16$ ),

\rho\equiv h_{Q^{\dagger},\Omega}^{r}\leq h_{0}^{r-1}\,h_{Q^{\dagger},\Omega}\leq\frac{1}{16}h_{Q^{\dagger},\Omega}\leq\frac{1}{16}\rho_{{Q}^{\prime}}\leq\frac{1}{8}\rho_{Q^{\prime,\dagger}}<\frac{1}{2}\rho_{Q^{\prime,\dagger}},

showing that $\rho<\frac{1}{2}\rho_{Q^{\prime,\dagger}}$ . We can thus apply Proposition 4.2 to conclude that there exist constants $C,h_{0}>0$ (with $h_{0}\leq 1/16$ ), such that if $h_{Q^{\dagger},\Omega}\leq h_{0}/9$ , we have

h_{Q^{\prime,\dagger},\Omega}\leq 2h_{{Q}^{\prime},\Omega}\leq 6h_{{Q},\Omega}\leq 9h_{Q^{\dagger},\Omega}\leq h_{0}

and hence

	$\displaystyle\\|f^{\dagger}-f_{{z},{Q}}\\|_{L^{\infty}}$	$\displaystyle\leq C\left(\\|f^{\dagger}\\|_{C^{r}(\Omega)}+\epsilon+\\|f^{\dagger}\\|_{C^{1}(\Omega)}\rho\right)$
		$\displaystyle\leq 2C\left(1+\\|f^{\dagger}\\|_{C^{r}(\Omega)}\right)h_{Q^{\dagger},\Omega}^{r}.$

Replacing $h_{0}\to h_{0}/9$ and $C\to 2C$ now yields the claimed result of Proposition 5.5. ∎

Proof.

(Proposition 5.6) Let $q_{0},\widetilde{q}_{0}\in\Omega$ and $u_{0}\in\mathcal{F}$ be given, and denote by $(q_{\tau},p_{\tau})$ the solution of the Hamiltonian ODE, with initial value $(q_{0},p_{0})=(q_{0},\nabla_{q}u_{0}(q_{0}))$ . Define $(\widetilde{q}_{\tau},\widetilde{p}_{\tau})$ similarly.

By compactness of $\mathcal{F}\subset C^{r}_{\mathrm{per}}(\Omega)\subset C^{2}_{\mathrm{per}}(\Omega)$ , there exists a constant $M>0$ , such that $|p_{0}|,|\widetilde{p}_{0}|\leq\|u_{0}\|_{C^{2}(\Omega)}\leq M$ . By continuity of the flow map, there exists $\overline{M}>0$ , such that

|p_{\tau}|,|\widetilde{p}_{\tau}|\leq\overline{M},\quad\forall\,\tau\in[0,t].

Then, we have

	$\displaystyle\frac{d}{d\tau}\|(q_{\tau},p_{\tau})-(\widetilde{q}_{\tau},\widetilde{p}_{\tau})\|$	$\displaystyle\leq\left\|\nabla H(q_{\tau},p_{\tau})-\nabla H(\widetilde{q}_{\tau},\widetilde{p}_{\tau})\right\|$
		$\displaystyle\leq\left(\sup_{q\in\Omega,\|p\|\leq\overline{M}}\\|D^{2}H(q,p)\\|\right)\|(q_{\tau},p_{\tau})-(\widetilde{q}_{\tau},\widetilde{p}_{\tau})\|,$

where $D^{2}H$ denotes the Hessian of $H$ and $\|D^{2}H\|$ is the matrix norm induced by the Euclidean vector norm. Further denoting

\|D^{2}H\|_{\infty}:=\sup_{q\in\Omega,|p|\leq\overline{M}}\|D^{2}H(q,p)\|,

then by Gronwall’s inequality, it follows that

|(q_{t},p_{t})-(\widetilde{q}_{t},\widetilde{p}_{t})|\leq e^{\|D^{2}H\|_{\infty}t}|(q_{0},p_{0})-(\widetilde{q}_{0},\widetilde{p}_{0})|.

Furthermore, since $p_{0}=\nabla_{q}u_{0}(q_{0})$ , and $\widetilde{p}_{0}=\nabla_{q}u_{0}(\widetilde{q}_{0})$ , we have $|p_{0}-\widetilde{p}_{0}|\leq\|u_{0}\|_{C^{2}}|q_{0}-\widetilde{q}_{0}|\leq M|q_{0}-\widetilde{q}_{0}|$ , which implies that

	$\displaystyle\|(q_{t},p_{t})-(\widetilde{q}_{t},\widetilde{p}_{t})\|$	$\displaystyle\leq e^{\\|D^{2}H\\|_{\infty}t}\|(q_{0},p_{0})-(\widetilde{q}_{0},\widetilde{p}_{0})\|$
		$\displaystyle\leq(1+M)e^{\\|D^{2}H\\|_{\infty}t}\|q_{0}-\widetilde{q}_{0}\|.$

Therefore, $\Phi^{\dagger}_{t,u_{0}}(q_{0})=q_{t}$ , $\Phi^{\dagger}_{t,u_{0}}(\widetilde{q}_{0})=\widetilde{q}_{t}$ satisfy the estimate

\left|\Phi^{\dagger}_{t,u_{0}}({q}_{0})-\Phi^{\dagger}_{t,u_{0}}(\widetilde{q}_{0})\right|\leq C|q_{0}-\widetilde{q}_{0}|,

with constant

C=(1+M)\exp\left(t\sup_{q\in\Omega,|p|\leq\overline{M}}\|D^{2}H(q,p)\|\right),

which depends only on $t$ , $H$ and on $\mathcal{F}$ (via $M$ and $\overline{M}$ ), but is independent of the particular choice of $u_{0}$ . Thus, if

\Omega\subset\bigcup_{j=1}^{N}B_{h}(q^{j}),

then

\Omega\subset\bigcup_{j=1}^{N}B_{Ch}(\Phi^{\dagger}_{t,u_{0}}(q^{j})),

for any $u_{0}\in\mathcal{F}$ . This implies the claim. ∎

	$\displaystyle\sum_{j=1}^{\infty}j^{-\alpha}\\|{\kappa(j)}\\|_{\ell^{\infty}}^{\rho-s}$	$\displaystyle\sim\sum_{K=1}^{\infty}\sum_{\\|\kappa\\|_{\ell^{\infty}}=K}j(\kappa)^{-\alpha}\\|{\kappa}\\|_{\ell^{\infty}}^{\rho-s}$
		$\displaystyle\sim\sum_{K=1}^{\infty}K^{-\alpha d}\sum_{\\|\kappa\\|_{\ell^{\infty}}=K}\\|{\kappa}\\|_{\ell^{\infty}}^{\rho-s}$
		$\displaystyle\sim\sum_{K=1}^{\infty}K^{-\alpha d}K^{d-1}K^{\rho-s}$
		$\displaystyle=\sum_{K=1}^{\infty}K^{-1-d\left(\alpha-1-\frac{\rho-s}{d}\right)}.$

	$\displaystyle\sup_{y\in[0,1]^{D}}\left\|f_{D}(y)-\widehat{\Phi}_{\epsilon}(y)\right\|$	$\displaystyle=D^{\alpha^{\ast}r}\sup_{y\in[0,1]^{D}}\left\|D^{-\alpha^{\ast}r}f_{D}(y)-\Phi_{\epsilon}(W_{D}y)\right\|$
		$\displaystyle=D^{\alpha^{\ast}r}\sup_{u\in\overline{Q}_{D}}\left\|\mathcal{S}^{\dagger}(u)-\mathcal{S}_{\epsilon}(u)\right\|$
		$\displaystyle\leq D^{\alpha^{\ast}r}\sup_{u\in K}\left\|\mathcal{S}^{\dagger}(u)-\mathcal{S}_{\epsilon}(u)\right\|$
		$\displaystyle\leq D^{\alpha^{\ast}r}\epsilon.$

	$\displaystyle\sup_{u\in K}\|\mathcal{F}^{\dagger}(u)-{\mathrm{ev}}_{y}\circ\mathcal{S}_{\epsilon}(u)\|$	$\displaystyle=\sup_{u\in K}\|{\mathrm{ev}}_{y}\circ\mathcal{S}^{\dagger}(u)-{\mathrm{ev}}_{y}\circ\mathcal{S}_{\epsilon}(u)\|$
		$\displaystyle\leq\sup_{u\in K}C\\|\mathcal{S}^{\dagger}(u)-\mathcal{S}_{\epsilon}(u)\\|_{\mathcal{Y}}$
		$\displaystyle\leq C\epsilon.$

The parametric complexity of operator learning

Abstract.

1. Introduction

1.1. Context and Literature Review

1.2. Organization

2. The Curse of Parametric Complexity

2.1. Curse of Dimensionality for Neural Networks

2.1.1. ReLU Neural Networks

2.1.2. Two Simple Facts from ReLU Neural Network Calculus

2.1.3. Approximation Theory and CoD for ReLU Networks

Proposition 2.1 (Neural Network CoD).

2.2. Curse of Parametric Complexity in Operator Learning

2.2.1. Infinite-dimensional hypercubes

Example 2.2.

Definition 2.3.

Remark 2.4.

Remark 2.5.

Remark 2.6.

Lemma 2.7.

2.2.2. Curse of Parametric Complexity

Assume K⊂𝒳K\subset\mathcal{X} contains a hypercube QαQ_{\alpha}.

Assume 𝒴=ℝ\mathcal{Y}=\mathbb{R}, i.e. 𝒮†\mathcal{S}^{\dagger} is a functional.

Assume 𝒮\mathcal{S} is of neural network-type.

Definition 2.8 (Functional of neural network-type).

Remark 2.9.

Definition 2.10 (Operator of neural network-type).

2.3. Main Theorem on Curse of Parametric Complexity

Theorem 2.11 (Curse of Parametric Complexity).

Corollary 2.12.

Proof of Theorem 2.11 (Sketch).

Remark 2.13.

Remark 2.14.

Remark 2.15.

Remark 2.16.

Example 2.17 (Operator Learning CoD).

2.4. Examples of Oerators of Neural Network-Type

PCA-Net.

Lemma 2.18 (Complexity of PCA-Net).

Proposition 2.19 (Curse of parametric complexity for PCA-Net).

DeepONet.

Lemma 2.20 (Complexity of DeepONet).

Proposition 2.21 (Curse of parametric complexity for DeepONet).

NOMAD

Lemma 2.22 (Complexity of NOMAD).

Proposition 2.23 (Curse of parametric complexity for NOMAD).

Discussion.

Remark 2.24.

2.5. The Curse of (Parametric) Complexity for Fourier Neural Operators.

Remark 2.25 (FNO approximation of functionals).

Lemma 2.26.

Theorem 2.27.

Proof.

Remark 2.28.

2.5.1. Discussion

3. The Hamilton-Jacobi Equation

3.1. Method of Characteristics for (HJ)

Assumption 3.1 (Growth at Infinity).

3.2. Short-time Existence of CrC^{r}-solutions

Proposition 3.2 (Short-time Existence of Classical Solutions).

Lemma 3.3.

Lemma 3.4.

Remark 3.5.

4. Hamilton-Jacobi Neural Operator: HJ-Net

4.1. Step b) ReLU Network

4.2. Step c) Moving Least Squares

Definition 4.1.

Proposition 4.2 (Error Stability).

Remark 4.3.

4.3. Summary of HJ-Net

5. Error Estimates and Complexity

5.1. HJ-Net Beats The Curse of Parametric Complexity

Theorem 5.1 (HJ-Net Approximation Estimate).

Proof.

Remark 5.2 (Overall Computational Complexity of HJ-Net).

5.2. Approximation of Hamiltonian Flow and Quadrature Map

Proposition 5.3.

Remark 5.4.

5.3. Moving Least Squares Reconstruction Error

Proposition 5.5.

Proposition 5.6.

Assume $K\subset\mathcal{X}$ contains a hypercube $Q_{\alpha}$ .

Assume $\mathcal{Y}=\mathbb{R}$ , i.e. $\mathcal{S}^{\dagger}$ is a functional.

Assume $\mathcal{S}$ is of neural network-type.

3.2. Short-time Existence of $C^{r}$ -solutions

Appendix B Short-time existence of $C^{r}$ -solutions