The parametric complexity of operator learning
Abstract.
Neural operator architectures employ neural networks to approximate operators mapping between Banach spaces of functions; they may be used to accelerate model evaluations via emulation, or to discover models from data. Consequently, the methodology has received increasing attention over recent years, giving rise to the rapidly growing field of operator learning. The first contribution of this paper is to prove that for general classes of operators which are characterized only by their - or Lipschitz-regularity, operator learning suffers from a “curse of parametric complexity”, which is an infinite-dimensional analogue of the well-known curse of dimensionality encountered in high-dimensional approximation problems. The result is applicable to a wide variety of existing neural operators, including PCA-Net, DeepONet and the FNO. The second contribution of the paper is to prove that this general curse can be overcome for solution operators defined by the Hamilton-Jacobi equation; this is achieved by leveraging additional structure in the underlying solution operator, going beyond regularity. To this end, a novel neural operator architecture is introduced, termed HJ-Net, which explicitly takes into account characteristic information of the underlying Hamiltonian system. Error and complexity estimates are derived for HJ-Net which show that this architecture can provably beat the curse of parametric complexity related to the infinite-dimensional input and output function spaces.
1. Introduction
The paper is devoted to a study of the computational complexity of the approximation of maps between Banach spaces by means of neural operators. The paper has two main focii: establishing a complexity barrier for general classes of or Lipschitz regular maps; and then showing that this barrier can be beaten for Hamilton-Jacobi (HJ) equations. In Subsection 1.1 we give a detailed literature review; we set in context the definition of “the curse of parametric complexity” that we introduce and use in this paper; and we highlight our main contributions. Then, in Subsection 1.2, we overview the organization of the remainder of the paper.
1.1. Context and Literature Review
The use of neural networks to learn operators, typically mapping between Banach spaces of functions defined over subsets of finite dimensional Euclidean space and referred to as neural operators, is receiving growing interest in the computational science and engineering community [5, 58, 3, 28, 39, 2, 43, 36]. The methodology has the potential for accelerating numerical methods for solving partial differential equations (PDEs) when a model relating inputs and outputs is known; and it has the potential for discovering input-output maps from data when no model is available.
The computational complexity of learning and evaluating such neural operators is crucial to understanding when the methods will be effective. Numerical experiments addressing this issue may be found in [13] and the analysis of linear problems from this perspective may be found in [4, 14]. Universal approximation theorems, applicable beyond the linear setting, may be found in [5, 39, 34, 30, 31, 33, 2, 56] but such theorems do not address the cost of achieving a given small error.
Early work on operator approximation [42] presents first quantitative bounds; most notably, this work identifies the continuous nonlinear -widths of a space of continuous functionals defined on -spaces, showing that these -widths decay at most (poly-)logarithmically in . Both upper and lower bounds are derived in this specific setting. More recently, upper bounds on the computational complexity of recent approaches to operator learning based on deep neural networks, including the DeepONet [39] and the Fourier Neural Operator (FNO) [36], have been studied in more detail. Specific operator learning tasks arising in PDEs have been considered in the papers [50, 24, 40, 34, 30, 48, 15]. Related complexity analysis for the PCA-Net architecture [2] has recently been established in [32]. These papers studying computational complexity focus on the issue of beating a form of the “curse of dimensionality” in these operator approximation tasks.
In these operator learning problems the input and output spaces are infinite dimensional, and hence the meaning of the curse of dimensionality could be ambiguous. In this infinite-dimensional context, “beating the curse” is interpreted as identifying problems, and operator approximations applied to those problems, for which a measure of evaluation cost (referred to as their complexity) grows only algebraically with the inverse of the desired error. As shown rigorously in [32], this is a non-trivial issue: for the PCA-Net architecture, it has been established that such algebraic complexity and error bounds cannot be achieved for general Lipschitz (and even more regular) operators.
As will be explained in detail in the present work, this fact is not specific to PCA-Net, but extends to many other neural operator architectures. In fact, it can be interpreted as a scaling limit of the conventional curse of dimensionality; this conventional curse affects finite-dimensional approximation problems when the underlying dimension is very large. It can be shown that (ReLU) neural networks cannot overcome this curse, in general. As a consequence, neural operators, which build on neural networks, suffer from the scaling limit of this curse in infinite-dimensions. To distinguish this infinite-dimensional phenomenon from the conventional curse of dimensionality encountered in high-but-finite-dimensional approximation problems, we will refer to the scaling limit identified in this work as “the curse of parametric complexity”.
The first contribution of the present paper is to prove that for general classes of operators which are characterized only by their - or Lipschitz-regularity, operator learning suffers from such a curse of parametric complexity: Theorem 2.11 (and a variant thereof, Theorem 2.27) shows that, in this setting, there exist operators (and indeed even real-valued functionals) which are approximable only with parametric model complexity that grows exponentially in .
To overcome the general curse of parametric complexity implied by Theorem 2.11 (and Theorem 2.27), efficient operator learning frameworks therefore have to leverage additional structure present in the operators of interest, going beyond - or Lipschitz-regularity. Previous work on overcoming this curse for operator learning has mostly focused on operator holomorphy [23, 50, 34] and the emulation of numerical methods [30, 34, 32, 40] as two basic mechanisms for overcoming the curse of parametric complexity for specific operators of interest. A notable exception are the complexity estimates for DeepONets in [15] which are based on explicit representation of the solution; most prominently, this is achieved via the Cole-Hopf transformation for the viscous Burgers equation.
An abstract characterization of the entire class of operators that allow for efficient approximation by neural operators would be very desirable. Unfortunately, this appears to be out of reach, at the current state of analysis. Indeed, as far as the authors are aware, there does not even exist such a characterization for any class of standard numerical methods, such as finite difference, finite element or spectral, viewed as operator approximators. Therefore, in order to identify settings in which operator learning can be effective (without suffering from the general curse of parametric complexity), we restrict attention to specific classes of operators of interest.
The HJ equations present an application area that has the potential to be significantly impacted by the use of ideas from neural networks, especially regarding the solution of problems for functions defined over subsets of high dimensional () Euclidean space [7, 8, 12, 11]; in particular beating the curse of dimensionality with respect to this dimension has been of focus. However, this body of work has not studied operator learning, as it concerns settings in which only fixed instances of the PDE are solved. The purpose of the second part of the present paper is to study the design and analysis of neural operators to approximate the solution operator for HJ equations; this operator maps the initial condition (a function) to the solution at a later time (another function).
The second contribution of the paper is to prove in Theorem 5.1 that the general curse of parametric complexity can be overcome for maps defined by the solution operator of the Hamilton-Jacobi (HJ) equation; this is achieved by exposing additional structure in the underlying solution operator, different from holomorphy and emulation and going beyond regularity, that can be leveraged by neural operators; for the HJ equations, the identified structure relies on representation of solutions of the HJ equations in terms of characteristics. In this paper the dimension of the underlying spatial domain will be fixed and we do not study the curse of dimensionality with respect to . Instead, we demonstrate that it is possible to beat the curse of parametric complexity with respect to the infinite dimensional nature of the input function for fixed (and moderate) .
1.2. Organization
In section 2 we present the first contribution: Theorem 2.11, together with the closely related Theorem 2.27 which extends the general but not exhaustive setting Theorem 2.11 to include the FNO, establish that the curse of parametric complexity is to be expected in operator learning. The following sections then focus on the second contribution and hence on solution operators associated with the HJ equation; in Theorem 5.1 we prove that additional structure in the solution operator for this equation can be leveraged to beat the curse of parametric complexity. In Section 3 we describe the precise setting for the HJ equation employed throughout this paper; we recall the method of characteristics for solution of the equations; and we describe a short-time existence theorem. Section 4 introduces the proposed neural operator, the HJ-Net, based on learning the flow underlying the method of characteristics and combining it with scattered data approximation. In Section 5 we state our approximation theorem for the proposed HJ-Net, resulting in complexity estimates which demonstrate that the curse of parametric complexity is avoided in relation to the infinite dimensional nature of the input space (of initial conditions). Section 6 contains concluding remarks. Whilst the high-level structure of the proofs is contained in the main body of the paper, many technical details are collected in the appendix, to promote readability.
2. The Curse of Parametric Complexity
Operator learning seeks to employ neural networks to efficiently approximate operators mapping between infinite-dimensional Banach spaces of functions. To enable implementation of these methods in practice, maps between the formally infinite-dimensional spaces have to be approximated using only a finite number of degrees of freedom.
Commonly, operator learning frameworks can therefore be written in terms of an encoding, a neural network and a reconstruction step as shown in Figure 1. The first step encodes the infinite-dimensional input using a finite number of degrees of freedom. The second approximation step maps the encoded input to an encoded, finite-dimensional output. The final reconstruction step reconstructs an output function given the finite-dimensional output of the approximation mapping. The composition of these encoding, approximation and reconstruction mappings thus takes an input function and maps it to another output function, and hence defines an operator. Existing operator learning frameworks differ in their particular choice of the encoder, neural network architecture and reconstruction mappings.
We start by giving background discussion on the curse of dimensionality (CoD) in finite dimensions, in subsection 2.1. We then describe the subject in detail for neural network-based operator learning, resulting in our notion of the curse of parametric complexity, in subsection 2.2. In subsection 2.3 we state our main theorem concerning the curse of parametric complexity for neural operators. Subsection 2.4 demonstrates that the main theorem applies to PCA-Net, DeepONet and the NOMAD neural network architectures. Subsection 2.5 extends the main theorem to the FNO since it sits outside the framework introduced in subsection 2.2.
2.1. Curse of Dimensionality for Neural Networks
Since the neural network mapping in the decomposition shown in Figure 1 typically maps between high-dimensional (encoded) spaces, with , , most approaches to operator learning employ neural networks to learn this mapping. The motivation for this is that, empirically, neural networks have been found to be exceptionally well suited for the approximation of such high-dimensional functions in diverse applications [20]. Detailed investigation of the approximation theory of neural networks, including quantitative upper and lower approximation error bounds, has thus attracted a lot of attention in recent years [54, 55, 29, 38, 16]. Since we build on this analysis we summarize the relevant part of it here, restricting attention to ReLU neural networks in this work, as defined next; generalization to the use of other (piecewise polynomial) activation functions is possible.
2.1.1. ReLU Neural Networks
Fix integer and integers Let and for . A ReLU neural network , is a mapping of the form
(2.1) |
where and . Here the activation function is extended pointwise to act on any Euclidean space; and in what follows we employ the ReLU activation function We let and note that we have defined parametric mapping . We define the depth of as the number of layers, and the size of as the number of non-zero weights and biases, i.e.
where counts the number of non-zero entries of a matrix or vector.
2.1.2. Two Simple Facts from ReLU Neural Network Calculus
The following facts will be used without further comment (see e.g. [44, Section 2.2.3] for a discussion of more general results): If is a ReLU neural network, and is a matrix, then there exists a ReLU neural network , such that
(2.2) |
Similarly, if is a matrix, then there exists a ReLU neural network , such that
(2.3) |
The main non-trivial issue in (2.2) and (2.3) is to preserve the potentially sparse structure of the underlying neural networks; this is based on a concept of “sparse concatenation” from [46].
2.1.3. Approximation Theory and CoD for ReLU Networks
One overall finding of research into the approximation power of ReLU neural networks is that, for function approximation in spaces characterized by smoothness, neural networks cannot entirely overcome the curse of dimensionality [54, 55, 29, 38]. This is illustrated by the following result, which builds on [54, Thm. 5] derived by D. Yarotsky:
Proposition 2.1 (Neural Network CoD).
Let be given. For any dimension , there exists and constant , such that any ReLU neural network achieving accuracy
with , has size at least . The constant depends only on , and is universal.
The proof of Proposition 2.1 is included in Appendix A.1. Proposition 2.1 shows that neural network approximation of a function between high-dimensional Euclidean spaces suffers from a curse of dimensionality, characterised by an algebraic complexity with a potentially large exponent proportional to the dimension . This lower bound is similar to the approximation rates (upper bounds) achieved by traditional methods,111Ignoring the potentially beneficial factor such as polynomial approximation. This fact suggests that the empirically observed efficiency of neural networks may well rely on additional structure of functions of practical interest, beyond their smoothness; for relevant results in this direction see, for example, [41, 21].
2.2. Curse of Parametric Complexity in Operator Learning
In the present work, we consider the approximation of an underlying operator acting between Banach spaces; specifically, we assume that the dimensions of the spaces are infinite. Given the curse of dimensionality in the finite-dimensional setting, Proposition 2.1, and letting , one would generally expect a super-algebraic, potentially even exponential, lower bound on the “complexity” of neural operators approximating such , as a function of the accuracy . In this subsection, we make this statement precise for a general class of neural operators, in Theorem 2.11. This is preceded by a discussion of relevant structure of compact sets in infinite-dimensional function spaces and a discussion of a suitable class of “neural operators”.
2.2.1. Infinite-dimensional hypercubes
Proposition 2.1 was stated for the unit cube as the underlying domain. In the finite-dimensional setting of Proposition 2.1, the approximation rate turns out to be independent of the underlying compact domain, provided that the domain has non-empty interior and assuming a Lipschitz continuous boundary. This is in contrast to the infinite dimensional case, where compact domains necessarily have empty interior and where the convergence rate depends on the specific structure of the domain. To state our complexity bound for operator approximation, we will therefore need to discuss the prototypical structure of compact subsets .
To fix intuition, we temporarily consider a function space (for example a Hölder, Lebesgue or Sobolev space). In this case, the most common way to define a compact subset is via a smoothness constraint, as illustrated by the following concrete example:
Example 2.2.
Assume that is the space of -times continuously differentiable functions on a bounded domain . Then for and upper bound , the subset defined by
(2.4) |
is a compact subset of . Here, we define the -norm as,
(2.5) |
To better understand the approximation theory of operators , we would like to understand the structure of such . Our point of view is inspired by Fourier analysis, according to which smoothness of roughly corresponds to a decay rate of the Fourier coefficients of . In particular, is guaranteed to belong to the set (2.4), if is of the form,
(2.6) |
for a sufficiently large decay rate , small constant , and where denotes the periodic Fourier (sine/cosine) basis, restricted to . We include a proof of this fact at the end of this subsection (see Lemma 2.7), where we also identify relevant decay rate . In this sense, the set in (2.4) could be said to “contain” an infinite-dimensional hypercube , with decay rate . Such hypercubes will replace the finite-dimensional unit cube in our analysis of operator approximation in infinite-dimensions.
We would like to point out that a similar observation holds for many other sets defined by a smoothness constraint, such as sets in Sobolev spaces defined by a smoothness constraint, , or more generally , for any , but also Besov spaces, spaces of functions of bounded variation and others share a similar structure. Bounded balls in all of these spaces contain infinite-dimensional hypercubes, consisting of elements of the form (2.6). We note in passing that, in general, it may be more natural to replace the trigonometric basis in (2.6) by some other choice of basis, such as polynomials, splines, wavelets, or a more general (frame) expansion. We refer to e.g. [9, 22] for general background and [24] for an example of such a setting in the context of holomorphic operators.
The above considerations lead us to the following definition of an abstract hypercube:
Definition 2.3.
Let be a sequence of linearly independent and normed elements, . Given constants , , we say that contains an infinite-dimensional cube , if:
-
(1)
contains the set consisting of all of the form (2.6);
-
(2)
set possesses a bounded bi-orthogonal sequence of functionals, labelled , in the continuous dual of 222If is a Hilbert space, such a bi-orthogonal sequence always exists for independent .; i.e. we assume that for all , and that there exists , such that for all .
Remark 2.4.
Remark 2.5.
The decay rate of the infinite-dimensional cube provides a measure of its “asymptotic size” or “complexity”. In terms of our complexity bounds, this decay rate will play a special role. Hence, we will usually retain this dependence explicitly by writing , but suppress the additional dependence on and in the following.
The notion of infinite-dimensional cubes introduced in Definition 2.3 is only a minor generalization of an established notion of cube embeddings, introduced by Donoho [18] in a Hilbert space setting. We refer to [10, Chap. 5] for a pedagogical exposition of such cube embeddings in the Hilbert space setting, and their relation to the Kolmogorov entropy of .
Remark 2.6.
The complexity bounds established in this work will be based on infinite-dimensional hypercubes. An interesting question, left for future work, is whether our main result on the curse of parametric complexity, Theorem 2.11 below, could be stated directly in terms of the Kolmogorov complexity of , or other notions such as the Kolmogorov -width [17].
Our definition of an infinite-dimensional hypercube is natural in view of the following lemma, the discussion following Example (2.4), and other similar results.
Lemma 2.7.
Assume that is the space of times continuously differentiable functions on a bounded domain . Choose and define , compact in , by
with constant . Then contains an infinite-dimensional hypercube , for any .
2.2.2. Curse of Parametric Complexity
The main question to be addressed in the present section is the following: given compact, an -times Fréchet differentiable operator to be approximated by a neural operator , and given a desired approximation accuracy , how many tunable parameters (in the architecture of ) are required to achieve,
The answer to this question clearly depends on our assumptions on , and the class of neural operators .
Assume contains a hypercube .
Consistent with our discussion in the last subsection, we will assume that contains an infinite-dimensional hypercube , as introduced in Definition 2.3, with algebraic decay rate .
Assume , i.e. is a functional.
The approximation of an operator with potentially infinite-dimensional output space is generally harder to achieve than the approximation of a functional with one-dimensional output; indeed, if , then can be embedded in and any functional gives rise to an operator under this embedding. To simplify our discussion, we will therefore initially restrict attention to the approximation of functionals, with the aim of showing that even the approximation of -times Fréchet differentiable functionals is generally intractable.
Assume is of neural network-type.
Assuming that , we must finally introduce a rigorous notion of the relevant class of approximating functionals , i.e. define a class of “neural operators/functionals” approximating the functional .
Definition 2.8 (Functional of neural network-type).
We will say that a (neural) functional is a “functional of neural network-type”, if it can be written in the form
(2.7) |
where for some , is a linear map, and is a ReLU neural network (potentially sparse).
If is a functional of neural network-type, we define the complexity of , denoted , as the smallest size of a neural network for which there exists linear map such that a representation of the form (2.7) holds, i.e.
(2.8) |
where the minimum is taken over all possible in (2.7).
Remark 2.9.
Without loss of generality, we may assume that in (2.7) and (2.8). Indeed, if this is not the case, then and we can show that it is possible to construct another representation pair in (2.7), consisting of a neural network , linear map and such that : To see why, let us assume that . Let be the weight matrix in the first input layer of . Since
at most columns of can be non-vanishing. Write the matrix in terms of its column vectors. Up to permutation, we may assume that . We now drop the corresponding columns in the input layer of and remove these unused components from the output of the linear map in (2.7). This leads to a new map , with output components for , and we define , as the neural network that is obtained from by replacing the input matrix by . Then , so that and satisfy a representation of the form (2.7), but the dimension satisfies ; the first equality is by definition of , and the last equality holds because we only removed zero weights from . In particular, this ensures that for this new representation, without affecting the size of the underlying neural network, i.e. .
More generally, we can consider a function space, consisting of functions with domain . Given , we introduce the point-evaluation map,
Provided that point-evaluation is well-defined for all , we can readily extend the above notion to operators, as follows:
Definition 2.10 (Operator of neural network-type).
Let be a function space on which point-evaluation is well-defined. We will say that a (neural) operator is an “operator of neural network-type”, if for any evaluation point , the composition , , can be written in the form
(2.9) |
where is a linear operator, and is a ReLU neural network which may depend on the evaluation point . In this case, we define
We next state our main result demonstrating a “curse of parametric complexity” for functionals (and operators) of neural network-type. This is followed by a detailed discussion of the implications of this abstract result for four representative examples of operator learning frameworks: PCA-Net, DeepONet, NOMAD and the Fourier neural operator.
2.3. Main Theorem on Curse of Parametric Complexity
The following result formalizes an analogue of the curse of dimensionality in infinite-dimensions:
Theorem 2.11 (Curse of Parametric Complexity).
Let be a compact set in an infinite-dimensional Banach space . Assume that contains an infinite-dimensional hypercube for some . Then for any and , there exists and an -times Fréchet differentiable functional , such that approximation to accuracy by a functional of neural network-type,
(2.10) |
implies complexity bound ; here , are constants depending only on , and .
Before providing a sketch of the proof of Theorem 2.11, we note the following simple corollary, whose proof is given in Appendix A.4.
Corollary 2.12.
Let be a compact set in an infinite-dimensional Banach space . Assume that contains an infinite-dimensional hypercube for some . Let be a function space with continuous embedding in . Then for any and , there exists and an -times Fréchet differentiable functional , such that approximation to accuracy by an operator of neural network-type,
(2.11) |
implies complexity bound ; here , are constants depending only on , and .
Proof of Theorem 2.11 (Sketch).
Let be any functional of neural network-type, achieving approximation accuracy (2.10). In view of our definition of in (2.8), to prove the claim, it suffices to show that if is a linear map and is a ReLU neural network, such that
then .
The idea behind the proof of this fact is that if contains a hypercube , then for any , a suitable rescaling of the finite-dimensional cube can be embedded in . More precisely, for any there exists an injective linear map .
If we now consider the composition , then we observe that we have a decomposition , where
is linear, and
is a ReLU neural network. In particular, there exists a matrix , such that for all , and the mapping defines a ReLU neural network , whose size can be bounded by
Using Proposition 2.1, for any , we are then able to construct a functional , mimicking the function constructed in Proposition 2.1, and such that uniform approximation of by (implying similar approximation of by ) incurs a lower complexity bound
where is a constant depending on . For this particular functional , and given the uniform lower bound on above, it then follows that
The first challenge is to make this strategy precise, and to determine the -dependency of the constant . As we will see, this argument leads to a lower bound of roughly the form .
At this point, the argument is still for fixed , and would only lead to an algebraic complexity in . To extend this to an exponential lower bound in , we next observe that if the estimate could in fact be established for all simultaneously, i.e. if we could construct a single functional , for which the lower complexity bound were to hold, then setting on the right would imply that
with suitable . Leading to an exponential lower complexity bound for such . The second main challenge is thus to construct a single which effectively “embeds” an infinite family of functionals with complexity bounds as above. This will be achieved by defining as a weighted sum of suitable functionals . The detailed proof is provided in Appendix A.3. ∎
Several remarks are in order:
Remark 2.13.
Theorem 2.11 shows rigorously that in general, operator learning suffers from a curse of parametric complexity, in the sense that it is not possible to achieve better than exponential complexity bounds for general classes of operators which are merely determined by their (- or Lipschitz-) regularity. As explained above, this is a natural infinite-dimensional analogue of the curse of dimensionality in finite-dimensions (cp. Proposition 2.1), and motivates our terminology. We note that the lower bound of Theorem 2.11 qualitatively matches general upper bounds for DeepONets derived in [37]. It would be of interest to determine sharp rates.
Remark 2.14.
Theorem 2.11 is derived for ReLU neural networks. With some effort, the argument could be extended to more general, piecewise polynomial activation functions. While we believe that the curse of parametric complexity has a fundamental character, we would like to point out that, for non-standard neural network architectures, algebraic approximation rates have been obtained [49]; these results build on either “superexpressive” activation functions or other non-standard architectures. Since these networks are not ordinary feedforward ReLU neural networks, the algebraic approximation rates of [49] are not in contradiction with Theorem 2.11. While the parametric complexity of the non-standard neural operators in [49] is exponentially smaller than the lower bound of Theorem 2.11, it is conceivable that storing the neural network weights in practice would require exponential accuracy (number of bits), to account for the highly unstable character of super-expressive constructions.
Remark 2.15.
Theorem 2.11 differs from previous results on the limitations of operator learning frameworks, as e.g. addressed in [35, 51, 34]. Earlier work focuses on the limitations imposed by a linear choice of the reconstruction mapping . In contrast, the results of the present work exhibit -smooth operators and functionals which are fundamentally hard to approximate by neural network-based methods (with ReLU activation), irrespective of the choice of reconstruction.
Remark 2.16.
We finally link our main theorem to a related result for PCA-Net [32, Thm. 3.3], there derived in a complementary Hilbert space setting for and ; the result of [32] shows that, for PCA-Net, no fixed algebraic convergence rate can be achieved in the operator learning of general -operators between Hilbert spaces; this can be viewed as a milder version of the full curse of parametric complexity identified in the present work, expressed by an exponential lower complexity bound in Theorem 2.11.
To further illustrate an implication of Theorem 2.11, we provide the following example:
Example 2.17 (Operator Learning CoD).
Let be a domain. Let be given, with , and consider the compact set
Fix . By Lemma 2.7, contains an infinite-dimensional hypercube for any . Fix such . Applying Theorem 2.11, it follows that there exists a -times Fréchet differentiable functional and constant , such that any family of functionals of neural network-type, achieving accuracy
has complexity at least for . Furthermore, the constants depend only on the parameters .
In the next subsection, we aim to show the relevance of the above abstract result for concrete neural operator architectures. Specifically, we show that three operator learning architectures from the literature are of neural network-type (PCA-Net, DeepONet, NOMAD), and relate our notion of complexity to the required number of tunable parameters for each. Finally, we show that even frameworks which are not necessarily of neural network-type could suffer from a similar curse of parametric complexity. We make this explicit for the Fourier neural operator in subsection 2.5.
2.4. Examples of Oerators of Neural Network-Type
We describe three representative neural operator architectures and show that they can be cast in the above framework.
PCA-Net.
We start with the PCA-Net architecture from [2], anticipated in the work [25]. If and are Hilbert spaces, then a neural network can be combined with principal component analysis (PCA) for the encoding and reconstruction on the underlying spaces, to define a neural operator architecture termed PCA-Net; the ingredients of this architecture are orthonormal PCA bases and , and a neural network mapping . The encoder is obtained by projection onto the , whereas the reconstruction is defined by a linear expansion in . The resulting PCA-Net neural operator is defined as
(2.12) |
Here the neural network , with components , depends on parameters which are optimized during training of the PCA-Net. The PCA basis functions , defining the reconstruction, are precomputed from the data using PCA analysis.
Given an evaluation point , the composition , between and the point-evaluation mapping , can now be written in the form,
where , is a linear mapping, and , for fixed , is the composition of a neural network with a linear read-out; thus, is itself a neural network for fixed . This shows that PCA-Net is of neural network-type.
The following lemma shows that the complexity of PCA-Net gives a lower bound on the number of free parameters for the underlying neural network architecture.
Lemma 2.18 (Complexity of PCA-Net).
Assume that and are Hilbert spaces, so that PCA-Net is well-defined. Any PCA-Net is of neural network-type, and
Note that the dimension is fixed by the underlying function space . Thus, Lemma 2.18 implies a lower complexity bound . The detailed proof is given in Appendix A.5.1. It thus follows from Corollary 2.12 that operator learning with PCA-Net suffers from a curse of parametric complexity:
Proposition 2.19 (Curse of parametric complexity for PCA-Net).
Assume the setting of Corollary 2.12, with , are Hilbert spaces. Then for any and , there exists and an -times Fréchet differentiable functional , such that approximation to accuracy by a PCA-Net
(2.13) |
with encoder , neural network and reconstruction , implies complexity bound ; here , are constants depending only on , , and .
DeepONet.
A conceptually similar approach is followed by the DeepONet of [39] which differs by learning the form of the representation in concurrently with the coefficients, and by allowing for quite general input linear functionals
The DeepONet architecture defines the encoding by a fixed choice of general linear functionals ; these may be obtained, for example, by point evaluation at distinct “sensor points” or by projection onto PCA modes as in PCA-Net. The reconstruction is defined by expansion with respect to a set of functions which are themselves finite dimensional neural networks to be learned. The resulting DeepONet can be written as
(2.14) |
Here, both the neural networks , with components , and with components , depend on parameters which are optimized during training of the DeepONet.
Given a evaluation point , the composition , with the point-evaluation mapping can again be written in the form,
where , is linear, where
and for fixed the values are just (constant) vectors. Thus, is the composition of a neural network with a linear read-out, and hence is itself a neural network. This shows that DeepONet is of neural network-type.
The next lemma shows that the size can be related to the complexity of DeepONet: Also in this case, provides a lower bound on the total number of non-zero degrees of freedom of a DeepONet. The detailed proof is provided in Appendix A.5.2.
Lemma 2.20 (Complexity of DeepONet).
Any DeepONet , defined by a branch-net and trunk-net , is of neural network-type, and
The following result is now an immediate consequence of Corollary 2.12 and the above lemma.
Proposition 2.21 (Curse of parametric complexity for DeepONet).
Assume the setting of Corollary 2.12. Then for any and , there exists and an -times Fréchet differentiable functional , such that approximation to accuracy by a DeepONet
(2.15) |
with branch net and trunk net , implies complexity bound ; here , are constants depending only on , and .
NOMAD
The linearity in the reconstruction in for both PCA-Net and DeepONet imposes a fundamental limitation on their approximation capability [34, 35, 51]. To overcome this limitation, nonlinear extensions of DeepONet have recently been proposed. The following NOMAD architecture [51] provides an example:
NOMAD (NOnlinear MAnifold Decoder) employs encoding by point evaluation at a fixed set of sensor points, or more general linear functionals . The reconstruction in the output space is defined via a neural network , which depends jointly on encoded output coefficients in and the evaluation point ; as in the two previous examples, is again a neural network. The resulting NOMAD mapping, for and , is given by
(2.16) |
for . We note that the main difference between DeepONet and NOMAD is that the linear expansion in (2.14) has been replaced by a nonlinear composition with the neural network . Both neural networks and are optimized during training.
Given evaluation point , the composition with the point-evaluation can be written in the form,
where , is linear, and defines a neural network for fixed . This shows that NOMAD is of neural network-type. Finally, the following lemma provides an estimate on the complexity of NOMAD:
Lemma 2.22 (Complexity of NOMAD).
Any NOMAD operator defined by a branch-net and non-linear reconstruction is of neural network-type, and
(2.17) |
The expression on the left hand-side of (2.17) represents the total number of non-zero degrees of freedom in the NOMAD architecture and, as for PCA-Net and DeepONet, it is lower bounded by our notion of complexity. For the proof, we refer to Appendix A.5.3.
The following result is now an immediate consequence of Corollary 2.12 and the above lemma.
Proposition 2.23 (Curse of parametric complexity for NOMAD).
Assume the setting of Corollary 2.12. Then for any and , there exists and an -times Fréchet differentiable functional , such that approximation to accuracy by a NOMAD neural operator
(2.18) |
with neural networks and , implies complexity bound ; here , are constants depending only on , and .
Discussion.
For all examples above, a general lower bound on the complexity of operators of neural network-type implies a lower bound on the total number of degrees of freedom of the particular architecture. In particular, a lower bound on gave a lower bound on the smallest possible number of non-zero parameters that are needed to implement in practice. This observation motivates our nomenclature for the complexity.
We emphasize that our notion of complexity only relates to the size of the neural network at the core of these architectures; by design, it does not take into account other factors, such as the additional complexity associated with the practical evaluation of inner products in the PCA-Net architecture, evaluation of linear encoding functionals of DeepONets, or the numerical representation of an (optimal) output PCA basis for PCA-Net or neural network basis for DeepONet. The important point is that the aforementioned factors can only increase the overall complexity of any concrete implementation; correspondingly, our proposed notion of , which neglects some of these additional contributions, is can be used to derive rigorous lower bounds on the overall complexity of any implementation.
Remark 2.24.
In the next subsection 2.5 we will show that even for operator architectures which are not of neural network-type according to the above definition, we may nevertheless be able to link them with an associated operator of neural network-type. Specifically, we will show this for the FNO in Theorem 2.27. There, we will see that the size (number of tunable parameters) of the FNO can be linked to the complexity of an associated operator of neural network-type. And hence lower bounds on the complexity of operators of neural network-type imply corresponding lower bounds on the FNO.
2.5. The Curse of (Parametric) Complexity for Fourier Neural Operators.
The definition of operators of neural network-type introduced in the previous subsection does not include the FNO, a widely adopted neural operator architecture. However, in this subsection we show that a result similar to Theorem 2.11, stated as Theorem 2.27 below, can be obtained for the FNO.
Due to intrinsic constraints on the domain on which FNO can be (readily) applied, we will assume that the spatial domain is rectangular. We recall that an FNO,
can be written as a composition, , of a lifting layer , hidden layers and an output layer . In the following, let us denote by a generic space of functions from to .
The nonlinear lifting layer
is defined by a neural network , depending jointly on the evaluation point and the components of the input function evaluated at , namely . The dimension is a free hyperparameter, and determines the number of “channels” (or the “lifting dimension”).
Each hidden layer , , of an FNO is of the form
where is a nonlinear activation function applied componentwise, is a matrix, the bias is a function and is a non-local operator defined in terms of Fourier multipliers,
where denotes the -th Fourier coefficient of for , is a Fourier multiplier matrix indexed by , and denotes the inverse Fourier transform. In practice, a Fourier cut-off is introduced, and only a finite number of Fourier modes333Throughout this paper denotes the maximum norm on finite dimensional Euclidean space. is retained. In particular, the number of non-zero components of is bounded by . In the following, we will also assume that the bias functions are determined by their Fourier components , .
Finally, the output layer
is defined in terms of a neural network , a joint function of the evaluation point and the components of the output of the previous layer evaluated at , namely .
To define the size of an FNO, we note that its tunable parameters are given by: (i) the weights and biases of the neural network defining the lifting layer , (ii) the components of the matrices , (iii) the components of the Fourier multipliers for , (iv) the Fourier coefficients , for , and (v) the number of weights and biases of the neural network defining the output layer . Given an FNO , we define of an FNO as the total number of non-zero parameters in this construction. We also follow the convention that for a matrix (or vector) with complex entries, the number of parameters is defined as .
Remark 2.25 (FNO approximation of functionals).
If is a (scalar-valued) functional, then we will again identify the output-space with a space of constant functions. In this case, it is natural to add an averaging operation after the last output layer , i.e. we replace by , given by
(2.19) |
This does not introduce any additional degrees of freedom, and ensures that the output is constant. We also note that (2.19) is a special case of a Fourier multiplier , involving only the Fourier mode.
In the following, we will restrict attention to the approximation of functionals, taking into account Remark 2.25. We first mention the following result, proved in Appendix A.6, which shows that FNOs are not of neural network-type, in general:
Lemma 2.26.
Let be the ReLU activation function. Let denote the -periodic torus. The FNO,
is not of neural network-type.
The fact that FNO is not of neural network-type is closely related to the fact that the Fourier transforms at the core of the FNO mapping cannot be computed exactly. In practice, the FNO therefore needs to be discretized.
A simple discretization of is readily obtained and commonly used in applications of FNOs. To this end, fix , and let , be a grid consisting of equidistant points in each coordinate direction. A numerical approximation of is obtained by replacing the Fourier transform and its inverse in each hidden layer by the discrete Fourier transform and its inverse , computed on the equidistant grid. Similarly, the exact average (2.19) is replaced by an average over the grid values. This “discretized” FNO thus defines a mapping
which depends only on the grid values of the input function , for multi-indices . In contrast to , the discretization is readily implemented in practice. We expect that for sufficiently large . Note also that by construction.
Given these preparatory remarks, we can now state our result on the curse of parametric complexity for FNOs, with proof in Appendix A.7.
Theorem 2.27.
Let be a cube. Let be a compact subset of a Banach space . Assume that contains an infinite-dimensional hypercube for some . Then for any and , there exists and an -times Fréchet differentiable functional , such that approximation to accuracy by a discretized FNO ,
requires either (i) complexity bound , or (ii) discretization parameter ; here , are constants depending only on , and .
Proof.
(Sketch) The proof of Theorem 2.27 relies on the curse of parametric complexity for operators of neural network-type in Theorem 2.11. The first step is to show that discrete FNOs are of neural network-type. As a consequence, Theorem 2.11 implies a lower bound of the form . The main additional difficulty is that the discrete FNO is not a standard ReLU neural network according to our definition, since it employs a very non-standard architecture involving convolution and Fourier transforms. Hence more work is needed, in order to relate to ; while the complexity is the minimal number of parameters required to represent by an ordinary ReLU network architecture, we recall that the is defined as the number of parameters defining via the non-standard FNO architecture. Our analysis leads to an upper bound of the form
(2.20) |
As a consequence of the exponential lower bound, , inequality (2.20) implies that either or have to be exponentially large in , proving the claim.
We would like to point out that the additional factor depending on is natural in view of the fact that even a linear discretized FNO layer , with matrix , actually corresponds to a mapping , ; i.e. the matrix multiplication should be viewed as being carried out in parallel at all grid points. Representing this mapping by an ordinary matrix multiplication requires degrees of freedom, and thus depends on in addition to the number of FNO parameters .
For the detailed proof, we refer to Appendix A.7. ∎
Remark 2.28.
Theorem 2.27 shows that FNO suffers from a similar curse of complexity as do the variants on DeepONet and PCA-Net covered by Theorem 2.11: approximation to accuracy of general (- or Lipschitz-) operators requires either an exponential number of non-zero degree of freedom, or requires exponential computational resources to evaluate even a single forward pass. We note the difference in how the curse is expressed in Theorem 2.27 compared to Theorem 2.11; this is due to the fact that FNO is not of neural network-type (see Lemma 2.26). Instead, as outlined in the proof sketch above, only upon discretization does the FNO define an operator/functional of neural network type. The computational complexity of this discretized FNO is determined by both the FNO coefficients and the discretization parameter .
2.5.1. Discussion
To overcome the general curse of parametric complexity implied by Theorem 2.11 (and Theorem 2.27), efficient operator learning frameworks therefore have to leverage additional structure present in the operators of interest, going beyond - or Lipschitz-regularity. Previous work on overcoming the curse of parametric complexity for operator learning has mostly focused on operator holomorphy [23, 50, 34] and the emulation of numerical methods [15, 30, 34] as two basic mechanisms for overcoming the curse of parametric complexity for specific operators of interest. An abstract characterization of the entire class of operators that allow for efficient approximation by neural operators would be very desirable. Unfortunately, this appears to be out of reach, at the current state of analysis. Indeed, as far as the authors are aware, there does not even exist such a characterization for any class of standard numerical methods, such as finite difference, finite element or spectral, viewed as operator approximators.
The second contribution of the present work is to expose additional structure, different from holomorphy and emulation, that can be leveraged by neural operators. In particular, in the remainder of this paper we identify such structure in the Hamilton-Jacobi equations, and propose a neural operator framework which can build on this structure to provably overcome the curse of parametric complexity in learning the associated solution operator.
3. The Hamilton-Jacobi Equation
In the previous section we demonstrate that, generically, a scaling-limit of the curse of dimensionality is to be expected in operator approximation, if only regularity of the map is assumed. However we also outlined in the introduction the many cases where specific structure can be exploited to overcome this curse. In the remainder of the paper we show how the curse can be removed for Hamilton-Jacobi equations. To this end we start, in this section, by setting up the theoretical framework for operator learning in this context.
We are interested in deriving error and complexity estimates for the approximation of the solution operator associated with the following Hamilton-Jacobi PDE:
(HJ) |
Here is a -dimensional domain. To simplify our analysis we consider only the case of a domain with periodic boundary conditions (so that we may identify with .) We denote by the space of -times continuously differentiable real-valued functions with -periodic derivatives, with norm
By slight abuse of notation, we will similarly denote by and functions that have regularity in all variables, and that are periodic in the first variable , so that in particular,
belong to , for fixed or
In equation (HJ), , is a function, depending on the spatial variable and on time . We will restrict attention to problems for which a classical solution , , exists. In this setting the initial value problem (HJ) can be solved by the method of characteristics and a unique solution may be proved to exist for some We may then define solution operator with property the local time of existence in time will, in general, depend on the input to The next two subsections describe, respectively, the method of characteristics and the existence of the solution operator on a time-interval , for all initial data from compact in for Thus we define for all , sufficiently small.
We recall that throughout this paper, use of superscript denotes an object defined through construction of an exact solution of (HJ), or objects used in the construction of the solution; identical notation without a superscript denotes an approximation of that object.
3.1. Method of Characteristics for (HJ)
We briefly summarize this methodology; for more details see [19, Section 3.3]. Consider the following Hamiltonian system for defined by
(3.1a) | |||
(3.1b) |
If the solution of (HJ) is twice continuously differentiable, then can be evaluated by solving the ODE (3.1a) with the specific parameterized initial conditions (3.1b). Given these trajectories and , the values can be computed in terms of the “action” along this trajectory:
(3.2) |
where is the Lagrangian associated with , i.e.
(3.3) |
Equivalently, the solution can be expressed in terms of the solution of the following system of ODEs, :
(3.4a) | |||
(3.4b) |
The system of ODEs (3.4) is defined on , with a -periodic spatial domain . It can be shown that
(3.5) |
tracks the evolution of the gradient of along this trajectory. To ensure the existence of solutions to (3.1), i.e. to avoid trajectories escaping to infinity, we make the following assumption on , in which denotes the Euclidean distance on :
Assumption 3.1 (Growth at Infinity).
There exists , such that
(3.6) |
for all .
In the following, we will denote by
(3.7) |
the semigroup (flow map) generated by the system of ODEs (3.4a); existence of this semigroup, and hence of the solution operator , is the topic of the next subsection.
3.2. Short-time Existence of -solutions
The goal of the present section is to show that for any , and for any compact subset , there exists a maximal , such that for any , there exists a solution of (HJ), with property belongs to for all . Thus we may take any in (HJ), for data in . Our proof of this fact relies on the Banach space version of the implicit function theorem; see Appendix B, Theorem B.1, for a precise statement.
In the following, given and given initial values , we define , as the spatial- and momenta-components of the solution of the Hamiltonian ODE (3.1a), respectively; i.e. the solution of (3.1a) is given by for any initial data .
Proposition 3.2 (Short-time Existence of Classical Solutions).
The proof, which may be found in Appendix B, uses the following two lemmas, also proved in Appendix B. The first result shows that, under Assumption 3.1, the semigroup in (3.7) is well-defined, globally in time.
Lemma 3.3.
The following result will be used to show that the method of characteristics can be used to construct solutions, at least over a sufficiently short time-interval.
Lemma 3.4.
Note that the map is defined by the semigroup for ; however it only requires solution of the Hamiltonian system (3.1) for . In contrast to the flowmap , the spatial characteristic map is in general only invertible over a sufficiently short time-interval, leading to a corresponding short-time existence result for the solution operator associated with (HJ) in Proposition 3.2.
We close this section on the short-time existence with the following remark.
Remark 3.5.
Classical solutions of the Hamilton-Jacobi equations can develop singularities in finite time due to the potential crossing of the spatial characteristics emanating from different points; Indeed, if two spatial characteristics emanating from and cross in finite time, then (3.2) does not lead to a well-defined function , and the method of characteristics breaks down. Thus, our existence theorem is generally restricted to a finite time interval . In important special cases, the crossing of characteristics is ruled out, and classical solutions exist for all time, i.e. with . One such example is the advection equation with velocity field , and corresponding Hamiltonian , for which the complexity of operator approximation is studied computationally in [13].
4. Hamilton-Jacobi Neural Operator: HJ-Net
In this section, we will describe an operator learning framework to approximate the solution operator defined by (HJ) for initial data with . The main motivation for our choice of framework is the observation that the flow map associated with the system of ODEs (3.4a) can be computed independently of the underlying solution . Hence, an operator learning framework for (HJ) can be constructed based on a suitable approximation , where is a neural network approximation of the flow . Given such , and upon fixing evaluation points , we propose to approximate the forward operator of the Hamilton-Jacobi equation (HJ) by , using the following three steps:
-
Step a)
encode the initial data by evaluating it at the points :
with ;
-
Step b)
for each , apply the approximate flow to the encoded data, resulting in a map
where , for ;
-
Step c)
approximate the underlying solution at a fixed time , for sufficiently small, by interpolating the data (input/output pairs) , leading to a reconstruction map:
If we let denote the projection map which takes to then, for fixed sufficiently small, we have defined an approximation of , denoted , and defined by
(4.1) |
It is a consequence of Proposition 3.2 that our approximation is well-defined for all inputs from compact subset , , in some interval , for sufficiently small. However the resulting approximation does not obey the semigroup property with respect to and should be interpreted as holding for a fixed , sufficiently small. Furthermore, iterating the map obtained for any such fixed is not in general possible unless maps into itself. The requirement that maps into itself would also be required to prove the existence of a semigroup for (HJ); for our operator approximator we would additionally need to ensure periodicity of the reconstruction step, something we do not address in this paper.
If the underlying solution of (HJ) exists up to time and if it is , then the method of characteristics can be applied, and the above procedure would reproduce the underlying solution, in the absence of approximation errors in steps b) and c); i.e. in the absence of approximation errors of the Hamiltonian flow , and in the absence of reconstruction errors. We will study the effect of approximating step b) by use of a ReLU neural network approximation of the flow and by use of a moving least squares interpolation for step c). In the following two subsections we define these two approximation steps, noting that doing so leads to a complete specification of This complete specification is summarized in the final Subsection 4.3.
4.1. Step b) ReLU Network
We seek an approximation to in the form of a ReLU neural network (2.1), as summarized in section 2.1.1, with input and output dimensions , and taking the concatenated input to its image in . We use -periodicity to identify the output in the first components (the spatial variable ) with a unique element in . With slight abuse of notation, this results in a well-defined mapping
Let denote a probability measure on . We would like to choose to minimize the loss function
In practice we achieve this by an empirical approximation of , based on i.i.d. samples, and numerical simulation to approximate the evaluation of on these samples. The resulting approximate empirical loss can be approximately minimized by stochastic gradient descent [20, 47], for example.
4.2. Step c) Moving Least Squares
In this section, we define the interpolation mapping , employing reconstruction by moving least squares [53]. In general, given a function , here assumed to be defined on the domain , and given a set of (scattered) data where for , the method of moving least squares produces an approximation , which is given by the following minimization [53, Def. 4.1]:
(4.2) |
Here, denotes the space of polynomials of degree , and is a compactly supported, non-negative weight function. We will assume to be smooth, supported in the unit ball , and positive on the ball with radius .
The approximation error incurred by moving least squares can be estimated in the -norm, in terms of suitable measures of the “density” of the scattered data points , and the smoothness of the underlying function [53]. The relevant notions are defined next, in which denotes the Euclidean distance on .
Definition 4.1.
The fill distance of a set of points for a bounded domain is defined to be
The separation distance of is defined by
A set of points in is said to be quasi-uniform with respect to , if
Combining the approximation error of moving least squares with a stability result for moving least squares, with respect to the input data, leads to the error estimate that we require to estimate the error in our proposed HJ-Net. Proof of the following may be found in Appendix C. The statement involves both the input data set and a set which is supposed to approximate, together with their respective fill-distances and separation distances.
Proposition 4.2 (Error Stability).
Let and consider function , for some fixed regularity parameter . Assume that is quasi-uniform with respect to . Let be approximate interpolation data, and define where, for some and , we have
Using this approximate interpolation data, let be obtained by moving least squares (4.2). Then there exist constants , depending only on , and , such that, for and moving least squares scale parameter , we have
(4.3) |
Remark 4.3.
The constants and in the previous proposition can be computed explicitly [53, see Thm. 3.14 and Cor. 4.8].
Proposition 4.2 reveals two sources of error in the reconstruction by moving least squares: The first term on the right-hand side of (4.3) is the error due to the discretization by a finite number of evaluation points . This error persists even in a perfect data setting, i.e. when and . The last two terms in (4.3) reflect the fact that in our intended application to HJ-Net, the evaluation points and the function values are only given approximately, via the approximate flow map , introducing additional error sources.
The proof of Proposition 4.2 relies crucially on the fact that the set of evaluation points is quasi-uniform. In our application to HJ-Net, these points are obtained as the image of under the approximate flow . In particular, they depend on and we cannot ensure any a priori control on the separation distance . Our proposed reconstruction therefore involves the pruning step, stated as Algorithm 1. Lemma C.3 in Appendix C shows that Algorithm 1 produces a quasi-uniform set with the desired properties asserted above.
Given the interpolation by moving least squares and the pruning algorithm, we finally define the reconstruction map as follows:
-
(1)
Given approximate interpolation data at data points , determine a quasi-uniform subset by Algorithm 1.
- (2)
4.3. Summary of HJ-Net
Thus in the final definition of HJ-Net given in equation (4.1) we recall that denotes the encoder mapping,
The mapping
approximates the flow for each of the triples, , , and is trained from data to minimize an approximate empirical loss. And, finally, the reconstruction is obtained by applying the pruned moving least squares Algorithm 2 to the data at scattered data points and with corresponding values , to approximate the output .
5. Error Estimates and Complexity
Subsection 5.1 contains statement of the main theorem concerning the computational complexity of HJ-Net, and the high level structure of the proof. The theorem demonstrates that the curse of parametric complexity is overcome in this problem in the sense that the cost to achieve a given error, as measured by the size of the parameterization, grows only algebraically with the inverse of the error. In the following subsections 5.2 and 5.3, we provide a detailed discussion of the approximation of the Hamiltonian flow and the moving last squares algorithm which, together, lead to proof of Theorem 5.1. Proofs of the results stated in those two subsections are given in an appendix.
5.1. HJ-Net Beats The Curse of Parametric Complexity
Theorem 5.1 (HJ-Net Approximation Estimate).
Consider equation (HJ) on periodic domain , with initial data and Hamiltonian , where . Assume that satisfies the no-blowup Assumption 3.1. Let be a compact set of initial data, and let where is given by Proposition 3.2. Then there is constant , such that for any and , there exists a set of points , optimal parameter and neural network such that the corresponding HJ-Net approximation given by (4.1), with Steps b) and c) defined in Subsections 4.1 and 4.2 respectively, satisfies
Furthermore, we have the following complexity bounds: (i) the number of encoding points from Step a) can be bounded by
(5.1) |
and (ii) the neural network from Step b), Subsection 4.1, satisfies
(5.2) |
Proof.
We first note that, for any and , Proposition 3.2 shows that the solution of (HJ) can be computed by the method of characteristics up to time . Thus is well-defined for any . We break the proof into three steps, relying on propositions established in the following subsections, and then conclude in a final fourth step.
Step 1: (Neural Network Approximation) Let be given by
where denotes the maximum. By this choice of , we have , for all , . By Proposition 5.3, there exists a constant , and a constant , depending only on , , , and the norm , such that the Hamiltonian flow map (3.7) can be approximated by a neural network with
and
(5.3) |
Step 2: (Choice of Encoding Points) Fix , to be determined below. Let denote an equidistant grid on with grid spacing . Enumerating the elements of , we write , where we note that there exists a constant depending only on , such that ; equivalently,
(5.4) |
For any , let
(5.5) |
be the set of image points under the characteristic mapping defined by . Since is a set of points, it follows from the definition of the fill distance that balls of radius cover . A simple volume counting argument then implies that there exists a constant , such that ; equivalently,
(5.6) |
Given from Step 1, we now choose , so that by (5.4) and (5.6),
We emphasize that depend only on , and are independent of and . We have thus shown that if is an enumeration of with , then the fill distance of the image points under the characteristic mapping satisfies
(5.7) |
In particular, this step defines our encoding points .
Step 3: (Interpolation Error Estimate) Let be the set of evaluation points determined in Step 2. Since is an equidistant grid with grid spacing proportional to , the fill distance of is bounded by , where the constant depends only on . By Proposition 5.6, there exists a (possibly larger) constant , such that for all , the set of image points under the characteristic mapping (5.5), has uniformly bounded fill distance
(5.8) |
Furthermore, taking into account (5.7), the upper bound (5.3) implies that
for any . In turn, this shows that the approximate interpolation data , , obtained from the neural network approximation by (orthogonal) projection onto the first and last components, satisfies
where denotes the exact solution of the Hamilton-Jacobi PDE (HJ) with initial data . By Proposition 5.5, there exists a constant , depending only on and , such that the pruned moving least squares approximant obtained by Algorithm 2 with (approximate) data points and corresponding data values , satisfies
(5.9) |
Step 4: (Conclusion) By the short-time existence result of Proposition 3.2, there exists , such that for any solution of the Hamilton-Jacobi equation (HJ) with initial data . By definition of the HJ-Net approximation, we have and by definition of the solution operator, . We thus conclude that
for a constant , independent of . Since is arbitrary, replacing by throughout the above argument implies the claimed error and complexity estimate of Theorem 5.1. ∎
Remark 5.2 (Overall Computational Complexity of HJ-Net).
The error and complexity estimate of Theorem 5.1 implies that for moderate dimensions , and for sufficiently smooth input functions , with , the overall complexity of this approach scales at most linearly in : Indeed, the mapping of the data points for requires forward-passes of the neural network , which has internal degrees of freedom. Since the computational complexity of this mapping is thus bounded by . A naive implementation of the pruning algorithm requires at most comparisons. The computational complexity of the reconstruction by the moving least squares method with (pruned) interpolation points and with evaluation points can be bounded by [53, last paragraph of Chapter 4.2]. In particular, the overall complexity to obtain an -approximation of the output function at e.g. the encoding points (with ) is at most linear in .
Theorem 5.1 shows that for fixed and , the proposed HJ-Net architecture can overcome the general curse of parametric complexity in the operator approximation implied by Theorem 2.11 even though the underlying operator does not have higher than -regularity. This is possible because HJ-Net leverages additional structure inherent to the Hamilton-Jacobi PDE HJ (reflected in the method of characteristics), and therefore does not rely solely on the -smoothness of the underlying operator .
5.2. Approximation of Hamiltonian Flow and Quadrature Map
In this subsection we quantify the complexity of -approximation by a ReLU network as defined in Subsection 4.1.
Recall that comprises solution of the Hamiltonian equations (3.1) and quadrature (3.2), leading to (3.4). An approximation of this Hamiltonian flow map is provided the by the following proposition, proved in Appendix D.
Proposition 5.3.
Let . Let , and be given, and assume that satisfies the no-blowup Assumption 3.1 with constant Then, for any , there exist constants and , such that for all , there is a ReLU neural network satisfying
(5.10) |
satisfying
Remark 5.4.
Using the results of [38, 29, 55], one could in fact improve the size bound of Proposition 5.3 somewhat: neglecting logarithmic terms in , it can be shown that a neural network with suffices. However, this comes at the expense of substantially increasing the depth from a logarithmic scaling , to an algebraic scaling .
5.3. Moving Least Squares Reconstruction Error
In this subsection we discuss error estimates for the reconstruction by moving least squares, based on imperfect input-output pairs, as defined in Subsection 4.2.
We recall that the reconstruction in the HJ-Net approximation is obtained by applying the pruned moving least squares Algorithm 2 to the data , where are obtained as with fixed evaluation points , and where , are determined in terms of a given input function , so that is an approximation of the corresponding solution at time .
In the following, we first derive an error estimate in terms of the fill distance of , in Proposition 5.5. Subsequently, in Proposition 5.6, we provide a bound on the fill distance at time in terms of the fill distance at time . Proof of both propositions can be found in Appendix D.
Proposition 5.5.
Let and fix a regularity parameter . There exist constants such that the following holds: Assume that is a set of evaluation points with fill distance . Then for any -periodic function , and approximate input-output data , such that
the pruned moving least squares approximant of Algorithm 2 with interpolation data points and data values , satisfies
In contrast to Proposition 4.2, Proposition 5.5 includes the pruning step in the reconstruction, and does not assume quasi-uniformity of either the underlying exact point distribution , nor the approximate point distribution . To obtain a bound on the reconstruction error, we can combine the preservation of -regularity implied by the short-time existence Proposition 3.2, with the following:
Proposition 5.6.
Let , let , and assume that the Hamiltonian is periodic in and satisfies Assumption 3.1 with constant . Let be a compact subset and fix . Then there exists a constant , such that for any set of evaluation points , and for any , the image points
under the spatial characteristic mapping satisfy the following uniform bound on the fill distance:
6. Conclusions
The first contribution of this work is to study the curse of dimensionality in the context of operator learning, here interpreted in terms of the infinite dimensional nature of the input space. We state a theorem which, for the first time, establishes a general form of the curse of parametric complexity, a natural scaling limit of the curse of dimensionality in high-dimensional approximation, characterized by lower bounds which are exponential in the desired error. The theorem demonstrates that in general it is not possible to obtain complexity estimates, for the size of the approximating neural operator, that grow algebraically with inverse error unless specific structure in the underlying operator is leveraged; in particular, we prove that this additional structure has to go beyond - or Lipschitz-regularity. This considerably generalizes and strengthens earlier work on the curse of parametric complexity in [32], where a mild form of this curse had been identified for PCA-Net. As shown in the present work, our result applies to many proposed operator learning architectures including PCA-Net, the FNO, and DeepONet and recent nonlinear extensions thereof. The lower complexity bound in this work is obtained for neural operator architectures based on standard feedforward ReLU neural networks, and could likely be extended to feedforward architectures employing piecewise polynomial activations. It is of note that algebraic complexity bounds, which seemingly overcome the curse of parametric complexity of the present work, have recently been derived for the approximation of Lipschitz operators [49]; these results build on non-standard neural network architectures with either superexpressive activation functions, or non-standard connectivity, and therefore do not contradict our results.
The second contribution of this paper is to present an operator learning framework for Hamilton-Jacobi equations, and to provide a complexity analysis demonstrating that the methodology is able to tame the curse of parametric complexity for these PDE. We present the ideas in a simple setting, and there are numerous avenues for future investigation. For example, as pointed out in Remark 3.5, one main limitation of the proposed approach based on characteristics is that we can only consider finite time intervals on which classical solutions of the HJ equations exist. It would be of interest to extend the methodology to weak solutions, after the potential formation of singularities. It would also be of interest to combine our work with the work on curse of dimensionality with respect to dimension of Euclidean space, cited in Section 1. Furthermore, in practice we recommend learning the Hamiltonian flow, which underlies the method of characteristics for the HJ equation, using symplectic neural networks [27]. However, the analysis of these neural networks is not yet developed to the extent needed for our complexity analysis in this paper, which builds on the work in [54]. Extending the analysis of symplectic neural networks, and using this extension to analyze generalizations of HJ-Net as defined here, are interesting directions for future study. Finally, we note that it is of interest to iterate the learned operator. In order to do this, we would need to generalize the error estimates to the topology. This could be achieved either by interpolation between higher-order bounds of the proposed methodology for combined with the existing error analysis, or by using the gradient variable in the interpolation.
Acknowledgments The work of SL is supported by Postdoc.Mobility grant P500PT-206737 from the Swiss National Science Foundation. The work of AMS is supported by a Department of Defense Vannevar Bush Faculty Fellowship.
References
- [1] M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999.
- [2] K. Bhattacharya, B. Hosseini, N. B. Kovachki, and A. M. Stuart. Model reduction and neural networks for parametric PDEs. The SMAI Journal of Computational Mathematics, 7:121–157, 2021.
- [3] N. Boullé, C. J. Earls, and A. Townsend. Data-driven discovery of green’s functions with human-understandable deep learning. Scientific Reports, 12(1):1–9, 2022.
- [4] N. Boullé and A. Townsend. Learning elliptic partial differential equations with randomized linear algebra. Foundations of Computational Mathematics, pages 1–31, 2022.
- [5] T. Chen and H. Chen. Universal approximation to nonlinear operators by neural networks with arbitrary activation functions and its application to dynamical systems. IEEE Transactions on Neural Networks, 6(4):911–917, 1995.
- [6] S.-N. Chow and J. K. Hale. Methods of Bifurcation Theory, volume 251. Springer Science & Business Media, 2012.
- [7] Y. T. Chow, J. Darbon, S. Osher, and W. Yin. Algorithm for overcoming the curse of dimensionality for time-dependent non-convex Hamilton–Jacobi equations arising from optimal control and differential games problems. Journal of Scientific Computing, 73(2):617–643, 2017.
- [8] Y. T. Chow, J. Darbon, S. Osher, and W. Yin. Algorithm for overcoming the curse of dimensionality for state-dependent Hamilton-Jacobi equations. Journal of Computational Physics, 387:376–409, 2019.
- [9] O. Christensen et al. An introduction to frames and Riesz bases, volume 7. Springer, 2003.
- [10] S. Dahlke, F. De Mari, P. Grohs, and D. Labate. Harmonic and applied analysis. Appl. Numer. Harmon. Anal, 2015.
- [11] J. Darbon, G. P. Langlois, and T. Meng. Overcoming the curse of dimensionality for some Hamilton–Jacobi partial differential equations via neural network architectures. Research in the Mathematical Sciences, 7(3):1–50, 2020.
- [12] J. Darbon and S. Osher. Algorithms for overcoming the curse of dimensionality for certain Hamilton–Jacobi equations arising in control theory and elsewhere. Research in the Mathematical Sciences, 3(1):1–26, 2016.
- [13] M. De Hoop, D. Z. Huang, E. Qian, and A. M. Stuart. The cost-accuracy trade-off in operator learning with neural networks. Journal of Machine Learning, 1:3:299–341, 2022.
- [14] M. V. de Hoop, N. B. Kovachki, N. H. Nelsen, and A. M. Stuart. Convergence rates for learning linear operators from noisy data. SIAM/ASA Journal on Uncertainty Quantification, 11(2):480–513, 2023.
- [15] B. Deng, Y. Shin, L. Lu, Z. Zhang, and G. E. Karniadakis. Approximation rates of deeponets for learning operators arising from advection–diffusion equations. Neural Networks, 153:411–426, 2022.
- [16] R. DeVore, B. Hanin, and G. Petrova. Neural network approximation. Acta Numerica, 30:327–444, 2021.
- [17] R. A. DeVore. Nonlinear approximation. Acta numerica, 7:51–150, 1998.
- [18] D. L. Donoho. Sparse components of images and optimal atomic decompositions. Constructive Approximation, 17:353–382, 2001.
- [19] L. C. Evans. Partial Differential Equations. American Mathematical Society, Providence, R.I., 2010.
- [20] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT press, 2016.
- [21] R. Gribonval, G. Kutyniok, M. Nielsen, and F. Voigtlaender. Approximation spaces of deep neural networks. Constructive approximation, 55(1):259–367, 2022.
- [22] C. Heil. A basis theory primer: expanded edition. Springer Science & Business Media, 2010.
- [23] L. Herrmann, C. Schwab, and J. Zech. Deep ReLU neural network expression rates for data-to-qoi maps in Bayesian PDE inversion. SAM Research Report, 2020, 2020.
- [24] L. Herrmann, C. Schwab, and J. Zech. Neural and GPC operator surrogates: Construction and expression rate bounds. arXiv preprint arXiv:2207.04950, 2022.
- [25] J. S. Hesthaven and S. Ubbiali. Non-intrusive reduced order modeling of nonlinear problems using neural networks. Journal of Computational Physics, 363:55–78, 2018.
- [26] N. Hua and W. Lu. Basis operator network: A neural network-based model for learning nonlinear operators via neural basis. Neural Networks, 164:21–37, 2023.
- [27] P. Jin, Z. Zhang, A. Zhu, Y. Tang, and G. E. Karniadakis. Sympnets: Intrinsic structure-preserving symplectic networks for identifying Hamiltonian systems. Neural Networks, 132:166–179, 2020.
- [28] Y. Khoo, J. Lu, and L. Ying. Solving parametric PDE problems with artificial neural networks. European Journal of Applied Mathematics, 32(3):421–435, 2021.
- [29] M. Kohler and S. Langer. On the rate of convergence of fully connected deep neural network regression estimates. The Annals of Statistics, 49(4):2231–2249, 2021.
- [30] N. B. Kovachki, S. Lanthaler, and S. Mishra. On universal approximation and error bounds for Fourier neural operators. Journal of Machine Learning Research, 22(1), 2021.
- [31] N. B. Kovachki, Z. Li, B. Liu, K. Azizzadenesheli, K. Bhattacharya, A. Stuart, and A. Anandkumar. Neural operator: Learning maps between function spaces with applications to PDEs. Journal of Machine Learning Research, 24(89), 2023.
- [32] S. Lanthaler. Operator learning with PCA-Net: Upper and lower complexity bounds. Journal of Machine Learning Research, 24(318), 2023.
- [33] S. Lanthaler, Z. Li, and A. M. Stuart. The nonlocal neural operator: universal approximation. arXiv preprint arXiv:2304.13221, 2023.
- [34] S. Lanthaler, S. Mishra, and G. E. Karniadakis. Error estimates for DeepONets: A deep learning framework in infinite dimensions. Transactions of Mathematics and Its Applications, 6(1), 2022.
- [35] S. Lanthaler, R. Molinaro, P. Hadorn, and S. Mishra. Nonlinear reconstruction for operator learning of PDEs with discontinuities. In Eleventh International Conference on Learning Representations, 2023.
- [36] Z. Li, N. B. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. M. Stuart, and A. Anandkumar. Fourier neural operator for parametric partial differential equations. In Ninth International Conference on Learning Representations, 2021.
- [37] H. Liu, H. Yang, M. Chen, T. Zhao, and W. Liao. Deep nonparametric estimation of operators between infinite dimensional spaces. Journal of Machine Learning Research, 25(24):1–67, 2024.
- [38] J. Lu, Z. Shen, H. Yang, and S. Zhang. Deep network approximation for smooth functions. SIAM Journal on Mathematical Analysis, 53(5):5465–5506, 2021.
- [39] L. Lu, P. Jin, G. Pang, Z. Zhang, and G. E. Karniadakis. Learning nonlinear operators via deeponet based on the universal approximation theorem of operators. Nature Machine Intelligence, 3(3):218–229, 2021.
- [40] C. Marcati and C. Schwab. Exponential convergence of deep operator networks for elliptic partial differential equations. SIAM Journal on Numerical Analysis, 61(3):1513–1545, 2023.
- [41] H. Mhaskar, Q. Liao, and T. Poggio. Learning real and boolean functions: When is deep better than shallow. Technical report, Center for Brains, Minds and Machines (CBMM), arXiv, 2016.
- [42] H. N. Mhaskar and N. Hahm. Neural networks for functional approximation and system identification. Neural Computation, 9(1):143–159, 1997.
- [43] N. H. Nelsen and A. M. Stuart. The random feature model for input-output maps between banach spaces. SIAM Journal on Scientific Computing, 43(5):A3212–A3243, 2021.
- [44] J. A. Opschoor, C. Schwab, and J. Zech. Exponential relu dnn expression of holomorphic maps in high dimension. Constructive Approximation, 55(1):537–582, 2022.
- [45] D. Patel, D. Ray, M. R. Abdelmalik, T. J. Hughes, and A. A. Oberai. Variationally mimetic operator networks. Computer Methods in Applied Mechanics and Engineering, 419:116536, 2024.
- [46] P. Petersen and F. Voigtlaender. Optimal approximation of piecewise smooth functions using deep relu neural networks. Neural Networks, 108:296–330, 2018.
- [47] H. Robbins and S. Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
- [48] T. D. Ryck and S. Mishra. Generic bounds on the approximation error for physics-informed (and) operator learning. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, Advances in Neural Information Processing Systems, 2022.
- [49] C. Schwab, A. Stein, and J. Zech. Deep operator network approximation rates for lipschitz operators. arXiv preprint arXiv:2307.09835, 2023.
- [50] C. Schwab and J. Zech. Deep learning in high dimension: Neural network expression rates for generalized polynomial chaos expansions in UQ. Analysis and Applications, 17(01):19–55, 2019.
- [51] J. Seidman, G. Kissas, P. Perdikaris, and G. J. Pappas. NOMAD: Nonlinear manifold decoders for operator learning. Advances in Neural Information Processing Systems, 35:5601–5613, 2022.
- [52] E. M. Stein and R. Shakarchi. Real Analysis: Measure Theory, Integration, and Hilbert Spaces. Princeton University Press, 2009.
- [53] H. Wendland. Scattered Data Approximation, volume 17. Cambridge university press, 2004.
- [54] D. Yarotsky. Error bounds for approximations with deep ReLU networks. Neural Networks, 94:103–114, 2017. Publisher: Elsevier.
- [55] D. Yarotsky and A. Zhevnerchuk. The phase diagram of approximation rates for deep neural networks. Advances in neural information processing systems, 33:13005–13015, 2020.
- [56] H. You, Q. Zhang, C. J. Ross, C.-H. Lee, and Y. Yu. Learning deep implicit fourier neural operators (ifnos) with applications to heterogeneous material modeling. Computer Methods in Applied Mechanics and Engineering, 398:115296, 2022.
- [57] Z. Zhang, L. Tat, and H. Schaeffer. BelNet: Basis enhanced learning, a mesh-free neural operator. Proceedings of the Royal Society A, 479, 2023.
- [58] Y. Zhu and N. Zabaras. Bayesian deep convolutional encoder–decoder networks for surrogate modeling and uncertainty quantification. Journal of Computational Physics, 366:415–447, 2018.
Appendix A Proof of Curse of Parametric Complexity
A.1. Preliminaries in Finite Dimensions
Our goal in this section is to prove Proposition 2.1, which we repeat here:
See 2.1
To this end, we recall and extend several results from [54] on the ReLU neural network approximation of functions . Subsequently, these results will be used as building blocks to construct functionals in the infinite-dimensional context, leading to a curse of parametric complexity in that setting, made precise in Theorem 2.11. Here, we consider the space consisting of -times continuously differentiable functions . We denote by
the unit ball in . For technical reasons, it will often be more convenient to consider the subset consisting of functions vanishing at the origin, .
Remark A.1.
Let be a function. Following [54, Sect. 4.3], we will denote by the minimal number of hidden computation units that is required to approximate to accuracy by an arbitrary ReLU feedforward network , uniformly over the unit cube ; i.e. is the minimal integer such that there exists a ReLU neural network with at most computation units444The number of computation units equals the number of scalar evaluations of the activation in a forward-pass, cf. [1]. and such that
Remark A.2.
We emphasize that, even though the domain of a function is by definition the whole space , the above quantity only relates to the approximation of over the unit cube . The reason for this seemingly awkward choice is that it will greatly simplify our arguments later on, when we construct functionals on infinite-dimensional Banach spaces with a view towards proving an infinite-dimensional analogue of the curse of dimensionality.
Note that the number of (non-trivial) hidden computation units of a neural network obeys the bounds : The lower bound follows from the fact that each (non-trivial) computation unit has at least one non-zero connection or a bias feeding into it. To derive the upper bound, we note that any network with at most computation units can be embedded in a fully connected enveloping network (allowing skip-connections) [54, cf. Fig. 6(a)] with depth , width , where each hidden node is connected to all other hidden nodes plus the output, and where each node in the input layer is connected to all hidden units plus the output. In addition, we take into account that each computation unit and the output layer have an additional bias. The existence of this enveloping network thus implies the size bound
valid for any neural network with at most computation units.
In view of the lower size bound, , Proposition 2.1 above is now clearly implied by the following:
Proposition A.3.
Fix . There is a universal constant and a constant , depending only on , such that for any , there exists a function for which
As an immediate corollary of Proposition A.3, we conclude that cannot be approximated to accuracy by a family of ReLU neural networks with
The latter conclusion was established by Yarotsky [54, Thm. 5] (with ). Proposition 2.1 improves on Yarotsky’s result in two important ways: firstly, it implies that the lower bound holds for all , and not only along a (unspecified) sequence ; secondly, it shows that the constant can be chosen independently of . This generalization of Yarotsky’s result will be crucial for the extension to the infinite-dimensional case, which is the goal of this work.
To prove Proposition A.3, we will need two intermediate results, which build on [54]. We start with the following lemma providing a lower bound on the required size of a fixed neural network architecture, which is able to approximate arbitrary to accuracy . This result is based on [54, Thm. 4(a)], but explicitly quantifies the dependence on the dimension . This dependence was left unspecified in earlier work.
Lemma A.4.
Fix . Let be a ReLU neural network architecture depending on parameters . There exists a constant , such that if
for some , then .
Proof.
The proof in [54] is based on the Vapnik-Chervonenkis (VC) dimension. We will not repeat the entire argument here, but instead discuss only the required changes in the proof of Yarotsky, which are needed to prove our extension of his result. In fact, there are only two differences between the proof our Lemma A.4, and [54, Thm. 4(a)], which we now point out: The first difference is that we consider , consisting of functions vanishing at the origin, whereas [54] considers the unit ball in . Nevertheless the proof of [54] applies to our setting in a straightforward way. To see this, we recall that the construction in [54, Proof of Thm. 4, eq. (35)], considers of the form
(A.1) |
with coefficients to be determined later, and where is a bump function, which is required to satisfy555We note that choosing the to define the set where , rather than the Euclidean distance as in [54], is immaterial. The only requirement is that the support of the functions on the right-hand side in the definition (A.1) do not overlap.
(A.2) |
In (A.1), the points are chosen such that the -distance between any two of them is at least . We can easily ensure that , by choosing the points to be of the form , where . Then, since for any multi-index of order , the mixed derivative
any function of the form (A.1) belongs to , provided that
where
(A.3) |
As shown in [54], the above construction allows one to encode arbitrary Boolean values by construction suitable ; this in turn can be used to estimate a VC-dimension related to from below, following verbatim the arguments in [54]. As no changes are required in this part of the proof, we will not repeat the details here; Instead, following the argument leading up to [54, eq. (38)], and under the assumptions of Lemma A.4, Yarotsky’s argument immediately implies the following lower bound
where is an absolute constant.
To finish our proof of Lemma A.4, it remains to determine the dependence of the constant in (A.3), on the parameters and . To this end, we construct a specific bump function . Recall that our only requirement on in the above argument is that (A.2) must hold. To construct suitable , let , be a smooth bump function such that , and for . We now make the particular choice
Let be a multi-index with . Let denote the number of non-zero components of . We note that . We thus have
In particular, we conclude that can be bounded from below by , where clearly only depends on , and is independent of the ambient dimension . The claimed inequality of Lemma A.4 now follows from
i.e. we have with constant
depending only on . ∎
While Lemma A.4 applies to a fixed architecture capable of approximating all , our goal is to instead construct a single for which a similar lower complexity bound holds for arbitrary architectures. The construction of such will rely on the following lemma, based on [54, Lem. 3]:
Lemma A.5.
Fix . For any (fixed) , there exists , such that
where depends only on .
Proof.
As explained above, any neural network (with potentially sparse architecture), , with computation units can be embedded in the fully connected architecture , , with size bound . By Lemma A.4, it follows that if , then there exists , such that
Expressing the above size bound in terms of , it follows that for any network with computation units, we must have . Thus, approximation of to within accuracy requires at least computation units. From the definition of , we conclude that
for this choice of , with constant depending only on . ∎
Lemma A.5 will be our main technical tool for the proof of Proposition A.3. We also recall that satisfies the following properties, [54, eq. (42)–(44)]:
(A.4a) | |||
(A.4b) | |||
(A.4c) |
Proof.
(Proposition A.3) We define a rapidly decaying sequence , by and , so that by recursion . We also define . For later reference, we note that since for all , we have
(A.5) |
Our goal is to construct of the form
We note that , hence if is of the above form then,
implies that irrespective of the specific choice of . In the following construction, we choose throughout. We determine recursively, formally starting from the empty sum, i.e. . In the recursive step, given , we distinguish two cases:
Case 1: Assume that
In this case, we set , and trivially obtain
(A.6) |
Case 2: In the other case, we have
Our first goal is to again choose such that (A.6) holds, at least for sufficiently large . We note that, by the general properties of , (A.4c), and the assumption of Case 2:
(A.7) |
By Lemma A.5, we can find , such that
(A.8) |
This defines our recursive choice of in Case 2. By (A.7) and (A.8), to obtain (A.6) it now suffices to show that
It turns out that there exists depending only on , such that
for all , and where . In fact, upon raising both sides to the exponent , it is straightforward to see that this inequality holds independently of , provided that
where we note that on account of the fact that . Define by
With this choice of , and by construction of , we then have
(A.9) |
for all . This is inequality (A.6), and concludes our discussion of Case 2.
Continuing the above construction by recursion, we obtain a sequence , such that (A.9) holds for any such that . Define . We claim that for any we have
To see this, we fix . Choose , such that and . Then,
where the last inequality follows from . The claim of Proposition A.3 thus follows for all , upon redefining the universal constant as . ∎
A.2. Proof of Lemma 2.7
Proof.
(Lemma 2.7) Since the interior of is non-empty, then upon a rescaling and shift of the domain, we may wlog assume that . Let us define as a suitable re-normalization of the standard Fourier sine-basis, indexed by , and normalized such that . We note that for each we can define a bi-orthogonal functional by
The norm of is easily seen to be bounded by . Hence, for any enumeration , the sequence satisfies the assumptions in the definition of a infinite-dimensional hypercube.
Furthermore, if is an enumeration of , such that is monotonically increasing (non-decreasing), we note that any series of the form
is absolutely convergent in , provided that
Inverting the enumeration for , and setting , we find that
where the number of elements in the lower and upper bounding sets are . We thus conclude that . We also note that each shell , with , contains elements. Thus, we have
The last sum converges if . Thus, choosing sufficiently small then ensures that is a subset of . ∎
A.3. Proof of Theorem 2.11
Proof.
Fix . By assumption on , there exists and a linearly independent sequence of normed elements, such that the set consisting of all of the form
defines a subset .
Step 1: (Finding embedded cubes ) We note that for any and for any choice of , with indices , we have
(A.10) |
Set , and let us denote the set of all such by , in the following. Note that up to the constant rescaling by , can be identified with the -dimensional unit cube . In particular, since , it follows that contains a rescaled copy of for any such . Furthermore, it will be important in our construction that all of the embedded cubes, defined by (A.10) for , only intersect at the origin.
By Proposition A.3, there exist constants , independent of , such that given any , we can find , , for which the following lower complexity bound holds:
(A.11) |
Our aim is to construct , such that the restriction to the cube “reproduces” this . If this can be achieved, then our intuition is that embeds all for , , at once, and hence a rescaled version of the lower complexity bound should hold for any . Our next aim is to make this precise, and determine the implicit constant that arises due to the fact that is only a rescaled version of .
Step 2: (Construction of ) To construct suitable , we first recall that we assume the existence of “bi-orthogonal” elements in the continuous dual space , such that
and furthermore, there exists a constant , such that for all . Given the functions from Step 1, we now make the following ansatz for :
(A.12) |
Here (for ) satisfies (A.11) and . We note in passing that defines a -times Fréchet differentiable functional. This will be rigorously shown in Lemma A.7 below (with ).
Step 3: (Relating with ) We next show in which way “the restriction to the cube reproduces ”. Note that if is of the form (A.10) with and with coefficients
then, if ,
while for we find
From the fact that for all by construction (recall that ), we conclude that
(A.13) |
for any . In this sense, “ reproduces ”.
Step 4: (Lower complexity bound, uniform in ) Let be a family of operators of neural network-type, such that
By assumption on being of neural network-type, there exists , a linear mapping , and a neural network representing :
For , , define by
Then, since is a linear mapping, there exists a matrix , such that for all . In particular, it follows that
By (A.10), we clearly have . Let . It now follows that
Let denote the number of hidden computation units of . By construction of (cp. Proposition A.3), we have
whenever . On the other hand, we can also estimate (cp. equation (2.2) in Section 2.1.2),
Combining these bounds, we conclude that
holds for any neural network representation of , whenever . And hence
(A.14) |
whenever . By Lemma A.6 below, taking the supremum on the right over all admissible implies the lower bound
where depend only on and . Recalling our choice of , and the fact that the constant depends only on , while is universal, it follows that
with depending only on . Up to a change in notation, this is the claimed complexity bound. ∎
The following lemma addresses the optimization of the lower bound in (A.14):
Lemma A.6.
Let , and be given. Assume that
for any of the form , , and whenever . Fix a small parameter . There exist , depending only on , such that
Proof.
Write for . Fix a small parameter . Since we restrict attention to , we have
provided that . Given , choose , such that
Let . Note that this defines a function . For any , we can write
(A.15) |
for some . We now note that for ,
Decreasing the size of further, we can ensure that for ,
Note also that for any of the form , . It thus follows that for any given , and with our particular choice of satisfying (A.15), we have
Note that this in particular implies that . We conclude that
Upon defining as
we obtain the lower bound
This concludes the proof of Lemma A.6. ∎
In the lemma below, we provide a simple result on Fréchet differentiability which was used in our proof of Proposition 2.11:
Lemma A.7.
(Fréchet differentiability of a series) Assume that we are given a bounded family of functions indexed by integers , , such that for all . Let be a sequence of linear functionals, such that for all . Let be given, and assume that . Then, the functional defined by the series
is -times Fréchet differentiable.
Proof.
By assumption on and the linearity of the functionals , each nonlinear functional in the series defining is -times continuously differentiable. Fixing , let us denote . The -th total derivative of () is given by
By assumption, we have
Since the sum over has terms, and since the functionals are bounded by assumption, we can now readily estimate the operator norm for by
In particular, for any , the series
is uniformly convergent. Thus, is a uniform limit of -times continuously differentiable functionals, all of whose derivatives of order are also uniformly convergent. From this, it follows that is itself -times continuously Fréchet differentiable. ∎
A.4. Proof of Corollary 2.12
Let be a function space with continuous embedding in . We will only consider the case , the case following by an almost identical argument; Let be a non-trivial function. Since is continuously embedded in , it follows that point evaluation is continuous. Given that is non-trivial, there exists , such that . We may wlog assume that . Let be a functional exposing the curse of parametric complexity, as constructed in Theorem 2.11. We define an -times Fréchet differentiable operator by .
The claim now follows immediately by observing that
and by noting that if is an operator of neural network-type, then is a functional of neural network-type, and by assumption, with ,
By our choice of , this implies that the complexity of is lower bounded by an exponential bound for some constant . This in turn implies that
This lower bound implies the exponential lower bound of Corollary 2.12.
A.5. Proofs of Lemmas 2.18 – 2.7
A.5.1. Proof of Lemma 2.18
Proof.
(Lemma 2.18) We want to show that a PCA-Net neural operator is of neural network-type, and aim to estimate in terms of . To this end, we assume that is a Hilbert space of functions. Since is by definition linear, then given an evaluation point , the mapping defines a linear map . We can represent by matrix multiplication: , with . The encoder is linear by definition, thus we can take for the linear map in the definition of “operator of neural network-type”. Define a neural network by . Then, we have the identity
for all . This shows that is of neural network-type. We now aim to estimate in terms of . To this end, write with component mappings . Let be the subset of indices for which is not the zero function. Define a (sparsified) matrix with -th column defined by
Then, we have , and identity for all . Thus, using the concept of sparse concatenation (2.3), we can upper bound the complexity, , in terms of the of the neural network as follows:
This is the claimed lower bound on . ∎
A.5.2. Proof of Lemma 2.20
Proof.
(Lemma 2.20) We observe that with , for any the encoder is linear, and
where defines a neural network, . Thus, DeepONet is of neural network-type. To estimate the complexity, , we let be the set of indices , such that the -th component, , of the vector is non-zero. Let be the matrix with entries , so that for all . Note that has precisely non-zero entries, and that , since is non-zero for all . Thus, it follows that
∎
A.5.3. Proof of Lemma 2.22
Proof.
(Lemma 2.22) To see the complexity bound, we recall that for any , we can choose to obtain the representation
where is given by the linear encoder. The composition of two neural networks and can be represented by a neural network of size at most . We thus have the lower bound,
This shows the claim. ∎
A.6. Proof of Lemma 2.26
Proof.
Our aim is to show that ,
is not of neural network-type. We argue by contradiction. Suppose that was of neural network type. By definition, there exists a linear mapping , and a neural network , for some such that
(A.16) |
In the following we will consider for . Since for all , and for , we have
(A.17) |
Now, fix any , and consider , . Since and are linear mappings, it follows that
is a linear mapping. Represent this linear mapping by a matrix . In particular, by (A.16), we have
(A.18) |
Since , it follows that is nontrivial. Let be an element in the kernel, and consider . Since is not identically equal to , either or must be positive somewhere in the domain . Upon replacing if necessary, we may wlog assume that for some . From (A.17) and (A.18), it now follows that
A contradiction. Thus, cannot be of neural network-type. ∎
A.7. Proof of curse of parametric complexity for FNO, Theorem 2.27
Building on the curse of parametric complexity for operators of neural network-type, we next show that FNOs also suffer from a similar curse, as stated in Theorem 2.27.
Proof.
(Theorem 2.27) Let be an -times Fréchet differentiable functional satisfying the conclusion of Theorem 2.11 (CoD for functionals of neural network-type). In the following, we show that also satisfies the conclusions of Theorem 2.27. Our proof argues by contradiction: we assume that can be approximated by a family of discrete FNOs satisfying the error and complexity bounds of Theorem 2.27, i.e.
-
(1)
Complexity bound: There exist constant , such that the discretization parameter , and the total number of non-zero parameters are bounded by , for all .
-
(2)
Accuracy: We have
Then we show that this implies the existence of a family of operators of neural network-type , which satisfy for some , and for all sufficiently small ,
-
•
complexity bound ,
-
•
and error estimate
with a potentially different constant. By choice of , the existence of is ruled out by Theorem 2.11, providing the desired contradiction.
In the following, we discuss the construction of : Let be a family of FNOs satisfying (1) and (2) above. Fix for the moment. To simplify notation, we will write , in the following. We recall that, by definition, the discretized FNO can be written as a composition , where
and
are defined by neural networks and , respectively, is of the form
with , Fourier multiplier , where , and () denote discrete (inverse) Fourier transform, and where the bias is determined by its Fourier coefficients , . We also recall that the size of is given by the total number of non-zero components, i.e.
We now observe that, after flattening the tensor
the (linear) mapping can be represented by multiplication against a sparse matrix with at most non-zero entries. For the non-local operator , we note that for , the number of components (channels) that need to be considered is bounded by . Discarding any channels that are zeroed out by , a naive implementation of thus amounts to a matrix representation of a linear mapping , requiring at most non-zero components. Thus, each discretized hidden layer can be represented exactly by an ordinary neural network layer with matrix and bias , satisfying the following bounds on the number of non-zero components:
Similarly, the input and output layers and can be represented exactly by ordinary neural networks and , with
obtained by parallelization of , resp. , at each grid point. Given the above observations, we conclude that, with canonical identification and , the discretized FNO can be represented by an ordinary neural network , , with
By assumption on (for which we aim to show that it leads to a contradiction), we have , and . It follows that
In addition, trivially defines an operator of neural network-type, , by
To see this, we simply note that the point-evaluation mapping is linear, and hence we have the representation
By the above construction, we have for all input functions .
To summarize, assuming that a family of FNOs exists, satisfying (1) and (2) above, we have constructed a family of operators of neural network type, with
and
where is a constant independent of .
By Theorem 2.11, and fixing any , we also have the following lower bound for all sufficiently small :
where is independent of . From the above two-sided bounds, it thus follows that
for all sufficiently small, and where by our choice of : and are constants. Since grows faster than as , this leads to the desired contradiction. We thus conclude that a family of discretized FNOs as assumed above cannot exist. This concludes the proof. ∎
Appendix B Short-time existence of -solutions
The proof of short-time existence of solutions of the Hamilton-Jacobi equation (HJ) is based on the following Banach space implicit function theorem:
Theorem B.1 (Implicit Function Theorem, see e.g. [6, Section 2.2]).
Let , be open subsets of Banach spaces and . Let
be a -mapping (-times continuously Fréchet differentiable). Assume that there exist such that , and such that the Jacobian with respect to , evaluated at the point ,
is a linear isomorphism. Then there exists a neighbourhood of , and a -mapping , such that
Furthermore, is unique, in the sense that for any , implies .
As a first step toward proving short-time existence for (HJ), we prove that under the no-blowup Assumption 3.1, the semigroup (3.7) exists for any .
Proof.
(Lemma 3.3) By classical ODE theory, it is immediate that for any initial data a maximal solution of the ODE system (3.4) exists, and is unique, over a short time-interval. It thus remains to prove that this solution in fact exists globally, i.e. that the solution does not escape to infinity in finite time. Since solves , with a right-hand side that only depends on and , it will suffice to prove that the Hamiltonian trajectory , with initial data exists globally in time. To this end, we note that, by Assumption 3.1, we have
Gronwall’s lemma thus implies that remains bounded for all . This shows that blow-up cannot occur in the -variable in a finite time. On the other hand, the right-hand side of the ODE system is periodic in , and hence (by the bound on ) blow-up is also ruled out for . In particular, Assumption 3.1 ensures that the Hamiltonian trajectory exists globally in time. As argued above, this in turn implies the global existence of the flow map . ∎
Proof.
(Lemma 3.4) Let with . Define
i.e. we set
with , the spatial characteristic mapping of the Hamiltonian system (3.4) at time , where, recall, we have assumed that Under Assumption 3.1, the spatial characteristic mapping, and hence , is well-defined for any (see Lemma 3.3). Since , the mapping , is . The mapping , is in the first argument, and is a continuous linear mapping in the second argument (and hence infinitely Frechet differentiable in the second argument). Hence, the composition
is a mapping. And, as a consequence, the mapping is a mapping.
Since
we clearly have
for any and , and the derivative with respect to the last argument
is given by
which defines an isomorphism .
By the implicit function theorem, for any and , there exist , and a mapping
such that for
if, and only if,
Fix for the moment. Since is compact, we can choose a finite number of points , such that
Let
Due to the uniqueness property of each , , all of these mappings have the same values on overlapping domains. Hence, we can combine them into a single map
Furthermore, since , we also have . Similarly, we can cover the compact set by a finite number of open balls , ,
Setting , the uniqueness property of again implies that these mappings agree on overlapping domains. Hence, we can combine them into a global map, and obtain a map
which satisfies for all , and . Furthermore, this is still a map and it is unique, in the sense that
i.e. for any and , we have if and only if . In particular, this shows that for any , , the -mapping
is invertible, with inverse . This implies the claim. ∎
Proof.
(Proposition 3.2) Let be the maximal time such that the spatial characteristic mapping , is invertible, for any and for all . We have , by Lemma 3.4. We denote by the inverse . By the method of characteristics, a solution of the Hamilton-Jacobi equations must satisfy
(B.1) |
where , are the trajectories of the Hamiltonian ODE (3.1) with initial data . Given a fixed , we use the above expression to define .
We first observe that , as it is the composition of mappings,
where , are -functions of . In particular, since , this implies that is at least . Evaluating along a fixed trajectory, we find that
(B.2) |
Thus, to show that is a classical solution of (HJ), it remains to show that for all . Assume that for the moment. We first note that for the -th component of , we have (with implicit summation over repeated indices, following the “Einstein summation rule”)
Next, we note that by the invertibility of the spatial characteristic mapping , we can write (B.2) in the form
where . This implies that
We point out that on the last line, we have introduced which is a continuous function of and . Choosing now , so that , we thus obtain
Substitution in the ODE for yields
where is the matrix with components . Since , this implies that
(B.3) |
along the trajectory. At this point, the conclusion (B.3) has been obtained under the assumption that , which is only ensured a priori if .
To prove the result also for the case , we can apply the above argument to smooth , , which approximate the given and , and for defined by the method of characteristics (B.1) with , replaced by the smooth approximations , . Then, by the above argument, for any -regularized trajectory , we have
Choosing a sequence such that , as , the corresponding characteristic solution defined by (B.1) (with and in place of and ) converges in . Since , this implies that .
Thus, we conclude that defined by (B.1) is a classical solution of the Hamilton-Jacobi equations (HJ). We finally have to show that, in fact, . -differentiability in space follows readily from the fact that, by (B.3), we have . By the invertibility of the spatial characteristic map, this can be equivalently written in the form
(B.4) |
where on the right hand side, , and are all mappings. Thus, is a -function. Furthermore, by (HJ), this implies that is also a function. This allows us to conclude that . The additional bound on follows from the trivial estimate
combined with (B.1) and (B.4), and the fact that is a mapping with continuous dependence on and , and is a -mapping, so that
for any initial data . ∎
Appendix C Reconstruction from scattered data
The purpose of this appendix is to prove Proposition 4.2. This is achieved through three lemmas, followed by the proof of the proposition itself, employing these lemmas.
The first lemma is the following special case of the Vitali covering lemma, a well-known geometric result from measure theory (see e.g. [52, Lemma 1.2]):
Lemma C.1 (Vitali covering).
If and is a set for which the domain
is contained in the union of balls of radius around the , then there exists a subset such that for all , and
Remark C.2.
Given , the proof of Lemma C.1 (see e.g. [52, Lemma 1.2]) shows that the subset of Lemma C.1 can be found by the following greedy algorithm, which proceeds by iteratively adding elements to (the following is in fact the basis for Algorithm 1 in the main text):
-
(1)
Start with and .
-
(2)
Iteration step: given , check whether there exists , such that
-
•
If yes: define and .
-
•
If not: terminate the algorithm and set .
-
•
Based on Lemma C.1, we can now state the following basic result:
Lemma C.3.
Given a set with fill distance , the subset determined by the pruning Algorithm 1 has fill distance and separation distance , and is quasi-uniform with distortion constant , i.e.
Proof.
By definition of (cf. Definition 4.1), we have
Let be the subset determined by Algorithm 1 (reproduced in Remark C.2). satisfies the conclusion of the Vitali covering lemma C.1 with ; thus,
(C.1) |
and for all . The inclusion (C.1) implies that
On the other hand, the fact that implies that for all , and hence . Thus, we have
The bound always holds (for convex sets); to see this, choose such that realizes the minimal separation distance. Let be the mid-point , so that . We note that for any , since
i.e. for . We conclude that
∎
Lemma C.4.
Let , let and let . There exists and , such that for all and all , quasi-uniform with respect to , and with fill distance , the approximation error of the moving least squares interpolation (4.2) with exact data , and parameter , is bounded as follows:
Proof.
This is immediate from [53, Corollary 4.8]. ∎
Based on Lemma C.4, we can also derive a reconstruction estimate, when the interpolation data is only known up to small errors in both the position and the values , and this is Proposition 4.2 stated in the main body of the text. We now prove this proposition.
Proof.
(Proposition 4.2) Let be a compactly supported function, such that , and for . Define
Note that, by assumption, we have
for all . In particular this implies that the functions
have disjoint support for different values of . Thus, for any , we have , and is the (exact) moving least squares approximation of . From Proposition C.4 it follows that
We now note that
On the one hand (due to the disjoint supports of the bump functions), we can now make the estimate
On the other hand (again due to the disjoint supports of the bump functions), we also have
These two estimates imply that
Taking into account that is bounded, that is independent of , and introducing a new constant , we obtain
as claimed. ∎
Appendix D Complexity estimates for HJ-Net approximation
In the last section, we have shown that given a set of initial data , with and , the method of characteristics is applicable for a positive time . In the present section, we will combine this result with the proposed HJ-Net framework to derive quantitative error and complexity estimates for the approximation of the solution operator of (HJ). This will require estimates for the approximation of the Hamiltonian flow by neural networks, as well as an estimate for the reconstruction error resulting from the pruned moving least squares operator .
D.1. Approximation of Hamiltonian flow
Proof.
(Proposition 5.3) It follows from [54, Theorem 1] (and a simple scaling argument), that there exists a constant , such that for any , there exists a neural network satisfying the bound,
(D.1) |
and such that
where denotes the norm on the relevant domain. To prove the claim, we note that for any trajectory satisfying the Hamiltonian ODE system
(D.2) |
with initial data , we have by assumption 3.1:
Integration of this inequality implies . Taking also into account that , this implies that , where depends only on , and . Since remains uniformly bounded and since is -periodic, it follows that for any the Hamiltonian trajectory starting at stays in .
Recall that is the flow map of the Hamiltonian ODE (D.2) combined with the action integral
(D.3) |
Since the Hamiltonian trajectories starting at are confined to and since the right-hand side of (D.2) and (D.3) involve only first-order derivatives of , it follows from basic ODE theory that there exists , such that the -norm of the flow can be bounded by
In particular, we finally conclude that there exist constants and , such that for any , there exists a neural network satisfying the bound (D.1), and such that
∎
D.2. Reconstruction error
Proof.
(Proposition 5.5) Let be given with , and let be approximate interpolation data, with
The assertion of this proposition is restricted to satisfying for a constant (to be determined below). We may wlog assume that in the following. Denote . We recall that the first step in the reconstruction Algorithm 2 consists in an application of the pruning Algorithm 1 to determine pruned interpolation points , such that (cf. Lemma C.3)
Step 1: Write . Our first goal is to show that is quasi-uniform: To this end, we note that by definition of the separation distance and the upper bound on the distance of and :
By the definition of the fill distance, and the assumption that and , we also have
implying the upper bound . Similarly we can show that . Substitution in the lower bound on above and using that , yields
Thus, we conclude that
Since we always have , we conclude that is quasi-uniform with .
Step 2: Next, we intend to apply Proposition 4.2 with in place of and with : To this end, it remains to show that is bounded by half the separation distance. To see this, we note that by the above bounds (recall also that ),
showing that . We can thus apply Proposition 4.2 to conclude that there exist constants (with ), such that if , we have
and hence
Replacing and now yields the claimed result of Proposition 5.5. ∎
Proof.
(Proposition 5.6) Let and be given, and denote by the solution of the Hamiltonian ODE, with initial value . Define similarly.
By compactness of , there exists a constant , such that . By continuity of the flow map, there exists , such that
Then, we have
where denotes the Hessian of and is the matrix norm induced by the Euclidean vector norm. Further denoting
then by Gronwall’s inequality, it follows that
Furthermore, since , and , we have , which implies that
Therefore, , satisfy the estimate
with constant
which depends only on , and on (via and ), but is independent of the particular choice of . Thus, if
then
for any . This implies the claim. ∎