Parametric Complexity Bounds for Approximating PDEs
with Neural Networks
Abstract
Recent experiments have shown that deep networks can approximate solutions to high-dimensional PDEs, seemingly escaping the curse of dimensionality. However, questions regarding the theoretical basis for such approximations, including the required network size, remain open. In this paper, we investigate the representational power of neural networks for approximating solutions to linear elliptic PDEs with Dirichlet boundary conditions. We prove that when a PDE’s coefficients are representable by small neural networks, the parameters required to approximate its solution scale polynomially with the input dimension and proportionally to the parameter counts of the coefficient networks. To this we end, we develop a proof technique that simulates gradient descent (in an appropriate Hilbert space) by growing a neural network architecture whose iterates each participate as sub-networks in their (slightly larger) successors, and converge to the solution of the PDE. We bound the size of the solution, showing a polynomial dependence on and no dependence on the volume of the domain.
1 Introduction
A partial differential equation (PDE) relates a multivariate function defined over some domain to its partial derivatives. Typically, one’s goal is to solve for the (unknown) function, often subject to additional constraints, such as the function’s value on the boundary of the domain. PDEs are ubiquitous in both the natural and social sciences, where they model such diverse processes as heat diffusion [Crank and Nicolson, 1947; Özişik et al., 2017], fluid dynamics [Anderson and Wendt, 1995; Temam, 2001], and financial markets [Black and Scholes, 1973; Ehrhardt and Mickens, 2008]. Because most PDEs of interest lack closed-form solutions, computational approximation methods remain a vital and an active field of research [Ames, 2014]. For low-dimensional functions, dominant approaches include the finite differences and finite element methods [LeVeque, 2007], which discretize the domain. After partitioning the domain into a mesh, these methods solve for the function value at its vertices. However, these techniques scale exponentially with the input dimension, rendering them unsuitable for high-dimensional problems.
Following breakthroughs in deep learning for approximating high-dimensional functions in such diverse domains as computer vision [Krizhevsky et al., 2012; Radford et al., 2015] and natural language processing [Bahdanau et al., 2014; Devlin et al., 2018; Vaswani et al., 2017], a burgeoning line of research leverages neural networks to approximate solutions to PDEs. This line of work has produced promising empirical results for common PDEs such as the Hamilton-Jacobi-Bellman and Black-Scholes equations [Han et al., 2018; Grohs et al., 2018; Sirignano and Spiliopoulos, 2018]. Because they do not explicitly discretize the domain, and given their empirical success on high-dimensional problems, these methods appear not to suffer the curse of dimensionality. However, these methods are not well understood theoretically, leaving open questions about when they are applicable, what their performance depends on, and just how many parameters are required to approximate the solution to a given PDE.
Over the past three years, several theoretical works have investigated questions of representational power under various assumptions. Exploring a variety of settings, Kutyniok et al. [2019], Grohs et al. [2018], and Jentzen et al. [2018], proved that the number of parameters required to approximate a solution to a PDE exhibits a less than exponential dependence on the input dimension for some special parabolic PDEs that admit straightforward analysis. Grohs and Herrmann [2020] consider elliptic PDEs with Dirichlet boundary conditions. However, their rate depends on the volume of the domain, and thus can have an implicit exponential dependence on dimension (e.g., consider a hypercube with side length greater than one).
In this paper, we focus on linear elliptic PDEs with Dirichlet boundary conditions, which are prevalent in science and engineering (e.g., the Laplace and Poisson equations). Notably, linear elliptic PDEs define the steady state of processes like heat diffusion and fluid dynamics. Our work asks:
Question.
How many parameters suffice to approximate the solution to a linear elliptic PDE up to a specified level of precision using a neural network?
We show that when the coefficients of the PDE are expressible as small neural networks (note that PDE coefficients are functions), the number of parameters required to approximate the PDE’s solution is proportional to the number of parameters required to express the coefficients. Furthermore, we show that the number of parameters depends polynomially on the dimension and does not depend upon the volume of the domain.
2 Overview of Results
To begin, we formally define linear elliptic PDEs.
Definition 1 (Linear Elliptic PDE [Evans, 1998]).
Linear elliptic PDEs with Dirichlet boundary condition can be expressed in the following form:
where is a bounded open set with a boundary . Further, for all , is a matrix-valued function, s.t. , and , s.t. . 111 Here, div denotes the divergence operator. Given a vector field ,
We refer to and as the coefficients of the PDE. The divergence form in Definition 1 is one of two canonical ways to define a linear elliptic PDE [Evans, 1998] and is convenient for several technical reasons (see Section 4). The Dirichlet boundary condition states that the solution takes a constant value (here ) on the boundary .
Our goal is to express the number of parameters required to approximate the solution of a PDE in terms of those required to approximate its coefficients and . Our key result shows:
Theorem (Informal).
If the coefficients and the function are approximable by neural networks with at most parameters, the solution to the PDE in Definition 1 is approximable by a neural network with parameters.
This result, formally expressed in Section 5, may help to explain the practical efficacy of neural networks in approximating solutions to high-dimensional PDEs with boundary conditions [Sirignano and Spiliopoulos, 2018; Li et al., 2020a]. To establish this result, we develop a constructive proof technique that simulates gradient descent (in an appropriate Hilbert space) through the very architecture of a neural network. Each iterate, given by a neural network, is subsumed into the (slightly larger) network representing the subsequent iterate. The key to our analysis is to bound both (i) the growth in network size across consecutive iterates; and (ii) the total number of iterates required.
Organization of the paper
3 Prior Work
Among the first papers to leverage neural networks to approximate solutions to PDEs with boundary conditions are Lagaris et al. [1998], Lagaris et al. [2000], and Malek and Beidokhti [2006]. However, these methods discretize the input space and thus are not suitable for high-dimensional input spaces. More recently, mesh-free neural network approaches have been proposed for high-dimensional PDEs [Han et al., 2018; Raissi et al., 2017, 2019], achieving impressive empirical results in various applications. Sirignano and Spiliopoulos [2018] design a loss function that penalizes failure to satisfy the PDE, training their network on minibatches sampled uniformly from the input domain. They also provide a universal approximation result, showing that for sufficiently regularized PDEs, there exists a multilayer network that approximates its solution. However, they do not comment on the complexity of the neural network or how it scales with the input dimension. Khoo et al. [2017] also prove universal approximation power, albeit with networks of size exponential in the input dimension. Recently, Grohs et al. [2018]; Jentzen et al. [2018] provided a better-than-exponential dependence on the input dimension for some special parabolic PDEs, for which the simulating a PDE solver by a neural network is straightforward.
Several recent works [Bhattacharya et al., 2020; Kutyniok et al., 2019; Li et al., 2020b, a] show (experimentally) that a single neural network can solve for an entire family of PDEs. They approximate the map from a PDE’s parameters to its solution, potentially avoiding the trouble of retraining for every set of coefficients. Among these, only Kutyniok et al. [2019] provides theoretical grounding. However, they assume the existence of a finite low-dimensional space with basis functions that can approximate this parametric map—and it is unclear when this would obtain. Our work proves the existence of such maps, under the assumption that the family of PDEs has coefficients described by neural networks with a fixed architecture (Section 7).
In the work most closely related to ours, Grohs and Herrmann [2020] provides approximation rates polynomial in the input dimension for the Poisson equation (a special kind of linear elliptic PDE) with Dirichlet boundary conditions. They introduce a walk-on-the-sphere algorithm, which simulates a stochastic differential equation that can be used to solve a Poisson equation with Dirichlet boundary conditions (see, e.g., Oksendal [2013]’s Theorem 9.13). The rates provided in Grohs and Herrmann [2020] depend on the volume of the domain, and thus depend, implicitly, exponentially on the input dimension . Our result considers the boundary condition for the PDE and is independent of the volume of the domain. Further, we note that our results are defined for a more general linear elliptic PDE, of which the Poisson equation is a special case.
4 Notation and Definitions
We now introduce several key concepts from PDEs and some notation. For any open set , we denote its boundary by and denote its closure by . By , we denote the space of real-valued continuous functions defined over the domain . Furthermore, for , a function belongs to if all partial derivatives exist and are continuous for any multi-index , such that . Finally, a function if for all . Next, we define several relevant function spaces:
Definition 2.
For any , .
Definition 3.
For a domain , the function space consists of all functions , s.t. where . This function space is equipped with the inner product
Definition 4.
For a domain and a function , the function space is defined analogously, where .
Definition 5.
For a domain and , we define the Hilbert space as
Furthermore, is equipped with the inner product, and the corresponding norm
Definition 6.
The closure of in is denoted by .
Informally, is the set of functions belonging to that can be approximated by a sequence of functions . This also implies that if a function , then for all . This space (particularly with ) is often useful when analyzing elliptic PDEs with Dirichlet boundary conditions.
Definition 7 (Weak Solution).
Given the PDE in Definition 1, if , then a function solves the PDE in a weak sense if and for all , we have
(1) |
The left hand side of (1) is also equal to for all (see Lemma A.1), whereas, following the definition of the norm, the right side is simply . Having introduced these preliminaries, we now introduce some important facts about linear PDEs that feature prominently in our analysis.
Proposition 1.
This proposition is standard (we include a proof in the Appendix, Section A.1 for completeness) and states that there exists a unique solution to the PDE (referred to as ), which is also the solution we get from the variational formulation in (2). In this work, we introduce a sequence of functions that minimizes the loss in the variational formulation.
Definition 8 (Eigenvalues and Eigenfunctions, Evans [1998]).
Given an operator , the tuples , where and are (eigenvalue, eigenfunction) pairs that satisfy , for all . Since , we know that . The eigenvalue can be written as
(3) |
where and . Furthermore, we define by the span of the first eigenfunctions of , i.e., .
We note that since the operator is self-adjoint and elliptic (in particular, is compact), the eigenvalues are real and countable. Moreover, the eigenfunctions form an orthonormal basis of (see Evans [1998], Section 6.5).
5 Main Result
Before stating our results, we provide the formal assumptions on the PDEs of interest:
Assumptions:
-
(i)
Smoothness: We assume that . We also assume that the coefficient is a symmetric matrix-valued function, i.e., and for all and the function and for all . Furthermore, we assume that . We define a constant
Further, the function is also in and the projection of onto which we denote satisfies for any multi-index : . 222Since and the functions and are all in , it follows from Nirenberg [1955] (Theorem, Section 5) the eigenfuntions of are also . Hence, the function is in as well.
-
(ii)
Ellipticity: There exist constants such that, for all and ,
-
(iii)
Neural network approximability: There exist neural networks and with parameters, respectively, that approximate the functions and , i.e., and , for small . We assume that for all the operator defined as,
(4) is elliptic with (eigenvalue, eigenfunction) pairs. We also assume that there exists a neural network with parameters such that for any multi-index , . By , we denote the set of all (infinitely differentiable) activation functions used by networks , , and . By , we denote the set that contains all the -th order derivatives of the activation functions in ,
Intuitively, ellipticity of in a linear PDE is analogous to positive definiteness of a matrix in a linear equation , where .
In (iii), we assume that the coefficients and , and the function can be approximated by neural networks. While this is true for any smooth functions given sufficiently large , our results are most interesting when these quantities are small (e.g. subexponential in the input dimension ). For many PDEs used in practice, approximating the coefficients using small neural networks is straightforward. For example, in heat diffusion (whose equilibrium is defined by a linear elliptic PDE) defines the conductivity of the material at point . If the conductivity is constant, then the coefficients can be written as neural networks with parameters.
The part of assumption (i) that stipulates that is close to can be thought of as a smoothness condition on . For instance, if (the Laplacian operator), the Dirichlet form satisfies , so eigenfunctions corresponding to higher eigenvalues tend to exhibit a higher degree of spikiness. The reader can also think of the eigenfunctions corresponding to larger as Fourier basis functions corresponding to higher frequencies.
Finally, in (i) and (iii), while the requirement that the function pairs (, ) and (, ) are close not only in their values, but their derivatives as well is a matter of analytical convenience, our key results do not necessarily depend on this precise assumption. Alternatively, we could replace this assumption with similar (but incomparable) conditions: e.g., we can also assume closeness of the values and a rapid decay of the norms of the derivatives. We require control over the derivatives because our method’s gradient descent iterations involve repeatedly applying the operator to —which results in progressively higher derivatives.
We can now formally state our main result:
Theorem 1 (Main Theorem).
Consider a linear elliptic PDE satisfying Assumptions (i)-(iii), and let denote its unique solution. If there exists a neural network with parameters, such that , for some , then for every such that , there exists a neural network with size
such that where
and , . Furthermore, the activation functions used in belong to the set where for all is the square activation function.
This theorem shows that given an initial neural network containing parameters, we can recover a neural network that is close to the unique solution . The number of parameters in depend on how close the initial estimate is to the solution , and . This results in a trade-off, where better approximations may require more parameters, compared to a poorer approximation with fewer parameters.
Note that as , while is a “bias” error term that does not go to as . The first three terms in the expression for result from bounding the difference between the solutions to the equations and , whereas the third term is due to difference between and and the fact that our proof involves simulating the gradient descent updates with neural networks. Further, if the constant is equal to then the error term will also be , in which case the term will equal .
The fact that comes from the fact that we are simulating steps of a gradient descent-like procedure on a strongly convex loss. The parameters and can be thought of as the effective Lipschitz and strong-convexity constants of the loss. Finally, to give a sense of what looks like, we show in Lemma 1 that if is initialized to be identically zero then .
Lemma 1.
If , then .
Proof.
We make few remarks about the theorem statement:
Remark 1.
While we state our convergence results in norm, our proof works for the norm as well. This is because in the space defined by the top-k eigenfunctions of the operator , and norm are equivalent (shown in Proposition A.1). Further, note that even though we have assumed that is the unique solution of (1) from the boundary regularity condition, we have that (see Evans [1998], Chapter 6, Section 6.3). This ensures that the solution is twice differentiable as well.
Remark 2.
To get a sense of the scale of and , when (the Laplacian operator), the eigenvalue , where is the Poincaré constant (see Theorem A.1 in Appendix). For geometrically well-behaved sets (e.g. convex sets with a strongly convex boundary, like a sphere), is even dimension-independent. Further from the Weyl’s law operator (Evans [1998], Section 6.5) we have
where is the volume of a unit ball in dimensions. So, if , grows as , which is a constant so long as .
6 Proof of Main Result
First, we provide some intuition behind the proof, via an analogy between a uniformly elliptic operator and a positive definite matrix in linear algebra. We can think of finding the solution to the equation for an elliptic as analogous to finding the solution to the linear system of equations , where is a positive definite matrix, and and are -dimensional vectors. One way to solve such a linear system is by minimizing the strongly convex function using gradient descent. Since the objective is strongly convex, after gradient steps, we reach an -optimal point in an sense.
Our proof uses a similar strategy. First, we show that for the operator , we can define a sequence of functions that converge to an -optimal function approximation (in this case in the norm) after steps—similar to the rate of convergence for strongly convex functions. Next, we inductively show that each iterate in the sequence can be approximated by a small neural network. More precisely, we show that given a bound on the size of the -th iterate , we can, in turn, upper bound the size of the -th iterate because the update transforming to can be simulated by a small neural network (Lemma 7). These iterations look roughly like , and we use a “backpropagation” lemma (Lemma 8) which bounds the size of the derivative of a neural network.
6.1 Defining a Convergent Sequence
The rough idea is to perform gradient descent in [Neuberger, 2009; Faragó and Karátson, 2001, 2002] to define a convergent sequence whose iterates converge to in norm (and following Remark 1, in as well). However, there are two obstacles to defining the iterates as simply : (1) is unbounded—so the standard way of choosing a step size for gradient descent (roughly the ratio of the minimum and maximum eigenvalues of ) would imply choosing a step size , and (2) does not necessarily preserve the boundary conditions, so if we start with , it may be that does not even lie in .
We resolve both issues by restricting the updates to the span of the first eigenfunctions of . More concretely, as shown in Lemma 2, if a function in , then the function will also lie in . We also show that within the span of the first eigenfunctions, is bounded (with maximum eigenvalue ), and can therefore be viewed as an operator from to . Further, we use instead of in our updates, which now have the form . Since belongs to , for a in the next iterate will now remain in . Continuing the matrix analogy, we can choose the usual step size of . Precisely, we show:
Lemma 2.
Let be an elliptic operator. Then, for all it holds:
-
1.
.
-
2.
-
3.
Proof.
Writing as where , we have . Therefore and lies in , proving (1.).
In order to prove (2.) let us first denote . Note if is an eigenfunction of with corresponding eigenvalue , it is also an eigenfunction of with corresponding eigenvalue .
Hence, writing as , where , we have
(5) |
By the orthogonality of , we have
Since , we have and , so . This implies . Plugging this back in (5), we get the claim we wanted. ∎
In fact, we will use a slight variant of the updates and instead set as the iterates of the convergent sequence, where is the projections of onto . This sequence satisfies two important properties: (1) The convergence point of the sequence and , the solution to the original PDE, are not too far from each other; (2) The sequence of functions converges exponentially fast. In Section 6.2, we will see that updates defined thusly will be more convenient to simulate via a neural network.
The first property is formalized as follows:
Lemma 3.
Assume that is the solution to the PDE , where is the projections of onto . Given Assumptions (i)-(iii), we have , such that , where and .
The proof for Lemma 3 is provided in the Appendix (Section B.1). Each of the three terms in the final error captures different sources of perturbation: the first term comes from approximating by ; the second term comes from applying Davis-Kahan [Davis and Kahan, 1970] to bound the “misalignment” between the eigenspaces and (hence, the appearance of the eigengap between the and -st eigenvalue of ); the third term is a type of “relative” error bounding the difference between the solutions to the PDEs and .
The “misalignment” term can be characterized through the following lemma:
Lemma 4 (Bounding distance between and ).
Given Assumptions (i)-(iii)and denoting the projection of onto by we have:
(6) |
where .
Proof.
Let us write where . Further, we can define a function such that such that .
The main technical tool for bounding the difference between the operators and can be formalized through the lemma below. Note, the “relative” nature of the perturbation is because and are not bounded operators.
Lemma 5 (Relative operator perturbation bound).
Proof.
Hence this implies that
where proving part (1.).
Further, for part (2.) we have for all ,
(10) |
where . Therefore using (10) the following holds for all ,
(11) |
where we use the fact that the operators and are self-adjoint to get (1) and then bring the appropriate terms to the LHS in (2). ∎
The second property of the sequence of functions is that they converge exponentially fast. Namely, we show:
Lemma 6 (Convergence of gradient descent in ).
Let denote the unique solution to the PDE , where , and the operator satisfies the conditions in Lemma 2. For any such that , we define the sequence
(12) |
where for all , . Then for any , we have
The proof is essentially the same as the the analysis of the convergence time of gradient descent for strongly convex losses. Namely, we have:
Proof.
Given that and the function and as well (from Lemma 2).
As , all the iterates in the sequence will also belong to and will lie in the .
6.2 Approximating iterates by neural networks
In Lemma 6, we show that there exists a sequence of functions (12) which converge fast to a function close to . The next step in the proof is to approximate the iterates by neural networks.
The main idea is as follows. Suppose first the iterates are such that is exactly representable as a neural network. Then, the iterate can be written in terms of three operations performed on , and : taking derivatives, multiplication and addition. Moreover, if is representable as a neural network with parameters, the coordinates of the vector can be represented by a neural network with parameters. This is a classic result (Lemma 8), essentially following from the backpropagation algorithm. Finally, addition or multiplication of two functions representable as neural networks with sizes can be represented as neural networks with size (see Lemma 9).
Using these facts, we can write down a recurrence upper bounding the size of neural network approximation , denoted by , in terms of the number of parameters in (which is the neural network approximation to ). Formally, we have:
Lemma 7 (Recursion Lemma).
Given the Assumptions (i)-(iii), consider the update equation
(13) |
If at step , is a neural network with parameters, then the function is a neural network with parameters.
Proof.
Expand the update as follows:
Using Lemma 8, , and can be represented by a neural network with , and parameters, respectively. Further, and can be represented by a neural network with parameters, and can be represented by a network with parameters, from Lemma 9. Hence can be represented in parameters. Note that, throughout the entire proofs hides independent constants. ∎
Combining the results of Lemma 6 and Lemma 7, we can get a recurrence for the number of parameters required to represent the neural network :
Unfolding this recurrence, we get .
The formal lemmas for the different operations on neural networks we can simulate using a new neural network are as follows:
Lemma 8 (Backpropagation, Rumelhart et al. [1986]).
Consider neural network with depth , parameters and differentiable activation functions in the set . There exists a neural network of size and activation functions in the set that calculates the gradient for all .
Lemma 9 (Addition and Multiplication).
Given neural networks , , with and parameters respectively, the operations and can be represented by neural networks of size , and square activation functions.
Proof.
For Addition, there exists a network containing both networks and as subnetworks and an extra layer to compute the addition between their outputs. Hence, the total number of parameters in such a network will be .
For Multiplication, consider the operation . Then following the same argument as for addition of two networks, we can construct a network containing both networks and square activation function. ∎
While the representation result in Lemma 9 is shown using square activation, we refer to Yarotsky [2017] for approximation results with ReLU activation. The scaling with respect to the number of parameters in the network remains the same.
Finally, we have to deal with the fact that is not exactly a neural network, but only approximately so. The error due to this discrepancy can be characterized through the following lemma:
Lemma 10 (Error using ).
Consider the update equation in (13), where is a neural network with . Then the neural network approximates the function such that where is
where , , and is a multi-index.
The proof for the lemma is deferred to Section B.2 of the Appendix. The main strategy to prove this lemma involves tracking the “residual” non-neural-network part of the iterates. Precisely, for every , we will write , s.t. is a neural network and bound . is defined such that
Correspondingly, as , we have:
Unfolding the recurrence, we have , which reduces the proof to bounding . 333The reason we require that is close to not only in the sense but also in terms of their higher order derivatives is since involves -order derivatives of to be bounded at each step.
7 Applications to Learning Operators
A number of recent works attempt to simultaneously approximate the solutions for an entire family of PDEs by learning a parametric map that takes as inputs (some representation of) the coefficients of a PDE and returns its solution [Bhattacharya et al., 2020; Li et al., 2020b, a]. For example, given a set of observations that , where each denotes a coefficient of a PDE with corresponding solution , they learn a neural network such that for all , . Our parametric results provide useful insights for why simultaneously solving an entire family of PDEs with a single neural network is possible in the case of linear elliptic PDEs.
Consider the case where the coefficients in the family of PDEs are given by neural networks with a fixed architecture, but where each instance of a PDE is characterized by a different setting of the weights in the models representing the coefficients. Lemma 7 shows that each iteration of our sequence (12) constructs a new network containing both the current solution and the coefficient networks as subnetworks. We can view our approximation as not merely approximating the solution to a single PDE but to every PDE in the family, by treating the coefficient networks as placeholder architectures whose weights are provided as inputs. Thus, our construction provides a parametric map between the coefficients of an elliptic PDE in this family and its solution.
8 Conclusion and Future Work
We derive parametric complexity bounds for neural network approximations for solving linear elliptic PDEs with Dirichlet boundary conditions, whenever the coefficients can be approximated by are neural networks with finite parameter counts. By simulating gradient descent in function spaces using neural networks, we construct a neural network that approximates the solution of a PDE. We show that the number of parameters in the neural network depends on the parameters required to represent the coeffcients and has a dependence on the dimension of the input space, therefore avoiding the curse of dimensionality.
An immediate open question is related to the tightening our results: our current error bound is sensitive to the neural network approximation lying close to which could be alleviated by relaxing (by adding some kind of “regularity” assumptions) the dependence of our analysis on the first eigenfunctions. Further, the dependencies in the exponent of on and in parametric bound may also be improvable. Finally, the idea of simulating an iterative algorithm by a neural network to derive a representation-theoretic result is broadly applicable, and may be a fertile ground for further work, both theoretically and empirically, as it suggest a particular kind of weight tying.
References
- Ames [2014] William F Ames. Numerical methods for partial differential equations. Academic press, 2014.
- Anderson and Wendt [1995] John David Anderson and J Wendt. Computational fluid dynamics, volume 206. Springer, 1995.
- Bahdanau et al. [2014] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
- Bhattacharya et al. [2020] Kaushik Bhattacharya, Bamdad Hosseini, Nikola B Kovachki, and Andrew M Stuart. Model reduction and neural networks for parametric pdes. arXiv preprint arXiv:2005.03180, 2020.
- Black and Scholes [1973] Fischer Black and Myron Scholes. The pricing of options and corporate liabilities. Journal of political economy, 81(3):637–654, 1973.
- Crank and Nicolson [1947] John Crank and Phyllis Nicolson. A practical method for numerical evaluation of solutions of partial differential equations of the heat-conduction type. In Mathematical Proceedings of the Cambridge Philosophical Society, volume 43, pages 50–67. Cambridge University Press, 1947.
- Davis and Kahan [1970] Chandler Davis and William Morton Kahan. The rotation of eigenvectors by a perturbation. iii. SIAM Journal on Numerical Analysis, 7(1):1–46, 1970.
- Devlin et al. [2018] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Ehrhardt and Mickens [2008] Matthias Ehrhardt and Ronald E Mickens. A fast, stable and accurate numerical method for the black–scholes equation of american options. International Journal of Theoretical and Applied Finance, 11(05):471–501, 2008.
- Evans [1998] Lawrence C Evans. Partial Differential Equations. graduate studies in mathematics. american mathematical society, 1998. ISBN 9780821807729.
- Faragó and Karátson [2001] I Faragó and J Karátson. The gradient-finite element method for elliptic problems. Computers & Mathematics with Applications, 42(8-9):1043–1053, 2001.
- Faragó and Karátson [2002] István Faragó and János Karátson. Numerical solution of nonlinear elliptic problems via preconditioning operators: Theory and applications, volume 11. Nova Publishers, 2002.
- Gilbarg and Trudinger [2001] David Gilbarg and Neil S Trudinger. Elliptic partial differential equations of second order. 2001.
- Grohs and Herrmann [2020] Philipp Grohs and Lukas Herrmann. Deep neural network approximation for high-dimensional elliptic pdes with boundary conditions. arXiv preprint arXiv:2007.05384, 2020.
- Grohs et al. [2018] Philipp Grohs, Fabian Hornung, Arnulf Jentzen, and Philippe Von Wurstemberger. A proof that artificial neural networks overcome the curse of dimensionality in the numerical approximation of black-scholes partial differential equations. arXiv preprint arXiv:1809.02362, 2018.
- Han et al. [2018] Jiequn Han, Arnulf Jentzen, and E Weinan. Solving high-dimensional partial differential equations using deep learning. Proceedings of the National Academy of Sciences, 115(34):8505–8510, 2018.
- Jentzen et al. [2018] Arnulf Jentzen, Diyora Salimova, and Timo Welti. A proof that deep artificial neural networks overcome the curse of dimensionality in the numerical approximation of kolmogorov partial differential equations with constant diffusion and nonlinear drift coefficients. arXiv preprint arXiv:1809.07321, 2018.
- Khoo et al. [2017] Yuehaw Khoo, Jianfeng Lu, and Lexing Ying. Solving parametric pde problems with artificial neural networks. arXiv preprint arXiv:1707.03351, 2017.
- Krizhevsky et al. [2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 25, pages 1097–1105. Curran Associates, Inc., 2012.
- Kutyniok et al. [2019] Gitta Kutyniok, Philipp Petersen, Mones Raslan, and Reinhold Schneider. A theoretical analysis of deep neural networks and parametric pdes. arXiv preprint arXiv:1904.00377, 2019.
- Lagaris et al. [1998] Isaac E Lagaris, Aristidis Likas, and Dimitrios I Fotiadis. Artificial neural networks for solving ordinary and partial differential equations. IEEE transactions on neural networks, 9(5):987–1000, 1998.
- Lagaris et al. [2000] Isaac E Lagaris, Aristidis C Likas, and Dimitris G Papageorgiou. Neural-network methods for boundary value problems with irregular boundaries. IEEE Transactions on Neural Networks, 11(5):1041–1049, 2000.
- Lax and Milgram [1954] Peter D Lax and Arthur N Milgram. Parabolic equations, volume 33 of annals of mathematics studies, 1954.
- LeVeque [2007] Randall J LeVeque. Finite difference methods for ordinary and partial differential equations: steady-state and time-dependent problems. SIAM, 2007.
- Li et al. [2020a] Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Fourier neural operator for parametric partial differential equations. arXiv preprint arXiv:2010.08895, 2020a.
- Li et al. [2020b] Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Neural operator: Graph kernel network for partial differential equations. arXiv preprint arXiv:2003.03485, 2020b.
- Malek and Beidokhti [2006] Alaeddin Malek and R Shekari Beidokhti. Numerical solution for high order differential equations using a hybrid neural network—optimization method. Applied Mathematics and Computation, 183(1):260–271, 2006.
- Neuberger [2009] John Neuberger. Sobolev gradients and differential equations. Springer Science & Business Media, 2009.
- Nirenberg [1955] Louis Nirenberg. Remarks on strongly elliptic partial differential equations. Communications on pure and applied mathematics, 8(4):648–674, 1955.
- Oksendal [2013] Bernt Oksendal. Stochastic differential equations: an introduction with applications. Springer Science & Business Media, 2013.
- Özişik et al. [2017] M Necati Özişik, Helcio RB Orlande, Marcelo J Colaço, and Renato M Cotta. Finite difference methods in heat transfer. CRC press, 2017.
- Radford et al. [2015] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
- Raissi et al. [2017] Maziar Raissi, Paris Perdikaris, and George Em Karniadakis. Physics informed deep learning (part i): Data-driven solutions of nonlinear partial differential equations. arXiv preprint arXiv:1711.10561, 2017.
- Raissi et al. [2019] Maziar Raissi, Paris Perdikaris, and George E Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378:686–707, 2019.
- Rumelhart et al. [1986] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-propagating errors. nature, 323(6088):533–536, 1986.
- Sirignano and Spiliopoulos [2018] Justin Sirignano and Konstantinos Spiliopoulos. Dgm: A deep learning algorithm for solving partial differential equations. Journal of computational physics, 375:1339–1364, 2018.
- Temam [2001] Roger Temam. Navier-Stokes equations: theory and numerical analysis, volume 343. American Mathematical Soc., 2001.
- Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
- Yarotsky [2017] Dmitry Yarotsky. Error bounds for approximations with deep relu networks. Neural Networks, 94:103–114, 2017.
Appendix A Brief Overview of Partial Differential Equations
In this section, we introduce few key definitions and results from PDE literature. We note that the results in this section are standard and have been included in the Appendix for completeness. We refer the reader to classical texts on PDEs [Evans, 1998; Gilbarg and Trudinger, 2001] for more details.
We will use the following Poincaré inequality throughout our proofs.
Theorem A.1 (Poincaré inequality).
Given , a bounded open subset, there exists a constant such that for all
Corollary A.1.
For the bounded open subset , for all , we define the norm in the Hilbert space as
(14) |
Further, the norm in is equivalent to the norm .
Proof.
Note that for we have,
Where we have used the definition of the norm in space.
Further, using the result in Theorem A.1 we have
(15) |
Therefore, combining the two inequalities we have
(16) |
where . Hence we have that the norm in and spaces are equivalent. ∎
Proposition A.1 (Equivalence between and norms).
If then we have that is equivalent to .
Proof.
We have from the Poincare inequality in Theorem A.1 that for all , the norm in is upper bounded by the norm in , i.e.,
Further, using results from (18) and (17) (where ), we know that for all we have
This implies that is equivalent to the inner product , i.e., for all ,
Further, since , we have from Lemma 2 that
Hence we have that for all is equivalent to and by Corollary A.1 is also equivalent to . ∎
Now introduce a form for that is more amenable for the existence and uniqueness results.
Lemma A.1.
For all , we have the following,
-
1.
The inner product equals,
-
2.
The operator is self-adjoint.
Proof.
-
1.
We will be using the following integration by parts formula,
Where is a normal at the boundary and is an infinitesimal element of the boundary.
Hence we have for all ,
-
2.
To show that the operator is self-adjoint, we show that for all we have .
From Proposition A.1, for functions we have
∎
A.1 Proof of Proposition 1
We first show that if is the unique solution then it minimizes the variational norm.
Let denote the weak solution, further for all let . Using the fact that is self-adjoint (as shown in Lemma A.1) we have
where we use the fact that and that is a weak solution hence (1) holds for all .
To show the other side, assume that minimizes , i.e., for all and we have, ,
Taking , we get
and also taking as , we have
Together, this implies that if is the solution to (2), then is also the weak solution, i.e, for all we have
Proof for Existence and Uniqueness of the Solution
In order to prove for the uniqueness of the solution, we first state the Lax-Milgram theorem.
Theorem A.2 (Lax-Milgram, Lax and Milgram [1954]).
Let be a Hilbert space with inner-product , and let and be the bilinear form and linear form, respectively. Assume that there exists constants such that for all we have,
Then there exists a unique such that,
Having stated the Lax-Milgram Theorem, we make the following proposition,
Proposition A.2.
Given the assumptions (i)-(iii), solution to the variational formulation in Equation 1 exists and is unique.
Proof.
Using the variational formulation defined in (1), we introduce the bilinear form where . Hence, we prove the theorem by showing that the bilinear form satisfies the conditions in Theorem A.2.
We first show that for all the following holds,
(17) |
Now we show that the bilinear form is lower bounded.
(18) |
Finally, for
Appendix B Perturbation Analysis
B.1 Proof of Lemma 3
Proof.
Using the triangle inequality the error between and , we have,
(19) |
where is the solution to the PDE .
First we introduce an intermediate PDE , and denote the solution . Therefore, by utilizing triangle inequality again Term (II) can be expanded as the following,
(21) |
We will tackle the second term in (21) first.
Using and ,
(22) |
Therefore, using the inequality in Lemma 5 part (2.) we can upper bounded (22) to get,
(23) |
where .
Proceeding to the first term in (21), using Lemma 4, and the inequality in (2), the term can be upper bounded by,
(24) |
Therefore Term (II), i.e., can be upper bounded by
(25) |
B.2 Proof of Lemma 10
Proof.
We define , therefore from Lemma C.2 we have that for any multi-index ,
For every , we will write , s.t. is a neural network and we (iteratively) bound . Precisely, we define a sequence of neural networks , s.t.
Since , we can define a corresponding recurrence for :
Unfolding the recurrence, we get
(26) |
Using the binomial expansion we can write:
Here the inequality follows by using the bound derived in Lemma C.2. Further, we use that all we have in and the inequality follows from the fact that .
Hence we have the the final upper bound:
∎
Appendix C Technical Lemmas: Perturbation Bounds
In this section we introduce some useful lemmas about perturbation bounds used in the preceding parts of the appendix.
First we show a lemma that’s ostensibly an application of Davis-Kahan to the (bounded) operators and .
Lemma C.1 (Subspace alignment).
Consider linear elliptic operators and with eigenvalues and respectively. Assume that . For any function , we define and as the projection of onto and , respectively. Then we have:
(27) |
where .
Proof.
We begin the proof by first showing that the inverse of the operators and are close. Using the result from Lemma 5 with , we have:
Now, the operator norm can be written as,
(28) |
Further note that, and are the eigenvalues of the operators and , respectively. Therefore from Weyl’s Inequality and (28) we have:
(29) |
Therefore, for all , we have that , i.e., all the eigenvalues of are within of the eigenvalue of . which therefore implies that the difference between eigenvalues is,
Since the operators , are bounded, the Davis-Kahan theorem [Davis and Kahan, 1970] can be used to conclude that:
(30) |
where is understood to be the operator norm, and . Therefore for any function we have
By (30), we then get , which finishes the proof. ∎
Finally, we show that repeated applications of to have also bounded norms:
Lemma C.2 (Bounding norms of applications of ).
The functions and satisfy:
-
1.
-
2.
where .
Proof.
For Part 1, by Lemma D.4 we have that
(31) |
From Assumptions (i)-(iii), for any multi-index we have:
(32) |
For Part 2 we have,
(33) | ||||
(34) |
Note that from Lemma 5 part (2.) we have that (where denotes the operator norm). This implies that there exists an operator , such that and we can express as:
We will show that there exists a , s.t. and . Towards that, we will denote and show that
(35) |
We have:
where (1) follows from triangle inequality, (2) follows from Lemma D.5, (3) follows from , and the last part follows from and Taylor expanding . Next, since and are elliptic operators, we have . From this, it immediately follows that there exists a , s.t. with .
Plugging this into the first term of (34), we have
(36) |
The first term in first term in (36) can be expanded as follows:
(37) |
We’ll consider the two terms in turn.
For the first term, the same proof as that of (35) shows that there exists an operator , s.t. and . Hence, we have:
(38) |
For the second term in (36) we have:
(39) |
To bound the first factor we have:
where we use the fact that and . Hence, we can bound
(40) |
From (40) and Lemma 4 we have:
(41) |
Therefore from (38) and (41), we can upper bound using (36) as follows:
Here in we use the result from Lemma D.1 and write . In , we use and the fact that
Therefore, finally we have:
Combining with the result for Part 1, Therefore we have the following:
∎
Appendix D Technical Lemmas: Manipulating Operators
Before we state the lemmas we introduce some common notation used throughout this section. We denote . Further we use to denote the operator with for all and as coefficients, that is:
Similarly the operator is defined as:
Lemma D.1.
Given and for all are top eigenvalues of operators and respectively, such that is bounded. Then for all we have that
where and and .
Proof.
From (28) and Weyl’s inequality we have for all
From this, we can conclude that:
Writing (where ), we have
where (1) follows from the factorization , (2) follows from , and (3) follows from and Taylor expanding . Hence, there exists a , s.t. and (i.e., ). Using the fact that for all completes the proof. ∎
Lemma D.2 (Operator Chain Rule).
Given an elliptic operator , for all we have the following
(42) |
(43) |
where we assume that .
Proof.
We show the proof using induction on . To handle the base case, for , we have
(44) |
Similarly and ,
(45) |
Lemma D.3.
For all then for all the following upper bounds hold,
(48) |
(49) |
and
(50) |
where
Proof.
We first show the upper bound on :
(51) |
where (1) follows by Hölder.
Lemma D.4.
For all and then for all we have the following upper bounds,
(56) |
(57) |
(58) |
where .
Proof.
We prove the Lemma by induction on . The base case follows from Lemma D.3, along with the fact that .
To show the inductive case, assume that the claim holds for all . By Lemma D.3, we have
Thus, we have
as we need.
Similarly, for , we have:
(59) |
Finally, for we have
Thus, the claim follows. ∎
Lemma D.5.
Let , be defined as a composition of applications of and applications of (in any order), s.t. . Then, we have:
(61) |
Proof.
We prove the above claim by induction on .
For we have two cases. If , we have:
If we have:
Towards the inductive hypothesis, assume that for and it holds that,
For , we will have two cases. First, if , by submultiplicativity of the operator norm, as well as the fact that similar operators have identical spectra (hence equal operator norm) we have:
so the inductive claim is proved. In the second case, and we have, by using the fact that the similar operators have identical spectra:
where the last inequality follows by the inductive hypothesis. ∎