Error bounds for PDE-regularized learning
Abstract.
In this work we consider the regularization of a supervised learning problem by partial differential equations (PDEs) and derive error bounds for the obtained approximation in terms of a PDE error term and a data error term. Assuming that the target function satisfies an unknown PDE, the PDE error term quantifies how well this PDE is approximated by the auxiliary PDE used for regularization. It is shown that this error term decreases if more data is provided. The data error term quantifies the accuracy of the given data. Furthermore, the PDE-regularized learning problem is discretized by generalized Galerkin discretizations solving the associated minimization problem in subsets of the infinite dimensional functions space, which are not necessarily subspaces. For such discretizations an error bound in terms of the PDE error, the data error, and a best approximation error is derived.
1. Introduction
The problem of learning an unknown function is one of the central problems in the field of machine learning. A classical ansatz is to minimize the risk functional
in some ansatz set for a loss functional . In case of a quadratic loss functional this leads to an -best-approximation problem for in .
Solving this problem in general requires complete knowledge of . In practice the available data on is often incomplete. A classical example of incomplete data is a finite set of known point data for a continuous function . In this case, instead of minimizing the risk, one often considers the problem of empirical risk minimization given by
(1) |
Here, the weighted sum can be viewed as Monte-Carlo approximation of the -norm using the sample set .
Recently a lot of attention has been paid to learning with shallow or deep neural networks. In this case is given as the range of a nonlinear map over a finite dimensional parameter domain , such that the empirical risk minimization problem in can be written as
(2) |
with . The minimizer is often approximated using stochastic gradient-type algorithms. While this approach has been used with reasonable success in practice, its theoretical understanding is still in its infancy. The central question is, if the total error, i.e. the difference of the computed approximation and the target function can be controlled. When analyzing this error, several aspects have to be taken into account.
Learning: Here the question is if we can control the error made by the algebraic solution method, e.g. stochastic gradient descent. If the algebraic error of a computed approximate parameter set can be controlled, local Lipschitz continuity of the parametrization allows to control the induced error of the associated functions. A major obstacle for an error analysis is, that the learning problem (2) is in general nonconvex and may have multiple global or local minimizers, saddle points, or plateaus. One direction to target this problem is to characterize network architectures, where local optimality implies global optimality or at least plateaus (see, e.g., [11], [27] and the references therein). Another direction is to interpret a (stochastic) gradient flow for shallow neural networks as evolution of an interacting particle system which allows to relate long time limits (i.e. stationary points) and the many particle limit [25]. In the present paper we do not consider this aspect and concentrate on the analysis of global minimizers.
Expressivity: This centers around the question of how well can be approximated in ? Starting from early results on universal approximation properties of neural networks (see, e.g. [5, 22]) there was significant recent progress on deriving best approximation error and expressivity bounds for deep neural networks [1, 2, 7, 9, 10, 20, 21, 26, 28]. An additional question is, to which extend it is possible to realize the theoretically derived best approximation error by solving the learning problem. Despite its importance this question is largely unexplored which is rooted in the fact that it can hardly be answered due to ill-posedness in case of incomplete data. In the present paper we will address this question by regularizing the problem and proving discretization error bounds for the minimizer in terms of the best approximation error for in .
Generalization: This addresses the question if the trained function generalizes from the training data to other input data. In mathematical terms this leads to the question of how well the computed function can approximate on given incomplete data at some points . This question is often discussed in a statistical setting considering the input data as randomly drawn samples. In the present paper we will take a deterministic perspective and are interested in error bounds for in terms of the amount of provided data. Again such bounds can in general not be derived due to ill-posedness, which we will address by a regularization of the problem.
The fact that it is hard to derive error bounds—even for global minimizers of the learning problem—is deeply related to the fact that problem (1) which only incorporates incomplete data is in general ill-posed leading to severe artifacts. For example, if the ansatz set is ’large’ in comparison to the available amount of data, one observes so called overfitting, where matches nicely in the points but fails to provide a reasonable approximation in other parts of . This is in fact a consequence of ill-posedness 111This can be easily explained in the case of approximation by polynomials of degree . If , then the resulting is exactly the interpolation polynomial which in general exhibits severe over- and undershoots. If then there is no unique solution and one can add an oscillatory polynomial of degree with arbitrary amplitude to the interpolation polynomial leading to uncontrollably large errors. . A common technique to reduce such artifacts is to introduce regularization terms for . However, it is unclear how the influence of such regularization on the error with respect to the -norm could be analyzed. Another obstacle in deriving error bounds for is, that it is unclear, how to understand (1) as discretization of a continuous problem and thus how such a discretization and the influence of incomplete data could be analyzed. In the present paper we target these questions by introducing regularizations of the learning problem by partial differential equations (PDEs). This is based on the assumption that additional knowledge on the process generating the data or at least on the smoothness of is available. Using such regularizations we derive a framework which allows to prove error bounds for that quantify the effect of incomplete data and relate the discretization error in nonlinear ansatz spaces (like neural networks) to their approximation properties.
Due to their approximation power, neural networks have already been proposed for the solution of partial different equations (PDEs). In [4] the authors introduce the deep Ritz method which approximates the solution of a variational formulation of a PDE using discretization by deep neural networks. In contrast to (2) this approach minimizes the Dirichlet energy associated to the PDE in the nonlinear neural network space using a simple penalty approach for essential boundary data. As a variant, it was proposed to minimize the consistent penalty formulation from Nitsche’s method using a heuristic penalty parameter [17]. While error bounds for both methods are unknown so far a convergence result based on -convergence was recently derived [18]. A different approach was taken in [23], where the authors introduce so called physics informed neural networks (PINNs) which are trained using a collocation least squares functional. This ansatz is extended in [24] where, additionally to the PDE, point data of the target function is incorporated. Replacing collocation by a least squares Petrov–Galerkin ansatz leads to the variational PINN approach considered in [14]. It has also been highlighted that a neural network ansatz is especially promising for high-dimensional PDEs [4, 12, 9, 17]. Other uses of neural networks in the context of PDEs e.g. include reduced basis methods for parametrized problems [16].
Our contribution: In the present paper we will in general assume that solves an elliptic PDE which is not known exactly. Since we cannot use the unknown exact PDE, an inexact auxiliary PDE is used to regularize the problem. Furthermore, to make the problem well-posed in a Lebesgue- and Sobolev-space setting, we first replace the point data by local averages on sets with leading to the regularized learning problem
Here, is the Dirichlet energy associated to the elliptic auxiliary PDE and is a regularization parameter that balances the data and PDE term. The main result of the paper is an error bound of the form
Here is a constants that can be decreased by adding more data and quantifies the error induced by using an inexact auxiliary PDE. Hence the accuracy of can be improved by either providing more data or by improving the exactness of the auxiliary PDE. In a second step we treat the case of given point values by considering as an inexact variant of . Assuming -Hölder-continuity of we can control the additional error by an error bound
where can again be decreased by adding more data.
Finally, using a nonlinear Céa-Lemma, we derive a bound
for the case of a generalized Galerkin discretization where is computed by minimizing in a subset of . A variant for inexact minimizers is also presented. Since the result allows for non-subspace subsets , it is in principle applicable to nonlinear approximation schemes like neural networks. Inserting known approximation error bounds (e.g. from [10]) on the right hand side, this leads to a discretization error bound for neural network discretizations under the assumption that a global minimizer can be computed, or, that the algebraic energy error can be controlled. While the latter can in general not be guaranteed, the derived results open a new perspective for an error analysis of neural networks. For example, the same arguments can be used to provide a-priori error bounds for the deep Ritz method [4] for Neumann problems with exact (or controlled inexact) global minimizers which allows to quantify recent convergence results [18].
The paper is organized as follows: First the PDE-regularized learning problem is introduced and its well-posedness is discussed in Section 2. Then an error bound for is derived in Section 3 in the infinite dimensional case. This section also discussed the quasi-optimality of the derived error bound for a pure data fitting problem without a-priori knowledge on the PDE. The generalized Galerkin discretization in subsets is introduced and analyzed in Section 4. Finally, the theoretical findings are illustrated by numerical experiments with finite element and neural network discretizations in Section 5.
2. Problem setting
2.1. Exact and inexact auxiliary PDEs
We are interested in approximating a function on a bounded domain with Lipschitz boundary. Throughout the paper we make the assumption that solves an elliptic partial differential equation (PDE) given in terms of a variational equation
(3) |
for a closed subspace with , a symmetric bilinear form
with uniformly bounded coefficient functions and ,
and given by
for some . The subspace is chosen such that is coercive on . Then (3) has a unique solution by the Lax–Milgram theorem.
The basic assumption we will make is, that the PDE is not known exactly, that is, the exact functions are unknown. Instead we will consider an inexact auxiliary PDE induced by a guess for , , and . The auxiliary PDE is given in terms a bilinear form
with uniformly bounded coefficient functions and ,
and a functional given by
for some .
2.2. PDE-regularized learning problem
To compute we will not just solve but combine this PDE with the given (possibly inexact) local data on . To this end we assume that the data is given in terms of local average values of on open, nonempty subsets for . We will also assume that the sets can be extended to form a covering of in the sense that for each there is a convex set with Lipschitz boundary and such that
The maximal overlap and the maximal diameter of the families and given by
will be used to quantify errors later on.
In the following we will use the notation for the average of over a bounded set . Using a regularization parameter we define define the augmented auxiliary forms
with the bilinear form and the possibly inexact data term given by
(4) |
This data term can be viewed as an approximation of which becomes exact for . Using this notation we introduce the PDE-regularized learning problem
(5) |
for the functional
(6) |
Using standard arguments we see, that this quadratic minimization problem is equivalent to
(7) |
Analogously to (7) we can define the augmented exact problem
(8) |
for the exact augmented forms
It is straight forward to show that (8) is equivalent to (3).
2.3. Well-posedness by PDE-based regularization
We now discuss well-posedness of the PDE-regularized learning problem. For convenience we will denote the semi-norm induced by a symmetric, positive semi-definite, bilinear form by . Using this notation it is clear that
The following lemma shows that the weighting of the terms in is natural in the sense that it guarantee that scales like .
Lemma 1.
The bilinear form is -continuous with
Proof.
Let . We first note that the Cauchy–Schwarz inequality gives
(9) |
Using this estimate and the Cauchy–Schwarz inequality in we get
∎
Despite this upper bound, a pure data fitting problem
is in general ill-posed, since is not coercive, or, equivalently, cannot be bounded from below by . Due to the finite rank of , the same is true whenever minimization is considered in a space with dimension larger then . Thanks to the PDE-regularization, the situation is different for the PDE-regularized problem (5).
Proposition 1.
Proof.
By the lower bound on we obtain
Furthermore is coercive on the space of constant functions. Hence coercivity of on follows from the Poincaré inequality (see [6, Proposition 2]). Furthermore the upper bounds on and and Lemma 1 imply continuity and thus ellipticity of on . Finally, we get continuity of and thus similar to the proof of Lemma 1 such that the Lax–Milgram theorem guarantees existence of a unique solution of (7). Lipschitz continuous dependency on follows from standard arguments for linear elliptic problems. ∎
3. Error bounds for PDE-regularized learning
3.1. Localized Poincaré inequality and improved -coercivity
A central ingredient in the error estimates shown later is the -coercivity of . In the proof of Proposition 1 we have seen, that inherits coercivity with respect to the - and -norm from . However, the coercivity constants are independent of the data term. In this section we will show an improved -coercivity by applying localized Poincaré estimates. First we remind the classical Poincaré inequality on convex domains.
Lemma 2.
Let be a convex, open, non-empty, bounded domain and . Then
with Poincaré constant .
In fact the constant given here is the best possible one depending only on the domain diameter. A proof can be found in [19]. As a direct consequence we get the following estimate.
Lemma 3.
Let be a convex, open, non-empty, bounded domain and . Then
with Poincaré constant .
Proof.
It is also possible to get bounds involving just averages over subsets, at the price of an additional constant. An abstract prove has been given in [6]. Since we are interested in the resulting constants we will give an explicit proof here.
Lemma 4.
Let be a convex, open, non-empty, bounded domain, with , , and . Then
with Poincaré constant and especially (using )
Proof.
Note that using the optimal value for the constant can be slightly improved to with the golden ratio .
Next we apply Lemma 3 and Lemma 4 locally to show a data dependent global Poincaré type estimate. To unify both estimates we encode the maximal mismatch of and in terms of the constant
Lemma 5.
Using the constants defined above it holds for
Proof.
The right hand side in the estimate of Lemma 5 almost coincides with . The following lemma balances the constants to finally derive an -coercivity of .
Lemma 6.
For any it holds that
Proof.
In the following the relation of and the constant from Lemma 5 will play a crucial role. In the simplest case we would have
(10) |
such that the constant in Lemma 6 reduces to . Since this constant may not be known exactly we quantify the relation by assuming that
(11) |
for . Note that by such a does always exist. Using this assumption we obtain the following bounds on
(12) |
3.2. Error analysis
Now we show an estimate for the error in the -norm.
Theorem 1.
Proof.
It should be noted that the residual is zero if the PDE is exact. Hence quantifies the error induced by the inexact PDE. If the data points provide a more fine grained covering of , then is decreased. In this sense, the PDE error can be reduced by adding more data. On the other hand simply adding more data cannot cure the data error made by using inexact data.
The estimate also indicates that the PDE and data term should be balanced appropriately: In order minimize the constant in front of the PDE error term, should be sufficiently small, while it should be sufficiently large in order minimize the constant in front of the data error term. Since the estimate (14) bounds both constants in terms of the optimal choice of is (10) which leads to . This motivates the following definition which allows to characterize the involved constants. It has to be understood in the sense that the data term and selected parameter are part of a sequence of problems.
Definition 1.
The following corollary summarizes the result for a non-degenerate and well-balanced problem.
Corollary 1.
Next we will discuss the PDE error term.
Remark 1.
Under assumptions on the smoothness of , , and it can be shown, that and furthermore that this term can be bounded with respect to . To show this we first note that for all we have
To estimate the last term we note that for , problem (3) corresponds to a mixed boundary value problem with
for a decomposition . For sufficiently smooth we then have
(15) |
Now, assuming that and we get
Hence we have shown the PDE error bound
Remark 2.
It can also be shown, that such a bound on does in general not exist if is not sufficiently smooth—regardless of the smoothness of . To see this let , , and such that . Now let for . Then
while for .
Finally, we will provide an error bound for the data error term in the special case of point-wise approximations
(16) |
for points . To make sense of this expression, we need at least . However, to derive an error bound in terms of we will also need to relate to for points , independently of . Hence we need to make additions assumptions on the regularity of .
Proposition 2.
Proof.
Now we summarize the results including the estimates for the PDE and data error terms.
Corollary 2.
3.3. Pure data fitting
As a special case of the setting outlined above, one can consider the pure data fitting problem, where nothing is known about the partial differential equation satisfied by . In this case, one could consider using only the data in terms of a least squares ansatz
(17) |
or, equivalently,
Due to the finite rank of this problem is in general under determined and thus ill-posed. As a remedy, this can be considered in a finite dimensional subspace or submanifold of only, which may lead to a well-posed problem. However, this comes at the price, that the behavior of the fitted solution is largely determined by .
As a simple example consider and . Then, for small sets , the problem essentially reduces to interpolation problem in which may lead to uncontrollably large errors.
To overcome the ill-posedness in a controllable way, we introduce the regularized problem
(18) |
Noting that this takes the form of (7) with and and exact data we can utilize Theorem 1 to get an error estimate.
Theorem 2.
Assume that , and let be the solution of (7) with and . Then
(19) |
Proof.
In the case of exact data in (18) the estimate reduces to
It can also be shown that the result is quasi-optimal in a certain sense. To illustrate this, we investigate the question of how well we can approximate only from the given data , and the known regularity . Unfortunately, there is an infinite dimensional affine subspace
All functions in this space provide a perfect fit to the data and cannot be distinguished in terms of the available information. Hence the best we can afford in view of the regularity is to compute a norm minimizing approximation in , i.e.
Using the orthogonality
and the fact that we obtain the identity
with as in Theorem 2. Now assuming the additional regularity and utilizing the coercivity from Lemma 6 we obtain
Thus—if we chose the optimal weighting parameter such that —the error estimate for coincides with the one for which is the energy minimizing one among all functions that fit the data. It should be noted that the only reason for the inequality ‘’ in the estimate for is the application of the coercivity estimate from Lemma 6 and the application of the Cauchy–Schwarz inequality in . Hence we cannot expect to be able to derive a better estimate unless we sharpen the coercivity bound.
4. Generalized Galerkin discretization
4.1. Abstract a priori error estimate
Now we investigate the discretization of the PDE-regularized problem (7). To this end we consider an abstract generalized Galerkin discretization
(20) |
of the minimization formulation (5) of (7) in a subset . We call this problem a generalized Galerkin discretization since we do not require that is a closed subspace of as in a classical Galerkin discretization. As a consequence, existence and uniqueness of is not guaranteed. However, if is a closed subspace, then (20) is equivalent to the variational equation
(21) |
and uniqueness and existence is guaranteed by the Lax–Milgram theorem.
For several reasons we cannot directly apply the classical Céa-Lemma or Galerkin orthogonality to derive an error bound: The set is in general not a subspace, we want to bound the error in the -norm, and the bilinear form incorporates data dependent weighting factors. Furthermore we are interested in directly bounding and not just for the auxiliary continuous solution . As a remedy we will first use a nonlinear Céa-Lemma in terms of the weighted norm and then go over to the norm using Lemma 6. The idea of using generalized versions of Céa’s lemma for discretization in subsets goes back to [8, 13] where this was developed in a metric space setting. Since our setting is more special, we give a direct proof here.
Proof.
Theorem 3.
Proof.
While the first error term is the same as the one in the continuous case, we still need to take care for the best approximation error because it involves the weighted data dependent norm . The main ingredient is the following bound on this norm.
Lemma 8.
The weighted norm can be bounded according to
To interpret this estimate we again consider a non-degenerate, well-balanced problem. In this case the estimate in Lemma 8 takes the form
where is bounded and even decreasing if the data points cover the domain better and better. We summarize the result in the following corollary.
Corollary 3.
Inserting the bounds on the PDE and data error we finally get:
Corollary 4.
It is important to note that we neither require that is a subspace, nor that the solution to (20) is unique. Hence the error estimate is in principle applicable to nonlinear approximation schemes. However, it will in general be hard to compute a global minimizer in as required in (20) since practical optimization schemes often at most guarantee local optimality.
In case of inexact or local minimization we can at least bound the error in terms of the algebraic energy error. Let an approximation of . Then using a straight forward modification of the proof of Lemma 7 we get
Thus, if we want to bound instead of we have to add the algebraic error term
to the right hand sides of the estimates in Theorem 3, Corollary 3, and Corollary 4.
4.2. Finite element discretization
As an example for a linear Galerkin discretization we apply the abstract error bound to a finite element ansatz. To this end let be a conforming Lagrange finite element space of order (piecewise polynomials in for simplex elements or piecewise tensor-polynomials in for cubic elements) on a triangulation with mesh size . Then we can apply the classical finite element interpolation error estimate which, in the present special case, provides:
Proposition 3.
Let interpolation operator and . Then
with a constant depending only on the shape regularity of the triangulation. Here denotes the -semi-norm containing only derivatives of order .
For a proof we refer to [3, Theorem 3.2.1 and Remark 3.2.2]. Inserting the interpolation error estimate in the abstract error bound we get the following total error bound.
4.3. A heuristic strategy for parameter selection
To guarantee non-degenerate and well-balanced problem in the sense of Definition 1 the size of the averaging sets and the parameter should be carefully selected such that the quotient is bounded and such that scales like . In applications, where data points are given we can in general not assume that one is able to compute a decomposition into sets and their maximal diameter exactly. While this is in principle possible using a Voronoi decomposition, computing the latter is in general much to costly. Hence we propose a heuristic strategy to determine the sets and estimates for and under the assumption of uniformly distributed data points .
To this end we make the assumption that fits into a cube . Then we can expected that the points are in average uniformly spaced. For an ideal uniformly spaced distribution of points in we can place non-overlapping cubes with edge length on a uniform lattice into . The diameter of these boxes is
which we will use as estimate for . In order to fix for a parameter , we will use cubes centered at with edge length
Since we cannot control explicitly, we assume and select
5. Numerical experiments
5.1. A smooth test problem
Finally we illustrate the theoretical findings using numerical experiments for test problems. To this end we consider an example problem from [4] given by
with . For the Eigenfunction of the Laplacian and this equation is satisfied with . We do not want to hide, that this example was selected, because —in contrast to Dirichlet boundary conditions—natural boundary conditions do not require any approximation for neural network discretizations which was not covered in the error analysis.
As inexact auxiliary PDE we will use the same differential operator but a scaled right hand side . Then we can explicitly compute the PDE error term
In all experiments we generate data by creating uniformly distributed random points and use cubes with edge length centered at the points . The points are sampled in such that is guaranteed. The edge length and the penalty parameter are selected according to the heuristic strategy proposed in Subsection 4.3 for different values of . Notice that the heuristic strategy does not guarantee a non-degenerate and well-balanced problem in the sense of Definition 1. Nevertheless, the method is still covered by the theory presented above, but the constants may blow up if the strategy fails, leading to different error rates.
5.2. Finite element discretization
In this section we discretize the problem with conforming finite elements for . Since we are not interested in illustrating well known finite element error bounds, we use a fixed finite element grid with uniformly spaced interval/rectangular/hexahedral elements and tensorial Lagrange finite elements of order throughout the experiments. For we used elements and order , for we used elements and order , and for we used elements and order . In any case the discretization error can be neglected compared to the PDE and data error terms reported in the following. All reported -errors are computed using a quadrature rule of order on the grid elements.
For the first experiment we fix and compute the solution for increasing number of data points . The computations are done using exact data, i.e., and inexact data in the form of point values . Figure 1 depicts the error over for dimensions and parameters in the heuristic strategy. Left and right picture show exact and inexact data, respectively. We observe that the error decays like as expected from the theoretical error bounds for exact data and uniformly distributed data points. We also find that the error increases by a constant if we increase which is again in accordance with the error estimate where this enters in the constant factor . For inexact data, the situation is very similar. We observe the order but for very small errors obtained for eventually the case becomes better compared to . This can be explained by the fact that, while increasing increases the constant in the PDE error term, it also decreases and thus the data error term. Hence, if the data error comes in the range of the PDE error, we expect that larger reduces the total error.
In the second experiment we fix data points and compute the solution for inexact right hand sides with varying and exact data as well as inexact data . Figure 2 depicts the error over the PDE error for dimensions and parameters in the heuristic strategy. Left and right picture show exact and inexact data, respectively. Here we observe that the error scales like for a wide range of . This is expected from the theoretical error bound. Again, we observe that for small total error is better compared to where the error finally saturates for . The latter indicates that the data error starts to dominate in this regime such that we no longer benefit from improving the PDE error.
5.3. Neural network discretization
Finally we consider the discretization of the test problem with neural networks. To this end we minimize the loss functional defined in (6) over a set of neural networks with a fixed architecture. In this case integrals are no longer evaluated exactly, but approximated using stochastic integration as proposed in [4]. More precisely, we approximate the local averages appearing in the functional by averaging over uniformly distributed sampling points in . Since all sets are the same up to translation, we also translated the sampling points. The integration over was approximated using uniformly distributed sampling points in that are newly sampled in each step of the iterative algebraic solution method. Since we use natural boundary conditions, we only need to approximate the space without any hard boundary constraints. Hence no approximation of boundary conditions was needed.
The network architecture is as follows: A first layer inflates the input size to . This is followed by blocks, each consisting of two densely connected layers with a residual connection. A final affine layer reduces the size from to . In total the network takes the form
where is a linear map padding its input by zeros, is an affine map, and each block has the form
for dense weight matrices and bias vectors . In all layers we used the activation function . All parameters involved in are determined in the training procedure.
The networks are trained using the Adam method [15] which is a variant of stochastic gradient descent. Training was stopped if the error did not decrease any more. While this is in general impractical due to the unknown solution , we used this here to avoid effects of more heuristic stopping criteria.
Again we fix and compute the solution for increasing number of data points . The computations are done using exact data, i.e., and inexact data in the form of point values . Figure 3 depicts the error over for dimensions and parameters in the heuristic strategy. Left and right picture show exact and inexact data, respectively. For and we again observe that the error decays like . For the situation is less clear. While we roughly observe the order again, there are some exceptions. Most importantly the error increases after adding more data in one case.
References
- [1] H. Bölcskei, P. Grohs, G. Kutyniok, and P. Petersen. Optimal approximation with sparsely connected deep neural networks. SIAM J. Math. Data Sci., 1(1):8–45, 2019. arxiv:1705.01714.
- [2] M. Burger and A. Neubauer. Error bounds for approximation with neural networks. J. Approx. Theory, 112(2):235–250, 2001.
- [3] P. G. Ciarlet. The finite element method for elliptic problems, volume 4 of Studies in Mathematics and its Applications. North-Holland, Amsterdam New York Oxford, 1978.
- [4] W. E and B. Yu. The deep Ritz method: A deep learning-based numerical algorithm for solving variational problems. Comm. Math. Stats., 6(1):1–12, 2018.
- [5] K.-I. Funahashi. On the approximate realization of continuous mappings by neural networks. Neural networks, 2(3):183–192, 1989.
- [6] C. Gräser. A note on Poincaré- and Friedrichs-type inequalities, 2015. arxiv:1512.02842.
- [7] R. Griboval, G. Kutyniok, M. Nielsen, and F. Voigtlaender. Approximation spaces of deep neural networks, 2019. arxiv:1905.01208.
- [8] P. Grohs, H. Hardering, and O. Sander. Optimal a priori discretization error bounds for geodesic finite elements. Found. Comput. Math., 15:1357–1411, 2015.
- [9] P. Grohs, F. Hornung, A. Jentzen, and P. von Wurstemberger. A proof that artificial neural networks overcome the curse of dimensionality in the numerical approximation of Black-Scholes partial differential equations, 2018. arxiv:1809.02362.
- [10] I. Gühring, G. Kutyniok, and P. Petersen. Error bounds for approximations with deep ReLU neural networks in norms, 2019.
- [11] B. D. Haeffele and R. Vidal. Global optimality in neural network training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7331–7339, 2017.
- [12] J. Han, A. Jentzen, and W. E. Solving high-dimensional partial differential equations using deep learning. Proc. Natl. Acad. Sci., 115(34):8505–8510, 2018. arXiv:1707.02568.
- [13] H. Hardering. Intrinsic Discretization Error Bounds for Geodesic Finite Elements. PhD thesis, Freie Universität Berlin, 2015.
- [14] E. Kharazmi, Z. Zhang, and G. E. Karniadakis. Variational physics-informed neural networks for solving partial differential equations, 2019. arxiv:1912.00873.
- [15] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization, 2014. arxiv:1912.03937.
- [16] G. Kutyniok, P. Petersen, M. Raslan, and R. Schneider. A theoretical analysis of deep neural networks and parametric PDEs, 2019.
- [17] Y. Liao and P. Ming. Deep Nitsche method: Deep Ritz method with essential boundary conditions, 2019.
- [18] J. Müller and M. Zeinhofer. Deep Ritz revisited, 2019. arxiv:1912.03937.
- [19] L. E. Payne and H. F. Weinberger. An optimal Poincaré inequality for convex domains. Arch. Rational Mech. Anal., 5(1):286–292, 1960.
- [20] D. Perekrestenko, P. Grohs, Elbrächter D., and H. Bölcskei. The universal approximation power of finite-width deep ReLU networks, 2018.
- [21] P. Petersen and F. Voigtlaender. Optimal approximation of piecewise smooth functions using deep ReLU neural networks. Neural Networks, 108:296–330, 2018.
- [22] A. Pinkus. Approximation theory of the MLP model in neural networks. Acta numerica, 8:143–195, 1999.
- [23] M. Raissi, P. Perdikaris, and G. E. Karniadakis. Physics informed deep learning (part I): Data-driven solutions of nonlinear partial differential equations, 2017. arxiv:1711.10561.
- [24] M. Raissi, P. Perdikaris, and G. E. Karniadakis. Physics informed deep learning (part II): Data-driven discovery of nonlinear partial differential equations, 2017. arxiv:1711.10566.
- [25] G. M. Rotskoff and E. Vanden-Eijnden. Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error, 2018. arxiv:1805.00915.
- [26] Z. Shen, H. Yang, and S. Zhang. Nonlinear approximation via compositions. Neural Networks, 119:74–84, Nov 2019.
- [27] R. Vidal, J. Bruna, R. Giryes, and S. Soatto. Mathematics of deep learning, 2017. arxiv:1712.04741.
- [28] D. Yarotsky. Error bounds for approximations with deep ReLU networks. Neural Networks, 94:103–114, 2017.