This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Error bounds for PDE-regularized learning

Carsten Gräser1 1Institut für Mathematik, Freie Universität Berlin, 14195 Berlin, Germany [email protected]  and  Prem Anand Alathur Srinivasan2 2Institut für Mathematik, Freie Universität Berlin, 14195 Berlin, Germany [email protected]
Abstract.

In this work we consider the regularization of a supervised learning problem by partial differential equations (PDEs) and derive error bounds for the obtained approximation in terms of a PDE error term and a data error term. Assuming that the target function satisfies an unknown PDE, the PDE error term quantifies how well this PDE is approximated by the auxiliary PDE used for regularization. It is shown that this error term decreases if more data is provided. The data error term quantifies the accuracy of the given data. Furthermore, the PDE-regularized learning problem is discretized by generalized Galerkin discretizations solving the associated minimization problem in subsets of the infinite dimensional functions space, which are not necessarily subspaces. For such discretizations an error bound in terms of the PDE error, the data error, and a best approximation error is derived.

Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany´s Excellence Strategy – The Berlin Mathematics Research Center MATH+ (EXC-2046/1, project ID: 390685689). Project EF3-4 “Physics-regularized learning”.

1. Introduction

The problem of learning an unknown function u:Ωu:\Omega\to\mathbb{R} is one of the central problems in the field of machine learning. A classical ansatz is to minimize the risk functional

u~|Ω|1ΩL(u~(x),u(x))𝑑x\displaystyle\tilde{u}\mapsto|\Omega|^{-1}\int_{\Omega}L(\tilde{u}(x),u(x))\,dx

in some ansatz set VhV_{h} for a loss functional L:2L:\mathbb{R}^{2}\to\mathbb{R}. In case of a quadratic loss functional this leads to an L2(Ω)L^{2}(\Omega)-best-approximation problem for uu in VhL2(Ω)V_{h}\subset L^{2}(\Omega).

Solving this problem in general requires complete knowledge of uL2(Ω)u\in L^{2}(\Omega). In practice the available data on uu is often incomplete. A classical example of incomplete data is a finite set of known point data (p1,u(p1)),,(pm,u(pm))(p_{1},u(p_{1})),\dots,(p_{m},u(p_{m})) for a continuous function uu. In this case, instead of minimizing the risk, one often considers the problem of empirical risk minimization given by

(1) u~h=argminvVh1mi=1m|v(pi)u(pi)|2.\displaystyle\tilde{u}_{h}=\operatorname{argmin}_{v\in V_{h}}\frac{1}{m}\sum_{i=1}^{m}|v(p_{i})-u(p_{i})|^{2}.

Here, the weighted sum can be viewed as Monte-Carlo approximation of the L2(Ω)L^{2}(\Omega)-norm using the sample set {p1,,pm}\{p_{1},\dots,p_{m}\}.

Recently a lot of attention has been paid to learning with shallow or deep neural networks. In this case VhL2(Ω)V_{h}\subset L^{2}(\Omega) is given as the range Vh=Φ(N)V_{h}=\Phi(\mathbb{R}^{N}) of a nonlinear map Φ:NL2(Ω)\Phi:\mathbb{R}^{N}\to L^{2}(\Omega) over a finite dimensional parameter domain N\mathbb{R}^{N}, such that the empirical risk minimization problem in VhV_{h} can be written as

(2) W=argminWN1mi=1m|Φ(W)(pi)u(pi)|2\displaystyle W^{*}=\operatorname{argmin}_{W\in\mathbb{R}^{N}}\frac{1}{m}\sum_{i=1}^{m}|\Phi(W)(p_{i})-u(p_{i})|^{2}

with u~h=Φ(W)\tilde{u}_{h}=\Phi(W^{*}). The minimizer WW^{*} is often approximated using stochastic gradient-type algorithms. While this approach has been used with reasonable success in practice, its theoretical understanding is still in its infancy. The central question is, if the total error, i.e. the difference u~hu\tilde{u}_{h}-u of the computed approximation and the target function can be controlled. When analyzing this error, several aspects have to be taken into account.

Learning: Here the question is if we can control the error made by the algebraic solution method, e.g. stochastic gradient descent. If the algebraic error WWW-W^{*} of a computed approximate parameter set WW can be controlled, local Lipschitz continuity of the parametrization Φ\Phi allows to control the induced error Φ(W)Φ(W)\Phi(W)-\Phi(W^{*}) of the associated functions. A major obstacle for an error analysis is, that the learning problem (2) is in general nonconvex and may have multiple global or local minimizers, saddle points, or plateaus. One direction to target this problem is to characterize network architectures, where local optimality implies global optimality or at least plateaus (see, e.g., [11][27] and the references therein). Another direction is to interpret a (stochastic) gradient flow for shallow neural networks as evolution of an interacting particle system which allows to relate long time limits (i.e. stationary points) and the many particle limit [25]. In the present paper we do not consider this aspect and concentrate on the analysis of global minimizers.

Expressivity: This centers around the question of how well uu can be approximated in VhV_{h}? Starting from early results on universal approximation properties of neural networks (see, e.g. [5, 22]) there was significant recent progress on deriving best approximation error and expressivity bounds for deep neural networks [1, 2, 7, 9, 10, 20, 21, 26, 28]. An additional question is, to which extend it is possible to realize the theoretically derived best approximation error by solving the learning problem. Despite its importance this question is largely unexplored which is rooted in the fact that it can hardly be answered due to ill-posedness in case of incomplete data. In the present paper we will address this question by regularizing the problem and proving discretization error bounds for the minimizer u~h\tilde{u}_{h} in terms of the best approximation error for uu in VhV_{h}.

Generalization: This addresses the question if the trained function generalizes from the training data to other input data. In mathematical terms this leads to the question of how well the computed function u~h\tilde{u}_{h} can approximate uu on Ω\Omega given incomplete data at some points {p1,,pm}Ω\{p_{1},\dots,p_{m}\}\subsetneqq\Omega. This question is often discussed in a statistical setting considering the input data as randomly drawn samples. In the present paper we will take a deterministic perspective and are interested in error bounds for u~hu\tilde{u}_{h}-u in terms of the amount of provided data. Again such bounds can in general not be derived due to ill-posedness, which we will address by a regularization of the problem.

The fact that it is hard to derive error bounds—even for global minimizers of the learning problem—is deeply related to the fact that problem (1) which only incorporates incomplete data is in general ill-posed leading to severe artifacts. For example, if the ansatz set VhV_{h} is ’large’ in comparison to the available amount of data, one observes so called overfitting, where u~\tilde{u} matches uu nicely in the points pip_{i} but fails to provide a reasonable approximation in other parts of Ω\Omega. This is in fact a consequence of ill-posedness 111This can be easily explained in the case of approximation by polynomials Vh=𝒫nV_{h}=\mathcal{P}_{n} of degree nn. If n=m1n=m-1, then the resulting u~\tilde{u} is exactly the interpolation polynomial which in general exhibits severe over- and undershoots. If nmn\geq m then there is no unique solution and one can add an oscillatory polynomial of degree nn with arbitrary amplitude to the interpolation polynomial leading to uncontrollably large errors. . A common technique to reduce such artifacts is to introduce regularization terms for WW. However, it is unclear how the influence of such regularization on the error Φ(W)u=u~hu\Phi(W^{*})-u=\tilde{u}_{h}-u with respect to the L2(Ω)L^{2}(\Omega)-norm could be analyzed. Another obstacle in deriving error bounds for u~hu\tilde{u}_{h}-u is, that it is unclear, how to understand (1) as discretization of a continuous problem and thus how such a discretization and the influence of incomplete data could be analyzed. In the present paper we target these questions by introducing regularizations of the learning problem by partial differential equations (PDEs). This is based on the assumption that additional knowledge on the process generating the data or at least on the smoothness of uu is available. Using such regularizations we derive a framework which allows to prove error bounds for u~hu\tilde{u}_{h}-u that quantify the effect of incomplete data and relate the discretization error in nonlinear ansatz spaces (like neural networks) to their approximation properties.

Due to their approximation power, neural networks have already been proposed for the solution of partial different equations (PDEs). In [4] the authors introduce the deep Ritz method which approximates the solution of a variational formulation of a PDE using discretization by deep neural networks. In contrast to (2) this approach minimizes the Dirichlet energy associated to the PDE in the nonlinear neural network space using a simple penalty approach for essential boundary data. As a variant, it was proposed to minimize the consistent penalty formulation from Nitsche’s method using a heuristic penalty parameter [17]. While error bounds for both methods are unknown so far a convergence result based on Γ\Gamma-convergence was recently derived [18]. A different approach was taken in [23], where the authors introduce so called physics informed neural networks (PINNs) which are trained using a collocation least squares functional. This ansatz is extended in [24] where, additionally to the PDE, point data of the target function is incorporated. Replacing collocation by a least squares Petrov–Galerkin ansatz leads to the variational PINN approach considered in [14]. It has also been highlighted that a neural network ansatz is especially promising for high-dimensional PDEs [4, 12, 9, 17]. Other uses of neural networks in the context of PDEs e.g. include reduced basis methods for parametrized problems [16].

Our contribution: In the present paper we will in general assume that uu solves an elliptic PDE which is not known exactly. Since we cannot use the unknown exact PDE, an inexact auxiliary PDE is used to regularize the problem. Furthermore, to make the problem well-posed in a Lebesgue- and Sobolev-space setting, we first replace the point data u(pi)u(p_{i}) by local averages Biuf(pi){\textstyle\fint}_{B_{i}}u\approx f(p_{i}) on sets BiΩB_{i}\subset\Omega with piBip_{i}\in B_{i} leading to the regularized learning problem

u~=argminvV12i=1m|Bi||BivBiu|2+δ(12a~(v,v)~(v)).\displaystyle\tilde{u}=\operatorname{argmin}_{v\in V}\frac{1}{2}\sum_{i=1}^{m}|B_{i}|\left|{\textstyle\fint}_{B_{i}}v-{\textstyle\fint}_{B_{i}}u\right|^{2}+\delta\left(\frac{1}{2}\tilde{a}(v,v)-\tilde{\ell}(v)\right).

Here, 12a~(v,v)~(v)\frac{1}{2}\tilde{a}(v,v)-\tilde{\ell}(v) is the Dirichlet energy associated to the elliptic auxiliary PDE and δ>0\delta>0 is a regularization parameter that balances the data and PDE term. The main result of the paper is an error bound of the form

u~uL2(Ω)CR2𝔼pde.\displaystyle\|\tilde{u}-u\|_{L^{2}(\Omega)}\leq CR^{2}\mathbb{E}_{\operatorname{pde}}.

Here RR is a constants that can be decreased by adding more data and 𝔼pde\mathbb{E}_{\operatorname{pde}} quantifies the error induced by using an inexact auxiliary PDE. Hence the accuracy of u~\tilde{u} can be improved by either providing more data or by improving the exactness of the auxiliary PDE. In a second step we treat the case of given point values u(pi)u(p_{i}) by considering u(pi)u(p_{i}) as an inexact variant of Biu{\textstyle\fint}_{B_{i}}u. Assuming λ\lambda-Hölder-continuity of uu we can control the additional error by an error bound

u~uL2(Ω)CR2𝔼pde+Crλ\displaystyle\|\tilde{u}-u\|_{L^{2}(\Omega)}\leq CR^{2}\mathbb{E}_{\operatorname{pde}}+Cr^{\lambda}

where rr can again be decreased by adding more data.

Finally, using a nonlinear Céa-Lemma, we derive a bound

u~huL2(Ω)CR2𝔼pde+Crλ+CinfvVh(R(vu)L2(Ω)+vuL2(Ω))\displaystyle\|\tilde{u}_{h}-u\|_{L^{2}(\Omega)}\leq CR^{2}\mathbb{E}_{\operatorname{pde}}+Cr^{\lambda}+C\inf_{v\in V_{h}}\Bigl{(}R\|\nabla(v-u)\|_{L^{2}(\Omega)}+\|v-u\|_{L^{2}(\Omega)}\Bigr{)}

for the case of a generalized Galerkin discretization where u~h\tilde{u}_{h} is computed by minimizing in a subset VhV_{h} of VV. A variant for inexact minimizers is also presented. Since the result allows for non-subspace subsets VhVV_{h}\subset V, it is in principle applicable to nonlinear approximation schemes like neural networks. Inserting known approximation error bounds (e.g. from [10]) on the right hand side, this leads to a discretization error bound for neural network discretizations under the assumption that a global minimizer can be computed, or, that the algebraic energy error can be controlled. While the latter can in general not be guaranteed, the derived results open a new perspective for an error analysis of neural networks. For example, the same arguments can be used to provide a-priori error bounds for the deep Ritz method [4] for Neumann problems with exact (or controlled inexact) global minimizers which allows to quantify recent convergence results [18].

The paper is organized as follows: First the PDE-regularized learning problem is introduced and its well-posedness is discussed in Section 2. Then an L2(Ω)L^{2}(\Omega) error bound for u~u\tilde{u}-u is derived in Section 3 in the infinite dimensional case. This section also discussed the quasi-optimality of the derived error bound for a pure data fitting problem without a-priori knowledge on the PDE. The generalized Galerkin discretization in subsets VhV_{h} is introduced and analyzed in Section 4. Finally, the theoretical findings are illustrated by numerical experiments with finite element and neural network discretizations in Section 5.

2. Problem setting

2.1. Exact and inexact auxiliary PDEs

We are interested in approximating a function u:Ωu:\Omega\to\mathbb{R} on a bounded domain Ωd\Omega\subset\mathbb{R}^{d} with Lipschitz boundary. Throughout the paper we make the assumption that uu solves an elliptic partial differential equation (PDE) given in terms of a variational equation

(3) uV:a(u,v)=(v)vV\displaystyle u\in V:\qquad a(u,v)=\ell(v)\qquad\forall v\in V

for a closed subspace VH1(Ω)V\subset H^{1}(\Omega) with H01(Ω)VH_{0}^{1}(\Omega)\subset V, a symmetric bilinear form

a(w,v)\colonequalsΩα(x)w(x)v(x)+σ(x)w(x)v(x)dx\displaystyle a(w,v)\colonequals\int_{\Omega}\alpha(x)\nabla w(x)\cdot\nabla v(x)+\sigma(x)w(x)v(x)\,dx

with uniformly bounded coefficient functions α:Ω\alpha:\Omega\to\mathbb{R} and σ:Ω\sigma:\Omega\to\mathbb{R},

0<αminα(x)αmax<,0σ(x)σL(Ω)<\displaystyle 0<\alpha_{\min}\leq\alpha(x)\leq\alpha_{\max}<\infty,\qquad 0\leq\sigma(x)\leq\|\sigma\|_{L^{\infty}(\Omega)}<\infty

and V\ell\in V^{*} given by

(v)\colonequalsΩf(x)𝑑x\displaystyle\ell(v)\colonequals\int_{\Omega}f(x)\,dx

for some fL2(Ω)f\in L^{2}(\Omega). The subspace VV is chosen such that a(,)a(\cdot,\cdot) is coercive on VV. Then (3) has a unique solution uVu\in V by the Lax–Milgram theorem.

The basic assumption we will make is, that the PDE is not known exactly, that is, the exact functions α,σ,\alpha,\sigma,\ell are unknown. Instead we will consider an inexact auxiliary PDE induced by a guess for α\alpha, σ\sigma, and ff. The auxiliary PDE is given in terms a bilinear form

a~(w,v)\colonequalsΩα~(x)w(x)v(x)+σ~(x)w(x)v(x)dx\displaystyle\tilde{a}(w,v)\colonequals\int_{\Omega}\tilde{\alpha}(x)\nabla w(x)\cdot\nabla v(x)+\tilde{\sigma}(x)w(x)v(x)\,dx

with uniformly bounded coefficient functions α~:Ω\tilde{\alpha}:\Omega\to\mathbb{R} and σ~:Ω\tilde{\sigma}:\Omega\to\mathbb{R},

0<α~minα~(x)α~max<,0σ~(x)σ~L(Ω)<\displaystyle 0<\tilde{\alpha}_{\min}\leq\tilde{\alpha}(x)\leq\tilde{\alpha}_{\max}<\infty,\qquad 0\leq\tilde{\sigma}(x)\leq\|\tilde{\sigma}\|_{L^{\infty}(\Omega)}<\infty

and a functional ~V\tilde{\ell}\in V^{*} given by

~(v)\colonequalsΩf~(x)𝑑x\displaystyle\tilde{\ell}(v)\colonequals\int_{\Omega}\tilde{f}(x)\,dx

for some f~L2(Ω)\tilde{f}\in L^{2}(\Omega).

2.2. PDE-regularized learning problem

To compute u~\tilde{u} we will not just solve a~(u~,)~=0\tilde{a}(\tilde{u},\cdot)-\tilde{\ell}=0 but combine this PDE with the given (possibly inexact) local data on uu. To this end we assume that the data is given in terms of local average values of uu on open, nonempty subsets BiΩB_{i}\subset\Omega for i=1,,mi=1,\dots,m. We will also assume that the sets BiB_{i} can be extended to form a covering of Ω\Omega in the sense that for each BiB_{i} there is a convex set KiK_{i} with Lipschitz boundary and BiKiB_{i}\subset K_{i} such that

Ωi=1mKi¯.\displaystyle\Omega\subset\bigcup_{i=1}^{m}\overline{K_{i}}.

The maximal overlap and the maximal diameter of the families (Bi)(B_{i}) and (Ki)(K_{i}) given by

M\colonequalsmaxxΩ|{i|xKi}|,r\colonequalsmaxi=1,,mdiam(Bi),R\colonequalsmaxi=1,,mdiam(Ki)\displaystyle M\colonequals\max_{x\in\Omega}|\{i\mathrel{|}x\in K_{i}\}|,\qquad r\colonequals\max_{i=1,\dots,m}\operatorname{diam}(B_{i}),\qquad R\colonequals\max_{i=1,\dots,m}\operatorname{diam}(K_{i})

will be used to quantify errors later on.

In the following we will use the notation Uv=|U|1Uv(x)𝑑x{\textstyle\fint}_{U}v=|U|^{-1}\int_{U}v(x)\,dx for the average of vv over a bounded set UU. Using a regularization parameter δ>0\delta>0 we define define the augmented auxiliary forms

c~(w,v)\colonequalsa~(w,v)+δ1b(w,v),r~(v)\colonequals~(v)+δ1b(u,v)~.\displaystyle\tilde{c}(w,v)\colonequals\tilde{a}(w,v)+\delta^{-1}b(w,v),\qquad\tilde{r}(v)\colonequals\tilde{\ell}(v)+\delta^{-1}\widetilde{b(u,v)}.

with the bilinear form b(,)b(\cdot,\cdot) and the possibly inexact data term b(u,)~b(u,)\widetilde{b(u,\cdot)}\approx b(u,\cdot) given by

(4) b(w,v)\colonequalsi=1m|Bi|BiwBiv,b(u,v)~\colonequalsi=1mbi|Bi|Biv.\displaystyle b(w,v)\colonequals\sum_{i=1}^{m}|B_{i}|{\textstyle\fint}_{B_{i}}w{\textstyle\fint}_{B_{i}}v,\qquad\widetilde{b(u,v)}\colonequals\sum_{i=1}^{m}b_{i}|B_{i}|{\textstyle\fint}_{B_{i}}v.

This data term can be viewed as an approximation of b(u,v)b(u,v) which becomes exact for bi=Biub_{i}={\textstyle\fint}_{B_{i}}u. Using this notation we introduce the PDE-regularized learning problem

(5) u~=argminvVJ~(v),\displaystyle\tilde{u}=\operatorname{argmin}_{v\in V}\tilde{J}(v),

for the functional

(6) J~(v)\colonequals12c~(v,v)r~(v)=12i=1m|Bi||Bivbi|2+δ(12a~(v,v)l~(v))+const.\displaystyle\tilde{J}(v)\colonequals\frac{1}{2}\tilde{c}(v,v)-\tilde{r}(v)=\frac{1}{2}\sum_{i=1}^{m}|B_{i}|\left|{\textstyle\fint}_{B_{i}}v-b_{i}\right|^{2}+\delta\Bigl{(}\frac{1}{2}\tilde{a}(v,v)-\tilde{l}(v)\Bigr{)}+\operatorname{const}.

Using standard arguments we see, that this quadratic minimization problem is equivalent to

(7) u~V:c~(u~,v)=r~(v)vV.\displaystyle\tilde{u}\in V:\qquad\tilde{c}(\tilde{u},v)=\tilde{r}(v)\qquad\forall v\in V.

Analogously to (7) we can define the augmented exact problem

(8) uV:c(u,v)=r(v)vV\displaystyle u\in V:\qquad c(u,v)=r(v)\qquad\forall v\in V

for the exact augmented forms

c(w,v)\colonequalsa(w,v)+δ1b(w,v),r(v)\colonequals(v)+δ1b(u,v).\displaystyle c(w,v)\colonequals a(w,v)+\delta^{-1}b(w,v),\qquad r(v)\colonequals\ell(v)+\delta^{-1}b(u,v).

It is straight forward to show that (8) is equivalent to (3).

2.3. Well-posedness by PDE-based regularization

We now discuss well-posedness of the PDE-regularized learning problem. For convenience we will denote the semi-norm vd(v,v)12v\mapsto d(v,v)^{\frac{1}{2}} induced by a symmetric, positive semi-definite, bilinear form d(,)d(\cdot,\cdot) by d=d(,)12\|\cdot\|_{d}=d(\cdot,\cdot)^{\frac{1}{2}}. Using this notation it is clear that

c~2\displaystyle\|\cdot\|^{2}_{\tilde{c}} =a~2+δ1b2,\displaystyle=\|\cdot\|^{2}_{\tilde{a}}+\delta^{-1}\|\cdot\|^{2}_{b}, c2\displaystyle\|\cdot\|^{2}_{c} =a2+δ1b2.\displaystyle=\|\cdot\|^{2}_{a}+\delta^{-1}\|\cdot\|^{2}_{b}.

The following lemma shows that the weighting of the terms in b(,)b(\cdot,\cdot) is natural in the sense that it guarantee that b\|\cdot\|_{b} scales like L2(Ω)\|\cdot\|_{L^{2}(\Omega)}.

Lemma 1.

The bilinear form b(,)b(\cdot,\cdot) is L2(Ω)L^{2}(\Omega)-continuous with

b(w,v)MwL2(Ω)vL2(Ω),vb2MvL2(Ω)2,w,vL2(Ω).\displaystyle b(w,v)\leq M\|w\|_{L^{2}(\Omega)}\|v\|_{L^{2}(\Omega)},\qquad\|v\|_{b}^{2}\leq M\|v\|^{2}_{L^{2}(\Omega)},\qquad\forall w,v\in L^{2}(\Omega).
Proof.

Let v,wL2(Ω)v,w\in L^{2}(\Omega). We first note that the Cauchy–Schwarz inequality gives

(9) Biv|Bi|1vL2(Bi)1L2(Bi)=|Bi|1/2vL2(Bi).\displaystyle{\textstyle\fint}_{B_{i}}v\leq|B_{i}|^{-1}\|v\|_{L^{2}(B_{i})}\|1\|_{L^{2}(B_{i})}=|B_{i}|^{-1/2}\|v\|_{L^{2}(B_{i})}.

Using this estimate and the Cauchy–Schwarz inequality in m\mathbb{R}^{m} we get

b(w,v)i=1mwL2(Bi)vL2(Bi)(i=1mwL2(Bi)2)1/2(i=1mvL2(Bi)2)1/2MwL2(Ω)vL2(Ω).\displaystyle b(w,v)\leq\sum_{i=1}^{m}\|w\|_{L^{2}(B_{i})}\|v\|_{L^{2}(B_{i})}\leq\left(\sum_{i=1}^{m}\|w\|_{L^{2}(B_{i})}^{2}\right)^{1/2}\left(\sum_{i=1}^{m}\|v\|_{L^{2}(B_{i})}^{2}\right)^{1/2}\leq M\|w\|_{L^{2}(\Omega)}\|v\|_{L^{2}(\Omega)}.

Despite this upper bound, a pure data fitting problem

minvL2(Ω)12i=1m|Bi||Bivbi|2=minvL2(Ω)12vb2b(u,v)~+const\displaystyle\operatorname{min}_{v\in L^{2}(\Omega)}\frac{1}{2}\sum_{i=1}^{m}|B_{i}|\left|{\textstyle\fint}_{B_{i}}v-b_{i}\right|^{2}=\operatorname{min}_{v\in L^{2}(\Omega)}\frac{1}{2}\|v\|_{b}^{2}-\widetilde{b(u,v)}+\operatorname{const}

is in general ill-posed, since b(,)b(\cdot,\cdot) is not coercive, or, equivalently, b\|\cdot\|_{b} cannot be bounded from below by L2(Ω)\|\cdot\|_{L^{2}(\Omega)}. Due to the finite rank mm of b(,)b(\cdot,\cdot), the same is true whenever minimization is considered in a space with dimension larger then mm. Thanks to the PDE-regularization, the situation is different for the PDE-regularized problem (5).

Proposition 1.

The PDE-regularized problem (5) or, equivalently, (7) has a unique solution u~V\tilde{u}\in V which depends Lipschitz-continuously on the provided data b1,,bmb_{1},\dots,b_{m}.

Proof.

By the lower bound on α~\tilde{\alpha} we obtain

vL2(Ω)2α~min1vc~2vV.\displaystyle\|\nabla v\|_{L^{2}(\Omega)}^{2}\leq\tilde{\alpha}_{\min}^{-1}\|v\|^{2}_{\tilde{c}}\qquad\forall v\in V.

Furthermore b(,)b(\cdot,\cdot) is coercive on the space of constant functions. Hence coercivity of c~\tilde{c} on H1(Ω)H^{1}(\Omega) follows from the Poincaré inequality (see [6, Proposition 2]). Furthermore the upper bounds on α~\tilde{\alpha} and σ~\tilde{\sigma} and Lemma 1 imply continuity and thus ellipticity of c~(,)\tilde{c}(\cdot,\cdot) on VV. Finally, we get continuity of b(u,)~\widetilde{b(u,\cdot)} and thus r~()\tilde{r}(\cdot) similar to the proof of Lemma 1 such that the Lax–Milgram theorem guarantees existence of a unique solution u~V\tilde{u}\in V of (7). Lipschitz continuous dependency on b1,,bmb_{1},\dots,b_{m} follows from standard arguments for linear elliptic problems. ∎

3. Error bounds for PDE-regularized learning

3.1. Localized Poincaré inequality and improved L2(Ω)L^{2}(\Omega)-coercivity

A central ingredient in the error estimates shown later is the L2(Ω)L^{2}(\Omega)-coercivity of c~(,)\tilde{c}(\cdot,\cdot). In the proof of Proposition 1 we have seen, that c~(,)\tilde{c}(\cdot,\cdot) inherits coercivity with respect to the H1(Ω)H^{1}(\Omega)- and L2(Ω)L^{2}(\Omega)-norm from a~(,)\tilde{a}(\cdot,\cdot). However, the coercivity constants are independent of the data term. In this section we will show an improved L2(Ω)L^{2}(\Omega)-coercivity by applying localized Poincaré estimates. First we remind the classical Poincaré inequality on convex domains.

Lemma 2.

Let UdU\subset\mathbb{R}^{d} be a convex, open, non-empty, bounded domain and vH1(U)v\in H^{1}(U). Then

vUvL2(U)CUvL2(U)\displaystyle\left\|v-{\textstyle\fint}_{U}v\right\|_{L^{2}(U)}\leq C_{U}\|\nabla v\|_{L^{2}(U)}

with Poincaré constant CU=diam(U)πC_{U}=\frac{\operatorname{diam}(U)}{\pi}.

In fact the constant CUC_{U} given here is the best possible one depending only on the domain diameter. A proof can be found in [19]. As a direct consequence we get the following estimate.

Lemma 3.

Let UdU\subset\mathbb{R}^{d} be a convex, open, non-empty, bounded domain and vH1(U)v\in H^{1}(U). Then

vL2(U)2CU2vL2(U)2+UvL2(U)2=CU2vL2(U)2+|U|(Uv)2\displaystyle\|v\|^{2}_{L^{2}(U)}\leq C_{U}^{2}\|\nabla v\|^{2}_{L^{2}(U)}+\|{\textstyle\fint}_{U}v\|_{L^{2}(U)}^{2}=C_{U}^{2}\|\nabla v\|^{2}_{L^{2}(U)}+|U|\left({\textstyle\fint}_{U}v\right)^{2}

with Poincaré constant CU=diam(U)πC_{U}=\frac{\operatorname{diam}(U)}{\pi}.

Proof.

Using the L2(U)L^{2}(U)-orthogonality vUvv-{\textstyle\fint}_{U}v and Uv{\textstyle\fint}_{U}v and Lemma 2 we get

vL2(U)2=vUvL2(U)2+UvL2(U)2CU2vL2(U)2+|U|(Uv)2.\displaystyle\|v\|_{L^{2}(U)}^{2}=\left\|v-{\textstyle\fint}_{U}v\right\|_{L^{2}(U)}^{2}+\left\|{\textstyle\fint}_{U}v\right\|_{L^{2}(U)}^{2}\leq C_{U}^{2}\|\nabla v\|^{2}_{L^{2}(U)}+|U|\left({\textstyle\fint}_{U}v\right)^{2}.

It is also possible to get bounds involving just averages over subsets, at the price of an additional constant. An abstract prove has been given in [6]. Since we are interested in the resulting constants we will give an explicit proof here.

Lemma 4.

Let UdU\subset\mathbb{R}^{d} be a convex, open, non-empty, bounded domain, WUW\subset U with |W|>0|W|>0, t>0t>0, and vH1(U)v\in H^{1}(U). Then

vL2(U)2CU2(1+(1+t)|U||W|)vL2(U)2+(1+t1)|U|(Wv)2\displaystyle\|v\|^{2}_{L^{2}(U)}\leq C_{U}^{2}\left(1+(1+t)\frac{|U|}{|W|}\right)\|\nabla v\|^{2}_{L^{2}(U)}+(1+t^{-1})|U|\left({\textstyle\fint}_{W}v\right)^{2}

with Poincaré constant CU=diam(U)πC_{U}=\frac{\operatorname{diam}(U)}{\pi} and especially (using t=1t=1)

vL2(U)23|U||W|(CU2vL2(U)2+|W|(Wv)2).\displaystyle\|v\|^{2}_{L^{2}(U)}\leq 3\frac{|U|}{|W|}\Bigl{(}C_{U}^{2}\|\nabla v\|^{2}_{L^{2}(U)}+|W|\left({\textstyle\fint}_{W}v\right)^{2}\Bigr{)}.
Proof.

Applying Lemma 3 on UU we get

vL2(U)2CU2vL2(U)2+vUL2(U)2\displaystyle\|v\|_{L^{2}(U)}^{2}\leq C_{U}^{2}\|\nabla v\|^{2}_{L^{2}(U)}+\|v_{U}\|_{L^{2}(U)}^{2}

where we used the notation vU=Uvv_{U}={\textstyle\fint}_{U}v. Utilizing the inequality

a,a=ab,ab+2b,ab+b,b(1+t1)b,b+(1+t)ba,ba\displaystyle\langle a,a\rangle=\langle a-b,a-b\rangle+2\langle b,a-b\rangle+\langle b,b\rangle\leq(1+t^{-1})\langle b,b\rangle+(1+t)\langle b-a,b-a\rangle

for the symmetric positive semi-definite bilinear form ,=W()W()\langle\cdot,\cdot\rangle={\textstyle\fint}_{W}(\cdot){\textstyle\fint}_{W}(\cdot) we get

vUL2(U)2=|U|(WvU)2\displaystyle\|v_{U}\|_{L^{2}(U)}^{2}=|U|\bigl{(}{\textstyle\fint}_{W}v_{U}\bigr{)}^{2} |U|((1+t1)(Wv)2+(1+t)(W(vvU))2)\displaystyle\leq|U|\Bigl{(}(1+t^{-1})\bigl{(}{\textstyle\fint}_{W}v\bigr{)}^{2}+(1+t)\bigl{(}{\textstyle\fint}_{W}(v-v_{U})\bigr{)}^{2}\Bigr{)}
|U|((1+t1)(Wv)2+(1+t)|W|1vvUL2(W)2)\displaystyle\leq|U|\Bigl{(}(1+t^{-1})\bigl{(}{\textstyle\fint}_{W}v\bigr{)}^{2}+(1+t)|W|^{-1}\|v-v_{U}\|_{L^{2}(W)}^{2}\Bigr{)}
|U|(1+t1)(Wv)2+(1+t)|U||W|vvUL2(U)2.\displaystyle\leq|U|(1+t^{-1})\bigl{(}{\textstyle\fint}_{W}v\bigr{)}^{2}+(1+t)\frac{|U|}{|W|}\|v-v_{U}\|_{L^{2}(U)}^{2}.

Finally we get the assertion by using Lemma 2. ∎

Note that using the optimal value for tt the constant 33 can be slightly improved to 1+φ<31+\varphi<3 with the golden ratio φ=12(1+5)\varphi=\frac{1}{2}(1+\sqrt{5}).

Next we apply Lemma 3 and Lemma 4 locally to show a data dependent global Poincaré type estimate. To unify both estimates we encode the maximal mismatch of BiB_{i} and KiK_{i} in terms of the constant

η\colonequals{1if Bi=Kii,3maxi=1,,m|Ki||Bi|.else.\displaystyle\eta\colonequals\begin{cases}1&\text{if }B_{i}=K_{i}\quad\forall i,\\ 3\max_{i=1,\dots,m}\frac{|K_{i}|}{|B_{i}|}.&\text{else}.\end{cases}
Lemma 5.

Using the constants M,R,ηM,R,\eta defined above it holds for vH1(Ω)v\in H^{1}(\Omega)

vL2(Ω)2η(R2Mπ2vL2(Ω)2+b(v,v)).\displaystyle\|v\|_{L^{2}(\Omega)}^{2}\leq\eta\left(\frac{R^{2}M}{\pi^{2}}\|\nabla v\|_{L^{2}(\Omega)}^{2}+b(v,v)\right).
Proof.

Let vH1(Ω)v\in H^{1}(\Omega). Then we also have vH1(Bi)v\in H^{1}(B_{i}) for i=1,,mi=1,\dots,m. Applying either Lemma 3 (if η=1\eta=1) or Lemma 4 (if η>1\eta>1) on each BiB_{i} we get

vL2(Ω)2i=1mvL2(Bi)2\displaystyle\|v\|_{L^{2}(\Omega)}^{2}\leq\sum_{i=1}^{m}\|v\|_{L^{2}(B_{i})}^{2} ηi=1m(diam(Ki)2π2vL2(Ki)2+|Bi|(Biv)2)\displaystyle\leq\eta\sum_{i=1}^{m}\left(\frac{\operatorname{diam}(K_{i})^{2}}{\pi^{2}}\|\nabla v\|^{2}_{L^{2}(K_{i})}+|B_{i}|\left({\textstyle\fint}_{B_{i}}v\right)^{2}\right)
η(i=1mR2π2vL2(Bi)2+b(v,v)).\displaystyle\leq\eta\left(\sum_{i=1}^{m}\frac{R^{2}}{\pi^{2}}\|\nabla v\|^{2}_{L^{2}(B_{i})}+b(v,v)\right).

Using the fact that each part of Ω\Omega is a most covered MM-times by the sets BiB_{i} we get the assertion. ∎

The right hand side in the estimate of Lemma 5 almost coincides with c~(v,v)\tilde{c}(v,v). The following lemma balances the constants to finally derive an L2(Ω)L^{2}(\Omega)-coercivity of c~(,)\tilde{c}(\cdot,\cdot).

Lemma 6.

For any vH1(Ω)v\in H^{1}(\Omega) it holds that

vL2(Ω)2Γvc~2,Γ\colonequalsηmax{R2Mπ2α~min,δ}.\displaystyle\|v\|^{2}_{L^{2}(\Omega)}\leq\Gamma\|v\|_{\tilde{c}}^{2},\qquad\Gamma\colonequals\eta\max\left\{\frac{R^{2}M}{\pi^{2}\tilde{\alpha}_{\min}},\delta\right\}.
Proof.

Using Lemma 5, positive definiteness of σ~L2(Ω)2\|\sqrt{\tilde{\sigma}}\cdot\|_{L^{2}(\Omega)}^{2}, and Γδ1η\Gamma\delta^{-1}\geq\eta we get

vL2(Ω)2ηR2Mπ2α~minva~2+ηvb2ηR2Mπ2α~minva~2+Γδ1vb2Γvc~2.\displaystyle\|v\|_{L^{2}(\Omega)}^{2}\leq\eta\frac{R^{2}M}{\pi^{2}\tilde{\alpha}_{\min}}\|v\|_{\tilde{a}}^{2}+\eta\|v\|_{b}^{2}\leq\eta\frac{R^{2}M}{\pi^{2}\tilde{\alpha}_{\min}}\|v\|_{\tilde{a}}^{2}+\Gamma\delta^{-1}\|v\|_{b}^{2}\leq\Gamma\|v\|_{\tilde{c}}^{2}.

In the following the relation of δ\delta and the constant from Lemma 5 will play a crucial role. In the simplest case we would have

(10) δ=R2Mπ2α~min,\displaystyle\delta=\frac{R^{2}M}{\pi^{2}\tilde{\alpha}_{\min}},

such that the constant in Lemma 6 reduces to Γ=ηδ\Gamma=\eta\delta. Since this constant may not be known exactly we quantify the relation by assuming that

(11) δ[θR2Mπ2α~min,θ1R2Mπ2α~min]\displaystyle\delta\in\left[\theta\frac{R^{2}M}{\pi^{2}\tilde{\alpha}_{\min}},\theta^{-1}\frac{R^{2}M}{\pi^{2}\tilde{\alpha}_{\min}}\right]

for θ(0,1]\theta\in(0,1]. Note that by δ(0,)\delta\in(0,\infty) such a θ=(0,1)\theta=(0,1) does always exist. Using this assumption we obtain the following bounds on Γ\Gamma

(12) Γηθ1R2Mπ2α~min,Γηθ1δ.\displaystyle\Gamma\leq\eta\theta^{-1}\frac{R^{2}M}{\pi^{2}\tilde{\alpha}_{\min}},\qquad\Gamma\leq\eta\theta^{-1}\delta.

3.2. Error analysis

Now we show an estimate for the error uu~u-\tilde{u} in the L2(Ω)L^{2}(\Omega)-norm.

Theorem 1.

Let uu and u~\tilde{u} be the solutions of (3) and (7), respectively. Furthermore assume that ~a~(u,)L2(Ω)\tilde{\ell}-\tilde{a}(u,\cdot)\in L^{2}(\Omega). Then we have with Γ\Gamma from Lemma 6

(13) u~uL2(Ω)Γu~uc~Γ(𝔼pde+δ1𝔼data)\displaystyle\|\tilde{u}-u\|_{L^{2}(\Omega)}\leq\sqrt{\Gamma}\|\tilde{u}-u\|_{\tilde{c}}\leq\Gamma\Bigl{(}\mathbb{E}_{\operatorname{pde}}+\delta^{-1}\mathbb{E}_{\operatorname{data}}\Bigr{)}

and, using θ\theta from (11),

(14) u~uL2(Ω)Γu~uc~ηθ1(R2Mπ2α~min𝔼pde+𝔼data)\displaystyle\|\tilde{u}-u\|_{L^{2}(\Omega)}\leq\sqrt{\Gamma}\|\tilde{u}-u\|_{\tilde{c}}\leq\eta\theta^{-1}\left(\frac{R^{2}M}{\pi^{2}\tilde{\alpha}_{\min}}\mathbb{E}_{\operatorname{pde}}+\mathbb{E}_{\operatorname{data}}\right)

with the PDE and data error terms

𝔼pde\displaystyle\mathbb{E}_{\operatorname{pde}} \colonequals~a~(u,)L2(Ω),\displaystyle\colonequals\|\tilde{\ell}-\tilde{a}(u,\cdot)\|_{L^{2}(\Omega)}, 𝔼data\displaystyle\mathbb{E}_{\operatorname{data}} \colonequalsb(u,)~b(u,)L2(Ω).\displaystyle\colonequals\|\widetilde{b(u,\cdot)}-b(u,\cdot)\|_{L^{2}(\Omega)}.
Proof.

Testing (7) with u~u\tilde{u}-u and subtracting c~(u,u~u)\tilde{c}(u,\tilde{u}-u) yields

u~uc~2\displaystyle\|\tilde{u}-u\|_{\tilde{c}}^{2} =r~(u~u)c~(u,u~u)\displaystyle=\tilde{r}(\tilde{u}-u)-\tilde{c}(u,\tilde{u}-u)
=(~a~(u,))(u~u)+δ1(b(u,)~b(u,))(u~u)\displaystyle=(\tilde{\ell}-\tilde{a}(u,\cdot))(\tilde{u}-u)+\delta^{-1}(\widetilde{b(u,\cdot)}-b(u,\cdot))(\tilde{u}-u)
(~a~(u,)L2(Ω)+δ1b(u,)~b(u,)L2(Ω))u~uL2(Ω)\displaystyle\leq\Bigl{(}\|\tilde{\ell}-\tilde{a}(u,\cdot)\|_{L^{2}(\Omega)}+\delta^{-1}\|\widetilde{b(u,\cdot)}-b(u,\cdot)\|_{L^{2}(\Omega)}\Bigr{)}\|\tilde{u}-u\|_{L^{2}(\Omega)}
(~a~(u,)L2(Ω)+δ1b(u,)~b(u,)L2(Ω))Γu~uc~\displaystyle\leq\Bigl{(}\|\tilde{\ell}-\tilde{a}(u,\cdot)\|_{L^{2}(\Omega)}+\delta^{-1}\|\widetilde{b(u,\cdot)}-b(u,\cdot)\|_{L^{2}(\Omega)}\Bigr{)}\sqrt{\Gamma}\|\tilde{u}-u\|_{\tilde{c}}

where we used the L2(Ω)L^{2}(\Omega)-coercivity from Lemma 6 for the last estimate. Dividing by u~uc~\|\tilde{u}-u\|_{\tilde{c}} and using Lemma 6 again to bound the left hand side, we obtain (13). Estimate (14) is obtained by using the bounds on Γ\Gamma from (12). ∎

It should be noted that the residual ~a~(u,)\tilde{\ell}-\tilde{a}(u,\cdot) is zero if the PDE is exact. Hence 𝔼pde\mathbb{E}_{\operatorname{pde}} quantifies the error induced by the inexact PDE. If the data points provide a more fine grained covering of Ω\Omega, then RR is decreased. In this sense, the PDE error can be reduced by adding more data. On the other hand simply adding more data cannot cure the data error 𝔼data\mathbb{E}_{\operatorname{data}} made by using inexact data.

The estimate also indicates that the PDE and data term should be balanced appropriately: In order minimize the constant Γ\Gamma in front of the PDE error term, δ\delta should be sufficiently small, while it should be sufficiently large in order minimize the constant Γδ1\Gamma\delta^{-1} in front of the data error term. Since the estimate (14) bounds both constants in terms of θ1\theta^{-1} the optimal choice of δ\delta is (10) which leads to θ=1\theta=1. This motivates the following definition which allows to characterize the involved constants. It has to be understood in the sense that the data term and selected parameter are part of a sequence of problems.

Definition 1.

Problem (7) is called non-degenerate if the decomposition used in the data term is non-degenerate in the sense that η,MO(1)\eta,M\in O(1). It is called well-balanced if δ\delta is selected such that there is a θ(0,1)\theta\in(0,1) according to (11) with θ1O(1)\theta^{-1}\in O(1).

The following corollary summarizes the result for a non-degenerate and well-balanced problem.

Corollary 1.

Let uu and u~\tilde{u} be the solutions of (3) and (7), respectively. Furthermore assume that ~a~(u,)L2(Ω)\tilde{\ell}-\tilde{a}(u,\cdot)\in L^{2}(\Omega) and that the problem is non-degenerate and well-balanced. Then

u~uL2(Ω)CR2𝔼pde+C𝔼data.\displaystyle\|\tilde{u}-u\|_{L^{2}(\Omega)}\leq CR^{2}\mathbb{E}_{\operatorname{pde}}+C\mathbb{E}_{\operatorname{data}}.

Next we will discuss the PDE error term.

Remark 1.

Under assumptions on the smoothness of Ω\partial\Omega, uu, and α~α\tilde{\alpha}-\alpha it can be shown, that ~a~(u,)L2(Ω)\tilde{\ell}-\tilde{a}(u,\cdot)\in L^{2}(\Omega) and furthermore that this term can be bounded with respect to α~α\tilde{\alpha}-\alpha. To show this we first note that for all vVv\in V we have

|~(v)a~(u,v)|\displaystyle|\tilde{\ell}(v)-\tilde{a}(u,v)| =|~(v)((v)a(u,v))a~(u,v)|\displaystyle=|\tilde{\ell}(v)-(\ell(v)-a(u,v))-\tilde{a}(u,v)|
(~L2(Ω)+σ~σL(Ω)uL2(Ω))vL2(Ω)+|Ω(α~α)uvdx|.\displaystyle\leq\Bigl{(}\|\tilde{\ell}-\ell\|_{L^{2}(\Omega)}+\|\tilde{\sigma}-\sigma\|_{L^{\infty}(\Omega)}\|u\|_{L^{2}(\Omega)}\Bigr{)}\|v\|_{L^{2}(\Omega)}+\left|\int_{\Omega}(\tilde{\alpha}-\alpha)\nabla u\cdot\nabla v\,dx\right|.

To estimate the last term we note that for H01(Ω)VH1(Ω)H_{0}^{1}(\Omega)\subset V\subset H^{1}(\Omega), problem (3) corresponds to a mixed boundary value problem with

u=0 on (Ω)D,uν=0 on (Ω)N\displaystyle u=0\text{ on }(\partial\Omega)_{D},\qquad\frac{\partial u}{\partial\nu}=0\text{ on }(\partial\Omega)_{N}

for a decomposition Ω=(Ω)D(Ω)N\partial\Omega=(\partial\Omega)_{D}\cup(\partial\Omega)_{N}. For sufficiently smooth Ω\partial\Omega we then have

(15) (uνv)|Ω=0vV.\displaystyle\Bigl{(}\frac{\partial u}{\partial\nu}v\Bigr{)}|_{\partial\Omega}=0\qquad\forall v\in V.

Now, assuming that α~αW1,(Ω)\tilde{\alpha}-\alpha\in W^{1,\infty}(\Omega) and uH2(Ω)u\in H^{2}(\Omega) we get

|Ω(α~α)uvdx|\displaystyle\left|\int_{\Omega}(\tilde{\alpha}-\alpha)\nabla u\cdot\nabla v\,dx\right| =|Ωdiv((α~α)u)v𝑑x+Ω(α~α)(uνv)𝑑s|\displaystyle=\left|-\int_{\Omega}\operatorname{div}((\tilde{\alpha}-\alpha)\nabla u)v\,dx+\int_{\partial\Omega}(\tilde{\alpha}-\alpha)\Bigl{(}\frac{\partial u}{\partial\nu}v\Bigr{)}\,ds\right|
=|Ω((α~α)Δu+(α~α)u)v𝑑x|\displaystyle=\left|\int_{\Omega}\Bigl{(}(\tilde{\alpha}-\alpha)\Delta u+\nabla(\tilde{\alpha}-\alpha)\cdot\nabla u\Bigr{)}v\,dx\right|
(α~αL(Ω)ΔuL2(Ω)+α~αL(Ω)uL2(Ω))vL2(Ω)\displaystyle\leq\Bigl{(}\|\tilde{\alpha}-\alpha\|_{L^{\infty}(\Omega)}\|\Delta u\|_{L^{2}(\Omega)}+\|\nabla\tilde{\alpha}-\nabla\alpha\|_{L^{\infty}(\Omega)}\|\nabla u\|_{L^{2}(\Omega)}\Bigr{)}\|v\|_{L^{2}(\Omega)}
α~αW1,(Ω)uH2(Ω)vL2(Ω)vV.\displaystyle\leq\|\tilde{\alpha}-\alpha\|_{W^{1,\infty}(\Omega)}\|u\|_{H^{2}(\Omega)}\|v\|_{L^{2}(\Omega)}\qquad\forall v\in V.

Hence we have shown the PDE error bound

𝔼pde=~a~(u,)L2(Ω)~L2(Ω)+α~αW1,(Ω)uH2(Ω)+σ~σL(Ω)uL2(Ω).\displaystyle\mathbb{E}_{\operatorname{pde}}=\|\tilde{\ell}-\tilde{a}(u,\cdot)\|_{L^{2}(\Omega)}\leq\|\tilde{\ell}-\ell\|_{L^{2}(\Omega)}+\|\tilde{\alpha}-\alpha\|_{W^{1,\infty}(\Omega)}\|u\|_{H^{2}(\Omega)}+\|\tilde{\sigma}-\sigma\|_{L^{\infty}(\Omega)}\|u\|_{L^{2}(\Omega)}.
Remark 2.

It can also be shown, that such a bound on a(u,)a~(u,)L2(Ω)\|a(u,\cdot)-\tilde{a}(u,\cdot)\|_{L^{2}(\Omega)} does in general not exist if α~α\tilde{\alpha}-\alpha is not sufficiently smooth—regardless of the smoothness of uu. To see this let Ω=(1,1)\Omega=(-1,1), (α~α)(x)=β(x)=sgn(x)L(Ω)(\tilde{\alpha}-\alpha)(x)=\beta(x)=-\operatorname{sgn}(x)\in L^{\infty}(\Omega), and uC([1,1])u\in C^{\infty}([-1,1]) such that u|[0.5,0.5](x)=xu|_{[-0.5,0.5]}(x)=x. Now let vε(x)=max{0,1ε1|x|}v_{\varepsilon}(x)=\operatorname{max}\{0,1-{\varepsilon}^{-1}|x|\} for ε<0.5\varepsilon<0.5. Then

Ωβuvεdx=[ε,ε](sgn(x))(sgn(x)ε1)𝑑x=2\displaystyle\int_{\Omega}\beta\nabla u\cdot\nabla v_{\varepsilon}\,dx=\int_{[-\varepsilon,\varepsilon]}(-\operatorname{sgn}(x))(-\operatorname{sgn}(x)\varepsilon^{-1})\,dx=2

while vεL2(Ω)0\|v_{\varepsilon}\|_{L^{2}(\Omega)}\to 0 for ε0\varepsilon\to 0.

Finally, we will provide an error bound for the data error term 𝔼data\mathbb{E}_{\operatorname{data}} in the special case of point-wise approximations

(16) bi=u(pi)Biu\displaystyle b_{i}=u(p_{i})\approx{\textstyle\fint}_{B_{i}}u

for points piBip_{i}\in B_{i}. To make sense of this expression, we need at least uC(Ω)u\in C(\Omega). However, to derive an error bound in terms of diam(Bi)\operatorname{diam}(B_{i}) we will also need to relate u(pi)u(x)u(p_{i})-u(x) to pixp_{i}-x for points xBix\in B_{i}, independently of pip_{i}. Hence we need to make additions assumptions on the regularity of uu.

Proposition 2.

Assume uC0,λ(Ω¯)u\in C^{0,\lambda}(\overline{\Omega}) for λ(0,1]\lambda\in(0,1], i.e. uu is λ\lambda-Hölder continuous or Lipschitz continuous (for λ=1\lambda=1) and that the data b(u,)~\widetilde{b(u,\cdot)} is given by (4) with (16). Then we can bound the data error term 𝔼data\mathbb{E}_{\operatorname{data}} according to

𝔼data=b(u,)~b(u,)L2(Ω)uC0,λ(Ω¯)M|Ω|1/2rλ.\displaystyle\mathbb{E}_{\operatorname{data}}=\|\widetilde{b(u,\cdot)}-b(u,\cdot)\|_{L^{2}(\Omega)}\leq\|u\|_{C^{0,\lambda}(\overline{\Omega})}M|\Omega|^{1/2}r^{\lambda}.
Proof.

Let vL2(Ω)v\in L^{2}(\Omega) and i=1,,mi=1,\dots,m. Then

|biBiu|=|Bi|1|Biu(pi)u(x)dx|uC0,λ(Bi)diam(Bi)λuC0,λ(Ω¯)rλ.\displaystyle|b_{i}-{\textstyle\fint}_{B_{i}}u|=|B_{i}|^{-1}\left|\int_{B_{i}}u(p_{i})-u(x)\,dx\right|\leq\|u\|_{C^{0,\lambda}(B_{i})}\operatorname{diam}(B_{i})^{\lambda}\leq\|u\|_{C^{0,\lambda}(\overline{\Omega})}r^{\lambda}.

Using this estimate and proceeding as in the proof of Lemma 1 we get

|b(u,v)~b(u,v)|\displaystyle|\widetilde{b(u,v)}-b(u,v)| =|i=1m(biBiu)|Bi|Biv|\displaystyle=\left|\sum_{i=1}^{m}\left(b_{i}-{\textstyle\fint}_{B_{i}}u\right)|B_{i}|{\textstyle\fint}_{B_{i}}v\right|
uC0,λ(Ω¯)rλi=1m|Bi||Bi1Biv|uC0,λ(Ω¯)rλM|Ω|1/2vL1(Ω).\displaystyle\leq\|u\|_{C^{0,\lambda}(\overline{\Omega})}r^{\lambda}\sum_{i=1}^{m}|B_{i}|\left|{\textstyle\fint}_{B_{i}}1{\textstyle\fint}_{B_{i}}v\right|\leq\|u\|_{C^{0,\lambda}(\overline{\Omega})}r^{\lambda}M|\Omega|^{1/2}\|v\|_{L^{1}(\Omega)}.

Now we summarize the results including the estimates for the PDE and data error terms.

Corollary 2.

Let uu and u~\tilde{u} be the solutions of (3) and (7), respectively. Furthermore assume that

  • the problem is non-degenerate and well-balanced,

  • the data b(u,)~\widetilde{b(u,\cdot)} is given by (4) and (16),

  • the exact and inexact coefficients satisfy α~αW1,(Ω)\tilde{\alpha}-\alpha\in W^{1,\infty}(\Omega),

  • Ω\partial\Omega is smooth enough such that (15) holds true,

  • the solution satisfies uH2(Ω)C0,λ(Ω¯)u\in H^{2}(\Omega)\cap C^{0,\lambda}(\overline{\Omega}) for λ(0,1]\lambda\in(0,1].

Then

u~uL2(Ω)CR2(~L2(Ω)+σ~σL(Ω)+α~αW1,(Ω))+Crλ.\displaystyle\|\tilde{u}-u\|_{L^{2}(\Omega)}\leq CR^{2}\bigl{(}\|\tilde{\ell}-\ell\|_{L^{2}(\Omega)}+\|\tilde{\sigma}-\sigma\|_{L^{\infty}(\Omega)}+\|\tilde{\alpha}-\alpha\|_{W^{1,\infty}(\Omega)}\bigr{)}+Cr^{\lambda}.

3.3. Pure data fitting

As a special case of the setting outlined above, one can consider the pure data fitting problem, where nothing is known about the partial differential equation satisfied by uu. In this case, one could consider using only the data in terms of a least squares ansatz

(17) u~V:u~ub2vub2vV,\displaystyle\tilde{u}\in V:\qquad\|\tilde{u}-u\|_{b}^{2}\leq\|v-u\|_{b}^{2}\qquad\forall v\in V,

or, equivalently,

u~V:b(u~,v)=b(u,v)vV.\displaystyle\tilde{u}\in V:\qquad b(\tilde{u},v)=b(u,v)\qquad\forall v\in V.

Due to the finite rank of b(,)b(\cdot,\cdot) this problem is in general under determined and thus ill-posed. As a remedy, this can be considered in a finite dimensional subspace or submanifold VhV_{h} of VV only, which may lead to a well-posed problem. However, this comes at the price, that the behavior of the fitted solution is largely determined by VhV_{h}.

As a simple example consider V=H1([0,1])V=H^{1}([0,1]) and Vh=𝒫m1V_{h}=\mathcal{P}_{m-1}. Then, for small sets BiB_{i}, the problem essentially reduces to interpolation problem in 𝒫m1\mathcal{P}_{m-1} which may lead to uncontrollably large errors.

To overcome the ill-posedness in a controllable way, we introduce the regularized problem

(18) u~V:u~L2(Ω)2+δ1u~ub2vL2(Ω)2+δ1vub2vV.\displaystyle\tilde{u}\in V:\qquad\|\nabla\tilde{u}\|_{L^{2}(\Omega)}^{2}+\delta^{-1}\|\tilde{u}-u\|_{b}^{2}\leq\|\nabla v\|_{L^{2}(\Omega)}^{2}+\delta^{-1}\|v-u\|_{b}^{2}\qquad\forall v\in V.

Noting that this takes the form of (7) with α~=1\tilde{\alpha}=1 and f~=0\tilde{f}=0 and exact data b(u,)~=b(u,)\widetilde{b(u,\cdot)}=b(u,\cdot) we can utilize Theorem 1 to get an error estimate.

Theorem 2.

Assume that VH01(Ω)V\subset H^{1}_{0}(\Omega), uVH2(Ω)u\in V\cap H^{2}(\Omega) and let u~\tilde{u} be the solution of (7) with α~=1\tilde{\alpha}=1 and f~=0\tilde{f}=0. Then

(19) u~uL2(Ω)ηθ1R2Mπ2ΔuL2(Ω)+ηθ1b(u,)~b(u,)L2(Ω).\displaystyle\|\tilde{u}-u\|_{L^{2}(\Omega)}\leq\eta\theta^{-1}\frac{R^{2}M}{\pi^{2}}\|\Delta u\|_{L^{2}(\Omega)}+\eta\theta^{-1}\|\widetilde{b(u,\cdot)}-b(u,\cdot)\|_{L^{2}(\Omega)}.
Proof.

We only need to note that uH2(Ω)u\in H^{2}(\Omega) solves the PDE (3) with f=ΔuL2(Ω)f=-\Delta u\in L^{2}(\Omega) and α=1\alpha=1. Then Theorem 1 provides the error estimate with ~a~(u,)=0a(u,)=f=Δu\tilde{\ell}-\tilde{a}(u,\cdot)=0-a(u,\cdot)=-f=\Delta u. ∎

In the case of exact data b(u,)~=b(u,)\widetilde{b(u,\cdot)}=b(u,\cdot) in (18) the estimate reduces to

u~uL2(Ω)ηθ1R2Mπ2ΔuL2(Ω).\displaystyle\|\tilde{u}-u\|_{L^{2}(\Omega)}\leq\eta\theta^{-1}\frac{R^{2}M}{\pi^{2}}\|\Delta u\|_{L^{2}(\Omega)}.

It can also be shown that the result is quasi-optimal in a certain sense. To illustrate this, we investigate the question of how well we can approximate uu only from the given data Biu{\textstyle\fint}_{B_{i}}u, i=1,mi=1\dots,m and the known regularity uH01(Ω)u\in H^{1}_{0}(\Omega). Unfortunately, there is an infinite dimensional affine subspace

Vb={vH01(Ω)|Biv=Biu,i=1,,b}.\displaystyle V_{b}=\Bigl{\{}v\in H_{0}^{1}(\Omega)\mathrel{|}{\textstyle\fint}_{B_{i}}v={\textstyle\fint}_{B_{i}}u,i=1,\dots,b\Bigr{\}}.

All functions in this space provide a perfect fit to the data and cannot be distinguished in terms of the available information. Hence the best we can afford in view of the regularity uH01(Ω)u\in H_{0}^{1}(\Omega) is to compute a norm minimizing approximation in VbV_{b}, i.e.

uopt=argminvVbvL2(Ω)2.\displaystyle u_{\text{opt}}=\operatorname{argmin}_{v\in V_{b}}\|\nabla v\|_{L^{2}(\Omega)}^{2}.

Using the orthogonality

(uopt,(uoptv))=0vVb\displaystyle(\nabla u_{\text{opt}},\nabla(u_{\text{opt}}-v))=0\qquad\forall v\in V_{b}

and the fact that uoptub=0\|u_{\text{opt}}-u\|_{b}=0 we obtain the identity

c~(uoptu,uoptu)=(u,(uoptu))\displaystyle\tilde{c}(u_{\text{opt}}-u,u_{\text{opt}}-u)=(-\nabla u,\nabla(u_{\text{opt}}-u))

with c~(,)\tilde{c}(\cdot,\cdot) as in Theorem 2. Now assuming the additional regularity uH2(Ω)u\in H^{2}(\Omega) and utilizing the coercivity from Lemma 6 we obtain

uoptuL2(Ω)2ηR2Mπ2(Δu,uoptu)ηR2Mπ2ΔuL2(Ω)uoptuL2(Ω).\displaystyle\|u_{\text{opt}}-u\|^{2}_{L^{2}(\Omega)}\leq\eta\frac{R^{2}M}{\pi^{2}}(\Delta u,u_{\text{opt}}-u)\leq\eta\frac{R^{2}M}{\pi^{2}}\|\Delta u\|_{L^{2}(\Omega)}\|u_{\text{opt}}-u\|_{L^{2}(\Omega)}.

Thus—if we chose the optimal weighting parameter δ1\delta^{-1} such that θ=1\theta=1—the error estimate for u~\tilde{u} coincides with the one for uoptu_{\text{opt}} which is the energy minimizing one among all functions that fit the data. It should be noted that the only reason for the inequality ‘\leq’ in the estimate for uoptu_{\text{opt}} is the application of the coercivity estimate from Lemma 6 and the application of the Cauchy–Schwarz inequality in L2(Ω)L^{2}(\Omega). Hence we cannot expect to be able to derive a better estimate unless we sharpen the coercivity bound.

4. Generalized Galerkin discretization

4.1. Abstract a priori error estimate

Now we investigate the discretization of the PDE-regularized problem (7). To this end we consider an abstract generalized Galerkin discretization

(20) u~h=argminvVhJ~(v),\displaystyle\tilde{u}_{h}=\operatorname{argmin}_{v\in V_{h}}\tilde{J}(v),

of the minimization formulation (5) of (7) in a subset VhVV_{h}\subset V. We call this problem a generalized Galerkin discretization since we do not require that VhV_{h} is a closed subspace of VV as in a classical Galerkin discretization. As a consequence, existence and uniqueness of u~h\tilde{u}_{h} is not guaranteed. However, if VhV_{h} is a closed subspace, then (20) is equivalent to the variational equation

(21) u~hVh:c~(u~h,v)=r~(v)vVh,\displaystyle\tilde{u}_{h}\in V_{h}:\qquad\tilde{c}(\tilde{u}_{h},v)=\tilde{r}(v)\qquad\forall v\in V_{h},

and uniqueness and existence is guaranteed by the Lax–Milgram theorem.

For several reasons we cannot directly apply the classical Céa-Lemma or Galerkin orthogonality to derive an error bound: The set VhV_{h} is in general not a subspace, we want to bound the error in the L2(Ω)L^{2}(\Omega)-norm, and the bilinear form incorporates data dependent weighting factors. Furthermore we are interested in directly bounding u~huL2(Ω)\|\tilde{u}_{h}-u\|_{L^{2}(\Omega)} and not just u~hu~L2(Ω)\|\tilde{u}_{h}-\tilde{u}\|_{L^{2}(\Omega)} for the auxiliary continuous solution u~\tilde{u}. As a remedy we will first use a nonlinear Céa-Lemma in terms of the weighted norm Γc~\sqrt{\Gamma}\|\cdot\|_{\tilde{c}} and then go over to the L2(Ω)\|\cdot\|_{L^{2}(\Omega)} norm using Lemma 6. The idea of using generalized versions of Céa’s lemma for discretization in subsets goes back to [8, 13] where this was developed in a metric space setting. Since our setting is more special, we give a direct proof here.

Lemma 7.

Let u~hVh\tilde{u}_{h}\in V_{h} be a solution of (20) and u~V\tilde{u}\in V the solution of (5). Then we have

u~hu~c~infvVhvu~c~.\displaystyle\|\tilde{u}_{h}-\tilde{u}\|_{\tilde{c}}\leq\inf_{v\in V_{h}}\|v-\tilde{u}\|_{\tilde{c}}.
Proof.

First we note that, in view of (7), we have for arbitrary vVv\in V

vu~c~2=vc~22c~(u~,v)+u~c~2=vc~22r~(v)+u~c~2=2J~(v)+u~c~2.\displaystyle\|v-\tilde{u}\|_{\tilde{c}}^{2}=\|v\|_{\tilde{c}}^{2}-2\tilde{c}(\tilde{u},v)+\|\tilde{u}\|_{\tilde{c}}^{2}=\|v\|_{\tilde{c}}^{2}-2\tilde{r}(v)+\|\tilde{u}\|_{\tilde{c}}^{2}=2\tilde{J}(v)+\|\tilde{u}\|_{\tilde{c}}^{2}.

Thus we get for the minimizer u~h\tilde{u}_{h} of J~\tilde{J} in VhV_{h}

u~hu~c~2=2J~(u~h)+u~c~22J~(v)+u~c~2=vu~c~2.\displaystyle\|\tilde{u}_{h}-\tilde{u}\|_{\tilde{c}}^{2}=2\tilde{J}(\tilde{u}_{h})+\|\tilde{u}\|_{\tilde{c}}^{2}\leq 2\tilde{J}(v)+\|\tilde{u}\|_{\tilde{c}}^{2}=\|v-\tilde{u}\|_{\tilde{c}}^{2}.

Theorem 3.

Let u~hVh\tilde{u}_{h}\in V_{h} be a solution of (20) and uVu\in V the solution of (3). Furthermore assume that ~a~(u,)L2(Ω)\tilde{\ell}-\tilde{a}(u,\cdot)\in L^{2}(\Omega). Then u~h\tilde{u}_{h} satisfies the error bound

u~huL2(Ω)Γu~huc~\displaystyle\|\tilde{u}_{h}-u\|_{L^{2}(\Omega)}\leq\sqrt{\Gamma}\|\tilde{u}_{h}-u\|_{\tilde{c}} 2Γu~uc~+infvVhΓvuc~\displaystyle\leq 2\sqrt{\Gamma}\|\tilde{u}-u\|_{\tilde{c}}+\inf_{v\in V_{h}}\sqrt{\Gamma}\|v-u\|_{\tilde{c}}
2Γ(𝔼pde+δ1𝔼data)+infvVhΓvuc~\displaystyle\leq 2\Gamma\Bigl{(}\mathbb{E}_{\operatorname{pde}}+\delta^{-1}\mathbb{E}_{\operatorname{data}}\Bigr{)}+\inf_{v\in V_{h}}\sqrt{\Gamma}\|v-u\|_{\tilde{c}}

with the PDE- and data error terms 𝔼pde\mathbb{E}_{\operatorname{pde}} and 𝔼data\mathbb{E}_{\operatorname{data}} as defined in Theorem 1.

Proof.

Using Lemma 6, the triangle inequality, and the generalized Céa-Lemma 7 we get

u~huL2(Ω)\displaystyle\|\tilde{u}_{h}-u\|_{L^{2}(\Omega)} Γu~huc~Γu~hu~c~+Γu~uc~\displaystyle\leq\sqrt{\Gamma}\|\tilde{u}_{h}-u\|_{\tilde{c}}\leq\sqrt{\Gamma}\|\tilde{u}_{h}-\tilde{u}\|_{\tilde{c}}+\sqrt{\Gamma}\|\tilde{u}-u\|_{\tilde{c}}
Γvu~c~+Γu~uc~Γvuc~+2Γu~uc~vVh.\displaystyle\leq\sqrt{\Gamma}\|v-\tilde{u}\|_{\tilde{c}}+\sqrt{\Gamma}\|\tilde{u}-u\|_{\tilde{c}}\leq\sqrt{\Gamma}\|v-u\|_{\tilde{c}}+2\sqrt{\Gamma}\|\tilde{u}-u\|_{\tilde{c}}\qquad\forall v\in V_{h}.

Now using Theorem 1 we get the assertion. ∎

While the first error term is the same as the one in the continuous case, we still need to take care for the best approximation error because it involves the weighted data dependent norm Γc~\sqrt{\Gamma}\|\cdot\|_{\tilde{c}}. The main ingredient is the following bound on this norm.

Lemma 8.

The weighted norm Γc~\sqrt{\Gamma}\|\cdot\|_{\tilde{c}} can be bounded according to

Γvc~ηθ1M(Rπα~maxα~minvL2(Ω)+vL2(Ω)).\displaystyle\sqrt{\Gamma}\|v\|_{\tilde{c}}\leq\sqrt{\eta\theta^{-1}M}\left(\frac{R}{\pi}\frac{\sqrt{\tilde{\alpha}_{\max}}}{\sqrt{\tilde{\alpha}_{\min}}}\|\nabla v\|_{L^{2}(\Omega)}+\|v\|_{L^{2}(\Omega)}\right).
Proof.

As a direct consequence of the bounds in (12) and Lemma 1 we get

Γvc~2ηθ1M(R2α~maxπ2α~minvL2(Ω)2+vL2(Ω)2).\displaystyle\Gamma\|v\|_{\tilde{c}}^{2}\leq\eta\theta^{-1}M\left(\frac{R^{2}\tilde{\alpha}_{\max}}{\pi^{2}\tilde{\alpha}_{\min}}\|\nabla v\|_{L^{2}(\Omega)}^{2}+\|v\|^{2}_{L^{2}(\Omega)}\right).

To interpret this estimate we again consider a non-degenerate, well-balanced problem. In this case the estimate in Lemma 8 takes the form

Γvc~C(RvL2(Ω)+vL2(Ω))\displaystyle\sqrt{\Gamma}\|v\|_{\tilde{c}}\leq C\Bigl{(}R\|\nabla v\|_{L^{2}(\Omega)}+\|v\|_{L^{2}(\Omega)}\Bigr{)}

where RR is bounded and even decreasing if the data points cover the domain better and better. We summarize the result in the following corollary.

Corollary 3.

Let u~hVh\tilde{u}_{h}\in V_{h} be a solution of (20) and uVu\in V the solution of (3). Furthermore assume that ~a~(u,)L2(Ω)\tilde{\ell}-\tilde{a}(u,\cdot)\in L^{2}(\Omega) and that the problem is non-degenerate and well-balanced. Then

u~huL2(Ω)C𝔼approx+CR2𝔼pde+C𝔼data\displaystyle\|\tilde{u}_{h}-u\|_{L^{2}(\Omega)}\leq C\mathbb{E}_{\operatorname{approx}}+CR^{2}\mathbb{E}_{\operatorname{pde}}+C\mathbb{E}_{\operatorname{data}}

with the PDE and data error terms as in Theorem 1 and the approximation-error term

𝔼approx\displaystyle\mathbb{E}_{\operatorname{approx}} \colonequalsinfvVh(R(vu)L2(Ω)+vuL2(Ω)).\displaystyle\colonequals\inf_{v\in V_{h}}\Bigl{(}R\|\nabla(v-u)\|_{L^{2}(\Omega)}+\|v-u\|_{L^{2}(\Omega)}\Bigr{)}.

Inserting the bounds on the PDE and data error we finally get:

Corollary 4.

Let u~hVh\tilde{u}_{h}\in V_{h} be a solution of (20) and uVu\in V the solution of (3). Furthermore assume that the assumptions of Corollary 2 hold true. Then

u~huL2(Ω)CR2(~L2(Ω)+σ~σL(Ω)+α~αW1,(Ω))+Crλ+CinfvVh(R(vu)L2(Ω)+vuL2(Ω)).\|\tilde{u}_{h}-u\|_{L^{2}(\Omega)}\leq CR^{2}\bigl{(}\|\tilde{\ell}-\ell\|_{L^{2}(\Omega)}+\|\tilde{\sigma}-\sigma\|_{L^{\infty}(\Omega)}+\|\tilde{\alpha}-\alpha\|_{W^{1,\infty}(\Omega)}\bigr{)}\\ +Cr^{\lambda}+C\inf_{v\in V_{h}}\Bigl{(}R\|\nabla(v-u)\|_{L^{2}(\Omega)}+\|v-u\|_{L^{2}(\Omega)}\Bigr{)}.

It is important to note that we neither require that VhV_{h} is a subspace, nor that the solution to (20) is unique. Hence the error estimate is in principle applicable to nonlinear approximation schemes. However, it will in general be hard to compute a global minimizer in VhV_{h} as required in (20) since practical optimization schemes often at most guarantee local optimality.

In case of inexact or local minimization we can at least bound the error in terms of the algebraic energy error. Let u~~h\tilde{\tilde{u}}_{h} an approximation of u~h\tilde{u}_{h}. Then using a straight forward modification of the proof of Lemma 7 we get

u~~hu~c~infvVhvu~c~+2(J~(u~~h)J~(u~h)).\displaystyle\|\tilde{\tilde{u}}_{h}-\tilde{u}\|_{\tilde{c}}\leq\inf_{v\in V_{h}}\|v-\tilde{u}\|_{\tilde{c}}+\sqrt{2(\tilde{J}(\tilde{\tilde{u}}_{h})-\tilde{J}(\tilde{u}_{h}))}.

Thus, if we want to bound u~~huL2(Ω)\|\tilde{\tilde{u}}_{h}-u\|_{L^{2}(\Omega)} instead of u~huL2(Ω)\|\tilde{u}_{h}-u\|_{L^{2}(\Omega)} we have to add the algebraic error term

2Γ(J~(u~~h)J~(u~h))CRJ~(u~~h)J~(u~h)\displaystyle\sqrt{2\Gamma(\tilde{J}(\tilde{\tilde{u}}_{h})-\tilde{J}(\tilde{u}_{h}))}\leq CR\sqrt{\tilde{J}(\tilde{\tilde{u}}_{h})-\tilde{J}(\tilde{u}_{h})}

to the right hand sides of the estimates in Theorem 3, Corollary 3, and Corollary 4.

4.2. Finite element discretization

As an example for a linear Galerkin discretization we apply the abstract error bound to a finite element ansatz. To this end let VhVV_{h}\subset V be a conforming Lagrange finite element space of order kk (piecewise polynomials in 𝒫k\mathcal{P}_{k} for simplex elements or piecewise tensor-polynomials in 𝒬k\mathcal{Q}_{k} for cubic elements) on a triangulation with mesh size hh. Then we can apply the classical finite element interpolation error estimate which, in the present special case, provides:

Proposition 3.

Let Πh:C0(Ω¯)Vh\Pi_{h}:C^{0}(\overline{\Omega})\to V_{h} interpolation operator and uHk+1(Ω)C0(Ω¯)u\in H^{k+1}(\Omega)\cap C^{0}(\overline{\Omega}). Then

uΠhuL2(Ω)\displaystyle\|u-\Pi_{h}u\|_{L^{2}(\Omega)} Chk+1|u|Hk+1(Ω),\displaystyle\leq Ch^{k+1}|u|_{H^{k+1}(\Omega)}, uΠhuH1(Ω)\displaystyle\|u-\Pi_{h}u\|_{H^{1}(\Omega)} Chk|u|Hk+1(Ω).\displaystyle\leq Ch^{k}|u|_{H^{k+1}(\Omega)}.

with a constant CC depending only on the shape regularity of the triangulation. Here ||Hk+1(Ω)|\cdot|_{H^{k+1}(\Omega)} denotes the Hk+1(Ω)H^{k+1}(\Omega)-semi-norm containing only derivatives of order k+1k+1.

For a proof we refer to [3, Theorem 3.2.1 and Remark 3.2.2]. Inserting the interpolation error estimate in the abstract error bound we get the following total error bound.

Corollary 5.

Let u~hVh\tilde{u}_{h}\in V_{h} be a solution of (20) and uVu\in V the solution of (3). Furthermore assume that the assumptions of Corollary 2 and uHk+1(Ω)u\in H^{k+1}(\Omega) hold true. Then

u~huL2(Ω)CR2(~L2(Ω)+σ~σL(Ω)+α~αW1,(Ω))+Crλ+CRhk+Chk+1.\|\tilde{u}_{h}-u\|_{L^{2}(\Omega)}\leq CR^{2}\bigl{(}\|\tilde{\ell}-\ell\|_{L^{2}(\Omega)}+\|\tilde{\sigma}-\sigma\|_{L^{\infty}(\Omega)}+\|\tilde{\alpha}-\alpha\|_{W^{1,\infty}(\Omega)}\bigr{)}+Cr^{\lambda}+CRh^{k}+Ch^{k+1}.

4.3. A heuristic strategy for parameter selection

To guarantee non-degenerate and well-balanced problem in the sense of Definition 1 the size of the averaging sets BiB_{i} and the parameter δ\delta should be carefully selected such that the quotient |Ki|/|Bi||K_{i}|/|B_{i}| is bounded and such that δ\delta scales like R2MR^{2}M. In applications, where data points p1,,pmΩp_{1},\dots,p_{m}\in\Omega are given we can in general not assume that one is able to compute a decomposition into sets KiK_{i} and their maximal diameter RR exactly. While this is in principle possible using a Voronoi decomposition, computing the latter is in general much to costly. Hence we propose a heuristic strategy to determine the sets BiB_{i} and estimates for RR and δ\delta under the assumption of uniformly distributed data points pip_{i}.

To this end we make the assumption that Ω\Omega fits into a cube [0,D]d[0,D]^{d}. Then we can expected that the points are in average uniformly spaced. For an ideal uniformly spaced distribution of mm points in [0,D]d[0,D]^{d} we can place mm non-overlapping cubes K^i\hat{K}_{i} with edge length L^=Dm1/d\hat{L}=Dm^{-1/d} on a uniform lattice into Ω=[0,D]d\Omega=[0,D]^{d}. The diameter of these boxes is

R^\colonequalsL^d=Dm1/dd\displaystyle\hat{R}\colonequals\hat{L}\sqrt{d}=Dm^{-1/d}\sqrt{d}

which we will use as estimate for RR. In order to fix |K^i|/|Bi|=Q|\hat{K}_{i}|/|B_{i}|=Q for a parameter Q>1Q>1, we will use cubes BiB_{i} centered at pip_{i} with edge length l^\hat{l}

Bi\colonequalspi+l^[0.5,0.5]d,l^\colonequalsL^Q1/d=D(mQ)1/d.\displaystyle B_{i}\colonequals p_{i}+\hat{l}[-0.5,0.5]^{d},\qquad\hat{l}\colonequals\hat{L}Q^{-1/d}=D(mQ)^{-1/d}.

Since we cannot control MM explicitly, we assume MO(1)M\in O(1) and select

δ\colonequalsR^2π2α~min=D2m2/ddπ2α~min.\displaystyle\delta\colonequals\frac{\hat{R}^{2}}{\pi^{2}\tilde{\alpha}_{\min}}=\frac{D^{2}m^{-2/d}d}{\pi^{2}\tilde{\alpha}_{\min}}.

5. Numerical experiments

5.1. A smooth test problem

Finally we illustrate the theoretical findings using numerical experiments for test problems. To this end we consider an example problem from [4] given by

Δu+σu\displaystyle-\Delta u+\sigma u =fin Ω,\displaystyle=f\qquad\text{in }\Omega, un\displaystyle\frac{\partial u}{\partial n} =0on Ω\displaystyle=0\qquad\text{on }\Omega

with Ω=[0,1]d\Omega=[0,1]^{d}. For the Eigenfunction u(x)=i=1dcos(πxi)u(x)=\sum_{i=1}^{d}\cos(\pi x_{i}) of the Laplacian and σ=π2\sigma=\pi^{2} this equation is satisfied with f(x)=(π2+σ)u(x)=2π2u(x)f(x)=(\pi^{2}+\sigma)u(x)=2\pi^{2}u(x). We do not want to hide, that this example was selected, because —in contrast to Dirichlet boundary conditions—natural boundary conditions un=0\frac{\partial u}{\partial n}=0 do not require any approximation for neural network discretizations which was not covered in the error analysis.

As inexact auxiliary PDE we will use the same differential operator Δ+σ-\Delta+\sigma but a scaled right hand side f~=(1ε)f\tilde{f}=(1-\varepsilon)f. Then we can explicitly compute the PDE error term

𝔼pde=ff~L2(Ω)=εfL2(Ω).\displaystyle\mathbb{E}_{\operatorname{pde}}=\|f-\tilde{f}\|_{L^{2}(\Omega)}=\varepsilon\|f\|_{L^{2}(\Omega)}.

In all experiments we generate data by creating mm uniformly distributed random points p1,,pmp_{1},\dots,p_{m} and use cubes BiB_{i} with edge length l^\hat{l} centered at the points pip_{i}. The points are sampled in [l^/2,1l^/2]d[\hat{l}/2,1-\hat{l}/2]^{d} such that BiΩB_{i}\subset\Omega is guaranteed. The edge length l^\hat{l} and the penalty parameter δ\delta are selected according to the heuristic strategy proposed in Subsection 4.3 for different values of QQ. Notice that the heuristic strategy does not guarantee a non-degenerate and well-balanced problem in the sense of Definition 1. Nevertheless, the method is still covered by the theory presented above, but the constants may blow up if the strategy fails, leading to different error rates.

5.2. Finite element discretization

In this section we discretize the problem with conforming finite elements for d=1,2,3d=1,2,3. Since we are not interested in illustrating well known finite element error bounds, we use a fixed finite element grid with uniformly spaced interval/rectangular/hexahedral elements and tensorial 𝒬k\mathcal{Q}_{k} Lagrange finite elements of order kk throughout the experiments. For d=1d=1 we used 6464 elements and order k=4k=4, for d=2d=2 we used 4069=64×644069=64\times 64 elements and order k=4k=4, and for d=3d=3 we used 4069=16×16×164069=16\times 16\times 16 elements and order k=2k=2. In any case the discretization error can be neglected compared to the PDE and data error terms reported in the following. All reported L2(Ω)L^{2}(\Omega)-errors are computed using a quadrature rule of order k2k^{2} on the grid elements.

For the first experiment we fix f~=0.5f\tilde{f}=0.5f and compute the solution for increasing number of data points mm. The computations are done using exact data, i.e., bi=Biub_{i}={\textstyle\fint}_{B_{i}}u and inexact data in the form of point values bi=f(pi)b_{i}=f(p_{i}). Figure 1 depicts the error over mm for dimensions d=1,2,3d=1,2,3 and parameters Q=4,2Q=4,2 in the heuristic strategy. Left and right picture show exact and inexact data, respectively. We observe that the error decays like O(m2/d)=O(R^2)O(m^{-2/d})=O(\hat{R}^{2}) as expected from the theoretical error bounds for exact data and uniformly distributed data points. We also find that the error increases by a constant if we increase Q=|K^i|/|Bi|Q=|\hat{K}_{i}|/|B_{i}| which is again in accordance with the error estimate where this enters in the constant factor η\eta. For inexact data, the situation is very similar. We observe the order O(m2/d)=O(R^2)O(m^{-2/d})=O(\hat{R}^{2}) but for very small errors obtained for d=1d=1 eventually the case Q=4Q=4 becomes better compared to Q=2Q=2. This can be explained by the fact that, while increasing QQ increases the constant in the PDE error term, it also decreases rr and thus the data error term. Hence, if the data error comes in the range of the PDE error, we expect that larger QQ reduces the total error.

In the second experiment we fix m=512m=512 data points and compute the solution for inexact right hand sides f~=(1ε)f\tilde{f}=(1-\varepsilon)f with varying ε\varepsilon and exact data bi=Biub_{i}={\textstyle\fint}_{B_{i}}u as well as inexact data bi=f(pi)b_{i}=f(p_{i}). Figure 2 depicts the error over the PDE error ε=𝔼pde/fL2(Ω)\varepsilon=\mathbb{E}_{\operatorname{pde}}/\|f\|_{L^{2}(\Omega)} for dimensions d=1,2,3d=1,2,3 and parameters Q=4,2Q=4,2 in the heuristic strategy. Left and right picture show exact and inexact data, respectively. Here we observe that the error scales like O(ε)=O(𝔼pde)O(\varepsilon)=O(\mathbb{E}_{\operatorname{pde}}) for a wide range of ε\varepsilon. This is expected from the theoretical error bound. Again, we observe that for small total error Q=4Q=4 is better compared to Q=2Q=2 where the error finally saturates for d=2,3d=2,3. The latter indicates that the data error starts to dominate in this regime such that we no longer benefit from improving the PDE error.

102\displaystyle{10^{2}}103\displaystyle{10^{3}}data points m\displaystyle m104\displaystyle{10^{-4}}103\displaystyle{10^{-3}}102\displaystyle{10^{-2}}101\displaystyle{10^{-1}}L2(Ω)\displaystyle L^{2}(\Omega) errord=3\displaystyle d=3d=2\displaystyle d=2d=1\displaystyle d=1O(m2/3)\displaystyle O(m^{-2/3})O(m2/2)\displaystyle O(m^{-2/2})O(m2/1)\displaystyle O(m^{-2/1})
102\displaystyle{10^{2}}103\displaystyle{10^{3}}data points m\displaystyle m104\displaystyle{10^{-4}}103\displaystyle{10^{-3}}102\displaystyle{10^{-2}}101\displaystyle{10^{-1}}L2(Ω)\displaystyle L^{2}(\Omega) errord=3\displaystyle d=3d=2\displaystyle d=2d=1\displaystyle d=1O(m2/3)\displaystyle O(m^{-2/3})O(m2/2)\displaystyle O(m^{-2/2})O(m2/1)\displaystyle O(m^{-2/1})
Figure 1. Finite element discretization. Total error over number of data points. Solid lines: Q=4Q=4. Dashed lines: Q=2Q=2. Dotted lines: Reference slope. Left: Exact local average data. Right: Point values as inexact data.
102\displaystyle{10^{-2}}101\displaystyle{10^{-1}}PDE error ε\displaystyle\varepsilon107\displaystyle{10^{-7}}106\displaystyle{10^{-6}}105\displaystyle{10^{-5}}104\displaystyle{10^{-4}}103\displaystyle{10^{-3}}102\displaystyle{10^{-2}}101\displaystyle{10^{-1}}L2(Ω)\displaystyle L^{2}(\Omega) errord=3\displaystyle d=3d=2\displaystyle d=2d=1\displaystyle d=1O(ε)\displaystyle O(\varepsilon)
102\displaystyle{10^{-2}}101\displaystyle{10^{-1}}PDE error ε\displaystyle\varepsilon107\displaystyle{10^{-7}}106\displaystyle{10^{-6}}105\displaystyle{10^{-5}}104\displaystyle{10^{-4}}103\displaystyle{10^{-3}}102\displaystyle{10^{-2}}101\displaystyle{10^{-1}}L2(Ω)\displaystyle L^{2}(\Omega) errord=3\displaystyle d=3d=2\displaystyle d=2d=1\displaystyle d=1O(ε)\displaystyle O(\varepsilon)
Figure 2. Finite element discretization. Total error over PDE error for exact local average data. Solid lines: Q=4Q=4. Dashed lines: Q=2Q=2. Dotted lines: Reference slope. Left: Exact local average data. Right: Point values as inexact data.

5.3. Neural network discretization

Finally we consider the discretization of the test problem with neural networks. To this end we minimize the loss functional J~\tilde{J} defined in (6) over a set VhV_{h} of neural networks with a fixed architecture. In this case integrals are no longer evaluated exactly, but approximated using stochastic integration as proposed in [4]. More precisely, we approximate the local averages Biv{\textstyle\fint}_{B_{i}}v appearing in the functional by averaging vv over 10d10d uniformly distributed sampling points in BiB_{i}. Since all sets BiB_{i} are the same up to translation, we also translated the sampling points. The integration over Ω\Omega was approximated using 100d100d uniformly distributed sampling points in Ω\Omega that are newly sampled in each step of the iterative algebraic solution method. Since we use natural boundary conditions, we only need to approximate the space H1(Ω)H^{1}(\Omega) without any hard boundary constraints. Hence no approximation of boundary conditions was needed.

The network architecture is as follows: A first layer inflates the input size to 1616. This is followed by 33 blocks, each consisting of two densely connected layers with a residual connection. A final affine layer reduces the size from 1616 to 11. In total the network takes the form

F=FoutF3F2F1Fin\displaystyle F=F_{\text{out}}\circ F_{3}\circ F_{2}\circ F_{1}\circ F_{\text{in}}

where Fin:d16F_{\text{in}}:\mathbb{R}^{d}\to\mathbb{R}^{16} is a linear map padding its input by 16d16-d zeros, Fout:16F_{\text{out}}:\mathbb{R}^{16}\to\mathbb{R} is an affine map, and each block Fi:1616F_{i}:\mathbb{R}^{16}\to\mathbb{R}^{16} has the form

Fi(x)=ψ(Wi,2ψ(Wi,1xθi,1)θi,2)+x\displaystyle F_{i}(x)=\psi(W_{i,2}\psi(W_{i,1}x-\theta_{i,1})-\theta_{i,2})+x

for dense weight matrices Wi,j16×16W_{i,j}\in\mathbb{R}^{16\times 16} and bias vectors θi,j16\theta_{i,j}\in\mathbb{R}^{16}. In all layers we used the activation function ψ(x)=max{x3,0}\psi(x)=\max\{x^{3},0\}. All parameters involved in F1,F2,F3,FoutF_{1},F_{2},F_{3},F_{\text{out}} are determined in the training procedure.

The networks are trained using the Adam method [15] which is a variant of stochastic gradient descent. Training was stopped if the L2(Ω)L^{2}(\Omega) error u~huL2(Ω)\|\tilde{u}_{h}-u\|_{L^{2}(\Omega)} did not decrease any more. While this is in general impractical due to the unknown solution uu, we used this here to avoid effects of more heuristic stopping criteria.

Again we fix f~=0.5f\tilde{f}=0.5f and compute the solution for increasing number of data points mm. The computations are done using exact data, i.e., bi=Biub_{i}={\textstyle\fint}_{B_{i}}u and inexact data in the form of point values bi=f(pi)b_{i}=f(p_{i}). Figure 3 depicts the error over mm for dimensions d=1,2,3d=1,2,3 and parameters Q=4,2Q=4,2 in the heuristic strategy. Left and right picture show exact and inexact data, respectively. For d=2d=2 and d=3d=3 we again observe that the error decays like O(m2/d)=O(R^2)O(m^{-2/d})=O(\hat{R}^{2}). For d=1d=1 the situation is less clear. While we roughly observe the order O(m2/d)=O(R^2)O(m^{-2/d})=O(\hat{R}^{2}) again, there are some exceptions. Most importantly the error increases after adding more data in one case.

102\displaystyle{10^{2}}103\displaystyle{10^{3}}data points m\displaystyle m104\displaystyle{10^{-4}}103\displaystyle{10^{-3}}102\displaystyle{10^{-2}}101\displaystyle{10^{-1}}L2(Ω)\displaystyle L^{2}(\Omega) errord=3\displaystyle d=3d=2\displaystyle d=2d=1\displaystyle d=1O(m2/3)\displaystyle O(m^{-2/3})O(m2/2)\displaystyle O(m^{-2/2})O(m2/1)\displaystyle O(m^{-2/1})
102\displaystyle{10^{2}}103\displaystyle{10^{3}}data points m\displaystyle m104\displaystyle{10^{-4}}103\displaystyle{10^{-3}}102\displaystyle{10^{-2}}101\displaystyle{10^{-1}}L2(Ω)\displaystyle L^{2}(\Omega) errord=3\displaystyle d=3d=2\displaystyle d=2d=1\displaystyle d=1O(m2/3)\displaystyle O(m^{-2/3})O(m2/2)\displaystyle O(m^{-2/2})O(m2/1)\displaystyle O(m^{-2/1})
Figure 3. Neural network discretization. Total error over number of data points. Solid lines: Q=4Q=4. Dashed lines: Q=2Q=2. Dotted lines: Reference slope. Left: Exact local average data. Right: Point values as inexact data.

References

  • [1] H. Bölcskei, P. Grohs, G. Kutyniok, and P. Petersen. Optimal approximation with sparsely connected deep neural networks. SIAM J. Math. Data Sci., 1(1):8–45, 2019. arxiv:1705.01714.
  • [2] M. Burger and A. Neubauer. Error bounds for approximation with neural networks. J. Approx. Theory, 112(2):235–250, 2001.
  • [3] P. G. Ciarlet. The finite element method for elliptic problems, volume 4 of Studies in Mathematics and its Applications. North-Holland, Amsterdam New York Oxford, 1978.
  • [4] W. E and B. Yu. The deep Ritz method: A deep learning-based numerical algorithm for solving variational problems. Comm. Math. Stats., 6(1):1–12, 2018.
  • [5] K.-I. Funahashi. On the approximate realization of continuous mappings by neural networks. Neural networks, 2(3):183–192, 1989.
  • [6] C. Gräser. A note on Poincaré- and Friedrichs-type inequalities, 2015. arxiv:1512.02842.
  • [7] R. Griboval, G. Kutyniok, M. Nielsen, and F. Voigtlaender. Approximation spaces of deep neural networks, 2019. arxiv:1905.01208.
  • [8] P. Grohs, H. Hardering, and O. Sander. Optimal a priori discretization error bounds for geodesic finite elements. Found. Comput. Math., 15:1357–1411, 2015.
  • [9] P. Grohs, F. Hornung, A. Jentzen, and P. von Wurstemberger. A proof that artificial neural networks overcome the curse of dimensionality in the numerical approximation of Black-Scholes partial differential equations, 2018. arxiv:1809.02362.
  • [10] I. Gühring, G. Kutyniok, and P. Petersen. Error bounds for approximations with deep ReLU neural networks in Ws,pW^{s,p} norms, 2019.
  • [11] B. D. Haeffele and R. Vidal. Global optimality in neural network training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7331–7339, 2017.
  • [12] J. Han, A. Jentzen, and W. E. Solving high-dimensional partial differential equations using deep learning. Proc. Natl. Acad. Sci., 115(34):8505–8510, 2018. arXiv:1707.02568.
  • [13] H. Hardering. Intrinsic Discretization Error Bounds for Geodesic Finite Elements. PhD thesis, Freie Universität Berlin, 2015.
  • [14] E. Kharazmi, Z. Zhang, and G. E. Karniadakis. Variational physics-informed neural networks for solving partial differential equations, 2019. arxiv:1912.00873.
  • [15] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization, 2014. arxiv:1912.03937.
  • [16] G. Kutyniok, P. Petersen, M. Raslan, and R. Schneider. A theoretical analysis of deep neural networks and parametric PDEs, 2019.
  • [17] Y. Liao and P. Ming. Deep Nitsche method: Deep Ritz method with essential boundary conditions, 2019.
  • [18] J. Müller and M. Zeinhofer. Deep Ritz revisited, 2019. arxiv:1912.03937.
  • [19] L. E. Payne and H. F. Weinberger. An optimal Poincaré inequality for convex domains. Arch. Rational Mech. Anal., 5(1):286–292, 1960.
  • [20] D. Perekrestenko, P. Grohs, Elbrächter D., and H. Bölcskei. The universal approximation power of finite-width deep ReLU networks, 2018.
  • [21] P. Petersen and F. Voigtlaender. Optimal approximation of piecewise smooth functions using deep ReLU neural networks. Neural Networks, 108:296–330, 2018.
  • [22] A. Pinkus. Approximation theory of the MLP model in neural networks. Acta numerica, 8:143–195, 1999.
  • [23] M. Raissi, P. Perdikaris, and G. E. Karniadakis. Physics informed deep learning (part I): Data-driven solutions of nonlinear partial differential equations, 2017. arxiv:1711.10561.
  • [24] M. Raissi, P. Perdikaris, and G. E. Karniadakis. Physics informed deep learning (part II): Data-driven discovery of nonlinear partial differential equations, 2017. arxiv:1711.10566.
  • [25] G. M. Rotskoff and E. Vanden-Eijnden. Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error, 2018. arxiv:1805.00915.
  • [26] Z. Shen, H. Yang, and S. Zhang. Nonlinear approximation via compositions. Neural Networks, 119:74–84, Nov 2019.
  • [27] R. Vidal, J. Bruna, R. Giryes, and S. Soatto. Mathematics of deep learning, 2017. arxiv:1712.04741.
  • [28] D. Yarotsky. Error bounds for approximations with deep ReLU networks. Neural Networks, 94:103–114, 2017.