This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\newsiamthm

claimClaim \newsiamremarkconjectureConjecture \newsiamremarkremarkRemark \newsiamremarkexampleExample \newsiamremarkhypothesisHypothesis \newsiamremarkproblemProblem \newsiamremarkassumptionAssumption \headersSolving and learning nonlinear PDEs with GPsY. Chen, B. Hosseini, H. Owhadi, AM. Stuart

Solving and Learning Nonlinear PDEs with Gaussian Processes

Yifan Chen    Bamdad Hosseini    Houman Owhadi111Corresponding author.    Andrew M Stuart Computing and Mathematical Sciences, Caltech, Pasadena, CA (, , , ). [email protected] [email protected] [email protected] [email protected]
Abstract

We introduce a simple, rigorous, and unified framework for solving nonlinear partial differential equations (PDEs), and for solving inverse problems (IPs) involving the identification of parameters in PDEs, using the framework of Gaussian processes. The proposed approach: (1) provides a natural generalization of collocation kernel methods to nonlinear PDEs and IPs; (2) has guaranteed convergence for a very general class of PDEs, and comes equipped with a path to compute error bounds for specific PDE approximations; (3) inherits the state-of-the-art computational complexity of linear solvers for dense kernel matrices. The main idea of our method is to approximate the solution of a given PDE as the maximum a posteriori (MAP) estimator of a Gaussian process conditioned on solving the PDE at a finite number of collocation points. Although this optimization problem is infinite-dimensional, it can be reduced to a finite-dimensional one by introducing additional variables corresponding to the values of the derivatives of the solution at collocation points; this generalizes the representer theorem arising in Gaussian process regression. The reduced optimization problem has the form of a quadratic objective function subject to nonlinear constraints; it is solved with a variant of the Gauss–Newton method. The resulting algorithm (a) can be interpreted as solving successive linearizations of the nonlinear PDE, and (b) in practice is found to converge in a small number of iterations (2 to 10), for a wide range of PDEs. Most traditional approaches to IPs interleave parameter updates with numerical solution of the PDE; our algorithm solves for both parameter and PDE solution simultaneously. Experiments on nonlinear elliptic PDEs, Burgers’ equation, a regularized Eikonal equation, and an IP for permeability identification in Darcy flow illustrate the efficacy and scope of our framework.

keywords:
Kernel Methods, Gaussian Processes, Nonlinear Partial Differential Equations, Inverse Problems, Optimal Recovery.
{AMS}

60G15, 65M75, 65N75, 65N35, 47B34, 41A15, 35R30, 34B15.

1 Introduction

Two hundred years ago, modeling a physical problem and solving the underlying differential equations would have required the expertise of Cauchy or Laplace, and it would have been done by hand through a laborious process of discovery. Sixty years ago, the resolution of the underlying differential equation would have been addressed using computers, but modeling and design of the solver would have still been done by hand. Nowadays, there is increasing interest in automating these steps by casting them as machine learning problems. The resulting methods can be divided into two main categories: (1) methods based on variants of artificial neural networks (ANNs) [22]; and (2) methods based on kernels and Gaussian Processes (GPs) [89, 68]. In the context of (1) there has been recent activity toward solving nonlinear PDEs, whilst the systematic development of methods of type (2) for nonlinear PDEs has remained largely open. However, methods of type (2) hold potential for considerable advantages over methods of type (1), both in terms of theoretical analysis and numerical implementation. In this paper, our goal is to develop a simple kernel/GP framework for solving nonlinear PDEs and related inverse problems (IPs); in particular the methodology we introduce has the following properties:

  • the proposed collocation setting for solving nonlinear PDEs and IPs is a direct generalization of optimal recovery kernel methods for linear PDEs [43, 44, 47], and a natural generalization of radial basis function collocation methods [91, 88], and of meshless kernel methods [64];

  • theoretically, the proposed method is provably convergent and amenable to rigorous numerical analysis, suggesting new research directions to generalize the analysis of linear regression methods [88] to the proposed collocation setting for solving nonlinear PDEs;

  • computationally, it inherits the complexity of state-of-the-art solvers for dense kernel matrices, suggesting new research to generalize the work of [65], which developed optimal approximate methods for linear regression, to the proposed collocation setting for solving nonlinear PDEs and IPs;

  • for IPs the methodology is closely aligned with methodologies prevalent in the PDE-constrained optimization literature [26] and suggests the need for new computational and theoretical analyses generalizing the standard optimization and Bayesian approaches found in [17, 20, 28].

Since ANN methods can also be interpreted as kernel methods with kernels learned from data [27, 45, 90], our framework could also be used for theoretical analysis of such methods.

In Subsection 1.1 we summarize the theoretical foundations and numerical implementation of our method in the context of a simple nonlinear elliptic PDE. In Subsection 1.2 we give a literature review, placing the proposed methodology in the context of other research at the intersection of machine learning and PDEs. The outline of the article is given in Subsection 1.3.

1.1 Summary of the Proposed Method

For demonstration purposes, we summarize the key ideas of our method for solving a nonlinear second-order elliptic PDE. This PDE will also serve as a running example in Section 3 where we present an abstract framework for general nonlinear PDEs.

Let d1d\geq 1 and Ω\Omega be a bounded open domain in d\mathbb{R}^{d} with a Lipschitz boundary Ω\partial\Omega. Consider the following nonlinear elliptic equation for uu^{\star}:

(1) {Δu(𝐱)+τ(u(𝐱))=f(𝐱),𝐱Ω,u(𝐱)=g(𝐱),𝐱Ω,\left\{\begin{aligned} -\Delta u^{\star}(\mathbf{x})+\tau\big{(}u^{\star}(\mathbf{x})\big{)}&=f(\mathbf{x}),&&\forall\mathbf{x}\in\Omega\,,\\ u^{\star}(\mathbf{x})&=g(\mathbf{x}),&&\forall\mathbf{x}\in\partial\Omega\,,\end{aligned}\right.

where f:Ω,g:Ωf:\Omega\to\mathbb{R},{g:\partial\Omega\to\mathbb{R}} and τ:\tau:\mathbb{R}\to\mathbb{R} are given continuous functions. We assume that f,g,τf,g,\tau are chosen appropriately so that the PDE has a unique classical solution (for abstract theory of nonlinear elliptic PDEs see for example [56, 72]). In Subsection 1.1.4 we will present a concrete numerical experiment where τ(u)=u3\tau(u)=u^{3} and g(𝐱)=0g(\mathbf{x})=0.

1.1.1 Optimal Recovery

Our proposed method starts with an optimal recovery problem that can also be interpreted as maximum a posterior (MAP) estimation for a GP constrained by a PDE. More precisely, consider a nondegenerate, symmetric, and positive definite kernel function K:Ω¯×Ω¯K:\overline{\Omega}\times\overline{\Omega}\rightarrow\mathbb{R} where Ω¯:=ΩΩ\overline{\Omega}:=\Omega\cup\partial\Omega. Let 𝒰\mathcal{U} be the RKHS associated with KK and denote the RKHS norm by \|\cdot\|. Let 1MΩ<M<1\leq M_{\Omega}<M<\infty and fix MM points in Ω¯\overline{\Omega} such that 𝐱1,,𝐱MΩΩ\mathbf{x}_{1},\ldots,\mathbf{x}_{M_{\Omega}}\in\Omega and 𝐱MΩ+1,,𝐱MΩ\mathbf{x}_{M_{\Omega}+1},\ldots,\mathbf{x}_{M}\in\partial\Omega. Then, our method approximates the solution uu^{\star} of (1) with a minimizer uu^{\dagger} of the following optimal recovery problem:

(2) {minimizeu𝒰us.t.Δu(𝐱m)+τ(u(𝐱m))=f(𝐱m),for m=1,,MΩ,u(𝐱m)=g(𝐱m),for m=MΩ+1,,M.\left\{\begin{aligned} &\operatorname*{{\rm minimize}}_{u\in\mathcal{U}}~{}\|u\|&&\\ &{\rm\,s.t.}\quad-\Delta u(\mathbf{x}_{m})+\tau\big{(}u(\mathbf{x}_{m})\big{)}=f(\mathbf{x}_{m}),\quad&&\text{for }m=1,\ldots,M_{\Omega}\,,\\ &\hskip 25.83325ptu(\mathbf{x}_{m})=g(\mathbf{x}_{m}),\quad&&\text{for }m=M_{\Omega}+1,\ldots,M\,.\end{aligned}\right.

Here, we assume KK is chosen appropriately so that 𝒰C2(Ω)C(Ω¯)\mathcal{U}\subset C^{2}(\Omega)\cap C(\overline{\Omega}), which ensures the pointwise PDE constraints in (2) are well-defined.

A minimizer uu^{\dagger} can be interpreted as a MAP estimator of a GP ξ𝒩(0,𝒦)\xi\sim\mathcal{N}(0,\mathcal{K})222This Gaussian prior notation is equivalent to the GP notation 𝒢𝒫(0,K)\mathcal{GP}(0,K), where KK is the covariance function. See further discussions in Subsection 3.4.1.(where 𝒦\mathcal{K} is the integral operator with kernel KK) conditioned on satisfying the PDE at the collocation points 𝐱m,1mM\mathbf{x}_{m},1\leq m\leq M. Such a view has been introduced for solving linear PDEs in [43, 44] and a closely related approach is studied in [12, Sec. 5.2]; the methodology introduced via (2) serves as a prototype for generalization to nonlinear PDEs. Here it is important to note that in the nonlinear case the conditioned GP is no longer Gaussian in general; thus the solution of (2) is a MAP estimator only and is not the conditional expectation, except in the case where τ()\tau(\cdot) is a linear function.

In the next subsections, we show equivalence of (2) and a finite dimensional constrained optimization problem (5). This provides existence333Uniqueness of a minimizer is not necessarily a consequence of the assumed uniqueness of the solution to the PDE (1). We provide additional discussions of uniqueness results in Subsection 5.1. of a minimizer to (2), as well as the basis for a numerical method to approximate the minimizer, based on solution of an unconstrained finite-dimensional optimization problem (6).

1.1.2 Finite-Dimensional Representation

The key to our numerical algorithm for solving (2) is a representer theorem that characterizes uu^{\dagger} via a finite-dimensional representation formula. To achieve this we first reformulate (2) as a two level optimization problem:

(3) {minimize𝐳(1)M,𝐳(2)MΩ{minimizeu𝒰us.t.u(𝐱m)=zm(1) and Δu(𝐱m)=zm(2), for m=1,,M,s.t.zm(2)+τ(zm(1))=f(𝐱m),for m=1,,MΩ,zm(1)=g(𝐱m),for m=MΩ+1,,M.\left\{\begin{aligned} &\operatorname*{{\rm minimize}}_{\mathbf{z}^{(1)}\in\mathbb{R}^{M},\mathbf{z}^{(2)}\in\mathbb{R}^{M_{\Omega}}}\>\left\{\begin{aligned} &\operatorname*{{\rm minimize}}_{u\in\mathcal{U}}~{}\|u\|\\ &{\rm\,s.t.}\quad u(\mathbf{x}_{m})=z^{(1)}_{m}\text{ and }-\Delta u(\mathbf{x}_{m})=z^{(2)}_{m},\text{ for }m=1,\ldots,M,\end{aligned}\right.\\ &{\rm\,s.t.}\quad z^{(2)}_{m}+\tau(z^{(1)}_{m})=f(\mathbf{x}_{m}),\hskip 120.55518pt\text{for }m=1,\ldots,M_{\Omega}\,,\\ &\hskip 25.83325ptz^{(1)}_{m}=g(\mathbf{x}_{m}),\hskip 161.45782pt\text{for }m=M_{\Omega}+1,\ldots,M\,.\end{aligned}\right.

Denote ϕm(1)=δ𝐱m\phi^{(1)}_{m}=\updelta_{\mathbf{x}_{m}} and ϕm(2)=δ𝐱m(Δ)\phi^{(2)}_{m}=\updelta_{\mathbf{x}_{m}}\circ(-\Delta), where δ𝐱\updelta_{\mathbf{x}} is the Dirac delta function centered at 𝐱\mathbf{x}. We further use the shorthand notation ϕ(1){\bm{\phi}}^{(1)} and ϕ(2){\bm{\phi}}^{(2)} for the MM and MΩM_{\Omega}-dimensional vectors with entries ϕm(1)\phi^{(1)}_{m} and ϕm(2)\phi^{(2)}_{m} respectively, and ϕ{\bm{\phi}} for the (M+MΩ)(M+M_{\Omega})-dimensional vector obtained by concatenating ϕ(1){\bm{\phi}}^{(1)} and ϕ(2){\bm{\phi}}^{(2)}. Similarly, we write 𝐳\mathbf{z} for the (M+MΩ)(M+M_{\Omega})-dimensional vector concatenating 𝐳(1)\mathbf{z}^{(1)} and 𝐳(2)\mathbf{z}^{(2)}.

For a fixed 𝐳\mathbf{z}, we can solve the first level optimization problem explicitly due to the representer theorem444This is not the standard RKHS/GP representer theorem [89, Sec. 2.2] in the sense that measurements include the pointwise observation of higher order derivatives of the underlying GP. See [30] and [84, p. xiii] for related spline representation formulas with derivative information. (see [47, Sec. 17.8]), which leads to the conclusion that

(4) u(𝐱)=K(𝐱,ϕ)K(ϕ,ϕ)1𝐳;u(\mathbf{x})=K(\mathbf{x},{\bm{\phi}})K({\bm{\phi}},{\bm{\phi}})^{-1}\mathbf{z}\,;

here K(,ϕ)K(\cdot,{\bm{\phi}}) denotes the (M+MΩ)(M+M_{\Omega})-dimensional vector field with entries K(,𝐱)ϕm(𝐱)d𝐱\int K(\cdot,\mathbf{x}^{\prime})\phi_{m}(\mathbf{x}^{\prime})\,{\rm d}\mathbf{x}^{\prime} and K(ϕ,ϕ)K({\bm{\phi}},{\bm{\phi}}) is the (M+MΩ)×(M+MΩ)(M+M_{\Omega})\times(M+M_{\Omega})-matrix with entries K(𝐱,𝐱)ϕm(𝐱)ϕj(𝐱)d𝐱d𝐱\int K(\mathbf{x},\mathbf{x}^{\prime})\phi_{m}(\mathbf{x})\phi_{j}(\mathbf{x}^{\prime})\,{\rm d}\mathbf{x}\,{\rm d}\mathbf{x}^{\prime} with the ϕm\phi_{m} denoting the entries of ϕ{\bm{\phi}}. For this solution, u2=𝐳TK(ϕ,ϕ)1𝐳\|u\|^{2}=\mathbf{z}^{T}K({\bm{\phi}},{\bm{\phi}})^{-1}\mathbf{z}, so we can equivalently formulate (3) as a finite-dimensional optimization problem:

(5) {minimize𝐳M+MΩ𝐳TK(ϕ,ϕ)1𝐳s.t.zm(2)+τ(zm(1))=f(𝐱m),for m=1,,MΩ,zm(1)=g(𝐱m),for m=MΩ+1,,M.\left\{\begin{aligned} &\operatorname*{{\rm minimize}}_{\mathbf{z}\in\mathbb{R}^{M+M_{\Omega}}}\quad\mathbf{z}^{T}K({\bm{\phi}},{\bm{\phi}})^{-1}\mathbf{z}\\ &{\rm\,s.t.}\quad z^{(2)}_{m}+\tau(z^{(1)}_{m})=f(\mathbf{x}_{m}),&&\text{for }m=1,\ldots,M_{\Omega}\,,\\ &\hskip 25.83325ptz^{(1)}_{m}=g(\mathbf{x}_{m}),&&\text{for }m=M_{\Omega}+1,\ldots,M\,.\end{aligned}\right.

Moreover, using the equation zm(2)=f(𝐱m)τ(zm(1))z^{(2)}_{m}=f(\mathbf{x}_{m})-\tau(z^{(1)}_{m}) and the boundary data, we can further eliminate 𝐳(2)\mathbf{z}^{(2)} and rewrite it as an unconstrained problem:

(6) minimize𝐳Ω(1)MΩ(𝐳Ω(1),g(𝐱Ω),f(𝐱Ω)τ(𝐳Ω(1)))K(ϕ,ϕ)1(𝐳Ω(1)g(𝐱Ω)f(𝐱Ω)τ(𝐳Ω(1))),\operatorname*{{\rm minimize}}_{\mathbf{z}^{(1)}_{\Omega}\in\mathbb{R}^{M_{\Omega}}}~{}\big{(}\mathbf{z}^{(1)}_{\Omega},g(\mathbf{x}_{\partial\Omega}),f(\mathbf{x}_{\Omega})-\tau(\mathbf{z}^{(1)}_{\Omega})\big{)}K({\bm{\phi}},{\bm{\phi}})^{-1}\begin{pmatrix}\mathbf{z}^{(1)}_{\Omega}\\ g(\mathbf{x}_{\partial\Omega})\\ f(\mathbf{x}_{\Omega})-\tau(\mathbf{z}^{(1)}_{\Omega})\end{pmatrix},

where we used 𝐱Ω\mathbf{x}_{\Omega} and 𝐱Ω\mathbf{x}_{\partial\Omega} to denote the interior and boundary points respectively, 𝐳Ω(1)\mathbf{z}^{(1)}_{\Omega} denotes the MΩM_{\Omega}-dimensional vector of the ziz_{i} for i=1,,MΩi=1,\dots,M_{\Omega} associated to the interior points 𝐱Ω\mathbf{x}_{\Omega} while f(𝐱Ω),g(𝐱Ω)f(\mathbf{x}_{\Omega}),g(\mathbf{x}_{\partial\Omega}) and τ(𝐳Ω(1))\tau(\mathbf{z}^{(1)}_{\Omega}) are vectors obtained by applying the corresponding functions to entries of their input vectors. For brevity we have suppressed the transpose signs in the row vector multiplying the matrix from the left in (6).

The foregoing considerations lead to the following existence result which underpins555Although this existence result is desirable, it is not necessary in the sense that our variational analysis of minimizers of (2), remains true for near-minimizers. Furthermore, the error estimates of Section 5.2 can be modified to include the error incurred by approximating uu^{\star} with a near-minimizer rather than a minimizer uu^{\dagger}. our numerical method for (1); furthermore (6) provides the basis for our numerical implementations. We summarize these facts:

Theorem 1.1.

The variational problem (2) has a minimizer of the form
u(𝐱)=K(𝐱,ϕ)K(ϕ,ϕ)1𝐳u^{\dagger}(\mathbf{x})=K(\mathbf{x},{\bm{\phi}})K({\bm{\phi}},{\bm{\phi}})^{-1}\mathbf{z}^{\dagger}, where 𝐳\mathbf{z}^{\dagger} is a minimizer of (5). Furthermore u(𝐱)u^{\dagger}(\mathbf{x}) may be found by solving the unconstrained minimization problem (6) for 𝐳Ω(1)\mathbf{z}^{(1)}_{\Omega}.

Proof 1.2.

Problems (2), (3) and (5) are equivalent. It is therefore sufficient to show that (5) has a minimizer. Write 𝐳\mathbf{z}^{\star} for the vector with entries zm(1)=u(𝐱m)z^{\star(1)}_{m}=u^{\star}(\mathbf{x}_{m}) and zm(2)=Δu(𝐱m)z^{\star(2)}_{m}=-\Delta u^{\star}(\mathbf{x}_{m}). Since uu^{\star} solves the PDE (1), 𝐳\mathbf{z}^{\star} satisfies the constraints on 𝐳\mathbf{z} in (5). It follows that the minimization in (5) can be restricted to the set 𝒞\mathcal{C} of 𝐳\mathbf{z} that satisfies 𝐳TK(ϕ,ϕ)1𝐳(𝐳)TK(ϕ,ϕ)1𝐳\mathbf{z}^{T}K({\bm{\phi}},{\bm{\phi}})^{-1}\mathbf{z}\leq(\mathbf{z}^{\star})^{T}K({\bm{\phi}},{\bm{\phi}})^{-1}\mathbf{z}^{\star} and the nonlinear constraints. The set 𝒞\mathcal{C} is compact and non-empty: compact because τ\tau is continuous and so the constraint set is closed as it is the pre-image of a closed set, and the intersection of a closed set with a compact set is compact; and nonempty because it contains 𝐳\mathbf{z}^{*}. Thus the objective function 𝐳TK(ϕ,ϕ)1𝐳\mathbf{z}^{T}K({\bm{\phi}},{\bm{\phi}})^{-1}\mathbf{z} achieves its minimum in the set 𝒞.\mathcal{C}. Once 𝐳Ω(1)\mathbf{z}^{(1)}_{\Omega} is found we can extend to the boundary points to obtain 𝐳(1)\mathbf{z}^{(1)}, and use the nonlinear relation between 𝐳(1)\mathbf{z}^{(1)} and 𝐳(2)\mathbf{z}^{(2)} to reconstruct 𝐳(2)\mathbf{z}^{(2)}. This gives 𝐳\mathbf{z}^{\dagger}.

1.1.3 Convergence Theory

The consistency of our method is guaranteed by the convergence of uu^{\dagger} towards uu^{\star} as MM, the total number of collocation points, goes to infinity. We first present this result in the case of the nonlinear PDE (1) and defer a more general version to Subsection 3.2. We also give a simple proof of convergence here as it is instructive in understanding why the method works and how the more general result can be established. Henceforth we use |||\cdot| to denote the standard Euclidean norm and write 𝒰\mathcal{U}\subseteq\mathcal{H} to denote 𝒰\mathcal{U} being continuously embedded in Banach space \mathcal{H}.

Theorem 1.3.

Suppose that KK is chosen so that 𝒰Hs(Ω)\mathcal{U}\subseteq H^{s}(\Omega) for some s>2+d/2s>2+d/2 and that (1) has a unique classical solution u𝒰u^{\star}\in\mathcal{U}. Write uMu^{\dagger}_{M} for a minimizer of (2) with MM distinct collocation points 𝐱1,,𝐱M\mathbf{x}_{1},\ldots,\mathbf{x}_{M}. Assume that, as MM\rightarrow\infty,

sup𝐱Ωmin1mMΩ|𝐱𝐱m|0andsup𝐱ΩminMΩ+1mM|𝐱𝐱m|0.\sup_{\mathbf{x}\in\Omega}\>\min_{1\leq m\leq M_{\Omega}}|\mathbf{x}-\mathbf{x}_{m}|\rightarrow 0\quad\text{and}\quad\sup_{\mathbf{x}\in\partial\Omega}\>\min_{M_{\Omega+1}\leq m\leq M}|\mathbf{x}-\mathbf{x}_{m}|\rightarrow 0\,.

Then, as MM\rightarrow\infty, uMu^{\dagger}_{M} converges towards uu^{\star} pointwise in Ω\Omega and in Ht(Ω)H^{t}(\Omega) for any t<st<s.

Proof 1.4.

The proof relies on a simple compactness argument together with the Sobolev embedding theorem [1, 8]. First, as uu^{\star} satisfies the constraint in (2) and uMu^{\dagger}_{M} is the minimizer, it follows that uMu\|u^{\dagger}_{M}\|\leq\|u^{\star}\| for all M1M\geq 1. This implies uMHs(Ω)Cu\|u^{\dagger}_{M}\|_{H^{s}(\Omega)}\leq C\|u^{\star}\| because 𝒰\mathcal{U} is continuously embedded into Hs(Ω)H^{s}(\Omega). Let t(2+d/2,s)t\in(2+d/2,s) so that Ht(Ω)H^{t}(\Omega) embeds continuously into C2(Ω)C(Ω¯)C^{2}(\Omega)\cap C(\overline{\Omega}) [1, Thm. 4.12]. Since Hs(Ω)H^{s}(\Omega) is compactly embedded into Ht(Ω)H^{t}(\Omega), we deduce that there exists a subsequence {Mp,p1}\{M_{p},p\geq 1\}\subset\mathbb{N} and a limit uHt(Ω)u^{\dagger}_{\infty}\in H^{t}(\Omega) such that uMpu^{\dagger}_{M_{p}} converges towards uu^{\dagger}_{\infty} in the Ht(Ω)H^{t}(\Omega) norm, as pp\rightarrow\infty. This also implies convergence in C2(Ω)C^{2}(\Omega) due to the continuous embedding of Ht(Ω)H^{t}(\Omega) into C2(Ω)C(Ω¯)C^{2}(\Omega)\cap C(\overline{\Omega}).

We now show that uu^{\dagger}_{\infty} satisfies the desired PDE in (1). Denote by v=Δu+τ(u)fC(Ω)v=-\Delta u^{\dagger}_{\infty}+\tau\big{(}u^{\dagger}_{\infty}\big{)}-f\in C(\Omega) and vp=ΔuMp+τ(uMp)fC(Ω)v_{p}=-\Delta u^{\dagger}_{M_{p}}+\tau\big{(}u^{\dagger}_{M_{p}}\big{)}-f\in C(\Omega). For any 𝐱Ω\mathbf{x}\in\Omega and p1p\geq 1, use of the triangle inequality shows that

(7) |v(𝐱)|\displaystyle\left|v(\mathbf{x})\right| min1mMp,Ω(|v(𝐱)v(𝐱m)|+|v(𝐱m)vp(𝐱m)|)\displaystyle\leq\min_{1\leq m\leq M_{p,\Omega}}\left(\left|v(\mathbf{x})-v(\mathbf{x}_{m})\right|+\left|v(\mathbf{x}_{m})-v_{p}(\mathbf{x}_{m})\right|\right)
min1mMp,Ω|v(𝐱)v(𝐱m)|+vvpC(Ω),\displaystyle\leq\min_{1\leq m\leq M_{p,\Omega}}\left|v(\mathbf{x})-v(\mathbf{x}_{m})\right|+\|v-v_{p}\|_{C(\Omega)}\,,

where we have used the fact that vp(𝐱m)=0v_{p}(\mathbf{x}_{m})=0, and Mp,ΩM_{p,\Omega} is the number of interior points associated with the total MpM_{p} collocation points. Since vv is continuous in a compact domain, it is also uniformly continuous. Thus, it holds that limpmin1mMp,Ω|v(𝐱)v(𝐱m)|=0\lim_{p\to\infty}\min_{1\leq m\leq M_{p,\Omega}}\left|v(\mathbf{x})-v(\mathbf{x}_{m})\right|=0 since the fill-distance converges to zero by assumption. Moreover, we have that vpv_{p} converges to vv in the C(Ω)C(\Omega) norm due to the C2(Ω)C^{2}(\Omega) convergence from uMpu^{\dagger}_{M_{p}} to uu^{\dagger}_{\infty}. Combining these facts with (7), and taking pp\to\infty, we obtain v(𝐱)=0v(\mathbf{x})=0, and thus Δu(𝐱)+τ(u(𝐱))=f(𝐱)-\Delta u^{\dagger}_{\infty}(\mathbf{x})+\tau\big{(}u^{\dagger}_{\infty}(\mathbf{x})\big{)}=f(\mathbf{x}), for 𝐱Ω\mathbf{x}\in\Omega. Following a similar argument, we get u(𝐱)=g(𝐱)u^{\dagger}_{\infty}(\mathbf{x})=g(\mathbf{x}) for 𝐱Ω\mathbf{x}\in\partial\Omega. Thus, uu^{\dagger}_{\infty} is a classical solution to (1). By assumption, the solution is unique, so we must have u=uu^{\dagger}_{\infty}=u^{\star}. As the limit uu^{\dagger}_{\infty} is independent of the particular subsequence, the whole sequence uMu^{\dagger}_{M} must converge to uu^{\star} in Ht(Ω)H^{t}(\Omega). Since t(2+d/2,s)t\in(2+d/2,s), we also get pointwise convergence and convergence in Ht(Ω)H^{t}(\Omega) for any t<st<s.

We note that this convergence theorem requires KK to be adapted to the solution space of the PDE so that uu^{\star} belongs to 𝒰\mathcal{U}. In our numerical experiments, we will use a Gaussian kernel. However, if ff or the boundary Ω\partial\Omega are not sufficiently regular, then the embedding conditions u𝒰Hs(Ω)u^{\star}\in\mathcal{U}\subseteq H^{s}(\Omega) may be satisfied by using kernel as the Green’s function of some power of the Laplacian, instead of a Gaussian kernel; the latter induces smoothness on 𝒰\mathcal{U} which may be incompatible with the regularity of uu^{\star} for irregular ff and Ω\partial\Omega. We will provide further insights into the convergence properties of our method, including an error bound between the solution uu^{\star} and a minimizer uu^{\dagger} in Section 5.2.

Refer to caption
Figure 1: L2L^{2} and LL^{\infty} error plots for numerical approximations of uu^{\star}, the solution to (1), as a function of the number of collocation points MM. Left: τ(u)=0\tau(u)=0; both the kernel collocation method using Gaussian kernel with σ=0.2\sigma=0.2 and M1/4M^{-1/4} and the finite difference (FD) method were implemented. Right: τ(u)=u3\tau(u)=u^{3}; the kernel collocation method using Gaussian kernel with σ=0.2\sigma=0.2 and M1/4M^{-1/4} were used. In both cases, an adaptive nugget term with global parameter η=1013\eta=10^{-13} was used to regularize the kernel matrix Θ\Theta (see Appendix A.1 for details).

1.1.4 Numerical Framework

The representation of uu^{\dagger} via the optimization problem (6) is the cornerstone of our numerical framework for solving nonlinear PDEs. Indeed, efficient solution of (6), and in turn construction and inversion of the matrix K(ϕ,ϕ)K({\bm{\phi}},{\bm{\phi}}), are the most costly steps of our numerical implementation. We summarize several main ingredients of our algorithm below:

  • We propose an efficient variant of the Gauss–Newton algorithm in Section 3.4.2. Although, in general, (6) is a nonconvex problem, our algorithm typically converges in between 22 and 1010 steps in all the experiments we have conducted.

  • In practice we perturb K(ϕ,ϕ)K({\bm{\phi}},{\bm{\phi}}) to improve its condition number, and hence the numerical stability of the algorithm, by adding a small diagonal matrix; this perturbation is adapted to the problem at hand, as outlined in Appendix A.1; the approach generalizes the idea of a “nugget” as widely used in GP regression.

  • To evaluate the cost function in (6), we pre-compute the Cholesky factorization of the (perturbed) kernel matrix and store it for multiple uses. State-of-the-art linear solvers for dense kernel matrices can be used for this step.

As a demonstration, we present here a simple experiment to show the convergence of our method. We take d=2d=2, Ω=(0,1)2\Omega=(0,1)^{2} and τ(u)=0\tau(u)=0 or u3u^{3} together with homogeneous Dirichlet boundary conditions g(𝐱)=0g(\mathbf{x})=0. The true solution is prescribed to be u(𝐱)=sin(πx1)sin(πx2)+4sin(4πx1)sin(4πx2)u^{\star}(\mathbf{x})=\sin(\pi x_{1})\sin(\pi x_{2})+4\sin(4\pi x_{1})\sin(4\pi x_{2}) and the corresponding right hand side f(𝐱)f(\mathbf{x}) is computed using the equation.

We choose the Gaussian kernel

K(𝐱,𝐲;σ)=exp(|𝐱𝐲|22σ2),K(\mathbf{x},\mathbf{y};\sigma)=\exp\Bigl{(}-\frac{|\mathbf{x}-\mathbf{y}|^{2}}{2\sigma^{2}}\Bigr{)}\,,

with lengthscale parameter σ\sigma; we will fix this parameter, but note that the framework is naturally adapted to learning of such hyperparameters. We set M=64,256,1024,4096M=64,256,1024,4096, sampling the collocation points on a uniform grid of points within Ω\Omega. We apply our algorithm to solve the PDEs and compute the L2L^{2} and LL^{\infty} errors to uu^{\star}. In the case τ(u)=0\tau(u)=0, which corresponds to a linear PDE, we also compare our algorithm with a second-order finite difference (FD) method. For the nonlinear case τ(u)=u3\tau(u)=u^{3}, we observe that the Gauss–Newton algorithm only needs 2 steps to converge. The errors versus MM are depicted in Figure 1. The following observations can be made from this figure:

  • In the linear case τ(u)=0\tau(u)=0, where the corresponding optimization problem (6) is convex, our algorithm outperforms the FD method in terms of accuracy and rate of convergence. This can be attributed to the fact that the true solution is very smooth, and the Gaussian kernel allows for efficient approximation.

  • The choice of the kernel parameter σ\sigma has a profound impact on the accuracy and rate of convergence of the algorithm, especially when MM is not very large. This implies the importance of choosing a “good kernel” that is adapted to the solution space of the PDE, and highlights the importance of hyperparameter learning.

  • In the nonlinear setting, our algorithm demonstrates similar convergence behavior to the linear setting. Once again, an appropriate choice of σ\sigma leads to significant gains in solution accuracy.

1.2 Relevant Literature

Machine learning and statistical inference approaches to numerical approximation have attracted a lot of attention in recent years thanks to their remarkable success in engineering applications. Such approaches can be broadly divided into two categories: (1) GP/Kernel-based methods, and (2) ANN-based methods.

GPs and kernel-based methods have long been used in function approximation, regression, and machine learning [89]. As surveyed in [48]:

“[kernel approaches can be traced back to] Poincaré’s course in Probability Theory [54] and to the pioneering investigations of Sul’din [75], Palasti and Renyi [52], Sard [63], Kimeldorf and Wahba [31] (on the correspondence between Bayesian estimation and spline smoothing/interpolation) and Larkin [35] (on the correspondence between Gaussian process regression and numerical approximation). Although their study initially attracted little attention among numerical analysts [35], it was revived in Information Based Complexity (IBC) [78], Bayesian numerical analysis [18], and more recently in Probabilistic Numerics [25] and Bayesian numerical homogenization [43].”

For recent reviews of the extensive applications of GP/kernel methods see [13, 48, 76] and [47, Chap. 20]. In particular, they have been introduced to motivate novel methods for solving ordinary differential equations (ODEs) [71, 9, 67] and underlie the collocation approach advocated for parameter identification in ODEs as described in [60]. For PDEs, the use of kernel methods can be traced back to meshless collocation methods with radial basis functions [91, 88, 64]. Furthermore, a recent systematic development towards solving PDEs was initiated in [43, 44, 51, 47] and has lead to the identification of fast solvers for elliptic PDEs and dense kernel matrices [66, 65] with state-of-the-art complexity versus accuracy guarantees. The effectiveness of these methods has been supported by well-developed theory [47] residing at the interfaces between numerical approximation [84, 88], optimal recovery [40], information-based complexity [78], and GP regression [6], building on the perspective introduced in [18, 35, 40, 79, 84]. In particular there is a one to one correspondence [47, 66] between (1) the locality of basis functions (gamblets) obtained with kernel methods and the screening effect observed in kriging [73], (2) Schur complementation and conditioning Gaussian vectors, and (3) the approximate low-rank structure of inverse stiffness matrices obtained with kernel methods and variational properties of Gaussian conditioning. Furthermore, although the approach of modeling a deterministic function (the solution uu^{\star} of a PDE) as a sample from a Gaussian distribution may seem counterintuitive, it can be understood (in the linear setting [47]) as an optimal minimax strategy for recovering uu^{\star} from partial measurements. Indeed, as in Von Neumann’s theory of games, optimal strategies are mixed (randomized) strategies and (using quadratic norms to define losses) GP regression (kriging) emerges as an optimal minimax approach to numerical approximation [40, 47].

On the other hand, ANN methods can be traced back to [19, 33, 34, 61, 80] and, although developed for ODEs several decades ago [11, 61], with some of the work generalized to PDEs [32], their systematic development for solving PDEs has been initiated only recently. This systematic development includes the Deep Ritz Method [87], Physics Informed Neural Networks [58] (PINNs), DGN [70], and [23] which employs a probabilistic formulation of the nonlinear PDE via the Feynman-Kac formula. Although the basic idea is to replace unknown functions with ANNs and minimize some loss with respect to the parameters of the ANN to identify the solution, there are by now many variants of ANN approaches, and we refer to [29] for a recent review of this increasingly popular topic at the interface between scientific computing and deep learning. While ANN approaches have been shown to be empirically successful on complex problems (e.g., machine learning physics [59]), they may fail on simple ones [86, 81] if the architecture of the ANN is not adapted to the underlying PDE [85]. Moreover, the theory of ANNs is still in its infancy; most ANN-based PDE solvers lack theoretical guarantees and convergence analyses are often limited to restricted linear instances [69, 86]. Broadly speaking, in comparison with kernel methods, ANN methods have both limited theoretical foundations and unfavorable complexity vs accuracy estimates. We also remark that ANN methods are suitable for the learning of the parametric dependence of solutions of differential equations [92, 5, 36, 37, 7]; however, GP and kernel methods may also be used in this context, and the random feature method provides a natural approach to implementing such methods in high dimensions [41].

Regardless, the theory and computational framework of kernel methods may naturally be extended to ANN methods to investigate666Beyond the infinite width neural tangent kernel regime [27, 86]. such methods and possibly accelerate them by viewing them as ridge regression with data-dependent kernels and following [66, 65]. To this end, ANN methods can be interpreted as kernel methods with data-dependent kernels in two equivalent ways: (1) as solving PDEs in an RKHS space spanned by a feature map parameterized by the initial layers of the ANN that is learned from data; or, (2) as kernel-based methods with kernels that are parameterized by the inner layers of the network. For instance, [45] shows that Residual Neural Networks [24] (ResNets) are ridge regressors with warped kernels [62, 53]. Since our approach employs an explicit kernel, it allows us to learn that kernel directly, as discussed in Subsection 5.3. Given the notorious difficulty of developing numerical methods for nonlinear PDEs [77], it is to some degree surprising that (as suggested by our framework) (A) this difficulty can universally be decomposed into three parts: (1) the compression/inversion of dense kernel matrices, (2) the selection of the kernel, and (3) the minimization of the reduced finite-dimensional optimization problem (24), and (B) a significant portion of the resulting analysis can be reduced to that of linear regression [47].

Beyond solving PDEs, ANN methods have also been used in data-driven discretizations, and discovery of PDEs that allow for the identification of the governing model [58, 39, 4]; this leads to applications in IPs. Our method, viewed as a GP conditioned on PDE constraints at collocation points, can be interpreted as solving an IP with Bayesian inference and a Gaussian prior [74]. Indeed, if we relax the PDE constraints as in Subsection 3.3.2 then a minimizer uu^{\dagger} coincides with a MAP estimator of a posterior measure obtained by viewing the PDE constraints as nonlinear pointwise measurements of a field uu with a Gaussian prior 𝒩(0,𝒦)\mathcal{N}(0,\mathcal{K}). Bayesian IPs with Gaussian priors have been studied extensively (see [17, 14, 74] and references therein). The standard approach for their solution is to discretize the problem using spectral projection or finite element methods, and compute posterior MAP estimators [16] or employ Markov chain Monte Carlo algorithms [15] to simulate posterior samples. Our abstract representation theorem outlined in Section 2.1 completely characterizes the MAP estimator of posterior measures with Gaussian priors in settings where the forward map is written as the composition of a nonlinear map with bounded linear functionals acting on the parameter. Indeed, this is precisely the approach that we employ in Section 4 to solve IPs with PDE constraints. However, the main difference between our methodology and existing methods in the literature is that we pose the IP as that of recovering the solution of the PDE uu^{\dagger} simultaneously with learning the unknown PDE parameter with independent Gaussian priors on both unknowns.

We now turn to motivation for the GP interpretation. The PDE solvers obtained here are deterministic and could be described from an entirely classical numerical approximation perspective. However we emphasize the GP interpretation for two reasons: (i) it is integral to the derivation of the methods; and (ii) it allows the numerical solution of the PDE to be integrated into a larger engineering pipeline and, in that context, the posterior distribution of the GP conditioned on the PDE at collocation points provides a measure of uncertainty quantification. Using the GP perspective as a pathway to the discovery of numerical methods, was the motivation for the work in [43, 44]. Indeed, as discussed in [43], while the discovery of numerical solvers for PDEs is usually based on a combination of insight and guesswork, this process can be facilitated to the point of being automated, using this GP perspective. For instance, for nonsmooth PDEs, basis functions with near optimal accuracy/localization tradeoff and operator valued wavelets can be identified by conditioning physics informed Gaussian priors on localized linear measurements (e.g., local averages or pointwise values). Physics informed Gaussian priors can, in turn, be identified by (a) filtering uninformed Gaussian priors through the inverse operator [43], or (b) turning the process of computing fast with partial information into repeated zero-sum games with physics informed losses (whose optimal mixed strategies are physics informed Gaussian priors) [44]. The paper [57] generalized (a) to time-dependent PDEs by filtering uninformed priors through linearized PDEs obtained via time stepping. The paper [12] emphasized the probabilistic interpretation of numerical errors obtained from this Bayesian perspective. In particular [12, Sec. 5.2] describes a method identical to the one considered here (and [43]) for linear problems; in the setting of semi-linear PDEs, it is suggested in [12] that latent variables could be employed to efficiently sample from posterior/conditional measures (see also [45] where latent variables were employed to reduce nonlinear optimal recovery problems via two-level optimization as in 3). Although the methodology proposed in our paper agrees with that in [12] for linear problems, the methodology we propose appears to be more general, and differs in the case of nonlinear problems to which both approaches apply.

1.3 Outline of Article

The remainder of this article is organized as follows. We give an overview of the abstract theory of GPs on Banach spaces in Section 2; we establish notation, and summarize basic results and ideas that are used throughout the remainder of the article. Section 3 is dedicated to the development of our numerical framework for solving nonlinear PDEs with kernel methods; we outline our assumptions on the PDE, present a general convergence theory, discuss our approach to implementation, and present numerical experiments. In Section 4 we extend our nonlinear PDE framework to IPs and discuss the implementation differences between the PDE and IP settings, followed by numerical experiments involving a benchmark IP in subsurface flow. Finally, we present additional discussions, results, and possible extensions of our method in Section 5. Appendix A is devoted to the small diagonal regularization of kernel matrices (“nugget” term) and outlines general principles as well as specifics for the examples considered in this paper.

2 Conditioning GPs on Nonlinear Observations

In this section, we outline the abstract theory of RKHSs/GPs and their connection to Banach spaces endowed with quadratic norms; this forms the framework for the proposed methodology to solve PDEs and IPs. We start by recalling some basic facts about GPs in Subsection 2.1. This is followed in Subsection 2.2 by general results pertaining to conditioning of GPs on linear and nonlinear observations, leading to a representer theorem that is the cornerstone of our numerical algorithms. Some of these results may be known to experts, but to the best of our knowledge, they are not presented in the existing literature with the coherent goal of solving nonlinear problems via the conditioning of GPs; hence this material may be of independent interest to the reader.

2.1 GPs and Banach Spaces Endowed with a Quadratic Norm

Consider a separable Banach space 𝒰\mathcal{U} and its dual 𝒰\mathcal{U}^{\ast} with their duality pairing denoted by [,][\cdot,\cdot]. We further assume that 𝒰\mathcal{U} is endowed with a quadratic norm \|\cdot\|, i.e., there exists a linear bijection 𝒦:𝒰𝒰\mathcal{K}:\mathcal{U}^{\ast}\to\mathcal{U} that is symmetric ([𝒦ϕ,φ]=[𝒦φ,ϕ][\mathcal{K}\phi,\varphi]=[\mathcal{K}\varphi,\phi]), positive ([𝒦ϕ,ϕ]>0[\mathcal{K}\phi,\phi]>0 for ϕ0\phi\neq 0), and such that

u2=[𝒦1u,u],u𝒰.\|u\|^{2}=[\mathcal{K}^{-1}u,u],\qquad\forall u\in\mathcal{U}\,.

The operator 𝒦\mathcal{K} endows 𝒰\mathcal{U} and 𝒰\mathcal{U}^{\ast} with the following inner products:

u,v\displaystyle\langle u,v\rangle :=[𝒦1u,v],\displaystyle:=[\mathcal{K}^{-1}u,v], u,v𝒰,\displaystyle u,v\in\mathcal{U}\,,
ϕ,φ\displaystyle\langle\phi,\varphi\rangle_{\ast} :=[ϕ,𝒦φ],\displaystyle:=[\phi,\mathcal{K}\varphi], ϕ,φ𝒰.\displaystyle\phi,\varphi\in\mathcal{U}^{\ast}\,.

Note that the ,\langle\cdot,\cdot\rangle_{\ast} inner product defines a norm on 𝒰\mathcal{U}^{\ast} that coincides with the standard dual norm of 𝒰\mathcal{U}^{\ast}, i.e.,

ϕ=supu𝒰,u0[ϕ,u]u=[ϕ,𝒦ϕ].\|\phi\|_{\ast}=\sup_{u\in\mathcal{U},u\neq 0}\frac{[\phi,u]}{\|u\|}=\sqrt{[\phi,\mathcal{K}\phi]}\,.

As in [47, Chap. 11], although 𝒰,𝒰\mathcal{U},\mathcal{U}^{\ast} are also Hilbert spaces under the quadratic norms \|\cdot\| and \|\cdot\|_{\ast}, we will keep using the Banach space terminology to emphasize the fact that our dual pairings will not be based on the inner product through the Riesz representation theorem, but on a different realization of the dual space. A particular case of the setting considered here is 𝒰=H0s(Ω)\mathcal{U}=H^{s}_{0}(\Omega) (writing H0s(Ω)H^{s}_{0}(\Omega) for the closure of the set of smooth functions with compact support in Ω\Omega with respect to the Sobolev norm Hs(Ω)\|\cdot\|_{H^{s}(\Omega)}), with its dual 𝒰=Hs(Ω)\mathcal{U}^{\ast}=H^{-s}(\Omega) defined by the pairing [ϕ,v]:=Ωϕu[\phi,v]:=\int_{\Omega}{\phi u} obtained from the Gelfand triple Hs(Ω)L2(Ω)Hs(Ω)H^{s}(\Omega)\subset L^{2}(\Omega)\subset H^{-s}(\Omega). Here 𝒦\mathcal{K} can be defined as solution map of an elliptic operator777Given s>0s>0, we call an invertible operator :H0s(Ω)Hs(Ω)\mathcal{L}\,:\,H^{s}_{0}(\Omega)\rightarrow H^{-s}(\Omega) elliptic, if it is positive and symmetric in the sense that Ωuu0\int_{\Omega}u\mathcal{L}u\geq 0 and Ωuv=Ωvu0\int_{\Omega}u\mathcal{L}v=\int_{\Omega}v\mathcal{L}u\geq 0. mapping H0s(Ω)H^{s}_{0}(\Omega) to Hs(Ω)H^{-s}(\Omega).

We say that ξ\xi is the canonical GP [47, Chap. 17.6] on 𝒰\mathcal{U} and write ξ𝒩(0,𝒦)\xi\sim\mathcal{N}(0,\mathcal{K}), if and only if ξ\xi is a linear isometry from 𝒰\mathcal{U}^{\ast} to a centered Gaussian space (a closed linear space of scalar valued centered Gaussian random variables). The word canonical indicates that the covariance operator of ξ\xi coincides with the bijection 𝒦\mathcal{K} defining the norm on 𝒰\mathcal{U}. Write [ϕ,ξ][\phi,\xi] for the image of ϕ𝒰\phi\in\mathcal{U}^{\ast} under ξ\xi and note that the following properties hold:

𝔼[ϕ,ξ]=0and𝔼[ϕ,ξ][φ,ξ]=[ϕ,𝒦φ],ϕ,φ𝒰.\mathbb{E}[\phi,\xi]=0\quad\text{and}\quad\mathbb{E}[\phi,\xi][\varphi,\xi]=[\phi,\mathcal{K}\varphi],\quad\forall\phi,\varphi\in\mathcal{U}^{\ast}\,.

The space 𝒰\mathcal{U} coincides with the Cameron–Martin space of the GP 𝒩(0,𝒦)\mathcal{N}(0,\mathcal{K}). In the setting where 𝒦\mathcal{K} is defined through a covariance kernel KK (such as in Subsection 1.1 and later in Subsection 3.4.2) then 𝒰\mathcal{U} coincides with the RKHS space of the kernel KK [82, Sec. 2.3].

2.2 Conditioning GPs with Linear and Nonlinear Observations

Let ϕ1,,ϕN\phi_{1},\dots,\phi_{N} be NN non-trivial elements of 𝒰\mathcal{U}^{\ast} and define

(8) ϕ:=(ϕ1,,ϕN)(𝒰)N.{\bm{\phi}}:=\left(\phi_{1},\dots,\phi_{N}\right)\in(\mathcal{U}^{\ast})^{\otimes N}\,.

Now consider the canonical GP ξ𝒩(0,𝒦)\xi\sim\mathcal{N}(0,\mathcal{K}), then [ϕ,ξ][{\bm{\phi}},\xi] is an N\mathbb{R}^{N}-valued Gaussian vector and [ϕ,ξ]𝒩(0,Θ)[{\bm{\phi}},\xi]\sim\mathcal{N}(0,\Theta) where

(9) ΘN×N,Θi,n=[ϕi,𝒦ϕn],1i,nN.\Theta\in\mathbb{R}^{N\times N},\qquad\Theta_{i,n}=[\phi_{i},\mathcal{K}\phi_{n}],\quad 1\leq i,n\leq N\,.

The following proposition characterizes the conditional distribution of GPs under these linear observations; to simplify the statement it is useful to write 𝒦(ϕ,φ)\mathcal{K}({\bm{\phi}},\varphi) for the vector with entries [ϕi,𝒦φ][\phi_{i},\mathcal{K}\varphi]. This type of vectorized notation is used in [47].

Proposition 2.1.

Consider a vector 𝐲N\mathbf{y}\in\mathbb{R}^{N} and the canonical GP ξ𝒩(0,𝒦)\xi\sim\mathcal{N}(0,\mathcal{K}). Then ξ\xi conditioned on [ϕ,ξ]=𝐲[{\bm{\phi}},\xi]=\mathbf{y} is also Gaussian. Moreover if Θ\Theta is invertible then Law[ξ|[ϕ,ξ]=𝐲]=𝒩(u,𝒦ϕ){\rm Law}[\xi|[{\bm{\phi}},\xi]=\mathbf{y}]=\mathcal{N}(u^{\dagger},\mathcal{K}_{\bm{\phi}}) with conditional mean defined by u=𝐲TΘ1𝒦ϕu^{\dagger}=\mathbf{y}^{T}\Theta^{-1}\mathcal{K}{\bm{\phi}} and conditional covariance operator defined by [φ,𝒦ϕφ]=[φ,𝒦φ]𝒦(φ,ϕ)Θ1𝒦(ϕ,φ),φ𝒰.[\varphi,\mathcal{K}_{\bm{\phi}}\varphi]=[\varphi,\mathcal{K}\varphi]-\mathcal{K}(\varphi,{\bm{\phi}})\Theta^{-1}\mathcal{K}({\bm{\phi}},\varphi),\forall\varphi\in\mathcal{U}^{\ast}.

Proposition 2.1 gives a finite representation of the conditional mean of the GP constituting a representer theorem [47, Cor. 17.12]. Let us define the elements

(10) χi:=n=1NΘi,n1𝒦ϕn,\chi_{i}:=\sum_{n=1}^{N}\Theta_{i,n}^{-1}\mathcal{K}\phi_{n}\,,

referred to as gamblets in the parlance of [44] which can equivalently be characterized as the minimizers of the variational problem

(11) {minimizeχ𝒰χs.t.[ϕn,χ]=δi,n,n=1,,N.\left\{\begin{aligned} &\operatorname*{{\rm minimize}}_{\chi\in\mathcal{U}}&&\|\chi\|\\ &{\rm\,s.t.}&&[\phi_{n},\chi]=\delta_{i,n},\qquad n=1,\dots,N\,.\end{aligned}\right.

This fact further enables the variational characterization of the conditional mean uu^{\dagger} directly in terms of the gamblets χn\chi_{n}.

Proposition 2.2.

Let u=𝔼[ξ|[ϕ,ξ]=𝐲]u^{\dagger}=\mathbb{E}\left[\xi|[{\bm{\phi}},\xi]=\mathbf{y}\right] as in Proposition 2.1. Then u=n=1Nynχnu^{\dagger}=\sum_{n=1}^{N}y_{n}\chi_{n} is the unique minimizer of

{minimizeu𝒰us.t.[ϕn,u]=yn,n=1,,N.\left\{\begin{aligned} &\operatorname*{{\rm minimize}}_{u\in\mathcal{U}}&&\|u\|\\ &{\rm\,s.t.}&&[\phi_{n},u]=y_{n},\qquad n=1,\dots,N\,.\end{aligned}\right.

Proposition 2.2 is the cornerstone of our methodology for solution of nonlinear PDEs. It is also useful for the solution of IPs. For this purpose consider nonlinear functions G:NIG:\mathbb{R}^{N}\to\mathbb{R}^{I} and F:NMF:\mathbb{R}^{N}\to\mathbb{R}^{M} and vectors 𝐨I\mathbf{o}\in\mathbb{R}^{I} and 𝐲M\mathbf{y}\in\mathbb{R}^{M} and consider the optimization problem:

(12) {minimizeu𝒰u2+1γ2|G([ϕ,u])𝐨|2s.t.F([ϕ,u])=𝐲,\left\{\begin{aligned} &\operatorname*{{\rm minimize}}_{u\in\mathcal{U}}&&\|u\|^{2}+\frac{1}{\gamma^{2}}|G([{\bm{\phi}},u])-\mathbf{o}|^{2}\\ &{\rm\,s.t.}&&F([{\bm{\phi}},u])=\mathbf{y}\,,\end{aligned}\right.

where γ\gamma\in\mathbb{R} is a parameter. We will use this formulation of IPs in PDEs, with uu concatenating the solution of the forward PDE problem and the unknown parameter; the nonlinear constraint on FF enforces the forward PDE and GG the observed noisy data. Then a representer theorem still holds for a minimizer of this problem stating that the solution has a finite expansion in terms of the gamblets χn\chi_{n}:

Proposition 2.3.

Suppose (𝐨,𝐲)I×M(\mathbf{o},\mathbf{y})\in\mathbb{R}^{I}\times\mathbb{R}^{M} are fixed and Θ\Theta is invertible888Relaxing the interpolation constraints renders the invertibility assumption on Θ\Theta unnecessary. Nevertheless we keep it for ease of presentation.. Then u𝒰u^{\dagger}\in\mathcal{U} is a minimizer of (12) if and only if u=n=1Nznχnu^{\dagger}=\sum_{n=1}^{N}z^{\dagger}_{n}\chi_{n} and the vector 𝐳\mathbf{z}^{\dagger} is a minimizer of

(13) {minimize𝐳N𝐳TΘ1𝐳+1γ2|G(𝐳)𝐨|2s.t.F(𝐳)=𝐲.\left\{\begin{aligned} &\operatorname*{{\rm minimize}}_{\mathbf{z}\in\mathbb{R}^{N}}&&\mathbf{z}^{T}\Theta^{-1}\mathbf{z}+\frac{1}{\gamma^{2}}|G(\mathbf{z})-\mathbf{o}|^{2}\\ &{\rm\,s.t.}&&F(\mathbf{z})=\mathbf{y}\,.\end{aligned}\right.

Proof 2.4.

The proof is nearly identical to the derivation of (5) presented in Section 1.1.2. Simply observe that minimizing (12) is equivalent to minimizing

(14) minimize𝐳N:F(𝐳)=𝐲{minimizeu𝒰u2+1γ2|G(𝐳)𝐨|2s.t.[ϕ,u]=𝐳.\operatorname*{{\rm minimize}}_{\mathbf{z}\in\mathbb{R}^{N}\,:\,F(\mathbf{z})=\mathbf{y}}\left\{\begin{aligned} &\operatorname*{{\rm minimize}}_{u\in\mathcal{U}}&&\|u\|^{2}+\frac{1}{\gamma^{2}}|G(\mathbf{z})-\mathbf{o}|^{2}\\ &{\rm\,s.t.}&&[{\bm{\phi}},u]=\mathbf{z}\,.\end{aligned}\right.

Then solve the inner optimization problem for a fixed 𝐳\mathbf{z} and apply Proposition 2.1.

We note that this model assumes independent and identically distributed (i.i.d.) observation noise for the vector 𝐨\mathbf{o} and can easily be extended to correlated observation noise by replacing the misfit term 1γ2|G(𝐳)𝐨|2\frac{1}{\gamma^{2}}|G(\mathbf{z})-\mathbf{o}|^{2} in (12) with an appropriately weighted misfit term of the form |Σ1/2(G(𝐳)𝐨)|2|\Sigma^{-1/2}(G(\mathbf{z})-\mathbf{o})|^{2} where Σ\Sigma denotes the covariance matrix of the observation noise.

Remark 2.5.

It is intuitive that a minimizer of the optimization problem we introduce and solve in this paper corresponds to a MAP point for the GP ξ𝒩(0,𝒦)\xi\sim\mathcal{N}(0,\mathcal{K}) conditioned on PDE constraints at the collocation points. To prove this will require extension of the approach introduced in [16], for example, and is left for future work. Here we describe this connection informally in the absence of the equality constraints. Consider the the prior measure μ0=𝒩(0,𝒦)\mu_{0}=\mathcal{N}(0,\mathcal{K}) and consider the measurements 𝐨=G([ϕ,u])+η,𝐲=F([ϕ,u])+η,η𝒩(0,γ2I),η𝒩(0,β2I)\mathbf{o}=G([{\bm{\phi}},u])+\eta,\mathbf{y}=F([{\bm{\phi}},u])+\eta^{\prime},\>\eta\sim\mathcal{N}(0,\gamma^{2}I),\eta^{\prime}\sim\mathcal{N}(0,\beta^{2}I). It then follows from Bayes’ rule [74] that the posterior measure of uu given the data (𝐨,𝐲(\mathbf{o},\mathbf{y}) is identified as the measure

dμ(𝐨,𝐲)dμ0(u)\displaystyle\frac{{\rm d}\mu^{(\mathbf{o},\mathbf{y})}}{{\rm d}\mu_{0}}(u) =1Z(𝐨,𝐮)exp(12γ2|G([ϕ,u])𝐨|212β2|F([ϕ,u])𝐲|2),\displaystyle=\frac{1}{Z^{(\mathbf{o},\mathbf{u})}}\exp\left(-\frac{1}{2\gamma^{2}}|G([{\bm{\phi}},u])-\mathbf{o}|^{2}-\frac{1}{2\beta^{2}}|F([{\bm{\phi}},u])-\mathbf{y}|^{2}\right),
Z(𝐨,𝐲)\displaystyle Z^{(\mathbf{o},\mathbf{y})} :=𝔼uμ0exp(12γ2|G([ϕ,u])𝐨|212β2|F([ϕ,u])𝐲|2).\displaystyle:=\mathbb{E}_{u\sim\mu_{0}}\exp\left(-\frac{1}{2\gamma^{2}}|G([{\bm{\phi}},u])-\mathbf{o}|^{2}-\frac{1}{2\beta^{2}}|F([{\bm{\phi}},u])-\mathbf{y}|^{2}\right).

The article [16] showed that the MAP estimators of μ(𝐨,𝐲)\mu^{(\mathbf{o},\mathbf{y})} are solutions to

minimizeu𝒰\displaystyle\operatorname*{{\rm minimize}}_{u\in\mathcal{U}} u2+1γ2|G([ϕ,u])𝐨|2+1β2|F([ϕ,u])𝐲|2.\displaystyle\|u\|^{2}+\frac{1}{\gamma^{2}}|G([{\bm{\phi}},u])-\mathbf{o}|^{2}+\frac{1}{\beta^{2}}|F([{\bm{\phi}},u])-\mathbf{y}|^{2}.

Letting β0\beta\to 0 then yields (12).

\Diamond

3 Solving Nonlinear PDEs

In this section, we present our framework for solution of nonlinear PDEs by conditioning GPs on nonlinear constraints. In Subsection 3.1 we outline our abstract setting as well as our assumptions on PDEs of interest; this leads to Corollary 3.1 which states an analogue of Proposition 2.3 in the PDE setting. We analyze the convergence of our method in Subsection 3.2 and discuss two strategies for dealing with the nonlinear PDE constraints in Subsection 3.3. Next, we present the details pertaining to numerical implementations of our method, including the choice of kernels and a Gauss–Newton algorithm in Subsection 3.4. Finally, we present a set of numerical experiments in Subsection 3.5 that demonstrate the effectiveness of our method in the context of prototypical nonlinear PDEs.

3.1 Problem Setup

Let us consider a bounded domain Ωd\Omega\subseteq\mathbb{R}^{d} for d1d\geq 1 and a nonlinear PDE of the form

(15) {𝒫(u)(𝐱)=f(𝐱),𝐱Ω,(u)(𝐱)=g(𝐱),𝐱Ω.\left\{\begin{aligned} \mathcal{P}(u^{\star})(\mathbf{x})=f(\mathbf{x}),&\qquad\forall\mathbf{x}\in\Omega\,,\\ \mathcal{B}(u^{\star})(\mathbf{x})=g(\mathbf{x}),&\qquad\forall\mathbf{x}\in\partial\Omega\,.\end{aligned}\right.

Here 𝒫\mathcal{P} is a nonlinear differential operator and \mathcal{B} is an appropriate boundary operator with data f,gf,g. Throughout this section and for brevity, we assume that the PDE at hand is well-defined pointwise and has a unique strong solution; extension of our methodology to weak solutions is left as a future research direction. We then consider 𝒰\mathcal{U} to be an appropriate quadratic Banach space for the solution uu^{\star} such as a Sobolev space Hs(Ω)H^{s}(\Omega) with sufficiently large regularity index s>0s>0.

We propose to solve the PDE (15) by approximating uu^{\star} by a GP conditioned on satisfying the PDE at a finite set of collocation points in Ω¯\overline{\Omega} and proceed to approximate the solution by computing the MAP point of such a conditioned GP. More precisely, let {𝐱i}i=1M\{\mathbf{x}_{i}\}_{i=1}^{M} be a collection of points in Ω¯\overline{\Omega} ordered in such a way that 𝐱1,𝐱MΩΩ\mathbf{x}_{1},\dots\mathbf{x}_{M_{\Omega}}\in\Omega are in the interior of Ω\Omega while 𝐱MΩ+1,,𝐱MΩ\mathbf{x}_{M_{\Omega}+1},\dots,\mathbf{x}_{M}\in\partial\Omega are on the boundary. Now let 𝒰\mathcal{U} be a quadratic Banach space with associated covariance operator 𝒦:𝒰𝒰\mathcal{K}:\mathcal{U}^{\ast}\to\mathcal{U} and consider the optimization problem:

(16) {minimizeu𝒰us.t.𝒫(u)(𝐱m)=f(𝐱m),for m=1,,M,(u)(𝐱m)=g(𝐱m),for m=MΩ+1,,M.\left\{\begin{aligned} &\operatorname*{{\rm minimize}}_{u\in\mathcal{U}}&&\|u\|\\ &{\rm\,s.t.}&&\mathcal{P}(u)(\mathbf{x}_{m})=f(\mathbf{x}_{m}),&&\text{for }m=1,\dots,M\,,\\ &&&\mathcal{B}(u)(\mathbf{x}_{m})=g(\mathbf{x}_{m}),&&\text{for }m=M_{\Omega}+1,\dots,M\,.\\ \end{aligned}\right.

In other words, we wish to approximate uu^{\star} with the minimum norm element of the Cameron–Martin space of 𝒩(0,𝒦)\mathcal{N}(0,\mathcal{K}) that satisfies the PDE and boundary data at the collocation points {𝐱i}i=1M\{\mathbf{x}_{i}\}_{i=1}^{M}. In what follows we write (𝒰;)\mathcal{L}(\mathcal{U};\mathcal{H}) to denote the space of bounded and linear operators from 𝒰\mathcal{U} to another Banach space \mathcal{H}. We make the following assumption regarding the operators 𝒫,\mathcal{P},\mathcal{B}:

{assumption}

There exist bounded and linear operators L1,,LQ(𝒰;C(Ω))L_{1},\dots,L_{Q}\in\mathcal{L}(\mathcal{U};C(\Omega)) in which L1,,LQb(𝒰;C(Ω))L_{1},\dots,L_{Q_{b}}\in\mathcal{L}(\mathcal{U};C(\partial\Omega)) for some 1QbQ1\leq Q_{b}\leq Q, and there are maps P:QP:\mathbb{R}^{Q}\to\mathbb{R} and B:QbB:\mathbb{R}^{Q_{b}}\to\mathbb{R}, which may be nonlinear, so that 𝒫(u)(𝐱)\mathcal{P}(u)(\mathbf{x}) and (u)(𝐱)\mathcal{B}(u)(\mathbf{x}) can be written as

(17) 𝒫(u)(𝐱)\displaystyle\mathcal{P}(u)(\mathbf{x}) =P(L1(u)(𝐱),,LQ(u)(𝐱)),\displaystyle=P\big{(}L_{1}(u)(\mathbf{x}),\dots,L_{Q}(u)(\mathbf{x})\big{)}, 𝐱Ω,\displaystyle\forall\mathbf{x}\in\Omega\,,
(u)(𝐱)\displaystyle\mathcal{B}(u)(\mathbf{x}) =B(L1(u)(𝐱),,LQb(u)(𝐱)),\displaystyle=B\big{(}L_{1}(u)(\mathbf{x}),\dots,L_{Q_{b}}(u)(\mathbf{x})\big{)}, 𝐱Ω.\displaystyle\forall\mathbf{x}\in\partial\Omega\,.

\Diamond

For prototypical nonlinear PDEs the LqL_{q} for 1qQ1\leq q\leq Q are linear differential operators such as first or second order derivatives while the maps PP and BB are often simple algebraic nonlinearities. Furthermore, observe that for ease of presentation we are assuming fewer linear operators are used to define the boundary conditions than the operators that define the PDE in the interior.

Example NE (Nonlinear Elliptic PDE).

Recall the nonlinear elliptic PDE (1) and consider the linear operators and nonlinear maps

L1:uu,L2:uΔu,P(v1,v2)=v2+τ(v1),B(v1)=v1,L_{1}:u\mapsto u,\quad L_{2}:u\mapsto\Delta u,\quad P(v_{1},v_{2})=-v_{2}+\tau(v_{1}),\quad B(v_{1})=v_{1}\,,

where we took Q=2Q=2 and Qb=1Q_{b}=1. Then this equation readily satisfies Assumption 3.1 whenever the solution is sufficiently regular so that L2(u)L_{2}(u) is well-defined pointwise within Ω\Omega. \Diamond

Under Assumption 3.1 we can then define the functionals ϕm(q)𝒰\phi_{m}^{(q)}\in\mathcal{U}^{\ast} by setting

(18) ϕm(q):=δ𝐱mLq,where{1mM,if1qQb,1mMΩ,ifQb+1qQ.\phi_{m}^{(q)}:=\updelta_{\mathbf{x}_{m}}\circ L_{q},\quad\text{where}\quad\left\{\begin{aligned} &1\leq m\leq M,&&\text{if}\quad 1\leq q\leq Q_{b},\\ &1\leq m\leq M_{\Omega},&&\text{if}\quad Q_{b+1}\leq q\leq Q.\end{aligned}\right.

We further use the shorthand notation ϕ(q){\bm{\phi}}^{(q)} to denote the vector of dual elements ϕm(q)\phi^{(q)}_{m} for a fixed index qq. Observe that ϕ(q)(𝒰)M{\bm{\phi}}^{(q)}\in(\mathcal{U}^{\ast})^{\otimes M} if qQbq\leq Q_{b} while ϕ(q)(𝒰)MΩ{\bm{\phi}}^{(q)}\in(\mathcal{U}^{\ast})^{\otimes M_{\Omega}} if q>Qbq>Q_{b}. We further write

(19) N=MQb+MΩ(QQb)N=MQ_{b}+M_{\Omega}(Q-Q_{b})

and define

(20) ϕ:=(ϕ(1),,ϕ(Q))(𝒰)N.{\bm{\phi}}:=({\bm{\phi}}^{(1)},\dots,{\bm{\phi}}^{(Q)})\in(\mathcal{U}^{\ast})^{\otimes N}\,.

To this end, we define the measurement vector 𝐲M\mathbf{y}\in\mathbb{R}^{M} by setting

(21) ym={f(𝐱m),if m{1,,MΩ},g(𝐱m),if m{MΩ+1,,M},y_{m}=\left\{\begin{aligned} &f(\mathbf{x}_{m}),&&\text{if }m\in\{1,\dots,M_{\Omega}\}\,,\\ &g(\mathbf{x}_{m}),&&\text{if }m\in\{M_{\Omega}+1,\dots,M\}\,,\end{aligned}\right.

as well as the nonlinear map

(22) (F([ϕ,u]))m:={P([ϕm(1),u],,[ϕm(Q),u])if m{1,,MΩ},B([ϕm(1),u],,[ϕm(Qb),u])if m{MΩ+1,,M}.\big{(}F([{\bm{\phi}},u])\big{)}_{m}:=\left\{\begin{aligned} &P([\phi_{m}^{(1)},u],\dots,[\phi_{m}^{(Q)},u])&&\text{if }m\in\{1,\dots,M_{\Omega}\},\\ &B([\phi_{m}^{(1)},u],\dots,[\phi_{m}^{(Q_{b})},u])&&\text{if }m\in\{M_{\Omega}+1,\dots,M\}.\end{aligned}\right.

We can now rewrite the optimization problem (16) in the same form as (12):

(23) {minimizeu𝒰us.t.F([ϕ,u])=𝐲.\left\{\begin{aligned} &\operatorname*{{\rm minimize}}_{u\in\mathcal{U}}&&\|u\|\\ &{\rm\,s.t.}&&F([{\bm{\phi}},u])=\mathbf{y}.\end{aligned}\right.

Then a direct appliction of Proposition 2.3 yields the following corollary.

Corollary 3.1.

Suppose Assumption 3.1 holds, 𝒦\mathcal{K} and Θ\Theta are invertible, and define ϕ,F,𝐲{\bm{\phi}},F,\mathbf{y} as above. Then uu^{\dagger} is a minimizer of (16) if and only if u=n=1Nznχnu^{\dagger}=\sum_{n=1}^{N}z^{\dagger}_{n}\chi_{n} where the χn\chi_{n} are the gamblets defined according to (11) and 𝐳\mathbf{z}^{\dagger} is a minimizer of

(24) {minimize𝐳N𝐳TΘ1𝐳s.t.F(𝐳)=𝐲.\left\{\begin{aligned} &\operatorname*{{\rm minimize}}_{\mathbf{z}\in\mathbb{R}^{N}}&&\mathbf{z}^{T}\Theta^{-1}\mathbf{z}\\ &{\rm\,s.t.}&&F(\mathbf{z})=\mathbf{y}.\end{aligned}\right.

The above corollary is the foundation of our numerical algorithms for approximation of the solution uu^{\dagger}, as Θ1\Theta^{-1} and the gamblets χn\chi_{n} can be approximated offline while the coefficients znz^{\dagger}_{n} can be computed by solving the optimization problem (24).

To solve (24) numerically, we will present two different approaches that transform it to an unconstrained optimization problem. Before moving to that in Subsection 3.3, we discuss the convergence theory first in the next section.

3.2 Convergence Theory

We state and prove a more general version of Theorem 1.3 for our abstract setting of PDEs on Banach spaces with quadratic norms stating that a minimizer uu^{\dagger} of (16) converges to the true solution uu^{\star} under sufficient regularity assumptions and for appropriate choices of the operator 𝒦\mathcal{K}.

Theorem 3.2.

Consider the PDE (15) and suppose that 𝒰Ct(Ω)Ct(Ω¯)\mathcal{U}\subset\mathcal{H}\subset C^{t}(\Omega)\cap C^{t^{\prime}}(\overline{\Omega}) where \mathcal{H} is a Banach space such that the first inclusion from the left is given by a compact embedding and tt0t\geq t^{\prime}\geq 0 are sufficiently large so that all derivatives appearing in the PDE are defined pointwise for elements of Ct(Ω)Ct(Ω¯)C^{t}(\Omega)\cap C^{t^{\prime}}(\overline{\Omega}). Furthermore assume that the PDE has a unique classical solution u𝒰u^{\star}\in\mathcal{U} and that, as MM\rightarrow\infty,

sup𝐱Ωmin1mMΩ|𝐱𝐱m|0andsup𝐱ΩminMΩ+1mM|𝐱𝐱m|0.\sup_{\mathbf{x}\in\Omega}\>\min_{1\leq m\leq M_{\Omega}}|\mathbf{x}-\mathbf{x}_{m}|\rightarrow 0\quad\text{and}\quad\sup_{\mathbf{x}\in\partial\Omega}\>\min_{M_{\Omega}+1\leq m\leq M}|\mathbf{x}-\mathbf{x}_{m}|\rightarrow 0.

Write uMu^{\dagger}_{M} for a minimizer of (16) with MM distinct collocation points. Then, as MM\rightarrow\infty, the sequence of minimizers uMu^{\dagger}_{M} converges towards uu^{\star} pointwise in Ω\Omega and in \mathcal{H}.

Proof 3.3.

The method of proof is similar to that of Theorem 1.3. Indeed, by the same argument as in the first paragraph of the proof for Theorem 1.3, there exists a subsequence uMpu_{M_{p}} that converges to uu_{\infty}^{\dagger} in \mathcal{H}. This also implies convergence in Ct(Ω)C^{t}(\Omega) and Ct(Ω¯)C^{t^{\prime}}(\overline{\Omega}) due to the assumed continuous embedding of \mathcal{H} into Ct(Ω)Ct(Ω¯)C^{t}(\Omega)\cap C^{t^{\prime}}(\overline{\Omega}). Since tt0t\geq t^{\prime}\geq 0 are sufficiently large so that all derivatives appearing in the PDE are defined pointwise for elements of Ct(Ω)Ct(Ω)C^{t}(\Omega)\cap C^{t^{\prime}}(\partial\Omega), we get that 𝒫uMp\mathcal{P}u_{M_{p}} converges to 𝒫u\mathcal{P}u^{\dagger}_{\infty} in C(Ω)C(\Omega) and 𝒫uC(Ω)\mathcal{P}u^{\dagger}_{\infty}\in C(\Omega). As Ω\Omega is compact, uu^{\dagger}_{\infty} is also uniformly continuous in Ω\Omega.

For any 𝐱Ω\mathbf{x}\in\Omega and p1p\geq 1, the triangle inequality shows that

(25) |𝒫(u)(𝐱)f(𝐱)|\displaystyle|\mathcal{P}(u^{\dagger}_{\infty})(\mathbf{x})-f(\mathbf{x})| min1mMp,Ω(|𝒫(u)(𝐱)𝒫(u)(𝐱m)|+|𝒫(u)(𝐱m)𝒫(uMp)(𝐱m)|)\displaystyle\leq\min_{1\leq m\leq M_{p,\Omega}}\left(|\mathcal{P}(u^{\dagger}_{\infty})(\mathbf{x})-\mathcal{P}(u^{\dagger})(\mathbf{x}_{m})|+|\mathcal{P}(u^{\dagger})(\mathbf{x}_{m})-\mathcal{P}(u_{M_{p}})(\mathbf{x}_{m})|\right)
min1mMp,Ω|𝒫(u)(𝐱)𝒫(u)(𝐱m)|+𝒫u𝒫uMpC(Ω),\displaystyle\leq\min_{1\leq m\leq M_{p,\Omega}}|\mathcal{P}(u^{\dagger}_{\infty})(\mathbf{x})-\mathcal{P}(u^{\dagger}_{\infty})(\mathbf{x}_{m})|+\|\mathcal{P}u^{\dagger}_{\infty}-\mathcal{P}u_{M_{p}}\|_{C(\Omega)}\,,

where in the first inequality we have used the fact that 𝒫(uMp)(𝐱m)=f(𝐱m)\mathcal{P}(u_{M_{p}})(\mathbf{x}_{m})=f(\mathbf{x}_{m}). Here Mp,ΩM_{p,\Omega} is the number of interior points associated with the total MpM_{p} collocation points. Taking pp\to\infty and using the uniform continuity of 𝒫u\mathcal{P}u^{\dagger}_{\infty} and the C(Ω)C(\Omega) convergence from 𝒫uMp\mathcal{P}u_{M_{p}} to 𝒫u\mathcal{P}u^{\dagger}_{\infty}, we derive that 𝒫(u)(𝐱)=f(𝐱)\mathcal{P}(u^{\dagger}_{\infty})(\mathbf{x})=f(\mathbf{x}). In a similar manner we can derive (u)(𝐱)=g(𝐱)\mathcal{B}(u^{\dagger}_{\infty})(\mathbf{x})=g(\mathbf{x}). Thus, the limit uu^{\dagger}_{\infty} is a classical solution to the PDE. By the uniqueness of the solution we must have u=uu^{\dagger}_{\infty}=u^{\star}. Finally, as the limit uu^{\dagger}_{\infty} is independent of the choice of the subsequence, the whole sequence uMu^{\dagger}_{M} must converge to uu^{\star} pointwise and in \mathcal{H}.

We note that while this theorem does not provide a rate for convergence of uu^{\dagger} towards uu^{\star} it relies on straightforward conditions that are readily verifiable for prototypical PDEs. Typically we choose t,t>0t,t^{\prime}>0 large enough so that the PDE operators 𝒫,\mathcal{P},\mathcal{B} are pointwise defined for the elements of Ct(Ω)Ct(Ω¯)C^{t}(\Omega)\cap C^{t^{\prime}}(\overline{\Omega}) (e.g., t>t> order of PDE +d/2+d/2) and take the space \mathcal{H} to be a Sobolev-type space of appropriate regularity for the inclusion Ct(Ω)Ct(Ω)\mathcal{H}\subset C^{t}(\Omega)\cap C^{t^{\prime}}(\partial\Omega) to hold; also see the conditions of Theorem 1.3 and the subsequent discussion. The compact embedding 𝒰\mathcal{U}\subset\mathcal{H} can then be ensured by an appropriate choice of the covariance operator 𝒦\mathcal{K} (or the associated kernel KK). However, this choice should result in a sufficiently large space 𝒰\mathcal{U} that includes the solution uu^{\star} of the PDE. Our conditions on the collocation points {𝐱m}m=1M\{\mathbf{x}_{m}\}_{m=1}^{M} simply ensure that these points form a dense subset of Ω¯\overline{\Omega} as MM\to\infty.

3.3 Dealing with the Constraints

Now, we turn our attention to the equality constraints in (24) and present two strategies for elimination or relaxation of these constraints; these transform the optimization problem to an unconstrained one. They are crucial preliminary steps before introducing our numerical framework.

3.3.1 Eliminating the Equality Constraints

The equality constraints in (24) can be eliminated under slightly stronger assumptions on the maps P,BP,B. In particular, suppose that the following assumption holds: {assumption} The equations

P(v1,,vQ)=y,B(v1,,vQb)=y,P(v_{1},\dots,v_{Q})=y,\qquad B(v_{1},\dots,v_{Q_{b}})=y\,,

can be solved as finite-dimensional algebraic equations, i.e., there exist P¯:Q1\overline{P}:\mathbb{R}^{Q-1}\to\mathbb{R} and B¯:Qb1\overline{B}:\mathbb{R}^{Q_{b}-1}\to\mathbb{R} so that

(26) vj=P¯(v1,,vj1,vj+1,,vQ,y),vk=B¯(v1,,vk1,vk+1,,vQb,y),v_{j}=\overline{P}(v_{1},\dots,v_{j-1},v_{j+1},\dots,v_{Q},y),\qquad v_{k}=\overline{B}(v_{1},\dots,v_{k-1},v_{k+1},\dots,v_{Q_{b}},y)\,,

for selected indices j{1,,Q}j\in\{1,\dots,Q\} and k{1,,Qb}k\in\{1,\dots,Q_{b}\}. Then for integer NN defined by (19), and using the solution maps P¯,B¯\overline{P},\overline{B}, we can then define a new solution map F¯:NM×MN\overline{F}:\mathbb{R}^{N-M}\times\mathbb{R}^{M}\to\mathbb{R}^{N} so that

F(𝐳)=𝐲 if and only if 𝐳=F¯(𝐰,𝐲), for a unique 𝐰NM.F(\mathbf{z})=\mathbf{y}\quad\text{ if and only if }\quad\mathbf{z}=\overline{F}(\mathbf{w},\mathbf{y}),\quad\text{ for a unique }\mathbf{w}\in\mathbb{R}^{N-M}\,.

\Diamond With this new solution map we can rewrite (24) as an unconstrained optimization problem.

Corollary 3.4.

Let Assumption 3.3.1 and the conditions of Corollary 3.1 hold. Then uu^{\dagger} is a minimizer of (16) if and only if u=n=1Nznχnu^{\dagger}=\sum_{n=1}^{N}z^{\dagger}_{n}\chi_{n} with 𝐳=F(𝐰,𝐲)\mathbf{z}^{\dagger}=F^{\prime}(\mathbf{w}^{\dagger},\mathbf{y}) and 𝐰NM\mathbf{w}^{\dagger}\in\mathbb{R}^{N-M} is a minimizer of

(27) minimize𝐰NMF¯(𝐰,𝐲)TΘ1F¯(𝐰,𝐲).\operatorname*{{\rm minimize}}_{\mathbf{w}\in\mathbb{R}^{N-M}}\quad\overline{F}(\mathbf{w},\mathbf{y})^{T}\Theta^{-1}\overline{F}(\mathbf{w},\mathbf{y})\,.

Example NE.

Let us recall that we already eliminated the equality constraints in the context of the PDE (1) through the calculations leading to the unconstrained minimization problem (6). In that example, we used the calculation

P(v1,v2)=v2+τ(v1)=yv2=τ(v1)y=P¯(v1,y),P(v_{1},v_{2})=-v_{2}+\tau(v_{1})=y\Leftrightarrow v_{2}=\tau(v_{1})-y=\overline{P}(v_{1},y)\,,

that is, we solved Δu\Delta u in terms of τ(u)\tau(u) and the source term in the interior of the domain in order to eliminate the PDE constraint. We further imposed the boundary conditions exactly since the boundary map BB is simply the pointwise evaluation function in that example.

Alternatively, we could eliminate v1v_{1} by setting v1=τ1(y+v2)v_{1}=\tau^{-1}(y+v_{2}), assuming that τ1\tau^{-1} has closed form. While both elimination strategies are conceptually valid they may lead to very different optimization problems. The former corresponds to solving for the values of uu at the collocation points while the latter solves for the values of Δu\Delta u at the interior points under Dirichlet boundary conditions at the boundary collocation points. \Diamond

3.3.2 Relaxing the Equality Constraints

The choice of the solution maps P¯,B¯\overline{P},\overline{B} in (26), i.e., the choice of the variable which the equations are solved for, has an impact on the conditioning of (27); it is not a priori clear that poor conditioning can always be avoided by choice of variables to solve for. Moreover, for certain nonlinear PDEs Assumption 3.3.1 may not hold. In such cases it may be useful to relax the equality constraints in (23) and instead consider a loss of the following form:

(28) minimizeu𝒰u2+1β2|F([ϕ,u])𝐲|2,\left.\begin{aligned} &\operatorname*{{\rm minimize}}_{u\in\mathcal{U}}\quad\|u\|^{2}+\frac{1}{\beta^{2}}|F([{\bm{\phi}},u])-\mathbf{y}|^{2}\,,\end{aligned}\right.

where β2>0\beta^{2}>0 is a small positive parameter. Likewise (24) can be relaxed to obtain

(29) minimize𝐳N𝐳TΘ1𝐳+1β2|F(𝐳)𝐲|2.\left.\begin{aligned} &\operatorname*{{\rm minimize}}_{\mathbf{z}\in\mathbb{R}^{N}}\quad\mathbf{z}^{T}\Theta^{-1}\mathbf{z}+\frac{1}{\beta^{2}}|F(\mathbf{z})-\mathbf{y}|^{2}\,.\end{aligned}\right.

Then a similar argument to the proof of Theorem 3.2 can be used to show that a minimizer of the relaxed optimization problem for uu converges to the solution uu^{\star} of the PDE as the number of collocation points MM increases and the parameter β\beta vanishes.

Proposition 3.5.

Fix β>0.\beta>0. Then the optimization problem (28) has minimizer uβ,Mu^{\dagger}_{\beta,M} which (assuming Θ\Theta to be invertible) may be expressed in the form

uβ,M:=n=1Nzβ,nχn𝒰,u^{\dagger}_{\beta,M}:=\sum_{n=1}^{N}z^{\dagger}_{\beta,n}\chi_{n}\in\mathcal{U}\,,

where 𝐳β\mathbf{z}^{\dagger}_{\beta} denotes a minimizer of (29). Under the assumptions of Theorem 3.2 it follows that, as (β,M1)0(\beta,M^{-1})\to 0, the relaxed estimator uβ,Mu^{\dagger}_{\beta,M} converges to uu^{\star} pointwise and in \mathcal{H}.

Proof 3.6.

By the arguments used in Proposition 2.3 the minimizer of (29) delivers a minimizer of (28) in the desired form. Since uu^{\star} satisfies F([ϕ,u])𝐲=0F([{\bm{\phi}},u^{\star}])-\mathbf{y}=0 we must have uβ,Mu\|u^{\dagger}_{\beta,M}\|\leq\|u^{\star}\|. Then a compactness argument similar to that used in the proof of Theorem 3.2, noting that taking β0\beta\to 0 as MM\to\infty delivers exact satisfaction of the constraints in the limit, yields the desired result.

Example NE.

When only part of the constraints F(𝐳)=𝐲F(\mathbf{z})=\mathbf{y} can be explicitly solved, as is often the case for boundary values, we can also combine the elimination and relaxation approach. Employing the relaxation approach for the interior constraint and the elimination approach for the boundary constraints in (1) amounts to replacing the optimization problem (5), which is the analogue of (24) for our running example, with the following problem for a small parameter β2>0\beta^{2}>0:

(30) {minimize𝐳M+MΩ𝐳TK(ϕ,ϕ)1𝐳+1β2[m=1MΩ|zm(2)+τ(zm(1))f(𝐱m)|2]s.t.zm(1)=g(𝐱m),for m=MΩ+1,,M.\left\{\begin{aligned} \operatorname*{{\rm minimize}}_{\mathbf{z}\in\mathbb{R}^{M+M_{\Omega}}}\quad&\mathbf{z}^{T}K({\bm{\phi}},{\bm{\phi}})^{-1}\mathbf{z}+\frac{1}{\beta^{2}}\left[\sum_{m=1}^{M_{\Omega}}\left|z^{(2)}_{m}+\tau(z^{(1)}_{m})-f(\mathbf{x}_{m})\right|^{2}\right]\\ \text{s.t.}\quad&z_{m}^{(1)}=g(\mathbf{x}_{m}),\quad\text{for }m=M_{\Omega}+1,...,M\,.\end{aligned}\right.

We will numerically compare the above approach with the full elimination approach (6) in Subsection 3.5.1. \Diamond

3.4 Implementation

We now outline the details of a numerical algorithm for solution of nonlinear PDEs based on the discussions of the previous subsection and in particular Corollary 3.1. We discuss the construction of the matrix Θ\Theta in Subsection 3.4.1 followed by a variant of the Gauss–Newton algorithm in Subsection 3.4.2 for solving the unconstrained or relaxed problems outlined in Subsections 3.3. We also note that strategies for regularizing the matrix Θ\Theta by adding small diagonal (“nugget”) terms are collected in Appendix A.

3.4.1 Constructing Θ\Theta

We established through Corollary 3.1 that a solution to (16) can be completely identified by 𝐳\mathbf{z}^{\dagger} a minimizer of (24) as well as the gamblets χn\chi_{n}. Since here we are concerned with the strong form of the PDE (15) it is reasonable to assume that at the very least 𝒰C(Ω¯)\mathcal{U}\subset C(\overline{\Omega}); although we often require higher regularity so that the PDE constraints can be imposed pointwise. This assumption suggests that our GP model for uu^{\star} can equivalently be identified via a covariance kernel function as opposed to the covariance operator 𝒦\mathcal{K}. To this end, given a covariance operator 𝒦\mathcal{K} define the covariance kernel KK (equivalently Green’s function of 𝒦1\mathcal{K}^{-1}) as

(31) K:Ω¯×Ω¯,K(𝐱,𝐱):=[δ𝐱,𝒦δ𝐱].K:\overline{\Omega}\times\overline{\Omega}\mapsto\mathbb{R},\qquad K(\mathbf{x},\mathbf{x}^{\prime}):=[\updelta_{\mathbf{x}},\mathcal{K}\updelta_{\mathbf{x}^{\prime}}]\,.

It is known that the kernel KK completely characterizes the GP 𝒩(0,𝒦)\mathcal{N}(0,\mathcal{K}) under mild conditions [82]; that is 𝒩(0,𝒦)𝒢𝒫(0,K).\mathcal{N}(0,\mathcal{K})\equiv\mathcal{GP}(0,K). Let us now consider the matrix Θ\Theta in block form

Θ=[Θ(1,1)Θ(1,2)Θ(1,Q)Θ(2,1)Θ(2,2)Θ(2,Q)Θ(Q,1)Θ(Q,2)Θ(Q,Q)].\Theta=\begin{bmatrix}\Theta^{(1,1)}&\Theta^{(1,2)}&\cdots&\Theta^{(1,Q)}\\ \Theta^{(2,1)}&\Theta^{(2,2)}&\cdots&\Theta^{(2,Q)}\\ \vdots&\vdots&\ddots&\vdots\\ \Theta^{(Q,1)}&\Theta^{(Q,2)}&\cdots&\Theta^{(Q,Q)}\end{bmatrix}\,.

Using the L2(Ω)L^{2}(\Omega) duality pairing between 𝒰\mathcal{U} and 𝒰\mathcal{U}^{\ast} we can identify the blocks

Θ(q,j)=K(ϕ(q),ϕ(j)),\Theta^{(q,j)}=K({\bm{\phi}}^{(q)},{\bm{\phi}}^{(j)})\,,

where we used the shorthand notation of Subsection 1.1 for the kernel matrix, with the ϕ(q){\bm{\phi}}^{(q)} defined as in (18) and the subsequent discussion. To this end the entries of the Θ(q,j)\Theta^{(q,j)} take the form

Θm,i(q,j)=Lq𝐱Lj𝐱K(𝐱,𝐱)|(𝐱,𝐱)=(𝐱m,𝐱i),\Theta^{(q,j)}_{m,i}=L_{q}^{\mathbf{x}}L_{j}^{\mathbf{x}^{\prime}}K(\mathbf{x},\mathbf{x}^{\prime})\big{|}_{(\mathbf{x},\mathbf{x}^{\prime})=(\mathbf{x}_{m},\mathbf{x}_{i})}\,,

where we used the superscripts 𝐱,𝐱\mathbf{x},\mathbf{x}^{\prime} to denote the variables with respect to which the differential operators Lq,LjL_{q},L_{j} act. Note that ΘN×N\Theta\in\mathbb{R}^{N\times N} with N=MQb+MΩ(QQb)N=MQ_{b}+M_{\Omega}(Q-Q_{b}) following the definition of ϕ(q){\bm{\phi}}^{(q)} in Subsection 3.1.

3.4.2 A Gauss–Newton Algorithm

Here we outline a variant of the Gauss–Newton algorithm [42, Sec. 10.3] for solution of the unconstrained optimization problem (27). Recall our definition of the maps P¯,B¯\overline{P},\overline{B} in (26) and in turn the map F¯\overline{F}. We then propose to approximate a minimizer 𝐰\mathbf{w}^{\dagger} of (27) with a sequence of elements 𝐰\mathbf{w}^{\ell} defined iteratively via 𝐰+1=𝐰+αδ𝐰,\mathbf{w}^{\ell+1}=\mathbf{w}^{\ell}+\alpha^{\ell}\delta\mathbf{w}^{\ell}, where α>0\alpha^{\ell}>0 is an appropriate step size while δ𝐰\delta\mathbf{w}^{\ell} is the minimizer of the optimization problem

minimizeδ𝐰NM(F¯(𝐰,𝐲)+δ𝐰TF¯(𝐰,𝐲))TΘ1(F¯(𝐰,𝐲)+δ𝐰TF¯(𝐰,𝐲)),\operatorname*{{\rm minimize}}_{\delta\mathbf{w}\in\mathbb{R}^{N-M}}\quad\left(\overline{F}(\mathbf{w}^{\ell},\mathbf{y})+\delta\mathbf{w}^{T}\nabla\overline{F}(\mathbf{w}^{\ell},\mathbf{y})\right)^{T}\Theta^{-1}\left(\overline{F}(\mathbf{w}^{\ell},\mathbf{y})+\delta\mathbf{w}^{T}\nabla\overline{F}(\mathbf{w}^{\ell},\mathbf{y})\right)\,,

and the gradient of F¯\overline{F} is computed with respect to the 𝐰\mathbf{w} variable only. 999Note that our proposed method is nothing more than the standard Gauss–Newton algorithm with Euclidean norm |||\cdot| defining the least-squares functional replaced with the weighted norm |Θ1/2||\Theta^{-1/2}\cdot| [42, Sec. 10.3].

This approach can be applied also to solve the relaxed problem (3.5) where this time we consider the sequence of approximations 𝐳+1=𝐳+αδ𝐳,\mathbf{z}^{\ell+1}=\mathbf{z}^{\ell}+\alpha^{\ell}\delta\mathbf{z}^{\ell}, where δ𝐳\delta\mathbf{z}^{\ell} is the minimizer of

minimizeδ𝐳N(𝐳+δ𝐳)TΘ1(𝐳+δ𝐳)+1β2|F(𝐳)+δ𝐳TF(𝐳)𝐲|2.\operatorname*{{\rm minimize}}_{\delta\mathbf{z}\in\mathbb{R}^{N}}\quad\left(\mathbf{z}^{\ell}+\delta\mathbf{z}\right)^{T}\Theta^{-1}\left(\mathbf{z}^{\ell}+\delta\mathbf{z}\right)+\frac{1}{\beta^{2}}\left|F(\mathbf{z}^{\ell})+\delta\mathbf{z}^{T}\nabla F(\mathbf{z}^{\ell})-\mathbf{y}\right|^{2}\,.

Since (3.4.2) and (3.4.2) are both quadratic in δ𝐰\delta\mathbf{w} and δ𝐳\delta\mathbf{z} respectively, they can be solved exactly and efficiently at each step and the step-size parameters α\alpha^{\ell} can be fixed or computed adaptively using standard step-size selection techniques [42]. However, in our experiments in Section 3.5, we find that both algorithms converge quickly simply by setting α=1\alpha^{\ell}=1.

Example NE.

Let us return once more to the nonlinear elliptic PDE considered in Subsection 1.1. Observe that (6) is precisely in the form of (27) and so in order to formulate our Gauss–Newton iterations we need to linearize the vector valued function

𝐰(𝐰,g(𝐱Ω),f(𝐱Ω)τ(𝐰)),\mathbf{w}\mapsto\big{(}\mathbf{w},g(\mathbf{x}_{\partial\Omega}),f(\mathbf{x}_{\Omega})-\tau(\mathbf{w})\big{)}\,,

which can easily be achieved by linearizing τ\tau. To this end, we solve (27) via the iteration 𝐰+1=𝐰+αδ𝐰\mathbf{w}^{\ell+1}=\mathbf{w}^{\ell}+\alpha^{\ell}\delta\mathbf{w}^{\ell} where δ𝐰\delta\mathbf{w}^{\ell} is the minimizer of the functional

(𝐰+δ𝐰,g(𝐱Ω),f(𝐱Ω)τ(𝐰)δ𝐰Tτ(𝐰))K(ϕ,ϕ)1(𝐰+δ𝐰g(𝐱Ω)f(𝐱Ω)τ(𝐰)δ𝐰Tτ(𝐰)).\big{(}\mathbf{w}^{\ell}+\delta\mathbf{w},g(\mathbf{x}_{\partial\Omega}),f(\mathbf{x}_{\Omega})-\tau(\mathbf{w}^{\ell})-\delta\mathbf{w}^{T}\nabla\tau(\mathbf{w}^{\ell})\big{)}K({\bm{\phi}},{\bm{\phi}})^{-1}\begin{pmatrix}\mathbf{w}^{\ell}+\delta\mathbf{w}\\ g(\mathbf{x}_{\partial\Omega})\\ f(\mathbf{x}_{\Omega})-\tau(\mathbf{w}^{\ell})-\delta\mathbf{w}^{T}\nabla\tau(\mathbf{w}^{\ell})\end{pmatrix}\,.

We also note that the sequence of approximations obtained by the above implementation of Gauss–Newton coincides with successive kernel collocation approximations of the solution of the following particular linearization of the PDE,

(32) Δu+uτ(un)=fτ(un)+unτ(un),-\Delta u+u\tau^{\prime}(u^{n})=f-\tau(u^{n})+u^{n}\tau^{\prime}(u^{n})\,,

subject to the Dirichlet boundary conditions. \Diamond

3.4.3 Computational Bottlenecks

The primary computational cost of our method lies in the approximation of the matrix Θ1\Theta^{-1}. Efficient factorizations and approximations of Θ1\Theta^{-1} have been studied extensively in the GP regression literature [55] as well as spatial statistics, Kriging and numerical analysis (see [65, 66] and the discussions within). In this paper, we do not employ these algorithms and choose instead to use standard 𝒪(N3)\mathcal{O}(N^{3}) algorithms to factorize Θ\Theta.

The algorithm introduced in [65] is particularly interesting as it directly approximates the Cholesky factors of Θ1\Theta^{-1} by querying a subset of the entries of Θ\Theta. In fact, that algorithm alleviates the need for a small diagonal regularization (“nugget”) term by directly computing the Cholesky factors of Θ1\Theta^{-1} from the entries of Θ\Theta. This could be done by extending the algorithm introduced and analyzed in [65]. This algorithm is based on the identification of an explicit formula for computing approximate Cholesky factors LL minimizing the Kullback-Leibler divergence between 𝒩(0,Θ1)\mathcal{N}(0,\Theta^{-1}) and 𝒩(0,LLT)\mathcal{N}(0,LL^{T}) given a sparsity constraint on the entries of LL. The proposed formula is equivalent to the Vecchia approximation [83] (popular in geostatistics). The resulting algorithm outlined in [65] computes ϵ\epsilon approximate Cholesky factors of Θ1\Theta^{-1} in 𝒪(Nlog2d(N/ϵ))\mathcal{O}(N\log^{2d}(N/\epsilon)) complexity by accessing 𝒪(Nlogd(N/ϵ))\mathcal{O}(N\log^{d}(N/\epsilon)) entries of Θ\Theta.

Another possible bottleneck is the computation of the gamblets χn\chi_{n}. The articles [47, 66] show that the gamblets can be approximated with compactly supported functions in complexity 𝒪(Nlog2d+1(N/ϵ))\mathcal{O}(N\log^{2d+1}(N/\epsilon)). We also note that the complexity-vs-accuracy guarantees of [47, 65, 66] have only been established for functionals ϕn\phi_{n} that are Dirac delta functions and kernels KK that are the Green’s functions of arbitrary elliptic differential operators (mapping Hs(Ω)H^{s}(\Omega) to Hs(Ω)H^{-s}(\Omega)). Extension of those results to functionals ϕn\phi_{n} considered here is an interesting future direction.

3.5 Numerical Experiments for Nonlinear PDEs

In this subsection, we implement our algorithm to solve several nonlinear PDEs, including the nonlinear elliptic equation in Subsection 3.5.1, Burgers’ equation in Subsection 3.5.2 and the regularized Eikonal equation in Subsection 3.5.3. For all of these equations, we will start with a fixed MM and demonstrate the performance of our algorithm by showing the pattern of collocation points, the loss function history of the Gauss–Newton iteration, and contours of the solution errors. Then, we vary the value of MM and study how the errors change with respect to MM. We also compare the elimination and relaxation approaches for dealing with the nonlinear constraints.

All the experiments are conducted using Python with the JAX package for automatic differentiation101010We use JAX for convenience and all derivatives in our methodology can be computed using standard techniques such as symbolic computation or adjoint methods.. In particular, we use automatic differentiation to form the kernel matrix Θ\Theta that involves derivatives of the kernel function, and to optimize the loss function via the Gauss–Newton method. Details on the choice of small diagonal regularization (“nugget”) terms for these experiments are presented in Appendices A.2 through A.4.

Remark 3.7.

In all of the numerical experiments in this section we used a set of collocation points that are drawn randomly from the uniform distribution over the domain Ω\Omega, as opposed to the deterministic uniform grid used in Subsection 1.1.4. The choice of the random collocation points was made to highlight the flexibility of our methodology. Furthermore, random collocation points are often used in other machine learning algorithms for solution of PDEs such as PINNs [58] and so adopting this approach allows direct comparison with such methods. We observed empirically that the random grids had similar accuracy to the deterministic uniform grid in all experiments except for Burgers’ equation in Subsection 3.5.2, where random collocation points outperformed the uniform grid. Understanding this surprising performance gap is an interesting problem related to active learning and the acquisition of collocation points; we do not address this issue here. \Diamond

Remark 3.8.

Float64 data type was employed in the experiments below. This allows the use of small diagonal regularization (“nugget”) terms (see Appendix A for details) which do not affect accuracy in the computations described in this paper. In contrast, if Float32 data type (the default setting in JAX) is used, we found the need to regularize Θ\Theta with larger diagonal terms, leading to an observable accuracy floor. \Diamond

3.5.1 A Nonlinear Elliptic PDE

We revisit again the nonlinear elliptic equation in (1). As in Subsection 1.1.4, we take d=2d=2, Ω=(0,1)2\Omega=(0,1)^{2} and τ(u)=u3\tau(u)=u^{3} together with homogeneous Dirichlet boundary conditions g(𝐱)=0g(\mathbf{x})=0. The true solution is prescribed to be u(𝐱)=sin(πx1)sin(πx2)+4sin(4πx1)sin(4πx2)u^{\star}(\mathbf{x})=\sin(\pi x_{1})\sin(\pi x_{2})+4\sin(4\pi x_{1})\sin(4\pi x_{2}) and the corresponding right hand side f(𝐱)f(\mathbf{x}) is computed using the equation. We choose the Gaussian kernel K(𝐱,𝐲;σ)=exp(|𝐱𝐲|22σ2)K(\mathbf{x},\mathbf{y};\sigma)=\exp(-\frac{|\mathbf{x}-\mathbf{y}|^{2}}{2\sigma^{2}}) with a lengthscale parameter σ\sigma.

\begin{overpic}[width=426.79134pt]{image/data_Elliptic_eqn_demon.pdf} \put(14.0,-1.0){\scriptsize(a)} \put(50.0,-1.0){\scriptsize(b)} \put(82.0,-1.0){\scriptsize(c)} \end{overpic}
Figure 2: Numerical results for the nonlinear elliptic PDE (1): (a) a sample of collocation points and contours of the true solution; (b) convergence history of the Gauss–Newton algorithm; (c) contours of the solution error. An adaptive nugget term with global parameter η=1013\eta=10^{-13} was employed (see Appendix A.2)

First, for M=1024M=1024 and MΩ=900M_{\Omega}=900, we randomly sample collocation points in Ω\Omega as shown in Figure 2(a). We further show an instance of the convergence history of the Gauss–Newton algorithm in Figure 2(b) where we solved the unconstrained optimization problem (6) after eliminating the equality constraints. We used kernel parameter σ=M1/4\sigma=M^{-1/4} and appropriate nugget terms as outlined in Appendix A.2. We initiated the algorithm with a Gaussian random initial guess. We observe that only 33 steps sufficed for convergence. In Figure 2(c), we show the contours of the solution error. The error in the approximate solution is seen to be fairly uniform spatially, with larger errors observed near the boundary, when M=1024.M=1024. We note that the main difference between these experiments and those in Subsection 1.1.4 is that here we used randomly distributed collocation points while a uniform grid was used previously.

Next, we compare two approaches for dealing with the PDE constraints as outlined in Subsection 3.3. We applied both the elimination and relaxation approaches, defined by the optimization problems (27) and (29) respectively, for different choices of MM. In the relaxation approach, we set β2=1010\beta^{2}=10^{-10}. Here we set M=300,600,1200,2400M=300,600,1200,2400 and MΩ=0.9×MM_{\Omega}=0.9\times M. The L2L^{2} and LL^{\infty} errors of the converged Gauss–Newton solutions are shown in Table 1. Results were averaged over 10 realizations of the random collocation points. From the table we observe that the difference in solution errors was very mild and both methods were convergent as MM increases. We note that in the relaxed setting, convergence is closely tied to our choice of β\beta, and choosing an inadequate value, i.e. too small or too large, can lead to inaccurate solutions. In terms of computational costs, the elimination approaches take 2-3 steps of Gauss–Newton iterations on average, while the relaxation approach needs 5-8 steps. Thus while the elimination strategy appears to be more efficient, we do not observe a significant difference in the order of complexity of the methods for dealing with the constraints, especially when the number of collocation points becomes large.

MM 300 600 1200 2400
Elimination: L2L^{2} error 4.84e-02 6.20e-05 2.74e-06 2.83e-07
Elimination: LL^{\infty} error 3.78e-01 9.71e-04 4.56e-05 5.08e-06
Relaxation: L2L^{2} error 1.15e-01 1.15e-04 1.87e-06 1.68e-07
Relaxation: LL^{\infty} error 1.21e+00 1.45e-03 3.38e-05 1.84e-06
Table 1: Comparison between the elimination and relaxation approaches to deal with the equality constraints for the nonlinear elliptic PDE (1). Uniformly random collocation points were sampled with different MM and MΩ=0.9MM_{\Omega}=0.9M. Adaptive nugget terms were employed with the global nugget parameter η=1012\eta=10^{-12} (see Appendix A.2). The lengthscale parameter σ=0.2\sigma=0.2. Results were averaged over 10 realizations of the random collocation points. The maximum Gauss-Newton iteration was 10.

3.5.2 Burgers’ Equation

We consider numerical solution of the viscous Burgers equation:

(33) tu+usuνs2u\displaystyle\partial_{t}u+u\partial_{s}u-\nu\partial_{s}^{2}u =0,(s,t)(1,1)×(0,1],\displaystyle=0,\quad\forall(s,t)\in(-1,1)\times(0,1]\,,
u(s,0)\displaystyle u(s,0) =sin(πx),\displaystyle=-\sin(\pi x)\,,
u(1,t)\displaystyle u(-1,t) =u(1,t)=0.\displaystyle=u(1,t)=0\,.

We adopt an approach in which we solve the problem by conditioning a Gaussian process in space-time111111It would also be possible to look at an incremental in time approach, for example using backward Euler discretization, in which one iteratively in time solves a nonlinear elliptic two point boundary value problem by conditioning a spatial Gaussian process; we do not pursue this here and leave it as a future direction.. In our experiments we take ν=0.02\nu=0.02 and consider 𝐱=(s,t)\mathbf{x}=(s,t). We write this PDE in the form of (17) with Q=4Q=4 and Qb=1Q_{b}=1 with linear operators L1(u)=u,L2(u)=tu,L3(u)=su,L4(u)=s2uL_{1}(u)=u,L_{2}(u)=\partial_{t}u,L_{3}(u)=\partial_{s}u,L_{4}(u)=\partial_{s}^{2}u and the nonlinear map P(v1,v2,v3,v4)=v2+v1v3νv42P(v_{1},v_{2},v_{3},v_{4})=v_{2}+v_{1}v_{3}-\nu v_{4}^{2}. The boundary part is simply B(v1)=v1B(v_{1})=v_{1}. We then eliminate the equality constraints in our optimization framework following the approach of Subsection 3.3.1 using the equation v2=νv42v1v3v_{2}=\nu v_{4}^{2}-v_{1}v_{3}.

We randomly sampled M=2400M=2400 with MΩ=2000M_{\Omega}=2000 points in the computational domain Ω=[1,1]×[0,1]\Omega=[-1,1]\times[0,1] see Figure 3(a), where we also plot contours of the true solution uu. The Gauss–Newton algorithm was then applied to solve the unconstrained optimization problem. We computed the true solution from the Cole–Hopf transformation, together with the numerical quadrature. Since the time and space variability of the solution to Burgers’ equation are significantly different, we chose an anisotropic kernel 121212One can also design/learn a non-stationary kernel using the approaches discussed in Subsection 5.3. However, the parameterization of such kernels and strategies for tuning their hyperparameters are outside the scope of this article.

K((s,t),(s,t);σ)=exp(σ12(ss)2σ22(tt)2)K\Bigl{(}(s,t),(s^{\prime},t^{\prime});\sigma\Bigr{)}=\exp\Bigl{(}-\sigma_{1}^{-2}(s-s^{\prime})^{2}-\sigma_{2}^{-2}(t-t^{\prime})^{2}\Bigr{)}\,

with σ=(1/20,1/3)\sigma=(1/20,1/3) together with an adaptive diagonal regularization (“nugget”) as outlined in Appendix A.3.

We plot the Gauss–Newton iteration history in Figure 3(b) and observe that 10 steps sufficed for convergence. We compare the converged solution to the true solution and present the contours of the error in Figure 3(c). The maximum errors occured close to the (viscous) shock at time 11 as expected. In Figure 3(d–f), we also compare various time slices of the numerical and true solutions at times t=0.2,0.5,0.8t=0.2,0.5,0.8 to further highlight the ability of our method in capturing the location and shape of the shock.

\begin{overpic}[width=426.79134pt]{Burgers_pts2400_sol_contour_plt.pdf} \put(14.0,-1.0){\scriptsize(a)} \put(50.0,-1.0){\scriptsize(b)} \put(82.0,-1.0){\scriptsize(c)} \end{overpic}
\begin{overpic}[width=426.79134pt]{image/Burgers_pts2400_time_sol_plt.pdf} \put(14.0,-1.0){\scriptsize(d)} \put(50.0,-1.0){\scriptsize(e)} \put(82.0,-1.0){\scriptsize(f)} \end{overpic}
Figure 3: Numerical results for Burgers equation (33): (a) an instance of uniformly sampled collocation points in space-time over contours of the true solution; (b) Gauss–Newton iteration history; (c) contours of the pointwise error of the numerical solution; (d–f) time slices of the numerical and true solutions at t=0.2,0.5,0.8t=0.2,0.5,0.8. An adaptive nugget term with global parameter η=1010\eta=10^{-10} was employed (see Appendix A.3).

Next, we studied the convergence properties of our method as a function of MM as shown in Table 2. Here, we varied MM with a fixed ratio of interior points, MΩ/M=5/6M_{\Omega}/M=5/6. For each experiment we ran 1010 steps of Gauss–Newton starting from a Gaussian random initial guess. Results were averaged over 10 realizations of the random collocation points. From the table, we observe that the error decreases very fast as MM increases, implying the convergence of our proposed algorithm.

Finally, we note that the accuracy of our method is closely tied to the choice of the viscosity parameter ν\nu and choosing a smaller value of ν\nu, which in turn results in a sharper shock, can significantly reduce our accuracy. This phenomenon is not surprising since a sharper shock corresponds to the presence of shorter length and time scales in the solution; these in turn, require a more careful choice of the kernel, as well as suggesting the need to carefully choose the collocation points.

MM 600 1200 2400 4800
L2L^{2} error 1.75e-02 7.90e-03 8.65e-04 9.76e-05
LL^{\infty} error 6.61e-01 6.39e-02 5.50e-03 7.36e-04
Table 2: Space-time L2L^{2} and LL^{\infty} solution errors for the Burgers’ equation (33) for different choices of MM with kernel parameters σ=(20,3)\sigma=(20,3) and global nugget parameter η=105\eta=10^{-5} if M1200M\leq 1200 and η=1010\eta=10^{-10} otherwise (see Appendix A.3). Results were averaged over 10 realizations of the random collocation points. The maximum Gauss-Newton iteration was 30.

3.5.3 Eikonal PDE

We now consider the regularized Eikonal equation in Ω=[0,1]2\Omega=[0,1]^{2}:

(34) {|u(𝐱)|2=f(𝐱)2+ϵΔu(𝐱),𝐱Ω,u(𝐱)=0,𝐱Ω,\left\{\begin{aligned} |\nabla u(\mathbf{x})|^{2}&=f(\mathbf{x})^{2}+\epsilon\Delta u(\mathbf{x}),&&\forall\mathbf{x}\in\Omega\,,\\ u(\mathbf{x})&=0,&&\forall\mathbf{x}\in\partial\Omega\,,\end{aligned}\right.

with f=1f=1 and ϵ=0.1\epsilon=0.1. We write this PDE in the form of (17) with Q=4Q=4 and Qb=1Q_{b}=1 and linear operators L1(u)=u,L2(u)=x1u,L3(u)=x2u,L4(u)=ΔuL_{1}(u)=u,L_{2}(u)=\partial_{x_{1}}u,L_{3}(u)=\partial_{x_{2}}u,L_{4}(u)=\Delta u and nonlinear map P(v1,v2,v3,v4)=v22+v32ϵv4P(v_{1},v_{2},v_{3},v_{4})=v_{2}^{2}+v_{3}^{2}-\epsilon v_{4} in the interior of Ω\Omega and define the boundary operator identically to Subsection 3.5.2. We further eliminate the nonlinear constraints, as outlined in Subsection 3.3.1, by solving v4v_{4} in terms of v2,v3v_{2},v_{3}. To obtain a “true” solution, for the purpose of estimating errors, we employ the transformation u=ϵlogvu=-\epsilon\log v, which leads to the linear PDE fvϵ2Δv=0fv-\epsilon^{2}\Delta v=0; we solve this by a highly-resolved FD method; we used 20002000 uniform grid points in each dimension of the domain leading to the finest mesh that our hardware could handle.

As before, we began with M=2400M=2400 collocation points with MΩ=2160M_{\Omega}=2160 interior points. An instance of these collocation points along with contours of the true solution are shown in Figure 4(a). We employed a nugget term as outlined in Appendix A.4 and used the Gaussian kernel, as in Subsection 3.5.1 with σ=M1/4\sigma=M^{-1/4}. Finally we used the Gauss–Newton algorithm to find the minimizer. We show the convergence history of Gauss–Newton in Figure 4(b), observing that six iterations were sufficient for convergence. In Figure 4(c) we show the error contours of the obtained numerical approximation, which appeared to be qualitatively different to Figure 2(c) in that the errors were larger in the middle of the domain as well as close to the boundary.

\begin{overpic}[width=426.79134pt]{image/data_Eikonal_eqn_demon.pdf} \put(14.0,-1.0){\scriptsize(a)} \put(50.0,-1.0){\scriptsize(b)} \put(82.0,-1.0){\scriptsize(c)} \end{overpic}
Figure 4: Numerical results for the regularized Eikonal equation (34): (a) an instance of uniformly sampled collocation points over contours of the true solution; (b) Gauss–Newton iteration history; (c) contour of the solution error. An adaptive nugget term with η=1010\eta=10^{-10} was used (see Appendix A.4).

Next we performed a convergence study by varying MM and computing L2L^{2} and LL^{\infty} errors as reported in Table 3 by choosing the lengthscale parameter of the kernel σ=M1/4\sigma=M^{-1/4}. We used the same nugget terms as in the Burgers’ equation (see Appendix A.4). Results were averaged over 10 realizations of the random collocation points. Once again we observe clear improvements in accuracy as the number of collocation points increases.

MM 300 600 1200 2400
L2L^{2} error 1.01e-01 1.64e-02 2.27e-04 7.78e-05
LL^{\infty} error 3.59e-01 7.76e-02 3.22e-03 1.61e-03
Table 3: Numerical results for the regularized Eikonal equation (34). Uniformly random collocation points were sampled with different MM and with fixed ratio MΩ=0.9MM_{\Omega}=0.9M. An adaptive nugget term was used with global nugget parameter η=105\eta=10^{-5} if M1200M\leq 1200 and η=1010\eta=10^{-10} otherwise (see Appendix A.4), together with a Gaussian kernel with lengthscale parameter σ=M1/4\sigma=M^{-1/4}. Results were averaged over 10 realizations of the random collocation points. The maximum Gauss-Newton iteration was 20.

4 Solving Inverse Problems

We now turn our attention to solution of IPs and show that the methodology of Subsections 3.13.4 can readily be extended to solve such problems with small modifications. We descibe the abstract setting of our IPs in Subsection 4.1 leading to Corollary 4.3 which is analogous to Proposition 2.3 and Corollary 3.1 in the setting of IPs. Subsection 4.2 outlines our approach for dealing with PDE constraints in IPs and highlights the differences in this setting in comparison to the PDE setting described in Subsection 3.3. Subsection 4.3 further highlights the implementation differences between the PDE and IP settings while Subsection 4.4 presents a numerical experiment concerning an IP in subsurface flow governed by the Darcy flow PDE.

4.1 Problem Setup

Consider our usual setting of a nonlinear parameteric PDE in strong form

(35) {𝒫(u;a)(𝐱)=f(𝐱),𝐱Ω,(u;a)(𝐱)=g(𝐱),𝐱Ω.\left\{\begin{aligned} \mathcal{P}(u^{\star};a^{\star})(\mathbf{x})=f(\mathbf{x}),&\qquad\forall\mathbf{x}\in\Omega\,,\\ \mathcal{B}(u^{\star};a^{\star})(\mathbf{x})=g(\mathbf{x}),&\qquad\forall\mathbf{x}\in\partial\Omega\,.\end{aligned}\right.

As before we assume the solution uu^{\star} belongs to a quadratic Banach space 𝒰\mathcal{U} while aa^{\star} is a parameter belonging to another quadratic Banach space 𝒜\mathcal{A}. Our goal in this subsection is to identify the parameter aa^{\star} from limited observations of the solution uu^{\star}. To this end, fix ψ1,,ψI𝒰\psi_{1},\dots,\psi_{I}\in\mathcal{U}^{\ast} and define

(36) 𝝍:=(ψ1,,ψI)(𝒰)I,{\bm{\psi}}:=(\psi_{1},\dots,\psi_{I})\in(\mathcal{U}^{\ast})^{\otimes I}\,,

then our goal is to recover aa^{\star} given the noisy observations

(37) 𝐨=[𝝍,u]+ϵ,ϵ𝒩(0,γ2I).\mathbf{o}=[{\bm{\psi}},u]+\bm{\epsilon},\qquad\bm{\epsilon}\sim\mathcal{N}(0,\gamma^{2}I)\,.

We propose to solve this inverse problem by modelling both uu^{\star} and aa^{\star} with canonical GPs on the spaces 𝒰,𝒜\mathcal{U},\mathcal{A} with invertible covariance operators 𝒦:𝒰𝒰\mathcal{K}:\mathcal{U}^{\ast}\to\mathcal{U} and 𝒦~:𝒜𝒜\widetilde{\mathcal{K}}:\mathcal{A}^{\ast}\to\mathcal{A} respectively. We then condition these GPs to satisfy the PDE on collocation points 𝐱1,,𝐱MΩ¯\mathbf{x}_{1},\dots,\mathbf{x}_{M}\in\overline{\Omega} as before and propose to approximate u,au^{\star},a^{\star} simultaneously via the optimization problem:

(38) {minimize(u,a)𝒰×𝒜u𝒰2+a𝒜2+1γ2|[𝝍,u]𝐨|2s.t.𝒫(u;a)(𝐱m)=f(𝐱m),for m=1,,MΩ,(u;a)(𝐱m)=g(𝐱m),for m=MΩ+1,,M,\left\{\begin{aligned} &\operatorname*{{\rm minimize}}_{(u,a)\in\mathcal{U}\times\mathcal{A}}&&\|u\|_{\mathcal{U}}^{2}+\|a\|_{\mathcal{A}}^{2}+\frac{1}{\gamma^{2}}\big{|}[{\bm{\psi}},u]-\mathbf{o}\big{|}^{2}&&\\ &{\rm\,s.t.}&&\mathcal{P}(u;a)(\mathbf{x}_{m})=f(\mathbf{x}_{m}),\qquad&&\text{for }m=1,\dots,M_{\Omega}\,,\\ &&&\mathcal{B}(u;a)(\mathbf{x}_{m})=g(\mathbf{x}_{m}),\qquad&&\text{for }m=M_{\Omega}+1,\dots,M\,,\\ \end{aligned}\right.

where we used subscripts to distinguish the quadratic norms on the spaces 𝒰\mathcal{U} and 𝒜\mathcal{A}.

Remark 4.1.

In light of Remark 2.5 we observe that (38) corresponds to imposing a prior measure on u,au,a which assumes they are a priori independent. It is straightforward to introduce correlations between the solution uu and the parameter aa by defining the prior measure directly on the product space 𝒰×𝒜\mathcal{U}\times\mathcal{A}. This perspective will then lead to an analogous optimization problem to (38) with the same constraints but with the functional

(u,a)𝒰×𝒜2+1γ2|[𝝍,u]𝐨|2,\|(u,a)\|_{\mathcal{U}\times\mathcal{A}}^{2}+\frac{1}{\gamma^{2}}|[{\bm{\psi}},u]-\mathbf{o}|^{2},

where we used 𝒰×𝒜\|\cdot\|_{\mathcal{U}\times\mathcal{A}} to denote the RKHS norm of the GP associated with 𝒰×𝒜\mathcal{U}\times\mathcal{A}. \Diamond

Remark 4.2.

We also note that the Bayesian interpretation of (38) can be viewed as an extension of gradient matching [9, 38] from ODEs to PDEs. Indeed, gradient matching simultaneously approximates the unknown parameters and the solution of an ODE system using a joint GP prior and imposes the ODE as a constraint at finitely many time steps. \Diamond

We make analogous assumptions on the form of the operators 𝒫,\mathcal{P},\mathcal{B} as in Assumption 3.1 but this time also involving the parameters aa:

{assumption}

There exist bounded and linear operators L1,,LQ(𝒰;C(Ω))L_{1},\dots,L_{Q}\in\mathcal{L}(\mathcal{U};C(\Omega)) in which L1,,LQb(𝒰;C(Ω))L_{1},\dots,L_{Q_{b}}\in\mathcal{L}(\mathcal{U};C(\partial\Omega)) for some 1QbQ1\leq Q_{b}\leq Q, and L~1,,L~J(𝒜;C(Ω¯))\widetilde{L}_{1},\dots,\widetilde{L}_{J}\in\mathcal{L}(\mathcal{A};C(\overline{\Omega})) together with maps P:Q+JP:\mathbb{R}^{Q+J}\to\mathbb{R} and B:Qb+JB:\mathbb{R}^{Q_{b}+J}\to\mathbb{R}, which may be nonlinear, so that 𝒫(u;a)(𝐱)\mathcal{P}(u;a)(\mathbf{x}) and (u;a)(𝐱)\mathcal{B}(u;a)(\mathbf{x}) can be written as

(39) 𝒫(u;a)(𝐱)\displaystyle\mathcal{P}(u;a)(\mathbf{x}) =P(L1(u)(𝐱),,LQ(u)(𝐱);L~1(a)(𝐱),L~J(a)(𝐱)),\displaystyle=P\big{(}L_{1}(u)(\mathbf{x}),\dots,L_{Q}(u)(\mathbf{x});\widetilde{L}_{1}(a)(\mathbf{x}),\dots\widetilde{L}_{J}(a)(\mathbf{x})\big{)}, 𝐱Ω,\displaystyle\forall\mathbf{x}\in\Omega\,,
(u;a)(𝐱)\displaystyle\mathcal{B}(u;a)(\mathbf{x}) =B(L1(u)(𝐱),,LQb(u)(𝐱);L~1(a)(𝐱),L~J(a)(𝐱)),\displaystyle=B\big{(}L_{1}(u)(\mathbf{x}),\dots,L_{Q_{b}}(u)(\mathbf{x});\widetilde{L}_{1}(a)(\mathbf{x}),\dots\widetilde{L}_{J}(a)(\mathbf{x})\big{)}, 𝐱Ω.\displaystyle\forall\mathbf{x}\in\partial\Omega\,.

\Diamond

Similarly to the LqL_{q}, the L~j\widetilde{L}_{j} operators are also linear differential operators in case of prototypical PDEs while the maps P,BP,B remain as simple algebraic nonlinearities. Let us briefly consider an IP in subsurface flow and verify the above assumption.

Example DF (Darcy flow IP).

Let Ω=(0,1)2\Omega=(0,1)^{2} and consider the Darcy flow PDE with Dirichlet boundary conditions

{div(exp(a)u)(𝐱)=f(𝐱),𝐱Ω,u(𝐱)=0,𝐱Ω.\left\{\begin{aligned} -\text{div}\left(\exp(a)\nabla u\right)(\mathbf{x})&=f(\mathbf{x}),&&\mathbf{x}\in\Omega\,,\\ u(\mathbf{x})&=0,&&\mathbf{x}\in\partial\Omega\,.\end{aligned}\right.

We wish to approximate aC1(Ω¯)a\in C^{1}(\overline{\Omega}) given noisy pointwise observations of uu at a set of points 𝐱~1,,𝐱~I\tilde{\mathbf{x}}_{1},\dots,\tilde{\mathbf{x}}_{I}. Thus, we take ψi=δ𝐱~i\psi_{i}=\updelta_{\tilde{\mathbf{x}}_{i}}. By expanding the PDE we obtain

div(exp(a)u)=exp(a)(au+Δu),-\text{div}\left(\exp(a)\nabla u\right)=-\exp(a)\left(\nabla a\cdot\nabla u+\Delta u\right)\,,

and so we simply choose Q=3Q=3, Qb=1Q_{b}=1 and J=2J=2 with the linear operators

L1(u)=u,L2(u)=u,L3(u)=Δu,L~1(a)=a,L~2(a)=a.L_{1}(u)=u,\quad L_{2}(u)=\nabla u,\quad L_{3}(u)=\Delta u,\quad\widetilde{L}_{1}(a)=a,\quad\widetilde{L}_{2}(a)=\nabla a\,.

We can then satisfy Assumption 39 by taking

(40) P(v1,𝐯2,v3;v4,𝐯5)=exp(v4)(𝐯5𝐯2+v3),B(v1;v4,𝐯5)=v1,P(v_{1},\mathbf{v}_{2},v_{3};v_{4},\mathbf{v}_{5})=-\exp(v_{4})\left(\mathbf{v}_{5}\cdot\mathbf{v}_{2}+v_{3}\right),\qquad B(v_{1};v_{4},\mathbf{v}_{5})=v_{1}\,,

where we have slightly abused notation by letting L2,L~2L_{2},\widetilde{L}_{2} be vector valued and defining P,BP,B to take vectors as some of their inputs. \Diamond

As in Subsection 3.1 we now define the functionals ϕm(q)𝒰\phi_{m}^{(q)}\in\mathcal{U}^{\ast} for m=1,,Mm=1,\dots,M and q=1,,Qq=1,\dots,Q according to (18) and (20) with N=MQb+MΩ(QQb)N=MQ_{b}+M_{\Omega}(Q-Q_{b}). Similarly we define the functionals ϕ~m(j)𝒜\widetilde{\phi}_{m}^{(j)}\in\mathcal{A}^{\ast} as

(41) ϕ~m(j):=δ𝐱mL~j,form=1,,M, and j=1,,J,\widetilde{\phi}_{m}^{(j)}:=\delta_{\mathbf{x}_{m}}\circ\widetilde{L}_{j},\qquad\text{for}\qquad m=1,\dots,M,\text{ and }j=1,\dots,J\,,

together with the vectors

(42) ϕ~(j)=(ϕ~1(j),ϕ~M(j))(𝒜)Mandϕ~=(ϕ~(1),,ϕ~(J))(𝒜)N~,\widetilde{{\bm{\phi}}}^{(j)}=(\widetilde{\phi}_{1}^{(j)},\dots\widetilde{\phi}_{M}^{(j)})\in(\mathcal{A}^{\ast})^{\otimes M}\quad\text{and}\quad\widetilde{{\bm{\phi}}}=(\widetilde{{\bm{\phi}}}^{(1)},\dots,\widetilde{{\bm{\phi}}}^{(J)})\in(\mathcal{A}^{\ast})^{\otimes\widetilde{N}}\,,

where N~:=MJ\widetilde{N}:=MJ. Similarly to (22) define the map

(\displaystyle\big{(} F([ϕ,u]𝒰;[ϕ~,a]𝒜))m:=\displaystyle F([{\bm{\phi}},u]_{\mathcal{U}};[\widetilde{{\bm{\phi}}},a]_{\mathcal{A}})\big{)}_{m}:=
{P([ϕm(1),u]𝒰,,[ϕm(Q),u]𝒰;[ϕ~m(1),a]𝒜,,[ϕ~m(J),a]𝒜)if m{1,,MΩ},B([ϕm(1),u]𝒰,,[ϕm(Qb),u]𝒰;[ϕ~m(1),a]𝒜,,[ϕ~m(J),a]𝒜)if m{MΩ+1,,M},\displaystyle\left\{\begin{aligned} &P([\phi_{m}^{(1)},u]_{\mathcal{U}},\dots,[\phi_{m}^{(Q)},u]_{\mathcal{U}};[\widetilde{\phi}_{m}^{(1)},a]_{\mathcal{A}},\dots,[\widetilde{\phi}_{m}^{(J)},a]_{\mathcal{A}})&&\text{if }m\in\{1,\dots,M_{\Omega}\}\,,\\ &B([\phi_{m}^{(1)},u]_{\mathcal{U}},\dots,[\phi_{m}^{(Q_{b})},u]_{\mathcal{U}};[\widetilde{\phi}_{m}^{(1)},a]_{\mathcal{A}},\dots,[\widetilde{\phi}_{m}^{(J)},a]_{\mathcal{A}})&&\text{if }m\in\{M_{\Omega}+1,\dots,M\}\,,\end{aligned}\right.

where we used subscripts to distinguish the duality pairings between 𝒰,𝒰\mathcal{U},\mathcal{U}^{\ast} and the pairing between 𝒜,𝒜\mathcal{A},\mathcal{A}^{\ast}. With this new notation we can finally rewrite (38) in the familiar form

(43) {minimize(u,a)𝒰×𝒜u𝒰2+a𝒜2+1γ2|𝝍(u)𝐨|2s.t.F([ϕ,u]𝒰;[ϕ~,a]𝒜)=𝐲,\left\{\begin{aligned} &\operatorname*{{\rm minimize}}_{(u,a)\in\mathcal{U}\times\mathcal{A}}\quad&&\|u\|_{\mathcal{U}}^{2}+\|a\|_{\mathcal{A}}^{2}+\frac{1}{\gamma^{2}}|{\bm{\psi}}(u)-\mathbf{o}|^{2}\\ &{\rm\,s.t.}\quad&&F\big{(}[{\bm{\phi}},u]_{\mathcal{U}};[\widetilde{{\bm{\phi}}},a]_{\mathcal{A}}\big{)}=\mathbf{y},\end{aligned}\right.

with the PDE data vector 𝐲M\mathbf{y}\in\mathbb{R}^{M} defined in (21).

We can now apply Proposition 2.3 with the canonical GP defined on the product space 𝒰×𝒜\mathcal{U}\times\mathcal{A} and with a block diagonal covariance operator 𝒦𝒦~\mathcal{K}\otimes\widetilde{\mathcal{K}} to obtain a representer theorem for minimizer of (43). We state this result as a corollary below after introducing some further notation. Define the vector 𝝋=(φ1,,φN+I)(𝒰)(I+N),{\bm{\varphi}}=(\varphi_{1},\dots,\varphi_{N+I})\in(\mathcal{U}^{\ast})^{\otimes(I+N)}, with entries 131313Note that we are concatenating the II measurement functionals defining the data for the IP with the NN functionals used to define the PDE at the collocation points.

φn:={ψn,if n=1,,I,ϕnI,if n=I+1,,I+N,\varphi_{n}:=\left\{\begin{aligned} &\psi_{n},&&\text{if }n=1,\dots,I,\\ &\phi_{n-I},&&\text{if }n=I+1,\dots,I+N,\end{aligned}\right.

as well as the matrices Θ(I+N)×(I+N)\Theta\in\mathbb{R}^{(I+N)\times(I+N)} and Θ~N~×N~\widetilde{\Theta}\in\mathbb{R}^{\widetilde{N}\times\widetilde{N}} with entries

Θi,n=[φi,𝒦φn]𝒰andΘ~i,n=[ϕ~i,𝒦~ϕ~n]𝒜.\Theta_{i,n}=[\varphi_{i},\mathcal{K}\varphi_{n}]_{\mathcal{U}}\qquad\text{and}\qquad\widetilde{\Theta}_{i,n}=[\widetilde{\phi}_{i},\widetilde{\mathcal{K}}\widetilde{\phi}_{n}]_{\mathcal{A}}.

As in (10) we define the gamblets

χi=n=1N+IΘi,n1𝒦φn,andχ~i=n=1N~Θ~i,n1𝒦~ϕ~n.\chi_{i}=\sum_{n=1}^{N+I}\Theta_{i,n}^{-1}\mathcal{K}\varphi_{n},\qquad\text{and}\qquad\widetilde{\chi}_{i}=\sum_{n=1}^{\widetilde{N}}\widetilde{\Theta}_{i,n}^{-1}\widetilde{\mathcal{K}}\widetilde{\phi}_{n}.

Then Proposition 2.3 yields the following corollary.

Corollary 4.3.

Suppose Assumption 4.1 holds and that the covariance operators 𝒦\mathcal{K} and 𝒦~\widetilde{\mathcal{K}} as well as the matrices Θ\Theta and Θ~\widetilde{\Theta} are invertible. Then (u,a)𝒰×𝒜(u^{\dagger},a^{\dagger})\in\mathcal{U}\times\mathcal{A} is a minimizer of (16) if and only if

u=n=1I+Nznχn,anda=n=1N~z~nχ~n,u^{\dagger}=\sum_{n=1}^{I+N}z^{\dagger}_{n}\chi_{n},\qquad\text{and}\qquad a^{\dagger}=\sum_{n=1}^{\widetilde{N}}\widetilde{z}^{\dagger}_{n}\widetilde{\chi}_{n},

where the vectors 𝐳,𝐳~\mathbf{z}^{\dagger},\widetilde{\mathbf{z}}^{\dagger} are minimizers of

(44) {minimize(𝐳,𝐳~)(I+N×N~)𝐳TΘ1𝐳+𝐳~TΘ~1𝐳~+1γ2|ΠI𝐳𝐨|2s.t.F(ΠN𝐳;𝐳~)=𝐲,\left\{\begin{aligned} &\operatorname*{{\rm minimize}}_{(\mathbf{z},\widetilde{\mathbf{z}})\in(\mathbb{R}^{I+N}\times\mathbb{R}^{\widetilde{N}})}&&\mathbf{z}^{T}\Theta^{-1}\mathbf{z}+\widetilde{\mathbf{z}}^{T}\widetilde{\Theta}^{-1}\widetilde{\mathbf{z}}+\frac{1}{\gamma^{2}}|\Pi^{I}\mathbf{z}-\mathbf{o}|^{2}\\ &{\rm\,s.t.}&&F(\Pi_{N}\mathbf{z};\widetilde{\mathbf{z}})=\mathbf{y},\end{aligned}\right.

where ΠI:I+NI\Pi^{I}:\mathbb{R}^{I+N}\to\mathbb{R}^{I} is the projection that extracts the first II entries of a vector while ΠN:I+NN\Pi_{N}:\mathbb{R}^{I+N}\to\mathbb{R}^{N} is the projection that extracts the last NN entries.

4.2 Dealing with the Constraints

The equality constraints in (44) can be dealt with using the same strategies as in Subsection 3.3. Indeed, as in Subsection 3.3.2, we can readily relax these constraints to obtain the optimization problem

(45) minimize(𝐳,𝐳~)(I+N×N~)\displaystyle\operatorname*{{\rm minimize}}_{(\mathbf{z},\widetilde{\mathbf{z}})\in(\mathbb{R}^{I+N}\times\mathbb{R}^{\widetilde{N}})} 𝐳TΘ1𝐳+𝐳~TΘ~1𝐳~+1γ2|ΠI𝐳𝐨|2+1β2|F(ΠN𝐳;𝐳~)𝐲|2,\displaystyle\mathbf{z}^{T}\Theta^{-1}\mathbf{z}+\widetilde{\mathbf{z}}^{T}\widetilde{\Theta}^{-1}\widetilde{\mathbf{z}}+\frac{1}{\gamma^{2}}|\Pi^{I}\mathbf{z}-\mathbf{o}|^{2}+\frac{1}{\beta^{2}}\left|F(\Pi_{N}\mathbf{z};\widetilde{\mathbf{z}})-\mathbf{y}\right|^{2},

for a small parameter β2>0\beta^{2}>0. Elimination of the constraints as in Subsection 3.3.1 is slightly more delicate, but is sometimes possible. Suppose there exists a solution map F¯:N+N~M×MN+N~\overline{F}:\mathbb{R}^{N+\widetilde{N}-M}\times\mathbb{R}^{M}\to\mathbb{R}^{N+\widetilde{N}} so that

F(ΠN𝐳;𝐳~)=𝐲if and only if(ΠN𝐳,𝐳~)=F¯(𝐰,𝐲)for a unique 𝐰N+N~M.F(\Pi_{N}\mathbf{z};\widetilde{\mathbf{z}})=\mathbf{y}\quad\text{if and only if}\quad(\Pi_{N}\mathbf{z},\widetilde{\mathbf{z}})=\overline{F}(\mathbf{w},\mathbf{y})\qquad\text{for a unique }\mathbf{w}\in\mathbb{R}^{N+\widetilde{N}-M}.

Then solving (44) is equivalent to solving the unconstrained problem

(46) minimize(𝐯,𝐰)I×N+N~M\displaystyle\operatorname*{{\rm minimize}}_{(\mathbf{v},\mathbf{w})\in\mathbb{R}^{I}\times\mathbb{R}^{N+\widetilde{N}-M}} (𝐯,F¯(𝐰,𝐲))[Θ100Θ~1](𝐯F¯(𝐰,𝐲))+1γ2|𝐯𝐨|2,\displaystyle(\mathbf{v},\overline{F}(\mathbf{w},\mathbf{y}))\begin{bmatrix}\Theta^{-1}&0\\ 0&\widetilde{\Theta}^{-1}\end{bmatrix}\begin{pmatrix}\mathbf{v}\\ \overline{F}(\mathbf{w},\mathbf{y})\end{pmatrix}+\frac{1}{\gamma^{2}}|\mathbf{v}-\mathbf{o}|^{2},

and setting ΠI𝐳=𝐯\Pi^{I}\mathbf{z}^{\dagger}=\mathbf{v}^{\dagger} and (ΠN𝐳,𝐳~)=F¯(𝐰,𝐲)(\Pi_{N}\mathbf{z}^{\dagger},\widetilde{\mathbf{z}})=\overline{F}(\mathbf{w}^{\dagger},\mathbf{y}).

4.3 Implementation

Both of the problems (45) and (46) can be solved using the same techniques outlined in Subsection 3.4 except that we now have a higher dimensional solution space. Below we briefly describe the main differences between the implementation of the PDE and IP solvers.

4.3.1 Constructing Θ\Theta and Θ~\widetilde{\Theta}

We propose to construct the matrices Θ,Θ~\Theta,\widetilde{\Theta} using appropriate kernels KK, for the solution uu of the PDE, and K~\widetilde{K}, for the parameter aa identically to Subsection 3.4.1. Our minimum requirements on K,K~K,\widetilde{K} is sufficient regularity for the pointwise constraints in (38) to be well-defined. Since we have limited and noisy data in the inverse problem setting, it is not possible for us to recover the exact solution (u,a)(u^{\star},a^{\star}) in general and so the kernels K,K~K,\widetilde{K} should be chosen to reflect our prior assumptions on the unknown parameter and the solution of the PDE at that parameter value.

\begin{overpic}[width=426.79134pt]{image/Darcy_noisy_demon1.pdf} \put(14.0,-1.0){\scriptsize(a)} \put(50.0,-1.0){\scriptsize(b)} \put(82.0,-1.0){\scriptsize(c)} \end{overpic}
\begin{overpic}[width=426.79134pt]{image/Darcy_noisy_demon2.pdf} \put(14.0,-1.0){\scriptsize(d)} \put(50.0,-1.0){\scriptsize(e)} \put(82.0,-1.0){\scriptsize(f)} \end{overpic}
Figure 5: Numerical results for the inverse Darcy flow: (a) an instance of uniformly sampled collocation points and data points; (d) Gauss–Newton iteration history; (b) true aa; (e) recovered aa; (c) true uu; (f) recovered uu. Adaptive diagonal regularization (“nugget”) terms were added to the kernel matrix, with parameters η=η~=105\eta=\tilde{\eta}=10^{-5} as outlined in Appendix A.5.

4.4 Numerical Experiments for Darcy Flow

In this subsection, we apply our method to an IP involving the Darcy flow PDE. We consider the IP outlined in Example DF with the true coefficient a(𝐱)a^{\star}(\mathbf{x}) satisfying

exp(a(𝐱))=exp(sin(2πx1)+sin(2πx2))+exp(sin(2πx1)sin(2πx2)),\exp\Bigl{(}a^{\star}(\mathbf{x})\Bigr{)}=\exp\Bigl{(}\sin(2\pi x_{1})+\sin(2\pi x_{2})\Bigr{)}+\exp\Bigl{(}-\sin(2\pi x_{1})-\sin(2\pi x_{2})\Bigr{)}\,,

and the right hand side source term is f=1f=1. We randomly sampled M=500M=500 collocation points with MΩ=400M_{\Omega}=400 in the interior. From these 400400 interior points, we randomly chose I=40I=40 points and observed the values of u(𝐱)u(\mathbf{x}) at those points as the data for the IP. The values of u(𝐱)u(\mathbf{x}) for this purpose were generated by first solving the equation with the true coefficient on a uniform grid and then using linear interpolation to get the solution at the observation points. We further added independent Gaussian noise 𝒩(0,γ2I)\mathcal{N}(0,\gamma^{2}I) with noise standard deviation γ=103\gamma=10^{-3} to these observations. In dealing with the nonlinear constraint shown in Example DF, we eliminated the variable v3v_{3} using the relation in (40).

We chose Gaussian kernels for both uu and aa with the same lengthscale parameter σ=0.2\sigma=0.2 and adaptive diagonal (“nugget”) terms were added to the kernel matrices, with parameters η=η~=105,\eta=\tilde{\eta}=10^{-5}, to regularize Θ\Theta and Θ~\widetilde{\Theta} as outlined in Appendix A.5. In Figure 5 we show the experimental results for recovering both aa and uu. From the figure, we observe that the Gauss–Newton iterations converged after 6 steps. Moreover, the recovered aa and uu are reasonably accurate, i.e. they capture the shape of the truth, given the limited amount of observation information available.

5 Concluding Remarks

We have introduced a kernel/GP framework for solving nonlinear PDEs and IPs centered around the idea of approximating the solution of a given PDE with a MAP estimator of a GP conditioned on satisfying the PDE at a set of collocation points. Theoretically, we exhibited a nonlinear representer theorem which finite-dimensionalizes the MAP estimation problem and proved the convergence of the resulting solution towards the truth as the number of collocation points goes to infinity, under some regularity assumptions. Computationally, we demonstrated that the solution can be found by solving a finite-dimensional optimization problem with quadratic loss and nonlinear constraints. We presented two methods for dealing with the nonlinear constraints, namely the elimination and relaxation approaches. An efficient variant of the Gauss–Newton algorithm was also proposed for solving the resulting unconstrained optimization problem, where an adaptive nugget term was employed for regularization together with offline Cholesky factorizations of the underlying kernel matrices. We demonstrated that the proposed algorithm performs well in a wide range of prototypical nonlinear problems such as a nonlinear elliptic PDE, Burgers’ equation, a regularized Eikonal equation, and the identification of the permeability field in Darcy flow.

While our theoretical results established the existence of a solution for the finite dimensional optimization problem and the convergence of uu^{\dagger} to the truth, the uniqueness of the solution is not guaranteed and the convergence theorem does not imply convergence rates. In what follows, we provide several additional discussions that hold the potential to address these two issues and point towards interesting future directions: in Subsection 5.1 we discuss how to get the uniqueness result if an appropriate condition is assumed. In Subsection 5.2, we lay a path to obtain rigorous error estimates followed by a discussion about learning of hierarchical kernel parameters in Subsection 5.3. Finally, in Subsection 5.4, we connect the framework of this article to MAP estimation for Bayesian IPs with non-Gaussian priors.

5.1 Uniqueness and Optimality of Minimizers

While our theoretical results in Subsection 2.2 characterize the minimizers of nonlinearly constrained optimization problems in RKHSs they do not imply the uniqueness of such minimizers. The lack of convexity due to the nonlinearity of FF makes it impossible to obtain such uniqueness results, however, a positive result can be obtained under appropriate assumptions on the function FF. Recall (24), and assume that the following condition holds

(47) (𝐳𝐳)TΘ1𝐳0 for F(𝐳)=𝐲.(\mathbf{z}-\mathbf{z}^{\dagger})^{T}\Theta^{-1}\mathbf{z}^{\dagger}\geq 0\text{ for }F(\mathbf{z})=\mathbf{y}.

Generally, this condition is equivalent to the requirement that the nonlinear manifold 𝒱:={v𝒰F([ϕ,v])=F([ϕ,u])}\mathcal{V}:=\big{\{}v\in\mathcal{U}\mid F\big{(}[{\bm{\phi}},v]\big{)}=F\big{(}[{\bm{\phi}},u^{\star}]\big{)}\big{\}} lies on only one side of the tangent hyperplane supported by 𝐳\mathbf{z}^{\dagger} with Θ1\Theta^{-1}-normal direction 𝐳\mathbf{z}^{\dagger}. Note that (47) is always satisfied with an equality if the PDE is linear, i.e., FF is affine. Then the following proposition states that the minimizer uu^{\dagger} of (16) is not only unique, it is also the minimax optimal approximation of uu^{\star} in 𝒱\mathcal{V} in the sense of optimal recovery [40].

Proposition 5.1.

Under condition (47) it holds true that:

  1. 1.

    uu^{\dagger} is unique.

  2. 2.

    vu2v2u2\|v-u^{\dagger}\|^{2}\leq\|v\|^{2}-\|u^{\dagger}\|^{2} for any v𝒱v\in\mathcal{V}.

  3. 3.

    uu^{\dagger} is one of the minimzers of the minimax loss infu𝒱supv𝒱,v0vuv\inf_{u\in\mathcal{V}}\sup_{v\in\mathcal{V},v\not=0}\frac{\|v-u\|}{\|v\|}.

Proof 5.2.

For any v𝒱v\in\mathcal{V}, we denote by v¯=n=1N[ϕn,v]χn\overline{v}=\sum_{n=1}^{N}[\phi_{n},v]\chi_{n}, the GP MAP solution after observing the values [ϕn,v][\phi_{n},v] for 1nN1\leq n\leq N. By definition, v¯𝒱\overline{v}\in\mathcal{V} and [ϕ,vv¯]=0[{\bm{\phi}},v-\overline{v}]=0. Thus, vv¯v-\overline{v} is orthogonal to the linear span of χn\chi_{n} in the \|\cdot\| norm. It follows that vu2=vv¯2+v¯u2\|v-u^{\dagger}\|^{2}=\|v-\overline{v}\|^{2}+\|\overline{v}-u^{\dagger}\|^{2}. Furthermore writing 𝐳=[ϕ,v]\mathbf{z}=[{\bm{\phi}},v], we have

v¯u2=v¯2+u22𝐳TΘ1𝐳=v¯2u22(𝐳𝐳)TΘ1𝐳v¯2u2,\|\overline{v}-u^{\dagger}\|^{2}=\|\overline{v}\|^{2}+\|u^{\dagger}\|^{2}-2\mathbf{z}^{T}\Theta^{-1}\mathbf{z}^{\dagger}=\|\overline{v}\|^{2}-\|u^{\dagger}\|^{2}-2(\mathbf{z}-\mathbf{z}^{\dagger})^{T}\Theta^{-1}\mathbf{z}^{\dagger}\leq\|\overline{v}\|^{2}-\|u^{\dagger}\|^{2},

where we used the relation u2=(𝐳)TΘ1𝐳\|u^{\dagger}\|^{2}=(\mathbf{z}^{\dagger})^{T}\Theta^{-1}\mathbf{z}^{\dagger} and the assumed condition (47). Thus, we deduce that vu2vv¯2+v¯2u2=v2u2\|v-u^{\dagger}\|^{2}\leq\|v-\overline{v}\|^{2}+\|\overline{v}\|^{2}-\|u^{\dagger}\|^{2}=\|v\|^{2}-\|u^{\dagger}\|^{2} according to the orthogonality. This leads to the second point of the proposition, and the first point follows directly.

The proof of the third point is similar to that of [47, Thm. 18.2]. Here we shall prove that for the given minimax loss, the optimal value is 1 and the minimizer is n=1Nznχn\sum_{n=1}^{N}z_{n}\chi_{n} for any 𝐳\mathbf{z} satisfying F(𝐳)=𝐲F(\mathbf{z})=\mathbf{y}. To achieve this, first observe that the loss is always less or equal to 1 because for v𝒱v\in\mathcal{V}, vv¯2/v21\|v-\overline{v}\|^{2}/\|v\|^{2}\leq 1 and v¯𝒱\overline{v}\in\mathcal{V}. Thus, the optimal value is not larger than 1.

Moreover, let vv^{\perp} be a nonzero element in 𝒰\mathcal{U} such that [ϕ,v]=0[{\bm{\phi}},v^{\perp}]=0. Consider vc=cv+n=1Nznχnv_{c}=cv^{\perp}+\sum_{n=1}^{N}z_{n}\chi_{n} for any 𝐳\mathbf{z} satisfying F(𝐳)=𝐲F(\mathbf{z})=\mathbf{y}. Taking u=n=1Nznχnu=\sum_{n=1}^{N}z_{n}\chi_{n}, we have that

vcu2vc2=c2v2c2v2+u2;\frac{\|v_{c}-u\|^{2}}{\|v_{c}\|^{2}}=\frac{c^{2}\|v^{\perp}\|^{2}}{c^{2}\|v^{\perp}\|^{2}+\|u\|^{2}}\,;

this term converges towards 11 as cc\rightarrow\infty. Thus, we get that the optimal value is exactly 1 and u=n=1Nznχnu=\sum_{n=1}^{N}z_{n}\chi_{n} are minimizers.

Note that the above proposition implies the uniqueness of the minimizer uu^{\dagger} but does not imply the uniqueness of the solution uu^{\star}. It is also interesting to observe that condition (47) only involves the nonlinear map FF and not the differential operators defining the PDE.

5.2 Error Estimates

Here we present rigorous error estimates in the PDE setting of Subsection 3 by bounding the error between the minimizer uu^{\dagger} of (16) and the true solution uu^{\star}. The main idea behind our error bounds is to connect the conditional covariance of the underlying GP to the pointwise error of the conditional MAP estimator.

In the setting of Section 2.2, consider ker(ϕ):={u𝒰|[ϕ,u]=0}\ker({\bm{\phi}}):=\{u\in\mathcal{U}|[{\bm{\phi}},u]=0\}. Then the conditional covariance operator 𝒦ϕ\mathcal{K}_{\bm{\phi}} of Proposition 2.1 coincides with the short of the operator 𝒦\mathcal{K} to ker(ϕ)\ker({\bm{\phi}}) [2, 46]: this is a symmetric and positive operator from 𝒰\mathcal{U}^{\ast} to 𝒰\mathcal{U} identified via

(48) [φ,𝒦ϕφ]=infφ¯span ϕφφ¯,φφ¯=infφ¯span ϕ[φφ¯,𝒦(φφ¯)],φ𝒰,[\varphi,\mathcal{K}_{\bm{\phi}}\varphi]=\inf_{\overline{\varphi}\in\text{span }{\bm{\phi}}}\langle\varphi-\overline{\varphi},\varphi-\overline{\varphi}\rangle_{\ast}=\inf_{\overline{\varphi}\in\text{span }{\bm{\phi}}}[\varphi-\overline{\varphi},\mathcal{K}(\varphi-\overline{\varphi})],\qquad\forall\varphi\in\mathcal{U}^{\ast},

where we used the shorthand notation span ϕ=span{ϕ1,,ϕN}\text{span }{\bm{\phi}}=\text{span}\{\phi_{1},\dots,\phi_{N}\}. This identification holds true even when the ϕn\phi_{n} are not linearly independent. With the notion of the shorted operator at hand we can now present an abstract error bound on the action of dual elements on uu^{\dagger} and uu^{\star}.

Proposition 5.3.

Let \mathcal{H} be a Banach space endowed with a quadratic norm such that the embedding 𝒰\mathcal{U}\subset\mathcal{H} is compact and such that 𝒰\mathcal{H}^{\ast}\subset\mathcal{U}^{\ast} where \mathcal{H}^{\ast} is defined using the same duality pairing as for 𝒰\mathcal{U}^{\ast}. Let u=n=1Nznχnu^{\dagger}=\sum_{n=1}^{N}z_{n}^{\dagger}\chi_{n} be a minimizer of (16) approximating the unique solution uu^{\star} of (15). Fix φ\varphi\in\mathcal{H}^{\ast} and define ϕ{\bm{\phi}} as in (20). Then it holds true that

(49) |[φ,u][φ,u]|σ(φ)u+ϵφ,\big{|}[\varphi,u^{\dagger}]-[\varphi,u^{\star}]\big{|}\leq\sigma(\varphi)\|u^{\star}\|+\epsilon\|\varphi\|_{\mathcal{H}^{\ast}},

where σ2(φ):=𝒦(φ,φ)𝒦(φ,ϕ)𝒦(ϕ,ϕ)1𝒦(ϕ,φ)\sigma^{2}(\varphi):=\mathcal{K}(\varphi,\varphi)-\mathcal{K}(\varphi,{\bm{\phi}})\mathcal{K}({\bm{\phi}},{\bm{\phi}})^{-1}\mathcal{K}({\bm{\phi}},\varphi) and

ϵ2:={max𝐳N(𝐳𝐳)TA(𝐳𝐳)s.t.F(𝐳)=𝐲,𝐳TΘ1𝐳u2.\epsilon^{2}:=\left\{\begin{aligned} &\max_{\mathbf{z}\in\mathbb{R}^{N}}&&(\mathbf{z}-\mathbf{z}^{\dagger})^{T}A(\mathbf{z}-\mathbf{z}^{\dagger})\\ &{\rm\,s.t.}&&F(\mathbf{z})=\mathbf{y},\\ &&&\mathbf{z}^{T}\Theta^{-1}\mathbf{z}\leq\|u^{\star}\|^{2}.\end{aligned}\right.

where AA is the N×NN\times N positive symmetric matrix with entries Ai,j=χi,χjA_{i,j}=\langle\chi_{i},\chi_{j}\rangle_{\mathcal{H}}. Furthermore, it also holds true that

(50) uuinfϕspanϕu𝒦ϕ+ϵ.\|u^{\dagger}-u^{\star}\|\leq\inf_{\phi\in\mathrm{span}{\bm{\phi}}}\|u^{\star}-\mathcal{K}\phi\|+\epsilon\,.

Proof 5.4.

Define 𝐯:=[ϕ,u]\mathbf{v}:=[{\bm{\phi}},u^{\star}] and u¯:=𝐯TΘ1𝒦ϕ𝒰\overline{u}:=\mathbf{v}^{T}\Theta^{-1}\mathcal{K}{\bm{\phi}}\in\mathcal{U}. By triangle inequality we have

|[φ,u][φ,u]||[φ,u][φ,u¯]|+|[φ,u¯][φ,u]|.\big{|}[\varphi,u^{\star}]-[\varphi,u^{\dagger}]\big{|}\leq\big{|}[\varphi,u^{\star}]-[\varphi,\overline{u}]\big{|}+\big{|}[\varphi,\overline{u}]-[\varphi,u^{\dagger}]\big{|}.

Since [ϕ,u]=[ϕ,u¯][{\bm{\phi}},u^{\star}]=[{\bm{\phi}},\overline{u}] then by [43, Thm. 5.1],

|[φ,u][φ,u¯]|=infφ¯spanϕ|[φφ¯,uu¯]|infφ¯spanϕφφ¯uu¯.\big{|}[\varphi,u^{\star}]-[\varphi,\overline{u}]\big{|}=\inf_{\overline{\varphi}\in\text{span}{\bm{\phi}}}\big{|}[\varphi-\overline{\varphi},u^{\star}-\overline{u}]\big{|}\leq\inf_{\overline{\varphi}\in\text{span}{\bm{\phi}}}\|\varphi-\overline{\varphi}\|_{*}\|u^{\star}-\overline{u}\|.

Then (48) implies that infφ¯spanϕφφ¯=σ(φ)\inf_{\overline{\varphi}\in\text{span}{\bm{\phi}}}\|\varphi-\overline{\varphi}\|_{*}=\sigma(\varphi). Since u¯\overline{u} minimizes u\|u\| subject to [ϕ,u]=[ϕ,u][{\bm{\phi}},u]=[{\bm{\phi}},u^{\star}], we have uu¯2=u2u¯2\|u^{\star}-\overline{u}\|^{2}=\|u^{\star}\|^{2}-\|\overline{u}\|^{2}, which leads to |[φ,u][φ,u¯]|σ(φ)u.|[\varphi,u^{\star}]-[\varphi,\overline{u}]|\leq\sigma(\varphi)\|u^{\star}\|. Furthermore, |[φ,u¯][φ,u]|φu¯u|[\varphi,\overline{u}]-[\varphi,u^{\dagger}]|\leq\|\varphi\|_{\mathcal{H}^{\ast}}\|\overline{u}-u^{\dagger}\|_{\mathcal{H}} and observe that u¯u2=(𝐳𝐳)A(𝐳𝐳)\|\overline{u}-u^{\dagger}\|_{\mathcal{H}}^{2}=(\mathbf{z}-\mathbf{z}^{\dagger})A(\mathbf{z}-\mathbf{z}^{\dagger}). Since u¯u\|\overline{u}\|\leq\|u^{\star}\| and uu\|u^{\dagger}\|\leq\|u\|^{\star}, and F([ϕ,u¯])=𝐲F([{\bm{\phi}},\overline{u}])=\mathbf{y} we deduce that u¯u\|\overline{u}-u^{\dagger}\|_{\mathcal{H}} is bounded by the supremum of vu\|v^{\prime}-u^{\dagger}\|_{\mathcal{H}} over all v𝒰v^{\prime}\in\mathcal{U} that belong to the ball of radius u\|u^{\star}\| and satisfy the PDE constraints at the collocation points. Inequality (50) follows from uuuu¯+ϵ\|u^{\dagger}-u^{\star}\|\leq\|u^{\star}-\overline{u}\|+\epsilon and the the fact that u¯\overline{u} is the projection of uu^{\dagger} onto spanϕ\text{span}\>{\bm{\phi}} [47, Thm. 12.3].

In the case where Ct(Ω)Ct(Ω¯)\mathcal{H}\subset C^{t}(\Omega)\cap C^{t^{\prime}}(\overline{\Omega}) we can take φ\varphi to be of the form δxL\updelta_{x}\circ L where LL is a differential operator of order at most tt for xΩx\in\Omega, and tt^{\prime} for xΩx\in\partial\Omega. Then Proposition 5.3 allows us to bound the poitwise error of uu^{\dagger} and its derivatives. As in Theorem 3.2, the compact embedding of 𝒰\mathcal{U} into \mathcal{H} guarantees the convergence of σ(φ)\sigma(\varphi) and ϵ\epsilon towards zero as the collocation points become dense in Ω\Omega and Ω\partial\Omega. Note that, by (50), the strong approximation error between uu^{\dagger} and uu^{\star} is controlled by the sum between ϵ\epsilon and the best approximation error in span𝒦ϕ\text{span}\>\mathcal{K}{\bm{\phi}}. Controlling this best approximation error is typically done in the dual space in the linear setting [47, Sec. 14.8]. In the nonlinear setting, it requires controlling the regularity of the solution (analyzing the underlying PDE). Moreover if the fill distance of the collocation points is hh, and if the norm \|\cdot\| satisfies the embedding inequality CHs(Ω)\|\cdot\|\leq C\|\cdot\|_{H^{s}(\Omega)}, then [47, Lem. 14.39] implies uu¯ChsuH2s(Ω)\|u^{\star}-\overline{u}\|\leq Ch^{s}\|u^{\star}\|_{H^{2s}(\Omega)}.

5.3 Learning the Kernel

An important factor in the accuracy and performance of our method is the choice of the kernel KK (and in turn, the operator 𝒦\mathcal{K}). The importance of the kernel choice is readily visible in Theorem 3.2 through the assumption that uu^{\star}, the solution of the PDE, should belong to the 𝒰\mathcal{U} the RKHS corresponding to KK. As discussed in [43], even for linear PDEs, if the underlying kernel is not adapted to the solution space of the PDE, then the resulting method can perform arbitrarily badly [3]. This point stresses the importance of learning (or possibly programming [49, 45]) an informed kernel beyond enforcing the PDE at collocation points. While the problem of selecting the kernel is well-understood if the solution is smooth, or the PDE is linear with possibly rough coefficients [43], it can be challenging when the underlying PDE is nonlinear, and the solution is irregular. Since our approach employs an explicit kernel it allows us to learn the kernel directly using kernel (hyperparameter) learning techniques such as maximum likelihood estimation (MLE) [10], MAP estimation [45], or cross-validation (CV) [21, 50, 10]. Adaptation of these techniques for solution of nonlinear PDEs is an attractive direction for future research.

5.4 Non-Gaussian priors

Although we have employed a Gaussian prior in the design of our nonlinear solver and in our Bayesian IPs, our methodology can naturally be generalized to priors that can be obtained as nonlinear transformations of Gaussian measures (e.g., instead of replacing the solution uu^{\star} of a nonlinear PDE by a GP ξ\xi, one could replace it by G(ξ)G(\xi) where GG is a nonlinear function with the appropriate degree of regularity so that the pointwise derivatives of G(ξ)G(\xi) remain well-defined). This approach can potentially lead to efficient methods for computing MAP estimators of Bayesian IPs with certain non-Gaussian priors, a contemporary problem in the uncertainty quantification and IP literatures.

Acknowledgments

The authors gratefully acknowledge support by the Air Force Office of Scientific Research under MURI award number FA9550-20-1-0358 (Machine Learning and Physics-Based Modeling and Simulation). HO also acknowledges support by the Air Force Office of Scientific Research under award number FA9550-18-1-0271 (Games for Computation and Learning). YC is also grateful for support from the Caltech Kortchak Scholar Program.

References

  • [1] R. A. Adams and J. J. Fournier, Sobolev Spaces, Elsevier, 2003.
  • [2] W. Anderson, Jr and G. Trapp, Shorted operators. II, SIAM Journal on Applied Mathematics, 28 (1975), pp. 60–71.
  • [3] I. Babuška and J. Osborn, Can a finite element method perform arbitrarily badly?, Mathematics of computation, 69 (2000), pp. 443–462.
  • [4] Y. Bar-Sinai, S. Hoyer, J. Hickey, and M. P. Brenner, Learning data-driven discretizations for partial differential equations, Proceedings of the National Academy of Sciences, 116 (2019), pp. 15344–15349.
  • [5] K. Bhattacharya, B. Hosseini, N. B. Kovachki, and A. M. Stuart, Model reduction and neural networks for parametric PDEs, The SMAI journal of computational mathematics, 7 (2021), pp. 121–157.
  • [6] V. I. Bogachev, Gaussian measures, American Mathematical Society, 1998.
  • [7] G. Boncoraglio and C. Farhat, Active manifold and model reduction for multidisciplinary analysis and optimization, in AIAA Scitech 2021 Forum, 2021, p. 1694.
  • [8] H. Brezis, Functional Analysis, Sobolev Spaces and Partial Differential Equations, Springer Science & Business Media, 2010.
  • [9] B. Calderhead, M. Girolami, and N. D. Lawrence, Accelerating Bayesian inference over nonlinear differential equations with Gaussian processes, in Advances in Neural Information Processing Systems, 2009, pp. 217–224.
  • [10] Y. Chen, H. Owhadi, and A. Stuart, Consistency of empirical Bayes and kernel flow for hierarchical parameter estimation, Mathematics of Computation, (2021).
  • [11] A. Cochocki and R. Unbehauen, Neural networks for optimization and signal processing, John Wiley & Sons, Inc., 1993.
  • [12] J. Cockayne, C. Oates, T. Sullivan, and M. Girolami, Probabilistic numerical methods for PDE-constrained Bayesian inverse problems, in AIP Conference Proceedings, vol. 1853, 2017, p. 060001.
  • [13] J. Cockayne, C. J. Oates, T. J. Sullivan, and M. Girolami, Bayesian probabilistic numerical methods, SIAM Review, 61 (2019), pp. 756–789.
  • [14] S. L. Cotter, M. Dashti, and A. M. Stuart, Approximation of Bayesian inverse problems for PDEs, SIAM Journal on Numerical Analysis, 48 (2010), pp. 322–345.
  • [15] S. L. Cotter, G. O. Roberts, A. M. Stuart, and D. White, MCMC methods for functions: modifying old algorithms to make them faster, Statistical Science, (2013), pp. 424–446.
  • [16] M. Dashti, K. J. Law, A. M. Stuart, and J. Voss, MAP estimators and their consistency in Bayesian nonparametric inverse problems, Inverse Problems, 29 (2013), p. 095017.
  • [17] M. Dashti and A. M. Stuart, The Bayesian approach to inverse problems, in Handbook of Uncertainty Quantification, R. Ghanem, D. Higdon, and H. Owhadi, eds., Springer International Publishing, 2016, pp. 1–118.
  • [18] P. Diaconis, Bayesian numerical analysis, in Statistical Decision Theory and Related Topics IV, S. S. Gupta and J. Berger, eds., Springer, 1988.
  • [19] B. Djeridane and J. Lygeros, Neural approximation of PDE solutions: An application to reachability computations, in Proceedings of the 45th IEEE Conference on Decision and Control, IEEE, 2006, pp. 3034–3039.
  • [20] H. W. Engl, M. Hanke, and A. Neubauer, Regularization of inverse problems, vol. 375, Springer Science & Business Media, 1996.
  • [21] J. Friedman, T. Hastie, and R. Tibshirani, The elements of statistical learning, Springer, 2001.
  • [22] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning, vol. 1, MIT press, 2016.
  • [23] J. Han, A. Jentzen, and E. Weinan, Solving high-dimensional partial differential equations using deep learning, Proceedings of the National Academy of Sciences, 115 (2018), pp. 8505–8510.
  • [24] K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [25] P. Hennig, M. A. Osborne, and M. Girolami, Probabilistic numerics and uncertainty in computations, Proceedings of the Royal Society A, 471 (2015), p. 20150142.
  • [26] M. Hinze, R. Pinnau, M. Ulbrich, and S. Ulbrich, Optimization with PDE constraints, vol. 23, Springer Science & Business Media, 2008.
  • [27] A. Jacot, F. Gabriel, and C. Hongler, Neural tangent kernel: convergence and generalization in neural networks, in Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018, pp. 8580–8589.
  • [28] J. Kaipio and E. Somersalo, Statistical and computational inverse problems, vol. 160, Springer Science & Business Media, 2006.
  • [29] G. Karniadakis, I. Kevrekidis, L. Lu, P. Perdikaris, S. Wang, and L. Yang, Physics-informed machine learning, Nature Reviews Physics, (2021).
  • [30] G. Kimeldorf and G. Wahba, Some results on Tchebycheffian spline functions, Journal of mathematical analysis and applications, 33 (1971), pp. 82–95.
  • [31] G. S. Kimeldorf and G. Wahba, A correspondence between Bayesian estimation on stochastic processes and smoothing by splines, Annals of Mathematical Statistics, 41 (1970), pp. 495–502.
  • [32] K. Krischer, R. Rico-Martínez, I. Kevrekidis, H. Rotermund, G. Ertl, and J. Hudson, Model identification of a spatiotemporally varying catalytic reaction, AIChE Journal, 39 (1993), pp. 89–98.
  • [33] I. E. Lagaris, A. C. Likas, and D. I. Fotiadis, Artificial neural networks for solving ordinary and partial differential equations, IEEE transactions on Neural Networks, 9 (1998), pp. 987–1000.
  • [34] I. E. Lagaris, A. C. Likas, and D. G. Papageorgiou, Neural-network methods for boundary value problems with irregular boundaries, IEEE Transactions on Neural Networks, 11 (2000), pp. 1041–1049.
  • [35] F. M. Larkin, Gaussian measure in Hilbert space and applications in numerical analysis, Journal of Mathematics, 2 (1972).
  • [36] Z. Li, N. Kovachki, K. Azizzadenesheli, K. Bhattacharya, A. Stuart, and A. Anandkumar, Fourier neural operator for parametric partial differential equations, in International Conference on Learning Representations, 2020.
  • [37] Z. Li, N. Kovachki, K. Azizzadenesheli, B. Liu, A. Stuart, K. Bhattacharya, and A. Anandkumar, Multipole graph neural operator for parametric partial differential equations, Advances in Neural Information Processing Systems, (2020).
  • [38] H. Liang and H. Wu, Parameter estimation for differential equation models using a framework of measurement error in regression models, Journal of the American Statistical Association, 103 (2008), pp. 1570–1583.
  • [39] Z. Long, Y. Lu, X. Ma, and B. Dong, PDE-net: Learning PDEs from data, in International Conference on Machine Learning, PMLR, 2018, pp. 3208–3216.
  • [40] C. A. Micchelli and T. J. Rivlin, A survey of optimal recovery, in Optimal estimation in approximation theory, C. Micchelli and T. J. Rivlin, eds., Springer, 1977, pp. 1–54.
  • [41] N. H. Nelsen and A. M. Stuart, The random feature model for input-output maps between Banach spaces. arXiv preprint arXiv:2005.10224, 2020.
  • [42] J. Nocedal and S. Wright, Numerical Optimization, Springer Science & Business Media, 2006.
  • [43] H. Owhadi, Bayesian numerical homogenization, Multiscale Modeling & Simulation, 13 (2015), pp. 812–828.
  • [44] H. Owhadi, Multigrid with rough coefficients and multiresolution operator decomposition from hierarchical information games, SIAM Review, 59 (2017), pp. 99–149.
  • [45] H. Owhadi, Do ideas have shape? plato’s theory of forms as the continuous limit of artificial neural networks. arXiv preprint arXiv:2008.03920, 2020.
  • [46] H. Owhadi and C. Scovel, Conditioning Gaussian measure on Hilbert space, Journal of Mathematical and Statistical Analysis, 1 (2018).
  • [47] H. Owhadi and C. Scovel, Operator-Adapted Wavelets, Fast Solvers, and Numerical Homogenization: From a Game Theoretic Approach to Numerical Approximation and Algorithm Design, Cambridge University Press, 2019.
  • [48] H. Owhadi, C. Scovel, and F. Schäfer, Statistical numerical approximation, Notices of the AMS, (2019).
  • [49] H. Owhadi, C. Scovel, and G. R. Yoo, Kernel mode decomposition and programmable/interpretable regression networks. arXiv preprint arXiv:1907.08592, 2019.
  • [50] H. Owhadi and G. R. Yoo, Kernel flows: From learning kernels from data into the abyss, Journal of Computational Physics, 389 (2019), pp. 22–47.
  • [51] H. Owhadi and L. Zhang, Gamblets for opening the complexity-bottleneck of implicit schemes for hyperbolic and parabolic ODEs/PDEs with rough coefficients, Journal of Computational Physics, 347 (2017), pp. 99–128.
  • [52] I. Palasti and A. Renyi, On interpolation theory and the theory of games, MTA Mat. Kat. Int. Kozl, 1 (1956), pp. 529–540.
  • [53] O. Perrin and P. Monestiez, Modelling of non-stationary spatial structure using parametric radial basis deformations, in GeoENV II—Geostatistics for Environmental Applications, Springer, 1999, pp. 175–186.
  • [54] H. Poincaré, Calcul des probabilités, Georges Carrés, 1896.
  • [55] J. Quinonero-Candela and C. E. Rasmussen, A unifying view of sparse approximate Gaussian process regression, The Journal of Machine Learning Research, 6 (2005), pp. 1939–1959.
  • [56] V. D. Rădulescu, Qualitative analysis of nonlinear elliptic partial differential equations: monotonicity, analytic, and variational methods, Hindawi Publishing Corporation, 2008.
  • [57] M. Raissi, P. Perdikaris, and G. E. Karniadakis, Numerical Gaussian processes for time-dependent and nonlinear partial differential equations, SIAM Journal on Scientific Computing, 40 (2018), pp. A172–A198.
  • [58] M. Raissi, P. Perdikaris, and G. E. Karniadakis, Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations, Journal of Computational Physics, 378 (2019), pp. 686–707.
  • [59] M. Raissi, A. Yazdani, and G. E. Karniadakis, Hidden fluid mechanics: learning velocity and pressure fields from flow visualizations, Science, 367 (2020), pp. 1026–1030.
  • [60] J. O. Ramsay, G. Hooker, D. Campbell, and J. Cao, Parameter estimation for differential equations: a generalized smoothing approach, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69 (2007), pp. 741–796.
  • [61] R. Rico-Martinez and I. G. Kevrekidis, Continuous time modeling of nonlinear systems: A neural network-based approach, in IEEE International Conference on Neural Networks, IEEE, 1993, pp. 1522–1525.
  • [62] P. D. Sampson and P. Guttorp, Nonparametric estimation of nonstationary spatial covariance structure, Journal of the American Statistical Association, 87 (1992), pp. 108–119.
  • [63] A. Sard, Linear approximation, American Mathematical Society, 1963.
  • [64] R. Schaback and H. Wendland, Kernel techniques: from machine learning to meshless methods, Acta numerica, 15 (2006), p. 543.
  • [65] F. Schäfer, M. Katzfuss, and H. Owhadi, Sparse cholesky factorization by Kullback–Leibler minimization, SIAM Journal on Scientific Computing, 43 (2021), pp. A2019–A2046.
  • [66] F. Schäfer, T. Sullivan, and H. Owhadi, Compression, inversion, and approximate PCA of dense kernel matrices at near-linear computational complexity, Multiscale Modeling & Simulation, 19 (2021), pp. 688–730.
  • [67] M. Schober, D. Duvenaud, and P. Hennig, Probabilistic ODE solvers with Runge-Kutta means, in 28th Annual Conference on Neural Information Processing Systems, 2014, pp. 739–747.
  • [68] B. Scholkopf and A. J. Smola, Learning with kernels: support vector machines, regularization, optimization, and beyond, MIT Press, 2018.
  • [69] Y. Shin, J. Darbon, and G. E. Karniadakis, On the convergence and generalization of physics informed neural networks. arXiv preprint arXiv:2004.01806, 2020.
  • [70] J. Sirignano and K. Spiliopoulos, DGM: A deep learning algorithm for solving partial differential equations, Journal of Computational Physics, 375 (2018), pp. 1339–1364.
  • [71] J. Skilling, Bayesian solution of ordinary differential equations, in Maximum entropy and Bayesian methods, C. R. Smith, G. J. Erickson, and P. O. Neudorfer, eds., Springer, 1992, pp. 23–37.
  • [72] J. Smoller, Shock Waves and Reaction—Diffusion Equations, Springer Science & Business Media, 2012.
  • [73] M. L. Stein, The screening effect in kriging, Annals of statistics, 30 (2002), pp. 298–323.
  • [74] A. M. Stuart, Inverse problems: a Bayesian perspective, Acta numerica, 19 (2010), pp. 451–559.
  • [75] A. V. Sul’din, Wiener measure and its applications to approximation methods. I, Izvestiya Vysshikh Uchebnykh Zavedenii Matematika, (1959), pp. 145–158.
  • [76] L. P. Swiler, M. Gulian, A. L. Frankel, C. Safta, and J. D. Jakeman, A survey of constrained Gaussian process regression: Approaches and implementation challenges, Journal of Machine Learning for Modeling and Computing, 1 (2020).
  • [77] E. Tadmor, A review of numerical methods for nonlinear partial differential equations, Bulletin of the American Mathematical Society, 49 (2012), pp. 507–554.
  • [78] J. F. Traub, G. Wasilkowski, H. Wozniakowski, and E. Novak, Information-based complexity, SIAM Review, 36 (1994), pp. 514–514.
  • [79] J. F. Traub, G. W. Wasilkowski, and H. Woźniakowski, Average case optimality for linear problems, Theoretical Computer Science, 29 (1984), pp. 1–25.
  • [80] T. Uchiyama and N. Sonehara, Solving inverse problems in nonlinear PDEs by recurrent neural networks, in IEEE International Conference on Neural Networks, 1993, pp. 99–102.
  • [81] R. van der Meer, C. Oosterlee, and A. Borovykh, Optimally weighted loss functions for solving PDEs with neural networks. arXiv preprint arXiv:2002.06269, 2020.
  • [82] A. W. van der Vaart and J. H. van Zanten, Reproducing kernel Hilbert spaces of Gaussian priors, in Pushing the limits of contemporary statistics: contributions in honor of Jayanta K. Ghosh, Institute of Mathematical Statistics, 2008, pp. 200–222.
  • [83] A. V. Vecchia, Estimation and model identification for continuous spatial processes, Journal of the Royal Statistical Society: Series B (Methodological), 50 (1988), pp. 297–312.
  • [84] G. Wahba, Spline models for observational data, SIAM, 1990.
  • [85] S. Wang, H. Wang, and P. Perdikaris, On the eigenvector bias of Fourier feature networks: From regression to solving multi-scale PDEs with physics-informed neural networks, Computer Methods in Applied Mechanics and Engineering, 384 (2021), p. 113938.
  • [86] S. Wang, X. Yu, and P. Perdikaris, When and why PINNs fail to train: a neural tangent kernel perspective. arXiv preprint arXiv:2007.14527, 2020.
  • [87] E. Weinan and B. Yu, The deep ritz method: a deep learning-based numerical algorithm for solving variational problems, Communications in Mathematics and Statistics, 6 (2018), pp. 1–12.
  • [88] H. Wendland, Scattered Data Approximation, Cambridge University Press, 2004.
  • [89] C. K. I. Williams and C. E. Rasmussen, Gaussian Processes for Machine Learning, The MIT Press, 2006.
  • [90] A. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P. Xing, Deep kernel learning, in Artificial Intelligence and Statistics, PMLR, 2016, pp. 370–378.
  • [91] X. Zhang, K. Z. Song, M. W. Lu, and X. Liu, Meshless methods based on collocation with radial basis functions, Computational Mechanics, 26 (2000), pp. 333–343.
  • [92] Y. Zhu and N. Zabaras, Bayesian deep convolutional encoder–decoder networks for surrogate modeling and uncertainty quantification, Journal of Computational Physics, 366 (2018), pp. 415–447.

Appendix A Diagonal Regularization of Kernel Matrices

In the context of GP regression it is common to regularize the kernel matrices by addition of a small diagonal term; in that context, doing so has the interpretation of assuming small Gaussian noise on top of the observations. This diagonal regularization is sometimes referred to as a “nugget”. Here we discuss a related approach to regularizing kernel matrices (Θ\Theta and Θ~\widetilde{\Theta}) by perturbing them slightly; for brevity we use the terminology “nugget” throughout. In Appendix A.1 we present an adaptive approach to constructing a family of nugget terms that is tailored to our kernel matrices. Appendices A.2 through A.4 gather the detailed choices of nugget terms for the experiments in Subsections 3.5.1 through 3.5.3. Appendix A.5 contains the same details for the experiments in Subsection 4.4.

A.1 An Adaptive Nugget Term

One of the main computational issues in implementing our methodology is that the kernel matrix Θ\Theta is ill-conditioned. As a consequence, we need to regularize this matrix to improve the stability of these algorithms. This regularization may introduce an accuracy floor, so it is important to choose the regularization term so that it has a small effect on accuracy – there is thus an accuracy-stability tradeoff. A traditional strategy for achieving this goal is to add a nugget term ηI\eta I to Θ\Theta, where η>0\eta>0 is a small number, and II is the identity matrix. By choosing a suitable η\eta, the condition number of Θ+ηI\Theta+\eta I can be improved significantly. However, there is an additional level of difficulty in our method since the matrix Θ\Theta contains multiple blocks whose spectral properties can differ by orders of magnitude, since they can involve different orders of derivatives of the kernel function KK. This observation implies that adding ηI\eta I, which is uniform across all blocks, may be suboptimal in terms of the accuracy-stability tradeoff.

In what follows we adopt the same notation as Subsection 3.4.1. To address the ill-conditioning of Θ\Theta, we consider adding an adaptive block diagonal nugget term. More precisely, without loss of generality we can assume Θ(1,1)\Theta^{(1,1)} corresponds to the pointwise measurements, i.e., L1𝐱=δ𝐱L_{1}^{\mathbf{x}}=\updelta_{\mathbf{x}}, then, we construct a block diagonal matrix

R=[Itr(Θ(2,2))tr(Θ(1,1))Itr(Θ(Q,Q))tr(Θ(1,1))I],R=\begin{bmatrix}I&&&\\ &\frac{\operatorname{tr}(\Theta^{(2,2)})}{\operatorname{tr}(\Theta^{(1,1)})}I&&\\ &&\ddots&\\ &&&\frac{\operatorname{tr}(\Theta^{(Q,Q)})}{\operatorname{tr}(\Theta^{(1,1)})}I\end{bmatrix}\,,

where we reweight the identity matrix in each diagonal block by a factor of the trace ratio between Θ(q,q)\Theta^{(q,q)} and Θ(1,1)\Theta^{(1,1)}. With this matrix, the adaptive nugget term is defined as ηR\eta R with a global nugget parameter η>0\eta>0. We find that once the parameter η\eta is chosen suitably, then our Gauss–Newton algorithm converges quickly and in a stable manner. During computations, we can compute the Cholesky factors of Θ+ηR\Theta+\eta R offline and use back-substitution to invert them in each iteration of Gauss–Newton.

Example NE.

For the numerical experiments in Subsection 1.1.4 pertaining to the nonlinear elliptic PDE (1), we observed that Θ\Theta has two distinct diagonal blocks, i.e., one block corresponding to the pointwise evaluation functions and with entries K(𝐱m,𝐱i)K(\mathbf{x}_{m},\mathbf{x}_{i}) and another block corresponding to pointwise evaluations of the Laplacian operator and with entries Δ𝐱Δ𝐱K(𝐱,𝐱)|(𝐱m,𝐱i)\Delta^{\mathbf{x}}\Delta^{\mathbf{x}^{\prime}}K(\mathbf{x},\mathbf{x}^{\prime})|_{(\mathbf{x}_{m},\mathbf{x}_{i})}. With M=1024M=1024 collocation points, the trace ratio between these blocks was of order 40004000. Thus, the difference between II and RR is significant. Our experiments also showed that if we only add ηI\eta I to regularize the matrix, then choosing η\eta as large as O(104)O(10^{-4}) was needed to get meaningful results, while using the nugget term ηR\eta R, we could choose η=109\eta=10^{-9} which leads to significantly improved results. We further explore these details below and in particular in Table 4. \Diamond

A.2 Choice of Nugget Terms for the Nonlinear Elliptic PDE

Below we discuss the choice of the nugget term in the numerical experiments of Subsection 3.5.1. The results in Figure 2 and Table 1 were obtained by employing the adaptive nugget term of Appendix A.1 with global parameters η=1013\eta=10^{-13} and η=1012\eta=10^{-12} respectively.

We also compared our adaptive nugget term to more standard choices, i.e., nugget terms of the form ηI\eta I against our adaptive nugget term ηR\eta R with the nonlinearity τ(u)=u3\tau(u)=u^{3}. Cholesky factorization was applied to the regularized matrix and the subsequent Gauss–Newton iterations were employed to obtain the solutions. The L2L^{2} and LL^{\infty} errors of the converged solutions are shown in Table 4. The results were averaged over 10 instances of a random sampling of the collocation points. Here, “nan” means the algorithm was unstable and diverged. To obtain these results we terminated all Gauss-Newton iterations after 5 steps. Due to randomness in the initial guess, and in examples where random collocation points were used due to this too, we observed that some random trials did not converge in 5 steps. This variation also explains the non-monotonic behavior of the error in Table 4 as η\eta decreases. These effects were more profound for the standard nugget term. Besides these small variations our results clearly demonstrate the superior accuracy and stability that is provided by our adaptive nugget term versus the standard nugget choice.

η\eta 10110^{-1} 10210^{-2} 10310^{-3} 10410^{-4} 10510^{-5} 10610^{-6}
Θ+ηI\Theta+\eta I: L2L^{2} error 7.77e-02 4.46e-02 2.65e-02 1.56e-02 1.32e-02 1.46e-02
Θ+ηI\Theta+\eta I: LL^{\infty} error 6.43-01 3.13e-01 1.99e-01 1.47e-01 1.33e-01 1.43e-01
Θ+ηR\Theta+\eta R: L2L^{2} error 8.49e-02 9.29e-03 9.10e-03 3.34e-03 1.01e-03 3.36e-04
Θ+ηR\Theta+\eta R: LL^{\infty} error 4.02e-01 7.86e-02 5.58e-02 2.21e-02 7.17e-03 3.87e-03
η\eta 10710^{-7} 10810^{-8} 10910^{-9} 101010^{-10} 101110^{-11} 101210^{-12}
Θ+ηI\Theta+\eta I: L2L^{2} error 1.37e-02 8.623-03 1.01e-02 1.92e-02 nan nan
Θ+ηI\Theta+\eta I: LL^{\infty} error 1.81e-01 8.28e-02 1.07e-01 3.05e-01 nan nan
Θ+ηR\Theta+\eta R: L2L^{2} error 1.55e-04 7.05e-05 4.56e-05 6.30e-06 1.73e-06 8.31e-07
Θ+ηR\Theta+\eta R: LL^{\infty} error 2.41e-03 1.07e-03 7.66e-04 8.92e-05 2.62e-05 1.17e-05
Table 4: Comparison of solution errors between standard nugget terms and our adaptive nugget terms for the nonlinear elliptic PDE (1). Collocation points are uniformly sampled with M=1024M=1024 and MΩ=900M_{\Omega}=900 with a Gaussian kernel with lengthscale parameter σ=0.2\sigma=0.2. Results are averaged over 10 realizations of the random collocation points. The maximum Gauss-Newton iteration was 5.

A.3 Choice of Nugget Terms for Burger’s Equation

For the numerical experiments in Subsection 3.5.2 we primarily used our adaptive nugget term as outlined in Appendix A.1. For the results in Figure 3 we used a global nugget parameter η=1010\eta=10^{-10}. For the convergence analysis in Table 2 we varied η\eta for different number of collocation points to achieve better performance. More precisely, for M1200M\leq 1200 we used a larger nugget η=105\eta=10^{-5} and for M2400M\geq 2400 we used η=1010\eta=10^{-10}. Choosing a smaller nugget for small values of MM would still yield equally accurate results but required more iterations of the Gauss-Newton algorithm.

A.4 Choice of Nugget Terms for the Eikonal Equation

Our numerical experiments in Subsection 3.5.3 also used the adaptive nugget of Appendix A.1. Indeed, we followed a similar approach to choosing the global parameter η\eta as in the case of Burger’s equation outlined in Appendix A.3 above.

For the results in Figure 4 we used η=1010\eta=10^{-10}. For the convergence analysis in Table 3 we varied η\eta for different values of MM, i.e., we chose η=105\eta=10^{-5} for M1200M\leq 1200 and η=1010\eta=10^{-10} for M2400M\geq 2400. Analogously to the Burger’s experiment we observed that smaller values of η\eta for smaller choices of MM cause the Gauss-Newton iterations to converge more slowly. Thus varying η\eta with MM improved the efficiency of our framework.

A.5 Choice of Nugget Terms for Darcy Flow

Both of the matrices Θ,Θ~\Theta,\widetilde{\Theta} outlined in Subsection 4.1 are dense and ill-conditioned in the IP setting and so an appropriate nugget term should be chosen to regularize them. In the IP setting we propose to add adaptive nuggets for both Θ,Θ~\Theta,\widetilde{\Theta} using the same strategy as in Appendix A.1, except that the nuggets are constructed independently for each matrix. To this end we set ΘΘ+ηR\Theta\leftarrow\Theta+\eta R and Θ~Θ~+η~R~\widetilde{\Theta}\leftarrow\widetilde{\Theta}+\tilde{\eta}\tilde{R}, where the η~,R~\tilde{\eta},\tilde{R} denote the global nugget parameter and the re-weighted identity matrix corresponding to Θ~\widetilde{\Theta}. For the numerical experiments in Figure 5 we used η=η~=105.\eta=\tilde{\eta}=10^{-5}.