claimClaim \newsiamremarkconjectureConjecture \newsiamremarkremarkRemark \newsiamremarkexampleExample \newsiamremarkhypothesisHypothesis \newsiamremarkproblemProblem \newsiamremarkassumptionAssumption \headersSolving and learning nonlinear PDEs with GPsY. Chen, B. Hosseini, H. Owhadi, AM. Stuart
Solving and Learning Nonlinear PDEs with Gaussian Processes
Abstract
We introduce a simple, rigorous, and unified framework for solving nonlinear partial differential equations (PDEs), and for solving inverse problems (IPs) involving the identification of parameters in PDEs, using the framework of Gaussian processes. The proposed approach: (1) provides a natural generalization of collocation kernel methods to nonlinear PDEs and IPs; (2) has guaranteed convergence for a very general class of PDEs, and comes equipped with a path to compute error bounds for specific PDE approximations; (3) inherits the state-of-the-art computational complexity of linear solvers for dense kernel matrices. The main idea of our method is to approximate the solution of a given PDE as the maximum a posteriori (MAP) estimator of a Gaussian process conditioned on solving the PDE at a finite number of collocation points. Although this optimization problem is infinite-dimensional, it can be reduced to a finite-dimensional one by introducing additional variables corresponding to the values of the derivatives of the solution at collocation points; this generalizes the representer theorem arising in Gaussian process regression. The reduced optimization problem has the form of a quadratic objective function subject to nonlinear constraints; it is solved with a variant of the Gauss–Newton method. The resulting algorithm (a) can be interpreted as solving successive linearizations of the nonlinear PDE, and (b) in practice is found to converge in a small number of iterations (2 to 10), for a wide range of PDEs. Most traditional approaches to IPs interleave parameter updates with numerical solution of the PDE; our algorithm solves for both parameter and PDE solution simultaneously. Experiments on nonlinear elliptic PDEs, Burgers’ equation, a regularized Eikonal equation, and an IP for permeability identification in Darcy flow illustrate the efficacy and scope of our framework.
keywords:
Kernel Methods, Gaussian Processes, Nonlinear Partial Differential Equations, Inverse Problems, Optimal Recovery.60G15, 65M75, 65N75, 65N35, 47B34, 41A15, 35R30, 34B15.
1 Introduction
Two hundred years ago, modeling a physical problem and solving the underlying differential equations would have required the expertise of Cauchy or Laplace, and it would have been done by hand through a laborious process of discovery. Sixty years ago, the resolution of the underlying differential equation would have been addressed using computers, but modeling and design of the solver would have still been done by hand. Nowadays, there is increasing interest in automating these steps by casting them as machine learning problems. The resulting methods can be divided into two main categories: (1) methods based on variants of artificial neural networks (ANNs) [22]; and (2) methods based on kernels and Gaussian Processes (GPs) [89, 68]. In the context of (1) there has been recent activity toward solving nonlinear PDEs, whilst the systematic development of methods of type (2) for nonlinear PDEs has remained largely open. However, methods of type (2) hold potential for considerable advantages over methods of type (1), both in terms of theoretical analysis and numerical implementation. In this paper, our goal is to develop a simple kernel/GP framework for solving nonlinear PDEs and related inverse problems (IPs); in particular the methodology we introduce has the following properties:
- •
-
•
theoretically, the proposed method is provably convergent and amenable to rigorous numerical analysis, suggesting new research directions to generalize the analysis of linear regression methods [88] to the proposed collocation setting for solving nonlinear PDEs;
-
•
computationally, it inherits the complexity of state-of-the-art solvers for dense kernel matrices, suggesting new research to generalize the work of [65], which developed optimal approximate methods for linear regression, to the proposed collocation setting for solving nonlinear PDEs and IPs;
- •
Since ANN methods can also be interpreted as kernel methods with kernels learned from data [27, 45, 90], our framework could also be used for theoretical analysis of such methods.
In Subsection 1.1 we summarize the theoretical foundations and numerical implementation of our method in the context of a simple nonlinear elliptic PDE. In Subsection 1.2 we give a literature review, placing the proposed methodology in the context of other research at the intersection of machine learning and PDEs. The outline of the article is given in Subsection 1.3.
1.1 Summary of the Proposed Method
For demonstration purposes, we summarize the key ideas of our method for solving a nonlinear second-order elliptic PDE. This PDE will also serve as a running example in Section 3 where we present an abstract framework for general nonlinear PDEs.
Let and be a bounded open domain in with a Lipschitz boundary . Consider the following nonlinear elliptic equation for :
(1) |
where and are given continuous functions. We assume that are chosen appropriately so that the PDE has a unique classical solution (for abstract theory of nonlinear elliptic PDEs see for example [56, 72]). In Subsection 1.1.4 we will present a concrete numerical experiment where and .
1.1.1 Optimal Recovery
Our proposed method starts with an optimal recovery problem that can also be interpreted as maximum a posterior (MAP) estimation for a GP constrained by a PDE. More precisely, consider a nondegenerate, symmetric, and positive definite kernel function where . Let be the RKHS associated with and denote the RKHS norm by . Let and fix points in such that and . Then, our method approximates the solution of (1) with a minimizer of the following optimal recovery problem:
(2) |
Here, we assume is chosen appropriately so that , which ensures the pointwise PDE constraints in (2) are well-defined.
A minimizer can be interpreted as a MAP estimator of a GP 222This Gaussian prior notation is equivalent to the GP notation , where is the covariance function. See further discussions in Subsection 3.4.1.(where is the integral operator with kernel ) conditioned on satisfying the PDE at the collocation points . Such a view has been introduced for solving linear PDEs in [43, 44] and a closely related approach is studied in [12, Sec. 5.2]; the methodology introduced via (2) serves as a prototype for generalization to nonlinear PDEs. Here it is important to note that in the nonlinear case the conditioned GP is no longer Gaussian in general; thus the solution of (2) is a MAP estimator only and is not the conditional expectation, except in the case where is a linear function.
In the next subsections, we show equivalence of (2) and a finite dimensional constrained optimization problem (5). This provides existence333Uniqueness of a minimizer is not necessarily a consequence of the assumed uniqueness of the solution to the PDE (1). We provide additional discussions of uniqueness results in Subsection 5.1. of a minimizer to (2), as well as the basis for a numerical method to approximate the minimizer, based on solution of an unconstrained finite-dimensional optimization problem (6).
1.1.2 Finite-Dimensional Representation
The key to our numerical algorithm for solving (2) is a representer theorem that characterizes via a finite-dimensional representation formula. To achieve this we first reformulate (2) as a two level optimization problem:
(3) |
Denote and , where is the Dirac delta function centered at . We further use the shorthand notation and for the and -dimensional vectors with entries and respectively, and for the -dimensional vector obtained by concatenating and . Similarly, we write for the -dimensional vector concatenating and .
For a fixed , we can solve the first level optimization problem explicitly due to the representer theorem444This is not the standard RKHS/GP representer theorem [89, Sec. 2.2] in the sense that measurements include the pointwise observation of higher order derivatives of the underlying GP. See [30] and [84, p. xiii] for related spline representation formulas with derivative information. (see [47, Sec. 17.8]), which leads to the conclusion that
(4) |
here denotes the -dimensional vector field with entries and is the -matrix with entries with the denoting the entries of . For this solution, , so we can equivalently formulate (3) as a finite-dimensional optimization problem:
(5) |
Moreover, using the equation and the boundary data, we can further eliminate and rewrite it as an unconstrained problem:
(6) |
where we used and to denote the interior and boundary points respectively, denotes the -dimensional vector of the for associated to the interior points while and are vectors obtained by applying the corresponding functions to entries of their input vectors. For brevity we have suppressed the transpose signs in the row vector multiplying the matrix from the left in (6).
The foregoing considerations lead to the following existence result which underpins555Although this existence result is desirable, it is not necessary in the sense that our variational analysis of minimizers of (2), remains true for near-minimizers. Furthermore, the error estimates of Section 5.2 can be modified to include the error incurred by approximating with a near-minimizer rather than a minimizer . our numerical method for (1); furthermore (6) provides the basis for our numerical implementations. We summarize these facts:
Theorem 1.1.
Proof 1.2.
Problems (2), (3) and (5) are equivalent. It is therefore sufficient to show that (5) has a minimizer. Write for the vector with entries and . Since solves the PDE (1), satisfies the constraints on in (5). It follows that the minimization in (5) can be restricted to the set of that satisfies and the nonlinear constraints. The set is compact and non-empty: compact because is continuous and so the constraint set is closed as it is the pre-image of a closed set, and the intersection of a closed set with a compact set is compact; and nonempty because it contains . Thus the objective function achieves its minimum in the set Once is found we can extend to the boundary points to obtain , and use the nonlinear relation between and to reconstruct . This gives .
1.1.3 Convergence Theory
The consistency of our method is guaranteed by the convergence of towards as , the total number of collocation points, goes to infinity. We first present this result in the case of the nonlinear PDE (1) and defer a more general version to Subsection 3.2. We also give a simple proof of convergence here as it is instructive in understanding why the method works and how the more general result can be established. Henceforth we use to denote the standard Euclidean norm and write to denote being continuously embedded in Banach space .
Theorem 1.3.
Proof 1.4.
The proof relies on a simple compactness argument together with the Sobolev embedding theorem [1, 8]. First, as satisfies the constraint in (2) and is the minimizer, it follows that for all . This implies because is continuously embedded into . Let so that embeds continuously into [1, Thm. 4.12]. Since is compactly embedded into , we deduce that there exists a subsequence and a limit such that converges towards in the norm, as . This also implies convergence in due to the continuous embedding of into .
We now show that satisfies the desired PDE in (1). Denote by and . For any and , use of the triangle inequality shows that
(7) | ||||
where we have used the fact that , and is the number of interior points associated with the total collocation points. Since is continuous in a compact domain, it is also uniformly continuous. Thus, it holds that since the fill-distance converges to zero by assumption. Moreover, we have that converges to in the norm due to the convergence from to . Combining these facts with (7), and taking , we obtain , and thus , for . Following a similar argument, we get for . Thus, is a classical solution to (1). By assumption, the solution is unique, so we must have . As the limit is independent of the particular subsequence, the whole sequence must converge to in . Since , we also get pointwise convergence and convergence in for any .
We note that this convergence theorem requires to be adapted to the solution space of the PDE so that belongs to . In our numerical experiments, we will use a Gaussian kernel. However, if or the boundary are not sufficiently regular, then the embedding conditions may be satisfied by using kernel as the Green’s function of some power of the Laplacian, instead of a Gaussian kernel; the latter induces smoothness on which may be incompatible with the regularity of for irregular and . We will provide further insights into the convergence properties of our method, including an error bound between the solution and a minimizer in Section 5.2.

1.1.4 Numerical Framework
The representation of via the optimization problem (6) is the cornerstone of our numerical framework for solving nonlinear PDEs. Indeed, efficient solution of (6), and in turn construction and inversion of the matrix , are the most costly steps of our numerical implementation. We summarize several main ingredients of our algorithm below:
- •
-
•
In practice we perturb to improve its condition number, and hence the numerical stability of the algorithm, by adding a small diagonal matrix; this perturbation is adapted to the problem at hand, as outlined in Appendix A.1; the approach generalizes the idea of a “nugget” as widely used in GP regression.
-
•
To evaluate the cost function in (6), we pre-compute the Cholesky factorization of the (perturbed) kernel matrix and store it for multiple uses. State-of-the-art linear solvers for dense kernel matrices can be used for this step.
As a demonstration, we present here a simple experiment to show the convergence of our method. We take , and or together with homogeneous Dirichlet boundary conditions . The true solution is prescribed to be and the corresponding right hand side is computed using the equation.
We choose the Gaussian kernel
with lengthscale parameter ; we will fix this parameter, but note that the framework is naturally adapted to learning of such hyperparameters. We set , sampling the collocation points on a uniform grid of points within . We apply our algorithm to solve the PDEs and compute the and errors to . In the case , which corresponds to a linear PDE, we also compare our algorithm with a second-order finite difference (FD) method. For the nonlinear case , we observe that the Gauss–Newton algorithm only needs 2 steps to converge. The errors versus are depicted in Figure 1. The following observations can be made from this figure:
-
•
In the linear case , where the corresponding optimization problem (6) is convex, our algorithm outperforms the FD method in terms of accuracy and rate of convergence. This can be attributed to the fact that the true solution is very smooth, and the Gaussian kernel allows for efficient approximation.
-
•
The choice of the kernel parameter has a profound impact on the accuracy and rate of convergence of the algorithm, especially when is not very large. This implies the importance of choosing a “good kernel” that is adapted to the solution space of the PDE, and highlights the importance of hyperparameter learning.
-
•
In the nonlinear setting, our algorithm demonstrates similar convergence behavior to the linear setting. Once again, an appropriate choice of leads to significant gains in solution accuracy.
1.2 Relevant Literature
Machine learning and statistical inference approaches to numerical approximation have attracted a lot of attention in recent years thanks to their remarkable success in engineering applications. Such approaches can be broadly divided into two categories: (1) GP/Kernel-based methods, and (2) ANN-based methods.
GPs and kernel-based methods have long been used in function approximation, regression, and machine learning [89]. As surveyed in [48]:
“[kernel approaches can be traced back to] Poincaré’s course in Probability Theory [54] and to the pioneering investigations of Sul’din [75], Palasti and Renyi [52], Sard [63], Kimeldorf and Wahba [31] (on the correspondence between Bayesian estimation and spline smoothing/interpolation) and Larkin [35] (on the correspondence between Gaussian process regression and numerical approximation). Although their study initially attracted little attention among numerical analysts [35], it was revived in Information Based Complexity (IBC) [78], Bayesian numerical analysis [18], and more recently in Probabilistic Numerics [25] and Bayesian numerical homogenization [43].”
For recent reviews of the extensive applications of GP/kernel methods see [13, 48, 76] and [47, Chap. 20]. In particular, they have been introduced to motivate novel methods for solving ordinary differential equations (ODEs) [71, 9, 67] and underlie the collocation approach advocated for parameter identification in ODEs as described in [60]. For PDEs, the use of kernel methods can be traced back to meshless collocation methods with radial basis functions [91, 88, 64]. Furthermore, a recent systematic development towards solving PDEs was initiated in [43, 44, 51, 47] and has lead to the identification of fast solvers for elliptic PDEs and dense kernel matrices [66, 65] with state-of-the-art complexity versus accuracy guarantees. The effectiveness of these methods has been supported by well-developed theory [47] residing at the interfaces between numerical approximation [84, 88], optimal recovery [40], information-based complexity [78], and GP regression [6], building on the perspective introduced in [18, 35, 40, 79, 84]. In particular there is a one to one correspondence [47, 66] between (1) the locality of basis functions (gamblets) obtained with kernel methods and the screening effect observed in kriging [73], (2) Schur complementation and conditioning Gaussian vectors, and (3) the approximate low-rank structure of inverse stiffness matrices obtained with kernel methods and variational properties of Gaussian conditioning. Furthermore, although the approach of modeling a deterministic function (the solution of a PDE) as a sample from a Gaussian distribution may seem counterintuitive, it can be understood (in the linear setting [47]) as an optimal minimax strategy for recovering from partial measurements. Indeed, as in Von Neumann’s theory of games, optimal strategies are mixed (randomized) strategies and (using quadratic norms to define losses) GP regression (kriging) emerges as an optimal minimax approach to numerical approximation [40, 47].
On the other hand, ANN methods can be traced back to [19, 33, 34, 61, 80] and, although developed for ODEs several decades ago [11, 61], with some of the work generalized to PDEs [32], their systematic development for solving PDEs has been initiated only recently. This systematic development includes the Deep Ritz Method [87], Physics Informed Neural Networks [58] (PINNs), DGN [70], and [23] which employs a probabilistic formulation of the nonlinear PDE via the Feynman-Kac formula. Although the basic idea is to replace unknown functions with ANNs and minimize some loss with respect to the parameters of the ANN to identify the solution, there are by now many variants of ANN approaches, and we refer to [29] for a recent review of this increasingly popular topic at the interface between scientific computing and deep learning. While ANN approaches have been shown to be empirically successful on complex problems (e.g., machine learning physics [59]), they may fail on simple ones [86, 81] if the architecture of the ANN is not adapted to the underlying PDE [85]. Moreover, the theory of ANNs is still in its infancy; most ANN-based PDE solvers lack theoretical guarantees and convergence analyses are often limited to restricted linear instances [69, 86]. Broadly speaking, in comparison with kernel methods, ANN methods have both limited theoretical foundations and unfavorable complexity vs accuracy estimates. We also remark that ANN methods are suitable for the learning of the parametric dependence of solutions of differential equations [92, 5, 36, 37, 7]; however, GP and kernel methods may also be used in this context, and the random feature method provides a natural approach to implementing such methods in high dimensions [41].
Regardless, the theory and computational framework of kernel methods may naturally be extended to ANN methods to investigate666Beyond the infinite width neural tangent kernel regime [27, 86]. such methods and possibly accelerate them by viewing them as ridge regression with data-dependent kernels and following [66, 65]. To this end, ANN methods can be interpreted as kernel methods with data-dependent kernels in two equivalent ways: (1) as solving PDEs in an RKHS space spanned by a feature map parameterized by the initial layers of the ANN that is learned from data; or, (2) as kernel-based methods with kernels that are parameterized by the inner layers of the network. For instance, [45] shows that Residual Neural Networks [24] (ResNets) are ridge regressors with warped kernels [62, 53]. Since our approach employs an explicit kernel, it allows us to learn that kernel directly, as discussed in Subsection 5.3. Given the notorious difficulty of developing numerical methods for nonlinear PDEs [77], it is to some degree surprising that (as suggested by our framework) (A) this difficulty can universally be decomposed into three parts: (1) the compression/inversion of dense kernel matrices, (2) the selection of the kernel, and (3) the minimization of the reduced finite-dimensional optimization problem (24), and (B) a significant portion of the resulting analysis can be reduced to that of linear regression [47].
Beyond solving PDEs, ANN methods have also been used in data-driven discretizations, and discovery of PDEs that allow for the identification of the governing model [58, 39, 4]; this leads to applications in IPs. Our method, viewed as a GP conditioned on PDE constraints at collocation points, can be interpreted as solving an IP with Bayesian inference and a Gaussian prior [74]. Indeed, if we relax the PDE constraints as in Subsection 3.3.2 then a minimizer coincides with a MAP estimator of a posterior measure obtained by viewing the PDE constraints as nonlinear pointwise measurements of a field with a Gaussian prior . Bayesian IPs with Gaussian priors have been studied extensively (see [17, 14, 74] and references therein). The standard approach for their solution is to discretize the problem using spectral projection or finite element methods, and compute posterior MAP estimators [16] or employ Markov chain Monte Carlo algorithms [15] to simulate posterior samples. Our abstract representation theorem outlined in Section 2.1 completely characterizes the MAP estimator of posterior measures with Gaussian priors in settings where the forward map is written as the composition of a nonlinear map with bounded linear functionals acting on the parameter. Indeed, this is precisely the approach that we employ in Section 4 to solve IPs with PDE constraints. However, the main difference between our methodology and existing methods in the literature is that we pose the IP as that of recovering the solution of the PDE simultaneously with learning the unknown PDE parameter with independent Gaussian priors on both unknowns.
We now turn to motivation for the GP interpretation. The PDE solvers obtained here are deterministic and could be described from an entirely classical numerical approximation perspective. However we emphasize the GP interpretation for two reasons: (i) it is integral to the derivation of the methods; and (ii) it allows the numerical solution of the PDE to be integrated into a larger engineering pipeline and, in that context, the posterior distribution of the GP conditioned on the PDE at collocation points provides a measure of uncertainty quantification. Using the GP perspective as a pathway to the discovery of numerical methods, was the motivation for the work in [43, 44]. Indeed, as discussed in [43], while the discovery of numerical solvers for PDEs is usually based on a combination of insight and guesswork, this process can be facilitated to the point of being automated, using this GP perspective. For instance, for nonsmooth PDEs, basis functions with near optimal accuracy/localization tradeoff and operator valued wavelets can be identified by conditioning physics informed Gaussian priors on localized linear measurements (e.g., local averages or pointwise values). Physics informed Gaussian priors can, in turn, be identified by (a) filtering uninformed Gaussian priors through the inverse operator [43], or (b) turning the process of computing fast with partial information into repeated zero-sum games with physics informed losses (whose optimal mixed strategies are physics informed Gaussian priors) [44]. The paper [57] generalized (a) to time-dependent PDEs by filtering uninformed priors through linearized PDEs obtained via time stepping. The paper [12] emphasized the probabilistic interpretation of numerical errors obtained from this Bayesian perspective. In particular [12, Sec. 5.2] describes a method identical to the one considered here (and [43]) for linear problems; in the setting of semi-linear PDEs, it is suggested in [12] that latent variables could be employed to efficiently sample from posterior/conditional measures (see also [45] where latent variables were employed to reduce nonlinear optimal recovery problems via two-level optimization as in 3). Although the methodology proposed in our paper agrees with that in [12] for linear problems, the methodology we propose appears to be more general, and differs in the case of nonlinear problems to which both approaches apply.
1.3 Outline of Article
The remainder of this article is organized as follows. We give an overview of the abstract theory of GPs on Banach spaces in Section 2; we establish notation, and summarize basic results and ideas that are used throughout the remainder of the article. Section 3 is dedicated to the development of our numerical framework for solving nonlinear PDEs with kernel methods; we outline our assumptions on the PDE, present a general convergence theory, discuss our approach to implementation, and present numerical experiments. In Section 4 we extend our nonlinear PDE framework to IPs and discuss the implementation differences between the PDE and IP settings, followed by numerical experiments involving a benchmark IP in subsurface flow. Finally, we present additional discussions, results, and possible extensions of our method in Section 5. Appendix A is devoted to the small diagonal regularization of kernel matrices (“nugget” term) and outlines general principles as well as specifics for the examples considered in this paper.
2 Conditioning GPs on Nonlinear Observations
In this section, we outline the abstract theory of RKHSs/GPs and their connection to Banach spaces endowed with quadratic norms; this forms the framework for the proposed methodology to solve PDEs and IPs. We start by recalling some basic facts about GPs in Subsection 2.1. This is followed in Subsection 2.2 by general results pertaining to conditioning of GPs on linear and nonlinear observations, leading to a representer theorem that is the cornerstone of our numerical algorithms. Some of these results may be known to experts, but to the best of our knowledge, they are not presented in the existing literature with the coherent goal of solving nonlinear problems via the conditioning of GPs; hence this material may be of independent interest to the reader.
2.1 GPs and Banach Spaces Endowed with a Quadratic Norm
Consider a separable Banach space and its dual with their duality pairing denoted by . We further assume that is endowed with a quadratic norm , i.e., there exists a linear bijection that is symmetric (), positive ( for ), and such that
The operator endows and with the following inner products:
Note that the inner product defines a norm on that coincides with the standard dual norm of , i.e.,
As in [47, Chap. 11], although are also Hilbert spaces under the quadratic norms and , we will keep using the Banach space terminology to emphasize the fact that our dual pairings will not be based on the inner product through the Riesz representation theorem, but on a different realization of the dual space. A particular case of the setting considered here is (writing for the closure of the set of smooth functions with compact support in with respect to the Sobolev norm ), with its dual defined by the pairing obtained from the Gelfand triple . Here can be defined as solution map of an elliptic operator777Given , we call an invertible operator elliptic, if it is positive and symmetric in the sense that and . mapping to .
We say that is the canonical GP [47, Chap. 17.6] on and write , if and only if is a linear isometry from to a centered Gaussian space (a closed linear space of scalar valued centered Gaussian random variables). The word canonical indicates that the covariance operator of coincides with the bijection defining the norm on . Write for the image of under and note that the following properties hold:
2.2 Conditioning GPs with Linear and Nonlinear Observations
Let be non-trivial elements of and define
(8) |
Now consider the canonical GP , then is an -valued Gaussian vector and where
(9) |
The following proposition characterizes the conditional distribution of GPs under these linear observations; to simplify the statement it is useful to write for the vector with entries . This type of vectorized notation is used in [47].
Proposition 2.1.
Consider a vector and the canonical GP . Then conditioned on is also Gaussian. Moreover if is invertible then with conditional mean defined by and conditional covariance operator defined by
Proposition 2.1 gives a finite representation of the conditional mean of the GP constituting a representer theorem [47, Cor. 17.12]. Let us define the elements
(10) |
referred to as gamblets in the parlance of [44] which can equivalently be characterized as the minimizers of the variational problem
(11) |
This fact further enables the variational characterization of the conditional mean directly in terms of the gamblets .
Proposition 2.2.
Proposition 2.2 is the cornerstone of our methodology for solution of nonlinear PDEs. It is also useful for the solution of IPs. For this purpose consider nonlinear functions and and vectors and and consider the optimization problem:
(12) |
where is a parameter. We will use this formulation of IPs in PDEs, with concatenating the solution of the forward PDE problem and the unknown parameter; the nonlinear constraint on enforces the forward PDE and the observed noisy data. Then a representer theorem still holds for a minimizer of this problem stating that the solution has a finite expansion in terms of the gamblets :
Proposition 2.3.
Suppose are fixed and is invertible888Relaxing the interpolation constraints renders the invertibility assumption on unnecessary. Nevertheless we keep it for ease of presentation.. Then is a minimizer of (12) if and only if and the vector is a minimizer of
(13) |
Proof 2.4.
We note that this model assumes independent and identically distributed (i.i.d.) observation noise for the vector and can easily be extended to correlated observation noise by replacing the misfit term in (12) with an appropriately weighted misfit term of the form where denotes the covariance matrix of the observation noise.
Remark 2.5.
It is intuitive that a minimizer of the optimization problem we introduce and solve in this paper corresponds to a MAP point for the GP conditioned on PDE constraints at the collocation points. To prove this will require extension of the approach introduced in [16], for example, and is left for future work. Here we describe this connection informally in the absence of the equality constraints. Consider the the prior measure and consider the measurements . It then follows from Bayes’ rule [74] that the posterior measure of given the data ) is identified as the measure
The article [16] showed that the MAP estimators of are solutions to
Letting then yields (12).
3 Solving Nonlinear PDEs
In this section, we present our framework for solution of nonlinear PDEs by conditioning GPs on nonlinear constraints. In Subsection 3.1 we outline our abstract setting as well as our assumptions on PDEs of interest; this leads to Corollary 3.1 which states an analogue of Proposition 2.3 in the PDE setting. We analyze the convergence of our method in Subsection 3.2 and discuss two strategies for dealing with the nonlinear PDE constraints in Subsection 3.3. Next, we present the details pertaining to numerical implementations of our method, including the choice of kernels and a Gauss–Newton algorithm in Subsection 3.4. Finally, we present a set of numerical experiments in Subsection 3.5 that demonstrate the effectiveness of our method in the context of prototypical nonlinear PDEs.
3.1 Problem Setup
Let us consider a bounded domain for and a nonlinear PDE of the form
(15) |
Here is a nonlinear differential operator and is an appropriate boundary operator with data . Throughout this section and for brevity, we assume that the PDE at hand is well-defined pointwise and has a unique strong solution; extension of our methodology to weak solutions is left as a future research direction. We then consider to be an appropriate quadratic Banach space for the solution such as a Sobolev space with sufficiently large regularity index .
We propose to solve the PDE (15) by approximating by a GP conditioned on satisfying the PDE at a finite set of collocation points in and proceed to approximate the solution by computing the MAP point of such a conditioned GP. More precisely, let be a collection of points in ordered in such a way that are in the interior of while are on the boundary. Now let be a quadratic Banach space with associated covariance operator and consider the optimization problem:
(16) |
In other words, we wish to approximate with the minimum norm element of the Cameron–Martin space of that satisfies the PDE and boundary data at the collocation points . In what follows we write to denote the space of bounded and linear operators from to another Banach space . We make the following assumption regarding the operators :
There exist bounded and linear operators in which for some , and there are maps and , which may be nonlinear, so that and can be written as
(17) | ||||||
For prototypical nonlinear PDEs the for are linear differential operators such as first or second order derivatives while the maps and are often simple algebraic nonlinearities. Furthermore, observe that for ease of presentation we are assuming fewer linear operators are used to define the boundary conditions than the operators that define the PDE in the interior.
Example NE (Nonlinear Elliptic PDE).
Under Assumption 3.1 we can then define the functionals by setting
(18) |
We further use the shorthand notation to denote the vector of dual elements for a fixed index . Observe that if while if . We further write
(19) |
and define
(20) |
To this end, we define the measurement vector by setting
(21) |
as well as the nonlinear map
(22) |
We can now rewrite the optimization problem (16) in the same form as (12):
(23) |
Then a direct appliction of Proposition 2.3 yields the following corollary.
Corollary 3.1.
The above corollary is the foundation of our numerical algorithms for approximation of the solution , as and the gamblets can be approximated offline while the coefficients can be computed by solving the optimization problem (24).
3.2 Convergence Theory
We state and prove a more general version of Theorem 1.3 for our abstract setting of PDEs on Banach spaces with quadratic norms stating that a minimizer of (16) converges to the true solution under sufficient regularity assumptions and for appropriate choices of the operator .
Theorem 3.2.
Consider the PDE (15) and suppose that where is a Banach space such that the first inclusion from the left is given by a compact embedding and are sufficiently large so that all derivatives appearing in the PDE are defined pointwise for elements of . Furthermore assume that the PDE has a unique classical solution and that, as ,
Write for a minimizer of (16) with distinct collocation points. Then, as , the sequence of minimizers converges towards pointwise in and in .
Proof 3.3.
The method of proof is similar to that of Theorem 1.3. Indeed, by the same argument as in the first paragraph of the proof for Theorem 1.3, there exists a subsequence that converges to in . This also implies convergence in and due to the assumed continuous embedding of into . Since are sufficiently large so that all derivatives appearing in the PDE are defined pointwise for elements of , we get that converges to in and . As is compact, is also uniformly continuous in .
For any and , the triangle inequality shows that
(25) | ||||
where in the first inequality we have used the fact that . Here is the number of interior points associated with the total collocation points. Taking and using the uniform continuity of and the convergence from to , we derive that . In a similar manner we can derive . Thus, the limit is a classical solution to the PDE. By the uniqueness of the solution we must have . Finally, as the limit is independent of the choice of the subsequence, the whole sequence must converge to pointwise and in .
We note that while this theorem does not provide a rate for convergence of towards it relies on straightforward conditions that are readily verifiable for prototypical PDEs. Typically we choose large enough so that the PDE operators are pointwise defined for the elements of (e.g., order of PDE ) and take the space to be a Sobolev-type space of appropriate regularity for the inclusion to hold; also see the conditions of Theorem 1.3 and the subsequent discussion. The compact embedding can then be ensured by an appropriate choice of the covariance operator (or the associated kernel ). However, this choice should result in a sufficiently large space that includes the solution of the PDE. Our conditions on the collocation points simply ensure that these points form a dense subset of as .
3.3 Dealing with the Constraints
Now, we turn our attention to the equality constraints in (24) and present two strategies for elimination or relaxation of these constraints; these transform the optimization problem to an unconstrained one. They are crucial preliminary steps before introducing our numerical framework.
3.3.1 Eliminating the Equality Constraints
The equality constraints in (24) can be eliminated under slightly stronger assumptions on the maps . In particular, suppose that the following assumption holds: {assumption} The equations
can be solved as finite-dimensional algebraic equations, i.e., there exist and so that
(26) |
for selected indices and . Then for integer defined by (19), and using the solution maps , we can then define a new solution map so that
With this new solution map we can rewrite (24) as an unconstrained optimization problem.
Corollary 3.4.
Example NE.
Let us recall that we already eliminated the equality constraints in the context of the PDE (1) through the calculations leading to the unconstrained minimization problem (6). In that example, we used the calculation
that is, we solved in terms of and the source term in the interior of the domain in order to eliminate the PDE constraint. We further imposed the boundary conditions exactly since the boundary map is simply the pointwise evaluation function in that example.
Alternatively, we could eliminate by setting , assuming that has closed form. While both elimination strategies are conceptually valid they may lead to very different optimization problems. The former corresponds to solving for the values of at the collocation points while the latter solves for the values of at the interior points under Dirichlet boundary conditions at the boundary collocation points.
3.3.2 Relaxing the Equality Constraints
The choice of the solution maps in (26), i.e., the choice of the variable which the equations are solved for, has an impact on the conditioning of (27); it is not a priori clear that poor conditioning can always be avoided by choice of variables to solve for. Moreover, for certain nonlinear PDEs Assumption 3.3.1 may not hold. In such cases it may be useful to relax the equality constraints in (23) and instead consider a loss of the following form:
(28) |
where is a small positive parameter. Likewise (24) can be relaxed to obtain
(29) |
Then a similar argument to the proof of Theorem 3.2 can be used to show that a minimizer of the relaxed optimization problem for converges to the solution of the PDE as the number of collocation points increases and the parameter vanishes.
Proposition 3.5.
Proof 3.6.
By the arguments used in Proposition 2.3 the minimizer of (29) delivers a minimizer of (28) in the desired form. Since satisfies we must have . Then a compactness argument similar to that used in the proof of Theorem 3.2, noting that taking as delivers exact satisfaction of the constraints in the limit, yields the desired result.
Example NE.
When only part of the constraints can be explicitly solved, as is often the case for boundary values, we can also combine the elimination and relaxation approach. Employing the relaxation approach for the interior constraint and the elimination approach for the boundary constraints in (1) amounts to replacing the optimization problem (5), which is the analogue of (24) for our running example, with the following problem for a small parameter :
(30) |
We will numerically compare the above approach with the full elimination approach (6) in Subsection 3.5.1.
3.4 Implementation
We now outline the details of a numerical algorithm for solution of nonlinear PDEs based on the discussions of the previous subsection and in particular Corollary 3.1. We discuss the construction of the matrix in Subsection 3.4.1 followed by a variant of the Gauss–Newton algorithm in Subsection 3.4.2 for solving the unconstrained or relaxed problems outlined in Subsections 3.3. We also note that strategies for regularizing the matrix by adding small diagonal (“nugget”) terms are collected in Appendix A.
3.4.1 Constructing
We established through Corollary 3.1 that a solution to (16) can be completely identified by a minimizer of (24) as well as the gamblets . Since here we are concerned with the strong form of the PDE (15) it is reasonable to assume that at the very least ; although we often require higher regularity so that the PDE constraints can be imposed pointwise. This assumption suggests that our GP model for can equivalently be identified via a covariance kernel function as opposed to the covariance operator . To this end, given a covariance operator define the covariance kernel (equivalently Green’s function of ) as
(31) |
It is known that the kernel completely characterizes the GP under mild conditions [82]; that is Let us now consider the matrix in block form
Using the duality pairing between and we can identify the blocks
where we used the shorthand notation of Subsection 1.1 for the kernel matrix, with the defined as in (18) and the subsequent discussion. To this end the entries of the take the form
where we used the superscripts to denote the variables with respect to which the differential operators act. Note that with following the definition of in Subsection 3.1.
3.4.2 A Gauss–Newton Algorithm
Here we outline a variant of the Gauss–Newton algorithm [42, Sec. 10.3] for solution of the unconstrained optimization problem (27). Recall our definition of the maps in (26) and in turn the map . We then propose to approximate a minimizer of (27) with a sequence of elements defined iteratively via where is an appropriate step size while is the minimizer of the optimization problem
and the gradient of is computed with respect to the variable only. 999Note that our proposed method is nothing more than the standard Gauss–Newton algorithm with Euclidean norm defining the least-squares functional replaced with the weighted norm [42, Sec. 10.3].
This approach can be applied also to solve the relaxed problem (3.5) where this time we consider the sequence of approximations where is the minimizer of
Since (3.4.2) and (3.4.2) are both quadratic in and respectively, they can be solved exactly and efficiently at each step and the step-size parameters can be fixed or computed adaptively using standard step-size selection techniques [42]. However, in our experiments in Section 3.5, we find that both algorithms converge quickly simply by setting .
Example NE.
Let us return once more to the nonlinear elliptic PDE considered in Subsection 1.1. Observe that (6) is precisely in the form of (27) and so in order to formulate our Gauss–Newton iterations we need to linearize the vector valued function
which can easily be achieved by linearizing . To this end, we solve (27) via the iteration where is the minimizer of the functional
We also note that the sequence of approximations obtained by the above implementation of Gauss–Newton coincides with successive kernel collocation approximations of the solution of the following particular linearization of the PDE,
(32) |
subject to the Dirichlet boundary conditions.
3.4.3 Computational Bottlenecks
The primary computational cost of our method lies in the approximation of the matrix . Efficient factorizations and approximations of have been studied extensively in the GP regression literature [55] as well as spatial statistics, Kriging and numerical analysis (see [65, 66] and the discussions within). In this paper, we do not employ these algorithms and choose instead to use standard algorithms to factorize .
The algorithm introduced in [65] is particularly interesting as it directly approximates the Cholesky factors of by querying a subset of the entries of . In fact, that algorithm alleviates the need for a small diagonal regularization (“nugget”) term by directly computing the Cholesky factors of from the entries of . This could be done by extending the algorithm introduced and analyzed in [65]. This algorithm is based on the identification of an explicit formula for computing approximate Cholesky factors minimizing the Kullback-Leibler divergence between and given a sparsity constraint on the entries of . The proposed formula is equivalent to the Vecchia approximation [83] (popular in geostatistics). The resulting algorithm outlined in [65] computes approximate Cholesky factors of in complexity by accessing entries of .
Another possible bottleneck is the computation of the gamblets . The articles [47, 66] show that the gamblets can be approximated with compactly supported functions in complexity . We also note that the complexity-vs-accuracy guarantees of [47, 65, 66] have only been established for functionals that are Dirac delta functions and kernels that are the Green’s functions of arbitrary elliptic differential operators (mapping to ). Extension of those results to functionals considered here is an interesting future direction.
3.5 Numerical Experiments for Nonlinear PDEs
In this subsection, we implement our algorithm to solve several nonlinear PDEs, including the nonlinear elliptic equation in Subsection 3.5.1, Burgers’ equation in Subsection 3.5.2 and the regularized Eikonal equation in Subsection 3.5.3. For all of these equations, we will start with a fixed and demonstrate the performance of our algorithm by showing the pattern of collocation points, the loss function history of the Gauss–Newton iteration, and contours of the solution errors. Then, we vary the value of and study how the errors change with respect to . We also compare the elimination and relaxation approaches for dealing with the nonlinear constraints.
All the experiments are conducted using Python with the JAX package for automatic differentiation101010We use JAX for convenience and all derivatives in our methodology can be computed using standard techniques such as symbolic computation or adjoint methods.. In particular, we use automatic differentiation to form the kernel matrix that involves derivatives of the kernel function, and to optimize the loss function via the Gauss–Newton method. Details on the choice of small diagonal regularization (“nugget”) terms for these experiments are presented in Appendices A.2 through A.4.
Remark 3.7.
In all of the numerical experiments in this section we used a set of collocation points that are drawn randomly from the uniform distribution over the domain , as opposed to the deterministic uniform grid used in Subsection 1.1.4. The choice of the random collocation points was made to highlight the flexibility of our methodology. Furthermore, random collocation points are often used in other machine learning algorithms for solution of PDEs such as PINNs [58] and so adopting this approach allows direct comparison with such methods. We observed empirically that the random grids had similar accuracy to the deterministic uniform grid in all experiments except for Burgers’ equation in Subsection 3.5.2, where random collocation points outperformed the uniform grid. Understanding this surprising performance gap is an interesting problem related to active learning and the acquisition of collocation points; we do not address this issue here.
Remark 3.8.
Float64 data type was employed in the experiments below. This allows the use of small diagonal regularization (“nugget”) terms (see Appendix A for details) which do not affect accuracy in the computations described in this paper. In contrast, if Float32 data type (the default setting in JAX) is used, we found the need to regularize with larger diagonal terms, leading to an observable accuracy floor.
3.5.1 A Nonlinear Elliptic PDE
We revisit again the nonlinear elliptic equation in (1). As in Subsection 1.1.4, we take , and together with homogeneous Dirichlet boundary conditions . The true solution is prescribed to be and the corresponding right hand side is computed using the equation. We choose the Gaussian kernel with a lengthscale parameter .
First, for and , we randomly sample collocation points in as shown in Figure 2(a). We further show an instance of the convergence history of the Gauss–Newton algorithm in Figure 2(b) where we solved the unconstrained optimization problem (6) after eliminating the equality constraints. We used kernel parameter and appropriate nugget terms as outlined in Appendix A.2. We initiated the algorithm with a Gaussian random initial guess. We observe that only steps sufficed for convergence. In Figure 2(c), we show the contours of the solution error. The error in the approximate solution is seen to be fairly uniform spatially, with larger errors observed near the boundary, when We note that the main difference between these experiments and those in Subsection 1.1.4 is that here we used randomly distributed collocation points while a uniform grid was used previously.
Next, we compare two approaches for dealing with the PDE constraints as outlined in Subsection 3.3. We applied both the elimination and relaxation approaches, defined by the optimization problems (27) and (29) respectively, for different choices of . In the relaxation approach, we set . Here we set and . The and errors of the converged Gauss–Newton solutions are shown in Table 1. Results were averaged over 10 realizations of the random collocation points. From the table we observe that the difference in solution errors was very mild and both methods were convergent as increases. We note that in the relaxed setting, convergence is closely tied to our choice of , and choosing an inadequate value, i.e. too small or too large, can lead to inaccurate solutions. In terms of computational costs, the elimination approaches take 2-3 steps of Gauss–Newton iterations on average, while the relaxation approach needs 5-8 steps. Thus while the elimination strategy appears to be more efficient, we do not observe a significant difference in the order of complexity of the methods for dealing with the constraints, especially when the number of collocation points becomes large.
300 | 600 | 1200 | 2400 | |
---|---|---|---|---|
Elimination: error | 4.84e-02 | 6.20e-05 | 2.74e-06 | 2.83e-07 |
Elimination: error | 3.78e-01 | 9.71e-04 | 4.56e-05 | 5.08e-06 |
Relaxation: error | 1.15e-01 | 1.15e-04 | 1.87e-06 | 1.68e-07 |
Relaxation: error | 1.21e+00 | 1.45e-03 | 3.38e-05 | 1.84e-06 |
3.5.2 Burgers’ Equation
We consider numerical solution of the viscous Burgers equation:
(33) | ||||
We adopt an approach in which we solve the problem by conditioning a Gaussian process in space-time111111It would also be possible to look at an incremental in time approach, for example using backward Euler discretization, in which one iteratively in time solves a nonlinear elliptic two point boundary value problem by conditioning a spatial Gaussian process; we do not pursue this here and leave it as a future direction.. In our experiments we take and consider . We write this PDE in the form of (17) with and with linear operators and the nonlinear map . The boundary part is simply . We then eliminate the equality constraints in our optimization framework following the approach of Subsection 3.3.1 using the equation .
We randomly sampled with points in the computational domain see Figure 3(a), where we also plot contours of the true solution . The Gauss–Newton algorithm was then applied to solve the unconstrained optimization problem. We computed the true solution from the Cole–Hopf transformation, together with the numerical quadrature. Since the time and space variability of the solution to Burgers’ equation are significantly different, we chose an anisotropic kernel 121212One can also design/learn a non-stationary kernel using the approaches discussed in Subsection 5.3. However, the parameterization of such kernels and strategies for tuning their hyperparameters are outside the scope of this article.
with together with an adaptive diagonal regularization (“nugget”) as outlined in Appendix A.3.
We plot the Gauss–Newton iteration history in Figure 3(b) and observe that 10 steps sufficed for convergence. We compare the converged solution to the true solution and present the contours of the error in Figure 3(c). The maximum errors occured close to the (viscous) shock at time as expected. In Figure 3(d–f), we also compare various time slices of the numerical and true solutions at times to further highlight the ability of our method in capturing the location and shape of the shock.
Next, we studied the convergence properties of our method as a function of as shown in Table 2. Here, we varied with a fixed ratio of interior points, . For each experiment we ran steps of Gauss–Newton starting from a Gaussian random initial guess. Results were averaged over 10 realizations of the random collocation points. From the table, we observe that the error decreases very fast as increases, implying the convergence of our proposed algorithm.
Finally, we note that the accuracy of our method is closely tied to the choice of the viscosity parameter and choosing a smaller value of , which in turn results in a sharper shock, can significantly reduce our accuracy. This phenomenon is not surprising since a sharper shock corresponds to the presence of shorter length and time scales in the solution; these in turn, require a more careful choice of the kernel, as well as suggesting the need to carefully choose the collocation points.
600 | 1200 | 2400 | 4800 | |
---|---|---|---|---|
error | 1.75e-02 | 7.90e-03 | 8.65e-04 | 9.76e-05 |
error | 6.61e-01 | 6.39e-02 | 5.50e-03 | 7.36e-04 |
3.5.3 Eikonal PDE
We now consider the regularized Eikonal equation in :
(34) |
with and . We write this PDE in the form of (17) with and and linear operators and nonlinear map in the interior of and define the boundary operator identically to Subsection 3.5.2. We further eliminate the nonlinear constraints, as outlined in Subsection 3.3.1, by solving in terms of . To obtain a “true” solution, for the purpose of estimating errors, we employ the transformation , which leads to the linear PDE ; we solve this by a highly-resolved FD method; we used uniform grid points in each dimension of the domain leading to the finest mesh that our hardware could handle.
As before, we began with collocation points with interior points. An instance of these collocation points along with contours of the true solution are shown in Figure 4(a). We employed a nugget term as outlined in Appendix A.4 and used the Gaussian kernel, as in Subsection 3.5.1 with . Finally we used the Gauss–Newton algorithm to find the minimizer. We show the convergence history of Gauss–Newton in Figure 4(b), observing that six iterations were sufficient for convergence. In Figure 4(c) we show the error contours of the obtained numerical approximation, which appeared to be qualitatively different to Figure 2(c) in that the errors were larger in the middle of the domain as well as close to the boundary.
Next we performed a convergence study by varying and computing and errors as reported in Table 3 by choosing the lengthscale parameter of the kernel . We used the same nugget terms as in the Burgers’ equation (see Appendix A.4). Results were averaged over 10 realizations of the random collocation points. Once again we observe clear improvements in accuracy as the number of collocation points increases.
300 | 600 | 1200 | 2400 | |
---|---|---|---|---|
error | 1.01e-01 | 1.64e-02 | 2.27e-04 | 7.78e-05 |
error | 3.59e-01 | 7.76e-02 | 3.22e-03 | 1.61e-03 |
4 Solving Inverse Problems
We now turn our attention to solution of IPs and show that the methodology of Subsections 3.1–3.4 can readily be extended to solve such problems with small modifications. We descibe the abstract setting of our IPs in Subsection 4.1 leading to Corollary 4.3 which is analogous to Proposition 2.3 and Corollary 3.1 in the setting of IPs. Subsection 4.2 outlines our approach for dealing with PDE constraints in IPs and highlights the differences in this setting in comparison to the PDE setting described in Subsection 3.3. Subsection 4.3 further highlights the implementation differences between the PDE and IP settings while Subsection 4.4 presents a numerical experiment concerning an IP in subsurface flow governed by the Darcy flow PDE.
4.1 Problem Setup
Consider our usual setting of a nonlinear parameteric PDE in strong form
(35) |
As before we assume the solution belongs to a quadratic Banach space while is a parameter belonging to another quadratic Banach space . Our goal in this subsection is to identify the parameter from limited observations of the solution . To this end, fix and define
(36) |
then our goal is to recover given the noisy observations
(37) |
We propose to solve this inverse problem by modelling both and with canonical GPs on the spaces with invertible covariance operators and respectively. We then condition these GPs to satisfy the PDE on collocation points as before and propose to approximate simultaneously via the optimization problem:
(38) |
where we used subscripts to distinguish the quadratic norms on the spaces and .
Remark 4.1.
In light of Remark 2.5 we observe that (38) corresponds to imposing a prior measure on which assumes they are a priori independent. It is straightforward to introduce correlations between the solution and the parameter by defining the prior measure directly on the product space . This perspective will then lead to an analogous optimization problem to (38) with the same constraints but with the functional
where we used to denote the RKHS norm of the GP associated with .
Remark 4.2.
We also note that the Bayesian interpretation of (38) can be viewed as an extension of gradient matching [9, 38] from ODEs to PDEs. Indeed, gradient matching simultaneously approximates the unknown parameters and the solution of an ODE system using a joint GP prior and imposes the ODE as a constraint at finitely many time steps.
We make analogous assumptions on the form of the operators as in Assumption 3.1 but this time also involving the parameters :
There exist bounded and linear operators in which for some , and together with maps and , which may be nonlinear, so that and can be written as
(39) | ||||||
Similarly to the , the operators are also linear differential operators in case of prototypical PDEs while the maps remain as simple algebraic nonlinearities. Let us briefly consider an IP in subsurface flow and verify the above assumption.
Example DF (Darcy flow IP).
Let and consider the Darcy flow PDE with Dirichlet boundary conditions
We wish to approximate given noisy pointwise observations of at a set of points . Thus, we take . By expanding the PDE we obtain
and so we simply choose , and with the linear operators
We can then satisfy Assumption 39 by taking
(40) |
where we have slightly abused notation by letting be vector valued and defining to take vectors as some of their inputs.
As in Subsection 3.1 we now define the functionals for and according to (18) and (20) with . Similarly we define the functionals as
(41) |
together with the vectors
(42) |
where . Similarly to (22) define the map
where we used subscripts to distinguish the duality pairings between and the pairing between . With this new notation we can finally rewrite (38) in the familiar form
(43) |
with the PDE data vector defined in (21).
We can now apply Proposition 2.3 with the canonical GP defined on the product space and with a block diagonal covariance operator to obtain a representer theorem for minimizer of (43). We state this result as a corollary below after introducing some further notation. Define the vector with entries 131313Note that we are concatenating the measurement functionals defining the data for the IP with the functionals used to define the PDE at the collocation points.
as well as the matrices and with entries
As in (10) we define the gamblets
Then Proposition 2.3 yields the following corollary.
Corollary 4.3.
Suppose Assumption 4.1 holds and that the covariance operators and as well as the matrices and are invertible. Then is a minimizer of (16) if and only if
where the vectors are minimizers of
(44) |
where is the projection that extracts the first entries of a vector while is the projection that extracts the last entries.
4.2 Dealing with the Constraints
The equality constraints in (44) can be dealt with using the same strategies as in Subsection 3.3. Indeed, as in Subsection 3.3.2, we can readily relax these constraints to obtain the optimization problem
(45) |
for a small parameter . Elimination of the constraints as in Subsection 3.3.1 is slightly more delicate, but is sometimes possible. Suppose there exists a solution map so that
Then solving (44) is equivalent to solving the unconstrained problem
(46) |
and setting and .
4.3 Implementation
Both of the problems (45) and (46) can be solved using the same techniques outlined in Subsection 3.4 except that we now have a higher dimensional solution space. Below we briefly describe the main differences between the implementation of the PDE and IP solvers.
4.3.1 Constructing and
We propose to construct the matrices using appropriate kernels , for the solution of the PDE, and , for the parameter identically to Subsection 3.4.1. Our minimum requirements on is sufficient regularity for the pointwise constraints in (38) to be well-defined. Since we have limited and noisy data in the inverse problem setting, it is not possible for us to recover the exact solution in general and so the kernels should be chosen to reflect our prior assumptions on the unknown parameter and the solution of the PDE at that parameter value.
4.4 Numerical Experiments for Darcy Flow
In this subsection, we apply our method to an IP involving the Darcy flow PDE. We consider the IP outlined in Example DF with the true coefficient satisfying
and the right hand side source term is . We randomly sampled collocation points with in the interior. From these interior points, we randomly chose points and observed the values of at those points as the data for the IP. The values of for this purpose were generated by first solving the equation with the true coefficient on a uniform grid and then using linear interpolation to get the solution at the observation points. We further added independent Gaussian noise with noise standard deviation to these observations. In dealing with the nonlinear constraint shown in Example DF, we eliminated the variable using the relation in (40).
We chose Gaussian kernels for both and with the same lengthscale parameter and adaptive diagonal (“nugget”) terms were added to the kernel matrices, with parameters to regularize and as outlined in Appendix A.5. In Figure 5 we show the experimental results for recovering both and . From the figure, we observe that the Gauss–Newton iterations converged after 6 steps. Moreover, the recovered and are reasonably accurate, i.e. they capture the shape of the truth, given the limited amount of observation information available.
5 Concluding Remarks
We have introduced a kernel/GP framework for solving nonlinear PDEs and IPs centered around the idea of approximating the solution of a given PDE with a MAP estimator of a GP conditioned on satisfying the PDE at a set of collocation points. Theoretically, we exhibited a nonlinear representer theorem which finite-dimensionalizes the MAP estimation problem and proved the convergence of the resulting solution towards the truth as the number of collocation points goes to infinity, under some regularity assumptions. Computationally, we demonstrated that the solution can be found by solving a finite-dimensional optimization problem with quadratic loss and nonlinear constraints. We presented two methods for dealing with the nonlinear constraints, namely the elimination and relaxation approaches. An efficient variant of the Gauss–Newton algorithm was also proposed for solving the resulting unconstrained optimization problem, where an adaptive nugget term was employed for regularization together with offline Cholesky factorizations of the underlying kernel matrices. We demonstrated that the proposed algorithm performs well in a wide range of prototypical nonlinear problems such as a nonlinear elliptic PDE, Burgers’ equation, a regularized Eikonal equation, and the identification of the permeability field in Darcy flow.
While our theoretical results established the existence of a solution for the finite dimensional optimization problem and the convergence of to the truth, the uniqueness of the solution is not guaranteed and the convergence theorem does not imply convergence rates. In what follows, we provide several additional discussions that hold the potential to address these two issues and point towards interesting future directions: in Subsection 5.1 we discuss how to get the uniqueness result if an appropriate condition is assumed. In Subsection 5.2, we lay a path to obtain rigorous error estimates followed by a discussion about learning of hierarchical kernel parameters in Subsection 5.3. Finally, in Subsection 5.4, we connect the framework of this article to MAP estimation for Bayesian IPs with non-Gaussian priors.
5.1 Uniqueness and Optimality of Minimizers
While our theoretical results in Subsection 2.2 characterize the minimizers of nonlinearly constrained optimization problems in RKHSs they do not imply the uniqueness of such minimizers. The lack of convexity due to the nonlinearity of makes it impossible to obtain such uniqueness results, however, a positive result can be obtained under appropriate assumptions on the function . Recall (24), and assume that the following condition holds
(47) |
Generally, this condition is equivalent to the requirement that the nonlinear manifold lies on only one side of the tangent hyperplane supported by with -normal direction . Note that (47) is always satisfied with an equality if the PDE is linear, i.e., is affine. Then the following proposition states that the minimizer of (16) is not only unique, it is also the minimax optimal approximation of in in the sense of optimal recovery [40].
Proposition 5.1.
Under condition (47) it holds true that:
-
1.
is unique.
-
2.
for any .
-
3.
is one of the minimzers of the minimax loss .
Proof 5.2.
For any , we denote by , the GP MAP solution after observing the values for . By definition, and . Thus, is orthogonal to the linear span of in the norm. It follows that . Furthermore writing , we have
where we used the relation and the assumed condition (47). Thus, we deduce that according to the orthogonality. This leads to the second point of the proposition, and the first point follows directly.
The proof of the third point is similar to that of [47, Thm. 18.2]. Here we shall prove that for the given minimax loss, the optimal value is 1 and the minimizer is for any satisfying . To achieve this, first observe that the loss is always less or equal to 1 because for , and . Thus, the optimal value is not larger than 1.
Moreover, let be a nonzero element in such that . Consider for any satisfying . Taking , we have that
this term converges towards as . Thus, we get that the optimal value is exactly 1 and are minimizers.
Note that the above proposition implies the uniqueness of the minimizer but does not imply the uniqueness of the solution . It is also interesting to observe that condition (47) only involves the nonlinear map and not the differential operators defining the PDE.
5.2 Error Estimates
Here we present rigorous error estimates in the PDE setting of Subsection 3 by bounding the error between the minimizer of (16) and the true solution . The main idea behind our error bounds is to connect the conditional covariance of the underlying GP to the pointwise error of the conditional MAP estimator.
In the setting of Section 2.2, consider . Then the conditional covariance operator of Proposition 2.1 coincides with the short of the operator to [2, 46]: this is a symmetric and positive operator from to identified via
(48) |
where we used the shorthand notation . This identification holds true even when the are not linearly independent. With the notion of the shorted operator at hand we can now present an abstract error bound on the action of dual elements on and .
Proposition 5.3.
Let be a Banach space endowed with a quadratic norm such that the embedding is compact and such that where is defined using the same duality pairing as for . Let be a minimizer of (16) approximating the unique solution of (15). Fix and define as in (20). Then it holds true that
(49) |
where and
where is the positive symmetric matrix with entries . Furthermore, it also holds true that
(50) |
Proof 5.4.
Define and . By triangle inequality we have
Since then by [43, Thm. 5.1],
Then (48) implies that . Since minimizes subject to , we have , which leads to Furthermore, and observe that . Since and , and we deduce that is bounded by the supremum of over all that belong to the ball of radius and satisfy the PDE constraints at the collocation points. Inequality (50) follows from and the the fact that is the projection of onto [47, Thm. 12.3].
In the case where we can take to be of the form where is a differential operator of order at most for , and for . Then Proposition 5.3 allows us to bound the poitwise error of and its derivatives. As in Theorem 3.2, the compact embedding of into guarantees the convergence of and towards zero as the collocation points become dense in and . Note that, by (50), the strong approximation error between and is controlled by the sum between and the best approximation error in . Controlling this best approximation error is typically done in the dual space in the linear setting [47, Sec. 14.8]. In the nonlinear setting, it requires controlling the regularity of the solution (analyzing the underlying PDE). Moreover if the fill distance of the collocation points is , and if the norm satisfies the embedding inequality , then [47, Lem. 14.39] implies .
5.3 Learning the Kernel
An important factor in the accuracy and performance of our method is the choice of the kernel (and in turn, the operator ). The importance of the kernel choice is readily visible in Theorem 3.2 through the assumption that , the solution of the PDE, should belong to the the RKHS corresponding to . As discussed in [43], even for linear PDEs, if the underlying kernel is not adapted to the solution space of the PDE, then the resulting method can perform arbitrarily badly [3]. This point stresses the importance of learning (or possibly programming [49, 45]) an informed kernel beyond enforcing the PDE at collocation points. While the problem of selecting the kernel is well-understood if the solution is smooth, or the PDE is linear with possibly rough coefficients [43], it can be challenging when the underlying PDE is nonlinear, and the solution is irregular. Since our approach employs an explicit kernel it allows us to learn the kernel directly using kernel (hyperparameter) learning techniques such as maximum likelihood estimation (MLE) [10], MAP estimation [45], or cross-validation (CV) [21, 50, 10]. Adaptation of these techniques for solution of nonlinear PDEs is an attractive direction for future research.
5.4 Non-Gaussian priors
Although we have employed a Gaussian prior in the design of our nonlinear solver and in our Bayesian IPs, our methodology can naturally be generalized to priors that can be obtained as nonlinear transformations of Gaussian measures (e.g., instead of replacing the solution of a nonlinear PDE by a GP , one could replace it by where is a nonlinear function with the appropriate degree of regularity so that the pointwise derivatives of remain well-defined). This approach can potentially lead to efficient methods for computing MAP estimators of Bayesian IPs with certain non-Gaussian priors, a contemporary problem in the uncertainty quantification and IP literatures.
Acknowledgments
The authors gratefully acknowledge support by the Air Force Office of Scientific Research under MURI award number FA9550-20-1-0358 (Machine Learning and Physics-Based Modeling and Simulation). HO also acknowledges support by the Air Force Office of Scientific Research under award number FA9550-18-1-0271 (Games for Computation and Learning). YC is also grateful for support from the Caltech Kortchak Scholar Program.
References
- [1] R. A. Adams and J. J. Fournier, Sobolev Spaces, Elsevier, 2003.
- [2] W. Anderson, Jr and G. Trapp, Shorted operators. II, SIAM Journal on Applied Mathematics, 28 (1975), pp. 60–71.
- [3] I. Babuška and J. Osborn, Can a finite element method perform arbitrarily badly?, Mathematics of computation, 69 (2000), pp. 443–462.
- [4] Y. Bar-Sinai, S. Hoyer, J. Hickey, and M. P. Brenner, Learning data-driven discretizations for partial differential equations, Proceedings of the National Academy of Sciences, 116 (2019), pp. 15344–15349.
- [5] K. Bhattacharya, B. Hosseini, N. B. Kovachki, and A. M. Stuart, Model reduction and neural networks for parametric PDEs, The SMAI journal of computational mathematics, 7 (2021), pp. 121–157.
- [6] V. I. Bogachev, Gaussian measures, American Mathematical Society, 1998.
- [7] G. Boncoraglio and C. Farhat, Active manifold and model reduction for multidisciplinary analysis and optimization, in AIAA Scitech 2021 Forum, 2021, p. 1694.
- [8] H. Brezis, Functional Analysis, Sobolev Spaces and Partial Differential Equations, Springer Science & Business Media, 2010.
- [9] B. Calderhead, M. Girolami, and N. D. Lawrence, Accelerating Bayesian inference over nonlinear differential equations with Gaussian processes, in Advances in Neural Information Processing Systems, 2009, pp. 217–224.
- [10] Y. Chen, H. Owhadi, and A. Stuart, Consistency of empirical Bayes and kernel flow for hierarchical parameter estimation, Mathematics of Computation, (2021).
- [11] A. Cochocki and R. Unbehauen, Neural networks for optimization and signal processing, John Wiley & Sons, Inc., 1993.
- [12] J. Cockayne, C. Oates, T. Sullivan, and M. Girolami, Probabilistic numerical methods for PDE-constrained Bayesian inverse problems, in AIP Conference Proceedings, vol. 1853, 2017, p. 060001.
- [13] J. Cockayne, C. J. Oates, T. J. Sullivan, and M. Girolami, Bayesian probabilistic numerical methods, SIAM Review, 61 (2019), pp. 756–789.
- [14] S. L. Cotter, M. Dashti, and A. M. Stuart, Approximation of Bayesian inverse problems for PDEs, SIAM Journal on Numerical Analysis, 48 (2010), pp. 322–345.
- [15] S. L. Cotter, G. O. Roberts, A. M. Stuart, and D. White, MCMC methods for functions: modifying old algorithms to make them faster, Statistical Science, (2013), pp. 424–446.
- [16] M. Dashti, K. J. Law, A. M. Stuart, and J. Voss, MAP estimators and their consistency in Bayesian nonparametric inverse problems, Inverse Problems, 29 (2013), p. 095017.
- [17] M. Dashti and A. M. Stuart, The Bayesian approach to inverse problems, in Handbook of Uncertainty Quantification, R. Ghanem, D. Higdon, and H. Owhadi, eds., Springer International Publishing, 2016, pp. 1–118.
- [18] P. Diaconis, Bayesian numerical analysis, in Statistical Decision Theory and Related Topics IV, S. S. Gupta and J. Berger, eds., Springer, 1988.
- [19] B. Djeridane and J. Lygeros, Neural approximation of PDE solutions: An application to reachability computations, in Proceedings of the 45th IEEE Conference on Decision and Control, IEEE, 2006, pp. 3034–3039.
- [20] H. W. Engl, M. Hanke, and A. Neubauer, Regularization of inverse problems, vol. 375, Springer Science & Business Media, 1996.
- [21] J. Friedman, T. Hastie, and R. Tibshirani, The elements of statistical learning, Springer, 2001.
- [22] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning, vol. 1, MIT press, 2016.
- [23] J. Han, A. Jentzen, and E. Weinan, Solving high-dimensional partial differential equations using deep learning, Proceedings of the National Academy of Sciences, 115 (2018), pp. 8505–8510.
- [24] K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- [25] P. Hennig, M. A. Osborne, and M. Girolami, Probabilistic numerics and uncertainty in computations, Proceedings of the Royal Society A, 471 (2015), p. 20150142.
- [26] M. Hinze, R. Pinnau, M. Ulbrich, and S. Ulbrich, Optimization with PDE constraints, vol. 23, Springer Science & Business Media, 2008.
- [27] A. Jacot, F. Gabriel, and C. Hongler, Neural tangent kernel: convergence and generalization in neural networks, in Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018, pp. 8580–8589.
- [28] J. Kaipio and E. Somersalo, Statistical and computational inverse problems, vol. 160, Springer Science & Business Media, 2006.
- [29] G. Karniadakis, I. Kevrekidis, L. Lu, P. Perdikaris, S. Wang, and L. Yang, Physics-informed machine learning, Nature Reviews Physics, (2021).
- [30] G. Kimeldorf and G. Wahba, Some results on Tchebycheffian spline functions, Journal of mathematical analysis and applications, 33 (1971), pp. 82–95.
- [31] G. S. Kimeldorf and G. Wahba, A correspondence between Bayesian estimation on stochastic processes and smoothing by splines, Annals of Mathematical Statistics, 41 (1970), pp. 495–502.
- [32] K. Krischer, R. Rico-Martínez, I. Kevrekidis, H. Rotermund, G. Ertl, and J. Hudson, Model identification of a spatiotemporally varying catalytic reaction, AIChE Journal, 39 (1993), pp. 89–98.
- [33] I. E. Lagaris, A. C. Likas, and D. I. Fotiadis, Artificial neural networks for solving ordinary and partial differential equations, IEEE transactions on Neural Networks, 9 (1998), pp. 987–1000.
- [34] I. E. Lagaris, A. C. Likas, and D. G. Papageorgiou, Neural-network methods for boundary value problems with irregular boundaries, IEEE Transactions on Neural Networks, 11 (2000), pp. 1041–1049.
- [35] F. M. Larkin, Gaussian measure in Hilbert space and applications in numerical analysis, Journal of Mathematics, 2 (1972).
- [36] Z. Li, N. Kovachki, K. Azizzadenesheli, K. Bhattacharya, A. Stuart, and A. Anandkumar, Fourier neural operator for parametric partial differential equations, in International Conference on Learning Representations, 2020.
- [37] Z. Li, N. Kovachki, K. Azizzadenesheli, B. Liu, A. Stuart, K. Bhattacharya, and A. Anandkumar, Multipole graph neural operator for parametric partial differential equations, Advances in Neural Information Processing Systems, (2020).
- [38] H. Liang and H. Wu, Parameter estimation for differential equation models using a framework of measurement error in regression models, Journal of the American Statistical Association, 103 (2008), pp. 1570–1583.
- [39] Z. Long, Y. Lu, X. Ma, and B. Dong, PDE-net: Learning PDEs from data, in International Conference on Machine Learning, PMLR, 2018, pp. 3208–3216.
- [40] C. A. Micchelli and T. J. Rivlin, A survey of optimal recovery, in Optimal estimation in approximation theory, C. Micchelli and T. J. Rivlin, eds., Springer, 1977, pp. 1–54.
- [41] N. H. Nelsen and A. M. Stuart, The random feature model for input-output maps between Banach spaces. arXiv preprint arXiv:2005.10224, 2020.
- [42] J. Nocedal and S. Wright, Numerical Optimization, Springer Science & Business Media, 2006.
- [43] H. Owhadi, Bayesian numerical homogenization, Multiscale Modeling & Simulation, 13 (2015), pp. 812–828.
- [44] H. Owhadi, Multigrid with rough coefficients and multiresolution operator decomposition from hierarchical information games, SIAM Review, 59 (2017), pp. 99–149.
- [45] H. Owhadi, Do ideas have shape? plato’s theory of forms as the continuous limit of artificial neural networks. arXiv preprint arXiv:2008.03920, 2020.
- [46] H. Owhadi and C. Scovel, Conditioning Gaussian measure on Hilbert space, Journal of Mathematical and Statistical Analysis, 1 (2018).
- [47] H. Owhadi and C. Scovel, Operator-Adapted Wavelets, Fast Solvers, and Numerical Homogenization: From a Game Theoretic Approach to Numerical Approximation and Algorithm Design, Cambridge University Press, 2019.
- [48] H. Owhadi, C. Scovel, and F. Schäfer, Statistical numerical approximation, Notices of the AMS, (2019).
- [49] H. Owhadi, C. Scovel, and G. R. Yoo, Kernel mode decomposition and programmable/interpretable regression networks. arXiv preprint arXiv:1907.08592, 2019.
- [50] H. Owhadi and G. R. Yoo, Kernel flows: From learning kernels from data into the abyss, Journal of Computational Physics, 389 (2019), pp. 22–47.
- [51] H. Owhadi and L. Zhang, Gamblets for opening the complexity-bottleneck of implicit schemes for hyperbolic and parabolic ODEs/PDEs with rough coefficients, Journal of Computational Physics, 347 (2017), pp. 99–128.
- [52] I. Palasti and A. Renyi, On interpolation theory and the theory of games, MTA Mat. Kat. Int. Kozl, 1 (1956), pp. 529–540.
- [53] O. Perrin and P. Monestiez, Modelling of non-stationary spatial structure using parametric radial basis deformations, in GeoENV II—Geostatistics for Environmental Applications, Springer, 1999, pp. 175–186.
- [54] H. Poincaré, Calcul des probabilités, Georges Carrés, 1896.
- [55] J. Quinonero-Candela and C. E. Rasmussen, A unifying view of sparse approximate Gaussian process regression, The Journal of Machine Learning Research, 6 (2005), pp. 1939–1959.
- [56] V. D. Rădulescu, Qualitative analysis of nonlinear elliptic partial differential equations: monotonicity, analytic, and variational methods, Hindawi Publishing Corporation, 2008.
- [57] M. Raissi, P. Perdikaris, and G. E. Karniadakis, Numerical Gaussian processes for time-dependent and nonlinear partial differential equations, SIAM Journal on Scientific Computing, 40 (2018), pp. A172–A198.
- [58] M. Raissi, P. Perdikaris, and G. E. Karniadakis, Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations, Journal of Computational Physics, 378 (2019), pp. 686–707.
- [59] M. Raissi, A. Yazdani, and G. E. Karniadakis, Hidden fluid mechanics: learning velocity and pressure fields from flow visualizations, Science, 367 (2020), pp. 1026–1030.
- [60] J. O. Ramsay, G. Hooker, D. Campbell, and J. Cao, Parameter estimation for differential equations: a generalized smoothing approach, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69 (2007), pp. 741–796.
- [61] R. Rico-Martinez and I. G. Kevrekidis, Continuous time modeling of nonlinear systems: A neural network-based approach, in IEEE International Conference on Neural Networks, IEEE, 1993, pp. 1522–1525.
- [62] P. D. Sampson and P. Guttorp, Nonparametric estimation of nonstationary spatial covariance structure, Journal of the American Statistical Association, 87 (1992), pp. 108–119.
- [63] A. Sard, Linear approximation, American Mathematical Society, 1963.
- [64] R. Schaback and H. Wendland, Kernel techniques: from machine learning to meshless methods, Acta numerica, 15 (2006), p. 543.
- [65] F. Schäfer, M. Katzfuss, and H. Owhadi, Sparse cholesky factorization by Kullback–Leibler minimization, SIAM Journal on Scientific Computing, 43 (2021), pp. A2019–A2046.
- [66] F. Schäfer, T. Sullivan, and H. Owhadi, Compression, inversion, and approximate PCA of dense kernel matrices at near-linear computational complexity, Multiscale Modeling & Simulation, 19 (2021), pp. 688–730.
- [67] M. Schober, D. Duvenaud, and P. Hennig, Probabilistic ODE solvers with Runge-Kutta means, in 28th Annual Conference on Neural Information Processing Systems, 2014, pp. 739–747.
- [68] B. Scholkopf and A. J. Smola, Learning with kernels: support vector machines, regularization, optimization, and beyond, MIT Press, 2018.
- [69] Y. Shin, J. Darbon, and G. E. Karniadakis, On the convergence and generalization of physics informed neural networks. arXiv preprint arXiv:2004.01806, 2020.
- [70] J. Sirignano and K. Spiliopoulos, DGM: A deep learning algorithm for solving partial differential equations, Journal of Computational Physics, 375 (2018), pp. 1339–1364.
- [71] J. Skilling, Bayesian solution of ordinary differential equations, in Maximum entropy and Bayesian methods, C. R. Smith, G. J. Erickson, and P. O. Neudorfer, eds., Springer, 1992, pp. 23–37.
- [72] J. Smoller, Shock Waves and Reaction—Diffusion Equations, Springer Science & Business Media, 2012.
- [73] M. L. Stein, The screening effect in kriging, Annals of statistics, 30 (2002), pp. 298–323.
- [74] A. M. Stuart, Inverse problems: a Bayesian perspective, Acta numerica, 19 (2010), pp. 451–559.
- [75] A. V. Sul’din, Wiener measure and its applications to approximation methods. I, Izvestiya Vysshikh Uchebnykh Zavedenii Matematika, (1959), pp. 145–158.
- [76] L. P. Swiler, M. Gulian, A. L. Frankel, C. Safta, and J. D. Jakeman, A survey of constrained Gaussian process regression: Approaches and implementation challenges, Journal of Machine Learning for Modeling and Computing, 1 (2020).
- [77] E. Tadmor, A review of numerical methods for nonlinear partial differential equations, Bulletin of the American Mathematical Society, 49 (2012), pp. 507–554.
- [78] J. F. Traub, G. Wasilkowski, H. Wozniakowski, and E. Novak, Information-based complexity, SIAM Review, 36 (1994), pp. 514–514.
- [79] J. F. Traub, G. W. Wasilkowski, and H. Woźniakowski, Average case optimality for linear problems, Theoretical Computer Science, 29 (1984), pp. 1–25.
- [80] T. Uchiyama and N. Sonehara, Solving inverse problems in nonlinear PDEs by recurrent neural networks, in IEEE International Conference on Neural Networks, 1993, pp. 99–102.
- [81] R. van der Meer, C. Oosterlee, and A. Borovykh, Optimally weighted loss functions for solving PDEs with neural networks. arXiv preprint arXiv:2002.06269, 2020.
- [82] A. W. van der Vaart and J. H. van Zanten, Reproducing kernel Hilbert spaces of Gaussian priors, in Pushing the limits of contemporary statistics: contributions in honor of Jayanta K. Ghosh, Institute of Mathematical Statistics, 2008, pp. 200–222.
- [83] A. V. Vecchia, Estimation and model identification for continuous spatial processes, Journal of the Royal Statistical Society: Series B (Methodological), 50 (1988), pp. 297–312.
- [84] G. Wahba, Spline models for observational data, SIAM, 1990.
- [85] S. Wang, H. Wang, and P. Perdikaris, On the eigenvector bias of Fourier feature networks: From regression to solving multi-scale PDEs with physics-informed neural networks, Computer Methods in Applied Mechanics and Engineering, 384 (2021), p. 113938.
- [86] S. Wang, X. Yu, and P. Perdikaris, When and why PINNs fail to train: a neural tangent kernel perspective. arXiv preprint arXiv:2007.14527, 2020.
- [87] E. Weinan and B. Yu, The deep ritz method: a deep learning-based numerical algorithm for solving variational problems, Communications in Mathematics and Statistics, 6 (2018), pp. 1–12.
- [88] H. Wendland, Scattered Data Approximation, Cambridge University Press, 2004.
- [89] C. K. I. Williams and C. E. Rasmussen, Gaussian Processes for Machine Learning, The MIT Press, 2006.
- [90] A. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P. Xing, Deep kernel learning, in Artificial Intelligence and Statistics, PMLR, 2016, pp. 370–378.
- [91] X. Zhang, K. Z. Song, M. W. Lu, and X. Liu, Meshless methods based on collocation with radial basis functions, Computational Mechanics, 26 (2000), pp. 333–343.
- [92] Y. Zhu and N. Zabaras, Bayesian deep convolutional encoder–decoder networks for surrogate modeling and uncertainty quantification, Journal of Computational Physics, 366 (2018), pp. 415–447.
Appendix A Diagonal Regularization of Kernel Matrices
In the context of GP regression it is common to regularize the kernel matrices by addition of a small diagonal term; in that context, doing so has the interpretation of assuming small Gaussian noise on top of the observations. This diagonal regularization is sometimes referred to as a “nugget”. Here we discuss a related approach to regularizing kernel matrices ( and ) by perturbing them slightly; for brevity we use the terminology “nugget” throughout. In Appendix A.1 we present an adaptive approach to constructing a family of nugget terms that is tailored to our kernel matrices. Appendices A.2 through A.4 gather the detailed choices of nugget terms for the experiments in Subsections 3.5.1 through 3.5.3. Appendix A.5 contains the same details for the experiments in Subsection 4.4.
A.1 An Adaptive Nugget Term
One of the main computational issues in implementing our methodology is that the kernel matrix is ill-conditioned. As a consequence, we need to regularize this matrix to improve the stability of these algorithms. This regularization may introduce an accuracy floor, so it is important to choose the regularization term so that it has a small effect on accuracy – there is thus an accuracy-stability tradeoff. A traditional strategy for achieving this goal is to add a nugget term to , where is a small number, and is the identity matrix. By choosing a suitable , the condition number of can be improved significantly. However, there is an additional level of difficulty in our method since the matrix contains multiple blocks whose spectral properties can differ by orders of magnitude, since they can involve different orders of derivatives of the kernel function . This observation implies that adding , which is uniform across all blocks, may be suboptimal in terms of the accuracy-stability tradeoff.
In what follows we adopt the same notation as Subsection 3.4.1. To address the ill-conditioning of , we consider adding an adaptive block diagonal nugget term. More precisely, without loss of generality we can assume corresponds to the pointwise measurements, i.e., , then, we construct a block diagonal matrix
where we reweight the identity matrix in each diagonal block by a factor of the trace ratio between and . With this matrix, the adaptive nugget term is defined as with a global nugget parameter . We find that once the parameter is chosen suitably, then our Gauss–Newton algorithm converges quickly and in a stable manner. During computations, we can compute the Cholesky factors of offline and use back-substitution to invert them in each iteration of Gauss–Newton.
Example NE.
For the numerical experiments in Subsection 1.1.4 pertaining to the nonlinear elliptic PDE (1), we observed that has two distinct diagonal blocks, i.e., one block corresponding to the pointwise evaluation functions and with entries and another block corresponding to pointwise evaluations of the Laplacian operator and with entries . With collocation points, the trace ratio between these blocks was of order . Thus, the difference between and is significant. Our experiments also showed that if we only add to regularize the matrix, then choosing as large as was needed to get meaningful results, while using the nugget term , we could choose which leads to significantly improved results. We further explore these details below and in particular in Table 4.
A.2 Choice of Nugget Terms for the Nonlinear Elliptic PDE
Below we discuss the choice of the nugget term in the numerical experiments of Subsection 3.5.1. The results in Figure 2 and Table 1 were obtained by employing the adaptive nugget term of Appendix A.1 with global parameters and respectively.
We also compared our adaptive nugget term to more standard choices, i.e., nugget terms of the form against our adaptive nugget term with the nonlinearity . Cholesky factorization was applied to the regularized matrix and the subsequent Gauss–Newton iterations were employed to obtain the solutions. The and errors of the converged solutions are shown in Table 4. The results were averaged over 10 instances of a random sampling of the collocation points. Here, “nan” means the algorithm was unstable and diverged. To obtain these results we terminated all Gauss-Newton iterations after 5 steps. Due to randomness in the initial guess, and in examples where random collocation points were used due to this too, we observed that some random trials did not converge in 5 steps. This variation also explains the non-monotonic behavior of the error in Table 4 as decreases. These effects were more profound for the standard nugget term. Besides these small variations our results clearly demonstrate the superior accuracy and stability that is provided by our adaptive nugget term versus the standard nugget choice.
: error | 7.77e-02 | 4.46e-02 | 2.65e-02 | 1.56e-02 | 1.32e-02 | 1.46e-02 |
---|---|---|---|---|---|---|
: error | 6.43-01 | 3.13e-01 | 1.99e-01 | 1.47e-01 | 1.33e-01 | 1.43e-01 |
: error | 8.49e-02 | 9.29e-03 | 9.10e-03 | 3.34e-03 | 1.01e-03 | 3.36e-04 |
: error | 4.02e-01 | 7.86e-02 | 5.58e-02 | 2.21e-02 | 7.17e-03 | 3.87e-03 |
: error | 1.37e-02 | 8.623-03 | 1.01e-02 | 1.92e-02 | nan | nan |
: error | 1.81e-01 | 8.28e-02 | 1.07e-01 | 3.05e-01 | nan | nan |
: error | 1.55e-04 | 7.05e-05 | 4.56e-05 | 6.30e-06 | 1.73e-06 | 8.31e-07 |
: error | 2.41e-03 | 1.07e-03 | 7.66e-04 | 8.92e-05 | 2.62e-05 | 1.17e-05 |
A.3 Choice of Nugget Terms for Burger’s Equation
For the numerical experiments in Subsection 3.5.2 we primarily used our adaptive nugget term as outlined in Appendix A.1. For the results in Figure 3 we used a global nugget parameter . For the convergence analysis in Table 2 we varied for different number of collocation points to achieve better performance. More precisely, for we used a larger nugget and for we used . Choosing a smaller nugget for small values of would still yield equally accurate results but required more iterations of the Gauss-Newton algorithm.
A.4 Choice of Nugget Terms for the Eikonal Equation
Our numerical experiments in Subsection 3.5.3 also used the adaptive nugget of Appendix A.1. Indeed, we followed a similar approach to choosing the global parameter as in the case of Burger’s equation outlined in Appendix A.3 above.
For the results in Figure 4 we used . For the convergence analysis in Table 3 we varied for different values of , i.e., we chose for and for . Analogously to the Burger’s experiment we observed that smaller values of for smaller choices of cause the Gauss-Newton iterations to converge more slowly. Thus varying with improved the efficiency of our framework.
A.5 Choice of Nugget Terms for Darcy Flow
Both of the matrices outlined in Subsection 4.1 are dense and ill-conditioned in the IP setting and so an appropriate nugget term should be chosen to regularize them. In the IP setting we propose to add adaptive nuggets for both using the same strategy as in Appendix A.1, except that the nuggets are constructed independently for each matrix. To this end we set and , where the denote the global nugget parameter and the re-weighted identity matrix corresponding to . For the numerical experiments in Figure 5 we used