This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\jyear

2021

[1]\fnmKeyi \surWu

[1]\orgdivDepartment of Mathematics, \orgnameThe University of Texas at Austin, \orgaddress\street2515 Speedway, \cityAustin, \stateTX \postcode78712, \countryUSA

2]\orgdivOden Institute for Computational Engineering and Sciences, \orgnameThe University of Texas at Austin, \orgaddress\street201 E 24th St, \cityAustin, \postcode78712, \stateTX, \countryUSA

3]\orgdivSchool of Computational Science and Engineering, \orgnameGeorgia Institute of Technology, \orgaddress \cityAtlanta, \postcode30308, \stateGA, \countryUSA

4]\orgdivDepartments of Geological Sciences and Mechanical Engineering, \orgnameThe University of Texas at Austin, \orgaddress \cityAustin, \postcode78712, \stateTX, \countryUSA

Large-scale Bayesian optimal experimental design with derivative-informed projected neural network

[email protected]    \fnmThomas \surO’Leary-Roseberry [email protected]    \fnmPeng \surChen [email protected]    \fnmOmar \surGhattas This research was partially funded by DOE ASCR DE-SC0019303 and DE-SC0021239, DOD MURI FA9550-21-1-0084, and NSF DMS-2012453. [email protected] * [ [ [
Abstract

We address the solution of large-scale Bayesian optimal experimental design (OED) problems governed by partial differential equations (PDEs) with infinite-dimensional parameter fields. The OED problem seeks to find sensor locations that maximize the expected information gain (EIG) in the solution of the underlying Bayesian inverse problem. Computation of the EIG is usually prohibitive for PDE-based OED problems. To make the evaluation of the EIG tractable, we approximate the (PDE-based) parameter-to-observable map with a derivative-informed projected neural network (DIPNet) surrogate, which exploits the geometry, smoothness, and intrinsic low-dimensionality of the map using a small and dimension-independent number of PDE solves. The surrogate is then deployed within a greedy algorithm-based solution of the OED problem such that no further PDE solves are required. We analyze the EIG approximation error in terms of the generalization error of the DIPNet and show they are of the same order. Finally, the efficiency and accuracy of the method are demonstrated via numerical experiments on OED problems governed by inverse scattering and inverse reactive transport with up to 16,641 uncertain parameters and 100 experimental design variables, where we observe up to three orders of magnitude speedup relative to a reference double loop Monte Carlo method.

keywords:
Bayesian optimal experimental design, Bayesian inverse problem, expected information gain, neural network, surrogate, greedy algorithm

1 Introduction

In modeling natural or engineered systems, uncertainties are often present due to the lack of knowledge or intrinsic variability of the system. Uncertainties may arise from sources as varied as initial and boundary conditions, material properties and other coefficients, external source terms, interaction and coupling terms, and geometries; for simplicity of exposition, we refer to all of these as parameters. Uncertainties in prior knowledge of these parameters can be reduced by incorporating indirect observational or experimental data on the system state or related quantities of interest into the forward model via solution of a Bayesian inverse problem. Prior knowledge can be incorporated through a prior distribution on the uncertain parameters. The data are typically noisy because of limited measurement precision, which induces a likelihood of the data conditioned on the given parameters. Uncertainties of the parameters are then reduced by the data and quantified by a posterior distribution, which is a joint distribution of the prior and the likelihood, given by Bayes’ rule.

Large amounts of informative data can reduce uncertainties in the parameters, and thus posterior predictions, significantly. However, the data are often sparse or limited due to the cost of their acquisition. In such cases it is critical to design the acquisition process or experiment in an optimal way so that as much information as possible can be gained from the acquired data, or the uncertainty in the parameters or posterior predictions can be reduced as much as possible. Experimental design variables can include what, when, and where to measure, which sources to use to excite the system, and under which conditions should the experiments be conducted. This is known as the optimal experimental design (OED) problem Ucinski05 , or Bayesian OED in the context of Bayesian inference. OED problems arise across numerous fields including geophysical exploration, medical imaging, nondestructive evaluation, drug testing, materials characterization, and earth system data assimilation, to name just a few. For example, two notable uses of OED include optimal observing system design in oceanography LooseHeimbach21 and optimal sensor placement for tsunami early warning FerrolinoLopeMendoza20 .

The challenges to solving OED problems in these and other fields are manifold. The models underlying the systems of interest typically take the form of partial differential equations (PDEs) and can be large-scale, complex, nonlinear, dynamic, multiscale, and coupled. The uncertain parameters may depend on both space and time, and are often characterized by infinite-dimensional random fields and/or stochastic processes. The PDE models can be extremely expensive to solve for each realization of the infinite-dimensional uncertain parameters. The computation of the OED objective involves high-dimensional (after discretization) integration with respect to (w.r.t.) the uncertain parameters, and thus require a large number of PDE solves. Finally, the OED objective will need to be evaluated numerous times, especially when the experimental design variables are high-dimensional or when they represent discrete decisions.

To address these computational challenges, different classes of methods have been developed by exploiting (1) sparsity via polynomial chaos approximation of parameter-to-observable maps HuanMarzouk13 ; HuanMarzouk14 ; HuanMarzouk16 , (2) Laplace approximation of the posterior AlexanderianPetraStadlerEtAl16 ; BeckDiaEspathEtAl18 ; LongScavinoTemponeEtAl13a ; LongMotamedTempone15 ; BeckMansourDiaEspathEtAl20 , (3) intrinsic low dimensionality by low-rank approximation of (prior-preconditioned and data-informed) operators AlexanderianPetraStadlerEtAl14 ; AlexanderianGloorGhattas16 ; AlexanderianPetraStadlerEtAl16 ; SaibabaAlexanderianIpsen17 ; CrestelAlexanderianStadlerEtAl17 ; AttiaAlexanderianSaibaba18 , (4) decomposibility by offline (for PDE-constrained approximation)–online (for design optimization) decomposition WuChenGhattas20 ; WuChenGhattas21 , and (5) surrogate models of the PDEs, parameter-to-observable map, or posterior distribution by model reduction Aretz-NellesenChenGreplEtAl20 ; AretzChenVeroy21 and deep learning FosterJankowiakBinghamEtAl19 ; KleinegesseGutmann20 ; ShenHuan21 .

Here, we consider the Bayesian OED problem for optimal sensor placement governed by large-scale and possibly nonlinear PDEs with infinite-dimensional uncertain parameters. We use the expected information gain (EIG), also known as mutual information, as the optimality criterion for the OED problem. The optimization problem is combinatorial: we seek the combination of sensors, selected from a set of candidate locations, that maximizes the EIG. The EIG is an average of the Kullback–Leibler (KL) divergence between the posterior and the prior distributions over all realizations of the data. This involves a double integral: one integral of the likelihood function w.r.t. the prior distribution to compute the normalization constant or model evidence for each data realization, and one integral w.r.t. the data distribution. To evaluate the two integrals we adopt a double-loop Monte Carlo (DLMC) method that requires the computation of the parameter-to-observable map at each of the parameter and data samples. Since the likelihood can be rather complex and highly locally supported in the parameter space, the number of parameter samples from the prior distribution needed to capture the likelihood well with relatively accurate sample average approximation of the normalization constant can be extremely large. The requirement to evaluate the PDE-constrained parameter-to-observable map at each of the large number of samples leads to numerous PDE solves, which is prohibitive when the PDEs are expensive to solve. To tackle this challenge, we construct a derivative-informed projected neural network (DIPNet) OLearyRoseberryVillaChenEtAl2022 ; OLearyRoseberry2020 ; OLearyRoseberryDuChaudhuriEtAl2021 surrogate of the parameter-to-observable map that exploits the intrinsic low dimensionality of both the parameter and the data spaces. This intrinsic low dimensionality is due to the correlation of the high-dimensional parameters, the smoothing property of the underlying PDE solution, and redundant information contained in the data from all of the candidate sensors. In particular, the low-dimensional subspace of the parameter space can be detected via low rank approximations of derivatives of the parameter-to-observable map, such as the Jacobian, Gauss-Newton Hessian, or higher-order derivatives. This property has been observed and exploited across a wide spectrum of Bayesian inverse problems FlathWilcoxAkcelikEtAl11 ; Bui-ThanhBursteddeGhattasEtAl12 ; Bui-ThanhGhattasMartinEtAl13 ; Bui-ThanhGhattas14 ; KalmikovHeimbach14 ; HesseStadler14 ; IsaacPetraStadlerEtAl15 ; CuiLawMarzouk16 ; ChenVillaGhattas17 ; BeskosGirolamiLanEtAl17 ; ZahmCuiLawEtAl18 ; BrennanBigoniZahmEtAl20 ; ChenWuChenEtAl19 ; ChenGhattas20 ; SubramanianScheufeleMehlEtAl20 ; BabaniyiNicholsonVillaEtAl21 and Bayesian optimal experimental design AlexanderianPetraStadlerEtAl16 ; WuChenGhattas20 ; WuChenGhattas21 . See GhattasWillcox21 for analysis of model elliptic, parabolic, and hyperbolic problems, and a lengthy list of complex inverse problems that have been found numerically to exhibit this property.

This intrinsic low-dimensionality of parameter and data spaces, along with smoothness of the parameter-to-observable map, allow us to construct an accurate (over parameter space) DIPNet surrogate with a limited and dimension-independent number of training data pairs, each requiring a PDE solve. Once trained, the DIPNet surrogate is deployed in the OED problem, which is solved without further PDE solution, resulting in very large reductions in computing time. Under suitable assumptions, we provide an analysis of the error propagated from the DIPNet approximation to the approximation of the normalization constant and the EIG. To solve the combinatorial optimization problem of sensor selection, we use a greedy algorithm developed in our previous work WuChenGhattas20 ; WuChenGhattas21 . We demonstrate the efficiency and accuracy of our computational method by conducting two numerical experiments with infinite-dimensional parameter fields: OED for inverse scattering (with an acoustic Helmholtz forward problem) and inverse reactive transport (with a nonlinear advection-diffusion-reaction forward problem).

The rest of the paper is organized as follows. The setup of the problems including Bayesian inversion, EIG, sensor design matrix, and Bayesian OED are presented in Section 2. Section 3 is devoted to presentation of the computational methods including DLMC, DIPNet and its induced error analysis, and the greedy optimization algorithm. Results for the two OED numerical experiments are provided in Section 4, followed by conclusions in Section 5.

2 Problem setup

2.1 Bayesian inverse problems

Let 𝒟nx\mathcal{D}\subset\mathbb{R}^{n_{x}} denote a physical domain in dimension nx=1,2,3n_{x}=1,2,3. We consider the problem of inferring an uncertain parameter field mm defined in the physical domain 𝒟\mathcal{D} from noisy data 𝐲\mathbf{y} and a complex model represented by PDEs. Let 𝐲ny\mathbf{y}\in{\mathbb{R}}^{n_{y}} denote the noisy data vector of dimension nyn_{y}\in{\mathbb{N}}, given by

𝐲=(m)+𝜺,\mathbf{y}=\mathcal{F}(m)+\boldsymbol{\varepsilon}, (1)

which is contaminated by the additive Gaussian noise 𝜺𝒩(𝟎,𝚪n)\boldsymbol{\varepsilon}\sim\mathcal{N}(\boldsymbol{0},\mathbf{\Gamma}_{\text{n}}) with zero mean and covariance 𝚪nny×ny\mathbf{\Gamma}_{\text{n}}\subset{\mathbb{R}}^{n_{y}\times n_{y}}. Specifically, 𝐲\mathbf{y} is obtained from observation of the solution of the PDE model at nyn_{y} sensor locations. \mathcal{F} is the parameter-to-observable map which depends on the solution of the PDE and an observation operator that extracts the solution values at the nyn_{y} locations.

We consider the above inverse problem in a Bayesian framework. First, we assume that mm lies in an infinite-dimensional real separable Hilbert space \mathcal{M}, e.g., =2(𝒟)\mathcal{M}=\mathcal{L}^{2}(\mathcal{D}) of square integrable functions defined in 𝒟\mathcal{D}. Moreover, we assume that mm follows a Gaussian prior measure μpr=𝒩(mpr,𝒞pr)\mu_{\text{pr}}=\mathcal{N}({m_{\text{pr}}},\mathcal{C}_{\text{pr}}) with mean mpr{m_{\text{pr}}}\in\mathcal{M} and covariance operator 𝒞pr\mathcal{C}_{\text{pr}}, a strictly positive, self-adjoint, and trace-class operator. As one example, we consider 𝒞pr=𝒜2\mathcal{C}_{\text{pr}}=\mathcal{A}^{-2}, where 𝒜=γΔ+δI\mathcal{A}=-\gamma\Delta+\delta I is a Laplacian-like operator with prescribed homogeneous Neumann boundary condition, with Laplacian Δ\Delta, identity II, and positive constants γ,δ>0\gamma,\delta>0; see Stuart10 ; Bui-ThanhGhattasMartinEtAl13 ; PetraMartinStadlerEtAl14 for more details. Given the Gaussian observation noise, the likelihood of the data 𝐲\mathbf{y} for the parameter mm\in\mathcal{M} satisfies

πlike(𝐲|m)exp(Φ(m,𝐲)),\pi_{\text{like}}(\mathbf{y}|m)\propto\exp\left(-\Phi(m,\mathbf{y})\right), (2)

where

Φ(m,𝐲):=12𝐲(m)𝚪n12\Phi(m,\mathbf{y}):=\frac{1}{2}\|\mathbf{y}-{\mathcal{F}}(m)\|^{2}_{\mathbf{\Gamma}_{\text{n}}^{-1}} (3)

is known as a potential function. By Bayes’ rule, the posterior measure, denoted as μpost(m|𝐲)\mu_{\text{post}}(m|\mathbf{y}), is given by the Radon-Nikodym derivative as

dμpost(m|𝐲)dμpr(m)=1π(𝐲)πlike(𝐲|m),\frac{d\mu_{\text{post}}(m|\mathbf{y})}{d\mu_{\text{pr}}(m)}=\frac{1}{\pi(\mathbf{y})}\pi_{\text{like}}(\mathbf{y}|m), (4)

where π(𝐲)\pi(\mathbf{y}) is the so-called normalization constant or model evidence, given by

π(𝐲)=πlike(𝐲|m)𝑑μpr(m).\pi(\mathbf{y})=\int_{\mathcal{M}}\pi_{\text{like}}(\mathbf{y}|m)d\mu_{\text{pr}}(m). (5)

This expression is often computationally intractable because of the infinite-dimensional integral, which involves a (possibly large-scale) PDE solve for each realization mm.

2.2 Expected information gain

To measure the information gained from the data 𝐲\mathbf{y} in the inference of the parameter mm, we consider a Kullback–Leibler (KL) divergence between the posterior and the prior, defined as

DKL(μpost(|𝐲)μpr):=ln(dμpost(m|𝐲)dμpr(m))dμpost(m|𝐲),D_{\text{KL}}(\mu_{\text{post}}(\cdot|\mathbf{y})\|\mu_{\text{pr}}):=\int_{\mathcal{M}}\ln\left(\frac{d\mu_{\text{post}}(m|\mathbf{y})}{d\mu_{\text{pr}}(m)}\right)d\mu_{\text{post}}(m|\mathbf{y}), (6)

which is random since the data 𝐲\mathbf{y} is random. We consider a widely used optimality criterion, expected information gain (EIG), which is the KL divergence averaged over all realizations of the data, defined as

Ψ:=𝔼𝐲[DKL(μpost(|𝐲)μpr)]=𝒴DKL(μpost(|𝐲)μpr)π(𝐲)d𝐲=𝒴ln(dμpost(m|𝐲)dμpr(m))𝑑μpost(m|𝐲)π(𝐲)𝑑𝐲=𝒴ln(πlike(𝐲|m)π(𝐲))πlike(𝐲|m)𝑑𝐲𝑑μpr(m),\begin{split}\Psi&:=\mathbb{E}_{\mathbf{y}}\left[D_{\text{KL}}(\mu_{\text{post}}(\cdot|\mathbf{y})\|\mu_{\text{pr}})\right]\\ &=\int_{\mathcal{Y}}D_{\text{KL}}(\mu_{\text{post}}(\cdot|\mathbf{y})\|\mu_{\text{pr}})\,\pi(\mathbf{y})\,d\mathbf{y}\\ &=\int_{\mathcal{Y}}\int_{\mathcal{M}}\ln\left(\frac{d\mu_{\text{post}}(m|\mathbf{y})}{d\mu_{\text{pr}}(m)}\right)d\mu_{\text{post}}(m|\mathbf{y})\,\pi(\mathbf{y})\,d\mathbf{y}\\ &=\int_{\mathcal{M}}\int_{\mathcal{Y}}\ln\left(\frac{\pi_{\text{like}}(\mathbf{y}|m)}{\pi(\mathbf{y})}\right)\pi_{\text{like}}(\mathbf{y}|m)\,d\mathbf{y}\,d\mu_{\text{pr}}(m),\end{split} (7)

where the last equality follows Bayes’ rule (4) and the Fubini theorem under the assumption of proper integrability.

2.3 Optimal experimental design

We consider the OED problem for optimally acquiring data to maximize the expected information gained in the parameter inference. The experimental of design seeks to choose rr sensor locations out of dd candidates {𝒙1,,𝒙d}\{\boldsymbol{x}_{1},\dots,\boldsymbol{x}_{d}\} represented by a design matrix 𝐖r×d𝒲{\mathbf{W}}\in{\mathbb{R}}^{r\times d}\in\mathcal{W}, namely, if the ii-th sensor is placed at 𝒙j\boldsymbol{x}_{j}, then 𝐖ij=1{\mathbf{W}}_{ij}=1, otherwise 𝐖ij=0{\mathbf{W}}_{ij}=0:

𝒲:={𝐖r×d:𝐖ij{0,1},i=1r𝐖ij{0,1},j=1d𝐖ij=1}.\mathcal{W}:=\left\{{\mathbf{W}}\in{\mathbb{R}}^{r\times d}:{\mathbf{W}}_{ij}\in\{0,1\},\sum^{r}_{i=1}{\mathbf{W}}_{ij}\in\{0,1\},\sum^{d}_{j=1}{\mathbf{W}}_{ij}=1\right\}. (8)

Let d:d{\mathcal{F}}_{d}:\mathcal{M}\mapsto{\mathbb{R}}^{d} denote the parameter-to-observable map and 𝜺dd\boldsymbol{\varepsilon}_{d}\in{\mathbb{R}}^{d} denote the additive noise, both using all dd candidate sensors; then we have

=𝐖dand𝜺=𝐖𝜺d.{\mathcal{F}}={\mathbf{W}}{\mathcal{F}}_{d}\quad\text{and}\quad\boldsymbol{\varepsilon}={\mathbf{W}}\boldsymbol{\varepsilon}_{d}. (9)

Then the likelihood (2) for a specific design 𝐖{\mathbf{W}} is given by

πlike(𝐲|m,𝐖)exp(12𝐲𝐖d(m)𝚪n12),\pi_{\text{like}}(\mathbf{y}|m,{\mathbf{W}})\propto\exp\left(-\frac{1}{2}\|\mathbf{y}-{\mathbf{W}}{\mathcal{F}}_{d}(m)\|^{2}_{\mathbf{\Gamma}_{\text{n}}^{-1}}\right), (10)

and the normalization constant also depends on 𝐖{\mathbf{W}} as

π(𝐲,𝐖)=𝒴πlike(𝐲|m,𝐖)𝑑μpr(m).\pi(\mathbf{y},{\mathbf{W}})=\int_{\mathcal{Y}}\pi_{\text{like}}(\mathbf{y}|m,{\mathbf{W}})d\mu_{\text{pr}}(m). (11)

From Section 2.2, we can see that the EIG Ψ\Psi depends on the design matrix 𝐖{\mathbf{W}} through the likelihood function πlike(𝐲|m,𝐖)\pi_{\text{like}}(\mathbf{y}|m,{\mathbf{W}}). To this end, we formulate the OED problem to find an optimal design matrix 𝐖{\mathbf{W}}^{*} such that

𝐖=argmax𝐖𝒲Ψ(𝐖),{\mathbf{W}}^{*}=\operatorname*{arg\,max}_{{\mathbf{W}}\in\mathcal{W}}\Psi({\mathbf{W}}), (12)

with the 𝐖{\mathbf{W}}-dependent EIG given by

Ψ(𝐖)=𝒴ln(πlike(𝐲|m,𝐖)π(𝐲,𝐖))πlike(𝐲|m,𝐖)𝑑𝐲𝑑μpr(m).\Psi({\mathbf{W}})=\int_{\mathcal{M}}\int_{\mathcal{Y}}\ln\left(\frac{\pi_{\text{like}}(\mathbf{y}|m,{\mathbf{W}})}{\pi(\mathbf{y},{\mathbf{W}})}\right)\pi_{\text{like}}(\mathbf{y}|m,{\mathbf{W}})\,d\mathbf{y}\,d\mu_{\text{pr}}(m). (13)

2.4 Finite-dimensional approximation

To facilitate the presentation of our computational methods, we make a finite-dimensional approximation of the parameter field by using a finite element discretization. Let n\mathcal{M}_{n}\subset\mathcal{M} denote a subspace of \mathcal{M} spanned by nn piecewise continuous Lagrange polynomial basis functions {ψj}j=1n\{\psi_{j}\}^{n}_{j=1} over a mesh with elements of size hh. Then the discrete parameter mhnm_{h}\in\mathcal{M}_{n} is given by

mh=j=1nmjψj.m_{h}=\sum_{j=1}^{n}m_{j}\psi_{j}. (14)

The Bayesian inverse problem is then stated for the finite-dimensional coefficient vector 𝐦=(m1,,mn)T\mathbf{m}=(m_{1},\dots,m_{n})^{T} of mhm_{h}, with nn possibly very large. The prior distribution of the discrete parameter 𝐦\mathbf{m} is Gaussian 𝒩(𝐦pr,𝚪pr)\mathcal{N}(\mathbf{m}_{\text{pr}},\mathbf{\Gamma}_{\text{pr}}) with 𝐦pr\mathbf{m}_{\text{pr}} representing the coefficient vector of the discretized prior mean of mpr{m_{\text{pr}}}, and 𝚪pr\mathbf{\Gamma}_{\text{pr}} representing the covariance matrix corresponding to 𝒞pr=𝒜2\mathcal{C}_{\text{pr}}=\mathcal{A}^{-2}, given by

𝚪pr=𝐀1𝐌𝐀1,\mathbf{\Gamma}_{\text{pr}}=\mathbf{A}^{-1}\mathbf{M}\mathbf{A}^{-1}, (15)

where 𝐀\mathbf{A} is the finite element matrix of the Laplacian-like operator 𝒜\mathcal{A}, and 𝐌\mathbf{M} is the mass matrix. Moreover, let 𝐅d:nd\mathbf{F}_{d}:{\mathbb{R}}^{n}\to{\mathbb{R}}^{d} denote the discretized parameter-to-observable map corresponding to d\mathcal{F}_{d}, we have 𝐅=𝐖𝐅d\mathbf{F}={\mathbf{W}}\mathbf{F}_{d} as in (9). Then the likelihood function corresponding to (10) for the discrete parameter 𝐦\mathbf{m} is given by

πlike(𝐲|𝐦,𝐖)exp(12𝐲𝐖𝐅d(𝐦)𝚪n12).\pi_{\text{like}}(\mathbf{y}|\mathbf{m},{\mathbf{W}})\propto\exp\left(-\frac{1}{2}\|\mathbf{y}-{\mathbf{W}}\mathbf{F}_{d}(\mathbf{m})\|^{2}_{\mathbf{\Gamma}_{\text{n}}^{-1}}\right). (16)

3 Computational methods

3.1 Double-loop Monte Carlo estimator

To solve the OED problem (12), we need to evaluate the EIG repeatedly for each given design 𝐖{\mathbf{W}}. The double integrals in the EIG expression can be computed by a double-loop Monte Carlo (DLMC) estimator Ψdl\Psi^{dl} defined as

Ψdl(𝐖):=1nouti=1noutlog(πlike(𝐲i|𝐦i,𝐖)π^(𝐲i,𝐖)),\Psi^{dl}({\mathbf{W}}):=\frac{1}{n_{\text{out}}}\sum^{n_{\text{out}}}_{i=1}\log\left(\frac{\pi_{\text{like}}(\mathbf{y}_{i}|\mathbf{m}_{i},{\mathbf{W}})}{\hat{\pi}(\mathbf{y}_{i},{\mathbf{W}})}\right), (17)

where 𝐦i\mathbf{m}_{i}, i=1,,nouti=1,\dots,n_{\text{out}}, are i.i.d. samples from prior 𝒩(𝐦pr,𝚪pr)\mathcal{N}(\mathbf{m}_{\text{pr}},\mathbf{\Gamma}_{\text{pr}}) in the outer loop and 𝐲i=𝐅(𝐦i)+𝜺i\mathbf{y}_{i}=\mathbf{F}(\mathbf{m}_{i})+\boldsymbol{\varepsilon}_{i} are the realizations of the data with i.i.d. noise 𝜺i𝒩(𝟎,𝚪n)\boldsymbol{\varepsilon}_{i}\sim\mathcal{N}(\boldsymbol{0},\mathbf{\Gamma}_{\text{n}}). π^(𝐲i,𝐖)\hat{\pi}(\mathbf{y}_{i},{\mathbf{W}}) is a Monte Carlo estimator of the normalization constant π(𝐲i,𝐖)\pi(\mathbf{y}_{i},{\mathbf{W}}) with ninn_{\text{in}} samples in the inner loop, given by

π^(𝐲i,𝐖):=1ninj=1ninπlike(𝐲i|𝐦i,j,𝐖),\hat{\pi}(\mathbf{y}_{i},{\mathbf{W}}):=\frac{1}{n_{\text{in}}}\sum^{n_{\text{in}}}_{j=1}\pi_{\text{like}}(\mathbf{y}_{i}|{\mathbf{m}}_{i,j},{\mathbf{W}}), (18)

where 𝐦i,j{\mathbf{m}}_{i,j}, j=1,,ninj=1,\dots,n_{\text{in}}, are i.i.d. samples from the prior 𝒩(𝐦pr,𝚪pr)\mathcal{N}(\mathbf{m}_{\text{pr}},\mathbf{\Gamma}_{\text{pr}}).

For complex posterior distributions, e.g., high-dimensional, locally supported, multi-modal, non-Gaussian, etc., evaluation of the normalization constant is often intractable, i.e., a prohibitively large number of samples ninn_{\text{in}} is needed. As one particular example, when the posterior of 𝐦\mathbf{m} for data 𝐲i\mathbf{y}_{i} generated at sample 𝐦i\mathbf{m}_{i} concentrates in a very small region far away from the mean of the prior, the likelihood πlike(𝐲i|𝐦i,j,𝐖)\pi_{\text{like}}(\mathbf{y}_{i}|\mathbf{m}_{i,j},{\mathbf{W}}) is extremely small for most samples 𝐦i,j\mathbf{m}_{i,j}, which which leads to a requirement of a large number of samples to evaluate π^(𝐲i,𝐖)\hat{\pi}(\mathbf{y}_{i},{\mathbf{W}}) with relatively small estimation error. This is usually prohibitive, since one evaluation of the parameter-to-observable map, and thus one solution of the large-scale PDE model, is required for each of nout×ninn_{\text{out}}\times n_{\text{in}} samples. This nout×ninn_{\text{out}}\times n_{\text{in}} PDE solves are required for each design matrix 𝐖{\mathbf{W}} at each optimization iteration.

3.2 Derivative-informed projected neural networks

Recent research has motivated the deployment of neural networks as surrogates for parametric PDE mappings BhattacharyaHosseiniKovachki2020 ; FrescaManzoni2022 ; KovachkiLiLiuEtAl2021 ; LiKovachkiAzizzadenesheliEtAl2020b ; LuJinKarniadakis2019 ; OLearyRoseberryChenVillaEtAl2022 ; NelsenStuart21 ; NguyenBui2021model ; OLearyRoseberryVillaChenEtAl2022 . These surrogates can be used to accelerate the computation of the EIG within OED problems. Specifically, to reduce the prohibitive computational cost, we build a surrogate for the parameter-to-observable map 𝐅d:nd\mathbf{F}_{d}:{\mathbb{R}}^{n}\mapsto{\mathbb{R}}^{d} at all candidate sensor locations by a derivative-informed projected neural network (DIPNet) OLearyRoseberry2020 ; OLearyRoseberryDuChaudhuriEtAl2021 ; OLearyRoseberryVillaChenEtAl2022 . Often, PDE-constrained high-dimensional parametric maps, such as the parameter-to-observable map 𝐅d\mathbf{F}_{d}, admit low-dimensional structure due to the correlation of the high-dimensional parameters, the regularizing property of the underlying PDE solution, and/or redundant information in the data from all candidate sensors. When this is the case, the DIPNet can exploit this low-dimensional structure and parametrize a parsimonious map between the most informed subspaces of the input parameter and the output observables. The dimensions of the input and output subspaces are referred to as the “information dimension” of the map, which is often significantly smaller than the parameter and data dimensions. The architectural strategy that we employ exploits compressibility of the map, by first reducing the input and output dimensionality via projection to informed reduced bases of the inputs and outputs. A neural network is then used to construct a low-dimensional nonlinear mapping between the two reduced bases. Error bounds for the effects of basis truncation, and parametrization by neural network are studied in BhattacharyaHosseiniKovachki2020 ; OLearyRoseberryDuChaudhuriEtAl2021 ; OLearyRoseberryVillaChenEtAl2022 .

For the input parameter dimension reduction, we use a vector generalization of an active subspace (AS) ZahmConstantinePrieurEtAl2020 , which is spanned by the generalized eigenvectors (input reduced basis) corresponding to the rMr_{M} largest eigenvalues of the eigenvalue problem

[n𝐦𝐅d(𝐦)T𝐦𝐅d(𝐦)𝑑μpr(𝐦)]𝐯i=λiAS𝚪pr1𝐯i,\left[\int_{\mathcal{M}_{n}}\nabla_{\mathbf{m}}\mathbf{F}_{d}(\mathbf{m})^{T}\nabla_{\mathbf{m}}\mathbf{F}_{d}(\mathbf{m})d\mu_{\text{pr}}(\mathbf{m})\right]\mathbf{v}_{i}=\lambda_{i}^{AS}\mathbf{\Gamma}_{\text{pr}}^{-1}\mathbf{v}_{i}, (19)

where the eigenvectors 𝐯i\mathbf{v}_{i} are ordered by the decreasing generalized eigenvalues λiAS\lambda_{i}^{\text{AS}}, i=1,,rMi=1,\dots,r_{M}. For the output data dimension reduction, we use a proper orthogonal decomposition (POD)ManzoniNegriQuarteroni2016 ; QuarteroniManzoniNegri2015 , which uses the eigenvectors (output reduced basis) corresponding to the first rFr_{F} eigenvalues of the expected observable outer product matrix,

[n𝐅d(𝐦)𝐅d(𝐦)T𝑑μpr(𝐦)]ϕi=λiPODϕi,\left[\int_{\mathcal{M}_{n}}\mathbf{F}_{d}(\mathbf{m})\mathbf{F}_{d}(\mathbf{m})^{T}d\mu_{\text{pr}}(\mathbf{m})\right]\mathbf{\phi}_{i}=\lambda_{i}^{POD}\mathbf{\phi}_{i}, (20)

where the eigenvectors ϕi\mathbf{\phi}_{i} are ordered by the decreasing eigenvalues λiPOD\lambda_{i}^{\text{POD}}, i=1,,rFi=1,\dots,r_{F}. When the eigenvalues of AS and POD both decay quickly, the mapping 𝐦𝐅d(𝐦)\mathbf{m}\mapsto\mathbf{F}_{d}(\mathbf{m}) can be well approximated when 𝐦\mathbf{m} and 𝐅d\mathbf{F}_{d} are projected to the corresponding subspaces with small rMr_{M} and rFr_{F}; in this case approximation error bounds for reduced basis representation of the mapping are given by the trailing eigenvalues of the systems (19),(20). This allows one to detect appropriate “breadth” for the neural network via the direct computation of the associated eigenvalue problems, removing the need for ad-hoc neural network hyperparameter search for appropriate breadth. The neural network surrogate 𝐅~d\tilde{\mathbf{F}}_{d} of the map 𝐅d\mathbf{F}_{d} then has the form

𝐅~d(𝐦,[𝐰,𝐛])=𝚽rF𝐟r(𝐕rMT𝐦,𝐰)+𝐛,\tilde{\mathbf{F}}_{d}(\mathbf{m},[\mathbf{w},\mathbf{b}])=\mathbf{\Phi}_{r_{F}}\mathbf{f}_{r}(\mathbf{V}_{r_{M}}^{T}\mathbf{m},\mathbf{w})+\mathbf{b}, (21)

where 𝚽rFd×rF\mathbf{\Phi}_{r_{F}}\in{\mathbb{R}}^{d\times r_{F}} represents the POD reduced basis for the output, 𝐕rMn×rM\mathbf{V}_{r_{M}}\in{\mathbb{R}}^{n\times r_{M}} represents the AS reduced basis for the input, 𝐟rrF\mathbf{f}_{r}\in{\mathbb{R}}^{r_{F}} is the neural network mapping between the two bases parametrized by weights 𝐰\mathbf{w} and bias 𝐛\mathbf{b}. Since the reduced basis dimensions rF,rMr_{F},r_{M} are chosen based on spectral decay of the AS and POD operators, we can choose them to be the same; for convenience we denote the reduced basis dimension instead by rr. The remaining difficulty is how to properly parametrize and train the neural network mapping. While the use of the reduced basis representation for the approximating map allows one to detect appropriate breadth for the neural network by avoiding complex neural network hyperparameter searches, and associated nonconvex neural network trainings, how to choose appropriate depth for the network is still an open question. While neural network approximation theory suggests deeper networks have richer representative capacities, in practice, for many architectures, adding depth eventually diminishes performance in what is known as the “peaking phenomenon”Hughes1968 . In general finding appropriate depth for e.g., fully-connected feedforward neural networks requires re-training from scratch different networks with differing depths. In order to avoid this issue we employ an adaptively constructed residual network (ResNet) neural network parametrization of the mapping between the two reduced bases. This adaptive construction procedure is motivated by recent approximation theory that conceives of ResNets as discretizations of sequentially minimizing control flows LiLinShen22 , where such maps are proven to be universal approximators of LpL^{p} functions on compact sets. A schematic for our neural network architecture can be seen in Fig. 1.

This strategy adaptively constructs and trains a sequence of low-rank ResNet layers, where for convenience we take r=rM=rFr=r_{M}=r_{F} or otherwise employ a restriction or prolongation layer to enforce dimensional compatibility. The ResNet hidden neurons at layer i+1i+1 have the form

zi+1=zi+wi,2σ(wi,1zi+bi),z_{i+1}=z_{i}+w_{i,2}\sigma(w_{i,1}z_{i}+b_{i}), (22)

with zi+1,zi,birz_{i+1},z_{i},b_{i}\in\mathbb{R}^{r}, wi,2,wi,1Tr×kw_{i,2},w_{i,1}^{T}\in\mathbb{R}^{r\times k}, where the parameter k<rk<r is referred to as the layer rank, and it is chosen to be smaller than rr in order to impose a compressed representation of the ResNet latent space update (22). This choice is guided by the “well function” property in LiLinShen22 . The ResNet weights 𝐰=[(wi,2,wi,1,bi)]i=0depth\mathbf{w}=[(w_{i,2},w_{i,1},b_{i})]_{i=0}^{\text{depth}} consist of all of the coefficient arrays in each layer. Given appropriate reduced bases with dimension rr, the ResNet mapping between the reduced bases is trained adaptively, one layer at a time, until over-fitting is detected in training validation metrics. When this is the case, a final global end-to-end training is employed using a stochastic Newton optimizer OLearyRoseberryAlgerGhattas2020 . This architectural strategy is able to achieve high generalizability for few and (input-output) dimension-independent data, for more information on this strategy, please see OLearyRoseberryDuChaudhuriEtAl2021 .

Refer to caption
Figure 1: A schematic representation of the derivative-informed projected neural network using ResNet as the nonlinear mapping between the reduced subspace using active subspaces for input parameters and POD for output observables.

By removing the dependence of the input-to-output map on their high-dimensional and uninformed subspaces (complement to the low-dimensional and informed subspaces), we can construct a neural network of small input and output size that requires few training data. Since these architectures are able to achieve high generalization accuracy with limited training data for parametric PDE maps, they are especially well suited to scalable EIG approximation, since they can be efficiently queried many times at no cost in PDE solves, and require few high fidelity PDE solutions for their construction.

3.3 DLMC with DIPNet surrogate

We propose to train a DIPNet surrogate 𝐅~d\tilde{\mathbf{F}}_{d} for the parameter-to-observable map 𝐅d{\mathbf{F}}_{d}, so that π^(𝐲i,𝐖)\hat{\pi}(\mathbf{y}_{i},{\mathbf{W}}) can be approximated as

π~(𝐲i,𝐖)=1ninj=1ninexp(12𝐲i𝐖𝐅~d(𝐦i,j)𝚪n12),\tilde{\pi}(\mathbf{y}_{i},{\mathbf{W}})=\frac{1}{n_{\text{in}}}\sum^{n_{\text{in}}}_{j=1}\exp\left(-\frac{1}{2}\|\mathbf{y}_{i}-{\mathbf{W}}\tilde{\mathbf{F}}_{d}({\mathbf{m}}_{i,j})\|^{2}_{\mathbf{\Gamma}_{\text{n}}^{-1}}\right), (23)

where we omitted a constant 1(2π)nout/2det(𝚪n)1/2\frac{1}{(2\pi)^{n_{\text{out}}/2}\det(\mathbf{\Gamma}_{\text{n}})^{1/2}} since it appears in both the numerator and denominator of the argument of the log in the expression for the EIG. To this end, we can formulate the approximate EIG with the DIPNet surrogate as

Ψnn\displaystyle\Psi^{nn} :=1nouti=1noutlog(πlike(𝐲i|𝐦i,𝐖)π~(𝐲i,𝐖))\displaystyle:=\frac{1}{n_{\text{out}}}\sum^{n_{\text{out}}}_{i=1}\log\left(\frac{\pi_{\text{like}}(\mathbf{y}_{i}|\mathbf{m}_{i},{\mathbf{W}})}{\tilde{\pi}(\mathbf{y}_{i},{\mathbf{W}})}\right) (24)
=1nouti=1nout12𝐖𝜺d,i𝚪n12\displaystyle=-\frac{1}{n_{\text{out}}}\sum^{n_{\text{out}}}_{i=1}\frac{1}{2}\|{\mathbf{W}}\boldsymbol{\varepsilon}_{d,i}\|^{2}_{\mathbf{\Gamma}_{\text{n}}^{-1}} (25)
1nouti=1noutlog(1ninj=1ninexp(12𝐲i𝐖𝐅~d(𝐦i,j)𝚪n12)),\displaystyle\quad-\frac{1}{n_{\text{out}}}\sum^{n_{\text{out}}}_{i=1}\log\left(\frac{1}{n_{\text{in}}}\sum^{n_{\text{in}}}_{j=1}\exp\left(-\frac{1}{2}\|\mathbf{y}_{i}-{\mathbf{W}}\tilde{\mathbf{F}}_{d}({\mathbf{m}}_{i,j})\|^{2}_{\mathbf{\Gamma}_{\text{n}}^{-1}}\right)\right), (26)

where 𝜺d,i\boldsymbol{\varepsilon}_{d,i} are i.i.d. observation noise. Thanks to the DIPNet surrogate, the EIG can be evaluated at negligible cost (relating to PDE solver cost) for each given 𝐖{\mathbf{W}}, and does not require any PDE solves.

3.4 Error analysis

Theorem 1.

We assume that the parameter-to-observable map 𝐅d\mathbf{F}_{d} and its surrogate 𝐅~d\tilde{\mathbf{F}}_{d} are bounded as

𝔼𝐦μpr[𝐅d(𝐦)2]<and𝔼𝐦μpr[𝐅~d(𝐦)2]<.\mathbb{E}_{\mathbf{m}\sim\mu_{\text{pr}}}[\|\mathbf{F}_{d}(\mathbf{m})\|^{2}]<\infty\quad\text{and}\quad\mathbb{E}_{\mathbf{m}\sim\mu_{\text{pr}}}[\|\tilde{\mathbf{F}}_{d}(\mathbf{m})\|^{2}]<\infty. (27)

Moreover, we assume that the generalization error of the DIPNet surrogate can be bounded by ε\varepsilon, i.e.,

𝔼𝐦μpr[𝐅d(𝐦)𝐅~d(𝐦)2]ε2.\mathbb{E}_{\mathbf{m}\sim\mu_{\text{pr}}}[\|\mathbf{F}_{d}(\mathbf{m})-\tilde{\mathbf{F}}_{d}(\mathbf{m})\|^{2}]\leq\varepsilon^{2}. (28)

Then the error in the approximation of the normalization constant by the DIPNet surrogate can be bounded by

|π^(𝐲i,𝐖)π~(𝐲i,𝐖)|Ciε,|\hat{\pi}(\mathbf{y}_{i},{\mathbf{W}})-\tilde{\pi}(\mathbf{y}_{i},{\mathbf{W}})|\leq C_{i}\varepsilon, (29)

for sufficiently large ninn_{\text{in}} and some constants 0<Ci<0<C_{i}<\infty, i=1,,nouti=1,\dots,n_{\text{out}}. Moreover, the approximation error for the EIG can be bounded by

|Ψdl(𝐖)Ψnn(𝐖)|Cε|\Psi^{dl}({\mathbf{W}})-{\Psi}^{nn}({\mathbf{W}})|\leq C\varepsilon (30)

for some constant 0<C<0<C<\infty.

Proof: For notational simplicity, we omit the dependence of π^\hat{\pi} and π~\tilde{\pi} on 𝐖{\mathbf{W}} and write 𝐅=𝐖𝐅d\mathbf{F}={\mathbf{W}}\mathbf{F}_{d} and 𝐅~=𝐖𝐅~d\tilde{\mathbf{F}}={\mathbf{W}}\tilde{\mathbf{F}}_{d}. Note that the bounds (27) and (28) also hold for 𝐅\mathbf{F} and 𝐅~\tilde{\mathbf{F}} since 𝐅\mathbf{F} and 𝐅~\tilde{\mathbf{F}} are selection of some entries of 𝐅d\mathbf{F}_{d} and 𝐅~d\tilde{\mathbf{F}}_{d}. By definition of π^\hat{\pi} in (18) and π~\tilde{\pi} in (23), and the fact that |exex||xx||e^{-x}-e^{-x^{\prime}}|\leq|x-x^{\prime}| for any x,x>0x,x^{\prime}>0, we have

|π^(𝐲i)π~(𝐲i)|,\displaystyle|\hat{\pi}(\mathbf{y}_{i})-\tilde{\pi}(\mathbf{y}_{i})|,
1ninj=1nin12|𝐲i𝐅(𝐦i,j)𝚪n12𝐲i𝐅~(𝐦i,j)𝚪n12|,\displaystyle\leq\frac{1}{n_{\text{in}}}\sum^{n_{\text{in}}}_{j=1}\frac{1}{2}\bigg{\lvert}\|\mathbf{y}_{i}-\mathbf{F}(\mathbf{m}_{i,j})\|^{2}_{\mathbf{\Gamma}_{\text{n}}^{-1}}-\|\mathbf{y}_{i}-\tilde{\mathbf{F}}(\mathbf{m}_{i,j})\|^{2}_{\mathbf{\Gamma}_{\text{n}}^{-1}}\bigg{\rvert},
=1ninj=1nin12|(2𝐲i𝐅(𝐦i,j)𝐅~(𝐦i,j))T𝚪n1(𝐅(𝐦i,j)𝐅~(𝐦i,j))|,\displaystyle=\frac{1}{n_{\text{in}}}\sum^{n_{\text{in}}}_{j=1}\frac{1}{2}\bigg{\lvert}\left(2\mathbf{y}_{i}-\mathbf{F}(\mathbf{m}_{i,j})-\tilde{\mathbf{F}}(\mathbf{m}_{i,j})\right)^{T}\mathbf{\Gamma}_{\text{n}}^{-1}\left(\mathbf{F}(\mathbf{m}_{i,j})-\tilde{\mathbf{F}}(\mathbf{m}_{i,j})\right)\bigg{\rvert},
1ninj=1nin𝚪n1(2𝐲i+𝐅(𝐦i,j)+𝐅~(𝐦i,j))𝐅(𝐦i,j)𝐅~(𝐦i,j),\displaystyle\leq\frac{1}{n_{\text{in}}}\sum^{n_{\text{in}}}_{j=1}\|\mathbf{\Gamma}_{\text{n}}^{-1}\|(2\|\mathbf{y}_{i}\|+\|\mathbf{F}(\mathbf{m}_{i,j})\|+\|\tilde{\mathbf{F}}(\mathbf{m}_{i,j})\|)\|\mathbf{F}(\mathbf{m}_{i,j})-\tilde{\mathbf{F}}(\mathbf{m}_{i,j})\|,
Ci(1ninj=1nin𝐅(𝐦i,j)𝐅~(𝐦i,j)2)1/2,\displaystyle\leq C_{i}\left(\frac{1}{n_{\text{in}}}\sum^{n_{\text{in}}}_{j=1}\|\mathbf{F}(\mathbf{m}_{i,j})-\tilde{\mathbf{F}}(\mathbf{m}_{i,j})\|^{2}\right)^{1/2},

where we used the Cauchy–Schwarz inequality in the last inequality with CiC_{i} given by

Ci=𝚪n1(4𝐲i2+(1ninj=1nin𝐅(𝐦i,j)2)1/2+(1ninj=1nin𝐅~(𝐦i,j)2)1/2).C_{i}=\|\mathbf{\Gamma}_{\text{n}}^{-1}\|\left(4\|\mathbf{y}_{i}\|^{2}+\left(\frac{1}{n_{\text{in}}}\sum^{n_{\text{in}}}_{j=1}\|\mathbf{F}(\mathbf{m}_{i,j})\|^{2}\right)^{1/2}+\left(\frac{1}{n_{\text{in}}}\sum^{n_{\text{in}}}_{j=1}\|\tilde{\mathbf{F}}(\mathbf{m}_{i,j})\|^{2}\right)^{1/2}\right). (31)

For sufficiently large ninn_{\text{in}}, we have that Ci<C_{i}<\infty by 𝐲i<\|\mathbf{y}_{i}\|<\infty and the assumption (27). Moreover, the error bound (29) holds by the assumption (28).

By the definition of the EIGs (17) and (24), we have

|Ψdl(𝐖)Ψnn(𝐖)|1nouti=1nout|log(π^(𝐲i,𝐖))log(π~(𝐲i,𝐖))|.|\Psi^{dl}({\mathbf{W}})-{\Psi}^{nn}({\mathbf{W}})|\leq\frac{1}{n_{\text{out}}}\sum^{n_{\text{out}}}_{i=1}\lvert\log(\hat{\pi}(\mathbf{y}_{i},{\mathbf{W}}))-\log(\tilde{\pi}(\mathbf{y}_{i},{\mathbf{W}}))\rvert. (32)

For sufficiently large ninn_{\text{in}}, we have that the normalization constants π^(𝐲i,𝐖)\hat{\pi}(\mathbf{y}_{i},{\mathbf{W}}) and π~(𝐲i,𝐖)\tilde{\pi}(\mathbf{y}_{i},{\mathbf{W}}) are bounded away from zero, i.e.,

π^(𝐲i,𝐖),π~(𝐲i,𝐖)ci,\hat{\pi}(\mathbf{y}_{i},{\mathbf{W}}),\tilde{\pi}(\mathbf{y}_{i},{\mathbf{W}})\geq c_{i}, (33)

for some constant ci>0c_{i}>0. Then we have

|Ψdl(𝐖)Ψnn(𝐖)|1nouti=1nout1ci|π^(𝐲i,𝐖)π~(𝐲i,𝐖)|1nouti=1noutCiciε,|\Psi^{dl}({\mathbf{W}})-{\Psi}^{nn}({\mathbf{W}})|\leq\frac{1}{n_{\text{out}}}\sum^{n_{\text{out}}}_{i=1}\frac{1}{c_{i}}|\hat{\pi}(\mathbf{y}_{i},{\mathbf{W}})-\tilde{\pi}(\mathbf{y}_{i},{\mathbf{W}})|\leq\frac{1}{n_{\text{out}}}\sum^{n_{\text{out}}}_{i=1}\frac{C_{i}}{c_{i}}\varepsilon, (34)

where we used the bound (29), which implies the bound (30) with constant

C=1nouti=1noutCici.C=\frac{1}{n_{\text{out}}}\sum^{n_{\text{out}}}_{i=1}\frac{C_{i}}{c_{i}}. (35)

3.5 Greedy algorithm

With the DIPNet surrogate, the evaluation of the DLMC EIG Ψnn\Psi^{nn} defined in (26) does not involve any PDE solves. Thus to solve the optimization problem

𝐖=argmax𝐖𝒲Ψnn(𝐖),{\mathbf{W}}^{*}=\operatorname*{arg\,max}_{{\mathbf{W}}\in\mathcal{W}}\Psi^{nn}({\mathbf{W}}), (36)

we can directly use a greedy algorithm that requires evaluations of the EIG, not its derivative w.r.t. 𝐖{\mathbf{W}}. Let SdS_{d} denote the set of all dd candidate sensors; we wish to choose rr sensors from SdS_{d} that maximize the approximate EIG (approximated with the DIPNet surrogate). At the first step with t=1t=1, we select the sensor vSdv^{*}\in S_{d} corresponding to the maximum approximate EIG and set S={v}S^{*}=\{v^{*}\}. Then at step t=2,,rt=2,\dots,r, with t1t-1 sensors selected in SS^{*}, we choose the tt-th sensor vSdSv^{*}\in S_{d}\setminus S^{*} corresponding to the maximum approximate EIG evaluated with tt sensors S{v}S^{*}\cup\{v^{*}\}; see Algorithm 1 for the greedy optimization process, which can achieve (quasi-optimal) experimental designs with an approximation guarantee under suitable assumptions on the incremental information gain of an additional sensor; see JagalurMohanMarzouk21 and references therein. Note that at each step the approximate EIG can be evaluated in parallel for each sensor choice S{v}S^{*}\cup\{v\} with vSdSv\in S_{d}\setminus S^{*}.

Algorithm 1 Greedy algorithm to solve (36)
1:data {𝐲i}i=1Ns\{\mathbf{y}_{i}\}^{N_{s}}_{i=1} generated from the prior samples {𝒎i}i=1Ns\{\boldsymbol{m}_{i}\}^{N_{s}}_{i=1}, dd sensor candidates set SdS_{d}, sensor budget rr, optimal sensor set S=S^{*}=\emptyset
2:optimal sensor set SS^{*}
3:for t=1,,rt=1,\dots,r do
4:     SSdSS\Leftarrow S_{d}\setminus S^{*}
5:     for vSv\in S do
6:         𝐖v{\mathbf{W}}_{v} is the design matrix of sensor choice S{v}S^{*}\cup\{v\}
7:         evaluate Ψnn(𝐖v)\Psi^{nn}({\mathbf{W}}_{v})
8:     end for
9:     vargmaxvSΨnn(𝐖v)v^{*}\Leftarrow\operatorname*{arg\,max}_{v\in S}\Psi^{nn}({\mathbf{W}}_{v})
10:     SS{v}S^{*}\Leftarrow S^{*}\cup\{v^{*}\}
11:end for

4 Numerical results

In this section, we present numerical results for OED problems involving a Helmholtz acoustic inverse scattering problem and an advection-reaction-diffusion inverse transport problem to illustrate the efficiency and accuracy of our method. We compare the approximated normalization constant and EIG of our method with 1) the DLMC truth computed by a large number of Monte Carlo samples; and 2) the DLMC sampled at the same computational cost (number of PDE solves) as our method using DIPNet training.

Both PDE problems are discretized using the finite element library FEniCS AlnaesBlechtaHakeEtAl2015 . The construction of training data and reduced bases (active subspace and proper orthogonal decomposition) is implemented in hIPPYflow hippyflow , a library for dimension reduced PDE surrogate construction, building on top of PDE adjoints implemented in hIPPYlib VillaPetraGhattas20 . The DIPNet neural network surrogates are constructed in TensorFlow AbadiBarhanChenEtAl2016 , and are adaptively trained using a combination of Adam KingmaBa2014 , and a Newton method, LRSFN, which improves local convergence and generalization OLearyRoseberryAlgerGhattas2019 ; OLearyRoseberryAlgerGhattas2020 ; OLearyRoseberryDuChaudhuriEtAl2021 .

4.1 Helmholtz problem

For the first numerical experiment, we consider an inverse wave scattering problem modeled by the Helmholtz equation with uncertain medium in the two-dimensional physical domain Ω=(0,3)2\Omega=(0,3)^{2}

Δue2mk2u\displaystyle-\Delta{u}-e^{2m}k^{2}u =fin Ω,\displaystyle=f\quad\text{in }\Omega, (37a)
PML boundary condition on ΩΓtop,\displaystyle\text{ on }\partial\Omega\setminus\Gamma_{\text{top}}, (37b)
u𝐧\displaystyle\nabla u\cdot\mathbf{n} =0 on Γtop,\displaystyle=0\text{ on }\Gamma_{\text{top}}, (37c)
d(m)\displaystyle\mathcal{F}_{d}(m) =[u(𝒙i,m)]at 𝒙iΩ.\displaystyle=[u(\boldsymbol{x}_{i},m)]\quad\text{at }\boldsymbol{x}_{i}\in\Omega. (37d)

where uu is the total wave field, k4.55k\approx 4.55 is the wavenumber, and e2me^{2m} models the uncertainty of the medium, with the parameter mm a log-prefactor of the squared wavenumber. The right hand side ff is a point source located at 𝐱=(0.775,2.5)\mathbf{x}=(0.775,2.5). The perfectly matched layer (PML) boundary condition approximates a semi-infinite domain. The candidate sensor locations 𝒙i\boldsymbol{x}_{i} are linearly spaced in the line segment between the edge points (0.1,2.9)(0.1,2.9) and (2.9,2.9)(2.9,2.9), with coordinates {(0.1+2i/35,2.9)}i=049\{(0.1+2i/35,2.9)\}_{i=0}^{49}, as shown in Fig. 2. The prior distribution for the uncertain parameter mm is Gaussian μpr=𝒩(mpr,𝒞pr)\mu_{\text{pr}}=\mathcal{N}({m_{\text{pr}}},\mathcal{C}_{\text{pr}}) with zero mean mpr=0{m_{\text{pr}}}=0 and covariance 𝒞pr=(5.0IΔ)2\mathcal{C}_{\text{pr}}=(5.0I-\Delta)^{-2}. The mesh used for this problem is uniform of size 128×128128\times 128. We use quadratic elements for the discretization of the wave field uu and linear elements for the parameter mm, leading to a discrete parameter 𝐦\mathbf{m} of dimension 16,64116,641. The dimension of the wave field uu is 6604966049, the wave is more than sufficiently resolved in regards to the Nyquist sampling criteria for wave problems. A sample of the parameter field mm and the PDE solution uu is shown in Fig. 2 with all 5050 candidate sensor locations marked in circles.

The network has 1010 low-rank residual layers, each with a layer rank of 1010. For this numerical example we demonstrate the effects of using different breadths in the neural network representation, in each case the ResNet learns a mapping from the first rMr_{M} basis vectors for active subspace to the first rMr_{M} basis vectors of POD. In the case that rM>50r_{M}>50 we use a linear restriction layer to reduce the ResNet latent representation to the 5050 dimensional output. For the majority of the numerical results, we employ a rM=50r_{M}=50 dimensional network. The neural network is trained adaptively using 49154915 training samples, and 12281228 validation samples. Using 512512 independent testing samples, the DIPNet surrogate was 81.56%81.56\% 2\ell^{2} accurate measured as a percentage by

2 accuacy=100(1𝐅~d𝐅d2𝐅d2).\ell^{2}\text{ accuacy}=100\left(1-\frac{\|\tilde{\mathbf{F}}_{d}-\mathbf{F}_{d}\|_{2}}{\|\mathbf{F}_{d}\|_{2}}\right). (38)

For more details on this neural network architecture and training, see OLearyRoseberryDuChaudhuriEtAl2021 . The computational cost of the 5050 dimensional active subspace projector using 128128 samples is equivalent to the cost of 842842 additional training data; since the problem is linear the additional linear adjoint computations are comparable to the costs of the training data generation. As we will see in the next example, when the PDE is nonlinear the additional active subspace computations are marginal. Thus for fair comparison with the same computational cost of PDE solves, we use 4915+842=57574915+842=5757 samples for simple MC.

Refer to captionRefer to caption

Figure 2: A random sample of the parameter field 𝐦\mathbf{m} (left) and the corresponding solution 𝐮\mathbf{u} with candidate observation sensor locations marked in circles (right) for the Helmholtz problem.

To test the efficiency and accuracy of our DIPNet surrogate, we first compute the log normalization constant logπ(𝐲)\log\pi(\mathbf{y}) with our DIPNet surrogate for given observation data 𝐲\mathbf{y} generated by 𝐲=𝐖𝐅d(𝐦)+𝜺\mathbf{y}={\mathbf{W}}\mathbf{F}_{d}(\mathbf{m})+\boldsymbol{\varepsilon}. 𝐦\mathbf{m} is a random sample from the prior. We use in total 6000060000 random samples for Monte Carlo (MC) to compute the normalization constant as the ground truth. Fig. 3 shows the logπ(𝐲)\log\pi(\mathbf{y}) comparison for three random designs 𝐖{\mathbf{W}} that choose 1515 sensors out of 5050 candidates. We can see that DIPNet surrogate converges to a value close to the ground truth MC reference, while for the (simple) MC with 57575757 samples, the approximated value indicated by the green star is much less accurate than the DIPNet surrogate using 6000060000 samples with similar cost in PDE solves. Note that the DIPNet training and evaluation cost for this small size neural network is negligible compared to PDE solves.


Refer to captionRefer to captionRefer to caption

Figure 3: The approximation of the log normalization constant with increasing numbers of samples without (true MC) and with (DIPNet MC) surrogate at 33 random designs. Green stars indicate 57575757 samples (for simple MC), which is the same computational cost as DIPNet.

The left and middle figures in Fig. 4 illustrate the sample distributions of the relative approximation errors for the log normalization constant logπ(𝐲)\log\pi(\mathbf{y}) and the EIG Ψ\Psi (nout=200n_{\text{out}}=200) with (DIPNet MC with 6000060000 samples) and without (simple MC with 57575757 samples) the DIPNet surrogate, compared to the true MC with 6000060000 samples. These results show that using DIPNet surrogate we can approximate logπ(𝐲)\log\pi(\mathbf{y}) and Ψ\Psi much more accurately with less bias compared to the simple MC. The sample distributions of EIG at 200200 random designs compared to the design chosen by the greedy optimization using DIPNet surrogate for different number of sensors are shown in the right figure, from which we can see that our method can always chose better designs with larger EIG values than all the random designs.

Refer to captionRefer to captionRefer to caption

Figure 4: Sample distributions for 200 random designs of the relative approximation errors (compared to true MC) for the log normalization constant logπ(𝐲)\log\pi(\mathbf{y}) (left) and EIG Ψ\Psi (middle) by DIPNet MC and simple MC with different number of sensors rr. Right: Blue filled areas represent the sample distributions of the true DLMC EIG Ψdl\Psi^{dl} for 200200 random designs. Pink crosses is the true DLMC EIG Ψdl\Psi^{dl} of designs chosen by the greedy optimization using DIPNet surrogates.

Fig. 5 gives approximate values of the EIG with increasing number of outer loop samples noutn_{\text{out}} using: the DIPNet surrogate with inner loop nin=60000n_{\text{in}}=60000, simple DLMC with inner loop nin=5757n_{\text{in}}=5757, and the truth computed with inner loop nin=60000n_{\text{in}}=60000. We can see that the approximate values by the DIPNet surrogates are almost the same as the truth, while simple DLMC results are very inaccurate.

Refer to captionRefer to captionRefer to caption

Figure 5: EIG of true DLMC (nin=60000n_{\text{in}}=60000), DIPNet DLMC (nin=60000n_{\text{in}}=60000), and simple DLMC (nin=5757n_{\text{in}}=5757) with increasing number of outer loop samples noutn_{\text{out}} at 33 random designs.

To show the effectiveness of truncated rank (breadth) for DIPNet surrogate, we evaluate the log normalization constant logπ(𝐲)\log\pi(\mathbf{y}) and EIG Ψ\Psi with breadth =10,25,50,100=10,25,50,100 and compared with true MC and simple MC in Fig. 6. We can see that with increasing breadth, the relative error is decreasing, but gets worse when breadth reaches 100100. With breadth =100=100, the difficulties of neural network training start to dominate and diminish the accuracy. We can also see it in the right part of the figure. The relative error of EIG approximation reduces (close to) linearly with respect to the generalization error of the DIPNet approximation of the observables with network breadth =10,25,50=10,25,50, which confirms the error analysis in Theorem 1. However, when the breadth increases to 100100, the neural network becomes less accurate (without using more training data), leading to less accurate EIG approximation.

Refer to captionRefer to captionRefer to caption

Figure 6: Sample distributions for 200 random designs of the relative approximation errors (compared to true MC) for the log normalization constant logπ(𝐲)\log\pi(\mathbf{y}) (left) and EIG Ψ\Psi (middle) by DIPNet MC and simple MC with increasing breadth. Right: Sample distributions for 200 random designs of the relative approximation errors (compared to true MC) against the corresponding DIPNet generalization error at breadth (from left to right) =50,100,25,10=50,100,25,10.

4.2 Advection-diffusion-reaction problem

For the second numerical experiment we consider an OED problem for an advection-diffusion-reaction equation with a cubic nonlinear reaction term. The uncertain parameter mm appears as a log-coefficient of the cubic nonlinear reaction term. The PDE is defined in a domain Ω=(0,1)2\Omega=(0,1)^{2} as

(ku)+𝐯u+emu3\displaystyle-\nabla\cdot(k\nabla u)+\mathbf{v}\cdot\nabla u+e^{m}u^{3} =fin Ω,\displaystyle=f\quad\text{in }\Omega, (39a)
u\displaystyle u =0 on Ω,\displaystyle=0\text{ on }\partial\Omega, (39b)
d(m)\displaystyle\mathcal{F}_{d}(m) =[u(𝒙i,m)]at 𝒙iΩ.\displaystyle=[u(\boldsymbol{x}_{i},m)]\quad\text{at }\boldsymbol{x}_{i}\in\Omega. (39c)

Here k=0.01k=0.01 is the diffusion coefficient. The volumetric forcing function f is a smoothed Gaussian bump located at 𝒙=(0.7,0.7)\boldsymbol{x}=(0.7,0.7),

f(𝐱)=max(0.5,e25(x10.7)225(x20.7)2).f(\mathbf{x})=\max\bigg{(}0.5,e^{-25(x_{1}-0.7)^{2}-25(x_{2}-0.7)^{2}}\bigg{)}. (40)

The velocity field 𝐯\mathbf{v} is a solution of a steady-state Navier Stokes equation with shearing boundary conditions driving the flow (see the Appendix in OLearyRoseberryVillaChenEtAl2022 for more information on the flow). The candidate sensor locations are located in a linearly spaced mesh-grid of points in (0.1,0.9)×(0.1,0.9)(0.1,0.9)\times(0.1,0.9), with coordinates {(0.1i,0.1j),i,j=1,2,,9}\{(0.1i,0.1j),i,j=1,2,\dots,9\}. The prior distribution for the uncertain parameter mm is a mean zero Gaussian with covariance 𝒞pr=(I0.1Δ)2\mathcal{C}_{\text{pr}}=(I-0.1\Delta)^{-2}. The mesh used for this problem is uniform of size 128×128128\times 128. We use linear elements for both uu and mm, leading to a discrete parameter of dimension 16,64116,641. Fig. 7 gives a prior sample of the parameter field mm and the solution uu with all 100100 candidate sensor locations in white circles.


Refer to captionRefer to caption


Figure 7: A random sample of the parameter field 𝐦\mathbf{m} (left) and the corresponding solution 𝐮\mathbf{u} with candidate observation sensor locations marked in circles (right) for the advection-diffusion-reaction problem.

The neural network surrogate is trained adaptively using 409409 training samples and 102102 validation samples. Using 512512 independent testing samples the DIPNet network was 97.13%297.13\%\ell^{2} accurate (see equation 38). The network has 2020 low-rank residual layers, each with a layer rank of 1010, the breadth of the network is rM=rF=25r_{M}=r_{F}=25. The computational cost of the 2525 dimensional active subspace projector using 256256 samples is equivalent to the cost of 3434 additional training data. As was noted before when the PDE is nonlinear the linear adjoint-based derivative computations become much less of a computational burden. Thus we use 409+34=443409+34=443 samples for simple MC for fair comparison.

We first examine the log normalization constant logπ(𝐲)\log\pi(\mathbf{y}) computed with our 409409 PDE-solve-based DIPNet surrogate compared against the truth computed with 6000060000 MC samples and the simple MC computed with 443443 samples. Fig. 8 shows the logπ(𝐲)\log\pi(\mathbf{y}) comparison for three random designs that select 1515 sensors out of 100100 candidates. We can see that DIPNet MC converges to a value close to the true MC curves, while the simple MC’s green star computed with the same number of PDE solves (443443) as DIPNet MC, has much worse accuracy.

Refer to captionRefer to captionRefer to caption

Figure 8: The approximation of the log normalization constant with increasing numbers of samples without (true MC) and with (DIPNet MC) surrogate at 33 random designs. Green stars indicate 443443 samples (simple MC), having the same computational cost as DIPNet.

The left figure of Fig. 9 shows the relative errors for logπ(𝐲)\log\pi(\mathbf{y}) computed with the DIPNet surrogate and simple MC using 443443 samples based on the true MC with 6000060000 samples, for 200200 random designs. We see again that DIPNet gives better accuracy with less bias than simple MC.

Fig. 10 shows the EIG approximations of three random designs with increasing number of outer loop samples noutn_{\text{out}} using the DIPNet MC with 6000060000 (inner loop) samples, simple MC with 443443 samples, and true MC with 6000060000 samples. We can see that the values of DIPNet MC are quite close to the true MC, while simple MC is far off. Relative errors of the EIG Ψ\Psi (nout=100n_{\text{out}}=100) computed with the DIPNet surrogate, the simple MC for 200200 random designs is given in the middle figure of Fig. 9. With the DIPNet DLMC Ψnn\Psi^{nn}, we can use the greedy algorithm to find the optimal designs. The DIPNet greedy designs are presented as the pink crosses in the right figure of Fig. 9. We can see that the designs chosen by the greedy algorithm have much larger EIG values than all 200200 random designs.


Refer to captionRefer to captionRefer to caption

Figure 9: Sample distributions of the relative errors (compared to the truth MC) in approximating the log normalization constant logπ(𝐲)\log\pi(\mathbf{y}) (left) and EIG Ψ\Psi (middle) by DIPNet MC and simple MC with different number of sensors rr; Right: Blue filled areas represent the sample distributions of the true DLMC EIG Ψdl\Psi^{dl} for 200 random designs. Pink crosses are the true DLMC EIG Ψdl\Psi^{dl} of designs chosen by the greedy optimization using DIPNet surrogates.

Refer to captionRefer to captionRefer to caption

Figure 10: EIG of true DLMC (nin=60000n_{\text{in}}=60000), DIPNet DLMC (nin=60000n_{\text{in}}=60000) and simple DLMC (nin=443n_{\text{in}}=443) with increasing number of outer loop samples noutn_{\text{out}} at 33 random designs.

5 Conclusions

We have developed a computational method based on DIPNet surrogates for solving large-scale PDE-constrained Bayesian OED problems to determine optimal sensor locations (using the EIG criterion) to best infer infinite-dimensional parameters. We exploited the intrinsic low dimensionality of the parameter and data spaces and constructed a DIPNet surrogate for the parameter-to-observable map. The surrogate was used repeatedly in the evaluation of the normalization constant and the EIG. We presented error analysis for the approximation of the normalization constant and the EIG, showing that the errors are of the same order as the DIPNet RMS approximation error. Moreover, we used a greedy algorithm to solve the combinatorial optimization problem for sensor selection. The computational efficiency and accuracy of our approach are demonstrated by two numerical experiments. Future work will focus on gradient-based optimization also using the derivative information of the DIPNet w.r.t. both the parameter and the design variables, on the use of different optimality criteria such as A-optimality or D-optimality, and on exploring new network architectures for intrinsically high-dimensional Bayesian OED problems.

References

  • \bibcommenthead
  • (1) Uciński, D.: Optimal Measurement Methods for Distributed Parameter System Identification. CRC Press, Boca Raton (2005)
  • (2) Loose, N., Heimbach, P.: Leveraging uncertainty quantification to design ocean climate observing systems. Journal of Advances in Modeling Earth Systems, 1–29 (2021)
  • (3) Ferrolino, A.R., Lope, J.E.C., Mendoza, R.G.: Optimal location of sensors for early detection of tsunami waves. In: International Conference on Computational Science, pp. 562–575 (2020). Springer
  • (4) Huan, X., Marzouk, Y.M.: Simulation-based optimal Bayesian experimental design for nonlinear systems. Journal of Computational Physics 232(1), 288–317 (2013). https://doi.org/10.1016/j.jcp.2012.08.013
  • (5) Huan, X., Marzouk, Y.M.: Gradient-based stochastic optimization methods in Bayesian experimental design. International Journal for Uncertainty Quantification 4(6), 479–510 (2014)
  • (6) Huan, X., Marzouk, Y.M.: Sequential bayesian optimal experimental design via approximate dynamic programming. arXiv preprint arXiv:1604.08320 (2016)
  • (7) Alexanderian, A., Petra, N., Stadler, G., Ghattas, O.: A fast and scalable method for A-optimal design of experiments for infinite-dimensional Bayesian nonlinear inverse problems. SIAM Journal on Scientific Computing 38(1), 243–272 (2016). https://doi.org/10.1137/140992564
  • (8) Beck, J., Dia, B.M., Espath, L.F., Long, Q., Tempone, R.: Fast bayesian experimental design: Laplace-based importance sampling for the expected information gain. Computer Methods in Applied Mechanics and Engineering 334, 523–553 (2018). https://doi.org/10.1016/j.cma.2018.01.053
  • (9) Long, Q., Scavino, M., Tempone, R., Wang, S.: Fast estimation of expected information gains for Bayesian experimental designs based on Laplace approximations. Computer Methods in Applied Mechanics and Engineering 259, 24–39 (2013)
  • (10) Long, Q., Motamed, M., Tempone, R.: Fast bayesian optimal experimental design for seismic source inversion. Computer Methods in Applied Mechanics and Engineering 291, 123–145 (2015). https://doi.org/10.1016/j.cma.2015.03.021
  • (11) Beck, J., Mansour Dia, B., Espath, L., Tempone, R.: Multilevel double loop Monte Carlo and stochastic collocation methods with importance sampling for Bayesian optimal experimental design. International Journal for Numerical Methods in Engineering 121(15), 3482–3503 (2020)
  • (12) Alexanderian, A., Petra, N., Stadler, G., Ghattas, O.: A-optimal design of experiments for infinite-dimensional Bayesian linear inverse problems with regularized 0\ell_{0}-sparsification. SIAM Journal on Scientific Computing 36(5), 2122–2148 (2014). https://doi.org/10.1137/130933381
  • (13) Alexanderian, A., Gloor, P.J., Ghattas, O.: On Bayesian A-and D-optimal experimental designs in infinite dimensions. Bayesian Analysis 11(3), 671–695 (2016). https://doi.org/10.1214/15-BA969
  • (14) Saibaba, A.K., Alexanderian, A., Ipsen, I.C.: Randomized matrix-free trace and log-determinant estimators. Numerische Mathematik 137(2), 353–395 (2017)
  • (15) Crestel, B., Alexanderian, A., Stadler, G., Ghattas, O.: A-optimal encoding weights for nonlinear inverse problems, with application to the Helmholtz inverse problem. Inverse Problems 33(7), 074008 (2017)
  • (16) Attia, A., Alexanderian, A., Saibaba, A.K.: Goal-oriented optimal design of experiments for large-scale Bayesian linear inverse problems. Inverse Problems 34(9), 095009 (2018)
  • (17) Wu, K., Chen, P., Ghattas, O.: A fast and scalable computational framework for large-scale and high-dimensional Bayesian optimal experimental design. arXiv preprint arXiv:2010.15196, to appear in SIAM Journal on Scientific Computing (2020)
  • (18) Wu, K., Chen, P., Ghattas, O.: An efficient method for goal-oriented linear bayesian optimal experimental design: Application to optimal sensor placement. arXiv preprint arXiv:2102.06627, to appear in SIAM/AMS Journal on Uncertainty Quantification (2021)
  • (19) Aretz-Nellesen, N., Chen, P., Grepl, M.A., Veroy, K.: A-optimal experimental design for hyper-parameterized linear Bayesian inverse problems. Numerical Mathematics and Advanced Applications ENUMATH 2020 (2020)
  • (20) Aretz, N., Chen, P., Veroy, K.: Sensor selection for hyper-parameterized linear Bayesian inverse problems. PAMM 20(S1), 202000357 (2021)
  • (21) Foster, A., Jankowiak, M., Bingham, E., Horsfall, P., Teh, Y.W., Rainforth, T., Goodman, N.: Variational Bayesian optimal experimental design. In: Wallach, H., Larochelle, H., Beygelzimer, A., d' Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32, pp. 14036–14047. Curran Associates, Inc., ??? (2019). https://proceedings.neurips.cc/paper/2019/file/d55cbf210f175f4a37916eafe6c04f0d-Paper.pdf
  • (22) Kleinegesse, S., Gutmann, M.U.: Bayesian experimental design for implicit models by mutual information neural estimation. In: International Conference on Machine Learning, pp. 5316–5326 (2020). PMLR
  • (23) Shen, W., Huan, X.: Bayesian sequential optimal experimental design for nonlinear models using policy gradient reinforcement learning. arXiv preprint arXiv:2110.15335 (2021)
  • (24) O’Leary-Roseberry, T., Villa, U., Chen, P., Ghattas, O.: Derivative-informed projected neural networks for high-dimensional parametric maps governed by pdes. Computer Methods in Applied Mechanics and Engineering 388, 114199 (2022)
  • (25) O’Leary-Roseberry, T.: Efficient and dimension independent methods for neural network surrogate construction and training. PhD thesis, The University of Texas at Austin (2020)
  • (26) O’Leary-Roseberry, T., Du, X., Chaudhuri, A., Martins, J.R., Willcox, K., Ghattas, O.: Adaptive projected residual networks for learning parametric maps from sparse data. arXiv preprint arXiv:2112.07096 (2021)
  • (27) Flath, P.H., Wilcox, L.C., Akçelik, V., Hill, J., van Bloemen Waanders, B., Ghattas, O.: Fast algorithms for Bayesian uncertainty quantification in large-scale linear inverse problems based on low-rank partial Hessian approximations. SIAM Journal on Scientific Computing 33(1), 407–432 (2011). https://doi.org/10.1137/090780717
  • (28) Bui-Thanh, T., Burstedde, C., Ghattas, O., Martin, J., Stadler, G., Wilcox, L.C.: Extreme-scale UQ for Bayesian inverse problems governed by PDEs. In: SC12: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (2012)
  • (29) Bui-Thanh, T., Ghattas, O., Martin, J., Stadler, G.: A computational framework for infinite-dimensional Bayesian inverse problems Part I: The linearized case, with application to global seismic inversion. SIAM Journal on Scientific Computing 35(6), 2494–2523 (2013). https://doi.org/10.1137/12089586X
  • (30) Bui-Thanh, T., Ghattas, O.: An analysis of infinite dimensional Bayesian inverse shape acoustic scattering and its numerical approximation. SIAM/ASA Journal of Uncertainty Quantification 2(1), 203–222 (2014). https://doi.org/10.1137/120894877
  • (31) Kalmikov, A.G., Heimbach, P.: A Hessian-based method for uncertainty quantification in global ocean state estimation. SIAM Journal on Scientific Computing 36(5), 267–295 (2014)
  • (32) Hesse, M., Stadler, G.: Joint inversion in coupled quasistatic poroelasticity. Journal of Geophysical Research: Solid Earth 119(2), 1425–1445 (2014)
  • (33) Isaac, T., Petra, N., Stadler, G., Ghattas, O.: Scalable and efficient algorithms for the propagation of uncertainty from data through inference to prediction for large-scale problems, with application to flow of the Antarctic ice sheet. Journal of Computational Physics 296, 348–368 (2015). https://doi.org/%****␣manuscript.bbl␣Line␣550␣****10.1016/j.jcp.2015.04.047
  • (34) Cui, T., Law, K.J.H., Marzouk, Y.M.: Dimension-independent likelihood-informed MCMC. Journal of Computational Physics 304, 109–137 (2016)
  • (35) Chen, P., Villa, U., Ghattas, O.: Hessian-based adaptive sparse quadrature for infinite-dimensional Bayesian inverse problems. Computer Methods in Applied Mechanics and Engineering 327, 147–172 (2017)
  • (36) Beskos, A., Girolami, M., Lan, S., Farrell, P.E., Stuart, A.M.: Geometric MCMC for infinite-dimensional inverse problems. Journal of Computational Physics 335, 327–351 (2017)
  • (37) Zahm, O., Cui, T., Law, K., Spantini, A., Marzouk, Y.: Certified dimension reduction in nonlinear Bayesian inverse problems. arXiv preprint arXiv:1807.03712 (2018)
  • (38) Brennan, M., Bigoni, D., Zahm, O., Spantini, A., Marzouk, Y.: Greedy inference with structure-exploiting lazy maps. Advances in Neural Information Processing Systems 33 (2020)
  • (39) Chen, P., Wu, K., Chen, J., O’Leary-Roseberry, T., Ghattas, O.: Projected Stein variational Newton: A fast and scalable Bayesian inference method in high dimensions. Advances in Neural Information Processing Systems (2019)
  • (40) Chen, P., Ghattas, O.: Projected Stein variational gradient descent. In: Advances in Neural Information Processing Systems (2020)
  • (41) Subramanian, S., Scheufele, K., Mehl, M., Biros, G.: Where did the tumor start? An inverse solver with sparse localization for tumor growth models. Inverse Problems 36(4), 045006 (2020). https://doi.org/10.1088/1361-6420/ab649c
  • (42) Babaniyi, O., Nicholson, R., Villa, U., Petra, N.: Inferring the basal sliding coefficient field for the Stokes ice sheet model under rheological uncertainty. The Cryosphere 15(4), 1731–1750 (2021)
  • (43) Ghattas, O., Willcox, K.: Learning physics-based models from data: perspectives from inverse problems and model reduction. Acta Numerica 30, 445–554 (2021). https://doi.org/10.1017/S0962492921000064
  • (44) Stuart, A.M.: Inverse problems: A Bayesian perspective. Acta Numerica 19, 451–559 (2010). https://doi.org/10.1017/S0962492910000061
  • (45) Petra, N., Martin, J., Stadler, G., Ghattas, O.: A computational framework for infinite-dimensional Bayesian inverse problems: Part II. Stochastic Newton MCMC with application to ice sheet flow inverse problems. SIAM Journal on Scientific Computing 36(4), 1525–1555 (2014)
  • (46) Bhattacharya, K., Hosseini, B., Kovachki, N.B., Stuart, A.M.: Model reduction and neural networks for parametric pdes. arXiv preprint arXiv:2005.03180 (2020)
  • (47) Fresca, S., Manzoni, A.: POD-DL-ROM: enhancing deep learning-based reduced order models for nonlinear parametrized pdes by proper orthogonal decomposition. Computer Methods in Applied Mechanics and Engineering 388, 114181 (2022)
  • (48) Kovachki, N., Li, Z., Liu, B., Azizzadenesheli, K., Bhattacharya, K., Stuart, A., Anandkumar, A.: Neural operator: Learning maps between function spaces. arXiv preprint arXiv:2108.08481 (2021)
  • (49) Li, Z., Kovachki, N., Azizzadenesheli, K., Liu, B., Bhattacharya, K., Stuart, A., Anandkumar, A.: Neural operator: Graph kernel network for partial differential equations. arXiv preprint arXiv:2003.03485 (2020)
  • (50) Lu, L., Jin, P., Karniadakis, G.E.: Deeponet: Learning nonlinear operators for identifying differential equations based on the universal approximation theorem of operators. arXiv preprint arXiv:1910.03193 (2019)
  • (51) O’Leary-Roseberry, T., Chen, P., Villa, U., Ghattas, O.: Derivative-Informed Neural Operator: An Efficient Framework for High-Dimensional Parametric Derivative Learning. arXiv preprint arXiv:2206.10745 (2022)
  • (52) Nelsen, N.H., Stuart, A.M.: The random feature model for input-output maps between banach spaces. SIAM Journal on Scientific Computing 43(5), 3212–3243 (2021)
  • (53) Nguyen, H.V., Bui-Thanh, T.: Model-constrained deep learning approaches for inverse problems. arXiv preprint arXiv:2105.12033 (2021)
  • (54) Zahm, O., Constantine, P.G., Prieur, C., Marzouk, Y.M.: Gradient-based dimension reduction of multivariate vector-valued functions. SIAM Journal on Scientific Computing 42(1), 534–558 (2020)
  • (55) Manzoni, A., Negri, F., Quarteroni, A.: Dimensionality reduction of parameter-dependent problems through proper orthogonal decomposition. Annals of Mathematical Sciences and Applications 1(2), 341–377 (2016)
  • (56) Quarteroni, A., Manzoni, A., Negri, F.: Reduced Basis Methods for Partial Differential Equations: An Introduction vol. 92. Springer, ??? (2015)
  • (57) Hughes, G.: On the mean accuracy of statistical pattern recognizers. IEEE transactions on information theory 14(1), 55–63 (1968)
  • (58) Li, Q., Lin, T., Shen, Z.: Deep learning via dynamical systems: An approximation perspective. Journal of the European Mathematical Society (2022)
  • (59) O’Leary-Roseberry, T., Alger, N., Ghattas, O.: Low rank saddle free Newton: A scalable method for stochastic nonconvex optimization. arXiv preprint arXiv:2002.02881 (2020)
  • (60) Jagalur-Mohan, J., Marzouk, Y.: Batch greedy maximization of non-submodular functions: Guarantees and applications to experimental design. Journal of Machine Learning Research 22(252), 1–62 (2021)
  • (61) Alnæs, M.S., Blechta, J., Hake, J., Johansson, A., Kehlet, B., Logg, A., Richardson, C., Ring, J., Rognes, M.E., Wells, G.N.: The FEniCS project version 1.5. Archive of Numerical Software 3(100) (2015). https://doi.org/10.11588/ans.2015.100.20553
  • (62) O’Leary-Roseberry, T., Villa, U.: hippyflow: Dimension reduced surrogate construction for parametric PDE maps in Python (2021). https://doi.org/10.5281/zenodo.4608729
  • (63) Villa, U., Petra, N., Ghattas, O.: hIPPYlib: An extensible software framework for large-scale inverse problems governed by PDEs; Part I: Deterministic inversion and linearized Bayesian inference. ACM Transactions on Mathematical Software (2021)
  • (64) Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.: Tensorflow: A system for large-scale machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283 (2016)
  • (65) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  • (66) O’Leary-Roseberry, T., Alger, N., Ghattas, O.: Inexact Newton methods for stochastic nonconvex optimization with applications to neural network training. arXiv preprint arXiv:1905.06738 (2019)