Choosing observation operators to mitigate model error in Bayesian inverse problems
Abstract
In statistical inference, a discrepancy between the parameter-to-observable map that generates the data and the parameter-to-observable map that is used for inference can lead to misspecified likelihoods and thus to incorrect estimates. In many inverse problems, the parameter-to-observable map is the composition of a linear state-to-observable map called an ‘observation operator’ and a possibly nonlinear parameter-to-state map called the ‘model’. We consider such Bayesian inverse problems where the discrepancy in the parameter-to-observable map is due to the use of an approximate model that differs from the best model, i.e. to nonzero ‘model error’. Multiple approaches have been proposed to address such discrepancies, each leading to a specific posterior. We show how to use local Lipschitz stability estimates of posteriors with respect to likelihood perturbations to bound the Kullback–Leibler divergence of the posterior of each approach with respect to the posterior associated to the best model. Our bounds lead to criteria for choosing observation operators that mitigate the effect of model error for Bayesian inverse problems of this type. We illustrate one such criterion on an advection-diffusion-reaction PDE inverse problem from the literature, and use this example to discuss the importance and challenges of model error-aware inference.
Keywords: Model error, Bayesian inverse problems, experimental design, misspecified likelihood, posterior error bounds
1 Introduction
In many applications, one considers an inverse problem where the data is a noisy observation of the true state of some phenomenon of interest, where the true state is the output of a parameter-to-state mapping or ‘model’ corresponding to an unknown parameter. That is, for the unknown true parameter and the best model , the true state is , and the data is a realisation of the random variable
for a state-to-observable map or ‘observation operator’ and additive noise . The inverse problem consists in inferring the data-generating parameter from the data .
We consider Bayesian inverse problems with data taking values in , centred Gaussian observation noise, and linear observation operators. In this setting, the noise has the normal distribution for and positive definite covariance , and the observation operator is a linear mapping from the ‘state space’ of admissible states to . In the Bayesian approach to inverse problems, one fixes a possibly infinite-dimensional parameter space consisting of admissible values of , and describes the unknown true parameter using a random variable with prior law supported on . Under certain assumptions on the prior , the observation operator , and the model , the corresponding posterior measure is well-defined and admits the following density with respect to the prior :
See e.g. [30, Theorem 4.1] for sufficient conditions for well-definedness in the case of a Gaussian prior .
In practice, the best model is not available, so an approximate model is used instead. Alternatively, may be available but impractical or costly to evaluate: in the context of multifidelity or reduced order models, may be the element of a collection of models that yields the most accurate predictions of state, and may be a reduced-order model, an emulator, or a surrogate, i.e. a model that yields less accurate predictions but can be evaluated more cheaply and quickly than .
We shall refer to the difference as the ‘model error of the approximate model’, or simply, the ‘model error’. Model error is also known as ‘model inadequacy’, ‘model discrepancy’, or ‘structural uncertainty’, see e.g. [18, 5]. We do not use the term ‘model misspecification’, since this term is used in the statistics literature to refer to the setting where the parameter space does not contain the true parameter ; see e.g. [12, Section 8.5].
In the context of inverse problems, model error is important because it may lead to a wrong or ‘misspecified’ likelihood, which in turn may lead to incorrect inference. The negative effects may persist even after applying or approximating common limits from statistics. For example, numerical experiments in [5, Section 3] show how ignoring the model error results in posterior distributions that do not converge to the true data-generating parameter as the number of observations grows larger. An analytical example in [1, Section 8.1] shows a similar phenomenon for the problem of inferring the initial condition of a heat equation over a finite time interval from a noisy observation of the terminal state, in the limit of small noise.
The potential for incorrect inference due to model error raises the question of how to mitigate the effect of model error in Bayesian inverse problems. We approach this question from the perspective of selecting an appropriate observation operator . To see why this perspective is reasonable, note that the linearity of and the definition of the model error imply that . Using this fact in the equation for shows that only the observed model error has the potential to affect inference. In other words, ignoring model error leads to a misspecified likelihood if and only if is nonzero with positive -probability. This suggests that a possible approach to reduce the effect of model error on Bayesian inference is to choose an observation operator for which is closer to zero with higher -probability.
The approach of choosing observation operators suggests a connection with experimental design. In Bayesian experimental design, the main task is to select observations in order to maximise information about the parameter to be inferred. To quantify the information gain, one may use the Kullback–Leibler divergence of the posterior with respect to the prior, for example. In contrast, we control the Kullback–Leibler divergence between pairs of posteriors defined by the same prior but different likelihoods, by using the difference between the pair of negative log-likelihoods or ‘misfits’. The main task is then to select observations in order to minimise this difference. This approach can be viewed as experimental design for mitigating the effect of model error on inference.
Contributions
In this paper, we consider some approaches for accounting for model error in Bayesian inference: the approximate approach that ignores model error; the ‘enhanced error approach’ of [16]; the ‘joint inference approach’ that aims to infer both and ; and the ‘marginalisation approach’, which integrates out the model error component of the posterior from the joint inference approach. For the first three approaches, we compute upper bounds for the difference between the misfit of each approach and the misfit associated to the best model . We use these bounds to bound the Kullback–Leibler divergence between the posterior that results from each approach and the best posterior . We also compute upper bounds for the Kullback–Leibler divergence with respect to the approximate posterior that results from using the given approximate model instead of the best model . We express each upper bound in terms of the model error and the objects that the approach uses to account for the model error.
To prove these bounds, we exploit the assumption of additive Gaussian noise to prove a lemma concerning the difference of two quadratic forms that are weighted by different matrices; see Lemma A.2. We prove the upper bounds on the Kullback–Leibler divergences by combining the upper bounds on the differences between the misfits with a local Lipschitz stability estimate of posteriors from [29]. An important advantage of this estimate is that it holds for the Kullback–Leibler topology under the mild assumption of -integrable misfits; see Theorem 2.1.
We use the upper bounds on the Kullback–Leibler divergence with respect to the best posterior to derive sufficient conditions on the observation operator , the model error , and the objects that the approach uses to account for the model error, in order for the resulting posterior to closely approximate in the Kullback–Leibler topology. These conditions give ‘positive’ criteria for choosing observation operators to mitigate the effect of model error on inference. We use the upper bounds with respect to the approximate posterior to derive necessary conditions for the resulting posterior to differ from . These conditions give ‘negative’ criteria which one can use to exclude observation operators from consideration when one aims to mitigate the effect of model error on Bayesian inference.
We complement our theoretical findings with numerical experiments for an inverse problem from the literature, namely the inverse problem of inferring the time amplitude of a source term in an advection-diffusion-reaction partial differential equation (PDE) when the initial condition is unknown.
1.1 Overview of related work
The importance of accounting for the model error is well-documented in the literature on Bayesian inverse problems; see e.g. [5, 16, 18] and the references therein. The ‘Bayesian approximation error’ and ‘enhanced error’ approaches due to [17] and [16] respectively have been applied in various contexts. For example, the enhanced error approach has been applied with pre-marginalisation to electrical impedance tomography [24], diffuse optical tomography [19], and inversion for coefficient fields in the presence of uncertain conductivities [23].
Various methods have been developed to estimate or account for model error in Bayesian inference. For example, the work [6] presented an iterative algorithm to update an estimate of the model error in a model order reduction context, and proved geometric convergence of the algorithm. The authors of [27, 26] take a different perspective: instead of viewing model errors as additive perturbations to an approximate model, they incorporate these model errors into parametrisations of some phenomenon of interest, and use polynomial chaos expansions. Information theory has been used to quantify model error uncertainty or model bias in goal-oriented inference settings [14] and by exploiting concentration inequalities [13]. Optimal transport was applied to tackle problems due to model errors in [28].
In the context of Bayesian optimal experimental design, model error is also referred to as ‘model uncertainty’. The work [20] considers A-optimal designs for inverse problems in the presence of so-called ‘irreducible’ model uncertainties, i.e. uncertainties that cannot be reduced by collecting more data. In contrast, the work [4] considers reducible uncertainties, and describes an A-optimality criterion that involves marginalising out this reducible uncertainty. The work [2] combines the Laplace approximation and the Bayesian approximation error approach to find A-optimal designs for nonlinear Bayesian inverse problems.
As far as we are aware, the works most closely related to our paper are [7, 9]. The work [7] considers Bayesian inverse problems where the observation operator may be nonlinear and the model is approximated by a neural network. In particular, [7, Theorem 1] bounds the Kullback–Leibler divergence between the original and approximated posterior in terms of an norm for of the model error itself. In contrast, we consider linear observation operators, do not focus on any specific class of approximate models, and bound the Kullback–Leibler divergence in terms of norms of the observed model error. In [9], the main stability estimate [9, Theorem 3.4] bounds the expected utility of a design in terms of a sequence of likelihoods. The focus of [9] is not model error, but on the convergence of the utility of approximate optimal designs corresponding to a convergent sequence of likelihoods. The goal is ‘standard’ optimal experimental design, i.e. on choosing observations to maximise the information content of each additional data point and thereby to reduce posterior uncertainty. In contrast, we make pairwise comparisons of likelihoods and focus on choosing observations to mitigate the effect of model error on Bayesian inference.
1.2 Outline
We describe the notation that we use in Section 1.3. In Section 2, we state and prove upper bounds on errors between misfits and upper bounds on Kullback–Leibler divergences between the best posterior and posteriors that account for model error in different ways. We use these bounds to discuss positive and negative criteria for choosing observation operators to mitigate the effect of model error on Bayesian inference. In Section 3, we consider a specific inverse problem and use it to illustrate a positive criterion for choosing observation operators, in the setting where one uses the approximate posterior for inference. In Section 4, we describe and discuss the numerical experiments that we used to investigate the behaviour of the approximate posterior. We conclude in Section 5. In Appendix A, we present proofs of lemmas and results that are not stated in the main text.
1.3 Notation
The notation indicates the replacement of using , while indicates equality by definition. For , , and . Let be the underlying probability space for all random variables below. Given a measurable space and an -valued random variable , we denote the law of by . If is a topological space, we denote its Borel -algebra by , the set of all Borel probability measures on by , and the support of an arbitrary by . For , denotes the closure of . For and an -valued random variable , the expression means that is distributed according to , i.e. as measures. Given , we denote the absolute continuity of with respect to by , the corresponding Radon–Nikodym derivative by , and the Kullback–Leibler (KL) divergence of with respect to by , which has the value if and otherwise. For and , denotes the norm with respect to ; we shall specify the domain and codomain when necessary. We denote a Gaussian measure with mean and covariance operator by . Some useful facts about Gaussian measures on Banach spaces are given in [30, Section 6.3], for example.
Given and , we write to denote the measure that has Radon–Nikodym derivative , where . We shall use this notation to denote the posterior associated to a given prior and misfit below.
For and a symmetric positive semidefinite matrix ,
(1.1) |
We use single bars ‘’ to emphasise when we take norms of vectors in finite-dimensional spaces. If is the identity matrix, then we omit the subscript. Given normed vector spaces for , denotes the continuous dual space of . For a bounded linear operator , we denote the kernel, adjoint, and operator norm of by , , and , and we write the image of under as . For a matrix , and denote the adjoint and Moore–Penrose pseudoinverse of respectively.
2 Bounds on KL divergence in terms of observed model error
In this section, we state the theoretical results of our paper, namely bounds on the KL divergence between pairs of posteriors, expressed in terms of the model error and the objects that some approaches use to account for model error. These bounds show that it is not the model error itself but rather the observed model error that affects inference. We consider four natural types of posteriors that account for model error in different ways. After introducing the common objects and basic assumptions, we state bounds on certain KL divergences for the approximate posterior, enhanced noise posterior, joint posterior, and marginal posterior in Sections 2.1, 2.2, 2.3, and 2.4 respectively. We emphasise that we use the KL divergence only as a measure of the discrepancy between different posteriors, and not because the KL divergence is often used in classical optimal experimental design. Indeed, our bounds could have also been expressed using other measures of discrepancy that are considered in [29], such as the Hellinger distance or the total variation distance. Moreover, in the example inverse problem in Section 3, we use one such bound to guide the choice of observations, but do not estimate the terms in the bounds. The efficient and accurate estimation of the terms in the bounds below is beyond the scope of this paper.
We assume that the ‘parameter space’ of admissible values of the unknown true data-generating parameter is a Borel subset of a separable Banach space, and fix a prior that is supported on . Let denote the ‘state space’, which is a Banach space containing both the image of under the best model and the image of under the approximate model . In many inverse problems, a state is a function on some domain that describes some observable phenomenon of interest, and is often expressed as the solution of a differential equation; see Section 3 below. We shall refer to any measurable mapping from to as a ‘model’, and to any continuous linear mapping from to as an ‘observation operator’, where is fixed but arbitrary. Thus, and may be infinite-dimensional, but the data space is finite-dimensional. Infinite-dimensional data settings will involve additional technical issues — see e.g. [30, Remark 3.8] — that are not directly related to the main ideas of this paper.
We shall use the following local Lipschitz stability result to prove our bounds.
Theorem 2.1.
Let and for . Then
and thus is absolutely continuous with respect to . In particular,
and thus and are mutually equivalent.
Proof.
We now state the main assumptions of this paper. Given and , there exists a unique best parameter and unique best model , such that for and for a chosen observation operator , the data is a realisation of the random variable
(2.1) |
where is positive definite. We write to emphasise that is linear but may be nonlinear. The data model (2.1) leads to the best misfit , which together with the prior yields the best posterior :
(2.2) |
We assume that , and that and are statistically independent. The best posterior is important because it is the posterior that is consistent with the true data model (2.1).
2.1 Bound on KL divergence for approximate posterior
In general, one cannot perform inference with the best model , and one uses instead an approximate model distinct from . For example, many mathematical models of natural phenomena based on differential equations do not admit closed-form solutions, so one must perform inference using some numerical approximation of . The ‘model error’ of is given by the difference
(2.3) |
Given the assumption that , we obtain the approximate misfit and the approximate posterior by replacing with the approximate model in (2.2):
(2.4) |
By comparing the definitions of and in (2.2) and (2.4), we conclude that if there exists some set such that and does not vanish on , then the likelihood corresponding to will be misspecified, and thus using instead of will lead to incorrect inference. To make this observation more quantitative, we first bound the misfit error in Lemma 2.2, and then bound the KL divergence in Proposition 2.3.
Lemma 2.2.
If , then
(2.5) |
where .
We defer the proof of Lemma 2.2 to Section B.1.
Combining (2.5) with the bound on implies
Since the right-hand side of the inequality above is larger than , it follows that the bound given in Lemma 2.2 is not optimal, because it is worse than the bound we could obtain using the triangle inequality. Moreover, the term inside the parentheses on the right-hand side of (2.5) cannot be evaluated if is unavailable. Nevertheless, the bound (2.5) is useful, because it bounds in terms of the average observed model error and quantities that are assumed to be finite.
Proposition 2.3.
If , then
for .
Proof of Proposition 2.3.
We apply Lemma 2.2 and the second statement of Theorem 2.1 with , , and to obtain
This completes the proof of Proposition 2.3. ∎
Although in Proposition 2.3 is not optimal, it is finite under the stated hypotheses.
Proposition 2.3 shows that the Kullback–Leibler divergences between and are controlled by the average observed model error . In order for these divergences to be small, one should choose the observation operator such that takes values in or near with high -probability. In particular, if , then using will yield the same inference as . This is an example of a positive criterion for choosing observation operators to mitigate the effect of model error on Bayesian inference.
The condition can be useful for guiding the choice of observation operator even if is not fully known. For example, if one can determine a priori that takes values in some proper subspace of , then any choice of observation operator such that will imply that . However, the example shows that one can choose so that is too large, in the sense that observations of state yield little or no information about the state itself. Thus, the approach of choosing to mitigate model error involves the following tradeoff: should be as small as possible, but large enough to ensure that the model error takes values in ) -almost surely. In Section 3, we revisit this observation on a specific example.
The preceding discussion emphasises the main idea of this work: to mitigate the effect of model error on Bayesian inference, one should exploit all available knowledge about the model error and about the inverse problem in order to guide the choice of observations. This main idea is valid, independently of the fact that the bounds in Lemma 2.2 and Proposition 2.3 are not sharp.
2.2 Bounds on KL divergence for enhanced noise posterior
Recall that the definition (2.2) of the best misfit follows from the best model (2.1) for the data random variable . The approximate misfit in (2.4) can be seen as a misfit that results from assuming that . In the enhanced noise approach, one models the unknown state error as a random variable that is assumed to be statistically independent of and [16]. This is related to the so-called ‘pre-marginalisation’ approach; see e.g. [19]. We do not assume that . The corresponding enhanced noise data model is
where the ‘enhanced noise’ has the law because is linear. Since we assume that is positive definite and since is nonnegative definite, is positive definite, hence invertible.
The enhanced noise misfit and enhanced noise posterior are
(2.6) |
Lemma 2.4 below bounds the error between and in terms of the shifted observed model error term and the difference of covariance matrices . Define the scalar
(2.7) |
By Lemma A.1, satisfies for all .
Lemma 2.4.
If , then
(2.8) | ||||
Furthermore,
We defer the proof of Lemma 2.4 to Section B.2.
Proposition 2.5.
If , then
for .
Proof of Proposition 2.5.
Given that , we may apply Theorem 2.1 with , , and , to obtain
where the second inequality follows from the bound (2.8) of Lemma 2.4. ∎
The significance of Proposition 2.5 is similar to that of Proposition 2.3. The main difference consists in the fact that the error with respect to the best misfit is now controlled by
(2.9) |
where only the first term depends on the model error . Since we assume that is positive definite, the first term in (2.9) vanishes if and only if -almost surely. This condition differs from the sufficient condition for that was implied by Lemma 2.2, namely, that -almost surely. By (1.1), the second term in (2.9) vanishes if and only if
(2.10) |
If , then almost surely, which implies that . This in turn implies that (2.10) holds. Thus, if and , then . These sufficient conditions do not assume that . In fact, if in addition to it also holds that , then it follows from (2.6) that . The next lemma provides a more quantitative version of this statement, using the constant given in (2.7).
Lemma 2.6.
If , then
(2.11) | ||||
For the proof of Lemma 2.6, see Section B.2.
Proposition 2.7.
If , then
for .
The proof of Proposition 2.7 is similar to that of Proposition 2.5 except that one uses Lemma 2.6 instead of Lemma 2.4, so we omit it.
Proposition 2.7 implies the following: given an enhanced noise model with mean and covariance , if one chooses the observation operator so that
then the enhanced noise posterior and the approximate posterior coincide. Equivalently, and (2.10) together imply that . Thus, such a choice of observation operator yields an enhanced noise posterior that does not account for model error. In this case, the approximate posterior and the enhanced noise posterior perform equally well in terms of inference, and it would be better to use the approximate posterior instead, since it is simpler. Thus, the conditions and (2.10) give an example of negative criteria that one can use to exclude observation operators from consideration when one aims to mitigate the effect of model error on Bayesian inference.
2.3 Bounds on KL divergence for joint posterior
In the enhanced noise approach given by (2.6), one accounts for the uncertainty in the unknown state error by approximating the unknown state error using a random variable . The goal is to infer only. In the joint parameter-error inference approach, one aims to infer jointly, by using a random variable with prior and using Bayes’ formula. The joint prior on the random variable is given by the product of the prior on the unknown and a prior on the unknown . It is possible to specify the prior on the unknown without knowing , but one must specify a space of admissible values of . We assume that the space is a Radon space and that is a Borel probability measure.
The assumption that the prior on has product structure is equivalent to the assumption that and are independent random variables. This independence assumption is reasonable, since by (2.3) it follows that does not depend on .
Define the joint misfit and joint posterior by
(2.12) |
An important disadvantage of jointly inferring the parameter and model error is that the dimension of the space on which one performs inference increases; this tends to make the inference task more computationally expensive. It is also known that the problem of identifiability may arise; see e.g. [5]. We shall not consider the problem of identifiability here. On the other hand, jointly inferring the parameter and model error is consistent with the Bayesian approach of treating all unknowns as random variables and updating these distributions using the data. In addition, the joint inference approach offers the possibility to improve the approximate model in a data-driven way, e.g. by using the -marginal posterior mean. This improvement can be used to obtain better estimates of both the parameter and the state .
To compare the joint posterior with the posterior measures , and that we introduced earlier, we ‘lift’ the misfits and posteriors according to
(2.13) |
In (2.13), we use the notation to refer to two functions defined on different domains. The following lemma shows that this does not result in any fundamental changes, and that the lifted posterior is the product of the original posterior with the prior law for the unknown .
Lemma 2.8.
For ,
(2.14) |
If for some , then
(2.15) |
Proof.
Since is a probability measure, the definition of in (2.13) implies that
Thus, the normalisation constant for the likelihood that generates the lifted posterior on agrees with the corresponding normalisation constant for the posterior on . This proves (2.14). The fact that is constant with respect to implies the second statement of the lemma. ∎
Recall that denotes the probability measure in the underlying probability space. Below, we will write the random variables and explicitly, and take -norms with respect to instead of , e.g.
See (2.16) below for an example.
Lemma 2.9.
If , then
(2.16) |
Furthermore, .
Proposition 2.10.
If , then and satisfy
where .
We prove Lemma 2.9 in Section B.3. The proof of Proposition 2.10 is similar to the proofs of Proposition 2.3 and Proposition 2.5, so we omit it.
By Proposition 2.10, a sufficient condition for is that
In order to use this condition to choose a prior and observation operator , one must have some a priori knowledge about the set , e.g. that this set is contained in a proper affine subspace of , where and are an element and subspace of respectively. However, by the discussion before Section 2.2, if one knows that , then one can use this information to choose so that , e.g. by setting so that .
Next, we consider the lifted approximate posterior and the joint posterior .
Lemma 2.11.
If and , then
(2.17) |
Furthermore, .
Proposition 2.12.
If and , then
where .
We prove Lemma 2.11 in Section B.3 and omit the proof of Proposition 2.12 due to its similarity with the proofs of Proposition 2.3 and Proposition 2.5.
Recall that if , then the joint posterior and lifted best posterior agree. Proposition 2.12 shows that if , then the joint posterior and the lifted aproximate posterior agree. In this case, there is no additional benefit in terms of inference from using the joint posterior, so is a negative criterion by which one can exclude an observation operator from consideration when one aims to mitigate the effect of model error on Bayesian inference. We can interpret the negation of this condition, i.e. that with positive -probability, as follows: if the joint posterior differs from the lifted approximate posterior , then the observed state update cannot vanish -almost surely. This implies that the model must be updated by some nonzero with positive -probability.
The preceding observation establishes a connection with the so-called ‘parametrised background data-weak approach’ for state estimation. Consider the setting where the state space is a Hilbert space and for continuous linear functionals . By a Hilbert space identity, is the orthogonal complement of the closure of the range of . Denote the Riesz representation operator by , so that for every . Then one can verify by direct calculations that . Since every finite-dimensional subspace of is closed, holds if . The subspace corresponds to the ‘variational update space’ in [22, Section 2.3]. The term ‘update space’ refers to the goal of updating an existing model that maps parameters to states in order to improve state prediction.
It is possible to state and prove the analogues of the preceding bounds for the lifted enhanced noise posterior . These bounds are not relevant for the main goal of this paper. However, for the sake of completeness, we state them in Section B.3.
2.4 Kullback–Leibler error of marginal posterior
We define the marginal posterior as the -marginal of the joint posterior :
(2.18) |
One can approximate the marginal posterior by using Monte Carlo integration of the joint posterior over possible realisations of , for example. Since the marginal posterior requires the joint posterior, it inherits the disadvantages and advantages of the joint posterior.
In Section 2.3, we bounded the KL divergence of the joint posterior with respect to the lifted posteriors for that were defined in (2.13), and we observed in Lemma 2.8 that the -marginal of the lifted posterior is exactly . These bounds imply analogous bounds on the KL divergence of the marginal posterior with respect to , , and .
Below, denotes the regular version of the joint posterior law of conditioned on . Given our assumption that the random variable takes values in a Radon space, it follows that the regular conditional distribution exists and is unique up to sets of -measure zero.
Proposition 2.13.
Let and . Suppose . Then
Proof.
Proposition 2.13 implies that the KL divergence of the marginal posterior with respect to for is smaller than the KL divergence of the joint posterior and the corresponding lifted posterior . In other words, marginalisation can only reduce the KL divergence. As a result, the sufficient conditions for to agree with are also sufficient conditions for to agree with , for .
3 Example
In this section, we consider an inverse problem based on a PDE initial boundary value problem (PDE-IBVP), and use it to illustrate the positive criterion from Section 2.1 for choosing an observation operator that eliminates the effect of model error on inference. In other words, combining this observation operator with the approximate model yields exactly the same inference as if one had used the best model instead of . This PDE-IBVP was discussed from the perspective of classical optimal experimental design in [3, 4], for example.
Below, we use ‘’ as a placeholder symbol whose value may change from line to line.
3.1 Advection-diffusion initial boundary value problem
Let be a time interval and be a bounded open domain in . Fix a constant , a vector field , nonzero and nonnegative . Define the parabolic operator . For an arbitrary, sufficiently regular function , consider the PDE-IBVP with homogeneous Neumann boundary condition and inhomogeneous initial condition
(3.1a) | |||||
(3.1b) | |||||
(3.1c) |
where in (3.1b) denotes the unit exterior normal to at . We assume that a unique solution of (3.1) exists, and define the best model by . For a fixed , the true state is the function . For the theory of existence and regularity of solutions of parabolic PDE-IBVPs with Neumann boundary conditions, see e.g. [11, Part 2, Section 10].
We define the approximate model in a similar way as we defined . For the same coefficients and as above, we consider the IBVP with homogeneous Neumann boundary condition and homogeneous initial condition
(3.2a) | |||||
(3.2b) | |||||
(3.2c) |
Note that (3.1) and (3.2) differ only in their respective initial conditions. We define , for the unique solution of the IBVP (3.2). By homogeneity of the initial condition and the boundary condition, and using the superposition principle for linear PDEs, it follows that the approximate model is a linear function of its argument .
Recall from (2.3) that . Then solves the PDE-IBVP
(3.3a) | |||||
(3.3b) | |||||
(3.3c) |
Since does not depend on , it follows that is constant with respect to . Rewriting (2.3) as , it follows that is an affine, non-linear function of .
Recall the decomposition in (2.3). Suppose we keep the definition of the approximate model as the mapping that sends to the solution of (3.2), after replacing the initial condition (3.2c) with for some nonnegative, nonzero . If is small, then one expects the corresponding model error to also be small. However, is now the solution of the PDE-IBVP (3.3), after replacing the initial condition (3.3c) with . If assumes negative values on , then no longer describes the concentration of a substance. This may be undesirable, given that one uses advection-diffusion PDEs to describe the spatiotemporal evolution of concentrations. Moreover, for differential equations that model chemical reactions, negative initial values can lead to finite-time blow-up, and small negative values in solutions can lead to instability, cf. [15, Chapter I, Section 1.1]. Since we do not have a lower bound for except for , this explains the choice of the initial condition in the PDE-IBVP (3.2) for defining and .
3.2 Agreement of approximate and best posteriors
Let be , equipped with the supremum norm . Assume that the vector field in the definition of satisfies , and that both the spatial component on the right-hand side of (3.1a) and the true initial condition in (3.1c) belong to . That is, all the terms that define the IBVPs are smooth and bounded on their respective domains, with strictly positive right-hand side in (3.1a) and strictly positive initial condition in (3.1c). We define the state space to be
where for , , and
denotes the spatial derivative associated to the multi-index , and denotes the -th order time derivative. The first and second sums on the right-hand side are defined to be zero if and respectively.
Given that is constant and is bounded, the operator satisfies
for independent of . Thus, for arbitrary for , the linear observation operator
(3.4) |
is continuous, since . Moreover, for given by (3.4), it follows from (3.3a) that , since is the composition of the vectorisation of the evaluation functionals at with the differential operator on the left-hand side of (3.3a). Thus, given the definition of above and the definition of in Section 3.1 as the solution of the IBVP defined by (3.3), we have
In particular, the right-hand side of (2.5) in Lemma 2.2 vanishes for this choice of . By Proposition 2.3, it follows that , i.e. the approximate posterior and the best posterior agree.
In the discussion after Proposition 2.5, we showed that if the image of under is contained in some linear subspace of , then it is simpler and easier to use the approximate posterior with an appropriately chosen instead of the enhanced noise posterior . Since takes values in the linear subspace of , the same conclusion holds for this example, with the appropriate choice of given in (3.4). Similarly, by the discussion after Proposition 2.10, it is simpler and easier to use the approximate posterior with from (3.4) instead of the joint posterior.
In this example, the only information about the best model error that we used to find an observation operator that satisfies is contained in the PDE (3.3a). We exploited this information to choose the observation operator in (3.4). In the absence of additional conditions on , an observation operator that does not satisfy (3.4) may fail to satisfy the condition . Thus, we can consider the collection of observation operators with the structure (3.4) as being the largest possible collection, given the available information.
We end this section with some important remarks.
Suppose solves (3.1) for a given . Then the operator in (3.4) satisfies . In particular, if the spatial function is known and nonzero, then one can obtain the numbers from the vector directly, by division. We will return to this observation in Section 4.3.
For the operator in (3.4), exact evaluation of the vector for any in practical experimental settings can be difficult. For example, in practical settings, one might only be able to measure the concentration of a certain substance at fixed spatial locations and at multiples of some time lag , , . The corresponding observation operator maps every state to a vector of concentrations. We consider one such operator in (4.3a) below. In this case, and in the absence of further conditions on the spatiotemporal locations , there is no guarantee that one can approximate the operator from (3.4). However, if the spatiotemporal locations lie on a finite difference stencil, then one could potentially combine the information in the vector to approximate from (3.4). Such approximations may exhibit errors associated to the finite difference scheme, and may also be susceptible to noise in the vector . We postpone the design and analysis of approximations of in (3.4) for future work.
In order to identify in (3.4) as an observation operator that satisfies , we exploited the fact that the right-hand side of (3.3a) vanishes. Suppose that we modified the right-hand sides of (3.2a) and (3.3a) to be nonzero functions respectively, where for some fixed and any , and , for every . If we do not know the initial condition in (3.3c), then the only remaining information that we can use to find a linear operator such that is captured by the homogeneous Neumann boundary conditions in (3.3). Since the PDE-IBVP (3.1) that defines uses the same boundary conditions, any observation operator based on only the Neumann boundary conditions will imply that as maps on . Thus, if , then also . This example illustrates that it may not always be possible to find an observation operator that satisfies and that yields informative observations.
4 Numerical simulations
In this section, we describe some numerical simulations of the PDE-IBVP from Section 3.1 for a specific vector field . We use these simulations to illustrate the behaviour of the best posterior and the approximate posterior that were defined in (2.2) and (2.4) respectively. The code for these simulations is available at https://gitlab.tue.nl/hbansal/model-error-study.
4.1 Setup of PDE-IBVP and Bayesian inverse problem
4.1.1 Setup of PDE-IBVP
We set the spatial domain in (3.1) to be , where and represent two obstacles. See Figure 1(b). For the differential operator in Section 3.1, we use a constant diffusivity . For the vector field we use the solution to the stationary Navier–Stokes equation
(4.1) | ||||
for a given pressure field , Reynolds number , and boundary data
see [3, Section 5]. In Figure 1(a), we plot a numerical approximation of the solution to (4.1). The numerical approximation satisfies . The boundary data ensure that the inner product of the vector field with the exterior normal vanishes everywhere on the boundary . Given the Neumann boundary condition (3.1b), one can use this fact to prove the coercivity of the bilinear form associated to (3.1a); see e.g. [10, Section 3.2] for the proof of this statement for the time-independent version of (3.1a)–(3.1c). The coercivity of the bilinear form ensures the existence of a weak solution to all the PDE-IBVPs from Section 3.1. Given , the global Péclet number — see e.g. [25, Section 13.2, Eq. (13.19)] — associated to the PDE-IBVP has the value .
Next, we define
(4.2a) | ||||
(4.2b) |
and set and . We will use these functions as reference values for the right-hand side of (3.1a). We indicate the location parameter of the spatial function by a red triangle in Figure 1(b) and plot a discrete approximation of in Figure 4(b).
To complete the specification of the PDE-IBVP (3.1), it remains to state the reference initial condition in (3.1c). To this end, we used a sample from the Gaussian distribution with constant mean on and covariance operator for and equipped with Robin boundary conditions. This Gaussian distribution and the justification for its use are given in [4, Section 5.2]. In Figure 2, we show snapshots of the solution of (3.1a)–(3.1c) at times , for two samples from the Gaussian distribution of . For the reference initial condition , we used the initial condition shown in the bottom left subfigure of Figure 2.

4.1.2 Setup of Bayesian inverse problem
Consider two different observation operators
(4.3a) | ||||
(4.3b) |
for given in Section 3.1. We refer to and as the ‘basic observation operator’ and ‘PDE observation operator’ respectively. To ensure that the numerical results do not depend on the choice of spatiotemporal locations at which observations are collected, we define and using the same spatiotemporal locations belonging to the set
(4.4) |
Thus, for both and , observations are vectors with entries. In Figure 1(b), we indicate the spatial locations by black squares. We emphasise that, unlike ‘classical’ experimental design, our goal is not to describe how to choose the spatiotemporal locations of the measurements in order to maximise the information gain for each successive measurement. Instead, our goal is to show that by choosing a different type of measurement, one can mitigate the effect of model error on Bayesian inference. In this context, the key difference between and is that vanishes on because of (3.3a), but does not.
For the prior on the unknown , we use the same prior as in [4, Section 5.2], namely a Gaussian probability measure on with constant mean and covariance operator generated by a Matérn covariance kernel
with and length scale .
Fix and a realisation of the corresponding random variable in (4.9). In Section 3.1, we observed that is a linear mapping. This implies that the approximate can posterior is Gaussian. Given that we assume and to be statistically independent, it follows that the covariance of and is , and the covariance matrix of is . We then compute the mean and covariance of using the formula for conditioning of Gaussians, see e.g. [30, Theorem 6.20]:
(4.5a) | ||||
(4.5b) |
In Section 3.1, we observed that is constant with respect to , and that is affine. Hence, the covariance of and is , the covariance matrix of is , and is Gaussian, with mean and covariance
(4.6a) | ||||
(4.6b) |
By comparing (4.5) and (4.6), we conclude that the posteriors and differ only in their means, namely by the shift . Thus, if one chooses so that , then and agree. This result is consistent with Proposition 2.3. It has the interpretation that if one uses the observation operator to map states to data, then using the approximate model to map parameters to states yields exactly the same inference as when one uses the best model .
4.2 Implementation
To solve the IBVPs (3.1), (3.2), and (3.3), we implemented the space-time finite element method on an unstructured triangular mesh for the spatial domain and a uniform mesh for the temporal domain , using the hIPPYlib [31] and FEniCS/DOLFIN [21] software. We used Lagrange nodal basis functions to discretise functions. That is, given a finite element mesh of with nodes and given a finite element mesh of with nodes , we have a spatial basis of functions and a temporal basis such that for and for . This yields a space-time basis , where each belongs to . We used piecewise linear basis functions for all functions except for the advection vector field , where we used piecewise quadratic functions.
To compute the numerical approximation of the solution of the PDE-IBVP (3.1) for a given on the right-hand side of (3.1a) and a given in (3.1c), we used the weak formulation of (3.1) to form a matrix-vector equation , where
(4.7) |
We used quadrature to approximate the inner product. For the initial condition , we discretised the covariance operator described in Section 4.1.1 and sampled from the resulting finite-dimensional Gaussian distribution on .
For the spatial discretisation, 262 elements were used, yielding spatial degrees of freedom and a maximal spatial element diameter of . For the temporal discretisation, 100 elements of uniform diameter were used, yielding temporal degrees of freedom. Since in (4.1) satisfies , the local Péclet number is ; see e.g. [25, Section 13.2, Equation (13.22)].
The discretisations of , , and from (4.5) and (4.6) are
(4.8a) | ||||
(4.8b) | ||||
(4.8c) |
where in (4.8c) denotes the symmetric square root of . Since , we set . For the prior covariance matrix , we set and approximated the integral by quadrature. We discretised the approximate model as a matrix , where we computed the -th column of for by replacing in the linear form associated to (3.1a) with the temporal basis function in the definition of the right-hand side ; see (4.7). We discretised the basic observation operator in (4.3a) and the PDE observation operator in (4.3b) as matrices and respectively, where and return the values of the piecewise linear spatiotemporal functions at the locations given in (4.4) represented by and respectively. To evaluate functions off the finite element nodes, we used linear interpolation.
It remains to state how we generated the data used in (4.8a) and (4.8b). Let denote the discretisation of the truth using the temporal basis . We solved the matrix-vector equation that corresponds to (3.3) for and defined the discretisation of the true state to be . For each of the choices and , we generated data by sampling the random variable
(4.9) |
for , where we set the standard deviation to be the product of a prescribed signal-to-noise ratio (SNR) with the median of the underlying signal .
For our experiments, we considered a lower SNR and a higher SNR, where we set to be the median of the underlying signal, scaled by and respectively. Our SNR regimes are reasonable, given that in [4, Section 5.3] the value is used for signals whose entries take values in the interval , for a corresponding that is at most .
The minimum, median, and maximum of the signal was 52.18, 56.63, and 68.53 respectively, which is consistent with the figures in the bottom row of Figure 2 and yields and for the lower SNR and higher SNR cases respectively. In contrast, the minimum, median, and maximum of the signal was 0, , and respectively. This is consistent with the fact that returns the function represented by the vector in (4.7): the entries of are given by integrals of the function that decreases exponentially as increases, and the integration regions have space-time ‘volume’ proportional to . Thus, for , and in the lower SNR and higher SNR cases respectively.
4.3 Behaviour of discretised best posterior mean in small noise limit
In this section, we analyse the behaviour in the small noise limit of the discretised best posterior mean defined in (4.8b). By the preceding section, for , is a matrix with and . Hence, is not invertible, but it admits a unique Moore–Penrose pseudoinverse . Multiplying both sides of (4.8b) by , using the symmetry of , and adding zero, we obtain
By the definition (4.9) of , the definition , and the linearity of ,
so , where denotes the specific realisation of that yields . If is negligible, then the preceding observations imply that
where will be negligible with increasing probability as .
By the properties of the pseudoinverse, is the projection to the support of , i.e. to the orthogonal complement in of . We discuss an important property of this projection for the case , since we will use this property in Section 4.4 below. Let be the vector of coefficients of an arbitrary in the span of the temporal basis functions ; see Section 4.2. Recall that is an approximation of the weak solution of (3.2) with . As per the discussion at the end of Section 3, it follows from the definition of in (4.3) that , where are given in (4.4). By the interpolation property of the temporal basis , it follows that for every .
We now use the preceding argument to identify the orthogonal complement of . Denote the support of a vector by . Recall from Section 4.2 that and denote the index set of observation times given in (4.4) by . Since is one of the spatial locations in (4.4) at which measurements are taken, and since lies in the support of the function defined in (4.2a), the preceding paragraph implies that is strictly positive, and that if and only if for every . Thus,
Since is the projection to the orthogonal complement in of , it follows that for every , if and is zero otherwise. In particular, we expect that as decreases to zero, the best posterior mean for will more closely match the discretised true parameter on the index set of observation times. An important corollary is that if and in particular if is close to , then by comparing (4.8a) and (4.8b), the approximate posterior mean will be close to the discretised true parameter .
We close this section by observing that unlike , may be nonzero even if . This is because approximates the value of the solution of (3.2) at some spatiotemporal point . By the effects of diffusion and advection, may be nonzero even if , e.g. if at spatiotemporal points near the source term of the PDE is sufficiently large. Hence, we do not expect the best posterior mean for to agree with to the same extent that the best posterior mean for agrees with on the index set of observation times.
4.4 Results


Figure 3(a) and Figure 3(b) show the approximate posterior mean and the best posterior mean for the basic observation operator , for the lower SNR and higher SNR respectively. By (4.8a) and (4.8b),
is captured by the difference between the red and blue curves in Figure 3(a) and Figure 3(b). In both figures, the difference eventually decreases with time. This is because is the numerical solution of the PDE-IBVP (3.3), which has zero forcing and allows for flux across the boundary, and thus the values of decay with time. Nevertheless, the difference is large relative to the truth over the time interval . These observations show the importance of the model error for this specific inverse problem.
By comparing Figure 3(a) and Figure 3(b), we observe that the difference increases considerably after increasing the SNR. This occurs partly because the lower SNR corresponds to a larger in (4.9), which in turn leads to larger values of and thus to entries of that are more likely to be closer to zero. This behaviour is consistent with what we expect, namely that ignoring model error can lead to larger errors in the posterior mean when the data are informative and the model error is large.


In Figure 4(a) (respectively, Figure 4(b)), we show for the lower (resp. higher) SNR setting the best posterior mean for , the best posterior mean for , and the truth . For , fails to capture important features of for the lower SNR; it does better for the higher SNR, but under- and over-predictions remain visible. In contrast, for matches the truth very closely over the observation time window for both the lower and higher SNR. The close agreement between the best posterior mean and — and the different performance of the best posterior mean depending on whether or is used — are consistent with the discussion in Section 4.3.


In Figure 5(a) and Figure 5(b), we compare the diagonal entries of the best posterior covariance matrix for and for , in the lower SNR and higher SNR cases respectively. We observe that the diagonal of for is close to zero over the time observation window, whereas the diagonal of for is not. By (4.8c), the difference in behaviour is due to the size of the noise covariance . For , the noise standard deviation satisfies and in the lower and higher SNR cases respectively. Thus is almost zero for , and is close to zero as well. In contrast, for , and in the lower and higher SNR cases respectively, so that is not negligible. The differences in for in the lower and higher SNR cases are also due to the size of .
By comparing the curve for the approximate posterior mean for from Figure 3(a) with the curve for for in Figure 5(a) and recalling that , we note that the error of with respect to is the largest over the subinterval , but the marginal posterior variance is smallest over the same interval. In other words, the approximate posterior is most confident over the region where the bias in the approximate posterior mean is the largest. The approximate posterior is said to be ‘overconfident’; see e.g. [1, Section 8] and the references therein for other examples of overconfidence in the presence of model error. By comparing Figure 4(b) and Figure 5(b), we see that the phenomenon of overconfidence becomes worse for higher SNRs, i.e. for more informative data.
We make some important remarks about the results in Figure 4 and Figure 5. These results show that if one uses an observation operator that satisfies , then the best posterior and the approximate posterior are concentrated around the true parameter. We emphasise that these results are for the specific inverse problem presented in Section 4.1.2. Our explanation in Section 4.3 for why the best posterior mean is close to , and our explanation in the preceding paragraphs for why the diagonal of is close to zero over the time observation window, both make use of the size of the noise covariance . This suggests that whether one can obtain similar results for other inverse problems can depend on other factors, and not only on the condition . Indeed, in the last paragraph of Section 3, we showed for a modification of our inverse problem that the condition can even lead to uninformative observations.
5 Conclusion
We considered Bayesian inverse problems in the presence of model error in the finite-dimensional data setting with additive, Gaussian noise, and the parameter-to-observable map is the composition of a linear observation operator with a possibly nonlinear model. Under the assumption that there exists a unique best model and best parameter, the ‘model error’ is then the difference between the ‘approximate’ model that one uses and the best model. The approximate model and best model generate the ‘approximate posterior’ and ‘best posterior’ respectively.
We described some natural approaches for accounting for model error and the associated posteriors. We used the local Lipschitz stability property of posteriors with respect to perturbations in the likelihood to bound Kullback–Leibler divergences of the posterior associated to each approach, with respect to either the best posterior or the approximate posterior. These bounds have two important properties: first, they control the Kullback–Leibler divergence in terms of quantities that depend on the observation operator and the objects used to account for model error; and second, the terms in the bounds are finite under mild hypotheses, e.g. -integrability of the misfits with respect to the prior.
Our bounds yield sufficient (respectively, necessary) conditions on the observation operator and the objects used by each approach to yield a posterior that agrees with the best posterior (resp. that differs from the approximate posterior). These conditions provide guidelines for how to choose observation operators to mitigate the effect of model error on Bayesian inference, given a chosen approach. A unifying theme of these conditions is the importance of the kernel of the observation operator.
Using a linear Gaussian inverse problem involving an advection-diffusion-reaction PDE, we illustrated the sufficient condition for the approximate posterior to agree with the best posterior. We then reported numerical results from some simulations of this inverse problem under two different signal-to-noise settings, and used these results to discuss the importance of the sufficient condition itself, of observation noise, and of the information conveyed by the observation operator.
It is natural to ask whether it would be possible to combine classical optimal experimental design with the approach presented above. For example, in an ideal setting where one has access to multiple observation operators such that for every , one could aim to optimise the information gain over the set . Another potential avenue for future research would be to investigate the design of observation operators that satisfy the sufficient conditions; we expect this to be challenging and highly problem-specific. It would be of interest to study the setting of nonlinear observation operators and other approaches for accounting for model error.
Acknowledgements
The research of NC and HB has received funding from the ERC under the European Union’s Horizon 2020 Research and Innovation Programme — Grant Agreement ID 818473. The research of HCL has been partially funded by the Deutsche Forschungsgemeinschaft (DFG) — Project-ID 318763901 — SFB1294. The authors would like to thank Nicole Aretz for discussions on the numerics of PDEs and Giuseppe Carere for helpful comments. In addition, the authors would like to thank the three reviewers and review editor for their feedback.
Appendix A Technical lemmas
Let and , be symmetric and positive definite.
Recall the notation (1.1) for a matrix-weighted norm.
Lemma A.1.
Suppose that are symmetric, that is positive definite, and that is nonnegative definite. Then is symmetric positive definite and is symmetric nonnegative definite. In addition,
(A.1) |
Proof.
By the assumptions on and , is positive definite. As the difference of two symmetric matrices, is symmetric. To show that it is nonnegative, we use the following rearrangement of the Woodbury formula, which we take from [8, Equation (3)]:
Invertibility of follows from the invertibility of and . From the second equation above, inherits the nonnegative definiteness of .
We will use the following bound in the proofs below.
Lemma A.2.
Let be a metric space, , and and be -valued measurable functions on . Let be symmetric and positive definite and be symmetric nonnegative definite. Then
(A.2) |
More generally,
(A.3) | ||||
where . In addition,
(A.4) | |||
(A.5) |
Proof.
We claim that for arbitrary , symmetric positive definite , and symmetric nonnegative definite ,
(A.6) |
Recall that exists and is positive definite, by Lemma A.1.
Using the matrix-weighted inner product and norm notation from (1.1),
This implies (A.6), since
Now let , , and be as in the statement of the lemma. Then by (A.6) and the triangle inequality,
(A.7) |
Next,
(A.8) |
The first and second inequalities follow by applying the Cauchy–Schwarz inequality with respect to and respectively. The third inequality and the equation follow from the -triangle inequality and the definition of the norm for . By (A.8), we bound the first term on the right-hand side of (A.7). By using , the second term on the right-hand side of (A.7) vanishes. Thus (A.2) follows from (A.7).
Appendix B Deferred proofs
B.1 Proof for approximate posterior
Lemma 2.2 bounds in terms of the observed model error , under the hypothesis that and . The bound is given in (2.5):
Proof of Lemma 2.2.
Recall from (2.2) and (2.4) that and respectively. By these definitions,
(B.1) |
and similarly
(B.2) |
Now set , , , , and . By (2.3), we have . Hence and . Applying (A.2) from Lemma A.2 with these choices, and then (B.1) and (B.2), yields
This proves (2.5). The bound on in the statement of Lemma 2.2 follows from (A.4) in Lemma A.2 with the choices above:
This completes the proof of Lemma 2.2. ∎
B.2 Proofs for enhanced noise posterior
Lemma 2.4 bounds in terms of the observed model error and the Gaussian model of . In particular, if and , then for as in (2.7), the bound (2.8)
holds, and all terms on the right-hand side are finite.
Proof of Lemma 2.4.
B.3 Proofs for joint posterior
In Lemma 2.9, one assumes that as defined in (2.2) belongs to , and also that . The resulting bound (2.16) is
Proof of Lemma 2.9.
Let , , , , and . Then and . Using the same argument that we used to prove (B.1), it follows from (2.12) that
(B.5) |
By (B.1) and (2.15) from Lemma 2.8,
(B.6) |
Applying (A.2) Lemma A.2 with these choices yields (2.16):
Next,
follows from (B.5), (B.6), and (A.4) of Lemma A.2 with the choices stated above. ∎
In Lemma 2.11, one assumes that as defined in (2.4) belongs to , and also that . The resulting bound (2.17) is
The proof of Lemma 2.11 is very similar to the proof of Lemma 2.9 above.
Proof of Lemma 2.11.
Lemma B.1.
Proof of Lemma B.1.
Proof of Proposition B.2.
Applying Theorem 2.1 and Lemma B.1 yields
Using the definition of given in the statement of Proposition B.2 and (2.15) from Lemma 2.8 completes the proof. ∎
References
- [1] Assyr Abdulle and Giacomo Garegnani, Random time step probabilistic methods for uncertainty quantification in chaotic and geometric numerical integration, Stat. Comput. 30 (2020), no. 4, 907–932.
- [2] Alen Alexanderian, Ruanui Nicholson, and Noémi Petra, Optimal design of large-scale nonlinear Bayesian inverse problems under model uncertainty, 2022, arXiv:2211.03952.
- [3] Alen Alexanderian, Noemi Petra, Georg Stadler, and Omar Ghattas, A-optimal design of experiments for infinite-dimensional Bayesian linear inverse problems with regularized -sparsification, SIAM J. Sci. Comput. 36 (2014), no. 5, A2122–A2148.
- [4] Alen Alexanderian, Noémi Petra, Georg Stadler, and Isaac Sunseri, Optimal design of large-scale Bayesian linear inverse problems under reducible model uncertainty: Good to know what you don’t know, SIAM/ASA J. Uncertainty Quantif. 9 (2021), no. 1, 163–184.
- [5] Jenný Brynjarsdóttir and Anthony O’Hagan, Learning about physical parameters: the importance of model discrepancy, Inverse Probl. 30 (2014), no. 11, 114007.
- [6] Daniela Calvetti, Matthew Dunlop, Erkki Somersalo, and Andrew Stuart, Iterative updating of model error for Bayesian inversion, Inverse Probl. 34 (2018), no. 2, 025008.
- [7] Lianghao Cao, Thomas O’Leary-Roseberry, Prashant K. Jha, J. Tinsley Oden, and Omar Ghattas, Residual-based error correction for neural operator accelerated infinite-dimensional Bayesian inverse problems, J. Comput. Phys. 486 (2023), 112104.
- [8] Feng Cheng Chang, Inversion of a perturbed matrix, Appl. Math. Lett. 19 (2006), no. 2, 169–173.
- [9] Duc-Lam Duong, Tapio Helin, and Jose Rodrigo Rojo-Garcia, Stability estimates for the expected utility in Bayesian optimal experimental design, Inverse Probl. 39 (2023), no. 12, 125008.
- [10] Howard C. Elman, David J. Silvester, and Andrew J. Wathen, Finite elements and fast iterative solvers. With applications in incompressible fluid dynamics, 2nd ed. ed., Numer. Math. Sci. Comput., Oxford: Oxford University Press, 2014.
- [11] Avner Friedman, Partial differential equations, Dover Publications, Mineola, N.Y., 2008, Unabridged republication work originally published by Holt, Rinehart & Winston, New York, in 1969.
- [12] Subhashis Ghosal and Aad van der Vaart, Fundamentals of Nonparametric Bayesian Inference, vol. 44, Cambridge: Cambridge University Press, 2017.
- [13] Konstantinos Gourgoulias, Markos A. Katsoulakis, Luc Rey-Bellet, and Jie Wang, How biased is your model? Concentration inequalities, information and model bias, IEEE Trans. Inform. Theory 66 (2020), no. 5, 3079–3097.
- [14] Eric Joseph Hall and Markos A. Katsoulakis, Robust information divergences for model-form uncertainty arising from sparse data in random PDE, SIAM/ASA J. Uncertain. Quantif. 6 (2018), no. 4, 1364–1394.
- [15] Willem Hundsdorfer and Jan Verwer, Numerical Solution of Time-Dependent Advection-Diffusion-Reaction Equations, Springer-Verlag Berlin, 2003.
- [16] Jari Kaipio and Erkki Somersalo, Statistical and computational inverse problems, vol. 160, Springer Science & Business Media, 2005.
- [17] , Statistical inverse problems: discretization, model reduction and inverse crimes, J. Comput. Appl. Math. 198 (2007), no. 2, 493–504.
- [18] Marc C. Kennedy and Anthony O’Hagan, Bayesian calibration of computer models, J. R. Stat. Soc., Ser. B, Stat. Methodol. 63 (2001), no. 3, 425–464.
- [19] Ville Kolehmainen, Tanja Tarvainen, Simon R. Arridge, and Jari P. Kaipio, Marginalization of uninteresting distributed parameters in inverse problems – application to diffuse optical tomography, Int. J. Uncertain. Quantif. 1 (2011), no. 1, 1–17.
- [20] Karina Koval, Alen Alexanderian, and Georg Stadler, Optimal experimental design under irreducible uncertainty for linear inverse problems governed by PDEs, Inverse Probl. 36 (2020), no. 7, 075007.
- [21] Anders Logg, Garth N. Wells, and Johan Hake, DOLFIN: a C++/Python finite element library, Automated Solution of Differential Equations by the Finite Element Method: The FEniCS Book (Anders Logg, Kent-Andre Mardal, and Garth Wells, eds.), Springer, Berlin, Heidelberg, 2012, pp. 173–225.
- [22] Yvon Maday and Tommaso Taddei, Adaptive PBDW approach to state estimation: noisy observations; user-defined update spaces, SIAM J. Sci. Comput. 41 (2019), no. 4, b669–b693.
- [23] Ruanui Nicholson, Noémi Petra, and Jari P Kaipio, Estimation of the Robin coefficient field in a Poisson problem with uncertain conductivity field, Inverse Probl. 34 (2018), no. 11, 115005.
- [24] A. Nissinen, L. M. Heikkinen, V. Kolehmainen, and J. P. Kaipio, Compensation of errors due to discretization, domain truncation and unknown contact impedances in electrical impedance tomography, Meas. Sci. Technol. 20 (2009), no. 10, 105504.
- [25] Alfio Quarteroni, Numerical models for differential problems, third ed., Texts in Applied Mathematics, Springer Cham, Berlin, Germany, 2017.
- [26] Khachik Sargsyan, Xun Huan, and Habib N. Najm, Embedded model error representation for Bayesian model calibration, Int. J. Uncertain. Quantif. 9 (2019), no. 4, 365–394.
- [27] Khachik Sargsyan, Habib N. Najm, and Roger Ghanem, On the statistical calibration of physical models, Int. J. Chem. Kinet. 47 (2015), no. 4, 246–276.
- [28] Andrea Scarinci, Michael Fehler, and Youssef Marzouk, Bayesian inference under model misspecification using transport-Lagrangian distances: an application to seismic inversion, 2021, arXiv:2105.07027.
- [29] Björn Sprungk, On the local Lipschitz stability of Bayesian inverse problems, Inverse Probl. 36 (2020), 055015.
- [30] Andrew M. Stuart, Inverse problems: A Bayesian perspective, Acta Numerica 19 (2010), 451–559.
- [31] Umberto Villa, Noemi Petra, and Omar Ghattas, HIPPYlib: An extensible software framework for large-scale inverse problems governed by PDEs: Part I: Deterministic inversion and linearized Bayesian inference, ACM Trans. Math. Softw. 47 (2021), no. 2, 1–34.
- [32] Martin J. Wainwright, High-dimensional statistics, Cambridge Series in Statistical and Probabilistic Mathematics, vol. 48, Cambridge University Press, Cambridge, 2019.