This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Choosing observation operators to mitigate model error in Bayesian inverse problems

Nada Cvetković  Centre for Analysis, Scientific Computing and Applications, Eindhoven University of Technology, Groene Loper 5, 5612 AZ Eindhoven, the Netherlands Han Cheng Lie  Institut für Mathematik, Universität Potsdam, Campus Golm, Haus 9, Karl-Liebknecht-Straße 24–25, Potsdam OT Golm 14476, Germany Harshit Bansal  Centre for Analysis, Scientific Computing and Applications, Eindhoven University of Technology, Groene Loper 5, 5612 AZ Eindhoven, the Netherlands Karen Veroy  Centre for Analysis, Scientific Computing and Applications, Eindhoven University of Technology, Groene Loper 5, 5612 AZ Eindhoven, the Netherlands
Abstract

In statistical inference, a discrepancy between the parameter-to-observable map that generates the data and the parameter-to-observable map that is used for inference can lead to misspecified likelihoods and thus to incorrect estimates. In many inverse problems, the parameter-to-observable map is the composition of a linear state-to-observable map called an ‘observation operator’ and a possibly nonlinear parameter-to-state map called the ‘model’. We consider such Bayesian inverse problems where the discrepancy in the parameter-to-observable map is due to the use of an approximate model that differs from the best model, i.e. to nonzero ‘model error’. Multiple approaches have been proposed to address such discrepancies, each leading to a specific posterior. We show how to use local Lipschitz stability estimates of posteriors with respect to likelihood perturbations to bound the Kullback–Leibler divergence of the posterior of each approach with respect to the posterior associated to the best model. Our bounds lead to criteria for choosing observation operators that mitigate the effect of model error for Bayesian inverse problems of this type. We illustrate one such criterion on an advection-diffusion-reaction PDE inverse problem from the literature, and use this example to discuss the importance and challenges of model error-aware inference.

Keywords: Model error, Bayesian inverse problems, experimental design, misspecified likelihood, posterior error bounds

1 Introduction

In many applications, one considers an inverse problem where the data is a noisy observation of the true state of some phenomenon of interest, where the true state is the output of a parameter-to-state mapping or ‘model’ corresponding to an unknown parameter. That is, for the unknown true parameter θ\theta^{\dagger} and the best model \mathcal{M}^{\dagger}, the true state is u=(θ)u^{\dagger}=\mathcal{M}^{\dagger}(\theta^{\dagger}), and the data yy is a realisation of the random variable

Y𝒪(θ)+εY\coloneqq\mathcal{O}\circ\mathcal{M}^{\dagger}(\theta^{\dagger})+\varepsilon

for a state-to-observable map or ‘observation operator’ 𝒪\mathcal{O} and additive noise ε\varepsilon. The inverse problem consists in inferring the data-generating parameter θ\theta^{\dagger} from the data yy.

We consider Bayesian inverse problems with data taking values in n\mathbb{R}^{n}, centred Gaussian observation noise, and linear observation operators. In this setting, the noise ε\varepsilon has the normal distribution 𝒩(mε,Σε)\mathcal{N}(m_{\varepsilon},\Sigma_{\varepsilon}) for mε=0nm_{\varepsilon}=0\in\mathbb{R}^{n} and positive definite covariance Σεn×n\Sigma_{\varepsilon}\in\mathbb{R}^{n\times n}, and the observation operator 𝒪\mathcal{O} is a linear mapping from the ‘state space’ 𝒰\mathcal{U} of admissible states to n\mathbb{R}^{n}. In the Bayesian approach to inverse problems, one fixes a possibly infinite-dimensional parameter space Θ\Theta consisting of admissible values of θ\theta^{\dagger}, and describes the unknown true parameter θ\theta^{\dagger} using a random variable θ\theta with prior law μθ\mu_{\theta} supported on Θ\Theta. Under certain assumptions on the prior μθ\mu_{\theta}, the observation operator 𝒪\mathcal{O}, and the model \mathcal{M}^{\dagger}, the corresponding posterior measure μθy,\mu^{y,\dagger}_{\theta} is well-defined and admits the following density with respect to the prior μθ\mu_{\theta}:

Θθdμθy,dμθ(θ)=exp(12y𝒪(θ)Σε12)Θexp(12y𝒪(θ^)Σε12)dμθ(θ^).\Theta\ni\theta^{\prime}\mapsto\frac{\mathrm{d}\mu^{y,\dagger}_{\theta}}{\mathrm{d}\mu_{\theta}}(\theta^{\prime})=\frac{\exp(-\tfrac{1}{2}\left\|y-\mathcal{O}\circ\mathcal{M}^{\dagger}(\theta^{\prime})\right\|_{\Sigma_{\varepsilon}^{-1}}^{2})}{\displaystyle{\int_{\Theta}}\exp(-\tfrac{1}{2}\left\|y-\mathcal{O}\circ\mathcal{M}^{\dagger}(\hat{\theta})\right\|_{\Sigma_{\varepsilon}^{-1}}^{2})\mathrm{d}\mu_{\theta}(\hat{\theta})}.

See e.g. [30, Theorem 4.1] for sufficient conditions for well-definedness in the case of a Gaussian prior μθ\mu_{\theta}.

In practice, the best model :Θ𝒰\mathcal{M}^{\dagger}:\Theta\to\mathcal{U} is not available, so an approximate model :Θ𝒰\mathcal{M}:\Theta\to\mathcal{U} is used instead. Alternatively, \mathcal{M}^{\dagger} may be available but impractical or costly to evaluate: in the context of multifidelity or reduced order models, \mathcal{M}^{\dagger} may be the element of a collection of models that yields the most accurate predictions of state, and \mathcal{M} may be a reduced-order model, an emulator, or a surrogate, i.e. a model that yields less accurate predictions but can be evaluated more cheaply and quickly than \mathcal{M}^{\dagger}.

We shall refer to the difference δ\delta^{\dagger}\coloneqq\mathcal{M}^{\dagger}-\mathcal{M} as the ‘model error of the approximate model’, or simply, the ‘model error’. Model error is also known as ‘model inadequacy’, ‘model discrepancy’, or ‘structural uncertainty’, see e.g. [18, 5]. We do not use the term ‘model misspecification’, since this term is used in the statistics literature to refer to the setting where the parameter space Θ\Theta does not contain the true parameter θ\theta^{\dagger}; see e.g. [12, Section 8.5].

In the context of inverse problems, model error is important because it may lead to a wrong or ‘misspecified’ likelihood, which in turn may lead to incorrect inference. The negative effects may persist even after applying or approximating common limits from statistics. For example, numerical experiments in [5, Section 3] show how ignoring the model error results in posterior distributions that do not converge to the true data-generating parameter as the number of observations grows larger. An analytical example in [1, Section 8.1] shows a similar phenomenon for the problem of inferring the initial condition of a heat equation over a finite time interval from a noisy observation of the terminal state, in the limit of small noise.

The potential for incorrect inference due to model error raises the question of how to mitigate the effect of model error in Bayesian inverse problems. We approach this question from the perspective of selecting an appropriate observation operator 𝒪\mathcal{O}. To see why this perspective is reasonable, note that the linearity of 𝒪\mathcal{O} and the definition of the model error δ\delta^{\dagger} imply that 𝒪=𝒪+𝒪δ\mathcal{O}\mathcal{M}^{\dagger}=\mathcal{O}\mathcal{M}+\mathcal{O}\delta^{\dagger}. Using this fact in the equation for YY shows that only the observed model error 𝒪δ\mathcal{O}\delta^{\dagger} has the potential to affect inference. In other words, ignoring model error δ\delta^{\dagger} leads to a misspecified likelihood if and only if 𝒪δ(θ)\mathcal{O}\delta^{\dagger}(\theta^{\prime}) is nonzero with positive μθ\mu_{\theta}-probability. This suggests that a possible approach to reduce the effect of model error on Bayesian inference is to choose an observation operator 𝒪\mathcal{O} for which 𝒪δ(θ)\mathcal{O}\delta^{\dagger}(\theta^{\prime}) is closer to zero with higher μθ\mu_{\theta}-probability.

The approach of choosing observation operators suggests a connection with experimental design. In Bayesian experimental design, the main task is to select observations in order to maximise information about the parameter to be inferred. To quantify the information gain, one may use the Kullback–Leibler divergence of the posterior with respect to the prior, for example. In contrast, we control the Kullback–Leibler divergence between pairs of posteriors defined by the same prior but different likelihoods, by using the Lμθ1L^{1}_{\mu_{\theta}} difference between the pair of negative log-likelihoods or ‘misfits’. The main task is then to select observations in order to minimise this Lμθ1L^{1}_{\mu_{\theta}} difference. This approach can be viewed as experimental design for mitigating the effect of model error on inference.

Contributions

In this paper, we consider some approaches for accounting for model error in Bayesian inference: the approximate approach that ignores model error; the ‘enhanced error approach’ of [16]; the ‘joint inference approach’ that aims to infer both θ\theta^{\dagger} and δ\delta^{\dagger}; and the ‘marginalisation approach’, which integrates out the model error component of the posterior from the joint inference approach. For the first three approaches, we compute upper bounds for the Lμθ1L^{1}_{\mu_{\theta}} difference between the misfit of each approach and the misfit associated to the best model \mathcal{M}^{\dagger}. We use these bounds to bound the Kullback–Leibler divergence between the posterior that results from each approach and the best posterior μθy,\mu^{y,\dagger}_{\theta}. We also compute upper bounds for the Kullback–Leibler divergence with respect to the approximate posterior μθy,A\mu^{y,\textup{A}}_{\theta} that results from using the given approximate model \mathcal{M} instead of the best model \mathcal{M}^{\dagger}. We express each upper bound in terms of the model error δ\delta^{\dagger} and the objects that the approach uses to account for the model error.

To prove these bounds, we exploit the assumption of additive Gaussian noise to prove a lemma concerning the difference of two quadratic forms that are weighted by different matrices; see Lemma A.2. We prove the upper bounds on the Kullback–Leibler divergences by combining the upper bounds on the Lμθ1L^{1}_{\mu_{\theta}} differences between the misfits with a local Lipschitz stability estimate of posteriors from [29]. An important advantage of this estimate is that it holds for the Kullback–Leibler topology under the mild assumption of Lμθ1L^{1}_{\mu_{\theta}}-integrable misfits; see Theorem 2.1.

We use the upper bounds on the Kullback–Leibler divergence with respect to the best posterior μθy,\mu^{y,\dagger}_{\theta} to derive sufficient conditions on the observation operator 𝒪\mathcal{O}, the model error δ\delta^{\dagger}, and the objects that the approach uses to account for the model error, in order for the resulting posterior μθy,\mu^{y,\bullet}_{\theta} to closely approximate μθy,\mu^{y,\dagger}_{\theta} in the Kullback–Leibler topology. These conditions give ‘positive’ criteria for choosing observation operators to mitigate the effect of model error on inference. We use the upper bounds with respect to the approximate posterior μθy,A\mu^{y,\textup{A}}_{\theta} to derive necessary conditions for the resulting posterior μθy,\mu^{y,\bullet}_{\theta} to differ from μθy,A\mu^{y,\textup{A}}_{\theta}. These conditions give ‘negative’ criteria which one can use to exclude observation operators from consideration when one aims to mitigate the effect of model error on Bayesian inference.

We complement our theoretical findings with numerical experiments for an inverse problem from the literature, namely the inverse problem of inferring the time amplitude of a source term in an advection-diffusion-reaction partial differential equation (PDE) when the initial condition is unknown.

1.1 Overview of related work

The importance of accounting for the model error is well-documented in the literature on Bayesian inverse problems; see e.g. [5, 16, 18] and the references therein. The ‘Bayesian approximation error’ and ‘enhanced error’ approaches due to [17] and [16] respectively have been applied in various contexts. For example, the enhanced error approach has been applied with pre-marginalisation to electrical impedance tomography [24], diffuse optical tomography [19], and inversion for coefficient fields in the presence of uncertain conductivities [23].

Various methods have been developed to estimate or account for model error in Bayesian inference. For example, the work [6] presented an iterative algorithm to update an estimate of the model error in a model order reduction context, and proved geometric convergence of the algorithm. The authors of [27, 26] take a different perspective: instead of viewing model errors as additive perturbations to an approximate model, they incorporate these model errors into parametrisations of some phenomenon of interest, and use polynomial chaos expansions. Information theory has been used to quantify model error uncertainty or model bias in goal-oriented inference settings [14] and by exploiting concentration inequalities [13]. Optimal transport was applied to tackle problems due to model errors in [28].

In the context of Bayesian optimal experimental design, model error is also referred to as ‘model uncertainty’. The work [20] considers A-optimal designs for inverse problems in the presence of so-called ‘irreducible’ model uncertainties, i.e. uncertainties that cannot be reduced by collecting more data. In contrast, the work [4] considers reducible uncertainties, and describes an A-optimality criterion that involves marginalising out this reducible uncertainty. The work [2] combines the Laplace approximation and the Bayesian approximation error approach to find A-optimal designs for nonlinear Bayesian inverse problems.

As far as we are aware, the works most closely related to our paper are [7, 9]. The work [7] considers Bayesian inverse problems where the observation operator may be nonlinear and the model is approximated by a neural network. In particular, [7, Theorem 1] bounds the Kullback–Leibler divergence between the original and approximated posterior in terms of an LpL^{p} norm for p2p\geq 2 of the model error itself. In contrast, we consider linear observation operators, do not focus on any specific class of approximate models, and bound the Kullback–Leibler divergence in terms of L1L^{1} norms of the observed model error. In [9], the main stability estimate [9, Theorem 3.4] bounds the expected utility of a design in terms of a sequence of likelihoods. The focus of [9] is not model error, but on the convergence of the utility of approximate optimal designs corresponding to a convergent sequence of likelihoods. The goal is ‘standard’ optimal experimental design, i.e. on choosing observations to maximise the information content of each additional data point and thereby to reduce posterior uncertainty. In contrast, we make pairwise comparisons of likelihoods and focus on choosing observations to mitigate the effect of model error on Bayesian inference.

1.2 Outline

We describe the notation that we use in Section 1.3. In Section 2, we state and prove upper bounds on Lμθ1L^{1}_{\mu_{\theta}} errors between misfits and upper bounds on Kullback–Leibler divergences between the best posterior and posteriors that account for model error in different ways. We use these bounds to discuss positive and negative criteria for choosing observation operators to mitigate the effect of model error on Bayesian inference. In Section 3, we consider a specific inverse problem and use it to illustrate a positive criterion for choosing observation operators, in the setting where one uses the approximate posterior for inference. In Section 4, we describe and discuss the numerical experiments that we used to investigate the behaviour of the approximate posterior. We conclude in Section 5. In Appendix A, we present proofs of lemmas and results that are not stated in the main text.

1.3 Notation

The notation aba\leftarrow b indicates the replacement of aa using bb, while aba\coloneqq b indicates equality by definition. For NN\in\mathbb{N}, [N]{1,,N}[N]\coloneqq\{1,\ldots,N\}, and [N]0{0,1,,N}[N]_{0}\coloneqq\{0,1,\ldots,N\}. Let (Ω,,)(\Omega,\mathcal{F},\mathbb{P}) be the underlying probability space for all random variables below. Given a measurable space (E,)(E,\mathcal{E}) and an EE-valued random variable ξ:(Ω,)(E,)\xi:(\Omega,\mathcal{F})\to(E,\mathcal{E}), we denote the law of ξ\xi by μξ\mu_{\xi}. If EE is a topological space, we denote its Borel σ\sigma-algebra by (E)\mathcal{B}(E), the set of all Borel probability measures on EE by 𝒫(E)\mathcal{P}(E), and the support of an arbitrary μ𝒫(E)\mu\in\mathcal{P}(E) by supp(μ)\textup{supp}(\mu). For S(E)S\in\mathcal{B}(E), S¯\overline{S} denotes the closure of SS. For ν𝒫(E)\nu\in\mathcal{P}(E) and an EE-valued random variable ξ\xi, the expression ξν\xi\sim\nu means that ξ\xi is distributed according to ν\nu, i.e. μξ=ν\mu_{\xi}=\nu as measures. Given μ,ν𝒫(E)\mu,\nu\in\mathcal{P}(E), we denote the absolute continuity of μ\mu with respect to ν\nu by μν\mu\ll\nu, the corresponding Radon–Nikodym derivative by dμdν\tfrac{\mathrm{d}\mu}{\mathrm{d}\nu}, and the Kullback–Leibler (KL) divergence of μ\mu with respect to ν\nu by dKL(μν)d_{\textup{KL}}(\mu\|\nu), which has the value Elogdμdνdμ\int_{E}\log\frac{\mathrm{d}\mu}{\mathrm{d}\nu}\mathrm{d}\mu if μν\mu\ll\nu and ++\infty otherwise. For μ𝒫(E)\mu\in\mathcal{P}(E) and p1p\geq 1, Lμp\left\|\cdot\right\|_{L^{p}_{\mu}} denotes the LpL^{p} norm with respect to μ\mu; we shall specify the domain and codomain when necessary. We denote a Gaussian measure with mean mm and covariance operator Σ\Sigma by 𝒩(m,Σ)\mathcal{N}(m,\Sigma). Some useful facts about Gaussian measures on Banach spaces are given in [30, Section 6.3], for example.

Given μ𝒫(E)\mu\in\mathcal{P}(E) and Φ:E\Phi:E\to\mathbb{R}, we write μΦ\mu_{\Phi} to denote the measure that has Radon–Nikodym derivative dμΦdμ=exp(Φ)Z\tfrac{\mathrm{d}\mu_{\Phi}}{\mathrm{d}\mu}=\tfrac{\exp(-\Phi)}{Z}, where ZEexp(Φ(x))dμ(x)Z\coloneqq\int_{E}\exp(-\Phi(x^{\prime}))\mathrm{d}\mu(x^{\prime}). We shall use this notation to denote the posterior associated to a given prior and misfit below.

For dd\in\mathbb{N} and a symmetric positive semidefinite matrix Ld×dL\in\mathbb{R}^{d\times d},

a,bLaLb=L1/2a,L1/2b,|a|La,aL1/2=|L1/2a|.\langle a,b\rangle_{L}\coloneqq a^{\top}Lb=\langle L^{1/2}a,L^{1/2}b\rangle,\quad\left|a\right|_{L}\coloneqq\langle a,a\rangle_{L}^{1/2}=\left|L^{1/2}a\right|. (1.1)

We use single bars ‘||\left|\cdot\right|’ to emphasise when we take norms of vectors in finite-dimensional spaces. If LL is the identity matrix, then we omit the subscript. Given normed vector spaces (Vi,Vi)(V_{i},\left\|\cdot\right\|_{V_{i}}) for i=1,2i=1,2, ViV_{i}^{\ast} denotes the continuous dual space of ViV_{i}. For a bounded linear operator L:V1V2L:V_{1}\to V_{2}, we denote the kernel, adjoint, and operator norm of LL by ker(L)\textup{ker}(L), LL^{\ast}, and Lop\left\|L\right\|_{\textup{op}}, and we write the image of vV1v\in V_{1} under LL as LvLv. For a matrix Am×nA\in\mathbb{R}^{m\times n}, AA^{\ast} and A+A^{+} denote the adjoint and Moore–Penrose pseudoinverse of AA respectively.

2 Bounds on KL divergence in terms of observed model error

In this section, we state the theoretical results of our paper, namely bounds on the KL divergence between pairs of posteriors, expressed in terms of the model error and the objects that some approaches use to account for model error. These bounds show that it is not the model error itself but rather the observed model error that affects inference. We consider four natural types of posteriors that account for model error in different ways. After introducing the common objects and basic assumptions, we state bounds on certain KL divergences for the approximate posterior, enhanced noise posterior, joint posterior, and marginal posterior in Sections 2.1, 2.2, 2.3, and 2.4 respectively. We emphasise that we use the KL divergence only as a measure of the discrepancy between different posteriors, and not because the KL divergence is often used in classical optimal experimental design. Indeed, our bounds could have also been expressed using other measures of discrepancy that are considered in [29], such as the Hellinger distance or the total variation distance. Moreover, in the example inverse problem in Section 3, we use one such bound to guide the choice of observations, but do not estimate the terms in the bounds. The efficient and accurate estimation of the terms in the bounds below is beyond the scope of this paper.

We assume that the ‘parameter space’ Θ\Theta of admissible values of the unknown true data-generating parameter θ\theta^{\dagger} is a Borel subset of a separable Banach space, and fix a prior μθ𝒫(Θ)\mu_{\theta}\in\mathcal{P}(\Theta) that is supported on Θ\Theta. Let (𝒰,𝒰)(\mathcal{U},\left\|\cdot\right\|_{\mathcal{U}}) denote the ‘state space’, which is a Banach space containing both the image of Θ\Theta under the best model \mathcal{M}^{\dagger} and the image of Θ\Theta under the approximate model \mathcal{M}. In many inverse problems, a state u𝒰u\in\mathcal{U} is a function on some domain that describes some observable phenomenon of interest, and is often expressed as the solution of a differential equation; see Section 3 below. We shall refer to any measurable mapping from Θ\Theta to 𝒰\mathcal{U} as a ‘model’, and to any continuous linear mapping 𝒪\mathcal{O} from 𝒰\mathcal{U} to n\mathbb{R}^{n} as an ‘observation operator’, where nn is fixed but arbitrary. Thus, Θ\Theta and 𝒰\mathcal{U} may be infinite-dimensional, but the data space is finite-dimensional. Infinite-dimensional data settings will involve additional technical issues — see e.g. [30, Remark 3.8] — that are not directly related to the main ideas of this paper.

We shall use the following local Lipschitz stability result to prove our bounds.

Theorem 2.1.

Let μ𝒫(Θ)\mu\in\mathcal{P}(\Theta) and Φ(i)Lμ1(E,0)\Phi^{(i)}\in L^{1}_{\mu}(E,\mathbb{R}_{\geq 0}) for i=1,2i=1,2. Then

dKL(μΦ(1)μΦ(2))2exp(Φ(1)Lμ1+Φ(1)Φ(2)Lμ1)Φ(1)Φ(2)Lμ1,d_{\textup{KL}}(\mu_{\Phi^{(1)}}\|\mu_{\Phi^{(2)}})\leq 2\exp\left(\left\|\Phi^{(1)}\right\|_{L^{1}_{\mu}}+\left\|\Phi^{(1)}-\Phi^{(2)}\right\|_{L^{1}_{\mu}}\right)\left\|\Phi^{(1)}-\Phi^{(2)}\right\|_{L^{1}_{\mu}},

and thus μΦ(1)\mu_{\Phi^{(1)}} is absolutely continuous with respect to μΦ(2)\mu_{\Phi^{(2)}}. In particular,

max{dKL(μΦ(1)μΦ(2)),dKL(μΦ(2)μΦ(1))}2exp(2Φ(1)Lμ1+2Φ(2)Lμ1)Φ(1)Φ(2)Lμ1\displaystyle\max\{d_{\textup{KL}}(\mu_{\Phi^{(1)}}\|\mu_{\Phi^{(2)}}),d_{\textup{KL}}(\mu_{\Phi^{(2)}}\|\mu_{\Phi^{(1)}})\}\leq 2\exp\left(2\left\|\Phi^{(1)}\right\|_{L^{1}_{\mu}}+2\left\|\Phi^{(2)}\right\|_{L^{1}_{\mu}}\right)\left\|\Phi^{(1)}-\Phi^{(2)}\right\|_{L^{1}_{\mu}}

and thus μΦ(1)\mu_{\Phi^{(1)}} and μΦ(2)\mu_{\Phi^{(2)}} are mutually equivalent.

Proof.

The first inequality follows by combining [29, Theorem 11] and [29, Proposition 6]. The second inequality follows by combining the triangle inequality with the first inequality. ∎

We now state the main assumptions of this paper. Given Θ\Theta and 𝒰\mathcal{U}, there exists a unique best parameter θΘ\theta^{\dagger}\in\Theta and unique best model :Θ𝒰\mathcal{M}^{\dagger}:\Theta\to\mathcal{U}, such that for u=(θ)u^{\dagger}=\mathcal{M}^{\dagger}(\theta^{\dagger}) and for a chosen observation operator 𝒪\mathcal{O}, the data is a realisation of the random variable

Y=𝒪u+ε=𝒪(θ)+ε,ε𝒩(0,Σε)Y=\mathcal{O}u^{\dagger}+\varepsilon=\mathcal{O}\mathcal{M}^{\dagger}(\theta^{\dagger})+\varepsilon,\quad\varepsilon\sim\mathcal{N}(0,\Sigma_{\varepsilon}) (2.1)

where Σε\Sigma_{\varepsilon} is positive definite. We write 𝒪(θ)\mathcal{O}\mathcal{M}^{\dagger}(\theta^{\dagger}) to emphasise that 𝒪\mathcal{O} is linear but \mathcal{M}^{\dagger} may be nonlinear. The data model (2.1) leads to the best misfit Φy,\Phi^{y,\dagger}, which together with the prior μθ\mu_{\theta} yields the best posterior μθy,\mu^{y,\dagger}_{\theta}:

ΘθΦy,(θ)12|y𝒪(θ)|Σε12,μθy,(μθ)Φy,.\Theta\ni\theta^{\prime}\mapsto\Phi^{y,\dagger}(\theta^{\prime})\coloneqq\tfrac{1}{2}\left|y-\mathcal{O}\mathcal{M}^{\dagger}(\theta^{\prime})\right|^{2}_{\Sigma_{\varepsilon}^{-1}},\quad\mu^{y,\dagger}_{\theta}\coloneqq(\mu_{\theta})_{\Phi^{y,\dagger}}. (2.2)

We assume that Φy,Lμθ1\Phi^{y,\dagger}\in L^{1}_{\mu_{\theta}}, and that θμθ\theta\sim\mu_{\theta} and ε\varepsilon are statistically independent. The best posterior μθy,\mu^{y,\dagger}_{\theta} is important because it is the posterior that is consistent with the true data model (2.1).

2.1 Bound on KL divergence for approximate posterior

In general, one cannot perform inference with the best model \mathcal{M}^{\dagger}, and one uses instead an approximate model \mathcal{M} distinct from \mathcal{M}^{\dagger}. For example, many mathematical models of natural phenomena based on differential equations do not admit closed-form solutions, so one must perform inference using some numerical approximation \mathcal{M} of \mathcal{M}^{\dagger}. The ‘model error’ of \mathcal{M} is given by the difference

δ:Θ𝒰.\delta^{\dagger}\coloneqq\mathcal{M}^{\dagger}-\mathcal{M}:\Theta\to\mathcal{U}. (2.3)

Given the assumption that ε𝒩(0,Σε)\varepsilon\sim\mathcal{N}(0,\Sigma_{\varepsilon}), we obtain the approximate misfit Φy,A\Phi^{y,\textup{A}} and the approximate posterior μθy,A\mu^{y,\textup{A}}_{\theta} by replacing \mathcal{M}^{\dagger} with the approximate model \mathcal{M} in (2.2):

θΦy,A(θ)12|y𝒪(θ)|Σε12,μθy,A(μθ)Φy,A.\theta^{\prime}\mapsto\Phi^{y,\textup{A}}(\theta^{\prime})\coloneqq\tfrac{1}{2}\left|y-\mathcal{O}\mathcal{M}(\theta^{\prime})\right|^{2}_{\Sigma_{\varepsilon}^{-1}},\quad\mu^{y,\textup{A}}_{\theta}\coloneqq(\mu_{\theta})_{\Phi^{y,\textup{A}}}. (2.4)

By comparing the definitions of Φy,A\Phi^{y,\textup{A}} and Φy,\Phi^{y,\dagger} in (2.2) and (2.4), we conclude that if there exists some set SΘS\subset\Theta such that μθ(S)>0\mu_{\theta}(S)>0 and 𝒪δ\mathcal{O}\delta^{\dagger} does not vanish on SS, then the likelihood corresponding to Φy,A\Phi^{y,\textup{A}} will be misspecified, and thus using μθy,A\mu^{y,\textup{A}}_{\theta} instead of μθy,\mu^{y,\dagger}_{\theta} will lead to incorrect inference. To make this observation more quantitative, we first bound the misfit error in Lemma 2.2, and then bound the KL divergence dKL(μθy,Aμθy,)d_{\textup{KL}}(\mu^{y,\textup{A}}_{\theta}\|\mu^{y,\dagger}_{\theta}) in Proposition 2.3.

Lemma 2.2.

If Φy,ALμθ1\Phi^{y,\textup{A}}\in L^{1}_{\mu_{\theta}}, then

Φy,Φy,ALμθ121/2|𝒪δ|Σε12Lμθ11/2(Φy,Lμθ11/2+Φy,ALμθ11/2).\left\|\Phi^{y,\dagger}-\Phi^{y,\textup{A}}\right\|_{L^{1}_{\mu_{\theta}}}\leq 2^{-1/2}\left\|\left|\mathcal{O}\delta^{\dagger}\right|_{\Sigma_{\varepsilon}^{-1}}^{2}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2}\left(\left\|\Phi^{y,\dagger}\right\|^{1/2}_{L^{1}_{\mu_{\theta}}}+\left\|\Phi^{y,\textup{A}}\right\|^{1/2}_{L^{1}_{\mu_{\theta}}}\right). (2.5)

where |𝒪δ|Σε12Lμθ121/2(Φy,Lμθ11/2+Φy,ALμθ11/2)\left\|\left|\mathcal{O}\delta^{\dagger}\right|_{\Sigma_{\varepsilon}^{-1}}^{2}\right\|_{L^{1}_{\mu_{\theta}}}\leq 2^{1/2}(\left\|\Phi^{y,\dagger}\right\|^{1/2}_{L^{1}_{\mu_{\theta}}}+\left\|\Phi^{y,\textup{A}}\right\|^{1/2}_{L^{1}_{\mu_{\theta}}}).

We defer the proof of Lemma 2.2 to Section B.1.

Combining (2.5) with the bound on |𝒪δ|Σε12Lμθ1\left\|\left|\mathcal{O}\delta^{\dagger}\right|_{\Sigma_{\varepsilon}^{-1}}^{2}\right\|_{L^{1}_{\mu_{\theta}}} implies

Φy,Φy,ALμθ1(Φy,Lμθ11/2+Φy,ALμθ11/2)2.\left\|\Phi^{y,\dagger}-\Phi^{y,\textup{A}}\right\|_{L^{1}_{\mu_{\theta}}}\leq\left(\left\|\Phi^{y,\dagger}\right\|^{1/2}_{L^{1}_{\mu_{\theta}}}+\left\|\Phi^{y,\textup{A}}\right\|^{1/2}_{L^{1}_{\mu_{\theta}}}\right)^{2}.

Since the right-hand side of the inequality above is larger than Φy,Lμθ1+Φy,ALμθ1\left\|\Phi^{y,\dagger}\right\|_{L^{1}_{\mu_{\theta}}}+\left\|\Phi^{y,\textup{A}}\right\|_{L^{1}_{\mu_{\theta}}}, it follows that the bound given in Lemma 2.2 is not optimal, because it is worse than the bound we could obtain using the triangle inequality. Moreover, the term inside the parentheses on the right-hand side of (2.5) cannot be evaluated if \mathcal{M}^{\dagger} is unavailable. Nevertheless, the bound (2.5) is useful, because it bounds Φy,Φy,ALμθ1\left\|\Phi^{y,\dagger}-\Phi^{y,\textup{A}}\right\|_{L^{1}_{\mu_{\theta}}} in terms of the average observed model error |𝒪δ|Σε12Lμθ11/2\left\|\left|\mathcal{O}\delta^{\dagger}\right|_{\Sigma_{\varepsilon}^{-1}}^{2}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2} and quantities that are assumed to be finite.

Proposition 2.3.

If Φy,ALμθ1\Phi^{y,\textup{A}}\in L^{1}_{\mu_{\theta}}, then

max{dKL(μθy,Aμθy,),dKL(μθy,μθy,A)}\displaystyle\max\{d_{\textup{KL}}(\mu^{y,\textup{A}}_{\theta}\|\mu^{y,\dagger}_{\theta}),d_{\textup{KL}}(\mu^{y,\dagger}_{\theta}\|\mu^{y,\textup{A}}_{\theta})\}\leq C|𝒪δ|Σε12Lμθ11/2\displaystyle C\left\|\left|\mathcal{O}\delta^{\dagger}\right|_{\Sigma_{\varepsilon}^{-1}}^{2}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2}

for C=21/2exp(2Φy,Lμθ1+2Φy,ALμθ1)(Φy,Lμθ11/2+Φy,ALμθ11/2)C=2^{1/2}\exp\left(2\left\|\Phi^{y,\dagger}\right\|_{L^{1}_{\mu_{\theta}}}+2\left\|\Phi^{y,\textup{A}}\right\|_{L^{1}_{\mu_{\theta}}}\right)\left(\left\|\Phi^{y,\dagger}\right\|^{1/2}_{L^{1}_{\mu_{\theta}}}+\left\|\Phi^{y,\textup{A}}\right\|^{1/2}_{L^{1}_{\mu_{\theta}}}\right).

Proof of Proposition 2.3.

We apply Lemma 2.2 and the second statement of Theorem 2.1 with Φ(1)Φy,\Phi^{(1)}\leftarrow\Phi^{y,\dagger}, Φ(2)Φy,A\Phi^{(2)}\leftarrow\Phi^{y,\textup{A}}, and μμθ\mu\leftarrow\mu_{\theta} to obtain

max{dKL(μθy,Aμθy,),dKL(μθy,μθy,A)}\displaystyle\max\{d_{\textup{KL}}(\mu^{y,\textup{A}}_{\theta}\|\mu^{y,\dagger}_{\theta}),d_{\textup{KL}}(\mu^{y,\dagger}_{\theta}\|\mu^{y,\textup{A}}_{\theta})\}
\displaystyle\leq 2exp(2Φy,ALμθ1+2Φy,Lμθ1)Φy,Φy,ALμθ1\displaystyle 2\exp\left(2\left\|\Phi^{y,\textup{A}}\right\|_{L^{1}_{\mu_{\theta}}}+2\left\|\Phi^{y,\dagger}\right\|_{L^{1}_{\mu_{\theta}}}\right)\left\|\Phi^{y,\dagger}-\Phi^{y,\textup{A}}\right\|_{L^{1}_{\mu_{\theta}}}
\displaystyle\leq 21/2exp(2Φy,ALμθ1+2Φy,Lμθ1)(Φy,Lμθ11/2+Φy,ALμθ11/2)|𝒪δ|Σε12Lμθ11/2.\displaystyle 2^{1/2}\exp\left(2\left\|\Phi^{y,\textup{A}}\right\|_{L^{1}_{\mu_{\theta}}}+2\left\|\Phi^{y,\dagger}\right\|_{L^{1}_{\mu_{\theta}}}\right)\left(\left\|\Phi^{y,\dagger}\right\|^{1/2}_{L^{1}_{\mu_{\theta}}}+\left\|\Phi^{y,\textup{A}}\right\|^{1/2}_{L^{1}_{\mu_{\theta}}}\right)\left\|\left|\mathcal{O}\delta^{\dagger}\right|_{\Sigma_{\varepsilon}^{-1}}^{2}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2}.

This completes the proof of Proposition 2.3. ∎

Although CC in Proposition 2.3 is not optimal, it is finite under the stated hypotheses.

Proposition 2.3 shows that the Kullback–Leibler divergences between μθy,A\mu^{y,\textup{A}}_{\theta} and μθy,\mu^{y,\dagger}_{\theta} are controlled by the average observed model error |𝒪δ|Σε12Lμθ11/2\left\|\left|\mathcal{O}\delta^{\dagger}\right|_{\Sigma_{\varepsilon}^{-1}}^{2}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2}. In order for these divergences to be small, one should choose the observation operator 𝒪\mathcal{O} such that δ\delta^{\dagger} takes values in or near ker(𝒪)\textup{ker}(\mathcal{O}) with high μθ\mu_{\theta}-probability. In particular, if μθ(δker(𝒪))=1\mu_{\theta}(\delta^{\dagger}\in\textup{ker}(\mathcal{O}))=1, then using μθy,A\mu^{y,\textup{A}}_{\theta} will yield the same inference as μθy,\mu^{y,\dagger}_{\theta}. This is an example of a positive criterion for choosing observation operators to mitigate the effect of model error on Bayesian inference.

The condition μθ(𝒪δ=0)=1\mu_{\theta}(\mathcal{O}\delta^{\dagger}=0)=1 can be useful for guiding the choice of observation operator 𝒪\mathcal{O} even if δ\delta^{\dagger} is not fully known. For example, if one can determine a priori that δ\delta^{\dagger} takes values in some proper subspace VV of 𝒰\mathcal{U}, then any choice of observation operator 𝒪\mathcal{O} such that Vker(𝒪)V\subseteq\textup{ker}(\mathcal{O}) will imply that μθy,A=μθy,\mu^{y,\textup{A}}_{\theta}=\mu^{y,\dagger}_{\theta}. However, the example 𝒪0\mathcal{O}\equiv 0 shows that one can choose 𝒪\mathcal{O} so that ker(𝒪)\textup{ker}(\mathcal{O}) is too large, in the sense that observations of state yield little or no information about the state itself. Thus, the approach of choosing 𝒪\mathcal{O} to mitigate model error involves the following tradeoff: ker(𝒪)\textup{ker}(\mathcal{O}) should be as small as possible, but large enough to ensure that the model error takes values in ker(𝒪\textup{ker}(\mathcal{O}) μθ\mu_{\theta}-almost surely. In Section 3, we revisit this observation on a specific example.

The preceding discussion emphasises the main idea of this work: to mitigate the effect of model error on Bayesian inference, one should exploit all available knowledge about the model error and about the inverse problem in order to guide the choice of observations. This main idea is valid, independently of the fact that the bounds in Lemma 2.2 and Proposition 2.3 are not sharp.

2.2 Bounds on KL divergence for enhanced noise posterior

Recall that the definition (2.2) of the best misfit Φy,\Phi^{y,\dagger} follows from the best model (2.1) for the data random variable YY. The approximate misfit in (2.4) can be seen as a misfit that results from assuming that 𝒪δ(θ)=0\mathcal{O}\delta^{\dagger}(\theta^{\dagger})=0. In the enhanced noise approach, one models the unknown state error δ(θ)\delta^{\dagger}(\theta^{\dagger}) as a random variable 𝔲𝒩(m𝔲,Σ𝔲)\mathfrak{u}\sim\mathcal{N}(m_{\mathfrak{u}},\Sigma_{\mathfrak{u}}) that is assumed to be statistically independent of θμθ\theta\sim\mu_{\theta} and ε𝒩(0,Σε)\varepsilon\sim\mathcal{N}(0,\Sigma_{\varepsilon}) [16]. This is related to the so-called ‘pre-marginalisation’ approach; see e.g. [19]. We do not assume that supp(𝒩(m𝔲,Σ𝔲))=𝒰\text{supp}(\mathcal{N}(m_{\mathfrak{u}},\Sigma_{\mathfrak{u}}))=\mathcal{U}. The corresponding enhanced noise data model is

Y=𝒪(θ)+𝒪𝔲+ε,Y=\mathcal{O}\mathcal{M}(\theta^{\dagger})+\mathcal{O}\mathfrak{u}+\varepsilon,

where the ‘enhanced noise’ 𝒪𝔲+ε\mathcal{O}\mathfrak{u}+\varepsilon has the law 𝒩(𝒪m𝔲,Σε+𝒪Σ𝔲𝒪)\mathcal{N}(\mathcal{O}m_{\mathfrak{u}},\Sigma_{\varepsilon}+\mathcal{O}\Sigma_{\mathfrak{u}}\mathcal{O}^{\ast}) because 𝒪\mathcal{O} is linear. Since we assume that Σε\Sigma_{\varepsilon} is positive definite and since 𝒪Σ𝔲𝒪\mathcal{O}\Sigma_{\mathfrak{u}}\mathcal{O}^{\ast} is nonnegative definite, Σε+𝒪Σ𝔲𝒪\Sigma_{\varepsilon}+\mathcal{O}\Sigma_{\mathfrak{u}}\mathcal{O}^{\ast} is positive definite, hence invertible.

The enhanced noise misfit and enhanced noise posterior are

θΦy,E(θ)12|y𝒪(θ)𝒪m𝔲|(Σε+𝒪Σ𝔲𝒪)12,μθy,E(μθ)Φy,E.\theta^{\prime}\mapsto\Phi^{y,\textup{E}}(\theta^{\prime})\coloneqq\tfrac{1}{2}\left|y-\mathcal{O}\mathcal{M}(\theta^{\prime})-\mathcal{O}m_{\mathfrak{u}}\right|^{2}_{(\Sigma_{\varepsilon}+\mathcal{O}\Sigma_{\mathfrak{u}}\mathcal{O}^{\ast})^{-1}},\quad\mu^{y,\textup{E}}_{\theta}\coloneqq(\mu_{\theta})_{\Phi^{y,\textup{E}}}. (2.6)

Lemma 2.4 below bounds the error between Φy,\Phi^{y,\dagger} and Φy,E\Phi^{y,\textup{E}} in terms of the shifted observed model error term 𝒪(δm𝔲)\mathcal{O}(\delta^{\dagger}-m_{\mathfrak{u}}) and the difference of covariance matrices Σε1(Σε+𝒪Σ𝔲𝒪)1\Sigma_{\varepsilon}^{-1}-(\Sigma_{\varepsilon}+\mathcal{O}\Sigma_{\mathfrak{u}}\mathcal{O}^{\ast})^{-1}. Define the scalar

CEΣε1/2(Σε+𝒪Σ𝔲𝒪)1/2op.C_{\textup{E}}\coloneqq\left\|\Sigma_{\varepsilon}^{-1/2}(\Sigma_{\varepsilon}+\mathcal{O}\Sigma_{\mathfrak{u}}\mathcal{O}^{\ast})^{1/2}\right\|_{\textup{op}}. (2.7)

By Lemma A.1, CEC_{\textup{E}} satisfies |z|Σε1CE|z|(Σε+𝒪Σ𝔲𝒪)1\left|z\right|_{\Sigma_{\varepsilon}^{-1}}\leq C_{\textup{E}}\left|z\right|_{(\Sigma_{\varepsilon}+\mathcal{O}\Sigma_{\mathfrak{u}}\mathcal{O}^{\ast})^{-1}} for all znz\in\mathbb{R}^{n}.

Lemma 2.4.

If Φy,ELμθ1\Phi^{y,\textup{E}}\in L^{1}_{\mu_{\theta}}, then

Φy,Φy,ELμθ1\displaystyle\left\|\Phi^{y,\dagger}-\Phi^{y,\textup{E}}\right\|_{L^{1}_{\mu_{\theta}}}\leq 21/2|𝒪(δm𝔲)|Σε12Lμθ11/2(Φy,Lμθ11/2+CEΦy,ELμθ11/2)\displaystyle 2^{-1/2}\left\|\left|\mathcal{O}(\delta^{\dagger}-m_{\mathfrak{u}})\right|_{\Sigma_{\varepsilon}^{-1}}^{2}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2}\left(\left\|\Phi^{y,\dagger}\right\|^{1/2}_{L^{1}_{\mu_{\theta}}}+C_{\textup{E}}\left\|\Phi^{y,\textup{E}}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2}\right) (2.8)
+21|y𝒪𝒪m𝔲|Σε1(Σε+𝒪Σ𝔲𝒪)12Lμθ1.\displaystyle+2^{-1}\left\|\left|y-\mathcal{O}\mathcal{M}-\mathcal{O}m_{\mathfrak{u}}\right|^{2}_{\Sigma_{\varepsilon}^{-1}-(\Sigma_{\varepsilon}+\mathcal{O}\Sigma_{\mathfrak{u}}\mathcal{O}^{\ast})^{-1}}\right\|_{L^{1}_{\mu_{\theta}}}.

Furthermore,

|𝒪(δm𝔲)|Σε12Lμθ11/2\displaystyle\left\|\left|\mathcal{O}(\delta^{\dagger}-m_{\mathfrak{u}})\right|_{\Sigma_{\varepsilon}^{-1}}^{2}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2}\leq 21/2(Φy,Lμθ11/2+CEΦy,ELμθ11/2),\displaystyle 2^{1/2}\left(\left\|\Phi^{y,\dagger}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2}+C_{\textup{E}}\left\|\Phi^{y,\textup{E}}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2}\right),
|y𝒪𝒪m𝔲|Σε1(Σε+𝒪Σ𝔲𝒪)12Lμθ1\displaystyle\left\|\left|y-\mathcal{O}\mathcal{M}-\mathcal{O}m_{\mathfrak{u}}\right|^{2}_{\Sigma_{\varepsilon}^{-1}-(\Sigma_{\varepsilon}+\mathcal{O}\Sigma_{\mathfrak{u}}\mathcal{O}^{\ast})^{-1}}\right\|_{L^{1}_{\mu_{\theta}}}\leq (CE+1)2Φy,ELμθ1.\displaystyle(C_{\textup{E}}+1)\left\|2\Phi^{y,\textup{E}}\right\|_{L^{1}_{\mu_{\theta}}}.

We defer the proof of Lemma 2.4 to Section B.2.

Proposition 2.5.

If Φy,ELμθ1\Phi^{y,\textup{E}}\in L^{1}_{\mu_{\theta}}, then

max{dKL(μθy,μθy,E),dKL(μθy,Eμθy,)}\displaystyle\max\{d_{\textup{KL}}(\mu^{y,\dagger}_{\theta}\|\mu^{y,\textup{E}}_{\theta}),d_{\textup{KL}}(\mu^{y,\textup{E}}_{\theta}\|\mu^{y,\dagger}_{\theta})\}
\displaystyle\leq C(|𝒪(δm𝔲)|Σε12Lμθ11/2+|y𝒪𝒪m𝔲|Σε1(Σε+𝒪Σ𝔲𝒪)12Lμθ1)\displaystyle C\left(\left\|\left|\mathcal{O}(\delta^{\dagger}-m_{\mathfrak{u}})\right|_{\Sigma_{\varepsilon}^{-1}}^{2}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2}+\left\|\left|y-\mathcal{O}\mathcal{M}-\mathcal{O}m_{\mathfrak{u}}\right|^{2}_{\Sigma_{\varepsilon}^{-1}-(\Sigma_{\varepsilon}+\mathcal{O}\Sigma_{\mathfrak{u}}\mathcal{O}^{\ast})^{-1}}\right\|_{L^{1}_{\mu_{\theta}}}\right)

for C=exp(2Φy,Lμθ1+2Φy,ELμθ1)max{2(Φy,Lμθ11/2+CEΦy,ELμθ11/2),1}C=\exp\left(2\left\|\Phi^{y,\dagger}\right\|_{L^{1}_{\mu_{\theta}}}+2\left\|\Phi^{y,\textup{E}}\right\|_{L^{1}_{\mu_{\theta}}}\right)\max\left\{\sqrt{2}\left(\left\|\Phi^{y,\dagger}\right\|^{1/2}_{L^{1}_{\mu_{\theta}}}+C_{\textup{E}}\left\|\Phi^{y,\textup{E}}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2}\right),1\right\}.

Proof of Proposition 2.5.

Given that Φy,E,Φy,Lμθ1\Phi^{y,\textup{E}},\Phi^{y,\dagger}\in L^{1}_{\mu_{\theta}}, we may apply Theorem 2.1 with Φ(1)Φy,\Phi^{(1)}\leftarrow\Phi^{y,\dagger}, Φ(2)Φy,E\Phi^{(2)}\leftarrow\Phi^{y,\textup{E}}, and μμθ\mu\leftarrow\mu_{\theta}, to obtain

max{dKL(μθy,μθy,E),dKL(μθy,Eμθy,)}\displaystyle\max\{d_{\textup{KL}}(\mu^{y,\dagger}_{\theta}\|\mu^{y,\textup{E}}_{\theta}),d_{\textup{KL}}(\mu^{y,\textup{E}}_{\theta}\|\mu^{y,\dagger}_{\theta})\}
\displaystyle\leq 2exp(2Φy,ELμθ1+2Φy,Lμθ1)Φy,Φy,ELμθ1\displaystyle 2\exp\left(2\left\|\Phi^{y,\textup{E}}\right\|_{L^{1}_{\mu_{\theta}}}+2\left\|\Phi^{y,\dagger}\right\|_{L^{1}_{\mu_{\theta}}}\right)\left\|\Phi^{y,\dagger}-\Phi^{y,\textup{E}}\right\|_{L^{1}_{\mu_{\theta}}}
\displaystyle\leq exp(2Φy,ELμθ1+2Φy,Lμθ1)max{2(Φy,Lμθ11/2+CEΦy,ELμθ11/2),1}\displaystyle\exp\left(2\left\|\Phi^{y,\textup{E}}\right\|_{L^{1}_{\mu_{\theta}}}+2\left\|\Phi^{y,\dagger}\right\|_{L^{1}_{\mu_{\theta}}}\right)\max\left\{\sqrt{2}\left(\left\|\Phi^{y,\dagger}\right\|^{1/2}_{L^{1}_{\mu_{\theta}}}+C_{\textup{E}}\left\|\Phi^{y,\textup{E}}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2}\right),1\right\}
×(|𝒪(δm𝔲)|Σε12Lμθ11/2+|y𝒪𝒪m𝔲|Σε1(Σε+𝒪Σ𝔲𝒪)12Lμθ1)\displaystyle\times\left(\left\|\left|\mathcal{O}(\delta^{\dagger}-m_{\mathfrak{u}})\right|_{\Sigma_{\varepsilon}^{-1}}^{2}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2}+\left\|\left|y-\mathcal{O}\mathcal{M}-\mathcal{O}m_{\mathfrak{u}}\right|^{2}_{\Sigma_{\varepsilon}^{-1}-(\Sigma_{\varepsilon}+\mathcal{O}\Sigma_{\mathfrak{u}}\mathcal{O}^{\ast})^{-1}}\right\|_{L^{1}_{\mu_{\theta}}}\right)

where the second inequality follows from the bound (2.8) of Lemma 2.4. ∎

The significance of Proposition 2.5 is similar to that of Proposition 2.3. The main difference consists in the fact that the error with respect to the best misfit is now controlled by

|𝒪(δm𝔲)|Σε12Lμθ11/2+|y𝒪𝒪m𝔲|Σε1(Σε+𝒪Σ𝔲𝒪)12Lμθ1,\left\|\left|\mathcal{O}(\delta^{\dagger}-m_{\mathfrak{u}})\right|_{\Sigma_{\varepsilon}^{-1}}^{2}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2}+\left\|\left|y-\mathcal{O}\mathcal{M}-\mathcal{O}m_{\mathfrak{u}}\right|^{2}_{\Sigma_{\varepsilon}^{-1}-(\Sigma_{\varepsilon}+\mathcal{O}\Sigma_{\mathfrak{u}}\mathcal{O}^{\ast})^{-1}}\right\|_{L^{1}_{\mu_{\theta}}}, (2.9)

where only the first term depends on the model error δ\delta^{\dagger}. Since we assume that Σε\Sigma_{\varepsilon} is positive definite, the first term in (2.9) vanishes if and only if δm𝔲ker(𝒪)\delta^{\dagger}-m_{\mathfrak{u}}\in\textup{ker}(\mathcal{O}) μθ\mu_{\theta}-almost surely. This condition differs from the sufficient condition for μθy,=μθy,A\mu^{y,\dagger}_{\theta}=\mu^{y,\textup{A}}_{\theta} that was implied by Lemma 2.2, namely, that δker(𝒪)\delta^{\dagger}\in\textup{ker}(\mathcal{O}) μθ\mu_{\theta}-almost surely. By (1.1), the second term in (2.9) vanishes if and only if

μθ(y𝒪𝒪m𝔲ker(Σε1(Σε+𝒪Σ𝔲𝒪)1))=1.\mu_{\theta}\left(y-\mathcal{O}\mathcal{M}-\mathcal{O}m_{\mathfrak{u}}\in\textup{ker}\left(\Sigma_{\varepsilon}^{-1}-(\Sigma_{\varepsilon}+\mathcal{O}\Sigma_{\mathfrak{u}}\mathcal{O}^{\ast})^{-1}\right)\right)=1. (2.10)

If (𝔲m𝔲+ker(𝒪))=1\mathbb{P}(\mathfrak{u}\in m_{\mathfrak{u}}+\textup{ker}(\mathcal{O}))=1, then 𝒪𝔲=𝒪m𝔲\mathcal{O}\mathfrak{u}=\mathcal{O}m_{\mathfrak{u}} almost surely, which implies that 𝒪Σ𝔲𝒪=0\mathcal{O}\Sigma_{\mathfrak{u}}\mathcal{O}^{\ast}=0. This in turn implies that (2.10) holds. Thus, if μθ(δm𝔲+ker(𝒪))=1\mu_{\theta}(\delta^{\dagger}\in m_{\mathfrak{u}}+\textup{ker}(\mathcal{O}))=1 and (𝔲m𝔲+ker(𝒪))=1\mathbb{P}(\mathfrak{u}\in m_{\mathfrak{u}}+\textup{ker}(\mathcal{O}))=1, then μθy,=μθy,E\mu^{y,\dagger}_{\theta}=\mu^{y,\textup{E}}_{\theta}. These sufficient conditions do not assume that m𝔲ker(𝒪)m_{\mathfrak{u}}\in\textup{ker}(\mathcal{O}). In fact, if in addition to 𝒪Σ𝔲𝒪=0\mathcal{O}\Sigma_{\mathfrak{u}}\mathcal{O}^{\ast}=0 it also holds that m𝔲ker(𝒪)m_{\mathfrak{u}}\in\textup{ker}(\mathcal{O}), then it follows from (2.6) that Φy,E=Φy,A\Phi^{y,\textup{E}}=\Phi^{y,\textup{A}}. The next lemma provides a more quantitative version of this statement, using the constant CEC_{\textup{E}} given in (2.7).

Lemma 2.6.

If Φy,A,Φy,ELμθ1\Phi^{y,\textup{A}},\Phi^{y,\textup{E}}\in L^{1}_{\mu_{\theta}}, then

Φy,AΦy,ELμθ1\displaystyle\left\|\Phi^{y,\textup{A}}-\Phi^{y,\textup{E}}\right\|_{L^{1}_{\mu_{\theta}}}\leq 21/2|𝒪m𝔲|Σε1(Φy,ALμθ11/2+CEΦy,ELμθ11/2)\displaystyle 2^{-1/2}\left|\mathcal{O}m_{\mathfrak{u}}\right|_{\Sigma_{\varepsilon}^{-1}}\left(\left\|\Phi^{y,\textup{A}}\right\|^{1/2}_{L^{1}_{\mu_{\theta}}}+C_{\textup{E}}\left\|\Phi^{y,\textup{E}}\right\|^{1/2}_{L^{1}_{\mu_{\theta}}}\right) (2.11)
+21|y𝒪𝒪m𝔲|Σε1(Σε+𝒪Σ𝔲𝒪)12Lμθ1.\displaystyle+2^{-1}\left\|\left|y-\mathcal{O}\mathcal{M}-\mathcal{O}m_{\mathfrak{u}}\right|^{2}_{\Sigma_{\varepsilon}^{-1}-(\Sigma_{\varepsilon}+\mathcal{O}\Sigma_{\mathfrak{u}}\mathcal{O}^{\ast})^{-1}}\right\|_{L^{1}_{\mu_{\theta}}}.

For the proof of Lemma 2.6, see Section B.2.

Proposition 2.7.

If Φy,A,Φy,ELμθ1\Phi^{y,\textup{A}},\Phi^{y,\textup{E}}\in L^{1}_{\mu_{\theta}}, then

max{dKL(μθy,Aμθy,E),dKL(μθy,Eμθy,A)}\displaystyle\max\{d_{\textup{KL}}(\mu^{y,\textup{A}}_{\theta}\|\mu^{y,\textup{E}}_{\theta}),d_{\textup{KL}}(\mu^{y,\textup{E}}_{\theta}\|\mu^{y,\textup{A}}_{\theta})\}
\displaystyle\leq C(|𝒪m𝔲|Σε1+|y𝒪𝒪m𝔲|Σε1(Σε+𝒪Σ𝔲𝒪)12Lμθ1)\displaystyle C\left(\left|\mathcal{O}m_{\mathfrak{u}}\right|_{\Sigma_{\varepsilon}^{-1}}+\left\|\left|y-\mathcal{O}\mathcal{M}-\mathcal{O}m_{\mathfrak{u}}\right|^{2}_{\Sigma_{\varepsilon}^{-1}-(\Sigma_{\varepsilon}+\mathcal{O}\Sigma_{\mathfrak{u}}\mathcal{O}^{\ast})^{-1}}\right\|_{L^{1}_{\mu_{\theta}}}\right)

for C=exp(2Φy,ALμθ1+2Φy,ELμθ1)max{2(Φy,ALμθ11/2+CEΦy,ELμθ11/2),1}C=\exp\left(2\left\|\Phi^{y,\textup{A}}\right\|_{L^{1}_{\mu_{\theta}}}+2\left\|\Phi^{y,\textup{E}}\right\|_{L^{1}_{\mu_{\theta}}}\right)\max\left\{\sqrt{2}\left(\left\|\Phi^{y,\textup{A}}\right\|^{1/2}_{L^{1}_{\mu_{\theta}}}+C_{\textup{E}}\left\|\Phi^{y,\textup{E}}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2}\right),1\right\}.

The proof of Proposition 2.7 is similar to that of Proposition 2.5 except that one uses Lemma 2.6 instead of Lemma 2.4, so we omit it.

Proposition 2.7 implies the following: given an enhanced noise model with mean m𝔲m_{\mathfrak{u}} and covariance Σ𝔲\Sigma_{\mathfrak{u}}, if one chooses the observation operator 𝒪\mathcal{O} so that

|𝒪m𝔲|Σε1=0=|y𝒪𝒪m𝔲|Σε1(Σε+𝒪Σ𝔲𝒪)12Lμθ1,\left|\mathcal{O}m_{\mathfrak{u}}\right|_{\Sigma_{\varepsilon}^{-1}}=0=\left\|\left|y-\mathcal{O}\mathcal{M}-\mathcal{O}m_{\mathfrak{u}}\right|^{2}_{\Sigma_{\varepsilon}^{-1}-(\Sigma_{\varepsilon}+\mathcal{O}\Sigma_{\mathfrak{u}}\mathcal{O}^{\ast})^{-1}}\right\|_{L^{1}_{\mu_{\theta}}},

then the enhanced noise posterior μθy,E\mu^{y,\textup{E}}_{\theta} and the approximate posterior μθy,A\mu^{y,\textup{A}}_{\theta} coincide. Equivalently, 𝒪m𝔲=0\mathcal{O}m_{\mathfrak{u}}=0 and (2.10) together imply that μθy,E=μθy,A\mu^{y,\textup{E}}_{\theta}=\mu^{y,\textup{A}}_{\theta}. Thus, such a choice of observation operator yields an enhanced noise posterior μθy,E\mu^{y,\textup{E}}_{\theta} that does not account for model error. In this case, the approximate posterior μθy,A\mu^{y,\textup{A}}_{\theta} and the enhanced noise posterior μθy,E\mu^{y,\textup{E}}_{\theta} perform equally well in terms of inference, and it would be better to use the approximate posterior instead, since it is simpler. Thus, the conditions 𝒪m𝔲=0\mathcal{O}m_{\mathfrak{u}}=0 and (2.10) give an example of negative criteria that one can use to exclude observation operators from consideration when one aims to mitigate the effect of model error on Bayesian inference.

2.3 Bounds on KL divergence for joint posterior

In the enhanced noise approach given by (2.6), one accounts for the uncertainty in the unknown state error δ(θ)\delta^{\dagger}(\theta^{\dagger}) by approximating the unknown state error using a random variable 𝔲\mathfrak{u}. The goal is to infer θ\theta^{\dagger} only. In the joint parameter-error inference approach, one aims to infer (θ,δ)(\theta^{\dagger},\delta^{\dagger}) jointly, by using a random variable (θ,δ)(\theta,\delta) with prior μθ,δ\mu_{\theta,\delta} and using Bayes’ formula. The joint prior on the random variable (θ,δ)(\theta,\delta) is given by the product μθμδ\mu_{\theta}\otimes\mu_{\delta} of the prior μθ\mu_{\theta} on the unknown θ\theta^{\dagger} and a prior μδ\mu_{\delta} on the unknown δ\delta^{\dagger}. It is possible to specify the prior μδ\mu_{\delta} on the unknown δ\delta^{\dagger} without knowing δ\delta^{\dagger}, but one must specify a space 𝒟\mathscr{D} of admissible values of δ\delta^{\dagger}. We assume that the space 𝒟\mathscr{D} is a Radon space and that μδ\mu_{\delta} is a Borel probability measure.

The assumption that the prior μθ,δ\mu_{\theta,\delta} on (θ,δ)(\theta,\delta) has product structure is equivalent to the assumption that θ\theta and δ\delta are independent random variables. This independence assumption is reasonable, since by (2.3) it follows that δ\delta^{\dagger}\coloneqq\mathcal{M}^{\dagger}-\mathcal{M} does not depend on θ\theta^{\dagger}.

Define the joint misfit and joint posterior by

Φy,J(θ,δ)12|y𝒪(θ)𝒪δ(θ)|Σε12,μθ,δy,J(θ,δ)(μθμδ)Φy,J.\Phi^{y,\textup{J}}(\theta^{\prime},\delta^{\prime})\coloneqq\tfrac{1}{2}\left|y-\mathcal{O}\mathcal{M}(\theta^{\prime})-\mathcal{O}\delta^{\prime}(\theta^{\prime})\right|^{2}_{\Sigma_{\varepsilon}^{-1}},\quad\mu^{y,\textup{J}}_{\theta,\delta}(\theta^{\prime},\delta^{\prime})\coloneqq(\mu_{\theta}\otimes\mu_{\delta})_{\Phi^{y,\textup{J}}}. (2.12)

An important disadvantage of jointly inferring the parameter and model error is that the dimension of the space on which one performs inference increases; this tends to make the inference task more computationally expensive. It is also known that the problem of identifiability may arise; see e.g. [5]. We shall not consider the problem of identifiability here. On the other hand, jointly inferring the parameter and model error is consistent with the Bayesian approach of treating all unknowns as random variables and updating these distributions using the data. In addition, the joint inference approach offers the possibility to improve the approximate model \mathcal{M} in a data-driven way, e.g. by using the δ\delta-marginal posterior mean. This improvement can be used to obtain better estimates of both the parameter θ\theta^{\dagger} and the state u=(θ)u^{\dagger}=\mathcal{M}^{\dagger}(\theta^{\dagger}).

To compare the joint posterior μθy,J\mu^{y,\textup{J}}_{\theta} with the posterior measures μθy,\mu^{y,\dagger}_{\theta}, μθy,A\mu^{y,\textup{A}}_{\theta} and μy,E\mu^{y,\textup{E}} that we introduced earlier, we ‘lift’ the misfits and posteriors according to

Φy,(θ,δ)Φy,(θ),μθ,δy,(μθμδ)Φy,,{A,E,}.\Phi^{y,\bullet}(\theta^{\prime},\delta^{\prime})\coloneqq\Phi^{y,\bullet}(\theta^{\prime}),\qquad\mu^{y,\bullet}_{\theta,\delta}\coloneqq(\mu_{\theta}\otimes\mu_{\delta})_{\Phi^{y,\bullet}},\quad\bullet\in\{\textup{A},\textup{E},\dagger\}. (2.13)

In (2.13), we use the notation Φy,\Phi^{y,\bullet} to refer to two functions defined on different domains. The following lemma shows that this does not result in any fundamental changes, and that the lifted posterior is the product of the original posterior with the prior law μδ\mu_{\delta} for the unknown δ\delta^{\dagger}.

Lemma 2.8.

For {A,,E}\bullet\in\{\textup{A},\dagger,\textup{E}\},

μθ,δy,=μθy,μδ.\mu^{y,\bullet}_{\theta,\delta}=\mu^{y,\bullet}_{\theta}\otimes\mu_{\delta}. (2.14)

If Φy,Lμθq\Phi^{y,\bullet}\in L^{q}_{\mu_{\theta}} for some q>0q>0, then

Φy,Lμθμδq=Φy,Lμθq.\left\|\Phi^{y,\bullet}\right\|_{L^{q}_{\mu_{\theta}\otimes\mu_{\delta}}}=\left\|\Phi^{y,\bullet}\right\|_{L^{q}_{\mu_{\theta}}}. (2.15)
Proof.

Since μδ\mu_{\delta} is a probability measure, the definition of Φy,(θ,δ)\Phi^{y,\bullet}(\theta^{\prime},\delta^{\prime}) in (2.13) implies that

Θ×𝒟exp(Φy,(θ,δ))dμθμδ(θ,δ)=Θexp(Φy,(θ))dμθ(θ).\displaystyle\int_{\Theta\times\mathscr{D}}\exp(-\Phi^{y,\bullet}(\theta^{\prime},\delta^{\prime}))\mathrm{d}\mu_{\theta}\otimes\mu_{\delta}(\theta^{\prime},\delta^{\prime})=\int_{\Theta}\exp(-\Phi^{y,\bullet}(\theta^{\prime}))\mathrm{d}\mu_{\theta}(\theta^{\prime}).

Thus, the normalisation constant for the likelihood exp(Φy,)\exp(-\Phi^{y,\bullet}) that generates the lifted posterior μθ,δy,\mu^{y,\bullet}_{\theta,\delta} on Θ×𝒟\Theta\times\mathscr{D} agrees with the corresponding normalisation constant for the posterior μθy,\mu^{y,\bullet}_{\theta} on Θ\Theta. This proves (2.14). The fact that Φy,\Phi^{y,\bullet} is constant with respect to δ\delta^{\prime} implies the second statement of the lemma. ∎

Recall that \mathbb{P} denotes the probability measure in the underlying probability space. Below, we will write the random variables θ\theta and δ\delta explicitly, and take LpL^{p}-norms with respect to \mathbb{P} instead of μθμδ\mu_{\theta}\otimes\mu_{\delta}, e.g.

|𝒪(δδ)(θ)|Σε12L1=Θ×𝒟|𝒪(δδ)(θ)|Σε12dμθμδ(θ,δ).\left\|\left|\mathcal{O}(\delta^{\dagger}-\delta)(\theta)\right|_{\Sigma_{\varepsilon}^{-1}}^{2}\right\|_{L^{1}_{\mathbb{P}}}=\int_{\Theta\times\mathscr{D}}\left|\mathcal{O}(\delta^{\dagger}-\delta^{\prime})(\theta^{\prime})\right|_{\Sigma_{\varepsilon}^{-1}}^{2}\mathrm{d}\mu_{\theta}\otimes\mu_{\delta}(\theta^{\prime},\delta^{\prime}).

See (2.16) below for an example.

Lemma 2.9.

If Φy,JLμθμδ1\Phi^{y,\textup{J}}\in L^{1}_{\mu_{\theta}\otimes\mu_{\delta}}, then

Φy,Φy,JLμθμδ121/2|𝒪(δδ)(θ)|Σε12L11/2(Φy,JLμθμδ11/2+Φy,Lμθ11/2).\left\|\Phi^{y,\dagger}-\Phi^{y,\textup{J}}\right\|_{L^{1}_{\mu_{\theta}\otimes\mu_{\delta}}}\leq 2^{-1/2}\left\|\left|\mathcal{O}(\delta^{\dagger}-\delta)(\theta)\right|_{\Sigma_{\varepsilon}^{-1}}^{2}\right\|_{L^{1}_{\mathbb{P}}}^{1/2}\left(\left\|\Phi^{y,\textup{J}}\right\|_{L^{1}_{\mu_{\theta}\otimes\mu_{\delta}}}^{1/2}+\left\|\Phi^{y,\dagger}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2}\right). (2.16)

Furthermore, |𝒪(δδ)(θ)|Σε12L11/221/2(Φy,JLμθμδ11/2+Φy,Lμθ11/2)\left\|\left|\mathcal{O}(\delta^{\dagger}-\delta)(\theta)\right|_{\Sigma_{\varepsilon}^{-1}}^{2}\right\|_{L^{1}_{\mathbb{P}}}^{1/2}\leq 2^{1/2}\left(\left\|\Phi^{y,\textup{J}}\right\|_{L^{1}_{\mu_{\theta}\otimes\mu_{\delta}}}^{1/2}+\left\|\Phi^{y,\dagger}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2}\right).

Proposition 2.10.

If Φy,JLμθμδ1\Phi^{y,\textup{J}}\in L^{1}_{\mu_{\theta}\otimes\mu_{\delta}}, then μθ,δy,J\mu^{y,\textup{J}}_{\theta,\delta} and μθ,δy,\mu^{y,\dagger}_{\theta,\delta} satisfy

max{dKL(μθ,δy,μθ,δy,J),dKL(μθ,δy,Jμθ,δy,)}C|𝒪(δδ)(θ)|Σε12L11/2,\displaystyle\max\{d_{\textup{KL}}(\mu^{y,\dagger}_{\theta,\delta}\|\mu^{y,\textup{J}}_{\theta,\delta}),d_{\textup{KL}}(\mu^{y,\textup{J}}_{\theta,\delta}\|\mu^{y,\dagger}_{\theta,\delta})\}\leq C\left\|\left|\mathcal{O}(\delta^{\dagger}-\delta)(\theta)\right|_{\Sigma_{\varepsilon}^{-1}}^{2}\right\|_{L^{1}_{\mathbb{P}}}^{1/2},

where C=21/2exp(2Φy,JLμθμδ1+2Φy,Lμθ1)(Φy,JLμθμδ11/2+Φy,Lμθ11/2)C=2^{1/2}\exp\left(2\left\|\Phi^{y,\textup{J}}\right\|_{L^{1}_{\mu_{\theta}\otimes\mu_{\delta}}}+2\left\|\Phi^{y,\dagger}\right\|_{L^{1}_{\mu_{\theta}}}\right)\left(\left\|\Phi^{y,\textup{J}}\right\|_{L^{1}_{\mu_{\theta}\otimes\mu_{\delta}}}^{1/2}+\left\|\Phi^{y,\dagger}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2}\right).

We prove Lemma 2.9 in Section B.3. The proof of Proposition 2.10 is similar to the proofs of Proposition 2.3 and Proposition 2.5, so we omit it.

By Proposition 2.10, a sufficient condition for μθ,δy,J=μθ,δy,\mu^{y,\textup{J}}_{\theta,\delta}=\mu^{y,\dagger}_{\theta,\delta} is that

(δ(θ)δ(θ)ker(𝒪))=1.\mathbb{P}\left(\delta^{\dagger}(\theta)-\delta(\theta)\in\textup{ker}(\mathcal{O})\right)=1.

In order to use this condition to choose a prior μδ\mu_{\delta} and observation operator 𝒪\mathcal{O}, one must have some a priori knowledge about the set {δ(θ):θΘ}\{\delta^{\dagger}(\theta^{\prime})\ :\ \theta^{\prime}\in\Theta\}, e.g. that this set is contained in a proper affine subspace x+V={x+v:vV}x+V=\{x+v\ :\ v\in V\} of 𝒰\mathcal{U}, where xx and VV are an element and subspace of 𝒰\mathcal{U} respectively. However, by the discussion before Section 2.2, if one knows that {δ(θ):θΘ}x+V\{\delta^{\dagger}(\theta^{\prime})\ :\ \theta^{\prime}\in\Theta\}\subseteq x+V, then one can use this information to choose 𝒪\mathcal{O} so that μθy,A=μθy,\mu^{y,\textup{A}}_{\theta}=\mu^{y,\dagger}_{\theta}, e.g. by setting 𝒪\mathcal{O} so that span(x+V)ker(𝒪)\textup{span}(x+V)\subseteq\textup{ker}(\mathcal{O}).

Next, we consider the lifted approximate posterior μθ,δy,A\mu^{y,\textup{A}}_{\theta,\delta} and the joint posterior μθ,δy,J\mu^{y,\textup{J}}_{\theta,\delta}.

Lemma 2.11.

If Φy,ALμθ1\Phi^{y,\textup{A}}\in L^{1}_{\mu_{\theta}} and Φy,JLμθμδ1\Phi^{y,\textup{J}}\in L^{1}_{\mu_{\theta}\otimes\mu_{\delta}}, then

Φy,AΦy,JLμθμδ121/2|𝒪δ(θ)|Σε12L11/2(Φy,JLμθμδ11/2+Φy,ALμθ11/2).\left\|\Phi^{y,\textup{A}}-\Phi^{y,\textup{J}}\right\|_{L^{1}_{\mu_{\theta}\otimes\mu_{\delta}}}\leq 2^{-1/2}\left\|\left|\mathcal{O}\delta(\theta)\right|_{\Sigma_{\varepsilon}^{-1}}^{2}\right\|_{L^{1}_{\mathbb{P}}}^{1/2}\left(\left\|\Phi^{y,\textup{J}}\right\|_{L^{1}_{\mu_{\theta}\otimes\mu_{\delta}}}^{1/2}+\left\|\Phi^{y,\textup{A}}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2}\right). (2.17)

Furthermore, |𝒪δ(θ)|Σε12L11/221/2(Φy,JLμθμδ11/2+Φy,ALμθ11/2)\left\|\left|\mathcal{O}\delta(\theta)\right|_{\Sigma_{\varepsilon}^{-1}}^{2}\right\|_{L^{1}_{\mathbb{P}}}^{1/2}\leq 2^{1/2}\left(\left\|\Phi^{y,\textup{J}}\right\|_{L^{1}_{\mu_{\theta}\otimes\mu_{\delta}}}^{1/2}+\left\|\Phi^{y,\textup{A}}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2}\right).

Proposition 2.12.

If Φy,ALμθ1\Phi^{y,\textup{A}}\in L^{1}_{\mu_{\theta}} and Φy,JLμθμδ1\Phi^{y,\textup{J}}\in L^{1}_{\mu_{\theta}\otimes\mu_{\delta}}, then

max{dKL(μθ,δy,Aμθ,δy,J),dKL(μθ,δy,Jμθ,δy,A)}C|𝒪δ(θ)|Σε12L11/2,\displaystyle\max\{d_{\textup{KL}}(\mu^{y,\textup{A}}_{\theta,\delta}\|\mu^{y,\textup{J}}_{\theta,\delta}),d_{\textup{KL}}(\mu^{y,\textup{J}}_{\theta,\delta}\|\mu^{y,\textup{A}}_{\theta,\delta})\}\leq C\left\|\left|\mathcal{O}\delta(\theta)\right|_{\Sigma_{\varepsilon}^{-1}}^{2}\right\|_{L^{1}_{\mathbb{P}}}^{1/2},

where C=21/2exp(2Φy,JLμθμδ1+2Φy,ALμθ1)(Φy,JLμθμδ11/2+Φy,ALμθ11/2)C=2^{1/2}\exp\left(2\left\|\Phi^{y,\textup{J}}\right\|_{L^{1}_{\mu_{\theta}\otimes\mu_{\delta}}}+2\left\|\Phi^{y,\textup{A}}\right\|_{L^{1}_{\mu_{\theta}}}\right)\left(\left\|\Phi^{y,\textup{J}}\right\|_{L^{1}_{\mu_{\theta}\otimes\mu_{\delta}}}^{1/2}+\left\|\Phi^{y,\textup{A}}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2}\right).

We prove Lemma 2.11 in Section B.3 and omit the proof of Proposition 2.12 due to its similarity with the proofs of Proposition 2.3 and Proposition 2.5.

Recall that if (δ(θ)δ(θ)ker(𝒪))=1\mathbb{P}\left(\delta^{\dagger}(\theta)-\delta(\theta)\in\textup{ker}(\mathcal{O})\right)=1, then the joint posterior μθ,δy,J\mu^{y,\textup{J}}_{\theta,\delta} and lifted best posterior μθ,δy,\mu^{y,\dagger}_{\theta,\delta} agree. Proposition 2.12 shows that if (δ(θ)ker(𝒪))=1\mathbb{P}\left(\delta(\theta)\in\textup{ker}(\mathcal{O})\right)=1, then the joint posterior μθ,δy,J\mu^{y,\textup{J}}_{\theta,\delta} and the lifted aproximate posterior μθ,δy,A\mu^{y,\textup{A}}_{\theta,\delta} agree. In this case, there is no additional benefit in terms of inference from using the joint posterior, so (δ(θ)ker(𝒪))=1\mathbb{P}\left(\delta(\theta)\in\textup{ker}(\mathcal{O})\right)=1 is a negative criterion by which one can exclude an observation operator from consideration when one aims to mitigate the effect of model error on Bayesian inference. We can interpret the negation of this condition, i.e. that δ(θ)ker(𝒪)\delta(\theta)\notin\textup{ker}(\mathcal{O}) with positive \mathbb{P}-probability, as follows: if the joint posterior μθ,δy,J\mu^{y,\textup{J}}_{\theta,\delta} differs from the lifted approximate posterior μθ,δy,A\mu^{y,\textup{A}}_{\theta,\delta}, then the observed state update 𝒪δ(θ)\mathcal{O}\delta(\theta) cannot vanish \mathbb{P}-almost surely. This implies that the model \mathcal{M} must be updated by some nonzero δ𝒟\delta^{\prime}\in\mathscr{D} with positive \mathbb{P}-probability.

The preceding observation establishes a connection with the so-called ‘parametrised background data-weak approach’ for state estimation. Consider the setting where the state space 𝒰\mathcal{U} is a Hilbert space and 𝒪=(1,,n)\mathcal{O}=(\ell_{1},\ldots,\ell_{n}) for nn continuous linear functionals (i)i[n]𝒰(\ell_{i})_{i\in[n]}\subset\mathcal{U}^{\ast}. By a Hilbert space identity, ker(𝒪)\textup{ker}(\mathcal{O}) is the orthogonal complement of the closure of the range of 𝒪\mathcal{O}^{\ast}. Denote the Riesz representation operator by RR, so that Ri𝒰R\ell_{i}\in\mathcal{U} for every i[n]i\in[n]. Then one can verify by direct calculations that ran(𝒪)=span(Ri,i[n])\textup{ran}(\mathcal{O}^{\ast})=\textup{span}(R\ell_{i},\ i\in[n]). Since every finite-dimensional subspace of 𝒰\mathcal{U} is closed, δ(θ)ker(𝒪)\delta(\theta)\notin\textup{ker}(\mathcal{O}) holds if δ(θ)span(Ri,i[n])\delta(\theta)\in\textup{span}(R\ell_{i},\ i\in[n]). The subspace span(Ri,i[n])\textup{span}(R\ell_{i},\ i\in[n]) corresponds to the ‘variational update space’ in [22, Section 2.3]. The term ‘update space’ refers to the goal of updating an existing model that maps parameters to states in order to improve state prediction.

It is possible to state and prove the analogues of the preceding bounds for the lifted enhanced noise posterior μθ,δy,E\mu^{y,\textup{E}}_{\theta,\delta}. These bounds are not relevant for the main goal of this paper. However, for the sake of completeness, we state them in Section B.3.

2.4 Kullback–Leibler error of marginal posterior

We define the marginal posterior as the θ\theta-marginal of the joint posterior μθ,δy,J\mu^{y,\textup{J}}_{\theta,\delta}:

μθy,M(S)=S×𝒟dμθ,δy,J(θ,δ),S(Θ).\mu^{y,\textup{M}}_{\theta}(S)=\int_{S\times\mathscr{D}}\mathrm{d}\mu^{y,\textup{J}}_{\theta,\delta}(\theta^{\prime},\delta^{\prime}),\quad S\in\mathcal{B}(\Theta). (2.18)

One can approximate the marginal posterior μθy,M\mu^{y,\textup{M}}_{\theta} by using Monte Carlo integration of the joint posterior μθ,δy,J\mu^{y,\textup{J}}_{\theta,\delta} over possible realisations of δ\delta, for example. Since the marginal posterior requires the joint posterior, it inherits the disadvantages and advantages of the joint posterior.

In Section 2.3, we bounded the KL divergence of the joint posterior μθ,δy,J\mu^{y,\textup{J}}_{\theta,\delta} with respect to the lifted posteriors μθ,δy,\mu^{y,\bullet}_{\theta,\delta} for {A,,E}\bullet\in\{\textup{A},\dagger,\textup{E}\} that were defined in (2.13), and we observed in Lemma 2.8 that the θ\theta-marginal of the lifted posterior μθ,δy,\mu^{y,\bullet}_{\theta,\delta} is exactly μθy,\mu^{y,\bullet}_{\theta}. These bounds imply analogous bounds on the KL divergence of the marginal posterior μy,M\mu^{y,\textup{M}} with respect to μθy,\mu^{y,\dagger}_{\theta}, μθy,A\mu^{y,\textup{A}}_{\theta}, and μθy,E\mu^{y,\textup{E}}_{\theta}.

Below, μδ|θy,J\mu^{y,\textup{J}}_{\delta|\theta} denotes the regular version of the joint posterior law of δ\delta conditioned on θ\theta. Given our assumption that the random variable δ\delta takes values in a Radon space, it follows that the regular conditional distribution μδ|θy,J\mu^{y,\textup{J}}_{\delta|\theta} exists and is unique up to sets of μθ\mu_{\theta}-measure zero.

Proposition 2.13.

Let Φy,JLμθμδ1\Phi^{y,\textup{J}}\in L^{1}_{\mu_{\theta}\otimes\mu_{\delta}} and {A,,E}\bullet\in\{\textup{A},\dagger,\textup{E}\}. Suppose Φy,Lμθ1\Phi^{y,\bullet}\in L^{1}_{\mu_{\theta}}. Then

dKL(μθy,Mμθy,)=\displaystyle d_{\textup{KL}}(\mu^{y,\textup{M}}_{\theta}\|\mu^{y,\bullet}_{\theta})= dKL(μθ,δy,Jμθ,δy,)ΘdKL(μδ|θy,Jμδ)dμθy,M,\displaystyle d_{\textup{KL}}(\mu^{y,\textup{J}}_{\theta,\delta}\|\mu^{y,\bullet}_{\theta,\delta})-\int_{\Theta}d_{\textup{KL}}(\mu^{y,\textup{J}}_{\delta|\theta}\|\mu_{\delta})\mathrm{d}\mu^{y,\textup{M}}_{\theta},
dKL(μθy,μθy,M)=\displaystyle d_{\textup{KL}}(\mu^{y,\bullet}_{\theta}\|\mu^{y,\textup{M}}_{\theta})= dKL(μθ,δy,μθ,δy,J)ΘdKL(μδμδ|θy,J)dμθy,.\displaystyle d_{\textup{KL}}(\mu^{y,\bullet}_{\theta,\delta}\|\mu^{y,\textup{J}}_{\theta,\delta})-\int_{\Theta}d_{\textup{KL}}(\mu_{\delta}\|\mu^{y,\textup{J}}_{\delta|\theta})\mathrm{d}\mu^{y,\bullet}_{\theta}.
Proof.

We use the chain rule for the KL divergence, see e.g. [32, Exercise 3.2]:

dKL(μθ,δy,Jμθ,δy,)=\displaystyle d_{\textup{KL}}(\mu^{y,\textup{J}}_{\theta,\delta}\|\mu^{y,\bullet}_{\theta,\delta})= dKL(μθy,Mμθy,)+ΘdKL(μδ|θy,Jμδ|θy,)dμθy,M,\displaystyle d_{\textup{KL}}(\mu^{y,\textup{M}}_{\theta}\|\mu^{y,\bullet}_{\theta})+\int_{\Theta}d_{\textup{KL}}(\mu^{y,\textup{J}}_{\delta|\theta}\|\mu^{y,\bullet}_{\delta|\theta})\mathrm{d}\mu^{y,\textup{M}}_{\theta},
dKL(μθ,δy,μθ,δy,J)=\displaystyle d_{\textup{KL}}(\mu^{y,\bullet}_{\theta,\delta}\|\mu^{y,\textup{J}}_{\theta,\delta})= dKL(μθy,μθy,M)+ΘdKL(μδ|θy,μδ|θy,J)dμθy,.\displaystyle d_{\textup{KL}}(\mu^{y,\bullet}_{\theta}\|\mu^{y,\textup{M}}_{\theta})+\int_{\Theta}d_{\textup{KL}}(\mu^{y,\bullet}_{\delta|\theta}\|\mu^{y,\textup{J}}_{\delta|\theta})\mathrm{d}\mu^{y,\bullet}_{\theta}.

Recall that by Lemma 2.8, μθ,δy,=μθy,μδ\mu^{y,\bullet}_{\theta,\delta}=\mu^{y,\bullet}_{\theta}\otimes\mu_{\delta}, for {A,,E}\bullet\in\{\textup{A},\dagger,\textup{E}\}. This implies that μδ|θy,=μδ\mu^{y,\bullet}_{\delta|\theta}=\mu_{\delta} for {A,,E}\bullet\in\{\textup{A},\dagger,\textup{E}\}. ∎

Proposition 2.13 implies that the KL divergence of the marginal posterior μθy,M\mu^{y,\textup{M}}_{\theta} with respect to μθy,\mu^{y,\bullet}_{\theta} for {A,,E}\bullet\in\{\textup{A},\dagger,\textup{E}\} is smaller than the KL divergence of the joint posterior μθ,δy,J\mu^{y,\textup{J}}_{\theta,\delta} and the corresponding lifted posterior μθ,δy,\mu^{y,\bullet}_{\theta,\delta}. In other words, marginalisation can only reduce the KL divergence. As a result, the sufficient conditions for μθ,δy,J\mu^{y,\textup{J}}_{\theta,\delta} to agree with μθ,δy,\mu^{y,\bullet}_{\theta,\delta} are also sufficient conditions for μθy,M\mu^{y,\textup{M}}_{\theta} to agree with μθy,\mu^{y,\bullet}_{\theta}, for {A,,E}\bullet\in\{\textup{A},\dagger,\textup{E}\}.

3 Example

In this section, we consider an inverse problem based on a PDE initial boundary value problem (PDE-IBVP), and use it to illustrate the positive criterion from Section 2.1 for choosing an observation operator that eliminates the effect of model error on inference. In other words, combining this observation operator with the approximate model \mathcal{M} yields exactly the same inference as if one had used the best model \mathcal{M}^{\dagger} instead of \mathcal{M}. This PDE-IBVP was discussed from the perspective of classical optimal experimental design in [3, 4], for example.

Below, we use ‘ww’ as a placeholder symbol whose value may change from line to line.

3.1 Advection-diffusion initial boundary value problem

Let 𝒯(0,T)\mathcal{T}\coloneqq(0,T) be a time interval and 𝒟\mathcal{D} be a bounded open domain in 2\mathbb{R}^{2}. Fix a constant κ>0\kappa>0, a vector field v:𝒟2v:\mathcal{D}\to\mathbb{R}^{2}, nonzero s:𝒟s:\mathcal{D}\to\mathbb{R} and nonnegative b:𝒟b^{\dagger}:\mathcal{D}\to\mathbb{R}. Define the parabolic operator κΔ+v\mathcal{L}\coloneqq-\kappa\Delta+v\cdot\nabla. For an arbitrary, sufficiently regular function θ:𝒯\theta^{\prime}:\mathcal{T}\to\mathbb{R}, consider the PDE-IBVP with homogeneous Neumann boundary condition and inhomogeneous initial condition

(t+)w(x,t)=\displaystyle(\partial_{t}+\mathcal{L})w(x,t)= s(x)θ(t),\displaystyle s(x)\theta^{\prime}(t), (x,t)𝒟×𝒯,\displaystyle(x,t)\in\mathcal{D}\times\mathcal{T}, (3.1a)
w(x,t)n(x)=\displaystyle\nabla w(x,t)\cdot n(x)= 0,\displaystyle 0, (x,t)𝒟×𝒯¯,\displaystyle(x,t)\in\partial\mathcal{D}\times\overline{\mathcal{T}}, (3.1b)
w(x,0)=\displaystyle w(x,0)= b(x),\displaystyle b^{\dagger}(x), (x,0)𝒟×{0},\displaystyle(x,0)\in\mathcal{D}\times\{0\}, (3.1c)

where n(x)n(x) in (3.1b) denotes the unit exterior normal to 𝒟\mathcal{D} at x𝒟x\in\partial\mathcal{D}. We assume that a unique solution w=w(θ)w=w(\theta^{\prime}) of (3.1) exists, and define the best model \mathcal{M}^{\dagger} by (θ)w(θ)\mathcal{M}^{\dagger}(\theta^{\prime})\coloneqq w(\theta^{\prime}). For a fixed θ:𝒯\theta^{\dagger}:\mathcal{T}\to\mathbb{R}, the true state is the function u(θ):𝒟×𝒯u^{\dagger}\coloneqq\mathcal{M}^{\dagger}(\theta^{\dagger}):\mathcal{D}\times\mathcal{T}\to\mathbb{R}. For the theory of existence and regularity of solutions of parabolic PDE-IBVPs with Neumann boundary conditions, see e.g. [11, Part 2, Section 10].

We define the approximate model \mathcal{M} in a similar way as we defined \mathcal{M}^{\dagger}. For the same coefficients and θ\theta^{\prime} as above, we consider the IBVP with homogeneous Neumann boundary condition and homogeneous initial condition

(t+)w(x,t)=\displaystyle(\partial_{t}+\mathcal{L})w(x,t)= s(x)θ(t),\displaystyle s(x)\theta^{\prime}(t), (x,t)𝒟×𝒯,\displaystyle(x,t)\in\mathcal{D}\times\mathcal{T}, (3.2a)
w(x,t)n(x)=\displaystyle\nabla w(x,t)\cdot n(x)= 0,\displaystyle 0, (x,t)𝒟×𝒯¯,\displaystyle(x,t)\in\partial\mathcal{D}\times\overline{\mathcal{T}}, (3.2b)
w(x,0)=\displaystyle w(x,0)= 0,\displaystyle 0, (x,0)𝒟×{0}.\displaystyle(x,0)\in\mathcal{D}\times\{0\}. (3.2c)

Note that (3.1) and (3.2) differ only in their respective initial conditions. We define (θ)w(θ)\mathcal{M}(\theta^{\prime})\coloneqq w(\theta^{\prime}), for w(θ)w(\theta^{\prime}) the unique solution of the IBVP (3.2). By homogeneity of the initial condition and the boundary condition, and using the superposition principle for linear PDEs, it follows that the approximate model \mathcal{M} is a linear function of its argument θ\theta^{\prime}.

Recall from (2.3) that δ\delta^{\dagger}\coloneqq\mathcal{M}^{\dagger}-\mathcal{M}. Then δ\delta^{\dagger} solves the PDE-IBVP

(t+)w(x,t)=\displaystyle(\partial_{t}+\mathcal{L})w(x,t)= 0,\displaystyle 0, (x,t)𝒟×𝒯,\displaystyle(x,t)\in\mathcal{D}\times\mathcal{T}, (3.3a)
w(x,t)n(x)=\displaystyle\nabla w(x,t)\cdot n(x)= 0,\displaystyle 0, (x,t)𝒟×𝒯¯,\displaystyle(x,t)\in\partial\mathcal{D}\times\overline{\mathcal{T}}, (3.3b)
w(x,0)=\displaystyle w(x,0)= b(x),\displaystyle b^{\dagger}(x), (x,0)𝒟×{0}.\displaystyle(x,0)\in\mathcal{D}\times\{0\}. (3.3c)

Since ww does not depend on θ\theta^{\prime}, it follows that δ\delta^{\dagger} is constant with respect to θ\theta^{\prime}. Rewriting (2.3) as =+δ\mathcal{M}^{\dagger}=\mathcal{M}+\delta^{\dagger}, it follows that (θ)\mathcal{M}^{\dagger}(\theta^{\dagger}) is an affine, non-linear function of θ\theta^{\prime}.

Recall the decomposition δ=\delta^{\dagger}=\mathcal{M}^{\dagger}-\mathcal{M} in (2.3). Suppose we keep the definition of the approximate model \mathcal{M} as the mapping that sends θ\theta^{\prime} to the solution of (3.2), after replacing the initial condition (3.2c) with w(x,0)=b(x)w(x,0)=b(x) for some nonnegative, nonzero bb. If bbb^{\dagger}-b is small, then one expects the corresponding model error δ=\delta^{\dagger}=\mathcal{M}^{\dagger}-\mathcal{M} to also be small. However, δ\delta^{\dagger} is now the solution of the PDE-IBVP (3.3), after replacing the initial condition (3.3c) with w(x,0)=b(x)b(x)w(x,0)=b^{\dagger}(x)-b(x). If bbb^{\dagger}-b assumes negative values on 𝒟\mathcal{D}, then δ\delta^{\dagger} no longer describes the concentration of a substance. This may be undesirable, given that one uses advection-diffusion PDEs to describe the spatiotemporal evolution of concentrations. Moreover, for differential equations that model chemical reactions, negative initial values can lead to finite-time blow-up, and small negative values in solutions can lead to instability, cf. [15, Chapter I, Section 1.1]. Since we do not have a lower bound for bb^{\dagger} except for b0b^{\dagger}\geq 0, this explains the choice of the initial condition b=0b=0 in the PDE-IBVP (3.2) for defining \mathcal{M} and δ\delta^{\dagger}.

3.2 Agreement of approximate and best posteriors

Let Θ\Theta be C(𝒯¯,>0)C^{\infty}(\overline{\mathcal{T}},\mathbb{R}_{>0}), equipped with the supremum norm L\left\|\cdot\right\|_{L^{\infty}}. Assume that the vector field vv in the definition of \mathcal{L} satisfies vC(𝒟,2)L(𝒟,2)v\in C^{\infty}(\mathcal{D},\mathbb{R}^{2})\cap L^{\infty}(\mathcal{D},\mathbb{R}^{2}), and that both the spatial component ss on the right-hand side of (3.1a) and the true initial condition bb^{\dagger} in (3.1c) belong to C(𝒟¯,>0)C^{\infty}(\overline{\mathcal{D}},\mathbb{R}_{>0}). That is, all the terms that define the IBVPs are smooth and bounded on their respective domains, with strictly positive right-hand side s()θ()s(\cdot)\theta(\cdot) in (3.1a) and strictly positive initial condition b()b^{\dagger}(\cdot) in (3.1c). We define the state space 𝒰\mathcal{U} to be

𝒰C2,1(𝒟×𝒯){w satisfies (3.1b)},\mathcal{U}\coloneqq C^{2,1}(\mathcal{D}\times\mathcal{T})\cap\{w\text{ satisfies \eqref{eq_BC}}\},

where for k,j0k,j\in\mathbb{N}_{0}, Ck,j(𝒟×𝒯){w|wCk,j(𝒟×𝒯)<}C^{k,j}(\mathcal{D}\times\mathcal{T})\coloneqq\{w\ |\ \left\|w\right\|_{C^{k,j}(\mathcal{D}\times\mathcal{T})}<\infty\}, and

wCk,j(𝒟×𝒯)wL+0<|β|kxβwL+0<αjtαwL,\left\|w\right\|_{C^{k,j}(\mathcal{D}\times\mathcal{T})}\coloneqq\left\|w\right\|_{L^{\infty}}+\sum_{0<\left|\beta\right|\leq k}\left\|\partial_{x}^{\beta}w\right\|_{L^{\infty}}+\sum_{0<\alpha\leq j}\left\|\partial_{t}^{\alpha}w\right\|_{L^{\infty}},

xβ\partial_{x}^{\beta} denotes the spatial derivative associated to the multi-index β02\beta\in\mathbb{N}_{0}^{2}, and tα\partial_{t}^{\alpha} denotes the α\alpha-th order time derivative. The first and second sums on the right-hand side are defined to be zero if k=0k=0 and j=0j=0 respectively.

Given that κ>0\kappa>0 is constant and v:𝒟2v:\mathcal{D}\to\mathbb{R}^{2} is bounded, the operator t+\partial_{t}+\mathcal{L} satisfies

(t+)wC0,0(𝒟×𝒯)CwC2,1(𝒟×𝒯),w𝒰,\left\|(\partial_{t}+\mathcal{L})w\right\|_{C^{0,0}(\mathcal{D}\times\mathcal{T})}\leq C\left\|w\right\|_{C^{2,1}(\mathcal{D}\times\mathcal{T})},\quad w\in\mathcal{U},

for C=C(κ,v)>0C=C(\kappa,v)\in\mathbb{R}_{>0} independent of ww. Thus, for arbitrary (x(i),ti)𝒟×𝒯(x^{(i)},t_{i})\in\mathcal{D}\times\mathcal{T} for i[n]i\in[n], the linear observation operator

𝒪:𝒰n,w𝒪w((t+)w(x(i),ti))i[n]\mathcal{O}:\mathcal{U}\to\mathbb{R}^{n},\quad w\mapsto\mathcal{O}w\coloneqq((\partial_{t}+\mathcal{L})w(x^{(i)},t_{i}))_{i\in[n]} (3.4)

is continuous, since 𝒪wnn(t+)wC0,0(𝒟×𝒯)\left\|\mathcal{O}w\right\|_{\mathbb{R}^{n}}\leq n\left\|(\partial_{t}+\mathcal{L})w\right\|_{C^{0,0}(\mathcal{D}\times\mathcal{T})}. Moreover, for 𝒪\mathcal{O} given by (3.4), it follows from (3.3a) that {w|w satisfies (3.3a)}ker(𝒪)\{w\ |\ w\text{ satisfies \eqref{eq_PDE_delta}}\}\subset\textup{ker}(\mathcal{O}), since 𝒪\mathcal{O} is the composition of the vectorisation of the evaluation functionals at (x(i),ti)i[n](x^{(i)},t_{i})_{i\in[n]} with the differential operator on the left-hand side of (3.3a). Thus, given the definition of 𝒪\mathcal{O} above and the definition of δ\delta^{\dagger} in Section 3.1 as the solution of the IBVP defined by (3.3), we have

𝒪δ(θ)=0,θΘ.\mathcal{O}\delta^{\dagger}(\theta^{\prime})=0,\quad\forall\theta^{\prime}\in\Theta.

In particular, the right-hand side of (2.5) in Lemma 2.2 vanishes for this choice of 𝒪\mathcal{O}. By Proposition 2.3, it follows that μθy,A=μθy,\mu^{y,\textup{A}}_{\theta}=\mu^{y,\dagger}_{\theta}, i.e. the approximate posterior and the best posterior agree.

In the discussion after Proposition 2.5, we showed that if the image of Θ\Theta under δ\delta^{\dagger} is contained in some linear subspace VV of 𝒰\mathcal{U}, then it is simpler and easier to use the approximate posterior μθy,A\mu^{y,\textup{A}}_{\theta} with an appropriately chosen 𝒪\mathcal{O} instead of the enhanced noise posterior μθy,E\mu^{y,\textup{E}}_{\theta}. Since δ\delta^{\dagger} takes values in the linear subspace {w|w satisfies (3.3a)}\{w\ |\ w\text{ satisfies \eqref{eq_PDE_delta}}\} of 𝒰\mathcal{U}, the same conclusion holds for this example, with the appropriate choice of 𝒪\mathcal{O} given in (3.4). Similarly, by the discussion after Proposition 2.10, it is simpler and easier to use the approximate posterior with 𝒪\mathcal{O} from (3.4) instead of the joint posterior.

In this example, the only information about the best model error δ\delta^{\dagger} that we used to find an observation operator 𝒪\mathcal{O} that satisfies μθ(𝒪δ=0)=1\mu_{\theta}(\mathcal{O}\delta^{\dagger}=0)=1 is contained in the PDE (3.3a). We exploited this information to choose the observation operator in (3.4). In the absence of additional conditions on δ\delta^{\dagger}, an observation operator that does not satisfy (3.4) may fail to satisfy the condition μθ(𝒪δ=0)=1\mu_{\theta}(\mathcal{O}\delta^{\dagger}=0)=1. Thus, we can consider the collection of observation operators with the structure (3.4) as being the largest possible collection, given the available information.

We end this section with some important remarks.

Suppose ww solves (3.1) for a given θ\theta^{\prime}. Then the operator 𝒪\mathcal{O} in (3.4) satisfies 𝒪w=(s(x(i))θ(ti))i[n]\mathcal{O}w=(s(x^{(i)})\theta^{\prime}(t_{i}))_{i\in[n]}. In particular, if the spatial function ss is known and nonzero, then one can obtain the numbers (θ(ti))i[n](\theta^{\prime}(t_{i}))_{i\in[n]} from the vector 𝒪wn\mathcal{O}w\in\mathbb{R}^{n} directly, by division. We will return to this observation in Section 4.3.

For the operator 𝒪\mathcal{O} in (3.4), exact evaluation of the vector 𝒪w\mathcal{O}w for any w𝒰w\in\mathcal{U} in practical experimental settings can be difficult. For example, in practical settings, one might only be able to measure the concentration of a certain substance at fixed spatial locations (x(j))j[J]𝒟(x^{(j)})_{j\in[J]}\subset\mathcal{D} and at multiples kΔt𝒯k\Delta t\in\mathcal{T} of some time lag Δt>0\Delta t>0, k[K]k\in[K], KK\in\mathbb{N}. The corresponding observation operator maps every state w𝒰w\in\mathcal{U} to a vector (w(x(j),kΔt))j[J],k[K]JK(w(x^{(j)},k\Delta t))_{j\in[J],k\in[K]}\in\mathbb{R}^{JK} of concentrations. We consider one such operator in (4.3a) below. In this case, and in the absence of further conditions on the spatiotemporal locations (x(j),kΔt)j[J],k[K](x^{(j)},k\Delta t)_{j\in[J],k\in[K]}, there is no guarantee that one can approximate the operator 𝒪\mathcal{O} from (3.4). However, if the spatiotemporal locations lie on a finite difference stencil, then one could potentially combine the information in the vector (w(x(j),kΔt))j[J],k[K]JK(w(x^{(j)},k\Delta t))_{j\in[J],k\in[K]}\in\mathbb{R}^{JK} to approximate 𝒪\mathcal{O} from (3.4). Such approximations may exhibit errors associated to the finite difference scheme, and may also be susceptible to noise in the vector (w(x(j),kΔt))j[J],k[K]JK(w(x^{(j)},k\Delta t))_{j\in[J],k\in[K]}\in\mathbb{R}^{JK}. We postpone the design and analysis of approximations of 𝒪\mathcal{O} in (3.4) for future work.

In order to identify 𝒪\mathcal{O} in (3.4) as an observation operator that satisfies 𝒪δ=0\mathcal{O}\delta^{\dagger}=0, we exploited the fact that the right-hand side of (3.3a) vanishes. Suppose that we modified the right-hand sides of (3.2a) and (3.3a) to be nonzero functions h,h:𝒟×𝒯h,h^{\dagger}:\mathcal{D}\times\mathcal{T}\to\mathbb{R} respectively, where for some fixed 0<λ<10<\lambda<1 and any θΘ\theta^{\prime}\in\Theta, h(x,t)λs(x)θ(t)h(x,t)\coloneqq\lambda s(x)\theta^{\prime}(t) and h(1λ)s(x)θ(t)h^{\dagger}\coloneqq(1-\lambda)s(x)\theta^{\prime}(t), for every (x,t)𝒟×𝒯(x,t)\in\mathcal{D}\times\mathcal{T}. If we do not know the initial condition bb^{\dagger} in (3.3c), then the only remaining information that we can use to find a linear operator 𝒪\mathcal{O} such that 𝒪δ=0\mathcal{O}\delta^{\dagger}=0 is captured by the homogeneous Neumann boundary conditions in (3.3). Since the PDE-IBVP (3.1) that defines \mathcal{M}^{\dagger} uses the same boundary conditions, any observation operator based on only the Neumann boundary conditions will imply that 𝒪δ=𝒪\mathcal{O}\delta^{\dagger}=\mathcal{O}\mathcal{M}^{\dagger} as maps on Θ\Theta. Thus, if 𝒪δ=0\mathcal{O}\delta^{\dagger}=0, then also 𝒪=0\mathcal{O}\mathcal{M}^{\dagger}=0. This example illustrates that it may not always be possible to find an observation operator 𝒪\mathcal{O} that satisfies 𝒪δ=0\mathcal{O}\delta^{\dagger}=0 and that yields informative observations.

4 Numerical simulations

In this section, we describe some numerical simulations of the PDE-IBVP from Section 3.1 for a specific vector field vv. We use these simulations to illustrate the behaviour of the best posterior μθy,\mu^{y,\dagger}_{\theta} and the approximate posterior μθy,A\mu^{y,\textup{A}}_{\theta} that were defined in (2.2) and (2.4) respectively. The code for these simulations is available at https://gitlab.tue.nl/hbansal/model-error-study.

4.1 Setup of PDE-IBVP and Bayesian inverse problem

4.1.1 Setup of PDE-IBVP

We set the spatial domain 𝒟\mathcal{D} in (3.1) to be 𝒟=[0,1]2(C1C2)\mathcal{D}=[0,1]^{2}\setminus(C_{1}\cup C_{2}), where C1C_{1} and C2C_{2} represent two obstacles. See Figure 1(b). For the differential operator =κΔ+v\mathcal{L}=-\kappa\Delta+v\cdot\nabla in Section 3.1, we use a constant diffusivity κ=0.05\kappa=0.05. For the vector field vv we use the solution to the stationary Navier–Stokes equation

1ReΔv+q+vv=\displaystyle-\frac{1}{\textup{Re}}\Delta v+\nabla q+v\cdot\nabla v= 0in 𝒟\displaystyle 0\quad\text{in }\mathcal{D} (4.1)
v=\displaystyle\nabla\cdot v= 0in 𝒟\displaystyle 0\quad\text{in }\mathcal{D}
v=\displaystyle v= gon 𝒟,\displaystyle g\quad\text{on }\partial\mathcal{D},

for a given pressure field qq, Reynolds number Re=50\textup{Re}=50, and boundary data

g:𝒟2,gt(x1,x2)={(0,1)x1=0, 0<x2<1(0,1)x1=1, 0<x2<1(0,0)otherwise,g:\partial\mathcal{D}\to\mathbb{R}^{2},\quad g\partial_{t}(x_{1},x_{2})=\begin{cases}(0,1)&x_{1}=0,\ 0<x_{2}<1\\ (0,-1)&x_{1}=1,\ 0<x_{2}<1\\ (0,0)&\text{otherwise},\end{cases}

see [3, Section 5]. In Figure 1(a), we plot a numerical approximation vv of the solution to (4.1). The numerical approximation satisfies vL=1\left\|v\right\|_{L^{\infty}}=1. The boundary data ensure that the inner product of the vector field vv with the exterior normal vanishes everywhere on the boundary 𝒟\partial\mathcal{D}. Given the Neumann boundary condition (3.1b), one can use this fact to prove the coercivity of the bilinear form associated to (3.1a); see e.g. [10, Section 3.2] for the proof of this statement for the time-independent version of (3.1a)–(3.1c). The coercivity of the bilinear form ensures the existence of a weak solution to all the PDE-IBVPs from Section 3.1. Given κ=0.05\kappa=0.05, the global Péclet number — see e.g. [25, Section 13.2, Eq. (13.19)] — associated to the PDE-IBVP has the value vL(𝒟)/(2κ)=10\left\|v\right\|_{L^{\infty}(\mathcal{D})}/(2\kappa)=10.

Refer to caption
(a) Vector field vv in (3.1a) and (4.1).
Colour bar shows value of v(x1,x2)2[0,1]\left\|v(x_{1},x_{2})\right\|_{2}\in[0,1].
Refer to caption
(b) Location x(0)x^{(0)} of source ss in (3.1a) and (4.2a) (\blacktriangle), sensor locations (\blacksquare), and obstacles.
Figure 1: Plots of vector field, sensors, source term, and obstacles.

Next, we define

s(x)\displaystyle s(x)\coloneqq 12πLexp(12L|xx(0)|2),\displaystyle\frac{1}{2\pi L}\exp\left(-\frac{1}{2L}\left|x-x^{(0)}\right|^{2}\right), (4.2a)
θ(t)\displaystyle\theta^{\dagger}(t)\coloneqq {80t[0,0.2]80+20exp(1(116(t0.45)2)1)t(0.2,0.45]50+50exp(1(116(t0.45)2)1)t(0.45,0.7]50t(0.7,1.0],\displaystyle\begin{cases}80&t\in[0,0.2]\\ 80+20\exp\left(1-\left(1-16(t-0.45)^{2}\right)^{-1}\right)&t\in(0.2,0.45]\\ 50+50\exp\left(1-\left(1-16(t-0.45)^{2}\right)^{-1}\right)&t\in(0.45,0.7]\\ 50&t\in(0.7,1.0],\end{cases} (4.2b)

and set L=0.05L=0.05 and x(0)=(0.5,0.35)x^{(0)}=(0.5,0.35). We will use these functions as reference values for the right-hand side of (3.1a). We indicate the location parameter x(0)x^{(0)} of the spatial function ss by a red triangle in Figure 1(b) and plot a discrete approximation θ~\widetilde{\theta}^{\dagger} of θ\theta^{\dagger} in Figure 4(b).

To complete the specification of the PDE-IBVP (3.1), it remains to state the reference initial condition bb^{\dagger} in (3.1c). To this end, we used a sample from the Gaussian distribution with constant mean mb50m_{b}\equiv 50 on 𝒟\mathcal{D} and covariance operator (ϵΔ+αI)2(-\epsilon\Delta+\alpha I)^{-2} for ϵ=4.5×103\epsilon=4.5\times 10^{-3} and α=2.2×101\alpha=2.2\times 10^{-1} equipped with Robin boundary conditions. This Gaussian distribution and the justification for its use are given in [4, Section 5.2]. In Figure 2, we show snapshots of the solution ww of (3.1a)–(3.1c) at times t{0,0.2,0.3,0.7,1.0}t\in\{0,0.2,0.3,0.7,1.0\}, for two samples from the Gaussian distribution of bb^{\dagger}. For the reference initial condition bb^{\dagger}, we used the initial condition shown in the bottom left subfigure of Figure 2.

Refer to caption
Figure 2: Snapshots of the solution ww of (3.1a)–(3.1c) at t{0,0.2,0.3,0.7,1.0}t\in\{0,0.2,0.3,0.7,1.0\}, for two random initial conditions bb^{\dagger}. The reference initial condition bb^{\dagger} that was used for generating observations is shown in the bottom left subfigure.

4.1.2 Setup of Bayesian inverse problem

Consider two different observation operators

𝒪B:𝒰\displaystyle\mathcal{O}_{B}:\mathcal{U} n,u(u(x(i),t(i)))i=1n\displaystyle\to\mathbb{R}^{n},\quad u\mapsto(u(x^{(i)},t^{(i)}))_{i=1}^{n} (4.3a)
𝒪D:𝒰\displaystyle\mathcal{O}_{D}:\mathcal{U} n,u((tu(x(i),t(i))+u(x(i),t(i)))i=1n,\displaystyle\to\mathbb{R}^{n},\quad u\mapsto((\partial_{t}u(x^{(i)},t^{(i)})+\mathcal{L}u(x^{(i)},t^{(i)}))_{i=1}^{n}, (4.3b)

for \mathcal{L} given in Section 3.1. We refer to 𝒪B\mathcal{O}_{B} and 𝒪D\mathcal{O}_{D} as the ‘basic observation operator’ and ‘PDE observation operator’ respectively. To ensure that the numerical results do not depend on the choice of spatiotemporal locations at which observations are collected, we define 𝒪B\mathcal{O}_{B} and 𝒪D\mathcal{O}_{D} using the same spatiotemporal locations (x(i),ti)i=1n(x^{(i)},t_{i})_{i=1}^{n} belonging to the set

{(x1,x2)=(0.2m,0.2n)𝒟:m[2],n[4]}×{t=0.1+0.01𝒯:[80]0}.\left\{(x_{1},x_{2})=(0.2m,0.2n)\in\mathcal{D}\ :\ m\in[2],\ n\in[4]\right\}\times\left\{t=0.1+0.01\ell\in\mathcal{T}\ :\ \ell\in[80]_{0}\right\}. (4.4)

Thus, for both 𝒪B\mathcal{O}_{B} and 𝒪D\mathcal{O}_{D}, observations are vectors with n=8×81=648n=8\times 81=648 entries. In Figure 1(b), we indicate the spatial locations (x(i))i(x^{(i)})_{i} by black squares. We emphasise that, unlike ‘classical’ experimental design, our goal is not to describe how to choose the spatiotemporal locations of the measurements in order to maximise the information gain for each successive measurement. Instead, our goal is to show that by choosing a different type of measurement, one can mitigate the effect of model error on Bayesian inference. In this context, the key difference between 𝒪B\mathcal{O}_{B} and 𝒪D\mathcal{O}_{D} is that 𝒪Dδ\mathcal{O}_{D}\delta^{\dagger} vanishes on 𝒟×𝒯\mathcal{D}\times\mathcal{T} because of (3.3a), but 𝒪Bδ\mathcal{O}_{B}\delta^{\dagger} does not.

For the prior μθ\mu_{\theta} on the unknown θ\theta^{\dagger}, we use the same prior as in [4, Section 5.2], namely a Gaussian probability measure on L2(𝒯)L^{2}(\mathcal{T}) with constant mean mθ65m_{\theta}\equiv 65 and covariance operator Σθ:L2(𝒯)L2(𝒯)\Sigma_{\theta}:L^{2}(\mathcal{T})\to L^{2}(\mathcal{T}) generated by a Matérn covariance kernel

Σθz(s)𝒯c(s,t)z(t)dt,c(s,t)σ2(1+3|st|)exp(3|st|),s,t𝒯\Sigma_{\theta}z(s)\coloneqq\int_{\mathcal{T}}c(s,t^{\prime})z(t^{\prime})\mathrm{d}t^{\prime},\quad c(s,t)\coloneqq\sigma^{2}\left(1+\frac{\sqrt{3}\left|s-t\right|}{\ell}\right)\exp\left(-\frac{\sqrt{3}\left|s-t\right|}{\ell}\right),\quad s,t\in\mathcal{T}

with σ=80\sigma=80 and length scale =0.17\ell=0.17.

Fix 𝒪{𝒪B,𝒪D}\mathcal{O}\in\{\mathcal{O}_{B},\mathcal{O}_{D}\} and a realisation yy of the corresponding random variable YY in (4.9). In Section 3.1, we observed that θ(θ)\theta^{\prime}\mapsto\mathcal{M}(\theta^{\prime}) is a linear mapping. This implies that the approximate can posterior μθy,A\mu^{y,\textup{A}}_{\theta} is Gaussian. Given that we assume ε𝒩(0,Σε)\varepsilon\sim\mathcal{N}(0,\Sigma_{\varepsilon}) and θμθ=𝒩(mθ,Σθ)\theta\sim\mu_{\theta}=\mathcal{N}(m_{\theta},\Sigma_{\theta}) to be statistically independent, it follows that the covariance of 𝒪θ+ε\mathcal{O}\mathcal{M}\theta+\varepsilon and θ\theta is 𝒪Σθ\mathcal{O}\mathcal{M}\Sigma_{\theta}, and the covariance matrix of 𝒪θ+ε\mathcal{O}\mathcal{M}\theta+\varepsilon is (𝒪)Σθ(𝒪)+Σε(\mathcal{O}\mathcal{M})\Sigma_{\theta}(\mathcal{O}\mathcal{M})^{\ast}+\Sigma_{\varepsilon}. We then compute the mean and covariance of μθy,A\mu^{y,\textup{A}}_{\theta} using the formula for conditioning of Gaussians, see e.g. [30, Theorem 6.20]:

mθy,A=\displaystyle m^{y,\textup{A}}_{\theta}= mθ+(𝒪Σθ)((𝒪)Σθ(𝒪)+Σε)1(y𝒪mθ)\displaystyle m_{\theta}+(\mathcal{O}\mathcal{M}\Sigma_{\theta})^{\ast}\left((\mathcal{O}\mathcal{M})\Sigma_{\theta}(\mathcal{O}\mathcal{M})^{\ast}+\Sigma_{\varepsilon}\right)^{-1}(y-\mathcal{O}\mathcal{M}m_{\theta}) (4.5a)
Σθy,A=\displaystyle\Sigma^{y,\textup{A}}_{\theta}= Σθ(𝒪Σθ)((𝒪)Σθ(𝒪)+Σε)1(𝒪Σθ).\displaystyle\Sigma_{\theta}-(\mathcal{O}\mathcal{M}\Sigma_{\theta})^{\ast}\left((\mathcal{O}\mathcal{M})\Sigma_{\theta}(\mathcal{O}\mathcal{M})^{\ast}+\Sigma_{\varepsilon}\right)^{-1}(\mathcal{O}\mathcal{M}\Sigma_{\theta}). (4.5b)

In Section 3.1, we observed that δ\delta^{\dagger} is constant with respect to θ\theta^{\prime}, and that θθ=θ+δ\theta^{\prime}\mapsto\mathcal{M}^{\dagger}\theta^{\prime}=\mathcal{M}\theta^{\prime}+\delta^{\dagger} is affine. Hence, the covariance of 𝒪θ+ε\mathcal{O}\mathcal{M}^{\dagger}\theta+\varepsilon and θμθ\theta\sim\mu_{\theta} is 𝒪Σθ\mathcal{O}\mathcal{M}\Sigma_{\theta}, the covariance matrix of 𝒪θ+ε\mathcal{O}\mathcal{M}^{\dagger}\theta+\varepsilon is (𝒪)Σθ(𝒪)+Σε(\mathcal{O}\mathcal{M})\Sigma_{\theta}(\mathcal{O}\mathcal{M})^{\ast}+\Sigma_{\varepsilon}, and μθy,\mu^{y,\dagger}_{\theta} is Gaussian, with mean and covariance

mθy,=\displaystyle m^{y,\dagger}_{\theta}= mθ+(𝒪Σθ)((𝒪)Σθ(𝒪)+Σε)1(y𝒪mθ𝒪δ)\displaystyle m_{\theta}+(\mathcal{O}\mathcal{M}\Sigma_{\theta})^{\ast}\left((\mathcal{O}\mathcal{M})\Sigma_{\theta}(\mathcal{O}\mathcal{M})^{\ast}+\Sigma_{\varepsilon}\right)^{-1}(y-\mathcal{O}\mathcal{M}m_{\theta}-\mathcal{O}\delta^{\dagger}) (4.6a)
Σθy,=\displaystyle\Sigma^{y,\dagger}_{\theta}= Σθ(𝒪Σθ)((𝒪)Σθ(𝒪)+Σε)1(𝒪Σθ).\displaystyle\Sigma_{\theta}-(\mathcal{O}\mathcal{M}\Sigma_{\theta})^{\ast}\left((\mathcal{O}\mathcal{M})\Sigma_{\theta}(\mathcal{O}\mathcal{M})^{\ast}+\Sigma_{\varepsilon}\right)^{-1}(\mathcal{O}\mathcal{M}\Sigma_{\theta}). (4.6b)

By comparing (4.5) and (4.6), we conclude that the posteriors μθy,A\mu^{y,\textup{A}}_{\theta} and μθy,\mu^{y,\dagger}_{\theta} differ only in their means, namely by the shift (𝒪Σθ)((𝒪)Σθ(𝒪)+Σε)1𝒪δ(\mathcal{O}\mathcal{M}\Sigma_{\theta})^{\ast}\left((\mathcal{O}\mathcal{M})^{\ast}\Sigma_{\theta}(\mathcal{O}\mathcal{M})+\Sigma_{\varepsilon}\right)^{-1}\mathcal{O}\delta^{\dagger}. Thus, if one chooses 𝒪\mathcal{O} so that 𝒪δ=0\mathcal{O}\delta^{\dagger}=0, then μθy,\mu^{y,\dagger}_{\theta} and μθy,A\mu^{y,\textup{A}}_{\theta} agree. This result is consistent with Proposition 2.3. It has the interpretation that if one uses the observation operator 𝒪D\mathcal{O}_{D} to map states to data, then using the approximate model \mathcal{M} to map parameters to states yields exactly the same inference as when one uses the best model \mathcal{M}^{\dagger}.

4.2 Implementation

To solve the IBVPs (3.1), (3.2), and (3.3), we implemented the space-time finite element method on an unstructured triangular mesh for the spatial domain 𝒟\mathcal{D} and a uniform mesh for the temporal domain 𝒯\mathcal{T}, using the hIPPYlib [31] and FEniCS/DOLFIN [21] software. We used Lagrange nodal basis functions to discretise functions. That is, given a finite element mesh of 𝒟\mathcal{D} with nodes (xi)i[ND](x_{i})_{i\in[N_{D}]} and given a finite element mesh of 𝒯\mathcal{T} with nodes (tj)j[NT](t_{j})_{j\in[N_{T}]}, we have a spatial basis (φi)i[ND]L2(𝒟)(\varphi_{i})_{i\in[N_{D}]}\subset L^{2}(\mathcal{D}) of functions and a temporal basis (ψj)j[NT]L2(𝒯)(\psi_{j})_{j\in[N_{T}]}\subset L^{2}(\mathcal{T}) such that φi(xk)=δik\varphi_{i}(x_{k})=\delta_{ik} for i,k[ND]i,k\in[N_{D}] and ψj(t)=δj\psi_{j}(t_{\ell})=\delta_{j\ell} for j,[NT]j,\ell\in[N_{T}]. This yields a space-time basis (dk)k[NDNT](d_{k})_{k\in[N_{D}N_{T}]}, where each dk=φi(k)ψj(k)d_{k}=\varphi_{i(k)}\psi_{j(k)} belongs to L2(𝒟×𝒯)L^{2}(\mathcal{D}\times\mathcal{T}). We used piecewise linear basis functions for all functions except for the advection vector field vv, where we used piecewise quadratic functions.

To compute the numerical approximation 𝐰NDNT\mathbf{w}\in\mathbb{R}^{N_{D}N_{T}} of the solution ww of the PDE-IBVP (3.1) for a given θ\theta^{\prime} on the right-hand side of (3.1a) and a given bb^{\dagger} in (3.1c), we used the weak formulation of (3.1) to form a matrix-vector equation 𝐀𝐮=𝐟\mathbf{A}\mathbf{u}=\mathbf{f}, where

(𝐀)k1,k2(t+)dk1,dk2L2(𝒟×𝒯),(𝐟)k2sθ,dk2L2(𝒟×𝒯),k1,k2[NDNT].(\mathbf{A})_{k_{1},k_{2}}\coloneqq\left\langle(\partial_{t}+\mathcal{L})d_{k_{1}},d_{k_{2}}\right\rangle_{L^{2}(\mathcal{D}\times\mathcal{T})},\quad(\mathbf{f})_{k_{2}}\coloneqq\left\langle s\theta^{\prime},d_{k_{2}}\right\rangle_{L^{2}(\mathcal{D}\times\mathcal{T})},\quad k_{1},k_{2}\in[N_{D}N_{T}]. (4.7)

We used quadrature to approximate the L2(𝒟×𝒯)L^{2}(\mathcal{D}\times\mathcal{T}) inner product. For the initial condition bb^{\dagger}, we discretised the covariance operator (ϵΔ+αI)2(-\epsilon\Delta+\alpha I)^{-2} described in Section 4.1.1 and sampled from the resulting finite-dimensional Gaussian distribution on ND\mathbb{R}^{N_{D}}.

For the spatial discretisation, 262 elements were used, yielding ND=159N_{D}=159 spatial degrees of freedom and a maximal spatial element diameter of Δx0.13\Delta x\approx 0.13. For the temporal discretisation, 100 elements of uniform diameter Δt=0.01\Delta t=0.01 were used, yielding NT=101N_{T}=101 temporal degrees of freedom. Since vv in (4.1) satisfies vL(𝒟)=1\left\|v\right\|_{L^{\infty}(\mathcal{D})}=1, the local Péclet number is vL(𝒟)Δx/(2κ)=1.3\left\|v\right\|_{L^{\infty}(\mathcal{D})}\Delta x/(2\kappa)=1.3; see e.g. [25, Section 13.2, Equation (13.22)].

The discretisations of mθy,Am^{y,\textup{A}}_{\theta}, mθy,m^{y,\dagger}_{\theta}, and Σθy,A=Σθy,\Sigma^{y,\textup{A}}_{\theta}=\Sigma^{y,\dagger}_{\theta} from (4.5) and (4.6) are

𝐦θy,A=\displaystyle\mathbf{m}^{y,\textup{A}}_{\theta}= 𝐦θ+(𝐎𝐌𝚺θ)((𝐎𝐌)𝚺θ(𝐎𝐌)+Σε)1(y𝐎𝐌𝐦θ)\displaystyle\mathbf{m}_{\theta}+(\mathbf{OM\Sigma}_{\theta})^{\ast}((\mathbf{OM})\mathbf{\Sigma}_{\theta}(\mathbf{OM})^{\ast}+\Sigma_{\varepsilon})^{-1}(y-\mathbf{OMm}_{\theta}) (4.8a)
𝐦θy,=\displaystyle\mathbf{m}^{y,\dagger}_{\theta}= 𝐦θ+(𝐎𝐌𝚺θ)((𝐎𝐌)𝚺θ(𝐎𝐌)+Σε)1(y𝐎𝐌𝐦θ𝐎δ~)\displaystyle\mathbf{m}_{\theta}+(\mathbf{OM\Sigma}_{\theta})^{\ast}((\mathbf{OM})\mathbf{\Sigma}_{\theta}(\mathbf{OM})^{\ast}+\Sigma_{\varepsilon})^{-1}(y-\mathbf{OMm}_{\theta}-\mathbf{O}\widetilde{\delta}^{\dagger}) (4.8b)
𝚺θy,A=𝚺θy,=\displaystyle\mathbf{\Sigma}^{y,\textup{A}}_{\theta}=\mathbf{\Sigma}^{y,\dagger}_{\theta}= 𝚺θ(𝐎𝐌𝚺θ)((𝐎𝐌)𝚺θ(𝐎𝐌)+Σε)1𝐎𝐌𝚺θ\displaystyle\mathbf{\Sigma}_{\theta}-(\mathbf{OM\Sigma}_{\theta})^{\ast}((\mathbf{OM})\mathbf{\Sigma}_{\theta}(\mathbf{OM})^{\ast}+\Sigma_{\varepsilon})^{-1}\mathbf{OM\Sigma}_{\theta}
=\displaystyle= 𝚺θ1/2(I(𝐎𝐌𝚺θ1/2)((𝐎𝐌)𝚺θ(𝐎𝐌)+Σε)1𝐎𝐌𝚺θ1/2)𝚺θ1/2,\displaystyle\mathbf{\Sigma}_{\theta}^{1/2}\left(I-(\mathbf{OM}\mathbf{\Sigma}_{\theta}^{1/2})^{\ast}((\mathbf{OM})\mathbf{\Sigma}_{\theta}(\mathbf{OM})^{\ast}+\Sigma_{\varepsilon})^{-1}\mathbf{OM}\mathbf{\Sigma}_{\theta}^{1/2}\right)\mathbf{\Sigma}_{\theta}^{1/2}, (4.8c)

where 𝚺θ1/2\mathbf{\Sigma}_{\theta}^{1/2} in (4.8c) denotes the symmetric square root of 𝚺θ\mathbf{\Sigma}_{\theta}. Since mθ65m_{\theta}\equiv 65, we set 𝐦θ=(65,,65)NT\mathbf{m}_{\theta}=(65,\ldots,65)\in\mathbb{R}^{N_{T}}. For the prior covariance matrix 𝚺θNT×NT\mathbf{\Sigma}_{\theta}\in\mathbb{R}^{N_{T}\times N_{T}}, we set (𝚺θ)i,j𝒯c(ti,t)ψj(t)dt(\mathbf{\Sigma}_{\theta})_{i,j}\coloneqq\int_{\mathcal{T}}c(t_{i},t^{\prime})\psi_{j}(t^{\prime})\mathrm{d}t^{\prime} and approximated the integral by quadrature. We discretised the approximate model \mathcal{M} as a matrix 𝐌NDNT×NT\mathbf{M}\in\mathbb{R}^{N_{D}N_{T}\times N_{T}}, where we computed the jj-th column of 𝐌\mathbf{M} for j[NT]j\in[N_{T}] by replacing θ\theta^{\prime} in the linear form associated to (3.1a) with the temporal basis function ψj\psi_{j} in the definition of the right-hand side 𝐟\mathbf{f}; see (4.7). We discretised the basic observation operator 𝒪B\mathcal{O}_{B} in (4.3a) and the PDE observation operator 𝒪D\mathcal{O}_{D} in (4.3b) as n×NDNTn\times N_{D}N_{T} matrices 𝐎B\mathbf{O}_{B} and 𝐎D\mathbf{O}_{D} respectively, where 𝐎B𝐮\mathbf{O}_{B}\mathbf{u} and 𝐎D𝐮\mathbf{O}_{D}\mathbf{u} return the values of the piecewise linear spatiotemporal functions at the locations given in (4.4) represented by 𝐮\mathbf{u} and 𝐀𝐮\mathbf{Au} respectively. To evaluate functions off the finite element nodes, we used linear interpolation.

It remains to state how we generated the data yy used in (4.8a) and (4.8b). Let θ~NT\widetilde{\theta}^{\dagger}\in\mathbb{R}^{N_{T}} denote the discretisation of the truth θ\theta^{\dagger} using the temporal basis (ψj)j(\psi_{j})_{j}. We solved the matrix-vector equation 𝐀δ~=0\mathbf{A}\widetilde{\delta}^{\dagger}=0 that corresponds to (3.3) for δ~NDNT\widetilde{\delta}^{\dagger}\in\mathbb{R}^{N_{D}N_{T}} and defined the discretisation of the true state to be 𝐮𝐌θ~+δ~\mathbf{u}^{\dagger}\coloneqq\mathbf{M}\widetilde{\theta}^{\dagger}+\widetilde{\delta}^{\dagger}. For each of the choices 𝐎=𝐎B\mathbf{O}=\mathbf{O}_{B} and 𝐎=𝐎D\mathbf{O}=\mathbf{O}_{D}, we generated data by sampling the random variable

Y=𝐎𝐮+ε,ε𝒩(0,Σε)Y=\mathbf{O}\mathbf{u}^{\dagger}+\varepsilon,\quad\varepsilon\sim\mathcal{N}(0,\Sigma_{\varepsilon}) (4.9)

for Σεσε2I\Sigma_{\varepsilon}\coloneqq\sigma_{\varepsilon}^{2}I, where we set the standard deviation σε\sigma_{\varepsilon} to be the product of a prescribed signal-to-noise ratio (SNR) with the median of the underlying signal 𝐎𝐮\mathbf{O}\mathbf{u}^{\dagger}.

For our experiments, we considered a lower SNR and a higher SNR, where we set σε\sigma_{\varepsilon} to be the median of the underlying signal, scaled by 𝔰=0.1\mathfrak{s}=0.1 and 𝔰=0.02\mathfrak{s}=0.02 respectively. Our SNR regimes are reasonable, given that in [4, Section 5.3] the value σε=0.25\sigma_{\varepsilon}=0.25 is used for signals whose entries take values in the interval [51,54][51,54], for a corresponding 𝔰\mathfrak{s} that is at most 0.0040.004.

The minimum, median, and maximum of the signal 𝐎B𝐮\mathbf{O}_{B}\mathbf{u}^{\dagger} was 52.18, 56.63, and 68.53 respectively, which is consistent with the figures in the bottom row of Figure 2 and yields σε5.66\sigma_{\varepsilon}\approx 5.66 and σε1.13\sigma_{\varepsilon}\approx 1.13 for the lower SNR and higher SNR cases respectively. In contrast, the minimum, median, and maximum of the signal 𝐎D𝐮\mathbf{O}_{D}\mathbf{u}^{\dagger} was 0, 2×1062\times 10^{-6}, and 3.86×1033.86\times 10^{-3} respectively. This is consistent with the fact that 𝐎D𝐮\mathbf{O}_{D}\mathbf{u}^{\dagger} returns the function represented by the vector 𝐟\mathbf{f} in (4.7): the entries of 𝐟\mathbf{f} are given by integrals of the function s(x)s(x) that decreases exponentially as |xx(0)|2\left|x-x^{(0)}\right|^{2} increases, and the integration regions have space-time ‘volume’ proportional to (Δx)2Δt103(\Delta x)^{2}\Delta t\approx 10^{-3}. Thus, for 𝐎D\mathbf{O}_{D}, σε2×107\sigma_{\varepsilon}\approx 2\times 10^{-7} and σε4×108\sigma_{\varepsilon}\approx 4\times 10^{-8} in the lower SNR and higher SNR cases respectively.

4.3 Behaviour of discretised best posterior mean in small noise limit

In this section, we analyse the behaviour in the small noise limit of the discretised best posterior mean defined in (4.8b). By the preceding section, for 𝐎{𝐎B,𝐎D}\mathbf{O}\in\{\mathbf{O}_{B},\mathbf{O}_{D}\}, 𝐎𝐌\mathbf{OM} is a n×NTn\times N_{T} matrix with n=648n=648 and NT=101N_{T}=101. Hence, 𝐎𝐌\mathbf{OM} is not invertible, but it admits a unique Moore–Penrose pseudoinverse (𝐎𝐌)+(\mathbf{OM})^{+}. Multiplying both sides of (4.8b) by (𝐎𝐌)+(𝐎𝐌)(\mathbf{OM})^{+}(\mathbf{OM}), using the symmetry of 𝚺θ\mathbf{\Sigma}_{\theta}, and adding zero, we obtain

(𝐎𝐌)+(𝐎𝐌)𝐦θy,\displaystyle(\mathbf{OM})^{+}(\mathbf{OM})\mathbf{m}^{y,\dagger}_{\theta}
=\displaystyle= (𝐎𝐌)+(𝐎𝐌)(𝐦θ+(𝐎𝐌𝚺θ)((𝐎𝐌)𝚺θ(𝐎𝐌)+Σε)1(y𝐎𝐌𝐦θ𝐎δ~))\displaystyle(\mathbf{OM})^{+}(\mathbf{OM})\left(\mathbf{m}_{\theta}+(\mathbf{OM}\mathbf{\Sigma}_{\theta})^{\ast}((\mathbf{OM})\mathbf{\Sigma}_{\theta}(\mathbf{OM})^{\ast}+\Sigma_{\varepsilon})^{-1}(y-\mathbf{OMm}_{\theta}-\mathbf{O}\widetilde{\delta}^{\dagger})\right)
=\displaystyle= (𝐎𝐌)+(𝐎𝐌)𝐦θ\displaystyle(\mathbf{OM})^{+}(\mathbf{OM})\mathbf{m}_{\theta}
+(𝐎𝐌)+(𝐎𝐌)𝚺θ(𝐎𝐌)((𝐎𝐌)𝚺θ(𝐎𝐌)+Σε)1(y𝐎𝐌𝐦θ𝐎δ~)\displaystyle+(\mathbf{OM})^{+}(\mathbf{OM})\mathbf{\Sigma}_{\theta}(\mathbf{OM})^{\ast}((\mathbf{OM})\mathbf{\Sigma}_{\theta}(\mathbf{OM})^{\ast}+\Sigma_{\varepsilon})^{-1}(y-\mathbf{OMm}_{\theta}-\mathbf{O}\widetilde{\delta}^{\dagger})
=\displaystyle= (𝐎𝐌)+(𝐎𝐌)𝐦θ\displaystyle(\mathbf{OM})^{+}(\mathbf{OM})\mathbf{m}_{\theta}
+(𝐎𝐌)+(Σε+(𝐎𝐌)𝚺θ(𝐎𝐌))((𝐎𝐌)𝚺θ(𝐎𝐌)+Σε)1(y𝐎𝐌𝐦θ𝐎δ~)\displaystyle+(\mathbf{OM})^{+}\left(\Sigma_{\varepsilon}+(\mathbf{OM})\mathbf{\Sigma}_{\theta}(\mathbf{OM})^{\ast}\right)\left((\mathbf{OM})\mathbf{\Sigma}_{\theta}(\mathbf{OM})^{\ast}+\Sigma_{\varepsilon}\right)^{-1}(y-\mathbf{OMm}_{\theta}-\mathbf{O}\widetilde{\delta}^{\dagger})
(𝐎𝐌)+Σε((𝐎𝐌)𝚺θ(𝐎𝐌)+Σε)1(y𝐎𝐌𝐦θ𝐎δ~)\displaystyle-(\mathbf{OM})^{+}\Sigma_{\varepsilon}\left((\mathbf{OM})\mathbf{\Sigma}_{\theta}(\mathbf{OM})^{\ast}+\Sigma_{\varepsilon}\right)^{-1}(y-\mathbf{OMm}_{\theta}-\mathbf{O}\widetilde{\delta}^{\dagger})
=\displaystyle= (𝐎𝐌)+(𝐎𝐌)𝐦θ+(𝐎𝐌)+(y𝐎𝐌𝐦θ𝐎δ~)\displaystyle(\mathbf{OM})^{+}(\mathbf{OM})\mathbf{m}_{\theta}+(\mathbf{OM})^{+}(y-\mathbf{OMm}_{\theta}-\mathbf{O}\widetilde{\delta}^{\dagger})
(𝐎𝐌)+Σε((𝐎𝐌)𝚺θ(𝐎𝐌)+Σε)1(y𝐎𝐌𝐦θ𝐎δ~).\displaystyle-(\mathbf{OM})^{+}\Sigma_{\varepsilon}\left((\mathbf{OM})\mathbf{\Sigma}_{\theta}(\mathbf{OM})^{\ast}+\Sigma_{\varepsilon}\right)^{-1}(y-\mathbf{OMm}_{\theta}-\mathbf{O}\widetilde{\delta}^{\dagger}).

By the definition (4.9) of YY, the definition 𝐮𝐌θ~+δ~\mathbf{u}^{\dagger}\coloneqq\mathbf{M}\widetilde{\theta}^{\dagger}+\widetilde{\delta}^{\dagger}, and the linearity of 𝐎\mathbf{O},

Y𝐎𝐌𝐦θ𝐎δ~=(𝐎𝐌θ~+𝐎δ~+ε)𝐎𝐌𝐦θ𝐎δ~=𝐎𝐌(θ~𝐦θ)+ε,Y-\mathbf{O}\mathbf{M}\mathbf{m}_{\theta}-\mathbf{O}\widetilde{\delta}^{\dagger}=(\mathbf{O}\mathbf{M}\widetilde{\theta}^{\dagger}+\mathbf{O}\widetilde{\delta}^{\dagger}+\varepsilon)-\mathbf{O}\mathbf{M}\mathbf{m}_{\theta}-\mathbf{O}\widetilde{\delta}^{\dagger}=\mathbf{O}\mathbf{M}(\widetilde{\theta}^{\dagger}-\mathbf{m}_{\theta})+\varepsilon,

so (y𝐎𝐌𝐦θ𝐎δ~)=𝐎𝐌(θ~𝐦θ)+ε~(y-\mathbf{OMm}_{\theta}-\mathbf{O}\widetilde{\delta}^{\dagger})=\mathbf{OM}(\widetilde{\theta}^{\dagger}-\mathbf{m}_{\theta})+\widetilde{\varepsilon}, where ε~\widetilde{\varepsilon} denotes the specific realisation of ε\varepsilon that yields yy. If R(𝐎𝐌)+Σε((𝐎𝐌)𝚺θ(𝐎𝐌)+Σε)1(y𝐎𝐌𝐦θ𝐎δ~)R\coloneqq(\mathbf{OM})^{+}\Sigma_{\varepsilon}\left((\mathbf{OM})\mathbf{\Sigma}_{\theta}(\mathbf{OM})^{\ast}+\Sigma_{\varepsilon}\right)^{-1}(y-\mathbf{OMm}_{\theta}-\mathbf{O}\widetilde{\delta}^{\dagger}) is negligible, then the preceding observations imply that

(𝐎𝐌)+(𝐎𝐌)𝐦θy,\displaystyle(\mathbf{OM})^{+}(\mathbf{OM})\mathbf{m}^{y,\dagger}_{\theta}\approx (𝐎𝐌)+(𝐎𝐌)𝐦θ+(𝐎𝐌)+(𝐎𝐌(θ~𝐦θ)+ε~)\displaystyle(\mathbf{OM})^{+}(\mathbf{OM})\mathbf{m}_{\theta}+(\mathbf{OM})^{+}(\mathbf{OM}(\widetilde{\theta}^{\dagger}-\mathbf{m}_{\theta})+\widetilde{\varepsilon})
=\displaystyle= (𝐎𝐌)+(𝐎𝐌)θ~+(𝐎𝐌)+ε~,\displaystyle(\mathbf{OM})^{+}(\mathbf{OM})\widetilde{\theta}^{\dagger}+(\mathbf{OM})^{+}\widetilde{\varepsilon},

where ε~\widetilde{\varepsilon} will be negligible with increasing probability as Σεop0\left\|\Sigma_{\varepsilon}\right\|_{\textup{op}}\to 0.

By the properties of the pseudoinverse, (𝐎𝐌)+(𝐎𝐌)(\mathbf{OM})^{+}(\mathbf{OM}) is the projection to the support of 𝐎𝐌\mathbf{OM}, i.e. to the orthogonal complement in NT\mathbb{R}^{N_{T}} of ker(𝐎𝐌)\textup{ker}(\mathbf{OM}). We discuss an important property of this projection for the case 𝐎𝐎D\mathbf{O}\leftarrow\mathbf{O}_{D}, since we will use this property in Section 4.4 below. Let 𝐯NT\mathbf{v}\in\mathbb{R}^{N_{T}} be the vector of coefficients of an arbitrary vv in the span of the temporal basis functions (ψj)j[NT](\psi_{j})_{j\in[N_{T}]}; see Section 4.2. Recall that 𝐌𝐯\mathbf{Mv} is an approximation of the weak solution of (3.2) with θv\theta^{\prime}\leftarrow v. As per the discussion at the end of Section 3, it follows from the definition of 𝒪D\mathcal{O}_{D} in (4.3) that 𝒪Dv=(s(x(i))v(ti))i[n]\mathcal{O}_{D}\mathcal{M}v=(s(x^{(i)})v(t_{i}))_{i\in[n]}, where (x(i),ti)i[n](x^{(i)},t_{i})_{i\in[n]} are given in (4.4). By the interpolation property of the temporal basis (ψj)j[NT](\psi_{j})_{j\in[N_{T}]}, it follows that (𝐎D𝐌𝐯)i=s(x(i))v(ti)=s(x(i))𝐯i(\mathbf{O}_{D}\mathbf{Mv})_{i}=s(x^{(i)})v(t_{i})=s(x^{(i)})\mathbf{v}_{i} for every i[n]i\in[n].

We now use the preceding argument to identify the orthogonal complement of ker(𝐎D𝐌)\textup{ker}(\mathbf{O}_{D}\mathbf{M}). Denote the support of a vector 𝐯NT\mathbf{v}\in\mathbb{R}^{N_{T}} by supp(𝐯){j[NT]:𝐯j0}\textup{supp}(\mathbf{v})\coloneqq\{j\in[N_{T}]\ :\ \mathbf{v}_{j}\neq 0\}. Recall from Section 4.2 that Δt=0.01\Delta t=0.01 and denote the index set of observation times given in (4.4) by I{i[NT]:(i1)Δt{0.1+0.01:[80]0}}I\coloneqq\{i\in[N_{T}]\ :\ (i-1)\Delta t\in\{0.1+0.01\ell\ :\ \ell\in[80]_{0}\}\}. Since (x1,x2)=(0.4,0.4)(x_{1},x_{2})=(0.4,0.4) is one of the spatial locations x(i)x^{(i)} in (4.4) at which measurements are taken, and since (0.4,0.4)(0.4,0.4) lies in the support of the function s()s(\cdot) defined in (4.2a), the preceding paragraph implies that s(0.4,0.4)s(0.4,0.4) is strictly positive, and that 𝐎D𝐌𝐯=0\mathbf{O}_{D}\mathbf{Mv}=0 if and only if 𝐯i=0\mathbf{v}_{i}=0 for every iI[NT]i\in I\subset[N_{T}]. Thus,

ker(𝐎D𝐌)=span{𝐯NT:supp(𝐯)I}.\textup{ker}(\mathbf{O}_{D}\mathbf{M})^{\perp}=\textup{span}\left\{\mathbf{v}\in\mathbb{R}^{N_{T}}\ :\ \textup{supp}(\mathbf{v})\subseteq I\right\}.

Since (𝐎D𝐌)+(𝐎D𝐌)(\mathbf{O}_{D}\mathbf{M})^{+}(\mathbf{O}_{D}\mathbf{M}) is the projection to the orthogonal complement in NT\mathbb{R}^{N_{T}} of ker(𝐎D𝐌)\textup{ker}(\mathbf{O}_{D}\mathbf{M}), it follows that for every 𝐯NT\mathbf{v}\in\mathbb{R}^{N_{T}}, ((𝐎D𝐌)+(𝐎D𝐌)𝐯)j=𝐯j\left((\mathbf{O}_{D}\mathbf{M})^{+}(\mathbf{O}_{D}\mathbf{M})\mathbf{v}\right)_{j}=\mathbf{v}_{j} if jIj\in I and is zero otherwise. In particular, we expect that as Σεop\left\|\Sigma_{\varepsilon}\right\|_{\textup{op}} decreases to zero, the best posterior mean 𝐦θy,\mathbf{m}^{y,\dagger}_{\theta} for 𝐎D\mathbf{O}_{D} will more closely match the discretised true parameter θ~\widetilde{\theta}^{\dagger} on the index set II of observation times. An important corollary is that if 𝐎=𝐎D\mathbf{O}=\mathbf{O}_{D} and in particular if 𝐎Dδ~\mathbf{O}_{D}\widetilde{\delta}^{\dagger} is close to 0, then by comparing (4.8a) and (4.8b), the approximate posterior mean 𝐦θy,A\mathbf{m}^{y,\textup{A}}_{\theta} will be close to the discretised true parameter θ~\widetilde{\theta}^{\dagger}.

We close this section by observing that unlike 𝐎D\mathbf{O}_{D}, (𝐎B𝐌𝐯)i(\mathbf{O}_{B}\mathbf{Mv})_{i} may be nonzero even if 𝐯i=0\mathbf{v}_{i}=0. This is because (𝐎B𝐌𝐯)i(\mathbf{O}_{B}\mathbf{Mv})_{i} approximates the value of the solution of (3.2) at some spatiotemporal point (x(i),ti)(x^{(i)},t_{i}). By the effects of diffusion and advection, (𝐎B𝐌𝐯)i(\mathbf{O}_{B}\mathbf{Mv})_{i} may be nonzero even if s(x(i))v(ti)=0s(x^{(i)})v(t_{i})=0, e.g. if at spatiotemporal points near (x(i),ti)(x^{(i)},t_{i}) the source term of the PDE is sufficiently large. Hence, we do not expect the best posterior mean for 𝐎B\mathbf{O}_{B} to agree with θ~\widetilde{\theta}^{\dagger} to the same extent that the best posterior mean for 𝐎D\mathbf{O}_{D} agrees with θ~\widetilde{\theta}^{\dagger} on the index set II of observation times.

4.4 Results

Refer to caption
(a) 𝐦θy,A\mathbf{m}^{y,\textup{A}}_{\theta} and 𝐦θy,\mathbf{m}^{y,\dagger}_{\theta} for 𝐎B\mathbf{O}_{B}, lower SNR
Refer to caption
(b) 𝐦θy,A\mathbf{m}^{y,\textup{A}}_{\theta} and 𝐦θy,\mathbf{m}^{y,\dagger}_{\theta} for 𝐎B\mathbf{O}_{B}, higher SNR
Figure 3: Comparison of basic posterior mean 𝐦θy,A\mathbf{m}^{y,\textup{A}}_{\theta} (red solid line with red \blacktriangle markers) and 𝐦θy,\mathbf{m}^{y,\dagger}_{\theta} (blue dash-dotted line) from (4.8) for basic observation operator 𝐎B\mathbf{O}_{B}, against prior mean 𝐦θ\mathbf{m}_{\theta} (black dashed line) and truth θ~\widetilde{\theta}^{\dagger} (black solid line).

Figure 3(a) and Figure 3(b) show the approximate posterior mean 𝐦θy,A\mathbf{m}^{y,\textup{A}}_{\theta} and the best posterior mean 𝐦θy,\mathbf{m}^{y,\dagger}_{\theta} for the basic observation operator 𝐎B\mathbf{O}_{B}, for the lower SNR and higher SNR respectively. By (4.8a) and (4.8b),

𝐦θy,A𝐦θy,=(𝐎B𝐌𝚺θ)((𝐎B𝐌)𝚺θ(𝐎B𝐌)+Σε)1(𝐎Bδ~)\mathbf{m}^{y,\textup{A}}_{\theta}-\mathbf{m}^{y,\dagger}_{\theta}=(\mathbf{O}_{B}\mathbf{M\Sigma}_{\theta})^{\ast}((\mathbf{O}_{B}\mathbf{M})\mathbf{\Sigma}_{\theta}(\mathbf{O}_{B}\mathbf{M})^{\ast}+\Sigma_{\varepsilon})^{-1}(\mathbf{O}_{B}\widetilde{\delta}^{\dagger})

is captured by the difference between the red and blue curves in Figure 3(a) and Figure 3(b). In both figures, the difference eventually decreases with time. This is because δ~\widetilde{\delta}^{\dagger} is the numerical solution of the PDE-IBVP (3.3), which has zero forcing and allows for flux across the boundary, and thus the values of δ~\widetilde{\delta}^{\dagger} decay with time. Nevertheless, the difference is large relative to the truth θ~\widetilde{\theta}^{\dagger} over the time interval 𝒯\mathcal{T}. These observations show the importance of the model error for this specific inverse problem.

By comparing Figure 3(a) and Figure 3(b), we observe that the difference 𝐦θy,A𝐦θy,\mathbf{m}^{y,\textup{A}}_{\theta}-\mathbf{m}^{y,\dagger}_{\theta} increases considerably after increasing the SNR. This occurs partly because the lower SNR corresponds to a larger σε\sigma_{\varepsilon} in (4.9), which in turn leads to larger values of ε\varepsilon and thus to entries of 𝐎δ~+ε\mathbf{O}\widetilde{\delta}^{\dagger}+\varepsilon that are more likely to be closer to zero. This behaviour is consistent with what we expect, namely that ignoring model error can lead to larger errors in the posterior mean when the data are informative and the model error is large.

Refer to caption
(a) 𝐦θy,\mathbf{m}^{y,\dagger}_{\theta} for 𝐎B\mathbf{O}_{B} and 𝐦θy,\mathbf{m}^{y,\dagger}_{\theta} for 𝐎D\mathbf{O}_{D}, lower SNR
Refer to caption
(b) 𝐦θy,\mathbf{m}^{y,\dagger}_{\theta} for 𝐎B\mathbf{O}_{B} and 𝐦θy,\mathbf{m}^{y,\dagger}_{\theta} for 𝐎D\mathbf{O}_{D}, higher SNR
Figure 4: Comparison of best posterior means 𝐦θy,\mathbf{m}^{y,\dagger}_{\theta} from (4.8b) for basic observation operator 𝐎B\mathbf{O}_{B} (blue dash-dotted line) and PDE observation operator 𝐎D\mathbf{O}_{D} (green solid line with green \blacktriangle markers), against prior mean 𝐦θ\mathbf{m}_{\theta} (black dashed line) and truth θ~\widetilde{\theta}^{\dagger} (black solid line).

In Figure 4(a) (respectively, Figure 4(b)), we show for the lower (resp. higher) SNR setting the best posterior mean 𝐦θy,\mathbf{m}^{y,\dagger}_{\theta} for 𝐎B\mathbf{O}_{B}, the best posterior mean 𝐦θy,\mathbf{m}^{y,\dagger}_{\theta} for 𝐎D\mathbf{O}_{D}, and the truth θ~\widetilde{\theta}^{\dagger}. For 𝐎B\mathbf{O}_{B}, 𝐦θy,\mathbf{m}^{y,\dagger}_{\theta} fails to capture important features of θ~\widetilde{\theta}^{\dagger} for the lower SNR; it does better for the higher SNR, but under- and over-predictions remain visible. In contrast, 𝐦θy,\mathbf{m}^{y,\dagger}_{\theta} for 𝐎D\mathbf{O}_{D} matches the truth θ~\widetilde{\theta}^{\dagger} very closely over the observation time window 0.1t0.90.1\leq t\leq 0.9 for both the lower and higher SNR. The close agreement between the best posterior mean 𝐦θy,\mathbf{m}^{y,\dagger}_{\theta} and θ~\widetilde{\theta}^{\dagger} — and the different performance of the best posterior mean depending on whether 𝐎D\mathbf{O}_{D} or 𝐎B\mathbf{O}_{B} is used — are consistent with the discussion in Section 4.3.

Refer to caption
(a) Diagonal entries of 𝚺θy,\mathbf{\Sigma}^{y,\dagger}_{\theta} for 𝐎B\mathbf{O}_{B} and 𝐎D\mathbf{O}_{D}, lower SNR
Refer to caption
(b) Diagonal entries of 𝚺θy,\mathbf{\Sigma}^{y,\dagger}_{\theta} for 𝐎B\mathbf{O}_{B} and 𝐎D\mathbf{O}_{D}, higher SNR
Figure 5: Diagonal entries of prior covariance 𝚺θ\mathbf{\Sigma}_{\theta} (black dashed line) and posterior covariance 𝚺θy,\mathbf{\Sigma}^{y,\dagger}_{\theta} from (4.8c) for basic observation operator 𝐎B\mathbf{O}_{B} (blue dash-dotted line) and PDE observation operator 𝐎D\mathbf{O}_{D} (green solid line with green \blacktriangle markers) as a function of time, for lower SNR (left) and higher SNR (right).

In Figure 5(a) and Figure 5(b), we compare the diagonal entries of the best posterior covariance matrix 𝚺θy,\mathbf{\Sigma}^{y,\dagger}_{\theta} for 𝐎B\mathbf{O}_{B} and for 𝐎D\mathbf{O}_{D}, in the lower SNR and higher SNR cases respectively. We observe that the diagonal of 𝚺θy,\mathbf{\Sigma}^{y,\dagger}_{\theta} for 𝐎D\mathbf{O}_{D} is close to zero over the time observation window, whereas the diagonal of 𝚺θy,\mathbf{\Sigma}^{y,\dagger}_{\theta} for 𝐎B\mathbf{O}_{B} is not. By (4.8c), the difference in behaviour is due to the size of the noise covariance Σε\Sigma_{\varepsilon}. For 𝐎D\mathbf{O}_{D}, the noise standard deviation satisfies σε=O(107)\sigma_{\varepsilon}=O(10^{-7}) and σε=O(108)\sigma_{\varepsilon}=O(10^{-8}) in the lower and higher SNR cases respectively. Thus Σε\Sigma_{\varepsilon} is almost zero for 𝐎D\mathbf{O}_{D}, and 𝚺θy,\mathbf{\Sigma}^{y,\dagger}_{\theta} is close to zero as well. In contrast, for 𝐎B\mathbf{O}_{B}, σε5.66\sigma_{\varepsilon}\approx 5.66 and σε1.13\sigma_{\varepsilon}\approx 1.13 in the lower and higher SNR cases respectively, so that Σε\Sigma_{\varepsilon} is not negligible. The differences in 𝚺θy,\mathbf{\Sigma}^{y,\dagger}_{\theta} for 𝐎B\mathbf{O}_{B} in the lower and higher SNR cases are also due to the size of Σε\Sigma_{\varepsilon}.

By comparing the curve for the approximate posterior mean 𝐦θy,A\mathbf{m}^{y,\textup{A}}_{\theta} for 𝐎B\mathbf{O}_{B} from Figure 3(a) with the curve for 𝚺θy,\mathbf{\Sigma}^{y,\dagger}_{\theta} for 𝐎B\mathbf{O}_{B} in Figure 5(a) and recalling that 𝚺θy,=𝚺θy,A\mathbf{\Sigma}^{y,\dagger}_{\theta}=\mathbf{\Sigma}^{y,\textup{A}}_{\theta}, we note that the error of 𝐦θy,A\mathbf{m}^{y,\textup{A}}_{\theta} with respect to θ~\widetilde{\theta}^{\dagger} is the largest over the subinterval 0.1t0.20.1\leq t\leq 0.2, but the marginal posterior variance is smallest over the same interval. In other words, the approximate posterior μθy,A\mu^{y,\textup{A}}_{\theta} is most confident over the region where the bias in the approximate posterior mean is the largest. The approximate posterior is said to be ‘overconfident’; see e.g. [1, Section 8] and the references therein for other examples of overconfidence in the presence of model error. By comparing Figure 4(b) and Figure 5(b), we see that the phenomenon of overconfidence becomes worse for higher SNRs, i.e. for more informative data.

We make some important remarks about the results in Figure 4 and Figure 5. These results show that if one uses an observation operator that satisfies 𝒪δ=0\mathcal{O}\delta^{\dagger}=0, then the best posterior μθy,\mu^{y,\dagger}_{\theta} and the approximate posterior μθy,A\mu^{y,\textup{A}}_{\theta} are concentrated around the true parameter. We emphasise that these results are for the specific inverse problem presented in Section 4.1.2. Our explanation in Section 4.3 for why the best posterior mean 𝐦θy,\mathbf{m}^{y,\dagger}_{\theta} is close to θ\theta^{\dagger}, and our explanation in the preceding paragraphs for why the diagonal of 𝚺θy,\mathbf{\Sigma}^{y,\dagger}_{\theta} is close to zero over the time observation window, both make use of the size of the noise covariance Σε\Sigma_{\varepsilon}. This suggests that whether one can obtain similar results for other inverse problems can depend on other factors, and not only on the condition 𝒪δ=0\mathcal{O}\delta^{\dagger}=0. Indeed, in the last paragraph of Section 3, we showed for a modification of our inverse problem that the condition 𝒪δ=0\mathcal{O}\delta^{\dagger}=0 can even lead to uninformative observations.

5 Conclusion

We considered Bayesian inverse problems in the presence of model error in the finite-dimensional data setting with additive, Gaussian noise, and the parameter-to-observable map is the composition of a linear observation operator with a possibly nonlinear model. Under the assumption that there exists a unique best model and best parameter, the ‘model error’ is then the difference between the ‘approximate’ model that one uses and the best model. The approximate model and best model generate the ‘approximate posterior’ and ‘best posterior’ respectively.

We described some natural approaches for accounting for model error and the associated posteriors. We used the local Lipschitz stability property of posteriors with respect to perturbations in the likelihood to bound Kullback–Leibler divergences of the posterior associated to each approach, with respect to either the best posterior or the approximate posterior. These bounds have two important properties: first, they control the Kullback–Leibler divergence in terms of quantities that depend on the observation operator and the objects used to account for model error; and second, the terms in the bounds are finite under mild hypotheses, e.g. L1L^{1}-integrability of the misfits with respect to the prior.

Our bounds yield sufficient (respectively, necessary) conditions on the observation operator and the objects used by each approach to yield a posterior that agrees with the best posterior (resp. that differs from the approximate posterior). These conditions provide guidelines for how to choose observation operators to mitigate the effect of model error on Bayesian inference, given a chosen approach. A unifying theme of these conditions is the importance of the kernel of the observation operator.

Using a linear Gaussian inverse problem involving an advection-diffusion-reaction PDE, we illustrated the sufficient condition for the approximate posterior to agree with the best posterior. We then reported numerical results from some simulations of this inverse problem under two different signal-to-noise settings, and used these results to discuss the importance of the sufficient condition itself, of observation noise, and of the information conveyed by the observation operator.

It is natural to ask whether it would be possible to combine classical optimal experimental design with the approach presented above. For example, in an ideal setting where one has access to multiple observation operators (𝒪i)iI(\mathcal{O}_{i})_{i\in I} such that δker(𝒪i)\delta^{\dagger}\in\textup{ker}(\mathcal{O}_{i}) for every iIi\in I, one could aim to optimise the information gain over the set (𝒪i)iI(\mathcal{O}_{i})_{i\in I}. Another potential avenue for future research would be to investigate the design of observation operators that satisfy the sufficient conditions; we expect this to be challenging and highly problem-specific. It would be of interest to study the setting of nonlinear observation operators and other approaches for accounting for model error.

Acknowledgements

The research of NC and HB has received funding from the ERC under the European Union’s Horizon 2020 Research and Innovation Programme — Grant Agreement ID 818473. The research of HCL has been partially funded by the Deutsche Forschungsgemeinschaft (DFG) — Project-ID 318763901 — SFB1294. The authors would like to thank Nicole Aretz for discussions on the numerics of PDEs and Giuseppe Carere for helpful comments. In addition, the authors would like to thank the three reviewers and review editor for their feedback.

Appendix A Technical lemmas

Let dd\in\mathbb{N} and Mid×dM_{i}\in\mathbb{R}^{d\times d}, i=1,2i=1,2 be symmetric and positive definite.

Recall the notation (1.1) for a matrix-weighted norm.

Lemma A.1.

Suppose that M1,M2d×dM_{1},M_{2}\in\mathbb{R}^{d\times d} are symmetric, that M1M_{1} is positive definite, and that M2M_{2} is nonnegative definite. Then M1+M2M_{1}+M_{2} is symmetric positive definite and M11(M1+M2)1M_{1}^{-1}-(M_{1}+M_{2})^{-1} is symmetric nonnegative definite. In addition,

|z|(M1+M2)1(M1+M2)1/2M11/2op|z|M11M11/2(M1+M2)1/2op|z|(M1+M2)1,zd.\frac{\left|z\right|_{(M_{1}+M_{2})^{-1}}}{\left\|(M_{1}+M_{2})^{-1/2}M_{1}^{1/2}\right\|}_{\textup{op}}\leq\left|z\right|_{M_{1}^{-1}}\leq\left\|M_{1}^{-1/2}(M_{1}+M_{2})^{1/2}\right\|_{\textup{op}}\left|z\right|_{(M_{1}+M_{2})^{-1}},\quad z\in\mathbb{R}^{d}. (A.1)
Proof.

By the assumptions on M1M_{1} and M2M_{2}, M1+M2M_{1}+M_{2} is positive definite. As the difference of two symmetric matrices, M11(M1+M2)1M_{1}^{-1}-(M_{1}+M_{2})^{-1} is symmetric. To show that it is nonnegative, we use the following rearrangement of the Woodbury formula, which we take from [8, Equation (3)]:

(M1+M2)1=\displaystyle(M_{1}+M_{2})^{-1}= M11M11M2(I+M11M2)1M11\displaystyle M_{1}^{-1}-M_{1}^{-1}M_{2}(I+M_{1}^{-1}M_{2})^{-1}M_{1}^{-1}
M11(M1+M2)1=\displaystyle\Longleftrightarrow M_{1}^{-1}-(M_{1}+M_{2})^{-1}= M11M2(I+M11M2)1M11.\displaystyle M_{1}^{-1}M_{2}(I+M_{1}^{-1}M_{2})^{-1}M_{1}^{-1}.

Invertibility of I+M11M2=M11(M1+M2)I+M_{1}^{-1}M_{2}=M_{1}^{-1}(M_{1}+M_{2}) follows from the invertibility of M11M_{1}^{-1} and M1+M2M_{1}+M_{2}. From the second equation above, M11(M1+M2)1M_{1}^{-1}-(M_{1}+M_{2})^{-1} inherits the nonnegative definiteness of M2M_{2}.

The first inequality of (A.1) follows by

|z|(M1+M2)1\displaystyle\left|z\right|_{(M_{1}+M_{2})^{-1}} =|(M1+M2)1/2z|=|(M1+M2)1/2M11/2M11/2z|\displaystyle=\left|(M_{1}+M_{2})^{-1/2}z\right|=\left|(M_{1}+M_{2})^{-1/2}M_{1}^{1/2}M_{1}^{-1/2}z\right|
(M1+M2)1/2M11/2op|M11/2z|\displaystyle\leq\left\|(M_{1}+M_{2})^{-1/2}M_{1}^{1/2}\right\|_{\textup{op}}\left|M_{1}^{-1/2}z\right|
=(M1+M2)1/2M11/2op|z|M11.\displaystyle=\left\|(M_{1}+M_{2})^{-1/2}M_{1}^{1/2}\right\|_{\textup{op}}\left|z\right|_{M_{1}^{-1}}.

The second inequality of (A.1) follows by

|z|M11\displaystyle\left|z\right|_{M_{1}^{-1}} =|M11/2z|=|M11/2(M1+M2)1/2(M1+M2)1/2z|\displaystyle=\left|M_{1}^{-1/2}z\right|=\left|M_{1}^{-1/2}(M_{1}+M_{2})^{1/2}(M_{1}+M_{2})^{-1/2}z\right|
M11/2(M1+M2)1/2op|(M1+M2)1/2z|\displaystyle\leq\left\|M_{1}^{-1/2}(M_{1}+M_{2})^{1/2}\right\|_{\textup{op}}\left|(M_{1}+M_{2})^{-1/2}z\right|
=M11/2(M1+M2)1/2op|z|(M1+M2)1.\displaystyle=\left\|M_{1}^{-1/2}(M_{1}+M_{2})^{1/2}\right\|_{\textup{op}}\left|z\right|_{(M_{1}+M_{2})^{-1}}.

This completes the proof of Lemma A.1. ∎

We will use the following bound in the proofs below.

Lemma A.2.

Let (E,dE)(E,d_{E}) be a metric space, μ𝒫(E)\mu\in\mathcal{P}(E), and ff and gg be d\mathbb{R}^{d}-valued measurable functions on EE. Let M1d×dM_{1}\in\mathbb{R}^{d\times d} be symmetric and positive definite and M2d×dM_{2}\in\mathbb{R}^{d\times d} be symmetric nonnegative definite. Then

|f|M112|f+g|M112Lμ1|g|M112Lμ11/2(|f+g|M112Lμ11/2+|f|M112Lμ11/2).\displaystyle\left\|\left|f\right|^{2}_{M_{1}^{-1}}-\left|f+g\right|^{2}_{M_{1}^{-1}}\right\|_{L^{1}_{\mu}}\leq\left\|\left|g\right|^{2}_{M_{1}^{-1}}\right\|_{L^{1}_{\mu}}^{1/2}\left(\left\|\left|f+g\right|^{2}_{M_{1}^{-1}}\right\|_{L^{1}_{\mu}}^{1/2}+\left\|\left|f\right|^{2}_{M_{1}^{-1}}\right\|_{L^{1}_{\mu}}^{1/2}\right). (A.2)

More generally,

|f|M112|f+g|(M1+M2)12Lμ1\displaystyle\left\|\left|f\right|^{2}_{M_{1}^{-1}}-\left|f+g\right|^{2}_{(M_{1}+M_{2})^{-1}}\right\|_{L^{1}_{\mu}}
\displaystyle\leq |g|M112Lμ11/2(M11/2(M1+M2)1/2op|f+g|(M1+M2)12Lμ11/2+|f|M112Lμ11/2)\displaystyle\left\|\left|g\right|^{2}_{M_{1}^{-1}}\right\|_{L^{1}_{\mu}}^{1/2}\left(\left\|M_{1}^{-1/2}(M_{1}+M_{2})^{1/2}\right\|_{\textup{op}}\cdot\left\|\left|f+g\right|^{2}_{(M_{1}+M_{2})^{-1}}\right\|_{L^{1}_{\mu}}^{1/2}+\left\|\left|f\right|^{2}_{M_{1}^{-1}}\right\|_{L^{1}_{\mu}}^{1/2}\right) (A.3)
+|f+g|M11(M1+M2)12Lμ1,\displaystyle+\left\|\left|f+g\right|_{M_{1}^{-1}-(M_{1}+M_{2})^{-1}}^{2}\right\|_{L^{1}_{\mu}},

where |f+g|M11(M1+M2)12|f+g|M112|f+g|(M1+M2)12\left|f+g\right|_{M_{1}^{-1}-(M_{1}+M_{2})^{-1}}^{2}\coloneqq\left|f+g\right|_{M_{1}^{-1}}^{2}-\left|f+g\right|_{(M_{1}+M_{2})^{-1}}^{2}. In addition,

|g|M112Lμ11/2|f|M112Lμ11/2+M11/2(M1+M2)1/2op|f+g|(M1+M2)12Lμ11/2\displaystyle\left\|\left|g\right|^{2}_{M_{1}^{-1}}\right\|_{L^{1}_{\mu}}^{1/2}\leq\left\|\left|f\right|_{M_{1}^{-1}}^{2}\right\|_{L^{1}_{\mu}}^{1/2}+\left\|M_{1}^{-1/2}(M_{1}+M_{2})^{1/2}\right\|_{\textup{op}}\left\|\left|f+g\right|_{(M_{1}+M_{2})^{-1}}^{2}\right\|_{L^{1}_{\mu}}^{1/2} (A.4)
|f+g|M11(M1+M2)12Lμ1(1+M11/2(M1+M2)1/2op)|f+g|(M1+M2)12Lμ1.\displaystyle\left\|\left|f+g\right|_{M_{1}^{-1}-(M_{1}+M_{2})^{-1}}^{2}\right\|_{L^{1}_{\mu}}\leq\left(1+\left\|M_{1}^{-1/2}(M_{1}+M_{2})^{1/2}\right\|_{\textup{op}}\right)\left\|\left|f+g\right|_{(M_{1}+M_{2})^{-1}}^{2}\right\|_{L^{1}_{\mu}}. (A.5)
Proof.

We claim that for arbitrary a,bda,b\in\mathbb{R}^{d}, symmetric positive definite M1d×dM_{1}\in\mathbb{R}^{d\times d}, and symmetric nonnegative definite M2d×dM_{2}\in\mathbb{R}^{d\times d},

|a|M112|a+b|(M1+M2)12=b,2a+bM11+|a+b|M11(M1+M2)12.\left|a\right|^{2}_{M_{1}^{-1}}-\left|a+b\right|^{2}_{(M_{1}+M_{2})^{-1}}=-\langle b,2a+b\rangle_{M_{1}^{-1}}+\left|a+b\right|^{2}_{M_{1}^{-1}-(M_{1}+M_{2})^{-1}}. (A.6)

Recall that (M1+M2)1(M_{1}+M_{2})^{-1} exists and is positive definite, by Lemma A.1.

Using the matrix-weighted inner product and norm notation from (1.1),

|a|M112|a+b|M112\displaystyle\left|a\right|^{2}_{M_{1}^{-1}}-\left|a+b\right|^{2}_{M_{1}^{-1}}
=\displaystyle= aM11a(a+b)M11(a+b)\displaystyle a^{\top}M_{1}^{-1}a-(a+b)^{\top}M_{1}^{-1}(a+b)
=\displaystyle= aM11a(aM11a+2aM11b+bM11b)\displaystyle a^{\top}M_{1}^{-1}a-(a^{\top}M_{1}^{-1}a+2a^{\top}M_{1}^{-1}b+b^{\top}M_{1}^{-1}b)
=\displaystyle= bM11(2a+b)\displaystyle-b^{\top}M_{1}^{-1}(2a+b)
=\displaystyle= b,2a+bM11.\displaystyle-\langle b,2a+b\rangle_{M_{1}^{-1}}.

This implies (A.6), since

|a|M112|a+b|M112+|a+b|M112|a+b|(M1+M2)12\displaystyle\left|a\right|_{M_{1}^{-1}}^{2}-\left|a+b\right|^{2}_{M_{1}^{-1}}+\left|a+b\right|^{2}_{M_{1}^{-1}}-\left|a+b\right|^{2}_{(M_{1}+M_{2})^{-1}}
=\displaystyle= b,2a+bM11+|a+b|M112|a+b|(M1+M2)12.\displaystyle-\langle b,2a+b\rangle_{M_{1}^{-1}}+\left|a+b\right|^{2}_{M_{1}^{-1}}-\left|a+b\right|^{2}_{(M_{1}+M_{2})^{-1}}.

Now let ff, gg, and μ\mu be as in the statement of the lemma. Then by (A.6) and the triangle inequality,

|f|M112|f+g|(M1+M2)12Lμ1=\displaystyle\left\|\left|f\right|_{M_{1}^{-1}}^{2}-\left|f+g\right|_{(M_{1}+M_{2})^{-1}}^{2}\right\|_{L^{1}_{\mu}}= g,2f+gM11+|f+g|M11(M1+M2)12Lμ1\displaystyle\left\|-\langle g,2f+g\rangle_{M_{1}^{-1}}+\left|f+g\right|_{M_{1}^{-1}-(M_{1}+M_{2})^{-1}}^{2}\right\|_{L^{1}_{\mu}}
\displaystyle\leq g,2f+gM11Lμ1+|f+g|M11(M1+M2)12Lμ1.\displaystyle\left\|\langle g,2f+g\rangle_{M_{1}^{-1}}\right\|_{L^{1}_{\mu}}+\left\|\left|f+g\right|_{M_{1}^{-1}-(M_{1}+M_{2})^{-1}}^{2}\right\|_{L^{1}_{\mu}}. (A.7)

Next,

g,2f+gM11Lμ1\displaystyle\left\|\langle g,2f+g\rangle_{M_{1}^{-1}}\right\|_{L^{1}_{\mu}}\leq |g|M11|2f+g|M11Lμ1\displaystyle\left\|\left|g\right|_{M_{1}^{-1}}\left|2f+g\right|_{M_{1}^{-1}}\right\|_{L^{1}_{\mu}}
\displaystyle\leq |g|M11Lμ2|2f+g|M11Lμ2\displaystyle\left\|\left|g\right|_{M_{1}^{-1}}\right\|_{L^{2}_{\mu}}\left\|\left|2f+g\right|_{M_{1}^{-1}}\right\|_{L^{2}_{\mu}}
\displaystyle\leq |g|M11Lμ2(|f+g|M11Lμ2+|f|M11Lμ2)\displaystyle\left\|\left|g\right|_{M_{1}^{-1}}\right\|_{L^{2}_{\mu}}\left(\left\|\left|f+g\right|_{M_{1}^{-1}}\right\|_{L^{2}_{\mu}}+\left\|\left|f\right|_{M_{1}^{-1}}\right\|_{L^{2}_{\mu}}\right)
=\displaystyle= |g|M112Lμ11/2(|f+g|M112Lμ11/2+|f|M112Lμ11/2).\displaystyle\left\|\left|g\right|_{M_{1}^{-1}}^{2}\right\|_{L^{1}_{\mu}}^{1/2}\left(\left\|\left|f+g\right|_{M_{1}^{-1}}^{2}\right\|_{L^{1}_{\mu}}^{1/2}+\left\|\left|f\right|_{M_{1}^{-1}}^{2}\right\|_{L^{1}_{\mu}}^{1/2}\right). (A.8)

The first and second inequalities follow by applying the Cauchy–Schwarz inequality with respect to ,M11\langle\cdot,\cdot\rangle_{M_{1}^{-1}} and Lμ1\left\|\cdot\right\|_{L^{1}_{\mu}} respectively. The third inequality and the equation follow from the Lμ2\left\|\cdot\right\|_{L^{2}_{\mu}}-triangle inequality and the definition of the LμpL^{p}_{\mu} norm for p=1,2p=1,2. By (A.8), we bound the first term on the right-hand side of (A.7). By using M20M_{2}\leftarrow 0, the second term on the right-hand side of (A.7) vanishes. Thus (A.2) follows from (A.7).

Next, we bound the first term inside the parentheses on the right-hand side of (A.8). By (A.1),

|f+g|M112Lμ11/2M11/2(M1+M2)1/2op|f+g|(M1+M2)12Lμ11/2.\left\|\left|f+g\right|_{M_{1}^{-1}}^{2}\right\|_{L^{1}_{\mu}}^{1/2}\leq\left\|M_{1}^{-1/2}(M_{1}+M_{2})^{1/2}\right\|_{\textup{op}}\left\|\left|f+g\right|_{(M_{1}+M_{2})^{-1}}^{2}\right\|_{L^{1}_{\mu}}^{1/2}.

Using the above bound yields (A.3).

To prove (A.4),

|g|M112Lμ11/2=\displaystyle\left\|\left|g\right|^{2}_{M_{1}^{-1}}\right\|_{L^{1}_{\mu}}^{1/2}= |f+(f+g)|M11Lμ2\displaystyle\left\|\left|-f+(f+g)\right|_{M_{1}^{-1}}\right\|_{L^{2}_{\mu}}
\displaystyle\leq |f|M11Lμ2+|f+g|M11Lμ2\displaystyle\left\|\left|f\right|_{M_{1}^{-1}}\right\|_{L^{2}_{\mu}}+\left\|\left|f+g\right|_{M_{1}^{-1}}\right\|_{L^{2}_{\mu}}
\displaystyle\leq |f|M112Lμ11/2+M11/2(M1+M2)1/2op|f+g|(M1+M2)12Lμ11/2\displaystyle\left\|\left|f\right|_{M_{1}^{-1}}^{2}\right\|_{L^{1}_{\mu}}^{1/2}+\left\|M_{1}^{-1/2}(M_{1}+M_{2})^{1/2}\right\|_{\textup{op}}\left\|\left|f+g\right|_{(M_{1}+M_{2})^{-1}}^{2}\right\|_{L^{1}_{\mu}}^{1/2}

where the last inequality uses (A.1). To prove (A.5), observe that

|f+g|M11(M1+M2)12Lμ1=\displaystyle\left\|\left|f+g\right|_{M_{1}^{-1}-(M_{1}+M_{2})^{-1}}^{2}\right\|_{L^{1}_{\mu}}= |f+g|M112|f+g|(M1+M2)12Lμ1\displaystyle\left\|\left|f+g\right|_{M_{1}^{-1}}^{2}-\left|f+g\right|_{(M_{1}+M_{2})^{-1}}^{2}\right\|_{L^{1}_{\mu}}
\displaystyle\leq |f+g|M112Lμ1+|f+g|(M1+M2)12Lμ1\displaystyle\left\|\left|f+g\right|_{M_{1}^{-1}}^{2}\right\|_{L^{1}_{\mu}}+\left\|\left|f+g\right|_{(M_{1}+M_{2})^{-1}}^{2}\right\|_{L^{1}_{\mu}}
\displaystyle\leq (M11/2(M1+M2)1/2op+1)|f+g|(M1+M2)12Lμ1\displaystyle\left(\left\|M_{1}^{-1/2}(M_{1}+M_{2})^{1/2}\right\|_{\textup{op}}+1\right)\left\|\left|f+g\right|_{(M_{1}+M_{2})^{-1}}^{2}\right\|_{L^{1}_{\mu}}

where the first and second inequality follow from the triangle inequality and (A.1). This completes the proof of Lemma A.2. ∎

Appendix B Deferred proofs

B.1 Proof for approximate posterior

Lemma 2.2 bounds Φy,Φy,ALμθq\left\|\Phi^{y,\dagger}-\Phi^{y,\textup{A}}\right\|_{L^{q}_{\mu_{\theta}}} in terms of the observed model error 𝒪δ\mathcal{O}\delta^{\dagger}, under the hypothesis that Φy,Lμθ1\Phi^{y,\dagger}\in L^{1}_{\mu_{\theta}} and Φy,ALμθ1\Phi^{y,\textup{A}}\in L^{1}_{\mu_{\theta}}. The bound is given in (2.5):

Φy,Φy,ALμθ121/2|𝒪δ|Σε1Lμθ2(Φy,Lμθ11/2+Φy,ALμθ11/2).\left\|\Phi^{y,\dagger}-\Phi^{y,\textup{A}}\right\|_{L^{1}_{\mu_{\theta}}}\leq 2^{-1/2}\left\|\left|\mathcal{O}\delta^{\dagger}\right|_{\Sigma_{\varepsilon}^{-1}}\right\|_{L^{2}_{\mu_{\theta}}}\left(\left\|\Phi^{y,\dagger}\right\|^{1/2}_{L^{1}_{\mu_{\theta}}}+\left\|\Phi^{y,\textup{A}}\right\|^{1/2}_{L^{1}_{\mu_{\theta}}}\right).
Proof of Lemma 2.2.

Recall from (2.2) and (2.4) that Φy,(θ)=12|y𝒪(θ)|Σε12\Phi^{y,\dagger}(\theta^{\prime})=\tfrac{1}{2}\left|y-\mathcal{O}\mathcal{M}^{\dagger}(\theta^{\prime})\right|_{\Sigma_{\varepsilon}^{-1}}^{2} and Φy,A(θ)=12|y𝒪(θ)|Σε12\Phi^{y,\textup{A}}(\theta^{\prime})=\tfrac{1}{2}\left|y-\mathcal{O}\mathcal{M}(\theta^{\prime})\right|_{\Sigma_{\varepsilon}^{-1}}^{2} respectively. By these definitions,

212|y𝒪|Σε12Lμθ11/2=2Φy,Lμθ11/2=2Φy,Lμθ11/2\left\|2\cdot\tfrac{1}{2}\left|y-\mathcal{O}\mathcal{M}^{\dagger}\right|_{\Sigma_{\varepsilon}^{-1}}^{2}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2}=\left\|2\Phi^{y,\dagger}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2}=\sqrt{2}\left\|\Phi^{y,\dagger}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2} (B.1)

and similarly

|y𝒪|Σε12Lμθ11/2=2Φy,ALμθ11/2.\left\|\left|y-\mathcal{O}\mathcal{M}\right|^{2}_{\Sigma_{\varepsilon}^{-1}}\right\|^{1/2}_{L^{1}_{\mu_{\theta}}}=\sqrt{2}\left\|\Phi^{y,\textup{A}}\right\|^{1/2}_{L^{1}_{\mu_{\theta}}}. (B.2)

Now set fy𝒪f\leftarrow y-\mathcal{O}\mathcal{M}^{\dagger}, g𝒪δg\leftarrow\mathcal{O}\delta^{\dagger}, μμθ\mu\leftarrow\mu_{\theta}, M1ΣεM_{1}\leftarrow\Sigma_{\varepsilon}, and M20M_{2}\leftarrow 0. By (2.3), we have f+g=y𝒪(δ)=y𝒪f+g=y-\mathcal{O}(\mathcal{M}^{\dagger}-\delta^{\dagger})=y-\mathcal{O}\mathcal{M}. Hence |f|M112=2Φy,\left|f\right|_{M_{1}^{-1}}^{2}=2\Phi^{y,\dagger} and |f+g|M112=2Φy,A\left|f+g\right|_{M_{1}^{-1}}^{2}=2\Phi^{y,\textup{A}}. Applying (A.2) from Lemma A.2 with these choices, and then (B.1) and (B.2), yields

2Φy,Φy,ALμθ1\displaystyle 2\left\|\Phi^{y,\dagger}-\Phi^{y,\textup{A}}\right\|_{L^{1}_{\mu_{\theta}}}\leq |𝒪δ|Σε12Lμθ11/2(|y𝒪|Σε12Lμθ11/2+|y𝒪|Σε12Lμθ11/2)\displaystyle\left\|\left|\mathcal{O}\delta^{\dagger}\right|_{\Sigma_{\varepsilon}^{-1}}^{2}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2}\left(\left\|\left|y-\mathcal{O}\mathcal{M}^{\dagger}\right|_{\Sigma_{\varepsilon}^{-1}}^{2}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2}+\left\|\left|y-\mathcal{O}\mathcal{M}\right|_{\Sigma_{\varepsilon}^{-1}}^{2}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2}\right)
=\displaystyle= |𝒪δ|Σε12Lμθ11/22(Φy,Lμθ11/2+Φy,ALμθ11/2).\displaystyle\left\|\left|\mathcal{O}\delta^{\dagger}\right|_{\Sigma_{\varepsilon}^{-1}}^{2}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2}\sqrt{2}\left(\left\|\Phi^{y,\dagger}\right\|^{1/2}_{L^{1}_{\mu_{\theta}}}+\left\|\Phi^{y,\textup{A}}\right\|^{1/2}_{L^{1}_{\mu_{\theta}}}\right).

This proves (2.5). The bound on |𝒪δ|Σε12Lμθ11/2\left\|\left|\mathcal{O}\delta^{\dagger}\right|_{\Sigma_{\varepsilon}^{-1}}^{2}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2} in the statement of Lemma 2.2 follows from (A.4) in Lemma A.2 with the choices above:

|𝒪δ|Σε12Lμθ11/22(Φy,Lμθ11/2+Φy,ALμθ11/2).\left\|\left|\mathcal{O}\delta^{\dagger}\right|_{\Sigma_{\varepsilon}^{-1}}^{2}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2}\leq\sqrt{2}\left(\left\|\Phi^{y,\dagger}\right\|^{1/2}_{L^{1}_{\mu_{\theta}}}+\left\|\Phi^{y,\textup{A}}\right\|^{1/2}_{L^{1}_{\mu_{\theta}}}\right).

This completes the proof of Lemma 2.2. ∎

B.2 Proofs for enhanced noise posterior

Lemma 2.4 bounds Φy,Φy,ELμθ1\left\|\Phi^{y,\dagger}-\Phi^{y,\textup{E}}\right\|_{L^{1}_{\mu_{\theta}}} in terms of the observed model error 𝒪δ\mathcal{O}\delta^{\dagger} and the Gaussian model 𝒩(m𝔲,Σ𝔲)\mathcal{N}(m_{\mathfrak{u}},\Sigma_{\mathfrak{u}}) of δ(θ)\delta^{\dagger}(\theta^{\dagger}). In particular, if Φy,Lμθ1\Phi^{y,\dagger}\in L^{1}_{\mu_{\theta}} and Φy,ELμθ1\Phi^{y,\textup{E}}\in L^{1}_{\mu_{\theta}}, then for CEΣε1/2(Σε+𝒪Σ𝔲𝒪)1/2C_{\textup{E}}\coloneqq\left\|\Sigma_{\varepsilon}^{-1/2}(\Sigma_{\varepsilon}+\mathcal{O}\Sigma_{\mathfrak{u}}\mathcal{O}^{\ast})^{1/2}\right\| as in (2.7), the bound (2.8)

Φy,Φy,ELμθ1\displaystyle\left\|\Phi^{y,\dagger}-\Phi^{y,\textup{E}}\right\|_{L^{1}_{\mu_{\theta}}}\leq 21/2|𝒪(δm𝔲)|Σε12Lμθ11/2(Φy,Lμθ11/2+CEΦy,ELμθ11/2)\displaystyle 2^{-1/2}\left\|\left|\mathcal{O}(\delta^{\dagger}-m_{\mathfrak{u}})\right|_{\Sigma_{\varepsilon}^{-1}}^{2}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2}\left(\left\|\Phi^{y,\dagger}\right\|^{1/2}_{L^{1}_{\mu_{\theta}}}+C_{\textup{E}}\left\|\Phi^{y,\textup{E}}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2}\right)
+21|y𝒪𝒪m𝔲|Σε1(Σε+𝒪Σ𝔲𝒪)12Lμθ1,\displaystyle+2^{-1}\left\|\left|y-\mathcal{O}\mathcal{M}-\mathcal{O}m_{\mathfrak{u}}\right|^{2}_{\Sigma_{\varepsilon}^{-1}-(\Sigma_{\varepsilon}+\mathcal{O}\Sigma_{\mathfrak{u}}\mathcal{O}^{\ast})^{-1}}\right\|_{L^{1}_{\mu_{\theta}}},

holds, and all terms on the right-hand side are finite.

Proof of Lemma 2.4.

In the same way that we proved (B.1), we can use the definition (2.6) of Φy,E\Phi^{y,\textup{E}} to prove

|y𝒪𝒪m𝔲|(Σε+𝒪Σ𝔲𝒪)12Lμθ11/2=2Φy,ELμθ11/2.\left\|\left|y-\mathcal{O}\mathcal{M}-\mathcal{O}m_{\mathfrak{u}}\right|_{(\Sigma_{\varepsilon}+\mathcal{O}\Sigma_{\mathfrak{u}}\mathcal{O}^{\ast})^{-1}}^{2}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2}=\sqrt{2}\left\|\Phi^{y,\textup{E}}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2}. (B.3)

Let fy𝒪f\leftarrow y-\mathcal{O}\mathcal{M}^{\dagger}, g𝒪(δm𝔲)g\leftarrow\mathcal{O}(\delta^{\dagger}-m_{\mathfrak{u}}), μμθ\mu\leftarrow\mu_{\theta}, M1ΣεM_{1}\leftarrow\Sigma_{\varepsilon}, and M2𝒪Σ𝔲𝒪M_{2}\leftarrow\mathcal{O}\Sigma_{\mathfrak{u}}\mathcal{O}^{\ast}. Then

f+g=y𝒪(δ)𝒪m𝔲=y𝒪𝒪m𝔲,f+g=y-\mathcal{O}(\mathcal{M}^{\dagger}-\delta^{\dagger})-\mathcal{O}m_{\mathfrak{u}}=y-\mathcal{O}\mathcal{M}-\mathcal{O}m_{\mathfrak{u}},

|f|M112=2Φy,\left|f\right|_{M_{1}^{-1}}^{2}=2\Phi^{y,\dagger}, and |f+g|(M1+M2)12=2Φy,E\left|f+g\right|_{(M_{1}+M_{2})^{-1}}^{2}=2\Phi^{y,\textup{E}}. Applying (A.3) from Lemma A.2 yields

2Φy,Φy,ELμθ1\displaystyle 2\left\|\Phi^{y,\dagger}-\Phi^{y,\textup{E}}\right\|_{L^{1}_{\mu_{\theta}}}\leq |𝒪(δm𝔲)|Σε12Lμθ11/2(CE2Φy,ELμθ11/2+2Φy,Lμθ11/2)\displaystyle\left\|\left|\mathcal{O}(\delta^{\dagger}-m_{\mathfrak{u}})\right|_{\Sigma_{\varepsilon}^{-1}}^{2}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2}\left(C_{\textup{E}}\left\|2\Phi^{y,\textup{E}}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2}+\left\|2\Phi^{y,\dagger}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2}\right) (B.4)
+|y𝒪𝒪m𝔲|Σε1(Σε+𝒪Σ𝔲𝒪)12Lμθ1.\displaystyle+\left\|\left|y-\mathcal{O}\mathcal{M}-\mathcal{O}m_{\mathfrak{u}}\right|^{2}_{\Sigma_{\varepsilon}^{-1}-(\Sigma_{\varepsilon}+\mathcal{O}\Sigma_{\mathfrak{u}}\mathcal{O}^{\ast})^{-1}}\right\|_{L^{1}_{\mu_{\theta}}}.

This proves (2.8). By (A.4) and CEC_{\textup{E}} as in (2.7),

|𝒪(δm𝔲)|Σε12Lμθ11/22Φy,Lμθ11/2+CE2Φy,ELμθ11/2.\left\|\left|\mathcal{O}(\delta^{\dagger}-m_{\mathfrak{u}})\right|_{\Sigma_{\varepsilon}^{-1}}^{2}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2}\leq\left\|2\Phi^{y,\dagger}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2}+C_{\textup{E}}\left\|2\Phi^{y,\textup{E}}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2}.

By (A.5) from Lemma A.2,

|y𝒪𝒪m𝔲|Σε1(Σε+𝒪Σ𝔲𝒪)12Lμθ1(CE+1)2Φy,ELμθ1.\left\|\left|y-\mathcal{O}\mathcal{M}-\mathcal{O}m_{\mathfrak{u}}\right|^{2}_{\Sigma_{\varepsilon}^{-1}-(\Sigma_{\varepsilon}+\mathcal{O}\Sigma_{\mathfrak{u}}\mathcal{O}^{\ast})^{-1}}\right\|_{L^{1}_{\mu_{\theta}}}\leq(C_{\textup{E}}+1)\left\|2\Phi^{y,\textup{E}}\right\|_{L^{1}_{\mu_{\theta}}}.

This completes the proof of Lemma 2.4. ∎

Lemma 2.6 asserts that, under the hypotheses that Φy,ALμθ1\Phi^{y,\textup{A}}\in L^{1}_{\mu_{\theta}} and Φy,ELμθ1\Phi^{y,\textup{E}}\in L^{1}_{\mu_{\theta}}, then for CEΣε1/2(Σε+𝒪Σ𝔲𝒪)1/2opC_{\textup{E}}\coloneqq\left\|\Sigma_{\varepsilon}^{-1/2}(\Sigma_{\varepsilon}+\mathcal{O}\Sigma_{\mathfrak{u}}\mathcal{O}^{\ast})^{1/2}\right\|_{\textup{op}} as in (2.7), the bound (2.11)

Φy,AΦy,ELμθ1\displaystyle\left\|\Phi^{y,\textup{A}}-\Phi^{y,\textup{E}}\right\|_{L^{1}_{\mu_{\theta}}}\leq 21/2|𝒪m𝔲|Σε1(Φy,ALμθ11/2+CEΦy,ELμθ11/2)\displaystyle 2^{-1/2}\left|\mathcal{O}m_{\mathfrak{u}}\right|_{\Sigma_{\varepsilon}^{-1}}\left(\left\|\Phi^{y,\textup{A}}\right\|^{1/2}_{L^{1}_{\mu_{\theta}}}+C_{\textup{E}}\left\|\Phi^{y,\textup{E}}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2}\right)
+21|y𝒪𝒪m𝔲|Σε1(Σε+𝒪Σ𝔲𝒪)12Lμθ1,\displaystyle+2^{-1}\left\|\left|y-\mathcal{O}\mathcal{M}-\mathcal{O}m_{\mathfrak{u}}\right|^{2}_{\Sigma_{\varepsilon}^{-1}-(\Sigma_{\varepsilon}+\mathcal{O}\Sigma_{\mathfrak{u}}\mathcal{O}^{\ast})^{-1}}\right\|_{L^{1}_{\mu_{\theta}}},

holds, and all terms on the right-hand side are finite.

Proof of Lemma 2.6.

Let fy𝒪f\leftarrow y-\mathcal{O}\mathcal{M}, g𝒪m𝔲g\leftarrow-\mathcal{O}m_{\mathfrak{u}}, M1Σε1M_{1}\leftarrow\Sigma_{\varepsilon}^{-1}, M2𝒪Σ𝔲𝒪M_{2}\leftarrow\mathcal{O}\Sigma_{\mathfrak{u}}\mathcal{O}^{\ast}, and μμθ\mu\leftarrow\mu_{\theta}. Then |f|M112=2Φy,A\left|f\right|_{M_{1}^{-1}}^{2}=2\Phi^{y,\textup{A}} and |f+g|(M1+M2)12=2Φy,E\left|f+g\right|_{(M_{1}+M_{2})^{-1}}^{2}=2\Phi^{y,\textup{E}}. Applying (A.3) from Lemma A.2 yields

2Φy,AΦy,ELμθ1\displaystyle 2\left\|\Phi^{y,\textup{A}}-\Phi^{y,\textup{E}}\right\|_{L^{1}_{\mu_{\theta}}}
\displaystyle\leq |𝒪m𝔲|Σε12(Φy,ALμθ11/2+CEΦy,ELμθ11/2)+|y𝒪𝒪m𝔲|Σε1(Σε+𝒪Σ𝔲𝒪)12Lμθ1,\displaystyle\left|\mathcal{O}m_{\mathfrak{u}}\right|_{\Sigma_{\varepsilon}^{-1}}\sqrt{2}\left(\left\|\Phi^{y,\textup{A}}\right\|^{1/2}_{L^{1}_{\mu_{\theta}}}+C_{\textup{E}}\left\|\Phi^{y,\textup{E}}\right\|^{1/2}_{L^{1}_{\mu_{\theta}}}\right)+\left\|\left|y-\mathcal{O}\mathcal{M}-\mathcal{O}m_{\mathfrak{u}}\right|^{2}_{\Sigma_{\varepsilon}^{-1}-(\Sigma_{\varepsilon}+\mathcal{O}\Sigma_{\mathfrak{u}}\mathcal{O}^{\ast})^{-1}}\right\|_{L^{1}_{\mu_{\theta}}},

for CEΣε1/2(Σε+𝒪Σ𝔲𝒪)1/2opC_{\textup{E}}\coloneqq\left\|\Sigma_{\varepsilon}^{-1/2}(\Sigma_{\varepsilon}+\mathcal{O}\Sigma_{\mathfrak{u}}\mathcal{O}^{\ast})^{1/2}\right\|_{\textup{op}} as in (2.7). This proves (2.11). Next, (A.5) yields

|y𝒪𝒪m𝔲|Σε1(Σε+𝒪Σ𝔲𝒪)12Lμθ12(CE+1)Φy,ELμθ1.\left\|\left|y-\mathcal{O}\mathcal{M}-\mathcal{O}m_{\mathfrak{u}}\right|^{2}_{\Sigma_{\varepsilon}^{-1}-(\Sigma_{\varepsilon}+\mathcal{O}\Sigma_{\mathfrak{u}}\mathcal{O}^{\ast})^{-1}}\right\|_{L^{1}_{\mu_{\theta}}}\leq 2\left(C_{\textup{E}}+1\right)\left\|\Phi^{y,\textup{E}}\right\|_{L^{1}_{\mu_{\theta}}}.

This completes the proof of Lemma 2.6. ∎

B.3 Proofs for joint posterior

In Lemma 2.9, one assumes that Φy,\Phi^{y,\dagger} as defined in (2.2) belongs to Lμθ1L^{1}_{\mu_{\theta}}, and also that Φy,JLμθμδ1\Phi^{y,\textup{J}}\in L^{1}_{\mu_{\theta}\otimes\mu_{\delta}}. The resulting bound (2.16) is

Φy,Φy,JLμθμδ121/2|𝒪(δδ)(θ)|Σε12L11/2(Φy,JLμθμδ11/2+Φy,Lμθ11/2).\left\|\Phi^{y,\dagger}-\Phi^{y,\textup{J}}\right\|_{L^{1}_{\mu_{\theta}\otimes\mu_{\delta}}}\leq 2^{-1/2}\left\|\left|\mathcal{O}(\delta^{\dagger}-\delta)(\theta)\right|_{\Sigma_{\varepsilon}^{-1}}^{2}\right\|_{L^{1}_{\mathbb{P}}}^{1/2}\left(\left\|\Phi^{y,\textup{J}}\right\|_{L^{1}_{\mu_{\theta}\otimes\mu_{\delta}}}^{1/2}+\left\|\Phi^{y,\dagger}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2}\right).
Proof of Lemma 2.9.

Let f(θ)y𝒪(θ)f(\theta)\leftarrow y-\mathcal{O}\mathcal{M}^{\dagger}(\theta), g(θ,δ)𝒪(δδ)(θ)g(\theta,\delta)\leftarrow\mathcal{O}(\delta^{\dagger}-\delta)(\theta), M1ΣεM_{1}\leftarrow\Sigma_{\varepsilon}, M20M_{2}\leftarrow 0, and μ\mu\leftarrow\mathbb{P}. Then |f(θ)|M112=2Φy,(θ,δ)\left|f(\theta)\right|_{M_{1}^{-1}}^{2}=2\Phi^{y,\dagger}(\theta,\delta) and |f(θ)+g(θ,δ)|M112=2Φy,J(θ,δ)\left|f(\theta)+g(\theta,\delta)\right|_{M_{1}^{-1}}^{2}=2\Phi^{y,\textup{J}}(\theta,\delta). Using the same argument that we used to prove (B.1), it follows from (2.12) that

|y𝒪(θ)𝒪δ(θ)|Σε12L11/2=2Φy,JLμθμδ11/2.\left\|\left|y-\mathcal{O}\mathcal{M}(\theta)-\mathcal{O}\delta(\theta)\right|_{\Sigma_{\varepsilon}^{-1}}^{2}\right\|_{L^{1}_{\mathbb{P}}}^{1/2}=\sqrt{2}\left\|\Phi^{y,\textup{J}}\right\|_{L^{1}_{\mu_{\theta}\otimes\mu_{\delta}}}^{1/2}. (B.5)

By (B.1) and (2.15) from Lemma 2.8,

|y𝒪(θ)|Σε12L11/2=2Φy,Lμθ11/2=2Φy,Lμθμδ11/2.\left\|\left|y-\mathcal{O}\mathcal{M}^{\dagger}(\theta)\right|_{\Sigma_{\varepsilon}^{-1}}^{2}\right\|_{L^{1}_{\mathbb{P}}}^{1/2}=\sqrt{2}\left\|\Phi^{y,\dagger}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2}=\sqrt{2}\left\|\Phi^{y,\dagger}\right\|_{L^{1}_{\mu_{\theta}\otimes\mu_{\delta}}}^{1/2}. (B.6)

Applying (A.2) Lemma A.2 with these choices yields (2.16):

2Φy,AΦy,JLμθμδ1\displaystyle 2\left\|\Phi^{y,\textup{A}}-\Phi^{y,\textup{J}}\right\|_{L^{1}_{\mu_{\theta}\otimes\mu_{\delta}}}\leq |𝒪(δδ)(θ)|Σε12L11/22(Φy,JLμθμδ11/2+Φy,Lμθ11/2).\displaystyle\left\|\left|\mathcal{O}(\delta^{\dagger}-\delta)(\theta)\right|_{\Sigma_{\varepsilon}^{-1}}^{2}\right\|_{L^{1}_{\mathbb{P}}}^{1/2}\sqrt{2}\left(\left\|\Phi^{y,\textup{J}}\right\|_{L^{1}_{\mu_{\theta}\otimes\mu_{\delta}}}^{1/2}+\left\|\Phi^{y,\dagger}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2}\right).

Next,

|𝒪(δδ)(θ)|Σε12L11/22(Φy,JLμθμδ11/2+Φy,Lμθ11/2),\left\|\left|\mathcal{O}(\delta^{\dagger}-\delta)(\theta)\right|_{\Sigma_{\varepsilon}^{-1}}^{2}\right\|_{L^{1}_{\mathbb{P}}}^{1/2}\leq\sqrt{2}\left(\left\|\Phi^{y,\textup{J}}\right\|_{L^{1}_{\mu_{\theta}\otimes\mu_{\delta}}}^{1/2}+\left\|\Phi^{y,\dagger}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2}\right),

follows from (B.5), (B.6), and (A.4) of Lemma A.2 with the choices stated above. ∎

In Lemma 2.11, one assumes that Φy,A\Phi^{y,\textup{A}} as defined in (2.4) belongs to Lμθ1L^{1}_{\mu_{\theta}}, and also that Φy,JLμθμδ1\Phi^{y,\textup{J}}\in L^{1}_{\mu_{\theta}\otimes\mu_{\delta}}. The resulting bound (2.17) is

Φy,AΦy,JLμθμδ121/2|𝒪δ(θ)|Σε12L11/2(Φy,JLμθμδ11/2+Φy,ALμθ11/2).\left\|\Phi^{y,\textup{A}}-\Phi^{y,\textup{J}}\right\|_{L^{1}_{\mu_{\theta}\otimes\mu_{\delta}}}\leq 2^{-1/2}\left\|\left|\mathcal{O}\delta(\theta)\right|_{\Sigma_{\varepsilon}^{-1}}^{2}\right\|_{L^{1}_{\mathbb{P}}}^{1/2}\left(\left\|\Phi^{y,\textup{J}}\right\|_{L^{1}_{\mu_{\theta}\otimes\mu_{\delta}}}^{1/2}+\left\|\Phi^{y,\textup{A}}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2}\right).

The proof of Lemma 2.11 is very similar to the proof of Lemma 2.9 above.

Proof of Lemma 2.11.

Let f(θ)y𝒪(θ)f(\theta)\leftarrow y-\mathcal{O}\mathcal{M}(\theta), g(θ,δ)𝒪(δ)(θ)g(\theta,\delta)\leftarrow\mathcal{O}(-\delta)(\theta), M1ΣεM_{1}\leftarrow\Sigma_{\varepsilon}, M20M_{2}\leftarrow 0, and μ\mu\leftarrow\mathbb{P}. Then |f(θ)+g(θ,δ)|M112=2Φy,J(θ,δ)\left|f(\theta)+g(\theta,\delta)\right|_{M_{1}^{-1}}^{2}=2\Phi^{y,\textup{J}}(\theta,\delta) and |f(θ)|M112=2Φy,A(θ,δ)\left|f(\theta)\right|_{M_{1}^{-1}}^{2}=2\Phi^{y,\textup{A}}(\theta,\delta). Analogously to (B.6), we have by (2.4) and (2.15) of Lemma 2.8 that

|y𝒪(θ)|Σε12L11/2=2Φy,ALμθ11/2=2Φy,ALμθμδ11/2.\left\|\left|y-\mathcal{O}\mathcal{M}(\theta)\right|_{\Sigma_{\varepsilon}^{-1}}^{2}\right\|_{L^{1}_{\mathbb{P}}}^{1/2}=\sqrt{2}\left\|\Phi^{y,\textup{A}}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2}=\sqrt{2}\left\|\Phi^{y,\textup{A}}\right\|_{L^{1}_{\mu_{\theta}\otimes\mu_{\delta}}}^{1/2}. (B.7)

Applying (A.2) of Lemma A.2 with the choices above yields (2.17):

2Φy,AΦy,JLμθμδ1\displaystyle 2\left\|\Phi^{y,\textup{A}}-\Phi^{y,\textup{J}}\right\|_{L^{1}_{\mu_{\theta}\otimes\mu_{\delta}}}\leq |𝒪(δ)(θ)|Σε12L11/22(Φy,JLμθμδ11/2+Φy,ALμθ11/2).\displaystyle\left\|\left|\mathcal{O}(-\delta)(\theta)\right|_{\Sigma_{\varepsilon}^{-1}}^{2}\right\|_{L^{1}_{\mathbb{P}}}^{1/2}\sqrt{2}\left(\left\|\Phi^{y,\textup{J}}\right\|_{L^{1}_{\mu_{\theta}\otimes\mu_{\delta}}}^{1/2}+\left\|\Phi^{y,\textup{A}}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2}\right).

Next,

|𝒪(δ)(θ)|Σε12L11/22(Φy,ALμθ11/2+Φy,JLμθμδ11/2)\left\|\left|\mathcal{O}(-\delta)(\theta)\right|_{\Sigma_{\varepsilon}^{-1}}^{2}\right\|_{L^{1}_{\mathbb{P}}}^{1/2}\leq\sqrt{2}\left(\left\|\Phi^{y,\textup{A}}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2}+\left\|\Phi^{y,\textup{J}}\right\|_{L^{1}_{\mu_{\theta}\otimes\mu_{\delta}}}^{1/2}\right)

follows from (B.5), (B.7), and (A.4) of Lemma A.2. ∎

Lemma B.1.

Let Φy,E\Phi^{y,\textup{E}} be defined as in (2.13) with =E\bullet=\textup{E}. If Φy,ELμθ1\Phi^{y,\textup{E}}\in L^{1}_{\mu_{\theta}} and Φy,JLμθμδ1\Phi^{y,\textup{J}}\in L^{1}_{\mu_{\theta}\otimes\mu_{\delta}}, then

Φy,EΦy,JLμθμδ1\displaystyle\left\|\Phi^{y,\textup{E}}-\Phi^{y,\textup{J}}\right\|_{L^{1}_{\mu_{\theta}\otimes\mu_{\delta}}}\leq 21/2|𝒪(δ(θ)m𝔲)|Σε12L11/2(CEΦy,ELμθ11/2+Φy,JLμθμδ11/2)\displaystyle 2^{-1/2}\left\|\left|\mathcal{O}(\delta(\theta)-m_{\mathfrak{u}})\right|_{\Sigma_{\varepsilon}^{-1}}^{2}\right\|_{L^{1}_{\mathbb{P}}}^{1/2}\left(C_{\textup{E}}\left\|\Phi^{y,\textup{E}}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2}+\left\|\Phi^{y,\textup{J}}\right\|_{L^{1}_{\mu_{\theta}\otimes\mu_{\delta}}}^{1/2}\right) (B.8)
+21|y𝒪(θ)𝒪m𝔲|Σε1(Σε+𝒪Σ𝔲𝒪)12L1.\displaystyle+2^{-1}\left\|\left|y-\mathcal{O}\mathcal{M}(\theta)-\mathcal{O}m_{\mathfrak{u}}\right|_{\Sigma_{\varepsilon}^{-1}-(\Sigma_{\varepsilon}+\mathcal{O}\Sigma_{\mathfrak{u}}\mathcal{O}^{\ast})^{-1}}^{2}\right\|_{L^{1}_{\mathbb{P}}}.

Furthermore,

|𝒪(δ(θ)m𝔲)|Σε12L11/22(CEΦy,ELμθ11/2+Φy,JLμθμδ11/2)\displaystyle\left\|\left|\mathcal{O}(\delta(\theta)-m_{\mathfrak{u}})\right|_{\Sigma_{\varepsilon}^{-1}}^{2}\right\|_{L^{1}_{\mathbb{P}}}^{1/2}\leq\sqrt{2}\left(C_{\textup{E}}\left\|\Phi^{y,\textup{E}}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2}+\left\|\Phi^{y,\textup{J}}\right\|_{L^{1}_{\mu_{\theta}\otimes\mu_{\delta}}}^{1/2}\right)
|y𝒪(θ)𝒪m𝔲|Σε1(Σε+𝒪Σ𝔲𝒪)12L12(CE+1)Φy,ELμθ1.\displaystyle\left\|\left|y-\mathcal{O}\mathcal{M}(\theta)-\mathcal{O}m_{\mathfrak{u}}\right|_{\Sigma_{\varepsilon}^{-1}-(\Sigma_{\varepsilon}+\mathcal{O}\Sigma_{\mathfrak{u}}\mathcal{O}^{\ast})^{-1}}^{2}\right\|_{L^{1}_{\mathbb{P}}}\leq 2(C_{\textup{E}}+1)\left\|\Phi^{y,\textup{E}}\right\|_{L^{1}_{\mu_{\theta}}}.
Proof of Lemma B.1.

Let f(θ,δ)y𝒪((θ)+δ(θ))f(\theta,\delta)\leftarrow y-\mathcal{O}(\mathcal{M}(\theta)+\delta(\theta)), g(θ,δ)𝒪(δ(θ)m𝔲)g(\theta,\delta)\leftarrow\mathcal{O}(\delta(\theta)-m_{\mathfrak{u}}), M1ΣεM_{1}\leftarrow\Sigma_{\varepsilon}, M2𝒪Σ𝔲𝒪M_{2}\leftarrow\mathcal{O}\Sigma_{\mathfrak{u}}\mathcal{O}^{\ast}, and μ\mu\leftarrow\mathbb{P}. Then |f(θ,δ)|M112=2Φy,J(θ,δ)\left|f(\theta,\delta)\right|_{M_{1}^{-1}}^{2}=2\Phi^{y,\textup{J}}(\theta,\delta) and |(f+g)(θ,δ)|(M1+M2)12=2Φy,E(θ,δ)\left|(f+g)(\theta,\delta)\right|_{(M_{1}+M_{2})^{-1}}^{2}=2\Phi^{y,\textup{E}}(\theta,\delta). Analogously to (B.6), we have by (2.6) and (2.15) of Lemma 2.8 that

|y𝒪(θ)𝒪m𝔲|(Σε+𝒪Σ𝔲𝒪)12L11/2=2Φy,ELμθ11/2=2Φy,ELμθμδ11/2.\left\|\left|y-\mathcal{O}\mathcal{M}(\theta)-\mathcal{O}m_{\mathfrak{u}}\right|_{(\Sigma_{\varepsilon}+\mathcal{O}\Sigma_{\mathfrak{u}}\mathcal{O}^{\ast})^{-1}}^{2}\right\|_{L^{1}_{\mathbb{P}}}^{1/2}=\sqrt{2}\left\|\Phi^{y,\textup{E}}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2}=\sqrt{2}\left\|\Phi^{y,\textup{E}}\right\|_{L^{1}_{\mu_{\theta}\otimes\mu_{\delta}}}^{1/2}. (B.9)

Applying (A.3) of Lemma A.2 with the choices above yields (B.8):

2Φy,EΦy,JLμθμδ1\displaystyle 2\left\|\Phi^{y,\textup{E}}-\Phi^{y,\textup{J}}\right\|_{L^{1}_{\mu_{\theta}\otimes\mu_{\delta}}}\leq |𝒪(δ(θ)m𝔲)|Σε12L11/2(CE2Φy,ELμθ11/2+2Φy,JLμθμδ11/2)\displaystyle\left\|\left|\mathcal{O}(\delta(\theta)-m_{\mathfrak{u}})\right|_{\Sigma_{\varepsilon}^{-1}}^{2}\right\|_{L^{1}_{\mathbb{P}}}^{1/2}\left(C_{\textup{E}}\left\|2\Phi^{y,\textup{E}}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2}+\left\|2\Phi^{y,\textup{J}}\right\|_{L^{1}_{\mu_{\theta}\otimes\mu_{\delta}}}^{1/2}\right)
+|y𝒪(θ)𝒪m𝔲|Σε1(Σε+𝒪Σ𝔲𝒪)12L1.\displaystyle+\left\|\left|y-\mathcal{O}\mathcal{M}(\theta)-\mathcal{O}m_{\mathfrak{u}}\right|_{\Sigma_{\varepsilon}^{-1}-(\Sigma_{\varepsilon}+\mathcal{O}\Sigma_{\mathfrak{u}}\mathcal{O}^{\ast})^{-1}}^{2}\right\|_{L^{1}_{\mathbb{P}}}.

Next, we apply (A.4) and (A.5) from Lemma A.2, and use (B.9):

|𝒪(δ(θ)m𝔲)|Σε12L11/22(CEΦy,ELμθ11/2+Φy,JLμθμδ11/2)\displaystyle\left\|\left|\mathcal{O}(\delta(\theta)-m_{\mathfrak{u}})\right|_{\Sigma_{\varepsilon}^{-1}}^{2}\right\|_{L^{1}_{\mathbb{P}}}^{1/2}\leq\sqrt{2}\left(C_{\textup{E}}\left\|\Phi^{y,\textup{E}}\right\|_{L^{1}_{\mu_{\theta}}}^{1/2}+\left\|\Phi^{y,\textup{J}}\right\|_{L^{1}_{\mu_{\theta}\otimes\mu_{\delta}}}^{1/2}\right)
|y𝒪(θ)𝒪m𝔲|Σε1(Σε+𝒪Σ𝔲𝒪)12L12(CE+1)Φy,ELμθ1.\displaystyle\left\|\left|y-\mathcal{O}\mathcal{M}(\theta)-\mathcal{O}m_{\mathfrak{u}}\right|_{\Sigma_{\varepsilon}^{-1}-(\Sigma_{\varepsilon}+\mathcal{O}\Sigma_{\mathfrak{u}}\mathcal{O}^{\ast})^{-1}}^{2}\right\|_{L^{1}_{\mathbb{P}}}\leq 2(C_{\textup{E}}+1)\left\|\Phi^{y,\textup{E}}\right\|_{L^{1}_{\mu_{\theta}}}.

This completes the proof of Lemma B.1. ∎

Proposition B.2.

Let Φy,E\Phi^{y,\textup{E}} and Φy,J\Phi^{y,\textup{J}} be as in Lemma B.1, and let μθ,δy,E\mu^{y,\textup{E}}_{\theta,\delta} be as in (2.13) with =E\bullet=\textup{E}. Then

max{dKL(μθ,δy,Eμθ,δy,J),dKL(μθ,δy,Jμθ,δy,E)}C|𝒪(m𝔲δ)(θ)|Σε12L11/2,\max\{d_{\textup{KL}}(\mu^{y,\textup{E}}_{\theta,\delta}\|\mu^{y,\textup{J}}_{\theta,\delta}),d_{\textup{KL}}(\mu^{y,\textup{J}}_{\theta,\delta}\|\mu^{y,\textup{E}}_{\theta,\delta})\}\leq C\left\|\left|\mathcal{O}(m_{\mathfrak{u}}-\delta)(\theta)\right|_{\Sigma_{\varepsilon}^{-1}}^{2}\right\|_{L^{1}_{\mathbb{P}}}^{1/2},

where

C=exp(2Φy,JLμθμδ1+2Φy,ELμθ1)max{1,21/2(CEΦy,ELμθμδ11/2+Φy,JLμθμδ11/2)}.C=\exp\left(2\left\|\Phi^{y,\textup{J}}\right\|_{L^{1}_{\mu_{\theta}\otimes\mu_{\delta}}}+2\left\|\Phi^{y,\textup{E}}\right\|_{L^{1}_{\mu_{\theta}}}\right)\max\left\{1,2^{1/2}\left(C_{\textup{E}}\left\|\Phi^{y,\textup{E}}\right\|_{L^{1}_{\mu_{\theta}\otimes\mu_{\delta}}}^{1/2}+\left\|\Phi^{y,\textup{J}}\right\|_{L^{1}_{\mu_{\theta}\otimes\mu_{\delta}}}^{1/2}\right)\right\}.
Proof of Proposition B.2.

Applying Theorem 2.1 and Lemma B.1 yields

max{dKL(μθ,δy,Eμθ,δy,J),dKL(μθ,δy,Jμθ,δy,E)}\displaystyle\max\{d_{\textup{KL}}(\mu^{y,\textup{E}}_{\theta,\delta}\|\mu^{y,\textup{J}}_{\theta,\delta}),d_{\textup{KL}}(\mu^{y,\textup{J}}_{\theta,\delta}\|\mu^{y,\textup{E}}_{\theta,\delta})\}
\displaystyle\leq 2exp(2Φy,JLμθμδ1+2Φy,ELμθμδ1)Φy,EΦy,JLμθμδ1\displaystyle 2\exp\left(2\left\|\Phi^{y,\textup{J}}\right\|_{L^{1}_{\mu_{\theta}\otimes\mu_{\delta}}}+2\left\|\Phi^{y,\textup{E}}\right\|_{L^{1}_{\mu_{\theta}\otimes\mu_{\delta}}}\right)\left\|\Phi^{y,\textup{E}}-\Phi^{y,\textup{J}}\right\|_{L^{1}_{\mu_{\theta}\otimes\mu_{\delta}}}
\displaystyle\leq exp(2Φy,JLμθμδ1+2Φy,ELμθμδ1)max{21/2(CEΦy,ELμθμδ11/2+Φy,JLμθμδ11/2),1}\displaystyle\exp\left(2\left\|\Phi^{y,\textup{J}}\right\|_{L^{1}_{\mu_{\theta}\otimes\mu_{\delta}}}+2\left\|\Phi^{y,\textup{E}}\right\|_{L^{1}_{\mu_{\theta}\otimes\mu_{\delta}}}\right)\max\{2^{1/2}\left(C_{\textup{E}}\left\|\Phi^{y,\textup{E}}\right\|_{L^{1}_{\mu_{\theta}\otimes\mu_{\delta}}}^{1/2}+\left\|\Phi^{y,\textup{J}}\right\|_{L^{1}_{\mu_{\theta}\otimes\mu_{\delta}}}^{1/2}\right),1\}
×(|𝒪(m𝔲δ)(θ)|Σε12L1+|y𝒪(θ)𝒪m𝔲|Σε1(Σε+𝒪Σ𝔲𝒪)12L1).\displaystyle\times\left(\left\|\left|\mathcal{O}(m_{\mathfrak{u}}-\delta)(\theta)\right|_{\Sigma_{\varepsilon}^{-1}}^{2}\right\|_{L^{1}_{\mathbb{P}}}+\left\|\left|y-\mathcal{O}\mathcal{M}(\theta)-\mathcal{O}m_{\mathfrak{u}}\right|_{\Sigma_{\varepsilon}^{-1}-(\Sigma_{\varepsilon}+\mathcal{O}\Sigma_{\mathfrak{u}}\mathcal{O}^{\ast})^{-1}}^{2}\right\|_{L^{1}_{\mathbb{P}}}\right).

Using the definition of CC given in the statement of Proposition B.2 and (2.15) from Lemma 2.8 completes the proof. ∎

References

  • [1] Assyr Abdulle and Giacomo Garegnani, Random time step probabilistic methods for uncertainty quantification in chaotic and geometric numerical integration, Stat. Comput. 30 (2020), no. 4, 907–932.
  • [2] Alen Alexanderian, Ruanui Nicholson, and Noémi Petra, Optimal design of large-scale nonlinear Bayesian inverse problems under model uncertainty, 2022, arXiv:2211.03952.
  • [3] Alen Alexanderian, Noemi Petra, Georg Stadler, and Omar Ghattas, A-optimal design of experiments for infinite-dimensional Bayesian linear inverse problems with regularized 0\ell_{0}-sparsification, SIAM J. Sci. Comput. 36 (2014), no. 5, A2122–A2148.
  • [4] Alen Alexanderian, Noémi Petra, Georg Stadler, and Isaac Sunseri, Optimal design of large-scale Bayesian linear inverse problems under reducible model uncertainty: Good to know what you don’t know, SIAM/ASA J. Uncertainty Quantif. 9 (2021), no. 1, 163–184.
  • [5] Jenný Brynjarsdóttir and Anthony O’Hagan, Learning about physical parameters: the importance of model discrepancy, Inverse Probl. 30 (2014), no. 11, 114007.
  • [6] Daniela Calvetti, Matthew Dunlop, Erkki Somersalo, and Andrew Stuart, Iterative updating of model error for Bayesian inversion, Inverse Probl. 34 (2018), no. 2, 025008.
  • [7] Lianghao Cao, Thomas O’Leary-Roseberry, Prashant K. Jha, J. Tinsley Oden, and Omar Ghattas, Residual-based error correction for neural operator accelerated infinite-dimensional Bayesian inverse problems, J. Comput. Phys. 486 (2023), 112104.
  • [8] Feng Cheng Chang, Inversion of a perturbed matrix, Appl. Math. Lett. 19 (2006), no. 2, 169–173.
  • [9] Duc-Lam Duong, Tapio Helin, and Jose Rodrigo Rojo-Garcia, Stability estimates for the expected utility in Bayesian optimal experimental design, Inverse Probl. 39 (2023), no. 12, 125008.
  • [10] Howard C. Elman, David J. Silvester, and Andrew J. Wathen, Finite elements and fast iterative solvers. With applications in incompressible fluid dynamics, 2nd ed. ed., Numer. Math. Sci. Comput., Oxford: Oxford University Press, 2014.
  • [11] Avner Friedman, Partial differential equations, Dover Publications, Mineola, N.Y., 2008, Unabridged republication work originally published by Holt, Rinehart & Winston, New York, in 1969.
  • [12] Subhashis Ghosal and Aad van der Vaart, Fundamentals of Nonparametric Bayesian Inference, vol. 44, Cambridge: Cambridge University Press, 2017.
  • [13] Konstantinos Gourgoulias, Markos A. Katsoulakis, Luc Rey-Bellet, and Jie Wang, How biased is your model? Concentration inequalities, information and model bias, IEEE Trans. Inform. Theory 66 (2020), no. 5, 3079–3097.
  • [14] Eric Joseph Hall and Markos A. Katsoulakis, Robust information divergences for model-form uncertainty arising from sparse data in random PDE, SIAM/ASA J. Uncertain. Quantif. 6 (2018), no. 4, 1364–1394.
  • [15] Willem Hundsdorfer and Jan Verwer, Numerical Solution of Time-Dependent Advection-Diffusion-Reaction Equations, Springer-Verlag Berlin, 2003.
  • [16] Jari Kaipio and Erkki Somersalo, Statistical and computational inverse problems, vol. 160, Springer Science & Business Media, 2005.
  • [17]  , Statistical inverse problems: discretization, model reduction and inverse crimes, J. Comput. Appl. Math. 198 (2007), no. 2, 493–504.
  • [18] Marc C. Kennedy and Anthony O’Hagan, Bayesian calibration of computer models, J. R. Stat. Soc., Ser. B, Stat. Methodol. 63 (2001), no. 3, 425–464.
  • [19] Ville Kolehmainen, Tanja Tarvainen, Simon R. Arridge, and Jari P. Kaipio, Marginalization of uninteresting distributed parameters in inverse problems – application to diffuse optical tomography, Int. J. Uncertain. Quantif. 1 (2011), no. 1, 1–17.
  • [20] Karina Koval, Alen Alexanderian, and Georg Stadler, Optimal experimental design under irreducible uncertainty for linear inverse problems governed by PDEs, Inverse Probl. 36 (2020), no. 7, 075007.
  • [21] Anders Logg, Garth N. Wells, and Johan Hake, DOLFIN: a C++/Python finite element library, Automated Solution of Differential Equations by the Finite Element Method: The FEniCS Book (Anders Logg, Kent-Andre Mardal, and Garth Wells, eds.), Springer, Berlin, Heidelberg, 2012, pp. 173–225.
  • [22] Yvon Maday and Tommaso Taddei, Adaptive PBDW approach to state estimation: noisy observations; user-defined update spaces, SIAM J. Sci. Comput. 41 (2019), no. 4, b669–b693.
  • [23] Ruanui Nicholson, Noémi Petra, and Jari P Kaipio, Estimation of the Robin coefficient field in a Poisson problem with uncertain conductivity field, Inverse Probl. 34 (2018), no. 11, 115005.
  • [24] A. Nissinen, L. M. Heikkinen, V. Kolehmainen, and J. P. Kaipio, Compensation of errors due to discretization, domain truncation and unknown contact impedances in electrical impedance tomography, Meas. Sci. Technol. 20 (2009), no. 10, 105504.
  • [25] Alfio Quarteroni, Numerical models for differential problems, third ed., Texts in Applied Mathematics, Springer Cham, Berlin, Germany, 2017.
  • [26] Khachik Sargsyan, Xun Huan, and Habib N. Najm, Embedded model error representation for Bayesian model calibration, Int. J. Uncertain. Quantif. 9 (2019), no. 4, 365–394.
  • [27] Khachik Sargsyan, Habib N. Najm, and Roger Ghanem, On the statistical calibration of physical models, Int. J. Chem. Kinet. 47 (2015), no. 4, 246–276.
  • [28] Andrea Scarinci, Michael Fehler, and Youssef Marzouk, Bayesian inference under model misspecification using transport-Lagrangian distances: an application to seismic inversion, 2021, arXiv:2105.07027.
  • [29] Björn Sprungk, On the local Lipschitz stability of Bayesian inverse problems, Inverse Probl. 36 (2020), 055015.
  • [30] Andrew M. Stuart, Inverse problems: A Bayesian perspective, Acta Numerica 19 (2010), 451–559.
  • [31] Umberto Villa, Noemi Petra, and Omar Ghattas, HIPPYlib: An extensible software framework for large-scale inverse problems governed by PDEs: Part I: Deterministic inversion and linearized Bayesian inference, ACM Trans. Math. Softw. 47 (2021), no. 2, 1–34.
  • [32] Martin J. Wainwright, High-dimensional statistics, Cambridge Series in Statistical and Probabilistic Mathematics, vol. 48, Cambridge University Press, Cambridge, 2019.