This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.


A Data-Adaptive Prior for Bayesian Learning
of Kernels in Operators

Neil K. Chada Department of Actuarial Mathematics and Statistics, Heriot Watt University, Edinburgh, EH14 4AS, UK ([email protected])    Quanjun Lang Department of Mathematics, Johns Hopkins University, Baltimore, MD 21218, USA ([email protected]), ([email protected]), ([email protected])    Fei Lu22footnotemark: 2    Xiong Wang22footnotemark: 2
Abstract

Kernels effectively represent nonlocal dependencies and are extensively employed in formulating operators between function spaces. Thus, learning kernels in operators from data is an inverse problem of general interest. Due to the nonlocal dependence, the inverse problem is often severely ill-posed with a data-dependent normal operator. Traditional Bayesian methods address the ill-posedness by a non-degenerate prior, which may result in an unstable posterior mean in the small noise regime, especially when data induces a perturbation in the null space of the normal operator. We propose a new data-adaptive Reproducing Kernel Hilbert Space (RKHS) prior, which ensures the stability of the posterior mean in the small noise regime. We analyze this adaptive prior and showcase its efficacy through applications on Toeplitz matrices and integral operators. Numerical experiments reveal that fixed non-degenerate priors can produce divergent posterior means under errors from discretization, model inaccuracies, partial observations, or erroneous noise assumptions. In contrast, our data-adaptive RKHS prior consistently yields convergent posterior means.

Keywords: Data-adaptive prior, kernels in operators, linear Bayesian inverse problem,
RKHS, Tikhonov regularization
 2020 Mathematics Subject Classification: 62F15, 47A52, 47B32

1 Introduction

Kernels are efficient in representing nonlocal or long-range dependence and interaction between high- or infinite-dimensional variables. Thus, they are widely used to design operators between function spaces, with numerous applications in machine learning such as kernel methods (e.g., [5, 9, 13, 26, 52, 47]) and operator learning (e.g., [29, 43]), in partial differential equations (PDEs) and stochastic processes such as nonlocal and fractional diffusions (e.g., [6, 17, 19, 56, 57]), and in multi-agent systems (e.g., [7, 39, 41, 46]).

Learning kernels in operators from data is an integral part of these applications. We consider the case when the operator depends on the kernel linearly, and the learning is a linear inverse problem. However, the inverse problem is often severely ill-posed, due to the nonlocal dependence and the presence of various perturbations resulting from noise in data, numerical error, or model error. To address the ill-posedness, a Bayesian approach or a variational approach with regularization is often used. In either approach, the major challenge is the selection of a prior or a regularization term since there is limited prior knowledge about the kernel.

This study examines the selection of the prior in a Bayesian approach. The common practice is to use a non-degenerate prior. However, we show that a non-degenerate prior may result in an unstable posterior mean in the small noise regime, especially when data induces a perturbation in the null space of the normal operator.

We propose a new data-adaptive Reproducing Kernel Hilbert Space (RKHS) prior, which ensures the stability of the posterior mean in the small noise regime. We analyze this adaptive prior and showcase its efficacy through applications on learning kernels in Toeplitz matrices and integral operators.

1.1 Problem setup

We aim to learn the kernel ϕ\phi in the operator Rϕ:𝕏𝕐R_{\phi}:\mathbb{X}\to\mathbb{Y} in the following model

Rϕ[u]+η+ξ=f,R_{\phi}[u]+\eta+\xi=f, (1.1)

by fitting the model to the data consisting of NN input-output pairs:

𝒟={(uk,fk)}k=1N,(uk,fk)𝕏×𝕐.\mathcal{D}=\{(u^{k},f^{k})\}_{k=1}^{N},\quad(u^{k},f^{k})\in\mathbb{X}\times\mathbb{Y}. (1.2)

Here 𝕏\mathbb{X} is a Banach space, 𝕐\mathbb{Y} is a Hilbert space, the measurement noise η\eta is a 𝕐\mathbb{Y}-valued white noise in the sense that 𝔼[η,f𝕐2]=ση2f,f𝕐\mathbb{E}[\langle{\eta,f}\rangle_{\mathbb{Y}}^{2}]=\sigma_{\eta}^{2}\langle{f,f}\rangle_{\mathbb{Y}} for any f𝕐f\in\mathbb{Y}. The term ξ\xi represents unknown model errors such as model misspecification or computational error due to incomplete data, and it may depend on the input data uu.

The operator RϕR_{\phi} depends non-locally on the kernel ϕ\phi in the form

Rϕ[u](y)=Ωϕ(yx)g[u](x,y)μ(dx),yΩ,R_{\phi}[u](y)=\int_{\Omega}\phi(y-x)g[u](x,y)\mu(dx),\quad\forall y\in\Omega, (1.3)

where (Ω,μ)(\Omega,\mu) is a measure space that can be either a domain in the Euclidean space with the Lebesgue measure or a discrete set with an atomic measure. For simplicity, we let 𝕐=L2(Ω,μ)\mathbb{Y}=L^{2}(\Omega,\mu) throughout this study. Here g[u]g[u] is a bounded (nonlinear) functional of uu and is assumed to be known. Note that the operator RϕR_{\phi} can be nonlinear in uu, but it depends linearly on ϕ\phi. Such operators are widely seen in PDEs, matrix operators, and image processing. Examples include the Toeplitz matrix, integral and nonlocal operators; see Sect.2.1. In these examples, there is often limited prior knowledge about the kernel.

The inverse problem of learning the kernel ϕ\phi is often ill-posed, due to the nonlocal dependence of the output data ff on the kernel. The ill-posedness, in terms of minimizing the negative log-likelihood of the data,

(ϕ)=1Nση21kNRϕ[uk]fk𝕐2=12ση2[G¯ϕ,ϕLρ22ϕ𝒟,ϕLρ2+CNf],\mathcal{E}(\phi)=\frac{1}{N\sigma_{\eta}^{2}}\sum_{1\leq k\leq N}\|R_{\phi}[u^{k}]-f^{k}\|_{\mathbb{Y}}^{2}=\frac{1}{2\sigma_{\eta}^{2}}\left[\langle{{\mathcal{L}_{\overline{G}}}\phi,\phi}\rangle_{L^{2}_{\rho}}-2\langle{\phi^{\scriptscriptstyle\mathcal{D}},\phi}\rangle_{L^{2}_{\rho}}+C_{N}^{f}\right], (1.4)

appears as the instability of the minimizer G¯1ϕ𝒟{\mathcal{L}_{\overline{G}}}^{-1}\phi^{\scriptscriptstyle\mathcal{D}} when it exists. Here Lρ2L^{2}_{\rho} is space of square-integrable functions with measure ρ\rho, the normal operator G¯{\mathcal{L}_{\overline{G}}} is a trace-class operator, and ϕ𝒟\phi^{\scriptscriptstyle\mathcal{D}} comes from data; see Sect.2.3 for details. Thus, the ill-posedness is rooted in the unboundedness of G¯1{\mathcal{L}_{\overline{G}}}^{-1} and the perturbation of ϕ𝒟\phi^{\scriptscriptstyle\mathcal{D}}.

There are two common strategies to overcome the ill-posedness: a prior in a Bayesian approach and regularization in a variational approach; see, e.g., [53, 23, 3] and a sample of the large literature in Sect.1.3).

This study focuses on selecting a prior for the Bayesian approach. The major challenge is the limited prior information about the kernel and the need to overcome the ill-posedness caused by a data-dependent normal operator.

1.2 Proposed: a data-adaptive RKHS prior

Due to the lack of prior information about the kernel, one may use the default prior, a non-degenerate prior, e.g., a Gaussian distribution 𝒩(0,𝒬0)\mathcal{N}(0,\mathcal{Q}_{0}) with a nondegenerate covariance operator 𝒬0\mathcal{Q}_{0}, with the belief that it is a safe choice to ensure a well-defined posterior.

However, we show that the fixed non-degenerate prior has the risk of leading to a divergent posterior mean in the small noise limit. Specifically, the posterior mean,

μ1=(G¯+ση2𝒬0)1ϕ𝒟,\mu_{1}=({\mathcal{L}_{\overline{G}}}+\sigma_{\eta}^{2}\mathcal{Q}_{0})^{-1}\phi^{\scriptscriptstyle\mathcal{D}},

obtained from the prior 𝒩(0,𝒬0)\mathcal{N}(0,\mathcal{Q}_{0}) and the likelihood that yields (1.4), blows up when, the variance of the noise, ση0\sigma_{\eta}\to 0 if ϕ𝒟\phi^{\scriptscriptstyle\mathcal{D}} contains a perturbation in the null space of the normal operator G¯{\mathcal{L}_{\overline{G}}}; see Proposition 3.2. Such perturbation can be caused by any of the four types of errors in data or computation: (a) discretization error, (b) model error, (c) partial observations, and (d) wrong noise assumption. Thus, a prior adaptive to G¯{\mathcal{L}_{\overline{G}}} and ϕ𝒟\phi^{\scriptscriptstyle\mathcal{D}} is needed to remove the risk.

We propose a data-adaptive RKHS prior that ensures a stable posterior mean in the small noise regime. It is a Gaussian distribution 𝒩(0,λ1G¯)\mathcal{N}(0,\lambda_{*}^{-1}{\mathcal{L}_{\overline{G}}}) with the parameter λ\lambda_{*} selected adaptive to ϕ𝒟\phi^{\scriptscriptstyle\mathcal{D}}. We prove in Theorem 4.2 that it leads to a stable posterior whose mean

μ1𝒟=(G¯2+ση2λINull(G¯))1G¯ϕ𝒟,\mu_{1}^{\scriptscriptstyle\mathcal{D}}=({\mathcal{L}_{\overline{G}}}^{2}+\sigma_{\eta}^{2}\lambda_{*}I_{\mathrm{Null({\mathcal{L}_{\overline{G}}})}^{\perp}})^{-1}{\mathcal{L}_{\overline{G}}}\phi^{\scriptscriptstyle\mathcal{D}},

always has a small noise limit, and the small noise limit converges to the identifiable parts of the true kernel. Furthermore, we show that our prior outperforms the non-degenerate prior in producing a more accurate posterior mean and smaller posterior uncertainty in terms of the trace of the posterior covariance; see Sect.4.2. The prior is called an RKHS prior because its Cameron-Martin space G¯1/2Lρ2{\mathcal{L}_{\overline{G}}}^{1/2}L^{2}_{\rho} is the RKHS with a reproducing kernel G¯{\overline{G}} determined by the operator RϕR_{\phi} and the data. Importantly, the closure of this RKHS is the space in which the components of the true kernel can be identified from data.

We also study the computational practice of the data-adaptive prior and demonstrate it on the Toeplitz matrices and integral operators. We select the hyper-parameter by the L-curve method in [24]. Numerical tests show that while a fixed non-degenerate prior leads to divergent posterior means, the data-adaptive prior always attains stable posterior means with small noise limits; see Sect.5.

The outline of this study is as follows. Sect.2 presents the mathematical setup of this study, and shows the ill-posedness of the inverse problem through the variational approach. We introduce in Sect.3 the Bayesian approach and show the issue of a fixed non-degenerate prior. To solve the issue, we introduce a data-adaptive prior in Sect.4, and analyze its advantages. Sect.5 discusses the computational practice and demonstrates the advantage of the data-adaptive prior in numerical tests on Toeplitz matrices and integral operators. Finally, Sect.6 concludes our findings and provides future research directions.

1.3 Related work

Prior selection for Bayesian inverse problems.

We focus on prior selection and don’t consider the sampling of the posterior, which is a main topic for nonlinear Bayesian inverse problems with a given prior, see, e.g., [14, 28, 51, 53, 10]. Prior selection is an important topic in statistical Bayesian modeling, which dates back to [27]. This study provides a new class of problems where prior selection is crucial: learning kernels in operators, which is an ill-posed linear inverse problem. Our data-adaptive RKHS prior re-discovers the well-known Zellner’s g-prior in [59, 1] when the kernel is finite-dimensional and the basis functions are orthonormal. Importantly, the stable small noise limit in this study provides a new criterion for prior selection, a useful addition to many criteria studied in [4].

Regularization in a variational approach.

The prior is closely related to Tikhonov or ridge regularization in a variation approach. The likelihood function provides a loss function, and the prior often provides a regularization term. Various regularization terms have been studied, including the widely-used Euclidean norm (see, e.g., [21, 23, 24, 54]), the RKHS norm with pre-specified reproducing kernel (see, e.g., [9, 3]), the total variation norm in [49], and the data-adaptive RKHS norm in [37, 38]. It remains open to compare these norms. Appealingly, the Bayesian approach provides probabilistic tools for analyzing regularizations. Thus, to better understand regularization, it is of interest to study the priors in a Bayesian approach.

Operator learning.

This study focuses on learning the kernels, not the operators. Thus, our focus differs from the focus of the widely-used kernel methods for operator approximation (see, e.g., [47, 13]) and the operator learning (see, e.g., [15, 16, 29, 36, 42, 43]). These methods aim to approximate the operator matching the input and output, not to identify the kernel in the operator.

Gaussian process and kernel-based regression.

Selection of the reproducing kernel is an important component in Gaussian process and kernel-based regression (see, e.g.,[5, 9, 26, 52, 58]). Previous methods often tune the hyper-parameter of a pre-selected class of kernels. Our data-adaptive RKHS prior provides an automatic reproducing kernel, adaptive to data and the model, for these methods.

Learning interacting kernels and nonlocal kernels.

The learning of kernels in operators has been studied in the context of identifying the interaction kernels in interacting particle systems (e.g., [20, 25, 18, 30, 35, 39, 41, 40, 44, 45, 55]) and the nonlocal kernels in homogenization of PDEs (e.g., [37, 56, 57]). This study is the first to analyze the selection of a prior in a Bayesian approach.

2 The learning of kernels in operators

In this section, we discuss learning kernels in operators as a variational inverse problem that maximizes the likelihood. In this process, we introduce a few key concepts for the Bayesian approach in later sections: the function space for the kernel, the normal operator, the function space of identifiability, and a data-adaptive RKHS.

2.1 Examples

We first present a few examples of learning kernels in operators. Note that in these examples, there is little prior information about the kernel.

Example 2.1 (Kernels in Toeplitz matrices).

Consider the estimation of the kernel ϕ\phi in the Toeplitz matrix Rϕn×nR_{\phi}\in\mathbb{R}^{n\times n}, i.e., Rϕ(i,j)=ϕ(ij)R_{\phi}(i,j)=\phi(i-j) for all 1i,jn1\leq i,j\leq n, from measurement data {(uk,fk)n×n}k=1N\{(u^{k},f^{k})\in\mathbb{R}^{n}\times\mathbb{R}^{n}\}_{k=1}^{N} by fitting the data to the model

Rϕu+η+ξ(u)=f,η𝒩(0,ση2In),𝕏=𝕐=n,R_{\phi}u+\eta+\xi(u)=f,\quad\eta\sim\mathcal{N}(0,\sigma_{\eta}^{2}I_{n}),\quad\mathbb{X}=\mathbb{Y}=\mathbb{R}^{n}, (2.1)

where ξ(u)\xi(u) represents an unknown model error. We can write the Toeplitz matrix as an integral operator in the form of (1.3) with Ω={1,2,,n}\Omega=\{1,2,\ldots,n\}, g[u](x,y)=u(y)g[u](x,y)=u(y), and μ\mu being a uniform discrete measure on Ω\Omega. The kernel is a vector ϕ:𝒮2n1\phi:\mathcal{S}\to\mathbb{R}^{2n-1} with 𝒮={rl}l=12n1\mathcal{S}=\{r_{l}\}_{l=1}^{2n-1} with rl=lnr_{l}=l-n.

Example 2.2 (Integral operator).

Let 𝕏=𝕐=L2([0,1])\mathbb{X}=\mathbb{Y}=L^{2}([0,1]). We aim to find a function ϕ:[1,1]\phi:[-1,1]\to\mathbb{R} fitting the dataset in (1.2) to the model (1.1) with an integral operator

Rϕ[u](y)=01ϕ(yx)u(x)𝑑x,y[0,1].R_{\phi}[u](y)=\int_{0}^{1}\phi(y-x)u(x)dx,\quad\forall y\in[0,1]. (2.2)

We assume that η\eta is a white noise, that is, 𝔼[η(y)η(y)]=δ(yy)\mathbb{E}[\eta(y)\eta(y^{\prime})]=\delta(y^{\prime}-y) for any y,y[0,1]y,y^{\prime}\in[0,1]. In the form of the operator in (1.3), we have Ω=[0,1]\Omega=[0,1], g[u](x,y)=u(x)g[u](x,y)=u(x), and μ\mu being the Lebesgue measure. This operator is an infinite-dimensional version of the Toeplitz matrix.

Example 2.3 (Nonlocal operator).

Suppose that we want to estimate a kernel ϕ:d\phi:\mathbb{R}^{d}\to\mathbb{R} in a model (1.1) with a nonlocal operator

Rϕ[u](y)=Ωϕ(yx)[u(y)u(x)]𝑑x,yΩd,R_{\phi}[u](y)=\int_{\Omega}\phi(y-x)[u(y)-u(x)]dx,\quad\forall y\in\Omega\subset\mathbb{R}^{d},

from a given data set as in (1.2) with 𝕏=L2(Ω)\mathbb{X}=L^{2}(\Omega) and 𝕐=L2(Ω)\mathbb{Y}=L^{2}(\Omega), where Ω\Omega is a bounded set. Such nonlocal operators arise in [19, 57, 37]. Here η\eta is a white noise in the sense that 𝔼[η(y)η(y)]=δ(yy)\mathbb{E}[\eta(y)\eta(y^{\prime})]=\delta(y-y^{\prime}) for any y,yΩy,y^{\prime}\in\Omega. This example corresponds to (1.3) with g[u](x,y)=u(y)u(x)g[u](x,y)=u(y)-u(x). Note that even the support of the kernel ϕ\phi is unknown.

Example 2.4 (Interaction operator).

Let 𝕏=C01()\mathbb{X}=C^{1}_{0}(\mathbb{R}) and 𝕐=L2()\mathbb{Y}=L^{2}(\mathbb{R}) and consider the problem of estimating the interaction kernel ϕ:\phi:\mathbb{R}\to\mathbb{R} in the nonlinear operator

Rϕ[u](y)=ϕ(yx)[u(y)u(x)+u(x)u(y)]𝑑x,y,R_{\phi}[u](y)=\int_{\mathbb{R}}\phi(y-x)[u^{\prime}(y)u(x)+u^{\prime}(x)u(y)]dx,\,\quad\forall y\in\mathbb{R},

by fitting the dataset in (1.2) to the model (1.1). This nonlinear operator corresponds to (1.3) with g[u](x,y)=u(y)u(x)+u(x)u(y)g[u](x,y)=u^{\prime}(y)u(x)+u^{\prime}(x)u(y). It comes from the aggression operator Rϕ[u]=[u(Φu)]R_{\phi}[u]=\nabla\cdot[u\nabla(\Phi*u)] in the mean-field equation of interaction particles (see, e.g., [7, 30]).

2.2 Variational approach

To identify the kernel, the variational approach seeks a maximal likelihood estimator:

ϕ^=argminϕ(ϕ), where (ϕ)=1Nση21kNRϕ[uk]fk𝕐2,\widehat{\phi}=\operatorname*{arg\,min}_{\phi\in\mathcal{H}}\mathcal{E}(\phi),\quad\text{ where }\mathcal{E}(\phi)=\frac{1}{N\sigma_{\eta}^{2}}\sum_{1\leq k\leq N}\|R_{\phi}[u^{k}]-f^{k}\|_{\mathbb{Y}}^{2}, (2.3)

over a hypothesis space \mathcal{H}, where the loss function (ϕ)\mathcal{E}(\phi) is the negative log-likelihood of the data (1.2) under the assumption that the noise η\eta in (1.1) is white.

The first step is to find a proper function space for ϕ\phi, in which one can further specify the hypothesis space \mathcal{H}. Clearly, given a data set, we can only identify the kernel where the data provides information. Examples 2.12.4 show that the support of the kernel ϕ\phi is yet to be extracted from data.

We set function space for learning ϕ\phi to be L2(𝒮,ρ)L^{2}(\mathcal{S},\rho) with 𝒮={xy:x,yΩ}\mathcal{S}=\{x-y:x,y\in\Omega\}, where ρ\rho is an empirical probability measure quantifying the exploration of data to the kernel:

ρ(dr)=1ZN1kNΩΩδ(yxr)|g[uk](x,y)|μ(dx)μ(dy),r𝒮.\displaystyle\rho(dr)=\frac{1}{ZN}\sum_{1\leq k\leq N}\int_{\Omega}\int_{\Omega}\delta(y-x-r)\left|g[u^{k}](x,y)\right|\mu(dx)\mu(dy),\quad r\in\mathcal{S}. (2.4)

Here ZZ is the normalizing constant and δ\delta is the Dirac delta function. We call ρ\rho an exploration measure. Its support is the region in which the learning process ought to work, and outside of which, we have limited information from the data to learn the function ϕ\phi. Thus, we can restrict 𝒮\mathcal{S} to be the support of ρ\rho, and we denote Lρ2:=L2(𝒮,ρ)L^{2}_{\rho}:=L^{2}(\mathcal{S},\rho) for short. In other words, the exploration measure is a generalization of the measure ρX\rho_{X} in nonparametric regression that estimates Y=f(X)Y=f(X) from data {(xi,yi)}\{(x_{i},y_{i})\}, where ρX\rho_{X} is the distribution of XX and the data are samples from the joint distribution of (X,Y)(X,Y).

The next step is to find the minimizer of the loss function. Note that (ϕ)\mathcal{E}(\phi) is quadratic in ϕ\phi since the operator RϕR_{\phi} depends linearly on ϕ\phi. We get the minimizer by solving the zero of the (Fréchet) derivative of the loss function. To compute its derivative, we first introduce a bilinear form ,\langle\langle{\cdot,\cdot}\rangle\rangle: ϕ,ψLρ2\forall\phi,\psi\in L^{2}_{\rho},

ϕ,ψ=\displaystyle\langle\langle{\phi,\psi}\rangle\rangle= 1N1kNRϕ[uk],Rψ[uk]𝕐,\displaystyle\frac{1}{N}\sum_{1\leq k\leq N}\langle{R_{\phi}[u^{k}],R_{\psi}[u^{k}]}\rangle_{\mathbb{Y}}, (2.5)
=\displaystyle= 1N1kN[ϕ(yx)ψ(yz)g[uk](x,y)g[uk](z,y)μ(dx)μ(dz)]μ(dy)\displaystyle\frac{1}{N}\sum_{1\leq k\leq N}\int\left[\int\int\phi(y-x)\psi(y-z)g[u^{k}](x,y)g[u^{k}](z,y)\mu(dx)\mu(dz)\right]\mu(dy)
=\displaystyle= 𝒮𝒮ϕ(r)ψ(s)G¯(r,s)ρ(dr)ρ(ds),\displaystyle\int_{\mathcal{S}}\int_{\mathcal{S}}\phi(r)\psi(s){\overline{G}}(r,s)\rho(dr)\rho(ds),

where the integral kernel G¯{\overline{G}} given by, for r,ssupp(ρ)r,s\in\mathrm{supp}(\rho),

G¯(r,s)=G(r,s)ρ(r)ρ(s) with G(r,s)=1N1kNg[uk](x,r+x)g[uk](x,s+x)μ(dx),{\overline{G}}(r,s)=\frac{G(r,s)}{\rho(r)\rho(s)}\quad\text{ with }G(r,s)=\frac{1}{N}\sum_{1\leq k\leq N}\int g[u^{k}](x,r+x)g[u^{k}](x,s+x)\mu(dx), (2.6)

in which by an abuse of notation, we also use ρ(r)\rho(r) to denote either the probability of rr when ρ\rho defined in (2.4) is discrete or the probability density of ρ\rho when the density exists.

By definition, the bivariate function G¯{\overline{G}} is symmetric and positive semi-definite in the sense that i,j=1ncicjG(ri,rj)0\sum_{i,j=1}^{n}c_{i}c_{j}G(r_{i},r_{j})\geq 0 for any {ci}i=1n\{c_{i}\}_{i=1}^{n}\subset\mathbb{R} and {ri}i=1n𝒮\{r_{i}\}_{i=1}^{n}\subset\mathcal{S}. In the following, we assume that the data is continuous and bounded so that G¯{\overline{G}} defines a self-adjoint compact operator which is fundamental for studying identifiability. This assumption holds under mild regularity conditions on the data {uk}\{u^{k}\} and the operator RϕR_{\phi}.

Assumption 2.5 (Integrability of G¯{\overline{G}}).

Assume that Ω\Omega is bounded and {g[uk](x,y)}\{g[u^{k}](x,y)\} in (1.3) with data {uk}\{u^{k}\} in (1.2) are continuous satisfying max1kNsupx,yΩ|g[uk](x,y)|<\max_{1\leq k\leq N}\sup_{x,y\in\Omega}|g[u^{k}](x,y)|<\infty.

Under Assumption 2.5, the integral operator G¯:Lρ2Lρ2{\mathcal{L}_{\overline{G}}}:L^{2}_{\rho}\to L^{2}_{\rho}

G¯ϕ(r)=𝒮ϕ(s)G¯(r,s)ρ(s)𝑑s,{\mathcal{L}_{\overline{G}}}\phi(r)=\int_{\mathcal{S}}\phi(s)\overline{G}(r,s)\rho(s)ds, (2.7)

is a positive semi-definite trace-class operator (see Lemma A.1). Hereafter we denote {λi,ψi}\{\lambda_{i},\psi_{i}\} the eigen-pairs of G¯{\mathcal{L}_{\overline{G}}} with the eigenvalues in descending order, and assume that the eigenfunctions are orthonormal, hence they provide an orthonormal basis of Lρ2L^{2}_{\rho}. Furthermore, for any ϕ,ψLρ2\phi,\psi\in L^{2}_{\rho}, the bilinear form in (2.5) can be written as

ϕ,ψ=G¯ϕ,ψLρ2=ϕ,G¯ψLρ2,\langle\langle{\phi,\psi}\rangle\rangle=\langle{{\mathcal{L}_{\overline{G}}}\phi,\psi}\rangle_{L^{2}_{\rho}}=\langle{\phi,{\mathcal{L}_{\overline{G}}}\psi}\rangle_{L^{2}_{\rho}}, (2.8)

and we can write the loss functional in (2.3) as

(ϕ)\displaystyle\mathcal{E}(\phi) =ϕ,ϕ21N1kNRϕ[uk],fk𝕐+1N1kNfk𝕐2\displaystyle=\langle\langle{\phi,\phi}\rangle\rangle-2\frac{1}{N}\sum_{1\leq k\leq N}\langle{R_{\phi}[u^{k}],f^{k}}\rangle_{\mathbb{Y}}+\frac{1}{N}\sum_{1\leq k\leq N}\|f^{k}\|_{\mathbb{Y}}^{2} (2.9)
=G¯ϕ,ϕLρ22ϕ𝒟,ϕLρ2+CNf,\displaystyle=\langle{{\mathcal{L}_{\overline{G}}}\phi,\phi}\rangle_{L^{2}_{\rho}}-2\langle{\phi^{\scriptscriptstyle\mathcal{D}},\phi}\rangle_{L^{2}_{\rho}}+C_{N}^{f},

where ϕ𝒟Lρ2\phi^{\scriptscriptstyle\mathcal{D}}\in L^{2}_{\rho} is the Riesz representation of the bounded linear functional:

ϕ𝒟,ψLρ2=1N1kNRψ[uk],fk𝕐,ψLρ2.\langle{\phi^{\scriptscriptstyle\mathcal{D}},\psi}\rangle_{L^{2}_{\rho}}=\frac{1}{N}\sum_{1\leq k\leq N}\langle{R_{\psi}[u^{k}],f^{k}}\rangle_{\mathbb{Y}},\quad\forall\psi\in L^{2}_{\rho}. (2.10)

Then, by solving the zero of (ϕ)=2(G¯ϕϕ𝒟)\nabla\mathcal{E}(\phi)=2({\mathcal{L}_{\overline{G}}}\phi-\phi^{\scriptscriptstyle\mathcal{D}}), one may obtain the minimizer G¯1ϕ𝒟{\mathcal{L}_{\overline{G}}}^{-1}\phi^{\scriptscriptstyle\mathcal{D}}, provided that it is well-defined. However, it is often ill-defined, e.g., when ϕ𝒟\phi^{\scriptscriptstyle\mathcal{D}} is not in G¯(Lρ2){\mathcal{L}_{\overline{G}}}(L^{2}_{\rho}) and G¯{\mathcal{L}_{\overline{G}}} is compact infinite-rank or rank-deficient. Thus, it is important to examine the inversion and the uniqueness of the minimizer, for which we introduce the function space of identifiability in the next section.

2.3 Function space of identifiability

We define a function space of identifiability (FSOI) when one learns the kernel by minimizing the loss function.

Definition 2.6.

The function space of identifiability (FSOI) by the loss functional \mathcal{E} in (2.3) is the largest linear subspace of Lρ2L^{2}_{\rho} in which \mathcal{E} has a unique minimizer.

The next theorem characterizes the FSOI. Its proof is deferred to Appendix A.1.

Theorem 2.7 (Function space of identifiability).

Suppose the data in (1.2) is generated from the system (1.1) with a true kernel ϕtrue\phi_{true}. Suppose that Assumption 2.5 holds. Then, the following statements hold.

  • (a)

    The function ϕ𝒟\phi^{\scriptscriptstyle\mathcal{D}} in (2.10) has the following decomposition:

    ϕ𝒟=G¯ϕtrue+ϵξ+ϵη,\phi^{\scriptscriptstyle\mathcal{D}}={\mathcal{L}_{\overline{G}}}\phi_{true}+\epsilon^{\xi}+\epsilon^{\eta}, (2.11)

    where ϵξ\epsilon^{\xi} comes from the model error ξ\xi, the random ϵη\epsilon^{\eta} comes from the observation noise and it has a Gaussian distribution 𝒩(0,ση2G¯)\mathcal{N}(0,\sigma_{\eta}^{2}{\mathcal{L}_{\overline{G}}}), and they satisfy ψLρ2\forall\psi\in L^{2}_{\rho}

    ϵξ,ψLρ2=1N1kNRψ[uk],ξk𝕐,ϵη,ψLρ2=1N1kNRψ[uk],ηk]𝕐.\displaystyle\langle{\epsilon^{\xi},\psi}\rangle_{L^{2}_{\rho}}=\frac{1}{N}\sum_{1\leq k\leq N}\langle{R_{\psi}[u^{k}],\xi^{k}}\rangle_{\mathbb{Y}},\quad\langle{\epsilon^{\eta},\psi}\rangle_{L^{2}_{\rho}}=\frac{1}{N}\sum_{1\leq k\leq N}\langle{R_{\psi}[u^{k}],\eta_{k}]}\rangle_{\mathbb{Y}}\,.
  • (b)

    The Fréchet derivative of (ϕ)\mathcal{E}(\phi) in Lρ2L^{2}_{\rho} is (ϕ)=2(G¯ϕϕ𝒟)\nabla\mathcal{E}(\phi)=2({\mathcal{L}_{\overline{G}}}\phi-\phi^{\scriptscriptstyle\mathcal{D}}).

  • (c)

    The function space of identifiability (FSOI) of \mathcal{E} is H=span{ψi}¯i:λi>0H=\overline{\mathrm{span}\{\psi_{i}\}}_{i:\lambda_{i}>0} with closure in Lρ2L^{2}_{\rho}. In particular, if ϕ𝒟G¯(Lρ2)\phi^{\scriptscriptstyle\mathcal{D}}\in{\mathcal{L}_{\overline{G}}}(L^{2}_{\rho}), the unique minimizer of (ϕ)\mathcal{E}(\phi) in the FSOI is ϕ^=G¯1ϕ𝒟\widehat{\phi}={\mathcal{L}_{\overline{G}}}^{-1}\phi^{\scriptscriptstyle\mathcal{D}}, where G¯1{\mathcal{L}_{\overline{G}}}^{-1} is the pseudo-inverse of G¯{\mathcal{L}_{\overline{G}}}. Furthermore, if ϕtrueH\phi_{true}\in H and there is no observation noise and no model error, we have ϕ^=G¯1ϕ𝒟=ϕtrue\widehat{\phi}={\mathcal{L}_{\overline{G}}}^{-1}\phi^{\scriptscriptstyle\mathcal{D}}=\phi_{true}.

Theorem 2.7 enables us to analyze the ill-posedness of the inverse problems through the operator G¯{\mathcal{L}_{\overline{G}}} and ϕ𝒟\phi^{\scriptscriptstyle\mathcal{D}}. When ϕ𝒟G¯(Lρ2)\phi^{\scriptscriptstyle\mathcal{D}}\in{\mathcal{L}_{\overline{G}}}(L^{2}_{\rho}), the inverse problem has a unique solution in the FSOI HH, even when it is underdetermined in Lρ2L^{2}_{\rho} due to HH being a proper subspace, which happens when the compact operator G¯{\mathcal{L}_{\overline{G}}} has a zero eigenvalue. However, when ϕ𝒟G¯(Lρ2)\phi^{\scriptscriptstyle\mathcal{D}}\notin{\mathcal{L}_{\overline{G}}}(L^{2}_{\rho}), the inverse problem =0\nabla\mathcal{E}=0 has no solution in Lρ2L^{2}_{\rho} because G¯1ϕ𝒟{\mathcal{L}_{\overline{G}}}^{-1}\phi^{\scriptscriptstyle\mathcal{D}} is undefined. According to (2.11), this happens in one or more of the following scenarios:

  • when the model error leads to ϵξG¯(Lρ2)\epsilon^{\xi}\notin{\mathcal{L}_{\overline{G}}}(L^{2}_{\rho}).

  • when the observation noise leads to ϵηG¯(Lρ2)\epsilon^{\eta}\notin{\mathcal{L}_{\overline{G}}}(L^{2}_{\rho}). In particular, since ϵη\epsilon^{\eta} is Gaussian 𝒩(0,G¯)\mathcal{N}(0,{\mathcal{L}_{\overline{G}}}), it has the Karhunen–Loève expansion ϵη=iλi1/2ϵiηψi\epsilon^{\eta}=\sum_{i}\lambda_{i}^{1/2}\epsilon^{\eta}_{i}\psi_{i} with ϵiη\epsilon^{\eta}_{i} being i.i.d. 𝒩(0,1)\mathcal{N}(0,1). Then, G¯1ϵη=iλi1/2ϵiηψi{\mathcal{L}_{\overline{G}}}^{-1}\epsilon^{\eta}=\sum_{i}\lambda_{i}^{-1/2}\epsilon^{\eta}_{i}\psi_{i}, which diverges a.s. if i:λi>0λi1\sum_{i:\lambda_{i}>0}\lambda_{i}^{-1} diverges. Thus, we have ϵηG¯(Lρ2)\epsilon^{\eta}\notin{\mathcal{L}_{\overline{G}}}(L^{2}_{\rho}) a.s. when i:λi>0λi1\sum_{i:\lambda_{i}>0}\lambda_{i}^{-1} diverges.

Additionally, ϕ𝒟\phi^{\scriptscriptstyle\mathcal{D}} only encodes information of ϕtrueH\phi_{true}^{H}, and it provides no information about ϕtrue\phi_{true}^{\perp}, where ϕtrueH\phi_{true}^{H} and ϕtrue\phi_{true}^{\perp} form an orthogonal decomposition ϕtrue=ϕtrueH+ϕtrueHH\phi_{true}=\phi_{true}^{H}+\phi_{true}^{\perp}\in H\oplus H^{\perp}. In other words, the data provides no information to recover ϕtrue\phi_{true}^{\perp}.

As a result, it is important to avoid absorbing the errors outside of the FSOI when using a Bayesian approach or a regularization method to mitigate the ill-posedness.

2.4 A data-adaptive RKHS

The RKHS with G¯{\overline{G}} as a reproducing kernel is a dense subspace of the FSOI. Hence, when using it as a prior, which we will detail in later sections, one can filter out the error outside the FSOI and ensure that the learning takes place inside the FSOI.

The next lemma is an operator characterization of this RKHS (see, e.g., [9, Section 4.4]). Its proof can be found in [38, Theorem 3.3].

Lemma 2.8 (A data-adaptive RKHS).

Suppose that Assumption 2.5 holds. Then, the following statements hold.

  • (a)

    The RKHS HGH_{G} with G¯{\overline{G}} in (2.6) as reproducing kernel satisfies HG=G¯1/2(Lρ2)H_{G}={\mathcal{L}_{\overline{G}}}^{1/2}(L^{2}_{\rho}) and its inner product satisfies

    ϕ,ψHG=G¯1/2ϕ,G¯1/2ψLρ2,ϕ,ψHG.\langle{\phi,\psi}\rangle_{H_{G}}=\langle{{\mathcal{L}_{\overline{G}}}^{-1/2}\phi,{\mathcal{L}_{\overline{G}}}^{-1/2}\psi}\rangle_{L^{2}_{\rho}},\quad\forall\phi,\psi\in H_{G}. (2.12)
  • (b)

    Denote the eigen-pairs of G¯{\mathcal{L}_{\overline{G}}} by {λi,ψi}i\{\lambda_{i},\psi_{i}\}_{i} with {ψi}\{\psi_{i}\} being orthonormal. Then, for any ϕ=iciψiLρ2\phi=\sum_{i}c_{i}\psi_{i}\in L^{2}_{\rho}, we have

    ϕ,ϕ=iλici2,ϕLρ22=ici2,ϕHG2=i:λi>0λi1ci2,\langle\langle{\phi,\phi}\rangle\rangle=\sum_{i}\lambda_{i}c_{i}^{2},\quad\|\phi\|^{2}_{L^{2}_{\rho}}=\sum_{i}c_{i}^{2},\quad\|\phi\|^{2}_{H_{G}}=\sum_{i:\lambda_{i}>0}\lambda_{i}^{-1}c_{i}^{2}, (2.13)

    where the last equation is restricted to ϕHG\phi\in H_{G}.

The above RKHS has been used for regularization in [37, 38]. The regularizer, named DARTR, regularizes the loss by the norm of this RKHS,

λ(ϕ)=(ϕ)+λϕHG2=(G¯+λG¯1)ϕ,ϕLρ22ϕ𝒟,ϕLρ2+CNf.\mathcal{E}_{\lambda}(\phi)=\mathcal{E}(\phi)+\lambda\|\phi\|_{H_{G}}^{2}=\langle{({\mathcal{L}_{\overline{G}}}+\lambda{\mathcal{L}_{\overline{G}}}^{-1})\phi,\phi}\rangle_{L^{2}_{\rho}}-2\langle{\phi^{\scriptscriptstyle\mathcal{D}},\phi}\rangle_{L^{2}_{\rho}}+C_{N}^{f}. (2.14)

Selecting the optimal hyper-parameter λ\lambda_{*} by the L-curve method, it leads to the estimator

ϕ^HG=(G¯2+λIH)1G¯ϕ𝒟,\widehat{\phi}_{H_{G}}=({\mathcal{L}_{\overline{G}}}^{2}+\lambda_{*}I_{H})^{-1}{\mathcal{L}_{\overline{G}}}\phi^{\scriptscriptstyle\mathcal{D}}, (2.15)

where IHI_{H} is the identity operator on HH. Since the RKHS is a subset of the FSOI with more regular elements, DARTR ensures that the estimator is in the FSOI and is regularized.

We summarize the key terminologies and notations in this section in Table 1.

Table 1: Notations for learning kernels in operators in Sect.2.
Notation Meaning
Exploration measure ρ\rho A probability measure quantifying the data’s exploration to the kernel (2.4).
L2(𝒮,ρ)L^{2}(\mathcal{S},\rho) or Lρ2L^{2}_{\rho} Function space of learning
G¯{\mathcal{L}_{\overline{G}}}, {(λi,ψi)}i\{(\lambda_{i},\psi_{i})\}_{i}, G¯1{\mathcal{L}_{\overline{G}}}^{-1} The normal operator in (2.7), its spectral-decomposition and pseudo-inverse
(ϕ)=G¯ϕ,ϕLρ22ϕ𝒟,ϕLρ2+CNf\mathcal{E}(\phi)=\langle{{\mathcal{L}_{\overline{G}}}\phi,\phi}\rangle_{L^{2}_{\rho}}-2\langle{\phi^{\scriptscriptstyle\mathcal{D}},\phi}\rangle_{L^{2}_{\rho}}+C_{N}^{f} Loss function from the likelihood in (2.9)
FSOI H=span{ψi}λi>0H=\mathrm{span}\{\psi_{i}\}_{\lambda_{i}>0} Function space of identifiability, in which \mathcal{E} has a unique minimizer.

3 Bayesian inversion and the risk in a non-degenerate prior

The Bayesian approach overcomes the ill-posedness by introducing a prior, so the posterior is stable under perturbations. Since little prior information is available about the kernel, it is common to use a non-degenerate prior to ensure the well-posedness of the posterior. However, we will show that a fixed non-degenerate prior may lead to a posterior with a divergent mean in the small noise limit. These discussions promote the data-adaptive prior in the next section.

3.1 The Bayesian approach

In this study, we focus on Gaussian priors, so the posterior is also a Gaussian measure in combination with a Gaussian likelihood. Also, without loss of generality, we assume that the prior is centered. Recall that the function space of learning is Lρ2L^{2}_{\rho} defined in (2.4). For illustration, we first specify the prior and posterior when the space Lρ2L^{2}_{\rho} is finite-dimensional, then discuss them in the infinite-dimensional case. We follow the notations in [53] and [14].

Finite-dimensional case.

Consider first that the space Lρ2L^{2}_{\rho} is finite-dimensional, i.e., 𝒮={r1,,rd}\mathcal{S}=\{r_{1},\ldots,r_{d}\}, as in Example 2.1. Then, the space Lρ2L^{2}_{\rho} is equivalent to d\mathbb{R}^{d} with norm satisfying ϕ2=i=1dϕ(ri)2ρ(ri)\|\phi\|^{2}=\sum_{i=1}^{d}\phi(r_{i})^{2}\rho(r_{i}). Also, assume that space 𝕐\mathbb{Y} is finite-dimensional, and the measurement noise in (1.2) is Gaussian 𝒩(0,ση2I)\mathcal{N}(0,\sigma_{\eta}^{2}I). Since ϕ\phi is finite-dimensional, we write the prior, the likelihood, and the posterior in terms of their probability densities with respect to the Lebesgue measure.

  • Prior distribution, denoted by 𝒩(0,𝒬0)\mathcal{N}(0,\mathcal{Q}_{0}), with density

    dπ0(ϕ)dϕe12ϕ,𝒬01ϕLρ2,\frac{d\pi_{0}(\phi)}{d\phi}\propto e^{-\frac{1}{2}\langle{\phi,\mathcal{Q}_{0}^{-1}\phi}\rangle_{L^{2}_{\rho}}},

    where 𝒬0\mathcal{Q}_{0} is a strictly positive matrix, so that the prior is a non-degenerate measure.

  • Likelihood distribution of the data with density

    dπL(ϕ)dϕexp(12ση2(ϕ))=exp(12ση2[G¯ϕ,ϕLρ22ϕ𝒟,ϕLρ2+CNf]),\frac{d\pi_{L}(\phi)}{d\phi}\propto\exp\left(-\frac{1}{2\sigma_{\eta}^{2}}\mathcal{E}(\phi)\right)=\exp\left(-\frac{1}{2\sigma_{\eta}^{2}}\left[\langle{{\mathcal{L}_{\overline{G}}}\phi,\phi}\rangle_{L^{2}_{\rho}}-2\langle{\phi^{\scriptscriptstyle\mathcal{D}},\phi}\rangle_{L^{2}_{\rho}}+C_{N}^{f}\right]\right), (3.1)

    where (ϕ)\mathcal{E}(\phi) is the loss function defined in (2.3) and the equality follows from (2.9). Note that this distribution is a non-degenerate Gaussian 𝒩(G¯1ϕ𝒟,ση2G¯1)\mathcal{N}({\mathcal{L}_{\overline{G}}}^{-1}\phi^{\scriptscriptstyle\mathcal{D}},\sigma_{\eta}^{2}{\mathcal{L}_{\overline{G}}}^{-1}) when G¯1{\mathcal{L}_{\overline{G}}}^{-1} exists, and it can be ill-defined when G¯{\mathcal{L}_{\overline{G}}} has a zero eigenvalue.

  • Posterior distribution with density proportionating to the product of the prior and likelihood densities,

    dπ1(ϕ)dϕ\displaystyle\frac{d\pi_{1}(\phi)}{d\phi} exp(12[ση2(ϕ)+𝒬01ϕ,ϕLρ2])\displaystyle\propto\exp\left(-\frac{1}{2}[\sigma_{\eta}^{-2}\mathcal{E}(\phi)+\langle{\mathcal{Q}_{0}^{-1}\phi,\phi}\rangle_{L^{2}_{\rho}}]\right)
    =exp(12[ση2(G¯ϕ,ϕLρ22ϕ𝒟,ϕLρ2+CNf)+ϕ,𝒬01ϕLρ2])\displaystyle=\exp\left(-\frac{1}{2}\Big{[}\sigma_{\eta}^{-2}\big{(}\langle{{\mathcal{L}_{\overline{G}}}\phi,\phi}\rangle_{L^{2}_{\rho}}-2\langle{\phi^{\scriptscriptstyle\mathcal{D}},\phi}\rangle_{L^{2}_{\rho}}+C_{N}^{f}\big{)}+\langle{\phi,\mathcal{Q}_{0}^{-1}\phi}\rangle_{L^{2}_{\rho}}\Big{]}\right)
    =exp(12𝒬11(ϕμ1),ϕμ1+CNf)\displaystyle=\exp\left(-\frac{1}{2}\langle{\mathcal{Q}_{1}^{-1}(\phi-\mu_{1}),\phi-\mu_{1}}\rangle+C_{N}^{f}\right) (3.2)

    with the constant term CNfC_{N}^{f} may change from line to line and

    μ1=(G¯+ση2𝒬01)1ϕ𝒟=ση2𝒬1ϕ𝒟,and 𝒬1=ση2(G¯+ση2𝒬01)1.\mu_{1}=({\mathcal{L}_{\overline{G}}}+\sigma_{\eta}^{2}\mathcal{Q}_{0}^{-1})^{-1}\phi^{\scriptscriptstyle\mathcal{D}}=\sigma_{\eta}^{-2}\mathcal{Q}_{1}\phi^{\scriptscriptstyle\mathcal{D}},\,\text{and }\mathcal{Q}_{1}=\sigma_{\eta}^{2}({\mathcal{L}_{\overline{G}}}+\sigma_{\eta}^{2}\mathcal{Q}_{0}^{-1})^{-1}. (3.3)

    Thus, π1(ϕ)\pi_{1}(\phi) is a Gaussian measure 𝒩(μ1,𝒬1)\mathcal{N}(\mu_{1},\mathcal{Q}_{1}).

The Bayesian approach is closely related to Tikhonov regularization (see, e.g., [33]). Note that a Gaussian prior corresponds to a regularization term (ϕ)=ϕ,𝒬01ϕLρ2\mathcal{R}(\phi)=\langle{\phi,\mathcal{Q}_{0}^{-1}\phi}\rangle_{L^{2}_{\rho}}, the negative log-likelihood is the loss function, and the posterior corresponds to the penalized loss:

2ση2logπ1(ϕ)=(ϕ)+λϕ,𝒬01ϕLρ2 with λ=ση2.-2\sigma_{\eta}^{2}\log\pi_{1}(\phi)=\mathcal{E}(\phi)+\lambda\langle{\phi,\mathcal{Q}_{0}^{-1}\phi}\rangle_{L^{2}_{\rho}}\,\text{ with }\lambda=\sigma_{\eta}^{2}.

In particular, the maximal a posteriori, MAP in short, which agrees with the posterior mean μ1\mu_{1}, is the minimizer of the penalized loss using a penalty term ση2ϕ,𝒬01ϕLρ2\sigma_{\eta}^{2}\langle{\phi,\mathcal{Q}_{0}^{-1}\phi}\rangle_{L^{2}_{\rho}}. The difference is that a regularization approach selects the hyper-parameter according to data.

Infinite-dimensional case.

When space Lρ2L^{2}_{\rho} is infinite-dimensional, i.e., the set 𝒮\mathcal{S} has infinite elements, we use the generic notion of Gaussian measures on Hilbert spaces, see Appendix A.2 for a brief review. Since there is no longer a Lebesgue measure on the infinite-dimensional space, the prior and the posterior are characterized by their means and covariance operators. We write the prior and the posterior as follows:

  • Prior 𝒩(0,𝒬0)\mathcal{N}(0,\mathcal{Q}_{0}), where 𝒬0\mathcal{Q}_{0} is a strictly positive trace-class operator on Lρ2L^{2}_{\rho};

  • Posterior 𝒩(μ1,𝒬1)\mathcal{N}(\mu_{1},\mathcal{Q}_{1}), whose Radon–Nikodym derivative with respect to the prior is

    dπ1dπ0exp(12ση2(ϕ))=exp(12ση2[G¯ϕ,ϕLρ22ϕ𝒟,ϕLρ2+CNf]),\frac{d\pi_{1}}{d\pi_{0}}\propto\exp(-\frac{1}{2}\sigma_{\eta}^{-2}\mathcal{E}(\phi))=\exp\left(-\frac{1}{2}\sigma_{\eta}^{-2}\left[\langle{{\mathcal{L}_{\overline{G}}}\phi,\phi}\rangle_{L^{2}_{\rho}}-2\langle{\phi^{\scriptscriptstyle\mathcal{D}},\phi}\rangle_{L^{2}_{\rho}}+C_{N}^{f}\right]\right), (3.4)

    which is the same as the likelihood in (3.1). Its mean and covariance are given as in (3.3).

Note that unlike the finite-dimensional case, it is problematic to write the likelihood distribution as 𝒩(G¯1ϕ𝒟,ση2G¯1)\mathcal{N}({\mathcal{L}_{\overline{G}}}^{-1}\phi^{\scriptscriptstyle\mathcal{D}},\sigma_{\eta}^{-2}{\mathcal{L}_{\overline{G}}}^{-1}), because the operator G¯1{\mathcal{L}_{\overline{G}}}^{-1} is unbounded and G¯1ϕ𝒟{\mathcal{L}_{\overline{G}}}^{-1}\phi^{\scriptscriptstyle\mathcal{D}} may be ill-defined.

3.2 The risk in a non-degenerate prior

The prior distribution plays a crucial role in Bayesian inverse problems. To make the ill-posed inverse problem well-defined, it is often set to be a non-degenerate measure (i.e., its covariance operator 𝒬0\mathcal{Q}_{0} has no zero eigenvalues). It is fixed in many cases and not adaptive to data. Such a non-degenerate prior works well for an inverse problem whose function space of identifiability does not change with data. However, in the learning of kernels in operators, a non-degenerate prior has a risk of leading to a catastrophic error: the posterior mean may diverge in the small observation noise limit, as we show in the Proposition 3.2.

Assumption 3.1.

Assume that the operator G¯{\mathcal{L}_{\overline{G}}} is finite rank and commutes with the prior covariance 𝒬0\mathcal{Q}_{0} and assume the existence of error outside of the FSOI as follows.

  • (A1)\operatorname{(A1)}

    The operator G¯{\mathcal{L}_{\overline{G}}} in (2.7) has zero eigenvalues. Let λK+1=0\lambda_{K+1}=0 be the first zero eigenvalue, where KK is less than the dimension of Lρ2L^{2}_{\rho}. As a result, the FSOI is H=span{ψi}i=1KH=\mathrm{span}\{\psi_{i}\}_{i=1}^{K}.

  • (A2)\operatorname{(A2)}

    The covariance of the prior 𝒩(0,𝒬0)\mathcal{N}(0,\mathcal{Q}_{0}) satisfies 𝒬0ψi=riψi\mathcal{Q}_{0}\psi_{i}=r_{i}\psi_{i} with ri>0r_{i}>0 for all ii, where {ψi}i\{\psi_{i}\}_{i} are orthonormal eigenfunctions of G¯{\mathcal{L}_{\overline{G}}}.

  • (A3)\operatorname{(A3)}

    The term ϵξ\epsilon^{\xi} in (2.11), which represents the model error, is outside of the FSOI, i.e., ϵξ=iϵiξψi\epsilon^{\xi}=\sum_{i}\epsilon^{\xi}_{i}\psi_{i} has a component ϵi0ξ0\epsilon^{\xi}_{i_{0}}\neq 0 for some i0>Ki_{0}>K.

Assumptions (A1-A2) are common in practice. Assumption (A1) holds because the operator G¯{\mathcal{L}_{\overline{G}}} is finite rank when the data is discrete, and it is not full rank for under-determined problems. It is natural to assume the prior has a full rank covariance 𝒬0\mathcal{Q}_{0} as in (A2). We assume that 𝒬0\mathcal{Q}_{0} commutes with G¯{\mathcal{L}_{\overline{G}}} for simplicity and one can extend it to the general case as in the proof of [12, Theorem 2.25, Feldman–Hajek theorem]. Assumption (A3), which requires ϕ𝒟\phi^{\scriptscriptstyle\mathcal{D}} to be outside the range of G¯{\mathcal{L}_{\overline{G}}}, holds when the regression vector b¯{\overline{b}} is outside the range of the regression matrix A¯{\overline{A}} in (5.3b), see Sect.5.25.3 for more discussions.

Proposition 3.2 (Risk in a fixed non-degenerate prior).

A non-degenerate prior has the risk of leading to a divergent posterior mean in the small noise limit. Specifically, under Assumption 3.1, the posterior mean μ1\mu_{1} in (3.3) diverges as ση0\sigma_{\eta}\to 0.

Proof of Proposition 3.2..

Recall that conditional on the data, the observation noise-induced term ϵη\epsilon^{\eta} in (2.11) has a distribution 𝒩(0,ση2G¯)\mathcal{N}(0,\sigma^{2}_{\eta}{\mathcal{L}_{\overline{G}}}). Thus, in the orthonormal basis {ψi}\{\psi_{i}\}, we can write ϵη=σηi:λi>0λi1/2ϵiηψi\epsilon^{\eta}=\sigma_{\eta}\sum_{i:\lambda_{i}>0}\lambda_{i}^{1/2}\epsilon^{\eta}_{i}\psi_{i}, where {ϵiη}\{\epsilon^{\eta}_{i}\} are i.i.d. 𝒩(0,1)\mathcal{N}(0,1) random variables. Additionally, write the true kernel as ϕtrue=iϕtrue,iψi\phi_{true}=\sum_{i}\phi_{true,i}\psi_{i}, where ϕtrue,i=ϕtrue,ψiLρ2\phi_{true,i}=\langle{\phi_{true},\psi_{i}}\rangle_{L^{2}_{\rho}} for all ii. Note that ϕtrue\phi_{true} does not have to be in the FSOI. Combining these facts, we have

ϕ𝒟=i=1ϕi𝒟ψi, with ϕi𝒟=λiϕtrue,i+σηλi1/2ϵiη+ϵiξ.\phi^{\scriptscriptstyle\mathcal{D}}=\sum_{i=1}^{\infty}\phi^{\scriptscriptstyle\mathcal{D}}_{i}\psi_{i},\,\text{ with }\phi^{\scriptscriptstyle\mathcal{D}}_{i}=\lambda_{i}\phi_{true,i}+\sigma_{\eta}\lambda_{i}^{1/2}\epsilon^{\eta}_{i}+\epsilon^{\xi}_{i}. (3.5)

Then, the posterior mean μ1=(G¯+ση2𝒬01)1ϕ𝒟\mu_{1}=({\mathcal{L}_{\overline{G}}}+\sigma_{\eta}^{2}\mathcal{Q}_{0}^{-1})^{-1}\phi^{\scriptscriptstyle\mathcal{D}} in (3.3) becomes

μ1\displaystyle\mu_{1} =i=1(λi+ση2ri1)1ϕi𝒟ψi=i=1K(λi+ση2ri1)1ϕi𝒟ψi+i>Kση2riϵiξψi.\displaystyle=\sum_{i=1}^{\infty}\left(\lambda_{i}+\sigma_{\eta}^{2}r_{i}^{-1}\right)^{-1}\phi^{\scriptscriptstyle\mathcal{D}}_{i}\psi_{i}=\sum_{i=1}^{K}\left(\lambda_{i}+\sigma_{\eta}^{2}r_{i}^{-1}\right)^{-1}\phi^{\scriptscriptstyle\mathcal{D}}_{i}\psi_{i}+\sum_{i>K}\sigma_{\eta}^{-2}r_{i}\epsilon^{\xi}_{i}\psi_{i}. (3.6)

Thus, the posterior mean μ1\mu_{1} is contaminated by the model error outside the FSOI, i.e., the part with components ϵiξ\epsilon^{\xi}_{i} with i>Ki>K. It diverges when ση0\sigma_{\eta}\to 0 because

limση0(μ1i>Kση2riϵiξψi)=1iK(ϕtrue,i+λi1ϵiξ)ψi,\lim_{\sigma_{\eta}\to 0}\left(\mu_{1}-\sum_{i>K}\sigma_{\eta}^{-2}r_{i}\epsilon^{\xi}_{i}\psi_{i}\right)=\sum_{1\leq i\leq K}\left(\phi_{true,i}+\lambda_{i}^{-1}\epsilon^{\xi}_{i}\right)\psi_{i},

and i>Kση2riϵiξψi\sum_{i>K}\sigma_{\eta}^{-2}r_{i}\epsilon^{\xi}_{i}\psi_{i} diverges. ∎

On the other hand, one may adjust the prior covariance by the standard deviation of the noise in the hope of removing the risk of divergence. However, the next proposition shows that such a noise-adaptive non-degenerate prior will have a biased small noise limit.

Proposition 3.3 (Risk in a noise-adaptive non-degenerate prior).

Let the prior be 𝒩(0,λ𝒬0)\mathcal{N}(0,\lambda\mathcal{Q}_{0}), where λ=C0ση2β\lambda=C_{0}\sigma_{\eta}^{2\beta} with C00C_{0}\neq 0 and β0\beta\geq 0. Suppose that Assumption 3.1 holds. Then, the corresponding posterior mean μ1ση\mu_{1}^{\sigma_{\eta}} either blows up or is biased, satisfying

limση0μ1ση={,β<1;(G¯+C01𝒬01)1ϕ𝒟,β=1;0,β>1.\displaystyle\lim_{\sigma_{\eta}\to 0}\mu_{1}^{\sigma_{\eta}}=\begin{cases}\infty\,,&\beta<1\,;\\ ({\mathcal{L}_{\overline{G}}}+C_{0}^{-1}\mathcal{Q}_{0}^{-1})^{-1}\phi^{\scriptscriptstyle\mathcal{D}}\,,&\beta=1\,;\\ 0\,,&\beta>1\,.\end{cases}
Proof.

Under the prior 𝒩(0,λ𝒬0)\mathcal{N}(0,\lambda\mathcal{Q}_{0}) with λ=C0ση2β\lambda=C_{0}\sigma_{\eta}^{2\beta}, we have the posterior mean

μ1ση\displaystyle\mu_{1}^{\sigma_{\eta}} =(G¯+ση2λ1𝒬01)1ϕ𝒟=i=1(λi+ση2λ1ri1)1ϕi𝒟ψi\displaystyle=({\mathcal{L}_{\overline{G}}}+\sigma_{\eta}^{2}\lambda^{-1}\mathcal{Q}_{0}^{-1})^{-1}\phi^{\scriptscriptstyle\mathcal{D}}=\sum_{i=1}^{\infty}\left(\lambda_{i}+\sigma_{\eta}^{2}\lambda^{-1}r_{i}^{-1}\right)^{-1}\phi^{\scriptscriptstyle\mathcal{D}}_{i}\psi_{i}
=i=1K(λi+ση2λ1ri1)1ϕi𝒟ψi+i>Kση2λriϵiξψi\displaystyle=\sum_{i=1}^{K}\left(\lambda_{i}+\sigma_{\eta}^{2}\lambda^{-1}r_{i}^{-1}\right)^{-1}\phi^{\scriptscriptstyle\mathcal{D}}_{i}\psi_{i}+\sum_{i>K}\sigma_{\eta}^{-2}\lambda r_{i}\epsilon^{\xi}_{i}\psi_{i}
=i=1K(λi+C01ση22βri1)1ϕi𝒟ψi+i>KC0ση2β2riϵiξψi.\displaystyle=\sum_{i=1}^{K}\left(\lambda_{i}+C_{0}^{-1}\sigma_{\eta}^{2-2\beta}r_{i}^{-1}\right)^{-1}\phi^{\scriptscriptstyle\mathcal{D}}_{i}\psi_{i}+\sum_{i>K}C_{0}\sigma_{\eta}^{2\beta-2}r_{i}\epsilon^{\xi}_{i}\psi_{i}\,.

Then, the limits for β>1\beta>1, β=1\beta=1 and β<1\beta<1 follow directly. Note that when β1\beta\geq 1, the limits are biased, not recovering the identifiable component G¯1ϕ𝒟=i=1Kλi1ϕi𝒟ψi{\mathcal{L}_{\overline{G}}}^{-1}\phi^{\scriptscriptstyle\mathcal{D}}=\sum_{i=1}^{K}\lambda_{i}^{-1}\phi^{\scriptscriptstyle\mathcal{D}}_{i}\psi_{i} even when the model error i=1Kϵiξψi\sum_{i=1}^{K}\epsilon^{\xi}_{i}\psi_{i} vanishes. ∎

Propositions 3.2 and 3.3 highlight that the risk of a non-degenerate prior comes from the error outside the data-adaptive FSOI. Importantly, by definition of the FSOI, there is no signal to be recovered outside the FSOI. Thus, it is important to design a data-adaptive prior to restrict the learning to take place inside the FSOI.

4 Data-adaptive RKHS prior

We propose a data-adaptive prior to filter out the error outside of the FSOI, so that its posterior mean always has a small noise limit. In particular, the small noise limit converges to the identifiable part of the true kernel when the model error vanishes. Additionally, we show that this prior, even with a sub-optimal λ\lambda_{*}, outperforms a large class of fixed non-degenerate priors in the quality of the posterior.

4.1 Data-adaptive prior and its posterior

We first introduce the data-adaptive prior and specify its posterior, which will remove the risk in a non-degenerate prior as shown in Propositions 3.23.3. This prior is a Gaussian measure with a covariance from the likelihood.

Following the notations in Sect.2.1, the operator G¯{\mathcal{L}_{\overline{G}}} is a data-dependent positive definite trace-class operator on Lρ2L^{2}_{\rho}, and we denote its eigen-pairs by {λi,ψi}i1\{\lambda_{i},\psi_{i}\}_{i\geq 1} with the eigenfunction forming an orthonormal basis of Lρ2L^{2}_{\rho}. Then, as characterized in Theorem 2.7 and Lemma 2.8, the data-dependent FSOI and RKHS are

H=span{ψi}λi>0¯Lρ2,HG=span{ψi}λi>0¯HG,H=\overline{\mathrm{span}\{\psi_{i}\}_{\lambda_{i}>0}}^{\|\cdot\|_{L^{2}_{\rho}}},\quad H_{G}=\overline{\mathrm{span}\{\psi_{i}\}_{\lambda_{i}>0}}^{\|\cdot\|_{H_{G}}}, (4.1)

where the closure of HGH_{G} is with respect to the norm ϕHG2=i:λi>0λi1ϕ,ψiLρ22\|\phi\|_{H_{G}}^{2}=\sum_{i:\lambda_{i}>0}\lambda_{i}^{-1}\langle{\phi,\psi_{i}}\rangle_{L^{2}_{\rho}}^{2}. Note that those two spaces are the same vector space but with different norms. These two spaces are different unless the operator G¯{\mathcal{L}_{\overline{G}}} is finite rank (e.g., the cases in Section 3.2). They are proper subspaces of Lρ2L^{2}_{\rho} when the operator G¯{\mathcal{L}_{\overline{G}}} has a zero eigenvalue.

Definition 4.1 (Data-adaptive RKHS prior).

Let G¯{\mathcal{L}_{\overline{G}}} be the operator defined in (2.7). The data-adaptive prior is a Gaussian measure with mean and covariance defined by

π0𝒟=𝒩(μ0𝒟,𝒬0𝒟):μ0𝒟=0;𝒬0𝒟=λ1G¯,\pi_{0}^{\scriptscriptstyle\mathcal{D}}=\mathcal{N}(\mu_{0}^{\scriptscriptstyle\mathcal{D}},\mathcal{Q}_{0}^{\scriptscriptstyle\mathcal{D}}):\quad\mu_{0}^{\scriptscriptstyle\mathcal{D}}=0;\,\,\mathcal{Q}_{0}^{\scriptscriptstyle\mathcal{D}}=\lambda_{*}^{-1}{\mathcal{L}_{\overline{G}}}, (4.2)

where the hyper-parameter λ>0\lambda_{*}>0 is determined adaptive to data.

In practice, we select the hyper-parameter λ0\lambda_{*}\geq 0 adaptive to data by the L-curve method in [24], which is effective in reaching an optimal trade-off between the likelihood and the prior (see Sect.5.1 for more details). We call this prior an RKHS prior because its covariance operator G¯{\mathcal{L}_{\overline{G}}}’s Cameron-Martin space is the RKHS HGH_{G} (see, e.g., [11, Section 1.7] and a brief review of the Gaussian measures in Sect.A.2).

This data-adaptive prior is a Gaussian distribution with support in the FSOI HH in (4.1). When HH is finite-dimensional, its probability density in HH is

dπ0𝒟(ϕ)dϕe12ϕμ0𝒟,[𝒬0𝒟]1(ϕμ0𝒟)Lρ2=e12ϕ,λG¯1ϕLρ2,ϕH\frac{d\pi_{0}^{\scriptscriptstyle\mathcal{D}}(\phi)}{d\phi}\propto e^{-\frac{1}{2}\langle{\phi-\mu_{0}^{\scriptscriptstyle\mathcal{D}},[\mathcal{Q}_{0}^{\scriptscriptstyle\mathcal{D}}]^{-1}(\phi-\mu_{0}^{\scriptscriptstyle\mathcal{D}})}\rangle_{L^{2}_{\rho}}}=e^{-\frac{1}{2}\langle{\phi,\lambda_{*}{\mathcal{L}_{\overline{G}}}^{-1}\phi}\rangle_{L^{2}_{\rho}}},\quad\forall\phi\in H

by definitions (4.2), where G¯1{\mathcal{L}_{\overline{G}}}^{-1} is pseudo-inverse. Combining with the likelihood (3.1), the posterior becomes

𝒬1𝒟=ση2(G¯+ση2λG¯1)1=ση2(G¯2+ση2λIH)1G¯;\displaystyle\mathcal{Q}_{1}^{\scriptscriptstyle\mathcal{D}}=\sigma_{\eta}^{2}({\mathcal{L}_{\overline{G}}}+\sigma_{\eta}^{2}\lambda_{*}{\mathcal{L}_{\overline{G}}}^{-1})^{-1}=\sigma_{\eta}^{2}({\mathcal{L}_{\overline{G}}}^{2}+\sigma_{\eta}^{2}\lambda_{*}I_{H})^{-1}{\mathcal{L}_{\overline{G}}}\,; (4.3a)
μ1𝒟=ση2𝒬1𝒟ϕ𝒟=(G¯2+ση2λIH)1G¯ϕ𝒟.\displaystyle\mu_{1}^{\scriptscriptstyle\mathcal{D}}=\sigma_{\eta}^{-2}\mathcal{Q}_{1}^{\scriptscriptstyle\mathcal{D}}\phi^{\scriptscriptstyle\mathcal{D}}=({\mathcal{L}_{\overline{G}}}^{2}+\sigma_{\eta}^{2}\lambda_{*}I_{H})^{-1}{\mathcal{L}_{\overline{G}}}\phi^{\scriptscriptstyle\mathcal{D}}. (4.3b)

We emphasize that the operator G¯1{\mathcal{L}_{\overline{G}}}^{-1} restricts the posterior to be supported on the FSOI HH, where the inverse problem is well-defined. The posterior mean in (4.3b) coincides with the minimum norm least squares estimator if there is no regularization, i.e., λ=0\lambda_{*}=0.

When HH is infinite-dimensional, the above mean and covariance remain valid, following similar arguments based on the likelihood ratio in (3.4).

In either case, the posterior is a Gaussian distribution whose support is HH, and it is degenerate in Lρ2L^{2}_{\rho} if HH is a proper subspace of Lρ2L^{2}_{\rho}. In short, the data-adaptive prior is a Gaussian distribution supported on the FSOI with a hyper-parameter adaptive to data. Both the prior and posterior are degenerate when the FSOI is a proper subspace of Lρ2L^{2}_{\rho}.

Table 2 compares the posteriors of the non-degenerate prior and the data-adaptive prior.

Table 2: Priors and posteriors on Lρ2L^{2}_{\rho}.
Gaussian measure Mean Covariance
Fixed non-degenerate prior and its posterior
π0=𝒩(μ0,𝒬0)\pi_{0}=\mathcal{N}(\mu_{0},\mathcal{Q}_{0}) μ0=0\mu_{0}=0 𝒬0\mathcal{Q}_{0}
π1=𝒩(μ1,𝒬1)\pi_{1}=\mathcal{N}(\mu_{1},\mathcal{Q}_{1}) μ1=ση2𝒬1ϕ𝒟\mu_{1}=\sigma_{\eta}^{-2}\mathcal{Q}_{1}\phi^{\scriptscriptstyle\mathcal{D}} 𝒬1=ση2(G¯+ση2𝒬01)1\mathcal{Q}_{1}=\sigma_{\eta}^{2}({\mathcal{L}_{\overline{G}}}+\sigma_{\eta}^{2}\mathcal{Q}_{0}^{-1})^{-1}
Data-adaptive prior and its posterior
π0𝒟=𝒩(μ0𝒟,𝒬0𝒟)\pi_{0}^{\scriptscriptstyle\mathcal{D}}=\mathcal{N}(\mu_{0}^{\scriptscriptstyle\mathcal{D}},\mathcal{Q}_{0}^{\scriptscriptstyle\mathcal{D}}) μ0𝒟=0\mu_{0}^{\scriptscriptstyle\mathcal{D}}=0 𝒬0𝒟=λ1G¯\mathcal{Q}_{0}^{\scriptscriptstyle\mathcal{D}}=\lambda_{*}^{-1}{\mathcal{L}_{\overline{G}}}
π1𝒟=𝒩(μ1𝒟,𝒬1𝒟)\pi_{1}^{\scriptscriptstyle\mathcal{D}}=\mathcal{N}(\mu_{1}^{\scriptscriptstyle\mathcal{D}},\mathcal{Q}_{1}^{\scriptscriptstyle\mathcal{D}}) μ1𝒟=ση2𝒬1𝒟ϕ𝒟\mu_{1}^{\scriptscriptstyle\mathcal{D}}=\sigma_{\eta}^{-2}\mathcal{Q}_{1}^{\scriptscriptstyle\mathcal{D}}\phi^{\scriptscriptstyle\mathcal{D}} 𝒬1𝒟=ση2(G¯2+ση2λIH)1G¯\mathcal{Q}_{1}^{\scriptscriptstyle\mathcal{D}}=\sigma_{\eta}^{2}({\mathcal{L}_{\overline{G}}}^{2}+\sigma_{\eta}^{2}\lambda_{*}I_{H})^{-1}{\mathcal{L}_{\overline{G}}}

4.2 Quality of the posterior and its MAP estimator

The data-adaptive prior aims to improve the quality of the posterior. Compared with a fixed non-degenerate prior, we show that the data-adaptive prior improves the quality of the posterior in three aspects: (1) it improves the stability of the MAP estimator so that the MAP estimator always has a small noise limit; (2) it improves the accuracy of the MAP estimator by reducing the expected mean square error; and (3) it reduces the uncertainty in the posterior in terms of the trace of the posterior covariance.

We show first that the posterior mean always has a small noise limit, and the limit converges to the projection of the true function in the FSOI when the model error vanishes.

Theorem 4.2 (Small noise limit of the MAP estimator).

Suppose that Assumption 3.1 (A1-A2) holds. Then, the posterior mean in (4.3b) with the data-adaptive prior (4.2) always has a small noise limit. In particular, its small noise limit converges to the projection of true kernel in the FSOI HH in (4.1) when the model error i=1Kϵiξψi\sum_{i=1}^{K}\epsilon^{\xi}_{i}\psi_{i} vanishes.

Proof.

The claims follow directly from the definition of the new posterior mean in (4.3b) and the decomposition in Eq. (3.5), which says that ϕ𝒟=iϕi𝒟ψi\phi^{\scriptscriptstyle\mathcal{D}}=\sum_{i}\phi^{\scriptscriptstyle\mathcal{D}}_{i}\psi_{i} with ϕi𝒟=λiϕtrue,i+σηλi1/2ϵiη+ϵiξ\phi^{\scriptscriptstyle\mathcal{D}}_{i}=\lambda_{i}\phi_{true,i}+\sigma_{\eta}\lambda_{i}^{1/2}\epsilon^{\eta}_{i}+\epsilon^{\xi}_{i}. Recall that μ1𝒟=(G¯2+ση2λIH)1G¯ϕ𝒟\mu_{1}^{\scriptscriptstyle\mathcal{D}}=({\mathcal{L}_{\overline{G}}}^{2}+\sigma_{\eta}^{2}\lambda_{*}I_{H})^{-1}{\mathcal{L}_{\overline{G}}}\phi^{\scriptscriptstyle\mathcal{D}}, and if iK+1i\geq K+1, we have (G¯2+ση2λIH)1G¯ψi=λi(λi2+ση2λ)1ψi=0({\mathcal{L}_{\overline{G}}}^{2}+\sigma_{\eta}^{2}\lambda_{*}I_{H})^{-1}{\mathcal{L}_{\overline{G}}}\psi_{i}=\lambda_{i}(\lambda_{i}^{2}+\sigma_{\eta}^{2}\lambda_{*})^{-1}\psi_{i}=0. Thus, we can write

μ1𝒟\displaystyle\mu_{1}^{\scriptscriptstyle\mathcal{D}} =(G¯2+ση2λIH)1G¯ϕ𝒟=i=1Kλi(λi2+ση2λ)1ψiϕi𝒟,\displaystyle=({\mathcal{L}_{\overline{G}}}^{2}+\sigma_{\eta}^{2}\lambda_{*}I_{H})^{-1}{\mathcal{L}_{\overline{G}}}\phi^{\scriptscriptstyle\mathcal{D}}=\sum_{i=1}^{K}\lambda_{i}(\lambda_{i}^{2}+\sigma_{\eta}^{2}\lambda_{*})^{-1}\psi_{i}\phi^{\scriptscriptstyle\mathcal{D}}_{i}, (4.4)

since {ψi}i\{\psi_{i}\}_{i} are orthonormal eigenfunctions of G¯{\mathcal{L}_{\overline{G}}} with eigenvalues {λi}i\{\lambda_{i}\}_{i}. Thus, the small noise limit exists and is equal to

limση0μ1𝒟\displaystyle\lim_{\sigma_{\eta}\to 0}\mu_{1}^{\scriptscriptstyle\mathcal{D}} =limση01iKλi(λi2+ση2λ)1ϕi𝒟ψi\displaystyle=\lim_{\sigma_{\eta}\to 0}\sum_{1\leq i\leq K}\lambda_{i}\left(\lambda_{i}^{2}+\sigma_{\eta}^{2}\lambda_{*}\right)^{-1}\phi^{\scriptscriptstyle\mathcal{D}}_{i}\psi_{i}
=limση01iKλi(λi2+ση2λ)1(λiϕtrue,i+σηλi1/2ϵiη+ϵiξ)ψi\displaystyle=\lim_{\sigma_{\eta}\to 0}\sum_{1\leq i\leq K}\lambda_{i}\left(\lambda_{i}^{2}+\sigma_{\eta}^{2}\lambda_{*}\right)^{-1}(\lambda_{i}\phi_{true,i}+\sigma_{\eta}\lambda_{i}^{1/2}\epsilon^{\eta}_{i}+\epsilon^{\xi}_{i})\psi_{i}
=i=1Kλi1(λiϕtrue,i+ϵiξ)ψi=i=1K(ϕtrue,i+λi1ϵiξ)ψi.\displaystyle=\sum_{i=1}^{K}\lambda_{i}^{-1}\big{(}\lambda_{i}\phi_{true,i}+\epsilon^{\xi}_{i}\big{)}\psi_{i}=\sum_{i=1}^{K}\left(\phi_{true,i}+\lambda_{i}^{-1}\epsilon^{\xi}_{i}\right)\psi_{i}\,.

Furthermore, as the model error i=1KϵiξψiLρ22=i=1K|ϵiξ|20\|\sum_{i=1}^{K}\epsilon^{\xi}_{i}\psi_{i}\|^{2}_{L^{2}_{\rho}}=\sum_{i=1}^{K}|\epsilon^{\xi}_{i}|^{2}\to 0, this small noise limit converges to i=1Kϕtrue,iψi\sum_{i=1}^{K}\phi_{true,i}\psi_{i}, the projection of ϕtrue\phi_{true} in the FSOI. ∎

We show next that the data-adaptive prior leads to a MAP estimator more accurate than the non-degenerate prior’s.

Theorem 4.3 (Expected MSE of the MAP estimator).

Suppose that Assumption 3.1 (A1-A2) holds. Assume in addition that maxiK{λiri1}λ1\max_{i\leq K}\{\lambda_{i}r_{i}^{-1}\}\leq\lambda_{*}\leq 1. Then, the expected mean square error of the MAP estimator of the data-adaptive prior is smaller than the non-degenerate prior’s, i.e.,

𝔼π0𝒟𝔼η[μ1𝒟ϕtrueLρ22]𝔼π0𝔼η[μ1ϕtrueLρ22],\mathbb{E}_{\pi_{0}^{\scriptscriptstyle\mathcal{D}}}\mathbb{E}_{\eta}\Big{[}\|\mu_{1}^{\scriptscriptstyle\mathcal{D}}-\phi_{true}\|_{L^{2}_{\rho}}^{2}\Big{]}\leq\mathbb{E}_{\pi_{0}}\mathbb{E}_{\eta}\Big{[}\|\mu_{1}-\phi_{true}\|_{L^{2}_{\rho}}^{2}\Big{]}, (4.5)

where the equality holds only when the two priors are the same. Here 𝔼π0𝒟\mathbb{E}_{\pi_{0}^{\scriptscriptstyle\mathcal{D}}} and 𝔼π0\mathbb{E}_{\pi_{0}} denote expectation with ϕtrueπ0𝒟\phi_{true}\sim\pi_{0}^{\scriptscriptstyle\mathcal{D}} and ϕtrueπ0\phi_{true}\sim\pi_{0}, respectively.

Proof.

Note that from (3.6) and (4.4), we have

μ1𝒟ϕtrue\displaystyle\mu_{1}^{\scriptscriptstyle\mathcal{D}}-\phi_{true} =1iKψi(λi+ση2λλi1)1[σηλi1/2ϵiηση2λλi1ϕtrue,i+ϵiξ]+i>Kϕtrue,i,\displaystyle=\sum_{1\leq i\leq K}\psi_{i}\left(\lambda_{i}+\sigma_{\eta}^{2}\lambda_{*}\lambda_{i}^{-1}\right)^{-1}[\sigma_{\eta}\lambda_{i}^{1/2}\epsilon^{\eta}_{i}-\sigma_{\eta}^{2}\lambda_{*}\lambda_{i}^{-1}\phi_{true,i}+\epsilon^{\xi}_{i}]+\sum_{i>K}\phi_{true,i},
μ1ϕtrue\displaystyle\mu_{1}-\phi_{true} =i1ψi(λi+ση2ri1)1[σηλi1/2ϵiηση2ri1ϕtrue,i+ϵiξ].\displaystyle=\sum_{i\geq 1}\psi_{i}\left(\lambda_{i}+\sigma_{\eta}^{2}r_{i}^{-1}\right)^{-1}[\sigma_{\eta}\lambda_{i}^{1/2}\epsilon^{\eta}_{i}-\sigma_{\eta}^{2}r_{i}^{-1}\phi_{true,i}+\epsilon^{\xi}_{i}].

Recall that {ϵiη}\{\epsilon^{\eta}_{i}\} and {ϕtrue,i}\{\phi_{true,i}\} are independent centered Gaussian with ϵiη𝒩(0,1)\epsilon^{\eta}_{i}\sim\mathcal{N}(0,1), ϕtrue,i𝒩(0,λi)\phi_{true,i}\sim\mathcal{N}(0,\lambda_{i}) when ϕtrueπ0𝒟\phi_{true}\sim\pi_{0}^{\scriptscriptstyle\mathcal{D}}, and ϕtrue,i𝒩(0,ri)\phi_{true,i}\sim\mathcal{N}(0,r_{i}) when ϕtrueπ0\phi_{true}\sim\pi_{0}. Then, the expectations of the MSEs 𝔼η[μ1𝒟ϕtrueLρ22]\mathbb{E}_{\eta}\Big{[}\|\mu_{1}^{\scriptscriptstyle\mathcal{D}}-\phi_{true}\|_{L^{2}_{\rho}}^{2}\Big{]} and 𝔼η[μ1ϕtrueLρ22]\mathbb{E}_{\eta}\Big{[}\|\mu_{1}-\phi_{true}\|_{L^{2}_{\rho}}^{2}\Big{]} are

𝔼π0𝒟𝔼η[μ1𝒟ϕtrueLρ22]\displaystyle\mathbb{E}_{\pi_{0}^{\scriptscriptstyle\mathcal{D}}}\mathbb{E}_{\eta}[\|\mu_{1}^{\scriptscriptstyle\mathcal{D}}-\phi_{true}\|_{L^{2}_{\rho}}^{2}] =1iK(λi+ση2λλi1)2[ση2(λi+ση2λ2λi1)+|ϵiξ|2],\displaystyle=\sum_{1\leq i\leq K}(\lambda_{i}+\sigma_{\eta}^{2}\lambda_{*}\lambda_{i}^{-1})^{-2}[\sigma_{\eta}^{2}(\lambda_{i}+\sigma_{\eta}^{2}\lambda_{*}^{2}\lambda_{i}^{-1})+|\epsilon^{\xi}_{i}|^{2}], (4.6)
𝔼π0𝔼η[μ1ϕtrueLρ22]\displaystyle\mathbb{E}_{\pi_{0}}\mathbb{E}_{\eta}[\|\mu_{1}-\phi_{true}\|_{L^{2}_{\rho}}^{2}] =1iK(λi+ση2ri1)2[ση2λi+ση4ri1+|ϵiξ|2]+i>K[ri+ση4ri2|ϵiξ|2].\displaystyle=\sum_{1\leq i\leq K}(\lambda_{i}+\sigma_{\eta}^{2}r_{i}^{-1})^{-2}[\sigma_{\eta}^{2}\lambda_{i}+\sigma_{\eta}^{4}r_{i}^{-1}+|\epsilon^{\xi}_{i}|^{2}]+\sum_{i>K}[r_{i}+\sigma_{\eta}^{-4}r_{i}^{2}|\epsilon^{\xi}_{i}|^{2}]\,. (4.7)

When ri=0r_{i}=0 for all i>Ki>K and λi=ri\lambda_{i}=r_{i}, λ=1\lambda_{*}=1 for all iKi\leq K, i.e., when the two priors are the same, the two expectations are equal.

To prove (4.5), it suffices to compare the summations over 1ik1\leq i\leq k in (4.6) and (4.7). Note that

λ1\displaystyle\lambda_{*}\leq 1 λi+ση2λ2λi1λi+ση2λλi1,\displaystyle\implies\lambda_{i}+\sigma_{\eta}^{2}\lambda_{*}^{2}\lambda_{i}^{-1}\leq\lambda_{i}+\sigma_{\eta}^{2}\lambda_{*}\lambda_{i}^{-1},
max1iK{λiri1}λ\displaystyle\max_{1\leq i\leq K}\{\lambda_{i}r_{i}^{-1}\}\leq\lambda_{*} λi+ση2λλi1λi+ση2ri1.\displaystyle\implies\lambda_{i}+\sigma_{\eta}^{2}\lambda_{*}\lambda_{i}^{-1}\geq\lambda_{i}+\sigma_{\eta}^{2}r_{i}^{-1}.

Then, we have

1iK\displaystyle\sum_{1\leq i\leq K} (λi+ση2λλi1)2[ση2(λi+ση2λ2λi1)+|ϵiξ|2]\displaystyle\left(\lambda_{i}+\sigma_{\eta}^{2}\lambda_{*}\lambda_{i}^{-1}\right)^{-2}[\sigma_{\eta}^{2}(\lambda_{i}+\sigma_{\eta}^{2}\lambda_{*}^{2}\lambda_{i}^{-1})+|\epsilon^{\xi}_{i}|^{2}]
1iK(λi+ση2λλi1)1ση2+(λi+ση2λλi1)2|ϵiξ|2\displaystyle\leq\sum_{1\leq i\leq K}\left(\lambda_{i}+\sigma_{\eta}^{2}\lambda_{*}\lambda_{i}^{-1}\right)^{-1}\sigma_{\eta}^{2}+\left(\lambda_{i}+\sigma_{\eta}^{2}\lambda_{*}\lambda_{i}^{-1}\right)^{-2}|\epsilon^{\xi}_{i}|^{2}
1iK(λi+ση2ri1)1ση2+(λi+ση2ri1)2|ϵiξ|2\displaystyle\leq\sum_{1\leq i\leq K}\left(\lambda_{i}+\sigma_{\eta}^{2}r_{i}^{-1}\right)^{-1}\sigma_{\eta}^{2}+\left(\lambda_{i}+\sigma_{\eta}^{2}r_{i}^{-1}\right)^{-2}|\epsilon^{\xi}_{i}|^{2}
=1iK(λi+ση2ri1)2[ση2λi+ση4ri1+|ϵiξ|2].\displaystyle=\sum_{1\leq i\leq K}\left(\lambda_{i}+\sigma_{\eta}^{2}r_{i}^{-1}\right)^{-2}[\sigma_{\eta}^{2}\lambda_{i}+\sigma_{\eta}^{4}r_{i}^{-1}+|\epsilon^{\xi}_{i}|^{2}]\,.

In particular, the first inequality is strict if λ<1\lambda_{*}<1, and the second inequality is strict if λi<ri\lambda_{i}<r_{i} for some 1iK1\leq i\leq K. Thus, the inequality in (4.5) is strict if the two priors differ. ∎

Additionally, the next theorem shows that under the condition λ>maxiK{λiri1}\lambda_{*}>\max_{i\leq K}\{\lambda_{i}r_{i}^{-1}\}, the data-adaptive prior outperforms the non-degenerate prior in producing a posterior with a smaller trace of covariance. We note that this condition is sufficient but not necessary, since the proof is based on component-wise comparison and does not take into account the part i>Kri\sum_{i>K}r_{i} (see Remark 4.7 for more discussions).

Theorem 4.4 (Trace of the posterior covariance).

Suppose that Assumption 3.1 (A1-A2) holds. Recall that 𝒬1𝒟\mathcal{Q}_{1}^{\scriptscriptstyle\mathcal{D}} and 𝒬1\mathcal{Q}_{1} are the posterior covariance operators of the data-adaptive prior and the non-degenerate prior in (4.3b) and (3.3), respectively. Then, Tr(𝒬1𝒟)<Tr(𝒬1)Tr(\mathcal{Q}_{1}^{\scriptscriptstyle\mathcal{D}})<Tr(\mathcal{Q}_{1}) if λ>maxiK{λiri1}\lambda_{*}>\max_{i\leq K}\{\lambda_{i}r_{i}^{-1}\}. Additionally, when ri=0r_{i}=0 for all i>Ki>K, we have Tr(𝒬1𝒟)>Tr(𝒬1)Tr(\mathcal{Q}_{1}^{\scriptscriptstyle\mathcal{D}})>Tr(\mathcal{Q}_{1}) if λ<miniK{λiri1}\lambda_{*}<\min_{i\leq K}\{\lambda_{i}r_{i}^{-1}\}.

Proof.

By definition, the trace of the two operators are

Tr(𝒬1𝒟)\displaystyle Tr(\mathcal{Q}_{1}^{\scriptscriptstyle\mathcal{D}}) =1iKση2(λi+ση2λλi1)1,\displaystyle=\sum_{1\leq i\leq K}\sigma_{\eta}^{2}(\lambda_{i}+\sigma_{\eta}^{2}\lambda_{*}\lambda_{i}^{-1})^{-1}, (4.8)
Tr(𝒬1)\displaystyle Tr(\mathcal{Q}_{1}) =1iKση2(λi+ση2ri1)1+i>Kri.\displaystyle=\sum_{1\leq i\leq K}\sigma_{\eta}^{2}(\lambda_{i}+\sigma_{\eta}^{2}r_{i}^{-1})^{-1}+\sum_{i>K}r_{i}.

Thus, when λ>maxi{λiri1}\lambda_{*}>\max_{i}\{\lambda_{i}r_{i}^{-1}\}, we have (λi+ση2λλi1)1<(λi+ση2ri1)1(\lambda_{i}+\sigma_{\eta}^{2}\lambda_{*}\lambda_{i}^{-1})^{-1}<(\lambda_{i}+\sigma_{\eta}^{2}r_{i}^{-1})^{-1} for each iKi\geq K, and hence Tr(𝒬1𝒟)<Tr(𝒬1)Tr(\mathcal{Q}_{1}^{\scriptscriptstyle\mathcal{D}})<Tr(\mathcal{Q}_{1}). The last claim follows similarly. ∎

Remark 4.5 (Expected MSE of the MAP and the trace of the posterior covariance).

When there is no model error, we have 𝔼π0𝔼η[μ1ϕtrueLρ22]=Tr(𝒬1)\mathbb{E}_{\pi_{0}}\mathbb{E}_{\eta}\Big{[}\|\mu_{1}-\phi_{true}\|_{L^{2}_{\rho}}^{2}\Big{]}=Tr(\mathcal{Q}_{1}) in (4.8). That is, for the prior π0\pi_{0}, the expected MSE of the MAP estimator is the trace of the posterior covariance [2, Theorem 2]. However, for the data-adaptive prior π0𝒟\pi_{0}^{\scriptscriptstyle\mathcal{D}}, we have 𝔼π0𝒟𝔼η[μ1𝒟ϕtrueLρ22]=Tr(𝒬1𝒟)\mathbb{E}_{\pi_{0}^{\scriptscriptstyle\mathcal{D}}}\mathbb{E}_{\eta}\Big{[}\|\mu_{1}^{\scriptscriptstyle\mathcal{D}}-\phi_{true}\|_{L^{2}_{\rho}}^{2}\Big{]}=Tr(\mathcal{Q}_{1}^{\scriptscriptstyle\mathcal{D}}) if and only if λ=1\lambda_{*}=1, which follows from (4.6) and (4.8). Thus, if maxiK{λiri1}1\max_{i\leq K}\{\lambda_{i}r_{i}^{-1}\}\leq 1, a smaller expected MSE of the MAP estimator in Theorem 4.3 implies a smaller trace of the posterior covariance in Theorem 4.4.

Remark 4.6 (A-optimality).

Theorem 4.4 shows that the data-adaptive prior achieves A-optimality among all priors with {ri}\{r_{i}\} satisfying λ>maxiK{λiri1}\lambda_{*}>\max_{i\leq K}\{\lambda_{i}r_{i}^{-1}\}. Here, an A-optimal design is defined to be the one that minimizes the trace of the posterior covariance operator in a certain class ([2] and [8]). It is equivalent to minimizing the expected MSE of the MAP estimator (which is equal to Tr(𝒬1)Tr(\mathcal{Q}_{1})) through an optimal choice of the π0\pi_{0}. Thus, in our context, the A-optimal design seeks a prior with {ri}i1\{r_{i}\}_{i\geq 1} in a certain class such that g(r1,,rK):=Tr(𝒬1)=iK(λi+ri1)g(r_{1},\ldots,r_{K}):=Tr(\mathcal{Q}_{1})=\sum_{i\leq K}(\lambda_{i}+r_{i}^{-1}) is minimized, and the data-adaptive prior achieves A-optimality in the above class of priors.

Remark 4.7 (Conditions on the spectra).

The condition maxiK{λiri1}λ\max_{i\leq K}\{\lambda_{i}r_{i}^{-1}\}\leq\lambda_{*} in Theorems 4.34.4 is far from necessary, since their proofs are based on a component-wise comparison in the sum and its does not take into account the part i>Kri\sum_{i>K}r_{i}. The optimal λ\lambda_{*} in practice is often much smaller than the maximal ratio maxiK{λiri1}\max_{i\leq K}\{\lambda_{i}r_{i}^{-1}\} and it depends on the dataset, in particular, it depends nonlinearly on all the elements involved (see Figures 78 in Appendix A.3). Thus, a full analysis with an optimal λ\lambda_{*} is beyond the scope of this study.

5 Computational practice

We have followed the wisdom of [53] on “avoid discretization until the last possible moment” so that we have presented the analysis of the distributions on Lρ2L^{2}_{\rho} using operators. In the same spirit, we avoid selecting a basis for the function space until the last possible moment. The moment arrives now. Based on the abstract theory in the previous sections, we present the implementation of the data-adaptive prior in computational practice. We demonstrate it on Toeplitz matrices and integral operators, which represent finite- and infinite-dimensional function spaces of learning, respectively.

In computational practice, the goal is to estimate the coefficient c=(c1,,cl)l×1c=(c_{1},\ldots,c_{l})^{\top}\in\mathbb{R}^{l\times 1} of ϕ=i=1lciϕi\phi=\sum_{i=1}^{l}c_{i}\phi_{i} in a prescribed hypothesis space =span{ϕi}i=1lLρ2\mathcal{H}=\mathrm{span}\{\phi_{i}\}_{i=1}^{l}\subset L^{2}_{\rho} with ll\leq\infty, where the basis functions {ϕi}\{\phi_{i}\} can be the B-splines, polynomials, or wavelets. Then, the prior and posterior are represented by distributions of clc\in\mathbb{R}^{l}. Note that the pre-specified basis {ϕi}\{\phi_{i}\} is rarely orthonormal in Lρ2L^{2}_{\rho} because ρ\rho varies with data. Hence, we only require that the basis matrix

B=[ϕi,ϕjLρ2]1i,jlB=[\langle{\phi_{i},\phi_{j}}\rangle_{L^{2}_{\rho}}]_{1\leq i,j\leq l} (5.1)

is non-singular, i.e., the basis functions are linearly independent in Lρ2L^{2}_{\rho}. This simple requirement reduces redundancy in basis functions.

In terms of cc, the negative log-likelihood in (2.9) for ϕ=i=1lciϕi\phi=\sum_{i=1}^{l}c_{i}\phi_{i} reads

(c)\displaystyle\mathcal{E}(c) =G¯ϕ,ϕLρ22ϕ𝒟,ϕLρ2+CNf\displaystyle=\langle{{\mathcal{L}_{\overline{G}}}\phi,\phi}\rangle_{L^{2}_{\rho}}-2\langle{\phi^{\scriptscriptstyle\mathcal{D}},\phi}\rangle_{L^{2}_{\rho}}+C_{N}^{f} (5.2)
=cA¯c2cb¯+CNf,\displaystyle=c^{\top}{\overline{A}}c-2c^{\top}{\overline{b}}+C_{N}^{f},

where the regression matrix A¯{\overline{A}} and vector b¯{\overline{b}} are given by

A¯(i,j)=1N1kNRϕi[uk],Rϕj[uk]𝕐=G¯ϕi,ϕjLρ2,\displaystyle{\overline{A}}(i,j)=\frac{1}{N}\sum_{1\leq k\leq N}\langle{R_{\phi_{i}}[u^{k}],R_{\phi_{j}}[u^{k}]}\rangle_{\mathbb{Y}}=\langle{{\mathcal{L}_{\overline{G}}}\phi_{i},\phi_{j}}\rangle_{L^{2}_{\rho}}, (5.3a)
b¯(i)=1N1kNRϕi[uk],fk𝕐=ϕi,ϕ𝒟Lρ2.\displaystyle{\overline{b}}(i)=\frac{1}{N}\sum_{1\leq k\leq N}\langle{R_{\phi_{i}}[u^{k}],f^{k}}\rangle_{\mathbb{Y}}=\langle{\phi_{i},\phi^{\scriptscriptstyle\mathcal{D}}}\rangle_{L^{2}_{\rho}}. (5.3b)

The maximal likelihood estimator (MLE) c^=A¯1b¯\widehat{c}={\overline{A}}^{-1}{\overline{b}} is the default choice of solution when b¯{\overline{b}} is in the range of A¯{\overline{A}}. However, the MLE is ill-defined when b¯{\overline{b}} is not in the range of A¯{\overline{A}}, which may happen when there is model error (or computational error due to incomplete data, as we have discussed after Theorem 2.7), and a Bayesian approach makes the inverse problem well-posed by introducing a prior.

We will compare our data-adaptive prior with the widely-used standard Gaussian prior on the coefficient, that is, cπ0=𝒩(0,Q0)c\sim\pi_{0}=\mathcal{N}(0,Q_{0}) with Q0=IQ_{0}=I, the identity matrix on l\mathbb{R}^{l}. This prior leads to a posterior π1=𝒩(m1,Q1)\pi_{1}=\mathcal{N}(m_{1},Q_{1}) with

m1=(A¯+ση2I)1b¯,Q1=(A¯+ση2I)1.m_{1}=({\overline{A}}+\sigma^{2}_{\eta}I)^{-1}{\overline{b}},\quad Q_{1}=({\overline{A}}+\sigma^{2}_{\eta}I)^{-1}. (5.4)

5.1 Data-adaptive prior and posterior of the coefficient

The next proposition computes the prior and posterior distributions of the random coefficient c=(c1,,cl)l×1c=(c_{1},\ldots,c_{l})^{\top}\in\mathbb{R}^{l\times 1} of the Lρ2L^{2}_{\rho}-valued random variable ϕ=iciϕi\phi=\sum_{i}c_{i}\phi_{i} with the data-adaptive prior (4.2).

Proposition 5.1.

Assume that {ϕi}i1\{\phi_{i}\}_{i\geq 1} is a complete basis of Lρ2L^{2}_{\rho} that may not be orthonormal, and the basis matrix BB of {ϕi}i=1l\{\phi_{i}\}_{i=1}^{l} in (5.1) is invertible. Denote ϕ=iciϕi\phi=\sum_{i}c_{i}\phi_{i} the Lρ2L^{2}_{\rho}-valued random variable with the data-adaptive prior in (4.2). Then, the prior and posterior distributions of c=(c1,,cl)l×1c=(c_{1},\ldots,c_{l})^{\top}\in\mathbb{R}^{l\times 1} are 𝒩(0,Q0𝒟)\mathcal{N}(0,Q_{0}^{\scriptscriptstyle\mathcal{D}}) and 𝒩(m1𝒟,Q1𝒟)\mathcal{N}(m_{1}^{\scriptscriptstyle\mathcal{D}},Q_{1}^{\scriptscriptstyle\mathcal{D}}) with

Q0𝒟\displaystyle Q_{0}^{\scriptscriptstyle\mathcal{D}} =λ1B1A¯B1,\displaystyle=\lambda_{*}^{-1}B^{-1}{\overline{A}}B^{-1}\,, (5.5a)
Q1𝒟\displaystyle Q_{1}^{\scriptscriptstyle\mathcal{D}} =ση2(A¯+ση2λBA¯1B)1,m1𝒟=ση2Q1𝒟b¯,\displaystyle=\sigma_{\eta}^{2}({\overline{A}}+\sigma_{\eta}^{2}\lambda_{*}B{\overline{A}}^{-1}B)^{-1},\quad m_{1}^{\scriptscriptstyle\mathcal{D}}=\sigma_{\eta}^{-2}Q_{1}^{\scriptscriptstyle\mathcal{D}}{\overline{b}}, (5.5b)

where A¯{\overline{A}} and b¯{\overline{b}} be defined in (5.3b).

Proof.

The prior covariance (5.5a) follows directly from the definition of the data-adaptive prior in (4.2) and Lemma A.2. The posterior covariance and mean follow from the likelihood in (5.2) and the Q0𝒟Q_{0}^{\scriptscriptstyle\mathcal{D}} above:

dπ1𝒟(c)dcexp(12[ση2(cA¯c2cb¯+CNf)+c(Q0𝒟)1c]).\frac{d\pi_{1}^{\scriptscriptstyle\mathcal{D}}(c)}{dc}\propto\exp\left(-\frac{1}{2}\big{[}\sigma_{\eta}^{-2}(c^{\top}{\overline{A}}c-2c^{\top}{\overline{b}}+C_{N}^{f})+c^{\top}(Q_{0}^{{\scriptscriptstyle\mathcal{D}}})^{-1}c\big{]}\right).

Thus, completing the squares in the exponent, we obtain (5.5b). ∎

Remark 5.2 (Relation between distributions of the coefficient and the function).

The prior and posterior distributions of the coefficient cc and the function ϕ=i=1lciϕi\phi=\sum_{i=1}^{l}c_{i}\phi_{i} are different: the former depends on the basis {ϕi}i=1l\{\phi_{i}\}_{i=1}^{l}, but the latter is not. The relation between the distributions of cc and ϕ\phi is characterized by Lemma A.2A.3. Specifically, if c𝒩(0,Q)c\sim\mathcal{N}(0,Q) and ϕ=i=1lciϕi\phi=\sum_{i=1}^{l}c_{i}\phi_{i} has a Gaussian measure 𝒩(0,𝒬)\mathcal{N}(0,\mathcal{Q}) on =span{ϕi}i=1l\mathcal{H}=\mathrm{span}\{\phi_{i}\}_{i=1}^{l}, then, we have A:=(ϕi,𝒬ϕj)=BQBA:=(\langle{\phi_{i},\mathcal{Q}\phi_{j}}\rangle)=BQB provided that BB in (5.1) is strictly positive definite. Additionally, when computing the trace of the operator 𝒬\mathcal{Q}, we solve a generalized eigenvalue problem Av=λBvAv=\lambda Bv, which follows from the proof of Proposition 5.6 below.

Remark 5.3 (Relation to the basis matrix of the RKHS).

The matrix B1A¯B1B^{-1}{\overline{A}}B^{-1} in the covariance Q0𝒟Q_{0}^{\scriptscriptstyle\mathcal{D}} in (5.5a) is the pseudo-inverse of the basis matrix of {ϕi}\{\phi_{i}\} in the RKHS HGH_{G} defined in Lemma 2.8, that is, Brkhs(i,j)=ϕi,G¯1ϕjLρ2=ϕi,ϕjHGB_{rkhs}(i,j)=\langle{\phi_{i},{\mathcal{L}_{\overline{G}}}^{-1}\phi_{j}}\rangle_{L^{2}_{\rho}}=\langle{\phi_{i},\phi_{j}}\rangle_{H_{G}}, assuming that the basis functions {ϕi}\{\phi_{i}\} are in the RKHS. Computation of the matrix BrkhsB_{rkhs} involves a general eigenvalue problem to solve the eigenvalues of G¯{\mathcal{L}_{\overline{G}}} (see Proposition 5.6).

We select the hyper-parameter λ\lambda_{*} by the L-curve method in [24]. The L-curve is a log-log plot of the curve l(λ)=(y(λ),x(λ))l(\lambda)=(y(\lambda),x(\lambda)) with y(λ)2=cλBA¯1Bcλy(\lambda)^{2}=c_{\lambda}^{\top}B{\overline{A}}^{-1}Bc_{\lambda} and x(λ)2=(cλ)x(\lambda)^{2}=\mathcal{E}(c_{\lambda}), where cλ=(A¯+λBA¯1B)1b¯c_{\lambda}=({\overline{A}}+\lambda B{\overline{A}}^{-1}B)^{-1}{\overline{b}}. The L-curve method maximizes the curvature to balance between the minimization of the likelihood and the control of the regularization:

λ=argmaxλminλλmaxκ(l(λ)),κ(l(λ))=xy′′xy′′(x+2y)23/2,\lambda_{*}=\underset{{\lambda_{\text{min}}\leq\lambda\leq\lambda_{\text{max}}}}{\rm{argmax}}\kappa(\it{l}(\lambda)),\quad\kappa(\it{l}(\lambda))=\frac{x^{\prime}y^{\prime\prime}-x^{\prime}y^{\prime\prime}}{(x^{\prime}\,{}^{2}+y^{\prime}\,{}^{2})^{3/2}},

where λmin\lambda_{\min} and λmax\lambda_{\max} are the smallest and the largest generalized eigenvalues of (A¯,B)({\overline{A}},B).

We summarize the priors and posteriors in computation in Table 3.

Table 3: Priors and posteriors of the coefficients cc of ϕ=i=1lciϕiLρ2\phi=\sum_{i=1}^{l}c_{i}\phi_{i}\in\mathcal{H}\subset L^{2}_{\rho}.
Gaussian measure Mean Covariance
π0=𝒩(m0,Q0)\pi_{0}=\mathcal{N}(m_{0},Q_{0}) m0=0m_{0}=0 Q0=IQ_{0}=I
π1=𝒩(m1,Q1)\pi_{1}=\mathcal{N}(m_{1},Q_{1}) m1=(A¯+ση2I)1b¯m_{1}=({\overline{A}}+\sigma_{\eta}^{2}I)^{-1}{\overline{b}} Q1=ση2(A¯+ση2I)1Q_{1}=\sigma_{\eta}^{2}({\overline{A}}+\sigma_{\eta}^{2}I)^{-1}
π0𝒟=𝒩(m0𝒟,Q0𝒟)\pi_{0}^{\scriptscriptstyle\mathcal{D}}=\mathcal{N}(m_{0}^{\scriptscriptstyle\mathcal{D}},Q_{0}^{\scriptscriptstyle\mathcal{D}}) m0𝒟=0m_{0}^{\scriptscriptstyle\mathcal{D}}=0 Q0𝒟=λ1B1A¯B1Q_{0}^{\scriptscriptstyle\mathcal{D}}=\lambda_{*}^{-1}B^{-1}{\overline{A}}B^{-1}
π1𝒟=𝒩(m1𝒟,Q1𝒟)\pi_{1}^{\scriptscriptstyle\mathcal{D}}=\mathcal{N}(m_{1}^{\scriptscriptstyle\mathcal{D}},Q_{1}^{\scriptscriptstyle\mathcal{D}}) m1𝒟=ση2Q1𝒟b¯m_{1}^{\scriptscriptstyle\mathcal{D}}=\sigma_{\eta}^{-2}Q_{1}^{\scriptscriptstyle\mathcal{D}}{\overline{b}} Q1𝒟=ση2(A¯+ση2λBA¯1B)1Q_{1}^{\scriptscriptstyle\mathcal{D}}=\sigma_{\eta}^{2}({\overline{A}}+\sigma_{\eta}^{2}\lambda_{*}B{\overline{A}}^{-1}B)^{-1}
Remark 5.4 (Avoiding pseudo-inverse of singular matrix).

The inverse of the matrix in Q1𝒟Q_{1}^{\scriptscriptstyle\mathcal{D}} in (5.5b) can cause a large numerical error when A¯{\overline{A}} is singular or severely ill-conditioned. We increase the numerical stability by avoiding A¯1{\overline{A}}^{-1}: let D=B1A¯1/2D=B^{-1}{\overline{A}}^{1/2} and write Q1𝒟Q_{1}^{\scriptscriptstyle\mathcal{D}} as

Q1𝒟=ση2(A¯+ση2λBA¯1B)1=ση2D(DA¯D+λI)1D.Q_{1}^{\scriptscriptstyle\mathcal{D}}=\sigma_{\eta}^{2}({\overline{A}}+\sigma_{\eta}^{2}\lambda_{*}B{\overline{A}}^{-1}B)^{-1}=\sigma_{\eta}^{2}D(D^{\top}{\overline{A}}D+\lambda I)^{-1}D^{\top}. (5.6)
Remark 5.5 (Relation to Zellner’s g-prior).

When the basis of the hypothesis space are orthonormal in Lρ2L^{2}_{\rho} (that is, the basis matrix B=(ϕi,ϕjLρ2)1i,jl=IB=(\langle{\phi_{i},\phi_{j}}\rangle_{L^{2}_{\rho}})_{1\leq i,j\leq l}=I), we have Q0𝒟=A¯Q_{0}^{{\scriptscriptstyle\mathcal{D}}}={\overline{A}}. Thus, we are once again revealing the well-known Zellner’s g-prior [1, 4, 59].

The next proposition shows that the eigenvalues of G¯{\mathcal{L}_{\overline{G}}} are solved by a generalized eigenvalue problem. Its proof is deferred to Appendix A.1.

Proposition 5.6.

Assume that the hypothesis space satisfies =span{ϕi}i=1lG¯(Lρ2)\mathcal{H}=\mathrm{span}\{\phi_{i}\}_{i=1}^{l}\supseteq{\mathcal{L}_{\overline{G}}}(L^{2}_{\rho}) with ll\leq\infty, where G¯:Lρ2Lρ2{\mathcal{L}_{\overline{G}}}:L^{2}_{\rho}\to L^{2}_{\rho} be the integral operator in (2.7). Let A¯{\overline{A}} and b¯{\overline{b}} be defined in (5.3b). Then, the operator G¯{\mathcal{L}_{\overline{G}}} has eigenvalues (λ1,,λl)(\lambda_{1},\ldots,\lambda_{l}) solved by the generalize eigenvalue problem with BB in (5.1):

A¯V=BVΛ,s.t.,VBV=I,Λ=Diag(λ1,,λl).{\overline{A}}V=BV\Lambda,\quad s.t.,V^{\top}BV=I,\quad\Lambda=\mathrm{Diag}(\lambda_{1},\ldots,\lambda_{l}). (5.7)

and the corresponding eigenfunctions of G¯{\mathcal{L}_{\overline{G}}} are {ψk=j=1lVjkϕj}\{\psi_{k}=\sum_{j=1}^{l}V_{jk}\phi_{j}\}. Additionally, for any ϕ=ilciϕi\phi=\sum_{i}^{l}c_{i}\phi_{i} in G¯1/2(Lρ2){\mathcal{L}_{\overline{G}}}^{1/2}(L^{2}_{\rho}), we have ϕ,G¯1ϕLρ2=cBrkhsc\langle{\phi,{\mathcal{L}_{\overline{G}}}^{-1}\phi}\rangle_{L^{2}_{\rho}}=c^{\top}B_{rkhs}c with

Brkhs=(VΛV)1=BA¯1B.B_{rkhs}=(V\Lambda V^{\top})^{-1}=B{\overline{A}}^{-1}B.

5.2 Example: discrete kernels in Toeplitz matrices

The Toeplitz matrix in Example 2.1 has a vector kernel, which lies in a finite-dimensional function space of learning Lρ2L^{2}_{\rho}. It provides a typical example of discrete kernels. We use the simplest case of a 2×22\times 2 Toeplitz matrix to demonstrate the data-adaptive function space of identifiability and the advantages of the data-adaptive prior.

We aim to recover the kernel ϕ2n1\phi\in\mathbb{R}^{2n-1} in the n×n\mathbb{R}^{n\times n} Toeplitz matrix from measurement data {(uk,fk)n×n}k=1N\{(u^{k},f^{k})\in\mathbb{R}^{n}\times\mathbb{R}^{n}\}_{k=1}^{N} by fitting the data to the model (2.1). The kernel is a vector ϕ:𝒮2n1\phi:\mathcal{S}\to\mathbb{R}^{2n-1} with 𝒮={rl}l=12n1\mathcal{S}=\{r_{l}\}_{l=1}^{2n-1} with rl=lnr_{l}=l-n. Since Rϕ[u]R_{\phi}[u] is linear in ϕ\phi, there is a matrix Lun×(2n1)L_{u}\in\mathbb{R}^{n\times(2n-1)} such that Rϕ[u]=LuϕR_{\phi}[u]=L_{u}\phi. Note that LuL_{u} is linear in uu since Rϕ[u]R_{\phi}[u] is, hence only linearly independent data {uk}k+1N\{u^{k}\}_{k+1}^{N} brings new information for the recovery of ϕ\phi.

A least squares estimator (LSE) of ϕ2n1\phi\in\mathbb{R}^{2n-1} is

ϕ^=A¯1b¯, with A¯=1N1kNLukLuk,b¯=1N1kNLukfk,\widehat{\phi}={\overline{A}}^{-1}{\overline{b}},\,\text{ with }{\overline{A}}=\frac{1}{N}\sum_{1\leq k\leq N}L_{u^{k}}^{\top}L_{u^{k}},\quad{\overline{b}}=\frac{1}{N}\sum_{1\leq k\leq N}L_{u^{k}}^{\top}f^{k},

Here the A¯1{\overline{A}}^{-1} is a pseudo-inverse when A¯{\overline{A}} is singular. However, the pseudo-inverse is unstable to perturbations, and the inverse problem is ill-posed.

We only need to identify the basis matrix BB in (5.1) to get the data-adaptive prior and its posterior in Table 3. The basis matrix requires two elements: the exploration measure and the basis functions. Here the exploration measure ρ\rho in (2.4) is ρ(rl)=Z11kN0i,jnδ(ijrl)|ujk|\rho(r_{l})=Z^{-1}\sum_{1\leq k\leq N}\sum_{0\leq i,j\leq n}\delta(i-j-r_{l})|u_{j}^{k}| with rl𝒮r_{l}\in\mathcal{S}, where Z=nk=1Ni=1n1|uik|Z=n\sum_{k=1}^{N}\sum_{i=1}^{n-1}|u_{i}^{k}| is the normalizing constant. Meanwhile, the unspoken hypothesis space for the above vector ϕ=i=12n1ciϕi\phi=\sum_{i=1}^{2n-1}c_{i}\phi_{i} with ci=ϕ(ri)c_{i}=\phi(r_{i}) is =span{ϕi}i=12n1=2n1\mathcal{H}=\mathrm{span}\{\phi_{i}\}_{i=1}^{2n-1}=\mathbb{R}^{2n-1} with basis ϕi(r)=δ(rir)L2(𝒮,)\phi_{i}(r)=\delta(r_{i}-r)\in L^{2}(\mathcal{S},\mathbb{R}), where δ\delta is the Kronecker delta function. Then, the basis matrix of {ϕi(r)=δ(rir)}\{\phi_{i}(r)=\delta(r_{i}-r)\} in Lρ2L^{2}_{\rho}, as defined in (5.1), is B=Diag(ρ)B=\mathrm{Diag}(\rho). Thus, if ρ\rho is not strictly positive, this basis matrix is singular and these basis functions are linearly dependent (hence redundant) in Lρ2L^{2}_{\rho}. In such a case, we select a linearly independent basis for Lρ2L^{2}_{\rho}, which is a proper subspace of 2n1\mathbb{R}^{2n-1}, and we use pseudo-inverse of A¯{\overline{A}} and BB to remove the redundant rows. Additionally, since vector ϕ\phi is the same as its coefficient cc, the priors and posteriors in Table 2 and Table 3 are the same.

Toeplitz matrix with n=2n=2.

Table 4 shows three representative datasets for the inverse problem: (1) the dataset {u1=(1,0)}\{u^{1}=(1,0)\} leads to a well-posed inverse problem in Lρ2L^{2}_{\rho} though it appears ill-posed in 3\mathbb{R}^{3}, (2) the dataset {u1,u2=(0,1)}\{u^{1},u^{2}=(0,1)\} leads to a well-posed inverse problem, and (3) the dataset {u3=(1,1)}\{u^{3}=(1,1)\} leads to an ill-posed inverse problem and our data-adaptive prior significantly improves the accuracy of the posterior, see Table 5. Computational details are presented in Appendix A.3.

Table 4: The exploration measure, the FSOI and the eigenvalues of G¯{\mathcal{L}_{\overline{G}}} for learning the kernel in a 2×22\times 2 Toeplitz matrix from 3 typical datasets.
Data {uk}\{u^{k}\} ρ\rho on {1,0,1}\{-1,0,1\} FOSI Eigenvalues of G¯{\mathcal{L}_{\overline{G}}}
{u1=(1,0)}\{u^{1}=(1,0)^{\top}\} (0,12,12)(0,\frac{1}{2},\frac{1}{2}) span{ϕ2,ϕ3}=Lρ2\mathrm{span}\{\phi_{2},\phi_{3}\}=L^{2}_{\rho} {1,1}\{1,1\}
{u1,u2=(0,1)}\{u^{1},u^{2}=(0,1)\} (14,12,14)(\frac{1}{4},\frac{1}{2},\frac{1}{4}) span{ϕ1,ϕ2,ϕ3}=Lρ2\mathrm{span}\{\phi_{1},\phi_{2},\phi_{3}\}=L^{2}_{\rho} {2,2,2}\{2,2,2\}
{u3=(1,1)}\{u^{3}=(1,1)^{\top}\} (14,12,14)(\frac{1}{4},\frac{1}{2},\frac{1}{4}) span{ψ1,ψ2}Lρ2\mathrm{span}\{\psi_{1},\psi_{2}\}\subsetneq L^{2}_{\rho} {8,4,0}\{8,4,0\}
The basis {ϕi}\{\phi_{i}\} are defined as (ϕ1,ϕ2,ϕ3)=I3(\phi_{1},\phi_{2},\phi_{3})=I_{3}. For the dataset {u3}\{u^{3}\}, the eigenvectors of G¯{\mathcal{L}_{\overline{G}}} in Lρ2L^{2}_{\rho} are ψ1=(1,1,1)\psi_{1}=(1,1,1)^{\top}, ψ2=(2,0,2),\psi_{2}=(-\sqrt{2},0,\sqrt{2})^{\top}, and ψ3=(1,1,1)\psi_{3}=(1,-1,1)^{\top}, see the text for more details.
Table 5: Performance of the posteriors in learning the kernel of Teoplitz matrix.
ϕtrue\phi_{true} Bias of m1m_{1} Bias of m1𝒟m_{1}^{\scriptscriptstyle\mathcal{D}} Tr(𝒬1)Tr(\mathcal{Q}_{1}) Tr(𝒬1𝒟)Tr(\mathcal{Q}_{1}^{\scriptscriptstyle\mathcal{D}})
(1,1,1)(1,1,1)^{\top}\in FSOI 0.34±0.010.34\pm 0.01 0.10±0.11\bm{0.10\pm 0.11} 0.34±0.000.34\pm 0.00 0.0037±0.00\bm{0.0037\pm 0.00}
(1,0,1)(1,0,1)^{\top}\notin FSOI 0.94±0.010.94\pm 0.01 0.66±0.09\bm{0.66\pm 0.09} 0.34±0.000.34\pm 0.00 0.0037±0.00\bm{0.0037\pm 0.00}
* We compute the means and standard deviations of the relative errors of the posterior means (“bias of m1m_{1}” and “bias of m1𝒟m_{1}^{\scriptscriptstyle\mathcal{D}}”) and the traces of the covariance of posteriors. They are computed in 100 independent datasets with f3f^{3} observed with random noises, which are sampled from 𝒩(0,ση2)\mathcal{N}(0,\sigma_{\eta}^{2}) with ση=0.1\sigma_{\eta}=0.1. and the uu data is {u3=(1,1)}\{u^{3}=(1,1)\}. The relative bias of each estimator mm is computed by mϕtrueLρ2/ϕtrueLρ2\|m-\phi_{true}\|_{L^{2}_{\rho}}/\|\phi_{true}\|_{L^{2}_{\rho}}. The standard deviations of the traces are less than 10510^{-5}.

Table 5 demonstrates the significant advantage of the data-adaptive prior over the non-degenerate prior in the case of the third dataset. We examine the performance of the posterior in two aspects: the trace of its covariance operator, and the bias in the posterior mean. Following Remark 5.2, we compute the trace of the covariance operator of the posterior by solving a generalized eigenvalue problem. Table 5 presents the means and standard deviations of the traces and the relative errors of the posterior mean. It consider two cases: ϕtrue=ψ1\phi_{true}=\psi_{1} in the FSOI and ϕtrue=(1,0,1)=0.5ψ1+0.5ψ3\phi_{true}=(1,0,1)^{\top}=0.5\psi_{1}+0.5\psi_{3} outside of the FSOI (see Table 4). We highlight two observations.

  • The posterior mean m1𝒟m_{1}^{\scriptscriptstyle\mathcal{D}} is much more accurate than m1m_{1}. When ϕtrue\phi_{true} is in the FSOI, m1𝒟m_{1}^{\scriptscriptstyle\mathcal{D}} is relatively accurate. When ϕtrue\phi_{true} is outside the FSOI, the major bias comes from the part outside the FSOI, i.e., the part 0.5ψ30.5\psi_{3} leads to a relatively large error.

  • The trace of 𝒬1𝒟\mathcal{Q}_{1}^{\scriptscriptstyle\mathcal{D}} is significantly smaller than 𝒬1\mathcal{Q}_{1}. Here 𝒬1𝒟\mathcal{Q}_{1}^{\scriptscriptstyle\mathcal{D}} has a zero eigenvalue in the direction outside of the FSOI, while 𝒬1\mathcal{Q}_{1} is full rank.

Discrete inverse problem: stable small noise limit.

For the discrete inverse problem of solving ϕm\phi\in\mathbb{R}^{m} in Lkϕ=fkL_{k}\phi=f_{k} for 1kN1\leq k\leq N, the next proposition shows that Assumption 3.1 (A3) does not hold, regardless of the presence of model error or missing data in fkf_{k}. Thus, it has a small noise limit. Numerical tests confirm that b¯{\overline{b}} is always in the range of the operator A¯{\overline{A}} and that m1m_{1} has a small noise limit regardless of the model error (e.g., ξ(u)=0.01u|u|2\xi(u)=0.01u|u|^{2}) or computational error due to missing data. However, for continuous inverse problems that estimate a continuous function ϕ\phi, Assumption 3.1 (A3) holds when b¯{\overline{b}} is computed using different regression arrays from those in A¯{\overline{A}} due to discretization or missing data (see Sect.5.3) or errors in integration by parts [30].

Proposition 5.7.

Let A¯=1kNLkLk{\overline{A}}=\sum_{1\leq k\leq N}L_{k}^{\top}L_{k} and b¯=1kNLkfk{\overline{b}}=\sum_{1\leq k\leq N}L_{k}^{\top}f_{k}, where Lkn×mL_{k}\in\mathbb{R}^{n\times m} and fkn×1f_{k}\in\mathbb{R}^{n\times 1} for each 1kN1\leq k\leq N. Then, b¯Range(A¯){\overline{b}}\in\text{Range}({\overline{A}}).

Proof.

First, we show that it suffices to consider LkL_{k}’s being rank-1 arrays. The SVD (singular value decomposition) of each LkL_{k} gives Lk=1inkσk,iwk,ivk,iL_{k}=\sum_{1\leq i\leq n_{k}}\sigma_{k,i}w_{k,i}v_{k,i}^{\top}, where {σk,i,wk,i,vk,i}\{\sigma_{k,i},w_{k,i},v_{k,i}\} are the singular values, left and right singular vectors that are orthonormal, i.e., wk,iwk,j=δi,jw_{k,i}^{\top}w_{k,j}=\delta_{i,j} and vk,ivk,j=δi,jv_{k,i}^{\top}v_{k,j}=\delta_{i,j}. Denote Lk,i=σk,iwk,ivk,iL_{k,i}=\sigma_{k,i}w_{k,i}v_{k,i}^{\top}, which is rank-1. Note that LkLk=1i,jnkσk,i2vk,iwk,iwk,jvk,j=1inkσk,i2vk,ivk,i=1inkLk,iLk,iL_{k}^{\top}L_{k}=\sum_{1\leq i,j\leq n_{k}}\sigma_{k,i}^{2}v_{k,i}w_{k,i}^{\top}w_{k,j}v_{k,j}^{\top}=\sum_{1\leq i\leq n_{k}}\sigma_{k,i}^{2}v_{k,i}v_{k,i}^{\top}=\sum_{1\leq i\leq n_{k}}L_{k,i}^{\top}L_{k,i}. Thus, we can write A¯=1kN1inkLk,iLk,i{\overline{A}}=\sum_{1\leq k\leq N}\sum_{1\leq i\leq n_{k}}L_{k,i}^{\top}L_{k,i} and b¯=1kN1inkLk,ifk{\overline{b}}=\sum_{1\leq k\leq N}\sum_{1\leq i\leq n_{k}}L_{k,i}^{\top}f_{k} in terms of rank-1 arrays.

Next, for each kk, write the rank-1 array as Lk=σkwkvkL_{k}=\sigma_{k}w_{k}v_{k}^{\top} with wkm×1w_{k}\in\mathbb{R}^{m\times 1} and vkn×1v_{k}\in\mathbb{R}^{n\times 1} both being unitary vectors. Then, A¯=1kNσk2vkwkwkvk=1kNσk2vkvk{\overline{A}}=\sum_{1\leq k\leq N}\sigma_{k}^{2}v_{k}w_{k}^{\top}w_{k}v_{k}^{\top}=\sum_{1\leq k\leq N}\sigma_{k}^{2}v_{k}v_{k}^{\top}, and Range(A¯)=span{vk}k=1N\text{Range}({\overline{A}})=\mathrm{span}\{v_{k}\}_{k=1}^{N} (where the vkv_{k}’s can be linearly dependent). Therefore, b¯=1kNσkvkvkfk{\overline{b}}=\sum_{1\leq k\leq N}\sigma_{k}v_{k}v_{k}^{\top}f_{k} is in the range of A¯{\overline{A}} because vkfkv_{k}^{\top}f_{k} is a scalar. ∎

5.3 Example: continuous kernels in integral operators

For the continuous kernels of the integral operators in Examples 2.2-2.4, their function space of learning Lρ2L^{2}_{\rho} is infinite-dimensional. Their Bayesian inversions are similar, so we demonstrate the computation using the convolution operator in Example 2.2. In particular, we compare our data-adaptive prior with a fixed non-degenerate prior in the presence of four types of errors: (i) discretization error, (ii) model error, (ii) partial observation (or missing data), and (iv) wrong noise assumption.

Recall that with 𝕏=𝕐=L2([0,1])\mathbb{X}=\mathbb{Y}=L^{2}([0,1]), we aim to recover the kernel ϕ\phi in the operator in (2.2), Rϕ[u](y)=01ϕ(yx)u(x)𝑑xR_{\phi}[u](y)=\int_{0}^{1}\phi(y-x)u(x)dx, by fitting the model (1.1) to an input-output dataset {uk,fk}k=13\left\{u^{k},f^{k}\right\}_{k=1}^{3}. We set {uk}k=13\{u^{k}\}_{k=1}^{3} to be the probability densities of normal distributions 𝒩(1.6+0.6k,1/15)\mathcal{N}(-1.6+0.6k,1/15) for k=1,2,3k=1,2,3 and we compute Rψ[uk]=01ψ(yx)uk(x)𝑑xR_{\psi}[u^{k}]=\int_{0}^{1}\psi(y-x)u^{k}(x)dx by the global adaptive quadrature method in [50]. The data are {uk(xj),fk(yl)}k=13\left\{u^{k}(x_{j}),f^{k}(y_{l})\right\}_{k=1}^{3} on uniform meshes {xj}j=1J\{x_{j}\}_{j=1}^{J} and {yl}l=1L\{y_{l}\}_{l=1}^{L} of [0,1][0,1] with J=100J=100 and L=50L=50. Here fk(yl)f^{k}(y_{l}) is generated by

fk(yl)=Rϕ[uk](yl)+ηlk+ξk(yl),f^{k}(y_{l})=R_{\phi}[u^{k}](y_{l})+\eta_{l}^{k}+\xi^{k}(y_{l}), (5.8)

where ηlk\eta_{l}^{k} are i.i.d. 𝒩(0,ση2)\mathcal{N}(0,\sigma_{\eta}^{2}) random variables (unless the wrong noise assumption case to be specified later) with variance ση2y\frac{\sigma_{\eta}^{2}}{\triangle y} and ξk(y)=σξu(y)|u(y)|\xi^{k}(y)=\sigma_{\xi}u(y)\left|u(y)\right| are artificial model errors with σξ=0\sigma_{\xi}=0 (no model error) or σξ=0.01\sigma_{\xi}=0.01 (a small model error).

Figure 1: The exploration measure and the eigenvalues of the basis matrix BB, regression matrix ADA_{D} and operator G¯{\mathcal{L}_{\overline{G}}} (computed via the generalized eigenvalue problem of (AD,B)(A_{D},B)).
Refer to caption

The exploration measure (defined in (2.4)) of this dataset has a density

ρ(r)=1ZNk=1N[0,1][r,r+1]|uk(y)|𝑑y,r[1,1],\rho(r)=\frac{1}{ZN}\sum_{k=1}^{N}\int_{[0,1]\cap[r,r+1]}\left|u^{k}(y)\right|dy,\quad r\in[-1,1],

with ZZ being the normalizing constant. We set the =span{ϕi}i=1l\mathcal{H}=\mathrm{span}\{\phi_{i}\}_{i=1}^{l}, where {ϕi}i=1l\{\phi_{i}\}_{i=1}^{l} are B-spline basis functions (i.e., piecewise polynomials) with degree 3 and with knots from a uniform partition of [1,1][-1,1]. We approximate A¯{\overline{A}} and b¯{\overline{b}} using the Riemann sum integration,

A¯D(i,i)\displaystyle{\overline{A}}_{D}(i,i^{\prime}) =1Nk=1Nj=1JR^ϕi[uk](yj)R^ϕi[uk](yj)y,\displaystyle=\frac{1}{N}\sum_{k=1}^{N}\sum_{j=1}^{J}\widehat{R}_{\phi_{i}}[u^{k}](y_{j})\widehat{R}_{\phi_{i^{\prime}}}[u^{k}](y_{j})\triangle y,
b¯(i)\displaystyle{\overline{b}}(i) =1N1kNl=1LR^ϕi[uk](xl)fk(yl)y,\displaystyle=\frac{1}{N}\sum_{1\leq k\leq N}\sum_{l=1}^{L}\widehat{R}_{\phi_{i}}[u^{k}](x_{l})f^{k}(y_{l})\triangle y,

where we approximate Rψ[uk]R_{\psi}[u^{k}] via Riemann integration R^ψ[uk](y)=j=1Jψ(yxj)uk(xj)x\widehat{R}_{\psi}[u^{k}](y)=\sum_{j=1}^{J}\psi(y-x_{j})u^{k}(x_{j})\triangle x. Additionally, to illustrate the effects of discretization error, we also compute A¯{\overline{A}} in (5.3b) using the continuous {uk}\{u^{k}\} with quadrature integration, and denote the matrix by A¯C{\overline{A}}_{C}.

Figure 1 shows the exploration measure and the eigenvalues of the basis matrix BB, ADA_{D} and G¯{\mathcal{L}_{\overline{G}}} (which are the generalized eigenvalues of (A¯D,B)({\overline{A}}_{D},B)). Note that the support 𝒮\mathcal{S} is a proper subset of [1,1][-1,1], leading to a near singular BB. In particular, the inverse problem is severely ill-posed in Lρ2L^{2}_{\rho} since G¯{\mathcal{L}_{\overline{G}}} has multiple almost-zero eigenvalues.

We consider four types of errors, in addition to the observation noise, in b¯{\overline{b}} that often happen in practice.

  1. 1.

    Discretization Error. We assume that fkf^{k} in (5.8) has no model error.

  2. 2.

    Partial Observation. We assume that fkf^{k} misses data in the first quarter of the interval, i.e. flk=0f^{k}_{l}=0 for l=0,,L/4l=0,\dots,L/4. Also, assume that there is no model error.

  3. 3.

    Model Error. Assume there are model errors.

  4. 4.

    Wrong Noise Assumption. Suppose that the measurement noise ηlk\eta_{l}^{k} is uniformly distributed on the interval [3σηy,3σηy][-\frac{\sqrt{3}\sigma_{\eta}}{\sqrt{\triangle y}},\frac{\sqrt{3}\sigma_{\eta}}{\sqrt{\triangle y}}]. Thus, the model, which assumes a Gaussian noise, has a wrong noise assumption. Notice that we add a 3\sqrt{3} to keep the variance at the same level as the Gaussian.

For each of the four cases, we compute the posterior means in Table 3 with the optimal hyper-parameter λ\lambda_{*} selected by the L-curve method, and report the Lρ2L^{2}_{\rho} error of the function estimators. Additionally, for each of them, we consider different levels of observation noise ση\sigma_{\eta} in 10110510^{-1}\sim 10^{-5}, so as to demonstrate the small noise limit of the posterior mean.

We access the performance of the fixed prior and the data-adaptive prior in Table 3 through the accuracy of their posterior means. We report the interquartile range (IQR, the 75th75^{th}, 50th50^{th} and 25th25^{th} percentiles) of the Lρ2L^{2}_{\rho} errors of their posterior means in 200200 independent simulations in which ϕtrue\phi_{true} are randomly sampled.

Two scenarios are considered: ϕtrue\phi_{true} is either inside or outside the FSOI. To draw samples of ϕtrue\phi_{true} outside the FSOI, we sample the coefficient cc^{*} of ϕtrue=j=1lcjϕj\phi_{true}=\sum_{j=1}^{l}c_{j}^{*}\phi_{j} from the fixed prior 𝒩(0,Il)\mathcal{N}(0,I_{l}). Thus, the fixed prior is the true prior. To sample ϕtrue\phi_{true} inside the FSOI, we sample ϕtrue=j=1lcjψj\phi_{true}=\sum_{j=1}^{l}c_{j}^{*}\psi_{j} with cc_{*} from 𝒩(0,I3)\mathcal{N}(0,I_{3}), where {ψj=i=1lvi,jϕj}\{\psi_{j}=\sum_{i=1}^{l}v_{i,j}\phi_{j}\} has v,jv_{\cdot,j} being the jj-th eigenvector of A¯D{\overline{A}}_{D}.

Note that the exploration measure, the matrices A¯D{\overline{A}}_{D}, A¯C{\overline{A}}_{C} and BB are the same in all these simulations because they are determined by the data {uk}\{u^{k}\} and the basis functions. Thus, we only need to compute b¯{\overline{b}} for each simulation.

Figure 2: Interquartile range (IQR, the 75th75^{th}, 50th50^{th} and 25th25^{th} percentiles) of the Lρ2L^{2}_{\rho} errors of the posterior means. They are computed in 200200 independent simulations with ϕtrue\phi_{true} sampled from the fixed prior (hence outside the FSOI), in the presence of four types of errors. Top row: the regression matrix A¯{\overline{A}} is computed from continuous {uk}\{u^{k}\}; Bottom row: A¯{\overline{A}} is computed from discrete data. As ση0\sigma_{\eta}\to 0, the fixed prior leads to diverging posterior means in 6 out of the 8 cases, while the data-adaptive (DA) prior leads to stable posterior means.
Refer to caption
Figure 3: IQR of the Lρ2L^{2}_{\rho} errors of the posterior means in 200200 independent simulations with ϕtrue\phi_{true} sampled inside the FSOI.
Refer to caption

Figure 2 shows the IQR of these simulations in the scenario that the true kernels are outside the FSOI. The fixed prior leads to diverging posterior means in 6 out of the 8 cases, while the DA-prior has stable posterior means in all cases. The fixed prior leads to a diverging posterior mean when using the continuously integrated regression matrix A¯C{\overline{A}}_{C}, because the discrepancy between b¯{\overline{b}} and A¯C{\overline{A}}_{C} leads to a perturbation outside the FSOI, satisfying Assumption 3.1 (A3). Similarly, either the model error or partial observation error in b¯{\overline{b}} causes a perturbation outside the FSOI of A¯D{\overline{A}}_{D}, making the fixed prior’s posterior mean diverge. On the other hand, the discretely computed A¯D{\overline{A}}_{D} matches b¯{\overline{b}} in the sense that b¯Range(A¯D){\overline{b}}\in{\rm Range}({\overline{A}}_{D}) as proved in Proposition 5.7, so the fixed prior has a stable posterior mean in cases of discretization and wrong noise assumption. In all these cases, the error of the posterior mean of the DA-prior does not decay as ση0\sigma_{\eta}\to 0, because the error is dominated by the part outside of the FSOI that cannot be recovered from data.
Figure 3 shows the IQR of these simulations with the true kernels sampled inside the FSOI. The DA prior leads to posterior means that are not only stable but also converge to small noise limits, whereas the fixed prior leads to diverging posterior means as in Figure 2. The convergence of the posterior means of the DA prior can be seen in the cases of “Discretization” and “Wrong noise” with both continuously and discretely computed regression matrices. Meanwhile, the flat lines of the DA prior in the cases of “Model error” or “Partial observations” are due to the error inside the FSOI caused by either the model error or partial observation error in b¯{\overline{b}}, as shown in the proof of Theorem 4.2.
Additionally, we show in Figure 4 and Figure 5 the estimated posterior (in terms of its mean, the 75th75^{th} and 25th25^{th} percentiles) in a typical simulation, when ϕtrue\phi_{true} is outside and inside the FSOI, respectively. Here, the percentiles are computed by drawing samples from the posterior. A more accurate posterior would have a more accurate mean and a narrower shaded region between the percentiles so as to have a smaller uncertainty. In all cases, the DA prior leads to a more accurate posterior mean (MAP) than the fixed prior. When the observation noise has ση=0.1\sigma_{\eta}=0.1, the DA prior leads to a posterior with a larger shaded region between the percentiles than the fixed prior, but when ση=0.001\sigma_{\eta}=0.001, the DA prior’s shaded region is much smaller than those of the fixed prior.

Figure 4: The posterior (its mean, the 75th75^{th} and 25th25^{th} percentiles) when ϕtrue\phi_{true}\notin FSOI.
Refer to caption
Figure 5: The posterior (its mean, the 75th75^{th} and 25th25^{th} percentiles) when ϕtrue\phi_{true}\in FSOI.
Refer to caption

In summary, these numerical results confirm that the data-adaptive prior removes the risk in a fixed non-degenerate prior, leading to a robust posterior with a small noise limit.

Remark 5.8 (Increasing data size).

We have chosen a low number of data points in the numerical tests, so that the inverse problems are under-determined and hence are meaningful for a Bayesian study. In general, the blow-up behavior caused by a fixed non-degenerate prior will remain when the data size increases as long as the inverse problem satisfies the conditions in Assumption 3.1 (i.e., under-determined with deficient rank, using a fixed non-degenerate prior, with error in the null space of the normal operator).

5.4 Limitations of the data-adaptive prior

As stated in [24]: “every practical method has advantages and disadvantages”. The major advantage of the data-adaptive RKHS prior is to avoid the posterior being contaminated by the errors outside of the data-dependent function space of identifiability.

A drawback of the data-adaptive prior is its reliance on selecting the hyper-parameter λ\lambda_{*}. The L-curve method is state-of-the-art and works well in our numerical tests, yet it has limitations in dealing with smoothness and asymptotic consistency (see [24]). An improper hyper-parameter can lead to a posterior with an inaccurate mean and unreliable covariance. Also, the premise is that the identifiable part of the true kernel is in the data-adaptive RKHS. But this RKHS can be restrictive when the data is smooth, leading to an overly-smoothed estimator if the true kernel is non-smooth. It remains open to use a prior with G¯s{\mathcal{L}_{\overline{G}}}^{s} as its covariance, as introduced in [31], with s0s\geq 0 to detect the smoothness of the true kernel. We leave this as potential future work.

Also, the data-adaptive prior in this study is for linear inverse problems, and it does not apply to nonlinear problems in which the operator depends on the kernel nonlinearly. However, the covariance of our data-adaptive prior corresponds to a scaled Fisher information matrix. Thus, for nonlinear inverse problems, a potential adaptive prior is the scaled Fisher information, which has been explored as a regularization method in [34].

6 Conclusion

The inverse problem of learning kernels in operators is often severely ill-posed. We show that a fixed non-degenerate prior leads to a divergent posterior mean when the observation noise becomes small if the data induces a perturbation in the eigenspace of zero eigenvalues of the normal operator.

We have solved the issue by a data-adaptive RKHS prior. It leads to a stable posterior whose mean always has a small noise limit, and the small noise limit converges to the identifiable part of the true kernel when the perturbation vanishes. Its covariance is the normal operator with a hyper-parameter selected adaptive to data by the L-curve method. Also, the data-adaptive prior improves the quality of the posterior over the fixed prior in two aspects: a smaller expected mean square error of the posterior mean, and a smaller trace of the covariance operator, thus reducing the uncertainty.

Furthermore, we provide a detailed analysis of the data-adaptive prior in computational practice. We demonstrate its advantage on Toeplitz matrices and integral operators in the presence of four types of errors. Numerical tests show that when the noise becomes small, a fixed non-degenerate prior may lead to a divergent posterior mean, the data-adaptive prior always attains stable posterior means.

We have also discussed the limitations of the data-adaptive prior, such as its dependence on the hyper-parameter selection and its tendency to over-smoothing. It is of interest to overcome these limitations in future research by adaptively selecting the regularity of the prior covariance through a fractional operator. Among various other directions to be further explored, we mention one that is particularly relevant in the era of big data: to investigate the inverse problem when the data {uk}\{u^{k}\} are randomly sampled in the setting of infinite-dimensional statistical models (e.g., [22]). When the operator Rϕ[u]R_{\phi}[u] is linear in uu, the examples of Toeplitz matrices and integral operators show that the inverse problem will become less ill-posed when the number of linearly independent data {uk}\{u^{k}\} increases. When RϕR_{\phi} is nonlinear in uu, it remains open to understand how the ill-posedness depends on the data. Another direction would be to consider sampling the posterior exploiting MCMC or sequential Monte Carlo methodologies (e.g., [48]).

Acknowledgments

The work of FL is funded by the Johns Hopkins University Catalyst Award, FA9550-20-1-0288, and NSF-DMS-2238486. XW is supported by the Johns Hopkins University research fund. FL would like to thank Yue Yu and Mauro Maggioni for inspiring discussions.

Appendix A Appendix

A.1 Identifiability theory

The main theme in the identifiability theory is to find the function space in which the quadratic loss functional has a unique minimizer. The next lemma shows that the normal operator G¯{\mathcal{L}_{\overline{G}}} defined in (2.7) is a trace-class operator. Recall that an operator 𝒬\mathcal{Q} on a Hilbert space if it satisfies k𝒬ek,ek<\sum_{k}\langle{\mathcal{Q}e_{k},e_{k}}\rangle<\infty for any complete orthonormal basis {ek}k=1\{e_{k}\}_{k=1}^{\infty}.

Lemma A.1.

Under Assumption 2.5, the operator G¯:Lρ2Lρ2{\mathcal{L}_{\overline{G}}}:L^{2}_{\rho}\to L^{2}_{\rho} defined in (2.7) is a trace-class operator with Tr(G¯)=𝒮G¯(r,r)ρ(r)𝑑rTr({\mathcal{L}_{\overline{G}}})=\int_{\mathcal{S}}{\overline{G}}(r,r)\rho(r)dr.

Proof.

We have ρ(r)=1ZN1kNΩ|g[uk](x,r+x)|μ(dx)\rho(r)=\frac{1}{ZN}\sum_{1\leq k\leq N}\int_{\Omega}\left|g[u^{k}](x,r+x)\right|\mu(dx) by (2.4). Then,

G(r,s)=1N1kNg[uk](x,r+x)g[uk](x,s+x)μ(dx)Cρ(r)ρ(s)G(r,s)=\frac{1}{N}\sum_{1\leq k\leq N}\int g[u^{k}](x,r+x)g[u^{k}](x,s+x)\mu(dx)\leq C\rho(r)\wedge\rho(s)

for and r,s𝒮r,s\in\mathcal{S}, where C=Zmax1kKsupx,yΩ|g[uk](x,y)|C=Z\max_{1\leq k\leq K}\sup_{x,y\in\Omega}|g[u^{k}](x,y)|. Thus,

G¯(r,s)=G(r,s)ρ(r)ρ(s)Cρ(r)1ρ(s)1,{\overline{G}}(r,s)=\frac{G(r,s)}{\rho(r)\rho(s)}\leq C\rho(r)^{-1}\wedge\rho(s)^{-1},

for each r,s𝒮r,s\in\mathcal{S}. Meanwhile, since Ω\Omega is bonded, we have |𝒮|<|\mathcal{S}|<\infty. Hence 𝒮G¯(r,r)ρ(r)𝑑rC|𝒮|<\int_{\mathcal{S}}{\overline{G}}(r,r)\rho(r)dr\leq C|\mathcal{S}|<\infty. Also, note that G¯{\overline{G}} is continuous since g[uk]g[u^{k}] is continuous. Then, by [32, Theorem 12, p344], the operator G¯{\mathcal{L}_{\overline{G}}} with integral kernel G¯{\overline{G}} has a finite trace Tr(G¯)=𝒮G¯(r,r)ρ(r)𝑑r<Tr({\mathcal{L}_{\overline{G}}})=\int_{\mathcal{S}}{\overline{G}}(r,r)\rho(r)dr<\infty. ∎

Theorem 2.7 characterizes the FSOI through the normal operator G¯{\mathcal{L}_{\overline{G}}}.

Proof of Theorem 2.7.

Part (a)(a) follows from the definition of ϕ𝒟\phi^{\scriptscriptstyle\mathcal{D}} in (2.10). In fact, plugging in fk=Rϕtrue[uk]+ξk+ηkf^{k}=R_{\phi_{true}}[u^{k}]+\xi_{k}+\eta_{k} into the right hand side of (2.10), we have, ψLρ2\forall\psi\in L^{2}_{\rho},

ϕ𝒟,ψLρ2\displaystyle\langle{\phi^{\scriptscriptstyle\mathcal{D}},\psi}\rangle_{L^{2}_{\rho}} =1N1kNRψ[uk],Rϕtrue[uk]𝕐+Rψ[uk],ξk]𝕐+Rψ[uk],ηk]𝕐\displaystyle=\frac{1}{N}\sum_{1\leq k\leq N}\langle{R_{\psi}[u^{k}],R_{\phi_{true}}[u^{k}]}\rangle_{\mathbb{Y}}+\langle{R_{\psi}[u^{k}],\xi_{k}]}\rangle_{\mathbb{Y}}+\langle{R_{\psi}[u^{k}],\eta_{k}]}\rangle_{\mathbb{Y}}
=ψ,G¯ϕtrueLρ2+ψ,ϵξLρ2+ψ,ϵηLρ2,\displaystyle=\langle{\psi,{\mathcal{L}_{\overline{G}}}\phi_{true}}\rangle_{L^{2}_{\rho}}+\langle{\psi,\epsilon^{\xi}}\rangle_{L^{2}_{\rho}}+\langle{\psi,\epsilon^{\eta}}\rangle_{L^{2}_{\rho}},

where the first term in the last equation comes from the definitions of the operator G¯{\mathcal{L}_{\overline{G}}} in (2.7), the second and the third term comes from the Riesz representation. Since each ηk\eta_{k} is a 𝕐\mathbb{Y}-valued white noise, the random variable ψ,ϵηLρ2=1N1kNRψ[uk],ηk]𝕐\langle{\psi,\epsilon^{\eta}}\rangle_{L^{2}_{\rho}}=\frac{1}{N}\sum_{1\leq k\leq N}\langle{R_{\psi}[u^{k}],\eta_{k}]}\rangle_{\mathbb{Y}} is Gaussian with mean zero and variance ση2ψ,G¯ψLρ2\sigma_{\eta}^{2}\langle{\psi,{\mathcal{L}_{\overline{G}}}\psi}\rangle_{L^{2}_{\rho}} for each ψLρ2\psi\in L^{2}_{\rho}. Thus, ϵη\epsilon^{\eta} has a Gaussian distribution 𝒩(0,ση2G¯)\mathcal{N}(0,\sigma_{\eta}^{2}{\mathcal{L}_{\overline{G}}}).

Part (b)(b) follows directly from loss functional in (2.9).

For Part (c)(c), first, note that the quadratic loss functional has a unique minimizer in HH. Meanwhile, note that HH is the orthogonal complement of the null space of G¯{\mathcal{L}_{\overline{G}}}, and (ϕtrue+ϕ0)=(ϕtrue)\mathcal{E}(\phi_{true}+\phi^{0})=\mathcal{E}(\phi_{true}) for any ϕ0\phi^{0} such that G¯ϕ0=0{\mathcal{L}_{\overline{G}}}\phi^{0}=0. Thus, HH is the largest such function space, and we conclude that HH is the FSOI.

Next, for any ϕ𝒟G¯(Lρ2)\phi^{\scriptscriptstyle\mathcal{D}}\in{\mathcal{L}_{\overline{G}}}(L^{2}_{\rho}), the estimator ϕ^=G¯1ϕ𝒟\widehat{\phi}={\mathcal{L}_{\overline{G}}}^{-1}\phi^{\scriptscriptstyle\mathcal{D}} is well-defined. By Part (b)(b), this estimator is the unique zero of the loss functional’s Fréchet derivative in HH. Hence it is the unique minimizer of (ϕ)\mathcal{E}(\phi) in HH. In particular, when the data is noiseless and with no model error, and it is generated from ϕtrue\phi_{true}, i.e. Rϕtrue[uk]=fkR_{\phi_{true}}[u^{k}]=f^{k}, we have ϕ𝒟=G¯ϕtrue\phi^{\scriptscriptstyle\mathcal{D}}={\mathcal{L}_{\overline{G}}}\phi_{true} from Part (a)(a). Hence ϕ^=G¯1ϕ𝒟=ϕtrue\widehat{\phi}={\mathcal{L}_{\overline{G}}}^{-1}\phi^{\scriptscriptstyle\mathcal{D}}=\phi_{true}. That is, ϕtrueH\phi_{true}\in H is the unique minimizer of the loss functional \mathcal{E}. ∎

The proof of Proposition 5.6 is an extension of Theorem 4.1 of [38].

Proof of Proposition 5.6.

Let ψk=j=1lVjkϕj\psi_{k}=\sum_{j=1}^{l}V_{jk}\phi_{j} with VBV=IV^{\top}BV=I. Then, ψk\psi_{k} is an eigenfunction of G¯{\mathcal{L}_{\overline{G}}} with eigenvalue λk\lambda_{k} if and only if for each ii,

ϕi,λkψkLρ2=ϕi,G¯ψkLρ2=1jlϕi,G¯ϕjLρ2Vjk=1jlA¯(i,j)Vjk,\displaystyle\langle{\phi_{i},\lambda_{k}\psi_{k}}\rangle_{L^{2}_{\rho}}=\langle{\phi_{i},{\mathcal{L}_{\overline{G}}}\psi_{k}}\rangle_{L^{2}_{\rho}}=\sum_{1\leq j\leq l}\langle{\phi_{i},{\mathcal{L}_{\overline{G}}}\phi_{j}}\rangle_{L^{2}_{\rho}}V_{jk}=\sum_{1\leq j\leq l}{\overline{A}}(i,j)V_{jk},\vspace{-2mm}

where the last equality follows from the definition of A¯{\overline{A}}. Meanwhile, by the definition of BB we have ϕi,λkψkLρ2=j=1lB(i,j)Vjkλk\langle{\phi_{i},\lambda_{k}\psi_{k}}\rangle_{L^{2}_{\rho}}=\sum_{j=1}^{l}B(i,j)V_{jk}\lambda_{k} for each ii. Then, Equation (5.7) follows.

Next, to compute ϕ,G¯1ϕLρ2\langle{\phi,{\mathcal{L}_{\overline{G}}}^{-1}\phi}\rangle_{L^{2}_{\rho}}, we denote Ψ=(ψ1,,ψl)\Psi=(\psi_{1},\ldots,\psi_{l})^{\top} and Φ=(ϕ1,,ϕl)\Phi=(\phi_{1},\ldots,\phi_{l})^{\top}. Then, we can write

Ψ=VΦ,ϕ=1ilciϕi=cΦ=cVΨ.\Psi=V^{\top}\Phi,\qquad\phi=\sum_{1\leq i\leq l}c_{i}\phi_{i}=c^{\top}\Phi=c^{\top}V^{-\top}\Psi.

Hence, we can obtain Brkhs=(VΛV)1B_{rkhs}=(V\Lambda V^{\top})^{-1} in ϕ,G¯1ϕLρ2=cBrkhsc\langle{\phi,{\mathcal{L}_{\overline{G}}}^{-1}\phi}\rangle_{L^{2}_{\rho}}=c^{\top}B_{rkhs}c via:

ϕ,G¯1ϕLρ2\displaystyle\langle{\phi,{\mathcal{L}_{\overline{G}}}^{-1}\phi}\rangle_{L^{2}_{\rho}} =cΦ,G¯1cΦLρ2\displaystyle=\langle{c^{\top}\Phi,{\mathcal{L}_{\overline{G}}}^{-1}c^{\top}\Phi}\rangle_{L^{2}_{\rho}}
=cVΨ,G¯1cVΨLρ2\displaystyle=\langle{c^{\top}V^{-\top}\Psi,{\mathcal{L}_{\overline{G}}}^{-1}c^{\top}V^{-\top}\Psi}\rangle_{L^{2}_{\rho}}
=cVΨ,G¯1ΨLρ2V1c=cVΛ1V1c,\displaystyle=c^{\top}V^{-\top}\langle{\Psi,{\mathcal{L}_{\overline{G}}}^{-1}\Psi}\rangle_{L^{2}_{\rho}}V^{-1}c=c^{\top}V^{-\top}\Lambda^{-1}V^{-1}c,

where the last equality follows from Ψ,G¯1ΨLρ2=Λ1\langle{\Psi,{\mathcal{L}_{\overline{G}}}^{-1}\Psi}\rangle_{L^{2}_{\rho}}=\Lambda^{-1}.

Additionally, to prove Brkhs=BA¯1BB_{rkhs}=B{\overline{A}}^{-1}B, we use the generalized eigenvalue problem. Since VBV=IV^{\top}BV=I, we have V1=VBV^{-1}=V^{\top}B. Meanwhile, A¯V=BVΛ{\overline{A}}V=BV\Lambda implied that B1A¯=VΛV1B^{-1}{\overline{A}}=V\Lambda V^{-1}. Thus, B1A¯B1=VΛV1=VΛVB^{-1}{\overline{A}}B^{-1}=V\Lambda V^{-1}=V\Lambda V^{\top}, which is Brkhs1B_{rkhs}^{-1}. ∎

A.2 Gaussian measures on a Hilbert space

A Gaussian measure on a Hilbert space is defined by its mean and covariance operator (see [11, Chapter 1-2] and [12]). Let HH be a Hilbert space with inner product ,\langle{\cdot,\cdot}\rangle, and let (H)\mathcal{B}(H) denote its Borel algebra. Let 𝒬\mathcal{Q} be a symmetric nonnegative trace class operator on HH, that is 𝒬x,y=x,𝒬y\langle{\mathcal{Q}x,y}\rangle=\langle{x,\mathcal{Q}y}\rangle and 𝒬x,x0\langle{\mathcal{Q}x,x}\rangle\geq 0 for any x,yHx,y\in H, and k𝒬ek,ek<\sum_{k}\langle{\mathcal{Q}e_{k},e_{k}}\rangle<\infty for any complete orthonormal basis {ek}k=1\{e_{k}\}_{k=1}^{\infty}. Additionally, denote {λk,ek}k=1\{\lambda_{k},e_{k}\}_{k=1}^{\infty} the eigenvalues (in descending order) and eigenfunctions of 𝒬\mathcal{Q}.

A measure on HH with mean aa and covariance operator 𝒬\mathcal{Q} is a Gaussian measure π=𝒩(a,𝒬)\pi=\mathcal{N}(a,\mathcal{Q}) iff its Fourier transform π^(h)=Heix,hπ(dx)\widehat{\pi}(h)=\int_{H}e^{i\langle{x,h}\rangle}\pi(dx) is eia,h12𝒬h,he^{i\langle{a,h}\rangle-\frac{1}{2}\langle{\mathcal{Q}h,h}\rangle} for any hHh\in H. The measure is non-degenerate if Ker𝒬={0}\mathrm{Ker}\mathcal{Q}=\{0\}, i.e., λk>0\lambda_{k}>0 for all kk. It is a product measure π=k=1𝒩(ak,λk)\pi=\prod_{k=1}^{\infty}\mathcal{N}(a_{k},\lambda_{k}), where ak=a,eka_{k}=\langle{a,e_{k}}\rangle\in\mathbb{R} for each kk. Note that π(𝒬1/2H)=0\pi(\mathcal{Q}^{1/2}H)=0 if HH is infinite-dimensional, that is, the Cameron-Martin space 𝒬1/2H\mathcal{Q}^{1/2}H has measure zero.

The next lemma specifies the covariance of the coefficient of an HH-valued Gaussian random variable. The coefficient can be on either a full or partial basis.

Lemma A.2 (Covariance operator to covariance matrix).

Let HH be a Hilbert space with a complete basis {ϕi}i=1n\{\phi_{i}\}_{i=1}^{n} that may not be orthonormal, where nn\leq\infty. For lnl\leq n, assume that the matrix B=ϕi,ϕj1i,jlB=\langle{\phi_{i},\phi_{j}}\rangle_{1\leq i,j\leq l} is strictly positive definite. Let ϕ=i=1nciϕi\phi=\sum_{i=1}^{n}c_{i}\phi_{i} be an HH-valued random variable with Gaussion measure 𝒩(m,𝒬)\mathcal{N}(m,\mathcal{Q}), where 𝒬\mathcal{Q} is a trace class operator. Then, the coefficient c=(c1,,cl)l×1c=(c_{1},\ldots,c_{l})^{\top}\in\mathbb{R}^{l\times 1} has a Gaussian distribution 𝒩(c¯,B1AB1)\mathcal{N}(\overline{c},B^{-1}AB^{-1}), where

c¯\displaystyle\overline{c} =B1(ϕ,ϕ1,,ϕ,ϕl)\displaystyle=B^{-1}(\langle{\phi,\phi_{1}}\rangle,\ldots,\langle{\phi,\phi_{l}}\rangle)^{\top}
A\displaystyle A =[A(i,j)]1i,jl=[ϕi,𝒬ϕj]1i,jl.\displaystyle=[A(i,j)]_{1\leq i,j\leq l}=[\langle{\phi_{i},\mathcal{Q}\phi_{j}}\rangle]_{1\leq i,j\leq l}.
Proof.

Without loss of generality, we assume m=0m=0: otherwise, we replace ϕ\phi by ϕm\phi-m in the following proof. By definition, for any hh\in\mathcal{H}, the random variable ϕ,h\langle{\phi,h}\rangle has distribution 𝒩(0,h,𝒬h)\mathcal{N}(0,\langle{h,\mathcal{Q}h}\rangle). Thus, we have ϕ,ϕi𝒩(0,ϕi,𝒬ϕi)\langle{\phi,\phi_{i}}\rangle\sim\mathcal{N}(0,\langle{\phi_{i},\mathcal{Q}\phi_{i}}\rangle) for each ii. Similarly, we have that 𝔼[ϕ,ϕi+ϕj2]=ϕi+ϕj,𝒬(ϕi+ϕj)\mathbb{E}[\langle{\phi,\phi_{i}+\phi_{j}}\rangle^{2}]=\langle{\phi_{i}+\phi_{j},\mathcal{Q}(\phi_{i}+\phi_{j})}\rangle. Then, we have

𝔼[ϕ,ϕiϕ,ϕj]\displaystyle\mathbb{E}[\langle{\phi,\phi_{i}}\rangle\langle{\phi,\phi_{j}}\rangle] =12(𝔼[ϕ,ϕi+ϕj2]𝔼[ϕ,ϕi2]𝔼[ϕ,ϕj2])\displaystyle=\frac{1}{2}\Big{(}\mathbb{E}[\langle{\phi,\phi_{i}+\phi_{j}}\rangle^{2}]-\mathbb{E}[\langle{\phi,\phi_{i}}\rangle^{2}]-\mathbb{E}[\langle{\phi,\phi_{j}}\rangle^{2}]\Big{)}
=12(ϕi+ϕj,𝒬(ϕi+ϕj)ϕi,𝒬ϕiϕj,𝒬ϕj)=ϕi,𝒬ϕj.\displaystyle=\frac{1}{2}\Big{(}\langle{\phi_{i}+\phi_{j},\mathcal{Q}(\phi_{i}+\phi_{j})}\rangle-\langle{\phi_{i},\mathcal{Q}\phi_{i}}\rangle-\langle{\phi_{j},\mathcal{Q}\phi_{j}}\rangle\Big{)}=\langle{\phi_{i},\mathcal{Q}\phi_{j}}\rangle.

Hence, the random vector X=(ϕ,ϕ1,,ϕ,ϕl)X=(\langle{\phi,\phi_{1}}\rangle,\ldots,\langle{\phi,\phi_{l}}\rangle)^{\top} is Gaussian 𝒩(0,A)\mathcal{N}(0,A) with A(i,j)=ϕi,𝒬ϕjA(i,j)=\langle{\phi_{i},\mathcal{Q}\phi_{j}}\rangle. Now, noticing that X=BcX=Bc and B=BB=B^{\top}, we obtain that the distribution of c=B1Xc=B^{-1}X is 𝒩(0,B1AB1)\mathcal{N}(0,B^{-1}AB^{-1}), where the covariance matrix can be derived as 𝔼[cc]=𝔼[B1XXB1]=B1AB1\mathbb{E}[cc^{\top}]=\mathbb{E}[B^{-1}XX^{\top}B^{-1}]=B^{-1}AB^{-1}. ∎

On the other hand, the distribution of the coefficient only determines a Gaussian measure on the linear space its basis spans.

Lemma A.3 (Covariance matrix to covariance operator).

Let H=span{ϕi}i=1lH=\mathrm{span}\{\phi_{i}\}_{i=1}^{l} with ll\leq\infty be a Hilbert space with basis such that the matrix B=ϕi,ϕj1i,jlB=\langle{\phi_{i},\phi_{j}}\rangle_{1\leq i,j\leq l} is strictly positive definite. Assume the coefficient clc\in\mathbb{R}^{l} of ϕ=i=1lciϕi\phi=\sum_{i=1}^{l}c_{i}\phi_{i} has a Gaussian measure 𝒩(0,Q)\mathcal{N}(0,Q). Then, the HH-valued random variable ϕ\phi has a Gaussian distribution 𝒩(0,𝒬)\mathcal{N}(0,\mathcal{Q}), where the operator 𝒬\mathcal{Q} is defined by ϕi,𝒬ϕj=(BQB)i,j\langle{\phi_{i},\mathcal{Q}\phi_{j}}\rangle=(BQB)_{i,j}.

Proof.

Since {ϕi}\{\phi_{i}\} is a complete basis, we only need to determine the distribution of the random vector X=(ϕ,ϕ1,,ϕ,ϕl)lX=(\langle{\phi,\phi_{1}}\rangle,\ldots,\langle{\phi,\phi_{l}}\rangle)^{\top}\in\mathbb{R}^{l}. Note that it satisfies X=BcX=Bc. Thus, its distribution is Gaussian 𝒩(0,BQB)\mathcal{N}(0,BQB). ∎

A.3 Details of numerical examples

Computation for Toeplitz matrix.

Each dataset {uk=(u0k,u1k)}k\{u^{k}=(u_{0}^{k},u_{1}^{k})\}_{k} leads to an exploration measure on 𝒮={1,0,1}\mathcal{S}=\{-1,0,1\}:

ρ(1)=k|u1k|2k(|u1k|+|u0k|),ρ(0)=12,ρ(1)=k|u0k|2k(|u1k|+|u0k|).\rho(-1)=\frac{\sum_{k}|u_{1}^{k}|}{2\sum_{k}(|u_{1}^{k}|+|u_{0}^{k}|)},\quad\rho(0)=\frac{1}{2},\quad\rho(1)=\frac{\sum_{k}|u_{0}^{k}|}{2\sum_{k}(|u_{1}^{k}|+|u_{0}^{k}|)}.

Since each u=(u0,u1)u=(u_{0},u_{1}) leads to a rank-2 regression matrix

Lu=[u1u000u1u0],LuLu=[u12u1u00u1u0u12+u02u1u00u1u0u02],L_{u}=\begin{bmatrix}u_{1}&u_{0}&0\\ 0&u_{1}&u_{0}\end{bmatrix},\quad L_{u}^{\top}L_{u}=\begin{bmatrix}u_{1}^{2}&u_{1}u_{0}&0\\ u_{1}u_{0}&u_{1}^{2}+u_{0}^{2}&u_{1}u_{0}\\ 0&u_{1}u_{0}&u_{0}^{2}\end{bmatrix},

the regression matrices A¯=kLukLuk{\overline{A}}=\sum_{k}L_{u^{k}}^{\top}L_{u^{k}} of the three datasets are

A¯(1)=[000010001],A¯(2)=12k=12LukLuk=12[100020001],A¯(3)=[110121011].{\overline{A}}_{(1)}=\begin{bmatrix}0&0&0\\ 0&1&0\\ 0&0&1\end{bmatrix},\quad{\overline{A}}_{(2)}=\frac{1}{2}\sum_{k=1}^{2}L_{u^{k}}^{\top}L_{u^{k}}=\frac{1}{2}\begin{bmatrix}1&0&0\\ 0&2&0\\ 0&0&1\end{bmatrix},\quad{\overline{A}}_{(3)}=\begin{bmatrix}1&1&0\\ 1&2&1\\ 0&1&1\end{bmatrix}. (A.1)

Additionally, with B=Diag(ρ)B=\mathrm{Diag}(\rho), the prior covariances λQ0𝒟=B1A¯B1\lambda_{*}Q_{0}^{\scriptscriptstyle\mathcal{D}}=B^{-1}{\overline{A}}B^{-1} are

Q0,(1)𝒟=[000040004],Q0,(2)𝒟=[800040008],Q0,(3)𝒟=[16808880816].Q^{\scriptscriptstyle\mathcal{D}}_{0,(1)}=\begin{bmatrix}0&0&0\\ 0&4&0\\ 0&0&4\end{bmatrix},\quad Q^{\scriptscriptstyle\mathcal{D}}_{0,(2)}=\begin{bmatrix}8&0&0\\ 0&4&0\\ 0&0&8\end{bmatrix},\quad Q^{\scriptscriptstyle\mathcal{D}}_{0,(3)}=\begin{bmatrix}16&8&0\\ 8&8&8\\ 0&8&16\end{bmatrix}. (A.2)

We analyze the well-posedness of the inverse problem in terms of the operator G¯{\mathcal{L}_{\overline{G}}}, whose eigenvalues are solved via the generalized eigenvalue problem (see Proposition 5.6).

  • For the data set {u1}\{u^{1}\}, the exploration measure ρ\rho is degenerate with ρ(1)=0\rho(-1)=0, thus, we have no information from data to identify ϕ(1)\phi(-1). As a result, Lρ2=span{ϕ2,ϕ3}L^{2}_{\rho}=\mathrm{span}\{\phi_{2},\phi_{3}\} is a proper subspace of 3\mathbb{R}^{3}. The regression matrix A¯(1){\overline{A}}_{(1)} and the covariance matrix Q0,(1)𝒟Q^{\scriptscriptstyle\mathcal{D}}_{0,(1)} are effectively the identity matrix I2I_{2} and 4I24I_{2}. The operator G¯{\mathcal{L}_{\overline{G}}} has eigenvalues {1,1}\{1,1\}, and the FSOI is Lρ2L^{2}_{\rho}. Thus, the inverse problem is well-posed in Lρ2L^{2}_{\rho}.

  • For the dataset {u1,u2}\{u^{1},u^{2}\}, the inverse problem is well-posed because the operator G¯{\mathcal{L}_{\overline{G}}} has eigenvalues {2,2,2}\{2,2,2\}, and the FSOI is Lρ2L^{2}_{\rho}. Note that the data-adaptive prior Q0,(2)𝒟Q^{\scriptscriptstyle\mathcal{D}}_{0,(2)} assigns weights to the entries of the coefficient according to the exploration measure.

  • For the data set {u3}\{u^{3}\}, the inverse problem is ill-defined in Lρ2L^{2}_{\rho}, but it is well-posed in the FSOI, which is a proper subset of Lρ2L^{2}_{\rho}. Here the FSOI is span{ψ1,ψ2}\mathrm{span}\{\psi_{1},\psi_{2}\}, which are the eigenvectors of G¯{\mathcal{L}_{\overline{G}}} with positive eigenvalues. Following (5.7), these eigenvectors {ψi}\{\psi_{i}\} are solved from the generalized eigenvalue problem A¯(3)ψ=λDiag(ρ)ψ{\overline{A}}_{(3)}\psi=\lambda\mathrm{Diag}(\rho)\psi and they are orthonormal in Lρ2L^{2}_{\rho}. The eigenvalues are {8,4,0}\{8,4,0\} and the corresponding eigenvectors are ψ1=(1,1,1)\psi_{1}=(1,1,1)^{\top}, ψ2=(2,0,2),\psi_{2}=(-\sqrt{2},0,\sqrt{2})^{\top}, and ψ3=(1,1,1)\psi_{3}=(1,-1,1)^{\top}.

The hyper-parameter selected by the L-curve method.

Figure 6 shows a typical L-curve, where (λ)=ϕλHG\mathcal{R}(\lambda)=\left\|\phi_{\lambda}\right\|_{H_{G}} and \mathcal{E} represents the square root of the loss (ϕλ)\mathcal{E}(\phi_{\lambda}). The L-curve method selects the parameter that attains the maximal curvature at the corner of the L-shaped curve.

Figures 78 present the λ\lambda_{*} in the simulations in Figures 23, respectively. Those hyper-parameters are mostly similar, and the majority of them are at the scale of 10410^{-4}. They show that the optimal hyper-parameter depends on the spectrum of G¯{\mathcal{L}_{\overline{G}}}, the four types of errors in b¯{\overline{b}}, the strength of the noise, and the smoothness of the true kernel. Generally, a large variation of λ\lambda_{*} suggests difficulty in selecting an optimal hyper-parameter by the method. Additionally, the error in the numerical computation of matrix inversion or the solution of the linear systems can affect the result when λ\lambda_{*} is small. Thus, it is beyond the scope of this study to analyze the optimal hyper-parameter.

Figure 6: The L-curve for selecting the hyper-parameter λ\lambda_{*}.
Refer to caption

Figure 7: The hyper-parameter λ\lambda_{*} in the 200 simulations in Figure 2.
Refer to caption
Figure 8: The hyper-parameter λ\lambda_{*} in the 200 simulations in Figure 3.
Refer to caption

References

  • [1] A. Agliari and C. C. Parisetti. A-g reference informative prior: A note on zellner’s gg prior. Journal of the Royal Statistical Society. Series D (The Statistician), 37(3):271–275, 1988.
  • [2] A. Alexanderian, P. J. Gloor, and O. Ghattas. On bayesian a-and d-optimal experimental designs in infinite dimensions. Bayesian Analysis, 11(3):671–695, 2016.
  • [3] F. Bauer, S. Pereverzev, and L. Rosasco. On regularization algorithms in learning theory. Journal of complexity, 23(1):52–72, 2007.
  • [4] M. J. Bayarri, J. O. Berger, A. Forte, and G. García-Donato. Criteria for bayesian model choice with application to variable selection. The Annals of Statistics, 40(3), Jun 2012.
  • [5] M. Belkin, S. Ma, and S. Mandal. To understand deep learning we need to understand kernel learning. In International Conference on Machine Learning, pages 541–549. PMLR, 2018.
  • [6] C. Bucur and E. Valdinoci. Nonlocal Diffusion and Applications, volume 20 of Lecture Notes of the Unione Matematica Italiana. Springer International Publishing, Cham, 2016.
  • [7] J. A. Carrillo, K. Craig, and Y. Yao. Aggregation-diffusion equations: dynamics, asymptotics, and singular limits. In Active Particles, Volume 2, pages 65–108. Springer, 2019.
  • [8] K. Chaloner and I. Verdinelli. Bayesian experimental design: A review. Statistical Science, pages 273–304, 1995.
  • [9] F. Cucker and D. X. Zhou. Learning theory: an approximation theory viewpoint, volume 24. Cambridge University Press, Cambridge, 2007.
  • [10] T. Cui and X. T. Tong. A unified performance analysis of likelihood-informed subspace methods. Bernoulli, 28(4):2788–2815, 2022.
  • [11] G. Da Prato. An introduction to infinite-dimensional analysis. Springer Science & Business Media, New York, 2006.
  • [12] G. Da Prato and J. Zabczyk. Stochastic equations in infinite dimensions. Cambridge university press, 2014.
  • [13] M. Darcy, B. Hamzi, J. Susiluoto, A. Braverman, and H. Owhadi. Learning dynamical systems from data: a simple cross-validation perspective, part ii: nonparametric kernel flows. preprint, 2021.
  • [14] M. Dashti and A. M. Stuart. The Bayesian approach to inverse problems. In Handbook of uncertainty quantification, pages 311–428. Springer, 2017.
  • [15] M. V. de Hoop, D. Z. Huang, E. Quin, and A. M. Stuart. The cost-accuracy trade-off in operator learning with neural networks. arXiv preprint arXiv:2203.13181, 2022.
  • [16] M. V. de Hoop, N. B. Kovachki, N. H. Nelsen, and A. M. Stuart. Convergence rates for learning linear operators from noisy data. SIAM/ASA Journal on Uncertainty Quantification, 2022.
  • [17] M. D’Elia, Q. Du, C. Glusa, M. Gunzburger, X. Tian, and Z. Zhou. Numerical methods for nonlocal and fractional models. Acta Numerica, 29:1–124, 2020.
  • [18] L. Della Maestra and M. Hoffmann. Nonparametric estimation for interacting particle systems: McKean-Vlasov models. Probability Theory and Related Fields, pages 1–63, 2022.
  • [19] Q. Du, M. Gunzburger, R. B. Lehoucq, and K. Zhou. Analysis and Approximation of Nonlocal Diffusion Problems with Volume Constraints. SIAM Rev., 54(4):667–696, 2012.
  • [20] J. Feng, Y. Ren, and S. Tang. Data-driven discovery of interacting particle systems using gaussian processes. arXiv preprint arXiv:2106.02735, 2021.
  • [21] S. Gazzola, P. C. Hansen, and J. G. Nagy. Ir tools: a matlab package of iterative regularization methods and large-scale test problems. Numerical Algorithms, 81(3):773–811, 2019.
  • [22] E. Giné and R. Nickl. Mathematical foundations of infinite-dimensional statistical models, volume 40. Cambridge University Press, UK, 2015.
  • [23] P. C. Hansen. REGULARIZATION TOOLS: A Matlab package for analysis and solution of discrete ill-posed problems. Numer Algor, 6(1):1–35, 1994.
  • [24] P. C. Hansen. The L-curve and its use in the numerical treatment of inverse problems. In in Computational Inverse Problems in Electrocardiology, ed. P. Johnston, Advances in Computational Bioengineering, pages 119–142. WIT Press, 2000.
  • [25] Y. He, S. H. Kang, W. Liao, H. Liu, and Y. Liu. Numerical identification of nonlocal potential in aggregation. Commun. Comput. Phys., 32:638–670, 2022.
  • [26] T. Hofmann, B. Schölkopf, and A. J. Smola. Kernel methods in machine learning. Ann. Statist., 36(3):1171–1220, 06 2008.
  • [27] H. Jeffreys. The theory of probability. OuP Oxford, 1961.
  • [28] J. Kaipio and E. Somersalo. Statistical and Computational Inverse Problems. Springer, 2005.
  • [29] N. B. Kovachki, Z. Li, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. M. Stuart, and A. Anandkumar. Neural operator: Learning maps between function spaces. arXiv preprint arXiv:2108.08481, 2021.
  • [30] Q. Lang and F. Lu. Learning interaction kernels in mean-field equations of first-order systems of interacting particles. SIAM Journal on Scientific Computing, 44(1):A260–A285, 2022.
  • [31] Q. Lang and F. Lu. Small noise analysis for Tikhonov and RKHS regularizations. arXiv preprint arXiv:2305.11055, 2023.
  • [32] P. D. Lax. Functional Analysis. John Wiley & Sons Inc., New York, 2002.
  • [33] M. S. Lehtinen, L. Paivarinta, and E. Somersalo. Linear inverse problems for generalised random variables. Inverse Problems, 5(4):599–612, 1989.
  • [34] W. Li, J. Lu, and L. Wang. Fisher information regularization schemes for Wasserstein gradient flows. Journal of Computational Physics, 416:109449, 2020.
  • [35] Z. Li, F. Lu, M. Maggioni, S. Tang, and C. Zhang. On the identifiability of interaction functions in systems of interacting particles. Stochastic Processes and their Applications, 132:135–163, 2021.
  • [36] Z. Li, N. B. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. M. Stuart, and A. Anandkumar. Fourier neural operator for parametric partial differential equations. International Conference on Learning Representations, 2020.
  • [37] F. Lu, Q. An, and Y. Yu. Nonparametric learning of kernels in nonlocal operators. Journal of Peridynamics and Nonlocal Modeling, pages 1–24, 2023.
  • [38] F. Lu, Q. Lang, and Q. An. Data adaptive RKHS Tikhonov regularization for learning kernels in operators. Proceedings of Mathematical and Scientific Machine Learning, PMLR 190:158-172, 2022.
  • [39] F. Lu, M. Maggioni, and S. Tang. Learning interaction kernels in heterogeneous systems of agents from multiple trajectories. Journal of Machine Learning Research, 22(32):1–67, 2021.
  • [40] F. Lu, M. Maggioni, and S. Tang. Learning interaction kernels in stochastic systems of interacting particles from multiple trajectories. Foundations of Computational Mathematics, pages 1–55, 2021.
  • [41] F. Lu, M. Zhong, S. Tang, and M. Maggioni. Nonparametric inference of interaction laws in systems of agents from trajectory data. Proc. Natl. Acad. Sci. USA, 116(29):14424–14433, 2019.
  • [42] L. Lu, P. Jin, and G. E. Karniadakis. Deeponet: Learning nonlinear operators for identifying differential equations based on the universal approximation theorem of operators. arXiv preprint arXiv:1910.03193, 2021.
  • [43] L. Lu, P. Jin, G. Pang, Z. Zhang, and G. E. Karniadakis. Learning nonlinear operators via deepOnet based on the universal approximation theorem of operators. Nature Machine Intelligence, 3(3):218–229, 2021.
  • [44] C. N. Mavridis, A. Tirumalai, and J. S. Baras. Learning swarm interaction dynamics from density evolution. IEEE Transactions on Control of Network Systems, 10(1):214–225, 2022.
  • [45] D. A. Messenger and D. M. Bortz. Learning mean-field equations from particle data using wsindy. Physica D: Nonlinear Phenomena, 439:133406, 2022.
  • [46] S. Motsch and E. Tadmor. Heterophilious Dynamics Enhances Consensus. SIAM Rev, 56(4):577 – 621, 2014.
  • [47] H. Owhadi and G. R. Yoo. Kernel flows: From learning kernels from data into the abyss. Journal of Computational Physics, 389:22–47, 2019.
  • [48] C. Robert and G. Casella, editors. Monte Carlo Statistical Methods. Springer, 1999.
  • [49] L. I. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noise removal algorithms. Physica D: nonlinear phenomena, 60(1-4):259–268, 1992.
  • [50] L. F. Shampine. Vectorized adaptive quadrature in matlab. Journal of Computational and Applied Mathematics, 211(2):131–140, 2008.
  • [51] A. Spantini, A. Solonen, T. Cui, J. Martin, L. Tenorio, and Y. Marzouk. Optimal low-rank approximations of Bayesian linear inverse problems. SIAM Journal on Scientific Computing, 37(6):A2451–A2487, 2015.
  • [52] B. K. Sriperumbudur, K. Fukumizu, and G. R. Lanckriet. Universality, characteristic kernels and RKHS embedding of measures. Journal of Machine Learning Research, 12(70):2389–2410, 2011.
  • [53] A. M. Stuart. Inverse problems: a Bayesian perspective. Acta Numer., 19:451–559, 2010.
  • [54] A. N. Tihonov. Solution of incorrectly formulated problems and the regularization method. Soviet Math., 4:1035–1038, 1963.
  • [55] R. Yao, X. Chen, and Y. Yang. Mean-field nonparametric estimation of interacting particle systems. In Conference on Learning Theory, pages 2242–2275. PMLR, 2022.
  • [56] H. You, Y. Yu, S. Silling, and M. D’Elia. A data-driven peridynamic continuum model for upscaling molecular dynamics. Computer Methods in Applied Mechanics and Engineering, 389:114400, 2022.
  • [57] H. You, Y. Yu, N. Trask, M. Gulian, and M. D’Elia. Data-driven learning of nonlocal physics from high-fidelity synthetic data. Computer Methods in Applied Mechanics and Engineering, 374:113553, 2021.
  • [58] M. Yuan and T. T. Cai. A reproducing kernel Hilbert space approach to functional linear regression. The Annals of Statistics, 38(6):3412–3444, 2010.
  • [59] A. Zellner and A. Siow. Posterior odds ratios for selected regression hypotheses. Trabajos de Estadistica Y de Investigacion Operativa, 31:585–603, 1980.