A Data-Adaptive Prior for Bayesian Learning
of Kernels in Operators
Abstract
Kernels effectively represent nonlocal dependencies and are extensively employed in formulating operators between function spaces. Thus, learning kernels in operators from data is an inverse problem of general interest. Due to the nonlocal dependence, the inverse problem is often severely ill-posed with a data-dependent normal operator. Traditional Bayesian methods address the ill-posedness by a non-degenerate prior, which may result in an unstable posterior mean in the small noise regime, especially when data induces a perturbation in the null space of the normal operator. We propose a new data-adaptive Reproducing Kernel Hilbert Space (RKHS) prior, which ensures the stability of the posterior mean in the small noise regime. We analyze this adaptive prior and showcase its efficacy through applications on Toeplitz matrices and integral operators. Numerical experiments reveal that fixed non-degenerate priors can produce divergent posterior means under errors from discretization, model inaccuracies, partial observations, or erroneous noise assumptions. In contrast, our data-adaptive RKHS prior consistently yields convergent posterior means.
Keywords: Data-adaptive prior, kernels in operators, linear Bayesian inverse problem,
RKHS, Tikhonov regularization
2020 Mathematics Subject Classification: 62F15, 47A52, 47B32
1 Introduction
Kernels are efficient in representing nonlocal or long-range dependence and interaction between high- or infinite-dimensional variables. Thus, they are widely used to design operators between function spaces, with numerous applications in machine learning such as kernel methods (e.g., [5, 9, 13, 26, 52, 47]) and operator learning (e.g., [29, 43]), in partial differential equations (PDEs) and stochastic processes such as nonlocal and fractional diffusions (e.g., [6, 17, 19, 56, 57]), and in multi-agent systems (e.g., [7, 39, 41, 46]).
Learning kernels in operators from data is an integral part of these applications. We consider the case when the operator depends on the kernel linearly, and the learning is a linear inverse problem. However, the inverse problem is often severely ill-posed, due to the nonlocal dependence and the presence of various perturbations resulting from noise in data, numerical error, or model error. To address the ill-posedness, a Bayesian approach or a variational approach with regularization is often used. In either approach, the major challenge is the selection of a prior or a regularization term since there is limited prior knowledge about the kernel.
This study examines the selection of the prior in a Bayesian approach. The common practice is to use a non-degenerate prior. However, we show that a non-degenerate prior may result in an unstable posterior mean in the small noise regime, especially when data induces a perturbation in the null space of the normal operator.
We propose a new data-adaptive Reproducing Kernel Hilbert Space (RKHS) prior, which ensures the stability of the posterior mean in the small noise regime. We analyze this adaptive prior and showcase its efficacy through applications on learning kernels in Toeplitz matrices and integral operators.
1.1 Problem setup
We aim to learn the kernel in the operator in the following model
(1.1) |
by fitting the model to the data consisting of input-output pairs:
(1.2) |
Here is a Banach space, is a Hilbert space, the measurement noise is a -valued white noise in the sense that for any . The term represents unknown model errors such as model misspecification or computational error due to incomplete data, and it may depend on the input data .
The operator depends non-locally on the kernel in the form
(1.3) |
where is a measure space that can be either a domain in the Euclidean space with the Lebesgue measure or a discrete set with an atomic measure. For simplicity, we let throughout this study. Here is a bounded (nonlinear) functional of and is assumed to be known. Note that the operator can be nonlinear in , but it depends linearly on . Such operators are widely seen in PDEs, matrix operators, and image processing. Examples include the Toeplitz matrix, integral and nonlocal operators; see Sect.2.1. In these examples, there is often limited prior knowledge about the kernel.
The inverse problem of learning the kernel is often ill-posed, due to the nonlocal dependence of the output data on the kernel. The ill-posedness, in terms of minimizing the negative log-likelihood of the data,
(1.4) |
appears as the instability of the minimizer when it exists. Here is space of square-integrable functions with measure , the normal operator is a trace-class operator, and comes from data; see Sect.2.3 for details. Thus, the ill-posedness is rooted in the unboundedness of and the perturbation of .
There are two common strategies to overcome the ill-posedness: a prior in a Bayesian approach and regularization in a variational approach; see, e.g., [53, 23, 3] and a sample of the large literature in Sect.1.3).
This study focuses on selecting a prior for the Bayesian approach. The major challenge is the limited prior information about the kernel and the need to overcome the ill-posedness caused by a data-dependent normal operator.
1.2 Proposed: a data-adaptive RKHS prior
Due to the lack of prior information about the kernel, one may use the default prior, a non-degenerate prior, e.g., a Gaussian distribution with a nondegenerate covariance operator , with the belief that it is a safe choice to ensure a well-defined posterior.
However, we show that the fixed non-degenerate prior has the risk of leading to a divergent posterior mean in the small noise limit. Specifically, the posterior mean,
obtained from the prior and the likelihood that yields (1.4), blows up when, the variance of the noise, if contains a perturbation in the null space of the normal operator ; see Proposition 3.2. Such perturbation can be caused by any of the four types of errors in data or computation: (a) discretization error, (b) model error, (c) partial observations, and (d) wrong noise assumption. Thus, a prior adaptive to and is needed to remove the risk.
We propose a data-adaptive RKHS prior that ensures a stable posterior mean in the small noise regime. It is a Gaussian distribution with the parameter selected adaptive to . We prove in Theorem 4.2 that it leads to a stable posterior whose mean
always has a small noise limit, and the small noise limit converges to the identifiable parts of the true kernel. Furthermore, we show that our prior outperforms the non-degenerate prior in producing a more accurate posterior mean and smaller posterior uncertainty in terms of the trace of the posterior covariance; see Sect.4.2. The prior is called an RKHS prior because its Cameron-Martin space is the RKHS with a reproducing kernel determined by the operator and the data. Importantly, the closure of this RKHS is the space in which the components of the true kernel can be identified from data.
We also study the computational practice of the data-adaptive prior and demonstrate it on the Toeplitz matrices and integral operators. We select the hyper-parameter by the L-curve method in [24]. Numerical tests show that while a fixed non-degenerate prior leads to divergent posterior means, the data-adaptive prior always attains stable posterior means with small noise limits; see Sect.5.
The outline of this study is as follows. Sect.2 presents the mathematical setup of this study, and shows the ill-posedness of the inverse problem through the variational approach. We introduce in Sect.3 the Bayesian approach and show the issue of a fixed non-degenerate prior. To solve the issue, we introduce a data-adaptive prior in Sect.4, and analyze its advantages. Sect.5 discusses the computational practice and demonstrates the advantage of the data-adaptive prior in numerical tests on Toeplitz matrices and integral operators. Finally, Sect.6 concludes our findings and provides future research directions.
1.3 Related work
Prior selection for Bayesian inverse problems.
We focus on prior selection and don’t consider the sampling of the posterior, which is a main topic for nonlinear Bayesian inverse problems with a given prior, see, e.g., [14, 28, 51, 53, 10]. Prior selection is an important topic in statistical Bayesian modeling, which dates back to [27]. This study provides a new class of problems where prior selection is crucial: learning kernels in operators, which is an ill-posed linear inverse problem. Our data-adaptive RKHS prior re-discovers the well-known Zellner’s g-prior in [59, 1] when the kernel is finite-dimensional and the basis functions are orthonormal. Importantly, the stable small noise limit in this study provides a new criterion for prior selection, a useful addition to many criteria studied in [4].
Regularization in a variational approach.
The prior is closely related to Tikhonov or ridge regularization in a variation approach. The likelihood function provides a loss function, and the prior often provides a regularization term. Various regularization terms have been studied, including the widely-used Euclidean norm (see, e.g., [21, 23, 24, 54]), the RKHS norm with pre-specified reproducing kernel (see, e.g., [9, 3]), the total variation norm in [49], and the data-adaptive RKHS norm in [37, 38]. It remains open to compare these norms. Appealingly, the Bayesian approach provides probabilistic tools for analyzing regularizations. Thus, to better understand regularization, it is of interest to study the priors in a Bayesian approach.
Operator learning.
This study focuses on learning the kernels, not the operators. Thus, our focus differs from the focus of the widely-used kernel methods for operator approximation (see, e.g., [47, 13]) and the operator learning (see, e.g., [15, 16, 29, 36, 42, 43]). These methods aim to approximate the operator matching the input and output, not to identify the kernel in the operator.
Gaussian process and kernel-based regression.
Selection of the reproducing kernel is an important component in Gaussian process and kernel-based regression (see, e.g.,[5, 9, 26, 52, 58]). Previous methods often tune the hyper-parameter of a pre-selected class of kernels. Our data-adaptive RKHS prior provides an automatic reproducing kernel, adaptive to data and the model, for these methods.
Learning interacting kernels and nonlocal kernels.
The learning of kernels in operators has been studied in the context of identifying the interaction kernels in interacting particle systems (e.g., [20, 25, 18, 30, 35, 39, 41, 40, 44, 45, 55]) and the nonlocal kernels in homogenization of PDEs (e.g., [37, 56, 57]). This study is the first to analyze the selection of a prior in a Bayesian approach.
2 The learning of kernels in operators
In this section, we discuss learning kernels in operators as a variational inverse problem that maximizes the likelihood. In this process, we introduce a few key concepts for the Bayesian approach in later sections: the function space for the kernel, the normal operator, the function space of identifiability, and a data-adaptive RKHS.
2.1 Examples
We first present a few examples of learning kernels in operators. Note that in these examples, there is little prior information about the kernel.
Example 2.1 (Kernels in Toeplitz matrices).
Consider the estimation of the kernel in the Toeplitz matrix , i.e., for all , from measurement data by fitting the data to the model
(2.1) |
where represents an unknown model error. We can write the Toeplitz matrix as an integral operator in the form of (1.3) with , , and being a uniform discrete measure on . The kernel is a vector with with .
Example 2.2 (Integral operator).
Let . We aim to find a function fitting the dataset in (1.2) to the model (1.1) with an integral operator
(2.2) |
We assume that is a white noise, that is, for any . In the form of the operator in (1.3), we have , , and being the Lebesgue measure. This operator is an infinite-dimensional version of the Toeplitz matrix.
Example 2.3 (Nonlocal operator).
Suppose that we want to estimate a kernel in a model (1.1) with a nonlocal operator
from a given data set as in (1.2) with and , where is a bounded set. Such nonlocal operators arise in [19, 57, 37]. Here is a white noise in the sense that for any . This example corresponds to (1.3) with . Note that even the support of the kernel is unknown.
Example 2.4 (Interaction operator).
Let and and consider the problem of estimating the interaction kernel in the nonlinear operator
by fitting the dataset in (1.2) to the model (1.1). This nonlinear operator corresponds to (1.3) with . It comes from the aggression operator in the mean-field equation of interaction particles (see, e.g., [7, 30]).
2.2 Variational approach
To identify the kernel, the variational approach seeks a maximal likelihood estimator:
(2.3) |
over a hypothesis space , where the loss function is the negative log-likelihood of the data (1.2) under the assumption that the noise in (1.1) is white.
The first step is to find a proper function space for , in which one can further specify the hypothesis space . Clearly, given a data set, we can only identify the kernel where the data provides information. Examples 2.1– 2.4 show that the support of the kernel is yet to be extracted from data.
We set function space for learning to be with , where is an empirical probability measure quantifying the exploration of data to the kernel:
(2.4) |
Here is the normalizing constant and is the Dirac delta function. We call an exploration measure. Its support is the region in which the learning process ought to work, and outside of which, we have limited information from the data to learn the function . Thus, we can restrict to be the support of , and we denote for short. In other words, the exploration measure is a generalization of the measure in nonparametric regression that estimates from data , where is the distribution of and the data are samples from the joint distribution of .
The next step is to find the minimizer of the loss function. Note that is quadratic in since the operator depends linearly on . We get the minimizer by solving the zero of the (Fréchet) derivative of the loss function. To compute its derivative, we first introduce a bilinear form : ,
(2.5) | ||||
where the integral kernel given by, for ,
(2.6) |
in which by an abuse of notation, we also use to denote either the probability of when defined in (2.4) is discrete or the probability density of when the density exists.
By definition, the bivariate function is symmetric and positive semi-definite in the sense that for any and . In the following, we assume that the data is continuous and bounded so that defines a self-adjoint compact operator which is fundamental for studying identifiability. This assumption holds under mild regularity conditions on the data and the operator .
Assumption 2.5 (Integrability of ).
Under Assumption 2.5, the integral operator
(2.7) |
is a positive semi-definite trace-class operator (see Lemma A.1). Hereafter we denote the eigen-pairs of with the eigenvalues in descending order, and assume that the eigenfunctions are orthonormal, hence they provide an orthonormal basis of . Furthermore, for any , the bilinear form in (2.5) can be written as
(2.8) |
and we can write the loss functional in (2.3) as
(2.9) | ||||
where is the Riesz representation of the bounded linear functional:
(2.10) |
Then, by solving the zero of , one may obtain the minimizer , provided that it is well-defined. However, it is often ill-defined, e.g., when is not in and is compact infinite-rank or rank-deficient. Thus, it is important to examine the inversion and the uniqueness of the minimizer, for which we introduce the function space of identifiability in the next section.
2.3 Function space of identifiability
We define a function space of identifiability (FSOI) when one learns the kernel by minimizing the loss function.
Definition 2.6.
The function space of identifiability (FSOI) by the loss functional in (2.3) is the largest linear subspace of in which has a unique minimizer.
The next theorem characterizes the FSOI. Its proof is deferred to Appendix A.1.
Theorem 2.7 (Function space of identifiability).
Suppose the data in (1.2) is generated from the system (1.1) with a true kernel . Suppose that Assumption 2.5 holds. Then, the following statements hold.
-
(a)
The function in (2.10) has the following decomposition:
(2.11) where comes from the model error , the random comes from the observation noise and it has a Gaussian distribution , and they satisfy
-
(b)
The Fréchet derivative of in is .
-
(c)
The function space of identifiability (FSOI) of is with closure in . In particular, if , the unique minimizer of in the FSOI is , where is the pseudo-inverse of . Furthermore, if and there is no observation noise and no model error, we have .
Theorem 2.7 enables us to analyze the ill-posedness of the inverse problems through the operator and . When , the inverse problem has a unique solution in the FSOI , even when it is underdetermined in due to being a proper subspace, which happens when the compact operator has a zero eigenvalue. However, when , the inverse problem has no solution in because is undefined. According to (2.11), this happens in one or more of the following scenarios:
-
•
when the model error leads to .
-
•
when the observation noise leads to . In particular, since is Gaussian , it has the Karhunen–Loève expansion with being i.i.d. . Then, , which diverges a.s. if diverges. Thus, we have a.s. when diverges.
Additionally, only encodes information of , and it provides no information about , where and form an orthogonal decomposition . In other words, the data provides no information to recover .
As a result, it is important to avoid absorbing the errors outside of the FSOI when using a Bayesian approach or a regularization method to mitigate the ill-posedness.
2.4 A data-adaptive RKHS
The RKHS with as a reproducing kernel is a dense subspace of the FSOI. Hence, when using it as a prior, which we will detail in later sections, one can filter out the error outside the FSOI and ensure that the learning takes place inside the FSOI.
The next lemma is an operator characterization of this RKHS (see, e.g., [9, Section 4.4]). Its proof can be found in [38, Theorem 3.3].
Lemma 2.8 (A data-adaptive RKHS).
Suppose that Assumption 2.5 holds. Then, the following statements hold.
-
(a)
The RKHS with in (2.6) as reproducing kernel satisfies and its inner product satisfies
(2.12) -
(b)
Denote the eigen-pairs of by with being orthonormal. Then, for any , we have
(2.13) where the last equation is restricted to .
The above RKHS has been used for regularization in [37, 38]. The regularizer, named DARTR, regularizes the loss by the norm of this RKHS,
(2.14) |
Selecting the optimal hyper-parameter by the L-curve method, it leads to the estimator
(2.15) |
where is the identity operator on . Since the RKHS is a subset of the FSOI with more regular elements, DARTR ensures that the estimator is in the FSOI and is regularized.
We summarize the key terminologies and notations in this section in Table 1.
Notation | Meaning |
---|---|
Exploration measure | A probability measure quantifying the data’s exploration to the kernel (2.4). |
or | Function space of learning |
, , | The normal operator in (2.7), its spectral-decomposition and pseudo-inverse |
Loss function from the likelihood in (2.9) | |
FSOI | Function space of identifiability, in which has a unique minimizer. |
3 Bayesian inversion and the risk in a non-degenerate prior
The Bayesian approach overcomes the ill-posedness by introducing a prior, so the posterior is stable under perturbations. Since little prior information is available about the kernel, it is common to use a non-degenerate prior to ensure the well-posedness of the posterior. However, we will show that a fixed non-degenerate prior may lead to a posterior with a divergent mean in the small noise limit. These discussions promote the data-adaptive prior in the next section.
3.1 The Bayesian approach
In this study, we focus on Gaussian priors, so the posterior is also a Gaussian measure in combination with a Gaussian likelihood. Also, without loss of generality, we assume that the prior is centered. Recall that the function space of learning is defined in (2.4). For illustration, we first specify the prior and posterior when the space is finite-dimensional, then discuss them in the infinite-dimensional case. We follow the notations in [53] and [14].
Finite-dimensional case.
Consider first that the space is finite-dimensional, i.e., , as in Example 2.1. Then, the space is equivalent to with norm satisfying . Also, assume that space is finite-dimensional, and the measurement noise in (1.2) is Gaussian . Since is finite-dimensional, we write the prior, the likelihood, and the posterior in terms of their probability densities with respect to the Lebesgue measure.
-
•
Prior distribution, denoted by , with density
where is a strictly positive matrix, so that the prior is a non-degenerate measure.
- •
-
•
Posterior distribution with density proportionating to the product of the prior and likelihood densities,
(3.2) with the constant term may change from line to line and
(3.3) Thus, is a Gaussian measure .
The Bayesian approach is closely related to Tikhonov regularization (see, e.g., [33]). Note that a Gaussian prior corresponds to a regularization term , the negative log-likelihood is the loss function, and the posterior corresponds to the penalized loss:
In particular, the maximal a posteriori, MAP in short, which agrees with the posterior mean , is the minimizer of the penalized loss using a penalty term . The difference is that a regularization approach selects the hyper-parameter according to data.
Infinite-dimensional case.
When space is infinite-dimensional, i.e., the set has infinite elements, we use the generic notion of Gaussian measures on Hilbert spaces, see Appendix A.2 for a brief review. Since there is no longer a Lebesgue measure on the infinite-dimensional space, the prior and the posterior are characterized by their means and covariance operators. We write the prior and the posterior as follows:
-
•
Prior , where is a strictly positive trace-class operator on ;
- •
Note that unlike the finite-dimensional case, it is problematic to write the likelihood distribution as , because the operator is unbounded and may be ill-defined.
3.2 The risk in a non-degenerate prior
The prior distribution plays a crucial role in Bayesian inverse problems. To make the ill-posed inverse problem well-defined, it is often set to be a non-degenerate measure (i.e., its covariance operator has no zero eigenvalues). It is fixed in many cases and not adaptive to data. Such a non-degenerate prior works well for an inverse problem whose function space of identifiability does not change with data. However, in the learning of kernels in operators, a non-degenerate prior has a risk of leading to a catastrophic error: the posterior mean may diverge in the small observation noise limit, as we show in the Proposition 3.2.
Assumption 3.1.
Assume that the operator is finite rank and commutes with the prior covariance and assume the existence of error outside of the FSOI as follows.
-
The operator in (2.7) has zero eigenvalues. Let be the first zero eigenvalue, where is less than the dimension of . As a result, the FSOI is .
-
The covariance of the prior satisfies with for all , where are orthonormal eigenfunctions of .
-
The term in (2.11), which represents the model error, is outside of the FSOI, i.e., has a component for some .
Assumptions (A1-A2) are common in practice. Assumption (A1) holds because the operator is finite rank when the data is discrete, and it is not full rank for under-determined problems. It is natural to assume the prior has a full rank covariance as in (A2). We assume that commutes with for simplicity and one can extend it to the general case as in the proof of [12, Theorem 2.25, Feldman–Hajek theorem]. Assumption (A3), which requires to be outside the range of , holds when the regression vector is outside the range of the regression matrix in (5.3b), see Sect.5.2–5.3 for more discussions.
Proposition 3.2 (Risk in a fixed non-degenerate prior).
Proof of Proposition 3.2..
Recall that conditional on the data, the observation noise-induced term in (2.11) has a distribution . Thus, in the orthonormal basis , we can write , where are i.i.d. random variables. Additionally, write the true kernel as , where for all . Note that does not have to be in the FSOI. Combining these facts, we have
(3.5) |
Then, the posterior mean in (3.3) becomes
(3.6) |
Thus, the posterior mean is contaminated by the model error outside the FSOI, i.e., the part with components with . It diverges when because
and diverges. ∎
On the other hand, one may adjust the prior covariance by the standard deviation of the noise in the hope of removing the risk of divergence. However, the next proposition shows that such a noise-adaptive non-degenerate prior will have a biased small noise limit.
Proposition 3.3 (Risk in a noise-adaptive non-degenerate prior).
Let the prior be , where with and . Suppose that Assumption 3.1 holds. Then, the corresponding posterior mean either blows up or is biased, satisfying
Proof.
Under the prior with , we have the posterior mean
Then, the limits for , and follow directly. Note that when , the limits are biased, not recovering the identifiable component even when the model error vanishes. ∎
Propositions 3.2 and 3.3 highlight that the risk of a non-degenerate prior comes from the error outside the data-adaptive FSOI. Importantly, by definition of the FSOI, there is no signal to be recovered outside the FSOI. Thus, it is important to design a data-adaptive prior to restrict the learning to take place inside the FSOI.
4 Data-adaptive RKHS prior
We propose a data-adaptive prior to filter out the error outside of the FSOI, so that its posterior mean always has a small noise limit. In particular, the small noise limit converges to the identifiable part of the true kernel when the model error vanishes. Additionally, we show that this prior, even with a sub-optimal , outperforms a large class of fixed non-degenerate priors in the quality of the posterior.
4.1 Data-adaptive prior and its posterior
We first introduce the data-adaptive prior and specify its posterior, which will remove the risk in a non-degenerate prior as shown in Propositions 3.2–3.3. This prior is a Gaussian measure with a covariance from the likelihood.
Following the notations in Sect.2.1, the operator is a data-dependent positive definite trace-class operator on , and we denote its eigen-pairs by with the eigenfunction forming an orthonormal basis of . Then, as characterized in Theorem 2.7 and Lemma 2.8, the data-dependent FSOI and RKHS are
(4.1) |
where the closure of is with respect to the norm . Note that those two spaces are the same vector space but with different norms. These two spaces are different unless the operator is finite rank (e.g., the cases in Section 3.2). They are proper subspaces of when the operator has a zero eigenvalue.
Definition 4.1 (Data-adaptive RKHS prior).
Let be the operator defined in (2.7). The data-adaptive prior is a Gaussian measure with mean and covariance defined by
(4.2) |
where the hyper-parameter is determined adaptive to data.
In practice, we select the hyper-parameter adaptive to data by the L-curve method in [24], which is effective in reaching an optimal trade-off between the likelihood and the prior (see Sect.5.1 for more details). We call this prior an RKHS prior because its covariance operator ’s Cameron-Martin space is the RKHS (see, e.g., [11, Section 1.7] and a brief review of the Gaussian measures in Sect.A.2).
This data-adaptive prior is a Gaussian distribution with support in the FSOI in (4.1). When is finite-dimensional, its probability density in is
by definitions (4.2), where is pseudo-inverse. Combining with the likelihood (3.1), the posterior becomes
(4.3a) | |||||
(4.3b) |
We emphasize that the operator restricts the posterior to be supported on the FSOI , where the inverse problem is well-defined. The posterior mean in (4.3b) coincides with the minimum norm least squares estimator if there is no regularization, i.e., .
When is infinite-dimensional, the above mean and covariance remain valid, following similar arguments based on the likelihood ratio in (3.4).
In either case, the posterior is a Gaussian distribution whose support is , and it is degenerate in if is a proper subspace of . In short, the data-adaptive prior is a Gaussian distribution supported on the FSOI with a hyper-parameter adaptive to data. Both the prior and posterior are degenerate when the FSOI is a proper subspace of .
Table 2 compares the posteriors of the non-degenerate prior and the data-adaptive prior.
Gaussian measure | Mean | Covariance | |
---|---|---|---|
Fixed non-degenerate prior and its posterior | |||
Data-adaptive prior and its posterior | |||
4.2 Quality of the posterior and its MAP estimator
The data-adaptive prior aims to improve the quality of the posterior. Compared with a fixed non-degenerate prior, we show that the data-adaptive prior improves the quality of the posterior in three aspects: (1) it improves the stability of the MAP estimator so that the MAP estimator always has a small noise limit; (2) it improves the accuracy of the MAP estimator by reducing the expected mean square error; and (3) it reduces the uncertainty in the posterior in terms of the trace of the posterior covariance.
We show first that the posterior mean always has a small noise limit, and the limit converges to the projection of the true function in the FSOI when the model error vanishes.
Theorem 4.2 (Small noise limit of the MAP estimator).
Proof.
The claims follow directly from the definition of the new posterior mean in (4.3b) and the decomposition in Eq. (3.5), which says that with . Recall that , and if , we have . Thus, we can write
(4.4) |
since are orthonormal eigenfunctions of with eigenvalues . Thus, the small noise limit exists and is equal to
Furthermore, as the model error , this small noise limit converges to , the projection of in the FSOI. ∎
We show next that the data-adaptive prior leads to a MAP estimator more accurate than the non-degenerate prior’s.
Theorem 4.3 (Expected MSE of the MAP estimator).
Suppose that Assumption 3.1 (A1-A2) holds. Assume in addition that . Then, the expected mean square error of the MAP estimator of the data-adaptive prior is smaller than the non-degenerate prior’s, i.e.,
(4.5) |
where the equality holds only when the two priors are the same. Here and denote expectation with and , respectively.
Proof.
Additionally, the next theorem shows that under the condition , the data-adaptive prior outperforms the non-degenerate prior in producing a posterior with a smaller trace of covariance. We note that this condition is sufficient but not necessary, since the proof is based on component-wise comparison and does not take into account the part (see Remark 4.7 for more discussions).
Theorem 4.4 (Trace of the posterior covariance).
Proof.
By definition, the trace of the two operators are
(4.8) | ||||
Thus, when , we have for each , and hence . The last claim follows similarly. ∎
Remark 4.5 (Expected MSE of the MAP and the trace of the posterior covariance).
When there is no model error, we have in (4.8). That is, for the prior , the expected MSE of the MAP estimator is the trace of the posterior covariance [2, Theorem 2]. However, for the data-adaptive prior , we have if and only if , which follows from (4.6) and (4.8). Thus, if , a smaller expected MSE of the MAP estimator in Theorem 4.3 implies a smaller trace of the posterior covariance in Theorem 4.4.
Remark 4.6 (A-optimality).
Theorem 4.4 shows that the data-adaptive prior achieves A-optimality among all priors with satisfying . Here, an A-optimal design is defined to be the one that minimizes the trace of the posterior covariance operator in a certain class ([2] and [8]). It is equivalent to minimizing the expected MSE of the MAP estimator (which is equal to ) through an optimal choice of the . Thus, in our context, the A-optimal design seeks a prior with in a certain class such that is minimized, and the data-adaptive prior achieves A-optimality in the above class of priors.
Remark 4.7 (Conditions on the spectra).
The condition in Theorems 4.3–4.4 is far from necessary, since their proofs are based on a component-wise comparison in the sum and its does not take into account the part . The optimal in practice is often much smaller than the maximal ratio and it depends on the dataset, in particular, it depends nonlinearly on all the elements involved (see Figures 7–8 in Appendix A.3). Thus, a full analysis with an optimal is beyond the scope of this study.
5 Computational practice
We have followed the wisdom of [53] on “avoid discretization until the last possible moment” so that we have presented the analysis of the distributions on using operators. In the same spirit, we avoid selecting a basis for the function space until the last possible moment. The moment arrives now. Based on the abstract theory in the previous sections, we present the implementation of the data-adaptive prior in computational practice. We demonstrate it on Toeplitz matrices and integral operators, which represent finite- and infinite-dimensional function spaces of learning, respectively.
In computational practice, the goal is to estimate the coefficient of in a prescribed hypothesis space with , where the basis functions can be the B-splines, polynomials, or wavelets. Then, the prior and posterior are represented by distributions of . Note that the pre-specified basis is rarely orthonormal in because varies with data. Hence, we only require that the basis matrix
(5.1) |
is non-singular, i.e., the basis functions are linearly independent in . This simple requirement reduces redundancy in basis functions.
In terms of , the negative log-likelihood in (2.9) for reads
(5.2) | ||||
where the regression matrix and vector are given by
(5.3a) | |||||
(5.3b) |
The maximal likelihood estimator (MLE) is the default choice of solution when is in the range of . However, the MLE is ill-defined when is not in the range of , which may happen when there is model error (or computational error due to incomplete data, as we have discussed after Theorem 2.7), and a Bayesian approach makes the inverse problem well-posed by introducing a prior.
We will compare our data-adaptive prior with the widely-used standard Gaussian prior on the coefficient, that is, with , the identity matrix on . This prior leads to a posterior with
(5.4) |
5.1 Data-adaptive prior and posterior of the coefficient
The next proposition computes the prior and posterior distributions of the random coefficient of the -valued random variable with the data-adaptive prior (4.2).
Proposition 5.1.
Proof.
Remark 5.2 (Relation between distributions of the coefficient and the function).
The prior and posterior distributions of the coefficient and the function are different: the former depends on the basis , but the latter is not. The relation between the distributions of and is characterized by Lemma A.2–A.3. Specifically, if and has a Gaussian measure on , then, we have provided that in (5.1) is strictly positive definite. Additionally, when computing the trace of the operator , we solve a generalized eigenvalue problem , which follows from the proof of Proposition 5.6 below.
Remark 5.3 (Relation to the basis matrix of the RKHS).
We select the hyper-parameter by the L-curve method in [24]. The L-curve is a log-log plot of the curve with and , where . The L-curve method maximizes the curvature to balance between the minimization of the likelihood and the control of the regularization:
where and are the smallest and the largest generalized eigenvalues of .
We summarize the priors and posteriors in computation in Table 3.
Gaussian measure | Mean | Covariance |
---|---|---|
Remark 5.4 (Avoiding pseudo-inverse of singular matrix).
The inverse of the matrix in in (5.5b) can cause a large numerical error when is singular or severely ill-conditioned. We increase the numerical stability by avoiding : let and write as
(5.6) |
Remark 5.5 (Relation to Zellner’s g-prior).
The next proposition shows that the eigenvalues of are solved by a generalized eigenvalue problem. Its proof is deferred to Appendix A.1.
Proposition 5.6.
Assume that the hypothesis space satisfies with , where be the integral operator in (2.7). Let and be defined in (5.3b). Then, the operator has eigenvalues solved by the generalize eigenvalue problem with in (5.1):
(5.7) |
and the corresponding eigenfunctions of are . Additionally, for any in , we have with
5.2 Example: discrete kernels in Toeplitz matrices
The Toeplitz matrix in Example 2.1 has a vector kernel, which lies in a finite-dimensional function space of learning . It provides a typical example of discrete kernels. We use the simplest case of a Toeplitz matrix to demonstrate the data-adaptive function space of identifiability and the advantages of the data-adaptive prior.
We aim to recover the kernel in the Toeplitz matrix from measurement data by fitting the data to the model (2.1). The kernel is a vector with with . Since is linear in , there is a matrix such that . Note that is linear in since is, hence only linearly independent data brings new information for the recovery of .
A least squares estimator (LSE) of is
Here the is a pseudo-inverse when is singular. However, the pseudo-inverse is unstable to perturbations, and the inverse problem is ill-posed.
We only need to identify the basis matrix in (5.1) to get the data-adaptive prior and its posterior in Table 3. The basis matrix requires two elements: the exploration measure and the basis functions. Here the exploration measure in (2.4) is with , where is the normalizing constant. Meanwhile, the unspoken hypothesis space for the above vector with is with basis , where is the Kronecker delta function. Then, the basis matrix of in , as defined in (5.1), is . Thus, if is not strictly positive, this basis matrix is singular and these basis functions are linearly dependent (hence redundant) in . In such a case, we select a linearly independent basis for , which is a proper subspace of , and we use pseudo-inverse of and to remove the redundant rows. Additionally, since vector is the same as its coefficient , the priors and posteriors in Table 2 and Table 3 are the same.
Toeplitz matrix with .
Table 4 shows three representative datasets for the inverse problem: (1) the dataset leads to a well-posed inverse problem in though it appears ill-posed in , (2) the dataset leads to a well-posed inverse problem, and (3) the dataset leads to an ill-posed inverse problem and our data-adaptive prior significantly improves the accuracy of the posterior, see Table 5. Computational details are presented in Appendix A.3.
Data | on | FOSI | Eigenvalues of |
---|---|---|---|
Bias of | Bias of | |||
---|---|---|---|---|
FSOI | ||||
FSOI |
Table 5 demonstrates the significant advantage of the data-adaptive prior over the non-degenerate prior in the case of the third dataset. We examine the performance of the posterior in two aspects: the trace of its covariance operator, and the bias in the posterior mean. Following Remark 5.2, we compute the trace of the covariance operator of the posterior by solving a generalized eigenvalue problem. Table 5 presents the means and standard deviations of the traces and the relative errors of the posterior mean. It consider two cases: in the FSOI and outside of the FSOI (see Table 4). We highlight two observations.
-
•
The posterior mean is much more accurate than . When is in the FSOI, is relatively accurate. When is outside the FSOI, the major bias comes from the part outside the FSOI, i.e., the part leads to a relatively large error.
-
•
The trace of is significantly smaller than . Here has a zero eigenvalue in the direction outside of the FSOI, while is full rank.
Discrete inverse problem: stable small noise limit.
For the discrete inverse problem of solving in for , the next proposition shows that Assumption 3.1 (A3) does not hold, regardless of the presence of model error or missing data in . Thus, it has a small noise limit. Numerical tests confirm that is always in the range of the operator and that has a small noise limit regardless of the model error (e.g., ) or computational error due to missing data. However, for continuous inverse problems that estimate a continuous function , Assumption 3.1 (A3) holds when is computed using different regression arrays from those in due to discretization or missing data (see Sect.5.3) or errors in integration by parts [30].
Proposition 5.7.
Let and , where and for each . Then, .
Proof.
First, we show that it suffices to consider ’s being rank-1 arrays. The SVD (singular value decomposition) of each gives , where are the singular values, left and right singular vectors that are orthonormal, i.e., and . Denote , which is rank-1. Note that . Thus, we can write and in terms of rank-1 arrays.
Next, for each , write the rank-1 array as with and both being unitary vectors. Then, , and (where the ’s can be linearly dependent). Therefore, is in the range of because is a scalar. ∎
5.3 Example: continuous kernels in integral operators
For the continuous kernels of the integral operators in Examples 2.2-2.4, their function space of learning is infinite-dimensional. Their Bayesian inversions are similar, so we demonstrate the computation using the convolution operator in Example 2.2. In particular, we compare our data-adaptive prior with a fixed non-degenerate prior in the presence of four types of errors: (i) discretization error, (ii) model error, (ii) partial observation (or missing data), and (iv) wrong noise assumption.
Recall that with , we aim to recover the kernel in the operator in (2.2), , by fitting the model (1.1) to an input-output dataset . We set to be the probability densities of normal distributions for and we compute by the global adaptive quadrature method in [50]. The data are on uniform meshes and of with and . Here is generated by
(5.8) |
where are i.i.d. random variables (unless the wrong noise assumption case to be specified later) with variance and are artificial model errors with (no model error) or (a small model error).

The exploration measure (defined in (2.4)) of this dataset has a density
with being the normalizing constant. We set the , where are B-spline basis functions (i.e., piecewise polynomials) with degree 3 and with knots from a uniform partition of . We approximate and using the Riemann sum integration,
where we approximate via Riemann integration . Additionally, to illustrate the effects of discretization error, we also compute in (5.3b) using the continuous with quadrature integration, and denote the matrix by .
Figure 1 shows the exploration measure and the eigenvalues of the basis matrix , and (which are the generalized eigenvalues of ). Note that the support is a proper subset of , leading to a near singular . In particular, the inverse problem is severely ill-posed in since has multiple almost-zero eigenvalues.
We consider four types of errors, in addition to the observation noise, in that often happen in practice.
-
1.
Discretization Error. We assume that in (5.8) has no model error.
-
2.
Partial Observation. We assume that misses data in the first quarter of the interval, i.e. for . Also, assume that there is no model error.
-
3.
Model Error. Assume there are model errors.
-
4.
Wrong Noise Assumption. Suppose that the measurement noise is uniformly distributed on the interval . Thus, the model, which assumes a Gaussian noise, has a wrong noise assumption. Notice that we add a to keep the variance at the same level as the Gaussian.
For each of the four cases, we compute the posterior means in Table 3 with the optimal hyper-parameter selected by the L-curve method, and report the error of the function estimators. Additionally, for each of them, we consider different levels of observation noise in , so as to demonstrate the small noise limit of the posterior mean.
We access the performance of the fixed prior and the data-adaptive prior in Table 3 through the accuracy of their posterior means. We report the interquartile range (IQR, the , and percentiles) of the errors of their posterior means in independent simulations in which are randomly sampled.
Two scenarios are considered: is either inside or outside the FSOI. To draw samples of outside the FSOI, we sample the coefficient of from the fixed prior . Thus, the fixed prior is the true prior. To sample inside the FSOI, we sample with from , where has being the -th eigenvector of .
Note that the exploration measure, the matrices , and are the same in all these simulations because they are determined by the data and the basis functions. Thus, we only need to compute for each simulation.


Figure 2 shows the IQR of these simulations in the scenario that the true kernels are outside the FSOI. The fixed prior leads to diverging posterior means in 6 out of the 8 cases, while the DA-prior has stable posterior means in all cases. The fixed prior leads to a diverging posterior mean when using the continuously integrated regression matrix , because the discrepancy between and leads to a perturbation outside the FSOI, satisfying Assumption 3.1 (A3). Similarly, either the model error or partial observation error in causes a perturbation outside the FSOI of , making the fixed prior’s posterior mean diverge. On the other hand, the discretely computed matches in the sense that as proved in Proposition 5.7, so the fixed prior has a stable posterior mean in cases of discretization and wrong noise assumption. In all these cases, the error of the posterior mean of the DA-prior does not decay as , because the error is dominated by the part outside of the FSOI that cannot be recovered from data.
Figure 3 shows the IQR of these simulations with the true kernels sampled inside the FSOI. The DA prior leads to posterior means that are not only stable but also converge to small noise limits, whereas the fixed prior leads to diverging posterior means as in Figure 2. The convergence of the posterior means of the DA prior can be seen in the cases of “Discretization” and “Wrong noise” with both continuously and discretely computed regression matrices. Meanwhile, the flat lines of the DA prior in the cases of “Model error” or “Partial observations” are due to the error inside the FSOI caused by either the model error or partial observation error in , as shown in the proof of Theorem 4.2.
Additionally, we show in Figure 4 and Figure 5 the estimated posterior (in terms of its mean, the and percentiles) in a typical simulation, when is outside and inside the FSOI, respectively. Here, the percentiles are computed by drawing samples from the posterior. A more accurate posterior would have a more accurate mean and a narrower shaded region between the percentiles so as to have a smaller uncertainty.
In all cases, the DA prior leads to a more accurate posterior mean (MAP) than the fixed prior.
When the observation noise has , the DA prior leads to a posterior with a larger shaded region between the percentiles than the fixed prior, but when , the DA prior’s shaded region is much smaller than those of the fixed prior.


In summary, these numerical results confirm that the data-adaptive prior removes the risk in a fixed non-degenerate prior, leading to a robust posterior with a small noise limit.
Remark 5.8 (Increasing data size).
We have chosen a low number of data points in the numerical tests, so that the inverse problems are under-determined and hence are meaningful for a Bayesian study. In general, the blow-up behavior caused by a fixed non-degenerate prior will remain when the data size increases as long as the inverse problem satisfies the conditions in Assumption 3.1 (i.e., under-determined with deficient rank, using a fixed non-degenerate prior, with error in the null space of the normal operator).
5.4 Limitations of the data-adaptive prior
As stated in [24]: “every practical method has advantages and disadvantages”. The major advantage of the data-adaptive RKHS prior is to avoid the posterior being contaminated by the errors outside of the data-dependent function space of identifiability.
A drawback of the data-adaptive prior is its reliance on selecting the hyper-parameter . The L-curve method is state-of-the-art and works well in our numerical tests, yet it has limitations in dealing with smoothness and asymptotic consistency (see [24]). An improper hyper-parameter can lead to a posterior with an inaccurate mean and unreliable covariance. Also, the premise is that the identifiable part of the true kernel is in the data-adaptive RKHS. But this RKHS can be restrictive when the data is smooth, leading to an overly-smoothed estimator if the true kernel is non-smooth. It remains open to use a prior with as its covariance, as introduced in [31], with to detect the smoothness of the true kernel. We leave this as potential future work.
Also, the data-adaptive prior in this study is for linear inverse problems, and it does not apply to nonlinear problems in which the operator depends on the kernel nonlinearly. However, the covariance of our data-adaptive prior corresponds to a scaled Fisher information matrix. Thus, for nonlinear inverse problems, a potential adaptive prior is the scaled Fisher information, which has been explored as a regularization method in [34].
6 Conclusion
The inverse problem of learning kernels in operators is often severely ill-posed. We show that a fixed non-degenerate prior leads to a divergent posterior mean when the observation noise becomes small if the data induces a perturbation in the eigenspace of zero eigenvalues of the normal operator.
We have solved the issue by a data-adaptive RKHS prior. It leads to a stable posterior whose mean always has a small noise limit, and the small noise limit converges to the identifiable part of the true kernel when the perturbation vanishes. Its covariance is the normal operator with a hyper-parameter selected adaptive to data by the L-curve method. Also, the data-adaptive prior improves the quality of the posterior over the fixed prior in two aspects: a smaller expected mean square error of the posterior mean, and a smaller trace of the covariance operator, thus reducing the uncertainty.
Furthermore, we provide a detailed analysis of the data-adaptive prior in computational practice. We demonstrate its advantage on Toeplitz matrices and integral operators in the presence of four types of errors. Numerical tests show that when the noise becomes small, a fixed non-degenerate prior may lead to a divergent posterior mean, the data-adaptive prior always attains stable posterior means.
We have also discussed the limitations of the data-adaptive prior, such as its dependence on the hyper-parameter selection and its tendency to over-smoothing. It is of interest to overcome these limitations in future research by adaptively selecting the regularity of the prior covariance through a fractional operator. Among various other directions to be further explored, we mention one that is particularly relevant in the era of big data: to investigate the inverse problem when the data are randomly sampled in the setting of infinite-dimensional statistical models (e.g., [22]). When the operator is linear in , the examples of Toeplitz matrices and integral operators show that the inverse problem will become less ill-posed when the number of linearly independent data increases. When is nonlinear in , it remains open to understand how the ill-posedness depends on the data. Another direction would be to consider sampling the posterior exploiting MCMC or sequential Monte Carlo methodologies (e.g., [48]).
Acknowledgments
The work of FL is funded by the Johns Hopkins University Catalyst Award, FA9550-20-1-0288, and NSF-DMS-2238486. XW is supported by the Johns Hopkins University research fund. FL would like to thank Yue Yu and Mauro Maggioni for inspiring discussions.
Appendix A Appendix
A.1 Identifiability theory
The main theme in the identifiability theory is to find the function space in which the quadratic loss functional has a unique minimizer. The next lemma shows that the normal operator defined in (2.7) is a trace-class operator. Recall that an operator on a Hilbert space if it satisfies for any complete orthonormal basis .
Proof.
Theorem 2.7 characterizes the FSOI through the normal operator .
Proof of Theorem 2.7.
Part follows from the definition of in (2.10). In fact, plugging in into the right hand side of (2.10), we have, ,
where the first term in the last equation comes from the definitions of the operator in (2.7), the second and the third term comes from the Riesz representation. Since each is a -valued white noise, the random variable is Gaussian with mean zero and variance for each . Thus, has a Gaussian distribution .
Part follows directly from loss functional in (2.9).
For Part , first, note that the quadratic loss functional has a unique minimizer in . Meanwhile, note that is the orthogonal complement of the null space of , and for any such that . Thus, is the largest such function space, and we conclude that is the FSOI.
Next, for any , the estimator is well-defined. By Part , this estimator is the unique zero of the loss functional’s Fréchet derivative in . Hence it is the unique minimizer of in . In particular, when the data is noiseless and with no model error, and it is generated from , i.e. , we have from Part . Hence . That is, is the unique minimizer of the loss functional . ∎
Proof of Proposition 5.6.
Let with . Then, is an eigenfunction of with eigenvalue if and only if for each ,
where the last equality follows from the definition of . Meanwhile, by the definition of we have for each . Then, Equation (5.7) follows.
Next, to compute , we denote and . Then, we can write
Hence, we can obtain in via:
where the last equality follows from .
Additionally, to prove , we use the generalized eigenvalue problem. Since , we have . Meanwhile, implied that . Thus, , which is . ∎
A.2 Gaussian measures on a Hilbert space
A Gaussian measure on a Hilbert space is defined by its mean and covariance operator (see [11, Chapter 1-2] and [12]). Let be a Hilbert space with inner product , and let denote its Borel algebra. Let be a symmetric nonnegative trace class operator on , that is and for any , and for any complete orthonormal basis . Additionally, denote the eigenvalues (in descending order) and eigenfunctions of .
A measure on with mean and covariance operator is a Gaussian measure iff its Fourier transform is for any . The measure is non-degenerate if , i.e., for all . It is a product measure , where for each . Note that if is infinite-dimensional, that is, the Cameron-Martin space has measure zero.
The next lemma specifies the covariance of the coefficient of an -valued Gaussian random variable. The coefficient can be on either a full or partial basis.
Lemma A.2 (Covariance operator to covariance matrix).
Let be a Hilbert space with a complete basis that may not be orthonormal, where . For , assume that the matrix is strictly positive definite. Let be an -valued random variable with Gaussion measure , where is a trace class operator. Then, the coefficient has a Gaussian distribution , where
Proof.
Without loss of generality, we assume : otherwise, we replace by in the following proof. By definition, for any , the random variable has distribution . Thus, we have for each . Similarly, we have that . Then, we have
Hence, the random vector is Gaussian with . Now, noticing that and , we obtain that the distribution of is , where the covariance matrix can be derived as . ∎
On the other hand, the distribution of the coefficient only determines a Gaussian measure on the linear space its basis spans.
Lemma A.3 (Covariance matrix to covariance operator).
Let with be a Hilbert space with basis such that the matrix is strictly positive definite. Assume the coefficient of has a Gaussian measure . Then, the -valued random variable has a Gaussian distribution , where the operator is defined by .
Proof.
Since is a complete basis, we only need to determine the distribution of the random vector . Note that it satisfies . Thus, its distribution is Gaussian . ∎
A.3 Details of numerical examples
Computation for Toeplitz matrix.
Each dataset leads to an exploration measure on :
Since each leads to a rank-2 regression matrix
the regression matrices of the three datasets are
(A.1) |
Additionally, with , the prior covariances are
(A.2) |
We analyze the well-posedness of the inverse problem in terms of the operator , whose eigenvalues are solved via the generalized eigenvalue problem (see Proposition 5.6).
-
•
For the data set , the exploration measure is degenerate with , thus, we have no information from data to identify . As a result, is a proper subspace of . The regression matrix and the covariance matrix are effectively the identity matrix and . The operator has eigenvalues , and the FSOI is . Thus, the inverse problem is well-posed in .
-
•
For the dataset , the inverse problem is well-posed because the operator has eigenvalues , and the FSOI is . Note that the data-adaptive prior assigns weights to the entries of the coefficient according to the exploration measure.
-
•
For the data set , the inverse problem is ill-defined in , but it is well-posed in the FSOI, which is a proper subset of . Here the FSOI is , which are the eigenvectors of with positive eigenvalues. Following (5.7), these eigenvectors are solved from the generalized eigenvalue problem and they are orthonormal in . The eigenvalues are and the corresponding eigenvectors are , and .
The hyper-parameter selected by the L-curve method.
Figure 6 shows a typical L-curve, where and represents the square root of the loss . The L-curve method selects the parameter that attains the maximal curvature at the corner of the L-shaped curve.
Figures 7–8 present the in the simulations in Figures 2–3, respectively. Those hyper-parameters are mostly similar, and the majority of them are at the scale of . They show that the optimal hyper-parameter depends on the spectrum of , the four types of errors in , the strength of the noise, and the smoothness of the true kernel. Generally, a large variation of suggests difficulty in selecting an optimal hyper-parameter by the method. Additionally, the error in the numerical computation of matrix inversion or the solution of the linear systems can affect the result when is small. Thus, it is beyond the scope of this study to analyze the optimal hyper-parameter.



References
- [1] A. Agliari and C. C. Parisetti. A-g reference informative prior: A note on zellner’s prior. Journal of the Royal Statistical Society. Series D (The Statistician), 37(3):271–275, 1988.
- [2] A. Alexanderian, P. J. Gloor, and O. Ghattas. On bayesian a-and d-optimal experimental designs in infinite dimensions. Bayesian Analysis, 11(3):671–695, 2016.
- [3] F. Bauer, S. Pereverzev, and L. Rosasco. On regularization algorithms in learning theory. Journal of complexity, 23(1):52–72, 2007.
- [4] M. J. Bayarri, J. O. Berger, A. Forte, and G. García-Donato. Criteria for bayesian model choice with application to variable selection. The Annals of Statistics, 40(3), Jun 2012.
- [5] M. Belkin, S. Ma, and S. Mandal. To understand deep learning we need to understand kernel learning. In International Conference on Machine Learning, pages 541–549. PMLR, 2018.
- [6] C. Bucur and E. Valdinoci. Nonlocal Diffusion and Applications, volume 20 of Lecture Notes of the Unione Matematica Italiana. Springer International Publishing, Cham, 2016.
- [7] J. A. Carrillo, K. Craig, and Y. Yao. Aggregation-diffusion equations: dynamics, asymptotics, and singular limits. In Active Particles, Volume 2, pages 65–108. Springer, 2019.
- [8] K. Chaloner and I. Verdinelli. Bayesian experimental design: A review. Statistical Science, pages 273–304, 1995.
- [9] F. Cucker and D. X. Zhou. Learning theory: an approximation theory viewpoint, volume 24. Cambridge University Press, Cambridge, 2007.
- [10] T. Cui and X. T. Tong. A unified performance analysis of likelihood-informed subspace methods. Bernoulli, 28(4):2788–2815, 2022.
- [11] G. Da Prato. An introduction to infinite-dimensional analysis. Springer Science & Business Media, New York, 2006.
- [12] G. Da Prato and J. Zabczyk. Stochastic equations in infinite dimensions. Cambridge university press, 2014.
- [13] M. Darcy, B. Hamzi, J. Susiluoto, A. Braverman, and H. Owhadi. Learning dynamical systems from data: a simple cross-validation perspective, part ii: nonparametric kernel flows. preprint, 2021.
- [14] M. Dashti and A. M. Stuart. The Bayesian approach to inverse problems. In Handbook of uncertainty quantification, pages 311–428. Springer, 2017.
- [15] M. V. de Hoop, D. Z. Huang, E. Quin, and A. M. Stuart. The cost-accuracy trade-off in operator learning with neural networks. arXiv preprint arXiv:2203.13181, 2022.
- [16] M. V. de Hoop, N. B. Kovachki, N. H. Nelsen, and A. M. Stuart. Convergence rates for learning linear operators from noisy data. SIAM/ASA Journal on Uncertainty Quantification, 2022.
- [17] M. D’Elia, Q. Du, C. Glusa, M. Gunzburger, X. Tian, and Z. Zhou. Numerical methods for nonlocal and fractional models. Acta Numerica, 29:1–124, 2020.
- [18] L. Della Maestra and M. Hoffmann. Nonparametric estimation for interacting particle systems: McKean-Vlasov models. Probability Theory and Related Fields, pages 1–63, 2022.
- [19] Q. Du, M. Gunzburger, R. B. Lehoucq, and K. Zhou. Analysis and Approximation of Nonlocal Diffusion Problems with Volume Constraints. SIAM Rev., 54(4):667–696, 2012.
- [20] J. Feng, Y. Ren, and S. Tang. Data-driven discovery of interacting particle systems using gaussian processes. arXiv preprint arXiv:2106.02735, 2021.
- [21] S. Gazzola, P. C. Hansen, and J. G. Nagy. Ir tools: a matlab package of iterative regularization methods and large-scale test problems. Numerical Algorithms, 81(3):773–811, 2019.
- [22] E. Giné and R. Nickl. Mathematical foundations of infinite-dimensional statistical models, volume 40. Cambridge University Press, UK, 2015.
- [23] P. C. Hansen. REGULARIZATION TOOLS: A Matlab package for analysis and solution of discrete ill-posed problems. Numer Algor, 6(1):1–35, 1994.
- [24] P. C. Hansen. The L-curve and its use in the numerical treatment of inverse problems. In in Computational Inverse Problems in Electrocardiology, ed. P. Johnston, Advances in Computational Bioengineering, pages 119–142. WIT Press, 2000.
- [25] Y. He, S. H. Kang, W. Liao, H. Liu, and Y. Liu. Numerical identification of nonlocal potential in aggregation. Commun. Comput. Phys., 32:638–670, 2022.
- [26] T. Hofmann, B. Schölkopf, and A. J. Smola. Kernel methods in machine learning. Ann. Statist., 36(3):1171–1220, 06 2008.
- [27] H. Jeffreys. The theory of probability. OuP Oxford, 1961.
- [28] J. Kaipio and E. Somersalo. Statistical and Computational Inverse Problems. Springer, 2005.
- [29] N. B. Kovachki, Z. Li, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. M. Stuart, and A. Anandkumar. Neural operator: Learning maps between function spaces. arXiv preprint arXiv:2108.08481, 2021.
- [30] Q. Lang and F. Lu. Learning interaction kernels in mean-field equations of first-order systems of interacting particles. SIAM Journal on Scientific Computing, 44(1):A260–A285, 2022.
- [31] Q. Lang and F. Lu. Small noise analysis for Tikhonov and RKHS regularizations. arXiv preprint arXiv:2305.11055, 2023.
- [32] P. D. Lax. Functional Analysis. John Wiley & Sons Inc., New York, 2002.
- [33] M. S. Lehtinen, L. Paivarinta, and E. Somersalo. Linear inverse problems for generalised random variables. Inverse Problems, 5(4):599–612, 1989.
- [34] W. Li, J. Lu, and L. Wang. Fisher information regularization schemes for Wasserstein gradient flows. Journal of Computational Physics, 416:109449, 2020.
- [35] Z. Li, F. Lu, M. Maggioni, S. Tang, and C. Zhang. On the identifiability of interaction functions in systems of interacting particles. Stochastic Processes and their Applications, 132:135–163, 2021.
- [36] Z. Li, N. B. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. M. Stuart, and A. Anandkumar. Fourier neural operator for parametric partial differential equations. International Conference on Learning Representations, 2020.
- [37] F. Lu, Q. An, and Y. Yu. Nonparametric learning of kernels in nonlocal operators. Journal of Peridynamics and Nonlocal Modeling, pages 1–24, 2023.
- [38] F. Lu, Q. Lang, and Q. An. Data adaptive RKHS Tikhonov regularization for learning kernels in operators. Proceedings of Mathematical and Scientific Machine Learning, PMLR 190:158-172, 2022.
- [39] F. Lu, M. Maggioni, and S. Tang. Learning interaction kernels in heterogeneous systems of agents from multiple trajectories. Journal of Machine Learning Research, 22(32):1–67, 2021.
- [40] F. Lu, M. Maggioni, and S. Tang. Learning interaction kernels in stochastic systems of interacting particles from multiple trajectories. Foundations of Computational Mathematics, pages 1–55, 2021.
- [41] F. Lu, M. Zhong, S. Tang, and M. Maggioni. Nonparametric inference of interaction laws in systems of agents from trajectory data. Proc. Natl. Acad. Sci. USA, 116(29):14424–14433, 2019.
- [42] L. Lu, P. Jin, and G. E. Karniadakis. Deeponet: Learning nonlinear operators for identifying differential equations based on the universal approximation theorem of operators. arXiv preprint arXiv:1910.03193, 2021.
- [43] L. Lu, P. Jin, G. Pang, Z. Zhang, and G. E. Karniadakis. Learning nonlinear operators via deepOnet based on the universal approximation theorem of operators. Nature Machine Intelligence, 3(3):218–229, 2021.
- [44] C. N. Mavridis, A. Tirumalai, and J. S. Baras. Learning swarm interaction dynamics from density evolution. IEEE Transactions on Control of Network Systems, 10(1):214–225, 2022.
- [45] D. A. Messenger and D. M. Bortz. Learning mean-field equations from particle data using wsindy. Physica D: Nonlinear Phenomena, 439:133406, 2022.
- [46] S. Motsch and E. Tadmor. Heterophilious Dynamics Enhances Consensus. SIAM Rev, 56(4):577 – 621, 2014.
- [47] H. Owhadi and G. R. Yoo. Kernel flows: From learning kernels from data into the abyss. Journal of Computational Physics, 389:22–47, 2019.
- [48] C. Robert and G. Casella, editors. Monte Carlo Statistical Methods. Springer, 1999.
- [49] L. I. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noise removal algorithms. Physica D: nonlinear phenomena, 60(1-4):259–268, 1992.
- [50] L. F. Shampine. Vectorized adaptive quadrature in matlab. Journal of Computational and Applied Mathematics, 211(2):131–140, 2008.
- [51] A. Spantini, A. Solonen, T. Cui, J. Martin, L. Tenorio, and Y. Marzouk. Optimal low-rank approximations of Bayesian linear inverse problems. SIAM Journal on Scientific Computing, 37(6):A2451–A2487, 2015.
- [52] B. K. Sriperumbudur, K. Fukumizu, and G. R. Lanckriet. Universality, characteristic kernels and RKHS embedding of measures. Journal of Machine Learning Research, 12(70):2389–2410, 2011.
- [53] A. M. Stuart. Inverse problems: a Bayesian perspective. Acta Numer., 19:451–559, 2010.
- [54] A. N. Tihonov. Solution of incorrectly formulated problems and the regularization method. Soviet Math., 4:1035–1038, 1963.
- [55] R. Yao, X. Chen, and Y. Yang. Mean-field nonparametric estimation of interacting particle systems. In Conference on Learning Theory, pages 2242–2275. PMLR, 2022.
- [56] H. You, Y. Yu, S. Silling, and M. D’Elia. A data-driven peridynamic continuum model for upscaling molecular dynamics. Computer Methods in Applied Mechanics and Engineering, 389:114400, 2022.
- [57] H. You, Y. Yu, N. Trask, M. Gulian, and M. D’Elia. Data-driven learning of nonlocal physics from high-fidelity synthetic data. Computer Methods in Applied Mechanics and Engineering, 374:113553, 2021.
- [58] M. Yuan and T. T. Cai. A reproducing kernel Hilbert space approach to functional linear regression. The Annals of Statistics, 38(6):3412–3444, 2010.
- [59] A. Zellner and A. Siow. Posterior odds ratios for selected regression hypotheses. Trabajos de Estadistica Y de Investigacion Operativa, 31:585–603, 1980.