This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\journaltitle

Journal Title Here \DOIDOI HERE \accessAdvance Access Publication Date: Day Month Year \appnotesPaper

\authormark

He et al.

\corresp

[\ast]Address for correspondence: Jian Kang ([email protected]) and Ying Yang ([email protected])

Scalable Bayesian inference for heat kernel Gaussian processes on manifolds

Junhui He    Guoxuan Ma    Jian Kang    Ying Yang \orgdivDepartment of Mathematical Sciences, \orgnameTsinghua University, \orgaddress\stateBeijing, \countryChina \orgdivDepartment of Biostatistics, \orgnameUniversity of Michigan, Ann Arbor, \orgaddress\stateMichigan, \countryUnited States
(2022; 2019)
Abstract

We develop scalable manifold learning methods and theory, motivated by the problem of estimating manifold of fMRI activation in the Human Connectome Project (HCP). We propose the Fast Graph Laplacian Estimation for Heat Kernel Gaussian Processes (FLGP) in the natural exponential family model. FLGP handles large sample sizes nn, preserves the intrinsic geometry of data, and significantly reduces computational complexity from 𝒪(n3)\mathcal{O}(n^{3}) to 𝒪(n)\mathcal{O}(n) via a novel reduced-rank approximation of the graph Laplacian’s transition matrix and truncated Singular Value Decomposition for eigenpair computation. Our numerical experiments demonstrate FLGP’s scalability and improved accuracy for manifold learning from large-scale complex data.

keywords:
Exponential family; Graph Laplacian; Heat kernel; Manifold; Scalable Gaussian process; Subsampling

1 Introduction

In recent years, a variety of data has emerged in the machine learning community, such as image and network data. These data sets often possess non-Euclidean geometry, contributing to a rapid rise in manifold learning. Let us consider a simple case: a spiral curve in 2\mathbb{R}^{2} as shown in Figure 1. The spiral is a one-dimensional submanifold of 2\mathbb{R}^{2}. Besides, we can see from the top left channel of Figure 1 that for two points x,xx,x^{\prime} in the curve, the Euclidean distance is the length of the line segment connecting them (black dotted line), whereas the geodesic distance is the length of the spiral curve connecting them (black solid line). Consequently, the differences in the dimension and distance substantially affect statistical inference. It is well known that in Gaussian process (GP) approaches, many covariance kernels are based on the distance, such as the squared exponential kernel and the Matérn kernel (Genton, 2001; Rasmussen and Williams, 2006). And the posterior contraction rates depend on the dimension of data, as highlighted in van der Vaart and van Zanten (2008, 2009).

In this article, we concentrate on the construction of a domain-adaptive kernel and propose a fast graph Laplacian estimation for heat kernel Gaussian processes (FLGP). For better visualization, we plot intrinsic kernels and Euclidean-based kernels on the spiral in Figure 1 respectively. The top right channel illustrates that the heat kernel depends on the geodesic distance. That is, its value gradually declines with increasing of the intrinsic distance. Conversely, the squared exponential kernel in the bottom left channel is based on the Euclidean distance. Notably, the FLGP kernel obtains an accurate estimation of the heat kernel and reflects the intrinsic manifold well. Furthermore, we apply FLGP and Euclidean-based kernels (RBF GP) to make predictions on the spiral. As a result, we obtain an RMSE of 0.343 for FLGP and an RMSE of 2.323 for RBF GP. It is clear that without the intrinsic geometry of the spiral, the Euclidean-based GP causes an inferior result to the domain-adaptive GP.

Refer to caption
Figure 1: The spiral case whose intrinsic geometry is non-Euclidean. Top left: distances on the spiral. Top right: heat kernels on xx. Bottom left: squared exponential kernels on xx. Bottom right: estimated kernels on xx by FLGP.

The manifold learning was also motivated by the analysis of biomedical imaging data. Traditional analysis of volumetric functional Magnetic Resonance Imaging (fMRI) data, used in studying brain activation signals, often assumes that these signals exist within a regular three-dimensional Euclidean space. However, this assumption may not capture the complex nature of brain activation patterns. The human brain’s cortical surface, where these signals are mapped, is not a simple three-dimensional structure but a complex, folded surface. This unique topology challenges standard Euclidean-based analysis methods. Manifold learning methods fill this gap by offering a more sophisticated approach to modeling brain imaging data. They acknowledge and leverage the intricate geometry of the brain’s surface, enabling a more accurate understanding of the brain’s functional architecture. This approach has the potential to lead to discoveries in neuroscience and medicine. Thus, in machine learning, whether in understanding complex brain patterns or in simpler geometric structures like spheres, the utilization of the intrinsic geometry of data is crucial for optimal performance.

We focus on the prediction problem with GPs in the paper. Suppose we have nn observations consisting of mm labelled data {(xi,yi)}i=1m𝒳×𝒴\{(x_{i},y_{i})\}_{i=1}^{m}\subset\mathcal{X}\times\mathcal{Y} and nmn-m unlabelled data {xi}i=m+1n𝒳\{x_{i}\}_{i=m+1}^{n}\subset\mathcal{X}. Define the point cloud as Xn={xi}i=1nX^{n}=\{x_{i}\}_{i=1}^{n}. Assume that the domain space 𝒳\mathcal{X} is unknown. We are interested in the cases where 𝒳\mathcal{X} is a non-Euclidean manifold. The geodesic distance is completely different from the Euclidean distance, making it inappropriate to treat 𝒳\mathcal{X} as an Euclidean space. Our objective is to use XnX^{n} to estimate unknown 𝒳\mathcal{X} and to devise a prediction algorithm that respects the intrinsic geometry of 𝒳\mathcal{X}. In the context of GPs, this involves constructing a domain-adaptive kernel rather than relying on a kernel-based solely on extrinsic distance. A straightforward approach is to substitute the Euclidean distance with the geodesic distance. For example, in squared exponential kernels, for any x,x𝒳x,x^{\prime}\in\mathcal{X}, consider the kernel function k(x,x)=exp{ρ(x,x)2/2ϵ2}k(x,x^{\prime})=\exp\{-\rho(x,x^{\prime})^{2}/2\epsilon^{2}\}, where ρ\rho represents the geodesic distance and ϵ>0\epsilon>0. However, such kernels are generally not semi-positive definite and hence, do not qualify as covariance functions (Feragen et al., 2015). Our primary goal is to develop a valid domain-adaptive kernel for GPs on Riemannian manifolds.

Let 𝒳\mathcal{X} be a dd-dimensional submanifold of p\mathbb{R}^{p}. When 𝒳\mathcal{X} is not straightforward of p\mathbb{R}^{p}, such as Kendall shape spaces (Kendall et al., 2009; Kim et al., 2021) and Grassmann manifolds (Lee, 2012), we can embed it into a high-dimensional Euclidean space by the Whitney Embedding Theorem; see extrinsic GPs in Lin et al. (2019). Yang and Dunson (2016) use a simple squared exponential Gaussian process prior with the intrinsic dimension adaptive random scaling parameter. They achieve a posterior contraction rate depending on the intrinsic dimension dd. Castillo et al. (2014) construct a Gaussian process prior with the rescaling heat kernel on 𝒳\mathcal{X}. They show that the heat kernel characterizes the geometry on 𝒳\mathcal{X} by the heat diffusion. The posterior distribution yields an optimal contraction rate adaptive to the manifold. However, 𝒳\mathcal{X} is unknown in many cases, and the heat kernel is analytically intractable except for some trivial cases such as Euclidean spaces or spheres. Thus, we have to calculate the heat kernel numerically in practice.

Some researchers utilize the connection between the heat kernel and the transition density of Brown motions (BM) on manifolds to estimate the heat kernel. The primary challenge is to simulate the BM sample paths on 𝒳\mathcal{X}. Niu et al. (2019) do the simulation by solving systems of stochastic differential equations based on the coordinate chart and Ye et al. (2024) do it through the exponential map. However, even if 𝒳\mathcal{X} is given, the global parametrization or the explicit exponential map doesn’t exist in general. In addition, 𝒳\mathcal{X} is unknown in many practical problems. In this case, Niu et al. (2023) employ probabilistic latent variable models to estimate the probabilistic parametrization and simulate the BM trajectories on the unknown manifold using the probabilistic metric. The sampling procedure of trajectories is computationally expensive, as they have to simulate the Brown motions many times originating at each input.

Furthermore, the spectral property suggests that the heat kernel can also be constructed with the sum of eigenpairs of the Beltrami-Laplace operator on the manifold. Dunson et al. (2021a) use the Gaussian kernelized Graph Laplacian on XnX^{n} to estimate the Beltrami-Laplace operator and propose a graph Laplacian Gaussian process (GLGP) for regression. In the algorithm, the heat kernel is estimated by the sum of finitely many eigenpairs of the graph Laplacian. Similarly, Borovitskiy et al. (2021) construct the graph Matérn kernel relying on the eigenpairs of the normalized graph Laplacian of a predefined graph. As the graph Laplacian is an n×nn\times n matrix, the time complexity of the eigen-decomposition is 𝒪(n3)\mathcal{O}(n^{3}). This is unaffordable in the large-scale data.

Apart from the heat kernel, there are other methods to respect the geometry. If 𝒳\mathcal{X} is embedded into a lower-dimensional Euclidean space preserving the intrinsic distance (Roweis and Saul, 2000; Belkin and Niyogi, 2001; Balasubramanian and Schwartz, 2002; Weinberger and Saul, 2006), then we can make inference on the lower-dimensional representation to improve the performance. However, it is very challenging to find such an embedding when 𝒳\mathcal{X} is unknown and complicated.

In the article, we propose a fast graph Laplacian approach to estimate the heat kernel Gaussian processes efficiently and accurately. Our novel insight is to construct a random walk on XnX^{n} through a transition decomposition strategy. Initially, we subsample ss induced points from the point cloud XnX^{n}, denoted by Is={ui}i=1sI^{s}=\{u_{i}\}_{i=1}^{s}. Then the one-step movement from XnX^{n} to XnX^{n} is decomposed into two steps, a jump from XnX^{n} to IsI^{s} and a jump from IsI^{s} to XnX^{n}. This results in a reduced-rank approximation of the stochastic matrix. The truncated SVD is employed to calculate the eigenpairs instead of the eigendecomposition. It dramatically speeds up the computation. Finally, the heat kernel covariance matrix is constructed from the eigenpairs under the spectral property. The time complexity decreases remarkably from 𝒪(n3)\mathcal{O}(n^{3}) to 𝒪(n)\mathcal{O}(n). The algorithm is illustrated in simulation and real data. It outperforms other algorithms in all cases successfully.

Generally speaking, FLGP will play a role in any scenario where the intrinsic geometry of 𝒳\mathcal{X} differs from the extrinsic geometry of p\mathbb{R}^{p}. It gives a fast estimation of heat kernels by XnX^{n} without prior knowledge of 𝒳\mathcal{X}. Then the domain-adaptive GPs are accessible in the large-scale data. We have implemented the algorithms in an R package using Rcpp. It provides a convenient interface in R. In addition, we develop a general theoretical framework for FLGP in the natural exponential family (NEF). This distribution family covers a wide range of models, including normal regression and classification. The posterior contraction rates of heat kernel GPs in NEF depend on the intrinsic dimension dd rather than pp and are much sharper than those of extrinsic kernel GPs. The approximation error of heat kernels in FLGP is estimated by the spectral convergence property of the graph Laplacian and the perturbation analysis. Finally, the predictive error is bounded based on the approximation covariance error.

The paper is organized as follows. Section 2 gives a preliminary about the NEF model and heat kernels on manifolds. Section 3 provides a detailed description of the FLGP methodology. The approach consists of subsampling induced points, constructing random walks, solving truncated SVD, and computing the heat kernel. Section 4 presents the main theoretical results. Section 5 illustrates the performance in various examples, and Section 6 discusses the future work.

The FLGP package is accessible in https://github.com/junhuihe2000/FLGP.

2 Preliminary

2.1 Natural exponential family

Suppose the real-valued random variable YY has the law of the natural exponential family (NEF), denoted as YNEF(θ)Y\sim\text{NEF}(\theta). That is, under the Lebesgue measure or the counting measure, we have

π(y|θ)=h(y)exp{θκ(y)J(θ)},h(y)0,\pi(y|\theta)=h(y)\exp{\left\{\theta\kappa(y)-J(\theta)\right\}},\quad h(y)\geq 0, (1)

where θ\theta is called the natural parameter, h(y)h(y) does not depend on θ\theta, and κ\kappa is the sufficient statistic of YY. The natural exponential family includes varieties of the most common models, such as normal regression and binary classification. It emerges in many fields of statistics and machine learning (Brown et al., 2010; Jonathan R. Bradley and Wikle, 2020; Schweinberger and Stewart, 2020). The expectation of κ\kappa is given as 𝔼θ{κ(Y)}=J(θ)\mathbb{E}_{\theta}\{\kappa(Y)\}=J^{\prime}(\theta). It is the quantity of interest in the article.

2.2 Heat kernels on manifolds

Let 𝒳\mathcal{X} be a dd-dimensional Riemannian manifold. Let ρ\rho be the geodesic distance on 𝒳\mathcal{X}. Assume GG is a measure on 𝒳\mathcal{X} with a smooth positive density function gg under the volume form VV. Let Δx\Delta_{x} be the weighted Beltrami-Laplace operator on (𝒳,G)(\mathcal{X},G). Define the Dirichlet-Laplace operator by =Δx\mathcal{L}=-\Delta_{x}. Then \mathcal{L} is a non-negative definite self-adjoint operator in L2(𝒳,G)L^{2}(\mathcal{X},G). The heat kernel pt(x,x)p_{t}(x,x^{\prime}) is a fundamental solution of the heat equation on (𝒳,G)(\mathcal{X},G), satisfying the following conditions,

{tpt(x,x)=Δxpt(x,x),t+,x,x𝒳,limt0pt(x,x)=δ(x,x),x,x𝒳,\begin{cases}\frac{\partial}{\partial t}p_{t}(x,x^{\prime})=\Delta_{x}p_{t}(x,x^{\prime}),\quad t\in\mathbb{R}^{+},~{}x,x^{\prime}\in\mathcal{X},\\ \lim_{t\to 0}p_{t}(x,x^{\prime})=\delta(x,x^{\prime}),\quad x,x^{\prime}\in\mathcal{X},\\ \end{cases}

where δ(x,x)\delta(x,x^{\prime}) is the Dirac-delta function and t+t\in\mathbb{R}^{+} is the diffusion time (Grigor’yan, 2009). The heat kernel describes the heat diffusion from xx to xx^{\prime} on manifolds over time tt.

Assume that 𝒳\mathcal{X} is compact. The spectral theorem states that the spectrum of \mathcal{L} consists of an increasing non-negative sequence {λi}i=1\{\lambda_{i}\}_{i=1}^{\infty} such that λ1=0\lambda_{1}=0 and limiλi=+\lim_{i\to\infty}\lambda_{i}=+\infty. There exists an orthonormal basis {ei}i=1\{e_{i}\}_{i=1}^{\infty} in L2(𝒳,G)L^{2}(\mathcal{X},G) such that for i1i\geq 1, eie_{i} is the eigenfunction of \mathcal{L} with λi\lambda_{i}. Subsequently, from Grigor’yan (2009), for t+t\in\mathbb{R}^{+} and x,x𝒳x,x^{\prime}\in\mathcal{X}, pt(x,x)p_{t}(x,x^{\prime}) can be expanded as

pt(x,x)=i=1exp(tλi)ei(x)ei(x).p_{t}(x,x^{\prime})=\sum_{i=1}^{\infty}\exp(-t\lambda_{i})e_{i}(x)e_{i}(x^{\prime}). (2)

If the eigenpairs of \mathcal{L} are estimated, then we can utilize (2) to approximate the heat kernel. So our goal is to construct a fast numerical estimation for the spectrum of \mathcal{L}. We refer readers to Grigor’yan (2006, 2009) for further discussions about heat kernels on manifolds.

3 Methodology

Assume that the observations consist of the labeled observations {(xi,yi)}i=1m\{(x_{i},y_{i})\}_{i=1}^{m} and the unlabeled observations {xi}i=m+1n\{x_{i}\}_{i=m+1}^{n}, which are independent and identically distributed (i.i.d.) samples drawn from the population (X,Y)𝒳×𝒴(X,Y)\in\mathcal{X}\times\mathcal{Y}. Denote the labeled input {xi}i=1m\{x_{i}\}_{i=1}^{m} and labeled output {yi}i=1m\{y_{i}\}_{i=1}^{m} by XmX^{m}, YmY^{m}. Suppose the input variable XX is endowed with the distribution GG. Assume that GG is absolutely continuous and denote the marginal density as g(x)g(x). For any x𝒳x\in\mathcal{X}, suppose the conditional distribution πθ(y|x)NEF{θ(x)}\pi_{\theta}(y|x)\sim\text{NEF}\{\theta(x)\} as in (1). Consider the mean function φ(x)𝔼θ(x){κ(Y)}\varphi(x)\triangleq\mathbb{E}_{\theta(x)}\{\kappa(Y)\}. In NEF, the state space of φ\varphi varies from concrete likelihoods and is not identical to \mathbb{R}.

Through a well-chosen link function ll, we assume that φ=lf\varphi=l\circ f with a latent function ff. To induce a prior on φ\varphi through ll, we assign the heat kernel GP with adaptive diffusion time for ff. Specifically, we assume

f(x)T𝒢𝒫{0,pT(x,x)},f(x)\mid T\sim\mathcal{GP}\{0,p_{T}(x,x^{\prime})\}, (3)

where pt(x,x)p_{t}(x,x^{\prime}) is the heat kernel on 𝒳\mathcal{X} defined in (2)\eqref{eq:spec}; and TT is as the random scaling parameter with a prior distribution π(t)ta0exp{td/2log1+d/2(1/t)}\pi(t)\propto t^{-a_{0}}\exp\{-t^{-d/2}\log^{1+d/2}(1/t)\} for a0>1a_{0}>1.

Let fmf^{m} be the mm-dimensional vector {f(xi)}i=1m\{f(x_{i})\}_{i=1}^{m}. Then, the posterior expectation of the prediction function in the test input xx^{*} is given by

𝔼[φ|Xm,Ym,x]=l(f)π(f|fm,x)π(fm|Xm,Ym)𝑑fm𝑑f,\mathbb{E}[\varphi^{*}|X^{m},Y^{m},x^{*}]=\int\int l(f^{*})\pi(f^{*}|f^{m},x^{*})\pi(f^{m}|X^{m},Y^{m})df^{m}df^{*},

where π(fm|Xm,Ym)=π(Ym|fm)π(fm|Xm)/π(Ym|Xm)\pi(f^{m}|X^{m},Y^{m})=\pi(Y^{m}|f^{m})\pi(f^{m}|X^{m})/\pi(Y^{m}|X^{m}) is the posterior density over the latent variables, and φ,f\varphi^{*},f^{*} are φ(x),f(x)\varphi(x^{*}),f(x^{*}) respectively.

3.1 Fast graph Laplacian heat kernel estimation

According to the desirable property of ptp_{t}, the heat kernel Gaussian process is an excellent alternative for domain-adaptive GPs. However, the heat kernel is analytically intractable except for trivial cases even if 𝒳\mathcal{X} is known. In the section, we propose a fast graph Laplacian approach to estimate the heat kernel and construct the heat kernel GP in the large-scale data when 𝒳\mathcal{X} is unknown. Given labeled observations {(xi,yi)}i=1m\{(x_{i},y_{i})\}_{i=1}^{m} and unlabeled observations {xi}i=m+1n\{x_{i}\}_{i=m+1}^{n}, the objective is to approximate a heat kernel covariance matrix over XnX^{n} in a low time cost. We present the FLGP algorithm in five steps.

Step 1, subsample ss induced points from the point cloud XnX^{n}, denoted as Is={ui}i=1sI^{s}=\{u_{i}\}_{i=1}^{s}. The induced point set is used to simplify XnX^{n} and represent the domain geometry. In this article, the subsampling method is k-means clustering or random sampling. Step 2, choose a suitable kernel function k(x,x)k(x,x^{\prime}), such as the squared exponential kernel, k(x,x)=exp(xx2/4ϵ2)k(x,x^{\prime})=\exp\left(-||x-x^{\prime}||^{2}/4\epsilon^{2}\right) with ϵ>0\epsilon>0. Construct the cross kernel matrix KK between XnX^{n} and IsI^{s} with Kij=k(xi,uj)K_{ij}=k(x_{i},u_{j}), for 1in,1js1\leq i\leq n,1\leq j\leq s. Then KK is an n×sn\times s matrix. Step 3, obtain the cross similarity matrix ZZ by normalizing KK. For 1in,1js1\leq i\leq n,1\leq j\leq s, in random subsampling, ZijZ_{ij} is given by

Zij=k(xi,uj)q=1nk(xq,uj)q=1sk(xi,uq),Z_{ij}=\frac{k(x_{i},u_{j})}{\sum_{q=1}^{n}k(x_{q},u_{j})\sum_{q=1}^{s}k(x_{i},u_{q})},

and in k-means clustering, ZijZ_{ij} is given by

Zij=njk(xi,uj)q=1nk(xq,uj)q=1snqk(xi,uq),Z_{ij}=\frac{n_{j}k(x_{i},u_{j})}{\sum_{q=1}^{n}k(x_{q},u_{j})\sum_{q=1}^{s}n_{q}k(x_{i},u_{q})}, (4)

where {ni}i=1s\{n_{i}\}_{i=1}^{s} indicates the number of points in each membership. Construct the cross transition matrix A=D1ZA=D^{-1}Z, where DD is a diagonal matrix with Dii=j=1sZij,1inD_{ii}=\sum_{j=1}^{s}Z_{ij},~{}1\leq i\leq n. Notably, AA can serve as the probability transition matrix of the random walk from XnX^{n} to IsI^{s}. Step 4, let Λ\Lambda be a diagonal matrix such that Λjj=i=1nAij,1js\Lambda_{jj}=\sum_{i=1}^{n}A_{ij},~{}1\leq j\leq s and define the subsampling graph Laplacian LL as

L=IAΛ1A.L=I-A\Lambda^{-1}A^{\top}. (5)

Do truncated SVD for AΛ1/2A\Lambda^{-1/2} to obtain the singular values {σi,ϵ}i=1s\{\sigma_{i,\epsilon}\}_{i=1}^{s} and the left singular vectors {vi,ϵ}i=1s\{v_{i,\epsilon}\}_{i=1}^{s} with vi,ϵ2=1,1is||v_{i,\epsilon}||_{2}=1,~{}1\leq i\leq s. Then the eigenpairs of LL are given by {λi,ϵ=1σi,ϵ2,vi,ϵ}i=1s\{\lambda_{i,\epsilon}=1-\sigma_{i,\epsilon}^{2},~{}v_{i,\epsilon}\}_{i=1}^{s}. According to Proposition 1, suppose {λi,ϵ}i=1s\{\lambda_{i,\epsilon}\}_{i=1}^{s} are in the ascending order, i.e., 0=λ1,ϵλs,ϵ10=\lambda_{1,\epsilon}\leq\cdots\leq\lambda_{s,\epsilon}\leq 1. Step 5, choose the top MM eigenvalues and corresponding eigenvectors of LL and estimate the heat kernel based on (2) as

Cϵ,M,t=ni=1Mexp(tλi,ϵ)vi,ϵvi,ϵ,C_{\epsilon,M,t}=n\sum_{i=1}^{M}\exp(-t\lambda_{i,\epsilon})v_{i,\epsilon}v_{i,\epsilon}^{\top}, (6)

where t>0t>0 is the diffusion time.

Xn:X^{n}:x1x_{1}x2x_{2}\cdotsxn1x_{n-1}xnx_{n}u1u_{1}\cdotsusu_{s}Is:I^{s}:x2x_{2}x1x_{1}\cdotsxn1x_{n-1}xnx_{n}Xn:X^{n}:Construct the cross transition matrix AACalculate the cross kernel matrix KKSubsampling IsI^{s}Truncated SVD for AΛ1AA\Lambda^{-1}A^{\top}Estimate the heat kernel Cϵ,M,tC_{\epsilon,M,t}AAΛ1A\Lambda^{-1}A^{\top}
Figure 2: The flow chart of FLGP to estimate the heat kernel. The transition decomposition strategy contributes to the construction of a reduced-rank stochastic matrix in random walks.

The whole procedure of FLGP is shown in Figure 2. Finally, we can make inference in GPs with the estimated covariance matrix Cϵ,M,tC_{\epsilon,M,t}. Owing to the construction, we have the following proposition for the spectrum of LL. The proof is given in Appendix A.

Proposition 1.

The eigenvalues of graph Laplacian LL in (5) are real-valued and lie in [0,1][0,1]. In particular, 0 is the smallest eigenvalue of LL.

The key novelty in FLGP is to construct a reduced-rank approximation for the transition matrix. Then the eigendecomposition of LL is replaced by the truncated SVD of AΛ1/2A\Lambda^{-1/2}. We explain the idea behind the approximation from the random walk viewpoint. According to the graph spectral theory, there is a close relationship between the heat kernel on manifolds and random walks on the point cloud XnX^{n}. The stochastic transition matrix is n×nn\times n. Based on the induced points IsI^{s}, we can decompose the direct random walk into two steps. For x,xXnx,x^{\prime}\in X^{n}, the movement from xx to xx^{\prime} is decomposed into a jump from xx to IsI^{s} and a jump from IsI^{s} to xx^{\prime}. Then the transition probability is

q(x,Is)q(Is,x)j=1sq(x,uj)q(uj,x).q(x,I^{s})q(I^{s},x^{\prime})\triangleq\sum_{j=1}^{s}q(x,u_{j})q(u_{j},x^{\prime}).

If we define the cross transition matrix AA from XnX^{n} to IsI^{s} as in Step 3, and let Λ1A\Lambda^{-1}A^{\top} be the transition matrix from IsI^{s} to XnX^{n}, then the two-step transition probability matrix is given as AΛ1AA\Lambda^{-1}A^{\top}. This implies the definition of the graph Laplacian in (5). In contrast, Dunson et al. (2021a) doesn’t utilize the transition decomposition strategy. They construct the random walk with one-step movement. Consequently, they end up with a full-rank transition matrix and have to execute the eigendecomposition on the matrix.

The diagram inside the dashed rectangle in Figure 2 illustrates the transition decomposition graphically. As IsI^{s} contains only ss induced points, the rank of AΛ1AA\Lambda^{-1}A^{\top} is at most ss. On the contrary, the stochastic matrix’s rank in the direct random walk among XnX^{n} is nn, much more than ss. Thus, we obtain a reduced-rank approximation for the transition probability matrix. Subsequently, the complexity of the truncated SVD is 𝒪(ns2+s3)\mathcal{O}(ns^{2}+s^{3}) instead of 𝒪(n3)\mathcal{O}(n^{3}). This accelerates the computational speed in large-scale data. Sparsity will be introduced into the transition matrix to remove the ns2ns^{2} term further.

3.2 Sparsity and likelihood-free kernels

Let rr be a positive integer. For each xiXnx_{i}\in X^{n}, let {ij}j=1r\{i_{j}\}_{j=1}^{r} be the indexes of the closest rr induced points to xix_{i}. The local induced point set of xix_{i} is defined as [xi]{uj}j{i1,,ir}[x_{i}]\triangleq\{u_{j}\}_{j\in\{i_{1},\ldots,i_{r}\}}. Denote the sparse kernel matrix by KK^{*}, which means that for 1in1\leq i\leq n, if j{i1,i2,,ir}j\in\{i_{1},i_{2},\ldots,i_{r}\}, then Kij=KijK^{*}_{ij}=K_{ij}; otherwise Kij=0K^{*}_{ij}=0. Thus KK^{*} is an n×sn\times s sparse matrix with nrnr non-zero elements. Consequently, the cross similarity matrix and the cross transition matrix are sparse matrices with nrnr non-zero elements. This saves the memory usage. And the time complexity of the truncated SVD decreases to 𝒪(nr2+s3)\mathcal{O}(nr^{2}+s^{3}). The introduced sparsity accelerates the computation when r<<sr<<s. In addition, due to the locally Euclidean property (Lee, 2012), the Euclidean distance is almost equivalent to the intrinsic distance in a small neighborhood. Then sparsity will enhance the prediction performance when the global domain geometry is highly non-Euclidean.

It is clear that for the squared exponential kernel, selecting the bandwidth parameter ϵ\epsilon involves the computation of the marginal likelihood, which costs redundant float calculation and slows down the algorithm. Moreover, the heat kernel depends on the geometry of 𝒳\mathcal{X}, while the marginal likelihood refers only to the labeled observations. As XnX^{n} contains more information about 𝒳\mathcal{X} than XmX^{m}, we want to use the point cloud to construct the kernel. Therefore, it is advantageous to substitute a likelihood-free alternative for the squared exponential kernel. We adopt the local anchor embedding (LAE) kernel of Liu et al. (2010) in the underlying work. The kernel matrix KlaeK^{\mathrm{lae}} is calculated by minimizing the square loss of a convex reconstruction problem, i.e., for 1in1\leq i\leq n,

minxij{i1,,ir}Kijlaeuj2,s.t.Kijlae0,j{i1,,ir},j{i1,,ir}Kijlae=1.\begin{split}&\min\left|\left|x_{i}-\sum_{j\in\{i_{1},\ldots,i_{r}\}}K^{\mathrm{lae}}_{ij}u_{j}\right|\right|^{2},\\ &\emph{s.t.}\quad K^{\mathrm{lae}}_{ij}\geq 0,~{}j\in\{i_{1},\ldots,i_{r}\},~{}\sum_{j\in\{i_{1},\ldots,i_{r}\}}K^{\mathrm{lae}}_{ij}=1.\\ \end{split}

The optimal solution is the coefficients of the projection of xix_{i} into the convex hull spanned by [xi][x_{i}]. Let the remaining elements in KlaeK^{\mathrm{lae}} be zero. Then it is also an n×sn\times s sparse matrix with rr non-zero elements in each row. Compared to the squared exponential kernel, the LAE method constructs a nonparametric kernel estimation using XnX^{n} and IsI^{s}, which means it is more flexible and robust.

3.3 Algorithms and Complexity analysis

Given the covariance 𝒞\mathcal{C}, we calculate the marginal likelihood of GPs based on the concrete models as follows,

π(y|x,𝒞,η)=π(y|f,x,η)π(f|𝒞,x)𝑑f,\pi(y|x,\mathcal{C},\eta)=\int\pi(y|f,x,\eta)\pi(f|\mathcal{C},x)df,

where η\eta is the potential nuisance parameter. The hyperparameters in FLGP consist of predefined parameters including s,r,Ms,r,M, and optimized parameters including η,ϵ\eta,\epsilon (if any), and tt. The optimized parameters are estimated by maximizing the marginal likelihood function. The FLGP algorithm is listed as Algorithm 1. Furthermore, the following proposition demonstrates the time efficiency of FLGP.

Algorithm 1 FLGP algorithms on manifolds
1:Input: Labeled observations {(xi,yi)}i=1m\{(x_{i},y_{i})\}_{i=1}^{m}, unlabeled observations {xi}i=m+1n\{x_{i}\}_{i=m+1}^{n} and predefined parameters s,r,Ms,r,M.
2:Subsample ss induced points IsI^{s} from XnX^{n} by random sampling or k-means clustering.
3:Construct the n×sn\times s sparse matrix KK with nrnr non-zero elements by the squared exponential kernel or the local anchor embedding kernel.
4:Calculate the n×sn\times s sparse matrices ZZ and AA.
5:Compute the largest MM singular values and left singular vectors of AΛ1/2A\Lambda^{-1/2}. Obtain the eigenpairs of LL.
6:Construct the n×nn\times n estimated heat kernel covariance matrix as in (6). \triangleright Only the left mm columns are needed in the practical inference.
7:Optimize η,ϵ\eta,\epsilon (if any) and tt by maximizing the marginal likelihood on (Xm,Ym)(X^{m},Y^{m}).
8:Make the prediction on {xi}i=m+1n\{x_{i}\}_{i=m+1}^{n}.
9:Output: The prediction on the unlabeled points.
Proposition 2.

The time complexity in the heat kernel approximation by FLGP is 𝒪(nsp+s3+nmM)\mathcal{O}(nsp+s^{3}+nmM) for random subsampling or 𝒪(nspI+s3+nmM)\mathcal{O}(nspI+s^{3}+nmM) for k-means clustering, where nn is the number of all observations, ss is the number of induced points, mm is the number of labeled observations, pp is the ambient dimension, MM is the number of eigenpairs used, and II is the iteration number in k-means.

Remark 1.

Suppose m<<nm<<n. Except for the marginal likelihood, the time complexity of FLGP is almost linearly dependent on nn. Therefore, our FLGP algorithm allows scalable applications of heat kernel GPs. Conversely, the complexity of GLGP is 𝒪(n3)\mathcal{O}(n^{3}). If s<<ns<<n, then our approach is much faster.

4 Theory

The theory consists of three main parts. First, we estimate the error between the heat kernel and the numerical approximation in (6). Then, we obtain the predictive error bound in GPs. Finally, we establish the posterior contraction rate of heat kernel GPs in NEF on manifolds.

Suppose 𝒳\mathcal{X} is the domain manifold. Denote the geodesic distance on 𝒳\mathcal{X} by ρ\rho. Let VV be the volume form on 𝒳\mathcal{X} and let GG be a probability measure on 𝒳\mathcal{X}. For x𝒳,t>0x\in\mathcal{X},~{}t>0, let Bt(x)={z𝒳|ρ(x,z)t}B_{t}(x)=\left\{z\in\mathcal{X}|~{}\rho(x,z)\leq t\right\}. In the article, we focus on the following manifold.

Assumption 1.

Suppose 𝒳\mathcal{X} is a compact manifold without boundary. Let g(x)g(x) be the density function of GG under VV. Assume there exist constants gmin,gmax>0g_{\min},g_{\max}>0, such that, gming(x)gmaxg_{\min}\leq g(x)\leq g_{\max}, x𝒳\forall x\in\mathcal{X}. Suppose we have a Gaussian estimate for the heat kernel on 𝒳\mathcal{X}. That is, there exist constants b1,b2,b3,b4>0b_{1},b_{2},b_{3},b_{4}>0 such that, for x,x𝒳x,x^{\prime}\in\mathcal{X}, t>0t>0,

b1F(x,x,t)exp{b2ρ2(x,x)t}pt(x,x)b3F(x,x,t)exp{b4ρ2(x,x)t},\frac{b_{1}}{F(x,x^{\prime},t)}\exp{\left\{-\frac{b_{2}\rho^{2}(x,x^{\prime})}{t}\right\}}\leq p_{t}(x,x^{\prime})\leq\frac{b_{3}}{F(x,x^{\prime},t)}\exp{\left\{-\frac{b_{4}\rho^{2}(x,x^{\prime})}{t}\right\}},

where F(x,x,t)=G(Bt(x))G(Bt(x))F(x,x^{\prime},t)=\sqrt{G\left(B_{\sqrt{t}}(x)\right)G\left(B_{\sqrt{t}}(x^{\prime})\right)}.

The assumption is to guarantee that the heat kernel on 𝒳\mathcal{X} satisfies some regular conditions. A similar assumption is made in Dunson et al. (2021a). And Assumption 1 has already covered many usual settings. Let us take the compact manifold with the volume form as an example.

Example 1.

Suppose 𝒳\mathcal{X} is a compact manifold without boundary, and GG is the volume form VV. We have 0<g(x)=1/V(𝒳)<+0<g(x)=1/V(\mathcal{X})<+\infty, where V(𝒳)V(\mathcal{X}) is the volume of 𝒳\mathcal{X}. Grigor’yan (2009) indicates that the Gaussian estimate holds for the heat kernel.

4.1 One-step random walk graph Laplacian

As it is usually intractable to calculate the exact heat kernel, we utilize the normalized random walk graph Laplacian to estimate the heat kernel numerically. The approach is also used by Dunson et al. (2021a). The main difference is that we make the approximate inference of exact priors, while they make the exact inference of approximate priors. That is, we treat the prior as the exact heat kernel GPs and prove the oracle contraction rate.

We first estimate the approximate error for the so-called one-step random walk graph Laplacian and then explore the impact of subsampling later. Let XnX^{n} be the point cloud. Assume that kk is the squared exponential kernel with bandwidth ϵ\epsilon. Construct the kernel matrix K¯\bar{K} on XnX^{n} with K¯ij=k(xi,xj)\bar{K}_{ij}=k(x_{i},x_{j}). For 1in1\leq i\leq n, define K¯i=j=1nk(xi,xj)\bar{K}_{i\cdot}=\sum_{j=1}^{n}k(x_{i},x_{j}). Calculate the similarity matrix by Z¯ij=k(xi,xj)/(K¯iK¯j)\bar{Z}_{ij}=k(x_{i},x_{j})/(\bar{K}_{i\cdot}\bar{K}_{j\cdot}) for 1i,jn1\leq i,j\leq n. Let D¯\bar{D} be a diagonal matrix with D¯ii=j=1nZ¯ij,1in\bar{D}_{ii}=\sum_{j=1}^{n}\bar{Z}_{ij},~{}1\leq i\leq n. Compute the stochastic matrix A¯\bar{A} by A¯=D¯1Z¯\bar{A}=\bar{D}^{-1}\bar{Z}. The one-step random walk graph Laplacian is defined as L¯=IA¯\bar{L}=I-\bar{A}.

Furthermore, the kernel normalized graph Laplacian is L¯/ϵ2\bar{L}/\epsilon^{2}. Denote the eigenpairs of L¯/ϵ2\bar{L}/\epsilon^{2} by {(λ¯i,ϵ,v¯i,ϵ)}i=1n\{(\bar{\lambda}_{i,\epsilon},\bar{v}_{i,\epsilon})\}_{i=1}^{n} with v¯i,ϵ2=1||\bar{v}_{i,\epsilon}||_{2}=1. For t>0t>0, construct the heat kernel estimation as in (6) by

C¯ϵ,M,t=ni=1Mexp(tλ¯i,ϵ)v¯i,ϵv¯i,ϵ.\bar{C}_{\epsilon,M,t}=n\sum_{i=1}^{M}\exp(-t\bar{\lambda}_{i,\epsilon})\bar{v}_{i,\epsilon}\bar{v}_{i,\epsilon}^{\top}.

Let HtH_{t} be the exact heat kernel matrix over XnX^{n}. Let \mathcal{L} be the Dirichlet-Laplace operator on 𝒳\mathcal{X}. Denote the eigenpairs of \mathcal{L} by {(λi,ei)}i=1\{(\lambda_{i},e_{i})\}_{i=1}^{\infty} with ei2=1||e_{i}||_{2}=1. A technical condition is introduced on the spectrum of \mathcal{L}.

Assumption 2 (Dunson et al. (2021b)).

Assume that the eigenvalues of \mathcal{L} are simple, i.e., no multiplicity. For MM\in\mathbb{N}, denote ΓMmin1iMdist(λi,{λj}ji)\Gamma_{M}\triangleq\min_{1\leq i\leq M}\text{dist}(\lambda_{i},\{\lambda_{j}\}_{j\neq i}). Suppose ϵ\epsilon is small enough such that

ϵ1min{(min(ΓM,1)2+λMd/2+5)2,1(3+λM(5d+7)/4)2},\epsilon\leq\mathcal{H}_{1}\min\left\{\left(\frac{\min(\Gamma_{M},1)}{\mathcal{H}_{2}+\lambda_{M}^{d/2+5}}\right)^{2},\frac{1}{(\mathcal{H}_{3}+\lambda_{M}^{(5d+7)/4})^{2}}\right\},

where 1,2,3>1\mathcal{H}_{1},\mathcal{H}_{2},\mathcal{H}_{3}>1 are positive constants depending on 𝒳\mathcal{X}. Suppose nn is sufficiently large so that (logn/n)1/(4d+13)ϵ\left(\log n/n\right)^{1/(4d+13)}\leq\epsilon.

The above assumption and the squared exponential kernel are imposed to obtain the spectral convergence of the normalized one-step graph Laplacian in the ll^{\infty} norm. Other cases are feasible, as long as the spectral convergence between L¯\bar{L} and \mathcal{L} can be established, such as the compactly supported kernel (García Trillos et al., 2020). Assumption 2 suggests that as dd increases and ϵ\epsilon decreases, nn will increase exponentially. This implies that the current approach is probably inappropriate for processing high-dimensional data. The error bound between HtH_{t} and C¯ϵ,M,t\bar{C}_{\epsilon,M,t} will be estimated using the spectral property in the following theorem.

Theorem 1.

Under Assumption 1, suppose ϵ,M,n\epsilon,M,n satisfy Assumption 2. For any diffusion time tt less than the diameter of 𝒳\mathcal{X}, there exist constants 𝒫1,𝒫2\mathcal{P}_{1},\mathcal{P}_{2} and 𝒫3\mathcal{P}_{3} depending on the geometry of 𝒳\mathcal{X}, such that,

HtC¯ϵ,M,tmax𝒫1M2ϵ1/2+𝒫2exp(𝒫3M2/d),\|H_{t}-\bar{C}_{\epsilon,M,t}\|_{\mathrm{max}}\leq\mathcal{P}_{1}M^{2}\epsilon^{1/2}+\mathcal{P}_{2}\exp(-\mathcal{P}_{3}M^{2/d}), (7)

with the probability greater than 1n21-n^{-2}, max\|\cdot\|_{\mathrm{max}} is the matrix maximum norm.

Remark 2.

By the Weyl’s law, λMM2/d\lambda_{M}\approx M^{2/d}. Then in Assumption 2, fixed n>0n>0, we have (logn/n)1/(4d+13)ϵM(5d+7)/d(\log n/n)^{1/(4d+13)}\leq\epsilon\lesssim M^{-(5d+7)/d}. Let MM be (logϵ/𝒫3)d/2\left(-\log\epsilon/\mathcal{P}_{3}\right)^{d/2}. Suppose Assumption 2 holds for ϵ,M\epsilon,M. Theorem 1 implies that the estimated error is bounded by ϵ1/2\epsilon^{1/2}, i.e., (logn/n)1/2(4d+13)(\log n/n)^{1/2(4d+13)}, with the logarithmic term.

Based on (2), the heat kernel is split into the sum of the first MM eigenpairs and the sum of the remaining eigenpairs. They are estimated separately by combining the spectral convergence property of L¯\bar{L} and the fast-decaying spectrum nature of the heat kernel. To limit the maximum distance between HtH_{t} and C¯ϵ,M,t\bar{C}_{\epsilon,M,t}, we first choose MM large enough to bound the latter term and subsequently choose sufficiently small ϵ\epsilon to bound the former term in the right-hand side of (7). Furthermore, as nn increases, the point cloud becomes denser in the domain. Thus according to Theorem 1 and Remark 2, the approximation error of the heat kernel will decrease. Therefore, unlabeled observations contribute to an accurate estimation of the heat kernel by reflecting the geometry of 𝒳\mathcal{X}.

4.2 Two-step random walk graph Laplacian with subsampling

In FLGP, we construct a two-step random walk instead of the standard one-step random walk. The errors from subsampling and transition decomposition will be estimated by the perturbation analysis. A similar technique is used in the fast spectral clustering algorithm of Yan et al. (2009). Assume that the subsampling method is k-means clustering, the kernel is the squared exponential kernel and the kernel matrix is not sparse. Suppose for 1is1\leq i\leq s, the number of points in the membership of uiu_{i} is nin_{i}. Define the extended induced point set I~n\tilde{I}^{n} as (u1,,u1n1,u2,,u2n2,,us,,usns)(\overbrace{u_{1},\ldots,u_{1}}^{n_{1}},\overbrace{u_{2},\ldots,u_{2}}^{n_{2}},\ldots,\overbrace{u_{s},\ldots,u_{s}}^{n_{s}}). Denote the extended cross kernel matrix by K~\tilde{K}. Then K~\tilde{K} is an n×nn\times n matrix. Let N0=0N_{0}=0 and Ni=j=1inj,1isN_{i}=\sum_{j=1}^{i}n_{j},~{}1\leq i\leq s. Construct the extended cross similarity matrix Z~\tilde{Z} by

Z~ij=k(xi,uj)q=1nk(xq,uj)q=1snqk(xi,uq),1in,Nj1<jNj,1js.\tilde{Z}_{ij^{\prime}}=\frac{k(x_{i},u_{j})}{\sum_{q=1}^{n}k(x_{q},u_{j})\sum_{q=1}^{s}n_{q}k(x_{i},u_{q})},\quad 1\leq i\leq n,~{}N_{j-1}<j^{\prime}\leq N_{j},~{}1\leq j\leq s.

Let D~\tilde{D} be a diagonal matrix with D~ii=j=1nZ~ij,1in\tilde{D}_{ii}=\sum_{j=1}^{n}\tilde{Z}_{ij},~{}1\leq i\leq n. Define the extended stochastic matrix from XnX^{n} to I~n\tilde{I}^{n} by A~=D~1Z~\tilde{A}=\tilde{D}^{-1}\tilde{Z}. Let Λ~\tilde{\Lambda} be a diagonal matrix with Λ~ii=j=1nA~ji\tilde{\Lambda}_{ii}=\sum_{j=1}^{n}\tilde{A}_{ji}. The extended two-step graph Laplacian is defined as L~=IA~(Λ~)1A~\tilde{L}=I-\tilde{A}(\tilde{\Lambda})^{-1}\tilde{A}^{\top}.

Lemma 1.

If we write A~\tilde{A} as (a1,,a1n1,a2,,a2n2,,as,,asns)(\overbrace{a_{1},\ldots,a_{1}}^{n_{1}},\overbrace{a_{2},\ldots,a_{2}}^{n_{2}},\ldots,\overbrace{a_{s},\ldots,a_{s}}^{n_{s}}), where aia_{i} is a column vector for 1is1\leq i\leq s, then A=(n1a1,n2a2,,nsas)A=\left(n_{1}a_{1},n_{2}a_{2},\ldots,n_{s}a_{s}\right).

Lemma 2.

The two graph Laplacians are the same in fact, i.e., L=L~L=\tilde{L}.

Without loss of generality, assume that for 1is1\leq i\leq s, xNi1+1,,xNix_{N_{i-1}+1},\ldots,x_{N_{i}} belong to the same membership with the clustering center uiu_{i} in coincidence. Consider the one-step stochastic matrix A¯\bar{A} on XnX^{n}. Let Λ¯\bar{\Lambda} be a diagonal matrix with Λ¯ii=j=1nA¯ji,1in\bar{\Lambda}_{ii}=\sum_{j=1}^{n}\bar{A}_{ji},~{}1\leq i\leq n. Subsequently, the exact two-step graph Laplacian is defined as L^=IA¯Λ¯1A¯\hat{L}=I-\bar{A}\bar{\Lambda}^{-1}\bar{A}^{\top}.

Assumption 3.

Let K~ij=1snjk(xi,uj),1in\tilde{K}_{i\cdot}\triangleq\sum_{j=1}^{s}n_{j}k(x_{i},u_{j}),~{}1\leq i\leq n and K~ji=1nk(xi,uj),1js\tilde{K}_{\cdot j}\triangleq\sum_{i=1}^{n}k(x_{i},u_{j}),~{}1\leq j\leq s. Suppose there exists a constant c1(0,1)c_{1}\in(0,1), such that, for 1in,1js1\leq i\leq n,1\leq j\leq s, K¯i,K~i,K~jc1n\bar{K}_{i\cdot},\tilde{K}_{i\cdot},\tilde{K}_{\cdot j}\geq c_{1}n.

The assumption is used to constrain the spectral norm of stochastic matrices. It imposes a prior on the distribution of the point cloud. The prior reveals that for each xiXnx_{i}\in X^{n}, there are sufficiently many points in its neighborhood. Now we can estimate the spectral norm error between LL and L^\hat{L}.

Theorem 2.

Under Assumption 3, for any δ>0\delta>0, if sup1issupNi1<jNixjuiδ\sup_{1\leq i\leq s}\sup_{N_{i-1}<j\leq N_{i}}||x_{j}-u_{i}||\leq\delta, then there exist constants C1C_{1} and C2C_{2} such that L^L2C1δ+C2δ2||\hat{L}-L||_{2}\leq C_{1}\delta+C_{2}\delta^{2}, where C1C_{1} and C2C_{2} depend on c1c_{1} and ϵ\epsilon; and ||||2||\cdot||_{2} is the spectral norm.

Corollary 1.

If A¯Λ¯1A¯2γ||\bar{A}-\bar{\Lambda}^{-1}\bar{A}^{\top}||_{2}\leq\gamma, then IA¯2L2C1δ+C2δ2+1/c1γ||I-\bar{A}^{2}-L||_{2}\leq C_{1}\delta+C_{2}\delta^{2}+\sqrt{1/c_{1}}\gamma.

The theorem comes from the measurement error analysis of graph Laplacian. More relevant results refer to Karoui and Wu (2016). According to Theorem 2, the distortion error bound in the k-means clustering derives the spectral norm bound of the graph Laplacian in FLGP. This spectral result further implies the error estimation of the covariance matrix. Theorem 2 suggests that for xx located far away from IsI^{s}, the corresponding prediction by FLGP may be terrible. This agrees with the intuition.

4.3 Predictive errors in Gaussian processes

Let Xm={xi}i=1mX^{m}=\{x_{i}\}_{i=1}^{m} and Ym={yi}i=1mY^{m}=\{y_{i}\}_{i=1}^{m}. For any test point xx^{*}, let f=f(x)f^{*}=f(x^{*}) and φ=φ(x)\varphi^{*}=\varphi(x^{*}). Then the posterior expectation of φ\varphi^{*} is given by 𝔼{φ|Ym}=𝔼[𝔼{l(f)|fm}|Ym]\mathbb{E}\{\varphi^{*}|Y^{m}\}=\mathbb{E}\left[\mathbb{E}\{l(f^{*})|f^{m}\}|Y^{m}\right]. Let Σ1\Sigma^{1} and Σ2\Sigma^{2} be two covariance matrices of ff over {x1,,xm,x}\{x_{1},\ldots,x_{m},x^{*}\}. For i=1,2i=1,2, rewrite Σi\Sigma^{i} as (ΣfmfmiΣfmfiΣffmiΣffi)\begin{pmatrix}\Sigma^{i}_{f^{m}f^{m}}&\Sigma^{i}_{f^{m}f^{*}}\\ \Sigma^{i}_{f^{*}f^{m}}&\Sigma^{i}_{f^{*}f^{*}}\\ \end{pmatrix}, where Σfmfmim×m\Sigma^{i}_{f^{m}f^{m}}\in\mathbb{R}^{m\times m}, then πΣi(f|fm)N(μi,σi2)\pi_{\Sigma^{i}}(f^{*}|f^{m})\sim N(\mu_{i},\sigma_{i}^{2}) with

μi=Σffmi(Σfmfmi)1fm,σi2=ΣffiΣffmi(Σfmfmi)1Σfmfi.\mu_{i}=\Sigma^{i}_{f^{*}f^{m}}(\Sigma^{i}_{f^{m}f^{m}})^{-1}f^{m},\quad\sigma_{i}^{2}=\Sigma^{i}_{f^{*}f^{*}}-\Sigma^{i}_{f^{*}f^{m}}(\Sigma^{i}_{f^{m}f^{m}})^{-1}\Sigma^{i}_{f^{m}f^{*}}.
Lemma 3.

Assume that Σ1Σ2maxδ<1\|\Sigma^{1}-\Sigma^{2}\|_{\mathrm{max}}\leq\delta<1. Then there exist constants c2,c3>0c_{2},c_{3}>0 such that, for μ1,μ2\mu_{1},\mu_{2} and σ12,σ22\sigma_{1}^{2},\sigma_{2}^{2},

|μ1μ2|δmfm{(c2+3λ)||Σffm1||+λ}λ(λ+c2δ),|σ12σ22|δ+2δmΣffm1{(c3+3λ)||Σffm1||+λ}λ(λ+c3δ),\begin{split}|\mu_{1}-\mu_{2}|\leq\frac{\delta m||f^{m}||_{\infty}\left\{(c_{2}+3\lambda)||\Sigma^{1}_{f^{*}f^{m}}||_{\infty}+\lambda\right\}}{\lambda(\lambda+c_{2}\delta)},\\ |\sigma_{1}^{2}-\sigma_{2}^{2}|\leq\delta+\frac{2\delta m||\Sigma^{1}_{f^{*}f^{m}}||_{\infty}\left\{(c_{3}+3\lambda)||\Sigma^{1}_{f^{*}f^{m}}||_{\infty}+\lambda\right\}}{\lambda(\lambda+c_{3}\delta)},\end{split}

where λ\lambda is the smallest eigenvalue of Σfmfm1\Sigma^{1}_{f^{m}f^{m}}.

Lemma 4.

Suppose |μ1μ2|δ1|\mu_{1}-\mu_{2}|\leq\delta_{1} and |σ12σ22|δ2|\sigma_{1}^{2}-\sigma_{2}^{2}|\leq\delta_{2}. Then we have

|𝔼1{l(f)|fm}𝔼2{l(f)|fm}|l(δ1+𝔼|ξ|δ2),|\mathbb{E}_{1}\{l(f^{*})|f^{m}\}-\mathbb{E}_{2}\{l(f^{*})|f^{m}\}|\leq||l^{\prime}||_{\infty}(\delta_{1}+\mathbb{E}|\xi|\sqrt{\delta_{2}}),

where ξN(0,1)\xi\sim N(0,1).

Let HtH_{t} be the exact heat kernel matrix on XnX^{n} and let Cϵ,M,tC_{\epsilon,M,t} be the approximate covariance matrix on XnX^{n}.

Theorem 3.

Assume HtCϵ,M,tmaxδ<1\|H_{t}-C_{\epsilon,M,t}\|_{\mathrm{max}}\leq\delta<1. Then there exists C3>0C_{3}>0 such that,

maxm+1in|𝔼Ht{φ(xi)|fm}𝔼Cϵ,M,t{φ(xi)|fm}|C3lδ,\max_{m+1\leq i\leq n}\left|\mathbb{E}_{H_{t}}\{\varphi(x_{i})|f^{m}\}-\mathbb{E}_{C_{\epsilon,M,t}}\{\varphi(x_{i})|f^{m}\}\right|\leq C_{3}||l^{\prime}||_{\infty}\sqrt{\delta},

where C3C_{3} depends on m,fmm,~{}f^{m} and HtH_{t}.

Remark 3.

In the normal regression model, as the GP is the conjugate prior and φ=f\varphi=f, Lemma 3 implies that maxm+1in|𝔼Ht{φ(xi)|Ym}𝔼Cϵ,M,t{φ(xi)|Ym}|\max_{m+1\leq i\leq n}\left|\mathbb{E}_{H_{t}}\{\varphi(x_{i})|Y^{m}\}-\mathbb{E}_{C_{\epsilon,M,t}}\{\varphi(x_{i})|Y^{m}\}\right| is estimated by the same bound for |μ1μ2||\mu_{1}-\mu_{2}|, after replacing fm,Σ1||f^{m}||_{\infty},\Sigma^{1} with Ym,Ht||Y^{m}||_{\infty},H_{t}. However, for other non-conjugate models, the case is probably very complicated.

If l<||l^{\prime}||_{\infty}<\infty, then the error of the posterior expectation will be sufficiently small as δ1\delta_{1} and δ2\delta_{2} decrease. Therefore, Theorems 1 and 3 imply the bounded predictive error. For the non-conjugate prior in the NEF model, the error estimation of 𝔼[φ(x)|Ym]\mathbb{E}[\varphi(x^{*})|Y^{m}] is outside the scope because of the complicated integral.

4.4 Posterior contraction rates

As multiple-parameter NEF models can usually be implemented as a composition of several one-parameter NEF models, we focus on the one-parameter NEF case in the article. We will follow the universal framework shown in Rousseau (2016) and Ghosal and van der Vaart (2017) to derive posterior contraction rates of GPs in NEF.

Let {(xi,yi)}i=1m\{(x_{i},y_{i})\}_{i=1}^{m} be i.i.d. observations from (X,Y)𝒳×𝒴(X,Y)\in\mathcal{X}\times\mathcal{Y}, where 𝒳\mathcal{X} is a dd-dimensional submanifold of p\mathbb{R}^{p} and 𝒴\mathcal{Y}\subset\mathbb{R}. Consider the product measure ν=V×μ\nu=V\times\mu on 𝒳×𝒴\mathcal{X}\times\mathcal{Y}, where μ\mu is the Lebesgue measure or the counting measure on 𝒴\mathcal{Y}. Assume that XGX\sim G and πθ(y|x)NEF(θ(x))\pi_{\theta}(y|x)\sim\text{NEF}(\theta(x)), for x𝒳x\in\mathcal{X} under μ\mu. Suppose φ=lf\varphi=l\circ f with a latent function ff and a link function ll. For x𝒳x\in\mathcal{X}, we have J(θ(x))=lf(x)J^{\prime}(\theta(x))=l\circ f(x) and J′′(θ(x))=Varθ(x)(κ(Y))J^{\prime\prime}(\theta(x))=\text{Var}_{\theta(x)}(\kappa(Y)). Hence, J′′(θ(x))0,x𝒳J^{\prime\prime}(\theta(x))\geq 0,~{}x\in\mathcal{X}, i.e., JJ^{\prime} is non-decreasing. If J′′>0J^{\prime\prime}>0, then JJ^{\prime} is invertible. Denote (J)1l(J^{\prime})^{-1}\circ l as ζ\zeta. We have θ=ζf\theta=\zeta\circ f.

Assumption 4.

Assume φ=lf\varphi=l\circ f and θ=ζf\theta=\zeta\circ f. Suppose θJ(θ)C2()\theta\mapsto J(\theta)\in C^{2}(\mathbb{R}) and l,ζC1()l,~{}\zeta\in C^{1}(\mathbb{R}). Assume that the derivatives ll^{\prime} and ζ\zeta^{\prime} are uniformly bounded.

For x𝒳x\in\mathcal{X}, ζ(f(x))=l(f(x))/J′′(θ(x))\zeta^{\prime}(f(x))=l^{\prime}(f(x))/J^{\prime\prime}(\theta(x)) by the chain rule. If the variance J′′J^{\prime\prime} is uniformly bounded away from zero, the above assumption suffices to control the growth rate of ll. This is reasonable in the prediction problem. If ll varies rapidly, then the estimation of φ\varphi will be very sensitive to the corresponding prediction of ff. Even though the latent function ff is estimated accurately, the errors of φ\varphi are still likely to diverge to infinity. So it is almost impossible to derive the posterior contraction rates without any condition on ll. And Assumption 4 holds for a wide range of models. Let us take the normal regression and the classification as examples.

Example 2.

Suppose πf1(y|x)N(f(x),σ2)\pi_{f_{1}}(y|x)\sim N(f(x),\sigma^{2}) with σ>0\sigma>0 known, we have

πf1(y|x,σ)exp(12σ2(yf(x))2)=exp(12σ2y2)exp{f(x)σ2yf2(x)2σ2}.\pi_{f_{1}}(y|x,\sigma)\propto\exp{\left(-\frac{1}{2\sigma^{2}}\left(y-f(x)\right)^{2}\right)}=\exp{\left(-\frac{1}{2\sigma^{2}}y^{2}\right)}\exp{\left\{\frac{f(x)}{\sigma^{2}}y-\frac{f^{2}(x)}{2\sigma^{2}}\right\}}.

Then φ=f\varphi=f and θ=ζf=f/σ2\theta=\zeta\circ f=f/\sigma^{2}. So l=1l^{\prime}=1 and ζ=1/σ2\zeta^{\prime}=1/\sigma^{2} are bounded.

Example 3.

Suppose πφ(y|x)Ber(1,φ(x))\pi_{\varphi}(y|x)\sim\text{Ber}(1,\varphi(x)) with 0<φ(x)<10<\varphi(x)<1, that is,

πφ(y|x)=φ(x)y{1φ(x)}1y=exp{ylogφ(x)1φ(x)+log(1φ(x))}.\pi_{\varphi}(y|x)=\varphi(x)^{y}\left\{1-\varphi(x)\right\}^{1-y}=\exp{\left\{y\log\frac{\varphi(x)}{1-\varphi(x)}+\log\left(1-\varphi(x)\right)\right\}}.

Choose φ(x)=l(f(x))=exp(f(x))/{1+exp(f(x))}\varphi(x)=l(f(x))=\exp(f(x))/\{1+\exp(f(x))\}. Then we have θ=f\theta=f. So ζ=1\zeta^{\prime}=1 and l(f(x))=exp(f(x))/{1+exp(f(x))}2l^{\prime}(f(x))=\exp(f(x))/\left\{1+\exp(f(x))\right\}^{2} are bounded.

Let Θ={ζ(u)|u}\Theta=\{\zeta(u)|u\in\mathbb{R}\}. Suppose f1,f2f_{1},f_{2} are two latent functions taking values in Ω\Omega\subset\mathbb{R}. Let πf1,πf2\pi_{f_{1}},\pi_{f_{2}} be the joint density functions of (X,Y)(X,Y) under the dominant measure ν\nu. Define the Kullback-Leibler divergence as KL(πf1;πf2)=𝔼πf1log(πf1/πf2)\text{KL}(\pi_{f_{1}};\pi_{f_{2}})=\mathbb{E}_{\pi_{f_{1}}}\log(\pi_{f_{1}}/\pi_{f_{2}}) and define the ii-th Kullback-Leibler variation as Vi,0(πf1;πf2)=𝔼πf1|log(πf1/πf2)KL(πf1;πf2)|i\text{V}_{i,0}(\pi_{f_{1}};\pi_{f_{2}})=\mathbb{E}_{\pi_{f_{1}}}\left|\log(\pi_{f_{1}}/\pi_{f_{2}})-\text{KL}(\pi_{f_{1}};\pi_{f_{2}})\right|^{i}. Define the Hellinger distance as dH(πf1,πf2)={𝔼ν(πf1πf2)2}1/2d_{H}(\pi_{f_{1}},\pi_{f_{2}})=\left\{\mathbb{E}_{\nu}(\sqrt{\pi_{f_{1}}}-\sqrt{\pi_{f_{2}}})^{2}\right\}^{1/2}. For i.i.d. experiments, dHd_{H} is closely related to these common distances in many models. Take the binary classification as an example. Let φf1=lf1\varphi_{f_{1}}=l\circ f_{1} and φf2=lf2\varphi_{f_{2}}=l\circ f_{2}, then by Lemma 2.8 in Ghosal and van der Vaart (2017), we have φf1φf22,GdH(πf1,πf2)f1f22,G||\varphi_{f_{1}}-\varphi_{f_{2}}||_{2,G}\lesssim d_{H}(\pi_{f_{1}},\pi_{f_{2}})\lesssim||f_{1}-f_{2}||_{2,G}.

Lemma 5.

Suppose Assumption 4 holds. If πf1(y|x)\pi_{f_{1}}(y|x) and πf2(y|x)\pi_{f_{2}}(y|x) belong to one-parameter NEF, then

  1. 1.

    Suppose Ω\Omega is bounded. There exist constants C4(Ω)>0C_{4}(\Omega)>0 and C5(Ω)>0C_{5}(\Omega)>0, such that

    KL(πf1;πf2)C4(Ω)f1f22,V2,0(πf1;πf2)C5(Ω)f1f22.\text{KL}(\pi_{f_{1}};\pi_{f_{2}})\leq C_{4}(\Omega)||f_{1}-f_{2}||_{\infty}^{2},\qquad\text{V}_{2,0}(\pi_{f_{1}};\pi_{f_{2}})\leq C_{5}(\Omega)||f_{1}-f_{2}||_{\infty}^{2}.
  2. 2.

    Suppose Θ\Theta is convex. There exists a uniform constant C6C_{6}, such that

    dH2(πf1,πf2)C6f1f22.d_{H}^{2}(\pi_{f_{1}},\pi_{f_{2}})\leq C_{6}||f_{1}-f_{2}||_{\infty}^{2}.

The main technique used is the Taylor expansion. According to Lemma 5, the Kullback-Leibler divergence, the second variation, and the Hellinger distance are all bounded above by the infinity distance of the corresponding latent functions. This consequence is crucial for proving the posterior contraction rates in one-parameter NEF models. Let us state the main theorem about posterior contraction rates in NEF on manifolds.

Theorem 4.

Suppose the observations {(xi,yi)}i=1m\{(x_{i},y_{i})\}_{i=1}^{m} are independent and identically distributed with the population πf(x,y)=πf(y|x)g(x)\pi_{f}(x,y)=\pi_{f}(y|x)g(x), where ff is the latent function. Let the prior on ff be the heat kernel GP with an adaptive diffusion time of (3). Suppose πf(y|x)\pi_{f}(y|x) belongs to one-parameter NEF models with Assumption 4 and g(x)g(x) satisfies Assumption 1. Then for the true parameter f0f_{0} in the Besov space B,β(𝒳)B^{\beta}_{\infty,\infty}(\mathcal{X}) with β>0\beta>0, the posterior contraction rate at f0f_{0} w.r.t. dHd_{H} is ϵm(logm/m)2β/(2β+d)\epsilon_{m}\sim(\log m/m)^{2\beta/(2\beta+d)}, that is, as mm\to\infty, for each MmM_{m}\to\infty,

𝔼f0Π{f,dH(πf,πf0)Mmϵm|Xm,Ym}0,\mathbb{E}_{f_{0}}\Pi\left\{f,~{}d_{H}(\pi_{f},\pi_{f_{0}})\geq M_{m}\epsilon_{m}|X^{m},Y^{m}\right\}\to 0, (8)

where Π{|Xm,Ym}\Pi\{\cdot|X^{m},Y^{m}\} is the posterior distribution.

The theorem comes from Lemma 5 and the property of heat kernels. B,β(𝒳)B^{\beta}_{\infty,\infty}(\mathcal{X}) is the Besov space with the smoothness β\beta (Castillo et al., 2014). It is amazing to see that the posterior contraction rates depend on dd instead of pp. If d<<pd<<p, then (logm/m)2β/(2β+d)(\log m/m)^{2\beta/(2\beta+d)} is much less than (logm/m)2β/(2β+p)(\log m/m)^{2\beta/(2\beta+p)}. So when the domain 𝒳\mathcal{X} is highly non-Euclidean, the heat kernel Gaussian process is superior to the extrinsic kernel Gaussian process. At the heuristic level, if we can embed 𝒳p\mathcal{X}\subset\mathbb{R}^{p} into d\mathbb{R}^{d} while keeping the intrinsic distances invariant, then Theorem 4 suggests that the heat kernel GPs are equivalent to inferring on the embedded dd-dimensional domain space. So the contraction rates are extremely sharp.

In the FLGP algorithm, time tt is estimated by the empirical Bayesian method. However, we assign TT as the hyperprior for tt in the posterior contraction rates, which is a fully Bayesian method. The theoretical gap may be filled in the future.

5 Experiments

In this section, we compare the performance of various kernel-based approaches. Besides FLGP, we consider GPs with radial basis function kernels (RBF), kernel support vector machines (KSVM) with RBF kernels, GLGP and the Nyström approximation of GLGP (denoted by GLGP). RBF GPs and RBF KSVM are extrinsic kernel algorithms while other approaches are manifold learning algorithms. The Nyström approximation is one of the most common ways to improve the scalability of GPs (Quinonero-Candela and Rasmussen, 2005; Liu et al., 2020). We provide an implementation based on Dunson et al. (2021a)’s work as a benchmark of scalable domain-adaptive GPs in Appendix C, where the subsampling method is k-means clustering. We consider four FLGP algorithms that differ in the kernel matrix constructions and the subsampling methods, including squared exponential kernels and random subsampling (SRFLGP), squared exponential kernels and k-means subsampling (SKFLGP), local anchor embedding and random subsampling (LRFLGP), and local anchor embedding and k-means subsampling (LKFLGP).

We consider two examples to evaluate the performance of FLGP and apply FLGP to the fMRI dataset from the Human Connectome Project (HCP). In all cases, the hyperparameters ss, rr, and MM are prespecified. The sensitivity analysis of FLGP is presented in Appendix D. The results suggest that the prediction performance is robust against changes in these prespecified hyperparameters. For the first synthetic example, we simulate multiple-concentric circles in 2\mathbb{R}^{2}, where the class of each point is determined by the circle it is located on. The second example is based on the MNIST handwritten-digit dataset. In the third case, we reconstruct the fMRI images of the working memory task from HCP. The domain manifolds in the MNIST and HCP tasks are unknown and can be very complicated. Experimental results show that FLGP achieves better performance over competing methods in all cases. The first and second examples are run on the 80GB RAM, 3.70GHz AMD Ryzen 5 5600X 6-core processor machine. The HCP application is run on the Great Lakes computing cluster at the University of Michigan.

5.1 Simulation on synthetic concentric circles

In this example, we consider six circles with the same center but different radii in 2\mathbb{R}^{2}. Denote the six circles, from the inside out, as S1,S2,S3,S4,S5,S6S_{1},S_{2},S_{3},S_{4},S_{5},S_{6}, and let S=i=16SiS=\bigcup_{i=1}^{6}S_{i}. Then, SS is a one-dimensional submanifold in 2\mathbb{R}^{2}. The domain geometry is significantly different from the Euclidean geometry. To generate the data, we first sample n/6n/6 points uniformly on each of the six circles. Those nn points are the point cloud XnX^{n}. Figure 3 illustrates the generated data when n=4800n=4800. Then, we randomly sample mm points from XnX^{n}. These mm points are the labelled data points Xm={x1,x2,,xm}X^{m}=\{x_{1},x_{2},\ldots,x_{m}\}. Points on S1,S3,S5S_{1},S_{3},S_{5} have true Class 1 and points on S2,S4,S6S_{2},S_{4},S_{6} have true Class 0.

Refer to caption
Figure 3: The circle case in 2\mathbb{R}^{2}, where red points are of Class 0 and orange points are of Class 1. We execute various approaches to make the classification.

We consider six scenarios for data generation, for all combinations of nn and mm, where m{50,100}m\in\{50,100\} and n{2400,4800,12000}n\in\{2400,4800,12000\}. We classify unlabelled data points using FLGP and competing methods, repeating each scenario for 2020 times. For FLGP, we choose M=100M=100, r=3r=3, and s=600s=600. Other hyperparameters are optimized by maximizing the corresponding marginal likelihood. We report the error rate over unlabelled data to show the classification performance in Table 1 and the computing time to show the computation speed in Table 2. As shown in Table 1, intrinsic kernel approaches achieve lower error rates than extrinsic kernel approaches. RBF GP and KSVM collapse to random prediction because of the highly non-Euclidean geometry. Meanwhile, GLGP fails to work when m=50m=50 with an average error rate of 46.1%46.1\%, but the error rate decreases to 17.6%17.6\% when m=100m=100. The prediction performance of the Nyström extension (GLGP) is similar to that of GLGP. It is worth noting that FLGP obtains much lower error rates than GLGP and the Nyström extension in most cases. For example, when n=4800n=4800 and m=100m=100, LKFLGP has an average error rate of 1.5%1.5\%, which is much less than that of GLGP or extrinsic kernel approaches. Comparing different FLGP algorithms, FLGP with square exponential kernels acquires lower error rates than FLGP with local anchor embedding kernels in the example. Moreover, since clustering centers describe the domain geometry better than random points, FLGP with k-means subsampling also attains lower error rates than FLGP with random subsampling in all cases. In terms of computational efficiency, Table 2 shows our FLGP computes much faster than other manifold approaches including GLGP and the Nyström extension. For instance, when n=12000n=12000 and m=100m=100, LKFLGP (0.4s) is over 100100 times faster than GLGP (59.7s). We can conclude from Tables 1 and 2 that LKFLGP achieves the best balance between the performance and the speed in this case.

Table 1: Error rates (%) in the circle example
\toprulenn 2400 4800 12000
mm 50 100 50 100 50 100
\midruleRBF GP 47.3(2.3)1 44.2(2.4) 46.1(3.0) 44.8(2.4) 46.8(2.1) 45.3(2.0)
KSVM 46.8(3.1) 44.7(2.5) 46.7(2.4) 45.0(3.1) 47.1(2.5) 46.4(1.3)
GLGP 46.1(8.3) 17.6(2.9) 46.8(6.9) 17.6(3.4) 47.7(5.7) 17.1(3.8)
GLGP 44.7(8.6) 16.7(3.6) 44.6(8.7) 16.2(3.8) 41.3(9.6) 17.3(3.4)
SRFLGP 19.6(5.3) 12.3(2.6) 16.5(4.2) 12.0(3.5) 17.8(4.7) 9.7(3.9)
SKFLGP 3.1(3.4) 0.7(1.6) <0.1(<0.1) <0.1(0.2) <0.1(<0.1) <0.1(<0.1)
LRFLGP 27.4(5.4) 17.3(3.2) 26.9(4.8) 18.1(3.5) 27.0(4.4) 19.5(3.5)
LKFLGP 7.0(4.2) 3.9(2.5) 3.3(2.9) 1.5(1.4) 1.1(2.6) 0.3(0.7)
\botrule
{tablenotes}

The term a(b)a(b) indicates the average error rate a%a\% with the standard deviation b%b\%. The other tables about error rates are in the same format.

Table 2: Computation time (second) in the circle example
\toprulen 2400 4800 12000
m 50 100 50 100 50 100
\midruleRBF GP 0.1 0.11 0.19 0.2 0.48 0.52
KSVM 0.03 0.01 0.02 0.03 0.04 0.06
GLGP 1.4 1.55 9.21 9.33 59.74 59.72
GLGP 1.58 1.74 3.99 4.12 7.38 7.54
SRFLGP 1.92 2.09 1.36 1.58 1.59 1.81
SKFLGP 5.57 5.98 2.29 2.59 1.98 2.34
LRFLGP 0.12 0.17 0.16 0.21 0.27 0.33
LKFLGP 0.14 0.2 0.18 0.24 0.33 0.4
\botrule

To explore the potential mechanism, we plot the covariance distributions between all data points and some reference points in Figure 4. For better visualization, the covariance is scaled into [0,1][0,1]. As the geometry in the domain is highly non-Euclidean, the RBF kernel fails to capture the true similarity between points. It activates the points based on the Euclidean distance, while the heat kernel estimates the covariance based on the intrinsic geometry. In this case, as these circles are very close together, GLGP and the Nyström extension acquire inferior estimation without sparsity. On the contrary, the covariance structure is well-captured by FLGP. Only the points in the same circle as the reference point are activated. Finally, the covariance distributions lead to different performances of various approaches.

Refer to caption
Figure 4: Covariance distributions in the circle example: the covariance between all data points and the reference point in various approaches when m=100m=100 and n=4800n=4800, where the black point indicates the reference point. From left to right, the top channels: RBF GP, GLGP, GLGP; the middle channels: SRFLGP, SKFLGP, LRFLGP; the bottom channel: LKFLGP.

5.2 Application to handwritten digit data MNIST

The MNIST database contains 60000 training and 10000 test image-label pairs. Each image is a 28×2828\times 28 handwritten digit (0-9) and the label is the digit number. In this experiment, we mix the training set and the test set to construct the entire point cloud containing 70000 images. In this application, we compare the performance of multi-classification on image labels of RBF GP, RBF KSVM, GLGP, the Nyström extension, and FLGP. To speed up the computation, we reduce the dimension of the 28×2828\times 28 image predictor from 784784 to 100100 by Principle Component Analysis (PCA). After the dimensionality reduction, the input space is unknown and should be a submanifold in 100\mathbb{R}^{100}. The domain geometry could be very complicated and non-linear.

To generate the data, we first sample nn observations from all 7000070000 instances. We then sample mm labeled observations from the nn observations. The number of unlabelled observations is nmn-m. We consider four scenarios of data generation, for all combinations of nn and mm, where m{100,200}m\in\{100,200\} and n{7000,70000}n\in\{7000,70000\}. Since GLGP takes too much computation time when n=70000n=70000, we only report the results of GLGP for n=7000n=7000. We repeat experiments in each scenario for 1010 times. We set M=200M=200, r=3r=3 and s=1000s=1000 for FLGP. Other hyperparameters are optimized by maximizing the marginal likelihood.

Table 3 summarizes the mean and standard deviation of error rates over the 10 replicates for each method. Generally, manifold learning algorithms perform better than RBF kernel-based algorithms. FLGP achieves lower error rates compared to GLGP and the Nyström extension. For each scenario, FLGP with k-means subsampling (SKFLGP and LKFLGP) obtains the lowest error rates among all competing methods. Another observation is that the error rates of extrinsic kernel approaches remain at the same level as nn increases from 7000 to 70000, since they do not make use of the domain information. However, the error rates in FLGP with random subsampling do not change either, while the error rates in the Nyström extension and FLGP with k-means subsampling decrease with increasing nn. This is because when the point cloud becomes larger, the clustering centers capture more manifold information but the random subsampling points remain the same. This phenomenon verifies the intuition that the induced points determine the domain information used. When the induced points describe the intrinsic geometry better, the manifold learning algorithms achieve lower error rates. The computation time is summarized in Table 4. The Nyström extension is faster than GLGP and FLGP is faster than the Nyström extension. The superior efficiency of FLGP over the Nyström extension is beneficial from the sparse matrix multiplication. Considering the error rates and the time cost, LKFLGP has the best performance among the FLGP algorithms.

Table 3: Error rates (%) in the MNIST example
\toprulenn 7000 70000
mm 100 200 100 200
\midruleRBF GP 54.2(4.9) 43.0(5.1) 53.3(7.2) 40.6(3.0)
KSVM 65.4(3.7) 40.3(3.7) 65.7(6.3) 40.3(4.0)
GLGP 32.1(3.3) 23.3(1.9) ×\times ×\times
GLGP 31.1(3.0) 21.9(1.6) 19.9(2.0) 15.3(1.9)
SRFLGP 20.2(2.0) 15.0(1.4) 18.4(3.0) 13.8(0.9)
SKFLGP 14.1(2.4) 10.3(1.5) 8.9(1.3) 7.4(0.8)
LRFLGP 23.3(4.9) 16.4(1.4) 21.7(3.5) 16.2(1.1)
LKFLGP 15.2(2.9) 10.3(1.3) 9.7(1.4) 7.7(0.5)
\botrule
Table 4: Computation time (second) in the MNIST example
\toprulenn 7000 70000
mm 100 200 100 200
\midruleRBF GP 1 1.3 9.7 12.4
KSVM 0.7 0.9 6.6 9.3
GLGP 57.3 61.7 ×\times ×\times
GLGP 17.5 20.8 156.1 157.7
SRFLGP 7.1 10.9 15 18.7
SKFLGP 11.9 16.1 29.8 34.6
LRFLGP 1.2 2.2 5.8 7.1
LKFLGP 1.9 3 21.8 23
\botrule

5.3 Application to fMRI data from HCP

We further illustrate our method with analysis of task-evoked functional Magnetic Resonance Imaging (fMRI) data in the Human Connectome Projects (HCP-1200 release). The task fMRI is a technique that indirectly measures brain activity by monitoring oxygen-level changes in blood flow when the subject performs specific tasks. The standard analysis of fMRI data is the Euclidean space which does not directly take into account the complexity of human brain’s structure. Although the cortical surface fMRI data (Fischl et al., 1999) offers the opportunity to analyze brain activation on a lower dimensional manifold of a three dimensional space, it is based on a prespecified manifold structure that is not adaptive to different subjects and different types of activation.

Our goal is to estimate the manifold of fMRI activation in the three-dimensional brain from data and evaluate the accuracy of predicting fMRI activation on the manifold, which may provide insights on the neural underpinnings associated with cognitive functions (Sripada et al., 2020).

The typical fMRI data for a single subject is a four dimensional tensor which records the fMRI signals over time at each location (voxel) in the three-dimensional brain. The standard preprocessing of task-based fMRI data involves several key steps: gradient unwarping, motion correction, distortion correction, and grand-mean intensity normalization. To localize the fMRI activation in the brain, the general linear models are fitted at each voxel for each subject where the fMRI time series are the response variable and the stimulus pattern by the task as the covariates. The voxel-level fMRI activation intensities are estimated by the contrast statistics of the fitted general linear models, which well reflect the fMRI activation difference under different task conditions. This preprocessing procedure generates the subject-level contrast maps which are the three-dimensional volumetric images for fMRI activation. In this study, we focus on the 2-back versus 0-back contrast maps during the working memory N-back tasks (van Essen et al., 2013) from 917 subjects in the HCP study. All the images have been registered on the standard MNI 2mm brain template of resolution 91×109×9191\times 109\times 91.

To estimate the manifold of fMRI activation, we first define the population level activation regions from all the contrast maps. For each voxel in each contrast map, if its absolute intensity value is larger than 1.961.96 (i.e.,the 97.5%97.5\% standard normal quantile), then the fMRI activation at this voxel are significantly different between the 2-back and 0-back tasks. For each subject, we refer to the collection of activation voxels as the activation region. In this study, we estimate the population level activation region as the superposition of the activation voxels from all the 917 subjects. To improve the robustness of the estimation, we impose a constraint that each voxel in this region must be an activation voxel for at least two subjects. Figure 7 shows the map of the population level activation regions. The total number of activation voxels is 17,711, and 76 out of the first 90 brain regions from the automated anatomical labelling (AAL) (Tzourio-Mazoyer et al., 2002) brain atlas have at least one activation voxel.

On the population level activation region, we estimate the manifold of fMRI activation by FLGP. For each subject, we estimate the prior covariance kernel on manifolds using the proposed algorithm in Section 3.1. The distance between two voxels on the manifold of fMRI activation can be well measured by the prior correlation of the GP between the two voxels. We compute the average of the covariance kernel over 917 subjects to estimate the population level manifold of fMRI activation. To summarize the manifold of fMRI activation, we consider two levels of visualizations: voxel level correlation seed map (Figure 5) and region level correlation network (Figure 6).

The voxel level seed map in Figure 5 shows the prior correlation between each activation voxel and the seed voxel in the AAL brain region. The seed voxel in each region is defined as the voxel with the largest absolute mean intensity in the region and can be considered as the region center. We include 47 brain regions that have at least 30 activation voxels out of the 76 brain regions that have at least one activation voxel. Among the 47 regions, 26 regions (55.3%) have at least 30% activation voxels that have a correlation greater than 0.5 with the region center and 7 regions (14.9%) have at least 30% activation voxels that have a correlation greater than 0.8 with the region center. Particularly, 91.9% activation voxels in Parietal_Inf_L, 77.8% activation voxels in Cuneus_L and 57.1% activation voxels in Parietal_Sup_R have correlations greater than 0.8 with their corresponding region centers.

The region level correlation network map in Figure 6 illustrates the manifold of fMRI activation among different AAL regions. The two regions with a large correlation are close to each other on the manifold of fMRI activation. We present the estimated prior correlation estimated by FLGP between the 76 AAL regions over 917 subjects in Figure 6. Each region is represented by its region center. The connected edges indicate correlations greater than 0.5 between regions. According to Figure 6, the Frontal Lobe, Parietal Lobe, and Temporal Lobe exhibit significant within-network correation, with the Frontal Lobe strongly connected to the Insula and the Temporal Lobe. Besides the Frontal Lobe, the Temporal Lobe demonstrates notable correlation to the Occipital Lobe and moderate correlation to the Insula. Furthermore, the Limbic system shows a moderate connection with the Frontal Lobe and the Occipital Lobe. The Occipital Lobe also has a modest correlation with the Parietal Lobe.

Refer to caption
Figure 5: Manifold of fMRI activation at voxel level: the prior correlation seed map of 47 AAL brain regions that have at least 30 activation voxels.
Refer to caption
Figure 6: Manifold of fMRI activation at region level: the prior correlation network between the AAL regions. The connected edges indicate correlations greater than 0.5 between regions. Regions are label if it has a high correlation greater than 0.9 with another region.

To evaluate manifold learning accuracy of FLGP, we compare the prediction performance of RBF GP, the Nyström approximation of GLGP (GLGP), and FLGP with local anchor embedding and k-means subsampling (LKFLGP). Since the population level activation region has a very large number of voxels (n=17,711)(n=17,711), we implement an extremely fast regression algorithm when the labeled observations are enormous in Appendix E. Here we take LKFLGP as a representative of all FLGP methods as previous data experiments show that it balances between prediction and efficiency best. GLGP is excluded as it is computationally prohibitive. Besides, we set M=500M=500, r=3r=3, and s=1,500s=1,500 for FLGP. We consider three scenarios with m=5,000,7,500,10,000m=5,000,7,500,10,000. Each scenario is repeated 10 times. To remove the effect of outliers, we disregard points where the prediction errors are within the top or bottom 1%1\%.

Refer to caption
Figure 7: The population level activation map from sagittal, coronal, and axial views. A voxel is considered as an activation voxel if and only if it has an absolute intensity value larger than the 97.5%97.5\% quantile of the standard normal distribution (1.961.96) for at least two subjects. The color represents the number of subjects that have the voxel activated.

Experimental results are summarized in Table 5, including the average voxel-wise predictive mean squared error (MSE) and computation time among 917 subjects over the 10 replicates. In all scenarios, FLGP obtains the lowest MSE while RBF GP has the highest MSE. The mean predictive MSE of GLGP is between that of RBF GP and FLGP. This observation shows that using intrinsic geometry will improve the performance and FLGP estimates the brain manifold better than GLGP. Furthermore, because of the low-rank structure and sparsity of the covariance, FLGP is the fastest approach with much less computation time than other methods. For instance, when m=10000m=10000, the computing time of FLGP is about 1/201/20 of GLGP and 1/301/30 of RBF GP. In conclusion, FLGP attains a splendid performance in both prediction and efficiency.

Table 5: MSE and computation time (second) in the HCP example
\toprulem 5000 7500 10000
\midruleMethod MSE Time MSE Time MSE Time
\midruleRBF GP 0.066(0.039)1 97 0.064(0.037) 320 0.062(0.037) 777
GLGP 0.040(0.021) 444 0.038(0.020) 512 0.037(0.020) 573
FLGP 0.033(0.018) 17 0.032(0.017) 21 0.031(0.017) 25
\botrule
{tablenotes}

The term a(b)a(b) indicates average MSE aa with standard deviation bb.

6 Conclusion

In this article we develop scalable Bayesian inference method for heat kernel Gaussian process on an unknown dd-submanifold of p\mathbb{R}^{p}. With the enormous point cloud, we construct a reduced-rank approximation of the transition probability matrix and implement an efficient truncated SVD, which contributes to a scalable domain-adaptive GP approach, i.e., FLGP. Furthermore, when the domain space is highly non-Euclidean, FLGP performs much better than traditional GPs and other heat-kernel-based GPs.

There are also several interesting jobs in future research. The first direction is to combine FLGP with the known geometry information. For example, 𝒳\mathcal{X} is given in many practical problems, such as Kim et al. (2021) and Guinness and Fuentes (2016). Then it is possible to utilize the known domain structure to improve the performance. The second direction is to study sample paths of heat kernel GPs, such as the smoothness of the paths. The third direction is to consider more kernels on manifolds besides the heat kernel. In particular, the generalized Matérn kernel (Borovitskiy et al., 2020) is fascinating because of its simple form and nice properties.

7 Code and supplementary materials

The source package FLGP is accessible on the GitHub repository https://github.com/junhuihe2000/FLGP and R codes of examples in Section 5 are available on the GitHub repository https://github.com/junhuihe2000/exampleFLGP. The supplementary materials include all proofs in the article, the implementation of the Nyström extension, and the sensitivity analysis of the predefined parameters. They also involve a fast GP normal regression algorithm when the number of labeled observations is enormous.

Acknowledgements

Yang’s research was partially supported by the National Natural Science Foundation of China grant (12271286, 11931001). Kang’s research was partially supported by the following grants: NIH R01DA048993, NIH R01MH105561, and NSF IIS2123777.

References

  • Balasubramanian and Schwartz (2002) Balasubramanian, M. and Schwartz, E. L. (2002) The isomap algorithm and topological stability. Science, 295, 7.
  • Belkin and Niyogi (2001) Belkin, M. and Niyogi, P. (2001) Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in Neural Information Processing Systems, vol. 14. Cambridge: MIT Press.
  • Borovitskiy et al. (2021) Borovitskiy, V., Azangulov, I., Terenin, A., Mostowsky, P., Deisenroth, M. and Durrande, N. (2021) Matérn Gaussian processes on graphs. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics, vol. 130, 2593–2601.
  • Borovitskiy et al. (2020) Borovitskiy, V., Terenin, A., Mostowsky, P. and Deisenroth, M. (2020) Matérn Gaussian processes on Riemannian manifolds. In Advances in Neural Information Processing Systems, vol. 33, 12426–12437. Curran Associates, Inc.
  • Brown et al. (2010) Brown, L. D., Cai, T. T. and Zhou, H. H. (2010) Nonparametric regression in exponential families. The Annals of Statistics, 38, 2005 – 2046.
  • Castillo et al. (2014) Castillo, I., Kerkyacharian, G. and Picard, D. (2014) Thomas Bayes’ walk on manifolds. Probability Theory and Related Fields, 158, 665–710.
  • Dunson et al. (2021a) Dunson, D. B., Wu, H.-T. and Wu, N. (2021a) Graph based Gaussian processes on restricted domains. Journal of the Royal Statistical Society Series B: Statistical Methodology, 84, 414–439.
  • Dunson et al. (2021b) — (2021b) Spectral convergence of graph Laplacian and heat kernel reconstruction in L{L}^{\infty} from random samples. Applied and Computational Harmonic Analysis, 55, 282–336.
  • Feragen et al. (2015) Feragen, A., Lauze, F. and Hauberg, S. (2015) Geodesic exponential kernels: When curvature and linearity conflict. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3032–3042.
  • Fischl et al. (1999) Fischl, B., Sereno, M. I. and Dale, A. M. (1999) Cortical surface-based analysis: Ii: inflation, flattening, and a surface-based coordinate system. Neuroimage, 9, 195–207.
  • García Trillos et al. (2020) García Trillos, N., Gerlach, M., Hein, M. and Slepčev, D. (2020) Error estimates for spectral convergence of the graph Laplacian on random geometric graphs toward the Laplace–Beltrami operator. Foundations of Computational Mathematics, 20, 827–887.
  • Genton (2001) Genton, M. G. (2001) Classes of kernels for machine learning: a statistics perspective. Journal of Machine Learning Research, 2, 299–312.
  • Ghosal and van der Vaart (2017) Ghosal, S. and van der Vaart, A. (2017) Fundamentals of Nonparametric Bayesian Inference. Cambridge: Cambridge University Press.
  • Grigor’yan (2006) Grigor’yan, A. (2006) Heat kernels on weighted manifolds and applications. Contemporary Mathematics, 398, 93–191.
  • Grigor’yan (2009) — (2009) Heat Kernel and Analysis on Manifolds, vol. 47. Rhode Island: American Mathematical Society.
  • Guinness and Fuentes (2016) Guinness, J. and Fuentes, M. (2016) Isotropic covariance functions on spheres: Some properties and modeling considerations. Journal of Multivariate Analysis, 143, 143–152.
  • Jonathan R. Bradley and Wikle (2020) Jonathan R. Bradley, S. H. H. and Wikle, C. K. (2020) Bayesian hierarchical models with conjugate full-conditional distributions for dependent data from the natural exponential family. Journal of the American Statistical Association, 115, 2037–2052.
  • Karoui and Wu (2016) Karoui, N. E. and Wu, H.-T. (2016) Graph connection Laplacian methods can be made robust to noise. The Annals of Statistics, 44, 346 – 372.
  • Kendall et al. (2009) Kendall, D. G., Barden, D., Carne, T. K. and Le, H. (2009) Shape and Shape Theory. Hoboken: John Wiley & Sons.
  • Kim et al. (2021) Kim, K.-R., Dryden, I. L., Le, H. and Severn, K. E. (2021) Smoothing splines on Riemannian manifolds, with applications to 3D shape space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 83, 108–132.
  • Lee (2012) Lee, J. M. (2012) Introduction to Smooth Manifolds. Berlin: Springer.
  • Lin et al. (2019) Lin, L., Mu, N., Cheung, P. and Dunson, D. (2019) Extrinsic Gaussian processes for regression and classification on manifolds. Bayesian Analysis, 14, 887 – 906.
  • Liu et al. (2020) Liu, H., Ong, Y.-S., Shen, X. and Cai, J. (2020) When Gaussian process meets big data: A review of scalable GPs. IEEE Transactions on Neural Networks and Learning Systems, 31, 4405–4423.
  • Liu et al. (2010) Liu, W., He, J. and Chang, S.-F. (2010) Large graph construction for scalable semi-supervised learning. In Proceedings of the 27th International Conference on Machine Learning, 679–686.
  • Niu et al. (2019) Niu, M., Cheung, P., Lin, L., Dai, Z., Lawrence, N. and Dunson, D. (2019) Intrinsic Gaussian processes on complex constrained domains. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 81, 603–627.
  • Niu et al. (2023) Niu, M., Dai, Z., Cheung, P. and Wang, Y. (2023) Intrinsic Gaussian process on unknown manifolds with probabilistic metrics. Journal of Machine Learning Research, 24, 1–42.
  • Quinonero-Candela and Rasmussen (2005) Quinonero-Candela, J. and Rasmussen, C. E. (2005) A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research, 6, 1939–1959.
  • Rasmussen and Williams (2006) Rasmussen, C. E. and Williams, C. K. I. (2006) Gaussian Processes for Machine Learning, vol. 1. Cambridge: MIT Press.
  • Rousseau (2016) Rousseau, J. (2016) On the frequentist properties of Bayesian nonparametric methods. Annual Review of Statistics and Its Application, 3, 211–231.
  • Roweis and Saul (2000) Roweis, S. T. and Saul, L. K. (2000) Nonlinear dimensionality reduction by locally linear embedding. Science, 290, 2323–2326.
  • Schweinberger and Stewart (2020) Schweinberger, M. and Stewart, J. (2020) Concentration and consistency results for canonical and curved exponential-family models of random graphs. The Annals of Statistics, 48, 374 – 396.
  • Sripada et al. (2020) Sripada, C., Angstadt, M., Rutherford, S., Taxali, A. and Shedden, K. (2020) Toward a “treadmill test” for cognition: Improved prediction of general cognitive ability from the task activated brain. Human Brain Mapping, 41, 3186–3197.
  • Tzourio-Mazoyer et al. (2002) Tzourio-Mazoyer, N., Landeau, B., Papathanassiou, D., Crivello, F., Etard, O., Delcroix, N., Mazoyer, B. and Joliot, M. (2002) Automated anatomical labeling of activations in SPM using a macroscopic anatomical parcellation of the MNI MRI single-subject brain. NeuroImage, 15, 273–289.
  • van der Vaart and van Zanten (2008) van der Vaart, A. W. and van Zanten, J. H. (2008) Rates of contraction of posterior distributions based on Gaussian process priors. The Annals of Statistics, 36, 1435 – 1463.
  • van der Vaart and van Zanten (2009) — (2009) Adaptive Bayesian estimation using a Gaussian random field with inverse Gamma bandwidth. The Annals of Statistics, 37, 2655 – 2675.
  • van Essen et al. (2013) van Essen, D. C., Smith, S. M., Barch, D. M., Behrens, T. E., Yacoub, E. and Ugurbil, K. (2013) The WU-Minn human connectome project: an overview. NeuroImage, 80, 62–79.
  • Weinberger and Saul (2006) Weinberger, K. Q. and Saul, L. K. (2006) Unsupervised learning of image manifolds by semidefinite programming. International Journal of Computer Vision, 70, 77–90.
  • Wu and Wu (2018) Wu, H.-T. and Wu, N. (2018) Think globally, fit locally under the manifold setup: Asymptotic analysis of locally linear embedding. The Annals of Statistics, 46, 3805 – 3837.
  • Yan et al. (2009) Yan, D., Huang, L. and Jordan, M. I. (2009) Fast approximate spectral clustering. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 907–916. New York: Association for Computing Machinery.
  • Yang and Dunson (2016) Yang, Y. and Dunson, D. B. (2016) Bayesian manifold regression. The Annals of Statistics, 44, 876 – 905.
  • Ye et al. (2024) Ye, K., Niu, M., Cheung, P., Dai, Z. and Liu, Y. (2024) Intrinsic Gaussian processes on manifolds and their accelerations by symmetry. Preprint arXiv:2006.14266.
{appendices}

8 Proofs in Section 3

8.1 Proof of Proposition 1

Proof.

As AΛ1AA\Lambda^{-1}A^{\top} is a semi-positive definite matrix, its eigenvalues are of +\mathbb{R}^{+}. Let λ\lambda be an eigenvalue of AΛ1AA\Lambda^{-1}A^{\top}. Since AA and Λ1A\Lambda^{-1}A^{\top} are row stochastic matrices, AΛ1AA\Lambda^{-1}A^{\top} is also a row stochastic matrix. Then λ\lambda lies in the unit disc and 11 is one eigenvalue of AΛ1AA\Lambda^{-1}A^{\top}. So we have λ[0,1]\lambda\in[0,1]. Therefore, eigenvalues of L=IAΛ1AL=I-A\Lambda^{-1}A^{\top} are real numbers and belong to [0,1][0,1]. And 0 is the smallest eigenvalue of LL. ∎

8.2 Proof of Proposition 2

Proof.

The time complexity of subsampling is 𝒪(ns)\mathcal{O}(ns) for random sampling and 𝒪(nspI)\mathcal{O}(nspI) for k-means clustering. Assume r<<sr<<s. The construction of K,Z,AK,Z,A needs 𝒪(nsp)\mathcal{O}(nsp) computations. The complexity of truncated SVD with sparsity is 𝒪(nr2+s3)\mathcal{O}(nr^{2}+s^{3}) and the construction of approximate heat kernel matrix from eigenpairs needs 𝒪(nmM)\mathcal{O}(nmM) computations. Therefore, the total complexity is 𝒪(nsp+s3+nmM)\mathcal{O}(nsp+s^{3}+nmM) for random sampling or 𝒪(nspI+s3+nmM)\mathcal{O}(nspI+s^{3}+nmM) for k-means clustering. ∎

9 Proofs in Section 4

9.1 One step random walk graph Laplacian

9.1.1 Proof of Theorem 1

Proof.

It follows from this theorem that the error between the estimated covariance and the heat kernel is bounded by bandwidth ϵ\epsilon and the number of eigenpairs MM. In contrast, Theorem 2 in Dunson et al. (2021a) has proven a similar result to bound the error between the covariance and the sum of the first eigenpairs of the heat kernel. As our exact prior is the heat kernel Gaussian process, we must estimate the magnitude of the remaining term in the heat kernel.

Denote by (λq,eq)(\lambda_{q},e_{q}) the qq-th eigenpair of \mathcal{L} and by (λ¯q,ϵ,v¯q,ϵ)(\bar{\lambda}_{q,\epsilon},\bar{v}_{q,\epsilon}) the qq-th eigenpair of L¯\bar{L}. Let Ht(i,j)H_{t}(i,j) and C¯ϵ,M,t(i,j)\bar{C}_{\epsilon,M,t}(i,j) be the (i,j)(i,j) element of the matrices HtH_{t} and C¯ϵ,M,t\bar{C}_{\epsilon,M,t}, respectively. By the spectral decomposition of HtH_{t} we have

|Ht(i,j)C¯ϵ,M,t(i,j)|=|q=1{etλqeq(xi)eq(xj)}C¯ϵ,M,t(i,j)||q=1M{etλqeq(xi)eq(xj)}C¯ϵ,M,t(i,j)|+q=M+1|etλqeq(xi)eq(xj)|.\begin{split}|H_{t}(i,j)-\bar{C}_{\epsilon,M,t}(i,j)|=\left|\sum_{q=1}^{\infty}\left\{e^{-t\lambda_{q}}e_{q}(x_{i})e_{q}(x_{j})\right\}-\bar{C}_{\epsilon,M,t}(i,j)\right|\\ \leq\left|\sum_{q=1}^{M}\left\{e^{-t\lambda_{q}}e_{q}(x_{i})e_{q}(x_{j})\right\}-\bar{C}_{\epsilon,M,t}(i,j)\right|+\sum_{q=M+1}^{\infty}|e^{-t\lambda_{q}}e_{q}(x_{i})e_{q}(x_{j})|.\\ \end{split} (9)

Under Assumption 2 in the maintext, Dunson et al. (2021a) have shown that for the first term, with the probability greater than 1n21-n^{-2}, we have

|q=1M{etλqeq(xi)eq(xj)}C¯ϵ,M,t(i,j)|𝒫1M2ϵ1/2.\left|\sum_{q=1}^{M}\left\{e^{-t\lambda_{q}}e_{q}(x_{i})e_{q}(x_{j})\right\}-\bar{C}_{\epsilon,M,t}(i,j)\right|\leq\mathcal{P}_{1}M^{2}\epsilon^{1/2}.

Furthermore, the second term in (9) is bounded from the fact that the spectrum of the heat kernel is fast-decaying. By the Weyl’s law, there exists a constant c4>0c_{4}>0, such that, for q1q\geq 1, c4q2/dλqc_{4}q^{2/d}\leq\lambda_{q}. Besides, for the normalized eigenfunction of the Beltrami-Laplace operator eqe_{q} (Grigor’yan, 2009), there exists a decreasing function F𝒳(t)F_{\mathcal{X}}(t), such that, for t>0t>0, we have supx𝒳|etλqeq(x)|F𝒳(t)\sup_{x\in\mathcal{X}}|e^{-t\lambda_{q}}e_{q}(x)|\leq F_{\mathcal{X}}(t). This gives the upper bound in the infinity norm of eqe_{q}. Then for any t>0t>0, q1q\geq 1,

|etλqeq(xi)eq(xj)|F𝒳2(t/3)exp(tλq/3)F𝒳2(t/3)exp(c4tq2/d/3).|e^{-t\lambda_{q}}e_{q}(x_{i})e_{q}(x_{j})|\leq F_{\mathcal{X}}^{2}(t/3)\exp{\left(-t\lambda_{q}/3\right)}\leq F_{\mathcal{X}}^{2}(t/3)\exp{\left(-c_{4}tq^{2/d}/3\right)}.

Therefore, accumulate the above equation for qM+1q\geq M+1, we have

q=M+1|etλqeq(xi)eq(xj)|q=M+1F𝒳2(t/3)exp(c4tq2/d/3)F𝒳2(t/3)Mexp(c4tz2/d/3)𝑑z𝒫2exp(𝒫3M2/d).\begin{split}\sum_{q=M+1}^{\infty}|e^{-t\lambda_{q}}e_{q}(x_{i})e_{q}(x_{j})|\leq\sum_{q=M+1}^{\infty}F_{\mathcal{X}}^{2}(t/3)\exp{\left(-c_{4}tq^{2/d}/3\right)}\\ \leq F_{\mathcal{X}}^{2}(t/3)\int_{M}^{\infty}\exp{\left(-c_{4}tz^{2/d}/3\right)}dz\leq\mathcal{P}_{2}\exp(-\mathcal{P}_{3}M^{2/d}).\end{split}

The last inequality holds by some simple algebra. ∎

9.2 Two step random walk graph Laplacian with subsampling

Initially, Lemmas 1 and 2 establish the equivalence between LL and L~\tilde{L}. Afterwards based on the perturbation analysis, we estimate the error between LL and L^\hat{L} using the relation in Theorem 2.

9.2.1 Proof of Lemma 1

Proof.

According to (4) in Section 3 of the main manuscript, for 1in,1js,Nj1<jNj1\leq i\leq n,~{}1\leq j\leq s,~{}N_{j-1}<j^{\prime}\leq N_{j}, we have

Zij=njZ~ij=njk(xi,uj){q=1nk(xq,uj)}{q=1snqk(xi,uq)}.\begin{split}Z_{ij}=n_{j}\tilde{Z}_{ij^{\prime}}=\frac{n_{j}k(x_{i},u_{j})}{\left\{\sum_{q=1}^{n}k(x_{q},u_{j})\right\}\left\{\sum_{q=1}^{s}n_{q}k(x_{i},u_{q})\right\}}.\end{split}

For 1is1\leq i\leq s, let ziz_{i} be a column vector and each ziz_{i} repeats nin_{i} times in Z~\tilde{Z}. Thus, Z=(n1z1,,,nszs)Z=\left(n_{1}z_{1},,\ldots,n_{s}z_{s}\right) and Z~=(z1,,z1,,zs,,zs).\tilde{Z}=\left(z_{1},\ldots,z_{1},\ldots,z_{s},\ldots,z_{s}\right). As Dii=j=1sZijD_{ii}=\sum_{j=1}^{s}Z_{ij} and D~ii=j=1nZ~ij\tilde{D}_{ii}=\sum_{j=1}^{n}\tilde{Z}_{ij}, we have D=D~D=\tilde{D}. Therefore, denote D1ziD^{-1}z_{i} as aia_{i}, we have A=(n1a1,,nsas)A=\left(n_{1}a_{1},\ldots,n_{s}a_{s}\right) and A~=(a1,,a1n1,a2,,a2n2,,as,,asns).\tilde{A}=(\overbrace{a_{1},\ldots,a_{1}}^{n_{1}},\overbrace{a_{2},\ldots,a_{2}}^{n_{2}},\ldots,\overbrace{a_{s},\ldots,a_{s}}^{n_{s}}).

9.2.2 Proof of Lemma 2

Proof.

Since L=IAΛ1AL=I-A\Lambda^{-1}A^{\top} and L~=IA~Λ~1A~\tilde{L}=I-\tilde{A}\tilde{\Lambda}^{-1}\tilde{A}^{\top}, it is equivalent to prove that

AΛ1A=A~Λ~1A~.A\Lambda^{-1}A^{\top}=\tilde{A}\tilde{\Lambda}^{-1}\tilde{A}^{\top}. (10)

By Lemma 1, we have A=(n1a1,,nsas)A=\left(n_{1}a_{1},\ldots,n_{s}a_{s}\right) and A~=(a1,,a1,,,as,,as)\tilde{A}=\left(a_{1},\ldots,a_{1},,\ldots,a_{s},\ldots,a_{s}\right) along with Λii=j=1nAji\Lambda_{ii}=\sum_{j=1}^{n}A_{ji} and Λ~ii=j=1nA~ji\tilde{\Lambda}_{ii}=\sum_{j=1}^{n}\tilde{A}_{ji}. For 1is,1jn,Ni1<qNi1\leq i\leq s,~{}1\leq j\leq n,~{}N_{i-1}<q\leq N_{i}, we have Aji=niA~jqA_{ji}=n_{i}\tilde{A}_{jq} and Λii=niΛ~qq\Lambda_{ii}=n_{i}\tilde{\Lambda}_{qq}. Therefore, the left-hand side of (10) is expressed as

(n1a1,,nsas)(Λ111Λss1)(n1a1nsas)=i=1s(ni2/Λii)aiai.\begin{split}\left(n_{1}a_{1},\ldots,n_{s}a_{s}\right)\begin{pmatrix}\Lambda_{11}^{-1}&&\\ &\ddots&\\ &&\Lambda_{ss}^{-1}\end{pmatrix}\begin{pmatrix}n_{1}a_{1}^{\top}\\ \vdots\\ n_{s}a_{s}^{\top}\end{pmatrix}=\sum_{i=1}^{s}(n_{i}^{2}/\Lambda_{ii})a_{i}a_{i}^{\top}.\\ \end{split}

And the right-hand side of (10) is also expressed as

(a1,,as)(Λ~111Λ~nn1)(a1as)=i=1sq=Ni1+1NiΛ~qq1aiai=i=1s(ni2/Λii)aiai.\begin{split}(a_{1},\ldots,a_{s})\begin{pmatrix}\tilde{\Lambda}_{11}^{-1}&&\\ &\ddots&\\ &&\tilde{\Lambda}_{nn}^{-1}\end{pmatrix}\begin{pmatrix}a_{1}^{\top}\\ \vdots\\ a_{s}^{\top}\end{pmatrix}=\sum_{i=1}^{s}\sum_{q=N_{i-1}+1}^{N_{i}}\tilde{\Lambda}_{qq}^{-1}a_{i}a_{i}^{\top}=\sum_{i=1}^{s}\left(n_{i}^{2}/\Lambda_{ii}\right)a_{i}a_{i}^{\top}.\end{split}

Thus we have L=L~L=\tilde{L}. ∎

9.2.3 Proof of Theorem 2

Proof.

Define K¯i=j=1nk(xj,xi)\bar{K}_{\cdot i}=\sum_{j=1}^{n}k(x_{j},x_{i}). As K¯\bar{K} is symmetric, K¯i=K¯i\bar{K}_{\cdot i}=\bar{K}_{i\cdot}. Then we have D¯ii=j=1nZ¯ij=(j=1nK¯ij/K¯j)/K¯i\bar{D}_{ii}=\sum_{j=1}^{n}\bar{Z}_{ij}=(\sum_{j=1}^{n}{\bar{K}_{ij}}/{\bar{K}_{\cdot j}})/{\bar{K}_{i\cdot}}. Under Assumption 3, c1nK¯jn,1jnc_{1}n\leq\bar{K}_{\cdot j}\leq n,~{}1\leq j\leq n. Thereby 1/nD¯ii1/(c1n)1/n\leq\bar{D}_{ii}\leq 1/(c_{1}n). In the row stochastic matrix A¯\bar{A}, for 1i,jn1\leq i,j\leq n,

c1nK¯ijK¯iK¯jA¯ij=D¯ii1Z¯ij=K¯ijD¯iiK¯iK¯jnK¯ijK¯iK¯j.c_{1}n\frac{\bar{K}_{ij}}{\bar{K}_{i\cdot}\bar{K}_{\cdot j}}\leq\bar{A}_{ij}=\bar{D}_{ii}^{-1}\bar{Z}_{ij}=\frac{\bar{K}_{ij}}{\bar{D}_{ii}\bar{K}_{i\cdot}\bar{K}_{\cdot j}}\leq n\frac{\bar{K}_{ij}}{\bar{K}_{i\cdot}\bar{K}_{\cdot j}}.

Then the 11-norm and the \infty-norm of A¯\bar{A} is given by

A¯1=sup1jni=1nA¯ijsup1jnnK¯ji=1nK¯ijK¯i1c1,A¯=1.||\bar{A}||_{1}=\sup_{1\leq j\leq n}\sum_{i=1}^{n}\bar{A}_{ij}\leq\sup_{1\leq j\leq n}\frac{n}{\bar{K}_{\cdot j}}\sum_{i=1}^{n}\frac{\bar{K}_{ij}}{\bar{K}_{i\cdot}}\leq\frac{1}{c_{1}},\quad||\bar{A}||_{\infty}=1.

Therefore, the spectral norm of A¯\bar{A} satisfies that A¯2A¯1A¯1/c1||\bar{A}||_{2}\leq\sqrt{||\bar{A}||_{1}||\bar{A}||_{\infty}}\leq\sqrt{1/c_{1}}. Furthermore, for the diagonal matrix Λ¯\bar{\Lambda}, we have c1Λ¯jj1/c1c_{1}\leq\bar{\Lambda}_{jj}\leq 1/c_{1}, for 1jn1\leq j\leq n. Then the 11-norm and the \infty-norm of the column stochastic matrix A¯Λ¯1\bar{A}\bar{\Lambda}^{-1} is given by

A¯Λ¯11=1,A¯Λ¯1=sup1inj=1nA¯ijΛ¯jj11/c12.||\bar{A}\bar{\Lambda}^{-1}||_{1}=1,\quad||\bar{A}\bar{\Lambda}^{-1}||_{\infty}=\sup_{1\leq i\leq n}\sum_{j=1}^{n}\bar{A}_{ij}\bar{\Lambda}^{-1}_{jj}\leq 1/c_{1}^{2}.

Therefore, A¯Λ¯1\bar{A}\bar{\Lambda}^{-1} satisfies that A¯Λ¯121/c1||\bar{A}\bar{\Lambda}^{-1}||_{2}\leq 1/c_{1}. Similarly, we can prove that A~21/c1||\tilde{A}||_{2}\leq\sqrt{1/c_{1}}, A~Λ~121/c1||\tilde{A}\tilde{\Lambda}^{-1}||_{2}\leq 1/c_{1} and c1Λ~jj1/c1c_{1}\leq\tilde{\Lambda}_{jj}\leq 1/c_{1} for 1jn1\leq j\leq n. So far, we have completed the prepared jobs. Subsequently, according to L=L~L=\tilde{L},

L^L2=A¯Λ¯1A¯A~Λ~1A~2A¯Λ¯1A¯A~Λ¯1A¯2+A~Λ¯1A¯A~Λ~1A¯2+A~Λ~1A¯A~Λ~1A~21c1A¯A~2+1c1Λ¯1Λ~12+1c1A¯A~2.\begin{split}||\hat{L}-L||_{2}&=||\bar{A}\bar{\Lambda}^{-1}\bar{A}^{\top}-\tilde{A}\tilde{\Lambda}^{-1}\tilde{A}^{\top}||_{2}\leq||\bar{A}\bar{\Lambda}^{-1}\bar{A}^{\top}-\tilde{A}\bar{\Lambda}^{-1}\bar{A}^{\top}||_{2}\\ &+||\tilde{A}\bar{\Lambda}^{-1}\bar{A}^{\top}-\tilde{A}\tilde{\Lambda}^{-1}\bar{A}^{\top}||_{2}+||\tilde{A}\tilde{\Lambda}^{-1}\bar{A}^{\top}-\tilde{A}\tilde{\Lambda}^{-1}\tilde{A}^{\top}||_{2}\\ &\leq\frac{1}{c_{1}}||\bar{A}-\tilde{A}||_{2}+\frac{1}{c_{1}}||\bar{\Lambda}^{-1}-\tilde{\Lambda}^{-1}||_{2}+\frac{1}{c_{1}}||\bar{A}^{\top}-\tilde{A}^{\top}||_{2}.\\ \end{split} (11)

The first inequality holds because of the triangle inequality. And we have Λ¯1Λ~12sup1jn(i=1n|A¯ijA~ij|)/c12||\bar{\Lambda}^{-1}-\tilde{\Lambda}^{-1}||_{2}\leq\sup_{1\leq j\leq n}(\sum_{i=1}^{n}|\bar{A}_{ij}-\tilde{A}_{ij}|)/{c_{1}^{2}}. Let δ=sup1i,jn|K¯ijK~ij|\delta^{\prime}=\sup_{1\leq i,j\leq n}|\bar{K}_{ij}-\tilde{K}_{ij}|. Then for 1i,jn1\leq i,j\leq n,

|K¯iK~i|,|K¯jK~j|nδ,|1/K¯i1/K~i|,|1/K¯j1/K~j|δ/(c12n).|\bar{K}_{i\cdot}-\tilde{K}_{i\cdot}|,~{}|\bar{K}_{\cdot j}-\tilde{K}_{\cdot j}|\leq n\delta^{\prime},\quad|1/\bar{K}_{i\cdot}-1/\tilde{K}_{i\cdot}|,~{}|1/\bar{K}_{\cdot j}-1/\tilde{K}_{\cdot j}|\leq\delta^{\prime}/(c_{1}^{2}n).

The triangle inequality implies that, for 1in1\leq i\leq n,

|D¯iiD~ii|=|1K¯ij=1nK¯ijK¯j1K~ij=1nK~ijK~j||1K¯i1K~i|j=1nK¯ijK¯j+1K~ij=1nK¯ij|1K¯j1K~j|+1K~ij=1n1K~j|K¯ijK~ij|δc13n+δc13n+δc12n3δc13n.\begin{split}&|\bar{D}_{ii}-\tilde{D}_{ii}|=\left|\frac{1}{\bar{K}_{i\cdot}}\sum_{j=1}^{n}\frac{\bar{K}_{ij}}{\bar{K}_{\cdot j}}-\frac{1}{\tilde{K}_{i\cdot}}\sum_{j=1}^{n}\frac{\tilde{K}_{ij}}{\tilde{K}_{\cdot j}}\right|\\ &\leq\left|\frac{1}{\bar{K}_{i\cdot}}-\frac{1}{\tilde{K}_{i\cdot}}\right|\sum_{j=1}^{n}\frac{\bar{K}_{ij}}{\bar{K}_{\cdot j}}+\frac{1}{\tilde{K}_{i\cdot}}\sum_{j=1}^{n}\bar{K}_{ij}\left|\frac{1}{\bar{K}_{\cdot j}}-\frac{1}{\tilde{K}_{\cdot j}}\right|+\frac{1}{\tilde{K}_{i\cdot}}\sum_{j=1}^{n}\frac{1}{\tilde{K}_{\cdot j}}\left|\bar{K}_{ij}-\tilde{K}_{ij}\right|\\ &\leq\frac{\delta^{\prime}}{c_{1}^{3}n}+\frac{\delta^{\prime}}{c_{1}^{3}n}+\frac{\delta^{\prime}}{c_{1}^{2}n}\leq\frac{3\delta^{\prime}}{c_{1}^{3}n}.\end{split}

Then |1/D¯ii1/D~ii|3nδ/c13|1/\bar{D}_{ii}-1/\tilde{D}_{ii}|\leq{3n\delta^{\prime}}/{c_{1}^{3}}. Similarly, using the triangle inequality again, we have, for 1i,jn1\leq i,j\leq n,

|A¯ijA~ij|=|K¯ijD¯iiK¯iK¯jK~ijD~iiK~iK~j|K¯ijK¯iK¯j|1D¯ii1D~ii|+1D~ii|K¯ijK¯iK¯jK~ijK~iK~j|3δc15n+3δc13n6δc15n.\begin{split}&|\bar{A}_{ij}-\tilde{A}_{ij}|=\left|\frac{\bar{K}_{ij}}{\bar{D}_{ii}\bar{K}_{i\cdot}\bar{K}_{\cdot j}}-\frac{\tilde{K}_{ij}}{\tilde{D}_{ii}\tilde{K}_{i\cdot}\tilde{K}_{\cdot j}}\right|\\ &\leq\frac{\bar{K}_{ij}}{\bar{K}_{i\cdot}\bar{K}_{\cdot j}}\left|\frac{1}{\bar{D}_{ii}}-\frac{1}{\tilde{D}_{ii}}\right|+\frac{1}{\tilde{D}_{ii}}\left|\frac{\bar{K}_{ij}}{\bar{K}_{i\cdot}\bar{K}_{\cdot j}}-\frac{\tilde{K}_{ij}}{\tilde{K}_{i\cdot}\tilde{K}_{\cdot j}}\right|\leq\frac{3\delta^{\prime}}{c_{1}^{5}n}+\frac{3\delta^{\prime}}{c_{1}^{3}n}\leq\frac{6\delta^{\prime}}{c_{1}^{5}n}.\\ \end{split}

Consequently, A¯A~1,A¯A~6δ/c15||\bar{A}-\tilde{A}||_{1},~{}||\bar{A}-\tilde{A}||_{\infty}\leq{6\delta^{\prime}}/{c_{1}^{5}} and Λ¯1Λ~126δ/c17||\bar{\Lambda}^{-1}-\tilde{\Lambda}^{-1}||_{2}\leq{6\delta^{\prime}}/{c_{1}^{7}}. Then we have A¯A~26δ/c15||\bar{A}-\tilde{A}||_{2}\leq{6\delta^{\prime}}/{c_{1}^{5}}. As sup1issupNi1<kNixkuiδ\sup_{1\leq i\leq s}\sup_{N_{i-1}<k\leq N_{i}}||x_{k}-u_{i}||\leq\delta, for the squared exponential kernel, after the lengthy calculation, we have δδ/ϵ+δ2/ϵ2\delta^{\prime}\leq\delta/\epsilon+\delta^{2}/\epsilon^{2}. Finally, (11) gives that,

L^L2(12/c16+6/c18)δ(12/c16+6/c18)(δ/ϵ+δ2/ϵ2)C1δ+C2δ2,||\hat{L}-L||_{2}\leq(12/c_{1}^{6}+6/c_{1}^{8})\delta^{\prime}\leq(12/c_{1}^{6}+6/c_{1}^{8})(\delta/\epsilon+\delta^{2}/\epsilon^{2})\leq C_{1}\delta+C_{2}\delta^{2},

where C1C_{1} is (12/c16+6/c18)/ϵ(12/c_{1}^{6}+6/c_{1}^{8})/\epsilon and C2C_{2} is (12/c16+6/c18)/ϵ2(12/c_{1}^{6}+6/c_{1}^{8})/\epsilon^{2}. ∎

9.2.4 Proof of Corollary 1

Proof.

IA¯2L2L^L2+A¯2A¯Λ¯1A¯2C1δ+C2δ2+1/c1γ||I-\bar{A}^{2}-L||_{2}\leq||\hat{L}-L||_{2}+||\bar{A}||_{2}||\bar{A}-\bar{\Lambda}^{-1}\bar{A}^{\top}||_{2}\leq C_{1}\delta+C_{2}\delta^{2}+\sqrt{1/c_{1}}\gamma. ∎

9.3 Predictive errors in Gaussian processes

Lemma 3 and Lemma 4 are technical results in order to estimate the predictive errors in Theorem 3. In particular, Lemma 3 utilizes the matrix perturbation theory and Lemma 4 involves the Gaussian distribution.

9.3.1 Proof of Lemma 3

Proof.

It follows from this lemma that the posterior expectation and variance in a multivariate Gaussian distribution are uniquely determined by the covariance matrix. Theorem 4 in Dunson et al. (2021a) has proven the above proposition for the posterior expectation excluding variance. It is enough in the Gaussian regression model. However owing to the link function in the natural exponential family, we have to estimate the distance of posterior variance between different Gaussian distributions in addition. We will utilize the matrix perturbation theory to complete the proof.

By the property of the conditional normal distribution, we have

|μ1μ2|=|Σffm1(Σfmfm1)1fmΣffm2(Σfmfm2)1fm|,|σ12σ22|=|(Σff1Σffm1(Σfmfm1)1Σfmf1)(Σff2Σffm2(Σfmfm2)1Σfmf2)|.\begin{split}|\mu_{1}-\mu_{2}|=&|\Sigma^{1}_{f^{*}f^{m}}(\Sigma^{1}_{f^{m}f^{m}})^{-1}f^{m}-\Sigma^{2}_{f^{*}f^{m}}(\Sigma^{2}_{f^{m}f^{m}})^{-1}f^{m}|,\\ |\sigma_{1}^{2}-\sigma_{2}^{2}|=|(\Sigma^{1}_{f^{*}f^{*}}-&\Sigma^{1}_{f^{*}f^{m}}(\Sigma^{1}_{f^{m}f^{m}})^{-1}\Sigma^{1}_{f^{m}f^{*}})-(\Sigma^{2}_{f^{*}f^{*}}-\Sigma^{2}_{f^{*}f^{m}}(\Sigma^{2}_{f^{m}f^{m}})^{-1}\Sigma^{2}_{f^{m}f^{*}})|.\end{split}

Similar to Theorem 4 in Dunson et al. (2021a), there exists a constant c2>0c_{2}>0, such that,

|μ1μ2|δmfm((c2+3λ)Σffm1+λ)λ(λ+c2δ).|\mu_{1}-\mu_{2}|\leq\frac{\delta m||f^{m}||_{\infty}\left((c_{2}+3\lambda)||\Sigma^{1}_{f^{*}f^{m}}||_{\infty}+\lambda\right)}{\lambda(\lambda+c_{2}\delta)}.

Then the remaining work is to prove the upper bound for the conditional variance. By the triangle inequality, we have

|σ12σ22|δ+|Σffm2(Σfmfm2)1Σfmf2Σffm1(Σfmfm1)1Σfmf1|.|\sigma_{1}^{2}-\sigma_{2}^{2}|\leq\delta+|\Sigma^{2}_{f^{*}f^{m}}(\Sigma^{2}_{f^{m}f^{m}})^{-1}\Sigma^{2}_{f^{m}f^{*}}-\Sigma^{1}_{f^{*}f^{m}}(\Sigma^{1}_{f^{m}f^{m}})^{-1}\Sigma^{1}_{f^{m}f^{*}}|. (12)

Since Σ1Σ2maxδ||\Sigma^{1}-\Sigma^{2}||_{\max}\leq\delta, rewrite Σ2=Σ1+δE\Sigma^{2}=\Sigma^{1}+\delta E with Emax1||E||_{\max}\leq 1. Consequently, Σfmfm2=Σfmfm1+δEfmfm\Sigma^{2}_{f^{m}f^{m}}=\Sigma^{1}_{f^{m}f^{m}}+\delta E_{f^{m}f^{m}} and Σffm2=Σffm1+δEffm\Sigma^{2}_{f^{*}f^{m}}=\Sigma^{1}_{f^{*}f^{m}}+\delta E_{f^{*}f^{m}}. Denote the spectral decomposition of Σfmfm2\Sigma^{2}_{f^{m}f^{m}} and Σfmfm1\Sigma^{1}_{f^{m}f^{m}} by Σfmfm1=UQU\Sigma^{1}_{f^{m}f^{m}}=UQU^{\top} and Σfmfm2=U¯Q¯U¯\Sigma^{2}_{f^{m}f^{m}}=\bar{U}\bar{Q}\bar{U}^{\top}, where U,U¯O(m)U,\bar{U}\in O(m) and Q,Q¯Q,\bar{Q} are m×mm\times m diagonal matrices. By the perturbation theory in Wu and Wu (2018), there exist UU^{\prime}, QQ^{\prime} with Umax,Qmax1||U^{\prime}||_{\text{max}},||Q^{\prime}||_{\text{max}}\leq 1 and OO(m)O\in O(m) commutes with Q,Q1Q,Q^{-1}, such that U¯=UO+δU\bar{U}=UO+\delta U^{\prime} and Q¯=Q+δQ\bar{Q}=Q+\delta Q^{\prime}. Then we have

|Σffm2(Σfmfm2)1Σfmf2Σffm1(Σfmfm1)1Σfmf1|=|Σffm1UQ1UΣfmf1(Σffm1+δEffm)(UO+δU)(Q+δQ)1(UO+δU)(Σfmf1+δEfmf)|.\begin{split}|\Sigma^{2}_{f^{*}f^{m}}&(\Sigma^{2}_{f^{m}f^{m}})^{-1}\Sigma^{2}_{f^{m}f^{*}}-\Sigma^{1}_{f^{*}f^{m}}(\Sigma^{1}_{f^{m}f^{m}})^{-1}\Sigma^{1}_{f^{m}f^{*}}|=|\Sigma^{1}_{f^{*}f^{m}}UQ^{-1}U^{\top}\Sigma^{1}_{f^{m}f^{*}}-\\ &(\Sigma^{1}_{f^{*}f^{m}}+\delta E_{f^{*}f^{m}})(UO+\delta U^{\prime})(Q+\delta Q^{\prime})^{-1}(UO+\delta U^{\prime})^{\top}(\Sigma^{1}_{f^{m}f^{*}}+\delta E_{f^{m}f^{*}})|.\end{split}

Notably, both Σffm1\Sigma^{1}_{f^{*}f^{m}} and Σfmf1\Sigma^{1}_{f^{m}f^{*}} are vectors here. There exists a constant c3>0c_{3}>0, such that,

(Q+δQ)11λ+c3δ,(Q+δQ)1Q1c3δλ(λ+c3δ).||(Q+\delta Q^{\prime})^{-1}||_{\infty}\leq\frac{1}{\lambda+c_{3}\delta},\quad||(Q+\delta Q^{\prime})^{-1}-Q^{-1}||_{\infty}\leq\frac{c_{3}\delta}{\lambda(\lambda+c_{3}\delta)}.

Consequently, by the triangle inequality, the second term of (12) is bounded by

c3δmΣffm12λ(λ+c3δ)+2δmλ+c3δΣffm1(Σffm1+1+𝒪(δ))2δmΣffm1((c3+3λ)Σffm1+λ)λ(λ+c3δ).\begin{split}&\frac{c_{3}\delta m||\Sigma^{1}_{f^{*}f^{m}}||_{\infty}^{2}}{\lambda(\lambda+c_{3}\delta)}+\frac{2\delta m}{\lambda+c_{3}\delta}||\Sigma^{1}_{f^{*}f^{m}}||_{\infty}\left(||\Sigma^{1}_{f^{*}f^{m}}||_{\infty}+1+\mathcal{O}(\delta)\right)\\ &\leq\frac{2\delta m||\Sigma^{1}_{f^{*}f^{m}}||_{\infty}\left((c_{3}+3\lambda)||\Sigma^{1}_{f^{*}f^{m}}||_{\infty}+\lambda\right)}{\lambda(\lambda+c_{3}\delta)}.\end{split}

9.3.2 Proof of Lemma 4

Proof.

Let ξN(0,1)\xi\sim N(0,1). As πΣ1(f|fm)N(μ1,σ12)\pi_{\Sigma^{1}}(f^{*}|f^{m})\sim N(\mu_{1},\sigma_{1}^{2}) and πΣ2(f|fm)N(μ2,σ22)\pi_{\Sigma^{2}}(f^{*}|f^{m})\sim N(\mu_{2},\sigma_{2}^{2}), we have πΣ1(f|fm)=dμ1+σ1ξ\pi_{\Sigma^{1}}(f^{*}|f^{m})\stackrel{{\scriptstyle d}}{{=}}\mu_{1}+\sigma_{1}\xi and πΣ2(f|fm)=dμ2+σ2ξ\pi_{\Sigma^{2}}(f^{*}|f^{m})\stackrel{{\scriptstyle d}}{{=}}\mu_{2}+\sigma_{2}\xi. Then |𝔼1[l(f)|fm]𝔼2[l(f)|fm]|=|𝔼[l(μ1+σ1ξ)l(μ2+σ2ξ)]||\mathbb{E}_{1}[l(f^{*})|f^{m}]-\mathbb{E}_{2}[l(f^{*})|f^{m}]|=|\mathbb{E}[l(\mu_{1}+\sigma_{1}\xi)-l(\mu_{2}+\sigma_{2}\xi)]|. By the mean value theorem,

|𝔼[l(μ1+σ1ξ)l(μ2+σ2ξ)]|l𝔼|μ1+σ1ξμ2σ2ξ|l(|μ1μ2|+|σ1σ2|𝔼|ξ|)l(δ1+𝔼|ξ|δ2).\begin{split}|\mathbb{E}[l(\mu_{1}+\sigma_{1}\xi)-l(\mu_{2}+\sigma_{2}\xi)]|&\leq||l^{\prime}||_{\infty}\mathbb{E}|\mu_{1}+\sigma_{1}\xi-\mu_{2}-\sigma_{2}\xi|\\ &\leq||l^{\prime}||_{\infty}\left(|\mu_{1}-\mu_{2}|+|\sigma_{1}-\sigma_{2}|\mathbb{E}|\xi|\right)\\ &\leq||l^{\prime}||_{\infty}\left(\delta_{1}+\mathbb{E}|\xi|\sqrt{\delta_{2}}\right).\end{split}

The last inequality holds as δ2|σ12σ22||σ1σ2|2\delta_{2}\geq|\sigma_{1}^{2}-\sigma_{2}^{2}|\geq|\sigma_{1}-\sigma_{2}|^{2}. ∎

9.3.3 Proof of Theorem 3

Proof.

Let Ht=(H11H12H21H22)H_{t}=\begin{pmatrix}H_{11}&H_{12}\\ H_{21}&H_{22}\\ \end{pmatrix}, where H11m×mH_{11}\in\mathbb{R}^{m\times m}. For m+1inm+1\leq i\leq n, Lemmas 3 and 4 gives that,

|𝔼Ht{φ(xi)|fm}𝔼Cϵ,M,t{φ(xi)|fm}|l(δ1+𝔼|ξ|δ2),\left|\mathbb{E}_{H_{t}}\{\varphi(x_{i})|f^{m}\}-\mathbb{E}_{C_{\epsilon,M,t}}\{\varphi(x_{i})|f^{m}\}\right|\leq||l^{\prime}||_{\infty}(\delta_{1}+\mathbb{E}|\xi|\sqrt{\delta_{2}}),

where

δ1=δmfm((c2+3λ)H21+λ)λ(λ+c2δ),δ2=δ+2δmH21((c3+3λ)H21+λ)λ(λ+c3δ),\begin{split}\delta_{1}=\frac{\delta m||f^{m}||_{\infty}\left((c_{2}+3\lambda)||H_{21}||_{\infty}+\lambda\right)}{\lambda(\lambda+c_{2}\delta)},\\ \delta_{2}=\delta+\frac{2\delta m||H_{21}||_{\infty}\left((c_{3}+3\lambda)||H_{21}||_{\infty}+\lambda\right)}{\lambda(\lambda+c_{3}\delta)},\end{split}

and λ\lambda is the smallest eigenvalue of H11H_{11}. ∎

9.4 Posterior contraction rates

In this part, the major objective of Theorem 4 is to prove the posterior contraction rate of the heat kernel GPs in NEF. To reach the destination, we establish the relation among relevant metrics in Lemma 5 initially.

9.4.1 Proof of Lemma 5

Proof.

The proof is divided into three parts, separately bounding KL(πf1;πf2)\text{KL}(\pi_{f_{1}};\pi_{f_{2}}), V2,0(πf1;πf2)\text{V}_{2,0}(\pi_{f_{1}};\pi_{f_{2}}) and dH(πf1,πf2)d_{H}(\pi_{f_{1}},\pi_{f_{2}}) above by f1f2||f_{1}-f_{2}||_{\infty}. Let VV be the volume form on 𝒳\mathcal{X} and let μ\mu be the dominant measure on 𝒴\mathcal{Y}.

  1. 1.

    By the definition of KL divergence,

    KL(πf1;πf2)=𝒳𝒴πf1(y|x)logπf1(y|x)πf2(y|x)dμ(y)𝑑G(x).\text{KL}(\pi_{f_{1}};\pi_{f_{2}})=\int_{\mathcal{X}}\int_{\mathcal{Y}}\pi_{f_{1}}(y|x)\log\frac{\pi_{f_{1}}(y|x)}{\pi_{f_{2}}(y|x)}d\mu(y)dG(x).

    Based on the NEF model and the Fubini theorem, the above equality of the KL divergence can be rewritten as

    KL(πf1;πf2)=𝒳dG(x)𝒴πf1(x)(y)[{ζ(f1(x))ζ(f2(x))}κ(y)+J(ζ(f2(x)))J(ζ(f1(x)))]dμ(y).\begin{split}\text{KL}(\pi_{f_{1}};\pi_{f_{2}})=\int_{\mathcal{X}}dG(x)\int_{\mathcal{Y}}\pi_{f_{1}(x)}(y)[\{\zeta(f_{1}(x))-\zeta(f_{2}(x))\}\kappa(y)\\ +J(\zeta(f_{2}(x)))-J(\zeta(f_{1}(x)))]d\mu(y).\\ \end{split}

    Let φ1,u(v)=𝒴πu(y)[(ζ(u)ζ(v))κ(y)+J(ζ(v))J(ζ(u))]𝑑μ(y)\varphi_{1,u}(v)=\int_{\mathcal{Y}}\pi_{u}(y)[(\zeta(u)-\zeta(v))\kappa(y)+J(\zeta(v))-J(\zeta(u))]d\mu(y). Since 𝔼θ[κ(y)]=J(θ)\mathbb{E}_{\theta}[\kappa(y)]=J^{\prime}(\theta) and J(ζ(u))=l(u)J^{\prime}(\zeta(u))=l(u), then

    φ1,u(v)=[ζ(u)ζ(v)]J(ζ(u))+J(ζ(v))J(ζ(u))=[ζ(u)ζ(v)]l(u)+J(ζ(v))J(ζ(u)).\begin{split}\varphi_{1,u}(v)&=[\zeta(u)-\zeta(v)]J^{\prime}(\zeta(u))+J(\zeta(v))-J(\zeta(u))\\ &=[\zeta(u)-\zeta(v)]l(u)+J(\zeta(v))-J(\zeta(u)).\end{split}

    And the derivative of φ1,u(v)\varphi_{1,u}(v) on vv is φ1,u(v)=ζ(v)[l(v)l(u)]\varphi_{1,u}^{\prime}(v)=\zeta^{\prime}(v)[l(v)-l(u)]. It is obvious to see φ1,u(u)=0\varphi_{1,u}(u)=0. So by the mean value theorem, there exists a(u,v)(v,u)a\in(u,v)\lor(v,u), such that φ1,u(v)=φ1,u(a)(vu)=(vu)ζ(a)[l(a)l(u)]\varphi_{1,u}(v)=\varphi_{1,u}^{\prime}(a)(v-u)=(v-u)\zeta^{\prime}(a)[l(a)-l(u)]. Using the mean value theorem again, there exists b(u,a)(a,u)b\in(u,a)\lor(a,u), such that φ1,u(v)=(vu)ζ(a)l(b)(au)\varphi_{1,u}(v)=(v-u)\zeta^{\prime}(a)l^{\prime}(b)(a-u). Suppose uu and vv lie in Ω\Omega, then ζ(a)supuΩ|ζ(u)|\zeta^{\prime}(a)\leq\sup_{u\in\Omega}|\zeta^{\prime}(u)|, l(b)supuΩ|l(u)|l^{\prime}(b)\leq\sup_{u\in\Omega}|l^{\prime}(u)| and |au||vu||a-u|\leq|v-u|. Take C4(Ω)C_{4}(\Omega) as (supuΩ|ζ(u)|)(supuΩ|l(u)|)\left(\sup_{u\in\Omega}|\zeta^{\prime}(u)|\right)\lor\left(\sup_{u\in\Omega}|l^{\prime}(u)|\right). Thereby we have |φ1,u(v)|C4(Ω)|vu|2|\varphi_{1,u}(v)|\leq C_{4}(\Omega)|v-u|^{2}. So the upper bound of KL divergence is given by

    KL(πf1;πf2)=𝒳φ1,f1(x)(f2(x))𝑑G(x)𝒳C4(Ω)|f2(x)f1(x)|2𝑑G(x)C4(Ω)f2f12.\begin{split}\text{KL}(\pi_{f_{1}};\pi_{f_{2}})&=\int_{\mathcal{X}}\varphi_{1,f_{1}(x)}(f_{2}(x))dG(x)\\ &\leq\int_{\mathcal{X}}C_{4}(\Omega)|f_{2}(x)-f_{1}(x)|^{2}dG(x)\\ &\leq C_{4}(\Omega)||f_{2}-f_{1}||_{\infty}^{2}.\end{split}
  2. 2.

    The second KL variation V2,0(πf1;πf2)V_{2,0}(\pi_{f_{1}};\pi_{f_{2}}) is bounded by

    V2(πf1;πf2)𝔼πf1[(logπf1πf2)2].V_{2}(\pi_{f_{1}};\pi_{f_{2}})\triangleq\mathbb{E}_{\pi_{f_{1}}}\left[\left(\log\frac{\pi_{f_{1}}}{\pi_{f_{2}}}\right)^{2}\right].

    Moreover, for one parameter NEF models, we have

    V2(πf1;πf2)=𝒳dG(x)𝒴πf1(x)(y)[{ζ(f1(x))ζ(f2(x))}κ(y)+J(ζ(f2(x)))J(ζ(f1(x)))]2dμ(y).\begin{split}V_{2}(\pi_{f_{1}};\pi_{f_{2}})=\int_{\mathcal{X}}dG(x)\int_{\mathcal{Y}}\pi_{f_{1}(x)}(y)[\{\zeta(f_{1}(x))-\zeta(f_{2}(x))\}\kappa(y)\\ +J(\zeta(f_{2}(x)))-J(\zeta(f_{1}(x)))]^{2}d\mu(y).\\ \end{split}

    Let φ2,u(v)=𝒴πu(y)[(ζ(u)ζ(v))κ(y)+J(ζ(v))J(ζ(u))]2𝑑μ(y)\varphi_{2,u}(v)=\int_{\mathcal{Y}}\pi_{u}(y)[(\zeta(u)-\zeta(v))\kappa(y)+J(\zeta(v))-J(\zeta(u))]^{2}d\mu(y). Since 𝔼θ[κ(y)2]=J′′(θ)+J(θ)2\mathbb{E}_{\theta}[\kappa(y)^{2}]=J^{\prime\prime}(\theta)+J^{\prime}(\theta)^{2}, expand φ2,u(v)\varphi_{2,u}(v) to get

    φ2,u(v)=[ζ(v)ζ(u)]2[J′′(ζ(u))+l(u)2]+[J(ζ(v))J(ζ(u))]2+2[ζ(u)ζ(v)][J(ζ(v))J(ζ(u))]l(u).\begin{split}\varphi_{2,u}(v)=&[\zeta(v)-\zeta(u)]^{2}[J^{\prime\prime}(\zeta(u))+l(u)^{2}]+[J(\zeta(v))-J(\zeta(u))]^{2}\\ &+2[\zeta(u)-\zeta(v)][J(\zeta(v))-J(\zeta(u))]l(u).\end{split}

    Then by the mean value theorem, there exist constants a,b(u,v)(v,u)a,b\in(u,v)\lor(v,u), such that ζ(v)ζ(u)=ζ(a)(vu)\zeta(v)-\zeta(u)=\zeta^{\prime}(a)(v-u) and J(ζ(v))J(ζ(u))=J(ζ(b))ζ(b)(vu)J(\zeta(v))-J(\zeta(u))=J^{\prime}(\zeta(b))\zeta^{\prime}(b)(v-u). Consequently, we have elementwise upper bounds for φ2,u(v)\varphi_{2,u}(v), that is,

    |φ2,u(v)||vu|2[supuΩ|ζ(u)|2(4supuΩ|l(u)|2+supuΩ|J′′(ζ(u))|)].\begin{split}|\varphi_{2,u}(v)|\leq|v-u|^{2}\left[\sup_{u\in\Omega}|\zeta^{\prime}(u)|^{2}\left(4\sup_{u\in\Omega}|l(u)|^{2}+\sup_{u\in\Omega}|J^{\prime\prime}(\zeta(u))|\right)\right].\end{split}

    Take C5(Ω)=supuΩ|ζ(u)|2(4supuΩ|l(u)|2+supuΩ|J′′(ζ(u))|)C_{5}(\Omega)=\sup_{u\in\Omega}|\zeta^{\prime}(u)|^{2}\left(4\sup_{u\in\Omega}|l(u)|^{2}+\sup_{u\in\Omega}|J^{\prime\prime}(\zeta(u))|\right), then |φ2,u(v)|C5(Ω)|vu|2|\varphi_{2,u}(v)|\leq C_{5}(\Omega)|v-u|^{2}. So an upper bound of the second KL variation is

    V2,0(πf1;πf2)V2(πf1;πf2)=𝒳φ2,f1(x)(f2(x))𝑑G(x)𝒳C5(Ω)|f2(x)f2(x)|2𝑑G(x)C5(Ω)f2f12.\begin{split}V_{2,0}(\pi_{f_{1}};\pi_{f_{2}})&\leq V_{2}(\pi_{f_{1}};\pi_{f_{2}})=\int_{\mathcal{X}}\varphi_{2,f_{1}(x)}(f_{2}(x))dG(x)\\ &\leq\int_{\mathcal{X}}C_{5}(\Omega)|f_{2}(x)-f_{2}(x)|^{2}dG(x)\leq C_{5}(\Omega)||f_{2}-f_{1}||_{\infty}^{2}.\end{split}
  3. 3.

    By the definition and the Fubini theorem, the Hellinger distance is

    dH2(πf1,πf2)=𝒳𝑑G(x)𝒴[πf1(x)(y)πf2(y|x)]2𝑑μ(y).d_{H}^{2}(\pi_{f_{1}},\pi_{f_{2}})=\int_{\mathcal{X}}dG(x)\int_{\mathcal{Y}}\left[\sqrt{\pi_{f_{1}(x)}(y)}-\sqrt{\pi_{f_{2}}(y|x)}\right]^{2}d\mu(y).

    Let φ3,u(v)=𝒴[πu(y)πv(y)]2𝑑μ(y)\varphi_{3,u}(v)=\int_{\mathcal{Y}}\left[\sqrt{\pi_{u}(y)}-\sqrt{\pi_{v}(y)}\right]^{2}d\mu(y). Multiply out the square, φ3,u(v)\varphi_{3,u}(v) can be rewritten as

    φ3,u(v)=2(1𝒴πu(y)πv(y)𝑑μ(y)).\varphi_{3,u}(v)=2\left(1-\int_{\mathcal{Y}}\sqrt{\pi_{u}(y)\pi_{v}(y)}d\mu(y)\right).

    Based on the one parameter NEF likelihoods, since Θ\Theta is convex, we have

    φ3,u(v)=2[1𝒴h(y)exp{ζ(u)+ζ(v)2κ(y)J(ζ(u))+J(ζ(v))2}𝑑μ(y)]=2[1exp{J(ζ(u)+ζ(v)2)J(ζ(u))+J(ζ(v))2}].\begin{split}\varphi_{3,u}(v)&=2\left[1-\int_{\mathcal{Y}}h(y)\exp\left\{\frac{\zeta(u)+\zeta(v)}{2}\kappa(y)-\frac{J(\zeta(u))+J(\zeta(v))}{2}\right\}d\mu(y)\right]\\ &=2\left[1-\exp{\left\{J\left(\frac{\zeta(u)+\zeta(v)}{2}\right)-\frac{J(\zeta(u))+J(\zeta(v))}{2}\right\}}\right].\end{split}

    As J′′(θ)=Varθ(κ(Y))0J^{\prime\prime}(\theta)=\text{Var}_{\theta}\left(\kappa(Y)\right)\geq 0, J(θ)J(\theta) is a convex function on θ\theta. Then J(ζ(u)+ζ(v)2)J(ζ(u))+J(ζ(v))20J\left(\frac{\zeta(u)+\zeta(v)}{2}\right)-\frac{J(\zeta(u))+J(\zeta(v))}{2}\leq 0. By a simple equality 01ezz0\leq 1-e^{-z}\leq z, for z0z\geq 0, we have

    0φ3,u(v)2[J(ζ(u))+J(ζ(v))2J(ζ(u)+ζ(v)2)].0\leq\varphi_{3,u}(v)\leq 2\left[\frac{J(\zeta(u))+J(\zeta(v))}{2}-J\left(\frac{\zeta(u)+\zeta(v)}{2}\right)\right].

    Let ψu(v)=2[J(ζ(u))+J(ζ(v))2J(ζ(u)+ζ(v)2)]\psi_{u}(v)=2\left[\frac{J(\zeta(u))+J(\zeta(v))}{2}-J\left(\frac{\zeta(u)+\zeta(v)}{2}\right)\right]. Then the derivative on vv is ψu(v)=ζ(v)[J(ζ(v))J(ζ(u)+ζ(v)2)]\psi_{u}^{\prime}(v)=\zeta^{\prime}(v)\left[J^{\prime}(\zeta(v))-J^{\prime}\left(\frac{\zeta(u)+\zeta(v)}{2}\right)\right]. As J(θ)J^{\prime}(\theta) is non-decreasing, we have

    |J(ζ(v))J(ζ(u)+ζ(v)2)||J(ζ(v))J(ζ(u))|=|l(v)l(u)|.\left|J^{\prime}(\zeta(v))-J^{\prime}\left(\frac{\zeta(u)+\zeta(v)}{2}\right)\right|\leq\left|J^{\prime}(\zeta(v))-J^{\prime}(\zeta(u))\right|=\left|l(v)-l(u)\right|.

    It is obvious to see ψu(u)=0\psi_{u}(u)=0. So by the mean value theorem, there exists a constant a(u,v)(v,u)a\in(u,v)\lor(v,u), such that

    0ψu(v)=ψu(a)(vu)|vu||ζ(a)||l(a)l(u)|ζl|vu|2.0\leq\psi_{u}(v)=\psi_{u}^{\prime}(a)(v-u)\leq|v-u||\zeta^{\prime}(a)||l(a)-l(u)|\leq||\zeta^{\prime}||_{\infty}||l^{\prime}||_{\infty}|v-u|^{2}.

    Under Assumption 4, ζl<+||\zeta^{\prime}||_{\infty}||l^{\prime}||_{\infty}<+\infty. Take C6=ζlC_{6}=||\zeta^{\prime}||_{\infty}||l^{\prime}||_{\infty}. Then the elementwise bound of ψu(v)\psi_{u}(v) implies that

    dH2(πf1,πf2)=𝒳φ3,f1(x)(f2(x))𝑑G(x)𝒳C6|f2(x)f1(x)|2𝑑G(x)C6f2f12.\begin{split}d_{H}^{2}(\pi_{f_{1}},\pi_{f_{2}})&=\int_{\mathcal{X}}\varphi_{3,f_{1}(x)}(f_{2}(x))dG(x)\\ &\leq\int_{\mathcal{X}}C_{6}|f_{2}(x)-f_{1}(x)|^{2}dG(x)\leq C_{6}||f_{2}-f_{1}||_{\infty}^{2}.\end{split}

9.4.2 Proof of Theorem 4

Proof.

With Assumption 2, Castillo et al. (2014) have demonstrated that the domain space (𝒳,G)(\mathcal{X},G) satisfies the Ahlfors property and the Gaussian two sided bounds. Furthermore, they have proved the following properties of the adaptive heat kernel Gaussian processes: when f𝒢𝒫(0,pT)f\sim\mathcal{GP}(0,p_{T}) with TT being a random diffusion time, take vanishing sequences ϵmϵ¯m(logm/m)2β/(2β+d)\epsilon_{m}\sim\bar{\epsilon}_{m}\sim(\log m/m)^{2\beta/(2\beta+d)}, there exist measurable subsets BmL(𝒳,G)B_{m}\subset L^{\infty}(\mathcal{X},G) and a positive constant C7C_{7}, such that

(ff0ϵ¯m)exp(C7mϵ¯m2).\mathbb{P}(||f-f_{0}||_{\infty}\leq\bar{\epsilon}_{m})\geq\exp{\left(-C_{7}m\bar{\epsilon}_{m}^{2}\right)}.
logN(ϵm,Bm,||||)mϵm2.\log N(\epsilon_{m},B_{m},||\cdot||_{\infty})\leq m\epsilon_{m}^{2}.
(fBm)exp((C7+4)mϵ¯m2),\mathbb{P}(f\notin B_{m})\leq\exp{\left(-(C_{7}+4)m\bar{\epsilon}_{m}^{2}\right)},

where N(ϵm,Bm,||||)N(\epsilon_{m},B_{m},||\cdot||_{\infty}) is the covering number. Those inequalities are respectively called the KL condition, the entropy condition and the remaining mass condition in the Bayesian nonparametric inference (Ghosal and van der Vaart, 2017).

Since 𝒳\mathcal{X} is compact and f0f_{0} is continuous on 𝒳\mathcal{X}, f0f_{0} is bounded. Then {f,ff0ϵ¯m}\{f,||f-f_{0}||_{\infty}\leq\bar{\epsilon}_{m}\} are uniformly bounded. Assume these functions are valued in a bounded set Ω\Omega\subset\mathbb{R}. Denote C4(Ω)C_{4}(\Omega) and C5(Ω)C_{5}(\Omega) by C4C_{4} and C5C_{5}. For f1,f2{f,ff0ϵ¯m}f_{1},f_{2}\in\{f,||f-f_{0}||_{\infty}\leq\bar{\epsilon}_{m}\}, Lemma 5 shows that KL(πf1;πf2)C4f1f22\text{KL}(\pi_{f_{1}};\pi_{f_{2}})\leq C_{4}||f_{1}-f_{2}||_{\infty}^{2} and V2,0(πf1;πf2)C5f1f22\text{V}_{2,0}(\pi_{f_{1}};\pi_{f_{2}})\leq C_{5}||f_{1}-f_{2}||_{\infty}^{2}. After some simple set operations, we have

{f,||ff0||ϵ¯m}{f,KL(πf0;πf)C4ϵ¯m2,V2,0(πf0;πf)C5ϵ¯m2},\{f,~{}||f-f_{0}||_{\infty}\leq\bar{\epsilon}_{m}\}\subset\left\{f,~{}\text{KL}(\pi_{f_{0}};\pi_{f})\leq C_{4}\bar{\epsilon}_{m}^{2},\text{V}_{2,0}(\pi_{f_{0}};\pi_{f})\leq C_{5}\bar{\epsilon}_{m}^{2}\right\},
fL(𝒳,G),B(f,ϵm,||||)B(f,C6ϵm,dH).\forall f\in L^{\infty}(\mathcal{X},G),\quad B(f,\epsilon_{m},||\cdot||_{\infty})\subset B(f,\sqrt{C_{6}}\epsilon_{m},d_{H}).

Let Π\Pi be the prior probability distribution of ff. After transforming the infinity norm into the statistical distance, we obtain

Π{f,KL(πf0;πf)C4ϵ¯m2,V2,0(πf0;πf)C5ϵ¯m2}exp(C7mϵ¯m2),\Pi\left\{f,~{}\text{KL}(\pi_{f_{0}};\pi_{f})\leq C_{4}\bar{\epsilon}_{m}^{2},\text{V}_{2,0}(\pi_{f_{0}};\pi_{f})\leq C_{5}\bar{\epsilon}_{m}^{2}\right\}\geq\exp{\left(-C_{7}m\bar{\epsilon}_{m}^{2}\right)},
logN(C6ϵm,Bm,dH)mϵm2.\log N(\sqrt{C_{6}}\epsilon_{m},B_{m},d_{H})\leq m\epsilon_{m}^{2}.

Besides, Lemma B.5 in Ghosal and van der Vaart (2017) implies that the basic testing assumption holds for the Hellinger distance. Finally, based on Theorem 8.9 in Ghosal and van der Vaart (2017), the posterior contraction rate at f0f_{0} w.r.t dHd_{H} is ϵm\epsilon_{m}. ∎

10 The Nyström Approximation for GLGP

Denote XnX^{n} as the point cloud. Let us introduce the GLGP algorithm as in Dunson et al. (2021a) initially. Step 1, construct the n×nn\times n kernel matrix KK with a given kernel kk as Kij=k(xi,xj)K_{ij}=k(x_{i},x_{j}) for 1i,jn1\leq i,j\leq n. It is apparent to see KSn×nK\in S^{n\times n}. Meanwhile, the squared exponential kernel is the most common example of kk and assume that ϵ\epsilon is its bandwidth. Step 2, normalize KK to attain the similarity matrix ZZ, that is, for 1i,jn1\leq i,j\leq n,

Zij=Kijq=1nKiqq=1nKqj.Z_{ij}=\frac{K_{ij}}{\sum_{q=1}^{n}K_{iq}\sum_{q=1}^{n}K_{qj}}.

Afterward, calculate the transition matrix AA with row normalization through,

Aij=Zijq=1nZiq,1i,jn.A_{ij}=\frac{Z_{ij}}{\sum_{q=1}^{n}Z_{iq}},\quad 1\leq i,j\leq n.

Thus, AA is a row stochastic matrix. Step 3, solve the eigendecomposition of AA for the eigenpairs {μi,ϵ,vi,ϵ}i=1n\{\mu_{i,\epsilon},v_{i,\epsilon}\}_{i=1}^{n} with vi,ϵ2=1||v_{i,\epsilon}||_{2}=1. As the graph Laplacian is defined by L=IAL=I-A, the eigenpairs {λi,ϵ,vi,ϵ}i=1n\{\lambda_{i,\epsilon},v_{i,\epsilon}\}_{i=1}^{n} of LL are given by λi,ϵ=1μi,ϵ\lambda_{i,\epsilon}=1-\mu_{i,\epsilon}. Step 4, after choosing the top MM eigenvalues and eigenvectors, the heat kernel matrix in GLGP is estimated by, for t>0t>0,

Cϵ,M,t=ni=1Mexp(tλi,ϵ)vi,ϵvi,ϵ.C_{\epsilon,M,t}=n\sum_{i=1}^{M}\exp{(-t\lambda_{i,\epsilon})}v_{i,\epsilon}v_{i,\epsilon}^{\top}.

However, since AA is of n×n\mathbb{R}^{n\times n}, the computational complexity of GLGP is at least 𝒪(n3)\mathcal{O}(n^{3}), which is unaffordable in the scalable application. Consequently, it is beneficial for speed to consider the Nyström approximation. We will explain the approach as follow. Step 1, subsample ss induced points from XnX^{n}, denoted as IsI^{s}. Step 2, calculate the kernel matrices KK^{\prime} and KnysK^{\mathrm{nys}} by Kij=k(ui,uj)K^{\prime}_{ij}=k(u_{i},u_{j}) and Kijnys=k(xi,uj)K^{\mathrm{nys}}_{i^{\prime}j}=k(x_{i^{\prime}},u_{j}) for 1i,js,1in1\leq i,j\leq s,~{}1\leq i^{\prime}\leq n, where kk is a kernel function. Then Ks×sK^{\prime}\in\mathbb{R}^{s\times s} and Knysn×sK^{\mathrm{nys}}\in\mathbb{R}^{n\times s}. Step 3, obtain the similarity matrix ZZ^{\prime} by, for 1i,js1\leq i,j\leq s,

Zij=Kijq=1sKiqq=1sKqj.Z^{\prime}_{ij}=\frac{K^{\prime}_{ij}}{\sum_{q=1}^{s}K^{\prime}_{iq}\sum_{q=1}^{s}K^{\prime}_{qj}}.

Similarly, compute ZnysZ^{\mathrm{nys}} by, for 1in,1js1\leq i\leq n,~{}1\leq j\leq s,

Zijnys=Kijnysq=1sKiqnysq=1sKqj.Z^{\mathrm{nys}}_{ij}=\frac{K^{\mathrm{nys}}_{ij}}{\sum_{q=1}^{s}K^{\mathrm{nys}}_{iq}\sum_{q=1}^{s}K^{\prime}_{qj}}.

Notice that in the denominator of ZnysZ^{\mathrm{nys}}, the second sum is q=1sKqj\sum_{q=1}^{s}K^{\prime}_{qj}, but not q=1nKqjnys\sum_{q=1}^{n}K^{\mathrm{nys}}_{qj}. Construct the transition matrices AA^{\prime} and AnysA^{\mathrm{nys}} by, for 1i,js,1in1\leq i,j\leq s,~{}1\leq i^{\prime}\leq n,

Aij=Zijq=1sZiq,Aijnys=Zijnysq=1nZiqnys.A^{\prime}_{ij}=\frac{Z^{\prime}_{ij}}{\sum_{q=1}^{s}Z^{\prime}_{iq}},\quad A^{\mathrm{nys}}_{i^{\prime}j}=\frac{Z^{\mathrm{nys}}_{i^{\prime}j}}{\sum_{q=1}^{n}Z^{\mathrm{nys}}_{i^{\prime}q}}.

Step 4, attain the eigenpairs {(μi,ϵ,vi,ϵ)}i=1s\{(\mu^{\prime}_{i,\epsilon},v^{\prime}_{i,\epsilon})\}_{i=1}^{s} of AA^{\prime} with vi,ϵ2=1||v^{\prime}_{i,\epsilon}||_{2}=1 through the eigendecomposition. Define the extended eigenpairs {(λi,ϵnys,vi,ϵnys)}i=1s\{(\lambda^{\mathrm{nys}}_{i,\epsilon},v^{\mathrm{nys}}_{i,\epsilon})\}_{i=1}^{s} as λi,ϵnys=1μi,ϵ\lambda^{\mathrm{nys}}_{i,\epsilon}=1-\mu^{\prime}_{i,\epsilon} and vi,ϵnys=Anysvi,ϵ/μi,ϵnv^{\mathrm{nys}}_{i,\epsilon}={A^{\mathrm{nys}}v^{\prime}_{i,\epsilon}}/{\mu^{\prime}_{i,\epsilon}}\in\mathbb{R}^{n}, for 1is1\leq i\leq s. The above transformation of eigenvectors is the crucial idea behind the Nyström approximation. Step 5, choose the top MM eigenvalues and corresponding eigenvectors. Construct the Nyström approximate heat kernel estimation as in GLGP by

Cϵ,M,tnys=si=1Mexp(tλi,ϵnys)vi,ϵnys(vi,ϵnys).C^{\mathrm{nys}}_{\epsilon,M,t}=s\sum_{i=1}^{M}\exp(-t\lambda^{\mathrm{nys}}_{i,\epsilon})v^{\mathrm{nys}}_{i,\epsilon}(v^{\mathrm{nys}}_{i,\epsilon})^{\top}.

The normalization mechanism of eigenvectors causes the difference in the coefficients between covariance matrices, which are nn and ss respectively.

11 Sensitivity analysis on MNIST

The predefined hyper parameters in FLGP are s,rs,r and MM. We execute the sensitivity analysis of s,r,Ms,r,M in the MNIST example. Let n=70000n=70000 and let m=100,200m=100,200. We randomly sample mm images to construct the labeled observations and let the remaining images be the unlabeled observations. The numerical experiment repeats 1010 times for each configuration. Error rates are calculated in FLGP with different hyper parameters.

The number of induced points ss has a great impact on how well the manifold is estimated. With increasing number of ss, the induced points IsI^{s} become denser gradually in the domain 𝒳s\mathcal{X}^{s}. For clustering subsampling, the Euclidean distance between the cluster center and points of the corresponding membership is smaller and the representative error decreases. Then IsI^{s} has the more powerful ability to represent 𝒳\mathcal{X}. We will obtain a better approximation of the heat kernel in FLGP. Therefore, the prediction performance is improved with increasing ss. However, the computation complexity depends on ss in the meantime. The time cost increases when ss is larger. So we need to make a trade off in ss. Generally speaking, the suitable ss is up to the complexity of the intrinsic geometry. For the highly non-Euclidean manifold and random subsampling, a relatively bigger ss should be chosen. When r=3r=3 and M=200M=200, error rates with different ss are summarized in Table 1. In MNIST, as ss increases, the error rates decrease gradually for all FLGP algorithms. In particular, FLGP with random subsampling has greater improvement than FLGP with k-means subsampling. For example, for m=200m=200, when ss increases from 500500 to 15001500, error rates of LRFLGP decrease from 21%21\% to 13.7%13.7\% and error rates of LKFLGP decrease from 9.3%9.3\% to 6.7%6.7\%. The reason is that I500I^{500} with k-means subsampling has already represented 𝒳\mathcal{X} well but I500I^{500} with random subsampling is not. So the representative ability with random subsampling will improve more significantly with increasing ss.

Table 6: Error rates (%) with different ss in the MNIST example
\toprules 500 1000 1500
m 100 200 100 200 100 200
\midruleSRFLGP 23.7(3.7) 18.6(1.5) 18.4(3) 13.8(0.9) 17.2(2.8) 12.3(1.4)
SKFLGP 12.4(2.0) 9.2(0.9) 8.9(1.3) 7.4(0.8) 8.9(1.7) 6.4(0.7)
LRFLGP 27.3(4.8) 21(1.6) 21.7(3.5) 16.2(1.1) 19.3(2.4) 13.7(1.1)
LKFLGP 13.1(3.2) 9.3(0.9) 9.7(1.4) 7.7(0.5) 9(1.4) 6.7(0.7)
\botrule

The number of the local induced points rr decides the sparsity in FLGP. According to the definition, with decreasing number of rr, the number of nonzero elements decreases for each row in the kernel matrix KK. Then the cross transition matrix AA from XnX^{n} to IsI^{s} becomes more sparse. The random walk is restricted to move inside those local induced points. For most situations, the introduced sparsity will promote the prediction performance and speed up the computation. Therefore, we recommend to choose rr as small as possible. For example, let r=3r=3. When s=1000s=1000 and M=200M=200, error rates with different rr are summarized in Table 2. The influence of rr on error rates is less significant than ss. The mechanism of sparsity is complicated. In this example, when r=3r=3, FLGP obtains the lowest error rates. But the trend is not apparent. However, since the sparsity accelerates the computation without scarifying the performance, r=3r=3 seems to be the best alternative.

Table 7: Error rates (%\%) with different rr in the MNIST example
\topruler 3 5 7
m 100 200 100 200 100 200
\midruleSRFLGP 18.4(3.0) 13.8(0.9) 19.3(2.2) 14.4(1.3) 18.2(2.6) 14.3(1.4)
SKFLGP 8.9(1.3) 7.4(0.8) 10.2(1.4) 8.4(1.6) 9.7(1.0) 7.8(1.1)
LRFLGP 21.7(3.5) 16.2(1.1) 23.0(2.3) 17.1(1.6) 21.6(2.3) 17.2(1.0)
LKFLGP 9.7(1.4) 7.7(0.5) 13(1.5) 9.3(1.1) 12.8(1.7) 9.9(0.8)
\botrule

The number of used eigenpairs MM determines how the heat kernel is approximated by the graph Laplacian. As MM increases, more eigenpairs will be used to construct the covariance matrix. In the one hand, the heat kernel is estimated better by the top MM eigenpairs of the Dirichlet-Laplace operator. In the other hand, the approximation error of the top MM eigenpairs by the graph Laplacian increases in the meantime. So the precise role of MM is unclear. Fortunately, the Weyl’s law implies that the spectrum of the heat kernel is fast decaying. Then MM has little influence on the performance. When s=1000s=1000 and r=3r=3, error rates with different MM are summarized in Table 3. Considering the standard deviation, error rates with different MM are indistinguishable in FLGP. For example, in LKFLGP, when m=200m=200, the error rates with M=100,200,300M=100,200,300 are respectively 7.5%,7.7%,8.2%7.5\%,7.7\%,8.2\%. The performance is similar among different MM.

Table 8: Error rates (%) with different MM in the MNIST example
\topruleM 100 200 300
m 100 200 100 200 100 200
\midruleSRFLGP 19.1(3.7) 14.2(1.3) 18.4(3) 13.8(0.9) 19.0(2.0) 14.4(1.2)
SKFLGP 10.3(3.0) 7.1(0.4) 8.9(1.3) 7.4(0.8) 9.4(1.0) 7.4(1.0)
LRFLGP 20.9(4.2) 15.8(1.0) 21.7(3.5) 16.2(1.1) 20.6(2.9) 16.6(1.2)
LKFLGP 11.7(3.7) 7.5(0.8) 9.7(1.4) 7.7(0.5) 10.7(1.7) 8.2(0.6)
\botrule

12 Scalable Gaussian process regression in FLGP

Let AA and BB be the matrices. Let aa and bb belong to \mathbb{N}. Denote AabA_{a}^{b} by the a×ba\times b matrix and denote BaB_{a} by the a×aa\times a matrix. In the Gaussian process regression, rewrite the log marginal likelihood as

logp(y|x,Cm,σ2)=12y(Cm+σ2Im)1y12log|Cm+σ2Im|n2log2π.\log p(y|x,C_{m},\sigma^{2})=-\frac{1}{2}y^{\top}(C_{m}+\sigma^{2}I_{m})^{-1}y-\frac{1}{2}\log|C_{m}+\sigma^{2}I_{m}|-\frac{n}{2}\log 2\pi.

Using the spectral approximation of the heat kernel, we have Cm=i=1qexp(λit)vivi=VmqΛqVqmC_{m}=\sum_{i=1}^{q}\exp({-\lambda_{i}t})v_{i}v_{i}^{\top}=V_{m}^{q}\Lambda_{q}V_{q}^{m}, where Vqm=(Vmq)V_{q}^{m}=\left(V_{m}^{q}\right)^{\top} and Λq\Lambda_{q} is a diagonal matrix whose ii-th diagonal element is exp(λit)\exp(-\lambda_{i}t). Then by the matrix inversion lemma A.9 in Rasmussen and Williams (2006), we have

(Cm+σ2Im)1=σ2Imσ2VmqΛq1/2(σ2Iq+Λq1/2VqmVmqΛq1/2)1Λq1/2Vqm.(C_{m}+\sigma^{2}I_{m})^{-1}=\sigma^{-2}I_{m}-\sigma^{-2}V_{m}^{q}\Lambda_{q}^{1/2}\left(\sigma^{2}I_{q}+\Lambda_{q}^{1/2}V_{q}^{m}V_{m}^{q}\Lambda_{q}^{1/2}\right)^{-1}\Lambda_{q}^{1/2}V_{q}^{m}. (13)

Subsequently, define Qq=σ2Iq+Λq1/2VqmVmqΛq1/2Q_{q}=\sigma^{2}I_{q}+\Lambda_{q}^{1/2}V_{q}^{m}V_{m}^{q}\Lambda_{q}^{1/2}. Then the inversion of the m×mm\times m matrix is transformed into the inversion of the q×qq\times q matrix. Denote the Cholesky decomposition of QqQ_{q} by Qq=LqLqQ_{q}=L_{q}L_{q}^{\top}. Then by Equation A.10 in Rasmussen and Williams (2006), the determinant is given as

log|Cm+σ2Im|=(mq)logσ2+2i=1qlogLq,ii,\log|C_{m}+\sigma^{2}I_{m}|=(m-q)\log\sigma^{2}+2\sum_{i=1}^{q}\log L_{q,ii}, (14)

where Lq,iiL_{q,ii} is the ii-th diagonal element of LqL_{q}. Since the complexity of the Cholesky decomposition is 𝒪(q3/6)\mathcal{O}(q^{3}/6), the complexity of the log marginal likelihood using (13) and (14) is 𝒪(q3+mq)\mathcal{O}(q^{3}+mq). Let Km=Cm+σ2ImK_{m}=C_{m}+\sigma^{2}I_{m}. Moreover, the partial derivatives are

θjlogp(y|x,θ)=12tr((ααKm1)Kmθj),\frac{\partial}{\partial\theta_{j}}\log p(y|x,\theta)=\frac{1}{2}\text{tr}\left((\alpha\alpha^{\top}-K_{m}^{-1})\frac{\partial K_{m}}{\partial\theta_{j}}\right),

where α=Km1y\alpha=K_{m}^{-1}y. In FLGP, we have Kmt=VmqAqVqm\frac{\partial K_{m}}{\partial t}=V_{m}^{q}A_{q}V_{q}^{m} and Kmσ2=Im\frac{\partial K_{m}}{\partial\sigma^{2}}=I_{m}, where AqA_{q} is a diagonal matrix whose ii-th diagonal element is λiexp(λit)-\lambda_{i}\exp{(-\lambda_{i}t)}. Then

tr(ααKmt)=tr(αVmqAqVqmα),tr(ααKmσ2)=tr(αα).\text{tr}\left(\alpha\alpha^{\top}\frac{\partial K_{m}}{\partial t}\right)=\text{tr}(\alpha^{\top}V_{m}^{q}A_{q}V_{q}^{m}\alpha),\quad\text{tr}\left(\alpha\alpha^{\top}\frac{\partial K_{m}}{\partial\sigma^{2}}\right)=\text{tr}(\alpha^{\top}\alpha).

According to (13),

tr(Km1Kmt)=σ2(tr{Aq(VqmVmq)}tr{(Qq1Λq1/2VqmVmq)(AqVqmVmqΛq1/2)}),tr(Km1Kmσ2)=σ2(mtr{Qq1(Λq1/2VqmVmqΛq1/2)}).\begin{split}\text{tr}\left(K_{m}^{-1}\frac{\partial K_{m}}{\partial t}\right)&=\sigma^{-2}\left(\text{tr}\{A_{q}(V_{q}^{m}V_{m}^{q})\}-\text{tr}\{(Q_{q}^{-1}\Lambda_{q}^{1/2}V_{q}^{m}V_{m}^{q})(A_{q}V_{q}^{m}V_{m}^{q}\Lambda_{q}^{1/2})\}\right),\\ \text{tr}\left(K_{m}^{-1}\frac{\partial K_{m}}{\partial\sigma^{2}}\right)&=\sigma^{-2}\left(m-\text{tr}\{Q_{q}^{-1}(\Lambda_{q}^{1/2}V_{q}^{m}V_{m}^{q}\Lambda_{q}^{1/2})\}\right).\\ \end{split}

Then the computational complexity of the derivatives is 𝒪(mq2)\mathcal{O}(mq^{2}) if qmq\leq m. In the prediction, let ClmC_{l}^{m} be the covariance matrix between unlabeled points and labeled points, then Clm=i=1qexp(λit)vivi=VlqΛqVqmC_{l}^{m}=\sum_{i=1}^{q}\exp({-\lambda_{i}t})v_{i}v_{i}^{\top}=V_{l}^{q}\Lambda_{q}V_{q}^{m}, where the columns of VlqV_{l}^{q} are the corresponding eigenvectors. By the conditional Gaussian distribution, the posterior expectation on the unlabeled point is 𝔼[f|y,C,σ2]=Clm(Cm+σ2Im)1y=VlqΛqVqmα\mathbb{E}[f^{*}|y,C,\sigma^{2}]=C_{l}^{m}(C_{m}+\sigma^{2}I_{m})^{-1}y=V_{l}^{q}\Lambda_{q}V_{q}^{m}\alpha. The complexity is 𝒪(lq)\mathcal{O}(lq) if mlm\leq l. Therefore, the total computational complexity of the domain-adaptive Gaussian process regression is 𝒪(mq2+lq)\mathcal{O}(mq^{2}+lq). It is less than 𝒪(m3+lm)\mathcal{O}(m^{3}+lm), when q<mq<m.