Journal Title Here \DOIDOI HERE \accessAdvance Access Publication Date: Day Month Year \appnotesPaper
He et al.
[]Address for correspondence: Jian Kang ([email protected]) and Ying Yang ([email protected])
Scalable Bayesian inference for heat kernel Gaussian processes on manifolds
Abstract
We develop scalable manifold learning methods and theory, motivated by the problem of estimating manifold of fMRI activation in the Human Connectome Project (HCP). We propose the Fast Graph Laplacian Estimation for Heat Kernel Gaussian Processes (FLGP) in the natural exponential family model. FLGP handles large sample sizes , preserves the intrinsic geometry of data, and significantly reduces computational complexity from to via a novel reduced-rank approximation of the graph Laplacian’s transition matrix and truncated Singular Value Decomposition for eigenpair computation. Our numerical experiments demonstrate FLGP’s scalability and improved accuracy for manifold learning from large-scale complex data.
keywords:
Exponential family; Graph Laplacian; Heat kernel; Manifold; Scalable Gaussian process; Subsampling1 Introduction
In recent years, a variety of data has emerged in the machine learning community, such as image and network data. These data sets often possess non-Euclidean geometry, contributing to a rapid rise in manifold learning. Let us consider a simple case: a spiral curve in as shown in Figure 1. The spiral is a one-dimensional submanifold of . Besides, we can see from the top left channel of Figure 1 that for two points in the curve, the Euclidean distance is the length of the line segment connecting them (black dotted line), whereas the geodesic distance is the length of the spiral curve connecting them (black solid line). Consequently, the differences in the dimension and distance substantially affect statistical inference. It is well known that in Gaussian process (GP) approaches, many covariance kernels are based on the distance, such as the squared exponential kernel and the Matérn kernel (Genton, 2001; Rasmussen and Williams, 2006). And the posterior contraction rates depend on the dimension of data, as highlighted in van der Vaart and van Zanten (2008, 2009).
In this article, we concentrate on the construction of a domain-adaptive kernel and propose a fast graph Laplacian estimation for heat kernel Gaussian processes (FLGP). For better visualization, we plot intrinsic kernels and Euclidean-based kernels on the spiral in Figure 1 respectively. The top right channel illustrates that the heat kernel depends on the geodesic distance. That is, its value gradually declines with increasing of the intrinsic distance. Conversely, the squared exponential kernel in the bottom left channel is based on the Euclidean distance. Notably, the FLGP kernel obtains an accurate estimation of the heat kernel and reflects the intrinsic manifold well. Furthermore, we apply FLGP and Euclidean-based kernels (RBF GP) to make predictions on the spiral. As a result, we obtain an RMSE of 0.343 for FLGP and an RMSE of 2.323 for RBF GP. It is clear that without the intrinsic geometry of the spiral, the Euclidean-based GP causes an inferior result to the domain-adaptive GP.

The manifold learning was also motivated by the analysis of biomedical imaging data. Traditional analysis of volumetric functional Magnetic Resonance Imaging (fMRI) data, used in studying brain activation signals, often assumes that these signals exist within a regular three-dimensional Euclidean space. However, this assumption may not capture the complex nature of brain activation patterns. The human brain’s cortical surface, where these signals are mapped, is not a simple three-dimensional structure but a complex, folded surface. This unique topology challenges standard Euclidean-based analysis methods. Manifold learning methods fill this gap by offering a more sophisticated approach to modeling brain imaging data. They acknowledge and leverage the intricate geometry of the brain’s surface, enabling a more accurate understanding of the brain’s functional architecture. This approach has the potential to lead to discoveries in neuroscience and medicine. Thus, in machine learning, whether in understanding complex brain patterns or in simpler geometric structures like spheres, the utilization of the intrinsic geometry of data is crucial for optimal performance.
We focus on the prediction problem with GPs in the paper. Suppose we have observations consisting of labelled data and unlabelled data . Define the point cloud as . Assume that the domain space is unknown. We are interested in the cases where is a non-Euclidean manifold. The geodesic distance is completely different from the Euclidean distance, making it inappropriate to treat as an Euclidean space. Our objective is to use to estimate unknown and to devise a prediction algorithm that respects the intrinsic geometry of . In the context of GPs, this involves constructing a domain-adaptive kernel rather than relying on a kernel-based solely on extrinsic distance. A straightforward approach is to substitute the Euclidean distance with the geodesic distance. For example, in squared exponential kernels, for any , consider the kernel function , where represents the geodesic distance and . However, such kernels are generally not semi-positive definite and hence, do not qualify as covariance functions (Feragen et al., 2015). Our primary goal is to develop a valid domain-adaptive kernel for GPs on Riemannian manifolds.
Let be a -dimensional submanifold of . When is not straightforward of , such as Kendall shape spaces (Kendall et al., 2009; Kim et al., 2021) and Grassmann manifolds (Lee, 2012), we can embed it into a high-dimensional Euclidean space by the Whitney Embedding Theorem; see extrinsic GPs in Lin et al. (2019). Yang and Dunson (2016) use a simple squared exponential Gaussian process prior with the intrinsic dimension adaptive random scaling parameter. They achieve a posterior contraction rate depending on the intrinsic dimension . Castillo et al. (2014) construct a Gaussian process prior with the rescaling heat kernel on . They show that the heat kernel characterizes the geometry on by the heat diffusion. The posterior distribution yields an optimal contraction rate adaptive to the manifold. However, is unknown in many cases, and the heat kernel is analytically intractable except for some trivial cases such as Euclidean spaces or spheres. Thus, we have to calculate the heat kernel numerically in practice.
Some researchers utilize the connection between the heat kernel and the transition density of Brown motions (BM) on manifolds to estimate the heat kernel. The primary challenge is to simulate the BM sample paths on . Niu et al. (2019) do the simulation by solving systems of stochastic differential equations based on the coordinate chart and Ye et al. (2024) do it through the exponential map. However, even if is given, the global parametrization or the explicit exponential map doesn’t exist in general. In addition, is unknown in many practical problems. In this case, Niu et al. (2023) employ probabilistic latent variable models to estimate the probabilistic parametrization and simulate the BM trajectories on the unknown manifold using the probabilistic metric. The sampling procedure of trajectories is computationally expensive, as they have to simulate the Brown motions many times originating at each input.
Furthermore, the spectral property suggests that the heat kernel can also be constructed with the sum of eigenpairs of the Beltrami-Laplace operator on the manifold. Dunson et al. (2021a) use the Gaussian kernelized Graph Laplacian on to estimate the Beltrami-Laplace operator and propose a graph Laplacian Gaussian process (GLGP) for regression. In the algorithm, the heat kernel is estimated by the sum of finitely many eigenpairs of the graph Laplacian. Similarly, Borovitskiy et al. (2021) construct the graph Matérn kernel relying on the eigenpairs of the normalized graph Laplacian of a predefined graph. As the graph Laplacian is an matrix, the time complexity of the eigen-decomposition is . This is unaffordable in the large-scale data.
Apart from the heat kernel, there are other methods to respect the geometry. If is embedded into a lower-dimensional Euclidean space preserving the intrinsic distance (Roweis and Saul, 2000; Belkin and Niyogi, 2001; Balasubramanian and Schwartz, 2002; Weinberger and Saul, 2006), then we can make inference on the lower-dimensional representation to improve the performance. However, it is very challenging to find such an embedding when is unknown and complicated.
In the article, we propose a fast graph Laplacian approach to estimate the heat kernel Gaussian processes efficiently and accurately. Our novel insight is to construct a random walk on through a transition decomposition strategy. Initially, we subsample induced points from the point cloud , denoted by . Then the one-step movement from to is decomposed into two steps, a jump from to and a jump from to . This results in a reduced-rank approximation of the stochastic matrix. The truncated SVD is employed to calculate the eigenpairs instead of the eigendecomposition. It dramatically speeds up the computation. Finally, the heat kernel covariance matrix is constructed from the eigenpairs under the spectral property. The time complexity decreases remarkably from to . The algorithm is illustrated in simulation and real data. It outperforms other algorithms in all cases successfully.
Generally speaking, FLGP will play a role in any scenario where the intrinsic geometry of differs from the extrinsic geometry of . It gives a fast estimation of heat kernels by without prior knowledge of . Then the domain-adaptive GPs are accessible in the large-scale data. We have implemented the algorithms in an R package using Rcpp. It provides a convenient interface in R. In addition, we develop a general theoretical framework for FLGP in the natural exponential family (NEF). This distribution family covers a wide range of models, including normal regression and classification. The posterior contraction rates of heat kernel GPs in NEF depend on the intrinsic dimension rather than and are much sharper than those of extrinsic kernel GPs. The approximation error of heat kernels in FLGP is estimated by the spectral convergence property of the graph Laplacian and the perturbation analysis. Finally, the predictive error is bounded based on the approximation covariance error.
The paper is organized as follows. Section 2 gives a preliminary about the NEF model and heat kernels on manifolds. Section 3 provides a detailed description of the FLGP methodology. The approach consists of subsampling induced points, constructing random walks, solving truncated SVD, and computing the heat kernel. Section 4 presents the main theoretical results. Section 5 illustrates the performance in various examples, and Section 6 discusses the future work.
The FLGP package is accessible in https://github.com/junhuihe2000/FLGP.
2 Preliminary
2.1 Natural exponential family
Suppose the real-valued random variable has the law of the natural exponential family (NEF), denoted as . That is, under the Lebesgue measure or the counting measure, we have
(1) |
where is called the natural parameter, does not depend on , and is the sufficient statistic of . The natural exponential family includes varieties of the most common models, such as normal regression and binary classification. It emerges in many fields of statistics and machine learning (Brown et al., 2010; Jonathan R. Bradley and Wikle, 2020; Schweinberger and Stewart, 2020). The expectation of is given as . It is the quantity of interest in the article.
2.2 Heat kernels on manifolds
Let be a -dimensional Riemannian manifold. Let be the geodesic distance on . Assume is a measure on with a smooth positive density function under the volume form . Let be the weighted Beltrami-Laplace operator on . Define the Dirichlet-Laplace operator by . Then is a non-negative definite self-adjoint operator in . The heat kernel is a fundamental solution of the heat equation on , satisfying the following conditions,
where is the Dirac-delta function and is the diffusion time (Grigor’yan, 2009). The heat kernel describes the heat diffusion from to on manifolds over time .
Assume that is compact. The spectral theorem states that the spectrum of consists of an increasing non-negative sequence such that and . There exists an orthonormal basis in such that for , is the eigenfunction of with . Subsequently, from Grigor’yan (2009), for and , can be expanded as
(2) |
If the eigenpairs of are estimated, then we can utilize (2) to approximate the heat kernel. So our goal is to construct a fast numerical estimation for the spectrum of . We refer readers to Grigor’yan (2006, 2009) for further discussions about heat kernels on manifolds.
3 Methodology
Assume that the observations consist of the labeled observations and the unlabeled observations , which are independent and identically distributed (i.i.d.) samples drawn from the population . Denote the labeled input and labeled output by , . Suppose the input variable is endowed with the distribution . Assume that is absolutely continuous and denote the marginal density as . For any , suppose the conditional distribution as in (1). Consider the mean function . In NEF, the state space of varies from concrete likelihoods and is not identical to .
Through a well-chosen link function , we assume that with a latent function . To induce a prior on through , we assign the heat kernel GP with adaptive diffusion time for . Specifically, we assume
(3) |
where is the heat kernel on defined in ; and is as the random scaling parameter with a prior distribution for .
Let be the -dimensional vector . Then, the posterior expectation of the prediction function in the test input is given by
where is the posterior density over the latent variables, and are respectively.
3.1 Fast graph Laplacian heat kernel estimation
According to the desirable property of , the heat kernel Gaussian process is an excellent alternative for domain-adaptive GPs. However, the heat kernel is analytically intractable except for trivial cases even if is known. In the section, we propose a fast graph Laplacian approach to estimate the heat kernel and construct the heat kernel GP in the large-scale data when is unknown. Given labeled observations and unlabeled observations , the objective is to approximate a heat kernel covariance matrix over in a low time cost. We present the FLGP algorithm in five steps.
Step 1, subsample induced points from the point cloud , denoted as . The induced point set is used to simplify and represent the domain geometry. In this article, the subsampling method is k-means clustering or random sampling. Step 2, choose a suitable kernel function , such as the squared exponential kernel, with . Construct the cross kernel matrix between and with , for . Then is an matrix. Step 3, obtain the cross similarity matrix by normalizing . For , in random subsampling, is given by
and in k-means clustering, is given by
(4) |
where indicates the number of points in each membership. Construct the cross transition matrix , where is a diagonal matrix with . Notably, can serve as the probability transition matrix of the random walk from to . Step 4, let be a diagonal matrix such that and define the subsampling graph Laplacian as
(5) |
Do truncated SVD for to obtain the singular values and the left singular vectors with . Then the eigenpairs of are given by . According to Proposition 1, suppose are in the ascending order, i.e., . Step 5, choose the top eigenvalues and corresponding eigenvectors of and estimate the heat kernel based on (2) as
(6) |
where is the diffusion time.
The whole procedure of FLGP is shown in Figure 2. Finally, we can make inference in GPs with the estimated covariance matrix . Owing to the construction, we have the following proposition for the spectrum of . The proof is given in Appendix A.
Proposition 1.
The eigenvalues of graph Laplacian in (5) are real-valued and lie in . In particular, is the smallest eigenvalue of .
The key novelty in FLGP is to construct a reduced-rank approximation for the transition matrix. Then the eigendecomposition of is replaced by the truncated SVD of . We explain the idea behind the approximation from the random walk viewpoint. According to the graph spectral theory, there is a close relationship between the heat kernel on manifolds and random walks on the point cloud . The stochastic transition matrix is . Based on the induced points , we can decompose the direct random walk into two steps. For , the movement from to is decomposed into a jump from to and a jump from to . Then the transition probability is
If we define the cross transition matrix from to as in Step 3, and let be the transition matrix from to , then the two-step transition probability matrix is given as . This implies the definition of the graph Laplacian in (5). In contrast, Dunson et al. (2021a) doesn’t utilize the transition decomposition strategy. They construct the random walk with one-step movement. Consequently, they end up with a full-rank transition matrix and have to execute the eigendecomposition on the matrix.
The diagram inside the dashed rectangle in Figure 2 illustrates the transition decomposition graphically. As contains only induced points, the rank of is at most . On the contrary, the stochastic matrix’s rank in the direct random walk among is , much more than . Thus, we obtain a reduced-rank approximation for the transition probability matrix. Subsequently, the complexity of the truncated SVD is instead of . This accelerates the computational speed in large-scale data. Sparsity will be introduced into the transition matrix to remove the term further.
3.2 Sparsity and likelihood-free kernels
Let be a positive integer. For each , let be the indexes of the closest induced points to . The local induced point set of is defined as . Denote the sparse kernel matrix by , which means that for , if , then ; otherwise . Thus is an sparse matrix with non-zero elements. Consequently, the cross similarity matrix and the cross transition matrix are sparse matrices with non-zero elements. This saves the memory usage. And the time complexity of the truncated SVD decreases to . The introduced sparsity accelerates the computation when . In addition, due to the locally Euclidean property (Lee, 2012), the Euclidean distance is almost equivalent to the intrinsic distance in a small neighborhood. Then sparsity will enhance the prediction performance when the global domain geometry is highly non-Euclidean.
It is clear that for the squared exponential kernel, selecting the bandwidth parameter involves the computation of the marginal likelihood, which costs redundant float calculation and slows down the algorithm. Moreover, the heat kernel depends on the geometry of , while the marginal likelihood refers only to the labeled observations. As contains more information about than , we want to use the point cloud to construct the kernel. Therefore, it is advantageous to substitute a likelihood-free alternative for the squared exponential kernel. We adopt the local anchor embedding (LAE) kernel of Liu et al. (2010) in the underlying work. The kernel matrix is calculated by minimizing the square loss of a convex reconstruction problem, i.e., for ,
The optimal solution is the coefficients of the projection of into the convex hull spanned by . Let the remaining elements in be zero. Then it is also an sparse matrix with non-zero elements in each row. Compared to the squared exponential kernel, the LAE method constructs a nonparametric kernel estimation using and , which means it is more flexible and robust.
3.3 Algorithms and Complexity analysis
Given the covariance , we calculate the marginal likelihood of GPs based on the concrete models as follows,
where is the potential nuisance parameter. The hyperparameters in FLGP consist of predefined parameters including , and optimized parameters including (if any), and . The optimized parameters are estimated by maximizing the marginal likelihood function. The FLGP algorithm is listed as Algorithm 1. Furthermore, the following proposition demonstrates the time efficiency of FLGP.
Proposition 2.
The time complexity in the heat kernel approximation by FLGP is for random subsampling or for k-means clustering, where is the number of all observations, is the number of induced points, is the number of labeled observations, is the ambient dimension, is the number of eigenpairs used, and is the iteration number in k-means.
Remark 1.
Suppose . Except for the marginal likelihood, the time complexity of FLGP is almost linearly dependent on . Therefore, our FLGP algorithm allows scalable applications of heat kernel GPs. Conversely, the complexity of GLGP is . If , then our approach is much faster.
4 Theory
The theory consists of three main parts. First, we estimate the error between the heat kernel and the numerical approximation in (6). Then, we obtain the predictive error bound in GPs. Finally, we establish the posterior contraction rate of heat kernel GPs in NEF on manifolds.
Suppose is the domain manifold. Denote the geodesic distance on by . Let be the volume form on and let be a probability measure on . For , let . In the article, we focus on the following manifold.
Assumption 1.
Suppose is a compact manifold without boundary. Let be the density function of under . Assume there exist constants , such that, , . Suppose we have a Gaussian estimate for the heat kernel on . That is, there exist constants such that, for , ,
where .
The assumption is to guarantee that the heat kernel on satisfies some regular conditions. A similar assumption is made in Dunson et al. (2021a). And Assumption 1 has already covered many usual settings. Let us take the compact manifold with the volume form as an example.
Example 1.
Suppose is a compact manifold without boundary, and is the volume form . We have , where is the volume of . Grigor’yan (2009) indicates that the Gaussian estimate holds for the heat kernel.
4.1 One-step random walk graph Laplacian
As it is usually intractable to calculate the exact heat kernel, we utilize the normalized random walk graph Laplacian to estimate the heat kernel numerically. The approach is also used by Dunson et al. (2021a). The main difference is that we make the approximate inference of exact priors, while they make the exact inference of approximate priors. That is, we treat the prior as the exact heat kernel GPs and prove the oracle contraction rate.
We first estimate the approximate error for the so-called one-step random walk graph Laplacian and then explore the impact of subsampling later. Let be the point cloud. Assume that is the squared exponential kernel with bandwidth . Construct the kernel matrix on with . For , define . Calculate the similarity matrix by for . Let be a diagonal matrix with . Compute the stochastic matrix by . The one-step random walk graph Laplacian is defined as .
Furthermore, the kernel normalized graph Laplacian is . Denote the eigenpairs of by with . For , construct the heat kernel estimation as in (6) by
Let be the exact heat kernel matrix over . Let be the Dirichlet-Laplace operator on . Denote the eigenpairs of by with . A technical condition is introduced on the spectrum of .
Assumption 2 (Dunson et al. (2021b)).
Assume that the eigenvalues of are simple, i.e., no multiplicity. For , denote . Suppose is small enough such that
where are positive constants depending on . Suppose is sufficiently large so that .
The above assumption and the squared exponential kernel are imposed to obtain the spectral convergence of the normalized one-step graph Laplacian in the norm. Other cases are feasible, as long as the spectral convergence between and can be established, such as the compactly supported kernel (García Trillos et al., 2020). Assumption 2 suggests that as increases and decreases, will increase exponentially. This implies that the current approach is probably inappropriate for processing high-dimensional data. The error bound between and will be estimated using the spectral property in the following theorem.
Theorem 1.
Remark 2.
Based on (2), the heat kernel is split into the sum of the first eigenpairs and the sum of the remaining eigenpairs. They are estimated separately by combining the spectral convergence property of and the fast-decaying spectrum nature of the heat kernel. To limit the maximum distance between and , we first choose large enough to bound the latter term and subsequently choose sufficiently small to bound the former term in the right-hand side of (7). Furthermore, as increases, the point cloud becomes denser in the domain. Thus according to Theorem 1 and Remark 2, the approximation error of the heat kernel will decrease. Therefore, unlabeled observations contribute to an accurate estimation of the heat kernel by reflecting the geometry of .
4.2 Two-step random walk graph Laplacian with subsampling
In FLGP, we construct a two-step random walk instead of the standard one-step random walk. The errors from subsampling and transition decomposition will be estimated by the perturbation analysis. A similar technique is used in the fast spectral clustering algorithm of Yan et al. (2009). Assume that the subsampling method is k-means clustering, the kernel is the squared exponential kernel and the kernel matrix is not sparse. Suppose for , the number of points in the membership of is . Define the extended induced point set as . Denote the extended cross kernel matrix by . Then is an matrix. Let and . Construct the extended cross similarity matrix by
Let be a diagonal matrix with . Define the extended stochastic matrix from to by . Let be a diagonal matrix with . The extended two-step graph Laplacian is defined as .
Lemma 1.
If we write as , where is a column vector for , then .
Lemma 2.
The two graph Laplacians are the same in fact, i.e., .
Without loss of generality, assume that for , belong to the same membership with the clustering center in coincidence. Consider the one-step stochastic matrix on . Let be a diagonal matrix with . Subsequently, the exact two-step graph Laplacian is defined as .
Assumption 3.
Let and . Suppose there exists a constant , such that, for , .
The assumption is used to constrain the spectral norm of stochastic matrices. It imposes a prior on the distribution of the point cloud. The prior reveals that for each , there are sufficiently many points in its neighborhood. Now we can estimate the spectral norm error between and .
Theorem 2.
Under Assumption 3, for any , if , then there exist constants and such that , where and depend on and ; and is the spectral norm.
Corollary 1.
If , then .
The theorem comes from the measurement error analysis of graph Laplacian. More relevant results refer to Karoui and Wu (2016). According to Theorem 2, the distortion error bound in the k-means clustering derives the spectral norm bound of the graph Laplacian in FLGP. This spectral result further implies the error estimation of the covariance matrix. Theorem 2 suggests that for located far away from , the corresponding prediction by FLGP may be terrible. This agrees with the intuition.
4.3 Predictive errors in Gaussian processes
Let and . For any test point , let and . Then the posterior expectation of is given by . Let and be two covariance matrices of over . For , rewrite as , where , then with
Lemma 3.
Assume that . Then there exist constants such that, for and ,
where is the smallest eigenvalue of .
Lemma 4.
Suppose and . Then we have
where .
Let be the exact heat kernel matrix on and let be the approximate covariance matrix on .
Theorem 3.
Assume . Then there exists such that,
where depends on and .
Remark 3.
In the normal regression model, as the GP is the conjugate prior and , Lemma 3 implies that is estimated by the same bound for , after replacing with . However, for other non-conjugate models, the case is probably very complicated.
4.4 Posterior contraction rates
As multiple-parameter NEF models can usually be implemented as a composition of several one-parameter NEF models, we focus on the one-parameter NEF case in the article. We will follow the universal framework shown in Rousseau (2016) and Ghosal and van der Vaart (2017) to derive posterior contraction rates of GPs in NEF.
Let be i.i.d. observations from , where is a -dimensional submanifold of and . Consider the product measure on , where is the Lebesgue measure or the counting measure on . Assume that and , for under . Suppose with a latent function and a link function . For , we have and . Hence, , i.e., is non-decreasing. If , then is invertible. Denote as . We have .
Assumption 4.
Assume and . Suppose and . Assume that the derivatives and are uniformly bounded.
For , by the chain rule. If the variance is uniformly bounded away from zero, the above assumption suffices to control the growth rate of . This is reasonable in the prediction problem. If varies rapidly, then the estimation of will be very sensitive to the corresponding prediction of . Even though the latent function is estimated accurately, the errors of are still likely to diverge to infinity. So it is almost impossible to derive the posterior contraction rates without any condition on . And Assumption 4 holds for a wide range of models. Let us take the normal regression and the classification as examples.
Example 2.
Suppose with known, we have
Then and . So and are bounded.
Example 3.
Suppose with , that is,
Choose . Then we have . So and are bounded.
Let . Suppose are two latent functions taking values in . Let be the joint density functions of under the dominant measure . Define the Kullback-Leibler divergence as and define the -th Kullback-Leibler variation as . Define the Hellinger distance as . For i.i.d. experiments, is closely related to these common distances in many models. Take the binary classification as an example. Let and , then by Lemma 2.8 in Ghosal and van der Vaart (2017), we have .
Lemma 5.
Suppose Assumption 4 holds. If and belong to one-parameter NEF, then
-
1.
Suppose is bounded. There exist constants and , such that
-
2.
Suppose is convex. There exists a uniform constant , such that
The main technique used is the Taylor expansion. According to Lemma 5, the Kullback-Leibler divergence, the second variation, and the Hellinger distance are all bounded above by the infinity distance of the corresponding latent functions. This consequence is crucial for proving the posterior contraction rates in one-parameter NEF models. Let us state the main theorem about posterior contraction rates in NEF on manifolds.
Theorem 4.
Suppose the observations are independent and identically distributed with the population , where is the latent function. Let the prior on be the heat kernel GP with an adaptive diffusion time of (3). Suppose belongs to one-parameter NEF models with Assumption 4 and satisfies Assumption 1. Then for the true parameter in the Besov space with , the posterior contraction rate at w.r.t. is , that is, as , for each ,
(8) |
where is the posterior distribution.
The theorem comes from Lemma 5 and the property of heat kernels. is the Besov space with the smoothness (Castillo et al., 2014). It is amazing to see that the posterior contraction rates depend on instead of . If , then is much less than . So when the domain is highly non-Euclidean, the heat kernel Gaussian process is superior to the extrinsic kernel Gaussian process. At the heuristic level, if we can embed into while keeping the intrinsic distances invariant, then Theorem 4 suggests that the heat kernel GPs are equivalent to inferring on the embedded -dimensional domain space. So the contraction rates are extremely sharp.
In the FLGP algorithm, time is estimated by the empirical Bayesian method. However, we assign as the hyperprior for in the posterior contraction rates, which is a fully Bayesian method. The theoretical gap may be filled in the future.
5 Experiments
In this section, we compare the performance of various kernel-based approaches. Besides FLGP, we consider GPs with radial basis function kernels (RBF), kernel support vector machines (KSVM) with RBF kernels, GLGP and the Nyström approximation of GLGP (denoted by GLGP∗). RBF GPs and RBF KSVM are extrinsic kernel algorithms while other approaches are manifold learning algorithms. The Nyström approximation is one of the most common ways to improve the scalability of GPs (Quinonero-Candela and Rasmussen, 2005; Liu et al., 2020). We provide an implementation based on Dunson et al. (2021a)’s work as a benchmark of scalable domain-adaptive GPs in Appendix C, where the subsampling method is k-means clustering. We consider four FLGP algorithms that differ in the kernel matrix constructions and the subsampling methods, including squared exponential kernels and random subsampling (SRFLGP), squared exponential kernels and k-means subsampling (SKFLGP), local anchor embedding and random subsampling (LRFLGP), and local anchor embedding and k-means subsampling (LKFLGP).
We consider two examples to evaluate the performance of FLGP and apply FLGP to the fMRI dataset from the Human Connectome Project (HCP). In all cases, the hyperparameters , , and are prespecified. The sensitivity analysis of FLGP is presented in Appendix D. The results suggest that the prediction performance is robust against changes in these prespecified hyperparameters. For the first synthetic example, we simulate multiple-concentric circles in , where the class of each point is determined by the circle it is located on. The second example is based on the MNIST handwritten-digit dataset. In the third case, we reconstruct the fMRI images of the working memory task from HCP. The domain manifolds in the MNIST and HCP tasks are unknown and can be very complicated. Experimental results show that FLGP achieves better performance over competing methods in all cases. The first and second examples are run on the 80GB RAM, 3.70GHz AMD Ryzen 5 5600X 6-core processor machine. The HCP application is run on the Great Lakes computing cluster at the University of Michigan.
5.1 Simulation on synthetic concentric circles
In this example, we consider six circles with the same center but different radii in . Denote the six circles, from the inside out, as , and let . Then, is a one-dimensional submanifold in . The domain geometry is significantly different from the Euclidean geometry. To generate the data, we first sample points uniformly on each of the six circles. Those points are the point cloud . Figure 3 illustrates the generated data when . Then, we randomly sample points from . These points are the labelled data points . Points on have true Class 1 and points on have true Class 0.

We consider six scenarios for data generation, for all combinations of and , where and . We classify unlabelled data points using FLGP and competing methods, repeating each scenario for times. For FLGP, we choose , , and . Other hyperparameters are optimized by maximizing the corresponding marginal likelihood. We report the error rate over unlabelled data to show the classification performance in Table 1 and the computing time to show the computation speed in Table 2. As shown in Table 1, intrinsic kernel approaches achieve lower error rates than extrinsic kernel approaches. RBF GP and KSVM collapse to random prediction because of the highly non-Euclidean geometry. Meanwhile, GLGP fails to work when with an average error rate of , but the error rate decreases to when . The prediction performance of the Nyström extension (GLGP∗) is similar to that of GLGP. It is worth noting that FLGP obtains much lower error rates than GLGP and the Nyström extension in most cases. For example, when and , LKFLGP has an average error rate of , which is much less than that of GLGP or extrinsic kernel approaches. Comparing different FLGP algorithms, FLGP with square exponential kernels acquires lower error rates than FLGP with local anchor embedding kernels in the example. Moreover, since clustering centers describe the domain geometry better than random points, FLGP with k-means subsampling also attains lower error rates than FLGP with random subsampling in all cases. In terms of computational efficiency, Table 2 shows our FLGP computes much faster than other manifold approaches including GLGP and the Nyström extension. For instance, when and , LKFLGP (0.4s) is over times faster than GLGP (59.7s). We can conclude from Tables 1 and 2 that LKFLGP achieves the best balance between the performance and the speed in this case.
\toprule | 2400 | 4800 | 12000 | |||
---|---|---|---|---|---|---|
50 | 100 | 50 | 100 | 50 | 100 | |
\midruleRBF GP | 47.3(2.3)1 | 44.2(2.4) | 46.1(3.0) | 44.8(2.4) | 46.8(2.1) | 45.3(2.0) |
KSVM | 46.8(3.1) | 44.7(2.5) | 46.7(2.4) | 45.0(3.1) | 47.1(2.5) | 46.4(1.3) |
GLGP | 46.1(8.3) | 17.6(2.9) | 46.8(6.9) | 17.6(3.4) | 47.7(5.7) | 17.1(3.8) |
GLGP∗ | 44.7(8.6) | 16.7(3.6) | 44.6(8.7) | 16.2(3.8) | 41.3(9.6) | 17.3(3.4) |
SRFLGP | 19.6(5.3) | 12.3(2.6) | 16.5(4.2) | 12.0(3.5) | 17.8(4.7) | 9.7(3.9) |
SKFLGP | 3.1(3.4) | 0.7(1.6) | <0.1(<0.1) | <0.1(0.2) | <0.1(<0.1) | <0.1(<0.1) |
LRFLGP | 27.4(5.4) | 17.3(3.2) | 26.9(4.8) | 18.1(3.5) | 27.0(4.4) | 19.5(3.5) |
LKFLGP | 7.0(4.2) | 3.9(2.5) | 3.3(2.9) | 1.5(1.4) | 1.1(2.6) | 0.3(0.7) |
\botrule |
The term indicates the average error rate with the standard deviation . The other tables about error rates are in the same format.
\toprulen | 2400 | 4800 | 12000 | |||
---|---|---|---|---|---|---|
m | 50 | 100 | 50 | 100 | 50 | 100 |
\midruleRBF GP | 0.1 | 0.11 | 0.19 | 0.2 | 0.48 | 0.52 |
KSVM | 0.03 | 0.01 | 0.02 | 0.03 | 0.04 | 0.06 |
GLGP | 1.4 | 1.55 | 9.21 | 9.33 | 59.74 | 59.72 |
GLGP∗ | 1.58 | 1.74 | 3.99 | 4.12 | 7.38 | 7.54 |
SRFLGP | 1.92 | 2.09 | 1.36 | 1.58 | 1.59 | 1.81 |
SKFLGP | 5.57 | 5.98 | 2.29 | 2.59 | 1.98 | 2.34 |
LRFLGP | 0.12 | 0.17 | 0.16 | 0.21 | 0.27 | 0.33 |
LKFLGP | 0.14 | 0.2 | 0.18 | 0.24 | 0.33 | 0.4 |
\botrule |
To explore the potential mechanism, we plot the covariance distributions between all data points and some reference points in Figure 4. For better visualization, the covariance is scaled into . As the geometry in the domain is highly non-Euclidean, the RBF kernel fails to capture the true similarity between points. It activates the points based on the Euclidean distance, while the heat kernel estimates the covariance based on the intrinsic geometry. In this case, as these circles are very close together, GLGP and the Nyström extension acquire inferior estimation without sparsity. On the contrary, the covariance structure is well-captured by FLGP. Only the points in the same circle as the reference point are activated. Finally, the covariance distributions lead to different performances of various approaches.

5.2 Application to handwritten digit data MNIST
The MNIST database contains 60000 training and 10000 test image-label pairs. Each image is a handwritten digit (0-9) and the label is the digit number. In this experiment, we mix the training set and the test set to construct the entire point cloud containing 70000 images. In this application, we compare the performance of multi-classification on image labels of RBF GP, RBF KSVM, GLGP, the Nyström extension, and FLGP. To speed up the computation, we reduce the dimension of the image predictor from to by Principle Component Analysis (PCA). After the dimensionality reduction, the input space is unknown and should be a submanifold in . The domain geometry could be very complicated and non-linear.
To generate the data, we first sample observations from all instances. We then sample labeled observations from the observations. The number of unlabelled observations is . We consider four scenarios of data generation, for all combinations of and , where and . Since GLGP takes too much computation time when , we only report the results of GLGP for . We repeat experiments in each scenario for times. We set , and for FLGP. Other hyperparameters are optimized by maximizing the marginal likelihood.
Table 3 summarizes the mean and standard deviation of error rates over the 10 replicates for each method. Generally, manifold learning algorithms perform better than RBF kernel-based algorithms. FLGP achieves lower error rates compared to GLGP and the Nyström extension. For each scenario, FLGP with k-means subsampling (SKFLGP and LKFLGP) obtains the lowest error rates among all competing methods. Another observation is that the error rates of extrinsic kernel approaches remain at the same level as increases from 7000 to 70000, since they do not make use of the domain information. However, the error rates in FLGP with random subsampling do not change either, while the error rates in the Nyström extension and FLGP with k-means subsampling decrease with increasing . This is because when the point cloud becomes larger, the clustering centers capture more manifold information but the random subsampling points remain the same. This phenomenon verifies the intuition that the induced points determine the domain information used. When the induced points describe the intrinsic geometry better, the manifold learning algorithms achieve lower error rates. The computation time is summarized in Table 4. The Nyström extension is faster than GLGP and FLGP is faster than the Nyström extension. The superior efficiency of FLGP over the Nyström extension is beneficial from the sparse matrix multiplication. Considering the error rates and the time cost, LKFLGP has the best performance among the FLGP algorithms.
\toprule | 7000 | 70000 | ||
---|---|---|---|---|
100 | 200 | 100 | 200 | |
\midruleRBF GP | 54.2(4.9) | 43.0(5.1) | 53.3(7.2) | 40.6(3.0) |
KSVM | 65.4(3.7) | 40.3(3.7) | 65.7(6.3) | 40.3(4.0) |
GLGP | 32.1(3.3) | 23.3(1.9) | ||
GLGP∗ | 31.1(3.0) | 21.9(1.6) | 19.9(2.0) | 15.3(1.9) |
SRFLGP | 20.2(2.0) | 15.0(1.4) | 18.4(3.0) | 13.8(0.9) |
SKFLGP | 14.1(2.4) | 10.3(1.5) | 8.9(1.3) | 7.4(0.8) |
LRFLGP | 23.3(4.9) | 16.4(1.4) | 21.7(3.5) | 16.2(1.1) |
LKFLGP | 15.2(2.9) | 10.3(1.3) | 9.7(1.4) | 7.7(0.5) |
\botrule |
\toprule | 7000 | 70000 | ||
---|---|---|---|---|
100 | 200 | 100 | 200 | |
\midruleRBF GP | 1 | 1.3 | 9.7 | 12.4 |
KSVM | 0.7 | 0.9 | 6.6 | 9.3 |
GLGP | 57.3 | 61.7 | ||
GLGP∗ | 17.5 | 20.8 | 156.1 | 157.7 |
SRFLGP | 7.1 | 10.9 | 15 | 18.7 |
SKFLGP | 11.9 | 16.1 | 29.8 | 34.6 |
LRFLGP | 1.2 | 2.2 | 5.8 | 7.1 |
LKFLGP | 1.9 | 3 | 21.8 | 23 |
\botrule |
5.3 Application to fMRI data from HCP
We further illustrate our method with analysis of task-evoked functional Magnetic Resonance Imaging (fMRI) data in the Human Connectome Projects (HCP-1200 release). The task fMRI is a technique that indirectly measures brain activity by monitoring oxygen-level changes in blood flow when the subject performs specific tasks. The standard analysis of fMRI data is the Euclidean space which does not directly take into account the complexity of human brain’s structure. Although the cortical surface fMRI data (Fischl et al., 1999) offers the opportunity to analyze brain activation on a lower dimensional manifold of a three dimensional space, it is based on a prespecified manifold structure that is not adaptive to different subjects and different types of activation.
Our goal is to estimate the manifold of fMRI activation in the three-dimensional brain from data and evaluate the accuracy of predicting fMRI activation on the manifold, which may provide insights on the neural underpinnings associated with cognitive functions (Sripada et al., 2020).
The typical fMRI data for a single subject is a four dimensional tensor which records the fMRI signals over time at each location (voxel) in the three-dimensional brain. The standard preprocessing of task-based fMRI data involves several key steps: gradient unwarping, motion correction, distortion correction, and grand-mean intensity normalization. To localize the fMRI activation in the brain, the general linear models are fitted at each voxel for each subject where the fMRI time series are the response variable and the stimulus pattern by the task as the covariates. The voxel-level fMRI activation intensities are estimated by the contrast statistics of the fitted general linear models, which well reflect the fMRI activation difference under different task conditions. This preprocessing procedure generates the subject-level contrast maps which are the three-dimensional volumetric images for fMRI activation. In this study, we focus on the 2-back versus 0-back contrast maps during the working memory N-back tasks (van Essen et al., 2013) from 917 subjects in the HCP study. All the images have been registered on the standard MNI 2mm brain template of resolution .
To estimate the manifold of fMRI activation, we first define the population level activation regions from all the contrast maps. For each voxel in each contrast map, if its absolute intensity value is larger than (i.e.,the standard normal quantile), then the fMRI activation at this voxel are significantly different between the 2-back and 0-back tasks. For each subject, we refer to the collection of activation voxels as the activation region. In this study, we estimate the population level activation region as the superposition of the activation voxels from all the 917 subjects. To improve the robustness of the estimation, we impose a constraint that each voxel in this region must be an activation voxel for at least two subjects. Figure 7 shows the map of the population level activation regions. The total number of activation voxels is 17,711, and 76 out of the first 90 brain regions from the automated anatomical labelling (AAL) (Tzourio-Mazoyer et al., 2002) brain atlas have at least one activation voxel.
On the population level activation region, we estimate the manifold of fMRI activation by FLGP. For each subject, we estimate the prior covariance kernel on manifolds using the proposed algorithm in Section 3.1. The distance between two voxels on the manifold of fMRI activation can be well measured by the prior correlation of the GP between the two voxels. We compute the average of the covariance kernel over 917 subjects to estimate the population level manifold of fMRI activation. To summarize the manifold of fMRI activation, we consider two levels of visualizations: voxel level correlation seed map (Figure 5) and region level correlation network (Figure 6).
The voxel level seed map in Figure 5 shows the prior correlation between each activation voxel and the seed voxel in the AAL brain region. The seed voxel in each region is defined as the voxel with the largest absolute mean intensity in the region and can be considered as the region center. We include 47 brain regions that have at least 30 activation voxels out of the 76 brain regions that have at least one activation voxel. Among the 47 regions, 26 regions (55.3%) have at least 30% activation voxels that have a correlation greater than 0.5 with the region center and 7 regions (14.9%) have at least 30% activation voxels that have a correlation greater than 0.8 with the region center. Particularly, 91.9% activation voxels in Parietal_Inf_L, 77.8% activation voxels in Cuneus_L and 57.1% activation voxels in Parietal_Sup_R have correlations greater than 0.8 with their corresponding region centers.
The region level correlation network map in Figure 6 illustrates the manifold of fMRI activation among different AAL regions. The two regions with a large correlation are close to each other on the manifold of fMRI activation. We present the estimated prior correlation estimated by FLGP between the 76 AAL regions over 917 subjects in Figure 6. Each region is represented by its region center. The connected edges indicate correlations greater than 0.5 between regions. According to Figure 6, the Frontal Lobe, Parietal Lobe, and Temporal Lobe exhibit significant within-network correation, with the Frontal Lobe strongly connected to the Insula and the Temporal Lobe. Besides the Frontal Lobe, the Temporal Lobe demonstrates notable correlation to the Occipital Lobe and moderate correlation to the Insula. Furthermore, the Limbic system shows a moderate connection with the Frontal Lobe and the Occipital Lobe. The Occipital Lobe also has a modest correlation with the Parietal Lobe.


To evaluate manifold learning accuracy of FLGP, we compare the prediction performance of RBF GP, the Nyström approximation of GLGP (GLGP∗), and FLGP with local anchor embedding and k-means subsampling (LKFLGP). Since the population level activation region has a very large number of voxels , we implement an extremely fast regression algorithm when the labeled observations are enormous in Appendix E. Here we take LKFLGP as a representative of all FLGP methods as previous data experiments show that it balances between prediction and efficiency best. GLGP is excluded as it is computationally prohibitive. Besides, we set , , and for FLGP. We consider three scenarios with . Each scenario is repeated 10 times. To remove the effect of outliers, we disregard points where the prediction errors are within the top or bottom .

Experimental results are summarized in Table 5, including the average voxel-wise predictive mean squared error (MSE) and computation time among 917 subjects over the 10 replicates. In all scenarios, FLGP obtains the lowest MSE while RBF GP has the highest MSE. The mean predictive MSE of GLGP∗ is between that of RBF GP and FLGP. This observation shows that using intrinsic geometry will improve the performance and FLGP estimates the brain manifold better than GLGP∗. Furthermore, because of the low-rank structure and sparsity of the covariance, FLGP is the fastest approach with much less computation time than other methods. For instance, when , the computing time of FLGP is about of GLGP∗ and of RBF GP. In conclusion, FLGP attains a splendid performance in both prediction and efficiency.
\toprulem | 5000 | 7500 | 10000 | |||
---|---|---|---|---|---|---|
\midruleMethod | MSE | Time | MSE | Time | MSE | Time |
\midruleRBF GP | 0.066(0.039)1 | 97 | 0.064(0.037) | 320 | 0.062(0.037) | 777 |
GLGP∗ | 0.040(0.021) | 444 | 0.038(0.020) | 512 | 0.037(0.020) | 573 |
FLGP | 0.033(0.018) | 17 | 0.032(0.017) | 21 | 0.031(0.017) | 25 |
\botrule |
The term indicates average MSE with standard deviation .
6 Conclusion
In this article we develop scalable Bayesian inference method for heat kernel Gaussian process on an unknown -submanifold of . With the enormous point cloud, we construct a reduced-rank approximation of the transition probability matrix and implement an efficient truncated SVD, which contributes to a scalable domain-adaptive GP approach, i.e., FLGP. Furthermore, when the domain space is highly non-Euclidean, FLGP performs much better than traditional GPs and other heat-kernel-based GPs.
There are also several interesting jobs in future research. The first direction is to combine FLGP with the known geometry information. For example, is given in many practical problems, such as Kim et al. (2021) and Guinness and Fuentes (2016). Then it is possible to utilize the known domain structure to improve the performance. The second direction is to study sample paths of heat kernel GPs, such as the smoothness of the paths. The third direction is to consider more kernels on manifolds besides the heat kernel. In particular, the generalized Matérn kernel (Borovitskiy et al., 2020) is fascinating because of its simple form and nice properties.
7 Code and supplementary materials
The source package FLGP is accessible on the GitHub repository https://github.com/junhuihe2000/FLGP and R codes of examples in Section 5 are available on the GitHub repository https://github.com/junhuihe2000/exampleFLGP. The supplementary materials include all proofs in the article, the implementation of the Nyström extension, and the sensitivity analysis of the predefined parameters. They also involve a fast GP normal regression algorithm when the number of labeled observations is enormous.
Acknowledgements
Yang’s research was partially supported by the National Natural Science Foundation of China grant (12271286, 11931001). Kang’s research was partially supported by the following grants: NIH R01DA048993, NIH R01MH105561, and NSF IIS2123777.
References
- Balasubramanian and Schwartz (2002) Balasubramanian, M. and Schwartz, E. L. (2002) The isomap algorithm and topological stability. Science, 295, 7.
- Belkin and Niyogi (2001) Belkin, M. and Niyogi, P. (2001) Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in Neural Information Processing Systems, vol. 14. Cambridge: MIT Press.
- Borovitskiy et al. (2021) Borovitskiy, V., Azangulov, I., Terenin, A., Mostowsky, P., Deisenroth, M. and Durrande, N. (2021) Matérn Gaussian processes on graphs. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics, vol. 130, 2593–2601.
- Borovitskiy et al. (2020) Borovitskiy, V., Terenin, A., Mostowsky, P. and Deisenroth, M. (2020) Matérn Gaussian processes on Riemannian manifolds. In Advances in Neural Information Processing Systems, vol. 33, 12426–12437. Curran Associates, Inc.
- Brown et al. (2010) Brown, L. D., Cai, T. T. and Zhou, H. H. (2010) Nonparametric regression in exponential families. The Annals of Statistics, 38, 2005 – 2046.
- Castillo et al. (2014) Castillo, I., Kerkyacharian, G. and Picard, D. (2014) Thomas Bayes’ walk on manifolds. Probability Theory and Related Fields, 158, 665–710.
- Dunson et al. (2021a) Dunson, D. B., Wu, H.-T. and Wu, N. (2021a) Graph based Gaussian processes on restricted domains. Journal of the Royal Statistical Society Series B: Statistical Methodology, 84, 414–439.
- Dunson et al. (2021b) — (2021b) Spectral convergence of graph Laplacian and heat kernel reconstruction in from random samples. Applied and Computational Harmonic Analysis, 55, 282–336.
- Feragen et al. (2015) Feragen, A., Lauze, F. and Hauberg, S. (2015) Geodesic exponential kernels: When curvature and linearity conflict. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3032–3042.
- Fischl et al. (1999) Fischl, B., Sereno, M. I. and Dale, A. M. (1999) Cortical surface-based analysis: Ii: inflation, flattening, and a surface-based coordinate system. Neuroimage, 9, 195–207.
- García Trillos et al. (2020) García Trillos, N., Gerlach, M., Hein, M. and Slepčev, D. (2020) Error estimates for spectral convergence of the graph Laplacian on random geometric graphs toward the Laplace–Beltrami operator. Foundations of Computational Mathematics, 20, 827–887.
- Genton (2001) Genton, M. G. (2001) Classes of kernels for machine learning: a statistics perspective. Journal of Machine Learning Research, 2, 299–312.
- Ghosal and van der Vaart (2017) Ghosal, S. and van der Vaart, A. (2017) Fundamentals of Nonparametric Bayesian Inference. Cambridge: Cambridge University Press.
- Grigor’yan (2006) Grigor’yan, A. (2006) Heat kernels on weighted manifolds and applications. Contemporary Mathematics, 398, 93–191.
- Grigor’yan (2009) — (2009) Heat Kernel and Analysis on Manifolds, vol. 47. Rhode Island: American Mathematical Society.
- Guinness and Fuentes (2016) Guinness, J. and Fuentes, M. (2016) Isotropic covariance functions on spheres: Some properties and modeling considerations. Journal of Multivariate Analysis, 143, 143–152.
- Jonathan R. Bradley and Wikle (2020) Jonathan R. Bradley, S. H. H. and Wikle, C. K. (2020) Bayesian hierarchical models with conjugate full-conditional distributions for dependent data from the natural exponential family. Journal of the American Statistical Association, 115, 2037–2052.
- Karoui and Wu (2016) Karoui, N. E. and Wu, H.-T. (2016) Graph connection Laplacian methods can be made robust to noise. The Annals of Statistics, 44, 346 – 372.
- Kendall et al. (2009) Kendall, D. G., Barden, D., Carne, T. K. and Le, H. (2009) Shape and Shape Theory. Hoboken: John Wiley & Sons.
- Kim et al. (2021) Kim, K.-R., Dryden, I. L., Le, H. and Severn, K. E. (2021) Smoothing splines on Riemannian manifolds, with applications to 3D shape space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 83, 108–132.
- Lee (2012) Lee, J. M. (2012) Introduction to Smooth Manifolds. Berlin: Springer.
- Lin et al. (2019) Lin, L., Mu, N., Cheung, P. and Dunson, D. (2019) Extrinsic Gaussian processes for regression and classification on manifolds. Bayesian Analysis, 14, 887 – 906.
- Liu et al. (2020) Liu, H., Ong, Y.-S., Shen, X. and Cai, J. (2020) When Gaussian process meets big data: A review of scalable GPs. IEEE Transactions on Neural Networks and Learning Systems, 31, 4405–4423.
- Liu et al. (2010) Liu, W., He, J. and Chang, S.-F. (2010) Large graph construction for scalable semi-supervised learning. In Proceedings of the 27th International Conference on Machine Learning, 679–686.
- Niu et al. (2019) Niu, M., Cheung, P., Lin, L., Dai, Z., Lawrence, N. and Dunson, D. (2019) Intrinsic Gaussian processes on complex constrained domains. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 81, 603–627.
- Niu et al. (2023) Niu, M., Dai, Z., Cheung, P. and Wang, Y. (2023) Intrinsic Gaussian process on unknown manifolds with probabilistic metrics. Journal of Machine Learning Research, 24, 1–42.
- Quinonero-Candela and Rasmussen (2005) Quinonero-Candela, J. and Rasmussen, C. E. (2005) A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research, 6, 1939–1959.
- Rasmussen and Williams (2006) Rasmussen, C. E. and Williams, C. K. I. (2006) Gaussian Processes for Machine Learning, vol. 1. Cambridge: MIT Press.
- Rousseau (2016) Rousseau, J. (2016) On the frequentist properties of Bayesian nonparametric methods. Annual Review of Statistics and Its Application, 3, 211–231.
- Roweis and Saul (2000) Roweis, S. T. and Saul, L. K. (2000) Nonlinear dimensionality reduction by locally linear embedding. Science, 290, 2323–2326.
- Schweinberger and Stewart (2020) Schweinberger, M. and Stewart, J. (2020) Concentration and consistency results for canonical and curved exponential-family models of random graphs. The Annals of Statistics, 48, 374 – 396.
- Sripada et al. (2020) Sripada, C., Angstadt, M., Rutherford, S., Taxali, A. and Shedden, K. (2020) Toward a “treadmill test” for cognition: Improved prediction of general cognitive ability from the task activated brain. Human Brain Mapping, 41, 3186–3197.
- Tzourio-Mazoyer et al. (2002) Tzourio-Mazoyer, N., Landeau, B., Papathanassiou, D., Crivello, F., Etard, O., Delcroix, N., Mazoyer, B. and Joliot, M. (2002) Automated anatomical labeling of activations in SPM using a macroscopic anatomical parcellation of the MNI MRI single-subject brain. NeuroImage, 15, 273–289.
- van der Vaart and van Zanten (2008) van der Vaart, A. W. and van Zanten, J. H. (2008) Rates of contraction of posterior distributions based on Gaussian process priors. The Annals of Statistics, 36, 1435 – 1463.
- van der Vaart and van Zanten (2009) — (2009) Adaptive Bayesian estimation using a Gaussian random field with inverse Gamma bandwidth. The Annals of Statistics, 37, 2655 – 2675.
- van Essen et al. (2013) van Essen, D. C., Smith, S. M., Barch, D. M., Behrens, T. E., Yacoub, E. and Ugurbil, K. (2013) The WU-Minn human connectome project: an overview. NeuroImage, 80, 62–79.
- Weinberger and Saul (2006) Weinberger, K. Q. and Saul, L. K. (2006) Unsupervised learning of image manifolds by semidefinite programming. International Journal of Computer Vision, 70, 77–90.
- Wu and Wu (2018) Wu, H.-T. and Wu, N. (2018) Think globally, fit locally under the manifold setup: Asymptotic analysis of locally linear embedding. The Annals of Statistics, 46, 3805 – 3837.
- Yan et al. (2009) Yan, D., Huang, L. and Jordan, M. I. (2009) Fast approximate spectral clustering. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 907–916. New York: Association for Computing Machinery.
- Yang and Dunson (2016) Yang, Y. and Dunson, D. B. (2016) Bayesian manifold regression. The Annals of Statistics, 44, 876 – 905.
- Ye et al. (2024) Ye, K., Niu, M., Cheung, P., Dai, Z. and Liu, Y. (2024) Intrinsic Gaussian processes on manifolds and their accelerations by symmetry. Preprint arXiv:2006.14266.
8 Proofs in Section 3
8.1 Proof of Proposition 1
Proof.
As is a semi-positive definite matrix, its eigenvalues are of . Let be an eigenvalue of . Since and are row stochastic matrices, is also a row stochastic matrix. Then lies in the unit disc and is one eigenvalue of . So we have . Therefore, eigenvalues of are real numbers and belong to . And is the smallest eigenvalue of . ∎
8.2 Proof of Proposition 2
Proof.
The time complexity of subsampling is for random sampling and for k-means clustering. Assume . The construction of needs computations. The complexity of truncated SVD with sparsity is and the construction of approximate heat kernel matrix from eigenpairs needs computations. Therefore, the total complexity is for random sampling or for k-means clustering. ∎
9 Proofs in Section 4
9.1 One step random walk graph Laplacian
9.1.1 Proof of Theorem 1
Proof.
It follows from this theorem that the error between the estimated covariance and the heat kernel is bounded by bandwidth and the number of eigenpairs . In contrast, Theorem 2 in Dunson et al. (2021a) has proven a similar result to bound the error between the covariance and the sum of the first eigenpairs of the heat kernel. As our exact prior is the heat kernel Gaussian process, we must estimate the magnitude of the remaining term in the heat kernel.
Denote by the -th eigenpair of and by the -th eigenpair of . Let and be the element of the matrices and , respectively. By the spectral decomposition of we have
(9) |
Under Assumption 2 in the maintext, Dunson et al. (2021a) have shown that for the first term, with the probability greater than , we have
Furthermore, the second term in (9) is bounded from the fact that the spectrum of the heat kernel is fast-decaying. By the Weyl’s law, there exists a constant , such that, for , . Besides, for the normalized eigenfunction of the Beltrami-Laplace operator (Grigor’yan, 2009), there exists a decreasing function , such that, for , we have . This gives the upper bound in the infinity norm of . Then for any , ,
Therefore, accumulate the above equation for , we have
The last inequality holds by some simple algebra. ∎
9.2 Two step random walk graph Laplacian with subsampling
Initially, Lemmas 1 and 2 establish the equivalence between and . Afterwards based on the perturbation analysis, we estimate the error between and using the relation in Theorem 2.
9.2.1 Proof of Lemma 1
Proof.
According to (4) in Section 3 of the main manuscript, for , we have
For , let be a column vector and each repeats times in . Thus, and As and , we have . Therefore, denote as , we have and ∎
9.2.2 Proof of Lemma 2
9.2.3 Proof of Theorem 2
Proof.
Define . As is symmetric, . Then we have . Under Assumption 3, . Thereby . In the row stochastic matrix , for ,
Then the -norm and the -norm of is given by
Therefore, the spectral norm of satisfies that . Furthermore, for the diagonal matrix , we have , for . Then the -norm and the -norm of the column stochastic matrix is given by
Therefore, satisfies that . Similarly, we can prove that , and for . So far, we have completed the prepared jobs. Subsequently, according to ,
(11) |
The first inequality holds because of the triangle inequality. And we have . Let . Then for ,
The triangle inequality implies that, for ,
Then . Similarly, using the triangle inequality again, we have, for ,
Consequently, and . Then we have . As , for the squared exponential kernel, after the lengthy calculation, we have . Finally, (11) gives that,
where is and is . ∎
9.2.4 Proof of Corollary 1
Proof.
. ∎
9.3 Predictive errors in Gaussian processes
Lemma 3 and Lemma 4 are technical results in order to estimate the predictive errors in Theorem 3. In particular, Lemma 3 utilizes the matrix perturbation theory and Lemma 4 involves the Gaussian distribution.
9.3.1 Proof of Lemma 3
Proof.
It follows from this lemma that the posterior expectation and variance in a multivariate Gaussian distribution are uniquely determined by the covariance matrix. Theorem 4 in Dunson et al. (2021a) has proven the above proposition for the posterior expectation excluding variance. It is enough in the Gaussian regression model. However owing to the link function in the natural exponential family, we have to estimate the distance of posterior variance between different Gaussian distributions in addition. We will utilize the matrix perturbation theory to complete the proof.
By the property of the conditional normal distribution, we have
Similar to Theorem 4 in Dunson et al. (2021a), there exists a constant , such that,
Then the remaining work is to prove the upper bound for the conditional variance. By the triangle inequality, we have
(12) |
Since , rewrite with . Consequently, and . Denote the spectral decomposition of and by and , where and are diagonal matrices. By the perturbation theory in Wu and Wu (2018), there exist , with and commutes with , such that and . Then we have
Notably, both and are vectors here. There exists a constant , such that,
Consequently, by the triangle inequality, the second term of (12) is bounded by
∎
9.3.2 Proof of Lemma 4
Proof.
Let . As and , we have and . Then . By the mean value theorem,
The last inequality holds as . ∎
9.3.3 Proof of Theorem 3
Proof.
Let , where . For , Lemmas 3 and 4 gives that,
where
and is the smallest eigenvalue of . ∎
9.4 Posterior contraction rates
In this part, the major objective of Theorem 4 is to prove the posterior contraction rate of the heat kernel GPs in NEF. To reach the destination, we establish the relation among relevant metrics in Lemma 5 initially.
9.4.1 Proof of Lemma 5
Proof.
The proof is divided into three parts, separately bounding , and above by . Let be the volume form on and let be the dominant measure on .
-
1.
By the definition of KL divergence,
Based on the NEF model and the Fubini theorem, the above equality of the KL divergence can be rewritten as
Let . Since and , then
And the derivative of on is . It is obvious to see . So by the mean value theorem, there exists , such that . Using the mean value theorem again, there exists , such that . Suppose and lie in , then , and . Take as . Thereby we have . So the upper bound of KL divergence is given by
-
2.
The second KL variation is bounded by
Moreover, for one parameter NEF models, we have
Let . Since , expand to get
Then by the mean value theorem, there exist constants , such that and . Consequently, we have elementwise upper bounds for , that is,
Take , then . So an upper bound of the second KL variation is
-
3.
By the definition and the Fubini theorem, the Hellinger distance is
Let . Multiply out the square, can be rewritten as
Based on the one parameter NEF likelihoods, since is convex, we have
As , is a convex function on . Then . By a simple equality , for , we have
Let . Then the derivative on is . As is non-decreasing, we have
It is obvious to see . So by the mean value theorem, there exists a constant , such that
Under Assumption 4, . Take . Then the elementwise bound of implies that
∎
9.4.2 Proof of Theorem 4
Proof.
With Assumption 2, Castillo et al. (2014) have demonstrated that the domain space satisfies the Ahlfors property and the Gaussian two sided bounds. Furthermore, they have proved the following properties of the adaptive heat kernel Gaussian processes: when with being a random diffusion time, take vanishing sequences , there exist measurable subsets and a positive constant , such that
where is the covering number. Those inequalities are respectively called the KL condition, the entropy condition and the remaining mass condition in the Bayesian nonparametric inference (Ghosal and van der Vaart, 2017).
Since is compact and is continuous on , is bounded. Then are uniformly bounded. Assume these functions are valued in a bounded set . Denote and by and . For , Lemma 5 shows that and . After some simple set operations, we have
Let be the prior probability distribution of . After transforming the infinity norm into the statistical distance, we obtain
Besides, Lemma B.5 in Ghosal and van der Vaart (2017) implies that the basic testing assumption holds for the Hellinger distance. Finally, based on Theorem 8.9 in Ghosal and van der Vaart (2017), the posterior contraction rate at w.r.t is . ∎
10 The Nyström Approximation for GLGP
Denote as the point cloud. Let us introduce the GLGP algorithm as in Dunson et al. (2021a) initially. Step 1, construct the kernel matrix with a given kernel as for . It is apparent to see . Meanwhile, the squared exponential kernel is the most common example of and assume that is its bandwidth. Step 2, normalize to attain the similarity matrix , that is, for ,
Afterward, calculate the transition matrix with row normalization through,
Thus, is a row stochastic matrix. Step 3, solve the eigendecomposition of for the eigenpairs with . As the graph Laplacian is defined by , the eigenpairs of are given by . Step 4, after choosing the top eigenvalues and eigenvectors, the heat kernel matrix in GLGP is estimated by, for ,
However, since is of , the computational complexity of GLGP is at least , which is unaffordable in the scalable application. Consequently, it is beneficial for speed to consider the Nyström approximation. We will explain the approach as follow. Step 1, subsample induced points from , denoted as . Step 2, calculate the kernel matrices and by and for , where is a kernel function. Then and . Step 3, obtain the similarity matrix by, for ,
Similarly, compute by, for ,
Notice that in the denominator of , the second sum is , but not . Construct the transition matrices and by, for ,
Step 4, attain the eigenpairs of with through the eigendecomposition. Define the extended eigenpairs as and , for . The above transformation of eigenvectors is the crucial idea behind the Nyström approximation. Step 5, choose the top eigenvalues and corresponding eigenvectors. Construct the Nyström approximate heat kernel estimation as in GLGP by
The normalization mechanism of eigenvectors causes the difference in the coefficients between covariance matrices, which are and respectively.
11 Sensitivity analysis on MNIST
The predefined hyper parameters in FLGP are and . We execute the sensitivity analysis of in the MNIST example. Let and let . We randomly sample images to construct the labeled observations and let the remaining images be the unlabeled observations. The numerical experiment repeats times for each configuration. Error rates are calculated in FLGP with different hyper parameters.
The number of induced points has a great impact on how well the manifold is estimated. With increasing number of , the induced points become denser gradually in the domain . For clustering subsampling, the Euclidean distance between the cluster center and points of the corresponding membership is smaller and the representative error decreases. Then has the more powerful ability to represent . We will obtain a better approximation of the heat kernel in FLGP. Therefore, the prediction performance is improved with increasing . However, the computation complexity depends on in the meantime. The time cost increases when is larger. So we need to make a trade off in . Generally speaking, the suitable is up to the complexity of the intrinsic geometry. For the highly non-Euclidean manifold and random subsampling, a relatively bigger should be chosen. When and , error rates with different are summarized in Table 1. In MNIST, as increases, the error rates decrease gradually for all FLGP algorithms. In particular, FLGP with random subsampling has greater improvement than FLGP with k-means subsampling. For example, for , when increases from to , error rates of LRFLGP decrease from to and error rates of LKFLGP decrease from to . The reason is that with k-means subsampling has already represented well but with random subsampling is not. So the representative ability with random subsampling will improve more significantly with increasing .
\toprules | 500 | 1000 | 1500 | |||
---|---|---|---|---|---|---|
m | 100 | 200 | 100 | 200 | 100 | 200 |
\midruleSRFLGP | 23.7(3.7) | 18.6(1.5) | 18.4(3) | 13.8(0.9) | 17.2(2.8) | 12.3(1.4) |
SKFLGP | 12.4(2.0) | 9.2(0.9) | 8.9(1.3) | 7.4(0.8) | 8.9(1.7) | 6.4(0.7) |
LRFLGP | 27.3(4.8) | 21(1.6) | 21.7(3.5) | 16.2(1.1) | 19.3(2.4) | 13.7(1.1) |
LKFLGP | 13.1(3.2) | 9.3(0.9) | 9.7(1.4) | 7.7(0.5) | 9(1.4) | 6.7(0.7) |
\botrule |
The number of the local induced points decides the sparsity in FLGP. According to the definition, with decreasing number of , the number of nonzero elements decreases for each row in the kernel matrix . Then the cross transition matrix from to becomes more sparse. The random walk is restricted to move inside those local induced points. For most situations, the introduced sparsity will promote the prediction performance and speed up the computation. Therefore, we recommend to choose as small as possible. For example, let . When and , error rates with different are summarized in Table 2. The influence of on error rates is less significant than . The mechanism of sparsity is complicated. In this example, when , FLGP obtains the lowest error rates. But the trend is not apparent. However, since the sparsity accelerates the computation without scarifying the performance, seems to be the best alternative.
\topruler | 3 | 5 | 7 | |||
---|---|---|---|---|---|---|
m | 100 | 200 | 100 | 200 | 100 | 200 |
\midruleSRFLGP | 18.4(3.0) | 13.8(0.9) | 19.3(2.2) | 14.4(1.3) | 18.2(2.6) | 14.3(1.4) |
SKFLGP | 8.9(1.3) | 7.4(0.8) | 10.2(1.4) | 8.4(1.6) | 9.7(1.0) | 7.8(1.1) |
LRFLGP | 21.7(3.5) | 16.2(1.1) | 23.0(2.3) | 17.1(1.6) | 21.6(2.3) | 17.2(1.0) |
LKFLGP | 9.7(1.4) | 7.7(0.5) | 13(1.5) | 9.3(1.1) | 12.8(1.7) | 9.9(0.8) |
\botrule |
The number of used eigenpairs determines how the heat kernel is approximated by the graph Laplacian. As increases, more eigenpairs will be used to construct the covariance matrix. In the one hand, the heat kernel is estimated better by the top eigenpairs of the Dirichlet-Laplace operator. In the other hand, the approximation error of the top eigenpairs by the graph Laplacian increases in the meantime. So the precise role of is unclear. Fortunately, the Weyl’s law implies that the spectrum of the heat kernel is fast decaying. Then has little influence on the performance. When and , error rates with different are summarized in Table 3. Considering the standard deviation, error rates with different are indistinguishable in FLGP. For example, in LKFLGP, when , the error rates with are respectively . The performance is similar among different .
\topruleM | 100 | 200 | 300 | |||
---|---|---|---|---|---|---|
m | 100 | 200 | 100 | 200 | 100 | 200 |
\midruleSRFLGP | 19.1(3.7) | 14.2(1.3) | 18.4(3) | 13.8(0.9) | 19.0(2.0) | 14.4(1.2) |
SKFLGP | 10.3(3.0) | 7.1(0.4) | 8.9(1.3) | 7.4(0.8) | 9.4(1.0) | 7.4(1.0) |
LRFLGP | 20.9(4.2) | 15.8(1.0) | 21.7(3.5) | 16.2(1.1) | 20.6(2.9) | 16.6(1.2) |
LKFLGP | 11.7(3.7) | 7.5(0.8) | 9.7(1.4) | 7.7(0.5) | 10.7(1.7) | 8.2(0.6) |
\botrule |
12 Scalable Gaussian process regression in FLGP
Let and be the matrices. Let and belong to . Denote by the matrix and denote by the matrix. In the Gaussian process regression, rewrite the log marginal likelihood as
Using the spectral approximation of the heat kernel, we have , where and is a diagonal matrix whose -th diagonal element is . Then by the matrix inversion lemma A.9 in Rasmussen and Williams (2006), we have
(13) |
Subsequently, define . Then the inversion of the matrix is transformed into the inversion of the matrix. Denote the Cholesky decomposition of by . Then by Equation A.10 in Rasmussen and Williams (2006), the determinant is given as
(14) |
where is the -th diagonal element of . Since the complexity of the Cholesky decomposition is , the complexity of the log marginal likelihood using (13) and (14) is . Let . Moreover, the partial derivatives are
where . In FLGP, we have and , where is a diagonal matrix whose -th diagonal element is . Then
According to (13),
Then the computational complexity of the derivatives is if . In the prediction, let be the covariance matrix between unlabeled points and labeled points, then , where the columns of are the corresponding eigenvectors. By the conditional Gaussian distribution, the posterior expectation on the unlabeled point is . The complexity is if . Therefore, the total computational complexity of the domain-adaptive Gaussian process regression is . It is less than , when .