[table]capposition=top \newfloatcommandcapbtabboxtable[][\FBwidth]
Geometry-Aware Hierarchical Bayesian Learning on Manifolds
Abstract
Bayesian learning with Gaussian processes demonstrates encouraging regression and classification performances in solving computer vision tasks. However, Bayesian methods on 3D manifold-valued vision data, such as meshes and point clouds, are seldom studied. One of the primary challenges is how to effectively and efficiently aggregate geometric features from the irregular inputs. In this paper, we propose a hierarchical Bayesian learning model to address this challenge. We initially introduce a kernel with the properties of geometry-awareness and intra-kernel convolution. This enables geometrically reasonable inferences on manifolds without using any specific hand-crafted feature descriptors. Then, we use a Gaussian process regression to organize the inputs and finally implement a hierarchical Bayesian network for the feature aggregation. Furthermore, we incorporate the feature learning of neural networks with the feature aggregation of Bayesian models to investigate the feasibility of jointly learning on manifolds. Experimental results not only show that our method outperforms existing Bayesian methods on manifolds but also demonstrate the prospect of coupling neural networks with Bayesian networks.
1 Introduction
Three-dimensional data on Riemannian manifolds, such as triangle meshes and point clouds as shown in Figure 1, is widely used to describe the shape information in object understanding, scene understanding, and many other vision tasks. Extracting and aggregating geometric features is considered the key to leveraging the intrinsic shape information of this type of data [9]. Recently, the Gaussian process (GP) based Bayesian learning emerges to be a study hotspot [59, 5, 21]. Theoretically, it has been proven that a single fully connected neural network (NN) layer with an infinity width is essentially a GP [41]. Further, this equivalence is extended to deep fully connected NNs and hierarchically connected GPs [25, 35]. Practically, encouraging results have been demonstrated in various applications [7]. In this work, we focus on developing GP-based Bayesian learning methods for solving vision tasks on manifolds. Specifically, we concentrate on two fundamental aspects: GP kernel design and Bayesian learning framework on manifolds.
Since the property of a zero-mean GP is largely determined by its kernel function [48], a primary goal is to develop an expressive kernel. Essentially, we aim at integrating two important characteristics into a shallow kernel structure: geometry-awareness and intra-kernel convolution.

Geometry-awareness stresses the capability of learning geometric features in the prior knowledge so that the posterior inference respects the representative regions of a 3D shape [22]. For example, when distinguishing different human poses, red circled regions of interest (ROIs) in Figure 1(a) are expected to be numerically highlighted because their regional features are more geometrically significant. The current strategy of achieving geometry-awareness is to directly add geometric feature descriptors to the kernel design [23, 20]. However, this strategy heavily relies on computing specific hand-crafted features, which potentially impedes the generality to broader types of applications. We propose a paradigm shift where only the point coordinates are needed. It enables our kernel to be geometry-aware on all commonly-used types of manifold-valued data.
Intra-kernel convolution introduces the convolutional filtering to the kernel construction so that the GP inference has a powerful feature aggregation ability [59]. This characteristic has been widely studied to increase the expressiveness of GPs [60, 16]. Most of the previous work uses the additive patch-wised computational structure to explicitly mimic the mechanism of convolutional NNs (CNNs) [15, 59]. But this approach is not feasible for manifold-valued data because of its off-the-grid structure. As one attempt, the graph convolutional GP (GCGP) in [60] used local coordinates to adjust the inputs to a uniform on-the-grid style. However, some drawbacks, such as the huge computational cost and the strict requirement on the input size, are noticed. Alternatively, we propose an implicit intra-kernel convolution. Mathematically, we rigorously show that the convolutional filtering can be delicately embedded into the kernel definition.
In this paper, we propose a hierarchical Bayesian model for manifold-valued tasks. The core is a kernel derived from a stochastic partial differential equation (SPDE) that generalizes a real physical process called periodic potential diffusion process. Two observations explaining why we choose this particular physical process are discussed in Sec. 4.1. Mathematically, we prove that the kernel implicitly integrates both the mean curvature flow, which is an effective geometric feature descriptor in , and a convolutional filtering. For tackling the irregular input dimensionality, we firstly use a GP-based salient point selection algorithm to obtain a uniform and light input, then, feeding it to the Bayesian network. Additionally, because the input of a Bayesian model on manifolds is the geometric features and NNs are strong in learning expressive features, we explore the potential of incorporating NNs with hierarchical Bayesian methods to leverage the strengths of both methods. Our contributions are summarized into three-folds:
(1) A kernel with both geometry-awareness and intra-kernel convolution properties. No hand-crafted feature is needed for involving geometric properties. The method is feasible to all commonly-used manifold-valued data;
(2) A Bayesian network for manifold-valued tasks, including a salient point selection module that non-linearly reduces the data dimensionality and organizes the irregular inputs;
(3) An exploration on NN+Bayesian approaches to leverage both the feature learning ability of NNs and the feature aggregation ability of the Bayesian methods.
Both empirical and numerical experimental results verify the effectiveness of our methods. We hope this work not only makes contributions to the Bayesian learning methods on manifolds but also sheds new light on integrating different learning mechanisms to maximize their learning power.
2 Related Work
Kernel design is always an important topic in GP-based Bayesian learning studies. Our initial thought originated from the problem addressed by Stein in [56] that the infinite differentiability of a Gaussian kernel led to an unrealistic match with the physical processes. Later, Stein proposed the well-known Matérn kernel family as a generalization of the Gaussian radial basis functions (RBF) to solve this problem. An intriguing idea is: why not derive a kernel function directly from the expression of a physical process? In this way, the kernel will intrinsically come with a real physical rationale. This idea was further strengthened by Särkkä’s statement [50, 54] that any SPDE was a potential kernel. Given the fact that many physical processes are generalized by SPDEs, it is intuitive to combine the above two opinions together as a trustful theoretical foundation of kernel development [39]. For example, the Matérn kernel is actually the solution of a linear fractional SPDE [13, 51, 28].
However, classical kernels mainly dealt with data in the Euclidean space and considered less on the Riemannian manifolds. One solution was using Riemannian metric and space mapping techniques to achieve the domain transform [19, 40]. But clearly, this approach was not feasible to the data like volumetric meshes and point clouds. Alternatively, wrapping geometric features into kernels was proven to be effective [45, 10, 38, 22, 20]. For example, the weighted GP (W-GP) [23] yielded reasonable inferences after weighing the RBF kernel with the mean curvature and Gaussian curvature; the morphometric GP [20] used wave kernel signature metric and demonstrated good performances on 3-dimensional manifolds. These methods used explicit geometric expressions, which often relied on specific simplicial complex. Conversely, we implement an implicit geometric expression which is only sensitive to the distance lag.
Bayesian learning architecture is also an emerging topic. Additive GP [15] directly enabled the convolutional GP (CGP) [59]. We also use additive structure in our kernel design. The aforementioned GCGP is an implementation of CGP on graphs. It adjusted the irregular inputs by referring to the method used in graph convolutional networks (GCNs). Instead, we use a manifold learning strategy to organize the inputs. The studies of sparse variational GP and posterior estimation facilitated the development of deep GP (DGP) [14, 52, 29, 5]. In our hierarchical Bayesian model, we follow the framework of DGPs and use the doubly stochastic variational inference method [49].
3 Preliminaries
Some notations are defined here. Given a manifold-valued data with a vertex set of size , an edge set of size and a face set of size . and can be empty. A distance lag between vertex and its neighborhood is denoted as . Define a GP, , as a random process where the joint distribution of a finite collection of observations of samples follows a multivariate Gaussian distribution: , where is the mean function and is the covariance function or kernel. The dimension of the kernel matrix is denoted by subscripts.
3.1 Periodic Potential Diffusion Process
Our kernel derivation originates from the periodic potential diffusion process. It is a special case of the reaction diffusion process. A reaction diffusion process is the solution to a reaction diffusion equation [18, 57]:
(1) |
where is the Laplace operator, constant is 1. The initial condition is 0. is the reaction function that defines the property of the energy source [17]. Given a certain , there exists a corresponding physical scenario [57, 42]. The reaction function can be expressed as the multiplication of a Dirac delta function at location and a temporal function : . By defining the Green’s function of Laplace operator under the Dirichlet boundary condition as [3], is equal to:
(2) |
3.2 Gaussian Process Regression
A GP regression (GPR) aims at learning a multivariate distribution that fits with the training data and predicts the observation when a testing sample arrives [48]. The Bayes’ theorem is used to transform the prior knowledge to posterior inference in the learning process. As known, every finite marginal distribution of a GP still follows a multivariate Gaussian distribution. Therefore, the predictive distribution can be uniquely determined by the standard rules for conditioning Gaussian distributions:
(4) |
(5) |
(6) |
(7) |
When new samples are continuously given, the GP is recursively updated. Later, we adopt the framework of GPR in a salient point selection algorithm. Each salient point is taken as a sample. We update the saliency map after adding the previous selection into the prior and then select the next one until a certain number of salient points are collected.
3.3 Deep Gaussian Processes with Doubly Stochastic Variational Inference
A DGP is a deep belief network that hierarchically concatenates multiple Gaussian process latent variable models together (GP-LVMs) [14]. It mimics the composition of restricted Boltzmann machines (RBMs) in NNs. The sparse variational inference is usually used in GPR to estimate the posterior and avoid the cubic complexity [52]. Suppose inducing points are selected, the complexity is decreased to in a single GPR. For a DGP, the doubly stochastic variational inference is often applied to estimate the posterior [49, 5]. Specifically, the sparse variational inference is used to simplify the correlations within layers and keep the correlations between layers unchanged. In a DGP with layers, the prior is recursively defined on a series of vector-valued stochastic functions . The row of is denoted as . Function values at inducing points are . Each single function has an independent Gaussian prior and inducing points. A joint density of a DGP can be expressed as:
(8) |
According to the theories of variational inference, a factorized form of the posterior joint density is defined as [49]:
(9) |
where is a Gaussian with mean function and covariance function for layer . Eq. (9) indicates that the prediction of the layer, , depends on the previous prediction and the inducing points of the current layer. By marginalising the approximation from each layer, the factorized variational posterior of the final layer is the integral of all paths through the Gaussian distributions defined by parameters , and :
(10) |
The objective function is the doubly stochastic evidence lower bound (ELBO):
(11) |
where is the Kullback–Leibler divergence. The ELBO has the complexity to compute, is the size of the layer. The variational expection likelihood in Eq. 11 is computed using the Monte Carlo approximation. Please refer to [49] for more details.
4 Methods
4.1 Geometry-Aware Convolutional Kernel
In this section, we start by briefly explaining what motivates us to choose the periodic potential diffusion process as the theoretical basis. Then, we provide two implementations of the kernel for different applications. Furthermore, we introduce our theoretical analysis about the property of the kernel. In the end, a hierarchical Bayesian model is defined.
The idea of using periodic potential diffusion process comes from two observations:
The first observation is that the integral Laplace transform of function in in Eq. (12) (which is a special upper incomplete gamma function, is a constant) has several similar terms with the Matérn kernel in Eq. (13):
(12) |
(13) |
Both equations have modified Bessel function of the second kind , and the rest parts are functions with the same dimension order . This indicates the possibility of deriving a kernel from Eq. (12).
The second observation is that the Green’s function of Laplace-Beltrami operator in the 3D diffusion problem belongs to the family of . Bochner’s theorem states that a stationary kernel is positive definite in if it is the Fourier transform of a positive bounded measure function [6]. Taking Eq. (3) as the real part of the Fourier transform (decomposing the exponential term with Euler’s formula), it is already a stationary kernel function regarding the temporal variable . If the spatial part is also proven to be positive semi-definite (PSD), then the periodic potential diffusion process is a valid kernel function.
Summarizing these two observations and incorporating the background in Sec. 3.1, our goal is to derive a close-form expression from Eq. (3) and prove that this expression is PSD regarding its spatial variable . Unfortunately, the integral in Eq. (3) has no explicit solution according to [50].
Alternatively, we apply the same approximation strategy in [21] to estimate the Eq. (3) as the combination of a cosine Fourier transform and a sine Fourier transform:
(14) | ||||
By solving this approximated form, we have the closed-form periodic potential diffusion process:
(15) |
For simplicity, we define a frequency term and a phase term . and are hyper-parameters that are determined by regular parameter tuning methods. Reminding that the diffusion process is dynamic, we can express it as the accumulation of values at time slots. Therefore, the final kernel definition is expressed as an additive kernel [15]:
(16) |
For regression tasks, the implementation in Eq. (16) is sufficient. But for better fitting with a hierarchical Bayesian model, we further introduce an implementation by taking the kernel as an ARD covariance function [24]. An individual length-scale parameter is added for each input dimension which determines the relevancy of the input to the task:
(17) |
GPs defined with Eq. (16) or Eq. (17) can be taken as the mixture of GPs. Previous studies show that the mixture of Gaussian has a universal approximation ability in fitting with continuous distributions [58, 43].
The next goal is to prove that Eq. (16) is PSD regarding its spatial variable. Eq. (16) is the composition of two functions: and . It is acknowledged that the exponential function and the cosine function are PSD regarding variable . According to the composition property of a PSD function, the result of positive real function/constant times a PSD function is still PSD [4]. So, Eq. (16) is spatially PSD. Eq. (16) can be taken as a compositional kernel.

Eq. (1) indicates a family of kernels. This opinion is generalized as:
Theorem 1.
A real-valued function on is a spatial-temporal kernel function if it is a linear/non-linear diffusion process: , where is a positive constant, is a periodic function, is the Dirac delta function, and is the Laplace operator.
Proof.
The proof is provided in Supplementary. ∎
The kernel implemented in Eq. (16) or Eq. (17) is named as the geometry-aware convolutional (GAC) kernel. The GAC kernel satisfies the two characteristics discussed in Introduction, which are summarized as two lemmas:
Lemma 1.
The GAC Kernel embeds the mean curvature flow in , which enables it to be geometry-aware.
Lemma 2.
The GAC Kernel embeds a convolution filtering within the kernel structure, called intra-kernel convolution.
Proof.
The proofs are provided in the Supplementary. ∎
We validate the proposed GAC kernel in two different studies. The first one is to use it in a regular GPR model. An unsupervised salient point selection algorithm will be introduced. This implementation fully utilizes geometry-awareness property. The second one is to adopt it in a Bayesian network layer. Hierarchical deep Bayesian learning models will be discussed. The feature aggregation will benefit from the nice intra-kernel convolution property.
4.2 Unsupervised Salient Point Selection
GP regression (GPR) is widely used in spatial inference known as the “Kriging” method [48, 22]. Being inspired by the landmarking in face recognition [61], we use a GPR-based unsupervised salient point selection to process the irregular inputs. This method is essentially a manifold learning technique [31, 37, 47]. Therefore, applying our method also achieves nonlinear data dimensionality reduction. One advantage of GPR is the availability of uncertainty estimation [48]. We leverage this advantage and define the saliency score as the variance-based uncertainty [65]. By iteratively selecting new salient points and adding previously selected points to the prior, we successively collect a set of salient points. A geometry-aware kernel guarantees the salient points are significant to represent the original massive data. This strategy has been successfully applied in [37, 22, 20].
Suppose a set of salient points is denoted as . Define . equals to a -length vector by dividing into equal line-spaces. A multi-frequency multi-phase GAC kernel () is defined in a weighted squared form: ( is symmetric). The weight is a diagonal matrix with the sum of absolute values of each row in GAC kernel as the diagonal entries: . The saliency score of during selecting the salient point is defined as:
(18) |
(19) |
(20) |
Only the point with the highest uncertainty score is selected as the salient point: . The first salient point is the vertex with the maximum variance in . From Eq. (18)-(20) we can see that a newly-selected salient point will be added to the prior knowledge and the next saliency score is determined by the previous selections. The whole process follows a GPR framework. The algorithm is summarized in Algorithm 1. Figure 1(b) shows examples of selecting salient points on point clouds. Figure. 4 demonstrates salient points on triangle meshes. Further evaluation results are available in Experiments.
4.3 Hierarchical Bayesian Model on Manifolds
We define a GAC-GP layer with the GAC kernel and follow the framework of DGPs to construct a hierarchical Bayesian learning model by stacking up multiple GP layers. Thanks to the intra-kernel convolution property, the GAC-GP layer has a good feature aggregation ability. In a pure hierarchical Bayesian learning model on manifolds, a computational pipeline is shown in Figure. 3. The first step is to process the input. Assume salient points are selected by Algorithm 1. The feature on each salient point is a vector of length . For shape , we link all features of salient points in the order of their selections: . Noting that all shapes here belong to the same dataset and all salient points are selected with the same parameter setting in Algorithm 1. Otherwise, point-to-point registration is needed to concatenate features. Suppose shapes are used, the input is an matrix. We compose a sequence of layers that map the input to its label in a hierarchical Bayesian model for classifications:
(21) |
The output of hidden layer is a vector of the size , where is the layer size. This is similar to the relationship of the input channel and output channel in a NN. When the batch processing is applied, the output of each hidden layer has dimension , is the batch size. A final layer is appended with a softmax multi-class likelihood. The output vector has the dimension , where is the number of classes. Each entry stands for the probability belonging to a certain class. Arbitrary numbers of GP layers can be added as hidden layers. We use the doubly stochastic variational inference approach to estimate the posterior [49]. The optimization process is to maximize the ELBO in Eq. (11). The K-means method is used to choose inducing points.


Because the input of the Bayesian model is the point-wise features on manifolds, and NNs are strong in feature learning, we are inspired to further explore the potential of NN+Bayesian methods. Such a mixed model can take advantage of both the feature learning ability of NNs and the feature aggregation ability of Bayesian models. We generally follow the pipeline in deep kernel learning [63]. The input format is determined by the NN part. The output of NNs is the shape feature. The feature is then fed into a Bayesian model, and the following processing is the same as the pure Bayesian method. The negative marginal log-likelihood (MLL) is used as the loss function.
5 Experiments
We evaluate our methods with three experiments. Our method is noted as GAC-GP. Applications are implemented in Pytorch and GPytorch with GPU acceleration [24].

5.1 Unsupervised Salient Point Selection
In the first experiment, the task is to select salient points on manifolds. The purpose is to evaluate the geometry-awareness of the GAC-GP and its stability in continuous regressions. Additionally, we compare the computational efficiency by recording the average running time. When defining , we use the fast marching algorithm to compute geodesic distances and only select nearest neighboring points for each vertex. is set as 5.
Three datasets are used [8]: (i) “Mandibular molars”, or “molar”, contains 116 teeth shapes; (ii) “First metatarsals”, or “metatarsal”, contains 57 shapes; and (iii) Distal radii contains 45 shapes. All shapes are triangle meshes with around 5000 vertices. Examples of each dataset are illustrated in Figure 4(a)-(c). The ROIs of such data are usually the marginal ridges, teeth crowns, and outline contours where the geometric features are rich [8, 12]. Therefore, the meshes in Figure. 4(a)-(c), are rendered by the normalized mean curvature as the ground-truth [27, 26]. The salient points are expected to evenly distribute in yellow regions.
Comparison methods include: (1) RBF kernel GP (RBF, RBF-GP) [48]. RBF is a classical choice; (2) spectral mixture kernel GP (SM, SpectralMixture, SMK-GP) [62]. SMK-GP had good performances in many vision tasks. 10 mixtures are used; (3) Matérn kernel GP (Matérn, Matérn-GP) [55]. Matérn and are excluded because of worse results; (4) mesh saliency (MeshSaliency, MS) [34]. It is a classical saliency detection method; (5) Weighted GP (Weighted, W-GP) in [23]. W-GP is a state-of-the-art GP method on manifolds. The same parameter settings are used.
Figure 4(a)-(c) demonstrate salient point selection results. In (a) and (b), we focus on comparing ours with the W-GP because it is the only GP kernel method that stresses the geometry-awareness, and it outperforms all other comparison methods. We show 20, 80, and 140 salient points. When selecting a small number of salient points, both methods present reasonable results. But the accuracy of W-GP gradually drops with the iterations increasing while GAC-GP shows a much better stability of geometry-awareness. For other comparison methods, we give examples of 50 salient points on one Molar shape in Figure 4(c). Figure. 1(b) shows some examples of salient points on point clouds.
Numerically, we define an Accumulated Curvature (AC) value to measure the selection performance: , where and are normalized Gaussian curvature and mean curvature, and is the total number of salient points. Drawing the AC values with the increased number of salient points forms an AC curve. Within a dataset, we compute the log average of AC values to plot an average AC curve. This curve reflects the geometry-awareness of different selection methods. When only selecting points with the largest accumulated curvatures at each iteration and draw the AC curve with these maximum values, we get a MaxGC+MC AC curve. This curve stands for the upper bound of saliency selections. The higher and closer to this MaxGC+MC curve an AC curve is, the better the geometry-awareness ability a corresponding method has. The results are plotted in Figure 5(a)-(c). The MaxGC+MC AC curve is drawn in dashes. Generally, AC curves of GAC-GP are above all other methods. Both the empirical visualizations and the numerical measurements verify that the GAC-GP is capable of making geometry-aware inference on manifolds. More importantly, its geometry-awareness is still consistent and stable after continuous regressions.
The computational efficiency is measured by averaging the running time of selecting one salient point, as shown in Figure 5(d). GPR-based methods gradually slow down due to the increased prior knowledge as shown in Eq. (18)-(20). The GAC-GP enjoys strong computational efficiency considering its superior performance because usually only a small number of salient points are needed.
5.2 Human Pose Retrieval
In the second experiment, the task is to classify different human poses modeled by triangle meshes. The first purpose is to further evaluate the salient point selections by fixing the Bayesian learning architecture. The second purpose is to fix the inputs and evaluate different hierarchical Bayesian learning architectures. The pipeline in Figure. 3 is used. We choose the scaled-invariant wave kernel signature (SIWKS) [36, 2] as the feature. When computing the SIWKS, 30 smallest eigenvalues are used. The other parameter settings are the same as those in [2]. Features of 50, 100, and 250 salient points are used. In customizing the Bayesian model, we use the multitask variational strategy and the softmax likelihood in GPytorch. The number of inducing points is 50. We use Adam as the optimizer with an initial learning rate of 0.001. After the first 200 epochs, the learning rate changes to . We trained for 2000 epochs. The cost function is the variational ELBO mentioned in Sec 3.3. The comparison methods are the same as the prior experiment.
The SHREC14 non-rigid 3D human model is used [44]. It contains 400 triangle meshes of 40 human subjects with 10 poses. Each mesh has about 15000 vertices. The dataset is randomly split into: 90% for training, 5% for validation, and 5% for testing.
SIWKS | RBF-GP | W-GP | Matern-GP | SMK-GP | MS | GAC-GP |
---|---|---|---|---|---|---|
50 | 0.850 | 0.898 | 0.885 | 0.898 | 0.866 | 0.915 |
100 | 0.859 | 0.901 | 0.886 | 0.908 | 0.872 | 0.921 |
250 | 0.862 | 0.905 | 0.890 | 0.912 | 0.899 | 0.925 |
For the first purpose, we fix the Bayesian model to be a one-layer GAC-GP and feed in features from different methods. Table 1 shows the results. Classification with the features of all points has an accuracy of 0.910. Taking this value as a reference, we can draw conclusions that (1) our strategy of selecting salient points works for distinguishing different shapes. When enough salient points are selected, it is possible to use a small subset to represent the original data; (2) the geometry-aware selection of GAC-GP is more distinguishable than other comparison methods.
Method | Accuracy | |
---|---|---|
GCGP* | ||
1RBF | ||
1GAC | ||
1GAC(10)+1RBF | ||
1GAC(10)+1GAC | 93.4% |
-
•
*(Using self-reproduced code.)
Method | Error rates |
---|---|
PCNN | |
PointNet++ | |
PointNet++ +Normal | |
PCNN+1GAC | 87.2% |
PointNet++ +1GAC | 91.8% |
PointNet++ +Normal+1GAC | |
PointNet++ +Normal+2GAC | |
PointNet++ +Normal+3GAC | 93.1% |
For the second purpose, we fix the inputs to be GAC-GP salient features and evaluate different Bayesian learning architectures. Here we use GCGP [60] as a comparison method. The results are shown in Table. 2. Noting that the code of GCGP is not available and we use our implementations, so we put a star mark on GCGP’s result. We can see that the accuracy is generally increased after adding a GAC layer, supporting a strong feature aggregation property. Meanwhile, a hierarchical concatenation of GAC layers shows a better accuracy than the single layer structure.
5.3 Point Cloud Classification
In the third experiment, the task is to classify different point cloud models. Our purpose is to demonstrate the work of integrating NNs with Bayesian learning. Here, we use the hierarchical feature learning architecture in PointNet++ [46] to learn the pointwise feature. We perform multi-class classification on ModelNet40 which contains 12311 3D CAD models of 40 categories. Each point cloud has 10000 points. We use 9843 models for training and 2468 models for testing. In the feature aggregation part, we use one single GAC-GP layer (ten mixtures). 64 inducing points are used. The optimizer is Adam and the initial learning rate is 0.04. The comparison methods include PointNet++, PointNet++ with normal information, and the Pointwise Convolutional NNs (PCNN) [30]. Table 3 shows that (1) the mechanism of NN+Bayesian can be jointly trained for tasks on manifolds; (2) models with Bayesian aggregation layers generally outperforms the classical multiple fully connected layers in our tests. We notice that the performance gain of using single GAC layer shrinks after adding normal information. Our hypothesis is that the features become more complicated, and the inference capability of single GAC layer is not powerful enough to well aggregate the new features. By adding 2&3 GAC layers, the improvements increase to 0.9% and 1.2%, respectively. The overall results show that architectures with GAC layers universally perform better than their original versions, which proves that such a co-design benefits the performance. A reasonable outlook is to investigate more effective architectures that integrate both methods for end-to-end tasks on manifolds.
6 Conclusion
In this work, we propose the GAC kernel that carries properties of geometry-awareness and intra-kernel convolution. Our methods show strong feature aggregation capability in various tasks on manifolds. We hope our work may inspire future Bayesian and NN+Bayesian studies on manifolds.
Acknowledgements This work was funded by grants R01EY032125 and R21AG065942.
7 Supplementary Materials
7.1 Theorem1 and Proof
Theorem 1.
A real-valued function on is a spatial-temporal kernel function if it is a linear/non-linear diffusion process: , where is a positive constant, is a periodic function, is the Dirac delta function, and is the Laplace operator.
Proof.
As known, there exists a corresponding Green’s function for a parabolic partial differential equation so that the diffusion process expressed by this parabolic partial differential equation has the form [18, 57]:
(22) |
In the main submission we have proved that the analytical solution to the following equation is PSD:
(23) |
Since the spatial variable mainly exists in the Green’s function and the Green’s function is spatial stationary in a diffusion process, the primary task is to prove any choices of periodic function can lead to the same conclusion. Because Eq. (23) has been prove to be PSD, we can draw the same conclusion if Eq. (22) has a similar form with Eq. (23). Therefore, the main idea is using cosine function to generalize a periodic function . As known, a periodic function can be estimated with the Fourier series expansion:
(24) |
where are arbitrary real numbers. Assuming there exists a cosine function that is equal to , is a constant. By using the trigonometric sum formulae, we get: and . Eq. (24) is then transformed to:
(25) |
Eq. (25) shows that any periodic functions can be approximated by the linear combination of cosine functions. By substituting Eq. (25) into Eq. (22), we see that the diffusion process is estimated to be the integral of Green’s function times the combination of cosine functions. Applying the PSD summation identity, we can draw the conclusion that the solution to Eq. (22) is also PSD. Therefore, it is a valid kernel function. ∎

7.2 Lemma 1 and Proof
Lemma 1.
The GAC Kernel embeds the mean curvature flow in , which enables it to be geometry-aware.
Proof.
In differential geometry, a curvature flow numerically links intrinsic geometric features and extrinsic flows together [32]. We take the proof on a planar curve by assuming manifold is a two dimensional manifold in as an example for convenience. In this case, the GAC kernel is actually equivalent to a curve-shortening flow which can be considered as a one dimension mean curvature flow [1]. Figure 6 shows sketch plots of the symbols used in this proof. Suppose is a point on the manifold. is the intersection between the manifold and the normal plane on . As known, is a 1-dimensional smooth curve. Assume one point moves along from to . Let be the arc length of this movement and be the rotation angle of the tangent vector, then we can define the following concepts:
(i) the velocity vector at is ;
(ii) the velocity is the magnitude of the velocity vector, which is ;
(iii) the unit tangent vector and the unit normal vector . is a rotation matrix;
(iv) the curvature , which measures how fast the unit tangent vector rotates relative to the arc length: . And we can further get and similarly .

Assume all points on the curve start to move along their normal directions at a velocity of during time , we have the curvature flow: . With the equation (iii) and (iv), we write the curvature flow as: , which is clearly a diffusion process with a zero reaction function. This equivalence proves that the GAC kernel can theoretically reflect the geometric features. From the perspective of physical meanings, means how much curvature changes from point to its neighborhood during a period of time. Therefore, the physical meanings also support that the GAC kernel embeds the geometric information of the manifolds. Lemma 1 is proved. ∎
7.3 Lemma 2 and Proof
Lemma 2.
The GAC Kernel embeds a convolution filtering within the kernel structure, called intra-kernel convolution.
Proof.
According to the reaction diffusion theory [33, 57], Eq. (22) can be expressed as:
(26) |
If integrating along the temporal variable, then the result has the form , which matches with the definition of a convolutional filtering . Our kernel derivation also indicates the existence of a convolution on manifolds. Reminding that we estimate the integral in Eq. (22) as the summation of a sine Fourier transform and a cosine Fourier transform (Eq.14 in the main submission). Each term implements the transform from time domain to frequency domain. According to the Convolution Theorem, we can draw the same conclusion. The similar theory has also been applied in geometric deep learning to realize the convolution on manifolds [9]. Lemma 2 is proved. ∎

7.4 Unsupervised Salient Point Selection
We provide more salient point selection results on point clouds here. Figure. 7 shows twelve examples of salient point selection on McGill 3D shape benchmark [53]. The shapes have different sampling densities. The original data type is the triangle mesh. We remove their edge connections and only use the vertex coordinates as inputs. The results show that the GAC-GP can learn the geometric property of the inputs in the prior and the selected salient points are representative in distinguishing the shapes.
Figure. 8 illustrates two examples of salient point selection on Modelnet40 dataset. Figure. 8 (a) demonstrates the saliency maps after selecting 200 salient points. Figure. 8 (b) shows salient points on the original point clouds. Figure. 8 (c) is 200 salient points. We can see that 200 salient points can generally depict the original shapes. Noting that we did not use salient point selection algorithm in the point cloud classifications in the main submission. This is mainly because the comparison methods used the whole data as inputs, for fairness and convenience, we also use 10000 vertices as inputs in the experiments.
References
- [1] Steven J Altschuler. Shortening space curves. Differential geometry: Riemannian geometry (Los Angeles, CA, 1990), 54:45–51, 1993.
- [2] Mathieu Aubry, Ulrich Schlickewei, and Daniel Cremers. The wave kernel signature: A quantum mechanical approach to shape analysis. In 2011 IEEE international conference on computer vision workshops (ICCV workshops), pages 1626–1633. IEEE, 2011.
- [3] James V Beck, Kevin D Cole, A Haji-Sheikh, and Bahman Litkouhl. Heat conduction using Green’s function. Taylor & Francis, 1992.
- [4] Rajendra Bhatia. Positive definite matrices, princeton ser. Appl. Math., Princeton University Press, Princeton, NJ, 2007.
- [5] Kenneth Blomqvist, Samuel Kaski, and Markus Heinonen. Deep convolutional gaussian processes. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 582–597. Springer, 2019.
- [6] Salomon Bochner. Harmonic analysis and the theory of probability. Courier Corporation, 2005.
- [7] Viacheslav Borovitskiy, Alexander Terenin, Peter Mostowsky, and Marc Peter Deisenroth. Matérn gaussian processes on riemannian manifolds. Advances in Neural Information Processing Systems, 2020.
- [8] Doug M Boyer, Yaron Lipman, Elizabeth St Clair, Jesus Puente, Biren A Patel, Thomas Funkhouser, Jukka Jernvall, and Ingrid Daubechies. Algorithms to automatically quantify the geometric similarity of anatomical surfaces. Proceedings of the National Academy of Sciences, 108(45):18221–18226, 2011.
- [9] Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, 2017.
- [10] Ismaël Castillo, Gérard Kerkyacharian, and Dominique Picard. Thomas bayes’ walk on manifolds. Probability Theory and Related Fields, 158(3-4):665–710, 2014.
- [11] Paolo Cignoni, Marco Callieri, Massimiliano Corsini, Matteo Dellepiane, Fabio Ganovelli, and Guido Ranzuglia. MeshLab: an Open-Source Mesh Processing Tool. In Vittorio Scarano, Rosario De Chiara, and Ugo Erra, editors, Eurographics Italian Chapter Conference. The Eurographics Association, 2008.
- [12] Sébastien Couette and Jess White. 3d geometric morphometrics and missing-data. can extant taxa give clues for the analysis of fossil primates? Comptes Rendus Palevol, 9(6-7):423–433, 2010.
- [13] Noel Cressie and Christopher K Wikle. Statistics for spatio-temporal data. John Wiley & Sons, 2015.
- [14] Andreas Damianou and Neil Lawrence. Deep gaussian processes. In Artificial Intelligence and Statistics, pages 207–215, 2013.
- [15] Nicolas Durrande, David Ginsbourger, and Olivier Roustant. Additive covariance kernels for high-dimensional gaussian process modeling. In Annales de la Faculté des sciences de Toulouse: Mathématiques, volume 21, pages 481–499, 2012.
- [16] Vincent Dutordoir, Mark Wilk, Artem Artemev, and James Hensman. Bayesian image classification with deep convolutional gaussian processes. In International Conference on Artificial Intelligence and Statistics, pages 1529–1539. PMLR, 2020.
- [17] Klaus Ecker. Heat equations in geometry and topology. Jahresber. Deutsch. Math.-Verein, 110(3):117–141, 2008.
- [18] Gert Ehrlich and Kaj Stolt. Surface diffusion. Annual Review of Physical Chemistry, 31(1):603–637, 1980.
- [19] Ludwig Fahrmeir, Thomas Kneib, Stefan Lang, and Brian Marx. Regression: models, methods and applications. Springer Science & Business Media, 2013.
- [20] Yonghui Fan, Natasha Lepore, and Yalin Wang. Morphometric gaussian process for landmarking on grey matter tetrahedral models. In 15th International Symposium on Medical Information Processing and Analysis, volume 11330, page 113300H. International Society for Optics and Photonics, 2020.
- [21] Yonghui Fan and Yalin Wang. Convolutional bayesian models for anatomical landmarking on multi-dimensional shapes. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 786–796. Springer, 2020.
- [22] Tingran Gao, Shahar Z Kovalsky, Doug M Boyer, and Ingrid Daubechies. Gaussian process landmarking for three-dimensional geometric morphometrics. SIAM Journal on Mathematics of Data Science, 1(1):237–267, 2019.
- [23] Tingran Gao, Shahar Z Kovalsky, and Ingrid Daubechies. Gaussian process landmarking on manifolds. SIAM Journal on Mathematics of Data Science, 1(1):208–236, 2019.
- [24] Jacob R Gardner, Geoff Pleiss, David Bindel, Kilian Q Weinberger, and Andrew Gordon Wilson. Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration. In Advances in Neural Information Processing Systems, 2018.
- [25] Adrià Garriga-Alonso, Carl Edward Rasmussen, and Laurence Aitchison. Deep convolutional networks as shallow gaussian processes. The International Conference on Learning Representations (ICLR), 2019.
- [26] Gaël Guennebaud, Marcel Germann, and Markus Gross. Dynamic sampling and rendering of algebraic point set surfaces. In Computer Graphics Forum, volume 27, pages 653–662. Wiley Online Library, 2008.
- [27] Gaël Guennebaud and Markus Gross. Algebraic point set surfaces. In ACM Transactions on Graphics (TOG), volume 26, page 23. ACM, 2007.
- [28] Peter Guttorp and Tilmann Gneiting. Studies in the history of probability and statistics xlix on the matern correlation family. Biometrika, 93(4):989–995, 2006.
- [29] James Hensman, Alexander G Matthews, Maurizio Filippone, and Zoubin Ghahramani. Mcmc for variationally sparse gaussian processes. In Advances in Neural Information Processing Systems, pages 1648–1656, 2015.
- [30] Binh-Son Hua, Minh-Khoi Tran, and Sai-Kit Yeung. Pointwise convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 984–993, 2018.
- [31] Samuel Kadoury. Manifold learning in medical imaging. In Manifolds. IntechOpen, 2018.
- [32] Satyanad Kichenassamy, Arun Kumar, Peter Olver, Allen Tannenbaum, and Anthony Yezzi. Gradient flows and geometric active contour models. In Proceedings of IEEE International Conference on Computer Vision, pages 810–815. IEEE, 1995.
- [33] Christina Kuttler. Reaction-diffusion equations with applications. In Internet seminar, 2011.
- [34] Chang Ha Lee, Amitabh Varshney, and David W Jacobs. Mesh saliency. ACM transactions on graphics (TOG), 24(3):659–666, 2005.
- [35] Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein. Deep neural networks as gaussian processes. International Conference on Learning Representations(ICLR), 2018.
- [36] Haisheng Li, Li Sun, Xiaoqun Wu, and Qiang Cai. Scale-invariant wave kernel signature for non-rigid 3d shape retrieval. In 2018 IEEE International Conference on Big Data and Smart Computing (BigComp), pages 448–454. IEEE, 2018.
- [37] Dawen Liang and John Paisley. Landmarking manifolds with Gaussian processes. In International Conference on Machine Learning, pages 466–474, 2015.
- [38] Lizhen Lin, Niu Mu, Pokman Cheung, David Dunson, et al. Extrinsic gaussian processes for regression and classification on manifolds. Bayesian Analysis, 2018.
- [39] Finn Lindgren, Håvard Rue, and Johan Lindström. An explicit link between gaussian fields and gaussian markov random fields: the stochastic partial differential equation approach. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(4):423–498, 2011.
- [40] Anton Mallasto and Aasa Feragen. Wrapped gaussian process regression on riemannian manifolds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5580–5588, 2018.
- [41] Radford M Neal. Priors for infinite networks. In Bayesian Learning for Neural Networks, pages 29–53. Springer, 1996.
- [42] Bernt Oksendal. Stochastic differential equations: an introduction with applications. Springer Science & Business Media, 2013.
- [43] Gabriel Parra and Felipe Tobar. Spectral mixture kernels for multi-output gaussian processes. In Advances in Neural Information Processing Systems, pages 6681–6690, 2017.
- [44] David Pickup, X Sun, Paul L Rosin, RR Martin, Z Cheng, Z Lian, M Aono, A Ben Hamza, A Bronstein, M Bronstein, et al. Shrec’14 track: Shape retrieval of non-rigid 3d human models. In Proceedings of the 7th Eurographics workshop on 3D Object Retrieval, volume 1, page 6. Eurographics Association, 2014.
- [45] Victor Adrian Prisacariu and Ian Reid. Nonlinear shape manifolds as shape priors in level set segmentation and tracking. In CVPR 2011, pages 2185–2192. IEEE, 2011.
- [46] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems, pages 5099–5108, 2017.
- [47] Manazhy Rashmi and Praveen Sankaran. Optimal landmark point selection using clustering for manifold modeling and data classification. Journal of Classification, pages 1–19, 2019.
- [48] Carl Edward Rasmussen. Gaussian processes in machine learning. In Advanced lectures on machine learning, pages 63–71. Springer, 2004.
- [49] Hugh Salimbeni and Marc Deisenroth. Doubly stochastic variational inference for deep gaussian processes. In Advances in Neural Information Processing Systems, pages 4588–4599, 2017.
- [50] Simo Särkkä. Linear operators and stochastic partial differential equations in gaussian process regression. In International Conference on Artificial Neural Networks, pages 151–158. Springer, 2011.
- [51] Michael Sherman. Spatial statistics and spatio-temporal data: covariance functions and directional properties. John Wiley & Sons, 2011.
- [52] Rishit Sheth, Yuyang Wang, and Roni Khardon. Sparse variational inference for generalized gp models. In International Conference on Machine Learning, pages 1302–1311, 2015.
- [53] Kaleem Siddiqi, Juan Zhang, Diego Macrini, Ali Shokoufandeh, Sylvain Bouix, and Sven Dickinson. Retrieving articulated 3-d models using medial surfaces. Machine vision and applications, 19(4):261–275, 2008.
- [54] Arno Solin and Simo Särkkä. Hilbert space methods for reduced-rank gaussian process regression. Statistics and Computing, 30(2):419–446, 2020.
- [55] Michael L Stein. A kernel approximation to the kriging predictor of a spatial process. Annals of the Institute of Statistical Mathematics, 43(1):61–75, 1991.
- [56] Michael L Stein. Interpolation of spatial data: some theory for kriging. Springer Science & Business Media, 2012.
- [57] Walter A Strauss. Partielle Differentialgleichungen: eine Einführung. Springer-Verlag, 2013.
- [58] Felipe Tobar, Thang D Bui, and Richard E Turner. Learning stationary time series using gaussian processes with nonparametric kernels. In Advances in Neural Information Processing Systems, pages 3501–3509, 2015.
- [59] Mark Van der Wilk, Carl Edward Rasmussen, and James Hensman. Convolutional gaussian processes. In Advances in Neural Information Processing Systems, pages 2849–2858, 2017.
- [60] Ian Walker and Ben Glocker. Graph convolutional gaussian processes. arXiv preprint arXiv:1905.05739, 2019.
- [61] Yue Wang and Yang Song. Facial keypoints detection, 2014.
- [62] Andrew Gordon Wilson. Covariance kernels for fast automatic pattern discovery and extrapolation with Gaussian processes. PhD thesis, University of Cambridge, 2014.
- [63] Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P Xing. Deep kernel learning. In Artificial intelligence and statistics, pages 370–378, 2016.
- [64] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1912–1920, 2015.
- [65] James V Zidek, Constance van Eeden, et al. Uncertainty, entropy, variance and the effect of partial information. Lecture Notes-Monograph Series, 42:155–167, 2003.