This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Wrapped Gaussian Process Functional Regression Model for Batch Data on Riemannian Manifolds

Jinzhao Liu, Chao Liu, Jian Qing Shi, Tom Nye
Abstract

Regression is an essential and fundamental methodology in statistical analysis. The majority of the literature focuses on linear and nonlinear regression in the context of the Euclidean space. However, regression models in non-Euclidean spaces deserve more attention due to collection of increasing volumes of manifold-valued data. In this context, this paper proposes a concurrent functional regression model for batch data on Riemannian manifolds by estimating both mean structure and covariance structure simultaneously. The response variable is assumed to follow a wrapped Gaussian process distribution. Nonlinear relationships between manifold-valued response variables and multiple Euclidean covariates can be captured by this model in which the covariates can be functional and/or scalar. The performance of our model has been tested on both simulated data and real data, showing it is an effective and efficient tool in conducting functional data regression on Riemannian manifolds.

1 Introduction

Regression models are ubiquitous and powerful tools in data analysis to reveal the relationship between independent variables and dependent variables. Most well known regression models are formulated under the assumption that the variables all lie in Euclidean vector spaces. However, statistical models of manifold-valued data are gaining popularity in various fields of scientific analysis, such as computational social science, medical imaging analysis and computer vision [8]. Together with the significant increase in manifold-valued data observed by current instrumentation, generalizations of Euclidean regression models to manifold-valued settings are needed. If the manifold structure is not accounted for then the reliability of model predictions in the target spaces is compromised. Consider, for instance, Figure 1, where a regression model is trained on manifold-valued data (represented by the black curves on a S2). The predictions of this model (illustrated by the blue curve) deviate from the manifold. However, when the manifold structure is incorporated into the model, the predictions (depicted by the red curve) align with the target manifold. This underscores the necessity of employing non-Euclidean structure for modelling manifold-valued data. An additional example demonstrating this principle in Kendall’s shape space is provided in the Appendix.

Refer to caption
Figure 1: The black curves represent manifold-valued data, while the red and blue curves depict the predictions of Gaussian process regressions with and without a manifold structure, respectively. This illustrates the impact of incorporating a manifold structure into the regression model, as evidenced by the alignment of the red curve with the manifold-valued data, in contrast to the deviation of the blue curve. This underscores the importance of considering manifold structures in Gaussian process regression models for accurate prediction and data representation.

In recent years, functional data have garnered increasing attention due to the continuous or intermittent recording of data at discrete time points. These data often exhibit non-linearity, and a common assumption is that they lie on a nonlinear manifold. For instance, image data, which can be influenced by random domain shifts, are known to reside in such manifolds . However, analyzing functional data poses challenges, particularly when dealing with spaces lacking global or local linear structures. Traditional methods like functional principal component analysis become infeasible in these cases. One specific difficulty arises from defining a suitable metric for data residing on a Riemannian manifold. Researchers have explored alternative approaches to address this issue. For example, [16] propose using point-wise Fréchet means for non-Euclidean time-varying random objects in general metric spaces. Additionally, they introduce point-wise distance trajectories between individual time courses and the estimated Fréchet mean trajectory. This innovative approach allows for a representation of time-varying random objects using functional data. In summary, the study of functional data on nonlinear manifolds presents exciting opportunities for advancing statistical methodologies and understanding complex data structures.

According to the manifold hypothesis, functional data can be mapped onto a low-dimensional nonlinear manifold, which is also known as manifold learning [35]. Gaussian processes (GPs) have proven to be powerful tools for modeling complex data structures. While GPs are commonly applied in Euclidean spaces, recent research has extended their applicability to Riemannian manifolds. [5] utilize the spectral theory of the Laplace-Beltrami operator to compute Matérn kernels on Riemannian manifolds. These kernels, widely used in physical sciences, capture both smoothness and spatial correlations. The spectral approach allows for efficient computation of Matérn covariance functions, enabling GPs to model data residing on curved surfaces. [22] introduce a vector-valued GP equipped with a matrix-valued kernel. This novel framework enables modeling of vector fields on Riemannian manifolds. Applications include geodesic flow modeling, fluid dynamics, and other scenarios where vector-valued data arise. Recently, [2] and [3] extend stationary GPs to compact Lie groups and non-compact symmetric spaces. These spaces play a critical role in spatiotemporal modeling. Those models define a metric based on geodesic distances between predictors, accommodating both functional and scalar predictors. In addition, [26] propose a regression model for functional data on manifolds. While existing approaches assume Euclidean response variables, certain problems involve responses residing directly on Riemannian manifolds. For instance, predicting flight trajectories based on flight time. In contrast to previous models, we propose a regression framework that maps Euclidean-valued predictors to manifold-valued functional responses.

A number of statistical approaches to analysis of manifold-valued data exist in the literature, especially in the context of principal component analysis (PCA). For example, [18] introduced principal geodesic analysis which is a generalisation of PCA on Riemannian manifolds. It replaces the first principal component with a principal geodesic constrained to pass through the intrinsic mean, with calculations performed approximately in Euclidean tangent space. [25] proposed a PCA technique for functional data on 22-dimensional Riemannian manifolds in which the authors adopt a regularization method by a smoothing penalty coherent with geodesic distances. Subsequently, [11] formulated Riemannian functional principal component analysis by mapping data to tangent spaces via the Riemannian logarithm map and then performing multivariate functional principal component analysis within the tangent space.

In addition to generalizations of PCA to manifold data, generalizations of linear regression have also been studied. Similar to manifold-based PCA, the regression line is typically replaced with a geodesic within the manifold, which gives rise to more challenging optimisation problems in order to fit the models than in the standard Euclidean setting. Geodesic regression [17] is a generalization of linear regression to a manifold-valued setting in which an univariate independent variable is considered in \mathbb{R} and dependent variables lie in a Riemannian manifold. The author derived a gradient descent algorithm for model fitting via derivatives of the exponential map and Jacobi fields, because the least square method has no analytical solution under this setting. [24] then extended geodesic regression to the multivariate setting in which the independent variables are in n\mathbb{R}^{n} and the dependent variable is still manifold-valued. In addition, the authors proposed a variational gradient descent method based on parallel transport which is more convenient for high-dimensional independent variables. In the context of shape analysis, [28] provided an analytic approach to geodesic regression in Kendall’s shape space. Multiple linear regression has also been generalised by [30] for complex random objects in metric spaces with independent variables in n\mathbb{R}^{n}. Using a least-squares approach, the authors derived asymptotic rates of convergence.

However, in many applications, non-Euclidean data cannot always be assumed to follow form of linear relationship. [4] introduced a kernel-based nonlinear regression model for both manifold-valued independent variables and manifold-valued dependent variables. [21] extended polynomial regression to Riemannian manifolds by introducing a class of curves on Riemannian manifolds which generalize geodesics and are analogs of polynomial curves in Euclidean space. [10] developed a regression model for a manifold-valued response variable in Riemannian symmetric spaces and covariates in Euclidean spaces with applications in medical imaging analysis. Moreover, [31] introduced an additive linear model for manifold-valued data which is based on the exponential map. In particular, they transform the manifold-valued data onto tangent spaces and then estimate the parameters of the additive linear model by a generalised least squares method.

Gaussian process regression (GPR) is a powerful non-linear and non-parametric model widely used for learning probability distributions over unknown functions. Operating within a Bayesian framework, GPR assumes that both the prior and likelihood follow normal distributions, allowing for the derivation of a posterior distribution that is also Gaussian. This ash been extended to solve the problems involving non-Gaussian data [41] and the use of other prior processes [43]. In addition, researchers have extended GPR to accommodate multi-output scenarios. Notable contributions include the work of [7], [40], and [1], who have explored multi-output Gaussian process regression. Over the past decade, substantial progress has been made in developing Gaussian process regression models specifically tailored for manifolds, thereby expanding the model’s scope and applicability. One intriguing approach is the wrapped Gaussian process regression proposed by [27]. In this framework, the Riemannian manifold is linearized via the logrithm map, projecting data points onto a tangent space. This results in a non-parametric regression model with a probabilistic foundation. Importantly, the computational cost of wrapped Gaussian process regression remains relatively low, as it involves additional calculations only for the exponential map and logarithm map of each manifold-valued data point. The key innovation lies in defining a Gaussian distribution directly on the Riemannian manifold, leveraging insights from real-valued multivariate Gaussian distributions and the exponential map. This novel approach opens up avenues for Gaussian process regression on Riemannian manifolds. Within the Bayesian framework, researchers have derived manifold-valued posterior distributions, albeit under certain assumptions (e.g., the requirement of infinite injectivity radius).

To the best of our knowledge, there is little current literature about regression models for functional batch data on Riemannian manifolds within a probabilistic framework. In light of this, we provide a nonlinear non-parametric regression model with uncertainty for batch data on a smooth Riemannian manifold. Called the wrapped Gaussian process functional regression model (WGPFR), it models the relationship of a functional manifold-valued response and mixed (scalar and functional) Euclidean-valued predictors by representing the mean structure and the covariance structure as different model terms. Specifically, a functional regression model is used for the mean structure with a functional manifold-valued response and scalar covariates, while a wrapped Gaussian process has been used for the covariance structure with functional covariates based on [27] who extend Gaussian process regression on Riemannian manifolds. In this way, the proposed WGPFR model extends the research field of non-parametric regression on manifolds.

The rest of this paper is organized as follows. In section 2, some basic concepts of functional data, Riemannian manifolds, Gaussian process regression and wrapped Gaussian process regression are reviewed. In section 3, we propose our model together with a method for inference via an efficient iterative algorithm. Numerical experiments and real data analysis are reported in Section 4. Finally, we draw conclusions and discuss the results in Section 5.

2 Background

2.1 Functional Data

Functional data analysis is a field of study wherein each observation is a random function, manifesting as a curve, surface, or other variations over a continuum. This analysis is particularly pertinent to data recorded over identical intervals, with consistent frequency and numerous repeated observations. While data are frequently modelled by parametric models incorporating randomness in the realms of statistics or machine learning, functional data analysis contemplates a smooth process observed at discrete time points, offering greater flexibility and reduced parametric constraints, see e.g. [33] and [45].

Furthermore, functional data can be procured from a diverse range of domains. For instance, in finance, volatility surfaces serve as a source of functional data. In the field of biomedical research, time-varying physiological signals yield functional data. Functional Magnetic Resonance Imaging (FMRI) is yet another domain that generates functional data. These examples underscore the broad applicability and interdisciplinary nature of functional data analysis.

2.2 Preliminaries for Riemannian Manifolds

In this section, we review a few basic concepts of Riemannian manifolds, wrapped Gaussian distributions, wrapped Gaussian processes on manifolds and then set up basic notations. More detail on basic Riemannian geometry can be found in standard texts, such as that by [14].

2.2.1 Concepts and Notation

Given a dd-dimensional smooth differentiable manifold \mathcal{M} and a tangent space TpT_{p}\mathcal{M}, for pp\in\mathcal{M}, a Riemannian metric gp:Tp×Tpg_{p}:T_{p}\mathcal{M}\times T_{p}\mathcal{M}\rightarrow\mathbb{R} on \mathcal{M} is a family of positive definite inner products which vary smoothly with pp\in\mathcal{M}. Equipped with this Riemannian metric, we call the pair (,g)(\mathcal{M},g) a Riemannian manifold.

The tangent bundle of \mathcal{M} is defined as a set 𝒯=p(p×Tp)\mathcal{TM}=\cup_{p\in\mathcal{M}}(p\times T_{p}\mathcal{M}), where pp is a point in \mathcal{M} and vTpv\in T_{p}\mathcal{M} is a tangent vector at pp. If γ\gamma is a smooth curve in \mathcal{M}, then its length is defined as the integral of γ/t\|\partial\gamma/\partial t\| where the norm is computed via the Riemannian metric at γ(t)\gamma(t). For any pair (p,v)𝒯(p,v)\in\mathcal{TM} with v\|v\| sufficiently small, there is a unique curve γ:[0,1]\gamma:[0,1]\rightarrow\mathcal{M} of minimal length between two points pp\in\mathcal{M} and qq\in\mathcal{M}, with initial conditions γ(0)=p\gamma(0)=p, γ(0)=v\gamma^{\prime}(0)=v and γ(1)=q\gamma(1)=q. Such a curve is called a geodesic. The exponential map Exp(p,v):×Tp\text{Exp}(p,v):\mathcal{M}\times T_{p}\mathcal{M}\rightarrow\mathcal{M} at pp\in\mathcal{M} is defined as Exp(p,v)=γ(1)\text{Exp}(p,v)=\gamma(1) for sufficiently small vTpv\in T_{p}\mathcal{M}. Consider the ball of largest radius around the origin in TpT_{p}\mathcal{M} on which Exp(p,v)\text{Exp}(p,v) is defined and let V(p)V(p)\subset\mathcal{M} denote the image of Exp(p,)\text{Exp}(p,\cdot) on this ball. Then the exponential map has an inverse on V(p)V(p), called the Riemannian logarithm map at pp, Log(p,):V(p)Tp\text{Log}(p,\cdot):V(p)\rightarrow T_{p}\mathcal{M}. For any point qV(p)q\in V(p), the geodesic distance of pp\in\mathcal{M} and qq\in\mathcal{M} is given by d(p,q)=Log(p,q)d_{\mathcal{M}}(p,q)=\|\text{Log}(p,q)\|.

Figure 2 shows the concepts above. For example, pp and qq are two points on a Riemannian manifold \mathcal{M}, the tangent space at pp is denoted as Tp()T_{p}(\mathcal{M}) and γ(t)\gamma(t) refers to a geodesic between pp and qq. The tangent vector vv is from pp to qq at Tp()T_{p}(\mathcal{M}) and pp moves towarding vv with distance v\|v\| is qq, which represents Log(p,q)\text{Log}(p,q) and Exp(p,v)\text{Exp}(p,v) respectively.

Refer to caption
Figure 2: Exponential map and Logarithm map on S2S^{2}.

The exponential map and its inverse are analogs of vector space operations in the following way. Suppose pp_{\mathcal{M}} and qq_{\mathcal{M}} refer to two different points on \mathcal{M} linked by a geodesic γ\gamma, and suppose the tangent vector from the former to the latter is denoted by pq=(dγdt)ppq_{\mathcal{M}}=(\frac{d\gamma}{dt})_{p}. In addition, suppose pEp_{E} and qEq_{E} refer to two Euclidean data points and the vector from the former to the latter is given by pqEpq_{E}. Table 1 shows the relationship of operations between Riemannian manifolds and Euclidean spaces.

Table 1: Analogs between basic operation in Euclidean space and Riemannian manifold.
Euclidean Space Riemannian Manifold
Addition qE=pE+vpqq_{E}=p_{E}+v_{pq} q=Exp(p,vpq)q_{\mathcal{M}}=\text{Exp}(p_{\mathcal{M}},v_{pq})
Subtraction vpq=pEqEv_{pq}=p_{E}-q_{E} vpq=Log(p,q)v_{pq}=\text{Log}(p_{\mathcal{M}},q_{\mathcal{M}})
Distance pEqE\|p_{E}-q_{E}\| Log(p,q)\|\text{Log}(p_{\mathcal{M}},q_{\mathcal{M}})\|

If (,d)(\mathcal{M},d_{\mathcal{M}}) is a complete metric space that every Cauchy sequence of points in \mathcal{M} has a limit that is also in \mathcal{M}, then the Hopf-Rinow theorem shows that every pair of points p,qp,q\in\mathcal{M} is joined by at least one geodesic, a condition known as geodesic completeness, and, equivalently, Exp(p,v)\text{Exp}(p,v) is defined for all (p,v)𝒯(p,v)\in\mathcal{TM}. However, given two points on \mathcal{M}, the geodesic between the points may not be unique, even if \mathcal{M} is geodesically complete. The injectivity radius at pp\in\mathcal{M}, denoted injp\textrm{inj}_{p}, is defined to be the radius of the largest ball at the origin of TpT_{p}\mathcal{M} on which Exp(p,)\text{Exp}(p,\cdot) is a diffeomorphism. It follows that the geodesic ball at pp of radius injp\textrm{inj}_{p} is contained within V(p)V(p) under which the Wrapped Gaussian process regression can be derived to a analytical form. In the following, we assume the data are on a Riemannian manifold with infinite injectivity radius, such as spheres and Kendall’s shape spaces.

Some examples of Riemannian manifolds are shown in Appendix, including explicit formulae for the exponential and logarithm maps in certain cases.

2.3 Functional Data on Riemannian Manifolds

To study functional data located on Riemannian manifolds, [11] considered a dd-dimensional complete Riemannian submanifold #\mathcal{M}^{\#} of a Euclidean space d0\mathbb{R}^{d_{0}}, with a geodesic distance dd_{\mathcal{M}} and a probability space (Ω,𝒜,P)(\Omega,\mathcal{A},P) with sample space Ω\Omega, σ\sigma-algebra 𝒜\mathcal{A} and probability PP. The sample space of all \mathcal{M}-valued continuous functions on a compact interval 𝒯\mathcal{T}\subset\mathbb{R} is denoted by 𝒳={y:𝒯|y𝒞(𝒯)}\mathcal{X}=\{y:\mathcal{T}\rightarrow\mathcal{M}\ |\ y\in\mathcal{C}(\mathcal{T})\}. Thus, we define \mathcal{M}-valued random functions to be functions y(t,ω)y(t,\omega), y:𝒯×Ωy:\mathcal{T}\times\Omega\rightarrow\mathcal{M}, satisfying y(,ω)𝒳y(\cdot,\omega)\in\mathcal{X}. Let ={v:𝒯d|𝒯v(t)Tv(t)𝑑t<}\mathbb{H}=\{v:\mathcal{T}\rightarrow\ \mathbb{R}^{d}\ |\ \int_{\mathcal{T}}v(t)^{T}v(t)dt<\infty\} refers to an ambient L2L^{2} Hilbert space of dd-dimensional square integrable functions on 𝒯\mathcal{T}. The inner product is defined as v,u=𝒯v(t)Tu(t)𝑑t\langle v,u\rangle=\int_{\mathcal{T}}v(t)^{T}u(t)dt and the norm is given by v=v,v12\|v\|=\langle v,v\rangle^{\frac{1}{2}} for v,uv,u\in\mathbb{H}.

2.4 Gaussian Process Regression and Wrapped Gaussian Process Regression

2.4.1 Euclidean Gaussian Process Regression

As described in [34], a general Gaussian process regression model is defined as

y=f(𝒙)+ϵ\displaystyle y=f(\boldsymbol{x})+\epsilon (1)

where 𝒙Q\boldsymbol{x}\in\mathbb{R}^{Q}, yy\in\mathbb{R}, ϵ𝒩(0,σ2)\epsilon\sim\mathcal{N}(0,\sigma^{2}). The regression function ff is assumed to follow a Gaussian process prior, that is

f()GP(μ(),k(,;𝜽))\displaystyle f(\cdot)\sim GP(\mu(\cdot),k(\cdot,\cdot;\boldsymbol{\theta})) (2)

where μ()=E[f()]\mu(\cdot)=E[f(\cdot)] is the mean function of Gaussian process, and k(,;𝜽)k(\cdot,\cdot;\boldsymbol{\theta}) is the covariance function with hyper-parameter 𝜽\boldsymbol{\theta}.

Given training data 𝒟={𝒙i,yi,i=1,,n}\mathcal{D}=\{\boldsymbol{x}_{i},y_{i},i=1,\ldots,n\}, we have a multivariate Gaussian distribution that is

(f(𝒙1),,f(𝒙n))𝒩(𝝁(𝑿),K(𝑿,𝑿))\displaystyle\big{(}f(\boldsymbol{x}_{1}),...,f(\boldsymbol{x}_{n})\big{)}\sim\mathcal{N}(\boldsymbol{\mu}(\boldsymbol{X}),K(\boldsymbol{X},\boldsymbol{X})) (3)

where 𝑿\boldsymbol{X} refers to (𝒙1,,𝒙n)(\boldsymbol{x}_{1},...,\boldsymbol{x}_{n}), 𝝁(𝑿)=(μ(𝒙1),,μ(𝒙𝒏))\boldsymbol{\mu}(\boldsymbol{X})=(\mu(\boldsymbol{x}_{1}),...,\mu(\boldsymbol{x_{n}})) refers to a mean vector, and K(𝑿,𝑿)K(\boldsymbol{X},\boldsymbol{X}) refers to a n×nn\times n covariance matrix in which the entry of the (i,j)(i,j)-th element is k(𝒙i,𝒙j;𝜽)k(\boldsymbol{x}_{i},\boldsymbol{x}_{j};\boldsymbol{\theta}).

To obtain the predictive distribution at a new input 𝒙\boldsymbol{x}^{*}, it is convenient to take advantage of the properties of Gaussian distribution:

(𝒚𝒇(𝒙))𝒩((𝝁(𝑿)𝝁(𝒙)),(K(𝑿,𝑿)+σ2𝑰K(𝒙,𝑿)TK(𝒙,X)K(𝒙,𝒙)))\displaystyle\begin{pmatrix}\boldsymbol{y}\\ \boldsymbol{f}(\boldsymbol{x}^{*})\end{pmatrix}\sim\mathcal{N}\begin{pmatrix}\begin{pmatrix}\boldsymbol{\mu}(\boldsymbol{X})\\ \boldsymbol{\mu}(\boldsymbol{x}^{*})\end{pmatrix},\begin{pmatrix}K(\boldsymbol{X},\boldsymbol{X})+\sigma^{2}\boldsymbol{I}&K(\boldsymbol{x}^{*},\boldsymbol{X})^{T}\\ K(\boldsymbol{x}^{*},X)&K(\boldsymbol{x}^{*},\boldsymbol{x}^{*})\end{pmatrix}\end{pmatrix} (4)

where K(𝒙,𝑿)K(\boldsymbol{x}^{*},\boldsymbol{X}) is a 1×n1\times n covariance matrix, K(𝒙,𝒙)K(\boldsymbol{x}^{*},\boldsymbol{x}^{*}) is scalar and their entries are similar to K(𝑿,𝑿)K(\boldsymbol{X},\boldsymbol{X}). Fortunately, the predictive distribution has an analytical form which is

p(𝒇(𝒙)|𝒟,𝒙)𝒩(K(𝒙,𝑿)T(K(𝑿,𝑿)+σ2𝑰)1(𝒚𝝁(𝑿))+𝝁(𝒙),K(𝒙,𝒙)K(𝒙,𝑿)T(K(𝑿,𝑿)+σ2𝑰)1K(𝒙,𝑿))\displaystyle\begin{split}p(\boldsymbol{f}(\boldsymbol{x}^{*})|\mathcal{D},\boldsymbol{x}^{*})\sim\mathcal{N}(&K(\boldsymbol{x}^{*},\boldsymbol{X})^{T}(K(\boldsymbol{X},\boldsymbol{X})+\sigma^{2}\boldsymbol{I})^{-1}(\boldsymbol{y}-\boldsymbol{\mu}(\boldsymbol{X}))+\boldsymbol{\mu}(\boldsymbol{x}^{*}),\\ &K(\boldsymbol{x}^{*},\boldsymbol{x}^{*})-K(\boldsymbol{x}^{*},\boldsymbol{X})^{T}(K(\boldsymbol{X},\boldsymbol{X})+\sigma^{2}\boldsymbol{I})^{-1}K(\boldsymbol{x}^{*},\boldsymbol{X}))\end{split} (5)

For more details of Gaussian process regression, see [44].

The kernel function in the covariance function can be chosen from a parametric family, such as the radial basis function (RBF) kernel:

K(𝒙,𝒙;𝜽)=λ12exp(𝒙𝒙22λ2)\displaystyle K(\boldsymbol{x},\boldsymbol{x}^{\prime};\boldsymbol{\theta})=\lambda_{1}^{2}\exp\bigg{(}-\frac{\|\boldsymbol{x}-\boldsymbol{x}^{\prime}\|^{2}}{2\lambda_{2}}\bigg{)} (6)

with the hyper-parameter 𝜽={λ1,λ2}\boldsymbol{\theta}=\{\lambda_{1},\lambda_{2}\} chosen by maximizing the marginal likelihood. Specifically, the hyper-parameters can be estimated by conjugate gradient descent algorithm [32] or Markov Chain Monte Carlo [38].

2.4.2 Wrapped Gaussian Distribution

In this subsection, we first review the definitions of the wrapped Gaussian distribution and wrapped Gaussian process on a Riemannian manifold \mathcal{M}, and then review wrapped Gaussian process regression (WGPR) which is introduced by [27].

Suppose \mathcal{M} denotes a dd-dimensional Riemannian manifold and XX is a random variable on \mathcal{M}. For μ\mu\in\mathcal{M} and a symmetric positive definite matrix Kd×dK\in\mathbb{R}^{d\times d}, we can define a wrapped Gaussian distribution as follows

X=Exp(μ,𝒗),𝒗𝒩(𝟎,K)\displaystyle X=\text{Exp}(\mu,\boldsymbol{v}),\ \boldsymbol{v}\sim\mathcal{N}(\boldsymbol{0},K) (7)

where μ\mu is called basepoint and 𝒗\boldsymbol{v} is a vector belongs to the tangent space at μ\mu which follows a multi-variate Gaussian distribution with zero mean and covariance matrix KK (see Section 3.1 in [27]). The wrapped Gaussian distribution is formally denoted as

X𝒩(μ,K).\displaystyle X\sim\mathcal{N}_{\mathcal{M}}(\mu,K). (8)

[27] describe a manifold-valued Gaussian process constructed by the wrapped Gaussian distribution, and the following material is based on their work. A collection (f(X1),,f(Xn))(f(X_{1}),...,f(X_{n})) of random points on a manifold \mathcal{M} index over a set Ω\Omega is a wrapped Gaussian process, if every subcollection follows a jointly wrapped Gaussian distribution on \mathcal{M}. It is denoted as

f()𝒢𝒫(μ(),k(,;𝜽))\displaystyle f(\cdot)\sim\mathcal{GP}_{\mathcal{M}}(\mu(\cdot),k(\cdot,\cdot;\boldsymbol{\theta})) (9)

where μ()=E[f()]\mu(\cdot)=E[f(\cdot)] is the mean function on a Riemannian manifold, and k(,;𝜽)k(\cdot,\cdot;\boldsymbol{\theta}) is a covariance function with hyper-parameter 𝜽\boldsymbol{\theta} on a tangent space at 𝝁()\boldsymbol{\mu}(\cdot).

Consider the wrapped Gaussian process regression model which relates the manifold-valued response pp\in\mathcal{M} to the vector-valued predictor 𝒙Q\boldsymbol{x}\in\mathbb{R}^{Q} though a link function f:Qf:\mathbb{R}^{Q}\rightarrow\mathcal{M} which we assume to be a wrapped Gaussian process. Analogously to Gaussian process regression, the authors also derive inferences for wrapped Gaussian process regression in Bayesian framework. Specifically, given the training data 𝒟={(𝒙i,pi)|𝒙iQ,pi,i=1,,n}\mathcal{D}_{\mathcal{M}}=\{(\boldsymbol{x}_{i},p_{i})|\boldsymbol{x}_{i}\in\mathbb{R}^{Q},p_{i}\in\mathcal{M},i=1,\cdots,n\}, the prior for f()f(\cdot) is assumed to be a wrapped Gaussian process, which is f()𝒢𝒫(μ(),K(,;𝜽))f(\cdot)\sim\mathcal{GP}_{\mathcal{M}}(\mu(\cdot),K(\cdot,\cdot;\boldsymbol{\theta})). Then the joint distribution between the training outputs 𝒑=(p1,,pn)\boldsymbol{p}=(p_{1},\cdots,p_{n}) and the test outputs p{p}^{*} at 𝒙\boldsymbol{x}^{*} is given by

(p𝒑)𝒩((μ𝝁),(KKKK))\displaystyle\left(\begin{array}[]{c}{p}^{*}\\ \boldsymbol{p}\end{array}\right)\sim\mathcal{N}_{\mathcal{M}}\left(\left(\begin{array}[]{c}\mu^{*}\\ \boldsymbol{\mu}\end{array}\right),\left(\begin{array}[]{cc}K^{**}&K^{*}\\ K^{*\top}&K\end{array}\right)\right) (16)

where μ=μ(𝒙)\mu^{*}=\mu(\boldsymbol{x}^{*}), 𝝁=(μ(𝒙1),,μ(𝒙𝒏))\boldsymbol{\mu}=(\mu(\boldsymbol{x}_{1}),...,\mu(\boldsymbol{x_{n}})), K=K(𝒙,𝒙)K=K(\boldsymbol{x},\boldsymbol{x}), K=K(𝒙,𝒙)K^{*}=K(\boldsymbol{x},\boldsymbol{x}^{*}), K=K(𝒙,𝒙)K^{**}=K\left(\boldsymbol{x}^{*},\boldsymbol{x}^{*}\right) with 𝒙=(𝒙1,,𝒙n)\boldsymbol{x}=(\boldsymbol{x}_{1},\cdots,\boldsymbol{x}_{n}).

Therefore, by Theorem 1 in [27], the predictive distribution for new data given the training data can be obtained by

p𝒑Exp(μ,v),\displaystyle{p}^{*}\mid\boldsymbol{p}\sim\operatorname{Exp}\left(\mu^{*},v^{*}\right), (17)

where

v𝒩(m,Σ),m=KK1Log(𝝁,𝒑),Σ=KKK1K.\displaystyle v^{*}\sim\mathcal{N}_{\mathcal{M}}\left(m^{*},\Sigma^{*}\right),\ m^{*}=K^{*}K^{-1}\text{Log}(\boldsymbol{\mu},\boldsymbol{p}),\ \Sigma^{*}=K^{**}-K^{*}K^{-1}K^{*\top}. (18)

Similar to the comparison of operators between Riemannian manifolds and Euclidean spaces, Table 2 shows the analogs of Gaussian processes between these two spaces.

Table 2: Analogs between Gaussian processes in Euclidean spaces and Riemannian manifolds.
Euclidean Spaces Riemannian Manifolds
Definition f()GP(μ(),k(,;𝜽))f(\cdot)\sim GP(\mu(\cdot),k(\cdot,\cdot;\boldsymbol{\theta})) f()𝒢𝒫(μ(),k(,;𝜽))f(\cdot)\sim\mathcal{GP}_{\mathcal{M}}(\mu(\cdot),k(\cdot,\cdot;\boldsymbol{\theta}))
Mean μ()\mu(\cdot)\in\mathbb{R} μ()\mu(\cdot)\in\mathcal{M}
Covariance k(,;𝜽)Q×Qk(\cdot,\cdot;\boldsymbol{\theta})\in\mathbb{R}^{Q\times Q} k(,;𝜽)Tμ()Q×Qk(\cdot,\cdot;\boldsymbol{\theta})\in T_{\mu(\cdot)}\mathcal{M}^{Q\times Q}

3 Main Model and Inference

Gaussian process regression is a powerful non-parametric tool in statistics and machine learning. In Section 3, we adapt GPR to functional data on manifolds by proposing the WGPFR model for batch data on Riemannian manifold. Inference of the WGPFR model is presented in Section 3.2, where the statistical properties of the model parameters are studied and an efficient algorithm for inference is proposed.

3.1 Wrapped Gaussian Process Functional Regression Model

The data consist of discrete observations of MM random functional curves on \mathcal{M}, so that the mm-th curve is denoted as ym(t)y_{m}(t), t𝒯t\in\mathcal{T}, m=1,,Mm=1,\ldots,M. We assume that all curves are observed at a set of times tit_{i}, i=1,,Ni=1,\ldots,N. Throughout, ii will index time and mm will index curves. The set of observations is therefore ymi=ym(ti)y_{mi}=y_{m}(t_{i}), i=1,,Ni=1,\ldots,N, m=1,,Mm=1,\ldots,M. Associated with the mm-th curve, we assume there is a real vector-valued functional covariate 𝒙m(t)Q\mbox{{\boldmath${x}$}${}_{m}$}(t)\in\mathbb{R}^{Q}. We assume, as for the functional responses ym(t)y_{m}(t), that 𝒙m(t)\mbox{{\boldmath${x}$}${}_{m}$}(t) is also observed at the times tit_{i}, so that we are given points 𝒙m=𝒙m(ti)\mbox{{\boldmath${x}$}${}_{m}$}=\mbox{{\boldmath${x}$}${}_{m}$}(t_{i}). Finally, we assume there is a vector-valued covariate 𝒖mP\mbox{{\boldmath${u}$}${}_{m}$}\in\mathbb{R}^{P} observed for each curve ymy_{m}. Thus, the data for the mm-th curve comprises ymi,𝒙m,tiy_{mi},\mbox{{\boldmath${x}$}${}_{m}$},t_{i}, i=1,,Ni=1,\ldots,N together with the vector 𝒖{u}m. These data, for fixed mm, are called a batch.

Next we define our hierarchical model for the data. It has the following overall structure, and details of each part are filled out in what follows. At the top level of the hierarchy, there is a mean curve μ(t)\mu_{*}(t)\in\mathcal{M}. There are two stages to obtain each curve ymy_{m} from μ\mu_{*}. First there is a perturbation of μ\mu_{*} to obtain an unobserved curve μm\mu_{m}. We assume this perturbation depends on the vector covariate 𝒖{u}m and a functional model parameter 𝜷(t)P\boldsymbol{\beta}(t)\in\mathbb{R}^{P}, but not on the functional covariates (see Equation (19)). Secondly, we assume ymy_{m} is obtained from μm\mu_{m} via a vector field τm\tau_{m} along μm\mu_{m} which depends on the functional covariate 𝒙m(t)\mbox{{\boldmath${x}$}${}_{m}$}(t) (see Equation (20)). Figure 3 shows a schematic of the overall model.

Refer to caption
Figure 3: Schematic diagram of the model.

3.1.1 Definition of Mean Structure

It is difficult to infer non-linear relationships non-parametrically between a functional response variable and multi-dimensional functional covariates even when both of them are observed in a space for real-valued functions [37]. Therefore, we assume that the mean structure μm()\mu_{m}(\cdot) depends on the scalar covariates 𝒖m\boldsymbol{u}_{m} only. Specifically we assume

μm(t)=Exp(μ(t),𝒖mT𝜷(t))\displaystyle\mu_{m}(t)=\text{Exp}(\mu_{*}(t),\boldsymbol{u}_{m}^{T}\boldsymbol{\beta}(t)) (19)

where μ(t)\mu_{*}(t)\in\mathcal{M} and 𝜷(t)Tμ(t)\boldsymbol{\beta}(t)\in T_{\mu_{*}(t)}\mathcal{M} play a similar role to the intercept term and slope term of the function-on-scalar linear regression model in Euclidean space. These are model parameters which we estimate. Furthermore, equation (19) can be seen as a functional version of geodesic regression proposed in [17], in which a geodesic regression model for manifold-valued response and scalar-valued predictor is proposed in the form y=Exp(p,xv)y=\text{Exp}(p,xv) with (p,v)𝒯(p,v)\in\mathcal{TM} where xx\in\mathbb{R} is the predictor variable.

3.1.2 Definition of Covariance Structure

Conditional on μm\mu_{m}, the mm-th curve ymy_{m} is modelled as

ym(t)=Exp(μm(t),τm(t)),form=1,,M\displaystyle y_{m}(t)=\text{Exp}(\mu_{m}(t),\tau_{m}(t)),\text{for}\ m=1,...,M (20)

where τm(t)Tμm(t)\tau_{m}(t)\in T_{\mu_{m}(t)}\mathcal{M} is a vector field along μm\mu_{m}. When the data are observed with random noise, [17] defines the geodesic model as y=Exp(Exp(p,xv),ϵ)y=\text{Exp}(\text{Exp}(p,xv),\epsilon). Similarly, the proposed WGPFR model with error term can be defined as

ym(t)=Exp(μm(t),τm(t)+ϵm(t)),form=1,,M;\displaystyle y_{m}(t)=\text{Exp}(\mu_{m}(t),\tau_{m}(t)+\epsilon_{m}(t)),\text{for}\ m=1,\cdots,M; (21)

where ϵm(t)\epsilon_{m}(t) are independent measurement errors following a multivariate normal distribution with mean zero and covariance matrix σ2𝐈\sigma^{2}\mathbf{I}. For functional data, in addition to the measurement error term, the most important part we are interested is the underlying covariance structure contained in term τm(t)\tau_{m}(t), which is Equation (20). In what follows, we explain how τm(t)\tau_{m}(t) is modelled as depending on the functional covariate 𝒙m(t)\mbox{{\boldmath${x}$}${}_{m}$}(t).

We assume that the vector field τm(t)Tμm(t)\tau_{m}(t)\in T_{\mu_{m}(t)}\mathcal{M} can be represented in local coordinates as a dd-dimensional vector function τm(t)=(τm,1(t),,τm,d(t))d\tau_{m}(t)=(\tau_{m,1}(t),...,\tau_{m,d}(t))\in\mathbb{R}^{d}. We further assume this vector-valued function follows a Gaussian process with mean zero and an underlying covariance structure km(,;𝜽)k_{m}(\cdot,\cdot;\boldsymbol{\theta}), which is

τm(t)GP(𝟎,km(,;𝜽)).\displaystyle\tau_{m}(t)\sim GP(\mathbf{0},k_{m}(\cdot,\cdot;\boldsymbol{\theta})). (22)

Specifically, given two observation times tit_{i} and tmjt_{mj} from the mm-th batch of data, we assume that the covariance of the two random vectors τm(ti)\tau_{m}(t_{i}) and τm(tmj)\tau_{m}(t_{mj}) a function of the observed covariate values 𝒙m(ti)\mbox{{\boldmath${x}$}${}_{m}$}(t_{i}) and 𝒙m(tmj)\mbox{{\boldmath${x}$}${}_{m}$}(t_{mj}).

Under the framework of the wrapped Gaussian process in [27], we can represent the above Gaussian process as the wrapped Gaussian process for manifold functional data. That is,

ym(t)𝒢𝒫(μm(t),km(,;𝜽))\displaystyle y_{m}(t)\sim\mathcal{GP}_{\mathcal{M}}(\mu_{m}(t),k_{m}(\cdot,\cdot;\boldsymbol{\theta})) (23)

where μm(t)\mu_{m}(t)\in\mathcal{M} and km(,;𝜽)Tμm(t)k_{m}(\cdot,\cdot;\boldsymbol{\theta})\in T_{\mu_{m}(t)}\mathcal{M} are defined as above.

To summarize, the WGPFR model studies the non-linear relationship between the manifold-valued functional response and mixed (functional and vector) Euclidean-valued predictors in two parts, as shown in Figure 3. The first part is the mean structure which is defined through a functional regression to model the manifold-valued functional mean with the vector-valued predictor. The second part is called the covariance structure, and for which the inference is based on a Bayesian method with the prior assumed to be Gaussian process.

Compared to the geodesic regression in [17], the WGPFR additionally incorporates the information of dependence between data modelled by covariance structure which is the main focus of interest for functional data analysis. Compared to the WGPR model in [27], this paper further models the mean part which relates the functional mean on the manifold to the scalar predictor on Euclidean space, while the mean part in [27] is not related to scalar predictors. Note that the scalar predictor plays an important role in the prediction of the model, especially for the batch data we consider in this paper. The proposed WGPFR model is more complicated but much flexible than existing models and induces a more complex inference procedure which will be illustrated in detail in the next subsection.

3.2 Inference for WGPFR

3.2.1 Estimation of Mean Structure

Since the mean structure model (19) can be seen as a functional version of the geodesic regression model [17], an intuitive inference method consists of optimizing μ(t)\mu_{*}(t) and 𝜷(t)\boldsymbol{\beta}(t) simultaneously for each fixed tt. However, this would ignore information about the correlation of the function μ\mu_{*} and 𝜷(t)\boldsymbol{\beta}(t) at different times. On the other hand, if the number of observed times tit_{i} is large, and we account fully for correlations between all different times, then the computational cost will be very high, unless we use some efficient methods, such as [12]. Due to the dependent structure, we cannot conduct optimization at each fixed tt separately, otherwise the correlation information across different t will be lost. It can also be seen that inference of the functional mean structure model is indeed challenging.

In this paper we borrow an idea of [11] to estimate μ(t)\mu_{*}(t) and 𝜷(t)\boldsymbol{\beta}(t). Note that μ(t)\mu_{*}(t) plays the role of an intercept in the linear regression in Euclidean space, which is usually defined as the mean of the response. Similarly, for functional data ym(t),m=1,,My_{m}(t),m=1,...,M on a Riemannian manifold, we can define the intrinsic population Fréchet mean function under the assumption of existence and uniqueness:

μ0(t)=arg minp(t)E[d(y(t),p(t))2]\mu_{0}(t)=\underset{p(t)\in\mathcal{M}}{\text{arg min}}\,E[d_{\mathcal{M}}(y(t),p(t))^{2}] (24)

where d(,)d_{\mathcal{M}}(\cdot,\cdot) denotes geodesic distance. When tt is fixed, μ0(t)\mu_{0}(t) refers to a point on a Riemannian manifold; when tt is variable, μ0(t)\mu_{0}(t) refers to a curve on a Riemannian manifold. Additionally, since each ym(t)y_{m}(t)\in\mathcal{M} is continuous, μ0(t)\mu_{0}(t) is also continuous (see equation (1) in [11]).

The intrinsic Fréchet mean function μ0(t)\mu_{0}(t) can be a good choice for μ(t)\mu_{*}(t) since it is a natural mean of manifold-valued functional data and it is reasonable to assume ym(t),m=1,,My_{m}(t),m=1,...,M are confined to stay within the radius of injectivity at μ0(t)\mu_{0}(t) for all t𝒯t\in\mathcal{T}. Then the inverse exponential map can be well-defined. We estimate μ(t)\mu_{*}(t) at each time tt via the sample Fréchet mean of the points ym(t)y_{m}(t). This requires that the times tit_{i} are the same across all curves m=1,,Mm=1,\ldots,M. Standard gradient descent methods [29] can be used to obtain the Fréchet sample mean at each time t=tit=t_{i} to obtain estimates μ^0(ti)\hat{\mu}_{0}(t_{i}) of μ(ti)\mu_{*}(t_{i}), i=1,,Ni=1,\ldots,N. Denote the logarithm process data as

Vm(t)=Log(μ0(t),ym(t)),V_{m}(t)=\text{Log}(\mu_{0}(t),y_{m}(t)),

where Vm(t)V_{m}(t) is a dd-dimensional tangent vector of ym(t)y_{m}(t) at μ0(t)\mu_{0}(t) for a fixed tt. Since we assume \mathcal{M} is a dd dimensional closed Riemannian submanifold of a Euclidean space, it is reasonable to identify the tangent space Tμ0(t)T_{\mu_{0}(t)}\mathcal{M} as a subspace of the Euclidean space d0\mathbb{R}^{d_{0}} with d0dd_{0}\geq d. Then Vm(t)V_{m}(t) can be seen as d0\mathbb{R}^{d_{0}} valued square integrable functions, which can be represented by a set of basis functions. Similarly, the slope function 𝜷(t)\boldsymbol{\beta}(t) can also be represented by this set of basis functions. In this way, we can not only easily estimate the intercept term μ0(t)\mu_{0}(t) but also the slope term 𝜷(t)\boldsymbol{\beta}(t) can be obtained by least square methods on d0\mathbb{R}^{d_{0}}. This reduces the computation complexity compared with the optimization method in [17]. Next we illustrate the estimation procedure in detail.

For the estimation of the intrinsic Fréchet mean μ0(t)\mu_{0}(t), the sample Fréchet mean is calculated by minimizing the following loss function for each t𝒯t\in\mathcal{T}:

μ^0(t)=arg minp(t)1Mm=1Md(ym(t),p(t))2.\displaystyle\hat{\mu}_{0}(t)=\underset{p(t)\in\mathcal{M}}{\text{arg min}}\ \frac{1}{M}\sum_{m=1}^{M}d_{\mathcal{M}}(y_{m}(t),p(t))^{2}. (25)

It is natural to use the gradient descent algorithm introduced by [29] to estimate the sample Fréchet mean for each fixed tt. Using [14], the gradient is given by

d(p,ym(t))2=2Log(ym(t),p).\displaystyle\nabla d_{\mathcal{M}}(p,y_{m}(t))^{2}=-2\text{Log}(y_{m}(t),p). (26)

An implementation of the gradient descent algorithm to solve equation (25) is given in Algorithm 1. In practice, we choose the step size l=1Nl=\frac{1}{N} where NN is the number of samples and ϵ=108\epsilon=10^{-8} which is relatively small.

1. Initialise with a manifold-valued point pp and set a step size ll and a small positive number ϵ\epsilon ;
2. ν=2lMm=1MLog(ym(t),p)\nu=\frac{2l}{M}\sum_{m=1}^{M}\text{Log}(y_{m}(t),p) ;
3. While ν>ϵ\|\nu\|>\epsilon, do p=Log(p,ν)p=\text{Log}(p,\nu); else, end.
Algorithm 1 Gradient Descent Algorithm for Sample Fréchet Mean, fixed tt.

For the estimation of 𝜷(t)=(𝜷1(t),,𝜷p(t))\boldsymbol{\beta}(t)=\big{(}\boldsymbol{\beta}_{1}(t),\cdots,\boldsymbol{\beta}_{p}(t)\big{)}, the first step is to find a reasonable basis to represent 𝜷(t)\boldsymbol{\beta}(t) in the tangent space. Under the framework of [11], suppose we have an arbitrary system of KK orthonormal basis functions

ΦK={ϕk|ϕk(t)Tμ0(t),ϕk,ϕl=δkl,k,l=1,,K},\displaystyle\Phi_{K}=\{\phi_{k}\in\mathbb{H}\leavevmode\nobreak\ |\leavevmode\nobreak\ \phi_{k}(t)\in T_{\mu_{0}(t)}\mathcal{M},\langle\phi_{k},\phi_{l}\rangle=\delta_{kl},k,l=1,\cdots,K\}, (27)

where ={ν:𝒯d,𝒯ν(t)Tν(t)𝑑t<}\mathbb{H}=\{\nu:\mathcal{T}\rightarrow\mathbb{R}^{d},\int_{\mathcal{T}}\nu(t)^{T}\nu(t)dt<\infty\} is a L2L^{2} Hilbert space of d\mathbb{R}^{d} with inner product ν,u=𝒯ν(t)Tu(t)dt}\langle\nu,u\rangle=\int_{\mathcal{T}}\nu(t)^{T}u(t)dt\} and norm v=v,v1/2\|v\|=\langle v,v\rangle^{1/2}, δkl=1\delta_{kl}=1 if k=lk=l and 0 otherwise. Note that the values of each ϕk(t)\phi_{k}(t) at each time t𝒯t\in\mathcal{T} is restricted to the tangent space Tμ0(t)T_{\mu_{0}(t)}\mathcal{M}, which are identified with d0\mathbb{R}^{d_{0}}. Define the KK-dimensional linear subspace of \mathbb{H}:

K(ΦK):={x(t)=k=1Kakϕk(t) for t𝒯ak}.\displaystyle\mathcal{M}_{K}\left(\Phi_{K}\right):=\left\{x(t)=\sum_{k=1}^{K}a_{k}\phi_{k}(t)\text{ for }t\in\mathcal{T}\mid a_{k}\in\mathbb{R}\right\}. (28)

Then the slope functions 𝜷j(t)Tμ0(t)\boldsymbol{\beta}_{j}(t)\in T_{\mu_{0}(t)}\mathcal{M}, j=1,,pj=1,\cdots,p can be approximated by a linear span on K(ΦK)\mathcal{M}_{K}\left(\Phi_{K}\right) with expansion coefficients bjkb_{jk}:

𝜷j(t)k=1Kbjkϕk(t).\displaystyle\boldsymbol{\beta}_{j}(t)\approx\sum_{k=1}^{K}b_{jk}\phi_{k}(t). (29)

Then the estimation of 𝜷j(t)\boldsymbol{\beta}_{j}(t) can be transformed to the estimation of bjkb_{jk}, k=1,2,,Kk=1,2,\cdots,K, j=1,2,,pj=1,2,\cdots,p.

Defining Vm(t)=Log(μ0(t),ym(t))V_{m}(t)=\text{Log}(\mu_{0}(t),y_{m}(t)), V^m(t)=Log(μ^0(t),ym(t))\hat{V}_{m}(t)=\text{Log}(\hat{\mu}_{0}(t),y_{m}(t)) and Wm(t)=Log(μ0(t),μm(t))W_{m}(t)=\text{Log}(\mu_{0}(t),\mu_{m}(t)), then the multiple linear function-on-scalar regression model becomes

Vm(t)=Wm(t)+em(t)j=1pk=1Kumjbjkϕk(t)+em(t)\displaystyle V_{m}(t)=W_{m}(t)+e_{m}(t)\approx\sum_{j=1}^{p}\sum_{k=1}^{K}u_{mj}b_{jk}\phi_{k}(t)+e_{m}(t) (30)

where em(t)=Vm(t)Wm(t)e_{m}(t)=V_{m}(t)-W_{m}(t) which are assumed to be independent of the covariate 𝒖m\boldsymbol{u}_{m} for m=1,,Mm=1,\cdots,M.

Since Vm(t)V_{m}(t) can not be obtained directly, we use V^m(t)\hat{V}_{m}(t) as the response of the multiple linear regression model, so that the parameters of the regression model are estimated by minimizing the following loss function with observed data:

L(bjk)=m=1Mi=1nV^m(ti)j=1pk=1Kumjbjkϕk(ti)2.\displaystyle L(b_{jk})=\sum_{m=1}^{M}\sum_{i=1}^{n}\|\hat{V}_{m}(t_{i})-\sum_{j=1}^{p}\sum_{k=1}^{K}u_{mj}b_{jk}\phi_{k}(t_{i})\|^{2}. (31)

The bjkb_{jk} coefficients are calculated by standard least squares methods.

To be more specific, let V^=(V^im)i=1,,nd0;m=1,,M\hat{\textbf{V}}=(\hat{V}_{im})_{i=1,\cdots,nd_{0};m=1,\cdots,M} denote a nd0×Mnd_{0}\times M matrix such that the mm-th column denotes the mm-th batch, and the ii-th row denotes the vectored element (V^m1(t1),,V^m1(tn),,V^md0(t1),,V^md0(tn))\bigg{(}\hat{V}_{m1}(t_{1}),\cdots,\hat{V}_{m1}(t_{n}),\cdots,\hat{V}_{md_{0}}(t_{1}),\cdots,\hat{V}_{md_{0}}(t_{n})\bigg{)}. Similarly, let 𝚽=(ϕik)i=1,,nd0;k=1,,K\boldsymbol{\Phi}=(\phi_{ik})_{i=1,\cdots,nd_{0};k=1,\cdots,K} denote a nd0×Knd_{0}\times K matrix; let UU denote a M×pM\times p matrix for which each row um\textbf{u}_{m} indicates a data point; and let BB denote the p×Kp\times K coefficient matrix with element bjkb_{jk} for j=1,,pj=1,\cdots,p, k=1,,Kk=1,\cdots,K. Then the above loss function can be rewritten as

L(B)=vec(V^)(𝚽U)vec(B)2.\displaystyle L(B)=\|vec(\hat{\textbf{V}})-(\boldsymbol{\Phi}\otimes U)vec(B)\|^{2}. (32)

The least squares estimate of BB is obtained by

vec(B^)=((𝚽U)(𝚽U))1(𝚽U)vec(V^)\displaystyle vec(\hat{B})=\bigg{(}(\boldsymbol{\Phi}\otimes U)^{\top}(\boldsymbol{\Phi}\otimes U)\bigg{)}^{-1}(\boldsymbol{\Phi}\otimes U)^{\top}vec(\hat{\textbf{V}}) (33)

Consistency of the estimator of the sample Fréchet mean was proved in [11]. The following theorem establishes consistency of the estimator of the regression coefficients; the proof is left to the appendix.

Theorem 1.

Under the conditions (C1)-(C5) in the Appendix, the coefficient vec(B^)vec(\hat{B}) is a consistent estimator with probability tending to 11 as nn\rightarrow\infty in the sense that

vec(B^)vec(B)=op(1).\displaystyle\|vec(\hat{B})-vec({B})\|=o_{p}(1). (34)

It follows that the functional slope coefficients can be estimated by 𝜷^(t)=(𝜷^1(t),,𝜷^p(t))\hat{\boldsymbol{\beta}}(t)=\big{(}\hat{\boldsymbol{\beta}}_{1}(t),\cdots,\hat{\boldsymbol{\beta}}_{p}(t)\big{)} where

𝜷^j(t)=k=1Kb^jkϕk(t)\displaystyle\hat{\boldsymbol{\beta}}_{j}(t)=\sum_{k=1}^{K}\hat{b}_{jk}\phi_{k}(t) (35)

for j=1,,pj=1,\cdots,p. The estimated mean structure for the mm-th curve is given by

μ^m(t)=Exp(μ^0(t),um𝜷^(t)).\displaystyle\hat{\mu}_{m}(t)=\text{Exp}(\hat{\mu}_{0}(t),\textbf{u}_{m}\hat{\boldsymbol{\beta}}(t)). (36)

The function μ^m(t)\hat{\mu}_{m}(t) is expected to approximate μm(t)\mu_{m}(t). However we omit the rigorous proof of the consistency of μ^m(t)\hat{\mu}_{m}(t) on the manifold due to some technique difficulties.

3.2.2 Estimation of Covariance Structure

From the section above, we obtain an estimate of the mean structure μm(t)\mu_{m}(t) which is a continuous function on \mathcal{M}. In this section, we will focus on inference of the covariance structure τm(t)Tμm(t)\tau_{m}(t)\in T_{\mu_{m}(t)}\mathcal{M}. We assume that the tangent spaces Tμm(t)T_{\mu_{m}(t)}\mathcal{M} for t𝒯t\in\mathcal{T} can be identified with d0\mathbb{R}^{d_{0}} via some smooth local basis of orthonormal vector fields along μm(t)\mu_{m}(t). We have mentioned above that the covariance structure can be related to another functional covariate xm(t)x_{m}(t), which is a q\mathbb{R}^{q}-valued function. If we denote τm(t)=τm(xm(t))=(τm1(t),,τmd0(t))\tau_{m}(t)=\tau_{m}(x_{m}(t))=(\tau_{m1}(t),\cdots,\tau_{md_{0}}(t)), then the correlation of different components in τm(t)\tau_{m}(t) could be estimated via a cross-covariance function model, such as the convolved Gaussian process [6]. However, if there are nn observations τmd(ti)\tau_{md}(t_{i}), i=1,,ni=1,\cdots,n, then the size of a covariance matrix in a Gaussian process of τmd(t)\tau_{md}(t), d=1,,d0d=1,\cdots,d_{0} is n×nn\times n while the size of a cross-covariance matrix in a convolved Gaussian process of τm(t)\tau_{m}(t) is nd0×nd0n^{d_{0}}\times n^{d_{0}}, which is computationally expensive. As a result, we will consider different dimensions independently. Specifically, we assume

τmd(t)GP(0,Kmd(xm(t),xm(t);𝜽md)),\displaystyle\tau_{md}(t)\sim GP(0,K_{md}(x_{m}(t),x_{m}(t^{\prime});\boldsymbol{\theta}_{md})), (37)

where Kmd(,;𝜽md)K_{md}(\cdot,\cdot;\boldsymbol{\theta}_{md}) denotes a covariance kernel depending on hyper-parameter 𝜽md\boldsymbol{\theta}_{md}.

Given the value of τmd(ti)\tau_{md}(t_{i}), i=1,,ni=1,\cdots,n, denote 𝝉md=(τmd(tm1),,τmd(tmn))\boldsymbol{\tau}_{md}=(\tau_{md}(t_{m1}),...,\tau_{md}(t_{mn})), then for any new input tmt_{m}^{*}, the conditional distribution of τmd(tm)\tau_{md}(t_{m}^{*}) is

τmd(tm)|𝒟md𝒩(μmd,Σmd),μmd=𝒌mdTKmd1𝝉md,Σmd=kmd𝒌mdTKmd1𝒌md\displaystyle\tau_{md}(t_{m}^{*})|\mathcal{D}_{md}\sim\mathcal{N}(\mu_{md}^{*},\Sigma_{md}^{*}),\ \mu_{md}^{*}=\boldsymbol{k}_{md}^{*T}K_{md}^{-1}\boldsymbol{\tau}_{md},\ \Sigma_{md}^{*}=k_{md}^{**}-\boldsymbol{k}_{md}^{*T}K_{md}^{-1}\boldsymbol{k}_{md}^{*} (38)

where 𝒌md=(Kmd(tm1,tm),,Kmd(tmn,tm))\boldsymbol{k}_{md}^{*}=(K_{md}(t_{m1},t_{m}^{*}),...,K_{md}(t_{mn},t_{m}^{*})) and kmd=Kmd(tm,tm)k_{md}^{**}=K_{md}(t_{m}^{*},t_{m}^{*}).

In Gaussian process regression, the hyper-parameters for the dd-th dimension, 𝜽md\boldsymbol{\theta}_{md}, can be estimated by maximizing the sum of marginal likelihood functions for each batch, i.e. maximizing

(2π)n2det(Kmd)12e12𝝉mdTKmd1𝝉md.\displaystyle(2\pi)^{-\frac{n}{2}}\det(K_{md})^{-\frac{1}{2}}e^{-\frac{1}{2}\boldsymbol{\tau}_{md}^{T}K_{md}^{-1}\boldsymbol{\tau}_{md}}. (39)

Under some regularity conditions, the estimator θ^md\hat{\theta}_{md} converges to θmd\theta_{md} [9]. With the estimated hyper-parameters, the estimated covariance structure of the dd-th coordinate for the mm-th batch is given by

τ^md(t)|𝝉mdGP(𝒌mdT(t)Kmd1𝝉md,Kmd(t,t)𝒌md(t)TKmd1𝒌md(t)|𝜽^md))\displaystyle\hat{\tau}_{md}(t)|{\boldsymbol{\tau}}_{md}\sim GP(\boldsymbol{k}_{md}^{T}(t)K_{md}^{-1}{\boldsymbol{\tau}}_{md},K_{md}(t,t)-\boldsymbol{k}_{md}(t)^{T}K_{md}^{-1}\boldsymbol{k}_{md}(t)|\hat{\boldsymbol{\theta}}_{md})) (40)

where 𝒌md(t)=(Kmd(t,𝒙m(tm1)),,Kmd(t,𝒙m(tmn)))\boldsymbol{k}_{md}(t)=(K_{md}(t,\boldsymbol{x}_{m}(t_{m1})),\cdots,K_{md}(t,\boldsymbol{x}_{m}(t_{mn}))). Under some mild conditions, the above estimator of the covariance structure is information consistent, as the following theorem states.

Theorem 2.

Suppose (1)(1) the underlying true vector field, τm(t)\tau_{m}(t) along μm(t)\mu_{m}(t) can be represented using a local basis by real-valued functions τmd\tau_{md}, d=1,,d0d=1,\ldots,d_{0}, and that each function has a Gaussian process prior with mean zero and bounded covariance function Kmd(,,𝛉md)K_{md}(\cdot,\cdot,\boldsymbol{\theta}_{md}) for any covariate xm(t)x_{m}(t), (2)(2) Kmd(,,𝛉md)K_{md}(\cdot,\cdot,\boldsymbol{\theta}_{md}) is continuous in 𝛉md\boldsymbol{\theta}_{md}, further assume the estimator 𝛉^md\hat{\boldsymbol{\theta}}_{md} converges to 𝛉md\boldsymbol{\theta}_{md} as nn\rightarrow\infty almost surely, and (3)(3) the mean structure μm(t)\mu_{m}(t) is known.

Then τ^md()\hat{\tau}_{md}(\cdot) is information consistent to the true τmd()\tau_{md}(\cdot), which means an estimator that guarantees convergence to the true parameter value as the sample size grows, if the reproducing kernel Hilbert space norm τmdk\|\tau_{md}\|_{k} is bounded and the expected regret term E𝐗(log|𝐈+σ2𝐂nn|)=o(n)E_{\boldsymbol{X}}\left(\log\left|\boldsymbol{I}+\sigma^{-2}\boldsymbol{C}_{nn}\right|\right)=o(n), where 𝐂nn=(k(𝐱i,𝐱j;𝛉))\boldsymbol{C}_{nn}=\left(k\left(\boldsymbol{x}_{i},\boldsymbol{x}_{j};\boldsymbol{\theta}\right)\right) is the covariance matrix over 𝐱i,i=1,,n\boldsymbol{x}_{i},i=1,\ldots,n, and σ2\sigma^{2} is the variance of the measurement error ϵm(t)\epsilon_{m}(t).

The proof is given in Appendix. Note that the above theorem holds under the condition that μm(t)\mu_{m}(t) is known, but in practice, we only have an estimate of μm(t)\mu_{m}(t). Furthermore, the term 𝝉md\boldsymbol{\tau}_{md} in the posterior distribution can not be observed directly, and it is approximated by 𝝉~md=(τ~md(tm1),,τ~md(tmn))\tilde{\boldsymbol{\tau}}_{md}=(\tilde{\tau}_{md}(t_{m1}),\cdots,\tilde{\tau}_{md}(t_{mn})) which is the realization of the dd-th element of τ~m(t)=(τ~m1(t),,τ~md0(t))=Log(μ^m(t),ym(t))\tilde{\tau}_{m}(t)=(\tilde{\tau}_{m1}(t),\cdots,\tilde{\tau}_{md_{0}}(t))=\text{Log}(\hat{\mu}_{m}(t),y_{m}(t)).

3.2.3 Update Mean Structure and Covariance Structure

After obtaining the estimated mean structure and covariance structure, we are able to make predictions with given new inputs. In order to improve the performance of our model, we introduce an algorithm which can update the estimated mean structure and covariance structure iteratively.

The loss function of the mm-th curve at time point tit_{i} with the estimated mean structure and estimated covariance structure is given as

E=m=1Mi=1nd(Exp(Exp(μ^0(ti),μ^m(t)),τ^m(ti)),ym(ti))2\displaystyle E=\sum_{m=1}^{M}\sum_{i=1}^{n}d_{\mathcal{M}}\Bigg{(}\text{Exp}\Big{(}\text{Exp}\big{(}\hat{\mu}_{0}(t_{i}),\hat{\mu}_{m}(t)\big{)},\hat{\tau}_{m}(t_{i})\Big{)},y_{m}(t_{i})\Bigg{)}^{2} (41)

where μ^m(t)=j=1pk=1Kumjb^jkϕk(ti)\hat{\mu}_{m}(t)=\sum_{j=1}^{p}\sum_{k=1}^{K}u_{mj}\hat{b}_{jk}\phi_{k}(t_{i}).

Given the above estimated covariance structure τ^m(ti)\hat{\tau}_{m}(t_{i}), the mean structure can be updated by a gradient descent algorithm where the gradient is

μm(ti)E=dμm(ti)Exp(μm(ti),τ^m(ti))Log(ym(ti),Exp(μm(ti),τ^m(ti))),\displaystyle\nabla_{\mu_{m}(t_{i})}E=-d_{\mu_{m}(t_{i})}\text{Exp}\Big{(}\mu_{m}(t_{i}),\hat{\tau}_{m}(t_{i})\Big{)}^{\dagger}\text{Log}\Big{(}y_{m}(t_{i}),\text{Exp}\big{(}\mu_{m}(t_{i}),\hat{\tau}_{m}(t_{i})\big{)}\Big{)}, (42)

where \dagger is an adjoint with respect to the Riemannian inner product, which plays a similar role to parallel transport. In practice, a variational method for gradient descent [24] can be used as a substitute for gradient in Equation (42). Thereafter, we update the mean structure from μ^m(t)\hat{\mu}_{m}(t) to μ^m(1)(t)\hat{\mu}_{m}^{(1)}(t) (here the superscript (1) means the 11-st iteration). The updated coefficients bjk(1)b_{jk}^{(1)} can be estimated by minimizing the loss function

L(bjk)=Log(μ^0(ti),μ^m(1)(ti))j=1pk=1Kum,jbjkϕk(t)2,\displaystyle L(b_{jk})=\|\text{Log}(\hat{\mu}_{0}(t_{i}),\hat{\mu}^{(1)}_{m}(t_{i}))-\sum_{j=1}^{p}\sum_{k=1}^{K}u_{m,j}b_{jk}\phi_{k}(t)\|^{2}, (43)

which is a linear least squares problem.

Given the updated mean structure μ^m(1)(t)\hat{\mu}_{m}^{(1)}(t), we can re-calculate the covariance structure τ^m(1)(t)\hat{\tau}_{m}^{(1)}(t). The only difference from Section 3.2.2 is that the estimated mean structure is replaced by the updated mean structure. This two-way updating procedure is then repeated iteratively, stopping when the difference of the updated mean or covariance structure smaller than some given threshold. The algorithm of the updating procedure is specified in Algorithm 2.

Input: ym(ti),ti,𝒙m(ti),𝒖m,m=1,,M,i=1,,n,iter=1y_{m}(t_{i}),t_{i},\boldsymbol{x}_{m}(t_{i}),\boldsymbol{u}_{m},m=1,...,M,i=1,...,n,iter=1.
Output: b^jk\hat{b}_{jk} and 𝚯^\hat{\boldsymbol{\Theta}}.
1. Compute sample Fréchet mean function μ^0(t)\hat{\mu}_{0}(t) for all curves.
2. Estimate the coefficient bjk(0)b_{jk}^{(0)} according to equation (31), to obtain the estimate of the mean structure μ^m(0)(t)\hat{\mu}^{(0)}_{m}(t).
3. Estimate the hyper-parameter 𝚯(0)\boldsymbol{\Theta}^{(0)} by maximising the likelihood function of Equation (39), to obtain the estimate of the covariance structure τ^m(0)(t)\hat{\tau}^{(0)}_{m}(t).
4. Update μ^m(iter)(t)\hat{\mu}_{m}^{(iter)}(t) according to Equation (42).
5. Update 𝚯(iter)\boldsymbol{\Theta}^{(iter)} with the updated μ^m(iter)(t)\hat{\mu}^{(iter)}_{m}(t), and set iter+=1iter\ +=1.
6. Repeat Steps 4 and 5 until some convergence conditions have been satisfied.
Algorithm 2 Algorithm to estimate and update coefficients B=(bjk)j=1,,p;k=1,,KB=(b_{jk})_{j=1,\cdots,p;k=1,\cdots,K} and hyper-parameters 𝚯=(𝜽md)m=1,,M;d=1,,d0\boldsymbol{\Theta}=(\boldsymbol{\theta}_{md})_{m=1,\cdots,M;d=1,\cdots,d_{0}}.

4 Numerical Experiments

In this section we demonstrate the WGPFR model on two Riemannian manifolds: S2S^{2} and Kendall’s shape space. As previously, we suppose there are MM curves on a Riemannian manifold; in what follows we will simulate data points on these random curves in different ways to form different scenarios of interpolation and extrapolation. For the extrapolation problem, we use our model to predict the last 1515 data points (given all preceding data points) to test the long-term extrapolating performance, and the last 55 data points to test the short-term extrapolation performance. Moreover, since the Algorithm 2 is calculated in Euclidean space, we use a optimizer from SciPy which is powerful and widely-used for minimization problems.

4.1 Regression analysis on S2S^{2}

4.1.1 Simulation scenario on S2S^{2}

Suppose observation times tt are equally spaced in the interval [0,1][0,1] with NN points, where m{1,,m1,m1+1,,M}m\in\{1,...,m_{1},m_{1}+1,...,M\} and M=m1+m2M=m_{1}+m_{2}. To test the performance of our model with different number of observed curves and data points, we considered 3030 and 6060 batches and 2020, 4040 and 6060 data points on each batch respectively, which is M{30,40}M\in\{30,40\} and n{20,40,60}n\in\{20,40,60\}. The simulated data are shown in Figure 4 and in the remainder of this subsection we explain how the data were generated.

Refer to caption
Figure 4: Simulated data on S2S^{2} where red curves refer to one batch and blue curves refer to the other.

As introduced previously, the WGPFR model consists of three parts: the intrinsic Fréchet mean function, the mean structure and the covariance structure. In order to simulate data, we first define the Fréchet mean function μ0(t)\mu_{0}(t) on S2S^{2} by

μ0(t)=Exp(p,(sin(tπ2)2,sin(tπ)3,0))\displaystyle\mu_{0}(t)=\text{Exp}(p,(\sin{(\frac{t\pi}{2})^{2}},\sin{(t\pi)^{3},0)}) (44)

where p=(0,0,1)p=(0,0,1) refers to a point on a S2S^{2}.

Suppose there are two batches, and that we generate two mean structures based on batch-specific covariates 𝒖m\boldsymbol{u}_{m} and Fourier basis functions 𝜷(t)\boldsymbol{\beta}(t), defined by

μm(t)=Exp(μ0(t),𝒖sT𝜷(t)),m=1,,m1+m2\displaystyle\mu_{m}(t)=\text{Exp}(\mu_{0}(t),\boldsymbol{u}_{s}^{T}\boldsymbol{\beta}(t)),\ m=1,...,m_{1}+m_{2} (45)

where 𝒖m=(1,0)\boldsymbol{u}_{m}=(1,0) for m=1,,m1m=1,...,m_{1} and 𝒖m=(1,1)\boldsymbol{u}_{m}=(1,1) for m=m1+1,,m1+m2m=m_{1}+1,...,m_{1}+m_{2} refer to the covariates for two scalar batches respectively and 𝜷(t)=(𝜷1(t),𝜷2(t))\boldsymbol{\beta}(t)=(\boldsymbol{\beta}_{1}(t),\boldsymbol{\beta}_{2}(t)) refers to tangent vector-valued basis functions. Specifically, we define

𝜷1(t)=Rt(i=120w1,i(ϕi(t),ϕi(t+12),ϕi(t+22))),𝜷2(t)=Rt(i=120w2,i(ϕi(t),ϕi(t+12),ϕi(t+22))),i=1,,20\displaystyle\begin{split}\boldsymbol{\beta}_{1}(t)&=R_{t}\big{(}\sum_{i=1}^{20}w_{1,i}(\phi_{i}(t),\phi_{i}(\frac{t+1}{2}),\phi_{i}(\frac{t+2}{2}))\big{)},\\ \boldsymbol{\beta}_{2}(t)&=R_{t}\big{(}\sum_{i=1}^{20}w_{2,i}(\phi_{i}(t),\phi_{i}(\frac{t+1}{2}),\phi_{i}(\frac{t+2}{2}))\big{)},\ i=1,...,20\end{split} (46)

where

w1,i=i120,w2,i=12sin(i60),i=1,,20,ϕi(t)=12,fori=0,ϕi(t)=sin(2πit),fori=1,3,5,,19,ϕi(t)=cos(2πit),fori=2,4,6,,20,\displaystyle\begin{split}w_{1,i}&=\frac{i}{120},\\ w_{2,i}&=-\frac{1}{2}\sqrt{\sin{(\frac{i}{60})}},\ i=1,...,20,\\ \phi_{i}(t)&=\frac{1}{\sqrt{2}},\ \text{for}\ i=0,\\ \phi_{i}(t)&=\sin{(2\pi it)},\ \text{for}\ i=1,3,5,...,19,\\ \phi_{i}(t)&=\cos{(2\pi it)},\ \text{for}\ i=2,4,6,...,20,\\ \end{split} (47)

and RtR_{t} is a rotation matrix transferring the vector i=120w1,i(ϕi(t),ϕi(t+12),ϕi(t+22)))\sum_{i=1}^{20}w_{1,i}(\phi_{i}(t),\phi_{i}(\frac{t+1}{2}),\phi_{i}(\frac{t+2}{2}))) or
i=120w2,i(ϕi(t),ϕi(t+12),ϕi(t+22)))\sum_{i=1}^{20}w_{2,i}(\phi_{i}(t),\phi_{i}(\frac{t+1}{2}),\phi_{i}(\frac{t+2}{2}))) to Tμ0(t)T_{\mu_{0}(t)}\mathcal{M}, this ensures that the vectors 𝜷1(t),𝜷2(t)\boldsymbol{\beta}_{1}(t),\boldsymbol{\beta}_{2}(t) are tangents at μ0(t)\mu_{0}(t). For example, given a point pp and its tangent vector 𝐯\mathbf{v}, the rotation matrix is defined by

Rt(𝐯)=𝐯v,p𝐯v,p𝐯.\displaystyle R_{t}(\mathbf{v})=\frac{\mathbf{v}-\langle v,p\rangle}{\|\mathbf{v}-\langle v,p\rangle\|}\|\mathbf{v}\|.

We use the following covariance function

k(ti,tj)=v0exp(12w0(titj)2)+a0+a1titj+σ2δij\displaystyle k(t_{i},t_{j})=v_{0}\exp(-\frac{1}{2w_{0}}(t_{i}-t_{j})^{2})+a_{0}+a_{1}t_{i}t_{j}+\sigma^{2}\delta_{ij} (48)

which is a combination of a squared exponential kernel (often used for stationary Gaussian processes) and a linear kernel (often used for non-stationary Gaussian process), this can be used for most of Gaussian process regression [37]. Since S2S^{2} can be embedded into 3\mathbb{R}^{3}, for each dimension of tangent vector of covariance structure, the hyper-parameters (v0,w0,a0,a1,σ)(v_{0},w_{0},a_{0},a_{1},\sigma) are given as 𝜽1=(0.012,3,0.01,0.01,0.02)\boldsymbol{\theta}_{1}=(0.012,3,0.01,0.01,0.02), 𝜽2=(0.017,3.1,0.011,0.012,0.015)\boldsymbol{\theta}_{2}=(0.017,3.1,0.011,0.012,0.015) and 𝜽3=(0.015,3.2,0.012,0.013,0.01)\boldsymbol{\theta}_{3}=(0.015,3.2,0.012,0.013,0.01). Given these three Gaussian processes with zero mean and kernel 48, we can generate a tangent vector for the covariance structure. However, in practice, the tangent vector may not in the target tangent space because of numerical rounding. For example, if the ideal tangent vector is (1,1,0)(1,1,0), our generated tangent vector might be (0.99,1,0.01)(0.99,1,0.01), which will result in the exponential map and inverse exponential map incalculable (defined in Equation (53) and Equation (54) respectively). Therefore, the tangent vector must be projected into the correct tangent space, which is denoted as v3(t)|𝜽1,𝜽2,𝜽3v_{3}(t)|\boldsymbol{\theta}_{1},\boldsymbol{\theta}_{2},\boldsymbol{\theta}_{3}. The subscript 3 in v3v_{3} refers to 33-dimensional. As a consequence, the formula to generate manifold-valued data for numerical experiments is written as

ys,m(t)=Exp(μs(t),v3(t)|𝜽1,𝜽2,𝜽3),s=1,2,m=1,,Ms.\displaystyle y_{s,m}(t)=\text{Exp}(\mu_{s}(t),v_{3}(t)|\boldsymbol{\theta}_{1},\boldsymbol{\theta}_{2},\boldsymbol{\theta}_{3}),\ s=1,2,\ m=1,...,M_{s}. (49)

4.1.2 Model assessment on S2S^{2}

In this section, we delete some generated data points on a randomly selected curve, say yr(t)y_{r}(t), in different ways to form training data sets and then calculate the predictions for these deleted data points (i.e. the test data). The performance of WGPFR model is assessed by comparing the root mean square error between predictions and their real data by Euclidean distance, since a sphere can be embedded in 3\mathbb{R}^{3} easily.

Specifically, we selected 1515 data points on yr(t)y_{r}(t) uniformly at random as our test data set, and all remaining data points were used as the training data set, which is a typical interpolation problem. This scenario is denoted as Type 1 prediction. As to Type 2 prediction, the short-term forecasting, the last 55 data points on yr(t)y_{r}(t) are considered as test data set which is a typical extrapolation problem. Analogously to the short-term forecasting, we also test the performance of WGPFR for long-term forecasting, the Type 3 prediction. In Type 3 prediction, we choose the last 1515 data points as test data.

For comparison, in each scenario, the same training data set is used for several other models, such as functional linear regression on Riemannian manifolds (FLRM) and wrapped Gaussian process Fréchet mean regression (WGFmR). Specifically, FLRM is the mean structure (19) without covariance structure; WGPFmR consists of mean structure and covariance structure in which the mean structure is the sample Fréchet mean point for all training data. In addition, the WGPFmR model does not have the updating part. We compare the performance of these three models not only to show a significant improvement of predictive accuracy from concurrent model (consider mean structure and covariance structure simultaneously), but also from the intervatively updating algorithm 2. Moreover, we can use some models, such as Gaussian process functional regression without manifold structure to fit the data and make predictions. However, such models are meaningless since the inferences cannot be guaranteed on the right space which is manifold. Thus, it might be not suitable as a baseline and we only consider comparison with manifold structure.

We replicate each simulation study 100100 times. Thus, we test the performance of our model on thousands of test data points. The numerical results reported in Table 3 are the average of root-mean-square-error (RMSE) in every single replication. Using the embedding S23S^{2}\subseteq\mathbb{R}^{3}, it is reasonable to use the Euclidean norm between points as a distance function (chordal metric), which provides a method to calculate the RMSE.

In Type 1 prediction, Type 2 prediction and Type 3 prediction, WGPFR always provides the best prediction. The predictive accuracy of FLRM is better than that of WGPFmR in most scenarios, because FLRM only learns the mean structure of the training data which might be more useful for trend forecasting. In extrapolation, the long-term forecasting and short-term forecasting, since test data are distant from the training data, the output of GRP is usually inaccurate and then the mean structure mainly determines the accurate of prediction. Because we used the Fréchet mean point as the mean function of WGFmR, the prediction of WGFmR is very poor.

Comparing long-term and short-term forecasting, since the test data of the former is closer to the training data than that of the latter, the predictions of short-term forecasting are more accurate than that of long-term forecasting. In addition, from the table we can see that when the numbers of curves are fixed, the RMSE between prediction and real data decreases with the increasing number of points; when the number of points is fixed, the RMSE also decreases with the increasing number of curves.

Type 1 Type 2 Type 3
30 curves, 20 points
WGPFR 0.0349 0.0226 0.0341
FLRM 0.2194 0.2951 0.3188
WGFmR 0.2755 0.3336 0.4779
30 curves, 60 points
WGPFR 0.0209 0.0108 0.0248
FLRM 0.1520 0.2282 0.2311
WGFmR 0.1909 0.1608 0.2164
60 curves, 20 points
WGPFR 0.0391 0.0273 0.0428
FLRM 0.2117 0.2509 0.2865
WGFmR 0.2827 0.3165 0.4717
60 curves, 60 points
WGPFR 0.0215 0.0115 0.0126
FLRM 0.1601 0.2022 0.2385
WGFmR 0.2058 0.2199 0.3225
Table 3: Root-mean-square-error for several models with four types of predictions and equally spaced data.

Since Gaussian process regression is a Bayesian model, we also compare the log-predictive likelihood of the covariance structure which provides the randomness of this model. [19] introduce a calculation of log pointwise predictive density in practice. Thus we compute this index by summing the log pointwise predictive density in each dimension and the result is shown in Table 4.

Moreover, discrete Fréchet distance is a metric to measure the similarity between two curves which considers the locations and orderings between predictions and real data. This method is widely used in GPS data, especially for comparing trajectories of animals, vehicles or other moving objects [13]. The results show that our model provides small predictive error under this metric. In other words, our model is effective even the measurement is different from its objective function. The numerical results are shown in Table 4.

We also show the RMSE of 𝜷\boldsymbol{\beta} in Table 4. We can see that, when the number of curves is fixed, the RMSE between estimated β\beta and real β\beta decreases with the increasing number of data points; however, when the number of data points is fixed, the RMSE between estimated β\beta and real β\beta is almost the same (0.1779 compared to 0.1773 and 0.0212 compared to 0.0214).

Type 1 Type 2 Type 3
Log predictive likelihood
30 curves, 20 points 38.4998 11.2927 35.9895
30 curves, 60 points 42.1669 13.1486 46.1013
60 curves, 20 points 40.0731 12.7354 35.9048
60 curves, 60 points 45.3947 14.1749 52.6215
Curve similarity
30 curves, 20 points 0.0657 0.0646 0.0498
30 curves, 60 points 0.0306 0.0529 0.0336
60 curves, 20 points 0.0520 0.0596 0.0535
60 curves, 60 points 0.0247 0.0422 0.0390
RMSE between β\beta and β^\hat{\beta}
30 curves, 20 points 0.1779 0.1774 0.1778
30 curves, 60 points 0.0230 0.0212 0.0236
60 curves, 20 points 0.1773 0.1770 0.1773
60 curves, 60 points 0.0228 0.0214 0.0213
Table 4: Log-predictive likelihood, discrete Fréchet distance between estimated covariance structure and the real covariance structure for several models with three types of data. Root-mean-square-error between estimation and the real value of β\beta for several models with three types of data.

4.2 Regression Analysis on Kendall’s Shape Space

We firstly generated μ0(t)\mu_{0}(t), for t(0,1)t\in(0,1), which describes a varying shape. Specifically, the shape is a circle at the beginning and becomes a square in the end. The varying shapes are shown in Figure 5 and they are generated based on an elliptic equation after removing scale, rotation and translation.

Refer to caption
Figure 5: Generated μ0(t)\mu_{0}(t) on a Kendall’s shape space.

We generated data corresponding to 80 landmarks, given by an element of 2×80\mathbb{C}^{2\times 80}(See the Appendix for more details about Kendall’s shape space) and the mean shape function μ0(t)\mu_{0}(t) is defined as

μ0,z(t)={(z201,(z201)2+(1(z201)2)(1t2)2i),for z=1,,40(1z20,(1z20)2+(1(z201)2)(1t2)2i),for z=41,,80\displaystyle\mu_{0,z}(t)=\begin{cases}(\frac{z}{20}-1,\sqrt{(\frac{z}{20}-1)^{2}+(1-(\frac{z}{20}-1)^{2})(1-\frac{t}{2})^{2}}i),&\text{for }z=1,...,40\\ (1-\frac{z}{20},-\sqrt{(1-\frac{z}{20})^{2}+(1-(\frac{z}{20}-1)^{2})(1-\frac{t}{2})^{2}}i),&\text{for }z=41,...,80\end{cases} (50)

Then, we use the scalar covariates to generate tangent vectors from the mean function μ0(t)\mu_{0}(t) to mean structures μ1(t)\mu_{1}(t) and μ2(t)\mu_{2}(t) by

μ1(t)=Exp(μ0(t),p=1Pu1,psin(t)3sin(r(t))),μ2(t)=Exp(μ0(t),p=1Pu2,psin(t)3sin(r(t))cos(r(t))),\displaystyle\begin{split}\mu_{1}(t)&=\text{Exp}(\mu_{0}(t),\sum_{p=1}^{P}u_{1,p}\sin{(t)}^{3}\sin{(r(t))}),\\ \mu_{2}(t)&=\text{Exp}(\mu_{0}(t),\sum_{p=1}^{P}u_{2,p}\sin{(t)}^{3}\sin{(r(t))}\cos{(r(t))}),\end{split} (51)

where PP refers to the number of elements in 𝐮1\mathbf{u}_{1}, u1,1=0u_{1,1}=0, u1,2=0u_{1,2}=0, u2,1=1u_{2,1}=1, u2,2=2u_{2,2}=2 and r(t)r(t) refers to a ranking function of tt. For example, when t=t3t=t_{3}, r(t)=3r(t)=3. Thus, we obtain 33 curves in a batch and each curve is consist of 1010 points (curves).

The generated μ0(t)\mu_{0}(t) and the mean structures μ1(t)\mu_{1}(t) and μ2(t)\mu_{2}(t) are shown in Figures 5 and 6 respectively.

Refer to caption
Figure 6: Generated mean structures on Kendall’s shape space.

We estimate the mean structure and covariance structure as discussed in Section 3.2.1 and Section 3.2.2. The training data consist of all data in batch one together with the first and the second curve in batch two; the difference between these types of prediction is the handling of the third curve in batch two. Specifically, for Type 1 prediction, 88 time points are randomly selected to form the training data and the remaining 22 points are used as the test data; with regard to Type 2 prediction, we add the first half into the training data and the remaining for the test data; as for the Type 3 prediction, the last 33 time points are considered as the test data and the remaining are added into the training data. Each numerical experiment is repeated for 100100 times. In addition, to compare the performances among different regression models, we also tested the WGPFR, approximate wrapped Gaussian process functional regression model, functional linear regression model on Riemannian manifold (the mean structure) and wrapped Gaussian process Fréchet mean regression model.

The RMSE are given in Table 5. Similar to the simulation study on a S2S^{2}, the WGPFR model achieves the best prediction results in both interpolation and extrapolation. Moreover, the time consuming of simulation study on S2S^{2} is less than 1010 minutes while that of shape space is about one hour, since the later has higher dimension. Specifically, there are 33 sets of estimated hyper-parameters for a wrapped Gaussian process on a S2S^{2} while 160160 sets of estimated hyper-parameters for a wrapped Gaussian process on a shape space.

Type 1 Type 2 Type 3
10 curves, 20 points
WGPFR 0.0189 0.0197 0.0202
FLRM 0.2702 0.2777 0.3425
WGFmR 0.1091 0.2316 0.2453
10 curves, 40 points
WGPFR 0.0159 0.0183 0.0192
FLRM 0.2663 0.1206 0.1767
WGFmR 0.0914 0.1312 0.1481
20 curves, 20 points
WGPFR 0.0178 0.0190 0.0200
FLRM 0.2648 0.2735 0.2882
WGFmR 0.0855 0.1698 0.1531
20 curves, 40 points
WGPFR 0.0101 0.0148 0.0163
FLRM 0.2036 0.1165 0.1486
WGFmR 0.0653 0.1101 0.1383
Table 5: Root-mean-squared-error of several models with four types of shape data sets.

Thus we also show a log pointwise predictive density and the results are in Table 6.

Type 1 Type 2 Type 3
10 curves, 20 points 80.8193 23.3038 79.0855
10 curves, 40 points 81.1700 25.9431 83.1195
20 curves, 20 points 75.9353 22.8560 78.5120
20 curves, 40 points 77.1970 24.7091 82.4506
Table 6: Log-predictive likelihood between estimated covariance structure and the real covariance structure for several models with three types of shape data.

4.3 Flight Trajectory Data

In this section, we test our model on a real data set. The earth is roughly a sphere and can be modelled as a copy of S2S^{2}. Certain data sets, for example hurricane trajectories, are therefore considered as a manifold-valued random curve [39]. Here we study flight trajectory data, shown in Figure 7 in which the red curves represent flights from Shanghai to London by British Airways and the black curves represent flight trajectories of Eastern China Airlines between the same destinations (the data were downloaded from www.variflight.com on 2020). Therefore, these sets of trajectories can naturally be split into two batches and the model with common mean structure can be used.

Refer to caption
Figure 7: Flight trajectories from Shanghai to London.

The original data includes time, height, speed, longitude and latitude of each flight at discrete time points. We select the position of each airplane as the response variable, which can be transformed onto S2S^{2} using longitude and latitude; in addition, the company and time are regarded as non-functional and functional covariates respectively. Before we train the model, it is necessary to pre-process the raw data. In this step, we firstly set the takeoff time of each flight to be 0 and the landing time to be 11, excluding taxi time. 2525 trajectories of each company were selected in which the number of observed data points in every flight is greater than 600600. In order to obtain smooth manifold-valued curves from the data, some kernel smoothing functions with small bandwidth were applied to the longitude and latitude of the training data. For computational reasons, we choose every 66-th data points of each smoothed trajectory as training data (100100 data points in each curve totally).

To model the mean structure for flight trajectory data, we use company as the batch-specific covariate. In practice, for Eastern China Airlines, the covariate is defined as 0; and for British Airways, the covariate is defined as 11. Estimation of the mean structure and covariance structure were described in Section 3.2.1 and 3.2.2, respectively. The parameters of the basis functions in mean structure and hyper-parameters in the covariance structure were be updated iteratively, as described in Section 3.2.3. In addition, the initial values of hyper-parameters in convariance structure are drawn from a standard Gaussian distribution independently. Afterwards, the predictions of the WGPFR model has been compared to FLRM and WGFmR for the same flight trajectory data.

The overall prediction of WGPFR model should be better than the other models, since the FLRM model only learns from mean structure and ignores the dependent error. In addition, the mean function of WGFmR is a point on S2S^{2} and the prediction should have significant error when data are far away from the Fréchet mean. This is verified by numerical results in Table 7.

The training data and test data are generated in the following way. We randomly selected another flight trajectory of British Airways which had >600>600 time observations. After the same pre-processing steps, 5050 data points were added into the training data and the remaining 5050 data points are used as the test data. In order to test the performance of interpolation and extrapolation for the real data set, we form the Type 1 prediction by randomly choosing these 5050 data points and form the forecasting by selecting the first 5050 points in the trajectory as training data. In addition, we test the capability for short-term forecasting and long-term forecasting by supposing different test data. Specifically, the 5050-th to 6060-th points and the 5050-th to final data points of the flight trajectory can be selected respectively to form two scenarios of prediction which are denoted as short-term and long-term, respectively.

The predictive accuracy of WGPFR compared to FLR and WGFmR on the flight trajectory data is show in Table 7, which shows the mean of root-mean-squared-error of 2020 repeated simulations to reduce the effects of random seeds. We can see that the WGPFR model outperforms FLR and WGFmR for interpolation. For the short-term prediction, the RMSE is much smaller than that of the long-term prediction. As mentioned previously, the reason is that GPR provides little extra information when the test data are distant from the training data. The prediction of WGFmR is less accurate since the mean structure (Fréchet mean) is only a manifold-valued point and the test data set are not close to that point.

Type 1 Short-term Long-term
WGPFR 0.0048 0.0096 0.0233
FLRM 0.0082 0.0131 0.0471
WGFmR 0.0187 0.0747 0.3569
Table 7: Root-mean-squared-error of several models for flight trajectory data.

5 Conclusion and discussion

In this paper we studied a novel functional regression model within a probabilistic framework for nonlinear manifold-valued response variables and real-valued covariates. Specifically, a wrapped Gaussian process functional regression (WGPFR) model is proposed to model the regression relationship into mean structure and covariance structure. For the mean structure, we proposed a functional version of geodesic regression model for the manifold-valued functional response and scalar covariates, where the inference of the mean structure is based on the intrinsic Fréchet mean function and the traditional functional linear regression model is used on the tangent vectors of the manifold-valued data at the intrinsic Fréchet mean function to estimate the scalar covariates in an ambient Euclidean space. A wrapped Gaussian process prior is proposed to model the covariance structure in the tangent space at the mean function of each batch, where inference is conducted in a Bayesian way.Furthermore, an iterative approach based on a gradient descent algorithm or variational gradient descent algorithm are also applied to update mean structure and covariance structure efficiently. The mean structure captures the relation between functional dependent variable and batch-based scalar independent variable. Meanwhile, the covariance structure models the nonlinear concurrent relationship between functional output and multiple functional covariates. This idea in WGPFR model avoids the curse of dimensionality by using multiple functional covariates as input for functional regression, which promotes the flexibility of this method.

Future research endeavors could encompass various extensions and enhancements to the proposed WGPFR model. Initially, the WGPFR model, as delineated in this paper, is predicated on the assumption of Riemannian manifolds with infinite injectivity radius. The potential exists to adapt this model to other manifolds that may not conform to this assumption.

Secondly, the current paper presumes the independence of Gaussian processes across different dimensions. Future investigations could explore methodologies as suggested by [1] and [40], where Gaussian processes exhibit dependence. However, such an approach would necessitate the development of additional computational efficiencies.

Thirdly, the proposed model could be applied to other intriguing real data sets. For instance, the performance of our model could be evaluated on medical imaging data. Repeated measurements of functional magnetic resonance imaging (fMRI), for example, could be construed as data residing in a specific manifold. Furthermore, the recurrently scanned shapes of the corpus callosum from diverse patients could be modeled to predict the likelihood of Alzheimer’s disease based on certain relative factors [4]. In addition, consideration could be given to non-Gaussian data as [41] and alternative methods to define mean and covariance structures.

The convergence of parameter estimation is substantiated in Theorem 1 and 2. However, the assurance of convergence for the iterative optimization Algorithm 2 remains an open question for future exploration. [20] demonstrated that, given suitable initial parameters, the estimation error of conditional maximization problems is confined within the bounds of statistical error and optimization error with a high degree of probability, potentially laying the groundwork for convergence. Nonetheless, it is crucial to acknowledge the distinction between graphical spaces and Riemannian manifolds. The approach proposed by [42], which establishes that the convergence rate of GPR is optimal under certain preconditions and is upper bounded based on the smoothness of the correlation function in Euclidean space, could potentially be extrapolated to Riemannian manifolds to facilitate the convergence of covariance structure estimation. This proposition, however, requires rigorous validation.

Acknowledgments

JQS’s work is supported by funds of National Key R&D Program of China (2023YFA1011400), the National Natural Science Foundation of China (No. 12271239) and Shenzhen Fundamental Research Program (No. 20220111). Chao Liu is supported in part by China Postdoctoral Science Foundation, No.2021M691443, No.2021TQ0141 and SUSTC Presidential Postdoctoral Fellowship.

Appendix

Comparison of A Model with or without A Manifold Structure

In Kendall’s shape space, the prediction of a regression model without a manifold structure loses the shape framework while a model with manifold structure still keep it.

Refer to caption
Figure 8: The blue shape is prediction of a Gaussian process regression with a manifold structure while the red shape is prediction of a Gaussian process regression without a manifold structure.

Examples of Riemannian Manifolds

Sphere SnS^{n}: The two-dimensional sphere S2={x3:x=1}S^{2}=\{x\in\mathbb{R}^{3}:\|x\|=1\} is a simple example of a Riemannian manifold. It is easy to show that the tangent vectors at a point pS2p\in S^{2} are the vectors which are orthogonal to pp. A Riemannian metric is inherited from the ambient Euclidean metric on 3\mathbb{R}^{3}, and it exactly the Euclidean inner product between tangent vectors. It is easy to show, using this metric, that the shortest curve passing between two points p,qS2p,q\in S^{2} is a the great circle, which is contained within the intersection of S2S^{2} and a plane containing pp, qq and the origin. The geodesic is unique exactly when one such plane exists, or in other words, when pp and qq are not antipodal. The formula of geodesic distance is given by

d(p,q)=cos1(pTq).\displaystyle d_{\mathcal{M}}(p,q)=\cos^{-1}(p^{T}q). (52)

The formula of exponential map is

Exp(p,v)=cos(v)p+sin(v)vv,for all vTp={v:pTv=0}.\text{Exp}(p,v)=\cos(\|v\|)p+\sin(\|v\|)\frac{v}{\|v\|},\ \text{for all\ }v\in T_{p}\mathcal{M}=\{v:p^{T}v=0\}. (53)

The logarithm map is defined for all pairs p,qp,q which are not antipodal:

Log(p,q)=uud(p,q),whereu=q(pTq)p,for all p+q0.\displaystyle\text{Log}(p,q)=\frac{u}{\|u\|}d_{\mathcal{M}}(p,q),\ \text{where}\ u=q-(p^{T}q)p,\ \text{for all\ }p+q\neq 0. (54)

These concepts can be to the hyper-sphere SnS^{n}, where n>2n>2.

Kendall’s Shape Space: As a well-developed probability space, Kendall’s shape space provides a very useful space for illustrate new theory and applications in statistics and machine learning. In 2-dimensional shape analysis [15], a landmark is a point that can be unambiguously identified on all objects in a population. Landmarking is the process of identifying and labelling a set of kk landmarks on each object in an analysis. This process generates a 2k2k dimensional (real) vector, which can also be represented in a kk-dimensional complex space k\mathbb{C}^{k}. The shape of an object is the equivalence class of the object when translation, rotation and scaling of the object are removed. Translation is removed by moving the centroid of each object to be origin point. The landmark vectors then lie in a complex subspace V={(zi)k|i=1kzi=0}V=\{(z_{i})\in\mathbb{C}^{k}|\sum_{i=1}^{k}z_{i}=0\}, which is a copy of k1\mathbb{C}^{k-1}. Two configurations in k1\mathbb{C}^{k-1} are equivalent if they are related by scaling and rotation. The set of equivalence classes can be show to be equal to complex projective space Ck2C\mathbb{P}^{k-2} and is known as Kendall’s shape space [23]. The following formulae of exp map, log map on Kendall’s shape space are from [46].

Analogously to the sphere, we write down formulae which specify the Riemannian geometry on Kendall’s shape space. Suppose pp and qq are two data points in a general Ck2C\mathbb{P}^{k-2}, the geodesic distance can be computed by

d(p,q)=arccos|pq|pq\displaystyle d_{\mathcal{M}}(p,q)=\arccos\frac{|p^{*}q|}{\|p\|\|q\|} (55)

Note that this expression is invariant under multiplication of pp and qq by arbitrary non-zero complex constants. The inverse exponential map is given by

Log(p,q)=θ(qπpq)qπpq,θ=arccos|p,q|pq,πpq=pp,qp2\text{Log}(p,q)=\frac{\theta\cdot(q-\pi_{p}q)}{\|q-\pi_{p}q\|},\ \theta=\arccos{\frac{|\langle p,q\rangle|}{\|p\|\|q\|}},\ \pi_{p}q=\frac{p\cdot\langle p,q\rangle}{\|p\|^{2}} (56)

The exponential map is defined by

Exp(p,v)=cosθp+psinθθv,θ=Log(p,q)\displaystyle\text{Exp}(p,v)=\cos{\theta}\cdot p+\frac{\|p\|\sin{\theta}}{\theta}v,\ \theta=\|\text{Log}(p,q)\| (57)

The tangent space TpT_{p}\mathcal{M} of a point pp has the same dimensionality as the Kendall’s shape space.

Proofs

In order to prove the theorems in Section 3, we need the following assumptions for the Riemannian manifold \mathcal{M} and the manifold-valued data y(t)y(t).

  • (C1) \mathcal{M} is a closed Riemannian submanifold of a Euclidean space d0\mathbb{R}^{d_{0}}, with geodesic distance dd_{\mathcal{M}} induced by the Euclidean metric.

  • (C2) Trajectories y(t)y(t) are continuous for t𝒯t\in\mathcal{T} almost surely.

  • (C3) For all t𝒯t\in\mathcal{T}, the Fréchet mean μ0(t)\mu_{0}(t) and sample Fréchet mean μ^0(t)\hat{\mu}_{0}(t) exist and are unique almost surely.

  • (C4) Almost surely, trajectories y(t)y(t) lie in a compact set StB(μ0(t),r)S_{t}\subset B_{\mathcal{M}}\left(\mu_{0}(t),r\right) for t𝒯t\in\mathcal{T}, where B(μ0(t),r)B_{\mathcal{M}}\left(\mu_{0}(t),r\right)\subset\mathcal{M} is an open ball centered at μ0(t)\mu_{0}(t) with radius r<inft𝒯injμ0(t)r<\inf_{t\in\mathcal{T}}\operatorname{inj}_{\mu_{0}(t)}.

  • (C5) For any ϵ>0\epsilon>0,

    inft𝒯infp:d(p,μ0(t))>ϵM(p,t)M(μ0(t),t)>0.\inf_{t\in\mathcal{T}}\inf_{p:d_{\mathcal{M}}\left(p,\mu_{0}(t)\right)>\epsilon}M(p,t)-M\left(\mu_{0}(t),t\right)>0.

Proof of Theorem 1

Proof. Recall the notations and the multiple functional regression model Vm(t)=Wm(t)+em(t)V_{m}(t)=W_{m}(t)+e_{m}(t), denote the vector of the realizations at tit_{i}, m=1,,Mm=1,\cdot,M, i=1,,ni=1,\cdots,n of Vm(t),Wm(t)V_{m}(t),W_{m}(t) and em(t)e_{m}(t) as 𝑽\boldsymbol{V}, 𝑾\boldsymbol{W} and 𝑬\boldsymbol{E}, respectively. Then we have

vec(B^)vec(B)\displaystyle vec(\hat{B})-vec(B) =((𝚽U)(𝚽U))1(𝚽U)vec(V^)vec(B)\displaystyle=\bigg{(}(\boldsymbol{\Phi}\otimes U)^{\top}(\boldsymbol{\Phi}\otimes U)\bigg{)}^{-1}(\boldsymbol{\Phi}\otimes U)^{\top}vec(\hat{\textbf{V}})-vec(B)
=((𝚽U)(𝚽U))1(𝚽U)(vec(𝑾+𝑬)+vec(𝑽^𝑽))vec(B)\displaystyle=\bigg{(}(\boldsymbol{\Phi}\otimes U)^{\top}(\boldsymbol{\Phi}\otimes U)\bigg{)}^{-1}(\boldsymbol{\Phi}\otimes U)^{\top}\bigg{(}vec(\boldsymbol{W}+\boldsymbol{E})+vec(\hat{\boldsymbol{V}}-\boldsymbol{V})\bigg{)}-vec(B)
=((𝚽U)(𝚽U))1(𝚽U)((𝚽U)vec(B)+vec(𝑬)+vec(𝑽^𝑽))vec(B)\displaystyle=\bigg{(}(\boldsymbol{\Phi}\otimes U)^{\top}(\boldsymbol{\Phi}\otimes U)\bigg{)}^{-1}(\boldsymbol{\Phi}\otimes U)^{\top}\bigg{(}(\boldsymbol{\Phi}\otimes U)vec(B)+vec(\boldsymbol{E})+vec(\hat{\boldsymbol{V}}-\boldsymbol{V})\bigg{)}-vec(B)
=((𝚽U)(𝚽U))1(𝚽U)vec(𝑬)\displaystyle=\bigg{(}(\boldsymbol{\Phi}\otimes U)^{\top}(\boldsymbol{\Phi}\otimes U)\bigg{)}^{-1}(\boldsymbol{\Phi}\otimes U)^{\top}vec(\boldsymbol{E})
+((𝚽U)(𝚽U))1(𝚽U)vec(𝑽^𝑽)\displaystyle+\bigg{(}(\boldsymbol{\Phi}\otimes U)^{\top}(\boldsymbol{\Phi}\otimes U)\bigg{)}^{-1}(\boldsymbol{\Phi}\otimes U)^{\top}vec(\hat{\boldsymbol{V}}-\boldsymbol{V})
=˙A1+A2\displaystyle\dot{=}A_{1}+A_{2}

For the first term, as em(t)e_{m}(t) are assumed to be independent to the covariate 𝒖m\boldsymbol{u}_{m} for m=1,,Mm=1,\cdots,M, so that E(A1)=0E(A_{1})=0, Var(A1)=Op(1/M)Var(A_{1})=O_{p}(1/M). Therefore A1=o(1)A_{1}=o(1) as MM\rightarrow\infty.

For the second term, note that

V^m(t)Vm(t)=Log(μ^0(t),ym(t))Log(μ0(t),ym(t))d(μ^0(t),μ0(t))=Op(1/M)\hat{V}_{m}(t)-V_{m}(t)=\text{Log}(\hat{\mu}_{0}(t),y_{m}(t))-\text{Log}({\mu}_{0}(t),y_{m}(t))\leq d_{\mathcal{M}}(\hat{\mu}_{0}(t),\mu_{0}(t))=O_{p}(1/\sqrt{M})

where the last term has been proofed in [11] under the above conditions (C1)-(C5). So the second term also has A2=o(1)A_{2}=o(1) as MM\rightarrow\infty. \blacksquare

Proof of Theorem 2

Before proofing the Theorem 2, we first introduce some notation to simplify the proof. Recall that 𝝉md=(τmd(tm1),,τmd(tmn))\boldsymbol{\tau}_{md}=(\tau_{md}(t_{m1}),\cdots,\tau_{md}(t_{mn})), for simplicity, we omit the subscript mdmd and denote the nn observation at the mm-th batch and the dd-th dimension as zn=(z1,,zn)=˙(τmd(tm1),,τmd(tmn))\textbf{z}_{n}=(z_{1},\cdots,z_{n})\dot{=}(\tau_{md}(t_{m1}),\cdots,\tau_{md}(t_{mn})), and denote the corresponding covariates 𝑿n=(𝒙1,,𝒙n)\boldsymbol{X}_{n}=(\boldsymbol{x}_{1},\cdots,\boldsymbol{x}_{n}) with 𝒙i𝒳\boldsymbol{x}_{i}\in\mathcal{X} which are independently drawn from a distribution 𝒰(𝒙)\mathcal{U}(\boldsymbol{x}). Denote the observed data 𝒟n={(𝒙i,zi),i=1,,n}\mathcal{D}_{n}=\{(\boldsymbol{x}_{i},z_{i}),i=1,\cdots,n\}.

Let τ0()\tau_{0}(\cdot) be the true underlying function. Assume that the underlying process τ()GP(0,K(,;𝜽))\tau(\cdot)\sim\operatorname{GP}(0,K(\cdot,\cdot;\boldsymbol{\theta})) where all the subscript mdmd are omitted. Denote

pgp(zn)\displaystyle p_{gp}\left(z_{n}\right) =p(z1,,znτ(𝑿n))𝑑pn(τ)\displaystyle=\int_{\mathcal{F}}p\left(z_{1},\ldots,z_{n}\mid\tau\left(\boldsymbol{X}_{n}\right)\right)dp_{n}(\tau)
p0(zn)\displaystyle p_{0}\left(z_{n}\right) =p(z1,,znτ0(𝑿n))\displaystyle=p\left(z_{1},\ldots,z_{n}\mid\tau_{0}\left(\boldsymbol{X}_{n}\right)\right)

then pgp(zn)p_{gp}\left(z_{n}\right) is the Bayesian predictive distribution of znz_{n} based on the GPR model. Note that pn(τ)p_{n}(\tau) depends on nn since the hyperparameters of τ()\tau(\cdot) is estimated from the data.

Proof. It suffices to show

1nE𝑿n(D[p0(𝒛n),pgp(𝒛n)])0 as n,\frac{1}{n}E_{\boldsymbol{X}_{n}}\left(D\left[p_{0}\left(\boldsymbol{z}_{n}\right),p_{gp}\left(\boldsymbol{z}_{n}\right)\right]\right)\rightarrow 0\quad\text{ as }n\rightarrow\infty,

Note that

D[p0(𝒛n),pgp(𝒛n)]\displaystyle D\left[p_{0}\left(\boldsymbol{z}_{n}\right),{p}_{gp}\left(\boldsymbol{z}_{n}\right)\right] =𝒵np0(z1,,zn)logp0(z1,,zn)pgp(z1,,zn)dz1dzn\displaystyle=\int_{\mathcal{Z}^{n}}p_{0}\left(z_{1},\cdots,z_{n}\right)\log\frac{p_{0}\left(z_{1},\cdots,z_{n}\right)}{p_{gp}\left(z_{1},\cdots,z_{n}\right)}dz_{1}\cdots dz_{n}
=𝒵np0(z1,,zn)[logpgp(z1,,zn)+logp0(z1,,zn)]𝑑z1𝑑zn.\displaystyle=\int_{\mathcal{Z}^{n}}p_{0}\left(z_{1},\cdots,z_{n}\right)\left[-\log p_{gp}\left(z_{1},\ldots,z_{n}\right)+\log p_{0}\left(z_{1},\ldots,z_{n}\right)\right]dz_{1}\cdots dz_{n}.

It suffices to give an upper bound for the term logpgp(z1,,zn)+logp0(z1,,zn)-\log p_{gp}\left(z_{1},\ldots,z_{n}\right)+\log p_{0}\left(z_{1},\ldots,z_{n}\right).

Let \mathcal{H} be the Reproducing Kernel Hilbert Space (RKHS) associated with the covariance function k(,;𝜽)k(\cdot,\cdot;\boldsymbol{\theta}), and n\mathcal{H}_{n} the span of {k(,𝒙i;𝜽)}\left\{k\left(\cdot,\boldsymbol{x}_{i};\boldsymbol{\theta}\right)\right\}, i.e. n={f():f(𝒙)=\mathcal{H}_{n}=\{f(\cdot):f(\boldsymbol{x})= i=1nαik(𝒙,𝒙i;𝜽)\sum_{i=1}^{n}\alpha_{i}k\left(\boldsymbol{x},\boldsymbol{x}_{i};\boldsymbol{\theta}\right), for any αi}\left.\alpha_{i}\in\mathbb{R}\right\}. Assume the true underlying function τ0n\tau_{0}\in\mathcal{H}_{n}, then τ0()\tau_{0}(\cdot) can be expressed as

τ0()=i=1nαik(,𝒙i;𝜽)K()𝜶\tau_{0}(\cdot)=\sum_{i=1}^{n}\alpha_{i}k\left(\cdot,\boldsymbol{x}_{i};\boldsymbol{\theta}\right)\triangleq K(\cdot)\boldsymbol{\alpha}

where K()=(k(,𝒙1;𝜽),,k(,𝒙n;𝜽))K(\cdot)=\left(k\left(\cdot,\boldsymbol{x}_{1};\boldsymbol{\theta}\right),\ldots,k\left(\cdot,\boldsymbol{x}_{n};\boldsymbol{\theta}\right)\right) and 𝜶=(α1,,αn)T.\boldsymbol{\alpha}=\left(\alpha_{1},\ldots,\alpha_{n}\right)^{T}. By the properties of RKHS,τ0k2=𝜶T𝑪nn𝜶\mathrm{RKHS},\left\|\tau_{0}\right\|_{k}^{2}=\boldsymbol{\alpha}^{T}\boldsymbol{C}_{nn}\boldsymbol{\alpha}, and (τ0(𝒙1),,τ0(𝒙n))T=𝑪nn𝜶\left(\tau_{0}\left(\boldsymbol{x}_{1}\right),\ldots,\tau_{0}\left(\boldsymbol{x}_{n}\right)\right)^{T}=\boldsymbol{C}_{nn}\boldsymbol{\alpha}, where 𝑪nn=(k(𝒙i,𝒙j;𝜽))\boldsymbol{C}_{nn}=\left(k\left(\boldsymbol{x}_{i},\boldsymbol{x}_{j};\boldsymbol{\theta}\right)\right) is the covariance matrix over 𝒙i,i=1,,n\boldsymbol{x}_{i},i=1,\ldots,n.

Let PP and P¯\bar{P} be any two measures on \mathcal{F}, then it yields by Fenchel-Legendre duality relationship that, for any functional g()g(\cdot) on \mathcal{F},

EP¯[g(τ)]logEP[eg(τ)]+D[P¯,P]E_{\bar{P}}[g(\tau)]\leq\log E_{P}\left[e^{g(\tau)}\right]+D[\bar{P},P]

Let g(τ)=logp(z1,,znτ)g(\tau)=\log p\left(z_{1},\ldots,z_{n}\mid\tau\right) for any z1,,znz_{1},\ldots,z_{n} in 𝒵\mathcal{Z} and τ\tau\in\mathcal{F}, let PP be the measure induced by GP(0,k(,;𝜽^n))GP\left(0,k\left(\cdot,\cdot;\hat{\boldsymbol{\theta}}_{n}\right)\right), hence its finite dimensional distribution at z1,,znz_{1},\ldots,z_{n} is p~(z1,,zn)=N(0,𝑪^nn)\tilde{p}\left(z_{1},\ldots,z_{n}\right)=N\left(0,\hat{\boldsymbol{C}}_{nn}\right), and

EP[eg(τ)]=EP[p(z1,,znτ)]=p(z1,,znτ)𝑑pn(τ)=pgp(𝒛n)E_{P}\left[e^{g(\tau)}\right]=E_{P}\left[p\left(z_{1},\ldots,z_{n}\mid\tau\right)\right]=\int_{\mathcal{F}}p\left(z_{1},\cdots,z_{n}\mid\tau\right)dp_{n}(\tau)=p_{gp}\left(\boldsymbol{z}_{n}\right)

where 𝑪^nn\hat{\boldsymbol{C}}_{nn} is defined in the same way as 𝑪nn\boldsymbol{C}_{nn} but with 𝜽\boldsymbol{\theta} being replaced by its estimator 𝜽^n\hat{\boldsymbol{\theta}}_{n}.

Let P¯\bar{P} be the posterior distribution of τ()\tau(\cdot) on \mathcal{F} which has a prior distribution GP(0,k(,;𝜽))GP(0,k(\cdot,\cdot;\boldsymbol{\theta})) and normal likelihood i=1nN(z^i;τ(𝒙i),σ2)\prod_{i=1}^{n}N\left(\hat{z}_{i};\tau\left(\boldsymbol{x}_{i}\right),\sigma^{2}\right), where

𝒛^n(z^1z^n)=(𝑪nn+σ2𝑰)𝜶\hat{\boldsymbol{z}}_{n}\triangleq\left(\begin{array}[]{c}\hat{z}_{1}\\ \vdots\\ \hat{z}_{n}\end{array}\right)=\left(\boldsymbol{C}_{nn}+\sigma^{2}\boldsymbol{I}\right)\boldsymbol{\alpha}

and σ2\sigma^{2} is a constant to be specified. In other words, we assume a model z=τ(𝒙)+ηz=\tau(\boldsymbol{x})+\eta with ηN(0,σ2)\eta\sim N\left(0,\sigma^{2}\right) and τ()GP(0,k(,;𝜽))\tau(\cdot)\sim GP(0,k(\cdot,\cdot;\boldsymbol{\theta})), and 𝒛^n\hat{\boldsymbol{z}}_{n} is a set of observations at x1,,𝒙n.\textbf{x}_{1},\ldots,\boldsymbol{x}_{n}. Thus, P¯(τ)=p(τ𝒛^n,𝑿n)\bar{P}(\tau)=p\left(\tau\mid\hat{\boldsymbol{z}}_{n},\boldsymbol{X}_{n}\right) is a probability measure on \mathcal{F}. Therefore, by Gaussian process regression, the posterior of (τ1,,τn)(τ(𝒙1),,τ(𝒙n))\left(\tau_{1},\ldots,\tau_{n}\right)\triangleq\left(\tau\left(\boldsymbol{x}_{1}\right),\ldots,\tau\left(\boldsymbol{x}_{n}\right)\right) is

p¯(τ1,,τn)\displaystyle\bar{p}\left(\tau_{1},\cdots,\tau_{n}\right) p(τ1,,τn𝒛^,𝑿n)\displaystyle\triangleq p\left(\tau_{1},\cdots,\tau_{n}\mid\hat{\boldsymbol{z}},\boldsymbol{X}_{n}\right)
=N(𝑪nn(𝑪nn+σ2𝑰)1𝒛^,𝑪nn(𝑪nn+σ2𝑰)1σ2)\displaystyle=N\left(\boldsymbol{C}_{nn}\left(\boldsymbol{C}_{nn}+\sigma^{2}\boldsymbol{I}\right)^{-1}\hat{\boldsymbol{z}},\boldsymbol{C}_{nn}\left(\boldsymbol{C}_{nn}+\sigma^{2}\boldsymbol{I}\right)^{-1}\sigma^{2}\right)
=N(𝑪nn𝜶,𝑪nn(𝑪nn+σ2𝑰)1σ2)\displaystyle=N\left(\boldsymbol{C}_{nn}\boldsymbol{\alpha},\boldsymbol{C}_{nn}\left(\boldsymbol{C}_{nn}+\sigma^{2}\boldsymbol{I}\right)^{-1}\sigma^{2}\right)
=N(𝑪nn𝜶,𝑪nnB1)\displaystyle=N\left(\boldsymbol{C}_{nn}\boldsymbol{\alpha},\boldsymbol{C}_{nn}B^{-1}\right)

where B=𝑰+σ2𝑪nnB=\boldsymbol{I}+\sigma^{-2}\boldsymbol{C}_{nn}.

It follows that

D[P¯,P]=log(dP¯dP)𝑑P¯=Rnp¯(τ1,,τn)logp¯(τ1,,τn)p~(τ1,,τn)dτ1dτn=12[log|C^nn|log|Cnn|+log|B|+tr(𝑪^nn1𝑪nnB1)+(𝑪nn𝜶)T𝑪^nn1(𝑪nn𝜶)n]=12[log|𝑪^nn1𝑪nn|+log|B|+tr(𝑪^nn1𝑪nnB1)+τ0k2+𝜶T𝑪nn(𝑪^nn1𝑪nn𝑰)𝜶n]\begin{aligned} D[\bar{P},P]&=\int_{\mathcal{F}}\log\left(\frac{d\bar{P}}{dP}\right)d\bar{P}\\ &=\int_{R^{n}}\bar{p}\left(\tau_{1},\ldots,\tau_{n}\right)\log\frac{\bar{p}\left(\tau_{1},\ldots,\tau_{n}\right)}{\tilde{p}\left(\tau_{1},\ldots,\tau_{n}\right)}d\tau_{1}\ldots d\tau_{n}\\ &=\frac{1}{2}\left[\log\left|\hat{C}_{nn}\right|-\log\left|C_{nn}\right|+\log|B|+\operatorname{tr}\left(\hat{\boldsymbol{C}}_{nn}^{-1}\boldsymbol{C}_{nn}B^{-1}\right)+\left(\boldsymbol{C}_{nn}\boldsymbol{\alpha}\right)^{T}\hat{\boldsymbol{C}}_{nn}^{-1}\left(\boldsymbol{C}_{nn}\boldsymbol{\alpha}\right)-n\right]\\ &=\frac{1}{2}\left[-\log\left|\hat{\boldsymbol{C}}_{nn}^{-1}\boldsymbol{C}_{nn}\right|+\log|B|+\operatorname{tr}\left(\hat{\boldsymbol{C}}_{nn}^{-1}\boldsymbol{C}_{nn}B^{-1}\right)+\left\|\tau_{0}\right\|_{k}^{2}\right.\\ &\left.+\boldsymbol{\alpha}^{T}\boldsymbol{C}_{nn}\left(\hat{\boldsymbol{C}}_{nn}^{-1}\boldsymbol{C}_{nn}-\boldsymbol{I}\right)\boldsymbol{\alpha}-n\right]\end{aligned}

On the other hand,

EP¯[g(τ)]=EP¯[logp(z1,,znτ)]=i=1nEP¯[logp(ziτ(𝒙i))]E_{\bar{P}}[g(\tau)]=E_{\bar{P}}\left[\log p\left(z_{1},\ldots,z_{n}\mid\tau\right)\right]=\sum_{i=1}^{n}E_{\bar{P}}\left[\log p\left(z_{i}\mid\tau\left(\boldsymbol{x}_{i}\right)\right)\right]

By Taylor’s expansion, expanding logp(ziτ(𝒙i))\log p\left(z_{i}\mid\tau\left(\boldsymbol{x}_{i}\right)\right) to the second order at τ0(𝒙i)\tau_{0}\left(\boldsymbol{x}_{i}\right) yields

logp(ziτ(𝒙i))=\displaystyle\log p\left(z_{i}\mid\tau\left(\boldsymbol{x}_{i}\right)\right)= logp(ziτ0(𝒙i))+d[logp(ziτ(𝒙i))]dτ(𝒙i)|τ(𝒙i)=τ0(𝒙i)(τ(𝒙i)τ0(𝒙i))\displaystyle\log p\left(z_{i}\mid\tau_{0}\left(\boldsymbol{x}_{i}\right)\right)+\left.\frac{d\left[\log p\left(z_{i}\mid\tau\left(\boldsymbol{x}_{i}\right)\right)\right]}{d\tau\left(\boldsymbol{x}_{i}\right)}\right|_{\tau\left(\boldsymbol{x}_{i}\right)=\tau_{0}\left(\boldsymbol{x}_{i}\right)}\left(\tau\left(\boldsymbol{x}_{i}\right)-\tau_{0}\left(\boldsymbol{x}_{i}\right)\right)
+12d2[logp(ziτ(𝒙i))][dτ(𝒙i)]2|τ(𝒙i)=τ~(𝒙i)(τ(𝒙i)τ0(𝒙i))2\displaystyle+\left.\frac{1}{2}\frac{d^{2}\left[\log p\left(z_{i}\mid\tau\left(\boldsymbol{x}_{i}\right)\right)\right]}{\left[d\tau\left(\boldsymbol{x}_{i}\right)\right]^{2}}\right|_{\tau\left(\boldsymbol{x}_{i}\right)=\tilde{\tau}\left(\boldsymbol{x}_{i}\right)}\left(\tau\left(\boldsymbol{x}_{i}\right)-\tau_{0}\left(\boldsymbol{x}_{i}\right)\right)^{2}

where τ~(𝒙i)=τ0(𝒙i)+λ(τ(𝒙i)τ0(𝒙i))\tilde{\tau}\left(\boldsymbol{x}_{i}\right)=\tau_{0}\left(\boldsymbol{x}_{i}\right)+\lambda\left(\tau\left(\boldsymbol{x}_{i}\right)-\tau_{0}\left(\boldsymbol{x}_{i}\right)\right) for some 0λ10\leq\lambda\leq 1. For Gaussian probability density function, it follows that

EP¯[logp(ziτ(𝒙i))]=logp(ziτ0(𝒙i))σ22Var[τ(𝒙i)]\displaystyle E_{\bar{P}}\left[\log p\left(z_{i}\mid\tau\left(\boldsymbol{x}_{i}\right)\right)\right]=\log p\left(z_{i}\mid\tau_{0}\left(\boldsymbol{x}_{i}\right)\right)-\frac{\sigma^{2}}{2}Var[\tau\left(\boldsymbol{x}_{i}\right)]

so that

EP¯[logp(z1,,znτ)]=logp0(z1,,zn)σ22tr(Var[τ(𝑿n)])=logp0(z1,,zn)σ22tr(𝑪nnB1)E_{\bar{P}}\left[\log p\left(z_{1},\ldots,z_{n}\mid\tau\right)\right]=\log p_{0}\left(z_{1},\ldots,z_{n}\right)-\frac{\sigma^{2}}{2}\operatorname{tr}\left(Var[\tau\left(\boldsymbol{X}_{n}\right)]\right)=\log p_{0}\left(z_{1},\ldots,z_{n}\right)-\frac{\sigma^{2}}{2}\operatorname{tr}\left(\boldsymbol{C}_{nn}B^{-1}\right)

Therefore,

logpgp(z1,,zn)+logp0(z1,,zn)logEP[eg(τ)]+EP¯[g(τ)]+σ22tr(𝑪nnB1)D[P¯,P]+σ22tr(𝑪nnB1)=12τ0k2+12[log|𝑪^nn1𝑪nn|+log|B|+tr(𝑪^nn1𝑪nnB1+σ2𝑪nnB1)+𝜶T𝑪nn(𝑪^nn1𝑪nn𝑰)𝜶n]\begin{aligned} &-\log p_{gp}\left(z_{1},\ldots,z_{n}\right)+\log p_{0}\left(z_{1},\ldots,z_{n}\right)\\ \leq&-\log E_{P}\left[e^{g(\tau)}\right]+E_{\bar{P}}[g(\tau)]+\frac{\sigma^{2}}{2}\operatorname{tr}\left(\boldsymbol{C}_{nn}B^{-1}\right)\\ \leq&D[\bar{P},P]+\frac{\sigma^{2}}{2}\operatorname{tr}\left(\boldsymbol{C}_{nn}B^{-1}\right)\\ =&\frac{1}{2}\left\|\tau_{0}\right\|_{k}^{2}+\frac{1}{2}\left[-\log\left|\hat{\boldsymbol{C}}_{nn}^{-1}\boldsymbol{C}_{nn}\right|+\log|B|+\operatorname{tr}\left(\hat{\boldsymbol{C}}_{nn}^{-1}\boldsymbol{C}_{nn}B^{-1}+\sigma^{2}\boldsymbol{C}_{nn}B^{-1}\right)\right.\\ &\left.+\boldsymbol{\alpha}^{T}\boldsymbol{C}_{nn}\left(\hat{\boldsymbol{C}}_{nn}^{-1}\boldsymbol{C}_{nn}-\boldsymbol{I}\right)\boldsymbol{\alpha}-n\right]\end{aligned}.

Since the covariance function is continuous in 𝜽\boldsymbol{\theta} and 𝜽^n𝜽\hat{\boldsymbol{\theta}}_{n}\rightarrow\boldsymbol{\theta} we have 𝑪^nn1𝑪nn𝑰0\hat{\boldsymbol{C}}_{nn}^{-1}\boldsymbol{C}_{nn}-\boldsymbol{I}\rightarrow 0 as nn\rightarrow\infty. Therefore there exist some positive constants KK and ϵ\epsilon such that

log|𝑪^nn1𝑪nn|<K𝜶T𝑪nn(𝑪^nn1𝑪nn𝑰)𝜶<Ktr((𝑪^nn1𝑪nnI)B1)<K\begin{gathered}-\log\left|\hat{\boldsymbol{C}}_{nn}^{-1}\boldsymbol{C}_{nn}\right|<K\\ \quad\boldsymbol{\alpha}^{T}\boldsymbol{C}_{nn}\left(\hat{\boldsymbol{C}}_{nn}^{-1}\boldsymbol{C}_{nn}-\boldsymbol{I}\right)\boldsymbol{\alpha}<K\\ \operatorname{tr}\left((\hat{\boldsymbol{C}}_{nn}^{-1}\boldsymbol{C}_{nn}-I)B^{-1}\right)<K\end{gathered}

Thus

logpgp(z1,,zn)+logp0(z1,,zn)<12τ0k2+12log|B|+32K-\log p_{gp}\left(z_{1},\ldots,z_{n}\right)+\log p_{0}\left(z_{1},\ldots,z_{n}\right)<\frac{1}{2}\left\|\tau_{0}\right\|_{k}^{2}+\frac{1}{2}\log|B|+\frac{3}{2}K

for any τ0()n\tau_{0}(\cdot)\in\mathcal{H}_{n}. Taking infimum over τ0\tau_{0} and applying Representer Theorem (see Lemma 2 in [36]) we obtain

logpgp(z1,,zn)+logp0(z1,,zn)12τ0k2+12log|𝑰+σ2𝑪nn|+32K-\log p_{gp}\left(z_{1},\ldots,z_{n}\right)+\log p_{0}\left(z_{1},\ldots,z_{n}\right)\leq\frac{1}{2}\left\|\tau_{0}\right\|_{k}^{2}+\frac{1}{2}\log\left|\boldsymbol{I}+\sigma^{-2}\boldsymbol{C}_{nn}\right|+\frac{3}{2}K

for all τ0()\tau_{0}(\cdot)\in\mathcal{H}.

Therefore, we obtain that

1nE𝑿n(D[p0(𝒛n),pgp(𝒛n)])12nτ0k2+12nE𝑿n(log|𝑰+σ2𝑪nn|)+3/2Kn0\frac{1}{n}E_{\boldsymbol{X}_{n}}\left(D\left[p_{0}\left(\boldsymbol{z}_{n}\right),p_{gp}\left(\boldsymbol{z}_{n}\right)\right]\right)\leq\frac{1}{2n}\left\|\tau_{0}\right\|_{k}^{2}+\frac{1}{2n}E_{\boldsymbol{X}_{n}}\left(\log\left|\boldsymbol{I}+\sigma^{-2}\boldsymbol{C}_{nn}\right|\right)+\frac{3/2K}{n}\rightarrow 0

as nn\rightarrow\infty. \blacksquare


References

  • [1] Mauricio A Alvarez, Lorenzo Rosasco and Neil D Lawrence “Kernels for vector-valued functions: A review” In Foundations and Trends® in Machine Learning 4.3 Now Publishers, Inc., 2012, pp. 195–266
  • [2] Iskander Azangulov, Andrei Smolensky, Alexander Terenin and Viacheslav Borovitskiy “Stationary kernels and Gaussian processes on Lie groups and their homogeneous spaces i: the compact case” In arXiv preprint arXiv:2208.14960, 2022
  • [3] Iskander Azangulov, Andrei Smolensky, Alexander Terenin and Viacheslav Borovitskiy “Stationary Kernels and Gaussian Processes on Lie Groups and their Homogeneous Spaces II: non-compact symmetric spaces” In arXiv preprint arXiv:2301.13088, 2023
  • [4] Monami Banerjee et al. “A Nonlinear Regression Technique for Manifold Valued Data With Applications to Medical Image Analysis” In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016
  • [5] Viacheslav Borovitskiy, Alexander Terenin and Peter Mostowsky “Matérn Gaussian processes on Riemannian manifolds” In Advances in Neural Information Processing Systems 33, 2020, pp. 12426–12437
  • [6] Phillip Boyle and Marcus Frean “Dependent gaussian processes” In Advances in neural information processing systems 17, 2004
  • [7] Phillip Boyle and Marcus Frean “Dependent gaussian processes” In Advances in neural information processing systems, 2005, pp. 217–224
  • [8] Michael M. Bronstein et al. “Geometric Deep Learning: Going beyond Euclidean data” In IEEE Signal Processing Magazine 34, 2017, pp. 18–42
  • [9] Taeryon Choi, Jian Q Shi and Bo Wang “A Gaussian process regression approach to a single-index model” In Journal of Nonparametric Statistics 23.1 Taylor & Francis, 2011, pp. 21–36
  • [10] Emil Cornea, Hongtu Zhu, Peter Kim and Joseph G Ibrahim “Regression models on Riemannian symmetric spaces” In Journal of the Royal Statistical Society. Series B, Statistical methodology 79.2 NIH Public Access, 2017, pp. 463
  • [11] Xiongtao Dai and Hans-Georg Müller “Principal component analysis for functional data on riemannian manifolds and spheres” In The Annals of Statistics 46.6B Institute of Mathematical Statistics, 2018, pp. 3334–3361
  • [12] Abhirup Datta, Sudipto Banerjee, Andrew O Finley and Alan E Gelfand “Hierarchical nearest-neighbor Gaussian process models for large geostatistical datasets” In Journal of the American Statistical Association 111.514 Taylor & Francis, 2016, pp. 800–812
  • [13] Thomas Devogele, Laurent Etienne, Maxence Esnault and Florian Lardy “Optimized discrete fréchet distance between trajectories” In Proceedings of the 6th ACM SIGSPATIAL Workshop on Analytics for Big Geospatial Data, 2017, pp. 11–19
  • [14] MP Do Carmo “Mathematics: Theory & Applications” In Riemannian Geometry Birkhäuser Boston, 1992, pp. xiv+–300
  • [15] Ian L Dryden and Kanti V Mardia “Statistical shape analysis: with applications in R” John Wiley & Sons, 2016
  • [16] Paromita Dubey and Hans-Georg Müller “Modeling time-varying random objects and dynamic networks” In Journal of the American Statistical Association Taylor & Francis, 2021, pp. 1–16
  • [17] P Thomas Fletcher “Geodesic regression and the theory of least squares on Riemannian manifolds” In International journal of computer vision 105.2 Springer, 2013, pp. 171–185
  • [18] P Thomas Fletcher, Conglin Lu, Stephen M Pizer and Sarang Joshi “Principal geodesic analysis for the study of nonlinear statistics of shape” In IEEE transactions on medical imaging 23.8 IEEE, 2004, pp. 995–1005
  • [19] Andrew Gelman, Jessica Hwang and Aki Vehtari “Understanding predictive information criteria for Bayesian models” In Statistics and computing 24.6 Springer, 2014, pp. 997–1016
  • [20] Botao Hao, Will Wei Sun, Yufeng Liu and Guang Cheng “Simultaneous clustering and estimation of heterogeneous graphical models” In Journal of Machine Learning Research 18.217, 2018, pp. 1–58
  • [21] Jacob Hinkle, Prasanna Muralidharan, P Thomas Fletcher and Sarang Joshi “Polynomial regression on Riemannian manifolds” In European Conference on Computer Vision, 2012, pp. 1–14 Springer
  • [22] Michael Hutchinson et al. “Vector-valued Gaussian processes on Riemannian manifolds via gauge independent projected kernels” In Advances in Neural Information Processing Systems 34, 2021, pp. 17160–17169
  • [23] David G Kendall “Shape manifolds, procrustean metrics, and complex projective spaces” In Bulletin of the London mathematical society 16.2 Wiley Online Library, 1984, pp. 81–121
  • [24] Hyunwoo J Kim et al. “Multivariate general linear models (MGLM) on Riemannian manifolds with applications to statistical analysis of diffusion weighted images” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 2705–2712
  • [25] Eardi Lila, John AD Aston and Laura M Sangalli “Smooth principal component analysis over two-dimensional manifolds with an application to neuroimaging” In The Annals of Applied Statistics 10.4 Institute of Mathematical Statistics, 2016, pp. 1854–1879
  • [26] Zhenhua Lin and Fang Yao “Functional regression on the manifold with contamination” In Biometrika 108.1 Oxford University Press, 2021, pp. 167–181
  • [27] Anton Mallasto and Aasa Feragen “Wrapped Gaussian process regression on Riemannian manifolds” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5580–5588
  • [28] Esfandiar Nava-Yazdani, Hans-Christian Hege, Timothy John Sullivan and Christoph Tycowicz “Geodesic analysis in Kendall’s shape space with epidemiological applications” In Journal of Mathematical Imaging and Vision Springer, 2020, pp. 1–11
  • [29] Xavier Pennec “Probabilities and statistics on Riemannian manifolds: Basic tools for geometric measurements.” In NSIP 3, 1999, pp. 194–198 Citeseer
  • [30] Alexander Petersen and Hans-Georg Müller “Fréchet regression for random objects with Euclidean predictors” In The Annals of Statistics 47.2 Institute of Mathematical Statistics, 2019, pp. 691–719
  • [31] Davide Pigoli, Alessandra Menafoglio and Piercesare Secchi “Kriging prediction for manifold-valued random fields” In Journal of Multivariate Analysis 145 Elsevier, 2016, pp. 117–131
  • [32] William H Press, Saul A Teukolsky, William T Vetterling and Brian P Flannery “Numerical recipes 3rd edition: The art of scientific computing” Cambridge university press, 2007
  • [33] James O Ramsay “Functional data analysis” In Encyclopedia of Statistical Sciences 4 Wiley Online Library, 2004
  • [34] Carl Edward Rasmussen “Gaussian processes in machine learning” In Summer school on machine learning, 2003, pp. 63–71 Springer
  • [35] Sam T Roweis and Lawrence K Saul “Nonlinear dimensionality reduction by locally linear embedding” In science 290.5500 American Association for the Advancement of Science, 2000, pp. 2323–2326
  • [36] Matthias W Seeger, Sham M Kakade and Dean P Foster “Information consistency of nonparametric Gaussian process methods” In IEEE Transactions on Information Theory 54.5 IEEE, 2008, pp. 2376–2382
  • [37] J.Q. Shi, B. Wang, R. Murray-Smith and D.M. Titterington “Gaussian process functional regression modeling for batch data” In Biometrics 63, 2007, pp. 714–723
  • [38] Jian Qing Shi and Taeryon Choi “Gaussian process regression analysis for functional data” CRC Press, 2011
  • [39] Jingyong Su, Sebastian Kurtek, Eric Klassen and Anuj Srivastava “Statistical analysis of trajectories on Riemannian manifolds: bird migration, hurricane tracking and video surveillance” In The Annals of Applied Statistics 8.1 Institute of Mathematical Statistics, 2014, pp. 530–552
  • [40] Mark Van der Wilk et al. “A framework for interdomain and multioutput Gaussian processes” In arXiv preprint arXiv:2003.01115, 2020
  • [41] Bo Wang and Jian Qing Shi “Generalized Gaussian process regression model for non-Gaussian functional data” In Journal of the American Statistical Association 109.507 Taylor & Francis, 2014, pp. 1123–1133
  • [42] Wenjia Wang and Bing-Yi Jing “Gaussian process regression: Optimality, robustness, and relationship with kernel ridge regression” In Journal of Machine Learning Research 23.193, 2022, pp. 1–67
  • [43] Zhanfeng Wang, Maengseok Noh, Youngjo Lee and Jian Qing Shi “A general robust t-process regression model” In Computational Statistics & Data Analysis 154 Elsevier, 2021, pp. 107093
  • [44] Christopher KI Williams and Carl Edward Rasmussen “Gaussian processes for machine learning” MIT press Cambridge, MA, 2006
  • [45] Fang Yao, Hans-Georg Müller and Jane-Ling Wang “Functional data analysis for sparse longitudinal data” In Journal of the American statistical association 100.470 Taylor & Francis, 2005, pp. 577–590
  • [46] Miaomiao Zhang and Tom Fletcher “Probabilistic principal geodesic analysis” In Advances in neural information processing systems 26, 2013