From the Greene–Wu Convolution to Gradient Estimation over Riemannian Manifolds

Tianyu Wang^*^**[email protected] Yifeng Huang^†^††[email protected] Didong Li^‡^‡‡[email protected]

Abstract

Over a complete Riemannian manifold of finite dimension, Greene and Wu studied a convolution of $f$ defined as

\displaystyle\widehat{f}^{\mu}(x):=\int_{v\in T_{p}\mathcal{M}}f\left({\mathrm{Exp}}_{p}(v)\right)\kappa_{\mu}\left(v\right)d\Pi,

where $\kappa_{\mu}$ is a kernel that integrates to 1, and $d\Pi$ is the base measure on $T_{p}\mathcal{M}$ . In this paper, we study properties of the Greene–Wu (GW) convolution and apply it to non-Euclidean machine learning problems. In particular, we derive a new formula for how the curvature of the space would affect the curvature of the function through the GW convolution. Also, following the study of the GW convolution, a new method for gradient estimation over Riemannian manifolds is introduced.

1 Introduction

Recently, as data structures are becoming increasingly non-Euclidean, many non-Euclidean operations are studied and applied to machine learning problems (e.g., Absil et al.,, 2009). Among these operations, the Greene-Wu (GW) convolution (Greene and Wu,, 1973, 1976, 1979) is important but relatively less understood.

Over a complete Riemannian manifold, the seminal Greene-Wu convolution (or approximation) of function $f$ at $p$ is defined as

\displaystyle\widehat{f}^{\mu}(p):=\int_{v\in T_{p}\mathcal{M}}f\left({\mathrm{Exp}}_{p}(v)\right)\kappa_{\mu}\left(v\right)d\Pi,

(GW)

where $\kappa_{\mu}$ is a convolution kernel that integrates to 1, and $d\Pi$ is the base measure on $T_{p}\mathcal{M}$ . The GW convolution generalizes standard convolutions in Euclidean spaces and has been subsequently studied in mathematics (e.g., Parkkonen and Paulin,, 2012; Azagra and Ferrera,, 2006; Azagra,, 2013). This convolution operation also naturally arises in machine learning scenarios. For example, if one is interested in studying a function over a matrix manifold, the output of such a function $f$ is GW convoluted if the input is noisy (i.e., we can only obtain an average function value near $p$ whenever we try to obtain the function value at $p$ ). A real-life scenario where the GW convolution happens is photo taking with shaky camera. Any function defined over the manifolds of images (e.g., likelihood of containing a cat) is GW convoluted if the camera is perturbed.

Despite the importance of the GW convolution, many of its elementary properties are only inadequately understood. In particular, how would the curvature of the manifold affect the curvature of the function through the GW convolution? (Q1) The answer to this question is important to machine learning problems, since the curvature of a function depicts fundamental properties of a function, including how convex the function is. In the above-mentioned example scenario, a better understanding of the GW convolution would lead to a better understanding of how the landscape of the function $f$ would be affected by the curvature of the space when the input is noisy.

In this paper, we give a quantitative answer to question (Q1). As an example, consider $\widehat{f}^{\mu}$ the GW convolution defined with kernel $\kappa_{\mu}$ whose radial density (see Definition 1) is uniform over $[-\mu,\mu]$ . For this GW convolution, we show that

	$\displaystyle\lambda_{\min}\left(H_{p,\widehat{f}^{\mu}}\right)\geq$	$\displaystyle\;\frac{1}{2\mu}\mathbb{E}_{v\sim\mathbb{S}_{p}}\left[\int_{-\mu}^{\mu}\lambda_{\min}\left(H_{{\mathrm{Exp}}_{p}(tv),f}\right)dt\right]$
		$\displaystyle+\;\min_{u\in\mathbb{S}_{p}}\frac{1}{2\mu}\mathbb{E}_{v\sim\mathbb{S}_{p}}\left[\int_{-\mu}^{\mu}\sum_{j=1}^{\infty}\frac{t^{2j}}{(2j)!}\nabla_{u}^{2}\nabla_{v}^{2j}f(p)\,dt-\int_{-\mu}^{\mu}\sum_{j=1}^{\infty}\frac{t^{2j}}{(2j)!}\nabla_{v}^{2j}\nabla_{u}^{2}f(p)\,dt\right],$

where

1.

for any twice continuously differentiable $f$ and any $p\in\mathcal{M}$ , $H_{p,{f}}$ denotes the Hessian matrix of $f$ at $p$ ;
2.

$\lambda_{\min}$ exacts the minimal eigenvalue of a matrix;
3.

$\mathbb{S}_{p}$ denotes the unit sphere in $T_{p}\mathcal{M}$ and $\mathbb{E}_{v\sim\mathbb{S}_{p}}$ denotes the expectation taken with respect to $v$ uniformly sampled from $\mathbb{S}_{p}$ ;
4.

$\nabla$ denotes the covariant derivative associated with the Levi-Civita connection.

This result implies that the GW convolution can sometimes make the function more convex, and thus often more friendly to optimize. Specifically, if $f$ and $\nabla$ satisfy that

\min_{u\in\mathbb{S}_{p}}\frac{1}{2\mu}\mathbb{E}_{v\sim\mathbb{S}_{p}}\left[\int_{-\mu}^{\mu}\sum_{j=1}^{\infty}\frac{t^{2j}}{(2j)!}\nabla_{u}^{2}\nabla_{v}^{2j}f(p)\,dt-\int_{-\mu}^{\mu}\sum_{j=1}^{\infty}\frac{t^{2j}}{(2j)!}\nabla_{v}^{2j}\nabla_{u}^{2}f(p)\,dt\right]>0,

then the minimal eigenvalue of $\widetilde{f}^{\mu}$ at $p$ is larger than the minimal eigenvalue of $f$ near $p$ . This means sometimes noisy input can make the function more convex and thus more friendly to optimize, thanks to the curvature of the space.

Besides understanding of how the curvature of the space would affect curvature of the function via GW convolution, another important question is how can we directly apply the GW convolution operation to machine learning problems? (Q2) Naturally, the GW convolution is related to anti-derivatives and thus gradient estimation, as suggested by the fundamental theorem of calculus or the Stokes’ theorem. Along this line, we introduce a new gradient estimation method over Riemannian manifolds. Our gradient estimation method improves the state-of-the-art scaling with dimension $n$ from $\left(n+3\right)^{3/2}$ to $n^{3/2}$ , while holding all other quantities the same^§^§§More details for the comparison are explained in Remark 3.. Apart from curvature, Riemannian gradient estimation differs from its Euclidean counterpart in that an open set is diffeomorphic to an Euclidean set only within the injectivity radius. This difference implies that one needs to be extra careful when taking finite difference steps in non-Euclidean spaces. To this end, we show that as long as the function is geodesically $L_{1}$ -smooth, the finite difference method always works, no matter how small the injectivity radius is. Empirically, our method outperforms the best existing method for gradient estimation over Riemannian manifolds, as evidenced by thorough experimental evaluations. Based on our answer to (Q2), we study the online convex learning problems over Hadamard manifolds under stochastic bandit feedback. In particular, single point gradient estimator is designed. Thus online convex learning problem under bandit-type feedback can be solved, and a curvature-dependent regret bound is derived.

Related Works

The study of geometric methods in machine learning has never stalled. Researchers have investigated this topic from different angles, include kernel methods (e.g., Schölkopf et al.,, 2002; Muandet et al.,, 2017; Jacot et al.,, 2021), manifold learning (e.g., Lin and Zha,, 2008; Li et al.,, 2017; Zhu et al.,, 2018; Li and Dunson,, 2020), metric learning (e.g., Kulis et al.,, 2012; Li and Dunson,, 2019). Recently, attentions are concentrated on problems of learning with Riemannian geometry. Examples include convex optimization method over negatively curved spaces (Zhang and Sra,, 2016; Zhang et al.,, 2016), online learning algorithms with Riemannian metric (Antonakopoulos et al.,, 2019), mirror map based methods for relative Lipschitz objectives (Zhou et al.,, 2020).

Despite the rich literature on geometric methods for machine learning (e.g., Absil et al.,, 2009), some important non-Euclidean operations are relatively poorly understood, including the Greene–Wu (GW) convolution (Greene and Wu,, 1973, 1976, 1979; Parkkonen and Paulin,, 2012; Azagra and Ferrera,, 2006; Azagra,, 2013). To this end, we derive new properties of the GW convolution and apply it to machine learning problems.

Enabled by our study of the GW convolution, we design a new gradient estimation method over Riemannian manifolds. Perhaps the history of gradient estimation can date back to Sir Isaac Newton and the fundamental theory of calculus. Since then, many gradient estimation methods have been invented (Lyness and Moler,, 1967; Hildebrand,, 1987). In its more recent form, gradient estimation methods are often combined with learning and optimization. Flaxman et al., (2005) used the fundamental theorem of geometric calculus and solved online convex optimization problems with bandit feedback. Nesterov and Spokoiny, (2017); Li et al., (2020) introduced a stochastic gradient estimation method using Gaussian sampling, in Euclidean space or manifolds embedded in Euclidean spaces. We improve the previous results in two ways. 1. Our gradient estimation method generalizes this previous work by removing the ambient Euclidean space. 2 We tighten the previous error bound. Over an $n$ -dimensional Riemannian manifold, we show that the error of gradient estimation is bounded by $\frac{L_{1}\left(n+3\right)^{3/2}\mu}{2}$ in contrast to $\frac{L_{1}n^{3/2}\mu}{2}$ (Li et al.,, 2020), where $n$ is the dimension of the manifold, $L_{1}$ is the smoothness parameter of the objective function and $\mu$ is an algorithm parameter controlling finite-difference step size.

The application of our method to convex bandit optimization is related to online convex learning (e.g., Shalev-Shwartz,, 2011). The problem of online convex optimization with bandit feedback was introduced and studied by Flaxman et al., (2005). Since then, a sequence of works has focused on bandit convex learning in Euclidean spaces. For example, Hazan and Kale, (2012); Chen et al., (2019); Garber and Kretzu, (2020) have studied the gradient-free counterpart for the Frank-Wolfe algorithm, and Hazan and Levy, (2014) studied this problem with self-concordant barrier. An extension to this scheme is combining the methodology with a partitioning of the space Kleinberg, (2004). By doing so, regret rate of $\mathcal{O}\left(\sqrt{T}\right)$ can be achieved in Euclidean spaces (Bubeck et al.,, 2015; Bubeck and Eldan,, 2016). This rate is optimal in terms of $T$ for stochastic environement, as shown by Shamir, (2013). While there has been extensive study of online convex optimization problems, they are all restricted to Euclidean spaces. In this paper, we generalize previous results from Euclidean space to Hadamard manifolds, using our new gradient estimation method.

2 Preliminaries and Conventions

Preliminaries for Riemannian Geometry

A Riemannian manifold $(\mathcal{M},g)$ is a smooth manifold equipped with a metric tensor $g$ such that at any point $p\in\mathcal{M}$ , $g_{p}$ defines a positive definite inner product in the tangent space $T_{p}\mathcal{M}$ . For a point $p\in\mathcal{M}$ , we use $\left<,\right>_{p}$ to denote the inner product in $T_{p}\mathcal{M}$ induced by $g$ . Also, we use $\|\cdot\|_{p}$ to denote the norm associated with $\left<,\right>_{p}$ . Intuitively, a Riemannian manifold is a space that can be locally identified as an Euclidean space. With such structures, one can locally carry out calculus and linear algebra operations.

A vector field $X$ assigns each point $p\in\mathcal{M}$ a unique vector in $T_{p}\mathcal{M}$ . Let $\mathfrak{X}(\mathcal{M})$ be the space of all smooth vector fields. An affine connection is a mapping $\nabla:\mathfrak{X}(\mathcal{M})\times\mathfrak{X}(\mathcal{M})\rightarrow\mathfrak{X}(\mathcal{M})$ so that (i) $\nabla_{f_{1}X_{1}+f_{2}X_{2}}Y=f\nabla_{X_{1}}Y+f_{2}\nabla_{X_{2}}Y$ for $f_{1},f_{2}\in C^{\infty}(\mathcal{M})$ , (ii) $\nabla_{X}(a_{1}Y_{1}+a_{2}Y_{2})=a_{1}\nabla_{X}Y_{1}+a_{2}\nabla_{X}Y_{2}$ for $a_{1},a_{2}\in\mathbb{R}$ , and (iii) $\nabla_{X}(fY)=f\nabla_{X}Y+X(f)Y$ for all $f\in C^{\infty}(\mathcal{M})$ . The Levi-Civita connection is the unique affine connection that is torsion free and compatible with the Riemannian metric $g$ . A geodesic curve $\gamma:[0,1]\rightarrow\mathcal{M}$ is a curve so that $\nabla_{\dot{\gamma}(t)}\dot{\gamma}(t)=0$ for all $t\in[0,1]$ , where $\nabla$ here denotes the Levi-Civita connection. One can think of geodesic curves as generalizations of straight line segments in Euclidean space. At any point $p\in\mathcal{M}$ , the exponential map ${\mathrm{Exp}}_{p}$ is a local diffeomorphism that maps $v\in T_{p}\mathcal{M}$ to ${\mathrm{Exp}}_{p}(v)\in\mathcal{M}$ such that there exists a geodesic curve $\gamma:[0,1]\rightarrow\mathcal{M}$ with $\gamma(0)=p$ , $\dot{\gamma}\big{|}_{0}=v$ and $\gamma(1)={\mathrm{Exp}}_{p}(v)$ . Since the exponential map is (locally) diffeomorphic, the inverse exponential map is also properly defined (locally). We use ${\mathrm{Exp}}_{p}^{-1}$ to denote the inverse exponential map of ${\mathrm{Exp}}_{p}$ .

We say a vector field $X$ is parallel along a geodesic $\gamma:[0,1]\rightarrow\mathcal{M}$ if $\nabla_{\dot{\gamma}(t)}X=0$ for all $t\in[0,1]$ . For $t_{0},t_{1}\in[0,1]$ , we call the map $\mathcal{P}_{t_{0},t_{1}}^{\gamma}:T_{\gamma(t_{0})}\mathcal{M}\rightarrow T_{\gamma(t_{1})}\mathcal{M}$ a parallel transport, if it holds that for any vector field $X$ such that $X$ is parallel along $\gamma$ , $X(\gamma(t_{0}))$ is mapped to $X(\gamma(t_{1}))$ . Parallel transport induced by the Levi-Civita connection preserves the Riemannian metric, that is, for $p\in\mathcal{M}$ , $u,v\in T_{p}\mathcal{M}$ , and geodesic $\gamma:[0,1]\rightarrow\mathcal{M}$ , it holds that $\left<u,v\right>_{\gamma(0)}=\left<\mathcal{P}_{0,t}^{\gamma}(u),\mathcal{P}_{0,t}^{\gamma}(v)\right>_{\gamma(t)}$ for any $t\in[0,1]$ .

The Riemann curvature tensor $\mathcal{R}:\mathfrak{X}(\mathcal{M})\times\mathfrak{X}(\mathcal{M})\times\mathfrak{X}(\mathcal{M})\to\mathfrak{X}(\mathcal{M})$ is defined as $\mathcal{R}(X,Y)Z=\nabla_{X}\nabla_{Y}Z-\nabla_{Y}\nabla_{X}Z-\nabla_{[X,Y]}Z$ , where $[,]$ is the Lie bracket for vector fields. The sectional curvature $\mathcal{K}:\mathfrak{X}(\mathcal{M})\times\mathfrak{X}(\mathcal{M})\to\mathbb{R}$ is defined as $\mathcal{K}(X,Y)=\frac{g(\mathcal{R}(X,Y)X,Y)}{g(X,X)g(Y,Y)-g(X,Y)^{2}}$ , which is the high dimensional generalization of Gaussian curvature. Intuitively, for a point $p$ , $\mathcal{K}\left(X(p),Y(p)\right)$ is the the Gaussian curvature of the sub-manifold spanned by $X(p),Y(p)$ at $p$ .

For a continuously connected Riemannian manifold $(\mathcal{M},g)$ , a distance metric can be defined. The distance between $p,q\in\mathcal{M}$ is the infimum over the lengths of piecewise smooth curves in $\mathcal{M}$ connecting $p$ and $q$ , where the length of a curve is defined with respect to the metric tensor $g$ . A Riemannian manifold is complete if it is complete as a metric space. By the celebrated Hopf–Rinow theorem, if a Riemannian manifold is complete, the exponential map ${\mathrm{Exp}}_{p}$ is defined over the entire tangent space $T_{p}\mathcal{M}$ for all $p\in\mathcal{M}$ . The cut locus of a point $p\in\mathcal{M}$ is the set of points $q\in\mathcal{M}$ such that there exists more than two distinct shortest path from $p$ to $q$ .

We refer the readers to books (e.g., Petersen,, 2006; Lee,, 2006) for more background on Riemannian geometry.

Notations and Conventions

For better readability, we list here some notations and conventions that will be used throughout the rest of this paper.

•

For any $x\in\mathcal{M}$ , let $U_{p}$ denote the open set near $p$ that is a diffeomorphic to a subset of $\mathbb{R}^{n}$ via the local normal coordinate chart $\phi$ . Define the distance $d_{p}(q_{1},q_{2})$ ( $q_{1},q_{2}\in U_{p}$ ) such that

$\displaystyle d_{p}(q_{1},q_{2})=d_{\text{Euc}}(\phi(q_{1}),\phi(q_{2})).$

where $d_{\text{Euc}}$ is the Euclidean distance in $\mathbb{R}^{n}$ .
•

As mentioned in the preliminaries, for any $p\in\mathcal{M}$ , we use $\left<\cdot,\cdot\right>_{p}$ and $\|\cdot\|_{p}$ to denote the inner produce and norm induced by the Riemannian metric $g$ in the tangent space $T_{p}\mathcal{M}$ . We omit the subscript when it is clear from context.
•

For any $p\in\mathcal{M}$ and differentiable function $f$ over $\mathcal{M}$ , We use $\mathrm{grad}f\big{|}_{p}$ to denote the gradient of $f$ at $p$ .
•

For any $p\in\mathcal{M}$ and $\alpha>0$ , we use $\mathbb{S}_{p}(\alpha)$ to denote the origin-centered sphere in $T_{p}\mathcal{M}$ with radius $\alpha$ . For simplicity, we write $\mathbb{S}_{p}=\mathbb{S}_{p}(1)$ .
•

We will use another set of notations for parallel transport. For two points $p,q\in\mathcal{M}$ and a smooth curve $\gamma$ connecting $p$ and $q$ , we use $\mathcal{P}_{p\rightarrow q}^{\gamma}$ to denote the parallel transport from $T_{p}\mathcal{M}$ to $T_{q}\mathcal{M}$ along the curve $\gamma$ . Let $\Omega_{\min}(p,q)$ denote the set of minimizing geodesics connecting $p$ and $q$ . Define $\mathcal{P}_{p\rightarrow q}:T_{p}\mathcal{M}\rightarrow T_{q}\mathcal{M}$ be defined such that $\mathcal{P}_{p\rightarrow q}(v)=\mathbb{E}_{\gamma\sim\Omega_{\min}(p,q)}\left[\mathcal{P}_{p\rightarrow q}^{\gamma}(v)\right]$ for any $v\in T_{p}\mathcal{M}$ , where the expectation is taken with respect to the uniform probability measure over $\Omega_{\min}(p,q)$ . Since for any point $p$ , its cut locus is of measure zero, the operation $\mathcal{P}_{p\rightarrow q}$ is deterministically defined for almost every $q$ .
•

For any $p\in\mathcal{M}$ , $v\in T_{p}\mathcal{M}$ and $q\in\mathcal{M}$ , define $v_{q}=\mathcal{P}_{p\rightarrow q}(v)$ .
•

As a convention, if $U\subseteq\mathcal{M}$ is diffeomorphic to $V\subseteq\mathbb{R}^{n}$ , we simply say $U$ is diffeomorphic. Similarly, a map is said to be diffeomorphic when the domain of the map is diffeomorphic to an open subset of the Euclidean space. For simplicity, we omit all musical isomorphisms when there is no confusion.

3 The Greene–Wu Convolution

The Greene–Wu convolution can be decomposed into two steps. In the first step, a direction $v$ in the tangent space is picked, a convolution along the direction is performed, and the extension of this convolution to other points is defined. In the second step, the direction $v$ is randomized over a sphere. To properly describe these two steps, we need to formally define kernels.

Definition 1.

A function $\kappa_{\mu}:\mathbb{R}^{n}\rightarrow\mathbb{R}_{\geq 0}$ is called a kernel if

1.

For any $\alpha>0$ and $\mu>0$ , there exists a density $\Theta_{\mu,\alpha}$ on $\mathbb{R}$ such that $\kappa_{\mu}(v)={2}\nu_{\alpha}\left(\alpha\frac{v}{\|v\|}\right)\Theta_{\mu,\alpha}\left(\frac{\|v\|}{\alpha}\right)$ , and $\Theta_{\mu,\alpha}(-x)=\Theta_{\mu,\alpha}(x)$ for all $x\in\mathbb{R}$ , where $\nu_{\alpha}$ is the uniform probability density over the Euclidean sphere of radius $\alpha$ ^¶^¶¶Here $\kappa_{\mu}(0)=2\Theta_{\mu,\alpha}(0)$ as a convention.;
2.

$\lim_{\mu\rightarrow 0^{+}}\kappa_{\mu}=\delta_{0}$ , where $\delta_{0}$ is the Dirac point mass at $0$ .

We call $\Theta_{\mu,\alpha}$ the radial density for $k_{\mu}$ .

By the Radon-Nikodym theorem, item 1 in the definition of a kernel is satisfied when the measure on $T_{p}\mathcal{M}$ induced by $\kappa_{\mu}$ is absolutely continuous with respect to the base measure on $T_{p}\mathcal{M}$ . Examples of kernels include uniform mass over a ball of radius $\mu$ .

Remark 1.

For simplicity, we restrict our attention to symmetric kernels $\kappa_{\mu}$ , where the value of $\kappa_{\mu}(v)$ only depends on $\|v\|$ . Yet many of the results generalize to non-symmetric kernels. To see this, we decompose $\kappa_{\mu}$ by $\kappa_{\mu}(v)=2\nu\left(\alpha\frac{v}{\|v\|}\right)\Theta_{\mu,\alpha;v}\left(\frac{\|v\|}{\alpha}\right)$ , where $\nu_{\alpha}$ is the uniform density over the sphere of radius $\alpha$ , and $\Theta_{\mu,\alpha;v}$ is the density induced by $\kappa_{\mu}$ along the direction of $v$ (or a parallel transport of $v$ ) up to a scaling by $\alpha$ . We can replace item 1 in Definition 1 and many of the follow-up results can be extended to these cases.

With the kernel defined in Definition 1, the GW approximation/convolution can be defined as follows. For $p\in\mathcal{M}$ and $\mu>0$ , we define

\displaystyle\widetilde{f}^{\mu}(p):=\mathbb{E}_{v\sim\mathbb{S}_{p}(\alpha)}\left[\widetilde{f}_{v}^{\mu}(p)\right],

(1)

where

\displaystyle\widetilde{f}_{v}^{\mu}(q)=\int_{\mathbb{R}}2f\left({\mathrm{Exp}}_{q}(tv_{q})\right)\Theta_{\mu,\alpha}(t)dt,\quad\forall q\in\mathcal{M}.

(2)

Next we show that the convolution in (1) is identical to the (GW) convolution everywhere.

Proposition 1.

For any $p\in\mathcal{M}$ and $\kappa_{\mu}$ defined on $T_{p}\mathcal{M}\cong\mathbb{R}^{n}$ , it holds that $\widetilde{f}^{\mu}(p)=\widehat{f}^{\mu}(p)$ .

Proof.

Let $dt$ be the base measure on $\mathbb{R}$ and let $dS_{\alpha}$ be the base measure on $\mathbb{S}_{p}(\alpha)$ . We then have $dtdS_{\alpha}=d\Pi$ . Thus it holds that

	$\displaystyle\widetilde{f}^{\mu}(p)=$	$\displaystyle\int_{v\in\mathbb{S}_{p}(\alpha)}\int_{t\in\mathbb{R}}2f\left({\mathrm{Exp}}_{p}(tv)\right)\nu_{\alpha}\left(\alpha\frac{v}{\\|v\\|}\right)\Theta_{\mu,\alpha}(t)\,dtdS_{\alpha}$
	$\displaystyle=$	$\displaystyle\int_{u\in T_{p}\mathcal{M}}f\left({\mathrm{Exp}}_{p}(u)\right)\kappa_{\mu}(u)\,d\Pi=\widehat{f}^{\mu}(p),$

where a change of variable $u=tv$ is used.

∎

What’s interesting with (1) is that it extends (GW) from a local definition to a global definition, as pointed out in Remark 2.

Remark 2.

With the Levi-Civita connection, spheres are preserved after parallel transport, and so are the uniform base measures over the spheres. Thus the spherical integration of $\widetilde{f}_{v}^{\mu}$ defined in (2) can be carried out at $p$ regardless whether $v\in T_{p}\mathcal{M}$ or not. Hence, the convolution defined by (1) and (2) is indeed the GW convolution in the sense that $\widetilde{f}_{v}^{\mu}$ agrees with (GW) almost everywhere^∥^∥∥The cut locus is of measure zero and thus $v_{q}$ is deterministically defined almost everywhere. if the convolution at every point is restricted to the direction specified by $v$ (or parallel transport of $v$ ). Now the advantage of our decomposition is clear – $\widetilde{f}_{v}^{\mu}$ is globally defined and agrees with (GW) along the the direction of $v$ . This advantage leads to easy computations and meaningful applications, as we will see throughout the rest of the paper.

3.1 Properties of the GW Convolution

In their seminal works, Greene and Wu showed that the approximation is convexity-preserving when the width of the kernel approaches 0 (Theorem 2 by Greene and Wu, (1973)). We show that the GW convolution is convexity-preserving even if the width of the kernel is strictly positive. Also, second order properties of the GW approximation is proved.

3.1.1 Convexity-Preserving Property of the GW Convolution

Definition 2 (Convexity).

Let $\mathcal{M}$ be a complete Riemannian manifold. A smooth function $f$ is convex near $p$ with radius exceeding $\epsilon$ if there exists $\epsilon>0$ , such that for any $v\in\mathbb{S}_{p}$ , the function $f\left({\mathrm{Exp}}_{p}(tv)\right)$ is convex in $t$ over $[-\epsilon,\epsilon]$ .

Using our decomposition trick of the GW approximation (defined in Eq. 1), one can show that the GW convolution is convexity preserving even if the width of the kernel is strictly positive.

Theorem 1.

Let $\mathcal{M}$ be a complete Riemannian manifold, and fix an arbitrary $p\in\mathcal{M}$ . If there exists $\epsilon>0$ such that 1) $f$ is convex near $p$ with radius exceeding $\epsilon$ , and 2) the density $\Theta_{\mu}$ (a shorthand for $\Theta_{\mu,1}$ ) induced by kernel $\kappa_{\mu}$ satisfies that $\int_{|t|\geq\frac{\epsilon}{2}}\Theta_{\mu}(t)=0$ , then the Greene-Wu approximation $\widehat{f}^{\mu}$ is also convex near $p$ with radius exceeding $\frac{\epsilon}{2}$ .

Proof of Theorem 1.

By Proposition 1 and Remark 2, it is sufficient to consider $\widetilde{f}^{\mu}$ . Since $f$ is convex near $p$ with radius exceeding $\epsilon$ , we have, for any $v\in T_{p}\mathcal{M}$ and any real numbers $\lambda,t\in\left(0,\frac{\epsilon}{2}\right)$ ,

\displaystyle f({\mathrm{Exp}}_{p}((\lambda+t)v))+f({\mathrm{Exp}}_{p}((-\lambda+t)v))\geq 2f({\mathrm{Exp}}_{p}(tv)).

(3)

Let $\gamma(s)={\mathrm{Exp}}_{p}(sv)$ ( $s\in[-\lambda-t,\lambda+t]$ ) for simplicity. We can then rewrite (3) as

\displaystyle f({\mathrm{Exp}}_{\gamma(\lambda)}(t\mathcal{P}_{0,\lambda}^{\gamma}(v)))+f({\mathrm{Exp}}_{\gamma(-\lambda)}(t\mathcal{P}_{0,-\lambda}^{\gamma}(v)))\geq

\displaystyle 2f({\mathrm{Exp}}_{p}(tv))

(4)

We can integrate the first term in (4) with respect to $v$ and $t$ , and get

		$\displaystyle\int_{v\in\mathbb{S}_{p}}\int_{t\in\left[0,\frac{\epsilon}{2}\right]}2f({\mathrm{Exp}}_{\gamma(\lambda)}(t\mathcal{P}_{0,\lambda}^{\gamma}(v)))\nu_{1}(v)\Theta_{\mu}(t)dtdS_{1}$
	$\displaystyle=$	$\displaystyle\int_{v\in\mathbb{S}_{\gamma(\lambda)}}\int_{t\in\left[0,\frac{\epsilon}{2}\right]}2f({\mathrm{Exp}}_{\gamma(\lambda)}(tv))\nu_{1}(v)\Theta_{\mu}(t)dt\,dS_{1},$		(5)

where (5) uses that parallel transport along geodesics preserves length and angle, for the change of tangent space for the outer integral.

Since the density $\Theta_{\mu}$ satisfies that $\int_{|t|\geq\frac{\epsilon}{2}}\Theta_{\mu}(t)=0$ , the term on the right-hand-side of (5) is the exact definition of $\widetilde{f}^{\mu}(\gamma(\lambda))$ . We repeat this computation for the second term in (4), and get

\displaystyle\widetilde{f}^{\mu}(\gamma(\lambda))+\widetilde{f}^{\mu}(\gamma(-\lambda))

\displaystyle\geq 2\widetilde{f}^{\mu}(\gamma(0)),\quad\forall\lambda\in\left[0,\frac{\epsilon}{2}\right],

(6)

which is the midpoint convexity (around $\gamma(0)$ , along $\gamma$ ) for $\widetilde{f}^{\mu}$ . Since midpoint convexity implies convexity for smooth functions, we know $\widetilde{f}^{\mu}$ is convex near $p$ along $\gamma$ . Since the above argument is true for any $\gamma$ , we know that $\widetilde{f}^{\mu}$ is convex near $p$ with radius exceeding $\frac{\epsilon}{2}$ .

∎

Previous Convexity-Preserving Result

Previously, Greene and Wu, (1973, 1976, 1979) studied the GW convolution from an analytical and topological perspective, and showed that the GW convolution is convexity-preserving (Greene and Wu,, 1973) when the width of the kernel approaches zero. For completeness, this previous result is summarized in Theorem 2. In Theorem 1 above, we show that the output of GW convolution is convex with a strictly positive radius, as long as the input is convex with a strictly positive radius.

Theorem 2 (Greene and Wu, (1973)).

Consider a smooth function defined over a complete Riemannian manifold $\mathcal{M}$ . Let $\widehat{f}^{\mu}$ be the GW approximation of $f$ with kernel $\kappa_{\mu}$ . Then it holds that

\displaystyle\underset{\mu\rightarrow 0^{+}}{\lim\inf}\left(\inf_{C\in\mathcal{C}_{p}}\frac{d^{2}\widehat{f}^{\mu}(C(t))}{dt^{2}}\Big{|}_{t=0}\right)\geq 0,

(7)

where $\mathcal{C}_{p}$ is the set of all geodesics $C$ parametrized by arc-length such that $C(0)=p$ .

3.2 Second Order Properties of the GW Convolution

Our decomposition of the Greene-Wu convolution also allows other properties of the GW convolution easily derived. Perhaps one of the most important property of a function is its Hessian (matrix), which we discuss now. The definition of Hessian is in Definition 3 (e.g., Schoen and Yau, (1994); Petersen, (2006)).

Definition 3 (Hessian).

Consider an $n$ -dimensional Riemannian manifold $(\mathcal{M},g)$ . For $f\in C^{\infty}(\mathcal{M})$ , define the Hessian of $f$ as $H(f):\mathfrak{X}(\mathcal{M})\times\mathfrak{X}(\mathcal{M})\rightarrow C^{\infty}(\mathcal{M})$ such that

\displaystyle H(f)\left(X,Y\right):=\nabla_{X}df(Y),\quad\forall X,Y\in\mathfrak{X}(\mathcal{M}),

where $\nabla$ is the Levi-Civita connection. Using Christoffel symbols, it holds that

H(f)(\partial_{i},\partial_{j})=\partial_{i}\partial_{j}f-\Gamma_{ij}^{k}\partial_{k}f.

The Hessian matrix of $f$ at $p$ is a map $H_{p,f}:T_{p}\mathcal{M}\rightarrow T_{p}\mathcal{M}$ , such that $H_{p,f}(v)=\left<\mathrm{grad}f\big{|}_{p},v\right>$ for any $v\in T_{p}\mathcal{M}$ .

With the Levi-Civita connection, which is torsion-free, one can verify that

H(f)(X,Y)=H(f)(Y,X).

For any $p\in\mathcal{M}$ , with the normal coordinate specified by the exponential map ${\mathrm{Exp}}_{p}$ , the Christoffel symbol vanishes at $p$ . This means $H(f)(\partial_{i},\partial_{j})\Big{|}_{p}=\partial_{i}\partial_{j}f\big{|}_{p}$ in local normal coordinate. Similar to the Euclidean case, the eigenvalues of the Hessian matrix describes the curvature of the function. Next, we will show how the curvature of the function is affected after applying the GW convolution.

In the study of the second order properties of the GW approximation, we will restrict our attention to kernels whose radial density function is uniform. More specifically, we will use $\alpha=1$ and $\Theta_{\mu,\alpha}(t)=\frac{1}{2\mu}\mathbb{I}_{\left[t\in\left[-\mu,\mu\right]\right]}$ . Since one can construct other kernels by combining kernels with uniform radial density functions, investigating kernels with uniform radial density is sufficient for our purpose. With this in mind, we look at

\displaystyle\widetilde{f}_{v}^{\mu}(q)=\frac{1}{2\mu}\int_{-\mu}^{\mu}f\left({\mathrm{Exp}}_{q}(tv_{q})\right)dt.

(8)

The following theorem tells us how the geometry of the function and the geometry of the space are related after the GW convolution.

Theorem 3.

Consider a complete Riemannian manifold $\mathcal{M}$ and a kernel with radial density $\Theta_{\mu,\alpha}(t)=\frac{1}{2\mu}\mathbb{I}_{\left[t\in\left[-\mu,\mu\right]\right]}$ . For any $p\in\mathcal{M}$ , there exists $\mu_{0}>0$ such that for all $\mu\in(0,\mu_{0})$ , it holds that

	$\displaystyle D_{uu}\widetilde{f}_{v}^{\mu}(p)=$	$\displaystyle\frac{1}{2\mu}\int_{t\in[-\mu,\mu]}D_{u_{{\mathrm{Exp}}_{p}(tv)}u_{{\mathrm{Exp}}_{p}(tv)}}f({\mathrm{Exp}}_{p}(tv))$
		$\displaystyle+\frac{1}{2\mu}\int_{-\mu}^{\mu}\sum_{j=1}^{\infty}\frac{t^{2j}}{(2j)!}\nabla_{u}^{2}\nabla_{v}^{2j}f(p)\,dt-\frac{1}{2\mu}\int_{-\mu}^{\mu}\sum_{j=1}^{\infty}\frac{t^{2j}}{(2j)!}\nabla_{v}^{2j}\nabla_{u}^{2}f(p)\,dt,$

for all $v,u\in\mathbb{S}_{p}$ .

We will use the double exponential map (Gavrilov,, 2007) notation in the proof. The double exponential map (also written ${\mathrm{Exp}}_{p}$ ) is defined as ${\mathrm{Exp}}_{p}:T_{p}\mathcal{M}\times T_{p}\mathcal{M}\rightarrow\mathcal{M}$ such that, when $v\in T_{p}\mathcal{M}$ is small enough ( ${\mathrm{Exp}}_{p}(v)\in U_{p}$ ),

\displaystyle{\mathrm{Exp}}_{p}(v,u)={\mathrm{Exp}}_{{\mathrm{Exp}}_{p}(v)}(u_{{\mathrm{Exp}}_{p}(v)}).

(9)

With this notation, we are ready to prove Theorem 3.

Proof of Theorem 3.

With this notion of double exponential map, we have

	$\displaystyle D_{uu}\widetilde{f}_{v}^{\mu}(p):=$	$\displaystyle\;\lim_{\tau\rightarrow 0}\frac{\widetilde{f}_{v}^{\mu}({\mathrm{Exp}}_{p}(\tau u))-2\widetilde{f}_{v}^{\mu}(p)+\widetilde{f}_{v}^{\mu}({\mathrm{Exp}}_{p}(-\tau u))}{\tau^{2}}$
	$\displaystyle=$	$\displaystyle\;\frac{1}{2\mu}\int_{-\mu}^{\mu}\lim_{\tau\rightarrow 0}\frac{{f}({\mathrm{Exp}}_{p}(\tau u,tv))-2{f}({\mathrm{Exp}}_{p}(tv))+{f}({\mathrm{Exp}}_{p}(-\tau u,tv))}{\tau^{2}}\,dt.$		(10)

Let $\varphi_{u}(t,\tau)=f({\mathrm{Exp}}_{p}(tv,\tau u))$ , and let

\psi_{u,v}(\tau,t)=f({\mathrm{Exp}}_{p}(\tau u,tv))-f({\mathrm{Exp}}_{p}(tv,\tau u))+f({\mathrm{Exp}}_{p}(-\tau u,tv))-f({\mathrm{Exp}}_{p}(tv,-\tau u)).

We can then write (10) as

\displaystyle D_{uu}\widetilde{f}_{v}^{\mu}(p)=

\displaystyle\frac{1}{2\mu}\int_{t\in[-\mu,\mu]}\lim_{\tau\rightarrow 0}\frac{\varphi_{u}(t,\tau)-2\varphi_{u}(t,0)+\varphi_{u}(t,-\tau)+\psi_{u,v}(\tau,t)}{\tau^{2}}.

(11)

For the terms involving $\varphi_{u}$ , one has

\displaystyle\lim_{\tau\rightarrow 0}\frac{\varphi_{u}(t,\tau)-2\varphi_{u}(t,0)+\varphi_{u}(t,-\tau)}{\tau^{2}}=\frac{d^{2}\varphi_{u}(t,\tau)}{d\tau^{2}}\Big{|}_{\tau=0}=D_{u_{q}u_{q}}f(q),

(12)

where $q={\mathrm{Exp}}_{p}(tv)$ and $u_{q}=\mathcal{P}_{p\rightarrow q}(u)$ .

For $\psi_{u,v}(\tau,t)$ , we consider the term $f({\mathrm{Exp}}_{p}(u,v))$ for a smooth function $f$ and small vectors $u,v\in T_{p}\mathcal{M}$ . For any $p$ , $v\in T_{p}\mathcal{M}$ and $q\in U_{p}$ , define $h_{v}^{(j)}(q)=\nabla_{v_{q}}^{j}f(q)$ . We can Taylor expand $h_{v}({\mathrm{Exp}}_{p}(u))$ by

	$\displaystyle h_{v}^{(j)}({\mathrm{Exp}}_{p}(u))=$	$\displaystyle\;h_{v}^{(j)}({\mathrm{Exp}}_{p}(tu))\big{\|}_{t=1}$
	$\displaystyle=$	$\displaystyle\;\sum_{i=0}^{\infty}\frac{1}{i!}\frac{d^{i}}{dt^{i}}h_{v}^{(j)}({\mathrm{Exp}}_{p}(tu))$
	$\displaystyle=$	$\displaystyle\;\sum_{i=0}^{\infty}\frac{1}{i!}\nabla_{u}^{i}h_{v}^{(j)}(p)$
	$\displaystyle\overset{(a)}{=}$	$\displaystyle\;\sum_{i=0}^{\infty}\frac{1}{i!}\nabla_{u}^{i}\nabla_{v}^{j}f(p).$

Thus we have, for any $p$ , $u,v\in T_{p}\mathcal{M}$ of small norm,

	$\displaystyle f\left({\mathrm{Exp}}_{p}\left(u,v\right)\right)=$	$\displaystyle\;f\left({\mathrm{Exp}}_{{\mathrm{Exp}}_{p}(u)}\left(v_{{\mathrm{Exp}}_{p}(u)}\right)\right)$
	$\displaystyle=$	$\displaystyle\;\sum_{j=0}^{\infty}\frac{1}{j!}\nabla_{v_{{\mathrm{Exp}}_{p}(u)}}^{j}f({\mathrm{Exp}}_{p}(u))$
	$\displaystyle=$	$\displaystyle\;\sum_{j=0}^{\infty}\frac{1}{j!}h_{v}^{(j)}({\mathrm{Exp}}_{p}(u))$
	$\displaystyle=$	$\displaystyle\;\sum_{j=0}^{\infty}\frac{1}{j!}\sum_{i=0}^{\infty}\frac{1}{i!}\nabla_{u}^{i}\nabla_{v}^{j}f(p),$

where the last equality uses $(a)$ .

Thus we have

		$\displaystyle\;f({\mathrm{Exp}}_{p}(\tau u,tv))+f({\mathrm{Exp}}_{p}(-\tau u,tv))-f({\mathrm{Exp}}_{p}(tv,\tau u))-f({\mathrm{Exp}}_{p}(tv,-\tau u))$
	$\displaystyle=$	$\displaystyle\;\sum_{i=0}^{\infty}\sum_{j=0}^{\infty}\frac{1}{i!j!}\nabla_{\tau u}^{i}\nabla_{tv}^{j}f(p)+\sum_{i=0}^{\infty}\sum_{j=0}^{\infty}\frac{1}{i!j!}\nabla_{-\tau u}^{i}\nabla_{tv}^{j}f(p)$
		$\displaystyle-\sum_{i=0}^{\infty}\sum_{j=0}^{\infty}\frac{1}{i!j!}\nabla_{tv}^{j}\nabla_{\tau u}^{i}f(p)-\sum_{i=0}^{\infty}\sum_{j=0}^{\infty}\frac{1}{i!j!}\nabla_{tv}^{j}\nabla_{-\tau u}^{i}f(p)$
	$\displaystyle=$	$\displaystyle\;\sum_{j=0}^{\infty}\frac{\tau^{2}}{2j!}\nabla_{u}^{2}\nabla_{tv}^{j}f(p)+\sum_{j=0}^{\infty}\frac{\tau^{2}}{2j!}\nabla_{u}^{2}\nabla_{tv}^{j}f(p)$
		$\displaystyle-\sum_{j=0}^{\infty}\frac{\tau^{2}}{2j!}\nabla_{tv}^{j}\nabla_{u}^{2}f(p)-\sum_{j=0}^{\infty}\frac{\tau^{2}}{2j!}\nabla_{tv}^{j}\nabla_{u}^{2}f(p)+\mathcal{O}(\tau^{3}).$

Thus we have

\displaystyle\lim_{\tau\rightarrow 0}\frac{\psi_{u,v}(\tau,t)}{\tau^{2}}=\sum_{j=1}^{\infty}\frac{t^{2j}}{(2j)!}\nabla_{u}^{2}\nabla_{v}^{2j}f(p)-\sum_{j=1}^{\infty}\frac{t^{2j}}{(2j)!}\nabla_{v}^{2j}\nabla_{u}^{2}f(p)+Odd(t),

(13)

where $Odd(t)$ denotes terms that are odd in $t$ .

Combining equations (11), (12), (13) and that

\displaystyle\int_{-\mu}^{\mu}Odd(t)\,dt=0

gives

	$\displaystyle D_{uu}\widetilde{f}_{v}^{\mu}(p)=$	$\displaystyle\frac{1}{2\mu}\int_{t\in[-\mu,\mu]}D_{u_{q}u_{q}}f(q)$
		$\displaystyle+\frac{1}{2\mu}\int_{-\mu}^{\mu}\sum_{j=1}^{\infty}\frac{t^{2j}}{(2j)!}\nabla_{u}^{2}\nabla_{v}^{2j}f(p)\,dt-\frac{1}{2\mu}\int_{-\mu}^{\mu}\sum_{j=1}^{\infty}\frac{t^{2j}}{(2j)!}\nabla_{v}^{2j}\nabla_{u}^{2}f(p)\,dt,$

where $q={\mathrm{Exp}}_{p}(tv)$ .

∎

Theorem 3 says that when the space is not flat, the GW convolution changes the curvature of the function (Hessian matrix) in a non-trivial way. A more concrete example of this fact is in Corollary 1.

Corollary 1.

Consider a kernel with radial density function $\Theta_{\mu,\alpha}(t)=\frac{1}{2\mu}\mathbb{I}_{\left[t\in\left[-\mu,\mu\right]\right]}$ . For a smooth function $f$ defined over a complete Riemannian manifold $\mathcal{M}$ and any $p\in\mathcal{M}$ , there exists $\mu_{0}>0$ , such that for all $\mu\in(0,\mu_{0})$ , it holds that

	$\displaystyle\lambda_{\min}\left(H_{p,\widetilde{f}^{\mu}}\right)\geq$	$\displaystyle\;\frac{1}{2\mu}\mathbb{E}_{v\sim\mathbb{S}_{p}}\left[\int_{-\mu}^{\mu}\lambda_{\min}\left(H_{{\mathrm{Exp}}_{p}(tv),f}\right)dt\right]$
		$\displaystyle+\min_{u\in\mathbb{S}_{p}}\frac{1}{2\mu}\mathbb{E}_{v\sim\mathbb{S}_{p}}\left[\int_{-\mu}^{\mu}\sum_{j=1}^{\infty}\frac{t^{2j}}{(2j)!}\nabla_{u}^{2}\nabla_{v}^{2j}f(p)\,dt-\int_{-\mu}^{\mu}\sum_{j=1}^{\infty}\frac{t^{2j}}{(2j)!}\nabla_{v}^{2j}\nabla_{u}^{2}f(p)\,dt\right].$

Since the minimum eigenvalue of the Hessian matrix measures the degree of convexity of a function, this corollary establish a quantitative description of how the GW convolution would affect the degree of convexity.

Proof of Corollary 1.

For any smooth function $f$ and point $p$ , one has (e.g., Absil et al.,, 2009)

\displaystyle u^{\top}H_{p,f}u:=\left<H_{p,f}(u),u\right>=D_{uu}f(p).

and thus

\displaystyle\lambda_{\min}\left(H_{p,\widetilde{f}^{\mu}}\right)=\min_{u\in\mathbb{S}_{p}}u^{\top}H_{p,\widetilde{f}^{\mu}}u=\min_{u\in\mathbb{S}_{p}}D_{uu}\widetilde{f}^{\mu}(p).

(14)

Thus we have

		$\displaystyle\;\min_{u\in\mathbb{S}_{p}}D_{uu}\widetilde{f}^{\mu}(p)$
	$\displaystyle\geq$	$\displaystyle\;\frac{1}{2\mu}\mathbb{E}_{v\sim\mathbb{S}_{p}}\left[\int_{\mu}^{\mu}\min_{w\in\mathbb{S}_{{\mathrm{Exp}}_{p}(tv)}}D_{ww}f\left({\mathrm{Exp}}_{p}(tv)\right)dt\right]$
		$\displaystyle+\min_{u\in\mathbb{S}_{p}}\frac{1}{2\mu}\mathbb{E}_{v\sim\mathbb{S}_{p}}\left[\int_{-\mu}^{\mu}\sum_{j=1}^{\infty}\frac{t^{2j}}{(2j)!}\nabla_{u}^{2}\nabla_{v}^{2j}f(p)\,dt-\int_{-\mu}^{\mu}\sum_{j=1}^{\infty}\frac{t^{2j}}{(2j)!}\nabla_{v}^{2j}\nabla_{u}^{2}f(p)\,dt\right].$

Since $\lambda_{\min}\left(H_{p,\widetilde{f}^{\mu}}\right)=\min_{u\in\mathbb{S}_{p}}u^{\top}H_{p,\widetilde{f}^{\mu}}u=\min_{u\in\mathbb{S}_{p}}D_{uu}\widetilde{f}^{\mu}(p)$ for any $p$ and $f$ , it holds that

	$\displaystyle\lambda_{\min}\left(H_{p,\widetilde{f}^{\mu}}\right)\geq$	$\displaystyle\;\frac{1}{2\mu}\mathbb{E}_{v\sim\mathbb{S}_{p}}\left[\int_{-\mu}^{\mu}\lambda_{\min}\left(H_{{\mathrm{Exp}}_{p}(tv),f}\right)dt\right]$
		$\displaystyle\;+\min_{u\in\mathbb{S}_{p}}\frac{1}{2\mu}\mathbb{E}_{v\sim\mathbb{S}_{p}}\left[\int_{-\mu}^{\mu}\sum_{j=1}^{\infty}\frac{t^{2j}}{(2j)!}\nabla_{u}^{2}\nabla_{v}^{2j}f(p)\,dt-\int_{-\mu}^{\mu}\sum_{j=1}^{\infty}\frac{t^{2j}}{(2j)!}\nabla_{v}^{2j}\nabla_{u}^{2}f(p)\,dt\right].$

∎

4 Estimating Gradient over Riemannian Manifolds

From our formulation of the GW approximation, one can obtain tighter bounds for gradient estimation over Riemannian manifolds, for geodesically $L_{1}$ -smooth functions defined as follows.

Definition 4.

Let $p\in\mathcal{M}$ . A function $f:\mathcal{M}\rightarrow\mathbb{R}$ is called geodesically $L_{1}$ -smooth near $p$ if

\displaystyle\left\|\mathcal{P}_{p\rightarrow q}\left(\mathrm{grad}f\big{|}_{p}\right)-\mathrm{grad}f\big{|}_{q}\right\|_{q}\leq L_{1}d_{p}(p,q),\quad\forall q\in U_{p},

(15)

where $\mathrm{grad}f\big{|}_{p}$ is the gradient of $f$ at $p$ . If $f$ is $L_{1}$ -smooth near $p$ for all $p\in\mathcal{M}$ , we say $f$ is $L_{1}$ -smooth over $\mathcal{M}$ .

For a geodesically $L_{1}$ -smooth function, one has the following theorem that bridges the gradient at $p$ and the zeroth-order information near $p$ .

Theorem 4.

Fix $\alpha>0$ . Let $f$ be geodesically $L_{1}$ -smooth near $p$ . It holds that

\displaystyle\left\|\mathrm{grad}f\big{|}_{p}-\frac{n}{\mu\alpha^{2}}\mathbb{E}_{v\sim\mathbb{S}_{p}(\alpha)}\left[f\left({\mathrm{Exp}}_{p}(\mu v)\right)v\right]\right\|_{p}\leq\frac{L_{1}n\alpha\mu}{2},

(16)

where $\mathbb{S}_{p}(\alpha)$ is the origin-centered sphere in $T_{p}\mathcal{M}$ of radius $\alpha$ .

This theorem provides a very simple and practical gradient estimation method: At point $p\in\mathcal{M}$ , we can select small numbers $\mu$ and $\alpha$ , and uniformly sample a vector $v\sim\mathbb{S}_{p}(\alpha)$ . With these $\mu,\alpha,v$ , the random vector

\displaystyle\widehat{\mathrm{grad}}f\big{|}_{p}(v)=\frac{n}{2\alpha^{2}\mu}\left(f\left({\mathrm{Exp}}_{p}(\mu v)\right)-f\left({\mathrm{Exp}}_{p}(-\mu v)\right)\right)

(17)

gives an estimator of $\mathrm{grad}f\big{|}_{p}$ .

One can also independently sample $v_{1},v_{2},v_{3},\cdots,v_{m}$ uniformly from $\mathbb{S}_{p}$ . Using these samples, an estimator for $\mathrm{grad}f\big{|}_{p}$ is

\displaystyle\widehat{\mathrm{grad}}f\big{|}_{p}(v_{1},v_{2},\cdots,v_{m})=\frac{n}{2m\alpha^{2}\mu}\sum_{i=1}^{m}\left(f\left({\mathrm{Exp}}_{p}(\mu v_{i})\right)-f\left({\mathrm{Exp}}_{p}(-\mu v_{i})\right)\right)v_{i}.

(18)

Note that Theorem 4 also provides an in expectation bound for the estimator in (18). Before proving Theorem 4, we first present Propositions 2 and 3, and Lemma 1.

Proposition 2.

Consider a continuously differentiable function $f$ defined over a complete Riemannian manifold $\mathcal{M}$ . If function $f$ is geodesically $L_{1}$ -smooth over $\mathcal{M}$ , then it holds that

1.

$f\left({\mathrm{Exp}}_{p}(u)\right)\leq f(p)+\left<\mathrm{grad}f\big{|}_{p},u\right>+\frac{L_{1}}{2}\|u\|^{2}$ for all $p\in\mathcal{M}$ and $u\in T_{p}\mathcal{M}$ ;
2.

for any $v\in T_{p}\mathcal{M}$ , $\left\|\mathcal{P}_{0,1}^{\gamma}\left(\mathrm{grad}f\big{|}_{p}\right)-\mathrm{grad}f\big{|}_{{\mathrm{Exp}}_{p}(v)}\right\|\leq L_{1}\|v\|$ , where $\gamma(t):={\mathrm{Exp}}_{p}(tv)$ .

Proof.

For any $p\in\mathcal{M}$ and $u\in T_{p}\mathcal{M}$ , let $q_{0}=p,q_{1},q_{2},\cdots,q_{N}={\mathrm{Exp}}_{p}(u)$ be a sequence of points such that $q_{i+1}\in U_{q_{i}}$ and $q_{i+1}={\mathrm{Exp}}_{q_{i}}(\tau u_{q_{i}})$ for all $\tau>0$ that is sufficiently small. Since the Levi-Civita connection preserves the Riemannian metric, we know $N\tau=1$ . At any $q_{i}$ , we have

	$\displaystyle\left<\mathrm{grad}f\big{\|}_{q_{i}},u_{q_{i}}\right>$
$\displaystyle=$	$\displaystyle\left<\mathrm{grad}f\big{\|}_{q_{i}}-\mathcal{P}_{q_{i-1}\rightarrow q_{i}}\left(\mathrm{grad}f\big{\|}_{q_{i-1}}\right),u_{q_{i}}\right>+\left<\mathrm{grad}f\big{\|}_{q_{i-1}},u_{q_{i-1}}\right>$	(19)
$\displaystyle\leq$	$\displaystyle\left<\mathrm{grad}f\big{\|}_{q_{i-1}},u_{q_{i-1}}\right>+L_{1}\tau\\|u\\|^{2}$
$\displaystyle\leq$	$\displaystyle\cdots$
$\displaystyle\leq$	$\displaystyle\left<\mathrm{grad}f\big{\|}_{p},u\right>+iL_{1}\tau\\|u\\|^{2},$

where (19) uses the definition of geodesic $L_{1}$ -smoothness.

Also, it holds that

		$\displaystyle f\left(q_{i}\right)-f\left(q_{i-1}\right)$
	$\displaystyle=$	$\displaystyle\int_{0}^{\tau}\left<\mathrm{grad}f\big{\|}_{{\mathrm{Exp}}_{q_{i-1}}(tu_{q_{i-1}})},u_{{\mathrm{Exp}}_{q_{i-1}}(tu_{q_{i-1}})}\right>dt.\qquad(\text{by Lemma \ref{lem:fundamental-calculus}})$		(20)

By the definition of geodesic $L_{1}$ -smoothness and the Cauchy-Schwarz inequality, it holds that

	$\displaystyle\int_{0}^{\tau}\left<\mathrm{grad}f\big{\|}_{{\mathrm{Exp}}_{q_{i-1}}(tu_{q_{i-1}})},u_{{\mathrm{Exp}}_{q_{i-1}}(tu_{q_{i-1}})}\right>dt-\tau\left<\mathrm{grad}f\big{\|}_{q_{i-1}},u_{q_{i-1}}\right>$
$\displaystyle\leq$	$\displaystyle\int_{0}^{\tau}\left\\|\mathrm{grad}f\big{\|}_{{\mathrm{Exp}}_{q_{i-1}}(tu_{q_{i-1}})}-\mathcal{P}_{q_{i-1}\rightarrow{\mathrm{Exp}}_{q_{i-1}}(tu_{q_{i-1}})}\left(\mathrm{grad}f\big{\|}_{q_{i-1}}\right)\right\\|\\|u\\|\,dt$
$\displaystyle\leq$	$\displaystyle\frac{L_{1}\tau^{2}\\|u\\|^{2}}{2}.$	(21)

Adding and subtracting $\tau\left<\mathrm{grad}f\big{|}_{q_{i-1}},u_{q_{i-1}}\right>$ to (20) and using (21) gives

\displaystyle f\left(q_{i}\right)-f\left(q_{i-1}\right)\leq\frac{L_{1}\tau^{2}\|u\|^{2}}{2}+\tau\left<\mathrm{grad}f\big{|}_{q_{i-1}},u_{q_{i-1}}\right>.

Thus we have

$\displaystyle f({\mathrm{Exp}}_{p}(u))-f(p)\leq$	$\displaystyle\sum_{i=1}^{N}f(q_{i})-f(q_{i-1})$
$\displaystyle\leq$	$\displaystyle\sum_{i=1}^{N}\left(\tau\left<\mathrm{grad}f\big{\|}_{q_{i-1}},u_{q_{i-1}}\right>+\frac{L_{1}\tau^{2}\\|u\\|^{2}}{2}\right)$
$\displaystyle\leq$	$\displaystyle N\tau\left<\mathrm{grad}f\big{\|}_{p},u\right>+\sum_{i=1}^{N}L_{1}(i-1)\tau^{2}\\|u\\|^{2}+\frac{L_{1}N\tau^{2}\\|u\\|^{2}}{2}$
$\displaystyle=$	$\displaystyle N\tau\left<\mathrm{grad}f\big{\|}_{p},u\right>+\frac{L_{1}N^{2}\tau^{2}\\|u\\|^{2}}{2}+O(N\tau^{2}),$	(22)

where the $O\left(\cdot\right)$ notation omits constants that do not depend on $N$ or $\tau$ . Since $N\tau=1$ and (22) is true for arbitrarily small $\tau$ , letting $\tau\rightarrow 0$ in (22) finishes the proof.

The second item can be proved in a similar way. Specifically, we can also find a sequence of points $q_{0},q_{1},\cdots,q_{N}$ on the curve of $\gamma(t)={\mathrm{Exp}}_{p}(tv)$ with $q_{0}=p$ and $q_{N}={\mathrm{Exp}}_{p}(v)$ , and repeatedly use the definition of geodesic $L_{1}$ -smoothness.

∎

Proposition 3.

For any vector $u\in T_{p}\mathcal{M}$ , we have

\displaystyle\mathbb{E}_{v\sim\mathbb{S}_{p}(\alpha)}\left[\left<u,v\right>v\right]=\frac{\alpha^{2}}{n}u,\quad\text{ and }\quad\mathbb{E}_{v\sim\mathbb{S}_{p}(\alpha)}\left[\left<u,v\right>^{2}\right]=\frac{\alpha^{2}}{n}\|u\|^{2}.

Proof.

It is sufficient to consider $u=e_{i}$ , where $\{e_{i}\}_{i}$ is the local canonical basis for $T_{p}\mathcal{M}$ with respect to $g_{p}$ . For any $i,j\in\{1,2,\cdots,n\}$ ,

\displaystyle\left(\mathbb{E}_{v\sim\mathbb{S}_{p}(\alpha)}\left[\left<e_{i},v\right>v\right]\right)_{j}=\mathbb{E}_{v\sim\mathbb{S}_{p}(\alpha)}\left[v_{i}v_{j}\right].

Since $\mathbb{E}_{v\sim\mathbb{S}_{p}(\alpha)}\left[v_{j}|v_{i}=x\right]=0$ for any $x$ when $i\neq j$ , we have $\mathbb{E}_{v\sim\mathbb{S}_{p}(\alpha)}\left[v_{i}v_{j}\right]=0$ . When $i=j$ , by symmetry, we have $\mathbb{E}_{v\sim\mathbb{S}_{p}(\alpha)}\left[v_{j}^{2}\right]=\frac{\alpha^{2}}{n}$ for all $j=1,2,\cdots,n$ .

We have shown $\mathbb{E}_{v\sim\mathbb{S}_{p}(\alpha)}\left[\left<u,v\right>v\right]=\frac{\alpha^{2}}{n}u$ , on which taking inner product with $u$ shows that $\mathbb{E}_{v\sim\mathbb{S}_{p}(\alpha)}\left[\left<u,v\right>^{2}\right]=\frac{\alpha^{2}}{n}\|u\|^{2}$ . ∎

Lemma 1.

Pick $p\in\mathcal{M}$ and $v\in T_{p}\mathcal{M}$ . Let $\gamma(t)={\mathrm{Exp}}_{p}(tv)$ $(t\in[a,b])$ be a geodesic in $\mathcal{M}$ . We then have

\displaystyle\int_{a}^{b}\left<\mathrm{grad}f\big{|}_{{\mathrm{Exp}}_{p}(sv)},\dot{\gamma}\big{|}_{s}\right>ds=f(\gamma(b))-f(\gamma(a)).

(23)

Proof.

Consider the function $f\circ\gamma:[a,b]\rightarrow\mathbb{R}$ . By fundamental theorem of calculus, we have

\displaystyle f(\gamma(b))-f(\gamma(a))

\displaystyle=\int_{a}^{b}\frac{df\circ\gamma}{dt}{d}t.

Also, the directional derivative over Riemannian manifolds can be computed by

\displaystyle\frac{df\circ\gamma}{dt}\Bigg{|}_{s}=\lim_{\tau\rightarrow 0}\frac{f\circ\gamma(s+\tau)}{\tau}=\left<\mathrm{grad}f\big{|}_{{\mathrm{Exp}}_{p}(sv)},\dot{\gamma}\big{|}_{s}\right>.

Combining the above results finishes the proof. ∎

Lemma 1 can be viewed as a corollary of the fundamental theorem of calculus, or a consequence of Stokes’ theorem. If one views the curve $\gamma$ as a manifold with boundary ( $\gamma(a)$ and $\gamma(b)$ ), and $\dot{\gamma}$ as a vector field parallel to $\gamma$ , then applying Stokes’ theorem proves Lemma 1.

We are now ready to prove Theorem 4, using our decomposition trick with the radial density function being $\Theta_{\mu,\alpha}(x)=\frac{1}{2\mu}\mathbb{I}_{\left[x\in\left[-\mu,\mu\right]\right]}$ .

Proof of Theorem 4.

From the definition of directional derivative, we have

$\displaystyle\left<\mathrm{grad}\widetilde{f}_{v}^{\mu}\big{\|}_{p},v\right>_{p}$	$\displaystyle=\frac{1}{2\mu}\lim_{\tau\rightarrow 0}\frac{\int_{-\mu}^{\mu}\left(f\left({\mathrm{Exp}}_{{\mathrm{Exp}}_{p}(\tau v)}\left(tv_{{\mathrm{Exp}}_{p}(\tau v)}\right)\right)-f\left({\mathrm{Exp}}_{p}\left(tv\right)\right)\right)dt}{\tau}$
	$\displaystyle=\frac{1}{2\mu}\int_{-\mu}^{\mu}\lim_{\tau\rightarrow 0}\frac{f\left({\mathrm{Exp}}_{p}\left(\tau v+tv\right)\right)-f\left({\mathrm{Exp}}_{p}\left(tv\right)\right)}{\tau}dt$
	$\displaystyle=\frac{1}{2\mu}\int_{-\mu}^{\mu}\left<\mathrm{grad}f\big{\|}_{{\mathrm{Exp}}_{p}\left(tv\right)},v_{{\mathrm{Exp}}_{p}(tv)}\right>_{{\mathrm{Exp}}_{p}(tv)}dt$	(24)

For simplicity, let $\gamma_{v}(t)={\mathrm{Exp}}_{p}(tv)$ ( $t\in[-\mu,\mu]$ ) for any $v\in T_{p}(\mathcal{M})$ . Since $f$ is geodesically $L_{1}$ -smooth, we have, for any $p$ and $v\in T_{p}\mathcal{M}$ ,

	$\displaystyle\left<\mathrm{grad}{f}\big{\|}_{p}-\mathrm{grad}\widetilde{f}_{v}^{\mu}\big{\|}_{p},v\right>_{p}$
$\displaystyle=$	$\displaystyle\left<\mathrm{grad}{f}\big{\|}_{p},v\right>_{p}-\frac{\int_{-\mu}^{\mu}\left<\mathrm{grad}f\big{\|}_{{\mathrm{Exp}}_{p}\left(tv\right)},v_{{\mathrm{Exp}}_{p}(tv)}\right>_{{\mathrm{Exp}}_{p}(tv)}dt}{2\mu}$	(25)
$\displaystyle=$	$\displaystyle\frac{1}{2\mu}\int_{-\mu}^{\mu}\left(\left<\mathcal{P}_{0,t}^{\gamma_{v}}\left(\mathrm{grad}{f}\big{\|}_{p}\right),v_{{\mathrm{Exp}}_{p}(tv)}\right>_{{\mathrm{Exp}}_{p}(tv)}\right)dt$
	$\displaystyle-\frac{1}{2\mu}\int_{-\mu}^{\mu}\left(\left<\mathrm{grad}f\big{\|}_{{\mathrm{Exp}}_{p}\left(tv\right)},v_{{\mathrm{Exp}}_{p}(tv)}\right>_{{\mathrm{Exp}}_{p}(tv)}\right)dt$	(26)
$\displaystyle=$	$\displaystyle\frac{1}{2\mu}\int_{-\mu}^{\mu}\left(\left<\mathcal{P}_{0,t}^{\gamma_{v}}\left(\mathrm{grad}{f}\big{\|}_{p}\right)-\mathrm{grad}f\big{\|}_{{\mathrm{Exp}}_{p}\left(tv\right)},v_{{\mathrm{Exp}}_{p}(tv)}\right>_{{\mathrm{Exp}}_{p}(tv)}\right)dt$
$\displaystyle\leq$	$\displaystyle\frac{1}{2\mu}\int_{-\mu}^{\mu}L_{1}\|t\|\alpha^{2}dt=\frac{L_{1}\alpha^{2}\mu}{2},$	(27)

where (25) uses (24), (26) uses the fact that the parallel transport preserves the Riemannian metric, and (27) uses Proposition 2 and the Cauchy-Schwarz inequality.

By Proposition 3, we have

\displaystyle\mathbb{E}_{v\sim\mathbb{S}_{p}(\alpha)}\left[\left<\mathrm{grad}f\big{|}_{p},v\right>_{p}v\right]=\frac{\alpha^{2}}{n}\mathrm{grad}f\big{|}_{p},

and thus from (27), we have

	$\displaystyle\left\\|\mathbb{E}_{v\sim\mathbb{S}_{p}(\alpha)}\left[\left<\mathrm{grad}\widetilde{f}_{v}^{\mu}\big{\|}_{p},v\right>v\right]-\frac{\alpha^{2}}{n}\mathrm{grad}f\big{\|}_{p}\right\\|$
$\displaystyle=$	$\displaystyle\left\\|\mathbb{E}_{v\sim\mathbb{S}_{p}(\alpha)}\left[\left<\mathrm{grad}\widetilde{f}_{v}^{\mu}\big{\|}_{p},v\right>v-\left<\mathrm{grad}f\big{\|}_{p},v\right>v\right]\right\\|$
$\displaystyle=$	$\displaystyle\left\\|\mathbb{E}_{v\sim\mathbb{S}_{p}(\alpha)}\left[\left<\mathrm{grad}\widetilde{f}_{v}^{\mu}\big{\|}_{p}-\mathrm{grad}f\big{\|}_{p},v\right>v\right]\right\\|$
$\displaystyle\leq$	$\displaystyle\frac{L_{1}\alpha^{3}\mu}{2}.$	(28)

Applying Lemma 1 and the above results gives

	$\displaystyle\mathbb{E}_{v\sim\mathbb{S}_{p}(\alpha)}\left[\left<\mathrm{grad}\widetilde{f}_{v}^{\mu}\big{\|}_{p},v\right>_{p}v\right]$
$\displaystyle=$	$\displaystyle\mathbb{E}_{v\sim\mathbb{S}_{p}(\alpha)}\left[\frac{1}{2\mu}\int_{-\mu}^{\mu}\left<\mathrm{grad}f\big{\|}_{{\mathrm{Exp}}_{p}\left(tv\right)},v_{{\mathrm{Exp}}_{p}(tv)}\right>_{{\mathrm{Exp}}_{p}(tv)}dt\;v\right]$	$\displaystyle(\text{by Eq. \ref{eq:above}})$
$\displaystyle=$	$\displaystyle\frac{1}{2\mu}\mathbb{E}_{v\sim\mathbb{S}_{p}(\alpha)}\left[\left(f\left({\mathrm{Exp}}_{p}(\mu v)\right)-f\left({\mathrm{Exp}}_{p}(\mu v)\right)\right)v\right]$	$\displaystyle(\text{by Lemma \ref{lem:fundamental-calculus}})$
$\displaystyle=$	$\displaystyle\frac{1}{\mu}\mathbb{E}_{v\sim\mathbb{S}_{p}(\alpha)}\left[f\left({\mathrm{Exp}}_{p}(\mu v)\right)v\right].$

Combining the above result with (28) gives

\displaystyle\left\|\mathrm{grad}f\big{|}_{p}-\frac{n}{\mu\alpha^{2}}\mathbb{E}_{v\sim\mathbb{S}_{p}(\alpha)}\left[f\left({\mathrm{Exp}}_{p}(\mu v)\right)v\right]\right\|\leq\frac{L_{1}n\alpha\mu}{2}.

(29)

∎

The variance of the estimator (17) is shown in Theorem 5.

Theorem 5.

If $f$ is geodesically $L_{1}$ -smooth, then it holds that

		$\displaystyle\mathbb{E}_{v_{1},v_{2},\cdots,v_{m}\overset{i.i.d.}{\sim}\mathbb{S}_{p}(\alpha)}\left[\left\\|\widehat{\mathrm{grad}}f\big{\|}_{p}\left(v_{1},v_{2},\cdots,v_{m}\right)\big{\|}_{p}\right\\|^{2}\right]$
	$\displaystyle\leq$	$\displaystyle\frac{n^{2}}{m}\left(\frac{L_{1}\alpha\mu}{2}+\left\\|\mathrm{grad}f\big{\|}_{p}\right\\|\right)^{2}$
		$\displaystyle+\frac{m-1}{m}\left(\left\\|\mathrm{grad}f\big{\|}_{p}\right\\|^{2}+\left\\|\mathrm{grad}f\big{\|}_{p}\right\\|L_{1}n^{3/2}\alpha\mu+\frac{L_{1}^{2}n^{3}\alpha^{2}\mu^{2}}{4}\right).$

Proof.

By Lemma 1 and that $f$ is geodesically $L_{1}$ -smooth, we have, for any $v\in\mathbb{S}_{p}(\alpha)$ ,

	$\displaystyle\left\|f\left({\mathrm{Exp}}_{p}(\mu v)\right)-f\left({\mathrm{Exp}}_{p}(-\mu v)\right)\right\|$
$\displaystyle=$	$\displaystyle\left\|\int_{-\mu}^{\mu}\left<\mathrm{grad}f\big{\|}_{{\mathrm{Exp}}_{p}(tv)},v_{{\mathrm{Exp}}_{p}(tv)}\right>dt-\int_{-\mu}^{\mu}\left<\mathrm{grad}f\big{\|}_{p},v\right>dt+2\mu\left<\mathrm{grad}f\big{\|}_{p},v\right>\right\|$
$\displaystyle\leq$	$\displaystyle\int_{-\mu}^{\mu}\left\|\left<\mathrm{grad}f\big{\|}_{{\mathrm{Exp}}_{p}(tv)}-\mathcal{P}_{p\rightarrow{\mathrm{Exp}}_{p}(tv)}\left(\mathrm{grad}f\big{\|}_{p}\right),\mathcal{P}_{p\rightarrow{\mathrm{Exp}}_{p}(tv)}(v)\right>\right\|dt$
	$\displaystyle+\left\|2\mu\left<\mathrm{grad}f\big{\|}_{p},v\right>\right\|$
$\displaystyle\leq$	$\displaystyle L_{1}\int_{-\mu}^{\mu}\|t\|\\|v\\|^{2}dt+2\alpha\mu\left\\|\mathrm{grad}f\big{\|}_{p}\right\\|$
$\displaystyle\leq$	$\displaystyle L_{1}\alpha^{2}\mu^{2}+2\alpha\mu\left\\|\mathrm{grad}f\big{\|}_{p}\right\\|.$	(30)

By (30), it holds that,

	$\displaystyle\mathbb{E}\left[\left\\|\widehat{\mathrm{grad}}f\big{\|}_{p}\left(v\right)\right\\|^{2}\right]=$	$\displaystyle\mathbb{E}\left[\frac{n^{2}}{4\mu^{2}\alpha^{4}}\left(f\left({\mathrm{Exp}}_{p}(\mu v)\right)-f\left({\mathrm{Exp}}_{p}(-\mu v)\right)\right)^{2}\\|v\\|^{2}\right]$
	$\displaystyle\leq$	$\displaystyle{n^{2}}\left(\frac{L_{1}\alpha\mu}{2}+\left\\|\mathrm{grad}f\big{\|}_{p}\right\\|\right)^{2}$		(31)

By Theorem 4, we have

\displaystyle\mathbb{E}\left[\widehat{\mathrm{grad}}f\big{|}_{p}(v_{i})\right]=\mathrm{grad}f\big{|}_{p}+O\left(\frac{L_{1}n\alpha\mu}{2}\mathbf{1}\right),

where $\mathbf{1}$ is the all-one vector and the big-O notation here means that, with respect to any coordinate system, each entry in $\mathbb{E}\left[\widehat{\mathrm{grad}}f\big{|}_{p}(v_{i})\right]$ and $\mathrm{grad}f\big{|}_{p}$ differs by at most $\frac{L_{1}n\alpha\mu}{2}$ .

Since $v_{1},v_{2},\cdots,v_{m}$ are mutually independent, we have, for any $i\neq j$ ,

		$\displaystyle\mathbb{E}\left<\widehat{\mathrm{grad}}f\big{\|}_{p}(v_{i}),\widehat{\mathrm{grad}}f\big{\|}_{p}(v_{j})\right>$
	$\displaystyle=$	$\displaystyle\left<\mathrm{grad}f\big{\|}_{p}+O\left(\frac{L_{1}n\alpha\mu}{2}\mathbf{1}\right),\mathrm{grad}f\big{\|}_{p}+O\left(\frac{L_{1}n\alpha\mu}{2}\mathbf{1}\right)\right>$
	$\displaystyle\leq$	$\displaystyle\left\\|\mathrm{grad}f\big{\|}_{p}\right\\|^{2}+\left\\|\mathrm{grad}f\big{\|}_{p}\right\\|L_{1}n^{3/2}\alpha\mu+\frac{L_{1}^{2}n^{3}\alpha^{2}\mu^{2}}{4},$

where the last inequality uses the Cauchy-Schwartz inequality. Thus by expanding out all terms we have

		$\displaystyle\mathbb{E}\left[\left\\|\widehat{\mathrm{grad}}f\big{\|}_{p}\left(v_{1},v_{2},\cdots,v_{m}\right)\right\\|^{2}\right]$
	$\displaystyle=$	$\displaystyle\frac{1}{m^{2}}\mathbb{E}\left[\sum_{i=1}^{m}\left\\|\widehat{\mathrm{grad}}f\big{\|}_{p}\left(v_{i}\right)\right\\|^{2}\right]$
		$\displaystyle+\sum_{1\leq i,j\leq m:i\neq j}\mathbb{E}\left<\mathrm{grad}f\big{\|}_{p}+O\left(\frac{L_{1}n\alpha\mu}{2}\mathbf{1}\right),\mathrm{grad}f\big{\|}_{p}+O\left(\frac{L_{1}n\alpha\mu}{2}\mathbf{1}\right)\right>$
	$\displaystyle\leq$	$\displaystyle\frac{n^{2}}{m}\left(\frac{L_{1}\alpha\mu}{2}+\left\\|\mathrm{grad}f\big{\|}_{p}\right\\|\right)^{2}$
		$\displaystyle+\frac{m-1}{m}\left(\left\\|\mathrm{grad}f\big{\|}_{p}\right\\|^{2}+\left\\|\mathrm{grad}f\big{\|}_{p}\right\\|L_{1}n^{3/2}\alpha\mu+\frac{L_{1}^{2}n^{3}\alpha^{2}\mu^{2}}{4}\right).$

∎

4.1 Previous Methods for Riemannian Gradient Estimation

Previously, Nesterov and Spokoiny, (2017); Li et al., (2020) have introduced gradient estimators using the following sampler. The estimator is

\displaystyle\widehat{\mathrm{grad}}f\big{|}_{p}(v_{1},v_{2},\cdots,v_{m})=\frac{1}{2m\mu}\sum_{i=1}^{m}\left(f\left({\mathrm{Exp}}_{p}(\mu v_{i})\right)-f\left({\mathrm{Exp}}_{p}(-\mu v_{i})\right)\right)v_{i},

(32)

where $v_{i}\overset{i.i.d.}{\sim}\mathcal{N}\left(0,I\right)$ are Gaussian vectors, and $\mu>0$ is a parameter controlling the width of the estimator. We use this Gaussian sampler as a baseline for empirical evaluations. For this estimator, Li et al., (2020) studied its properties over manifolds embedded in Euclidean space. Their result is in Proposition 4.

Proposition 4 (Li et al., (2020)).

Let $\mathcal{M}$ be a manifold embedded in a Euclidean. If $f:\mathcal{M}\rightarrow\mathbb{R}$ is geodesically $L_{1}$ -smooth over $\mathcal{M}$ , then it holds that

\displaystyle\left\|\mathbb{E}_{v_{i}\overset{i.i.d.}{\sim}\mathcal{N}(0,I)}\left[\widehat{\mathrm{grad}}f\big{|}_{p}(v_{1},v_{2},\cdots,v_{m})\right]-\mathrm{grad}f\big{|}_{p}\right\|\leq\frac{L_{1}\left(d+3\right)^{3/2}\mu}{2}

(33)

where $\widehat{\mathrm{grad}}f\big{|}_{p}(v_{1},v_{2},\cdots,v_{m})$ is the estimator in (32).

To fairly compare our method (17) and this previous method (32), one should set $\alpha=\sqrt{n}$ for our method, as pointed out in Remark 3.

Remark 3.

If one picks $\alpha=\sqrt{n}$ in Theorem 4, one can obtain the error bound $\frac{L_{1}n^{3/2}\mu}{2}$ for our method, and the error bound for the previous method is $\frac{L_{1}(n+3)^{3/2}\mu}{2}$ (Proposition 4). Why should we pick $\alpha=\sqrt{n}$ in Theorem 4? This is because when $\alpha=\sqrt{n}$ , we have $\mathbb{E}_{v\sim\mathbb{S}(\alpha)}[\|v\|^{2}]=\mathbb{E}_{v\sim\mathcal{N}(0,I)}[\|v\|^{2}]=n$ . In words, when $\alpha=\sqrt{n}$ , the random vectors used in our method (17) and the random vector used in the previous method (32) have the same squared norm in expectation. This leads to a fair comparison for the bias, while holding the second moment of the random vector exactly the same.

5 Empirical Studies

In this section, we empirically study the spherical estimator in (18), in comparison with the Gaussian estimator in (32). The two methods (18) and (32) are compared over three manifolds: 1. the Euclidean space $\mathbb{R}^{n}$ , 2. the unit sphere $\mathbb{S}^{n}$ , and 3. the surface specified by the equation $h(x)=\frac{1}{2}\sum_{i=1}^{n/2}x_{i}^{2}-\frac{1}{2}\sum_{i=n/2+1}^{n}x_{i}^{2}$ . The evaluation is carried out at three different point, one for each manifold. The three manifolds, and corresponding points for evaluation, are listed in Table 1. Two test functions are used: a. the linear function $f(x)=\sum_{i=1}^{n}x_{i}$ , and b. the function $f(x)=\sum_{i=1}^{n}\sin\left(x_{i}-1\right)$ . The two test functions are listed in Table 2.

Label	Manifold	Point $p$	Exponential map ${\mathrm{Exp}}_{p}(v)$
1	$\mathbb{R}^{n}$	$0$	$v$
2	$\mathbb{S}^{n}$	$(1,0,\cdots,0)$	$p\cos(\\|v\\|)+\frac{v}{\\|v\\|}\sin\left(\\|v\\|\right)$
3	$(x,h(x))$ , $x\in\mathbb{R}^{n}$	$0$	$\left(v,\left(\frac{\sqrt{1+\sum_{i=1}^{n}v_{i}^{4}}}{2}+\frac{\sinh^{-1}\left(\sqrt{\sum_{i=1}^{n}v_{i}^{4}}\right)}{\sqrt{\sum_{i=1}^{n}v_{i}^{4}}}\right)h(v)\right)$

Table 1: Manifolds used for experiments. All manifolds are of dimension

n

. The coordinate system of the ambient space is used in all settings.

Label	Function
a	$f(x)=\sum_{i=1}^{n}x_{i}$
b	$f(x)=\sum_{i=1}^{n}\sin(x_{i}-1)$

Table 2: Functions used for experiments. The coordinate system of the ambient Euclidean space is used for the computations.

The three manifolds in Table 1 and the two functions in Table 2 together generate 6 settings. We use 1a, 1b, 2a, 2b, 3a, 3c to label these 6 setting, where 1a refers to the setting over manifold 1 using function a, and so on. The experimental results are summarized in Figures 1-12. In all figures, “G” on the $x$ -axis stands for the previous method using Gaussian estimator (32), and “S” on the $x$ -axis stands for our method using the spherical estimator (18). To fairly compare (32) and (18), we use $\alpha=\sqrt{n}$ for all spherical estimators so that for both the spherical estimator and the Gaussian estimator, one has $\mathbb{E}\left[\|v\|^{2}\right]=n$ . The figure captions specify the settings used. For example, setting 1a is for function $a$ in Table 2 over manifold 1 in Table 1. Below each subfigure, the values of $n$ and $\mu$ are the dimension of the manifold and the parameter used in the estimators (both G and S).

Since $\mathbb{E}\left[\|v\|^{2}\right]$ are set to the same value for both G and S, we use

\displaystyle\left\|\widehat{\mathrm{grad}}f\big{|}_{p}(v_{1},v_{2},\cdots,v_{m})-\mathrm{grad}f\big{|}_{p}\right\|

(34)

to measure the error (bias) of the estimator, and

\displaystyle\frac{1}{m}\sum_{i=1}^{m}\left\|\widehat{\mathrm{grad}}f\big{|}_{p}(v_{i})\right\|^{4}

(35)

to measure robustness of the estimator. For both the estimation error and the robustness, smaller values mean better performance. For both “G” and “S” in all figures, each violin plot summarizes $100$ values of (34) or (35), where we use $m=100$ for all of them. Take G and S in Figure 1 as an example. We compute (35) for 100 times for G and $100$ estimations for S, all with $m=100$ . Then we use violin plots to summarize these 100 values of (35) for G and the 100 values of (35) for S.

Figures 1-6 plot the robustness of the two estimators $G$ and $S$ , and Figures 7-12 show the corresponding errors. As shown in the figures, in all of the settings, the spherical estimator S is much more robust than the Gaussian estimator G, while achieving the same level of estimation error.

Refer to caption — Figure 1: Plot of $\frac{1}{m}\sum_{i=1}^{m}\left\|\widehat{\mathrm{grad}}f\big{|}_{p}(v_{i})\right\|^{4}$ for Setting 1a.

6 Application to Online Convex Optimization

In online learning with bandit feedback, each time $t$ , the learner chooses points $x_{t}$ and $z_{t}$ , and the environment picks a convex loss function $f_{t}$ . The learner suffers and observes loss $f_{t}(x_{t})$ and observes the function value $f_{t}(z_{t})$ . The learner then picks $x_{t+1}$ based on historical observations and the learning process moves on. Online learning agents, unlike their offline learning (or batch learning) counterpart, do not have knowledge about $f_{t}$ when they choose $x_{t}$ . Such models capture real-world scenarios such as learning with streaming data. In the machine learning community, such problem settings are often referred to as the Online Convex Optimization problem with bandit-type feedback.

6.1 Online Convex Optimization over Hadamard Manifolds: an Algorithm

We consider an online convex optimization problem (with bandit-type feedback) over a negatively curved manifold. Let there be a Hadamard manifold $\mathcal{M}$ and a closed geodesic ball $\mathcal{D}\subseteq\mathcal{M}$ . Each time $t$ , the agent and the environment executes the above learning protocol, with $f_{t}$ defined over $\mathcal{M}$ and $x_{t}\in\mathcal{D}$ . The goal of the agent is to minimize the $T$ -step hindsight regret: $Reg(T):=\sum_{t=1}^{T}f_{t}(x_{t})-f_{t}(x^{*})$ , where $x^{*}\in\arg\min_{x\in\mathcal{D}}\sum_{t=1}^{T}f_{t}(x)$ .

With the approximation guarantee in Theorem 4, one can solve this online learning problem. At each $t$ , we can approximate gradient at $x_{t}$ using the function value $f_{t}\left(z_{t}\right)$ , and perform an online gradient descent step. In particular, we can generalize the online convex optimization algorithm (with bandit-type feedback) (Flaxman et al.,, 2005) to Hadamard manifolds.

By Theorem 4, we have

\displaystyle\left\|\mathrm{grad}f\big{|}_{p}-\mathbb{E}_{v\sim\mathbb{S}_{p}}\left[\frac{n}{\mu}\left(f({\mathrm{Exp}}_{p}(\mu v))-f(p)\right)v\right]\right\|\leq\frac{L_{1}n\mu}{2},\qquad\forall p\in\mathcal{M}

(36)

Thus the gradient of $f_{t}$ at $x_{t}$ can be estimated by

\displaystyle\mathrm{grad}f_{t}\big{|}_{x_{t}}\approx\mathbb{E}_{v_{t}\sim\mathbb{S}_{x_{t}}}\left[\frac{n}{\mu}\left(f_{t}\left({\mathrm{Exp}}_{x_{t}}\left(\mu v_{t}\right)\right)-f_{t}(x_{t})\right)v_{t}\right],

where $v_{t}$ is uniformly sampled from $\mathbb{S}_{x_{t}}$ . This estimator generalizes the two-evaluation estimator in the Euclidean case (Duchi et al.,, 2015). As shown in (36), the bias of this estimator is also bounded by $\frac{L_{1}n\mu}{2}$ .

With this gradient estimator, one can perform estimated gradient descent by

\displaystyle x_{t+\frac{1}{2}}\leftarrow{\mathrm{Exp}}_{x_{t}}\left(-\eta\frac{n}{\mu}\left(f_{t}\left({\mathrm{Exp}}_{x_{t}}\left(\mu v_{t}\right)\right)-f_{t}(x_{t})\right)v_{t}\right),

where $v_{t}$ is uniformly sampled from $\mathbb{S}_{x_{t}}$ , and $\eta$ is the learning rate.

To ensure that the algorithm stays in the geodesic ball $\mathcal{D}$ , we need to project $x_{t+\frac{1}{2}}$ back to a subset of $\mathcal{D}$ , which is

\displaystyle(1-\beta)\mathcal{D}:=B(c(\mathcal{D}),(1-\beta)r(\mathcal{D})),\qquad\text{ for some }\beta\in(0,1),

where $c(\mathcal{D})$ and $r(\mathcal{D})$ are the center and radius of the geodesic ball $\mathcal{D}$ . Since the set $(1-\beta)\mathcal{D}$ is a geodesic ball in a Hadamard manifold, the projection onto it is well-defined. Thus we can project $x_{t+\frac{1}{2}}$ onto $(1-\beta)\mathcal{D}$ to get $x_{t+1}$ , or let $x_{t+1}=\mathrm{Proj}_{(1-\beta)\mathcal{D}}\left(x_{t+\frac{1}{2}}\right)$ . Now one iteration finishes and the learning process moves on to the next round. This learning algorithm is summarized in Algorithm 1.

Algorithm 1

Input: Algorithm parameter:

\mu

\eta

\beta

; time horizon:

T

;

Initialization: Arbitrarily pick

x_{1}\in\mathcal{D}

for

t=1,2,\dots,T

Environment picks loss function

f_{t}

. The agent suffers and observes loss

f_{t}(x_{t})

Sample

v_{t}\sim\mathbb{S}_{x_{t}}

Observe function value

f_{t}(z_{t})

, where

z_{t}={\mathrm{Exp}}_{x_{t}}(\mu v_{t})

, and compute

w_{t}=\frac{n}{\mu}\left(f_{t}(z_{t})-f_{t}(x_{t})\right)v_{t}

. /*

w_{t}

is the gradient estimate. */

Perform gradient update

x_{t+\frac{1}{2}}={\mathrm{Exp}}_{x_{t}}(-\eta w_{t})

Project the point to set

(1-\beta)\mathcal{D}

x_{t+1}=\mathrm{Proj}_{(1-\beta)\mathcal{D}}\left(x_{t+\frac{1}{2}}\right)

. /* In practice, the projection can be approximated by a point in

(1-\beta)\mathcal{D}

that is close to

x_{t+\frac{1}{2}}

. */

end for

6.2 Online Convex Optimization over Hadamard Manifolds: Analysis

Theorem 6.

If, for all $t\in[T]$ , the loss function $f_{t}$ satisfies that is geodesically $\mu$ -smooth, then the regret of Algorithm 1 satisfies:

\displaystyle\mathbb{E}\left[Reg(T)\right]\leq\frac{D^{2}}{2\eta}+\frac{\zeta(\kappa,D)\eta n^{2}G^{2}T}{2\mu^{2}}+\left(\frac{DL_{1}n\mu}{2}+2\beta D\right)T,

(37)

where $D$ is the diameter of $\mathcal{D}$ , and $G:=\max_{p\in\mathcal{D}}f(p)$ .

In particular, when $\eta=\Theta\left(T^{-3/4}\right)$ , $\mu=\Theta\left(T^{-1/4}\right)$ and $\beta=\Theta\left(T^{-1/4}\right)$ , the regret satisfies $\mathbb{E}\left[Reg(T)\right]\leq\mathcal{O}\left(T^{3/4}\right)$ . This sublinear regret rate implies that, when $f_{1},f_{2},\cdots,f_{T}$ are identical, the algorithm converges to the optimal point at a rate of order $\mathcal{O}\left(T^{-1/4}\right)$ . We now start the proof of Theorem 6, by presenting an existing result by Zhang and Sra.

Lemma 2 (Corollary 8, Zhang and Sra, (2016)).

Let $\mathcal{M}$ be a Hadamard manifold, with sectional curvature $\mathcal{K}_{p}(u,v)\geq\kappa>-\infty$ for all $p\in\mathcal{M}$ and $u,v\in T_{p}\mathcal{M}$ . Then for any $x,x_{s}$ , and $x_{s+1}={\mathrm{Exp}}_{x_{s}}(-\eta v)$ for some $v\in T_{x_{s}}\mathcal{M}$ , then it holds that

\displaystyle\left<-v,{\mathrm{Exp}}_{x_{s}}^{-1}(x)\right>_{x_{s}}\leq

\displaystyle\frac{1}{2\eta}\left(d^{2}(x_{s},x)-d^{2}(x_{s+1},x)\right)+\frac{\zeta(\kappa,d(x_{s},x))\eta}{2}\|v\|^{2},

where $\zeta(\kappa,D)=\frac{\sqrt{|\kappa|}D}{\tanh(\sqrt{|\kappa|}D)}$ .

Proof of Theorem 6.

By convexity of $f_{t}$ , we have

f_{t}(x_{t})-f_{t}(x^{*})\leq\left<-\mathrm{grad}f_{t}\big{|}_{x_{t}},{\mathrm{Exp}}_{x_{t}}^{-1}(x^{*})\right>.

Together with the Lemma of Zhang and Sra, we get

		$\displaystyle f_{t}(x_{t})-f_{t}(x^{*})$
	$\displaystyle\leq$	$\displaystyle\left<-\mathrm{grad}f_{t}\big{\|}_{x_{t}},{\mathrm{Exp}}_{x_{t}}^{-1}(x^{*})\right>$
	$\displaystyle\leq$	$\displaystyle\left<-w_{t},{\mathrm{Exp}}_{x_{t}}^{-1}(x^{})\right>+\left<w_{t}-\mathrm{grad}f_{t}\big{\|}_{x_{t}},{\mathrm{Exp}}_{x_{t}}^{-1}(x^{})\right>$
	$\displaystyle\leq$	$\displaystyle\frac{1}{2\eta}\left(d^{2}(x_{t},x^{})-d^{2}(x_{t+\frac{1}{2}},x^{})\right)+\frac{\zeta(\kappa,D)\eta}{2}\\|w_{t}\\|^{2}$
		$\displaystyle+\left<w_{t}-\mathrm{grad}f_{t}\big{\|}_{x_{t}},{\mathrm{Exp}}_{x_{t}}^{-1}(x^{*})\right>,$

where the last line uses Lemma 2. Taking expectation (over $v_{t}\sim\mathbb{S}_{x_{t}}$ ) on both sides gives

	$\displaystyle\mathbb{E}\left[f_{t}(x_{t})\right]-\mathbb{E}\left[f_{t}(x^{*})\right]$
$\displaystyle\leq$	$\displaystyle\frac{1}{2\eta}\mathbb{E}\left[d^{2}(x_{t},x^{})-d^{2}(x_{t+\frac{1}{2}},x^{})\right]+\mathbb{E}\left[\frac{\zeta(\kappa,D)\eta}{2}\\|w_{t}\\|^{2}\right]$
	$\displaystyle+\left<\mathbb{E}\left[w_{t}\right]-\mathrm{grad}f_{t}\big{\|}_{x_{t}},{\mathrm{Exp}}_{x_{t}}^{-1}(x^{*})\right>$
$\displaystyle\leq$	$\displaystyle\frac{\mathbb{E}\left[d^{2}(x_{t},x^{})-d^{2}(x_{t+\frac{1}{2}},x^{})\right]}{2\eta}+\frac{\zeta(\kappa,D)\eta n^{2}G^{2}}{2\mu^{2}}+\frac{DL_{1}n\mu}{2},$	(38)

where the last line uses Theorem 4, and that $\|w_{t}\|^{2}\leq\frac{n^{2}G^{2}}{\mu^{2}}$ ( $G\geq f(p)$ for all $p\in\mathcal{D}$ ). Henceforth, we will omit the expectation operator $\mathbb{E}$ for simplicity.

By definition of $(1-\beta)\mathcal{D}$ , we have, $d\left(x_{t+\frac{1}{2}},x_{t+1}\right)\leq\beta$ . Plugging this back to (38), we get, in expectation,

	$\displaystyle f_{t}(x_{t})-f_{t}(x^{*})\leq$	$\displaystyle\frac{1}{2\eta}\left(d^{2}(x_{t},x^{})-d^{2}(x_{t+1},x^{})\right)$
		$\displaystyle+\frac{\zeta(\kappa,D)\eta n^{2}G^{2}}{2\mu^{2}}+\frac{DL_{1}n\mu}{2}+d^{2}(x_{t+1},x^{})-d^{2}(x_{t+\frac{1}{2}},x^{})$
	$\displaystyle\leq$	$\displaystyle\frac{1}{2\eta}\left(d^{2}(x_{t},x^{})-d^{2}(x_{t+1},x^{})\right)+\frac{\zeta(\kappa,D)\eta n^{2}G^{2}}{2\mu^{2}}+\frac{DL_{1}n\mu}{2}+2\beta D.$

Summing over $t$ gives, in expectation,

		$\displaystyle\sum_{t=1}^{T}f_{t}(x_{t})-f_{t}(x^{*})$
	$\displaystyle\leq$	$\displaystyle\frac{1}{2\eta}\sum_{t=1}^{T}\left(d^{2}(x_{t},x^{})-d^{2}(x_{t+1},x^{})\right)+\frac{\zeta(\kappa,D)\eta n^{2}G^{2}T}{2\mu^{2}}+\left(\frac{DL_{1}n\mu}{2}+2\beta D\right)T$
	$\displaystyle\leq$	$\displaystyle\frac{D^{2}}{2\eta}+\frac{\zeta(\kappa,D)\eta n^{2}G^{2}T}{2\mu^{2}}+\left(\frac{DL_{1}n\mu}{2}+2\beta D\right)T,$

where the last line telescopes out the summation term.

∎

6.3 Online Convex Optimization over Hadamard Manifolds: Experiments

In this section, we study the effectiveness of our algorithm on the benchmark task of finding the Riemannian center of mass of several positive definite matrices (Bini and Iannazzo,, 2013). The Riemannian center of mass of a set of $N$ symmetric positive definite matrices $\{A_{i}\}_{i=1}^{N}$ seeks to minimize

\displaystyle f(X,\{A_{i}\}_{i=1}^{N})=\sum_{i=1}^{N}\|\log(X^{-1/2}A_{i}X^{-1/2})\|_{F}^{2},

(39)

where $X$ is symmetric and positive definite.

This objective is non-convex in the Euclidean sense, but is geodesically convex on the manifold of positive definite matrices. The gradient update step for objective (39) over the manifold of symmetric and positive definite matrices is

\displaystyle X_{t+1}=X_{t}^{1/2}\exp\left(-\eta\;\mathrm{grad}f_{t}\big{|}_{x_{t}}\right)X_{t}^{1/2}.

For our experiments, we can only estimate the gradient $\mathrm{grad}f_{t}\big{|}_{x_{t}}$ by querying function values of $f_{t}$ . We randomly sample several positive definite matrices $A_{i}$ , and apply Algorithm 1 to search for $X$ to minimize (39). The results are summarized in Figure 13.

7 Conclusion

In this paper, we study the GW convolution/approximation and gradient estimation over Riemannian manifolds. Our new decomposition trick of the GW approximation provides a tools for computing properties of the GW convolution. Using this decomposition trick, we derive a formula for how the curvature of the manifold will affect the curvature of the function via the GW convolution. Empowered by our decomposition trick, we introduce a new gradient estimation method over Riemannian manifolds. Over an $n$ -dimensional Riemannian manifold, we improve best previous gradient estimation error (Li et al.,, 2020). We also apply our gradient estimation method to bandit convex optimization problems, and generalize previous results from Euclidean spaces to Hadamard manifolds.

References

Absil et al., (2009) Absil, P.-A., Mahony, R., and Sepulchre, R. (2009). Optimization algorithms on matrix manifolds. Princeton University Press.
Antonakopoulos et al., (2019) Antonakopoulos, K., Belmega, E. V., and Mertikopoulos, P. (2019). Online and stochastic optimization beyond lipschitz continuity: A riemannian approach. In International Conference on Learning Representations.
Azagra, (2013) Azagra, D. (2013). Global and fine approximation of convex functions. Proceedings of the London Mathematical Society, 107(4):799–824.
Azagra and Ferrera, (2006) Azagra, D. and Ferrera, J. (2006). Inf-convolution and regularization of convex functions on riemannian manifolds of nonpositive curvature. Revista Matemática Complutense, 19(2):323–345.
Bini and Iannazzo, (2013) Bini, D. A. and Iannazzo, B. (2013). Computing the karcher mean of symmetric positive definite matrices. Linear Algebra and its Applications, 438(4):1700–1710.
Bubeck et al., (2015) Bubeck, S., Dekel, O., Koren, T., and Peres, Y. (2015). Bandit convex optimization: $\sqrt{T}$ regret in one dimension. In Conference on Learning Theory, pages 266–278. PMLR.
Bubeck and Eldan, (2016) Bubeck, S. and Eldan, R. (2016). Multi-scale exploration of convex functions and bandit convex optimization. In Conference on Learning Theory, pages 583–589. PMLR.
Chen et al., (2019) Chen, L., Zhang, M., and Karbasi, A. (2019). Projection-free bandit convex optimization. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2047–2056. PMLR.
Duchi et al., (2015) Duchi, J. C., Jordan, M. I., Wainwright, M. J., and Wibisono, A. (2015). Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory, 61(5):2788–2806.
Flaxman et al., (2005) Flaxman, A. D., Kalai, A. T., and McMahan, H. B. (2005). Online convex optimization in the bandit setting: gradient descent without a gradient. In Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms, pages 385–394.
Garber and Kretzu, (2020) Garber, D. and Kretzu, B. (2020). Improved regret bounds for projection-free bandit convex optimization. In International Conference on Artificial Intelligence and Statistics, pages 2196–2206. PMLR.
Gavrilov, (2007) Gavrilov, A. V. (2007). The double exponential map and covariant derivation. Siberian Mathematical Journal, 48(1):56–61.
Greene and Wu, (1979) Greene, R. E. and Wu, H. (1979). ${C}^{\infty}$ -approximations of convex, subharmonic, and plurisubharmonic functions. In Annales Scientifiques de l’École Normale Supérieure, volume 12, pages 47–84.
Greene and Wu, (1973) Greene, R. E. and Wu, H.-H. (1973). On the subharmonicity and plurisubharmonicity of geodesically convex functions. Indiana University Mathematics Journal, 22(7):641–653.
Greene and Wu, (1976) Greene, R. E. and Wu, H.-h. (1976). ${C}^{\infty}$ -convex functions and manifolds of positive curvature. Acta Mathematica, 137(1):209–245.
Hazan and Kale, (2012) Hazan, E. and Kale, S. (2012). Projection-free online learning. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 1843–1850.
Hazan and Levy, (2014) Hazan, E. and Levy, K. Y. (2014). Bandit convex optimization: Towards tight bounds. In NIPS, pages 784–792.
Hildebrand, (1987) Hildebrand, F. B. (1987). Introduction to numerical analysis. Courier Corporation.
Jacot et al., (2021) Jacot, A., Gabriel, F., and Hongler, C. (2021). Neural tangent kernel: convergence and generalization in neural networks. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 6–6.
Kleinberg, (2004) Kleinberg, R. (2004). Nearly tight bounds for the continuum-armed bandit problem. Advances in Neural Information Processing Systems, 17:697–704.
Kulis et al., (2012) Kulis, B. et al. (2012). Metric learning: A survey. Foundations and trends in machine learning, 5(4):287–364.
Lee, (2006) Lee, J. M. (2006). Riemannian manifolds: an introduction to curvature, volume 176. Springer Science & Business Media.
Li and Dunson, (2019) Li, D. and Dunson, D. B. (2019). Geodesic distance estimation with spherelets. arXiv preprint arXiv:1907.00296.
Li and Dunson, (2020) Li, D. and Dunson, D. B. (2020). Classification via local manifold approximation. Biometrika, 107(4):1013–1020.
Li et al., (2017) Li, D., Mukhopadhyay, M., and Dunson, D. B. (2017). Efficient manifold and subspace approximations with spherelets. arXiv preprint arXiv:1706.08263.
Li et al., (2020) Li, J., Balasubramanian, K., and Ma, S. (2020). Stochastic zeroth-order riemannian derivative estimation and optimization. arXiv preprint arXiv:2003.11238.
Lin and Zha, (2008) Lin, T. and Zha, H. (2008). Riemannian manifold learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(5):796–809.
Lyness and Moler, (1967) Lyness, J. N. and Moler, C. B. (1967). Numerical differentiation of analytic functions. SIAM Journal on Numerical Analysis, 4(2):202–210.
Muandet et al., (2017) Muandet, K., Fukumizu, K., Sriperumbudur, B., and Schölkopf, B. (2017). Kernel mean embedding of distributions: A review and beyond. Foundations and Trends in Machine Learning, 10(1-2):1–144.
Nesterov and Spokoiny, (2017) Nesterov, Y. and Spokoiny, V. (2017). Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17(2):527–566.
Parkkonen and Paulin, (2012) Parkkonen, J. and Paulin, F. (2012). On strictly convex subsets in negatively curved manifolds. Journal of Geometric Analysis, 22(3):621–632.
Petersen, (2006) Petersen, P. (2006). Riemannian geometry, volume 171. Springer.
Schoen and Yau, (1994) Schoen, R. M. and Yau, S.-T. (1994). Lectures on differential geometry, volume 2. International press Cambridge, MA.
Schölkopf et al., (2002) Schölkopf, B., Smola, A. J., Bach, F., et al. (2002). Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press.
Shalev-Shwartz, (2011) Shalev-Shwartz, S. (2011). Online learning and online convex optimization. Foundations and trends in Machine Learning, 4(2):107–194.
Shamir, (2013) Shamir, O. (2013). On the complexity of bandit and derivative-free stochastic convex optimization. In Conference on Learning Theory, pages 3–24. PMLR.
Zhang et al., (2016) Zhang, H., Reddi, S. J., and Sra, S. (2016). Riemannian SVRG: fast stochastic optimization on Riemannian manifolds. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pages 4599–4607.
Zhang and Sra, (2016) Zhang, H. and Sra, S. (2016). First-order methods for geodesically convex optimization. In Conference on Learning Theory, pages 1617–1638. PMLR.
Zhou et al., (2020) Zhou, Y., Sanches Portella, V., Schmidt, M., and Harvey, N. (2020). Regret bounds without lipschitz continuity: Online learning with relative-lipschitz losses. Advances in Neural Information Processing Systems, 33.
Zhu et al., (2018) Zhu, B., Liu, J. Z., Cauley, S. F., Rosen, B. R., and Rosen, M. S. (2018). Image reconstruction by domain-transform manifold learning. Nature, 555(7697):487–492.

		$\displaystyle\mathbb{E}_{v_{1},v_{2},\cdots,v_{m}\overset{i.i.d.}{\sim}\mathbb{S}_{p}(\alpha)}\left[\left\\|\widehat{\mathrm{grad}}f\big{\|}_{p}\left(v_{1},v_{2},\cdots,v_{m}\right)\big{\|}_{p}\right\\|^{2}\right]$
	$\displaystyle\leq$	$\displaystyle\frac{n^{2}}{m}\left(\frac{L_{1}\alpha\mu}{2}+\left\\|\mathrm{grad}f\big{\|}_{p}\right\\|\right)^{2}$
		$\displaystyle+\frac{m-1}{m}\left(\left\\|\mathrm{grad}f\big{\|}_{p}\right\\|^{2}+\left\\|\mathrm{grad}f\big{\|}_{p}\right\\|L_{1}n^{3/2}\alpha\mu+\frac{L_{1}^{2}n^{3}\alpha^{2}\mu^{2}}{4}\right).$

	$\displaystyle\left\|f\left({\mathrm{Exp}}_{p}(\mu v)\right)-f\left({\mathrm{Exp}}_{p}(-\mu v)\right)\right\|$
$\displaystyle=$	$\displaystyle\left\|\int_{-\mu}^{\mu}\left<\mathrm{grad}f\big{\|}_{{\mathrm{Exp}}_{p}(tv)},v_{{\mathrm{Exp}}_{p}(tv)}\right>dt-\int_{-\mu}^{\mu}\left<\mathrm{grad}f\big{\|}_{p},v\right>dt+2\mu\left<\mathrm{grad}f\big{\|}_{p},v\right>\right\|$
$\displaystyle\leq$	$\displaystyle\int_{-\mu}^{\mu}\left\|\left<\mathrm{grad}f\big{\|}_{{\mathrm{Exp}}_{p}(tv)}-\mathcal{P}_{p\rightarrow{\mathrm{Exp}}_{p}(tv)}\left(\mathrm{grad}f\big{\|}_{p}\right),\mathcal{P}_{p\rightarrow{\mathrm{Exp}}_{p}(tv)}(v)\right>\right\|dt$
	$\displaystyle+\left\|2\mu\left<\mathrm{grad}f\big{\|}_{p},v\right>\right\|$
$\displaystyle\leq$	$\displaystyle L_{1}\int_{-\mu}^{\mu}\|t\|\\|v\\|^{2}dt+2\alpha\mu\left\\|\mathrm{grad}f\big{\|}_{p}\right\\|$
$\displaystyle\leq$	$\displaystyle L_{1}\alpha^{2}\mu^{2}+2\alpha\mu\left\\|\mathrm{grad}f\big{\|}_{p}\right\\|.$	(30)

	$\displaystyle\mathbb{E}\left[\left\\|\widehat{\mathrm{grad}}f\big{\|}_{p}\left(v\right)\right\\|^{2}\right]=$	$\displaystyle\mathbb{E}\left[\frac{n^{2}}{4\mu^{2}\alpha^{4}}\left(f\left({\mathrm{Exp}}_{p}(\mu v)\right)-f\left({\mathrm{Exp}}_{p}(-\mu v)\right)\right)^{2}\\|v\\|^{2}\right]$
	$\displaystyle\leq$	$\displaystyle{n^{2}}\left(\frac{L_{1}\alpha\mu}{2}+\left\\|\mathrm{grad}f\big{\|}_{p}\right\\|\right)^{2}$		(31)

		$\displaystyle\mathbb{E}\left<\widehat{\mathrm{grad}}f\big{\|}_{p}(v_{i}),\widehat{\mathrm{grad}}f\big{\|}_{p}(v_{j})\right>$
	$\displaystyle=$	$\displaystyle\left<\mathrm{grad}f\big{\|}_{p}+O\left(\frac{L_{1}n\alpha\mu}{2}\mathbf{1}\right),\mathrm{grad}f\big{\|}_{p}+O\left(\frac{L_{1}n\alpha\mu}{2}\mathbf{1}\right)\right>$
	$\displaystyle\leq$	$\displaystyle\left\\|\mathrm{grad}f\big{\|}_{p}\right\\|^{2}+\left\\|\mathrm{grad}f\big{\|}_{p}\right\\|L_{1}n^{3/2}\alpha\mu+\frac{L_{1}^{2}n^{3}\alpha^{2}\mu^{2}}{4},$

		$\displaystyle\mathbb{E}\left[\left\\|\widehat{\mathrm{grad}}f\big{\|}_{p}\left(v_{1},v_{2},\cdots,v_{m}\right)\right\\|^{2}\right]$
	$\displaystyle=$	$\displaystyle\frac{1}{m^{2}}\mathbb{E}\left[\sum_{i=1}^{m}\left\\|\widehat{\mathrm{grad}}f\big{\|}_{p}\left(v_{i}\right)\right\\|^{2}\right]$
		$\displaystyle+\sum_{1\leq i,j\leq m:i\neq j}\mathbb{E}\left<\mathrm{grad}f\big{\|}_{p}+O\left(\frac{L_{1}n\alpha\mu}{2}\mathbf{1}\right),\mathrm{grad}f\big{\|}_{p}+O\left(\frac{L_{1}n\alpha\mu}{2}\mathbf{1}\right)\right>$
	$\displaystyle\leq$	$\displaystyle\frac{n^{2}}{m}\left(\frac{L_{1}\alpha\mu}{2}+\left\\|\mathrm{grad}f\big{\|}_{p}\right\\|\right)^{2}$
		$\displaystyle+\frac{m-1}{m}\left(\left\\|\mathrm{grad}f\big{\|}_{p}\right\\|^{2}+\left\\|\mathrm{grad}f\big{\|}_{p}\right\\|L_{1}n^{3/2}\alpha\mu+\frac{L_{1}^{2}n^{3}\alpha^{2}\mu^{2}}{4}\right).$