This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

From the Greene–Wu Convolution to Gradient Estimation over Riemannian Manifolds

Tianyu Wang***[email protected]  Yifeng Huang[email protected]  Didong Li[email protected]
Abstract

Over a complete Riemannian manifold of finite dimension, Greene and Wu studied a convolution of ff defined as

f^μ(x):=vTpf(Expp(v))κμ(v)𝑑Π,\displaystyle\widehat{f}^{\mu}(x):=\int_{v\in T_{p}\mathcal{M}}f\left({\mathrm{Exp}}_{p}(v)\right)\kappa_{\mu}\left(v\right)d\Pi,

where κμ\kappa_{\mu} is a kernel that integrates to 1, and dΠd\Pi is the base measure on TpT_{p}\mathcal{M}. In this paper, we study properties of the Greene–Wu (GW) convolution and apply it to non-Euclidean machine learning problems. In particular, we derive a new formula for how the curvature of the space would affect the curvature of the function through the GW convolution. Also, following the study of the GW convolution, a new method for gradient estimation over Riemannian manifolds is introduced.

1 Introduction

Recently, as data structures are becoming increasingly non-Euclidean, many non-Euclidean operations are studied and applied to machine learning problems (e.g., Absil et al.,, 2009). Among these operations, the Greene-Wu (GW) convolution (Greene and Wu,, 1973, 1976, 1979) is important but relatively less understood.

Over a complete Riemannian manifold, the seminal Greene-Wu convolution (or approximation) of function ff at pp is defined as

f^μ(p):=vTpf(Expp(v))κμ(v)𝑑Π,\displaystyle\widehat{f}^{\mu}(p):=\int_{v\in T_{p}\mathcal{M}}f\left({\mathrm{Exp}}_{p}(v)\right)\kappa_{\mu}\left(v\right)d\Pi, (GW)

where κμ\kappa_{\mu} is a convolution kernel that integrates to 1, and dΠd\Pi is the base measure on TpT_{p}\mathcal{M}. The GW convolution generalizes standard convolutions in Euclidean spaces and has been subsequently studied in mathematics (e.g., Parkkonen and Paulin,, 2012; Azagra and Ferrera,, 2006; Azagra,, 2013). This convolution operation also naturally arises in machine learning scenarios. For example, if one is interested in studying a function over a matrix manifold, the output of such a function ff is GW convoluted if the input is noisy (i.e., we can only obtain an average function value near pp whenever we try to obtain the function value at pp). A real-life scenario where the GW convolution happens is photo taking with shaky camera. Any function defined over the manifolds of images (e.g., likelihood of containing a cat) is GW convoluted if the camera is perturbed.

Despite the importance of the GW convolution, many of its elementary properties are only inadequately understood. In particular, how would the curvature of the manifold affect the curvature of the function through the GW convolution? (Q1) The answer to this question is important to machine learning problems, since the curvature of a function depicts fundamental properties of a function, including how convex the function is. In the above-mentioned example scenario, a better understanding of the GW convolution would lead to a better understanding of how the landscape of the function ff would be affected by the curvature of the space when the input is noisy.

In this paper, we give a quantitative answer to question (Q1). As an example, consider f^μ\widehat{f}^{\mu} the GW convolution defined with kernel κμ\kappa_{\mu} whose radial density (see Definition 1) is uniform over [μ,μ][-\mu,\mu]. For this GW convolution, we show that

λmin(Hp,f^μ)\displaystyle\lambda_{\min}\left(H_{p,\widehat{f}^{\mu}}\right)\geq 12μ𝔼v𝕊p[μμλmin(HExpp(tv),f)𝑑t]\displaystyle\;\frac{1}{2\mu}\mathbb{E}_{v\sim\mathbb{S}_{p}}\left[\int_{-\mu}^{\mu}\lambda_{\min}\left(H_{{\mathrm{Exp}}_{p}(tv),f}\right)dt\right]
+minu𝕊p12μ𝔼v𝕊p[μμj=1t2j(2j)!u2v2jf(p)dtμμj=1t2j(2j)!v2ju2f(p)dt],\displaystyle+\;\min_{u\in\mathbb{S}_{p}}\frac{1}{2\mu}\mathbb{E}_{v\sim\mathbb{S}_{p}}\left[\int_{-\mu}^{\mu}\sum_{j=1}^{\infty}\frac{t^{2j}}{(2j)!}\nabla_{u}^{2}\nabla_{v}^{2j}f(p)\,dt-\int_{-\mu}^{\mu}\sum_{j=1}^{\infty}\frac{t^{2j}}{(2j)!}\nabla_{v}^{2j}\nabla_{u}^{2}f(p)\,dt\right],

where

  1. 1.

    for any twice continuously differentiable ff and any pp\in\mathcal{M}, Hp,fH_{p,{f}} denotes the Hessian matrix of ff at pp;

  2. 2.

    λmin\lambda_{\min} exacts the minimal eigenvalue of a matrix;

  3. 3.

    𝕊p\mathbb{S}_{p} denotes the unit sphere in TpT_{p}\mathcal{M} and 𝔼v𝕊p\mathbb{E}_{v\sim\mathbb{S}_{p}} denotes the expectation taken with respect to vv uniformly sampled from 𝕊p\mathbb{S}_{p};

  4. 4.

    \nabla denotes the covariant derivative associated with the Levi-Civita connection.

This result implies that the GW convolution can sometimes make the function more convex, and thus often more friendly to optimize. Specifically, if ff and \nabla satisfy that

minu𝕊p12μ𝔼v𝕊p[μμj=1t2j(2j)!u2v2jf(p)dtμμj=1t2j(2j)!v2ju2f(p)dt]>0,\min_{u\in\mathbb{S}_{p}}\frac{1}{2\mu}\mathbb{E}_{v\sim\mathbb{S}_{p}}\left[\int_{-\mu}^{\mu}\sum_{j=1}^{\infty}\frac{t^{2j}}{(2j)!}\nabla_{u}^{2}\nabla_{v}^{2j}f(p)\,dt-\int_{-\mu}^{\mu}\sum_{j=1}^{\infty}\frac{t^{2j}}{(2j)!}\nabla_{v}^{2j}\nabla_{u}^{2}f(p)\,dt\right]>0,

then the minimal eigenvalue of f~μ\widetilde{f}^{\mu} at pp is larger than the minimal eigenvalue of ff near pp. This means sometimes noisy input can make the function more convex and thus more friendly to optimize, thanks to the curvature of the space.

Besides understanding of how the curvature of the space would affect curvature of the function via GW convolution, another important question is how can we directly apply the GW convolution operation to machine learning problems? (Q2) Naturally, the GW convolution is related to anti-derivatives and thus gradient estimation, as suggested by the fundamental theorem of calculus or the Stokes’ theorem. Along this line, we introduce a new gradient estimation method over Riemannian manifolds. Our gradient estimation method improves the state-of-the-art scaling with dimension nn from (n+3)3/2\left(n+3\right)^{3/2} to n3/2n^{3/2}, while holding all other quantities the same§§§More details for the comparison are explained in Remark 3.. Apart from curvature, Riemannian gradient estimation differs from its Euclidean counterpart in that an open set is diffeomorphic to an Euclidean set only within the injectivity radius. This difference implies that one needs to be extra careful when taking finite difference steps in non-Euclidean spaces. To this end, we show that as long as the function is geodesically L1L_{1}-smooth, the finite difference method always works, no matter how small the injectivity radius is. Empirically, our method outperforms the best existing method for gradient estimation over Riemannian manifolds, as evidenced by thorough experimental evaluations. Based on our answer to (Q2), we study the online convex learning problems over Hadamard manifolds under stochastic bandit feedback. In particular, single point gradient estimator is designed. Thus online convex learning problem under bandit-type feedback can be solved, and a curvature-dependent regret bound is derived.

Related Works

The study of geometric methods in machine learning has never stalled. Researchers have investigated this topic from different angles, include kernel methods (e.g., Schölkopf et al.,, 2002; Muandet et al.,, 2017; Jacot et al.,, 2021), manifold learning (e.g., Lin and Zha,, 2008; Li et al.,, 2017; Zhu et al.,, 2018; Li and Dunson,, 2020), metric learning (e.g., Kulis et al.,, 2012; Li and Dunson,, 2019). Recently, attentions are concentrated on problems of learning with Riemannian geometry. Examples include convex optimization method over negatively curved spaces (Zhang and Sra,, 2016; Zhang et al.,, 2016), online learning algorithms with Riemannian metric (Antonakopoulos et al.,, 2019), mirror map based methods for relative Lipschitz objectives (Zhou et al.,, 2020).

Despite the rich literature on geometric methods for machine learning (e.g., Absil et al.,, 2009), some important non-Euclidean operations are relatively poorly understood, including the Greene–Wu (GW) convolution (Greene and Wu,, 1973, 1976, 1979; Parkkonen and Paulin,, 2012; Azagra and Ferrera,, 2006; Azagra,, 2013). To this end, we derive new properties of the GW convolution and apply it to machine learning problems.

Enabled by our study of the GW convolution, we design a new gradient estimation method over Riemannian manifolds. Perhaps the history of gradient estimation can date back to Sir Isaac Newton and the fundamental theory of calculus. Since then, many gradient estimation methods have been invented (Lyness and Moler,, 1967; Hildebrand,, 1987). In its more recent form, gradient estimation methods are often combined with learning and optimization. Flaxman et al., (2005) used the fundamental theorem of geometric calculus and solved online convex optimization problems with bandit feedback. Nesterov and Spokoiny, (2017); Li et al., (2020) introduced a stochastic gradient estimation method using Gaussian sampling, in Euclidean space or manifolds embedded in Euclidean spaces. We improve the previous results in two ways. 1. Our gradient estimation method generalizes this previous work by removing the ambient Euclidean space. 2 We tighten the previous error bound. Over an nn-dimensional Riemannian manifold, we show that the error of gradient estimation is bounded by L1(n+3)3/2μ2\frac{L_{1}\left(n+3\right)^{3/2}\mu}{2} in contrast to L1n3/2μ2\frac{L_{1}n^{3/2}\mu}{2}(Li et al.,, 2020), where nn is the dimension of the manifold, L1L_{1} is the smoothness parameter of the objective function and μ\mu is an algorithm parameter controlling finite-difference step size.

The application of our method to convex bandit optimization is related to online convex learning (e.g., Shalev-Shwartz,, 2011). The problem of online convex optimization with bandit feedback was introduced and studied by Flaxman et al., (2005). Since then, a sequence of works has focused on bandit convex learning in Euclidean spaces. For example, Hazan and Kale, (2012); Chen et al., (2019); Garber and Kretzu, (2020) have studied the gradient-free counterpart for the Frank-Wolfe algorithm, and Hazan and Levy, (2014) studied this problem with self-concordant barrier. An extension to this scheme is combining the methodology with a partitioning of the space Kleinberg, (2004). By doing so, regret rate of 𝒪(T)\mathcal{O}\left(\sqrt{T}\right) can be achieved in Euclidean spaces (Bubeck et al.,, 2015; Bubeck and Eldan,, 2016). This rate is optimal in terms of TT for stochastic environement, as shown by Shamir, (2013). While there has been extensive study of online convex optimization problems, they are all restricted to Euclidean spaces. In this paper, we generalize previous results from Euclidean space to Hadamard manifolds, using our new gradient estimation method.

2 Preliminaries and Conventions

Preliminaries for Riemannian Geometry

A Riemannian manifold (,g)(\mathcal{M},g) is a smooth manifold equipped with a metric tensor gg such that at any point pp\in\mathcal{M}, gpg_{p} defines a positive definite inner product in the tangent space TpT_{p}\mathcal{M}. For a point pp\in\mathcal{M}, we use ,p\left<,\right>_{p} to denote the inner product in TpT_{p}\mathcal{M} induced by gg. Also, we use p\|\cdot\|_{p} to denote the norm associated with ,p\left<,\right>_{p}. Intuitively, a Riemannian manifold is a space that can be locally identified as an Euclidean space. With such structures, one can locally carry out calculus and linear algebra operations.

A vector field XX assigns each point pp\in\mathcal{M} a unique vector in TpT_{p}\mathcal{M}. Let 𝔛()\mathfrak{X}(\mathcal{M}) be the space of all smooth vector fields. An affine connection is a mapping :𝔛()×𝔛()𝔛()\nabla:\mathfrak{X}(\mathcal{M})\times\mathfrak{X}(\mathcal{M})\rightarrow\mathfrak{X}(\mathcal{M}) so that (i) f1X1+f2X2Y=fX1Y+f2X2Y\nabla_{f_{1}X_{1}+f_{2}X_{2}}Y=f\nabla_{X_{1}}Y+f_{2}\nabla_{X_{2}}Y for f1,f2C()f_{1},f_{2}\in C^{\infty}(\mathcal{M}), (ii) X(a1Y1+a2Y2)=a1XY1+a2XY2\nabla_{X}(a_{1}Y_{1}+a_{2}Y_{2})=a_{1}\nabla_{X}Y_{1}+a_{2}\nabla_{X}Y_{2} for a1,a2a_{1},a_{2}\in\mathbb{R}, and (iii) X(fY)=fXY+X(f)Y\nabla_{X}(fY)=f\nabla_{X}Y+X(f)Y for all fC()f\in C^{\infty}(\mathcal{M}). The Levi-Civita connection is the unique affine connection that is torsion free and compatible with the Riemannian metric gg. A geodesic curve γ:[0,1]\gamma:[0,1]\rightarrow\mathcal{M} is a curve so that γ˙(t)γ˙(t)=0\nabla_{\dot{\gamma}(t)}\dot{\gamma}(t)=0 for all t[0,1]t\in[0,1], where \nabla here denotes the Levi-Civita connection. One can think of geodesic curves as generalizations of straight line segments in Euclidean space. At any point pp\in\mathcal{M}, the exponential map Expp{\mathrm{Exp}}_{p} is a local diffeomorphism that maps vTpv\in T_{p}\mathcal{M} to Expp(v){\mathrm{Exp}}_{p}(v)\in\mathcal{M} such that there exists a geodesic curve γ:[0,1]\gamma:[0,1]\rightarrow\mathcal{M} with γ(0)=p\gamma(0)=p, γ˙|0=v\dot{\gamma}\big{|}_{0}=v and γ(1)=Expp(v)\gamma(1)={\mathrm{Exp}}_{p}(v). Since the exponential map is (locally) diffeomorphic, the inverse exponential map is also properly defined (locally). We use Expp1{\mathrm{Exp}}_{p}^{-1} to denote the inverse exponential map of Expp{\mathrm{Exp}}_{p}.

We say a vector field XX is parallel along a geodesic γ:[0,1]\gamma:[0,1]\rightarrow\mathcal{M} if γ˙(t)X=0\nabla_{\dot{\gamma}(t)}X=0 for all t[0,1]t\in[0,1]. For t0,t1[0,1]t_{0},t_{1}\in[0,1], we call the map 𝒫t0,t1γ:Tγ(t0)Tγ(t1)\mathcal{P}_{t_{0},t_{1}}^{\gamma}:T_{\gamma(t_{0})}\mathcal{M}\rightarrow T_{\gamma(t_{1})}\mathcal{M} a parallel transport, if it holds that for any vector field XX such that XX is parallel along γ\gamma, X(γ(t0))X(\gamma(t_{0})) is mapped to X(γ(t1))X(\gamma(t_{1})). Parallel transport induced by the Levi-Civita connection preserves the Riemannian metric, that is, for pp\in\mathcal{M}, u,vTpu,v\in T_{p}\mathcal{M}, and geodesic γ:[0,1]\gamma:[0,1]\rightarrow\mathcal{M}, it holds that u,vγ(0)=𝒫0,tγ(u),𝒫0,tγ(v)γ(t)\left<u,v\right>_{\gamma(0)}=\left<\mathcal{P}_{0,t}^{\gamma}(u),\mathcal{P}_{0,t}^{\gamma}(v)\right>_{\gamma(t)} for any t[0,1]t\in[0,1].

The Riemann curvature tensor :𝔛()×𝔛()×𝔛()𝔛()\mathcal{R}:\mathfrak{X}(\mathcal{M})\times\mathfrak{X}(\mathcal{M})\times\mathfrak{X}(\mathcal{M})\to\mathfrak{X}(\mathcal{M}) is defined as (X,Y)Z=XYZYXZ[X,Y]Z\mathcal{R}(X,Y)Z=\nabla_{X}\nabla_{Y}Z-\nabla_{Y}\nabla_{X}Z-\nabla_{[X,Y]}Z, where [,][,] is the Lie bracket for vector fields. The sectional curvature 𝒦:𝔛()×𝔛()\mathcal{K}:\mathfrak{X}(\mathcal{M})\times\mathfrak{X}(\mathcal{M})\to\mathbb{R} is defined as 𝒦(X,Y)=g((X,Y)X,Y)g(X,X)g(Y,Y)g(X,Y)2\mathcal{K}(X,Y)=\frac{g(\mathcal{R}(X,Y)X,Y)}{g(X,X)g(Y,Y)-g(X,Y)^{2}}, which is the high dimensional generalization of Gaussian curvature. Intuitively, for a point pp, 𝒦(X(p),Y(p))\mathcal{K}\left(X(p),Y(p)\right) is the the Gaussian curvature of the sub-manifold spanned by X(p),Y(p)X(p),Y(p) at pp.

For a continuously connected Riemannian manifold (,g)(\mathcal{M},g), a distance metric can be defined. The distance between p,qp,q\in\mathcal{M} is the infimum over the lengths of piecewise smooth curves in \mathcal{M} connecting pp and qq, where the length of a curve is defined with respect to the metric tensor gg. A Riemannian manifold is complete if it is complete as a metric space. By the celebrated Hopf–Rinow theorem, if a Riemannian manifold is complete, the exponential map Expp{\mathrm{Exp}}_{p} is defined over the entire tangent space TpT_{p}\mathcal{M} for all pp\in\mathcal{M}. The cut locus of a point pp\in\mathcal{M} is the set of points qq\in\mathcal{M} such that there exists more than two distinct shortest path from pp to qq.

We refer the readers to books (e.g., Petersen,, 2006; Lee,, 2006) for more background on Riemannian geometry.

Notations and Conventions

For better readability, we list here some notations and conventions that will be used throughout the rest of this paper.

  • For any xx\in\mathcal{M}, let UpU_{p} denote the open set near pp that is a diffeomorphic to a subset of n\mathbb{R}^{n} via the local normal coordinate chart ϕ\phi. Define the distance dp(q1,q2)d_{p}(q_{1},q_{2}) (q1,q2Upq_{1},q_{2}\in U_{p}) such that

    dp(q1,q2)=dEuc(ϕ(q1),ϕ(q2)).\displaystyle d_{p}(q_{1},q_{2})=d_{\text{Euc}}(\phi(q_{1}),\phi(q_{2})).

    where dEucd_{\text{Euc}} is the Euclidean distance in n\mathbb{R}^{n}.

  • As mentioned in the preliminaries, for any pp\in\mathcal{M}, we use ,p\left<\cdot,\cdot\right>_{p} and p\|\cdot\|_{p} to denote the inner produce and norm induced by the Riemannian metric gg in the tangent space TpT_{p}\mathcal{M}. We omit the subscript when it is clear from context.

  • For any pp\in\mathcal{M} and differentiable function ff over \mathcal{M}, We use gradf|p\mathrm{grad}f\big{|}_{p} to denote the gradient of ff at pp.

  • For any pp\in\mathcal{M} and α>0\alpha>0, we use 𝕊p(α)\mathbb{S}_{p}(\alpha) to denote the origin-centered sphere in TpT_{p}\mathcal{M} with radius α\alpha. For simplicity, we write 𝕊p=𝕊p(1)\mathbb{S}_{p}=\mathbb{S}_{p}(1).

  • We will use another set of notations for parallel transport. For two points p,qp,q\in\mathcal{M} and a smooth curve γ\gamma connecting pp and qq, we use 𝒫pqγ\mathcal{P}_{p\rightarrow q}^{\gamma} to denote the parallel transport from TpT_{p}\mathcal{M} to TqT_{q}\mathcal{M} along the curve γ\gamma. Let Ωmin(p,q)\Omega_{\min}(p,q) denote the set of minimizing geodesics connecting pp and qq. Define 𝒫pq:TpTq\mathcal{P}_{p\rightarrow q}:T_{p}\mathcal{M}\rightarrow T_{q}\mathcal{M} be defined such that 𝒫pq(v)=𝔼γΩmin(p,q)[𝒫pqγ(v)]\mathcal{P}_{p\rightarrow q}(v)=\mathbb{E}_{\gamma\sim\Omega_{\min}(p,q)}\left[\mathcal{P}_{p\rightarrow q}^{\gamma}(v)\right] for any vTpv\in T_{p}\mathcal{M}, where the expectation is taken with respect to the uniform probability measure over Ωmin(p,q)\Omega_{\min}(p,q). Since for any point pp, its cut locus is of measure zero, the operation 𝒫pq\mathcal{P}_{p\rightarrow q} is deterministically defined for almost every qq.

  • For any pp\in\mathcal{M}, vTpv\in T_{p}\mathcal{M} and qq\in\mathcal{M}, define vq=𝒫pq(v)v_{q}=\mathcal{P}_{p\rightarrow q}(v).

  • As a convention, if UU\subseteq\mathcal{M} is diffeomorphic to VnV\subseteq\mathbb{R}^{n}, we simply say UU is diffeomorphic. Similarly, a map is said to be diffeomorphic when the domain of the map is diffeomorphic to an open subset of the Euclidean space. For simplicity, we omit all musical isomorphisms when there is no confusion.

3 The Greene–Wu Convolution

The Greene–Wu convolution can be decomposed into two steps. In the first step, a direction vv in the tangent space is picked, a convolution along the direction is performed, and the extension of this convolution to other points is defined. In the second step, the direction vv is randomized over a sphere. To properly describe these two steps, we need to formally define kernels.

Definition 1.

A function κμ:n0\kappa_{\mu}:\mathbb{R}^{n}\rightarrow\mathbb{R}_{\geq 0} is called a kernel if

  1. 1.

    For any α>0\alpha>0 and μ>0\mu>0, there exists a density Θμ,α\Theta_{\mu,\alpha} on \mathbb{R} such that κμ(v)=2να(αvv)Θμ,α(vα)\kappa_{\mu}(v)={2}\nu_{\alpha}\left(\alpha\frac{v}{\|v\|}\right)\Theta_{\mu,\alpha}\left(\frac{\|v\|}{\alpha}\right), and Θμ,α(x)=Θμ,α(x)\Theta_{\mu,\alpha}(-x)=\Theta_{\mu,\alpha}(x) for all xx\in\mathbb{R}, where να\nu_{\alpha} is the uniform probability density over the Euclidean sphere of radius α\alphaHere κμ(0)=2Θμ,α(0)\kappa_{\mu}(0)=2\Theta_{\mu,\alpha}(0) as a convention.;

  2. 2.

    limμ0+κμ=δ0\lim_{\mu\rightarrow 0^{+}}\kappa_{\mu}=\delta_{0}, where δ0\delta_{0} is the Dirac point mass at 0.

We call Θμ,α\Theta_{\mu,\alpha} the radial density for kμk_{\mu}.

By the Radon-Nikodym theorem, item 1 in the definition of a kernel is satisfied when the measure on TpT_{p}\mathcal{M} induced by κμ\kappa_{\mu} is absolutely continuous with respect to the base measure on TpT_{p}\mathcal{M}. Examples of kernels include uniform mass over a ball of radius μ\mu.

Remark 1.

For simplicity, we restrict our attention to symmetric kernels κμ\kappa_{\mu}, where the value of κμ(v)\kappa_{\mu}(v) only depends on v\|v\|. Yet many of the results generalize to non-symmetric kernels. To see this, we decompose κμ\kappa_{\mu} by κμ(v)=2ν(αvv)Θμ,α;v(vα)\kappa_{\mu}(v)=2\nu\left(\alpha\frac{v}{\|v\|}\right)\Theta_{\mu,\alpha;v}\left(\frac{\|v\|}{\alpha}\right), where να\nu_{\alpha} is the uniform density over the sphere of radius α\alpha, and Θμ,α;v\Theta_{\mu,\alpha;v} is the density induced by κμ\kappa_{\mu} along the direction of vv (or a parallel transport of vv) up to a scaling by α\alpha. We can replace item 1 in Definition 1 and many of the follow-up results can be extended to these cases.

With the kernel defined in Definition 1, the GW approximation/convolution can be defined as follows. For pp\in\mathcal{M} and μ>0\mu>0, we define

f~μ(p):=𝔼v𝕊p(α)[f~vμ(p)],\displaystyle\widetilde{f}^{\mu}(p):=\mathbb{E}_{v\sim\mathbb{S}_{p}(\alpha)}\left[\widetilde{f}_{v}^{\mu}(p)\right], (1)

where

f~vμ(q)=2f(Expq(tvq))Θμ,α(t)𝑑t,q.\displaystyle\widetilde{f}_{v}^{\mu}(q)=\int_{\mathbb{R}}2f\left({\mathrm{Exp}}_{q}(tv_{q})\right)\Theta_{\mu,\alpha}(t)dt,\quad\forall q\in\mathcal{M}. (2)

Next we show that the convolution in (1) is identical to the (GW) convolution everywhere.

Proposition 1.

For any pp\in\mathcal{M} and κμ\kappa_{\mu} defined on TpnT_{p}\mathcal{M}\cong\mathbb{R}^{n}, it holds that f~μ(p)=f^μ(p)\widetilde{f}^{\mu}(p)=\widehat{f}^{\mu}(p).

Proof.

Let dtdt be the base measure on \mathbb{R} and let dSαdS_{\alpha} be the base measure on 𝕊p(α)\mathbb{S}_{p}(\alpha). We then have dtdSα=dΠdtdS_{\alpha}=d\Pi. Thus it holds that

f~μ(p)=\displaystyle\widetilde{f}^{\mu}(p)= v𝕊p(α)t2f(Expp(tv))να(αvv)Θμ,α(t)𝑑t𝑑Sα\displaystyle\int_{v\in\mathbb{S}_{p}(\alpha)}\int_{t\in\mathbb{R}}2f\left({\mathrm{Exp}}_{p}(tv)\right)\nu_{\alpha}\left(\alpha\frac{v}{\|v\|}\right)\Theta_{\mu,\alpha}(t)\,dtdS_{\alpha}
=\displaystyle= uTpf(Expp(u))κμ(u)𝑑Π=f^μ(p),\displaystyle\int_{u\in T_{p}\mathcal{M}}f\left({\mathrm{Exp}}_{p}(u)\right)\kappa_{\mu}(u)\,d\Pi=\widehat{f}^{\mu}(p),

where a change of variable u=tvu=tv is used.

What’s interesting with (1) is that it extends (GW) from a local definition to a global definition, as pointed out in Remark 2.

Remark 2.

With the Levi-Civita connection, spheres are preserved after parallel transport, and so are the uniform base measures over the spheres. Thus the spherical integration of f~vμ\widetilde{f}_{v}^{\mu} defined in (2) can be carried out at pp regardless whether vTpv\in T_{p}\mathcal{M} or not. Hence, the convolution defined by (1) and (2) is indeed the GW convolution in the sense that f~vμ\widetilde{f}_{v}^{\mu} agrees with (GW) almost everywhereThe cut locus is of measure zero and thus vqv_{q} is deterministically defined almost everywhere. if the convolution at every point is restricted to the direction specified by vv (or parallel transport of vv). Now the advantage of our decomposition is clear – f~vμ\widetilde{f}_{v}^{\mu} is globally defined and agrees with (GW) along the the direction of vv. This advantage leads to easy computations and meaningful applications, as we will see throughout the rest of the paper.

3.1 Properties of the GW Convolution

In their seminal works, Greene and Wu showed that the approximation is convexity-preserving when the width of the kernel approaches 0 (Theorem 2 by Greene and Wu, (1973)). We show that the GW convolution is convexity-preserving even if the width of the kernel is strictly positive. Also, second order properties of the GW approximation is proved.

3.1.1 Convexity-Preserving Property of the GW Convolution

Definition 2 (Convexity).

Let \mathcal{M} be a complete Riemannian manifold. A smooth function ff is convex near pp with radius exceeding ϵ\epsilon if there exists ϵ>0\epsilon>0, such that for any v𝕊pv\in\mathbb{S}_{p}, the function f(Expp(tv))f\left({\mathrm{Exp}}_{p}(tv)\right) is convex in tt over [ϵ,ϵ][-\epsilon,\epsilon].

Using our decomposition trick of the GW approximation (defined in Eq. 1), one can show that the GW convolution is convexity preserving even if the width of the kernel is strictly positive.

Theorem 1.

Let \mathcal{M} be a complete Riemannian manifold, and fix an arbitrary pp\in\mathcal{M}. If there exists ϵ>0\epsilon>0 such that 1) ff is convex near pp with radius exceeding ϵ\epsilon, and 2) the density Θμ\Theta_{\mu} (a shorthand for Θμ,1\Theta_{\mu,1}) induced by kernel κμ\kappa_{\mu} satisfies that |t|ϵ2Θμ(t)=0\int_{|t|\geq\frac{\epsilon}{2}}\Theta_{\mu}(t)=0, then the Greene-Wu approximation f^μ\widehat{f}^{\mu} is also convex near pp with radius exceeding ϵ2\frac{\epsilon}{2}.

Proof of Theorem 1.

By Proposition 1 and Remark 2, it is sufficient to consider f~μ\widetilde{f}^{\mu}. Since ff is convex near pp with radius exceeding ϵ\epsilon, we have, for any vTpv\in T_{p}\mathcal{M} and any real numbers λ,t(0,ϵ2)\lambda,t\in\left(0,\frac{\epsilon}{2}\right),

f(Expp((λ+t)v))+f(Expp((λ+t)v))2f(Expp(tv)).\displaystyle f({\mathrm{Exp}}_{p}((\lambda+t)v))+f({\mathrm{Exp}}_{p}((-\lambda+t)v))\geq 2f({\mathrm{Exp}}_{p}(tv)). (3)

Let γ(s)=Expp(sv)\gamma(s)={\mathrm{Exp}}_{p}(sv) (s[λt,λ+t]s\in[-\lambda-t,\lambda+t]) for simplicity. We can then rewrite (3) as

f(Expγ(λ)(t𝒫0,λγ(v)))+f(Expγ(λ)(t𝒫0,λγ(v)))\displaystyle f({\mathrm{Exp}}_{\gamma(\lambda)}(t\mathcal{P}_{0,\lambda}^{\gamma}(v)))+f({\mathrm{Exp}}_{\gamma(-\lambda)}(t\mathcal{P}_{0,-\lambda}^{\gamma}(v)))\geq 2f(Expp(tv))\displaystyle 2f({\mathrm{Exp}}_{p}(tv)) (4)

We can integrate the first term in (4) with respect to vv and tt, and get

v𝕊pt[0,ϵ2]2f(Expγ(λ)(t𝒫0,λγ(v)))ν1(v)Θμ(t)𝑑t𝑑S1\displaystyle\int_{v\in\mathbb{S}_{p}}\int_{t\in\left[0,\frac{\epsilon}{2}\right]}2f({\mathrm{Exp}}_{\gamma(\lambda)}(t\mathcal{P}_{0,\lambda}^{\gamma}(v)))\nu_{1}(v)\Theta_{\mu}(t)dtdS_{1}
=\displaystyle= v𝕊γ(λ)t[0,ϵ2]2f(Expγ(λ)(tv))ν1(v)Θμ(t)𝑑t𝑑S1,\displaystyle\int_{v\in\mathbb{S}_{\gamma(\lambda)}}\int_{t\in\left[0,\frac{\epsilon}{2}\right]}2f({\mathrm{Exp}}_{\gamma(\lambda)}(tv))\nu_{1}(v)\Theta_{\mu}(t)dt\,dS_{1}, (5)

where (5) uses that parallel transport along geodesics preserves length and angle, for the change of tangent space for the outer integral.

Since the density Θμ\Theta_{\mu} satisfies that |t|ϵ2Θμ(t)=0\int_{|t|\geq\frac{\epsilon}{2}}\Theta_{\mu}(t)=0, the term on the right-hand-side of (5) is the exact definition of f~μ(γ(λ))\widetilde{f}^{\mu}(\gamma(\lambda)). We repeat this computation for the second term in (4), and get

f~μ(γ(λ))+f~μ(γ(λ))\displaystyle\widetilde{f}^{\mu}(\gamma(\lambda))+\widetilde{f}^{\mu}(\gamma(-\lambda)) 2f~μ(γ(0)),λ[0,ϵ2],\displaystyle\geq 2\widetilde{f}^{\mu}(\gamma(0)),\quad\forall\lambda\in\left[0,\frac{\epsilon}{2}\right], (6)

which is the midpoint convexity (around γ(0)\gamma(0), along γ\gamma) for f~μ\widetilde{f}^{\mu}. Since midpoint convexity implies convexity for smooth functions, we know f~μ\widetilde{f}^{\mu} is convex near pp along γ\gamma. Since the above argument is true for any γ\gamma, we know that f~μ\widetilde{f}^{\mu} is convex near pp with radius exceeding ϵ2\frac{\epsilon}{2}.

Previous Convexity-Preserving Result

Previously, Greene and Wu, (1973, 1976, 1979) studied the GW convolution from an analytical and topological perspective, and showed that the GW convolution is convexity-preserving (Greene and Wu,, 1973) when the width of the kernel approaches zero. For completeness, this previous result is summarized in Theorem 2. In Theorem 1 above, we show that the output of GW convolution is convex with a strictly positive radius, as long as the input is convex with a strictly positive radius.

Theorem 2 (Greene and Wu, (1973)).

Consider a smooth function defined over a complete Riemannian manifold \mathcal{M}. Let f^μ\widehat{f}^{\mu} be the GW approximation of ff with kernel κμ\kappa_{\mu}. Then it holds that

liminfμ0+(infC𝒞pd2f^μ(C(t))dt2|t=0)0,\displaystyle\underset{\mu\rightarrow 0^{+}}{\lim\inf}\left(\inf_{C\in\mathcal{C}_{p}}\frac{d^{2}\widehat{f}^{\mu}(C(t))}{dt^{2}}\Big{|}_{t=0}\right)\geq 0, (7)

where 𝒞p\mathcal{C}_{p} is the set of all geodesics CC parametrized by arc-length such that C(0)=pC(0)=p.

3.2 Second Order Properties of the GW Convolution

Our decomposition of the Greene-Wu convolution also allows other properties of the GW convolution easily derived. Perhaps one of the most important property of a function is its Hessian (matrix), which we discuss now. The definition of Hessian is in Definition 3 (e.g., Schoen and Yau, (1994); Petersen, (2006)).

Definition 3 (Hessian).

Consider an nn-dimensional Riemannian manifold (,g)(\mathcal{M},g). For fC()f\in C^{\infty}(\mathcal{M}), define the Hessian of ff as H(f):𝔛()×𝔛()C()H(f):\mathfrak{X}(\mathcal{M})\times\mathfrak{X}(\mathcal{M})\rightarrow C^{\infty}(\mathcal{M}) such that

H(f)(X,Y):=Xdf(Y),X,Y𝔛(),\displaystyle H(f)\left(X,Y\right):=\nabla_{X}df(Y),\quad\forall X,Y\in\mathfrak{X}(\mathcal{M}),

where \nabla is the Levi-Civita connection. Using Christoffel symbols, it holds that

H(f)(i,j)=ijfΓijkkf.H(f)(\partial_{i},\partial_{j})=\partial_{i}\partial_{j}f-\Gamma_{ij}^{k}\partial_{k}f.

The Hessian matrix of ff at pp is a map Hp,f:TpTpH_{p,f}:T_{p}\mathcal{M}\rightarrow T_{p}\mathcal{M}, such that Hp,f(v)=gradf|p,vH_{p,f}(v)=\left<\mathrm{grad}f\big{|}_{p},v\right> for any vTpv\in T_{p}\mathcal{M}.

With the Levi-Civita connection, which is torsion-free, one can verify that

H(f)(X,Y)=H(f)(Y,X).H(f)(X,Y)=H(f)(Y,X).

For any pp\in\mathcal{M}, with the normal coordinate specified by the exponential map Expp{\mathrm{Exp}}_{p}, the Christoffel symbol vanishes at pp. This means H(f)(i,j)|p=ijf|pH(f)(\partial_{i},\partial_{j})\Big{|}_{p}=\partial_{i}\partial_{j}f\big{|}_{p} in local normal coordinate. Similar to the Euclidean case, the eigenvalues of the Hessian matrix describes the curvature of the function. Next, we will show how the curvature of the function is affected after applying the GW convolution.

In the study of the second order properties of the GW approximation, we will restrict our attention to kernels whose radial density function is uniform. More specifically, we will use α=1\alpha=1 and Θμ,α(t)=12μ𝕀[t[μ,μ]]\Theta_{\mu,\alpha}(t)=\frac{1}{2\mu}\mathbb{I}_{\left[t\in\left[-\mu,\mu\right]\right]}. Since one can construct other kernels by combining kernels with uniform radial density functions, investigating kernels with uniform radial density is sufficient for our purpose. With this in mind, we look at

f~vμ(q)=12μμμf(Expq(tvq))𝑑t.\displaystyle\widetilde{f}_{v}^{\mu}(q)=\frac{1}{2\mu}\int_{-\mu}^{\mu}f\left({\mathrm{Exp}}_{q}(tv_{q})\right)dt. (8)

The following theorem tells us how the geometry of the function and the geometry of the space are related after the GW convolution.

Theorem 3.

Consider a complete Riemannian manifold \mathcal{M} and a kernel with radial density Θμ,α(t)=12μ𝕀[t[μ,μ]]\Theta_{\mu,\alpha}(t)=\frac{1}{2\mu}\mathbb{I}_{\left[t\in\left[-\mu,\mu\right]\right]}. For any pp\in\mathcal{M}, there exists μ0>0\mu_{0}>0 such that for all μ(0,μ0)\mu\in(0,\mu_{0}), it holds that

Duuf~vμ(p)=\displaystyle D_{uu}\widetilde{f}_{v}^{\mu}(p)= 12μt[μ,μ]DuExpp(tv)uExpp(tv)f(Expp(tv))\displaystyle\frac{1}{2\mu}\int_{t\in[-\mu,\mu]}D_{u_{{\mathrm{Exp}}_{p}(tv)}u_{{\mathrm{Exp}}_{p}(tv)}}f({\mathrm{Exp}}_{p}(tv))
+12μμμj=1t2j(2j)!u2v2jf(p)dt12μμμj=1t2j(2j)!v2ju2f(p)dt,\displaystyle+\frac{1}{2\mu}\int_{-\mu}^{\mu}\sum_{j=1}^{\infty}\frac{t^{2j}}{(2j)!}\nabla_{u}^{2}\nabla_{v}^{2j}f(p)\,dt-\frac{1}{2\mu}\int_{-\mu}^{\mu}\sum_{j=1}^{\infty}\frac{t^{2j}}{(2j)!}\nabla_{v}^{2j}\nabla_{u}^{2}f(p)\,dt,

for all v,u𝕊pv,u\in\mathbb{S}_{p}.

We will use the double exponential map (Gavrilov,, 2007) notation in the proof. The double exponential map (also written Expp{\mathrm{Exp}}_{p}) is defined as Expp:Tp×Tp{\mathrm{Exp}}_{p}:T_{p}\mathcal{M}\times T_{p}\mathcal{M}\rightarrow\mathcal{M} such that, when vTpv\in T_{p}\mathcal{M} is small enough (Expp(v)Up{\mathrm{Exp}}_{p}(v)\in U_{p}),

Expp(v,u)=ExpExpp(v)(uExpp(v)).\displaystyle{\mathrm{Exp}}_{p}(v,u)={\mathrm{Exp}}_{{\mathrm{Exp}}_{p}(v)}(u_{{\mathrm{Exp}}_{p}(v)}). (9)

With this notation, we are ready to prove Theorem 3.

Proof of Theorem 3.

With this notion of double exponential map, we have

Duuf~vμ(p):=\displaystyle D_{uu}\widetilde{f}_{v}^{\mu}(p):= limτ0f~vμ(Expp(τu))2f~vμ(p)+f~vμ(Expp(τu))τ2\displaystyle\;\lim_{\tau\rightarrow 0}\frac{\widetilde{f}_{v}^{\mu}({\mathrm{Exp}}_{p}(\tau u))-2\widetilde{f}_{v}^{\mu}(p)+\widetilde{f}_{v}^{\mu}({\mathrm{Exp}}_{p}(-\tau u))}{\tau^{2}}
=\displaystyle= 12μμμlimτ0f(Expp(τu,tv))2f(Expp(tv))+f(Expp(τu,tv))τ2dt.\displaystyle\;\frac{1}{2\mu}\int_{-\mu}^{\mu}\lim_{\tau\rightarrow 0}\frac{{f}({\mathrm{Exp}}_{p}(\tau u,tv))-2{f}({\mathrm{Exp}}_{p}(tv))+{f}({\mathrm{Exp}}_{p}(-\tau u,tv))}{\tau^{2}}\,dt. (10)

Let φu(t,τ)=f(Expp(tv,τu))\varphi_{u}(t,\tau)=f({\mathrm{Exp}}_{p}(tv,\tau u)), and let

ψu,v(τ,t)=f(Expp(τu,tv))f(Expp(tv,τu))+f(Expp(τu,tv))f(Expp(tv,τu)).\psi_{u,v}(\tau,t)=f({\mathrm{Exp}}_{p}(\tau u,tv))-f({\mathrm{Exp}}_{p}(tv,\tau u))+f({\mathrm{Exp}}_{p}(-\tau u,tv))-f({\mathrm{Exp}}_{p}(tv,-\tau u)).

We can then write (10) as

Duuf~vμ(p)=\displaystyle D_{uu}\widetilde{f}_{v}^{\mu}(p)= 12μt[μ,μ]limτ0φu(t,τ)2φu(t,0)+φu(t,τ)+ψu,v(τ,t)τ2.\displaystyle\frac{1}{2\mu}\int_{t\in[-\mu,\mu]}\lim_{\tau\rightarrow 0}\frac{\varphi_{u}(t,\tau)-2\varphi_{u}(t,0)+\varphi_{u}(t,-\tau)+\psi_{u,v}(\tau,t)}{\tau^{2}}. (11)

For the terms involving φu\varphi_{u}, one has

limτ0φu(t,τ)2φu(t,0)+φu(t,τ)τ2=d2φu(t,τ)dτ2|τ=0=Duquqf(q),\displaystyle\lim_{\tau\rightarrow 0}\frac{\varphi_{u}(t,\tau)-2\varphi_{u}(t,0)+\varphi_{u}(t,-\tau)}{\tau^{2}}=\frac{d^{2}\varphi_{u}(t,\tau)}{d\tau^{2}}\Big{|}_{\tau=0}=D_{u_{q}u_{q}}f(q), (12)

where q=Expp(tv)q={\mathrm{Exp}}_{p}(tv) and uq=𝒫pq(u)u_{q}=\mathcal{P}_{p\rightarrow q}(u).

For ψu,v(τ,t)\psi_{u,v}(\tau,t), we consider the term f(Expp(u,v))f({\mathrm{Exp}}_{p}(u,v)) for a smooth function ff and small vectors u,vTpu,v\in T_{p}\mathcal{M}. For any pp, vTpv\in T_{p}\mathcal{M} and qUpq\in U_{p}, define hv(j)(q)=vqjf(q)h_{v}^{(j)}(q)=\nabla_{v_{q}}^{j}f(q). We can Taylor expand hv(Expp(u))h_{v}({\mathrm{Exp}}_{p}(u)) by

hv(j)(Expp(u))=\displaystyle h_{v}^{(j)}({\mathrm{Exp}}_{p}(u))= hv(j)(Expp(tu))|t=1\displaystyle\;h_{v}^{(j)}({\mathrm{Exp}}_{p}(tu))\big{|}_{t=1}
=\displaystyle= i=01i!didtihv(j)(Expp(tu))\displaystyle\;\sum_{i=0}^{\infty}\frac{1}{i!}\frac{d^{i}}{dt^{i}}h_{v}^{(j)}({\mathrm{Exp}}_{p}(tu))
=\displaystyle= i=01i!uihv(j)(p)\displaystyle\;\sum_{i=0}^{\infty}\frac{1}{i!}\nabla_{u}^{i}h_{v}^{(j)}(p)
=(a)\displaystyle\overset{(a)}{=} i=01i!uivjf(p).\displaystyle\;\sum_{i=0}^{\infty}\frac{1}{i!}\nabla_{u}^{i}\nabla_{v}^{j}f(p).

Thus we have, for any pp, u,vTpu,v\in T_{p}\mathcal{M} of small norm,

f(Expp(u,v))=\displaystyle f\left({\mathrm{Exp}}_{p}\left(u,v\right)\right)= f(ExpExpp(u)(vExpp(u)))\displaystyle\;f\left({\mathrm{Exp}}_{{\mathrm{Exp}}_{p}(u)}\left(v_{{\mathrm{Exp}}_{p}(u)}\right)\right)
=\displaystyle= j=01j!vExpp(u)jf(Expp(u))\displaystyle\;\sum_{j=0}^{\infty}\frac{1}{j!}\nabla_{v_{{\mathrm{Exp}}_{p}(u)}}^{j}f({\mathrm{Exp}}_{p}(u))
=\displaystyle= j=01j!hv(j)(Expp(u))\displaystyle\;\sum_{j=0}^{\infty}\frac{1}{j!}h_{v}^{(j)}({\mathrm{Exp}}_{p}(u))
=\displaystyle= j=01j!i=01i!uivjf(p),\displaystyle\;\sum_{j=0}^{\infty}\frac{1}{j!}\sum_{i=0}^{\infty}\frac{1}{i!}\nabla_{u}^{i}\nabla_{v}^{j}f(p),

where the last equality uses (a)(a).

Thus we have

f(Expp(τu,tv))+f(Expp(τu,tv))f(Expp(tv,τu))f(Expp(tv,τu))\displaystyle\;f({\mathrm{Exp}}_{p}(\tau u,tv))+f({\mathrm{Exp}}_{p}(-\tau u,tv))-f({\mathrm{Exp}}_{p}(tv,\tau u))-f({\mathrm{Exp}}_{p}(tv,-\tau u))
=\displaystyle= i=0j=01i!j!τuitvjf(p)+i=0j=01i!j!τuitvjf(p)\displaystyle\;\sum_{i=0}^{\infty}\sum_{j=0}^{\infty}\frac{1}{i!j!}\nabla_{\tau u}^{i}\nabla_{tv}^{j}f(p)+\sum_{i=0}^{\infty}\sum_{j=0}^{\infty}\frac{1}{i!j!}\nabla_{-\tau u}^{i}\nabla_{tv}^{j}f(p)
i=0j=01i!j!tvjτuif(p)i=0j=01i!j!tvjτuif(p)\displaystyle-\sum_{i=0}^{\infty}\sum_{j=0}^{\infty}\frac{1}{i!j!}\nabla_{tv}^{j}\nabla_{\tau u}^{i}f(p)-\sum_{i=0}^{\infty}\sum_{j=0}^{\infty}\frac{1}{i!j!}\nabla_{tv}^{j}\nabla_{-\tau u}^{i}f(p)
=\displaystyle= j=0τ22j!u2tvjf(p)+j=0τ22j!u2tvjf(p)\displaystyle\;\sum_{j=0}^{\infty}\frac{\tau^{2}}{2j!}\nabla_{u}^{2}\nabla_{tv}^{j}f(p)+\sum_{j=0}^{\infty}\frac{\tau^{2}}{2j!}\nabla_{u}^{2}\nabla_{tv}^{j}f(p)
j=0τ22j!tvju2f(p)j=0τ22j!tvju2f(p)+𝒪(τ3).\displaystyle-\sum_{j=0}^{\infty}\frac{\tau^{2}}{2j!}\nabla_{tv}^{j}\nabla_{u}^{2}f(p)-\sum_{j=0}^{\infty}\frac{\tau^{2}}{2j!}\nabla_{tv}^{j}\nabla_{u}^{2}f(p)+\mathcal{O}(\tau^{3}).

Thus we have

limτ0ψu,v(τ,t)τ2=j=1t2j(2j)!u2v2jf(p)j=1t2j(2j)!v2ju2f(p)+Odd(t),\displaystyle\lim_{\tau\rightarrow 0}\frac{\psi_{u,v}(\tau,t)}{\tau^{2}}=\sum_{j=1}^{\infty}\frac{t^{2j}}{(2j)!}\nabla_{u}^{2}\nabla_{v}^{2j}f(p)-\sum_{j=1}^{\infty}\frac{t^{2j}}{(2j)!}\nabla_{v}^{2j}\nabla_{u}^{2}f(p)+Odd(t), (13)

where Odd(t)Odd(t) denotes terms that are odd in tt.

Combining equations (11), (12), (13) and that

μμO𝑑d(t)𝑑t=0\displaystyle\int_{-\mu}^{\mu}Odd(t)\,dt=0

gives

Duuf~vμ(p)=\displaystyle D_{uu}\widetilde{f}_{v}^{\mu}(p)= 12μt[μ,μ]Duquqf(q)\displaystyle\frac{1}{2\mu}\int_{t\in[-\mu,\mu]}D_{u_{q}u_{q}}f(q)
+12μμμj=1t2j(2j)!u2v2jf(p)dt12μμμj=1t2j(2j)!v2ju2f(p)dt,\displaystyle+\frac{1}{2\mu}\int_{-\mu}^{\mu}\sum_{j=1}^{\infty}\frac{t^{2j}}{(2j)!}\nabla_{u}^{2}\nabla_{v}^{2j}f(p)\,dt-\frac{1}{2\mu}\int_{-\mu}^{\mu}\sum_{j=1}^{\infty}\frac{t^{2j}}{(2j)!}\nabla_{v}^{2j}\nabla_{u}^{2}f(p)\,dt,

where q=Expp(tv)q={\mathrm{Exp}}_{p}(tv).

Theorem 3 says that when the space is not flat, the GW convolution changes the curvature of the function (Hessian matrix) in a non-trivial way. A more concrete example of this fact is in Corollary 1.

Corollary 1.

Consider a kernel with radial density function Θμ,α(t)=12μ𝕀[t[μ,μ]]\Theta_{\mu,\alpha}(t)=\frac{1}{2\mu}\mathbb{I}_{\left[t\in\left[-\mu,\mu\right]\right]}. For a smooth function ff defined over a complete Riemannian manifold \mathcal{M} and any pp\in\mathcal{M}, there exists μ0>0\mu_{0}>0, such that for all μ(0,μ0)\mu\in(0,\mu_{0}), it holds that

λmin(Hp,f~μ)\displaystyle\lambda_{\min}\left(H_{p,\widetilde{f}^{\mu}}\right)\geq 12μ𝔼v𝕊p[μμλmin(HExpp(tv),f)𝑑t]\displaystyle\;\frac{1}{2\mu}\mathbb{E}_{v\sim\mathbb{S}_{p}}\left[\int_{-\mu}^{\mu}\lambda_{\min}\left(H_{{\mathrm{Exp}}_{p}(tv),f}\right)dt\right]
+minu𝕊p12μ𝔼v𝕊p[μμj=1t2j(2j)!u2v2jf(p)dtμμj=1t2j(2j)!v2ju2f(p)dt].\displaystyle+\min_{u\in\mathbb{S}_{p}}\frac{1}{2\mu}\mathbb{E}_{v\sim\mathbb{S}_{p}}\left[\int_{-\mu}^{\mu}\sum_{j=1}^{\infty}\frac{t^{2j}}{(2j)!}\nabla_{u}^{2}\nabla_{v}^{2j}f(p)\,dt-\int_{-\mu}^{\mu}\sum_{j=1}^{\infty}\frac{t^{2j}}{(2j)!}\nabla_{v}^{2j}\nabla_{u}^{2}f(p)\,dt\right].

Since the minimum eigenvalue of the Hessian matrix measures the degree of convexity of a function, this corollary establish a quantitative description of how the GW convolution would affect the degree of convexity.

Proof of Corollary 1.

For any smooth function ff and point pp, one has (e.g., Absil et al.,, 2009)

uHp,fu:=Hp,f(u),u=Duuf(p).\displaystyle u^{\top}H_{p,f}u:=\left<H_{p,f}(u),u\right>=D_{uu}f(p).

and thus

λmin(Hp,f~μ)=minu𝕊puHp,f~μu=minu𝕊pDuuf~μ(p).\displaystyle\lambda_{\min}\left(H_{p,\widetilde{f}^{\mu}}\right)=\min_{u\in\mathbb{S}_{p}}u^{\top}H_{p,\widetilde{f}^{\mu}}u=\min_{u\in\mathbb{S}_{p}}D_{uu}\widetilde{f}^{\mu}(p). (14)

Thus we have

minu𝕊pDuuf~μ(p)\displaystyle\;\min_{u\in\mathbb{S}_{p}}D_{uu}\widetilde{f}^{\mu}(p)
\displaystyle\geq 12μ𝔼v𝕊p[μμminw𝕊Expp(tv)Dwwf(Expp(tv))𝑑t]\displaystyle\;\frac{1}{2\mu}\mathbb{E}_{v\sim\mathbb{S}_{p}}\left[\int_{\mu}^{\mu}\min_{w\in\mathbb{S}_{{\mathrm{Exp}}_{p}(tv)}}D_{ww}f\left({\mathrm{Exp}}_{p}(tv)\right)dt\right]
+minu𝕊p12μ𝔼v𝕊p[μμj=1t2j(2j)!u2v2jf(p)dtμμj=1t2j(2j)!v2ju2f(p)dt].\displaystyle+\min_{u\in\mathbb{S}_{p}}\frac{1}{2\mu}\mathbb{E}_{v\sim\mathbb{S}_{p}}\left[\int_{-\mu}^{\mu}\sum_{j=1}^{\infty}\frac{t^{2j}}{(2j)!}\nabla_{u}^{2}\nabla_{v}^{2j}f(p)\,dt-\int_{-\mu}^{\mu}\sum_{j=1}^{\infty}\frac{t^{2j}}{(2j)!}\nabla_{v}^{2j}\nabla_{u}^{2}f(p)\,dt\right].

Since λmin(Hp,f~μ)=minu𝕊puHp,f~μu=minu𝕊pDuuf~μ(p)\lambda_{\min}\left(H_{p,\widetilde{f}^{\mu}}\right)=\min_{u\in\mathbb{S}_{p}}u^{\top}H_{p,\widetilde{f}^{\mu}}u=\min_{u\in\mathbb{S}_{p}}D_{uu}\widetilde{f}^{\mu}(p) for any pp and ff, it holds that

λmin(Hp,f~μ)\displaystyle\lambda_{\min}\left(H_{p,\widetilde{f}^{\mu}}\right)\geq 12μ𝔼v𝕊p[μμλmin(HExpp(tv),f)𝑑t]\displaystyle\;\frac{1}{2\mu}\mathbb{E}_{v\sim\mathbb{S}_{p}}\left[\int_{-\mu}^{\mu}\lambda_{\min}\left(H_{{\mathrm{Exp}}_{p}(tv),f}\right)dt\right]
+minu𝕊p12μ𝔼v𝕊p[μμj=1t2j(2j)!u2v2jf(p)dtμμj=1t2j(2j)!v2ju2f(p)dt].\displaystyle\;+\min_{u\in\mathbb{S}_{p}}\frac{1}{2\mu}\mathbb{E}_{v\sim\mathbb{S}_{p}}\left[\int_{-\mu}^{\mu}\sum_{j=1}^{\infty}\frac{t^{2j}}{(2j)!}\nabla_{u}^{2}\nabla_{v}^{2j}f(p)\,dt-\int_{-\mu}^{\mu}\sum_{j=1}^{\infty}\frac{t^{2j}}{(2j)!}\nabla_{v}^{2j}\nabla_{u}^{2}f(p)\,dt\right].

4 Estimating Gradient over Riemannian Manifolds

From our formulation of the GW approximation, one can obtain tighter bounds for gradient estimation over Riemannian manifolds, for geodesically L1L_{1}-smooth functions defined as follows.

Definition 4.

Let pp\in\mathcal{M}. A function f:f:\mathcal{M}\rightarrow\mathbb{R} is called geodesically L1L_{1}-smooth near pp if

𝒫pq(gradf|p)gradf|qqL1dp(p,q),qUp,\displaystyle\left\|\mathcal{P}_{p\rightarrow q}\left(\mathrm{grad}f\big{|}_{p}\right)-\mathrm{grad}f\big{|}_{q}\right\|_{q}\leq L_{1}d_{p}(p,q),\quad\forall q\in U_{p}, (15)

where gradf|p\mathrm{grad}f\big{|}_{p} is the gradient of ff at pp. If ff is L1L_{1}-smooth near pp for all pp\in\mathcal{M}, we say ff is L1L_{1}-smooth over \mathcal{M}.

For a geodesically L1L_{1}-smooth function, one has the following theorem that bridges the gradient at pp and the zeroth-order information near pp.

Theorem 4.

Fix α>0\alpha>0. Let ff be geodesically L1L_{1}-smooth near pp. It holds that

gradf|pnμα2𝔼v𝕊p(α)[f(Expp(μv))v]pL1nαμ2,\displaystyle\left\|\mathrm{grad}f\big{|}_{p}-\frac{n}{\mu\alpha^{2}}\mathbb{E}_{v\sim\mathbb{S}_{p}(\alpha)}\left[f\left({\mathrm{Exp}}_{p}(\mu v)\right)v\right]\right\|_{p}\leq\frac{L_{1}n\alpha\mu}{2}, (16)

where 𝕊p(α)\mathbb{S}_{p}(\alpha) is the origin-centered sphere in TpT_{p}\mathcal{M} of radius α\alpha.

This theorem provides a very simple and practical gradient estimation method: At point pp\in\mathcal{M}, we can select small numbers μ\mu and α\alpha, and uniformly sample a vector v𝕊p(α)v\sim\mathbb{S}_{p}(\alpha). With these μ,α,v\mu,\alpha,v, the random vector

grad^f|p(v)=n2α2μ(f(Expp(μv))f(Expp(μv)))\displaystyle\widehat{\mathrm{grad}}f\big{|}_{p}(v)=\frac{n}{2\alpha^{2}\mu}\left(f\left({\mathrm{Exp}}_{p}(\mu v)\right)-f\left({\mathrm{Exp}}_{p}(-\mu v)\right)\right) (17)

gives an estimator of gradf|p\mathrm{grad}f\big{|}_{p}.

One can also independently sample v1,v2,v3,,vmv_{1},v_{2},v_{3},\cdots,v_{m} uniformly from 𝕊p\mathbb{S}_{p}. Using these samples, an estimator for gradf|p\mathrm{grad}f\big{|}_{p} is

grad^f|p(v1,v2,,vm)=n2mα2μi=1m(f(Expp(μvi))f(Expp(μvi)))vi.\displaystyle\widehat{\mathrm{grad}}f\big{|}_{p}(v_{1},v_{2},\cdots,v_{m})=\frac{n}{2m\alpha^{2}\mu}\sum_{i=1}^{m}\left(f\left({\mathrm{Exp}}_{p}(\mu v_{i})\right)-f\left({\mathrm{Exp}}_{p}(-\mu v_{i})\right)\right)v_{i}. (18)

Note that Theorem 4 also provides an in expectation bound for the estimator in (18). Before proving Theorem 4, we first present Propositions 2 and 3, and Lemma 1.

Proposition 2.

Consider a continuously differentiable function ff defined over a complete Riemannian manifold \mathcal{M}. If function ff is geodesically L1L_{1}-smooth over \mathcal{M}, then it holds that

  1. 1.

    f(Expp(u))f(p)+gradf|p,u+L12u2f\left({\mathrm{Exp}}_{p}(u)\right)\leq f(p)+\left<\mathrm{grad}f\big{|}_{p},u\right>+\frac{L_{1}}{2}\|u\|^{2} for all pp\in\mathcal{M} and uTpu\in T_{p}\mathcal{M};

  2. 2.

    for any vTpv\in T_{p}\mathcal{M}, 𝒫0,1γ(gradf|p)gradf|Expp(v)L1v\left\|\mathcal{P}_{0,1}^{\gamma}\left(\mathrm{grad}f\big{|}_{p}\right)-\mathrm{grad}f\big{|}_{{\mathrm{Exp}}_{p}(v)}\right\|\leq L_{1}\|v\|, where γ(t):=Expp(tv)\gamma(t):={\mathrm{Exp}}_{p}(tv).

Proof.

For any pp\in\mathcal{M} and uTpu\in T_{p}\mathcal{M}, let q0=p,q1,q2,,qN=Expp(u)q_{0}=p,q_{1},q_{2},\cdots,q_{N}={\mathrm{Exp}}_{p}(u) be a sequence of points such that qi+1Uqiq_{i+1}\in U_{q_{i}} and qi+1=Expqi(τuqi)q_{i+1}={\mathrm{Exp}}_{q_{i}}(\tau u_{q_{i}}) for all τ>0\tau>0 that is sufficiently small. Since the Levi-Civita connection preserves the Riemannian metric, we know Nτ=1N\tau=1. At any qiq_{i}, we have

gradf|qi,uqi\displaystyle\left<\mathrm{grad}f\big{|}_{q_{i}},u_{q_{i}}\right>
=\displaystyle= gradf|qi𝒫qi1qi(gradf|qi1),uqi+gradf|qi1,uqi1\displaystyle\left<\mathrm{grad}f\big{|}_{q_{i}}-\mathcal{P}_{q_{i-1}\rightarrow q_{i}}\left(\mathrm{grad}f\big{|}_{q_{i-1}}\right),u_{q_{i}}\right>+\left<\mathrm{grad}f\big{|}_{q_{i-1}},u_{q_{i-1}}\right> (19)
\displaystyle\leq gradf|qi1,uqi1+L1τu2\displaystyle\left<\mathrm{grad}f\big{|}_{q_{i-1}},u_{q_{i-1}}\right>+L_{1}\tau\|u\|^{2}
\displaystyle\leq \displaystyle\cdots
\displaystyle\leq gradf|p,u+iL1τu2,\displaystyle\left<\mathrm{grad}f\big{|}_{p},u\right>+iL_{1}\tau\|u\|^{2},

where (19) uses the definition of geodesic L1L_{1}-smoothness.

Also, it holds that

f(qi)f(qi1)\displaystyle f\left(q_{i}\right)-f\left(q_{i-1}\right)
=\displaystyle= 0τgradf|Expqi1(tuqi1),uExpqi1(tuqi1)dt.(by Lemma 1)\displaystyle\int_{0}^{\tau}\left<\mathrm{grad}f\big{|}_{{\mathrm{Exp}}_{q_{i-1}}(tu_{q_{i-1}})},u_{{\mathrm{Exp}}_{q_{i-1}}(tu_{q_{i-1}})}\right>dt.\qquad(\text{by Lemma \ref{lem:fundamental-calculus}}) (20)

By the definition of geodesic L1L_{1}-smoothness and the Cauchy-Schwarz inequality, it holds that

0τgradf|Expqi1(tuqi1),uExpqi1(tuqi1)𝑑tτgradf|qi1,uqi1\displaystyle\int_{0}^{\tau}\left<\mathrm{grad}f\big{|}_{{\mathrm{Exp}}_{q_{i-1}}(tu_{q_{i-1}})},u_{{\mathrm{Exp}}_{q_{i-1}}(tu_{q_{i-1}})}\right>dt-\tau\left<\mathrm{grad}f\big{|}_{q_{i-1}},u_{q_{i-1}}\right>
\displaystyle\leq 0τgradf|Expqi1(tuqi1)𝒫qi1Expqi1(tuqi1)(gradf|qi1)udt\displaystyle\int_{0}^{\tau}\left\|\mathrm{grad}f\big{|}_{{\mathrm{Exp}}_{q_{i-1}}(tu_{q_{i-1}})}-\mathcal{P}_{q_{i-1}\rightarrow{\mathrm{Exp}}_{q_{i-1}}(tu_{q_{i-1}})}\left(\mathrm{grad}f\big{|}_{q_{i-1}}\right)\right\|\|u\|\,dt
\displaystyle\leq L1τ2u22.\displaystyle\frac{L_{1}\tau^{2}\|u\|^{2}}{2}. (21)

Adding and subtracting τgradf|qi1,uqi1\tau\left<\mathrm{grad}f\big{|}_{q_{i-1}},u_{q_{i-1}}\right> to (20) and using (21) gives

f(qi)f(qi1)L1τ2u22+τgradf|qi1,uqi1.\displaystyle f\left(q_{i}\right)-f\left(q_{i-1}\right)\leq\frac{L_{1}\tau^{2}\|u\|^{2}}{2}+\tau\left<\mathrm{grad}f\big{|}_{q_{i-1}},u_{q_{i-1}}\right>.

Thus we have

f(Expp(u))f(p)\displaystyle f({\mathrm{Exp}}_{p}(u))-f(p)\leq i=1Nf(qi)f(qi1)\displaystyle\sum_{i=1}^{N}f(q_{i})-f(q_{i-1})
\displaystyle\leq i=1N(τgradf|qi1,uqi1+L1τ2u22)\displaystyle\sum_{i=1}^{N}\left(\tau\left<\mathrm{grad}f\big{|}_{q_{i-1}},u_{q_{i-1}}\right>+\frac{L_{1}\tau^{2}\|u\|^{2}}{2}\right)
\displaystyle\leq Nτgradf|p,u+i=1NL1(i1)τ2u2+L1Nτ2u22\displaystyle N\tau\left<\mathrm{grad}f\big{|}_{p},u\right>+\sum_{i=1}^{N}L_{1}(i-1)\tau^{2}\|u\|^{2}+\frac{L_{1}N\tau^{2}\|u\|^{2}}{2}
=\displaystyle= Nτgradf|p,u+L1N2τ2u22+O(Nτ2),\displaystyle N\tau\left<\mathrm{grad}f\big{|}_{p},u\right>+\frac{L_{1}N^{2}\tau^{2}\|u\|^{2}}{2}+O(N\tau^{2}), (22)

where the O()O\left(\cdot\right) notation omits constants that do not depend on NN or τ\tau. Since Nτ=1N\tau=1 and (22) is true for arbitrarily small τ\tau, letting τ0\tau\rightarrow 0 in (22) finishes the proof.

The second item can be proved in a similar way. Specifically, we can also find a sequence of points q0,q1,,qNq_{0},q_{1},\cdots,q_{N} on the curve of γ(t)=Expp(tv)\gamma(t)={\mathrm{Exp}}_{p}(tv) with q0=pq_{0}=p and qN=Expp(v)q_{N}={\mathrm{Exp}}_{p}(v), and repeatedly use the definition of geodesic L1L_{1}-smoothness.

Proposition 3.

For any vector uTpu\in T_{p}\mathcal{M}, we have

𝔼v𝕊p(α)[u,vv]=α2nu, and 𝔼v𝕊p(α)[u,v2]=α2nu2.\displaystyle\mathbb{E}_{v\sim\mathbb{S}_{p}(\alpha)}\left[\left<u,v\right>v\right]=\frac{\alpha^{2}}{n}u,\quad\text{ and }\quad\mathbb{E}_{v\sim\mathbb{S}_{p}(\alpha)}\left[\left<u,v\right>^{2}\right]=\frac{\alpha^{2}}{n}\|u\|^{2}.
Proof.

It is sufficient to consider u=eiu=e_{i}, where {ei}i\{e_{i}\}_{i} is the local canonical basis for TpT_{p}\mathcal{M} with respect to gpg_{p}. For any i,j{1,2,,n}i,j\in\{1,2,\cdots,n\},

(𝔼v𝕊p(α)[ei,vv])j=𝔼v𝕊p(α)[vivj].\displaystyle\left(\mathbb{E}_{v\sim\mathbb{S}_{p}(\alpha)}\left[\left<e_{i},v\right>v\right]\right)_{j}=\mathbb{E}_{v\sim\mathbb{S}_{p}(\alpha)}\left[v_{i}v_{j}\right].

Since 𝔼v𝕊p(α)[vj|vi=x]=0\mathbb{E}_{v\sim\mathbb{S}_{p}(\alpha)}\left[v_{j}|v_{i}=x\right]=0 for any xx when iji\neq j, we have 𝔼v𝕊p(α)[vivj]=0\mathbb{E}_{v\sim\mathbb{S}_{p}(\alpha)}\left[v_{i}v_{j}\right]=0. When i=ji=j, by symmetry, we have 𝔼v𝕊p(α)[vj2]=α2n\mathbb{E}_{v\sim\mathbb{S}_{p}(\alpha)}\left[v_{j}^{2}\right]=\frac{\alpha^{2}}{n} for all j=1,2,,nj=1,2,\cdots,n.

We have shown 𝔼v𝕊p(α)[u,vv]=α2nu\mathbb{E}_{v\sim\mathbb{S}_{p}(\alpha)}\left[\left<u,v\right>v\right]=\frac{\alpha^{2}}{n}u, on which taking inner product with uu shows that 𝔼v𝕊p(α)[u,v2]=α2nu2\mathbb{E}_{v\sim\mathbb{S}_{p}(\alpha)}\left[\left<u,v\right>^{2}\right]=\frac{\alpha^{2}}{n}\|u\|^{2}. ∎

Lemma 1.

Pick pp\in\mathcal{M} and vTpv\in T_{p}\mathcal{M}. Let γ(t)=Expp(tv)\gamma(t)={\mathrm{Exp}}_{p}(tv) (t[a,b])(t\in[a,b]) be a geodesic in \mathcal{M}. We then have

abgradf|Expp(sv),γ˙|s𝑑s=f(γ(b))f(γ(a)).\displaystyle\int_{a}^{b}\left<\mathrm{grad}f\big{|}_{{\mathrm{Exp}}_{p}(sv)},\dot{\gamma}\big{|}_{s}\right>ds=f(\gamma(b))-f(\gamma(a)). (23)
Proof.

Consider the function fγ:[a,b]f\circ\gamma:[a,b]\rightarrow\mathbb{R}. By fundamental theorem of calculus, we have

f(γ(b))f(γ(a))\displaystyle f(\gamma(b))-f(\gamma(a)) =abdfγdt𝑑t.\displaystyle=\int_{a}^{b}\frac{df\circ\gamma}{dt}{d}t.

Also, the directional derivative over Riemannian manifolds can be computed by

dfγdt|s=limτ0fγ(s+τ)τ=gradf|Expp(sv),γ˙|s.\displaystyle\frac{df\circ\gamma}{dt}\Bigg{|}_{s}=\lim_{\tau\rightarrow 0}\frac{f\circ\gamma(s+\tau)}{\tau}=\left<\mathrm{grad}f\big{|}_{{\mathrm{Exp}}_{p}(sv)},\dot{\gamma}\big{|}_{s}\right>.

Combining the above results finishes the proof. ∎

Lemma 1 can be viewed as a corollary of the fundamental theorem of calculus, or a consequence of Stokes’ theorem. If one views the curve γ\gamma as a manifold with boundary (γ(a)\gamma(a) and γ(b)\gamma(b)), and γ˙\dot{\gamma} as a vector field parallel to γ\gamma, then applying Stokes’ theorem proves Lemma 1.

We are now ready to prove Theorem 4, using our decomposition trick with the radial density function being Θμ,α(x)=12μ𝕀[x[μ,μ]]\Theta_{\mu,\alpha}(x)=\frac{1}{2\mu}\mathbb{I}_{\left[x\in\left[-\mu,\mu\right]\right]}.

Proof of Theorem 4.

From the definition of directional derivative, we have

gradf~vμ|p,vp\displaystyle\left<\mathrm{grad}\widetilde{f}_{v}^{\mu}\big{|}_{p},v\right>_{p} =12μlimτ0μμ(f(ExpExpp(τv)(tvExpp(τv)))f(Expp(tv)))𝑑tτ\displaystyle=\frac{1}{2\mu}\lim_{\tau\rightarrow 0}\frac{\int_{-\mu}^{\mu}\left(f\left({\mathrm{Exp}}_{{\mathrm{Exp}}_{p}(\tau v)}\left(tv_{{\mathrm{Exp}}_{p}(\tau v)}\right)\right)-f\left({\mathrm{Exp}}_{p}\left(tv\right)\right)\right)dt}{\tau}
=12μμμlimτ0f(Expp(τv+tv))f(Expp(tv))τdt\displaystyle=\frac{1}{2\mu}\int_{-\mu}^{\mu}\lim_{\tau\rightarrow 0}\frac{f\left({\mathrm{Exp}}_{p}\left(\tau v+tv\right)\right)-f\left({\mathrm{Exp}}_{p}\left(tv\right)\right)}{\tau}dt
=12μμμgradf|Expp(tv),vExpp(tv)Expp(tv)𝑑t\displaystyle=\frac{1}{2\mu}\int_{-\mu}^{\mu}\left<\mathrm{grad}f\big{|}_{{\mathrm{Exp}}_{p}\left(tv\right)},v_{{\mathrm{Exp}}_{p}(tv)}\right>_{{\mathrm{Exp}}_{p}(tv)}dt (24)

For simplicity, let γv(t)=Expp(tv)\gamma_{v}(t)={\mathrm{Exp}}_{p}(tv) (t[μ,μ]t\in[-\mu,\mu]) for any vTp()v\in T_{p}(\mathcal{M}). Since ff is geodesically L1L_{1}-smooth, we have, for any pp and vTpv\in T_{p}\mathcal{M},

gradf|pgradf~vμ|p,vp\displaystyle\left<\mathrm{grad}{f}\big{|}_{p}-\mathrm{grad}\widetilde{f}_{v}^{\mu}\big{|}_{p},v\right>_{p}
=\displaystyle= gradf|p,vpμμgradf|Expp(tv),vExpp(tv)Expp(tv)𝑑t2μ\displaystyle\left<\mathrm{grad}{f}\big{|}_{p},v\right>_{p}-\frac{\int_{-\mu}^{\mu}\left<\mathrm{grad}f\big{|}_{{\mathrm{Exp}}_{p}\left(tv\right)},v_{{\mathrm{Exp}}_{p}(tv)}\right>_{{\mathrm{Exp}}_{p}(tv)}dt}{2\mu} (25)
=\displaystyle= 12μμμ(𝒫0,tγv(gradf|p),vExpp(tv)Expp(tv))𝑑t\displaystyle\frac{1}{2\mu}\int_{-\mu}^{\mu}\left(\left<\mathcal{P}_{0,t}^{\gamma_{v}}\left(\mathrm{grad}{f}\big{|}_{p}\right),v_{{\mathrm{Exp}}_{p}(tv)}\right>_{{\mathrm{Exp}}_{p}(tv)}\right)dt
12μμμ(gradf|Expp(tv),vExpp(tv)Expp(tv))𝑑t\displaystyle-\frac{1}{2\mu}\int_{-\mu}^{\mu}\left(\left<\mathrm{grad}f\big{|}_{{\mathrm{Exp}}_{p}\left(tv\right)},v_{{\mathrm{Exp}}_{p}(tv)}\right>_{{\mathrm{Exp}}_{p}(tv)}\right)dt (26)
=\displaystyle= 12μμμ(𝒫0,tγv(gradf|p)gradf|Expp(tv),vExpp(tv)Expp(tv))𝑑t\displaystyle\frac{1}{2\mu}\int_{-\mu}^{\mu}\left(\left<\mathcal{P}_{0,t}^{\gamma_{v}}\left(\mathrm{grad}{f}\big{|}_{p}\right)-\mathrm{grad}f\big{|}_{{\mathrm{Exp}}_{p}\left(tv\right)},v_{{\mathrm{Exp}}_{p}(tv)}\right>_{{\mathrm{Exp}}_{p}(tv)}\right)dt
\displaystyle\leq 12μμμL1|t|α2𝑑t=L1α2μ2,\displaystyle\frac{1}{2\mu}\int_{-\mu}^{\mu}L_{1}|t|\alpha^{2}dt=\frac{L_{1}\alpha^{2}\mu}{2}, (27)

where (25) uses (24), (26) uses the fact that the parallel transport preserves the Riemannian metric, and (27) uses Proposition 2 and the Cauchy-Schwarz inequality.

By Proposition 3, we have

𝔼v𝕊p(α)[gradf|p,vpv]=α2ngradf|p,\displaystyle\mathbb{E}_{v\sim\mathbb{S}_{p}(\alpha)}\left[\left<\mathrm{grad}f\big{|}_{p},v\right>_{p}v\right]=\frac{\alpha^{2}}{n}\mathrm{grad}f\big{|}_{p},

and thus from (27), we have

𝔼v𝕊p(α)[gradf~vμ|p,vv]α2ngradf|p\displaystyle\left\|\mathbb{E}_{v\sim\mathbb{S}_{p}(\alpha)}\left[\left<\mathrm{grad}\widetilde{f}_{v}^{\mu}\big{|}_{p},v\right>v\right]-\frac{\alpha^{2}}{n}\mathrm{grad}f\big{|}_{p}\right\|
=\displaystyle= 𝔼v𝕊p(α)[gradf~vμ|p,vvgradf|p,vv]\displaystyle\left\|\mathbb{E}_{v\sim\mathbb{S}_{p}(\alpha)}\left[\left<\mathrm{grad}\widetilde{f}_{v}^{\mu}\big{|}_{p},v\right>v-\left<\mathrm{grad}f\big{|}_{p},v\right>v\right]\right\|
=\displaystyle= 𝔼v𝕊p(α)[gradf~vμ|pgradf|p,vv]\displaystyle\left\|\mathbb{E}_{v\sim\mathbb{S}_{p}(\alpha)}\left[\left<\mathrm{grad}\widetilde{f}_{v}^{\mu}\big{|}_{p}-\mathrm{grad}f\big{|}_{p},v\right>v\right]\right\|
\displaystyle\leq L1α3μ2.\displaystyle\frac{L_{1}\alpha^{3}\mu}{2}. (28)

Applying Lemma 1 and the above results gives

𝔼v𝕊p(α)[gradf~vμ|p,vpv]\displaystyle\mathbb{E}_{v\sim\mathbb{S}_{p}(\alpha)}\left[\left<\mathrm{grad}\widetilde{f}_{v}^{\mu}\big{|}_{p},v\right>_{p}v\right]
=\displaystyle= 𝔼v𝕊p(α)[12μμμgradf|Expp(tv),vExpp(tv)Expp(tv)𝑑tv]\displaystyle\mathbb{E}_{v\sim\mathbb{S}_{p}(\alpha)}\left[\frac{1}{2\mu}\int_{-\mu}^{\mu}\left<\mathrm{grad}f\big{|}_{{\mathrm{Exp}}_{p}\left(tv\right)},v_{{\mathrm{Exp}}_{p}(tv)}\right>_{{\mathrm{Exp}}_{p}(tv)}dt\;v\right] (by Eq. 24)\displaystyle(\text{by Eq. \ref{eq:above}})
=\displaystyle= 12μ𝔼v𝕊p(α)[(f(Expp(μv))f(Expp(μv)))v]\displaystyle\frac{1}{2\mu}\mathbb{E}_{v\sim\mathbb{S}_{p}(\alpha)}\left[\left(f\left({\mathrm{Exp}}_{p}(\mu v)\right)-f\left({\mathrm{Exp}}_{p}(\mu v)\right)\right)v\right] (by Lemma 1)\displaystyle(\text{by Lemma \ref{lem:fundamental-calculus}})
=\displaystyle= 1μ𝔼v𝕊p(α)[f(Expp(μv))v].\displaystyle\frac{1}{\mu}\mathbb{E}_{v\sim\mathbb{S}_{p}(\alpha)}\left[f\left({\mathrm{Exp}}_{p}(\mu v)\right)v\right].

Combining the above result with (28) gives

gradf|pnμα2𝔼v𝕊p(α)[f(Expp(μv))v]L1nαμ2.\displaystyle\left\|\mathrm{grad}f\big{|}_{p}-\frac{n}{\mu\alpha^{2}}\mathbb{E}_{v\sim\mathbb{S}_{p}(\alpha)}\left[f\left({\mathrm{Exp}}_{p}(\mu v)\right)v\right]\right\|\leq\frac{L_{1}n\alpha\mu}{2}. (29)

The variance of the estimator (17) is shown in Theorem 5.

Theorem 5.

If ff is geodesically L1L_{1}-smooth, then it holds that

𝔼v1,v2,,vmi.i.d.𝕊p(α)[grad^f|p(v1,v2,,vm)|p2]\displaystyle\mathbb{E}_{v_{1},v_{2},\cdots,v_{m}\overset{i.i.d.}{\sim}\mathbb{S}_{p}(\alpha)}\left[\left\|\widehat{\mathrm{grad}}f\big{|}_{p}\left(v_{1},v_{2},\cdots,v_{m}\right)\big{|}_{p}\right\|^{2}\right]
\displaystyle\leq n2m(L1αμ2+gradf|p)2\displaystyle\frac{n^{2}}{m}\left(\frac{L_{1}\alpha\mu}{2}+\left\|\mathrm{grad}f\big{|}_{p}\right\|\right)^{2}
+m1m(gradf|p2+gradf|pL1n3/2αμ+L12n3α2μ24).\displaystyle+\frac{m-1}{m}\left(\left\|\mathrm{grad}f\big{|}_{p}\right\|^{2}+\left\|\mathrm{grad}f\big{|}_{p}\right\|L_{1}n^{3/2}\alpha\mu+\frac{L_{1}^{2}n^{3}\alpha^{2}\mu^{2}}{4}\right).
Proof.

By Lemma 1 and that ff is geodesically L1L_{1}-smooth, we have, for any v𝕊p(α)v\in\mathbb{S}_{p}(\alpha),

|f(Expp(μv))f(Expp(μv))|\displaystyle\left|f\left({\mathrm{Exp}}_{p}(\mu v)\right)-f\left({\mathrm{Exp}}_{p}(-\mu v)\right)\right|
=\displaystyle= |μμgradf|Expp(tv),vExpp(tv)dtμμgradf|p,vdt+2μgradf|p,v|\displaystyle\left|\int_{-\mu}^{\mu}\left<\mathrm{grad}f\big{|}_{{\mathrm{Exp}}_{p}(tv)},v_{{\mathrm{Exp}}_{p}(tv)}\right>dt-\int_{-\mu}^{\mu}\left<\mathrm{grad}f\big{|}_{p},v\right>dt+2\mu\left<\mathrm{grad}f\big{|}_{p},v\right>\right|
\displaystyle\leq μμ|gradf|Expp(tv)𝒫pExpp(tv)(gradf|p),𝒫pExpp(tv)(v)|dt\displaystyle\int_{-\mu}^{\mu}\left|\left<\mathrm{grad}f\big{|}_{{\mathrm{Exp}}_{p}(tv)}-\mathcal{P}_{p\rightarrow{\mathrm{Exp}}_{p}(tv)}\left(\mathrm{grad}f\big{|}_{p}\right),\mathcal{P}_{p\rightarrow{\mathrm{Exp}}_{p}(tv)}(v)\right>\right|dt
+|2μgradf|p,v|\displaystyle+\left|2\mu\left<\mathrm{grad}f\big{|}_{p},v\right>\right|
\displaystyle\leq L1μμ|t|v2dt+2αμgradf|p\displaystyle L_{1}\int_{-\mu}^{\mu}|t|\|v\|^{2}dt+2\alpha\mu\left\|\mathrm{grad}f\big{|}_{p}\right\|
\displaystyle\leq L1α2μ2+2αμgradf|p.\displaystyle L_{1}\alpha^{2}\mu^{2}+2\alpha\mu\left\|\mathrm{grad}f\big{|}_{p}\right\|. (30)

By (30), it holds that,

𝔼[grad^f|p(v)2]=\displaystyle\mathbb{E}\left[\left\|\widehat{\mathrm{grad}}f\big{|}_{p}\left(v\right)\right\|^{2}\right]= 𝔼[n24μ2α4(f(Expp(μv))f(Expp(μv)))2v2]\displaystyle\mathbb{E}\left[\frac{n^{2}}{4\mu^{2}\alpha^{4}}\left(f\left({\mathrm{Exp}}_{p}(\mu v)\right)-f\left({\mathrm{Exp}}_{p}(-\mu v)\right)\right)^{2}\|v\|^{2}\right]
\displaystyle\leq n2(L1αμ2+gradf|p)2\displaystyle{n^{2}}\left(\frac{L_{1}\alpha\mu}{2}+\left\|\mathrm{grad}f\big{|}_{p}\right\|\right)^{2} (31)

By Theorem 4, we have

𝔼[grad^f|p(vi)]=gradf|p+O(L1nαμ2𝟏),\displaystyle\mathbb{E}\left[\widehat{\mathrm{grad}}f\big{|}_{p}(v_{i})\right]=\mathrm{grad}f\big{|}_{p}+O\left(\frac{L_{1}n\alpha\mu}{2}\mathbf{1}\right),

where 𝟏\mathbf{1} is the all-one vector and the big-O notation here means that, with respect to any coordinate system, each entry in 𝔼[grad^f|p(vi)]\mathbb{E}\left[\widehat{\mathrm{grad}}f\big{|}_{p}(v_{i})\right] and gradf|p\mathrm{grad}f\big{|}_{p} differs by at most L1nαμ2\frac{L_{1}n\alpha\mu}{2}.

Since v1,v2,,vmv_{1},v_{2},\cdots,v_{m} are mutually independent, we have, for any iji\neq j,

𝔼grad^f|p(vi),grad^f|p(vj)\displaystyle\mathbb{E}\left<\widehat{\mathrm{grad}}f\big{|}_{p}(v_{i}),\widehat{\mathrm{grad}}f\big{|}_{p}(v_{j})\right>
=\displaystyle= gradf|p+O(L1nαμ2𝟏),gradf|p+O(L1nαμ2𝟏)\displaystyle\left<\mathrm{grad}f\big{|}_{p}+O\left(\frac{L_{1}n\alpha\mu}{2}\mathbf{1}\right),\mathrm{grad}f\big{|}_{p}+O\left(\frac{L_{1}n\alpha\mu}{2}\mathbf{1}\right)\right>
\displaystyle\leq gradf|p2+gradf|pL1n3/2αμ+L12n3α2μ24,\displaystyle\left\|\mathrm{grad}f\big{|}_{p}\right\|^{2}+\left\|\mathrm{grad}f\big{|}_{p}\right\|L_{1}n^{3/2}\alpha\mu+\frac{L_{1}^{2}n^{3}\alpha^{2}\mu^{2}}{4},

where the last inequality uses the Cauchy-Schwartz inequality. Thus by expanding out all terms we have

𝔼[grad^f|p(v1,v2,,vm)2]\displaystyle\mathbb{E}\left[\left\|\widehat{\mathrm{grad}}f\big{|}_{p}\left(v_{1},v_{2},\cdots,v_{m}\right)\right\|^{2}\right]
=\displaystyle= 1m2𝔼[i=1mgrad^f|p(vi)2]\displaystyle\frac{1}{m^{2}}\mathbb{E}\left[\sum_{i=1}^{m}\left\|\widehat{\mathrm{grad}}f\big{|}_{p}\left(v_{i}\right)\right\|^{2}\right]
+1i,jm:ij𝔼gradf|p+O(L1nαμ2𝟏),gradf|p+O(L1nαμ2𝟏)\displaystyle+\sum_{1\leq i,j\leq m:i\neq j}\mathbb{E}\left<\mathrm{grad}f\big{|}_{p}+O\left(\frac{L_{1}n\alpha\mu}{2}\mathbf{1}\right),\mathrm{grad}f\big{|}_{p}+O\left(\frac{L_{1}n\alpha\mu}{2}\mathbf{1}\right)\right>
\displaystyle\leq n2m(L1αμ2+gradf|p)2\displaystyle\frac{n^{2}}{m}\left(\frac{L_{1}\alpha\mu}{2}+\left\|\mathrm{grad}f\big{|}_{p}\right\|\right)^{2}
+m1m(gradf|p2+gradf|pL1n3/2αμ+L12n3α2μ24).\displaystyle+\frac{m-1}{m}\left(\left\|\mathrm{grad}f\big{|}_{p}\right\|^{2}+\left\|\mathrm{grad}f\big{|}_{p}\right\|L_{1}n^{3/2}\alpha\mu+\frac{L_{1}^{2}n^{3}\alpha^{2}\mu^{2}}{4}\right).

4.1 Previous Methods for Riemannian Gradient Estimation

Previously, Nesterov and Spokoiny, (2017); Li et al., (2020) have introduced gradient estimators using the following sampler. The estimator is

grad^f|p(v1,v2,,vm)=12mμi=1m(f(Expp(μvi))f(Expp(μvi)))vi,\displaystyle\widehat{\mathrm{grad}}f\big{|}_{p}(v_{1},v_{2},\cdots,v_{m})=\frac{1}{2m\mu}\sum_{i=1}^{m}\left(f\left({\mathrm{Exp}}_{p}(\mu v_{i})\right)-f\left({\mathrm{Exp}}_{p}(-\mu v_{i})\right)\right)v_{i}, (32)

where vii.i.d.𝒩(0,I)v_{i}\overset{i.i.d.}{\sim}\mathcal{N}\left(0,I\right) are Gaussian vectors, and μ>0\mu>0 is a parameter controlling the width of the estimator. We use this Gaussian sampler as a baseline for empirical evaluations. For this estimator, Li et al., (2020) studied its properties over manifolds embedded in Euclidean space. Their result is in Proposition 4.

Proposition 4 (Li et al., (2020)).

Let \mathcal{M} be a manifold embedded in a Euclidean. If f:f:\mathcal{M}\rightarrow\mathbb{R} is geodesically L1L_{1}-smooth over \mathcal{M}, then it holds that

𝔼vii.i.d.𝒩(0,I)[grad^f|p(v1,v2,,vm)]gradf|pL1(d+3)3/2μ2\displaystyle\left\|\mathbb{E}_{v_{i}\overset{i.i.d.}{\sim}\mathcal{N}(0,I)}\left[\widehat{\mathrm{grad}}f\big{|}_{p}(v_{1},v_{2},\cdots,v_{m})\right]-\mathrm{grad}f\big{|}_{p}\right\|\leq\frac{L_{1}\left(d+3\right)^{3/2}\mu}{2} (33)

where grad^f|p(v1,v2,,vm)\widehat{\mathrm{grad}}f\big{|}_{p}(v_{1},v_{2},\cdots,v_{m}) is the estimator in (32).

To fairly compare our method (17) and this previous method (32), one should set α=n\alpha=\sqrt{n} for our method, as pointed out in Remark 3.

Remark 3.

If one picks α=n\alpha=\sqrt{n} in Theorem 4, one can obtain the error bound L1n3/2μ2\frac{L_{1}n^{3/2}\mu}{2} for our method, and the error bound for the previous method is L1(n+3)3/2μ2\frac{L_{1}(n+3)^{3/2}\mu}{2} (Proposition 4). Why should we pick α=n\alpha=\sqrt{n} in Theorem 4? This is because when α=n\alpha=\sqrt{n}, we have 𝔼v𝕊(α)[v2]=𝔼v𝒩(0,I)[v2]=n\mathbb{E}_{v\sim\mathbb{S}(\alpha)}[\|v\|^{2}]=\mathbb{E}_{v\sim\mathcal{N}(0,I)}[\|v\|^{2}]=n. In words, when α=n\alpha=\sqrt{n}, the random vectors used in our method (17) and the random vector used in the previous method (32) have the same squared norm in expectation. This leads to a fair comparison for the bias, while holding the second moment of the random vector exactly the same.

5 Empirical Studies

In this section, we empirically study the spherical estimator in (18), in comparison with the Gaussian estimator in (32). The two methods (18) and (32) are compared over three manifolds: 1. the Euclidean space n\mathbb{R}^{n}, 2. the unit sphere 𝕊n\mathbb{S}^{n}, and 3. the surface specified by the equation h(x)=12i=1n/2xi212i=n/2+1nxi2h(x)=\frac{1}{2}\sum_{i=1}^{n/2}x_{i}^{2}-\frac{1}{2}\sum_{i=n/2+1}^{n}x_{i}^{2}. The evaluation is carried out at three different point, one for each manifold. The three manifolds, and corresponding points for evaluation, are listed in Table 1. Two test functions are used: a. the linear function f(x)=i=1nxif(x)=\sum_{i=1}^{n}x_{i}, and b. the function f(x)=i=1nsin(xi1)f(x)=\sum_{i=1}^{n}\sin\left(x_{i}-1\right). The two test functions are listed in Table 2.

Label Manifold Point pp Exponential map Expp(v){\mathrm{Exp}}_{p}(v)
1 n\mathbb{R}^{n} 0 vv
2 𝕊n\mathbb{S}^{n} (1,0,,0)(1,0,\cdots,0) pcos(v)+vvsin(v)p\cos(\|v\|)+\frac{v}{\|v\|}\sin\left(\|v\|\right)
3 (x,h(x))(x,h(x)), xnx\in\mathbb{R}^{n} 0 (v,(1+i=1nvi42+sinh1(i=1nvi4)i=1nvi4)h(v))\left(v,\left(\frac{\sqrt{1+\sum_{i=1}^{n}v_{i}^{4}}}{2}+\frac{\sinh^{-1}\left(\sqrt{\sum_{i=1}^{n}v_{i}^{4}}\right)}{\sqrt{\sum_{i=1}^{n}v_{i}^{4}}}\right)h(v)\right)
Table 1: Manifolds used for experiments. All manifolds are of dimension nn. The coordinate system of the ambient space is used in all settings.
Label Function
a f(x)=i=1nxif(x)=\sum_{i=1}^{n}x_{i}
b f(x)=i=1nsin(xi1)f(x)=\sum_{i=1}^{n}\sin(x_{i}-1)
Table 2: Functions used for experiments. The coordinate system of the ambient Euclidean space is used for the computations.

The three manifolds in Table 1 and the two functions in Table 2 together generate 6 settings. We use 1a, 1b, 2a, 2b, 3a, 3c to label these 6 setting, where 1a refers to the setting over manifold 1 using function a, and so on. The experimental results are summarized in Figures 1-12. In all figures, “G” on the xx-axis stands for the previous method using Gaussian estimator (32), and “S” on the xx-axis stands for our method using the spherical estimator (18). To fairly compare (32) and (18), we use α=n\alpha=\sqrt{n} for all spherical estimators so that for both the spherical estimator and the Gaussian estimator, one has 𝔼[v2]=n\mathbb{E}\left[\|v\|^{2}\right]=n. The figure captions specify the settings used. For example, setting 1a is for function aa in Table 2 over manifold 1 in Table 1. Below each subfigure, the values of nn and μ\mu are the dimension of the manifold and the parameter used in the estimators (both G and S).

Since 𝔼[v2]\mathbb{E}\left[\|v\|^{2}\right] are set to the same value for both G and S, we use

grad^f|p(v1,v2,,vm)gradf|p\displaystyle\left\|\widehat{\mathrm{grad}}f\big{|}_{p}(v_{1},v_{2},\cdots,v_{m})-\mathrm{grad}f\big{|}_{p}\right\| (34)

to measure the error (bias) of the estimator, and

1mi=1mgrad^f|p(vi)4\displaystyle\frac{1}{m}\sum_{i=1}^{m}\left\|\widehat{\mathrm{grad}}f\big{|}_{p}(v_{i})\right\|^{4} (35)

to measure robustness of the estimator. For both the estimation error and the robustness, smaller values mean better performance. For both “G” and “S” in all figures, each violin plot summarizes 100100 values of (34) or (35), where we use m=100m=100 for all of them. Take G and S in Figure 1 as an example. We compute (35) for 100 times for G and 100100 estimations for S, all with m=100m=100. Then we use violin plots to summarize these 100 values of (35) for G and the 100 values of (35) for S.

Figures 1-6 plot the robustness of the two estimators GG and SS, and Figures 7-12 show the corresponding errors. As shown in the figures, in all of the settings, the spherical estimator S is much more robust than the Gaussian estimator G, while achieving the same level of estimation error.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 1: Plot of 1mi=1mgrad^f|p(vi)4\frac{1}{m}\sum_{i=1}^{m}\left\|\widehat{\mathrm{grad}}f\big{|}_{p}(v_{i})\right\|^{4} for Setting 1a.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: Plot of 1mi=1mgrad^f|p(vi)4\frac{1}{m}\sum_{i=1}^{m}\left\|\widehat{\mathrm{grad}}f\big{|}_{p}(v_{i})\right\|^{4} for Setting 1b.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Plot of 1mi=1mgrad^f|p(vi)4\frac{1}{m}\sum_{i=1}^{m}\left\|\widehat{\mathrm{grad}}f\big{|}_{p}(v_{i})\right\|^{4} for Setting 2a.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: Plot of 1mi=1mgrad^f|p(vi)4\frac{1}{m}\sum_{i=1}^{m}\left\|\widehat{\mathrm{grad}}f\big{|}_{p}(v_{i})\right\|^{4} for Setting 2b.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 5: Plot of 1mi=1mgrad^f|p(vi)4\frac{1}{m}\sum_{i=1}^{m}\left\|\widehat{\mathrm{grad}}f\big{|}_{p}(v_{i})\right\|^{4} for Setting 3a.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 6: Plot of 1mi=1mgrad^f|p(vi)4\frac{1}{m}\sum_{i=1}^{m}\left\|\widehat{\mathrm{grad}}f\big{|}_{p}(v_{i})\right\|^{4} for Setting 3b.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 7: Plot of grad^f|p(v1,v2,,vm)gradf|p\left\|\widehat{\mathrm{grad}}f\big{|}_{p}(v_{1},v_{2},\cdots,v_{m})-\mathrm{grad}f\big{|}_{p}\right\| for Setting 1a.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 8: Plot of grad^f|p(v1,v2,,vm)gradf|p\left\|\widehat{\mathrm{grad}}f\big{|}_{p}(v_{1},v_{2},\cdots,v_{m})-\mathrm{grad}f\big{|}_{p}\right\| for Setting 1b.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 9: Plot of grad^f|p(v1,v2,,vm)gradf|p\left\|\widehat{\mathrm{grad}}f\big{|}_{p}(v_{1},v_{2},\cdots,v_{m})-\mathrm{grad}f\big{|}_{p}\right\| for Setting 2a.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 10: Plot of grad^f|p(v1,v2,,vm)gradf|p\left\|\widehat{\mathrm{grad}}f\big{|}_{p}(v_{1},v_{2},\cdots,v_{m})-\mathrm{grad}f\big{|}_{p}\right\| for Setting 2b.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 11: Plot of grad^f|p(v1,v2,,vm)gradf|p\left\|\widehat{\mathrm{grad}}f\big{|}_{p}(v_{1},v_{2},\cdots,v_{m})-\mathrm{grad}f\big{|}_{p}\right\| for Setting 3a.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 12: Plot of grad^f|p(v1,v2,,vm)gradf|p\left\|\widehat{\mathrm{grad}}f\big{|}_{p}(v_{1},v_{2},\cdots,v_{m})-\mathrm{grad}f\big{|}_{p}\right\| for Setting 3b.

6 Application to Online Convex Optimization

In online learning with bandit feedback, each time tt, the learner chooses points xtx_{t} and ztz_{t}, and the environment picks a convex loss function ftf_{t}. The learner suffers and observes loss ft(xt)f_{t}(x_{t}) and observes the function value ft(zt)f_{t}(z_{t}). The learner then picks xt+1x_{t+1} based on historical observations and the learning process moves on. Online learning agents, unlike their offline learning (or batch learning) counterpart, do not have knowledge about ftf_{t} when they choose xtx_{t}. Such models capture real-world scenarios such as learning with streaming data. In the machine learning community, such problem settings are often referred to as the Online Convex Optimization problem with bandit-type feedback.

6.1 Online Convex Optimization over Hadamard Manifolds: an Algorithm

We consider an online convex optimization problem (with bandit-type feedback) over a negatively curved manifold. Let there be a Hadamard manifold \mathcal{M} and a closed geodesic ball 𝒟\mathcal{D}\subseteq\mathcal{M}. Each time tt, the agent and the environment executes the above learning protocol, with ftf_{t} defined over \mathcal{M} and xt𝒟x_{t}\in\mathcal{D}. The goal of the agent is to minimize the TT-step hindsight regret: Reg(T):=t=1Tft(xt)ft(x)Reg(T):=\sum_{t=1}^{T}f_{t}(x_{t})-f_{t}(x^{*}), where xargminx𝒟t=1Tft(x)x^{*}\in\arg\min_{x\in\mathcal{D}}\sum_{t=1}^{T}f_{t}(x).

With the approximation guarantee in Theorem 4, one can solve this online learning problem. At each tt, we can approximate gradient at xtx_{t} using the function value ft(zt)f_{t}\left(z_{t}\right), and perform an online gradient descent step. In particular, we can generalize the online convex optimization algorithm (with bandit-type feedback) (Flaxman et al.,, 2005) to Hadamard manifolds.

By Theorem 4, we have

gradf|p𝔼v𝕊p[nμ(f(Expp(μv))f(p))v]L1nμ2,p\displaystyle\left\|\mathrm{grad}f\big{|}_{p}-\mathbb{E}_{v\sim\mathbb{S}_{p}}\left[\frac{n}{\mu}\left(f({\mathrm{Exp}}_{p}(\mu v))-f(p)\right)v\right]\right\|\leq\frac{L_{1}n\mu}{2},\qquad\forall p\in\mathcal{M} (36)

Thus the gradient of ftf_{t} at xtx_{t} can be estimated by

gradft|xt𝔼vt𝕊xt[nμ(ft(Expxt(μvt))ft(xt))vt],\displaystyle\mathrm{grad}f_{t}\big{|}_{x_{t}}\approx\mathbb{E}_{v_{t}\sim\mathbb{S}_{x_{t}}}\left[\frac{n}{\mu}\left(f_{t}\left({\mathrm{Exp}}_{x_{t}}\left(\mu v_{t}\right)\right)-f_{t}(x_{t})\right)v_{t}\right],

where vtv_{t} is uniformly sampled from 𝕊xt\mathbb{S}_{x_{t}}. This estimator generalizes the two-evaluation estimator in the Euclidean case (Duchi et al.,, 2015). As shown in (36), the bias of this estimator is also bounded by L1nμ2\frac{L_{1}n\mu}{2}.

With this gradient estimator, one can perform estimated gradient descent by

xt+12Expxt(ηnμ(ft(Expxt(μvt))ft(xt))vt),\displaystyle x_{t+\frac{1}{2}}\leftarrow{\mathrm{Exp}}_{x_{t}}\left(-\eta\frac{n}{\mu}\left(f_{t}\left({\mathrm{Exp}}_{x_{t}}\left(\mu v_{t}\right)\right)-f_{t}(x_{t})\right)v_{t}\right),

where vtv_{t} is uniformly sampled from 𝕊xt\mathbb{S}_{x_{t}}, and η\eta is the learning rate.

To ensure that the algorithm stays in the geodesic ball 𝒟\mathcal{D}, we need to project xt+12x_{t+\frac{1}{2}} back to a subset of 𝒟\mathcal{D}, which is

(1β)𝒟:=B(c(𝒟),(1β)r(𝒟)), for some β(0,1),\displaystyle(1-\beta)\mathcal{D}:=B(c(\mathcal{D}),(1-\beta)r(\mathcal{D})),\qquad\text{ for some }\beta\in(0,1),

where c(𝒟)c(\mathcal{D}) and r(𝒟)r(\mathcal{D}) are the center and radius of the geodesic ball 𝒟\mathcal{D}. Since the set (1β)𝒟(1-\beta)\mathcal{D} is a geodesic ball in a Hadamard manifold, the projection onto it is well-defined. Thus we can project xt+12x_{t+\frac{1}{2}} onto (1β)𝒟(1-\beta)\mathcal{D} to get xt+1x_{t+1}, or let xt+1=Proj(1β)𝒟(xt+12)x_{t+1}=\mathrm{Proj}_{(1-\beta)\mathcal{D}}\left(x_{t+\frac{1}{2}}\right). Now one iteration finishes and the learning process moves on to the next round. This learning algorithm is summarized in Algorithm 1.

Algorithm 1
Input: Algorithm parameter: μ\mu, η\eta, β\beta; time horizon: TT;
Initialization: Arbitrarily pick x1𝒟x_{1}\in\mathcal{D}.
for t=1,2,,Tt=1,2,\dots,T do
   Environment picks loss function ftf_{t}. The agent suffers and observes loss ft(xt)f_{t}(x_{t}).
   Sample vt𝕊xtv_{t}\sim\mathbb{S}_{x_{t}}.
   Observe function value ft(zt)f_{t}(z_{t}), where zt=Expxt(μvt)z_{t}={\mathrm{Exp}}_{x_{t}}(\mu v_{t}), and compute wt=nμ(ft(zt)ft(xt))vtw_{t}=\frac{n}{\mu}\left(f_{t}(z_{t})-f_{t}(x_{t})\right)v_{t}. /* wtw_{t} is the gradient estimate. */
   Perform gradient update xt+12=Expxt(ηwt)x_{t+\frac{1}{2}}={\mathrm{Exp}}_{x_{t}}(-\eta w_{t}).
   Project the point to set (1β)𝒟(1-\beta)\mathcal{D}, xt+1=Proj(1β)𝒟(xt+12)x_{t+1}=\mathrm{Proj}_{(1-\beta)\mathcal{D}}\left(x_{t+\frac{1}{2}}\right). /* In practice, the projection can be approximated by a point in (1β)𝒟(1-\beta)\mathcal{D} that is close to xt+12x_{t+\frac{1}{2}}. */
end for

6.2 Online Convex Optimization over Hadamard Manifolds: Analysis

Theorem 6.

If, for all t[T]t\in[T], the loss function ftf_{t} satisfies that is geodesically μ\mu-smooth, then the regret of Algorithm 1 satisfies:

𝔼[Reg(T)]D22η+ζ(κ,D)ηn2G2T2μ2+(DL1nμ2+2βD)T,\displaystyle\mathbb{E}\left[Reg(T)\right]\leq\frac{D^{2}}{2\eta}+\frac{\zeta(\kappa,D)\eta n^{2}G^{2}T}{2\mu^{2}}+\left(\frac{DL_{1}n\mu}{2}+2\beta D\right)T, (37)

where DD is the diameter of 𝒟\mathcal{D}, and G:=maxp𝒟f(p)G:=\max_{p\in\mathcal{D}}f(p).

In particular, when η=Θ(T3/4)\eta=\Theta\left(T^{-3/4}\right), μ=Θ(T1/4)\mu=\Theta\left(T^{-1/4}\right) and β=Θ(T1/4)\beta=\Theta\left(T^{-1/4}\right), the regret satisfies 𝔼[Reg(T)]𝒪(T3/4)\mathbb{E}\left[Reg(T)\right]\leq\mathcal{O}\left(T^{3/4}\right). This sublinear regret rate implies that, when f1,f2,,fTf_{1},f_{2},\cdots,f_{T} are identical, the algorithm converges to the optimal point at a rate of order 𝒪(T1/4)\mathcal{O}\left(T^{-1/4}\right). We now start the proof of Theorem 6, by presenting an existing result by Zhang and Sra.

Lemma 2 (Corollary 8, Zhang and Sra, (2016)).

Let \mathcal{M} be a Hadamard manifold, with sectional curvature 𝒦p(u,v)κ>\mathcal{K}_{p}(u,v)\geq\kappa>-\infty for all pp\in\mathcal{M} and u,vTpu,v\in T_{p}\mathcal{M}. Then for any x,xsx,x_{s}, and xs+1=Expxs(ηv)x_{s+1}={\mathrm{Exp}}_{x_{s}}(-\eta v) for some vTxsv\in T_{x_{s}}\mathcal{M}, then it holds that

v,Expxs1(x)xs\displaystyle\left<-v,{\mathrm{Exp}}_{x_{s}}^{-1}(x)\right>_{x_{s}}\leq 12η(d2(xs,x)d2(xs+1,x))+ζ(κ,d(xs,x))η2v2,\displaystyle\frac{1}{2\eta}\left(d^{2}(x_{s},x)-d^{2}(x_{s+1},x)\right)+\frac{\zeta(\kappa,d(x_{s},x))\eta}{2}\|v\|^{2},

where ζ(κ,D)=|κ|Dtanh(|κ|D)\zeta(\kappa,D)=\frac{\sqrt{|\kappa|}D}{\tanh(\sqrt{|\kappa|}D)}.

Proof of Theorem 6.

By convexity of ftf_{t}, we have

ft(xt)ft(x)gradft|xt,Expxt1(x).f_{t}(x_{t})-f_{t}(x^{*})\leq\left<-\mathrm{grad}f_{t}\big{|}_{x_{t}},{\mathrm{Exp}}_{x_{t}}^{-1}(x^{*})\right>.

Together with the Lemma of Zhang and Sra, we get

ft(xt)ft(x)\displaystyle f_{t}(x_{t})-f_{t}(x^{*})
\displaystyle\leq gradft|xt,Expxt1(x)\displaystyle\left<-\mathrm{grad}f_{t}\big{|}_{x_{t}},{\mathrm{Exp}}_{x_{t}}^{-1}(x^{*})\right>
\displaystyle\leq wt,Expxt1(x)+wtgradft|xt,Expxt1(x)\displaystyle\left<-w_{t},{\mathrm{Exp}}_{x_{t}}^{-1}(x^{*})\right>+\left<w_{t}-\mathrm{grad}f_{t}\big{|}_{x_{t}},{\mathrm{Exp}}_{x_{t}}^{-1}(x^{*})\right>
\displaystyle\leq 12η(d2(xt,x)d2(xt+12,x))+ζ(κ,D)η2wt2\displaystyle\frac{1}{2\eta}\left(d^{2}(x_{t},x^{*})-d^{2}(x_{t+\frac{1}{2}},x^{*})\right)+\frac{\zeta(\kappa,D)\eta}{2}\|w_{t}\|^{2}
+wtgradft|xt,Expxt1(x),\displaystyle+\left<w_{t}-\mathrm{grad}f_{t}\big{|}_{x_{t}},{\mathrm{Exp}}_{x_{t}}^{-1}(x^{*})\right>,

where the last line uses Lemma 2. Taking expectation (over vt𝕊xtv_{t}\sim\mathbb{S}_{x_{t}}) on both sides gives

𝔼[ft(xt)]𝔼[ft(x)]\displaystyle\mathbb{E}\left[f_{t}(x_{t})\right]-\mathbb{E}\left[f_{t}(x^{*})\right]
\displaystyle\leq 12η𝔼[d2(xt,x)d2(xt+12,x)]+𝔼[ζ(κ,D)η2wt2]\displaystyle\frac{1}{2\eta}\mathbb{E}\left[d^{2}(x_{t},x^{*})-d^{2}(x_{t+\frac{1}{2}},x^{*})\right]+\mathbb{E}\left[\frac{\zeta(\kappa,D)\eta}{2}\|w_{t}\|^{2}\right]
+𝔼[wt]gradft|xt,Expxt1(x)\displaystyle+\left<\mathbb{E}\left[w_{t}\right]-\mathrm{grad}f_{t}\big{|}_{x_{t}},{\mathrm{Exp}}_{x_{t}}^{-1}(x^{*})\right>
\displaystyle\leq 𝔼[d2(xt,x)d2(xt+12,x)]2η+ζ(κ,D)ηn2G22μ2+DL1nμ2,\displaystyle\frac{\mathbb{E}\left[d^{2}(x_{t},x^{*})-d^{2}(x_{t+\frac{1}{2}},x^{*})\right]}{2\eta}+\frac{\zeta(\kappa,D)\eta n^{2}G^{2}}{2\mu^{2}}+\frac{DL_{1}n\mu}{2}, (38)

where the last line uses Theorem 4, and that wt2n2G2μ2\|w_{t}\|^{2}\leq\frac{n^{2}G^{2}}{\mu^{2}} (Gf(p)G\geq f(p) for all p𝒟p\in\mathcal{D}). Henceforth, we will omit the expectation operator 𝔼\mathbb{E} for simplicity.

By definition of (1β)𝒟(1-\beta)\mathcal{D}, we have, d(xt+12,xt+1)βd\left(x_{t+\frac{1}{2}},x_{t+1}\right)\leq\beta. Plugging this back to (38), we get, in expectation,

ft(xt)ft(x)\displaystyle f_{t}(x_{t})-f_{t}(x^{*})\leq 12η(d2(xt,x)d2(xt+1,x))\displaystyle\frac{1}{2\eta}\left(d^{2}(x_{t},x^{*})-d^{2}(x_{t+1},x^{*})\right)
+ζ(κ,D)ηn2G22μ2+DL1nμ2+d2(xt+1,x)d2(xt+12,x)\displaystyle+\frac{\zeta(\kappa,D)\eta n^{2}G^{2}}{2\mu^{2}}+\frac{DL_{1}n\mu}{2}+d^{2}(x_{t+1},x^{*})-d^{2}(x_{t+\frac{1}{2}},x^{*})
\displaystyle\leq 12η(d2(xt,x)d2(xt+1,x))+ζ(κ,D)ηn2G22μ2+DL1nμ2+2βD.\displaystyle\frac{1}{2\eta}\left(d^{2}(x_{t},x^{*})-d^{2}(x_{t+1},x^{*})\right)+\frac{\zeta(\kappa,D)\eta n^{2}G^{2}}{2\mu^{2}}+\frac{DL_{1}n\mu}{2}+2\beta D.

Summing over tt gives, in expectation,

t=1Tft(xt)ft(x)\displaystyle\sum_{t=1}^{T}f_{t}(x_{t})-f_{t}(x^{*})
\displaystyle\leq 12ηt=1T(d2(xt,x)d2(xt+1,x))+ζ(κ,D)ηn2G2T2μ2+(DL1nμ2+2βD)T\displaystyle\frac{1}{2\eta}\sum_{t=1}^{T}\left(d^{2}(x_{t},x^{*})-d^{2}(x_{t+1},x^{*})\right)+\frac{\zeta(\kappa,D)\eta n^{2}G^{2}T}{2\mu^{2}}+\left(\frac{DL_{1}n\mu}{2}+2\beta D\right)T
\displaystyle\leq D22η+ζ(κ,D)ηn2G2T2μ2+(DL1nμ2+2βD)T,\displaystyle\frac{D^{2}}{2\eta}+\frac{\zeta(\kappa,D)\eta n^{2}G^{2}T}{2\mu^{2}}+\left(\frac{DL_{1}n\mu}{2}+2\beta D\right)T,

where the last line telescopes out the summation term.

6.3 Online Convex Optimization over Hadamard Manifolds: Experiments

In this section, we study the effectiveness of our algorithm on the benchmark task of finding the Riemannian center of mass of several positive definite matrices (Bini and Iannazzo,, 2013). The Riemannian center of mass of a set of NN symmetric positive definite matrices {Ai}i=1N\{A_{i}\}_{i=1}^{N} seeks to minimize

f(X,{Ai}i=1N)=i=1Nlog(X1/2AiX1/2)F2,\displaystyle f(X,\{A_{i}\}_{i=1}^{N})=\sum_{i=1}^{N}\|\log(X^{-1/2}A_{i}X^{-1/2})\|_{F}^{2}, (39)

where XX is symmetric and positive definite.

This objective is non-convex in the Euclidean sense, but is geodesically convex on the manifold of positive definite matrices. The gradient update step for objective (39) over the manifold of symmetric and positive definite matrices is

Xt+1=Xt1/2exp(ηgradft|xt)Xt1/2.\displaystyle X_{t+1}=X_{t}^{1/2}\exp\left(-\eta\;\mathrm{grad}f_{t}\big{|}_{x_{t}}\right)X_{t}^{1/2}.

For our experiments, we can only estimate the gradient gradft|xt\mathrm{grad}f_{t}\big{|}_{x_{t}} by querying function values of ftf_{t}. We randomly sample several positive definite matrices AiA_{i}, and apply Algorithm 1 to search for XX to minimize (39). The results are summarized in Figure 13.

Refer to caption
(a) Dimension 5×55\times 5
Refer to caption
(b) Dimension 10×1010\times 10
Figure 13: Results of Algorithm 1 on the task of finding the Riemannian center of mass (39). Each line plots the average regret of 5 runs and the shaded area around the line indicates one standard deviation above and below the average. The left (resp. right) subfigure plots the result over the manifold of symmetric positive definite matrices of dimension 5×55\times 5 (resp. 10×1010\times 10). The NN in the legends is the number of matrices used in Riemannian center of mass objective (39).

7 Conclusion

In this paper, we study the GW convolution/approximation and gradient estimation over Riemannian manifolds. Our new decomposition trick of the GW approximation provides a tools for computing properties of the GW convolution. Using this decomposition trick, we derive a formula for how the curvature of the manifold will affect the curvature of the function via the GW convolution. Empowered by our decomposition trick, we introduce a new gradient estimation method over Riemannian manifolds. Over an nn-dimensional Riemannian manifold, we improve best previous gradient estimation error (Li et al.,, 2020). We also apply our gradient estimation method to bandit convex optimization problems, and generalize previous results from Euclidean spaces to Hadamard manifolds.

References

  • Absil et al., (2009) Absil, P.-A., Mahony, R., and Sepulchre, R. (2009). Optimization algorithms on matrix manifolds. Princeton University Press.
  • Antonakopoulos et al., (2019) Antonakopoulos, K., Belmega, E. V., and Mertikopoulos, P. (2019). Online and stochastic optimization beyond lipschitz continuity: A riemannian approach. In International Conference on Learning Representations.
  • Azagra, (2013) Azagra, D. (2013). Global and fine approximation of convex functions. Proceedings of the London Mathematical Society, 107(4):799–824.
  • Azagra and Ferrera, (2006) Azagra, D. and Ferrera, J. (2006). Inf-convolution and regularization of convex functions on riemannian manifolds of nonpositive curvature. Revista Matemática Complutense, 19(2):323–345.
  • Bini and Iannazzo, (2013) Bini, D. A. and Iannazzo, B. (2013). Computing the karcher mean of symmetric positive definite matrices. Linear Algebra and its Applications, 438(4):1700–1710.
  • Bubeck et al., (2015) Bubeck, S., Dekel, O., Koren, T., and Peres, Y. (2015). Bandit convex optimization:T\sqrt{T} regret in one dimension. In Conference on Learning Theory, pages 266–278. PMLR.
  • Bubeck and Eldan, (2016) Bubeck, S. and Eldan, R. (2016). Multi-scale exploration of convex functions and bandit convex optimization. In Conference on Learning Theory, pages 583–589. PMLR.
  • Chen et al., (2019) Chen, L., Zhang, M., and Karbasi, A. (2019). Projection-free bandit convex optimization. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2047–2056. PMLR.
  • Duchi et al., (2015) Duchi, J. C., Jordan, M. I., Wainwright, M. J., and Wibisono, A. (2015). Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory, 61(5):2788–2806.
  • Flaxman et al., (2005) Flaxman, A. D., Kalai, A. T., and McMahan, H. B. (2005). Online convex optimization in the bandit setting: gradient descent without a gradient. In Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms, pages 385–394.
  • Garber and Kretzu, (2020) Garber, D. and Kretzu, B. (2020). Improved regret bounds for projection-free bandit convex optimization. In International Conference on Artificial Intelligence and Statistics, pages 2196–2206. PMLR.
  • Gavrilov, (2007) Gavrilov, A. V. (2007). The double exponential map and covariant derivation. Siberian Mathematical Journal, 48(1):56–61.
  • Greene and Wu, (1979) Greene, R. E. and Wu, H. (1979). C{C}^{\infty}-approximations of convex, subharmonic, and plurisubharmonic functions. In Annales Scientifiques de l’École Normale Supérieure, volume 12, pages 47–84.
  • Greene and Wu, (1973) Greene, R. E. and Wu, H.-H. (1973). On the subharmonicity and plurisubharmonicity of geodesically convex functions. Indiana University Mathematics Journal, 22(7):641–653.
  • Greene and Wu, (1976) Greene, R. E. and Wu, H.-h. (1976). C{C}^{\infty}-convex functions and manifolds of positive curvature. Acta Mathematica, 137(1):209–245.
  • Hazan and Kale, (2012) Hazan, E. and Kale, S. (2012). Projection-free online learning. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 1843–1850.
  • Hazan and Levy, (2014) Hazan, E. and Levy, K. Y. (2014). Bandit convex optimization: Towards tight bounds. In NIPS, pages 784–792.
  • Hildebrand, (1987) Hildebrand, F. B. (1987). Introduction to numerical analysis. Courier Corporation.
  • Jacot et al., (2021) Jacot, A., Gabriel, F., and Hongler, C. (2021). Neural tangent kernel: convergence and generalization in neural networks. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 6–6.
  • Kleinberg, (2004) Kleinberg, R. (2004). Nearly tight bounds for the continuum-armed bandit problem. Advances in Neural Information Processing Systems, 17:697–704.
  • Kulis et al., (2012) Kulis, B. et al. (2012). Metric learning: A survey. Foundations and trends in machine learning, 5(4):287–364.
  • Lee, (2006) Lee, J. M. (2006). Riemannian manifolds: an introduction to curvature, volume 176. Springer Science & Business Media.
  • Li and Dunson, (2019) Li, D. and Dunson, D. B. (2019). Geodesic distance estimation with spherelets. arXiv preprint arXiv:1907.00296.
  • Li and Dunson, (2020) Li, D. and Dunson, D. B. (2020). Classification via local manifold approximation. Biometrika, 107(4):1013–1020.
  • Li et al., (2017) Li, D., Mukhopadhyay, M., and Dunson, D. B. (2017). Efficient manifold and subspace approximations with spherelets. arXiv preprint arXiv:1706.08263.
  • Li et al., (2020) Li, J., Balasubramanian, K., and Ma, S. (2020). Stochastic zeroth-order riemannian derivative estimation and optimization. arXiv preprint arXiv:2003.11238.
  • Lin and Zha, (2008) Lin, T. and Zha, H. (2008). Riemannian manifold learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(5):796–809.
  • Lyness and Moler, (1967) Lyness, J. N. and Moler, C. B. (1967). Numerical differentiation of analytic functions. SIAM Journal on Numerical Analysis, 4(2):202–210.
  • Muandet et al., (2017) Muandet, K., Fukumizu, K., Sriperumbudur, B., and Schölkopf, B. (2017). Kernel mean embedding of distributions: A review and beyond. Foundations and Trends in Machine Learning, 10(1-2):1–144.
  • Nesterov and Spokoiny, (2017) Nesterov, Y. and Spokoiny, V. (2017). Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17(2):527–566.
  • Parkkonen and Paulin, (2012) Parkkonen, J. and Paulin, F. (2012). On strictly convex subsets in negatively curved manifolds. Journal of Geometric Analysis, 22(3):621–632.
  • Petersen, (2006) Petersen, P. (2006). Riemannian geometry, volume 171. Springer.
  • Schoen and Yau, (1994) Schoen, R. M. and Yau, S.-T. (1994). Lectures on differential geometry, volume 2. International press Cambridge, MA.
  • Schölkopf et al., (2002) Schölkopf, B., Smola, A. J., Bach, F., et al. (2002). Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press.
  • Shalev-Shwartz, (2011) Shalev-Shwartz, S. (2011). Online learning and online convex optimization. Foundations and trends in Machine Learning, 4(2):107–194.
  • Shamir, (2013) Shamir, O. (2013). On the complexity of bandit and derivative-free stochastic convex optimization. In Conference on Learning Theory, pages 3–24. PMLR.
  • Zhang et al., (2016) Zhang, H., Reddi, S. J., and Sra, S. (2016). Riemannian SVRG: fast stochastic optimization on Riemannian manifolds. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pages 4599–4607.
  • Zhang and Sra, (2016) Zhang, H. and Sra, S. (2016). First-order methods for geodesically convex optimization. In Conference on Learning Theory, pages 1617–1638. PMLR.
  • Zhou et al., (2020) Zhou, Y., Sanches Portella, V., Schmidt, M., and Harvey, N. (2020). Regret bounds without lipschitz continuity: Online learning with relative-lipschitz losses. Advances in Neural Information Processing Systems, 33.
  • Zhu et al., (2018) Zhu, B., Liu, J. Z., Cauley, S. F., Rosen, B. R., and Rosen, M. S. (2018). Image reconstruction by domain-transform manifold learning. Nature, 555(7697):487–492.