This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On Sharp Stochastic Zeroth Order Hessian Estimators over Riemannian Manifolds

Tianyu Wang111[email protected]
Abstract

We study Hessian estimators for functions defined over an nn-dimensional complete analytic Riemannian manifold. We introduce new stochastic zeroth-order Hessian estimators using O(1)O(1) function evaluations. We show that, for an analytic real-valued function ff, our estimator achieves a bias bound of order O(γδ2)O\left(\gamma\delta^{2}\right), where γ\gamma depends on both the Levi-Civita connection and function ff, and δ\delta is the finite difference step size. To the best of our knowledge, our results provide the first bias bound for Hessian estimators that explicitly depends on the geometry of the underlying Riemannian manifold. We also study downstream computations based on our Hessian estimators. The supremacy of our method is evidenced by empirical evaluations.

1 Introduction

Hessian computation is one of the central tasks in optimization, machine learning and related fields. Understanding the landscape of the objective function is in many cases the first step towards solving a mathematical programming problem, and Hessian is one of the key quantities that depict the function landscape. Often in real-world scenarios, the objective function is a black-box, and its Hessian is not directly computable. In these cases, zeroth-order Hessian computation techniques are needed if one wants to understand the function landscape via its Hessian.

To this end, we introduce new zeroth-order methods for estimating a function’s Hessian at any given point over an nn-dimensional complete analytic Riemannian manifold (,g)(\mathcal{M},g). For pp\in\mathcal{M} and an analytic real-valued function ff defined over a complete analytic Riemannian manifold \mathcal{M}, the Hessian estimator of ff at pp is

H^f(p;v,w;δ):=n2δ2f(Expp(δv+δw))vw,\displaystyle\widehat{\mathrm{H}}{f}(p;v,w;\delta):=\frac{n^{2}}{\delta^{2}}f({\mathrm{Exp}}_{p}(\delta v+\delta w))v\otimes w, (1)

where Expp{\mathrm{Exp}}_{p} is the exponential map, v,wv,w are independently sampled from the unit sphere in TpT_{p}\mathcal{M}, vwv\otimes w denotes the tensor product of vv and ww (v,wTpv,w\in T_{p}\mathcal{M}), and δ\delta is the finite-difference step size.

Our Hessian estimator satisfies

𝔼v,wi.i.d.𝕊p[H^f(p;v,w;δ)]Hessf(p)\displaystyle\;\left\|\mathbb{E}_{v,w\overset{i.i.d.}{\sim}\mathbb{S}_{p}}\left[\widehat{\mathrm{H}}f(p;v,w;\delta)\right]-\mathrm{Hess}f(p)\right\|
=\displaystyle= O(δ2supu𝕊p|𝔼v,wi.i.d.𝕊p[nn+2u2(v2+w2)f(p)]|),\displaystyle\;O\left(\delta^{2}\sup_{u\in\mathbb{S}_{p}}\left|\mathbb{E}_{v,w\overset{i.i.d.}{\sim}\mathbb{S}_{p}}\left[\frac{n}{n+2}\nabla_{u}^{2}\left(\nabla_{v}^{2}+\nabla_{w}^{2}\right)f(p)\right]\right|\right),

where \|\cdot\| is the \infty-Schatten norm, 𝕊p\mathbb{S}_{p} is the unit sphere in TpT_{p}\mathcal{M}, Hessf(p)\mathrm{Hess}f(p) is the Hessian of ff at pp, and \nabla is the covariant derivative associated with the Riemannian metric.

This bias bound improves previously known results in two ways:

  1. 1.

    It provides, via the Levi-Civita connection, the first bias bound for Hessian estimators that explicitly depends on the local geometry of the underlying space;

  2. 2.

    It significantly improves best previously known bias bound for O(1)O(1)-evaluation Hessian estimators over Riemannian manifolds, which is of order O(L2n2δ)O(L_{2}n^{2}\delta), where L2L_{2} is the Lipschitz constant for the Hessian, nn is the dimension of the manifold, and δ\delta is the finite-difference step size. See Remark 1 for details.

We also study downstream computations for our Hessian estimator. More specifically, we introduce novel provably accurate methods for computing adjugate and inversion of the Hessian matrix, all using zeroth order information only. These zeroth order computation methods may be used as primers for further applications. The supremacy of our method over existing methods is evidenced by careful empirical evaluations.

Related Works

Zeroth order optimization has attracted the attention of many researchers. Under this broad umbrella, there stands the Bayesian optimization methods (See the review article by Shahriari et al., (2015) for an overview), comparison-based methods (e.g., Nelder and Mead,, 1965), genetic algorithms (e.g., Goldberg and Holland,, 1988), best arm identification from the multi-armed bandit community (e.g., Bubeck et al.,, 2009; Audibert et al.,, 2010), and many others (See the book by Conn et al., (2009) for an overview). Among all these zeroth order optimization schemes, one classic and prosperous line of works focuses on estimating higher order derivatives using zeroth order information.

Zeroth order gradient estimators make up a large portion of derivative estimation literature. In the past few years, Flaxman et al., (2005) studied the stochastic gradient estimator using a single-point for the purpose of bandit learning. Duchi et al., (2015) studied stabilization of the stochastic gradient estimator via two-points (or multi-points) evaluations. Nesterov and Spokoiny, (2017); Li et al., (2020) studied gradient estimators using Gaussian smoothing, and investigated downstream optimization methods using the estimated gradient. Recently, Wang et al., (2021) studied stochastic gradient estimators over Riemannian manifolds, via the lens of the Greene-Wu convolution.

Zeroth order Hessian estimation is also a central topic in derivative estimation. In the control community, gradient-based Hessian estimators were introduced for iterative optimization algorithms, and asymptotic convergence was proved (Spall,, 2000). Apart from this asymptotic result, no generic non-asymptotic bound for O(1)O(1)-evaluation Hessian estimators are well investigated until recently. Based on the Stein’s identity (Stein,, 1981), Balasubramanian and Ghadimi, (2021) designed the Stein-type Hessian estimator, and combined it with cubic regularized Newton’s method (Nesterov and Polyak,, 2006) for non-convex optimization. Li et al., (2020) generalizes the Stein-type Hessian estimators to Riemannian manifolds embedded in Euclidean spaces. Several authors have also considered variance and higher order moments of Hessian (and gradient) estimators (Li et al.,, 2020; Balasubramanian and Ghadimi,, 2021; Feng and Wang,, 2022). In particular, Feng and Wang, (2022) showed that estimators via random orthogonal frames from Steifel’s manifolds have significantly smaller variance. Yet in the case of non-trivial curvature (Li et al.,, 2020), no geometry-aware bias bound has been given prior to our work.

2 Preliminaries and Conventions

For better readability, we list here some notations and conventions that will be used throughout the rest of this paper.

  • For any pp\in\mathcal{M}, let UpU_{p} denote the open set near pp that is diffeomorphic to a subset of n\mathbb{R}^{n} via the local normal coordinate chart ϕ\phi. Define the distance dp(q1,q2)d_{p}(q_{1},q_{2}) (q1,q2Upq_{1},q_{2}\in U_{p}) such that

    dp(q1,q2)=dEuc(ϕ(q1),ϕ(q2)).\displaystyle d_{p}(q_{1},q_{2})=d_{\text{Euc}}(\phi(q_{1}),\phi(q_{2})).

    where dEucd_{\text{Euc}} is the Euclidean distance in n\mathbb{R}^{n}.

  • (A0, Analyticity Assumption) Throughout the paper, we assume that, both the Riemannian metric and the function of interest are analytic.

  • The injectivity radius of pp\in\mathcal{M} (written inj(p)\mathrm{inj}(p)) is defined as the radius of the largest geodesic ball that is contained in UpU_{p}. (A1, Small Step Size Assumption) Throughout the paper, we assume that the finite difference step size δ\delta of the estimator at point pp\in\mathcal{M} satisfies δinj(p)2\delta\leq\frac{\mathrm{inj}(p)}{2}.

  • All musical isomorphisms are omitted when there is no confusion.

  • For any pp\in\mathcal{M} and α>0\alpha>0, we use α𝕊p\alpha\mathbb{S}_{p} (resp. α𝔹p\alpha\mathbb{B}_{p}) to denote the origin-centered sphere (resp. ball) in TpT_{p}\mathcal{M} with radius α\alpha. For simplicity, we write 𝕊p=1𝕊p\mathbb{S}_{p}=1\mathbb{S}_{p} (resp. 𝔹p=1𝔹p\mathbb{B}_{p}=1\mathbb{B}_{p}). It is worth emphasizing that 𝕊p\mathbb{S}_{p} and 𝔹p\mathbb{B}_{p} are in TpT_{p}\mathcal{M}. They are different from geodesic balls which reside in \mathcal{M}.

  • For pp\in\mathcal{M} and qUpq\in U_{p}, we use pq:TpTq\mathcal{I}_{p}^{q}:T_{p}\mathcal{M}\rightarrow T_{q}\mathcal{M} to denote the parallel transport from TpT_{p}\mathcal{M} to TqT_{q}\mathcal{M} along the distance-minimizing geodesic connecting pp and qq. For any pp\in\mathcal{M}, uTpu\in T_{p}\mathcal{M} and qUpq\in U_{p}, define uq=pq(u)u_{q}=\mathcal{I}_{p}^{q}(u). More generally, pq\mathcal{I}_{p}^{q} denotes the parallel transport along the distance-minimizing geodesic from pp to qq, among the fiber bundle that is compatible with the Riemannian structure.

  • We will use the double exponential map notation (Gavrilov,, 2007). For any pp\in\mathcal{M} and u,vTpu,v\in T_{p}\mathcal{M} such that Expp(u)Up{\mathrm{Exp}}_{p}(u)\in U_{p}, we write Expp2(u,v)=ExpExpp(u)(vExpp(u)){\mathrm{Exp}}_{p}^{2}(u,v)={\mathrm{Exp}}_{{\mathrm{Exp}}_{p}(u)}(v_{{\mathrm{Exp}}_{p}(u)}).

  • (Definition of Hessian (e.g., Petersen,, 2006)) Over an nn-dimensional complete Riemannian manifold \mathcal{M}, the Hessian of a smooth function f:f:\mathcal{M}\rightarrow\mathbb{R} at pp is a bilinear form Hessf(p):Tp×Tp\mathrm{Hess}f(p):T_{p}\mathcal{M}\times T_{p}\mathcal{M}\rightarrow\mathbb{R} such that, for all u,vTpu,v\in T_{p}\mathcal{M}, Hessf(p)(u,v)=vdf|p,u\mathrm{Hess}f(p)(u,v)=\left<\nabla_{v}df\big{|}_{p},u\right>. Since the Levi-Civita connection is torsion-free, the Hessian is symmetric: Hessf(p)(u,v)=Hessf(p)(v,u)\mathrm{Hess}f(p)(u,v)=\mathrm{Hess}f(p)(v,u) for all u,vTpu,v\in T_{p}\mathcal{M}. For a smooth function ff, its Hessian satisfies (e.g., Chapter 5.4 of (Absil et al.,, 2009)), for any pp\in\mathcal{M} and any vTpv\in T_{p}\mathcal{M},

    Hessf(p)(v,v)=limτ0f(Expp(τv))2f(p)+f(Expp(τv))τ2=v2f(p).\displaystyle\mathrm{Hess}f(p)(v,v)=\lim_{\tau\rightarrow 0}\frac{f({\mathrm{Exp}}_{p}(\tau v))-2f(p)+f({\mathrm{Exp}}_{p}(-\tau v))}{\tau^{2}}=\nabla_{v}^{2}f(p). (2)

    For simplicity and coherence with the notations in the Euclidean case, we write uHessf(p)v:=Hessf(p)(u,v)u^{\top}\mathrm{Hess}f(p)v:=\mathrm{Hess}f(p)(u,v) for any u,vTpu,v\in T_{p}\mathcal{M}.

  • Consider a Riemannian manifold (,g)(\mathcal{M},g), a point pp\in\mathcal{M}, and any symmetric bilinear form A:Tp×TpA:T_{p}\mathcal{M}\times T_{p}\mathcal{M}\rightarrow\mathbb{R}. The gg-induced \infty-Schatten norm (the operator norm) of AA is defined as

    A=supuTp,u=1|uAu|.\displaystyle\|A\|=\sup_{u\in T_{p}\mathcal{M},\|u\|=1}|u^{\top}Au|.

    When it is clear from context, we simply use \infty-Schatten norm to refer to gg-induced \infty-Schatten norm.

  • Note. When applied to a tangent vector, \|\cdot\| is the standard norm induced by the Riemannian metric. When applied to a symmetric bilinear form, \|\cdot\| is the \infty-Schatten norm defined above.

3 Zeroth Order Hessian Estimation

For pp\in\mathcal{M} and f:f:\mathcal{M}\rightarrow\mathbb{R}, the Hessian of ff at pp can be estimated by

H^f(p;v,w;δ)=n2δ2f(Expp(δv+δw))vw,\displaystyle\widehat{\mathrm{H}}{f}(p;v,w;\delta)=\frac{n^{2}}{\delta^{2}}f({\mathrm{Exp}}_{p}(\delta v+\delta w))v\otimes w,

where v,wv,w are independently uniformly sampled from 𝕊p\mathbb{S}_{p} and δ\delta is the finite difference step size. To study the bias of this estimator, we consider a function f~δ\widetilde{f}^{\delta} defined as follows.

For pp\in\mathcal{M}, a smooth real-valued function ff defined over \mathcal{M}, and a number δ(0,δ0]\delta\in(0,\delta_{0}], define a function f~δ\widetilde{f}^{\delta} (at pp) such that

f~δ(p)=1δ2nVn2wδ𝔹pvδ𝔹pf(Expp(v+w))𝑑w𝑑v,\displaystyle\widetilde{f}^{\delta}(p)=\frac{1}{\delta^{2n}V_{n}^{2}}\int_{w\in\delta\mathbb{B}_{p}}\int_{v\in\delta\mathbb{B}_{p}}f({\mathrm{Exp}}_{p}(v+w))\,dw\,dv, (3)

where VnV_{n} is the volume of the unit ball in TpT_{p}\mathcal{M} (same as the volume of the unit ball in n\mathbb{R}^{n}). Smoothings of this kind have been analytically investigated by Greene and Wu (Greene and Wu,, 1973, 1976, 1979). We will first show that Hessf~δ(p)=𝔼v,wi.i.d.𝕊p[H^f(p;v,w;δ)]\mathrm{Hess}\widetilde{f}^{\delta}(p)=\mathbb{E}_{v,w\overset{i.i.d.}{\sim}\mathbb{S}_{p}}\left[\widehat{\mathrm{H}}{f}(p;v,w;\delta)\right] in Lemma 1. Then we derive a bound on Hessf~δ(p)Hessf(p)\left\|\mathrm{Hess}\widetilde{f}^{\delta}(p)-\mathrm{Hess}{f}(p)\right\|, which gives a bound on 𝔼v,wi.i.d.𝕊p[H^f(p;v,w;δ)]Hessf(p)\left\|\mathbb{E}_{v,w\overset{i.i.d.}{\sim}\mathbb{S}_{p}}\left[\widehat{\mathrm{H}}{f}(p;v,w;\delta)\right]-\mathrm{Hess}f(p)\right\|. Henceforth, we will use 𝔼v,w\mathbb{E}_{v,w} as a shorthand for 𝔼v,wi.i.d.𝕊p\mathbb{E}_{v,w\overset{i.i.d.}{\sim}\mathbb{S}_{p}}.

Lemma 1.

Consider an nn-dimensional complete analytic Riemannian manifold (,g)(\mathcal{M},g). Consider pp\in\mathcal{M}, an analytic function f:f:\mathcal{M}\rightarrow\mathbb{R} and a small number δ(0,inj(p)/2]\delta\in(0,\mathrm{inj}(p)/2]. If vv and ww are independently randomly sampled from 𝕊p\mathbb{S}_{p}, then it holds that,

𝔼v,w[H^f(p;v,w;δ)]=Hessf~δ(p).\displaystyle\mathbb{E}_{v,w}\left[\widehat{\mathrm{H}}f(p;v,w;\delta)\right]=\mathrm{Hess}\widetilde{f}^{\delta}(p).
Proof. .

Define φp=fExpp\varphi_{p}=f\circ{\mathrm{Exp}}_{p}. By the fundamental theorem of geometric calculus, it holds that 222Here i\partial_{i} and j\partial_{j} are understood as Einstein’s notations.

vδ𝔹piwδ𝔹pjφp(w+v)dwdv=\displaystyle\int_{v\in\delta\mathbb{B}_{p}}\partial_{i}\int_{w\in\delta\mathbb{B}_{p}}\partial_{j}\varphi_{p}(w+v)\,dw\,dv= vδ𝔹piwδ𝕊pφp(w+v)ww𝑑w𝑑v\displaystyle\int_{v\in\delta\mathbb{B}_{p}}\partial_{i}\int_{w\in\delta\mathbb{S}_{p}}\varphi_{p}(w+v)\frac{w}{\|w\|}\,dw\,dv
=(i)\displaystyle\overset{(i)}{=} vδ𝕊pwδ𝕊pφp(w+v)vwwv𝑑w𝑑v.\displaystyle\int_{v\in\delta\mathbb{S}_{p}}\int_{w\in\delta\mathbb{S}_{p}}\varphi_{p}(w+v)\frac{v\otimes w}{\|w\|\|v\|}\,dw\,dv.

Since vv and ww are independently uniformly sampled from 𝕊p\mathbb{S}_{p}, it holds that

δ𝕊pδ𝕊pφp(w+v)vwvw𝑑w𝑑v=(ii)δ2n2An12𝔼v,w[φp(δv+δw)vw],\displaystyle\int_{\delta\mathbb{S}_{p}}\int_{\delta\mathbb{S}_{p}}\varphi_{p}(w+v)\frac{v\otimes w}{\|v\|\|w\|}\,dw\,dv\overset{(ii)}{=}\delta^{2n-2}A_{n-1}^{2}\mathbb{E}_{v,w}\left[\varphi_{p}(\delta v+\delta w)v\otimes w\right],

where An1A_{n-1} is the surface area of 𝕊p\mathbb{S}_{p} in TpT_{p}\mathcal{M} (same as the surface area of unit sphere in n\mathbb{R}^{n}).

By the dominated convergence theorem and that δinj(p)2\delta\leq\frac{\mathrm{inj}(p)}{2}, we have

ijvδ𝔹pwδ𝔹pφp(w+v)𝑑w𝑑v=(iii)\displaystyle\partial_{i}\partial_{j}\int_{v\in\delta\mathbb{B}_{p}}\int_{w\in\delta\mathbb{B}_{p}}\varphi_{p}(w+v)\,dw\,dv\overset{(iii)}{=} vδ𝔹piwδ𝔹pjφp(w+v)dwdv.\displaystyle\;\int_{v\in\delta\mathbb{B}_{p}}\partial_{i}\int_{w\in\delta\mathbb{B}_{p}}\partial_{j}\varphi_{p}(w+v)\,dw\,dv.

More specifically, the i\partial_{i} operations (or tangent vectors) can be defined in terms of limits, and we can interchange the limit and the integral by the dominated convergence theorem.

Combining (i)(i), (ii)(ii) and (iii)(iii) gives

ijvδ𝔹pwδ𝔹pφp(w+v)𝑑w𝑑v=(iv)\displaystyle\partial_{i}\partial_{j}\int_{v\in\delta\mathbb{B}_{p}}\int_{w\in\delta\mathbb{B}_{p}}\varphi_{p}(w+v)\,dw\,dv\overset{(iv)}{=} δ2n2An12𝔼v,w[φp(δv+δw)vw].\displaystyle\;\delta^{2n-2}A_{n-1}^{2}\mathbb{E}_{v,w}\left[\varphi_{p}(\delta v+\delta w)v\otimes w\right].

Combining the above results gives

ijf~δ(p)=\displaystyle\partial_{i}\partial_{j}\widetilde{f}^{\delta}(p)= ij1δ2nVnvδ𝔹pwδ𝔹pφp(w+v)𝑑w𝑑v\displaystyle\;\partial_{i}\partial_{j}\frac{1}{\delta^{2n}V_{n}}\int_{v\in\delta\mathbb{B}_{p}}\int_{w\in\delta\mathbb{B}_{p}}\varphi_{p}(w+v)\,dw\,dv
=\displaystyle= δ2n2An12δ2nVn2𝔼v,w[φp(δv+δw)vw]\displaystyle\;\frac{\delta^{2n-2}A_{n-1}^{2}}{\delta^{2n}V_{n}^{2}}\mathbb{E}_{v,w}\left[\varphi_{p}(\delta v+\delta w)v\otimes w\right]
=\displaystyle= n2δ2𝔼v,w[f(Expp(δv+δw))vw],\displaystyle\;\frac{n^{2}}{\delta^{2}}\mathbb{E}_{v,w}\left[f({\mathrm{Exp}}_{p}(\delta v+\delta w))v\otimes w\right],

where the second last equality uses (iv)(iv), and last equality uses An1=nVnA_{n-1}=nV_{n}. ∎

As a result of Lemma 1, a bound on Hessf~δ(p)Hessf(p)\|\mathrm{Hess}\widetilde{f}^{\delta}(p)-\mathrm{Hess}f(p)\| will give a bound on 𝔼[H^f(p;v,w;δ)]Hessf(p)\|\mathbb{E}\left[\widehat{\mathrm{H}}f(p;v,w;\delta)\right]-\mathrm{Hess}f(p)\|. To bound Hessf~δ(p)Hessf(p)\|\mathrm{Hess}\widetilde{f}^{\delta}(p)-\mathrm{Hess}f(p)\|, we need to explicitly extend the definition of f~δ\widetilde{f}^{\delta} from pp to a neighborhood of pp (Wang et al.,, 2021), so that the Hessian can be computed in a precise manner. For pp\in\mathcal{M}, a smooth function f:f:\mathcal{M}\rightarrow\mathbb{R}, and a number δ(0,δ0]\delta\in(0,\delta_{0}], define a function f~δ\widetilde{f}^{\delta} (near pp) such that

f~δ(q)=𝔼v,wi.i.d.𝕊p[f~v,wδ(q)],qUp,\displaystyle\widetilde{f}^{\delta}(q)=\mathbb{E}_{v,w\overset{i.i.d.}{\sim}\mathbb{S}_{p}}\left[\widetilde{f}_{v,w}^{\delta}(q)\right],\quad\forall q\in U_{p}, (4)

where

f~v,wδ(q):=n24δ2nδδδδf(Expq(tvq+swq))|t|n1|s|n1𝑑t𝑑s,\displaystyle\widetilde{f}_{v,w}^{\delta}(q):=\frac{n^{2}}{4\delta^{2n}}\int_{-\delta}^{\delta}\int_{-\delta}^{\delta}f\left({\mathrm{Exp}}_{q}(tv_{q}+sw_{q})\right)|t|^{n-1}|s|^{n-1}\,dt\,ds, (5)

with v,w𝕊pv,w\in\mathbb{S}_{p}.

The advantage of defining f~δ\widetilde{f}^{\delta} via f~v,wδ\widetilde{f}_{v,w}^{\delta} is that f~v,wδ\widetilde{f}_{v,w}^{\delta} is explicitly defined in a neighborhood of pp. Thus we can carry out geometry-aware computations in a precise manner. Next, we verify that (3) and (4) agree with each other in the following proposition.

Proposition 1.

For any pp\in\mathcal{M} and any δδ0\delta\leq\delta_{0}, (3) and (4) coincide at any qUpq\in U_{p}.

Proof. .

At any qUpq\in U_{p}, we have

(3)=\displaystyle(\ref{eq:def-surrogate})= 1δ2nVn2wδ𝔹qvδ𝔹qf(Expq(v+w))𝑑v𝑑w\displaystyle\;\frac{1}{\delta^{2n}V_{n}^{2}}\int_{w\in\delta\mathbb{B}_{q}}\int_{v\in\delta\mathbb{B}_{q}}f({\mathrm{Exp}}_{q}(v+w))\,dv\,dw
=(i)\displaystyle\overset{(i)}{=} n2δ2nAn2wδ𝔹qvδ𝔹qf(Expq(v+w))𝑑v𝑑w\displaystyle\;\frac{n^{2}}{\delta^{2n}A_{n}^{2}}\int_{w\in\delta\mathbb{B}_{q}}\int_{v\in\delta\mathbb{B}_{q}}f({\mathrm{Exp}}_{q}(v+w))\,dv\,dw
=(ii)\displaystyle\overset{(ii)}{=} n2δ2nAn2w𝕊qv𝕊q14δδδδf(Expq(tv+sw))|t|n1|s|n1𝑑t𝑑s𝑑v𝑑w,\displaystyle\;\frac{n^{2}}{\delta^{2n}A_{n}^{2}}\int_{w\in\mathbb{S}_{q}}\int_{v\in\mathbb{S}_{q}}\frac{1}{4}\int_{-\delta}^{\delta}\int_{-\delta}^{\delta}f({\mathrm{Exp}}_{q}(tv+sw))|t|^{n-1}|s|^{n-1}\,dt\,ds\,dv\,dw,

where (i)(i) uses An1=nVnA_{n-1}=nV_{n}, and (ii)(ii) changes from Cartesian coordinate to hyperspherical coordinate (in TqT_{q}\mathcal{M}). Since the Levi-Civita connection is compatible with the Riemannian metric, we know that the standard Lebesgue measure in TpT_{p}\mathcal{M} is preserved after transporting to TqT_{q}\mathcal{M}. This implies, for any continuous function hh defined over TqT_{q}\mathcal{M}, we have v𝕊qh(v)𝑑v=(iii)v𝕊ph(vq)𝑑v\int_{v\in\mathbb{S}_{q}}h(v)dv\overset{(iii)}{=}\int_{v\in\mathbb{S}_{p}}h(v_{q})dv. Thus we have, at any qq\in\mathcal{M},

(3)=\displaystyle(\ref{eq:def-surrogate})= n24δ2nAn2w𝕊qv𝕊qδδδδf(Expq(tv+sw))|t|n1|s|n1𝑑t𝑑s𝑑w𝑑v\displaystyle\;\frac{n^{2}}{4\delta^{2n}A_{n}^{2}}\int_{w\in\mathbb{S}_{q}}\int_{v\in\mathbb{S}_{q}}\int_{-\delta}^{\delta}\int_{-\delta}^{\delta}f({\mathrm{Exp}}_{q}(tv+sw))|t|^{n-1}|s|^{n-1}\,dt\,ds\,dw\,dv
=(iv)\displaystyle\overset{(iv)}{=} n24δ2nAn2w𝕊pv𝕊pδδδδf(Expq(tvq+swq))|t|n1|s|n1𝑑t𝑑s𝑑w𝑑v\displaystyle\;\frac{n^{2}}{4\delta^{2n}A_{n}^{2}}\int_{w\in\mathbb{S}_{p}}\int_{v\in\mathbb{S}_{p}}\int_{-\delta}^{\delta}\int_{-\delta}^{\delta}f({\mathrm{Exp}}_{q}(tv_{q}+sw_{q}))|t|^{n-1}|s|^{n-1}\,dt\,ds\,dw\,dv
=\displaystyle= 1An2w𝕊pv𝕊pf~v,wδ(q)𝑑w𝑑v=(4),\displaystyle\;\frac{1}{A_{n}^{2}}\int_{w\in\mathbb{S}_{p}}\int_{v\in\mathbb{S}_{p}}\widetilde{f}_{v,w}^{\delta}(q)\,dw\,dv=(\ref{eq:def-surrogate2}),

where (iv)(iv) uses (iii)(iii). ∎

By Proposition 1, it is sufficient to work with f~v,wδ\widetilde{f}_{v,w}^{\delta} and randomize v,wv,w over a unit sphere. For any direction u𝕊pu\in\mathbb{S}_{p}, the Hessian of f~v,wδ\widetilde{f}_{v,w}^{\delta} along uu can be explicitly written out in terms of ff and u,v,wu,v,w. This result is found in Lemma 2.

Lemma 2.

Consider an nn-dimensional complete analytic Riemannian manifold (,g)(\mathcal{M},g). Consider pp\in\mathcal{M}, an analytic function f:f:\mathcal{M}\rightarrow\mathbb{R} and a small number δ(0,inj(p)/2]\delta\in(0,\mathrm{inj}(p)/2]. For any u,v,w𝕊pu,v,w\in\mathbb{S}_{p}, we have

uHessf~v,wδ(p)u\displaystyle\;u^{\top}\mathrm{Hess}\widetilde{f}_{v,w}^{\delta}(p)u
=\displaystyle= n24δ2nδδδδuqHessf(q)uq|t|n1|s|n1𝑑t𝑑s\displaystyle\;\frac{n^{2}}{4\delta^{2n}}\int_{-\delta}^{\delta}\int_{-\delta}^{\delta}u_{q}^{\top}\mathrm{Hess}f(q)u_{q}\;|t|^{n-1}|s|^{n-1}\,dt\,ds
+n24δ2nδδδδj=1|t|n1|s|n1(2j)!u2(tv+sw)2jf(p)dtds\displaystyle+\frac{n^{2}}{4\delta^{2n}}\int_{-\delta}^{\delta}\int_{-\delta}^{\delta}\sum_{j=1}^{\infty}\frac{|t|^{n-1}|s|^{n-1}}{(2j)!}\nabla_{u}^{2}\left(t\nabla_{v}+s\nabla_{w}\right)^{2j}f(p)\;dt\,ds
n24δ2nδδδδj=1|t|n1|s|n1(2j)!(tv+sw)2ju2f(p)dtds,\displaystyle-\frac{n^{2}}{4\delta^{2n}}\int_{-\delta}^{\delta}\int_{-\delta}^{\delta}\sum_{j=1}^{\infty}\frac{|t|^{n-1}|s|^{n-1}}{(2j)!}\left(t\nabla_{v}+s\nabla_{w}\right)^{2j}\nabla_{u}^{2}f(p)\;dt\,ds,

where q=Expp(tv+sw)q={\mathrm{Exp}}_{p}(tv+sw).

Proof. .

From the definition of Hessian, we have

uHessf~v,wδ(p)u\displaystyle\;u^{\top}\mathrm{Hess}\widetilde{f}_{v,w}^{\delta}(p)u
=\displaystyle= limτ0f~v,wδ(Expp(τu))2f~v,wδ(p)+f~v,wδ(Expp(τu))τ2.\displaystyle\;\lim_{\tau\rightarrow 0}\frac{\widetilde{f}_{v,w}^{\delta}({\mathrm{Exp}}_{p}(\tau u))-2\widetilde{f}_{v,w}^{\delta}(p)+\widetilde{f}_{v,w}^{\delta}({\mathrm{Exp}}_{p}(-\tau u))}{\tau^{2}}.

Thus it is sufficient to fix any t,s[δ,δ]t,s\in[-\delta,\delta] and consider

limτ0f(Expp2(τu,tv+sw))2f(Expp(tv+sw))+f(Expp2(τu,tv+sw))τ2.\displaystyle\lim_{\tau\rightarrow 0}\frac{f({\mathrm{Exp}}_{p}^{2}(\tau u,tv+sw))-2f({\mathrm{Exp}}_{p}(tv+sw))+f({\mathrm{Exp}}_{p}^{2}(-\tau u,tv+sw))}{\tau^{2}}.

For simplicity, define

ϕ(τ,t,s)=\displaystyle\phi(\tau,t,s)= f(Expp2(τu,tv+sw))+f(Expp2(τu,tv+sw))\displaystyle\;f({\mathrm{Exp}}_{p}^{2}(\tau u,tv+sw))+f({\mathrm{Exp}}_{p}^{2}(-\tau u,tv+sw))
f(Expp2(tv+sw,τu))f(Expp2(tv+sw,τu)).\displaystyle-f({\mathrm{Exp}}_{p}^{2}(tv+sw,\tau u))-f({\mathrm{Exp}}_{p}^{2}(tv+sw,-\tau u)).

Let q=Expp(tv+sw)q={\mathrm{Exp}}_{p}(tv+sw), and we have

uHessf~v,wδ(p)u\displaystyle\;u^{\top}\mathrm{Hess}\widetilde{f}_{v,w}^{\delta}(p)u
=\displaystyle= n24δ2nδδδδuqHessf(q)uq|t|n1|s|n1𝑑t𝑑s\displaystyle\;\frac{n^{2}}{4\delta^{2n}}\int_{-\delta}^{\delta}\int_{-\delta}^{\delta}u_{q}^{\top}\mathrm{Hess}f(q)u_{q}\;|t|^{n-1}|s|^{n-1}\,dt\,ds
+n24δ2nlimτ0δδδδϕ(τ,t,s)|t|n1|s|n1𝑑t𝑑sτ2,\displaystyle+\frac{n^{2}}{4\delta^{2n}}\lim_{\tau\rightarrow 0}\frac{\int_{-\delta}^{\delta}\int_{-\delta}^{\delta}\phi(\tau,t,s)|t|^{n-1}|s|^{n-1}\,dt\,ds}{\tau^{2}}, (6)

provided that the last term converges.

For any pp\in\mathcal{M}, vTpv\in T_{p}\mathcal{M} and qUpq\in U_{p}, define hv(j)(q)=vqjf(q)h_{v}^{(j)}(q)=\nabla_{v_{q}}^{j}f(q). We can Taylor expand hv(j)(Expp(u))h_{v}^{(j)}({\mathrm{Exp}}_{p}(u)) by

hv(j)(Expp(u))=\displaystyle h_{v}^{(j)}({\mathrm{Exp}}_{p}(u))= hv(j)(Expp(tu))|t=1\displaystyle\;h_{v}^{(j)}({\mathrm{Exp}}_{p}(tu))\big{|}_{t=1}
=\displaystyle= i=01i!didtihv(j)(Expp(tu))|t=0\displaystyle\;\sum_{i=0}^{\infty}\frac{1}{i!}\frac{d^{i}}{dt^{i}}h_{v}^{(j)}({\mathrm{Exp}}_{p}(tu))\bigg{|}_{t=0}
=\displaystyle= i=01i!uihv(j)(p)\displaystyle\;\sum_{i=0}^{\infty}\frac{1}{i!}\nabla_{u}^{i}h_{v}^{(j)}(p)
=(a)\displaystyle\overset{(a)}{=} i=01i!uivjf(p).\displaystyle\;\sum_{i=0}^{\infty}\frac{1}{i!}\nabla_{u}^{i}\nabla_{v}^{j}f(p).

From above, we have, for any pp, and u,vTpu,v\in T_{p}\mathcal{M} of small norm,

f(Expp2(u,v))=\displaystyle f\left({\mathrm{Exp}}_{p}^{2}\left(u,v\right)\right)= f(ExpExpp(u)(vExpp(u)))\displaystyle\;f\left({\mathrm{Exp}}_{{\mathrm{Exp}}_{p}(u)}\left(v_{{\mathrm{Exp}}_{p}(u)}\right)\right)
=\displaystyle= j=01j!vExpp(u)jf(Expp(u))\displaystyle\;\sum_{j=0}^{\infty}\frac{1}{j!}\nabla_{v_{{\mathrm{Exp}}_{p}(u)}}^{j}f({\mathrm{Exp}}_{p}(u))
=\displaystyle= j=01j!hv(j)(Expp(u))\displaystyle\;\sum_{j=0}^{\infty}\frac{1}{j!}h_{v}^{(j)}({\mathrm{Exp}}_{p}(u))
=\displaystyle= j=01j!i=01i!uivjf(p),\displaystyle\;\sum_{j=0}^{\infty}\frac{1}{j!}\sum_{i=0}^{\infty}\frac{1}{i!}\nabla_{u}^{i}\nabla_{v}^{j}f(p),

where the second equality uses Taylor expansion at Expp(u){\mathrm{Exp}}_{p}(u) and the last equality uses (a)(a).

From the above computation, we expand f(Expp2(tv+sw,τu))f({\mathrm{Exp}}_{p}^{2}(tv+sw,\tau u)) (and similar terms) into infinite series. Thus we can write ϕ(τ,t,s)\phi(\tau,t,s) as

ϕ(τ,t,s)=\displaystyle\phi(\tau,t,s)= f(Expp2(τu,tv+sw))+f(Expp2(τu,tv+sw))\displaystyle\;f({\mathrm{Exp}}_{p}^{2}(\tau u,tv+sw))+f({\mathrm{Exp}}_{p}^{2}(-\tau u,tv+sw))
f(Expp2(tv+sw,τu))f(Expp2(tv+sw,τu))\displaystyle-f({\mathrm{Exp}}_{p}^{2}(tv+sw,\tau u))-f({\mathrm{Exp}}_{p}^{2}(tv+sw,-\tau u))
=\displaystyle= i=0j=01i!j!τuitv+swjf(p)+i=0j=01i!j!τuitv+swjf(p)\displaystyle\;\sum_{i=0}^{\infty}\sum_{j=0}^{\infty}\frac{1}{i!j!}\nabla_{\tau u}^{i}\nabla_{tv+sw}^{j}f(p)+\sum_{i=0}^{\infty}\sum_{j=0}^{\infty}\frac{1}{i!j!}\nabla_{-\tau u}^{i}\nabla_{tv+sw}^{j}f(p)
i=0j=01i!j!tv+swjτuif(p)i=0j=01i!j!tv+swjτuif(p)\displaystyle-\sum_{i=0}^{\infty}\sum_{j=0}^{\infty}\frac{1}{i!j!}\nabla_{tv+sw}^{j}\nabla_{\tau u}^{i}f(p)-\sum_{i=0}^{\infty}\sum_{j=0}^{\infty}\frac{1}{i!j!}\nabla_{tv+sw}^{j}\nabla_{-\tau u}^{i}f(p)
=\displaystyle= j=0τ22j!u2tv+swjf(p)+j=0τ22j!u2tv+swjf(p)\displaystyle\;\sum_{j=0}^{\infty}\frac{\tau^{2}}{2j!}\nabla_{u}^{2}\nabla_{tv+sw}^{j}f(p)+\sum_{j=0}^{\infty}\frac{\tau^{2}}{2j!}\nabla_{u}^{2}\nabla_{tv+sw}^{j}f(p)
j=0τ22j!tv+swju2f(p)j=0τ22j!tv+swju2f(p)+O(τ3),\displaystyle-\sum_{j=0}^{\infty}\frac{\tau^{2}}{2j!}\nabla_{tv+sw}^{j}\nabla_{u}^{2}f(p)-\sum_{j=0}^{\infty}\frac{\tau^{2}}{2j!}\nabla_{tv+sw}^{j}\nabla_{u}^{2}f(p)+{O}(\tau^{3}), (7)

where the last equality uses that zeroth-order terms in τ\tau and first-order terms in τ\tau all cancel.

From (7), we have

limτ0ϕ(τ,t,s)τ2=\displaystyle\lim_{\tau\rightarrow 0}\frac{\phi(\tau,t,s)}{\tau^{2}}= j=11j!u2tv+swjf(p)j=11j!tv+swju2f(p)\displaystyle\;\sum_{j=1}^{\infty}\frac{1}{j!}\nabla_{u}^{2}\nabla_{tv+sw}^{j}f(p)-\sum_{j=1}^{\infty}\frac{1}{j!}\nabla_{tv+sw}^{j}\nabla_{u}^{2}f(p)
=\displaystyle= j=11j!u2(tv+sw)jf(p)j=11j!(tv+sw)ju2f(p)\displaystyle\;\sum_{j=1}^{\infty}\frac{1}{j!}\nabla_{u}^{2}\left(t\nabla_{v}+s\nabla_{w}\right)^{j}f(p)-\sum_{j=1}^{\infty}\frac{1}{j!}\left(t\nabla_{v}+s\nabla_{w}\right)^{j}\nabla_{u}^{2}f(p)
=\displaystyle= j=11(2j)!u2(tv+sw)2jf(p)\displaystyle\;\sum_{j=1}^{\infty}\frac{1}{(2j)!}\nabla_{u}^{2}\left(t\nabla_{v}+s\nabla_{w}\right)^{2j}f(p)
j=11(2j)!(tv+sw)2ju2f(p)+Odd(t,s),\displaystyle-\sum_{j=1}^{\infty}\frac{1}{(2j)!}\left(t\nabla_{v}+s\nabla_{w}\right)^{2j}\nabla_{u}^{2}f(p)+Odd(t,s),

where Odd(t,s)Odd(t,s) denotes terms that are either odd in tt or odd in ss.

Since δδδδO𝑑d(t,s)𝑑t𝑑s=0\int_{-\delta}^{\delta}\int_{-\delta}^{\delta}Odd(t,s)\,dt\,ds=0, we have

n24δ2nlimτ0δδδδϕ(τ,t,s)|t|n1|s|n1𝑑t𝑑sτ2\displaystyle\;\frac{n^{2}}{4\delta^{2n}}\lim_{\tau\rightarrow 0}\frac{\int_{-\delta}^{\delta}\int_{-\delta}^{\delta}\phi(\tau,t,s)|t|^{n-1}|s|^{n-1}\,dt\,ds}{\tau^{2}}
=\displaystyle= n24δ2nδδδδj=1|t|n1|s|n1(2j)!u2(tv+sw)2jf(p)dtds\displaystyle\;\frac{n^{2}}{4\delta^{2n}}\int_{-\delta}^{\delta}\int_{-\delta}^{\delta}\sum_{j=1}^{\infty}\frac{|t|^{n-1}|s|^{n-1}}{(2j)!}\nabla_{u}^{2}\left(t\nabla_{v}+s\nabla_{w}\right)^{2j}f(p)\;dt\,ds
n24δ2nδδδδj=1|t|n1|s|n1(2j)!(tv+sw)2ju2f(p)dtds.\displaystyle-\frac{n^{2}}{4\delta^{2n}}\int_{-\delta}^{\delta}\int_{-\delta}^{\delta}\sum_{j=1}^{\infty}\frac{|t|^{n-1}|s|^{n-1}}{(2j)!}\left(t\nabla_{v}+s\nabla_{w}\right)^{2j}\nabla_{u}^{2}f(p)\;dt\,ds. (8)

Collecting terms from (6) and (8), we have

uHessf~v,wδ(p)u\displaystyle\;u^{\top}\mathrm{Hess}\widetilde{f}_{v,w}^{\delta}(p)u
=\displaystyle= n24δ2nδδδδuqHessf(q)uq|t|n1|s|n1𝑑t𝑑s\displaystyle\;\frac{n^{2}}{4\delta^{2n}}\int_{-\delta}^{\delta}\int_{-\delta}^{\delta}u_{q}^{\top}\mathrm{Hess}f(q)u_{q}\;|t|^{n-1}|s|^{n-1}\,dt\,ds
+n24δ2nδδδδj=1|t|n1|s|n1(2j)!u2(tv+sw)2jf(p)dtds\displaystyle+\frac{n^{2}}{4\delta^{2n}}\int_{-\delta}^{\delta}\int_{-\delta}^{\delta}\sum_{j=1}^{\infty}\frac{|t|^{n-1}|s|^{n-1}}{(2j)!}\nabla_{u}^{2}\left(t\nabla_{v}+s\nabla_{w}\right)^{2j}f(p)\;dt\,ds
n24δ2nδδδδj=1|t|n1|s|n1(2j)!(tv+sw)2ju2f(p)dtds,\displaystyle-\frac{n^{2}}{4\delta^{2n}}\int_{-\delta}^{\delta}\int_{-\delta}^{\delta}\sum_{j=1}^{\infty}\frac{|t|^{n-1}|s|^{n-1}}{(2j)!}\left(t\nabla_{v}+s\nabla_{w}\right)^{2j}\nabla_{u}^{2}f(p)\;dt\,ds,

where q=Expp(tv+sw)q={\mathrm{Exp}}_{p}(tv+sw). This concludes the proof. ∎

Gathering the above results gives a bias bound for (1), which is summarized in the following theorem.

Theorem 1.

Consider an nn-dimensional complete analytic Riemannian manifold (,g)(\mathcal{M},g). Consider pp\in\mathcal{M}, an analytic function f:f:\mathcal{M}\rightarrow\mathbb{R} and a small number δ(0,inj(p)/2]\delta\in(0,\mathrm{inj}(p)/2]. For any pp\in\mathcal{M} and unit vectors u,vTpu,v\in T_{p}\mathcal{M}, define a function ϑp,u,v,w\vartheta_{p,u,v,w} over (inj(p)/2,inj(p)/2)×(inj(p)/2,inj(p)/2)(-\mathrm{inj}(p)/2,\mathrm{inj}(p)/2)\times(-\mathrm{inj}(p)/2,\mathrm{inj}(p)/2) such that

ϑp,u,v,w(t,s)=Hessf(Expp(tv+sw))(uExpp(tv+sw),uExpp(tv+sw)).\displaystyle\vartheta_{p,u,v,w}(t,s)=\mathrm{Hess}f({\mathrm{Exp}}_{p}(tv+sw))(u_{{\mathrm{Exp}}_{p}(tv+sw)},u_{{\mathrm{Exp}}_{p}(tv+sw)}).

The estimator (1) satisfies

𝔼v,w[H^f(p;v,w;δ)]Hessf(p)\displaystyle\;\left\|\mathbb{E}_{v,w}\left[\widehat{\mathrm{H}}f(p;v,w;\delta)\right]-\mathrm{Hess}{f}(p)\right\|
\displaystyle\leq supu𝕊p|𝔼v,w[n24δ2nδδδδi,j,i+j1t2is2j(2i)!(2j)!12i22jϑp,u,v,w(0,0)|t|n1|s|n1dtds]\displaystyle\;\sup_{u\in\mathbb{S}_{p}}\Bigg{|}\mathbb{E}_{v,w}\left[\frac{n^{2}}{4\delta^{2n}}\int_{-\delta}^{\delta}\int_{-\delta}^{\delta}\sum_{i,j\in\mathbb{N},i+j\geq 1}\frac{t^{2i}s^{2j}}{(2i)!(2j)!}\partial_{1}^{2i}\partial_{2}^{2j}\vartheta_{p,u,v,w}(0,0)|t|^{n-1}|s|^{n-1}\,dt\,ds\right]
+𝔼v,w[n24δ2nδδδδj=1|t|n1|s|n1(2j)!u2(tv+sw)2jf(p)dtds\displaystyle+\mathbb{E}_{v,w}\Bigg{[}\frac{n^{2}}{4\delta^{2n}}\int_{-\delta}^{\delta}\int_{-\delta}^{\delta}\sum_{j=1}^{\infty}\frac{|t|^{n-1}|s|^{n-1}}{(2j)!}\nabla_{u}^{2}\left(t\nabla_{v}+s\nabla_{w}\right)^{2j}f(p)\;dt\,ds
n24δ2nδδδδj=1|t|n1|s|n1(2j)!(tv+sw)2ju2f(p)dtds]|,\displaystyle-\frac{n^{2}}{4\delta^{2n}}\int_{-\delta}^{\delta}\int_{-\delta}^{\delta}\sum_{j=1}^{\infty}\frac{|t|^{n-1}|s|^{n-1}}{(2j)!}\left(t\nabla_{v}+s\nabla_{w}\right)^{2j}\nabla_{u}^{2}f(p)\;dt\,ds\Bigg{]}\Bigg{|},

where v,wv,w are independently sampled from the uniform distribution over 𝕊p\mathbb{S}_{p}.

Proof. .

By Lemma 2, we have

u𝔼v,w[Hessf~v,wδ(p)]u\displaystyle\;u^{\top}\mathbb{E}_{v,w}\left[\mathrm{Hess}\widetilde{f}_{v,w}^{\delta}(p)\right]u
=\displaystyle= n24δ2n𝔼v,w[δδδδuExpp(tv+sw)Hessf(Expp(tv+sw))uExpp(tv+sw)|t|n1|s|n1𝑑t𝑑s]\displaystyle\;\frac{n^{2}}{4\delta^{2n}}\mathbb{E}_{v,w}\left[\int_{-\delta}^{\delta}\int_{-\delta}^{\delta}u_{{\mathrm{Exp}}_{p}(tv+sw)}^{\top}\mathrm{Hess}f({\mathrm{Exp}}_{p}(tv+sw))u_{{\mathrm{Exp}}_{p}(tv+sw)}\;|t|^{n-1}|s|^{n-1}\,dt\,ds\right]
+n24δ2n𝔼v,w[δδδδj=1|t|n1|s|n1(2j)!u2(tv+sw)2jf(p)dtds]\displaystyle+\frac{n^{2}}{4\delta^{2n}}\mathbb{E}_{v,w}\left[\int_{-\delta}^{\delta}\int_{-\delta}^{\delta}\sum_{j=1}^{\infty}\frac{|t|^{n-1}|s|^{n-1}}{(2j)!}\nabla_{u}^{2}\left(t\nabla_{v}+s\nabla_{w}\right)^{2j}f(p)\;dt\,ds\right]
n24δ2n𝔼v,w[δδδδj=1|t|n1|s|n1(2j)!(tv+sw)2ju2f(p)dtds].\displaystyle-\frac{n^{2}}{4\delta^{2n}}\mathbb{E}_{v,w}\left[\int_{-\delta}^{\delta}\int_{-\delta}^{\delta}\sum_{j=1}^{\infty}\frac{|t|^{n-1}|s|^{n-1}}{(2j)!}\left(t\nabla_{v}+s\nabla_{w}\right)^{2j}\nabla_{u}^{2}f(p)\;dt\,ds\right].

By the analyticity assumption, we have

ϑp,u,v,w(t,s)=\displaystyle\vartheta_{p,u,v,w}(t,s)= i=0j=0tisji!j!1i2jϑp,u,v,w(0,0).\displaystyle\;\sum_{i=0}^{\infty}\sum_{j=0}^{\infty}\frac{t^{i}s^{j}}{i!j!}\partial_{1}^{i}\partial_{2}^{j}\vartheta_{p,u,v,w}(0,0).

For any fixed u𝕊pu\in\mathbb{S}_{p}, we have

n24δ2n𝔼v,w[δδδδuExpp(tv+sw)Hessf(Expp(tv+sw))uExpp(tv+sw)|t|n1|s|n1𝑑t𝑑s]\displaystyle\;\frac{n^{2}}{4\delta^{2n}}\mathbb{E}_{v,w}\left[\int_{-\delta}^{\delta}\int_{-\delta}^{\delta}u_{{\mathrm{Exp}}_{p}(tv+sw)}^{\top}\mathrm{Hess}f({\mathrm{Exp}}_{p}(tv+sw))u_{{\mathrm{Exp}}_{p}(tv+sw)}\;|t|^{n-1}|s|^{n-1}\,dt\,ds\right]
=\displaystyle= n24δ2n𝔼v,w[δδδδϑp,u,v,w(t,s)|t|n1|s|n1𝑑t𝑑s]\displaystyle\;\frac{n^{2}}{4\delta^{2n}}\mathbb{E}_{v,w}\left[\int_{-\delta}^{\delta}\int_{-\delta}^{\delta}\vartheta_{p,u,v,w}(t,s)\;|t|^{n-1}|s|^{n-1}\,dt\,ds\right]
=\displaystyle= n24δ2n𝔼v,w[δδδδ(i=0j=0tisji!j!1i2jϑp,u,v,w(0,0))|t|n1|s|n1𝑑t𝑑s]\displaystyle\;\frac{n^{2}}{4\delta^{2n}}\mathbb{E}_{v,w}\left[\int_{-\delta}^{\delta}\int_{-\delta}^{\delta}\left(\sum_{i=0}^{\infty}\sum_{j=0}^{\infty}\frac{t^{i}s^{j}}{i!j!}\partial_{1}^{i}\partial_{2}^{j}\vartheta_{p,u,v,w}(0,0)\right)|t|^{n-1}|s|^{n-1}\,dt\,ds\right]
=\displaystyle= n24δ2n𝔼v,w[δδδδi,j,i+j1t2is2j(2i)!(2j)!12i22jϑp,u,v,w(0,0)|t|n1|s|n1dtds]+uHessf(p)u,\displaystyle\;\frac{n^{2}}{4\delta^{2n}}\mathbb{E}_{v,w}\left[\int_{-\delta}^{\delta}\int_{-\delta}^{\delta}\sum_{i,j\in\mathbb{N},i+j\geq 1}\frac{t^{2i}s^{2j}}{(2i)!(2j)!}\partial_{1}^{2i}\partial_{2}^{2j}\vartheta_{p,u,v,w}(0,0)|t|^{n-1}|s|^{n-1}\,dt\,ds\right]+u^{\top}\mathrm{Hess}f(p)u,

where in the last line the terms that are odd in tt or ss vanishes. ∎

By applying (2) twice, we have

12ϑp,u,v,w(0,0)=v2u2f(p)\displaystyle\partial_{1}^{2}\vartheta_{p,u,v,w}(0,0)=\nabla_{v}^{2}\nabla_{u}^{2}f(p) and22ϑp,u,v,w(0,0)=w2u2f(p).\displaystyle\qquad\mathrm{and}\qquad\partial_{2}^{2}\vartheta_{p,u,v,w}(0,0)=\nabla_{w}^{2}\nabla_{u}^{2}f(p).

Thus by dropping terms of order O(δ3)O(\delta^{3}) and noting that δδδδ|t|n1t|s|n1s𝑑t𝑑s=0\int_{-\delta}^{\delta}\int_{-\delta}^{\delta}|t|^{n-1}t|s|^{n-1}s\;dt\;ds=0, we have

𝔼v,w[H^f(p;v,w;δ)]Hessf(p)\displaystyle\;\left\|\mathbb{E}_{v,w}\left[\widehat{\mathrm{H}}f(p;v,w;\delta)\right]-\mathrm{Hess}f(p)\right\|
=\displaystyle= O(δ2supu𝕊p|𝔼v,wi.i.d.𝕊p[nn+2u2(v2+w2)f(p)]|).\displaystyle\;O\left(\delta^{2}\sup_{u\in\mathbb{S}_{p}}\left|\mathbb{E}_{v,w\overset{i.i.d.}{\sim}\mathbb{S}_{p}}\left[\frac{n}{n+2}\nabla_{u}^{2}\left(\nabla_{v}^{2}+\nabla_{w}^{2}\right)f(p)\right]\right|\right).

3.1 Example: the nn-sphere

We consider the Riemannian manifold 𝕊n1\mathbb{S}^{n-1} with metric induced by the ambient Euclidean space. This space of both theoretical and practical appeal. In this space, the exponential map is

Expx(tv)=xcos(t)+vsin(t),\displaystyle{\mathrm{Exp}}_{x}(tv)=x\cos(t)+v\sin(t),

where vv is a unit vector in n\mathbb{R}^{n}; The parallel transport is

xExpx(tu)(v)=vuuv+uuvcos(t)uuvxsin(t)\displaystyle\mathcal{I}_{x}^{{\mathrm{Exp}}_{x}(tu)}(v)=v-uu^{\top}v+uu^{\top}v\cos(t)-\|uu^{\top}v\|x\sin(t)

for any x𝕊n1x\in\mathbb{S}^{n-1}, u,v𝕊n1u,v\in\mathbb{S}^{n-1} and u,vxu,v\perp x.

We will consider estimating Hessian of the function xi2x_{i}^{2} where x𝕊n1x\in\mathbb{S}^{n-1} and xix_{i} is the ii-th component of xx. This simple function serves as an example of estimating the Hessian of general polynomials over 𝕊n1\mathbb{S}^{n-1}.

We have

v2xi2=\displaystyle\nabla_{v}^{2}x_{i}^{2}= limt0(xicost+visint)22xi2+(xicostvisint)2t2\displaystyle\;\lim_{t\rightarrow 0}\frac{\left(x_{i}\cos t+v_{i}\sin t\right)^{2}-2x_{i}^{2}+\left(x_{i}\cos t-v_{i}\sin t\right)^{2}}{t^{2}}
=\displaystyle= limt0(2cos2t2)xi2+2vi2sin2tt2\displaystyle\;\lim_{t\rightarrow 0}\frac{\left(2\cos^{2}t-2\right)x_{i}^{2}+2v_{i}^{2}\sin^{2}t}{t^{2}}
=\displaystyle= 2xi2+2vi2,\displaystyle\;-2x_{i}^{2}+2v_{i}^{2},

and

u2vi2\displaystyle\;\nabla_{u}^{2}v_{i}^{2}
=\displaystyle= limt02(vi(uuv)i+(uuv)icost)2+2(uuvxisint)22vi2t2\displaystyle\;\lim_{t\rightarrow 0}\frac{2\left(v_{i}-\left(uu^{\top}v\right)_{i}+\left(uu^{\top}v\right)_{i}\cos t\right)^{2}+2\left(\|uu^{\top}v\|x_{i}\sin t\right)^{2}-2v_{i}^{2}}{t^{2}}
=\displaystyle= limt04vi(uuv)i(cost1)+2(uuv)i2(cost1)2+2(uuvxisint)2t2\displaystyle\;\lim_{t\rightarrow 0}\frac{4v_{i}\left(uu^{\top}v\right)_{i}(\cos t-1)+2(uu^{\top}v)_{i}^{2}\left(\cos t-1\right)^{2}+2\left(\|uu^{\top}v\|x_{i}\sin t\right)^{2}}{t^{2}}
=\displaystyle= 2vi(uuv)i+2uuv2xi2.\displaystyle\;-2v_{i}\left(uu^{\top}v\right)_{i}+2\|uu^{\top}v\|^{2}x_{i}^{2}.

Thus it holds that

u2v2xi2=\displaystyle\nabla_{u}^{2}\nabla_{v}^{2}x_{i}^{2}= u2(2xi2+2vi2)\displaystyle\;\nabla_{u}^{2}\left(-2x_{i}^{2}+2v_{i}^{2}\right)
=\displaystyle= 2(2xi2+2ui2)+2(2vi(uuv)i+2uuv2xi2)\displaystyle\;-2\left(-2x_{i}^{2}+2u_{i}^{2}\right)+2\left(-2v_{i}\left(uu^{\top}v\right)_{i}+2\|uu^{\top}v\|^{2}x_{i}^{2}\right)
=\displaystyle=  4xi24ui24vi(uuv)i+4uuv2xi2.\displaystyle\;4x_{i}^{2}-4u_{i}^{2}-4v_{i}\left(uu^{\top}v\right)_{i}+4\|uu^{\top}v\|^{2}x_{i}^{2}.

Since 𝔼v[vv]=1nI\mathbb{E}_{v}\left[vv^{\top}\right]=\frac{1}{n}I and 𝔼w[ww]=1nI\mathbb{E}_{w}\left[ww^{\top}\right]=\frac{1}{n}I, we have

𝔼v,w[u2(v2+w2)xi2]=\displaystyle\mathbb{E}_{v,w}\left[\nabla_{u}^{2}\left(\nabla_{v}^{2}+\nabla_{w}^{2}\right)x_{i}^{2}\right]=  2(4xi24ui24nui2+4nxi2)\displaystyle\;2\left(4x_{i}^{2}-4u_{i}^{2}-\frac{4}{n}u_{i}^{2}+\frac{4}{n}x_{i}^{2}\right)
=\displaystyle= (8+8n)xi2(8+8n)ui2.\displaystyle\;\left(8+\frac{8}{n}\right)x_{i}^{2}-\left(8+\frac{8}{n}\right)u_{i}^{2}.

This implies that the Hessian estimator for xi2x_{i}^{2} over the nn-sphere with granularity δ\delta is of order

O(δ2maxu𝕊n1,ux|(8+8n)xi2(8+8n)ui2|).\displaystyle O\left(\delta^{2}\max_{u\in\mathbb{S}^{n-1},u\perp x}\left|\left(8+\frac{8}{n}\right)x_{i}^{2}-\left(8+\frac{8}{n}\right)u_{i}^{2}\right|\right).

4 The Euclidean Case

In this section, we will focus on numerical stabilization of the estimation, and algorithmic zeroth-order inversion of the estimated Hessian. For numerical and algorithmic purposes, we restrict our attention to the Euclidean case. In the Euclidean case, we also use the notation 2f(x)\nabla^{2}f(x) to denote the Hessian of ff at xx.

4.1 Stabilizing the Estimate

In the Euclidean case, the estimator in (1) simplifies to

H^f(p;v,w;δ)=n2δ2f(p+δv+δw)vw,\displaystyle\widehat{\mathrm{H}}{f}(p;v,w;\delta)=\frac{n^{2}}{\delta^{2}}f(p+\delta v+\delta w)vw^{\top},

where v,wv,w are independently uniformly sampled from 𝕊n1\mathbb{S}^{n-1} (the unit sphere in n\mathbb{R}^{n}). Its stabilized version is

H^f(p;v,w;δ)=\displaystyle\widehat{\mathrm{H}}{f}(p;v,w;\delta)= n28δ2[f(p+δv+δw)f(pδv+δw)\displaystyle\;\frac{n^{2}}{8\delta^{2}}\big{[}f(p+\delta v+\delta w)-f(p-\delta v+\delta w)
f(p+δvδw)+f(pδvδw)](vw+wv).\displaystyle-f(p+\delta v-\delta w)+f(p-\delta v-\delta w)\big{]}\left(vw^{\top}+wv^{\top}\right). (9)

To see why (9) stabilizes the estimate, we use Taylor expansion and get

f(p+δv+δw)f(pδv+δw)f(p+δvδw)+f(pδvδw)\displaystyle\;f(p+\delta v+\delta w)-f(p-\delta v+\delta w)-f(p+\delta v-\delta w)+f(p-\delta v-\delta w)
\displaystyle\approx δ2(v+w)2f(x)(v+w)δ22(vw)2f(x)(vw)\displaystyle\;{\delta^{2}}\left(v+w\right)^{\top}\nabla^{2}f(x)\left(v+w\right)-\frac{\delta^{2}}{2}\left(v-w\right)^{\top}\nabla^{2}f(x)\left(v-w\right)
δ22(v+w)2f(x)(v+w)\displaystyle-\frac{\delta^{2}}{2}\left(-v+w\right)^{\top}\nabla^{2}f(x)\left(-v+w\right)
=\displaystyle{=}  4δ2v2f(x)w,\displaystyle\;4{\delta^{2}}v^{\top}\nabla^{2}f(x)w, (10)

where 2f\nabla^{2}f denotes the Hessian of ff.

From the above derivation, we see that (9) removes the dependence on the zeroth-order and first-order information, and symmetrizes the estimation (Feng and Wang,, 2022). This can reduce variance and stabilize the estimation. A similar phenomenon for the gradient estimators is noted by Duchi et al., (2015).

4.1.1 A Random Projection Derivation

Similar to gradient estimators (Nesterov and Spokoiny,, 2017; Li et al.,, 2020; Wang et al.,, 2021; Feng and Wang,, 2022), one may also derive the Hessian estimator (9) using a random projection argument. Here we use the spherical random projection argument to derive the Hessian estimator. A more thorough study can be found in (Feng and Wang,, 2022). To start with, we first prove an identity for random matrix projection in Lemma 3.

Lemma 3.

Let v,wv,w be independently uniformly sampled from the unit sphere in n\mathbb{R}^{n}. For any matrix An×nA\in\mathbb{R}^{n\times n}, we have

𝔼[(vAw)wv]=1n2A.\displaystyle\mathbb{E}\left[\left(v^{\top}Aw\right)wv^{\top}\right]=\frac{1}{n^{2}}A.
Proof. .

It is sufficient to show 𝔼[viAijwjvkwl]=1n2Alk\mathbb{E}\left[v^{i}A_{i}^{j}w_{j}v^{k}w_{l}\right]=\frac{1}{n^{2}}A_{l}^{k} for any k,l[n]k,l\in[n] (Einstein’s notation is used).

Since vv is uniformly sampled from 𝕊n1\mathbb{S}^{n-1} (the unit sphere in n\mathbb{R}^{n}), for kik\neq i, we have 𝔼[vivk|vk=x]=0\mathbb{E}\left[v^{i}v^{k}|v^{k}=x\right]=0 for any xx. This gives that

𝔼[vivk]=(i)x[1,1](vk=x)𝔼[vivk|vk=x]𝑑x=0,ki.\displaystyle\mathbb{E}\left[v^{i}v^{k}\right]\overset{(i)}{=}\int_{x\in\left[-1,1\right]}\mathbb{P}\left(v^{k}=x\right)\mathbb{E}\left[v^{i}v^{k}|v^{k}=x\right]\;dx=0,\qquad\forall k\neq i.

By symmetry of the sphere 𝕊n1\mathbb{S}^{n-1} and that 𝔼[vivi]=1\mathbb{E}\left[v^{i}v_{i}\right]=1, we have 𝔼[vkvk]=(ii)1n\mathbb{E}\left[v^{k}v^{k}\right]\overset{(ii)}{=}\frac{1}{n} for any k[n]k\in[n]. Combining (i)(i) and (ii)(ii) gives

𝔼[vivk]=(iii)1nδki,\displaystyle\mathbb{E}\left[v^{i}v^{k}\right]\overset{(iii)}{=}\frac{1}{n}\delta^{ki},

where δki\delta^{ki} is the Kronecker’s delta with two superscript.

Similarly, it holds that 𝔼[wjwl]=(iv)1nδjl\mathbb{E}\left[w_{j}w_{l}\right]\overset{(iv)}{=}\frac{1}{n}\delta_{jl}, where δjl\delta_{jl} is the Kronecker’s delta with two subscript. Since vv and ww are independent, (iii)(iii) and (iv)(iv) gives

𝔼[viAijwjvkwl]=1n2Aijδikδjl=1n2Alk,\displaystyle\mathbb{E}\left[v^{i}A_{i}^{j}w_{j}v^{k}w_{l}\right]=\frac{1}{n^{2}}A_{i}^{j}\delta^{ik}\delta_{jl}=\frac{1}{n^{2}}A_{l}^{k},

which concludes the proof. ∎

With Lemma 3, we can see that the estimator in (9) satisfies

𝔼[H^f(p;v,w;δ)]\displaystyle\;\mathbb{E}\left[\widehat{\mathrm{H}}f(p;v,w;\delta)\right]
(i)\displaystyle\overset{(i)}{\approx} n28δ2𝔼[4δ2(v2f(x)w)(vw+wv)]\displaystyle\;\frac{n^{2}}{8\delta^{2}}\mathbb{E}\left[4\delta^{2}\left(v^{\top}\nabla^{2}f(x)w\right)\left(vw^{\top}+wv^{\top}\right)\right]
=\displaystyle= n22(𝔼[(v2f(x)w)wv]+𝔼[(w[2f(x)]v)vw])\displaystyle\;\frac{n^{2}}{2}\left(\mathbb{E}\left[\left(v^{\top}\nabla^{2}f(x)w\right)wv^{\top}\right]+\mathbb{E}\left[\left(w^{\top}\left[\nabla^{2}f(x)\right]^{\top}v\right)vw^{\top}\right]\right)
=(ii)\displaystyle\overset{(ii)}{=} 122f(x)+12[2f(x)]\displaystyle\;\frac{1}{2}\nabla^{2}f(x)+\frac{1}{2}\left[\nabla^{2}f(x)\right]^{\top}
=\displaystyle= 2f(x),\displaystyle\;\nabla^{2}f(x), (11)

where (i)(i) uses (10), and (ii)(ii) uses Lemma 3. The above argument gives a random-projection derivation for the estimator (9).

Similar to (Feng and Wang,, 2022), we can obtain an O(δ2)O(\delta^{2}) bias bound in Euclidean spaces.

Corollary 1.

Let f:nf:\mathbb{R}^{n}\rightarrow\mathbb{R} be a smooth function, and let kf\partial^{k}f (k+k\in\mathbb{N}_{+}) denote the kk-th order total derivative of ff. Let ff be 4-th order continuously differentiable. Let there be a constant L4L_{4} such that 4f(x)L4\|\partial^{4}f(x)\|\leq L_{4} for all xnx\in\mathbb{R}^{n}, where \|\cdot\| denotes the spectral norm (\infty-Schatten norm) of a tensor. Then it holds that

𝔼v,wH^f(p;v,w;δ)Hessf(p)L4nδ2n+2,pn,δ(0,),\displaystyle\left\|\mathbb{E}_{v,w}\widehat{\mathrm{H}}f(p;v,w;\delta)-\mathrm{Hess}f(p)\right\|\leq\frac{L_{4}n\delta^{2}}{n+2},\quad\forall p\in\mathbb{R}^{n},\delta\in(0,\infty),

where v,wv,w are uniformly sampled from the unit sphere 𝕊n1\mathbb{S}^{n-1}.

Proof. .

Firstly, note that the injectivity radius of Euclidean spaces is infinite. By Lemmas 1 and 2, it suffices to consider Hessf^δ(p)Hessf(p)\|\mathrm{Hess}\widehat{f}^{\delta}(p)-\mathrm{Hess}f(p)\|. In the Euclidean case, for any u,v,pnu,v,p\in\mathbb{R}^{n}, Taylor’s theorem gives

Hessf(p+v)(u,u)=Hessf(p)(u,u)+3f(p)(u,u,v)+124f(p)(u,u,v,v)\displaystyle\mathrm{Hess}f(p+v)(u,u)=\mathrm{Hess}f(p)(u,u)+\partial^{3}f(p)(u,u,v)+\frac{1}{2}\partial^{4}f(p^{\prime})(u,u,v,v)

for some pp^{\prime} depending on p,u,vp,u,v. Fix any unit vector unu\in\mathbb{R}^{n}, we have

Hessf~δ(p)(u,u)Hessf(p)(u,u)\displaystyle\;\mathrm{Hess}\widetilde{f}^{\delta}(p)(u,u)-\mathrm{Hess}f(p)(u,u)
=\displaystyle= 1δ2nVn2vδ𝔹nwδ𝔹nHessf(p+v+w)(u,u)𝑑v𝑑wHessf(p)(u,u)\displaystyle\;\frac{1}{\delta^{2n}V_{n}^{2}}\int_{v\in\delta\mathbb{B}^{n}}\int_{w\in\delta\mathbb{B}^{n}}\mathrm{Hess}f(p+v+w)(u,u)\,dv\,dw-\mathrm{Hess}f(p)(u,u) (δ𝔹n\delta\mathbb{B}^{n} is the origin-centered ball with radius δ\delta.)
=\displaystyle= 1δ2nVn2vδ𝔹nwδ𝔹n(3f(p)(u,u,v+w)+124f(p)(u,u,v+w,v+w))𝑑v𝑑w\displaystyle\;\frac{1}{\delta^{2n}V_{n}^{2}}\int_{v\in\delta\mathbb{B}^{n}}\int_{w\in\delta\mathbb{B}^{n}}\left(\partial^{3}f(p)(u,u,v+w)+\frac{1}{2}\partial^{4}f(p^{\prime})(u,u,v+w,v+w)\right)\,dv\,dw
=\displaystyle= 12δ2nVn2vδ𝔹nwδ𝔹n4f(p)(u,u,v+w,v+w)dvdw.\displaystyle\;\frac{1}{2\delta^{2n}V_{n}^{2}}\int_{v\in\delta\mathbb{B}^{n}}\int_{w\in\delta\mathbb{B}^{n}}\partial^{4}f(p^{\prime})(u,u,v+w,v+w)\,dv\,dw. (by symmetry of the ball δ𝔹n\delta\mathbb{B}^{n})

By symmetry of δ𝔹n\delta\mathbb{B}^{n}, we know that vδ𝔹nwδ𝔹n4f(p)(u,u,v,w)dvdw=0\int_{v\in\delta\mathbb{B}^{n}}\int_{w\in\delta\mathbb{B}^{n}}\partial^{4}f(p^{\prime})(u,u,v,w)\,dv\,dw=0. Since 4f(p)L4\|\partial^{4}f(p)\|\leq L_{4} for all pnp\in\mathbb{R}^{n}, we have that

|vδ𝔹nwδ𝔹n4f(p)(u,u,v+w,v+w)dvdw|L42n(n+2)An2δ2n+2,\displaystyle\left|\int_{v\in\delta\mathbb{B}^{n}}\int_{w\in\delta\mathbb{B}^{n}}\partial^{4}f(p^{\prime})(u,u,v+w,v+w)\,dv\,dw\right|\leq L_{4}\frac{2}{n(n+2)}A_{n}^{2}\delta^{2n+2},

where AnA_{n} is the surface area of 𝕊n1\mathbb{S}^{n-1}. Thus we have

|Hessf~δ(p)(u,u)Hessf(p)(u,u)|L4nδ2n+2.\displaystyle\left|\mathrm{Hess}\widetilde{f}^{\delta}(p)(u,u)-\mathrm{Hess}f(p)(u,u)\right|\leq\frac{L_{4}n\delta^{2}}{n+2}.

We can conclude the proof since the above inequality holds for arbitrary unit vector uu. ∎

4.2 Zeroth Order Hessian Inversion

4.2.1 Hessian Adjugate Estimation by Cramer’s Rule

Cramer’s rule states that the inverse of a nonsingular matrix AA equals

A1=1det(A)adj(A),\displaystyle A^{-1}=\frac{1}{\det(A)}\mathrm{adj}(A),

where adj(A)\mathrm{adj}(A) is the adjugate of matrix AA. Recall the adjugate of matrix AA is

adj(A)=[(1)i+jMji]{1i,jn},\mathrm{adj}(A)=\left[(-1)^{i+j}M_{ji}\right]_{\{1\leq i,j\leq n\}},

where Mji=det(Aji)M_{ji}=\det\left(A_{-ji}\right) and AjiA_{-ji} is the submatrix of AA by removing the jj-th row and ii-th column. As suggested by the Cramer’s rule, one can estimate inverse of Hessian (up to scaling) by first estimating the unsigned minors of the Hessian and then gather the minors into a matrix. This estimation procedure is summarized in Algorithm 1.

Algorithm 1 Cramer-Hessian-Adjugate (CHA)
1:Input: number of samples mm; finite difference step size δ\delta; location for estimation xx.
2:Uniformly independently sample {(vk,ij,ab,wk,ij,ab)}1km,1i,jn,1a,bn\{\left(v_{k,ij,ab},w_{k,ij,ab}\right)\}_{1\leq k\leq m,1\leq i,j\leq n,1\leq a,b\leq n} from 𝕊n1\mathbb{S}^{n-1} (the unit sphere in 𝕊n1\mathbb{S}^{n-1}).
3:For all i,j[n]i,j\in[n] and k[m]k\in[m], create n2n^{2} estimators for the (i,j)(i,j)-submatrix of [2f~δ(x)]ij\left[\nabla^{2}\widetilde{f}^{\delta}(x)\right]_{-ij} by
S^k,ij,ab=[H^f(x;vk,ij,ab,wk,ij,ab;δ)]ij,1i,j,a,bn,k[m].\displaystyle\widehat{\text{S}}_{k,ij,ab}=\left[\widehat{\mathrm{H}}{f}(x;v_{k,ij,ab},w_{k,ij,ab};\delta)\right]_{-ij},\quad\forall 1\leq i,j,a,b\leq n,\;\;\forall k\in[m].
4:Create estimators of [2f~δ(x)]ij\left[\nabla^{2}\widetilde{f}^{\delta}(x)\right]_{-ij} (written S^k,ij\widehat{\text{S}}_{k,ij}) such that the (a,b)(a,b)-th entry of S^k,ij\widehat{\text{S}}_{k,ij} is the (a,b)(a,b)-th entry of S^k,ij,ab\widehat{\text{S}}_{k,ij,ab} for all a,b[n]a,b\in[n].
5:/* In practice, one can use the entry-wise estimators to replace the estimator in Step 3. See Section 5.1.1 for more details on entry-wise Hessian estimators. */
6:For all i,j[n]i,j\in[n], estimate the unsigned minors by
M^ij=1mk=1mdet(S^k,ij).\displaystyle\widehat{M}_{ij}=\frac{1}{m}\sum_{k=1}^{m}\det\left(\widehat{\text{S}}_{k,ij}\right).
/* The determinant can be computed via LU decomposition, QR decomposition, or similar methods. */
7:Estimate the adjugate of Hessian by
A¯mf~δ(x)=[(1)i+jM^ji].\displaystyle\overline{\mathrm{A}}_{m}\widetilde{f}^{\delta}(x)=\left[(-1)^{i+j}\widehat{M}_{ji}\right]. (12)
8:Output: CHA(m,δ,x)=A¯mf~δ(x).\texttt{CHA}(m,\delta,x)=\overline{\mathrm{A}}_{m}\widetilde{f}^{\delta}(x).
Proposition 2.

Let CHA(m,δ,x)\mathrm{\texttt{CHA}}(m,\delta,x) be the estimator returned by Algorithm 1. If 2f~δ(x)\nabla^{2}\widetilde{f}^{\delta}(x) is non-singular, it holds that

𝔼[CHA(m,δ,x)]=det(2f~δ(x))2f~δ(x),\displaystyle\mathbb{E}\left[\mathrm{\texttt{CHA}}(m,\delta,x)\right]=\det\left(\nabla^{2}\widetilde{f}^{\delta}(x)\right)\nabla^{-2}\widetilde{f}^{\delta}(x),

where 2f~δ(x):=[2f~δ(x)]1\nabla^{-2}\widetilde{f}^{\delta}(x):=\left[\nabla^{2}\widetilde{f}^{\delta}(x)\right]^{-1}.

Proof. .

We will use the notations defined in Algorithm 1. By Lemma 1, we know that, i,j,a,b[n],k[m]\forall i,j,a,b\in[n],\;\;\forall k\in[m],

𝔼[S^k,ij]=𝔼[S^k,ij,ab]=[𝔼[H^f(x;vk,ij,ab,wk,ij,ab;δ)]]ij=[2f~δ(x)]ij.\displaystyle\mathbb{E}\left[\widehat{\text{S}}_{k,ij}\right]=\mathbb{E}\left[\widehat{\text{S}}_{k,ij,ab}\right]=\left[\mathbb{E}\left[\widehat{\mathrm{H}}{f}(x;v_{k,ij,ab},w_{k,ij,ab};\delta)\right]\right]_{-ij}=\left[\nabla^{2}\widetilde{f}^{\delta}(x)\right]_{-ij}.

Since (i) the determinant of a matrix can be expressed in terms of multiplication and addition of its entries, and (ii) all entries of S^k,ij\widehat{\text{S}}_{k,ij} are independent, we have

𝔼[det(S^k,ij)]=det(𝔼[S^k,ij]).\displaystyle\mathbb{E}\left[\det\left(\widehat{\text{S}}_{k,ij}\right)\right]=\det\left(\mathbb{E}\left[\widehat{\text{S}}_{k,ij}\right]\right).

By a use of the Cramer’s rule and the above result, it holds that

𝔼[CHA(m,δ,x)]=𝔼[(1)i+jM^ji]=adj(2f~δ(x))=det(2f~δ(x))2f~δ(x).\displaystyle\mathbb{E}\left[\mathrm{\texttt{CHA}}(m,\delta,x)\right]=\mathbb{E}\left[(-1)^{i+j}\widehat{M}_{ji}\right]=\mathrm{adj}\left(\nabla^{2}\widetilde{f}^{\delta}(x)\right)=\det\left(\nabla^{2}\widetilde{f}^{\delta}(x)\right)\nabla^{-2}\widetilde{f}^{\delta}(x).

The biggest advantage of the CHA method is that it gives an unbiased estimator of the adjugate matrix of 2f~δ(x)\nabla^{2}\widetilde{f}^{\delta}(x). Also, Proposition 2 hold true even if 2f~δ(x)\nabla^{2}\widetilde{f}^{\delta}(x) is non-definite. However, a shortcoming of the CHA method is its computational expense. For this reason, we introduce the following zeroth order Hessian inversion method, for a special class of Hessian matrices.

4.2.2 Hessian Inverse Estimation by Neumman Series

An approach for computing the inverse of Hessian is via Neumman series. For an invertible matrix AA satisfying limp(IA)p=0\lim_{p\rightarrow\infty}\left(I-A\right)^{p}=0, the Neumann series expands the inverse of AA by

A1=p=0(IA)1.\displaystyle A^{-1}=\sum_{p=0}^{\infty}\left(I-A\right)^{-1}.

From this observation, we can first estimate the Hessian, and then estimate the inverse by the Neumann series. Previously, Agarwal et al., (2017) studied fast Neumann series based Hessian inversion using first-order information. Here a similar result can be obtained using zeroth-order information only. This zeroth-order extension of (Agarwal et al.,, 2017) is summarized in Algorithm 2.

Algorithm 2 Neumman-Hessian-Inverse (NHI)
1:Input: number of samples (m1,m2,m3)(m_{1},m_{2},m_{3}); finite difference step size δ\delta; location for estimation xx.
2:Uniformly independently sample {(vijk,wijk)}1im1,1jm2,1km3\{\left(v_{ijk},w_{ijk}\right)\}_{1\leq i\leq m_{1},1\leq j\leq m_{2},1\leq k\leq m_{3}} from 𝕊n1\mathbb{S}^{n-1}.
3:For all i,j,ki,j,k, compute
H^f(x;vijk,wijk;δ),\displaystyle\widehat{\mathrm{H}}{f}(x;v_{ijk},w_{ijk};\delta),
as defined in (9).
4:Compute
H¯m1,m2,m31f~δ(x)=1m1i=1m1(I+h=1m2j=1h(I1m3k=1m3H^f(x;vijk,wijk;δ))).\displaystyle\overline{\mathrm{H}}_{m_{1},m_{2},m_{3}}^{-1}\widetilde{f}^{\delta}(x)=\frac{1}{m_{1}}\sum_{i=1}^{m_{1}}\left(I+\sum_{h=1}^{m_{2}}\prod_{j=1}^{h}\left(I-\frac{1}{m_{3}}\sum_{k=1}^{m_{3}}\widehat{\mathrm{H}}{f}(x;v_{ijk},w_{ijk};\delta)\right)\right). (13)
5:Output: NHI(m1,m2,m3,δ,x)=H¯m1,m2,m31f~δ(x).\texttt{NHI}(m_{1},m_{2},m_{3},\delta,x)=\overline{\mathrm{H}}_{m_{1},m_{2},m_{3}}^{-1}\widetilde{f}^{\delta}(x).
Proposition 3.

Suppose ff is twice-differentiable, α\alpha-strongly convex and β\beta-smooth with β<1\beta<1. Then it holds that

𝔼[NHI(m1,m2,m3,δ,x)]2f~δ(x)(1α)m2+1α,\displaystyle\left\|\mathbb{E}\left[\texttt{NHI}(m_{1},m_{2},m_{3},\delta,x)\right]-\nabla^{-2}\widetilde{f}^{\delta}(x)\right\|\leq\frac{\left(1-\alpha\right)^{m_{2}+1}}{\alpha}, (14)

where 2f~δ(x):=[2f~δ(x)]1\nabla^{-2}\widetilde{f}^{\delta}(x):=\left[\nabla^{2}\widetilde{f}^{\delta}(x)\right]^{-1}.

Proof. .

Since ff is α\alpha-strongly convex, it holds that, for any x,y,v,wnx,y,v,w\in\mathbb{R}^{n},

f(x+v+w)f(y+v+w)+(xy)f(y+v+w)+α2xy2.\displaystyle f(x+v+w)\geq f(y+v+w)+\left(x-y\right)^{\top}\nabla f(y+v+w)+\frac{\alpha}{2}\left\|x-y\right\|^{2}.

Integrating both vv and ww over δ𝔹n\delta\mathbb{B}^{n} gives that

f~δ(x)f~δ(y)+(xy)f~δ(y)+α2xy2,\displaystyle\widetilde{f}^{\delta}(x)\geq\widetilde{f}^{\delta}(y)+(x-y)^{\top}\nabla\widetilde{f}^{\delta}(y)+\frac{\alpha}{2}\left\|x-y\right\|^{2},

where we use the dominated convergence theorem to interchange the integral and the gradient operator. This shows that f~δ\widetilde{f}^{\delta} is also α\alpha-strongly.

Since a differentiable function ff is β\beta-smooth if and only if f(x)f(y)+f(y)(xy)+β2xy2f(x)\leq f(y)+\nabla f(y)^{\top}\left(x-y\right)+\frac{\beta}{2}\left\|x-y\right\|^{2} for all x,ynx,y\in\mathbb{R}^{n}, one can show that f~δ\widetilde{f}^{\delta} is β\beta-smooth by repeating the above argument.

For NHI(m1,m2,m3,δ,x)\texttt{NHI}(m_{1},m_{2},m_{3},\delta,x), we have

𝔼[NHI(m1,m2,m3,δ,x)]=\displaystyle\mathbb{E}\left[\texttt{NHI}(m_{1},m_{2},m_{3},\delta,x)\right]= I+h=1m2j=1h(I𝔼[H^f(x;vijk,wijk;δ)])\displaystyle\;I+\sum_{h=1}^{m_{2}}\prod_{j=1}^{h}\left(I-\mathbb{E}\left[\widehat{\mathrm{H}}{f}(x;v_{ijk},w_{ijk};\delta)\right]\right)
=\displaystyle= j=0m2(I2f~δ(x))j.\displaystyle\;\sum_{j=0}^{m_{2}}\left(I-\nabla^{2}\widetilde{f}^{\delta}(x)\right)^{j}.

Since f~δ\widetilde{f}^{\delta} is α\alpha-strongly convex, β\beta-smooth (β<1\beta<1), and apparently twice-differentiable, we have

0I2f~δ(x)(1α)I.\displaystyle 0\preccurlyeq I-\nabla^{2}\widetilde{f}^{\delta}(x)\preccurlyeq\left(1-\alpha\right)I.

Thus we can bound the bias by

𝔼[NHI(m1,m2,m3,δ,x)]2f~δ(x)j=m2+1(1α)j=(1α)m2+1α.\displaystyle\left\|\mathbb{E}\left[\texttt{NHI}(m_{1},m_{2},m_{3},\delta,x)\right]-\nabla^{-2}\widetilde{f}^{\delta}(x)\right\|\leq\sum_{j=m_{2}+1}^{\infty}(1-\alpha)^{j}=\frac{\left(1-\alpha\right)^{m_{2}+1}}{\alpha}.

5 Existing Methods and Experiments

5.1 Existing Methods for Hessian Estimation

5.1.1 Hessian Estimation via Collecting Single Entry Estimations

In the Euclidean case, one can fix a canonical coordinate system {𝒆i}i[n]\{\bm{e}_{i}\}_{i\in[n]}, and the (i,j)(i,j)-th entry of the Hessian matrix of ff at xx can be estimated by

H^ijentryf(x;δ)=\displaystyle\widehat{\mathrm{H}}_{ij}^{\mathrm{entry}}{f}(x;\delta){=} 14δ2(f(x+δ𝒆i+δ𝒆j)f(x+δ𝒆iδ𝒆j)\displaystyle\;\frac{1}{4\delta^{2}}\big{(}f(x+\delta\bm{e}_{i}+\delta\bm{e}_{j})-f(x+\delta\bm{e}_{i}-\delta\bm{e}_{j})
f(xδ𝒆i+δ𝒆j)+f(xδ𝒆iδ𝒆j)).\displaystyle-f(x-\delta\bm{e}_{i}+\delta\bm{e}_{j})+f(x-\delta\bm{e}_{i}-\delta\bm{e}_{j})\big{)}. (15)

One can then gather the entries to obtain a Hessian estimator:

H^entryf(x;δ)=[H^ijentryf(x;δ)]i,j[n].\displaystyle\widehat{\mathrm{H}}^{\mathrm{entry}}{f}(x;\delta)=\left[\widehat{\mathrm{H}}_{ij}^{\mathrm{entry}}{f}(x;\delta)\right]_{i,j\in[n]}. (16)

This estimator could perhaps date back to classic times when the finite difference principles were first used. Yet it needs at least Ω(n2)\Omega(n^{2}) zeroth order samples to produce an estimator in an nn-dimensional space. Previously, Balasubramanian and Ghadimi, (2021) designed a Hessian estimator based on the Stein’s identity (Stein,, 1981). Their estimator only needs O(1)O(1) zeroth-order function evaluations. This method is discussed in the next section.

5.1.2 Hessian Estimation via the Stein’s identity

A classic result for Hessian computation is the Stein’s identity (named after Charles Stein), as stated below.

Theorem 2 (Stein’s identity).

Consider a smooth function f:nf:\mathbb{R}^{n}\rightarrow\mathbb{R}. For any point xnx\in\mathbb{R}^{n}, it holds that

2f(x)=12𝔼[(uuI)Duuf(x)],\displaystyle\nabla^{2}f(x)=\frac{1}{2}\mathbb{E}\left[\left(uu^{\top}-I\right)D_{uu}f(x)\right],

where 1. u𝒩(0,I)u{\sim}\mathcal{N}(0,I), and 2.

Duuf(x)=limτ0f(x+τu)2f(x)+f(xτu)τ2.\displaystyle D_{uu}f(x)=\lim_{\tau\rightarrow 0}\frac{f(x+\tau u)-2f(x)+f(x-\tau u)}{\tau^{2}}.
Proof. .

For completeness, a convenient proof of Theorem 2 is provided in the Appendix. ∎

One can estimate Hessian using the Stein’s identity (Balasubramanian and Ghadimi,, 2021):

H^Steinf(x;u;δ)=f(x+δu)2f(x)+f(xδu)2δ2(uuI),\displaystyle\widehat{\mathrm{H}}^{\mathrm{Stein}}{f}(x;u;\delta)=\frac{{f}(x+\delta u)-2{f}(x)+{f}(x-\delta u)}{2\delta^{2}}(uu^{\top}-I), (17)

where u𝒩(0,I)u\sim\mathcal{N}(0,I) is a standard Gaussian vector. A bias bound for (17) is in Theorem 3.

Theorem 3 (Li et al., (2020); Balasubramanian and Ghadimi, (2021)).

Let ff have L2L_{2}-Lipschitz Hessian: There exists a constant L2L_{2} such that 2f(x)2f(x)L2xx\|\nabla^{2}f(x)-\nabla^{2}f(x^{\prime})\|\leq L_{2}\|x-x^{\prime}\| for all x,xnx,x^{\prime}\in\mathbb{R}^{n}. The estimator in (17) satisfies

𝔼[H^Steinf(x;u;δ)]2f(x)L2(n+6)52δ4,\displaystyle\left\|\mathbb{E}\left[\widehat{\mathrm{H}}^{\mathrm{Stein}}{f}(x;u;\delta)\right]-\nabla^{2}f(x)\right\|\leq\frac{L_{2}\left(n+6\right)^{\frac{5}{2}}\delta}{4},

for any xnx\in\mathbb{R}^{n} and any function f:nf:\mathbb{R}^{n}\rightarrow\mathbb{R} with L2L_{2}-Lipschitz Hessian.

The estimator (17) improves the entry-wise estimator in the sense that one only needs O(1)O(1) samples to produce an estimator. However, its error bound given by Theorem 3 is worse than that of (9) in Theorem 1. A more detailed discussion on the error bounds is in Remark 1.

Remark 1.

We need to note that our estimator (9) and the estimator via Stein’s method (17) have different finite-difference step size. More specifically, 𝔼v,wi.i.d.𝕊n1[δv+w]=Θ(δ)\mathbb{E}_{v,w\overset{i.i.d.}{\sim}\mathbb{S}^{n-1}}\left[\delta\|v+w\|\right]=\Theta\left(\delta\right) for (9) and 𝔼u𝒩(0,In)[δu]=Θ(nδ)\mathbb{E}_{u\sim\mathcal{N}(0,I_{n})}\left[\delta\|u\|\right]=\Theta\left(\sqrt{n}\delta\right) for (17). To compare the bias bounds for (9) and (17) using the same (expected) finite-difference step size, we need to downscale the bound in Theorem 3 by a factor of n\sqrt{n}. After this downscaling, the error bound for the Stein-type estimator (17) is O(L2n2δ)O\left(L_{2}n^{2}\delta\right) which is worse than the bias bound bound for our estimator (9). In the experiments, we down scale the finite-difference step size when studying all results of Stein’s method estimator.

As discussed in Remark 1, a difference between (9) and (17) is that they sample random vectors from different distributions: uniformly random vector from the unit sphere for (9) and standard Gaussian vector for (17). High moments of uniformly random vectors from the unit sphere are smaller than Gaussian vectors of same expected norm. More specifically, The kk-th moment of (norm of) a standard Gaussian vector v𝒩(0,In)v\sim\mathcal{N}(0,I_{n}) that is downscale by a factor of n\sqrt{n} is

nk/2𝔼[vk]\displaystyle\;n^{-k/2}\mathbb{E}\left[\|v\|^{k}\right]
=\displaystyle= nk/2(2π)n/20π0π02π0rk+n1er22𝑑rsinn2(φ1)sinn3(φ2)sin(φn2)𝑑φ1𝑑φ2𝑑φn1\displaystyle\;\frac{n^{-k/2}}{\left(2\pi\right)^{-n/2}}\int_{0}^{\pi}\int_{0}^{\pi}\cdots\int_{0}^{2\pi}\int_{0}^{\infty}r^{k+n-1}e^{-\frac{r^{2}}{2}}\,dr\,\sin^{n-2}(\varphi_{1})\sin^{n-3}(\varphi_{2})\cdots\sin(\varphi_{n-2})\,d\varphi_{1}\,d\varphi_{2}\cdots d\varphi_{n-1}\,
=\displaystyle= nk/2(2π)n/2An0rk+n1er22𝑑r\displaystyle\;n^{-k/2}\left(2\pi\right)^{-n/2}A_{n}\int_{0}^{\infty}r^{k+n-1}e^{-\frac{r^{2}}{2}}\,dr\, (AnA_{n} is the surface area of the Euclidean unit sphere 𝕊n1\mathbb{S}^{n-1})
=\displaystyle= nk/2(2π)n/22πn/2Γ(n2)2n+k21Γ(k+n2)\displaystyle\;n^{-k/2}\left(2\pi\right)^{-n/2}\frac{2\pi^{n/2}}{\Gamma\left(\frac{n}{2}\right)}2^{\frac{n+k}{2}-1}\Gamma\left(\frac{k+n}{2}\right)
=\displaystyle= nk/22k2Γ(k+n2)Γ(n2)\displaystyle\;n^{-k/2}2^{\frac{k}{2}}\frac{\Gamma\left(\frac{k+n}{2}\right)}{\Gamma\left(\frac{n}{2}\right)}
\displaystyle\sim nk/22k22πk+n2(k+n2e)k+n22πn2(n2e)n2\displaystyle\;n^{-k/2}2^{\frac{k}{2}}\frac{{\sqrt{\frac{2\pi}{\frac{k+n}{2}}}}\,{\left({\frac{\frac{k+n}{2}}{e}}\right)}^{\frac{k+n}{2}}}{{\sqrt{\frac{2\pi}{\frac{n}{2}}}}\,{\left({\frac{\frac{n}{2}}{e}}\right)}^{\frac{n}{2}}} (by Stirling’s approximation)
=\displaystyle= (en2)k/2nn+k(n+k2)n/2+k/2(n2)n/2\displaystyle\;\left(\frac{en}{2}\right)^{-k/2}\sqrt{\frac{n}{n+k}}\left(\frac{n+k}{2}\right)^{n/2+k/2}\left(\frac{n}{2}\right)^{-n/2}

which clearly grows very fast with kk for large kk and for any fixed nn. On contrary, the moments of (norm) of the vector uniformly sampled from the unit sphere are all 11. This difference implies that our estimator tends to have smaller higher order moments.

5.2 Numerical Results

We test the performance of our estimator against the previous two estimators in noisy environments. Before proceeding, we re-define some notations for the estimators, so that the estimators are tested on the same ground and noise is properly taken into consideration. The estimators we will empirically study are

  1. 1.

    Our new estimator:

    H^newf(p;m;δ)\displaystyle\;\widehat{\mathrm{H}}^{\text{new}}f(p;m;\delta)
    =\displaystyle= k=1m4n2δ2[ϵk+f(Expp(δvk+δwk))f(Expp(δvk+δwk))\displaystyle\;\sum_{k=1}^{\lfloor\frac{m}{4}\rfloor}\frac{n^{2}}{\delta^{2}}\big{[}\epsilon_{k}+f({\mathrm{Exp}}_{p}(\delta v_{k}+\delta w_{k}))-f({\mathrm{Exp}}_{p}(-\delta v_{k}+\delta w_{k}))
    f(Expp(δvkδwk))+f(Expp(δvkδwk))](vkwk+wkvk),\displaystyle-f({\mathrm{Exp}}_{p}(\delta v_{k}-\delta w_{k}))+f({\mathrm{Exp}}_{p}(-\delta v_{k}-\delta w_{k}))\big{]}(v_{k}\otimes w_{k}+w_{k}\otimes v_{k}), (18)

    where vk,wki.i.d.𝕊pv_{k},w_{k}\overset{i.i.d.}{\sim}\mathbb{S}_{p}, and ϵk\epsilon_{k} is a mean-zero noise that is independent of all other randomness.

  2. 2.

    The Stein’s estimator:

    H^Steinf(p;m;δ)\displaystyle\;\widehat{\mathrm{H}}^{\text{Stein}}f(p;m;\delta)
    =\displaystyle= k=1m3n2δ2[f(Expp(δukn))2f(p)+f(Expp(δukn))+ϵk]\displaystyle\;\sum_{k=1}^{\lfloor\frac{m}{3}\rfloor}\frac{n}{2\delta^{2}}\left[{f}\left({\mathrm{Exp}}_{p}\left(\frac{\delta u_{k}}{\sqrt{n}}\right)\right)-2{f}(p)+{f}\left({\mathrm{Exp}}_{p}\left(\frac{-\delta u_{k}}{\sqrt{n}}\right)\right)+\epsilon_{k}\right]
    (ukukI),\displaystyle\qquad\cdot(u_{k}\otimes u_{k}-I), (19)

    where uki.i.d.𝒩(0,I)u_{k}\overset{i.i.d.}{\sim}\mathcal{N}(0,I) (the standard Gaussian in TpT_{p}\mathcal{M}), II is the identity map from TpT_{p}\mathcal{M} to itself (As a bilinear form, I(u,v)=u,vpI(u,v)=\left<u,v\right>_{p} for any u,vTpu,v\in T_{p}\mathcal{M}.), and ϵk\epsilon_{k} is a mean-zero noise that is independent of all other randomness.

  3. 3.

    The entry-wise estimator:

    H^entryf(p;m;δ)=[H^ijentryf(p;m;δ)]i,j[n],\displaystyle\widehat{\mathrm{H}}^{\text{entry}}f(p;m;\delta)=\left[\widehat{\mathrm{H}}_{ij}^{\text{entry}}f(p;m;\delta)\right]_{i,j\in[n]}, (20)

    where

    H^ijentryf(p;m;δ)=\displaystyle\widehat{\mathrm{H}}_{ij}^{\text{entry}}f(p;m;\delta)= 14δ2k=1m4n2(f(Expp(δ𝒆i+δ𝒆j))f(Expp(δ𝒆iδ𝒆j))\displaystyle\;\frac{1}{4\delta^{2}}\sum_{k=1}^{\lfloor\frac{m}{4n^{2}}\rfloor}\big{(}f({\mathrm{Exp}}_{p}(\delta\bm{e}_{i}+\delta\bm{e}_{j}))-f({\mathrm{Exp}}_{p}(\delta\bm{e}_{i}-\delta\bm{e}_{j}))
    f(Expp(δ𝒆i+δ𝒆j))+f(Expp(δ𝒆iδ𝒆j))+ϵk),\displaystyle-f({\mathrm{Exp}}_{p}(-\delta\bm{e}_{i}+\delta\bm{e}_{j}))+f({\mathrm{Exp}}_{p}(-\delta\bm{e}_{i}-\delta\bm{e}_{j}))+\epsilon_{k}\big{)},

    {𝒆i}i\{\bm{e}_{i}\}_{i} is the local normal coordinate for TpT_{p}\mathcal{M}, and ϵk\epsilon_{k} is a mean-zero noise that is independent of all other randomness.

Table 1: Manifolds used for testing. The local metric near pp is implicitly specified by the exponential map.
Manifold pp (pn+1p\in\mathbb{R}^{n+1}) h(x)h(x), xTpnx\in T_{p}\mathcal{M}\cong\mathbb{R}^{n} Expp(v){\mathrm{Exp}}_{p}(v)
(I) p=0p=0 h(x)=0h(x)=0 (v,h(v))(v,h(v))
(II) p=0p=0 h(x)=11i=1nxi2h(x)=1-\sqrt{1-\sum_{i=1}^{n}x_{i}^{2}} (v,h(v))(v,h(v))
(III) p=0p=0 h(x)=i=1n/2xi2i=n/2+1nxi2h(x)=\sum_{i=1}^{n/2}x_{i}^{2}-\sum_{i=n/2+1}^{n}x_{i}^{2} (v,h(v))(v,h(v))
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 1: Results for the Manifold (I). Each violin plot summarizes estimation error of 100 estimations. More specifically, each estimation in this figure uses m=3840m=3840 function evaluations, and 100 estimations are used to generate one violin plot. On the xx-axis, “Our method” corresponds to our estimator (18); “Stein’s” corresponds to the Stein’s method (19); “Entry-wise” corresponds to the entry-wise estimator (20). Subfigures (a), (b), (c) corresponds to δ=0.05\delta=0.05, δ=0.1\delta=0.1, δ=0.2\delta=0.2.
Table 2: Timing results in seconds, rounded to 1e-4 accuracy. In the table, “Our method” stands for the estimator (18), and “Stein’s” stands for the estimator (19). The time consumption of the estimators are divided into three parts: (1) sampling time, used for generating the random vectors (uniformly random unit vectors for our methods, and standard Gaussian vectors for the Stein’s method), (2) evaluation time, used for accessing function values, and (3) computation time, used for matrix manipulations (e.g., outer product computation). In the timing experiments, both estimators (18) and (19) use m=10,000m=10,000, δ=0.05\delta=0.05 and n=8n=8. All timing results are averaged 10 times, presented in a “mean ±\pm standard deviation” format. The last two columns are cumulative, to avoid fast memory access to saved data.
Sampling Sampling + Evaluation Sampling + Evaluation + Computation
Our method 0.1580±0.00310.1580\pm 0.0031 0.5763±0.00360.5763\pm 0.0036 0.6977±0.00400.6977\pm 0.0040
Stein’s 0.0257±0.00120.0257\pm 0.0012 0.3020±0.00240.3020\pm 0.0024 0.4222±0.00280.4222\pm 0.0028

Strictly speaking, the noises ϵk\epsilon_{k} corrupt all the zeroth-order function value observations. Specifically, the notation ϵk+f(Expp(δvk+δwk))f(Expp(δvk+δwk))f(Expp(δvkδwk))+f(Expp(δvkδwk))\epsilon_{k}+f({\mathrm{Exp}}_{p}(\delta v_{k}+\delta w_{k}))-f({\mathrm{Exp}}_{p}(-\delta v_{k}+\delta w_{k}))-f({\mathrm{Exp}}_{p}(\delta v_{k}-\delta w_{k}))+f({\mathrm{Exp}}_{p}(-\delta v_{k}-\delta w_{k})) should be understood in the following way. All four functions values f(Expp(δvk+δwk))f({\mathrm{Exp}}_{p}(\delta v_{k}+\delta w_{k})), f(Expp(δvk+δwk))f({\mathrm{Exp}}_{p}(-\delta v_{k}+\delta w_{k})), f(Expp(δvkδwk))f({\mathrm{Exp}}_{p}(\delta v_{k}-\delta w_{k})), and f(Expp(δvkδwk))f({\mathrm{Exp}}_{p}(-\delta v_{k}-\delta w_{k})) are corrupted with mean-zero and independent noise and not directly observable. Note that all previously discussed bias bounds hold when the function evaluations are corrupted by independent mean-zero noise.

The above notations allow us to put all the estimators on the same ground more easily. With the new notations, all H^newf(p;m;δ)\widehat{\mathrm{H}}^{\text{new}}f(p;m;\delta), H^Steinf(p;m;δ)\widehat{\mathrm{H}}^{\text{Stein}}f(p;m;\delta) and H^entryf(p;m;δ)\widehat{\mathrm{H}}^{\text{entry}}f(p;m;\delta) uses mm functions value observations and have an expected finite difference step size Θ(δ)\Theta(\delta). The redefining of the estimators is needed since 1. the entry-wise estimator needs more samples to output an estimate, and 2. the default Stein’s method in expectation uses a larger finite-difference step-size, as discussed in Remark 1.

Remark 2.

The estimator we introduced (18) has a practical advantage over that via the Stein’s identity (19). The reason is that estimators based on the Stein’s identity requires one to explicitly know the identity map from TpT_{p}\mathcal{M} to itself. This map may or may not admit an easy numerical representation. For example, for the real Stiefel’s manifold St(n,k)={Xn×k:XX=I}\mathrm{St}(n,k)=\{X\in\mathbb{R}^{n\times k}:X^{\top}X=I\}, we know that the map

PXZ:=(IXX)Z+12X(XZZX),Zn×k,\displaystyle P_{X}Z:=(I-XX^{\top})Z+\frac{1}{2}X\left(X^{\top}Z-Z^{\top}X\right),\qquad\forall Z\in\mathbb{R}^{n\times k},

is the identity from TXSt(n,k)T_{X}\mathrm{St}(n,k) to itself (e.g., Absil et al.,, 2009). Also, this map projects any Zn×kZ\in\mathbb{R}^{n\times k} onto TXSt(n,k)T_{X}\mathrm{St}(n,k). Extracting a numerical representation of this map PXP_{X} may not be easy. On contrary, for any Z1,Z2TXSt(n,k)Z_{1},Z_{2}\in T_{X}\mathrm{St}(n,k), computing Z1Z2Z_{1}\otimes Z_{2} is tractable. More specifically, one can use the following procedure to obtain a uniformly random vector from the unit sphere in TXSt(n,k)T_{X}\mathrm{St}(n,k) for a given XSt(n,k)X\in\mathrm{St}(n,k). One can (1) sample an i.i.d.i.i.d. Gaussian matrix GG from n,k\mathbb{R}^{n,k}, (2) compute PXGP_{X}G, and (3) normalize PXGP_{X}G with respect to the Frobenius inner product. By rotational invariance (of the standard Gaussian distribution and the Frobenius norm), this procedure outputs a uniformly random unit matrix in TXSt(n,k)T_{X}\mathrm{St}(n,k). Once we have the unit vectors in TXSt(n,k)T_{X}\mathrm{St}(n,k), we can numerically compute their tensor product.

All three methods are tested using the following test function, defined using the standard Cartesian coordinate system in n+1\mathbb{R}^{n+1},

f(x)=i=1n+1cos(xi)+exp(x1x2).\displaystyle f(x)=\sum_{i=1}^{n+1}\cos(x_{i})+\exp(x_{1}x_{2}).

Every function evaluation is corrupted with an independent noise sampled from 𝒩(0,0.0025)\mathcal{N}(0,0.0025). The estimators are tested over three manifolds in n+1\mathbb{R}^{n+1}. More details about the three manifolds are in Table 1. In all settings, we set the number of total function evaluation m=3840m=3840 and dimension of manifold n=8n=8. The results for manifold (I), the Euclidean case, is in Figure 1. Results for manifold (II) and manifold (III) are in Appendix B. In Figure 1 (and Figures 2 and 3 in Appendix B), the errors on the yy-axis plots the norm of the difference between the empirical estimator and the truth:

H^f(p;v,w;δ)Hessf(p).\displaystyle\left\|\widehat{\mathrm{H}}f(p;v,w;\delta)-\mathrm{Hess}{f}(p)\right\|.

5.3 Time Efficiency Comparison

We compare the time efficiency of our method and the estimator based on the Stein’s identity. In general, one expects estimators based on the Stein’s identity to be more time-efficient. Main reasons for this include that the estimator based on the Stein’s identity needs only 3 function evaluations instead of 4. In practice, the function evaluations may or may not be cheap. When the functions evaluations are expensive, we may expect that estimators based on the Stein’s identity approximately saves 1/4 time, compared to our method. When the function evaluations are cheap, our estimator (18) is in general more time consuming as well, since more sampling and more matrix computations need to be carried out.

In Table 2, we compare the running time of (18) and (19). All timing experiments use the same benchmark function and underlying manifold as Figure 1. We use n=8n=8, m=10,000m=10,000 and δ=0.05\delta=0.05 for timing experiments. All timing experiments are carried out in an environment with

  • 10 cores and 20 logical processors with a maximum speed of 2.80 GHz;

  • 32GB RAM;

  • Windows 11 22000.832;

  • Python 3.8.8.

6 Conclusion

In this paper, we study Hessian estimators over Riemannian manifolds. We design a new estimator, such that for real-valued analytic functions over an nn-dimensional complete analytic Riemannian manifold, our estimator achieves an O(γδ2)O(\gamma\delta^{2}) expected error, where γ\gamma depends both on the Levi-Civita connection and the function ff, and δ\delta is the finite difference step size. Downstream computations of Hessian inversion is also studied. Empirical studies show supremacy of our method over existing methods.

Data Availability Statement

No new data were generated or analysed in support of this paper. Code for the experiments is available at https://github.com/wangt1anyu/zeroth-order-Riemann-Hess-code.

References

  • Absil et al., (2009) Absil, P.-A., Mahony, R., and Sepulchre, R. (2009). Optimization algorithms on matrix manifolds. Princeton University Press.
  • Agarwal et al., (2017) Agarwal, N., Bullins, B., and Hazan, E. (2017). Second-order stochastic optimization for machine learning in linear time. The Journal of Machine Learning Research, 18(1):4148–4187.
  • Audibert et al., (2010) Audibert, J.-Y., Bubeck, S., and Munos, R. (2010). Best arm identification in multi-armed bandits. In COLT, pages 41–53. Citeseer.
  • Balasubramanian and Ghadimi, (2021) Balasubramanian, K. and Ghadimi, S. (2021). Zeroth-order nonconvex stochastic optimization: Handling constraints, high dimensionality, and saddle points. Foundations of Computational Mathematics, pages 1–42.
  • Bubeck et al., (2009) Bubeck, S., Munos, R., and Stoltz, G. (2009). Pure exploration in multi-armed bandits problems. In International conference on Algorithmic learning theory, pages 23–37. Springer.
  • Conn et al., (2009) Conn, A. R., Scheinberg, K., and Vicente, L. N. (2009). Introduction to derivative-free optimization. SIAM.
  • Duchi et al., (2015) Duchi, J. C., Jordan, M. I., Wainwright, M. J., and Wibisono, A. (2015). Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory, 61(5):2788–2806.
  • Feng and Wang, (2022) Feng, Y. and Wang, T. (2022). Stochastic Zeroth Order Gradient and Hessian Estimators: Variance Reduction and Refined Bias Bounds. arXiv preprint arXiv:2205.14737.
  • Flaxman et al., (2005) Flaxman, A. D., Kalai, A. T., and McMahan, H. B. (2005). Online convex optimization in the bandit setting: gradient descent without a gradient. In Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms, pages 385–394.
  • Gavrilov, (2007) Gavrilov, A. V. (2007). The double exponential map and covariant derivation. Siberian Mathematical Journal, 48(1):56–61.
  • Goldberg and Holland, (1988) Goldberg, D. E. and Holland, J. H. (1988). Genetic algorithms and machine learning.
  • Greene and Wu, (1979) Greene, R. E. and Wu, H. (1979). cc^{\infty}-approximations of convex, subharmonic, and plurisubharmonic functions. In Annales Scientifiques de l’École Normale Supérieure, volume 12, pages 47–84.
  • Greene and Wu, (1973) Greene, R. E. and Wu, H.-H. (1973). On the subharmonicity and plurisubharmonicity of geodesically convex functions. Indiana University Mathematics Journal, 22(7):641–653.
  • Greene and Wu, (1976) Greene, R. E. and Wu, H.-h. (1976). cc^{\infty} convex functions and manifolds of positive curvature. Acta Mathematica, 137(1):209–245.
  • Li et al., (2020) Li, J., Balasubramanian, K., and Ma, S. (2020). Stochastic zeroth-order riemannian derivative estimation and optimization. arXiv preprint arXiv:2003.11238.
  • Nelder and Mead, (1965) Nelder, J. A. and Mead, R. (1965). A simplex method for function minimization. The computer journal, 7(4):308–313.
  • Nesterov and Polyak, (2006) Nesterov, Y. and Polyak, B. T. (2006). Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177–205.
  • Nesterov and Spokoiny, (2017) Nesterov, Y. and Spokoiny, V. (2017). Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17(2):527–566.
  • Petersen, (2006) Petersen, P. (2006). Riemannian geometry, volume 171. Springer.
  • Shahriari et al., (2015) Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., and De Freitas, N. (2015). Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1):148–175.
  • Spall, (2000) Spall, J. C. (2000). Adaptive stochastic approximation by the simultaneous perturbation method. IEEE transactions on automatic control, 45(10):1839–1853.
  • Stein, (1981) Stein, C. M. (1981). Estimation of the Mean of a Multivariate Normal Distribution. The Annals of Statistics, 9(6):1135 – 1151.
  • Wang et al., (2021) Wang, T., Huang, Y., and Li, D. (2021). From the Greene–Wu Convolution to Gradient Estimation over Riemannian Manifolds. arXiv preprint arXiv:2108.07406.

Appendix A Proof of Theorem 2

Proof of Theorem 2. .

Consider 𝔼[ukuhuiujij]\mathbb{E}\left[u_{k}u_{h}u_{i}u_{j}\partial_{i}\partial_{j}\right] for any k,h,i,j[n]k,h,i,j\in[n].

When (k,h)=(i,j)(k,h)=(i,j), one has 𝔼[ukuhuiujij]=𝔼[ui2uj2ij]\mathbb{E}\left[u_{k}u_{h}u_{i}u_{j}\partial_{i}\partial_{j}\right]=\mathbb{E}\left[u_{i}^{2}u_{j}^{2}\partial_{i}\partial_{j}\right]. In this case, it holds that

𝔼[ui2uj2ij]\displaystyle\mathbb{E}\left[u_{i}^{2}u_{j}^{2}\partial_{i}\partial_{j}\right] =khfor ijand𝔼[ui4ii]=3kk,for i=j.\displaystyle=\partial_{k}\partial_{h}\quad\text{for }i\neq j\quad\text{and}\quad\mathbb{E}\left[u_{i}^{4}\partial_{i}\partial_{i}\right]=3\partial_{k}\partial_{k},\quad\text{for }i=j.

When (k,h)(i,j)(k,h)\neq(i,j), i=ji=j and k=hk=h, we have 𝔼[uk2ui2ij]=ii\mathbb{E}\left[u_{k}^{2}u_{i}^{2}\partial_{i}\partial_{j}\right]=\partial_{i}\partial_{i}.

When (k,h)(i,j)(k,h)\neq(i,j), i=ji=j and khk\neq h, we have 𝔼[ukuhuiuj]=0\mathbb{E}\left[u_{k}u_{h}u_{i}u_{j}\right]=0.

When (k,h)(i,j)(k,h)\neq(i,j), iji\neq j and k=hk=h, we have 𝔼[ukuhuiuj]=0\mathbb{E}\left[u_{k}u_{h}u_{i}u_{j}\right]=0.

When (k,h)(i,j)(k,h)\neq(i,j), iji\neq j, khk\neq h, k=jk=j and h=ih=i, we have 𝔼[ukuhuiujij]=𝔼[ui2uj2hk]=hk\mathbb{E}\left[u_{k}u_{h}u_{i}u_{j}\partial_{i}\partial_{j}\right]=\mathbb{E}\left[u_{i}^{2}u_{j}^{2}\partial_{h}\partial_{k}\right]=\partial_{h}\partial_{k}.

When (k,h)(i,j)(k,h)\neq(i,j), iji\neq j, khk\neq h, k=ik=i and h=jh=j, we have 𝔼[ukuhuiujij]=𝔼[ui2uj2kh]=kh\mathbb{E}\left[u_{k}u_{h}u_{i}u_{j}\partial_{i}\partial_{j}\right]=\mathbb{E}\left[u_{i}^{2}u_{j}^{2}\partial_{k}\partial_{h}\right]=\partial_{k}\partial_{h}.

When (k,h)(i,j)(k,h)\neq(i,j), iji\neq j, khk\neq h and kjk\neq j, we have 𝔼[ukuhuiuj]=0\mathbb{E}\left[u_{k}u_{h}u_{i}u_{j}\right]=0.

When (k,h)(i,j)(k,h)\neq(i,j), iji\neq j, khk\neq h and hih\neq i, we have 𝔼[ukuhuiuj]=0\mathbb{E}\left[u_{k}u_{h}u_{i}u_{j}\right]=0.

Now using Einstein’s notation and combining all above cases give

𝔼[ukuhuiujij]=\displaystyle\mathbb{E}\left[u^{k}u_{h}u^{i}u_{j}\partial_{i}\partial^{j}\right]= kh(1δkh)+hk(1δhk)+δkhii+2khδkh\displaystyle\;\partial^{k}\partial_{h}(1-\delta_{k}^{h})+\partial^{h}\partial_{k}(1-\delta_{h}^{k})+\delta_{k}^{h}\partial_{i}\partial^{i}+2\partial^{k}\partial_{h}\delta_{k}^{h}
=(i)\displaystyle\overset{(i)}{=}  2kh+δkhii,\displaystyle\;2\partial^{k}\partial_{h}+\delta_{k}^{h}\partial_{i}\partial^{i},

where δkh\delta_{k}^{h} is the Kronecker’s delta.

Since Duuf(p)=uiujijf(x)D_{uu}f(p)=u^{i}u_{j}\partial_{i}\partial^{j}f(x) for all uu and xx, we can write uuDuuf(x)=ukuhuiujijf(x)uu^{\top}D_{uu}f(x)=u^{k}u_{h}u^{i}u_{j}\partial_{i}\partial^{j}f(x). Thus rearranging terms in (i)(i) gives

𝔼[uuDuuf(x)]=(ii)22f(x)+(Δf(x))I,\displaystyle\mathbb{E}\left[uu^{\top}D_{uu}f(x)\right]\overset{(ii)}{=}2\nabla^{2}f(x)+\left(\Delta f(x)\right)I,

where Δ=ii\Delta=\partial_{i}\partial^{i} is the Laplace operator.

Since 𝔼[Duuf(x)]=𝔼[uiujijf(x)]=δijijf(x)=(Δf(x))I\mathbb{E}\left[D_{uu}f(x)\right]=\mathbb{E}\left[u^{i}u_{j}\partial_{i}\partial^{j}f(x)\right]=\delta_{i}^{j}\partial_{i}\partial^{j}f(x)=(\Delta f(x))I, rearranging terms in (ii)(ii) concludes the proof. ∎

Appendix B Additional Experimental Results

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 2: Results for the Manifold (II). Each violin plot summarizes estimation error of 100 estimations. Specifically, each estimation in this figure uses m=3840m=3840 function evaluations, and 100 estimations are used to generate one violin plot. On the xx-axis, “Our method” corresponds to our estimator (18); “Stein’s” corresponds to the Stein’s method (19); “Entry-wise” corresponds to the entry-wise estimator (20). Subfigures (a), (b), (c) corresponds to δ=0.05\delta=0.05, δ=0.1\delta=0.1, δ=0.2\delta=0.2.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 3: Results for the Manifold (III). Each violin plot summarizes estimation error of 100 estimations, with the estimators defined in. Specifically, each estimation in this figure uses m=3840m=3840 function evaluations, and 100 estimations are used to generate one violin plot. On the xx-axis, “Our method” corresponds to our estimator (18); “Stein’s” corresponds to the Stein’s method (19); “Entry-wise” corresponds to the entry-wise estimator (20). Subfigures (a), (b), (c) corresponds to δ=0.05\delta=0.05, δ=0.1\delta=0.1, δ=0.2\delta=0.2.