On Sharp Stochastic Zeroth Order Hessian Estimators over Riemannian Manifolds
Abstract
We study Hessian estimators for functions defined over an -dimensional complete analytic Riemannian manifold. We introduce new stochastic zeroth-order Hessian estimators using function evaluations. We show that, for an analytic real-valued function , our estimator achieves a bias bound of order , where depends on both the Levi-Civita connection and function , and is the finite difference step size. To the best of our knowledge, our results provide the first bias bound for Hessian estimators that explicitly depends on the geometry of the underlying Riemannian manifold. We also study downstream computations based on our Hessian estimators. The supremacy of our method is evidenced by empirical evaluations.
1 Introduction
Hessian computation is one of the central tasks in optimization, machine learning and related fields. Understanding the landscape of the objective function is in many cases the first step towards solving a mathematical programming problem, and Hessian is one of the key quantities that depict the function landscape. Often in real-world scenarios, the objective function is a black-box, and its Hessian is not directly computable. In these cases, zeroth-order Hessian computation techniques are needed if one wants to understand the function landscape via its Hessian.
To this end, we introduce new zeroth-order methods for estimating a function’s Hessian at any given point over an -dimensional complete analytic Riemannian manifold . For and an analytic real-valued function defined over a complete analytic Riemannian manifold , the Hessian estimator of at is
(1) |
where is the exponential map, are independently sampled from the unit sphere in , denotes the tensor product of and (), and is the finite-difference step size.
Our Hessian estimator satisfies
where is the -Schatten norm, is the unit sphere in , is the Hessian of at , and is the covariant derivative associated with the Riemannian metric.
This bias bound improves previously known results in two ways:
-
1.
It provides, via the Levi-Civita connection, the first bias bound for Hessian estimators that explicitly depends on the local geometry of the underlying space;
-
2.
It significantly improves best previously known bias bound for -evaluation Hessian estimators over Riemannian manifolds, which is of order , where is the Lipschitz constant for the Hessian, is the dimension of the manifold, and is the finite-difference step size. See Remark 1 for details.
We also study downstream computations for our Hessian estimator. More specifically, we introduce novel provably accurate methods for computing adjugate and inversion of the Hessian matrix, all using zeroth order information only. These zeroth order computation methods may be used as primers for further applications. The supremacy of our method over existing methods is evidenced by careful empirical evaluations.
Related Works
Zeroth order optimization has attracted the attention of many researchers. Under this broad umbrella, there stands the Bayesian optimization methods (See the review article by Shahriari et al., (2015) for an overview), comparison-based methods (e.g., Nelder and Mead,, 1965), genetic algorithms (e.g., Goldberg and Holland,, 1988), best arm identification from the multi-armed bandit community (e.g., Bubeck et al.,, 2009; Audibert et al.,, 2010), and many others (See the book by Conn et al., (2009) for an overview). Among all these zeroth order optimization schemes, one classic and prosperous line of works focuses on estimating higher order derivatives using zeroth order information.
Zeroth order gradient estimators make up a large portion of derivative estimation literature. In the past few years, Flaxman et al., (2005) studied the stochastic gradient estimator using a single-point for the purpose of bandit learning. Duchi et al., (2015) studied stabilization of the stochastic gradient estimator via two-points (or multi-points) evaluations. Nesterov and Spokoiny, (2017); Li et al., (2020) studied gradient estimators using Gaussian smoothing, and investigated downstream optimization methods using the estimated gradient. Recently, Wang et al., (2021) studied stochastic gradient estimators over Riemannian manifolds, via the lens of the Greene-Wu convolution.
Zeroth order Hessian estimation is also a central topic in derivative estimation. In the control community, gradient-based Hessian estimators were introduced for iterative optimization algorithms, and asymptotic convergence was proved (Spall,, 2000). Apart from this asymptotic result, no generic non-asymptotic bound for -evaluation Hessian estimators are well investigated until recently. Based on the Stein’s identity (Stein,, 1981), Balasubramanian and Ghadimi, (2021) designed the Stein-type Hessian estimator, and combined it with cubic regularized Newton’s method (Nesterov and Polyak,, 2006) for non-convex optimization. Li et al., (2020) generalizes the Stein-type Hessian estimators to Riemannian manifolds embedded in Euclidean spaces. Several authors have also considered variance and higher order moments of Hessian (and gradient) estimators (Li et al.,, 2020; Balasubramanian and Ghadimi,, 2021; Feng and Wang,, 2022). In particular, Feng and Wang, (2022) showed that estimators via random orthogonal frames from Steifel’s manifolds have significantly smaller variance. Yet in the case of non-trivial curvature (Li et al.,, 2020), no geometry-aware bias bound has been given prior to our work.
2 Preliminaries and Conventions
For better readability, we list here some notations and conventions that will be used throughout the rest of this paper.
-
•
For any , let denote the open set near that is diffeomorphic to a subset of via the local normal coordinate chart . Define the distance () such that
where is the Euclidean distance in .
-
•
(A0, Analyticity Assumption) Throughout the paper, we assume that, both the Riemannian metric and the function of interest are analytic.
-
•
The injectivity radius of (written ) is defined as the radius of the largest geodesic ball that is contained in . (A1, Small Step Size Assumption) Throughout the paper, we assume that the finite difference step size of the estimator at point satisfies .
-
•
All musical isomorphisms are omitted when there is no confusion.
-
•
For any and , we use (resp. ) to denote the origin-centered sphere (resp. ball) in with radius . For simplicity, we write (resp. ). It is worth emphasizing that and are in . They are different from geodesic balls which reside in .
-
•
For and , we use to denote the parallel transport from to along the distance-minimizing geodesic connecting and . For any , and , define . More generally, denotes the parallel transport along the distance-minimizing geodesic from to , among the fiber bundle that is compatible with the Riemannian structure.
-
•
We will use the double exponential map notation (Gavrilov,, 2007). For any and such that , we write .
-
•
(Definition of Hessian (e.g., Petersen,, 2006)) Over an -dimensional complete Riemannian manifold , the Hessian of a smooth function at is a bilinear form such that, for all , . Since the Levi-Civita connection is torsion-free, the Hessian is symmetric: for all . For a smooth function , its Hessian satisfies (e.g., Chapter 5.4 of (Absil et al.,, 2009)), for any and any ,
(2) For simplicity and coherence with the notations in the Euclidean case, we write for any .
-
•
Consider a Riemannian manifold , a point , and any symmetric bilinear form . The -induced -Schatten norm (the operator norm) of is defined as
When it is clear from context, we simply use -Schatten norm to refer to -induced -Schatten norm.
-
•
Note. When applied to a tangent vector, is the standard norm induced by the Riemannian metric. When applied to a symmetric bilinear form, is the -Schatten norm defined above.
3 Zeroth Order Hessian Estimation
For and , the Hessian of at can be estimated by
where are independently uniformly sampled from and is the finite difference step size. To study the bias of this estimator, we consider a function defined as follows.
For , a smooth real-valued function defined over , and a number , define a function (at ) such that
(3) |
where is the volume of the unit ball in (same as the volume of the unit ball in ). Smoothings of this kind have been analytically investigated by Greene and Wu (Greene and Wu,, 1973, 1976, 1979). We will first show that in Lemma 1. Then we derive a bound on , which gives a bound on . Henceforth, we will use as a shorthand for .
Lemma 1.
Consider an -dimensional complete analytic Riemannian manifold . Consider , an analytic function and a small number . If and are independently randomly sampled from , then it holds that,
Proof. .
Define . By the fundamental theorem of geometric calculus, it holds that 222Here and are understood as Einstein’s notations.
Since and are independently uniformly sampled from , it holds that
where is the surface area of in (same as the surface area of unit sphere in ).
By the dominated convergence theorem and that , we have
More specifically, the operations (or tangent vectors) can be defined in terms of limits, and we can interchange the limit and the integral by the dominated convergence theorem.
Combining , and gives
Combining the above results gives
where the second last equality uses , and last equality uses . ∎
As a result of Lemma 1, a bound on will give a bound on . To bound , we need to explicitly extend the definition of from to a neighborhood of (Wang et al.,, 2021), so that the Hessian can be computed in a precise manner. For , a smooth function , and a number , define a function (near ) such that
(4) |
where
(5) |
with .
The advantage of defining via is that is explicitly defined in a neighborhood of . Thus we can carry out geometry-aware computations in a precise manner. Next, we verify that (3) and (4) agree with each other in the following proposition.
Proof. .
At any , we have
where uses , and changes from Cartesian coordinate to hyperspherical coordinate (in ). Since the Levi-Civita connection is compatible with the Riemannian metric, we know that the standard Lebesgue measure in is preserved after transporting to . This implies, for any continuous function defined over , we have . Thus we have, at any ,
where uses . ∎
By Proposition 1, it is sufficient to work with and randomize over a unit sphere. For any direction , the Hessian of along can be explicitly written out in terms of and . This result is found in Lemma 2.
Lemma 2.
Consider an -dimensional complete analytic Riemannian manifold . Consider , an analytic function and a small number . For any , we have
where .
Proof. .
From the definition of Hessian, we have
Thus it is sufficient to fix any and consider
For simplicity, define
Let , and we have
(6) |
provided that the last term converges.
For any , and , define . We can Taylor expand by
From above, we have, for any , and of small norm,
where the second equality uses Taylor expansion at and the last equality uses .
From the above computation, we expand (and similar terms) into infinite series. Thus we can write as
(7) |
where the last equality uses that zeroth-order terms in and first-order terms in all cancel.
Since , we have
(8) |
Gathering the above results gives a bias bound for (1), which is summarized in the following theorem.
Theorem 1.
Consider an -dimensional complete analytic Riemannian manifold . Consider , an analytic function and a small number . For any and unit vectors , define a function over such that
The estimator (1) satisfies
where are independently sampled from the uniform distribution over .
Proof. .
By Lemma 2, we have
By the analyticity assumption, we have
For any fixed , we have
where in the last line the terms that are odd in or vanishes. ∎
By applying (2) twice, we have
Thus by dropping terms of order and noting that , we have
3.1 Example: the -sphere
We consider the Riemannian manifold with metric induced by the ambient Euclidean space. This space of both theoretical and practical appeal. In this space, the exponential map is
where is a unit vector in ; The parallel transport is
for any , and .
We will consider estimating Hessian of the function where and is the -th component of . This simple function serves as an example of estimating the Hessian of general polynomials over .
We have
and
Thus it holds that
Since and , we have
This implies that the Hessian estimator for over the -sphere with granularity is of order
4 The Euclidean Case
In this section, we will focus on numerical stabilization of the estimation, and algorithmic zeroth-order inversion of the estimated Hessian. For numerical and algorithmic purposes, we restrict our attention to the Euclidean case. In the Euclidean case, we also use the notation to denote the Hessian of at .
4.1 Stabilizing the Estimate
In the Euclidean case, the estimator in (1) simplifies to
where are independently uniformly sampled from (the unit sphere in ). Its stabilized version is
(9) |
To see why (9) stabilizes the estimate, we use Taylor expansion and get
(10) |
where denotes the Hessian of .
From the above derivation, we see that (9) removes the dependence on the zeroth-order and first-order information, and symmetrizes the estimation (Feng and Wang,, 2022). This can reduce variance and stabilize the estimation. A similar phenomenon for the gradient estimators is noted by Duchi et al., (2015).
4.1.1 A Random Projection Derivation
Similar to gradient estimators (Nesterov and Spokoiny,, 2017; Li et al.,, 2020; Wang et al.,, 2021; Feng and Wang,, 2022), one may also derive the Hessian estimator (9) using a random projection argument. Here we use the spherical random projection argument to derive the Hessian estimator. A more thorough study can be found in (Feng and Wang,, 2022). To start with, we first prove an identity for random matrix projection in Lemma 3.
Lemma 3.
Let be independently uniformly sampled from the unit sphere in . For any matrix , we have
Proof. .
It is sufficient to show for any (Einstein’s notation is used).
Since is uniformly sampled from (the unit sphere in ), for , we have for any . This gives that
By symmetry of the sphere and that , we have for any . Combining and gives
where is the Kronecker’s delta with two superscript.
Similarly, it holds that , where is the Kronecker’s delta with two subscript. Since and are independent, and gives
which concludes the proof. ∎
With Lemma 3, we can see that the estimator in (9) satisfies
(11) |
where uses (10), and uses Lemma 3. The above argument gives a random-projection derivation for the estimator (9).
Similar to (Feng and Wang,, 2022), we can obtain an bias bound in Euclidean spaces.
Corollary 1.
Let be a smooth function, and let () denote the -th order total derivative of . Let be 4-th order continuously differentiable. Let there be a constant such that for all , where denotes the spectral norm (-Schatten norm) of a tensor. Then it holds that
where are uniformly sampled from the unit sphere .
Proof. .
Firstly, note that the injectivity radius of Euclidean spaces is infinite. By Lemmas 1 and 2, it suffices to consider . In the Euclidean case, for any , Taylor’s theorem gives
for some depending on . Fix any unit vector , we have
( is the origin-centered ball with radius .) | ||||
(by symmetry of the ball ) |
By symmetry of , we know that . Since for all , we have that
where is the surface area of . Thus we have
We can conclude the proof since the above inequality holds for arbitrary unit vector . ∎
4.2 Zeroth Order Hessian Inversion
4.2.1 Hessian Adjugate Estimation by Cramer’s Rule
Cramer’s rule states that the inverse of a nonsingular matrix equals
where is the adjugate of matrix . Recall the adjugate of matrix is
where and is the submatrix of by removing the -th row and -th column. As suggested by the Cramer’s rule, one can estimate inverse of Hessian (up to scaling) by first estimating the unsigned minors of the Hessian and then gather the minors into a matrix. This estimation procedure is summarized in Algorithm 1.
(12) |
Proposition 2.
Proof. .
Since (i) the determinant of a matrix can be expressed in terms of multiplication and addition of its entries, and (ii) all entries of are independent, we have
By a use of the Cramer’s rule and the above result, it holds that
∎
The biggest advantage of the CHA method is that it gives an unbiased estimator of the adjugate matrix of . Also, Proposition 2 hold true even if is non-definite. However, a shortcoming of the CHA method is its computational expense. For this reason, we introduce the following zeroth order Hessian inversion method, for a special class of Hessian matrices.
4.2.2 Hessian Inverse Estimation by Neumman Series
An approach for computing the inverse of Hessian is via Neumman series. For an invertible matrix satisfying , the Neumann series expands the inverse of by
From this observation, we can first estimate the Hessian, and then estimate the inverse by the Neumann series. Previously, Agarwal et al., (2017) studied fast Neumann series based Hessian inversion using first-order information. Here a similar result can be obtained using zeroth-order information only. This zeroth-order extension of (Agarwal et al.,, 2017) is summarized in Algorithm 2.
(13) |
Proposition 3.
Suppose is twice-differentiable, -strongly convex and -smooth with . Then it holds that
(14) |
where .
Proof. .
Since is -strongly convex, it holds that, for any ,
Integrating both and over gives that
where we use the dominated convergence theorem to interchange the integral and the gradient operator. This shows that is also -strongly.
Since a differentiable function is -smooth if and only if for all , one can show that is -smooth by repeating the above argument.
For , we have
Since is -strongly convex, -smooth (), and apparently twice-differentiable, we have
Thus we can bound the bias by
∎
5 Existing Methods and Experiments
5.1 Existing Methods for Hessian Estimation
5.1.1 Hessian Estimation via Collecting Single Entry Estimations
In the Euclidean case, one can fix a canonical coordinate system , and the -th entry of the Hessian matrix of at can be estimated by
(15) |
One can then gather the entries to obtain a Hessian estimator:
(16) |
This estimator could perhaps date back to classic times when the finite difference principles were first used. Yet it needs at least zeroth order samples to produce an estimator in an -dimensional space. Previously, Balasubramanian and Ghadimi, (2021) designed a Hessian estimator based on the Stein’s identity (Stein,, 1981). Their estimator only needs zeroth-order function evaluations. This method is discussed in the next section.
5.1.2 Hessian Estimation via the Stein’s identity
A classic result for Hessian computation is the Stein’s identity (named after Charles Stein), as stated below.
Theorem 2 (Stein’s identity).
Consider a smooth function . For any point , it holds that
where 1. , and 2.
Proof. .
For completeness, a convenient proof of Theorem 2 is provided in the Appendix. ∎
One can estimate Hessian using the Stein’s identity (Balasubramanian and Ghadimi,, 2021):
(17) |
where is a standard Gaussian vector. A bias bound for (17) is in Theorem 3.
Theorem 3 (Li et al., (2020); Balasubramanian and Ghadimi, (2021)).
Let have -Lipschitz Hessian: There exists a constant such that for all . The estimator in (17) satisfies
for any and any function with -Lipschitz Hessian.
The estimator (17) improves the entry-wise estimator in the sense that one only needs samples to produce an estimator. However, its error bound given by Theorem 3 is worse than that of (9) in Theorem 1. A more detailed discussion on the error bounds is in Remark 1.
Remark 1.
We need to note that our estimator (9) and the estimator via Stein’s method (17) have different finite-difference step size. More specifically, for (9) and for (17). To compare the bias bounds for (9) and (17) using the same (expected) finite-difference step size, we need to downscale the bound in Theorem 3 by a factor of . After this downscaling, the error bound for the Stein-type estimator (17) is which is worse than the bias bound bound for our estimator (9). In the experiments, we down scale the finite-difference step size when studying all results of Stein’s method estimator.
As discussed in Remark 1, a difference between (9) and (17) is that they sample random vectors from different distributions: uniformly random vector from the unit sphere for (9) and standard Gaussian vector for (17). High moments of uniformly random vectors from the unit sphere are smaller than Gaussian vectors of same expected norm. More specifically, The -th moment of (norm of) a standard Gaussian vector that is downscale by a factor of is
( is the surface area of the Euclidean unit sphere ) | ||||
(by Stirling’s approximation) | ||||
which clearly grows very fast with for large and for any fixed . On contrary, the moments of (norm) of the vector uniformly sampled from the unit sphere are all . This difference implies that our estimator tends to have smaller higher order moments.
5.2 Numerical Results
We test the performance of our estimator against the previous two estimators in noisy environments. Before proceeding, we re-define some notations for the estimators, so that the estimators are tested on the same ground and noise is properly taken into consideration. The estimators we will empirically study are
-
1.
Our new estimator:
(18) where , and is a mean-zero noise that is independent of all other randomness.
-
2.
The Stein’s estimator:
(19) where (the standard Gaussian in ), is the identity map from to itself (As a bilinear form, for any .), and is a mean-zero noise that is independent of all other randomness.
-
3.
The entry-wise estimator:
(20) where
is the local normal coordinate for , and is a mean-zero noise that is independent of all other randomness.
Manifold | () | , | |
(I) | |||
(II) | |||
(III) |



Sampling | Sampling + Evaluation | Sampling + Evaluation + Computation | |
Our method | |||
Stein’s |
Strictly speaking, the noises corrupt all the zeroth-order function value observations. Specifically, the notation should be understood in the following way. All four functions values , , , and are corrupted with mean-zero and independent noise and not directly observable. Note that all previously discussed bias bounds hold when the function evaluations are corrupted by independent mean-zero noise.
The above notations allow us to put all the estimators on the same ground more easily. With the new notations, all , and uses functions value observations and have an expected finite difference step size . The redefining of the estimators is needed since 1. the entry-wise estimator needs more samples to output an estimate, and 2. the default Stein’s method in expectation uses a larger finite-difference step-size, as discussed in Remark 1.
Remark 2.
The estimator we introduced (18) has a practical advantage over that via the Stein’s identity (19). The reason is that estimators based on the Stein’s identity requires one to explicitly know the identity map from to itself. This map may or may not admit an easy numerical representation. For example, for the real Stiefel’s manifold , we know that the map
is the identity from to itself (e.g., Absil et al.,, 2009). Also, this map projects any onto . Extracting a numerical representation of this map may not be easy. On contrary, for any , computing is tractable. More specifically, one can use the following procedure to obtain a uniformly random vector from the unit sphere in for a given . One can (1) sample an Gaussian matrix from , (2) compute , and (3) normalize with respect to the Frobenius inner product. By rotational invariance (of the standard Gaussian distribution and the Frobenius norm), this procedure outputs a uniformly random unit matrix in . Once we have the unit vectors in , we can numerically compute their tensor product.
All three methods are tested using the following test function, defined using the standard Cartesian coordinate system in ,
Every function evaluation is corrupted with an independent noise sampled from . The estimators are tested over three manifolds in . More details about the three manifolds are in Table 1. In all settings, we set the number of total function evaluation and dimension of manifold . The results for manifold (I), the Euclidean case, is in Figure 1. Results for manifold (II) and manifold (III) are in Appendix B. In Figure 1 (and Figures 2 and 3 in Appendix B), the errors on the -axis plots the norm of the difference between the empirical estimator and the truth:
5.3 Time Efficiency Comparison
We compare the time efficiency of our method and the estimator based on the Stein’s identity. In general, one expects estimators based on the Stein’s identity to be more time-efficient. Main reasons for this include that the estimator based on the Stein’s identity needs only 3 function evaluations instead of 4. In practice, the function evaluations may or may not be cheap. When the functions evaluations are expensive, we may expect that estimators based on the Stein’s identity approximately saves 1/4 time, compared to our method. When the function evaluations are cheap, our estimator (18) is in general more time consuming as well, since more sampling and more matrix computations need to be carried out.
In Table 2, we compare the running time of (18) and (19). All timing experiments use the same benchmark function and underlying manifold as Figure 1. We use , and for timing experiments. All timing experiments are carried out in an environment with
-
•
10 cores and 20 logical processors with a maximum speed of 2.80 GHz;
-
•
32GB RAM;
-
•
Windows 11 22000.832;
-
•
Python 3.8.8.
6 Conclusion
In this paper, we study Hessian estimators over Riemannian manifolds. We design a new estimator, such that for real-valued analytic functions over an -dimensional complete analytic Riemannian manifold, our estimator achieves an expected error, where depends both on the Levi-Civita connection and the function , and is the finite difference step size. Downstream computations of Hessian inversion is also studied. Empirical studies show supremacy of our method over existing methods.
Data Availability Statement
No new data were generated or analysed in support of this paper. Code for the experiments is available at https://github.com/wangt1anyu/zeroth-order-Riemann-Hess-code.
References
- Absil et al., (2009) Absil, P.-A., Mahony, R., and Sepulchre, R. (2009). Optimization algorithms on matrix manifolds. Princeton University Press.
- Agarwal et al., (2017) Agarwal, N., Bullins, B., and Hazan, E. (2017). Second-order stochastic optimization for machine learning in linear time. The Journal of Machine Learning Research, 18(1):4148–4187.
- Audibert et al., (2010) Audibert, J.-Y., Bubeck, S., and Munos, R. (2010). Best arm identification in multi-armed bandits. In COLT, pages 41–53. Citeseer.
- Balasubramanian and Ghadimi, (2021) Balasubramanian, K. and Ghadimi, S. (2021). Zeroth-order nonconvex stochastic optimization: Handling constraints, high dimensionality, and saddle points. Foundations of Computational Mathematics, pages 1–42.
- Bubeck et al., (2009) Bubeck, S., Munos, R., and Stoltz, G. (2009). Pure exploration in multi-armed bandits problems. In International conference on Algorithmic learning theory, pages 23–37. Springer.
- Conn et al., (2009) Conn, A. R., Scheinberg, K., and Vicente, L. N. (2009). Introduction to derivative-free optimization. SIAM.
- Duchi et al., (2015) Duchi, J. C., Jordan, M. I., Wainwright, M. J., and Wibisono, A. (2015). Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory, 61(5):2788–2806.
- Feng and Wang, (2022) Feng, Y. and Wang, T. (2022). Stochastic Zeroth Order Gradient and Hessian Estimators: Variance Reduction and Refined Bias Bounds. arXiv preprint arXiv:2205.14737.
- Flaxman et al., (2005) Flaxman, A. D., Kalai, A. T., and McMahan, H. B. (2005). Online convex optimization in the bandit setting: gradient descent without a gradient. In Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms, pages 385–394.
- Gavrilov, (2007) Gavrilov, A. V. (2007). The double exponential map and covariant derivation. Siberian Mathematical Journal, 48(1):56–61.
- Goldberg and Holland, (1988) Goldberg, D. E. and Holland, J. H. (1988). Genetic algorithms and machine learning.
- Greene and Wu, (1979) Greene, R. E. and Wu, H. (1979). -approximations of convex, subharmonic, and plurisubharmonic functions. In Annales Scientifiques de l’École Normale Supérieure, volume 12, pages 47–84.
- Greene and Wu, (1973) Greene, R. E. and Wu, H.-H. (1973). On the subharmonicity and plurisubharmonicity of geodesically convex functions. Indiana University Mathematics Journal, 22(7):641–653.
- Greene and Wu, (1976) Greene, R. E. and Wu, H.-h. (1976). convex functions and manifolds of positive curvature. Acta Mathematica, 137(1):209–245.
- Li et al., (2020) Li, J., Balasubramanian, K., and Ma, S. (2020). Stochastic zeroth-order riemannian derivative estimation and optimization. arXiv preprint arXiv:2003.11238.
- Nelder and Mead, (1965) Nelder, J. A. and Mead, R. (1965). A simplex method for function minimization. The computer journal, 7(4):308–313.
- Nesterov and Polyak, (2006) Nesterov, Y. and Polyak, B. T. (2006). Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177–205.
- Nesterov and Spokoiny, (2017) Nesterov, Y. and Spokoiny, V. (2017). Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17(2):527–566.
- Petersen, (2006) Petersen, P. (2006). Riemannian geometry, volume 171. Springer.
- Shahriari et al., (2015) Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., and De Freitas, N. (2015). Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1):148–175.
- Spall, (2000) Spall, J. C. (2000). Adaptive stochastic approximation by the simultaneous perturbation method. IEEE transactions on automatic control, 45(10):1839–1853.
- Stein, (1981) Stein, C. M. (1981). Estimation of the Mean of a Multivariate Normal Distribution. The Annals of Statistics, 9(6):1135 – 1151.
- Wang et al., (2021) Wang, T., Huang, Y., and Li, D. (2021). From the Greene–Wu Convolution to Gradient Estimation over Riemannian Manifolds. arXiv preprint arXiv:2108.07406.
Appendix A Proof of Theorem 2
Proof of Theorem 2. .
Consider for any .
When , one has . In this case, it holds that
When , and , we have .
When , and , we have .
When , and , we have .
When , , , and , we have .
When , , , and , we have .
When , , and , we have .
When , , and , we have .
Now using Einstein’s notation and combining all above cases give
where is the Kronecker’s delta.
Since for all and , we can write . Thus rearranging terms in gives
where is the Laplace operator.
Since , rearranging terms in concludes the proof. ∎
Appendix B Additional Experimental Results





