Geodesically convex -estimation in metric spaces
Abstract
We study the asymptotic properties of geodesically convex -estimation on non-linear spaces. Namely, we prove that under very minimal assumptions besides geodesic convexity of the cost function, one can obtain consistency and asymptotic normality, which are fundamental properties in statistical inference. Our results extend the Euclidean theory of convex -estimation; They also generalize limit theorems on non-linear spaces which, essentially, were only known for barycenters, allowing to consider robust alternatives that are defined through non-smooth -estimation procedures.
keywords:
-estimation, metric spaces, Riemannian manifolds, CAT spaces, geodesic convexity, barycenters, robust location estimationabstractnameskip Abstract:
1 Preliminaries
1.1 Introduction
The problem of -estimation, or empirical risk minimization, is pervasive to statistics and machine learning. In many problems, one seeks to minimize a function that takes the form of an expectation, i.e., , where is a random variable with unknown distribution, is the target variable, which lives in some target space, and can be seen as a cost function. Classification, regression, location estimation, maximum likelihood estimation, are standard instances of such problems. In practice, the distribution of is unknown, so the function cannot be evaluated. Hence, one rather minimizes a proxy of the form , where are available i.i.d. copies of . This method is called -estimation, or empirical risk minimization. When the target space is Euclidean, a lot of results are available for guarantees of a minimizer of , as an estimator for a minimizer of . Asymptotic results (such as consistency and asymptotic normality) are classic and well known, especially under technical smoothness assumptions on the cost function [53], most of which can be dropped when the cost function is convex in the target variable [28, 44].
In this work, we are interested in the framework in which the target space is not Euclidean, and does not even exhibit any linear structure. Of course, the space needs to be equipped with a metric, at the very least in order to measure the accuracy of an estimation procedure. Thus, we consider a target space that is a metric space, which we denote by .
Statistics and machine learning are more and more confronted with data that lie in non-linear spaces: In spatial statistics (e.g., directional data), computational tomography (e.g., data in quotient spaces such as in shape statistics, collected up to rigid transformations), economics (e.g., optimal transport, where data are measures), etc. Moreover, data and/or target parameters that are encoded as very high dimensional vectors may have a much smaller intrinsic dimension: For instance, if they are lying on small dimensional submanifolds of the ambient Euclidean space. In that case, leveraging the possibly non-linear geometry of the problems at hand can be a powerful tool in order to significantly reduce their dimensionality. This is the central motivating idea behind manifold learning [26, 1] or generative adversarial networks [50]. Even though more and more algorithms are developed to work and, in particular, optimize in non-linear spaces [42, 46, 55, 56, 7, 19], still little is understood from a statistical prospective.
A specific instance of -estimation on metric spaces, which has attracted much attention, is that of barycenters. Given points , a barycenter is any minimizer of the sum of squared distances . One can easily check that if is a Euclidean or Hilbert space, then the minimizer is unique and it is given by the average of . Barycenters extend the notion of average to spaces that lack a linear structure. More generally, given a probability measure on , one can define a barycenter of as any minimizer of , provided that integral is finite for at least one (and hence, for all) value of . Barycenters were initially introduced in statistics by [24] in the 1940’s, and later by [33], where they were rather called Fréchet means or Karcher means. They were popularized in the fields of shape statistics [34], optimal transport [2, 20, 38, 18, 37, 36, 5, 6]. and matrix analysis [9, 8, 10]. Asymptotic theory is fairly well understood for empirical barycenters in various setups, particularly laws of large numbers [57, 51] and central limit theorems in Riemannian manifolds (it is natural to impose a smooth structure on in order to derive central limit theorems) [12, 13, 35, 32, 11, 23, 22]. A few finite sample guarantees are available, even though the non-asymptotic statistical theory for barycenters is still quite limited [51, 25, 49, 3, 39]. Asymptotic theorems are also available for more general -estimators on Riemannian manifolds, e.g., -means, where and , for some [49], including geometric medians as a particular case (), but only suboptimal rates are known. In particular, the standard -rate is unknown in general, beyond barycenters. It should be noted that all central limit theorems for barycenters in Riemannian manifolds rely on the smoothness of the cost function and use standard techniques from smooth -estimation [53] by performing Taylor expansions in local charts. Moreover, they assume that the data are almost surely in a ball with small radius, which essentially ensures the convexity of the squared distance function in its second variable, even though this synthetic property is not leveraged in this line of work. For instance, asymptotic normality cannot be obtained for geometric medians or other robust alternatives to barycenters, using these techniques.
1.2 Contributions
In this work, we prove strong consistency and asymptotic normality of geodesically convex -estimators under very mild assumptions. We recover the central limit theorems proven for barycenters in Riemannian manifolds with small diameter, but we cover a much broader class of location estimators, including robust alternatives to barycenters. Just as in the Euclidean case, ours results convey that convexity is a powerful property, since it yields strong learning guarantees under very little assumptions.
All our results are asymptotic, but it is worth mentioning that proving learning guarantees in non-linear spaces is a very challenging, ongoing research topic, because it requires non-standard tools from metric and differential geometry. In particular, very few non-asymptotic learning guarantees are available, even for simple estimators such as barycenters, and asymptotic theory is still an active research area in that framework (e.g., understanding non-standard asymptotic rates, a.k.a. sticky and smeary central limit theorems [30, 23]). Moreover, asymptotic guarantees such as asymptotic normality provide benchmarks that are also important for a finite sample theory.
2 Background and notation
2.1 Geodesic spaces
Let be a complete and separable metric space that satisfies the following property: For all , there exists a continuous function such that , and with the property that , for all . Such a function is called a unit speed geodesic from to and it should be thought of as a (not necessarily unique) shortest path from to . We denote by the collection of all such functions. The space is then called a geodesic space. Examples of geodesic spaces include Euclidean spaces, Euclidean spheres, metric graphs, Wasserstein spaces, quotient spaces, etc. [16, 15]. Geodesic spaces allow for a natural extension of the notion of convexity.
Definition 1.
A function is called geodesically convex (convex, for short) if for all and , .
In this work, we will further assume that is a proper space, i.e., all bounded closed sets are compact. This assumption may seem limiting in practice but, at a high level, it essentially only discards infinite dimensional spaces. Indeed, Hopf-Rinow theorem [15, Chapter 1, Proposition 3.7] asserts that so long as is complete and locally compact, then it is proper. For instance, Hopf-Rinow theorem for Riemannian manifolds asserts that any complete (finite dimensional) Riemannian manifold is a proper geodesic metric space [21, Chapter 7, Theorem 2.8].
2.2 Riemannian manifolds
A Riemannian manifold is a finite dimensional smooth manifold equipped with a Riemannian metric , i.e., a smooth family of scalar products on tangent spaces. We refer the reader to [21] and [40, 41] for a clear introduction to differential geometry and Riemannian manifolds. The distance inherited on from the Riemannian metric will be denoted by . In this work, for simplicity, we only consider Riemannian manifolds without boundary, even though our results extend easily to the case with boundary.
The tangent bundle of is denoted by , i.e., where, for all , the tangent space to at . For all , we also let and be the exponential and the logarithmic maps at , respectively. The latter may not be defined on the whole tangent space, but it is always defined at least on a neighborhood of the origin. To provide intuition, at a given point , and for a given vector , is the Riemannian analog of “” in the Euclidean space: From point , add a vector , whose direction indicates in which direction to move, and whose norm (given by the Riemannian metric ) indicates at which distance, in order to attain the point denoted by . On the opposite, given , (when it is well-defined) is the Riemannian analog of “”, that is, from , which vector in led to move to .
For all , we denote by the scalar product on inherited from and by the corresponding norm.
Let and be a geodesic (in the sense of Section 2.1) from to . If and are close enough, is unique [41, Proposition 6.11] and in that case, for all , , where is the derivative of at , given by . We denote the parallel transport map from to along as , without specifying its dependence on (which is non-ambiguous if is close enough to ). The map is a linear isometry, i.e., , for all . Moreover, it holds that . Intuitively, parallel transport is an important tool that allows to compare vectors from different tangent spaces, along a given path on the manifold.
For further details and properties on Riemannian manifolds used in the mathematical development of this work, we refer the reader to Appendix B.
2.3 -estimation
Let be some abstract space, equipped with a -algebra and a probability measure . Let satisfy the following:
-
(i)
For all , the map is measurable and integrable with respect to ;
-
(ii)
For -almost all , is convex.
Finally, let be i.i.d. random variables in with distribution and set the functions
for all and for all positive integers . Finally, we denote by the set of minimizers of . Note that by convexity of , the minimizing set is convex, i.e., for all and for all , .
Important examples include the case where and is a function of the metric, which yields location estimation. Specifically, assume that and that , for all , where is a non-decreasing, convex function. Then, a natural framework to ensure convexity of the functions is that of spaces with curvature upper bounds, a.k.a. CAT spaces. Here, we only give an intuitive definition on CAT spaces. We refer the reader to Appendix D for a more detailed account on CAT spaces.
Fix some real number . We say that is (or that it has global curvature upper bound ) if all small enough triangles in are thinner than what they would be in a space of constant curvature , i.e., a sphere if , a Euclidean space if and a hyperbolic space if . This setup is very natural to grasp the convexity of the distance function to a point. Fix . The function is convex if and only if for all and all , , which controls the distance from to any point in the opposite edge to in a triangle with vertices . In other words, it indicates how thin a triangle with vertices must be.
Hence, if is and has diameter at most (see Appendix D for its definition), [45, Proposition 3.1] guarantees that for all , is convex on and hence, is also convex as soon as is non-decreasing and convex itself. In that setup, important examples of functions include:
-
(i)
: Then, a minimizer of is a barycenter of and a minimizer of is an empirical barycenter of
-
(ii)
: Then, a minimizer of is a geometric median of
-
(iii)
More generally, if , a minimizer of is called a -mean of
-
(iv)
if , if , where is a fixed parameter. This function was introduced by Huber [31] in order to produce low-bias robust estimators of the mean: It provides an interpolation between barycenter and geometric median estimation.
Important examples of CAT spaces include Euclidean spaces, spheres, simply connected Riemannian manifolds with bounded sectional curvature, metric trees, etc. In Appendix D, we provide a more detailed list of examples.
3 Consistency
This section contains our first main results on the consistency of geodesically convex -estimators. Here, we work under two scenarios. First, we assume that is convex and Lipschitz with some constant that does not depend on . This is somewhat restrictive, even though it covers most of the cases mentioned above for location estimation. In the second scenario, we only assume convexity, but we consider the case where is a Riemannian manifold. We then discuss possible extensions of our findings.
3.1 Lipschitz case
Here, we assume that there exists such that is -Lipschitz on , for all . For instance, this is satisfied if has bounded diameter and , for some locally Lipschitz function , such as the functions mentioned above for location estimation.
Recall that we denote by the set of minimizers of . The following theorem need not require that has a unique minimizer: It asserts that any minimizer of will eventually become arbitrarily close to with probability .
Theorem 1.
Let be a proper geodesic metric space. Assume that is non-empty and bounded. For , let be a minimizer of . Then, it holds that almost surely.
A simple adaptation of the proof shows that the same conclusion applies if is not exactly a minimizer of but rather satisfies , where is any non-negative error term that goes to zero almost surely, as .
The proof of Theorem 1 is based on the following extension of [47, Theorem 10.8]. The proof is a straightforward adaptation of that of [47, Theorem 10.8].
Lemma 1.
Let and be a sequence of -Lipschitz convex functions on . Assume that there is a function and a dense subset of such that , for all . Then, is convex, -Lipschitz, and converges to uniformly on any compact subset of .
This lemma is the reason why, in this section, we require to be Lipschitz, with a constant that does not depend on . If was Euclidean, the Lipschitz assumption of Lemma 1 would not be necessary because on any compact subset, it would be a consequence of the convexity and pointwise convergence of the functions , see [47, Theorem 10.6]. However, to the best of our knowledge, this may not hold in the more general case that we treat here (and may be subject to a further research question).
As a corollary to Lemma 1, we now state the following results, which is essential in our proof of Theorem 1. The proof is deferred to Appendix C.
Lemma 2.
Let and be a sequence of real valued, -Lipschitz, convex random functions on . Let a (possibly random) function on and assume that for all , converges almost surely (resp. in probability) to . Then, the convergence holds uniformly on any compact subset of , i.e., converges to zero almost surely (resp. in probability).
Proof of Theorem 1.
Let be the smallest value of on and fix some arbitrary . Fix and let . Since is bounded, is bounded. Moreover, since the distance is continuous, is a closed set. Hence, it is compact, since we have assumed to be proper. Since is Lipschitz, it is continuous so it holds that . Moreover, by the law of large numbers, almost surely, for all . Therefore, by Lemma 2, converges uniformly to on almost surely. So, with probability , for all large enough . Moreover, also with probability , for all large enough .
We have established that with probability , it holds simultaneously, for all large enough , that . Let us show that this, together with the convexity of , implies that any minimizer of must be at a distance at most of . This will yield the desired result.
Assume, for the sake of contradiction, that has a minimizer that satisfies . Consider a geodesic . Then, by continuity of the function , there must be some such that . The convexity of yields the convexity of , which is minimum at . Hence, must be non-increasing, implying that , which yields a contradiction.
3.2 Riemannian framework
Now, we prove that, at least in a Riemannian manifold, the Lipschitz condition on can be dropped for the consistency of . In this section, we assume that is a complete Riemannian manifold and we have the following theorem.
Theorem 2.
Let be a Riemannian manifold. Assume that is non-empty and bounded. For , let be a minimizer of . Then, it holds that almost surely.
Again, this theorem applies if satisfies , where is any non-negative error term that goes to zero almost surely, as .
Once one has Lemma 4 below, which builds upon Lemma 3, the proof of Theorem 2 is very similar to that of Theorem 1, hence, we omit it.
Lemma 3.
[27] Any convex function on a Riemannian manifold is continuous and locally Lipschitz.
By adapting the proof of [47, Theorem 10.8], this lemma yields the following result.
Lemma 4.
Let be a sequence of convex functions on . Assume that , for all in a dense subset of , where is some given convex function. Then, is equi-Lipschitz and converges to uniformly on any compact subset of .
Remark 1.
Lemma 3 is key in the proof of consistency if one does not assume that is -Lipschitz, for all , with some that does not depend on . As we pointed out in the previous section, we do not know whether this lemma still holds true in more general proper geodesic spaces and we leave this as an open question.
4 Asymptotic normality
Now, we turn to asymptotic normality of . Riemannian manifolds offer a natural and reasonable framework because, on top of their metric structure, they enjoy smoothness which allows to compute first and second order expansions and Gaussian distributions can be defined on their tangent spaces. Hence, in this section, we assume that be a complete Riemannian manifold.
Theorem 3.
Assume that the distribution and the function satisfy the following assumptions:
-
•
has a unique minimizer
-
•
is twice differentiable at and is positive definite
-
•
There exists such that , for all with , where is a measurable subgradient of at the point .
Then, is asymptotically normal, that is,
where , , and we denote by the normal distribution on the Euclidean space with mean and covariance operator .
Here, is the Hessian of at the point (see Appendix B). Recall that is the Riemannian analog of “”, so Theorem 3 is a natural formulation of asymptotic normality of is the Riemannian context.
A close look to the end of the proof will convince the reader that one can assume, without modifying the conclusion of this theorem, that need not be a minimizer of , but should satisfy that , where is a possibly random error term that goes to zero in probability at a faster rate than .
Remark 2.
The statement of Theorem 3 implicitly assumes the existence of measurable subgradients of , that is, a mapping such that for -almost all and for all , is a subgradient of at the point , and for all , is measurable. This is actually guaranteed by Lemma 8 on measurable selections, see Appendix A. Note also that the last assumption is automatically guaranteed if is locally -Lipschitz at , for -almost all , for some .
The proof of Theorem 3 is strongly inspired by [28] and [44] but requires subtle tools to account for the non-Euclidean geometry of the problem. The idea of the proof is to show that
as grows large, uniformly in . Hence, the minimizer of the left-hand side which, by definition, is , must be close to the minimizer of the right-hand side, which has a closed form, and which is asymptotically normal, thanks to the central limit theorem. The main difficulty is to handle , which is not a convex function of , even though is convex on . Lemma 5 below is our most technical result, and is the key argument to adapt the techniques of [28, 44] to the Riemannian setup.
Lemma 5.
Let be a Riemannian manifold and . There exists such that the following holds. Let . There exists such that for all convex functions that are -Lipschitz on , the map
is convex on , where is the norm induced by on the tangent space .
In the proof of this lemma, we first consider the case where is smooth and then proceed by approximation of convex functions by smooth functions, leveraging [27, Theorem 2].
Proof of Theorem 3.
Fix . For all , let
The definition of subgradients yields that . Let be the parallel transport from to through the geodesic . Then, . Thus, one has the following:
(1) |
where we used the fact that is an isometry in the last equality.
Therefore, , where
Note that is finite if is large enough, thanks to the last assumption of the theorem.
The inequalities above yield that for all and by Lemma 9, the sequence is non-increasing. Therefore, converges almost surely to a non-negative random variable . Let us now prove that almost surely. In order to do so, let us prove that . Since, by the monotone convergence theorem, must converge to , we will obtain that . This, combined with the fact that almost surely, will yield the desired result.
Fix with . Then, for all , . Taking the expectation readily yields that . Hence, for all integers that are large enough so , it holds that . Therefore,
since is twice differentiable at .
Finally, we obtain that almost surely. Now, is a non-increasing, non-negative sequence, so the monotone convergence theorem yields that goes to zero as , so that
Therefore, by Chebychev’s inequality, converges in probability to zero, as , that is,
in probability. Since is twice differentiable at , we obtain
(2) |
in probability. For and , denote by
and by
We have shown that converges in probability to as , for any fixed . Now, we would like to make use of Lemma 2. However, the functions are not guaranteed to be convex. We adjust these functions using lemma 5.
Let be any fixed positive number. Let be the ball in centered at the origin, with radius and let . Then, and are both compact. By the law of large numbers, converges almost surely to , for all . Therefore, by Lemma 2, almost surely. Hence, Lemma 4 yields the existence of a (possibly random) such that the following holds with probability :
Now, let be as in Lemma 5. Then, there exists a (possibly random) such that with probability , for all , is convex on .
Denote by and , for all and . For all large enough (such that ), each is convex on with probability , and we trivially have that in probability, for all . Therefore, Lemma 2 yields that in probability.
Finally, we have shown that for any ,
(3) |
in probability.
By definition of , is minimized for . Moreover, one easily checks that is minimized for
Note that are i.i.d. random vectors in . They are centered, since by the first order condition, and they have a second moment, thanks to the last assumption of the theorem. Therefore, the multivariate central limit theorem yields that
(4) |
in distribution. In particular, the sequence is bounded in probability, so one can choose in (3) so that for all , with arbitrarily large probability.
From (3), it is easy to obtain that converges in probability to zero, as . Therefore, by Slutsky’s theorem, has the same limit in distribution as , which concludes the proof thanks to (4).
Proof of Lemma 5.
Fix (to be chosen) and denote by , which is compact by Hopf-Rinow theorem [21, Chapter 7, Theorem 2.8]. Let be convex and -Lipschitz () on and set .
We split the proof of this lemma into steps.
Step 1: Case where is smooth.
First, assume that is smooth. Then, is a smooth function. In the sequel, we denote by the Hessian of and by the Hessian of (i.e., we use H for Hessian of functions defined on and for Hessian of functions defined on the tangent space ). Fix and let , for . Denote by , for all . The Hessian of is given by . For all ,
and
(5) |
where is the covariant derivative along the path . Note that the first term on the right-hand side of (5) is non-negative, due to the convexity of (see [52, Theorem 6.2]). Therefore,
(6) |
where the second inequality follows from Cauchy-Schwarz inequality and the last one follows from the assumption that is -Lipschitz on .
For all , let . In particular, . Then, , where . Since the map is continuous, there exists a constant (that only depends on and on the geometry of ) such that
for all with and . Hence, (6) yields that . Therefore, by setting , we have established that is convex on .
Step 2: Case where is no longer assumed to be smooth.
Step 2.1: Smooth approximation of .
For all positive integers , let be the function defined by
where is a non-negative, smooth function supported on with and is the pushforward measure of the Riemannian volume onto by the exponential map . By [27, Theorem 2], each is smooth and the sequence converges to uniformly on and satisfies
where stands for the smallest eigenvalue. Fix , that we will then choose to be small enough. Without loss of generality, assume that , for all (since one might as well forget about the first terms of the sequence ).
Step 2.2: is Lipschitz on .
Let us show that is Lipschitz on , so long as is large enough. Let and let be the parallel transport map from to . Since is an isometry, it maps the measure to , hence, it holds that
Thus,
(7) |
where the last inequality holds as soon as is large enough (i.e., ) since is -Lipschitz on and in the last integral, only the vectors with contribute to the integral. Finally, the map is smooth, hence it is Lipschitz on any compact set, so there exists that is independent of and such that for all with , . Therefore, (7) yields
(8) |
for all . Thus, is -Lipschitz on , for all integers .
Step 2.3: Add a fixed function to each to make them convex.
Assume that and are chosen according to Lemma 6 below and let be the function given in that lemma. Then, for all , is convex on , since its Hessian is positive semi-definite at any .
Step 2.4: Apply the smooth case (Step 1) to each on .
For all , is smooth. Moreover, is -Lipschitz on , where does not depend on , and is smooth so it is -Lipschitz on , for some that does not depend on either. Therefore, applying the result proven in Step 1, we obtain that there exists that does not depend on , such that for all large enough integers , is convex on . By taking the limit as , we obtain that is convex on .
Since is smooth, so is , so there exists such that , for all with . Therefore, by setting , we obtain that is convex on .
Step 3: Conclusion.
We can now combine all cases by taking .
Lemma 6.
Let be a complete Riemannian manifold without boundary and . There exists and a smooth function such that for all .
Proof.
Let such that is simply connected. For all and for all linearly independent vectors with , let be the sectional curvature of at , along the -plane spanned by and (see [21, Proposition 3.1 and Definition 3.2]). Then, is smooth, hence, there exists that is an upper bound on all sectional curvatures of at points , which is compact. Therefore, Lemma 12 yields that is . The desired result follows by taking: and , where and are defined in Lemma 11, see Appendix D.2.
∎
References
- [1] Eddie Aamari and Clément Levrard. Nonasymptotic rates for manifold, tangent space and curvature estimation. 2019.
- [2] Martial Agueh and Guillaume Carlier. Barycenters in the wasserstein space. SIAM Journal on Mathematical Analysis, 43(2):904–924, 2011.
- [3] Adil Ahidar-Coutrix, Thibaut Le Gouic, and Quentin Paris. Convergence rates for empirical barycenters in metric spaces: curvature, convexity and extendable geodesics. Probability theory and related fields, 177(1-2):323–368, 2020.
- [4] Stephanie Alexander, Vitali Kapovitch, and Anton Petrunin. Alexandrov geometry: preliminary version no. 1. arXiv e-prints, pages arXiv–1903, 2019.
- [5] Jason M. Altschuler and Enric Boix-Adsera. Wasserstein barycenters can be computed in polynomial time in fixed dimension. J. Mach. Learn. Res., 22:44–1, 2021.
- [6] Jason M. Altschuler and Enric Boix-Adsera. Wasserstein barycenters are np-hard to compute. SIAM Journal on Mathematics of Data Science, 4(1):179–203, 2022.
- [7] Kimon Antonakopoulos, Elena Veronica Belmega, and Panayotis Mertikopoulos. Online and stochastic optimization beyond lipschitz continuity: A riemannian approach. In ICLR 2020-International Conference on Learning Representations, pages 1–20, 2020.
- [8] Rajendra Bhatia. Positive definite matrices. In Positive Definite Matrices. Princeton university press, 2009.
- [9] Rajendra Bhatia and John Holbrook. Riemannian geometry and matrix geometric means. Linear algebra and its applications, 413(2-3):594–618, 2006.
- [10] Rajendra Bhatia, Tanvi Jain, and Yongdo Lim. On the Bures-Wasserstein distance between positive definite matrices. Expositiones Mathematicae, 37(2):165–191, 2019.
- [11] Rabi Bhattacharya and Lizhen Lin. Omnibus CLTs for Fréchet means and nonparametric inference on non-euclidean spaces. Proceedings of the American Mathematical Society, 145(1):413–428, 2017.
- [12] Rabi Bhattacharya and Vic Patrangenaru. Large sample theory of intrinsic and extrinsic sample means on manifolds. Annals of statistics, 31(1):1–29, 2003.
- [13] Rabi Bhattacharya and Vic Patrangenaru. Large sample theory of intrinsic and extrinsic sample means on manifolds: II. Annals of statistics, pages 1225–1259, 2005.
- [14] Louis J. Billera, Susan P. Holmes, and Karen Vogtmann. Geometry of the space of phylogenetic trees. Advances in Applied Mathematics, 27(4):733–767, 2001.
- [15] Martin R Bridson and André Haefliger. Metric spaces of non-positive curvature, volume 319. Springer Science & Business Media, 2013.
- [16] Dmitri Burago, Yuri Burago, and Sergei Ivanov. A course in metric geometry, volume 33. American Mathematical Society, 2001.
- [17] Yuan Shih Chow and Henry Teicher. Probability theory: independence, interchangeability, martingales. Springer Science & Business Media, 2003.
- [18] Sebastian Claici, Edward Chien, and Justin Solomon. Stochastic Wasserstein barycenters. In International Conference on Machine Learning, pages 999–1008. PMLR, 2018.
- [19] Christopher Criscitiello and Nicolas Boumal. An accelerated first-order method for non-convex optimization on manifolds. Foundations of Computational Mathematics, pages 1–77, 2022.
- [20] Marco Cuturi and Arnaud Doucet. Fast computation of Wasserstein barycenters. In International conference on machine learning, pages 685–693. PMLR, 2014.
- [21] Manfredo Perdigao Do Carmo. Riemannian geometry, volume 6. Springer, 1992.
- [22] Benjamin Eltzner, Fernando Galaz-Garcia, Septhan F. Huckemann, and Wilderich Tuschmann. Stability of the cut locus and a central limit theorem for Fréchet means of Riemannian manifolds. arXiv preprint arXiv:1909.00410, 2019.
- [23] Benjamin Eltzner and Stephan F Huckemann. A smeary central limit theorem for manifolds with application to high-dimensional spheres. The Annals of Statistics, 47(6):3360–3381, 2019.
- [24] Maurice Fréchet. Les éléments aléatoires de nature quelconque dans un espace distancié. Ann. Inst. H. Poincaré, 10:215–310, 1948.
- [25] Kei Funano. Rate of convergence of stochastic processes with values in -trees and Hadamard manifolds. Osaka J. Math., 47(4):911–920, 2010.
- [26] Christopher R Genovese, Marco Perone Pacifico, Verdinelli Isabella, Larry Wasserman, et al. Minimax manifold estimation. Journal of machine learning research, 13:1263–1291, 2012.
- [27] Robert E. Greene and Hung Wu. On the subharmonicity and plurisubharmonicity of geodesically convex functions. Indiana University Mathematics Journal, 22(7):641–653, 1973.
- [28] Shelby J. Haberman. Concavity and estimation. The Annals of Statistics, pages 1631–1661, 1989.
- [29] CH Himmelberg, T Parthasarathy, and FS Van Vleck. On measurable relations. Fundamenta mathematicae, 61:161–167, 1982.
- [30] Thomas Hotz, Sean Skwerer, Stephan Huckemann, Huiling Le, J Stephen Marron, Jonathan C Mattingly, Ezra Miller, James Nolen, Megan Owen, and Vic Patrangenaru. Sticky central limit theorems on open books. Annals of Applied Probability, 2013.
- [31] Peter J Huber. Robust estimation of a location parameter. Breakthroughs in statistics: Methodology and distribution, pages 492–518, 1992.
- [32] Stephan F. Huckemann. Intrinsic inference on the mean geodesic of planar shapes and tree discrimination by leaf growth. The Annals of Statistics, 39(2):1098–1124, 2011.
- [33] Hermann Karcher. Riemannian center of mass and mollifier smoothing. Communications on pure and applied mathematics, 30(5):509–541, 1977.
- [34] David George Kendall, Dennis Barden, Thomas K. Carne, and Huiling Le. Shape and shape theory, volume 500. John Wiley & Sons, 2009.
- [35] Wilfrid S. Kendall and Huiling Le. Limit theorems for empirical Fréchet means of independent and non-identically distributed manifold-valued random variables. Brazilian Journal of Probability and Statistics, 25(3):323 – 352, 2011.
- [36] Alexey Kroshnin, Vladimir Spokoiny, and Alexandra Suvorikova. Statistical inference for Bures-Wasserstein barycenters. The Annals of Applied Probability, 31(3):1264–1298, 2021.
- [37] Alexey Kroshnin, Nazarii Tupitsa, Darina Dvinskikh, Pavel Dvurechensky, Alexander Gasnikov, and Cesar Uribe. On the complexity of approximating Wasserstein barycenters. In International conference on machine learning, pages 3530–3540. PMLR, 2019.
- [38] Thibaut Le Gouic and Jean-Michel Loubes. Existence and consistency of Wasserstein barycenters. Probability Theory and Related Fields, 168(3):901–917, 2017.
- [39] Thibaut Le Gouic, Quentin Paris, Philippe Rigollet, and Austin J Stromme. Fast convergence of empirical barycenters in Alexandrov spaces and the Wasserstein space. Journal of the European Mathematical Society, 2022.
- [40] John M. Lee. Introduction to smooth manifolds. Springer, 2012.
- [41] John M. Lee. Introduction to Riemannian manifolds, volume 2. Springer, 2018.
- [42] Yongdo Lim and Miklós Pálfia. Weighted deterministic walks for the least squares mean on Hadamard spaces. Bulletin of the London Mathematical Society, 46, 05 2014.
- [43] Estelle Massart, Julien M. Hendrickx, and P-A Absil. Curvature of the manifold of fixed-rank positive-semidefinite matrices endowed with the Bures-Wasserstein metric. In Geometric Science of Information: 4th International Conference, GSI 2019, Toulouse, France, August 27–29, 2019, Proceedings, pages 739–748. Springer, 2019.
- [44] Wojciech Niemiro. Asymptotics for M-estimators defined by convex minimization. The Annals of Statistics, pages 1514–1533, 1992.
- [45] Shin-ichi Ohta. Convexities of metric spaces. Geom. Dedicata, 125:225–250, 2007.
- [46] Shin-ichi Ohta and Miklós Pálfia. Discrete-time gradient flows and law of large numbers in Alexandrov spaces. Calc. Var. Partial Differential Equations, 54(2):1591–1610, 2015.
- [47] R Tyrrell Rockafellar. Convex analysis, volume 18. Princeton university press, 1970.
- [48] Rolf Schneider. Convex bodies: the Brunn–Minkowski theory. Number 151. Cambridge university press, 2014.
- [49] Christof Schötz. Convergence rates for the generalized fréchet mean via the quadruple inequality. Electronic Journal of Statistics, 13:4280–4345, 2019.
- [50] Nicolas Schreuder, Victor-Emmanuel Brunel, and Arnak Dalalyan. Statistical guarantees for generative models without domination. In Algorithmic Learning Theory, pages 1051–1071. PMLR, 2021.
- [51] Karl-Theodor Sturm. Probability measures on metric spaces of nonpositive curvature. In Heat kernels and analysis on manifolds, graphs, and metric spaces (Paris, 2002), volume 338 of Contemp. Math., pages 357–390. Amer. Math. Soc., Providence, RI, 2003.
- [52] Constantin Udriste. Convex functions and optimization methods on Riemannian manifolds, volume 297. Springer Science & Business Media, 2013.
- [53] Aad W Van Der Vaart and Jon A Wellner. Weak convergence. Springer, 1996.
- [54] Daniel H. Wagner. Survey of measurable selection theorems. SIAM Journal on Control and Optimization, 15(5):859–903, 1977.
- [55] Hongyi Zhang and Suvrit Sra. First-order methods for geodesically convex optimization. In Conference on Learning Theory, pages 1617–1638. PMLR, 2016.
- [56] Hongyi Zhang and Suvrit Sra. Towards riemannian accelerated gradient methods. arXiv preprint arXiv:1806.02812, 2018.
- [57] Herbert Ziezold. On expected figures and a strong law of large numbers for random elements in quasi-metric spaces. In Transactions of the Seventh Prague Conference on Information Theory, Statistical Decision Functions, Random Processes and of the 1974 European Meeting of Statisticians, pages 591–602. Springer, 1977.
Appendix A On measurable selections of subgradients
Lemma 7.
Let be a probability space and be a Riemannian manifold without boundary. Let be a function such that
-
•
is convex for -almost all ;
-
•
is measurable for all .
Then, there exists a map such that
-
•
is a subgradient of at , for all , for -almost all ;
-
•
is measurable, for all .
Proof.
Let be such that and is convex on for all . Set . It is enough to prove that for all , there exists a measurable function on such that is a subgradient of at , for all .
Fix . For all , let be the set of all subgradients of at the point . By [52, Theorem 4.5], , for all . Therefore, it is enough to show that the restriction of to is weakly measurable, i.e., that , for all open subsets (see [29] for the definition).
One can wrete as
where is some fixed countable, dense subset of . The last equality follows from the continuity of , by Lemma 3. For all , let .
Since each of the countably many , is closed, [29, Corollary 4.2] shows that it is enough to show that each is weakly measurable. Fix and let be an open subset of . Then, for all ,
The last equivalence follows from the fact that since is open, is dense in , and if some satisfies , then, any of the form , for any with small enough norm such that . There is at least one such that falls in .
Finally, one obtains:
which is in since is assumed to be measurable for all .
One concludes using Lemma 8 below.
∎
Lemma 8.
Let be a probability space and a Polish space. Let be a function that maps any to a closed subset of . Assume the existence of with and such that for all . Assume further that for all open subsets of , . Then, there exists a function such that , for -almost all .
Proof.
Consider the probability space defined by setting
-
•
;
-
•
;
-
•
as the restriction of to .
Then, for all , so, thanks to [54, Theorem 4.1], one can find a measurable selection of the restriction of to . By setting to be any constant value in for , we obtain a function that satisfies the requirement.
∎
Appendix B Elements on Riemannian manifolds and on geodesic convexity
In this section, let be a complete -dimensional Riemannian manifold.. That is, is a -dimensional smooth manifold and the Riemannian metric equips each tangent space , , with a dot product , . Moreover, by Hopf-Rinow theorem [21, Chapter 7, Theorem 2.8], completeness of ensures that any two points are connected by at least minimizing geodesic, i.e., a smooth path such that and , for all .
B.1 On smooth functions and their derivatives
If is a smooth function, its gradient is the map such that for all , and for all ,
where is any smooth path such that . The Hessian of at any is represented by the linear map such that for all ,
where is a geodesic path with , .
With our notation, any smooth function has a second order Taylor expansion at any given point of the following form:
as , for any fixed .
B.2 On convex functions
Recall that a convex function on is any function that satisfies for all , and for all constant speed, minimizing geodesic from to . Just as in Euclidean spaces, convex functions are automatically continuous [52, Theorem 3.6] (note that we only consider convex functions on the whole space , which is assumed to have no boundary). The notion of subgradients can also be extended to the Riemannian case.
Definition 2.
Let be a convex function and . A vector is called a subgradient of at if and only if
for all .
The set of all subgradients of at is denoted by and [52, Theorem 4.5] ensures that , for all (again, note that we only consider the case when has no boundary). For simplicity here, unlike in [52], subgradients are elements of the tangent space, not its dual.
Subgradients of convex functions satisfy the following monotonicity property.
Lemma 9.
Let be a convex function. Let , and set . Let and . Then,
where is the parallel transport from to through the geodesic given by .
Proof.
First, note that by letting , it holds that . By definition of subgradients, one has the following inequalities:
and
using the fact that is an isometry. Summing both inequalities yields the result.
∎
Appendix C On sequences of convex functions in metric spaces
C.1 Proof of Lemma 2
Almost sure convergence
Since is a proper metric space, it is separable [4, Proposition 2.2.2]. Let be a countable dense subset of . By assumption, almost surely, for all . Therefore, with probability one, the convergence holds for all , since is countable. Hence, because is also dense, Theorem 1 applies and we obtain the desired result.
Convergence in probability
Here, we use the following fact.
Lemma 10.
[17, Lemma 2 (ii) p.67] Let be real valued random variables. Then, converges in probability to if and only if every subsequence of has itself a subsequence that converges almost surely to .
Let be a compact subset of and consider a subsequence of . Without loss of generality (one could renumber that subsequence), we actually consider the whole sequence and we show that it admits a subsequence that converges almost surely to zero. Let be a countable dense subset of . Denote the elements of as .
By assumption, converges to in probability, therefore, it has a subsequence, say , that converges to almost surely.
Now, again by assumption, converges to in probability, therefore, it has a subsequence, say , that converges to almost surely.
By iterating this procedure, we build, for all integers , an increasing function such that converges almost surely to , as .
Finally, let the function defined by , for all . Then, by construction, converges almost surely to , for all . Hence, with probability , converges almost surely to for all . Lemma 1 yields that converges to zero almost surely, which concludes the proof.
Appendix D A brief introduction to Alexandrov’s curvature
In this section, we give the precise definition of CAT spaces.
D.1 Model spaces
First, we introduce a family of model spaces that will allow us to define local and global curvature bounds in the sequel. Let .
: Euclidean plane
Set , equipped with its Euclidean metric. This model space corresponds to zero curvature, is a geodesic space where geodesics are unique and given by line segments.
: Sphere
Set : This is the -dimensional Euclidean sphere, embedded in , with center and radius , equipped with the arc length metric: , for all . This is a geodesic space where the geodesics are unique except for opposite points, and given by arcs of great circles. Here, a great circle is the intersection of the sphere with any plane going through the origin.
: Hyperbolic space
Set , where . The metric is given by , for all , where . This is a geodesic space where geodesics are always unique and are given by arcs of the intersections of with planes going through the origin.
For , let be the diameter of the model space , i.e., .
D.2 Curvature bounds
Let be a geodesic space. The notion of curvature (lower or upper) bounds for is defined by comparing the triangles in with triangles with the same side lengths in model spaces.
Definition 3.
A (geodesic) triangle in is a set of three points in (the vertices) together with three geodesic segments connecting them (the sides).
Given three points , we abusively denote by a triangle with vertices , with no mention to which geodesic segments are chosen for the sides (geodesics between points are not necessarily unique, as seen for example on a sphere, between any two opposite points). The perimeter of a triangle is defined as . It does not depend on the choice of the sides.
Definition 4.
Let and be a triangle in with . A comparison triangle for in the model space is a triangle with same side lengths as , i.e., if , then where are any points in satisfying
Note that a comparison triangle in is always unique up to rigid motions. We are now ready to define curvature bounds. Intuitively, we say that has global curvature bounded from above (resp. below) by if all its triangles (with perimeter smaller than ) are thinner (resp. fatter) than their comparison triangles in the model space .
Definition 5.
Let . We say that has global curvature bounded from above (resp. below) by if and only if for all triangles with and for all , (resp. ), where and are the points on a comparison triangle in that correspond to and . We then call a (resp. ) space.
Lemma 11.
[45, Proposition 3.1]
Let be for some and assume that has diameter . Then, for all , the function is convex on . Moreover, set if , otherwise. Then, the function is -strongly convex, i.e., it satisfies
for all and .
D.3 Examples of CAT spaces
In this section, we give a few examples of CAT spaces. First, the following lemma, which is a direct consequence of [15, Theorem 1A.6] together with [16, Theorem 9.2.9], give a whole class of Riemannian manifolds as CAT spaces.
Lemma 12.
Let be a simply connected Riemannian manifold. Assume that all the sectional curvatures are bounded from above by some . Then, is .
We give two examples that are of particular relevance in optimal transport and matrix analysis.
-
1.
Let be the collection of all real, symmetric positive definite matrices. Equip with the metric given by , where is the matrix logarithm and is the Frobenius norm. Then, is inherited from a Riemannian metric and is [9]. This metric space is particularly important in the study of matrix geometric means. For any , their geometric mean is defined as , and it is the unique midpoint from to , i.e., , where is the unique geodesic between and .
-
2.
Let be the collection of all real, symmetric positive definite matrices with spectrum included in , for some . Equip with the Bures-Wasserstein metric, obtained as follows. For , let , where is the Wasserstein distance and (resp. ) is the -variate Gaussian distribution with mean and covariance matrix (resp. ). Then, is also inherited from a Riemannian metric [10] and is , by [43, Proposition 2].
Further examples are given below.
-
•
Metric trees are for any (since all its triangles are flat) [15, Section II.1.15]
-
•
The surface of a smooth convex body in a Euclidean space is , where is an upper bound on the reach of the complement of the body, as a sconsequence of [48, Theorem 3.2.9]
-
•
The space of phylogenetic trees [14] is .