This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Geodesically convex MM-estimation in metric spaces

Victor-Emmanuel Brunel CREST-ENSAE, [email protected]
Abstract

We study the asymptotic properties of geodesically convex MM-estimation on non-linear spaces. Namely, we prove that under very minimal assumptions besides geodesic convexity of the cost function, one can obtain consistency and asymptotic normality, which are fundamental properties in statistical inference. Our results extend the Euclidean theory of convex MM-estimation; They also generalize limit theorems on non-linear spaces which, essentially, were only known for barycenters, allowing to consider robust alternatives that are defined through non-smooth MM-estimation procedures.

keywords:
MM-estimation, metric spaces, Riemannian manifolds, CAT spaces, geodesic convexity, barycenters, robust location estimation
\setattribute

abstractnameskip Abstract:

1 Preliminaries

1.1 Introduction

The problem of MM-estimation, or empirical risk minimization, is pervasive to statistics and machine learning. In many problems, one seeks to minimize a function that takes the form of an expectation, i.e., Φ(x)=𝔼[ϕ(Z,x)]\Phi(x)=\mathbb{E}[\phi(Z,x)], where ZZ is a random variable with unknown distribution, xx is the target variable, which lives in some target space, and ϕ\phi can be seen as a cost function. Classification, regression, location estimation, maximum likelihood estimation, are standard instances of such problems. In practice, the distribution of ZZ is unknown, so the function Φ\Phi cannot be evaluated. Hence, one rather minimizes a proxy of the form Φ^n(x)=n1i=1nϕ(Zi,x)\hat{\Phi}_{n}(x)=n^{-1}\sum_{i=1}^{n}\phi(Z_{i},x), where Z1,,ZnZ_{1},\ldots,Z_{n} are available i.i.d. copies of ZZ. This method is called MM-estimation, or empirical risk minimization. When the target space is Euclidean, a lot of results are available for guarantees of a minimizer x^n\hat{x}_{n} of Φ^n\hat{\Phi}_{n}, as an estimator for a minimizer xx^{*} of Φ\Phi. Asymptotic results (such as consistency and asymptotic normality) are classic and well known, especially under technical smoothness assumptions on the cost function ϕ\phi [53], most of which can be dropped when the cost function is convex in the target variable [28, 44].

In this work, we are interested in the framework in which the target space is not Euclidean, and does not even exhibit any linear structure. Of course, the space needs to be equipped with a metric, at the very least in order to measure the accuracy of an estimation procedure. Thus, we consider a target space that is a metric space, which we denote by (M,d)(M,d).

Statistics and machine learning are more and more confronted with data that lie in non-linear spaces: In spatial statistics (e.g., directional data), computational tomography (e.g., data in quotient spaces such as in shape statistics, collected up to rigid transformations), economics (e.g., optimal transport, where data are measures), etc. Moreover, data and/or target parameters that are encoded as very high dimensional vectors may have a much smaller intrinsic dimension: For instance, if they are lying on small dimensional submanifolds of the ambient Euclidean space. In that case, leveraging the possibly non-linear geometry of the problems at hand can be a powerful tool in order to significantly reduce their dimensionality. This is the central motivating idea behind manifold learning [26, 1] or generative adversarial networks [50]. Even though more and more algorithms are developed to work and, in particular, optimize in non-linear spaces [42, 46, 55, 56, 7, 19], still little is understood from a statistical prospective.

A specific instance of MM-estimation on metric spaces, which has attracted much attention, is that of barycenters. Given nn points x1,,xnMx_{1},\ldots,x_{n}\in M, a barycenter is any minimizer xMx\in M of the sum of squared distances i=1nd(x,xi)2\sum_{i=1}^{n}d(x,x_{i})^{2}. One can easily check that if (M,d)(M,d) is a Euclidean or Hilbert space, then the minimizer is unique and it is given by the average of x1,,xnx_{1},\ldots,x_{n}. Barycenters extend the notion of average to spaces that lack a linear structure. More generally, given a probability measure PP on MM, one can define a barycenter of PP as any minimizer xMx\in M of Md(x,z)2dP(z)\int_{M}d(x,z)^{2}\mathop{}\!\mathrm{d}P(z), provided that integral is finite for at least one (and hence, for all) value of xMx\in M. Barycenters were initially introduced in statistics by [24] in the 1940’s, and later by [33], where they were rather called Fréchet means or Karcher means. They were popularized in the fields of shape statistics [34], optimal transport [2, 20, 38, 18, 37, 36, 5, 6]. and matrix analysis [9, 8, 10]. Asymptotic theory is fairly well understood for empirical barycenters in various setups, particularly laws of large numbers [57, 51] and central limit theorems in Riemannian manifolds (it is natural to impose a smooth structure on MM in order to derive central limit theorems) [12, 13, 35, 32, 11, 23, 22]. A few finite sample guarantees are available, even though the non-asymptotic statistical theory for barycenters is still quite limited [51, 25, 49, 3, 39]. Asymptotic theorems are also available for more general MM-estimators on Riemannian manifolds, e.g., pp-means, where 𝒵=X\mathcal{Z}=X and ϕ(z,x)=d(x,z)p\phi(z,x)=d(x,z)^{p}, for some p1p\geq 1 [49], including geometric medians as a particular case (p=1p=1), but only suboptimal rates are known. In particular, the standard n1/2n^{-1/2}-rate is unknown in general, beyond barycenters. It should be noted that all central limit theorems for barycenters in Riemannian manifolds rely on the smoothness of the cost function ϕ=d2\phi=d^{2} and use standard techniques from smooth MM-estimation [53] by performing Taylor expansions in local charts. Moreover, they assume that the data are almost surely in a ball with small radius, which essentially ensures the convexity of the squared distance function d2d^{2} in its second variable, even though this synthetic property is not leveraged in this line of work. For instance, asymptotic normality cannot be obtained for geometric medians or other robust alternatives to barycenters, using these techniques.

1.2 Contributions

In this work, we prove strong consistency and asymptotic normality of geodesically convex MM-estimators under very mild assumptions. We recover the central limit theorems proven for barycenters in Riemannian manifolds with small diameter, but we cover a much broader class of location estimators, including robust alternatives to barycenters. Just as in the Euclidean case, ours results convey that convexity is a powerful property, since it yields strong learning guarantees under very little assumptions.

All our results are asymptotic, but it is worth mentioning that proving learning guarantees in non-linear spaces is a very challenging, ongoing research topic, because it requires non-standard tools from metric and differential geometry. In particular, very few non-asymptotic learning guarantees are available, even for simple estimators such as barycenters, and asymptotic theory is still an active research area in that framework (e.g., understanding non-standard asymptotic rates, a.k.a. sticky and smeary central limit theorems [30, 23]). Moreover, asymptotic guarantees such as asymptotic normality provide benchmarks that are also important for a finite sample theory.

2 Background and notation

2.1 Geodesic spaces

Let (M,d)(M,d) be a complete and separable metric space that satisfies the following property: For all x,yMx,y\in M, there exists a continuous function γ:[0,1]M\gamma:[0,1]\to M such that γ(0)=x\gamma(0)=x, γ(1)=y\gamma(1)=y and with the property that d(γ(s),γ(t))=|st|d(x,y)d(\gamma(s),\gamma(t))=|s-t|d(x,y), for all s,t[0,1]s,t\in[0,1]. Such a function γ\gamma is called a unit speed geodesic from xx to yy and it should be thought of as a (not necessarily unique) shortest path from xx to yy. We denote by Γx,y\Gamma_{x,y} the collection of all such functions. The space (M,d)(M,d) is then called a geodesic space. Examples of geodesic spaces include Euclidean spaces, Euclidean spheres, metric graphs, Wasserstein spaces, quotient spaces, etc. [16, 15]. Geodesic spaces allow for a natural extension of the notion of convexity.

Definition 1.

A function f:Mf:M\to\mathbb{R} is called geodesically convex (convex, for short) if for all x,yM,γΓx,yx,y\in M,\gamma\in\Gamma_{x,y} and t[0,1]t\in[0,1], f(γ(t))(1t)f(x)+tf(y)\displaystyle f(\gamma(t))\leq(1-t)f(x)+tf(y).

In this work, we will further assume that (M,d)(M,d) is a proper space, i.e., all bounded closed sets are compact. This assumption may seem limiting in practice but, at a high level, it essentially only discards infinite dimensional spaces. Indeed, Hopf-Rinow theorem [15, Chapter 1, Proposition 3.7] asserts that so long as (M,d)(M,d) is complete and locally compact, then it is proper. For instance, Hopf-Rinow theorem for Riemannian manifolds asserts that any complete (finite dimensional) Riemannian manifold is a proper geodesic metric space [21, Chapter 7, Theorem 2.8].

2.2 Riemannian manifolds

A Riemannian manifold (M,g)(M,g) is a finite dimensional smooth manifold MM equipped with a Riemannian metric gg, i.e., a smooth family of scalar products on tangent spaces. We refer the reader to [21] and [40, 41] for a clear introduction to differential geometry and Riemannian manifolds. The distance inherited on MM from the Riemannian metric gg will be denoted by dd. In this work, for simplicity, we only consider Riemannian manifolds without boundary, even though our results extend easily to the case with boundary.

The tangent bundle of MM is denoted by TMTM, i.e., TM=xMTxM\displaystyle TM=\bigcup_{x\in M}T_{x}M where, for all xMx\in M, TxMT_{x}M the tangent space to MM at xx. For all xMx\in M, we also let Expx:TxMM\textrm{Exp}_{x}:T_{x}M\to M and Logx:MTxM\textrm{Log}_{x}:M\to T_{x}M be the exponential and the logarithmic maps at xx, respectively. The latter may not be defined on the whole tangent space, but it is always defined at least on a neighborhood of the origin. To provide intuition, at a given point xMx\in M, and for a given vector uTxMu\in T_{x}M, Expx(u)\textrm{Exp}_{x}(u) is the Riemannian analog of “x+ux+u” in the Euclidean space: From point xx, add a vector uu, whose direction indicates in which direction to move, and whose norm (given by the Riemannian metric gg) indicates at which distance, in order to attain the point denoted by Expx(u)\textrm{Exp}_{x}(u). On the opposite, given x,yMx,y\in M, Logx(y)\textrm{Log}_{x}(y) (when it is well-defined) is the Riemannian analog of “yxy-x”, that is, from xx, which vector in TxMT_{x}M led to move to yy.

For all xMx\in M, we denote by ,x\langle\cdot,\cdot\rangle_{x} the scalar product on TxMT_{x}M inherited from gg and by x\|\cdot\|_{x} the corresponding norm.

Let x,yMx,y\in M and γΓx,y\gamma\in\Gamma_{x,y} be a geodesic (in the sense of Section 2.1) from xx to yy. If xx and yy are close enough, γ\gamma is unique [41, Proposition 6.11] and in that case, for all t[0,1]t\in[0,1], γ(t)=Expx(tγ˙(0))\gamma(t)=\textrm{Exp}_{x}(t\dot{\gamma}(0)), where γ˙(0)TxM\dot{\gamma}(0)\in T_{x}M is the derivative of γ\gamma at t=0t=0, given by γ˙(0)=Logx(y)\dot{\gamma}(0)=\textrm{Log}_{x}(y). We denote the parallel transport map from xx to yy along γ\gamma as πx,y:TxMTyM\pi_{x,y}:T_{x}M\to T_{y}M, without specifying its dependence on γ\gamma (which is non-ambiguous if yy is close enough to xx). The map πx,y\pi_{x,y} is a linear isometry, i.e., u,vx=πx,y(u),πx,y(v)y\langle u,v\rangle_{x}=\langle\pi_{x,y}(u),\pi_{x,y}(v)\rangle_{y}, for all u,vTxMu,v\in T_{x}M. Moreover, it holds that πx,y(γ˙(0))=γ˙(1)\pi_{x,y}(\dot{\gamma}(0))=-\dot{\gamma}(1). Intuitively, parallel transport is an important tool that allows to compare vectors from different tangent spaces, along a given path on the manifold.

For further details and properties on Riemannian manifolds used in the mathematical development of this work, we refer the reader to Appendix B.

2.3 MM-estimation

Let 𝒵\mathcal{Z} be some abstract space, equipped with a σ\sigma-algebra \mathcal{F} and a probability measure PP. Let ϕ:𝒵×M\phi:\mathcal{Z}\times M\to\mathbb{R} satisfy the following:

  1. (i)

    For all xMx\in M, the map ϕ(,x)\phi(\cdot,x) is measurable and integrable with respect to PP;

  2. (ii)

    For PP-almost all z𝒵z\in\mathcal{Z}, ϕ(z,)\phi(z,\cdot) is convex.

Finally, let Z,Z1,Z2,Z,Z_{1},Z_{2},\ldots be i.i.d. random variables in 𝒵\mathcal{Z} with distribution PP and set the functions

Φ(x)=𝔼[ϕ(Z,x)] and Φ^n(x)=n1i=1nϕ(Zi,x),\Phi(x)=\mathbb{E}[\phi(Z,x)]\quad\mbox{ and }\quad\hat{\Phi}_{n}(x)=n^{-1}\sum_{i=1}^{n}\phi(Z_{i},x),

for all xMx\in M and for all positive integers nn. Finally, we denote by MM^{*} the set of minimizers of Φ\Phi. Note that by convexity of Φ\Phi, the minimizing set MM^{*} is convex, i.e., for all x,yMx,y\in M^{*} and for all γΓx,y\gamma\in\Gamma_{x,y}, γ([0,1])M\gamma([0,1])\subseteq M^{*}.

Important examples include the case where 𝒵=M\mathcal{Z}=M and ϕ\phi is a function of the metric, which yields location estimation. Specifically, assume that 𝒵=M\mathcal{Z}=M and that ϕ(z,x)=(d(z,x))\phi(z,x)=\ell(d(z,x)), for all z,xMz,x\in M, where :[0,)[0,)\ell:[0,\infty)\to[0,\infty) is a non-decreasing, convex function. Then, a natural framework to ensure convexity of the functions ϕ(z,)\phi(z,\cdot) is that of spaces with curvature upper bounds, a.k.a. CAT spaces. Here, we only give an intuitive definition on CAT spaces. We refer the reader to Appendix D for a more detailed account on CAT spaces.

Fix some real number κ\kappa. We say that (M,d)(M,d) is CAT(κ)\textrm{CAT}(\kappa) (or that it has global curvature upper bound κ\kappa) if all small enough triangles in MM are thinner than what they would be in a space of constant curvature κ\kappa, i.e., a sphere if κ>0\kappa>0, a Euclidean space if κ=0\kappa=0 and a hyperbolic space if κ<0\kappa<0. This setup is very natural to grasp the convexity of the distance function to a point. Fix x0Mx_{0}\in M. The function d(,x0)d(\cdot,x_{0}) is convex if and only if for all y,zMy,z\in M and all γΓy,z\gamma\in\Gamma_{y,z}, d(γ(t),x0)(1t)d(y,x0)+td(z,x0)d(\gamma(t),x_{0})\leq(1-t)d(y,x_{0})+td(z,x_{0}), which controls the distance from x0x_{0} to any point in the opposite edge to x0x_{0} in a triangle with vertices x0,y,zx_{0},y,z. In other words, it indicates how thin a triangle with vertices x0,y,zx_{0},y,z must be.

Hence, if 𝒵=M\mathcal{Z}=M is CAT(κ)\textrm{CAT}(\kappa) and has diameter at most DκD_{\kappa} (see Appendix D for its definition), [45, Proposition 3.1] guarantees that for all zMz\in M, d(z,)d(z,\cdot) is convex on MM and hence, ϕ(z,)=(ϕ(z,))\phi(z,\cdot)=\ell(\phi(z,\cdot)) is also convex as soon as :[0,)[0,)\ell:[0,\infty)\to[0,\infty) is non-decreasing and convex itself. In that setup, important examples of functions \ell include:

  1. (i)

    (u)=u2\ell(u)=u^{2}: Then, a minimizer of Φ\Phi is a barycenter of PP and a minimizer of Φ^n\hat{\Phi}_{n} is an empirical barycenter of Z1,,ZnZ_{1},\ldots,Z_{n}

  2. (ii)

    (u)=u\ell(u)=u: Then, a minimizer of Φ\Phi is a geometric median of PP

  3. (iii)

    More generally, if (u)=up,p1\ell(u)=u^{p},p\geq 1, a minimizer of Φ\Phi is called a pp-mean of PP

  4. (iv)

    (u)=u2\ell(u)=u^{2} if 0uc0\leq u\leq c, (u)=c(2uc)\ell(u)=c(2u-c) if u>cu>c, where c>0c>0 is a fixed parameter. This function was introduced by Huber [31] in order to produce low-bias robust estimators of the mean: It provides an interpolation between barycenter and geometric median estimation.

Important examples of CAT spaces include Euclidean spaces, spheres, simply connected Riemannian manifolds with bounded sectional curvature, metric trees, etc. In Appendix D, we provide a more detailed list of examples.

3 Consistency

This section contains our first main results on the consistency of geodesically convex MM-estimators. Here, we work under two scenarios. First, we assume that ϕ(z,)\phi(z,\cdot) is convex and Lipschitz with some constant that does not depend on z𝒵z\in\mathcal{Z}. This is somewhat restrictive, even though it covers most of the cases mentioned above for location estimation. In the second scenario, we only assume convexity, but we consider the case where MM is a Riemannian manifold. We then discuss possible extensions of our findings.

3.1 Lipschitz case

Here, we assume that there exists L>0L>0 such that ϕ(z,)\phi(z,\cdot) is LL-Lipschitz on MM, for all z𝒵z\in\mathcal{Z}. For instance, this is satisfied if MM has bounded diameter and ϕ=d\phi=\ell\circ d, for some locally Lipschitz function :[0,)\ell:[0,\infty), such as the functions mentioned above for location estimation.

Recall that we denote by MM^{*} the set of minimizers of Φ\Phi. The following theorem need not require that Φ\Phi has a unique minimizer: It asserts that any minimizer of Φn\Phi_{n} will eventually become arbitrarily close to MM^{*} with probability 11.

Theorem 1.

Let MM be a proper geodesic metric space. Assume that MM^{*} is non-empty and bounded. For n1n\geq 1, let x^n\hat{x}_{n} be a minimizer of Φ^n\hat{\Phi}_{n}. Then, it holds that d(x^n,M)n0d(\hat{x}_{n},M^{*})\xrightarrow[n\to\infty]{}0 almost surely.

A simple adaptation of the proof shows that the same conclusion applies if x^n\hat{x}_{n} is not exactly a minimizer of Φ^n\hat{\Phi}_{n} but rather satisfies Φ^n(x^n)minxMΦ^n(x)+εn\hat{\Phi}_{n}(\hat{x}_{n})\leq\min_{x\in M}\hat{\Phi}_{n}(x)+\varepsilon_{n}, where εn\varepsilon_{n} is any non-negative error term that goes to zero almost surely, as nn\to\infty.

The proof of Theorem 1 is based on the following extension of [47, Theorem 10.8]. The proof is a straightforward adaptation of that of [47, Theorem 10.8].

Lemma 1.

Let L>0L>0 and (fn)n1(f_{n})_{n\geq 1} be a sequence of LL-Lipschitz convex functions on MM. Assume that there is a function f:Mf:M\to\mathbb{R} and a dense subset M0M_{0} of MM such that fn(x)nf(x)f_{n}(x)\xrightarrow[n\to\infty]{}f(x), for all xM0x\in M_{0}. Then, ff is convex, LL-Lipschitz, and fnf_{n} converges to ff uniformly on any compact subset of MM.

This lemma is the reason why, in this section, we require ϕ(z,)\phi(z,\cdot) to be Lipschitz, with a constant that does not depend on z𝒵z\in\mathcal{Z}. If MM was Euclidean, the Lipschitz assumption of Lemma 1 would not be necessary because on any compact subset, it would be a consequence of the convexity and pointwise convergence of the functions fn,n1f_{n},n\geq 1, see [47, Theorem 10.6]. However, to the best of our knowledge, this may not hold in the more general case that we treat here (and may be subject to a further research question).

As a corollary to Lemma 1, we now state the following results, which is essential in our proof of Theorem 1. The proof is deferred to Appendix C.

Lemma 2.

Let L>0L>0 and (Fn)n1(F_{n})_{n\geq 1} be a sequence of real valued, LL-Lipschitz, convex random functions on MM. Let FF a (possibly random) function on MM and assume that for all xMx\in M, Fn(x)F_{n}(x) converges almost surely (resp. in probability) to F(x)F(x). Then, the convergence holds uniformly on any compact subset KK of MM, i.e., supxK|Fn(x)F(x)|\sup_{x\in K}|F_{n}(x)-F(x)| converges to zero almost surely (resp. in probability).

Proof of Theorem 1.

Let Φ=minxMΦ(x)\Phi^{*}=\min_{x\in M}\Phi(x) be the smallest value of Φ\Phi on MM and fix some arbitrary xMx^{*}\in M^{*}. Fix ε>0\varepsilon>0 and let Kε={xM:d(x,M)=ε}K_{\varepsilon}=\{x\in M:d(x,M^{*})=\varepsilon\}. Since MM^{*} is bounded, KεK_{\varepsilon} is bounded. Moreover, since the distance is continuous, KεK_{\varepsilon} is a closed set. Hence, it is compact, since we have assumed (M,d)(M,d) to be proper. Since Φ\Phi is Lipschitz, it is continuous so it holds that η:=minxKεΦ(x)Φ>0\eta:=\min_{x\in K_{\varepsilon}}\Phi(x)-\Phi^{*}>0. Moreover, by the law of large numbers, Φn(x)nΦ(x)\Phi_{n}(x)\xrightarrow[n\to\infty]{}\Phi(x) almost surely, for all xMx\in M. Therefore, by Lemma 2, Φn\Phi_{n} converges uniformly to Φ\Phi on KεK_{\varepsilon} almost surely. So, with probability 11, infxKεΦ^n(x)>Φ+2η/3\inf_{x\in K_{\varepsilon}}\hat{\Phi}_{n}(x)>\Phi^{*}+2\eta/3 for all large enough nn. Moreover, also with probability 11, Φ^n(x)<Φ+η/3\hat{\Phi}_{n}(x^{*})<\Phi^{*}+\eta/3 for all large enough nn.

We have established that with probability 11, it holds simultaneously, for all large enough nn, that infxKεΦ^n(x)>Φ^n(x)\inf_{x\in K_{\varepsilon}}\hat{\Phi}_{n}(x)>\hat{\Phi}_{n}(x^{*}). Let us show that this, together with the convexity of Φ^n\hat{\Phi}_{n}, implies that any minimizer of Φ^n\hat{\Phi}_{n} must be at a distance at most ε\varepsilon of MM^{*}. This will yield the desired result.

Assume, for the sake of contradiction, that Φ^n\hat{\Phi}_{n} has a minimizer x~n\tilde{x}_{n} that satisfies d(x~n,M)>εd(\tilde{x}_{n},M^{*})>\varepsilon. Consider a geodesic γΓx,x~n\gamma\in\Gamma_{x^{*},\tilde{x}_{n}}. Then, by continuity of the function d(,M)d(\cdot,M^{*}), there must be some t[0,1]t\in[0,1] such that γ(t)Kε\gamma(t)\in K_{\varepsilon}. The convexity of Φ^n\hat{\Phi}_{n} yields the convexity of Φ^nγ\hat{\Phi}_{n}\circ\gamma, which is minimum at t=1t=1. Hence, Φ^nγ\hat{\Phi}_{n}\circ\gamma must be non-increasing, implying that infxKεΦ^n(x)Φ^n(γ(t))Φ^n(x)\inf_{x\in K_{\varepsilon}}\hat{\Phi}_{n}(x)\leq\hat{\Phi}_{n}(\gamma(t))\leq\hat{\Phi}_{n}(x^{*}), which yields a contradiction.

\square

3.2 Riemannian framework

Now, we prove that, at least in a Riemannian manifold, the Lipschitz condition on ϕ\phi can be dropped for the consistency of x^n\hat{x}_{n}. In this section, we assume that (M,g)(M,g) is a complete Riemannian manifold and we have the following theorem.

Theorem 2.

Let (M,g)(M,g) be a Riemannian manifold. Assume that MM^{*} is non-empty and bounded. For n1n\geq 1, let x^n\hat{x}_{n} be a minimizer of Φ^n\hat{\Phi}_{n}. Then, it holds that d(x^n,M)n0d(\hat{x}_{n},M^{*})\xrightarrow[n\to\infty]{}0 almost surely.

Again, this theorem applies if x^n\hat{x}_{n} satisfies Φ^n(x^n)minxMΦ^n(x)+εn\hat{\Phi}_{n}(\hat{x}_{n})\leq\min_{x\in M}\hat{\Phi}_{n}(x)+\varepsilon_{n}, where εn\varepsilon_{n} is any non-negative error term that goes to zero almost surely, as nn\to\infty.

Once one has Lemma 4 below, which builds upon Lemma 3, the proof of Theorem 2 is very similar to that of Theorem 1, hence, we omit it.

Lemma 3.

[27] Any convex function on a Riemannian manifold is continuous and locally Lipschitz.

By adapting the proof of [47, Theorem 10.8], this lemma yields the following result.

Lemma 4.

Let (fn)n1(f_{n})_{n\geq 1} be a sequence of convex functions on MM. Assume that fn(x)nf(x)f_{n}(x)\xrightarrow[n\to\infty]{}f(x), for all xx in a dense subset M0M_{0} of MM, where f:Mf:M\to\mathbb{R} is some given convex function. Then, fnf_{n} is equi-Lipschitz and converges to ff uniformly on any compact subset of MM.

Remark 1.

Lemma 3 is key in the proof of consistency if one does not assume that ϕ(z,)\phi(z,\cdot) is LL-Lipschitz, for all z𝒵z\in\mathcal{Z}, with some L>0L>0 that does not depend on zz. As we pointed out in the previous section, we do not know whether this lemma still holds true in more general proper geodesic spaces and we leave this as an open question.

4 Asymptotic normality

Now, we turn to asymptotic normality of x^n\hat{x}_{n}. Riemannian manifolds offer a natural and reasonable framework because, on top of their metric structure, they enjoy smoothness which allows to compute first and second order expansions and Gaussian distributions can be defined on their tangent spaces. Hence, in this section, we assume that (M,g)(M,g) be a complete Riemannian manifold.

Theorem 3.

Assume that the distribution PP and the function ϕ\phi satisfy the following assumptions:

  • Φ\Phi has a unique minimizer xMx^{*}\in M

  • Φ\Phi is twice differentiable at xx^{*} and HΦ(x)\textrm{H}\Phi(x^{*}) is positive definite

  • There exists η>0\eta>0 such that 𝔼[g(Z,x)x2]<\mathbb{E}[\|g(Z,x)\|_{x^{*}}^{2}]<\infty, for all xMx\in M with d(x,x)ηd(x^{*},x)\leq\eta, where g(Z,x)g(Z,x) is a measurable subgradient of ϕ(Z,)\phi(Z,\cdot) at the point xx.

Then, x^n\hat{x}_{n} is asymptotically normal, that is,

nLogx(x^n)n(d)𝒩TxM(0,V(x))\sqrt{n}\textrm{Log}_{x^{*}}(\hat{x}_{n})\xrightarrow[n\to\infty]{(d)}\mathcal{N}_{T_{x^{*}}M}(0,V(x^{*}))

where V(x)=S(x)1B(x)S(x)1V(x^{*})=S(x^{*})^{-1}B(x^{*})S(x^{*})^{-1}, S(x)=HΦ(x)S(x^{*})=\textrm{H}\Phi(x^{*}), B(x)=𝔼[g(Z,x)g(Z,x)]B(x^{*})=\mathbb{E}[g(Z,x^{*})g(Z,x^{*})^{\top}] and we denote by 𝒩TxM(0,V(x))\mathcal{N}_{T_{x^{*}}M}(0,V(x^{*})) the normal distribution on the Euclidean space TxMT_{x^{*}}M with mean 0 and covariance operator V(x)V(x^{*}).

Here, HΦ(x)\textrm{H}\Phi(x^{*}) is the Hessian of Φ\Phi at the point xx^{*} (see Appendix B). Recall that Logx(x^n)\textrm{Log}_{x^{*}}(\hat{x}_{n}) is the Riemannian analog of “x^nx\hat{x}_{n}-x^{*}”, so Theorem 3 is a natural formulation of asymptotic normality of x^n\hat{x}_{n} is the Riemannian context.

A close look to the end of the proof will convince the reader that one can assume, without modifying the conclusion of this theorem, that x^n\hat{x}_{n} need not be a minimizer of Φ^n\hat{\Phi}_{n}, but should satisfy that Φ^n(x^n)minxMΦ^n(x)+oP(n1/2)\hat{\Phi}_{n}(\hat{x}_{n})\leq\min_{x\in M}\hat{\Phi}_{n}(x)+o_{P}(n^{-1/2}), where oP(n1/2)o_{P}(n^{-1/2}) is a possibly random error term that goes to zero in probability at a faster rate than n1/2n^{-1/2}.

Remark 2.

The statement of Theorem 3 implicitly assumes the existence of measurable subgradients of ϕ\phi, that is, a mapping g:𝒵×MTMg:\mathcal{Z}\times M\to TM such that for PP-almost all z𝒵z\in\mathcal{Z} and for all xMx\in M, g(z,x)TxMg(z,x)\in T_{x}M is a subgradient of ϕ(z,)\phi(z,\cdot) at the point xMx\in M, and for all xMx\in M, g(,x):𝒵TxMg(\cdot,x):\mathcal{Z}\to T_{x}M is measurable. This is actually guaranteed by Lemma 8 on measurable selections, see Appendix A. Note also that the last assumption is automatically guaranteed if ϕ(z,)\phi(z,\cdot) is locally L(z)L(z)-Lipschitz at xx^{*}, for PP-almost all z𝒵z\in\mathcal{Z}, for some LL2(P)L\in L^{2}(P).

The proof of Theorem 3 is strongly inspired by [28] and [44] but requires subtle tools to account for the non-Euclidean geometry of the problem. The idea of the proof is to show that

Φ^n(Expx(u/n))Φ^n(x)+1nu,n1i=1ng(Zi,x)x+n1u,HΦ(x)ux\hat{\Phi}_{n}(\textrm{Exp}_{x^{*}}(u/\sqrt{n}))\approx\hat{\Phi}_{n}(x^{*})+\frac{1}{\sqrt{n}}\langle u,n^{-1}\sum_{i=1}^{n}g(Z_{i},x^{*})\rangle_{x^{*}}+n^{-1}\langle u,\textrm{H}\Phi(x^{*})u\rangle_{x^{*}}

as nn grows large, uniformly in uTxMu\in T_{x^{*}}M. Hence, the minimizer of the left-hand side which, by definition, is Logxx^n\textrm{Log}_{x^{*}}\hat{x}_{n}, must be close to the minimizer of the right-hand side, which has a closed form, and which is asymptotically normal, thanks to the central limit theorem. The main difficulty is to handle Φ^n(Expx(u/n))\hat{\Phi}_{n}(\textrm{Exp}_{x^{*}}(u/\sqrt{n})), which is not a convex function of uTxMu\in T_{x^{*}}M, even though Φ^n\hat{\Phi}_{n} is convex on MM. Lemma 5 below is our most technical result, and is the key argument to adapt the techniques of [28, 44] to the Riemannian setup.

Lemma 5.

Let (M,g)(M,g) be a Riemannian manifold and x0Mx_{0}\in M. There exists r>0r>0 such that the following holds. Let L>0L>0. There exists α>0\alpha>0 such that for all convex functions f:Mf:M\to\mathbb{R} that are LL-Lipschitz on B(x0,r)B(x_{0},r), the map

fExpx0+α2x02:Tx0Mf\circ\textrm{Exp}_{x_{0}}+\frac{\alpha}{2}\|\cdot\|_{x_{0}}^{2}:T_{x_{0}}M\to\mathbb{R}

is convex on {uTx0M:ux0r/2}\{u\in T_{x_{0}}M:\|u\|_{x_{0}}\leq r/2\}, where x0\|\cdot\|_{x_{0}} is the norm induced by gg on the tangent space Tx0MT_{x_{0}}M.

In the proof of this lemma, we first consider the case where ff is smooth and then proceed by approximation of convex functions by smooth functions, leveraging [27, Theorem 2].

Proof of Theorem 3.

Fix uTxMu\in T_{x^{*}}M. For all i=1,,ni=1,\ldots,n, let

Wi,n=ϕ(Zi,Expx(u/n))ϕ(Zi,x)1nu,g(Zi,x)x.W_{i,n}=\phi(Z_{i},\textrm{Exp}_{x^{*}}(u/\sqrt{n}))-\phi(Z_{i},x^{*})-\frac{1}{\sqrt{n}}\langle u,g(Z_{i},x^{*})\rangle_{x^{*}}.

The definition of subgradients yields that Wi,n0W_{i,n}\geq 0. Let πn\pi_{n} be the parallel transport from xx^{*} to xn:=Expx(u/n)x_{n}:=\textrm{Exp}_{x^{*}}(u/\sqrt{n}) through the geodesic γ(t)=Expx(tu/n),t[0,1]\gamma(t)=\textrm{Exp}_{x^{*}}(tu/\sqrt{n}),t\in[0,1]. Then, x=Expxn(πn1(u)/n)x^{*}=\textrm{Exp}_{x_{n}}(-\pi_{n}^{-1}(u)/\sqrt{n}). Thus, one has the following:

Wi,n\displaystyle W_{i,n} πn(u)/n,g(Zi,xn)xnu/n,g(Zi,x)x\displaystyle\leq\langle\pi_{n}(u)/\sqrt{n},g(Z_{i},x_{n})\rangle_{x_{n}}-\langle u/\sqrt{n},g(Z_{i},x^{*})\rangle_{x^{*}}
=u/n,πn1(g(Zi,Expx(u/n))g(Zi,x)x,\displaystyle=\langle u/\sqrt{n},\pi_{n}^{-1}\left(g(Z_{i},\textrm{Exp}_{x^{*}}(u/\sqrt{n})\right)-g(Z_{i},x^{*})\rangle_{x^{*}}, (1)

where we used the fact that πn\pi_{n} is an isometry in the last equality.

Therefore, 𝔼[Wi,n2]n1𝔼[Yn2]\mathbb{E}[W_{i,n}^{2}]\leq n^{-1}\mathbb{E}[Y_{n}^{2}], where

Yn=u,πn1(g(Z1,Expx(u/n)))g(Z1,x)x.Y_{n}=\langle u,\pi_{n}^{-1}\left(g(Z_{1},\textrm{Exp}_{x^{*}}(u/\sqrt{n}))\right)-g(Z_{1},x^{*})\rangle_{x}.

Note that 𝔼[Yn2]\mathbb{E}[Y_{n}^{2}] is finite if nn is large enough, thanks to the last assumption of the theorem.

The inequalities above yield that Yn0Y_{n}\geq 0 for all n1n\geq 1 and by Lemma 9, the sequence (Yn)n1(Y_{n})_{n\geq 1} is non-increasing. Therefore, YnY_{n} converges almost surely to a non-negative random variable YY. Let us now prove that Y=0Y=0 almost surely. In order to do so, let us prove that 𝔼[Yn]n0\mathbb{E}[Y_{n}]\xrightarrow[n\to\infty]{}0. Since, by the monotone convergence theorem, 𝔼[Yn]\mathbb{E}[Y_{n}] must converge to 𝔼[Y]\mathbb{E}[Y], we will obtain that 𝔼[Y]=0\mathbb{E}[Y]=0. This, combined with the fact that Y0Y\geq 0 almost surely, will yield the desired result.

Fix xMx\in M with d(x,x)ηd(x,x^{*})\leq\eta. Then, for all vTxMv\in T_{x}M, ϕ(Z1,Expx(v))ϕ(Z1,x)+v,g(Z1,x)x\phi(Z_{1},\textrm{Exp}_{x}(v))\geq\phi(Z_{1},x)+\langle v,g(Z_{1},x)\rangle_{x}. Taking the expectation readily yields that 𝔼[g(Z1,x)]Φ(x)\mathbb{E}[g(Z_{1},x)]\in\partial\Phi(x). Hence, for all integers nn that are large enough so u/nxη\|u/\sqrt{n}\|_{x^{*}}\leq\eta, it holds that gn:=𝔼[g(Z1,Expx(u/n)]Φ(Expx(u/n))g_{n}:=\mathbb{E}[g(Z_{1},\textrm{Exp}_{x^{*}}(u/\sqrt{n})]\in\partial\Phi(\textrm{Exp}_{x^{*}}(u/\sqrt{n})). Therefore,

𝔼[Yn]=u,πn1(gn)Φ(x)xn0\mathbb{E}[Y_{n}]=\langle u,\pi_{n}^{-1}(g_{n})-\nabla\Phi(x^{*})\rangle_{x^{*}}\xrightarrow[n\to\infty]{}0

since Φ\Phi is twice differentiable at xx^{*}.

Finally, we obtain that Y=0Y=0 almost surely. Now, (Yn)n0(Y_{n})_{n\geq 0} is a non-increasing, non-negative sequence, so the monotone convergence theorem yields that 𝔼[Yn2]\mathbb{E}[Y_{n}^{2}] goes to zero as nn\to\infty, so that

var(i=1nWi,n)=i=1nvar(Wi,n)i=1n𝔼[Wi,n2]𝔼[Yn2]n0.\mathrm{var}\left(\sum_{i=1}^{n}W_{i,n}\right)=\sum_{i=1}^{n}\mathrm{var}(W_{i,n})\leq\sum_{i=1}^{n}\mathbb{E}[W_{i,n}^{2}]\leq\mathbb{E}[Y_{n}^{2}]\xrightarrow[n\to\infty]{}0.

Therefore, by Chebychev’s inequality, i=1n(Wi,n𝔼[Wi,n])\sum_{i=1}^{n}(W_{i,n}-\mathbb{E}[W_{i,n}]) converges in probability to zero, as nn\to\infty, that is,

n(Φ^n(Expx(u/n))Φ^n(x))1nu,i=1ng(Zi,x)x\displaystyle n\left(\hat{\Phi}_{n}(\textrm{Exp}_{x^{*}}(u/\sqrt{n}))-\hat{\Phi}_{n}(x^{*})\right)-\frac{1}{\sqrt{n}}\langle u,\sum_{i=1}^{n}g(Z_{i},x^{*})\rangle_{x^{*}}
n(Φ(Expx(u/n))Φ(x)u,Φ(x)x)n0\displaystyle\quad\quad-n\left(\Phi(\textrm{Exp}_{x^{*}}(u/\sqrt{n}))-\Phi(x^{*})-\langle u,\nabla\Phi(x^{*})\rangle_{x^{*}}\right)\xrightarrow[n\to\infty]{}0

in probability. Since Φ\Phi is twice differentiable at xx^{*}, we obtain

n(Φ^n(Expx(u/n))Φ^n(x))1nu,i=1ng(Zi,x)xu,HΦ(x)uxn0n\left(\hat{\Phi}_{n}(\textrm{Exp}_{x^{*}}(u/\sqrt{n}))-\hat{\Phi}_{n}(x^{*})\right)-\frac{1}{\sqrt{n}}\langle u,\sum_{i=1}^{n}g(Z_{i},x^{*})\rangle_{x^{*}}-\langle u,\textrm{H}\Phi(x^{*})u\rangle_{x^{*}}\xrightarrow[n\to\infty]{}0 (2)

in probability. For n1n\geq 1 and uTxMu\in T_{x^{*}}M , denote by

Fn(u)=n(Φ^n(Expx(u/n))Φ^n(x))1nu,i=1ng(Zi,x)xF_{n}(u)=n\left(\hat{\Phi}_{n}(\textrm{Exp}_{x^{*}}(u/\sqrt{n}))-\hat{\Phi}_{n}(x^{*})\right)-\frac{1}{\sqrt{n}}\langle u,\sum_{i=1}^{n}g(Z_{i},x^{*})\rangle_{x^{*}}

and by

F(u)=u,HΦ(x)ux.F(u)=\langle u,\textrm{H}\Phi(x^{*})u\rangle_{x^{*}}.

We have shown that Fn(u)F_{n}(u) converges in probability to F(u)F(u) as nn\to\infty, for any fixed uTxMu\in T_{x^{*}}M. Now, we would like to make use of Lemma 2. However, the functions FnF_{n} are not guaranteed to be convex. We adjust these functions using lemma 5.

Let R>0R>0 be any fixed positive number. Let K=Bx(0,R)K=B_{x^{*}}(0,R) be the ball in TxMT_{x^{*}}M centered at the origin, with radius RR and let K~=B(x,R)\tilde{K}=B(x^{*},R). Then, KK and K~\tilde{K} are both compact. By the law of large numbers, Φ^n(x)\hat{\Phi}_{n}(x) converges almost surely to Φ(x)\Phi(x), for all xK~x\in\tilde{K}. Therefore, by Lemma 2, supxK~|Φ^n(x)Φ(x)|n0\sup_{x\in\tilde{K}}|\hat{\Phi}_{n}(x)-\Phi(x)|\xrightarrow[n\to\infty]{}0 almost surely. Hence, Lemma 4 yields the existence of a (possibly random) L>0L>0 such that the following holds with probability 11:

|Φ^n(x)Φ^n(y)|Ld(x,y),n1,x,yK~.|\hat{\Phi}_{n}(x)-\hat{\Phi}_{n}(y)|\leq Ld(x,y),\forall n\geq 1,\forall x,y\in\tilde{K}.

Now, let r>0r>0 be as in Lemma 5. Then, there exists a (possibly random) α>0\alpha>0 such that with probability 11, for all n1n\geq 1, Φ^nExpx+α2x2\hat{\Phi}_{n}\circ\textrm{Exp}_{x^{*}}+\frac{\alpha}{2}\|\cdot\|_{x^{*}}^{2} is convex on KK.

Denote by F~n(u)=Fn(u)+α2ux2\tilde{F}_{n}(u)=F_{n}(u)+\frac{\alpha}{2}\|u\|_{x^{*}}^{2} and F~(u)=F(u)+α2ux2\tilde{F}(u)=F(u)+\frac{\alpha}{2}\|u\|_{x^{*}}^{2}, for all uKu\in K and n1n\geq 1. For all large enough n1n\geq 1 (such that R/nrR/\sqrt{n}\leq r), each FnF_{n} is convex on KK with probability 11, and we trivially have that F~n(u)nF~(u)\tilde{F}_{n}(u)\xrightarrow[n\to\infty]{}\tilde{F}(u) in probability, for all uKu\in K. Therefore, Lemma 2 yields that supuK|Fn(u)F(u)|=supuK|F~n(u)F~(u)|n0\sup_{u\in K}|F_{n}(u)-F(u)|=\sup_{u\in K}|\tilde{F}_{n}(u)-\tilde{F}(u)|\xrightarrow[n\to\infty]{}0 in probability.

Finally, we have shown that for any R>0R>0,

supuxR|n(Φ^n(Expx(u/n))Φ^n(x))1nu,i=1ng(Zi,x)xu,HΦ(x)ux|n0\sup_{\|u\|_{x^{*}}\leq R}\left|n\left(\hat{\Phi}_{n}(\textrm{Exp}_{x^{*}}(u/\sqrt{n}))-\hat{\Phi}_{n}(x^{*})\right)-\frac{1}{\sqrt{n}}\langle u,\sum_{i=1}^{n}g(Z_{i},x^{*})\rangle_{x^{*}}-\langle u,\textrm{H}\Phi(x^{*})u\rangle_{x^{*}}\right|\xrightarrow[n\to\infty]{}0 (3)

in probability.

By definition of x^n\hat{x}_{n}, Φ^n(Expx(u/n))\hat{\Phi}_{n}(\textrm{Exp}_{x^{*}}(u/\sqrt{n})) is minimized for u=nLogx(x^n)u=\sqrt{n}\textrm{Log}_{x^{*}}(\hat{x}_{n}). Moreover, one easily checks that 1nu,i=1ng(Zi,x)xu,HΦ(x)ux\frac{1}{\sqrt{n}}\langle u,\sum_{i=1}^{n}g(Z_{i},x^{*})\rangle_{x^{*}}-\langle u,\textrm{H}\Phi(x^{*})u\rangle_{x^{*}} is minimized for

u=HΦ(x)11ni=1ng(Zi,x).u=-\textrm{H}\Phi(x^{*})^{-1}\frac{1}{\sqrt{n}}\sum_{i=1}^{n}g(Z_{i},x^{*}).

Note that g(Z1,x),,g(Zn,x)g(Z_{1},x^{*}),\ldots,g(Z_{n},x^{*}) are i.i.d. random vectors in TxMT_{x^{*}}M. They are centered, since 𝔼[g(Z1,x)]=Φ(x)=0\mathbb{E}[g(Z_{1},x^{*})]=\nabla\Phi(x^{*})=0 by the first order condition, and they have a second moment, thanks to the last assumption of the theorem. Therefore, the multivariate central limit theorem yields that

1ni=1ng(Zi,x)n𝒩TxM(0,B(x))\frac{1}{\sqrt{n}}\sum_{i=1}^{n}g(Z_{i},x^{*})\xrightarrow[n\to\infty]{}\mathcal{N}_{T_{x^{*}}M}\left(0,B(x^{*})\right) (4)

in distribution. In particular, the sequence 1ni=1ng(Zi,x)\frac{1}{\sqrt{n}}\sum_{i=1}^{n}g(Z_{i},x^{*}) is bounded in probability, so one can choose RR in (3) so that 1ni=1ng(Zi,x)Bx0(0,R)\frac{1}{\sqrt{n}}\sum_{i=1}^{n}g(Z_{i},x^{*})\in B_{x_{0}}(0,R) for all n1n\geq 1, with arbitrarily large probability.

From (3), it is easy to obtain that nLogx(x^n)HΦ(x)11ni=1ng(Zi,x)\sqrt{n}\textrm{Log}_{x^{*}}(\hat{x}_{n})-\textrm{H}\Phi(x^{*})^{-1}\frac{1}{\sqrt{n}}\sum_{i=1}^{n}g(Z_{i},x^{*}) converges in probability to zero, as nn\to\infty. Therefore, by Slutsky’s theorem, nLogx(x^n)\sqrt{n}\textrm{Log}_{x^{*}}(\hat{x}_{n}) has the same limit in distribution as HΦ(x)11ni=1ng(Zi,x)\textrm{H}\Phi(x^{*})^{-1}\frac{1}{\sqrt{n}}\sum_{i=1}^{n}g(Z_{i},x^{*}), which concludes the proof thanks to (4).

\square

Proof of Lemma 5.

Fix r>0r>0 (to be chosen) and denote by K=B(x0,r)K=B(x_{0},r), which is compact by Hopf-Rinow theorem [21, Chapter 7, Theorem 2.8]. Let f:Mf:M\to\mathbb{R} be convex and LL-Lipschitz (L>0L>0) on KK and set F=fExpx0:Tx0MF=f\circ\textrm{Exp}_{x_{0}}:T_{x_{0}}M\to\mathbb{R}.

We split the proof of this lemma into steps.

Step 1: Case where ff is smooth.

First, assume that ff is smooth. Then, FF is a smooth function. In the sequel, we denote by Hf\textrm{H}f the Hessian of ff and by H¯F\bar{\textrm{H}}F the Hessian of FF (i.e., we use H for Hessian of functions defined on MM and H¯\bar{\textrm{H}} for Hessian of functions defined on the tangent space Tx0MT_{x_{0}}M). Fix u,vTx0Mu,v\in T_{x_{0}}M and let ϕ(t)=F(u+tv)\phi(t)=F(u+tv), for ss\in\mathbb{R}. Denote by η(s,t)=Expx0(t(u+sv))\eta(s,t)=\textrm{Exp}_{x_{0}}(t(u+sv)), for all s,ts,t\in\mathbb{R}. The Hessian of FF is given by H¯F(u)(v,v)=ϕ′′(0)\bar{\textrm{H}}F(u)(v,v)=\phi^{\prime\prime}(0). For all ss\in\mathbb{R},

ϕ(s)=f(η(s,1)),ηs(s,1)η(s,1)\phi^{\prime}(s)=\langle\nabla f(\eta(s,1)),\frac{\partial\eta}{\partial s}(s,1)\rangle_{\eta(s,1)}

and

ϕ′′(s)=Hf(η(s,t))ηs(s,1),ηs(s,1)η(s,1)+f(η(s,1)),Dsηs(s,1)η(s,1)\phi^{\prime\prime}(s)=\langle\textrm{H}f(\eta(s,t))\frac{\partial\eta}{\partial s}(s,1),\frac{\partial\eta}{\partial s}(s,1)\rangle_{\eta(s,1)}+\langle\nabla f(\eta(s,1)),D_{s}\frac{\partial\eta}{\partial s}(s,1)\rangle_{\eta(s,1)} (5)

where DsD_{s} is the covariant derivative along the path η(,1)\eta(\cdot,1). Note that the first term on the right-hand side of (5) is non-negative, due to the convexity of ff (see [52, Theorem 6.2]). Therefore,

ϕ′′(0)\displaystyle\phi^{\prime\prime}(0) f(η(0,1)),Dsηs(0,1)η(0,1)f(η(0,1))Dsηs(0,1)\displaystyle\geq\langle\nabla f(\eta(0,1)),D_{s}\frac{\partial\eta}{\partial s}(0,1)\rangle_{\eta(0,1)}\geq-\|\nabla f(\eta(0,1))\|\|D_{s}\frac{\partial\eta}{\partial s}(0,1)\|
LDsηs(0,1)\displaystyle\geq-L\|D_{s}\frac{\partial\eta}{\partial s}(0,1)\| (6)

where the second inequality follows from Cauchy-Schwarz inequality and the last one follows from the assumption that ff is LL-Lipschitz on KK.

For all w,wTx0Mw,w^{\prime}\in T_{x_{0}}M, let Γw,w(s,t)=Expx0(t(w+sw)),s,t[0,1]\Gamma_{w,w^{\prime}}(s,t)=\textrm{Exp}_{x_{0}}(t(w+sw^{\prime})),s,t\in[0,1]. In particular, η=Γu,v\eta=\Gamma_{u,v}. Then, DsΓu,vs(0,1)=v2DsΓu,v~s(0,1)D_{s}\frac{\partial\Gamma_{u,v}}{\partial s}(0,1)=\|v\|^{2}D_{s}\frac{\partial\Gamma_{u,\tilde{v}}}{\partial s}(0,1), where v~=vvx0\tilde{v}=\frac{v}{\|v\|_{x_{0}}}. Since the map w,wDsΓw,ws(0,1)w,w^{\prime}\mapsto D_{s}\frac{\partial\Gamma_{w,w^{\prime}}}{\partial s}(0,1) is continuous, there exists a constant C>0C>0 (that only depends on rr and on the geometry of (M,g)(M,g)) such that

DsΓw,ws(0,1)C,\|D_{s}\frac{\partial\Gamma_{w,w^{\prime}}}{\partial s}(0,1)\|\leq C,

for all u,vTx0Mu,v\in T_{x_{0}}M with ux0r\|u\|_{x_{0}}\leq r and vx01\|v\|_{x_{0}}\leq 1. Hence, (6) yields that ϕ′′(0)LCvx02\displaystyle\phi^{\prime\prime}(0)\geq-LC\|v\|_{x_{0}}^{2}. Therefore, by setting α1=LC\alpha_{1}=LC, we have established that F+α12x02F+\frac{\alpha_{1}}{2}\|\cdot\|_{x_{0}}^{2} is convex on {uTx0M:ux0r}\{u\in T_{x_{0}}M:\|u\|_{x_{0}}\leq r\}.

Step 2: Case where ff is no longer assumed to be smooth.

Step 2.1: Smooth approximation of ff.

For all positive integers nn, let fn:Mf_{n}:M\to\mathbb{R} be the function defined by

fn(x)=ndTxMf(Expx(v))k(nvx)dΩx(v)f_{n}(x)=n^{d}\int_{T_{x}M}f(\textrm{Exp}_{x}(v))k(n\|v\|_{x})\mathop{}\!\mathrm{d}\Omega_{x}(v)

where k:k:\mathbb{R}\to\mathbb{R} is a non-negative, smooth function supported on [0,1][0,1] with k(t)dt=1\int_{\mathbb{R}}k(t)\mathop{}\!\mathrm{d}t=1 and Ωx\Omega_{x} is the pushforward measure of the Riemannian volume onto TxMT_{x}M by the exponential map Expx\textrm{Exp}_{x}. By [27, Theorem 2], each fnf_{n} is smooth and the sequence (fn)n1(f_{n})_{n\geq 1} converges to ff uniformly on KK and satisfies

lim infninfxKλmin(Hfn(x))0,\liminf_{n\to\infty}\inf_{x\in K}\lambda_{\min}(\textrm{H}f_{n}(x))\geq 0,

where λmin\lambda_{\min} stands for the smallest eigenvalue. Fix ε>0\varepsilon>0, that we will then choose to be small enough. Without loss of generality, assume that infxKλmin(Hfn(x))ε\inf_{x\in K}\lambda_{\min}(\textrm{H}f_{n}(x))\geq-\varepsilon, for all n1n\geq 1 (since one might as well forget about the first terms of the sequence (fn)n1(f_{n})_{n\geq 1}).

Step 2.2: fnf_{n} is Lipschitz on B(x0,r/2)B(x_{0},r/2).

Let us show thatfnf_{n} is Lipschitz on B(x0,r/2)B(x_{0},r/2), so long as nn is large enough. Let x,yB(x0,r/2)x,y\in B(x_{0},r/2) and let πx,y:TxMTyM\pi_{x,y}:T_{x}M\to T_{y}M be the parallel transport map from xx to yy. Since πx,y\pi_{x,y} is an isometry, it maps the measure Ωx\Omega_{x} to Ωy\Omega_{y}, hence, it holds that

fn(y)=ndTxMf(Expy(πx,y(v)))k(nvx)dΩx(v).f_{n}(y)=n^{d}\int_{T_{x}M}f(\textrm{Exp}_{y}(\pi_{x,y}(v)))k(n\|v\|_{x})\mathop{}\!\mathrm{d}\Omega_{x}(v).

Thus,

|fn(x)fn(y)|\displaystyle|f_{n}(x)-f_{n}(y)| =nd|TxM(f(Expx(v))f(Expy(πx,y(v))))k(nvx)|dΩx(v)\displaystyle=n^{d}\left|\int_{T_{x}M}(f(\textrm{Exp}_{x}(v))-f(\textrm{Exp}_{y}(\pi_{x,y}(v))))k(n\|v\|_{x})\right|\mathop{}\!\mathrm{d}\Omega_{x}(v)
ndTxM|f(Expx(v))f(Expy(πx,y(v)))|k(nvx)dΩx(v)\displaystyle\leq n^{d}\int_{T_{x}M}\left|f(\textrm{Exp}_{x}(v))-f(\textrm{Exp}_{y}(\pi_{x,y}(v)))\right|k(n\|v\|_{x})\mathop{}\!\mathrm{d}\Omega_{x}(v)
LndTxMd(Expx(v),Expy(πx,y(v))k(nvx)dΩx(v)\displaystyle\leq Ln^{d}\int_{T_{x}M}d(\textrm{Exp}_{x}(v),\textrm{Exp}_{y}(\pi_{x,y}(v))k(n\|v\|_{x})\mathop{}\!\mathrm{d}\Omega_{x}(v) (7)

where the last inequality holds as soon as nn is large enough (i.e., n2/rn\geq 2/r) since ff is LL-Lipschitz on K=B(x0,r)K=B(x_{0},r) and in the last integral, only the vectors vTxMv\in T_{x}M with vx0n1r/2\|v\|_{x_{0}}\leq n^{-1}\leq r/2 contribute to the integral. Finally, the map (x,y,v)Expy(πx,y(v))(x,y,v)\mapsto\textrm{Exp}_{y}(\pi_{x,y}(v)) is smooth, hence it is Lipschitz on any compact set, so there exists C>0C>0 that is independent of xx and yy such that for all vTxMv\in T_{x}M with vx0r/2\|v\|_{x_{0}}\leq r/2, d(Expx(v),Expy(πx,y(v))Cd(x,y)d(\textrm{Exp}_{x}(v),\textrm{Exp}_{y}(\pi_{x,y}(v))\leq Cd(x,y). Therefore, (7) yields

|fn(x)fn(y)|LCd(x,y)ndTxMk(nvx)dΩx(v)=LCd(x,y),|f_{n}(x)-f_{n}(y)|\leq LCd(x,y)n^{d}\int_{T_{x}M}k(n\|v\|_{x})\mathop{}\!\mathrm{d}\Omega_{x}(v)=LCd(x,y), (8)

for all x,yB(x0,r/2)x,y\in B(x_{0},r/2). Thus, fnf_{n} is CLCL-Lipschitz on B(x0,r/2)B(x_{0},r/2), for all integers n2/rn\geq 2/r.

Step 2.3: Add a fixed function to each fnf_{n} to make them convex.

Assume that rr and ε\varepsilon are chosen according to Lemma 6 below and let ϕ\phi be the function given in that lemma. Then, for all n1n\geq 1, fn+ϕf_{n}+\phi is convex on KK, since its Hessian is positive semi-definite at any xKx\in K.

Step 2.4: Apply the smooth case (Step 1) to each fn+ϕf_{n}+\phi on B(x0,r/2)B(x_{0},r/2).

For all n1n\geq 1, fn+ϕf_{n}+\phi is smooth. Moreover, fnf_{n} is CLCL-Lipschitz on B(x0,r/2)B(x_{0},r/2), where C>0C>0 does not depend on ff, and ϕ\phi is smooth so it is LL^{\prime}-Lipschitz on B(x0,r/2)B(x_{0},r/2), for some L>0L^{\prime}>0 that does not depend on ff either. Therefore, applying the result proven in Step 1, we obtain that there exists α2>0\alpha_{2}>0 that does not depend on ff, such that for all large enough integers nn, fnExpx0+ϕExpx0+α22x02f_{n}\circ\textrm{Exp}_{x_{0}}+\phi\circ\textrm{Exp}_{x_{0}}+\frac{\alpha_{2}}{2}\|\cdot\|_{x_{0}}^{2} is convex on {uTx0M:ux0r/2}\{u\in T_{x_{0}}M:\|u\|_{x_{0}}\leq r/2\}. By taking the limit as nn\to\infty, we obtain that fExpx0+ϕExpx0+α22x02f\circ\textrm{Exp}_{x_{0}}+\phi\circ\textrm{Exp}_{x_{0}}+\frac{\alpha_{2}}{2}\|\cdot\|_{x_{0}}^{2} is convex on {uTx0M:ux0r/2}\{u\in T_{x_{0}}M:\|u\|_{x_{0}}\leq r/2\}.

Since ϕ\phi is smooth, so is ϕExpx0\phi\circ\textrm{Exp}_{x_{0}}, so there exists β>0\beta>0 such that λmax(H¯(ϕExpx0)(u))β\lambda_{\max}\left(\bar{\textrm{H}}(\phi\circ\textrm{Exp}_{x_{0}})(u)\right)\leq\beta, for all uTx0Mu\in T_{x_{0}}M with ux0r/2\|u\|_{x_{0}}\leq r/2. Therefore, by setting α3=α2+β\alpha_{3}=\alpha_{2}+\beta, we obtain that fExpx0+α32x02f\circ\textrm{Exp}_{x_{0}}+\frac{\alpha_{3}}{2}\|\cdot\|_{x_{0}}^{2} is convex on {uTx0M:ux0r/2}\{u\in T_{x_{0}}M:\|u\|_{x_{0}}\leq r/2\}.

Step 3: Conclusion.

We can now combine all cases by taking α=max(α1,α3)\alpha=\max(\alpha_{1},\alpha_{3}).

\square

Lemma 6.

Let (M,g)(M,g) be a complete Riemannian manifold without boundary and x0Mx_{0}\in M. There exists r,ε>0r,\varepsilon>0 and a smooth function ϕ:M\phi:M\to\mathbb{R} such that λmin(Hϕ(x))ε\lambda_{\min}(\textrm{H}\phi(x))\geq\varepsilon for all xB(x0,r)x\in B(x_{0},r).

Proof.

Let r1r_{1} such that B(x0,r1)B(x_{0},r_{1}) is simply connected. For all xMx\in M and for all linearly independent vectors u,vTxMu,v\in T_{x}M with ux=vx=1\|u\|_{x}=\|v\|_{x}=1, let κ(x,u,v)\kappa(x,u,v) be the sectional curvature of MM at xx, along the 22-plane spanned by uu and vv (see [21, Proposition 3.1 and Definition 3.2]). Then, κ\kappa is smooth, hence, there exists κ0<\kappa_{0}<\infty that is an upper bound on all sectional curvatures of MM at points xB(x0,r1)x\in B(x_{0},r_{1}), which is compact. Therefore, Lemma 12 yields that B(x0,r1)B(x_{0},r_{1}) is CAT(κ0)\textrm{CAT}(\kappa_{0}). The desired result follows by taking: r=min(r1,Dκ0/4)r=\min(r_{1},D_{\kappa_{0}}/4) and ϕ=ε2α2r,κ0d(,x0)2\phi=\frac{\varepsilon}{2\alpha_{2r,\kappa_{0}}}d(\cdot,x_{0})^{2}, where Dκ0D_{\kappa_{0}} and α2r,κ0\alpha_{2r,\kappa_{0}} are defined in Lemma 11, see Appendix D.2.

References

  • [1] Eddie Aamari and Clément Levrard. Nonasymptotic rates for manifold, tangent space and curvature estimation. 2019.
  • [2] Martial Agueh and Guillaume Carlier. Barycenters in the wasserstein space. SIAM Journal on Mathematical Analysis, 43(2):904–924, 2011.
  • [3] Adil Ahidar-Coutrix, Thibaut Le Gouic, and Quentin Paris. Convergence rates for empirical barycenters in metric spaces: curvature, convexity and extendable geodesics. Probability theory and related fields, 177(1-2):323–368, 2020.
  • [4] Stephanie Alexander, Vitali Kapovitch, and Anton Petrunin. Alexandrov geometry: preliminary version no. 1. arXiv e-prints, pages arXiv–1903, 2019.
  • [5] Jason M. Altschuler and Enric Boix-Adsera. Wasserstein barycenters can be computed in polynomial time in fixed dimension. J. Mach. Learn. Res., 22:44–1, 2021.
  • [6] Jason M. Altschuler and Enric Boix-Adsera. Wasserstein barycenters are np-hard to compute. SIAM Journal on Mathematics of Data Science, 4(1):179–203, 2022.
  • [7] Kimon Antonakopoulos, Elena Veronica Belmega, and Panayotis Mertikopoulos. Online and stochastic optimization beyond lipschitz continuity: A riemannian approach. In ICLR 2020-International Conference on Learning Representations, pages 1–20, 2020.
  • [8] Rajendra Bhatia. Positive definite matrices. In Positive Definite Matrices. Princeton university press, 2009.
  • [9] Rajendra Bhatia and John Holbrook. Riemannian geometry and matrix geometric means. Linear algebra and its applications, 413(2-3):594–618, 2006.
  • [10] Rajendra Bhatia, Tanvi Jain, and Yongdo Lim. On the Bures-Wasserstein distance between positive definite matrices. Expositiones Mathematicae, 37(2):165–191, 2019.
  • [11] Rabi Bhattacharya and Lizhen Lin. Omnibus CLTs for Fréchet means and nonparametric inference on non-euclidean spaces. Proceedings of the American Mathematical Society, 145(1):413–428, 2017.
  • [12] Rabi Bhattacharya and Vic Patrangenaru. Large sample theory of intrinsic and extrinsic sample means on manifolds. Annals of statistics, 31(1):1–29, 2003.
  • [13] Rabi Bhattacharya and Vic Patrangenaru. Large sample theory of intrinsic and extrinsic sample means on manifolds: II. Annals of statistics, pages 1225–1259, 2005.
  • [14] Louis J. Billera, Susan P. Holmes, and Karen Vogtmann. Geometry of the space of phylogenetic trees. Advances in Applied Mathematics, 27(4):733–767, 2001.
  • [15] Martin R Bridson and André Haefliger. Metric spaces of non-positive curvature, volume 319. Springer Science & Business Media, 2013.
  • [16] Dmitri Burago, Yuri Burago, and Sergei Ivanov. A course in metric geometry, volume 33. American Mathematical Society, 2001.
  • [17] Yuan Shih Chow and Henry Teicher. Probability theory: independence, interchangeability, martingales. Springer Science & Business Media, 2003.
  • [18] Sebastian Claici, Edward Chien, and Justin Solomon. Stochastic Wasserstein barycenters. In International Conference on Machine Learning, pages 999–1008. PMLR, 2018.
  • [19] Christopher Criscitiello and Nicolas Boumal. An accelerated first-order method for non-convex optimization on manifolds. Foundations of Computational Mathematics, pages 1–77, 2022.
  • [20] Marco Cuturi and Arnaud Doucet. Fast computation of Wasserstein barycenters. In International conference on machine learning, pages 685–693. PMLR, 2014.
  • [21] Manfredo Perdigao Do Carmo. Riemannian geometry, volume 6. Springer, 1992.
  • [22] Benjamin Eltzner, Fernando Galaz-Garcia, Septhan F. Huckemann, and Wilderich Tuschmann. Stability of the cut locus and a central limit theorem for Fréchet means of Riemannian manifolds. arXiv preprint arXiv:1909.00410, 2019.
  • [23] Benjamin Eltzner and Stephan F Huckemann. A smeary central limit theorem for manifolds with application to high-dimensional spheres. The Annals of Statistics, 47(6):3360–3381, 2019.
  • [24] Maurice Fréchet. Les éléments aléatoires de nature quelconque dans un espace distancié. Ann. Inst. H. Poincaré, 10:215–310, 1948.
  • [25] Kei Funano. Rate of convergence of stochastic processes with values in \mathbb{R}-trees and Hadamard manifolds. Osaka J. Math., 47(4):911–920, 2010.
  • [26] Christopher R Genovese, Marco Perone Pacifico, Verdinelli Isabella, Larry Wasserman, et al. Minimax manifold estimation. Journal of machine learning research, 13:1263–1291, 2012.
  • [27] Robert E. Greene and Hung Wu. On the subharmonicity and plurisubharmonicity of geodesically convex functions. Indiana University Mathematics Journal, 22(7):641–653, 1973.
  • [28] Shelby J. Haberman. Concavity and estimation. The Annals of Statistics, pages 1631–1661, 1989.
  • [29] CH Himmelberg, T Parthasarathy, and FS Van Vleck. On measurable relations. Fundamenta mathematicae, 61:161–167, 1982.
  • [30] Thomas Hotz, Sean Skwerer, Stephan Huckemann, Huiling Le, J Stephen Marron, Jonathan C Mattingly, Ezra Miller, James Nolen, Megan Owen, and Vic Patrangenaru. Sticky central limit theorems on open books. Annals of Applied Probability, 2013.
  • [31] Peter J Huber. Robust estimation of a location parameter. Breakthroughs in statistics: Methodology and distribution, pages 492–518, 1992.
  • [32] Stephan F. Huckemann. Intrinsic inference on the mean geodesic of planar shapes and tree discrimination by leaf growth. The Annals of Statistics, 39(2):1098–1124, 2011.
  • [33] Hermann Karcher. Riemannian center of mass and mollifier smoothing. Communications on pure and applied mathematics, 30(5):509–541, 1977.
  • [34] David George Kendall, Dennis Barden, Thomas K. Carne, and Huiling Le. Shape and shape theory, volume 500. John Wiley & Sons, 2009.
  • [35] Wilfrid S. Kendall and Huiling Le. Limit theorems for empirical Fréchet means of independent and non-identically distributed manifold-valued random variables. Brazilian Journal of Probability and Statistics, 25(3):323 – 352, 2011.
  • [36] Alexey Kroshnin, Vladimir Spokoiny, and Alexandra Suvorikova. Statistical inference for Bures-Wasserstein barycenters. The Annals of Applied Probability, 31(3):1264–1298, 2021.
  • [37] Alexey Kroshnin, Nazarii Tupitsa, Darina Dvinskikh, Pavel Dvurechensky, Alexander Gasnikov, and Cesar Uribe. On the complexity of approximating Wasserstein barycenters. In International conference on machine learning, pages 3530–3540. PMLR, 2019.
  • [38] Thibaut Le Gouic and Jean-Michel Loubes. Existence and consistency of Wasserstein barycenters. Probability Theory and Related Fields, 168(3):901–917, 2017.
  • [39] Thibaut Le Gouic, Quentin Paris, Philippe Rigollet, and Austin J Stromme. Fast convergence of empirical barycenters in Alexandrov spaces and the Wasserstein space. Journal of the European Mathematical Society, 2022.
  • [40] John M. Lee. Introduction to smooth manifolds. Springer, 2012.
  • [41] John M. Lee. Introduction to Riemannian manifolds, volume 2. Springer, 2018.
  • [42] Yongdo Lim and Miklós Pálfia. Weighted deterministic walks for the least squares mean on Hadamard spaces. Bulletin of the London Mathematical Society, 46, 05 2014.
  • [43] Estelle Massart, Julien M. Hendrickx, and P-A Absil. Curvature of the manifold of fixed-rank positive-semidefinite matrices endowed with the Bures-Wasserstein metric. In Geometric Science of Information: 4th International Conference, GSI 2019, Toulouse, France, August 27–29, 2019, Proceedings, pages 739–748. Springer, 2019.
  • [44] Wojciech Niemiro. Asymptotics for M-estimators defined by convex minimization. The Annals of Statistics, pages 1514–1533, 1992.
  • [45] Shin-ichi Ohta. Convexities of metric spaces. Geom. Dedicata, 125:225–250, 2007.
  • [46] Shin-ichi Ohta and Miklós Pálfia. Discrete-time gradient flows and law of large numbers in Alexandrov spaces. Calc. Var. Partial Differential Equations, 54(2):1591–1610, 2015.
  • [47] R Tyrrell Rockafellar. Convex analysis, volume 18. Princeton university press, 1970.
  • [48] Rolf Schneider. Convex bodies: the Brunn–Minkowski theory. Number 151. Cambridge university press, 2014.
  • [49] Christof Schötz. Convergence rates for the generalized fréchet mean via the quadruple inequality. Electronic Journal of Statistics, 13:4280–4345, 2019.
  • [50] Nicolas Schreuder, Victor-Emmanuel Brunel, and Arnak Dalalyan. Statistical guarantees for generative models without domination. In Algorithmic Learning Theory, pages 1051–1071. PMLR, 2021.
  • [51] Karl-Theodor Sturm. Probability measures on metric spaces of nonpositive curvature. In Heat kernels and analysis on manifolds, graphs, and metric spaces (Paris, 2002), volume 338 of Contemp. Math., pages 357–390. Amer. Math. Soc., Providence, RI, 2003.
  • [52] Constantin Udriste. Convex functions and optimization methods on Riemannian manifolds, volume 297. Springer Science & Business Media, 2013.
  • [53] Aad W Van Der Vaart and Jon A Wellner. Weak convergence. Springer, 1996.
  • [54] Daniel H. Wagner. Survey of measurable selection theorems. SIAM Journal on Control and Optimization, 15(5):859–903, 1977.
  • [55] Hongyi Zhang and Suvrit Sra. First-order methods for geodesically convex optimization. In Conference on Learning Theory, pages 1617–1638. PMLR, 2016.
  • [56] Hongyi Zhang and Suvrit Sra. Towards riemannian accelerated gradient methods. arXiv preprint arXiv:1806.02812, 2018.
  • [57] Herbert Ziezold. On expected figures and a strong law of large numbers for random elements in quasi-metric spaces. In Transactions of the Seventh Prague Conference on Information Theory, Statistical Decision Functions, Random Processes and of the 1974 European Meeting of Statisticians, pages 591–602. Springer, 1977.

Appendix A On measurable selections of subgradients

Lemma 7.

Let (𝒵,,P)(\mathcal{Z},\mathcal{F},P) be a probability space and (M,g)(M,g) be a Riemannian manifold without boundary. Let ϕ:𝒵×M\phi:\mathcal{Z}\times M\to\mathbb{R} be a function such that

  • ϕ(z,)\phi(z,\cdot) is convex for PP-almost all z𝒵z\in\mathcal{Z};

  • ϕ(,x)\phi(\cdot,x) is measurable for all xMx\in M.

Then, there exists a map g:𝒵×MTMg:\mathcal{Z}\times M\to TM such that

  • g(z,x)TxMg(z,x)\in T_{x}M is a subgradient of ϕ(z,)\phi(z,\cdot) at xx, for all xMx\in M, for PP-almost all z𝒵z\in\mathcal{Z};

  • g(,x)g(\cdot,x) is measurable, for all xMx\in M.

Proof.

Let NN\in\mathcal{F} be such that P(N)=0P(N)=0 and ϕ(z,)\phi(z,\cdot) is convex on MM for all z𝒵Nz\in\mathcal{Z}\setminus N. Set 𝒵~=𝒵N\tilde{\mathcal{Z}}=\mathcal{Z}\setminus N. It is enough to prove that for all xMx\in M, there exists a measurable function g(,x)g(\cdot,x) on MM such that g(z,x)g(z,x) is a subgradient of ϕ(z,)\phi(z,\cdot) at xx, for all zZ~z\in\tilde{Z}.

Fix x0Mx_{0}\in M. For all z𝒵z\in\mathcal{Z}, let Γ(z)\Gamma(z) be the set of all subgradients of ϕ(z,)\phi(z,\cdot) at the point x0x_{0}. By [52, Theorem 4.5], Γ(z)\Gamma(z)\neq\emptyset, for all z𝒵~z\in\tilde{\mathcal{Z}}. Therefore, it is enough to show that the restriction of Γ\Gamma to 𝒵~\tilde{\mathcal{Z}} is weakly measurable, i.e., that {z𝒵~:Γ(z)U}\{z\in\tilde{\mathcal{Z}}:\Gamma(z)\cap U\neq\emptyset\}\in\mathcal{F}, for all open subsets UTx0MU\subseteq T_{x_{0}}M (see [29] for the definition).

One can wrete Γ(z)\Gamma(z) as

Γ(z)\displaystyle\Gamma(z) =vTx0M{uTx0M:ϕ(z,Expx0(v))ϕ(z,x0)+u,vx0}\displaystyle=\bigcap_{v\in T_{x_{0}}M}\{u\in T_{x_{0}}M:\phi(z,\textrm{Exp}_{x_{0}}(v))\geq\phi(z,x_{0})+\langle u,v\rangle_{x_{0}}\}
=vE0{uTx0M:ϕ(z,Expx0(v))ϕ(z,x0)+u,vx0},\displaystyle=\bigcap_{v\in E_{0}}\{u\in T_{x_{0}}M:\phi(z,\textrm{Exp}_{x_{0}}(v))\geq\phi(z,x_{0})+\langle u,v\rangle_{x_{0}}\},

where E0E_{0} is some fixed countable, dense subset of Tx0MT_{x_{0}}M. The last equality follows from the continuity of ϕ(z,)\phi(z,\cdot), by Lemma 3. For all vE0v\in E_{0}, let Γv(z)={uTx0M:ϕ(z,Expx0(v))ϕ(z,x0)+u,vx0}\Gamma_{v}(z)=\{u\in T_{x_{0}}M:\phi(z,\textrm{Exp}_{x_{0}}(v))\geq\phi(z,x_{0})+\langle u,v\rangle_{x_{0}}\}.

Since each of the countably many Γv(z),vE0\Gamma_{v}(z),v\in E_{0}, is closed, [29, Corollary 4.2] shows that it is enough to show that each Γv\Gamma_{v} is weakly measurable. Fix vE0v\in E_{0} and let UU be an open subset of Tx0MT_{x_{0}}M. Then, for all z𝒵~z\in\tilde{\mathcal{Z}},

Γv(z)U\displaystyle\Gamma_{v}(z)\cap U\neq\emptyset uU,ϕ(z,Expx0(v))ϕ(z,x0)+u,vx0\displaystyle\iff\exists u\in U,\phi(z,\textrm{Exp}_{x_{0}}(v))\geq\phi(z,x_{0})+\langle u,v\rangle_{x_{0}}
uUE0,ϕ(z,Expx0(v))ϕ(z,x0)+u,vx0.\displaystyle\iff\exists u\in U\cap E_{0},\phi(z,\textrm{Exp}_{x_{0}}(v))\geq\phi(z,x_{0})+\langle u,v\rangle_{x_{0}}.

The last equivalence follows from the fact that since UU is open, UE0U\cap E_{0} is dense in UU, and if some uUu\in U satisfies ϕ(z,Expx0(v))ϕ(z,x0)+u,vx0\phi(z,\textrm{Exp}_{x_{0}}(v))\geq\phi(z,x_{0})+\langle u,v\rangle_{x_{0}}, then, any uUu^{\prime}\in U of the form u=uvu^{\prime}=u-v^{\prime}, for any vTx0Mv^{\prime}\in T_{x_{0}}M with small enough norm such that v,vx00\langle v,v^{\prime}\rangle_{x_{0}}\leq 0. There is at least one such uu^{\prime} that falls in UE0U\cap E_{0}.

Finally, one obtains:

{z𝒵~:Γv(z)U}=uE0{z𝒵~:ϕ(z,Expx0(v))ϕ(z,x0)+u,vx0},\{z\in\tilde{\mathcal{Z}}:\Gamma_{v}(z)\cap U\neq\emptyset\}=\bigcup_{u\in E_{0}}\{z\in\tilde{\mathcal{Z}}:\phi(z,\textrm{Exp}_{x_{0}}(v))\geq\phi(z,x_{0})+\langle u,v\rangle_{x_{0}}\},

which is in \mathcal{F} since ϕ(,x)\phi(\cdot,x) is assumed to be measurable for all xMx\in M.

One concludes using Lemma 8 below.

Lemma 8.

Let (𝒵,,P)(\mathcal{Z},\mathcal{F},P) be a probability space and EE a Polish space. Let Γ\Gamma be a function that maps any z𝒵z\in\mathcal{Z} to a closed subset of EE. Assume the existence of NN\in\mathcal{F} with P(N)=0P(N)=0 and such that Γ(z)\Gamma(z)\neq\emptyset for all z𝒵Nz\in\mathcal{Z}\setminus N. Assume further that for all open subsets UU of EE, {z𝒵N:Γ(z)U}\{z\in\mathcal{Z}\setminus N:\Gamma(z)\cap U\neq\emptyset\}\in\mathcal{F}. Then, there exists a function g:𝒵Eg:\mathcal{Z}\to E such that g(z)Γ(z)g(z)\in\Gamma(z), for PP-almost all z𝒵z\in\mathcal{Z}.

Proof.

Consider the probability space (𝒵~,~,P~)(\tilde{\mathcal{Z}},\tilde{\mathcal{F}},\tilde{P}) defined by setting

  • 𝒵~=𝒵N\tilde{\mathcal{Z}}=\mathcal{Z}\setminus N;

  • ~={FN:F}\tilde{\mathcal{F}}=\{F\setminus N:F\in\mathcal{F}\};

  • P~\tilde{P} as the restriction of PP to (𝒵~,~)(\tilde{\mathcal{Z}},\tilde{\mathcal{F}}).

Then, for all z𝒵z\in\mathcal{Z}, Γ(z)\Gamma(z)\neq\emptyset so, thanks to [54, Theorem 4.1], one can find a measurable selection gg of the restriction of Γ\Gamma to 𝒵~\tilde{\mathcal{Z}}. By setting g(z)g(z) to be any constant value in EE for zNz\in N, we obtain a function g:𝒵Eg:\mathcal{Z}\to E that satisfies the requirement.

Appendix B Elements on Riemannian manifolds and on geodesic convexity

In this section, let (M,g)(M,g) be a complete dd-dimensional Riemannian manifold.. That is, MM is a dd-dimensional smooth manifold and the Riemannian metric gg equips each tangent space TxMT_{x}M, xMx\in M, with a dot product gx(u,v)g_{x}(u,v), u,vTxMu,v\in T_{x}M. Moreover, by Hopf-Rinow theorem [21, Chapter 7, Theorem 2.8], completeness of (M,d)(M,d) ensures that any two points x,yMx,y\in M are connected by at least minimizing geodesic, i.e., a smooth path γ:[0,1]M\gamma:[0,1]\to M such that γ(0)=x,γ(1)=y\gamma(0)=x,\gamma(1)=y and d(γ(s),γ(t))=|st|d(x,y)d(\gamma(s),\gamma(t))=|s-t|d(x,y), for all s,t[0,1]s,t\in[0,1].

B.1 On smooth functions and their derivatives

If f:Mf:M\to\mathbb{R} is a smooth function, its gradient is the map f\nabla f such that for all xMx\in M, f(x)TxM\nabla f(x)\in T_{x}M and for all uTxMu\in T_{x}M,

f(x),ux=(fγ)(0)\langle\nabla f(x),u\rangle_{x}=(f\circ\gamma)^{\prime}(0)

where γ:[0,1]M\gamma:[0,1]\to M is any smooth path such that γ(0)=u\gamma^{\prime}(0)=u. The Hessian of ff at any xMx\in M is represented by the linear map Hf(x):TxMTxM\textrm{H}f(x):T_{x}M\to T_{x}M such that for all uTxMu\in T_{x}M,

u,Hf(x)ux=(fγ)′′(0)\langle u,\textrm{H}f(x)u\rangle_{x}=(f\circ\gamma)^{\prime\prime}(0)

where γ\gamma is a geodesic path with γ(0)=x\gamma(0)=x, γ(0)=u\gamma^{\prime}(0)=u.

With our notation, any smooth function f:Mf:M\to\mathbb{R} has a second order Taylor expansion at any given point xMx\in M of the following form:

f(Expx(tu))=f(x)+tf(x),ux+t22u,Hf(x)ux+o(t2)f(\textrm{Exp}_{x}(tu))=f(x)+t\langle\nabla f(x),u\rangle_{x}+\frac{t^{2}}{2}\langle u,\textrm{H}f(x)u\rangle_{x}+o(t^{2})

as t0t\to 0, for any fixed uTxMu\in T_{x}M.

B.2 On convex functions

Recall that a convex function on MM is any function f:Mf:M\to\mathbb{R} that satisfies f(γ(t))(1t)f(x)+tf(y)f(\gamma(t))\leq(1-t)f(x)+tf(y) for all t[0,1]t\in[0,1], x,yMx,y\in M and for all constant speed, minimizing geodesic γ:[0,1]M\gamma:[0,1]\to M from xx to yy. Just as in Euclidean spaces, convex functions are automatically continuous [52, Theorem 3.6] (note that we only consider convex functions on the whole space MM, which is assumed to have no boundary). The notion of subgradients can also be extended to the Riemannian case.

Definition 2.

Let f:Mf:M\to\mathbb{R} be a convex function and xMx\in M. A vector vTxMv\in T_{x}M is called a subgradient of MM at xx if and only if

f(Expx(u))f(x)+v,ux,f(\textrm{Exp}_{x}(u))\geq f(x)+\langle v,u\rangle_{x},

for all uTx(M)u\in T_{x}(M).

The set of all subgradients of ff at xMx\in M is denoted by f(x)\partial f(x) and [52, Theorem 4.5] ensures that f(x)\partial f(x)\neq\emptyset, for all xMx\in M (again, note that we only consider the case when MM has no boundary). For simplicity here, unlike in [52], subgradients are elements of the tangent space, not its dual.

Subgradients of convex functions satisfy the following monotonicity property.

Lemma 9.

Let f:Mf:M\to\mathbb{R} be a convex function. Let xMx\in M, uTxMu\in T_{x}M and set y=Expx(u)y=\textrm{Exp}_{x}(u). Let g1f(x)g_{1}\in\partial f(x) and g2f(y)g_{2}\in\partial f(y). Then,

u,πx,y1(g2)g1x0,\langle u,\pi_{x,y}^{-1}(g_{2})-g_{1}\rangle_{x}\geq 0,

where πx,y\pi_{x,y} is the parallel transport from xx to yy through the geodesic given by γ(t)=Expx(tu),t[0,1]\gamma(t)=\textrm{Exp}_{x}(tu),t\in[0,1].

Proof.

First, note that by letting v=πx,y(u)v=\pi_{x,y}(u), it holds that x=Expy(v)x=\textrm{Exp}_{y}(-v). By definition of subgradients, one has the following inequalities:

f(y)f(x)+u,g1xf(y)\geq f(x)+\langle u,g_{1}\rangle_{x}

and

f(x)f(y)+v,g2y=f(y)u,πx,y1(g2)x,f(x)\geq f(y)+\langle-v,g_{2}\rangle_{y}=f(y)-\langle u,\pi_{x,y}^{-1}(g_{2})\rangle_{x},

using the fact that πx,y\pi_{x,y} is an isometry. Summing both inequalities yields the result.

Appendix C On sequences of convex functions in metric spaces

C.1 Proof of Lemma 2

Almost sure convergence

Since (M,d)(M,d) is a proper metric space, it is separable [4, Proposition 2.2.2]. Let M0M_{0} be a countable dense subset of MM. By assumption, Fn(x)nF(x)F_{n}(x)\xrightarrow[n\to\infty]{}F(x) almost surely, for all xM0x\in M_{0}. Therefore, with probability one, the convergence holds for all xM0x\in M_{0}, since M0M_{0} is countable. Hence, because M0M_{0} is also dense, Theorem 1 applies and we obtain the desired result.

Convergence in probability

Here, we use the following fact.

Lemma 10.

[17, Lemma 2 (ii) p.67] Let X,X1,X2,X,X_{1},X_{2},\ldots be real valued random variables. Then, XnX_{n} converges in probability to XX if and only if every subsequence of (Xn)n1(X_{n})_{n\geq 1} has itself a subsequence that converges almost surely to XX.

Let KK be a compact subset of MM and consider a subsequence of (supxK|Fn(x)F(x)|)n1(\sup_{x\in K}|F_{n}(x)-F(x)|)_{n\geq 1}. Without loss of generality (one could renumber that subsequence), we actually consider the whole sequence (supxK|Fn(x)F(x)|)n1(\sup_{x\in K}|F_{n}(x)-F(x)|)_{n\geq 1} and we show that it admits a subsequence that converges almost surely to zero. Let M0M_{0} be a countable dense subset of MM. Denote the elements of M0M_{0} as x1,x2,x_{1},x_{2},\ldots.

By assumption, Fn(x1)F_{n}(x_{1}) converges to F(x1)F(x_{1}) in probability, therefore, it has a subsequence, say Fϕ1(n)(x1)F_{\phi_{1}(n)}(x_{1}), that converges to F(x1)F(x_{1}) almost surely.

Now, again by assumption, Fϕ1(n)(x2)F_{\phi_{1}(n)}(x_{2}) converges to F(x2)F(x_{2}) in probability, therefore, it has a subsequence, say Fϕ1ϕ2(n)(x2)F_{\phi_{1}\circ\phi_{2}(n)}(x_{2}), that converges to F(x2)F(x_{2}) almost surely.

By iterating this procedure, we build, for all integers k1k\geq 1, an increasing function ϕk:\phi_{k}:\mathbb{N}^{*}\to\mathbb{N}^{*} such that Fϕ1ϕk(n)(xk)F_{\phi_{1}\circ\ldots\circ\phi_{k}(n)}(x_{k}) converges almost surely to F(xk)F(x_{k}), as nn\to\infty.

Finally, let ψ:\psi:\mathbb{N}^{*}\to\mathbb{N}^{*} the function defined by ψ(n)=ϕ1ϕn(n)\psi(n)=\phi_{1}\circ\ldots\circ\phi_{n}(n), for all n1n\geq 1. Then, by construction, Fψ(n)(xk)F_{\psi(n)}(x_{k}) converges almost surely to F(xk)F(x_{k}), for all k1k\geq 1. Hence, with probability 11, Fψ(n)(x)F_{\psi(n)}(x) converges almost surely to F(x)F(x) for all xM0x\in M_{0}. Lemma 1 yields that supxK|Fψ(n)(x)F(x)|\sup_{x\in K}|F_{\psi(n)}(x)-F(x)| converges to zero almost surely, which concludes the proof.

Appendix D A brief introduction to Alexandrov’s curvature

In this section, we give the precise definition of CAT spaces.

D.1 Model spaces

First, we introduce a family of model spaces that will allow us to define local and global curvature bounds in the sequel. Let κ\kappa\in\mathbb{R}.

κ=0\kappa=0: Euclidean plane

Set M0=2M_{0}=\mathbb{R}^{2}, equipped with its Euclidean metric. This model space corresponds to zero curvature, is a geodesic space where geodesics are unique and given by line segments.

κ>0\kappa>0: Sphere

Set Mκ=1κ𝕊2M_{\kappa}=\frac{1}{\sqrt{\kappa}}\mathbb{S}^{2}: This is the 22-dimensional Euclidean sphere, embedded in 3\mathbb{R}^{3}, with center 0 and radius 1/κ1/\sqrt{\kappa}, equipped with the arc length metric: dκ(x,y)=1κarccos(κxy)d_{\kappa}(x,y)=\frac{1}{\sqrt{\kappa}}\arccos(\kappa x^{\top}y), for all x,yMκx,y\in M_{\kappa}. This is a geodesic space where the geodesics are unique except for opposite points, and given by arcs of great circles. Here, a great circle is the intersection of the sphere with any plane going through the origin.

κ<0\kappa<0: Hyperbolic space

Set Mκ=1κ2M_{\kappa}=\frac{1}{\sqrt{\kappa}}\mathbb{H}^{2}, where 2={(x1,x2,x3)3:x3>0,x12+x22x32=1}\mathbb{H}_{2}=\{(x_{1},x_{2},x_{3})\in\mathbb{R}^{3}:x_{3}>0,x_{1}^{2}+x_{2}^{2}-x_{3}^{2}=-1\}. The metric is given by dκ=1κarccosh(κx,y)d_{\kappa}=\frac{1}{\sqrt{-\kappa}}\textrm{arccosh}(-\kappa\langle x,y\rangle), for all x,yMκx,y\in M_{\kappa}, where x,y=x1y1+x2y2x3y3\langle x,y\rangle=x_{1}y_{1}+x_{2}y_{2}-x_{3}y_{3}. This is a geodesic space where geodesics are always unique and are given by arcs of the intersections of MκM_{\kappa} with planes going through the origin.

For κ\kappa\in\mathbb{R}, let DκD_{\kappa} be the diameter of the model space MκM_{\kappa}, i.e., Dκ={ if κ0πκ if κ>0\displaystyle D_{\kappa}=\begin{cases}\infty\mbox{ if }\kappa\leq 0\\ \frac{\pi}{\sqrt{\kappa}}\mbox{ if }\kappa>0\end{cases}.

D.2 Curvature bounds

Let (M,d)(M,d) be a geodesic space. The notion of curvature (lower or upper) bounds for (M,d)(M,d) is defined by comparing the triangles in MM with triangles with the same side lengths in model spaces.

Definition 3.

A (geodesic) triangle in MM is a set of three points in MM (the vertices) together with three geodesic segments connecting them (the sides).

Given three points x,y,zSx,y,z\in S, we abusively denote by Δ(x,y,z)\Delta(x,y,z) a triangle with vertices x,y,zx,y,z, with no mention to which geodesic segments are chosen for the sides (geodesics between points are not necessarily unique, as seen for example on a sphere, between any two opposite points). The perimeter of a triangle Δ=Δ(x,y,z)\Delta=\Delta(x,y,z) is defined as per(Δ)=d(x,y)+d(y,z)+d(x,z)\mathrm{per}(\Delta)=d(x,y)+d(y,z)+d(x,z). It does not depend on the choice of the sides.

Definition 4.

Let κ\kappa\in\mathbb{R} and Δ\Delta be a triangle in MM with per(Δ)<2Dκ\mathrm{per}(\Delta)<2D_{\kappa}. A comparison triangle for Δ\Delta in the model space MκM_{\kappa} is a triangle Δ¯Mκ\bar{\Delta}\subseteq M_{\kappa} with same side lengths as Δ\Delta, i.e., if Δ=Δ(x,y,z)\Delta=\Delta(x,y,z), then Δ¯=Δ(x¯,y¯,z¯)\bar{\Delta}=\Delta(\bar{x},\bar{y},\bar{z}) where x¯,y¯,z¯\bar{x},\bar{y},\bar{z} are any points in MκM_{\kappa} satisfying

{d(x,y)=dκ(x¯,y¯)d(y,z)=dκ(y¯,z¯)d(x,z)=dκ(x¯,z¯).\begin{cases}d(x,y)=d_{\kappa}(\bar{x},\bar{y})\\ d(y,z)=d_{\kappa}(\bar{y},\bar{z})\\ d(x,z)=d_{\kappa}(\bar{x},\bar{z}).\end{cases}

Note that a comparison triangle in MκM_{\kappa} is always unique up to rigid motions. We are now ready to define curvature bounds. Intuitively, we say that (M,d)(M,d) has global curvature bounded from above (resp. below) by κ\kappa if all its triangles (with perimeter smaller than 2Dκ2D_{\kappa}) are thinner (resp. fatter) than their comparison triangles in the model space MκM_{\kappa}.

Definition 5.

Let κ\kappa\in\mathbb{R}. We say that (M,d)(M,d) has global curvature bounded from above (resp. below) by κ\kappa if and only if for all triangles ΔM\Delta\subseteq M with per(Δ)<2Dκ\mathrm{per}(\Delta)<2D_{\kappa} and for all x,yΔx,y\in\Delta, d(x,y)dκ(x¯,y¯)d(x,y)\leq d_{\kappa}(\bar{x},\bar{y}) (resp. d(x,y)dκ(x¯,y¯)d(x,y)\leq d_{\kappa}(\bar{x},\bar{y})), where x¯\bar{x} and y¯\bar{y} are the points on a comparison triangle Δ¯\bar{\Delta} in MκM_{\kappa} that correspond to xx and yy. We then call (M,d)(M,d) a CAT(κ)\textrm{CAT}(\kappa) (resp. CAT+(κ)\textrm{CAT}^{+}(\kappa)) space.

Lemma 11.

[45, Proposition 3.1]

Let (M,d)(M,d) be CAT(κ)\textrm{CAT}(\kappa) for some κ\kappa\in\mathbb{R} and assume that MM has diameter D<Dκ/2D<D_{\kappa}/2. Then, for all x0Mx_{0}\in M, the function d(,x0)d(\cdot,x_{0}) is convex on MM. Moreover, set αD,κ=κDtan(κD)\alpha_{D,\kappa}=\frac{\sqrt{\kappa}D}{\tan(\sqrt{\kappa}D)} if κ>0\kappa>0, αD,κ=1\alpha_{D,\kappa}=1 otherwise. Then, the function 12d2(,x0)\frac{1}{2}d^{2}(\cdot,x_{0}) is αD,κ\alpha_{D,\kappa}-strongly convex, i.e., it satisfies

12d2(γ(t),x0)1t2d2(x,x0)+t2d2(y,x0)αD,κt(1t)4d(x,y)2,\frac{1}{2}d^{2}(\gamma(t),x_{0})\leq\frac{1-t}{2}d^{2}(x,x_{0})+\frac{t}{2}d^{2}(y,x_{0})-\frac{\alpha_{D,\kappa}t(1-t)}{4}d(x,y)^{2},

for all x,yMx,y\in M and γΓx,y\gamma\in\Gamma_{x,y}.

D.3 Examples of CAT spaces

In this section, we give a few examples of CAT spaces. First, the following lemma, which is a direct consequence of [15, Theorem 1A.6] together with [16, Theorem 9.2.9], give a whole class of Riemannian manifolds as CAT spaces.

Lemma 12.

Let (M,g)(M,g) be a simply connected Riemannian manifold. Assume that all the sectional curvatures are bounded from above by some κ\kappa\in\mathbb{R}. Then, MM is CAT(κ)\textrm{CAT}(\kappa).

We give two examples that are of particular relevance in optimal transport and matrix analysis.

  1. 1.

    Let MM be the collection of all p×pp\times p real, symmetric positive definite matrices. Equip MM with the metric given by d(A,B)=log(A1/2BA1/2)Fd(A,B)=\|\log(A^{-1/2}BA^{-1/2})\|_{F}, where log\log is the matrix logarithm and F\|\cdot\|_{F} is the Frobenius norm. Then, dd is inherited from a Riemannian metric and (M,d)(M,d) is CAT(0)\textrm{CAT}(0) [9]. This metric space is particularly important in the study of matrix geometric means. For any A,BMA,B\in M, their geometric mean is defined as A#B=A1/2(A1/2BA1/2)1/2A1/2A\#B=A^{1/2}(A^{-1/2}BA^{-1/2})^{1/2}A^{1/2}, and it is the unique midpoint from AA to BB, i.e., A#B=γ(1/2)A\#B=\gamma(1/2), where γ\gamma is the unique geodesic between AA and BB.

  2. 2.

    Let MM be the collection of all p×pp\times p real, symmetric positive definite matrices with spectrum included in [λ0,)[\lambda_{0},\infty), for some λ0>0\lambda_{0}>0. Equip MM with the Bures-Wasserstein metric, obtained as follows. For A,BMA,B\in M, let d(A,B)=W2(𝒩p(0,A),𝒩p(0,B))d(A,B)=\textrm{W}_{2}(\mathcal{N}_{p}(0,A),\mathcal{N}_{p}(0,B)), where W2\textrm{W}_{2} is the Wasserstein distance and 𝒩p(0,A)\mathcal{N}_{p}(0,A) (resp. 𝒩p(0,B)\mathcal{N}_{p}(0,B)) is the pp-variate Gaussian distribution with mean 0 and covariance matrix AA (resp. BB). Then, dd is also inherited from a Riemannian metric [10] and (M,d)(M,d) is CAT(3/(2λ0))\textrm{CAT}(3/(2\lambda_{0})), by [43, Proposition 2].

Further examples are given below.

  • Metric trees are CAT(κ)\textrm{CAT}(\kappa) for any κ\kappa\in\mathbb{R} (since all its triangles are flat) [15, Section II.1.15]

  • The surface of a smooth convex body in a Euclidean space is CAT(κ)\textrm{CAT}(\kappa), where κ>0\kappa>0 is an upper bound on the reach of the complement of the body, as a sconsequence of [48, Theorem 3.2.9]

  • The space of phylogenetic trees [14] is CAT(0)\textrm{CAT}(0).