This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\SetKwComment

Comment/* */ \coltauthor\NameZihao Hu \Email[email protected]
\addrGeorgia Institute of Technology and \NameGuanghui Wang \Email[email protected]
\addrGeorgia Institute of Technology and \NameJacob Abernethy \Email[email protected]
\addrGeorgia Institute of Technology, Google Research

Minimizing Dynamic Regret on Geodesic Metric Spaces

Abstract

In this paper, we consider the sequential decision problem where the goal is to minimize the general dynamic regret on a complete Riemannian manifold. The task of offline optimization on such a domain, also known as a geodesic metric space, has recently received significant attention. The online setting has received significantly less attention, and it has remained an open question whether the body of results that hold in the Euclidean setting can be transplanted into the land of Riemannian manifolds where new challenges (e.g., curvature) come into play. In this paper, we show how to get optimistic regret bound on manifolds with non-positive curvature whenever improper learning is allowed and propose an array of adaptive no-regret algorithms. To the best of our knowledge, this is the first work that considers general dynamic regret and develops “optimistic” online learning algorithms which can be employed on geodesic metric spaces.

keywords:
Riemannian Manifolds, Optimistic Online Learning, Dynamic Regret

1 Introduction

Online convex optimization (OCO) in Euclidean space is a well-developed area with numerous applications. In each round, the learner takes an action from a decision set, while the adversary chooses a loss function. The long-term performance metric of the learner is (static) regret, which is defined as the difference between the learner’s cumulative loss and the loss of playing the best-fixed decision in hindsight. As the name suggests, OCO requires both the losses and the decision set to be convex. From the theoretical perspective, convex functions and sets are well-behaved objects with many desirable properties that are generally required to obtain tight regret bounds.

Typical algorithms in OCO, such as mirror descent, determine how one should adjust parameter estimates in response to arriving data, typically by shifting parameters against the gradient of the loss. But in many cases of interest, the underlying parameter space is not only non-convex but non-Euclidean. The hyperboloid, for example, arising from the solution set of a degree-two polynomial, is a Riemannian manifold that has garnered interest as a tool in tree-embedding tasks (Lou et al., 2020). On such manifolds, we do have a generalized notion of convexity, known as geodesic convexity (Udriste, 2013). There are many popular problems of interest (Hosseini and Sra, 2015; Vishnoi, 2018; Sra et al., 2018) where the underlying objective function is geodesically convex (gsc-convex) under a suitable Riemannian metric. But there has thus far been significantly limited research on how to do adaptive learning in such spaces and to understand when regret bounds are obtainable.

Table 1: Summary of bounds. δ\delta describes the discrepancy between the decision set and the comparator set. We define ζ\zeta in Def. 1 and let BTmin{VT,FT}B_{T}\coloneqq\min\{V_{T},F_{T}\}.
Algorithm Type Dynamic regret
Radar Standard O(ζ(1+PT)T)O(\sqrt{{\zeta}(1+P_{T})T})
Lower bound Ω((1+PT)T)\Omega(\sqrt{(1+P_{T})T})
Radarv Gradient-variation O(ζ(1+PTδ2+VT)(PT+1))O(\sqrt{{\zeta}(\frac{1+P_{T}}{\delta^{2}}+V_{T})(P_{T}+1)})
Radars Small-loss O(ζ((1+PT)ζ+FT)(PT+1))O(\sqrt{{\zeta}((1+P_{T}){\zeta}+F_{T})(P_{T}+1)})
Radarb Best-of-both-worlds O(ζ(PT(ζ+1δ2)+BT+1)(PT+1)+BTlnT)O\left(\sqrt{\zeta(P_{T}(\zeta+\frac{1}{\delta^{2}})+B_{T}+1)(P_{T}+1)+B_{T}\ln T}\right)

Let 𝒩\mathcal{N} be a gsc-convex subset of a geodesic metric space \mathcal{M}. In this paper, we consider the problem of minimizing the general dynamic regret on 𝒩\mathcal{N}, defined as

D-RegretT:=t=1Tft(𝐱t)t=1Tft(𝐮t),\text{D-Regret}_{T}:=\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\sum_{t=1}^{T}f_{t}(\mathbf{u}_{t}),

where 𝐱1,,𝐱T𝒩\mathbf{x}_{1},\ldots,\mathbf{x}_{T}\in\mathcal{N} is the sequence of actions taken by the learner, whose loss is evaluated relative to the sequence of “comparator” points 𝐮1,,𝐮T𝒩\mathbf{u}_{1},\dots,\mathbf{u}_{T}\in\mathcal{N}. There has been recent work establishing that sublinear regret is possible as long as 𝒩\mathcal{N} and the ftf_{t}’s are gsc-convex, for example using a Riemannian variant of Online Gradient Descent  (Wang et al., 2021). But so far there are no such results that elicit better D-Regret using adaptive algorithms.

What do we mean by “adaptive” in this context? In the online learning literature there have emerged three key quantities of interest in the context of sequential decision making, the comparator path length, the gradient variation, and the comparator loss, defined respectively as:

PT\displaystyle\textstyle P_{T} :=\displaystyle:= t=2Td(𝐮t,𝐮t1),\displaystyle\textstyle\sum_{t=2}^{T}d(\mathbf{u}_{t},\mathbf{u}_{t-1}), (1)
VT\displaystyle\textstyle V_{T} :=\displaystyle:= t=2Tsup𝐱𝒩ft1(𝐱)ft(𝐱)2,\displaystyle\textstyle\sum_{t=2}^{T}\sup_{\mathbf{x}\in\mathcal{N}}\|\nabla f_{t-1}(\mathbf{x})-\nabla f_{t}(\mathbf{x})\|^{2},
FT\displaystyle\textstyle F_{T} :=\displaystyle:= t=1Tft(𝐮t).\displaystyle\textstyle\sum_{t=1}^{T}f_{t}(\mathbf{u}_{t}).

Let us start by considering regret minimization with respect to path length. While it has been observed that O((1+PT)T)O(\sqrt{(1+P_{T})T}) is optimal in a minimax sense (Zhang et al., 2018), a great deal of research for the Euclidean setting (Srebro et al., 2010; Chiang et al., 2012; Rakhlin and Sridharan, 2013) has shown that significantly smaller regret is achievable when any one of the above quantities is small. These are not just cosmetic improvements either, many fundamental applications of online learning rely on these adaptive methods and bounds. We give a thorough overview in Section 3.

The goal of the present paper is to translate to the Riemannian setting an array of adaptive regret algorithms and prove corresponding bounds. We propose a family of algorithms which we call Radar, for Riemannian adaptive dynamic regret. The three important variants of Radar are Radarv, Radars, and Radarb, we prove regret bounds on each, summarized in Table 1. We allow improper learning for the gradient-variation bound, which means the player can choose 𝐱1,,𝐱T\mathbf{x}_{1},\dots,\mathbf{x}_{T} from a slightly larger set 𝒩δG\mathcal{N}_{\delta G} (formally defined in Definition 4).

As a general matter, convex constraints on a Riemannian manifold introduce new difficulties in optimization that are not typically present in the Euclidean case, and there has been limited work on addressing these. To the best of our knowledge, there are only three papers considering how to incorporate constraints on manifolds, and these all make further assumptions on either the curvature or the diameter of the feasible set. Martínez-Rubio (2022) only applies to hyperbolic and spherical spaces. Criscitiello and Boumal (2022) works for complete Riemannian manifolds with sectional curvature in [K,K][-K,K] but the diameter of the decision set is at most O(1K)\textstyle O(\frac{1}{\sqrt{K}}). Martínez-Rubio and Pokutta (2022) mainly works for locally symmetric Hadamard manifolds. Our paper is the first to consider the projective distortion in the online setting that applies to all Hadamard manifolds as long as improper learning is allowed without imposing further constraints on the diameter or the curvature.

Obtaining adaptive regret guarantees in the Riemannian setting is by no means a trivial task, as the new geometry introduces various additional technical challenges. Here is but one example: whereas the cost of a (Bregman) projection into a feasible region can be controlled using a generalized “Pythagorean” theorem in the Euclidean setting, this same issue becomes more difficult on a manifold as we encounter geometric distortion due to curvature. To better appreciate this, for a Hadamard manifold \mathcal{M}, assume the projection of 𝐱\mathbf{x}\in\mathcal{M} onto a convex subset 𝒩\mathcal{N}\subset\mathcal{M} is 𝐳{\mathbf{z}}. While it is true that for any 𝐲𝒩{𝐳}\mathbf{y}\in\mathcal{N}\setminus\{\mathbf{z}\} the angle between geodesics 𝐳𝐱¯\overline{\mathbf{z}\mathbf{x}} and 𝐳𝐲¯\overline{\mathbf{z}\mathbf{y}} is obtuse, this property is only relevant at the tangent space of 𝐳\mathbf{z}, yet we need to analyze gradients at the tangent space of 𝐱{\mathbf{x}}. The use of parallel transport between 𝐱\mathbf{x} and 𝐳\mathbf{z} unavoidably incurs extra distortion and could potentially lead to O(T)O(T) regret.

The last challenge comes from averaging on manifolds. For example, many adaptive OCO algorithms rely on the meta-expert framework, described by Van Erven and Koolen (2016), that runs several learning algorithms in parallel and combines them through appropriately-weighted averaging. There is not, unfortunately, a single way to take convex combinations in a geodesic metric space, and all such averaging schemes need to account for the curvature of the manifold and incorporate the associated costs. We finally find the Fréchet mean to be a desirable choice, but the analysis must proceed with care.

The key contributions of our work can be summarized as follows:

  • We develop the optimistic mirror descent (OMD) algorithm on Hadamard manifolds111We focus on Hadamard manifolds in the main paper and extend the guarantee to CAT(κ)(\kappa) spaces in Appendix B.2. in the online improper learning setting. Interestingly, we also show Optimistic Hedge, a variant of OMD, works for gsc-convex losses. We believe these tools may have significant applications to research in online learning and Riemannian optimization.

  • We combine our analysis on OMD with the meta-expert framework  (Van Erven and Koolen, 2016) to get several adaptive regret bounds, as shown in Table 1.

  • We develop a novel dynamic regret lower bound, which renders our O(ζ(1+PT)T)O(\sqrt{\zeta(1+P_{T})T}) bound to be tight up to the geometric constant ζ\zeta.

2 Preliminaries

In this section, we introduce background knowledge of OCO and Riemannian manifolds.

OCO in Euclidean space.

We first formally describe OCO in Euclidean space. For each round t=1,,Tt=1,\dots,T, the learner makes a decision 𝐱t𝒳\mathbf{x}_{t}\in\mathcal{X} based on historical losses f1,,ft1f_{1},\dots,f_{t-1} where 𝒳\mathcal{X} is a convex decision set, and then the adversary reveals a convex loss function ftf_{t}. The goal of the learner is to minimize the difference between the cumulative loss and that of the best-fixed decision in hindsight: RegretT=t=1Tft(𝐱t)min𝐱𝒳t=1Tft(𝐱),\text{Regret}_{T}=\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\min_{\mathbf{x}\in\mathcal{X}}\sum_{t=1}^{T}f_{t}(\mathbf{x}), which is usually referred to as the static regret, since the comparator is a fixed decision.

In the literature, there exist a large number of algorithms  (Hazan et al., 2016) on minimizing the static regret. However, the underlying assumption of the static regret is the adversary’s behavior does not change drastically, which can be unrealistic in real applications. To resolve this issue, dynamic regret stands out, which is defined as  (Zinkevich, 2003)

RegretT(𝐮1,,𝐮T)=t=1Tft(𝐱t)t=1Tft(𝐮t),\text{Regret}_{T}(\mathbf{u}_{1},\dots,\mathbf{u}_{T})=\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\sum_{t=1}^{T}f_{t}(\mathbf{u}_{t}),

where 𝐮1,,𝐮T𝒳\mathbf{u}_{1},\dots,\mathbf{u}_{T}\in\mathcal{X} is a comparator sequence. Dynamic regret receives considerable attention recently  (Besbes et al., 2015; Jadbabaie et al., 2015; Mokhtari et al., 2016; Zhang et al., 2017, 2018; Zhao et al., 2020; Wan et al., 2021; Zhao and Zhang, 2021; Baby and Wang, 2021) due to its flexibility. However, dynamic regret can be as large as O(T)O(T) in general. Thus, regularizations need to be imposed on the comparator sequence to ensure no-regret online learning. A common assumption  (Zinkevich, 2003) is the path-length (see Equation (1)) of the comparator sequence to be bounded. We refer to the corresponding dynamic regret as general dynamic regret because any assignments to 𝐮1,,𝐮T\mathbf{u}_{1},\dots,\mathbf{u}_{T} subject to the path-length constraint are feasible.

Riemannian manifolds. Here, we give a brief overview of Riemannian geometry, but this is a complex subject, and we refer the reader to, e.g., Petersen (2006) for a full treatment. A Riemannian manifold (,g)(\mathcal{M},g) is a smooth manifold \mathcal{M} equipped with a Riemannian metric gg. The tangent space T𝐱dT_{\mathbf{x}}\mathcal{M}\cong\mathbb{R}^{d}, generalizing the concept of the tangent plane, contains vectors tangent to any curve passing 𝐱\mathbf{x}. The Riemannian metric gg induces the inner product 𝐮,𝐯𝐱\langle\mathbf{u},\mathbf{v}\rangle_{\mathbf{x}} and the Riemannian norm 𝐮𝐱=𝐮,𝐮𝐱\|\mathbf{u}\|_{\mathbf{x}}=\sqrt{\langle\mathbf{u},\mathbf{u}\rangle_{\mathbf{x}}} where 𝐮,𝐯T𝐱\mathbf{u},\mathbf{v}\in T_{\mathbf{x}}\mathcal{M} (we omit the reference point 𝐱\mathbf{x} when it is clear from the context). We use d(𝐱,𝐲)d(\mathbf{x},\mathbf{y}) to denote the Riemannian distance between 𝐱,𝐲\mathbf{x},\mathbf{y}\in\mathcal{M}, which is the greatest lower bound of the length of all piecewise smooth curves joining 𝐱\mathbf{x} and 𝐲\mathbf{y}.

A curve connecting 𝐱,𝐲\mathbf{x},\mathbf{y}\in\mathcal{M} is a geodesic if it is locally length-minimizing. For two points 𝐱,𝐲\mathbf{x},\mathbf{y}\in\mathcal{M}, suppose there exists a geodesic γ(t):[0,1]\gamma(t):[0,1]\rightarrow\mathcal{M} such that γ(0)=𝐱,γ(1)=𝐲\gamma(0)=\mathbf{x},\gamma(1)=\mathbf{y} and γ(0)=𝐯T𝐱\gamma^{\prime}(0)=\mathbf{v}\in T_{\mathbf{x}}\mathcal{M}. The exponential map Exp𝐱():T𝐱{\textnormal{Exp}}_{\mathbf{x}}(\cdot):T_{\mathbf{x}}\mathcal{M}\rightarrow\mathcal{M} maps 𝐯T𝐱\mathbf{v}\in T_{\mathbf{x}}\mathcal{M} to 𝐲\mathbf{y}\in\mathcal{M}. Correspondingly, the inverse exponential map Exp𝐱1():T𝐱{\textnormal{Exp}}_{\mathbf{x}}^{-1}(\cdot):\mathcal{M}\rightarrow T_{\mathbf{x}}\mathcal{M} maps 𝐲\mathbf{y}\in\mathcal{M} to 𝐯T𝐱\mathbf{v}\in T_{\mathbf{x}}\mathcal{M}. Since traveling along a geodesic is of constant velocity, we indeed have d(𝐱,𝐲)=Exp𝐱1𝐲𝐱d(\mathbf{x},\mathbf{y})=\|{\textnormal{Exp}}_{\mathbf{x}}^{-1}\mathbf{y}\|_{\mathbf{x}}. Also, it is useful to compare tangent vectors in different tangent spaces. Parallel transport Γ𝐱𝐲𝐮\Gamma_{\mathbf{x}}^{\mathbf{y}}\mathbf{u} translates 𝐮\mathbf{u} from T𝐱T_{\mathbf{x}}\mathcal{M} to T𝐲T_{\mathbf{y}}\mathcal{M} and preserves the inner product, i.e., 𝐮,𝐯𝐱=Γ𝐱𝐲𝐮,Γ𝐱𝐲𝐯𝐲\langle\mathbf{u},\mathbf{v}\rangle_{\mathbf{x}}=\langle\Gamma_{\mathbf{x}}^{\mathbf{y}}\mathbf{u},\Gamma_{\mathbf{x}}^{\mathbf{y}}\mathbf{v}\rangle_{\mathbf{y}}.

The curvature of Riemannian manifolds reflects the extent to which the manifold differs from a Euclidean surface. For optimization purposes, it usually suffices to consider the sectional curvature. Following Zhang and Sra (2016); Wang et al. (2021), in this paper we mainly consider Hadamard manifolds, which are complete and single-connected manifolds with non-positive sectional curvature. On such manifolds, every two points are connected by a unique and distance-minimizing geodesic by Hopf-Rinow Theorem  (Petersen, 2006).

A subset 𝒩\mathcal{N} of \mathcal{M} is gsc-convex if for any 𝐱,𝐲𝒩\mathbf{x},\mathbf{y}\in\mathcal{N}, there exists a geodesic connecting 𝐱,𝐲\mathbf{x},\mathbf{y} and fully lies in 𝒩\mathcal{N}. A function f:𝒩f:\mathcal{N}\rightarrow\mathbb{R} is gsc-convex if 𝒩\mathcal{N} is gsc-convex and the composition f(γ(t))f(\gamma(t)) satisfies f(γ(t))(1t)f(γ(0))+tf(γ(1))f(\gamma(t))\leq(1-t)f(\gamma(0))+tf(\gamma(1)) for any geodesic γ(t)𝒩\gamma(t)\subseteq\mathcal{N} and t[0,1]t\in[0,1]. An alternative definition of geodesic convexity is

f(𝐲)f(𝐱)+f(𝐱),Exp𝐱1𝐲,𝐱,𝐲𝒩.\textstyle f(\mathbf{y})\geq f(\mathbf{x})+\langle\nabla f(\mathbf{x}),{\textnormal{Exp}}_{\mathbf{x}}^{-1}\mathbf{y}\rangle,\qquad\forall\;\mathbf{x},\mathbf{y}\in\mathcal{N}.

where the Riemannian gradient f(𝐱)T𝐱\nabla f(\mathbf{x})\in T_{\mathbf{x}}\mathcal{M} is the unique vector determined by Df(𝐱)[𝐯]=𝐯,f(𝐱)Df(\mathbf{x})[\mathbf{v}]=\langle\mathbf{v},\nabla f(\mathbf{x})\rangle and Df(𝐱)[𝐯]Df(\mathbf{x})[\mathbf{v}] is the differential of ff along 𝐯T𝐱\mathbf{v}\in T_{\mathbf{x}}\mathcal{M}.

Similarly, a LL-geodesically-smooth (L-gsc-smooth) function ff satisfies Γ𝐲𝐱f(𝐲)f(𝐱)Ld(𝐱,𝐲)\|\Gamma_{\mathbf{y}}^{\mathbf{x}}\nabla f(\mathbf{y})-\nabla f(\mathbf{x})\|\leq L\cdot d(\mathbf{x},\mathbf{y}) for all 𝐱,𝐲𝒩\mathbf{x},\mathbf{y}\in\mathcal{N}, or

f(𝐲)f(𝐱)+f(𝐱),Exp𝐱1𝐲+L2d(𝐱,𝐲)2.\textstyle f(\mathbf{y})\leq f(\mathbf{x})+\langle\nabla f(\mathbf{x}),{\textnormal{Exp}}_{\mathbf{x}}^{-1}\mathbf{y}\rangle+\frac{L}{2}d(\mathbf{x},\mathbf{y})^{2}.

3 Related Work

In this part, we briefly review past work on OCO in Euclidean space, online optimization and optimism on Riemannian manifolds.

3.1 OCO in Euclidean Space

Static regret.

We first consider work on static regret. In Euclidean space, it is well known that online gradient descent (OGD) guarantees O(T)O(\sqrt{T}) and O(logT)O(\log T) regret for convex and strongly convex losses (Hazan et al., 2016), which are also minimax optimal (Abernethy et al., 2008). However, the aforementioned bounds are not fully adaptive due to the dependence on TT. Therefore, there is a tendency to replace TT with problem-dependent quantities. Srebro et al. (2010) first notice that the smooth and non-negative losses satisfy the self-bounding property, thus establishing the small-loss bound O(FT)O(\sqrt{F_{T}^{\star}}) where FT=t=1Tft(𝐱)F_{T}^{\star}=\sum_{t=1}^{T}f_{t}(\mathbf{x}^{\star}) is the cumulative loss of the best action in hindsight. Chiang et al. (2012) propose extra-gradient to get O(VT)O(\sqrt{V_{T}}) gradient-variation regret bound for convex and smooth losses where VT=t=2Tsup𝐱𝒳ft1(𝐱)ft(𝐱)22V_{T}=\sum_{t=2}^{T}\sup_{\mathbf{x}\in\mathcal{X}}\|\nabla f_{t-1}(\mathbf{x})-\nabla f_{t}(\mathbf{x})\|_{2}^{2}. Rakhlin and Sridharan (2013) generalize the work of Chiang et al. (2012) and propose optimistic mirror descent, which has become a standard tool in online learning since then.

Dynamic regret.

Now we switch to the related work on dynamic regret. Zinkevich (2003) propose to use OGD to get a O(ηT+1+PTη)O\left(\eta T+\frac{1+P_{T}}{\eta}\right) regret bound where η\eta is the step size, but the result turns out to be O((1+PT)T)O((1+P_{T})\sqrt{T}) since the value of PTP_{T} is unknown to the learner. The seminal work of Zhang et al. (2018) use Hedge to combine the advice of experts with different step sizes, and show a O((1+PT)T)O\left(\sqrt{(1+P_{T})T}\right) regret. A matching lower bound is also established therein. Zhao et al. (2020) utilize smoothness to get a gradient-variation bound, a small-loss bound, and a best-of-both-worlds bound in Euclidean space.

3.2 Online Learning and Optimism on Riemannian Manifolds

In the online setting, Bécigneul and Ganea (2019) consider adaptive stochastic optimization on Riemannian manifolds but their results only apply to the Cartesian product of one-manifolds. Maass et al. (2022) study the restricted dynamic regret on Hadamard manifolds under the gradient-free setting and provide O(T+PT)O(\sqrt{T}+P_{T}^{\star}) bound for gsc-strongly convex and gsc-smooth functions, where PTP_{T}^{\star} is the path-length of the comparator formed by 𝐮t=argmin𝐱𝒳ft(𝐱)\mathbf{u}_{t}=\operatorname*{argmin}_{\mathbf{x}\in\mathcal{X}}f_{t}(\mathbf{x}). On Hadamard manifolds, Wang et al. (2021) apply Riemannian OGD (R-OGD) to get O(T)O(\sqrt{T}) upper bound and Ω(T)\Omega(\sqrt{T}) randomized lower bound. Comparatively, we focus on general and adaptive dynamic regret on Hadamard manifolds. Our minimax lower bound is also novel.

There also exist algorithms considering optimism on Riemannian manifolds. Zhang et al. (2022) propose Riemannian Corrected Extra Gradient (RCEG) for unconstrained minimax optimization on manifolds. Karimi et al. (2022) consider a Robbins-Monro framework on Hadamard manifolds which subsumes Riemannian stochastic extra-gradient. By imposing the weakly asymptotically coercivity and using a decaying step size, the trajectory is guaranteed to be finite (Karimi et al., 2022). However, our paper is the first to consider the constrained case and the online setting. For the improper learning setting, we show a constant step size achieves the same guarantee as in Euclidean space.

4 Path Length Dynamic Regret Bound on Manifolds

In this section, we present the results related to the minimax path-length bound on manifolds. Before diving into the details, following previous work (Zinkevich, 2003; Wang et al., 2021), we introduce some standard assumptions and definitions.

Assumption 1

\mathcal{M} is a Hadamard manifold and its sectional curvature is lower bounded by κ0\kappa\leq 0.

Assumption 2

The decision set 𝒩\mathcal{N} is a gsc-convex compact subset of \mathcal{M} with diameter upper bounded by DD, i.e., sup𝐱,𝐲𝒩d(𝐱,𝐲)D\sup_{\mathbf{x},\mathbf{y}\in\mathcal{N}}d(\mathbf{x},\mathbf{y})\leq D. For optimistic online learning, we allow the player chooses decisions from 𝒩δM\mathcal{N}_{\delta M}, which is defined in Definition 4 and the diameter becomes (D+2δM)(D+2\delta M).

Assumption 3

The norm of Riemannian gradients are bounded by GG, i.e., sup𝐱𝒩ft(𝐱)G.\sup_{\mathbf{x}\in\mathcal{N}}\|\nabla f_{t}(\mathbf{x})\|\leq G. When improper learning is allowed, we assume sup𝐱𝒩δMft(𝐱)G.\sup_{\mathbf{x}\in\mathcal{N}_{\delta M}}\|\nabla f_{t}(\mathbf{x})\|\leq G.

Definition 1

Under Assumptions 1, 2, we denote ζκDcoth(κD){\zeta}\coloneqq\sqrt{-\kappa}D\operatorname{coth}(\sqrt{-\kappa}D). When improper learning is allowed, ζκ(D+2δM)coth(κ(D+2δM)){\zeta}\coloneqq\sqrt{-\kappa}(D+2\delta M)\operatorname{coth}(\sqrt{-\kappa}(D+2\delta M)), where MM is in Definition 4. Note that, on manifolds of zero sectional curvature (κ=0\kappa=0), we have ζ=limx0xcothx=1\zeta=\lim_{x\to 0}x\cdot\coth{x}=1.

The seminal work of Zinkevich (2003) shows that the classical OGD algorithm can minimize the general dynamic regret in Euclidean space. Motivated by this, we consider the Riemannian OGD (R-OGD) algorithm (Wang et al., 2021):

𝐱t+1=Π𝒩Exp𝐱t(ηft(𝐱t)),\mathbf{x}_{t+1}=\Pi_{\mathcal{N}}{\textnormal{Exp}}_{\mathbf{x}_{t}}(-\eta\nabla f_{t}(\mathbf{x}_{t})), (2)

which is a natural extension of OGD to the manifold setting. We show R-OGD can also minimizes the general dynamic regret on manifolds. Due to page limitation, we postpone details to Appendix A.

Theorem 1

Suppose Assumptions 1, 2 and 3 hold. Then the general dynamical regret of R-OGD defined in Equation (2) satisfies

D-RegretTD2+2DPT2η+ηζG2T2.\text{D-Regret}_{T}\leq\frac{D^{2}+2DP_{T}}{2\eta}+\frac{\eta{\zeta}G^{2}T}{2}. (3)

Theorem 1 implies that R-OGD yields O(PT+1η+ηT)O(\frac{P_{T}+1}{\eta}+\eta T) general dynamic regret bound, which means the optimal step size is η=O(1+PTT)\eta=O{\scriptstyle\left(\sqrt{\frac{1+P_{T}}{T}}\right)}. However, this configuration of η\eta is invalid, as PTP_{T} is unknown to the learner. Although a sub-optimal choice for η\eta, i.e., η=O(1T)\textstyle{\eta=O\left(\frac{1}{\sqrt{T}}\right)}, is accessible, the resulting algorithm suffers O((1+PT)T)O((1+P_{T})\sqrt{T}) regret.

The meta-expert framework (Van Erven and Koolen, 2016) consists of a meta algorithm and some expert algorithm instances. The constructions are modular such that we can use different meta algorithms and expert algorithms to achieve different regret guarantees. For optimizing dynamic regret, the seminal work of Zhang et al. (2018) propose Ader based on this framework. In every round tt, each expert runs OGD with a different step size, and the meta algorithm applies Hedge to learn the best weights. The step sizes used by the experts are carefully designed so that there always exists an expert which is almost optimal. The regret of Ader is O((1+PT)T)O(\sqrt{(1+P_{T})T}), which is minimax-optimal in Euclidean space (Zhang et al., 2018).

However, it is unclear how to extend Ader to manifolds at first glance since we need to figure out the ”correct” way to do averaging. In this paper, we successfully resolve this problem using the Fréchet mean and the geodesic mean. Our proposed algorithm, called Radar, consists of NN instances of the expert algorithm (Algorithm 4), each of which runs R-OGD with a different step size, and a meta algorithm (Algorithm 4), which enjoys a regret approximately the same as the best expert. We denote the set of all step sizes {ηi}\{\eta_{i}\} by \mathcal{H}. In the tt-th round, the expert algorithms submit all 𝐱t,i\mathbf{x}_{t,i}’s (i=1,,Ni=1,\dots,N) to the meta algorithm. Then the meta algorithm either computes the Fréchet mean or the geodesic mean (see Algorithm E in Appendix E for details) as 𝐱t\mathbf{x}_{t}. After receiving ftf_{t}, the meta algorithm updates the weight of each expert wt+1,iw_{t+1,i} via Hedge and sends ft(𝐱t,i)\nabla f_{t}(\mathbf{x}_{t,i}) to the ii-th expert, which computes 𝐱t+1,i\mathbf{x}_{t+1,i} by R-OGD. The regret of the meta algorithm of Radar can be bounded by Lemma 4.1.

{algorithm2e}

[H] Radar: Meta Algorithm \KwDataLearning rate β\beta, set of step sizes \mathcal{H}, initial weights w1,i=N+1i(i+1)Nw_{1,i}=\frac{N+1}{i(i+1)N} \Fort=1,,Tt=1,\dots,T Receive 𝐱t,i\mathbf{x}_{t,i} from experts with stepsize ηi\eta_{i}

𝐱t=argmin𝐱𝒩i[N]wt,id(𝐱,𝐱t,i)2\mathbf{x}_{t}=\operatorname*{argmin}_{\mathbf{x}\in\mathcal{N}}\sum_{i\in[N]}w_{t,i}d(\mathbf{x},\mathbf{x}_{t,i})^{2}

Observe the loss function ftf_{t}

Update wt+1,iw_{t+1,i} by Hedge with ft(𝐱t,i)f_{t}(\mathbf{x}_{t,i})

Send gradient ft(𝐱t,i)\nabla f_{t}(\mathbf{x}_{t,i}) to each expert

{algorithm2e}

[H] Radar: Expert Algorithm \KwData A step size ηi\eta_{i} Let 𝐱1,iη\mathbf{x}_{1,i}^{\eta} be any point in 𝒩\mathcal{N}

\For

t=1,,Tt=1,\dots,T Submit 𝐱t,i\mathbf{x}_{t,i} to the meta algorithm

Receive gradient ft(𝐱t,i)\nabla f_{t}(\mathbf{x}_{t,i}) from the meta algorithm

Update:

𝐱t+1,i=Π𝒩Exp𝐱t,i(ηift(𝐱t,i))\mathbf{x}_{t+1,i}=\Pi_{\mathcal{N}}{\textnormal{Exp}}_{\mathbf{x}_{t,i}}(-\eta_{i}\nabla f_{t}(\mathbf{x}_{t,i}))

Lemma 4.1.

Under Assumptions 1, 2, 3, and setting β=8G2D2T\beta=\sqrt{\frac{8}{G^{2}D^{2}T}}, the regret of Algorithm 4 satisfies

t=1Tft(𝐱t)t=1Tft(𝐱t,i)G2D2T8(1+ln1w1,i).\textstyle\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t,i})\leq\sqrt{\frac{G^{2}D^{2}T}{8}}\left(1+\ln\frac{1}{w_{1,i}}\right).

We show that, by configuring the step sizes in \mathcal{H} carefully, Radar ensures a O((1+PT)T)O(\sqrt{(1+P_{T})T}) bound on geodesic metric spaces.

Theorem 2.

Set ={ηi=2i1D2G2ζT|i[N]}\textstyle\mathcal{H}=\left\{\eta_{i}=2^{i-1}\sqrt{\frac{D^{2}}{G^{2}{\zeta}T}}\big{|}i\in[N]\right\} where N=12log2(1+2T)+1N=\lceil\frac{1}{2}\log_{2}(1+2T)\rceil+1 and β=8G2D2T\beta=\sqrt{\frac{8}{G^{2}D^{2}T}}. Under Assumptions 1, 2, 3, for any comparator sequence 𝐮1,,𝐮T𝒩\mathbf{u}_{1},\dots,\mathbf{u}_{T}\in\mathcal{N}, the general dynamic regret of Radar satisfies

D-RegretT=O(ζ(1+PT)T).\text{D-Regret}_{T}=O(\sqrt{{\zeta}(1+P_{T})T}).
Remark 4.2.

Note that if \mathcal{M} is Euclidean space, then ζ=1{\zeta}=1 and we get O((1+PT)T)O(\sqrt{(1+P_{T})T)} regret, which is the same as in Zhang et al. (2018).

A disadvantage of Radar is Θ(logT)\Theta(\log T) gradient queries are required at each round. In Euclidean space, Zhang et al. (2018) use a linear surrogate loss to achieve the same bound by O(1)O(1) gradient queries. But on manifolds, the existence of such functions implies the sectional curvature of the manifold is everywhere 0 (Kristály et al., 2016). It is interesting to investigate if Ω(logT)\Omega(\log T) gradient queries are necessary to achieve dynamic regret on manifolds. We would also like to point out that O(logT)O(\log T) is reasonable small and the work of Zhang et al. (2018) still needs O(logT)O(\log T) computational complexity per round.

Using the Busemann function as a bridge, we show the following dynamic regret lower bound, with proof deferred to Appendix A.4.

Theorem 3.

There exists a comparator sequence which satisfies t=2Td(𝐮t,𝐮t1)PT\sum_{t=2}^{T}d(\mathbf{u}_{t},\mathbf{u}_{t-1})\leq P_{T} and encounters Ω((1+PT)T)\Omega(\sqrt{(1+P_{T})T}) dynamic regret on Hadamard manifolds.

Although the regret guarantee in Theorem 2 is optimal up to constants in terms of TT and PTP_{T} by considering the corresponding lower bound, it still depends on TT and thus cannot adapt to mild environments. In Euclidean space, the smoothness of losses induces adaptive regret bounds, including the gradient-variation bound (Chiang et al., 2012) and the small-loss bound (Srebro et al., 2010). It is then natural to ask if similar bounds can be established on manifolds by assuming gsc-smoothness. We provide an affirmative answer to this question and show how to get problem-dependent bounds under the Radar framework.

5 Gradient-variation Bound on Manifolds

In this section, we show how to obtain the gradient-variation bound on manifolds under the Radar framework with alternative expert and meta algorithms.

Expert Algorithm.

For minimax optimization on Riemannian manifolds, Zhang et al. (2022) propose Riemannian Corrected Extra Gradient (RCEG), which performs the following iterates:

𝐱t=Exp𝐲t(ηft1(𝐲t))𝐲t+1=Exp𝐱t(ηft(𝐱t)+Exp𝐱t1(𝐲t)).\begin{split}\textstyle&\mathbf{x}_{t}={\textnormal{Exp}}_{\mathbf{y}_{t}}(-\eta\nabla f_{t-1}(\mathbf{y}_{t}))\\ &\mathbf{y}_{t+1}={\textnormal{Exp}}_{\mathbf{x}_{t}}\left(-\eta\nabla f_{t}(\mathbf{x}_{t})+{\textnormal{Exp}}_{\mathbf{x}_{t}}^{-1}(\mathbf{y}_{t})\right).\end{split}

However, this algorithm does not work in the constrained case, which has been left as an open problem (Zhang et al., 2022). The online improper learning setting (Hazan et al., 2018; Baby and Wang, 2021) allows the decision set to be different from (usually larger than) the set of strategies we want to compete against. Under such a setting, we find the geometric distortion due to projection can be bounded in an elegant way, and generalize RCEG to incorporate an optimism term MtT𝐲tM_{t}\in T_{\mathbf{y}_{t}}\mathcal{M}.

Definition 4.

We use MtM_{t} to denote the optimism at round tt and assume there exists MM such that MtM\|M_{t}\|\leq M for all t. We define 𝒩c={𝐱|d(x,𝒩)c}\mathcal{N}_{c}=\{\mathbf{x}|d(x,\mathcal{N})\leq c\} where d(𝐱,𝒩)inf𝐲𝒩d(𝐱,𝐲)d(\mathbf{x},\mathcal{N})\coloneqq\inf_{\mathbf{y}\in\mathcal{N}}d(\mathbf{x},\mathbf{y}). In the improper setting, we allow the player to choose decisions from 𝒩δM\mathcal{N}_{\delta M}.

Theorem 5.

Suppose all losses ftf_{t} are LL-gsc-smooth on \mathcal{M}. Under Assumptions 1, 2, 3, the iterates

𝐱t=Exp𝐲t(ηMt)𝐱t=Π𝒩δM𝐱t𝐲t+1=Π𝒩Exp𝐱t(ηft(𝐱t)+Exp𝐱t1(𝐲t)).\begin{split}\textstyle&\mathbf{x}_{t}^{\prime}={\textnormal{Exp}}_{\mathbf{y}_{t}}(-\eta M_{t})\\ &\mathbf{x}_{t}=\Pi_{\mathcal{N}_{\delta M}}\mathbf{x}_{t}^{\prime}\\ &\mathbf{y}_{t+1}=\Pi_{\mathcal{N}}{\textnormal{Exp}}_{\mathbf{x}_{t}^{\prime}}\left(-\eta\nabla f_{t}(\mathbf{x}_{t}^{\prime})+{\textnormal{Exp}}_{\mathbf{x}_{t}^{\prime}}^{-1}(\mathbf{y}_{t})\right).\end{split} (4)

satisfies

t=1Tft(𝐱t)t=1Tft(𝐮t)ηζt=1Tft(𝐲t)Mt2+D2+2DPT2η.\begin{split}\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\sum_{t=1}^{T}f_{t}(\mathbf{u}_{t})\leq\eta{\zeta}\sum_{t=1}^{T}\|\nabla f_{t}(\mathbf{y}_{t})-M_{t}\|^{2}+\frac{D^{2}+2DP_{T}}{2\eta}.\end{split}

for any 𝐮1,,𝐮T𝒩\mathbf{u}_{1},\dots,\mathbf{u}_{T}\in\mathcal{N} and ηδMG+(G2+2ζδ2M2L2)12\eta\leq\frac{\delta M}{G+(G^{2}+2\zeta\delta^{2}M^{2}L^{2})^{\frac{1}{2}}}. Specifically, we achieve ηζVT+D2+2DPT2η\eta{\zeta}V_{T}+\frac{D^{2}+2DP_{T}}{2\eta} regret by choosing Mt=ft1(𝐲t)M_{t}=\nabla f_{t-1}(\mathbf{y}_{t}). In this case, M=GM=G and we need ηδ1+(1+2ζδ2L2)12\eta\leq\frac{\delta}{1+(1+2\zeta\delta^{2}L^{2})^{\frac{1}{2}}}.

Proof sketch.

We use the special case Mt=ft1(𝐲t)M_{t}=\nabla f_{t-1}(\mathbf{y}_{t}) to illustrate the main idea of the proof. We first decompose ft(𝐱t)ft(𝐮t)f_{t}(\mathbf{x}_{t})-f_{t}(\mathbf{u}_{t}) into two terms,

ft(𝐱t)ft(𝐮t)=(ft(𝐱t)ft(𝐱t))+(ft(𝐱t)ft(𝐮t))Gd(𝐱t,𝐱t)+(ft(𝐱t)ft(𝐮t))Gd(𝐱t,𝐲t)troublesome term 1+(ft(𝐱t)ft(𝐮t))unconstrained RCEG\begin{split}&f_{t}(\mathbf{x}_{t})-f_{t}(\mathbf{u}_{t})={(f_{t}(\mathbf{x}_{t})-f_{t}(\mathbf{x}_{t}^{\prime}))}+{(f_{t}(\mathbf{x}_{t}^{\prime})-f_{t}(\mathbf{u}_{t}))}\\ \leq&G\cdot d(\mathbf{x}_{t},\mathbf{x}_{t}^{\prime})+{(f_{t}(\mathbf{x}_{t}^{\prime})-f_{t}(\mathbf{u}_{t}))}\leq\underbrace{G\cdot d(\mathbf{x}_{t}^{\prime},\mathbf{y}_{t})}_{\text{troublesome term }1}+\underbrace{(f_{t}(\mathbf{x}_{t}^{\prime})-f_{t}(\mathbf{u}_{t}))}_{\text{unconstrained RCEG}}\end{split}

where the first inequality is because the gradient Lipschitzness condition, and the second one follows from the non-expansiveness of the projection. For the unconstrained RCEG term, we have the following decomposition,

ft(𝐱t)ft(𝐮t)12η(2η2ζL21)d(𝐱t,𝐲t)2troublesome term 2+ηζft(𝐲t)ft1(𝐲t)2ηζVT+12η(d(𝐲t,𝐮t)2d(𝐲t+1,𝐮t)2)D2+2DPT2η\begin{split}&f_{t}(\mathbf{x}_{t}^{\prime})-f_{t}(\mathbf{u}_{t})\leq\underbrace{\frac{1}{2\eta}(2\eta^{2}\zeta L^{2}-1)d(\mathbf{x}_{t}^{\prime},\mathbf{y}_{t})^{2}}_{\text{troublesome term }2}\\ +&\underbrace{\eta\zeta\|\nabla f_{t}(\mathbf{y}_{t})-\nabla f_{t-1}(\mathbf{y}_{t})\|^{2}}_{\eta\zeta V_{T}}+\underbrace{\frac{1}{2\eta}\left(d(\mathbf{y}_{t},\mathbf{u}_{t})^{2}-d(\mathbf{y}_{t+1},\mathbf{u}_{t})^{2}\right)}_{\frac{D^{2}+2DP_{T}}{2\eta}}\end{split}

where the second and the third term corresponds to the gradient variation term and the dynamic regret term, respectively.

In the improper learning setting, we can show d(𝐱t,𝐲t)δGd(\mathbf{x}_{t}^{\prime},\mathbf{y}_{t})\geq\delta G. Combining both troublesome terms, it suffices to find η\eta which satisfies

2ηG+2η2ζL2λλ0,λd(𝐱t,𝐲t)δG.2\eta G+2\eta^{2}\zeta L^{2}\lambda-\lambda\leq 0,\;\forall\lambda\coloneqq d(\mathbf{x}_{t}^{\prime},\mathbf{y}_{t})\geq\delta G.
Remark 5.1.

We generalize Theorem 5 from Hadamard manifolds to CAT(κ)(\kappa) spaces in Appendix B.2. Note that although we allow the player to make improper decisions, VTV_{T} is still defined on 𝒩\mathcal{N} instead of 𝒩δG\mathcal{N}_{\delta G}. For the static setting, PT=0P_{T}=0 and the resulting regret bound is O(VT+1δ)O(\sqrt{V_{T}}+\frac{1}{\delta}). Also, in this setting, we can use an adaptive step-size

ηt=min{11+s=2tft(𝐲t)ft1(𝐲t)2,δ1+(1+2ζδ2L2)12}\eta_{t}=\min\left\{\frac{1}{\sqrt{1+\sum_{s=2}^{t}\|\nabla f_{t}(\mathbf{y}_{t})-\nabla f_{t-1}(\mathbf{y}_{t})\|^{2}}},\frac{\delta}{1+(1+2\zeta\delta^{2}L^{2})^{\frac{1}{2}}}\right\}

to eliminate the dependence on VTV_{T}.

Meta algorithm.

Intuitively, we can run OMD with different step sizes and apply a meta algorithm to estimate the optimal step size. Previous studies in learning with multiple step sizes usually adopt Hedge to aggregate the experts’ advice. However, the regret of Hedge is O(TlnN)O(\sqrt{T\ln N}) and thus is undesirable for our purpose. Inspired by optimistic online learning (Rakhlin and Sridharan, 2013; Syrgkanis et al., 2015), Zhao et al. (2020) adopt Optimistic Hedge as the meta algorithm to get O((VT+PT)PT)O(\sqrt{(V_{T}+P_{T})P_{T}}) gradient-variation bound. After careful analysis, we show Optimistic Hedge works for gsc-convex losses regardless of the geometric distortion and get the desired gradient-variation bound.

{algorithm2e}

[H] Radarv: Expert Algorithm \KwDataA step size ηi\eta_{i} Let 𝐱1,iη\mathbf{x}_{1,i}^{\eta} be any point in 𝒩\mathcal{N}
\Fort=1,,Tt=1,\dots,T Submit 𝐱t,i\mathbf{x}_{t,i} to the meta algorithm
Receive gradient ft()\nabla f_{t}(\cdot) from the meta algorithm
Each expert runs Equation (4) with Mt=ft1(𝐲t)M_{t}=\nabla f_{t-1}(\mathbf{y}_{t}), M=GM=G and step size ηi\eta_{i}

{algorithm2e}

[H] Radarv: Meta Algorithm \KwDataA learning rate β\beta, a set of step sizes \mathcal{H}, initial weights w1,i=w0,i=1Nw_{1,i}=w_{0,i}=\frac{1}{N} \Fort=1,,Tt=1,\dots,T Receive all 𝐱t,i\mathbf{x}_{t,i}’s from experts with step size ηi\eta_{i}
𝐱¯t=argmin𝐱𝒩δGi[N]wt1,id(𝐱,𝐱t,i)2\bar{\mathbf{x}}_{t}=\operatorname*{argmin}_{\mathbf{x}\in\mathcal{N}_{\delta G}}\sum_{i\in[N]}w_{t-1,i}d(\mathbf{x},\mathbf{x}_{t,i})^{2}
Update wt,iexp(β(s=1t1s,i+mt,i))w_{t,i}\propto\exp\left(-\beta\left(\sum_{s=1}^{t-1}\ell_{s,i}+m_{t,i}\right)\right) by Equation (5)
𝐱t=argmin𝐱𝒩δGi[N]wt,id(𝐱,𝐱t,i)2\mathbf{x}_{t}=\operatorname*{argmin}_{\mathbf{x}\in\mathcal{N}_{\delta G}}\sum_{i\in[N]}w_{t,i}d(\mathbf{x},\mathbf{x}_{t,i})^{2}
Observe ft()f_{t}(\cdot) and send ft()\nabla f_{t}(\cdot) to experts

We denote t,𝐦tN\boldsymbol{\ell}_{t},\mathbf{m}_{t}\in\mathbb{R}^{N} as the surrogate loss and the optimism at round tt. The update rule of Optimistic Hedge is:

wt,iexp(β(s=1t1s,i+mt,i)),\textstyle w_{t,i}\propto\exp\left(-\beta\left(\sum_{s=1}^{t-1}\ell_{s,i}+m_{t,i}\right)\right),

which achieves adaptive regret due to the optimism. The following technical lemma (Syrgkanis et al., 2015) is critical for our analysis of Optimistic Hedge, and the proof is in Appendix B.3 for completeness.

Lemma 5.2.

For any i[N]i\in[N], Optimistic Hedge satisfies

t=1T𝐰t,tt,i2+lnNβ+βt=1Tt𝐦t214βt=2T𝐰t𝐰t112\begin{split}\textstyle\sum_{t=1}^{T}\left\langle\mathbf{w}_{t},\boldsymbol{\ell}_{t}\right\rangle-\ell_{t,i}\leq\frac{2+\ln N}{\beta}+\beta\sum_{t=1}^{T}\|\boldsymbol{\ell}_{t}-\mathbf{m}_{t}\|_{\infty}^{2}-\frac{1}{4\beta}\sum_{t=2}^{T}\|\mathbf{w}_{t}-\mathbf{w}_{t-1}\|_{1}^{2}\end{split}

Following the insightful work of Zhao et al. (2020), we also adopt the Optimistic Hedge algorithm as the meta algorithm, but there are some key differences in the design of the surrogate loss and optimism. To respect the Riemannian metric, we propose the following:

t,i=ft(𝐱t),Exp𝐱t1𝐱t,imt,i=ft1(𝐱¯t),Exp𝐱¯t1𝐱t,i\begin{split}&\ell_{t,i}=\left\langle\nabla f_{t}(\mathbf{x}_{t}),{\textnormal{Exp}}_{\mathbf{x}_{t}}^{-1}\mathbf{x}_{t,i}\right\rangle\\ &m_{t,i}=\left\langle\nabla f_{t-1}(\bar{\mathbf{x}}_{t}),{\textnormal{Exp}}_{\bar{\mathbf{x}}_{t}}^{-1}\mathbf{x}_{t,i}\right\rangle\\ \end{split} (5)

where 𝐱t\mathbf{x}_{t} and 𝐱¯t\bar{\mathbf{x}}_{t} are Fréchet averages of 𝐱t,i\mathbf{x}_{t,i} w.r.t. linear combination coefficients 𝐰t\mathbf{w}_{t} and 𝐰t1\mathbf{w}_{t-1} respectively. Under the Fréchet mean, we can show

ft(𝐱t)ft(𝐱t,i)𝐰t,tt,i,f_{t}(\mathbf{x}_{t})-f_{t}(\mathbf{x}_{t,i})\leq\left\langle\mathbf{w}_{t},\boldsymbol{\ell}_{t}\right\rangle-\ell_{t,i},

which ensures Lemma 5.2 can be applied to bound the meta-regret and the geodesic mean does not meet this requirement. We also emphasize that the design of the surrogate loss and optimism is highly non-trivial. As we will see in the proof of Theorem 6, the combination of the surrogate loss and the gradient-vanishing property of the Fréchet mean ensures Lemma 5.2 can be invoked to upper bound the regret of the meta algorithm. However, 𝐦t\mathbf{m}_{t} cannot rely on 𝐱t\mathbf{x}_{t} thus, we need to design optimism based on the tangent space of 𝐱¯t{\bar{\mathbf{x}}_{t}}, which incurs extra cost. Luckily, under Equation (5), we find a reasonable upper bound of this geometric distortion by showing

t𝐦t2O(1)sup𝐱𝒩δGft(𝐱)ft1(𝐱)2+O(1)d(𝐱t,𝐱¯t)2O(1)sup𝐱𝒩δGft(𝐱)ft1(𝐱)2+O~(1)𝐰t𝐰t112.\begin{split}\|\boldsymbol{\ell}_{t}-\mathbf{m}_{t}\|_{\infty}^{2}&\leq O(1)\cdot\sup_{\mathbf{x}\in\mathcal{N}_{\delta G}}\|\nabla f_{t}(\mathbf{x})-\nabla f_{t-1}(\mathbf{x})\|^{2}+O(1)\cdot d(\mathbf{x}_{t},\bar{\mathbf{x}}_{t})^{2}\\ &\leq O(1)\cdot\sup_{\mathbf{x}\in\mathcal{N}_{\delta G}}\|\nabla f_{t}(\mathbf{x})-\nabla f_{t-1}(\mathbf{x})\|^{2}+\tilde{O}(1)\cdot\|\mathbf{w}_{t}-\mathbf{w}_{t-1}\|_{1}^{2}.\end{split}

Thus we can apply the negative term in Lemma 5.2 to eliminate undesired terms in t𝐦t2\|\boldsymbol{\ell}_{t}-\mathbf{m}_{t}\|_{\infty}^{2}.

Algorithms 5 and 5 describe the expert algorithm and meta algorithm of Radarv. We show the meta-regret and total regret of Radarv in Theorems 6 and 7, respectively. Detailed proof in this section is deferred to Appendix B.

Theorem 6.

Assume all losses are LL-gsc-smooth on \mathcal{M}. Then under Assumptions 1, 2, 3, the regret of Algorithm 5 satisfies:

t=1Tft(𝐱t)t=1Tft(𝐱t,i)2+lnNβ+3D2β(VT+G2)+t=2T(3β(D4L2+D2G2ζ2)14β)𝐰t𝐰t112.\begin{split}\textstyle&\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t,i})\\ \leq&\frac{2+\ln N}{\beta}+3D^{2}\beta(V_{T}+G^{2})+\sum_{t=2}^{T}\left(3\beta(D^{4}L^{2}+D^{2}G^{2}{\zeta}^{2})-\frac{1}{4\beta}\right)\|\mathbf{w}_{t}-\mathbf{w}_{t-1}\|_{1}^{2}.\end{split}
Theorem 7.

Let β=min{2+lnN3D2VT,112(D4L2+D2G2ζ2)}\textstyle\beta=\min\left\{\sqrt{\frac{2+\ln N}{3D^{2}V_{T}}},\frac{1}{\sqrt{12(D^{4}L^{2}+D^{2}G^{2}{\zeta}^{2})}}\right\}, ={ηi=2i1D28ζG2T|i[N]}\textstyle\mathcal{H}=\left\{\eta_{i}=2^{i-1}\sqrt{\frac{D^{2}}{8{\zeta}G^{2}T}}\big{|}i\in[N]\right\}, where N=12log8ζδ2G2T(1+(1+2ζδ2L2)12)2+1\textstyle N=\left\lceil\frac{1}{2}\log\frac{8\zeta\delta^{2}G^{2}T}{({1+(1+2\zeta\delta^{2}L^{2})^{\frac{1}{2}}})^{2}}\right\rceil+1. Assume all losses are LL-gsc-smooth on \mathcal{M} and allow improper learning. Under Assumptions 1, 2 and 3, the regret of Radarv satisfies

D-RegretT=O(ζ(VT+(1+PT)/δ2)(1+PT)).\text{D-Regret}_{T}={O}\left(\sqrt{{\zeta}(V_{T}+(1+P_{T})/\delta^{2})(1+P_{T})}\right).

In Theorem 7, β\beta relies on VTV_{T}, and this dependence can be eliminated by showing a variant of Lemma 5.2 with an adaptive learning rate βt\beta_{t}.

6 Small-loss Bound on Manifolds

For dynamic regret, the small-loss bound replaces the dependence on TT by FT=t=1Tft(𝐮t)F_{T}=\sum_{t=1}^{T}f_{t}(\mathbf{u}_{t}), which adapts to the function values of the comparator sequence. In Euclidean space, Srebro et al. (2010) show this adaptive regret by combining OGD with the self-bounding property of smooth and non-negative functions, which reads f(𝐱)224Lf(𝐱)\|\nabla f(\mathbf{x})\|_{2}^{2}\leq 4L\cdot f(\mathbf{x}) where LL is the smoothness constant. We show a similar conclusion on manifolds and defer proof details in this part to Appendix C.

Lemma 6.1.

Suppose f:f:\mathcal{M}\rightarrow\mathbb{R} is both LL-gsc-smooth and non-negative on its domain where \mathcal{M} is a Hadamard manifold, then we have f(𝐱)22Lf(𝐱)\|\nabla f(\mathbf{x})\|^{2}\leq 2L\cdot f(\mathbf{x}).

To facilitate the discussion, we denote F¯T=t=1Tft(𝐱t)\bar{F}_{T}=\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t}) and F¯T,i=t=1Tft(𝐱t,i)\bar{F}_{T,i}=\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t,i}). We use R-OGD as the expert algorithm (Algorithm 4) and Hedge with surrogate loss

t,i=ft(𝐱t),Exp𝐱t1𝐱t,i\ell_{t,i}=\left\langle\nabla f_{t}(\mathbf{x}_{t}),\textnormal{Exp}^{-1}_{\mathbf{x}_{t}}\mathbf{x}_{t,i}\right\rangle

as the meta algorithm (Algorithm 4). The following Lemma considers the regret of a single expert and shows that R-OGD achieves a small-loss dynamic regret on geodesic metric spaces.

Lemma 6.2.

Suppose all losses are LL-gsc-smooth and non-negative on \mathcal{M}. Under Assumptions 1, 2, by choosing any step size η12ζL\eta\leq\frac{1}{2{\zeta}L}, R-OGD achieves O(PTη+ηFT)O\left(\frac{P_{T}}{\eta}+\eta F_{T}\right) regret.

Again, we can not directly set η=O(1+PTFT)\textstyle\eta=O\left(\frac{1+P_{T}}{F_{T}}\right) because PTP_{T} is unknown, which is precisely why we need the meta algorithm. The meta-regret of Hedge is as follows.

Lemma 6.3.

Suppose all losses are LL-gsc-smooth and non-negative on \mathcal{M}. Under Assumptions 1, 2, by setting the learning rate of Hedge as β=(2+lnN)D2F¯T\beta=\sqrt{\frac{(2+\ln N)}{D^{2}\bar{F}_{T}}}, the regret of the meta algorithm satisfies

t=1Tft(𝐱t)t=1Tft(𝐱t,i)8D2L(2+lnN)+8D2L(2+lnN)FT,i.\begin{split}\textstyle\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t,i})\leq 8D^{2}L(2+\ln N)+\sqrt{8D^{2}L(2+\ln N)F_{T,i}}.\end{split}

Now as we have the guarantee for both the expert algorithm and the meta algorithm, a direct combination yields the following general dynamic regret guarantee.

Theorem 8.

Suppose all losses are LL-gsc-smooth and non-negative on \mathcal{M}. Under Assumptions 1, 2. Setting ={ηi=2i1D2ζLGT|i[N]}\mathcal{H}=\left\{\eta_{i}=2^{i-1}\sqrt{\frac{D}{2{\zeta}LGT}}\big{|}i\in[N]\right\} where N=12logGT2LDζ+1N=\lceil\frac{1}{2}\log\frac{GT}{2LD{\zeta}}\rceil+1 and β=(2+lnN)D2F¯T\beta=\sqrt{\frac{(2+\ln N)}{D^{2}\bar{F}_{T}}}. Then for any comparator 𝐮1,,𝐮T𝒩\mathbf{u}_{1},\dots,\mathbf{u}_{T}\in\mathcal{N}, we have

D-RegretT=O(ζ(ζ(PT+1)+FT)(PT+1)).\text{D-Regret}_{T}=O(\sqrt{\zeta({\zeta}(P_{T}+1)+F_{T})(P_{T}+1)}).
Remark 6.4.

If we take \mathcal{M} as Euclidean space, the regret guarantee shown in Theorem 8 becomes O((PT+FT+1)(PT+1))O(\sqrt{(P_{T}+F_{T}+1)(P_{T}+1)}), which matches the result of Zhao et al. (2020).

7 Best-of-both-worlds Bound on Manifolds

Now we already achieve the gradient-variation bound and the small-loss bound on manifolds. To highlight the differences between them, we provide an example in Appendix D.1 to show under certain scenarios, one bound can be much tighter than the other. The next natural question is, is that possible to get a best-of-both-worlds bound on manifolds?

We initialize NNv+NsN\coloneqq N^{v}+N^{s} experts as shown in Theorems 7 and 8 where NvN^{v} and NsN^{s} are the numbers of experts running OMD and R-OGD, respectively. For each expert i[N]i\in[N], the surrogate loss and the optimism are

t,i=ft(𝐱t),Exp𝐱t1𝐱t,imt,i=γtft1(𝐱¯t),Exp𝐱¯t1𝐱t,i.\begin{split}&\ell_{t,i}=\left\langle\nabla f_{t}(\mathbf{x}_{t}),{\textnormal{Exp}}_{\mathbf{x}_{t}}^{-1}\mathbf{x}_{t,i}\right\rangle\\ &m_{t,i}=\gamma_{t}\left\langle\nabla f_{t-1}(\bar{\mathbf{x}}_{t}),{\textnormal{Exp}}_{\bar{\mathbf{x}}_{t}}^{-1}\mathbf{x}_{t,i}\right\rangle.\\ \end{split} (6)

γt\gamma_{t} controls the optimism used in the meta algorithm. When γt=1\gamma_{t}=1, the optimism for the gradient-variation bound is recovered, and γt=0\gamma_{t}=0 corresponds to the optimism for the small-loss bound.

Following Zhao et al. (2020), we use Hedge for two experts to get a best-of-both-worlds bound. The analysis therein relies on the strong convexity of ft(𝐱t)𝐦22\|\nabla f_{t}(\mathbf{x}_{t})-\mathbf{m}\|_{2}^{2} in 𝐦\mathbf{m}, which is generally not the case on manifolds. So an alternative scheme needs to be proposed. We denote

mt,iv=ft1(𝐱¯t),Exp𝐱¯t1𝐱t,imt,is=0,\begin{split}&m_{t,i}^{v}=\left\langle\nabla f_{t-1}(\bar{\mathbf{x}}_{t}),\textnormal{Exp}^{-1}_{\bar{\mathbf{x}}_{t}}\mathbf{x}_{t,i}\right\rangle\\ &m_{t,i}^{s}=0,\\ \end{split} (7)

while 𝐦tv\mathbf{m}_{t}^{v} and 𝐦ts\mathbf{m}_{t}^{s} be the corresponding vectors respectively. Then 𝐦t=γt𝐦tv+(1γt)𝐦ts\mathbf{m}_{t}=\gamma_{t}\mathbf{m}_{t}^{v}+(1-\gamma_{t})\mathbf{m}_{t}^{s}, which is exactly the combination rule of Hedge. The function t𝐦2\|\boldsymbol{\ell}_{t}-\mathbf{m}\|_{\infty}^{2} is convex with respect to 𝐦\mathbf{m} but not strongly convex so we instead use dt(𝐦)t𝐦22d_{t}(\mathbf{m})\coloneqq\|\boldsymbol{\ell}_{t}-\mathbf{m}\|_{2}^{2} for Hedge, and the learning rate is updated as

γt=exp(τ(r=1t1dr(𝐦rv)))exp(τ(r=1t1dr(𝐦rv)))+exp(τ(r=1t1dr(𝐦rs))){\gamma_{t}=\frac{\exp\left(-\tau\left(\sum_{r=1}^{t-1}d_{r}(\mathbf{m}_{r}^{v})\right)\right)}{\exp\left(-\tau\left(\sum_{r=1}^{t-1}d_{r}(\mathbf{m}_{r}^{v})\right)\right)+\exp\left(-\tau\left(\sum_{r=1}^{t-1}d_{r}(\mathbf{m}_{r}^{s})\right)\right)}} (8)

Algorithm 7 summarizes the meta algorithm as well as the expert algorithm for Radarb.

{algorithm2e}

Radarb: Algorithm \KwData Learning rates β\beta for Optimistic Hedge and γt\gamma_{t} for Hedging the two experts, ={ηi}\mathcal{H}=\{\eta_{i}\} consists of N=Nv+NsN=N^{v}+N^{s} step sizes, τ=18NG2D2\tau=\frac{1}{8NG^{2}D^{2}} \Fort=1,,Tt=1,\dots,T Run Algorithms 5 and 4 on the first NvN^{v} experts and the later NsN^{s} experts, resp. 𝐱¯t=argmin𝐱𝒩δGi[N]wt1,id(𝐱,𝐱t,i)2\bar{\mathbf{x}}_{t}=\operatorname*{argmin}_{\mathbf{x}\in\mathcal{N}_{\delta G}}\sum_{i\in[N]}w_{t-1,i}d(\mathbf{x},\mathbf{x}_{t,i})^{2}
Update γt\gamma_{t} as in Equation (8)
Update wt,iexp(β(s=1t1s,i+mt,i))w_{t,i}\propto\exp\left(-\beta\left(\sum_{s=1}^{t-1}\ell_{s,i}+m_{t,i}\right)\right) by Equation (6) 𝐱t=argmin𝐱𝒩δGi[N]wt,id(𝐱,𝐱t,i)2\mathbf{x}_{t}=\operatorname*{argmin}_{\mathbf{x}\in\mathcal{N}_{\delta G}}\sum_{i\in[N]}w_{t,i}d(\mathbf{x},\mathbf{x}_{t,i})^{2}
Observe ftf_{t} and send ft()\nabla f_{t}(\cdot) to each expert

In Theorem 9 we show the guarantee of the meta algorithm of Radarb and postpone proof details of this section to Appendix D.

Theorem 9.

Setting learning rates τ=18NG2D2\tau=\frac{1}{8NG^{2}D^{2}} and

β=min{2+lnNN(D2min{3(VT+G2),F¯T}+8G2D2ln2),112(D4L2+D2G2ζ2)}.\begin{split}\beta=\min\left\{\sqrt{\frac{2+\ln N}{N(D^{2}\min\{3(V_{T}+G^{2}),\bar{F}_{T}\}+8G^{2}D^{2}\ln 2)}},\frac{1}{\sqrt{12(D^{4}L^{2}+D^{2}G^{2}{\zeta}^{2})}}\right\}.\end{split}

Suppose all losses are LL-gsc-smooth and non-negative on \mathcal{M}. Under Assumptions 1, 2, 3, the regret of the meta algorithm satisfies

t=1Tft(𝐱t)t=1Tft(𝐱t,i)=O(lnTmin{VT,F¯T})\textstyle\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t,i})=O\left(\sqrt{\ln T\min\{V_{T},\bar{F}_{T}\}}\right)

where N=Nv+NsN=N^{v}+N^{s} and F¯T=t=1Tft(𝐱t)\bar{F}_{T}=\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t}).

Finally, we show the regret of Radarb is bounded by the smaller of two problem-dependent bounds as follows.

Theorem 10.

Suppose all losses are LL-gsc-smooth and non-negative on \mathcal{M} and allow improper learning. Under Assumptions 1, 2, 3, if we set the set of candidate step sizes as

=vs,\mathcal{H}=\mathcal{H}^{v}\cup\mathcal{H}^{s}, (9)

where v\mathcal{H}^{v} and s\mathcal{H}^{s} are sets of step sizes in Theorem 7 with N=NvN=N^{v} and Theorem 8 with N=NsN=N^{s} respectively. Then Algorithm 7 satisfies

D-RegretT=O(ζ(PT(ζ+1/δ2)+BT+1)(1+PT)+lnTBT)\text{D-Regret}_{T}=O\left(\sqrt{\zeta(P_{T}(\zeta+1/\delta^{2})+B_{T}+1)(1+P_{T})+\ln T\cdot B_{T}}\right)

where BTmin{VT,FT}B_{T}\coloneqq\min\{V_{T},F_{T}\}.

Remark 7.1.

Comparing to the result in Zhao et al. (2020), we find the result of Theorem 10 has an additional lnT\sqrt{\ln T} factor, which comes from our construction of hedging two experts. It will be interesting to remove this dependence.

8 Conclusion

In this paper, we consider adaptive online learning on Riemannian manifolds. Equipped with the idea of learning with multiple step sizes and optimistic mirror descent, we derive a series of no-regret algorithms that adapt to quantities reflecting the intrinsic difficulty of the online learning problem in different aspects. In the future, it is interesting to investigate how to achieve optimistic online learning in the proper learning setting. Moving forward, one could further examine whether Ω(logT)\Omega(\log T) gradient queries in each round are truly necessary. A curvature-dependent lower bound like the one in Criscitiello and Boumal (2022) for Riemannian online optimization also remains open.

\acks

We would like to thank three anonymous referees for constructive comments and suggestions. We gratefully thank the AI4OPT Institute for funding, as part of NSF Award 2112533. We also acknowledge the NSF for their support through Award IIS-1910077. GW would like to acknowledge an ARC-ACO fellowship provided by Georgia Tech.

References

  • Abernethy et al. (2008) Jacob Abernethy, Peter L Bartlett, Alexander Rakhlin, and Ambuj Tewari. Optimal strategies and minimax lower bounds for online convex games. In Proceedings of the 21st Annual Conference on Learning Theory, pages 415–423, 2008.
  • Ahn and Sra (2020) Kwangjun Ahn and Suvrit Sra. From nesterov’s estimate sequence to riemannian acceleration. In Proceedings of the 33rd Annual Conference on Learning Theory, pages 84–118. PMLR, 2020.
  • Alimisis et al. (2020) Foivos Alimisis, Antonio Orvieto, Gary Bécigneul, and Aurelien Lucchi. A continuous-time perspective for modeling acceleration in riemannian optimization. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics, pages 1297–1307. PMLR, 2020.
  • Baby and Wang (2021) Dheeraj Baby and Yu-Xiang Wang. Optimal dynamic regret in exp-concave online learning. In Proceedings of 34th Conference on Learning Theory, pages 359–409. PMLR, 2021.
  • Bacák (2014a) Miroslav Bacák. Computing medians and means in hadamard spaces. SIAM journal on optimization, 24(3):1542–1566, 2014a.
  • Bacák (2014b) Miroslav Bacák. Convex analysis and optimization in Hadamard spaces. de Gruyter, 2014b.
  • Ballmann (2012) Werner Ballmann. Lectures on spaces of nonpositive curvature, volume 25. Birkhäuser, 2012.
  • Bécigneul and Ganea (2019) Gary Bécigneul and Octavian-Eugen Ganea. Riemannian adaptive optimization methods. In 7th International Conference on Learning Representations, 2019.
  • Bento et al. (2021) GC Bento, JX Neto, and IDL Melo. Elements of convex geometry in hadamard manifolds with application to equilibrium problems. arXiv preprint arXiv:2107.02223, 2021.
  • Besbes et al. (2015) Omar Besbes, Yonatan Gur, and Assaf Zeevi. Non-stationary stochastic optimization. Operations research, 63(5):1227–1244, 2015.
  • Bhatia (2009) Rajendra Bhatia. Positive definite matrices. In Positive Definite Matrices. Princeton university press, 2009.
  • Bridson and Haefliger (2013) Martin R Bridson and André Haefliger. Metric spaces of non-positive curvature, volume 319. Springer Science & Business Media, 2013.
  • Cesa-Bianchi and Lugosi (2006) Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge university press, 2006.
  • Chiang et al. (2012) Chao-Kai Chiang, Tianbao Yang, Chia-Jung Lee, Mehrdad Mahdavi, Chi-Jen Lu, Rong Jin, and Shenghuo Zhu. Online optimization with gradual variations. In Proceedings of the 25th Annual Conference on Learning Theory, pages 6.1–6.20, 2012.
  • Criscitiello and Boumal (2022) Christopher Criscitiello and Nicolas Boumal. Negative curvature obstructs acceleration for strongly geodesically convex optimization, even with exact first-order oracles. In Proceedings of the 35th Annual Conference on Learning Theory, pages 496–542. PMLR, 2022.
  • Duchi et al. (2012) John C Duchi, Alekh Agarwal, and Martin J Wainwright. Dual averaging for distributed optimization: Convergence analysis and network scaling. IEEE Transactions on Automatic Control, 57(3):592–606, 2012.
  • Hazan et al. (2018) Elad Hazan, Wei Hu, Yuanzhi Li, and Zhiyuan Li. Online improper learning with an approximation oracle. Advances in Neural Information Processing Systems, 31, 2018.
  • Hazan et al. (2016) Elad Hazan et al. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(3-4):157–325, 2016.
  • Hosseini and Sra (2015) Reshad Hosseini and Suvrit Sra. Matrix manifold optimization for gaussian mixtures. In Advances in Neural Information Processing Systems 29, 28:910–918, 2015.
  • Jadbabaie et al. (2015) Ali Jadbabaie, Alexander Rakhlin, Shahin Shahrampour, and Karthik Sridharan. Online optimization: Competing with dynamic comparators. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, pages 398–406. PMLR, 2015.
  • Karimi et al. (2022) Mohammad Karimi, Ya-Ping Hsieh, Panayotis Mertikopoulos, and Andreas Krause. Riemannian stochastic approximation algorithms. arXiv preprint arXiv:2206.06795, 2022.
  • Kristály et al. (2016) Alexandru Kristály, Chong Li, Genaro López-Acedo, and Adriana Nicolae. What do ‘convexities’ imply on hadamard manifolds? Journal of Optimization Theory and Applications, 170:1068–1074, 2016.
  • Lou et al. (2020) Aaron Lou, Isay Katsman, Qingxuan Jiang, Serge Belongie, Ser-Nam Lim, and Christopher De Sa. Differentiating through the fréchet mean. In Proceeddings of the 37th International Conference on Machine Learning, pages 6393–6403. PMLR, 2020.
  • Luo and Schapire (2015) Haipeng Luo and Robert E Schapire. Achieving all with no parameters: Adanormalhedge. In Proceedings of the 28th Annual Conference on Learning Theory, pages 1286–1304. PMLR, 2015.
  • Maass et al. (2022) Alejandro I Maass, Chris Manzie, Dragan Nesic, Jonathan H Manton, and Iman Shames. Tracking and regret bounds for online zeroth-order euclidean and riemannian optimization. SIAM Journal on Optimization, 32(2):445–469, 2022.
  • Martínez-Rubio (2022) David Martínez-Rubio. Global riemannian acceleration in hyperbolic and spherical spaces. In Proceedings of the 33rd International Conference on Algorithmic Learning Theory, pages 768–826. PMLR, 2022.
  • Martínez-Rubio and Pokutta (2022) David Martínez-Rubio and Sebastian Pokutta. Accelerated riemannian optimization: Handling constraints with a prox to bound geometric penalties. arXiv preprint arXiv:2211.14645, 2022.
  • Mokhtari et al. (2016) Aryan Mokhtari, Shahin Shahrampour, Ali Jadbabaie, and Alejandro Ribeiro. Online optimization in dynamic environments: Improved regret rates for strongly convex problems. In 2016 IEEE 55th Conference on Decision and Control, pages 7195–7201. IEEE, 2016.
  • Petersen (2006) Peter Petersen. Riemannian geometry, volume 171. Springer, 2006.
  • Rakhlin and Sridharan (2013) Alexander Rakhlin and Karthik Sridharan. Online learning with predictable sequences. In Conference on Learning Theory, pages 993–1019. PMLR, 2013.
  • Sakai (1996) Takashi Sakai. Riemannian geometry, volume 149. American Mathematical Soc., 1996.
  • Sra et al. (2018) Suvrit Sra, Nisheeth K Vishnoi, and Ozan Yildiz. On geodesically convex formulations for the brascamp-lieb constant. In Proceedings of the 21st International Conference on Approximation Algorithms for Combinatorial Optimization Problems. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2018.
  • Srebro et al. (2010) Nathan Srebro, Karthik Sridharan, and Ambuj Tewari. Smoothness, low noise and fast rates. In Advances in neural information processing systems 23, 2010.
  • Sturm (2003) Karl-Theodor Sturm. Probability measures on metric spaces of nonpositive curvature. Heat Kernels and Analysis on Manifolds, Graphs, and Metric Spaces: Lecture Notes from a Quarter Program on Heat Kernels, Random Walks, and Analysis on Manifolds and Graphs, 338:357, 2003.
  • Sun et al. (2019) Yue Sun, Nicolas Flammarion, and Maryam Fazel. Escaping from saddle points on riemannian manifolds. In In Advances in Neural Information Processing Systems 32, pages 7276–7286, 2019.
  • Syrgkanis et al. (2015) Vasilis Syrgkanis, Alekh Agarwal, Haipeng Luo, and Robert E Schapire. Fast convergence of regularized learning in games. In Advances in Neural Information Processing Systems 28, pages 2989–2997, 2015.
  • Udriste (2013) Constantin Udriste. Convex functions and optimization methods on Riemannian manifolds, volume 297. Springer Science & Business Media, 2013.
  • Van Erven and Koolen (2016) Tim Van Erven and Wouter M Koolen. Metagrad: Multiple learning rates in online learning. In In Advances in Neural Information Processing Systems 29, pages 3666–3674, 2016.
  • Vishnoi (2018) Nisheeth K Vishnoi. Geodesic convex optimization: Differentiation on manifolds, geodesics, and convexity. arXiv preprint arXiv:1806.06373, 2018.
  • Wan et al. (2021) Yuanyu Wan, Bo Xue, and Lijun Zhang. Projection-free online learning in dynamic environments. In Proceedings of the 35th AAAI Conference on Artificial Intelligence, pages 10067–10075, 2021.
  • Wang et al. (2021) Xi Wang, Zhipeng Tu, Yiguang Hong, Yingyi Wu, and Guodong Shi. No-regret online learning over riemannian manifolds. In Advances in Neural Information Processing Systems 34, pages 28323–28335, 2021.
  • Zhang and Sra (2016) Hongyi Zhang and Suvrit Sra. First-order methods for geodesically convex optimization. In The 29th Annual Conference on Learning Theory, pages 1617–1638. PMLR, 2016.
  • Zhang et al. (2017) Lijun Zhang, Tianbao Yang, Jinfeng Yi, Rong Jin, and Zhi-Hua Zhou. Improved dynamic regret for non-degenerate functions. In Advance in Neural Information Processing Systems 30, pages 732–741, 2017.
  • Zhang et al. (2018) Lijun Zhang, Shiyin Lu, and Zhi-Hua Zhou. Adaptive online learning in dynamic environments. In Advances in Neural Information Processing Systems, 31, 2018.
  • Zhang et al. (2022) Peiyuan Zhang, Jingzhao Zhang, and Suvrit Sra. Minimax in geodesic metric spaces: Sion’s theorem and algorithms. arXiv preprint arXiv:2202.06950, 2022.
  • Zhao and Zhang (2021) Peng Zhao and Lijun Zhang. Improved analysis for dynamic regret of strongly convex and smooth functions. In Proceedings of the 3rd Conference on Learning for Dynamics and Control, pages 48–59, 2021.
  • Zhao et al. (2020) Peng Zhao, Yu-Jie Zhang, Lijun Zhang, and Zhi-Hua Zhou. Dynamic regret of convex and smooth functions. In In Advances in Neural Information Processing Systems 33, pages 12510–12520, 2020.
  • Zhou and Huang (2019) Li-wen Zhou and Nan-jing Huang. A revision on geodesic pseudo-convex combination and knaster–kuratowski–mazurkiewicz theorem on hadamard manifolds. Journal of Optimization Theory and Applications, 182(3):1186–1198, 2019.
  • Zinkevich (2003) Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning, pages 928–936, 2003.

Appendix A Omitted Proof for Section 4

A.1 Proof of Theorem 1

We denote 𝐱t+1=Exp𝐱t(ηft(𝐱t))\mathbf{x}_{t+1}^{\prime}={\textnormal{Exp}}_{\mathbf{x}_{t}}(-\eta\nabla f_{t}(\mathbf{x}_{t})) and start from the geodesic convexity:

ft(𝐱t)ft(𝐮t)(1)ft(𝐱t),Exp𝐱t1(𝐮t)=1ηExp𝐱t1𝐱t+1,Exp𝐱t1𝐮t(2)12η(Exp𝐱t1𝐮t2Exp𝐱t+11𝐮t2+ζExp𝐱t1𝐱t+12)(3)12η(Exp𝐱t1𝐮t2Exp𝐱t+11𝐮t2)+ηζG22=12η(Exp𝐱t1𝐮t2Exp𝐱t+11𝐮t+12+Exp𝐱t+11𝐮t+12Exp𝐱t+11𝐮t2)+ηζG22(4)12η(Exp𝐱t1𝐮t2Exp𝐱t+11𝐮t+12+2DExp𝐮t1𝐮t+1)+ηζG22,\begin{split}f_{t}(\mathbf{x}_{t})-f_{t}(\mathbf{u}_{t})\stackrel{{\scriptstyle\rm(1)}}{{\leq}}&\langle\nabla f_{t}(\mathbf{x}_{t}),-{\textnormal{Exp}}_{\mathbf{x}_{t}}^{-1}(\mathbf{u}_{t})\rangle\\ =&\frac{1}{\eta}\langle{\textnormal{Exp}}_{\mathbf{x}_{t}}^{-1}\mathbf{x}_{t+1}^{\prime},{\textnormal{Exp}}_{\mathbf{x}_{t}}^{-1}\mathbf{u}_{t}\rangle\\ \stackrel{{\scriptstyle\rm(2)}}{{\leq}}&\frac{1}{2\eta}\left(\|{\textnormal{Exp}}_{\mathbf{x}_{t}}^{-1}\mathbf{u}_{t}\|^{2}-\|{\textnormal{Exp}}_{\mathbf{x}_{t+1}^{\prime}}^{-1}\mathbf{u}_{t}\|^{2}+{\zeta}\|{\textnormal{Exp}}_{\mathbf{x}_{t}}^{-1}\mathbf{x}_{t+1}^{\prime}\|^{2}\right)\\ \stackrel{{\scriptstyle\rm(3)}}{{\leq}}&\frac{1}{2\eta}\left(\|{\textnormal{Exp}}_{\mathbf{x}_{t}}^{-1}\mathbf{u}_{t}\|^{2}-\|{\textnormal{Exp}}_{\mathbf{x}_{t+1}}^{-1}\mathbf{u}_{t}\|^{2}\right)+\frac{\eta{\zeta}G^{2}}{2}\\ =&\frac{1}{2\eta}\left(\|{\textnormal{Exp}}_{\mathbf{x}_{t}}^{-1}\mathbf{u}_{t}\|^{2}-\|{\textnormal{Exp}}_{\mathbf{x}_{t+1}}^{-1}\mathbf{u}_{t+1}\|^{2}+\|{\textnormal{Exp}}_{\mathbf{x}_{t+1}}^{-1}\mathbf{u}_{t+1}\|^{2}-\|{\textnormal{Exp}}_{\mathbf{x}_{t+1}}^{-1}\mathbf{u}_{t}\|^{2}\right)+\frac{\eta{\zeta}G^{2}}{2}\\ \stackrel{{\scriptstyle\rm(4)}}{{\leq}}&\frac{1}{2\eta}\left(\|{\textnormal{Exp}}_{\mathbf{x}_{t}}^{-1}\mathbf{u}_{t}\|^{2}-\|{\textnormal{Exp}}_{\mathbf{x}_{t+1}}^{-1}\mathbf{u}_{t+1}\|^{2}+2D\|{\textnormal{Exp}}_{\mathbf{u}_{t}}^{-1}\mathbf{u}_{t+1}\|\right)+\frac{\eta{\zeta}G^{2}}{2},\end{split} (10)

where for the second inequality we use Lemma E.5, while the third is due to Lemma E.7 and Assumption 3. For the last inequality, we invoke triangle inequality and Assumption 2.

WLOG, we can assume 𝐮T+1=𝐮T\mathbf{u}_{T+1}=\mathbf{u}_{T} and sum from t=1t=1 to TT:

t=1Tft(𝐱t)ft(𝐮t)D22η+DPTη+ηζG2T2.\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-f_{t}(\mathbf{u}_{t})\leq\frac{D^{2}}{2\eta}+\frac{DP_{T}}{\eta}+\frac{\eta{\zeta}G^{2}T}{2}. (11)

A.2 Proof of Lemma 4.1

This is a generalization of (Cesa-Bianchi and Lugosi, 2006, Theorem 2.2) to the Riemannian manifold. Let Lt,i=s=1tfs(𝐱s,i)L_{t,i}=\sum_{s=1}^{t}f_{s}(\mathbf{x}_{s,i}) and Wt=i=1Nw1,ieβLt,iW_{t}=\sum_{i=1}^{N}w_{1,i}e^{-\beta L_{t,i}}. We have the following lower bound for lnWT\ln W_{T},

ln(WT)=ln(i[N]w1,ieβLt,i)βmini[N](LT,i+1βln1w1,i).\ln(W_{T})=\ln\left(\sum_{i\in[N]}w_{1,i}e^{-\beta L_{t,i}}\right)\geq-\beta\min_{i\in[N]}\left(L_{T,i}+\frac{1}{\beta}\ln\frac{1}{w_{1,i}}\right).

For the next step, we try to get an upper bound on lnWT\ln W_{T}. When t2t\geq 2, we have

ln(WtWt1)=lni[N]w1,ieβLt1,ieβft(𝐱t,i)i[N]w1,ieβLt1,i=ln(i[N]wt,ieβft(𝐱t,i)),\ln\left(\frac{W_{t}}{W_{t-1}}\right)=\ln\frac{\sum_{i\in[N]}w_{1,i}e^{-\beta L_{t-1,i}}e^{-\beta f_{t}(\mathbf{x}_{t,i})}}{\sum_{i\in[N]}w_{1,i}e^{-\beta L_{t-1,i}}}=\ln\left(\sum_{i\in[N]}w_{t,i}e^{-\beta f_{t}(\mathbf{x}_{t,i})}\right),

where the updating rule of Hedge

wt,i=w1,ieβLt1,ij[N]w1,jeβLt1,jw_{t,i}=\frac{w_{1,i}e^{-\beta L_{t-1,i}}}{\sum_{j\in[N]}w_{1,j}e^{-\beta L_{t-1,j}}}

is applied. Therefore

lnWT=lnW1+t=2Tln(WtWt1)=t=1Tln(i[N]wt,ieβft(𝐱t,i))t=1T(βi[N]wt,ift(𝐱t,i)+β2G2D28)t=1T(βft(𝐱t)+β2G2D28),\begin{split}&\ln W_{T}=\ln W_{1}+\sum_{t=2}^{T}\ln\left(\frac{W_{t}}{W_{t-1}}\right)=\sum_{t=1}^{T}\ln\left(\sum_{i\in[N]}w_{t,i}e^{-\beta f_{t}(\mathbf{x}_{t,i})}\right)\\ \leq&\sum_{t=1}^{T}\left(-\beta\sum_{i\in[N]}w_{t,i}f_{t}(\mathbf{x}_{t,i})+\frac{\beta^{2}G^{2}D^{2}}{8}\right)\leq\sum_{t=1}^{T}\left(-\beta f_{t}(\mathbf{x}_{t})+\frac{\beta^{2}G^{2}D^{2}}{8}\right),\end{split} (12)

where the first inequality follows from Hoeffding’s inequality and ft(𝐱)ft(𝐱)ft(𝐱)+GDf_{t}(\mathbf{x}^{\star})\leq f_{t}(\mathbf{x})\leq f_{t}(\mathbf{x}^{\star})+G\cdot D holds for any 𝐱𝒩\mathbf{x}\in\mathcal{N} and 𝐱=argmin𝐱𝒩ft(𝐱)\mathbf{x}^{\star}=\operatorname*{argmin}_{\mathbf{x}\in\mathcal{N}}f_{t}(\mathbf{x}), and the second inequality is due to both the Fréchet mean and the geodesic mean satisfy Jensen’s inequality. For the Fréchet mean, we can apply Lemma E.8. While Lemmas E.1 and E.11 ensure the geodesic mean satisfies the requirement.

Combining the lower and upper bound for lnWT\ln W_{T}, we see

βmini[N](LT,i+1βln1w1,i)t=1T(βft(𝐱t)+β2G2D28).-\beta\min_{i\in[N]}\left(L_{T,i}+\frac{1}{\beta}\ln\frac{1}{w_{1,i}}\right)\leq\sum_{t=1}^{T}\left(-\beta f_{t}(\mathbf{x}_{t})+\frac{\beta^{2}G^{2}D^{2}}{8}\right).

After simplifying, we get

t=1Tft(𝐱t)mini[N](t=1Tft(𝐱t,i)+1βln1w1,i)βG2D2T8.\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\min_{i\in[N]}\left(\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t,i})+\frac{1}{\beta}\ln\frac{1}{w_{1,i}}\right)\leq\frac{\beta G^{2}D^{2}T}{8}.

Setting β=8G2D2T\beta=\sqrt{\frac{8}{G^{2}D^{2}T}}, we have

t=1Tft(𝐱t)t=1Tft(𝐱t,i)G2D2T8(1+ln1w1,i).\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t,i})\leq\sqrt{\frac{G^{2}D^{2}T}{8}}\left(1+\ln\frac{1}{w_{1,i}}\right).

A.3 Proof of Theorem 2

Each expert performs R-OGD, so by Theorem 1 we have

t=1Tft(𝐱t,i)ft(𝐮t)D2+2DPT2η+ηζG2T2.\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t,i})-f_{t}(\mathbf{u}_{t})\leq\frac{D^{2}+2DP_{T}}{2\eta}+\frac{\eta{\zeta}G^{2}T}{2}. (13)

holds for any i[N]i\in[N]. Now it suffices to verify that there always exists ηk\eta_{k}\in\mathcal{H} which is close to the optimal stepsize

η=D2+2DPTζG2T.\eta^{\star}=\sqrt{\frac{D^{2}+2DP_{T}}{{\zeta}G^{2}T}}. (14)

By Assumption 2,

0PT=t=2Td(𝐮t1,𝐮t)TD.0\leq P_{T}=\sum_{t=2}^{T}d(\mathbf{u}_{t-1},\mathbf{u}_{t})\leq TD.

Thus

D2TG2ζηD2+2TD2TG2ζ\sqrt{\frac{D^{2}}{TG^{2}{\zeta}}}\leq\eta^{*}\leq\sqrt{\frac{D^{2}+2TD^{2}}{TG^{2}{\zeta}}}

It is obvious that

min=D2TG2ζ, and max2D2+2TD2TG2ζ\min\mathcal{H}=\sqrt{\frac{D^{2}}{TG^{2}{\zeta}}},\text{ and }\max\mathcal{H}\geq 2\sqrt{\frac{D^{2}+2TD^{2}}{TG^{2}{\zeta}}}

Therefore, there exists k[N1]k\in[N-1] such that

ηk=2k1D2TG2ζη2ηk\eta_{k}=2^{k-1}\sqrt{\frac{D^{2}}{TG^{2}{\zeta}}}\leq\eta^{*}\leq 2\eta_{k} (15)

The dynamic regret of the kk-th expert is

t=1Tft(𝐱t,k)t=1Tft(𝐮t)(1)D22ηk+DPTηk+(ηkTG2ζ2)(2)D2η+2DPTη+(ηTG2ζ2)=32TG2ζ(D2+2DPT).\begin{split}&\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t,k})-\sum_{t=1}^{T}f_{t}\left(\mathbf{u}_{t}\right)\\ \stackrel{{\scriptstyle\rm(1)}}{{\leq}}&\frac{D^{2}}{2\eta_{k}}+\frac{DP_{T}}{\eta_{k}}+\left(\frac{\eta_{k}TG^{2}{\zeta}}{2}\right)\\ \stackrel{{\scriptstyle\rm(2)}}{{\leq}}&\frac{D^{2}}{\eta^{*}}+\frac{2DP_{T}}{\eta^{*}}+\left(\frac{\eta^{*}TG^{2}{\zeta}}{2}\right)\\ {=}&\frac{3}{2}\sqrt{TG^{2}{\zeta}\left(D^{2}+2DP_{T}\right)}.\end{split} (16)

The second inequality follows from Equation (15) and we use Equation (14) to derive the last equality.

Since the initial weight of the kk-th expert satisfies

w1,k=N+1k(k+1)N1(k+1)2,w_{1,k}=\frac{N+1}{k(k+1)N}\geq\frac{1}{(k+1)^{2}},

the regret of the meta algorithm with respect to the kk-th expert is bounded by

t=1Tft(𝐱t)t=1Tft(𝐱t,k)G2D2T8(1+2ln(k+1))\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t,k})\leq\sqrt{\frac{G^{2}D^{2}T}{8}}\left(1+2\ln(k+1)\right) (17)

in view of Lemma 4.1. Combining Equations (LABEL:expert-rader) and (17), we have

t=1Tft(𝐱t)t=1Tft(𝐮t)32TG2ζ(D2+2DPT)+G2D2T8(1+2ln(k+1))2(94TG2ζ(D2+2DPT)+G2D2T8(1+2ln(k+1))2)=O(ζ(1+PT)T),\begin{split}&\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\sum_{t=1}^{T}f_{t}(\mathbf{u}_{t})\\ \leq&\frac{3}{2}\sqrt{TG^{2}{\zeta}\left(D^{2}+2DP_{T}\right)}+\sqrt{\frac{G^{2}D^{2}T}{8}}\left(1+2\ln(k+1)\right)\\ \leq&\sqrt{2\left(\frac{9}{4}TG^{2}{\zeta}\left(D^{2}+2DP_{T}\right)+\frac{G^{2}D^{2}T}{8}\left(1+2\ln(k+1)\right)^{2}\right)}\\ =&O\left(\sqrt{{\zeta}(1+P_{T})T}\right),\end{split} (18)

where a+b2(a+b)\sqrt{a}+\sqrt{b}\leq\sqrt{2(a+b)} is applied to derive the second inequality, and for the equality follows from lnk=O(loglogPT)=o(PT)\ln k=O(\log\log P_{T})=o(\sqrt{P_{T}}).

A.4 Minimax Dynamic Regret Lower Bound on Hadamard Manifolds

In this part, we first establish a Ω(T)\Omega(\sqrt{T}) minimax static regret lower bound on Hadamard manifolds following the classical work of Abernethy et al. (2008), then follow the reduction in Zhang et al. (2018) to get its dynamic counterpart. We focus on the manifold of SPD matrices (Bhatia, 2009) 𝒮++n={p:pn×n,p=pTand p0n×n}\mathcal{S}^{n}_{++}=\{p:p\in\mathbb{R}^{n\times n},p=p^{T}\text{and }p\succ 0^{n\times n}\}, which becomes Hadamard when equipped with the affine-invariant metric U,Vp=tr(p1Up1V)\left\langle U,V\right\rangle_{p}=tr(p^{-1}Up^{-1}V) where tr()tr(\cdot) is the trace operator. The tangent space of 𝒮++n\mathcal{S}^{n}_{++} is 𝒮n={U:Un×n,U=UT}\mathcal{S}^{n}=\{U:U\in\mathbb{R}^{n\times n},U=U^{T}\}. Under the affine-invariant metric, we have

ExppU=p12exp(p12Up12)p12Expp1q=p12log(p12qp12)p12d(p,q)=i=1n(logλi(q12pq12))2.\begin{split}&{\textnormal{Exp}}_{p}U=p^{\frac{1}{2}}\exp\left(p^{-\frac{1}{2}}Up^{-\frac{1}{2}}\right)p^{\frac{1}{2}}\\ &\textnormal{Exp}^{-1}_{p}q=p^{\frac{1}{2}}\log\left(p^{-\frac{1}{2}}qp^{-\frac{1}{2}}\right)p^{\frac{1}{2}}\\ &d(p,q)=\sqrt{\sum_{i=1}^{n}(\log\lambda_{i}(q^{-\frac{1}{2}}pq^{-\frac{1}{2}}))^{2}}.\end{split}

For technical reason, we restrict the manifold to be the manifold of SPD matrices with diagonal entries 𝒟++n={p:pn×n,pi,i>0,p is diagonal}\mathcal{D}_{++}^{n}=\{p:p\in\mathbb{R}^{n\times n},p_{i,i}>0,p\text{ is diagonal}\} and its tangent space is 𝒟n={U:Un×n,U is diagonal}\mathcal{D}^{n}=\{U:U\in\mathbb{R}^{n\times n},U\text{ is diagonal}\}. A key component in the proof is the Busemann function (Ballmann, 2012) on 𝒟++n\mathcal{D}_{++}^{n} equipped with affine-invariant metric has a closed form, which we describe as follows.

Definition 11.

(Ballmann, 2012) Suppose \mathcal{M} is a Hadamard manifold and c:[0,)c:[0,\infty) is a geodesic ray on \mathcal{M} with c˙(0)=1\|\dot{c}(0)\|=1. Then the Busemann function associated with cc is defined as

bc(p)=limt(d(p,c(t))t).b_{c}(p)=\lim_{t\rightarrow\infty}(d(p,c(t))-t).

Busemann functions enjoy the following useful properties.

Lemma A.1.

(Ballmann, 2012) A Busemann function bcb_{c} satisfies

  1. 1)

    bcb_{c} is gsc-convex;

  2. 2)

    bc(c(t))=c˙(t)\nabla b_{c}(c(t))=\dot{c}(t) for any t[0,)t\in[0,\infty);

  3. 3)

    bc(p)1\|\nabla b_{c}(p)\|\leq 1 for every pp\in\mathcal{M}.

Lemma A.2.

(Bridson and Haefliger, 2013, Chapter II.10) On 𝒟++n\mathcal{D}_{++}^{n}, suppose c(t)=ExpI(tX)c(t)={\textnormal{Exp}}_{I}(tX), p=ExpI(Y)p={\textnormal{Exp}}_{I}(Y) and X=1\|X\|=1, then the Busemann function is

bc(p)=tr(XY).b_{c}(p)=-\text{tr}(XY).
Proof A.3.

We first compute

d(p,c(t))2=d(ExpI(Y),ExpI(tX))2=d(eY,etX)2=i=1n(logλi(etX+Y))2=d(I,eYtX)2=d(I,ExpI(YtX))2=YtX2=tr((YtX)(YtX))=tr(Y2)2ttr(XY)+t2tr(X2),\begin{split}d(p,c(t))^{2}=&d({\textnormal{Exp}}_{I}(Y),{\textnormal{Exp}}_{I}(tX))^{2}\\ =&d(e^{Y},e^{tX})^{2}\\ =&\sum_{i=1}^{n}(\log\lambda_{i}(e^{-tX+Y}))^{2}\\ =&d(I,e^{Y-tX})^{2}\\ =&d(I,{\textnormal{Exp}}_{I}(Y-tX))^{2}\\ =&\|Y-tX\|^{2}\\ =&tr((Y-tX)(Y-tX))\\ =&tr(Y^{2})-2t\cdot tr(XY)+t^{2}\cdot tr(X^{2}),\end{split} (19)

where we use facts that ExpI(X)=eX{\textnormal{Exp}}_{I}(X)=e^{X} and d(p,q)=i=1n(logλi(q12pq12))2d(p,q)=\sqrt{\sum_{i=1}^{n}(\log\lambda_{i}(q^{-\frac{1}{2}}pq^{-\frac{1}{2}}))^{2}}. Meanwhile,

limtd(p,c(t))t=1\lim_{t\rightarrow\infty}\frac{d(p,c(t))}{t}=1

due to the triangle inequality on (p,c(0),c(t))\triangle(p,c(0),c(t)). Therefore,

bc(p)=limt(d(p,c(t))t)=limtd(p,c(t))2t22t=tr(XY).\begin{split}b_{c}(p)=&\lim_{t\rightarrow\infty}(d(p,c(t))-t)\\ =&\lim_{t\rightarrow\infty}\frac{d(p,c(t))^{2}-t^{2}}{2t}\\ =&-tr(XY).\end{split}
Remark A.4.

We consider 𝒟++n\mathcal{D}_{++}^{n} to ensure XX and YY commute thus also eYe^{Y} and etXe^{tX} commute. This is necessary to get a closed form of the Busemann function.

Now we describe the minimax game on 𝒟++n\mathcal{D}_{++}^{n}. Each round tt, the player chooses ptp_{t} from 𝒩={p:d(p,I)D2}={p:p=ExpI(Y),YD2}\mathcal{N}=\{p:d(p,I)\leq\frac{D}{2}\}=\{p:p={\textnormal{Exp}}_{I}(Y),\|Y\|\leq\frac{D}{2}\}, and the adversary is allowed to pick a geodesic ctc^{t}, which determines a loss function in

t={αtGtbc(p):c˙(0)=1,αt[0,1]}={αtGttr(XtY):Xt=1,αt[0,1]}={tr(XtY):XtGt}\begin{split}\mathcal{F}_{t}=&\{\alpha_{t}G_{t}b_{c}(p):\|\dot{c}(0)\|=1,\alpha_{t}\in[0,1]\}\\ =&\{\alpha_{t}G_{t}\cdot-tr(X_{t}Y):\|X_{t}\|=1,\alpha_{t}\in[0,1]\}\\ =&\{-tr(X_{t}Y):\|X_{t}\|\leq G_{t}\}\\ \end{split} (20)

The domain is gsc-convex by Lemmas E.13 and E.14. Each loss function is gsc-convex and has a gradient upper bound GtG_{t} using the second item of Lemma A.1. WLOG, we assume D=2D=2, and the value of the game is

𝒱T(𝒩,{t})infY11supX1G1infYT1supXTGT[t=1Ttr(XtYt)infY1t=1Ttr(XtY)]\mathcal{V}_{T}(\mathcal{N},\{\mathcal{F}_{t}\})\coloneqq\inf_{\|Y_{1}\|\leq 1}\sup_{\|X_{1}\|\leq G_{1}}\dots\inf_{\|Y_{T}\|\leq 1}\sup_{\|X_{T}\|\leq G_{T}}\left[\sum_{t=1}^{T}-tr(X_{t}Y_{t})-\inf_{\|Y\|\leq 1}\sum_{t=1}^{T}-tr(X_{t}Y)\right]
Lemma A.5.

The value of the minimax game 𝒱T\mathcal{V}_{T} can be written as

𝒱T(𝒩,{t})=infY11supX1G1infYT1supXTGT[t=1Ttr(XtYt)+t=1TXt]\mathcal{V}_{T}(\mathcal{N},\{\mathcal{F}_{t}\})=\inf_{\|Y_{1}\|\leq 1}\sup_{\|X_{1}\|\leq G_{1}}\dots\inf_{\|Y_{T}\|\leq 1}\sup_{\|X_{T}\|\leq G_{T}}\left[\sum_{t=1}^{T}-tr(X_{t}Y_{t})+\left\|\sum_{t=1}^{T}X_{t}\right\|\right]
Proof A.6.

This is obvious due to Cauchy–Schwarz inequality:

infY1t=1Ttr(XtY)=supY1tr(t=1TXtY)=t=1TXt.\inf_{\|Y\|\leq 1}\sum_{t=1}^{T}-tr(X_{t}Y)=-\sup_{\|Y\|\leq 1}tr\left(\sum_{t=1}^{T}X_{t}Y\right)=-\left\|\sum_{t=1}^{T}X_{t}\right\|.
Lemma A.7.

For n>2n>2, the adversary guarantees at least t=1TGt2\sqrt{\sum_{t=1}^{T}G_{t}^{2}} regardless of the player’s strategy.

Proof A.8.

Each round, after the player chooses YtY_{t}, the adversary chooses XtX_{t} such that Xt=Gt\|X_{t}\|=G_{t}, Xt,Yt=0\left\langle X_{t},Y_{t}\right\rangle=0 and Xt,s=1t1Xs=0\left\langle X_{t},\sum_{s=1}^{t-1}X_{s}\right\rangle=0. This is always possible when n>2n>2. Under this strategy, t=1Ttr(XtYt)=0\sum_{t=1}^{T}-tr(X_{t}Y_{t})=0 and we can show t=1TXt=t=1TGt2\left\|\sum_{t=1}^{T}X_{t}\right\|=\sqrt{\sum_{t=1}^{T}G_{t}^{2}} by induction. The case for T=1T=1 is obvious. Assume s=1t1Xs=s=1t1Gs2\left\|\sum_{s=1}^{t-1}X_{s}\right\|=\sqrt{\sum_{s=1}^{t-1}G_{s}^{2}}, then

s=1tXs=s=1t1Xs+Xt=s=1t1Xs2+Xt2=s=1tGt2.\left\|\sum_{s=1}^{t}X_{s}\right\|=\left\|\sum_{s=1}^{t-1}X_{s}+X_{t}\right\|=\sqrt{\left\|\sum_{s=1}^{t-1}X_{s}\right\|^{2}+\|X_{t}\|^{2}}=\sqrt{\sum_{s=1}^{t}G_{t}^{2}}.

where the second equality is due to Xt,s=1t1Xs=0\left\langle X_{t},\sum_{s=1}^{t-1}X_{s}\right\rangle=0.

Lemma A.9.

Let X0=0X_{0}=0. If the player plays

Yt=s=1t1Xss=1t1Xs2+s=tTGs2,Y_{t}=\frac{\sum_{s=1}^{t-1}X_{s}}{\sqrt{\left\|\sum_{s=1}^{t-1}X_{s}\right\|^{2}+\sum_{s=t}^{T}G_{s}^{2}}},

then

supX1G1supX2G2supXTGT[t=1Ttr(XtYt)+t=1TXt]t=1TGt2.\sup_{\|X_{1}\|\leq G_{1}}\sup_{\|X_{2}\|\leq G_{2}}\dots\sup_{\|X_{T}\|\leq G_{T}}\left[\sum_{t=1}^{T}-tr(X_{t}Y_{t})+\left\|\sum_{t=1}^{T}X_{t}\right\|\right]\leq\sqrt{\sum_{t=1}^{T}G_{t}^{2}}.
Proof A.10.

Let Γt2=s=tTGs2\Gamma_{t}^{2}=\sum_{s=t}^{T}G_{s}^{2}, ΓT+1=0\Gamma_{T+1}=0 and X~t=s=1tXs\tilde{X}_{t}=\sum_{s=1}^{t}X_{s}. We define

Φt(X1,,Xt1)=s=1t1tr(XsYs)+s=1t1Xs2+Γt2\Phi_{t}(X_{1},\dots,X_{t-1})=\sum_{s=1}^{t-1}-tr(X_{s}Y_{s})+\sqrt{\left\|\sum_{s=1}^{t-1}X_{s}\right\|^{2}+\Gamma_{t}^{2}}

and Φ1=t=1TGt2\Phi_{1}=\sqrt{\sum_{t=1}^{T}G_{t}^{2}}. We further let

Ψt(X1,,Xt1)=supXtGtsupXTGT[s=1Ttr(XsYs)+s=1TXs]\Psi_{t}(X_{1},\dots,X_{t-1})=\sup_{\|X_{t}\|\leq G_{t}}\dots\sup_{\|X_{T}\|\leq G_{T}}\left[\sum_{s=1}^{T}-tr(X_{s}Y_{s})+\left\|\sum_{s=1}^{T}X_{s}\right\|\right]

be the payoff of the adversary when he plays X1,,Xt1X_{1},\dots,X_{t-1} and then plays optimally.

We do backward induction for this argument, which means for all t{1,,T+1}t\in\{1,\dots,T+1\},

Ψt(X1,,Xt1)Φt(X1,,Xt1).\Psi_{t}(X_{1},\dots,X_{t-1})\leq\Phi_{t}(X_{1},\dots,X_{t-1}).

The case of t=T+1t=T+1 is obvious because ΨT+1=ΦT+1\Psi_{T+1}=\Phi_{T+1}. Assume the argument holds for t+1t+1 and we try to show the case for tt.

Ψt(X1,,Xt1)=supXtGtΨt+1(X1,,Xt)supXtGtΦt+1(X1,,Xt)=s=1t1tr(XsYs)+supXtGt[tr(XtYt)+s=1tXs2+Γt+12].\begin{split}&\Psi_{t}(X_{1},\dots,X_{t-1})\\ =&\sup_{\|X_{t}\|\leq G_{t}}\Psi_{t+1}(X_{1},\dots,X_{t})\\ \leq&\sup_{\|X_{t}\|\leq G_{t}}\Phi_{t+1}(X_{1},\dots,X_{t})\\ =&\sum_{s=1}^{t-1}-tr(X_{s}Y_{s})+\sup_{\|X_{t}\|\leq G_{t}}\left[-tr(X_{t}Y_{t})+\sqrt{\left\|\sum_{s=1}^{t}X_{s}\right\|^{2}+\Gamma_{t+1}^{2}}\right].\end{split} (21)

Now it suffices to show

supXtGt[tr(XtYt)+s=1tXs2+Γt+12]s=1t1Xs2+Γt2\sup_{\|X_{t}\|\leq G_{t}}\left[-tr(X_{t}Y_{t})+\sqrt{\left\|\sum_{s=1}^{t}X_{s}\right\|^{2}+\Gamma_{t+1}^{2}}\right]\leq\sqrt{\left\|\sum_{s=1}^{t-1}X_{s}\right\|^{2}+\Gamma_{t}^{2}}

to establish our argument. Recall that

Yt=s=1t1Xss=1t1Xs2+s=tTGs2,Y_{t}=\frac{\sum_{s=1}^{t-1}X_{s}}{\sqrt{\left\|\sum_{s=1}^{t-1}X_{s}\right\|^{2}+\sum_{s=t}^{T}G_{s}^{2}}},

and denote X~t1=s=1t1Xs\tilde{X}_{t-1}=\sum_{s=1}^{t-1}X_{s}. It turns out that what we need to show is

supXtGttr(Xt,X~t1X~t12+Γt2)+X~t1+Xt2+Γt+12X~t12+Γt2.\sup_{\|X_{t}\|\leq G_{t}}-tr\left(\frac{\left\langle X_{t},\tilde{X}_{t-1}\right\rangle}{\sqrt{\|\tilde{X}_{t-1}\|^{2}+\Gamma_{t}^{2}}}\right)+\sqrt{\|\tilde{X}_{t-1}+X_{t}\|^{2}+\Gamma_{t+1}^{2}}\leq\sqrt{\|\tilde{X}_{t-1}\|^{2}+\Gamma_{t}^{2}}.

We use the Lagrange multiplier method to prove this argument. Let

g(Xt)=tr(Xt,X~t1X~t12+Γt2)+X~t1+Xt2+Γt+12+λ(Xt2Gt2),g(X_{t})=-tr\left(\frac{\left\langle X_{t},\tilde{X}_{t-1}\right\rangle}{\sqrt{\|\tilde{X}_{t-1}\|^{2}+\Gamma_{t}^{2}}}\right)+\sqrt{\|\tilde{X}_{t-1}+X_{t}\|^{2}+\Gamma_{t+1}^{2}}+\lambda(\|X_{t}\|^{2}-G_{t}^{2}),

then the stationary point of gg satisfies

g(Xt)Xt=X~t1X~t12+Γt2+X~t1+XtX~t1+Xt2+Γt+12+2λXt=0\frac{\partial g(X_{t})}{\partial X_{t}}=-\frac{\tilde{X}_{t-1}}{\sqrt{\|\tilde{X}_{t-1}\|^{2}+\Gamma_{t}^{2}}}+\frac{\tilde{X}_{t-1}+X_{t}}{\sqrt{\|\tilde{X}_{t-1}+X_{t}\|^{2}+\Gamma_{t+1}^{2}}}+2\lambda X_{t}=0

and

λ(Xt2Gt2)=0.\lambda(\|X_{t}\|^{2}-G_{t}^{2})=0.

We first consider that X~t1\tilde{X}_{t-1} is co-linear with XtX_{t}. When λ=0\lambda=0, we have Xt=cX~t1X_{t}=c\tilde{X}_{t-1} where

c=Γt+1Γt1.c=\frac{\Gamma_{t+1}}{\Gamma_{t}}-1.

If X~t1\tilde{X}_{t-1} is co-linear with XtX_{t} and λ0\lambda\neq 0, we know Xt=Gt\|X_{t}\|=G_{t} and again let Xt=GtX~t1X~t1X_{t}=G_{t}\frac{\tilde{X}_{t-1}}{\|\tilde{X}_{t-1}\|} or Xt=GtX~t1X~t1X_{t}=-G_{t}\frac{\tilde{X}_{t-1}}{\|\tilde{X}_{t-1}\|}. Then we need to ensure

g(cX~t1)X~t12+Γt2g(c\tilde{X}_{t-1})\leq\sqrt{\|\tilde{X}_{t-1}\|^{2}+\Gamma_{t}^{2}}

holds for c=Γt+1Γt1,GtX~t1c=\frac{\Gamma_{t+1}}{\Gamma_{t}}-1,\frac{G_{t}}{\|\tilde{X}_{t-1}\|} and GtX~t1-\frac{G_{t}}{\|\tilde{X}_{t-1}\|}.

By Lemma A.11, it suffices to verify

(c21)X~t12Γt2+(X~t12+Γt2)Γt+12Γt4.(c^{2}-1)\|\tilde{X}_{t-1}\|^{2}\Gamma_{t}^{2}+(\|\tilde{X}_{t-1}\|^{2}+\Gamma_{t}^{2})\Gamma_{t+1}^{2}\leq\Gamma_{t}^{4}.

If c=Γt+1Γt1c=\frac{\Gamma_{t+1}}{\Gamma_{t}}-1, we have to ensure

(c21)X~t12Γt2+(X~t12+Γt2)Γt+12Γt4=(Γt+12Γt22Γt+1Γt)X~t12Γt2+X~t12Γt+12+Γt2Γt+12Γt4=2(Γt+1Γt)Γt+1X~t12+Γt2(Γt+12Γt2)0.\begin{split}&(c^{2}-1)\|\tilde{X}_{t-1}\|^{2}\Gamma_{t}^{2}+(\|\tilde{X}_{t-1}\|^{2}+\Gamma_{t}^{2})\Gamma_{t+1}^{2}-\Gamma_{t}^{4}\\ =&\left(\frac{\Gamma_{t+1}^{2}}{\Gamma_{t}^{2}}-2\frac{\Gamma_{t+1}}{\Gamma_{t}}\right)\|\tilde{X}_{t-1}\|^{2}\Gamma_{t}^{2}+\|\tilde{X}_{t-1}\|^{2}\Gamma_{t+1}^{2}+\Gamma_{t}^{2}\Gamma_{t+1}^{2}-\Gamma_{t}^{4}\\ =&2(\Gamma_{t+1}-\Gamma_{t})\Gamma_{t+1}\|\tilde{X}_{t-1}\|^{2}+\Gamma_{t}^{2}(\Gamma_{t+1}^{2}-\Gamma_{t}^{2})\leq 0.\end{split} (22)

For the case where c2=Gt2X~t12c^{2}=\frac{G_{t}^{2}}{\|\tilde{X}_{t-1}\|^{2}}, we have

(c21)X~t12Γt2+(X~t12+Γt2)Γt+12Γt4=(Gt2X~t121)X~t12Γt2+(X~t12+Γt2)Γt+12Γt4=Γt2(Gt2+Γt+12Γt2)Gt2X~t12=Gt2X~t120.\begin{split}&(c^{2}-1)\|\tilde{X}_{t-1}\|^{2}\Gamma_{t}^{2}+(\|\tilde{X}_{t-1}\|^{2}+\Gamma_{t}^{2})\Gamma_{t+1}^{2}-\Gamma_{t}^{4}\\ =&\left(\frac{G_{t}^{2}}{\|\tilde{X}_{t-1}\|^{2}}-1\right)\|\tilde{X}_{t-1}\|^{2}\Gamma_{t}^{2}+(\|\tilde{X}_{t-1}\|^{2}+\Gamma_{t}^{2})\Gamma_{t+1}^{2}-\Gamma_{t}^{4}\\ =&\Gamma_{t}^{2}(G_{t}^{2}+\Gamma_{t+1}^{2}-\Gamma_{t}^{2})-G_{t}^{2}\|\tilde{X}_{t-1}\|^{2}\\ =&-G_{t}^{2}\|\tilde{X}_{t-1}\|^{2}\leq 0.\end{split} (23)

The only case left is when XtX_{t} is not parallel to X~t1\tilde{X}_{t-1}. λ=0\lambda=0 implies Xt=0X_{t}=0 and thus

g(0)=X~t12+Γt+12X~t12+Γt2.g(0)=\sqrt{\|\tilde{X}_{t-1}\|^{2}+\Gamma_{t+1}^{2}}\leq\sqrt{\|\tilde{X}_{t-1}\|^{2}+\Gamma_{t}^{2}}.

If λ0\lambda\neq 0 then Xt=G\|X_{t}\|=G. We have

X~t1X~t12+Γt2+X~t1X~t1+Xt2+Γt+12=0-\frac{\tilde{X}_{t-1}}{\sqrt{\|\tilde{X}_{t-1}\|^{2}+\Gamma_{t}^{2}}}+\frac{\tilde{X}_{t-1}}{\sqrt{\|\tilde{X}_{t-1}+X_{t}\|^{2}+\Gamma_{t+1}^{2}}}=0

which in turn implies Xt,X~t1=0\left\langle X_{t},\tilde{X}_{t-1}\right\rangle=0. This is the maximum point of gg as now

g(Xt)=X~t12+Γt2.g(X_{t})=\sqrt{\|\tilde{X}_{t-1}\|^{2}+\Gamma_{t}^{2}}.

Thus we finished the induction step and the lemma was established.

Lemma A.11.
tr(Xt,X~t1X~t12+Γt2)+X~t1+Xt2+Γt+12X~t12+Γt2.-tr\left(\frac{\left\langle X_{t},\tilde{X}_{t-1}\right\rangle}{\sqrt{\|\tilde{X}_{t-1}\|^{2}+\Gamma_{t}^{2}}}\right)+\sqrt{\|\tilde{X}_{t-1}+X_{t}\|^{2}+\Gamma_{t+1}^{2}}\leq\sqrt{\|\tilde{X}_{t-1}\|^{2}+\Gamma_{t}^{2}}.

holds for Xt=cX~t1X_{t}=c\tilde{X}_{t-1} iff

(c21)X~t12Γt2+(X~t12+Γt2)Γt+12Γt4.(c^{2}-1)\|\tilde{X}_{t-1}\|^{2}\Gamma_{t}^{2}+(\|\tilde{X}_{t-1}\|^{2}+\Gamma_{t}^{2})\Gamma_{t+1}^{2}\leq\Gamma_{t}^{4}.
Proof A.12.

The statement we want to show is

cX~t12X~t12+Γt2+X~t12(1+c)2+Γt+12X~t12+Γt2.-\frac{c\|\tilde{X}_{t-1}\|^{2}}{\sqrt{\|\tilde{X}_{t-1}\|^{2}+\Gamma_{t}^{2}}}+\sqrt{\|\tilde{X}_{t-1}\|^{2}(1+c)^{2}+\Gamma_{t+1}^{2}}\leq\sqrt{\|\tilde{X}_{t-1}\|^{2}+\Gamma_{t}^{2}}.

Let α=X~t12\alpha=\|\tilde{X}_{t-1}\|^{2}, β=Γt2\beta=\Gamma_{t}^{2} and γ=Γt+12\gamma=\Gamma_{t+1}^{2}. Following a series of algebraic manipulations, we get

(c21)αβ+(α+β)γβ2.(c^{2}-1)\alpha\beta+(\alpha+\beta)\gamma\leq\beta^{2}.

And the argument is proved after plugging back α,β,γ\alpha,\beta,\gamma.

Theorem 12.

There exists a game on 𝒟++n\mathcal{D}_{++}^{n} such that we can exactly compute the value of the minimax regret. Specifically, the decision set of the player is 𝒩={p:p=ExpI(Y),YD2}\mathcal{N}=\{p:p={\textnormal{Exp}}_{I}(Y),\|Y\|\leq\frac{D}{2}\}, and the adversary is allowed to pick a loss function in

t={αtGtbc(p):c˙(0)=1,αt[0,1]}={tr(XtY):XtGt}.\mathcal{F}_{t}=\{\alpha_{t}G_{t}b_{c}(p):\|\dot{c}(0)\|=1,\alpha_{t}\in[0,1]\}=\{-tr(X_{t}Y):\|X_{t}\|\leq G_{t}\}\\ .

Then the minimax value of the game is

𝒱T(𝒩,{t})=D2t=1TGt2.\mathcal{V}_{T}(\mathcal{N},\{\mathcal{F}_{t}\})=\frac{D}{2}\sqrt{\sum_{t=1}^{T}G_{t}^{2}}.

In addition, the optimal strategy of the player is

Yt=s=1t1Xss=1t1Xs2+s=tTGs2.Y_{t}=\frac{\sum_{s=1}^{t-1}X_{s}}{\sqrt{\left\|\sum_{s=1}^{t-1}X_{s}\right\|^{2}+\sum_{s=t}^{T}G_{s}^{2}}}.
Proof A.13.

The proposition is a direct conclusion of Lemmas A.5, A.7, and A.9.

Theorem 13.

There exists a comparator sequence which satisfies t=2Td(𝐮t,𝐮t1)PT\sum_{t=2}^{T}d(\mathbf{u}_{t},\mathbf{u}_{t-1})\leq P_{T} and the dynamic minimax regret lower bound on Hadamard manifolds is Ω(GD2+DPT)\Omega(G\sqrt{D^{2}+DP_{T}}).

Proof A.14.

We combine Theorem 12 with a reduction in Zhang et al. (2018) to finish the proof. By Theorem 12 we have

𝒱T=infY11supX1GinfYT1supXTG[t=1Ttr(XtYt)infY1t=1Ttr(XtY)]=GDT2.\mathcal{V}_{T}=\inf_{\|Y_{1}\|\leq 1}\sup_{\|X_{1}\|\leq G}\dots\inf_{\|Y_{T}\|\leq 1}\sup_{\|X_{T}\|\leq G}\left[\sum_{t=1}^{T}-tr(X_{t}Y_{t})-\inf_{\|Y\|\leq 1}\sum_{t=1}^{T}-tr(X_{t}Y)\right]=\frac{GD\sqrt{T}}{2}.

Note that the path-length is upper bounded by TDTD. For any τ[0,TD]\tau\in[0,TD], we define the set of comparators with path-length bounded by τ\tau as

C(τ)={𝐮1,,𝐮T𝒩:t=2Td(𝐮t,𝐮t1)τ}C(\tau)=\left\{\mathbf{u}_{1},\dots,\mathbf{u}_{T}\in\mathcal{N}:\sum_{t=2}^{T}d(\mathbf{u}_{t},\mathbf{u}_{t-1})\leq\tau\right\}

where 𝒩={𝐮:d(I,𝐮)D2}\mathcal{N}=\{\mathbf{u}:d(I,\mathbf{u})\leq\frac{D}{2}\} is a gsc-convex subset and the minimax dynamic regret w.r.t. C(τ)C(\tau) is

𝒱T(C(τ))=infY11supX1GinfYT1supXTG[t=1Ttr(XtYt)inf𝐮1,,𝐮TC(τ)t=1Ttr(XtExpI1𝐮t)].\mathcal{V}_{T}(C(\tau))=\inf_{\small\|Y_{1}\|\leq 1}\sup_{\small\|X_{1}\|\leq G}\dots\inf_{\small\|Y_{T}\|\leq 1}\sup_{\small\|X_{T}\|\leq G}\left[\sum_{t=1}^{T}-tr(X_{t}Y_{t})-\inf_{\mathbf{u}_{1},\dots,\mathbf{u}_{T}\in C(\tau)}\sum_{t=1}^{T}-tr(X_{t}\textnormal{Exp}^{-1}_{I}\mathbf{u}_{t})\right].

We distinguish two cases. When τD\tau\leq D, we invoke the minimax static regret directly to get

𝒱T(C(τ))𝒱T=GDT2.\mathcal{V}_{T}(C(\tau))\geq\mathcal{V}_{T}=\frac{GD\sqrt{T}}{2}. (24)

For the case of τD\tau\geq D, WLOG, we assume τ/D\lceil\tau/D\rceil divides TT and let LL be the quotient. We construct a subset of C(τ)C(\tau), named C(τ)C^{\prime}(\tau), which contains comparators that are fixed for each consecutive LL rounds. Specifically,

C(τ)={𝐮1,,𝐮T𝒩:𝐮(i1)L+1=𝐮iL,i[1,τ/D]}.C^{\prime}(\tau)=\left\{\mathbf{u}_{1},\dots,\mathbf{u}_{T}\in\mathcal{N}:\mathbf{u}_{(i-1)L+1}=\dots\mathbf{u}_{iL},\forall i\in[1,\lceil\tau/D\rceil]\right\}.

Note that the path-length of comparators in C(τ)C^{\prime}(\tau) is at most (τ/D1)Dτ(\lceil\tau/D\rceil-1)D\leq\tau, which implies C(τ)C^{\prime}(\tau) is a subset of C(τ)C(\tau). Thus we have

𝒱T(C(τ))𝒱T(C(τ)).\mathcal{V}_{T}(C(\tau))\geq\mathcal{V}_{T}(C^{\prime}(\tau)). (25)

The objective of introducing C(τ)C^{\prime}(\tau) is we can set 𝐮(i1)L+1==𝐮(iL)\mathbf{u}_{(i-1)L+1}=\dots=\mathbf{u}_{(iL)} to be the offline minimizer of the ii-th segment and invoke the minimax lower bound for the static regret for each segment. Thus we have

𝒱T(C(τ))=infY11supX1GinfYT1supXTG[t=1Ttr(XtYt)inf𝐮1,,𝐮TC(τ)t=1Ttr(XtExpI1𝐮t)]=infY11supX1GinfYT1supXTG[t=1Ttr(XtYt)i=1τ/DinfY1t=(i1)L+1iLtr(XtY)]=τ/DGDL2=GDTτ/D2GTDτ2.\begin{split}&\mathcal{V}_{T}(C^{\prime}(\tau))\\ =&\inf_{\small\|Y_{1}\|\leq 1}\sup_{\small\|X_{1}\|\leq G}\dots\inf_{\small\|Y_{T}\|\leq 1}\sup_{\small\|X_{T}\|\leq G}\left[\sum_{t=1}^{T}-tr(X_{t}Y_{t})-\inf_{\mathbf{u}_{1},\dots,\mathbf{u}_{T}\in C^{\prime}(\tau)}\sum_{t=1}^{T}-tr(X_{t}\textnormal{Exp}^{-1}_{I}\mathbf{u}_{t})\right]\\ =&\inf_{\small\|Y_{1}\|\leq 1}\sup_{\small\|X_{1}\|\leq G}\dots\inf_{\small\|Y_{T}\|\leq 1}\sup_{\small\|X_{T}\|\leq G}\left[\sum_{t=1}^{T}-tr(X_{t}Y_{t})-\sum_{i=1}^{\lceil\tau/D\rceil}\inf_{\|Y\|\leq 1}\sum_{t=(i-1)L+1}^{iL}-tr(X_{t}Y)\right]\\ =&\lceil\tau/D\rceil\frac{GD\sqrt{L}}{2}=\frac{GD\sqrt{T\lceil\tau/D\rceil}}{2}\geq\frac{G\sqrt{TD\tau}}{2}.\end{split} (26)

Combining Equations (24), (25) and (26) yields

𝒱T(C(τ))G2max(DT,TDτ)=Ω(GT(D2+Dτ)).\mathcal{V}_{T}(C(\tau))\geq\frac{G}{2}\max(D\sqrt{T},\sqrt{TD\tau})=\Omega(G\sqrt{T(D^{2}+D\tau)}).

Appendix B Omitted Proof for Section 5

B.1 Proof of Theorem 5

We first argue 𝒩c\mathcal{N}_{c} is gsc-convex for any c0c\geq 0 to ensure the algorithm is well-defined. By Lemma E.13, d(𝐱,𝒩)d(\mathbf{x},\mathcal{N}) is gsc-convex on Hadamard manifolds. The sub-level set of a gsc-convex function is a gsc-convex set due to Lemma E.14, which implies 𝒩c\mathcal{N}_{c} is gsc-convex. We notice that

ft(𝐱t)ft(𝐮t)=(ft(𝐱t)ft(𝐱t))+(ft(𝐱t)ft(𝐮t))f_{t}(\mathbf{x}_{t})-f_{t}(\mathbf{u}_{t})=(f_{t}(\mathbf{x}_{t})-f_{t}(\mathbf{x}_{t}^{\prime}))+(f_{t}(\mathbf{x}_{t}^{\prime})-f_{t}(\mathbf{u}_{t}))

and derive upper bounds for two terms individually. If 𝐱t𝒩δM\mathbf{x}_{t}^{\prime}\in\mathcal{N}_{\delta M} then 𝐱t=𝐱t\mathbf{x}_{t}^{\prime}=\mathbf{x}_{t} and ft(𝐱t)ft(𝐱t)=0f_{t}(\mathbf{x}_{t})-f_{t}(\mathbf{x}_{t}^{\prime})=0. If this is not the case, by Lemma E.7, we have

d(𝐱t,𝐱t)d(𝐱t,𝐳)d(𝐱t,𝐲t)d(\mathbf{x}_{t}^{\prime},\mathbf{x}_{t})\leq d(\mathbf{x}_{t}^{\prime},\mathbf{z})\leq d(\mathbf{x}_{t}^{\prime},\mathbf{y}_{t})

where 𝐳\mathbf{z} is the intersection of 𝒩δ\mathcal{N}_{\delta} and the geodesic segment connecting 𝐱t\mathbf{x}_{t}^{\prime} and 𝐲t\mathbf{y}_{t}. Thus

ft(𝐱t)ft(𝐱t)ft(𝐱t),Exp𝐱t1𝐱tft(𝐱t)d(𝐱t,𝐱t)Gd(𝐱t,𝐲t),f_{t}(\mathbf{x}_{t})-f_{t}(\mathbf{x}_{t}^{\prime})\leq\left\langle\nabla f_{t}(\mathbf{x}_{t}),-\textnormal{Exp}^{-1}_{\mathbf{x}_{t}}{\mathbf{x}_{t}^{\prime}}\right\rangle\leq\|\nabla f_{t}(\mathbf{x}_{t})\|\cdot d(\mathbf{x}_{t}^{\prime},\mathbf{x}_{t})\leq G\cdot d(\mathbf{x}_{t}^{\prime},\mathbf{y}_{t}), (27)

where we notice 𝐱t𝒩δM\mathbf{x}_{t}\in\mathcal{N}_{\delta M} and use Assumption 3. Let 𝐲t+1=Exp𝐱t(ηft(𝐱t)+Exp𝐱t1(𝐲t))\mathbf{y}_{t+1}^{\prime}={\textnormal{Exp}}_{\mathbf{x}_{t}^{\prime}}\left(-\eta\nabla f_{t}(\mathbf{x}_{t}^{\prime})+{\textnormal{Exp}}_{\mathbf{x}_{t}^{\prime}}^{-1}(\mathbf{y}_{t})\right). The second term ft(𝐱t)ft(𝐮t)f_{t}(\mathbf{x}_{t}^{\prime})-f_{t}(\mathbf{u}_{t}) can be bounded by

ft(𝐱t)ft(𝐮t)(1)Exp𝐱t1(𝐮t),ft(𝐱t)=1ηExp𝐱t1(𝐮t),Exp𝐱t1(𝐲t+1)Exp𝐱t1(𝐲t)(2)12η(ζd(𝐱t,𝐲t+1)2+d(𝐱t,𝐮t)2d(𝐲t+1,𝐮t)2)12η(d(𝐱t,𝐲t)2+d(𝐱t,𝐮t)2d(𝐲t,𝐮t)2)=12η(ζd(𝐱t,𝐲t+1)2d(𝐱t,𝐲t)2+d(𝐲t,𝐮t)2d(𝐲t+1,𝐮t)2)=12η(ζηft(𝐱t)+Exp𝐱t1(𝐲t)2d(𝐱t,𝐲t)2+d(𝐲t,𝐮t)2d(𝐲t+1,𝐮t)2)=12η(ζηft(𝐱t)+ηΓ𝐲t𝐱tMt2d(𝐱t,𝐲t)2+d(𝐲t,𝐮t)2d(𝐲t+1,𝐮t)2)(3)12η(ζηft(𝐱t)+ηΓ𝐲t𝐱tMt2d(𝐱t,𝐲t)2+d(𝐲t,𝐮t)2d(𝐲t+1,𝐮t)2)\begin{split}&f_{t}(\mathbf{x}_{t}^{\prime})-f_{t}(\mathbf{u}_{t})\stackrel{{\scriptstyle\rm(1)}}{{\leq}}-\langle{\textnormal{Exp}}_{\mathbf{x}_{t}^{\prime}}^{-1}(\mathbf{u}_{t}),\nabla f_{t}(\mathbf{x}_{t}^{\prime})\rangle\\ =&\frac{1}{\eta}\langle{\textnormal{Exp}}_{\mathbf{x}_{t}^{\prime}}^{-1}(\mathbf{u}_{t}),{\textnormal{Exp}}_{\mathbf{x}_{t}^{\prime}}^{-1}(\mathbf{y}_{t+1}^{\prime})-{\textnormal{Exp}}_{\mathbf{x}_{t}^{\prime}}^{-1}(\mathbf{y}_{t})\rangle\\ \stackrel{{\scriptstyle\rm(2)}}{{\leq}}&\frac{1}{2\eta}\left({\zeta}d(\mathbf{x}_{t}^{\prime},\mathbf{y}_{t+1}^{\prime})^{2}+d(\mathbf{x}_{t}^{\prime},\mathbf{u}_{t})^{2}-d(\mathbf{y}_{t+1}^{\prime},\mathbf{u}_{t})^{2}\right)-\frac{1}{2\eta}\left(d(\mathbf{x}_{t}^{\prime},\mathbf{y}_{t})^{2}+d(\mathbf{x}_{t}^{\prime},\mathbf{u}_{t})^{2}-d(\mathbf{y}_{t},\mathbf{u}_{t})^{2}\right)\\ =&\frac{1}{2\eta}\left({\zeta}d(\mathbf{x}_{t}^{\prime},\mathbf{y}_{t+1}^{\prime})^{2}-d(\mathbf{x}_{t}^{\prime},\mathbf{y}_{t})^{2}+d(\mathbf{y}_{t},\mathbf{u}_{t})^{2}-d(\mathbf{y}_{t+1}^{\prime},\mathbf{u}_{t})^{2}\right)\\ =&\frac{1}{2\eta}\left({\zeta}\|-\eta\nabla f_{t}(\mathbf{x}_{t}^{\prime})+{\textnormal{Exp}}_{\mathbf{x}_{t}^{\prime}}^{-1}(\mathbf{y}_{t})\|^{2}-d(\mathbf{x}_{t}^{\prime},\mathbf{y}_{t})^{2}+d(\mathbf{y}_{t},\mathbf{u}_{t})^{2}-d(\mathbf{y}_{t+1}^{\prime},\mathbf{u}_{t})^{2}\right)\\ =&\frac{1}{2\eta}\left({\zeta}\|-\eta\nabla f_{t}(\mathbf{x}_{t}^{\prime})+\eta\Gamma_{\mathbf{y}_{t}}^{\mathbf{x}_{t}^{\prime}}M_{t}\|^{2}-d(\mathbf{x}_{t}^{\prime},\mathbf{y}_{t})^{2}+d(\mathbf{y}_{t},\mathbf{u}_{t})^{2}-d(\mathbf{y}_{t+1}^{\prime},\mathbf{u}_{t})^{2}\right)\\ \stackrel{{\scriptstyle\rm(3)}}{{\leq}}&\frac{1}{2\eta}\left({\zeta}\|-\eta\nabla f_{t}(\mathbf{x}_{t}^{\prime})+\eta\Gamma_{\mathbf{y}_{t}}^{\mathbf{x}_{t}^{\prime}}M_{t}\|^{2}-d(\mathbf{x}_{t}^{\prime},\mathbf{y}_{t})^{2}+d(\mathbf{y}_{t},\mathbf{u}_{t})^{2}-d(\mathbf{y}_{t+1},\mathbf{u}_{t})^{2}\right)\end{split} (28)

where the second inequality follows from Lemmas E.5 and E.6 and the third one is due to the non-expansive property of projection onto Hadamard manifolds. We apply Γ𝐱𝐲Exp𝐱1𝐲=Exp𝐲1𝐱\Gamma_{\mathbf{x}}^{\mathbf{y}}{\textnormal{Exp}}_{\mathbf{x}}^{-1}\mathbf{y}=-{\textnormal{Exp}}_{\mathbf{y}}^{-1}\mathbf{x} to derive the last equality.

Now we can get the desired squared term ft(𝐲t)Mt2\|\nabla f_{t}(\mathbf{y}_{t})-M_{t}\|^{2} by considering

ft(𝐱t)Γ𝐲t𝐱tMt2=ft(𝐱t)Γ𝐲t𝐱tft(𝐲t)+Γ𝐲t𝐱tft(𝐲t)Γ𝐲t𝐱tMt22(ft(𝐱t)Γ𝐲t𝐱tft(𝐲t)2+Γ𝐲t𝐱tft(𝐲t)Γ𝐲t𝐱tMt2)2L2d(𝐱t,𝐲t)2+2ft(𝐲t)Mt2,\begin{split}&\|\nabla f_{t}(\mathbf{x}_{t}^{\prime})-\Gamma_{\mathbf{y}_{t}}^{\mathbf{x}_{t}^{\prime}}M_{t}\|^{2}\\ =&\|\nabla f_{t}(\mathbf{x}_{t}^{\prime})-\Gamma_{\mathbf{y}_{t}}^{\mathbf{x}_{t}^{\prime}}\nabla f_{t}(\mathbf{y}_{t})+\Gamma_{\mathbf{y}_{t}}^{\mathbf{x}_{t}^{\prime}}\nabla f_{t}(\mathbf{y}_{t})-\Gamma_{\mathbf{y}_{t}}^{\mathbf{x}_{t}^{\prime}}M_{t}\|^{2}\\ \leq&2\left(\|\nabla f_{t}(\mathbf{x}_{t}^{\prime})-\Gamma_{\mathbf{y}_{t}}^{\mathbf{x}_{t}^{\prime}}\nabla f_{t}(\mathbf{y}_{t})\|^{2}+\|\Gamma_{\mathbf{y}_{t}}^{\mathbf{x}_{t}^{\prime}}\nabla f_{t}(\mathbf{y}_{t})-\Gamma_{\mathbf{y}_{t}}^{\mathbf{x}_{t}^{\prime}}M_{t}\|^{2}\right)\\ \leq&2L^{2}d(\mathbf{x}_{t}^{\prime},\mathbf{y}_{t})^{2}+2\|\nabla f_{t}(\mathbf{y}_{t})-M_{t}\|^{2},\end{split} (29)

where in the first inequlity we use 𝐱+𝐲22𝐱2+2𝐲2\|\mathbf{x}+\mathbf{y}\|^{2}\leq 2\|\mathbf{x}\|^{2}+2\|\mathbf{y}\|^{2} holds for any SPD norm \|\cdot\|, and the second inequality is due to the smoothness of ff and parallel transport is an isometry. Combining Equations (27), (28) and (29), we have

ft(𝐱t)ft(𝐮t)Gd(𝐱t,𝐲t)+ηζ2(2ft(𝐲t)Mt2+2L2d(𝐱t,𝐲t)2)12ηd(𝐱t,𝐲t)2+12η(d(𝐲t,𝐮t)2d(𝐲t+1,𝐮t)2)ηζft(𝐲t)Mt2+12η(2ηG+2η2ζL2d(𝐱t,𝐲t)d(𝐱t,𝐲t))d(𝐱t,𝐲t)+12η(d(𝐲t,𝐮t)2d(𝐲t+1,𝐮t)2).\begin{split}&f_{t}(\mathbf{x}_{t})-f_{t}(\mathbf{u}_{t})\\ \leq&Gd(\mathbf{x}_{t}^{\prime},\mathbf{y}_{t})+\frac{\eta{\zeta}}{2}\left(2\|\nabla f_{t}(\mathbf{y}_{t})-M_{t}\|^{2}+2L^{2}d(\mathbf{x}_{t}^{\prime},\mathbf{y}_{t})^{2}\right)\\ &-\frac{1}{2\eta}d(\mathbf{x}_{t}^{\prime},\mathbf{y}_{t})^{2}+\frac{1}{2\eta}(d(\mathbf{y}_{t},\mathbf{u}_{t})^{2}-d(\mathbf{y}_{t+1},\mathbf{u}_{t})^{2})\\ \leq&\eta{\zeta}\|\nabla f_{t}(\mathbf{y}_{t})-M_{t}\|^{2}+\frac{1}{2\eta}\left(2\eta G+2\eta^{2}{\zeta}L^{2}d(\mathbf{x}_{t}^{\prime},\mathbf{y}_{t})-d(\mathbf{x}_{t}^{\prime},\mathbf{y}_{t})\right)d(\mathbf{x}_{t}^{\prime},\mathbf{y}_{t})\\ &+\frac{1}{2\eta}(d(\mathbf{y}_{t},\mathbf{u}_{t})^{2}-d(\mathbf{y}_{t+1},\mathbf{u}_{t})^{2}).\end{split} (30)

Now we show

12η(2ηG+2η2ζL2d(𝐱t,𝐲t)d(𝐱t,𝐲t))d(𝐱t,𝐲t)0\frac{1}{2\eta}\left(2\eta G+2\eta^{2}{\zeta}L^{2}d(\mathbf{x}_{t}^{\prime},\mathbf{y}_{t})-d(\mathbf{x}_{t}^{\prime},\mathbf{y}_{t})\right)d(\mathbf{x}_{t}^{\prime},\mathbf{y}_{t})\leq 0 (31)

holds for any t[T]t\in[T]. First we consider the case that d(𝐱t,𝐲t)δMd(\mathbf{x}_{t}^{\prime},\mathbf{y}_{t})\leq\delta M, which means 𝐱t𝒩δM\mathbf{x}_{t}^{\prime}\in\mathcal{N}_{\delta M} and ft(𝐱t)=ft(𝐱t)f_{t}(\mathbf{x}_{t})=f_{t}(\mathbf{x}_{t}^{\prime}). Thus Equation (31) is implied by

2η2ζL2d(𝐱t,𝐲t)d(𝐱t,𝐲t)0,2\eta^{2}{\zeta}L^{2}d(\mathbf{x}_{t}^{\prime},\mathbf{y}_{t})-d(\mathbf{x}_{t}^{\prime},\mathbf{y}_{t})\leq 0,

which is obviously true by considering our assumption on η\eta:

ηδMG+(G2+2ζδ2M2L2)1212ζL2.\eta\leq\frac{\delta M}{G+(G^{2}+2\zeta\delta^{2}M^{2}L^{2})^{\frac{1}{2}}}\leq\frac{1}{\sqrt{2\zeta L^{2}}}.

When d(𝐱t,𝐲t)δMd(\mathbf{x}_{t}^{\prime},\mathbf{y}_{t})\geq\delta M, to simplify the proof, we denote λ=d(𝐱t,𝐲t)\lambda=d(\mathbf{x}_{t}^{\prime},\mathbf{y}_{t}) and try to find η\eta such that

h(η;λ)2ηG+2η2ζL2λλ0h(\eta;\lambda)\coloneqq 2\eta G+2\eta^{2}\zeta L^{2}\lambda-\lambda\leq 0 (32)

holds for any λδM\lambda\geq\delta M. We denote the only non-negative root of h(η;λ)h(\eta;\lambda) as η(λ)\eta(\lambda), which can be solved explicitly as

η(λ)=G+(G2+2ζλ2L2)122ζλL2.\eta(\lambda)=\frac{-G+(G^{2}+2\zeta\lambda^{2}L^{2})^{\frac{1}{2}}}{2\zeta\lambda L^{2}}.

Applying Lemma E.9 with a=Ga=G and b=2ζL2b=2\zeta L^{2}, we know η(λ)\eta(\lambda) increases on [0,)[0,\infty). Thus η(λ)η(δM)\eta(\lambda)\geq\eta(\delta M) holds for any λδM\lambda\geq\delta M. Combining with the fact that h(0;λ)=λ<0h(0;\lambda)=-\lambda<0, we know h(η;λ)0h(\eta;\lambda)\leq 0 holds for any ηη(λ)\eta\leq\eta(\lambda), so we can simply set

ηminλη(λ)=η(δM)=δMG+(G2+2ζδ2M2L2)12.\eta\leq\min_{\lambda}\eta(\lambda)=\eta\left(\delta M\right)=\frac{\delta M}{G+(G^{2}+2\zeta\delta^{2}M^{2}L^{2})^{\frac{1}{2}}}.

to ensure h(η;λ)0h(\eta;\lambda)\leq 0 always holds.

Now it suffices to bound

t=1T12η(d(𝐲t,𝐮t)2d(𝐲t+1,𝐮t)2)d(𝐲1,𝐮1)22η+t=2T(d(𝐲t,𝐮t)2d(𝐲t,𝐮t1)2)2ηD22η+2Dt=2Td(𝐮t,𝐮t1)2η=D2+2DPT2η.\begin{split}&\sum_{t=1}^{T}\frac{1}{2\eta}(d(\mathbf{y}_{t},\mathbf{u}_{t})^{2}-d(\mathbf{y}_{t+1},\mathbf{u}_{t})^{2})\\ \leq&\frac{d(\mathbf{y}_{1},\mathbf{u}_{1})^{2}}{2\eta}+\sum_{t=2}^{T}\frac{\left(d(\mathbf{y}_{t},\mathbf{u}_{t})^{2}-d(\mathbf{y}_{t},\mathbf{u}_{t-1})^{2}\right)}{2\eta}\\ \leq&\frac{D^{2}}{2\eta}+2D\frac{\sum_{t=2}^{T}d(\mathbf{u}_{t},\mathbf{u}_{t-1})}{2\eta}=\frac{D^{2}+2DP_{T}}{2\eta}.\end{split} (33)

Finally, we apply the telescoping-sum on Equation (30),

t=1Tft(𝐱t)t=1Tft(𝐮t)ηζt=1Tft(𝐲t)Mt2+D2+2DPT2η.\begin{split}\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\sum_{t=1}^{T}f_{t}(\mathbf{u}_{t})\leq\eta{\zeta}\sum_{t=1}^{T}\|\nabla f_{t}(\mathbf{y}_{t})-M_{t}\|^{2}+\frac{D^{2}+2DP_{T}}{2\eta}.\end{split} (34)

B.2 Extension of Theorem 5 to CAT(κ)(\kappa) Spaces

In this part, we show how to get optimistic regret bound on CAT(κ)(\kappa) spaces, the sectional curvature of which is upper bounded by κ\kappa. Note that Hadamard manifolds are complete CAT(0)(0) spaces. To proceed, we make the following assumption.

Assumption 4

The sectional curvature of manifold \mathcal{M} satisfies κ1κκ2-\kappa_{1}\leq\kappa\leq\kappa_{2} where κ10\kappa_{1}\geq 0. We define

𝒟(κ){,κ0πκ,κ>0.\mathcal{D}(\kappa)\coloneqq\begin{cases}\infty,\quad&\kappa\leq 0\\ \frac{\pi}{\sqrt{\kappa}},\quad&\kappa>0.\end{cases}

The diameter of the gsc-convex set 𝒩\mathcal{N}\subset\mathcal{M} is DD and we assume D+2δM𝒟(κ2)D+2\delta M\leq\mathcal{D}(\kappa_{2}). The gradient satisfies sup𝐱𝒩δMft(𝐱)G.\sup_{\mathbf{x}\in\mathcal{N}_{\delta M}}\|\nabla f_{t}(\mathbf{x})\|\leq G.

Lemma B.1.

(Alimisis et al., 2020, Corollary 2.1) Let \mathcal{M} be a Riemannian manifold with sectional curvature upper bounded by κ2\kappa_{2} and 𝒩\mathcal{N}\subset\mathcal{M} be a gsc-convex set with diameter upper bounded by 𝒟(κ2)\mathcal{D}(\kappa_{2}). For a geodesic triangle fully lies within 𝒩\mathcal{N} with side lengths a,b,ca,b,c, we have

a2ξ(κ2,D)b2+c22bccosAa^{2}\geq{\xi}(\kappa_{2},D)b^{2}+c^{2}-2bc\cos A

where ξ(κ2,D)=κ2Dcoth(κ2D){\xi}(\kappa_{2},D)=\sqrt{-\kappa_{2}}D\operatorname{coth}(\sqrt{-\kappa_{2}}D) when κ20\kappa_{2}\leq 0 and ξ(κ2,D)=κ2Dcot(κ2D){\xi}(\kappa_{2},D)=\sqrt{\kappa_{2}}D\operatorname{cot}(\sqrt{\kappa_{2}}D) when κ2>0\kappa_{2}>0.

Definition 14.

We define ζ=ζ(κ1,D+2δM)\zeta=\zeta(-\kappa_{1},D+2\delta M) and ξ=ξ(κ2,D+2δM)\xi=\xi(\kappa_{2},D+2\delta M) where ξ(,)\xi(\cdot,\cdot) and ζ(,)\zeta(\cdot,\cdot) are defined in Lemmas B.1 and E.5, respectively.

Lemma B.2.

(Bridson and Haefliger, 2013) For a CAT(κ)(\kappa) space \mathcal{M}, a ball of diameter smaller than 𝒟(κ)\mathcal{D}(\kappa) is convex. Let CC be a convex subset in \mathcal{M}. If d(𝐱,C)𝒟(κ)2d(\mathbf{x},C)\leq\frac{\mathcal{D}(\kappa)}{2} then d(𝐱,C)d(\mathbf{x},C) is convex and there exists a unique point ΠC(𝐱)𝒞\Pi_{C}(\mathbf{x})\in\mathcal{C} such that d(𝐱,ΠC(𝐱))=d(𝐱,C)=inf𝐲Cd(𝐱,𝐲)d(\mathbf{x},\Pi_{C}(\mathbf{x}))=d(\mathbf{x},C)=\inf_{\mathbf{y}\in C}d(\mathbf{x},\mathbf{y}).

Theorem 15.

Suppose all losses ftf_{t} are LL-gsc-smooth on \mathcal{M}. Under Assumptions 4, the iterates

𝐱t=Exp𝐲t(ηMt)𝐱t=Π𝒩δM𝐱t𝐲t+1=Π𝒩Exp𝐱t(ηft(𝐱t)+Exp𝐱t1(𝐲t)).\begin{split}\textstyle&\mathbf{x}_{t}^{\prime}={\textnormal{Exp}}_{\mathbf{y}_{t}}(-\eta M_{t})\\ &\mathbf{x}_{t}=\Pi_{\mathcal{N}_{\delta M}}\mathbf{x}_{t}^{\prime}\\ &\mathbf{y}_{t+1}=\Pi_{\mathcal{N}}{\textnormal{Exp}}_{\mathbf{x}_{t}^{\prime}}\left(-\eta\nabla f_{t}(\mathbf{x}_{t}^{\prime})+{\textnormal{Exp}}_{\mathbf{x}_{t}^{\prime}}^{-1}(\mathbf{y}_{t})\right).\end{split} (35)

satisfies

t=1Tft(𝐱t)t=1Tft(𝐮t)ηζt=1Tft(𝐲t)Mt2+D2+2DPT2η.\begin{split}\textstyle\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\sum_{t=1}^{T}f_{t}(\mathbf{u}_{t})\leq\eta{\zeta}\sum_{t=1}^{T}\|\nabla f_{t}(\mathbf{y}_{t})-M_{t}\|^{2}+\frac{D^{2}+2DP_{T}}{2\eta}.\end{split}

for any 𝐮1,,𝐮T𝒩\mathbf{u}_{1},\dots,\mathbf{u}_{T}\in\mathcal{N} and ηmin{ξδMG+(G2+2ζξδ2M2L2)12,𝒟(κ2)2(G+2M)}\eta\leq\min\left\{\frac{\xi\delta M}{G+(G^{2}+2\zeta\xi\delta^{2}M^{2}L^{2})^{\frac{1}{2}}},\frac{\mathcal{D}(\kappa_{2})}{2(G+2M)}\right\}.

Proof B.3.

We highlight key differences between the proof of Theorem 5. Again we let 𝐲t+1=Exp𝐱t(ηft(𝐱t)+Exp𝐱t1(𝐲t))\mathbf{y}_{t+1}^{\prime}={\textnormal{Exp}}_{\mathbf{x}_{t}^{\prime}}\left(-\eta\nabla f_{t}(\mathbf{x}_{t}^{\prime})+{\textnormal{Exp}}_{\mathbf{x}_{t}^{\prime}}^{-1}(\mathbf{y}_{t})\right). First, we need to argue the algorithm is well-defined. The diameter of 𝒩δM\mathcal{N}_{\delta M} is at most D+2δMD+2\delta M by triangle inequality, so 𝒩δM\mathcal{N}_{\delta M} is gsc-convex by Assumption 4 and Lemma B.2. We also need to ensure that d(𝐱t,𝒩δM)𝒟(κ2)2d(\mathbf{x}_{t}^{\prime},\mathcal{N}_{\delta M})\leq\frac{\mathcal{D}(\kappa_{2})}{2} and d(𝐲t+1,𝒩)𝒟(κ2)2d(\mathbf{y}_{t+1}^{\prime},\mathcal{N})\leq\frac{\mathcal{D}(\kappa_{2})}{2} and apply Lemma B.2 to show the projection is unique and non-expansive. For d(𝐱t,𝒩δM)d(\mathbf{x}_{t}^{\prime},\mathcal{N}_{\delta M}), we have

d(𝐱t,𝒩δM)d(𝐱t,𝐲t)ηM𝒟(κ2)2.d(\mathbf{x}_{t}^{\prime},\mathcal{N}_{\delta M})\leq d(\mathbf{x}_{t}^{\prime},\mathbf{y}_{t})\leq\eta M\leq\frac{\mathcal{D}(\kappa_{2})}{2}.

by η𝒟(κ2)2(G+2M)\eta\leq\frac{\mathcal{D}(\kappa_{2})}{2(G+2M)}. Similarly, for d(𝐲t+1,𝒩)d(\mathbf{y}_{t+1}^{\prime},\mathcal{N})

d(𝐲t+1,𝒩)d(𝐲t+1,𝐲t)d(𝐲t+1,𝐱t)+d(𝐲t,𝐱t)ηft(𝐱t)+ηΓ𝐲t𝐱tMt+ηMη(G+2M)𝒟(κ2)2.\begin{split}&d(\mathbf{y}_{t+1}^{\prime},\mathcal{N})\leq d(\mathbf{y}_{t+1}^{\prime},\mathbf{y}_{t})\leq d(\mathbf{y}_{t+1}^{\prime},\mathbf{x}_{t}^{\prime})+d(\mathbf{y}_{t},\mathbf{x}_{t}^{\prime})\\ \leq&\|-\eta\nabla f_{t}(\mathbf{x}_{t}^{\prime})+\eta\Gamma_{\mathbf{y}_{t}}^{\mathbf{x}_{t}^{\prime}}M_{t}\|+\eta M\\ \leq&\eta(G+2M)\leq\frac{\mathcal{D}(\kappa_{2})}{2}.\end{split}

We can bound ft(𝐱t)ft(𝐱t)f_{t}(\mathbf{x}_{t})-f_{t}(\mathbf{x}_{t}^{\prime}) in the same way as Theorem 5, but we now use Lemmas B.1 and E.5 to bound ft(𝐱t)ft(𝐮t)f_{t}(\mathbf{x}_{t}^{\prime})-f_{t}(\mathbf{u}_{t}).

ft(𝐱t)ft(𝐮t)Exp𝐱t1(𝐮t),ft(𝐱t)=1ηExp𝐱t1(𝐮t),Exp𝐱t1(𝐲t+1)Exp𝐱t1(𝐲t)12η(ζd(𝐱t,𝐲t+1)2+d(𝐱t,𝐮t)2d(𝐲t+1,𝐮t)2)12η(ξd(𝐱t,𝐲t)2+d(𝐱t,𝐮t)2d(𝐲t,𝐮t)2).\begin{split}&f_{t}(\mathbf{x}_{t}^{\prime})-f_{t}(\mathbf{u}_{t}){\leq}-\langle{\textnormal{Exp}}_{\mathbf{x}_{t}^{\prime}}^{-1}(\mathbf{u}_{t}),\nabla f_{t}(\mathbf{x}_{t}^{\prime})\rangle\\ =&\frac{1}{\eta}\langle{\textnormal{Exp}}_{\mathbf{x}_{t}^{\prime}}^{-1}(\mathbf{u}_{t}),{\textnormal{Exp}}_{\mathbf{x}_{t}^{\prime}}^{-1}(\mathbf{y}_{t+1}^{\prime})-{\textnormal{Exp}}_{\mathbf{x}_{t}^{\prime}}^{-1}(\mathbf{y}_{t})\rangle\\ {\leq}&\frac{1}{2\eta}\left({\zeta}d(\mathbf{x}_{t}^{\prime},\mathbf{y}_{t+1}^{\prime})^{2}+d(\mathbf{x}_{t}^{\prime},\mathbf{u}_{t})^{2}-d(\mathbf{y}_{t+1}^{\prime},\mathbf{u}_{t})^{2}\right)-\frac{1}{2\eta}\left(\xi d(\mathbf{x}_{t}^{\prime},\mathbf{y}_{t})^{2}+d(\mathbf{x}_{t}^{\prime},\mathbf{u}_{t})^{2}-d(\mathbf{y}_{t},\mathbf{u}_{t})^{2}\right).\\ \end{split} (36)

Finally, we need to show

2ηG+2η2ζL2d(𝐱t,𝐲t)ξd(𝐱t,𝐲t)02\eta G+2\eta^{2}{\zeta}L^{2}d(\mathbf{x}_{t}^{\prime},\mathbf{y}_{t})-\xi d(\mathbf{x}_{t}^{\prime},\mathbf{y}_{t})\leq 0 (37)

Following the proof of Theorem 5, we find

ηξδMG+(G2+2ζξδ2M2L2)12\eta\leq\frac{\xi\delta M}{G+(G^{2}+2\zeta\xi\delta^{2}M^{2}L^{2})^{\frac{1}{2}}}

satisfies the required condition. The guarantee is thus established.

B.3 Proof of Lemma 5.2

We first show

t=1Tt,𝐰t𝐰lnN+R(𝐰)β+βt=1Tt𝐦t212βt=1T(𝐰t𝐰t12+𝐰t𝐰t112)\begin{split}\sum_{t=1}^{T}\left\langle\boldsymbol{\ell}_{t},\mathbf{w}_{t}-\mathbf{w}^{*}\right\rangle\leq\frac{\ln N+R(\mathbf{w}^{*})}{\beta}+\beta\sum_{t=1}^{T}\|\boldsymbol{\ell}_{t}-\mathbf{m}_{t}\|_{\infty}^{2}-\frac{1}{2\beta}\sum_{t=1}^{T}\left(\|\mathbf{w}_{t}-\mathbf{w}_{t}^{\prime}\|_{1}^{2}+\|\mathbf{w}_{t}-\mathbf{w}_{t-1}^{\prime}\|_{1}^{2}\right)\end{split} (38)

holds for any 𝐰ΔN\mathbf{w}^{*}\in\Delta_{N}, where

𝐰t=argmin𝐰ΔNβs=1t1s+𝐦t,𝐰+R(𝐰),t1\mathbf{w}_{t}=\operatorname*{argmin}_{\mathbf{w}\in\Delta_{N}}\;\beta\left\langle\sum_{s=1}^{t-1}\boldsymbol{\ell}_{s}+\mathbf{m}_{t},\mathbf{w}\right\rangle+R(\mathbf{w}),\quad t\geq 1

and

𝐰t=argmin𝐰ΔNβs=1ts,𝐰+R(𝐰),t0.\mathbf{w}_{t}^{\prime}=\operatorname*{argmin}_{\mathbf{w}\in\Delta_{N}}\;\beta\left\langle\sum_{s=1}^{t}\boldsymbol{\ell}_{s},\mathbf{w}\right\rangle+R(\mathbf{w}),\quad t\geq 0.

Note that R(𝐰)=i[N]wilnwiR(\mathbf{w})=\sum_{i\in[N]}w_{i}\ln w_{i} is the negative entropy. According to the equivalence between Hedge and follow the regularized leader with the negative entropy regularizer, we have wt,ieβ(s=1t1s,i+mt,i)w_{t,i}\propto e^{-\beta\left(\sum_{s=1}^{t-1}\ell_{s,i}+m_{t,i}\right)} and wt,ieβ(s=1ts,i).w_{t,i}^{\prime}\propto e^{-\beta\left(\sum_{s=1}^{t}\ell_{s,i}\right)}. To prove Equation (38), we consider the following decomposition:

t,𝐰t𝐰=t𝐦t,𝐰t𝐰t+𝐦t,𝐰t𝐰t+t,𝐰t𝐰.\left\langle\boldsymbol{\ell}_{t},\mathbf{w}_{t}-\mathbf{w}^{*}\right\rangle=\left\langle\boldsymbol{\ell}_{t}-\mathbf{m}_{t},\mathbf{w}_{t}-\mathbf{w}_{t}^{\prime}\right\rangle+\left\langle\mathbf{m}_{t},\mathbf{w}_{t}-\mathbf{w}_{t}^{\prime}\right\rangle+\left\langle\boldsymbol{\ell}_{t},\mathbf{w}_{t}^{\prime}-\mathbf{w}^{*}\right\rangle.

Since R()R(\cdot) is 11-strongly convex w.r.t. the 1\boldsymbol{\ell}_{1} norm, by Lemma E.16, we have 𝐰t𝐰t1βt𝐦t\|\mathbf{w}_{t}-\mathbf{w}_{t}^{\prime}\|_{1}\leq\beta\|\boldsymbol{\ell}_{t}-\mathbf{m}_{t}\|_{\infty} and

t𝐦t,𝐰t𝐰tt𝐦t𝐰t𝐰t1βt𝐦t2\left\langle\boldsymbol{\ell}_{t}-\mathbf{m}_{t},\mathbf{w}_{t}-\mathbf{w}_{t}^{\prime}\right\rangle\leq\|\boldsymbol{\ell}_{t}-\mathbf{m}_{t}\|_{\infty}\|\mathbf{w}_{t}-\mathbf{w}_{t}^{\prime}\|_{1}\leq\beta\|\boldsymbol{\ell}_{t}-\mathbf{m}_{t}\|_{\infty}^{2}

by Hölder’s inequality. Hence it suffices to show

t=1T𝐦t,𝐰t𝐰t+t,𝐰t𝐰lnN+R(𝐰)β12βt=1T(𝐰t𝐰t12+𝐰t𝐰t112)\sum_{t=1}^{T}\left\langle\mathbf{m}_{t},\mathbf{w}_{t}-\mathbf{w}_{t}^{\prime}\right\rangle+\left\langle\boldsymbol{\ell}_{t},\mathbf{w}_{t}^{\prime}-\mathbf{w}^{*}\right\rangle\leq\frac{\ln N+R(\mathbf{w}^{*})}{\beta}-\frac{1}{2\beta}\sum_{t=1}^{T}\left(\|\mathbf{w}_{t}-\mathbf{w}_{t}^{\prime}\|_{1}^{2}+\|\mathbf{w}_{t}-\mathbf{w}_{t-1}^{\prime}\|_{1}^{2}\right) (39)

to prove Equation (38).

Equation (39) holds for T=0T=0 because R(𝐰)lnNR(\mathbf{w}^{*})\geq-\ln N holds for any 𝐰ΔN\mathbf{w}^{*}\in\Delta_{N}. To proceed, we need the following proposition:

g(𝐰)+c2𝐰𝐰2g(𝐰)g(\mathbf{w}^{*})+\frac{c}{2}\|\mathbf{w}-\mathbf{w}^{*}\|^{2}\leq g(\mathbf{w}) (40)

holds for any cc-strongly convex g():𝒲g(\cdot):\mathcal{W}\rightarrow\mathbb{R} where 𝐰=argmin𝐰𝒲g(𝐰)\mathbf{w}^{*}=\operatorname*{argmin}_{\mathbf{w}\in\mathcal{W}}g(\mathbf{w}). This fact can be easily seen by combining the strong convexity and the first-order optimality condition for convex functions.

Assume Equation (39) holds for round T1(T1)T-1\;(T\geq 1) and we denote

CT=12βt=1T(𝐰t𝐰t12+𝐰t𝐰t112).C_{T}=\frac{1}{2\beta}\sum_{t=1}^{T}\left(\|\mathbf{w}_{t}-\mathbf{w}_{t}^{\prime}\|_{1}^{2}+\|\mathbf{w}_{t}-\mathbf{w}_{t-1}^{\prime}\|_{1}^{2}\right).

Now for round TT,

t=1T(𝐦t,𝐰t𝐰t+t,𝐰t)(1)t=1T1t,𝐰T1+lnN+R(𝐰T1)βCT1+𝐦T,𝐰T𝐰T+T,𝐰T(2)t=1T1t,𝐰T+lnN+R(𝐰T)βCT1+𝐦T,𝐰T𝐰T+T,𝐰T12β𝐰T𝐰T112=(t=1T1t,𝐰T+𝐦T,𝐰T+lnN+R(𝐰T)β)+T𝐦T,𝐰TCT112β𝐰T𝐰T112(3)(t=1T1t,𝐰T+𝐦T,𝐰T+lnN+R(𝐰T)β)+T𝐦T,𝐰TCT112β𝐰T𝐰T11212β𝐰T𝐰T12=t=1Tt,𝐰T+lnN+R(𝐰T)βCT(4)t=1Tt,𝐰+lnN+R(𝐰)βCT.\begin{split}&\sum_{t=1}^{T}\left(\left\langle\mathbf{m}_{t},\mathbf{w}_{t}-\mathbf{w}_{t}^{\prime}\right\rangle+\left\langle\boldsymbol{\ell}_{t},\mathbf{w}_{t}^{\prime}\right\rangle\right)\\ \stackrel{{\scriptstyle\rm(1)}}{{\leq}}&\sum_{t=1}^{T-1}\left\langle\boldsymbol{\ell}_{t},\mathbf{w}_{T-1}^{\prime}\right\rangle+\frac{\ln N+R(\mathbf{w}_{T-1}^{\prime})}{\beta}-C_{T-1}+\left\langle\mathbf{m}_{T},\mathbf{w}_{T}-\mathbf{w}_{T}^{\prime}\right\rangle+\left\langle\boldsymbol{\ell}_{T},\mathbf{w}_{T}^{\prime}\right\rangle\\ \stackrel{{\scriptstyle\rm(2)}}{{\leq}}&\sum_{t=1}^{T-1}\left\langle\boldsymbol{\ell}_{t},\mathbf{w}_{T}\right\rangle+\frac{\ln N+R(\mathbf{w}_{T})}{\beta}-C_{T-1}+\left\langle\mathbf{m}_{T},\mathbf{w}_{T}-\mathbf{w}_{T}^{\prime}\right\rangle+\left\langle\boldsymbol{\ell}_{T},\mathbf{w}_{T}^{\prime}\right\rangle-\frac{1}{2\beta}\|\mathbf{w}_{T}-\mathbf{w}_{T-1}^{\prime}\|_{1}^{2}\\ =&\left(\sum_{t=1}^{T-1}\left\langle\boldsymbol{\ell}_{t},\mathbf{w}_{T}\right\rangle+\left\langle\mathbf{m}_{T},\mathbf{w}_{T}\right\rangle+\frac{\ln N+R(\mathbf{w}_{T})}{\beta}\right)+\left\langle\boldsymbol{\ell}_{T}-\mathbf{m}_{T},\mathbf{w}_{T}^{\prime}\right\rangle-C_{T-1}-\frac{1}{2\beta}\|\mathbf{w}_{T}-\mathbf{w}_{T-1}^{\prime}\|_{1}^{2}\\ \stackrel{{\scriptstyle\rm(3)}}{{\leq}}&\left(\sum_{t=1}^{T-1}\left\langle\boldsymbol{\ell}_{t},\mathbf{w}_{T}^{\prime}\right\rangle+\left\langle\mathbf{m}_{T},\mathbf{w}_{T}^{\prime}\right\rangle+\frac{\ln N+R(\mathbf{w}_{T}^{\prime})}{\beta}\right)+\left\langle\boldsymbol{\ell}_{T}-\mathbf{m}_{T},\mathbf{w}_{T}^{\prime}\right\rangle\\ &-C_{T-1}-\frac{1}{2\beta}\|\mathbf{w}_{T}-\mathbf{w}_{T-1}^{\prime}\|_{1}^{2}-\frac{1}{2\beta}\|\mathbf{w}_{T}-\mathbf{w}_{T}^{\prime}\|_{1}^{2}\\ =&\sum_{t=1}^{T}\left\langle\boldsymbol{\ell}_{t},\mathbf{w}_{T}^{\prime}\right\rangle+\frac{\ln N+R(\mathbf{w}_{T}^{\prime})}{\beta}-C_{T}\\ \stackrel{{\scriptstyle\rm(4)}}{{\leq}}&\sum_{t=1}^{T}\left\langle\boldsymbol{\ell}_{t},\mathbf{w}^{*}\right\rangle+\frac{\ln N+R(\mathbf{w}^{*})}{\beta}-C_{T}.\end{split} (41)

The first inequality is due to the induction hypothesis with 𝐰=𝐰T1\mathbf{w}^{\star}=\mathbf{w}_{T-1}^{\prime}. The second and the third ones are applications of Equation (40). Specifically, 𝐰T1\mathbf{w}_{T-1}^{\prime} and 𝐰T\mathbf{w}_{T} minimize t=1T1βt,𝐰+R(𝐰)\sum_{t=1}^{T-1}\beta\left\langle\boldsymbol{\ell}_{t},\mathbf{w}\right\rangle+{R(\mathbf{w})} and t=1T1βt,𝐰+β𝐦T,𝐰+R(𝐰)\sum_{t=1}^{T-1}\beta\left\langle\boldsymbol{\ell}_{t},\mathbf{w}\right\rangle+\beta\left\langle\mathbf{m}_{T},\mathbf{w}\right\rangle+{R(\mathbf{w})} respectively. The forth inequality follows from 𝐰T\mathbf{w}_{T}^{\prime} minimizes βt=1Tt,𝐰+R(𝐰)\beta\sum_{t=1}^{T}\left\langle\boldsymbol{\ell}_{t},\mathbf{w}\right\rangle+R(\mathbf{w}).

We now demonstrate how to remove the dependence on 𝐰t\mathbf{w}_{t}^{\prime}:

12βt=1T(𝐰t𝐰t12+𝐰t𝐰t112)12βt=1T(𝐰t𝐰t12+𝐰t+1𝐰t12)12β𝐰T+1𝐰T1214βt=2T𝐰t𝐰t1122β,\begin{split}&\frac{1}{2\beta}\sum_{t=1}^{T}\left(\|\mathbf{w}_{t}-\mathbf{w}_{t}^{\prime}\|_{1}^{2}+\|\mathbf{w}_{t}-\mathbf{w}_{t-1}^{\prime}\|_{1}^{2}\right)\\ \geq&\frac{1}{2\beta}\sum_{t=1}^{T}\left(\|\mathbf{w}_{t}-\mathbf{w}_{t}^{\prime}\|_{1}^{2}+\|\mathbf{w}_{t+1}-\mathbf{w}_{t}^{\prime}\|_{1}^{2}\right)-\frac{1}{2\beta}\|\mathbf{w}_{T+1}-\mathbf{w}_{T}^{\prime}\|_{1}^{2}\\ \geq&\frac{1}{4\beta}\sum_{t=2}^{T}\|\mathbf{w}_{t}-\mathbf{w}_{t-1}\|_{1}^{2}-\frac{2}{\beta},\end{split} (42)

where the last inequality follows from 𝐱+𝐲22(𝐱2+𝐲2)\|\mathbf{x}+\mathbf{y}\|^{2}\leq 2(\|\mathbf{x}\|^{2}+\|\mathbf{y}\|^{2}) holds for any norm. Now the proof is completed by combining Equations (38) and (42).

B.4 Proof of Theorem 6

We first show ft(𝐱t)ft(𝐱t,i)𝐰t,tt,if_{t}(\mathbf{x}_{t})-f_{t}(\mathbf{x}_{t,i})\leq\left\langle\mathbf{w}_{t},\boldsymbol{\ell}_{t}\right\rangle-\ell_{t,i} so that Lemma 5.2 can be invoked to bound the regret of the meta algorithm. We start from gsc-convexity,

ft(𝐱t)ft(𝐱t,i)ft(𝐱t),Exp𝐱t1𝐱t,i=ft(𝐱t),j=1Nwt,jExp𝐱t1𝐱t,jft(𝐱t),Exp𝐱t1𝐱t,i=j=1Nwt,jt,jt,i=𝐰t,tt,i,\begin{split}&f_{t}(\mathbf{x}_{t})-f_{t}(\mathbf{x}_{t,i})\leq-\left\langle\nabla f_{t}(\mathbf{x}_{t}),{\textnormal{Exp}}_{\mathbf{x}_{t}}^{-1}\mathbf{x}_{t,i}\right\rangle\\ =&\left\langle\nabla f_{t}(\mathbf{x}_{t}),\sum_{j=1}^{N}w_{t,j}{\textnormal{Exp}}_{\mathbf{x}_{t}}^{-1}\mathbf{x}_{t,j}\right\rangle-\left\langle\nabla f_{t}(\mathbf{x}_{t}),{\textnormal{Exp}}_{\mathbf{x}_{t}}^{-1}\mathbf{x}_{t,i}\right\rangle\\ =&\sum_{j=1}^{N}w_{t,j}\ell_{t,j}-\ell_{t,i}=\left\langle\mathbf{w}_{t},\boldsymbol{\ell}_{t}\right\rangle-\ell_{t,i},\end{split} (43)

where the first equality is due to the gradient of i=1Nwt,id(𝐱,𝐱t,i)2\sum_{i=1}^{N}w_{t,i}d(\mathbf{x},\mathbf{x}_{t,i})^{2} vanishes at 𝐱t\mathbf{x}_{t}. Now we can bound the regret as

t=1Tft(𝐱t)ft(𝐱t,i)t=1T𝐰t,tt,i2+lnNβ+βt=1Tt𝐦t214βt=2T𝐰t𝐰t112\begin{split}&\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-f_{t}(\mathbf{x}_{t,i})\leq\sum_{t=1}^{T}\left\langle\mathbf{w}_{t},\boldsymbol{\ell}_{t}\right\rangle-\ell_{t,i}\\ \leq&\frac{2+\ln N}{\beta}+\beta\sum_{t=1}^{T}\|\boldsymbol{\ell}_{t}-\mathbf{m}_{t}\|_{\infty}^{2}-\frac{1}{4\beta}\sum_{t=2}^{T}\|\mathbf{w}_{t}-\mathbf{w}_{t-1}\|_{1}^{2}\\ \end{split} (44)

It now suffices to bound t𝐦t2\|\boldsymbol{\ell}_{t}-\mathbf{m}_{t}\|_{\infty}^{2} in terms of the gradient-variation VTV_{T} and 𝐰t𝐰t112\|\mathbf{w}_{t}-\mathbf{w}_{t-1}\|_{1}^{2}. We start with the definition of the infinity norm.

t𝐦t2=maxi[N](t,imt,i)2=maxi[N](ft(𝐱t),Exp𝐱t1𝐱t,ift1(𝐱¯t),Exp𝐱¯t1𝐱t,i)2=maxi[N](.ft(𝐱t),Exp𝐱t1𝐱t,ift1(𝐱t),Exp𝐱t1𝐱t,i+ft1(𝐱t),Exp𝐱t1𝐱t,iΓ𝐱¯t𝐱tft1(𝐱¯t),Exp𝐱t1𝐱t,i+Γ𝐱¯t𝐱tft1(𝐱¯t),Exp𝐱t1𝐱t,iΓ𝐱¯t𝐱tft1(𝐱¯t),Γ𝐱¯t𝐱tExp𝐱¯t1𝐱t,i.)2(1)3maxi[N](.ft(𝐱t)ft1(𝐱t),Exp𝐱t1𝐱t,i2+ft1(𝐱t)Γ𝐱¯t𝐱tft1(𝐱¯t),Exp𝐱t1𝐱t,i2+Γ𝐱¯t𝐱tft1(𝐱¯t),Exp𝐱t1𝐱t,iΓ𝐱¯t𝐱tExp𝐱¯t1𝐱t,i2.)(2)3(D2sup𝐱𝒩ft(𝐱)ft1(𝐱)2+D2L2d(𝐱t,𝐱¯t)2+G2Exp𝐱t1(𝐱t,i)Γ𝐱¯t𝐱tExp𝐱¯t1(𝐱t,i)2)\begin{split}&\|\boldsymbol{\ell}_{t}-\mathbf{m}_{t}\|_{\infty}^{2}\\ =&\max_{i\in[N]}(\ell_{t,i}-m_{t,i})^{2}\\ =&\max_{i\in[N]}\left(\left\langle\nabla f_{t}(\mathbf{x}_{t}),{\textnormal{Exp}}_{\mathbf{x}_{t}}^{-1}\mathbf{x}_{t,i}\right\rangle-\left\langle\nabla f_{t-1}(\bar{\mathbf{x}}_{t}),{\textnormal{Exp}}_{\bar{\mathbf{x}}_{t}}^{-1}\mathbf{x}_{t,i}\right\rangle\right)^{2}\\ =&\max_{i\in[N]}\Bigl{(}\Bigl{.}\left\langle\nabla f_{t}(\mathbf{x}_{t}),{\textnormal{Exp}}_{\mathbf{x}_{t}}^{-1}\mathbf{x}_{t,i}\right\rangle-\left\langle\nabla f_{t-1}(\mathbf{x}_{t}),{\textnormal{Exp}}_{\mathbf{x}_{t}}^{-1}\mathbf{x}_{t,i}\right\rangle+\left\langle\nabla f_{t-1}(\mathbf{x}_{t}),{\textnormal{Exp}}_{\mathbf{x}_{t}}^{-1}\mathbf{x}_{t,i}\right\rangle\\ &-\left\langle\Gamma_{\bar{\mathbf{x}}_{t}}^{\mathbf{x}_{t}}\nabla f_{t-1}(\bar{\mathbf{x}}_{t}),{\textnormal{Exp}}_{\mathbf{x}_{t}}^{-1}\mathbf{x}_{t,i}\right\rangle+\left\langle\Gamma_{\bar{\mathbf{x}}_{t}}^{\mathbf{x}_{t}}\nabla f_{t-1}(\bar{\mathbf{x}}_{t}),{\textnormal{Exp}}_{\mathbf{x}_{t}}^{-1}\mathbf{x}_{t,i}\right\rangle-\left\langle\Gamma_{\bar{\mathbf{x}}_{t}}^{\mathbf{x}_{t}}\nabla f_{t-1}(\bar{\mathbf{x}}_{t}),\Gamma_{\bar{\mathbf{x}}_{t}}^{\mathbf{x}_{t}}{\textnormal{Exp}}_{\bar{\mathbf{x}}_{t}}^{-1}\mathbf{x}_{t,i}\right\rangle\Bigl{.}\Bigl{)}^{2}\\ \stackrel{{\scriptstyle\rm(1)}}{{\leq}}&3\max_{i\in[N]}\Bigl{(}\Bigl{.}\left\langle\nabla f_{t}(\mathbf{x}_{t})-\nabla f_{t-1}(\mathbf{x}_{t}),{\textnormal{Exp}}_{\mathbf{x}_{t}}^{-1}\mathbf{x}_{t,i}\right\rangle^{2}+\left\langle\nabla f_{t-1}(\mathbf{x}_{t})-\Gamma_{\bar{\mathbf{x}}_{t}}^{\mathbf{x}_{t}}\nabla f_{t-1}(\bar{\mathbf{x}}_{t}),{\textnormal{Exp}}_{\mathbf{x}_{t}}^{-1}\mathbf{x}_{t,i}\right\rangle^{2}\\ &+\left\langle\Gamma_{\bar{\mathbf{x}}_{t}}^{\mathbf{x}_{t}}\nabla f_{t-1}(\bar{\mathbf{x}}_{t}),{\textnormal{Exp}}_{\mathbf{x}_{t}}^{-1}\mathbf{x}_{t,i}-\Gamma_{\bar{\mathbf{x}}_{t}}^{\mathbf{x}_{t}}{\textnormal{Exp}}_{\bar{\mathbf{x}}_{t}}^{-1}\mathbf{x}_{t,i}\right\rangle^{2}\Bigl{.}\Bigl{)}\\ \stackrel{{\scriptstyle\rm(2)}}{{\leq}}&3\left(D^{2}\sup_{\mathbf{x}\in\mathcal{N}}\|\nabla f_{t}(\mathbf{x})-\nabla f_{t-1}(\mathbf{x})\|^{2}+D^{2}L^{2}d(\mathbf{x}_{t},\bar{\mathbf{x}}_{t})^{2}+G^{2}\|{\textnormal{Exp}}_{\mathbf{x}_{t}}^{-1}(\mathbf{x}_{t,i})-\Gamma_{\bar{\mathbf{x}}_{t}}^{\mathbf{x}_{t}}{\textnormal{Exp}}_{\bar{\mathbf{x}}_{t}}^{-1}(\mathbf{x}_{t,i})\|^{2}\right)\end{split} (45)

where the first inequality relies on fact that (a+b+c)23(a2+b2+c2)(a+b+c)^{2}\leq 3(a^{2}+b^{2}+c^{2}) holds for any a,b,ca,b,c\in\mathbb{R}, and the second one follows from Assumptions 2, 3, LL-gsc-smoothness and Hölder’s inequality.

By Lemma E.12, for a Hadamard manifold with sectional curvature lower bounded by κ\kappa, h(𝐱)12d(𝐱,𝐱t,i)2h(\mathbf{x})\coloneqq\frac{1}{2}d(\mathbf{x},\mathbf{x}_{t,i})^{2} is κDtanh(κD)\frac{\sqrt{\kappa}D}{\tanh(\sqrt{\kappa}D)}-smooth (which is exactly ζ{\zeta} as in Definition 1) on Hadamard manifolds. Thus

Exp𝐱t1(𝐱t,i)Γ𝐱¯t𝐱tExp𝐱¯t1(𝐱t,i)=h(𝐱t)+Γ𝐱¯t𝐱th(𝐱¯t)ζd(𝐱t,𝐱¯t).\|{\textnormal{Exp}}_{\mathbf{x}_{t}}^{-1}(\mathbf{x}_{t,i})-\Gamma_{\bar{\mathbf{x}}_{t}}^{\mathbf{x}_{t}}{\textnormal{Exp}}_{\bar{\mathbf{x}}_{t}}^{-1}(\mathbf{x}_{t,i})\|=\|-\nabla h(\mathbf{x}_{t})+\Gamma_{\bar{\mathbf{x}}_{t}}^{\mathbf{x}_{t}}\nabla h(\bar{\mathbf{x}}_{t})\|\leq{\zeta}d(\mathbf{x}_{t},\bar{\mathbf{x}}_{t}). (46)

We need to bound d(𝐱t,𝐱¯t)d(\mathbf{x}_{t},\bar{\mathbf{x}}_{t}) in terms of 𝐰t𝐰t11\|\mathbf{w}_{t}-\mathbf{w}_{t-1}\|_{1} to make full use of the negative term in Lemma 5.2. By Lemma E.3

d(𝐱t,𝐱¯t)i=1Nwt,id(𝐱t,i,𝐱t,i)+D𝐰t𝐰t11=D𝐰t𝐰t11.d(\mathbf{x}_{t},\bar{\mathbf{x}}_{t})\leq\sum_{i=1}^{N}w_{t,i}\cdot d(\mathbf{x}_{t,i},\mathbf{x}_{t,i})+D\|\mathbf{w}_{t}-\mathbf{w}_{t-1}\|_{1}=D\|\mathbf{w}_{t}-\mathbf{w}_{t-1}\|_{1}. (47)

Combining Equations (44), (45), (46) and (47), we have

t=1Tft(𝐱t)ft(𝐱t,i)2+lnNβ+3βD2(VT+G2)+t=2T(3β(D4L2+D2G2ζ2)14β)𝐰t𝐰t112,\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-f_{t}(\mathbf{x}_{t,i})\leq\frac{2+\ln N}{\beta}+3\beta D^{2}(V_{T}+G^{2})+\sum_{t=2}^{T}\left(3\beta(D^{4}L^{2}+D^{2}G^{2}{\zeta}^{2})-\frac{1}{4\beta}\right)\|\mathbf{w}_{t}-\mathbf{w}_{t-1}\|_{1}^{2}, (48)

where the 3βD2G23\beta D^{2}G^{2} term is due to the calculation of VTV_{T} starts from t=2t=2, while 𝐰0=𝐰1\mathbf{w}_{0}=\mathbf{w}_{1} ensures 𝐱1=𝐱¯1\mathbf{x}_{1}=\bar{\mathbf{x}}_{1} and thus d(𝐱1,𝐱¯1)=0d(\mathbf{x}_{1},\bar{\mathbf{x}}_{1})=0.

B.5 Proof of Theorem 7

The optimal step size, according to Theorem 5 is

η=min{δ1+(1+2ζδ2L2)12,D2+2DPT2ζVT}.\eta^{\star}=\min\left\{\frac{\delta}{1+(1+2\zeta\delta^{2}L^{2})^{\frac{1}{2}}},\sqrt{\frac{D^{2}+2DP_{T}}{2{\zeta}V_{T}}}\right\}.

Based on Assumption 3, we know VTV_{T} has an upper bound VT=t=2Tsup𝐱𝒩ft(𝐱)ft1(𝐱)24G2TV_{T}=\sum_{t=2}^{T}\sup_{\mathbf{x}\in\mathcal{N}}\|\nabla f_{t}(\mathbf{x})-\nabla f_{t-1}(\mathbf{x})\|^{2}\leq 4G^{2}T. Therefore, η\eta^{\star} can be bounded by

D28ζG2Tηδ1+(1+2ζδ2L2)12.\sqrt{\frac{D^{2}}{8{\zeta}G^{2}T}}\leq\eta^{\star}\leq\frac{\delta}{1+(1+2\zeta\delta^{2}L^{2})^{\frac{1}{2}}}.

According to the construction of \mathcal{H},

min=D28ζG2T,max2δ1+(1+2ζδ2L2)12.\min\mathcal{H}=\sqrt{\frac{D^{2}}{8{\zeta}G^{2}T}},\qquad\max\mathcal{H}\geq 2\frac{\delta}{1+(1+2\zeta\delta^{2}L^{2})^{\frac{1}{2}}}.

Therefore, there always exists k[N]k\in[N] such that ηkη2ηk\eta_{k}\leq\eta^{\star}\leq 2\eta_{k}. We can bound the regret of the kk-th expert as

t=1Tft(𝐱t,k)t=1Tft(𝐮t)ηkζVT+D2+2DPT2ηkηζVT+D2+2DPTηζVTD2+2DPT2ζVT+(D2+2DPT)max{2ζVTD2+2DPT,1+(1+2ζδ2L2)12δ}=322(D2+2DPT)ζVT+(D2+2DPT)1+(1+2ζδ2L2)12δ.\begin{split}&\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t,k})-\sum_{t=1}^{T}f_{t}(\mathbf{u}_{t})\leq\eta_{k}{\zeta}V_{T}+\frac{D^{2}+2DP_{T}}{2\eta_{k}}\leq\eta^{\star}{\zeta}V_{T}+\frac{D^{2}+2DP_{T}}{\eta^{\star}}\\ \leq&\zeta V_{T}\sqrt{\frac{D^{2}+2DP_{T}}{2\zeta V_{T}}}+{(D^{2}+2DP_{T})}\cdot\max\left\{\sqrt{\frac{2\zeta V_{T}}{D^{2}+2DP_{T}}},\frac{1+(1+2\zeta\delta^{2}L^{2})^{\frac{1}{2}}}{\delta}\right\}\\ =&\frac{3}{2}\sqrt{2(D^{2}+2DP_{T}){\zeta}V_{T}}+(D^{2}+2DP_{T})\frac{1+(1+2\zeta\delta^{2}L^{2})^{\frac{1}{2}}}{\delta}.\\ \end{split} (49)

Since the dynamic regret can be decomposed as the sum of the meta-regret and the expert-regret,

t=1Tft(𝐱t)t=1Tft(𝐮t)=t=1Tft(𝐱t)t=1Tft(𝐱t,k)+t=1Tft(𝐱t,k)t=1Tft(𝐮t).\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\sum_{t=1}^{T}f_{t}(\mathbf{u}_{t})=\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t,k})+\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t,k})-\sum_{t=1}^{T}f_{t}(\mathbf{u}_{t}).

Applying Theorem 6 with β112(D4L2+D2G2ζ2)\beta\leq\frac{1}{\sqrt{12(D^{4}L^{2}+D^{2}G^{2}{\zeta}^{2})}}, we have

(3β(D4L2+D2G2ζ2)14β)0\left(3\beta(D^{4}L^{2}+D^{2}G^{2}{\zeta}^{2})-\frac{1}{4\beta}\right)\leq 0

and

t=1Tft(𝐱t)t=1Tft(𝐱t,i)2+lnNβ+3βD2(VT+G2).\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t,i})\leq\frac{2+\ln N}{\beta}+3\beta D^{2}(V_{T}+G^{2}).

We need to consider two cases based on the value of β\beta.

If 2+lnN3D2(VT+G2)112(D4L2+D2G2ζ2)\sqrt{\frac{2+\ln N}{3D^{2}(V_{T}+G^{2})}}\leq\frac{1}{\sqrt{12(D^{4}L^{2}+D^{2}G^{2}{\zeta}^{2})}}, then

t=1Tft(𝐱t)t=1Tft(𝐱t,i)23D2(VT+G2)(2+lnN).\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t,i})\leq 2\sqrt{3D^{2}(V_{T}+G^{2})(2+\ln N)}.

Otherwise, we have

t=1Tft(𝐱t)t=1Tft(𝐱t,i)2(2+lnN)12(D4L2+D2G2ζ2).\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t,i})\leq 2(2+\ln N)\sqrt{12(D^{4}L^{2}+D^{2}G^{2}{\zeta}^{2})}.

In sum,

t=1Tft(𝐱t)t=1Tft(𝐱t,i)max{23D2(VT+G2)(2+lnN),2(2+lnN)12(D4L2+D2G2ζ2)}.\begin{split}&\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t,i})\\ \leq&\max\left\{2\sqrt{3D^{2}(V_{T}+G^{2})(2+\ln N)},2(2+\ln N)\sqrt{12(D^{4}L^{2}+D^{2}G^{2}{\zeta}^{2})}\right\}.\\ \end{split} (50)

Combining Equations (50) and (49), we have

t=1Tft(𝐱t)t=1Tft(𝐮t)max{23D2(VT+G2)(2+lnN),2(2+lnN)12(D4L2+D2G2ζ2)}+322(D2+2DPT)ζVT+(D2+2DPT)1+(1+2ζδ2L2)12δ=O((VT+ζ2lnN)lnN)+O(ζ(VT+(1+PT)/δ2)(1+PT))=O(ζ(VT+(1+PT)/δ2)(1+PT)),\begin{split}&\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\sum_{t=1}^{T}f_{t}(\mathbf{u}_{t})\\ \leq&\max\left\{2\sqrt{3D^{2}(V_{T}+G^{2})(2+\ln N)},2(2+\ln N)\sqrt{12(D^{4}L^{2}+D^{2}G^{2}{\zeta}^{2})}\right\}\\ &+\frac{3}{2}\sqrt{2(D^{2}+2DP_{T}){\zeta}V_{T}}+(D^{2}+2DP_{T})\frac{1+(1+2\zeta\delta^{2}L^{2})^{\frac{1}{2}}}{\delta}\\ =&O\left(\sqrt{(V_{T}+{\zeta}^{2}\ln N)\ln N}\right)+O\left(\sqrt{{\zeta}(V_{T}+(1+P_{T})/\delta^{2})(1+P_{T})}\right)\\ =&{O}\left(\sqrt{{\zeta}(V_{T}+(1+P_{T})/\delta^{2})(1+P_{T})}\right),\end{split} (51)

where we use O(){O}(\cdot) to hide O(loglogT)O(\log\log T) following Luo and Schapire (2015) and Zhao et al. (2020) and N=O(logT)N=O(\log T) leads to lnN=O(loglogT)\ln N=O(\log\log T).

Appendix C Omitted Proof for Section 6

C.1 Proof of Lemma 6.1

By LL-gsc-smoothness, we have

f(𝐲)f(𝐱)+f(𝐱),Exp𝐱1(𝐲)+Ld(𝐱,𝐲)22.f(\mathbf{y})\leq f(\mathbf{x})+\left\langle\nabla f(\mathbf{x}),{\textnormal{Exp}}_{\mathbf{x}}^{-1}(\mathbf{y})\right\rangle+\frac{L\cdot d(\mathbf{x},\mathbf{y})^{2}}{2}.

Setting 𝐲=Exp𝐱(1Lf(𝐱))\mathbf{y}={\textnormal{Exp}}_{\mathbf{x}}\left(-\frac{1}{L}\nabla f(\mathbf{x})\right), we have

0f(𝐲)f(𝐱)1Lf(𝐱)2+L21L2f(𝐱)2=f(𝐱)12Lf(𝐱)2,\begin{split}0\leq&f(\mathbf{y})\leq f(\mathbf{x})-\frac{1}{L}\|\nabla f(\mathbf{x})\|^{2}+\frac{L}{2}\cdot\frac{1}{L^{2}}\|\nabla f(\mathbf{x})\|^{2}\\ =&f(\mathbf{x})-\frac{1}{2L}\|\nabla f(\mathbf{x})\|^{2},\end{split} (52)

where we use the non-negativity of ff. The above inequality is equivalent to

f(𝐱)22Lf(𝐱),\|\nabla f(\mathbf{x})\|^{2}\leq 2L\cdot f(\mathbf{x}),

in which the constant is two times better than that of Srebro et al. (2010).

C.2 Proof of Lemma 6.2

The proof is similar to the proof of Theorem 1. Let 𝐱t+1=Exp𝐱t(ηft(𝐱t))\mathbf{x}_{t+1}^{\prime}={\textnormal{Exp}}_{\mathbf{x}_{t}}\left(-\eta\nabla f_{t}(\mathbf{x}_{t})\right), then analog to Equation (10), we have

ft(𝐱t)ft(𝐮t)12η(Exp𝐱t1𝐮t2Exp𝐱t+11𝐮t+12+2DExp𝐮t1𝐮t+1)+ηζft(𝐱t)2212η(Exp𝐱t1𝐮t2Exp𝐱t+11𝐮t+12+2DExp𝐮t1𝐮t+1)+ηζLft(𝐱t),\begin{split}f_{t}(\mathbf{x}_{t})-f_{t}(\mathbf{u}_{t})\leq&\frac{1}{2\eta}\left(\|\textnormal{Exp}^{-1}_{\mathbf{x}_{t}}\mathbf{u}_{t}\|^{2}-\|\textnormal{Exp}^{-1}_{\mathbf{x}_{t+1}}\mathbf{u}_{t+1}\|^{2}+2D\|\textnormal{Exp}^{-1}_{\mathbf{u}_{t}}\mathbf{u}_{t+1}\|\right)+\frac{\eta{\zeta}\|\nabla f_{t}(\mathbf{x}_{t})\|^{2}}{2}\\ \leq&\frac{1}{2\eta}\left(\|\textnormal{Exp}^{-1}_{\mathbf{x}_{t}}\mathbf{u}_{t}\|^{2}-\|\textnormal{Exp}^{-1}_{\mathbf{x}_{t+1}}\mathbf{u}_{t+1}\|^{2}+2D\|\textnormal{Exp}^{-1}_{\mathbf{u}_{t}}\mathbf{u}_{t+1}\|\right)+\eta{\zeta}Lf_{t}(\mathbf{x}_{t}),\end{split} (53)

where for the second inequality we apply Lemma 6.1. WLOG, we can assume 𝐮T+1=𝐮T\mathbf{u}_{T+1}=\mathbf{u}_{T} and sum from t=1t=1 to TT:

t=1T(ft(𝐱t)ft(𝐮t))D2+2DPT2η+ηζLt=1Tft(𝐱t).\sum_{t=1}^{T}\left(f_{t}(\mathbf{x}_{t})-f_{t}(\mathbf{u}_{t})\right)\leq\frac{D^{2}+2DP_{T}}{2\eta}+\eta{\zeta}L\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t}).

After simplifying, we get

t=1T(ft(𝐱t)ft(𝐮t))D2+2DPT2η(1ηζL)+ηζLt=1Tft(𝐮t)1ηζL=D2+2DPT2η(1ηζL)+ηζLFT1ηζLD2+2DPTη+2ηζLFT=O(1+PTη+ηFT).\begin{split}\sum_{t=1}^{T}\left(f_{t}(\mathbf{x}_{t})-f_{t}(\mathbf{u}_{t})\right)\leq&\frac{D^{2}+2DP_{T}}{2\eta(1-\eta{\zeta}L)}+\frac{\eta{\zeta}L\sum_{t=1}^{T}f_{t}(\mathbf{u}_{t})}{1-\eta{\zeta}L}\\ =&\frac{D^{2}+2DP_{T}}{2\eta(1-\eta{\zeta}L)}+\frac{\eta{\zeta}LF_{T}}{1-\eta{\zeta}L}\\ \leq&\frac{D^{2}+2DP_{T}}{\eta}+2\eta{\zeta}LF_{T}\\ =&O\left(\frac{1+P_{T}}{\eta}+\eta F_{T}\right).\\ \end{split} (54)

where η12ζL\eta\leq\frac{1}{2\zeta L} is used to obtain the second inequality.

C.3 Proof of Lemma 6.3

We again apply Lemma 5.2, with the surrogate loss t,i=ft(𝐱t),Exp𝐱t1𝐱t,i\ell_{t,i}=\left\langle\nabla f_{t}(\mathbf{x}_{t}),\textnormal{Exp}^{-1}_{\mathbf{x}_{t}}\mathbf{x}_{t,i}\right\rangle and the optimism mt,i=0m_{t,i}=0 for any i[N]i\in[N]. In this way,

t=1Tft(𝐱t)t=1Tft(𝐱t,i)t=1Tft(𝐱t),Exp𝐱t1𝐱t,i=t=1T𝐰t,twt,i2+lnNβ+βt=1Tt214βt=2T𝐰t𝐰t1122+lnNβ+βD2t=1Tft(𝐱t)22+lnNβ+2βD2Lt=1Tft(𝐱t)=2+lnNβ+2βD2LF¯T,\begin{split}&\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t,i})\\ \leq&-\sum_{t=1}^{T}\left\langle\nabla f_{t}(\mathbf{x}_{t}),\textnormal{Exp}^{-1}_{\mathbf{x}_{t}}\mathbf{x}_{t,i}\right\rangle=\sum_{t=1}^{T}\left\langle\mathbf{w}_{t},\boldsymbol{\ell}_{t}\right\rangle-w_{t,i}\\ \leq&\frac{2+\ln N}{\beta}+\beta\sum_{t=1}^{T}\|\boldsymbol{\ell}_{t}\|_{\infty}^{2}-\frac{1}{4\beta}\sum_{t=2}^{T}\|\mathbf{w}_{t}-\mathbf{w}_{t-1}\|_{1}^{2}\\ \leq&\frac{2+\ln N}{\beta}+\beta D^{2}\sum_{t=1}^{T}\|\nabla f_{t}(\mathbf{x}_{t})\|^{2}\\ \leq&\frac{2+\ln N}{\beta}+2\beta D^{2}L\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})=\frac{2+\ln N}{\beta}+2\beta D^{2}L\bar{F}_{T},\end{split} (55)

where the second inequality follows from Lemma 5.2, the third one follows from Assumption 2 and Hölder’s inequality, while the last inequality is due to Lemma 6.1. By setting β=2+lnN2LD2F¯T\beta=\sqrt{\frac{2+\ln N}{2LD^{2}\bar{F}_{T}}}, the regret of the meta algorithm is upper bounded by

t=1Tft(𝐱t)t=1Tft(𝐱t,i)8D2L(2+lnN)F¯T=8D2L(2+lnN)t=1Tft(𝐱t).\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t,i})\leq\sqrt{8D^{2}L(2+\ln N)\bar{F}_{T}}=\sqrt{8D^{2}L(2+\ln N)\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})}. (56)

Although F¯T\bar{F}_{T} is unknown similar to the case of Optimistic Hedge, we can use techniques like the doubling trick or a time-varying step size βt=O(11+F¯t)\beta_{t}=O\left(\frac{1}{\sqrt{1+\bar{F}_{t}}}\right) to overcome this hurdle.

The RHS of Equation (56) depends on the cumulative loss of 𝐱t\mathbf{x}_{t}, which remains elusive. Here we apply an algebraic fact that xyaxx-y\leq\sqrt{ax} implies xya+ayx-y\leq a+\sqrt{ay} holds for any non-negative x,yx,y and aa. Then

t=1Tft(𝐱t)t=1Tft(𝐱t,i)8D2L(2+lnN)+8D2L(2+lnN)F¯T,i\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t,i})\leq 8D^{2}L(2+\ln N)+\sqrt{8D^{2}L(2+\ln N)\bar{F}_{T,i}} (57)

where we remind F¯T,i=t=1Tft(𝐱t,i)\bar{F}_{T,i}=\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t,i}).

C.4 Proof of Theorem 8

Recall the regret of the meta algorithm as in Lemma 6.3:

t=1Tft(𝐱t)t=1Tft(𝐱t,i)8D2L(2+lnN)+8D2L(2+lnN)F¯T,i.\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t,i})\leq 8D^{2}L(2+\ln N)+\sqrt{8D^{2}L(2+\ln N)\bar{F}_{T,i}}. (58)

On the other hand, we know the regret can be decomposed as

t=1Tft(𝐱t)t=1Tft(𝐮t)=t=1Tft(𝐱t)t=1Tft(𝐱t,i)+t=1Tft(𝐱t,i)t=1Tft(𝐮t)=t=1Tft(𝐱t)t=1Tft(𝐱t,i)+F¯T,iFT.\begin{split}\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\sum_{t=1}^{T}f_{t}(\mathbf{u}_{t})=&\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t,i})+\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t,i})-\sum_{t=1}^{T}f_{t}(\mathbf{u}_{t})\\ =&\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t,i})+\bar{F}_{T,i}-F_{T}.\end{split} (59)

We can first show there exists an almost optimal step size and bound the regret of the corresponding expert. That regret immediately provides an upper bound of F¯T,i\bar{F}_{T,i} in terms of FTF_{T}. This argument eliminates the dependence on F¯T,i\bar{F}_{T,i} and leads to a regret bound solely depending on FTF_{T} and PTP_{T}.

Now we bound the regret of the best expert. Note that due to Lemma 6.2, the optimal step size is η=min{12ζL,D2+2DPT2ζLFT}\eta=\min\left\{\frac{1}{2{\zeta}L},\sqrt{\frac{D^{2}+2DP_{T}}{2{\zeta}LF_{T}}}\right\}. According to Assumptions 2, 3, FTGDTF_{T}\leq GDT. Thus the optimal step size η\eta^{\star} is bounded by

D4LGTη12ζL.\sqrt{\frac{D}{4LGT}}\leq\eta^{\star}\leq\frac{1}{2{\zeta}L}.

Due to our construction of \mathcal{H}, there exists k[N]k\in[N] such that ηkη2ηk\eta_{k}\leq\eta^{\star}\leq 2\eta_{k}.

According to Lemma 6.2,

t=1Tft(𝐱t,k)t=1Tft(𝐮t)D2+2DPT2ηk(1ηkζL)+ηkζLFT1ηkζLD2+2DPTηk+2ηkζLFT2(D2+2DPT)η+2ηζLFT2(D2+2DPT)(2ζL+2ζLFTD2+2DPT)+2ζLFTD2+2DPT2ζLFT=4ζL(D2+2DPT)+32ζLFT(D2+2DPT)2(16ζ2L2(D2+2DPT)2+18ζLFT(D2+2DPT))\begin{split}&\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t,k})-\sum_{t=1}^{T}f_{t}(\mathbf{u}_{t})\\ \leq&\frac{D^{2}+2DP_{T}}{2\eta_{k}(1-\eta_{k}{\zeta}L)}+\frac{\eta_{k}{\zeta}LF_{T}}{1-\eta_{k}{\zeta}L}\leq\frac{D^{2}+2DP_{T}}{\eta_{k}}+2\eta_{k}{\zeta}LF_{T}\\ \leq&\frac{2(D^{2}+2DP_{T})}{\eta^{\star}}+2\eta^{\star}{\zeta}LF_{T}\\ \leq&2(D^{2}+2DP_{T})\left(2{\zeta}L+\sqrt{\frac{2{\zeta}LF_{T}}{D^{2}+2DP_{T}}}\right)+2{\zeta}LF_{T}\cdot\sqrt{\frac{D^{2}+2DP_{T}}{2{\zeta}LF_{T}}}\\ =&4{\zeta}L(D^{2}+2DP_{T})+3\sqrt{2{\zeta}LF_{T}(D^{2}+2DP_{T})}\\ \leq&\sqrt{2\left(16{\zeta}^{2}L^{2}(D^{2}+2DP_{T})^{2}+18{\zeta}LF_{T}(D^{2}+2DP_{T})\right)}\end{split} (60)

where we apply ηD2+2DPT2ζLFT\eta^{\star}\leq\sqrt{\frac{D^{2}+2DP_{T}}{2\zeta LF_{T}}}, 1η2ζL+2ζLFTD2+2DPT\frac{1}{\eta^{\star}}\leq 2{\zeta}L+\sqrt{\frac{2\zeta LF_{T}}{D^{2}+2DP_{T}}} and a+b2(a+b)\sqrt{a}+\sqrt{b}\leq\sqrt{2(a+b)}.

Now as we combine Equations (58), (59), (60),

t=1Tft(𝐱t)t=1Tft(𝐮t)8D2L(2+lnN)+8D2L(2+lnN)F¯T,k+2(16ζ2L2(D2+2DPT)2+18ζLFT(D2+2DPT))8D2L(2+lnN)+8D2L(2+lnN)(FT+2(16ζ2L2(D2+2DPT)2+18ζLFT(D2+2DPT)))+2(16ζ2L2(D2+2DPT)2+18ζLFT(D2+2DPT))=O(ζ(ζ(1+PT)+FT)(PT+1)).\begin{split}&\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\sum_{t=1}^{T}f_{t}(\mathbf{u}_{t})\\ \leq&8D^{2}L(2+\ln N)+\sqrt{8D^{2}L(2+\ln N)\bar{F}_{T,k}}+\sqrt{2\left(16{\zeta}^{2}L^{2}(D^{2}+2DP_{T})^{2}+18{\zeta}LF_{T}(D^{2}+2DP_{T})\right)}\\ \leq&8D^{2}L(2+\ln N)+\sqrt{8D^{2}L(2+\ln N)\left(F_{T}+\sqrt{2\left(16{\zeta}^{2}L^{2}(D^{2}+2DP_{T})^{2}+18{\zeta}LF_{T}(D^{2}+2DP_{T})\right)}\right)}\\ &+\sqrt{2\left(16{\zeta}^{2}L^{2}(D^{2}+2DP_{T})^{2}+18{\zeta}LF_{T}(D^{2}+2DP_{T})\right)}\\ =&O(\sqrt{\zeta({\zeta}(1+P_{T})+F_{T})(P_{T}+1)}).\end{split} (61)

where we again use O()O(\cdot) to hide the loglogT\log\log T term.

Appendix D Omitted Proof for Section 7

D.1 Necessity of Best-of-both-worlds Bound

We highlight the necessity of achieving a best-of-both-worlds bound by computing the Fréchet mean in the online setting on the dd-dimensional unit Poincaré disk. The Poincaré disk looks like a unit ball in Euclidean space, but its Riemannian metric blows up near the boundary:

𝐮,𝐯𝐱=4𝐮,𝐯2(1𝐱22)2\left\langle\mathbf{u},\mathbf{v}\right\rangle_{\mathbf{x}}=\frac{4\left\langle\mathbf{u},\mathbf{v}\right\rangle_{2}}{(1-\|\mathbf{x}\|_{2}^{2})^{2}}

and has constant sectional curvature 1-1. We use 𝟎\mathbf{0} to denote the origin of the Poincaré ball and 𝐞i\mathbf{e}_{i} to be the ii-th unit vector in the standard basis. The Poincaré ball has the following property (Lou et al., 2020):

d(𝐱,𝐲)=arcosh(1+2𝐱𝐲22(1𝐱22)(1𝐲22))Exp𝟎1𝐲=arctanh(𝐲2)𝐲𝐲2.\begin{split}&d(\mathbf{x},\mathbf{y})=\operatorname{arcosh}\left(1+\frac{2\|\mathbf{x}-\mathbf{y}\|_{2}^{2}}{(1-\|\mathbf{x}\|_{2}^{2})(1-\|\mathbf{y}\|_{2}^{2})}\right)\\ &\textnormal{Exp}^{-1}_{\mathbf{0}}\mathbf{y}=\operatorname{arctanh}(\|\mathbf{y}\|_{2})\frac{\mathbf{y}}{\|\mathbf{y}\|_{2}}.\end{split}

Now consider the following loss function

ft(𝐱)=i=12dd(𝐱,𝐱t,i)22df_{t}(\mathbf{x})=\sum_{i=1}^{2d}\frac{d(\mathbf{x},\mathbf{x}_{t,i})^{2}}{2d}

where 𝐱t,i=t2T𝐞i\mathbf{x}_{t,i}=\frac{t}{2T}\mathbf{e}_{i} for 1id1\leq i\leq d and 𝐱t,i=t2T𝐞id\mathbf{x}_{t,i}=-\frac{t}{2T}\mathbf{e}_{i-d} for d+1i2dd+1\leq i\leq 2d. We choose 𝒩\mathcal{N} to be the convex hull of ±12𝐞i\pm\frac{1}{2}\mathbf{e}_{i}, i=1,,di=1,\dots,d. And the comparator is 𝐮t=argmin𝐮t𝒩ft(𝐮t)\mathbf{u}_{t}=\operatorname*{argmin}_{\mathbf{u}_{t}\in\mathcal{N}}f_{t}(\mathbf{u}_{t}) which is indeed the origin 𝟎\mathbf{0} due to symmetry. Now we can bound VTV_{T} by

VT=t=2Tsup𝐱𝒩ft(𝐱)ft1(𝐱)2=14d2t=2Tsup𝐱𝒩i=12d(Exp𝐱1𝐱t,iExp𝐱1𝐱t1,i)214d2t=2Tc(i=12dd(𝐱t,i,𝐱t1,i))2=t=2Tc(arcosh(1+2(t2Tt12T)2(1(t2T)2)(1(t12T)2)))2t=2Tc(arcosh(1+89T2))2t=2Tc(89T2+6481T4+169T2)2=t=2TcO(1T2)=O(1T),\begin{split}V_{T}=&\sum_{t=2}^{T}\sup_{\mathbf{x}\in\mathcal{N}}\|\nabla f_{t}(\mathbf{x})-\nabla f_{t-1}(\mathbf{x})\|^{2}\\ =&\frac{1}{4d^{2}}\sum_{t=2}^{T}\sup_{\mathbf{x}\in\mathcal{N}}\left\|\sum_{i=1}^{2d}\left(\textnormal{Exp}^{-1}_{\mathbf{x}}\mathbf{x}_{t,i}-\textnormal{Exp}^{-1}_{\mathbf{x}}\mathbf{x}_{t-1,i}\right)\right\|^{2}\\ \leq&\frac{1}{4d^{2}}\sum_{t=2}^{T}c\cdot\left(\sum_{i=1}^{2d}d(\mathbf{x}_{t,i},\mathbf{x}_{t-1,i})\right)^{2}=\sum_{t=2}^{T}c\left(\operatorname{arcosh}\left(1+\frac{2(\frac{t}{2T}-\frac{t-1}{2T})^{2}}{(1-(\frac{t}{2T})^{2})(1-(\frac{t-1}{2T})^{2})}\right)\right)^{2}\\ \leq&\sum_{t=2}^{T}c\left(\operatorname{arcosh}\left(1+\frac{8}{9T^{2}}\right)\right)^{2}\\ \leq&\sum_{t=2}^{T}c\cdot\left(\frac{8}{9T^{2}}+\sqrt{\frac{64}{81T^{4}}+\frac{16}{9T^{2}}}\right)^{2}=\sum_{t=2}^{T}c\cdot O\left(\frac{1}{{T^{2}}}\right)=O\left(\frac{1}{T}\right),\\ \end{split} (62)

where the first inequality is due to triangle inequality and Lemma E.15. We note that cc is a constant depending on the diamater of 𝒩\mathcal{N} and the sectional curvature of \mathcal{M}. The second one is due to tTt\leq T, while the third inequality follows from arcosh(1+x)x+x2+2x\operatorname{arcosh}(1+x)\leq x+\sqrt{x^{2}+2x}.

Similarly, we can evaluate FTF_{T}

FT=t=1Tft(𝐮t)=t=1Tft(𝟎)=t=1Ti=12dd(𝟎,𝐱t,i)22d=t=1Tarcosh(1+t24T21t24T2)=0Tarcosh(1+t24T21t24T2)𝑑t+O(1)=2T012arcosh1+a21a2da+O(1)=2T(aarcosh1+a21a2+ln(1a2))|012+O(1)=Θ(T).\begin{split}F_{T}=&\sum_{t=1}^{T}f_{t}(\mathbf{u}_{t})=\sum_{t=1}^{T}f_{t}(\mathbf{0})=\sum_{t=1}^{T}\sum_{i=1}^{2d}\frac{d(\mathbf{0},\mathbf{x}_{t,i})^{2}}{2d}\\ =&\sum_{t=1}^{T}\operatorname{arcosh}\left(\frac{1+\frac{t^{2}}{4T^{2}}}{1-\frac{t^{2}}{4T^{2}}}\right)=\int_{0}^{T}\operatorname{arcosh}\left(\frac{1+\frac{t^{2}}{4T^{2}}}{1-\frac{t^{2}}{4T^{2}}}\right)dt+O(1)\\ =&2T\int_{0}^{\frac{1}{2}}\operatorname{arcosh}\frac{1+a^{2}}{1-a^{2}}da+O(1)\\ =&2T\left(a\cdot\operatorname{arcosh}\frac{1+a^{2}}{1-a^{2}}+\ln(1-a^{2})\right)\big{|}_{0}^{\frac{1}{2}}+O(1)=\Theta(T).\end{split} (63)

When the input losses change smoothly, VTFTV_{T}\ll F_{T} and the gradient-variation bound is much tighter than the small-loss bound.

There also exist scenarios in which the small-loss bound is tighter. We still consider computing the Fréchet mean on the Poincaré disk

ft(𝐱)=i=1nd(𝐱,𝐱t,i)2/n,f_{t}(\mathbf{x})=\sum_{i=1}^{n}d(\mathbf{x},\mathbf{x}_{t,i})^{2}/n,

but assume 𝐱t,i=𝐲i\mathbf{x}_{t,i}=\mathbf{y}_{i} when tt is odd and xt,i=𝐲ix_{t,i}=-\mathbf{y}_{i} when tt is even. We restrict 𝐲1,,𝐲n𝔹(𝐞12,Tα)\mathbf{y}_{1},\dots,\mathbf{y}_{n}\in\mathbb{B}(\frac{\mathbf{e}_{1}}{2},T^{-\alpha}) where 𝔹(𝐩,r)\mathbb{B}(\mathbf{p},r) is the geodesic ball centered at 𝐩\mathbf{p} and with radius rr. 𝒩\mathcal{N} is the convex hull of 𝔹(𝐞12,Tα)𝔹(𝐞12,Tα)\mathbb{B}(\frac{\mathbf{e}_{1}}{2},T^{-\alpha})\cup\mathbb{B}(-\frac{\mathbf{e}_{1}}{2},T^{-\alpha}). Since the input sequence is alternating, sup𝐱𝒩ft(𝐱)ft1(𝐱)2\sup_{\mathbf{x}\in\mathcal{N}}\|\nabla f_{t}(\mathbf{x})-\nabla f_{t-1}(\mathbf{x})\|^{2} is a constant over time, and we can lower bounded it by

sup𝐱𝒩ft(𝐱)ft1(𝐱)2=sup𝐱𝒩1n(i=1nExp𝐱1𝐱t,ii=1nExp𝐱1𝐱t1,i)2=sup𝐱𝒩2ni=1nExp𝐱1𝐲i24n2i=1nExp𝟎1𝐲i2=4n2((i=1nExp𝟎1𝐲i)2+(i=1nExp𝟎1𝐲i)2)4n2(i=1nExp𝟎1𝐲i)24n2narctanh(12Tα)𝐞1𝟎2=16arctanh2(12Tα)16(12Tα32Tα)2=Ω(1).\begin{split}&\sup_{\mathbf{x}\in\mathcal{N}}\|\nabla f_{t}(\mathbf{x})-\nabla f_{t-1}(\mathbf{x})\|^{2}\\ =&\sup_{\mathbf{x}\in\mathcal{N}}\left\|\frac{1}{n}\left(\sum_{i=1}^{n}\textnormal{Exp}^{-1}_{\mathbf{x}}\mathbf{x}_{t,i}-\sum_{i=1}^{n}\textnormal{Exp}^{-1}_{\mathbf{x}}\mathbf{x}_{t-1,i}\right)\right\|^{2}\\ =&\sup_{\mathbf{x}\in\mathcal{N}}\left\|\frac{2}{n}\sum_{i=1}^{n}\textnormal{Exp}^{-1}_{\mathbf{x}}\mathbf{y}_{i}\right\|^{2}\geq\frac{4}{n^{2}}\left\|\sum_{i=1}^{n}\textnormal{Exp}^{-1}_{\mathbf{0}}\mathbf{y}_{i}\right\|^{2}\\ =&\frac{4}{n^{2}}\left(\left\|\left(\sum_{i=1}^{n}\textnormal{Exp}^{-1}_{\mathbf{0}}\mathbf{y}_{i}\right)^{\parallel}\right\|^{2}+\left\|\left(\sum_{i=1}^{n}\textnormal{Exp}^{-1}_{\mathbf{0}}\mathbf{y}_{i}\right)^{\perp}\right\|^{2}\right)\\ \geq&\frac{4}{n^{2}}\left\|\left(\sum_{i=1}^{n}\textnormal{Exp}^{-1}_{\mathbf{0}}\mathbf{y}_{i}\right)^{\parallel}\right\|^{2}\geq\frac{4}{n^{2}}\left\|n\cdot\operatorname{arctanh}\left(\frac{1}{2}-T^{-\alpha}\right)\mathbf{e}_{1}\right\|_{\mathbf{0}}^{2}\\ =&16\operatorname{arctanh}^{2}\left(\frac{1}{2}-T^{-\alpha}\right)\\ \geq&16\left(\frac{\frac{1}{2}-T^{-\alpha}}{\frac{3}{2}-T^{-\alpha}}\right)^{2}=\Omega(1).\\ \end{split} (64)

where we use 𝐚\mathbf{a}^{\parallel} and 𝐚\mathbf{a}^{\perp} to denote components parallel and orthogonal to the direction of 𝐞1\mathbf{e}_{1}, respectively. The key observation is the lower bound attains when (i=1nExp𝟎1𝐲i)\left(\sum_{i=1}^{n}\textnormal{Exp}^{-1}_{\mathbf{0}}\mathbf{y}_{i}\right)^{\perp} is zero, and each 𝐲i\mathbf{y}_{i} has the smallest component along 𝐞1\mathbf{e}_{1}, i.e., 𝐲i=(12Tα)𝐞1\mathbf{y}_{i}=\left(\frac{1}{2}-T^{-\alpha}\right)\mathbf{e}_{1}. We also use arctanh(x)x1+x\operatorname{arctanh}(x)\geq\frac{x}{1+x}. Thus we have VT=Ω(T)V_{T}=\Omega(T). Now we consider FTF_{T}. By Lemma E.8, we know that 𝐮t\mathbf{u}_{t} lies within the same geodesic ball as 𝐱t,i\mathbf{x}_{t,i}, i[n]i\in[n]. Thus

FT=t=1Tft(𝐮t)=t=1Ti=1nd(𝐮t,𝐱t,i)2/nT(2Tα)2=O(T12α).F_{T}=\sum_{t=1}^{T}f_{t}(\mathbf{u}_{t})=\sum_{t=1}^{T}\sum_{i=1}^{n}d(\mathbf{u}_{t},\mathbf{x}_{t,i})^{2}/n\leq T\cdot\left(\frac{2}{T^{\alpha}}\right)^{2}=O(T^{1-2\alpha}).

We can see whenever α>0\alpha>0, FT=o(T)F_{T}=o(T) but VT=Ω(T)V_{T}=\Omega(T).

D.2 Proof of Theorem 9

By Lemma 5.2 we have

t=1Tft(𝐱t)ft(𝐱t,i)2+lnNβ+βt=1Tt𝐦t214βt=2T𝐰t𝐰t112\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-f_{t}(\mathbf{x}_{t,i})\leq\frac{2+\ln N}{\beta}+\beta\sum_{t=1}^{T}\|\boldsymbol{\ell}_{t}-\mathbf{m}_{t}\|_{\infty}^{2}-\frac{1}{4\beta}\sum_{t=2}^{T}\|\mathbf{w}_{t}-\mathbf{w}_{t-1}\|_{1}^{2} (65)

We bound t=1Tt𝐦t2\sum_{t=1}^{T}\|\boldsymbol{\ell}_{t}-\mathbf{m}_{t}\|_{\infty}^{2} in terms of t=1Tt𝐦tv2\sum_{t=1}^{T}\|\boldsymbol{\ell}_{t}-\mathbf{m}_{t}^{v}\|_{\infty}^{2} and t=1Tt𝐦ts2\sum_{t=1}^{T}\|\boldsymbol{\ell}_{t}-\mathbf{m}_{t}^{s}\|_{\infty}^{2} as follows. By Assumptions 2 and 3, t𝐦t224NG2D2\|\boldsymbol{\ell}_{t}-\mathbf{m}_{t}\|_{2}^{2}\leq 4NG^{2}D^{2} and we can compute dt(𝐦)=t𝐦22d_{t}(\mathbf{m})=\|\boldsymbol{\ell}_{t}-\mathbf{m}\|_{2}^{2} is 18NG2D2\frac{1}{8NG^{2}D^{2}}-exp-concave. . We have

t=1Tt𝐦t2t=1Tt𝐦t22min{t=1Tt𝐦tv22,t=1Tt𝐦ts22}+8NG2D2ln2Nmin{t=1Tt𝐦tv2,t=1Tt𝐦ts2}+8NG2D2ln2,\begin{split}&\sum_{t=1}^{T}\|\boldsymbol{\ell}_{t}-\mathbf{m}_{t}\|_{\infty}^{2}\leq\sum_{t=1}^{T}\|\boldsymbol{\ell}_{t}-\mathbf{m}_{t}\|_{2}^{2}\\ \leq&\min\left\{\sum_{t=1}^{T}\|\boldsymbol{\ell}_{t}-\mathbf{m}_{t}^{v}\|_{2}^{2},\sum_{t=1}^{T}\|\boldsymbol{\ell}_{t}-\mathbf{m}_{t}^{s}\|_{2}^{2}\right\}+8NG^{2}D^{2}\ln 2\\ \leq&N\min\left\{\sum_{t=1}^{T}\|\boldsymbol{\ell}_{t}-\mathbf{m}_{t}^{v}\|_{\infty}^{2},\sum_{t=1}^{T}\|\boldsymbol{\ell}_{t}-\mathbf{m}_{t}^{s}\|_{\infty}^{2}\right\}+8NG^{2}D^{2}\ln 2,\\ \end{split} (66)

where for the second inequality we use Lemma E.17 and for the first and the third one the norm inequality 𝐱𝐱2N𝐱\|\mathbf{x}\|_{\infty}\leq\|\mathbf{x}\|_{2}\leq\sqrt{N}\|\mathbf{x}\|_{\infty} is used.

Combining Equations (50), (65), (66) and Lemma 6.3, we have

t=1Tft(𝐱t)t=1Tft(𝐱t,i)2+lnTβ+βN(D2min{3VT,F¯T}+8G2D2ln2)\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t,i})\leq\frac{2+\ln T}{\beta}+\beta N\left(D^{2}\min\{3V_{T},\bar{F}_{T}\}+8G^{2}D^{2}\ln 2\right)

holds for any β112(D4L2+D2G2ζ2)\beta\leq\frac{1}{\sqrt{12(D^{4}L^{2}+D^{2}G^{2}{\zeta}^{2})}} and i[N]i\in[N].

Suppose β=2+lnNN(D2min{3(VT+G2),F¯T}+8G2D2ln2)112(D4L2+D2G2ζ2)\beta^{\star}=\sqrt{\frac{2+\ln N}{N(D^{2}\min\{3(V_{T}+G^{2}),\bar{F}_{T}\}+8G^{2}D^{2}\ln 2)}}\leq\frac{1}{\sqrt{12(D^{4}L^{2}+D^{2}G^{2}{\zeta}^{2})}}, then

t=1Tft(𝐱t)t=1Tft(𝐱t,i)2(2+lnN)N(D2min{3(VT+G2),F¯T}+8G2D2ln2).\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t,i})\leq 2\sqrt{(2+\ln N)N(D^{2}\min\{3(V_{T}+G^{2}),\bar{F}_{T}\}+8G^{2}D^{2}\ln 2)}.

Otherwise

t=1Tft(𝐱t)t=1Tft(𝐱t,i)2(2+lnN)12(D4L2+D2G2ζ2).\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t,i})\leq 2(2+\ln N)\sqrt{12(D^{4}L^{2}+D^{2}G^{2}{\zeta}^{2})}.

In sum, we have

t=1Tft(𝐱t)t=1Tft(𝐱t,i)max{2(2+lnN)N(D2min{3(VT+G2),F¯T}+8G2D2ln2),2(2+lnN)12(D4L2+D2G2ζ2)}=O(logTmin{VT,F¯T}).\begin{split}&\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t,i})\\ \leq&\max\left\{2\sqrt{(2+\ln N)N(D^{2}\min\{3(V_{T}+G^{2}),\bar{F}_{T}\}+8G^{2}D^{2}\ln 2)},2(2+\ln N)\sqrt{12(D^{4}L^{2}+D^{2}G^{2}{\zeta}^{2})}\right\}\\ =&O(\log T\cdot\min\{V_{T},\bar{F}_{T}\}).\end{split} (67)

D.3 Proof of Theorem 10

By Theorem 9, we know

t=1Tft(𝐱t)t=1Tft(𝐱t,i)=O(lnT(min{VT,F¯T}))\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t,i})=O\left(\sqrt{\ln T(\min\{V_{T},\bar{F}_{T}\})}\right)

holds for any i[Nv+Ns]i\in[N^{v}+N^{s}]. WLOG, we assume kk and kk^{\prime} to be indexes of the best experts for the gradient-variation bound and the small-loss bound, respectively. Then by Theorem 7,

t=1Tft(𝐱t,k)ft(𝐮t)322(D2+2DPT)ζVT+(D2+2DPT)1+(1+2ζδ2L2)12δ\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t,k})-f_{t}(\mathbf{u}_{t})\leq\frac{3}{2}\sqrt{2(D^{2}+2DP_{T}){\zeta}V_{T}}+(D^{2}+2DP_{T})\frac{1+(1+2\zeta\delta^{2}L^{2})^{\frac{1}{2}}}{\delta} (68)

while by Theorem 8,

t=1Tft(𝐱t,k)ft(𝐮t)2(16ζ2L2(D2+2DPT)2+18ζLFT(D2+2DPT)).\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t,k^{\prime}})-f_{t}(\mathbf{u}_{t})\leq\sqrt{2\left(16{\zeta}^{2}L^{2}(D^{2}+2DP_{T})^{2}+18{\zeta}LF_{T}(D^{2}+2DP_{T})\right)}. (69)

Since the regret admits the following decompositions

t=1Tft(𝐱t)t=1Tft(𝐮t)=t=1Tft(𝐱t)t=1Tft(𝐱t,k)+t=1Tft(𝐱t,k)t=1Tft(𝐮t)=t=1Tft(𝐱t)t=1Tft(𝐱t,k)+t=1Tft(𝐱t,k)t=1Tft(𝐮t),\begin{split}&\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\sum_{t=1}^{T}f_{t}(\mathbf{u}_{t})\\ =&\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t,k})+\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t,k})-\sum_{t=1}^{T}f_{t}(\mathbf{u}_{t})\\ =&\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t,k^{\prime}})+\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t,k^{\prime}})-\sum_{t=1}^{T}f_{t}(\mathbf{u}_{t})\\ ,\end{split} (70)

we indeed have

t=1Tft(𝐱t)t=1Tft(𝐮t)O(lnT(min{VT,F¯T}))+min{322(D2+2DPT)ζVT+(D2+2DPT)1+(1+2ζδ2L2)12δ,2(16ζ2L2(D2+2DPT)2+18ζLFT(D2+2DPT))}=O(lnT(min{VT,F¯T}))+O(min{ζ(1+PT)((1+PT)/δ2+VT),ζ(1+PT)(ζ(1+PT)+FT)})\begin{split}&\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\sum_{t=1}^{T}f_{t}(\mathbf{u}_{t})\\ \leq&O\left(\sqrt{\ln T(\min\{V_{T},\bar{F}_{T}\})}\right)+\min\left\{\frac{3}{2}\sqrt{2(D^{2}+2DP_{T}){\zeta}V_{T}}+(D^{2}+2DP_{T})\frac{1+(1+2\zeta\delta^{2}L^{2})^{\frac{1}{2}}}{\delta},\right.\\ &\left.\sqrt{2\left(16{\zeta}^{2}L^{2}(D^{2}+2DP_{T})^{2}+18{\zeta}LF_{T}(D^{2}+2DP_{T})\right)}\right\}\\ =&O\left(\sqrt{\ln T(\min\{V_{T},\bar{F}_{T}\})}\right)\\ &+O\left(\min\left\{\sqrt{{\zeta}(1+P_{T})((1+P_{T})/\delta^{2}+V_{T})},\sqrt{{\zeta}(1+P_{T})(\zeta(1+P_{T})+F_{T})}\right\}\right)\end{split} (71)

Note that F¯T\bar{F}_{T} can be processed similarly as in Lemma 6.3 and Theorem 8 to get FTF_{T}. In sum, the regret is bounded by

t=1Tft(𝐱t)t=1Tft(𝐮t)=O(ζ(PT(ζ+1/δ2)+min{VT,FT}+1)(1+PT)+lnTmin{VT,FT}).\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\sum_{t=1}^{T}f_{t}(\mathbf{u}_{t})=O\left(\sqrt{\zeta(P_{T}(\zeta+1/\delta^{2})+\min\{V_{T},F_{T}\}+1)(1+P_{T})+\ln T\min\{V_{T},F_{T}\}}\right).

Appendix E Technical Lemmas

We need the following technical lemmas.

Lemma E.1.

(Bento et al., 2021, Theorem 2.1) Suppose f:f:\mathcal{M}\rightarrow\mathbb{R} is geodesically convex and \mathcal{M} is Hadamard. The geodesic mean 𝐱¯k\bar{\mathbf{x}}_{k} w.r.t coefficients a1,,aNa_{1},\dots,a_{N} (i=1Nai=1\sum_{i=1}^{N}a_{i}=1, ai0a_{i}\geq 0) is defined as:

𝐱¯1=𝐱1𝐱¯k=Exp𝐱¯k1(aki=1kaiExp𝐱¯k11𝐱k),k>1.\begin{split}&\bar{\mathbf{x}}_{1}=\mathbf{x}_{1}\\ &\bar{\mathbf{x}}_{k}={\textnormal{Exp}}_{\bar{\mathbf{x}}_{k-1}}\left(\frac{a_{k}}{\sum_{i=1}^{k}a_{i}}{\textnormal{Exp}}_{\bar{\mathbf{x}}_{k-1}}^{-1}\mathbf{x}_{k}\right),\quad k>1.\end{split} (72)

Then we have

f(𝐱¯N)i=1Naif(𝐱i).f(\bar{\mathbf{x}}_{N})\leq\sum_{i=1}^{N}a_{i}f(\mathbf{x}_{i}). (73)
Proof E.2.

We use induction to show a stronger statement

f(𝐱¯k)i=1kaij=1kajf(𝐱i)f(\bar{\mathbf{x}}_{k})\leq\sum_{i=1}^{k}\frac{a_{i}}{\sum_{j=1}^{k}a_{j}}f(\mathbf{x}_{i})

holds for k=1,,Nk=1,\dots,N.

For k=1k=1, this is obviously true because 𝐱¯1=𝐱1\bar{\mathbf{x}}_{1}=\mathbf{x}_{1}. Suppose

f(𝐱¯k)i=1kaij=1kajf(𝐱i)f(\bar{\mathbf{x}}_{k})\leq\sum_{i=1}^{k}\frac{a_{i}}{\sum_{j=1}^{k}a_{j}}f(\mathbf{x}_{i})

holds for some kk, then by geodesic convexity,

f(𝐱¯k+1)(1ak+1j=1k+1aj)f(𝐱¯k)+ak+1j=1k+1ajf(𝐱k+1)i=1kaij=1k+1ajf(𝐱i)+ak+1j=1k+1ajf(𝐱k+1)=i=1k+1aif(𝐱i)j=1k+1aj.\begin{split}f(\bar{\mathbf{x}}_{k+1})\leq&\left(1-\frac{a_{k+1}}{\sum_{j=1}^{k+1}a_{j}}\right)f(\bar{\mathbf{x}}_{k})+\frac{a_{k+1}}{\sum_{j=1}^{k+1}a_{j}}f(\mathbf{x}_{k+1})\\ \leq&\sum_{i=1}^{k}\frac{a_{i}}{\sum_{j=1}^{k+1}a_{j}}f(\mathbf{x}_{i})+\frac{a_{k+1}}{\sum_{j=1}^{k+1}a_{j}}f(\mathbf{x}_{k+1})\\ =&\sum_{i=1}^{k+1}\frac{a_{i}f(\mathbf{x}_{i})}{\sum_{j=1}^{k+1}a_{j}}.\end{split} (74)

The first inequality is due to gsc-convexity: for the geodesic determined by γ(0)=𝐱¯k\gamma(0)=\bar{\mathbf{x}}_{k} and γ(1)=𝐱k+1\gamma(1)=\mathbf{x}_{k+1} we have 𝐱¯k+1=γ(ak+1i=1k+1ai)\bar{\mathbf{x}}_{k+1}=\gamma\left(\frac{a_{k+1}}{\sum_{i=1}^{k+1}a_{i}}\right) and thus f(γ(t))(1t)f(γ(0))+tf(γ(1))f(\gamma(t))\leq(1-t)f(\gamma(0))+tf(\gamma(1)). For the second inequality, we use the induction hypothesis. Given i=1Nai=1\sum_{i=1}^{N}a_{i}=1, the lemma is proved.

The computation of the geodesic averaging is summarized in Algorithm E, which serves as a sub-routine of Radar.

{algorithm2e}

[H] Geodesic Averaging \KwDataNN points 𝐱1,,𝐱N𝒩\mathbf{x}_{1},\dots,\mathbf{x}_{N}\in\mathcal{N} and NN real weights w1,,wNw_{1},\dots,w_{N}. Let 𝐱¯1=𝐱1\bar{\mathbf{x}}_{1}=\mathbf{x}_{1}
\Fork=2,,Nk=2,\dots,N 𝐱¯k=Exp𝐱¯k1(wki=1kwiExp𝐱¯k11𝐱k)\bar{\mathbf{x}}_{k}={\textnormal{Exp}}_{\bar{\mathbf{x}}_{k-1}}\left(\frac{w_{k}}{\sum_{i=1}^{k}w_{i}}{\textnormal{Exp}}^{-1}_{\bar{\mathbf{x}}_{k-1}}\mathbf{x}_{k}\right)
Return 𝐱¯N\bar{\mathbf{x}}_{N}.

Lemma E.3.

Suppose 𝐱1,,𝐱N,𝐲1,,𝐲N𝒩\mathbf{x}_{1},\dots,\mathbf{x}_{N},\mathbf{y}_{1},\dots,\mathbf{y}_{N}\in\mathcal{N} where 𝒩\mathcal{N} is a gsc-convex subset of a Hadamard manifold \mathcal{M} and the diameter of 𝒩\mathcal{N} is upper bounded by DD. Let 𝐱¯\bar{\mathbf{x}} and 𝐲¯\bar{\mathbf{y}} be the weighted Fréchet mean with respect to coefficient vectors 𝐚\mathbf{a} and 𝐛\mathbf{b} (𝐚,𝐛ΔN)(\mathbf{a},\mathbf{b}\in\Delta_{N}), defined as 𝐱¯=argmin𝐱𝒩i=1Naid(𝐱,𝐱i)2\bar{\mathbf{x}}=\operatorname*{argmin}_{\mathbf{x}\in\mathcal{N}}\sum_{i=1}^{N}a_{i}\cdot d(\mathbf{x},\mathbf{x}_{i})^{2} and 𝐲¯=argmin𝐲𝒩i=1Nbid(𝐲,𝐲i)2\bar{\mathbf{y}}=\operatorname*{argmin}_{\mathbf{y}\in\mathcal{N}}\sum_{i=1}^{N}b_{i}\cdot d(\mathbf{y},\mathbf{y}_{i})^{2}. Then we have

d(𝐱¯,𝐲¯)i=1Naid(𝐱i,𝐲i)+Di=1N|aibi|.d(\bar{\mathbf{x}},\bar{\mathbf{y}})\leq\sum_{i=1}^{N}a_{i}\cdot d(\mathbf{x}_{i},\mathbf{y}_{i})+D\sum_{i=1}^{N}|a_{i}-b_{i}|. (75)
Proof E.4.

Recall that on Hadamard manifolds, the following inequality (Sturm, 2003, Prop. 2.4)

d(𝐱,𝐲)2+d(𝐮,𝐯)2d(𝐱,𝐯)2+d(𝐲,𝐮)2+2d(𝐱,𝐮)d(𝐲,𝐯)d(\mathbf{x},\mathbf{y})^{2}+d(\mathbf{u},\mathbf{v})^{2}\leq d(\mathbf{x},\mathbf{v})^{2}+d(\mathbf{y},\mathbf{u})^{2}+2d(\mathbf{x},\mathbf{u})\cdot d(\mathbf{y},\mathbf{v})

holds for any 𝐱,𝐲,𝐮,𝐯\mathbf{x},\mathbf{y},\mathbf{u},\mathbf{v}\in\mathcal{M}. A direct application of the above inequality yields

d(𝐱i,𝐲¯)2+d(𝐲i,𝐱¯)2d(𝐱i,𝐱¯)2+d(𝐲i,𝐲¯)2+2d(𝐱¯,𝐲¯)d(𝐱i,𝐲i)i[N].d(\mathbf{x}_{i},\bar{\mathbf{y}})^{2}+d(\mathbf{y}_{i},\bar{\mathbf{x}})^{2}\leq d(\mathbf{x}_{i},\bar{\mathbf{x}})^{2}+d(\mathbf{y}_{i},\bar{\mathbf{y}})^{2}+2d(\bar{\mathbf{x}},\bar{\mathbf{y}})\cdot d(\mathbf{x}_{i},\mathbf{y}_{i})\qquad\forall i\in[N]. (76)

By (Bacák, 2014a, Theorem 2.4):

i=1Naid(𝐱i,𝐱¯)2+i=1Nbid(𝐲i,𝐲¯)2+2d(𝐱¯,𝐲¯)2i=1Naid(𝐱i,𝐲¯)2+i=1Nbid(𝐲i,𝐱¯)2\sum_{i=1}^{N}a_{i}\cdot d(\mathbf{x}_{i},\bar{\mathbf{x}})^{2}+\sum_{i=1}^{N}b_{i}\cdot d(\mathbf{y}_{i},\bar{\mathbf{y}})^{2}+2d(\bar{\mathbf{x}},\bar{\mathbf{y}})^{2}\leq\sum_{i=1}^{N}a_{i}\cdot d(\mathbf{x}_{i},\bar{\mathbf{y}})^{2}+\sum_{i=1}^{N}b_{i}\cdot d(\mathbf{y}_{i},\bar{\mathbf{x}})^{2} (77)

Multiplying Equation (76) by aia_{i}, summing from i=1i=1 to nn and adding Equation (77), we have

2d(𝐱¯,𝐲¯)22d(𝐱¯,𝐲¯)i=1Naid(𝐱i,𝐲i)+i=1N(aibi)d(𝐲i,𝐲¯)2+i=1N(biai)d(𝐲i,𝐱¯)2=2d(𝐱¯,𝐲¯)i=1Naid(𝐱i,𝐲i)+i=1N(aibi)(d(𝐲i,𝐲¯)d(𝐲i,𝐱¯))(d(𝐲i,𝐲¯)+d(𝐲i,𝐱¯))2d(𝐱¯,𝐲¯)i=1Naid(𝐱i,𝐲i)+2Di=1N|aibi|d(𝐱¯,𝐲¯),\begin{split}2d(\bar{\mathbf{x}},\bar{\mathbf{y}})^{2}\leq&2d(\bar{\mathbf{x}},\bar{\mathbf{y}})\sum_{i=1}^{N}a_{i}\cdot d(\mathbf{x}_{i},\mathbf{y}_{i})+\sum_{i=1}^{N}(a_{i}-b_{i})d(\mathbf{y}_{i},\bar{\mathbf{y}})^{2}+\sum_{i=1}^{N}(b_{i}-a_{i})d(\mathbf{y}_{i},\bar{\mathbf{x}})^{2}\\ =&2d(\bar{\mathbf{x}},\bar{\mathbf{y}})\sum_{i=1}^{N}a_{i}\cdot d(\mathbf{x}_{i},\mathbf{y}_{i})+\sum_{i=1}^{N}(a_{i}-b_{i})\cdot(d(\mathbf{y}_{i},\bar{\mathbf{y}})-d(\mathbf{y}_{i},\bar{\mathbf{x}}))\cdot(d(\mathbf{y}_{i},\bar{\mathbf{y}})+d(\mathbf{y}_{i},\bar{\mathbf{x}}))\\ \leq&2d(\bar{\mathbf{x}},\bar{\mathbf{y}})\sum_{i=1}^{N}a_{i}\cdot d(\mathbf{x}_{i},\mathbf{y}_{i})+2D\sum_{i=1}^{N}|a_{i}-b_{i}|d(\bar{\mathbf{x}},\bar{\mathbf{y}}),\end{split} (78)

where for the last inequality we use the triangle inequality for geodesic metric spaces and Assumption 2. Now dividing both sides by 2d(𝐱¯,𝐲¯)2d(\bar{\mathbf{x}},\bar{\mathbf{y}}) and we complete the proof.

Lemma E.5.

(Zhang and Sra, 2016, Lemma 5). Let \mathcal{M} be a Riemannian manifold with sectional curvature lower bounded by κ0\kappa\leq 0. Consider 𝒩\mathcal{N}, a gsc-convex subset of \mathcal{M} with diameter DD. For a geodesic triangle fully lies within 𝒩\mathcal{N} with side lengths a,b,ca,b,c, we have

a2ζ(κ,D)b2+c22bccosAa^{2}\leq{\zeta}(\kappa,D)b^{2}+c^{2}-2bc\cos A

where ζ(κ,D):=κDcoth(κD){\zeta}(\kappa,D):=\sqrt{-\kappa}D\operatorname{coth}(\sqrt{-\kappa}D).

Lemma E.6.

(Sakai, 1996, Prop. 4.5) Let \mathcal{M} be a Riemannian manifold with sectional curvature upper bounded by κ0\kappa\leq 0. Consider 𝒩\mathcal{N}, a gsc-convex subset of \mathcal{M} with diameter DD. For a geodesic triangle fully lies within 𝒩\mathcal{N} with side lengths a,b,ca,b,c, we have

a2b2+c22bccosA.a^{2}\geq b^{2}+c^{2}-2bc\cos A.
Lemma E.7.

(Bacák, 2014b, Theorem 2.1.12) Let (,d)(\mathcal{H},d) be a Hadamard space and CC\subset\mathcal{H} be a closed convex set. Then ΠC𝐱\Pi_{C}{\mathbf{x}} is singleton and d(𝐱,ΠC𝐱)d(𝐱,𝐳)d(\mathbf{x},\Pi_{C}\mathbf{x})\leq d(\mathbf{x},\mathbf{z}) for any 𝐳C{ΠC𝐱}\mathbf{z}\in C\setminus\{\Pi_{C}\mathbf{x}\}.

Lemma E.8.

(Sturm, 2003, Prop. 6.1 and Theorem 6.2) Suppose 𝐱1,,𝐱N𝒩\mathbf{x}_{1},\dots,\mathbf{x}_{N}\in\mathcal{N} where 𝒩\mathcal{N} is a bounded gsc-convex subset of a Hadamard space. 𝐱¯\bar{\mathbf{x}} is the weighted Fréchet mean of 𝐱1,,𝐱N𝒩\mathbf{x}_{1},\dots,\mathbf{x}_{N}\in\mathcal{N} w.r.t. non-negative w1,,wNw_{1},\dots,w_{N} such that i=1wi=1\sum_{i=1}w_{i}=1 and f:𝒩Rf:\mathcal{N}\rightarrow R is a gsc-convex function. Then 𝐱¯𝒩\bar{\mathbf{x}}\in\mathcal{N} and

f(𝐱¯)i=1Nwif(𝐱i).f(\bar{\mathbf{x}})\leq\sum_{i=1}^{N}w_{i}f(\mathbf{x}_{i}).
Lemma E.9.

Let

g(x)a+(a2+bx2)12x,g(x)\coloneqq\frac{-a+({a^{2}+bx^{2}})^{\frac{1}{2}}}{x},

where a,b+a,b\in\mathbb{R}^{+}, then g(x)g(x) increases on [0,)[0,\infty).

Proof E.10.

Taking the derivative w.r.t. xx, we have

g(x)=122bx(a2+bx2)12x(a+(a2+bx2)12)1x2=bx2(aa2+bx2+a2+bx2)x2a2+bx2=aa2+bx2a2x2a2+bx20\begin{split}g^{\prime}(x)=&\frac{\frac{1}{2}\cdot 2bx(a^{2}+bx^{2})^{-\frac{1}{2}}\cdot x-(-a+(a^{2}+bx^{2})^{\frac{1}{2}})\cdot 1}{x^{2}}\\ =&\frac{bx^{2}-(-a\sqrt{a^{2}+bx^{2}}+a^{2}+bx^{2})}{x^{2}\sqrt{a^{2}+bx^{2}}}\\ =&\frac{a\sqrt{a^{2}+bx^{2}}-a^{2}}{x^{2}\sqrt{a^{2}+bx^{2}}}\geq 0\end{split}

holds for any x>0x>0. By L’Hôpital’s rule, g(0)=0g(0)=0 and g(0)=b2ag^{\prime}(0)=\frac{b}{2a}. Thus we know g(x)g(x) increases on [0,)[0,\infty).

Lemma E.11.

(Zhou and Huang, 2019, Theorem 3.1) On a Hadamard manifold \mathcal{M}, a subset CC gsc-convex, iff it contains the geodesic convex combinations of any countable points in CC.

Lemma E.12.

(Ahn and Sra, 2020, Prop. H.1) Let \mathcal{M} be a Riemannian manifold with sectional curvatures lower bounded by κ<0-\kappa<0 and the distance function d(𝐱)=12d(𝐱,𝐩)2d(\mathbf{x})=\frac{1}{2}d(\mathbf{x},\mathbf{p})^{2} where 𝐩\mathbf{p}\in\mathcal{M}. For D0D\geq 0, d()d(\cdot) is gsc-smooth within the domain {𝐮:d(𝐮,𝐩)D}\{\mathbf{u}\in\mathcal{M}:d(\mathbf{u},\mathbf{p})\leq D\}.

Lemma E.13.

(Ballmann, 2012, Corollary 5.6) Let \mathcal{M} be a Hadamard space and CC\subset\mathcal{M} a convex subset. Then d(𝐳,C)d(\mathbf{z},C) is gsc-convex for 𝐳\mathbf{z}\in\mathcal{M}.

Lemma E.14.

(Bacák, 2014b, Section 2.1) Let \mathcal{H} be a Hadamard manifold, f:(,)f:\mathcal{H}\rightarrow(-\infty,\infty) be a convex lower semicontinuous function. Then any β\beta-sublevel set of ff:

{𝐱:f(𝐱)β}\{\mathbf{x}\in\mathcal{H}:f(\mathbf{x})\leq\beta\}

is a closed convex set.

Lemma E.15.

(Sun et al., 2019, Lemma 4) Let the sectional curvature of \mathcal{M} is in [K,K][-K,K] and 𝐱,𝐲,𝐳\mathbf{x},\mathbf{y},\mathbf{z}\in\mathcal{M} with pairwise distance upper bounded by RR. Then

Exp𝐱1𝐲Exp𝐱1𝐳(1+c(K)R2)d(𝐲,𝐳).\|\textnormal{Exp}^{-1}_{\mathbf{x}}\mathbf{y}-\textnormal{Exp}^{-1}_{\mathbf{x}}\mathbf{z}\|\leq(1+c(K)R^{2})d(\mathbf{y},\mathbf{z}).
Lemma E.16.

(Duchi et al., 2012, Lemma 2) Let

Π𝒳ψ(𝐳,α)=argmin𝐱𝒳𝐳,𝐱+ψ(𝐱)α\Pi_{\mathcal{X}}^{\psi}(\mathbf{z},\alpha)=\operatorname*{argmin}_{\mathbf{x}\in\mathcal{X}}\left\langle\mathbf{z},\mathbf{x}\right\rangle+\frac{\psi(\mathbf{x})}{\alpha}

where ψ()\psi(\cdot) is 11-strongly convex w.r.t. \|\cdot\|, then

Π𝒳ψ(𝐮,α)Π𝒳ψ(𝐯,α)α𝐮𝐯.\|\Pi_{\mathcal{X}}^{\psi}(\mathbf{u},\alpha)-\Pi_{\mathcal{X}}^{\psi}(\mathbf{v},\alpha)\|\leq\alpha\|\mathbf{u}-\mathbf{v}\|_{\star}.
Lemma E.17.

(Cesa-Bianchi and Lugosi, 2006, Prop. 3.1 and Theorem 3.2) Suppose the loss function t\ell_{t} is exp-concave for η>0\eta>0, then the regret of Hedge is lnNη\frac{\ln N}{\eta}, where NN is the number of experts.