Escaping Saddle Points Faster on Manifolds via Perturbed Riemannian Stochastic Recursive Gradient

\nameAndi Han \email[email protected] \AND\nameJunbin Gao \email[email protected]
\addrDiscipline of Business Analytics
University of Sydney

Abstract

In this paper, we propose a variant of Riemannian stochastic recursive gradient method that can achieve second-order convergence guarantee and escape saddle points using simple perturbation. The idea is to perturb the iterates when gradient is small and carry out stochastic recursive gradient updates over tangent space. This avoids the complication of exploiting Riemannian geometry. We show that under finite-sum setting, our algorithm requires $\widetilde{\mathcal{O}}\big{(}\frac{\sqrt{n}}{\epsilon^{2}}+\frac{\sqrt{n}}{\delta^{4}}+\frac{n}{\delta^{3}}\big{)}$ stochastic gradient queries to find a $(\epsilon,\delta)$ -second-order critical point. This strictly improves the complexity of perturbed Riemannian gradient descent and is superior to perturbed Riemannian accelerated gradient descent under large-sample settings. We also provide a complexity of $\widetilde{\mathcal{O}}\big{(}\frac{1}{\epsilon^{3}}+\frac{1}{\delta^{3}\epsilon^{2}}+\frac{1}{\delta^{4}\epsilon}\big{)}$ for online optimization, which is novel on Riemannian manifold in terms of second-order convergence using only first-order information.

1 Introduction

Consider the following finite-sum and online optimization problem defined on Riemannian manifold $\mathcal{M}$ :

\min_{x\in\mathcal{M}}F(x):=\begin{cases}\mathbb{E}_{\omega}[f(x;\omega)],&\text{ online}\\ \frac{1}{n}\sum_{i=1}^{n}f_{i}(x),&\text{ finite-sum }\end{cases}

(1)

where $F:\mathcal{M}\xrightarrow[]{}\mathbb{R}$ is a non-convex function. Finite-sum optimization setting is a special case of online setting where $\omega$ can be finitely sampled. For simplicity of analysis, we only consider finite-sum formulation $\frac{1}{n}\sum_{i=1}^{n}f_{i}(x)$ and refer to it as online optimization when $n$ approaches infinity. Solving problem (1) globally on Euclidean space (i.e when $\mathcal{M}\equiv\mathbb{R}^{d}$ ) is NP-hard in general, letting alone for any Riemannian manifold. Thus many algorithms set out to find only approximate first-order critical points with small gradients. But this is usually insufficient because for non-convex problems, a point with small gradient can be either around a local minima, a local maxima or a saddle point. Therefore, to avoid being trapped by saddle points (and possibly local maxima), we need to find approximate second-order critical points (see Definition 1) with small gradients as well as nearly positive semi-definite Hessians.

Second-order algorithms can explore curvature information via Hessian and thus escape saddle points by construction. However, it is always desired that simpler algorithms using only gradient information are designed for this purpose, because access to Hessian can be costly and sometimes unavailable in real applications. It turns out that gradient descent (GD) can inherently escape saddle points, yet requiring exponential time (Du et al., 2017). Instead, Jin et al. (2017) showed that by injecting isotropic Gaussian noise to iterates, the perturbed gradient descent (PGD) can reach approximate $\epsilon$ -second-order stationarity within $\widetilde{\mathcal{O}}(n\epsilon^{-2})$ stochastic gradient queries with high probability. Similarly, a perturbed variant of stochastic gradient algorithm (PSGD) requires a complexity of $\widetilde{\mathcal{O}}(\epsilon^{-4})$ (Jin et al., 2019). The complexities match that of vanilla GD and SGD to find approximate first-order stationary points up to some poly-logarithmic factors.

When considering problem (1) on any arbitrary manifold, we require Riemannian gradient $\text{grad}F(x)\in T_{x}\mathcal{M}$ as well as a retraction $R_{x}:T_{x}\mathcal{M}\xrightarrow[]{}\mathbb{R}$ that allows iterates to be updated on the manifold with direction determined by Riemannian gradient. Defining a ‘pullback’ function $\hat{F}_{x}:=F\circ R_{x}:T_{x}\mathcal{M}\xrightarrow{}\mathbb{R}$ , Criscitiello and Boumal (2019) extended the idea of perturbed gradient descent (referred to as PRGD) by executing perturbations and the following gradient steps on the tangent space. Given that $\hat{F}_{x}$ is naturally defined on a vector space, the analysis can be largely drawn from (Jin et al., 2017), with similar convergence guarantees. Recently, Criscitiello and Boumal (2020) also proposed perturbed Riemannian accelerated gradient descent (PRAGD), a direct generalization of its Euclidean counterpart originally proposed in (Jin et al., 2018). It achieves an even lower complexity of $\widetilde{\mathcal{O}}(n\epsilon^{-7/4})$ compared to perturbed GD. However, these algorithms are sub-optimal under finite-sum settings and even fail to work for online problems where full gradient is inaccessible.

When objective can be decomposed into component functions as in problem (1), variance reduction techniques can improve gradient complexity of both GD and SGD (Reddi et al., 2016; Nguyen et al., 2017b; Fang et al., 2018). The main idea is to exploit past gradient information to correct for deviation of current stochastic gradients, by comparing either to a fixed snapshot point (SVRG, Reddi et al. (2016)) or recursively to previous iterates (SRG/SPIDER, Nguyen et al. (2017a); Fang et al. (2018)). Notably, the gradient complexities $\mathcal{O}(\sqrt{n}\epsilon^{-2})$ and $\mathcal{O}(\epsilon^{-3})$ achieved by SRG and SPIDER are optimal to achieve first-order guarantees under finite-sum and online settings respectively (Fang et al., 2018; Arjevani et al., 2019). Motivated by the success of variance reduction and the simplicity of adding perturbation to ensure second-order convergence, Li (2019) proposed perturbed stochastic recursive gradient method (PSRG), which can escape saddle points faster than perturbed GD and SGD. In this paper, we generalize PSRG to optimization problems defined on Riemannian manifolds with the following contributions:

•

Our proposed method PRSRG is the first simple stochastic algorithm that achieves second-order guarantees on Riemannian manifold using only first-order information. That is, the algorithm only requires simple perturbations and does not involve any negative curvature exploitation as in PRAGD. Our algorithm adopts the idea of tangent space steps as in (Criscitiello and Boumal, 2019), which results in simple analysis for convergence.
•

Our complexity is measured against $(\epsilon,\delta)$ -second-order stationarity, which is more general than $\epsilon$ -second-order stationarity where $\delta=\mathcal{O}(\sqrt{\epsilon})$ . Under finite-sum setting, PRSRG strictly improves the complexity of PRGD by $\widetilde{\mathcal{O}}(\min\{n/\epsilon^{1/2},$ $\sqrt{n}/\epsilon^{2}\})$ and is superior to PRAGD when $n\geq\epsilon^{-1/2}$ . We also provide a complexity of PRSRG under online setting, which is novel for Riemannian optimization.

2 Other related work

Following the work by (Jin et al., 2017), except for the perturbed SGD (Jin et al., 2019) and perturbed accelerated gradient method (Jin et al., 2018), Ge et al. (2019) showed that SVRG with perturbation (PSVRG) is sufficient to escape saddle points and a stabilized version is also developed to further improve the dependency on the Hessian Lipschitz constant. However, its complexity is strictly worse than PSRG (Li, 2019). We suspect that, with similar tangent space trick, PSVRG can be generalized to Riemannian manifolds with little efforts.

Another line of research is to incorporate negative curvature search subroutines (Xu et al., 2018; Allen-Zhu and Li, 2018) in some classic first-order algorithms, such as GD and SGD and also variance-reduction algorithms including SVRG (Allen-Zhu and Li, 2018), SNVRG (Zhou et al., 2018) and SPIDER (Fang et al., 2018). It usually requires access to Hessian-vector products when searching for the smallest eigenvector direction (Xu et al., 2018). Even though Allen-Zhu and Li (2018) managed to only use first-order information for the same goal, it still involves an iterative process that can be difficult to implement in practice.

To solve general optimization problems on Riemannian manifold, gradient descent and stochastic gradient descent have been generalized with provable guarantees (Boumal et al., 2019; Bonnabel, 2013; Hosseini and Sra, 2020). Some acceleration and variance reduction techniques have also been considered for speeding up the convergence of GD and SGD (Zhang and Sra, 2018; Ahn and Sra, 2020; Zhang et al., 2016; Kasai et al., 2018). These first-order methods however can only guarantee first-order convergence. To find second-order critical points, Newton’s methods, particularly the famous globalization variants, trust region and cubic regularization have been extended to manifold optimization, achieving similar second-order guarantees (Boumal et al., 2019; Agarwal et al., 2018).

Finally, we note that Sun et al. (2019) also proposed perturbed gradient descent on manifold. The perturbations and the following updates are performed directly on manifold, which is contrastly different to tangent space steps in this paper. The preference of tangent space steps over manifold steps is mainly due to the complexity of analysis. That is, Sun et al. (2019) requires more complicated geometric results as well as the use of exponential map given its natural connection to geodesics and Riemannian distance. By executing all the steps on tangent space, we can bypass these results while some regularity conditions should be carefully managed.

Table 1: Comparison of stochastic gradient complexity to achieve second-order guarantees

		$\epsilon$ -second-order stationarity	$(\epsilon,\delta)$ -second-order stationarity
Finite-sum	PRGD (Criscitiello and Boumal, 2019)	$\widetilde{\mathcal{O}}\big{(}\frac{n}{\epsilon^{2}}\big{)}$	—
	PRAGD (Criscitiello and Boumal, 2020)	$\widetilde{\mathcal{O}}\big{(}\frac{n}{\epsilon^{7/4}}\big{)}$	—
	PRSRG (this work)	$\widetilde{\mathcal{O}}\big{(}\frac{n}{\epsilon^{3/2}}+\frac{\sqrt{n}}{\epsilon^{2}}\big{)}$	$\widetilde{\mathcal{O}}\big{(}\frac{\sqrt{n}}{\epsilon^{2}}+\frac{\sqrt{n}}{\delta^{4}}+\frac{n}{\delta^{3}}\big{)}$
Online	PRSRG (this work)	$\widetilde{\mathcal{O}}\big{(}\frac{1}{\epsilon^{7/2}}\big{)}$	$\widetilde{\mathcal{O}}\big{(}\frac{1}{\epsilon^{3}}+\frac{1}{\delta^{3}\epsilon^{2}}+\frac{1}{\delta^{4}\epsilon}\big{)}$

3 Preliminaries

Here we start with a short review of some preliminary definitions. Readers can refer to (Absil et al., 2009) for more detailed discussions on manifold geometry. We consider $\mathcal{M}$ to be a $d$ -dimensional Riemannian manifold. The tangent space $T_{x}\mathcal{M}$ for $x\in\mathcal{M}$ is a $d$ -dimensional vector space. $\mathcal{M}$ is equipped with an inner product $\langle\cdot,\cdot\rangle_{x}$ and a corresponding norm $\|\cdot\|_{x}$ on each tangent space $T_{x}\mathcal{M}$ , $\forall\,x\in\mathcal{M}$ . Tangent bundle is the union of tangent spaces defined as $T\mathcal{M}:=\{(x,u):x\in\mathcal{M},u\in T_{x}\mathcal{M}\}$ . Riemannian gradient of a function $F:\mathcal{M}\xrightarrow[]{}\mathbb{R}$ is the vector field $\text{grad}F$ that uniquely satisfies $\text{D}F(x)[u]=\langle\text{grad}F(x),u\rangle_{x}$ , for all $(x,u)\in T\mathcal{M}$ where D $F(x)[u]$ is the directional derivative of $F(x)$ along $u$ . Riemannian Hessian of $F$ is the covariant derivative of grad $F$ defined as that for all $(x,u)\in T\mathcal{M}$ , it satisfies $\text{Hess}F(x)[u]=\nabla_{u}\text{grad}F(x)$ , where $\nabla$ is the Riemannian connection (also known as Levi-Civita connection). Note that we also use $\nabla$ to represent differentiation on vector space, which is a special case of covariant derivative.

Retraction $R_{x}:T_{x}\mathcal{M}\xrightarrow{}\mathcal{M}$ maps a tangent vector to manifold surface by satisfying (i) $R_{x}(0)=x$ , and (ii) $\text{D}R_{x}(0)$ is the identity map. $\text{D}R_{x}(u)$ represents the differential of retraction, denoted as $T_{u}:T_{x}\mathcal{M}\xrightarrow{}T_{R_{x}(u)}\mathcal{M}$ . Exponential map is a special instance of retraction and hence our results in this paper can be trivially generalized to this particular retraction. Define the pullback function $\hat{F}_{x}:=F\circ R_{x}:T_{x}\mathcal{M}\xrightarrow{}\mathbb{R}$ . That is, $\hat{F}_{x}(u)=\frac{1}{n}\sum_{i=1}^{n}\hat{f}_{i,x}(u)$ $=\frac{1}{n}\sum_{i=1}^{n}{f}_{i}(R_{x}(u))$ , where we also define the pullback components as $\hat{f}_{i,x}:=f_{i}\circ R_{x}$ . Given that the domain of pullback function is a vector space, we can represent its gradient $\nabla\hat{F}_{x}$ and Hessian $\nabla^{2}\hat{F}_{x}$ in terms of Riemannian gradient and Hessian as well as the differentiated retraction $T_{u}$ in the following Lemma (see Lemma 2.5 in Criscitiello and Boumal (2020)).

Lemma 1 (Gradient and Hessian of the pullback)

For a twice continuously differentiable function $f:\mathcal{M}\xrightarrow{}\mathbb{R}$ and $(x,u)\in T\mathcal{M}$ , we have

\nabla\hat{F}_{x}(u)=T_{u}^{*}\emph{grad}F(R_{x}(u))\quad\emph{ and }\quad\nabla^{2}\hat{F}_{x}(u)=T_{u}^{*}\circ\emph{Hess}f(R_{x}(u))\circ T_{u}+W_{u},

where $T_{u}^{*}$ denotes the adjoint operator of $T_{u}$ and $W_{u}$ is a symmetric linear operator defined by $\langle W_{u}[\dot{u}],\dot{u}\rangle_{x}$ $=\langle\emph{grad}f(R_{x}(u)),c^{\prime\prime}(0)\rangle_{R_{x}(u)}$ with $c(t)=R_{x}(u+t\dot{u})$ , $\dot{u}$ a perturbation to $u$ .

If the retraction $R_{x}$ is a second-order retraction, $\nabla\hat{F}_{x}(0)$ and $\nabla^{2}\hat{F}_{x}(0)$ match the Riemannian gradient and Hessian at $x$ . See Lemma 8 (Appendix B) for more details. Vector transport $\mathcal{T}$ with respect to retraction $R_{x}$ is a linear map that $\mathcal{T}:T\mathcal{M}\,\oplus\,T\mathcal{M}\xrightarrow[]{}T\mathcal{M}$ satisfies (i) $\mathcal{T}_{u}[\xi]\in T_{R_{x}(u)}\mathcal{M}$ ; (ii) $\mathcal{T}_{0_{x}}[\xi]=\xi$ . In this paper, we only consider isometric vector transport, which satisfies $\langle\xi,v\rangle_{x}=\langle\mathcal{T}_{u}\xi,\mathcal{T}_{u}v\rangle_{R_{x}(u)}$ . Below are some common notations used throughout this paper.

Notation. Denote $\lambda_{\min}(H)$ as the minimum eigenvalue of a symmetric operator $H$ and use $\|\cdot\|$ to represent either the vector norm or the spectral norm for a matrix. We claim $h(t)=\mathcal{O}(g(t))$ if there exists a positive constant $M$ and $t_{0}$ such that $h(t)\leq Mg(t)$ for all $t\geq t_{0}$ and use $\widetilde{\mathcal{O}}(\cdot)$ to hide poly-logarithmic factors. Let $\mathbb{B}_{x}(r)$ be the Euclidean ball with radius $r$ to the origin of the tangent space $T_{x}\mathcal{M}$ . That is, $\mathbb{B}_{x}(r):=\{u\in T_{x}\mathcal{M}:\|u\|_{x}\leq r\}$ . We use $\text{Unif}(\mathbb{B}_{x}(r))$ to denote the uniform distribution on the set $\mathbb{B}_{x}(r)$ . Denote $[n]:=\{1,2,..,n\}$ and let $\nabla\hat{f}_{\mathcal{I},x}(u)=\frac{1}{|\mathcal{I}|}\sum_{i\in\mathcal{I}}\nabla\hat{f}_{i,x}(u)$ be a mini-batch gradient of the pullback component function where $\mathcal{I}\subseteq[n]$ with cardinality $|\mathcal{I}|$ . Similarly, $\text{grad}f_{\mathcal{I}}(x)$ represents a mini-batch Riemannian gradient. Finally, we denote $\log(x)$ as the natural logarithm and $\log_{\alpha}(x)$ as logarithm with base $\alpha$ .

3.1 Assumptions

Now we state the main assumptions as follows. The first assumption is required to bound function decrease and to ensure existence of stationary points on the manifold.

Assumption 1 (Lower bounded function)

$F$ is lower bounded by $F^{*}$ with $F(x)\geq F^{*}$ for all $x\in\mathcal{M}$ .

Following (Criscitiello and Boumal, 2019), we need to impose some Lipschitzness conditions on the pullback $\hat{F}_{x}$ because the main gradient steps (see Algorithm 2) are performed on tangent space with respect to the pullback function. The next assumption requires both gradient and Hessian Lipschitzness of the pullback component functions $\hat{f}_{i,x}$ . Note that we only require the condition to satisfy with respect to the origin of $T_{x}\mathcal{M}$ within a constraint ball $\mathbb{B}_{x}(D)$ . This assumption is much weaker compared to requiring Lipschitz continuity over entire tangent space.

Assumption 2 (Gradient and Hessian Lipschitz)

Gradient and Hessian of the pullback component function is Lipschitz to the origin. That is, for all $x\in\mathcal{M},u\in T_{x}\mathcal{M}$ with $\|u\|\leq D$ , there exist $\ell,\rho\geq 0$ such that

	$\displaystyle\\|\nabla\hat{f}_{i,x}(u)-\nabla\hat{f}_{i,x}(0)\\|$	$\displaystyle\leq\ell\\|u\\|,$
	$\displaystyle\\|\nabla^{2}\hat{f}_{i,x}(u)-\nabla^{2}\hat{f}_{i,x}(0)\\|$	$\displaystyle\leq\rho\\|u\\|.$

This also suggests the gradient and Hessian of the pullback function $\hat{F}_{x}$ satisfy the same Lipschitzness condition to the origin.

Immediately based on this assumption, we can show (in Lemma 2) that gradient of the pullback function is Lipschitz continuous in the constraint ball $\mathbb{B}_{x}(D)$ . This result implies smoothness of the pullback function and is fundamental for analysing first-order optimization algorithms.

Lemma 2 ( $L$ -Lipschitz continuous)

Under Assumption 2, for all $x\in\mathcal{M}$ , there exists $L=\ell+\rho D$ such that $\nabla\hat{f}_{i,x}$ is $L$ -Lipschitz continuous inside the ball $\mathbb{B}_{x}(D)$ . This also implies that $\nabla\hat{F}_{x}$ is $L$ -Lipschitz continuous. That is, for any $u,v\in\mathbb{B}_{x}(D)$ , we have

\|\nabla\hat{f}_{i,x}(u)-\nabla\hat{f}_{i,x}(v)\|\leq L\|u-v\|\quad\emph{ and }\quad\|\nabla\hat{F}_{x}(u)-\nabla\hat{F}_{x}(v)\|\leq L\|u-v\|.

The next assumption is to ensure the correspondence between gradient and Hessian of the pullback $\hat{F}_{x}$ at origin, and those of original function $F$ . Note similar to (Criscitiello and Boumal, 2019), we can relax this assumption to bounded initial acceleration, i.e. $\|c^{\prime\prime}(t)\|\leq\beta$ for some $\beta\geq 0$ . In that case, results only differ by some constants.

Assumption 3 (Second-order retraction)

$R_{x}$ is a second-order retraction such that for all $(x,u)\in T\mathcal{M}$ , the retraction curve $c(t)=R_{x}(tu)$ has zero initial acceleration. That is $c^{\prime\prime}(t)=0$ .

The following assumption is needed to bound the difference between differentiated retraction and vector transport. Although our algorithm does not require vector transport as all updates are completed on tangent space, the main purpose of this assumption is to establish a bound on singular value of differentiated retraction (Lemma 3). The Lemma can then be used to bound the difference between $\nabla\hat{F}_{x}(u)$ and $\text{grad}F(R_{x}(u))$ for any $u\in\mathbb{B}_{x}(D)$ .

Assumption 4 (Difference between differentiated retraction and vector transport)

For any $\bar{x}\in\mathcal{M}$ , there exists a neighbourhood $\mathcal{X}$ of $\bar{x}$ such that for all $x,y=R_{x}(u)\in\mathcal{X}$ , there exists a constant $c_{0}\geq 0$ uniformly,

\|T_{u}-\mathcal{T}_{u}\|\leq c_{0}\|u\|\quad\emph{ and }\quad\|T_{u}^{-1}-\mathcal{T}_{u}^{-1}\|\leq c_{0}\|u\|.

Lemma 3 (Singular value bound of differentiated retraction)

For all $x,y=R_{x}(u)\in\mathcal{X}$ where $u\in\mathbb{B}_{x}(D)$ with $D\leq\frac{1}{2c_{0}}$ , we have $\sigma_{\min}(T_{u})\geq\frac{1}{2}$ .

It is not difficult to satisfy these assumptions. For compact Riemannian manifolds with a second-order retraction and a three-times continnously differentiable function $F$ , Assumptions 1, 2 and 3 are easily satisfied (see Lemma 3.2 in (Criscitiello and Boumal, 2019)). Assumption 4 can be guaranteed by requiring vector transport to be $C^{0}$ based on Taylor expansion (Huang et al., 2015).

Apart from the main assumptions, one additional assumption of bounded variance is particularly important for solving online problems.

Assumption 5 (Uniform bounded variance)

Variance of gradient of the pullback component function is bounded uniformly by $\sigma^{2}$ . That is, for all $x\in\mathcal{M}$ and $\hat{f}_{i,x}$ ,

\|\nabla\hat{f}_{i,x}(u)-\nabla\hat{F}_{x}(u)\|^{2}\leq\sigma^{2}

holds for any $u\in T_{x}\mathcal{M}$ such that $\|u\|\leq D$ .

This assumption is more stringent than the standard variance bound, which is in expectation. However, we can relax this assumption by requiring sub-Gaussian tails, which is sufficient to achieve a high probability bound (Li, 2019). Lastly, we conclude this section by defining second-order critical points as follows.

Definition 1 ( $(\epsilon,\delta)$ -second-order and $\epsilon$ -second-order critical points)

$x\in\mathcal{M}$ is an $(\epsilon,\delta)$ -second-order critical point if

\|\emph{grad}F(x)\|\leq\epsilon,\quad\emph{ and }\quad\lambda_{\min}(\emph{Hess}F(x))\geq-\delta.

It is an $\epsilon$ -second-order critical point if $\|\emph{grad}F(x)\|\leq\epsilon$ and $\lambda_{\min}(\emph{Hess}F(x))\geq-\sqrt{\rho\epsilon}$ . The second definition is a special case of the first one with $\delta=\sqrt{\rho\epsilon}$ .

4 Algorithm

In this section, we introduce perturbed Riemannian stochastic recursive gradient method (PRSRG) in Algorithm 1 where the main updates are performed by tangent space stochastic recursive gradient (TSSRG) in Algorithm 2. The key idea is simple: when gradient is large, we repeatedly execute $m$ standard recursive gradient iterations (i.e. an epoch) on tangent space before retracting back to the manifold. Essentially, we are minimizing the pullback function $\hat{F}_{x}$ within the constraint ball by SRG updates, which translates to minimizing $F$ within a neighbourhood of $x$ .

This process repeats until gradient is small. Then we perform the same SRG updates but at most $\mathscr{T}$ times, from a perturbed iterate with isotropic noise added on the tangent space. Notice that the small gradient condition in Line 3 in Algorithm 1 is examined against $\text{grad}f_{\mathcal{B}}(x)$ , where $\mathcal{B}$ contains $B$ samples drawn without replacement from $[n]$ . This is to ensure that full batch gradient is computed under finite-sum setting where we choose $B=n$ . Under online setting when $n$ approaches infinity, access to full gradient is unavailable and therefore we can only rely on the large-batch gradient.

TSSRG is mainly based on stochastic recursive gradient algorithm in (Nguyen et al., 2017b). The algorithm adopts a double-loop structure. At the start of each outer loop (called an epoch), a full gradient (or a large-batch gradient for online optimization) is evaluated. Within each inner loop, mini-batch gradients are computed for current iterate $u_{t}$ and its last iterate $u_{t-1}$ . Then a modified gradient $v_{t}$ at $u_{t}$ is constructed recursively based on the difference between $v_{t-1}$ and mini-batch gradient at $u_{t-1}$ . That is,

v_{t}\xleftarrow{}\nabla\hat{f}_{\mathcal{I},x}(u_{t})-\big{(}\nabla\hat{f}_{\mathcal{I},x}(u_{t-1})-v_{t-1}\big{)}.

(2)

Algorithm 1 Perturbed Riemannian stochastic recursive gradient (PRSRG)

1: Input: Initialization

x_{0}

, step size

\eta

, inner loop size

m

, mini-batch size

b

, large batch size

B

, perturbation radius

r

, perturbation interval

\mathscr{T}

, diameter bound

D

, tolerance

\epsilon

2: for

t=0,1,2,...

3: if

\|\text{grad}f_{\mathcal{B}}(x_{t})\|\leq\epsilon

then

\triangleright

\eqparboxCOMMENT

\mathcal{B}

contains

B

samples drawn randomly without replacement

4: Draw

u_{0}\sim\text{Unif}(\mathbb{B}_{x_{t}}(r))

x_{t+\mathscr{T}}=\textsf{TSSRG}(x_{t},u_{0},\eta,m,b,B,D,\mathscr{T})

t\xleftarrow[]{}t+\mathscr{T}

7: else

x_{t+m}=\textsf{TSSRG}(x_{t},0,\eta,m,b,B,D,m)

t\xleftarrow[]{}t+m

10: end if

11: end for

Algorithm 2 Tangent space stochastic recursive gradient (TSSRG(

x,u_{0},\eta,m,b,B,D,\mathscr{T}

))

1: Input: Initialization

x

, initial perturbation

u_{0}

, step size

\eta

, inner loop size

m

, mini-batch size

b

, large batch size

B

, diameter bound

D

, max iteration

\mathscr{T}

2: If

u_{0}=0

set

\textsf{perturbed}=0

; else, set

\textsf{perturbed}=1

3: for

s=0,1,...

v_{sm}=\nabla\hat{f}_{\mathcal{B},x}(u_{sm})

5: for

k=1,2,...,m

t\xleftarrow{}sm+k

u_{t}\xleftarrow[]{}u_{t-1}-\eta v_{t-1}

8: if

\|u_{t}\|\geq D

then

9: break with

u_{\mathscr{T}}\xleftarrow[]{}u_{t-1}-\alpha\eta v_{t-1}

, where

\alpha\in(0,1]

such that

\|u_{t-1}-\alpha\eta v_{t-1}\|=D.

10: end if

11:

v_{t}\xleftarrow{}\nabla\hat{f}_{\mathcal{I},x}(u_{t})-\nabla\hat{f}_{\mathcal{I},x}(u_{t-1})+v_{t-1}

\triangleright

\mathcal{I}

contains i.i.d.

b

samples

12: if (not perturbed) and

t<\mathscr{T}

then

12: break with

u_{\mathscr{T}}\xleftarrow{}u_{t}

with probability

\frac{1}{m-k+1}

13: end if

14: if

t\geq\mathscr{T}

then

15: break with

u_{\mathscr{T}}\xleftarrow{}u_{t}

16: end if

17: end for

18:

u_{(s+1)m}\xleftarrow[]{}u_{t}

19: end for

20: return

R_{x}(u_{\mathscr{T}})

Note that we do not require any vector transporter since all gradients (of the pullback) are defined on the same tangent space. Hence, TSSRG is very similar to Euclidean SRG with only differences in stopping criteria, discussed as follows. When gradient is large, we perform at most $m$ updates and break the loop with probability $\frac{1}{m-k+1}$ where $k$ is the index of inner iteration (Line 12, Algorithm 2). This stopping rule is equivalent to uniformly selecting from iterates within this epoch at random as the output. This is to ensure that either small gradient condition is triggered or sufficient decrease is achieved. More details can be found in Section 6. When gradient is small (i.e. around saddle point), we only break until max iteration has been reached.

Finally, we note that the Lipschitzness conditions are only guaranteed within a constraint ball of radius $D$ while taking $\mathscr{T}$ updates on tangent space can violate this requirement. Therefore, in Line 9 (Algorithm 2), we explicitly control the deviation of iterates from the origin and break the loop as soon as one iterate leaves the ball. Then we return some points on the boundary of the ball. By carefully balancing the parameters, we can show that whenever iterates escape the constraint ball, function value already has a large decrease.

To simplify notations, for the most time, we refer to Algorithm 2 as TSSRG( $x,u_{0},\mathscr{T}$ ).

5 Main results

In this section, we present the main complexity results of PRSRG in finding second-order stationary points.

Under finite-sum setting, we choose $B=n$ and thus $\text{grad}f_{\mathcal{B}}\equiv\text{grad}F$ . We set the parameters as in (3) while we also require first-order tolerance $\epsilon\leq\widetilde{\mathcal{O}}(\frac{D\sqrt{L}}{m\sqrt{\eta}})$ . This is not a strict condition given step size $\eta$ can be chosen arbitrarily small. The second-order tolerance $\delta$ needs to be smaller than the Lipschitz constant $\ell$ in Assumption 2. Otherwise when $\delta>\ell$ , any $x\in\mathcal{M}$ with small gradient $\|\text{grad}F(x)\|\leq\epsilon$ is already a $(\epsilon,\delta)$ -second-order stationary point because by $\|\lambda_{\min}(\nabla\hat{F}_{x}(0))\|\leq\ell$ from Assumption 2, we have $\lambda_{\min}(\text{Hess}F(x))=\lambda_{\min}(\nabla\hat{F}_{x}(0))\geq-\ell\geq-\delta$ .

Theorem 1 (Finite-sum complexity)

Under Assumptions 1 to 4, consider finite-sum optimization setting. For any starting point $x_{0}\in\mathcal{M}$ with the choice of parameters as

\mathscr{T}=\widetilde{\mathcal{O}}\Big{(}\frac{1}{\delta}\Big{)},\quad\eta\leq\widetilde{\mathcal{O}}\Big{(}\frac{1}{L}\Big{)},\quad m=b=\sqrt{n},\quad B=n,\quad r=\widetilde{\mathcal{O}}\Big{(}\min\Big{\{}\frac{\delta^{3}}{\rho^{2}\epsilon},\sqrt{\frac{\delta^{3}}{\rho^{2}L}}\Big{\}}\Big{)},

(3)

suppose $\epsilon,\delta$ satisfy $\epsilon\leq\widetilde{\mathcal{O}}(\frac{D\sqrt{L}}{m\sqrt{\eta}}),\,\delta\leq\ell$ where $\ell\leq L$ . Then with high probability, PRSRG $(x_{0},\eta,m,$ $b,B,r,\mathscr{T},D,\epsilon)$ will at least once visit an $(\epsilon,\delta)$ -second-order critical point with

\widetilde{\mathcal{O}}\Big{(}\frac{\Delta L\sqrt{n}}{\epsilon^{2}}+\frac{\Delta\rho^{2}\sqrt{n}}{\delta^{4}}+\frac{\Delta\rho^{2}n}{\delta^{3}}\Big{)}

stochastic gradient queries, where $\Delta:=F(x_{0})-F^{*}$ .

Up to some constants and poly-log factors, the complexity in Theorem 1 is exactly the same as that when optimization domain is a vector space (i.e. $\mathcal{M}\equiv\mathbb{R}^{d}$ ). Indeed, our result is a direct generalization of the Euclidean counterpart where retraction $R_{x}(u)=x+u$ and the Lipschitzness conditions are made with respect to $F:\mathbb{R}^{d}\xrightarrow[]{}\mathbb{R}$ . Set the same parameters as in (3) except that we do not require $\epsilon$ to be small since $D=+\infty$ and require $\delta\leq L$ given $\ell=L$ . Then we can recover the Euclidean perturbed stochastic recursive gradient algorithm (Li, 2019).

Under online setting, the complexities of stochastic gradient queries are presented below.

Theorem 2 (Online complexity)

Under Assumptions 1 to 5, consider online optimization setting. For any starting point $x_{0}\in\mathcal{M}$ with the choice of parameters

\mathscr{T}=\widetilde{\mathcal{O}}\Big{(}\frac{1}{\delta}\Big{)},\quad\eta\leq\widetilde{\mathcal{O}}\Big{(}\frac{1}{L}\Big{)},\quad m=b=\widetilde{\mathcal{O}}\Big{(}\frac{\sigma}{\epsilon}\Big{)},\quad B=\widetilde{\mathcal{O}}\Big{(}\frac{\sigma^{2}}{\epsilon^{2}}\Big{)},\quad r=\widetilde{\mathcal{O}}\Big{(}\min\Big{\{}\frac{\delta^{3}}{\rho^{2}\epsilon},\sqrt{\frac{\delta^{3}}{\rho^{2}L}}\Big{\}}\Big{)},

suppose $\epsilon,\delta$ satisfy $\epsilon\leq\min\big{\{}\frac{\delta^{2}}{\rho},\widetilde{\mathcal{O}}(\frac{D\sqrt{L}}{m\sqrt{\eta}})\big{\}}$ , $\delta\leq\ell$ where $\ell\leq L$ . Then with high probability, $\textsf{PRSRG}(x_{0},\eta,m,b,B,r,\mathscr{T},D,\epsilon)$ will at least once visit an $(\epsilon,\delta)$ -second-order critical point with

\widetilde{\mathcal{O}}\Big{(}\frac{\Delta L\sigma}{\epsilon^{3}}+\frac{\Delta\rho^{2}\sigma^{2}}{\delta^{3}\epsilon^{2}}+\frac{\Delta\rho^{2}\sigma}{\delta^{4}\epsilon}\Big{)}

stochastic gradient oracles, where $\Delta:=F(x_{0})-F^{*}$ .

Note the complexities in this paper are analysed in terms of achieving $(\epsilon,\delta)$ -second-order stationarity. Some literature prefers to choose $\delta=\sqrt{\rho\epsilon}$ to match the units of gradient and Hessian, following the work in (Nesterov and Polyak, 2006). In this case, our complexities reduce to $\widetilde{\mathcal{O}}(\frac{n}{\epsilon^{3/2}}+\frac{\sqrt{n}}{\epsilon^{2}})$ for finite-sum problems and $\widetilde{\mathcal{O}}(\frac{1}{\epsilon^{7/2}})$ for online problems. Compared to the optimal rate of $\mathcal{O}(\frac{\sqrt{n}}{\epsilon^{2}})$ and $\mathcal{O}(\frac{1}{\epsilon^{3}})$ in finding first-order stationary points, our rates are sub-optimal even if we ignore the poly-log factors. This nevertheless appears to be a problem for all stochastic variance reduction methods (Ge et al., 2019; Li, 2019).

6 High-level proof ideas

Now we provide a high-level proof road map of our proposed method in achieving second-order convergence guarantees. The main proof strategies are similar as in (Li, 2019). However, we need to carefully handle the particularity of manifold geometry as well as manage the unique regularity conditions. We focus on finite-sum problems and only highlight the key differences under online setting.

6.1 Finite-sum setting

We first start by showing how stochastic recursive gradient updates can achieve sublinear convergence in expectation by periodically computing a large-batch gradient. That is, from smoothness of the pullback function, we have

\displaystyle\hat{F}_{x}(u_{t})\leq\hat{F}_{x}(u_{t-1})-\frac{\eta}{2}\|\nabla\hat{F}_{x}(u_{t-1})\|^{2}+\frac{\eta}{2}\|v_{t-1}-\nabla\hat{F}_{x}(u_{t-1})\|^{2}-\big{(}\frac{1}{2\eta}-\frac{L}{2}\big{)}\|u_{t}-u_{t-1}\|^{2}.

(4)

To achieve first-order stationarity, it is sufficient to bound the variance term in expectation. That is $\mathbb{E}\|v_{t-1}-\nabla\hat{F}_{x}(u_{t-1})\|^{2}\leq\mathcal{O}(\mathbb{E}\|u_{t}-u_{t-1}\|)$ (see Nguyen et al. (2017b)). This suggests the variance is gradually reduced when approaching optimality where gradient is small. Then by carefully choosing the parameters, we can show that for a single epoch, it satisfies that

\mathbb{E}[\hat{F}_{x}(u_{sm+m})-\hat{F}_{x}(u_{sm})]\leq-\frac{\eta}{2}\sum_{j=sm+1}^{sm+m}\mathbb{E}\|\nabla\hat{F}_{x}(u_{j-1})\|^{2}.

(5)

Telescoping this result for all epochs and choosing the output uniformly from all iterates at random, we can guarantee that the output is an approximate first-order stationary point. This gives the optimal stochastic gradient complexity of $\mathcal{O}(\frac{\sqrt{n}}{\epsilon^{2}})$ by choosing $m=n=\sqrt{n}$ .

To achieve second-order stationarity, the algorithm will go through two phases: large gradients and around saddle points. We present two informal Lemmas corresponding to the two phases respectively, with parameter settings omitted. See Appendix for more details.

Lemma 4

When current iterate has large gradient, i.e. $\|\emph{grad}F(x)\|\geq\epsilon$ , running $\textsf{TSSRG}(x,0,m)$ gives three possible cases:

1.
When the iterates $\{u_{j}\}_{j=1}^{m}$ do not leave the constraint ball $\mathbb{B}_{x}(D)$ :
1. (a)
  
  If at least half of the iterates in the epoch satisfy $\|\nabla\hat{F}_{x}(u_{j})\|\leq\epsilon/2$ for $j=1,...,m$ , then with probability at least $1/2$ , we have $\|\emph{grad}F(\textsf{TSSRG}(x,0,m))\|\leq\epsilon$ .
2. (b)
  
  Otherwise, with probability at least $1/5$ , we have $F(\textsf{TSSRG}(x,0,m))-F(x)\leq-\frac{\eta m\epsilon^{2}}{32}$ .
2.

When one of the iterates $\{u_{j}\}_{j=1}^{m}$ leaves the constraint ball $\mathbb{B}_{x}{(D)}$ , with probability at least $1-\vartheta$ , we have $F(\textsf{TSSRG}(x,0,m))-F(x)\leq-\frac{\eta m\epsilon^{2}}{32}$ , where $\vartheta\in(0,1)$ can be made arbitrarily small.

No matter which case occurs, $F(\textsf{TSSRG}(x,0,m))\leq F(x)$ with high probability.

Lemma 5

When current iterate is around a saddle point, i.e. $\|\emph{grad}F(x)\|\leq\epsilon$ and $\lambda_{\min}(\emph{Hess}F(x))$ $\leq-\delta$ , running $\textsf{TSSRG}(x,u_{0},\mathscr{T})$ with perturbation $u_{0}\in\mathbb{B}_{x}(r)$ gives sufficient decrease of function value with high probability. That is,

{F}(\textsf{TSSRG}(x,u_{0},\mathscr{T}))-F(x)\leq-\mathscr{F},

where $\mathscr{F}=\widetilde{\mathcal{O}}(\frac{\delta^{3}}{\rho^{2}})$ .

Lemma 4 claims that when gradient is large (phase 1), the output after running TSSRG with a single epoch either has a small gradient (Case 1a) or reduces function value by a sufficient amount (Case 1b) with high probability. Note that we need to explicitly address the case when iterates leave the constraint ball (Case 2). In this case, we show that function value already decreases by the same desired amount given that first-order tolerance $\epsilon$ is chosen reasonably small as in Theorem 1. Note for Case 1a, the output satisfies the small gradient condition and hence is immediately followed by perturbation and the follow-up updates in TSSRG $(x,u_{0},\mathscr{T})$ . Notice that we can only show $\|\nabla\hat{F}_{x}(u_{m})\|$ is small. To connect to Riemannian gradient $\|\text{grad}F(R_{x}(u_{m}))\|$ , we need the singular value bound of the differentiated retraction in Lemma 3. In other cases, function value decreases by $\mathcal{O}(\eta m\epsilon^{2})$ with high probability. As a result, given that function $F$ is uniformly bounded by $F^{*}$ , we can bound the number of such descent epochs $N_{1}$ by $\mathcal{O}(\frac{\Delta}{\eta m\epsilon^{2}})$ , where $\Delta:=F(x_{0})-F^{*}$ . Since we choose $\eta=\widetilde{\mathcal{O}}(\frac{1}{L})$ , $m=b=\sqrt{n}$ , $B=n$ , the stochastic gradient complexity is computed as $N_{1}(B+mb)=$ $\widetilde{\mathcal{O}}(\frac{\Delta L\sqrt{n}}{\epsilon^{2}})$ .

In phase 2 where gradient is small, current iterate is either already a second-order stationary point or around a saddle point. Lemma 5 states that running TSSRG from any perturbation within the ball $\mathbb{B}_{x}(r)$ decreases function value by at least $\mathscr{F}=\widetilde{\mathcal{O}}(\frac{\delta^{3}}{\rho^{2}})$ with high probability. Again, since the optimality gap is bounded, the number of such escape epochs $N_{2}$ is bounded by $\widetilde{\mathcal{O}}(\frac{\rho^{2}\Delta}{\delta^{3}})$ . And similarly, the number of stochastic gradient queries is $N_{2}(\lceil\mathscr{T}/m\rceil n+\mathscr{T}b)=\widetilde{\mathcal{O}}(\frac{\Delta\rho^{2}\sqrt{n}}{\delta^{4}}+\frac{\Delta\rho^{2}n}{\delta^{3}})$ , where we choose $\mathscr{T}=\widetilde{\mathcal{O}}(\frac{1}{\delta})$ . Combining the complexities under phase 1 and phase 2 gives the result. For more rigorous analysis, we need to consider the number of wasted epochs where neither function decreases sufficiently nor gradient of the output is small. The complexity of such epochs however turns out to not exceed the complexities established before. Detailed proofs are included in Appendix C.1.

Next we will briefly explain how Lemma 4 (large gradient phase) and Lemma 5 (around saddle point phase) are derived.

Large gradient phase. The key result underlying Lemma 4 is a high-probability version of (5). To this end, we first need a high-probability bound for the variance term $\|v_{t}-\nabla\hat{F}_{x}(u_{t})\|^{2}$ . It is not difficult to verify that $y_{t}:=v_{t}-\nabla\hat{F}_{x}(u_{t}),t=sm,...,(s+1)m$ is a martingale sequence. As required by Azuma-Hoeffing inequality (Lemma 7, Appendix A), in order to bound $\|y_{t}\|$ , we need to bound its difference sequence $z_{t}:=y_{t}-y_{t-1}$ . This difference sequence can be bounded by applying vector Bernstein inequality (Lemma 6, Appendix A). After bounding $\|y_{t}\|$ , we can substitute this result into (4) to obtain

\hat{F}_{x}(u_{t})-\hat{F}_{x}(u_{sm})\leq-\frac{\eta}{2}\sum_{j=sm+1}^{t}\|\nabla\hat{F}_{x}(u_{j-1})\|^{2}\quad\text{ with high probability, }

(6)

for $1\leq t\leq m$ . Note that we always call TSSRG for only one epoch each time. Therefore, it is sufficient to consider the first epoch in TSSRG. Next, the analysis is divided into whether iterates leave the constraint ball or not. When all iterates stay within the boundary of the ball, inequality (6) suggests that if at least half of iterates in this epoch have large gradient, then we obtain a sufficient decrease. Otherwise, the output uniformly selected from the iterates in this epoch will have a small gradient with high probability. On the other hand, when one of the iterates escape the constraint ball, we still can show a sufficient decrease by a localization Lemma (Lemma 10, Appendix C.2). Specifically, we have $\|u_{t}\|^{2}\leq\widetilde{\mathcal{O}}\big{(}\hat{F}_{x}(0)-\hat{F}_{x}(u_{t})\big{)}$ , which is derived from (4) and the high-probability bound of the variance term. This bound implies that if iterates are far from the origin, function value already decreases a lot.

Around saddle point phase. When the current iterate is around a saddle point, we need to show that the objective can still decrease at a reasonable rate with high probability. At high-level, we adopt the same coupling-sequence analysis originally introduced in (Jin et al., 2017). Define the stuck region as $\mathcal{X}_{\text{stuck}}:=\{u\in\mathbb{B}_{x}(r):F(\textsf{TSSRG}(x,u,\mathscr{T})-\hat{F}_{x}(u)\geq-2\mathscr{F}\}$ such that running TSSRG from points initialized in this region will not give sufficient function value decrease. Consider two initialization $u_{0},u_{0}^{\prime}$ that only differs in the direction of the smallest eigenvector $e_{1}$ of the Hessian $\text{Hess}F(x)$ . That is, $u_{0}-u_{0}^{\prime}=r_{0}e_{1}$ where $r_{0}=\frac{\nu r}{\sqrt{d}}$ ( $\nu\in(0,1)$ can be chosen arbitrarily small). Then we can prove that at least one of the sequences $\{u_{t}\},\{u_{t}^{\prime}\}$ generated by running TSSRG from perturbation $u_{0}$ , $u_{0}^{\prime}$ achieves large deviation from the initialized points within $\mathscr{T}$ steps (see Lemma 12). That is, with high probability,

\exists\,t\leq\mathscr{T},\,\max\{\|u_{t}-u_{0}\|,\|u_{t}^{\prime}-u_{0}^{\prime}\|\}\geq\widetilde{\mathcal{O}}\Big{(}\frac{\delta}{\rho}\Big{)}.

This result together with the localization Lemma indicates that at least one of the sequences also achieves high function value decrease. Particularly, we obtain $\max\{\hat{F}_{x}(u_{0})-F(\textsf{TSSRG}(x,u_{0},\mathscr{T}),$ $\hat{F}_{x}(u_{0}^{\prime})-F(\textsf{TSSRG}(x,u_{0}^{\prime},\mathscr{T})\}\geq 2\mathscr{F}$ where we note $\hat{F}_{x}(u_{\mathscr{T}})=F(\textsf{TSSRG}(x,u_{0},\mathscr{T}))$ . This directly suggests that the width of stuck region $\mathcal{X}_{\text{stuck}}$ is at most $r_{0}$ . Based on some geometric results, we know that the probability of any perturbation $u_{0}\in\mathbb{B}_{x}(r)$ falling in the stuck region is at most $\nu$ . In other words, with high probability, an arbitrarily chosen $u_{0}$ falls outside of stuck region. In this case, we achieve sufficient decrease of function value as $\hat{F}_{x}(u_{0})-F(\textsf{TSSRG}(x,u_{0},\mathscr{T})\geq 2\mathscr{F}$ . By carefully selecting the radius $r$ of the perturbation ball, we can bound the difference between $\hat{F}_{x}(0)$ and $\hat{F}_{x}(u_{0})$ by $\mathscr{F}$ . Finally, combining these two results yields Lemma 5:

F(x)-F(\textsf{TSSRG}(x,u_{0},\mathscr{T}))=\hat{F}_{x}(0)-\hat{F}_{x}(u_{0})+\hat{F}_{x}(u_{0})-\hat{F}_{x}(u_{\mathscr{T}})\geq-\mathscr{F}+2\mathscr{F}=\mathscr{F}.

6.2 Online setting

Consider online problems where full gradient is inaccessible. The proof roadmap is the same as in finite-sum setting. But now we have $v_{sm}=\nabla\hat{f}_{\mathcal{B},x}(u_{sm})\neq\nabla\hat{F}_{x}(u_{sm})$ . Most key results are relaxed with an additional term that relates to the variance of stochastic gradient (Assumption 5).

Large gradient phase. Under phase 1, we can show that (6) holds with an additional term. That is,

\hat{F}_{x}(u_{t})-\hat{F}_{x}(u_{0})\leq-\frac{\eta}{2}\sum_{j=1}^{t}\|\nabla\hat{F}_{x}(u_{j-1})\|^{2}+\widetilde{\mathcal{O}}\Big{(}\frac{\eta t\sigma^{2}}{B}\Big{)}\quad\text{ with high probability. }

Note that under online setting, the small gradient condition is checked against the large-batch gradient $\text{grad}f_{\mathcal{B}}(x)$ , rather than the full gradient (Line 3, Algorithm 1). Therefore, compared to finite-sum cases, we require an extra bound on the difference between full gradient and large-batch gradient $\|\nabla\hat{f}_{\mathcal{B},x}(u_{m})-\nabla\hat{F}_{x}(u_{m})\|$ . This can be obtained by Bernstein inequality. By choosing $B=\widetilde{\mathcal{O}}(\frac{\sigma^{2}}{\epsilon^{2}})$ , similar results to Lemma 11 can be derived.

Around saddle point phase. Under phase 2, we can obtain the same inequality as in Lemma 14, with differences only in terms of parameter settings.

These results guarantee that the number of phase-1 and phase-2 epochs match those of finite-sum setting, up to some constants and poly-log factors. That is, $N_{1}\leq\mathcal{O}(\frac{\Delta}{\eta m\epsilon^{2}})$ and $N_{2}\leq\widetilde{\mathcal{O}}(\frac{\Delta\rho^{2}}{\delta^{3}})$ . Following similar logic and choosing parameters as $m=b=\sqrt{B}=\widetilde{\mathcal{O}}(\frac{\sigma}{\epsilon})$ , we can obtain the complexity in Theorem 2.

7 Conclusion

In this paper, we generalize perturbed stochastic recursive gradient method to Riemannian manifold by adopting the idea of tangent space steps introduced in (Criscitiello and Boumal, 2019). This avoids using any complicated geometric result as in (Sun et al., 2019) and thus largely simplifies the analysis. We show that up to some constants and poly-log factors, our generalization achieves the same stochastic gradient complexities as the Euclidean version (Li, 2019). Under finite-sum setting, our result is strictly superior to PRGD by Criscitiello and Boumal (2019) and to PRAGD by Criscitiello and Boumal (2020) for large-scale problems. We also prove an online complexity, which is, to the best of our knowledge, the first result in finding second-order stationary points only using first-order information.

References

Absil et al. (2009) P-A Absil, Robert Mahony, and Rodolphe Sepulchre. Optimization algorithms on matrix manifolds. Princeton University Press, 2009.
Agarwal et al. (2018) Naman Agarwal, Nicolas Boumal, Brian Bullins, and Coralia Cartis. Adaptive regularization with cubics on manifolds. arXiv preprint arXiv:1806.00065, 2018.
Ahn and Sra (2020) Kwangjun Ahn and Suvrit Sra. From Nesterov’s estimate sequence to Riemannian acceleration. arXiv preprint arXiv:2001.08876, 2020.
Allen-Zhu and Li (2018) Zeyuan Allen-Zhu and Yuanzhi Li. Neon2: Finding local minima via first-order oracles. In Advances in Neural Information Processing Systems, pages 3716–3726, 2018.
Arjevani et al. (2019) Yossi Arjevani, Yair Carmon, John C Duchi, Dylan J Foster, Nathan Srebro, and Blake Woodworth. Lower bounds for non-convex stochastic optimization. arXiv preprint arXiv:1912.02365, 2019.
Bonnabel (2013) Silvere Bonnabel. Stochastic gradient descent on Riemannian manifolds. IEEE Transactions on Automatic Control, 58(9):2217–2229, 2013.
Boumal (2020) Nicolas Boumal. An introduction to optimization on smooth manifolds. 2020.
Boumal et al. (2019) Nicolas Boumal, Pierre-Antoine Absil, and Coralia Cartis. Global rates of convergence for nonconvex optimization on manifolds. IMA Journal of Numerical Analysis, 39(1):1–33, 2019.
Chung and Lu (2006) Fan Chung and Linyuan Lu. Concentration inequalities and martingale inequalities: a survey. Internet Mathematics, 3(1):79–127, 2006.
Criscitiello and Boumal (2020) Chris Criscitiello and Nicolas Boumal. An accelerated first-order method for non-convex optimization on manifolds. arXiv preprint arXiv:2008.02252, 2020.
Criscitiello and Boumal (2019) Christopher Criscitiello and Nicolas Boumal. Efficiently escaping saddle points on manifolds. In Advances in Neural Information Processing Systems, pages 5987–5997, 2019.
Du et al. (2017) Simon S Du, Chi Jin, Jason D Lee, Michael I Jordan, Aarti Singh, and Barnabas Poczos. Gradient descent can take exponential time to escape saddle points. In Advances in Neural Information Processing Systems, pages 1067–1077, 2017.
Fang et al. (2018) Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. In Advances in Neural Information Processing Systems, pages 689–699, 2018.
Ge et al. (2019) Rong Ge, Zhize Li, Weiyao Wang, and Xiang Wang. Stabilized svrg: Simple variance reduction for nonconvex optimization. arXiv preprint arXiv:1905.00529, 2019.
Hoeffding (1994) Wassily Hoeffding. Probability inequalities for sums of bounded random variables. In The Collected Works of Wassily Hoeffding, pages 409–426. Springer, 1994.
Hosseini and Sra (2020) Reshad Hosseini and Suvrit Sra. An alternative to em for Gaussian mixture models: Batch and stochastic Riemannian optimization. Mathematical Programming, 181(1):187–223, 2020.
Huang et al. (2015) Wen Huang, Kyle A Gallivan, and P-A Absil. A Broyden class of quasi-Newton methods for Riemannian optimization. SIAM Journal on Optimization, 25(3):1660–1685, 2015.
Jin et al. (2017) Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape saddle points efficiently. arXiv preprint arXiv:1703.00887, 2017.
Jin et al. (2018) Chi Jin, Praneeth Netrapalli, and Michael I Jordan. Accelerated gradient descent escapes saddle points faster than gradient descent. In Conference On Learning Theory, pages 1042–1085. PMLR, 2018.
Jin et al. (2019) Chi Jin, Praneeth Netrapalli, Rong Ge, Sham M Kakade, and Michael I Jordan. On nonconvex optimization for machine learning: Gradients, stochasticity, and saddle points. arXiv preprint arXiv:1902.04811, 2019.
Kasai et al. (2018) Hiroyuki Kasai, Hiroyuki Sato, and Bamdev Mishra. Riemannian stochastic recursive gradient algorithm. In International Conference on Machine Learning, pages 2516–2524, 2018.
Li (2019) Zhize Li. SSRGD: Simple stochastic recursive gradient descent for escaping saddle points. In Advances in Neural Information Processing Systems, pages 1523–1533, 2019.
Nesterov and Polyak (2006) Yurii Nesterov and Boris T Polyak. Cubic regularization of Newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006.
Nguyen et al. (2017a) Lam M Nguyen, Jie Liu, Katya Scheinberg, and Martin Takáč. Stochastic recursive gradient algorithm for nonconvex optimization. arXiv preprint arXiv:1705.07261, 2017a.
Nguyen et al. (2017b) Lam M Nguyen, Jie Liu, Katya Scheinberg, and Martin Takáč. Stochastic recursive gradient algorithm for nonconvex optimization. arXiv preprint arXiv:1705.07261, 2017b.
Reddi et al. (2016) Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex Smola. Stochastic variance reduction for nonconvex optimization. In International Conference on Machine Learning, pages 314–323, 2016.
Sun et al. (2019) Yue Sun, Nicolas Flammarion, and Maryam Fazel. Escaping from saddle points on Riemannian manifolds. In Advances in Neural Information Processing Systems, pages 7276–7286, 2019.
Tropp (2012) Joel A Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Computational Mathematics, 12(4):389–434, 2012.
Xu et al. (2018) Yi Xu, Rong Jin, and Tianbao Yang. First-order stochastic algorithms for escaping from saddle points in almost linear time. In Advances in Neural Information Processing Systems, pages 5530–5540, 2018.
Zhang and Sra (2018) Hongyi Zhang and Suvrit Sra. Towards Riemannian accelerated gradient methods. arXiv preprint arXiv:1806.02812, 2018.
Zhang et al. (2016) Hongyi Zhang, Sashank J Reddi, and Suvrit Sra. Riemannian svrg: Fast stochastic optimization on riemannian manifolds. In Advances in Neural Information Processing Systems, pages 4592–4600, 2016.
Zhou et al. (2018) Dongruo Zhou, Pan Xu, and Quanquan Gu. Finding local minima via stochastic nested variance reduction. arXiv preprint arXiv:1806.08782, 2018.

A Useful concentration bound

This section presents some useful concentration bounds on vector space, which is used to derive high probability bounds for sequences defined on tangent space of the manifold.

Lemma 6 (Vector Bernstein inequality, Tropp (2012))

Given a sequence of independent vector random variables $\{v_{k}\}$ in $\mathbb{R}^{d}$ , which satisfies $\|v_{k}-\mathbb{E}[v_{k}]\|\leq R$ almost surely, then for $\varsigma\geq 0$

\displaystyle\emph{Pr}\{\|\sum_{k}(v_{k}-\mathbb{E}[v_{k}])\|\geq\varsigma\}\leq(d+1)\exp\big{(}\frac{-\varsigma^{2}/2}{\sigma^{2}+R\varsigma/3}\big{)}

where $\sigma^{2}:=\sum_{k}\mathbb{E}\|v_{k}-\mathbb{E}[v_{k}]\|^{2}.$

Lemma 7 (Azuma-Hoeffding inequality, Hoeffding (1994); Chung and Lu (2006))

Consider a vector-valued martingale sequence $\{y_{k}\}$ and its corresponding martingale difference sequence $\{z_{k}\}$ in $\mathbb{R}^{d}$ . If $\{z_{k}\}$ satisfies $\|z_{k}\|=\|y_{k}-y_{k-1}\|\leq c_{k}$ almost surely, then for $\varsigma\geq 0$ ,

\emph{Pr}\{\|y_{k}-y_{0}\|\geq\varsigma\}\leq(d+1)\exp\Big{(}\frac{-\varsigma^{2}}{8\sum_{i=1}^{k}c_{i}^{2}}\Big{)}.

(7)

If $\|z_{k}\|=\|y_{k}-y_{k-1}\|\leq c_{k}$ with probability $1-\zeta_{k}$ , then for $\varsigma\geq 0$ ,

\emph{Pr}\{\|y_{k}-y_{0}\|\geq\varsigma\}\leq(d+1)\exp\Big{(}\frac{-\varsigma^{2}}{8\sum_{i=1}^{k}c_{i}^{2}}\Big{)}+\sum_{i=1}^{k}\zeta_{i}

B Regularity conditions on Riemannian manifold

In this section, we prove some regularity conditions on manifolds, which are fundamental for Riemannian optimization, as seen in several literature (Criscitiello and Boumal, 2020; Boumal, 2020).

Lemma 2 ( $L$ -Lipschitz continuous) Under Assumption 2, for all $x\in\mathcal{M}$ , there exists $L=\ell+\rho D$ such that $\nabla\hat{f}_{i,x}$ is $L$ -Lipschitz continuous inside the ball $\mathbb{B}_{x}(D)$ . This also implies that $\nabla\hat{F}_{x}$ is $L$ -Lipschitz continuous. That is, for any $u,v\in\mathbb{B}_{x}(D)$ , we have

\|\nabla\hat{f}_{i,x}(u)-\nabla\hat{f}_{i,x}(v)\|\leq L\|u-v\|\quad\text{ and }\quad\|\nabla\hat{F}_{x}(u)-\nabla\hat{F}_{x}(v)\|\leq L\|u-v\|.

Proof The proof of $\nabla\hat{f}_{i,x}$ being Lipschitz continuous is the same as in Criscitiello and Boumal (2019). We include it here for completeness. From Assumption 2, we have $\|\nabla^{2}\hat{f}_{i,x}(0)\|\leq\ell$ . Thus,

\|\nabla^{2}\hat{f}_{i,x}(u)\|\leq\|\nabla^{2}\hat{f}_{i,x}(u)-\nabla^{2}\hat{f}_{i,x}(0)\|+\|\nabla^{2}\hat{f}_{i,x}(0)\|\leq\rho\|u\|+\ell\leq\ell+\rho D=L.

Hence, for any $u,v\in\mathbb{B}_{x}(D)$ , we obtain

\|\nabla\hat{f}_{i,x}(u)-\nabla\hat{f}_{i,x}(v)\|=\|\int_{0}^{1}\nabla^{2}\hat{f}_{i,x}(v+(u-v)\tau)[u-v]d\tau\|\leq L\|u-v\|.

This implies that the full gradient $\nabla\hat{F}_{x}$ is Lipschitz continuous because for any $u,v\in\mathbb{B}_{x}(D)$

\|\nabla\hat{F}_{x}(u)-\nabla\hat{F}_{x}(v)\|=\|\frac{1}{n}\sum_{i=1}^{n}\big{(}\nabla\hat{f}_{i,x}(u)-\nabla\hat{f}_{i,x}(v)\big{)}\|=\|\nabla\hat{f}_{i,x}(u)-\nabla\hat{f}_{i,x}(v)\|\leq L\|u-v\|.

This completes the proof.

Lemma 3 (Singular value bound of differentiated retraction) For all $x,y=R_{x}(u)\in\mathcal{X}$ where $u\in\mathbb{B}_{x}(D)$ with $D\leq\frac{1}{2c_{0}}$ , we have $\sigma_{\min}(T_{u})\geq\frac{1}{2}$ .

Proof From Assumption 4, we have $\|T_{u}-\mathcal{T}_{u}\|\leq c_{0}\|u\|\leq c_{0}D\leq\frac{1}{2}$ . Therefore $\sigma_{\min}(T_{u})=\min_{\|\xi\|=1}\|T_{u}\xi\|\geq\min_{\xi}\|\mathcal{T}_{u}\xi\|-\|(T_{u}-\mathcal{T}_{u})\xi\|\geq 1-\frac{1}{2}=\frac{1}{2}$ , where the last inequality uses the fact that all singular values of an isometric operator are $1$ .

Lemma 8 (Gradient and Hessian of the pullback under second-order retraction)

Given a second order retraction $R_{x}:T_{x}\mathcal{M}\xrightarrow[]{}\mathcal{M}$ , both gradient and Hessian of the pullback function $\hat{f}_{x}:=f\circ R_{x}:T_{x}\mathcal{M}\xrightarrow{}\mathbb{R}$ evaluated at the origin of $T_{x}\mathcal{M}$ match the Riemannian gradient and Hessian of $f$ . That is, for all $x\in\mathcal{M}$ ,

\nabla\hat{f}_{x}(0)=\emph{grad}f(x),\,\emph{ and }\,\nabla^{2}\hat{f}_{x}(0)=\emph{Hess}f(x)

Proof The proof is mainly based on (Boumal, 2020) and we include it here for completeness. First note that for any retraction (not necessarily second-order), the gradient is matched. That is, for any $u\in T_{x}\mathcal{M}$ , by chain rule,

\text{D}\hat{f}_{x}(0)[u]=\text{D}(f\circ R_{x})(0)[u]=\text{D}f(R_{x}(0))[\text{D}R_{x}(0)[u]]=Df(x)[u],

where we use the definition of retraction where $R_{x}(0)=x$ and $\text{D}R_{x}(0)[u]=u$ . Then we can use the definition of Riemannian gradient and its uniqueness property to show the result. Next we prove the second result. Consider a second-order Taylor expansion of $f$ from $x$ to $R_{x}(u)$ along the retraction curve as

	$\displaystyle\hat{f}_{x}(u)=f(R_{x}(u))$	$\displaystyle=f(x)+\langle\text{grad}f(x),u\rangle+\frac{1}{2}\langle\text{Hess}f(x)[u],u\rangle+\frac{1}{2}\langle\text{grad}f(x),c^{\prime\prime}(0)\rangle+\mathcal{O}(\\|u\\|^{3})$
		$\displaystyle=f(x)+\langle\text{grad}f(x),u\rangle+\frac{1}{2}\langle\text{Hess}f(x)[u],u\rangle+\mathcal{O}(\\|u\\|^{3}),$		(8)

due to $c^{\prime\prime}(0)=0$ for second-order retraction. Also since $\hat{f}_{x}$ is a ‘classic’ function from vector space to real number, we can use a classic Taylor expansion of this function as

\displaystyle\hat{f}_{x}(u)=\hat{f}_{x}(0)+\langle\nabla\hat{f}_{x}(0),u\rangle+\frac{1}{2}\langle\nabla^{2}\hat{f}_{x}(0)[u],u\rangle+\mathcal{O}(\|u\|^{3}).

(9)

Given that we already have $\nabla\hat{f}_{x}(0)=\text{grad}f(x)$ , we have by comparing (9) with (8), $\nabla^{2}\hat{f}_{x}(0)=\text{Hess}f(x)$ .

C Proof for finite-sum setting

In this section, we prove the main complexity results under finite-sum setting. In this case, we choose $B=n$ and hence from Algorithm 2, we have access to the full gradient and $\text{grad}f_{\mathcal{B}}\equiv\text{grad}F$ . We start by showing the proof for the main complexity Theorem in subsection C.1. Then we prove some key lemmas necessary to derive the Theorem in C.2.

C.1 Proof for main Theorem

Theorem 1 (Finite-sum complexity) Under Assumptions 1 to 4, consider finite-sum optimization setting. For any starting point $x_{0}\in\mathcal{M}$ with the choice of parameters as

\mathscr{T}=\widetilde{\mathcal{O}}\Big{(}\frac{1}{\delta}\Big{)},\quad\eta\leq\widetilde{\mathcal{O}}\Big{(}\frac{1}{L}\Big{)},\quad m=b=\sqrt{n},\quad B=n,\quad r=\widetilde{\mathcal{O}}\Big{(}\min\Big{\{}\frac{\delta^{3}}{\rho^{2}\epsilon},\sqrt{\frac{\delta^{3}}{\rho^{2}L}}\Big{\}}\Big{)},

suppose $\epsilon,\delta$ satisfy $\epsilon\leq\widetilde{\mathcal{O}}(\frac{D\sqrt{L}}{m\sqrt{\eta}}),\,\delta\leq L$ . Then with high probability, PRSRG $(x_{0},\eta,m,b,B,r,\mathscr{T},D,\epsilon)$ will at least once visit an $(\epsilon,\delta)$ -second-order critical point with

\widetilde{\mathcal{O}}\Big{(}\frac{\Delta L\sqrt{n}}{\epsilon^{2}}+\frac{\Delta\rho^{2}\sqrt{n}}{\delta^{4}}+\frac{\Delta\rho^{2}n}{\delta^{3}}\Big{)}

stochastic gradient queries, where $\Delta:=F(x_{0})-F^{*}$ .

Proof Here are all possible cases when running the main algorithm PRSRG. Notice that we need to explicitly discuss the case where iterates escape the constraint ball $\mathbb{B}_{x}(D)$ within an epoch. Under large gradient situation, when iterates leave the constraint ball, it achieves a large function decrease with high probability (hence labelled as descent epoch). Around saddle point, when iterates leave the constraint ball, it already decreases function value with probability 1 (hence merged with the case when iterates fall inside the ball).

•
Large gradients where ${\|\text{grad}F(x)\|\geq\epsilon}$ .
1. 1.
  
  Type-1 descent epoch: Iterates escape the constraint ball.
2. 2.
  Iterates do not escape the constraint ball.
  1. (a)
    
    Type-2 descent epoch: At least half of iterates in current epoch have pullback gradient larger than $\epsilon/2$ .
  2. (b)
    
    Useful epoch: At least half of iterates in current epoch have pullback gradient no larger than $\epsilon/2$ and output $\tilde{x}$ from current epoch has gradient no larger than $\epsilon$ . (Since output satisfies small gradient condition, the next epoch will run TSSRG $(\tilde{x},u_{0},\mathscr{T})$ to escape saddle points).
  3. (c)
    
    Wasted epoch: At least half of iterates in current epoch have pullback gradient no larger than $\epsilon/2$ and output $\tilde{x}$ from current epoch has gradient larger than $\epsilon$ .
•
Around saddle points where ${\|\text{grad}F(x)\|\leq\epsilon}$ and ${\lambda_{\min}(\text{Hess}(x))\leq-\delta}$
1. 3.
  
  Type-3 descent epoch: Current iterate is around saddle point.

First, because output $\tilde{x}$ for current epoch is randomly selected from the iterates, the probability of a wasted epoch is at most $1/2$ . Also due to the independence of each wasted epoch, with high probability, wasted epoch occurs consecutively at most $\widetilde{\mathcal{O}}(1)$ times before either a descent epoch (either type 1 or 2) or a useful epoch. ¹¹1That is, suppose the probability of at least half of iterates not escaping the constraint ball have large small gradient is $\theta$ . Therefore, the probability of $X$ consecutive occurrences of wasted epoch is $(\frac{\theta}{2})^{X}$ . Then with high probability of $1-\iota$ , there exists at least one useful epoch or type-2 descent epoch in $X=\mathcal{O}(-\log(\iota))=\widetilde{\mathcal{O}}(1)$ . Hereafter, we use $N_{1}$ , $N_{2}$ and $N_{3}$ to respectively represent three types of descent epoch.

Consider Type-1 descent epoch. From Case 2 in Lemma 11, with probability $1-\vartheta$ , function value decreases by at least $\frac{\eta m\epsilon^{2}}{32}$ and with high probability the function value decreases. Hence by the standard concentration, after $N_{1}$ such epochs, function value is reduced by $\mathcal{O}(\eta m\epsilon^{2}N_{1})$ with high probability. Given that $F(x)$ is bounded by $F^{*}$ , the decrease cannot exceed $\Delta:=F(x_{0})-F^{*}$ . Therefore, $N_{1}\leq\mathcal{O}(\frac{\Delta}{\eta m\epsilon^{2}})$ . Similarly, for Type-2 descent epoch, $N_{2}\leq\mathcal{O}(\frac{\Delta}{\eta m\epsilon^{2}})$ .

Consider Useful epoch where output $\|\text{grad}F(\tilde{x})\|\leq\epsilon$ . If further $\lambda_{\min}(\text{Hess}F(\tilde{x}))\geq-\delta$ , then $\tilde{x}$ is already an $(\epsilon,\delta)$ -second-order critical point. Otherwise, a Useful epoch is followed by Type-3 descent epoch around saddle points. From Lemma 14, we know that function value decreases by $\mathscr{F}=\frac{\delta^{3}}{2c_{3}\rho^{2}}$ with high probability. Similar to argument for other types of descent epoch, $N_{3}\leq\widetilde{\mathcal{O}}(\frac{\rho^{2}\Delta}{\delta^{3}})$ , where we omit $c_{3}=\widetilde{\mathcal{O}}(1)$ .

Hence, we have the following stochastic gradient complexity:

	$\displaystyle(N_{1}+N_{2})\big{(}\widetilde{\mathcal{O}}(1)\big{(}n+mb\big{)}+n+mb\big{)}+N_{3}\big{(}\widetilde{\mathcal{O}}(1)\big{(}n+mb\big{)}+\lceil\mathscr{T}/m\rceil n+\mathscr{T}b\big{)}$
	$\displaystyle\leq\widetilde{\mathcal{O}}\Big{(}\frac{\Delta}{\eta m\epsilon^{2}}n+\frac{\rho^{2}\Delta}{\delta^{3}}n+\frac{\rho^{2}\Delta}{\delta^{3}}(n+\frac{\sqrt{n}}{\delta})\Big{)}$
	$\displaystyle\leq\widetilde{\mathcal{O}}\Big{(}\frac{\Delta L\sqrt{n}}{\epsilon^{2}}+\frac{\Delta\rho^{2}\sqrt{n}}{\delta^{4}}+\frac{\Delta\rho^{2}n}{\delta^{3}}\Big{)},$

where $\mathscr{T}=\widetilde{\mathcal{O}}(\frac{1}{\delta})$ , $m=b=\sqrt{n}$ , $\eta=\widetilde{\mathcal{O}}(\frac{1}{L})$ .

C.2 Proof for key Lemmas

Organization of these Lemmas are as follows. In Lemma 9, we first prove a high probability bound for the estimation error of the modified gradient $v_{t}$ . This is to replace the expectation bound commonly used in deriving first-order guarantees. Lemma 10 is to show that the iterates that deviate a lot from the initialized point also achieve large function value decrease. These two results are subsequently used to derive Lemma 11, a descent Lemma for large gradient phase.

Under saddle point phase, we first show the proof that at least one of the coupled sequences achieve large deviation from the initialization (Lemma 12). This then translates to a sufficient function value decrease in Lemma 13. Finally, it can be shown in Lemma 14 that with high probability, the iterates can escape saddle point and decrease function value by a desired amount.

Lemma 9 (High probability bound on estimation error)

Under Assumption 2, we have the following high probability bound for estimation error of the modified gradient under finite-sum setting. That is, for $sm+1\leq t\leq(s+1)m$ ,

\|v_{t}-\nabla\hat{F}_{x}(u_{t})\|\leq\frac{\mathcal{O}\big{(}\log^{\frac{3}{2}}(d/\vartheta)\big{)}L}{\sqrt{b}}\sqrt{\sum_{j=sm+1}^{t}\|u_{j}-u_{j-1}\|^{2}}\quad\text{ with probability }1-\vartheta,

Proof For simplicity of notation, consider a single epoch from $k=1,...,m,\forall{s}$ . Consider two sequences on $T_{x}\mathcal{M}\subseteq\mathbb{R}^{d}$ , defined as $y_{k}:=v_{k}-\nabla\hat{F}_{x}(u_{k})$ and $z_{k}:=y_{k}-y_{k-1}$ . It is easily verified that $y_{k}$ is a martingale and $z_{k}$ is its difference sequence. That is, denote $\mathcal{F}_{t}$ is a filtration at time $t$ . $\mathbb{E}[y_{t}|\mathcal{F}_{t-1}]=\mathbb{E}_{\mathcal{F}_{t-1}}[\nabla\hat{f}_{\mathcal{I},x}(u_{t})-\nabla\hat{f}_{\mathcal{I},x}(u_{t-1})+v_{t-1}-\nabla\hat{F}_{x}(u_{t})]=v_{t-1}-\nabla\hat{F}_{x}(u_{t-1})=y_{t-1}$ where we use unbiasedness of i.i.d sampling. Hence, to bound $\|y_{k}\|$ , we need to first bound $\|z_{k}\|$ according to Azuma-Hoeffding Lemma. We can use vector Bernstein inequality to bound $z_{k}$ as follows. First note that

	$\displaystyle z_{k}$	$\displaystyle=v_{k}-\nabla\hat{F}_{x}(u_{k})-v_{k-1}+\nabla\hat{F}_{x}(u_{k-1})$
		$\displaystyle=\nabla\hat{f}_{\mathcal{I},x}(u_{k})-\nabla\hat{f}_{\mathcal{I},x}(u_{k-1})+v_{k-1}-\nabla\hat{F}_{x}(u_{k})-v_{k-1}+\nabla\hat{F}_{x}(u_{k-1})$
		$\displaystyle=\nabla\hat{f}_{\mathcal{I},x}(u_{k})-\nabla\hat{f}_{\mathcal{I},x}(u_{k-1})-\nabla\hat{F}_{x}(u_{k})+\nabla\hat{F}_{x}(u_{k-1}).$

Denote $w_{i}:=\nabla\hat{f}_{i,x}(u_{k})-\nabla\hat{f}_{i,x}(u_{k-1})-\nabla\hat{F}_{x}(u_{k})+\nabla\hat{F}_{x}(u_{k-1})$ and therefore $z_{k}=\frac{1}{b}\sum_{i\in\mathcal{I}}w_{i}$ with $|\mathcal{I}|=b$ . In order to apply Bernstein inequality, we need to show that $\|w_{i}\|$ is bounded. This is achieved by Lemma 2. That is,

	$\displaystyle\\|w_{i}\\|$	$\displaystyle=\\|\nabla\hat{f}_{i,x}(u_{k})-\nabla\hat{f}_{i,x}(u_{k-1})-\nabla\hat{F}_{x}(u_{k})+\nabla\hat{F}_{x}(u_{k-1})\\|$
		$\displaystyle\leq\\|\nabla\hat{f}_{i,x}(u_{k})-\nabla\hat{f}_{i,x}(u_{k-1})\\|+\\|\nabla\hat{F}_{x}(u_{k})-\nabla\hat{F}_{x}(u_{k-1})\\|\leq 2L\\|u_{k}-u_{k-1}\\|=:R.$

Also, the total variance $\sigma^{2}$ is computed as

	$\displaystyle\sum_{i\in\mathcal{I}}\mathbb{E}\\|w_{i}-\mathbb{E}[w_{i}]\\|^{2}$	$\displaystyle=\sum_{i\in\mathcal{I}}\mathbb{E}\\|w_{i}\\|^{2}$
		$\displaystyle=\sum_{i\in\mathcal{I}}\mathbb{E}\\|\nabla\hat{f}_{i,x}(u_{k})-\nabla\hat{f}_{i,x}(u_{k-1})-\nabla\hat{F}_{x}(u_{k})+\nabla\hat{F}_{x}(u_{k-1})\\|^{2}$
		$\displaystyle\leq\sum_{i\in\mathcal{I}}\mathbb{E}\\|\nabla\hat{f}_{i,x}(u_{k})-\nabla\hat{f}_{i,x}(u_{k-1})\\|^{2}\leq bL^{2}\\|u_{k}-u_{k-1}\\|^{2}=:\sigma^{2},$

where the first inequality holds due to $\mathbb{E}[w_{i}]=0$ and the second last inequality applies $\mathbb{E}\|x-\mathbb{E}[x]\|^{2}\leq\mathbb{E}\|x\|^{2}$ and the last inequality uses the gradient Lipschitzness result in Lemma 2. Finally we can apply Bernstein inequality (Lemma 6) to bound $\|z_{k}\|$ . That is,

	$\displaystyle\text{Pr}\{\\|\sum_{i\in\mathcal{I}}w_{i}\\|\geq\varsigma\}$	$\displaystyle=\text{Pr}\{\\|\frac{1}{b}\sum_{i\in\mathcal{I}}w_{i}\\|\geq\varsigma/b\}=\text{Pr}\{\\|z_{k}\\|\geq{\varsigma}/b\}$
		$\displaystyle\leq(d+1)\exp\big{(}\frac{-\varsigma^{2}/2}{\sigma^{2}+R\varsigma/3}\big{)}$
		$\displaystyle\leq(d+1)\exp\big{(}\frac{-\varsigma^{2}/2}{bL^{2}\\|u_{k}-u_{k-1}\\|^{2}+2\sqrt{b}L\\|u_{k}-u_{k-1}\\|\varsigma/3}\big{)}$
		$\displaystyle\leq\vartheta_{k},$

where the second inequality substitutes $bL^{2}\|u_{k}-u_{k-1}\|^{2}$ as $\sigma^{2}$ and $2L\|u_{k}-u_{k-1}\|$ as $R$ in Lemma 6. It also uses the fact that $\sqrt{b}\geq 1$ . The last inequality holds by the choice $\varsigma=\varsigma_{k}=\mathcal{O}\big{(}\log(d/\vartheta_{k})\sqrt{b}L\big{)}\|u_{k}-u_{k-1}\|$ (for example, $\frac{10}{3}\log((d+1)/\vartheta_{k})\sqrt{b}L\|u_{k}-u_{k-1}\|$ ). This gives a probability bound for $\{z_{k}\}$ , which is

\|z_{k}\|\leq\mathcal{O}\Big{(}\frac{{\log(d/\vartheta_{k})}L}{\sqrt{b}}\Big{)}\|u_{k}-u_{k-1}\|\quad\text{ with probability }1-\vartheta_{k}.

Now given the bound on $\|z_{k}\|$ , we can bound $\|y_{k}\|$ using the Azuma-Hoeffding inequality. Suppose we set $\vartheta_{k}=\vartheta/m$ , where $m$ is the epoch length. Then by union bound, for $k\leq m$

\text{Pr}\{\bigcup\limits_{j=1}^{k}\big{(}\|z_{j}\|\geq\varsigma\big{)}\}\leq\sum_{j=1}^{k}\vartheta_{j}\leq\sum_{j=1}^{m}\vartheta_{j}=\vartheta.

(10)

Therefore, the probability that $\|y_{k}-y_{k-1}\|=\|z_{k}\|\leq\varsigma_{k}$ for all $k=1,...,m$ is at least $1-\vartheta$ . Hence by Lemma 7, we have

\text{Pr}\{\|y_{k}-y_{0}\|\geq\beta\}\leq(d+1)\exp\big{(}\frac{-\beta^{2}}{8\sum_{j=1}^{k}\varsigma_{j}^{2}}\big{)}+k\vartheta_{k}\leq 2\vartheta,

(11)

where the last inequality holds due to $k\leq m$ and the choice that

\beta=\sqrt{8\sum_{j=1}^{k}\varsigma^{2}_{j}\log((d+1)/\vartheta)}=\mathcal{O}\Big{(}\frac{\log^{\frac{3}{2}}(d/\vartheta)L}{\sqrt{b}}\Big{)}\sqrt{\sum_{j=1}^{k}\|u_{j}-u_{j-1}\|^{2}},

where we denote $\log^{\frac{3}{2}}(a)=(\log(a))^{\frac{3}{2}}$ . Note under finite-sum setting, $y_{0}=v_{0}-\nabla\hat{F}_{x}(u_{0})=0$ . Thus (11) implies for $k\in[1,m]$ ,

\|v_{k}-\nabla\hat{F}_{x}(u_{k})\|=\|y_{k}\|\leq\mathcal{O}\Big{(}\frac{\log^{\frac{3}{2}}(d/\vartheta)L}{\sqrt{b}}\Big{)}\sqrt{\sum_{j=1}^{k}\|u_{j}-u_{j-1}\|^{2}}

(12)

holds with probability $1-2\vartheta$ . Note that we can always set $\vartheta/2$ while the result still holds because $\mathcal{O}(\log^{\frac{3}{2}}(2d/\vartheta))=\mathcal{O}(\log^{\frac{3}{2}}(d/\vartheta))$ . Hence the probability reduces to $1-\vartheta$ .

Lemma 10 (Improve or localize)

Consider $\{u_{t}\}_{t=1}^{\mathscr{T}}$ is the sequence generated by running $\textsf{TSSRG}(x,$ $u_{0},\mathscr{T})$ . Suppose we choose $b\geq m$ , $\eta\leq\frac{1}{2c_{1}L}$ where $c_{1}=\mathcal{O}(\log^{\frac{3}{2}}({dt}/{\vartheta}))$ . Then we have

\|u_{t}-u_{0}\|\leq\sqrt{\frac{4t}{c_{1}L}(\hat{F}_{x}(u_{0})-\hat{F}_{x}(u_{t}))}\quad\quad\text{ with probability }1-\vartheta,

Proof First by generalizing (16) to any epoch (i.e. $1\leq t\leq\mathscr{T}$ ), we have

$\displaystyle\hat{F}_{x}(u_{t})-\hat{F}_{x}(u_{sm})$	$\displaystyle\leq-\frac{\eta}{2}\sum_{j=sm+1}^{t}\\|\nabla\hat{F}_{x}(u_{j-1})\\|^{2}-\big{(}\frac{1}{2\eta}-\frac{L}{2}-\frac{\eta c_{1}^{2}L^{2}}{2}\big{)}\sum_{j=sm+1}^{t}\\|u_{j}-u_{j-1}\\|^{2}$
	$\displaystyle\leq-\big{(}\frac{1}{2\eta}-\frac{L}{2}-\frac{\eta c_{1}^{2}L^{2}}{2}\big{)}\sum_{j=sm+1}^{t}\\|u_{j}-u_{j-1}\\|^{2}$
	$\displaystyle\leq-\frac{c_{1}L}{4}\sum_{j=sm+1}^{t}\\|u_{j}-u_{j-1}\\|^{2},$	(13)

where the last inequality holds due to the choice $\eta\leq\frac{1}{2c_{1}L}$ and the assumption that $c_{1}\geq 1$ . Summing (13) for all epoch up to $t$ gives

\hat{F}_{x}(u_{t})-\hat{F}_{x}(u_{0})\leq-\frac{c_{1}L}{4}\sum_{j=1}^{t}\|u_{j}-u_{j-1}\|^{2}.

Also by Cauchy-Schwarz inequality and triangle inequality,

\sqrt{t\sum_{j=1}^{t}\|u_{j}-u_{j-1}\|^{2}}\geq\sum_{j=1}^{t}\|u_{j}-u_{j-1}\|\geq\|u_{t}-u_{0}\|.

The proof is complete by noting $\sqrt{t\sum_{j=1}^{t}\|u_{j}-u_{j-1}\|^{2}}\leq\sqrt{\frac{4t}{c_{1}L}(\hat{F}_{x}(u_{0})-\hat{F}_{x}(u_{t}))}$ .

Lemma 11 (Large gradient descent lemma)

(Lemma 4 in the maix text). Under Assumptions 2, 3, and 4, suppose we choose $\eta\leq\frac{1}{2c_{1}L}$ , $b\geq m$ where $c_{1}=\mathcal{O}(\log^{\frac{3}{2}}(dm/\vartheta))$ . Consider $x\in\mathcal{M}$ where $\|\emph{grad}F(x)\|\geq\epsilon$ with $\epsilon\leq\frac{D}{m}\sqrt{\frac{8c_{1}L}{\eta}}$ . Then by running $\textsf{TSSRG}(x,0,m)$ , we have the following three cases:

1.
When the iterates $\{u_{j}\}_{j=1}^{m}$ do not leave the constraint ball $\mathbb{B}_{x}(D)$ :
1. (a)
  
  If at least half of the iterates in the epoch satisfy $\|\nabla\hat{F}_{x}(u_{j})\|\leq\epsilon/2$ for $j=1,...,m$ , then with probability at least $1/2$ , we have $\|\emph{grad}F(\textsf{TSSRG}(x,0,m))\|\leq\epsilon$ .
2. (b)
  
  Otherwise, with probability at least $1/5$ , we have $F(\textsf{TSSRG}(x,0,m))-F(x)\leq-\frac{\eta m\epsilon^{2}}{32}$ .
2.

When one of the iterates $\{u_{j}\}_{j=1}^{m}$ leaves the constraint ball $\mathbb{B}_{x}{(D)}$ , with probability at least $1-\vartheta$ , we have $F(\textsf{TSSRG}(x,0,m))-F(x)\leq-\frac{\eta m\epsilon^{2}}{32}$ .

Regardless which case occurs, $F(\textsf{TSSRG}(x,0,m))\leq F(x)$ with high probability.

Proof First note that when gradient is large, we always call TSSRG( $x,0,m$ ) with total iterations set to be $m$ (i.e. a single epoch). Hence, we consider $t=1,...,m$ (in TSSRG). Compared to the proof in (Li, 2019), we further need to address the case where the iterates fall outside the prescribed ball $\mathbb{B}_{x}(D)$ . So we divide the proof into two parts.

1. Iterates do not leave the constraint ball. By $L$ -Lipschitzness in Lemma 2, we have

$\displaystyle\hat{F}_{x}(u_{t})$	$\displaystyle\leq\hat{F}_{x}(u_{t-1})+\langle\nabla\hat{F}_{x}(u_{t-1}),u_{t}-u_{t-1}\rangle+\frac{L}{2}\\|u_{t}-u_{t-1}\\|^{2}$
	$\displaystyle=\hat{F}_{x}(u_{t-1})-\eta\langle\nabla\hat{F}_{x}(u_{t-1}),v_{t-1}\rangle+\frac{\eta^{2}L}{2}\\|v_{t-1}\\|^{2}$
	$\displaystyle=\hat{F}_{x}(u_{t-1})-\frac{\eta}{2}\\|\nabla\hat{F}_{x}(u_{t-1})\\|^{2}-\frac{\eta}{2}\\|v_{t-1}\\|^{2}+\frac{\eta}{2}\\|v_{t-1}-\nabla\hat{F}_{x}(u_{t-1})\\|^{2}+\frac{\eta^{2}L}{2}\\|v_{t-1}\\|^{2}$
	$\displaystyle=\hat{F}_{x}(u_{t-1})-\frac{\eta}{2}\\|\nabla\hat{F}_{x}(u_{t-1})\\|^{2}+\frac{\eta}{2}\\|v_{t-1}-\nabla\hat{F}_{x}(u_{t-1})\\|^{2}-\big{(}\frac{1}{2\eta}-\frac{L}{2}\big{)}\\|u_{t}-u_{t-1}\\|^{2}.$	(14)

From Lemma 9, we know that

\|v_{t-1}-\nabla\hat{F}_{x}(u_{t-1})\|^{2}\leq\frac{\mathcal{O}(\log^{3}(d/\vartheta))L^{2}}{b}\sum_{j=sm+1}^{t-1}\|u_{j}-u_{j-1}\|^{2}

(15)

holds with high probability $1-\vartheta$ . By a union bound, $\text{Pr}\big{\{}\bigcup\limits_{\tau=1}^{{t}}\big{(}\text{\eqref{qmqmmqmq} does not hold}\big{)}\big{\}}\leq t\vartheta$ and therefore for all $1\leq\tau\leq t$ , (15) holds with probability $1-t\vartheta$ . Setting $\vartheta/t$ as $\vartheta$ we have for all $1\leq\tau\leq t$ ,

\|v_{\tau-1}-\nabla\hat{F}_{x}(u_{\tau-1})\|^{2}\leq\frac{\mathcal{O}(\log^{3}(dt/\vartheta))L^{2}}{b}\sum_{j=1}^{\tau-1}\|u_{j}-u_{j-1}\|^{2}\quad\text{ with probability }1-\vartheta.

Substituting this result into (14) and summing over this epoch from $1$ to $t$ gives

	$\displaystyle\hat{F}_{x}(u_{t})-\hat{F}_{x}(u_{0})$
	$\displaystyle\leq-\frac{\eta}{2}\sum_{j=1}^{t}\\|\nabla\hat{F}_{x}(u_{j-1})\\|^{2}+\frac{\eta c_{1}^{2}L^{2}}{2b}\sum_{k=1}^{t-1}\sum_{j=1}^{k}\\|u_{j}-u_{j-1}\\|^{2}-\big{(}\frac{1}{2\eta}-\frac{L}{2}\big{)}\sum_{j=1}^{t}\\|u_{j}-u_{j-1}\\|^{2}$
	$\displaystyle\leq-\frac{\eta}{2}\sum_{j=1}^{t}\\|\nabla\hat{F}_{x}(u_{j-1})\\|^{2}+\frac{\eta c_{1}^{2}L^{2}m}{2b}\sum_{j=1}^{t}\\|u_{j}-u_{j-1}\\|^{2}-\big{(}\frac{1}{2\eta}-\frac{L}{2}\big{)}\sum_{j=1}^{t}\\|u_{j}-u_{j-1}\\|^{2}$
	$\displaystyle\leq-\frac{\eta}{2}\sum_{j=1}^{t}\\|\nabla\hat{F}_{x}(u_{j-1})\\|^{2}-\big{(}\frac{1}{2\eta}-\frac{L}{2}-\frac{\eta c_{1}^{2}L^{2}}{2}\big{)}\sum_{j=1}^{t}\\|u_{j}-u_{j-1}\\|^{2}$		(16)
	$\displaystyle\leq-\frac{\eta}{2}\sum_{j=1}^{t}\\|\nabla\hat{F}_{x}(u_{j-1})\\|^{2},$		(17)

where $c_{1}:=\mathcal{O}(\log^{\frac{3}{2}}(dt/\vartheta))$ . The second inequality uses the fact that $t\leq m$ and the third inequality holds due to the choice that $b\geq m$ . The last inequality holds by noticing $\eta\leq\frac{1}{2c_{1}L}\leq\frac{\sqrt{4c_{1}^{2}+1}-1}{2c_{1}^{2}L}$ by assuming $c_{1}\geq 1$ . Note that we require (17) to hold for all $t\leq m$ and thus we change $c_{1}=\mathcal{O}(\log^{\frac{3}{2}}(dm/\vartheta))$ . Then we have the following two cases.

•

(Case 1a) Suppose at least half of the iterates in the epoch satisfy $\|\nabla\hat{F}_{x}(u_{j})\|\leq\epsilon/2$ for $j=1,...,m$ . Then by uniformly sampling $\{u_{j}\}_{j=1}^{m}$ (i.e. uniformly breaking by setting $u_{t}$ as $u_{m}$ as in Algorithm 2, Line 12), the output $u_{m}$ has gradient norm $\|\nabla\hat{F}_{x}(u_{m})\|$ no larger than $\epsilon/2$ with probability $1/2$ . Recall the definition of the pullback gradient in Lemma 1, i.e. $\nabla\hat{F}_{x}(u_{m})=T_{u_{m}}^{*}\text{grad}F(R_{x}(u_{m}))$ . Then the output of TSSRG( $x,0,m$ ) satisfies

	$\displaystyle\\|\text{grad}F(\textsf{TSSRG}(x,0,m))\\|=\\|\text{grad}F(R_{x}(u_{m}))\\|$	$\displaystyle=\\|(T^{*}_{u_{m}})^{-1}\nabla\hat{F}_{x}(u_{m})\\|$
		$\displaystyle\leq\\|(T^{*}_{u_{m}})^{-1}\\|\\|\nabla\hat{F}_{x}(u_{m})\\|$
		$\displaystyle\leq 2\cdot\frac{\epsilon}{2}=\epsilon,$

where we use $\sigma_{\min}(T_{u})\geq 1/2$ in Lemma 3.

•

(Case 1b) Suppose at least half of the points in the epoch satisfy $\|\nabla\hat{F}_{x}(u_{j})\|\geq\epsilon/2$ for $j=1,...,m$ . With probability $1/4$ , the output $u_{t}$ falls within the last quarter of $\{u_{j}\}_{j=1}^{m}$ by uniform sampling. In this case, $\sum_{j=1}^{t}\|\nabla\hat{F}_{x}(u_{j-1})\|^{2}\geq\frac{m}{4}\cdot\frac{\epsilon^{2}}{4}=\frac{m\epsilon^{2}}{16}$ because at least a quarter of points with large gradient appear before $u_{t}$ . Thus, by (17), we have

$\displaystyle F(\textsf{TSSRG}(x,0,m))-F(x)=F(R_{x}(u_{t}))-F(x)$	$\displaystyle=\hat{F}_{x}(u_{t})-\hat{F}_{x}(0)$
	$\displaystyle\leq-\frac{\eta}{2}\sum_{j=1}^{t}\\|\nabla\hat{F}_{x}(u_{j-1})\\|^{2}$
	$\displaystyle\leq-\frac{\eta m\epsilon^{2}}{32}.$	(18)

Note that (17) holds with probability $1-\vartheta$ . Then by a union bound, (18) holds with probability at least $\frac{1}{4}-\vartheta$ . Without loss of generality, we can choose $\vartheta\leq\frac{1}{20}$ and thus (18) holds with probability at least $\frac{1}{5}$ .

2. Iterates leave the constraint ball. Suppose at $\tau\leq m$ , we have $\|u_{\tau}\|>D$ . Therefore by Lemma 10, we know that the function value already decreases a lot. That is, with probability $1-\vartheta$ , running TSSRG $(x,0,m)$ gives

\displaystyle\hat{F}_{x}(0)-\hat{F}_{x}(u_{\tau})\geq\frac{c_{1}L}{4\tau}\|u_{\tau}\|^{2}\geq\frac{c_{1}LD^{2}}{4m}\geq\frac{\eta m\epsilon^{2}}{32},

where the last inequality follows from the choice that $\epsilon\leq\frac{D}{m}\sqrt{\frac{8c_{1}L}{\eta}}$ and $\eta$ can be sufficiently small. Note this requirement is not difficult to satisfy since $c_{1}$ can be sufficiently large. Hence by returning $u_{\tau}$ as $u_{m}$ , we have

F(\textsf{TSSRG}(x,0,m))-F(x)\leq-\frac{\eta m\epsilon^{2}}{32}.

In summary, regardless of whether the iterates stay within the ball $\mathbb{B}_{x}(D)$ , with high probability, either the gradient norm of the output is small, or the function value decreases a lot. Note for Case 1a, the function still decreases by (17).

Lemma 12 (Small stuck region)

Consider $x\in\mathcal{M}$ with $\|\emph{grad}F(x)\|\leq\epsilon$ and $-\gamma:=\lambda_{\min}(\nabla^{2}\hat{F}_{x}(0))$ $=\lambda_{\min}(\emph{Hess}F(x))\leq-\delta$ and $L\geq\delta$ . Let $u_{0},u_{0}^{\prime}\in T_{x}\mathcal{M}$ be two random perturbation, satisfying $\|u_{0}\|,\|u_{0}^{\prime}\|\leq r$ and $u_{0}-u_{0}^{\prime}=r_{0}e_{1}$ , where $e_{1}$ denotes the smallest eigenvector of $\nabla^{2}\hat{f}_{x}(0)$ and $r_{0}=\frac{\nu r}{\sqrt{d}},r\leq\frac{\delta}{c_{2}\rho}$ . Also set parameters $\mathscr{T}=\frac{2\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})}{\delta}=\widetilde{\mathcal{O}}(\frac{1}{\delta})$ , $\eta\leq\min\{\frac{1}{8C_{1}\log(\mathscr{T})L},\frac{1}{8\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})L}\}=\widetilde{\mathcal{O}}(\frac{1}{L})$ , where $\alpha\geq 1$ is chosen sufficiently small such that $\log_{\alpha}(1+\eta\gamma)>\gamma$ . Then for $\{u_{t}\},\{u_{t}^{\prime}\}$ generated by running TSSRG $(x,u_{0},\mathscr{T})$ and TSSRG $(x,u_{0}^{\prime},\mathscr{T})$ with same sets of mini-batches, with probability $1-\zeta$ , we have

\exists\,t\leq\mathscr{T},\,\max\{\|u_{t}-u_{0}\|,\|u_{t}^{\prime}-u_{0}^{\prime}\|\}\geq\frac{\delta}{c_{2}\rho},

where $c_{2}\geq\max\{\frac{12C_{1}}{L},\frac{2\delta}{D\rho}\}$ and $C_{1}=\mathcal{O}(\log^{\frac{3}{2}}(d\mathscr{T}/\zeta))=\widetilde{\mathcal{O}}(1)$ .

Proof The proof is by contradiction. So we assume

\forall\,t\leq\mathscr{T},\,\max\{\|u_{t}-u_{0}\|,\|u_{t}^{\prime}-u_{0}\|\}\leq\frac{\delta}{c_{2}\rho}.

(19)

First we should note none of the two sequences $\{u_{j}\}_{j=0}^{\mathscr{T}}$ , $\{u_{j}^{\prime}\}_{j=0}^{\mathscr{T}}$ escape the ball $\mathbb{B}_{x}(D)$ within $\mathscr{T}$ steps under condition (19). This is because (take $\{u_{j}\}$ for example),

\|u_{t}\|\leq\|u_{t}-u_{0}\|+\|u_{0}\|\leq\frac{\delta}{c_{2}\rho}+r\leq\frac{2\delta}{c_{2}\rho}\leq D,

where we apply $r\leq\frac{\delta}{c_{2}\rho}$ and $c_{2}\geq\frac{2\delta}{D\rho}$ . Hence, if condition (19) is satisfied, $\{u_{j}\},\{u_{j}^{\prime}\}$ must remain inside the ball $\mathbb{B}_{x}(D)$ . As a result, we can proceed the proof by following the similar idea as in (Li, 2019), which is to show an exponential growth in the distance between two coupled sequences and ultimately will exceed the bound in (19) using triangle inequality. This as a result gives rise to a contradiction.

Denote $\hat{u}_{t}:=u_{t}-u_{t}^{\prime}$ and $\mathcal{H}:=\nabla^{2}\hat{F}_{x}(0)$ . With $u_{t}=u_{t-1}-\eta v_{t-1}$ , we can express $\hat{u}_{t}$ as

$\displaystyle\hat{u}_{t}$	$\displaystyle=\hat{u}_{t-1}-\eta(v_{t-1}-v_{t-1}^{\prime})$
	$\displaystyle=\hat{u}_{t-1}-\eta(\nabla\hat{F}_{x}(u_{t-1})-\nabla\hat{F}_{x}(u_{t-1}^{\prime})+v_{t-1}-\nabla\hat{F}_{x}(u_{t-1})-v_{t-1}^{\prime}+\nabla\hat{F}_{x}(u_{t-1}^{\prime}))$
	$\displaystyle=\hat{u}_{t-1}-\eta\big{(}\int_{0}^{1}\nabla^{2}\hat{F}_{x}(u_{t-1}^{\prime}+\theta(u_{t-1}-u_{t-1}^{\prime}))[u_{t-1}-u_{t-1}^{\prime}]d\theta+v_{t-1}-\nabla\hat{F}_{x}(u_{t-1})$
	$\displaystyle\quad-v_{t-1}^{\prime}+\nabla\hat{F}_{x}(u_{t-1}^{\prime})\big{)}$
	$\displaystyle=\hat{u}_{t-1}-\eta\big{(}(\mathcal{H}+\Delta_{t-1})\hat{u}_{t-1}+\hat{y}_{t-1}\big{)}$	(20)
	$\displaystyle=(I-\eta\mathcal{H})\hat{u}_{t-1}-\eta(\Delta_{t-1}\hat{u}_{t-1}+\hat{y}_{t-1}).$	(21)

where $\Delta_{t}:=\int_{0}^{1}[\nabla^{2}\hat{F}_{x}(u_{t}^{\prime}+\theta(u_{t}-u_{t}^{\prime}))-\mathcal{H}]d\theta$ and $\hat{y}_{t}:=y_{t}-y_{t}^{\prime}:=v_{t}-\nabla\hat{F}_{x}(u_{t})-v_{t}^{\prime}+\nabla\hat{F}_{x}(u_{t}^{\prime})$ . Note we can bound $\|\Delta_{t}\|$ as

$\displaystyle\\|\Delta_{t}\\|=\\|\int_{0}^{1}\big{(}\nabla^{2}\hat{F}_{x}(u_{t}^{\prime}+\theta(u_{t}-u_{t}^{\prime}))-\nabla^{2}\hat{F}_{x}(0)\big{)}d\theta\\|$	$\displaystyle\leq\int_{0}^{1}\rho\\|u_{t}^{\prime}+\theta(u_{t}-u_{t}^{\prime})\\|d\theta$
	$\displaystyle\leq\int_{0}^{1}\rho\big{(}\theta\\|u_{t}\\|+(1-\theta)\\|u_{t}^{\prime}\\|\big{)}d\theta$
	$\displaystyle\leq\rho\max\{\\|u_{t}\\|,\\|u_{t}^{\prime}\\|\}$
	$\displaystyle=\rho\mathscr{D}_{t}\leq\rho\big{(}\frac{\delta}{c_{2}\rho}+r\big{)},$	(22)

where $\mathscr{D}_{t}:=\max\{\|u_{t}\|,\|u_{t}^{\prime}\|\}$ . The first inequality uses Assumption 2 and the last inequality follows from $\|u_{t}\|\leq\|u_{t}-u_{0}\|+\|u_{0}\|\leq\frac{\delta}{c_{2}\rho}+r$ . Denote for $t\geq 1$ ,

	$\displaystyle p(t):=(I-\eta\mathcal{H})^{t}\hat{u}_{0}=(I-\eta\mathcal{H})^{t}r_{0}e_{1}=(1+\eta\gamma)^{t}r_{0}e_{1},$
	$\displaystyle q(t):=\eta\sum_{j=0}^{t-1}(I-\eta\mathcal{H})^{t-1-j}(\Delta_{j}\hat{u}_{j}+\hat{y}_{j}),\quad\text{ and }$		(23)
	$\displaystyle p(0)=\hat{u}_{0},\,q(0)=0.$		(24)

By induction, we can show that $\hat{u}_{t}=p(t)-q(t)$ . It is noticed that $\|p(t)\|=(1+\eta\gamma)^{t}r_{0}$ grows exponentially and the only thing left to determine is that $q(t)$ is dominated by $p(t)$ when $t$ increases. To this end, we inductively show that (1) $\|\hat{y}_{t}\|\leq\gamma L(1+\eta\gamma)^{t}r_{0}$ and (2) $\|q(t)\|\leq\|p(t)\|/2$ . First note that when $t=0$ , these two conditions clearly hold. Now suppose for $j\leq t-1$ , these claims have been proved to be true. Hence we immediately have, by triangle inequality,

\|\hat{u}_{j}\|\leq\|p(j)\|+\|q(j)\|\leq\frac{3}{2}\|p(j)\|\leq\frac{3}{2}(1+\eta\gamma)^{j}r_{0}

(25)

holds for $j\leq t-1$ . We first prove that second condition holds for $j=t$ .

Proof that $\boldsymbol{\|q(t)\|}$ is bounded by $\boldsymbol{\|p(t)\|/2}$ . Note that from the definition in (23),

\|q(t)\|\leq\|\eta\sum_{j=0}^{t-1}(I-\eta\mathcal{H})^{t-1-j}\Delta_{j}\hat{u}_{j}\|+\|\eta\sum_{j=0}^{t-1}(I-\eta\mathcal{H})^{t-1-j}\hat{y}_{j}\|.

Then we can bound $\|q(t)\|$ by respectively bounding the two terms. The first term is bounded as follows,

$\displaystyle\\|\eta\sum_{j=0}^{t-1}(I-\eta\mathcal{H})^{t-1-j}\Delta_{j}\hat{u}_{j}\\|$	$\displaystyle\leq\eta\sum_{j=0}^{t-1}\\|(I-\eta\mathcal{H})^{t-1-j}\\|\\|\Delta_{j}\\|\\|\hat{u}_{j}\\|$
	$\displaystyle\leq\eta\rho(\frac{\delta}{c_{2}\rho}+r)\sum_{j=0}^{t-1}(1+\eta\gamma)^{t-1-j}\\|\hat{u}_{j}\\|$
	$\displaystyle\leq\frac{3}{2}\eta\rho(\frac{\delta}{c_{2}\rho}+r)t(1+\eta\gamma)^{t-1}r_{0}\leq\frac{3\eta\delta}{c_{2}}\mathscr{T}\\|p(t)\\|\leq\frac{1}{4}\\|p(t)\\|,$	(26)

where the second inequality holds because $\|I-\eta\mathcal{H}\|=\lambda_{\max}(I-\eta\mathcal{H})=1+\eta\gamma$ . The third inequality applies the first condition for $j\leq t-1$ . The fourth inequality is by $r\leq\frac{\delta}{c_{2}\rho}$ and $t\leq\mathscr{T}$ . The last inequality follows by the choice of parameters $\mathscr{T}\leq 2\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})/(\eta\gamma),\delta\leq\gamma$ and $c_{2}\geq 24\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})$ where $\alpha\geq 1$ is a constant defined later. The second term can also be similarly bounded as

	$\displaystyle\\|\eta\sum_{j=0}^{t-1}(I-\eta\mathcal{H})^{t-1-j}\hat{y}_{j}\\|$	$\displaystyle\leq\eta\sum_{j=0}^{t-1}(1+\eta\gamma)^{t-1-j}\\|\hat{y}_{j}\\|\leq\eta\sum_{j=0}^{t-1}(1+\eta\gamma)^{t-1-j}\gamma L(1+\eta\gamma)^{j}r_{0}$
		$\displaystyle\leq\eta\gamma L\mathscr{T}(1+\eta\gamma)^{t-1}r_{0}\leq\frac{1}{4}(1+\eta\gamma)^{t}r_{0}=\frac{1}{4}\\|p(t)\\|,$		(27)

where we apply the bound on $\|\hat{y}_{t}\|$ in the second inequality. The last inequality uses $\mathscr{T}\leq 2\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})/\gamma$ and $\eta\leq\frac{1}{8\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})L}$ . From (26) and (27), we prove $\|q(t)\|\leq\|p(t)\|/2$ . Note we can always assume $\eta\leq 1$ given its requirement. Therefore $1/\gamma\leq 1/(\eta\gamma)$ and it is sufficient to require $\mathscr{T}=2\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})/\gamma$ to guarantee the result. Now we can proceed to prove the first condition, which is an intermediate result that has been used in proving the second condition.

Proof that $\boldsymbol{\|\hat{y}_{t}\|\text{ is bounded by }\gamma L(1+\eta\gamma)^{t}r_{0}}$ . We first re-write $\hat{y}_{t}$ into a recursive form as

	$\displaystyle\hat{y}_{t}$	$\displaystyle=v_{t}-\nabla\hat{F}_{x}(u_{t})-v_{t}^{\prime}+\nabla\hat{F}_{x}(u_{t}^{\prime})$
		$\displaystyle=\nabla\hat{f}_{\mathcal{I},x}(u_{t})-\nabla\hat{f}_{\mathcal{I},x}(u_{t-1})+v_{t-1}-\nabla\hat{F}_{x}(u_{t})-\nabla\hat{f}_{\mathcal{I},x}(u_{t}^{\prime})+\nabla\hat{f}_{\mathcal{I},x}(u_{t-1}^{\prime})-v_{t-1}^{\prime}+\nabla\hat{F}_{x}(u_{t}^{\prime})$
		$\displaystyle=\nabla\hat{f}_{\mathcal{I},x}(u_{t})-\nabla\hat{f}_{\mathcal{I},x}(u_{t-1})-\nabla\hat{F}_{x}(u_{t})+\nabla\hat{F}_{x}(u_{t-1})-\big{(}\nabla\hat{f}_{\mathcal{I},x}(u_{t}^{\prime})-\nabla\hat{f}_{\mathcal{I},x}(u_{t-1}^{\prime})$
		$\displaystyle\quad-\nabla\hat{F}_{x}(u_{t}^{\prime})+\nabla\hat{F}_{x}(u_{t-1}^{\prime})\big{)}+v_{t-1}-\nabla\hat{F}_{x}(u_{t-1})-v_{t-1}^{\prime}+\hat{F}_{x}(u_{t-1}^{\prime})$
		$\displaystyle=z_{t}+\hat{y}_{t-1},$

where we denote $z_{t}:=\nabla\hat{f}_{\mathcal{I},x}(u_{t})-\nabla\hat{f}_{\mathcal{I},x}(u_{t-1})-\nabla\hat{F}_{x}(u_{t})+\nabla\hat{F}_{x}(u_{t-1})-\big{(}\nabla\hat{f}_{\mathcal{I},x}(u_{t}^{\prime})-\nabla\hat{f}_{\mathcal{I},x}(u_{t-1}^{\prime})-\nabla\hat{F}_{x}(u_{t}^{\prime})+\nabla\hat{F}_{x}(u_{t-1}^{\prime})\big{)}=\hat{y}_{t}-\hat{y}_{t-1}$ . It is easy to verify that $\{\hat{y}_{t}\}$ is a martingale sequence and $\{z_{t}\}$ is its difference sequence. Similar to the proof strategy as for Lemma 9, we first need to derive bound for $\|z_{t}\|$ by Bernstein inequality. Denote $w_{i}$ as the component of $z_{t}$ . That is

	$\displaystyle z_{t}=\frac{1}{b}\sum_{i\in\mathcal{I}}w_{i}=\frac{1}{b}\sum_{i\in\mathcal{I}}\Big{(}$	$\displaystyle\nabla\hat{f}_{i,x}(u_{t})-\nabla\hat{f}_{i,x}(u_{t-1})-\nabla\hat{F}_{x}(u_{t})+\nabla\hat{F}_{x}(u_{t-1})$
		$\displaystyle-\big{(}\nabla\hat{f}_{i,x}(u_{t}^{\prime})-\nabla\hat{f}_{i,x}(u_{t-1}^{\prime})-\nabla\hat{F}_{x}(u_{t}^{\prime})+\nabla\hat{F}_{x}(u_{t-1}^{\prime})\big{)}\Big{)}.$

Then $\|w_{i}\|$ is bounded as

	$\displaystyle\\|w_{i}\\|$
	$\displaystyle=\\|\nabla\hat{f}_{i,x}(u_{t})-\nabla\hat{f}_{i,x}(u_{t-1})-\nabla\hat{F}_{x}(u_{t})+\nabla\hat{F}_{x}(u_{t-1})-\nabla\hat{f}_{i,x}(u_{t}^{\prime})+\nabla\hat{f}_{i,x}(u_{t-1}^{\prime})$
	$\displaystyle+\nabla\hat{F}_{x}(u_{t}^{\prime})-\nabla\hat{F}_{x}(u_{t-1}^{\prime})\\|$
	$\displaystyle=\\|(\nabla\hat{f}_{i,x}(u_{t})-\nabla\hat{f}_{i,x}(u_{t}^{\prime}))-(\nabla\hat{f}_{i,x}(u_{t-1})-\nabla\hat{f}_{i,x}(u_{t-1}^{\prime}))-(\nabla\hat{F}_{x}(u_{t})-\nabla\hat{F}_{x}(u_{t}^{\prime}))$
	$\displaystyle+(\nabla\hat{F}_{x}(u_{t-1})-\nabla\hat{F}_{x}(u_{t-1}^{\prime}))\\|$
	$\displaystyle=\\|\int_{0}^{1}\nabla^{2}\hat{f}_{i,x}(u_{t}^{\prime}+\theta(u_{t}-u_{t}^{\prime}))[u_{t}-u_{t}^{\prime}]d\theta-\int_{0}^{1}\nabla^{2}\hat{f}_{i,x}(u_{t-1}^{\prime}+\theta(u_{t-1}-u_{t-1}^{\prime}))[u_{t-1}-u_{t-1}^{\prime}]d\theta$
	$\displaystyle-\int_{0}^{1}\nabla^{2}\hat{F}_{x}(u_{t}^{\prime}+\theta(u_{t}-u_{t}^{\prime}))[u_{t}-u_{t}^{\prime}]d\theta+\int_{0}^{1}\nabla^{2}\hat{F}_{x}(u_{t-1}^{\prime}+\theta(u_{t-1}-u_{t-1}^{\prime}))[u_{t-1}-u_{t-1}^{\prime}]d\theta\\|$
	$\displaystyle=\\|(\Delta_{t}^{i}+\mathcal{H}_{i})\hat{u}_{t}-(\Delta_{t-1}^{i}+\mathcal{H}_{i})\hat{u}_{t-1}-(\Delta_{t}+\mathcal{H})\hat{u}_{t}+(\Delta_{t-1}+\mathcal{H})\hat{u}_{t-1}\\|$
	$\displaystyle\leq\\|\mathcal{H}_{i}(\hat{u}_{t}-\hat{u}_{t-1})\\|+\\|\mathcal{H}(\hat{u}_{t}-\hat{u}_{t-1})\\|+\\|\Delta_{t}^{i}\hat{u}_{t}\\|+\\|\Delta_{t}\hat{u}_{t}\\|+\\|\Delta_{t-1}^{i}\hat{u}_{t-1}\\|+\\|\Delta_{t-1}\hat{u}_{t-1}\\|$
	$\displaystyle\leq 2L\\|\hat{u}_{t}-\hat{u}_{t-1}\\|+2\rho\mathscr{D}_{t}\\|\hat{u}_{t}\\|+2\rho\mathscr{D}_{t-1}\\|\hat{u}_{t-1}\\|=:R,$		(28)

where $\hat{u}_{t}:=u_{t}-u_{t}^{\prime}$ , $\mathcal{H}_{i}:=\nabla^{2}\hat{f}_{i,x}(0)$ , $\Delta_{t}^{i}:=\int_{0}^{1}(\nabla^{2}\hat{f}_{i}(u_{t}^{\prime}+\theta(u_{t}-u_{t}^{\prime}))-\mathcal{H}_{i})d\theta$ , $\mathscr{D}_{t}:=\max\{\|u_{t}\|,\|u_{t}^{\prime}\|\}$ . The second last inequality uses triangle inequality and the last inequality considers Assumption 2 and also the constraint on $\|\Delta_{t}\|$ (similarly $\|\Delta_{t}^{i}\|$ ) in (22). The total variance $\sigma^{2}$ is derived as

	$\displaystyle\sum_{i\in\mathcal{I}}\mathbb{E}\\|w_{i}\\|^{2}$
	$\displaystyle=\sum_{i\in\mathcal{I}}\mathbb{E}\\|\nabla\hat{f}_{i,x}(u_{t})-\nabla\hat{f}_{i,x}(u_{t-1})-\nabla\hat{F}_{x}(u_{t})+\nabla\hat{F}_{x}(u_{t-1})-\nabla\hat{f}_{i,x}(u_{t}^{\prime})+\nabla\hat{f}_{i,x}(u_{t-1}^{\prime})$
	$\displaystyle+\nabla\hat{F}_{x}(u_{t}^{\prime})-\nabla\hat{F}_{x}(u_{t-1}^{\prime})\\|^{2}$
	$\displaystyle\leq\sum_{i\in\mathcal{I}}\mathbb{E}\\|\nabla\hat{f}_{i,x}(u_{t})-\nabla\hat{f}_{i,x}(u_{t}^{\prime})-\nabla\hat{f}_{i,x}(u_{t-1})+\nabla\hat{f}_{i,x}(u_{t-1}^{\prime})\\|^{2}$
	$\displaystyle=\sum_{i\in\mathcal{I}}\mathbb{E}\\|(\Delta_{t}^{i}+\mathcal{H}_{i})\hat{u}_{t}-(\Delta_{t-1}^{i}+\mathcal{H}_{i})\hat{u}_{t-1}\\|^{2}$
	$\displaystyle\leq b(L\\|\hat{u}_{t}-\hat{u}_{t-1}\\|+\rho\mathscr{D}_{t}\\|\hat{u}_{t}\\|+\rho\mathscr{D}_{t-1}\\|\hat{u}_{t-1}\\|)^{2}=:\sigma^{2}$

where the first inequality uses $\mathbb{E}\|x-\mathbb{E}[x]\|^{2}\leq\mathbb{E}\|x\|^{2}$ and the last inequality follows similar logic as (28). Hence by vector Bernstein Lemma 6, we obtain

	$\displaystyle\text{Pr}\{\\|z_{t}\\|\geq\varsigma/b\}=\text{Pr}\{\\|\sum_{i\in\mathcal{I}}w_{i}\\|\geq\varsigma\}$	$\displaystyle\leq(d+1)\exp\big{(}\frac{-\varsigma^{2}/2}{\sigma^{2}+R\varsigma/3}\big{)}$
		$\displaystyle\leq\zeta_{t},$

where we choose $\varsigma=\mathcal{O}\big{(}\log(d/\zeta_{t})\sqrt{b}\big{)}(L\|\hat{u}_{t}-\hat{u}_{t-1}\|+\rho\mathscr{D}_{t}\|\hat{u}_{t}\|+\rho\mathscr{D}_{t-1}\|\hat{u}_{t-1}\|)$ based on similar logic as in Lemma 9. Therefore, we can bound $\|z_{t}\|$ with high probability. That is, with probability $1-\zeta_{t}$ ,

\displaystyle\|z_{t}\|\leq\mathcal{O}\Big{(}{\frac{\log(d/\zeta_{t})}{\sqrt{b}}}\Big{)}(L\|\hat{u}_{t}-\hat{u}_{t-1}\|+\rho\mathscr{D}_{t}\|\hat{u}_{t}\|+\rho\mathscr{D}_{t-1}\|\hat{u}_{t-1}\|)=:\varsigma_{t}.

We now can obtain a bound on $\|\hat{y}_{t}\|$ . We only need to consider a single epoch because $\hat{y}_{sm}=0$ due to the full gradient evaluation at the start of each epoch. Note that similar to (10) and union bound, we know that with probability at least $1-\zeta$ , where $\zeta_{t}:=\zeta/m$ , $\|z_{t}\|\leq\varsigma_{t}$ holds for all $t=sm+1,...,(s+1)m$ . Then applying Azuma-Hoeffding inequality (Lemma 7) to the martingale sequence $\{\hat{y}_{t}\}$ yields

\displaystyle\text{Pr}\{\|\hat{y}_{t}\|\geq\beta\}\leq(d+1)\exp\Big{(}\frac{-\beta^{2}}{8\sum_{j=sm+1}^{t}\varsigma_{j}^{2}}\Big{)}+\zeta\leq 2\zeta,\quad\forall\,t\in[sm+1,sm+m],

where we choose

	$\displaystyle\beta=\beta_{t}$	$\displaystyle=\sqrt{8\log((d+1)/\zeta)\sum_{j=sm+1}^{t}\varsigma_{j}^{2}}$
		$\displaystyle=\mathcal{O}\Big{(}\frac{\log^{\frac{3}{2}}(d/\zeta)}{\sqrt{b}}\Big{)}\sqrt{\sum_{j=sm+1}^{t}(L\\|\hat{u}_{t}-\hat{u}_{t-1}\\|+\rho\mathscr{D}_{t}\\|\hat{u}_{t}\\|+\rho\mathscr{D}_{t-1}\\|\hat{u}_{t-1}\\|)^{2}}.$

By a union bound, $\text{Pr}\{\bigcup\limits_{t=1}^{\mathscr{T}}\big{(}\|\hat{y}_{t}\|\geq\beta_{t}\big{)}\}\leq 2\mathscr{T}\zeta$ and therefore for all $t\leq\mathscr{T}$ ,

\|\hat{y}_{t}\|\leq\beta_{t}\quad\text{ with probability }1-2\mathscr{T}\zeta.

Note by simply setting $\zeta/(2\mathscr{T})$ as $\zeta$ , we obtain with probability $1-\zeta$ , for all $t\leq\mathscr{T}$ ,

\|\hat{y}_{t}\|\leq\mathcal{O}\Big{(}\frac{\log^{\frac{3}{2}}(d\mathscr{T}/\zeta)}{\sqrt{b}}\Big{)}\sqrt{\sum_{j=sm+1}^{t}(L\|\hat{u}_{t}-\hat{u}_{t-1}\|+\rho\mathscr{D}_{t}\|\hat{u}_{t}\|+\rho\mathscr{D}_{t-1}\|\hat{u}_{t-1}\|)^{2}}.

(29)

What remains to be shown is that the right hand side of (29) is bounded. From (20), we first have

\|\hat{u}_{t}-\hat{u}_{t-1}\|=\|\eta\big{(}(\mathcal{H}+\Delta_{t-1})\hat{u}_{t-1}+\hat{y}_{t-1}\big{)}\|\leq\|\eta\mathcal{H}\hat{u}_{t-1}\|+\|\eta(\Delta_{t-1}\hat{u}_{t-1}+\hat{y}_{t-1})\|.

(30)

The first term $\|\eta\mathcal{H}\hat{u}_{t-1}\|$ can be bounded as follows. First note that the Hessian satisfies $-\gamma\leq\lambda(\mathcal{H})\leq L$ by construction, where $\lambda(\mathcal{H})$ represents eigenvalues of $\mathcal{H}$ . Although $L\geq\delta$ , it is difficult to compare $L$ and $\gamma$ since $\gamma\geq\delta$ . Thus, we bound the term by projecting it to the following two subspaces:

•

$S_{-}$ : subspace spanned by eigenvectors of $\mathcal{H}$ where eigenvalues are within $[-\gamma,0]$ .
•

$S_{+}$ : subspace spanned by eigenvectors of $\mathcal{H}$ where eigenvalues are within $(0,L]$ .

That is, for the first case

\displaystyle\|\text{Proj}_{S_{-}}\big{(}\eta\mathcal{H}\hat{u}_{t-1}\big{)}\|\leq\eta\|\text{Proj}_{S_{-}}(\mathcal{H})\|\|\hat{u}_{t-1}\|\leq\eta\gamma\frac{3}{2}(1+\eta\gamma)^{t-1}r_{0},

(31)

where we use the bound on $\|\hat{u}_{t-1}\|$ and the fact that $\|\text{Proj}_{S_{-}}(\mathcal{H})\|\leq\lambda$ . For the second case, we have from (21),

$\displaystyle\\|\text{Proj}_{S_{+}}(\eta\mathcal{H}\hat{u}_{t-1})\\|$	$\displaystyle=\\|\text{Proj}_{S_{+}}\big{(}\eta\mathcal{H}(1+\eta\gamma)^{t-1}r_{0}e_{1}-\eta\sum_{j=0}^{t-2}\eta\mathcal{H}(I-\eta\mathcal{H})^{t-2-j}(\Delta_{j}\hat{u}_{j}+\hat{y}_{j})\big{)}\\|$
	$\displaystyle\leq\eta\gamma(1+\eta\gamma)^{t-1}r_{0}+\eta\\|\sum_{j=0}^{t-2}\text{Proj}_{S_{+}}\big{(}\eta\mathcal{H}(I-\eta\mathcal{H})^{t-2-j}\big{)}\\|\max_{0\leq j\leq t-2}\\|\Delta_{j}\hat{u}_{j}+\hat{y}_{j}\\|$
	$\displaystyle\leq\eta\gamma(1+\eta\gamma)^{t-1}r_{0}+\sum_{j=0}^{t-2}\frac{\eta}{t-1-j}\max_{0\leq j\leq t-2}\\|\Delta_{j}\hat{u}_{j}+\hat{y}_{j}\\|$
	$\displaystyle\leq\eta\gamma(1+\eta\gamma)^{t-1}r_{0}+\eta\log(t)\big{(}\max_{0\leq j\leq t-2}(\\|\Delta_{j}\\|\\|\hat{u}_{j}\\|+\\|\hat{y}_{j}\\|)\big{)}$
	$\displaystyle\leq\eta\gamma(1+\eta\gamma)^{t-1}r_{0}+\eta\log(t)\big{(}\rho(\frac{\delta}{c_{2}\rho}+r)\frac{3}{2}(1+\eta\gamma)^{t-2}r_{0}+\gamma L(1+\eta\gamma)^{t-2}r_{0}\big{)}$	(32)
	$\displaystyle\leq\eta\gamma(1+\eta\gamma)^{t-1}r_{0}+\eta\log(t)\big{(}\frac{3\delta}{c_{2}}(1+\eta\gamma)^{t-2}r_{0}+\gamma L(1+\eta\gamma)^{t-2}r_{0}\big{)}$
	$\displaystyle\leq\eta\gamma(1+\eta\gamma)^{t-1}r_{0}+\eta\log(\mathscr{T})\big{(}\frac{3\delta}{c_{2}}(1+\eta\gamma)^{t-1}r_{0}+\gamma L(1+\eta\gamma)^{t-1}r_{0}\big{)}$	(33)

The second inequality is due to $(1-\lambda)^{t-1}\lambda\leq\frac{1}{t}$ for $0<\lambda\leq 1$ . That is, given the choice $\eta\leq\widetilde{\mathcal{O}}(\frac{1}{L})$ and $\boldsymbol{0}\preceq\text{Proj}_{S_{+}}(\mathcal{H})\preceq LI$ , we obtain $0\leq\lambda(\text{Proj}_{S_{+}}(\eta\mathcal{H}))\leq 1$ . The third inequality uses the finite-sum bound on Harmonic series. Inequality (32) applies (i) the bound on $\|\Delta_{t}\|$ as in (22) (ii) the inductive bound on $\|\hat{u}_{j}\|$ as in (25) and (iii) the inductive assumption on $\|y_{j}\|$ . The second last inequality is by the choice $r\leq\frac{\delta}{c_{2}\rho}$ . It is easy to verify that the right-hand-side of (33) is larger than right-hand-side of (31). Hence combining the two cases gives

\|\eta\mathcal{H}\hat{u}_{t-1}\|\leq\eta\gamma(1+\eta\gamma)^{t-1}r_{0}+\eta\log(\mathscr{T})\big{(}\frac{3\delta}{c_{2}}(1+\eta\gamma)^{t-1}r_{0}+\gamma L(1+\eta\gamma)^{t-1}r_{0}\big{)}.

(34)

The second term in (30) is bounded as

$\displaystyle\\|\eta(\Delta_{t-1}\hat{u}_{t-1}+\hat{y}_{t-1})\\|$	$\displaystyle\leq\eta\\|\Delta_{t-1}\\|\\|\hat{u}_{t-1}\\|+\eta\\|\hat{y}_{t-1}\\|$
	$\displaystyle\leq\eta\rho\big{(}\frac{\delta}{c_{2}\rho}+r\big{)}\frac{3}{2}(1+\eta\gamma)^{t-1}r_{0}+\eta\gamma L(1+\eta\gamma)^{t-1}r_{0}$
	$\displaystyle\leq\frac{3\eta\delta}{c_{2}}(1+\eta\gamma)^{t-1}r_{0}+\eta\gamma L(1+\eta\gamma)^{t-1}r_{0},$	(35)

where the second inequality is derived similarly as (32) and the last inequality is again due to the assumption that $r\leq\frac{\delta}{c_{2}\rho}$ . Combining (34) and (35) gives a bound,

$\displaystyle L\\|\hat{u}_{t}-\hat{u}_{t-1}\\|$	$\displaystyle\leq L\Big{(}\eta\gamma(1+\eta\gamma)^{t-1}r_{0}+\eta\log(\mathscr{T})\big{(}\frac{3\delta}{c_{2}}(1+\eta\gamma)^{t-1}r_{0}+\gamma L(1+\eta\gamma)^{t-1}r_{0}\big{)}+$
	$\displaystyle+\frac{3\eta\delta}{c_{2}}(1+\eta\gamma)^{t-1}r_{0}+\eta\gamma L(1+\eta\gamma)^{t-1}r_{0}\Big{)}$
	$\displaystyle\leq\Big{(}{\eta}+\frac{3\eta\log(\mathscr{T})}{c_{2}}+\eta L\log(\mathscr{T})+\frac{3\eta}{c_{2}}+\eta L\Big{)}\gamma L(1+\eta\gamma)^{t}r_{0}$
	$\displaystyle\leq\Big{(}\frac{3\eta\log(\mathscr{T})}{c_{2}}+2\eta L\log(\mathscr{T})\Big{)}\gamma L(1+\eta\gamma)^{t}r_{0},$	(36)

where the second inequality uses the definition that $\delta\leq\gamma$ and $1+\eta\gamma>1$ . The last inequality by considering $1+\frac{3}{c_{2}}+L\leq L\log(\mathscr{T}+a)$ , for some $a\in\mathbb{N}^{+}$ . This is because, $1/c_{2}$ is a constant that is sufficiently small while $\log(\mathscr{T}+a)$ can be made sufficiently large to ensure the validity of this result. Here, for simplicity, we still write $\log(\mathscr{T})$ since this will not affect the result.

Similarly, we can bound $\rho\mathscr{D}_{t}\|\hat{u}_{t}\|+\rho\mathscr{D}_{t-1}\|\hat{u}_{t-1}\|$ as

	$\displaystyle\rho\mathscr{D}_{t}\\|\hat{u}_{t}\\|+\rho\mathscr{D}_{t-1}\\|\hat{u}_{t-1}\\|$	$\displaystyle\leq\rho\big{(}\frac{\delta}{c_{2}\rho}+r\big{)}\Big{(}\frac{3}{2}(1+\eta\gamma)^{t}r_{0}+\frac{3}{2}(1+\eta\gamma)^{t-1}r_{0}\Big{)}$
		$\displaystyle\leq\frac{6\delta}{c_{2}}(1+\eta\gamma)^{t}r_{0},$		(37)

where the first inequality applies the second condition for $j=t\text{ and }t-1$ and the second inequality again uses $r\leq\frac{\delta}{c_{2}\rho}$ . By combining results in (36) and (37), we have

	$\displaystyle\\|\hat{y}_{t}\\|$	$\displaystyle\leq\mathcal{O}\Big{(}\frac{\log^{\frac{3}{2}}(d\mathscr{T}/\zeta)}{\sqrt{b}}\Big{)}\sqrt{m}\Big{(}\big{(}\frac{3\eta\log(\mathscr{T})}{c_{2}}+2\eta L\log(\mathscr{T})\big{)}\gamma L(1+\eta\gamma)^{t}r_{0}+\frac{6\delta}{c_{2}}(1+\eta\gamma)^{t}r_{0}\Big{)}$
		$\displaystyle\leq C_{1}\Big{(}\frac{3\eta\log(\mathscr{T})}{c_{2}}+2\eta L\log(\mathscr{T})+\frac{6}{c_{2}L}\Big{)}\gamma L(1+\eta\gamma)^{t}r_{0}$
		$\displaystyle\leq\gamma L(1+\eta\gamma)^{t}r_{0}\qquad\qquad\qquad\text{ with probability }1-\zeta,$

where we denote $C_{1}:=\mathcal{O}(\log^{\frac{3}{2}}(d\mathscr{T}/\zeta))$ (note $c_{1}=\mathcal{O}(\log^{\frac{3}{2}}(dm/\vartheta)$ in Lemma 11). The second inequality considers $b\geq m$ and $\delta\leq\gamma$ . The last inequality follows by the parameter setting that $\eta\leq\frac{1}{8L\log(\mathscr{T})C_{1}}$ and $c_{2}\geq\frac{12C_{1}}{L}$ , $C_{1}\geq 1$ . This completes the proof of the bound on $\|\hat{y}_{t}\|$ .

Finally, we can proceed to raise a contradiction. First given the bound on $\|q(t)\|$ , we have

	$\displaystyle\\|\hat{u}_{\mathscr{T}}\\|=\\|p(\mathscr{T})-q(\mathscr{T})\\|\geq\\|p(\mathscr{T})\\|-\\|q(\mathscr{T})\\|\geq\\|p(\mathscr{T})\\|/2$	$\displaystyle=\frac{1}{2}(1+\eta\gamma)^{\mathscr{T}}r_{0}$
		$\displaystyle=\frac{1}{2}(1+\eta\gamma)^{\mathscr{T}}\frac{\nu r}{\sqrt{d}}>\frac{4\delta}{c_{2}\rho},$

where the last inequality follows by the choice that $\mathscr{T}={2\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})}/\gamma$ , where $\alpha\geq 1$ is chosen such that $\log_{\alpha}(1+\eta\gamma)>\gamma$ . This requirement is reasonable since when $\alpha\xrightarrow[]{}1$ , $\log_{\alpha}(1+\eta\gamma)$ increases while $\gamma$ is a constant. In this case,

\log_{\alpha}\big{(}(1+\eta\gamma)^{\mathscr{T}}\big{)}>\gamma\cdot 2\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})/\gamma>\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r}).

However, we have for $t\leq\mathscr{T}$ , $\|\hat{u}_{t}\|=\|u_{t}-u_{t}^{\prime}\|\leq\|u_{t}\|+\|u_{t}^{\prime}\|\leq 2(r+\frac{\delta}{c_{2}\rho})\leq\frac{4\delta}{c_{2}\rho}$ , which gives a contradiction. Hence $\exists\,t\leq\mathscr{T}={2\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})}/\gamma$ , such that $\max\{\|u_{t}-u_{0}\|,\|u_{t}^{\prime}-u_{0}\|\}\geq\frac{\delta}{c_{2}\rho}$ . Given $\gamma$ changes throughout optimization process, we may choose $\mathscr{T}={2\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})}/\delta$ . Since $\delta\leq\gamma$ , the result still holds.

Lemma 13 (Descent around saddle points)

Let $x\in\mathcal{M}$ satisfy $\|\emph{grad}F(x)\|\leq\epsilon$ and $\lambda_{\min}(\nabla^{2}\hat{F}_{x}(0))$ $\leq-\delta$ . With the same setting as in Lemma 12, the two coupled sequences satisfy, with probability $1-\zeta$ ,

\exists\,t\leq\mathscr{T},\,\max\{\hat{F}_{x}(u_{0})-\hat{F}_{x}(u_{t}),\hat{F}_{x}(u_{0}^{\prime})-\hat{F}_{x}(u_{t}^{\prime})\}\geq 2\mathscr{F},

where $\mathscr{F}=\frac{\delta^{3}}{2c_{3}\rho^{2}}$ , $c_{3}=\frac{8\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})c_{2}^{2}}{c_{1}L}$ , and $c_{1}$ and $c_{2}$ are defined in Lemma 11 and Lemma 12, respectively.

Proof We will again prove this result by contradiction. Suppose

\forall\,t\leq\mathscr{T},\,\max\{\hat{F}_{x}(u_{0})-\hat{F}_{x}(u_{t}),\hat{F}_{x}(u_{0}^{\prime})-\hat{F}_{x}(u_{t}^{\prime})\}\leq 2\mathscr{F}.

(38)

Then we first claim that both $\{u_{t}\}$ and $\{u_{t}^{\prime}\}$ fall within the prescribed ball $\mathbb{B}_{x}(D)$ . This is verified by contradiction. Assume that, without loss of generality, at iteration $j\leq\mathscr{T}$ , $\|u_{j}\|\geq D$ . In this case, $\textsf{TSSRG}(x,u_{0},\mathscr{T})$ returns $R_{x}(u_{j-1}-\alpha\eta v_{t-1})$ with $\alpha\in(0,1)$ such that $\|u_{j-1}-\alpha\eta v_{t-1}\|=D$ . Hence

	$\displaystyle D=\\|u_{j-1}-\alpha\eta v_{t-1}\\|$	$\displaystyle\leq\\|u_{j-1}-\alpha\eta v_{t-1}-u_{0}\\|+\\|u_{0}\\|\leq\sqrt{\frac{4j}{c_{1}L}(\hat{F}_{x}(u_{0})-\hat{F}_{x}(u_{j}))}+r$
		$\displaystyle\leq\sqrt{\frac{8\mathscr{T}\mathscr{F}}{c_{1}L}}+r=\sqrt{\frac{{8\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})}}{\delta c_{1}L}\cdot\frac{\delta^{3}}{c_{3}\rho^{2}}}+r=\frac{\delta}{c_{2}\rho}+r$		(39)

where we note $\mathscr{F}=\frac{\delta^{3}}{2c_{3}\rho^{2}}$ where $c_{3}=\frac{8\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})c_{2}^{2}}{c_{1}L}$ and $\mathscr{T}=2\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})/\delta$ . The second inequality uses Lemma 10 where we can replace $\eta$ with $\alpha\eta$ for iteration $j$ . However, by parameter choice, we have $\frac{\delta}{c_{2}\rho}+r\leq D$ . This gives a contradiction. Therefore, under condition (38), the two coupled sequences do not escape the ball. Hence we can proceed similarly as in (Li, 2019).

First Lemma 12 claims that, for some $t\leq\mathscr{T}$ , $\max\{\|u_{t}-u_{0}\|,\|u_{t}^{\prime}-u_{0}\|\}\geq\frac{\delta}{c_{2}\rho}$ . Without loss of generality, suppose $\|u_{T}-u_{0}\|\geq\frac{\delta}{c_{2}\rho}$ , for $T\leq\mathscr{T}$ . Then by Lemma 10, we have $\|u_{T}-u_{0}\|\leq\sqrt{\frac{4T}{c_{1}L}(\hat{F}_{x}(u_{0})-\hat{F}_{x}(u_{T}))}$ . This implies

\displaystyle\hat{F}_{x}(u_{0})-\hat{F}_{x}(u_{T})\geq\frac{c_{1}L}{4T}\|u_{T}-u_{0}\|^{2}\geq\frac{c_{1}L\delta^{2}}{4\mathscr{T}c_{2}^{2}\rho^{2}}\geq\frac{c_{1}L\delta^{3}}{8\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})c_{2}^{2}\rho^{2}}\geq\frac{\delta^{3}}{c_{3}\rho^{2}}=2\mathscr{F}

where we use the choice of $c_{3}$ . This contradicts (38). Therefore the proof is complete.

Without loss of generality, the result in Lemma 13 can be written as $\max\{\hat{F}_{x}(u_{0})-\hat{F}_{x}(u_{\mathscr{T}}),\hat{F}_{x}(u_{0}^{\prime})-\hat{F}_{x}(u_{\mathscr{T}}^{\prime})\}\geq 2\mathscr{F}$ . This represents the worst case scenario where we require $\mathscr{T}$ iterations of TSSRG to reach a large function decrease.²²2We can also add a stopping criterion that breaks with $u_{\mathscr{T}}$ ( $u_{\mathscr{T}}^{\prime}$ ) set to be $u_{T}$ (resp. $u_{T}^{\prime}$ ) where $T$ is the earliest iteration where a large function value decrease is reached.

Lemma 14 (Escape stuck region)

(Lemma 5 in the main text). Let $x\in\mathcal{M}$ satisfying $\|\emph{grad}F(x)\|\leq\epsilon$ and $-\gamma=\lambda_{\min}(\nabla^{2}\hat{F}_{x}(0))$ $\leq-\delta$ . Given that result in Lemma 13 holds and choosing perturbation radius $r\leq\min\{\frac{\delta^{3}}{4c_{3}\rho^{2}\epsilon},\sqrt{\frac{\delta^{3}}{2c_{3}\rho^{2}L}}\}$ , we have a sufficient decrease of function value with high probability:

{F}(\textsf{TSSRG}(x,u_{0},\mathscr{T}))-F(x)\leq-\mathscr{F}\quad\text{ with probability }1-\nu.

Proof First define the stuck region formally as

\mathcal{X}_{\text{stuck}}=\{u\in\mathbb{B}_{x}(r):F(\textsf{TSSRG}(x,u,\mathscr{T}))-\hat{F}_{x}(u)\geq-2\mathscr{F}\}.

This suggests that running TSSRG from any point initialized in this region will not give sufficient decrease of function value. Similar to (Li, 2019), the aim is to show this region is small in volume. First note that iterates with initialization within $\mathcal{X}_{\text{stuck}}$ do not escape the constraint ball from arguments in (39). Hence, if iterates leave the ball $\mathbb{B}_{x}(D)$ , the output already escapes the stuck region with large function decrease. Given that tangent space $T_{x}\mathcal{M}$ is a $d$ -dimensional vector space, we can perform similar analysis as in (Jin et al., 2017; Li, 2019).

Consider the two coupled sequences with starting points $u_{0}$ and $u_{0}^{\prime}$ satisfying $u_{0}-u_{0}^{\prime}=r_{0}e_{1}$ , where $e_{1}$ is the smallest eigenvector of $\nabla^{2}\hat{F}_{x}(0)$ and $r_{0}=\frac{\nu r}{\sqrt{d}}$ . Therefore from Lemma 13, at least one of the two sequences finally leaves $\mathcal{X}_{\text{stuck}}$ after $\mathscr{T}$ steps with probability $1-\zeta$ . Therefore, under this result, the width of the stuck region along direction $e_{1}$ is at most $r_{0}$ . Based on similar argument as in (Criscitiello and Boumal, 2019), $\text{Vol}(\mathcal{X}_{\text{stuck}})\leq r_{0}\text{Vol}(\mathcal{B}_{r}^{d-1})$ , where $\mathcal{B}_{r}^{d}$ represents $d$ -dimensional sphere with radius $r$ . Therefore,

\text{Pr}\{u_{0}\in\mathcal{X}_{\text{stuck}}\}=\frac{\text{Vol}(\mathcal{X}_{\text{stuck}})}{\text{Vol}(\mathcal{B}_{r}^{d})}\leq\frac{r_{0}\text{Vol}(\mathcal{B}_{r}^{d-1})}{\text{Vol}(\mathcal{B}_{r}^{d})}=\frac{r_{0}\Gamma(\frac{d}{2}+1)}{\sqrt{\pi}r\Gamma(\frac{d}{2}+\frac{1}{2})}\leq\frac{r_{0}}{\sqrt{\pi}r}\sqrt{\frac{d}{2}+1}\leq\frac{r_{0}\sqrt{d}}{r}=\nu,

where we have used Gautschi’s inequality for the Gamma function. This claims the probability of $u_{0}\in\mathcal{X}_{\text{stuck}}$ is at most $\nu$ . Therefore with high probability at least $1-\nu$ , $u_{0}\notin\mathcal{X}_{\text{stuck}}$ . In this case, function value decreases a lot. This is verified as follows. First note that $F(x)=\hat{F}_{x}(0)$ and $F(\textsf{TSSRG}(x,u_{0},\mathscr{T}))=\hat{F}_{x}(u_{\mathscr{T}})$ . Then, by $L$ -Lipschitzness of the gradient $\nabla\hat{F}_{x}$ ,

	$\displaystyle\hat{F}_{x}(u_{0})-\hat{F}_{x}(0)\leq\langle\nabla\hat{F}_{x}(0),u_{0}\rangle+\frac{L}{2}\\|u_{0}\\|^{2}$	$\displaystyle\leq\\|\nabla\hat{F}_{x}(0)\\|\\|u_{0}\\|+\frac{L}{2}\\|u_{0}\\|^{2}\leq\epsilon r+\frac{Lr^{2}}{2}$
		$\displaystyle\leq\frac{\delta^{3}}{2c_{3}\rho^{2}}=\mathscr{F},$		(40)

using $\|u_{0}\|\leq r,\|\nabla\hat{F}_{x}(0)\|\leq\epsilon$ and also $r\leq\min\{\frac{\delta^{3}}{4c_{3}\rho^{2}\epsilon},\sqrt{\frac{\delta^{3}}{2c_{3}\rho^{2}L}}\}$ . Therefore

\displaystyle F(x)-F(\textsf{TSSRG}(x,u_{0},\mathscr{T}))=\hat{F}_{x}(0)-\hat{F}_{x}(u_{0})+\hat{F}_{x}(u_{0})-\hat{F}_{x}(u_{\mathscr{T}})\geq-\mathscr{F}+2\mathscr{F}=\mathscr{F},

where we apply (40) and the fact that $u_{0}\notin\mathcal{X}_{\text{stuck}}$ .

D Proof for online setting

Here we provide proof for online problems where full gradient $\text{grad}F$ needs to be replaced by large-batch gradient $\text{grad}f_{\mathcal{B}}$ . The proof is similar to that of finite-sum problems and thus we mainly present the key steps in the following proof.

D.1 Proof for main Theorem

Theorem 2 (Online complexity) Under Assumptions 1 to 5, consider online optimization setting. For any starting point $x_{0}\in\mathcal{M}$ with the choice of parameters

\mathscr{T}=\widetilde{\mathcal{O}}\Big{(}\frac{1}{\delta}\Big{)},\quad\eta\leq\widetilde{\mathcal{O}}\Big{(}\frac{1}{L}\Big{)},\quad m=b=\widetilde{\mathcal{O}}\Big{(}\frac{\sigma}{\epsilon}\Big{)},\quad B=\widetilde{\mathcal{O}}\Big{(}\frac{\sigma^{2}}{\epsilon^{2}}\Big{)},\quad r=\widetilde{\mathcal{O}}\Big{(}\min\Big{\{}\frac{\delta^{3}}{\rho^{2}\epsilon},\sqrt{\frac{\delta^{3}}{\rho^{2}L}}\Big{\}}\Big{)},

\widetilde{\mathcal{O}}\Big{(}\frac{\Delta L\sigma}{\epsilon^{3}}+\frac{\Delta\rho^{2}\sigma^{2}}{\delta^{3}\epsilon^{2}}+\frac{\Delta\rho^{2}\sigma}{\delta^{4}\epsilon}\Big{)}

stochastic gradient oracles, where $\Delta:=F(x_{0})-F^{*}$ .

Proof Similar to Theorem 1, we have the following possible cases when running the main algorithm PRSRG:

•
Large gradients where ${\|\text{grad}f_{\mathcal{B}}(x)\|\geq\epsilon}$ .
1. 1.
  
  Type-1 descent epoch: Iterates escape the constraint ball.
2. 2.
  Iterates do not escape the constraint ball.
  1. (a)
    
    Type-2 descent epoch: At least half of iterates in current epoch have pullback gradient larger than $\epsilon/4$ .
  2. (b)
    
    Useful epoch: At least half of iterates in the current epoch have pullback gradient no larger than $\epsilon/4$ and the output $\tilde{x}$ from the current epoch has batch gradient $\|\text{grad}f_{\mathcal{B}}(x)\|$ no larger than $\epsilon$ . (Since the output satisfies small gradient condition, the next epoch will run TSSRG $(\tilde{x},u_{0},\mathscr{T})$ to escape saddle points).
  3. (c)
    
    Wasted epoch: At least half of iterates in the current epoch have pullback gradient no larger than $\epsilon/4$ and the output $\tilde{x}$ from the current epoch has batch gradient $\|\text{grad}f_{\mathcal{B}}(x)\|$ larger than $\epsilon$ .
•
Around saddle points where ${\|\text{grad}f_{\mathcal{B}}(x)\|\leq\epsilon}$ and ${\lambda_{\min}(\text{Hess}(x))\leq-\delta}$
1. 3.
  
  Type-3 descent epoch: The current iterate is around saddle point.

The following proof is very similar to that of Lemma 1. First from Lemma 17, we know that the probability of wasted epoch occurring is at most $2/3$ . Given independence of different wasted epoch, with high probability, wasted epoch happens consecutively at most $\widetilde{\mathcal{O}}(1)$ times before either a descent epoch (either Type 1 or 2) or a useful epoch. We use $N_{1},N_{2},N_{3}$ to respectively represent three types of descent epoch.

For Type-1 descent epoch, with probability $1-\vartheta$ , function value decrease by at least $\frac{3\eta m\epsilon^{2}}{512}$ . Hence by standard concentration, after $N_{1}$ such epochs, function value is decreased by ${\mathcal{O}}(\eta m\epsilon^{2}N_{1})$ with high probability. Given that $F(x)$ is bounded by $F^{*}$ , the decrease cannot exceed $\Delta:=F(x_{0})-F^{*}$ . Thus, $N_{1}\leq\mathcal{O}(\frac{\Delta}{\eta m\epsilon^{2}})$ . Similarly, for Type-2 descent epoch, $N_{2}\leq\mathcal{O}(\frac{\Delta}{\eta m\epsilon^{2}})$ . For Type-3 useful epoch, the output is either an $(\epsilon,\delta)$ -second-order critical point or the epoch is immediately followed by a Type-3 descent epoch around saddle points. From Lemma 20, we know that function value decreases by $\mathscr{F}=\frac{\delta^{3}}{2c_{3}\rho^{2}}$ with high probability. Therefore by similar arguments, $N_{3}\leq\widetilde{\mathcal{O}}(\frac{\rho^{2}\Delta}{\delta^{3}})$ .

Therefore to combine them, we have the following stochastic gradient complexity:

	$\displaystyle(N_{1}+N_{2})\big{(}\widetilde{\mathcal{O}}(1)\big{(}B+mb\big{)}+B+mb\big{)}+N_{3}\big{(}\widetilde{\mathcal{O}}(1)\big{(}B+mb\big{)}+\lceil\mathscr{T}/m\rceil B+\mathscr{T}b\big{)}$
	$\displaystyle\leq\widetilde{\mathcal{O}}\Big{(}\frac{\Delta L\sigma}{\epsilon^{3}}+\frac{\Delta\rho^{2}\sigma^{2}}{\delta^{3}\epsilon^{2}}+\frac{\rho^{2}\Delta}{\delta^{3}}\big{(}\frac{\sigma^{2}}{\epsilon^{2}}+\frac{\sigma}{\delta\epsilon}\big{)}\Big{)}$
	$\displaystyle\leq\widetilde{\mathcal{O}}\Big{(}\frac{\Delta L\sigma}{\epsilon^{3}}+\frac{\Delta\rho^{2}\sigma^{2}}{\delta^{3}\epsilon^{2}}+\frac{\Delta\rho^{2}\sigma}{\delta^{4}\epsilon}\Big{)},$

where $\mathscr{T}=\widetilde{\mathcal{O}}(\frac{1}{\delta})$ , $m=b=\sqrt{B}=\widetilde{\mathcal{O}}(\frac{\sigma}{\epsilon})$ and $\eta=\widetilde{\mathcal{O}}(\frac{1}{L})$ .

D.2 Proof for key Lemmas

Lemma 15 (High probability bound on estimation error)

Under Assumptions 2 and 5, we have the following high probability bound for estimation error of modified gradient under online setting. That is, for $sm+1\leq t\leq(s+1)m$ ,

\|v_{t}-\nabla\hat{F}_{x}(u_{t})\|\leq\frac{\mathcal{O}(\log^{\frac{3}{2}}(d/\vartheta))L}{\sqrt{b}}\sqrt{\sum_{j=sm+1}^{t}\|u_{j}-u_{j-1}\|^{2}}+\frac{\mathcal{O}(\log(d/\vartheta))\sigma}{\sqrt{B}}\,\,\text{ with probability }1-\vartheta.

Proof For simplicity of notation, consider a single epoch $k=1,...,m,\forall\,s$ . Because under online setting, $v_{0}=\nabla\hat{f}_{\mathcal{B},x}(u_{0})\neq\nabla\hat{F}_{x}(u_{0})$ , we first need to bound $\|v_{0}-\nabla\hat{F}_{x}(u_{0})\|$ by Bernstein inequality. By assumption of bounded variance, we have

\displaystyle\|\nabla\hat{f}_{i,x}(u_{0})-\nabla\hat{F}_{x}(u_{0})\|\leq\sigma\text{ and }\sum_{i\in\mathcal{B}}\|\nabla\hat{f}_{i,x}(u_{0})-\nabla\hat{F}_{x}(u_{0})\|^{2}\leq B\sigma^{2}.

By Lemma 6,

	$\displaystyle\text{Pr}\{\\|\sum_{i\in\mathcal{B}}\big{(}\nabla\hat{f}_{i,x}(u_{0})-\nabla\hat{F}_{x}(u_{0})\big{)}\\|\geq\varsigma\}$	$\displaystyle=\text{Pr}\{\\|\frac{1}{B}\sum_{i\in\mathcal{B}}\big{(}\nabla\hat{f}_{i,x}(u_{0})-\nabla\hat{F}_{x}(u_{0})\big{)}\\|\geq\varsigma/B\}$
		$\displaystyle=\text{Pr}\{\\|v_{0}-\hat{F}_{x}(u_{0})\\|\geq\varsigma/B\}$
		$\displaystyle\leq(d+1)\exp\big{(}\frac{-\varsigma^{2}/2}{B\sigma^{2}+\sqrt{B}\sigma\varsigma/3}\big{)}\leq\vartheta,$

where the first inequality also considers $\sqrt{B}\geq 1$ . The last inequality holds by setting $\varsigma=\mathcal{O}(\log(d/\vartheta))\sqrt{B}\sigma$ . Thus, we obtain

\|v_{0}-\hat{F}_{x}(u_{0})\|\leq\frac{\mathcal{O}(\log(d/\vartheta))\sigma}{\sqrt{B}}\,\text{ with probability }1-\vartheta.

(41)

Next, denote $y_{k}:=v_{k}-\nabla\hat{F}_{x}(u_{k})$ and $z_{k}:=y_{k}-y_{k-1}$ . Then we follow the same steps as in Lemma 9 to obtain the bound on $\|y_{k}\|$ except that $y_{0}\neq 0$ . From (12), we have

\|y_{k}-y_{0}\|\leq\mathcal{O}\Big{(}\frac{\log^{\frac{3}{2}}(d/\vartheta)L}{\sqrt{b}}\Big{)}\sqrt{\sum_{j=1}^{k}\|u_{j}-u_{j-1}\|^{2}}\,\text{ with probability }1-2\vartheta.

By union bound, for $k\in[1,m]$ , with probability at least $1-3\vartheta$ ,

\|v_{k}-\nabla\hat{F}_{x}(u_{k})\|=\|y_{k}\|\leq\|y_{k}-y_{0}\|+\|y_{0}\|\leq\frac{\mathcal{O}(\log^{\frac{3}{2}}(d/\vartheta))L}{\sqrt{b}}\sqrt{\sum_{j=1}^{k}\|u_{j}-u_{j-1}\|^{2}}+\frac{\mathcal{O}(\log(d/\vartheta))\sigma}{\sqrt{B}}.

The proof is complete by setting $\vartheta/3$ as $\vartheta$ .

Lemma 16 (Improve or localize)

Consider $\{u_{t}\}_{t=1}^{\mathscr{T}}$ as the sequence generated by running $\textsf{TSSRG}(x,$ $u_{0},\mathscr{T})$ . Suppose we choose $b\geq m$ , $\eta\leq\frac{1}{2c_{1}L}$ , where $c_{1}=\mathcal{O}(\log^{\frac{3}{2}}(\frac{dt}{\vartheta}))$ . Then we have

\|u_{t}-u_{0}\|\leq\sqrt{\frac{4t(\hat{F}_{x}(u_{0})-\hat{F}_{x}(u_{t}))}{c_{1}L}+\frac{2\eta t^{2}c_{1}^{\prime 2}\sigma^{2}}{c_{1}LB}}\quad\text{ with probability }1-\vartheta,

with $c_{1}^{\prime}=\mathcal{O}(\log(\frac{dt}{\vartheta m}))$ .

Proof First we generalize (44) to any epoch (i.e. $1\leq t\leq\mathscr{T}$ ) as

	$\displaystyle\hat{F}_{x}(u_{t})-\hat{F}_{x}(u_{sm})$
	$\displaystyle\leq-\frac{\eta}{2}\sum_{j=sm+1}^{t}\\|\nabla\hat{F}_{x}(u_{j-1})\\|^{2}-(\frac{1}{2\eta}-\frac{L}{2}-\frac{\eta c_{1}^{2}L^{2}}{2})\sum_{j=sm+1}^{t}\\|u_{j}-u_{j-1}\\|^{2}+\frac{\eta(t-sm)c_{1}^{\prime 2}\sigma^{2}}{2B}$
	$\displaystyle\leq-(\frac{1}{2\eta}-\frac{L}{2}-\frac{\eta c_{1}^{2}L^{2}}{2})\sum_{j=sm+1}^{t}\\|u_{j}-u_{j-1}\\|^{2}+\frac{\eta(t-sm)c_{1}^{\prime 2}\sigma^{2}}{2B}$
	$\displaystyle\leq-\frac{c_{1}L}{4}\sum_{j=sm+1}^{t}\\|u_{j}-u_{j-1}\\|^{2}+\frac{\eta(t-sm)c_{1}^{\prime 2}\sigma^{2}}{2B}\quad\text{with probability }1-\vartheta,$		(42)

where we choose $\eta\leq\frac{1}{2c_{1}L}$ and assume $c_{1}\geq 1$ . Summing (42) for all epochs up to $t$ gives

\hat{F}_{x}(u_{t})-\hat{F}_{x}(u_{0})\leq-\frac{c_{1}L}{4}\sum_{j=1}^{t}\|u_{j}-u_{j-1}\|^{2}+\frac{\eta tc_{1}^{\prime 2}\sigma^{2}}{2B}.

Lastly by Cauchy-Schwarz inequality and triangle inequality, $\sqrt{t\sum_{j=1}^{t}\|u_{j}-u_{j-1}\|^{2}}\geq\|u_{t}-u_{0}\|$ . Hence,

\|u_{t}-u_{0}\|\leq\sqrt{\frac{4t(\hat{F}_{x}(u_{0})-\hat{F}_{x}(u_{t}))}{c_{1}L}+\frac{2\eta t^{2}c_{1}^{\prime 2}\sigma^{2}}{c_{1}LB}}.

The proof is now complete.

Lemma 17 (Large gradient descent lemma)

Under Assumptions 2 to 5, suppose we set $\eta\leq\frac{1}{2c_{1}L},b\geq m,B\geq\frac{256c_{1}^{\prime 2}\sigma^{2}}{\epsilon^{2}}$ , where $c_{1}=\mathcal{O}(\log^{\frac{3}{2}}(\frac{dm}{\vartheta})),c_{1}^{\prime}=\mathcal{O}(\log(\frac{d}{\vartheta}))$ . Consider $x\in\mathcal{M}$ where $\|\emph{grad}f_{\mathcal{B}}(x)\|\geq\epsilon$ , with $\epsilon\leq\frac{D}{m}\sqrt{\frac{32c_{1}L}{\eta}}$ . Then by running $\textsf{TSSRG}(x,0,m)$ , we have the following three cases:

1.
When the iterates $\{u_{j}\}_{j=1}^{m}$ do not leave the constraint ball $\mathbb{B}_{x}(D)$ :
1. (a)
  
  If at least half of the iterates in the epoch satisfy $\|\nabla\hat{F}_{x}(u_{j})\|\leq\epsilon/4$ for $j=1,...,m$ , then with probability at least $1/3$ , we have $\|\emph{grad}f_{\mathcal{B}}(\textsf{TSSRG}(x,0,m))\|\leq\epsilon$ .
2. (b)
  
  Otherwise, with probability at least $1/5$ , we have $F(\textsf{TSSRG}(x,0,m))-F(x)\leq-\frac{3\eta m\epsilon^{2}}{512}$ .
2.

When one of the iterates $\{u_{j}\}_{j=1}^{m}$ leaves the constraint ball $\mathbb{B}_{x}(D)$ , with probability at least $1-\vartheta$ , we have $F(\textsf{TSSRG}(x,0,m))-F(x)\leq-\frac{3\eta m\epsilon^{2}}{512}$ .

Regardless which case occurs, $F(\textsf{TSSRG}(x,0,m))-F(x)\leq\frac{\eta tc_{1}^{\prime 2}\sigma^{2}}{2B}$ with high probability.

Proof Similar to proof of Lemma 11, we only need to consider a single epoch in TSSRG $(x,0,m)$ , i.e. $t=1,...,m$ . We also divide the proof into two parts.

1. Iterates do not leave the constraint ball. From Lemma 15 and union bound, we have for all $1\leq\tau\leq t$ ,

\|v_{\tau}-\nabla\hat{F}_{x}(u_{\tau})\|^{2}\leq\frac{c_{1}^{2}L^{2}}{{b}}\sum_{j=1}^{\tau-1}\|u_{j}-u_{j-1}\|^{2}+\frac{c_{1}^{\prime 2}\sigma^{2}}{B}\,\,\text{ with probability }1-\vartheta.

(43)

where we denote $c_{1}:=\mathcal{O}(\log^{\frac{3}{2}}(\frac{dt}{\vartheta})),c_{1}^{\prime}=\mathcal{O}(\log(\frac{dt}{\vartheta m}))$ . ³³3More precisely, $c_{1}^{\prime}=\mathcal{O}(\log(\frac{d}{\vartheta}\lceil\frac{t}{m}\rceil))=\mathcal{O}(\log(\frac{dt}{\vartheta m}))$ . We also use $(A+B)^{2}\leq 2A^{2}+2B^{2}$ . Summing up (14) from $j=1,...,t$ , we have

	$\displaystyle\hat{F}_{x}(u_{t})-\hat{F}_{x}(u_{0})$
	$\displaystyle\leq-\frac{\eta}{2}\sum_{j=1}^{t}\\|\nabla\hat{F}_{x}(u_{j-1})\\|^{2}+\frac{\eta}{2}\sum_{j=1}^{t}\\|v_{t-1}-\nabla\hat{F}_{x}(u_{t-1})\\|^{2}-(\frac{1}{2\eta}-\frac{L}{2})\sum_{j=1}^{t}\\|u_{t}-u_{t-1}\\|^{2}$
	$\displaystyle\leq-\frac{\eta}{2}\sum_{j=1}^{t}\\|\nabla\hat{F}_{x}(u_{j-1})\\|^{2}-(\frac{1}{2\eta}-\frac{L}{2})\sum_{j=1}^{t}\\|u_{t}-u_{t-1}\\|^{2}+\frac{\eta c_{1}^{2}L^{2}}{2b}\sum_{k=1}^{t-1}\sum_{j=1}^{k}\\|u_{j}-u_{j-1}\\|^{2}+\frac{\eta tc_{1}^{\prime 2}\sigma^{2}}{2B}$
	$\displaystyle\leq-\frac{\eta}{2}\sum_{j=1}^{t}\\|\nabla\hat{F}_{x}(u_{j-1})\\|^{2}-(\frac{1}{2\eta}-\frac{L}{2})\sum_{j=1}^{t}\\|u_{t}-u_{t-1}\\|^{2}+\frac{\eta c_{1}^{2}L^{2}m}{2b}\sum_{j=1}^{t}\\|u_{j}-u_{j-1}\\|^{2}+\frac{\eta tc_{1}^{\prime 2}\sigma^{2}}{2B}$
	$\displaystyle\leq-\frac{\eta}{2}\sum_{j=1}^{t}\\|\nabla\hat{F}_{x}(u_{j-1})\\|^{2}-(\frac{1}{2\eta}-\frac{L}{2}-\frac{\eta c_{1}^{2}L^{2}}{2})\sum_{j=1}^{t}\\|u_{j}-u_{j-1}\\|^{2}+\frac{\eta tc_{1}^{\prime 2}\sigma^{2}}{2B}$		(44)
	$\displaystyle\leq-\frac{\eta}{2}\sum_{j=1}^{t}\\|\nabla\hat{F}_{x}(u_{j-1})\\|^{2}+\frac{\eta tc_{1}^{\prime 2}\sigma^{2}}{2B},$		(45)

where the last inequality holds due to the choice $\eta\leq\frac{1}{2c_{1}L}\leq\frac{\sqrt{4c_{1}^{2}+1}-1}{2c_{1}^{2}L}$ by assuming $c_{1}\geq 1$ . Note that we need to ensure (45) to hold for all $t\leq m$ . Hence we change $c_{1}=\mathcal{O}(\log^{\frac{3}{2}}(\frac{dm}{\vartheta})),c_{1}^{\prime}=\mathcal{O}(\log(\frac{d}{\vartheta}))$ . Here are two cases.

•

(Case 1a) If at least half of the iterates in the epoch satisfy $\|\nabla\hat{F}_{x}(u_{j})\|\leq\epsilon/4$ for $j=1,...,m$ , then by uniformly sampling $\{u_{j}\}_{j=1}^{m}$ (i.e. uniformly breaking by setting $u_{t}$ as $u_{m}$ as in Algorithm 2, Line 12), the output $u_{m}$ satisfies $\|\nabla\hat{F}_{x}(u_{m})\|\leq\epsilon/4$ with probability at least $\frac{1}{2}$ . Under online setting, full gradient $\text{grad}F$ is inaccessible and we need to use approximated batch gradient $\text{grad}f_{\mathcal{B}}$ to check the small-gradient condition in Line 3, Algorithm 1. We know based on (41), $\|\nabla\hat{f}_{\mathcal{B},x}(u_{m})-\nabla\hat{F}_{x}(u_{m})\|\leq\frac{c_{1}^{\prime}\sigma}{\sqrt{B}}$ with probability $1-\vartheta$ . By a union bound, with probability at least $\frac{1}{2}-\vartheta$ ,

\|\nabla\hat{f}_{\mathcal{B},x}(u_{m})\|\leq\|\nabla\hat{f}_{\mathcal{B},x}(u_{m})-\nabla\hat{F}_{x}(u_{m})\|+\|\nabla\hat{F}_{x}(u_{m})\|\leq\frac{c_{1}^{\prime}\sigma}{\sqrt{B}}+\frac{\epsilon}{4}\leq\frac{\epsilon}{2},

where the last inequality holds by $B\geq\frac{256c_{1}^{\prime 2}\sigma^{2}}{\epsilon^{2}}$ . Without loss of generality, we set $\vartheta\leq\frac{1}{6}$ and therefore the probability reduces to $\frac{1}{3}$ . Lastly, from the definition of the pullback gradient, we have

\|\text{grad}f_{\mathcal{B}}(\textsf{TSSRG}(x,0,m))\|=\|\text{grad}f_{\mathcal{B}}(R_{x}(u_{m}))\|\leq\|(T^{*}_{u_{m}})^{-1}\|\|\nabla\hat{f}_{\mathcal{B},x}(u_{m})\|\leq\epsilon,

holds with probability at least $\frac{1}{3}$ . The last inequality is due to Lemma 3. Note that in this case, we also have $\|\text{grad}F(\textsf{TSSRG}(x,0,m))\|\leq\|(T^{*}_{u_{m}})^{-1}\|\|\nabla\hat{F}_{x}(u_{m})\|\leq\epsilon/2$ .

•

(Case 1b) If at least half of the points in the epoch satisfy $\|\nabla\hat{F}_{x}(u_{j})\|\geq\epsilon/4$ for $j=1,...,m$ , with probability at least $\frac{1}{4}$ , the selected output $u_{m}$ falls within the last quarter of $\{u_{j}\}_{j=1}^{m}$ . In this case, $\sum_{j=1}^{t}\|\nabla\hat{F}_{x}(u_{j-1})\|^{2}\geq\frac{m}{4}\cdot\frac{\epsilon^{2}}{16}=\frac{m\epsilon^{2}}{64}$ . Note that (45) holds with probability at least $1-\vartheta$ . By a union bound, we have with probability at least $\frac{1}{5}$ (by letting $\vartheta\leq\frac{1}{20}$ ),

	$\displaystyle F(\textsf{TSSRG}(x,0,m))-F(x)=\hat{F}_{x}(u_{t})-\hat{F}_{x}(0)$	$\displaystyle\leq-\frac{\eta}{2}\sum_{j=1}^{t}\\|\nabla\hat{F}_{x}(u_{j-1})\\|^{2}+\frac{\eta tc_{1}^{\prime 2}\sigma^{2}}{2B}$
		$\displaystyle\leq-\frac{\eta m\epsilon^{2}}{128}+\frac{\eta m\epsilon^{2}}{512}=-\frac{3\eta m\epsilon^{2}}{512},$

where we use $B\geq\frac{256c_{1}^{\prime 2}\sigma^{2}}{\epsilon^{2}}$ and $t\leq m$ .

2. Iterates leave the constraint ball. Suppose at $\tau\leq m$ , we have $\|u_{\tau}\|>D$ . Then by Lemma 16, we know the function value already decreases a lot. That is with probability $1-\vartheta$ ,

\displaystyle\hat{F}_{x}(0)-\hat{F}_{x}(u_{\tau})\geq\frac{c_{1}L}{4\tau}\|u_{\tau}\|^{2}-\frac{\eta\tau c_{1}^{\prime 2}\sigma^{2}}{2B}\geq\frac{c_{1}LD^{2}}{4m}-\frac{\eta mc_{1}^{\prime 2}\sigma^{2}}{2B}\geq\frac{3\eta m\epsilon^{2}}{512},

where the last inequality follows from the choice that $\epsilon\leq\frac{D}{m}\sqrt{\frac{32c_{1}L}{\eta}}$ and $B\geq\frac{256c_{1}^{\prime 2}\sigma^{2}}{\epsilon^{2}}$ . Hence by returning $u_{\tau}$ as $u_{m}$ , we have with high probability,

F(\textsf{TSSRG}(x,0,m))-F(x)\leq-\frac{3\eta m\epsilon^{2}}{512}.

In summary, with high probability, either gradient norm of the output is small or the function value decreases a lot. Even under Case 1a, $\hat{F}_{x}(u_{t})-\hat{F}_{x}(u_{0})\leq\frac{\eta tc_{1}^{\prime 2}\sigma^{2}}{2B}$ with high probability by (45).

Lemma 18 (Small stuck region)

Consider $x\in\mathcal{M}$ with $\|\emph{grad}f_{\mathcal{B}}(x)\|\leq\epsilon$ , $-\gamma:=\lambda_{\min}(\nabla^{2}\hat{F}_{x}(0))$ $=\lambda_{\min}(\emph{Hess}F(x))\leq-\delta$ and $L\geq\delta$ . Let $u_{0},u_{0}^{\prime}\in T_{x}\mathcal{M}$ be two random perturbations, satisfying $\|u_{0}\|,\|u_{0}^{\prime}\|\leq r$ and $u_{0}-u_{0}^{\prime}=r_{0}e_{1}$ , where $e_{1}$ denotes the smallest eigenvector of $\nabla^{2}\hat{f}_{x}(0)$ and $r_{0}=\frac{\nu r}{\sqrt{d}},r\leq\frac{\delta}{c_{2}\rho}$ . Also set parameters $\mathscr{T}=\frac{2\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})}{\delta}=\widetilde{\mathcal{O}}(\frac{1}{\delta})$ , $\eta\leq\min\{\frac{1}{8\log(\mathcal{T})C_{1}L},\frac{1}{16\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})L}\}=\widetilde{\mathcal{O}}(\frac{1}{L})$ where $\alpha\geq 1$ is chosen sufficiently small such that $\log_{\alpha}(1+\eta\gamma)>\gamma$ . Also choose $B\geq\mathcal{O}(\frac{c_{2}^{\prime 2}\sigma^{2}}{\epsilon^{2}})=\widetilde{\mathcal{O}}(\frac{\sigma^{2}}{\epsilon^{2}})$ , $\epsilon\leq\frac{\delta^{2}}{\rho}$ . Then for $\{u_{t}\},\{u_{t}^{\prime}\}$ generated by running TSSRG $(x,u_{0},\mathscr{T})$ and TSSRG $(x,u_{0}^{\prime},\mathscr{T})$ with same sets of mini-batches, with probability $1-\zeta$ ,

\exists\,t\leq\mathscr{T},\,\max\{\|u_{t}-u_{0}\|,\|u_{t}^{\prime}-u_{0}^{\prime}\|\}\geq\frac{\delta}{c_{2}\rho},

where $c_{2}\geq\max\{\frac{12C_{1}}{L},\frac{2\delta}{D\rho}\}$ , and $C_{1}=\mathcal{O}(\log^{\frac{3}{2}}(d\mathscr{T}/\zeta))=\widetilde{\mathcal{O}}(1)$ , $c_{2}^{\prime}=\mathcal{O}(\log(\frac{d\mathscr{T}}{\zeta m}))=\widetilde{\mathcal{O}}(1)$ .

Proof The proof is by contradiction. So we assume

\forall\,t\leq\mathscr{T},\,\max\{\|u_{t}-u_{0}\|,\|u_{t}^{\prime}-u_{0}\|\}\leq\frac{\delta}{c_{2}\rho}.

(46)

First, we again note that both $\{u_{j}\}_{j=0}^{\mathscr{T}}\text{ and }\{u_{j}^{\prime}\}_{j=0}^{\mathscr{T}}$ do not escape the constraint ball under condition (46). This is because $\|u_{t}\|\leq\|u_{t}-u_{0}\|+\|u_{0}\|\leq\frac{\delta}{c_{2}\rho}+r\leq D$ by the choice $r\leq\frac{\delta}{c_{2}\rho}$ and $c_{2}\geq\frac{2\delta}{D\rho}$ . Next, we can show an exponential growth in the distance between two coupled sequences and will eventually exceed the bound in (46). The proof roadmap is exactly the same as in Lemma 12 and thus we only highlights the main results under online settings.

Denote $\hat{u}_{t}:=u_{t}-u_{t}^{\prime}$ , $\mathcal{H}:=\nabla^{2}\hat{F}_{x}(0),\Delta_{t}:=\int_{0}^{1}[\nabla^{2}\hat{F}_{x}(u_{t}^{\prime}+\theta(u_{t}-u_{t}^{\prime}))-\mathcal{H}]d\theta$ , $\hat{y}_{t}:=y_{t}-y_{t}^{\prime}:=v_{t}-\nabla\hat{F}_{x}(u_{t})-v_{t}^{\prime}+\nabla\hat{F}_{x}(u_{t}^{\prime})$ . Recall we can bound $\|\Delta_{t}\|$ as $\|\Delta_{t}\|\leq\rho\mathscr{D}_{t}\leq\rho(\frac{\delta}{c_{2}\rho}+r)$ , where $\mathscr{D}_{t}:=\max\{\|u_{t}\|,\|u_{t}^{\prime}\|\}$ . Thus from proof of Lemma 12, we know that we can re-write $\hat{u}_{t}=p(t)-q(t)$ where

	$\displaystyle p(t):=(I-\eta\mathcal{H})^{t}\hat{u}_{0}=(1+\eta\gamma)^{t}r_{0}e_{1},$
	$\displaystyle q(t):=\eta\sum_{j=0}^{t-1}(I-\eta\mathcal{H})^{t-1-j}(\Delta_{j}\hat{u}_{j}+\hat{y}_{j}),\quad\text{ and }$
	$\displaystyle p(0)=\hat{u}_{0},\,q(0)=0.$

Next, we can inductively show that $(1)\,\|\hat{y}_{t}\|\leq 2\gamma L(1+\eta\gamma)^{t}r_{0}$ and $(2)\,\|q(t)\|\leq\|p(t)\|/2$ . When $t=0$ , these two conditions are easily verified. Now suppose for $j\leq t-1$ , these two conditions are true, then we can immediately obtain that for all $j\leq t-1$ ,

\|\hat{u}_{j}\|\leq\|p(j)\|+\|q(j)\|\leq\frac{3}{2}\|p(j)\|\leq\frac{3}{2}(1+\eta\gamma)^{j}r_{0}.

(47)

Also, different to finite-sum setting, we need a bound of $\|\hat{y}_{sm}\|$ . Given that $v_{sm}=\nabla\hat{f}_{\mathcal{B},x}(u_{sm})$ and $v_{sm}^{\prime}=\nabla\hat{f}_{\mathcal{B},x}(u_{sm}^{\prime})$ , we have

	$\displaystyle\hat{y}_{sm}$	$\displaystyle=v_{sm}-\nabla\hat{F}_{x}(u_{sm})-v_{sm}^{\prime}+\nabla\hat{F}_{x}(u_{sm}^{\prime})$
		$\displaystyle=\frac{1}{B}\sum_{i\in\mathcal{B}}\big{(}\nabla\hat{f}_{i,x}(u_{sm})-\nabla\hat{f}_{i,x}(u_{sm}^{\prime})-\nabla\hat{F}_{x}(u_{sm})+\nabla\hat{F}_{x}(u_{sm}^{\prime})\big{)}.$

For each component term, we have

	$\displaystyle\\|\nabla\hat{f}_{i,x}(u_{sm})-\nabla\hat{f}_{i,x}(u_{sm}^{\prime})-\nabla\hat{F}_{x}(u_{sm})+\nabla\hat{F}_{x}(u_{sm}^{\prime})\\|$	$\displaystyle\leq 2L\\|u_{sm}-u_{sm}^{\prime}\\|$
		$\displaystyle=2L\\|\hat{u}_{sm}\\|\leq 3L(1+\eta\gamma)^{sm}r_{0},$

where the last inequality uses (47). Also the variance term is bounded as

	$\displaystyle\sum_{i\in\mathcal{B}}\mathbb{E}\\|\nabla\hat{f}_{i,x}(u_{sm})-\nabla\hat{f}_{i,x}(u_{sm}^{\prime})-\nabla\hat{F}_{x}(u_{sm})+\nabla\hat{F}_{x}(u_{sm}^{\prime})\\|^{2}$
	$\displaystyle\leq\sum_{i\in\mathcal{B}}\mathbb{E}\\|\nabla\hat{f}_{i,x}(u_{sm})-\nabla\hat{f}_{i,x}(u_{sm}^{\prime})\\|^{2}$
	$\displaystyle\leq BL^{2}\\|u_{sm}-u_{sm}^{\prime}\\|^{2}$
	$\displaystyle=BL^{2}\\|\hat{u}_{sm}\\|^{2}\leq\frac{9}{4}BL^{2}(1+\eta\gamma)^{2sm}r_{0}^{2}.$

Therefore, we can substitute these two bounds in Bernstein inequality (Lemma 6) to bound $\|\hat{y}_{sm}\|$ . That is,

	$\displaystyle\text{Pr}\{\\|\hat{y}_{sm}\\|\geq\frac{\varsigma}{B}\}$	$\displaystyle\leq(d+1)\exp\Big{(}\frac{-\varsigma^{2}/2}{\sigma^{2}+R\varsigma/3}\Big{)}$
		$\displaystyle\leq(d+1)\exp\Big{(}\frac{-\varsigma^{2}/2}{9BL^{2}(1+\eta\gamma)^{2sm}r_{0}^{2}/4+3\sqrt{B}L(1+\eta\gamma)^{sm}r_{0}\varsigma/3}\Big{)}$
		$\displaystyle\leq\zeta,$

where the second inequality also uses $\sqrt{B}\geq 1$ and the last inequality holds due to

\varsigma=\mathcal{O}(\log(d/\vartheta))L\sqrt{B}(1+\eta\gamma)^{sm}r_{0}.

By a union bound, for all $t\leq\mathscr{T},s=\lceil\frac{t}{m}\rceil$ , we have

\|\hat{y}_{sm}\|\leq\frac{c_{2}^{\prime}L(1+\eta\gamma)^{sm}r_{0}}{\sqrt{B}}\leq\gamma L(1+\eta\gamma)^{sm}r_{0},\quad\text{ with probability }1-\zeta

(48)

where $c_{2}^{\prime}=\mathcal{O}(\log(\frac{d\mathscr{T}}{\zeta m}))=\widetilde{\mathcal{O}}(1)$ and the second inequality is due to $B\geq\mathcal{O}(c_{2}^{\prime 2}\sigma^{2}/\epsilon^{2})=\widetilde{\mathcal{O}}(\sigma^{2}/\epsilon^{2})$ , $\epsilon\leq\delta^{2}/\rho$ , $\delta\leq\gamma$ and assume without loss of generality $\delta\leq 1$ . Now we will first prove that the second condition holds for $j=t$ .

Proof that $\boldsymbol{\|q(t)\|}$ is bounded by $\boldsymbol{\|p(t)\|/2}$ . Recall we can decompose $\|q(t)\|$ into two terms:

\|q(t)\|\leq\|\eta\sum_{j=0}^{t-1}(I-\eta\mathcal{H})^{t-1-j}\Delta_{j}\hat{u}_{j}\|+\|\eta\sum_{j=0}^{t-1}(I-\eta\mathcal{H})^{t-1-j}\hat{y}_{j}\|.

The first term is bounded the same way as in Lemma 12. Consider parameter settings, $r\leq\frac{\delta}{c_{2}\rho}$ , $\mathscr{T}\leq 2\log_{\alpha}\big{(}\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r}\big{)}/(\eta\gamma)$ , $c_{2}\geq 24\log_{\alpha}\big{(}\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r}\big{)}$ , where $1\leq\alpha\leq e$ satisfying $\log_{\alpha}(1+\eta\gamma)>\gamma$ . Then we have

\|\eta\sum_{j=0}^{t-1}(I-\eta\mathcal{H})^{t-1-j}\Delta_{j}\hat{u}_{j}\|\leq\frac{1}{4}\|p(t)\|.

The second term can be bounded as

	$\displaystyle\\|\eta\sum_{j=0}^{t-1}(I-\eta\mathcal{H})^{t-1-j}\hat{y}_{j}\\|$	$\displaystyle\leq\eta\sum_{j=0}^{t-1}(1+\eta\gamma)^{t-1-j}\\|\hat{y}_{j}\\|\leq 2\eta\sum_{j=0}^{t-1}(1+\eta\gamma)^{t-1-j}\gamma L(1+\eta\gamma)^{j}r_{0}$
		$\displaystyle\leq 2\eta\gamma L\mathscr{T}(1+\eta\gamma)^{t-1}r_{0}\leq\frac{1}{4}\\|p(t)\\|,$

where we choose $\mathscr{T}\leq 2\log_{\alpha}\big{(}\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r}\big{)}/\gamma$ and $\eta\leq\frac{1}{16\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})L}$ . These two results complete the proof that $\|q(t)\|\leq\|p(t)\|/2$ . Now we proceed to prove the first condition.

Proof that $\boldsymbol{\|\hat{y}_{t}\|}$ is bounded by $\boldsymbol{2\gamma L(1+\eta\gamma)^{t}r_{0}}$ . Denote $z_{t}=\hat{y}_{t}-\hat{y}_{t-1}$ . Hence we can verify that $\{\hat{y}_{t}\}$ is a martingale sequence and $\{z_{t}\}$ is its difference sequence. Then we can first bound $\|z_{t}\|$ by Bernstein inequality. The following result is exactly the same as in Lemma 12. With probability $1-\zeta_{t}$ ,

\|z_{t}\|\leq\mathcal{O}\Big{(}\frac{\log(d/\zeta_{t})}{\sqrt{b}}\Big{)}(L\|\hat{u}_{t}-\hat{u}_{t-1}\|+\rho\mathscr{D}_{t}\|\hat{u}_{t}\|+\rho\mathscr{D}_{t-1}\|\hat{u}_{t-1}\|)=:\varrho_{t}.

Now we can bound $\|\hat{y}_{t}\|$ . We only need to consider a single epoch because $\|\hat{y}_{sm}\|$ is bounded as in (48). Similarly, set $\zeta_{t}=\zeta/m$ , we have $\|z_{t}\|\leq\varrho_{t}$ for all $t=sm+1,...,(s+1)m$ . By Azuma-Hoeffding inequality (Lemma 7), we have for all $t\leq\mathscr{T}$ , with probability $1-\zeta$ ,

\displaystyle\|\hat{y}_{t}-\hat{y}_{sm}\|\leq\mathcal{O}\Big{(}\frac{\log^{\frac{3}{2}}(d\mathcal{T}/\zeta)}{\sqrt{b}}\Big{)}\sqrt{\sum_{j=sm+1}^{t}(L\|\hat{u}_{t}-\hat{u}_{t-1}\|+\rho\mathscr{D}_{t}\|\hat{u}_{t}\|+\rho\mathscr{D}_{t-1}\|\hat{u}_{t-1}\|)^{2}}.

This is the same result from (29) except that $\hat{y}_{sm}\neq 0$ . Combining this result and (48), and using a union bound, we have with probability $1-2\zeta$ , for all $t\leq\mathscr{T}$ ,

	$\displaystyle\\|\hat{y}_{t}\\|\leq\\|\hat{y}_{t}-\hat{y}_{sm}\\|+\\|\hat{y}_{sm}\\|$
	$\displaystyle\leq\mathcal{O}\Big{(}\frac{\log^{\frac{3}{2}}(d\mathcal{T}/\zeta)}{\sqrt{b}}\Big{)}\sqrt{\sum_{j=sm+1}^{t}(L\\|\hat{u}_{t}-\hat{u}_{t-1}\\|+\rho\mathscr{D}_{t}\\|\hat{u}_{t}\\|+\rho\mathscr{D}_{t-1}\\|\hat{u}_{t-1}\\|)^{2}}+\gamma L(1+\eta\gamma)^{\lceil t/m\rceil m}.$		(49)

What remains to be shown is that the right hand side of (49) is bounded. Recall similarly from (35) and (37), we have

	$\displaystyle L\\|\hat{u}_{t}-\hat{u}_{t-1}\\|$	$\displaystyle\leq\Big{(}\frac{3\eta\log(\mathscr{T})}{c_{2}}+3\eta L\log(\mathscr{T})\Big{)}\gamma L(1+\eta\gamma)^{t}r_{0},\text{ and }$
	$\displaystyle\rho\mathscr{D}_{t}\\|\hat{u}_{t}\\|+\rho\mathscr{D}_{t-1}\\|\hat{u}_{t-1}\\|$	$\displaystyle\leq\frac{6\delta}{c_{2}}(1+\eta\gamma)^{t}r_{0},$

where there is a slight difference in one of the constants due to now $\|\hat{y}_{j}\|\leq 2\eta\gamma(1+\eta\gamma)^{t}r_{0}$ . Therefore, we obtain

	$\displaystyle\\|\hat{y}_{t}\\|$	$\displaystyle\leq C_{1}\Big{(}\frac{3\eta\log(t)}{c_{2}}+3\eta L\log(t)+\frac{6}{c_{2}L}\Big{)}\gamma L(1+\eta\gamma)^{t}r_{0}+\gamma L(1+\eta\gamma)^{\lceil t/m\rceil m}r_{0}$
		$\displaystyle\leq 2\gamma L(1+\eta\gamma)^{t}r_{0}$

where $C_{1}:=\mathcal{O}(\log^{\frac{3}{2}}(d\mathscr{T}/\zeta))$ . (Note $c_{1}=\mathcal{O}(\log^{\frac{3}{2}}(dm/\vartheta))$ in Lemma 17. The second inequality uses the choice that $\eta\leq\frac{1}{8L\log(\mathcal{T})C_{1}}\leq\frac{1}{8L\log({t})C_{1}}$ and $c_{2}\geq\frac{12C_{1}}{L},C_{1}\geq 1$ . This completes the proof for the first condition. Now we can prove the second condition.

The contradiction is the same as in the proof of Lemma 12 and hence skipped for clarity. Note similar to the argument in Lemma 12, we choose $\mathscr{T}=2\log_{\alpha}\big{(}\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r}\big{)}/\delta$ so that the result still holds.

Lemma 19 (Descent around saddle points)

Let $x\in\mathcal{M}$ satisfy $\|\emph{grad}f_{\mathcal{B}}(x)\|\leq\epsilon$ and $\lambda_{\min}(\nabla^{2}\hat{F}_{x}(0))$ $\leq-\delta$ . With the same settings as in Lemma 18, the two coupled sequences satisfy, with probability $1-\zeta$ ,

\exists\,t\leq\mathscr{T},\,\max\{\hat{F}_{x}(u_{0})-\hat{F}_{x}(u_{t}),\hat{F}_{x}(u_{0}^{\prime})-\hat{F}_{x}(u_{t}^{\prime})\}\geq 2\mathscr{F},

where $\mathscr{F}=\frac{\delta^{3}}{2c_{3}\rho^{2}}$ , $c_{3}=\frac{16\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})c_{2}^{2}}{c_{1}L}$ and $c_{1},c_{2}$ are defined in Lemma 17 and 18 respectively.

Proof We will prove this result by contradiction. Suppose the contrary holds. That is,

\forall\,t\leq\mathscr{T},\,\max\{\hat{F}_{x}(u_{0})-\hat{F}_{x}(u_{t}),\hat{F}_{x}(u_{0}^{\prime})-\hat{F}_{x}(u_{t}^{\prime})\}\leq 2\mathscr{F}.

(50)

Then we first notice that under this condition, both $\{u_{t}\}$ and $\{u_{t}^{\prime}\}$ do not escape the constraint ball $\mathbb{B}_{x}(D)$ . This is verified by contradiction. Assume, without loss of generality, at iteration $j\leq\mathscr{T}$ , $\|u_{j}\|\geq D$ . In this case, $\textsf{TSSRG}(x,u_{0},\mathscr{T})$ returns $R_{x}(u_{j-1}-\alpha\eta v_{t-1})$ with $\alpha\in(0,1)$ such that $\|v_{j-1}-\alpha\eta v_{t-1}\|=D$ . Hence, by the similar logic as in (39),

	$\displaystyle D$	$\displaystyle=\\|u_{j-1}-\alpha\eta v_{t-1}\\|\leq\\|u_{j-1}-\alpha\eta v_{t-1}-u_{0}\\|+\\|u_{0}\\|$
		$\displaystyle\leq\sqrt{\frac{4j(\hat{F}_{x}(u_{0})-\hat{F}_{x}(u_{t}))}{c_{1}L}+\frac{2\eta j^{2}c_{1}^{\prime 2}\sigma^{2}}{c_{1}LB}}+r\leq\sqrt{\frac{8\mathscr{T}\mathscr{F}}{c_{1}L}+\frac{2\eta\mathscr{T}^{2}c_{1}^{\prime 2}\sigma^{2}}{c_{1}LB}}+r$
		$\displaystyle\leq\sqrt{\frac{8\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})}{\delta c_{1}L}\cdot\frac{\delta^{3}}{c_{3}\rho^{2}}+\frac{8\eta\log_{\alpha}^{2}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})c_{1}^{\prime 2}\sigma^{2}}{\delta^{2}c_{1}LB}}+r\leq\sqrt{\frac{\delta^{2}}{2c_{2}^{2}\rho^{2}}+\frac{\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})c_{1}^{\prime 2}\sigma^{2}}{2L^{2}c_{1}\delta^{2}B}}+r$
		$\displaystyle\leq\sqrt{\frac{\delta^{2}}{2c_{2}^{2}\rho^{2}}+\frac{\delta^{2}}{2c_{2}^{2}\rho^{2}}}+r=\frac{\delta}{c_{2}\rho}+r.$

where we set parameters, $\mathscr{F}=\frac{\delta^{3}}{2c_{3}\rho^{2}},\mathscr{T}=2\log_{\alpha}\big{(}\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r}\big{)}/\delta$ , $c_{3}=\frac{16\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})c_{2}^{2}}{c_{1}L}$ , $\eta\leq\frac{1}{16\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})L}$ and $B\geq\mathcal{O}(\frac{\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})c_{2}^{2}c_{1}^{\prime 2}\sigma^{2}}{L^{2}c_{1}\epsilon^{2}})=\widetilde{\mathcal{O}}(\frac{\sigma^{2}}{\epsilon^{2}})$ , $\epsilon\leq\delta^{2}/\rho$ . However, we know that $\frac{\delta}{c_{2}\rho}+r\leq D$ , which gives a contradiction. Therefore, we claim that $\{u_{t}\}$ and $\{u_{t}^{\prime}\}$ stay within the constraint ball within $\mathscr{T}$ steps.

Also based on Lemma 18, assume $\|u_{T}-u_{0}\|\geq\frac{\delta}{c_{2}\rho}$ for some $T\leq\mathscr{T}$ . Then from Lemma 16, we know that

	$\displaystyle\hat{F}_{x}(u_{0})-\hat{F}_{x}(u_{T})\geq\frac{c_{1}L}{4T}\\|u_{T}-u_{0}\\|^{2}-\frac{\eta Tc_{1}^{\prime 2}\sigma^{2}}{2B}$	$\displaystyle\geq\frac{c_{1}L\delta^{3}}{8\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})c_{2}^{2}\rho^{2}}-\frac{\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})c_{1}^{\prime 2}\sigma^{2}}{B\delta}$
		$\displaystyle\geq\frac{2\delta^{3}}{c_{3}\rho^{2}}-\frac{\delta^{3}}{c_{3}\rho^{2}}=\frac{\delta^{3}}{c_{3}\rho^{2}}=2\mathscr{F},$

where we consider $T\leq\mathscr{T},\eta\leq 1$ and $B\geq\mathcal{O}(\frac{\log_{\alpha}^{2}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})c_{2}^{2}c_{1}^{\prime 2}\sigma^{2}}{\epsilon^{2}})=\widetilde{\mathcal{O}}(\frac{\sigma^{2}}{\epsilon^{2}})$ , $\epsilon\leq\delta^{2}/\rho$ . This contradicts (50) and hence the proof is complete.

Lemma 20 (Escape stuck region)

Let $x\in\mathcal{M}$ satisfy $\|\emph{grad}f_{\mathcal{B}}(x)\|\leq\epsilon$ and $\lambda_{\min}(\nabla^{2}\hat{F}_{x}(0))\leq-\delta$ . Given that result in Lemma 19 holds and choose perturbation radius $r\leq\min\{\frac{\delta^{3}}{4c_{3}\rho^{2}\epsilon},\sqrt{\frac{\delta^{3}}{2c_{3}\rho^{2}L}}\}$ , we have a sufficient decrease of function value with high probability:

{F}(\textsf{TSSRG}(x,u_{0},\mathscr{T}))-F(x)\leq-\mathscr{F}\quad\text{ with probability }1-\nu

Proof The proof is exactly the same as Lemma 20 and hence skipped. One remark is that when small gradient condition $\|\text{grad}f_{\mathcal{B}(x)}\|\leq\epsilon$ is triggered, we already have $\|\text{grad}F(x)\|\leq\epsilon/2$ with high probability from Lemma 17.

	$\displaystyle\\|\nabla\hat{f}_{i,x}(u)-\nabla\hat{f}_{i,x}(0)\\|$	$\displaystyle\leq\ell\\|u\\|,$
	$\displaystyle\\|\nabla^{2}\hat{f}_{i,x}(u)-\nabla^{2}\hat{f}_{i,x}(0)\\|$	$\displaystyle\leq\rho\\|u\\|.$

	$\displaystyle\sum_{i\in\mathcal{I}}\mathbb{E}\\|w_{i}-\mathbb{E}[w_{i}]\\|^{2}$	$\displaystyle=\sum_{i\in\mathcal{I}}\mathbb{E}\\|w_{i}\\|^{2}$
		$\displaystyle=\sum_{i\in\mathcal{I}}\mathbb{E}\\|\nabla\hat{f}_{i,x}(u_{k})-\nabla\hat{f}_{i,x}(u_{k-1})-\nabla\hat{F}_{x}(u_{k})+\nabla\hat{F}_{x}(u_{k-1})\\|^{2}$
		$\displaystyle\leq\sum_{i\in\mathcal{I}}\mathbb{E}\\|\nabla\hat{f}_{i,x}(u_{k})-\nabla\hat{f}_{i,x}(u_{k-1})\\|^{2}\leq bL^{2}\\|u_{k}-u_{k-1}\\|^{2}=:\sigma^{2},$

	$\displaystyle\text{Pr}\{\\|\sum_{i\in\mathcal{I}}w_{i}\\|\geq\varsigma\}$	$\displaystyle=\text{Pr}\{\\|\frac{1}{b}\sum_{i\in\mathcal{I}}w_{i}\\|\geq\varsigma/b\}=\text{Pr}\{\\|z_{k}\\|\geq{\varsigma}/b\}$
		$\displaystyle\leq(d+1)\exp\big{(}\frac{-\varsigma^{2}/2}{\sigma^{2}+R\varsigma/3}\big{)}$
		$\displaystyle\leq(d+1)\exp\big{(}\frac{-\varsigma^{2}/2}{bL^{2}\\|u_{k}-u_{k-1}\\|^{2}+2\sqrt{b}L\\|u_{k}-u_{k-1}\\|\varsigma/3}\big{)}$
		$\displaystyle\leq\vartheta_{k},$

$\displaystyle\hat{F}_{x}(u_{t})-\hat{F}_{x}(u_{sm})$	$\displaystyle\leq-\frac{\eta}{2}\sum_{j=sm+1}^{t}\\|\nabla\hat{F}_{x}(u_{j-1})\\|^{2}-\big{(}\frac{1}{2\eta}-\frac{L}{2}-\frac{\eta c_{1}^{2}L^{2}}{2}\big{)}\sum_{j=sm+1}^{t}\\|u_{j}-u_{j-1}\\|^{2}$
	$\displaystyle\leq-\big{(}\frac{1}{2\eta}-\frac{L}{2}-\frac{\eta c_{1}^{2}L^{2}}{2}\big{)}\sum_{j=sm+1}^{t}\\|u_{j}-u_{j-1}\\|^{2}$
	$\displaystyle\leq-\frac{c_{1}L}{4}\sum_{j=sm+1}^{t}\\|u_{j}-u_{j-1}\\|^{2},$	(13)

$\displaystyle\hat{F}_{x}(u_{t})$	$\displaystyle\leq\hat{F}_{x}(u_{t-1})+\langle\nabla\hat{F}_{x}(u_{t-1}),u_{t}-u_{t-1}\rangle+\frac{L}{2}\\|u_{t}-u_{t-1}\\|^{2}$
	$\displaystyle=\hat{F}_{x}(u_{t-1})-\eta\langle\nabla\hat{F}_{x}(u_{t-1}),v_{t-1}\rangle+\frac{\eta^{2}L}{2}\\|v_{t-1}\\|^{2}$
	$\displaystyle=\hat{F}_{x}(u_{t-1})-\frac{\eta}{2}\\|\nabla\hat{F}_{x}(u_{t-1})\\|^{2}-\frac{\eta}{2}\\|v_{t-1}\\|^{2}+\frac{\eta}{2}\\|v_{t-1}-\nabla\hat{F}_{x}(u_{t-1})\\|^{2}+\frac{\eta^{2}L}{2}\\|v_{t-1}\\|^{2}$
	$\displaystyle=\hat{F}_{x}(u_{t-1})-\frac{\eta}{2}\\|\nabla\hat{F}_{x}(u_{t-1})\\|^{2}+\frac{\eta}{2}\\|v_{t-1}-\nabla\hat{F}_{x}(u_{t-1})\\|^{2}-\big{(}\frac{1}{2\eta}-\frac{L}{2}\big{)}\\|u_{t}-u_{t-1}\\|^{2}.$	(14)

Escaping Saddle Points Faster on Manifolds via Perturbed Riemannian Stochastic Recursive Gradient

Abstract

1 Introduction

2 Other related work

3 Preliminaries

Lemma 1 (Gradient and Hessian of the pullback)

3.1 Assumptions

Assumption 1 (Lower bounded function)

Assumption 2 (Gradient and Hessian Lipschitz)

Lemma 2 (LL-Lipschitz continuous)

Assumption 3 (Second-order retraction)

Assumption 4 (Difference between differentiated retraction and vector transport)

Lemma 3 (Singular value bound of differentiated retraction)

Assumption 5 (Uniform bounded variance)

Definition 1 ((ϵ,δ)(\epsilon,\delta)-second-order and ϵ\epsilon-second-order critical points)

4 Algorithm

5 Main results

Theorem 1 (Finite-sum complexity)

Theorem 2 (Online complexity)

6 High-level proof ideas

6.1 Finite-sum setting

Lemma 4

Lemma 5

6.2 Online setting

7 Conclusion

References

A Useful concentration bound

Lemma 6 (Vector Bernstein inequality, Tropp (2012))

Lemma 7 (Azuma-Hoeffding inequality, Hoeffding (1994); Chung and Lu (2006))

B Regularity conditions on Riemannian manifold

Lemma 8 (Gradient and Hessian of the pullback under second-order retraction)

C Proof for finite-sum setting

C.1 Proof for main Theorem

C.2 Proof for key Lemmas

Lemma 9 (High probability bound on estimation error)

Lemma 10 (Improve or localize)

Lemma 11 (Large gradient descent lemma)

Lemma 12 (Small stuck region)

Lemma 13 (Descent around saddle points)

Lemma 14 (Escape stuck region)

D Proof for online setting

D.1 Proof for main Theorem

D.2 Proof for key Lemmas

Lemma 15 (High probability bound on estimation error)

Lemma 16 (Improve or localize)

Lemma 17 (Large gradient descent lemma)

Lemma 18 (Small stuck region)

Lemma 19 (Descent around saddle points)

Lemma 20 (Escape stuck region)

Lemma 2 ( $L$ -Lipschitz continuous)

Definition 1 ( $(\epsilon,\delta)$ -second-order and $\epsilon$ -second-order critical points)