This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Escaping Saddle Points Faster on Manifolds via Perturbed Riemannian Stochastic Recursive Gradient

\nameAndi Han \email[email protected] \AND\nameJunbin Gao \email[email protected]
\addrDiscipline of Business Analytics
University of Sydney
Abstract

In this paper, we propose a variant of Riemannian stochastic recursive gradient method that can achieve second-order convergence guarantee and escape saddle points using simple perturbation. The idea is to perturb the iterates when gradient is small and carry out stochastic recursive gradient updates over tangent space. This avoids the complication of exploiting Riemannian geometry. We show that under finite-sum setting, our algorithm requires π’ͺ~​(nΟ΅2+nΞ΄4+nΞ΄3)\widetilde{\mathcal{O}}\big{(}\frac{\sqrt{n}}{\epsilon^{2}}+\frac{\sqrt{n}}{\delta^{4}}+\frac{n}{\delta^{3}}\big{)} stochastic gradient queries to find a (Ο΅,Ξ΄)(\epsilon,\delta)-second-order critical point. This strictly improves the complexity of perturbed Riemannian gradient descent and is superior to perturbed Riemannian accelerated gradient descent under large-sample settings. We also provide a complexity of π’ͺ~​(1Ο΅3+1Ξ΄3​ϡ2+1Ξ΄4​ϡ)\widetilde{\mathcal{O}}\big{(}\frac{1}{\epsilon^{3}}+\frac{1}{\delta^{3}\epsilon^{2}}+\frac{1}{\delta^{4}\epsilon}\big{)} for online optimization, which is novel on Riemannian manifold in terms of second-order convergence using only first-order information.

1 Introduction

Consider the following finite-sum and online optimization problem defined on Riemannian manifold β„³\mathcal{M}:

minxβˆˆβ„³β‘F​(x):={𝔼ω​[f​(x;Ο‰)],Β online1nβ€‹βˆ‘i=1nfi​(x),Β finite-sumΒ \min_{x\in\mathcal{M}}F(x):=\begin{cases}\mathbb{E}_{\omega}[f(x;\omega)],&\text{ online}\\ \frac{1}{n}\sum_{i=1}^{n}f_{i}(x),&\text{ finite-sum }\end{cases} (1)

where F:ℳ→ℝF:\mathcal{M}\xrightarrow[]{}\mathbb{R} is a non-convex function. Finite-sum optimization setting is a special case of online setting where Ο‰\omega can be finitely sampled. For simplicity of analysis, we only consider finite-sum formulation 1nβ€‹βˆ‘i=1nfi​(x)\frac{1}{n}\sum_{i=1}^{n}f_{i}(x) and refer to it as online optimization when nn approaches infinity. Solving problem (1) globally on Euclidean space (i.e when ℳ≑ℝd\mathcal{M}\equiv\mathbb{R}^{d}) is NP-hard in general, letting alone for any Riemannian manifold. Thus many algorithms set out to find only approximate first-order critical points with small gradients. But this is usually insufficient because for non-convex problems, a point with small gradient can be either around a local minima, a local maxima or a saddle point. Therefore, to avoid being trapped by saddle points (and possibly local maxima), we need to find approximate second-order critical points (see Definition 1) with small gradients as well as nearly positive semi-definite Hessians.

Second-order algorithms can explore curvature information via Hessian and thus escape saddle points by construction. However, it is always desired that simpler algorithms using only gradient information are designed for this purpose, because access to Hessian can be costly and sometimes unavailable in real applications. It turns out that gradient descent (GD) can inherently escape saddle points, yet requiring exponential time (Du etΒ al., 2017). Instead, Jin etΒ al. (2017) showed that by injecting isotropic Gaussian noise to iterates, the perturbed gradient descent (PGD) can reach approximate Ο΅\epsilon-second-order stationarity within π’ͺ~​(nβ€‹Ο΅βˆ’2)\widetilde{\mathcal{O}}(n\epsilon^{-2}) stochastic gradient queries with high probability. Similarly, a perturbed variant of stochastic gradient algorithm (PSGD) requires a complexity of π’ͺ~​(Ο΅βˆ’4)\widetilde{\mathcal{O}}(\epsilon^{-4}) (Jin etΒ al., 2019). The complexities match that of vanilla GD and SGD to find approximate first-order stationary points up to some poly-logarithmic factors.

When considering problem (1) on any arbitrary manifold, we require Riemannian gradient grad​F​(x)∈Tx​ℳ\text{grad}F(x)\in T_{x}\mathcal{M} as well as a retraction Rx:Tx​ℳ→ℝR_{x}:T_{x}\mathcal{M}\xrightarrow[]{}\mathbb{R} that allows iterates to be updated on the manifold with direction determined by Riemannian gradient. Defining a β€˜pullback’ function F^x:=F∘Rx:Tx​ℳ→ℝ\hat{F}_{x}:=F\circ R_{x}:T_{x}\mathcal{M}\xrightarrow{}\mathbb{R}, Criscitiello and Boumal (2019) extended the idea of perturbed gradient descent (referred to as PRGD) by executing perturbations and the following gradient steps on the tangent space. Given that F^x\hat{F}_{x} is naturally defined on a vector space, the analysis can be largely drawn from (Jin etΒ al., 2017), with similar convergence guarantees. Recently, Criscitiello and Boumal (2020) also proposed perturbed Riemannian accelerated gradient descent (PRAGD), a direct generalization of its Euclidean counterpart originally proposed in (Jin etΒ al., 2018). It achieves an even lower complexity of π’ͺ~​(nβ€‹Ο΅βˆ’7/4)\widetilde{\mathcal{O}}(n\epsilon^{-7/4}) compared to perturbed GD. However, these algorithms are sub-optimal under finite-sum settings and even fail to work for online problems where full gradient is inaccessible.

When objective can be decomposed into component functions as in problem (1), variance reduction techniques can improve gradient complexity of both GD and SGD (Reddi etΒ al., 2016; Nguyen etΒ al., 2017b; Fang etΒ al., 2018). The main idea is to exploit past gradient information to correct for deviation of current stochastic gradients, by comparing either to a fixed snapshot point (SVRG, Reddi etΒ al. (2016)) or recursively to previous iterates (SRG/SPIDER, Nguyen etΒ al. (2017a); Fang etΒ al. (2018)). Notably, the gradient complexities π’ͺ​(nβ€‹Ο΅βˆ’2)\mathcal{O}(\sqrt{n}\epsilon^{-2}) and π’ͺ​(Ο΅βˆ’3)\mathcal{O}(\epsilon^{-3}) achieved by SRG and SPIDER are optimal to achieve first-order guarantees under finite-sum and online settings respectively (Fang etΒ al., 2018; Arjevani etΒ al., 2019). Motivated by the success of variance reduction and the simplicity of adding perturbation to ensure second-order convergence, Li (2019) proposed perturbed stochastic recursive gradient method (PSRG), which can escape saddle points faster than perturbed GD and SGD. In this paper, we generalize PSRG to optimization problems defined on Riemannian manifolds with the following contributions:

  • β€’

    Our proposed method PRSRG is the first simple stochastic algorithm that achieves second-order guarantees on Riemannian manifold using only first-order information. That is, the algorithm only requires simple perturbations and does not involve any negative curvature exploitation as in PRAGD. Our algorithm adopts the idea of tangent space steps as in (Criscitiello and Boumal, 2019), which results in simple analysis for convergence.

  • β€’

    Our complexity is measured against (Ο΅,Ξ΄)(\epsilon,\delta)-second-order stationarity, which is more general than Ο΅\epsilon-second-order stationarity where Ξ΄=π’ͺ​(Ο΅)\delta=\mathcal{O}(\sqrt{\epsilon}). Under finite-sum setting, PRSRG strictly improves the complexity of PRGD by π’ͺ~(min{n/Ο΅1/2,\widetilde{\mathcal{O}}(\min\{n/\epsilon^{1/2}, n/Ο΅2})\sqrt{n}/\epsilon^{2}\}) and is superior to PRAGD when nβ‰₯Ο΅βˆ’1/2n\geq\epsilon^{-1/2}. We also provide a complexity of PRSRG under online setting, which is novel for Riemannian optimization.

2 Other related work

Following the work by (Jin etΒ al., 2017), except for the perturbed SGD (Jin etΒ al., 2019) and perturbed accelerated gradient method (Jin etΒ al., 2018), Ge etΒ al. (2019) showed that SVRG with perturbation (PSVRG) is sufficient to escape saddle points and a stabilized version is also developed to further improve the dependency on the Hessian Lipschitz constant. However, its complexity is strictly worse than PSRG (Li, 2019). We suspect that, with similar tangent space trick, PSVRG can be generalized to Riemannian manifolds with little efforts.

Another line of research is to incorporate negative curvature search subroutines (Xu etΒ al., 2018; Allen-Zhu and Li, 2018) in some classic first-order algorithms, such as GD and SGD and also variance-reduction algorithms including SVRG (Allen-Zhu and Li, 2018), SNVRG (Zhou etΒ al., 2018) and SPIDER (Fang etΒ al., 2018). It usually requires access to Hessian-vector products when searching for the smallest eigenvector direction (Xu etΒ al., 2018). Even though Allen-Zhu and Li (2018) managed to only use first-order information for the same goal, it still involves an iterative process that can be difficult to implement in practice.

To solve general optimization problems on Riemannian manifold, gradient descent and stochastic gradient descent have been generalized with provable guarantees (Boumal etΒ al., 2019; Bonnabel, 2013; Hosseini and Sra, 2020). Some acceleration and variance reduction techniques have also been considered for speeding up the convergence of GD and SGD (Zhang and Sra, 2018; Ahn and Sra, 2020; Zhang etΒ al., 2016; Kasai etΒ al., 2018). These first-order methods however can only guarantee first-order convergence. To find second-order critical points, Newton’s methods, particularly the famous globalization variants, trust region and cubic regularization have been extended to manifold optimization, achieving similar second-order guarantees (Boumal etΒ al., 2019; Agarwal etΒ al., 2018).

Finally, we note that Sun etΒ al. (2019) also proposed perturbed gradient descent on manifold. The perturbations and the following updates are performed directly on manifold, which is contrastly different to tangent space steps in this paper. The preference of tangent space steps over manifold steps is mainly due to the complexity of analysis. That is, Sun etΒ al. (2019) requires more complicated geometric results as well as the use of exponential map given its natural connection to geodesics and Riemannian distance. By executing all the steps on tangent space, we can bypass these results while some regularity conditions should be carefully managed.

Table 1: Comparison of stochastic gradient complexity to achieve second-order guarantees
Β  Ο΅\epsilon-second-order stationarity (Ο΅,Ξ΄)(\epsilon,\delta)-second-order stationarity
Finite-sum PRGD (Criscitiello and Boumal, 2019) π’ͺ~​(nΟ΅2)\widetilde{\mathcal{O}}\big{(}\frac{n}{\epsilon^{2}}\big{)} β€”
PRAGD (Criscitiello and Boumal, 2020) π’ͺ~​(nΟ΅7/4)\widetilde{\mathcal{O}}\big{(}\frac{n}{\epsilon^{7/4}}\big{)} β€”
PRSRG (this work) π’ͺ~​(nΟ΅3/2+nΟ΅2)\widetilde{\mathcal{O}}\big{(}\frac{n}{\epsilon^{3/2}}+\frac{\sqrt{n}}{\epsilon^{2}}\big{)} π’ͺ~​(nΟ΅2+nΞ΄4+nΞ΄3)\widetilde{\mathcal{O}}\big{(}\frac{\sqrt{n}}{\epsilon^{2}}+\frac{\sqrt{n}}{\delta^{4}}+\frac{n}{\delta^{3}}\big{)}
Online PRSRG (this work) π’ͺ~​(1Ο΅7/2)\widetilde{\mathcal{O}}\big{(}\frac{1}{\epsilon^{7/2}}\big{)} π’ͺ~​(1Ο΅3+1Ξ΄3​ϡ2+1Ξ΄4​ϡ)\widetilde{\mathcal{O}}\big{(}\frac{1}{\epsilon^{3}}+\frac{1}{\delta^{3}\epsilon^{2}}+\frac{1}{\delta^{4}\epsilon}\big{)}
Β 

3 Preliminaries

Here we start with a short review of some preliminary definitions. Readers can refer to (Absil etΒ al., 2009) for more detailed discussions on manifold geometry. We consider β„³\mathcal{M} to be a dd-dimensional Riemannian manifold. The tangent space Tx​ℳT_{x}\mathcal{M} for xβˆˆβ„³x\in\mathcal{M} is a dd-dimensional vector space. β„³\mathcal{M} is equipped with an inner product βŸ¨β‹…,β‹…βŸ©x\langle\cdot,\cdot\rangle_{x} and a corresponding norm βˆ₯β‹…βˆ₯x\|\cdot\|_{x} on each tangent space Tx​ℳT_{x}\mathcal{M}, βˆ€xβˆˆβ„³\forall\,x\in\mathcal{M}. Tangent bundle is the union of tangent spaces defined as T​ℳ:={(x,u):xβˆˆβ„³,u∈Tx​ℳ}T\mathcal{M}:=\{(x,u):x\in\mathcal{M},u\in T_{x}\mathcal{M}\}. Riemannian gradient of a function F:ℳ→ℝF:\mathcal{M}\xrightarrow[]{}\mathbb{R} is the vector field grad​F\text{grad}F that uniquely satisfies D​F​(x)​[u]=⟨grad​F​(x),u⟩x\text{D}F(x)[u]=\langle\text{grad}F(x),u\rangle_{x}, for all (x,u)∈T​ℳ(x,u)\in T\mathcal{M} where DF​(x)​[u]F(x)[u] is the directional derivative of F​(x)F(x) along uu. Riemannian Hessian of FF is the covariant derivative of gradFF defined as that for all (x,u)∈T​ℳ(x,u)\in T\mathcal{M}, it satisfies Hess​F​(x)​[u]=βˆ‡ugrad​F​(x)\text{Hess}F(x)[u]=\nabla_{u}\text{grad}F(x), where βˆ‡\nabla is the Riemannian connection (also known as Levi-Civita connection). Note that we also use βˆ‡\nabla to represent differentiation on vector space, which is a special case of covariant derivative.

Retraction Rx:Tx​ℳ→ℳR_{x}:T_{x}\mathcal{M}\xrightarrow{}\mathcal{M} maps a tangent vector to manifold surface by satisfying (i) Rx​(0)=xR_{x}(0)=x, and (ii) D​Rx​(0)\text{D}R_{x}(0) is the identity map. D​Rx​(u)\text{D}R_{x}(u) represents the differential of retraction, denoted as Tu:Tx​ℳ→TRx​(u)​ℳT_{u}:T_{x}\mathcal{M}\xrightarrow{}T_{R_{x}(u)}\mathcal{M}. Exponential map is a special instance of retraction and hence our results in this paper can be trivially generalized to this particular retraction. Define the pullback function F^x:=F∘Rx:Tx​ℳ→ℝ\hat{F}_{x}:=F\circ R_{x}:T_{x}\mathcal{M}\xrightarrow{}\mathbb{R}. That is, F^x​(u)=1nβ€‹βˆ‘i=1nf^i,x​(u)\hat{F}_{x}(u)=\frac{1}{n}\sum_{i=1}^{n}\hat{f}_{i,x}(u) =1nβ€‹βˆ‘i=1nfi​(Rx​(u))=\frac{1}{n}\sum_{i=1}^{n}{f}_{i}(R_{x}(u)), where we also define the pullback components as f^i,x:=fi∘Rx\hat{f}_{i,x}:=f_{i}\circ R_{x}. Given that the domain of pullback function is a vector space, we can represent its gradient βˆ‡F^x\nabla\hat{F}_{x} and Hessian βˆ‡2F^x\nabla^{2}\hat{F}_{x} in terms of Riemannian gradient and Hessian as well as the differentiated retraction TuT_{u} in the following Lemma (see Lemma 2.5 in Criscitiello and Boumal (2020)).

Lemma 1 (Gradient and Hessian of the pullback)

For a twice continuously differentiable function f:ℳ→ℝf:\mathcal{M}\xrightarrow{}\mathbb{R} and (x,u)∈T​ℳ(x,u)\in T\mathcal{M}, we have

βˆ‡F^x​(u)=Tuβˆ—β€‹grad​F​(Rx​(u))Β andΒ βˆ‡2F^x​(u)=Tuβˆ—βˆ˜Hess​f​(Rx​(u))∘Tu+Wu,\nabla\hat{F}_{x}(u)=T_{u}^{*}\emph{grad}F(R_{x}(u))\quad\emph{ and }\quad\nabla^{2}\hat{F}_{x}(u)=T_{u}^{*}\circ\emph{Hess}f(R_{x}(u))\circ T_{u}+W_{u},

where Tuβˆ—T_{u}^{*} denotes the adjoint operator of TuT_{u} and WuW_{u} is a symmetric linear operator defined by ⟨Wu​[uΛ™],uΛ™βŸ©x\langle W_{u}[\dot{u}],\dot{u}\rangle_{x} =⟨grad​f​(Rx​(u)),c′′​(0)⟩Rx​(u)=\langle\emph{grad}f(R_{x}(u)),c^{\prime\prime}(0)\rangle_{R_{x}(u)} with c​(t)=Rx​(u+t​uΛ™)c(t)=R_{x}(u+t\dot{u}), uΛ™\dot{u} a perturbation to uu.

If the retraction RxR_{x} is a second-order retraction, βˆ‡F^x​(0)\nabla\hat{F}_{x}(0) and βˆ‡2F^x​(0)\nabla^{2}\hat{F}_{x}(0) match the Riemannian gradient and Hessian at xx. See Lemma 8 (Appendix B) for more details. Vector transport 𝒯\mathcal{T} with respect to retraction RxR_{x} is a linear map that 𝒯:Tβ€‹β„³βŠ•T​ℳ→T​ℳ\mathcal{T}:T\mathcal{M}\,\oplus\,T\mathcal{M}\xrightarrow[]{}T\mathcal{M} satisfies (i) 𝒯u​[ΞΎ]∈TRx​(u)​ℳ\mathcal{T}_{u}[\xi]\in T_{R_{x}(u)}\mathcal{M}; (ii) 𝒯0x​[ΞΎ]=ΞΎ\mathcal{T}_{0_{x}}[\xi]=\xi. In this paper, we only consider isometric vector transport, which satisfies ⟨ξ,v⟩x=βŸ¨π’―u​ξ,𝒯u​v⟩Rx​(u)\langle\xi,v\rangle_{x}=\langle\mathcal{T}_{u}\xi,\mathcal{T}_{u}v\rangle_{R_{x}(u)}. Below are some common notations used throughout this paper.

Notation. Denote Ξ»min​(H)\lambda_{\min}(H) as the minimum eigenvalue of a symmetric operator HH and use βˆ₯β‹…βˆ₯\|\cdot\| to represent either the vector norm or the spectral norm for a matrix. We claim h​(t)=π’ͺ​(g​(t))h(t)=\mathcal{O}(g(t)) if there exists a positive constant MM and t0t_{0} such that h​(t)≀M​g​(t)h(t)\leq Mg(t) for all tβ‰₯t0t\geq t_{0} and use π’ͺ~​(β‹…)\widetilde{\mathcal{O}}(\cdot) to hide poly-logarithmic factors. Let 𝔹x​(r)\mathbb{B}_{x}(r) be the Euclidean ball with radius rr to the origin of the tangent space Tx​ℳT_{x}\mathcal{M}. That is, 𝔹x​(r):={u∈Tx​ℳ:β€–uβ€–x≀r}\mathbb{B}_{x}(r):=\{u\in T_{x}\mathcal{M}:\|u\|_{x}\leq r\}. We use Unif​(𝔹x​(r))\text{Unif}(\mathbb{B}_{x}(r)) to denote the uniform distribution on the set 𝔹x​(r)\mathbb{B}_{x}(r). Denote [n]:={1,2,..,n}[n]:=\{1,2,..,n\} and let βˆ‡f^ℐ,x​(u)=1|ℐ|β€‹βˆ‘iβˆˆβ„βˆ‡f^i,x​(u)\nabla\hat{f}_{\mathcal{I},x}(u)=\frac{1}{|\mathcal{I}|}\sum_{i\in\mathcal{I}}\nabla\hat{f}_{i,x}(u) be a mini-batch gradient of the pullback component function where β„βŠ†[n]\mathcal{I}\subseteq[n] with cardinality |ℐ||\mathcal{I}|. Similarly, grad​fℐ​(x)\text{grad}f_{\mathcal{I}}(x) represents a mini-batch Riemannian gradient. Finally, we denote log⁑(x)\log(x) as the natural logarithm and logα⁑(x)\log_{\alpha}(x) as logarithm with base Ξ±\alpha.

3.1 Assumptions

Now we state the main assumptions as follows. The first assumption is required to bound function decrease and to ensure existence of stationary points on the manifold.

Assumption 1 (Lower bounded function)

FF is lower bounded by Fβˆ—F^{*} with F​(x)β‰₯Fβˆ—F(x)\geq F^{*} for all xβˆˆβ„³x\in\mathcal{M}.

Following (Criscitiello and Boumal, 2019), we need to impose some Lipschitzness conditions on the pullback F^x\hat{F}_{x} because the main gradient steps (see Algorithm 2) are performed on tangent space with respect to the pullback function. The next assumption requires both gradient and Hessian Lipschitzness of the pullback component functions f^i,x\hat{f}_{i,x}. Note that we only require the condition to satisfy with respect to the origin of Tx​ℳT_{x}\mathcal{M} within a constraint ball 𝔹x​(D)\mathbb{B}_{x}(D). This assumption is much weaker compared to requiring Lipschitz continuity over entire tangent space.

Assumption 2 (Gradient and Hessian Lipschitz)

Gradient and Hessian of the pullback component function is Lipschitz to the origin. That is, for all xβˆˆβ„³,u∈Tx​ℳx\in\mathcal{M},u\in T_{x}\mathcal{M} with β€–u‖≀D\|u\|\leq D, there exist β„“,ρβ‰₯0\ell,\rho\geq 0 such that

β€–βˆ‡f^i,x​(u)βˆ’βˆ‡f^i,x​(0)β€–\displaystyle\|\nabla\hat{f}_{i,x}(u)-\nabla\hat{f}_{i,x}(0)\| ≀ℓ​‖uβ€–,\displaystyle\leq\ell\|u\|,
β€–βˆ‡2f^i,x​(u)βˆ’βˆ‡2f^i,x​(0)β€–\displaystyle\|\nabla^{2}\hat{f}_{i,x}(u)-\nabla^{2}\hat{f}_{i,x}(0)\| ≀ρ​‖uβ€–.\displaystyle\leq\rho\|u\|.

This also suggests the gradient and Hessian of the pullback function F^x\hat{F}_{x} satisfy the same Lipschitzness condition to the origin.

Immediately based on this assumption, we can show (in Lemma 2) that gradient of the pullback function is Lipschitz continuous in the constraint ball 𝔹x​(D)\mathbb{B}_{x}(D). This result implies smoothness of the pullback function and is fundamental for analysing first-order optimization algorithms.

Lemma 2 (LL-Lipschitz continuous)

Under Assumption 2, for all xβˆˆβ„³x\in\mathcal{M}, there exists L=β„“+ρ​DL=\ell+\rho D such that βˆ‡f^i,x\nabla\hat{f}_{i,x} is LL-Lipschitz continuous inside the ball 𝔹x​(D)\mathbb{B}_{x}(D). This also implies that βˆ‡F^x\nabla\hat{F}_{x} is LL-Lipschitz continuous. That is, for any u,vβˆˆπ”Ήx​(D)u,v\in\mathbb{B}_{x}(D), we have

β€–βˆ‡f^i,x​(u)βˆ’βˆ‡f^i,x​(v)‖≀L​‖uβˆ’vβ€–Β andΒ β€–βˆ‡F^x​(u)βˆ’βˆ‡F^x​(v)‖≀L​‖uβˆ’vβ€–.\|\nabla\hat{f}_{i,x}(u)-\nabla\hat{f}_{i,x}(v)\|\leq L\|u-v\|\quad\emph{ and }\quad\|\nabla\hat{F}_{x}(u)-\nabla\hat{F}_{x}(v)\|\leq L\|u-v\|.

The next assumption is to ensure the correspondence between gradient and Hessian of the pullback F^x\hat{F}_{x} at origin, and those of original function FF. Note similar to (Criscitiello and Boumal, 2019), we can relax this assumption to bounded initial acceleration, i.e. β€–c′′​(t)‖≀β\|c^{\prime\prime}(t)\|\leq\beta for some Ξ²β‰₯0\beta\geq 0. In that case, results only differ by some constants.

Assumption 3 (Second-order retraction)

RxR_{x} is a second-order retraction such that for all (x,u)∈T​ℳ(x,u)\in T\mathcal{M}, the retraction curve c​(t)=Rx​(t​u)c(t)=R_{x}(tu) has zero initial acceleration. That is c′′​(t)=0c^{\prime\prime}(t)=0.

The following assumption is needed to bound the difference between differentiated retraction and vector transport. Although our algorithm does not require vector transport as all updates are completed on tangent space, the main purpose of this assumption is to establish a bound on singular value of differentiated retraction (Lemma 3). The Lemma can then be used to bound the difference between βˆ‡F^x​(u)\nabla\hat{F}_{x}(u) and grad​F​(Rx​(u))\text{grad}F(R_{x}(u)) for any uβˆˆπ”Ήx​(D)u\in\mathbb{B}_{x}(D).

Assumption 4 (Difference between differentiated retraction and vector transport)

For any xΒ―βˆˆβ„³\bar{x}\in\mathcal{M}, there exists a neighbourhood 𝒳\mathcal{X} of xΒ―\bar{x} such that for all x,y=Rx​(u)βˆˆπ’³x,y=R_{x}(u)\in\mathcal{X}, there exists a constant c0β‰₯0c_{0}\geq 0 uniformly,

β€–Tuβˆ’π’―u‖≀c0​‖uβ€–Β andΒ β€–Tuβˆ’1βˆ’π’―uβˆ’1‖≀c0​‖uβ€–.\|T_{u}-\mathcal{T}_{u}\|\leq c_{0}\|u\|\quad\emph{ and }\quad\|T_{u}^{-1}-\mathcal{T}_{u}^{-1}\|\leq c_{0}\|u\|.
Lemma 3 (Singular value bound of differentiated retraction)

For all x,y=Rx​(u)βˆˆπ’³x,y=R_{x}(u)\in\mathcal{X} where uβˆˆπ”Ήx​(D)u\in\mathbb{B}_{x}(D) with D≀12​c0D\leq\frac{1}{2c_{0}}, we have Οƒmin​(Tu)β‰₯12\sigma_{\min}(T_{u})\geq\frac{1}{2}.

It is not difficult to satisfy these assumptions. For compact Riemannian manifolds with a second-order retraction and a three-times continnously differentiable function FF, Assumptions 1, 2 and 3 are easily satisfied (see Lemma 3.2 in (Criscitiello and Boumal, 2019)). Assumption 4 can be guaranteed by requiring vector transport to be C0C^{0} based on Taylor expansion (Huang etΒ al., 2015).

Apart from the main assumptions, one additional assumption of bounded variance is particularly important for solving online problems.

Assumption 5 (Uniform bounded variance)

Variance of gradient of the pullback component function is bounded uniformly by Οƒ2\sigma^{2}. That is, for all xβˆˆβ„³x\in\mathcal{M} and f^i,x\hat{f}_{i,x},

β€–βˆ‡f^i,x​(u)βˆ’βˆ‡F^x​(u)β€–2≀σ2\|\nabla\hat{f}_{i,x}(u)-\nabla\hat{F}_{x}(u)\|^{2}\leq\sigma^{2}

holds for any u∈Tx​ℳu\in T_{x}\mathcal{M} such that β€–u‖≀D\|u\|\leq D.

This assumption is more stringent than the standard variance bound, which is in expectation. However, we can relax this assumption by requiring sub-Gaussian tails, which is sufficient to achieve a high probability bound (Li, 2019). Lastly, we conclude this section by defining second-order critical points as follows.

Definition 1 ((Ο΅,Ξ΄)(\epsilon,\delta)-second-order and Ο΅\epsilon-second-order critical points)

xβˆˆβ„³x\in\mathcal{M} is an (Ο΅,Ξ΄)(\epsilon,\delta)-second-order critical point if

β€–grad​F​(x)‖≀ϡ,Β andΒ Ξ»min​(Hess​F​(x))β‰₯βˆ’Ξ΄.\|\emph{grad}F(x)\|\leq\epsilon,\quad\emph{ and }\quad\lambda_{\min}(\emph{Hess}F(x))\geq-\delta.

It is an Ο΅\epsilon-second-order critical point if β€–grad​F​(x)‖≀ϡ\|\emph{grad}F(x)\|\leq\epsilon and Ξ»min​(Hess​F​(x))β‰₯βˆ’Οβ€‹Ο΅\lambda_{\min}(\emph{Hess}F(x))\geq-\sqrt{\rho\epsilon}. The second definition is a special case of the first one with Ξ΄=ρ​ϡ\delta=\sqrt{\rho\epsilon}.

4 Algorithm

In this section, we introduce perturbed Riemannian stochastic recursive gradient method (PRSRG) in Algorithm 1 where the main updates are performed by tangent space stochastic recursive gradient (TSSRG) in Algorithm 2. The key idea is simple: when gradient is large, we repeatedly execute mm standard recursive gradient iterations (i.e. an epoch) on tangent space before retracting back to the manifold. Essentially, we are minimizing the pullback function F^x\hat{F}_{x} within the constraint ball by SRG updates, which translates to minimizing FF within a neighbourhood of xx.

This process repeats until gradient is small. Then we perform the same SRG updates but at most 𝒯\mathscr{T} times, from a perturbed iterate with isotropic noise added on the tangent space. Notice that the small gradient condition in Line 3 in Algorithm 1 is examined against grad​fℬ​(x)\text{grad}f_{\mathcal{B}}(x), where ℬ\mathcal{B} contains BB samples drawn without replacement from [n][n]. This is to ensure that full batch gradient is computed under finite-sum setting where we choose B=nB=n. Under online setting when nn approaches infinity, access to full gradient is unavailable and therefore we can only rely on the large-batch gradient.

TSSRG is mainly based on stochastic recursive gradient algorithm in (Nguyen etΒ al., 2017b). The algorithm adopts a double-loop structure. At the start of each outer loop (called an epoch), a full gradient (or a large-batch gradient for online optimization) is evaluated. Within each inner loop, mini-batch gradients are computed for current iterate utu_{t} and its last iterate utβˆ’1u_{t-1}. Then a modified gradient vtv_{t} at utu_{t} is constructed recursively based on the difference between vtβˆ’1v_{t-1} and mini-batch gradient at utβˆ’1u_{t-1}. That is,

vtβ†βˆ‡f^ℐ,x​(ut)βˆ’(βˆ‡f^ℐ,x​(utβˆ’1)βˆ’vtβˆ’1).v_{t}\xleftarrow{}\nabla\hat{f}_{\mathcal{I},x}(u_{t})-\big{(}\nabla\hat{f}_{\mathcal{I},x}(u_{t-1})-v_{t-1}\big{)}. (2)
Algorithm 1 Perturbed Riemannian stochastic recursive gradient (PRSRG)
1:Β Β Input: Initialization x0x_{0}, step size Ξ·\eta, inner loop size mm, mini-batch size bb, large batch size BB, perturbation radius rr, perturbation interval 𝒯\mathscr{T}, diameter bound DD, tolerance Ο΅\epsilon
2:Β Β forΒ t=0,1,2,…t=0,1,2,...Β do
3:Β Β Β Β Β ifΒ β€–grad​fℬ​(xt)‖≀ϡ\|\text{grad}f_{\mathcal{B}}(x_{t})\|\leq\epsilonΒ thenΒ β–·\triangleright \eqparboxCOMMENTℬ\mathcal{B} contains BB samples drawn randomly without replacement
4:Β Β Β Β Β Β Β Β Draw u0∼Unif​(𝔹xt​(r))u_{0}\sim\text{Unif}(\mathbb{B}_{x_{t}}(r)).
5:Β Β Β Β Β Β Β Β xt+𝒯=TSSRG​(xt,u0,Ξ·,m,b,B,D,𝒯)x_{t+\mathscr{T}}=\textsf{TSSRG}(x_{t},u_{0},\eta,m,b,B,D,\mathscr{T}).
6:Β Β Β Β Β Β Β Β t←t+𝒯t\xleftarrow[]{}t+\mathscr{T}.
7:Β Β Β Β Β else
8:Β Β Β Β Β Β Β Β xt+m=TSSRG​(xt,0,Ξ·,m,b,B,D,m)x_{t+m}=\textsf{TSSRG}(x_{t},0,\eta,m,b,B,D,m).
9:Β Β Β Β Β Β Β Β t←t+mt\xleftarrow[]{}t+m.
10:Β Β Β Β Β endΒ if
11:Β Β endΒ for
Algorithm 2 Tangent space stochastic recursive gradient (TSSRG(x,u0,Ξ·,m,b,B,D,𝒯x,u_{0},\eta,m,b,B,D,\mathscr{T}))
1:Β Β Input: Initialization xx, initial perturbation u0u_{0}, step size Ξ·\eta, inner loop size mm, mini-batch size bb, large batch size BB, diameter bound DD, max iteration 𝒯\mathscr{T}
2:Β Β If u0=0u_{0}=0 set perturbed=0\textsf{perturbed}=0; else, set perturbed=1\textsf{perturbed}=1.
3:Β Β forΒ s=0,1,…s=0,1,...Β do
4:Β Β Β Β Β vs​m=βˆ‡f^ℬ,x​(us​m)v_{sm}=\nabla\hat{f}_{\mathcal{B},x}(u_{sm})
5:Β Β Β Β Β forΒ k=1,2,…,mk=1,2,...,mΒ do
6:Β Β Β Β Β Β Β Β t←s​m+kt\xleftarrow{}sm+k
7:Β Β Β Β Β Β Β Β ut←utβˆ’1βˆ’Ξ·β€‹vtβˆ’1u_{t}\xleftarrow[]{}u_{t-1}-\eta v_{t-1}
8:Β Β Β Β Β Β Β Β ifΒ β€–utβ€–β‰₯D\|u_{t}\|\geq DΒ then
9:Β Β Β Β Β Β Β Β Β Β Β break with u𝒯←utβˆ’1βˆ’Ξ±β€‹Ξ·β€‹vtβˆ’1u_{\mathscr{T}}\xleftarrow[]{}u_{t-1}-\alpha\eta v_{t-1}, where α∈(0,1]\alpha\in(0,1] such that β€–utβˆ’1βˆ’Ξ±β€‹Ξ·β€‹vtβˆ’1β€–=D.\|u_{t-1}-\alpha\eta v_{t-1}\|=D.
10:Β Β Β Β Β Β Β Β endΒ if
11:Β Β Β Β Β Β Β Β vtβ†βˆ‡f^ℐ,x​(ut)βˆ’βˆ‡f^ℐ,x​(utβˆ’1)+vtβˆ’1v_{t}\xleftarrow{}\nabla\hat{f}_{\mathcal{I},x}(u_{t})-\nabla\hat{f}_{\mathcal{I},x}(u_{t-1})+v_{t-1} β–·\triangleright ℐ\mathcal{I} contains i.i.d. bb samples
12:Β Β Β Β Β Β Β Β ifΒ (not perturbed) and t<𝒯t<\mathscr{T}Β then
12:Β Β Β Β Β Β Β Β Β Β Β break with u𝒯←utu_{\mathscr{T}}\xleftarrow{}u_{t} with probability 1mβˆ’k+1\frac{1}{m-k+1}
13:Β Β Β Β Β Β Β Β endΒ if
14:Β Β Β Β Β Β Β Β ifΒ tβ‰₯𝒯t\geq\mathscr{T}Β then
15:Β Β Β Β Β Β Β Β Β Β Β break with u𝒯←utu_{\mathscr{T}}\xleftarrow{}u_{t}
16:Β Β Β Β Β Β Β Β endΒ if
17:Β Β Β Β Β endΒ for
18:Β Β Β Β Β u(s+1)​m←utu_{(s+1)m}\xleftarrow[]{}u_{t}.
19:Β Β endΒ for
20:Β Β return Β Rx​(u𝒯)R_{x}(u_{\mathscr{T}})

Note that we do not require any vector transporter since all gradients (of the pullback) are defined on the same tangent space. Hence, TSSRG is very similar to Euclidean SRG with only differences in stopping criteria, discussed as follows. When gradient is large, we perform at most mm updates and break the loop with probability 1mβˆ’k+1\frac{1}{m-k+1} where kk is the index of inner iteration (Line 12, Algorithm 2). This stopping rule is equivalent to uniformly selecting from iterates within this epoch at random as the output. This is to ensure that either small gradient condition is triggered or sufficient decrease is achieved. More details can be found in Section 6. When gradient is small (i.e. around saddle point), we only break until max iteration has been reached.

Finally, we note that the Lipschitzness conditions are only guaranteed within a constraint ball of radius DD while taking 𝒯\mathscr{T} updates on tangent space can violate this requirement. Therefore, in Line 9 (Algorithm 2), we explicitly control the deviation of iterates from the origin and break the loop as soon as one iterate leaves the ball. Then we return some points on the boundary of the ball. By carefully balancing the parameters, we can show that whenever iterates escape the constraint ball, function value already has a large decrease.

To simplify notations, for the most time, we refer to Algorithm 2 as TSSRG(x,u0,𝒯x,u_{0},\mathscr{T}).

5 Main results

In this section, we present the main complexity results of PRSRG in finding second-order stationary points.

Under finite-sum setting, we choose B=nB=n and thus grad​fℬ≑grad​F\text{grad}f_{\mathcal{B}}\equiv\text{grad}F. We set the parameters as in (3) while we also require first-order tolerance ϡ≀π’ͺ~​(D​Lm​η)\epsilon\leq\widetilde{\mathcal{O}}(\frac{D\sqrt{L}}{m\sqrt{\eta}}). This is not a strict condition given step size Ξ·\eta can be chosen arbitrarily small. The second-order tolerance Ξ΄\delta needs to be smaller than the Lipschitz constant β„“\ell in Assumption 2. Otherwise when Ξ΄>β„“\delta>\ell, any xβˆˆβ„³x\in\mathcal{M} with small gradient β€–grad​F​(x)‖≀ϡ\|\text{grad}F(x)\|\leq\epsilon is already a (Ο΅,Ξ΄)(\epsilon,\delta)-second-order stationary point because by β€–Ξ»min​(βˆ‡F^x​(0))‖≀ℓ\|\lambda_{\min}(\nabla\hat{F}_{x}(0))\|\leq\ell from Assumption 2, we have Ξ»min​(Hess​F​(x))=Ξ»min​(βˆ‡F^x​(0))β‰₯βˆ’β„“β‰₯βˆ’Ξ΄\lambda_{\min}(\text{Hess}F(x))=\lambda_{\min}(\nabla\hat{F}_{x}(0))\geq-\ell\geq-\delta.

Theorem 1 (Finite-sum complexity)

Under Assumptions 1 to 4, consider finite-sum optimization setting. For any starting point x0βˆˆβ„³x_{0}\in\mathcal{M} with the choice of parameters as

𝒯=π’ͺ~​(1Ξ΄),η≀π’ͺ~​(1L),m=b=n,B=n,r=π’ͺ~​(min⁑{Ξ΄3ρ2​ϡ,Ξ΄3ρ2​L}),\mathscr{T}=\widetilde{\mathcal{O}}\Big{(}\frac{1}{\delta}\Big{)},\quad\eta\leq\widetilde{\mathcal{O}}\Big{(}\frac{1}{L}\Big{)},\quad m=b=\sqrt{n},\quad B=n,\quad r=\widetilde{\mathcal{O}}\Big{(}\min\Big{\{}\frac{\delta^{3}}{\rho^{2}\epsilon},\sqrt{\frac{\delta^{3}}{\rho^{2}L}}\Big{\}}\Big{)}, (3)

suppose Ο΅,Ξ΄\epsilon,\delta satisfy ϡ≀π’ͺ~​(D​Lm​η),δ≀ℓ\epsilon\leq\widetilde{\mathcal{O}}(\frac{D\sqrt{L}}{m\sqrt{\eta}}),\,\delta\leq\ell where ℓ≀L\ell\leq L. Then with high probability, PRSRG(x0,Ξ·,m,(x_{0},\eta,m, b,B,r,𝒯,D,Ο΅)b,B,r,\mathscr{T},D,\epsilon) will at least once visit an (Ο΅,Ξ΄)(\epsilon,\delta)-second-order critical point with

π’ͺ~​(Δ​L​nΟ΅2+Δ​ρ2​nΞ΄4+Δ​ρ2​nΞ΄3)\widetilde{\mathcal{O}}\Big{(}\frac{\Delta L\sqrt{n}}{\epsilon^{2}}+\frac{\Delta\rho^{2}\sqrt{n}}{\delta^{4}}+\frac{\Delta\rho^{2}n}{\delta^{3}}\Big{)}

stochastic gradient queries, where Ξ”:=F​(x0)βˆ’Fβˆ—\Delta:=F(x_{0})-F^{*}.

Up to some constants and poly-log factors, the complexity in Theorem 1 is exactly the same as that when optimization domain is a vector space (i.e. ℳ≑ℝd\mathcal{M}\equiv\mathbb{R}^{d}). Indeed, our result is a direct generalization of the Euclidean counterpart where retraction Rx​(u)=x+uR_{x}(u)=x+u and the Lipschitzness conditions are made with respect to F:ℝd→ℝF:\mathbb{R}^{d}\xrightarrow[]{}\mathbb{R}. Set the same parameters as in (3) except that we do not require Ο΅\epsilon to be small since D=+∞D=+\infty and require δ≀L\delta\leq L given β„“=L\ell=L. Then we can recover the Euclidean perturbed stochastic recursive gradient algorithm (Li, 2019).

Under online setting, the complexities of stochastic gradient queries are presented below.

Theorem 2 (Online complexity)

Under Assumptions 1 to 5, consider online optimization setting. For any starting point x0βˆˆβ„³x_{0}\in\mathcal{M} with the choice of parameters

𝒯=π’ͺ~​(1Ξ΄),η≀π’ͺ~​(1L),m=b=π’ͺ~​(σϡ),B=π’ͺ~​(Οƒ2Ο΅2),r=π’ͺ~​(min⁑{Ξ΄3ρ2​ϡ,Ξ΄3ρ2​L}),\mathscr{T}=\widetilde{\mathcal{O}}\Big{(}\frac{1}{\delta}\Big{)},\quad\eta\leq\widetilde{\mathcal{O}}\Big{(}\frac{1}{L}\Big{)},\quad m=b=\widetilde{\mathcal{O}}\Big{(}\frac{\sigma}{\epsilon}\Big{)},\quad B=\widetilde{\mathcal{O}}\Big{(}\frac{\sigma^{2}}{\epsilon^{2}}\Big{)},\quad r=\widetilde{\mathcal{O}}\Big{(}\min\Big{\{}\frac{\delta^{3}}{\rho^{2}\epsilon},\sqrt{\frac{\delta^{3}}{\rho^{2}L}}\Big{\}}\Big{)},

suppose Ο΅,Ξ΄\epsilon,\delta satisfy ϡ≀min⁑{Ξ΄2ρ,π’ͺ~​(D​Lm​η)}\epsilon\leq\min\big{\{}\frac{\delta^{2}}{\rho},\widetilde{\mathcal{O}}(\frac{D\sqrt{L}}{m\sqrt{\eta}})\big{\}}, δ≀ℓ\delta\leq\ell where ℓ≀L\ell\leq L. Then with high probability, PRSRG​(x0,Ξ·,m,b,B,r,𝒯,D,Ο΅)\textsf{PRSRG}(x_{0},\eta,m,b,B,r,\mathscr{T},D,\epsilon) will at least once visit an (Ο΅,Ξ΄)(\epsilon,\delta)-second-order critical point with

π’ͺ~​(Δ​L​σϡ3+Δ​ρ2​σ2Ξ΄3​ϡ2+Δ​ρ2​σδ4​ϡ)\widetilde{\mathcal{O}}\Big{(}\frac{\Delta L\sigma}{\epsilon^{3}}+\frac{\Delta\rho^{2}\sigma^{2}}{\delta^{3}\epsilon^{2}}+\frac{\Delta\rho^{2}\sigma}{\delta^{4}\epsilon}\Big{)}

stochastic gradient oracles, where Ξ”:=F​(x0)βˆ’Fβˆ—\Delta:=F(x_{0})-F^{*}.

Note the complexities in this paper are analysed in terms of achieving (Ο΅,Ξ΄)(\epsilon,\delta)-second-order stationarity. Some literature prefers to choose Ξ΄=ρ​ϡ\delta=\sqrt{\rho\epsilon} to match the units of gradient and Hessian, following the work in (Nesterov and Polyak, 2006). In this case, our complexities reduce to π’ͺ~​(nΟ΅3/2+nΟ΅2)\widetilde{\mathcal{O}}(\frac{n}{\epsilon^{3/2}}+\frac{\sqrt{n}}{\epsilon^{2}}) for finite-sum problems and π’ͺ~​(1Ο΅7/2)\widetilde{\mathcal{O}}(\frac{1}{\epsilon^{7/2}}) for online problems. Compared to the optimal rate of π’ͺ​(nΟ΅2)\mathcal{O}(\frac{\sqrt{n}}{\epsilon^{2}}) and π’ͺ​(1Ο΅3)\mathcal{O}(\frac{1}{\epsilon^{3}}) in finding first-order stationary points, our rates are sub-optimal even if we ignore the poly-log factors. This nevertheless appears to be a problem for all stochastic variance reduction methods (Ge etΒ al., 2019; Li, 2019).

6 High-level proof ideas

Now we provide a high-level proof road map of our proposed method in achieving second-order convergence guarantees. The main proof strategies are similar as in (Li, 2019). However, we need to carefully handle the particularity of manifold geometry as well as manage the unique regularity conditions. We focus on finite-sum problems and only highlight the key differences under online setting.

6.1 Finite-sum setting

We first start by showing how stochastic recursive gradient updates can achieve sublinear convergence in expectation by periodically computing a large-batch gradient. That is, from smoothness of the pullback function, we have

F^x​(ut)≀F^x​(utβˆ’1)βˆ’Ξ·2β€‹β€–βˆ‡F^x​(utβˆ’1)β€–2+Ξ·2​‖vtβˆ’1βˆ’βˆ‡F^x​(utβˆ’1)β€–2βˆ’(12β€‹Ξ·βˆ’L2)​‖utβˆ’utβˆ’1β€–2.\displaystyle\hat{F}_{x}(u_{t})\leq\hat{F}_{x}(u_{t-1})-\frac{\eta}{2}\|\nabla\hat{F}_{x}(u_{t-1})\|^{2}+\frac{\eta}{2}\|v_{t-1}-\nabla\hat{F}_{x}(u_{t-1})\|^{2}-\big{(}\frac{1}{2\eta}-\frac{L}{2}\big{)}\|u_{t}-u_{t-1}\|^{2}. (4)

To achieve first-order stationarity, it is sufficient to bound the variance term in expectation. That is 𝔼​‖vtβˆ’1βˆ’βˆ‡F^x​(utβˆ’1)β€–2≀π’ͺ​(𝔼​‖utβˆ’utβˆ’1β€–)\mathbb{E}\|v_{t-1}-\nabla\hat{F}_{x}(u_{t-1})\|^{2}\leq\mathcal{O}(\mathbb{E}\|u_{t}-u_{t-1}\|) (see Nguyen etΒ al. (2017b)). This suggests the variance is gradually reduced when approaching optimality where gradient is small. Then by carefully choosing the parameters, we can show that for a single epoch, it satisfies that

𝔼​[F^x​(us​m+m)βˆ’F^x​(us​m)]β‰€βˆ’Ξ·2β€‹βˆ‘j=s​m+1s​m+mπ”Όβ€‹β€–βˆ‡F^x​(ujβˆ’1)β€–2.\mathbb{E}[\hat{F}_{x}(u_{sm+m})-\hat{F}_{x}(u_{sm})]\leq-\frac{\eta}{2}\sum_{j=sm+1}^{sm+m}\mathbb{E}\|\nabla\hat{F}_{x}(u_{j-1})\|^{2}. (5)

Telescoping this result for all epochs and choosing the output uniformly from all iterates at random, we can guarantee that the output is an approximate first-order stationary point. This gives the optimal stochastic gradient complexity of π’ͺ​(nΟ΅2)\mathcal{O}(\frac{\sqrt{n}}{\epsilon^{2}}) by choosing m=n=nm=n=\sqrt{n}.

To achieve second-order stationarity, the algorithm will go through two phases: large gradients and around saddle points. We present two informal Lemmas corresponding to the two phases respectively, with parameter settings omitted. See Appendix for more details.

Lemma 4

When current iterate has large gradient, i.e. β€–grad​F​(x)β€–β‰₯Ο΅\|\emph{grad}F(x)\|\geq\epsilon, running TSSRG​(x,0,m)\textsf{TSSRG}(x,0,m) gives three possible cases:

  1. 1.

    When the iterates {uj}j=1m\{u_{j}\}_{j=1}^{m} do not leave the constraint ball 𝔹x​(D)\mathbb{B}_{x}(D):

    1. (a)

      If at least half of the iterates in the epoch satisfy β€–βˆ‡F^x​(uj)‖≀ϡ/2\|\nabla\hat{F}_{x}(u_{j})\|\leq\epsilon/2 for j=1,…,mj=1,...,m, then with probability at least 1/21/2, we have β€–grad​F​(TSSRG​(x,0,m))‖≀ϡ\|\emph{grad}F(\textsf{TSSRG}(x,0,m))\|\leq\epsilon.

    2. (b)

      Otherwise, with probability at least 1/51/5, we have F​(TSSRG​(x,0,m))βˆ’F​(x)β‰€βˆ’Ξ·β€‹m​ϡ232F(\textsf{TSSRG}(x,0,m))-F(x)\leq-\frac{\eta m\epsilon^{2}}{32}.

  2. 2.

    When one of the iterates {uj}j=1m\{u_{j}\}_{j=1}^{m} leaves the constraint ball 𝔹x​(D)\mathbb{B}_{x}{(D)}, with probability at least 1βˆ’Ο‘1-\vartheta, we have F​(TSSRG​(x,0,m))βˆ’F​(x)β‰€βˆ’Ξ·β€‹m​ϡ232F(\textsf{TSSRG}(x,0,m))-F(x)\leq-\frac{\eta m\epsilon^{2}}{32}, where Ο‘βˆˆ(0,1)\vartheta\in(0,1) can be made arbitrarily small.

No matter which case occurs, F​(TSSRG​(x,0,m))≀F​(x)F(\textsf{TSSRG}(x,0,m))\leq F(x) with high probability.

Lemma 5

When current iterate is around a saddle point, i.e. β€–grad​F​(x)‖≀ϡ\|\emph{grad}F(x)\|\leq\epsilon and Ξ»min​(Hess​F​(x))\lambda_{\min}(\emph{Hess}F(x)) β‰€βˆ’Ξ΄\leq-\delta, running TSSRG​(x,u0,𝒯)\textsf{TSSRG}(x,u_{0},\mathscr{T}) with perturbation u0βˆˆπ”Ήx​(r)u_{0}\in\mathbb{B}_{x}(r) gives sufficient decrease of function value with high probability. That is,

F​(TSSRG​(x,u0,𝒯))βˆ’F​(x)β‰€βˆ’β„±,{F}(\textsf{TSSRG}(x,u_{0},\mathscr{T}))-F(x)\leq-\mathscr{F},

where β„±=π’ͺ~​(Ξ΄3ρ2)\mathscr{F}=\widetilde{\mathcal{O}}(\frac{\delta^{3}}{\rho^{2}}).

Lemma 4 claims that when gradient is large (phase 1), the output after running TSSRG with a single epoch either has a small gradient (Case 1a) or reduces function value by a sufficient amount (Case 1b) with high probability. Note that we need to explicitly address the case when iterates leave the constraint ball (Case 2). In this case, we show that function value already decreases by the same desired amount given that first-order tolerance Ο΅\epsilon is chosen reasonably small as in Theorem 1. Note for Case 1a, the output satisfies the small gradient condition and hence is immediately followed by perturbation and the follow-up updates in TSSRG(x,u0,𝒯)(x,u_{0},\mathscr{T}). Notice that we can only show β€–βˆ‡F^x​(um)β€–\|\nabla\hat{F}_{x}(u_{m})\| is small. To connect to Riemannian gradient β€–grad​F​(Rx​(um))β€–\|\text{grad}F(R_{x}(u_{m}))\|, we need the singular value bound of the differentiated retraction in Lemma 3. In other cases, function value decreases by π’ͺ​(η​m​ϡ2)\mathcal{O}(\eta m\epsilon^{2}) with high probability. As a result, given that function FF is uniformly bounded by Fβˆ—F^{*}, we can bound the number of such descent epochs N1N_{1} by π’ͺ​(Δη​m​ϡ2)\mathcal{O}(\frac{\Delta}{\eta m\epsilon^{2}}), where Ξ”:=F​(x0)βˆ’Fβˆ—\Delta:=F(x_{0})-F^{*}. Since we choose Ξ·=π’ͺ~​(1L)\eta=\widetilde{\mathcal{O}}(\frac{1}{L}), m=b=nm=b=\sqrt{n}, B=nB=n, the stochastic gradient complexity is computed as N1​(B+m​b)=N_{1}(B+mb)= π’ͺ~​(Δ​L​nΟ΅2)\widetilde{\mathcal{O}}(\frac{\Delta L\sqrt{n}}{\epsilon^{2}}).

In phase 2 where gradient is small, current iterate is either already a second-order stationary point or around a saddle point. Lemma 5 states that running TSSRG from any perturbation within the ball 𝔹x​(r)\mathbb{B}_{x}(r) decreases function value by at least β„±=π’ͺ~​(Ξ΄3ρ2)\mathscr{F}=\widetilde{\mathcal{O}}(\frac{\delta^{3}}{\rho^{2}}) with high probability. Again, since the optimality gap is bounded, the number of such escape epochs N2N_{2} is bounded by π’ͺ~​(ρ2​Δδ3)\widetilde{\mathcal{O}}(\frac{\rho^{2}\Delta}{\delta^{3}}). And similarly, the number of stochastic gradient queries is N2​(βŒˆπ’―/mβŒ‰β€‹n+𝒯​b)=π’ͺ~​(Δ​ρ2​nΞ΄4+Δ​ρ2​nΞ΄3)N_{2}(\lceil\mathscr{T}/m\rceil n+\mathscr{T}b)=\widetilde{\mathcal{O}}(\frac{\Delta\rho^{2}\sqrt{n}}{\delta^{4}}+\frac{\Delta\rho^{2}n}{\delta^{3}}), where we choose 𝒯=π’ͺ~​(1Ξ΄)\mathscr{T}=\widetilde{\mathcal{O}}(\frac{1}{\delta}). Combining the complexities under phase 1 and phase 2 gives the result. For more rigorous analysis, we need to consider the number of wasted epochs where neither function decreases sufficiently nor gradient of the output is small. The complexity of such epochs however turns out to not exceed the complexities established before. Detailed proofs are included in Appendix C.1.

Next we will briefly explain how Lemma 4 (large gradient phase) and Lemma 5 (around saddle point phase) are derived.

Large gradient phase. The key result underlying Lemma 4 is a high-probability version of (5). To this end, we first need a high-probability bound for the variance term β€–vtβˆ’βˆ‡F^x​(ut)β€–2\|v_{t}-\nabla\hat{F}_{x}(u_{t})\|^{2}. It is not difficult to verify that yt:=vtβˆ’βˆ‡F^x​(ut),t=s​m,…,(s+1)​my_{t}:=v_{t}-\nabla\hat{F}_{x}(u_{t}),t=sm,...,(s+1)m is a martingale sequence. As required by Azuma-Hoeffing inequality (Lemma 7, Appendix A), in order to bound β€–ytβ€–\|y_{t}\|, we need to bound its difference sequence zt:=ytβˆ’ytβˆ’1z_{t}:=y_{t}-y_{t-1}. This difference sequence can be bounded by applying vector Bernstein inequality (Lemma 6, Appendix A). After bounding β€–ytβ€–\|y_{t}\|, we can substitute this result into (4) to obtain

F^x​(ut)βˆ’F^x​(us​m)β‰€βˆ’Ξ·2β€‹βˆ‘j=s​m+1tβ€–βˆ‡F^x​(ujβˆ’1)β€–2Β with high probability,Β \hat{F}_{x}(u_{t})-\hat{F}_{x}(u_{sm})\leq-\frac{\eta}{2}\sum_{j=sm+1}^{t}\|\nabla\hat{F}_{x}(u_{j-1})\|^{2}\quad\text{ with high probability, } (6)

for 1≀t≀m1\leq t\leq m. Note that we always call TSSRG for only one epoch each time. Therefore, it is sufficient to consider the first epoch in TSSRG. Next, the analysis is divided into whether iterates leave the constraint ball or not. When all iterates stay within the boundary of the ball, inequality (6) suggests that if at least half of iterates in this epoch have large gradient, then we obtain a sufficient decrease. Otherwise, the output uniformly selected from the iterates in this epoch will have a small gradient with high probability. On the other hand, when one of the iterates escape the constraint ball, we still can show a sufficient decrease by a localization Lemma (Lemma 10, Appendix C.2). Specifically, we have β€–utβ€–2≀π’ͺ~​(F^x​(0)βˆ’F^x​(ut))\|u_{t}\|^{2}\leq\widetilde{\mathcal{O}}\big{(}\hat{F}_{x}(0)-\hat{F}_{x}(u_{t})\big{)}, which is derived from (4) and the high-probability bound of the variance term. This bound implies that if iterates are far from the origin, function value already decreases a lot.

Around saddle point phase. When the current iterate is around a saddle point, we need to show that the objective can still decrease at a reasonable rate with high probability. At high-level, we adopt the same coupling-sequence analysis originally introduced in (Jin etΒ al., 2017). Define the stuck region as 𝒳stuck:={uβˆˆπ”Ήx(r):F(TSSRG(x,u,𝒯)βˆ’F^x(u)β‰₯βˆ’2β„±}\mathcal{X}_{\text{stuck}}:=\{u\in\mathbb{B}_{x}(r):F(\textsf{TSSRG}(x,u,\mathscr{T})-\hat{F}_{x}(u)\geq-2\mathscr{F}\} such that running TSSRG from points initialized in this region will not give sufficient function value decrease. Consider two initialization u0,u0β€²u_{0},u_{0}^{\prime} that only differs in the direction of the smallest eigenvector e1e_{1} of the Hessian Hess​F​(x)\text{Hess}F(x). That is, u0βˆ’u0β€²=r0​e1u_{0}-u_{0}^{\prime}=r_{0}e_{1} where r0=ν​rdr_{0}=\frac{\nu r}{\sqrt{d}} (ν∈(0,1)\nu\in(0,1) can be chosen arbitrarily small). Then we can prove that at least one of the sequences {ut},{utβ€²}\{u_{t}\},\{u_{t}^{\prime}\} generated by running TSSRG from perturbation u0u_{0}, u0β€²u_{0}^{\prime} achieves large deviation from the initialized points within 𝒯\mathscr{T} steps (see Lemma 12). That is, with high probability,

βˆƒt≀𝒯,max⁑{β€–utβˆ’u0β€–,β€–utβ€²βˆ’u0β€²β€–}β‰₯π’ͺ~​(δρ).\exists\,t\leq\mathscr{T},\,\max\{\|u_{t}-u_{0}\|,\|u_{t}^{\prime}-u_{0}^{\prime}\|\}\geq\widetilde{\mathcal{O}}\Big{(}\frac{\delta}{\rho}\Big{)}.

This result together with the localization Lemma indicates that at least one of the sequences also achieves high function value decrease. Particularly, we obtain max{F^x(u0)βˆ’F(TSSRG(x,u0,𝒯),\max\{\hat{F}_{x}(u_{0})-F(\textsf{TSSRG}(x,u_{0},\mathscr{T}), F^x(u0β€²)βˆ’F(TSSRG(x,u0β€²,𝒯)}β‰₯2β„±\hat{F}_{x}(u_{0}^{\prime})-F(\textsf{TSSRG}(x,u_{0}^{\prime},\mathscr{T})\}\geq 2\mathscr{F} where we note F^x​(u𝒯)=F​(TSSRG​(x,u0,𝒯))\hat{F}_{x}(u_{\mathscr{T}})=F(\textsf{TSSRG}(x,u_{0},\mathscr{T})). This directly suggests that the width of stuck region 𝒳stuck\mathcal{X}_{\text{stuck}} is at most r0r_{0}. Based on some geometric results, we know that the probability of any perturbation u0βˆˆπ”Ήx​(r)u_{0}\in\mathbb{B}_{x}(r) falling in the stuck region is at most Ξ½\nu. In other words, with high probability, an arbitrarily chosen u0u_{0} falls outside of stuck region. In this case, we achieve sufficient decrease of function value as F^x(u0)βˆ’F(TSSRG(x,u0,𝒯)β‰₯2β„±\hat{F}_{x}(u_{0})-F(\textsf{TSSRG}(x,u_{0},\mathscr{T})\geq 2\mathscr{F}. By carefully selecting the radius rr of the perturbation ball, we can bound the difference between F^x​(0)\hat{F}_{x}(0) and F^x​(u0)\hat{F}_{x}(u_{0}) by β„±\mathscr{F}. Finally, combining these two results yields Lemma 5:

F​(x)βˆ’F​(TSSRG​(x,u0,𝒯))=F^x​(0)βˆ’F^x​(u0)+F^x​(u0)βˆ’F^x​(u𝒯)β‰₯βˆ’β„±+2​ℱ=β„±.F(x)-F(\textsf{TSSRG}(x,u_{0},\mathscr{T}))=\hat{F}_{x}(0)-\hat{F}_{x}(u_{0})+\hat{F}_{x}(u_{0})-\hat{F}_{x}(u_{\mathscr{T}})\geq-\mathscr{F}+2\mathscr{F}=\mathscr{F}.

6.2 Online setting

Consider online problems where full gradient is inaccessible. The proof roadmap is the same as in finite-sum setting. But now we have vs​m=βˆ‡f^ℬ,x​(us​m)β‰ βˆ‡F^x​(us​m)v_{sm}=\nabla\hat{f}_{\mathcal{B},x}(u_{sm})\neq\nabla\hat{F}_{x}(u_{sm}). Most key results are relaxed with an additional term that relates to the variance of stochastic gradient (Assumption 5).

Large gradient phase. Under phase 1, we can show that (6) holds with an additional term. That is,

F^x​(ut)βˆ’F^x​(u0)β‰€βˆ’Ξ·2β€‹βˆ‘j=1tβ€–βˆ‡F^x​(ujβˆ’1)β€–2+π’ͺ~​(η​t​σ2B)Β with high probability.Β \hat{F}_{x}(u_{t})-\hat{F}_{x}(u_{0})\leq-\frac{\eta}{2}\sum_{j=1}^{t}\|\nabla\hat{F}_{x}(u_{j-1})\|^{2}+\widetilde{\mathcal{O}}\Big{(}\frac{\eta t\sigma^{2}}{B}\Big{)}\quad\text{ with high probability. }

Note that under online setting, the small gradient condition is checked against the large-batch gradient grad​fℬ​(x)\text{grad}f_{\mathcal{B}}(x), rather than the full gradient (Line 3, Algorithm 1). Therefore, compared to finite-sum cases, we require an extra bound on the difference between full gradient and large-batch gradient β€–βˆ‡f^ℬ,x​(um)βˆ’βˆ‡F^x​(um)β€–\|\nabla\hat{f}_{\mathcal{B},x}(u_{m})-\nabla\hat{F}_{x}(u_{m})\|. This can be obtained by Bernstein inequality. By choosing B=π’ͺ~​(Οƒ2Ο΅2)B=\widetilde{\mathcal{O}}(\frac{\sigma^{2}}{\epsilon^{2}}), similar results to Lemma 11 can be derived.

Around saddle point phase. Under phase 2, we can obtain the same inequality as in Lemma 14, with differences only in terms of parameter settings.

These results guarantee that the number of phase-1 and phase-2 epochs match those of finite-sum setting, up to some constants and poly-log factors. That is, N1≀π’ͺ​(Δη​m​ϡ2)N_{1}\leq\mathcal{O}(\frac{\Delta}{\eta m\epsilon^{2}}) and N2≀π’ͺ~​(Δ​ρ2Ξ΄3)N_{2}\leq\widetilde{\mathcal{O}}(\frac{\Delta\rho^{2}}{\delta^{3}}). Following similar logic and choosing parameters as m=b=B=π’ͺ~​(σϡ)m=b=\sqrt{B}=\widetilde{\mathcal{O}}(\frac{\sigma}{\epsilon}), we can obtain the complexity in Theorem 2.

7 Conclusion

In this paper, we generalize perturbed stochastic recursive gradient method to Riemannian manifold by adopting the idea of tangent space steps introduced in (Criscitiello and Boumal, 2019). This avoids using any complicated geometric result as in (Sun etΒ al., 2019) and thus largely simplifies the analysis. We show that up to some constants and poly-log factors, our generalization achieves the same stochastic gradient complexities as the Euclidean version (Li, 2019). Under finite-sum setting, our result is strictly superior to PRGD by Criscitiello and Boumal (2019) and to PRAGD by Criscitiello and Boumal (2020) for large-scale problems. We also prove an online complexity, which is, to the best of our knowledge, the first result in finding second-order stationary points only using first-order information.

References

  • Absil etΒ al. (2009) P-A Absil, Robert Mahony, and Rodolphe Sepulchre. Optimization algorithms on matrix manifolds. Princeton University Press, 2009.
  • Agarwal etΒ al. (2018) Naman Agarwal, Nicolas Boumal, Brian Bullins, and Coralia Cartis. Adaptive regularization with cubics on manifolds. arXiv preprint arXiv:1806.00065, 2018.
  • Ahn and Sra (2020) Kwangjun Ahn and Suvrit Sra. From Nesterov’s estimate sequence to Riemannian acceleration. arXiv preprint arXiv:2001.08876, 2020.
  • Allen-Zhu and Li (2018) Zeyuan Allen-Zhu and Yuanzhi Li. Neon2: Finding local minima via first-order oracles. In Advances in Neural Information Processing Systems, pages 3716–3726, 2018.
  • Arjevani etΒ al. (2019) Yossi Arjevani, Yair Carmon, JohnΒ C Duchi, DylanΒ J Foster, Nathan Srebro, and Blake Woodworth. Lower bounds for non-convex stochastic optimization. arXiv preprint arXiv:1912.02365, 2019.
  • Bonnabel (2013) Silvere Bonnabel. Stochastic gradient descent on Riemannian manifolds. IEEE Transactions on Automatic Control, 58(9):2217–2229, 2013.
  • Boumal (2020) Nicolas Boumal. An introduction to optimization on smooth manifolds. 2020.
  • Boumal etΒ al. (2019) Nicolas Boumal, Pierre-Antoine Absil, and Coralia Cartis. Global rates of convergence for nonconvex optimization on manifolds. IMA Journal of Numerical Analysis, 39(1):1–33, 2019.
  • Chung and Lu (2006) Fan Chung and Linyuan Lu. Concentration inequalities and martingale inequalities: a survey. Internet Mathematics, 3(1):79–127, 2006.
  • Criscitiello and Boumal (2020) Chris Criscitiello and Nicolas Boumal. An accelerated first-order method for non-convex optimization on manifolds. arXiv preprint arXiv:2008.02252, 2020.
  • Criscitiello and Boumal (2019) Christopher Criscitiello and Nicolas Boumal. Efficiently escaping saddle points on manifolds. In Advances in Neural Information Processing Systems, pages 5987–5997, 2019.
  • Du etΒ al. (2017) SimonΒ S Du, Chi Jin, JasonΒ D Lee, MichaelΒ I Jordan, Aarti Singh, and Barnabas Poczos. Gradient descent can take exponential time to escape saddle points. In Advances in Neural Information Processing Systems, pages 1067–1077, 2017.
  • Fang etΒ al. (2018) Cong Fang, ChrisΒ Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. In Advances in Neural Information Processing Systems, pages 689–699, 2018.
  • Ge etΒ al. (2019) Rong Ge, Zhize Li, Weiyao Wang, and Xiang Wang. Stabilized svrg: Simple variance reduction for nonconvex optimization. arXiv preprint arXiv:1905.00529, 2019.
  • Hoeffding (1994) Wassily Hoeffding. Probability inequalities for sums of bounded random variables. In The Collected Works of Wassily Hoeffding, pages 409–426. Springer, 1994.
  • Hosseini and Sra (2020) Reshad Hosseini and Suvrit Sra. An alternative to em for Gaussian mixture models: Batch and stochastic Riemannian optimization. Mathematical Programming, 181(1):187–223, 2020.
  • Huang etΒ al. (2015) Wen Huang, KyleΒ A Gallivan, and P-A Absil. A Broyden class of quasi-Newton methods for Riemannian optimization. SIAM Journal on Optimization, 25(3):1660–1685, 2015.
  • Jin etΒ al. (2017) Chi Jin, Rong Ge, Praneeth Netrapalli, ShamΒ M Kakade, and MichaelΒ I Jordan. How to escape saddle points efficiently. arXiv preprint arXiv:1703.00887, 2017.
  • Jin etΒ al. (2018) Chi Jin, Praneeth Netrapalli, and MichaelΒ I Jordan. Accelerated gradient descent escapes saddle points faster than gradient descent. In Conference On Learning Theory, pages 1042–1085. PMLR, 2018.
  • Jin etΒ al. (2019) Chi Jin, Praneeth Netrapalli, Rong Ge, ShamΒ M Kakade, and MichaelΒ I Jordan. On nonconvex optimization for machine learning: Gradients, stochasticity, and saddle points. arXiv preprint arXiv:1902.04811, 2019.
  • Kasai etΒ al. (2018) Hiroyuki Kasai, Hiroyuki Sato, and Bamdev Mishra. Riemannian stochastic recursive gradient algorithm. In International Conference on Machine Learning, pages 2516–2524, 2018.
  • Li (2019) Zhize Li. SSRGD: Simple stochastic recursive gradient descent for escaping saddle points. In Advances in Neural Information Processing Systems, pages 1523–1533, 2019.
  • Nesterov and Polyak (2006) Yurii Nesterov and BorisΒ T Polyak. Cubic regularization of Newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006.
  • Nguyen etΒ al. (2017a) LamΒ M Nguyen, Jie Liu, Katya Scheinberg, and Martin TakÑč. Stochastic recursive gradient algorithm for nonconvex optimization. arXiv preprint arXiv:1705.07261, 2017a.
  • Nguyen etΒ al. (2017b) LamΒ M Nguyen, Jie Liu, Katya Scheinberg, and Martin TakÑč. Stochastic recursive gradient algorithm for nonconvex optimization. arXiv preprint arXiv:1705.07261, 2017b.
  • Reddi etΒ al. (2016) SashankΒ J Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex Smola. Stochastic variance reduction for nonconvex optimization. In International Conference on Machine Learning, pages 314–323, 2016.
  • Sun etΒ al. (2019) Yue Sun, Nicolas Flammarion, and Maryam Fazel. Escaping from saddle points on Riemannian manifolds. In Advances in Neural Information Processing Systems, pages 7276–7286, 2019.
  • Tropp (2012) JoelΒ A Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Computational Mathematics, 12(4):389–434, 2012.
  • Xu etΒ al. (2018) YiΒ Xu, Rong Jin, and Tianbao Yang. First-order stochastic algorithms for escaping from saddle points in almost linear time. In Advances in Neural Information Processing Systems, pages 5530–5540, 2018.
  • Zhang and Sra (2018) Hongyi Zhang and Suvrit Sra. Towards Riemannian accelerated gradient methods. arXiv preprint arXiv:1806.02812, 2018.
  • Zhang etΒ al. (2016) Hongyi Zhang, SashankΒ J Reddi, and Suvrit Sra. Riemannian svrg: Fast stochastic optimization on riemannian manifolds. In Advances in Neural Information Processing Systems, pages 4592–4600, 2016.
  • Zhou etΒ al. (2018) Dongruo Zhou, Pan Xu, and Quanquan Gu. Finding local minima via stochastic nested variance reduction. arXiv preprint arXiv:1806.08782, 2018.

A Useful concentration bound

This section presents some useful concentration bounds on vector space, which is used to derive high probability bounds for sequences defined on tangent space of the manifold.

Lemma 6 (Vector Bernstein inequality, Tropp (2012))

Given a sequence of independent vector random variables {vk}\{v_{k}\} in ℝd\mathbb{R}^{d}, which satisfies β€–vkβˆ’π”Όβ€‹[vk]‖≀R\|v_{k}-\mathbb{E}[v_{k}]\|\leq R almost surely, then for Ο‚β‰₯0\varsigma\geq 0

Pr​{β€–βˆ‘k(vkβˆ’π”Όβ€‹[vk])β€–β‰₯Ο‚}≀(d+1)​exp⁑(βˆ’Ο‚2/2Οƒ2+R​ς/3)\displaystyle\emph{Pr}\{\|\sum_{k}(v_{k}-\mathbb{E}[v_{k}])\|\geq\varsigma\}\leq(d+1)\exp\big{(}\frac{-\varsigma^{2}/2}{\sigma^{2}+R\varsigma/3}\big{)}

where Οƒ2:=βˆ‘k𝔼​‖vkβˆ’π”Όβ€‹[vk]β€–2.\sigma^{2}:=\sum_{k}\mathbb{E}\|v_{k}-\mathbb{E}[v_{k}]\|^{2}.

Lemma 7 (Azuma-Hoeffding inequality, Hoeffding (1994); Chung and Lu (2006))

Consider a vector-valued martingale sequence {yk}\{y_{k}\} and its corresponding martingale difference sequence {zk}\{z_{k}\} in ℝd\mathbb{R}^{d}. If {zk}\{z_{k}\} satisfies β€–zkβ€–=β€–ykβˆ’ykβˆ’1‖≀ck\|z_{k}\|=\|y_{k}-y_{k-1}\|\leq c_{k} almost surely, then for Ο‚β‰₯0\varsigma\geq 0,

Pr​{β€–ykβˆ’y0β€–β‰₯Ο‚}≀(d+1)​exp⁑(βˆ’Ο‚28β€‹βˆ‘i=1kci2).\emph{Pr}\{\|y_{k}-y_{0}\|\geq\varsigma\}\leq(d+1)\exp\Big{(}\frac{-\varsigma^{2}}{8\sum_{i=1}^{k}c_{i}^{2}}\Big{)}. (7)

If β€–zkβ€–=β€–ykβˆ’ykβˆ’1‖≀ck\|z_{k}\|=\|y_{k}-y_{k-1}\|\leq c_{k} with probability 1βˆ’ΞΆk1-\zeta_{k}, then for Ο‚β‰₯0\varsigma\geq 0,

Pr​{β€–ykβˆ’y0β€–β‰₯Ο‚}≀(d+1)​exp⁑(βˆ’Ο‚28β€‹βˆ‘i=1kci2)+βˆ‘i=1kΞΆi\emph{Pr}\{\|y_{k}-y_{0}\|\geq\varsigma\}\leq(d+1)\exp\Big{(}\frac{-\varsigma^{2}}{8\sum_{i=1}^{k}c_{i}^{2}}\Big{)}+\sum_{i=1}^{k}\zeta_{i}

B Regularity conditions on Riemannian manifold

In this section, we prove some regularity conditions on manifolds, which are fundamental for Riemannian optimization, as seen in several literature (Criscitiello and Boumal, 2020; Boumal, 2020).

Lemma 2 (LL-Lipschitz continuous) Under Assumption 2, for all xβˆˆβ„³x\in\mathcal{M}, there exists L=β„“+ρ​DL=\ell+\rho D such that βˆ‡f^i,x\nabla\hat{f}_{i,x} is LL-Lipschitz continuous inside the ball 𝔹x​(D)\mathbb{B}_{x}(D). This also implies that βˆ‡F^x\nabla\hat{F}_{x} is LL-Lipschitz continuous. That is, for any u,vβˆˆπ”Ήx​(D)u,v\in\mathbb{B}_{x}(D), we have

β€–βˆ‡f^i,x​(u)βˆ’βˆ‡f^i,x​(v)‖≀L​‖uβˆ’vβ€–Β andΒ β€–βˆ‡F^x​(u)βˆ’βˆ‡F^x​(v)‖≀L​‖uβˆ’vβ€–.\|\nabla\hat{f}_{i,x}(u)-\nabla\hat{f}_{i,x}(v)\|\leq L\|u-v\|\quad\text{ and }\quad\|\nabla\hat{F}_{x}(u)-\nabla\hat{F}_{x}(v)\|\leq L\|u-v\|.

ProofΒ  The proof of βˆ‡f^i,x\nabla\hat{f}_{i,x} being Lipschitz continuous is the same as in Criscitiello and Boumal (2019). We include it here for completeness. From Assumption 2, we have β€–βˆ‡2f^i,x​(0)‖≀ℓ\|\nabla^{2}\hat{f}_{i,x}(0)\|\leq\ell. Thus,

β€–βˆ‡2f^i,x​(u)β€–β‰€β€–βˆ‡2f^i,x​(u)βˆ’βˆ‡2f^i,x​(0)β€–+β€–βˆ‡2f^i,x​(0)‖≀ρ​‖uβ€–+ℓ≀ℓ+ρ​D=L.\|\nabla^{2}\hat{f}_{i,x}(u)\|\leq\|\nabla^{2}\hat{f}_{i,x}(u)-\nabla^{2}\hat{f}_{i,x}(0)\|+\|\nabla^{2}\hat{f}_{i,x}(0)\|\leq\rho\|u\|+\ell\leq\ell+\rho D=L.

Hence, for any u,vβˆˆπ”Ήx​(D)u,v\in\mathbb{B}_{x}(D), we obtain

β€–βˆ‡f^i,x​(u)βˆ’βˆ‡f^i,x​(v)β€–=β€–βˆ«01βˆ‡2f^i,x​(v+(uβˆ’v)​τ)​[uβˆ’v]​𝑑τ‖≀L​‖uβˆ’vβ€–.\|\nabla\hat{f}_{i,x}(u)-\nabla\hat{f}_{i,x}(v)\|=\|\int_{0}^{1}\nabla^{2}\hat{f}_{i,x}(v+(u-v)\tau)[u-v]d\tau\|\leq L\|u-v\|.

This implies that the full gradient βˆ‡F^x\nabla\hat{F}_{x} is Lipschitz continuous because for any u,vβˆˆπ”Ήx​(D)u,v\in\mathbb{B}_{x}(D)

β€–βˆ‡F^x​(u)βˆ’βˆ‡F^x​(v)β€–=β€–1nβ€‹βˆ‘i=1n(βˆ‡f^i,x​(u)βˆ’βˆ‡f^i,x​(v))β€–=β€–βˆ‡f^i,x​(u)βˆ’βˆ‡f^i,x​(v)‖≀L​‖uβˆ’vβ€–.\|\nabla\hat{F}_{x}(u)-\nabla\hat{F}_{x}(v)\|=\|\frac{1}{n}\sum_{i=1}^{n}\big{(}\nabla\hat{f}_{i,x}(u)-\nabla\hat{f}_{i,x}(v)\big{)}\|=\|\nabla\hat{f}_{i,x}(u)-\nabla\hat{f}_{i,x}(v)\|\leq L\|u-v\|.

This completes the proof. Β 

Lemma 3 (Singular value bound of differentiated retraction) For all x,y=Rx​(u)βˆˆπ’³x,y=R_{x}(u)\in\mathcal{X} where uβˆˆπ”Ήx​(D)u\in\mathbb{B}_{x}(D) with D≀12​c0D\leq\frac{1}{2c_{0}}, we have Οƒmin​(Tu)β‰₯12\sigma_{\min}(T_{u})\geq\frac{1}{2}.

ProofΒ  From Assumption 4, we have β€–Tuβˆ’π’―u‖≀c0​‖u‖≀c0​D≀12\|T_{u}-\mathcal{T}_{u}\|\leq c_{0}\|u\|\leq c_{0}D\leq\frac{1}{2}. Therefore Οƒmin​(Tu)=minβ€–ΞΎβ€–=1⁑‖Tu​ξ‖β‰₯minξ⁑‖𝒯uβ€‹ΞΎβ€–βˆ’β€–(Tuβˆ’π’―u)​ξ‖β‰₯1βˆ’12=12\sigma_{\min}(T_{u})=\min_{\|\xi\|=1}\|T_{u}\xi\|\geq\min_{\xi}\|\mathcal{T}_{u}\xi\|-\|(T_{u}-\mathcal{T}_{u})\xi\|\geq 1-\frac{1}{2}=\frac{1}{2}, where the last inequality uses the fact that all singular values of an isometric operator are 11. Β 

Lemma 8 (Gradient and Hessian of the pullback under second-order retraction)

Given a second order retraction Rx:Tx​ℳ→ℳR_{x}:T_{x}\mathcal{M}\xrightarrow[]{}\mathcal{M}, both gradient and Hessian of the pullback function f^x:=f∘Rx:Tx​ℳ→ℝ\hat{f}_{x}:=f\circ R_{x}:T_{x}\mathcal{M}\xrightarrow{}\mathbb{R} evaluated at the origin of Tx​ℳT_{x}\mathcal{M} match the Riemannian gradient and Hessian of ff. That is, for all xβˆˆβ„³x\in\mathcal{M},

βˆ‡f^x​(0)=grad​f​(x),Β andΒ β€‹βˆ‡2f^x​(0)=Hess​f​(x)\nabla\hat{f}_{x}(0)=\emph{grad}f(x),\,\emph{ and }\,\nabla^{2}\hat{f}_{x}(0)=\emph{Hess}f(x)

ProofΒ  The proof is mainly based on (Boumal, 2020) and we include it here for completeness. First note that for any retraction (not necessarily second-order), the gradient is matched. That is, for any u∈Tx​ℳu\in T_{x}\mathcal{M}, by chain rule,

D​f^x​(0)​[u]=D​(f∘Rx)​(0)​[u]=D​f​(Rx​(0))​[D​Rx​(0)​[u]]=D​f​(x)​[u],\text{D}\hat{f}_{x}(0)[u]=\text{D}(f\circ R_{x})(0)[u]=\text{D}f(R_{x}(0))[\text{D}R_{x}(0)[u]]=Df(x)[u],

where we use the definition of retraction where Rx​(0)=xR_{x}(0)=x and D​Rx​(0)​[u]=u\text{D}R_{x}(0)[u]=u. Then we can use the definition of Riemannian gradient and its uniqueness property to show the result. Next we prove the second result. Consider a second-order Taylor expansion of ff from xx to Rx​(u)R_{x}(u) along the retraction curve as

f^x​(u)=f​(Rx​(u))\displaystyle\hat{f}_{x}(u)=f(R_{x}(u)) =f​(x)+⟨grad​f​(x),u⟩+12β€‹βŸ¨Hess​f​(x)​[u],u⟩+12β€‹βŸ¨grad​f​(x),c′′​(0)⟩+π’ͺ​(β€–uβ€–3)\displaystyle=f(x)+\langle\text{grad}f(x),u\rangle+\frac{1}{2}\langle\text{Hess}f(x)[u],u\rangle+\frac{1}{2}\langle\text{grad}f(x),c^{\prime\prime}(0)\rangle+\mathcal{O}(\|u\|^{3})
=f​(x)+⟨grad​f​(x),u⟩+12β€‹βŸ¨Hess​f​(x)​[u],u⟩+π’ͺ​(β€–uβ€–3),\displaystyle=f(x)+\langle\text{grad}f(x),u\rangle+\frac{1}{2}\langle\text{Hess}f(x)[u],u\rangle+\mathcal{O}(\|u\|^{3}), (8)

due to c′′​(0)=0c^{\prime\prime}(0)=0 for second-order retraction. Also since f^x\hat{f}_{x} is a β€˜classic’ function from vector space to real number, we can use a classic Taylor expansion of this function as

f^x​(u)=f^x​(0)+βŸ¨βˆ‡f^x​(0),u⟩+12β€‹βŸ¨βˆ‡2f^x​(0)​[u],u⟩+π’ͺ​(β€–uβ€–3).\displaystyle\hat{f}_{x}(u)=\hat{f}_{x}(0)+\langle\nabla\hat{f}_{x}(0),u\rangle+\frac{1}{2}\langle\nabla^{2}\hat{f}_{x}(0)[u],u\rangle+\mathcal{O}(\|u\|^{3}). (9)

Given that we already have βˆ‡f^x​(0)=grad​f​(x)\nabla\hat{f}_{x}(0)=\text{grad}f(x), we have by comparing (9) with (8), βˆ‡2f^x​(0)=Hess​f​(x)\nabla^{2}\hat{f}_{x}(0)=\text{Hess}f(x). Β 

C Proof for finite-sum setting

In this section, we prove the main complexity results under finite-sum setting. In this case, we choose B=nB=n and hence from Algorithm 2, we have access to the full gradient and grad​fℬ≑grad​F\text{grad}f_{\mathcal{B}}\equiv\text{grad}F. We start by showing the proof for the main complexity Theorem in subsection C.1. Then we prove some key lemmas necessary to derive the Theorem in C.2.

C.1 Proof for main Theorem

Theorem 1 (Finite-sum complexity) Under Assumptions 1 to 4, consider finite-sum optimization setting. For any starting point x0βˆˆβ„³x_{0}\in\mathcal{M} with the choice of parameters as

𝒯=π’ͺ~​(1Ξ΄),η≀π’ͺ~​(1L),m=b=n,B=n,r=π’ͺ~​(min⁑{Ξ΄3ρ2​ϡ,Ξ΄3ρ2​L}),\mathscr{T}=\widetilde{\mathcal{O}}\Big{(}\frac{1}{\delta}\Big{)},\quad\eta\leq\widetilde{\mathcal{O}}\Big{(}\frac{1}{L}\Big{)},\quad m=b=\sqrt{n},\quad B=n,\quad r=\widetilde{\mathcal{O}}\Big{(}\min\Big{\{}\frac{\delta^{3}}{\rho^{2}\epsilon},\sqrt{\frac{\delta^{3}}{\rho^{2}L}}\Big{\}}\Big{)},

suppose Ο΅,Ξ΄\epsilon,\delta satisfy ϡ≀π’ͺ~​(D​Lm​η),δ≀L\epsilon\leq\widetilde{\mathcal{O}}(\frac{D\sqrt{L}}{m\sqrt{\eta}}),\,\delta\leq L. Then with high probability, PRSRG(x0,Ξ·,m,b,B,r,𝒯,D,Ο΅)(x_{0},\eta,m,b,B,r,\mathscr{T},D,\epsilon) will at least once visit an (Ο΅,Ξ΄)(\epsilon,\delta)-second-order critical point with

π’ͺ~​(Δ​L​nΟ΅2+Δ​ρ2​nΞ΄4+Δ​ρ2​nΞ΄3)\widetilde{\mathcal{O}}\Big{(}\frac{\Delta L\sqrt{n}}{\epsilon^{2}}+\frac{\Delta\rho^{2}\sqrt{n}}{\delta^{4}}+\frac{\Delta\rho^{2}n}{\delta^{3}}\Big{)}

stochastic gradient queries, where Ξ”:=F​(x0)βˆ’Fβˆ—\Delta:=F(x_{0})-F^{*}.

ProofΒ  Here are all possible cases when running the main algorithm PRSRG. Notice that we need to explicitly discuss the case where iterates escape the constraint ball 𝔹x​(D)\mathbb{B}_{x}(D) within an epoch. Under large gradient situation, when iterates leave the constraint ball, it achieves a large function decrease with high probability (hence labelled as descent epoch). Around saddle point, when iterates leave the constraint ball, it already decreases function value with probability 1 (hence merged with the case when iterates fall inside the ball).

  • β€’

    Large gradients where β€–grad​F​(x)β€–β‰₯Ο΅{\|\text{grad}F(x)\|\geq\epsilon}.

    1. 1.

      Type-1 descent epoch: Iterates escape the constraint ball.

    2. 2.

      Iterates do not escape the constraint ball.

      1. (a)

        Type-2 descent epoch: At least half of iterates in current epoch have pullback gradient larger than Ο΅/2\epsilon/2.

      2. (b)

        Useful epoch: At least half of iterates in current epoch have pullback gradient no larger than Ο΅/2\epsilon/2 and output x~\tilde{x} from current epoch has gradient no larger than Ο΅\epsilon. (Since output satisfies small gradient condition, the next epoch will run TSSRG(x~,u0,𝒯)(\tilde{x},u_{0},\mathscr{T}) to escape saddle points).

      3. (c)

        Wasted epoch: At least half of iterates in current epoch have pullback gradient no larger than Ο΅/2\epsilon/2 and output x~\tilde{x} from current epoch has gradient larger than Ο΅\epsilon.

  • β€’

    Around saddle points where β€–grad​F​(x)‖≀ϡ{\|\text{grad}F(x)\|\leq\epsilon} and Ξ»min​(Hess​(x))β‰€βˆ’Ξ΄{\lambda_{\min}(\text{Hess}(x))\leq-\delta}

    1. 3.

      Type-3 descent epoch: Current iterate is around saddle point.

First, because output x~\tilde{x} for current epoch is randomly selected from the iterates, the probability of a wasted epoch is at most 1/21/2. Also due to the independence of each wasted epoch, with high probability, wasted epoch occurs consecutively at most π’ͺ~​(1)\widetilde{\mathcal{O}}(1) times before either a descent epoch (either type 1 or 2) or a useful epoch. 111That is, suppose the probability of at least half of iterates not escaping the constraint ball have large small gradient is ΞΈ\theta. Therefore, the probability of XX consecutive occurrences of wasted epoch is (ΞΈ2)X(\frac{\theta}{2})^{X}. Then with high probability of 1βˆ’ΞΉ1-\iota, there exists at least one useful epoch or type-2 descent epoch in X=π’ͺ​(βˆ’log⁑(ΞΉ))=π’ͺ~​(1)X=\mathcal{O}(-\log(\iota))=\widetilde{\mathcal{O}}(1). Hereafter, we use N1N_{1}, N2N_{2} and N3N_{3} to respectively represent three types of descent epoch.

Consider Type-1 descent epoch. From Case 2 in Lemma 11, with probability 1βˆ’Ο‘1-\vartheta, function value decreases by at least η​m​ϡ232\frac{\eta m\epsilon^{2}}{32} and with high probability the function value decreases. Hence by the standard concentration, after N1N_{1} such epochs, function value is reduced by π’ͺ​(η​m​ϡ2​N1)\mathcal{O}(\eta m\epsilon^{2}N_{1}) with high probability. Given that F​(x)F(x) is bounded by Fβˆ—F^{*}, the decrease cannot exceed Ξ”:=F​(x0)βˆ’Fβˆ—\Delta:=F(x_{0})-F^{*}. Therefore, N1≀π’ͺ​(Δη​m​ϡ2)N_{1}\leq\mathcal{O}(\frac{\Delta}{\eta m\epsilon^{2}}). Similarly, for Type-2 descent epoch, N2≀π’ͺ​(Δη​m​ϡ2)N_{2}\leq\mathcal{O}(\frac{\Delta}{\eta m\epsilon^{2}}).

Consider Useful epoch where output β€–grad​F​(x~)‖≀ϡ\|\text{grad}F(\tilde{x})\|\leq\epsilon. If further Ξ»min​(Hess​F​(x~))β‰₯βˆ’Ξ΄\lambda_{\min}(\text{Hess}F(\tilde{x}))\geq-\delta, then x~\tilde{x} is already an (Ο΅,Ξ΄)(\epsilon,\delta)-second-order critical point. Otherwise, a Useful epoch is followed by Type-3 descent epoch around saddle points. From Lemma 14, we know that function value decreases by β„±=Ξ΄32​c3​ρ2\mathscr{F}=\frac{\delta^{3}}{2c_{3}\rho^{2}} with high probability. Similar to argument for other types of descent epoch, N3≀π’ͺ~​(ρ2​Δδ3)N_{3}\leq\widetilde{\mathcal{O}}(\frac{\rho^{2}\Delta}{\delta^{3}}), where we omit c3=π’ͺ~​(1)c_{3}=\widetilde{\mathcal{O}}(1).

Hence, we have the following stochastic gradient complexity:

(N1+N2)​(π’ͺ~​(1)​(n+m​b)+n+m​b)+N3​(π’ͺ~​(1)​(n+m​b)+βŒˆπ’―/mβŒ‰β€‹n+𝒯​b)\displaystyle(N_{1}+N_{2})\big{(}\widetilde{\mathcal{O}}(1)\big{(}n+mb\big{)}+n+mb\big{)}+N_{3}\big{(}\widetilde{\mathcal{O}}(1)\big{(}n+mb\big{)}+\lceil\mathscr{T}/m\rceil n+\mathscr{T}b\big{)}
≀π’ͺ~​(Δη​m​ϡ2​n+ρ2​Δδ3​n+ρ2​Δδ3​(n+nΞ΄))\displaystyle\leq\widetilde{\mathcal{O}}\Big{(}\frac{\Delta}{\eta m\epsilon^{2}}n+\frac{\rho^{2}\Delta}{\delta^{3}}n+\frac{\rho^{2}\Delta}{\delta^{3}}(n+\frac{\sqrt{n}}{\delta})\Big{)}
≀π’ͺ~​(Δ​L​nΟ΅2+Δ​ρ2​nΞ΄4+Δ​ρ2​nΞ΄3),\displaystyle\leq\widetilde{\mathcal{O}}\Big{(}\frac{\Delta L\sqrt{n}}{\epsilon^{2}}+\frac{\Delta\rho^{2}\sqrt{n}}{\delta^{4}}+\frac{\Delta\rho^{2}n}{\delta^{3}}\Big{)},

where 𝒯=π’ͺ~​(1Ξ΄)\mathscr{T}=\widetilde{\mathcal{O}}(\frac{1}{\delta}), m=b=nm=b=\sqrt{n}, Ξ·=π’ͺ~​(1L)\eta=\widetilde{\mathcal{O}}(\frac{1}{L}). Β 

C.2 Proof for key Lemmas

Organization of these Lemmas are as follows. In Lemma 9, we first prove a high probability bound for the estimation error of the modified gradient vtv_{t}. This is to replace the expectation bound commonly used in deriving first-order guarantees. Lemma 10 is to show that the iterates that deviate a lot from the initialized point also achieve large function value decrease. These two results are subsequently used to derive Lemma 11, a descent Lemma for large gradient phase.

Under saddle point phase, we first show the proof that at least one of the coupled sequences achieve large deviation from the initialization (Lemma 12). This then translates to a sufficient function value decrease in Lemma 13. Finally, it can be shown in Lemma 14 that with high probability, the iterates can escape saddle point and decrease function value by a desired amount.

Lemma 9 (High probability bound on estimation error)

Under Assumption 2, we have the following high probability bound for estimation error of the modified gradient under finite-sum setting. That is, for s​m+1≀t≀(s+1)​msm+1\leq t\leq(s+1)m,

β€–vtβˆ’βˆ‡F^x​(ut)‖≀π’ͺ​(log32⁑(d/Ο‘))​Lbβ€‹βˆ‘j=s​m+1tβ€–ujβˆ’ujβˆ’1β€–2Β with probability ​1βˆ’Ο‘,\|v_{t}-\nabla\hat{F}_{x}(u_{t})\|\leq\frac{\mathcal{O}\big{(}\log^{\frac{3}{2}}(d/\vartheta)\big{)}L}{\sqrt{b}}\sqrt{\sum_{j=sm+1}^{t}\|u_{j}-u_{j-1}\|^{2}}\quad\text{ with probability }1-\vartheta,

ProofΒ  For simplicity of notation, consider a single epoch from k=1,…,m,βˆ€sk=1,...,m,\forall{s}. Consider two sequences on Txβ€‹β„³βŠ†β„dT_{x}\mathcal{M}\subseteq\mathbb{R}^{d}, defined as yk:=vkβˆ’βˆ‡F^x​(uk)y_{k}:=v_{k}-\nabla\hat{F}_{x}(u_{k}) and zk:=ykβˆ’ykβˆ’1z_{k}:=y_{k}-y_{k-1}. It is easily verified that yky_{k} is a martingale and zkz_{k} is its difference sequence. That is, denote β„±t\mathcal{F}_{t} is a filtration at time tt. 𝔼​[yt|β„±tβˆ’1]=𝔼ℱtβˆ’1​[βˆ‡f^ℐ,x​(ut)βˆ’βˆ‡f^ℐ,x​(utβˆ’1)+vtβˆ’1βˆ’βˆ‡F^x​(ut)]=vtβˆ’1βˆ’βˆ‡F^x​(utβˆ’1)=ytβˆ’1\mathbb{E}[y_{t}|\mathcal{F}_{t-1}]=\mathbb{E}_{\mathcal{F}_{t-1}}[\nabla\hat{f}_{\mathcal{I},x}(u_{t})-\nabla\hat{f}_{\mathcal{I},x}(u_{t-1})+v_{t-1}-\nabla\hat{F}_{x}(u_{t})]=v_{t-1}-\nabla\hat{F}_{x}(u_{t-1})=y_{t-1} where we use unbiasedness of i.i.d sampling. Hence, to bound β€–ykβ€–\|y_{k}\|, we need to first bound β€–zkβ€–\|z_{k}\| according to Azuma-Hoeffding Lemma. We can use vector Bernstein inequality to bound zkz_{k} as follows. First note that

zk\displaystyle z_{k} =vkβˆ’βˆ‡F^x​(uk)βˆ’vkβˆ’1+βˆ‡F^x​(ukβˆ’1)\displaystyle=v_{k}-\nabla\hat{F}_{x}(u_{k})-v_{k-1}+\nabla\hat{F}_{x}(u_{k-1})
=βˆ‡f^ℐ,x​(uk)βˆ’βˆ‡f^ℐ,x​(ukβˆ’1)+vkβˆ’1βˆ’βˆ‡F^x​(uk)βˆ’vkβˆ’1+βˆ‡F^x​(ukβˆ’1)\displaystyle=\nabla\hat{f}_{\mathcal{I},x}(u_{k})-\nabla\hat{f}_{\mathcal{I},x}(u_{k-1})+v_{k-1}-\nabla\hat{F}_{x}(u_{k})-v_{k-1}+\nabla\hat{F}_{x}(u_{k-1})
=βˆ‡f^ℐ,x​(uk)βˆ’βˆ‡f^ℐ,x​(ukβˆ’1)βˆ’βˆ‡F^x​(uk)+βˆ‡F^x​(ukβˆ’1).\displaystyle=\nabla\hat{f}_{\mathcal{I},x}(u_{k})-\nabla\hat{f}_{\mathcal{I},x}(u_{k-1})-\nabla\hat{F}_{x}(u_{k})+\nabla\hat{F}_{x}(u_{k-1}).

Denote wi:=βˆ‡f^i,x​(uk)βˆ’βˆ‡f^i,x​(ukβˆ’1)βˆ’βˆ‡F^x​(uk)+βˆ‡F^x​(ukβˆ’1)w_{i}:=\nabla\hat{f}_{i,x}(u_{k})-\nabla\hat{f}_{i,x}(u_{k-1})-\nabla\hat{F}_{x}(u_{k})+\nabla\hat{F}_{x}(u_{k-1}) and therefore zk=1bβ€‹βˆ‘iβˆˆβ„wiz_{k}=\frac{1}{b}\sum_{i\in\mathcal{I}}w_{i} with |ℐ|=b|\mathcal{I}|=b. In order to apply Bernstein inequality, we need to show that β€–wiβ€–\|w_{i}\| is bounded. This is achieved by Lemma 2. That is,

β€–wiβ€–\displaystyle\|w_{i}\| =β€–βˆ‡f^i,x​(uk)βˆ’βˆ‡f^i,x​(ukβˆ’1)βˆ’βˆ‡F^x​(uk)+βˆ‡F^x​(ukβˆ’1)β€–\displaystyle=\|\nabla\hat{f}_{i,x}(u_{k})-\nabla\hat{f}_{i,x}(u_{k-1})-\nabla\hat{F}_{x}(u_{k})+\nabla\hat{F}_{x}(u_{k-1})\|
≀βˆ₯βˆ‡f^i,x(uk)βˆ’βˆ‡f^i,x(ukβˆ’1)βˆ₯+βˆ₯βˆ‡F^x(uk)βˆ’βˆ‡F^x(ukβˆ’1)βˆ₯≀2Lβˆ₯ukβˆ’ukβˆ’1βˆ₯=:R.\displaystyle\leq\|\nabla\hat{f}_{i,x}(u_{k})-\nabla\hat{f}_{i,x}(u_{k-1})\|+\|\nabla\hat{F}_{x}(u_{k})-\nabla\hat{F}_{x}(u_{k-1})\|\leq 2L\|u_{k}-u_{k-1}\|=:R.

Also, the total variance Οƒ2\sigma^{2} is computed as

βˆ‘iβˆˆβ„π”Όβ€‹β€–wiβˆ’π”Όβ€‹[wi]β€–2\displaystyle\sum_{i\in\mathcal{I}}\mathbb{E}\|w_{i}-\mathbb{E}[w_{i}]\|^{2} =βˆ‘iβˆˆβ„π”Όβ€‹β€–wiβ€–2\displaystyle=\sum_{i\in\mathcal{I}}\mathbb{E}\|w_{i}\|^{2}
=βˆ‘iβˆˆβ„π”Όβ€‹β€–βˆ‡f^i,x​(uk)βˆ’βˆ‡f^i,x​(ukβˆ’1)βˆ’βˆ‡F^x​(uk)+βˆ‡F^x​(ukβˆ’1)β€–2\displaystyle=\sum_{i\in\mathcal{I}}\mathbb{E}\|\nabla\hat{f}_{i,x}(u_{k})-\nabla\hat{f}_{i,x}(u_{k-1})-\nabla\hat{F}_{x}(u_{k})+\nabla\hat{F}_{x}(u_{k-1})\|^{2}
β‰€βˆ‘iβˆˆβ„π”Όβˆ₯βˆ‡f^i,x(uk)βˆ’βˆ‡f^i,x(ukβˆ’1)βˆ₯2≀bL2βˆ₯ukβˆ’ukβˆ’1βˆ₯2=:Οƒ2,\displaystyle\leq\sum_{i\in\mathcal{I}}\mathbb{E}\|\nabla\hat{f}_{i,x}(u_{k})-\nabla\hat{f}_{i,x}(u_{k-1})\|^{2}\leq bL^{2}\|u_{k}-u_{k-1}\|^{2}=:\sigma^{2},

where the first inequality holds due to 𝔼​[wi]=0\mathbb{E}[w_{i}]=0 and the second last inequality applies 𝔼​‖xβˆ’π”Όβ€‹[x]β€–2≀𝔼​‖xβ€–2\mathbb{E}\|x-\mathbb{E}[x]\|^{2}\leq\mathbb{E}\|x\|^{2} and the last inequality uses the gradient Lipschitzness result in Lemma 2. Finally we can apply Bernstein inequality (Lemma 6) to bound β€–zkβ€–\|z_{k}\|. That is,

Pr​{β€–βˆ‘iβˆˆβ„wiβ€–β‰₯Ο‚}\displaystyle\text{Pr}\{\|\sum_{i\in\mathcal{I}}w_{i}\|\geq\varsigma\} =Pr​{β€–1bβ€‹βˆ‘iβˆˆβ„wiβ€–β‰₯Ο‚/b}=Pr​{β€–zkβ€–β‰₯Ο‚/b}\displaystyle=\text{Pr}\{\|\frac{1}{b}\sum_{i\in\mathcal{I}}w_{i}\|\geq\varsigma/b\}=\text{Pr}\{\|z_{k}\|\geq{\varsigma}/b\}
≀(d+1)​exp⁑(βˆ’Ο‚2/2Οƒ2+R​ς/3)\displaystyle\leq(d+1)\exp\big{(}\frac{-\varsigma^{2}/2}{\sigma^{2}+R\varsigma/3}\big{)}
≀(d+1)​exp⁑(βˆ’Ο‚2/2b​L2​‖ukβˆ’ukβˆ’1β€–2+2​b​L​‖ukβˆ’ukβˆ’1‖​ς/3)\displaystyle\leq(d+1)\exp\big{(}\frac{-\varsigma^{2}/2}{bL^{2}\|u_{k}-u_{k-1}\|^{2}+2\sqrt{b}L\|u_{k}-u_{k-1}\|\varsigma/3}\big{)}
≀ϑk,\displaystyle\leq\vartheta_{k},

where the second inequality substitutes b​L2​‖ukβˆ’ukβˆ’1β€–2bL^{2}\|u_{k}-u_{k-1}\|^{2} as Οƒ2\sigma^{2} and 2​L​‖ukβˆ’ukβˆ’1β€–2L\|u_{k}-u_{k-1}\| as RR in Lemma 6. It also uses the fact that bβ‰₯1\sqrt{b}\geq 1. The last inequality holds by the choice Ο‚=Ο‚k=π’ͺ​(log⁑(d/Ο‘k)​b​L)​‖ukβˆ’ukβˆ’1β€–\varsigma=\varsigma_{k}=\mathcal{O}\big{(}\log(d/\vartheta_{k})\sqrt{b}L\big{)}\|u_{k}-u_{k-1}\| (for example, 103​log⁑((d+1)/Ο‘k)​b​L​‖ukβˆ’ukβˆ’1β€–\frac{10}{3}\log((d+1)/\vartheta_{k})\sqrt{b}L\|u_{k}-u_{k-1}\|). This gives a probability bound for {zk}\{z_{k}\}, which is

β€–zk‖≀π’ͺ​(log⁑(d/Ο‘k)​Lb)​‖ukβˆ’ukβˆ’1β€–Β with probability ​1βˆ’Ο‘k.\|z_{k}\|\leq\mathcal{O}\Big{(}\frac{{\log(d/\vartheta_{k})}L}{\sqrt{b}}\Big{)}\|u_{k}-u_{k-1}\|\quad\text{ with probability }1-\vartheta_{k}.

Now given the bound on β€–zkβ€–\|z_{k}\|, we can bound β€–ykβ€–\|y_{k}\| using the Azuma-Hoeffding inequality. Suppose we set Ο‘k=Ο‘/m\vartheta_{k}=\vartheta/m, where mm is the epoch length. Then by union bound, for k≀mk\leq m

Pr​{⋃j=1k(β€–zjβ€–β‰₯Ο‚)}β‰€βˆ‘j=1kΟ‘jβ‰€βˆ‘j=1mΟ‘j=Ο‘.\text{Pr}\{\bigcup\limits_{j=1}^{k}\big{(}\|z_{j}\|\geq\varsigma\big{)}\}\leq\sum_{j=1}^{k}\vartheta_{j}\leq\sum_{j=1}^{m}\vartheta_{j}=\vartheta. (10)

Therefore, the probability that β€–ykβˆ’ykβˆ’1β€–=β€–zk‖≀ςk\|y_{k}-y_{k-1}\|=\|z_{k}\|\leq\varsigma_{k} for all k=1,…,mk=1,...,m is at least 1βˆ’Ο‘1-\vartheta. Hence by Lemma 7, we have

Pr​{β€–ykβˆ’y0β€–β‰₯Ξ²}≀(d+1)​exp⁑(βˆ’Ξ²28β€‹βˆ‘j=1kΟ‚j2)+k​ϑk≀2​ϑ,\text{Pr}\{\|y_{k}-y_{0}\|\geq\beta\}\leq(d+1)\exp\big{(}\frac{-\beta^{2}}{8\sum_{j=1}^{k}\varsigma_{j}^{2}}\big{)}+k\vartheta_{k}\leq 2\vartheta, (11)

where the last inequality holds due to k≀mk\leq m and the choice that

Ξ²=8β€‹βˆ‘j=1kΟ‚j2​log⁑((d+1)/Ο‘)=π’ͺ​(log32⁑(d/Ο‘)​Lb)β€‹βˆ‘j=1kβ€–ujβˆ’ujβˆ’1β€–2,\beta=\sqrt{8\sum_{j=1}^{k}\varsigma^{2}_{j}\log((d+1)/\vartheta)}=\mathcal{O}\Big{(}\frac{\log^{\frac{3}{2}}(d/\vartheta)L}{\sqrt{b}}\Big{)}\sqrt{\sum_{j=1}^{k}\|u_{j}-u_{j-1}\|^{2}},

where we denote log32⁑(a)=(log⁑(a))32\log^{\frac{3}{2}}(a)=(\log(a))^{\frac{3}{2}}. Note under finite-sum setting, y0=v0βˆ’βˆ‡F^x​(u0)=0y_{0}=v_{0}-\nabla\hat{F}_{x}(u_{0})=0. Thus (11) implies for k∈[1,m]k\in[1,m],

β€–vkβˆ’βˆ‡F^x​(uk)β€–=β€–yk‖≀π’ͺ​(log32⁑(d/Ο‘)​Lb)β€‹βˆ‘j=1kβ€–ujβˆ’ujβˆ’1β€–2\|v_{k}-\nabla\hat{F}_{x}(u_{k})\|=\|y_{k}\|\leq\mathcal{O}\Big{(}\frac{\log^{\frac{3}{2}}(d/\vartheta)L}{\sqrt{b}}\Big{)}\sqrt{\sum_{j=1}^{k}\|u_{j}-u_{j-1}\|^{2}} (12)

holds with probability 1βˆ’2​ϑ1-2\vartheta. Note that we can always set Ο‘/2\vartheta/2 while the result still holds because π’ͺ​(log32⁑(2​d/Ο‘))=π’ͺ​(log32⁑(d/Ο‘))\mathcal{O}(\log^{\frac{3}{2}}(2d/\vartheta))=\mathcal{O}(\log^{\frac{3}{2}}(d/\vartheta)). Hence the probability reduces to 1βˆ’Ο‘1-\vartheta. Β 

Lemma 10 (Improve or localize)

Consider {ut}t=1𝒯\{u_{t}\}_{t=1}^{\mathscr{T}} is the sequence generated by running TSSRG(x,\textsf{TSSRG}(x, u0,𝒯)u_{0},\mathscr{T}). Suppose we choose bβ‰₯mb\geq m, η≀12​c1​L\eta\leq\frac{1}{2c_{1}L} where c1=π’ͺ​(log32⁑(d​t/Ο‘))c_{1}=\mathcal{O}(\log^{\frac{3}{2}}({dt}/{\vartheta})). Then we have

β€–utβˆ’u0‖≀4​tc1​L​(F^x​(u0)βˆ’F^x​(ut))Β with probability ​1βˆ’Ο‘,\|u_{t}-u_{0}\|\leq\sqrt{\frac{4t}{c_{1}L}(\hat{F}_{x}(u_{0})-\hat{F}_{x}(u_{t}))}\quad\quad\text{ with probability }1-\vartheta,

ProofΒ  First by generalizing (16) to any epoch (i.e. 1≀t≀𝒯1\leq t\leq\mathscr{T}), we have

F^x​(ut)βˆ’F^x​(us​m)\displaystyle\hat{F}_{x}(u_{t})-\hat{F}_{x}(u_{sm}) β‰€βˆ’Ξ·2β€‹βˆ‘j=s​m+1tβ€–βˆ‡F^x​(ujβˆ’1)β€–2βˆ’(12β€‹Ξ·βˆ’L2βˆ’Ξ·β€‹c12​L22)β€‹βˆ‘j=s​m+1tβ€–ujβˆ’ujβˆ’1β€–2\displaystyle\leq-\frac{\eta}{2}\sum_{j=sm+1}^{t}\|\nabla\hat{F}_{x}(u_{j-1})\|^{2}-\big{(}\frac{1}{2\eta}-\frac{L}{2}-\frac{\eta c_{1}^{2}L^{2}}{2}\big{)}\sum_{j=sm+1}^{t}\|u_{j}-u_{j-1}\|^{2}
β‰€βˆ’(12β€‹Ξ·βˆ’L2βˆ’Ξ·β€‹c12​L22)β€‹βˆ‘j=s​m+1tβ€–ujβˆ’ujβˆ’1β€–2\displaystyle\leq-\big{(}\frac{1}{2\eta}-\frac{L}{2}-\frac{\eta c_{1}^{2}L^{2}}{2}\big{)}\sum_{j=sm+1}^{t}\|u_{j}-u_{j-1}\|^{2}
β‰€βˆ’c1​L4β€‹βˆ‘j=s​m+1tβ€–ujβˆ’ujβˆ’1β€–2,\displaystyle\leq-\frac{c_{1}L}{4}\sum_{j=sm+1}^{t}\|u_{j}-u_{j-1}\|^{2}, (13)

where the last inequality holds due to the choice η≀12​c1​L\eta\leq\frac{1}{2c_{1}L} and the assumption that c1β‰₯1c_{1}\geq 1. Summing (13) for all epoch up to tt gives

F^x​(ut)βˆ’F^x​(u0)β‰€βˆ’c1​L4β€‹βˆ‘j=1tβ€–ujβˆ’ujβˆ’1β€–2.\hat{F}_{x}(u_{t})-\hat{F}_{x}(u_{0})\leq-\frac{c_{1}L}{4}\sum_{j=1}^{t}\|u_{j}-u_{j-1}\|^{2}.

Also by Cauchy-Schwarz inequality and triangle inequality,

tβ€‹βˆ‘j=1tβ€–ujβˆ’ujβˆ’1β€–2β‰₯βˆ‘j=1tβ€–ujβˆ’ujβˆ’1β€–β‰₯β€–utβˆ’u0β€–.\sqrt{t\sum_{j=1}^{t}\|u_{j}-u_{j-1}\|^{2}}\geq\sum_{j=1}^{t}\|u_{j}-u_{j-1}\|\geq\|u_{t}-u_{0}\|.

The proof is complete by noting tβ€‹βˆ‘j=1tβ€–ujβˆ’ujβˆ’1β€–2≀4​tc1​L​(F^x​(u0)βˆ’F^x​(ut))\sqrt{t\sum_{j=1}^{t}\|u_{j}-u_{j-1}\|^{2}}\leq\sqrt{\frac{4t}{c_{1}L}(\hat{F}_{x}(u_{0})-\hat{F}_{x}(u_{t}))}. Β 

Lemma 11 (Large gradient descent lemma)

(Lemma 4 in the maix text). Under Assumptions 2, 3, and 4, suppose we choose η≀12​c1​L\eta\leq\frac{1}{2c_{1}L}, bβ‰₯mb\geq m where c1=π’ͺ​(log32⁑(d​m/Ο‘))c_{1}=\mathcal{O}(\log^{\frac{3}{2}}(dm/\vartheta)). Consider xβˆˆβ„³x\in\mathcal{M} where β€–grad​F​(x)β€–β‰₯Ο΅\|\emph{grad}F(x)\|\geq\epsilon with ϡ≀Dm​8​c1​LΞ·\epsilon\leq\frac{D}{m}\sqrt{\frac{8c_{1}L}{\eta}}. Then by running TSSRG​(x,0,m)\textsf{TSSRG}(x,0,m), we have the following three cases:

  1. 1.

    When the iterates {uj}j=1m\{u_{j}\}_{j=1}^{m} do not leave the constraint ball 𝔹x​(D)\mathbb{B}_{x}(D):

    1. (a)

      If at least half of the iterates in the epoch satisfy β€–βˆ‡F^x​(uj)‖≀ϡ/2\|\nabla\hat{F}_{x}(u_{j})\|\leq\epsilon/2 for j=1,…,mj=1,...,m, then with probability at least 1/21/2, we have β€–grad​F​(TSSRG​(x,0,m))‖≀ϡ\|\emph{grad}F(\textsf{TSSRG}(x,0,m))\|\leq\epsilon.

    2. (b)

      Otherwise, with probability at least 1/51/5, we have F​(TSSRG​(x,0,m))βˆ’F​(x)β‰€βˆ’Ξ·β€‹m​ϡ232F(\textsf{TSSRG}(x,0,m))-F(x)\leq-\frac{\eta m\epsilon^{2}}{32}.

  2. 2.

    When one of the iterates {uj}j=1m\{u_{j}\}_{j=1}^{m} leaves the constraint ball 𝔹x​(D)\mathbb{B}_{x}{(D)}, with probability at least 1βˆ’Ο‘1-\vartheta, we have F​(TSSRG​(x,0,m))βˆ’F​(x)β‰€βˆ’Ξ·β€‹m​ϡ232F(\textsf{TSSRG}(x,0,m))-F(x)\leq-\frac{\eta m\epsilon^{2}}{32}.

Regardless which case occurs, F​(TSSRG​(x,0,m))≀F​(x)F(\textsf{TSSRG}(x,0,m))\leq F(x) with high probability.

ProofΒ  First note that when gradient is large, we always call TSSRG(x,0,mx,0,m) with total iterations set to be mm (i.e. a single epoch). Hence, we consider t=1,…,mt=1,...,m (in TSSRG). Compared to the proof in (Li, 2019), we further need to address the case where the iterates fall outside the prescribed ball 𝔹x​(D)\mathbb{B}_{x}(D). So we divide the proof into two parts.

1. Iterates do not leave the constraint ball. By LL-Lipschitzness in Lemma 2, we have

F^x​(ut)\displaystyle\hat{F}_{x}(u_{t}) ≀F^x​(utβˆ’1)+βŸ¨βˆ‡F^x​(utβˆ’1),utβˆ’utβˆ’1⟩+L2​‖utβˆ’utβˆ’1β€–2\displaystyle\leq\hat{F}_{x}(u_{t-1})+\langle\nabla\hat{F}_{x}(u_{t-1}),u_{t}-u_{t-1}\rangle+\frac{L}{2}\|u_{t}-u_{t-1}\|^{2}
=F^x​(utβˆ’1)βˆ’Ξ·β€‹βŸ¨βˆ‡F^x​(utβˆ’1),vtβˆ’1⟩+Ξ·2​L2​‖vtβˆ’1β€–2\displaystyle=\hat{F}_{x}(u_{t-1})-\eta\langle\nabla\hat{F}_{x}(u_{t-1}),v_{t-1}\rangle+\frac{\eta^{2}L}{2}\|v_{t-1}\|^{2}
=F^x​(utβˆ’1)βˆ’Ξ·2β€‹β€–βˆ‡F^x​(utβˆ’1)β€–2βˆ’Ξ·2​‖vtβˆ’1β€–2+Ξ·2​‖vtβˆ’1βˆ’βˆ‡F^x​(utβˆ’1)β€–2+Ξ·2​L2​‖vtβˆ’1β€–2\displaystyle=\hat{F}_{x}(u_{t-1})-\frac{\eta}{2}\|\nabla\hat{F}_{x}(u_{t-1})\|^{2}-\frac{\eta}{2}\|v_{t-1}\|^{2}+\frac{\eta}{2}\|v_{t-1}-\nabla\hat{F}_{x}(u_{t-1})\|^{2}+\frac{\eta^{2}L}{2}\|v_{t-1}\|^{2}
=F^x​(utβˆ’1)βˆ’Ξ·2β€‹β€–βˆ‡F^x​(utβˆ’1)β€–2+Ξ·2​‖vtβˆ’1βˆ’βˆ‡F^x​(utβˆ’1)β€–2βˆ’(12β€‹Ξ·βˆ’L2)​‖utβˆ’utβˆ’1β€–2.\displaystyle=\hat{F}_{x}(u_{t-1})-\frac{\eta}{2}\|\nabla\hat{F}_{x}(u_{t-1})\|^{2}+\frac{\eta}{2}\|v_{t-1}-\nabla\hat{F}_{x}(u_{t-1})\|^{2}-\big{(}\frac{1}{2\eta}-\frac{L}{2}\big{)}\|u_{t}-u_{t-1}\|^{2}. (14)

From Lemma 9, we know that

β€–vtβˆ’1βˆ’βˆ‡F^x​(utβˆ’1)β€–2≀π’ͺ​(log3⁑(d/Ο‘))​L2bβ€‹βˆ‘j=s​m+1tβˆ’1β€–ujβˆ’ujβˆ’1β€–2\|v_{t-1}-\nabla\hat{F}_{x}(u_{t-1})\|^{2}\leq\frac{\mathcal{O}(\log^{3}(d/\vartheta))L^{2}}{b}\sum_{j=sm+1}^{t-1}\|u_{j}-u_{j-1}\|^{2} (15)

holds with high probability 1βˆ’Ο‘1-\vartheta. By a union bound, Pr​{⋃τ=1t((15) does not hold)}≀t​ϑ\text{Pr}\big{\{}\bigcup\limits_{\tau=1}^{{t}}\big{(}\text{\eqref{qmqmmqmq} does not hold}\big{)}\big{\}}\leq t\vartheta and therefore for all 1≀τ≀t1\leq\tau\leq t, (15) holds with probability 1βˆ’t​ϑ1-t\vartheta. Setting Ο‘/t\vartheta/t as Ο‘\vartheta we have for all 1≀τ≀t1\leq\tau\leq t,

β€–vΟ„βˆ’1βˆ’βˆ‡F^x​(uΟ„βˆ’1)β€–2≀π’ͺ​(log3⁑(d​t/Ο‘))​L2bβ€‹βˆ‘j=1Ο„βˆ’1β€–ujβˆ’ujβˆ’1β€–2Β with probability ​1βˆ’Ο‘.\|v_{\tau-1}-\nabla\hat{F}_{x}(u_{\tau-1})\|^{2}\leq\frac{\mathcal{O}(\log^{3}(dt/\vartheta))L^{2}}{b}\sum_{j=1}^{\tau-1}\|u_{j}-u_{j-1}\|^{2}\quad\text{ with probability }1-\vartheta.

Substituting this result into (14) and summing over this epoch from 11 to tt gives

F^x​(ut)βˆ’F^x​(u0)\displaystyle\hat{F}_{x}(u_{t})-\hat{F}_{x}(u_{0})
β‰€βˆ’Ξ·2β€‹βˆ‘j=1tβ€–βˆ‡F^x​(ujβˆ’1)β€–2+η​c12​L22​bβ€‹βˆ‘k=1tβˆ’1βˆ‘j=1kβ€–ujβˆ’ujβˆ’1β€–2βˆ’(12β€‹Ξ·βˆ’L2)β€‹βˆ‘j=1tβ€–ujβˆ’ujβˆ’1β€–2\displaystyle\leq-\frac{\eta}{2}\sum_{j=1}^{t}\|\nabla\hat{F}_{x}(u_{j-1})\|^{2}+\frac{\eta c_{1}^{2}L^{2}}{2b}\sum_{k=1}^{t-1}\sum_{j=1}^{k}\|u_{j}-u_{j-1}\|^{2}-\big{(}\frac{1}{2\eta}-\frac{L}{2}\big{)}\sum_{j=1}^{t}\|u_{j}-u_{j-1}\|^{2}
β‰€βˆ’Ξ·2β€‹βˆ‘j=1tβ€–βˆ‡F^x​(ujβˆ’1)β€–2+η​c12​L2​m2​bβ€‹βˆ‘j=1tβ€–ujβˆ’ujβˆ’1β€–2βˆ’(12β€‹Ξ·βˆ’L2)β€‹βˆ‘j=1tβ€–ujβˆ’ujβˆ’1β€–2\displaystyle\leq-\frac{\eta}{2}\sum_{j=1}^{t}\|\nabla\hat{F}_{x}(u_{j-1})\|^{2}+\frac{\eta c_{1}^{2}L^{2}m}{2b}\sum_{j=1}^{t}\|u_{j}-u_{j-1}\|^{2}-\big{(}\frac{1}{2\eta}-\frac{L}{2}\big{)}\sum_{j=1}^{t}\|u_{j}-u_{j-1}\|^{2}
β‰€βˆ’Ξ·2β€‹βˆ‘j=1tβ€–βˆ‡F^x​(ujβˆ’1)β€–2βˆ’(12β€‹Ξ·βˆ’L2βˆ’Ξ·β€‹c12​L22)β€‹βˆ‘j=1tβ€–ujβˆ’ujβˆ’1β€–2\displaystyle\leq-\frac{\eta}{2}\sum_{j=1}^{t}\|\nabla\hat{F}_{x}(u_{j-1})\|^{2}-\big{(}\frac{1}{2\eta}-\frac{L}{2}-\frac{\eta c_{1}^{2}L^{2}}{2}\big{)}\sum_{j=1}^{t}\|u_{j}-u_{j-1}\|^{2} (16)
β‰€βˆ’Ξ·2β€‹βˆ‘j=1tβ€–βˆ‡F^x​(ujβˆ’1)β€–2,\displaystyle\leq-\frac{\eta}{2}\sum_{j=1}^{t}\|\nabla\hat{F}_{x}(u_{j-1})\|^{2}, (17)

where c1:=π’ͺ​(log32⁑(d​t/Ο‘))c_{1}:=\mathcal{O}(\log^{\frac{3}{2}}(dt/\vartheta)). The second inequality uses the fact that t≀mt\leq m and the third inequality holds due to the choice that bβ‰₯mb\geq m. The last inequality holds by noticing η≀12​c1​L≀4​c12+1βˆ’12​c12​L\eta\leq\frac{1}{2c_{1}L}\leq\frac{\sqrt{4c_{1}^{2}+1}-1}{2c_{1}^{2}L} by assuming c1β‰₯1c_{1}\geq 1. Note that we require (17) to hold for all t≀mt\leq m and thus we change c1=π’ͺ​(log32⁑(d​m/Ο‘))c_{1}=\mathcal{O}(\log^{\frac{3}{2}}(dm/\vartheta)). Then we have the following two cases.

  • β€’

    (Case 1a) Suppose at least half of the iterates in the epoch satisfy β€–βˆ‡F^x​(uj)‖≀ϡ/2\|\nabla\hat{F}_{x}(u_{j})\|\leq\epsilon/2 for j=1,…,mj=1,...,m. Then by uniformly sampling {uj}j=1m\{u_{j}\}_{j=1}^{m} (i.e. uniformly breaking by setting utu_{t} as umu_{m} as in Algorithm 2, Line 12), the output umu_{m} has gradient norm β€–βˆ‡F^x​(um)β€–\|\nabla\hat{F}_{x}(u_{m})\| no larger than Ο΅/2\epsilon/2 with probability 1/21/2. Recall the definition of the pullback gradient in Lemma 1, i.e. βˆ‡F^x​(um)=Tumβˆ—β€‹grad​F​(Rx​(um))\nabla\hat{F}_{x}(u_{m})=T_{u_{m}}^{*}\text{grad}F(R_{x}(u_{m})). Then the output of TSSRG(x,0,mx,0,m) satisfies

    β€–grad​F​(TSSRG​(x,0,m))β€–=β€–grad​F​(Rx​(um))β€–\displaystyle\|\text{grad}F(\textsf{TSSRG}(x,0,m))\|=\|\text{grad}F(R_{x}(u_{m}))\| =β€–(Tumβˆ—)βˆ’1β€‹βˆ‡F^x​(um)β€–\displaystyle=\|(T^{*}_{u_{m}})^{-1}\nabla\hat{F}_{x}(u_{m})\|
    ≀‖(Tumβˆ—)βˆ’1β€–β€‹β€–βˆ‡F^x​(um)β€–\displaystyle\leq\|(T^{*}_{u_{m}})^{-1}\|\|\nabla\hat{F}_{x}(u_{m})\|
    ≀2β‹…Ο΅2=Ο΅,\displaystyle\leq 2\cdot\frac{\epsilon}{2}=\epsilon,

    where we use Οƒmin​(Tu)β‰₯1/2\sigma_{\min}(T_{u})\geq 1/2 in Lemma 3.

  • β€’

    (Case 1b) Suppose at least half of the points in the epoch satisfy β€–βˆ‡F^x​(uj)β€–β‰₯Ο΅/2\|\nabla\hat{F}_{x}(u_{j})\|\geq\epsilon/2 for j=1,…,mj=1,...,m. With probability 1/41/4, the output utu_{t} falls within the last quarter of {uj}j=1m\{u_{j}\}_{j=1}^{m} by uniform sampling. In this case, βˆ‘j=1tβ€–βˆ‡F^x​(ujβˆ’1)β€–2β‰₯m4β‹…Ο΅24=m​ϡ216\sum_{j=1}^{t}\|\nabla\hat{F}_{x}(u_{j-1})\|^{2}\geq\frac{m}{4}\cdot\frac{\epsilon^{2}}{4}=\frac{m\epsilon^{2}}{16} because at least a quarter of points with large gradient appear before utu_{t}. Thus, by (17), we have

    F​(TSSRG​(x,0,m))βˆ’F​(x)=F​(Rx​(ut))βˆ’F​(x)\displaystyle F(\textsf{TSSRG}(x,0,m))-F(x)=F(R_{x}(u_{t}))-F(x) =F^x​(ut)βˆ’F^x​(0)\displaystyle=\hat{F}_{x}(u_{t})-\hat{F}_{x}(0)
    β‰€βˆ’Ξ·2β€‹βˆ‘j=1tβ€–βˆ‡F^x​(ujβˆ’1)β€–2\displaystyle\leq-\frac{\eta}{2}\sum_{j=1}^{t}\|\nabla\hat{F}_{x}(u_{j-1})\|^{2}
    β‰€βˆ’Ξ·β€‹m​ϡ232.\displaystyle\leq-\frac{\eta m\epsilon^{2}}{32}. (18)

    Note that (17) holds with probability 1βˆ’Ο‘1-\vartheta. Then by a union bound, (18) holds with probability at least 14βˆ’Ο‘\frac{1}{4}-\vartheta. Without loss of generality, we can choose ϑ≀120\vartheta\leq\frac{1}{20} and thus (18) holds with probability at least 15\frac{1}{5}.

2. Iterates leave the constraint ball. Suppose at τ≀m\tau\leq m, we have β€–uΟ„β€–>D\|u_{\tau}\|>D. Therefore by Lemma 10, we know that the function value already decreases a lot. That is, with probability 1βˆ’Ο‘1-\vartheta, running TSSRG(x,0,m)(x,0,m) gives

F^x​(0)βˆ’F^x​(uΟ„)β‰₯c1​L4​τ​‖uΟ„β€–2β‰₯c1​L​D24​mβ‰₯η​m​ϡ232,\displaystyle\hat{F}_{x}(0)-\hat{F}_{x}(u_{\tau})\geq\frac{c_{1}L}{4\tau}\|u_{\tau}\|^{2}\geq\frac{c_{1}LD^{2}}{4m}\geq\frac{\eta m\epsilon^{2}}{32},

where the last inequality follows from the choice that ϡ≀Dm​8​c1​LΞ·\epsilon\leq\frac{D}{m}\sqrt{\frac{8c_{1}L}{\eta}} and Ξ·\eta can be sufficiently small. Note this requirement is not difficult to satisfy since c1c_{1} can be sufficiently large. Hence by returning uΟ„u_{\tau} as umu_{m}, we have

F​(TSSRG​(x,0,m))βˆ’F​(x)β‰€βˆ’Ξ·β€‹m​ϡ232.F(\textsf{TSSRG}(x,0,m))-F(x)\leq-\frac{\eta m\epsilon^{2}}{32}.

In summary, regardless of whether the iterates stay within the ball 𝔹x​(D)\mathbb{B}_{x}(D), with high probability, either the gradient norm of the output is small, or the function value decreases a lot. Note for Case 1a, the function still decreases by (17). Β 

Lemma 12 (Small stuck region)

Consider xβˆˆβ„³x\in\mathcal{M} with β€–grad​F​(x)‖≀ϡ\|\emph{grad}F(x)\|\leq\epsilon and βˆ’Ξ³:=Ξ»min​(βˆ‡2F^x​(0))-\gamma:=\lambda_{\min}(\nabla^{2}\hat{F}_{x}(0)) =Ξ»min​(Hess​F​(x))β‰€βˆ’Ξ΄=\lambda_{\min}(\emph{Hess}F(x))\leq-\delta and Lβ‰₯Ξ΄L\geq\delta. Let u0,u0β€²βˆˆTx​ℳu_{0},u_{0}^{\prime}\in T_{x}\mathcal{M} be two random perturbation, satisfying β€–u0β€–,β€–u0′‖≀r\|u_{0}\|,\|u_{0}^{\prime}\|\leq r and u0βˆ’u0β€²=r0​e1u_{0}-u_{0}^{\prime}=r_{0}e_{1}, where e1e_{1} denotes the smallest eigenvector of βˆ‡2f^x​(0)\nabla^{2}\hat{f}_{x}(0) and r0=ν​rd,r≀δc2​ρr_{0}=\frac{\nu r}{\sqrt{d}},r\leq\frac{\delta}{c_{2}\rho}. Also set parameters 𝒯=2​logα⁑(8​δ​dc2​ρ​ν​r)Ξ΄=π’ͺ~​(1Ξ΄)\mathscr{T}=\frac{2\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})}{\delta}=\widetilde{\mathcal{O}}(\frac{1}{\delta}), η≀min⁑{18​C1​log⁑(𝒯)​L,18​logα⁑(8​δ​dc2​ρ​ν​r)​L}=π’ͺ~​(1L)\eta\leq\min\{\frac{1}{8C_{1}\log(\mathscr{T})L},\frac{1}{8\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})L}\}=\widetilde{\mathcal{O}}(\frac{1}{L}), where Ξ±β‰₯1\alpha\geq 1 is chosen sufficiently small such that logα⁑(1+η​γ)>Ξ³\log_{\alpha}(1+\eta\gamma)>\gamma. Then for {ut},{utβ€²}\{u_{t}\},\{u_{t}^{\prime}\} generated by running TSSRG(x,u0,𝒯)(x,u_{0},\mathscr{T}) and TSSRG(x,u0β€²,𝒯)(x,u_{0}^{\prime},\mathscr{T}) with same sets of mini-batches, with probability 1βˆ’ΞΆ1-\zeta, we have

βˆƒt≀𝒯,max⁑{β€–utβˆ’u0β€–,β€–utβ€²βˆ’u0β€²β€–}β‰₯Ξ΄c2​ρ,\exists\,t\leq\mathscr{T},\,\max\{\|u_{t}-u_{0}\|,\|u_{t}^{\prime}-u_{0}^{\prime}\|\}\geq\frac{\delta}{c_{2}\rho},

where c2β‰₯max⁑{12​C1L,2​δD​ρ}c_{2}\geq\max\{\frac{12C_{1}}{L},\frac{2\delta}{D\rho}\} and C1=π’ͺ​(log32⁑(d​𝒯/ΞΆ))=π’ͺ~​(1)C_{1}=\mathcal{O}(\log^{\frac{3}{2}}(d\mathscr{T}/\zeta))=\widetilde{\mathcal{O}}(1).

ProofΒ  The proof is by contradiction. So we assume

βˆ€t≀𝒯,max⁑{β€–utβˆ’u0β€–,β€–utβ€²βˆ’u0β€–}≀δc2​ρ.\forall\,t\leq\mathscr{T},\,\max\{\|u_{t}-u_{0}\|,\|u_{t}^{\prime}-u_{0}\|\}\leq\frac{\delta}{c_{2}\rho}. (19)

First we should note none of the two sequences {uj}j=0𝒯\{u_{j}\}_{j=0}^{\mathscr{T}}, {ujβ€²}j=0𝒯\{u_{j}^{\prime}\}_{j=0}^{\mathscr{T}} escape the ball 𝔹x​(D)\mathbb{B}_{x}(D) within 𝒯\mathscr{T} steps under condition (19). This is because (take {uj}\{u_{j}\} for example),

β€–ut‖≀‖utβˆ’u0β€–+β€–u0‖≀δc2​ρ+r≀2​δc2​ρ≀D,\|u_{t}\|\leq\|u_{t}-u_{0}\|+\|u_{0}\|\leq\frac{\delta}{c_{2}\rho}+r\leq\frac{2\delta}{c_{2}\rho}\leq D,

where we apply r≀δc2​ρr\leq\frac{\delta}{c_{2}\rho} and c2β‰₯2​δD​ρc_{2}\geq\frac{2\delta}{D\rho}. Hence, if condition (19) is satisfied, {uj},{ujβ€²}\{u_{j}\},\{u_{j}^{\prime}\} must remain inside the ball 𝔹x​(D)\mathbb{B}_{x}(D). As a result, we can proceed the proof by following the similar idea as in (Li, 2019), which is to show an exponential growth in the distance between two coupled sequences and ultimately will exceed the bound in (19) using triangle inequality. This as a result gives rise to a contradiction.

Denote u^t:=utβˆ’utβ€²\hat{u}_{t}:=u_{t}-u_{t}^{\prime} and β„‹:=βˆ‡2F^x​(0)\mathcal{H}:=\nabla^{2}\hat{F}_{x}(0). With ut=utβˆ’1βˆ’Ξ·β€‹vtβˆ’1u_{t}=u_{t-1}-\eta v_{t-1}, we can express u^t\hat{u}_{t} as

u^t\displaystyle\hat{u}_{t} =u^tβˆ’1βˆ’Ξ·β€‹(vtβˆ’1βˆ’vtβˆ’1β€²)\displaystyle=\hat{u}_{t-1}-\eta(v_{t-1}-v_{t-1}^{\prime})
=u^tβˆ’1βˆ’Ξ·β€‹(βˆ‡F^x​(utβˆ’1)βˆ’βˆ‡F^x​(utβˆ’1β€²)+vtβˆ’1βˆ’βˆ‡F^x​(utβˆ’1)βˆ’vtβˆ’1β€²+βˆ‡F^x​(utβˆ’1β€²))\displaystyle=\hat{u}_{t-1}-\eta(\nabla\hat{F}_{x}(u_{t-1})-\nabla\hat{F}_{x}(u_{t-1}^{\prime})+v_{t-1}-\nabla\hat{F}_{x}(u_{t-1})-v_{t-1}^{\prime}+\nabla\hat{F}_{x}(u_{t-1}^{\prime}))
=u^tβˆ’1βˆ’Ξ·(∫01βˆ‡2F^x(utβˆ’1β€²+ΞΈ(utβˆ’1βˆ’utβˆ’1β€²))[utβˆ’1βˆ’utβˆ’1β€²]dΞΈ+vtβˆ’1βˆ’βˆ‡F^x(utβˆ’1)\displaystyle=\hat{u}_{t-1}-\eta\big{(}\int_{0}^{1}\nabla^{2}\hat{F}_{x}(u_{t-1}^{\prime}+\theta(u_{t-1}-u_{t-1}^{\prime}))[u_{t-1}-u_{t-1}^{\prime}]d\theta+v_{t-1}-\nabla\hat{F}_{x}(u_{t-1})
βˆ’vtβˆ’1β€²+βˆ‡F^x(utβˆ’1β€²))\displaystyle\quad-v_{t-1}^{\prime}+\nabla\hat{F}_{x}(u_{t-1}^{\prime})\big{)}
=u^tβˆ’1βˆ’Ξ·β€‹((β„‹+Ξ”tβˆ’1)​u^tβˆ’1+y^tβˆ’1)\displaystyle=\hat{u}_{t-1}-\eta\big{(}(\mathcal{H}+\Delta_{t-1})\hat{u}_{t-1}+\hat{y}_{t-1}\big{)} (20)
=(Iβˆ’Ξ·β€‹β„‹)​u^tβˆ’1βˆ’Ξ·β€‹(Ξ”tβˆ’1​u^tβˆ’1+y^tβˆ’1).\displaystyle=(I-\eta\mathcal{H})\hat{u}_{t-1}-\eta(\Delta_{t-1}\hat{u}_{t-1}+\hat{y}_{t-1}). (21)

where Ξ”t:=∫01[βˆ‡2F^x​(utβ€²+θ​(utβˆ’utβ€²))βˆ’β„‹]​𝑑θ\Delta_{t}:=\int_{0}^{1}[\nabla^{2}\hat{F}_{x}(u_{t}^{\prime}+\theta(u_{t}-u_{t}^{\prime}))-\mathcal{H}]d\theta and y^t:=ytβˆ’ytβ€²:=vtβˆ’βˆ‡F^x​(ut)βˆ’vtβ€²+βˆ‡F^x​(utβ€²)\hat{y}_{t}:=y_{t}-y_{t}^{\prime}:=v_{t}-\nabla\hat{F}_{x}(u_{t})-v_{t}^{\prime}+\nabla\hat{F}_{x}(u_{t}^{\prime}). Note we can bound β€–Ξ”tβ€–\|\Delta_{t}\| as

β€–Ξ”tβ€–=β€–βˆ«01(βˆ‡2F^x​(utβ€²+θ​(utβˆ’utβ€²))βˆ’βˆ‡2F^x​(0))​𝑑θ‖\displaystyle\|\Delta_{t}\|=\|\int_{0}^{1}\big{(}\nabla^{2}\hat{F}_{x}(u_{t}^{\prime}+\theta(u_{t}-u_{t}^{\prime}))-\nabla^{2}\hat{F}_{x}(0)\big{)}d\theta\| β‰€βˆ«01ρ​‖utβ€²+θ​(utβˆ’utβ€²)‖​𝑑θ\displaystyle\leq\int_{0}^{1}\rho\|u_{t}^{\prime}+\theta(u_{t}-u_{t}^{\prime})\|d\theta
β‰€βˆ«01ρ​(θ​‖utβ€–+(1βˆ’ΞΈ)​‖utβ€²β€–)​𝑑θ\displaystyle\leq\int_{0}^{1}\rho\big{(}\theta\|u_{t}\|+(1-\theta)\|u_{t}^{\prime}\|\big{)}d\theta
≀ρ​max⁑{β€–utβ€–,β€–utβ€²β€–}\displaystyle\leq\rho\max\{\|u_{t}\|,\|u_{t}^{\prime}\|\}
=Οβ€‹π’Ÿt≀ρ​(Ξ΄c2​ρ+r),\displaystyle=\rho\mathscr{D}_{t}\leq\rho\big{(}\frac{\delta}{c_{2}\rho}+r\big{)}, (22)

where π’Ÿt:=max⁑{β€–utβ€–,β€–utβ€²β€–}\mathscr{D}_{t}:=\max\{\|u_{t}\|,\|u_{t}^{\prime}\|\}. The first inequality uses Assumption 2 and the last inequality follows from β€–ut‖≀‖utβˆ’u0β€–+β€–u0‖≀δc2​ρ+r\|u_{t}\|\leq\|u_{t}-u_{0}\|+\|u_{0}\|\leq\frac{\delta}{c_{2}\rho}+r. Denote for tβ‰₯1t\geq 1,

p​(t):=(Iβˆ’Ξ·β€‹β„‹)t​u^0=(Iβˆ’Ξ·β€‹β„‹)t​r0​e1=(1+η​γ)t​r0​e1,\displaystyle p(t):=(I-\eta\mathcal{H})^{t}\hat{u}_{0}=(I-\eta\mathcal{H})^{t}r_{0}e_{1}=(1+\eta\gamma)^{t}r_{0}e_{1},
q​(t):=Ξ·β€‹βˆ‘j=0tβˆ’1(Iβˆ’Ξ·β€‹β„‹)tβˆ’1βˆ’j​(Ξ”j​u^j+y^j),Β and\displaystyle q(t):=\eta\sum_{j=0}^{t-1}(I-\eta\mathcal{H})^{t-1-j}(\Delta_{j}\hat{u}_{j}+\hat{y}_{j}),\quad\text{ and } (23)
p​(0)=u^0,q​(0)=0.\displaystyle p(0)=\hat{u}_{0},\,q(0)=0. (24)

By induction, we can show that u^t=p​(t)βˆ’q​(t)\hat{u}_{t}=p(t)-q(t). It is noticed that β€–p​(t)β€–=(1+η​γ)t​r0\|p(t)\|=(1+\eta\gamma)^{t}r_{0} grows exponentially and the only thing left to determine is that q​(t)q(t) is dominated by p​(t)p(t) when tt increases. To this end, we inductively show that (1) β€–y^t‖≀γ​L​(1+η​γ)t​r0\|\hat{y}_{t}\|\leq\gamma L(1+\eta\gamma)^{t}r_{0} and (2) β€–q​(t)‖≀‖p​(t)β€–/2\|q(t)\|\leq\|p(t)\|/2. First note that when t=0t=0, these two conditions clearly hold. Now suppose for j≀tβˆ’1j\leq t-1, these claims have been proved to be true. Hence we immediately have, by triangle inequality,

β€–u^j‖≀‖p​(j)β€–+β€–q​(j)‖≀32​‖p​(j)‖≀32​(1+η​γ)j​r0\|\hat{u}_{j}\|\leq\|p(j)\|+\|q(j)\|\leq\frac{3}{2}\|p(j)\|\leq\frac{3}{2}(1+\eta\gamma)^{j}r_{0} (25)

holds for j≀tβˆ’1j\leq t-1. We first prove that second condition holds for j=tj=t.

Proof that β€–q​(t)β€–\boldsymbol{\|q(t)\|} is bounded by β€–p​(t)β€–/𝟐\boldsymbol{\|p(t)\|/2}. Note that from the definition in (23),

β€–q​(t)β€–β‰€β€–Ξ·β€‹βˆ‘j=0tβˆ’1(Iβˆ’Ξ·β€‹β„‹)tβˆ’1βˆ’j​Δj​u^jβ€–+β€–Ξ·β€‹βˆ‘j=0tβˆ’1(Iβˆ’Ξ·β€‹β„‹)tβˆ’1βˆ’j​y^jβ€–.\|q(t)\|\leq\|\eta\sum_{j=0}^{t-1}(I-\eta\mathcal{H})^{t-1-j}\Delta_{j}\hat{u}_{j}\|+\|\eta\sum_{j=0}^{t-1}(I-\eta\mathcal{H})^{t-1-j}\hat{y}_{j}\|.

Then we can bound β€–q​(t)β€–\|q(t)\| by respectively bounding the two terms. The first term is bounded as follows,

β€–Ξ·β€‹βˆ‘j=0tβˆ’1(Iβˆ’Ξ·β€‹β„‹)tβˆ’1βˆ’j​Δj​u^jβ€–\displaystyle\|\eta\sum_{j=0}^{t-1}(I-\eta\mathcal{H})^{t-1-j}\Delta_{j}\hat{u}_{j}\| β‰€Ξ·β€‹βˆ‘j=0tβˆ’1β€–(Iβˆ’Ξ·β€‹β„‹)tβˆ’1βˆ’j‖​‖Δj‖​‖u^jβ€–\displaystyle\leq\eta\sum_{j=0}^{t-1}\|(I-\eta\mathcal{H})^{t-1-j}\|\|\Delta_{j}\|\|\hat{u}_{j}\|
≀η​ρ​(Ξ΄c2​ρ+r)β€‹βˆ‘j=0tβˆ’1(1+η​γ)tβˆ’1βˆ’j​‖u^jβ€–\displaystyle\leq\eta\rho(\frac{\delta}{c_{2}\rho}+r)\sum_{j=0}^{t-1}(1+\eta\gamma)^{t-1-j}\|\hat{u}_{j}\|
≀32​η​ρ​(Ξ΄c2​ρ+r)​t​(1+η​γ)tβˆ’1​r0≀3​η​δc2​𝒯​‖p​(t)‖≀14​‖p​(t)β€–,\displaystyle\leq\frac{3}{2}\eta\rho(\frac{\delta}{c_{2}\rho}+r)t(1+\eta\gamma)^{t-1}r_{0}\leq\frac{3\eta\delta}{c_{2}}\mathscr{T}\|p(t)\|\leq\frac{1}{4}\|p(t)\|, (26)

where the second inequality holds because β€–Iβˆ’Ξ·β€‹β„‹β€–=Ξ»max​(Iβˆ’Ξ·β€‹β„‹)=1+η​γ\|I-\eta\mathcal{H}\|=\lambda_{\max}(I-\eta\mathcal{H})=1+\eta\gamma. The third inequality applies the first condition for j≀tβˆ’1j\leq t-1. The fourth inequality is by r≀δc2​ρr\leq\frac{\delta}{c_{2}\rho} and t≀𝒯t\leq\mathscr{T}. The last inequality follows by the choice of parameters 𝒯≀2​logα⁑(8​δ​dc2​ρ​ν​r)/(η​γ),δ≀γ\mathscr{T}\leq 2\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})/(\eta\gamma),\delta\leq\gamma and c2β‰₯24​logα⁑(8​δ​dc2​ρ​ν​r)c_{2}\geq 24\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r}) where Ξ±β‰₯1\alpha\geq 1 is a constant defined later. The second term can also be similarly bounded as

β€–Ξ·β€‹βˆ‘j=0tβˆ’1(Iβˆ’Ξ·β€‹β„‹)tβˆ’1βˆ’j​y^jβ€–\displaystyle\|\eta\sum_{j=0}^{t-1}(I-\eta\mathcal{H})^{t-1-j}\hat{y}_{j}\| β‰€Ξ·β€‹βˆ‘j=0tβˆ’1(1+η​γ)tβˆ’1βˆ’j​‖y^jβ€–β‰€Ξ·β€‹βˆ‘j=0tβˆ’1(1+η​γ)tβˆ’1βˆ’j​γ​L​(1+η​γ)j​r0\displaystyle\leq\eta\sum_{j=0}^{t-1}(1+\eta\gamma)^{t-1-j}\|\hat{y}_{j}\|\leq\eta\sum_{j=0}^{t-1}(1+\eta\gamma)^{t-1-j}\gamma L(1+\eta\gamma)^{j}r_{0}
≀η​γ​L​𝒯​(1+η​γ)tβˆ’1​r0≀14​(1+η​γ)t​r0=14​‖p​(t)β€–,\displaystyle\leq\eta\gamma L\mathscr{T}(1+\eta\gamma)^{t-1}r_{0}\leq\frac{1}{4}(1+\eta\gamma)^{t}r_{0}=\frac{1}{4}\|p(t)\|, (27)

where we apply the bound on β€–y^tβ€–\|\hat{y}_{t}\| in the second inequality. The last inequality uses 𝒯≀2​logα⁑(8​δ​dc2​ρ​ν​r)/Ξ³\mathscr{T}\leq 2\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})/\gamma and η≀18​logα⁑(8​δ​dc2​ρ​ν​r)​L\eta\leq\frac{1}{8\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})L}. From (26) and (27), we prove β€–q​(t)‖≀‖p​(t)β€–/2\|q(t)\|\leq\|p(t)\|/2. Note we can always assume η≀1\eta\leq 1 given its requirement. Therefore 1/γ≀1/(η​γ)1/\gamma\leq 1/(\eta\gamma) and it is sufficient to require 𝒯=2​logα⁑(8​δ​dc2​ρ​ν​r)/Ξ³\mathscr{T}=2\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})/\gamma to guarantee the result. Now we can proceed to prove the first condition, which is an intermediate result that has been used in proving the second condition.

Proof that β€–y^t‖​ is bounded by ​γ​L​(𝟏+η​γ)t​r𝟎\boldsymbol{\|\hat{y}_{t}\|\text{ is bounded by }\gamma L(1+\eta\gamma)^{t}r_{0}}. We first re-write y^t\hat{y}_{t} into a recursive form as

y^t\displaystyle\hat{y}_{t} =vtβˆ’βˆ‡F^x​(ut)βˆ’vtβ€²+βˆ‡F^x​(utβ€²)\displaystyle=v_{t}-\nabla\hat{F}_{x}(u_{t})-v_{t}^{\prime}+\nabla\hat{F}_{x}(u_{t}^{\prime})
=βˆ‡f^ℐ,x​(ut)βˆ’βˆ‡f^ℐ,x​(utβˆ’1)+vtβˆ’1βˆ’βˆ‡F^x​(ut)βˆ’βˆ‡f^ℐ,x​(utβ€²)+βˆ‡f^ℐ,x​(utβˆ’1β€²)βˆ’vtβˆ’1β€²+βˆ‡F^x​(utβ€²)\displaystyle=\nabla\hat{f}_{\mathcal{I},x}(u_{t})-\nabla\hat{f}_{\mathcal{I},x}(u_{t-1})+v_{t-1}-\nabla\hat{F}_{x}(u_{t})-\nabla\hat{f}_{\mathcal{I},x}(u_{t}^{\prime})+\nabla\hat{f}_{\mathcal{I},x}(u_{t-1}^{\prime})-v_{t-1}^{\prime}+\nabla\hat{F}_{x}(u_{t}^{\prime})
=βˆ‡f^ℐ,x(ut)βˆ’βˆ‡f^ℐ,x(utβˆ’1)βˆ’βˆ‡F^x(ut)+βˆ‡F^x(utβˆ’1)βˆ’(βˆ‡f^ℐ,x(utβ€²)βˆ’βˆ‡f^ℐ,x(utβˆ’1β€²)\displaystyle=\nabla\hat{f}_{\mathcal{I},x}(u_{t})-\nabla\hat{f}_{\mathcal{I},x}(u_{t-1})-\nabla\hat{F}_{x}(u_{t})+\nabla\hat{F}_{x}(u_{t-1})-\big{(}\nabla\hat{f}_{\mathcal{I},x}(u_{t}^{\prime})-\nabla\hat{f}_{\mathcal{I},x}(u_{t-1}^{\prime})
βˆ’βˆ‡F^x(utβ€²)+βˆ‡F^x(utβˆ’1β€²))+vtβˆ’1βˆ’βˆ‡F^x(utβˆ’1)βˆ’vtβˆ’1β€²+F^x(utβˆ’1β€²)\displaystyle\quad-\nabla\hat{F}_{x}(u_{t}^{\prime})+\nabla\hat{F}_{x}(u_{t-1}^{\prime})\big{)}+v_{t-1}-\nabla\hat{F}_{x}(u_{t-1})-v_{t-1}^{\prime}+\hat{F}_{x}(u_{t-1}^{\prime})
=zt+y^tβˆ’1,\displaystyle=z_{t}+\hat{y}_{t-1},

where we denote zt:=βˆ‡f^ℐ,x​(ut)βˆ’βˆ‡f^ℐ,x​(utβˆ’1)βˆ’βˆ‡F^x​(ut)+βˆ‡F^x​(utβˆ’1)βˆ’(βˆ‡f^ℐ,x​(utβ€²)βˆ’βˆ‡f^ℐ,x​(utβˆ’1β€²)βˆ’βˆ‡F^x​(utβ€²)+βˆ‡F^x​(utβˆ’1β€²))=y^tβˆ’y^tβˆ’1z_{t}:=\nabla\hat{f}_{\mathcal{I},x}(u_{t})-\nabla\hat{f}_{\mathcal{I},x}(u_{t-1})-\nabla\hat{F}_{x}(u_{t})+\nabla\hat{F}_{x}(u_{t-1})-\big{(}\nabla\hat{f}_{\mathcal{I},x}(u_{t}^{\prime})-\nabla\hat{f}_{\mathcal{I},x}(u_{t-1}^{\prime})-\nabla\hat{F}_{x}(u_{t}^{\prime})+\nabla\hat{F}_{x}(u_{t-1}^{\prime})\big{)}=\hat{y}_{t}-\hat{y}_{t-1}. It is easy to verify that {y^t}\{\hat{y}_{t}\} is a martingale sequence and {zt}\{z_{t}\} is its difference sequence. Similar to the proof strategy as for Lemma 9, we first need to derive bound for β€–ztβ€–\|z_{t}\| by Bernstein inequality. Denote wiw_{i} as the component of ztz_{t}. That is

zt=1bβˆ‘iβˆˆβ„wi=1bβˆ‘iβˆˆβ„(\displaystyle z_{t}=\frac{1}{b}\sum_{i\in\mathcal{I}}w_{i}=\frac{1}{b}\sum_{i\in\mathcal{I}}\Big{(} βˆ‡f^i,x​(ut)βˆ’βˆ‡f^i,x​(utβˆ’1)βˆ’βˆ‡F^x​(ut)+βˆ‡F^x​(utβˆ’1)\displaystyle\nabla\hat{f}_{i,x}(u_{t})-\nabla\hat{f}_{i,x}(u_{t-1})-\nabla\hat{F}_{x}(u_{t})+\nabla\hat{F}_{x}(u_{t-1})
βˆ’(βˆ‡f^i,x(utβ€²)βˆ’βˆ‡f^i,x(utβˆ’1β€²)βˆ’βˆ‡F^x(utβ€²)+βˆ‡F^x(utβˆ’1β€²))).\displaystyle-\big{(}\nabla\hat{f}_{i,x}(u_{t}^{\prime})-\nabla\hat{f}_{i,x}(u_{t-1}^{\prime})-\nabla\hat{F}_{x}(u_{t}^{\prime})+\nabla\hat{F}_{x}(u_{t-1}^{\prime})\big{)}\Big{)}.

Then β€–wiβ€–\|w_{i}\| is bounded as

β€–wiβ€–\displaystyle\|w_{i}\|
=βˆ₯βˆ‡f^i,x(ut)βˆ’βˆ‡f^i,x(utβˆ’1)βˆ’βˆ‡F^x(ut)+βˆ‡F^x(utβˆ’1)βˆ’βˆ‡f^i,x(utβ€²)+βˆ‡f^i,x(utβˆ’1β€²)\displaystyle=\|\nabla\hat{f}_{i,x}(u_{t})-\nabla\hat{f}_{i,x}(u_{t-1})-\nabla\hat{F}_{x}(u_{t})+\nabla\hat{F}_{x}(u_{t-1})-\nabla\hat{f}_{i,x}(u_{t}^{\prime})+\nabla\hat{f}_{i,x}(u_{t-1}^{\prime})
+βˆ‡F^x(utβ€²)βˆ’βˆ‡F^x(utβˆ’1β€²)βˆ₯\displaystyle+\nabla\hat{F}_{x}(u_{t}^{\prime})-\nabla\hat{F}_{x}(u_{t-1}^{\prime})\|
=βˆ₯(βˆ‡f^i,x(ut)βˆ’βˆ‡f^i,x(utβ€²))βˆ’(βˆ‡f^i,x(utβˆ’1)βˆ’βˆ‡f^i,x(utβˆ’1β€²))βˆ’(βˆ‡F^x(ut)βˆ’βˆ‡F^x(utβ€²))\displaystyle=\|(\nabla\hat{f}_{i,x}(u_{t})-\nabla\hat{f}_{i,x}(u_{t}^{\prime}))-(\nabla\hat{f}_{i,x}(u_{t-1})-\nabla\hat{f}_{i,x}(u_{t-1}^{\prime}))-(\nabla\hat{F}_{x}(u_{t})-\nabla\hat{F}_{x}(u_{t}^{\prime}))
+(βˆ‡F^x(utβˆ’1)βˆ’βˆ‡F^x(utβˆ’1β€²))βˆ₯\displaystyle+(\nabla\hat{F}_{x}(u_{t-1})-\nabla\hat{F}_{x}(u_{t-1}^{\prime}))\|
=βˆ₯∫01βˆ‡2f^i,x(utβ€²+ΞΈ(utβˆ’utβ€²))[utβˆ’utβ€²]dΞΈβˆ’βˆ«01βˆ‡2f^i,x(utβˆ’1β€²+ΞΈ(utβˆ’1βˆ’utβˆ’1β€²))[utβˆ’1βˆ’utβˆ’1β€²]dΞΈ\displaystyle=\|\int_{0}^{1}\nabla^{2}\hat{f}_{i,x}(u_{t}^{\prime}+\theta(u_{t}-u_{t}^{\prime}))[u_{t}-u_{t}^{\prime}]d\theta-\int_{0}^{1}\nabla^{2}\hat{f}_{i,x}(u_{t-1}^{\prime}+\theta(u_{t-1}-u_{t-1}^{\prime}))[u_{t-1}-u_{t-1}^{\prime}]d\theta
βˆ’βˆ«01βˆ‡2F^x(utβ€²+ΞΈ(utβˆ’utβ€²))[utβˆ’utβ€²]dΞΈ+∫01βˆ‡2F^x(utβˆ’1β€²+ΞΈ(utβˆ’1βˆ’utβˆ’1β€²))[utβˆ’1βˆ’utβˆ’1β€²]dΞΈβˆ₯\displaystyle-\int_{0}^{1}\nabla^{2}\hat{F}_{x}(u_{t}^{\prime}+\theta(u_{t}-u_{t}^{\prime}))[u_{t}-u_{t}^{\prime}]d\theta+\int_{0}^{1}\nabla^{2}\hat{F}_{x}(u_{t-1}^{\prime}+\theta(u_{t-1}-u_{t-1}^{\prime}))[u_{t-1}-u_{t-1}^{\prime}]d\theta\|
=β€–(Ξ”ti+β„‹i)​u^tβˆ’(Ξ”tβˆ’1i+β„‹i)​u^tβˆ’1βˆ’(Ξ”t+β„‹)​u^t+(Ξ”tβˆ’1+β„‹)​u^tβˆ’1β€–\displaystyle=\|(\Delta_{t}^{i}+\mathcal{H}_{i})\hat{u}_{t}-(\Delta_{t-1}^{i}+\mathcal{H}_{i})\hat{u}_{t-1}-(\Delta_{t}+\mathcal{H})\hat{u}_{t}+(\Delta_{t-1}+\mathcal{H})\hat{u}_{t-1}\|
≀‖ℋi​(u^tβˆ’u^tβˆ’1)β€–+‖ℋ​(u^tβˆ’u^tβˆ’1)β€–+β€–Ξ”ti​u^tβ€–+β€–Ξ”t​u^tβ€–+β€–Ξ”tβˆ’1i​u^tβˆ’1β€–+β€–Ξ”tβˆ’1​u^tβˆ’1β€–\displaystyle\leq\|\mathcal{H}_{i}(\hat{u}_{t}-\hat{u}_{t-1})\|+\|\mathcal{H}(\hat{u}_{t}-\hat{u}_{t-1})\|+\|\Delta_{t}^{i}\hat{u}_{t}\|+\|\Delta_{t}\hat{u}_{t}\|+\|\Delta_{t-1}^{i}\hat{u}_{t-1}\|+\|\Delta_{t-1}\hat{u}_{t-1}\|
≀2Lβˆ₯u^tβˆ’u^tβˆ’1βˆ₯+2Οπ’Ÿtβˆ₯u^tβˆ₯+2Οπ’Ÿtβˆ’1βˆ₯u^tβˆ’1βˆ₯=:R,\displaystyle\leq 2L\|\hat{u}_{t}-\hat{u}_{t-1}\|+2\rho\mathscr{D}_{t}\|\hat{u}_{t}\|+2\rho\mathscr{D}_{t-1}\|\hat{u}_{t-1}\|=:R, (28)

where u^t:=utβˆ’utβ€²\hat{u}_{t}:=u_{t}-u_{t}^{\prime}, β„‹i:=βˆ‡2f^i,x​(0)\mathcal{H}_{i}:=\nabla^{2}\hat{f}_{i,x}(0), Ξ”ti:=∫01(βˆ‡2f^i​(utβ€²+θ​(utβˆ’utβ€²))βˆ’β„‹i)​𝑑θ\Delta_{t}^{i}:=\int_{0}^{1}(\nabla^{2}\hat{f}_{i}(u_{t}^{\prime}+\theta(u_{t}-u_{t}^{\prime}))-\mathcal{H}_{i})d\theta, π’Ÿt:=max⁑{β€–utβ€–,β€–utβ€²β€–}\mathscr{D}_{t}:=\max\{\|u_{t}\|,\|u_{t}^{\prime}\|\}. The second last inequality uses triangle inequality and the last inequality considers Assumption 2 and also the constraint on β€–Ξ”tβ€–\|\Delta_{t}\| (similarly β€–Ξ”tiβ€–\|\Delta_{t}^{i}\|) in (22). The total variance Οƒ2\sigma^{2} is derived as

βˆ‘iβˆˆβ„π”Όβ€‹β€–wiβ€–2\displaystyle\sum_{i\in\mathcal{I}}\mathbb{E}\|w_{i}\|^{2}
=βˆ‘iβˆˆβ„π”Όβˆ₯βˆ‡f^i,x​(ut)βˆ’βˆ‡f^i,x​(utβˆ’1)βˆ’βˆ‡F^x​(ut)+βˆ‡F^x​(utβˆ’1)βˆ’βˆ‡f^i,x​(utβ€²)+βˆ‡f^i,x​(utβˆ’1β€²)\displaystyle=\sum_{i\in\mathcal{I}}\mathbb{E}\|\nabla\hat{f}_{i,x}(u_{t})-\nabla\hat{f}_{i,x}(u_{t-1})-\nabla\hat{F}_{x}(u_{t})+\nabla\hat{F}_{x}(u_{t-1})-\nabla\hat{f}_{i,x}(u_{t}^{\prime})+\nabla\hat{f}_{i,x}(u_{t-1}^{\prime})
+βˆ‡F^x(utβ€²)βˆ’βˆ‡F^x(utβˆ’1β€²)βˆ₯2\displaystyle+\nabla\hat{F}_{x}(u_{t}^{\prime})-\nabla\hat{F}_{x}(u_{t-1}^{\prime})\|^{2}
β‰€βˆ‘iβˆˆβ„π”Όβ€‹β€–βˆ‡f^i,x​(ut)βˆ’βˆ‡f^i,x​(utβ€²)βˆ’βˆ‡f^i,x​(utβˆ’1)+βˆ‡f^i,x​(utβˆ’1β€²)β€–2\displaystyle\leq\sum_{i\in\mathcal{I}}\mathbb{E}\|\nabla\hat{f}_{i,x}(u_{t})-\nabla\hat{f}_{i,x}(u_{t}^{\prime})-\nabla\hat{f}_{i,x}(u_{t-1})+\nabla\hat{f}_{i,x}(u_{t-1}^{\prime})\|^{2}
=βˆ‘iβˆˆβ„π”Όβ€‹β€–(Ξ”ti+β„‹i)​u^tβˆ’(Ξ”tβˆ’1i+β„‹i)​u^tβˆ’1β€–2\displaystyle=\sum_{i\in\mathcal{I}}\mathbb{E}\|(\Delta_{t}^{i}+\mathcal{H}_{i})\hat{u}_{t}-(\Delta_{t-1}^{i}+\mathcal{H}_{i})\hat{u}_{t-1}\|^{2}
≀b(Lβˆ₯u^tβˆ’u^tβˆ’1βˆ₯+Οπ’Ÿtβˆ₯u^tβˆ₯+Οπ’Ÿtβˆ’1βˆ₯u^tβˆ’1βˆ₯)2=:Οƒ2\displaystyle\leq b(L\|\hat{u}_{t}-\hat{u}_{t-1}\|+\rho\mathscr{D}_{t}\|\hat{u}_{t}\|+\rho\mathscr{D}_{t-1}\|\hat{u}_{t-1}\|)^{2}=:\sigma^{2}

where the first inequality uses 𝔼​‖xβˆ’π”Όβ€‹[x]β€–2≀𝔼​‖xβ€–2\mathbb{E}\|x-\mathbb{E}[x]\|^{2}\leq\mathbb{E}\|x\|^{2} and the last inequality follows similar logic as (28). Hence by vector Bernstein Lemma 6, we obtain

Pr​{β€–ztβ€–β‰₯Ο‚/b}=Pr​{β€–βˆ‘iβˆˆβ„wiβ€–β‰₯Ο‚}\displaystyle\text{Pr}\{\|z_{t}\|\geq\varsigma/b\}=\text{Pr}\{\|\sum_{i\in\mathcal{I}}w_{i}\|\geq\varsigma\} ≀(d+1)​exp⁑(βˆ’Ο‚2/2Οƒ2+R​ς/3)\displaystyle\leq(d+1)\exp\big{(}\frac{-\varsigma^{2}/2}{\sigma^{2}+R\varsigma/3}\big{)}
≀΢t,\displaystyle\leq\zeta_{t},

where we choose Ο‚=π’ͺ​(log⁑(d/ΞΆt)​b)​(L​‖u^tβˆ’u^tβˆ’1β€–+Οβ€‹π’Ÿt​‖u^tβ€–+Οβ€‹π’Ÿtβˆ’1​‖u^tβˆ’1β€–)\varsigma=\mathcal{O}\big{(}\log(d/\zeta_{t})\sqrt{b}\big{)}(L\|\hat{u}_{t}-\hat{u}_{t-1}\|+\rho\mathscr{D}_{t}\|\hat{u}_{t}\|+\rho\mathscr{D}_{t-1}\|\hat{u}_{t-1}\|) based on similar logic as in Lemma 9. Therefore, we can bound β€–ztβ€–\|z_{t}\| with high probability. That is, with probability 1βˆ’ΞΆt1-\zeta_{t},

βˆ₯ztβˆ₯≀π’ͺ(log⁑(d/ΞΆt)b)(Lβˆ₯u^tβˆ’u^tβˆ’1βˆ₯+Οπ’Ÿtβˆ₯u^tβˆ₯+Οπ’Ÿtβˆ’1βˆ₯u^tβˆ’1βˆ₯)=:Ο‚t.\displaystyle\|z_{t}\|\leq\mathcal{O}\Big{(}{\frac{\log(d/\zeta_{t})}{\sqrt{b}}}\Big{)}(L\|\hat{u}_{t}-\hat{u}_{t-1}\|+\rho\mathscr{D}_{t}\|\hat{u}_{t}\|+\rho\mathscr{D}_{t-1}\|\hat{u}_{t-1}\|)=:\varsigma_{t}.

We now can obtain a bound on β€–y^tβ€–\|\hat{y}_{t}\|. We only need to consider a single epoch because y^s​m=0\hat{y}_{sm}=0 due to the full gradient evaluation at the start of each epoch. Note that similar to (10) and union bound, we know that with probability at least 1βˆ’ΞΆ1-\zeta, where ΞΆt:=ΞΆ/m\zeta_{t}:=\zeta/m, β€–zt‖≀ςt\|z_{t}\|\leq\varsigma_{t} holds for all t=s​m+1,…,(s+1)​mt=sm+1,...,(s+1)m. Then applying Azuma-Hoeffding inequality (Lemma 7) to the martingale sequence {y^t}\{\hat{y}_{t}\} yields

Pr​{β€–y^tβ€–β‰₯Ξ²}≀(d+1)​exp⁑(βˆ’Ξ²28β€‹βˆ‘j=s​m+1tΟ‚j2)+΢≀2​΢,βˆ€t∈[s​m+1,s​m+m],\displaystyle\text{Pr}\{\|\hat{y}_{t}\|\geq\beta\}\leq(d+1)\exp\Big{(}\frac{-\beta^{2}}{8\sum_{j=sm+1}^{t}\varsigma_{j}^{2}}\Big{)}+\zeta\leq 2\zeta,\quad\forall\,t\in[sm+1,sm+m],

where we choose

Ξ²=Ξ²t\displaystyle\beta=\beta_{t} =8​log⁑((d+1)/ΞΆ)β€‹βˆ‘j=s​m+1tΟ‚j2\displaystyle=\sqrt{8\log((d+1)/\zeta)\sum_{j=sm+1}^{t}\varsigma_{j}^{2}}
=π’ͺ​(log32⁑(d/ΞΆ)b)β€‹βˆ‘j=s​m+1t(L​‖u^tβˆ’u^tβˆ’1β€–+Οβ€‹π’Ÿt​‖u^tβ€–+Οβ€‹π’Ÿtβˆ’1​‖u^tβˆ’1β€–)2.\displaystyle=\mathcal{O}\Big{(}\frac{\log^{\frac{3}{2}}(d/\zeta)}{\sqrt{b}}\Big{)}\sqrt{\sum_{j=sm+1}^{t}(L\|\hat{u}_{t}-\hat{u}_{t-1}\|+\rho\mathscr{D}_{t}\|\hat{u}_{t}\|+\rho\mathscr{D}_{t-1}\|\hat{u}_{t-1}\|)^{2}}.

By a union bound, Pr​{⋃t=1𝒯(β€–y^tβ€–β‰₯Ξ²t)}≀2​𝒯​΢\text{Pr}\{\bigcup\limits_{t=1}^{\mathscr{T}}\big{(}\|\hat{y}_{t}\|\geq\beta_{t}\big{)}\}\leq 2\mathscr{T}\zeta and therefore for all t≀𝒯t\leq\mathscr{T},

β€–y^t‖≀βtΒ with probability ​1βˆ’2​𝒯​΢.\|\hat{y}_{t}\|\leq\beta_{t}\quad\text{ with probability }1-2\mathscr{T}\zeta.

Note by simply setting ΞΆ/(2​𝒯)\zeta/(2\mathscr{T}) as ΞΆ\zeta, we obtain with probability 1βˆ’ΞΆ1-\zeta, for all t≀𝒯t\leq\mathscr{T},

β€–y^t‖≀π’ͺ​(log32⁑(d​𝒯/ΞΆ)b)β€‹βˆ‘j=s​m+1t(L​‖u^tβˆ’u^tβˆ’1β€–+Οβ€‹π’Ÿt​‖u^tβ€–+Οβ€‹π’Ÿtβˆ’1​‖u^tβˆ’1β€–)2.\|\hat{y}_{t}\|\leq\mathcal{O}\Big{(}\frac{\log^{\frac{3}{2}}(d\mathscr{T}/\zeta)}{\sqrt{b}}\Big{)}\sqrt{\sum_{j=sm+1}^{t}(L\|\hat{u}_{t}-\hat{u}_{t-1}\|+\rho\mathscr{D}_{t}\|\hat{u}_{t}\|+\rho\mathscr{D}_{t-1}\|\hat{u}_{t-1}\|)^{2}}. (29)

What remains to be shown is that the right hand side of (29) is bounded. From (20), we first have

β€–u^tβˆ’u^tβˆ’1β€–=‖η​((β„‹+Ξ”tβˆ’1)​u^tβˆ’1+y^tβˆ’1)‖≀‖η​ℋ​u^tβˆ’1β€–+‖η​(Ξ”tβˆ’1​u^tβˆ’1+y^tβˆ’1)β€–.\|\hat{u}_{t}-\hat{u}_{t-1}\|=\|\eta\big{(}(\mathcal{H}+\Delta_{t-1})\hat{u}_{t-1}+\hat{y}_{t-1}\big{)}\|\leq\|\eta\mathcal{H}\hat{u}_{t-1}\|+\|\eta(\Delta_{t-1}\hat{u}_{t-1}+\hat{y}_{t-1})\|. (30)

The first term ‖η​ℋ​u^tβˆ’1β€–\|\eta\mathcal{H}\hat{u}_{t-1}\| can be bounded as follows. First note that the Hessian satisfies βˆ’Ξ³β‰€Ξ»β€‹(β„‹)≀L-\gamma\leq\lambda(\mathcal{H})\leq L by construction, where λ​(β„‹)\lambda(\mathcal{H}) represents eigenvalues of β„‹\mathcal{H}. Although Lβ‰₯Ξ΄L\geq\delta, it is difficult to compare LL and Ξ³\gamma since Ξ³β‰₯Ξ΄\gamma\geq\delta. Thus, we bound the term by projecting it to the following two subspaces:

  • β€’

    Sβˆ’S_{-}: subspace spanned by eigenvectors of β„‹\mathcal{H} where eigenvalues are within [βˆ’Ξ³,0][-\gamma,0].

  • β€’

    S+S_{+}: subspace spanned by eigenvectors of β„‹\mathcal{H} where eigenvalues are within (0,L](0,L].

That is, for the first case

β€–ProjSβˆ’β€‹(η​ℋ​u^tβˆ’1)‖≀η​‖ProjSβˆ’β€‹(β„‹)‖​‖u^tβˆ’1‖≀η​γ​32​(1+η​γ)tβˆ’1​r0,\displaystyle\|\text{Proj}_{S_{-}}\big{(}\eta\mathcal{H}\hat{u}_{t-1}\big{)}\|\leq\eta\|\text{Proj}_{S_{-}}(\mathcal{H})\|\|\hat{u}_{t-1}\|\leq\eta\gamma\frac{3}{2}(1+\eta\gamma)^{t-1}r_{0}, (31)

where we use the bound on β€–u^tβˆ’1β€–\|\hat{u}_{t-1}\| and the fact that β€–ProjSβˆ’β€‹(β„‹)‖≀λ\|\text{Proj}_{S_{-}}(\mathcal{H})\|\leq\lambda. For the second case, we have from (21),

β€–ProjS+​(η​ℋ​u^tβˆ’1)β€–\displaystyle\|\text{Proj}_{S_{+}}(\eta\mathcal{H}\hat{u}_{t-1})\| =β€–ProjS+​(η​ℋ​(1+η​γ)tβˆ’1​r0​e1βˆ’Ξ·β€‹βˆ‘j=0tβˆ’2η​ℋ​(Iβˆ’Ξ·β€‹β„‹)tβˆ’2βˆ’j​(Ξ”j​u^j+y^j))β€–\displaystyle=\|\text{Proj}_{S_{+}}\big{(}\eta\mathcal{H}(1+\eta\gamma)^{t-1}r_{0}e_{1}-\eta\sum_{j=0}^{t-2}\eta\mathcal{H}(I-\eta\mathcal{H})^{t-2-j}(\Delta_{j}\hat{u}_{j}+\hat{y}_{j})\big{)}\|
≀η​γ​(1+η​γ)tβˆ’1​r0+Ξ·β€‹β€–βˆ‘j=0tβˆ’2ProjS+​(η​ℋ​(Iβˆ’Ξ·β€‹β„‹)tβˆ’2βˆ’j)‖​max0≀j≀tβˆ’2⁑‖Δj​u^j+y^jβ€–\displaystyle\leq\eta\gamma(1+\eta\gamma)^{t-1}r_{0}+\eta\|\sum_{j=0}^{t-2}\text{Proj}_{S_{+}}\big{(}\eta\mathcal{H}(I-\eta\mathcal{H})^{t-2-j}\big{)}\|\max_{0\leq j\leq t-2}\|\Delta_{j}\hat{u}_{j}+\hat{y}_{j}\|
≀η​γ​(1+η​γ)tβˆ’1​r0+βˆ‘j=0tβˆ’2Ξ·tβˆ’1βˆ’j​max0≀j≀tβˆ’2⁑‖Δj​u^j+y^jβ€–\displaystyle\leq\eta\gamma(1+\eta\gamma)^{t-1}r_{0}+\sum_{j=0}^{t-2}\frac{\eta}{t-1-j}\max_{0\leq j\leq t-2}\|\Delta_{j}\hat{u}_{j}+\hat{y}_{j}\|
≀η​γ​(1+η​γ)tβˆ’1​r0+η​log⁑(t)​(max0≀j≀tβˆ’2⁑(β€–Ξ”j‖​‖u^jβ€–+β€–y^jβ€–))\displaystyle\leq\eta\gamma(1+\eta\gamma)^{t-1}r_{0}+\eta\log(t)\big{(}\max_{0\leq j\leq t-2}(\|\Delta_{j}\|\|\hat{u}_{j}\|+\|\hat{y}_{j}\|)\big{)}
≀η​γ​(1+η​γ)tβˆ’1​r0+η​log⁑(t)​(ρ​(Ξ΄c2​ρ+r)​32​(1+η​γ)tβˆ’2​r0+γ​L​(1+η​γ)tβˆ’2​r0)\displaystyle\leq\eta\gamma(1+\eta\gamma)^{t-1}r_{0}+\eta\log(t)\big{(}\rho(\frac{\delta}{c_{2}\rho}+r)\frac{3}{2}(1+\eta\gamma)^{t-2}r_{0}+\gamma L(1+\eta\gamma)^{t-2}r_{0}\big{)} (32)
≀η​γ​(1+η​γ)tβˆ’1​r0+η​log⁑(t)​(3​δc2​(1+η​γ)tβˆ’2​r0+γ​L​(1+η​γ)tβˆ’2​r0)\displaystyle\leq\eta\gamma(1+\eta\gamma)^{t-1}r_{0}+\eta\log(t)\big{(}\frac{3\delta}{c_{2}}(1+\eta\gamma)^{t-2}r_{0}+\gamma L(1+\eta\gamma)^{t-2}r_{0}\big{)}
≀η​γ​(1+η​γ)tβˆ’1​r0+η​log⁑(𝒯)​(3​δc2​(1+η​γ)tβˆ’1​r0+γ​L​(1+η​γ)tβˆ’1​r0)\displaystyle\leq\eta\gamma(1+\eta\gamma)^{t-1}r_{0}+\eta\log(\mathscr{T})\big{(}\frac{3\delta}{c_{2}}(1+\eta\gamma)^{t-1}r_{0}+\gamma L(1+\eta\gamma)^{t-1}r_{0}\big{)} (33)

The second inequality is due to (1βˆ’Ξ»)tβˆ’1​λ≀1t(1-\lambda)^{t-1}\lambda\leq\frac{1}{t} for 0<λ≀10<\lambda\leq 1. That is, given the choice η≀π’ͺ~​(1L)\eta\leq\widetilde{\mathcal{O}}(\frac{1}{L}) and 𝟎βͺ―ProjS+​(β„‹)βͺ―L​I\boldsymbol{0}\preceq\text{Proj}_{S_{+}}(\mathcal{H})\preceq LI, we obtain 0≀λ​(ProjS+​(η​ℋ))≀10\leq\lambda(\text{Proj}_{S_{+}}(\eta\mathcal{H}))\leq 1. The third inequality uses the finite-sum bound on Harmonic series. Inequality (32) applies (i) the bound on β€–Ξ”tβ€–\|\Delta_{t}\| as in (22) (ii) the inductive bound on β€–u^jβ€–\|\hat{u}_{j}\| as in (25) and (iii) the inductive assumption on β€–yjβ€–\|y_{j}\|. The second last inequality is by the choice r≀δc2​ρr\leq\frac{\delta}{c_{2}\rho}. It is easy to verify that the right-hand-side of (33) is larger than right-hand-side of (31). Hence combining the two cases gives

‖η​ℋ​u^tβˆ’1‖≀η​γ​(1+η​γ)tβˆ’1​r0+η​log⁑(𝒯)​(3​δc2​(1+η​γ)tβˆ’1​r0+γ​L​(1+η​γ)tβˆ’1​r0).\|\eta\mathcal{H}\hat{u}_{t-1}\|\leq\eta\gamma(1+\eta\gamma)^{t-1}r_{0}+\eta\log(\mathscr{T})\big{(}\frac{3\delta}{c_{2}}(1+\eta\gamma)^{t-1}r_{0}+\gamma L(1+\eta\gamma)^{t-1}r_{0}\big{)}. (34)

The second term in (30) is bounded as

‖η​(Ξ”tβˆ’1​u^tβˆ’1+y^tβˆ’1)β€–\displaystyle\|\eta(\Delta_{t-1}\hat{u}_{t-1}+\hat{y}_{t-1})\| ≀η​‖Δtβˆ’1‖​‖u^tβˆ’1β€–+η​‖y^tβˆ’1β€–\displaystyle\leq\eta\|\Delta_{t-1}\|\|\hat{u}_{t-1}\|+\eta\|\hat{y}_{t-1}\|
≀η​ρ​(Ξ΄c2​ρ+r)​32​(1+η​γ)tβˆ’1​r0+η​γ​L​(1+η​γ)tβˆ’1​r0\displaystyle\leq\eta\rho\big{(}\frac{\delta}{c_{2}\rho}+r\big{)}\frac{3}{2}(1+\eta\gamma)^{t-1}r_{0}+\eta\gamma L(1+\eta\gamma)^{t-1}r_{0}
≀3​η​δc2​(1+η​γ)tβˆ’1​r0+η​γ​L​(1+η​γ)tβˆ’1​r0,\displaystyle\leq\frac{3\eta\delta}{c_{2}}(1+\eta\gamma)^{t-1}r_{0}+\eta\gamma L(1+\eta\gamma)^{t-1}r_{0}, (35)

where the second inequality is derived similarly as (32) and the last inequality is again due to the assumption that r≀δc2​ρr\leq\frac{\delta}{c_{2}\rho}. Combining (34) and (35) gives a bound,

L​‖u^tβˆ’u^tβˆ’1β€–\displaystyle L\|\hat{u}_{t}-\hat{u}_{t-1}\| ≀L(Ξ·Ξ³(1+Ξ·Ξ³)tβˆ’1r0+Ξ·log(𝒯)(3​δc2(1+Ξ·Ξ³)tβˆ’1r0+Ξ³L(1+Ξ·Ξ³)tβˆ’1r0)+\displaystyle\leq L\Big{(}\eta\gamma(1+\eta\gamma)^{t-1}r_{0}+\eta\log(\mathscr{T})\big{(}\frac{3\delta}{c_{2}}(1+\eta\gamma)^{t-1}r_{0}+\gamma L(1+\eta\gamma)^{t-1}r_{0}\big{)}+
+3​η​δc2(1+Ξ·Ξ³)tβˆ’1r0+Ξ·Ξ³L(1+Ξ·Ξ³)tβˆ’1r0)\displaystyle+\frac{3\eta\delta}{c_{2}}(1+\eta\gamma)^{t-1}r_{0}+\eta\gamma L(1+\eta\gamma)^{t-1}r_{0}\Big{)}
≀(Ξ·+3​η​log⁑(𝒯)c2+η​L​log⁑(𝒯)+3​ηc2+η​L)​γ​L​(1+η​γ)t​r0\displaystyle\leq\Big{(}{\eta}+\frac{3\eta\log(\mathscr{T})}{c_{2}}+\eta L\log(\mathscr{T})+\frac{3\eta}{c_{2}}+\eta L\Big{)}\gamma L(1+\eta\gamma)^{t}r_{0}
≀(3​η​log⁑(𝒯)c2+2​η​L​log⁑(𝒯))​γ​L​(1+η​γ)t​r0,\displaystyle\leq\Big{(}\frac{3\eta\log(\mathscr{T})}{c_{2}}+2\eta L\log(\mathscr{T})\Big{)}\gamma L(1+\eta\gamma)^{t}r_{0}, (36)

where the second inequality uses the definition that δ≀γ\delta\leq\gamma and 1+η​γ>11+\eta\gamma>1. The last inequality by considering 1+3c2+L≀L​log⁑(𝒯+a)1+\frac{3}{c_{2}}+L\leq L\log(\mathscr{T}+a), for some aβˆˆβ„•+a\in\mathbb{N}^{+}. This is because, 1/c21/c_{2} is a constant that is sufficiently small while log⁑(𝒯+a)\log(\mathscr{T}+a) can be made sufficiently large to ensure the validity of this result. Here, for simplicity, we still write log⁑(𝒯)\log(\mathscr{T}) since this will not affect the result.

Similarly, we can bound Οβ€‹π’Ÿt​‖u^tβ€–+Οβ€‹π’Ÿtβˆ’1​‖u^tβˆ’1β€–\rho\mathscr{D}_{t}\|\hat{u}_{t}\|+\rho\mathscr{D}_{t-1}\|\hat{u}_{t-1}\| as

Οβ€‹π’Ÿt​‖u^tβ€–+Οβ€‹π’Ÿtβˆ’1​‖u^tβˆ’1β€–\displaystyle\rho\mathscr{D}_{t}\|\hat{u}_{t}\|+\rho\mathscr{D}_{t-1}\|\hat{u}_{t-1}\| ≀ρ​(Ξ΄c2​ρ+r)​(32​(1+η​γ)t​r0+32​(1+η​γ)tβˆ’1​r0)\displaystyle\leq\rho\big{(}\frac{\delta}{c_{2}\rho}+r\big{)}\Big{(}\frac{3}{2}(1+\eta\gamma)^{t}r_{0}+\frac{3}{2}(1+\eta\gamma)^{t-1}r_{0}\Big{)}
≀6​δc2​(1+η​γ)t​r0,\displaystyle\leq\frac{6\delta}{c_{2}}(1+\eta\gamma)^{t}r_{0}, (37)

where the first inequality applies the second condition for j=t​ and ​tβˆ’1j=t\text{ and }t-1 and the second inequality again uses r≀δc2​ρr\leq\frac{\delta}{c_{2}\rho}. By combining results in (36) and (37), we have

β€–y^tβ€–\displaystyle\|\hat{y}_{t}\| ≀π’ͺ​(log32⁑(d​𝒯/ΞΆ)b)​m​((3​η​log⁑(𝒯)c2+2​η​L​log⁑(𝒯))​γ​L​(1+η​γ)t​r0+6​δc2​(1+η​γ)t​r0)\displaystyle\leq\mathcal{O}\Big{(}\frac{\log^{\frac{3}{2}}(d\mathscr{T}/\zeta)}{\sqrt{b}}\Big{)}\sqrt{m}\Big{(}\big{(}\frac{3\eta\log(\mathscr{T})}{c_{2}}+2\eta L\log(\mathscr{T})\big{)}\gamma L(1+\eta\gamma)^{t}r_{0}+\frac{6\delta}{c_{2}}(1+\eta\gamma)^{t}r_{0}\Big{)}
≀C1​(3​η​log⁑(𝒯)c2+2​η​L​log⁑(𝒯)+6c2​L)​γ​L​(1+η​γ)t​r0\displaystyle\leq C_{1}\Big{(}\frac{3\eta\log(\mathscr{T})}{c_{2}}+2\eta L\log(\mathscr{T})+\frac{6}{c_{2}L}\Big{)}\gamma L(1+\eta\gamma)^{t}r_{0}
≀γ​L​(1+η​γ)t​r0Β with probability ​1βˆ’ΞΆ,\displaystyle\leq\gamma L(1+\eta\gamma)^{t}r_{0}\qquad\qquad\qquad\text{ with probability }1-\zeta,

where we denote C1:=π’ͺ​(log32⁑(d​𝒯/ΞΆ))C_{1}:=\mathcal{O}(\log^{\frac{3}{2}}(d\mathscr{T}/\zeta)) (note c1=π’ͺ(log32(dm/Ο‘)c_{1}=\mathcal{O}(\log^{\frac{3}{2}}(dm/\vartheta) in Lemma 11). The second inequality considers bβ‰₯mb\geq m and δ≀γ\delta\leq\gamma. The last inequality follows by the parameter setting that η≀18​L​log⁑(𝒯)​C1\eta\leq\frac{1}{8L\log(\mathscr{T})C_{1}} and c2β‰₯12​C1Lc_{2}\geq\frac{12C_{1}}{L}, C1β‰₯1C_{1}\geq 1. This completes the proof of the bound on β€–y^tβ€–\|\hat{y}_{t}\|.

Finally, we can proceed to raise a contradiction. First given the bound on β€–q​(t)β€–\|q(t)\|, we have

β€–u^𝒯‖=β€–p​(𝒯)βˆ’q​(𝒯)β€–β‰₯β€–p​(𝒯)β€–βˆ’β€–q​(𝒯)β€–β‰₯β€–p​(𝒯)β€–/2\displaystyle\|\hat{u}_{\mathscr{T}}\|=\|p(\mathscr{T})-q(\mathscr{T})\|\geq\|p(\mathscr{T})\|-\|q(\mathscr{T})\|\geq\|p(\mathscr{T})\|/2 =12​(1+η​γ)𝒯​r0\displaystyle=\frac{1}{2}(1+\eta\gamma)^{\mathscr{T}}r_{0}
=12​(1+η​γ)𝒯​ν​rd>4​δc2​ρ,\displaystyle=\frac{1}{2}(1+\eta\gamma)^{\mathscr{T}}\frac{\nu r}{\sqrt{d}}>\frac{4\delta}{c_{2}\rho},

where the last inequality follows by the choice that 𝒯=2​logα⁑(8​δ​dc2​ρ​ν​r)/Ξ³\mathscr{T}={2\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})}/\gamma, where Ξ±β‰₯1\alpha\geq 1 is chosen such that logα⁑(1+η​γ)>Ξ³\log_{\alpha}(1+\eta\gamma)>\gamma. This requirement is reasonable since when Ξ±β†’1\alpha\xrightarrow[]{}1, logα⁑(1+η​γ)\log_{\alpha}(1+\eta\gamma) increases while Ξ³\gamma is a constant. In this case,

logα⁑((1+η​γ)𝒯)>Ξ³β‹…2​logα⁑(8​δ​dc2​ρ​ν​r)/Ξ³>logα⁑(8​δ​dc2​ρ​ν​r).\log_{\alpha}\big{(}(1+\eta\gamma)^{\mathscr{T}}\big{)}>\gamma\cdot 2\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})/\gamma>\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r}).

However, we have for t≀𝒯t\leq\mathscr{T}, β€–u^tβ€–=β€–utβˆ’ut′‖≀‖utβ€–+β€–ut′‖≀2​(r+Ξ΄c2​ρ)≀4​δc2​ρ\|\hat{u}_{t}\|=\|u_{t}-u_{t}^{\prime}\|\leq\|u_{t}\|+\|u_{t}^{\prime}\|\leq 2(r+\frac{\delta}{c_{2}\rho})\leq\frac{4\delta}{c_{2}\rho}, which gives a contradiction. Hence βˆƒt≀𝒯=2​logα⁑(8​δ​dc2​ρ​ν​r)/Ξ³\exists\,t\leq\mathscr{T}={2\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})}/\gamma, such that max⁑{β€–utβˆ’u0β€–,β€–utβ€²βˆ’u0β€–}β‰₯Ξ΄c2​ρ\max\{\|u_{t}-u_{0}\|,\|u_{t}^{\prime}-u_{0}\|\}\geq\frac{\delta}{c_{2}\rho}. Given Ξ³\gamma changes throughout optimization process, we may choose 𝒯=2​logα⁑(8​δ​dc2​ρ​ν​r)/Ξ΄\mathscr{T}={2\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})}/\delta. Since δ≀γ\delta\leq\gamma, the result still holds. Β 

Lemma 13 (Descent around saddle points)

Let xβˆˆβ„³x\in\mathcal{M} satisfy β€–grad​F​(x)‖≀ϡ\|\emph{grad}F(x)\|\leq\epsilon and Ξ»min​(βˆ‡2F^x​(0))\lambda_{\min}(\nabla^{2}\hat{F}_{x}(0)) β‰€βˆ’Ξ΄\leq-\delta. With the same setting as in Lemma 12, the two coupled sequences satisfy, with probability 1βˆ’ΞΆ1-\zeta,

βˆƒt≀𝒯,max⁑{F^x​(u0)βˆ’F^x​(ut),F^x​(u0β€²)βˆ’F^x​(utβ€²)}β‰₯2​ℱ,\exists\,t\leq\mathscr{T},\,\max\{\hat{F}_{x}(u_{0})-\hat{F}_{x}(u_{t}),\hat{F}_{x}(u_{0}^{\prime})-\hat{F}_{x}(u_{t}^{\prime})\}\geq 2\mathscr{F},

where β„±=Ξ΄32​c3​ρ2\mathscr{F}=\frac{\delta^{3}}{2c_{3}\rho^{2}}, c3=8​logα⁑(8​δ​dc2​ρ​ν​r)​c22c1​Lc_{3}=\frac{8\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})c_{2}^{2}}{c_{1}L}, and c1c_{1} and c2c_{2} are defined in Lemma 11 and Lemma 12, respectively.

ProofΒ  We will again prove this result by contradiction. Suppose

βˆ€t≀𝒯,max⁑{F^x​(u0)βˆ’F^x​(ut),F^x​(u0β€²)βˆ’F^x​(utβ€²)}≀2​ℱ.\forall\,t\leq\mathscr{T},\,\max\{\hat{F}_{x}(u_{0})-\hat{F}_{x}(u_{t}),\hat{F}_{x}(u_{0}^{\prime})-\hat{F}_{x}(u_{t}^{\prime})\}\leq 2\mathscr{F}. (38)

Then we first claim that both {ut}\{u_{t}\} and {utβ€²}\{u_{t}^{\prime}\} fall within the prescribed ball 𝔹x​(D)\mathbb{B}_{x}(D). This is verified by contradiction. Assume that, without loss of generality, at iteration j≀𝒯j\leq\mathscr{T}, β€–ujβ€–β‰₯D\|u_{j}\|\geq D. In this case, TSSRG​(x,u0,𝒯)\textsf{TSSRG}(x,u_{0},\mathscr{T}) returns Rx​(ujβˆ’1βˆ’Ξ±β€‹Ξ·β€‹vtβˆ’1)R_{x}(u_{j-1}-\alpha\eta v_{t-1}) with α∈(0,1)\alpha\in(0,1) such that β€–ujβˆ’1βˆ’Ξ±β€‹Ξ·β€‹vtβˆ’1β€–=D\|u_{j-1}-\alpha\eta v_{t-1}\|=D. Hence

D=β€–ujβˆ’1βˆ’Ξ±β€‹Ξ·β€‹vtβˆ’1β€–\displaystyle D=\|u_{j-1}-\alpha\eta v_{t-1}\| ≀‖ujβˆ’1βˆ’Ξ±β€‹Ξ·β€‹vtβˆ’1βˆ’u0β€–+β€–u0‖≀4​jc1​L​(F^x​(u0)βˆ’F^x​(uj))+r\displaystyle\leq\|u_{j-1}-\alpha\eta v_{t-1}-u_{0}\|+\|u_{0}\|\leq\sqrt{\frac{4j}{c_{1}L}(\hat{F}_{x}(u_{0})-\hat{F}_{x}(u_{j}))}+r
≀8​𝒯​ℱc1​L+r=8​logα⁑(8​δ​dc2​ρ​ν​r)δ​c1​Lβ‹…Ξ΄3c3​ρ2+r=Ξ΄c2​ρ+r\displaystyle\leq\sqrt{\frac{8\mathscr{T}\mathscr{F}}{c_{1}L}}+r=\sqrt{\frac{{8\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})}}{\delta c_{1}L}\cdot\frac{\delta^{3}}{c_{3}\rho^{2}}}+r=\frac{\delta}{c_{2}\rho}+r (39)

where we note β„±=Ξ΄32​c3​ρ2\mathscr{F}=\frac{\delta^{3}}{2c_{3}\rho^{2}} where c3=8​logα⁑(8​δ​dc2​ρ​ν​r)​c22c1​Lc_{3}=\frac{8\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})c_{2}^{2}}{c_{1}L} and 𝒯=2​logα⁑(8​δ​dc2​ρ​ν​r)/Ξ΄\mathscr{T}=2\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})/\delta. The second inequality uses Lemma 10 where we can replace Ξ·\eta with α​η\alpha\eta for iteration jj. However, by parameter choice, we have Ξ΄c2​ρ+r≀D\frac{\delta}{c_{2}\rho}+r\leq D. This gives a contradiction. Therefore, under condition (38), the two coupled sequences do not escape the ball. Hence we can proceed similarly as in (Li, 2019).

First Lemma 12 claims that, for some t≀𝒯t\leq\mathscr{T}, max⁑{β€–utβˆ’u0β€–,β€–utβ€²βˆ’u0β€–}β‰₯Ξ΄c2​ρ\max\{\|u_{t}-u_{0}\|,\|u_{t}^{\prime}-u_{0}\|\}\geq\frac{\delta}{c_{2}\rho}. Without loss of generality, suppose β€–uTβˆ’u0β€–β‰₯Ξ΄c2​ρ\|u_{T}-u_{0}\|\geq\frac{\delta}{c_{2}\rho}, for T≀𝒯T\leq\mathscr{T}. Then by Lemma 10, we have β€–uTβˆ’u0‖≀4​Tc1​L​(F^x​(u0)βˆ’F^x​(uT))\|u_{T}-u_{0}\|\leq\sqrt{\frac{4T}{c_{1}L}(\hat{F}_{x}(u_{0})-\hat{F}_{x}(u_{T}))}. This implies

F^x​(u0)βˆ’F^x​(uT)β‰₯c1​L4​T​‖uTβˆ’u0β€–2β‰₯c1​L​δ24​𝒯​c22​ρ2β‰₯c1​L​δ38​logα⁑(8​δ​dc2​ρ​ν​r)​c22​ρ2β‰₯Ξ΄3c3​ρ2=2​ℱ\displaystyle\hat{F}_{x}(u_{0})-\hat{F}_{x}(u_{T})\geq\frac{c_{1}L}{4T}\|u_{T}-u_{0}\|^{2}\geq\frac{c_{1}L\delta^{2}}{4\mathscr{T}c_{2}^{2}\rho^{2}}\geq\frac{c_{1}L\delta^{3}}{8\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})c_{2}^{2}\rho^{2}}\geq\frac{\delta^{3}}{c_{3}\rho^{2}}=2\mathscr{F}

where we use the choice of c3c_{3}. This contradicts (38). Therefore the proof is complete. Β 

Without loss of generality, the result in Lemma 13 can be written as max⁑{F^x​(u0)βˆ’F^x​(u𝒯),F^x​(u0β€²)βˆ’F^x​(u𝒯′)}β‰₯2​ℱ\max\{\hat{F}_{x}(u_{0})-\hat{F}_{x}(u_{\mathscr{T}}),\hat{F}_{x}(u_{0}^{\prime})-\hat{F}_{x}(u_{\mathscr{T}}^{\prime})\}\geq 2\mathscr{F}. This represents the worst case scenario where we require 𝒯\mathscr{T} iterations of TSSRG to reach a large function decrease.222We can also add a stopping criterion that breaks with u𝒯u_{\mathscr{T}} (u𝒯′u_{\mathscr{T}}^{\prime}) set to be uTu_{T} (resp. uTβ€²u_{T}^{\prime}) where TT is the earliest iteration where a large function value decrease is reached.

Lemma 14 (Escape stuck region)

(Lemma 5 in the main text). Let xβˆˆβ„³x\in\mathcal{M} satisfying β€–grad​F​(x)‖≀ϡ\|\emph{grad}F(x)\|\leq\epsilon and βˆ’Ξ³=Ξ»min​(βˆ‡2F^x​(0))-\gamma=\lambda_{\min}(\nabla^{2}\hat{F}_{x}(0)) β‰€βˆ’Ξ΄\leq-\delta. Given that result in Lemma 13 holds and choosing perturbation radius r≀min⁑{Ξ΄34​c3​ρ2​ϡ,Ξ΄32​c3​ρ2​L}r\leq\min\{\frac{\delta^{3}}{4c_{3}\rho^{2}\epsilon},\sqrt{\frac{\delta^{3}}{2c_{3}\rho^{2}L}}\}, we have a sufficient decrease of function value with high probability:

F​(TSSRG​(x,u0,𝒯))βˆ’F​(x)β‰€βˆ’β„±Β with probability ​1βˆ’Ξ½.{F}(\textsf{TSSRG}(x,u_{0},\mathscr{T}))-F(x)\leq-\mathscr{F}\quad\text{ with probability }1-\nu.

ProofΒ  First define the stuck region formally as

𝒳stuck={uβˆˆπ”Ήx​(r):F​(TSSRG​(x,u,𝒯))βˆ’F^x​(u)β‰₯βˆ’2​ℱ}.\mathcal{X}_{\text{stuck}}=\{u\in\mathbb{B}_{x}(r):F(\textsf{TSSRG}(x,u,\mathscr{T}))-\hat{F}_{x}(u)\geq-2\mathscr{F}\}.

This suggests that running TSSRG from any point initialized in this region will not give sufficient decrease of function value. Similar to (Li, 2019), the aim is to show this region is small in volume. First note that iterates with initialization within 𝒳stuck\mathcal{X}_{\text{stuck}} do not escape the constraint ball from arguments in (39). Hence, if iterates leave the ball 𝔹x​(D)\mathbb{B}_{x}(D), the output already escapes the stuck region with large function decrease. Given that tangent space Tx​ℳT_{x}\mathcal{M} is a dd-dimensional vector space, we can perform similar analysis as in (Jin etΒ al., 2017; Li, 2019).

Consider the two coupled sequences with starting points u0u_{0} and u0β€²u_{0}^{\prime} satisfying u0βˆ’u0β€²=r0​e1u_{0}-u_{0}^{\prime}=r_{0}e_{1}, where e1e_{1} is the smallest eigenvector of βˆ‡2F^x​(0)\nabla^{2}\hat{F}_{x}(0) and r0=ν​rdr_{0}=\frac{\nu r}{\sqrt{d}}. Therefore from Lemma 13, at least one of the two sequences finally leaves 𝒳stuck\mathcal{X}_{\text{stuck}} after 𝒯\mathscr{T} steps with probability 1βˆ’ΞΆ1-\zeta. Therefore, under this result, the width of the stuck region along direction e1e_{1} is at most r0r_{0}. Based on similar argument as in (Criscitiello and Boumal, 2019), Vol​(𝒳stuck)≀r0​Vol​(ℬrdβˆ’1)\text{Vol}(\mathcal{X}_{\text{stuck}})\leq r_{0}\text{Vol}(\mathcal{B}_{r}^{d-1}), where ℬrd\mathcal{B}_{r}^{d} represents dd-dimensional sphere with radius rr. Therefore,

Pr​{u0βˆˆπ’³stuck}=Vol​(𝒳stuck)Vol​(ℬrd)≀r0​Vol​(ℬrdβˆ’1)Vol​(ℬrd)=r0​Γ​(d2+1)π​r​Γ​(d2+12)≀r0π​r​d2+1≀r0​dr=Ξ½,\text{Pr}\{u_{0}\in\mathcal{X}_{\text{stuck}}\}=\frac{\text{Vol}(\mathcal{X}_{\text{stuck}})}{\text{Vol}(\mathcal{B}_{r}^{d})}\leq\frac{r_{0}\text{Vol}(\mathcal{B}_{r}^{d-1})}{\text{Vol}(\mathcal{B}_{r}^{d})}=\frac{r_{0}\Gamma(\frac{d}{2}+1)}{\sqrt{\pi}r\Gamma(\frac{d}{2}+\frac{1}{2})}\leq\frac{r_{0}}{\sqrt{\pi}r}\sqrt{\frac{d}{2}+1}\leq\frac{r_{0}\sqrt{d}}{r}=\nu,

where we have used Gautschi’s inequality for the Gamma function. This claims the probability of u0βˆˆπ’³stucku_{0}\in\mathcal{X}_{\text{stuck}} is at most Ξ½\nu. Therefore with high probability at least 1βˆ’Ξ½1-\nu, u0βˆ‰π’³stucku_{0}\notin\mathcal{X}_{\text{stuck}}. In this case, function value decreases a lot. This is verified as follows. First note that F​(x)=F^x​(0)F(x)=\hat{F}_{x}(0) and F​(TSSRG​(x,u0,𝒯))=F^x​(u𝒯)F(\textsf{TSSRG}(x,u_{0},\mathscr{T}))=\hat{F}_{x}(u_{\mathscr{T}}). Then, by LL-Lipschitzness of the gradient βˆ‡F^x\nabla\hat{F}_{x},

F^x​(u0)βˆ’F^x​(0)β‰€βŸ¨βˆ‡F^x​(0),u0⟩+L2​‖u0β€–2\displaystyle\hat{F}_{x}(u_{0})-\hat{F}_{x}(0)\leq\langle\nabla\hat{F}_{x}(0),u_{0}\rangle+\frac{L}{2}\|u_{0}\|^{2} β‰€β€–βˆ‡F^x​(0)‖​‖u0β€–+L2​‖u0β€–2≀ϡ​r+L​r22\displaystyle\leq\|\nabla\hat{F}_{x}(0)\|\|u_{0}\|+\frac{L}{2}\|u_{0}\|^{2}\leq\epsilon r+\frac{Lr^{2}}{2}
≀δ32​c3​ρ2=β„±,\displaystyle\leq\frac{\delta^{3}}{2c_{3}\rho^{2}}=\mathscr{F}, (40)

using β€–u0‖≀r,β€–βˆ‡F^x​(0)‖≀ϡ\|u_{0}\|\leq r,\|\nabla\hat{F}_{x}(0)\|\leq\epsilon and also r≀min⁑{Ξ΄34​c3​ρ2​ϡ,Ξ΄32​c3​ρ2​L}r\leq\min\{\frac{\delta^{3}}{4c_{3}\rho^{2}\epsilon},\sqrt{\frac{\delta^{3}}{2c_{3}\rho^{2}L}}\}. Therefore

F​(x)βˆ’F​(TSSRG​(x,u0,𝒯))=F^x​(0)βˆ’F^x​(u0)+F^x​(u0)βˆ’F^x​(u𝒯)β‰₯βˆ’β„±+2​ℱ=β„±,\displaystyle F(x)-F(\textsf{TSSRG}(x,u_{0},\mathscr{T}))=\hat{F}_{x}(0)-\hat{F}_{x}(u_{0})+\hat{F}_{x}(u_{0})-\hat{F}_{x}(u_{\mathscr{T}})\geq-\mathscr{F}+2\mathscr{F}=\mathscr{F},

where we apply (40) and the fact that u0βˆ‰π’³stucku_{0}\notin\mathcal{X}_{\text{stuck}}. Β 

D Proof for online setting

Here we provide proof for online problems where full gradient grad​F\text{grad}F needs to be replaced by large-batch gradient grad​fℬ\text{grad}f_{\mathcal{B}}. The proof is similar to that of finite-sum problems and thus we mainly present the key steps in the following proof.

D.1 Proof for main Theorem

Theorem 2 (Online complexity) Under Assumptions 1 to 5, consider online optimization setting. For any starting point x0βˆˆβ„³x_{0}\in\mathcal{M} with the choice of parameters

𝒯=π’ͺ~​(1Ξ΄),η≀π’ͺ~​(1L),m=b=π’ͺ~​(σϡ),B=π’ͺ~​(Οƒ2Ο΅2),r=π’ͺ~​(min⁑{Ξ΄3ρ2​ϡ,Ξ΄3ρ2​L}),\mathscr{T}=\widetilde{\mathcal{O}}\Big{(}\frac{1}{\delta}\Big{)},\quad\eta\leq\widetilde{\mathcal{O}}\Big{(}\frac{1}{L}\Big{)},\quad m=b=\widetilde{\mathcal{O}}\Big{(}\frac{\sigma}{\epsilon}\Big{)},\quad B=\widetilde{\mathcal{O}}\Big{(}\frac{\sigma^{2}}{\epsilon^{2}}\Big{)},\quad r=\widetilde{\mathcal{O}}\Big{(}\min\Big{\{}\frac{\delta^{3}}{\rho^{2}\epsilon},\sqrt{\frac{\delta^{3}}{\rho^{2}L}}\Big{\}}\Big{)},

suppose Ο΅,Ξ΄\epsilon,\delta satisfy ϡ≀min⁑{Ξ΄2ρ,π’ͺ~​(D​Lm​η)}\epsilon\leq\min\big{\{}\frac{\delta^{2}}{\rho},\widetilde{\mathcal{O}}(\frac{D\sqrt{L}}{m\sqrt{\eta}})\big{\}}, δ≀ℓ\delta\leq\ell where ℓ≀L\ell\leq L. Then with high probability, PRSRG​(x0,Ξ·,m,b,B,r,𝒯,D,Ο΅)\textsf{PRSRG}(x_{0},\eta,m,b,B,r,\mathscr{T},D,\epsilon) will at least once visit an (Ο΅,Ξ΄)(\epsilon,\delta)-second-order critical point with

π’ͺ~​(Δ​L​σϡ3+Δ​ρ2​σ2Ξ΄3​ϡ2+Δ​ρ2​σδ4​ϡ)\widetilde{\mathcal{O}}\Big{(}\frac{\Delta L\sigma}{\epsilon^{3}}+\frac{\Delta\rho^{2}\sigma^{2}}{\delta^{3}\epsilon^{2}}+\frac{\Delta\rho^{2}\sigma}{\delta^{4}\epsilon}\Big{)}

stochastic gradient oracles, where Ξ”:=F​(x0)βˆ’Fβˆ—\Delta:=F(x_{0})-F^{*}.

ProofΒ  Similar to Theorem 1, we have the following possible cases when running the main algorithm PRSRG:

  • β€’

    Large gradients where β€–grad​fℬ​(x)β€–β‰₯Ο΅{\|\text{grad}f_{\mathcal{B}}(x)\|\geq\epsilon}.

    1. 1.

      Type-1 descent epoch: Iterates escape the constraint ball.

    2. 2.

      Iterates do not escape the constraint ball.

      1. (a)

        Type-2 descent epoch: At least half of iterates in current epoch have pullback gradient larger than Ο΅/4\epsilon/4.

      2. (b)

        Useful epoch: At least half of iterates in the current epoch have pullback gradient no larger than Ο΅/4\epsilon/4 and the output x~\tilde{x} from the current epoch has batch gradient β€–grad​fℬ​(x)β€–\|\text{grad}f_{\mathcal{B}}(x)\| no larger than Ο΅\epsilon. (Since the output satisfies small gradient condition, the next epoch will run TSSRG(x~,u0,𝒯)(\tilde{x},u_{0},\mathscr{T}) to escape saddle points).

      3. (c)

        Wasted epoch: At least half of iterates in the current epoch have pullback gradient no larger than Ο΅/4\epsilon/4 and the output x~\tilde{x} from the current epoch has batch gradient β€–grad​fℬ​(x)β€–\|\text{grad}f_{\mathcal{B}}(x)\| larger than Ο΅\epsilon.

  • β€’

    Around saddle points where β€–grad​fℬ​(x)‖≀ϡ{\|\text{grad}f_{\mathcal{B}}(x)\|\leq\epsilon} and Ξ»min​(Hess​(x))β‰€βˆ’Ξ΄{\lambda_{\min}(\text{Hess}(x))\leq-\delta}

    1. 3.

      Type-3 descent epoch: The current iterate is around saddle point.

The following proof is very similar to that of Lemma 1. First from Lemma 17, we know that the probability of wasted epoch occurring is at most 2/32/3. Given independence of different wasted epoch, with high probability, wasted epoch happens consecutively at most π’ͺ~​(1)\widetilde{\mathcal{O}}(1) times before either a descent epoch (either Type 1 or 2) or a useful epoch. We use N1,N2,N3N_{1},N_{2},N_{3} to respectively represent three types of descent epoch.

For Type-1 descent epoch, with probability 1βˆ’Ο‘1-\vartheta, function value decrease by at least 3​η​m​ϡ2512\frac{3\eta m\epsilon^{2}}{512}. Hence by standard concentration, after N1N_{1} such epochs, function value is decreased by π’ͺ​(η​m​ϡ2​N1){\mathcal{O}}(\eta m\epsilon^{2}N_{1}) with high probability. Given that F​(x)F(x) is bounded by Fβˆ—F^{*}, the decrease cannot exceed Ξ”:=F​(x0)βˆ’Fβˆ—\Delta:=F(x_{0})-F^{*}. Thus, N1≀π’ͺ​(Δη​m​ϡ2)N_{1}\leq\mathcal{O}(\frac{\Delta}{\eta m\epsilon^{2}}). Similarly, for Type-2 descent epoch, N2≀π’ͺ​(Δη​m​ϡ2)N_{2}\leq\mathcal{O}(\frac{\Delta}{\eta m\epsilon^{2}}). For Type-3 useful epoch, the output is either an (Ο΅,Ξ΄)(\epsilon,\delta)-second-order critical point or the epoch is immediately followed by a Type-3 descent epoch around saddle points. From Lemma 20, we know that function value decreases by β„±=Ξ΄32​c3​ρ2\mathscr{F}=\frac{\delta^{3}}{2c_{3}\rho^{2}} with high probability. Therefore by similar arguments, N3≀π’ͺ~​(ρ2​Δδ3)N_{3}\leq\widetilde{\mathcal{O}}(\frac{\rho^{2}\Delta}{\delta^{3}}).

Therefore to combine them, we have the following stochastic gradient complexity:

(N1+N2)​(π’ͺ~​(1)​(B+m​b)+B+m​b)+N3​(π’ͺ~​(1)​(B+m​b)+βŒˆπ’―/mβŒ‰β€‹B+𝒯​b)\displaystyle(N_{1}+N_{2})\big{(}\widetilde{\mathcal{O}}(1)\big{(}B+mb\big{)}+B+mb\big{)}+N_{3}\big{(}\widetilde{\mathcal{O}}(1)\big{(}B+mb\big{)}+\lceil\mathscr{T}/m\rceil B+\mathscr{T}b\big{)}
≀π’ͺ~​(Δ​L​σϡ3+Δ​ρ2​σ2Ξ΄3​ϡ2+ρ2​Δδ3​(Οƒ2Ο΅2+σδ​ϡ))\displaystyle\leq\widetilde{\mathcal{O}}\Big{(}\frac{\Delta L\sigma}{\epsilon^{3}}+\frac{\Delta\rho^{2}\sigma^{2}}{\delta^{3}\epsilon^{2}}+\frac{\rho^{2}\Delta}{\delta^{3}}\big{(}\frac{\sigma^{2}}{\epsilon^{2}}+\frac{\sigma}{\delta\epsilon}\big{)}\Big{)}
≀π’ͺ~​(Δ​L​σϡ3+Δ​ρ2​σ2Ξ΄3​ϡ2+Δ​ρ2​σδ4​ϡ),\displaystyle\leq\widetilde{\mathcal{O}}\Big{(}\frac{\Delta L\sigma}{\epsilon^{3}}+\frac{\Delta\rho^{2}\sigma^{2}}{\delta^{3}\epsilon^{2}}+\frac{\Delta\rho^{2}\sigma}{\delta^{4}\epsilon}\Big{)},

where 𝒯=π’ͺ~​(1Ξ΄)\mathscr{T}=\widetilde{\mathcal{O}}(\frac{1}{\delta}), m=b=B=π’ͺ~​(σϡ)m=b=\sqrt{B}=\widetilde{\mathcal{O}}(\frac{\sigma}{\epsilon}) and Ξ·=π’ͺ~​(1L)\eta=\widetilde{\mathcal{O}}(\frac{1}{L}). Β 

D.2 Proof for key Lemmas

Lemma 15 (High probability bound on estimation error)

Under Assumptions 2 and 5, we have the following high probability bound for estimation error of modified gradient under online setting. That is, for s​m+1≀t≀(s+1)​msm+1\leq t\leq(s+1)m,

β€–vtβˆ’βˆ‡F^x​(ut)‖≀π’ͺ​(log32⁑(d/Ο‘))​Lbβ€‹βˆ‘j=s​m+1tβ€–ujβˆ’ujβˆ’1β€–2+π’ͺ​(log⁑(d/Ο‘))​σB​ with probability ​1βˆ’Ο‘.\|v_{t}-\nabla\hat{F}_{x}(u_{t})\|\leq\frac{\mathcal{O}(\log^{\frac{3}{2}}(d/\vartheta))L}{\sqrt{b}}\sqrt{\sum_{j=sm+1}^{t}\|u_{j}-u_{j-1}\|^{2}}+\frac{\mathcal{O}(\log(d/\vartheta))\sigma}{\sqrt{B}}\,\,\text{ with probability }1-\vartheta.

ProofΒ  For simplicity of notation, consider a single epoch k=1,…,m,βˆ€sk=1,...,m,\forall\,s. Because under online setting, v0=βˆ‡f^ℬ,x​(u0)β‰ βˆ‡F^x​(u0)v_{0}=\nabla\hat{f}_{\mathcal{B},x}(u_{0})\neq\nabla\hat{F}_{x}(u_{0}), we first need to bound β€–v0βˆ’βˆ‡F^x​(u0)β€–\|v_{0}-\nabla\hat{F}_{x}(u_{0})\| by Bernstein inequality. By assumption of bounded variance, we have

β€–βˆ‡f^i,x​(u0)βˆ’βˆ‡F^x​(u0)‖≀σ​ andΒ β€‹βˆ‘iβˆˆβ„¬β€–βˆ‡f^i,x​(u0)βˆ’βˆ‡F^x​(u0)β€–2≀B​σ2.\displaystyle\|\nabla\hat{f}_{i,x}(u_{0})-\nabla\hat{F}_{x}(u_{0})\|\leq\sigma\text{ and }\sum_{i\in\mathcal{B}}\|\nabla\hat{f}_{i,x}(u_{0})-\nabla\hat{F}_{x}(u_{0})\|^{2}\leq B\sigma^{2}.

By Lemma 6,

Pr​{β€–βˆ‘iβˆˆβ„¬(βˆ‡f^i,x​(u0)βˆ’βˆ‡F^x​(u0))β€–β‰₯Ο‚}\displaystyle\text{Pr}\{\|\sum_{i\in\mathcal{B}}\big{(}\nabla\hat{f}_{i,x}(u_{0})-\nabla\hat{F}_{x}(u_{0})\big{)}\|\geq\varsigma\} =Pr​{β€–1Bβ€‹βˆ‘iβˆˆβ„¬(βˆ‡f^i,x​(u0)βˆ’βˆ‡F^x​(u0))β€–β‰₯Ο‚/B}\displaystyle=\text{Pr}\{\|\frac{1}{B}\sum_{i\in\mathcal{B}}\big{(}\nabla\hat{f}_{i,x}(u_{0})-\nabla\hat{F}_{x}(u_{0})\big{)}\|\geq\varsigma/B\}
=Pr​{β€–v0βˆ’F^x​(u0)β€–β‰₯Ο‚/B}\displaystyle=\text{Pr}\{\|v_{0}-\hat{F}_{x}(u_{0})\|\geq\varsigma/B\}
≀(d+1)​exp⁑(βˆ’Ο‚2/2B​σ2+B​σ​ς/3)≀ϑ,\displaystyle\leq(d+1)\exp\big{(}\frac{-\varsigma^{2}/2}{B\sigma^{2}+\sqrt{B}\sigma\varsigma/3}\big{)}\leq\vartheta,

where the first inequality also considers Bβ‰₯1\sqrt{B}\geq 1. The last inequality holds by setting Ο‚=π’ͺ​(log⁑(d/Ο‘))​B​σ\varsigma=\mathcal{O}(\log(d/\vartheta))\sqrt{B}\sigma. Thus, we obtain

β€–v0βˆ’F^x​(u0)‖≀π’ͺ​(log⁑(d/Ο‘))​σB​ with probability ​1βˆ’Ο‘.\|v_{0}-\hat{F}_{x}(u_{0})\|\leq\frac{\mathcal{O}(\log(d/\vartheta))\sigma}{\sqrt{B}}\,\text{ with probability }1-\vartheta. (41)

Next, denote yk:=vkβˆ’βˆ‡F^x​(uk)y_{k}:=v_{k}-\nabla\hat{F}_{x}(u_{k}) and zk:=ykβˆ’ykβˆ’1z_{k}:=y_{k}-y_{k-1}. Then we follow the same steps as in Lemma 9 to obtain the bound on β€–ykβ€–\|y_{k}\| except that y0β‰ 0y_{0}\neq 0. From (12), we have

β€–ykβˆ’y0‖≀π’ͺ​(log32⁑(d/Ο‘)​Lb)β€‹βˆ‘j=1kβ€–ujβˆ’ujβˆ’1β€–2​ with probability ​1βˆ’2​ϑ.\|y_{k}-y_{0}\|\leq\mathcal{O}\Big{(}\frac{\log^{\frac{3}{2}}(d/\vartheta)L}{\sqrt{b}}\Big{)}\sqrt{\sum_{j=1}^{k}\|u_{j}-u_{j-1}\|^{2}}\,\text{ with probability }1-2\vartheta.

By union bound, for k∈[1,m]k\in[1,m], with probability at least 1βˆ’3​ϑ1-3\vartheta,

β€–vkβˆ’βˆ‡F^x​(uk)β€–=β€–yk‖≀‖ykβˆ’y0β€–+β€–y0‖≀π’ͺ​(log32⁑(d/Ο‘))​Lbβ€‹βˆ‘j=1kβ€–ujβˆ’ujβˆ’1β€–2+π’ͺ​(log⁑(d/Ο‘))​σB.\|v_{k}-\nabla\hat{F}_{x}(u_{k})\|=\|y_{k}\|\leq\|y_{k}-y_{0}\|+\|y_{0}\|\leq\frac{\mathcal{O}(\log^{\frac{3}{2}}(d/\vartheta))L}{\sqrt{b}}\sqrt{\sum_{j=1}^{k}\|u_{j}-u_{j-1}\|^{2}}+\frac{\mathcal{O}(\log(d/\vartheta))\sigma}{\sqrt{B}}.

The proof is complete by setting Ο‘/3\vartheta/3 as Ο‘\vartheta. Β 

Lemma 16 (Improve or localize)

Consider {ut}t=1𝒯\{u_{t}\}_{t=1}^{\mathscr{T}} as the sequence generated by running TSSRG(x,\textsf{TSSRG}(x, u0,𝒯)u_{0},\mathscr{T}). Suppose we choose bβ‰₯mb\geq m, η≀12​c1​L\eta\leq\frac{1}{2c_{1}L}, where c1=π’ͺ​(log32⁑(d​tΟ‘))c_{1}=\mathcal{O}(\log^{\frac{3}{2}}(\frac{dt}{\vartheta})). Then we have

β€–utβˆ’u0‖≀4​t​(F^x​(u0)βˆ’F^x​(ut))c1​L+2​η​t2​c1′⁣2​σ2c1​L​BΒ with probability ​1βˆ’Ο‘,\|u_{t}-u_{0}\|\leq\sqrt{\frac{4t(\hat{F}_{x}(u_{0})-\hat{F}_{x}(u_{t}))}{c_{1}L}+\frac{2\eta t^{2}c_{1}^{\prime 2}\sigma^{2}}{c_{1}LB}}\quad\text{ with probability }1-\vartheta,

with c1β€²=π’ͺ​(log⁑(d​tϑ​m))c_{1}^{\prime}=\mathcal{O}(\log(\frac{dt}{\vartheta m})).

ProofΒ  First we generalize (44) to any epoch (i.e. 1≀t≀𝒯1\leq t\leq\mathscr{T}) as

F^x​(ut)βˆ’F^x​(us​m)\displaystyle\hat{F}_{x}(u_{t})-\hat{F}_{x}(u_{sm})
β‰€βˆ’Ξ·2β€‹βˆ‘j=s​m+1tβ€–βˆ‡F^x​(ujβˆ’1)β€–2βˆ’(12β€‹Ξ·βˆ’L2βˆ’Ξ·β€‹c12​L22)β€‹βˆ‘j=s​m+1tβ€–ujβˆ’ujβˆ’1β€–2+η​(tβˆ’s​m)​c1′⁣2​σ22​B\displaystyle\leq-\frac{\eta}{2}\sum_{j=sm+1}^{t}\|\nabla\hat{F}_{x}(u_{j-1})\|^{2}-(\frac{1}{2\eta}-\frac{L}{2}-\frac{\eta c_{1}^{2}L^{2}}{2})\sum_{j=sm+1}^{t}\|u_{j}-u_{j-1}\|^{2}+\frac{\eta(t-sm)c_{1}^{\prime 2}\sigma^{2}}{2B}
β‰€βˆ’(12β€‹Ξ·βˆ’L2βˆ’Ξ·β€‹c12​L22)β€‹βˆ‘j=s​m+1tβ€–ujβˆ’ujβˆ’1β€–2+η​(tβˆ’s​m)​c1′⁣2​σ22​B\displaystyle\leq-(\frac{1}{2\eta}-\frac{L}{2}-\frac{\eta c_{1}^{2}L^{2}}{2})\sum_{j=sm+1}^{t}\|u_{j}-u_{j-1}\|^{2}+\frac{\eta(t-sm)c_{1}^{\prime 2}\sigma^{2}}{2B}
β‰€βˆ’c1​L4β€‹βˆ‘j=s​m+1tβ€–ujβˆ’ujβˆ’1β€–2+η​(tβˆ’s​m)​c1′⁣2​σ22​Bwith probability ​1βˆ’Ο‘,\displaystyle\leq-\frac{c_{1}L}{4}\sum_{j=sm+1}^{t}\|u_{j}-u_{j-1}\|^{2}+\frac{\eta(t-sm)c_{1}^{\prime 2}\sigma^{2}}{2B}\quad\text{with probability }1-\vartheta, (42)

where we choose η≀12​c1​L\eta\leq\frac{1}{2c_{1}L} and assume c1β‰₯1c_{1}\geq 1. Summing (42) for all epochs up to tt gives

F^x​(ut)βˆ’F^x​(u0)β‰€βˆ’c1​L4β€‹βˆ‘j=1tβ€–ujβˆ’ujβˆ’1β€–2+η​t​c1′⁣2​σ22​B.\hat{F}_{x}(u_{t})-\hat{F}_{x}(u_{0})\leq-\frac{c_{1}L}{4}\sum_{j=1}^{t}\|u_{j}-u_{j-1}\|^{2}+\frac{\eta tc_{1}^{\prime 2}\sigma^{2}}{2B}.

Lastly by Cauchy-Schwarz inequality and triangle inequality, tβ€‹βˆ‘j=1tβ€–ujβˆ’ujβˆ’1β€–2β‰₯β€–utβˆ’u0β€–\sqrt{t\sum_{j=1}^{t}\|u_{j}-u_{j-1}\|^{2}}\geq\|u_{t}-u_{0}\|. Hence,

β€–utβˆ’u0‖≀4​t​(F^x​(u0)βˆ’F^x​(ut))c1​L+2​η​t2​c1′⁣2​σ2c1​L​B.\|u_{t}-u_{0}\|\leq\sqrt{\frac{4t(\hat{F}_{x}(u_{0})-\hat{F}_{x}(u_{t}))}{c_{1}L}+\frac{2\eta t^{2}c_{1}^{\prime 2}\sigma^{2}}{c_{1}LB}}.

The proof is now complete. Β 

Lemma 17 (Large gradient descent lemma)

Under Assumptions 2 to 5, suppose we set η≀12​c1​L,bβ‰₯m,Bβ‰₯256​c1′⁣2​σ2Ο΅2\eta\leq\frac{1}{2c_{1}L},b\geq m,B\geq\frac{256c_{1}^{\prime 2}\sigma^{2}}{\epsilon^{2}}, where c1=π’ͺ​(log32⁑(d​mΟ‘)),c1β€²=π’ͺ​(log⁑(dΟ‘))c_{1}=\mathcal{O}(\log^{\frac{3}{2}}(\frac{dm}{\vartheta})),c_{1}^{\prime}=\mathcal{O}(\log(\frac{d}{\vartheta})). Consider xβˆˆβ„³x\in\mathcal{M} where β€–grad​fℬ​(x)β€–β‰₯Ο΅\|\emph{grad}f_{\mathcal{B}}(x)\|\geq\epsilon, with ϡ≀Dm​32​c1​LΞ·\epsilon\leq\frac{D}{m}\sqrt{\frac{32c_{1}L}{\eta}}. Then by running TSSRG​(x,0,m)\textsf{TSSRG}(x,0,m), we have the following three cases:

  1. 1.

    When the iterates {uj}j=1m\{u_{j}\}_{j=1}^{m} do not leave the constraint ball 𝔹x​(D)\mathbb{B}_{x}(D):

    1. (a)

      If at least half of the iterates in the epoch satisfy β€–βˆ‡F^x​(uj)‖≀ϡ/4\|\nabla\hat{F}_{x}(u_{j})\|\leq\epsilon/4 for j=1,…,mj=1,...,m, then with probability at least 1/31/3, we have β€–grad​fℬ​(TSSRG​(x,0,m))‖≀ϡ\|\emph{grad}f_{\mathcal{B}}(\textsf{TSSRG}(x,0,m))\|\leq\epsilon.

    2. (b)

      Otherwise, with probability at least 1/51/5, we have F​(TSSRG​(x,0,m))βˆ’F​(x)β‰€βˆ’3​η​m​ϡ2512F(\textsf{TSSRG}(x,0,m))-F(x)\leq-\frac{3\eta m\epsilon^{2}}{512}.

  2. 2.

    When one of the iterates {uj}j=1m\{u_{j}\}_{j=1}^{m} leaves the constraint ball 𝔹x​(D)\mathbb{B}_{x}(D), with probability at least 1βˆ’Ο‘1-\vartheta, we have F​(TSSRG​(x,0,m))βˆ’F​(x)β‰€βˆ’3​η​m​ϡ2512F(\textsf{TSSRG}(x,0,m))-F(x)\leq-\frac{3\eta m\epsilon^{2}}{512}.

Regardless which case occurs, F​(TSSRG​(x,0,m))βˆ’F​(x)≀η​t​c1′⁣2​σ22​BF(\textsf{TSSRG}(x,0,m))-F(x)\leq\frac{\eta tc_{1}^{\prime 2}\sigma^{2}}{2B} with high probability.

ProofΒ  Similar to proof of Lemma 11, we only need to consider a single epoch in TSSRG(x,0,m)(x,0,m), i.e. t=1,…,mt=1,...,m. We also divide the proof into two parts.

1. Iterates do not leave the constraint ball. From Lemma 15 and union bound, we have for all 1≀τ≀t1\leq\tau\leq t,

β€–vΟ„βˆ’βˆ‡F^x​(uΟ„)β€–2≀c12​L2bβ€‹βˆ‘j=1Ο„βˆ’1β€–ujβˆ’ujβˆ’1β€–2+c1′⁣2​σ2B​ with probability ​1βˆ’Ο‘.\|v_{\tau}-\nabla\hat{F}_{x}(u_{\tau})\|^{2}\leq\frac{c_{1}^{2}L^{2}}{{b}}\sum_{j=1}^{\tau-1}\|u_{j}-u_{j-1}\|^{2}+\frac{c_{1}^{\prime 2}\sigma^{2}}{B}\,\,\text{ with probability }1-\vartheta. (43)

where we denote c1:=π’ͺ​(log32⁑(d​tΟ‘)),c1β€²=π’ͺ​(log⁑(d​tϑ​m))c_{1}:=\mathcal{O}(\log^{\frac{3}{2}}(\frac{dt}{\vartheta})),c_{1}^{\prime}=\mathcal{O}(\log(\frac{dt}{\vartheta m})). 333More precisely, c1β€²=π’ͺ​(log⁑(dΟ‘β€‹βŒˆtmβŒ‰))=π’ͺ​(log⁑(d​tϑ​m))c_{1}^{\prime}=\mathcal{O}(\log(\frac{d}{\vartheta}\lceil\frac{t}{m}\rceil))=\mathcal{O}(\log(\frac{dt}{\vartheta m})). We also use (A+B)2≀2​A2+2​B2(A+B)^{2}\leq 2A^{2}+2B^{2}. Summing up (14) from j=1,…,tj=1,...,t, we have

F^x​(ut)βˆ’F^x​(u0)\displaystyle\hat{F}_{x}(u_{t})-\hat{F}_{x}(u_{0})
β‰€βˆ’Ξ·2β€‹βˆ‘j=1tβ€–βˆ‡F^x​(ujβˆ’1)β€–2+Ξ·2β€‹βˆ‘j=1tβ€–vtβˆ’1βˆ’βˆ‡F^x​(utβˆ’1)β€–2βˆ’(12β€‹Ξ·βˆ’L2)β€‹βˆ‘j=1tβ€–utβˆ’utβˆ’1β€–2\displaystyle\leq-\frac{\eta}{2}\sum_{j=1}^{t}\|\nabla\hat{F}_{x}(u_{j-1})\|^{2}+\frac{\eta}{2}\sum_{j=1}^{t}\|v_{t-1}-\nabla\hat{F}_{x}(u_{t-1})\|^{2}-(\frac{1}{2\eta}-\frac{L}{2})\sum_{j=1}^{t}\|u_{t}-u_{t-1}\|^{2}
β‰€βˆ’Ξ·2β€‹βˆ‘j=1tβ€–βˆ‡F^x​(ujβˆ’1)β€–2βˆ’(12β€‹Ξ·βˆ’L2)β€‹βˆ‘j=1tβ€–utβˆ’utβˆ’1β€–2+η​c12​L22​bβ€‹βˆ‘k=1tβˆ’1βˆ‘j=1kβ€–ujβˆ’ujβˆ’1β€–2+η​t​c1′⁣2​σ22​B\displaystyle\leq-\frac{\eta}{2}\sum_{j=1}^{t}\|\nabla\hat{F}_{x}(u_{j-1})\|^{2}-(\frac{1}{2\eta}-\frac{L}{2})\sum_{j=1}^{t}\|u_{t}-u_{t-1}\|^{2}+\frac{\eta c_{1}^{2}L^{2}}{2b}\sum_{k=1}^{t-1}\sum_{j=1}^{k}\|u_{j}-u_{j-1}\|^{2}+\frac{\eta tc_{1}^{\prime 2}\sigma^{2}}{2B}
β‰€βˆ’Ξ·2β€‹βˆ‘j=1tβ€–βˆ‡F^x​(ujβˆ’1)β€–2βˆ’(12β€‹Ξ·βˆ’L2)β€‹βˆ‘j=1tβ€–utβˆ’utβˆ’1β€–2+η​c12​L2​m2​bβ€‹βˆ‘j=1tβ€–ujβˆ’ujβˆ’1β€–2+η​t​c1′⁣2​σ22​B\displaystyle\leq-\frac{\eta}{2}\sum_{j=1}^{t}\|\nabla\hat{F}_{x}(u_{j-1})\|^{2}-(\frac{1}{2\eta}-\frac{L}{2})\sum_{j=1}^{t}\|u_{t}-u_{t-1}\|^{2}+\frac{\eta c_{1}^{2}L^{2}m}{2b}\sum_{j=1}^{t}\|u_{j}-u_{j-1}\|^{2}+\frac{\eta tc_{1}^{\prime 2}\sigma^{2}}{2B}
β‰€βˆ’Ξ·2β€‹βˆ‘j=1tβ€–βˆ‡F^x​(ujβˆ’1)β€–2βˆ’(12β€‹Ξ·βˆ’L2βˆ’Ξ·β€‹c12​L22)β€‹βˆ‘j=1tβ€–ujβˆ’ujβˆ’1β€–2+η​t​c1′⁣2​σ22​B\displaystyle\leq-\frac{\eta}{2}\sum_{j=1}^{t}\|\nabla\hat{F}_{x}(u_{j-1})\|^{2}-(\frac{1}{2\eta}-\frac{L}{2}-\frac{\eta c_{1}^{2}L^{2}}{2})\sum_{j=1}^{t}\|u_{j}-u_{j-1}\|^{2}+\frac{\eta tc_{1}^{\prime 2}\sigma^{2}}{2B} (44)
β‰€βˆ’Ξ·2β€‹βˆ‘j=1tβ€–βˆ‡F^x​(ujβˆ’1)β€–2+η​t​c1′⁣2​σ22​B,\displaystyle\leq-\frac{\eta}{2}\sum_{j=1}^{t}\|\nabla\hat{F}_{x}(u_{j-1})\|^{2}+\frac{\eta tc_{1}^{\prime 2}\sigma^{2}}{2B}, (45)

where the last inequality holds due to the choice η≀12​c1​L≀4​c12+1βˆ’12​c12​L\eta\leq\frac{1}{2c_{1}L}\leq\frac{\sqrt{4c_{1}^{2}+1}-1}{2c_{1}^{2}L} by assuming c1β‰₯1c_{1}\geq 1. Note that we need to ensure (45) to hold for all t≀mt\leq m. Hence we change c1=π’ͺ​(log32⁑(d​mΟ‘)),c1β€²=π’ͺ​(log⁑(dΟ‘))c_{1}=\mathcal{O}(\log^{\frac{3}{2}}(\frac{dm}{\vartheta})),c_{1}^{\prime}=\mathcal{O}(\log(\frac{d}{\vartheta})). Here are two cases.

  • β€’

    (Case 1a) If at least half of the iterates in the epoch satisfy β€–βˆ‡F^x​(uj)‖≀ϡ/4\|\nabla\hat{F}_{x}(u_{j})\|\leq\epsilon/4 for j=1,…,mj=1,...,m, then by uniformly sampling {uj}j=1m\{u_{j}\}_{j=1}^{m} (i.e. uniformly breaking by setting utu_{t} as umu_{m} as in Algorithm 2, Line 12), the output umu_{m} satisfies β€–βˆ‡F^x​(um)‖≀ϡ/4\|\nabla\hat{F}_{x}(u_{m})\|\leq\epsilon/4 with probability at least 12\frac{1}{2}. Under online setting, full gradient grad​F\text{grad}F is inaccessible and we need to use approximated batch gradient grad​fℬ\text{grad}f_{\mathcal{B}} to check the small-gradient condition in Line 3, Algorithm 1. We know based on (41), β€–βˆ‡f^ℬ,x​(um)βˆ’βˆ‡F^x​(um)‖≀c1′​σB\|\nabla\hat{f}_{\mathcal{B},x}(u_{m})-\nabla\hat{F}_{x}(u_{m})\|\leq\frac{c_{1}^{\prime}\sigma}{\sqrt{B}} with probability 1βˆ’Ο‘1-\vartheta. By a union bound, with probability at least 12βˆ’Ο‘\frac{1}{2}-\vartheta,

    β€–βˆ‡f^ℬ,x​(um)β€–β‰€β€–βˆ‡f^ℬ,x​(um)βˆ’βˆ‡F^x​(um)β€–+β€–βˆ‡F^x​(um)‖≀c1′​σB+Ο΅4≀ϡ2,\|\nabla\hat{f}_{\mathcal{B},x}(u_{m})\|\leq\|\nabla\hat{f}_{\mathcal{B},x}(u_{m})-\nabla\hat{F}_{x}(u_{m})\|+\|\nabla\hat{F}_{x}(u_{m})\|\leq\frac{c_{1}^{\prime}\sigma}{\sqrt{B}}+\frac{\epsilon}{4}\leq\frac{\epsilon}{2},

    where the last inequality holds by Bβ‰₯256​c1′⁣2​σ2Ο΅2B\geq\frac{256c_{1}^{\prime 2}\sigma^{2}}{\epsilon^{2}}. Without loss of generality, we set ϑ≀16\vartheta\leq\frac{1}{6} and therefore the probability reduces to 13\frac{1}{3}. Lastly, from the definition of the pullback gradient, we have

    β€–grad​fℬ​(TSSRG​(x,0,m))β€–=β€–grad​fℬ​(Rx​(um))‖≀‖(Tumβˆ—)βˆ’1β€–β€‹β€–βˆ‡f^ℬ,x​(um)‖≀ϡ,\|\text{grad}f_{\mathcal{B}}(\textsf{TSSRG}(x,0,m))\|=\|\text{grad}f_{\mathcal{B}}(R_{x}(u_{m}))\|\leq\|(T^{*}_{u_{m}})^{-1}\|\|\nabla\hat{f}_{\mathcal{B},x}(u_{m})\|\leq\epsilon,

    holds with probability at least 13\frac{1}{3}. The last inequality is due to Lemma 3. Note that in this case, we also have β€–grad​F​(TSSRG​(x,0,m))‖≀‖(Tumβˆ—)βˆ’1β€–β€‹β€–βˆ‡F^x​(um)‖≀ϡ/2\|\text{grad}F(\textsf{TSSRG}(x,0,m))\|\leq\|(T^{*}_{u_{m}})^{-1}\|\|\nabla\hat{F}_{x}(u_{m})\|\leq\epsilon/2.

  • β€’

    (Case 1b) If at least half of the points in the epoch satisfy β€–βˆ‡F^x​(uj)β€–β‰₯Ο΅/4\|\nabla\hat{F}_{x}(u_{j})\|\geq\epsilon/4 for j=1,…,mj=1,...,m, with probability at least 14\frac{1}{4}, the selected output umu_{m} falls within the last quarter of {uj}j=1m\{u_{j}\}_{j=1}^{m}. In this case, βˆ‘j=1tβ€–βˆ‡F^x​(ujβˆ’1)β€–2β‰₯m4β‹…Ο΅216=m​ϡ264\sum_{j=1}^{t}\|\nabla\hat{F}_{x}(u_{j-1})\|^{2}\geq\frac{m}{4}\cdot\frac{\epsilon^{2}}{16}=\frac{m\epsilon^{2}}{64}. Note that (45) holds with probability at least 1βˆ’Ο‘1-\vartheta. By a union bound, we have with probability at least 15\frac{1}{5} (by letting ϑ≀120\vartheta\leq\frac{1}{20}),

    F​(TSSRG​(x,0,m))βˆ’F​(x)=F^x​(ut)βˆ’F^x​(0)\displaystyle F(\textsf{TSSRG}(x,0,m))-F(x)=\hat{F}_{x}(u_{t})-\hat{F}_{x}(0) β‰€βˆ’Ξ·2β€‹βˆ‘j=1tβ€–βˆ‡F^x​(ujβˆ’1)β€–2+η​t​c1′⁣2​σ22​B\displaystyle\leq-\frac{\eta}{2}\sum_{j=1}^{t}\|\nabla\hat{F}_{x}(u_{j-1})\|^{2}+\frac{\eta tc_{1}^{\prime 2}\sigma^{2}}{2B}
    β‰€βˆ’Ξ·β€‹m​ϡ2128+η​m​ϡ2512=βˆ’3​η​m​ϡ2512,\displaystyle\leq-\frac{\eta m\epsilon^{2}}{128}+\frac{\eta m\epsilon^{2}}{512}=-\frac{3\eta m\epsilon^{2}}{512},

    where we use Bβ‰₯256​c1′⁣2​σ2Ο΅2B\geq\frac{256c_{1}^{\prime 2}\sigma^{2}}{\epsilon^{2}} and t≀mt\leq m.

2. Iterates leave the constraint ball. Suppose at τ≀m\tau\leq m, we have β€–uΟ„β€–>D\|u_{\tau}\|>D. Then by Lemma 16, we know the function value already decreases a lot. That is with probability 1βˆ’Ο‘1-\vartheta,

F^x​(0)βˆ’F^x​(uΟ„)β‰₯c1​L4​τ​‖uΟ„β€–2βˆ’Ξ·β€‹Ο„β€‹c1′⁣2​σ22​Bβ‰₯c1​L​D24​mβˆ’Ξ·β€‹m​c1′⁣2​σ22​Bβ‰₯3​η​m​ϡ2512,\displaystyle\hat{F}_{x}(0)-\hat{F}_{x}(u_{\tau})\geq\frac{c_{1}L}{4\tau}\|u_{\tau}\|^{2}-\frac{\eta\tau c_{1}^{\prime 2}\sigma^{2}}{2B}\geq\frac{c_{1}LD^{2}}{4m}-\frac{\eta mc_{1}^{\prime 2}\sigma^{2}}{2B}\geq\frac{3\eta m\epsilon^{2}}{512},

where the last inequality follows from the choice that ϡ≀Dm​32​c1​LΞ·\epsilon\leq\frac{D}{m}\sqrt{\frac{32c_{1}L}{\eta}} and Bβ‰₯256​c1′⁣2​σ2Ο΅2B\geq\frac{256c_{1}^{\prime 2}\sigma^{2}}{\epsilon^{2}}. Hence by returning uΟ„u_{\tau} as umu_{m}, we have with high probability,

F​(TSSRG​(x,0,m))βˆ’F​(x)β‰€βˆ’3​η​m​ϡ2512.F(\textsf{TSSRG}(x,0,m))-F(x)\leq-\frac{3\eta m\epsilon^{2}}{512}.

In summary, with high probability, either gradient norm of the output is small or the function value decreases a lot. Even under Case 1a, F^x​(ut)βˆ’F^x​(u0)≀η​t​c1′⁣2​σ22​B\hat{F}_{x}(u_{t})-\hat{F}_{x}(u_{0})\leq\frac{\eta tc_{1}^{\prime 2}\sigma^{2}}{2B} with high probability by (45). Β 

Lemma 18 (Small stuck region)

Consider xβˆˆβ„³x\in\mathcal{M} with β€–grad​fℬ​(x)‖≀ϡ\|\emph{grad}f_{\mathcal{B}}(x)\|\leq\epsilon, βˆ’Ξ³:=Ξ»min​(βˆ‡2F^x​(0))-\gamma:=\lambda_{\min}(\nabla^{2}\hat{F}_{x}(0)) =Ξ»min​(Hess​F​(x))β‰€βˆ’Ξ΄=\lambda_{\min}(\emph{Hess}F(x))\leq-\delta and Lβ‰₯Ξ΄L\geq\delta. Let u0,u0β€²βˆˆTx​ℳu_{0},u_{0}^{\prime}\in T_{x}\mathcal{M} be two random perturbations, satisfying β€–u0β€–,β€–u0′‖≀r\|u_{0}\|,\|u_{0}^{\prime}\|\leq r and u0βˆ’u0β€²=r0​e1u_{0}-u_{0}^{\prime}=r_{0}e_{1}, where e1e_{1} denotes the smallest eigenvector of βˆ‡2f^x​(0)\nabla^{2}\hat{f}_{x}(0) and r0=ν​rd,r≀δc2​ρr_{0}=\frac{\nu r}{\sqrt{d}},r\leq\frac{\delta}{c_{2}\rho}. Also set parameters 𝒯=2​logα⁑(8​δ​dc2​ρ​ν​r)Ξ΄=π’ͺ~​(1Ξ΄)\mathscr{T}=\frac{2\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})}{\delta}=\widetilde{\mathcal{O}}(\frac{1}{\delta}), η≀min⁑{18​log⁑(𝒯)​C1​L,116​logα⁑(8​δ​dc2​ρ​ν​r)​L}=π’ͺ~​(1L)\eta\leq\min\{\frac{1}{8\log(\mathcal{T})C_{1}L},\frac{1}{16\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})L}\}=\widetilde{\mathcal{O}}(\frac{1}{L}) where Ξ±β‰₯1\alpha\geq 1 is chosen sufficiently small such that logα⁑(1+η​γ)>Ξ³\log_{\alpha}(1+\eta\gamma)>\gamma. Also choose Bβ‰₯π’ͺ​(c2′⁣2​σ2Ο΅2)=π’ͺ~​(Οƒ2Ο΅2)B\geq\mathcal{O}(\frac{c_{2}^{\prime 2}\sigma^{2}}{\epsilon^{2}})=\widetilde{\mathcal{O}}(\frac{\sigma^{2}}{\epsilon^{2}}), ϡ≀δ2ρ\epsilon\leq\frac{\delta^{2}}{\rho}. Then for {ut},{utβ€²}\{u_{t}\},\{u_{t}^{\prime}\} generated by running TSSRG(x,u0,𝒯)(x,u_{0},\mathscr{T}) and TSSRG(x,u0β€²,𝒯)(x,u_{0}^{\prime},\mathscr{T}) with same sets of mini-batches, with probability 1βˆ’ΞΆ1-\zeta,

βˆƒt≀𝒯,max⁑{β€–utβˆ’u0β€–,β€–utβ€²βˆ’u0β€²β€–}β‰₯Ξ΄c2​ρ,\exists\,t\leq\mathscr{T},\,\max\{\|u_{t}-u_{0}\|,\|u_{t}^{\prime}-u_{0}^{\prime}\|\}\geq\frac{\delta}{c_{2}\rho},

where c2β‰₯max⁑{12​C1L,2​δD​ρ}c_{2}\geq\max\{\frac{12C_{1}}{L},\frac{2\delta}{D\rho}\}, and C1=π’ͺ​(log32⁑(d​𝒯/ΞΆ))=π’ͺ~​(1)C_{1}=\mathcal{O}(\log^{\frac{3}{2}}(d\mathscr{T}/\zeta))=\widetilde{\mathcal{O}}(1), c2β€²=π’ͺ​(log⁑(d​𝒯΢​m))=π’ͺ~​(1)c_{2}^{\prime}=\mathcal{O}(\log(\frac{d\mathscr{T}}{\zeta m}))=\widetilde{\mathcal{O}}(1).

ProofΒ  The proof is by contradiction. So we assume

βˆ€t≀𝒯,max⁑{β€–utβˆ’u0β€–,β€–utβ€²βˆ’u0β€–}≀δc2​ρ.\forall\,t\leq\mathscr{T},\,\max\{\|u_{t}-u_{0}\|,\|u_{t}^{\prime}-u_{0}\|\}\leq\frac{\delta}{c_{2}\rho}. (46)

First, we again note that both {uj}j=0𝒯​ and ​{ujβ€²}j=0𝒯\{u_{j}\}_{j=0}^{\mathscr{T}}\text{ and }\{u_{j}^{\prime}\}_{j=0}^{\mathscr{T}} do not escape the constraint ball under condition (46). This is because β€–ut‖≀‖utβˆ’u0β€–+β€–u0‖≀δc2​ρ+r≀D\|u_{t}\|\leq\|u_{t}-u_{0}\|+\|u_{0}\|\leq\frac{\delta}{c_{2}\rho}+r\leq D by the choice r≀δc2​ρr\leq\frac{\delta}{c_{2}\rho} and c2β‰₯2​δD​ρc_{2}\geq\frac{2\delta}{D\rho}. Next, we can show an exponential growth in the distance between two coupled sequences and will eventually exceed the bound in (46). The proof roadmap is exactly the same as in Lemma 12 and thus we only highlights the main results under online settings.

Denote u^t:=utβˆ’utβ€²\hat{u}_{t}:=u_{t}-u_{t}^{\prime}, β„‹:=βˆ‡2F^x​(0),Ξ”t:=∫01[βˆ‡2F^x​(utβ€²+θ​(utβˆ’utβ€²))βˆ’β„‹]​𝑑θ\mathcal{H}:=\nabla^{2}\hat{F}_{x}(0),\Delta_{t}:=\int_{0}^{1}[\nabla^{2}\hat{F}_{x}(u_{t}^{\prime}+\theta(u_{t}-u_{t}^{\prime}))-\mathcal{H}]d\theta, y^t:=ytβˆ’ytβ€²:=vtβˆ’βˆ‡F^x​(ut)βˆ’vtβ€²+βˆ‡F^x​(utβ€²)\hat{y}_{t}:=y_{t}-y_{t}^{\prime}:=v_{t}-\nabla\hat{F}_{x}(u_{t})-v_{t}^{\prime}+\nabla\hat{F}_{x}(u_{t}^{\prime}). Recall we can bound β€–Ξ”tβ€–\|\Delta_{t}\| as β€–Ξ”tβ€–β‰€Οβ€‹π’Ÿt≀ρ​(Ξ΄c2​ρ+r)\|\Delta_{t}\|\leq\rho\mathscr{D}_{t}\leq\rho(\frac{\delta}{c_{2}\rho}+r), where π’Ÿt:=max⁑{β€–utβ€–,β€–utβ€²β€–}\mathscr{D}_{t}:=\max\{\|u_{t}\|,\|u_{t}^{\prime}\|\}. Thus from proof of Lemma 12, we know that we can re-write u^t=p​(t)βˆ’q​(t)\hat{u}_{t}=p(t)-q(t) where

p​(t):=(Iβˆ’Ξ·β€‹β„‹)t​u^0=(1+η​γ)t​r0​e1,\displaystyle p(t):=(I-\eta\mathcal{H})^{t}\hat{u}_{0}=(1+\eta\gamma)^{t}r_{0}e_{1},
q​(t):=Ξ·β€‹βˆ‘j=0tβˆ’1(Iβˆ’Ξ·β€‹β„‹)tβˆ’1βˆ’j​(Ξ”j​u^j+y^j),Β and\displaystyle q(t):=\eta\sum_{j=0}^{t-1}(I-\eta\mathcal{H})^{t-1-j}(\Delta_{j}\hat{u}_{j}+\hat{y}_{j}),\quad\text{ and }
p​(0)=u^0,q​(0)=0.\displaystyle p(0)=\hat{u}_{0},\,q(0)=0.

Next, we can inductively show that (1)​‖y^t‖≀2​γ​L​(1+η​γ)t​r0(1)\,\|\hat{y}_{t}\|\leq 2\gamma L(1+\eta\gamma)^{t}r_{0} and (2)​‖q​(t)‖≀‖p​(t)β€–/2(2)\,\|q(t)\|\leq\|p(t)\|/2. When t=0t=0, these two conditions are easily verified. Now suppose for j≀tβˆ’1j\leq t-1, these two conditions are true, then we can immediately obtain that for all j≀tβˆ’1j\leq t-1,

β€–u^j‖≀‖p​(j)β€–+β€–q​(j)‖≀32​‖p​(j)‖≀32​(1+η​γ)j​r0.\|\hat{u}_{j}\|\leq\|p(j)\|+\|q(j)\|\leq\frac{3}{2}\|p(j)\|\leq\frac{3}{2}(1+\eta\gamma)^{j}r_{0}. (47)

Also, different to finite-sum setting, we need a bound of β€–y^s​mβ€–\|\hat{y}_{sm}\|. Given that vs​m=βˆ‡f^ℬ,x​(us​m)v_{sm}=\nabla\hat{f}_{\mathcal{B},x}(u_{sm}) and vs​mβ€²=βˆ‡f^ℬ,x​(us​mβ€²)v_{sm}^{\prime}=\nabla\hat{f}_{\mathcal{B},x}(u_{sm}^{\prime}), we have

y^s​m\displaystyle\hat{y}_{sm} =vs​mβˆ’βˆ‡F^x​(us​m)βˆ’vs​mβ€²+βˆ‡F^x​(us​mβ€²)\displaystyle=v_{sm}-\nabla\hat{F}_{x}(u_{sm})-v_{sm}^{\prime}+\nabla\hat{F}_{x}(u_{sm}^{\prime})
=1Bβ€‹βˆ‘iβˆˆβ„¬(βˆ‡f^i,x​(us​m)βˆ’βˆ‡f^i,x​(us​mβ€²)βˆ’βˆ‡F^x​(us​m)+βˆ‡F^x​(us​mβ€²)).\displaystyle=\frac{1}{B}\sum_{i\in\mathcal{B}}\big{(}\nabla\hat{f}_{i,x}(u_{sm})-\nabla\hat{f}_{i,x}(u_{sm}^{\prime})-\nabla\hat{F}_{x}(u_{sm})+\nabla\hat{F}_{x}(u_{sm}^{\prime})\big{)}.

For each component term, we have

β€–βˆ‡f^i,x​(us​m)βˆ’βˆ‡f^i,x​(us​mβ€²)βˆ’βˆ‡F^x​(us​m)+βˆ‡F^x​(us​mβ€²)β€–\displaystyle\|\nabla\hat{f}_{i,x}(u_{sm})-\nabla\hat{f}_{i,x}(u_{sm}^{\prime})-\nabla\hat{F}_{x}(u_{sm})+\nabla\hat{F}_{x}(u_{sm}^{\prime})\| ≀2​L​‖us​mβˆ’us​mβ€²β€–\displaystyle\leq 2L\|u_{sm}-u_{sm}^{\prime}\|
=2​L​‖u^s​m‖≀3​L​(1+η​γ)s​m​r0,\displaystyle=2L\|\hat{u}_{sm}\|\leq 3L(1+\eta\gamma)^{sm}r_{0},

where the last inequality uses (47). Also the variance term is bounded as

βˆ‘iβˆˆβ„¬π”Όβ€‹β€–βˆ‡f^i,x​(us​m)βˆ’βˆ‡f^i,x​(us​mβ€²)βˆ’βˆ‡F^x​(us​m)+βˆ‡F^x​(us​mβ€²)β€–2\displaystyle\sum_{i\in\mathcal{B}}\mathbb{E}\|\nabla\hat{f}_{i,x}(u_{sm})-\nabla\hat{f}_{i,x}(u_{sm}^{\prime})-\nabla\hat{F}_{x}(u_{sm})+\nabla\hat{F}_{x}(u_{sm}^{\prime})\|^{2}
β‰€βˆ‘iβˆˆβ„¬π”Όβ€‹β€–βˆ‡f^i,x​(us​m)βˆ’βˆ‡f^i,x​(us​mβ€²)β€–2\displaystyle\leq\sum_{i\in\mathcal{B}}\mathbb{E}\|\nabla\hat{f}_{i,x}(u_{sm})-\nabla\hat{f}_{i,x}(u_{sm}^{\prime})\|^{2}
≀B​L2​‖us​mβˆ’us​mβ€²β€–2\displaystyle\leq BL^{2}\|u_{sm}-u_{sm}^{\prime}\|^{2}
=B​L2​‖u^s​mβ€–2≀94​B​L2​(1+η​γ)2​s​m​r02.\displaystyle=BL^{2}\|\hat{u}_{sm}\|^{2}\leq\frac{9}{4}BL^{2}(1+\eta\gamma)^{2sm}r_{0}^{2}.

Therefore, we can substitute these two bounds in Bernstein inequality (Lemma 6) to bound β€–y^s​mβ€–\|\hat{y}_{sm}\|. That is,

Pr​{β€–y^s​mβ€–β‰₯Ο‚B}\displaystyle\text{Pr}\{\|\hat{y}_{sm}\|\geq\frac{\varsigma}{B}\} ≀(d+1)​exp⁑(βˆ’Ο‚2/2Οƒ2+R​ς/3)\displaystyle\leq(d+1)\exp\Big{(}\frac{-\varsigma^{2}/2}{\sigma^{2}+R\varsigma/3}\Big{)}
≀(d+1)​exp⁑(βˆ’Ο‚2/29​B​L2​(1+η​γ)2​s​m​r02/4+3​B​L​(1+η​γ)s​m​r0​ς/3)\displaystyle\leq(d+1)\exp\Big{(}\frac{-\varsigma^{2}/2}{9BL^{2}(1+\eta\gamma)^{2sm}r_{0}^{2}/4+3\sqrt{B}L(1+\eta\gamma)^{sm}r_{0}\varsigma/3}\Big{)}
≀΢,\displaystyle\leq\zeta,

where the second inequality also uses Bβ‰₯1\sqrt{B}\geq 1 and the last inequality holds due to

Ο‚=π’ͺ​(log⁑(d/Ο‘))​L​B​(1+η​γ)s​m​r0.\varsigma=\mathcal{O}(\log(d/\vartheta))L\sqrt{B}(1+\eta\gamma)^{sm}r_{0}.

By a union bound, for all t≀𝒯,s=⌈tmβŒ‰t\leq\mathscr{T},s=\lceil\frac{t}{m}\rceil, we have

β€–y^s​m‖≀c2′​L​(1+η​γ)s​m​r0B≀γ​L​(1+η​γ)s​m​r0,Β with probability ​1βˆ’ΞΆ\|\hat{y}_{sm}\|\leq\frac{c_{2}^{\prime}L(1+\eta\gamma)^{sm}r_{0}}{\sqrt{B}}\leq\gamma L(1+\eta\gamma)^{sm}r_{0},\quad\text{ with probability }1-\zeta (48)

where c2β€²=π’ͺ​(log⁑(d​𝒯΢​m))=π’ͺ~​(1)c_{2}^{\prime}=\mathcal{O}(\log(\frac{d\mathscr{T}}{\zeta m}))=\widetilde{\mathcal{O}}(1) and the second inequality is due to Bβ‰₯π’ͺ​(c2′⁣2​σ2/Ο΅2)=π’ͺ~​(Οƒ2/Ο΅2)B\geq\mathcal{O}(c_{2}^{\prime 2}\sigma^{2}/\epsilon^{2})=\widetilde{\mathcal{O}}(\sigma^{2}/\epsilon^{2}), ϡ≀δ2/ρ\epsilon\leq\delta^{2}/\rho, δ≀γ\delta\leq\gamma and assume without loss of generality δ≀1\delta\leq 1. Now we will first prove that the second condition holds for j=tj=t.

Proof that β€–q​(t)β€–\boldsymbol{\|q(t)\|} is bounded by β€–p​(t)β€–/𝟐\boldsymbol{\|p(t)\|/2}. Recall we can decompose β€–q​(t)β€–\|q(t)\| into two terms:

β€–q​(t)β€–β‰€β€–Ξ·β€‹βˆ‘j=0tβˆ’1(Iβˆ’Ξ·β€‹β„‹)tβˆ’1βˆ’j​Δj​u^jβ€–+β€–Ξ·β€‹βˆ‘j=0tβˆ’1(Iβˆ’Ξ·β€‹β„‹)tβˆ’1βˆ’j​y^jβ€–.\|q(t)\|\leq\|\eta\sum_{j=0}^{t-1}(I-\eta\mathcal{H})^{t-1-j}\Delta_{j}\hat{u}_{j}\|+\|\eta\sum_{j=0}^{t-1}(I-\eta\mathcal{H})^{t-1-j}\hat{y}_{j}\|.

The first term is bounded the same way as in Lemma 12. Consider parameter settings, r≀δc2​ρr\leq\frac{\delta}{c_{2}\rho}, 𝒯≀2​logα⁑(8​δ​dc2​ρ​ν​r)/(η​γ)\mathscr{T}\leq 2\log_{\alpha}\big{(}\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r}\big{)}/(\eta\gamma), c2β‰₯24​logα⁑(8​δ​dc2​ρ​ν​r)c_{2}\geq 24\log_{\alpha}\big{(}\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r}\big{)}, where 1≀α≀e1\leq\alpha\leq e satisfying logα⁑(1+η​γ)>Ξ³\log_{\alpha}(1+\eta\gamma)>\gamma. Then we have

β€–Ξ·β€‹βˆ‘j=0tβˆ’1(Iβˆ’Ξ·β€‹β„‹)tβˆ’1βˆ’j​Δj​u^j‖≀14​‖p​(t)β€–.\|\eta\sum_{j=0}^{t-1}(I-\eta\mathcal{H})^{t-1-j}\Delta_{j}\hat{u}_{j}\|\leq\frac{1}{4}\|p(t)\|.

The second term can be bounded as

β€–Ξ·β€‹βˆ‘j=0tβˆ’1(Iβˆ’Ξ·β€‹β„‹)tβˆ’1βˆ’j​y^jβ€–\displaystyle\|\eta\sum_{j=0}^{t-1}(I-\eta\mathcal{H})^{t-1-j}\hat{y}_{j}\| β‰€Ξ·β€‹βˆ‘j=0tβˆ’1(1+η​γ)tβˆ’1βˆ’j​‖y^j‖≀2β€‹Ξ·β€‹βˆ‘j=0tβˆ’1(1+η​γ)tβˆ’1βˆ’j​γ​L​(1+η​γ)j​r0\displaystyle\leq\eta\sum_{j=0}^{t-1}(1+\eta\gamma)^{t-1-j}\|\hat{y}_{j}\|\leq 2\eta\sum_{j=0}^{t-1}(1+\eta\gamma)^{t-1-j}\gamma L(1+\eta\gamma)^{j}r_{0}
≀2​η​γ​L​𝒯​(1+η​γ)tβˆ’1​r0≀14​‖p​(t)β€–,\displaystyle\leq 2\eta\gamma L\mathscr{T}(1+\eta\gamma)^{t-1}r_{0}\leq\frac{1}{4}\|p(t)\|,

where we choose 𝒯≀2​logα⁑(8​δ​dc2​ρ​ν​r)/Ξ³\mathscr{T}\leq 2\log_{\alpha}\big{(}\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r}\big{)}/\gamma and η≀116​logα⁑(8​δ​dc2​ρ​ν​r)​L\eta\leq\frac{1}{16\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})L}. These two results complete the proof that β€–q​(t)‖≀‖p​(t)β€–/2\|q(t)\|\leq\|p(t)\|/2. Now we proceed to prove the first condition.

Proof that β€–y^tβ€–\boldsymbol{\|\hat{y}_{t}\|} is bounded by πŸβ€‹Ξ³β€‹L​(𝟏+η​γ)t​r𝟎\boldsymbol{2\gamma L(1+\eta\gamma)^{t}r_{0}}. Denote zt=y^tβˆ’y^tβˆ’1z_{t}=\hat{y}_{t}-\hat{y}_{t-1}. Hence we can verify that {y^t}\{\hat{y}_{t}\} is a martingale sequence and {zt}\{z_{t}\} is its difference sequence. Then we can first bound β€–ztβ€–\|z_{t}\| by Bernstein inequality. The following result is exactly the same as in Lemma 12. With probability 1βˆ’ΞΆt1-\zeta_{t},

βˆ₯ztβˆ₯≀π’ͺ(log⁑(d/ΞΆt)b)(Lβˆ₯u^tβˆ’u^tβˆ’1βˆ₯+Οπ’Ÿtβˆ₯u^tβˆ₯+Οπ’Ÿtβˆ’1βˆ₯u^tβˆ’1βˆ₯)=:Ο±t.\|z_{t}\|\leq\mathcal{O}\Big{(}\frac{\log(d/\zeta_{t})}{\sqrt{b}}\Big{)}(L\|\hat{u}_{t}-\hat{u}_{t-1}\|+\rho\mathscr{D}_{t}\|\hat{u}_{t}\|+\rho\mathscr{D}_{t-1}\|\hat{u}_{t-1}\|)=:\varrho_{t}.

Now we can bound β€–y^tβ€–\|\hat{y}_{t}\|. We only need to consider a single epoch because β€–y^s​mβ€–\|\hat{y}_{sm}\| is bounded as in (48). Similarly, set ΞΆt=ΞΆ/m\zeta_{t}=\zeta/m, we have β€–zt‖≀ϱt\|z_{t}\|\leq\varrho_{t} for all t=s​m+1,…,(s+1)​mt=sm+1,...,(s+1)m. By Azuma-Hoeffding inequality (Lemma 7), we have for all t≀𝒯t\leq\mathscr{T}, with probability 1βˆ’ΞΆ1-\zeta,

β€–y^tβˆ’y^s​m‖≀π’ͺ​(log32⁑(d​𝒯/ΞΆ)b)β€‹βˆ‘j=s​m+1t(L​‖u^tβˆ’u^tβˆ’1β€–+Οβ€‹π’Ÿt​‖u^tβ€–+Οβ€‹π’Ÿtβˆ’1​‖u^tβˆ’1β€–)2.\displaystyle\|\hat{y}_{t}-\hat{y}_{sm}\|\leq\mathcal{O}\Big{(}\frac{\log^{\frac{3}{2}}(d\mathcal{T}/\zeta)}{\sqrt{b}}\Big{)}\sqrt{\sum_{j=sm+1}^{t}(L\|\hat{u}_{t}-\hat{u}_{t-1}\|+\rho\mathscr{D}_{t}\|\hat{u}_{t}\|+\rho\mathscr{D}_{t-1}\|\hat{u}_{t-1}\|)^{2}}.

This is the same result from (29) except that y^s​mβ‰ 0\hat{y}_{sm}\neq 0. Combining this result and (48), and using a union bound, we have with probability 1βˆ’2​΢1-2\zeta, for all t≀𝒯t\leq\mathscr{T},

β€–y^t‖≀‖y^tβˆ’y^s​mβ€–+β€–y^s​mβ€–\displaystyle\|\hat{y}_{t}\|\leq\|\hat{y}_{t}-\hat{y}_{sm}\|+\|\hat{y}_{sm}\|
≀π’ͺ​(log32⁑(d​𝒯/ΞΆ)b)β€‹βˆ‘j=s​m+1t(L​‖u^tβˆ’u^tβˆ’1β€–+Οβ€‹π’Ÿt​‖u^tβ€–+Οβ€‹π’Ÿtβˆ’1​‖u^tβˆ’1β€–)2+γ​L​(1+η​γ)⌈t/mβŒ‰β€‹m.\displaystyle\leq\mathcal{O}\Big{(}\frac{\log^{\frac{3}{2}}(d\mathcal{T}/\zeta)}{\sqrt{b}}\Big{)}\sqrt{\sum_{j=sm+1}^{t}(L\|\hat{u}_{t}-\hat{u}_{t-1}\|+\rho\mathscr{D}_{t}\|\hat{u}_{t}\|+\rho\mathscr{D}_{t-1}\|\hat{u}_{t-1}\|)^{2}}+\gamma L(1+\eta\gamma)^{\lceil t/m\rceil m}. (49)

What remains to be shown is that the right hand side of (49) is bounded. Recall similarly from (35) and (37), we have

L​‖u^tβˆ’u^tβˆ’1β€–\displaystyle L\|\hat{u}_{t}-\hat{u}_{t-1}\| ≀(3​η​log⁑(𝒯)c2+3​η​L​log⁑(𝒯))​γ​L​(1+η​γ)t​r0,Β and\displaystyle\leq\Big{(}\frac{3\eta\log(\mathscr{T})}{c_{2}}+3\eta L\log(\mathscr{T})\Big{)}\gamma L(1+\eta\gamma)^{t}r_{0},\text{ and }
Οβ€‹π’Ÿt​‖u^tβ€–+Οβ€‹π’Ÿtβˆ’1​‖u^tβˆ’1β€–\displaystyle\rho\mathscr{D}_{t}\|\hat{u}_{t}\|+\rho\mathscr{D}_{t-1}\|\hat{u}_{t-1}\| ≀6​δc2​(1+η​γ)t​r0,\displaystyle\leq\frac{6\delta}{c_{2}}(1+\eta\gamma)^{t}r_{0},

where there is a slight difference in one of the constants due to now β€–y^j‖≀2​η​γ​(1+η​γ)t​r0\|\hat{y}_{j}\|\leq 2\eta\gamma(1+\eta\gamma)^{t}r_{0}. Therefore, we obtain

β€–y^tβ€–\displaystyle\|\hat{y}_{t}\| ≀C1​(3​η​log⁑(t)c2+3​η​L​log⁑(t)+6c2​L)​γ​L​(1+η​γ)t​r0+γ​L​(1+η​γ)⌈t/mβŒ‰β€‹m​r0\displaystyle\leq C_{1}\Big{(}\frac{3\eta\log(t)}{c_{2}}+3\eta L\log(t)+\frac{6}{c_{2}L}\Big{)}\gamma L(1+\eta\gamma)^{t}r_{0}+\gamma L(1+\eta\gamma)^{\lceil t/m\rceil m}r_{0}
≀2​γ​L​(1+η​γ)t​r0\displaystyle\leq 2\gamma L(1+\eta\gamma)^{t}r_{0}

where C1:=π’ͺ​(log32⁑(d​𝒯/ΞΆ))C_{1}:=\mathcal{O}(\log^{\frac{3}{2}}(d\mathscr{T}/\zeta)). (Note c1=π’ͺ​(log32⁑(d​m/Ο‘))c_{1}=\mathcal{O}(\log^{\frac{3}{2}}(dm/\vartheta)) in Lemma 17. The second inequality uses the choice that η≀18​L​log⁑(𝒯)​C1≀18​L​log⁑(t)​C1\eta\leq\frac{1}{8L\log(\mathcal{T})C_{1}}\leq\frac{1}{8L\log({t})C_{1}} and c2β‰₯12​C1L,C1β‰₯1c_{2}\geq\frac{12C_{1}}{L},C_{1}\geq 1. This completes the proof for the first condition. Now we can prove the second condition.

The contradiction is the same as in the proof of Lemma 12 and hence skipped for clarity. Note similar to the argument in Lemma 12, we choose 𝒯=2​logα⁑(8​δ​dc2​ρ​ν​r)/Ξ΄\mathscr{T}=2\log_{\alpha}\big{(}\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r}\big{)}/\delta so that the result still holds. Β 

Lemma 19 (Descent around saddle points)

Let xβˆˆβ„³x\in\mathcal{M} satisfy β€–grad​fℬ​(x)‖≀ϡ\|\emph{grad}f_{\mathcal{B}}(x)\|\leq\epsilon and Ξ»min​(βˆ‡2F^x​(0))\lambda_{\min}(\nabla^{2}\hat{F}_{x}(0)) β‰€βˆ’Ξ΄\leq-\delta. With the same settings as in Lemma 18, the two coupled sequences satisfy, with probability 1βˆ’ΞΆ1-\zeta,

βˆƒt≀𝒯,max⁑{F^x​(u0)βˆ’F^x​(ut),F^x​(u0β€²)βˆ’F^x​(utβ€²)}β‰₯2​ℱ,\exists\,t\leq\mathscr{T},\,\max\{\hat{F}_{x}(u_{0})-\hat{F}_{x}(u_{t}),\hat{F}_{x}(u_{0}^{\prime})-\hat{F}_{x}(u_{t}^{\prime})\}\geq 2\mathscr{F},

where β„±=Ξ΄32​c3​ρ2\mathscr{F}=\frac{\delta^{3}}{2c_{3}\rho^{2}}, c3=16​logα⁑(8​δ​dc2​ρ​ν​r)​c22c1​Lc_{3}=\frac{16\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})c_{2}^{2}}{c_{1}L} and c1,c2c_{1},c_{2} are defined in Lemma 17 and 18 respectively.

ProofΒ  We will prove this result by contradiction. Suppose the contrary holds. That is,

βˆ€t≀𝒯,max⁑{F^x​(u0)βˆ’F^x​(ut),F^x​(u0β€²)βˆ’F^x​(utβ€²)}≀2​ℱ.\forall\,t\leq\mathscr{T},\,\max\{\hat{F}_{x}(u_{0})-\hat{F}_{x}(u_{t}),\hat{F}_{x}(u_{0}^{\prime})-\hat{F}_{x}(u_{t}^{\prime})\}\leq 2\mathscr{F}. (50)

Then we first notice that under this condition, both {ut}\{u_{t}\} and {utβ€²}\{u_{t}^{\prime}\} do not escape the constraint ball 𝔹x​(D)\mathbb{B}_{x}(D). This is verified by contradiction. Assume, without loss of generality, at iteration j≀𝒯j\leq\mathscr{T}, β€–ujβ€–β‰₯D\|u_{j}\|\geq D. In this case, TSSRG​(x,u0,𝒯)\textsf{TSSRG}(x,u_{0},\mathscr{T}) returns Rx​(ujβˆ’1βˆ’Ξ±β€‹Ξ·β€‹vtβˆ’1)R_{x}(u_{j-1}-\alpha\eta v_{t-1}) with α∈(0,1)\alpha\in(0,1) such that β€–vjβˆ’1βˆ’Ξ±β€‹Ξ·β€‹vtβˆ’1β€–=D\|v_{j-1}-\alpha\eta v_{t-1}\|=D. Hence, by the similar logic as in (39),

D\displaystyle D =β€–ujβˆ’1βˆ’Ξ±β€‹Ξ·β€‹vtβˆ’1‖≀‖ujβˆ’1βˆ’Ξ±β€‹Ξ·β€‹vtβˆ’1βˆ’u0β€–+β€–u0β€–\displaystyle=\|u_{j-1}-\alpha\eta v_{t-1}\|\leq\|u_{j-1}-\alpha\eta v_{t-1}-u_{0}\|+\|u_{0}\|
≀4​j​(F^x​(u0)βˆ’F^x​(ut))c1​L+2​η​j2​c1′⁣2​σ2c1​L​B+r≀8​𝒯​ℱc1​L+2​η​𝒯2​c1′⁣2​σ2c1​L​B+r\displaystyle\leq\sqrt{\frac{4j(\hat{F}_{x}(u_{0})-\hat{F}_{x}(u_{t}))}{c_{1}L}+\frac{2\eta j^{2}c_{1}^{\prime 2}\sigma^{2}}{c_{1}LB}}+r\leq\sqrt{\frac{8\mathscr{T}\mathscr{F}}{c_{1}L}+\frac{2\eta\mathscr{T}^{2}c_{1}^{\prime 2}\sigma^{2}}{c_{1}LB}}+r
≀8​logα⁑(8​δ​dc2​ρ​ν​r)δ​c1​Lβ‹…Ξ΄3c3​ρ2+8​η​logΞ±2⁑(8​δ​dc2​ρ​ν​r)​c1′⁣2​σ2Ξ΄2​c1​L​B+r≀δ22​c22​ρ2+logα⁑(8​δ​dc2​ρ​ν​r)​c1′⁣2​σ22​L2​c1​δ2​B+r\displaystyle\leq\sqrt{\frac{8\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})}{\delta c_{1}L}\cdot\frac{\delta^{3}}{c_{3}\rho^{2}}+\frac{8\eta\log_{\alpha}^{2}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})c_{1}^{\prime 2}\sigma^{2}}{\delta^{2}c_{1}LB}}+r\leq\sqrt{\frac{\delta^{2}}{2c_{2}^{2}\rho^{2}}+\frac{\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})c_{1}^{\prime 2}\sigma^{2}}{2L^{2}c_{1}\delta^{2}B}}+r
≀δ22​c22​ρ2+Ξ΄22​c22​ρ2+r=Ξ΄c2​ρ+r.\displaystyle\leq\sqrt{\frac{\delta^{2}}{2c_{2}^{2}\rho^{2}}+\frac{\delta^{2}}{2c_{2}^{2}\rho^{2}}}+r=\frac{\delta}{c_{2}\rho}+r.

where we set parameters, β„±=Ξ΄32​c3​ρ2,𝒯=2​logα⁑(8​δ​dc2​ρ​ν​r)/Ξ΄\mathscr{F}=\frac{\delta^{3}}{2c_{3}\rho^{2}},\mathscr{T}=2\log_{\alpha}\big{(}\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r}\big{)}/\delta, c3=16​logα⁑(8​δ​dc2​ρ​ν​r)​c22c1​Lc_{3}=\frac{16\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})c_{2}^{2}}{c_{1}L}, η≀116​logα⁑(8​δ​dc2​ρ​ν​r)​L\eta\leq\frac{1}{16\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})L} and Bβ‰₯π’ͺ​(logα⁑(8​δ​dc2​ρ​ν​r)​c22​c1′⁣2​σ2L2​c1​ϡ2)=π’ͺ~​(Οƒ2Ο΅2)B\geq\mathcal{O}(\frac{\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})c_{2}^{2}c_{1}^{\prime 2}\sigma^{2}}{L^{2}c_{1}\epsilon^{2}})=\widetilde{\mathcal{O}}(\frac{\sigma^{2}}{\epsilon^{2}}), ϡ≀δ2/ρ\epsilon\leq\delta^{2}/\rho. However, we know that Ξ΄c2​ρ+r≀D\frac{\delta}{c_{2}\rho}+r\leq D, which gives a contradiction. Therefore, we claim that {ut}\{u_{t}\} and {utβ€²}\{u_{t}^{\prime}\} stay within the constraint ball within 𝒯\mathscr{T} steps.

Also based on Lemma 18, assume β€–uTβˆ’u0β€–β‰₯Ξ΄c2​ρ\|u_{T}-u_{0}\|\geq\frac{\delta}{c_{2}\rho} for some T≀𝒯T\leq\mathscr{T}. Then from Lemma 16, we know that

F^x​(u0)βˆ’F^x​(uT)β‰₯c1​L4​T​‖uTβˆ’u0β€–2βˆ’Ξ·β€‹T​c1′⁣2​σ22​B\displaystyle\hat{F}_{x}(u_{0})-\hat{F}_{x}(u_{T})\geq\frac{c_{1}L}{4T}\|u_{T}-u_{0}\|^{2}-\frac{\eta Tc_{1}^{\prime 2}\sigma^{2}}{2B} β‰₯c1​L​δ38​logα⁑(8​δ​dc2​ρ​ν​r)​c22​ρ2βˆ’logα⁑(8​δ​dc2​ρ​ν​r)​c1′⁣2​σ2B​δ\displaystyle\geq\frac{c_{1}L\delta^{3}}{8\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})c_{2}^{2}\rho^{2}}-\frac{\log_{\alpha}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})c_{1}^{\prime 2}\sigma^{2}}{B\delta}
β‰₯2​δ3c3​ρ2βˆ’Ξ΄3c3​ρ2=Ξ΄3c3​ρ2=2​ℱ,\displaystyle\geq\frac{2\delta^{3}}{c_{3}\rho^{2}}-\frac{\delta^{3}}{c_{3}\rho^{2}}=\frac{\delta^{3}}{c_{3}\rho^{2}}=2\mathscr{F},

where we consider T≀𝒯,η≀1T\leq\mathscr{T},\eta\leq 1 and Bβ‰₯π’ͺ​(logΞ±2⁑(8​δ​dc2​ρ​ν​r)​c22​c1′⁣2​σ2Ο΅2)=π’ͺ~​(Οƒ2Ο΅2)B\geq\mathcal{O}(\frac{\log_{\alpha}^{2}(\frac{8\delta\sqrt{d}}{c_{2}\rho\nu r})c_{2}^{2}c_{1}^{\prime 2}\sigma^{2}}{\epsilon^{2}})=\widetilde{\mathcal{O}}(\frac{\sigma^{2}}{\epsilon^{2}}), ϡ≀δ2/ρ\epsilon\leq\delta^{2}/\rho. This contradicts (50) and hence the proof is complete. Β 

Lemma 20 (Escape stuck region)

Let xβˆˆβ„³x\in\mathcal{M} satisfy β€–grad​fℬ​(x)‖≀ϡ\|\emph{grad}f_{\mathcal{B}}(x)\|\leq\epsilon and Ξ»min​(βˆ‡2F^x​(0))β‰€βˆ’Ξ΄\lambda_{\min}(\nabla^{2}\hat{F}_{x}(0))\leq-\delta. Given that result in Lemma 19 holds and choose perturbation radius r≀min⁑{Ξ΄34​c3​ρ2​ϡ,Ξ΄32​c3​ρ2​L}r\leq\min\{\frac{\delta^{3}}{4c_{3}\rho^{2}\epsilon},\sqrt{\frac{\delta^{3}}{2c_{3}\rho^{2}L}}\}, we have a sufficient decrease of function value with high probability:

F​(TSSRG​(x,u0,𝒯))βˆ’F​(x)β‰€βˆ’β„±Β with probability ​1βˆ’Ξ½{F}(\textsf{TSSRG}(x,u_{0},\mathscr{T}))-F(x)\leq-\mathscr{F}\quad\text{ with probability }1-\nu

ProofΒ  The proof is exactly the same as Lemma 20 and hence skipped. One remark is that when small gradient condition β€–grad​fℬ​(x)‖≀ϡ\|\text{grad}f_{\mathcal{B}(x)}\|\leq\epsilon is triggered, we already have β€–grad​F​(x)‖≀ϡ/2\|\text{grad}F(x)\|\leq\epsilon/2 with high probability from Lemma 17. Β