Escaping Saddle Points Faster on Manifolds via Perturbed Riemannian Stochastic Recursive Gradient
Abstract
In this paper, we propose a variant of Riemannian stochastic recursive gradient method that can achieve second-order convergence guarantee and escape saddle points using simple perturbation. The idea is to perturb the iterates when gradient is small and carry out stochastic recursive gradient updates over tangent space. This avoids the complication of exploiting Riemannian geometry. We show that under finite-sum setting, our algorithm requires stochastic gradient queries to find a -second-order critical point. This strictly improves the complexity of perturbed Riemannian gradient descent and is superior to perturbed Riemannian accelerated gradient descent under large-sample settings. We also provide a complexity of for online optimization, which is novel on Riemannian manifold in terms of second-order convergence using only first-order information.
1 Introduction
Consider the following finite-sum and online optimization problem defined on Riemannian manifold :
(1) |
where is a non-convex function. Finite-sum optimization setting is a special case of online setting where can be finitely sampled. For simplicity of analysis, we only consider finite-sum formulation and refer to it as online optimization when approaches infinity. Solving problem (1) globally on Euclidean space (i.e when ) is NP-hard in general, letting alone for any Riemannian manifold. Thus many algorithms set out to find only approximate first-order critical points with small gradients. But this is usually insufficient because for non-convex problems, a point with small gradient can be either around a local minima, a local maxima or a saddle point. Therefore, to avoid being trapped by saddle points (and possibly local maxima), we need to find approximate second-order critical points (see Definition 1) with small gradients as well as nearly positive semi-definite Hessians.
Second-order algorithms can explore curvature information via Hessian and thus escape saddle points by construction. However, it is always desired that simpler algorithms using only gradient information are designed for this purpose, because access to Hessian can be costly and sometimes unavailable in real applications. It turns out that gradient descent (GD) can inherently escape saddle points, yet requiring exponential time (Du etΒ al., 2017). Instead, Jin etΒ al. (2017) showed that by injecting isotropic Gaussian noise to iterates, the perturbed gradient descent (PGD) can reach approximate -second-order stationarity within stochastic gradient queries with high probability. Similarly, a perturbed variant of stochastic gradient algorithm (PSGD) requires a complexity of (Jin etΒ al., 2019). The complexities match that of vanilla GD and SGD to find approximate first-order stationary points up to some poly-logarithmic factors.
When considering problem (1) on any arbitrary manifold, we require Riemannian gradient as well as a retraction that allows iterates to be updated on the manifold with direction determined by Riemannian gradient. Defining a βpullbackβ function , Criscitiello and Boumal (2019) extended the idea of perturbed gradient descent (referred to as PRGD) by executing perturbations and the following gradient steps on the tangent space. Given that is naturally defined on a vector space, the analysis can be largely drawn from (Jin etΒ al., 2017), with similar convergence guarantees. Recently, Criscitiello and Boumal (2020) also proposed perturbed Riemannian accelerated gradient descent (PRAGD), a direct generalization of its Euclidean counterpart originally proposed in (Jin etΒ al., 2018). It achieves an even lower complexity of compared to perturbed GD. However, these algorithms are sub-optimal under finite-sum settings and even fail to work for online problems where full gradient is inaccessible.
When objective can be decomposed into component functions as in problem (1), variance reduction techniques can improve gradient complexity of both GD and SGD (Reddi etΒ al., 2016; Nguyen etΒ al., 2017b; Fang etΒ al., 2018). The main idea is to exploit past gradient information to correct for deviation of current stochastic gradients, by comparing either to a fixed snapshot point (SVRG, Reddi etΒ al. (2016)) or recursively to previous iterates (SRG/SPIDER, Nguyen etΒ al. (2017a); Fang etΒ al. (2018)). Notably, the gradient complexities and achieved by SRG and SPIDER are optimal to achieve first-order guarantees under finite-sum and online settings respectively (Fang etΒ al., 2018; Arjevani etΒ al., 2019). Motivated by the success of variance reduction and the simplicity of adding perturbation to ensure second-order convergence, Li (2019) proposed perturbed stochastic recursive gradient method (PSRG), which can escape saddle points faster than perturbed GD and SGD. In this paper, we generalize PSRG to optimization problems defined on Riemannian manifolds with the following contributions:
-
β’
Our proposed method PRSRG is the first simple stochastic algorithm that achieves second-order guarantees on Riemannian manifold using only first-order information. That is, the algorithm only requires simple perturbations and does not involve any negative curvature exploitation as in PRAGD. Our algorithm adopts the idea of tangent space steps as in (Criscitiello and Boumal, 2019), which results in simple analysis for convergence.
-
β’
Our complexity is measured against -second-order stationarity, which is more general than -second-order stationarity where . Under finite-sum setting, PRSRG strictly improves the complexity of PRGD by and is superior to PRAGD when . We also provide a complexity of PRSRG under online setting, which is novel for Riemannian optimization.
2 Other related work
Following the work by (Jin etΒ al., 2017), except for the perturbed SGD (Jin etΒ al., 2019) and perturbed accelerated gradient method (Jin etΒ al., 2018), Ge etΒ al. (2019) showed that SVRG with perturbation (PSVRG) is sufficient to escape saddle points and a stabilized version is also developed to further improve the dependency on the Hessian Lipschitz constant. However, its complexity is strictly worse than PSRG (Li, 2019). We suspect that, with similar tangent space trick, PSVRG can be generalized to Riemannian manifolds with little efforts.
Another line of research is to incorporate negative curvature search subroutines (Xu etΒ al., 2018; Allen-Zhu and Li, 2018) in some classic first-order algorithms, such as GD and SGD and also variance-reduction algorithms including SVRG (Allen-Zhu and Li, 2018), SNVRG (Zhou etΒ al., 2018) and SPIDER (Fang etΒ al., 2018). It usually requires access to Hessian-vector products when searching for the smallest eigenvector direction (Xu etΒ al., 2018). Even though Allen-Zhu and Li (2018) managed to only use first-order information for the same goal, it still involves an iterative process that can be difficult to implement in practice.
To solve general optimization problems on Riemannian manifold, gradient descent and stochastic gradient descent have been generalized with provable guarantees (Boumal etΒ al., 2019; Bonnabel, 2013; Hosseini and Sra, 2020). Some acceleration and variance reduction techniques have also been considered for speeding up the convergence of GD and SGD (Zhang and Sra, 2018; Ahn and Sra, 2020; Zhang etΒ al., 2016; Kasai etΒ al., 2018). These first-order methods however can only guarantee first-order convergence. To find second-order critical points, Newtonβs methods, particularly the famous globalization variants, trust region and cubic regularization have been extended to manifold optimization, achieving similar second-order guarantees (Boumal etΒ al., 2019; Agarwal etΒ al., 2018).
Finally, we note that Sun etΒ al. (2019) also proposed perturbed gradient descent on manifold. The perturbations and the following updates are performed directly on manifold, which is contrastly different to tangent space steps in this paper. The preference of tangent space steps over manifold steps is mainly due to the complexity of analysis. That is, Sun etΒ al. (2019) requires more complicated geometric results as well as the use of exponential map given its natural connection to geodesics and Riemannian distance. By executing all the steps on tangent space, we can bypass these results while some regularity conditions should be carefully managed.
3 Preliminaries
Here we start with a short review of some preliminary definitions. Readers can refer to (Absil etΒ al., 2009) for more detailed discussions on manifold geometry. We consider to be a -dimensional Riemannian manifold. The tangent space for is a -dimensional vector space. is equipped with an inner product and a corresponding norm on each tangent space , . Tangent bundle is the union of tangent spaces defined as . Riemannian gradient of a function is the vector field that uniquely satisfies , for all where D is the directional derivative of along . Riemannian Hessian of is the covariant derivative of grad defined as that for all , it satisfies , where is the Riemannian connection (also known as Levi-Civita connection). Note that we also use to represent differentiation on vector space, which is a special case of covariant derivative.
Retraction maps a tangent vector to manifold surface by satisfying (i) , and (ii) is the identity map. represents the differential of retraction, denoted as . Exponential map is a special instance of retraction and hence our results in this paper can be trivially generalized to this particular retraction. Define the pullback function . That is, , where we also define the pullback components as . Given that the domain of pullback function is a vector space, we can represent its gradient and Hessian in terms of Riemannian gradient and Hessian as well as the differentiated retraction in the following Lemma (see Lemma 2.5 in Criscitiello and Boumal (2020)).
Lemma 1 (Gradient and Hessian of the pullback)
For a twice continuously differentiable function and , we have
where denotes the adjoint operator of and is a symmetric linear operator defined by with , a perturbation to .
If the retraction is a second-order retraction, and match the Riemannian gradient and Hessian at . See Lemma 8 (Appendix B) for more details. Vector transport with respect to retraction is a linear map that satisfies (i) ; (ii) . In this paper, we only consider isometric vector transport, which satisfies . Below are some common notations used throughout this paper.
Notation. Denote as the minimum eigenvalue of a symmetric operator and use to represent either the vector norm or the spectral norm for a matrix. We claim if there exists a positive constant and such that for all and use to hide poly-logarithmic factors. Let be the Euclidean ball with radius to the origin of the tangent space . That is, . We use to denote the uniform distribution on the set . Denote and let be a mini-batch gradient of the pullback component function where with cardinality . Similarly, represents a mini-batch Riemannian gradient. Finally, we denote as the natural logarithm and as logarithm with base .
3.1 Assumptions
Now we state the main assumptions as follows. The first assumption is required to bound function decrease and to ensure existence of stationary points on the manifold.
Assumption 1 (Lower bounded function)
is lower bounded by with for all .
Following (Criscitiello and Boumal, 2019), we need to impose some Lipschitzness conditions on the pullback because the main gradient steps (see Algorithm 2) are performed on tangent space with respect to the pullback function. The next assumption requires both gradient and Hessian Lipschitzness of the pullback component functions . Note that we only require the condition to satisfy with respect to the origin of within a constraint ball . This assumption is much weaker compared to requiring Lipschitz continuity over entire tangent space.
Assumption 2 (Gradient and Hessian Lipschitz)
Gradient and Hessian of the pullback component function is Lipschitz to the origin. That is, for all with , there exist such that
This also suggests the gradient and Hessian of the pullback function satisfy the same Lipschitzness condition to the origin.
Immediately based on this assumption, we can show (in Lemma 2) that gradient of the pullback function is Lipschitz continuous in the constraint ball . This result implies smoothness of the pullback function and is fundamental for analysing first-order optimization algorithms.
Lemma 2 (-Lipschitz continuous)
Under Assumption 2, for all , there exists such that is -Lipschitz continuous inside the ball . This also implies that is -Lipschitz continuous. That is, for any , we have
The next assumption is to ensure the correspondence between gradient and Hessian of the pullback at origin, and those of original function . Note similar to (Criscitiello and Boumal, 2019), we can relax this assumption to bounded initial acceleration, i.e. for some . In that case, results only differ by some constants.
Assumption 3 (Second-order retraction)
is a second-order retraction such that for all , the retraction curve has zero initial acceleration. That is .
The following assumption is needed to bound the difference between differentiated retraction and vector transport. Although our algorithm does not require vector transport as all updates are completed on tangent space, the main purpose of this assumption is to establish a bound on singular value of differentiated retraction (Lemma 3). The Lemma can then be used to bound the difference between and for any .
Assumption 4 (Difference between differentiated retraction and vector transport)
For any , there exists a neighbourhood of such that for all , there exists a constant uniformly,
Lemma 3 (Singular value bound of differentiated retraction)
For all where with , we have .
It is not difficult to satisfy these assumptions. For compact Riemannian manifolds with a second-order retraction and a three-times continnously differentiable function , Assumptions 1, 2 and 3 are easily satisfied (see Lemma 3.2 in (Criscitiello and Boumal, 2019)). Assumption 4 can be guaranteed by requiring vector transport to be based on Taylor expansion (Huang etΒ al., 2015).
Apart from the main assumptions, one additional assumption of bounded variance is particularly important for solving online problems.
Assumption 5 (Uniform bounded variance)
Variance of gradient of the pullback component function is bounded uniformly by . That is, for all and ,
holds for any such that .
This assumption is more stringent than the standard variance bound, which is in expectation. However, we can relax this assumption by requiring sub-Gaussian tails, which is sufficient to achieve a high probability bound (Li, 2019). Lastly, we conclude this section by defining second-order critical points as follows.
Definition 1 (-second-order and -second-order critical points)
is an -second-order critical point if
It is an -second-order critical point if and . The second definition is a special case of the first one with .
4 Algorithm
In this section, we introduce perturbed Riemannian stochastic recursive gradient method (PRSRG) in Algorithm 1 where the main updates are performed by tangent space stochastic recursive gradient (TSSRG) in Algorithm 2. The key idea is simple: when gradient is large, we repeatedly execute standard recursive gradient iterations (i.e. an epoch) on tangent space before retracting back to the manifold. Essentially, we are minimizing the pullback function within the constraint ball by SRG updates, which translates to minimizing within a neighbourhood of .
This process repeats until gradient is small. Then we perform the same SRG updates but at most times, from a perturbed iterate with isotropic noise added on the tangent space. Notice that the small gradient condition in Line 3 in Algorithm 1 is examined against , where contains samples drawn without replacement from . This is to ensure that full batch gradient is computed under finite-sum setting where we choose . Under online setting when approaches infinity, access to full gradient is unavailable and therefore we can only rely on the large-batch gradient.
TSSRG is mainly based on stochastic recursive gradient algorithm in (Nguyen etΒ al., 2017b). The algorithm adopts a double-loop structure. At the start of each outer loop (called an epoch), a full gradient (or a large-batch gradient for online optimization) is evaluated. Within each inner loop, mini-batch gradients are computed for current iterate and its last iterate . Then a modified gradient at is constructed recursively based on the difference between and mini-batch gradient at . That is,
(2) |
Note that we do not require any vector transporter since all gradients (of the pullback) are defined on the same tangent space. Hence, TSSRG is very similar to Euclidean SRG with only differences in stopping criteria, discussed as follows. When gradient is large, we perform at most updates and break the loop with probability where is the index of inner iteration (Line 12, Algorithm 2). This stopping rule is equivalent to uniformly selecting from iterates within this epoch at random as the output. This is to ensure that either small gradient condition is triggered or sufficient decrease is achieved. More details can be found in Section 6. When gradient is small (i.e. around saddle point), we only break until max iteration has been reached.
Finally, we note that the Lipschitzness conditions are only guaranteed within a constraint ball of radius while taking updates on tangent space can violate this requirement. Therefore, in Line 9 (Algorithm 2), we explicitly control the deviation of iterates from the origin and break the loop as soon as one iterate leaves the ball. Then we return some points on the boundary of the ball. By carefully balancing the parameters, we can show that whenever iterates escape the constraint ball, function value already has a large decrease.
To simplify notations, for the most time, we refer to Algorithm 2 as TSSRG().
5 Main results
In this section, we present the main complexity results of PRSRG in finding second-order stationary points.
Under finite-sum setting, we choose and thus . We set the parameters as in (3) while we also require first-order tolerance . This is not a strict condition given step size can be chosen arbitrarily small. The second-order tolerance needs to be smaller than the Lipschitz constant in Assumption 2. Otherwise when , any with small gradient is already a -second-order stationary point because by from Assumption 2, we have .
Theorem 1 (Finite-sum complexity)
Up to some constants and poly-log factors, the complexity in Theorem 1 is exactly the same as that when optimization domain is a vector space (i.e. ). Indeed, our result is a direct generalization of the Euclidean counterpart where retraction and the Lipschitzness conditions are made with respect to . Set the same parameters as in (3) except that we do not require to be small since and require given . Then we can recover the Euclidean perturbed stochastic recursive gradient algorithm (Li, 2019).
Under online setting, the complexities of stochastic gradient queries are presented below.
Theorem 2 (Online complexity)
Note the complexities in this paper are analysed in terms of achieving -second-order stationarity. Some literature prefers to choose to match the units of gradient and Hessian, following the work in (Nesterov and Polyak, 2006). In this case, our complexities reduce to for finite-sum problems and for online problems. Compared to the optimal rate of and in finding first-order stationary points, our rates are sub-optimal even if we ignore the poly-log factors. This nevertheless appears to be a problem for all stochastic variance reduction methods (Ge etΒ al., 2019; Li, 2019).
6 High-level proof ideas
Now we provide a high-level proof road map of our proposed method in achieving second-order convergence guarantees. The main proof strategies are similar as in (Li, 2019). However, we need to carefully handle the particularity of manifold geometry as well as manage the unique regularity conditions. We focus on finite-sum problems and only highlight the key differences under online setting.
6.1 Finite-sum setting
We first start by showing how stochastic recursive gradient updates can achieve sublinear convergence in expectation by periodically computing a large-batch gradient. That is, from smoothness of the pullback function, we have
(4) |
To achieve first-order stationarity, it is sufficient to bound the variance term in expectation. That is (see Nguyen etΒ al. (2017b)). This suggests the variance is gradually reduced when approaching optimality where gradient is small. Then by carefully choosing the parameters, we can show that for a single epoch, it satisfies that
(5) |
Telescoping this result for all epochs and choosing the output uniformly from all iterates at random, we can guarantee that the output is an approximate first-order stationary point. This gives the optimal stochastic gradient complexity of by choosing .
To achieve second-order stationarity, the algorithm will go through two phases: large gradients and around saddle points. We present two informal Lemmas corresponding to the two phases respectively, with parameter settings omitted. See Appendix for more details.
Lemma 4
When current iterate has large gradient, i.e. , running gives three possible cases:
-
1.
When the iterates do not leave the constraint ball :
-
(a)
If at least half of the iterates in the epoch satisfy for , then with probability at least , we have .
-
(b)
Otherwise, with probability at least , we have .
-
(a)
-
2.
When one of the iterates leaves the constraint ball , with probability at least , we have , where can be made arbitrarily small.
No matter which case occurs, with high probability.
Lemma 5
When current iterate is around a saddle point, i.e. and , running with perturbation gives sufficient decrease of function value with high probability. That is,
where .
Lemma 4 claims that when gradient is large (phase 1), the output after running TSSRG with a single epoch either has a small gradient (Case 1a) or reduces function value by a sufficient amount (Case 1b) with high probability. Note that we need to explicitly address the case when iterates leave the constraint ball (Case 2). In this case, we show that function value already decreases by the same desired amount given that first-order tolerance is chosen reasonably small as in Theorem 1. Note for Case 1a, the output satisfies the small gradient condition and hence is immediately followed by perturbation and the follow-up updates in TSSRG. Notice that we can only show is small. To connect to Riemannian gradient , we need the singular value bound of the differentiated retraction in Lemma 3. In other cases, function value decreases by with high probability. As a result, given that function is uniformly bounded by , we can bound the number of such descent epochs by , where . Since we choose , , , the stochastic gradient complexity is computed as .
In phase 2 where gradient is small, current iterate is either already a second-order stationary point or around a saddle point. Lemma 5 states that running TSSRG from any perturbation within the ball decreases function value by at least with high probability. Again, since the optimality gap is bounded, the number of such escape epochs is bounded by . And similarly, the number of stochastic gradient queries is , where we choose . Combining the complexities under phase 1 and phase 2 gives the result. For more rigorous analysis, we need to consider the number of wasted epochs where neither function decreases sufficiently nor gradient of the output is small. The complexity of such epochs however turns out to not exceed the complexities established before. Detailed proofs are included in Appendix C.1.
Next we will briefly explain how Lemma 4 (large gradient phase) and Lemma 5 (around saddle point phase) are derived.
Large gradient phase. The key result underlying Lemma 4 is a high-probability version of (5). To this end, we first need a high-probability bound for the variance term . It is not difficult to verify that is a martingale sequence. As required by Azuma-Hoeffing inequality (Lemma 7, Appendix A), in order to bound , we need to bound its difference sequence . This difference sequence can be bounded by applying vector Bernstein inequality (Lemma 6, Appendix A). After bounding , we can substitute this result into (4) to obtain
(6) |
for . Note that we always call TSSRG for only one epoch each time. Therefore, it is sufficient to consider the first epoch in TSSRG. Next, the analysis is divided into whether iterates leave the constraint ball or not. When all iterates stay within the boundary of the ball, inequality (6) suggests that if at least half of iterates in this epoch have large gradient, then we obtain a sufficient decrease. Otherwise, the output uniformly selected from the iterates in this epoch will have a small gradient with high probability. On the other hand, when one of the iterates escape the constraint ball, we still can show a sufficient decrease by a localization Lemma (Lemma 10, Appendix C.2). Specifically, we have , which is derived from (4) and the high-probability bound of the variance term. This bound implies that if iterates are far from the origin, function value already decreases a lot.
Around saddle point phase. When the current iterate is around a saddle point, we need to show that the objective can still decrease at a reasonable rate with high probability. At high-level, we adopt the same coupling-sequence analysis originally introduced in (Jin etΒ al., 2017). Define the stuck region as such that running TSSRG from points initialized in this region will not give sufficient function value decrease. Consider two initialization that only differs in the direction of the smallest eigenvector of the Hessian . That is, where ( can be chosen arbitrarily small). Then we can prove that at least one of the sequences generated by running TSSRG from perturbation , achieves large deviation from the initialized points within steps (see Lemma 12). That is, with high probability,
This result together with the localization Lemma indicates that at least one of the sequences also achieves high function value decrease. Particularly, we obtain where we note . This directly suggests that the width of stuck region is at most . Based on some geometric results, we know that the probability of any perturbation falling in the stuck region is at most . In other words, with high probability, an arbitrarily chosen falls outside of stuck region. In this case, we achieve sufficient decrease of function value as . By carefully selecting the radius of the perturbation ball, we can bound the difference between and by . Finally, combining these two results yields Lemma 5:
6.2 Online setting
Consider online problems where full gradient is inaccessible. The proof roadmap is the same as in finite-sum setting. But now we have . Most key results are relaxed with an additional term that relates to the variance of stochastic gradient (Assumption 5).
Large gradient phase. Under phase 1, we can show that (6) holds with an additional term. That is,
Note that under online setting, the small gradient condition is checked against the large-batch gradient , rather than the full gradient (Line 3, Algorithm 1). Therefore, compared to finite-sum cases, we require an extra bound on the difference between full gradient and large-batch gradient . This can be obtained by Bernstein inequality. By choosing , similar results to Lemma 11 can be derived.
Around saddle point phase. Under phase 2, we can obtain the same inequality as in Lemma 14, with differences only in terms of parameter settings.
These results guarantee that the number of phase-1 and phase-2 epochs match those of finite-sum setting, up to some constants and poly-log factors. That is, and . Following similar logic and choosing parameters as , we can obtain the complexity in Theorem 2.
7 Conclusion
In this paper, we generalize perturbed stochastic recursive gradient method to Riemannian manifold by adopting the idea of tangent space steps introduced in (Criscitiello and Boumal, 2019). This avoids using any complicated geometric result as in (Sun etΒ al., 2019) and thus largely simplifies the analysis. We show that up to some constants and poly-log factors, our generalization achieves the same stochastic gradient complexities as the Euclidean version (Li, 2019). Under finite-sum setting, our result is strictly superior to PRGD by Criscitiello and Boumal (2019) and to PRAGD by Criscitiello and Boumal (2020) for large-scale problems. We also prove an online complexity, which is, to the best of our knowledge, the first result in finding second-order stationary points only using first-order information.
References
- Absil etΒ al. (2009) P-A Absil, Robert Mahony, and Rodolphe Sepulchre. Optimization algorithms on matrix manifolds. Princeton University Press, 2009.
- Agarwal etΒ al. (2018) Naman Agarwal, Nicolas Boumal, Brian Bullins, and Coralia Cartis. Adaptive regularization with cubics on manifolds. arXiv preprint arXiv:1806.00065, 2018.
- Ahn and Sra (2020) Kwangjun Ahn and Suvrit Sra. From Nesterovβs estimate sequence to Riemannian acceleration. arXiv preprint arXiv:2001.08876, 2020.
- Allen-Zhu and Li (2018) Zeyuan Allen-Zhu and Yuanzhi Li. Neon2: Finding local minima via first-order oracles. In Advances in Neural Information Processing Systems, pages 3716β3726, 2018.
- Arjevani etΒ al. (2019) Yossi Arjevani, Yair Carmon, JohnΒ C Duchi, DylanΒ J Foster, Nathan Srebro, and Blake Woodworth. Lower bounds for non-convex stochastic optimization. arXiv preprint arXiv:1912.02365, 2019.
- Bonnabel (2013) Silvere Bonnabel. Stochastic gradient descent on Riemannian manifolds. IEEE Transactions on Automatic Control, 58(9):2217β2229, 2013.
- Boumal (2020) Nicolas Boumal. An introduction to optimization on smooth manifolds. 2020.
- Boumal etΒ al. (2019) Nicolas Boumal, Pierre-Antoine Absil, and Coralia Cartis. Global rates of convergence for nonconvex optimization on manifolds. IMA Journal of Numerical Analysis, 39(1):1β33, 2019.
- Chung and Lu (2006) Fan Chung and Linyuan Lu. Concentration inequalities and martingale inequalities: a survey. Internet Mathematics, 3(1):79β127, 2006.
- Criscitiello and Boumal (2020) Chris Criscitiello and Nicolas Boumal. An accelerated first-order method for non-convex optimization on manifolds. arXiv preprint arXiv:2008.02252, 2020.
- Criscitiello and Boumal (2019) Christopher Criscitiello and Nicolas Boumal. Efficiently escaping saddle points on manifolds. In Advances in Neural Information Processing Systems, pages 5987β5997, 2019.
- Du etΒ al. (2017) SimonΒ S Du, Chi Jin, JasonΒ D Lee, MichaelΒ I Jordan, Aarti Singh, and Barnabas Poczos. Gradient descent can take exponential time to escape saddle points. In Advances in Neural Information Processing Systems, pages 1067β1077, 2017.
- Fang etΒ al. (2018) Cong Fang, ChrisΒ Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. In Advances in Neural Information Processing Systems, pages 689β699, 2018.
- Ge etΒ al. (2019) Rong Ge, Zhize Li, Weiyao Wang, and Xiang Wang. Stabilized svrg: Simple variance reduction for nonconvex optimization. arXiv preprint arXiv:1905.00529, 2019.
- Hoeffding (1994) Wassily Hoeffding. Probability inequalities for sums of bounded random variables. In The Collected Works of Wassily Hoeffding, pages 409β426. Springer, 1994.
- Hosseini and Sra (2020) Reshad Hosseini and Suvrit Sra. An alternative to em for Gaussian mixture models: Batch and stochastic Riemannian optimization. Mathematical Programming, 181(1):187β223, 2020.
- Huang etΒ al. (2015) Wen Huang, KyleΒ A Gallivan, and P-A Absil. A Broyden class of quasi-Newton methods for Riemannian optimization. SIAM Journal on Optimization, 25(3):1660β1685, 2015.
- Jin etΒ al. (2017) Chi Jin, Rong Ge, Praneeth Netrapalli, ShamΒ M Kakade, and MichaelΒ I Jordan. How to escape saddle points efficiently. arXiv preprint arXiv:1703.00887, 2017.
- Jin etΒ al. (2018) Chi Jin, Praneeth Netrapalli, and MichaelΒ I Jordan. Accelerated gradient descent escapes saddle points faster than gradient descent. In Conference On Learning Theory, pages 1042β1085. PMLR, 2018.
- Jin etΒ al. (2019) Chi Jin, Praneeth Netrapalli, Rong Ge, ShamΒ M Kakade, and MichaelΒ I Jordan. On nonconvex optimization for machine learning: Gradients, stochasticity, and saddle points. arXiv preprint arXiv:1902.04811, 2019.
- Kasai etΒ al. (2018) Hiroyuki Kasai, Hiroyuki Sato, and Bamdev Mishra. Riemannian stochastic recursive gradient algorithm. In International Conference on Machine Learning, pages 2516β2524, 2018.
- Li (2019) Zhize Li. SSRGD: Simple stochastic recursive gradient descent for escaping saddle points. In Advances in Neural Information Processing Systems, pages 1523β1533, 2019.
- Nesterov and Polyak (2006) Yurii Nesterov and BorisΒ T Polyak. Cubic regularization of Newton method and its global performance. Mathematical Programming, 108(1):177β205, 2006.
- Nguyen etΒ al. (2017a) LamΒ M Nguyen, Jie Liu, Katya Scheinberg, and Martin TakΓ‘Δ. Stochastic recursive gradient algorithm for nonconvex optimization. arXiv preprint arXiv:1705.07261, 2017a.
- Nguyen etΒ al. (2017b) LamΒ M Nguyen, Jie Liu, Katya Scheinberg, and Martin TakΓ‘Δ. Stochastic recursive gradient algorithm for nonconvex optimization. arXiv preprint arXiv:1705.07261, 2017b.
- Reddi etΒ al. (2016) SashankΒ J Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex Smola. Stochastic variance reduction for nonconvex optimization. In International Conference on Machine Learning, pages 314β323, 2016.
- Sun etΒ al. (2019) Yue Sun, Nicolas Flammarion, and Maryam Fazel. Escaping from saddle points on Riemannian manifolds. In Advances in Neural Information Processing Systems, pages 7276β7286, 2019.
- Tropp (2012) JoelΒ A Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Computational Mathematics, 12(4):389β434, 2012.
- Xu etΒ al. (2018) YiΒ Xu, Rong Jin, and Tianbao Yang. First-order stochastic algorithms for escaping from saddle points in almost linear time. In Advances in Neural Information Processing Systems, pages 5530β5540, 2018.
- Zhang and Sra (2018) Hongyi Zhang and Suvrit Sra. Towards Riemannian accelerated gradient methods. arXiv preprint arXiv:1806.02812, 2018.
- Zhang etΒ al. (2016) Hongyi Zhang, SashankΒ J Reddi, and Suvrit Sra. Riemannian svrg: Fast stochastic optimization on riemannian manifolds. In Advances in Neural Information Processing Systems, pages 4592β4600, 2016.
- Zhou etΒ al. (2018) Dongruo Zhou, Pan Xu, and Quanquan Gu. Finding local minima via stochastic nested variance reduction. arXiv preprint arXiv:1806.08782, 2018.
A Useful concentration bound
This section presents some useful concentration bounds on vector space, which is used to derive high probability bounds for sequences defined on tangent space of the manifold.
Lemma 6 (Vector Bernstein inequality, Tropp (2012))
Given a sequence of independent vector random variables in , which satisfies almost surely, then for
where
B Regularity conditions on Riemannian manifold
In this section, we prove some regularity conditions on manifolds, which are fundamental for Riemannian optimization, as seen in several literature (Criscitiello and Boumal, 2020; Boumal, 2020).
Lemma 2 (-Lipschitz continuous) Under Assumption 2, for all , there exists such that is -Lipschitz continuous inside the ball . This also implies that is -Lipschitz continuous. That is, for any , we have
ProofΒ The proof of being Lipschitz continuous is the same as in Criscitiello and Boumal (2019). We include it here for completeness. From Assumption 2, we have . Thus,
Hence, for any , we obtain
This implies that the full gradient is Lipschitz continuous because for any
This completes the proof.
Β
Lemma 3 (Singular value bound of differentiated retraction) For all where with , we have .
ProofΒ
From Assumption 4, we have . Therefore , where the last inequality uses the fact that all singular values of an isometric operator are .
Β
Lemma 8 (Gradient and Hessian of the pullback under second-order retraction)
Given a second order retraction , both gradient and Hessian of the pullback function evaluated at the origin of match the Riemannian gradient and Hessian of . That is, for all ,
ProofΒ The proof is mainly based on (Boumal, 2020) and we include it here for completeness. First note that for any retraction (not necessarily second-order), the gradient is matched. That is, for any , by chain rule,
where we use the definition of retraction where and . Then we can use the definition of Riemannian gradient and its uniqueness property to show the result. Next we prove the second result. Consider a second-order Taylor expansion of from to along the retraction curve as
(8) |
due to for second-order retraction. Also since is a βclassicβ function from vector space to real number, we can use a classic Taylor expansion of this function as
(9) |
Given that we already have , we have by comparing (9) with (8), .
Β
C Proof for finite-sum setting
In this section, we prove the main complexity results under finite-sum setting. In this case, we choose and hence from Algorithm 2, we have access to the full gradient and . We start by showing the proof for the main complexity Theorem in subsection C.1. Then we prove some key lemmas necessary to derive the Theorem in C.2.
C.1 Proof for main Theorem
Theorem 1 (Finite-sum complexity) Under Assumptions 1 to 4, consider finite-sum optimization setting. For any starting point with the choice of parameters as
suppose satisfy . Then with high probability, PRSRG will at least once visit an -second-order critical point with
stochastic gradient queries, where .
ProofΒ Here are all possible cases when running the main algorithm PRSRG. Notice that we need to explicitly discuss the case where iterates escape the constraint ball within an epoch. Under large gradient situation, when iterates leave the constraint ball, it achieves a large function decrease with high probability (hence labelled as descent epoch). Around saddle point, when iterates leave the constraint ball, it already decreases function value with probability 1 (hence merged with the case when iterates fall inside the ball).
-
β’
Large gradients where .
-
1.
Type-1 descent epoch: Iterates escape the constraint ball.
-
2.
Iterates do not escape the constraint ball.
-
(a)
Type-2 descent epoch: At least half of iterates in current epoch have pullback gradient larger than .
-
(b)
Useful epoch: At least half of iterates in current epoch have pullback gradient no larger than and output from current epoch has gradient no larger than . (Since output satisfies small gradient condition, the next epoch will run TSSRG to escape saddle points).
-
(c)
Wasted epoch: At least half of iterates in current epoch have pullback gradient no larger than and output from current epoch has gradient larger than .
-
(a)
-
1.
-
β’
Around saddle points where and
-
3.
Type-3 descent epoch: Current iterate is around saddle point.
-
3.
First, because output for current epoch is randomly selected from the iterates, the probability of a wasted epoch is at most . Also due to the independence of each wasted epoch, with high probability, wasted epoch occurs consecutively at most times before either a descent epoch (either type 1 or 2) or a useful epoch. 111That is, suppose the probability of at least half of iterates not escaping the constraint ball have large small gradient is . Therefore, the probability of consecutive occurrences of wasted epoch is . Then with high probability of , there exists at least one useful epoch or type-2 descent epoch in . Hereafter, we use , and to respectively represent three types of descent epoch.
Consider Type-1 descent epoch. From Case 2 in Lemma 11, with probability , function value decreases by at least and with high probability the function value decreases. Hence by the standard concentration, after such epochs, function value is reduced by with high probability. Given that is bounded by , the decrease cannot exceed . Therefore, . Similarly, for Type-2 descent epoch, .
Consider Useful epoch where output . If further , then is already an -second-order critical point. Otherwise, a Useful epoch is followed by Type-3 descent epoch around saddle points. From Lemma 14, we know that function value decreases by with high probability. Similar to argument for other types of descent epoch, , where we omit .
Hence, we have the following stochastic gradient complexity:
where , , .
Β
C.2 Proof for key Lemmas
Organization of these Lemmas are as follows. In Lemma 9, we first prove a high probability bound for the estimation error of the modified gradient . This is to replace the expectation bound commonly used in deriving first-order guarantees. Lemma 10 is to show that the iterates that deviate a lot from the initialized point also achieve large function value decrease. These two results are subsequently used to derive Lemma 11, a descent Lemma for large gradient phase.
Under saddle point phase, we first show the proof that at least one of the coupled sequences achieve large deviation from the initialization (Lemma 12). This then translates to a sufficient function value decrease in Lemma 13. Finally, it can be shown in Lemma 14 that with high probability, the iterates can escape saddle point and decrease function value by a desired amount.
Lemma 9 (High probability bound on estimation error)
Under Assumption 2, we have the following high probability bound for estimation error of the modified gradient under finite-sum setting. That is, for ,
ProofΒ For simplicity of notation, consider a single epoch from . Consider two sequences on , defined as and . It is easily verified that is a martingale and is its difference sequence. That is, denote is a filtration at time . where we use unbiasedness of i.i.d sampling. Hence, to bound , we need to first bound according to Azuma-Hoeffding Lemma. We can use vector Bernstein inequality to bound as follows. First note that
Denote and therefore with . In order to apply Bernstein inequality, we need to show that is bounded. This is achieved by Lemma 2. That is,
Also, the total variance is computed as
where the first inequality holds due to and the second last inequality applies and the last inequality uses the gradient Lipschitzness result in Lemma 2. Finally we can apply Bernstein inequality (Lemma 6) to bound . That is,
where the second inequality substitutes as and as in Lemma 6. It also uses the fact that . The last inequality holds by the choice (for example, ). This gives a probability bound for , which is
Now given the bound on , we can bound using the Azuma-Hoeffding inequality. Suppose we set , where is the epoch length. Then by union bound, for
(10) |
Therefore, the probability that for all is at least . Hence by Lemma 7, we have
(11) |
where the last inequality holds due to and the choice that
where we denote . Note under finite-sum setting, . Thus (11) implies for ,
(12) |
holds with probability . Note that we can always set while the result still holds because . Hence the probability reduces to .
Β
Lemma 10 (Improve or localize)
Consider is the sequence generated by running . Suppose we choose , where . Then we have
ProofΒ First by generalizing (16) to any epoch (i.e. ), we have
(13) |
where the last inequality holds due to the choice and the assumption that . Summing (13) for all epoch up to gives
Also by Cauchy-Schwarz inequality and triangle inequality,
The proof is complete by noting .
Β
Lemma 11 (Large gradient descent lemma)
(Lemma 4 in the maix text). Under Assumptions 2, 3, and 4, suppose we choose , where . Consider where with . Then by running , we have the following three cases:
-
1.
When the iterates do not leave the constraint ball :
-
(a)
If at least half of the iterates in the epoch satisfy for , then with probability at least , we have .
-
(b)
Otherwise, with probability at least , we have .
-
(a)
-
2.
When one of the iterates leaves the constraint ball , with probability at least , we have .
Regardless which case occurs, with high probability.
ProofΒ First note that when gradient is large, we always call TSSRG() with total iterations set to be (i.e. a single epoch). Hence, we consider (in TSSRG). Compared to the proof in (Li, 2019), we further need to address the case where the iterates fall outside the prescribed ball . So we divide the proof into two parts.
1. Iterates do not leave the constraint ball. By -Lipschitzness in Lemma 2, we have
(14) |
From Lemma 9, we know that
(15) |
holds with high probability . By a union bound, and therefore for all , (15) holds with probability . Setting as we have for all ,
Substituting this result into (14) and summing over this epoch from to gives
(16) | |||
(17) |
where . The second inequality uses the fact that and the third inequality holds due to the choice that . The last inequality holds by noticing by assuming . Note that we require (17) to hold for all and thus we change . Then we have the following two cases.
-
β’
(Case 1a) Suppose at least half of the iterates in the epoch satisfy for . Then by uniformly sampling (i.e. uniformly breaking by setting as as in Algorithm 2, Line 12), the output has gradient norm no larger than with probability . Recall the definition of the pullback gradient in Lemma 1, i.e. . Then the output of TSSRG() satisfies
where we use in Lemma 3.
-
β’
(Case 1b) Suppose at least half of the points in the epoch satisfy for . With probability , the output falls within the last quarter of by uniform sampling. In this case, because at least a quarter of points with large gradient appear before . Thus, by (17), we have
(18) Note that (17) holds with probability . Then by a union bound, (18) holds with probability at least . Without loss of generality, we can choose and thus (18) holds with probability at least .
2. Iterates leave the constraint ball. Suppose at , we have . Therefore by Lemma 10, we know that the function value already decreases a lot. That is, with probability , running TSSRG gives
where the last inequality follows from the choice that and can be sufficiently small. Note this requirement is not difficult to satisfy since can be sufficiently large. Hence by returning as , we have
In summary, regardless of whether the iterates stay within the ball , with high probability, either the gradient norm of the output is small, or the function value decreases a lot. Note for Case 1a, the function still decreases by (17).
Β
Lemma 12 (Small stuck region)
Consider with and and . Let be two random perturbation, satisfying and , where denotes the smallest eigenvector of and . Also set parameters , , where is chosen sufficiently small such that . Then for generated by running TSSRG and TSSRG with same sets of mini-batches, with probability , we have
where and .
ProofΒ The proof is by contradiction. So we assume
(19) |
First we should note none of the two sequences , escape the ball within steps under condition (19). This is because (take for example),
where we apply and . Hence, if condition (19) is satisfied, must remain inside the ball . As a result, we can proceed the proof by following the similar idea as in (Li, 2019), which is to show an exponential growth in the distance between two coupled sequences and ultimately will exceed the bound in (19) using triangle inequality. This as a result gives rise to a contradiction.
Denote and . With , we can express as
(20) | ||||
(21) |
where and . Note we can bound as
(22) |
where . The first inequality uses Assumption 2 and the last inequality follows from . Denote for ,
(23) | |||
(24) |
By induction, we can show that . It is noticed that grows exponentially and the only thing left to determine is that is dominated by when increases. To this end, we inductively show that (1) and (2) . First note that when , these two conditions clearly hold. Now suppose for , these claims have been proved to be true. Hence we immediately have, by triangle inequality,
(25) |
holds for . We first prove that second condition holds for .
Proof that is bounded by . Note that from the definition in (23),
Then we can bound by respectively bounding the two terms. The first term is bounded as follows,
(26) |
where the second inequality holds because . The third inequality applies the first condition for . The fourth inequality is by and . The last inequality follows by the choice of parameters and where is a constant defined later. The second term can also be similarly bounded as
(27) |
where we apply the bound on in the second inequality. The last inequality uses and . From (26) and (27), we prove . Note we can always assume given its requirement. Therefore and it is sufficient to require to guarantee the result. Now we can proceed to prove the first condition, which is an intermediate result that has been used in proving the second condition.
Proof that . We first re-write into a recursive form as
where we denote . It is easy to verify that is a martingale sequence and is its difference sequence. Similar to the proof strategy as for Lemma 9, we first need to derive bound for by Bernstein inequality. Denote as the component of . That is
Then is bounded as
(28) |
where , , , . The second last inequality uses triangle inequality and the last inequality considers Assumption 2 and also the constraint on (similarly ) in (22). The total variance is derived as
where the first inequality uses and the last inequality follows similar logic as (28). Hence by vector Bernstein Lemma 6, we obtain
where we choose based on similar logic as in Lemma 9. Therefore, we can bound with high probability. That is, with probability ,
We now can obtain a bound on . We only need to consider a single epoch because due to the full gradient evaluation at the start of each epoch. Note that similar to (10) and union bound, we know that with probability at least , where , holds for all . Then applying Azuma-Hoeffding inequality (Lemma 7) to the martingale sequence yields
where we choose
By a union bound, and therefore for all ,
Note by simply setting as , we obtain with probability , for all ,
(29) |
What remains to be shown is that the right hand side of (29) is bounded. From (20), we first have
(30) |
The first term can be bounded as follows. First note that the Hessian satisfies by construction, where represents eigenvalues of . Although , it is difficult to compare and since . Thus, we bound the term by projecting it to the following two subspaces:
-
β’
: subspace spanned by eigenvectors of where eigenvalues are within .
-
β’
: subspace spanned by eigenvectors of where eigenvalues are within .
That is, for the first case
(31) |
where we use the bound on and the fact that . For the second case, we have from (21),
(32) | ||||
(33) |
The second inequality is due to for . That is, given the choice and , we obtain . The third inequality uses the finite-sum bound on Harmonic series. Inequality (32) applies (i) the bound on as in (22) (ii) the inductive bound on as in (25) and (iii) the inductive assumption on . The second last inequality is by the choice . It is easy to verify that the right-hand-side of (33) is larger than right-hand-side of (31). Hence combining the two cases gives
(34) |
The second term in (30) is bounded as
(35) |
where the second inequality is derived similarly as (32) and the last inequality is again due to the assumption that . Combining (34) and (35) gives a bound,
(36) |
where the second inequality uses the definition that and . The last inequality by considering , for some . This is because, is a constant that is sufficiently small while can be made sufficiently large to ensure the validity of this result. Here, for simplicity, we still write since this will not affect the result.
Similarly, we can bound as
(37) |
where the first inequality applies the second condition for and the second inequality again uses . By combining results in (36) and (37), we have
where we denote (note in Lemma 11). The second inequality considers and . The last inequality follows by the parameter setting that and , . This completes the proof of the bound on .
Finally, we can proceed to raise a contradiction. First given the bound on , we have
where the last inequality follows by the choice that , where is chosen such that . This requirement is reasonable since when , increases while is a constant. In this case,
However, we have for , , which gives a contradiction. Hence , such that . Given changes throughout optimization process, we may choose . Since , the result still holds.
Β
Lemma 13 (Descent around saddle points)
ProofΒ We will again prove this result by contradiction. Suppose
(38) |
Then we first claim that both and fall within the prescribed ball . This is verified by contradiction. Assume that, without loss of generality, at iteration , . In this case, returns with such that . Hence
(39) |
where we note where and . The second inequality uses Lemma 10 where we can replace with for iteration . However, by parameter choice, we have . This gives a contradiction. Therefore, under condition (38), the two coupled sequences do not escape the ball. Hence we can proceed similarly as in (Li, 2019).
First Lemma 12 claims that, for some , . Without loss of generality, suppose , for . Then by Lemma 10, we have . This implies
where we use the choice of . This contradicts (38). Therefore the proof is complete.
Β
Without loss of generality, the result in Lemma 13 can be written as . This represents the worst case scenario where we require iterations of TSSRG to reach a large function decrease.222We can also add a stopping criterion that breaks with () set to be (resp. ) where is the earliest iteration where a large function value decrease is reached.
Lemma 14 (Escape stuck region)
ProofΒ First define the stuck region formally as
This suggests that running TSSRG from any point initialized in this region will not give sufficient decrease of function value. Similar to (Li, 2019), the aim is to show this region is small in volume. First note that iterates with initialization within do not escape the constraint ball from arguments in (39). Hence, if iterates leave the ball , the output already escapes the stuck region with large function decrease. Given that tangent space is a -dimensional vector space, we can perform similar analysis as in (Jin etΒ al., 2017; Li, 2019).
Consider the two coupled sequences with starting points and satisfying , where is the smallest eigenvector of and . Therefore from Lemma 13, at least one of the two sequences finally leaves after steps with probability . Therefore, under this result, the width of the stuck region along direction is at most . Based on similar argument as in (Criscitiello and Boumal, 2019), , where represents -dimensional sphere with radius . Therefore,
where we have used Gautschiβs inequality for the Gamma function. This claims the probability of is at most . Therefore with high probability at least , . In this case, function value decreases a lot. This is verified as follows. First note that and . Then, by -Lipschitzness of the gradient ,
(40) |
using and also . Therefore
where we apply (40) and the fact that .
Β
D Proof for online setting
Here we provide proof for online problems where full gradient needs to be replaced by large-batch gradient . The proof is similar to that of finite-sum problems and thus we mainly present the key steps in the following proof.
D.1 Proof for main Theorem
Theorem 2 (Online complexity) Under Assumptions 1 to 5, consider online optimization setting. For any starting point with the choice of parameters
suppose satisfy , where . Then with high probability, will at least once visit an -second-order critical point with
stochastic gradient oracles, where .
ProofΒ Similar to Theorem 1, we have the following possible cases when running the main algorithm PRSRG:
-
β’
Large gradients where .
-
1.
Type-1 descent epoch: Iterates escape the constraint ball.
-
2.
Iterates do not escape the constraint ball.
-
(a)
Type-2 descent epoch: At least half of iterates in current epoch have pullback gradient larger than .
-
(b)
Useful epoch: At least half of iterates in the current epoch have pullback gradient no larger than and the output from the current epoch has batch gradient no larger than . (Since the output satisfies small gradient condition, the next epoch will run TSSRG to escape saddle points).
-
(c)
Wasted epoch: At least half of iterates in the current epoch have pullback gradient no larger than and the output from the current epoch has batch gradient larger than .
-
(a)
-
1.
-
β’
Around saddle points where and
-
3.
Type-3 descent epoch: The current iterate is around saddle point.
-
3.
The following proof is very similar to that of Lemma 1. First from Lemma 17, we know that the probability of wasted epoch occurring is at most . Given independence of different wasted epoch, with high probability, wasted epoch happens consecutively at most times before either a descent epoch (either Type 1 or 2) or a useful epoch. We use to respectively represent three types of descent epoch.
For Type-1 descent epoch, with probability , function value decrease by at least . Hence by standard concentration, after such epochs, function value is decreased by with high probability. Given that is bounded by , the decrease cannot exceed . Thus, . Similarly, for Type-2 descent epoch, . For Type-3 useful epoch, the output is either an -second-order critical point or the epoch is immediately followed by a Type-3 descent epoch around saddle points. From Lemma 20, we know that function value decreases by with high probability. Therefore by similar arguments, .
Therefore to combine them, we have the following stochastic gradient complexity:
where , and .
Β
D.2 Proof for key Lemmas
Lemma 15 (High probability bound on estimation error)
ProofΒ For simplicity of notation, consider a single epoch . Because under online setting, , we first need to bound by Bernstein inequality. By assumption of bounded variance, we have
By Lemma 6,
where the first inequality also considers . The last inequality holds by setting . Thus, we obtain
(41) |
Next, denote and . Then we follow the same steps as in Lemma 9 to obtain the bound on except that . From (12), we have
By union bound, for , with probability at least ,
The proof is complete by setting as .
Β
Lemma 16 (Improve or localize)
Consider as the sequence generated by running . Suppose we choose , , where . Then we have
with .
ProofΒ First we generalize (44) to any epoch (i.e. ) as
(42) |
where we choose and assume . Summing (42) for all epochs up to gives
Lastly by Cauchy-Schwarz inequality and triangle inequality, . Hence,
The proof is now complete.
Β
Lemma 17 (Large gradient descent lemma)
Under Assumptions 2 to 5, suppose we set , where . Consider where , with . Then by running , we have the following three cases:
-
1.
When the iterates do not leave the constraint ball :
-
(a)
If at least half of the iterates in the epoch satisfy for , then with probability at least , we have .
-
(b)
Otherwise, with probability at least , we have .
-
(a)
-
2.
When one of the iterates leaves the constraint ball , with probability at least , we have .
Regardless which case occurs, with high probability.
ProofΒ Similar to proof of Lemma 11, we only need to consider a single epoch in TSSRG, i.e. . We also divide the proof into two parts.
1. Iterates do not leave the constraint ball. From Lemma 15 and union bound, we have for all ,
(43) |
where we denote . 333More precisely, . We also use . Summing up (14) from , we have
(44) | |||
(45) |
where the last inequality holds due to the choice by assuming . Note that we need to ensure (45) to hold for all . Hence we change . Here are two cases.
-
β’
(Case 1a) If at least half of the iterates in the epoch satisfy for , then by uniformly sampling (i.e. uniformly breaking by setting as as in Algorithm 2, Line 12), the output satisfies with probability at least . Under online setting, full gradient is inaccessible and we need to use approximated batch gradient to check the small-gradient condition in Line 3, Algorithm 1. We know based on (41), with probability . By a union bound, with probability at least ,
where the last inequality holds by . Without loss of generality, we set and therefore the probability reduces to . Lastly, from the definition of the pullback gradient, we have
holds with probability at least . The last inequality is due to Lemma 3. Note that in this case, we also have .
-
β’
(Case 1b) If at least half of the points in the epoch satisfy for , with probability at least , the selected output falls within the last quarter of . In this case, . Note that (45) holds with probability at least . By a union bound, we have with probability at least (by letting ),
where we use and .
2. Iterates leave the constraint ball. Suppose at , we have . Then by Lemma 16, we know the function value already decreases a lot. That is with probability ,
where the last inequality follows from the choice that and . Hence by returning as , we have with high probability,
In summary, with high probability, either gradient norm of the output is small or the function value decreases a lot. Even under Case 1a, with high probability by (45).
Β
Lemma 18 (Small stuck region)
Consider with , and . Let be two random perturbations, satisfying and , where denotes the smallest eigenvector of and . Also set parameters , where is chosen sufficiently small such that . Also choose , . Then for generated by running TSSRG and TSSRG with same sets of mini-batches, with probability ,
where , and , .
ProofΒ The proof is by contradiction. So we assume
(46) |
First, we again note that both do not escape the constraint ball under condition (46). This is because by the choice and . Next, we can show an exponential growth in the distance between two coupled sequences and will eventually exceed the bound in (46). The proof roadmap is exactly the same as in Lemma 12 and thus we only highlights the main results under online settings.
Denote , , . Recall we can bound as , where . Thus from proof of Lemma 12, we know that we can re-write where
Next, we can inductively show that and . When , these two conditions are easily verified. Now suppose for , these two conditions are true, then we can immediately obtain that for all ,
(47) |
Also, different to finite-sum setting, we need a bound of . Given that and , we have
For each component term, we have
where the last inequality uses (47). Also the variance term is bounded as
Therefore, we can substitute these two bounds in Bernstein inequality (Lemma 6) to bound . That is,
where the second inequality also uses and the last inequality holds due to
By a union bound, for all , we have
(48) |
where and the second inequality is due to , , and assume without loss of generality . Now we will first prove that the second condition holds for .
Proof that is bounded by . Recall we can decompose into two terms:
The first term is bounded the same way as in Lemma 12. Consider parameter settings, , , , where satisfying . Then we have
The second term can be bounded as
where we choose and . These two results complete the proof that . Now we proceed to prove the first condition.
Proof that is bounded by . Denote . Hence we can verify that is a martingale sequence and is its difference sequence. Then we can first bound by Bernstein inequality. The following result is exactly the same as in Lemma 12. With probability ,
Now we can bound . We only need to consider a single epoch because is bounded as in (48). Similarly, set , we have for all . By Azuma-Hoeffding inequality (Lemma 7), we have for all , with probability ,
This is the same result from (29) except that . Combining this result and (48), and using a union bound, we have with probability , for all ,
(49) |
What remains to be shown is that the right hand side of (49) is bounded. Recall similarly from (35) and (37), we have
where there is a slight difference in one of the constants due to now . Therefore, we obtain
where . (Note in Lemma 17. The second inequality uses the choice that and . This completes the proof for the first condition. Now we can prove the second condition.
The contradiction is the same as in the proof of Lemma 12 and hence skipped for clarity. Note similar to the argument in Lemma 12, we choose so that the result still holds.
Β
Lemma 19 (Descent around saddle points)
ProofΒ We will prove this result by contradiction. Suppose the contrary holds. That is,
(50) |
Then we first notice that under this condition, both and do not escape the constraint ball . This is verified by contradiction. Assume, without loss of generality, at iteration , . In this case, returns with such that . Hence, by the similar logic as in (39),
where we set parameters, , , and , . However, we know that , which gives a contradiction. Therefore, we claim that and stay within the constraint ball within steps.
Also based on Lemma 18, assume for some . Then from Lemma 16, we know that
where we consider and , . This contradicts (50) and hence the proof is complete.
Β
Lemma 20 (Escape stuck region)
Let satisfy and . Given that result in Lemma 19 holds and choose perturbation radius , we have a sufficient decrease of function value with high probability: