Escape saddle points by a simple gradient-descent based algorithm
Abstract
Escaping saddle points is a central research topic in nonconvex optimization. In this paper, we propose a simple gradient-based algorithm such that for a smooth function , it outputs an -approximate second-order stationary point in iterations. Compared to the previous state-of-the-art algorithms by Jin et al. with or iterations, our algorithm is polynomially better in terms of and matches their complexities in terms of . For the stochastic setting, our algorithm outputs an -approximate second-order stationary point in iterations. Technically, our main contribution is an idea of implementing a robust Hessian power method using only gradients, which can find negative curvature near saddle points and achieve the polynomial speedup in compared to the perturbed gradient descent methods. Finally, we also perform numerical experiments that support our results.
1 Introduction
Nonconvex optimization is a central research area in optimization theory, since lots of modern machine learning problems can be formulated in models with nonconvex loss functions, including deep neural networks, principal component analysis, tensor decomposition, etc. In general, finding a global minimum of a nonconvex function is NP-hard in the worst case. Instead, many theoretical works focus on finding a local minimum instead of a global one, because recent works (both empirical and theoretical) suggested that local minima are nearly as good as global minima for a significant amount of well-studied machine learning problems; see e.g. [4, 11, 13, 14, 16, 17]. On the other hand, saddle points are major obstacles for solving these problems, not only because they are ubiquitous in high-dimensional settings where the directions for escaping may be few (see e.g. [5, 7, 10]), but also saddle points can correspond to highly suboptimal solutions (see e.g. [18, 27]).
Hence, one of the most important topics in nonconvex optimization is to escape saddle points. Specifically, we consider a twice-differentiable function such that
-
•
is -smooth: ,
-
•
is -Hessian Lipschitz: ;
here is the Hessian of . The goal is to find an -approximate second-order stationary point :111We can ask for an -approx. second-order stationary point s.t. and in general. The scaling in (1) was adopted as a standard in literature [1, 6, 9, 19, 20, 21, 25, 28, 29, 30].
(1) |
In other words, at any -approx. second-order stationary point , the gradient is small with norm being at most and the Hessian is close to be positive semi-definite with all its eigenvalues .
Algorithms for escaping saddle points are mainly evaluated from two aspects. On the one hand, considering the enormous dimensions of machine learning models in practice, dimension-free or almost dimension-free (i.e., having dependence) algorithms are highly preferred. On the other hand, recent empirical discoveries in machine learning suggests that it is often feasible to tackle difficult real-world problems using simple algorithms, which can be implemented and maintained more easily in practice. On the contrary, algorithms with nested loops often suffer from significant overheads in large scales, or introduce concerns with the setting of hyperparameters and numerical stability (see e.g. [1, 6]), making them relatively hard to find practical implementations.
It is then natural to explore simple gradient-based algorithms for escaping from saddle points. The reason we do not assume access to Hessians is because its construction takes cost in general, which is computationally infeasible when the dimension is large. A seminal work along this line was by Ge et al. [11], which found an -approximate second-order stationary point satisfying (1) using only gradients in iterations. This is later improved to be almost dimension-free in the follow-up work [19],222The notation omits poly-logarithmic terms, i.e., . and the perturbed accelerated gradient descent algorithm [21] based on Nesterov’s accelerated gradient descent [26] takes iterations. However, these results still suffer from a significant overhead in terms of . On the other direction, Refs. [3, 24, 29] demonstrate that an -approximate second-order stationary point can be find using gradients in iterations. Their results are based on previous works [1, 6] using Hessian-vector products and the observation that the Hessian-vector product can be approximated via the difference of two gradient queries. Hence, their implementations contain nested-loop structures with relatively large numbers of hyperparameters. It has been an open question whether it is possible to keep both the merits of using only first-order information as well as being close to dimension-free using a simple, gradient-based algorithm without a nested-loop structure [22]. This paper answers this question in the affirmative.
Contributions.
Our main contribution is a simple, single-loop, and robust gradient-based algorithm that can find an -approximate second-order stationary point of a smooth, Hessian Lipschitz function . Compared to previous works [3, 24, 29] exploiting the idea of gradient-based Hessian power method, our algorithm has a single-looped, simpler structure and better numerical stability. Compared to the previous state-of-the-art results with single-looped structures by [21] and [19, 20] using or iterations, our algorithm achieves a polynomial speedup in :
Theorem 1 (informal).
Our single-looped algorithm finds an -approximate second-order stationary point in iterations.
Technically, our work is inspired by the perturbed gradient descent (PGD) algorithm in [19, 20] and the perturbed accelerated gradient descent (PAGD) algorithm in [21]. Specifically, PGD applies gradient descents iteratively until it reaches a point with small gradient, which can be a potential saddle point. Then PGD generates a uniform perturbation in a small ball centered at that point and then continues the GD procedure. It is demonstrated that, with an appropriate choice of the perturbation radius, PGD can shake the point off from the neighborhood of the saddle point and converge to a second-order stationary point with high probability. The PAGD in [21] adopts a similar perturbation idea, but the GD is replaced by Nesterov’s AGD [26].
Our algorithm is built upon PGD and PAGD but with one main modification regarding the perturbation idea: it is more efficient to add a perturbation in the negative curvature direction nearby the saddle point, rather than the uniform perturbation in PGD and PAGD, which is a compromise since we generally cannot access the Hessian at the saddle due to its high computational cost. Our key observation lies in the fact that we do not have to compute the entire Hessian to detect the negative curvature. Instead, in a small neighborhood of a saddle point, gradients can be viewed as Hessian-vector products plus some bounded deviation. In particular, GD near the saddle with learning rate is approximately the same as the power method of the matrix . As a result, the most negative eigenvalues stand out in GD because they have leading exponents in the power method, and thus it approximately moves along the direction of the most negative curvature nearby the saddle point. Following this approach, we can escape the saddle points more rapidly than previous algorithms: for a constant , PGD and PAGD take iterations to decrease the function value by and with high probability, respectively; on the contrary, we can first take iterations to specify a negative curvature direction, and then add a larger perturbation in this direction to decrease the function value by . See Proposition 3 and Proposition 5. After escaping the saddle point, similar to PGD and PAGD, we switch back to GD and AGD iterations, which are efficient to decrease the function value when the gradient is large [19, 20, 21].
Our algorithm is also applicable to the stochastic setting where we can only access stochastic gradients, and the stochasticity is not under the control of our algorithm. We further assume that the stochastic gradients are Lipschitz (or equivalently, the underlying functions are gradient-Lipschitz, see Assumption 2), which is also adopted in most of the existing works; see e.g. [8, 19, 20, 34]. We demonstrate that a simple extended version of our algorithm takes iterations to detect a negative curvature direction using only stochastic gradients, and then obtain an function value decrease with high probability. On the contrary, the perturbed stochastic gradient descent (PSGD) algorithm in [19, 20], the stochastic version of PGD, takes iterations to decrease the function value by with high probability.
Theorem 2 (informal).
In the stochastic setting, our algorithm finds an -approximate second-order stationary point using iterations via stochastic gradients.
Our results are summarized in Table 1. Although the underlying dynamics in [3, 24, 29] and our algorithm have similarity, the main focus of our work is different. Specifically, Refs. [3, 24, 29] mainly aim at using novel techniques to reduce the iteration complexity for finding a second-order stationary point, whereas our work mainly focuses on reducing the number of loops and hyper-parameters of negative curvature finding methods while preserving their advantage in iteration complexity, since a much simpler structure accords with empirical observations and enables wider applications. Moreover, the choice of perturbation in [3] is based on the Chebyshev approximation theory, which may require additional nested-looped structures to boost the success probability. In the stochastic setting, there are also other results studying nonconvex optimization [15, 23, 31, 36, 12, 32, 35] from different perspectives than escaping saddle points, which are incomparable to our results.
Setting | Reference | Oracle | Iterations | Simplicity |
---|---|---|---|---|
Non-stochastic | [1, 6] | Hessian-vector product | Nested-loop | |
Non-stochastic | [19, 20] | Gradient | Single-loop | |
Non-stochastic | [21] | Gradient | Single-loop | |
Non-stochastic | [3, 24, 29] | Gradient | Nested-loop | |
Non-stochastic | this work | Gradient | Single-loop | |
Stochastic | [19, 20] | Gradient | Single-loop | |
Stochastic | [9] | Gradient | Single-loop | |
Stochastic | [3] | Gradient | Nested-loop | |
Stochastic | [8] | Gradient | Nested-loop | |
Stochastic | this work | Gradient | Single-loop |
It is worth highlighting that our gradient-descent based algorithm enjoys the following nice features:
-
•
Simplicity: Some of the previous algorithms have nested-loop structures with the concern of practical impact when setting the hyperparameters. In contrast, our algorithm based on negative curvature finding only contains a single loop with two components: gradient descent (including AGD or SGD) and perturbation. As mentioned above, such simple structure is preferred in machine learning, which increases the possibility of our algorithm to find real-world applications.
-
•
Numerical stability: Our algorithm contains an additional renormalization step at each iteration when escaping from saddle points. Although in theoretical aspect a renormalization step does not affect the output and the complexity of our algorithm, when finding negative curvature near saddle points it enables us to sample gradients in a larger region, which makes our algorithm more numerically stable against floating point error and other errors. The introduction of renormalization step is enabled by the simple structure of our algorithm, which may not be feasible for more complicated algorithms [3, 24, 29].
-
•
Robustness: Our algorithm is robust against adversarial attacks when evaluating the value of the gradient. Specifically, when analyzing the performance of our algorithm near saddle points, we essentially view the deflation from pure quadratic geometry as an external noise. Hence, the effectiveness of our algorithm is unaffected under external attacks as long as the adversary is bounded by deflations from quadratic landscape.
Finally, we perform numerical experiments that support our polynomial speedup in . We perform our negative curvature finding algorithms using GD or SGD in various landscapes and general classes of nonconvex functions, and use comparative studies to show that our Algorithm 1 and Algorithm 3 achieve a higher probability of escaping saddle points using much fewer iterations than PGD and PSGD (typically less than times of the iteration number of PGD and times of the iteration number of PSGD, respectively). Moreover, we perform numerical experiments benchmarking the solution quality and iteration complexity of our algorithm against accelerated methods. Compared to PAGD [21] and even advanced optimization algorithms such as NEON+ [29], Algorithm 2 possesses better solution quality and iteration complexity in various landscapes given by more general nonconvex functions. With fewer iterations compared to PAGD and NEON+ (typically less than times of the iteration number of PAGD and times of the iteration number of NEON+, respectively), our Algorithm 2 achieves a higher probability of escaping from saddle points.
Open questions.
This work leaves a couple of natural open questions for future investigation:
- •
- •
Broader impact.
This work focuses on the theory of nonconvex optimization, and as far as we see, we do not anticipate its potential negative societal impact. Nevertheless, it might have a positive impact for researchers who are interested in understanding the theoretical underpinnings of (stochastic) gradient descent methods for machine learning applications.
Organization.
In Section 2, we introduce our gradient-based Hessian power method algorithm for negative curvature finding, and present how our algorithms provide polynomial speedup in for both PGD and PAGD. In Section 3, we present the stochastic version of our negative curvature finding algorithm using stochastic gradients and demonstrate its polynomial speedup in for PSGD. Numerical experiments are presented in Section 4. We provide detailed proofs and additional numerical experiments in the supplementary material.
2 A simple algorithm for negative curvature finding
We show how to find negative curvature near a saddle point using a gradient-based Hessian power method algorithm, and extend it to a version with faster convergence rate by replacing gradient descents by accelerated gradient descents. The intuition works as follows: in a small enough region nearby a saddle point, the gradient can be approximately expressed as a Hessian-vector product formula, and the approximation error can be efficiently upper bounded, see Eq. (6). Hence, using only gradients information, we can implement an accurate enough Hessian power method to find negative eigenvectors of the Hessian matrix, and further find the negative curvature nearby the saddle.
2.1 Negative curvature finding based on gradient descents
We first present an algorithm for negative curvature finding based on gradient descents. Specifically, for any with , it finds a unit vector such that .
Proposition 3.
Suppose the function is -smooth and -Hessian Lipschitz. For any , we specify our choice of parameters and constants we use as follows:
(2) |
Suppose that satisfies . Then with probability at least , Algorithm 1 outputs a unit vector satisfying
(3) |
using iterations, where stands for the Hessian matrix of function .
Proof.
Without loss of generality we assume by shifting such that is mapped to . Define a new -dimensional function
(4) |
for the ease of our analysis. Since is a linear function with Hessian being 0, the Hessian of equals to the Hessian of , and is also -smooth and -Hessian Lipschitz. In addition, note that . Then for all ,
(5) |
Furthermore, due to the -Hessian Lipschitz condition of both and , for any we have , which leads to
(6) |
Observe that the Hessian matrix admits the following eigen-decomposition:
(7) |
where the set forms an orthonormal basis of . Without loss of generality, we assume the eigenvalues corresponding to satisfy
(8) |
in which . If , Proposition 3 holds directly. Hence, we only need to prove the case where , in which there exists some , with
(9) |
We use , to separately denote the subspace of spanned by , , and use , to denote the subspace of spanned by , . Furthermore, we define
(10) | ||||
(11) |
respectively to denote the component of in Line 1 in the subspaces , , , , and let . Observe that
(12) |
where denotes the component of along . Consider the case where , which can be achieved with probability
(13) |
We prove that there exists some with such that
(14) |
Assume the contrary, for any , we all have . Then satisfies the following recurrence formula:
(15) |
where
(16) |
stands for the deviation from pure quadratic approximation and due to (6). Hence,
(17) |
which leads to
(18) |
Similarly, we can have the recurrence formula for :
(19) |
Considering that due to (6), we can further have
(20) |
Intuitively, should have a faster increasing rate than in this gradient-based Hessian power method, ignoring the deviation from quadratic approximation. As a result, the value the value should be non-decreasing. It is demonstrated in Lemma 18 in Appendix B that, even if we count this deviation in, can still be lower bounded by some constant :
(21) |
by which we can further deduce that
(22) |
Observe that
(23) |
Since , we have , contradiction. Hence, there here exists some with such that . Consider the normalized vector , we use and to separately denote the component of in and . Then, whereas . Then,
(24) |
since and . Due to the -smoothness of the function, all eigenvalue of the Hessian matrix has its absolute value upper bounded by . Hence,
(25) |
Further according to the definition of , we have
(26) |
Combining these two inequalities together, we can obtain
(27) |
∎
Remark 4.
In practice, the value of can become large during the execution of Algorithm 1. To fix this, we can renormalize to have -norm at the ends of such iterations, and this does not influence the performance of the algorithm.
2.2 Faster negative curvature finding based on accelerated gradient descents
In this subsection, we replace the GD part in Algorithm 1 by AGD to obtain an accelerated negative curvature finding subroutine with similar effect and faster convergence rate, based on which we further implement our Accelerated Gradient Descent with Negative Curvature Finding (Algorithm 2). Near any saddle point with , Algorithm 2 finds a unit vector such that .
The following proposition exhibits the effectiveness of Algorithm 2 for finding negative curvatures near saddle points:
Proposition 5.
Suppose the function is -smooth and -Hessian Lipschitz. For any , we specify our choice of parameters and constants we use as follows:
(28) |
Then for a point satisfying , if running Algorithm 2 with the uniform perturbation in Line 2 being added at , the unit vector in Line 2 obtained after iterations satisfies:
(29) |
The proof of Proposition 5 is similar to the proof of Proposition 3, and is deferred to Appendix B.2.
2.3 Escaping saddle points using negative curvature finding
In this subsection, we demonstrate that our Algorithm 1 and Algorithm 2 with the ability to find negative curvature near saddle points can further escape saddle points of nonconvex functions. The intuition works as follows: we start with gradient descents or accelerated gradient descents until the gradient becomes small. At this position, we compute the negative curvature direction, described by a unit vector , via Algorithm 1 or the negative curvature finding subroutine of Algorithm 2. Then, we add a perturbation along this direction of negative curvature and go back to gradient descents or accelerated gradient descents with an additional NegativeCurvatureExploitation subroutine (see Footnote 3). It has the following guarantee:
Lemma 6.
Suppose the function is -smooth and -Hessian Lipschitz. Then for any point , if there exists a unit vector satisfying where stands for the Hessian matrix of function , the following inequality holds:
(30) |
where stands for the gradient component of along the direction of .
Proof.
Without loss of generality, we assume . We can also assume ; if this is not the case we can pick instead, which still satisfies . In practice, to figure out whether we should use or , we apply both of them in (30) and choose the one with smaller function value. Then, for any with some , we have due to the -Hessian Lipschitz condition of . Hence,
(31) |
by which we can further derive that
(32) |
Settings gives (30). ∎
We give the full algorithm details based on Algorithm 1 in Appendix C.1. Along this approach, we achieve the following:
Theorem 7 (informal, full version deferred to Appendix C.3).
For any and a constant , Algorithm 2 satisfies that at least one of the iterations will be an -approximate second-order stationary point in
(33) |
iterations, with probability at least , where is the global minimum of .
Intuitively, the proof of Theorem 7 has two parts. The first part is similar to the proof of [21, Theorem 3], which shows that PAGD uses iterations to escape saddle points. We show that there can be at most iterations with the norm of gradient larger than using almost the same techniques, but with slightly different parameter choices. The second part is based on the negative curvature part of Algorithm 2, our accelerated negative curvature finding algorithm. Specifically, at each saddle point we encounter, we can take iterations to find its negative curvature (Proposition 5), and add a perturbation in this direction to decrease the function value by (Lemma 6). Hence, the iterations introduced by Algorithm 4 can be at most , which is simply an upper bound on the overall iteration number. The detailed proof is deferred to Appendix C.3.
Remark 8.
Although Theorem 7 only demonstrates that our algorithms will visit some -approximate second-order stationary point during their execution with high probability, it is straightforward to identify one of them if we add a termination condition: once Negative Curvature Finding (Algorithm 1 or Algorithm 2) is applied, we record the position and the function value decrease due to the following perturbation. If the function value decrease is larger than as per Lemma 6, then the algorithms make progress. Otherwise, is an -approximate second-order stationary point with high probability.
3 Stochastic setting
In this section, we present a stochastic version of Algorithm 1 using stochastic gradients, and demonstrate that it can also be used to escape saddle points and obtain a polynomial speedup in compared to the perturbed stochastic gradient (PSGD) algorithm in [20].
3.1 Stochastic negative curvature finding
In the stochastic gradient descent setting, the exact gradients oracle of function cannot be accessed. Instead, we only have unbiased stochastic gradients such that
(34) |
where stands for the probability distribution followed by the random variable . Define
(35) |
to be the error term of the stochastic gradient. Then, the expected value of vector at any equals to . Further, we assume the stochastic gradient also satisfies the following assumptions, which were also adopted in previous literatures; see e.g. [8, 19, 20, 34].
Assumption 1.
For any , the stochastic gradient with satisfies:
(36) |
Assumption 2.
For any , is -Lipschitz for some constant :
(37) |
Assumption 2 emerges from the fact that the stochastic gradient is often obtained as an exact gradient of some smooth function,
(38) |
In this case, Assumption 2 guarantees that for any , the spectral norm of the Hessian of is upper bounded by . Under these two assumptions, we can construct the stochastic version of Algorithm 1, as shown in Algorithm 3.
Similar to the non-stochastic setting, Algorithm 3 can be used to escape from saddle points and obtain a polynomial speedup in compared to PSGD algorithm in [20]. This is quantitatively shown in the following theorem:
Theorem 9 (informal, full version deferred to Appendix D.2).
For any and a constant , our algorithm444Our algorithm based on Algorithm 3 has similarities to the Neon2 algorithm in [3]. Both algorithms find a second-order stationary point for stochastic optimization in iterations, and we both apply directed perturbations based on the results of negative curvature finding. Nevertheless, our algorithm enjoys simplicity by only having a single loop, whereas Neon2 has a nested loop for boosting their confidence. based on Algorithm 3 using only stochastic gradient descent satisfies that at least one of the iterations will be an -approximate second-order stationary point in
(39) |
iterations, with probability at least , where is the global minimum of .
4 Numerical experiments
In this section, we provide numerical results that exhibit the power of our negative curvature finding algorithm for escaping saddle points. More experimental results can be found in Appendix E. All the experiments are performed on MATLAB R2019b on a computer with Six-Core Intel Core i7 processor and 16GB memory, and their codes are given in the supplementary material.
Comparison between Algorithm 1 and PGD.
We compare the performance of our Algorithm 1 with the perturbed gradient descent (PGD) algorithm in [20] on a test function with a saddle point at . The advantage of Algorithm 1 is illustrated in Figure 1.

Left: The contour of the landscape is placed on the background with labels being function values. Blue points represent samplings of Algorithm 1 at time step and , and red points represent samplings of PGD at time step and . Algorithm 1 transforms an initial uniform-circle distribution into a distribution concentrating on two points indicating negative curvature, and these two figures represent intermediate states of this process. It converges faster than PGD even when .
Right: A histogram of descent values obtained by Algorithm 1 and PGD, respectively. Set and . Although we run three times of iterations in PGD, there are still over of gradient descent paths with function value decrease no greater than , while this ratio for Algorithm 1 is less than .
Comparison between Algorithm 3 and PSGD.
We compare the performance of our Algorithm 3 with the perturbed stochastic gradient descent (PSGD) algorithm in [20] on a test function . Compared to the landscape of the previous experiment, this function is more deflated from a quadratic field due to the cubic terms. Nevertheless, Algorithm 3 still possesses a notable advantage compared to PSGD as demonstrated in Figure 2.

Left: The contour of the landscape is placed on the background with labels being function values. Blue points represent samplings of Algorithm 3 at time step and , and red points represent samplings of PSGD at time step and . Algorithm 3 transforms an initial uniform-circle distribution into a distribution concentrating on two points indicating negative curvature, and these two figures represent intermediate states of this process. It converges faster than PSGD even when .
Right: A histogram of descent values obtained by Algorithm 3 and PSGD, respectively. Set and . Although we run two times of iterations in PSGD, there are still over of SGD paths with function value decrease no greater than , while this ratio for Algorithm 3 is less than .
Acknowledgements
We thank Jiaqi Leng for valuable suggestions and generous help on the design of numerical experiments. TL was supported by the NSF grant PHY-1818914 and a Samsung Advanced Institute of Technology Global Research Partnership.
References
- [1] Naman Agarwal, Zeyuan Allen-Zhu, Brian Bullins, Elad Hazan, and Tengyu Ma, Finding approximate local minima faster than gradient descent, Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1195–1199, 2017, arXiv:1611.01146.
- [2] Zeyuan Allen-Zhu, Natasha 2: Faster non-convex optimization than SGD, Advances in Neural Information Processing Systems, pp. 2675–2686, 2018, arXiv:1708.08694.
- [3] Zeyuan Allen-Zhu and Yuanzhi Li, Neon2: Finding local minima via first-order oracles, Advances in Neural Information Processing Systems, pp. 3716–3726, 2018, arXiv:1711.06673.
- [4] Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro, Global optimality of local search for low rank matrix recovery, Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 3880–3888, 2016, arXiv:1605.07221.
- [5] Alan J. Bray and David S. Dean, Statistics of critical points of Gaussian fields on large-dimensional spaces, Physical Review Letters 98 (2007), no. 15, 150201, arXiv:cond-mat/0611023.
- [6] Yair Carmon, John C. Duchi, Oliver Hinder, and Aaron Sidford, Accelerated methods for nonconvex optimization, SIAM Journal on Optimization 28 (2018), no. 2, 1751–1772, arXiv:1611.00756.
- [7] Yann N. Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio, Identifying and attacking the saddle point problem in high-dimensional non-convex optimization, Advances in neural information processing systems, pp. 2933–2941, 2014, arXiv:1406.2572.
- [8] Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang, Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator, Advances in Neural Information Processing Systems, pp. 689–699, 2018, arXiv:1807.01695.
- [9] Cong Fang, Zhouchen Lin, and Tong Zhang, Sharp analysis for nonconvex SGD escaping from saddle points, Conference on Learning Theory, pp. 1192–1234, 2019, arXiv:1902.00247.
- [10] Yan V. Fyodorov and Ian Williams, Replica symmetry breaking condition exposed by random matrix calculation of landscape complexity, Journal of Statistical Physics 129 (2007), no. 5-6, 1081–1116, arXiv:cond-mat/0702601.
- [11] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan, Escaping from saddle points – online stochastic gradient for tensor decomposition, Proceedings of the 28th Conference on Learning Theory, Proceedings of Machine Learning Research, vol. 40, pp. 797–842, 2015, arXiv:1503.02101.
- [12] Rong Ge, Chi Jin, and Yi Zheng, No spurious local minima in nonconvex low rank problems: A unified geometric analysis, International Conference on Machine Learning, pp. 1233–1242, PMLR, 2017, arXiv:1704.00708.
- [13] Rong Ge, Jason D. Lee, and Tengyu Ma, Matrix completion has no spurious local minimum, Advances in Neural Information Processing Systems, pp. 2981–2989, 2016, arXiv:1605.07272.
- [14] Rong Ge, Jason D. Lee, and Tengyu Ma, Learning one-hidden-layer neural networks with landscape design, International Conference on Learning Representations, 2018, arXiv:1711.00501.
- [15] Rong Ge, Zhize Li, Weiyao Wang, and Xiang Wang, Stabilized SVRG: Simple variance reduction for nonconvex optimization, Conference on Learning Theory, pp. 1394–1448, PMLR, 2019, arXiv:1905.00529.
- [16] Rong Ge and Tengyu Ma, On the optimization landscape of tensor decompositions, Advances in Neural Information Processing Systems, pp. 3656–3666, Curran Associates Inc., 2017, arXiv:1706.05598.
- [17] Moritz Hardt, Tengyu Ma, and Benjamin Recht, Gradient descent learns linear dynamical systems, Journal of Machine Learning Research 19 (2018), no. 29, 1–44, arXiv:1609.05191.
- [18] Prateek Jain, Chi Jin, Sham Kakade, and Praneeth Netrapalli, Global convergence of non-convex gradient descent for computing matrix squareroot, Artificial Intelligence and Statistics, pp. 479–488, 2017, arXiv:1507.05854.
- [19] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan, How to escape saddle points efficiently, Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 1724–1732, 2017, arXiv:1703.00887.
- [20] Chi Jin, Praneeth Netrapalli, Rong Ge, Sham M Kakade, and Michael I. Jordan, On nonconvex optimization for machine learning: Gradients, stochasticity, and saddle points, Journal of the ACM (JACM) 68 (2021), no. 2, 1–29, arXiv:1902.04811.
- [21] Chi Jin, Praneeth Netrapalli, and Michael I. Jordan, Accelerated gradient descent escapes saddle points faster than gradient descent, Conference on Learning Theory, pp. 1042–1085, 2018, arXiv:1711.10456.
- [22] Michael I. Jordan, On gradient-based optimization: Accelerated, distributed, asynchronous and stochastic optimization, https://www.youtube.com/watch?v=VE2ITg%5FhGnI, 2017.
- [23] Zhize Li, SSRGD: Simple stochastic recursive gradient descent for escaping saddle points, Advances in Neural Information Processing Systems 32 (2019), 1523–1533, arXiv:1904.09265.
- [24] Mingrui Liu, Zhe Li, Xiaoyu Wang, Jinfeng Yi, and Tianbao Yang, Adaptive negative curvature descent with applications in non-convex optimization, Advances in Neural Information Processing Systems 31 (2018), 4853–4862.
- [25] Yurii Nesterov and Boris T. Polyak, Cubic regularization of Newton method and its global performance, Mathematical Programming 108 (2006), no. 1, 177–205.
- [26] Yurii E. Nesterov, A method for solving the convex programming problem with convergence rate , Soviet Mathematics Doklady, vol. 27, pp. 372–376, 1983.
- [27] Ju Sun, Qing Qu, and John Wright, A geometric analysis of phase retrieval, Foundations of Computational Mathematics 18 (2018), no. 5, 1131–1198, arXiv:1602.06664.
- [28] Nilesh Tripuraneni, Mitchell Stern, Chi Jin, Jeffrey Regier, and Michael I. Jordan, Stochastic cubic regularization for fast nonconvex optimization, Advances in neural Information Processing Systems, pp. 2899–2908, 2018, arXiv:1711.02838.
- [29] Yi Xu, Rong Jin, and Tianbao Yang, NEON+: Accelerated gradient methods for extracting negative curvature for non-convex optimization, 2017, arXiv:1712.01033.
- [30] Yi Xu, Rong Jin, and Tianbao Yang, First-order stochastic algorithms for escaping from saddle points in almost linear time, Advances in Neural Information Processing Systems, pp. 5530–5540, 2018, arXiv:1711.01944.
- [31] Yaodong Yu, Pan Xu, and Quanquan Gu, Third-order smoothness helps: Faster stochastic optimization algorithms for finding local minima, Advances in Neural Information Processing Systems (2018), 4530–4540.
- [32] Xiao Zhang, Lingxiao Wang, Yaodong Yu, and Quanquan Gu, A primal-dual analysis of global optimality in nonconvex low-rank matrix recovery, International Conference on Machine Learning, pp. 5862–5871, PMLR, 2018.
- [33] Dongruo Zhou and Quanquan Gu, Stochastic recursive variance-reduced cubic regularization methods, International Conference on Artificial Intelligence and Statistics, pp. 3980–3990, 2020, arXiv:1901.11518.
- [34] Dongruo Zhou, Pan Xu, and Quanquan Gu, Finding local minima via stochastic nested variance reduction, 2018, arXiv:1806.08782.
- [35] Dongruo Zhou, Pan Xu, and Quanquan Gu, Stochastic variance-reduced cubic regularized Newton methods, International Conference on Machine Learning, pp. 5990–5999, PMLR, 2018, arXiv:1802.04796.
- [36] Dongruo Zhou, Pan Xu, and Quanquan Gu, Stochastic nested variance reduction for nonconvex optimization, Journal of Machine Learning Research 21 (2020), no. 103, 1–63, arXiv:1806.07811.
Appendix A Auxiliary lemmas
In this section, we introduce auxiliary lemmas that are necessary for our proofs.
Lemma 10 ([20, Lemma 19]).
If is -smooth and -Hessian Lipschitz, , then the gradient descent sequence obtained by satisfies:
(40) |
for any step in which Negative Curvature Finding is not called.
Lemma 11 ([20, Lemma 23]).
For a -smooth and -Hessian Lipschitz function with its stochastic gradient satisfying Assumption 1, there exists an absolute constant such that, for any fixed , with probability at least , the stochastic gradient descent sequence in Algorithm 8 satisfies:
(41) |
if during , Stochastic Negative Curvature Finding has not been called.
Lemma 12 ([20, Lemma 29]).
Denote and . If , then we have
-
1.
for any ;
-
2.
.
Lemma 13 ([20, Lemma 30]).
Definition 14 ([20, Definition 32]).
A random vector is norm-subGaussian (or nSG()), if there exists so that:
(45) |
Lemma 15 ([20, Lemma 37]).
Given i.i.d. all being zero-mean nSG defined in Definition 14, then for any , and , there exists an absolute constant such that, with probability at least :
(46) |
The next two lemmas are frequently used in the large gradient scenario of the accelerated gradient descent method:
Lemma 16 ([21, Lemma 7]).
Consider the setting of Theorem 23, define a new parameter
(47) |
for some large enough constant . If for all , then there exists a large enough positive constant , such that if we choose , by running Algorithm 2 we have , in which , and is defined as:
(48) |
Lemma 17 ([21, Lemma 4 and Lemma 5]).
Assume that the function is -smooth. Consider the setting of Theorem 23, for every iteration that is not within steps after uniform perturbation, we have:
(49) |
Appendix B Proof details of negative curvature finding by gradient descents
B.1 Proof of Lemma 18
Lemma 18.
Under the setting of Proposition 3, we use to denote
(50) |
where defined in Eq. (10) is the component of in the subspace spanned by . Then, during all the iterations of Algorithm 1, we have for
(51) |
given that .
To prove Lemma 18, we first introduce the following lemma:
Lemma 19.
Under the setting of Proposition 3 and Lemma 18, for any and initial value , achieves its minimum possible value in the specific landscapes satisfying
(52) |
Proof.
We prove this lemma by contradiction. Suppose for some and initial value , achieves its minimum value in some landscape that does not satisfy Eq. (52). That is to say, there exists some such that .
We first consider the case . Since , we have . We use to denote the iteration sequence of in this landscape.
Consider another landscape with and all other eigenvalues being the same as , and we use to denote the iteration sequence of in this landscape. Furthermore, we assume that at each gradient query, the deviation from pure quadratic approximation defined in Eq. (16), denoted , is the same as the corresponding deviation in the landscape along all the directions other than . As for its component along , we set with its sign being the same as .
Under these settings, we have for any and . As for the component along , we have . Hence, by the definition of in Eq. (10), we have
(53) |
whereas
(54) |
indicating
(55) |
contradicting to the supposition that achieves its minimum possible value in . Similarly, we can also obtain a contradiction for the case . ∎
Proof.
In this proof, we consider the worst case, where the initial value . Also, according to Lemma 19, we assume that the eigenvalues satisfy
(56) |
Moreover, we assume that each time we make a gradient call at some point , the derivation term from pure quadratic approximation
(57) |
lies in the direction that can make as small as possible. Then, the component in should be in the direction of , and the component in should be in the opposite direction to , as long as . Hence in this case, we have being non-decreasing. Also, it admits the following recurrence formula:
(58) | ||||
(59) | ||||
(60) |
where the second inequality is due to the fact that can be an arbitrarily small positive number. Note that since is non-decreasing in this worst-case scenario, we have
(61) |
which leads to
(62) |
On the other hand, suppose for some value , we have with any . Then,
(63) | ||||
(64) |
Note that since , we have
(65) |
which leads to
(66) |
Then we can observe that
(67) |
where
(68) | ||||
(69) |
by which we can derive that
(70) | ||||
(71) |
indicating
(72) |
Hence, as long as for any , we can also have if . Since we have , we can thus claim that for any using recurrence. ∎
B.2 Proof of Proposition 5
To make it easier to analyze the properties and running time of Algorithm 2, we introduce a new Algorithm 4,
which has a more straightforward structure and has the same effect as Algorithm 2 near any saddle point . Quantitatively, this is demonstrated in the following lemma:
Lemma 20.
Under the setting of Proposition 5, suppose the perturbation in Line 2 of Algorithm 2 is added at . Then with the same value of , , and , the output of Algorithm 4 is the same as the unit vector in Algorithm 2 obtained steps later. In other words, if we separately denote the iteration sequence of in Algorithm 2 and Algorithm 4 as
(73) |
we have
(74) |
Proof.
Without loss of generality, we assume and . Use recurrence to prove the desired relationship. Suppose the following identities holds for all with being some natural number:
(75) |
Then,
(76) |
Adopting the notation in Algorithm 2, we use and to separately denote the value of and before renormalization:
(77) |
Then,
(78) |
which further leads to
(79) |
Note that , we thus have
(80) |
Hence,
(81) |
Since (75) holds for , we can now claim that it also holds for . ∎
Lemma 20 shows that, Algorithm 4 also works in principle for finding the negative curvature near any saddle point . But considering that Algorithm 4 may result in an exponentially large during execution, and it is hard to be merged with the AGD algorithm for large gradient scenarios. Hence, only Algorithm 2 is applicable in practical situations.
Use to denote the Hessian matrix of at . Observe that admits the following eigen-decomposition:
(82) |
where the set forms an orthonormal basis of . Without loss of generality, we assume the eigenvalues corresponding to satisfy
(83) |
in which . If , Proposition 5 holds directly, since no matter the value of , we can have . Hence, we only need to prove the case where , where there exists some with
(84) |
We use to denote the subspace of spanned by , and use to denote the subspace spanned by . Then we can have the following lemma:
Lemma 21.
Under the setting of Proposition 5, we use to denote
(85) |
in which is the component of in the subspace . Then, during all the iterations of Algorithm 4, we have for
(86) |
given that .
Proof.
Without loss of generality, we assume and . In this proof, we consider the worst case, where the initial value and the component along equals 0. In addition, according to the same proof of Lemma 19, we assume that the eigenvalues satisfy
(87) |
Out of the same reason, we assume that each time we make a gradient call at point , the derivation term from pure quadratic approximation
(88) |
lies in the direction that can make as small as possible. Then, the component in should be in the opposite direction to , and the component in should be in the direction of . Hence in this case, we have both and being non-decreasing. Also, it admits the following recurrence formula:
(89) |
Since is non-decreasing in this worst-case scenario, we have
(90) |
which leads to
(91) |
On the other hand, suppose for some value , we have with any . Then,
(92) | ||||
(93) |
Note that since for all , we also have
(94) |
which further leads to
(95) |
which leads to
(96) |
Consider the sequences with recurrence that can be written as
(97) |
for some . Its characteristic equation can be written as
(98) |
whose roots satisfy
(99) |
indicating
(100) |
where , for constants and being
(101) |
Then by the inequalities (91) and (96), as long as for any , the values and satisfy
(102) | ||||
(103) |
and
(104) | ||||
(105) |
where
(106) | ||||||||
(107) |
Furthermore, we can derive that
(108) | ||||
(109) | ||||
(110) |
and
(111) | ||||
(112) | ||||
(113) | ||||
(114) | ||||
(115) |
Then we can observe that
(116) |
where
(117) | ||||
(118) | ||||
(119) |
and
(120) | ||||
(121) | ||||
(122) | ||||
(123) |
by which we can derive that
(124) | ||||
(125) | ||||
(126) |
indicating
(127) |
Hence, as long as for any , we can also have if . Since we have and , we can claim that for any using recurrence. ∎
Equipped with Lemma 21, we are now ready to prove Proposition 5.
Proof.
By Lemma 20, the unit vector in Line 2 of Algorithm 2 obtained after iterations equals to the output of Algorithm 4 starting from . Hence in this proof we consider the output of Algorithm 4 instead of the original Algorithm 2.
If , Proposition 5 holds directly. Hence, we only need to prove the case where , in which there exists some with
(128) |
We use , to denote the subspace of spanned by , . Furthermore, we define , , , respectively to denote the component of and in Algorithm 4 in the subspaces , , and let . Consider the case where , which can be achieved with probability
(129) |
we prove that there exists some with such that
(130) |
Assume the contrary, for any with , we all have and . Focus on the case where , the component of in subspace , achieves the largest value possible. Then in this case, we have the following recurrence formula:
(131) |
Since for any , we can derive that
(132) |
which leads to
(133) | ||||
(134) |
Using similar characteristic function techniques shown in the proof of Lemma 21, it can be further derived that
(135) |
for and , given and for any . Due to Lemma 21,
(136) |
and it is demonstrated in the proof of Lemma 21 that,
(137) |
for and . Observe that
(138) | ||||
(139) |
where
(140) |
and
(141) | ||||
(142) | ||||
(143) | ||||
(144) |
Hence,
(145) |
Since , we have , contradiction. Hence, there here exists some with such that . Consider the normalized vector , we use and to separately denote the component of in and . Then, whereas . Then,
(146) |
since and , it can be further simplified to
(147) |
Due to the -smoothness of the function, all eigenvalue of the Hessian matrix has its absolute value upper bounded by . Hence,
(148) |
Further according to the definition of , we have
(149) |
Combining these two inequalities together, we can obtain
(150) |
∎
Appendix C Proof details of escaping from saddle points by negative curvature finding
C.1 Algorithms for escaping from saddle points using negative curvature finding
In this subsection, we first present algorithm for escaping from saddle points using Algorithm 1 as Algorithm 5.
Observe that Algorithm 5 and Algorithm 2 are similar to perturbed gradient descent and perturbed accelerated gradient descent but the uniform perturbation step is replaced by our negative curvature finding algorithms. One may wonder that Algorithm 5 seems to involve nested loops since negative curvature finding algorithm are contained in the primary loop, contradicting our previous claim that Algorithm 5 only contains a single loop. But actually, Algorithm 5 contains only two operations: gradient descents and one perturbation step, the same as operations outside the negative curvature finding algorithms. Hence, Algorithm 5 is essentially single-loop algorithm, and we count their iteration number as the total number of gradient calls.
C.2 Proof details of escaping saddle points using Algorithm 1
In this subsection, we prove:
Theorem 22.
For any and , Algorithm 5 with parameters chosen in Proposition 3 satisfies that at least of its iterations will be -approximate second-order stationary point, using
iterations, with probability at least , where is the global minimum of .
Proof.
Let the parameters be chosen according to (2), and set the total step number to be:
(151) |
similar to the perturbed gradient descent algorithm [20, Algorithm 4]. We first assume that for each we apply negative curvature finding (Algorithm 1) with contained in the parameters be chosen as
(152) |
we can successfully obtain a unit vector with , as long as . The error probability of this assumption is provided later.
Under this assumption, Algorithm 1 can be called for at most times, for otherwise the function value decrease will be greater than , which is not possible. Then, the error probability that some calls to Algorithm 1 fails is upper bounded by
(153) |
For the rest of iterations in which Algorithm 1 is not called, they are either large gradient steps, , or -approximate second-order stationary points. Within them, we know that the number of large gradient steps cannot be more than because otherwise, by Lemma 10 in Appendix A:
a contradiction. Therefore, we conclude that at least of the iterations must be -approximate second-order stationary points, with probability at least .
The number of iterations can be viewed as the sum of two parts, the number of iterations needed for gradient descent, denoted by , and the number of iterations needed for negative curvature finding, denoted by . with probability at least ,
(154) |
As for , with probability at least , Algorithm 1 is called for at most times, and by Proposition 3 it takes iterations each time. Hence,
(155) |
As a result, the total iteration number is
(156) |
∎
C.3 Proof details of escaping saddle points using Algorithm 2
We first present here the Negative Curvature Exploitation algorithm proposed in proposed in [21, Algorithm 3] appearing in Line 2 of Algorithm 2:
Now, we give the full version of Theorem 7 as follows:
Theorem 23.
Suppose that the function is -smooth and -Hessian Lipschitz. For any and a constant , we choose the parameters appearing in Algorithm 2 as follows:
(157) | ||||||||
(158) | ||||||||
(159) |
where and is the global minimum of , and the constant is chosen large enough to satisfy both the condition in Lemma 16 and . Then, Algorithm 2 satisfies that at least one of the iterations will be an -approximate second-order stationary point in
(160) |
iterations, with probability at least .
Proof.
Set the total step number to be:
(161) |
where as defined in Lemma 16, similar to the perturbed accelerated gradient descent algorithm [21, Algorithm 2]. We first assert that for each iteration that a uniform perturbation is added, after iterations we can successfully obtain a unit vector with , as long as . The error probability of this assumption is provided later.
Under this assumption, the uniform perturbation can be called for at most times, for otherwise the function value decrease will be greater than , which is not possible. Then, the probability that at least one negative curvature finding subroutine after uniform perturbation fails is upper bounded by
(162) |
For the rest of steps which is not within steps after uniform perturbation, they are either large gradient steps, , or -approximate second-order stationary points. Next, we demonstrate that at least one of these steps is an -approximate stationary point.
Assume the contrary. We use to denote the number of disjoint time periods with length larger than containing only large gradient steps and do not contain any step within steps after uniform perturbation. Then, it satisfies
(163) |
From Lemma 16, during these time intervals the Hamiltonian will decrease in total at least , which is impossible due to Lemma 17, the Hamiltonian decreases monotonically for every step except for the steps after uniform perturbation, and the overall decrease cannot be greater than , a contradiction. Therefore, we conclude that at least one of the iterations must be an -approximate second-order stationary point, with probability at least . ∎
Appendix D Proofs of the stochastic setting
D.1 Proof details of negative curvature finding using stochastic gradients
In this subsection, we demonstrate that Algorithm 3 can find a negative curvature efficiently. Specifically, we prove the following proposition:
Proposition 24.
Suppose the function is -smooth and -Hessian Lipschitz. For any , we specify our choice of parameters and constants we use as follows:
(164) | ||||||
(165) |
Then for any point satisfying , with probability at least , Algorithm 3 outputs a unit vector satisfying
(166) |
where stands for the Hessian matrix of function , using iteartions.
Similarly to Algorithm 1 and Algorithm 2, the renormalization step Line 3 in Algorithm 3 only guarantees that the value would not scales exponentially during the algorithm, and does not affect the output. We thus introduce the following Algorithm 7, which is the no-renormalization version of Algorithm 3 that possess the same output and a simpler structure. Hence in this subsection, we analyze Algorithm 7 instead of Algorithm 3.
Without loss of generality, we assume by shifting such that is mapped to . As argued in the proof of Proposition 3, admits the following following eigen-decomposition:
(167) |
where the set forms an orthonormal basis of . Without loss of generality, we assume the eigenvalues corresponding to satisfy
(168) |
where . If , Proposition 24 holds directly. Hence, we only need to prove the case where , where there exists some and with
(169) |
Notation:
Throughout this subsection, let . Use , to separately denote the subspace of spanned by , , and use , to denote the subspace of spanned by , . Furthermore, define , , , respectively to denote the component of in Line 7 of Algorithm 7 in the subspaces , , , , and let .
To prove Proposition 24, we first introduce the following lemma:
Lemma 25.
Under the setting of Proposition 24, for any point satisfying , with probability at least , Algorithm 3 outputs a unit vector satisfying
(170) |
using iteartions.
D.1.1 Proof of Lemma 25
In the proof of Lemma 25, we consider the worst case, where is the only eigenvalue less than , and all other eigenvalues equal to for an arbitrarily small constant . Under this scenario, the component is as small as possible at each time step.
The following lemma characterizes the dynamics of Algorithm 7:
Lemma 26.
Consider the sequence and let . Further, for any we define
(171) |
to be the errors caused by the stochastic gradients. Then , where:
(172) |
for , and
(173) |
Proof.
Without loss of generality we assume . The update formula for can be written as
(174) |
where
(175) |
indicating
(176) | ||||
(177) |
which finishes the proof. ∎
At a high level, under our parameter choice in Proposition 24, is the dominating term controlling the dynamics, and will be small compared to . Quantitatively, this is shown in the following lemma:
Lemma 27.
Under the setting of Proposition 24 while using the notation in Lemma 12 and Lemma 26, we have
(178) |
where .
Proof.
Divide into two parts:
(179) |
and
(180) |
Then by Lemma 13, we have
(181) |
and
(182) |
Similarly,
(183) |
and
(184) |
where . Set . Then for all , we have
(185) |
which further leads to
(186) |
Next, we use induction to prove that the following inequality holds for all :
(187) |
For the base case , the claim holds trivially. Suppose it holds for all for some . Then due to Lemma 13, with probability at least , we have
(188) |
By the Hessian Lipschitz property, satisfies:
(189) |
Hence,
(190) | ||||
(191) | ||||
(192) | ||||
(193) |
As for , note that satisfies the norm-subGaussian property defined in Definition 14. Specifically, . By applying Lemma 15 with and , with probability at least
(194) |
we have
(195) |
Then by union bound, with probability at least
(196) |
we have
(197) |
indicating that (187) holds. Then with probability at least
(198) |
we have
(199) |
Based on this, we prove that the following inequality holds for any :
(200) |
We still use recurrence to prove it. Note that its base case is guaranteed by (187). Suppose it holds for all for some . Then with probability at least , we have
(201) | ||||
(202) | ||||
(203) |
Then following a similar procedure as before, we can claim that
(204) |
holds with probability
(205) |
indicating that (200) holds. Then under our choice of parameters, the desired inequality
(206) |
holds with probability at least . ∎
Proof.
First note that under our choice of , we have
(207) |
Further by Lemma 27 and union bound, with probability at least ,
(208) |
For the output , observe that its component , since is the only component in subspace . Then with probability at least ,
(209) |
∎
D.1.2 Proof of Proposition 24
Based on Lemma 25, we present the proof of Proposition 24 as follows:
Proof.
By Lemma 25, the component of output satisfies
(210) |
Since , we can derive that
(211) |
Note that
(212) |
which can be further simplified to
(213) |
Due to the -smoothness of the function, all eigenvalue of the Hessian matrix has its absolute value upper bounded by . Hence,
(214) |
whereas
(215) |
Combining these two inequalities together, we can obtain
(216) |
∎
D.2 Proof details of escaping saddle points using Algorithm 3
In this subsection, we demonstrate that Algorithm 3 can be used to escape from saddle points in the stochastic setting. We first present the explicit Algorithm 8, and then introduce the full version Theorem 9 with proof.
Theorem 28 (Full version of Theorem 9).
Suppose that the function is -smooth and -Hessian Lipschitz. For any and a constant , we choose the parameters appearing in Algorithm 8 as
(217) | ||||||||
(218) |
where and is the global minimum of . Then, Algorithm 8 satisfies that at least of the iterations will be -approximate second-order stationary points, using
(219) |
iterations, with probability at least .
Proof.
Let the parameters be chosen according to (2), and set the total step number to be:
(220) |
We will show that the following two claims hold simultaneously with probability :
-
1.
At most steps have gradients larger than ;
-
2.
Algorithm 3 can be called for at most times.
Therefore, at least steps are -approximate secondary stationary points. We prove the two claims separately.
Claim 1.
Suppose that within steps, we have more than steps with gradients larger than . Then with probability ,
(221) |
contradiction.
Claim 2.
We first assume that for each we apply negative curvature finding (Algorithm 3), we can successfully obtain a unit vector with , as long as . The error probability of this assumption is provided later.
Under this assumption, Algorithm 3 can be called for at most times, for otherwise the function value decrease will be greater than , which is not possible. Then, the error probability that some calls to Algorithm 3 fails is upper bounded by
(222) |
The number of iterations can be viewed as the sum of two parts, the number of iterations needed in large gradient scenario, denoted by , and the number of iterations needed for negative curvature finding, denoted by . With probability at least ,
(223) |
As for , with probability at least , Algorithm 3 is called for at most times, and by Proposition 24 it takes iterations each time. Hence,
(224) |
As a result, the total iteration number is
(225) |
∎
Appendix E More numerical experiments
In this section, we present more numerical experiment results that support our theoretical claims from a few different perspectives compared to Section 4. Specifically, considering that previous experiments all lies in a two-dimensional space, and theoretically our algorithms have a better dependence on the dimension of the problem , it is reasonable to check the actual performance of our algorithm on high-dimensional test functions, which is presented in Appendix E.1. Then in Appendix E.2, we introduce experiments on various landscapes that demonstrate the advantage of Algorithm 2 over PAGD [21]. Moreover, we compare the performance of our Algorithm 2 with the NEON+ algorithm [29] on a few test functions in Appendix E.3. To be more precise, we compare the negative curvature extracting part of NEON+ with Algorithm 2 at saddle points in different types of nonconvex landscapes.
E.1 Dimension dependence
Recall that is the dimension of the problem. We choose a test function where is an -by- diagonal matrix: . The function has a saddle point at the origin, and only one negative curvature direction. Throughout the experiment, we set . For the sake of comparison, the iteration numbers are chosen in a manner such that the statistics of Algorithm 1 and PGD in each category of the histogram are of similar magnitude.

E.2 Comparison between Algorithm 2 and PAGD on various nonconvex landscapes
Quartic-type test function
Consider the test function with a saddle point at . The advantage of Algorithm 2 is illustrated in Figure 4.

Left: The contour of the landscape is placed on the background with labels being function values. Blue points represent samplings of Algorithm 2 at time step and , and red points represent samplings of PAGD at time step and . Similarly to Algorithm 1, Algorithm 2 transforms an initial uniform-circle distribution into a distribution concentrating on two points indicating negative curvature, and these two figures represent intermediate states of this process. It converges faster than PAGD even when .
Right: A histogram of descent values obtained by Algorithm 2 and PAGD, respectively. Set and . Although we run two times of iterations in PAGD, there are still over of PAGD paths with function value decrease no greater than , while this ratio for Algorithm 2 is less than .
Triangle-type test function.
Consider the test function with a saddle point at . The advantage of Algorithm 2 is illustrated in Figure 5.

Left: The contour of the landscape is placed on the background with labels being function values. Blue points represent samplings of Algorithm 2 at time step and , and red points represent samplings of PAGD at time step and . Algorithm 2 converges faster than PAGD even when .
Right: A histogram of descent values obtained by Algorithm 2 and PAGD, respectively. Set and . Although we run four times of iterations in PAGD, there are still over of gradient descent paths with function value decrease no greater than , while this ratio for Algorithm 2 is less than .
Exponential-type test function.
Consider the test function with a saddle point at . The advantage of Algorithm 2 is illustrated in Figure 6.

Left: The contour of the landscape is placed on the background with labels being function values. Blue points represent samplings of Algorithm 2 at time step and , and red points represent samplings of PAGD at time step and . Algorithm 2converges faster than PAGD even when .
Right: A histogram of descent values obtained by Algorithm 2 and PAGD, respectively. Set and . Although we run three times of iterations in PAGD, its performance is still dominated by our Algorithm 2.
Compared to the previous experiment on Algorithm 1 and PGD shown as Figure 1 in Section 4, these experiments also demonstrate the faster convergence rates enjoyed by the general family of "momentum methods". Specifically, using fewer iterations, Algorithm 2 and PAGD achieve larger function value decreases separately compared to Algorithm 1 and PGD.
E.3 Comparison between Algorithm 2 and NEON+ on various nonconvex landscapes
Triangle-type test function.
Consider the test function with a saddle point at . The advantage of Algorithm 2 is illustrated in Figure 7.

Left: The contour of the landscape is placed on the background with labels being function values. Red points represent samplings of NEON+ at time step , and blue points represent samplings of Algorithm 2 at time step . Algorithm 2 and the negative curvature extracting part of NEON+ both transform an initial uniform-circle distribution into a distribution concentrating on two points indicating negative curvature. Note that Algorithm 2 converges faster than NEON+ even when .
Right: A histogram of descent values obtained by Algorithm 2 and NEON+, respectively. Set and . Although we run two times of iterations in NEON+, none of NEON+ paths has function value decrease greater than , while this ratio for Algorithm 2 is larger than .
Exponential-type test function.
Consider the test function with a saddle point at . The advantage of Algorithm 2 is illustrated in Figure 8

Left: The contour of the landscape is placed on the background with labels being function values. Red points represent samplings of NEON+ at time step , and blue points represent samplings of Algorithm 2 at time step . Algorithm 2 converges faster than NEON+ even when .
Right: A histogram of descent values obtained by Algorithm 2 and NEON+, respectively. Set and . Although we run two times of iterations in NEON+, there are still over of NEON+ paths with function value decrease no greater than , while this ratio for Algorithm 2 is less than .
Compared to the previous experiments on Algorithm 2 and PAGD in Appendix E.2, these two experiments also reveal the faster convergence rate of both NEON+ and Algorithm 2 against PAGD [21] at small gradient regions.