This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Escape saddle points by a simple gradient-descent based algorithm

Chenyi Zhang1  Tongyang Li2,3,4
1 Institute for Interdisciplinary Information Sciences, Tsinghua University, China
2 Center on Frontiers of Computing Studies, Peking University, China
3 School of Computer Science, Peking University, China
4 Center for Theoretical Physics, Massachusetts Institute of Technology, USA
Corresponding author. Email: [email protected]
Abstract

Escaping saddle points is a central research topic in nonconvex optimization. In this paper, we propose a simple gradient-based algorithm such that for a smooth function f:nf\colon\mathbb{R}^{n}\to\mathbb{R}, it outputs an ϵ\epsilon-approximate second-order stationary point in O~(logn/ϵ1.75)\tilde{O}(\log n/\epsilon^{1.75}) iterations. Compared to the previous state-of-the-art algorithms by Jin et al. with O~(log4n/ϵ2)\tilde{O}(\log^{4}n/\epsilon^{2}) or O~(log6n/ϵ1.75)\tilde{O}(\log^{6}n/\epsilon^{1.75}) iterations, our algorithm is polynomially better in terms of logn\log n and matches their complexities in terms of 1/ϵ1/\epsilon. For the stochastic setting, our algorithm outputs an ϵ\epsilon-approximate second-order stationary point in O~(log2n/ϵ4)\tilde{O}(\log^{2}n/\epsilon^{4}) iterations. Technically, our main contribution is an idea of implementing a robust Hessian power method using only gradients, which can find negative curvature near saddle points and achieve the polynomial speedup in logn\log n compared to the perturbed gradient descent methods. Finally, we also perform numerical experiments that support our results.

1 Introduction

Nonconvex optimization is a central research area in optimization theory, since lots of modern machine learning problems can be formulated in models with nonconvex loss functions, including deep neural networks, principal component analysis, tensor decomposition, etc. In general, finding a global minimum of a nonconvex function is NP-hard in the worst case. Instead, many theoretical works focus on finding a local minimum instead of a global one, because recent works (both empirical and theoretical) suggested that local minima are nearly as good as global minima for a significant amount of well-studied machine learning problems; see e.g. [4, 11, 13, 14, 16, 17]. On the other hand, saddle points are major obstacles for solving these problems, not only because they are ubiquitous in high-dimensional settings where the directions for escaping may be few (see e.g. [5, 7, 10]), but also saddle points can correspond to highly suboptimal solutions (see e.g. [18, 27]).

Hence, one of the most important topics in nonconvex optimization is to escape saddle points. Specifically, we consider a twice-differentiable function f:nf\colon\mathbb{R}^{n}\to\mathbb{R} such that

  • ff is \ell-smooth: f(𝐱1)f(𝐱2)𝐱1𝐱2𝐱1,𝐱2n\|\nabla f(\mathbf{x}_{1})-\nabla f(\mathbf{x}_{2})\|\leq\ell\|\mathbf{x}_{1}-\mathbf{x}_{2}\|\quad\forall\mathbf{x}_{1},\mathbf{x}_{2}\in\mathbb{R}^{n},

  • ff is ρ\rho-Hessian Lipschitz: (𝐱1)(𝐱2)ρ𝐱1𝐱2𝐱1,𝐱2n\|\mathcal{H}(\mathbf{x}_{1})-\mathcal{H}(\mathbf{x}_{2})\|\leq\rho\|\mathbf{x}_{1}-\mathbf{x}_{2}\|\quad\forall\mathbf{x}_{1},\mathbf{x}_{2}\in\mathbb{R}^{n};

here \mathcal{H} is the Hessian of ff. The goal is to find an ϵ\epsilon-approximate second-order stationary point 𝐱ϵ\mathbf{x}_{\epsilon}:111We can ask for an (ϵ1,ϵ2)(\epsilon_{1},\epsilon_{2})-approx. second-order stationary point s.t. f(𝐱)ϵ1\|\nabla f(\mathbf{x})\|\leq\epsilon_{1} and λmin(2f(𝐱))ϵ2\lambda_{\min}(\nabla^{2}f(\mathbf{x}))\geq-\epsilon_{2} in general. The scaling in (1) was adopted as a standard in literature [1, 6, 9, 19, 20, 21, 25, 28, 29, 30].

f(𝐱ϵ)ϵ,λmin((𝐱ϵ))ρϵ.\displaystyle\|\nabla f(\mathbf{x}_{\epsilon})\|\leq\epsilon,\quad\lambda_{\min}(\mathcal{H}(\mathbf{x}_{\epsilon}))\geq-\sqrt{\rho\epsilon}. (1)

In other words, at any ϵ\epsilon-approx. second-order stationary point 𝐱ϵ\mathbf{x}_{\epsilon}, the gradient is small with norm being at most ϵ\epsilon and the Hessian is close to be positive semi-definite with all its eigenvalues ρϵ\geq-\sqrt{\rho\epsilon}.

Algorithms for escaping saddle points are mainly evaluated from two aspects. On the one hand, considering the enormous dimensions of machine learning models in practice, dimension-free or almost dimension-free (i.e., having poly(logn)\operatorname{poly}(\log n) dependence) algorithms are highly preferred. On the other hand, recent empirical discoveries in machine learning suggests that it is often feasible to tackle difficult real-world problems using simple algorithms, which can be implemented and maintained more easily in practice. On the contrary, algorithms with nested loops often suffer from significant overheads in large scales, or introduce concerns with the setting of hyperparameters and numerical stability (see e.g. [1, 6]), making them relatively hard to find practical implementations.

It is then natural to explore simple gradient-based algorithms for escaping from saddle points. The reason we do not assume access to Hessians is because its construction takes Ω(n2)\Omega(n^{2}) cost in general, which is computationally infeasible when the dimension is large. A seminal work along this line was by Ge et al. [11], which found an ϵ\epsilon-approximate second-order stationary point satisfying (1) using only gradients in O(poly(n,1/ϵ))O(\operatorname{poly}(n,1/\epsilon)) iterations. This is later improved to be almost dimension-free O~(log4n/ϵ2)\tilde{O}(\log^{4}n/\epsilon^{2}) in the follow-up work [19],222The O~\tilde{O} notation omits poly-logarithmic terms, i.e., O~(g)=O(gpoly(logg))\tilde{O}(g)=O(g\operatorname{poly}(\log g)). and the perturbed accelerated gradient descent algorithm [21] based on Nesterov’s accelerated gradient descent [26] takes O~(log6n/ϵ1.75)\tilde{O}(\log^{6}n/\epsilon^{1.75}) iterations. However, these results still suffer from a significant overhead in terms of logn\log n. On the other direction, Refs. [3, 24, 29] demonstrate that an ϵ\epsilon-approximate second-order stationary point can be find using gradients in O~(logn/ϵ1.75)\tilde{O}(\log n/\epsilon^{1.75}) iterations. Their results are based on previous works [1, 6] using Hessian-vector products and the observation that the Hessian-vector product can be approximated via the difference of two gradient queries. Hence, their implementations contain nested-loop structures with relatively large numbers of hyperparameters. It has been an open question whether it is possible to keep both the merits of using only first-order information as well as being close to dimension-free using a simple, gradient-based algorithm without a nested-loop structure [22]. This paper answers this question in the affirmative.

Contributions.

Our main contribution is a simple, single-loop, and robust gradient-based algorithm that can find an ϵ\epsilon-approximate second-order stationary point of a smooth, Hessian Lipschitz function f:nf\colon\mathbb{R}^{n}\to\mathbb{R}. Compared to previous works [3, 24, 29] exploiting the idea of gradient-based Hessian power method, our algorithm has a single-looped, simpler structure and better numerical stability. Compared to the previous state-of-the-art results with single-looped structures by [21] and [19, 20] using O~(log6n/ϵ1.75)\tilde{O}(\log^{6}n/\epsilon^{1.75}) or O~(log4n/ϵ2)\tilde{O}(\log^{4}n/\epsilon^{2}) iterations, our algorithm achieves a polynomial speedup in logn\log n:

Theorem 1 (informal).

Our single-looped algorithm finds an ϵ\epsilon-approximate second-order stationary point in O~(logn/ϵ1.75)\tilde{O}(\log n/\epsilon^{1.75}) iterations.

Technically, our work is inspired by the perturbed gradient descent (PGD) algorithm in [19, 20] and the perturbed accelerated gradient descent (PAGD) algorithm in [21]. Specifically, PGD applies gradient descents iteratively until it reaches a point with small gradient, which can be a potential saddle point. Then PGD generates a uniform perturbation in a small ball centered at that point and then continues the GD procedure. It is demonstrated that, with an appropriate choice of the perturbation radius, PGD can shake the point off from the neighborhood of the saddle point and converge to a second-order stationary point with high probability. The PAGD in [21] adopts a similar perturbation idea, but the GD is replaced by Nesterov’s AGD [26].

Our algorithm is built upon PGD and PAGD but with one main modification regarding the perturbation idea: it is more efficient to add a perturbation in the negative curvature direction nearby the saddle point, rather than the uniform perturbation in PGD and PAGD, which is a compromise since we generally cannot access the Hessian at the saddle due to its high computational cost. Our key observation lies in the fact that we do not have to compute the entire Hessian to detect the negative curvature. Instead, in a small neighborhood of a saddle point, gradients can be viewed as Hessian-vector products plus some bounded deviation. In particular, GD near the saddle with learning rate 1/1/\ell is approximately the same as the power method of the matrix (I/)(I-\mathcal{H}/\ell). As a result, the most negative eigenvalues stand out in GD because they have leading exponents in the power method, and thus it approximately moves along the direction of the most negative curvature nearby the saddle point. Following this approach, we can escape the saddle points more rapidly than previous algorithms: for a constant ϵ\epsilon, PGD and PAGD take O(logn)O(\log n) iterations to decrease the function value by Ω(1/log3n)\Omega(1/\log^{3}n) and Ω(1/log5n)\Omega(1/\log^{5}n) with high probability, respectively; on the contrary, we can first take O(logn)O(\log n) iterations to specify a negative curvature direction, and then add a larger perturbation in this direction to decrease the function value by Ω(1)\Omega(1). See Proposition 3 and Proposition 5. After escaping the saddle point, similar to PGD and PAGD, we switch back to GD and AGD iterations, which are efficient to decrease the function value when the gradient is large [19, 20, 21].

Our algorithm is also applicable to the stochastic setting where we can only access stochastic gradients, and the stochasticity is not under the control of our algorithm. We further assume that the stochastic gradients are Lipschitz (or equivalently, the underlying functions are gradient-Lipschitz, see Assumption 2), which is also adopted in most of the existing works; see e.g. [8, 19, 20, 34]. We demonstrate that a simple extended version of our algorithm takes O(log2n)O(\log^{2}n) iterations to detect a negative curvature direction using only stochastic gradients, and then obtain an Ω(1)\Omega(1) function value decrease with high probability. On the contrary, the perturbed stochastic gradient descent (PSGD) algorithm in [19, 20], the stochastic version of PGD, takes O(log10n)O(\log^{10}n) iterations to decrease the function value by Ω(1/log5n)\Omega(1/\log^{5}n) with high probability.

Theorem 2 (informal).

In the stochastic setting, our algorithm finds an ϵ\epsilon-approximate second-order stationary point using O~(log2n/ϵ4)\tilde{O}(\log^{2}n/\epsilon^{4}) iterations via stochastic gradients.

Our results are summarized in Table 1. Although the underlying dynamics in [3, 24, 29] and our algorithm have similarity, the main focus of our work is different. Specifically, Refs. [3, 24, 29] mainly aim at using novel techniques to reduce the iteration complexity for finding a second-order stationary point, whereas our work mainly focuses on reducing the number of loops and hyper-parameters of negative curvature finding methods while preserving their advantage in iteration complexity, since a much simpler structure accords with empirical observations and enables wider applications. Moreover, the choice of perturbation in [3] is based on the Chebyshev approximation theory, which may require additional nested-looped structures to boost the success probability. In the stochastic setting, there are also other results studying nonconvex optimization [15, 23, 31, 36, 12, 32, 35] from different perspectives than escaping saddle points, which are incomparable to our results.

Setting Reference Oracle Iterations Simplicity
Non-stochastic [1, 6] Hessian-vector product O~(logn/ϵ1.75)\tilde{O}(\log n/\epsilon^{1.75}) Nested-loop
Non-stochastic [19, 20] Gradient O~(log4n/ϵ2)\tilde{O}(\log^{4}n/\epsilon^{2}) Single-loop
Non-stochastic [21] Gradient O~(log6n/ϵ1.75)\tilde{O}(\log^{6}n/\epsilon^{1.75}) Single-loop
Non-stochastic [3, 24, 29] Gradient O~(logn/ϵ1.75)\tilde{O}(\log n/\epsilon^{1.75}) Nested-loop
Non-stochastic this work Gradient O~(logn/ϵ1.75)\tilde{O}(\log n/\epsilon^{1.75}) Single-loop
Stochastic [19, 20] Gradient O~(log15n/ϵ4)\tilde{O}(\log^{15}n/\epsilon^{4}) Single-loop
Stochastic [9] Gradient O~(log5n/ϵ3.5)\tilde{O}(\log^{5}n/\epsilon^{3.5}) Single-loop
Stochastic [3] Gradient O~(log2n/ϵ3.5)\tilde{O}(\log^{2}n/\epsilon^{3.5}) Nested-loop
Stochastic [8] Gradient O~(log2n/ϵ3)\tilde{O}(\log^{2}n/\epsilon^{3}) Nested-loop
Stochastic this work Gradient O~(log2n/ϵ4)\tilde{O}(\log^{2}n/\epsilon^{4}) Single-loop
Table 1: A summary of the state-of-the-art results on finding approximate second-order stationary points by the first-order (gradient) oracle. Iteration numbers are highlighted in terms of the dimension nn and the precision ϵ\epsilon.

It is worth highlighting that our gradient-descent based algorithm enjoys the following nice features:

  • Simplicity: Some of the previous algorithms have nested-loop structures with the concern of practical impact when setting the hyperparameters. In contrast, our algorithm based on negative curvature finding only contains a single loop with two components: gradient descent (including AGD or SGD) and perturbation. As mentioned above, such simple structure is preferred in machine learning, which increases the possibility of our algorithm to find real-world applications.

  • Numerical stability: Our algorithm contains an additional renormalization step at each iteration when escaping from saddle points. Although in theoretical aspect a renormalization step does not affect the output and the complexity of our algorithm, when finding negative curvature near saddle points it enables us to sample gradients in a larger region, which makes our algorithm more numerically stable against floating point error and other errors. The introduction of renormalization step is enabled by the simple structure of our algorithm, which may not be feasible for more complicated algorithms [3, 24, 29].

  • Robustness: Our algorithm is robust against adversarial attacks when evaluating the value of the gradient. Specifically, when analyzing the performance of our algorithm near saddle points, we essentially view the deflation from pure quadratic geometry as an external noise. Hence, the effectiveness of our algorithm is unaffected under external attacks as long as the adversary is bounded by deflations from quadratic landscape.

Finally, we perform numerical experiments that support our polynomial speedup in logn\log n. We perform our negative curvature finding algorithms using GD or SGD in various landscapes and general classes of nonconvex functions, and use comparative studies to show that our Algorithm 1 and Algorithm 3 achieve a higher probability of escaping saddle points using much fewer iterations than PGD and PSGD (typically less than 1/31/3 times of the iteration number of PGD and 1/21/2 times of the iteration number of PSGD, respectively). Moreover, we perform numerical experiments benchmarking the solution quality and iteration complexity of our algorithm against accelerated methods. Compared to PAGD [21] and even advanced optimization algorithms such as NEON+ [29], Algorithm 2 possesses better solution quality and iteration complexity in various landscapes given by more general nonconvex functions. With fewer iterations compared to PAGD and NEON+ (typically less than 1/31/3 times of the iteration number of PAGD and 1/21/2 times of the iteration number of NEON+, respectively), our Algorithm 2 achieves a higher probability of escaping from saddle points.

Open questions.

This work leaves a couple of natural open questions for future investigation:

  • Can we achieve the polynomial speedup in logn\log n for more advanced stochastic optimization algorithms with complexity O~(poly(logn)/ϵ3.5)\tilde{O}(\operatorname{poly}(\log n)/\epsilon^{3.5}) [2, 3, 9, 28, 30] or O~(poly(logn)/ϵ3)\tilde{O}(\operatorname{poly}(\log n)/\epsilon^{3}) [8, 33]?

  • How is the performance of our algorithms for escaping saddle points in real-world applications, such as tensor decomposition [11, 16], matrix completion [13], etc.?

Broader impact.

This work focuses on the theory of nonconvex optimization, and as far as we see, we do not anticipate its potential negative societal impact. Nevertheless, it might have a positive impact for researchers who are interested in understanding the theoretical underpinnings of (stochastic) gradient descent methods for machine learning applications.

Organization.

In Section 2, we introduce our gradient-based Hessian power method algorithm for negative curvature finding, and present how our algorithms provide polynomial speedup in logn\log n for both PGD and PAGD. In Section 3, we present the stochastic version of our negative curvature finding algorithm using stochastic gradients and demonstrate its polynomial speedup in logn\log n for PSGD. Numerical experiments are presented in Section 4. We provide detailed proofs and additional numerical experiments in the supplementary material.

2 A simple algorithm for negative curvature finding

We show how to find negative curvature near a saddle point using a gradient-based Hessian power method algorithm, and extend it to a version with faster convergence rate by replacing gradient descents by accelerated gradient descents. The intuition works as follows: in a small enough region nearby a saddle point, the gradient can be approximately expressed as a Hessian-vector product formula, and the approximation error can be efficiently upper bounded, see Eq. (6). Hence, using only gradients information, we can implement an accurate enough Hessian power method to find negative eigenvectors of the Hessian matrix, and further find the negative curvature nearby the saddle.

2.1 Negative curvature finding based on gradient descents

We first present an algorithm for negative curvature finding based on gradient descents. Specifically, for any 𝐱~n\tilde{\mathbf{x}}\in\mathbb{R}^{n} with λmin((𝐱~))ρϵ\lambda_{\min}(\mathcal{H}(\tilde{\mathbf{x}}))\leq-\sqrt{\rho\epsilon}, it finds a unit vector 𝐞^\hat{\mathbf{e}} such that 𝐞^T(𝐱~)𝐞^ρϵ/4\hat{\mathbf{e}}^{T}\mathcal{H}(\tilde{\mathbf{x}})\hat{\mathbf{e}}\leq-\sqrt{\rho\epsilon}/4.

1 𝐲0\mathbf{y}_{0}\leftarrowUniform(𝔹𝐱~(r))(\mathbb{B}_{\tilde{\mathbf{x}}}(r)) where 𝔹𝐱~(r)\mathbb{B}_{\tilde{\mathbf{x}}}(r) is the 2\ell_{2}-norm ball centered at 𝐱~\tilde{\mathbf{x}} with radius rr;
2 for t=1,,𝒯t=1,...,\mathscr{T} do
3       𝐲t𝐲t1𝐲t1r(f(𝐱~+r𝐲t1/𝐲t1)f(𝐱~))\mathbf{y}_{t}\leftarrow\mathbf{y}_{t-1}-\frac{\|\mathbf{y}_{t-1}\|}{\ell r}\big{(}\nabla f(\tilde{\mathbf{x}}+r\mathbf{y}_{t-1}/\|\mathbf{y}_{t-1}\|)-\nabla f(\tilde{\mathbf{x}})\big{)} ;
4      
Output 𝐲𝒯/r\mathbf{y}_{\mathscr{T}}/r.
Algorithm 1 Negative Curvature Finding(𝐱~,r,𝒯\tilde{\mathbf{x}},r,\mathscr{T}).
Proposition 3.

Suppose the function f:nf\colon\mathbb{R}^{n}\to\mathbb{R} is \ell-smooth and ρ\rho-Hessian Lipschitz. For any 0<δ010<\delta_{0}\leq 1, we specify our choice of parameters and constants we use as follows:

𝒯=8ρϵlog(δ0nπρϵmissing),r=ϵ8πnδ0.\displaystyle\mathscr{T}=\frac{8\ell}{\sqrt{\rho\epsilon}}\cdot\log\Big(\frac{\ell}{\delta_{0}}\sqrt{\frac{n}{\pi\rho\epsilon}}\Big{missing}),\quad r=\frac{\epsilon}{8\ell}\sqrt{\frac{\pi}{n}}\delta_{0}. (2)

Suppose that 𝐱~\tilde{\mathbf{x}} satisfies λmin(2f(𝐱~))ρϵ\lambda_{\min}(\nabla^{2}f(\tilde{\mathbf{x}}))\leq-\sqrt{\rho\epsilon}. Then with probability at least 1δ01-\delta_{0}, Algorithm 1 outputs a unit vector 𝐞^\hat{\mathbf{e}} satisfying

𝐞^T(𝐱)𝐞^ρϵ/4,\displaystyle\hat{\mathbf{e}}^{T}\mathcal{H}(\mathbf{x})\hat{\mathbf{e}}\leq-\sqrt{\rho\epsilon}/4, (3)

using O(𝒯)=O~(lognρϵ)O(\mathscr{T})=\tilde{O}\Big{(}\frac{\log n}{\sqrt{\rho\epsilon}}\Big{)} iterations, where \mathcal{H} stands for the Hessian matrix of function ff.

Proof.

Without loss of generality we assume 𝐱~=𝟎\tilde{\mathbf{x}}=\mathbf{0} by shifting n\mathbb{R}^{n} such that 𝐱~\tilde{\mathbf{x}} is mapped to 𝟎\mathbf{0}. Define a new nn-dimensional function

hf(𝐱):=f(𝐱)f(𝟎),𝐱,\displaystyle h_{f}(\mathbf{x}):=f(\mathbf{x})-\left\langle\nabla f(\mathbf{0}),\mathbf{x}\right\rangle, (4)

for the ease of our analysis. Since f(𝟎),𝐱\left\langle\nabla f(\mathbf{0}),\mathbf{x}\right\rangle is a linear function with Hessian being 0, the Hessian of hfh_{f} equals to the Hessian of ff, and hf(𝐱)h_{f}(\mathbf{x}) is also \ell-smooth and ρ\rho-Hessian Lipschitz. In addition, note that hf(𝟎)=f(𝟎)f(𝟎)=0\nabla h_{f}(\mathbf{0})=\nabla f(\mathbf{0})-\nabla f(\mathbf{0})=0. Then for all 𝐱n\mathbf{x}\in\mathbb{R}^{n},

hf(𝐱)=ξ=01(ξ𝐱)𝐱dξ=(𝟎)𝐱+ξ=01((ξ𝐱)(𝟎))𝐱dξ.\displaystyle\nabla h_{f}(\mathbf{x})=\int_{\xi=0}^{1}\mathcal{H}(\xi\mathbf{x})\cdot\mathbf{x}\,\mathrm{d}\xi=\mathcal{H}(\mathbf{0})\mathbf{x}+\int_{\xi=0}^{1}(\mathcal{H}(\xi\mathbf{x})-\mathcal{H}(\mathbf{0}))\cdot\mathbf{x}\,\mathrm{d}\xi. (5)

Furthermore, due to the ρ\rho-Hessian Lipschitz condition of both ff and hfh_{f}, for any ξ[0,1]\xi\in[0,1] we have (ξ𝐱)(𝟎)ρ𝐱\|\mathcal{H}(\xi\mathbf{x})-\mathcal{H}(\mathbf{0})\|\leq\rho\|\mathbf{x}\|, which leads to

hf(𝐱)(𝟎)𝐱ρ𝐱2.\displaystyle\|\nabla h_{f}(\mathbf{x})-\mathcal{H}(\mathbf{0})\mathbf{x}\|\leq\rho\|\mathbf{x}\|^{2}. (6)

Observe that the Hessian matrix (𝟎)\mathcal{H}(\mathbf{0}) admits the following eigen-decomposition:

(𝟎)=i=1nλi𝐮i𝐮iT,\displaystyle\mathcal{H}(\mathbf{0})=\sum_{i=1}^{n}\lambda_{i}\mathbf{u}_{i}\mathbf{u}_{i}^{T}, (7)

where the set {𝐮i}i=1n\{\mathbf{u}_{i}\}_{i=1}^{n} forms an orthonormal basis of n\mathbb{R}^{n}. Without loss of generality, we assume the eigenvalues λ1,λ2,,λn\lambda_{1},\lambda_{2},\ldots,\lambda_{n} corresponding to 𝐮1,𝐮2,,𝐮n\mathbf{u}_{1},\mathbf{u}_{2},\ldots,\mathbf{u}_{n} satisfy

λ1λ2λn,\displaystyle\lambda_{1}\leq\lambda_{2}\leq\cdots\leq\lambda_{n}, (8)

in which λ1ρϵ\lambda_{1}\leq-\sqrt{\rho\epsilon}. If λnρϵ/2\lambda_{n}\leq-\sqrt{\rho\epsilon}/2, Proposition 3 holds directly. Hence, we only need to prove the case where λn>ρϵ/2\lambda_{n}>-\sqrt{\rho\epsilon}/2, in which there exists some p>1p>1, p>1p^{\prime}>1 with

λpρϵλp+1,λpρϵ/2<λp+1.\displaystyle\lambda_{p}\leq-\sqrt{\rho\epsilon}\leq\lambda_{p+1},\quad\lambda_{p^{\prime}}\leq-\sqrt{\rho\epsilon}/2<\lambda_{p^{\prime}+1}. (9)

We use 𝔖\mathfrak{S}_{\parallel}, 𝔖\mathfrak{S}_{\perp} to separately denote the subspace of n\mathbb{R}^{n} spanned by {𝐮1,𝐮2,,𝐮p}\left\{\mathbf{u}_{1},\mathbf{u}_{2},\ldots,\mathbf{u}_{p}\right\}, {𝐮p+1,𝐮p+2,,𝐮n}\left\{\mathbf{u}_{p+1},\mathbf{u}_{p+2},\ldots,\mathbf{u}_{n}\right\}, and use 𝔖\mathfrak{S}_{\parallel}^{\prime}, 𝔖\mathfrak{S}_{\perp}^{\prime} to denote the subspace of n\mathbb{R}^{n} spanned by {𝐮1,𝐮2,,𝐮p}\left\{\mathbf{u}_{1},\mathbf{u}_{2},\ldots,\mathbf{u}_{p^{\prime}}\right\}, {𝐮p+1,𝐮p+2,,𝐮n}\left\{\mathbf{u}_{p^{\prime}+1},\mathbf{u}_{p+2},\ldots,\mathbf{u}_{n}\right\}. Furthermore, we define

𝐲t,\displaystyle\mathbf{y}_{t,\parallel} :=i=1p𝐮i,𝐲t𝐮i,𝐲t,:=i=pn𝐮i,𝐲t𝐮i;\displaystyle:=\sum_{i=1}^{p}\left\langle\mathbf{u}_{i},\mathbf{y}_{t}\right\rangle\mathbf{u}_{i},\qquad\ \mathbf{y}_{t,\perp}:=\sum_{i=p}^{n}\left\langle\mathbf{u}_{i},\mathbf{y}_{t}\right\rangle\mathbf{u}_{i}; (10)
𝐲t,\displaystyle\mathbf{y}_{t,\parallel^{\prime}} :=i=1p𝐮i,𝐲t𝐮i,𝐲t,:=i=pn𝐮i,𝐲t𝐮i\displaystyle:=\sum_{i=1}^{p^{\prime}}\left\langle\mathbf{u}_{i},\mathbf{y}_{t}\right\rangle\mathbf{u}_{i},\qquad\mathbf{y}_{t,\perp^{\prime}}:=\sum_{i=p^{\prime}}^{n}\left\langle\mathbf{u}_{i},\mathbf{y}_{t}\right\rangle\mathbf{u}_{i} (11)

respectively to denote the component of 𝐲t\mathbf{y}_{t} in Line 1 in the subspaces 𝔖\mathfrak{S}_{\parallel}, 𝔖\mathfrak{S}_{\perp}, 𝔖\mathfrak{S}_{\parallel}^{\prime}, 𝔖\mathfrak{S}_{\perp}^{\prime}, and let αt:=𝐲t,/𝐲t\alpha_{t}:=\|\mathbf{y}_{t,\parallel}\|/\|\mathbf{y}_{t}\|. Observe that

Pr{α0δ0π/n}Pr{|y0,1|/rδ0π/n},\displaystyle\Pr\left\{\alpha_{0}\geq\delta_{0}\sqrt{\pi/n}\right\}\geq\Pr\left\{|y_{0,1}|/r\geq\delta_{0}\sqrt{\pi/n}\right\}, (12)

where y0,1:=𝐮1,𝐲0y_{0,1}:=\left\langle\mathbf{u}_{1},\mathbf{y}_{0}\right\rangle denotes the component of 𝐲0\mathbf{y}_{0} along 𝐮1\mathbf{u}_{1}. Consider the case where α0δ0π/n\alpha_{0}\geq\delta_{0}\sqrt{\pi/n}, which can be achieved with probability

Pr{α0πnδ0}1πnδ0Vol(𝔹0n1(1))Vol(𝔹0n(1))1πnδ0nπ=1δ0.\displaystyle\Pr\left\{\alpha_{0}\geq\sqrt{\frac{\pi}{n}}\delta_{0}\right\}\geq 1-\sqrt{\frac{\pi}{n}}\delta_{0}\cdot\frac{\text{Vol}(\mathbb{B}_{0}^{n-1}(1))}{\text{Vol}(\mathbb{B}_{0}^{n}(1))}\geq 1-\sqrt{\frac{\pi}{n}}\delta_{0}\cdot\sqrt{\frac{n}{\pi}}=1-\delta_{0}. (13)

We prove that there exists some t0t_{0} with 1t0𝒯1\leq t_{0}\leq\mathscr{T} such that

𝐲t0,/𝐲t0ρϵ/(8).\displaystyle\|\mathbf{y}_{t_{0},\perp^{\prime}}\|/\|\mathbf{y}_{t_{0}}\|\leq\sqrt{\rho\epsilon}/(8\ell). (14)

Assume the contrary, for any 1t𝒯1\leq t\leq\mathscr{T}, we all have 𝐲t,/𝐲t>ρϵ/(8)\|\mathbf{y}_{t,\perp^{\prime}}\|/\|\mathbf{y}_{t}\|>\sqrt{\rho\epsilon}/(8\ell). Then 𝐲t,\|\mathbf{y}_{t,\perp}^{\prime}\| satisfies the following recurrence formula:

𝐲t+1,(1+ρϵ/(2))𝐲t,+Δ(1+ρϵ/(2)+Δ/𝐲t,)𝐲t,,\displaystyle\|\mathbf{y}_{t+1,\perp^{\prime}}\|\leq(1+\sqrt{\rho\epsilon}/(2\ell))\|\mathbf{y}_{t,\perp^{\prime}}\|+\|\Delta_{\perp^{\prime}}\|\leq(1+\sqrt{\rho\epsilon}/(2\ell)+\|\Delta\|/\|\mathbf{y}_{t,\perp^{\prime}}\|)\|\mathbf{y}_{t,\perp^{\prime}}\|, (15)

where

Δ:=𝐲tr(hf(r𝐲t/𝐲t)(0)(r𝐲t/𝐲t))\displaystyle\Delta:=\frac{\|\mathbf{y}_{t}\|}{r\ell}(\nabla h_{f}(r\mathbf{y}_{t}/\|\mathbf{y}_{t}\|)-\mathcal{H}(0)\cdot(r\mathbf{y}_{t}/\|\mathbf{y}_{t}\|)) (16)

stands for the deviation from pure quadratic approximation and Δ/𝐲tρr/\|\Delta\|/\|\mathbf{y}_{t}\|\leq\rho r/\ell due to (6). Hence,

𝐲t+1,(1+ρϵ2+Δ𝐲t,)𝐲t,(1+ρϵ2+8ρrρϵ)𝐲t+1,,\displaystyle\|\mathbf{y}_{t+1,\perp^{\prime}}\|\leq\Big{(}1+\frac{\sqrt{\rho\epsilon}}{2\ell}+\frac{\|\Delta\|}{\|\mathbf{y}_{t,\perp^{\prime}}\|}\Big{)}\|\mathbf{y}_{t,\perp^{\prime}}\|\leq\Big{(}1+\frac{\sqrt{\rho\epsilon}}{2\ell}+\cdot\frac{8\rho r}{\sqrt{\rho\epsilon}}\Big{)}\|\mathbf{y}_{t+1,\perp^{\prime}}\|, (17)

which leads to

𝐲t,𝐲0,(1+ρϵ/(2)+8ρr/ρϵ)t𝐲0,(1+5ρϵ/(8))t,t[𝒯].\displaystyle\|\mathbf{y}_{t,\perp^{\prime}}\|\leq\|\mathbf{y}_{0,\perp^{\prime}}\|(1+\sqrt{\rho\epsilon}/(2\ell)+{8\rho r}/\sqrt{\rho\epsilon})^{t}\leq\|\mathbf{y}_{0,\perp^{\prime}}\|(1+5\sqrt{\rho\epsilon}/(8\ell))^{t},\ \ \forall t\in[\mathscr{T}]. (18)

Similarly, we can have the recurrence formula for 𝐲t,\|\mathbf{y}_{t,\parallel}\|:

𝐲t+1,(1+ρϵ/(2))𝐲t,Δ(1+ρϵ/(2)Δ/(αt𝐲t))𝐲t,.\displaystyle\|\mathbf{y}_{t+1,\parallel}\|\geq(1+\sqrt{\rho\epsilon}/(2\ell))\|\mathbf{y}_{t,\parallel}\|-\|\Delta_{\parallel}\|\geq(1+\sqrt{\rho\epsilon}/(2\ell)-\|\Delta\|/(\alpha_{t}\|\mathbf{y}_{t}\|))\|\mathbf{y}_{t,\parallel}\|. (19)

Considering that Δ/𝐲tρr/\|\Delta\|/\|\mathbf{y}_{t}\|\leq\rho r/\ell due to (6), we can further have

𝐲t+1,(1+ρϵ/(2)ρr/(αt))𝐲t,.\displaystyle\|\mathbf{y}_{t+1,\parallel}\|\geq(1+\sqrt{\rho\epsilon}/(2\ell)-\rho r/(\alpha_{t}\ell))\|\mathbf{y}_{t,\parallel}\|. (20)

Intuitively, 𝐲t,\|\mathbf{y}_{t,\parallel}\| should have a faster increasing rate than 𝐲t,\|\mathbf{y}_{t,\perp}\| in this gradient-based Hessian power method, ignoring the deviation from quadratic approximation. As a result, the value the value αt=𝐲t,/𝐲t\alpha_{t}=\|\mathbf{y}_{t,\parallel}\|/\|\mathbf{y}_{t}\| should be non-decreasing. It is demonstrated in Lemma 18 in Appendix B that, even if we count this deviation in, αt\alpha_{t} can still be lower bounded by some constant αmin\alpha_{\min}:

αtαmin=δ04πn,1t𝒯.\displaystyle\alpha_{t}\geq\alpha_{\min}=\frac{\delta_{0}}{4}\sqrt{\frac{\pi}{n}},\qquad\forall 1\leq t\leq\mathscr{T}. (21)

by which we can further deduce that

𝐲t,𝐲0,(1+ρϵ/ρr/(αmin))t𝐲0,(1+7ρϵ/(8))t,1t𝒯.\displaystyle\|\mathbf{y}_{t,\parallel}\|\geq\|\mathbf{y}_{0,\parallel}\|(1+\sqrt{\rho\epsilon}/\ell-\rho r/(\alpha_{\min}\ell))^{t}\geq\|\mathbf{y}_{0,\parallel}\|(1+7\sqrt{\rho\epsilon}/(8\ell))^{t},\quad\forall 1\leq t\leq\mathscr{T}. (22)

Observe that

𝐲𝒯,𝐲𝒯,𝐲0,𝐲0,(1+5ρϵ/(8)1+7ρϵ/(8))𝒯1δ0nπ(1+5ρϵ/(8)1+7ρϵ/(8))𝒯ρϵ8.\displaystyle\frac{\|\mathbf{y}_{\mathscr{T},\perp^{\prime}}\|}{\|\mathbf{y}_{\mathscr{T},\parallel}\|}\leq\frac{\|\mathbf{y}_{0,\perp^{\prime}}\|}{\|\mathbf{y}_{0,\parallel}\|}\cdot\Big{(}\frac{1+5\sqrt{\rho\epsilon}/(8\ell)}{1+7\sqrt{\rho\epsilon}/(8\ell)}\Big{)}^{\mathscr{T}}\leq\frac{1}{\delta_{0}}\sqrt{\frac{n}{\pi}}\Big{(}\frac{1+5\sqrt{\rho\epsilon}/(8\ell)}{1+7\sqrt{\rho\epsilon}/(8\ell)}\Big{)}^{\mathscr{T}}\leq\frac{\sqrt{\rho\epsilon}}{8\ell}. (23)

Since 𝐲𝒯,𝐲𝒯\|\mathbf{y}_{\mathscr{T},\parallel}\|\leq\|\mathbf{y}_{\mathscr{T}}\|, we have 𝐲𝒯,/𝐲𝒯ρϵ/(8)\|\mathbf{y}_{\mathscr{T},\perp^{\prime}}\|/\|\mathbf{y}_{\mathscr{T}}\|\leq\sqrt{\rho\epsilon}/(8\ell), contradiction. Hence, there here exists some t0t_{0} with 1t0𝒯1\leq t_{0}\leq\mathscr{T} such that 𝐲t0,/𝐲t0ρϵ/(8)\|\mathbf{y}_{t_{0},\perp^{\prime}}\|/\|\mathbf{y}_{t_{0}}\|\leq\sqrt{\rho\epsilon}/(8\ell). Consider the normalized vector 𝐞^=𝐲t0/r\hat{\mathbf{e}}=\mathbf{y}_{t_{0}}/r, we use 𝐞^\hat{\mathbf{e}}_{\perp^{\prime}} and 𝐞^\hat{\mathbf{e}}_{\parallel^{\prime}} to separately denote the component of 𝐞^\hat{\mathbf{e}} in 𝔖\mathfrak{S}_{\perp}^{\prime} and 𝔖\mathfrak{S}_{\parallel}^{\prime}. Then, 𝐞^ρϵ/(8)\|\hat{\mathbf{e}}_{\perp^{\prime}}\|\leq\sqrt{\rho\epsilon}/(8\ell) whereas 𝐞^1ρϵ/(8)2\|\hat{\mathbf{e}}_{\parallel^{\prime}}\|\geq 1-\rho\epsilon/(8\ell)^{2}. Then,

𝐞^T(𝟎)𝐞^=(𝐞^+𝐞^)T(𝟎)(𝐞^+𝐞^)=𝐞^T(𝟎)𝐞^+𝐞^T(𝟎)𝐞^\displaystyle\hat{\mathbf{e}}^{T}\mathcal{H}(\mathbf{0})\hat{\mathbf{e}}=(\hat{\mathbf{e}}_{\perp^{\prime}}+\hat{\mathbf{e}}_{\parallel^{\prime}})^{T}\mathcal{H}(\mathbf{0})(\hat{\mathbf{e}}_{\perp^{\prime}}+\hat{\mathbf{e}}_{\parallel^{\prime}})=\hat{\mathbf{e}}_{\perp^{\prime}}^{T}\mathcal{H}(\mathbf{0})\hat{\mathbf{e}}_{\perp^{\prime}}+\hat{\mathbf{e}}_{\parallel^{\prime}}^{T}\mathcal{H}(\mathbf{0})\hat{\mathbf{e}}_{\parallel^{\prime}} (24)

since (𝟎)𝐞^𝔖\mathcal{H}(\mathbf{0})\hat{\mathbf{e}}_{\perp^{\prime}}\in\mathfrak{S}_{\perp}^{\prime} and (𝟎)𝐞^𝔖\mathcal{H}(\mathbf{0})\hat{\mathbf{e}}_{\parallel^{\prime}}\in\mathfrak{S}_{\parallel}^{\prime}. Due to the \ell-smoothness of the function, all eigenvalue of the Hessian matrix has its absolute value upper bounded by \ell. Hence,

𝐞^T(𝟎)𝐞^𝐞^T22=ρϵ/(642).\displaystyle\hat{\mathbf{e}}_{\perp^{\prime}}^{T}\mathcal{H}(\mathbf{0})\hat{\mathbf{e}}_{\perp^{\prime}}\leq\ell\|\hat{\mathbf{e}}_{\perp^{\prime}}^{T}\|_{2}^{2}=\rho\epsilon/(64\ell^{2}). (25)

Further according to the definition of 𝔖\mathfrak{S}_{\parallel}, we have

𝐞^T(𝟎)𝐞^ρϵ𝐞^2/2.\displaystyle\hat{\mathbf{e}}_{\parallel^{\prime}}^{T}\mathcal{H}(\mathbf{0})\hat{\mathbf{e}}_{\parallel^{\prime}}\leq-\sqrt{\rho\epsilon}\|\hat{\mathbf{e}}_{\parallel^{\prime}}\|^{2}/2. (26)

Combining these two inequalities together, we can obtain

𝐞^T(𝟎)𝐞^\displaystyle\hat{\mathbf{e}}^{T}\mathcal{H}(\mathbf{0})\hat{\mathbf{e}} =𝐞^T(𝟎)𝐞^+𝐞^T(𝟎)𝐞^ρϵ𝐞^2/2+ρϵ/(642)ρϵ/4.\displaystyle=\hat{\mathbf{e}}_{\perp}^{T}\mathcal{H}(\mathbf{0})\hat{\mathbf{e}}_{\perp^{\prime}}+\hat{\mathbf{e}}_{\parallel^{\prime}}^{T}\mathcal{H}(\mathbf{0})\hat{\mathbf{e}}_{\parallel^{\prime}}\leq-\sqrt{\rho\epsilon}\|\hat{\mathbf{e}}_{\parallel^{\prime}}\|^{2}/2+\rho\epsilon/(64\ell^{2})\leq-\sqrt{\rho\epsilon}/4. (27)

Remark 4.

In practice, the value of 𝐲t\|\mathbf{y}_{t}\| can become large during the execution of Algorithm 1. To fix this, we can renormalize 𝐲t\mathbf{y}_{t} to have 2\ell_{2}-norm rr at the ends of such iterations, and this does not influence the performance of the algorithm.

2.2 Faster negative curvature finding based on accelerated gradient descents

In this subsection, we replace the GD part in Algorithm 1 by AGD to obtain an accelerated negative curvature finding subroutine with similar effect and faster convergence rate, based on which we further implement our Accelerated Gradient Descent with Negative Curvature Finding (Algorithm 2). Near any saddle point 𝐱~n\tilde{\mathbf{x}}\in\mathbb{R}^{n} with λmin((𝐱~))ρϵ\lambda_{\min}(\mathcal{H}(\tilde{\mathbf{x}}))\leq-\sqrt{\rho\epsilon}, Algorithm 2 finds a unit vector 𝐞^\hat{\mathbf{e}} such that 𝐞^T(𝐱~)𝐞^ρϵ/4\hat{\mathbf{e}}^{T}\mathcal{H}(\tilde{\mathbf{x}})\hat{\mathbf{e}}\leq-\sqrt{\rho\epsilon}/4.

1 tperturb0t_{\text{perturb}}\leftarrow 0, 𝐳0𝐱0\mathbf{z}_{0}\leftarrow\mathbf{x}_{0}, 𝐱~𝐱0\tilde{\mathbf{x}}\leftarrow\mathbf{x}_{0}, ζ0\zeta\leftarrow 0;
2 for t=0,1,2,,Tt=0,1,2,...,T do
3       if f(𝐱t)ϵ\|\nabla f(\mathbf{x}_{t})\|\leq\epsilon and ttperturb>𝒯t-t_{\text{perturb}}>\mathscr{T} then
4             𝐱~=𝐱t\tilde{\mathbf{x}}=\mathbf{x}_{t};
5             𝐱tUniform(𝔹𝐱~(r))\mathbf{x}_{t}\leftarrow\text{Uniform}(\mathbb{B}_{\tilde{\mathbf{x}}}(r^{\prime})) where Uniform(𝔹𝐱~(r))\text{Uniform}(\mathbb{B}_{\tilde{\mathbf{x}}}(r^{\prime})) is the 2\ell_{2}-norm ball centered at 𝐱~\tilde{\mathbf{x}} with radius rr^{\prime}, 𝐳t𝐱t\mathbf{z}_{t}\leftarrow\mathbf{x}_{t}, ζf(𝐱~)\zeta\leftarrow\nabla f(\tilde{\mathbf{x}}), tperturbtt_{\text{perturb}}\leftarrow t;
6            
7      if ttperturb=𝒯t-t_{\text{perturb}}=\mathscr{T}^{\prime} then
8             𝐞^:=𝐱t𝐱~𝐱t𝐱~\hat{\mathbf{e}}:=\frac{\mathbf{x}_{t}-\tilde{\mathbf{x}}}{\|\mathbf{x}_{t}-\tilde{\mathbf{x}}\|};
9             𝐱t𝐱~f𝐞^(𝐱~)4|f𝐞^(𝐱~)|ϵρ𝐞^\mathbf{x}_{t}\leftarrow\tilde{\mathbf{x}}-\frac{f^{\prime}_{\hat{\mathbf{e}}}(\tilde{\mathbf{x}})}{4|f^{\prime}_{\hat{\mathbf{e}}}(\tilde{\mathbf{x}})|}\sqrt{\frac{\epsilon}{\rho}}\cdot\hat{\mathbf{e}}, 𝐳t𝐱t\mathbf{z}_{t}\leftarrow\mathbf{x}_{t}, ζ=𝟎\zeta=\mathbf{0};
10            
11      
12      𝐱t+1𝐳tη(f(𝐳t)ζ)\mathbf{x}_{t+1}\leftarrow\mathbf{z}_{t}-\eta(\nabla f(\mathbf{z}_{t})-\zeta);
13       𝐯t+1𝐱t+1𝐱t\mathbf{v}_{t+1}\leftarrow\mathbf{x}_{t+1}-\mathbf{x}_{t};
14       𝐳t+1𝐱t+1+(1θ)𝐯t+1\mathbf{z}_{t+1}\leftarrow\mathbf{x}_{t+1}+(1-\theta)\mathbf{v}_{t+1};
15       if tperturb0t_{\text{perturb}}\neq 0 and ttperturb<𝒯t-t_{\text{perturb}}<\mathscr{T}^{\prime} then
16             𝐳t+1𝐱~+r𝐳t+1𝐱~𝐳t+1𝐱~\mathbf{z}_{t+1}\leftarrow\tilde{\mathbf{x}}+r^{\prime}\cdot\frac{\mathbf{z}_{t+1}-\tilde{\mathbf{x}}}{\|\mathbf{z}_{t+1}-\tilde{\mathbf{x}}\|}, 𝐱t+1𝐱~+r𝐱t+1𝐱~𝐳t+1𝐱~\ \mathbf{x}_{t+1}\leftarrow\tilde{\mathbf{x}}+r^{\prime}\cdot\frac{\mathbf{x}_{t+1}-\tilde{\mathbf{x}}}{\|\mathbf{z}_{t+1}-\tilde{\mathbf{x}}\|};
17            
18      else
19             if f(𝐱t+1)f(𝐳t+1)+f(𝐳t+1),𝐱t+1𝐳t+1γ2𝐳t+1𝐱t+12f(\mathbf{x}_{t+1})\leq f(\mathbf{z}_{t+1})+\left\langle\nabla f(\mathbf{z}_{t+1}),\mathbf{x}_{t+1}-\mathbf{z}_{t+1}\right\rangle-\frac{\gamma}{2}\|\mathbf{z}_{t+1}-\mathbf{x}_{t+1}\|^{2} then
20                   (𝐱t+1,𝐯t+1)(\mathbf{x}_{t+1},\mathbf{v}_{t+1})\leftarrowNegativeCurvatureExploitation(𝐱t+1,𝐯t+1,s\mathbf{x}_{t+1},\mathbf{v}_{t+1},s)333 This NegativeCurvatureExploitation (NCE) subroutine was originally introduced in [21, Algorithm 3] and is called when we detect that the current momentum 𝐯t\mathbf{v}_{t} coincides with a negative curvature direction of 𝐳t\mathbf{z}_{t}. In this case, we reset the momentum 𝐯t\mathbf{v}_{t} and decide whether to exploit this direction based on the value of 𝐯t\|\mathbf{v}_{t}\|.;
21                   𝐳t+1𝐱t+1+(1θ)𝐯t+1\mathbf{z}_{t+1}\leftarrow\mathbf{x}_{t+1}+(1-\theta)\mathbf{v}_{t+1};
22                  
23            
24      
Algorithm 2 Perturbed Accelerated Gradient Descent with Accelerated Negative Curvature Finding(𝐱0,η,θ,γ,s,𝒯,r\mathbf{x}_{0},\eta,\theta,\gamma,s,\mathscr{T}^{\prime},r^{\prime})

The following proposition exhibits the effectiveness of Algorithm 2 for finding negative curvatures near saddle points:

Proposition 5.

Suppose the function f:nf\colon\mathbb{R}^{n}\to\mathbb{R} is \ell-smooth and ρ\rho-Hessian Lipschitz. For any 0<δ010<\delta_{0}\leq 1, we specify our choice of parameters and constants we use as follows:

η\displaystyle\eta :=14\displaystyle:=\frac{1}{4\ell} θ\displaystyle\theta :=(ρϵ)1/44\displaystyle:=\frac{(\rho\epsilon)^{1/4}}{4\sqrt{\ell}} 𝒯\displaystyle\mathscr{T}^{\prime} :=32(ρϵ)1/4log(δ0nρϵmissing)\displaystyle:=\frac{32\sqrt{\ell}}{(\rho\epsilon)^{1/4}}\log\Big(\frac{\ell}{\delta_{0}}\sqrt{\frac{n}{\rho\epsilon}}\Big{missing})
γ\displaystyle\gamma :=θ2η\displaystyle:=\frac{\theta^{2}}{\eta} s\displaystyle s :=γ4ρ\displaystyle:=\frac{\gamma}{4\rho} r\displaystyle r^{\prime} :=δ0ϵ32πρn.\displaystyle:=\frac{\delta_{0}\epsilon}{32}\sqrt{\frac{\pi}{\rho n}}. (28)

Then for a point 𝐱~\tilde{\mathbf{x}} satisfying λmin(2f(𝐱~))ρϵ\lambda_{\min}(\nabla^{2}f(\tilde{\mathbf{x}}))\leq-\sqrt{\rho\epsilon}, if running Algorithm 2 with the uniform perturbation in Line 2 being added at t=0t=0, the unit vector 𝐞^\hat{\mathbf{e}} in Line 2 obtained after 𝒯\mathscr{T}^{\prime} iterations satisfies:

(𝐞^T(𝐱)𝐞^ρϵ/4)1δ0.\displaystyle\mathbb{P}\Big{(}\hat{\mathbf{e}}^{T}\mathcal{H}(\mathbf{x})\hat{\mathbf{e}}\leq-\sqrt{\rho\epsilon}/4\Big{)}\geq 1-\delta_{0}. (29)

The proof of Proposition 5 is similar to the proof of Proposition 3, and is deferred to Appendix B.2.

2.3 Escaping saddle points using negative curvature finding

In this subsection, we demonstrate that our Algorithm 1 and Algorithm 2 with the ability to find negative curvature near saddle points can further escape saddle points of nonconvex functions. The intuition works as follows: we start with gradient descents or accelerated gradient descents until the gradient becomes small. At this position, we compute the negative curvature direction, described by a unit vector 𝐞^\hat{\mathbf{e}}, via Algorithm 1 or the negative curvature finding subroutine of Algorithm 2. Then, we add a perturbation along this direction of negative curvature and go back to gradient descents or accelerated gradient descents with an additional NegativeCurvatureExploitation subroutine (see Footnote 3). It has the following guarantee:

Lemma 6.

Suppose the function f:nf\colon\mathbb{R}^{n}\to\mathbb{R} is \ell-smooth and ρ\rho-Hessian Lipschitz. Then for any point 𝐱0n\mathbf{x}_{0}\in\mathbb{R}^{n}, if there exists a unit vector 𝐞^\hat{\mathbf{e}} satisfying 𝐞^T(𝐱0)𝐞^ρϵ4\hat{\mathbf{e}}^{T}\mathcal{H}(\mathbf{x}_{0})\hat{\mathbf{e}}\leq-\frac{\sqrt{\rho\epsilon}}{4} where \mathcal{H} stands for the Hessian matrix of function ff, the following inequality holds:

f(𝐱0f𝐞^(𝐱0)4|f𝐞^(𝐱0)|ϵρ𝐞^)f(𝐱0)1384ϵ3ρ,\displaystyle f\Big{(}\mathbf{x}_{0}-\frac{f^{\prime}_{\hat{\mathbf{e}}}(\mathbf{x}_{0})}{4|f^{\prime}_{\hat{\mathbf{e}}}(\mathbf{x}_{0})|}\sqrt{\frac{\epsilon}{\rho}}\cdot\hat{\mathbf{e}}\Big{)}\leq f(\mathbf{x}_{0})-\frac{1}{384}\sqrt{\frac{\epsilon^{3}}{\rho}}, (30)

where f𝐞^f^{\prime}_{\hat{\mathbf{e}}} stands for the gradient component of ff along the direction of 𝐞^\hat{\mathbf{e}}.

Proof.

Without loss of generality, we assume 𝐱0=𝟎\mathbf{x}_{0}=\mathbf{0}. We can also assume f(𝟎),𝐞^0\left\langle\nabla f(\mathbf{0}),\hat{\mathbf{e}}\right\rangle\leq 0; if this is not the case we can pick 𝐞^-\hat{\mathbf{e}} instead, which still satisfies (𝐞^)T(𝐱0)(𝐞^)ρϵ4(-\hat{\mathbf{e}})^{T}\mathcal{H}(\mathbf{x}_{0})(-\hat{\mathbf{e}})\leq-\frac{\sqrt{\rho\epsilon}}{4}. In practice, to figure out whether we should use 𝐞^\hat{\mathbf{e}} or 𝐞^-\hat{\mathbf{e}}, we apply both of them in (30) and choose the one with smaller function value. Then, for any 𝐱=x𝐞^𝐞^\mathbf{x}=x_{\hat{\mathbf{e}}}\hat{\mathbf{e}} with some x𝐞^>0x_{\hat{\mathbf{e}}}>0, we have 2fx𝐞^2(𝐱)ρϵ4+ρx𝐞^\frac{\partial^{2}f}{\partial x_{\hat{\mathbf{e}}}^{2}}(\mathbf{x})\leq-\frac{\sqrt{\rho\epsilon}}{4}+\rho x_{\hat{\mathbf{e}}} due to the ρ\rho-Hessian Lipschitz condition of ff. Hence,

fx𝐞^(𝐱)f𝐞^(𝟎)ρϵ4x𝐞^+ρx𝐞^2,\displaystyle\frac{\partial f}{\partial x_{\hat{\mathbf{e}}}}(\mathbf{x})\leq f^{\prime}_{\hat{\mathbf{e}}}(\mathbf{0})-\frac{\sqrt{\rho\epsilon}}{4}x_{\hat{\mathbf{e}}}+\rho x_{\hat{\mathbf{e}}}^{2}, (31)

by which we can further derive that

f(x𝐞^𝐞^)f(𝟎)f𝐞^(𝟎)x𝐞^ρϵ8x𝐞^2+ρ3x𝐞^3ρϵ8x𝐞^2+ρ3x𝐞^3.\displaystyle f(x_{\hat{\mathbf{e}}}\hat{\mathbf{e}})-f(\mathbf{0})\leq f^{\prime}_{\hat{\mathbf{e}}}(\mathbf{0})x_{\hat{\mathbf{e}}}-\frac{\sqrt{\rho\epsilon}}{8}x_{\hat{\mathbf{e}}}^{2}+\frac{\rho}{3}x_{\hat{\mathbf{e}}}^{3}\leq-\frac{\sqrt{\rho\epsilon}}{8}x_{\hat{\mathbf{e}}}^{2}+\frac{\rho}{3}x_{\hat{\mathbf{e}}}^{3}. (32)

Settings x𝐞^=14ϵρx_{\hat{\mathbf{e}}}=\frac{1}{4}\sqrt{\frac{\epsilon}{\rho}} gives (30). ∎

We give the full algorithm details based on Algorithm 1 in Appendix C.1. Along this approach, we achieve the following:

Theorem 7 (informal, full version deferred to Appendix C.3).

For any ϵ>0\epsilon>0 and a constant 0<δ10<\delta\leq 1, Algorithm 2 satisfies that at least one of the iterations 𝐱t\mathbf{x}_{t} will be an ϵ\epsilon-approximate second-order stationary point in

O~((f(𝐱0)f)ϵ1.75logn)\displaystyle\tilde{O}\Big{(}\frac{(f(\mathbf{x}_{0})-f^{*})}{\epsilon^{1.75}}\cdot\log n\Big{)} (33)

iterations, with probability at least 1δ1-\delta, where ff^{*} is the global minimum of ff.

Intuitively, the proof of Theorem 7 has two parts. The first part is similar to the proof of [21, Theorem 3], which shows that PAGD uses O~(log6n/ϵ1.75)\tilde{O}(\log^{6}n/\epsilon^{1.75}) iterations to escape saddle points. We show that there can be at most O~(Δf/ϵ1.75)\tilde{O}(\Delta_{f}/\epsilon^{1.75}) iterations with the norm of gradient larger than ϵ\epsilon using almost the same techniques, but with slightly different parameter choices. The second part is based on the negative curvature part of Algorithm 2, our accelerated negative curvature finding algorithm. Specifically, at each saddle point we encounter, we can take O~(logn/ϵ1/4)\tilde{O}(\log n/\epsilon^{1/4}) iterations to find its negative curvature (Proposition 5), and add a perturbation in this direction to decrease the function value by O(ϵ1.5)O(\epsilon^{1.5}) (Lemma 6). Hence, the iterations introduced by Algorithm 4 can be at most O~(lognϵ1.51ϵ0.25)=O~(logn/ϵ1.75)\tilde{O}\big{(}\frac{\log n}{\epsilon^{1.5}}\cdot\frac{1}{\epsilon^{0.25}}\big{)}=\tilde{O}(\log n/\epsilon^{1.75}), which is simply an upper bound on the overall iteration number. The detailed proof is deferred to Appendix C.3.

Remark 8.

Although Theorem 7 only demonstrates that our algorithms will visit some ϵ\epsilon-approximate second-order stationary point during their execution with high probability, it is straightforward to identify one of them if we add a termination condition: once Negative Curvature Finding (Algorithm 1 or Algorithm 2) is applied, we record the position 𝐱t0\mathbf{x}_{t_{0}} and the function value decrease due to the following perturbation. If the function value decrease is larger than 1384ϵ3ρ\frac{1}{384}\sqrt{\frac{\epsilon^{3}}{\rho}} as per Lemma 6, then the algorithms make progress. Otherwise, 𝐱t0\mathbf{x}_{t_{0}} is an ϵ\epsilon-approximate second-order stationary point with high probability.

3 Stochastic setting

In this section, we present a stochastic version of Algorithm 1 using stochastic gradients, and demonstrate that it can also be used to escape saddle points and obtain a polynomial speedup in logn\log n compared to the perturbed stochastic gradient (PSGD) algorithm in [20].

3.1 Stochastic negative curvature finding

In the stochastic gradient descent setting, the exact gradients oracle f\nabla f of function ff cannot be accessed. Instead, we only have unbiased stochastic gradients 𝐠(𝐱;θ)\mathbf{g}(\mathbf{x};\theta) such that

f(𝐱)=𝔼θ𝒟[𝐠(𝐱;θ)]𝐱n,\displaystyle\nabla f(\mathbf{x})=\mathbb{E}_{\theta\sim\mathcal{D}}[\mathbf{g}(\mathbf{x};\theta)]\qquad\forall\mathbf{x}\in\mathbb{R}^{n}, (34)

where 𝒟\mathcal{D} stands for the probability distribution followed by the random variable θ\theta. Define

ζ(𝐱;θ):=g(𝐱;θ)f(𝐱)\displaystyle\zeta(\mathbf{x};\theta):=g(\mathbf{x};\theta)-\nabla f(\mathbf{x}) (35)

to be the error term of the stochastic gradient. Then, the expected value of vector ζ(𝐱;θ)\zeta(\mathbf{x};\theta) at any 𝐱n\mathbf{x}\in\mathbb{R}^{n} equals to 𝟎\mathbf{0}. Further, we assume the stochastic gradient 𝐠(𝐱,θ)\mathbf{g}(\mathbf{x},\theta) also satisfies the following assumptions, which were also adopted in previous literatures; see e.g. [8, 19, 20, 34].

Assumption 1.

For any 𝐱n\mathbf{x}\in\mathbb{R}^{n}, the stochastic gradient 𝐠(𝐱;θ)\mathbf{g}(\mathbf{x};\theta) with θ𝒟\theta\sim\mathcal{D} satisfies:

Pr[(𝐠(𝐱;θ)f(𝐱)t)]2exp(t2/(2σ2)),t.\displaystyle\Pr[(\|\mathbf{g}(\mathbf{x};\theta)-\nabla f(\mathbf{x})\|\geq t)]\leq 2\exp(-t^{2}/(2\sigma^{2})),\quad\forall t\in\mathbb{R}. (36)
Assumption 2.

For any θsupp(𝒟)\theta\in\text{supp}(\mathcal{D}), 𝐠(𝐱;θ)\mathbf{g}(\mathbf{x};\theta) is ~\tilde{\ell}-Lipschitz for some constant ~\tilde{\ell}:

𝐠(𝐱1;θ)𝐠(𝐱2;θ)~𝐱1𝐱2,𝐱1,𝐱2n.\displaystyle\|\mathbf{g}(\mathbf{x}_{1};\theta)-\mathbf{g}(\mathbf{x}_{2};\theta)\|\leq\tilde{\ell}\|\mathbf{x}_{1}-\mathbf{x}_{2}\|,\qquad\forall\mathbf{x}_{1},\mathbf{x}_{2}\in\mathbb{R}^{n}. (37)

Assumption 2 emerges from the fact that the stochastic gradient 𝐠\mathbf{g} is often obtained as an exact gradient of some smooth function,

𝐠(𝐱;θ)=f(𝐱;θ).\displaystyle\mathbf{g}(\mathbf{x};\theta)=\nabla f(\mathbf{x};\theta). (38)

In this case, Assumption 2 guarantees that for any θ𝒟\theta\sim\mathcal{D}, the spectral norm of the Hessian of f(𝐱;θ)f(\mathbf{x};\theta) is upper bounded by ~\tilde{\ell}. Under these two assumptions, we can construct the stochastic version of Algorithm 1, as shown in Algorithm 3.

1 𝐲00,L0rs\mathbf{y}_{0}\leftarrow 0,\ L_{0}\leftarrow r_{s};
2 for t=1,,𝒯st=1,...,\mathscr{T}_{s} do
3       Sample {θ(1),θ(2),,θ(m)}𝒟\left\{\theta^{(1)},\theta^{(2)},\cdots,\theta^{(m)}\right\}\sim\mathcal{D};
4       𝐠(𝐲t1)1mj=1m(𝐠(𝐱0+𝐲t1;θ(j))𝐠(𝐱0;θ(j)))\mathbf{g}(\mathbf{y}_{t-1})\leftarrow\frac{1}{m}\sum_{j=1}^{m}\big{(}\mathbf{g}(\mathbf{x}_{0}+\mathbf{y}_{t-1};\theta^{(j)})-\mathbf{g}(\mathbf{x}_{0};\theta^{(j)})\big{)};
5       𝐲t𝐲t11(𝐠(𝐲t1)+ξt/Lt1),ξt𝒩(0,rs2dI)\mathbf{y}_{t}\leftarrow\mathbf{y}_{t-1}-\frac{1}{\ell}(\mathbf{g}(\mathbf{y}_{t-1})+\xi_{t}/L_{t-1}),\qquad\xi_{t}\sim\mathcal{N}\Big{(}0,\frac{r_{s}^{2}}{d}I\Big{)};
6       Lt𝐲trsLt1L_{t}\leftarrow\frac{\|\mathbf{y}_{t}\|}{r_{s}}L_{t-1} and 𝐲t𝐲trs𝐲t\mathbf{y}_{t}\leftarrow\mathbf{y}_{t}\cdot\frac{r_{s}}{\|\mathbf{y}_{t}\|};
7      
Output 𝐲𝒯/rs\mathbf{y}_{\mathscr{T}}/r_{s}.
Algorithm 3 Stochastic Negative Curvature Finding(𝐱0,rs,𝒯s,m\mathbf{x}_{0},r_{s},\mathscr{T}_{s},m).

Similar to the non-stochastic setting, Algorithm 3 can be used to escape from saddle points and obtain a polynomial speedup in logn\log n compared to PSGD algorithm in [20]. This is quantitatively shown in the following theorem:

Theorem 9 (informal, full version deferred to Appendix D.2).

For any ϵ>0\epsilon>0 and a constant 0<δ10<\delta\leq 1, our algorithm444Our algorithm based on Algorithm 3 has similarities to the Neon2online{}^{\text{online}} algorithm in [3]. Both algorithms find a second-order stationary point for stochastic optimization in O~(log2n/ϵ4)\tilde{O}(\log^{2}n/\epsilon^{4}) iterations, and we both apply directed perturbations based on the results of negative curvature finding. Nevertheless, our algorithm enjoys simplicity by only having a single loop, whereas Neon2online{}^{\text{online}} has a nested loop for boosting their confidence. based on Algorithm 3 using only stochastic gradient descent satisfies that at least one of the iterations 𝐱t\mathbf{x}_{t} will be an ϵ\epsilon-approximate second-order stationary point in

O~((f(𝐱0)f)ϵ4log2n)\displaystyle\tilde{O}\Big{(}\frac{(f(\mathbf{x}_{0})-f^{*})}{\epsilon^{4}}\cdot\log^{2}n\Big{)} (39)

iterations, with probability at least 1δ1-\delta, where ff^{*} is the global minimum of ff.

4 Numerical experiments

In this section, we provide numerical results that exhibit the power of our negative curvature finding algorithm for escaping saddle points. More experimental results can be found in Appendix E. All the experiments are performed on MATLAB R2019b on a computer with Six-Core Intel Core i7 processor and 16GB memory, and their codes are given in the supplementary material.

Comparison between Algorithm 1 and PGD.

We compare the performance of our Algorithm 1 with the perturbed gradient descent (PGD) algorithm in [20] on a test function f(x1,x2)=116x1412x12+98x22f(x_{1},x_{2})=\frac{1}{16}x_{1}^{4}-\frac{1}{2}x_{1}^{2}+\frac{9}{8}x_{2}^{2} with a saddle point at (0,0)(0,0). The advantage of Algorithm 1 is illustrated in Figure 1.

Refer to caption
Figure 1: Run Algorithm 1 and PGD on landscape f(x1,x2)=116x1412x12+98x22f(x_{1},x_{2})=\frac{1}{16}x_{1}^{4}-\frac{1}{2}x_{1}^{2}+\frac{9}{8}x_{2}^{2}. Parameters: η=0.05\eta=0.05 (step length), r=0.1r=0.1 (ball radius in PGD and parameter rr in Algorithm 1), M=300M=300 (number of samplings).
Left: The contour of the landscape is placed on the background with labels being function values. Blue points represent samplings of Algorithm 1 at time step tNCGD=15t_{\text{NCGD}}=15 and tNCGD=30t_{\text{NCGD}}=30, and red points represent samplings of PGD at time step tPGD=45t_{\text{PGD}}=45 and tPGD=90t_{\text{PGD}}=90. Algorithm 1 transforms an initial uniform-circle distribution into a distribution concentrating on two points indicating negative curvature, and these two figures represent intermediate states of this process. It converges faster than PGD even when tNCGDtPGDt_{\text{NCGD}}\ll t_{\text{PGD}}.
Right: A histogram of descent values obtained by Algorithm 1 and PGD, respectively. Set tNCGD=30t_{\text{NCGD}}=30 and tPGD=90t_{\text{PGD}}=90. Although we run three times of iterations in PGD, there are still over 40%40\% of gradient descent paths with function value decrease no greater than 0.90.9, while this ratio for Algorithm 1 is less than 5%5\%.
Comparison between Algorithm 3 and PSGD.

We compare the performance of our Algorithm 3 with the perturbed stochastic gradient descent (PSGD) algorithm in [20] on a test function f(x1,x2)=(x13x23)/23x1x2+(x12+x22)2/2f(x_{1},x_{2})=(x_{1}^{3}-x_{2}^{3})/2-3x_{1}x_{2}+(x_{1}^{2}+x_{2}^{2})^{2}/2. Compared to the landscape of the previous experiment, this function is more deflated from a quadratic field due to the cubic terms. Nevertheless, Algorithm 3 still possesses a notable advantage compared to PSGD as demonstrated in Figure 2.

Refer to caption
Figure 2: Run Algorithm 3 and PSGD on landscape f(x1,x2)=x13x2323x1x2+12(x12+x22)2f(x_{1},x_{2})=\frac{x_{1}^{3}-x_{2}^{3}}{2}-3x_{1}x_{2}+\frac{1}{2}(x_{1}^{2}+x_{2}^{2})^{2}. Parameters: η=0.02\eta=0.02 (step length), r=0.01r=0.01 (variance in PSGD and rsr_{s} in Algorithm 3), M=300M=300 (number of samplings).
Left: The contour of the landscape is placed on the background with labels being function values. Blue points represent samplings of Algorithm 3 at time step tSNCGD=15t_{\text{SNCGD}}=15 and tSNCGD=30t_{\text{SNCGD}}=30, and red points represent samplings of PSGD at time step tPSGD=30t_{\text{PSGD}}=30 and tPSGD=60t_{\text{PSGD}}=60. Algorithm 3 transforms an initial uniform-circle distribution into a distribution concentrating on two points indicating negative curvature, and these two figures represent intermediate states of this process. It converges faster than PSGD even when tSNCGDtPSGDt_{\text{SNCGD}}\ll t_{\text{PSGD}}.
Right: A histogram of descent values obtained by Algorithm 3 and PSGD, respectively. Set tSNCGD=30t_{\text{SNCGD}}=30 and tPSGD=60t_{\text{PSGD}}=60. Although we run two times of iterations in PSGD, there are still over 50%50\% of SGD paths with function value decrease no greater than 0.60.6, while this ratio for Algorithm 3 is less than 10%10\%.

Acknowledgements

We thank Jiaqi Leng for valuable suggestions and generous help on the design of numerical experiments. TL was supported by the NSF grant PHY-1818914 and a Samsung Advanced Institute of Technology Global Research Partnership.

References

  • [1] Naman Agarwal, Zeyuan Allen-Zhu, Brian Bullins, Elad Hazan, and Tengyu Ma, Finding approximate local minima faster than gradient descent, Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1195–1199, 2017, arXiv:1611.01146.
  • [2] Zeyuan Allen-Zhu, Natasha 2: Faster non-convex optimization than SGD, Advances in Neural Information Processing Systems, pp. 2675–2686, 2018, arXiv:1708.08694.
  • [3] Zeyuan Allen-Zhu and Yuanzhi Li, Neon2: Finding local minima via first-order oracles, Advances in Neural Information Processing Systems, pp. 3716–3726, 2018, arXiv:1711.06673.
  • [4] Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro, Global optimality of local search for low rank matrix recovery, Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 3880–3888, 2016, arXiv:1605.07221.
  • [5] Alan J. Bray and David S. Dean, Statistics of critical points of Gaussian fields on large-dimensional spaces, Physical Review Letters 98 (2007), no. 15, 150201, arXiv:cond-mat/0611023.
  • [6] Yair Carmon, John C. Duchi, Oliver Hinder, and Aaron Sidford, Accelerated methods for nonconvex optimization, SIAM Journal on Optimization 28 (2018), no. 2, 1751–1772, arXiv:1611.00756.
  • [7] Yann N. Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio, Identifying and attacking the saddle point problem in high-dimensional non-convex optimization, Advances in neural information processing systems, pp. 2933–2941, 2014, arXiv:1406.2572.
  • [8] Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang, Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator, Advances in Neural Information Processing Systems, pp. 689–699, 2018, arXiv:1807.01695.
  • [9] Cong Fang, Zhouchen Lin, and Tong Zhang, Sharp analysis for nonconvex SGD escaping from saddle points, Conference on Learning Theory, pp. 1192–1234, 2019, arXiv:1902.00247.
  • [10] Yan V. Fyodorov and Ian Williams, Replica symmetry breaking condition exposed by random matrix calculation of landscape complexity, Journal of Statistical Physics 129 (2007), no. 5-6, 1081–1116, arXiv:cond-mat/0702601.
  • [11] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan, Escaping from saddle points – online stochastic gradient for tensor decomposition, Proceedings of the 28th Conference on Learning Theory, Proceedings of Machine Learning Research, vol. 40, pp. 797–842, 2015, arXiv:1503.02101.
  • [12] Rong Ge, Chi Jin, and Yi Zheng, No spurious local minima in nonconvex low rank problems: A unified geometric analysis, International Conference on Machine Learning, pp. 1233–1242, PMLR, 2017, arXiv:1704.00708.
  • [13] Rong Ge, Jason D. Lee, and Tengyu Ma, Matrix completion has no spurious local minimum, Advances in Neural Information Processing Systems, pp. 2981–2989, 2016, arXiv:1605.07272.
  • [14] Rong Ge, Jason D. Lee, and Tengyu Ma, Learning one-hidden-layer neural networks with landscape design, International Conference on Learning Representations, 2018, arXiv:1711.00501.
  • [15] Rong Ge, Zhize Li, Weiyao Wang, and Xiang Wang, Stabilized SVRG: Simple variance reduction for nonconvex optimization, Conference on Learning Theory, pp. 1394–1448, PMLR, 2019, arXiv:1905.00529.
  • [16] Rong Ge and Tengyu Ma, On the optimization landscape of tensor decompositions, Advances in Neural Information Processing Systems, pp. 3656–3666, Curran Associates Inc., 2017, arXiv:1706.05598.
  • [17] Moritz Hardt, Tengyu Ma, and Benjamin Recht, Gradient descent learns linear dynamical systems, Journal of Machine Learning Research 19 (2018), no. 29, 1–44, arXiv:1609.05191.
  • [18] Prateek Jain, Chi Jin, Sham Kakade, and Praneeth Netrapalli, Global convergence of non-convex gradient descent for computing matrix squareroot, Artificial Intelligence and Statistics, pp. 479–488, 2017, arXiv:1507.05854.
  • [19] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan, How to escape saddle points efficiently, Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 1724–1732, 2017, arXiv:1703.00887.
  • [20] Chi Jin, Praneeth Netrapalli, Rong Ge, Sham M Kakade, and Michael I. Jordan, On nonconvex optimization for machine learning: Gradients, stochasticity, and saddle points, Journal of the ACM (JACM) 68 (2021), no. 2, 1–29, arXiv:1902.04811.
  • [21] Chi Jin, Praneeth Netrapalli, and Michael I. Jordan, Accelerated gradient descent escapes saddle points faster than gradient descent, Conference on Learning Theory, pp. 1042–1085, 2018, arXiv:1711.10456.
  • [22] Michael I. Jordan, On gradient-based optimization: Accelerated, distributed, asynchronous and stochastic optimization, https://www.youtube.com/watch?v=VE2ITg%5FhGnI, 2017.
  • [23] Zhize Li, SSRGD: Simple stochastic recursive gradient descent for escaping saddle points, Advances in Neural Information Processing Systems 32 (2019), 1523–1533, arXiv:1904.09265.
  • [24] Mingrui Liu, Zhe Li, Xiaoyu Wang, Jinfeng Yi, and Tianbao Yang, Adaptive negative curvature descent with applications in non-convex optimization, Advances in Neural Information Processing Systems 31 (2018), 4853–4862.
  • [25] Yurii Nesterov and Boris T. Polyak, Cubic regularization of Newton method and its global performance, Mathematical Programming 108 (2006), no. 1, 177–205.
  • [26] Yurii E. Nesterov, A method for solving the convex programming problem with convergence rate O(1/k2){O}(1/k^{2}), Soviet Mathematics Doklady, vol. 27, pp. 372–376, 1983.
  • [27] Ju Sun, Qing Qu, and John Wright, A geometric analysis of phase retrieval, Foundations of Computational Mathematics 18 (2018), no. 5, 1131–1198, arXiv:1602.06664.
  • [28] Nilesh Tripuraneni, Mitchell Stern, Chi Jin, Jeffrey Regier, and Michael I. Jordan, Stochastic cubic regularization for fast nonconvex optimization, Advances in neural Information Processing Systems, pp. 2899–2908, 2018, arXiv:1711.02838.
  • [29] Yi Xu, Rong Jin, and Tianbao Yang, NEON+: Accelerated gradient methods for extracting negative curvature for non-convex optimization, 2017, arXiv:1712.01033.
  • [30] Yi Xu, Rong Jin, and Tianbao Yang, First-order stochastic algorithms for escaping from saddle points in almost linear time, Advances in Neural Information Processing Systems, pp. 5530–5540, 2018, arXiv:1711.01944.
  • [31] Yaodong Yu, Pan Xu, and Quanquan Gu, Third-order smoothness helps: Faster stochastic optimization algorithms for finding local minima, Advances in Neural Information Processing Systems (2018), 4530–4540.
  • [32] Xiao Zhang, Lingxiao Wang, Yaodong Yu, and Quanquan Gu, A primal-dual analysis of global optimality in nonconvex low-rank matrix recovery, International Conference on Machine Learning, pp. 5862–5871, PMLR, 2018.
  • [33] Dongruo Zhou and Quanquan Gu, Stochastic recursive variance-reduced cubic regularization methods, International Conference on Artificial Intelligence and Statistics, pp. 3980–3990, 2020, arXiv:1901.11518.
  • [34] Dongruo Zhou, Pan Xu, and Quanquan Gu, Finding local minima via stochastic nested variance reduction, 2018, arXiv:1806.08782.
  • [35] Dongruo Zhou, Pan Xu, and Quanquan Gu, Stochastic variance-reduced cubic regularized Newton methods, International Conference on Machine Learning, pp. 5990–5999, PMLR, 2018, arXiv:1802.04796.
  • [36] Dongruo Zhou, Pan Xu, and Quanquan Gu, Stochastic nested variance reduction for nonconvex optimization, Journal of Machine Learning Research 21 (2020), no. 103, 1–63, arXiv:1806.07811.

Appendix A Auxiliary lemmas

In this section, we introduce auxiliary lemmas that are necessary for our proofs.

Lemma 10 ([20, Lemma 19]).

If f()f(\cdot) is \ell-smooth and ρ\rho-Hessian Lipschitz, η=1/\eta=1/\ell, then the gradient descent sequence {𝐱t}\{\mathbf{x}_{t}\} obtained by 𝐱t+1:=𝐱tηf(𝐱t)\mathbf{x}_{t+1}:=\mathbf{x}_{t}-\eta\nabla f(\mathbf{x}_{t}) satisfies:

f(𝐱t+1)f(𝐱t)ηf(𝐱)2/2,\displaystyle f(\mathbf{x}_{t+1})-f(\mathbf{x}_{t})\leq\eta\|\nabla f(\mathbf{x})\|^{2}/2, (40)

for any step tt in which Negative Curvature Finding is not called.

Lemma 11 ([20, Lemma 23]).

For a \ell-smooth and ρ\rho-Hessian Lipschitz function ff with its stochastic gradient satisfying Assumption 1, there exists an absolute constant cc such that, for any fixed t,t0,ι>0t,t_{0},\iota>0, with probability at least 14eι1-4e^{\iota}, the stochastic gradient descent sequence in Algorithm 8 satisfies:

f(𝐱t0+t)f(𝐱t0)η8i=0t1f(𝐱t0+i)2+cσ2(t+ι),\displaystyle f(\mathbf{x}_{t_{0}+t})-f(\mathbf{x}_{t_{0}})\leq-\frac{\eta}{8}\sum_{i=0}^{t-1}\|\nabla f(\mathbf{x}_{t_{0}+i})\|^{2}+c\cdot\frac{\sigma^{2}}{\ell}(t+\iota), (41)

if during t0t0+tt_{0}\sim t_{0}+t, Stochastic Negative Curvature Finding has not been called.

Lemma 12 ([20, Lemma 29]).

Denote α(t):=[τ=0t1(1+ηγ)2(t1τ)]1/2\alpha(t):=\Big{[}\sum_{\tau=0}^{t-1}(1+\eta\gamma)^{2(t-1-\tau)}\Big{]}^{1/2} and β(t):=(1+ηγ)t/2ηγ\beta(t):=(1+\eta\gamma)^{t}/\sqrt{2\eta\gamma}. If ηγ[0,1]\eta\gamma\in[0,1], then we have

  1. 1.

    α(t)β(t)\alpha(t)\leq\beta(t) for any tt\in\mathbb{N};

  2. 2.

    α(t)β(t)/3,tln2/(ηγ)\alpha(t)\geq\beta(t)/\sqrt{3},\quad\forall t\geq\ln 2/(\eta\gamma).

Lemma 13 ([20, Lemma 30]).

Under the notation of Lemma 12 and Appendix D.1, letting γ:=λmin(~)-\gamma:=\lambda_{\min}(\tilde{\mathcal{H}}), for any 0t𝒯s0\leq t\leq\mathscr{T}_{s} and ι>0\iota>0 we have

Pr(𝐪p(t)β(t)ηrsιmissing)12eι,\displaystyle\Pr\Big(\|\mathbf{q}_{p}(t)\|\leq\beta(t)\eta r_{s}\cdot\sqrt{\iota}\Big{missing})\geq 1-2e^{-\iota}, (42)

and

Pr(𝐪p(t)β(t)ηrs10nδ𝒯smissing)1δ𝒯s,\displaystyle\Pr\Big(\|\mathbf{q}_{p}(t)\|\geq\frac{\beta(t)\eta r_{s}}{10\sqrt{n}}\cdot\frac{\delta}{\mathscr{T}_{s}}\Big{missing})\geq 1-\frac{\delta}{\mathscr{T}_{s}}, (43)
Pr(𝐪p(t)β(t)ηrs10nδmissing)1δ.\displaystyle\Pr\Big(\|\mathbf{q}_{p}(t)\|\geq\frac{\beta(t)\eta r_{s}}{10\sqrt{n}}\cdot\delta\Big{missing})\geq 1-\delta. (44)
Definition 14 ([20, Definition 32]).

A random vector 𝐗n\mathbf{X}\in\mathbb{R}^{n} is norm-subGaussian (or nSG(σ\sigma)), if there exists σ\sigma so that:

Pr(𝐗𝔼[𝐗]t)2et22σ2,t.\displaystyle\Pr(\|\mathbf{X}-\mathbb{E}[\mathbf{X}]\|\geq t)\leq 2e^{-\frac{t^{2}}{2\sigma^{2}}},\quad\forall t\in\mathbb{R}. (45)
Lemma 15 ([20, Lemma 37]).

Given i.i.d. 𝐗1,,𝐗τn\mathbf{X}_{1},\ldots,\mathbf{X}_{\tau}\in\mathbb{R}^{n} all being zero-mean nSG(σi)(\sigma_{i}) defined in Definition 14, then for any ι>0\iota>0, and B>b>0B>b>0, there exists an absolute constant cc such that, with probability at least 12nlog(B/b)eι1-2n\log(B/b)\cdot e^{-\iota}:

i=1nσi2Bori=1τ𝐗icmax{i=1τσi2,b}ι.\displaystyle\sum_{i=1}^{n}\sigma_{i}^{2}\geq B\quad\text{or}\quad\big{\|}\sum_{i=1}^{\tau}\mathbf{X}_{i}\big{\|}\leq c\cdot\sqrt{\max\left\{\sum_{i=1}^{\tau}\sigma_{i}^{2},b\right\}\cdot\iota}. (46)

The next two lemmas are frequently used in the large gradient scenario of the accelerated gradient descent method:

Lemma 16 ([21, Lemma 7]).

Consider the setting of Theorem 23, define a new parameter

𝒯~:=(ρϵ)1/4cA,\displaystyle\tilde{\mathscr{T}}:=\frac{\sqrt{\ell}}{(\rho\epsilon)^{1/4}}\cdot c_{A}, (47)

for some large enough constant cAc_{A}. If f(𝐱τ)ϵ\|\nabla f(\mathbf{x}_{\tau})\|\geq\epsilon for all τ[0,𝒯~]\tau\in[0,\tilde{\mathscr{T}}], then there exists a large enough positive constant cA0c_{A0}, such that if we choose cAcA0c_{A}\geq c_{A0}, by running Algorithm 2 we have E𝒯~E0E_{\tilde{\mathscr{T}}}-E_{0}\leq-\mathscr{E}, in which =ϵ3ρcA7\mathscr{E}=\sqrt{\frac{\epsilon^{3}}{\rho}}\cdot c_{A}^{-7}, and EτE_{\tau} is defined as:

Eτ:=f(𝐱τ)+12η𝐯τ2.\displaystyle E_{\tau}:=f(\mathbf{x}_{\tau})+\frac{1}{2\eta}\|\mathbf{v}_{\tau}\|^{2}. (48)
Lemma 17 ([21, Lemma 4 and Lemma 5]).

Assume that the function ff is \ell-smooth. Consider the setting of Theorem 23, for every iteration that is not within 𝒯\mathscr{T}^{\prime} steps after uniform perturbation, we have:

Eτ+1Eτ,\displaystyle E_{\tau+1}\leq E_{\tau}, (49)

where EτE_{\tau} is defined in Eq. (48) in Lemma 16.

Appendix B Proof details of negative curvature finding by gradient descents

B.1 Proof of Lemma 18

Lemma 18.

Under the setting of Proposition 3, we use αt\alpha_{t} to denote

αt=𝐲t,/𝐲t,\displaystyle\alpha_{t}=\|\mathbf{y}_{t,\parallel}\|/\|\mathbf{y}_{t}\|, (50)

where 𝐲t,\mathbf{y}_{t,\parallel} defined in Eq. (10) is the component of 𝐲t\mathbf{y}_{t} in the subspace 𝔖\mathfrak{S}_{\parallel} spanned by {𝐮1,𝐮2,,𝐮p}\left\{\mathbf{u}_{1},\mathbf{u}_{2},\ldots,\mathbf{u}_{p}\right\}. Then, during all the 𝒯\mathscr{T} iterations of Algorithm 1, we have αtαmin\alpha_{t}\geq\alpha_{\min} for

αmin=δ04πn,\displaystyle\alpha_{\min}=\frac{\delta_{0}}{4}\sqrt{\frac{\pi}{n}}, (51)

given that α0πnδ0\alpha_{0}\geq\sqrt{\frac{\pi}{n}}\delta_{0}.

To prove Lemma 18, we first introduce the following lemma:

Lemma 19.

Under the setting of Proposition 3 and Lemma 18, for any t>0t>0 and initial value 𝐲0\mathbf{y}_{0}, αt\alpha_{t} achieves its minimum possible value in the specific landscapes satisfying

λ1=λ2==λp=λp+1=λp+2==λn1=ρϵ.\displaystyle\lambda_{1}=\lambda_{2}=\cdots=\lambda_{p}=\lambda_{p+1}=\lambda_{p+2}=\cdots=\lambda_{n-1}=-\sqrt{\rho\epsilon}. (52)
Proof.

We prove this lemma by contradiction. Suppose for some τ>0\tau>0 and initial value 𝐲0\mathbf{y}_{0}, ατ\alpha_{\tau} achieves its minimum value in some landscape gg that does not satisfy Eq. (52). That is to say, there exists some k[n1]k\in[n-1] such that λkρϵ\lambda_{k}\neq-\sqrt{\rho\epsilon}.

We first consider the case k>pk>p. Since λn1,,λp+1ρϵ\lambda_{n-1},\ldots,\lambda_{p+1}\geq-\sqrt{\rho\epsilon}, we have λk>ρϵ\lambda_{k}>-\sqrt{\rho\epsilon}. We use {𝐲g,t}\left\{\mathbf{y}_{g,t}\right\} to denote the iteration sequence of 𝐲t\mathbf{y}_{t} in this landscape.

Consider another landscape hh with λk=ρϵ\lambda_{k}=-\sqrt{\rho\epsilon} and all other eigenvalues being the same as gg, and we use {𝐲h,t}\left\{\mathbf{y}_{h,t}\right\} to denote the iteration sequence of 𝐲t\mathbf{y}_{t} in this landscape. Furthermore, we assume that at each gradient query, the deviation from pure quadratic approximation defined in Eq. (16), denoted Δh,t\Delta_{h,t}, is the same as the corresponding deviation Δg,t\Delta_{g,t} in the landscape gg along all the directions other than 𝐮k\mathbf{u}_{k}. As for its component Δh,t,k\Delta_{h,t,k} along 𝐮k\mathbf{u}_{k}, we set |Δh,t,k|=|Δg,t,k||\Delta_{h,t,k}|=|\Delta_{g,t,k}| with its sign being the same as yt,ky_{t,k}.

Under these settings, we have yh,τ,j=yg,τ,jy_{h,\tau,j}=y_{g,\tau,j} for any j[n]j\in[n] and jkj\neq k. As for the component along 𝐮k\mathbf{u}_{k}, we have |yh,τ,j|>|yg,τ,j||y_{h,\tau,j}|>|y_{g,\tau,j}|. Hence, by the definition of 𝐲t,\mathbf{y}_{t,\parallel} in Eq. (10), we have

𝐲g,τ,=(j=1pyg,τ,j2)1/2=(j=1pyh,τ,j2)1/2=𝐲h,τ,,\displaystyle\|\mathbf{y}_{g,\tau,\parallel}\|=\big{(}\sum_{j=1}^{p}y_{g,\tau,j}^{2}\big{)}^{1/2}=\big{(}\sum_{j=1}^{p}y_{h,\tau,j}^{2}\big{)}^{1/2}=\|\mathbf{y}_{h,\tau,\parallel}\|, (53)

whereas

𝐲g,τ=(j=1nyg,τ,j2)1/2<(j=1nyh,τ,j2)1/2=𝐲h,τ,\displaystyle\|\mathbf{y}_{g,\tau}\|=\big{(}\sum_{j=1}^{n}y_{g,\tau,j}^{2}\big{)}^{1/2}<\big{(}\sum_{j=1}^{n}y_{h,\tau,j}^{2}\big{)}^{1/2}=\|\mathbf{y}_{h,\tau}\|, (54)

indicating

αg,τ=𝐲g,τ,𝐲g,τ>𝐲h,τ,𝐲h,τ=αh,τ,\displaystyle\alpha_{g,\tau}=\frac{\|\mathbf{y}_{g,\tau,\parallel}\|}{\|\mathbf{y}_{g,\tau}\|}>\frac{\|\mathbf{y}_{h,\tau,\parallel}\|}{\|\mathbf{y}_{h,\tau}\|}=\alpha_{h,\tau}, (55)

contradicting to the supposition that ατ\alpha_{\tau} achieves its minimum possible value in gg. Similarly, we can also obtain a contradiction for the case kpk\leq p. ∎

Equipped with Lemma 19, we are now ready to prove Lemma 18.

Proof.

In this proof, we consider the worst case, where the initial value α0=πnδ0\alpha_{0}=\sqrt{\frac{\pi}{n}}\delta_{0}. Also, according to Lemma 19, we assume that the eigenvalues satisfy

λ1=λ2==λp=λp+1=λp+2==λn1=ρϵ.\displaystyle\lambda_{1}=\lambda_{2}=\cdots=\lambda_{p}=\lambda_{p+1}=\lambda_{p+2}=\cdots=\lambda_{n-1}=-\sqrt{\rho\epsilon}. (56)

Moreover, we assume that each time we make a gradient call at some point 𝐱\mathbf{x}, the derivation term Δ\Delta from pure quadratic approximation

Δ=hf(𝐱)(𝟎)𝐱\displaystyle\Delta=\nabla h_{f}(\mathbf{x})-\mathcal{H}(\mathbf{0})\mathbf{x} (57)

lies in the direction that can make αt\alpha_{t} as small as possible. Then, the component Δ\Delta_{\perp} in 𝔖\mathfrak{S}_{\perp} should be in the direction of 𝐱\mathbf{x}_{\perp}, and the component Δ\Delta_{\parallel} in 𝔖\mathfrak{S}_{\parallel} should be in the opposite direction to 𝐱\mathbf{x}_{\parallel}, as long as Δ𝐱\|\Delta_{\parallel}\|\leq\|\mathbf{x}_{\parallel}\|. Hence in this case, we have 𝐲t,/𝐲t\|\mathbf{y}_{t,\perp}\|/\|\mathbf{y}_{t}\| being non-decreasing. Also, it admits the following recurrence formula:

𝐲t+1,\displaystyle\|\mathbf{y}_{t+1,\perp}\| =(1+ρϵ/)𝐲t,+Δ/\displaystyle=(1+\sqrt{\rho\epsilon}/\ell)\|\mathbf{y}_{t,\perp}\|+\|\Delta_{\perp}\|/\ell (58)
(1+ρϵ/)𝐲t,+Δ/\displaystyle\leq(1+\sqrt{\rho\epsilon}/\ell)\|\mathbf{y}_{t,\perp}\|+\|\Delta\|/\ell (59)
(1+ρϵ/+Δ𝐲t,)𝐲t,,\displaystyle\leq\Big{(}1+\sqrt{\rho\epsilon}/\ell+\frac{\|\Delta\|}{\ell\|\mathbf{y}_{t,\perp}\|}\Big{)}\|\mathbf{y}_{t,\perp}\|, (60)

where the second inequality is due to the fact that ν\nu can be an arbitrarily small positive number. Note that since 𝐲t,/𝐲t\|\mathbf{y}_{t,\perp}\|/\|\mathbf{y}_{t}\| is non-decreasing in this worst-case scenario, we have

Δ𝐲t,Δ𝐲t𝐲0𝐲0,2Δ𝐲t2ρr,\displaystyle\frac{\|\Delta\|}{\|\mathbf{y}_{t,\perp}\|}\leq\frac{\|\Delta\|}{\|\mathbf{y}_{t}\|}\cdot\frac{\|\mathbf{y}_{0}\|}{\|\mathbf{y}_{0,\perp}\|}\leq\frac{2\|\Delta\|}{\|\mathbf{y}_{t}\|}\leq 2\rho r, (61)

which leads to

𝐲t+1,(1+ρϵ/+2ρr/)𝐲t,.\displaystyle\|\mathbf{y}_{t+1,\perp}\|\leq(1+\sqrt{\rho\epsilon}/\ell+2\rho r/\ell)\|\mathbf{y}_{t,\perp}\|. (62)

On the other hand, suppose for some value tt, we have αkαmin\alpha_{k}\geq\alpha_{\min} with any 1kt1\leq k\leq t. Then,

𝐲t+2,\displaystyle\|\mathbf{y}_{t+2,\parallel}\| =(1+ρϵ/)𝐲t+1,Δ/\displaystyle=(1+\sqrt{\rho\epsilon}/\ell)\|\mathbf{y}_{t+1,\perp}\|-\|\Delta_{\parallel}\|/\ell (63)
(1+ρϵ/Δ𝐲t,)𝐲t,.\displaystyle\geq\Big{(}1+\sqrt{\rho\epsilon}/\ell-\frac{\|\Delta\|}{\ell\|\mathbf{y}_{t,\parallel}\|}\Big{)}\|\mathbf{y}_{t,\parallel}\|. (64)

Note that since 𝐲t,/𝐲tαmin\|\mathbf{y}_{t,\parallel}\|/\|\mathbf{y}_{t}\|\geq\alpha_{\min}, we have

Δ𝐲t,Δαmin𝐲t=ρr/αmin,\displaystyle\frac{\|\Delta\|}{\|\mathbf{y}_{t,\parallel}\|}\geq\frac{\|\Delta\|}{\alpha_{\min}\|\mathbf{y}_{t}\|}=\rho r/\alpha_{\min}, (65)

which leads to

𝐲t+1,(1+ρϵ/ρr/(αmin))𝐲t,.\displaystyle\|\mathbf{y}_{t+1,\parallel}\|\geq(1+\sqrt{\rho\epsilon}/\ell-\rho r/(\alpha_{\min}\ell))\|\mathbf{y}_{t,\parallel}\|. (66)

Then we can observe that

𝐲t,𝐲t,𝐲0,𝐲0,(1+ρϵ/ρr/(αmin)1+ρϵ/+2ρr/)t,\displaystyle\frac{\|\mathbf{y}_{t,\parallel}\|}{\|\mathbf{y}_{t,\perp}\|}\geq\frac{\|\mathbf{y}_{0,\parallel}\|}{\|\mathbf{y}_{0,\perp}\|}\cdot\Big{(}\frac{1+\sqrt{\rho\epsilon}/\ell-\rho r/(\alpha_{\min}\ell)}{1+\sqrt{\rho\epsilon}/\ell+2\rho r/\ell}\Big{)}^{t}, (67)

where

1+ρϵ/ρr/(αmin)1+ρϵ/+2ρr/\displaystyle\frac{1+\sqrt{\rho\epsilon}/\ell-\rho r/(\alpha_{\min}\ell)}{1+\sqrt{\rho\epsilon}/\ell+2\rho r/\ell} (1+ρϵ/ρr/(αmin))(1ρϵ/2ρr/)\displaystyle\geq(1+\sqrt{\rho\epsilon}/\ell-\rho r/(\alpha_{\min}\ell))(1-\sqrt{\rho\epsilon}/\ell-2\rho r/\ell) (68)
1ρϵ/22ρrαmin11𝒯,\displaystyle\geq 1-\rho\epsilon/\ell^{2}-\frac{2\rho r}{\alpha_{\min}\ell}\geq 1-\frac{1}{\mathscr{T}}, (69)

by which we can derive that

𝐲t,𝐲t,\displaystyle\frac{\|\mathbf{y}_{t,\parallel\|}}{\|\mathbf{y}_{t,\perp}\|} 𝐲0,𝐲0,(11/𝒯)t\displaystyle\geq\frac{\|\mathbf{y}_{0,\parallel}\|}{\|\mathbf{y}_{0,\perp}\|}(1-1/\mathscr{T})^{t} (70)
𝐲0,𝐲0,exp(t𝒯1missing)𝐲0,2𝐲0,,\displaystyle\geq\frac{\|\mathbf{y}_{0,\parallel}\|}{\|\mathbf{y}_{0,\perp}\|}\exp\Big(-\frac{t}{\mathscr{T}-1}\Big{missing})\geq\frac{\|\mathbf{y}_{0,\parallel}\|}{2\|\mathbf{y}_{0,\perp}\|}, (71)

indicating

αt=𝐲t,𝐲t,2+𝐲t,2𝐲0,4𝐲0,αmin.\displaystyle\alpha_{t}=\frac{\|\mathbf{y}_{t,\parallel}\|}{\sqrt{\|\mathbf{y}_{t,\parallel}\|^{2}+\|\mathbf{y}_{t,\perp}\|^{2}}}\geq\frac{\|\mathbf{y}_{0,\parallel}\|}{4\|\mathbf{y}_{0,\perp}\|}\geq\alpha_{\min}. (72)

Hence, as long as αkαmin\alpha_{k}\geq\alpha_{\min} for any 1kt11\leq k\leq t-1, we can also have αtαmin\alpha_{t}\geq\alpha_{\min} if t𝒯t\leq\mathscr{T}. Since we have α0αmin\alpha_{0}\geq\alpha_{\min}, we can thus claim that αtαmin\alpha_{t}\geq\alpha_{\min} for any t𝒯t\leq\mathscr{T} using recurrence. ∎

B.2 Proof of Proposition 5

To make it easier to analyze the properties and running time of Algorithm 2, we introduce a new Algorithm 4,

1 𝐱0\mathbf{x}_{0}\leftarrowUniform(𝔹0(r))(\mathbb{B}_{0}(r^{\prime})) where 𝔹0(r)\mathbb{B}_{0}(r^{\prime}) is the 2\ell_{2}-norm ball centered at 𝐱~\tilde{\mathbf{x}} with radius rr^{\prime};
2 𝐳0𝐱0\mathbf{z}_{0}\leftarrow\mathbf{x}_{0};
3 for t=1,,𝒯t=1,...,\mathscr{T}^{\prime} do
4       𝐱t+1𝐳tη(𝐳t𝐱~rf(r𝐳t𝐱~𝐳t𝐱~+𝐱~)f(𝐱~))\mathbf{x}_{t+1}\leftarrow\mathbf{z}_{t}-\eta\Big{(}\frac{\|\mathbf{z}_{t}-\tilde{\mathbf{x}}\|}{r^{\prime}}\nabla f\Big{(}r^{\prime}\cdot\frac{\mathbf{z}_{t}-\tilde{\mathbf{x}}}{\|\mathbf{z}_{t}-\tilde{\mathbf{x}}\|}+\tilde{\mathbf{x}}\Big{)}-\nabla f(\tilde{\mathbf{x}})\Big{)};
5       𝐯t+1𝐱t+1𝐱t\mathbf{v}_{t+1}\leftarrow\mathbf{x}_{t+1}-\mathbf{x}_{t};
6       𝐳t+1𝐱t+1+(1θ)𝐯t+1\mathbf{z}_{t+1}\leftarrow\mathbf{x}_{t+1}+(1-\theta)\mathbf{v}_{t+1};
7      
Output 𝐱𝒯𝐱~𝐱𝒯𝐱~\frac{\mathbf{x}_{\mathscr{T}^{\prime}}-\tilde{\mathbf{x}}}{\|\mathbf{x}_{\mathscr{T}^{\prime}}-\tilde{\mathbf{x}}\|}.
Algorithm 4 Accelerated Negative Curvature Finding without Renormalization(𝐱~,r,𝒯\tilde{\mathbf{x}},r^{\prime},\mathscr{T}^{\prime}).

which has a more straightforward structure and has the same effect as Algorithm 2 near any saddle point 𝐱~\tilde{\mathbf{x}}. Quantitatively, this is demonstrated in the following lemma:

Lemma 20.

Under the setting of Proposition 5, suppose the perturbation in Line 2 of Algorithm 2 is added at t=0t=0. Then with the same value of rr^{\prime}, 𝒯\mathscr{T}^{\prime}, 𝐱~\tilde{\mathbf{x}} and 𝐱0\mathbf{x}_{0}, the output of Algorithm 4 is the same as the unit vector 𝐞^\hat{\mathbf{e}} in Algorithm 2 obtained 𝒯\mathscr{T}^{\prime} steps later. In other words, if we separately denote the iteration sequence of {𝐱t}\left\{\mathbf{x}_{t}\right\} in Algorithm 2 and Algorithm 4 as

{𝐱1,0,𝐱1,1,,𝐱1,𝒯},{𝐱2,0,𝐱2,1,,𝐱2,𝒯},\displaystyle\left\{\mathbf{x}_{1,0},\mathbf{x}_{1,1},\ldots,\mathbf{x}_{1,\mathscr{T}^{\prime}}\right\},\qquad\left\{\mathbf{x}_{2,0},\mathbf{x}_{2,1},\ldots,\mathbf{x}_{2,\mathscr{T}^{\prime}}\right\}, (73)

we have

𝐱1,𝒯𝐱~𝐱1,𝒯𝐱~=𝐱2,𝒯𝐱~𝐱2,𝒯𝐱~.\displaystyle\frac{\mathbf{x}_{1,\mathscr{T}^{\prime}}-\tilde{\mathbf{x}}}{\|\mathbf{x}_{1,\mathscr{T}^{\prime}}-\tilde{\mathbf{x}}\|}=\frac{\mathbf{x}_{2,\mathscr{T}^{\prime}}-\tilde{\mathbf{x}}}{\|\mathbf{x}_{2,\mathscr{T}^{\prime}}-\tilde{\mathbf{x}}\|}. (74)
Proof.

Without loss of generality, we assume 𝐱~=𝟎\tilde{\mathbf{x}}=\mathbf{0} and f(𝐱~)=𝟎\nabla f(\tilde{\mathbf{x}})=\mathbf{0}. Use recurrence to prove the desired relationship. Suppose the following identities holds for all ktk\leq t with tt being some natural number:

𝐱2,k𝐱2,k=𝐱1,kr,𝐳2,k𝐱2,k=𝐳1,kr.\displaystyle\frac{\mathbf{x}_{2,k}}{\|\mathbf{x}_{2,k}\|}=\frac{\mathbf{x}_{1,k}}{r},\qquad\frac{\mathbf{z}_{2,k}}{\|\mathbf{x}_{2,k}\|}=\frac{\mathbf{z}_{1,k}}{r^{\prime}}. (75)

Then,

𝐱2,t+1=𝐳2,tη𝐳2,trf(𝐳2,tr/𝐳2,t)=𝐳2,tr(𝐳1,tηf(𝐳1,t)).\displaystyle\mathbf{x}_{2,t+1}=\mathbf{z}_{2,t}-\eta\cdot\frac{\|\mathbf{z}_{2,t}\|}{r^{\prime}}\nabla f(\mathbf{z}_{2,t}\cdot r^{\prime}/\|\mathbf{z}_{2,t}\|)=\frac{\|\mathbf{z}_{2,t}\|}{r^{\prime}}(\mathbf{z}_{1,t}-\eta\nabla f(\mathbf{z}_{1,t})). (76)

Adopting the notation in Algorithm 2, we use 𝐱1,t+1\mathbf{x}_{1,t+1}^{\prime} and 𝐳1,t+1\mathbf{z}_{1,t+1}^{\prime} to separately denote the value of 𝐱1,t+1\mathbf{x}_{1,t+1} and 𝐳1,t+1\mathbf{z}_{1,t+1} before renormalization:

𝐱1,t+1=𝐳1,tηf(𝐳1,t),𝐳1,t+1=𝐱1,t+1+(1θ)(𝐱1,t+1𝐱1,t).\displaystyle\mathbf{x}_{1,t+1}^{\prime}=\mathbf{z}_{1,t}-\eta\nabla f(\mathbf{z}_{1,t}),\quad\mathbf{z}_{1,t+1}^{\prime}=\mathbf{x}_{1,t+1}^{\prime}+(1-\theta)(\mathbf{x}_{1,t+1}^{\prime}-\mathbf{x}_{1,t}). (77)

Then,

𝐱2,t+1=𝐳2,tr(𝐳1,tηf(𝐳1,t))=𝐳2,tr𝐱1,t+1,\displaystyle\mathbf{x}_{2,t+1}=\frac{\|\mathbf{z}_{2,t}\|}{r^{\prime}}(\mathbf{z}_{1,t}-\eta\nabla f(\mathbf{z}_{1,t}))=\frac{\|\mathbf{z}_{2,t}\|}{r^{\prime}}\cdot\mathbf{x}_{1,t+1}^{\prime}, (78)

which further leads to

𝐳2,t+1=𝐱2,t+1+(1θ)(𝐱2,t+1𝐱2,t)=𝐳2,tr𝐳1,t+1.\displaystyle\mathbf{z}_{2,t+1}=\mathbf{x}_{2,t+1}+(1-\theta)(\mathbf{x}_{2,t+1}-\mathbf{x}_{2,t})=\frac{\|\mathbf{z}_{2,t}\|}{r^{\prime}}\cdot\mathbf{z}_{1,t+1}^{\prime}. (79)

Note that 𝐳1,t+1=r𝐳1,t+1𝐳1,t+1\mathbf{z}_{1,t+1}=\frac{r^{\prime}}{\|\mathbf{z}_{1,t+1}^{\prime}\|}\cdot\mathbf{z}_{1,t+1}^{\prime}, we thus have

𝐳2,t+1𝐳2,t+1=𝐳1,t+1r.\displaystyle\frac{\mathbf{z}_{2,t+1}}{\|\mathbf{z}_{2,t+1}\|}=\frac{\mathbf{z}_{1,t+1}}{r^{\prime}}. (80)

Hence,

𝐱2,t+1=𝐳2,tr𝐱1,t+1=𝐳2,tr𝐳1,t+1𝐳1,t𝐱1,t+1=𝐳2,t+1r𝐱1,t+1.\displaystyle\mathbf{x}_{2,t+1}=\frac{\|\mathbf{z}_{2,t}\|}{r^{\prime}}\cdot\mathbf{x}_{1,t+1}^{\prime}=\frac{\|\mathbf{z}_{2,t}\|}{r^{\prime}}\cdot\frac{\|\mathbf{z}_{1,t+1}\|}{\|\mathbf{z}_{1,t}\|}\cdot\mathbf{x}_{1,t+1}=\frac{\|\mathbf{z}_{2,t+1}\|}{r^{\prime}}\cdot\mathbf{x}_{1,t+1}. (81)

Since (75) holds for k=0k=0, we can now claim that it also holds for k=𝒯k=\mathscr{T}^{\prime}. ∎

Lemma 20 shows that, Algorithm 4 also works in principle for finding the negative curvature near any saddle point 𝐱~\tilde{\mathbf{x}}. But considering that Algorithm 4 may result in an exponentially large 𝐱t\|\mathbf{x}_{t}\| during execution, and it is hard to be merged with the AGD algorithm for large gradient scenarios. Hence, only Algorithm 2 is applicable in practical situations.

Use (𝐱~)\mathcal{H}(\tilde{\mathbf{x}}) to denote the Hessian matrix of ff at 𝐱~\tilde{\mathbf{x}}. Observe that (𝐱~)\mathcal{H}(\tilde{\mathbf{x}}) admits the following eigen-decomposition:

(𝐱~)=i=1nλi𝐮i𝐮iT,\displaystyle\mathcal{H}(\tilde{\mathbf{x}})=\sum_{i=1}^{n}\lambda_{i}\mathbf{u}_{i}\mathbf{u}_{i}^{T}, (82)

where the set {𝐮i}i=1n\{\mathbf{u}_{i}\}_{i=1}^{n} forms an orthonormal basis of n\mathbb{R}^{n}. Without loss of generality, we assume the eigenvalues λ1,λ2,,λn\lambda_{1},\lambda_{2},\ldots,\lambda_{n} corresponding to 𝐮1,𝐮2,,𝐮n\mathbf{u}_{1},\mathbf{u}_{2},\ldots,\mathbf{u}_{n} satisfy

λ1λ2λn,\displaystyle\lambda_{1}\leq\lambda_{2}\leq\cdots\leq\lambda_{n}, (83)

in which λ1ρϵ\lambda_{1}\leq-\sqrt{\rho\epsilon}. If λnρϵ/2\lambda_{n}\leq-\sqrt{\rho\epsilon}/2, Proposition 5 holds directly, since no matter the value of 𝐞^\hat{\mathbf{e}}, we can have f(𝐱𝒯)f(𝐱~)ϵ3/ρ/384f(\mathbf{x}_{\mathscr{T}^{\prime}})-f(\tilde{\mathbf{x}})\leq-\sqrt{\epsilon^{3}/\rho}/384. Hence, we only need to prove the case where λn>ρϵ/2\lambda_{n}>-\sqrt{\rho\epsilon}/2, where there exists some p>1p>1 with

λpρϵλp+1.\displaystyle\lambda_{p}\leq-\sqrt{\rho\epsilon}\leq\lambda_{p+1}. (84)

We use 𝔖\mathfrak{S}_{\parallel} to denote the subspace of n\mathbb{R}^{n} spanned by {𝐮1,𝐮2,,𝐮p}\left\{\mathbf{u}_{1},\mathbf{u}_{2},\ldots,\mathbf{u}_{p}\right\}, and use 𝔖\mathfrak{S}_{\perp} to denote the subspace spanned by {𝐮p+1,𝐮p+2,,𝐮n}\left\{\mathbf{u}_{p+1},\mathbf{u}_{p+2},\ldots,\mathbf{u}_{n}\right\}. Then we can have the following lemma:

Lemma 21.

Under the setting of Proposition 5, we use αt\alpha_{t}^{\prime} to denote

αt=𝐱t,𝐱~𝐱t𝐱~,\displaystyle\alpha_{t}^{\prime}=\frac{\|\mathbf{x}_{t,\parallel}-\tilde{\mathbf{x}}_{\parallel}\|}{\|\mathbf{x}_{t}-\tilde{\mathbf{x}}\|}, (85)

in which 𝐱t,\mathbf{x}_{t,\parallel} is the component of 𝐱t\mathbf{x}_{t} in the subspace 𝔖\mathfrak{S}_{\parallel}. Then, during all the 𝒯\mathscr{T}^{\prime} iterations of Algorithm 4, we have αtαmin\alpha_{t}^{\prime}\geq\alpha_{\min}^{\prime} for

αmin=δ08πn,\displaystyle\alpha_{\min}^{\prime}=\frac{\delta_{0}}{8}\sqrt{\frac{\pi}{n}}, (86)

given that α0πnδ0\alpha_{0}^{\prime}\geq\sqrt{\frac{\pi}{n}}\delta_{0}.

Proof.

Without loss of generality, we assume 𝐱~=𝟎\tilde{\mathbf{x}}=\mathbf{0} and f(𝐱~)=𝟎\nabla f(\tilde{\mathbf{x}})=\mathbf{0}. In this proof, we consider the worst case, where the initial value α0=πnδ0\alpha_{0}^{\prime}=\sqrt{\frac{\pi}{n}}\delta_{0} and the component x0,nx_{0,n} along 𝐮n\mathbf{u}_{n} equals 0. In addition, according to the same proof of Lemma 19, we assume that the eigenvalues satisfy

λ1=λ2=λ3==λp=λp+1=λp+2==λn1=ρϵ.\displaystyle\lambda_{1}=\lambda_{2}=\lambda_{3}=\cdots=\lambda_{p}=\lambda_{p+1}=\lambda_{p+2}=\cdots=\lambda_{n-1}=-\sqrt{\rho\epsilon}. (87)

Out of the same reason, we assume that each time we make a gradient call at point 𝐳t\mathbf{z}_{t}, the derivation term Δ\Delta from pure quadratic approximation

Δ=𝐳tr(f(𝐳tr/𝐳t)(𝟎)r𝐳t𝐳t)\displaystyle\Delta=\frac{\|\mathbf{z}_{t}\|}{r^{\prime}}\cdot\Big{(}\nabla f(\mathbf{z}_{t}\cdot r^{\prime}/\|\mathbf{z}_{t}\|)-\mathcal{H}(\mathbf{0})\cdot\frac{r^{\prime}}{\|\mathbf{z}_{t}\|}\cdot\mathbf{z}_{t}\Big{)} (88)

lies in the direction that can make αt\alpha_{t}^{\prime} as small as possible. Then, the component Δ\Delta_{\parallel} in 𝔖\mathfrak{S}_{\parallel} should be in the opposite direction to 𝐯\mathbf{v}_{\parallel}, and the component Δ\Delta_{\perp} in 𝔖\mathfrak{S}_{\perp} should be in the direction of 𝐯\mathbf{v}_{\perp}. Hence in this case, we have both 𝐱t,/𝐱t\|\mathbf{x}_{t,\perp}\|/\|\mathbf{x}_{t}\| and 𝐳t,/𝐳t\|\mathbf{z}_{t,\perp}\|/\|\mathbf{z}_{t}\| being non-decreasing. Also, it admits the following recurrence formula:

𝐱t+2,\displaystyle\|\mathbf{x}_{t+2,\perp}\| (1+ηρϵ)(𝐱t+1,+(1θ)(𝐱t+1,𝐱t,))+ηΔ.\displaystyle\leq(1+\eta\sqrt{\rho\epsilon})\big{(}\|\mathbf{x}_{t+1,\perp}\|+(1-\theta)(\|\mathbf{x}_{t+1,\perp}\|-\|\mathbf{x}_{t,\perp}\|)\big{)}+\eta\|\Delta_{\perp}\|. (89)

Since 𝐱t,/𝐱t\|\mathbf{x}_{t,\perp}\|/\|\mathbf{x}_{t}\| is non-decreasing in this worst-case scenario, we have

Δ𝐱t+1,Δ𝐱t+1𝐱0𝐱0,2Δ𝐱t+12ρr,\displaystyle\frac{\|\Delta_{\perp}\|}{\|\mathbf{x}_{t+1,\perp}\|}\leq\frac{\|\Delta\|}{\|\mathbf{x}_{t+1}\|}\cdot\frac{\|\mathbf{x}_{0}\|}{\|\mathbf{x}_{0,\perp}\|}\leq\frac{2\|\Delta\|}{\|\mathbf{x}_{t+1}\|}\leq 2\rho r^{\prime}, (90)

which leads to

𝐱t+2,(1+ηρϵ+2ηρr)((2θ)𝐱t+1,(1θ)𝐱t,).\displaystyle\|\mathbf{x}_{t+2,\perp}\|\leq(1+\eta\sqrt{\rho\epsilon}+2\eta\rho r^{\prime})\big{(}(2-\theta)\|\mathbf{x}_{t+1,\perp}\|-(1-\theta)\|\mathbf{x}_{t,\perp}\|\big{)}. (91)

On the other hand, suppose for some value tt, we have αkαmin\alpha_{k}^{\prime}\geq\alpha_{\min}^{\prime} with any 1kt+11\leq k\leq t+1. Then,

𝐱t+2,\displaystyle\|\mathbf{x}_{t+2,\parallel}\| (1+η(ρϵν))(𝐱t+1,+(1θ)(𝐱t+1,𝐱t,))+ηΔ\displaystyle\geq(1+\eta(\sqrt{\rho\epsilon}-\nu))\big{(}\|\mathbf{x}_{t+1,\parallel}\|+(1-\theta)(\|\mathbf{x}_{t+1,\parallel}\|-\|\mathbf{x}_{t,\parallel}\|)\big{)}+\eta\|\Delta_{\parallel}\| (92)
(1+ηρϵ)(𝐱t+1,+(1θ)(𝐱t+1,𝐱t,))ηΔ.\displaystyle\geq(1+\eta\sqrt{\rho\epsilon})\big{(}\|\mathbf{x}_{t+1,\parallel}\|+(1-\theta)(\|\mathbf{x}_{t+1,\parallel}\|-\|\mathbf{x}_{t,\parallel}\|)\big{)}-\eta\|\Delta\|. (93)

Note that since 𝐱t+1,/𝐱tαmin\|\mathbf{x}_{t+1,\parallel}\|/\|\mathbf{x}_{t}\|\geq\alpha_{\min}^{\prime} for all t>0t>0, we also have

𝐱t+1,𝐱tαmin,t0,\displaystyle\frac{\|\mathbf{x}_{t+1,\parallel}\|}{\|\mathbf{x}_{t}\|}\geq\alpha_{\min}^{\prime},\quad\forall t\geq 0, (94)

which further leads to

Δ𝐳t+1,Δαmin𝐳t+1=ρr/αmin,\displaystyle\frac{\|\Delta\|}{\|\mathbf{z}_{t+1,\parallel}\|}\geq\frac{\|\Delta\|}{\alpha_{\min}^{\prime}\|\mathbf{z}_{t+1}\|}=\rho r^{\prime}/\alpha_{\min}^{\prime}, (95)

which leads to

𝐱t+2,(1+ηρϵηρr/αmin)((2θ)𝐱t+1,(1θ)𝐱t,).\displaystyle\|\mathbf{x}_{t+2,\parallel}\|\geq(1+\eta\sqrt{\rho\epsilon}-\eta\rho r^{\prime}/\alpha_{\min}^{\prime})\big{(}(2-\theta)\|\mathbf{x}_{t+1,\parallel}\|-(1-\theta)\|\mathbf{x}_{t,\parallel}\|\big{)}. (96)

Consider the sequences with recurrence that can be written as

ξt+2=(1+κ)((2θ)ξt+1(1θ)ξt)\displaystyle\xi_{t+2}=(1+\kappa)\big{(}(2-\theta)\xi_{t+1}-(1-\theta)\xi_{t}\big{)} (97)

for some κ>0\kappa>0. Its characteristic equation can be written as

x2(1+κ)(2θ)x+(1+κ)(1θ)=0,\displaystyle x^{2}-(1+\kappa)(2-\theta)x+(1+\kappa)(1-\theta)=0, (98)

whose roots satisfy

x=1+κ2((2θ)±(2θ)24(1θ)1+κ),\displaystyle x=\frac{1+\kappa}{2}\Big{(}(2-\theta)\pm\sqrt{(2-\theta)^{2}-\frac{4(1-\theta)}{1+\kappa}}\Big{)}, (99)

indicating

ξt=(1+κ2)t(C1(2θ+μ)t+C2(2θμ)t),\displaystyle\xi_{t}=\Big{(}\frac{1+\kappa}{2}\Big{)}^{t}\big{(}C_{1}(2-\theta+\mu)^{t}+C_{2}(2-\theta-\mu)^{t}\big{)}, (100)

where μ:=(2θ)24(1θ)1+κ\mu:=\sqrt{(2-\theta)^{2}-\frac{4(1-\theta)}{1+\kappa}}, for constants C1C_{1} and C2C_{2} being

{C1=2θμ2μξ0+1(1+κ)μξ1,C2=2θ+μ2μξ01(1+κ)μξ1.\displaystyle\left\{\begin{aligned} &C_{1}=-\frac{2-\theta-\mu}{2\mu}\xi_{0}+\frac{1}{(1+\kappa)\mu}\xi_{1},\\ &C_{2}=\frac{2-\theta+\mu}{2\mu}\xi_{0}-\frac{1}{(1+\kappa)\mu}\xi_{1}.\end{aligned}\right. (101)

Then by the inequalities (91) and (96), as long as αkαmin\alpha_{k}^{\prime}\geq\alpha_{\min}^{\prime} for any 1kt11\leq k\leq t-1, the values 𝐱t,\|\mathbf{x}_{t,\perp}\| and 𝐱t,\|\mathbf{x}_{t,\parallel}\| satisfy

𝐱t,\displaystyle\|\mathbf{x}_{t,\perp}\| (2θμ2μξ0,+1(1+κ)μξ1,)(1+κ2)t(2θ+μ)t\displaystyle\leq\Big{(}-\frac{2-\theta-\mu_{\perp}}{2\mu_{\perp}}\xi_{0,\perp}+\frac{1}{(1+\kappa_{\perp})\mu_{\perp}}\xi_{1,\perp}\Big{)}\cdot\Big{(}\frac{1+\kappa_{\perp}}{2}\Big{)}^{t}\cdot(2-\theta+\mu_{\perp})^{t} (102)
+(2θ+μ2μξ0,1(1+κ)μξ1,)(1+κ2)t(2θμ)t,\displaystyle\quad+\Big{(}\frac{2-\theta+\mu_{\perp}}{2\mu_{\perp}}\xi_{0,\perp}-\frac{1}{(1+\kappa_{\perp})\mu_{\perp}}\xi_{1,\perp}\Big{)}\cdot\Big{(}\frac{1+\kappa_{\perp}}{2}\Big{)}^{t}\cdot(2-\theta-\mu_{\perp})^{t}, (103)

and

𝐱t,\displaystyle\|\mathbf{x}_{t,\parallel}\| (2θμ2μξ0,+1(1+κ)μξ1,)(1+κ2)t(2θ+μ)t\displaystyle\geq\Big{(}-\frac{2-\theta-\mu_{\parallel}}{2\mu_{\parallel}}\xi_{0,\parallel}+\frac{1}{(1+\kappa_{\parallel})\mu_{\parallel}}\xi_{1,\parallel}\Big{)}\cdot\Big{(}\frac{1+\kappa_{\parallel}}{2}\Big{)}^{t}\cdot(2-\theta+\mu_{\parallel})^{t} (104)
+(2θ+μ2μξ0,1(1+κ)μξ1,)(1+κ2)t(2θμ)t,\displaystyle\quad+\Big{(}\frac{2-\theta+\mu_{\parallel}}{2\mu_{\parallel}}\xi_{0,\parallel}-\frac{1}{(1+\kappa_{\parallel})\mu_{\parallel}}\xi_{1,\parallel}\Big{)}\cdot\Big{(}\frac{1+\kappa_{\parallel}}{2}\Big{)}^{t}\cdot(2-\theta-\mu_{\parallel})^{t}, (105)

where

κ\displaystyle\kappa_{\perp} =ηρϵ+2ηρr,\displaystyle=\eta\sqrt{\rho\epsilon}+2\eta\rho r^{\prime},\qquad ξ0,\displaystyle\xi_{0,\perp} =𝐱0,,\displaystyle=\|\mathbf{x}_{0,\perp}\|,\qquad ξ1,\displaystyle\xi_{1,\perp} =(1+κ)ξ0,,\displaystyle=(1+\kappa_{\perp})\xi_{0,\perp}, (106)
κ\displaystyle\kappa_{\parallel} =ηρϵηρr/αmin,\displaystyle=\eta\sqrt{\rho\epsilon}-\eta\rho r^{\prime}/\alpha_{\min}^{\prime},\qquad ξ0,\displaystyle\xi_{0,\parallel} =𝐱0,,\displaystyle=\|\mathbf{x}_{0,\parallel}\|,\qquad ξ1,\displaystyle\xi_{1,\parallel} =(1+κ)ξ0,.\displaystyle=(1+\kappa_{\parallel})\xi_{0,\parallel}. (107)

Furthermore, we can derive that

𝐱t,\displaystyle\|\mathbf{x}_{t,\perp}\| (2θμ2μξ0,+1(1+κ)μξ1,)(1+κ2)t(2θ+μ)t\displaystyle\leq\Big{(}-\frac{2-\theta-\mu_{\perp}}{2\mu_{\perp}}\xi_{0,\perp}+\frac{1}{(1+\kappa_{\perp})\mu_{\perp}}\xi_{1,\perp}\Big{)}\cdot\Big{(}\frac{1+\kappa_{\perp}}{2}\Big{)}^{t}\cdot(2-\theta+\mu_{\perp})^{t} (108)
+(2θ+μ2μξ0,1(1+κ)μξ1,)(1+κ2)t(2θ+μ)t\displaystyle\quad+\Big{(}\frac{2-\theta+\mu_{\perp}}{2\mu_{\perp}}\xi_{0,\perp}-\frac{1}{(1+\kappa_{\perp})\mu_{\perp}}\xi_{1,\perp}\Big{)}\cdot\Big{(}\frac{1+\kappa_{\perp}}{2}\Big{)}^{t}\cdot(2-\theta+\mu_{\perp})^{t} (109)
ξ0,(1+κ2)t(2θ+μ)t=𝐱0,(1+κ2)t(2θ+μ)t,\displaystyle\leq\xi_{0,\perp}\cdot\Big{(}\frac{1+\kappa_{\perp}}{2}\Big{)}^{t}\cdot(2-\theta+\mu_{\perp})^{t}=\|\mathbf{x}_{0,\perp}\|\cdot\Big{(}\frac{1+\kappa_{\perp}}{2}\Big{)}^{t}\cdot(2-\theta+\mu_{\perp})^{t}, (110)

and

𝐱t,\displaystyle\|\mathbf{x}_{t,\parallel}\| (2θμ2μξ0,+1(1+κ)μξ1,)(1+κ2)t(2θ+μ)t\displaystyle\geq\Big{(}-\frac{2-\theta-\mu_{\parallel}}{2\mu_{\parallel}}\xi_{0,\parallel}+\frac{1}{(1+\kappa_{\parallel})\mu_{\parallel}}\xi_{1,\parallel}\Big{)}\cdot\Big{(}\frac{1+\kappa_{\parallel}}{2}\Big{)}^{t}\cdot(2-\theta+\mu_{\parallel})^{t} (111)
+(2θ+μ2μξ0,1(1+κ)μξ1,)(1+κ2)t(2θμ)t\displaystyle\quad+\Big{(}\frac{2-\theta+\mu_{\parallel}}{2\mu_{\parallel}}\xi_{0,\parallel}-\frac{1}{(1+\kappa_{\parallel})\mu_{\parallel}}\xi_{1,\parallel}\Big{)}\cdot\Big{(}\frac{1+\kappa_{\parallel}}{2}\Big{)}^{t}\cdot(2-\theta-\mu_{\parallel})^{t} (112)
(2θμ2μξ0,+1(1+κ)μξ1,)(1+κ2)t(2θ+μ)t\displaystyle\geq\Big{(}-\frac{2-\theta-\mu_{\parallel}}{2\mu_{\parallel}}\xi_{0,\parallel}+\frac{1}{(1+\kappa_{\parallel})\mu_{\parallel}}\xi_{1,\parallel}\Big{)}\cdot\Big{(}\frac{1+\kappa_{\parallel}}{2}\Big{)}^{t}\cdot(2-\theta+\mu_{\parallel})^{t} (113)
=μ+θ2μξ0,(1+κ2)t(2θ+μ)t\displaystyle=\frac{\mu_{\parallel}+\theta}{2\mu_{\parallel}}\xi_{0,\parallel}\cdot\Big{(}\frac{1+\kappa_{\parallel}}{2}\Big{)}^{t}\cdot(2-\theta+\mu_{\parallel})^{t} (114)
𝐱0,2(1+κ2)t(2θ+μ)t.\displaystyle\geq\frac{\|\mathbf{x}_{0,\parallel}\|}{2}\cdot\Big{(}\frac{1+\kappa_{\parallel}}{2}\Big{)}^{t}\cdot(2-\theta+\mu_{\parallel})^{t}. (115)

Then we can observe that

𝐱t,𝐱t,𝐱0,2𝐱0,(1+κ1+κ)t(2θ+μ2θ+μ)t,\displaystyle\frac{\|\mathbf{x}_{t,\parallel}\|}{\|\mathbf{x}_{t,\perp}\|}\geq\frac{\|\mathbf{x}_{0,\parallel}\|}{2\|\mathbf{x}_{0,\perp}\|}\cdot\Big{(}\frac{1+\kappa_{\parallel}}{1+\kappa_{\perp}}\Big{)}^{t}\cdot\Big{(}\frac{2-\theta+\mu_{\parallel}}{2-\theta+\mu_{\perp}}\Big{)}^{t}, (116)

where

1+κ1+κ\displaystyle\frac{1+\kappa_{\parallel}}{1+\kappa_{\perp}} (1+κ)(1κ)\displaystyle\geq(1+\kappa_{\parallel})(1-\kappa_{\perp}) (117)
1(2+1/αmin)ηρrκκ\displaystyle\geq 1-(2+1/\alpha_{\min}^{\prime})\eta\rho r^{\prime}-\kappa_{\parallel}\kappa_{\perp} (118)
12ηρr/αmin,\displaystyle\geq 1-2\eta\rho r^{\prime}/\alpha^{\prime}_{\min}, (119)

and

2θ+μ2θ+μ\displaystyle\frac{2-\theta+\mu_{\parallel}}{2-\theta+\mu_{\perp}} =1+μ/(2θ)1+μ/(2θ)\displaystyle=\frac{1+\mu_{\parallel}/(2-\theta)}{1+\mu_{\perp}/(2-\theta)} (120)
=1+14(1θ)(1+κ)(2θ)21+14(1θ)(1+κ)(2θ)2\displaystyle=\frac{1+\sqrt{1-\frac{4(1-\theta)}{(1+\kappa_{\parallel})(2-\theta)^{2}}}}{1+\sqrt{1-\frac{4(1-\theta)}{(1+\kappa_{\perp})(2-\theta)^{2}}}} (121)
(1+12θθ2+κ(2θ)21+κ)(112θθ2+κ(2θ)21+κ)\displaystyle\geq\Big{(}1+\frac{1}{2-\theta}\sqrt{\frac{\theta^{2}+\kappa_{\parallel}(2-\theta)^{2}}{1+\kappa_{\parallel}}}\Big{)}\Big{(}1-\frac{1}{2-\theta}\sqrt{\frac{\theta^{2}+\kappa_{\perp}(2-\theta)^{2}}{1+\kappa_{\perp}}}\Big{)} (122)
12(κκ)θ13ηρrαminθ,\displaystyle\geq 1-\frac{2(\kappa_{\perp}-\kappa_{\parallel})}{\theta}\geq 1-\frac{3\eta\rho r^{\prime}}{\alpha_{\min}^{\prime}\theta}, (123)

by which we can derive that

𝐱t,𝐱t,\displaystyle\frac{\|\mathbf{x}_{t,\parallel\|}}{\|\mathbf{x}_{t,\perp}\|} 𝐱0,2𝐱0,(14ρrαminθ)t\displaystyle\geq\frac{\|\mathbf{x}_{0,\parallel}\|}{2\|\mathbf{x}_{0,\perp}\|}\cdot\Big{(}1-\frac{4\rho r^{\prime}}{\alpha_{\min}^{\prime}\theta}\Big{)}^{t} (124)
𝐱0,2𝐱0,(11/𝒯)t\displaystyle\geq\frac{\|\mathbf{x}_{0,\parallel}\|}{2\|\mathbf{x}_{0,\perp}\|}(1-1/\mathscr{T}^{\prime})^{t} (125)
𝐱0,2𝐱0,exp(t𝒯1missing)𝐱0,4𝐱0,,\displaystyle\geq\frac{\|\mathbf{x}_{0,\parallel}\|}{2\|\mathbf{x}_{0,\perp}\|}\exp\Big(-\frac{t}{\mathscr{T}^{\prime}-1}\Big{missing})\geq\frac{\|\mathbf{x}_{0,\parallel}\|}{4\|\mathbf{x}_{0,\perp}\|}, (126)

indicating

αt=𝐱t,𝐱t,2+𝐱t,2𝐱0,8𝐱0,αmin.\displaystyle\alpha_{t}^{\prime}=\frac{\|\mathbf{x}_{t,\parallel}\|}{\sqrt{\|\mathbf{x}_{t,\parallel}\|^{2}+\|\mathbf{x}_{t,\perp}\|^{2}}}\geq\frac{\|\mathbf{x}_{0,\parallel}\|}{8\|\mathbf{x}_{0,\perp}\|}\geq\alpha_{\min}^{\prime}. (127)

Hence, as long as αkαmin\alpha_{k}^{\prime}\geq\alpha_{\min}^{\prime} for any 1kt11\leq k\leq t-1, we can also have αtαmin\alpha_{t}^{\prime}\geq\alpha_{\min}^{\prime} if t𝒯t\leq\mathscr{T}^{\prime}. Since we have α0αmin\alpha_{0}^{\prime}\geq\alpha_{\min}^{\prime} and α1αmin\alpha_{1}^{\prime}\geq\alpha_{\min}^{\prime}, we can claim that αtαmin\alpha_{t}^{\prime}\geq\alpha_{\min}^{\prime} for any t𝒯t\leq\mathscr{T}^{\prime} using recurrence. ∎

Equipped with Lemma 21, we are now ready to prove Proposition 5.

Proof.

By Lemma 20, the unit vector 𝐞^\hat{\mathbf{e}} in Line 2 of Algorithm 2 obtained after 𝒯\mathscr{T}^{\prime} iterations equals to the output of Algorithm 4 starting from 𝐱~\tilde{\mathbf{x}}. Hence in this proof we consider the output of Algorithm 4 instead of the original Algorithm 2.

If λnρϵ/2\lambda_{n}\leq-\sqrt{\rho\epsilon}/2, Proposition 5 holds directly. Hence, we only need to prove the case where λn>ρϵ/2\lambda_{n}>-\sqrt{\rho\epsilon}/2, in which there exists some pp^{\prime} with

λpρϵ/2<λp+1.\displaystyle\lambda_{p}^{\prime}\leq-\sqrt{\rho\epsilon}/2<\lambda_{p+1}. (128)

We use 𝔖\mathfrak{S}_{\parallel}^{\prime}, 𝔖\mathfrak{S}_{\perp}^{\prime} to denote the subspace of n\mathbb{R}^{n} spanned by {𝐮1,𝐮2,,𝐮p}\left\{\mathbf{u}_{1},\mathbf{u}_{2},\ldots,\mathbf{u}_{p^{\prime}}\right\}, {𝐮p+1,𝐮p+2,,𝐮n}\left\{\mathbf{u}_{p^{\prime}+1},\mathbf{u}_{p+2},\ldots,\mathbf{u}_{n}\right\}. Furthermore, we define 𝐱t,:=i=1p𝐮i,𝐱t𝐮i\mathbf{x}_{t,\parallel^{\prime}}:=\sum_{i=1}^{p^{\prime}}\left\langle\mathbf{u}_{i},\mathbf{x}_{t}\right\rangle\mathbf{u}_{i}, 𝐱t,:=i=pn𝐮i,𝐱t𝐮i\mathbf{x}_{t,\perp^{\prime}}:=\sum_{i=p^{\prime}}^{n}\left\langle\mathbf{u}_{i},\mathbf{x}_{t}\right\rangle\mathbf{u}_{i}, 𝐯t,:=i=1p𝐮i,𝐯t𝐮i\mathbf{v}_{t,\parallel^{\prime}}:=\sum_{i=1}^{p^{\prime}}\left\langle\mathbf{u}_{i},\mathbf{v}_{t}\right\rangle\mathbf{u}_{i}, 𝐯t,:=i=pn𝐮i,𝐯t𝐮i\mathbf{v}_{t,\perp^{\prime}}:=\sum_{i=p^{\prime}}^{n}\left\langle\mathbf{u}_{i},\mathbf{v}_{t}\right\rangle\mathbf{u}_{i} respectively to denote the component of 𝐱t\mathbf{x}_{t}^{\prime} and 𝐯t\mathbf{v}_{t}^{\prime} in Algorithm 4 in the subspaces 𝔖\mathfrak{S}_{\parallel}^{\prime}, 𝔖\mathfrak{S}_{\perp}^{\prime}, and let αt:=𝐱t,/𝐱t\alpha_{t}^{\prime}:=\|\mathbf{x}_{t,\parallel}\|/\|\mathbf{x}_{t}\|. Consider the case where α0πnδ0\alpha_{0}^{\prime}\geq\sqrt{\frac{\pi}{n}}\delta_{0}, which can be achieved with probability

Pr{α0πnδ0}1πnδ0Vol(𝔹0n1(1))Vol(𝔹0n(1))1πnδ0nπ=1δ0,\displaystyle\Pr\left\{\alpha_{0}^{\prime}\geq\sqrt{\frac{\pi}{n}}\delta_{0}\right\}\geq 1-\sqrt{\frac{\pi}{n}}\delta_{0}\cdot\frac{\text{Vol}(\mathbb{B}_{0}^{n-1}(1))}{\text{Vol}(\mathbb{B}_{0}^{n}(1))}\geq 1-\sqrt{\frac{\pi}{n}}\delta_{0}\cdot\sqrt{\frac{n}{\pi}}=1-\delta_{0}, (129)

we prove that there exists some t0t_{0} with 1t0𝒯1\leq t_{0}\leq\mathscr{T}^{\prime} such that

𝐱t0,𝐱t0ρϵ8.\displaystyle\frac{\|\mathbf{x}_{t_{0},\perp^{\prime}}\|}{\|\mathbf{x}_{t_{0}}\|}\leq\frac{\sqrt{\rho\epsilon}}{8\ell}. (130)

Assume the contrary, for any tt with 1t𝒯1\leq t\leq\mathscr{T}^{\prime}, we all have 𝐱t,𝐱t>ρϵ8\frac{\|\mathbf{x}_{t,\perp^{\prime}}\|}{\|\mathbf{x}_{t}\|}>\frac{\sqrt{\rho\epsilon}}{8\ell} and 𝐳t,𝐳t>ρϵ8\frac{\|\mathbf{z}_{t,\perp^{\prime}}\|}{\|\mathbf{z}_{t}\|}>\frac{\sqrt{\rho\epsilon}}{8\ell}. Focus on the case where 𝐱t,\|\mathbf{x}_{t,\perp^{\prime}}\|, the component of 𝐱t\mathbf{x}_{t} in subspace 𝔖\mathfrak{S}_{\perp}^{\prime}, achieves the largest value possible. Then in this case, we have the following recurrence formula:

𝐱t+2,(1+ηρϵ/2)(𝐱t+1,+(1θ)(𝐱t+1,𝐱t,))+ηΔ.\displaystyle\|\mathbf{x}_{t+2,\perp^{\prime}}\|\leq(1+\eta\sqrt{\rho\epsilon}/2)\big{(}\|\mathbf{x}_{t+1,\perp^{\prime}}\|+(1-\theta)(\|\mathbf{x}_{t+1,\perp^{\prime}}\|-\|\mathbf{x}_{t,\perp^{\prime}}\|)\big{)}+\eta\|\Delta_{\perp^{\prime}}\|. (131)

Since 𝐳k,𝐳kρϵ8\frac{\|\mathbf{z}_{k,\perp^{\prime}}\|}{\|\mathbf{z}_{k}\|}\geq\frac{\sqrt{\rho\epsilon}}{8\ell} for any 1kt+11\leq k\leq t+1, we can derive that

Δ𝐱t+1,+(1θ)(𝐱t+1,𝐱t,)Δ𝐳t,2ρrρϵ,\displaystyle\frac{\|\Delta_{\perp}\|}{\|\mathbf{x}_{t+1,\perp}\|+(1-\theta)(\|\mathbf{x}_{t+1,\perp}\|-\|\mathbf{x}_{t,\perp}\|)}\leq\frac{\|\Delta\|}{\|\mathbf{z}_{t,\perp^{\prime}}\|}\leq\frac{2\rho r^{\prime}}{\sqrt{\rho\epsilon}}, (132)

which leads to

𝐱t+2,\displaystyle\|\mathbf{x}_{t+2,\perp^{\prime}}\| (1+ηρϵ/2)(𝐱t+1,+(1θ)(𝐱t+1,𝐱t,))+ηΔ\displaystyle\leq(1+\eta\sqrt{\rho\epsilon}/2)\big{(}\|\mathbf{x}_{t+1,\perp^{\prime}}\|+(1-\theta)(\|\mathbf{x}_{t+1,\perp^{\prime}}\|-\|\mathbf{x}_{t,\perp^{\prime}}\|)\big{)}+\eta\|\Delta_{\perp^{\prime}}\| (133)
(1+ηρϵ/2+2ρr/ρϵ)((2θ)𝐱t+1,(1θ)𝐱t,).\displaystyle\leq(1+\eta\sqrt{\rho\epsilon}/2+2\rho r^{\prime}/\sqrt{\rho\epsilon})\big{(}(2-\theta)\|\mathbf{x}_{t+1,\perp^{\prime}}\|-(1-\theta)\|\mathbf{x}_{t,\perp^{\prime}}\|\big{)}. (134)

Using similar characteristic function techniques shown in the proof of Lemma 21, it can be further derived that

𝐱t,𝐱0,(1+κ2)t(2θ+μ)t,\displaystyle\|\mathbf{x}_{t,\perp^{\prime}}\|\leq\|\mathbf{x}_{0,\perp^{\prime}}\|\cdot\Big{(}\frac{1+\kappa_{\perp^{\prime}}}{2}\Big{)}^{t}\cdot(2-\theta+\mu_{\perp^{\prime}})^{t}, (135)

for κ=ηρϵ/2+2ρr/ρϵ\kappa_{\perp^{\prime}}=\eta\sqrt{\rho\epsilon}/2+2\rho r^{\prime}/\sqrt{\rho\epsilon} and μ=(2θ)24(1θ)1+κ\mu_{\perp^{\prime}}=\sqrt{(2-\theta)^{2}-\frac{4(1-\theta)}{1+\kappa_{\perp^{\prime}}}}, given 𝐱k,𝐱kρϵ8\frac{\|\mathbf{x}_{k,\perp^{\prime}}\|}{\|\mathbf{x}_{k}\|}\geq\frac{\sqrt{\rho\epsilon}}{8\ell} and 𝐳k,𝐳kρϵ8\frac{\|\mathbf{z}_{k,\perp^{\prime}}\|}{\|\mathbf{z}_{k}\|}\geq\frac{\sqrt{\rho\epsilon}}{8\ell} for any 1kt11\leq k\leq t-1. Due to Lemma 21,

αtαmin=δ08πn,1t𝒯.\displaystyle\alpha_{t}^{\prime}\geq\alpha_{\min}^{\prime}=\frac{\delta_{0}}{8}\sqrt{\frac{\pi}{n}},\qquad\forall 1\leq t\leq\mathscr{T}^{\prime}. (136)

and it is demonstrated in the proof of Lemma 21 that,

𝐱t,𝐱0,2(1+κ2)t(2θ+μ)t,1t𝒯,\displaystyle\|\mathbf{x}_{t,\parallel}\|\geq\frac{\|\mathbf{x}_{0,\parallel}\|}{2}\cdot\Big{(}\frac{1+\kappa_{\parallel}}{2}\Big{)}^{t}\cdot(2-\theta+\mu_{\parallel})^{t},\qquad\forall 1\leq t\leq\mathscr{T}^{\prime}, (137)

for κ=ηρϵηρr/αmin\kappa_{\parallel}=\eta\sqrt{\rho\epsilon}-\eta\rho r^{\prime}/\alpha_{\min}^{\prime} and μ=(2θ)24(1θ)1+κ\mu_{\parallel}=\sqrt{(2-\theta)^{2}-\frac{4(1-\theta)}{1+\kappa_{\parallel}}}. Observe that

𝐱𝒯,𝐱𝒯,\displaystyle\frac{\|\mathbf{x}_{\mathscr{T}^{\prime},\perp^{\prime}}\|}{\|\mathbf{x}_{\mathscr{T}^{\prime},\parallel}\|} 2𝐱0,𝐱0,(1+κ1+κ)𝒯(2θ+μ2θ+μ)𝒯\displaystyle\leq\frac{2\|\mathbf{x}_{0,\perp^{\prime}}\|}{\|\mathbf{x}_{0,\parallel}\|}\cdot\Big{(}\frac{1+\kappa_{\perp^{\prime}}}{1+\kappa_{\parallel}}\Big{)}^{\mathscr{T}^{\prime}}\cdot\Big{(}\frac{2-\theta+\mu_{\perp^{\prime}}}{2-\theta+\mu_{\parallel}}\Big{)}^{\mathscr{T}^{\prime}} (138)
2δ0nπ(1+κ1+κ)𝒯(2θ+μ2θ+μ)𝒯,\displaystyle\leq\frac{2}{\delta_{0}}\sqrt{\frac{n}{\pi}}\Big{(}\frac{1+\kappa_{\perp^{\prime}}}{1+\kappa_{\parallel}}\Big{)}^{\mathscr{T}^{\prime}}\cdot\Big{(}\frac{2-\theta+\mu_{\perp^{\prime}}}{2-\theta+\mu_{\parallel}}\Big{)}^{\mathscr{T}^{\prime}}, (139)

where

1+κ1+κ11+(κκ)=11ηρϵ/2+ρr(η/αmin+2/ρϵ)1ηρϵ4,\displaystyle\frac{1+\kappa_{\perp^{\prime}}}{1+\kappa_{\parallel}}\leq\frac{1}{1+(\kappa_{\parallel}-\kappa_{\perp^{\prime}})}=1-\frac{1}{\eta\sqrt{\rho\epsilon}/2+\rho r^{\prime}(\eta/\alpha_{\min^{\prime}}+2/\sqrt{\rho\epsilon})}\leq 1-\frac{\eta\sqrt{\rho\epsilon}}{4}, (140)

and

2θ+μ2θ+μ\displaystyle\frac{2-\theta+\mu_{\perp^{\prime}}}{2-\theta+\mu_{\parallel}} =1+14(1θ)(1+κ)(2θ)21+14(1θ)(1+κ)(2θ)2\displaystyle=\frac{1+\sqrt{1-\frac{4(1-\theta)}{(1+\kappa_{\perp^{\prime}})(2-\theta)^{2}}}}{1+\sqrt{1-\frac{4(1-\theta)}{(1+\kappa_{\parallel})(2-\theta)^{2}}}} (141)
11+(14(1θ)(1+κ)(2θ)214(1θ)(1+κ)(2θ)2)\displaystyle\leq\frac{1}{1+\Big{(}\sqrt{1-\frac{4(1-\theta)}{(1+\kappa_{\perp^{\prime}})(2-\theta)^{2}}}-\sqrt{1-\frac{4(1-\theta)}{(1+\kappa_{\parallel})(2-\theta)^{2}}}\Big{)}} (142)
1κκθ\displaystyle\leq 1-\frac{\kappa_{\parallel}-\kappa_{\perp^{\prime}}}{\theta} (143)
1ηρϵ4θ=1(ρϵ)1/416.\displaystyle\leq 1-\frac{\eta\sqrt{\rho\epsilon}}{4\theta}=1-\frac{(\rho\epsilon)^{1/4}}{16\sqrt{\ell}}. (144)

Hence,

𝐱𝒯,𝐱𝒯,\displaystyle\frac{\|\mathbf{x}_{\mathscr{T}^{\prime},\perp^{\prime}}\|}{\|\mathbf{x}_{\mathscr{T}^{\prime},\parallel}\|} 2δ0nπ(1(ρϵ)1/416)𝒯ρϵ8.\displaystyle\leq\frac{2}{\delta_{0}}\sqrt{\frac{n}{\pi}}\Big{(}1-\frac{(\rho\epsilon)^{1/4}}{16\sqrt{\ell}}\Big{)}^{\mathscr{T}^{\prime}}\leq\frac{\sqrt{\rho\epsilon}}{8\ell}. (145)

Since 𝐱𝒯,𝐱𝒯\|\mathbf{x}_{\mathscr{T}^{\prime},\parallel}\|\leq\|\mathbf{x}_{\mathscr{T}^{\prime}}\|, we have 𝐱𝒯,𝐱𝒯ρϵ8\frac{\|\mathbf{x}_{\mathscr{T}^{\prime},\perp^{\prime}}\|}{\|\mathbf{x}_{\mathscr{T}^{\prime}}\|}\leq\frac{\sqrt{\rho\epsilon}}{8\ell}, contradiction. Hence, there here exists some t0t_{0} with 1t0𝒯1\leq t_{0}\leq\mathscr{T}^{\prime} such that 𝐱t0,𝐱t0ρϵ8\frac{\|\mathbf{x}_{t_{0},\perp^{\prime}}\|}{\|\mathbf{x}_{t_{0}}\|}\leq\frac{\sqrt{\rho\epsilon}}{8\ell}. Consider the normalized vector 𝐞^=𝐱t0/r\hat{\mathbf{e}}=\mathbf{x}_{t_{0}}/r, we use 𝐞^\hat{\mathbf{e}}_{\perp^{\prime}} and 𝐞^\hat{\mathbf{e}}_{\parallel^{\prime}} to separately denote the component of 𝐞^\hat{\mathbf{e}} in 𝔖\mathfrak{S}_{\perp}^{\prime} and 𝔖\mathfrak{S}_{\parallel}^{\prime}. Then, 𝐞^ρϵ/(8)\|\hat{\mathbf{e}}_{\perp^{\prime}}\|\leq\sqrt{\rho\epsilon}/(8\ell) whereas 𝐞^1ρϵ/(8)2\|\hat{\mathbf{e}}_{\parallel^{\prime}}\|\geq 1-\rho\epsilon/(8\ell)^{2}. Then,

𝐞^T(𝟎)𝐞^=(𝐞^+𝐞^)T(𝟎)(𝐞^+𝐞^),\displaystyle\hat{\mathbf{e}}^{T}\mathcal{H}(\mathbf{0})\hat{\mathbf{e}}=(\hat{\mathbf{e}}_{\perp^{\prime}}+\hat{\mathbf{e}}_{\parallel^{\prime}})^{T}\mathcal{H}(\mathbf{0})(\hat{\mathbf{e}}_{\perp^{\prime}}+\hat{\mathbf{e}}_{\parallel^{\prime}}), (146)

since (𝟎)𝐞^𝔖\mathcal{H}(\mathbf{0})\hat{\mathbf{e}}_{\perp^{\prime}}\in\mathfrak{S}_{\perp}^{\prime} and (𝟎)𝐞^𝔖\mathcal{H}(\mathbf{0})\hat{\mathbf{e}}_{\parallel^{\prime}}\in\mathfrak{S}_{\parallel}^{\prime}, it can be further simplified to

𝐞^T(𝟎)𝐞^=𝐞^T(𝟎)𝐞^+𝐞^T(𝟎)𝐞^,\displaystyle\hat{\mathbf{e}}^{T}\mathcal{H}(\mathbf{0})\hat{\mathbf{e}}=\hat{\mathbf{e}}_{\perp^{\prime}}^{T}\mathcal{H}(\mathbf{0})\hat{\mathbf{e}}_{\perp^{\prime}}+\hat{\mathbf{e}}_{\parallel^{\prime}}^{T}\mathcal{H}(\mathbf{0})\hat{\mathbf{e}}_{\parallel^{\prime}}, (147)

Due to the \ell-smoothness of the function, all eigenvalue of the Hessian matrix has its absolute value upper bounded by \ell. Hence,

𝐞^T(𝟎)𝐞^𝐞^T22=ρϵ642.\displaystyle\hat{\mathbf{e}}_{\perp^{\prime}}^{T}\mathcal{H}(\mathbf{0})\hat{\mathbf{e}}_{\perp^{\prime}}\leq\ell\|\hat{\mathbf{e}}_{\perp^{\prime}}^{T}\|_{2}^{2}=\frac{\rho\epsilon}{64\ell^{2}}. (148)

Further according to the definition of 𝔖\mathfrak{S}_{\parallel}, we have

𝐞^T(𝟎)𝐞^ρϵ2𝐞^2.\displaystyle\hat{\mathbf{e}}_{\parallel^{\prime}}^{T}\mathcal{H}(\mathbf{0})\hat{\mathbf{e}}_{\parallel^{\prime}}\leq-\frac{\sqrt{\rho\epsilon}}{2}\|\hat{\mathbf{e}}_{\parallel^{\prime}}\|^{2}. (149)

Combining these two inequalities together, we can obtain

𝐞^T(𝟎)𝐞^\displaystyle\hat{\mathbf{e}}^{T}\mathcal{H}(\mathbf{0})\hat{\mathbf{e}} =𝐞^T(𝟎)𝐞^+𝐞^T(𝟎)𝐞^ρϵ2𝐞^2+ρϵ642ρϵ4.\displaystyle=\hat{\mathbf{e}}_{\perp}^{T}\mathcal{H}(\mathbf{0})\hat{\mathbf{e}}_{\perp^{\prime}}+\hat{\mathbf{e}}_{\parallel^{\prime}}^{T}\mathcal{H}(\mathbf{0})\hat{\mathbf{e}}_{\parallel^{\prime}}\leq-\frac{\sqrt{\rho\epsilon}}{2}\|\hat{\mathbf{e}}_{\parallel^{\prime}}\|^{2}+\frac{\rho\epsilon}{64\ell^{2}}\leq-\frac{\sqrt{\rho\epsilon}}{4}. (150)

Appendix C Proof details of escaping from saddle points by negative curvature finding

C.1 Algorithms for escaping from saddle points using negative curvature finding

In this subsection, we first present algorithm for escaping from saddle points using Algorithm 1 as Algorithm 5.

1 Input: 𝐱0n\mathbf{x}_{0}\in\mathbb{R}^{n};
2 for t=0,1,,Tt=0,1,...,T do
3       if f(𝐱t)ϵ\|\nabla f(\mathbf{x}_{t})\|\leq\epsilon then
4             𝐞^\hat{\mathbf{e}}\leftarrowNegativeCurvatureFinding(𝐱t,r,𝒯\mathbf{x}_{t},r,\mathscr{T}) ;
5             𝐱t𝐱tf𝐞^(𝐱0)4|f𝐞^(𝐱0)|ϵρ𝐞^\mathbf{x}_{t}\leftarrow\mathbf{x}_{t}-\frac{f^{\prime}_{\hat{\mathbf{e}}}(\mathbf{x}_{0})}{4|f^{\prime}_{\hat{\mathbf{e}}}(\mathbf{x}_{0})|}\sqrt{\frac{\epsilon}{\rho}}\cdot\hat{\mathbf{e}};
6            
7      𝐱t+1𝐱t1f(𝐱t)\mathbf{x}_{t+1}\leftarrow\mathbf{x}_{t}-\frac{1}{\ell}\nabla f(\mathbf{x}_{t});
8      
Algorithm 5 Perturbed Gradient Descent with Negative Curvature Finding

Observe that Algorithm 5 and Algorithm 2 are similar to perturbed gradient descent and perturbed accelerated gradient descent but the uniform perturbation step is replaced by our negative curvature finding algorithms. One may wonder that Algorithm 5 seems to involve nested loops since negative curvature finding algorithm are contained in the primary loop, contradicting our previous claim that Algorithm 5 only contains a single loop. But actually, Algorithm 5 contains only two operations: gradient descents and one perturbation step, the same as operations outside the negative curvature finding algorithms. Hence, Algorithm 5 is essentially single-loop algorithm, and we count their iteration number as the total number of gradient calls.

C.2 Proof details of escaping saddle points using Algorithm 1

In this subsection, we prove:

Theorem 22.

For any ϵ>0\epsilon>0 and 0<δ10<\delta\leq 1, Algorithm 5 with parameters chosen in Proposition 3 satisfies that at least 1/41/4 of its iterations will be ϵ\epsilon-approximate second-order stationary point, using

O~((f(𝐱0)f)ϵ2logn)\displaystyle\tilde{O}\Big{(}\frac{(f(\mathbf{x}_{0})-f^{*})}{\epsilon^{2}}\cdot\log n\Big{)}

iterations, with probability at least 1δ1-\delta, where ff^{*} is the global minimum of ff.

Proof.

Let the parameters be chosen according to (2), and set the total step number TT to be:

T=max{8(f(𝐱𝟎)f)ϵ2,768(f(𝐱𝟎)f)ρϵ3},\displaystyle T=\max\left\{\frac{8\ell(f(\mathbf{x_{0}})-f^{*})}{\epsilon^{2}},768(f(\mathbf{x_{0}})-f^{*})\cdot\sqrt{\frac{\rho}{\epsilon^{3}}}\right\}, (151)

similar to the perturbed gradient descent algorithm [20, Algorithm 4]. We first assume that for each 𝐱t\mathbf{x}_{t} we apply negative curvature finding (Algorithm 1) with δ0\delta_{0} contained in the parameters be chosen as

δ0=1384(f(𝐱0f)ϵ3ρδ,\displaystyle\delta_{0}=\frac{1}{384(f(\mathbf{x}_{0}-f^{*})}\sqrt{\frac{\epsilon^{3}}{\rho}}\delta, (152)

we can successfully obtain a unit vector 𝐞^\hat{\mathbf{e}} with 𝐞^T𝐞^ρϵ/4\hat{\mathbf{e}}^{T}\mathcal{H}\hat{\mathbf{e}}\leq-\sqrt{\rho\epsilon}/4, as long as λmin((𝐱t))ρϵ\lambda_{\min}(\mathcal{H}(\mathbf{x}_{t}))\leq-\sqrt{\rho\epsilon}. The error probability of this assumption is provided later.

Under this assumption, Algorithm 1 can be called for at most 384(f(𝐱𝟎)f)ρϵ3T2384(f(\mathbf{x_{0}})-f^{*})\sqrt{\frac{\rho}{\epsilon^{3}}}\leq\frac{T}{2} times, for otherwise the function value decrease will be greater than f(𝐱𝟎)ff(\mathbf{x_{0}})-f^{*}, which is not possible. Then, the error probability that some calls to Algorithm 1 fails is upper bounded by

384(f(𝐱𝟎)f)ρϵ3δ0=δ.\displaystyle 384(f(\mathbf{x_{0}})-f^{*})\sqrt{\frac{\rho}{\epsilon^{3}}}\cdot\delta_{0}=\delta. (153)

For the rest of iterations in which Algorithm 1 is not called, they are either large gradient steps, f(𝐱t)ϵ\|\nabla f(\mathbf{x}_{t})\|\geq\epsilon, or ϵ\epsilon-approximate second-order stationary points. Within them, we know that the number of large gradient steps cannot be more than T/4T/4 because otherwise, by Lemma 10 in Appendix A:

f(𝐱T)f(𝐱0)Tηϵ2/8<f,\displaystyle f(\mathbf{x}_{T})\leq f(\mathbf{x}_{0})-T\eta\epsilon^{2}/8<f^{*},

a contradiction. Therefore, we conclude that at least T/4T/4 of the iterations must be ϵ\epsilon-approximate second-order stationary points, with probability at least 1δ1-\delta.

The number of iterations can be viewed as the sum of two parts, the number of iterations needed for gradient descent, denoted by T1T_{1}, and the number of iterations needed for negative curvature finding, denoted by T2T_{2}. with probability at least 1δ1-\delta,

T1=T=O~((f(𝐱0)f)ϵ2).\displaystyle T_{1}=T=\tilde{O}\Big{(}\frac{(f(\mathbf{x}_{0})-f^{*})}{\epsilon^{2}}\Big{)}. (154)

As for T2T_{2}, with probability at least 1δ1-\delta, Algorithm 1 is called for at most 384(f(𝐱𝟎)f)ρϵ3384(f(\mathbf{x_{0}})-f^{*})\sqrt{\frac{\rho}{\epsilon^{3}}} times, and by Proposition 3 it takes O~(lognρϵ)\tilde{O}\Big{(}\frac{\log n}{\sqrt{\rho\epsilon}}\Big{)} iterations each time. Hence,

T2=384(f(𝐱𝟎)f)ρϵ3O~(lognρϵ)=O~((f(𝐱0)f)ϵ2logn).\displaystyle T_{2}=384(f(\mathbf{x_{0}})-f^{*})\sqrt{\frac{\rho}{\epsilon^{3}}}\cdot\tilde{O}\Big{(}\frac{\log n}{\sqrt{\rho\epsilon}}\Big{)}=\tilde{O}\Big{(}\frac{(f(\mathbf{x}_{0})-f^{*})}{\epsilon^{2}}\cdot\log n\Big{)}. (155)

As a result, the total iteration number T1+T2T_{1}+T_{2} is

O~((f(𝐱0)f)ϵ2logn).\displaystyle\tilde{O}\Big{(}\frac{(f(\mathbf{x}_{0})-f^{*})}{\epsilon^{2}}\cdot\log n\Big{)}. (156)

C.3 Proof details of escaping saddle points using Algorithm 2

We first present here the Negative Curvature Exploitation algorithm proposed in proposed in [21, Algorithm 3] appearing in Line 2 of Algorithm 2:

1 if 𝐯ts\|\mathbf{v}_{t}\|\geq s then
2       𝐱t+1𝐱t\mathbf{x}_{t+1}\leftarrow\mathbf{x}_{t};
3      
4else
5       ξ=s𝐯t/𝐯t\xi=s\cdot\mathbf{v}_{t}/\|\mathbf{v}\|_{t};
6       𝐱targmin𝐱{𝐱t+ξ,𝐱tξ}f(𝐱)\mathbf{x}_{t}\leftarrow\text{argmin}_{\mathbf{x}\in\left\{\mathbf{x}_{t}+\xi,\mathbf{x}_{t}-\xi\right\}}f(\mathbf{x});
7      
Output (𝐳t+1,𝟎)(\mathbf{z}_{t+1},\mathbf{0}).
Algorithm 6 Negative Curvature Exploitation(𝐱t,𝐯t,s)\mathbf{x}_{t},\mathbf{v}_{t},s)

Now, we give the full version of Theorem 7 as follows:

Theorem 23.

Suppose that the function ff is \ell-smooth and ρ\rho-Hessian Lipschitz. For any ϵ>0\epsilon>0 and a constant 0<δ10<\delta\leq 1, we choose the parameters appearing in Algorithm 2 as follows:

δ0\displaystyle\delta_{0} =δ384Δfϵ3ρ,\displaystyle=\frac{\delta}{384\Delta_{f}}\sqrt{\frac{\epsilon^{3}}{\rho}}, 𝒯\displaystyle\mathscr{T}^{\prime} =32(ρϵ)1/4log(δ0nρϵmissing),\displaystyle=\frac{32\sqrt{\ell}}{(\rho\epsilon)^{1/4}}\log\Big(\frac{\ell}{\delta_{0}}\sqrt{\frac{n}{\rho\epsilon}}\Big{missing}), ζ\displaystyle\zeta =ρϵ,\displaystyle=\frac{\ell}{\sqrt{\rho\epsilon}}, (157)
r\displaystyle r^{\prime} =δ0ϵ32πρn,\displaystyle=\frac{\delta_{0}\epsilon}{32}\sqrt{\frac{\pi}{\rho n}}, η\displaystyle\eta =14,\displaystyle=\frac{1}{4\ell}, θ\displaystyle\theta =14ζ,\displaystyle=\frac{1}{4\sqrt{\zeta}}, (158)
\displaystyle\mathscr{E} =ϵ3ρcA7,\displaystyle=\sqrt{\frac{\epsilon^{3}}{\rho}}\cdot c_{A}^{-7}, γ\displaystyle\gamma =θ2η,\displaystyle=\frac{\theta^{2}}{\eta}, s\displaystyle s =γ4ρ,\displaystyle=\frac{\gamma}{4\rho}, (159)

where Δf:=f(𝐱0)f\Delta_{f}:=f(\mathbf{x}_{0})-f^{*} and ff^{*} is the global minimum of ff, and the constant cAc_{A} is chosen large enough to satisfy both the condition in Lemma 16 and cA(384)1/7c_{A}\geq(384)^{1/7}. Then, Algorithm 2 satisfies that at least one of the iterations 𝐳t\mathbf{z}_{t} will be an ϵ\epsilon-approximate second-order stationary point in

O~((f(𝐱0)f)ϵ1.75logn)\displaystyle\tilde{O}\Big{(}\frac{(f(\mathbf{x}_{0})-f^{*})}{\epsilon^{1.75}}\cdot\log n\Big{)} (160)

iterations, with probability at least 1δ1-\delta.

Proof.

Set the total step number TT to be:

T=max{4Δf(𝒯~+𝒯),768Δf𝒯ρϵ3}=O~((f(𝐱0)f)ϵ1.75logn),\displaystyle T=\max\left\{\frac{4\Delta_{f}(\tilde{\mathscr{T}}+\mathscr{T}^{\prime})}{\mathscr{E}},768\Delta_{f}\mathscr{T}^{\prime}\sqrt{\frac{\rho}{\epsilon^{3}}}\right\}=\tilde{O}\Big{(}\frac{(f(\mathbf{x}_{0})-f^{*})}{\epsilon^{1.75}}\cdot\log n\Big{)}, (161)

where 𝒯~=ζcA\tilde{\mathscr{T}}=\sqrt{\zeta}\cdot c_{A} as defined in Lemma 16, similar to the perturbed accelerated gradient descent algorithm [21, Algorithm 2]. We first assert that for each iteration 𝐱t\mathbf{x}_{t} that a uniform perturbation is added, after 𝒯\mathscr{T}^{\prime} iterations we can successfully obtain a unit vector 𝐞^\hat{\mathbf{e}} with 𝐞^T𝐞^ρϵ/4\hat{\mathbf{e}}^{T}\mathcal{H}\hat{\mathbf{e}}\leq-\sqrt{\rho\epsilon}/4, as long as λmin((𝐱t))ρϵ\lambda_{\min}(\mathcal{H}(\mathbf{x}_{t}))\leq-\sqrt{\rho\epsilon}. The error probability of this assumption is provided later.

Under this assumption, the uniform perturbation can be called for at most 384(f(𝐱𝟎)f)ρϵ3384(f(\mathbf{x_{0}})-f^{*})\sqrt{\frac{\rho}{\epsilon^{3}}} times, for otherwise the function value decrease will be greater than f(𝐱𝟎)ff(\mathbf{x_{0}})-f^{*}, which is not possible. Then, the probability that at least one negative curvature finding subroutine after uniform perturbation fails is upper bounded by

384(f(𝐱𝟎)f)ρϵ3δ0=δ.\displaystyle 384(f(\mathbf{x_{0}})-f^{*})\sqrt{\frac{\rho}{\epsilon^{3}}}\cdot\delta_{0}=\delta. (162)

For the rest of steps which is not within 𝒯\mathscr{T}^{\prime} steps after uniform perturbation, they are either large gradient steps, f(𝐱t)ϵ\|\nabla f(\mathbf{x}_{t})\|\geq\epsilon, or ϵ\epsilon-approximate second-order stationary points. Next, we demonstrate that at least one of these steps is an ϵ\epsilon-approximate stationary point.

Assume the contrary. We use N𝒯~N_{\tilde{\mathscr{T}}} to denote the number of disjoint time periods with length larger than 𝒯~\tilde{\mathscr{T}} containing only large gradient steps and do not contain any step within 𝒯\mathscr{T}^{\prime} steps after uniform perturbation. Then, it satisfies

N𝒯~T2(𝒯~+𝒯)384Δfρϵ3(2cA7384)Δfρϵ3Δf.\displaystyle N_{\tilde{\mathscr{T}}}\geq\frac{T}{2(\tilde{\mathscr{T}}+\mathscr{T}^{\prime})}-384\Delta_{f}\sqrt{\frac{\rho}{\epsilon^{3}}}\geq(2c_{A}^{7}-384)\Delta_{f}\sqrt{\frac{\rho}{\epsilon^{3}}}\geq\frac{\Delta_{f}}{\mathscr{E}}. (163)

From Lemma 16, during these time intervals the Hamiltonian will decrease in total at least N𝒯~=ΔfN_{\tilde{\mathscr{T}}}\cdot\mathscr{E}=\Delta_{f}, which is impossible due to Lemma 17, the Hamiltonian decreases monotonically for every step except for the 𝒯\mathscr{T}^{\prime} steps after uniform perturbation, and the overall decrease cannot be greater than Δf\Delta_{f}, a contradiction. Therefore, we conclude that at least one of the iterations must be an ϵ\epsilon-approximate second-order stationary point, with probability at least 1δ1-\delta. ∎

Appendix D Proofs of the stochastic setting

D.1 Proof details of negative curvature finding using stochastic gradients

In this subsection, we demonstrate that Algorithm 3 can find a negative curvature efficiently. Specifically, we prove the following proposition:

Proposition 24.

Suppose the function f:nf\colon\mathbb{R}^{n}\to\mathbb{R} is \ell-smooth and ρ\rho-Hessian Lipschitz. For any 0<δ<10<\delta<1, we specify our choice of parameters and constants we use as follows:

𝒯s\displaystyle\mathscr{T}_{s} =8ρϵlog(nδρϵmissing),\displaystyle=\frac{8\ell}{\sqrt{\rho\epsilon}}\cdot\log\Big(\frac{\ell n}{\delta\sqrt{\rho\epsilon}}\Big{missing}), ι\displaystyle\iota =10log(n𝒯s2δlog(nηrsmissing)missing),\displaystyle=10\log\Big(\frac{n\mathscr{T}_{s}^{2}}{\delta}\log\Big(\frac{\sqrt{n}}{\eta r_{s}}\Big{missing})\Big{missing}), (164)
rs\displaystyle r_{s} =δ480ρn𝒯sρϵι,\displaystyle=\frac{\delta}{480\rho n\mathscr{T}_{s}}\sqrt{\frac{\rho\epsilon}{\iota}}, m\displaystyle m =160(+~)δρϵ𝒯sι,\displaystyle=\frac{160(\ell+\tilde{\ell})}{\delta\sqrt{\rho\epsilon}}\sqrt{\mathscr{T}_{s}\iota}, (165)

Then for any point 𝐱~n\tilde{\mathbf{x}}\in\mathbb{R}^{n} satisfying λmin((𝐱~))ρϵ\lambda_{\min}(\mathcal{H}(\tilde{\mathbf{x}}))\leq-\sqrt{\rho\epsilon}, with probability at least 13δ1-3\delta, Algorithm 3 outputs a unit vector 𝐞^\hat{\mathbf{e}} satisfying

𝐞^T(𝐱~)𝐞^ρϵ4,\displaystyle\hat{\mathbf{e}}^{T}\mathcal{H}(\tilde{\mathbf{x}})\hat{\mathbf{e}}\leq-\frac{\sqrt{\rho\epsilon}}{4}, (166)

where \mathcal{H} stands for the Hessian matrix of function ff, using O(m𝒯s)=O~(log2nδϵ1/2)O(m\cdot\mathscr{T}_{s})=\tilde{O}\Big{(}\frac{\log^{2}n}{\delta\epsilon^{1/2}}\Big{)} iteartions.

Similarly to Algorithm 1 and Algorithm 2, the renormalization step Line 3 in Algorithm 3 only guarantees that the value 𝐲t\|\mathbf{y}_{t}\| would not scales exponentially during the algorithm, and does not affect the output. We thus introduce the following Algorithm 7, which is the no-renormalization version of Algorithm 3 that possess the same output and a simpler structure. Hence in this subsection, we analyze Algorithm 7 instead of Algorithm 3.

1 𝐳00\mathbf{z}_{0}\leftarrow 0;
2 for t=1,,𝒯st=1,...,\mathscr{T}_{s} do
3       Sample {θ(1),θ(2),,θ(m)}𝒟\left\{\theta^{(1)},\theta^{(2)},\cdots,\theta^{(m)}\right\}\sim\mathcal{D};
4       𝐠(𝐳t1)𝐳t1rs1mj=1m(𝐠(𝐱~+rs𝐳t1𝐳t1;θ(j))𝐠(𝐱~;θ(j)))\mathbf{g}(\mathbf{z}_{t-1})\leftarrow\frac{\|\mathbf{z}_{t-1}\|}{r_{s}}\cdot\frac{1}{m}\sum_{j=1}^{m}\Big{(}\mathbf{g}\Big{(}\tilde{\mathbf{x}}+\frac{r_{s}}{\|\mathbf{z}_{t-1}\|}\mathbf{z}_{t-1};\theta^{(j)}\Big{)}-\mathbf{g}(\tilde{\mathbf{x}};\theta^{(j)})\Big{)};
5       𝐳t𝐳t11(𝐠(𝐳t1)+ξt),ξt𝒩(0,rs2dI)\mathbf{z}_{t}\leftarrow\mathbf{z}_{t-1}-\frac{1}{\ell}(\mathbf{g}(\mathbf{z}_{t-1})+\xi_{t}),\qquad\xi_{t}\sim\mathcal{N}\Big{(}0,\frac{r_{s}^{2}}{d}I\Big{)};
6      
Output 𝐳𝒯/𝐳𝒯\mathbf{z}_{\mathscr{T}}/\|\mathbf{z}_{\mathscr{T}}\|.
Algorithm 7 Stochastic Negative Curvature Finding without Renormalization(𝐱~,rs,𝒯s,m\tilde{\mathbf{x}},r_{s},\mathscr{T}_{s},m).

Without loss of generality, we assume 𝐱~=𝟎\tilde{\mathbf{x}}=\mathbf{0} by shifting n\mathbb{R}^{n} such that 𝐱~\tilde{\mathbf{x}} is mapped to 𝟎\mathbf{0}. As argued in the proof of Proposition 3, (𝟎)\mathcal{H}(\mathbf{0}) admits the following following eigen-decomposition:

(𝟎)=i=1nλi𝐮i𝐮iT,\displaystyle\mathcal{H}(\mathbf{0})=\sum_{i=1}^{n}\lambda_{i}\mathbf{u}_{i}\mathbf{u}_{i}^{T}, (167)

where the set {𝐮i}i=1n\{\mathbf{u}_{i}\}_{i=1}^{n} forms an orthonormal basis of n\mathbb{R}^{n}. Without loss of generality, we assume the eigenvalues λ1,λ2,,λn\lambda_{1},\lambda_{2},\ldots,\lambda_{n} corresponding to 𝐮1,𝐮2,,𝐮n\mathbf{u}_{1},\mathbf{u}_{2},\ldots,\mathbf{u}_{n} satisfy

λ1λ2λn,\displaystyle\lambda_{1}\leq\lambda_{2}\leq\cdots\leq\lambda_{n}, (168)

where λ1ρϵ\lambda_{1}\leq-\sqrt{\rho\epsilon}. If λnρϵ/2\lambda_{n}\leq-\sqrt{\rho\epsilon}/2, Proposition 24 holds directly. Hence, we only need to prove the case where λn>ρϵ/2\lambda_{n}>-\sqrt{\rho\epsilon}/2, where there exists some p>1p>1 and p>1p^{\prime}>1 with

λpρϵλp+1,λpρϵ/2<λp+1.\displaystyle\lambda_{p}\leq-\sqrt{\rho\epsilon}\leq\lambda_{p+1},\quad\lambda_{p^{\prime}}\leq-\sqrt{\rho\epsilon}/2<\lambda_{p^{\prime}+1}. (169)
Notation:

Throughout this subsection, let ~:=(𝐱~)\tilde{\mathcal{H}}:=\mathcal{H}(\tilde{\mathbf{x}}). Use 𝔖\mathfrak{S}_{\parallel}, 𝔖\mathfrak{S}_{\perp} to separately denote the subspace of n\mathbb{R}^{n} spanned by {𝐮1,𝐮2,,𝐮p}\left\{\mathbf{u}_{1},\mathbf{u}_{2},\ldots,\mathbf{u}_{p}\right\}, {𝐮p+1,𝐮p+2,,𝐮n}\left\{\mathbf{u}_{p+1},\mathbf{u}_{p+2},\ldots,\mathbf{u}_{n}\right\}, and use 𝔖\mathfrak{S}_{\parallel}^{\prime}, 𝔖\mathfrak{S}_{\perp}^{\prime} to denote the subspace of n\mathbb{R}^{n} spanned by {𝐮1,𝐮2,,𝐮p}\left\{\mathbf{u}_{1},\mathbf{u}_{2},\ldots,\mathbf{u}_{p^{\prime}}\right\}, {𝐮p+1,𝐮p+2,,𝐮n}\left\{\mathbf{u}_{p^{\prime}+1},\mathbf{u}_{p+2},\ldots,\mathbf{u}_{n}\right\}. Furthermore, define 𝐳t,:=i=1p𝐮i,𝐳t𝐮i\mathbf{z}_{t,\parallel}:=\sum_{i=1}^{p}\left\langle\mathbf{u}_{i},\mathbf{z}_{t}\right\rangle\mathbf{u}_{i}, 𝐳t,:=i=pn𝐮i,𝐳t𝐮i\mathbf{z}_{t,\perp}:=\sum_{i=p}^{n}\left\langle\mathbf{u}_{i},\mathbf{z}_{t}\right\rangle\mathbf{u}_{i}, 𝐳t,:=i=1p𝐮i,𝐳t𝐮i\mathbf{z}_{t,\parallel^{\prime}}:=\sum_{i=1}^{p^{\prime}}\left\langle\mathbf{u}_{i},\mathbf{z}_{t}\right\rangle\mathbf{u}_{i}, 𝐳t,:=i=pn𝐮i,𝐳t𝐮i\mathbf{z}_{t,\perp^{\prime}}:=\sum_{i=p^{\prime}}^{n}\left\langle\mathbf{u}_{i},\mathbf{z}_{t}\right\rangle\mathbf{u}_{i} respectively to denote the component of 𝐳t\mathbf{z}_{t} in Line 7 of Algorithm 7 in the subspaces 𝔖\mathfrak{S}_{\parallel}, 𝔖\mathfrak{S}_{\perp}, 𝔖\mathfrak{S}_{\parallel}^{\prime}, 𝔖\mathfrak{S}_{\perp}^{\prime}, and let γ=λ1\gamma=-\lambda_{1}.

To prove Proposition 24, we first introduce the following lemma:

Lemma 25.

Under the setting of Proposition 24, for any point 𝐱~n\tilde{\mathbf{x}}\in\mathbb{R}^{n} satisfying λmin(2f(𝐱~)ρϵ\lambda_{\min}(\nabla^{2}f(\tilde{\mathbf{x}})\leq-\sqrt{\rho\epsilon}, with probability at least 13δ1-3\delta, Algorithm 3 outputs a unit vector 𝐞^\hat{\mathbf{e}} satisfying

𝐞^:=i=pn𝐮i,𝐞^𝐮iρϵ8\displaystyle\|\hat{\mathbf{e}}_{\perp^{\prime}}\|:=\Big{\|}\sum_{i=p^{\prime}}^{n}\left\langle\mathbf{u}_{i},\hat{\mathbf{e}}\right\rangle\mathbf{u}_{i}\Big{\|}\leq\frac{\sqrt{\rho\epsilon}}{8\ell} (170)

using O(m𝒯s)=O~(log2nδϵ1/2)O(m\cdot\mathscr{T}_{s})=\tilde{O}\Big{(}\frac{\log^{2}n}{\delta\epsilon^{1/2}}\Big{)} iteartions.

D.1.1 Proof of Lemma 25

In the proof of Lemma 25, we consider the worst case, where λ1=γ=ρϵ\lambda_{1}=-\gamma=-\sqrt{\rho\epsilon} is the only eigenvalue less than ρϵ/2-\sqrt{\rho\epsilon}/2, and all other eigenvalues equal to ρϵ/2+ν-\sqrt{\rho\epsilon}/2+\nu for an arbitrarily small constant ν\nu. Under this scenario, the component 𝐳t,\mathbf{z}_{t,\perp^{\prime}} is as small as possible at each time step.

The following lemma characterizes the dynamics of Algorithm 7:

Lemma 26.

Consider the sequence {𝐳i}\left\{\mathbf{z}_{i}\right\} and let η=1/\eta=1/\ell. Further, for any 0t𝒯s0\leq t\leq\mathscr{T}_{s} we define

ζt:=𝐠(𝐳t1)𝐳trs(f(𝐱~+rs𝐳t𝐳t)f(𝐱~)),\displaystyle\zeta_{t}:=\mathbf{g}(\mathbf{z}_{t-1})-\frac{\|\mathbf{z}_{t}\|}{r_{s}}\Big{(}\nabla f\Big{(}\tilde{\mathbf{x}}+\frac{r_{s}}{\|\mathbf{z}_{t}\|}\mathbf{z}_{t}\Big{)}-\nabla f(\tilde{\mathbf{x}})\Big{)}, (171)

to be the errors caused by the stochastic gradients. Then 𝐳t=𝐪h(t)𝐪sg(t)𝐪p(t)\mathbf{z}_{t}=-\mathbf{q}_{h}(t)-\mathbf{q}_{sg}(t)-\mathbf{q}_{p}(t), where:

𝐪h(t):=ητ=0t1(Iη~)t1τΔτ𝐳^τ,\displaystyle\mathbf{q}_{h}(t):=\eta\sum_{\tau=0}^{t-1}(I-\eta\tilde{\mathcal{H}})^{t-1-\tau}\Delta_{\tau}\hat{\mathbf{z}}_{\tau}, (172)

for Δτ=01f(ψrs𝐳τ𝐳τ)dψ~\Delta_{\tau}=\int_{0}^{1}\mathcal{H}_{f}\big{(}\psi\frac{r_{s}}{\|\mathbf{z}_{\tau}\|}\mathbf{z}_{\tau}\big{)}\mathrm{d}\psi-\tilde{\mathcal{H}}, and

𝐪sg(t):=ητ=0t1(Iη~)t1τζτ,𝐪p(t):=ητ=0t1(Iη~)t1τξτ.\displaystyle\mathbf{q}_{sg}(t):=\eta\sum_{\tau=0}^{t-1}(I-\eta\tilde{\mathcal{H}})^{t-1-\tau}\zeta_{\tau},\quad\mathbf{q}_{p}(t):=\eta\sum_{\tau=0}^{t-1}(I-\eta\tilde{\mathcal{H}})^{t-1-\tau}\xi_{\tau}. (173)
Proof.

Without loss of generality we assume 𝐱~=𝟎\tilde{\mathbf{x}}=\mathbf{0}. The update formula for 𝐳t\mathbf{z}_{t} can be written as

𝐳t+1=𝐳tη(𝐳trs(f(rs𝐳t𝐳t)f(𝟎))+ζt+ξt),\displaystyle\mathbf{z}_{t+1}=\mathbf{z}_{t}-\eta\Big{(}\frac{\|\mathbf{z}_{t}\|}{r_{s}}\Big{(}\nabla f\Big{(}\frac{r_{s}}{\|\mathbf{z}_{t}\|}\mathbf{z}_{t}\Big{)}-\nabla f(\mathbf{0})\Big{)}+\zeta_{t}+\xi_{t}\Big{)}, (174)

where

𝐳trs(f(rs𝐳t𝐳t)f(𝟎))=𝐳trs01f(ψrs𝐳t𝐳t)rs𝐳t𝐳tdψ=(~+Δt)𝐳t,\displaystyle\frac{\|\mathbf{z}_{t}\|}{r_{s}}\Big{(}\nabla f\Big{(}\frac{r_{s}}{\|\mathbf{z}_{t}\|}\mathbf{z}_{t}\Big{)}-\nabla f(\mathbf{0})\Big{)}=\frac{\|\mathbf{z}_{t}\|}{r_{s}}\int_{0}^{1}\mathcal{H}_{f}\Big{(}\psi\frac{r_{s}}{\|\mathbf{z}_{t}\|}\mathbf{z}_{t}\Big{)}\frac{r_{s}}{\|\mathbf{z}_{t}\|}\mathbf{z}_{t}\mathrm{d}\psi=(\tilde{\mathcal{H}}+\Delta_{t})\mathbf{z}_{t}, (175)

indicating

𝐳t+1\displaystyle\mathbf{z}_{t+1} =(Iη~)𝐱tη(Δt𝐳t+ζt+ξt)\displaystyle=(I-\eta\tilde{\mathcal{H}})\mathbf{x}_{t}-\eta(\Delta_{t}\mathbf{z}_{t}+\zeta_{t}+\xi_{t}) (176)
=ητ=0t(Iη~)tτ(Δt𝐳t+ζt+ξt),\displaystyle=-\eta\sum_{\tau=0}^{t}(I-\eta\tilde{\mathcal{H}})^{t-\tau}(\Delta_{t}\mathbf{z}_{t}+\zeta_{t}+\xi_{t}), (177)

which finishes the proof. ∎

At a high level, under our parameter choice in Proposition 24, 𝐪p(t)\mathbf{q}_{p}(t) is the dominating term controlling the dynamics, and 𝐪h(t)+𝐪sg(t)\mathbf{q}_{h}(t)+\mathbf{q}_{sg}(t) will be small compared to 𝐪p(t)\mathbf{q}_{p}(t). Quantitatively, this is shown in the following lemma:

Lemma 27.

Under the setting of Proposition 24 while using the notation in Lemma 12 and Lemma 26, we have

Pr(𝐪h(t)+𝐪sg(t)β(t)ηrsδ20nρϵ16,t𝒯smissing)1δ,\displaystyle\Pr\Big(\|\mathbf{q}_{h}(t)+\mathbf{q}_{sg}(t)\|\leq\frac{\beta(t)\eta r_{s}\delta}{20\sqrt{n}}\cdot\frac{\sqrt{\rho\epsilon}}{16\ell},\ \forall t\leq\mathscr{T}_{s}\Big{missing})\geq 1-\delta, (178)

where γ:=λmin(~)=ρϵ-\gamma:=\lambda_{\min}(\tilde{\mathcal{H}})=-\sqrt{\rho\epsilon}.

Proof.

Divide 𝐪p(t)\mathbf{q}_{p}(t) into two parts:

𝐪p,1(t):=𝐪p(t),𝐮1𝐮1,\displaystyle\mathbf{q}_{p,1}(t):=\left\langle\mathbf{q}_{p}(t),\mathbf{u}_{1}\right\rangle\mathbf{u}_{1}, (179)

and

𝐪p,(t):=𝐪p(t)𝐪p,1(t).\displaystyle\mathbf{q}_{p,\perp^{\prime}}(t):=\mathbf{q}_{p}(t)-\mathbf{q}_{p,1}(t). (180)

Then by Lemma 13, we have

Pr(𝐪p,1(t)β(t)ηrsnιmissing)12eι,\displaystyle\Pr\Big(\|\mathbf{q}_{p,1}(t)\|\leq\frac{\beta(t)\eta r_{s}}{\sqrt{n}}\cdot\sqrt{\iota}\Big{missing})\geq 1-2e^{-\iota}, (181)

and

Pr(𝐪p,1(t)β(t)ηrs20nδmissing)1δ/4.\displaystyle\Pr\Big(\|\mathbf{q}_{p,1}(t)\|\geq\frac{\beta(t)\eta r_{s}}{20\sqrt{n}}\cdot\delta\Big{missing})\geq 1-\delta/4. (182)

Similarly,

Pr(𝐪p,(t)β(t)ηrsιmissing)12eι,\displaystyle\Pr\Big(\|\mathbf{q}_{p,\perp^{\prime}}(t)\|\leq\beta_{\perp^{\prime}}(t)\eta r_{s}\cdot\sqrt{\iota}\Big{missing})\geq 1-2e^{-\iota}, (183)

and

Pr(𝐪p,(t)β(t)ηrs20δmissing)1δ/4,\displaystyle\Pr\Big(\|\mathbf{q}_{p,\perp^{\prime}}(t)\|\geq\frac{\beta_{\perp^{\prime}}(t)\eta r_{s}}{20}\cdot\delta\Big{missing})\geq 1-\delta/4, (184)

where β(t):=(1+ηγ/2)tηγ\beta_{\perp^{\prime}}(t):=\frac{(1+\eta\gamma/2)^{t}}{\sqrt{\eta\gamma}}. Set t:=lognηγt_{\perp^{\prime}}:=\frac{\log n}{\eta\gamma}. Then for all τt\tau\leq t_{\perp^{\prime}}, we have

β(τ)β(τ)n,\displaystyle\frac{\beta(\tau)}{\beta_{\perp^{\prime}}(\tau)}\leq\sqrt{n}, (185)

which further leads to

Pr(𝐪p,(τ)2β(t)ηrsιmissing)12eι.\displaystyle\Pr\Big(\|\mathbf{q}_{p,\perp^{\prime}}(\tau)\|\leq 2\beta_{\perp^{\prime}}(t)\eta r_{s}\cdot\sqrt{\iota}\Big{missing})\geq 1-2e^{-\iota}. (186)

Next, we use induction to prove that the following inequality holds for all ttt\leq t_{\perp^{\prime}}:

Pr(𝐪h(τ)+𝐪sg(τ)β(τ)ηrsδ20,τtmissing)110nt2log(nηrsmissing)eι.\displaystyle\Pr\Big(\|\mathbf{q}_{h}(\tau)+\mathbf{q}_{sg}(\tau)\|\leq\beta_{\perp^{\prime}}(\tau)\eta r_{s}\cdot\frac{\delta}{20},\ \forall\tau\leq t\Big{missing})\geq 1-10nt^{2}\log\Big(\frac{\sqrt{n}}{\eta r_{s}}\Big{missing})e^{-\iota}. (187)

For the base case t=0t=0, the claim holds trivially. Suppose it holds for all τt\tau\leq t for some tt. Then due to Lemma 13, with probability at least 12teι1-2t_{\perp^{\prime}}e^{-\iota}, we have

𝐳tη𝐪p(t)+η𝐪h(t)+𝐪sg(t)3β(τ)ηrsι.\displaystyle\|\mathbf{z}_{t}\|\leq\eta\|\mathbf{q}_{p}(t)\|+\eta\|\mathbf{q}_{h}(t)+\mathbf{q}_{sg}(t)\|\leq 3\beta_{\perp^{\prime}}(\tau)\eta r_{s}\cdot\sqrt{\iota}. (188)

By the Hessian Lipschitz property, Δτ\Delta_{\tau} satisfies:

Δτρrs.\displaystyle\|\Delta_{\tau}\|\leq\rho r_{s}. (189)

Hence,

𝐪h(t+1)\displaystyle\|\mathbf{q}_{h}(t+1)\| ητ=0t(Iη~)tτΔτ𝐳τ\displaystyle\leq\big{\|}\eta\sum_{\tau=0}^{t}(I-\eta\tilde{\mathcal{H}})^{t-\tau}\Delta_{\tau}\mathbf{z}_{\tau}\big{\|} (190)
ηρrsτ=0t(Iη~)tτ𝐳τ\displaystyle\leq\eta\rho r_{s}\sum_{\tau=0}^{t}(I-\eta\tilde{\mathcal{H}})^{t-\tau}\|\mathbf{z}_{\tau}\| (191)
(ηρrsn𝒯s)(3β(t)ηrs)ι\displaystyle\leq(\eta\rho r_{s}n\mathscr{T}_{s})\cdot(3\beta_{\perp^{\prime}}(t)\eta r_{s})\cdot\sqrt{\iota} (192)
β(t+1)ηrs10nδρϵ16.\displaystyle\leq\frac{\beta_{\perp^{\prime}}(t+1)\eta r_{s}}{10\sqrt{n}}\cdot\frac{\delta\sqrt{\rho\epsilon}}{16\ell}. (193)

As for 𝐪sg(t)\mathbf{q}_{sg}(t), note that ζ^τ|τ1\hat{\zeta}_{\tau}|\mathcal{F}_{\tau-1} satisfies the norm-subGaussian property defined in Definition 14. Specifically, ζ^τ|τ1nSG((+~)𝐳^τ/m)\hat{\zeta}_{\tau}|\mathcal{F}_{\tau-1}\sim\text{nSG}((\ell+\tilde{\ell})\|\hat{\mathbf{z}}_{\tau}\|/\sqrt{m}). By applying Lemma 15 with b=α2(t)η2(+~)2/mb=\alpha^{2}(t)\cdot\eta^{2}(\ell+\tilde{\ell})^{2}/m and b=α2(t)η2(+~)2η2rs2/(mn)b=\alpha^{2}(t)\eta^{2}(\ell+\tilde{\ell})^{2}\eta^{2}r_{s}^{2}/(mn), with probability at least

14nlog(nηrsmissing)eι,\displaystyle 1-4n\cdot\log\Big(\frac{\sqrt{n}}{\eta r_{s}}\Big{missing})\cdot e^{-\iota}, (194)

we have

𝐪sg(t+1)η(+~)tm(β(t)ηrs)ιβ(t+1)ηrs20δρϵ8.\displaystyle\|\mathbf{q}_{sg}(t+1)\|\leq\frac{\eta(\ell+\tilde{\ell})\sqrt{t}}{m}\cdot(\beta_{\perp}(t)\eta r_{s})\cdot\sqrt{\iota}\leq\frac{\beta_{\perp^{\prime}}(t+1)\eta r_{s}}{20}\cdot\frac{\delta\sqrt{\rho\epsilon}}{8\ell}. (195)

Then by union bound, with probability at least

110n(t+1)2log(nηrsmissing)eι,\displaystyle 1-10n(t+1)^{2}\log\Big(\frac{\sqrt{n}}{\eta r_{s}}\Big{missing})e^{-\iota}, (196)

we have

𝐪h(t+1)+𝐪sg(t+1)β(t+1)ηrsδ20ρϵ8,\displaystyle\|\mathbf{q}_{h}(t+1)+\mathbf{q}_{sg}(t+1)\|\leq\beta_{\perp^{\prime}}(t+1)\eta r_{s}\cdot\frac{\delta}{20}\cdot\frac{\sqrt{\rho\epsilon}}{8\ell}, (197)

indicating that (187) holds. Then with probability at least

110nt2log(nηrsmissing)eιδ/4,\displaystyle 1-10nt_{\perp^{\prime}}^{2}\log\Big(\frac{\sqrt{n}}{\eta r_{s}}\Big{missing})e^{-\iota}-\delta/4, (198)

we have

𝐪h(t)+𝐪sg(t)𝐪p,1(t)ρϵ16.\displaystyle\|\mathbf{q}_{h}(t_{\perp^{\prime}})+\mathbf{q}_{sg}(t_{\perp^{\prime}})\|\leq\|\mathbf{q}_{p,1}(t_{\perp^{\prime}})\|\cdot\frac{\sqrt{\rho\epsilon}}{16\ell}. (199)

Based on this, we prove that the following inequality holds for any tt𝒯st_{\perp^{\prime}}\leq t\leq\mathscr{T}_{s}:

Pr(𝐪h(τ)+𝐪sg(τ)β(τ)ηrs20nδρϵ16,tτtmissing)110nt2log(nηrsmissing)eι.\displaystyle\Pr\Big(\|\mathbf{q}_{h}(\tau)+\mathbf{q}_{sg}(\tau)\|\leq\frac{\beta(\tau)\eta r_{s}}{20\sqrt{n}}\cdot\frac{\delta\sqrt{\rho\epsilon}}{16\ell},\ \forall t_{\perp^{\prime}}\leq\tau\leq t\Big{missing})\geq 1-10nt^{2}\log\Big(\frac{\sqrt{n}}{\eta r_{s}}\Big{missing})e^{-\iota}. (200)

We still use recurrence to prove it. Note that its base case τ=t\tau=t_{\perp^{\prime}} is guaranteed by (187). Suppose it holds for all τt\tau\leq t for some tt. Then with probability at least 12teι1-2te^{-\iota}, we have

𝐳t\displaystyle\|\mathbf{z}_{t}\| η𝐪p(t)+η𝐪h(t)+𝐪sg(t)\displaystyle\leq\eta\|\mathbf{q}_{p}(t)\|+\eta\|\mathbf{q}_{h}(t)+\mathbf{q}_{sg}(t)\| (201)
2𝐪p,1(t)+η𝐪h(t)+𝐪sg(t)\displaystyle\leq 2\|\mathbf{q}_{p,1}(t)\|+\eta\|\mathbf{q}_{h}(t)+\mathbf{q}_{sg}(t)\| (202)
3β(τ)ηrsnι.\displaystyle\leq\frac{3\beta(\tau)\eta r_{s}}{\sqrt{n}}\cdot\sqrt{\iota}. (203)

Then following a similar procedure as before, we can claim that

𝐪h(t+1)+𝐪sg(t+1)β(t+1)ηrsnδ20ρϵ8,\displaystyle\|\mathbf{q}_{h}(t+1)+\mathbf{q}_{sg}(t+1)\|\leq\frac{\beta(t+1)\eta r_{s}}{\sqrt{n}}\cdot\frac{\delta}{20}\cdot\frac{\sqrt{\rho\epsilon}}{8\ell}, (204)

holds with probability

110n(t+1)2log(nηrsmissing)eιδ4,\displaystyle 1-10n(t+1)^{2}\log\Big(\frac{\sqrt{n}}{\eta r_{s}}\Big{missing})e^{-\iota}-\frac{\delta}{4}, (205)

indicating that (200) holds. Then under our choice of parameters, the desired inequality

𝐪h(t)+𝐪sg(t)β(t)ηrsδ20nρϵ16\displaystyle\|\mathbf{q}_{h}(t)+\mathbf{q}_{sg}(t)\|\leq\frac{\beta(t)\eta r_{s}\delta}{20\sqrt{n}}\cdot\frac{\sqrt{\rho\epsilon}}{16\ell} (206)

holds with probability at least 1δ1-\delta. ∎

Equipped with Lemma 27, we are now ready to prove Lemma 25.

Proof.

First note that under our choice of 𝒯s\mathscr{T}_{s}, we have

Pr(𝐪p,(𝒯s)𝐪p,1(𝒯s)ρϵ16missing)1δ.\displaystyle\Pr\Big(\frac{\|\mathbf{q}_{p,\perp^{\prime}}(\mathscr{T}_{s})\|}{\|\mathbf{q}_{p,1}(\mathscr{T}_{s})\|}\leq\frac{\sqrt{\rho\epsilon}}{16\ell}\Big{missing})\geq 1-\delta. (207)

Further by Lemma 27 and union bound, with probability at least 12δ1-2\delta,

𝐪h(𝒯s)+𝐪sg(𝒯s)𝐪p(𝒯s)𝐪h(𝒯s)+𝐪sg(𝒯s)20nδβ(t)ηrsρϵ16.\displaystyle\frac{\|\mathbf{q}_{h}(\mathscr{T}_{s})+\mathbf{q}_{sg}(\mathscr{T}_{s})\|}{\|\mathbf{q}_{p}(\mathscr{T}_{s})\|}\leq\|\mathbf{q}_{h}(\mathscr{T}_{s})+\mathbf{q}_{sg}(\mathscr{T}_{s})\|\cdot\frac{20\sqrt{n}}{\delta\beta(t)\eta r_{s}}\leq\frac{\sqrt{\rho\epsilon}}{16\ell}. (208)

For the output 𝐞^\hat{\mathbf{e}}, observe that its component 𝐞^=𝐞^𝐞^1\hat{\mathbf{e}}_{\perp^{\prime}}=\hat{\mathbf{e}}-\hat{\mathbf{e}}_{1}, since 𝐮1\mathbf{u}_{1} is the only component in subspace 𝔖\mathfrak{S}_{\parallel^{\prime}}. Then with probability at least 13δ1-3\delta,

𝐞^ρϵ/(8).\displaystyle\|\hat{\mathbf{e}}_{\perp^{\prime}}\|\leq\sqrt{\rho\epsilon}/(8\ell). (209)

D.1.2 Proof of Proposition 24

Based on Lemma 25, we present the proof of Proposition 24 as follows:

Proof.

By Lemma 25, the component 𝐞^\hat{\mathbf{e}}_{\perp^{\prime}} of output 𝐞\mathbf{e} satisfies

𝐞^ρϵ8.\displaystyle\|\hat{\mathbf{e}}_{\perp^{\prime}}\|\leq\frac{\sqrt{\rho\epsilon}}{8\ell}. (210)

Since 𝐞^=𝐞^+𝐞^\hat{\mathbf{e}}=\hat{\mathbf{e}}_{\parallel^{\prime}}+\hat{\mathbf{e}}_{\perp^{\prime}}, we can derive that

𝐞^1ρϵ(8)21ρϵ(8)2.\displaystyle\|\hat{\mathbf{e}}_{\parallel^{\prime}}\|\geq\sqrt{1-\frac{\rho\epsilon}{(8\ell)^{2}}}\geq 1-\frac{\rho\epsilon}{(8\ell)^{2}}. (211)

Note that

𝐞^T~𝐞^=(𝐞^+𝐞^)T~(𝐞^+𝐞^),\displaystyle\hat{\mathbf{e}}^{T}\tilde{\mathcal{H}}\hat{\mathbf{e}}=(\hat{\mathbf{e}}_{\perp^{\prime}}+\hat{\mathbf{e}}_{\parallel^{\prime}})^{T}\tilde{\mathcal{H}}(\hat{\mathbf{e}}_{\perp^{\prime}}+\hat{\mathbf{e}}_{\parallel^{\prime}}), (212)

which can be further simplified to

𝐞^T~𝐞^=𝐞^T~𝐞^+𝐞^T~𝐞^.\displaystyle\hat{\mathbf{e}}^{T}\tilde{\mathcal{H}}\hat{\mathbf{e}}=\hat{\mathbf{e}}_{\perp^{\prime}}^{T}\tilde{\mathcal{H}}\hat{\mathbf{e}}_{\perp^{\prime}}+\hat{\mathbf{e}}_{\parallel^{\prime}}^{T}\tilde{\mathcal{H}}\hat{\mathbf{e}}_{\parallel^{\prime}}. (213)

Due to the \ell-smoothness of the function, all eigenvalue of the Hessian matrix has its absolute value upper bounded by \ell. Hence,

𝐞^T~𝐞^𝐞^T22=ρϵ642,\displaystyle\hat{\mathbf{e}}_{\perp}^{T}\tilde{\mathcal{H}}\hat{\mathbf{e}}_{\perp}\leq\ell\|\hat{\mathbf{e}}_{\perp}^{T}\|_{2}^{2}=\frac{\rho\epsilon}{64\ell^{2}}, (214)

whereas

𝐞^T~𝐞^ρϵ2𝐞^2.\displaystyle\hat{\mathbf{e}}_{\parallel^{\prime}}^{T}\tilde{\mathcal{H}}\hat{\mathbf{e}}_{\parallel^{\prime}}\leq-\frac{\sqrt{\rho\epsilon}}{2}\|\hat{\mathbf{e}}_{\parallel^{\prime}}\|^{2}. (215)

Combining these two inequalities together, we can obtain

𝐞^T~𝐞^\displaystyle\hat{\mathbf{e}}^{T}\tilde{\mathcal{H}}\hat{\mathbf{e}} =𝐞^T~𝐞^+𝐞^T~𝐞^ρϵ2𝐞^2+ρϵ642ρϵ4.\displaystyle=\hat{\mathbf{e}}_{\perp^{\prime}}^{T}\tilde{\mathcal{H}}\hat{\mathbf{e}}_{\perp^{\prime}}+\hat{\mathbf{e}}_{\parallel^{\prime}}^{T}\tilde{\mathcal{H}}\hat{\mathbf{e}}_{\parallel^{\prime}}\leq-\frac{\sqrt{\rho\epsilon}}{2}\|\hat{\mathbf{e}}_{\parallel^{\prime}}\|^{2}+\frac{\rho\epsilon}{64\ell^{2}}\leq-\frac{\sqrt{\rho\epsilon}}{4}. (216)

D.2 Proof details of escaping saddle points using Algorithm 3

In this subsection, we demonstrate that Algorithm 3 can be used to escape from saddle points in the stochastic setting. We first present the explicit Algorithm 8, and then introduce the full version Theorem 9 with proof.

1 Input: 𝐱0n\mathbf{x}_{0}\in\mathbb{R}^{n};
2 for t=0,1,,Tt=0,1,...,T do
3       Sample {θ(1),θ(2),,θ(M)}𝒟\left\{\theta^{(1)},\theta^{(2)},\cdots,\theta^{(M)}\right\}\sim\mathcal{D};
4       𝐠(𝐱t)=1Mj=1M𝐠(𝐱t;θ(j))\mathbf{g}(\mathbf{x}_{t})=\frac{1}{M}\sum_{j=1}^{M}\mathbf{g}(\mathbf{x}_{t};\theta^{(j)});
5       if 𝐠(𝐱t)3ϵ/4\|\mathbf{g}(\mathbf{x}_{t})\|\leq 3\epsilon/4 then
6             𝐞^\hat{\mathbf{e}}\leftarrowStochasticNegativeCurvatureFinding(𝐱t,rs,𝒯s,m\mathbf{x}_{t},r_{s},\mathscr{T}_{s},m);
7             𝐱t𝐱tf𝐞^(𝐱0)4|f𝐞^(𝐱0)|ϵρ𝐞^\mathbf{x}_{t}\leftarrow\mathbf{x}_{t}-\frac{f^{\prime}_{\hat{\mathbf{e}}}(\mathbf{x}_{0})}{4|f^{\prime}_{\hat{\mathbf{e}}}(\mathbf{x}_{0})|}\sqrt{\frac{\epsilon}{\rho}}\cdot\hat{\mathbf{e}};
8             Sample {θ(1),θ(2),,θ(M)}𝒟\left\{\theta^{(1)},\theta^{(2)},\cdots,\theta^{(M)}\right\}\sim\mathcal{D};
9             𝐠(𝐱t)=1Mj=1M𝐠(𝐱t;θ(j))\mathbf{g}(\mathbf{x}_{t})=\frac{1}{M}\sum_{j=1}^{M}\mathbf{g}(\mathbf{x}_{t};\theta^{(j)});
10            
11      𝐱t+1𝐱t1𝐠(𝐱t;θt)\mathbf{x}_{t+1}\leftarrow\mathbf{x}_{t}-\frac{1}{\ell}\mathbf{g}(\mathbf{x}_{t};\theta_{t});
12      
Algorithm 8 Stochastic Gradient Descent with Negative Curvature Finding.
Theorem 28 (Full version of Theorem 9).

Suppose that the function ff is \ell-smooth and ρ\rho-Hessian Lipschitz. For any ϵ>0\epsilon>0 and a constant 0<δs10<\delta_{s}\leq 1, we choose the parameters appearing in Algorithm 8 as

δ\displaystyle\delta =δs2304Δfϵ3ρ\displaystyle=\frac{\delta_{s}}{2304\Delta_{f}}\sqrt{\frac{\epsilon^{3}}{\rho}} 𝒯s\displaystyle\mathscr{T}_{s} =8ρϵlog(nδρϵmissing),\displaystyle=\frac{8\ell}{\sqrt{\rho\epsilon}}\cdot\log\Big(\frac{\ell n}{\delta\sqrt{\rho\epsilon}}\Big{missing}), ι\displaystyle\iota =10log(n𝒯s2δlog(nηrsmissing)missing),\displaystyle=10\log\Big(\frac{n\mathscr{T}_{s}^{2}}{\delta}\log\Big(\frac{\sqrt{n}}{\eta r_{s}}\Big{missing})\Big{missing}), (217)
rs\displaystyle r_{s} =δ480ρn𝒯sρϵι,\displaystyle=\frac{\delta}{480\rho n\mathscr{T}_{s}}\sqrt{\frac{\rho\epsilon}{\iota}}, m\displaystyle m =160(+~)δρϵ𝒯sι,\displaystyle=\frac{160(\ell+\tilde{\ell})}{\delta\sqrt{\rho\epsilon}}\sqrt{\mathscr{T}_{s}\iota}, M\displaystyle M =16Δfϵ2\displaystyle=\frac{16\ell\Delta_{f}}{\epsilon^{2}} (218)

where Δf:=f(𝐱0)f\Delta_{f}:=f(\mathbf{x}_{0})-f^{*} and ff^{*} is the global minimum of ff. Then, Algorithm 8 satisfies that at least 1/41/4 of the iterations 𝐱t\mathbf{x}_{t} will be ϵ\epsilon-approximate second-order stationary points, using

O~((f(𝐱0)f)ϵ4log2n)\displaystyle\tilde{O}\Big{(}\frac{(f(\mathbf{x}_{0})-f^{*})}{\epsilon^{4}}\cdot\log^{2}n\Big{)} (219)

iterations, with probability at least 1δs1-\delta_{s}.

Proof.

Let the parameters be chosen according to (2), and set the total step number TT to be:

T=max{8(f(𝐱𝟎)f)ϵ2,768(f(𝐱𝟎)f)ρϵ3}.\displaystyle T=\max\left\{\frac{8\ell(f(\mathbf{x_{0}})-f^{*})}{\epsilon^{2}},768(f(\mathbf{x_{0}})-f^{*})\cdot\sqrt{\frac{\rho}{\epsilon^{3}}}\right\}. (220)

We will show that the following two claims hold simultaneously with probability 1δs1-\delta_{s}:

  1. 1.

    At most T/4T/4 steps have gradients larger than ϵ\epsilon;

  2. 2.

    Algorithm 3 can be called for at most 384Δfρϵ3384\Delta_{f}\sqrt{\frac{\rho}{\epsilon^{3}}} times.

Therefore, at least T/4T/4 steps are ϵ\epsilon-approximate secondary stationary points. We prove the two claims separately.

Claim 1.

Suppose that within TT steps, we have more than T/4T/4 steps with gradients larger than ϵ\epsilon. Then with probability 1δs/21-\delta_{s}/2,

f(𝐱T)f(𝐱0)η8i=0T1f(𝐱i)2+cσ2M(T+log(1/δs))ff(𝐱0),\displaystyle f(\mathbf{x}_{T})-f(\mathbf{x}_{0})\leq-\frac{\eta}{8}\sum_{i=0}^{T-1}\|\nabla f(\mathbf{x}_{i})\|^{2}+c\cdot\frac{\sigma^{2}}{M\ell}(T+\log(1/\delta_{s}))\leq f^{*}-f(\mathbf{x}_{0}), (221)

contradiction.

Claim 2.

We first assume that for each 𝐱t\mathbf{x}_{t} we apply negative curvature finding (Algorithm 3), we can successfully obtain a unit vector 𝐞^\hat{\mathbf{e}} with 𝐞^T(𝐱t)𝐞^ρϵ/4\hat{\mathbf{e}}^{T}\mathcal{H}(\mathbf{x}_{t})\hat{\mathbf{e}}\leq-\sqrt{\rho\epsilon}/4, as long as λmin((𝐱t))ρϵ\lambda_{\min}(\mathcal{H}(\mathbf{x}_{t}))\leq-\sqrt{\rho\epsilon}. The error probability of this assumption is provided later.

Under this assumption, Algorithm 3 can be called for at most 384(f(𝐱𝟎)f)ρϵ3T2384(f(\mathbf{x_{0}})-f^{*})\sqrt{\frac{\rho}{\epsilon^{3}}}\leq\frac{T}{2} times, for otherwise the function value decrease will be greater than f(𝐱𝟎)ff(\mathbf{x_{0}})-f^{*}, which is not possible. Then, the error probability that some calls to Algorithm 3 fails is upper bounded by

384(f(𝐱𝟎)f)ρϵ3(3δ)=δs/2.\displaystyle 384(f(\mathbf{x_{0}})-f^{*})\sqrt{\frac{\rho}{\epsilon^{3}}}\cdot(3\delta)=\delta_{s}/2. (222)

The number of iterations can be viewed as the sum of two parts, the number of iterations needed in large gradient scenario, denoted by T1T_{1}, and the number of iterations needed for negative curvature finding, denoted by T2T_{2}. With probability at least 1δs1-\delta_{s},

T1=O(MT)=O~((f(𝐱0)f)ϵ4).\displaystyle T_{1}=O(M\cdot T)=\tilde{O}\Big{(}\frac{(f(\mathbf{x}_{0})-f^{*})}{\epsilon^{4}}\Big{)}. (223)

As for T2T_{2}, with probability at least 1δs1-\delta_{s}, Algorithm 3 is called for at most 384(f(𝐱𝟎)f)ρϵ3384(f(\mathbf{x_{0}})-f^{*})\sqrt{\frac{\rho}{\epsilon^{3}}} times, and by Proposition 24 it takes O~(log2nδρϵ)\tilde{O}\Big{(}\frac{\log^{2}n}{\delta\sqrt{\rho\epsilon}}\Big{)} iterations each time. Hence,

T2=384(f(𝐱𝟎)f)ρϵ3O~(log2nδρϵ)=O~((f(𝐱0)f)ϵ4log2n).\displaystyle T_{2}=384(f(\mathbf{x_{0}})-f^{*})\sqrt{\frac{\rho}{\epsilon^{3}}}\cdot\tilde{O}\Big{(}\frac{\log^{2}n}{\delta\sqrt{\rho\epsilon}}\Big{)}=\tilde{O}\Big{(}\frac{(f(\mathbf{x}_{0})-f^{*})}{\epsilon^{4}}\cdot\log^{2}n\Big{)}. (224)

As a result, the total iteration number T1+T2T_{1}+T_{2} is

O~((f(𝐱0)f)ϵ4log2n).\displaystyle\tilde{O}\Big{(}\frac{(f(\mathbf{x}_{0})-f^{*})}{\epsilon^{4}}\cdot\log^{2}n\Big{)}. (225)

Appendix E More numerical experiments

In this section, we present more numerical experiment results that support our theoretical claims from a few different perspectives compared to Section 4. Specifically, considering that previous experiments all lies in a two-dimensional space, and theoretically our algorithms have a better dependence on the dimension of the problem nn, it is reasonable to check the actual performance of our algorithm on high-dimensional test functions, which is presented in Appendix E.1. Then in Appendix E.2, we introduce experiments on various landscapes that demonstrate the advantage of Algorithm 2 over PAGD [21]. Moreover, we compare the performance of our Algorithm 2 with the NEON+ algorithm [29] on a few test functions in Appendix E.3. To be more precise, we compare the negative curvature extracting part of NEON+ with Algorithm 2 at saddle points in different types of nonconvex landscapes.

E.1 Dimension dependence

Recall that nn is the dimension of the problem. We choose a test function h(x)=12xTx+116x14h(x)=\frac{1}{2}x^{T}\mathcal{H}x+\frac{1}{16}x_{1}^{4} where \mathcal{H} is an nn-by-nn diagonal matrix: =diag(ϵ,1,1,,1)\mathcal{H}=\text{diag}(-\epsilon,1,1,...,1). The function h(x)h(x) has a saddle point at the origin, and only one negative curvature direction. Throughout the experiment, we set ϵ=1\epsilon=1. For the sake of comparison, the iteration numbers are chosen in a manner such that the statistics of Algorithm 1 and PGD in each category of the histogram are of similar magnitude.

Refer to caption
Figure 3: Dimension dependence of Algorithm 1 and PGD. We set ϵ=0.01\epsilon=0.01, r=0.1r=0.1, n=10pn=10^{p} for p=1,2,3p=1,2,3. The iteration number of Algorithm 1 and PGD are separately set to be 30p30p and 20p2+1020p^{2}+10, and the sample size M=100M=100. As we can see, to maintain the same performance, the number of iterations in PGD grows faster than the number of iterations in Algorithm 1.

E.2 Comparison between Algorithm 2 and PAGD on various nonconvex landscapes

Quartic-type test function

Consider the test function f(x1,x2)=116x1412x12+98x22f(x_{1},x_{2})=\frac{1}{16}x_{1}^{4}-\frac{1}{2}x_{1}^{2}+\frac{9}{8}x_{2}^{2} with a saddle point at (0,0)(0,0). The advantage of Algorithm 2 is illustrated in Figure 4.

Refer to caption
Figure 4: Run Algorithm 2 and PAGD on landscape f(x1,x2)=116x1412x12+98x22f(x_{1},x_{2})=\frac{1}{16}x_{1}^{4}-\frac{1}{2}x_{1}^{2}+\frac{9}{8}x_{2}^{2}. Parameters: η=0.05\eta=0.05 (step length), r=0.08r=0.08 (ball radius in PAGD and parameter rr in Algorithm 2), M=300M=300 (number of samplings).
Left: The contour of the landscape is placed on the background with labels being function values. Blue points represent samplings of Algorithm 2 at time step tANCGD=10t_{\text{ANCGD}}=10 and tANCGD=20t_{\text{ANCGD}}=20, and red points represent samplings of PAGD at time step tPAGD=20t_{\text{PAGD}}=20 and tPAGD=40t_{\text{PAGD}}=40. Similarly to Algorithm 1, Algorithm 2 transforms an initial uniform-circle distribution into a distribution concentrating on two points indicating negative curvature, and these two figures represent intermediate states of this process. It converges faster than PAGD even when tANCGDtPAGDt_{\text{ANCGD}}\ll t_{\text{PAGD}}.
Right: A histogram of descent values obtained by Algorithm 2 and PAGD, respectively. Set tANCGD=20t_{\text{ANCGD}}=20 and tPAGD=40t_{\text{PAGD}}=40. Although we run two times of iterations in PAGD, there are still over 20%20\% of PAGD paths with function value decrease no greater than 0.90.9, while this ratio for Algorithm 2 is less than 5%5\%.
Triangle-type test function.

Consider the test function f(x1,x2)=12cos(πx1)+12(x2+cos(2πx1)12)212f(x_{1},x_{2})=\frac{1}{2}\cos(\pi x_{1})+\frac{1}{2}\Big{(}x_{2}+\frac{\cos(2\pi x_{1})-1}{2}\Big{)}^{2}-\frac{1}{2} with a saddle point at (0,0)(0,0). The advantage of Algorithm 2 is illustrated in Figure 5.

Refer to caption
Figure 5: Run Algorithm 2 and PAGD on landscape f(x1,x2)=12cos(πx1)+12(x2+cos(2πx1)12)212f(x_{1},x_{2})=\frac{1}{2}\cos(\pi x_{1})+\frac{1}{2}\Big{(}x_{2}+\frac{\cos(2\pi x_{1})-1}{2}\Big{)}^{2}-\frac{1}{2}. Parameters: η=0.01\eta=0.01 (step length), r=0.1r=0.1 (ball radius in PAGD and parameter rr in Algorithm 2), M=300M=300 (number of samplings).
Left: The contour of the landscape is placed on the background with labels being function values. Blue points represent samplings of Algorithm 2 at time step tANCGD=10t_{\text{ANCGD}}=10 and tANCGD=20t_{\text{ANCGD}}=20, and red points represent samplings of PAGD at time step tPAGD=40t_{\text{PAGD}}=40 and tPAGD=80t_{\text{PAGD}}=80. Algorithm 2 converges faster than PAGD even when tANCGDtPAGDt_{\text{ANCGD}}\ll t_{\text{PAGD}}.
Right: A histogram of descent values obtained by Algorithm 2 and PAGD, respectively. Set tANCGD=20t_{\text{ANCGD}}=20 and tPAGD=80t_{\text{PAGD}}=80. Although we run four times of iterations in PAGD, there are still over 20%20\% of gradient descent paths with function value decrease no greater than 0.90.9, while this ratio for Algorithm 2 is less than 5%5\%.
Exponential-type test function.

Consider the test function f(x1,x2)=11+ex12+12(x2x12ex12)21f(x_{1},x_{2})=\frac{1}{1+e^{x_{1}^{2}}}+\frac{1}{2}\big{(}x_{2}-x_{1}^{2}e^{-x_{1}^{2}}\big{)}^{2}-1 with a saddle point at (0,0)(0,0). The advantage of Algorithm 2 is illustrated in Figure 6.

Refer to caption
Figure 6: Run Algorithm 2 and PAGD on landscape f(x1,x2)=f(x1,x2)=11+ex12+12(x2x12ex12)21f(x_{1},x_{2})=f(x_{1},x_{2})=\frac{1}{1+e^{x_{1}^{2}}}+\frac{1}{2}\big{(}x_{2}-x_{1}^{2}e^{-x_{1}^{2}}\big{)}^{2}-1. Parameters: η=0.03\eta=0.03 (step length), r=0.1r=0.1 (ball radius in PAGD and parameter rr in Algorithm 2), M=300M=300 (number of samplings).
Left: The contour of the landscape is placed on the background with labels being function values. Blue points represent samplings of Algorithm 2 at time step tANCGD=10t_{\text{ANCGD}}=10 and tANCGD=20t_{\text{ANCGD}}=20, and red points represent samplings of PAGD at time step tPAGD=30t_{\text{PAGD}}=30 and tPAGD=60t_{\text{PAGD}}=60. Algorithm 2converges faster than PAGD even when tANCGDtPAGDt_{\text{ANCGD}}\ll t_{\text{PAGD}}.
Right: A histogram of descent values obtained by Algorithm 2 and PAGD, respectively. Set tANCGD=20t_{\text{ANCGD}}=20 and tPAGD=60t_{\text{PAGD}}=60. Although we run three times of iterations in PAGD, its performance is still dominated by our Algorithm 2.

Compared to the previous experiment on Algorithm 1 and PGD shown as Figure 1 in Section 4, these experiments also demonstrate the faster convergence rates enjoyed by the general family of "momentum methods". Specifically, using fewer iterations, Algorithm 2 and PAGD achieve larger function value decreases separately compared to Algorithm 1 and PGD.

E.3 Comparison between Algorithm 2 and NEON+ on various nonconvex landscapes

Triangle-type test function.

Consider the test function f(x1,x2)=12cos(πx1)+12(x2+cos(2πx1)12)212f(x_{1},x_{2})=\frac{1}{2}\cos(\pi x_{1})+\frac{1}{2}\Big{(}x_{2}+\frac{\cos(2\pi x_{1})-1}{2}\Big{)}^{2}-\frac{1}{2} with a saddle point at (0,0)(0,0). The advantage of Algorithm 2 is illustrated in Figure 7.

Refer to caption
Figure 7: Run Algorithm 2 and NEON+ on landscape f(x1,x2)=12cos(πx1)+12(x2+cos(2πx1)12)212f(x_{1},x_{2})=\frac{1}{2}\cos(\pi x_{1})+\frac{1}{2}\big{(}x_{2}+\frac{\cos(2\pi x_{1})-1}{2}\big{)}^{2}-\frac{1}{2}. Parameters: η=0.04\eta=0.04 (step length), r=0.1r=0.1 (ball radius in NEON+ and parameter rr in Algorithm 2), M=300M=300 (number of samplings).
Left: The contour of the landscape is placed on the background with labels being function values. Red points represent samplings of NEON+ at time step tNEON=20t_{\text{NEON}}=20, and blue points represent samplings of Algorithm 2 at time step tANCGD=10t_{\text{ANCGD}}=10. Algorithm 2 and the negative curvature extracting part of NEON+ both transform an initial uniform-circle distribution into a distribution concentrating on two points indicating negative curvature. Note that Algorithm 2 converges faster than NEON+ even when tANCGDtNEONt_{\text{ANCGD}}\ll t_{\text{NEON}}.
Right: A histogram of descent values obtained by Algorithm 2 and NEON+, respectively. Set tANCGD=10t_{\text{ANCGD}}=10 and tNEON=20t_{\text{NEON}}=20. Although we run two times of iterations in NEON+, none of NEON+ paths has function value decrease greater than 0.950.95, while this ratio for Algorithm 2 is larger than 90%90\%.
Exponential-type test function.

Consider the test function f(x1,x2)=11+ex12+12(x2x12ex12)21f(x_{1},x_{2})=\frac{1}{1+e^{x_{1}^{2}}}+\frac{1}{2}\big{(}x_{2}-x_{1}^{2}e^{-x_{1}^{2}}\big{)}^{2}-1 with a saddle point at (0,0)(0,0). The advantage of Algorithm 2 is illustrated in Figure 8

Refer to caption
Figure 8: Run Algorithm 2 and NEON+ on landscape f(x1,x2)=11+ex12+12(x2x12ex12)21f(x_{1},x_{2})=\frac{1}{1+e^{x_{1}^{2}}}+\frac{1}{2}\big{(}x_{2}-x_{1}^{2}e^{-x_{1}^{2}}\big{)}^{2}-1. Parameters: η=0.03\eta=0.03 (step length), r=0.1r=0.1 (ball radius in NEON+ and parameter rr in Algorithm 2), M=300M=300 (number of samplings).
Left: The contour of the landscape is placed on the background with labels being function values. Red points represent samplings of NEON+ at time step tNEON=40t_{\text{NEON}}=40, and blue points represent samplings of Algorithm 2 at time step tANCGD=20t_{\text{ANCGD}}=20. Algorithm 2 converges faster than NEON+ even when tANCGDtNEONt_{\text{ANCGD}}\ll t_{\text{NEON}}.
Right: A histogram of descent values obtained by Algorithm 2 and NEON+, respectively. Set tANCGD=20t_{\text{ANCGD}}=20 and tNEON=40t_{\text{NEON}}=40. Although we run two times of iterations in NEON+, there are still over 20%20\% of NEON+ paths with function value decrease no greater than 0.90.9, while this ratio for Algorithm 2 is less than 10%10\%.

Compared to the previous experiments on Algorithm 2 and PAGD in Appendix E.2, these two experiments also reveal the faster convergence rate of both NEON+ and Algorithm 2 against PAGD [21] at small gradient regions.