Escape saddle points by a simple gradient-descent based algorithm

Chenyi Zhang¹ Tongyang Li^2,3,4
¹ Institute for Interdisciplinary Information Sciences, Tsinghua University, China
² Center on Frontiers of Computing Studies, Peking University, China
³ School of Computer Science, Peking University, China
⁴ Center for Theoretical Physics, Massachusetts Institute of Technology, USA Corresponding author. Email: [email protected]

Abstract

Escaping saddle points is a central research topic in nonconvex optimization. In this paper, we propose a simple gradient-based algorithm such that for a smooth function $f\colon\mathbb{R}^{n}\to\mathbb{R}$ , it outputs an $\epsilon$ -approximate second-order stationary point in $\tilde{O}(\log n/\epsilon^{1.75})$ iterations. Compared to the previous state-of-the-art algorithms by Jin et al. with $\tilde{O}(\log^{4}n/\epsilon^{2})$ or $\tilde{O}(\log^{6}n/\epsilon^{1.75})$ iterations, our algorithm is polynomially better in terms of $\log n$ and matches their complexities in terms of $1/\epsilon$ . For the stochastic setting, our algorithm outputs an $\epsilon$ -approximate second-order stationary point in $\tilde{O}(\log^{2}n/\epsilon^{4})$ iterations. Technically, our main contribution is an idea of implementing a robust Hessian power method using only gradients, which can find negative curvature near saddle points and achieve the polynomial speedup in $\log n$ compared to the perturbed gradient descent methods. Finally, we also perform numerical experiments that support our results.

1 Introduction

Nonconvex optimization is a central research area in optimization theory, since lots of modern machine learning problems can be formulated in models with nonconvex loss functions, including deep neural networks, principal component analysis, tensor decomposition, etc. In general, finding a global minimum of a nonconvex function is NP-hard in the worst case. Instead, many theoretical works focus on finding a local minimum instead of a global one, because recent works (both empirical and theoretical) suggested that local minima are nearly as good as global minima for a significant amount of well-studied machine learning problems; see e.g. [4, 11, 13, 14, 16, 17]. On the other hand, saddle points are major obstacles for solving these problems, not only because they are ubiquitous in high-dimensional settings where the directions for escaping may be few (see e.g. [5, 7, 10]), but also saddle points can correspond to highly suboptimal solutions (see e.g. [18, 27]).

Hence, one of the most important topics in nonconvex optimization is to escape saddle points. Specifically, we consider a twice-differentiable function $f\colon\mathbb{R}^{n}\to\mathbb{R}$ such that

•

$f$ is $\ell$ -smooth: $\|\nabla f(\mathbf{x}_{1})-\nabla f(\mathbf{x}_{2})\|\leq\ell\|\mathbf{x}_{1}-\mathbf{x}_{2}\|\quad\forall\mathbf{x}_{1},\mathbf{x}_{2}\in\mathbb{R}^{n}$ ,
•

$f$ is $\rho$ -Hessian Lipschitz: $\|\mathcal{H}(\mathbf{x}_{1})-\mathcal{H}(\mathbf{x}_{2})\|\leq\rho\|\mathbf{x}_{1}-\mathbf{x}_{2}\|\quad\forall\mathbf{x}_{1},\mathbf{x}_{2}\in\mathbb{R}^{n}$ ;

here $\mathcal{H}$ is the Hessian of $f$ . The goal is to find an $\epsilon$ -approximate second-order stationary point $\mathbf{x}_{\epsilon}$ :¹¹1We can ask for an $(\epsilon_{1},\epsilon_{2})$ -approx. second-order stationary point s.t. $\|\nabla f(\mathbf{x})\|\leq\epsilon_{1}$ and $\lambda_{\min}(\nabla^{2}f(\mathbf{x}))\geq-\epsilon_{2}$ in general. The scaling in (1) was adopted as a standard in literature [1, 6, 9, 19, 20, 21, 25, 28, 29, 30].

\displaystyle\|\nabla f(\mathbf{x}_{\epsilon})\|\leq\epsilon,\quad\lambda_{\min}(\mathcal{H}(\mathbf{x}_{\epsilon}))\geq-\sqrt{\rho\epsilon}.

(1)

In other words, at any $\epsilon$ -approx. second-order stationary point $\mathbf{x}_{\epsilon}$ , the gradient is small with norm being at most $\epsilon$ and the Hessian is close to be positive semi-definite with all its eigenvalues $\geq-\sqrt{\rho\epsilon}$ .

Algorithms for escaping saddle points are mainly evaluated from two aspects. On the one hand, considering the enormous dimensions of machine learning models in practice, dimension-free or almost dimension-free (i.e., having $\operatorname{poly}(\log n)$ dependence) algorithms are highly preferred. On the other hand, recent empirical discoveries in machine learning suggests that it is often feasible to tackle difficult real-world problems using simple algorithms, which can be implemented and maintained more easily in practice. On the contrary, algorithms with nested loops often suffer from significant overheads in large scales, or introduce concerns with the setting of hyperparameters and numerical stability (see e.g. [1, 6]), making them relatively hard to find practical implementations.

It is then natural to explore simple gradient-based algorithms for escaping from saddle points. The reason we do not assume access to Hessians is because its construction takes $\Omega(n^{2})$ cost in general, which is computationally infeasible when the dimension is large. A seminal work along this line was by Ge et al. [11], which found an $\epsilon$ -approximate second-order stationary point satisfying (1) using only gradients in $O(\operatorname{poly}(n,1/\epsilon))$ iterations. This is later improved to be almost dimension-free $\tilde{O}(\log^{4}n/\epsilon^{2})$ in the follow-up work [19],²²2The $\tilde{O}$ notation omits poly-logarithmic terms, i.e., $\tilde{O}(g)=O(g\operatorname{poly}(\log g))$ . and the perturbed accelerated gradient descent algorithm [21] based on Nesterov’s accelerated gradient descent [26] takes $\tilde{O}(\log^{6}n/\epsilon^{1.75})$ iterations. However, these results still suffer from a significant overhead in terms of $\log n$ . On the other direction, Refs. [3, 24, 29] demonstrate that an $\epsilon$ -approximate second-order stationary point can be find using gradients in $\tilde{O}(\log n/\epsilon^{1.75})$ iterations. Their results are based on previous works [1, 6] using Hessian-vector products and the observation that the Hessian-vector product can be approximated via the difference of two gradient queries. Hence, their implementations contain nested-loop structures with relatively large numbers of hyperparameters. It has been an open question whether it is possible to keep both the merits of using only first-order information as well as being close to dimension-free using a simple, gradient-based algorithm without a nested-loop structure [22]. This paper answers this question in the affirmative.

Contributions.

Our main contribution is a simple, single-loop, and robust gradient-based algorithm that can find an $\epsilon$ -approximate second-order stationary point of a smooth, Hessian Lipschitz function $f\colon\mathbb{R}^{n}\to\mathbb{R}$ . Compared to previous works [3, 24, 29] exploiting the idea of gradient-based Hessian power method, our algorithm has a single-looped, simpler structure and better numerical stability. Compared to the previous state-of-the-art results with single-looped structures by [21] and [19, 20] using $\tilde{O}(\log^{6}n/\epsilon^{1.75})$ or $\tilde{O}(\log^{4}n/\epsilon^{2})$ iterations, our algorithm achieves a polynomial speedup in $\log n$ :

Theorem 1 (informal).

Our single-looped algorithm finds an $\epsilon$ -approximate second-order stationary point in $\tilde{O}(\log n/\epsilon^{1.75})$ iterations.

Technically, our work is inspired by the perturbed gradient descent (PGD) algorithm in [19, 20] and the perturbed accelerated gradient descent (PAGD) algorithm in [21]. Specifically, PGD applies gradient descents iteratively until it reaches a point with small gradient, which can be a potential saddle point. Then PGD generates a uniform perturbation in a small ball centered at that point and then continues the GD procedure. It is demonstrated that, with an appropriate choice of the perturbation radius, PGD can shake the point off from the neighborhood of the saddle point and converge to a second-order stationary point with high probability. The PAGD in [21] adopts a similar perturbation idea, but the GD is replaced by Nesterov’s AGD [26].

Our algorithm is built upon PGD and PAGD but with one main modification regarding the perturbation idea: it is more efficient to add a perturbation in the negative curvature direction nearby the saddle point, rather than the uniform perturbation in PGD and PAGD, which is a compromise since we generally cannot access the Hessian at the saddle due to its high computational cost. Our key observation lies in the fact that we do not have to compute the entire Hessian to detect the negative curvature. Instead, in a small neighborhood of a saddle point, gradients can be viewed as Hessian-vector products plus some bounded deviation. In particular, GD near the saddle with learning rate $1/\ell$ is approximately the same as the power method of the matrix $(I-\mathcal{H}/\ell)$ . As a result, the most negative eigenvalues stand out in GD because they have leading exponents in the power method, and thus it approximately moves along the direction of the most negative curvature nearby the saddle point. Following this approach, we can escape the saddle points more rapidly than previous algorithms: for a constant $\epsilon$ , PGD and PAGD take $O(\log n)$ iterations to decrease the function value by $\Omega(1/\log^{3}n)$ and $\Omega(1/\log^{5}n)$ with high probability, respectively; on the contrary, we can first take $O(\log n)$ iterations to specify a negative curvature direction, and then add a larger perturbation in this direction to decrease the function value by $\Omega(1)$ . See Proposition 3 and Proposition 5. After escaping the saddle point, similar to PGD and PAGD, we switch back to GD and AGD iterations, which are efficient to decrease the function value when the gradient is large [19, 20, 21].

Our algorithm is also applicable to the stochastic setting where we can only access stochastic gradients, and the stochasticity is not under the control of our algorithm. We further assume that the stochastic gradients are Lipschitz (or equivalently, the underlying functions are gradient-Lipschitz, see Assumption 2), which is also adopted in most of the existing works; see e.g. [8, 19, 20, 34]. We demonstrate that a simple extended version of our algorithm takes $O(\log^{2}n)$ iterations to detect a negative curvature direction using only stochastic gradients, and then obtain an $\Omega(1)$ function value decrease with high probability. On the contrary, the perturbed stochastic gradient descent (PSGD) algorithm in [19, 20], the stochastic version of PGD, takes $O(\log^{10}n)$ iterations to decrease the function value by $\Omega(1/\log^{5}n)$ with high probability.

Theorem 2 (informal).

In the stochastic setting, our algorithm finds an $\epsilon$ -approximate second-order stationary point using $\tilde{O}(\log^{2}n/\epsilon^{4})$ iterations via stochastic gradients.

Our results are summarized in Table 1. Although the underlying dynamics in [3, 24, 29] and our algorithm have similarity, the main focus of our work is different. Specifically, Refs. [3, 24, 29] mainly aim at using novel techniques to reduce the iteration complexity for finding a second-order stationary point, whereas our work mainly focuses on reducing the number of loops and hyper-parameters of negative curvature finding methods while preserving their advantage in iteration complexity, since a much simpler structure accords with empirical observations and enables wider applications. Moreover, the choice of perturbation in [3] is based on the Chebyshev approximation theory, which may require additional nested-looped structures to boost the success probability. In the stochastic setting, there are also other results studying nonconvex optimization [15, 23, 31, 36, 12, 32, 35] from different perspectives than escaping saddle points, which are incomparable to our results.

Setting	Reference	Oracle	Iterations	Simplicity
Non-stochastic	[1, 6]	Hessian-vector product	$\tilde{O}(\log n/\epsilon^{1.75})$	Nested-loop
Non-stochastic	[19, 20]	Gradient	$\tilde{O}(\log^{4}n/\epsilon^{2})$	Single-loop
Non-stochastic	[21]	Gradient	$\tilde{O}(\log^{6}n/\epsilon^{1.75})$	Single-loop
Non-stochastic	[3, 24, 29]	Gradient	$\tilde{O}(\log n/\epsilon^{1.75})$	Nested-loop
Non-stochastic	this work	Gradient	$\tilde{O}(\log n/\epsilon^{1.75})$	Single-loop
Stochastic	[19, 20]	Gradient	$\tilde{O}(\log^{15}n/\epsilon^{4})$	Single-loop
Stochastic	[9]	Gradient	$\tilde{O}(\log^{5}n/\epsilon^{3.5})$	Single-loop
Stochastic	[3]	Gradient	$\tilde{O}(\log^{2}n/\epsilon^{3.5})$	Nested-loop
Stochastic	[8]	Gradient	$\tilde{O}(\log^{2}n/\epsilon^{3})$	Nested-loop
Stochastic	this work	Gradient	$\tilde{O}(\log^{2}n/\epsilon^{4})$	Single-loop

Table 1: A summary of the state-of-the-art results on finding approximate second-order stationary points by the first-order (gradient) oracle. Iteration numbers are highlighted in terms of the dimension

n

and the precision

\epsilon

It is worth highlighting that our gradient-descent based algorithm enjoys the following nice features:

•

Simplicity: Some of the previous algorithms have nested-loop structures with the concern of practical impact when setting the hyperparameters. In contrast, our algorithm based on negative curvature finding only contains a single loop with two components: gradient descent (including AGD or SGD) and perturbation. As mentioned above, such simple structure is preferred in machine learning, which increases the possibility of our algorithm to find real-world applications.
•

Numerical stability: Our algorithm contains an additional renormalization step at each iteration when escaping from saddle points. Although in theoretical aspect a renormalization step does not affect the output and the complexity of our algorithm, when finding negative curvature near saddle points it enables us to sample gradients in a larger region, which makes our algorithm more numerically stable against floating point error and other errors. The introduction of renormalization step is enabled by the simple structure of our algorithm, which may not be feasible for more complicated algorithms [3, 24, 29].
•

Robustness: Our algorithm is robust against adversarial attacks when evaluating the value of the gradient. Specifically, when analyzing the performance of our algorithm near saddle points, we essentially view the deflation from pure quadratic geometry as an external noise. Hence, the effectiveness of our algorithm is unaffected under external attacks as long as the adversary is bounded by deflations from quadratic landscape.

Finally, we perform numerical experiments that support our polynomial speedup in $\log n$ . We perform our negative curvature finding algorithms using GD or SGD in various landscapes and general classes of nonconvex functions, and use comparative studies to show that our Algorithm 1 and Algorithm 3 achieve a higher probability of escaping saddle points using much fewer iterations than PGD and PSGD (typically less than $1/3$ times of the iteration number of PGD and $1/2$ times of the iteration number of PSGD, respectively). Moreover, we perform numerical experiments benchmarking the solution quality and iteration complexity of our algorithm against accelerated methods. Compared to PAGD [21] and even advanced optimization algorithms such as NEON+ [29], Algorithm 2 possesses better solution quality and iteration complexity in various landscapes given by more general nonconvex functions. With fewer iterations compared to PAGD and NEON+ (typically less than $1/3$ times of the iteration number of PAGD and $1/2$ times of the iteration number of NEON+, respectively), our Algorithm 2 achieves a higher probability of escaping from saddle points.

Open questions.

This work leaves a couple of natural open questions for future investigation:

•

Can we achieve the polynomial speedup in $\log n$ for more advanced stochastic optimization algorithms with complexity $\tilde{O}(\operatorname{poly}(\log n)/\epsilon^{3.5})$ [2, 3, 9, 28, 30] or $\tilde{O}(\operatorname{poly}(\log n)/\epsilon^{3})$ [8, 33]?
•

How is the performance of our algorithms for escaping saddle points in real-world applications, such as tensor decomposition [11, 16], matrix completion [13], etc.?

Broader impact.

This work focuses on the theory of nonconvex optimization, and as far as we see, we do not anticipate its potential negative societal impact. Nevertheless, it might have a positive impact for researchers who are interested in understanding the theoretical underpinnings of (stochastic) gradient descent methods for machine learning applications.

Organization.

In Section 2, we introduce our gradient-based Hessian power method algorithm for negative curvature finding, and present how our algorithms provide polynomial speedup in $\log n$ for both PGD and PAGD. In Section 3, we present the stochastic version of our negative curvature finding algorithm using stochastic gradients and demonstrate its polynomial speedup in $\log n$ for PSGD. Numerical experiments are presented in Section 4. We provide detailed proofs and additional numerical experiments in the supplementary material.

2 A simple algorithm for negative curvature finding

We show how to find negative curvature near a saddle point using a gradient-based Hessian power method algorithm, and extend it to a version with faster convergence rate by replacing gradient descents by accelerated gradient descents. The intuition works as follows: in a small enough region nearby a saddle point, the gradient can be approximately expressed as a Hessian-vector product formula, and the approximation error can be efficiently upper bounded, see Eq. (6). Hence, using only gradients information, we can implement an accurate enough Hessian power method to find negative eigenvectors of the Hessian matrix, and further find the negative curvature nearby the saddle.

2.1 Negative curvature finding based on gradient descents

We first present an algorithm for negative curvature finding based on gradient descents. Specifically, for any $\tilde{\mathbf{x}}\in\mathbb{R}^{n}$ with $\lambda_{\min}(\mathcal{H}(\tilde{\mathbf{x}}))\leq-\sqrt{\rho\epsilon}$ , it finds a unit vector $\hat{\mathbf{e}}$ such that $\hat{\mathbf{e}}^{T}\mathcal{H}(\tilde{\mathbf{x}})\hat{\mathbf{e}}\leq-\sqrt{\rho\epsilon}/4$ .

\mathbf{y}_{0}\leftarrow

Uniform

(\mathbb{B}_{\tilde{\mathbf{x}}}(r))

where

\mathbb{B}_{\tilde{\mathbf{x}}}(r)

is the

\ell_{2}

-norm ball centered at

\tilde{\mathbf{x}}

with radius

r

;

2 for $t=1,...,\mathscr{T}$ do

\mathbf{y}_{t}\leftarrow\mathbf{y}_{t-1}-\frac{\|\mathbf{y}_{t-1}\|}{\ell r}\big{(}\nabla f(\tilde{\mathbf{x}}+r\mathbf{y}_{t-1}/\|\mathbf{y}_{t-1}\|)-\nabla f(\tilde{\mathbf{x}})\big{)}

;

Output

\mathbf{y}_{\mathscr{T}}/r

Algorithm 1 Negative Curvature Finding(

\tilde{\mathbf{x}},r,\mathscr{T}

Proposition 3.

Suppose the function $f\colon\mathbb{R}^{n}\to\mathbb{R}$ is $\ell$ -smooth and $\rho$ -Hessian Lipschitz. For any $0<\delta_{0}\leq 1$ , we specify our choice of parameters and constants we use as follows:

\displaystyle\mathscr{T}=\frac{8\ell}{\sqrt{\rho\epsilon}}\cdot\log\Big(\frac{\ell}{\delta_{0}}\sqrt{\frac{n}{\pi\rho\epsilon}}\Big{missing}),\quad r=\frac{\epsilon}{8\ell}\sqrt{\frac{\pi}{n}}\delta_{0}.

(2)

Suppose that $\tilde{\mathbf{x}}$ satisfies $\lambda_{\min}(\nabla^{2}f(\tilde{\mathbf{x}}))\leq-\sqrt{\rho\epsilon}$ . Then with probability at least $1-\delta_{0}$ , Algorithm 1 outputs a unit vector $\hat{\mathbf{e}}$ satisfying

\displaystyle\hat{\mathbf{e}}^{T}\mathcal{H}(\mathbf{x})\hat{\mathbf{e}}\leq-\sqrt{\rho\epsilon}/4,

(3)

using $O(\mathscr{T})=\tilde{O}\Big{(}\frac{\log n}{\sqrt{\rho\epsilon}}\Big{)}$ iterations, where $\mathcal{H}$ stands for the Hessian matrix of function $f$ .

Proof.

Without loss of generality we assume $\tilde{\mathbf{x}}=\mathbf{0}$ by shifting $\mathbb{R}^{n}$ such that $\tilde{\mathbf{x}}$ is mapped to $\mathbf{0}$ . Define a new $n$ -dimensional function

\displaystyle h_{f}(\mathbf{x}):=f(\mathbf{x})-\left\langle\nabla f(\mathbf{0}),\mathbf{x}\right\rangle,

(4)

for the ease of our analysis. Since $\left\langle\nabla f(\mathbf{0}),\mathbf{x}\right\rangle$ is a linear function with Hessian being 0, the Hessian of $h_{f}$ equals to the Hessian of $f$ , and $h_{f}(\mathbf{x})$ is also $\ell$ -smooth and $\rho$ -Hessian Lipschitz. In addition, note that $\nabla h_{f}(\mathbf{0})=\nabla f(\mathbf{0})-\nabla f(\mathbf{0})=0$ . Then for all $\mathbf{x}\in\mathbb{R}^{n}$ ,

\displaystyle\nabla h_{f}(\mathbf{x})=\int_{\xi=0}^{1}\mathcal{H}(\xi\mathbf{x})\cdot\mathbf{x}\,\mathrm{d}\xi=\mathcal{H}(\mathbf{0})\mathbf{x}+\int_{\xi=0}^{1}(\mathcal{H}(\xi\mathbf{x})-\mathcal{H}(\mathbf{0}))\cdot\mathbf{x}\,\mathrm{d}\xi.

(5)

Furthermore, due to the $\rho$ -Hessian Lipschitz condition of both $f$ and $h_{f}$ , for any $\xi\in[0,1]$ we have $\|\mathcal{H}(\xi\mathbf{x})-\mathcal{H}(\mathbf{0})\|\leq\rho\|\mathbf{x}\|$ , which leads to

\displaystyle\|\nabla h_{f}(\mathbf{x})-\mathcal{H}(\mathbf{0})\mathbf{x}\|\leq\rho\|\mathbf{x}\|^{2}.

(6)

Observe that the Hessian matrix $\mathcal{H}(\mathbf{0})$ admits the following eigen-decomposition:

\displaystyle\mathcal{H}(\mathbf{0})=\sum_{i=1}^{n}\lambda_{i}\mathbf{u}_{i}\mathbf{u}_{i}^{T},

(7)

where the set $\{\mathbf{u}_{i}\}_{i=1}^{n}$ forms an orthonormal basis of $\mathbb{R}^{n}$ . Without loss of generality, we assume the eigenvalues $\lambda_{1},\lambda_{2},\ldots,\lambda_{n}$ corresponding to $\mathbf{u}_{1},\mathbf{u}_{2},\ldots,\mathbf{u}_{n}$ satisfy

\displaystyle\lambda_{1}\leq\lambda_{2}\leq\cdots\leq\lambda_{n},

(8)

in which $\lambda_{1}\leq-\sqrt{\rho\epsilon}$ . If $\lambda_{n}\leq-\sqrt{\rho\epsilon}/2$ , Proposition 3 holds directly. Hence, we only need to prove the case where $\lambda_{n}>-\sqrt{\rho\epsilon}/2$ , in which there exists some $p>1$ , $p^{\prime}>1$ with

\displaystyle\lambda_{p}\leq-\sqrt{\rho\epsilon}\leq\lambda_{p+1},\quad\lambda_{p^{\prime}}\leq-\sqrt{\rho\epsilon}/2<\lambda_{p^{\prime}+1}.

(9)

We use $\mathfrak{S}_{\parallel}$ , $\mathfrak{S}_{\perp}$ to separately denote the subspace of $\mathbb{R}^{n}$ spanned by $\left\{\mathbf{u}_{1},\mathbf{u}_{2},\ldots,\mathbf{u}_{p}\right\}$ , $\left\{\mathbf{u}_{p+1},\mathbf{u}_{p+2},\ldots,\mathbf{u}_{n}\right\}$ , and use $\mathfrak{S}_{\parallel}^{\prime}$ , $\mathfrak{S}_{\perp}^{\prime}$ to denote the subspace of $\mathbb{R}^{n}$ spanned by $\left\{\mathbf{u}_{1},\mathbf{u}_{2},\ldots,\mathbf{u}_{p^{\prime}}\right\}$ , $\left\{\mathbf{u}_{p^{\prime}+1},\mathbf{u}_{p+2},\ldots,\mathbf{u}_{n}\right\}$ . Furthermore, we define

	$\displaystyle\mathbf{y}_{t,\parallel}$	$\displaystyle:=\sum_{i=1}^{p}\left\langle\mathbf{u}_{i},\mathbf{y}_{t}\right\rangle\mathbf{u}_{i},\qquad\ \mathbf{y}_{t,\perp}:=\sum_{i=p}^{n}\left\langle\mathbf{u}_{i},\mathbf{y}_{t}\right\rangle\mathbf{u}_{i};$		(10)
	$\displaystyle\mathbf{y}_{t,\parallel^{\prime}}$	$\displaystyle:=\sum_{i=1}^{p^{\prime}}\left\langle\mathbf{u}_{i},\mathbf{y}_{t}\right\rangle\mathbf{u}_{i},\qquad\mathbf{y}_{t,\perp^{\prime}}:=\sum_{i=p^{\prime}}^{n}\left\langle\mathbf{u}_{i},\mathbf{y}_{t}\right\rangle\mathbf{u}_{i}$		(11)

respectively to denote the component of $\mathbf{y}_{t}$ in Line 1 in the subspaces $\mathfrak{S}_{\parallel}$ , $\mathfrak{S}_{\perp}$ , $\mathfrak{S}_{\parallel}^{\prime}$ , $\mathfrak{S}_{\perp}^{\prime}$ , and let $\alpha_{t}:=\|\mathbf{y}_{t,\parallel}\|/\|\mathbf{y}_{t}\|$ . Observe that

\displaystyle\Pr\left\{\alpha_{0}\geq\delta_{0}\sqrt{\pi/n}\right\}\geq\Pr\left\{|y_{0,1}|/r\geq\delta_{0}\sqrt{\pi/n}\right\},

(12)

where $y_{0,1}:=\left\langle\mathbf{u}_{1},\mathbf{y}_{0}\right\rangle$ denotes the component of $\mathbf{y}_{0}$ along $\mathbf{u}_{1}$ . Consider the case where $\alpha_{0}\geq\delta_{0}\sqrt{\pi/n}$ , which can be achieved with probability

\displaystyle\Pr\left\{\alpha_{0}\geq\sqrt{\frac{\pi}{n}}\delta_{0}\right\}\geq 1-\sqrt{\frac{\pi}{n}}\delta_{0}\cdot\frac{\text{Vol}(\mathbb{B}_{0}^{n-1}(1))}{\text{Vol}(\mathbb{B}_{0}^{n}(1))}\geq 1-\sqrt{\frac{\pi}{n}}\delta_{0}\cdot\sqrt{\frac{n}{\pi}}=1-\delta_{0}.

(13)

We prove that there exists some $t_{0}$ with $1\leq t_{0}\leq\mathscr{T}$ such that

\displaystyle\|\mathbf{y}_{t_{0},\perp^{\prime}}\|/\|\mathbf{y}_{t_{0}}\|\leq\sqrt{\rho\epsilon}/(8\ell).

(14)

Assume the contrary, for any $1\leq t\leq\mathscr{T}$ , we all have $\|\mathbf{y}_{t,\perp^{\prime}}\|/\|\mathbf{y}_{t}\|>\sqrt{\rho\epsilon}/(8\ell)$ . Then $\|\mathbf{y}_{t,\perp}^{\prime}\|$ satisfies the following recurrence formula:

\displaystyle\|\mathbf{y}_{t+1,\perp^{\prime}}\|\leq(1+\sqrt{\rho\epsilon}/(2\ell))\|\mathbf{y}_{t,\perp^{\prime}}\|+\|\Delta_{\perp^{\prime}}\|\leq(1+\sqrt{\rho\epsilon}/(2\ell)+\|\Delta\|/\|\mathbf{y}_{t,\perp^{\prime}}\|)\|\mathbf{y}_{t,\perp^{\prime}}\|,

(15)

where

\displaystyle\Delta:=\frac{\|\mathbf{y}_{t}\|}{r\ell}(\nabla h_{f}(r\mathbf{y}_{t}/\|\mathbf{y}_{t}\|)-\mathcal{H}(0)\cdot(r\mathbf{y}_{t}/\|\mathbf{y}_{t}\|))

(16)

stands for the deviation from pure quadratic approximation and $\|\Delta\|/\|\mathbf{y}_{t}\|\leq\rho r/\ell$ due to (6). Hence,

\displaystyle\|\mathbf{y}_{t+1,\perp^{\prime}}\|\leq\Big{(}1+\frac{\sqrt{\rho\epsilon}}{2\ell}+\frac{\|\Delta\|}{\|\mathbf{y}_{t,\perp^{\prime}}\|}\Big{)}\|\mathbf{y}_{t,\perp^{\prime}}\|\leq\Big{(}1+\frac{\sqrt{\rho\epsilon}}{2\ell}+\cdot\frac{8\rho r}{\sqrt{\rho\epsilon}}\Big{)}\|\mathbf{y}_{t+1,\perp^{\prime}}\|,

(17)

which leads to

\displaystyle\|\mathbf{y}_{t,\perp^{\prime}}\|\leq\|\mathbf{y}_{0,\perp^{\prime}}\|(1+\sqrt{\rho\epsilon}/(2\ell)+{8\rho r}/\sqrt{\rho\epsilon})^{t}\leq\|\mathbf{y}_{0,\perp^{\prime}}\|(1+5\sqrt{\rho\epsilon}/(8\ell))^{t},\ \ \forall t\in[\mathscr{T}].

(18)

Similarly, we can have the recurrence formula for $\|\mathbf{y}_{t,\parallel}\|$ :

\displaystyle\|\mathbf{y}_{t+1,\parallel}\|\geq(1+\sqrt{\rho\epsilon}/(2\ell))\|\mathbf{y}_{t,\parallel}\|-\|\Delta_{\parallel}\|\geq(1+\sqrt{\rho\epsilon}/(2\ell)-\|\Delta\|/(\alpha_{t}\|\mathbf{y}_{t}\|))\|\mathbf{y}_{t,\parallel}\|.

(19)

Considering that $\|\Delta\|/\|\mathbf{y}_{t}\|\leq\rho r/\ell$ due to (6), we can further have

\displaystyle\|\mathbf{y}_{t+1,\parallel}\|\geq(1+\sqrt{\rho\epsilon}/(2\ell)-\rho r/(\alpha_{t}\ell))\|\mathbf{y}_{t,\parallel}\|.

(20)

Intuitively, $\|\mathbf{y}_{t,\parallel}\|$ should have a faster increasing rate than $\|\mathbf{y}_{t,\perp}\|$ in this gradient-based Hessian power method, ignoring the deviation from quadratic approximation. As a result, the value the value $\alpha_{t}=\|\mathbf{y}_{t,\parallel}\|/\|\mathbf{y}_{t}\|$ should be non-decreasing. It is demonstrated in Lemma 18 in Appendix B that, even if we count this deviation in, $\alpha_{t}$ can still be lower bounded by some constant $\alpha_{\min}$ :

\displaystyle\alpha_{t}\geq\alpha_{\min}=\frac{\delta_{0}}{4}\sqrt{\frac{\pi}{n}},\qquad\forall 1\leq t\leq\mathscr{T}.

(21)

by which we can further deduce that

\displaystyle\|\mathbf{y}_{t,\parallel}\|\geq\|\mathbf{y}_{0,\parallel}\|(1+\sqrt{\rho\epsilon}/\ell-\rho r/(\alpha_{\min}\ell))^{t}\geq\|\mathbf{y}_{0,\parallel}\|(1+7\sqrt{\rho\epsilon}/(8\ell))^{t},\quad\forall 1\leq t\leq\mathscr{T}.

(22)

Observe that

\displaystyle\frac{\|\mathbf{y}_{\mathscr{T},\perp^{\prime}}\|}{\|\mathbf{y}_{\mathscr{T},\parallel}\|}\leq\frac{\|\mathbf{y}_{0,\perp^{\prime}}\|}{\|\mathbf{y}_{0,\parallel}\|}\cdot\Big{(}\frac{1+5\sqrt{\rho\epsilon}/(8\ell)}{1+7\sqrt{\rho\epsilon}/(8\ell)}\Big{)}^{\mathscr{T}}\leq\frac{1}{\delta_{0}}\sqrt{\frac{n}{\pi}}\Big{(}\frac{1+5\sqrt{\rho\epsilon}/(8\ell)}{1+7\sqrt{\rho\epsilon}/(8\ell)}\Big{)}^{\mathscr{T}}\leq\frac{\sqrt{\rho\epsilon}}{8\ell}.

(23)

Since $\|\mathbf{y}_{\mathscr{T},\parallel}\|\leq\|\mathbf{y}_{\mathscr{T}}\|$ , we have $\|\mathbf{y}_{\mathscr{T},\perp^{\prime}}\|/\|\mathbf{y}_{\mathscr{T}}\|\leq\sqrt{\rho\epsilon}/(8\ell)$ , contradiction. Hence, there here exists some $t_{0}$ with $1\leq t_{0}\leq\mathscr{T}$ such that $\|\mathbf{y}_{t_{0},\perp^{\prime}}\|/\|\mathbf{y}_{t_{0}}\|\leq\sqrt{\rho\epsilon}/(8\ell)$ . Consider the normalized vector $\hat{\mathbf{e}}=\mathbf{y}_{t_{0}}/r$ , we use $\hat{\mathbf{e}}_{\perp^{\prime}}$ and $\hat{\mathbf{e}}_{\parallel^{\prime}}$ to separately denote the component of $\hat{\mathbf{e}}$ in $\mathfrak{S}_{\perp}^{\prime}$ and $\mathfrak{S}_{\parallel}^{\prime}$ . Then, $\|\hat{\mathbf{e}}_{\perp^{\prime}}\|\leq\sqrt{\rho\epsilon}/(8\ell)$ whereas $\|\hat{\mathbf{e}}_{\parallel^{\prime}}\|\geq 1-\rho\epsilon/(8\ell)^{2}$ . Then,

\displaystyle\hat{\mathbf{e}}^{T}\mathcal{H}(\mathbf{0})\hat{\mathbf{e}}=(\hat{\mathbf{e}}_{\perp^{\prime}}+\hat{\mathbf{e}}_{\parallel^{\prime}})^{T}\mathcal{H}(\mathbf{0})(\hat{\mathbf{e}}_{\perp^{\prime}}+\hat{\mathbf{e}}_{\parallel^{\prime}})=\hat{\mathbf{e}}_{\perp^{\prime}}^{T}\mathcal{H}(\mathbf{0})\hat{\mathbf{e}}_{\perp^{\prime}}+\hat{\mathbf{e}}_{\parallel^{\prime}}^{T}\mathcal{H}(\mathbf{0})\hat{\mathbf{e}}_{\parallel^{\prime}}

(24)

since $\mathcal{H}(\mathbf{0})\hat{\mathbf{e}}_{\perp^{\prime}}\in\mathfrak{S}_{\perp}^{\prime}$ and $\mathcal{H}(\mathbf{0})\hat{\mathbf{e}}_{\parallel^{\prime}}\in\mathfrak{S}_{\parallel}^{\prime}$ . Due to the $\ell$ -smoothness of the function, all eigenvalue of the Hessian matrix has its absolute value upper bounded by $\ell$ . Hence,

\displaystyle\hat{\mathbf{e}}_{\perp^{\prime}}^{T}\mathcal{H}(\mathbf{0})\hat{\mathbf{e}}_{\perp^{\prime}}\leq\ell\|\hat{\mathbf{e}}_{\perp^{\prime}}^{T}\|_{2}^{2}=\rho\epsilon/(64\ell^{2}).

(25)

Further according to the definition of $\mathfrak{S}_{\parallel}$ , we have

\displaystyle\hat{\mathbf{e}}_{\parallel^{\prime}}^{T}\mathcal{H}(\mathbf{0})\hat{\mathbf{e}}_{\parallel^{\prime}}\leq-\sqrt{\rho\epsilon}\|\hat{\mathbf{e}}_{\parallel^{\prime}}\|^{2}/2.

(26)

Combining these two inequalities together, we can obtain

\displaystyle\hat{\mathbf{e}}^{T}\mathcal{H}(\mathbf{0})\hat{\mathbf{e}}

\displaystyle=\hat{\mathbf{e}}_{\perp}^{T}\mathcal{H}(\mathbf{0})\hat{\mathbf{e}}_{\perp^{\prime}}+\hat{\mathbf{e}}_{\parallel^{\prime}}^{T}\mathcal{H}(\mathbf{0})\hat{\mathbf{e}}_{\parallel^{\prime}}\leq-\sqrt{\rho\epsilon}\|\hat{\mathbf{e}}_{\parallel^{\prime}}\|^{2}/2+\rho\epsilon/(64\ell^{2})\leq-\sqrt{\rho\epsilon}/4.

(27)

∎

Remark 4.

In practice, the value of $\|\mathbf{y}_{t}\|$ can become large during the execution of Algorithm 1. To fix this, we can renormalize $\mathbf{y}_{t}$ to have $\ell_{2}$ -norm $r$ at the ends of such iterations, and this does not influence the performance of the algorithm.

2.2 Faster negative curvature finding based on accelerated gradient descents

In this subsection, we replace the GD part in Algorithm 1 by AGD to obtain an accelerated negative curvature finding subroutine with similar effect and faster convergence rate, based on which we further implement our Accelerated Gradient Descent with Negative Curvature Finding (Algorithm 2). Near any saddle point $\tilde{\mathbf{x}}\in\mathbb{R}^{n}$ with $\lambda_{\min}(\mathcal{H}(\tilde{\mathbf{x}}))\leq-\sqrt{\rho\epsilon}$ , Algorithm 2 finds a unit vector $\hat{\mathbf{e}}$ such that $\hat{\mathbf{e}}^{T}\mathcal{H}(\tilde{\mathbf{x}})\hat{\mathbf{e}}\leq-\sqrt{\rho\epsilon}/4$ .

t_{\text{perturb}}\leftarrow 0

\mathbf{z}_{0}\leftarrow\mathbf{x}_{0}

\tilde{\mathbf{x}}\leftarrow\mathbf{x}_{0}

\zeta\leftarrow 0

;

2 for $t=0,1,2,...,T$ do

3 if $\|\nabla f(\mathbf{x}_{t})\|\leq\epsilon$ and $t-t_{\text{perturb}}>\mathscr{T}$ then

\tilde{\mathbf{x}}=\mathbf{x}_{t}

;

\mathbf{x}_{t}\leftarrow\text{Uniform}(\mathbb{B}_{\tilde{\mathbf{x}}}(r^{\prime}))

where

\text{Uniform}(\mathbb{B}_{\tilde{\mathbf{x}}}(r^{\prime}))

is the

\ell_{2}

-norm ball centered at

\tilde{\mathbf{x}}

with radius

r^{\prime}

\mathbf{z}_{t}\leftarrow\mathbf{x}_{t}

\zeta\leftarrow\nabla f(\tilde{\mathbf{x}})

t_{\text{perturb}}\leftarrow t

;

7 if $t-t_{\text{perturb}}=\mathscr{T}^{\prime}$ then

\hat{\mathbf{e}}:=\frac{\mathbf{x}_{t}-\tilde{\mathbf{x}}}{\|\mathbf{x}_{t}-\tilde{\mathbf{x}}\|}

;

\mathbf{x}_{t}\leftarrow\tilde{\mathbf{x}}-\frac{f^{\prime}_{\hat{\mathbf{e}}}(\tilde{\mathbf{x}})}{4|f^{\prime}_{\hat{\mathbf{e}}}(\tilde{\mathbf{x}})|}\sqrt{\frac{\epsilon}{\rho}}\cdot\hat{\mathbf{e}}

\mathbf{z}_{t}\leftarrow\mathbf{x}_{t}

\zeta=\mathbf{0}

;

\mathbf{x}_{t+1}\leftarrow\mathbf{z}_{t}-\eta(\nabla f(\mathbf{z}_{t})-\zeta)

;

\mathbf{v}_{t+1}\leftarrow\mathbf{x}_{t+1}-\mathbf{x}_{t}

;

\mathbf{z}_{t+1}\leftarrow\mathbf{x}_{t+1}+(1-\theta)\mathbf{v}_{t+1}

;

15 if $t_{\text{perturb}}\neq 0$ and $t-t_{\text{perturb}}<\mathscr{T}^{\prime}$ then

\mathbf{z}_{t+1}\leftarrow\tilde{\mathbf{x}}+r^{\prime}\cdot\frac{\mathbf{z}_{t+1}-\tilde{\mathbf{x}}}{\|\mathbf{z}_{t+1}-\tilde{\mathbf{x}}\|}

\ \mathbf{x}_{t+1}\leftarrow\tilde{\mathbf{x}}+r^{\prime}\cdot\frac{\mathbf{x}_{t+1}-\tilde{\mathbf{x}}}{\|\mathbf{z}_{t+1}-\tilde{\mathbf{x}}\|}

;

18 else

19 if $f(\mathbf{x}_{t+1})\leq f(\mathbf{z}_{t+1})+\left\langle\nabla f(\mathbf{z}_{t+1}),\mathbf{x}_{t+1}-\mathbf{z}_{t+1}\right\rangle-\frac{\gamma}{2}\|\mathbf{z}_{t+1}-\mathbf{x}_{t+1}\|^{2}$ then

(\mathbf{x}_{t+1},\mathbf{v}_{t+1})\leftarrow

NegativeCurvatureExploitation(

\mathbf{x}_{t+1},\mathbf{v}_{t+1},s

)³³3 This NegativeCurvatureExploitation (NCE) subroutine was originally introduced in [21, Algorithm 3] and is called when we detect that the current momentum

\mathbf{v}_{t}

coincides with a negative curvature direction of

\mathbf{z}_{t}

. In this case, we reset the momentum

\mathbf{v}_{t}

and decide whether to exploit this direction based on the value of

\|\mathbf{v}_{t}\|

\mathbf{z}_{t+1}\leftarrow\mathbf{x}_{t+1}+(1-\theta)\mathbf{v}_{t+1}

;

Algorithm 2 Perturbed Accelerated Gradient Descent with Accelerated Negative Curvature Finding(

\mathbf{x}_{0},\eta,\theta,\gamma,s,\mathscr{T}^{\prime},r^{\prime}

)

The following proposition exhibits the effectiveness of Algorithm 2 for finding negative curvatures near saddle points:

Proposition 5.

	$\displaystyle\eta$	$\displaystyle:=\frac{1}{4\ell}$	$\displaystyle\theta$	$\displaystyle:=\frac{(\rho\epsilon)^{1/4}}{4\sqrt{\ell}}$	$\displaystyle\mathscr{T}^{\prime}$	$\displaystyle:=\frac{32\sqrt{\ell}}{(\rho\epsilon)^{1/4}}\log\Big(\frac{\ell}{\delta_{0}}\sqrt{\frac{n}{\rho\epsilon}}\Big{missing})$
	$\displaystyle\gamma$	$\displaystyle:=\frac{\theta^{2}}{\eta}$	$\displaystyle s$	$\displaystyle:=\frac{\gamma}{4\rho}$	$\displaystyle r^{\prime}$	$\displaystyle:=\frac{\delta_{0}\epsilon}{32}\sqrt{\frac{\pi}{\rho n}}.$		(28)

Then for a point $\tilde{\mathbf{x}}$ satisfying $\lambda_{\min}(\nabla^{2}f(\tilde{\mathbf{x}}))\leq-\sqrt{\rho\epsilon}$ , if running Algorithm 2 with the uniform perturbation in Line 2 being added at $t=0$ , the unit vector $\hat{\mathbf{e}}$ in Line 2 obtained after $\mathscr{T}^{\prime}$ iterations satisfies:

\displaystyle\mathbb{P}\Big{(}\hat{\mathbf{e}}^{T}\mathcal{H}(\mathbf{x})\hat{\mathbf{e}}\leq-\sqrt{\rho\epsilon}/4\Big{)}\geq 1-\delta_{0}.

(29)

The proof of Proposition 5 is similar to the proof of Proposition 3, and is deferred to Appendix B.2.

2.3 Escaping saddle points using negative curvature finding

In this subsection, we demonstrate that our Algorithm 1 and Algorithm 2 with the ability to find negative curvature near saddle points can further escape saddle points of nonconvex functions. The intuition works as follows: we start with gradient descents or accelerated gradient descents until the gradient becomes small. At this position, we compute the negative curvature direction, described by a unit vector $\hat{\mathbf{e}}$ , via Algorithm 1 or the negative curvature finding subroutine of Algorithm 2. Then, we add a perturbation along this direction of negative curvature and go back to gradient descents or accelerated gradient descents with an additional NegativeCurvatureExploitation subroutine (see Footnote 3). It has the following guarantee:

Lemma 6.

Suppose the function $f\colon\mathbb{R}^{n}\to\mathbb{R}$ is $\ell$ -smooth and $\rho$ -Hessian Lipschitz. Then for any point $\mathbf{x}_{0}\in\mathbb{R}^{n}$ , if there exists a unit vector $\hat{\mathbf{e}}$ satisfying $\hat{\mathbf{e}}^{T}\mathcal{H}(\mathbf{x}_{0})\hat{\mathbf{e}}\leq-\frac{\sqrt{\rho\epsilon}}{4}$ where $\mathcal{H}$ stands for the Hessian matrix of function $f$ , the following inequality holds:

\displaystyle f\Big{(}\mathbf{x}_{0}-\frac{f^{\prime}_{\hat{\mathbf{e}}}(\mathbf{x}_{0})}{4|f^{\prime}_{\hat{\mathbf{e}}}(\mathbf{x}_{0})|}\sqrt{\frac{\epsilon}{\rho}}\cdot\hat{\mathbf{e}}\Big{)}\leq f(\mathbf{x}_{0})-\frac{1}{384}\sqrt{\frac{\epsilon^{3}}{\rho}},

(30)

where $f^{\prime}_{\hat{\mathbf{e}}}$ stands for the gradient component of $f$ along the direction of $\hat{\mathbf{e}}$ .

Proof.

Without loss of generality, we assume $\mathbf{x}_{0}=\mathbf{0}$ . We can also assume $\left\langle\nabla f(\mathbf{0}),\hat{\mathbf{e}}\right\rangle\leq 0$ ; if this is not the case we can pick $-\hat{\mathbf{e}}$ instead, which still satisfies $(-\hat{\mathbf{e}})^{T}\mathcal{H}(\mathbf{x}_{0})(-\hat{\mathbf{e}})\leq-\frac{\sqrt{\rho\epsilon}}{4}$ . In practice, to figure out whether we should use $\hat{\mathbf{e}}$ or $-\hat{\mathbf{e}}$ , we apply both of them in (30) and choose the one with smaller function value. Then, for any $\mathbf{x}=x_{\hat{\mathbf{e}}}\hat{\mathbf{e}}$ with some $x_{\hat{\mathbf{e}}}>0$ , we have $\frac{\partial^{2}f}{\partial x_{\hat{\mathbf{e}}}^{2}}(\mathbf{x})\leq-\frac{\sqrt{\rho\epsilon}}{4}+\rho x_{\hat{\mathbf{e}}}$ due to the $\rho$ -Hessian Lipschitz condition of $f$ . Hence,

\displaystyle\frac{\partial f}{\partial x_{\hat{\mathbf{e}}}}(\mathbf{x})\leq f^{\prime}_{\hat{\mathbf{e}}}(\mathbf{0})-\frac{\sqrt{\rho\epsilon}}{4}x_{\hat{\mathbf{e}}}+\rho x_{\hat{\mathbf{e}}}^{2},

(31)

by which we can further derive that

\displaystyle f(x_{\hat{\mathbf{e}}}\hat{\mathbf{e}})-f(\mathbf{0})\leq f^{\prime}_{\hat{\mathbf{e}}}(\mathbf{0})x_{\hat{\mathbf{e}}}-\frac{\sqrt{\rho\epsilon}}{8}x_{\hat{\mathbf{e}}}^{2}+\frac{\rho}{3}x_{\hat{\mathbf{e}}}^{3}\leq-\frac{\sqrt{\rho\epsilon}}{8}x_{\hat{\mathbf{e}}}^{2}+\frac{\rho}{3}x_{\hat{\mathbf{e}}}^{3}.

(32)

Settings $x_{\hat{\mathbf{e}}}=\frac{1}{4}\sqrt{\frac{\epsilon}{\rho}}$ gives (30). ∎

We give the full algorithm details based on Algorithm 1 in Appendix C.1. Along this approach, we achieve the following:

Theorem 7 (informal, full version deferred to Appendix C.3).

For any $\epsilon>0$ and a constant $0<\delta\leq 1$ , Algorithm 2 satisfies that at least one of the iterations $\mathbf{x}_{t}$ will be an $\epsilon$ -approximate second-order stationary point in

\displaystyle\tilde{O}\Big{(}\frac{(f(\mathbf{x}_{0})-f^{*})}{\epsilon^{1.75}}\cdot\log n\Big{)}

(33)

iterations, with probability at least $1-\delta$ , where $f^{*}$ is the global minimum of $f$ .

Intuitively, the proof of Theorem 7 has two parts. The first part is similar to the proof of [21, Theorem 3], which shows that PAGD uses $\tilde{O}(\log^{6}n/\epsilon^{1.75})$ iterations to escape saddle points. We show that there can be at most $\tilde{O}(\Delta_{f}/\epsilon^{1.75})$ iterations with the norm of gradient larger than $\epsilon$ using almost the same techniques, but with slightly different parameter choices. The second part is based on the negative curvature part of Algorithm 2, our accelerated negative curvature finding algorithm. Specifically, at each saddle point we encounter, we can take $\tilde{O}(\log n/\epsilon^{1/4})$ iterations to find its negative curvature (Proposition 5), and add a perturbation in this direction to decrease the function value by $O(\epsilon^{1.5})$ (Lemma 6). Hence, the iterations introduced by Algorithm 4 can be at most $\tilde{O}\big{(}\frac{\log n}{\epsilon^{1.5}}\cdot\frac{1}{\epsilon^{0.25}}\big{)}=\tilde{O}(\log n/\epsilon^{1.75})$ , which is simply an upper bound on the overall iteration number. The detailed proof is deferred to Appendix C.3.

Remark 8.

Although Theorem 7 only demonstrates that our algorithms will visit some $\epsilon$ -approximate second-order stationary point during their execution with high probability, it is straightforward to identify one of them if we add a termination condition: once Negative Curvature Finding (Algorithm 1 or Algorithm 2) is applied, we record the position $\mathbf{x}_{t_{0}}$ and the function value decrease due to the following perturbation. If the function value decrease is larger than $\frac{1}{384}\sqrt{\frac{\epsilon^{3}}{\rho}}$ as per Lemma 6, then the algorithms make progress. Otherwise, $\mathbf{x}_{t_{0}}$ is an $\epsilon$ -approximate second-order stationary point with high probability.

3 Stochastic setting

In this section, we present a stochastic version of Algorithm 1 using stochastic gradients, and demonstrate that it can also be used to escape saddle points and obtain a polynomial speedup in $\log n$ compared to the perturbed stochastic gradient (PSGD) algorithm in [20].

3.1 Stochastic negative curvature finding

In the stochastic gradient descent setting, the exact gradients oracle $\nabla f$ of function $f$ cannot be accessed. Instead, we only have unbiased stochastic gradients $\mathbf{g}(\mathbf{x};\theta)$ such that

\displaystyle\nabla f(\mathbf{x})=\mathbb{E}_{\theta\sim\mathcal{D}}[\mathbf{g}(\mathbf{x};\theta)]\qquad\forall\mathbf{x}\in\mathbb{R}^{n},

(34)

where $\mathcal{D}$ stands for the probability distribution followed by the random variable $\theta$ . Define

\displaystyle\zeta(\mathbf{x};\theta):=g(\mathbf{x};\theta)-\nabla f(\mathbf{x})

(35)

to be the error term of the stochastic gradient. Then, the expected value of vector $\zeta(\mathbf{x};\theta)$ at any $\mathbf{x}\in\mathbb{R}^{n}$ equals to $\mathbf{0}$ . Further, we assume the stochastic gradient $\mathbf{g}(\mathbf{x},\theta)$ also satisfies the following assumptions, which were also adopted in previous literatures; see e.g. [8, 19, 20, 34].

Assumption 1.

For any $\mathbf{x}\in\mathbb{R}^{n}$ , the stochastic gradient $\mathbf{g}(\mathbf{x};\theta)$ with $\theta\sim\mathcal{D}$ satisfies:

\displaystyle\Pr[(\|\mathbf{g}(\mathbf{x};\theta)-\nabla f(\mathbf{x})\|\geq t)]\leq 2\exp(-t^{2}/(2\sigma^{2})),\quad\forall t\in\mathbb{R}.

(36)

Assumption 2.

For any $\theta\in\text{supp}(\mathcal{D})$ , $\mathbf{g}(\mathbf{x};\theta)$ is $\tilde{\ell}$ -Lipschitz for some constant $\tilde{\ell}$ :

\displaystyle\|\mathbf{g}(\mathbf{x}_{1};\theta)-\mathbf{g}(\mathbf{x}_{2};\theta)\|\leq\tilde{\ell}\|\mathbf{x}_{1}-\mathbf{x}_{2}\|,\qquad\forall\mathbf{x}_{1},\mathbf{x}_{2}\in\mathbb{R}^{n}.

(37)

Assumption 2 emerges from the fact that the stochastic gradient $\mathbf{g}$ is often obtained as an exact gradient of some smooth function,

\displaystyle\mathbf{g}(\mathbf{x};\theta)=\nabla f(\mathbf{x};\theta).

(38)

In this case, Assumption 2 guarantees that for any $\theta\sim\mathcal{D}$ , the spectral norm of the Hessian of $f(\mathbf{x};\theta)$ is upper bounded by $\tilde{\ell}$ . Under these two assumptions, we can construct the stochastic version of Algorithm 1, as shown in Algorithm 3.

\mathbf{y}_{0}\leftarrow 0,\ L_{0}\leftarrow r_{s}

;

2 for $t=1,...,\mathscr{T}_{s}$ do

3 Sample

\left\{\theta^{(1)},\theta^{(2)},\cdots,\theta^{(m)}\right\}\sim\mathcal{D}

;

\mathbf{g}(\mathbf{y}_{t-1})\leftarrow\frac{1}{m}\sum_{j=1}^{m}\big{(}\mathbf{g}(\mathbf{x}_{0}+\mathbf{y}_{t-1};\theta^{(j)})-\mathbf{g}(\mathbf{x}_{0};\theta^{(j)})\big{)}

;

\mathbf{y}_{t}\leftarrow\mathbf{y}_{t-1}-\frac{1}{\ell}(\mathbf{g}(\mathbf{y}_{t-1})+\xi_{t}/L_{t-1}),\qquad\xi_{t}\sim\mathcal{N}\Big{(}0,\frac{r_{s}^{2}}{d}I\Big{)}

;

L_{t}\leftarrow\frac{\|\mathbf{y}_{t}\|}{r_{s}}L_{t-1}

and

\mathbf{y}_{t}\leftarrow\mathbf{y}_{t}\cdot\frac{r_{s}}{\|\mathbf{y}_{t}\|}

;

Output

\mathbf{y}_{\mathscr{T}}/r_{s}

Algorithm 3 Stochastic Negative Curvature Finding(

\mathbf{x}_{0},r_{s},\mathscr{T}_{s},m

Similar to the non-stochastic setting, Algorithm 3 can be used to escape from saddle points and obtain a polynomial speedup in $\log n$ compared to PSGD algorithm in [20]. This is quantitatively shown in the following theorem:

Theorem 9 (informal, full version deferred to Appendix D.2).

For any $\epsilon>0$ and a constant $0<\delta\leq 1$ , our algorithm⁴⁴4Our algorithm based on Algorithm 3 has similarities to the Neon2 ${}^{\text{online}}$ algorithm in [3]. Both algorithms find a second-order stationary point for stochastic optimization in $\tilde{O}(\log^{2}n/\epsilon^{4})$ iterations, and we both apply directed perturbations based on the results of negative curvature finding. Nevertheless, our algorithm enjoys simplicity by only having a single loop, whereas Neon2 ${}^{\text{online}}$ has a nested loop for boosting their confidence. based on Algorithm 3 using only stochastic gradient descent satisfies that at least one of the iterations $\mathbf{x}_{t}$ will be an $\epsilon$ -approximate second-order stationary point in

\displaystyle\tilde{O}\Big{(}\frac{(f(\mathbf{x}_{0})-f^{*})}{\epsilon^{4}}\cdot\log^{2}n\Big{)}

(39)

iterations, with probability at least $1-\delta$ , where $f^{*}$ is the global minimum of $f$ .

4 Numerical experiments

In this section, we provide numerical results that exhibit the power of our negative curvature finding algorithm for escaping saddle points. More experimental results can be found in Appendix E. All the experiments are performed on MATLAB R2019b on a computer with Six-Core Intel Core i7 processor and 16GB memory, and their codes are given in the supplementary material.

Comparison between Algorithm 1 and PGD.

We compare the performance of our Algorithm 1 with the perturbed gradient descent (PGD) algorithm in [20] on a test function $f(x_{1},x_{2})=\frac{1}{16}x_{1}^{4}-\frac{1}{2}x_{1}^{2}+\frac{9}{8}x_{2}^{2}$ with a saddle point at $(0,0)$ . The advantage of Algorithm 1 is illustrated in Figure 1.

Refer to caption — Figure 1: Run Algorithm 1 and PGD on landscape $f(x_{1},x_{2})=\frac{1}{16}x_{1}^{4}-\frac{1}{2}x_{1}^{2}+\frac{9}{8}x_{2}^{2}$ . Parameters: $\eta=0.05$ (step length), $r=0.1$ (ball radius in PGD and parameter $r$ in Algorithm 1), $M=300$ (number of samplings).
Left: The contour of the landscape is placed on the background with labels being function values. Blue points represent samplings of Algorithm 1 at time step $t_{\text{NCGD}}=15$ and $t_{\text{NCGD}}=30$ , and red points represent samplings of PGD at time step $t_{\text{PGD}}=45$ and $t_{\text{PGD}}=90$ . Algorithm 1 transforms an initial uniform-circle distribution into a distribution concentrating on two points indicating negative curvature, and these two figures represent intermediate states of this process. It converges faster than PGD even when $t_{\text{NCGD}}\ll t_{\text{PGD}}$ .
Right: A histogram of descent values obtained by Algorithm 1 and PGD, respectively. Set $t_{\text{NCGD}}=30$ and $t_{\text{PGD}}=90$ . Although we run three times of iterations in PGD, there are still over $40\%$ of gradient descent paths with function value decrease no greater than $0.9$ , while this ratio for Algorithm 1 is less than $5\%$ .

Comparison between Algorithm 3 and PSGD.

We compare the performance of our Algorithm 3 with the perturbed stochastic gradient descent (PSGD) algorithm in [20] on a test function $f(x_{1},x_{2})=(x_{1}^{3}-x_{2}^{3})/2-3x_{1}x_{2}+(x_{1}^{2}+x_{2}^{2})^{2}/2$ . Compared to the landscape of the previous experiment, this function is more deflated from a quadratic field due to the cubic terms. Nevertheless, Algorithm 3 still possesses a notable advantage compared to PSGD as demonstrated in Figure 2.

Acknowledgements

We thank Jiaqi Leng for valuable suggestions and generous help on the design of numerical experiments. TL was supported by the NSF grant PHY-1818914 and a Samsung Advanced Institute of Technology Global Research Partnership.

References

[1] Naman Agarwal, Zeyuan Allen-Zhu, Brian Bullins, Elad Hazan, and Tengyu Ma, Finding approximate local minima faster than gradient descent, Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1195–1199, 2017, arXiv:1611.01146.
[2] Zeyuan Allen-Zhu, Natasha 2: Faster non-convex optimization than SGD, Advances in Neural Information Processing Systems, pp. 2675–2686, 2018, arXiv:1708.08694.
[3] Zeyuan Allen-Zhu and Yuanzhi Li, Neon2: Finding local minima via first-order oracles, Advances in Neural Information Processing Systems, pp. 3716–3726, 2018, arXiv:1711.06673.
[4] Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro, Global optimality of local search for low rank matrix recovery, Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 3880–3888, 2016, arXiv:1605.07221.
[5] Alan J. Bray and David S. Dean, Statistics of critical points of Gaussian fields on large-dimensional spaces, Physical Review Letters 98 (2007), no. 15, 150201, arXiv:cond-mat/0611023.
[6] Yair Carmon, John C. Duchi, Oliver Hinder, and Aaron Sidford, Accelerated methods for nonconvex optimization, SIAM Journal on Optimization 28 (2018), no. 2, 1751–1772, arXiv:1611.00756.
[7] Yann N. Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio, Identifying and attacking the saddle point problem in high-dimensional non-convex optimization, Advances in neural information processing systems, pp. 2933–2941, 2014, arXiv:1406.2572.
[8] Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang, Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator, Advances in Neural Information Processing Systems, pp. 689–699, 2018, arXiv:1807.01695.
[9] Cong Fang, Zhouchen Lin, and Tong Zhang, Sharp analysis for nonconvex SGD escaping from saddle points, Conference on Learning Theory, pp. 1192–1234, 2019, arXiv:1902.00247.
[10] Yan V. Fyodorov and Ian Williams, Replica symmetry breaking condition exposed by random matrix calculation of landscape complexity, Journal of Statistical Physics 129 (2007), no. 5-6, 1081–1116, arXiv:cond-mat/0702601.
[11] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan, Escaping from saddle points – online stochastic gradient for tensor decomposition, Proceedings of the 28th Conference on Learning Theory, Proceedings of Machine Learning Research, vol. 40, pp. 797–842, 2015, arXiv:1503.02101.
[12] Rong Ge, Chi Jin, and Yi Zheng, No spurious local minima in nonconvex low rank problems: A unified geometric analysis, International Conference on Machine Learning, pp. 1233–1242, PMLR, 2017, arXiv:1704.00708.
[13] Rong Ge, Jason D. Lee, and Tengyu Ma, Matrix completion has no spurious local minimum, Advances in Neural Information Processing Systems, pp. 2981–2989, 2016, arXiv:1605.07272.
[14] Rong Ge, Jason D. Lee, and Tengyu Ma, Learning one-hidden-layer neural networks with landscape design, International Conference on Learning Representations, 2018, arXiv:1711.00501.
[15] Rong Ge, Zhize Li, Weiyao Wang, and Xiang Wang, Stabilized SVRG: Simple variance reduction for nonconvex optimization, Conference on Learning Theory, pp. 1394–1448, PMLR, 2019, arXiv:1905.00529.
[16] Rong Ge and Tengyu Ma, On the optimization landscape of tensor decompositions, Advances in Neural Information Processing Systems, pp. 3656–3666, Curran Associates Inc., 2017, arXiv:1706.05598.
[17] Moritz Hardt, Tengyu Ma, and Benjamin Recht, Gradient descent learns linear dynamical systems, Journal of Machine Learning Research 19 (2018), no. 29, 1–44, arXiv:1609.05191.
[18] Prateek Jain, Chi Jin, Sham Kakade, and Praneeth Netrapalli, Global convergence of non-convex gradient descent for computing matrix squareroot, Artificial Intelligence and Statistics, pp. 479–488, 2017, arXiv:1507.05854.
[19] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan, How to escape saddle points efficiently, Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 1724–1732, 2017, arXiv:1703.00887.
[20] Chi Jin, Praneeth Netrapalli, Rong Ge, Sham M Kakade, and Michael I. Jordan, On nonconvex optimization for machine learning: Gradients, stochasticity, and saddle points, Journal of the ACM (JACM) 68 (2021), no. 2, 1–29, arXiv:1902.04811.
[21] Chi Jin, Praneeth Netrapalli, and Michael I. Jordan, Accelerated gradient descent escapes saddle points faster than gradient descent, Conference on Learning Theory, pp. 1042–1085, 2018, arXiv:1711.10456.
[22] Michael I. Jordan, On gradient-based optimization: Accelerated, distributed, asynchronous and stochastic optimization, https://www.youtube.com/watch?v=VE2ITg%5FhGnI, 2017.
[23] Zhize Li, SSRGD: Simple stochastic recursive gradient descent for escaping saddle points, Advances in Neural Information Processing Systems 32 (2019), 1523–1533, arXiv:1904.09265.
[24] Mingrui Liu, Zhe Li, Xiaoyu Wang, Jinfeng Yi, and Tianbao Yang, Adaptive negative curvature descent with applications in non-convex optimization, Advances in Neural Information Processing Systems 31 (2018), 4853–4862.
[25] Yurii Nesterov and Boris T. Polyak, Cubic regularization of Newton method and its global performance, Mathematical Programming 108 (2006), no. 1, 177–205.
[26] Yurii E. Nesterov, A method for solving the convex programming problem with convergence rate ${O}(1/k^{2})$ , Soviet Mathematics Doklady, vol. 27, pp. 372–376, 1983.
[27] Ju Sun, Qing Qu, and John Wright, A geometric analysis of phase retrieval, Foundations of Computational Mathematics 18 (2018), no. 5, 1131–1198, arXiv:1602.06664.
[28] Nilesh Tripuraneni, Mitchell Stern, Chi Jin, Jeffrey Regier, and Michael I. Jordan, Stochastic cubic regularization for fast nonconvex optimization, Advances in neural Information Processing Systems, pp. 2899–2908, 2018, arXiv:1711.02838.
[29] Yi Xu, Rong Jin, and Tianbao Yang, NEON+: Accelerated gradient methods for extracting negative curvature for non-convex optimization, 2017, arXiv:1712.01033.
[30] Yi Xu, Rong Jin, and Tianbao Yang, First-order stochastic algorithms for escaping from saddle points in almost linear time, Advances in Neural Information Processing Systems, pp. 5530–5540, 2018, arXiv:1711.01944.
[31] Yaodong Yu, Pan Xu, and Quanquan Gu, Third-order smoothness helps: Faster stochastic optimization algorithms for finding local minima, Advances in Neural Information Processing Systems (2018), 4530–4540.
[32] Xiao Zhang, Lingxiao Wang, Yaodong Yu, and Quanquan Gu, A primal-dual analysis of global optimality in nonconvex low-rank matrix recovery, International Conference on Machine Learning, pp. 5862–5871, PMLR, 2018.
[33] Dongruo Zhou and Quanquan Gu, Stochastic recursive variance-reduced cubic regularization methods, International Conference on Artificial Intelligence and Statistics, pp. 3980–3990, 2020, arXiv:1901.11518.
[34] Dongruo Zhou, Pan Xu, and Quanquan Gu, Finding local minima via stochastic nested variance reduction, 2018, arXiv:1806.08782.
[35] Dongruo Zhou, Pan Xu, and Quanquan Gu, Stochastic variance-reduced cubic regularized Newton methods, International Conference on Machine Learning, pp. 5990–5999, PMLR, 2018, arXiv:1802.04796.
[36] Dongruo Zhou, Pan Xu, and Quanquan Gu, Stochastic nested variance reduction for nonconvex optimization, Journal of Machine Learning Research 21 (2020), no. 103, 1–63, arXiv:1806.07811.

Appendix A Auxiliary lemmas

In this section, we introduce auxiliary lemmas that are necessary for our proofs.

Lemma 10 ([20, Lemma 19]).

If $f(\cdot)$ is $\ell$ -smooth and $\rho$ -Hessian Lipschitz, $\eta=1/\ell$ , then the gradient descent sequence $\{\mathbf{x}_{t}\}$ obtained by $\mathbf{x}_{t+1}:=\mathbf{x}_{t}-\eta\nabla f(\mathbf{x}_{t})$ satisfies:

\displaystyle f(\mathbf{x}_{t+1})-f(\mathbf{x}_{t})\leq\eta\|\nabla f(\mathbf{x})\|^{2}/2,

(40)

for any step $t$ in which Negative Curvature Finding is not called.

Lemma 11 ([20, Lemma 23]).

For a $\ell$ -smooth and $\rho$ -Hessian Lipschitz function $f$ with its stochastic gradient satisfying Assumption 1, there exists an absolute constant $c$ such that, for any fixed $t,t_{0},\iota>0$ , with probability at least $1-4e^{\iota}$ , the stochastic gradient descent sequence in Algorithm 8 satisfies:

\displaystyle f(\mathbf{x}_{t_{0}+t})-f(\mathbf{x}_{t_{0}})\leq-\frac{\eta}{8}\sum_{i=0}^{t-1}\|\nabla f(\mathbf{x}_{t_{0}+i})\|^{2}+c\cdot\frac{\sigma^{2}}{\ell}(t+\iota),

(41)

if during $t_{0}\sim t_{0}+t$ , Stochastic Negative Curvature Finding has not been called.

Lemma 12 ([20, Lemma 29]).

Denote $\alpha(t):=\Big{[}\sum_{\tau=0}^{t-1}(1+\eta\gamma)^{2(t-1-\tau)}\Big{]}^{1/2}$ and $\beta(t):=(1+\eta\gamma)^{t}/\sqrt{2\eta\gamma}$ . If $\eta\gamma\in[0,1]$ , then we have

1.

$\alpha(t)\leq\beta(t)$ for any $t\in\mathbb{N}$ ;
2.

$\alpha(t)\geq\beta(t)/\sqrt{3},\quad\forall t\geq\ln 2/(\eta\gamma)$ .

Lemma 13 ([20, Lemma 30]).

Under the notation of Lemma 12 and Appendix D.1, letting $-\gamma:=\lambda_{\min}(\tilde{\mathcal{H}})$ , for any $0\leq t\leq\mathscr{T}_{s}$ and $\iota>0$ we have

\displaystyle\Pr\Big(\|\mathbf{q}_{p}(t)\|\leq\beta(t)\eta r_{s}\cdot\sqrt{\iota}\Big{missing})\geq 1-2e^{-\iota},

(42)

and

	$\displaystyle\Pr\Big(\\|\mathbf{q}_{p}(t)\\|\geq\frac{\beta(t)\eta r_{s}}{10\sqrt{n}}\cdot\frac{\delta}{\mathscr{T}_{s}}\Big{missing})\geq 1-\frac{\delta}{\mathscr{T}_{s}},$		(43)
	$\displaystyle\Pr\Big(\\|\mathbf{q}_{p}(t)\\|\geq\frac{\beta(t)\eta r_{s}}{10\sqrt{n}}\cdot\delta\Big{missing})\geq 1-\delta.$		(44)

Definition 14 ([20, Definition 32]).

A random vector $\mathbf{X}\in\mathbb{R}^{n}$ is norm-subGaussian (or nSG( $\sigma$ )), if there exists $\sigma$ so that:

\displaystyle\Pr(\|\mathbf{X}-\mathbb{E}[\mathbf{X}]\|\geq t)\leq 2e^{-\frac{t^{2}}{2\sigma^{2}}},\quad\forall t\in\mathbb{R}.

(45)

Lemma 15 ([20, Lemma 37]).

Given i.i.d. $\mathbf{X}_{1},\ldots,\mathbf{X}_{\tau}\in\mathbb{R}^{n}$ all being zero-mean nSG $(\sigma_{i})$ defined in Definition 14, then for any $\iota>0$ , and $B>b>0$ , there exists an absolute constant $c$ such that, with probability at least $1-2n\log(B/b)\cdot e^{-\iota}$ :

\displaystyle\sum_{i=1}^{n}\sigma_{i}^{2}\geq B\quad\text{or}\quad\big{\|}\sum_{i=1}^{\tau}\mathbf{X}_{i}\big{\|}\leq c\cdot\sqrt{\max\left\{\sum_{i=1}^{\tau}\sigma_{i}^{2},b\right\}\cdot\iota}.

(46)

The next two lemmas are frequently used in the large gradient scenario of the accelerated gradient descent method:

Lemma 16 ([21, Lemma 7]).

Consider the setting of Theorem 23, define a new parameter

\displaystyle\tilde{\mathscr{T}}:=\frac{\sqrt{\ell}}{(\rho\epsilon)^{1/4}}\cdot c_{A},

(47)

for some large enough constant $c_{A}$ . If $\|\nabla f(\mathbf{x}_{\tau})\|\geq\epsilon$ for all $\tau\in[0,\tilde{\mathscr{T}}]$ , then there exists a large enough positive constant $c_{A0}$ , such that if we choose $c_{A}\geq c_{A0}$ , by running Algorithm 2 we have $E_{\tilde{\mathscr{T}}}-E_{0}\leq-\mathscr{E}$ , in which $\mathscr{E}=\sqrt{\frac{\epsilon^{3}}{\rho}}\cdot c_{A}^{-7}$ , and $E_{\tau}$ is defined as:

\displaystyle E_{\tau}:=f(\mathbf{x}_{\tau})+\frac{1}{2\eta}\|\mathbf{v}_{\tau}\|^{2}.

(48)

Lemma 17 ([21, Lemma 4 and Lemma 5]).

Assume that the function $f$ is $\ell$ -smooth. Consider the setting of Theorem 23, for every iteration that is not within $\mathscr{T}^{\prime}$ steps after uniform perturbation, we have:

\displaystyle E_{\tau+1}\leq E_{\tau},

(49)

where $E_{\tau}$ is defined in Eq. (48) in Lemma 16.

Appendix B Proof details of negative curvature finding by gradient descents

B.1 Proof of Lemma 18

Lemma 18.

Under the setting of Proposition 3, we use $\alpha_{t}$ to denote

\displaystyle\alpha_{t}=\|\mathbf{y}_{t,\parallel}\|/\|\mathbf{y}_{t}\|,

(50)

where $\mathbf{y}_{t,\parallel}$ defined in Eq. (10) is the component of $\mathbf{y}_{t}$ in the subspace $\mathfrak{S}_{\parallel}$ spanned by $\left\{\mathbf{u}_{1},\mathbf{u}_{2},\ldots,\mathbf{u}_{p}\right\}$ . Then, during all the $\mathscr{T}$ iterations of Algorithm 1, we have $\alpha_{t}\geq\alpha_{\min}$ for

\displaystyle\alpha_{\min}=\frac{\delta_{0}}{4}\sqrt{\frac{\pi}{n}},

(51)

given that $\alpha_{0}\geq\sqrt{\frac{\pi}{n}}\delta_{0}$ .

To prove Lemma 18, we first introduce the following lemma:

Lemma 19.

Under the setting of Proposition 3 and Lemma 18, for any $t>0$ and initial value $\mathbf{y}_{0}$ , $\alpha_{t}$ achieves its minimum possible value in the specific landscapes satisfying

\displaystyle\lambda_{1}=\lambda_{2}=\cdots=\lambda_{p}=\lambda_{p+1}=\lambda_{p+2}=\cdots=\lambda_{n-1}=-\sqrt{\rho\epsilon}.

(52)

Proof.

We prove this lemma by contradiction. Suppose for some $\tau>0$ and initial value $\mathbf{y}_{0}$ , $\alpha_{\tau}$ achieves its minimum value in some landscape $g$ that does not satisfy Eq. (52). That is to say, there exists some $k\in[n-1]$ such that $\lambda_{k}\neq-\sqrt{\rho\epsilon}$ .

We first consider the case $k>p$ . Since $\lambda_{n-1},\ldots,\lambda_{p+1}\geq-\sqrt{\rho\epsilon}$ , we have $\lambda_{k}>-\sqrt{\rho\epsilon}$ . We use $\left\{\mathbf{y}_{g,t}\right\}$ to denote the iteration sequence of $\mathbf{y}_{t}$ in this landscape.

Consider another landscape $h$ with $\lambda_{k}=-\sqrt{\rho\epsilon}$ and all other eigenvalues being the same as $g$ , and we use $\left\{\mathbf{y}_{h,t}\right\}$ to denote the iteration sequence of $\mathbf{y}_{t}$ in this landscape. Furthermore, we assume that at each gradient query, the deviation from pure quadratic approximation defined in Eq. (16), denoted $\Delta_{h,t}$ , is the same as the corresponding deviation $\Delta_{g,t}$ in the landscape $g$ along all the directions other than $\mathbf{u}_{k}$ . As for its component $\Delta_{h,t,k}$ along $\mathbf{u}_{k}$ , we set $|\Delta_{h,t,k}|=|\Delta_{g,t,k}|$ with its sign being the same as $y_{t,k}$ .

Under these settings, we have $y_{h,\tau,j}=y_{g,\tau,j}$ for any $j\in[n]$ and $j\neq k$ . As for the component along $\mathbf{u}_{k}$ , we have $|y_{h,\tau,j}|>|y_{g,\tau,j}|$ . Hence, by the definition of $\mathbf{y}_{t,\parallel}$ in Eq. (10), we have

\displaystyle\|\mathbf{y}_{g,\tau,\parallel}\|=\big{(}\sum_{j=1}^{p}y_{g,\tau,j}^{2}\big{)}^{1/2}=\big{(}\sum_{j=1}^{p}y_{h,\tau,j}^{2}\big{)}^{1/2}=\|\mathbf{y}_{h,\tau,\parallel}\|,

(53)

whereas

\displaystyle\|\mathbf{y}_{g,\tau}\|=\big{(}\sum_{j=1}^{n}y_{g,\tau,j}^{2}\big{)}^{1/2}<\big{(}\sum_{j=1}^{n}y_{h,\tau,j}^{2}\big{)}^{1/2}=\|\mathbf{y}_{h,\tau}\|,

(54)

indicating

\displaystyle\alpha_{g,\tau}=\frac{\|\mathbf{y}_{g,\tau,\parallel}\|}{\|\mathbf{y}_{g,\tau}\|}>\frac{\|\mathbf{y}_{h,\tau,\parallel}\|}{\|\mathbf{y}_{h,\tau}\|}=\alpha_{h,\tau},

(55)

contradicting to the supposition that $\alpha_{\tau}$ achieves its minimum possible value in $g$ . Similarly, we can also obtain a contradiction for the case $k\leq p$ . ∎

Equipped with Lemma 19, we are now ready to prove Lemma 18.

Proof.

In this proof, we consider the worst case, where the initial value $\alpha_{0}=\sqrt{\frac{\pi}{n}}\delta_{0}$ . Also, according to Lemma 19, we assume that the eigenvalues satisfy

\displaystyle\lambda_{1}=\lambda_{2}=\cdots=\lambda_{p}=\lambda_{p+1}=\lambda_{p+2}=\cdots=\lambda_{n-1}=-\sqrt{\rho\epsilon}.

(56)

Moreover, we assume that each time we make a gradient call at some point $\mathbf{x}$ , the derivation term $\Delta$ from pure quadratic approximation

\displaystyle\Delta=\nabla h_{f}(\mathbf{x})-\mathcal{H}(\mathbf{0})\mathbf{x}

(57)

lies in the direction that can make $\alpha_{t}$ as small as possible. Then, the component $\Delta_{\perp}$ in $\mathfrak{S}_{\perp}$ should be in the direction of $\mathbf{x}_{\perp}$ , and the component $\Delta_{\parallel}$ in $\mathfrak{S}_{\parallel}$ should be in the opposite direction to $\mathbf{x}_{\parallel}$ , as long as $\|\Delta_{\parallel}\|\leq\|\mathbf{x}_{\parallel}\|$ . Hence in this case, we have $\|\mathbf{y}_{t,\perp}\|/\|\mathbf{y}_{t}\|$ being non-decreasing. Also, it admits the following recurrence formula:

$\displaystyle\\|\mathbf{y}_{t+1,\perp}\\|$	$\displaystyle=(1+\sqrt{\rho\epsilon}/\ell)\\|\mathbf{y}_{t,\perp}\\|+\\|\Delta_{\perp}\\|/\ell$	(58)
	$\displaystyle\leq(1+\sqrt{\rho\epsilon}/\ell)\\|\mathbf{y}_{t,\perp}\\|+\\|\Delta\\|/\ell$	(59)
	$\displaystyle\leq\Big{(}1+\sqrt{\rho\epsilon}/\ell+\frac{\\|\Delta\\|}{\ell\\|\mathbf{y}_{t,\perp}\\|}\Big{)}\\|\mathbf{y}_{t,\perp}\\|,$	(60)

where the second inequality is due to the fact that $\nu$ can be an arbitrarily small positive number. Note that since $\|\mathbf{y}_{t,\perp}\|/\|\mathbf{y}_{t}\|$ is non-decreasing in this worst-case scenario, we have

\displaystyle\frac{\|\Delta\|}{\|\mathbf{y}_{t,\perp}\|}\leq\frac{\|\Delta\|}{\|\mathbf{y}_{t}\|}\cdot\frac{\|\mathbf{y}_{0}\|}{\|\mathbf{y}_{0,\perp}\|}\leq\frac{2\|\Delta\|}{\|\mathbf{y}_{t}\|}\leq 2\rho r,

(61)

which leads to

\displaystyle\|\mathbf{y}_{t+1,\perp}\|\leq(1+\sqrt{\rho\epsilon}/\ell+2\rho r/\ell)\|\mathbf{y}_{t,\perp}\|.

(62)

On the other hand, suppose for some value $t$ , we have $\alpha_{k}\geq\alpha_{\min}$ with any $1\leq k\leq t$ . Then,

	$\displaystyle\\|\mathbf{y}_{t+2,\parallel}\\|$	$\displaystyle=(1+\sqrt{\rho\epsilon}/\ell)\\|\mathbf{y}_{t+1,\perp}\\|-\\|\Delta_{\parallel}\\|/\ell$		(63)
		$\displaystyle\geq\Big{(}1+\sqrt{\rho\epsilon}/\ell-\frac{\\|\Delta\\|}{\ell\\|\mathbf{y}_{t,\parallel}\\|}\Big{)}\\|\mathbf{y}_{t,\parallel}\\|.$		(64)

Note that since $\|\mathbf{y}_{t,\parallel}\|/\|\mathbf{y}_{t}\|\geq\alpha_{\min}$ , we have

\displaystyle\frac{\|\Delta\|}{\|\mathbf{y}_{t,\parallel}\|}\geq\frac{\|\Delta\|}{\alpha_{\min}\|\mathbf{y}_{t}\|}=\rho r/\alpha_{\min},

(65)

which leads to

\displaystyle\|\mathbf{y}_{t+1,\parallel}\|\geq(1+\sqrt{\rho\epsilon}/\ell-\rho r/(\alpha_{\min}\ell))\|\mathbf{y}_{t,\parallel}\|.

(66)

Then we can observe that

\displaystyle\frac{\|\mathbf{y}_{t,\parallel}\|}{\|\mathbf{y}_{t,\perp}\|}\geq\frac{\|\mathbf{y}_{0,\parallel}\|}{\|\mathbf{y}_{0,\perp}\|}\cdot\Big{(}\frac{1+\sqrt{\rho\epsilon}/\ell-\rho r/(\alpha_{\min}\ell)}{1+\sqrt{\rho\epsilon}/\ell+2\rho r/\ell}\Big{)}^{t},

(67)

where

	$\displaystyle\frac{1+\sqrt{\rho\epsilon}/\ell-\rho r/(\alpha_{\min}\ell)}{1+\sqrt{\rho\epsilon}/\ell+2\rho r/\ell}$	$\displaystyle\geq(1+\sqrt{\rho\epsilon}/\ell-\rho r/(\alpha_{\min}\ell))(1-\sqrt{\rho\epsilon}/\ell-2\rho r/\ell)$		(68)
		$\displaystyle\geq 1-\rho\epsilon/\ell^{2}-\frac{2\rho r}{\alpha_{\min}\ell}\geq 1-\frac{1}{\mathscr{T}},$		(69)

by which we can derive that

	$\displaystyle\frac{\\|\mathbf{y}_{t,\parallel\\|}}{\\|\mathbf{y}_{t,\perp}\\|}$	$\displaystyle\geq\frac{\\|\mathbf{y}_{0,\parallel}\\|}{\\|\mathbf{y}_{0,\perp}\\|}(1-1/\mathscr{T})^{t}$		(70)
		$\displaystyle\geq\frac{\\|\mathbf{y}_{0,\parallel}\\|}{\\|\mathbf{y}_{0,\perp}\\|}\exp\Big(-\frac{t}{\mathscr{T}-1}\Big{missing})\geq\frac{\\|\mathbf{y}_{0,\parallel}\\|}{2\\|\mathbf{y}_{0,\perp}\\|},$		(71)

indicating

\displaystyle\alpha_{t}=\frac{\|\mathbf{y}_{t,\parallel}\|}{\sqrt{\|\mathbf{y}_{t,\parallel}\|^{2}+\|\mathbf{y}_{t,\perp}\|^{2}}}\geq\frac{\|\mathbf{y}_{0,\parallel}\|}{4\|\mathbf{y}_{0,\perp}\|}\geq\alpha_{\min}.

(72)

Hence, as long as $\alpha_{k}\geq\alpha_{\min}$ for any $1\leq k\leq t-1$ , we can also have $\alpha_{t}\geq\alpha_{\min}$ if $t\leq\mathscr{T}$ . Since we have $\alpha_{0}\geq\alpha_{\min}$ , we can thus claim that $\alpha_{t}\geq\alpha_{\min}$ for any $t\leq\mathscr{T}$ using recurrence. ∎

B.2 Proof of Proposition 5

To make it easier to analyze the properties and running time of Algorithm 2, we introduce a new Algorithm 4,

\mathbf{x}_{0}\leftarrow

Uniform

(\mathbb{B}_{0}(r^{\prime}))

where

\mathbb{B}_{0}(r^{\prime})

is the

\ell_{2}

-norm ball centered at

\tilde{\mathbf{x}}

with radius

r^{\prime}

;

\mathbf{z}_{0}\leftarrow\mathbf{x}_{0}

;

3 for $t=1,...,\mathscr{T}^{\prime}$ do

\mathbf{x}_{t+1}\leftarrow\mathbf{z}_{t}-\eta\Big{(}\frac{\|\mathbf{z}_{t}-\tilde{\mathbf{x}}\|}{r^{\prime}}\nabla f\Big{(}r^{\prime}\cdot\frac{\mathbf{z}_{t}-\tilde{\mathbf{x}}}{\|\mathbf{z}_{t}-\tilde{\mathbf{x}}\|}+\tilde{\mathbf{x}}\Big{)}-\nabla f(\tilde{\mathbf{x}})\Big{)}

;

\mathbf{v}_{t+1}\leftarrow\mathbf{x}_{t+1}-\mathbf{x}_{t}

;

\mathbf{z}_{t+1}\leftarrow\mathbf{x}_{t+1}+(1-\theta)\mathbf{v}_{t+1}

;

Output

\frac{\mathbf{x}_{\mathscr{T}^{\prime}}-\tilde{\mathbf{x}}}{\|\mathbf{x}_{\mathscr{T}^{\prime}}-\tilde{\mathbf{x}}\|}

Algorithm 4 Accelerated Negative Curvature Finding without Renormalization(

\tilde{\mathbf{x}},r^{\prime},\mathscr{T}^{\prime}

which has a more straightforward structure and has the same effect as Algorithm 2 near any saddle point $\tilde{\mathbf{x}}$ . Quantitatively, this is demonstrated in the following lemma:

Lemma 20.

Under the setting of Proposition 5, suppose the perturbation in Line 2 of Algorithm 2 is added at $t=0$ . Then with the same value of $r^{\prime}$ , $\mathscr{T}^{\prime}$ , $\tilde{\mathbf{x}}$ and $\mathbf{x}_{0}$ , the output of Algorithm 4 is the same as the unit vector $\hat{\mathbf{e}}$ in Algorithm 2 obtained $\mathscr{T}^{\prime}$ steps later. In other words, if we separately denote the iteration sequence of $\left\{\mathbf{x}_{t}\right\}$ in Algorithm 2 and Algorithm 4 as

\displaystyle\left\{\mathbf{x}_{1,0},\mathbf{x}_{1,1},\ldots,\mathbf{x}_{1,\mathscr{T}^{\prime}}\right\},\qquad\left\{\mathbf{x}_{2,0},\mathbf{x}_{2,1},\ldots,\mathbf{x}_{2,\mathscr{T}^{\prime}}\right\},

(73)

we have

\displaystyle\frac{\mathbf{x}_{1,\mathscr{T}^{\prime}}-\tilde{\mathbf{x}}}{\|\mathbf{x}_{1,\mathscr{T}^{\prime}}-\tilde{\mathbf{x}}\|}=\frac{\mathbf{x}_{2,\mathscr{T}^{\prime}}-\tilde{\mathbf{x}}}{\|\mathbf{x}_{2,\mathscr{T}^{\prime}}-\tilde{\mathbf{x}}\|}.

(74)

Proof.

Without loss of generality, we assume $\tilde{\mathbf{x}}=\mathbf{0}$ and $\nabla f(\tilde{\mathbf{x}})=\mathbf{0}$ . Use recurrence to prove the desired relationship. Suppose the following identities holds for all $k\leq t$ with $t$ being some natural number:

\displaystyle\frac{\mathbf{x}_{2,k}}{\|\mathbf{x}_{2,k}\|}=\frac{\mathbf{x}_{1,k}}{r},\qquad\frac{\mathbf{z}_{2,k}}{\|\mathbf{x}_{2,k}\|}=\frac{\mathbf{z}_{1,k}}{r^{\prime}}.

(75)

Then,

\displaystyle\mathbf{x}_{2,t+1}=\mathbf{z}_{2,t}-\eta\cdot\frac{\|\mathbf{z}_{2,t}\|}{r^{\prime}}\nabla f(\mathbf{z}_{2,t}\cdot r^{\prime}/\|\mathbf{z}_{2,t}\|)=\frac{\|\mathbf{z}_{2,t}\|}{r^{\prime}}(\mathbf{z}_{1,t}-\eta\nabla f(\mathbf{z}_{1,t})).

(76)

Adopting the notation in Algorithm 2, we use $\mathbf{x}_{1,t+1}^{\prime}$ and $\mathbf{z}_{1,t+1}^{\prime}$ to separately denote the value of $\mathbf{x}_{1,t+1}$ and $\mathbf{z}_{1,t+1}$ before renormalization:

\displaystyle\mathbf{x}_{1,t+1}^{\prime}=\mathbf{z}_{1,t}-\eta\nabla f(\mathbf{z}_{1,t}),\quad\mathbf{z}_{1,t+1}^{\prime}=\mathbf{x}_{1,t+1}^{\prime}+(1-\theta)(\mathbf{x}_{1,t+1}^{\prime}-\mathbf{x}_{1,t}).

(77)

Then,

\displaystyle\mathbf{x}_{2,t+1}=\frac{\|\mathbf{z}_{2,t}\|}{r^{\prime}}(\mathbf{z}_{1,t}-\eta\nabla f(\mathbf{z}_{1,t}))=\frac{\|\mathbf{z}_{2,t}\|}{r^{\prime}}\cdot\mathbf{x}_{1,t+1}^{\prime},

(78)

which further leads to

\displaystyle\mathbf{z}_{2,t+1}=\mathbf{x}_{2,t+1}+(1-\theta)(\mathbf{x}_{2,t+1}-\mathbf{x}_{2,t})=\frac{\|\mathbf{z}_{2,t}\|}{r^{\prime}}\cdot\mathbf{z}_{1,t+1}^{\prime}.

(79)

Note that $\mathbf{z}_{1,t+1}=\frac{r^{\prime}}{\|\mathbf{z}_{1,t+1}^{\prime}\|}\cdot\mathbf{z}_{1,t+1}^{\prime}$ , we thus have

\displaystyle\frac{\mathbf{z}_{2,t+1}}{\|\mathbf{z}_{2,t+1}\|}=\frac{\mathbf{z}_{1,t+1}}{r^{\prime}}.

(80)

Hence,

\displaystyle\mathbf{x}_{2,t+1}=\frac{\|\mathbf{z}_{2,t}\|}{r^{\prime}}\cdot\mathbf{x}_{1,t+1}^{\prime}=\frac{\|\mathbf{z}_{2,t}\|}{r^{\prime}}\cdot\frac{\|\mathbf{z}_{1,t+1}\|}{\|\mathbf{z}_{1,t}\|}\cdot\mathbf{x}_{1,t+1}=\frac{\|\mathbf{z}_{2,t+1}\|}{r^{\prime}}\cdot\mathbf{x}_{1,t+1}.

(81)

Since (75) holds for $k=0$ , we can now claim that it also holds for $k=\mathscr{T}^{\prime}$ . ∎

Lemma 20 shows that, Algorithm 4 also works in principle for finding the negative curvature near any saddle point $\tilde{\mathbf{x}}$ . But considering that Algorithm 4 may result in an exponentially large $\|\mathbf{x}_{t}\|$ during execution, and it is hard to be merged with the AGD algorithm for large gradient scenarios. Hence, only Algorithm 2 is applicable in practical situations.

Use $\mathcal{H}(\tilde{\mathbf{x}})$ to denote the Hessian matrix of $f$ at $\tilde{\mathbf{x}}$ . Observe that $\mathcal{H}(\tilde{\mathbf{x}})$ admits the following eigen-decomposition:

\displaystyle\mathcal{H}(\tilde{\mathbf{x}})=\sum_{i=1}^{n}\lambda_{i}\mathbf{u}_{i}\mathbf{u}_{i}^{T},

(82)

\displaystyle\lambda_{1}\leq\lambda_{2}\leq\cdots\leq\lambda_{n},

(83)

in which $\lambda_{1}\leq-\sqrt{\rho\epsilon}$ . If $\lambda_{n}\leq-\sqrt{\rho\epsilon}/2$ , Proposition 5 holds directly, since no matter the value of $\hat{\mathbf{e}}$ , we can have $f(\mathbf{x}_{\mathscr{T}^{\prime}})-f(\tilde{\mathbf{x}})\leq-\sqrt{\epsilon^{3}/\rho}/384$ . Hence, we only need to prove the case where $\lambda_{n}>-\sqrt{\rho\epsilon}/2$ , where there exists some $p>1$ with

\displaystyle\lambda_{p}\leq-\sqrt{\rho\epsilon}\leq\lambda_{p+1}.

(84)

We use $\mathfrak{S}_{\parallel}$ to denote the subspace of $\mathbb{R}^{n}$ spanned by $\left\{\mathbf{u}_{1},\mathbf{u}_{2},\ldots,\mathbf{u}_{p}\right\}$ , and use $\mathfrak{S}_{\perp}$ to denote the subspace spanned by $\left\{\mathbf{u}_{p+1},\mathbf{u}_{p+2},\ldots,\mathbf{u}_{n}\right\}$ . Then we can have the following lemma:

Lemma 21.

Under the setting of Proposition 5, we use $\alpha_{t}^{\prime}$ to denote

\displaystyle\alpha_{t}^{\prime}=\frac{\|\mathbf{x}_{t,\parallel}-\tilde{\mathbf{x}}_{\parallel}\|}{\|\mathbf{x}_{t}-\tilde{\mathbf{x}}\|},

(85)

in which $\mathbf{x}_{t,\parallel}$ is the component of $\mathbf{x}_{t}$ in the subspace $\mathfrak{S}_{\parallel}$ . Then, during all the $\mathscr{T}^{\prime}$ iterations of Algorithm 4, we have $\alpha_{t}^{\prime}\geq\alpha_{\min}^{\prime}$ for

\displaystyle\alpha_{\min}^{\prime}=\frac{\delta_{0}}{8}\sqrt{\frac{\pi}{n}},

(86)

given that $\alpha_{0}^{\prime}\geq\sqrt{\frac{\pi}{n}}\delta_{0}$ .

Proof.

Without loss of generality, we assume $\tilde{\mathbf{x}}=\mathbf{0}$ and $\nabla f(\tilde{\mathbf{x}})=\mathbf{0}$ . In this proof, we consider the worst case, where the initial value $\alpha_{0}^{\prime}=\sqrt{\frac{\pi}{n}}\delta_{0}$ and the component $x_{0,n}$ along $\mathbf{u}_{n}$ equals 0. In addition, according to the same proof of Lemma 19, we assume that the eigenvalues satisfy

\displaystyle\lambda_{1}=\lambda_{2}=\lambda_{3}=\cdots=\lambda_{p}=\lambda_{p+1}=\lambda_{p+2}=\cdots=\lambda_{n-1}=-\sqrt{\rho\epsilon}.

(87)

Out of the same reason, we assume that each time we make a gradient call at point $\mathbf{z}_{t}$ , the derivation term $\Delta$ from pure quadratic approximation

\displaystyle\Delta=\frac{\|\mathbf{z}_{t}\|}{r^{\prime}}\cdot\Big{(}\nabla f(\mathbf{z}_{t}\cdot r^{\prime}/\|\mathbf{z}_{t}\|)-\mathcal{H}(\mathbf{0})\cdot\frac{r^{\prime}}{\|\mathbf{z}_{t}\|}\cdot\mathbf{z}_{t}\Big{)}

(88)

lies in the direction that can make $\alpha_{t}^{\prime}$ as small as possible. Then, the component $\Delta_{\parallel}$ in $\mathfrak{S}_{\parallel}$ should be in the opposite direction to $\mathbf{v}_{\parallel}$ , and the component $\Delta_{\perp}$ in $\mathfrak{S}_{\perp}$ should be in the direction of $\mathbf{v}_{\perp}$ . Hence in this case, we have both $\|\mathbf{x}_{t,\perp}\|/\|\mathbf{x}_{t}\|$ and $\|\mathbf{z}_{t,\perp}\|/\|\mathbf{z}_{t}\|$ being non-decreasing. Also, it admits the following recurrence formula:

\displaystyle\|\mathbf{x}_{t+2,\perp}\|

\displaystyle\leq(1+\eta\sqrt{\rho\epsilon})\big{(}\|\mathbf{x}_{t+1,\perp}\|+(1-\theta)(\|\mathbf{x}_{t+1,\perp}\|-\|\mathbf{x}_{t,\perp}\|)\big{)}+\eta\|\Delta_{\perp}\|.

(89)

Since $\|\mathbf{x}_{t,\perp}\|/\|\mathbf{x}_{t}\|$ is non-decreasing in this worst-case scenario, we have

\displaystyle\frac{\|\Delta_{\perp}\|}{\|\mathbf{x}_{t+1,\perp}\|}\leq\frac{\|\Delta\|}{\|\mathbf{x}_{t+1}\|}\cdot\frac{\|\mathbf{x}_{0}\|}{\|\mathbf{x}_{0,\perp}\|}\leq\frac{2\|\Delta\|}{\|\mathbf{x}_{t+1}\|}\leq 2\rho r^{\prime},

(90)

which leads to

\displaystyle\|\mathbf{x}_{t+2,\perp}\|\leq(1+\eta\sqrt{\rho\epsilon}+2\eta\rho r^{\prime})\big{(}(2-\theta)\|\mathbf{x}_{t+1,\perp}\|-(1-\theta)\|\mathbf{x}_{t,\perp}\|\big{)}.

(91)

On the other hand, suppose for some value $t$ , we have $\alpha_{k}^{\prime}\geq\alpha_{\min}^{\prime}$ with any $1\leq k\leq t+1$ . Then,

	$\displaystyle\\|\mathbf{x}_{t+2,\parallel}\\|$	$\displaystyle\geq(1+\eta(\sqrt{\rho\epsilon}-\nu))\big{(}\\|\mathbf{x}_{t+1,\parallel}\\|+(1-\theta)(\\|\mathbf{x}_{t+1,\parallel}\\|-\\|\mathbf{x}_{t,\parallel}\\|)\big{)}+\eta\\|\Delta_{\parallel}\\|$		(92)
		$\displaystyle\geq(1+\eta\sqrt{\rho\epsilon})\big{(}\\|\mathbf{x}_{t+1,\parallel}\\|+(1-\theta)(\\|\mathbf{x}_{t+1,\parallel}\\|-\\|\mathbf{x}_{t,\parallel}\\|)\big{)}-\eta\\|\Delta\\|.$		(93)

Note that since $\|\mathbf{x}_{t+1,\parallel}\|/\|\mathbf{x}_{t}\|\geq\alpha_{\min}^{\prime}$ for all $t>0$ , we also have

\displaystyle\frac{\|\mathbf{x}_{t+1,\parallel}\|}{\|\mathbf{x}_{t}\|}\geq\alpha_{\min}^{\prime},\quad\forall t\geq 0,

(94)

which further leads to

\displaystyle\frac{\|\Delta\|}{\|\mathbf{z}_{t+1,\parallel}\|}\geq\frac{\|\Delta\|}{\alpha_{\min}^{\prime}\|\mathbf{z}_{t+1}\|}=\rho r^{\prime}/\alpha_{\min}^{\prime},

(95)

which leads to

\displaystyle\|\mathbf{x}_{t+2,\parallel}\|\geq(1+\eta\sqrt{\rho\epsilon}-\eta\rho r^{\prime}/\alpha_{\min}^{\prime})\big{(}(2-\theta)\|\mathbf{x}_{t+1,\parallel}\|-(1-\theta)\|\mathbf{x}_{t,\parallel}\|\big{)}.

(96)

Consider the sequences with recurrence that can be written as

\displaystyle\xi_{t+2}=(1+\kappa)\big{(}(2-\theta)\xi_{t+1}-(1-\theta)\xi_{t}\big{)}

(97)

for some $\kappa>0$ . Its characteristic equation can be written as

\displaystyle x^{2}-(1+\kappa)(2-\theta)x+(1+\kappa)(1-\theta)=0,

(98)

whose roots satisfy

\displaystyle x=\frac{1+\kappa}{2}\Big{(}(2-\theta)\pm\sqrt{(2-\theta)^{2}-\frac{4(1-\theta)}{1+\kappa}}\Big{)},

(99)

indicating

\displaystyle\xi_{t}=\Big{(}\frac{1+\kappa}{2}\Big{)}^{t}\big{(}C_{1}(2-\theta+\mu)^{t}+C_{2}(2-\theta-\mu)^{t}\big{)},

(100)

where $\mu:=\sqrt{(2-\theta)^{2}-\frac{4(1-\theta)}{1+\kappa}}$ , for constants $C_{1}$ and $C_{2}$ being

\displaystyle\left\{\begin{aligned} &C_{1}=-\frac{2-\theta-\mu}{2\mu}\xi_{0}+\frac{1}{(1+\kappa)\mu}\xi_{1},\\ &C_{2}=\frac{2-\theta+\mu}{2\mu}\xi_{0}-\frac{1}{(1+\kappa)\mu}\xi_{1}.\end{aligned}\right.

(101)

Then by the inequalities (91) and (96), as long as $\alpha_{k}^{\prime}\geq\alpha_{\min}^{\prime}$ for any $1\leq k\leq t-1$ , the values $\|\mathbf{x}_{t,\perp}\|$ and $\|\mathbf{x}_{t,\parallel}\|$ satisfy

	$\displaystyle\\|\mathbf{x}_{t,\perp}\\|$	$\displaystyle\leq\Big{(}-\frac{2-\theta-\mu_{\perp}}{2\mu_{\perp}}\xi_{0,\perp}+\frac{1}{(1+\kappa_{\perp})\mu_{\perp}}\xi_{1,\perp}\Big{)}\cdot\Big{(}\frac{1+\kappa_{\perp}}{2}\Big{)}^{t}\cdot(2-\theta+\mu_{\perp})^{t}$		(102)
		$\displaystyle\quad+\Big{(}\frac{2-\theta+\mu_{\perp}}{2\mu_{\perp}}\xi_{0,\perp}-\frac{1}{(1+\kappa_{\perp})\mu_{\perp}}\xi_{1,\perp}\Big{)}\cdot\Big{(}\frac{1+\kappa_{\perp}}{2}\Big{)}^{t}\cdot(2-\theta-\mu_{\perp})^{t},$		(103)

and

	$\displaystyle\\|\mathbf{x}_{t,\parallel}\\|$	$\displaystyle\geq\Big{(}-\frac{2-\theta-\mu_{\parallel}}{2\mu_{\parallel}}\xi_{0,\parallel}+\frac{1}{(1+\kappa_{\parallel})\mu_{\parallel}}\xi_{1,\parallel}\Big{)}\cdot\Big{(}\frac{1+\kappa_{\parallel}}{2}\Big{)}^{t}\cdot(2-\theta+\mu_{\parallel})^{t}$		(104)
		$\displaystyle\quad+\Big{(}\frac{2-\theta+\mu_{\parallel}}{2\mu_{\parallel}}\xi_{0,\parallel}-\frac{1}{(1+\kappa_{\parallel})\mu_{\parallel}}\xi_{1,\parallel}\Big{)}\cdot\Big{(}\frac{1+\kappa_{\parallel}}{2}\Big{)}^{t}\cdot(2-\theta-\mu_{\parallel})^{t},$		(105)

where

	$\displaystyle\kappa_{\perp}$	$\displaystyle=\eta\sqrt{\rho\epsilon}+2\eta\rho r^{\prime},\qquad$	$\displaystyle\xi_{0,\perp}$	$\displaystyle=\\|\mathbf{x}_{0,\perp}\\|,\qquad$	$\displaystyle\xi_{1,\perp}$	$\displaystyle=(1+\kappa_{\perp})\xi_{0,\perp},$		(106)
	$\displaystyle\kappa_{\parallel}$	$\displaystyle=\eta\sqrt{\rho\epsilon}-\eta\rho r^{\prime}/\alpha_{\min}^{\prime},\qquad$	$\displaystyle\xi_{0,\parallel}$	$\displaystyle=\\|\mathbf{x}_{0,\parallel}\\|,\qquad$	$\displaystyle\xi_{1,\parallel}$	$\displaystyle=(1+\kappa_{\parallel})\xi_{0,\parallel}.$		(107)

Furthermore, we can derive that

$\displaystyle\\|\mathbf{x}_{t,\perp}\\|$	$\displaystyle\leq\Big{(}-\frac{2-\theta-\mu_{\perp}}{2\mu_{\perp}}\xi_{0,\perp}+\frac{1}{(1+\kappa_{\perp})\mu_{\perp}}\xi_{1,\perp}\Big{)}\cdot\Big{(}\frac{1+\kappa_{\perp}}{2}\Big{)}^{t}\cdot(2-\theta+\mu_{\perp})^{t}$	(108)
	$\displaystyle\quad+\Big{(}\frac{2-\theta+\mu_{\perp}}{2\mu_{\perp}}\xi_{0,\perp}-\frac{1}{(1+\kappa_{\perp})\mu_{\perp}}\xi_{1,\perp}\Big{)}\cdot\Big{(}\frac{1+\kappa_{\perp}}{2}\Big{)}^{t}\cdot(2-\theta+\mu_{\perp})^{t}$	(109)
	$\displaystyle\leq\xi_{0,\perp}\cdot\Big{(}\frac{1+\kappa_{\perp}}{2}\Big{)}^{t}\cdot(2-\theta+\mu_{\perp})^{t}=\\|\mathbf{x}_{0,\perp}\\|\cdot\Big{(}\frac{1+\kappa_{\perp}}{2}\Big{)}^{t}\cdot(2-\theta+\mu_{\perp})^{t},$	(110)

and

$\displaystyle\\|\mathbf{x}_{t,\parallel}\\|$	$\displaystyle\geq\Big{(}-\frac{2-\theta-\mu_{\parallel}}{2\mu_{\parallel}}\xi_{0,\parallel}+\frac{1}{(1+\kappa_{\parallel})\mu_{\parallel}}\xi_{1,\parallel}\Big{)}\cdot\Big{(}\frac{1+\kappa_{\parallel}}{2}\Big{)}^{t}\cdot(2-\theta+\mu_{\parallel})^{t}$	(111)
	$\displaystyle\quad+\Big{(}\frac{2-\theta+\mu_{\parallel}}{2\mu_{\parallel}}\xi_{0,\parallel}-\frac{1}{(1+\kappa_{\parallel})\mu_{\parallel}}\xi_{1,\parallel}\Big{)}\cdot\Big{(}\frac{1+\kappa_{\parallel}}{2}\Big{)}^{t}\cdot(2-\theta-\mu_{\parallel})^{t}$	(112)
	$\displaystyle\geq\Big{(}-\frac{2-\theta-\mu_{\parallel}}{2\mu_{\parallel}}\xi_{0,\parallel}+\frac{1}{(1+\kappa_{\parallel})\mu_{\parallel}}\xi_{1,\parallel}\Big{)}\cdot\Big{(}\frac{1+\kappa_{\parallel}}{2}\Big{)}^{t}\cdot(2-\theta+\mu_{\parallel})^{t}$	(113)
	$\displaystyle=\frac{\mu_{\parallel}+\theta}{2\mu_{\parallel}}\xi_{0,\parallel}\cdot\Big{(}\frac{1+\kappa_{\parallel}}{2}\Big{)}^{t}\cdot(2-\theta+\mu_{\parallel})^{t}$	(114)
	$\displaystyle\geq\frac{\\|\mathbf{x}_{0,\parallel}\\|}{2}\cdot\Big{(}\frac{1+\kappa_{\parallel}}{2}\Big{)}^{t}\cdot(2-\theta+\mu_{\parallel})^{t}.$	(115)

Then we can observe that

\displaystyle\frac{\|\mathbf{x}_{t,\parallel}\|}{\|\mathbf{x}_{t,\perp}\|}\geq\frac{\|\mathbf{x}_{0,\parallel}\|}{2\|\mathbf{x}_{0,\perp}\|}\cdot\Big{(}\frac{1+\kappa_{\parallel}}{1+\kappa_{\perp}}\Big{)}^{t}\cdot\Big{(}\frac{2-\theta+\mu_{\parallel}}{2-\theta+\mu_{\perp}}\Big{)}^{t},

(116)

where

$\displaystyle\frac{1+\kappa_{\parallel}}{1+\kappa_{\perp}}$	$\displaystyle\geq(1+\kappa_{\parallel})(1-\kappa_{\perp})$	(117)
	$\displaystyle\geq 1-(2+1/\alpha_{\min}^{\prime})\eta\rho r^{\prime}-\kappa_{\parallel}\kappa_{\perp}$	(118)
	$\displaystyle\geq 1-2\eta\rho r^{\prime}/\alpha^{\prime}_{\min},$	(119)

and

$\displaystyle\frac{2-\theta+\mu_{\parallel}}{2-\theta+\mu_{\perp}}$	$\displaystyle=\frac{1+\mu_{\parallel}/(2-\theta)}{1+\mu_{\perp}/(2-\theta)}$	(120)
	$\displaystyle=\frac{1+\sqrt{1-\frac{4(1-\theta)}{(1+\kappa_{\parallel})(2-\theta)^{2}}}}{1+\sqrt{1-\frac{4(1-\theta)}{(1+\kappa_{\perp})(2-\theta)^{2}}}}$	(121)
	$\displaystyle\geq\Big{(}1+\frac{1}{2-\theta}\sqrt{\frac{\theta^{2}+\kappa_{\parallel}(2-\theta)^{2}}{1+\kappa_{\parallel}}}\Big{)}\Big{(}1-\frac{1}{2-\theta}\sqrt{\frac{\theta^{2}+\kappa_{\perp}(2-\theta)^{2}}{1+\kappa_{\perp}}}\Big{)}$	(122)
	$\displaystyle\geq 1-\frac{2(\kappa_{\perp}-\kappa_{\parallel})}{\theta}\geq 1-\frac{3\eta\rho r^{\prime}}{\alpha_{\min}^{\prime}\theta},$	(123)

by which we can derive that

$\displaystyle\frac{\\|\mathbf{x}_{t,\parallel\\|}}{\\|\mathbf{x}_{t,\perp}\\|}$	$\displaystyle\geq\frac{\\|\mathbf{x}_{0,\parallel}\\|}{2\\|\mathbf{x}_{0,\perp}\\|}\cdot\Big{(}1-\frac{4\rho r^{\prime}}{\alpha_{\min}^{\prime}\theta}\Big{)}^{t}$	(124)
	$\displaystyle\geq\frac{\\|\mathbf{x}_{0,\parallel}\\|}{2\\|\mathbf{x}_{0,\perp}\\|}(1-1/\mathscr{T}^{\prime})^{t}$	(125)
	$\displaystyle\geq\frac{\\|\mathbf{x}_{0,\parallel}\\|}{2\\|\mathbf{x}_{0,\perp}\\|}\exp\Big(-\frac{t}{\mathscr{T}^{\prime}-1}\Big{missing})\geq\frac{\\|\mathbf{x}_{0,\parallel}\\|}{4\\|\mathbf{x}_{0,\perp}\\|},$	(126)

indicating

\displaystyle\alpha_{t}^{\prime}=\frac{\|\mathbf{x}_{t,\parallel}\|}{\sqrt{\|\mathbf{x}_{t,\parallel}\|^{2}+\|\mathbf{x}_{t,\perp}\|^{2}}}\geq\frac{\|\mathbf{x}_{0,\parallel}\|}{8\|\mathbf{x}_{0,\perp}\|}\geq\alpha_{\min}^{\prime}.

(127)

Hence, as long as $\alpha_{k}^{\prime}\geq\alpha_{\min}^{\prime}$ for any $1\leq k\leq t-1$ , we can also have $\alpha_{t}^{\prime}\geq\alpha_{\min}^{\prime}$ if $t\leq\mathscr{T}^{\prime}$ . Since we have $\alpha_{0}^{\prime}\geq\alpha_{\min}^{\prime}$ and $\alpha_{1}^{\prime}\geq\alpha_{\min}^{\prime}$ , we can claim that $\alpha_{t}^{\prime}\geq\alpha_{\min}^{\prime}$ for any $t\leq\mathscr{T}^{\prime}$ using recurrence. ∎

Equipped with Lemma 21, we are now ready to prove Proposition 5.

Proof.

By Lemma 20, the unit vector $\hat{\mathbf{e}}$ in Line 2 of Algorithm 2 obtained after $\mathscr{T}^{\prime}$ iterations equals to the output of Algorithm 4 starting from $\tilde{\mathbf{x}}$ . Hence in this proof we consider the output of Algorithm 4 instead of the original Algorithm 2.

If $\lambda_{n}\leq-\sqrt{\rho\epsilon}/2$ , Proposition 5 holds directly. Hence, we only need to prove the case where $\lambda_{n}>-\sqrt{\rho\epsilon}/2$ , in which there exists some $p^{\prime}$ with

\displaystyle\lambda_{p}^{\prime}\leq-\sqrt{\rho\epsilon}/2<\lambda_{p+1}.

(128)

We use $\mathfrak{S}_{\parallel}^{\prime}$ , $\mathfrak{S}_{\perp}^{\prime}$ to denote the subspace of $\mathbb{R}^{n}$ spanned by $\left\{\mathbf{u}_{1},\mathbf{u}_{2},\ldots,\mathbf{u}_{p^{\prime}}\right\}$ , $\left\{\mathbf{u}_{p^{\prime}+1},\mathbf{u}_{p+2},\ldots,\mathbf{u}_{n}\right\}$ . Furthermore, we define $\mathbf{x}_{t,\parallel^{\prime}}:=\sum_{i=1}^{p^{\prime}}\left\langle\mathbf{u}_{i},\mathbf{x}_{t}\right\rangle\mathbf{u}_{i}$ , $\mathbf{x}_{t,\perp^{\prime}}:=\sum_{i=p^{\prime}}^{n}\left\langle\mathbf{u}_{i},\mathbf{x}_{t}\right\rangle\mathbf{u}_{i}$ , $\mathbf{v}_{t,\parallel^{\prime}}:=\sum_{i=1}^{p^{\prime}}\left\langle\mathbf{u}_{i},\mathbf{v}_{t}\right\rangle\mathbf{u}_{i}$ , $\mathbf{v}_{t,\perp^{\prime}}:=\sum_{i=p^{\prime}}^{n}\left\langle\mathbf{u}_{i},\mathbf{v}_{t}\right\rangle\mathbf{u}_{i}$ respectively to denote the component of $\mathbf{x}_{t}^{\prime}$ and $\mathbf{v}_{t}^{\prime}$ in Algorithm 4 in the subspaces $\mathfrak{S}_{\parallel}^{\prime}$ , $\mathfrak{S}_{\perp}^{\prime}$ , and let $\alpha_{t}^{\prime}:=\|\mathbf{x}_{t,\parallel}\|/\|\mathbf{x}_{t}\|$ . Consider the case where $\alpha_{0}^{\prime}\geq\sqrt{\frac{\pi}{n}}\delta_{0}$ , which can be achieved with probability

\displaystyle\Pr\left\{\alpha_{0}^{\prime}\geq\sqrt{\frac{\pi}{n}}\delta_{0}\right\}\geq 1-\sqrt{\frac{\pi}{n}}\delta_{0}\cdot\frac{\text{Vol}(\mathbb{B}_{0}^{n-1}(1))}{\text{Vol}(\mathbb{B}_{0}^{n}(1))}\geq 1-\sqrt{\frac{\pi}{n}}\delta_{0}\cdot\sqrt{\frac{n}{\pi}}=1-\delta_{0},

(129)

we prove that there exists some $t_{0}$ with $1\leq t_{0}\leq\mathscr{T}^{\prime}$ such that

\displaystyle\frac{\|\mathbf{x}_{t_{0},\perp^{\prime}}\|}{\|\mathbf{x}_{t_{0}}\|}\leq\frac{\sqrt{\rho\epsilon}}{8\ell}.

(130)

Assume the contrary, for any $t$ with $1\leq t\leq\mathscr{T}^{\prime}$ , we all have $\frac{\|\mathbf{x}_{t,\perp^{\prime}}\|}{\|\mathbf{x}_{t}\|}>\frac{\sqrt{\rho\epsilon}}{8\ell}$ and $\frac{\|\mathbf{z}_{t,\perp^{\prime}}\|}{\|\mathbf{z}_{t}\|}>\frac{\sqrt{\rho\epsilon}}{8\ell}$ . Focus on the case where $\|\mathbf{x}_{t,\perp^{\prime}}\|$ , the component of $\mathbf{x}_{t}$ in subspace $\mathfrak{S}_{\perp}^{\prime}$ , achieves the largest value possible. Then in this case, we have the following recurrence formula:

\displaystyle\|\mathbf{x}_{t+2,\perp^{\prime}}\|\leq(1+\eta\sqrt{\rho\epsilon}/2)\big{(}\|\mathbf{x}_{t+1,\perp^{\prime}}\|+(1-\theta)(\|\mathbf{x}_{t+1,\perp^{\prime}}\|-\|\mathbf{x}_{t,\perp^{\prime}}\|)\big{)}+\eta\|\Delta_{\perp^{\prime}}\|.

(131)

Since $\frac{\|\mathbf{z}_{k,\perp^{\prime}}\|}{\|\mathbf{z}_{k}\|}\geq\frac{\sqrt{\rho\epsilon}}{8\ell}$ for any $1\leq k\leq t+1$ , we can derive that

\displaystyle\frac{\|\Delta_{\perp}\|}{\|\mathbf{x}_{t+1,\perp}\|+(1-\theta)(\|\mathbf{x}_{t+1,\perp}\|-\|\mathbf{x}_{t,\perp}\|)}\leq\frac{\|\Delta\|}{\|\mathbf{z}_{t,\perp^{\prime}}\|}\leq\frac{2\rho r^{\prime}}{\sqrt{\rho\epsilon}},

(132)

which leads to

	$\displaystyle\\|\mathbf{x}_{t+2,\perp^{\prime}}\\|$	$\displaystyle\leq(1+\eta\sqrt{\rho\epsilon}/2)\big{(}\\|\mathbf{x}_{t+1,\perp^{\prime}}\\|+(1-\theta)(\\|\mathbf{x}_{t+1,\perp^{\prime}}\\|-\\|\mathbf{x}_{t,\perp^{\prime}}\\|)\big{)}+\eta\\|\Delta_{\perp^{\prime}}\\|$		(133)
		$\displaystyle\leq(1+\eta\sqrt{\rho\epsilon}/2+2\rho r^{\prime}/\sqrt{\rho\epsilon})\big{(}(2-\theta)\\|\mathbf{x}_{t+1,\perp^{\prime}}\\|-(1-\theta)\\|\mathbf{x}_{t,\perp^{\prime}}\\|\big{)}.$		(134)

Using similar characteristic function techniques shown in the proof of Lemma 21, it can be further derived that

\displaystyle\|\mathbf{x}_{t,\perp^{\prime}}\|\leq\|\mathbf{x}_{0,\perp^{\prime}}\|\cdot\Big{(}\frac{1+\kappa_{\perp^{\prime}}}{2}\Big{)}^{t}\cdot(2-\theta+\mu_{\perp^{\prime}})^{t},

(135)

for $\kappa_{\perp^{\prime}}=\eta\sqrt{\rho\epsilon}/2+2\rho r^{\prime}/\sqrt{\rho\epsilon}$ and $\mu_{\perp^{\prime}}=\sqrt{(2-\theta)^{2}-\frac{4(1-\theta)}{1+\kappa_{\perp^{\prime}}}}$ , given $\frac{\|\mathbf{x}_{k,\perp^{\prime}}\|}{\|\mathbf{x}_{k}\|}\geq\frac{\sqrt{\rho\epsilon}}{8\ell}$ and $\frac{\|\mathbf{z}_{k,\perp^{\prime}}\|}{\|\mathbf{z}_{k}\|}\geq\frac{\sqrt{\rho\epsilon}}{8\ell}$ for any $1\leq k\leq t-1$ . Due to Lemma 21,

\displaystyle\alpha_{t}^{\prime}\geq\alpha_{\min}^{\prime}=\frac{\delta_{0}}{8}\sqrt{\frac{\pi}{n}},\qquad\forall 1\leq t\leq\mathscr{T}^{\prime}.

(136)

and it is demonstrated in the proof of Lemma 21 that,

\displaystyle\|\mathbf{x}_{t,\parallel}\|\geq\frac{\|\mathbf{x}_{0,\parallel}\|}{2}\cdot\Big{(}\frac{1+\kappa_{\parallel}}{2}\Big{)}^{t}\cdot(2-\theta+\mu_{\parallel})^{t},\qquad\forall 1\leq t\leq\mathscr{T}^{\prime},

(137)

for $\kappa_{\parallel}=\eta\sqrt{\rho\epsilon}-\eta\rho r^{\prime}/\alpha_{\min}^{\prime}$ and $\mu_{\parallel}=\sqrt{(2-\theta)^{2}-\frac{4(1-\theta)}{1+\kappa_{\parallel}}}$ . Observe that

	$\displaystyle\frac{\\|\mathbf{x}_{\mathscr{T}^{\prime},\perp^{\prime}}\\|}{\\|\mathbf{x}_{\mathscr{T}^{\prime},\parallel}\\|}$	$\displaystyle\leq\frac{2\\|\mathbf{x}_{0,\perp^{\prime}}\\|}{\\|\mathbf{x}_{0,\parallel}\\|}\cdot\Big{(}\frac{1+\kappa_{\perp^{\prime}}}{1+\kappa_{\parallel}}\Big{)}^{\mathscr{T}^{\prime}}\cdot\Big{(}\frac{2-\theta+\mu_{\perp^{\prime}}}{2-\theta+\mu_{\parallel}}\Big{)}^{\mathscr{T}^{\prime}}$		(138)
		$\displaystyle\leq\frac{2}{\delta_{0}}\sqrt{\frac{n}{\pi}}\Big{(}\frac{1+\kappa_{\perp^{\prime}}}{1+\kappa_{\parallel}}\Big{)}^{\mathscr{T}^{\prime}}\cdot\Big{(}\frac{2-\theta+\mu_{\perp^{\prime}}}{2-\theta+\mu_{\parallel}}\Big{)}^{\mathscr{T}^{\prime}},$		(139)

where

\displaystyle\frac{1+\kappa_{\perp^{\prime}}}{1+\kappa_{\parallel}}\leq\frac{1}{1+(\kappa_{\parallel}-\kappa_{\perp^{\prime}})}=1-\frac{1}{\eta\sqrt{\rho\epsilon}/2+\rho r^{\prime}(\eta/\alpha_{\min^{\prime}}+2/\sqrt{\rho\epsilon})}\leq 1-\frac{\eta\sqrt{\rho\epsilon}}{4},

(140)

and

$\displaystyle\frac{2-\theta+\mu_{\perp^{\prime}}}{2-\theta+\mu_{\parallel}}$	$\displaystyle=\frac{1+\sqrt{1-\frac{4(1-\theta)}{(1+\kappa_{\perp^{\prime}})(2-\theta)^{2}}}}{1+\sqrt{1-\frac{4(1-\theta)}{(1+\kappa_{\parallel})(2-\theta)^{2}}}}$	(141)
	$\displaystyle\leq\frac{1}{1+\Big{(}\sqrt{1-\frac{4(1-\theta)}{(1+\kappa_{\perp^{\prime}})(2-\theta)^{2}}}-\sqrt{1-\frac{4(1-\theta)}{(1+\kappa_{\parallel})(2-\theta)^{2}}}\Big{)}}$	(142)
	$\displaystyle\leq 1-\frac{\kappa_{\parallel}-\kappa_{\perp^{\prime}}}{\theta}$	(143)
	$\displaystyle\leq 1-\frac{\eta\sqrt{\rho\epsilon}}{4\theta}=1-\frac{(\rho\epsilon)^{1/4}}{16\sqrt{\ell}}.$	(144)

Hence,

\displaystyle\frac{\|\mathbf{x}_{\mathscr{T}^{\prime},\perp^{\prime}}\|}{\|\mathbf{x}_{\mathscr{T}^{\prime},\parallel}\|}

\displaystyle\leq\frac{2}{\delta_{0}}\sqrt{\frac{n}{\pi}}\Big{(}1-\frac{(\rho\epsilon)^{1/4}}{16\sqrt{\ell}}\Big{)}^{\mathscr{T}^{\prime}}\leq\frac{\sqrt{\rho\epsilon}}{8\ell}.

(145)

Since $\|\mathbf{x}_{\mathscr{T}^{\prime},\parallel}\|\leq\|\mathbf{x}_{\mathscr{T}^{\prime}}\|$ , we have $\frac{\|\mathbf{x}_{\mathscr{T}^{\prime},\perp^{\prime}}\|}{\|\mathbf{x}_{\mathscr{T}^{\prime}}\|}\leq\frac{\sqrt{\rho\epsilon}}{8\ell}$ , contradiction. Hence, there here exists some $t_{0}$ with $1\leq t_{0}\leq\mathscr{T}^{\prime}$ such that $\frac{\|\mathbf{x}_{t_{0},\perp^{\prime}}\|}{\|\mathbf{x}_{t_{0}}\|}\leq\frac{\sqrt{\rho\epsilon}}{8\ell}$ . Consider the normalized vector $\hat{\mathbf{e}}=\mathbf{x}_{t_{0}}/r$ , we use $\hat{\mathbf{e}}_{\perp^{\prime}}$ and $\hat{\mathbf{e}}_{\parallel^{\prime}}$ to separately denote the component of $\hat{\mathbf{e}}$ in $\mathfrak{S}_{\perp}^{\prime}$ and $\mathfrak{S}_{\parallel}^{\prime}$ . Then, $\|\hat{\mathbf{e}}_{\perp^{\prime}}\|\leq\sqrt{\rho\epsilon}/(8\ell)$ whereas $\|\hat{\mathbf{e}}_{\parallel^{\prime}}\|\geq 1-\rho\epsilon/(8\ell)^{2}$ . Then,

\displaystyle\hat{\mathbf{e}}^{T}\mathcal{H}(\mathbf{0})\hat{\mathbf{e}}=(\hat{\mathbf{e}}_{\perp^{\prime}}+\hat{\mathbf{e}}_{\parallel^{\prime}})^{T}\mathcal{H}(\mathbf{0})(\hat{\mathbf{e}}_{\perp^{\prime}}+\hat{\mathbf{e}}_{\parallel^{\prime}}),

(146)

\displaystyle\hat{\mathbf{e}}^{T}\mathcal{H}(\mathbf{0})\hat{\mathbf{e}}=\hat{\mathbf{e}}_{\perp^{\prime}}^{T}\mathcal{H}(\mathbf{0})\hat{\mathbf{e}}_{\perp^{\prime}}+\hat{\mathbf{e}}_{\parallel^{\prime}}^{T}\mathcal{H}(\mathbf{0})\hat{\mathbf{e}}_{\parallel^{\prime}},

(147)

Due to the $\ell$ -smoothness of the function, all eigenvalue of the Hessian matrix has its absolute value upper bounded by $\ell$ . Hence,

\displaystyle\hat{\mathbf{e}}_{\perp^{\prime}}^{T}\mathcal{H}(\mathbf{0})\hat{\mathbf{e}}_{\perp^{\prime}}\leq\ell\|\hat{\mathbf{e}}_{\perp^{\prime}}^{T}\|_{2}^{2}=\frac{\rho\epsilon}{64\ell^{2}}.

(148)

Further according to the definition of $\mathfrak{S}_{\parallel}$ , we have

\displaystyle\hat{\mathbf{e}}_{\parallel^{\prime}}^{T}\mathcal{H}(\mathbf{0})\hat{\mathbf{e}}_{\parallel^{\prime}}\leq-\frac{\sqrt{\rho\epsilon}}{2}\|\hat{\mathbf{e}}_{\parallel^{\prime}}\|^{2}.

(149)

Combining these two inequalities together, we can obtain

\displaystyle\hat{\mathbf{e}}^{T}\mathcal{H}(\mathbf{0})\hat{\mathbf{e}}

\displaystyle=\hat{\mathbf{e}}_{\perp}^{T}\mathcal{H}(\mathbf{0})\hat{\mathbf{e}}_{\perp^{\prime}}+\hat{\mathbf{e}}_{\parallel^{\prime}}^{T}\mathcal{H}(\mathbf{0})\hat{\mathbf{e}}_{\parallel^{\prime}}\leq-\frac{\sqrt{\rho\epsilon}}{2}\|\hat{\mathbf{e}}_{\parallel^{\prime}}\|^{2}+\frac{\rho\epsilon}{64\ell^{2}}\leq-\frac{\sqrt{\rho\epsilon}}{4}.

(150)

∎

Appendix C Proof details of escaping from saddle points by negative curvature finding

C.1 Algorithms for escaping from saddle points using negative curvature finding

In this subsection, we first present algorithm for escaping from saddle points using Algorithm 1 as Algorithm 5.

1 Input:

\mathbf{x}_{0}\in\mathbb{R}^{n}

;

2 for $t=0,1,...,T$ do

3 if $\|\nabla f(\mathbf{x}_{t})\|\leq\epsilon$ then

\hat{\mathbf{e}}\leftarrow

NegativeCurvatureFinding(

\mathbf{x}_{t},r,\mathscr{T}

) ;

\mathbf{x}_{t}\leftarrow\mathbf{x}_{t}-\frac{f^{\prime}_{\hat{\mathbf{e}}}(\mathbf{x}_{0})}{4|f^{\prime}_{\hat{\mathbf{e}}}(\mathbf{x}_{0})|}\sqrt{\frac{\epsilon}{\rho}}\cdot\hat{\mathbf{e}}

;

\mathbf{x}_{t+1}\leftarrow\mathbf{x}_{t}-\frac{1}{\ell}\nabla f(\mathbf{x}_{t})

;

Algorithm 5 Perturbed Gradient Descent with Negative Curvature Finding

Observe that Algorithm 5 and Algorithm 2 are similar to perturbed gradient descent and perturbed accelerated gradient descent but the uniform perturbation step is replaced by our negative curvature finding algorithms. One may wonder that Algorithm 5 seems to involve nested loops since negative curvature finding algorithm are contained in the primary loop, contradicting our previous claim that Algorithm 5 only contains a single loop. But actually, Algorithm 5 contains only two operations: gradient descents and one perturbation step, the same as operations outside the negative curvature finding algorithms. Hence, Algorithm 5 is essentially single-loop algorithm, and we count their iteration number as the total number of gradient calls.

C.2 Proof details of escaping saddle points using Algorithm 1

In this subsection, we prove:

Theorem 22.

For any $\epsilon>0$ and $0<\delta\leq 1$ , Algorithm 5 with parameters chosen in Proposition 3 satisfies that at least $1/4$ of its iterations will be $\epsilon$ -approximate second-order stationary point, using

\displaystyle\tilde{O}\Big{(}\frac{(f(\mathbf{x}_{0})-f^{*})}{\epsilon^{2}}\cdot\log n\Big{)}

iterations, with probability at least $1-\delta$ , where $f^{*}$ is the global minimum of $f$ .

Proof.

Let the parameters be chosen according to (2), and set the total step number $T$ to be:

\displaystyle T=\max\left\{\frac{8\ell(f(\mathbf{x_{0}})-f^{*})}{\epsilon^{2}},768(f(\mathbf{x_{0}})-f^{*})\cdot\sqrt{\frac{\rho}{\epsilon^{3}}}\right\},

(151)

similar to the perturbed gradient descent algorithm [20, Algorithm 4]. We first assume that for each $\mathbf{x}_{t}$ we apply negative curvature finding (Algorithm 1) with $\delta_{0}$ contained in the parameters be chosen as

\displaystyle\delta_{0}=\frac{1}{384(f(\mathbf{x}_{0}-f^{*})}\sqrt{\frac{\epsilon^{3}}{\rho}}\delta,

(152)

we can successfully obtain a unit vector $\hat{\mathbf{e}}$ with $\hat{\mathbf{e}}^{T}\mathcal{H}\hat{\mathbf{e}}\leq-\sqrt{\rho\epsilon}/4$ , as long as $\lambda_{\min}(\mathcal{H}(\mathbf{x}_{t}))\leq-\sqrt{\rho\epsilon}$ . The error probability of this assumption is provided later.

Under this assumption, Algorithm 1 can be called for at most $384(f(\mathbf{x_{0}})-f^{*})\sqrt{\frac{\rho}{\epsilon^{3}}}\leq\frac{T}{2}$ times, for otherwise the function value decrease will be greater than $f(\mathbf{x_{0}})-f^{*}$ , which is not possible. Then, the error probability that some calls to Algorithm 1 fails is upper bounded by

\displaystyle 384(f(\mathbf{x_{0}})-f^{*})\sqrt{\frac{\rho}{\epsilon^{3}}}\cdot\delta_{0}=\delta.

(153)

For the rest of iterations in which Algorithm 1 is not called, they are either large gradient steps, $\|\nabla f(\mathbf{x}_{t})\|\geq\epsilon$ , or $\epsilon$ -approximate second-order stationary points. Within them, we know that the number of large gradient steps cannot be more than $T/4$ because otherwise, by Lemma 10 in Appendix A:

\displaystyle f(\mathbf{x}_{T})\leq f(\mathbf{x}_{0})-T\eta\epsilon^{2}/8<f^{*},

a contradiction. Therefore, we conclude that at least $T/4$ of the iterations must be $\epsilon$ -approximate second-order stationary points, with probability at least $1-\delta$ .

The number of iterations can be viewed as the sum of two parts, the number of iterations needed for gradient descent, denoted by $T_{1}$ , and the number of iterations needed for negative curvature finding, denoted by $T_{2}$ . with probability at least $1-\delta$ ,

\displaystyle T_{1}=T=\tilde{O}\Big{(}\frac{(f(\mathbf{x}_{0})-f^{*})}{\epsilon^{2}}\Big{)}.

(154)

As for $T_{2}$ , with probability at least $1-\delta$ , Algorithm 1 is called for at most $384(f(\mathbf{x_{0}})-f^{*})\sqrt{\frac{\rho}{\epsilon^{3}}}$ times, and by Proposition 3 it takes $\tilde{O}\Big{(}\frac{\log n}{\sqrt{\rho\epsilon}}\Big{)}$ iterations each time. Hence,

\displaystyle T_{2}=384(f(\mathbf{x_{0}})-f^{*})\sqrt{\frac{\rho}{\epsilon^{3}}}\cdot\tilde{O}\Big{(}\frac{\log n}{\sqrt{\rho\epsilon}}\Big{)}=\tilde{O}\Big{(}\frac{(f(\mathbf{x}_{0})-f^{*})}{\epsilon^{2}}\cdot\log n\Big{)}.

(155)

As a result, the total iteration number $T_{1}+T_{2}$ is

\displaystyle\tilde{O}\Big{(}\frac{(f(\mathbf{x}_{0})-f^{*})}{\epsilon^{2}}\cdot\log n\Big{)}.

(156)

∎

C.3 Proof details of escaping saddle points using Algorithm 2

We first present here the Negative Curvature Exploitation algorithm proposed in proposed in [21, Algorithm 3] appearing in Line 2 of Algorithm 2:

1 if $\|\mathbf{v}_{t}\|\geq s$ then

\mathbf{x}_{t+1}\leftarrow\mathbf{x}_{t}

;

4else

\xi=s\cdot\mathbf{v}_{t}/\|\mathbf{v}\|_{t}

;

\mathbf{x}_{t}\leftarrow\text{argmin}_{\mathbf{x}\in\left\{\mathbf{x}_{t}+\xi,\mathbf{x}_{t}-\xi\right\}}f(\mathbf{x})

;

Output

(\mathbf{z}_{t+1},\mathbf{0})

Algorithm 6 Negative Curvature Exploitation(

\mathbf{x}_{t},\mathbf{v}_{t},s)

Now, we give the full version of Theorem 7 as follows:

Theorem 23.

Suppose that the function $f$ is $\ell$ -smooth and $\rho$ -Hessian Lipschitz. For any $\epsilon>0$ and a constant $0<\delta\leq 1$ , we choose the parameters appearing in Algorithm 2 as follows:

$\displaystyle\delta_{0}$	$\displaystyle=\frac{\delta}{384\Delta_{f}}\sqrt{\frac{\epsilon^{3}}{\rho}},$	$\displaystyle\mathscr{T}^{\prime}$	$\displaystyle=\frac{32\sqrt{\ell}}{(\rho\epsilon)^{1/4}}\log\Big(\frac{\ell}{\delta_{0}}\sqrt{\frac{n}{\rho\epsilon}}\Big{missing}),$	$\displaystyle\zeta$	$\displaystyle=\frac{\ell}{\sqrt{\rho\epsilon}},$	(157)
$\displaystyle r^{\prime}$	$\displaystyle=\frac{\delta_{0}\epsilon}{32}\sqrt{\frac{\pi}{\rho n}},$	$\displaystyle\eta$	$\displaystyle=\frac{1}{4\ell},$	$\displaystyle\theta$	$\displaystyle=\frac{1}{4\sqrt{\zeta}},$	(158)
$\displaystyle\mathscr{E}$	$\displaystyle=\sqrt{\frac{\epsilon^{3}}{\rho}}\cdot c_{A}^{-7},$	$\displaystyle\gamma$	$\displaystyle=\frac{\theta^{2}}{\eta},$	$\displaystyle s$	$\displaystyle=\frac{\gamma}{4\rho},$	(159)

where $\Delta_{f}:=f(\mathbf{x}_{0})-f^{*}$ and $f^{*}$ is the global minimum of $f$ , and the constant $c_{A}$ is chosen large enough to satisfy both the condition in Lemma 16 and $c_{A}\geq(384)^{1/7}$ . Then, Algorithm 2 satisfies that at least one of the iterations $\mathbf{z}_{t}$ will be an $\epsilon$ -approximate second-order stationary point in

\displaystyle\tilde{O}\Big{(}\frac{(f(\mathbf{x}_{0})-f^{*})}{\epsilon^{1.75}}\cdot\log n\Big{)}

(160)

iterations, with probability at least $1-\delta$ .

Proof.

Set the total step number $T$ to be:

\displaystyle T=\max\left\{\frac{4\Delta_{f}(\tilde{\mathscr{T}}+\mathscr{T}^{\prime})}{\mathscr{E}},768\Delta_{f}\mathscr{T}^{\prime}\sqrt{\frac{\rho}{\epsilon^{3}}}\right\}=\tilde{O}\Big{(}\frac{(f(\mathbf{x}_{0})-f^{*})}{\epsilon^{1.75}}\cdot\log n\Big{)},

(161)

where $\tilde{\mathscr{T}}=\sqrt{\zeta}\cdot c_{A}$ as defined in Lemma 16, similar to the perturbed accelerated gradient descent algorithm [21, Algorithm 2]. We first assert that for each iteration $\mathbf{x}_{t}$ that a uniform perturbation is added, after $\mathscr{T}^{\prime}$ iterations we can successfully obtain a unit vector $\hat{\mathbf{e}}$ with $\hat{\mathbf{e}}^{T}\mathcal{H}\hat{\mathbf{e}}\leq-\sqrt{\rho\epsilon}/4$ , as long as $\lambda_{\min}(\mathcal{H}(\mathbf{x}_{t}))\leq-\sqrt{\rho\epsilon}$ . The error probability of this assumption is provided later.

Under this assumption, the uniform perturbation can be called for at most $384(f(\mathbf{x_{0}})-f^{*})\sqrt{\frac{\rho}{\epsilon^{3}}}$ times, for otherwise the function value decrease will be greater than $f(\mathbf{x_{0}})-f^{*}$ , which is not possible. Then, the probability that at least one negative curvature finding subroutine after uniform perturbation fails is upper bounded by

\displaystyle 384(f(\mathbf{x_{0}})-f^{*})\sqrt{\frac{\rho}{\epsilon^{3}}}\cdot\delta_{0}=\delta.

(162)

For the rest of steps which is not within $\mathscr{T}^{\prime}$ steps after uniform perturbation, they are either large gradient steps, $\|\nabla f(\mathbf{x}_{t})\|\geq\epsilon$ , or $\epsilon$ -approximate second-order stationary points. Next, we demonstrate that at least one of these steps is an $\epsilon$ -approximate stationary point.

Assume the contrary. We use $N_{\tilde{\mathscr{T}}}$ to denote the number of disjoint time periods with length larger than $\tilde{\mathscr{T}}$ containing only large gradient steps and do not contain any step within $\mathscr{T}^{\prime}$ steps after uniform perturbation. Then, it satisfies

\displaystyle N_{\tilde{\mathscr{T}}}\geq\frac{T}{2(\tilde{\mathscr{T}}+\mathscr{T}^{\prime})}-384\Delta_{f}\sqrt{\frac{\rho}{\epsilon^{3}}}\geq(2c_{A}^{7}-384)\Delta_{f}\sqrt{\frac{\rho}{\epsilon^{3}}}\geq\frac{\Delta_{f}}{\mathscr{E}}.

(163)

From Lemma 16, during these time intervals the Hamiltonian will decrease in total at least $N_{\tilde{\mathscr{T}}}\cdot\mathscr{E}=\Delta_{f}$ , which is impossible due to Lemma 17, the Hamiltonian decreases monotonically for every step except for the $\mathscr{T}^{\prime}$ steps after uniform perturbation, and the overall decrease cannot be greater than $\Delta_{f}$ , a contradiction. Therefore, we conclude that at least one of the iterations must be an $\epsilon$ -approximate second-order stationary point, with probability at least $1-\delta$ . ∎

Appendix D Proofs of the stochastic setting

D.1 Proof details of negative curvature finding using stochastic gradients

In this subsection, we demonstrate that Algorithm 3 can find a negative curvature efficiently. Specifically, we prove the following proposition:

Proposition 24.

Suppose the function $f\colon\mathbb{R}^{n}\to\mathbb{R}$ is $\ell$ -smooth and $\rho$ -Hessian Lipschitz. For any $0<\delta<1$ , we specify our choice of parameters and constants we use as follows:

	$\displaystyle\mathscr{T}_{s}$	$\displaystyle=\frac{8\ell}{\sqrt{\rho\epsilon}}\cdot\log\Big(\frac{\ell n}{\delta\sqrt{\rho\epsilon}}\Big{missing}),$	$\displaystyle\iota$	$\displaystyle=10\log\Big(\frac{n\mathscr{T}_{s}^{2}}{\delta}\log\Big(\frac{\sqrt{n}}{\eta r_{s}}\Big{missing})\Big{missing}),$		(164)
	$\displaystyle r_{s}$	$\displaystyle=\frac{\delta}{480\rho n\mathscr{T}_{s}}\sqrt{\frac{\rho\epsilon}{\iota}},$	$\displaystyle m$	$\displaystyle=\frac{160(\ell+\tilde{\ell})}{\delta\sqrt{\rho\epsilon}}\sqrt{\mathscr{T}_{s}\iota},$		(165)

Then for any point $\tilde{\mathbf{x}}\in\mathbb{R}^{n}$ satisfying $\lambda_{\min}(\mathcal{H}(\tilde{\mathbf{x}}))\leq-\sqrt{\rho\epsilon}$ , with probability at least $1-3\delta$ , Algorithm 3 outputs a unit vector $\hat{\mathbf{e}}$ satisfying

\displaystyle\hat{\mathbf{e}}^{T}\mathcal{H}(\tilde{\mathbf{x}})\hat{\mathbf{e}}\leq-\frac{\sqrt{\rho\epsilon}}{4},

(166)

where $\mathcal{H}$ stands for the Hessian matrix of function $f$ , using $O(m\cdot\mathscr{T}_{s})=\tilde{O}\Big{(}\frac{\log^{2}n}{\delta\epsilon^{1/2}}\Big{)}$ iteartions.

Similarly to Algorithm 1 and Algorithm 2, the renormalization step Line 3 in Algorithm 3 only guarantees that the value $\|\mathbf{y}_{t}\|$ would not scales exponentially during the algorithm, and does not affect the output. We thus introduce the following Algorithm 7, which is the no-renormalization version of Algorithm 3 that possess the same output and a simpler structure. Hence in this subsection, we analyze Algorithm 7 instead of Algorithm 3.

\mathbf{z}_{0}\leftarrow 0

;

2 for $t=1,...,\mathscr{T}_{s}$ do

3 Sample

\left\{\theta^{(1)},\theta^{(2)},\cdots,\theta^{(m)}\right\}\sim\mathcal{D}

;

\mathbf{g}(\mathbf{z}_{t-1})\leftarrow\frac{\|\mathbf{z}_{t-1}\|}{r_{s}}\cdot\frac{1}{m}\sum_{j=1}^{m}\Big{(}\mathbf{g}\Big{(}\tilde{\mathbf{x}}+\frac{r_{s}}{\|\mathbf{z}_{t-1}\|}\mathbf{z}_{t-1};\theta^{(j)}\Big{)}-\mathbf{g}(\tilde{\mathbf{x}};\theta^{(j)})\Big{)}

;

\mathbf{z}_{t}\leftarrow\mathbf{z}_{t-1}-\frac{1}{\ell}(\mathbf{g}(\mathbf{z}_{t-1})+\xi_{t}),\qquad\xi_{t}\sim\mathcal{N}\Big{(}0,\frac{r_{s}^{2}}{d}I\Big{)}

;

Output

\mathbf{z}_{\mathscr{T}}/\|\mathbf{z}_{\mathscr{T}}\|

Algorithm 7 Stochastic Negative Curvature Finding without Renormalization(

\tilde{\mathbf{x}},r_{s},\mathscr{T}_{s},m

Without loss of generality, we assume $\tilde{\mathbf{x}}=\mathbf{0}$ by shifting $\mathbb{R}^{n}$ such that $\tilde{\mathbf{x}}$ is mapped to $\mathbf{0}$ . As argued in the proof of Proposition 3, $\mathcal{H}(\mathbf{0})$ admits the following following eigen-decomposition:

\displaystyle\mathcal{H}(\mathbf{0})=\sum_{i=1}^{n}\lambda_{i}\mathbf{u}_{i}\mathbf{u}_{i}^{T},

(167)

\displaystyle\lambda_{1}\leq\lambda_{2}\leq\cdots\leq\lambda_{n},

(168)

where $\lambda_{1}\leq-\sqrt{\rho\epsilon}$ . If $\lambda_{n}\leq-\sqrt{\rho\epsilon}/2$ , Proposition 24 holds directly. Hence, we only need to prove the case where $\lambda_{n}>-\sqrt{\rho\epsilon}/2$ , where there exists some $p>1$ and $p^{\prime}>1$ with

\displaystyle\lambda_{p}\leq-\sqrt{\rho\epsilon}\leq\lambda_{p+1},\quad\lambda_{p^{\prime}}\leq-\sqrt{\rho\epsilon}/2<\lambda_{p^{\prime}+1}.

(169)

Notation:

Throughout this subsection, let $\tilde{\mathcal{H}}:=\mathcal{H}(\tilde{\mathbf{x}})$ . Use $\mathfrak{S}_{\parallel}$ , $\mathfrak{S}_{\perp}$ to separately denote the subspace of $\mathbb{R}^{n}$ spanned by $\left\{\mathbf{u}_{1},\mathbf{u}_{2},\ldots,\mathbf{u}_{p}\right\}$ , $\left\{\mathbf{u}_{p+1},\mathbf{u}_{p+2},\ldots,\mathbf{u}_{n}\right\}$ , and use $\mathfrak{S}_{\parallel}^{\prime}$ , $\mathfrak{S}_{\perp}^{\prime}$ to denote the subspace of $\mathbb{R}^{n}$ spanned by $\left\{\mathbf{u}_{1},\mathbf{u}_{2},\ldots,\mathbf{u}_{p^{\prime}}\right\}$ , $\left\{\mathbf{u}_{p^{\prime}+1},\mathbf{u}_{p+2},\ldots,\mathbf{u}_{n}\right\}$ . Furthermore, define $\mathbf{z}_{t,\parallel}:=\sum_{i=1}^{p}\left\langle\mathbf{u}_{i},\mathbf{z}_{t}\right\rangle\mathbf{u}_{i}$ , $\mathbf{z}_{t,\perp}:=\sum_{i=p}^{n}\left\langle\mathbf{u}_{i},\mathbf{z}_{t}\right\rangle\mathbf{u}_{i}$ , $\mathbf{z}_{t,\parallel^{\prime}}:=\sum_{i=1}^{p^{\prime}}\left\langle\mathbf{u}_{i},\mathbf{z}_{t}\right\rangle\mathbf{u}_{i}$ , $\mathbf{z}_{t,\perp^{\prime}}:=\sum_{i=p^{\prime}}^{n}\left\langle\mathbf{u}_{i},\mathbf{z}_{t}\right\rangle\mathbf{u}_{i}$ respectively to denote the component of $\mathbf{z}_{t}$ in Line 7 of Algorithm 7 in the subspaces $\mathfrak{S}_{\parallel}$ , $\mathfrak{S}_{\perp}$ , $\mathfrak{S}_{\parallel}^{\prime}$ , $\mathfrak{S}_{\perp}^{\prime}$ , and let $\gamma=-\lambda_{1}$ .

To prove Proposition 24, we first introduce the following lemma:

Lemma 25.

Under the setting of Proposition 24, for any point $\tilde{\mathbf{x}}\in\mathbb{R}^{n}$ satisfying $\lambda_{\min}(\nabla^{2}f(\tilde{\mathbf{x}})\leq-\sqrt{\rho\epsilon}$ , with probability at least $1-3\delta$ , Algorithm 3 outputs a unit vector $\hat{\mathbf{e}}$ satisfying

\displaystyle\|\hat{\mathbf{e}}_{\perp^{\prime}}\|:=\Big{\|}\sum_{i=p^{\prime}}^{n}\left\langle\mathbf{u}_{i},\hat{\mathbf{e}}\right\rangle\mathbf{u}_{i}\Big{\|}\leq\frac{\sqrt{\rho\epsilon}}{8\ell}

(170)

using $O(m\cdot\mathscr{T}_{s})=\tilde{O}\Big{(}\frac{\log^{2}n}{\delta\epsilon^{1/2}}\Big{)}$ iteartions.

D.1.1 Proof of Lemma 25

In the proof of Lemma 25, we consider the worst case, where $\lambda_{1}=-\gamma=-\sqrt{\rho\epsilon}$ is the only eigenvalue less than $-\sqrt{\rho\epsilon}/2$ , and all other eigenvalues equal to $-\sqrt{\rho\epsilon}/2+\nu$ for an arbitrarily small constant $\nu$ . Under this scenario, the component $\mathbf{z}_{t,\perp^{\prime}}$ is as small as possible at each time step.

The following lemma characterizes the dynamics of Algorithm 7:

Lemma 26.

Consider the sequence $\left\{\mathbf{z}_{i}\right\}$ and let $\eta=1/\ell$ . Further, for any $0\leq t\leq\mathscr{T}_{s}$ we define

\displaystyle\zeta_{t}:=\mathbf{g}(\mathbf{z}_{t-1})-\frac{\|\mathbf{z}_{t}\|}{r_{s}}\Big{(}\nabla f\Big{(}\tilde{\mathbf{x}}+\frac{r_{s}}{\|\mathbf{z}_{t}\|}\mathbf{z}_{t}\Big{)}-\nabla f(\tilde{\mathbf{x}})\Big{)},

(171)

to be the errors caused by the stochastic gradients. Then $\mathbf{z}_{t}=-\mathbf{q}_{h}(t)-\mathbf{q}_{sg}(t)-\mathbf{q}_{p}(t)$ , where:

\displaystyle\mathbf{q}_{h}(t):=\eta\sum_{\tau=0}^{t-1}(I-\eta\tilde{\mathcal{H}})^{t-1-\tau}\Delta_{\tau}\hat{\mathbf{z}}_{\tau},

(172)

for $\Delta_{\tau}=\int_{0}^{1}\mathcal{H}_{f}\big{(}\psi\frac{r_{s}}{\|\mathbf{z}_{\tau}\|}\mathbf{z}_{\tau}\big{)}\mathrm{d}\psi-\tilde{\mathcal{H}}$ , and

\displaystyle\mathbf{q}_{sg}(t):=\eta\sum_{\tau=0}^{t-1}(I-\eta\tilde{\mathcal{H}})^{t-1-\tau}\zeta_{\tau},\quad\mathbf{q}_{p}(t):=\eta\sum_{\tau=0}^{t-1}(I-\eta\tilde{\mathcal{H}})^{t-1-\tau}\xi_{\tau}.

(173)

Proof.

Without loss of generality we assume $\tilde{\mathbf{x}}=\mathbf{0}$ . The update formula for $\mathbf{z}_{t}$ can be written as

\displaystyle\mathbf{z}_{t+1}=\mathbf{z}_{t}-\eta\Big{(}\frac{\|\mathbf{z}_{t}\|}{r_{s}}\Big{(}\nabla f\Big{(}\frac{r_{s}}{\|\mathbf{z}_{t}\|}\mathbf{z}_{t}\Big{)}-\nabla f(\mathbf{0})\Big{)}+\zeta_{t}+\xi_{t}\Big{)},

(174)

where

\displaystyle\frac{\|\mathbf{z}_{t}\|}{r_{s}}\Big{(}\nabla f\Big{(}\frac{r_{s}}{\|\mathbf{z}_{t}\|}\mathbf{z}_{t}\Big{)}-\nabla f(\mathbf{0})\Big{)}=\frac{\|\mathbf{z}_{t}\|}{r_{s}}\int_{0}^{1}\mathcal{H}_{f}\Big{(}\psi\frac{r_{s}}{\|\mathbf{z}_{t}\|}\mathbf{z}_{t}\Big{)}\frac{r_{s}}{\|\mathbf{z}_{t}\|}\mathbf{z}_{t}\mathrm{d}\psi=(\tilde{\mathcal{H}}+\Delta_{t})\mathbf{z}_{t},

(175)

indicating

	$\displaystyle\mathbf{z}_{t+1}$	$\displaystyle=(I-\eta\tilde{\mathcal{H}})\mathbf{x}_{t}-\eta(\Delta_{t}\mathbf{z}_{t}+\zeta_{t}+\xi_{t})$		(176)
		$\displaystyle=-\eta\sum_{\tau=0}^{t}(I-\eta\tilde{\mathcal{H}})^{t-\tau}(\Delta_{t}\mathbf{z}_{t}+\zeta_{t}+\xi_{t}),$		(177)

which finishes the proof. ∎

At a high level, under our parameter choice in Proposition 24, $\mathbf{q}_{p}(t)$ is the dominating term controlling the dynamics, and $\mathbf{q}_{h}(t)+\mathbf{q}_{sg}(t)$ will be small compared to $\mathbf{q}_{p}(t)$ . Quantitatively, this is shown in the following lemma:

Lemma 27.

Under the setting of Proposition 24 while using the notation in Lemma 12 and Lemma 26, we have

\displaystyle\Pr\Big(\|\mathbf{q}_{h}(t)+\mathbf{q}_{sg}(t)\|\leq\frac{\beta(t)\eta r_{s}\delta}{20\sqrt{n}}\cdot\frac{\sqrt{\rho\epsilon}}{16\ell},\ \forall t\leq\mathscr{T}_{s}\Big{missing})\geq 1-\delta,

(178)

where $-\gamma:=\lambda_{\min}(\tilde{\mathcal{H}})=-\sqrt{\rho\epsilon}$ .

Proof.

Divide $\mathbf{q}_{p}(t)$ into two parts:

\displaystyle\mathbf{q}_{p,1}(t):=\left\langle\mathbf{q}_{p}(t),\mathbf{u}_{1}\right\rangle\mathbf{u}_{1},

(179)

and

\displaystyle\mathbf{q}_{p,\perp^{\prime}}(t):=\mathbf{q}_{p}(t)-\mathbf{q}_{p,1}(t).

(180)

Then by Lemma 13, we have

\displaystyle\Pr\Big(\|\mathbf{q}_{p,1}(t)\|\leq\frac{\beta(t)\eta r_{s}}{\sqrt{n}}\cdot\sqrt{\iota}\Big{missing})\geq 1-2e^{-\iota},

(181)

and

\displaystyle\Pr\Big(\|\mathbf{q}_{p,1}(t)\|\geq\frac{\beta(t)\eta r_{s}}{20\sqrt{n}}\cdot\delta\Big{missing})\geq 1-\delta/4.

(182)

Similarly,

\displaystyle\Pr\Big(\|\mathbf{q}_{p,\perp^{\prime}}(t)\|\leq\beta_{\perp^{\prime}}(t)\eta r_{s}\cdot\sqrt{\iota}\Big{missing})\geq 1-2e^{-\iota},

(183)

and

\displaystyle\Pr\Big(\|\mathbf{q}_{p,\perp^{\prime}}(t)\|\geq\frac{\beta_{\perp^{\prime}}(t)\eta r_{s}}{20}\cdot\delta\Big{missing})\geq 1-\delta/4,

(184)

where $\beta_{\perp^{\prime}}(t):=\frac{(1+\eta\gamma/2)^{t}}{\sqrt{\eta\gamma}}$ . Set $t_{\perp^{\prime}}:=\frac{\log n}{\eta\gamma}$ . Then for all $\tau\leq t_{\perp^{\prime}}$ , we have

\displaystyle\frac{\beta(\tau)}{\beta_{\perp^{\prime}}(\tau)}\leq\sqrt{n},

(185)

which further leads to

\displaystyle\Pr\Big(\|\mathbf{q}_{p,\perp^{\prime}}(\tau)\|\leq 2\beta_{\perp^{\prime}}(t)\eta r_{s}\cdot\sqrt{\iota}\Big{missing})\geq 1-2e^{-\iota}.

(186)

Next, we use induction to prove that the following inequality holds for all $t\leq t_{\perp^{\prime}}$ :

\displaystyle\Pr\Big(\|\mathbf{q}_{h}(\tau)+\mathbf{q}_{sg}(\tau)\|\leq\beta_{\perp^{\prime}}(\tau)\eta r_{s}\cdot\frac{\delta}{20},\ \forall\tau\leq t\Big{missing})\geq 1-10nt^{2}\log\Big(\frac{\sqrt{n}}{\eta r_{s}}\Big{missing})e^{-\iota}.

(187)

For the base case $t=0$ , the claim holds trivially. Suppose it holds for all $\tau\leq t$ for some $t$ . Then due to Lemma 13, with probability at least $1-2t_{\perp^{\prime}}e^{-\iota}$ , we have

\displaystyle\|\mathbf{z}_{t}\|\leq\eta\|\mathbf{q}_{p}(t)\|+\eta\|\mathbf{q}_{h}(t)+\mathbf{q}_{sg}(t)\|\leq 3\beta_{\perp^{\prime}}(\tau)\eta r_{s}\cdot\sqrt{\iota}.

(188)

By the Hessian Lipschitz property, $\Delta_{\tau}$ satisfies:

\displaystyle\|\Delta_{\tau}\|\leq\rho r_{s}.

(189)

Hence,

$\displaystyle\\|\mathbf{q}_{h}(t+1)\\|$	$\displaystyle\leq\big{\\|}\eta\sum_{\tau=0}^{t}(I-\eta\tilde{\mathcal{H}})^{t-\tau}\Delta_{\tau}\mathbf{z}_{\tau}\big{\\|}$	(190)
	$\displaystyle\leq\eta\rho r_{s}\sum_{\tau=0}^{t}(I-\eta\tilde{\mathcal{H}})^{t-\tau}\\|\mathbf{z}_{\tau}\\|$	(191)
	$\displaystyle\leq(\eta\rho r_{s}n\mathscr{T}_{s})\cdot(3\beta_{\perp^{\prime}}(t)\eta r_{s})\cdot\sqrt{\iota}$	(192)
	$\displaystyle\leq\frac{\beta_{\perp^{\prime}}(t+1)\eta r_{s}}{10\sqrt{n}}\cdot\frac{\delta\sqrt{\rho\epsilon}}{16\ell}.$	(193)

As for $\mathbf{q}_{sg}(t)$ , note that $\hat{\zeta}_{\tau}|\mathcal{F}_{\tau-1}$ satisfies the norm-subGaussian property defined in Definition 14. Specifically, $\hat{\zeta}_{\tau}|\mathcal{F}_{\tau-1}\sim\text{nSG}((\ell+\tilde{\ell})\|\hat{\mathbf{z}}_{\tau}\|/\sqrt{m})$ . By applying Lemma 15 with $b=\alpha^{2}(t)\cdot\eta^{2}(\ell+\tilde{\ell})^{2}/m$ and $b=\alpha^{2}(t)\eta^{2}(\ell+\tilde{\ell})^{2}\eta^{2}r_{s}^{2}/(mn)$ , with probability at least

\displaystyle 1-4n\cdot\log\Big(\frac{\sqrt{n}}{\eta r_{s}}\Big{missing})\cdot e^{-\iota},

(194)

we have

\displaystyle\|\mathbf{q}_{sg}(t+1)\|\leq\frac{\eta(\ell+\tilde{\ell})\sqrt{t}}{m}\cdot(\beta_{\perp}(t)\eta r_{s})\cdot\sqrt{\iota}\leq\frac{\beta_{\perp^{\prime}}(t+1)\eta r_{s}}{20}\cdot\frac{\delta\sqrt{\rho\epsilon}}{8\ell}.

(195)

Then by union bound, with probability at least

\displaystyle 1-10n(t+1)^{2}\log\Big(\frac{\sqrt{n}}{\eta r_{s}}\Big{missing})e^{-\iota},

(196)

we have

\displaystyle\|\mathbf{q}_{h}(t+1)+\mathbf{q}_{sg}(t+1)\|\leq\beta_{\perp^{\prime}}(t+1)\eta r_{s}\cdot\frac{\delta}{20}\cdot\frac{\sqrt{\rho\epsilon}}{8\ell},

(197)

indicating that (187) holds. Then with probability at least

\displaystyle 1-10nt_{\perp^{\prime}}^{2}\log\Big(\frac{\sqrt{n}}{\eta r_{s}}\Big{missing})e^{-\iota}-\delta/4,

(198)

we have

\displaystyle\|\mathbf{q}_{h}(t_{\perp^{\prime}})+\mathbf{q}_{sg}(t_{\perp^{\prime}})\|\leq\|\mathbf{q}_{p,1}(t_{\perp^{\prime}})\|\cdot\frac{\sqrt{\rho\epsilon}}{16\ell}.

(199)

Based on this, we prove that the following inequality holds for any $t_{\perp^{\prime}}\leq t\leq\mathscr{T}_{s}$ :

\displaystyle\Pr\Big(\|\mathbf{q}_{h}(\tau)+\mathbf{q}_{sg}(\tau)\|\leq\frac{\beta(\tau)\eta r_{s}}{20\sqrt{n}}\cdot\frac{\delta\sqrt{\rho\epsilon}}{16\ell},\ \forall t_{\perp^{\prime}}\leq\tau\leq t\Big{missing})\geq 1-10nt^{2}\log\Big(\frac{\sqrt{n}}{\eta r_{s}}\Big{missing})e^{-\iota}.

(200)

We still use recurrence to prove it. Note that its base case $\tau=t_{\perp^{\prime}}$ is guaranteed by (187). Suppose it holds for all $\tau\leq t$ for some $t$ . Then with probability at least $1-2te^{-\iota}$ , we have

$\displaystyle\\|\mathbf{z}_{t}\\|$	$\displaystyle\leq\eta\\|\mathbf{q}_{p}(t)\\|+\eta\\|\mathbf{q}_{h}(t)+\mathbf{q}_{sg}(t)\\|$	(201)
	$\displaystyle\leq 2\\|\mathbf{q}_{p,1}(t)\\|+\eta\\|\mathbf{q}_{h}(t)+\mathbf{q}_{sg}(t)\\|$	(202)
	$\displaystyle\leq\frac{3\beta(\tau)\eta r_{s}}{\sqrt{n}}\cdot\sqrt{\iota}.$	(203)

Then following a similar procedure as before, we can claim that

\displaystyle\|\mathbf{q}_{h}(t+1)+\mathbf{q}_{sg}(t+1)\|\leq\frac{\beta(t+1)\eta r_{s}}{\sqrt{n}}\cdot\frac{\delta}{20}\cdot\frac{\sqrt{\rho\epsilon}}{8\ell},

(204)

holds with probability

\displaystyle 1-10n(t+1)^{2}\log\Big(\frac{\sqrt{n}}{\eta r_{s}}\Big{missing})e^{-\iota}-\frac{\delta}{4},

(205)

indicating that (200) holds. Then under our choice of parameters, the desired inequality

\displaystyle\|\mathbf{q}_{h}(t)+\mathbf{q}_{sg}(t)\|\leq\frac{\beta(t)\eta r_{s}\delta}{20\sqrt{n}}\cdot\frac{\sqrt{\rho\epsilon}}{16\ell}

(206)

holds with probability at least $1-\delta$ . ∎

Equipped with Lemma 27, we are now ready to prove Lemma 25.

Proof.

First note that under our choice of $\mathscr{T}_{s}$ , we have

\displaystyle\Pr\Big(\frac{\|\mathbf{q}_{p,\perp^{\prime}}(\mathscr{T}_{s})\|}{\|\mathbf{q}_{p,1}(\mathscr{T}_{s})\|}\leq\frac{\sqrt{\rho\epsilon}}{16\ell}\Big{missing})\geq 1-\delta.

(207)

Further by Lemma 27 and union bound, with probability at least $1-2\delta$ ,

\displaystyle\frac{\|\mathbf{q}_{h}(\mathscr{T}_{s})+\mathbf{q}_{sg}(\mathscr{T}_{s})\|}{\|\mathbf{q}_{p}(\mathscr{T}_{s})\|}\leq\|\mathbf{q}_{h}(\mathscr{T}_{s})+\mathbf{q}_{sg}(\mathscr{T}_{s})\|\cdot\frac{20\sqrt{n}}{\delta\beta(t)\eta r_{s}}\leq\frac{\sqrt{\rho\epsilon}}{16\ell}.

(208)

For the output $\hat{\mathbf{e}}$ , observe that its component $\hat{\mathbf{e}}_{\perp^{\prime}}=\hat{\mathbf{e}}-\hat{\mathbf{e}}_{1}$ , since $\mathbf{u}_{1}$ is the only component in subspace $\mathfrak{S}_{\parallel^{\prime}}$ . Then with probability at least $1-3\delta$ ,

\displaystyle\|\hat{\mathbf{e}}_{\perp^{\prime}}\|\leq\sqrt{\rho\epsilon}/(8\ell).

(209)

∎

D.1.2 Proof of Proposition 24

Based on Lemma 25, we present the proof of Proposition 24 as follows:

Proof.

By Lemma 25, the component $\hat{\mathbf{e}}_{\perp^{\prime}}$ of output $\mathbf{e}$ satisfies

\displaystyle\|\hat{\mathbf{e}}_{\perp^{\prime}}\|\leq\frac{\sqrt{\rho\epsilon}}{8\ell}.

(210)

Since $\hat{\mathbf{e}}=\hat{\mathbf{e}}_{\parallel^{\prime}}+\hat{\mathbf{e}}_{\perp^{\prime}}$ , we can derive that

\displaystyle\|\hat{\mathbf{e}}_{\parallel^{\prime}}\|\geq\sqrt{1-\frac{\rho\epsilon}{(8\ell)^{2}}}\geq 1-\frac{\rho\epsilon}{(8\ell)^{2}}.

(211)

Note that

\displaystyle\hat{\mathbf{e}}^{T}\tilde{\mathcal{H}}\hat{\mathbf{e}}=(\hat{\mathbf{e}}_{\perp^{\prime}}+\hat{\mathbf{e}}_{\parallel^{\prime}})^{T}\tilde{\mathcal{H}}(\hat{\mathbf{e}}_{\perp^{\prime}}+\hat{\mathbf{e}}_{\parallel^{\prime}}),

(212)

which can be further simplified to

\displaystyle\hat{\mathbf{e}}^{T}\tilde{\mathcal{H}}\hat{\mathbf{e}}=\hat{\mathbf{e}}_{\perp^{\prime}}^{T}\tilde{\mathcal{H}}\hat{\mathbf{e}}_{\perp^{\prime}}+\hat{\mathbf{e}}_{\parallel^{\prime}}^{T}\tilde{\mathcal{H}}\hat{\mathbf{e}}_{\parallel^{\prime}}.

(213)

Due to the $\ell$ -smoothness of the function, all eigenvalue of the Hessian matrix has its absolute value upper bounded by $\ell$ . Hence,

\displaystyle\hat{\mathbf{e}}_{\perp}^{T}\tilde{\mathcal{H}}\hat{\mathbf{e}}_{\perp}\leq\ell\|\hat{\mathbf{e}}_{\perp}^{T}\|_{2}^{2}=\frac{\rho\epsilon}{64\ell^{2}},

(214)

whereas

\displaystyle\hat{\mathbf{e}}_{\parallel^{\prime}}^{T}\tilde{\mathcal{H}}\hat{\mathbf{e}}_{\parallel^{\prime}}\leq-\frac{\sqrt{\rho\epsilon}}{2}\|\hat{\mathbf{e}}_{\parallel^{\prime}}\|^{2}.

(215)

Combining these two inequalities together, we can obtain

\displaystyle\hat{\mathbf{e}}^{T}\tilde{\mathcal{H}}\hat{\mathbf{e}}

\displaystyle=\hat{\mathbf{e}}_{\perp^{\prime}}^{T}\tilde{\mathcal{H}}\hat{\mathbf{e}}_{\perp^{\prime}}+\hat{\mathbf{e}}_{\parallel^{\prime}}^{T}\tilde{\mathcal{H}}\hat{\mathbf{e}}_{\parallel^{\prime}}\leq-\frac{\sqrt{\rho\epsilon}}{2}\|\hat{\mathbf{e}}_{\parallel^{\prime}}\|^{2}+\frac{\rho\epsilon}{64\ell^{2}}\leq-\frac{\sqrt{\rho\epsilon}}{4}.

(216)

∎

D.2 Proof details of escaping saddle points using Algorithm 3

In this subsection, we demonstrate that Algorithm 3 can be used to escape from saddle points in the stochastic setting. We first present the explicit Algorithm 8, and then introduce the full version Theorem 9 with proof.

1 Input:

\mathbf{x}_{0}\in\mathbb{R}^{n}

;

2 for $t=0,1,...,T$ do

3 Sample

\left\{\theta^{(1)},\theta^{(2)},\cdots,\theta^{(M)}\right\}\sim\mathcal{D}

;

\mathbf{g}(\mathbf{x}_{t})=\frac{1}{M}\sum_{j=1}^{M}\mathbf{g}(\mathbf{x}_{t};\theta^{(j)})

;

5 if $\|\mathbf{g}(\mathbf{x}_{t})\|\leq 3\epsilon/4$ then

\hat{\mathbf{e}}\leftarrow

StochasticNegativeCurvatureFinding(

\mathbf{x}_{t},r_{s},\mathscr{T}_{s},m

);

\mathbf{x}_{t}\leftarrow\mathbf{x}_{t}-\frac{f^{\prime}_{\hat{\mathbf{e}}}(\mathbf{x}_{0})}{4|f^{\prime}_{\hat{\mathbf{e}}}(\mathbf{x}_{0})|}\sqrt{\frac{\epsilon}{\rho}}\cdot\hat{\mathbf{e}}

;

8 Sample

\left\{\theta^{(1)},\theta^{(2)},\cdots,\theta^{(M)}\right\}\sim\mathcal{D}

;

\mathbf{g}(\mathbf{x}_{t})=\frac{1}{M}\sum_{j=1}^{M}\mathbf{g}(\mathbf{x}_{t};\theta^{(j)})

;

\mathbf{x}_{t+1}\leftarrow\mathbf{x}_{t}-\frac{1}{\ell}\mathbf{g}(\mathbf{x}_{t};\theta_{t})

;

Algorithm 8 Stochastic Gradient Descent with Negative Curvature Finding.

Theorem 28 (Full version of Theorem 9).

Suppose that the function $f$ is $\ell$ -smooth and $\rho$ -Hessian Lipschitz. For any $\epsilon>0$ and a constant $0<\delta_{s}\leq 1$ , we choose the parameters appearing in Algorithm 8 as

	$\displaystyle\delta$	$\displaystyle=\frac{\delta_{s}}{2304\Delta_{f}}\sqrt{\frac{\epsilon^{3}}{\rho}}$	$\displaystyle\mathscr{T}_{s}$	$\displaystyle=\frac{8\ell}{\sqrt{\rho\epsilon}}\cdot\log\Big(\frac{\ell n}{\delta\sqrt{\rho\epsilon}}\Big{missing}),$	$\displaystyle\iota$	$\displaystyle=10\log\Big(\frac{n\mathscr{T}_{s}^{2}}{\delta}\log\Big(\frac{\sqrt{n}}{\eta r_{s}}\Big{missing})\Big{missing}),$		(217)
	$\displaystyle r_{s}$	$\displaystyle=\frac{\delta}{480\rho n\mathscr{T}_{s}}\sqrt{\frac{\rho\epsilon}{\iota}},$	$\displaystyle m$	$\displaystyle=\frac{160(\ell+\tilde{\ell})}{\delta\sqrt{\rho\epsilon}}\sqrt{\mathscr{T}_{s}\iota},$	$\displaystyle M$	$\displaystyle=\frac{16\ell\Delta_{f}}{\epsilon^{2}}$		(218)

where $\Delta_{f}:=f(\mathbf{x}_{0})-f^{*}$ and $f^{*}$ is the global minimum of $f$ . Then, Algorithm 8 satisfies that at least $1/4$ of the iterations $\mathbf{x}_{t}$ will be $\epsilon$ -approximate second-order stationary points, using

\displaystyle\tilde{O}\Big{(}\frac{(f(\mathbf{x}_{0})-f^{*})}{\epsilon^{4}}\cdot\log^{2}n\Big{)}

(219)

iterations, with probability at least $1-\delta_{s}$ .

Proof.

Let the parameters be chosen according to (2), and set the total step number $T$ to be:

\displaystyle T=\max\left\{\frac{8\ell(f(\mathbf{x_{0}})-f^{*})}{\epsilon^{2}},768(f(\mathbf{x_{0}})-f^{*})\cdot\sqrt{\frac{\rho}{\epsilon^{3}}}\right\}.

(220)

We will show that the following two claims hold simultaneously with probability $1-\delta_{s}$ :

1.

At most $T/4$ steps have gradients larger than $\epsilon$ ;
2.

Algorithm 3 can be called for at most $384\Delta_{f}\sqrt{\frac{\rho}{\epsilon^{3}}}$ times.

Therefore, at least $T/4$ steps are $\epsilon$ -approximate secondary stationary points. We prove the two claims separately.

Claim 1.

Suppose that within $T$ steps, we have more than $T/4$ steps with gradients larger than $\epsilon$ . Then with probability $1-\delta_{s}/2$ ,

\displaystyle f(\mathbf{x}_{T})-f(\mathbf{x}_{0})\leq-\frac{\eta}{8}\sum_{i=0}^{T-1}\|\nabla f(\mathbf{x}_{i})\|^{2}+c\cdot\frac{\sigma^{2}}{M\ell}(T+\log(1/\delta_{s}))\leq f^{*}-f(\mathbf{x}_{0}),

(221)

contradiction.

Claim 2.

We first assume that for each $\mathbf{x}_{t}$ we apply negative curvature finding (Algorithm 3), we can successfully obtain a unit vector $\hat{\mathbf{e}}$ with $\hat{\mathbf{e}}^{T}\mathcal{H}(\mathbf{x}_{t})\hat{\mathbf{e}}\leq-\sqrt{\rho\epsilon}/4$ , as long as $\lambda_{\min}(\mathcal{H}(\mathbf{x}_{t}))\leq-\sqrt{\rho\epsilon}$ . The error probability of this assumption is provided later.

Under this assumption, Algorithm 3 can be called for at most $384(f(\mathbf{x_{0}})-f^{*})\sqrt{\frac{\rho}{\epsilon^{3}}}\leq\frac{T}{2}$ times, for otherwise the function value decrease will be greater than $f(\mathbf{x_{0}})-f^{*}$ , which is not possible. Then, the error probability that some calls to Algorithm 3 fails is upper bounded by

\displaystyle 384(f(\mathbf{x_{0}})-f^{*})\sqrt{\frac{\rho}{\epsilon^{3}}}\cdot(3\delta)=\delta_{s}/2.

(222)

The number of iterations can be viewed as the sum of two parts, the number of iterations needed in large gradient scenario, denoted by $T_{1}$ , and the number of iterations needed for negative curvature finding, denoted by $T_{2}$ . With probability at least $1-\delta_{s}$ ,

\displaystyle T_{1}=O(M\cdot T)=\tilde{O}\Big{(}\frac{(f(\mathbf{x}_{0})-f^{*})}{\epsilon^{4}}\Big{)}.

(223)

As for $T_{2}$ , with probability at least $1-\delta_{s}$ , Algorithm 3 is called for at most $384(f(\mathbf{x_{0}})-f^{*})\sqrt{\frac{\rho}{\epsilon^{3}}}$ times, and by Proposition 24 it takes $\tilde{O}\Big{(}\frac{\log^{2}n}{\delta\sqrt{\rho\epsilon}}\Big{)}$ iterations each time. Hence,

\displaystyle T_{2}=384(f(\mathbf{x_{0}})-f^{*})\sqrt{\frac{\rho}{\epsilon^{3}}}\cdot\tilde{O}\Big{(}\frac{\log^{2}n}{\delta\sqrt{\rho\epsilon}}\Big{)}=\tilde{O}\Big{(}\frac{(f(\mathbf{x}_{0})-f^{*})}{\epsilon^{4}}\cdot\log^{2}n\Big{)}.

(224)

As a result, the total iteration number $T_{1}+T_{2}$ is

\displaystyle\tilde{O}\Big{(}\frac{(f(\mathbf{x}_{0})-f^{*})}{\epsilon^{4}}\cdot\log^{2}n\Big{)}.

(225)

∎

Appendix E More numerical experiments

In this section, we present more numerical experiment results that support our theoretical claims from a few different perspectives compared to Section 4. Specifically, considering that previous experiments all lies in a two-dimensional space, and theoretically our algorithms have a better dependence on the dimension of the problem $n$ , it is reasonable to check the actual performance of our algorithm on high-dimensional test functions, which is presented in Appendix E.1. Then in Appendix E.2, we introduce experiments on various landscapes that demonstrate the advantage of Algorithm 2 over PAGD [21]. Moreover, we compare the performance of our Algorithm 2 with the NEON⁺ algorithm [29] on a few test functions in Appendix E.3. To be more precise, we compare the negative curvature extracting part of NEON⁺ with Algorithm 2 at saddle points in different types of nonconvex landscapes.

E.1 Dimension dependence

Recall that $n$ is the dimension of the problem. We choose a test function $h(x)=\frac{1}{2}x^{T}\mathcal{H}x+\frac{1}{16}x_{1}^{4}$ where $\mathcal{H}$ is an $n$ -by- $n$ diagonal matrix: $\mathcal{H}=\text{diag}(-\epsilon,1,1,...,1)$ . The function $h(x)$ has a saddle point at the origin, and only one negative curvature direction. Throughout the experiment, we set $\epsilon=1$ . For the sake of comparison, the iteration numbers are chosen in a manner such that the statistics of Algorithm 1 and PGD in each category of the histogram are of similar magnitude.

E.2 Comparison between Algorithm 2 and PAGD on various nonconvex landscapes

Quartic-type test function

Consider the test function $f(x_{1},x_{2})=\frac{1}{16}x_{1}^{4}-\frac{1}{2}x_{1}^{2}+\frac{9}{8}x_{2}^{2}$ with a saddle point at $(0,0)$ . The advantage of Algorithm 2 is illustrated in Figure 4.

Triangle-type test function.

Consider the test function $f(x_{1},x_{2})=\frac{1}{2}\cos(\pi x_{1})+\frac{1}{2}\Big{(}x_{2}+\frac{\cos(2\pi x_{1})-1}{2}\Big{)}^{2}-\frac{1}{2}$ with a saddle point at $(0,0)$ . The advantage of Algorithm 2 is illustrated in Figure 5.

Exponential-type test function.

Consider the test function $f(x_{1},x_{2})=\frac{1}{1+e^{x_{1}^{2}}}+\frac{1}{2}\big{(}x_{2}-x_{1}^{2}e^{-x_{1}^{2}}\big{)}^{2}-1$ with a saddle point at $(0,0)$ . The advantage of Algorithm 2 is illustrated in Figure 6.

Compared to the previous experiment on Algorithm 1 and PGD shown as Figure 1 in Section 4, these experiments also demonstrate the faster convergence rates enjoyed by the general family of "momentum methods". Specifically, using fewer iterations, Algorithm 2 and PAGD achieve larger function value decreases separately compared to Algorithm 1 and PGD.

E.3 Comparison between Algorithm 2 and NEON⁺ on various nonconvex landscapes

Triangle-type test function.

Exponential-type test function.

Compared to the previous experiments on Algorithm 2 and PAGD in Appendix E.2, these two experiments also reveal the faster convergence rate of both NEON⁺ and Algorithm 2 against PAGD [21] at small gradient regions.

$\displaystyle\\|\mathbf{y}_{t+1,\perp}\\|$	$\displaystyle=(1+\sqrt{\rho\epsilon}/\ell)\\|\mathbf{y}_{t,\perp}\\|+\\|\Delta_{\perp}\\|/\ell$	(58)
	$\displaystyle\leq(1+\sqrt{\rho\epsilon}/\ell)\\|\mathbf{y}_{t,\perp}\\|+\\|\Delta\\|/\ell$	(59)
	$\displaystyle\leq\Big{(}1+\sqrt{\rho\epsilon}/\ell+\frac{\\|\Delta\\|}{\ell\\|\mathbf{y}_{t,\perp}\\|}\Big{)}\\|\mathbf{y}_{t,\perp}\\|,$	(60)

	$\displaystyle\\|\mathbf{y}_{t+2,\parallel}\\|$	$\displaystyle=(1+\sqrt{\rho\epsilon}/\ell)\\|\mathbf{y}_{t+1,\perp}\\|-\\|\Delta_{\parallel}\\|/\ell$		(63)
		$\displaystyle\geq\Big{(}1+\sqrt{\rho\epsilon}/\ell-\frac{\\|\Delta\\|}{\ell\\|\mathbf{y}_{t,\parallel}\\|}\Big{)}\\|\mathbf{y}_{t,\parallel}\\|.$		(64)

	$\displaystyle\frac{\\|\mathbf{y}_{t,\parallel\\|}}{\\|\mathbf{y}_{t,\perp}\\|}$	$\displaystyle\geq\frac{\\|\mathbf{y}_{0,\parallel}\\|}{\\|\mathbf{y}_{0,\perp}\\|}(1-1/\mathscr{T})^{t}$		(70)
		$\displaystyle\geq\frac{\\|\mathbf{y}_{0,\parallel}\\|}{\\|\mathbf{y}_{0,\perp}\\|}\exp\Big(-\frac{t}{\mathscr{T}-1}\Big{missing})\geq\frac{\\|\mathbf{y}_{0,\parallel}\\|}{2\\|\mathbf{y}_{0,\perp}\\|},$		(71)

	$\displaystyle\\|\mathbf{x}_{t+2,\parallel}\\|$	$\displaystyle\geq(1+\eta(\sqrt{\rho\epsilon}-\nu))\big{(}\\|\mathbf{x}_{t+1,\parallel}\\|+(1-\theta)(\\|\mathbf{x}_{t+1,\parallel}\\|-\\|\mathbf{x}_{t,\parallel}\\|)\big{)}+\eta\\|\Delta_{\parallel}\\|$		(92)
		$\displaystyle\geq(1+\eta\sqrt{\rho\epsilon})\big{(}\\|\mathbf{x}_{t+1,\parallel}\\|+(1-\theta)(\\|\mathbf{x}_{t+1,\parallel}\\|-\\|\mathbf{x}_{t,\parallel}\\|)\big{)}-\eta\\|\Delta\\|.$		(93)

$\displaystyle\frac{\\|\mathbf{x}_{t,\parallel\\|}}{\\|\mathbf{x}_{t,\perp}\\|}$	$\displaystyle\geq\frac{\\|\mathbf{x}_{0,\parallel}\\|}{2\\|\mathbf{x}_{0,\perp}\\|}\cdot\Big{(}1-\frac{4\rho r^{\prime}}{\alpha_{\min}^{\prime}\theta}\Big{)}^{t}$	(124)
	$\displaystyle\geq\frac{\\|\mathbf{x}_{0,\parallel}\\|}{2\\|\mathbf{x}_{0,\perp}\\|}(1-1/\mathscr{T}^{\prime})^{t}$	(125)
	$\displaystyle\geq\frac{\\|\mathbf{x}_{0,\parallel}\\|}{2\\|\mathbf{x}_{0,\perp}\\|}\exp\Big(-\frac{t}{\mathscr{T}^{\prime}-1}\Big{missing})\geq\frac{\\|\mathbf{x}_{0,\parallel}\\|}{4\\|\mathbf{x}_{0,\perp}\\|},$	(126)