This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Comparisons Are All You Need for
Optimizing Smooth Functions

Chenyi Zhang1  Tongyang Li2,3,
1 Computer Science Department
Corresponding author. Email: [email protected]
Stanford University
2 Center on Frontiers of Computing Studies
Peking University China
3 School of Computer Science
Peking University China
Abstract

When optimizing machine learning models, there are various scenarios where gradient computations are challenging or even infeasible. Furthermore, in reinforcement learning (RL), preference-based RL that only compares between options has wide applications, including reinforcement learning with human feedback in large language models. In this paper, we systematically study optimization of a smooth function f:nf\colon\mathbb{R}^{n}\to\mathbb{R} only assuming an oracle that compares function values at two points and tells which is larger. When ff is convex, we give two algorithms using O~(n/ϵ)\tilde{O}(n/\epsilon) and O~(n2)\tilde{O}(n^{2}) comparison queries to find an ϵ\epsilon-optimal solution, respectively. When ff is nonconvex, our algorithm uses O~(n/ϵ2)\tilde{O}(n/\epsilon^{2}) comparison queries to find an ϵ\epsilon-approximate stationary point. All these results match the best-known zeroth-order algorithms with function evaluation queries in nn dependence, thus suggest that comparisons are all you need for optimizing smooth functions using derivative-free methods. In addition, we also give an algorithm for escaping saddle points and reaching an ϵ\epsilon-second order stationary point of a nonconvex ff, using O~(n1.5/ϵ2.5)\tilde{O}(n^{1.5}/\epsilon^{2.5}) comparison queries.

1 Introduction

Optimization is pivotal in the realm of machine learning. For instance, advancements in stochastic gradient descent (SGD) such as ADAM [25], Adagrad [13], etc., serve as foundational methods for the training of deep neural networks. However, there exist scenarios where gradient computations are challenging or even infeasible, such as black-box adversarial attack on neural networks [40, 33, 8] and policy search in reinforcement learning [42, 10]. Consequently, zeroth-order optimization methods with function evaluations have gained prominence, with provable guarantee for convex optimization [14, 37] and nonconvex optimization [16, 15, 22, 20, 51, 45, 4].

Furthermore, optimization for machine learning has been recently soliciting for even less information. For instance, it is known that taking only signs of gradient descents still enjoy good performance [32, 31, 6]. Moreover, in the breakthrough of large language models (LLMs), reinforcement learning from human feedback (RLHF) played an important rule in training these LLMs, especially GPTs by OpenAI [39]. Compared to standard RL that applies function evaluation for rewards, RLHF is preference-based RL that only compares between options and tells which is better. There is emerging research interest in preference-based RL. Refs. [9, 41, 38, 48, 52, 44] established provable guarantees for learning a near-optimal policy from preference feedback. Ref. [46] proved that for a wide range of preference models, preference-based RL can be solved with small or no extra costs compared to those of standard reward-based RL.

In this paper, we systematically study optimization of smooth functions using comparisons. Specifically, for a function f:nf\colon\mathbb{R}^{n}\to\mathbb{R}, we define the comparison oracle of ff as OfComp:n×n{1,1}O_{f}^{\operatorname{Comp}}\colon\mathbb{R}^{n}\times\mathbb{R}^{n}\to\{-1,1\} such that

OfComp(𝐱,𝐲)={1if f(𝐱)f(𝐲)1if f(𝐱)f(𝐲).\displaystyle O_{f}^{\operatorname{Comp}}(\mathbf{x},\mathbf{y})=\begin{cases}1&\text{if $f(\mathbf{x})\geq f(\mathbf{y})$}\\ -1&\text{if $f(\mathbf{x})\leq f(\mathbf{y})$}\end{cases}. (1)

(When f(𝐱)=f(𝐲)f(\mathbf{x})=f(\mathbf{y}), outputting either 1 or 1-1 is okay.) We consider an LL-smooth function f:nf\colon\mathbb{R}^{n}\to\mathbb{R}, defined as

f(𝐱)f(𝐲)L𝐱𝐲𝐱,𝐲n.\displaystyle\|\nabla f(\mathbf{x})-\nabla f(\mathbf{y})\|\leq L\|\mathbf{x}-\mathbf{y}\|\quad\forall\,\mathbf{x},\mathbf{y}\in\mathbb{R}^{n}.

Furthermore, we say ff is ρ\rho-Hessian Lipschitz if

2f(𝐱)2f(𝐲)ρ𝐱𝐲𝐱,𝐲n.\displaystyle\|\nabla^{2}f(\mathbf{x})-\nabla^{2}f(\mathbf{y})\|\leq\rho\|\mathbf{x}-\mathbf{y}\|\quad\forall\,\mathbf{x},\mathbf{y}\in\mathbb{R}^{n}.

In terms of the goal of optimization, we define:

  • 𝐱n\mathbf{x}\in\mathbb{R}^{n} is an ϵ\epsilon-optimal point if f(𝐱)f+ϵf(\mathbf{x})\leq f^{*}+\epsilon, where finf𝐱f(𝐱)f^{*}\coloneqq\inf_{\mathbf{x}}f(\mathbf{x}).

  • 𝐱n\mathbf{x}\in\mathbb{R}^{n} is an ϵ\epsilon-first-order stationary point (ϵ\epsilon-FOSP) if f(𝐱)ϵ\|\nabla f(\mathbf{x})\|\leq\epsilon.

  • 𝐱n\mathbf{x}\in\mathbb{R}^{n} is an ϵ\epsilon-second-order stationary point (ϵ\epsilon-SOSP) if f(𝐱)ϵ\|\nabla f(\mathbf{x})\|\leq\epsilon and λmin(2f(𝐱))ρϵ\lambda_{\min}(\nabla^{2}f(\mathbf{x}))\geq-\sqrt{\rho\epsilon}.111This is a standard definition among nonconvex optimization literature for escaping saddle points and reaching approximate second-order stationary points, see for instance [36, 11, 1, 7, 23, 2, 47, 51, 50].

Our main results can be listed as follows:

  • For an LL-smooth convex ff, Theorem 2 finds an ϵ\epsilon-optimal point in O(nL/ϵlog(nL/ϵ))O(nL/\kern-0.56905pt\epsilon\kern-0.85358pt\log(nL/\epsilon)) comparisons.

  • For an LL-smooth convex ff, Theorem 3 finds an ϵ\epsilon-optimal point in O(n2log(nL/ϵ))O(n^{2}\log(nL/\epsilon)) comparisons.

  • For an LL-smooth ff, Theorem 4 finds an ϵ\epsilon-FOSP using O(Lnlogn/ϵ2)O(Ln\log n/\epsilon^{2}) comparisons.

  • For an LL-smooth, ρ\rho-Hessian Lipschitz ff, Theorem 5 finds an ϵ\epsilon-SOSP in O~(n1.5/ϵ2.5)\tilde{O}(n^{1.5}/\epsilon^{2.5}) comparisons.

Intuitively, our results can be described as comparisons are all you need for derivative-free methods: For finding an approximate minimum of a convex function, the state-of-the-art zeroth-order methods with full function evaluations have query complexities O(n/ϵ)O(n/\sqrt{\epsilon}) [37] or O~(n2)\tilde{O}(n^{2}) [28], which are matched in nn by our Theorem 2 and Theorem 3 using comparisons, respectively. For finding an approximate stationary point of a nonconvex function, the state-of-the-art zeroth-order result has query complexity O(n/ϵ2)O(n/\epsilon^{2}) [15], which is matched by our Theorem 4 up to a logarithmic factor. In other words, in derivative-free scenarios for optimizing smooth functions, function values per se are unimportant but their comparisons, which indicate the direction that the function decreases.

Among the literature for derivative-free optimization methods [27], direct search methods [26] proceed by comparing function values, including the directional direct search method [3] and the Nelder-Mead method [34] as examples. However, the directional direct search method does not have a known rate of convergence, meanwhile the Nelson-Mead method may fail to converge to a stationary point for smooth functions [12]. As far as we know, the most relevant result is by Bergou et al. [5], which proposed the stochastic three points (STP) method and found an ϵ\epsilon-optimal point of a convex function and an ϵ\epsilon-FOSP of a nonconvex function in O~(n/ϵ)\tilde{O}(n/\epsilon) and O~(n/ϵ2)\tilde{O}(n/\epsilon^{2}) comparisons, respectively. STP also has a version with momentum [18]. Our Theorem 2 and Theorem 4 can be seen as rediscoveries of these results using different methods. However, for comparison-based convex optimization with poly(log1/ϵ)\operatorname{poly}(\log 1/\epsilon) dependence, Ref. [19] achieved this for strongly convex functions, and the state-of-the-art result for general convex optimization by Karabag et al. [24] takes O~(n4)\tilde{O}(n^{4}) comparison queries. Their algorithm applies the ellipsoid method, which has O~(n2)\tilde{O}(n^{2}) iterations and each iteration takes O~(n2)\tilde{O}(n^{2}) comparisons to construct the ellipsoid. This O~(n4)\tilde{O}(n^{4}) bound is noticeably worse than our Theorem 3. As far as we know, our Theorem 5 is the first provable guarantee for finding an ϵ\epsilon-SOSP of a nonconvex function by comparisons.

Techniques.

Our first technical contribution is Theorem 1, which for a point 𝐱\mathbf{x} estimates the direction of f(𝐱)\nabla f(\mathbf{x}) within precision δ\delta. This is achieved by Algorithm 2, named as Comparison-GDE (GDE is the acronym for gradient direction estimation). It is built upon a directional preference subroutine (Algorithm 1), which inputs a unit vector 𝐯n\mathbf{v}\in\mathbb{R}^{n} and a precision parameter Δ>0\Delta>0, and outputs whether f(𝐱),𝐯Δ\langle\gradient f(\mathbf{x}),\mathbf{v}\rangle\geq-\Delta or f(𝐱),𝐯Δ\langle\gradient f(\mathbf{x}),\mathbf{v}\rangle\leq\Delta using the value of the comparison oracle for OfComp(𝐱+2ΔL𝐯,𝐱)O_{f}^{\operatorname{Comp}}(\mathbf{x}+\frac{2\Delta}{L}\mathbf{v},\mathbf{x}). Comparison-GDE then has three phases:

  • First, it sets 𝐯\mathbf{v} to be all standard basis directions 𝐞i\mathbf{e}_{i} to determine the signs of all if(𝐱)\nabla_{i}f(\mathbf{x}) (up to Δ\Delta).

  • It then sets 𝐯\mathbf{v} as 12(𝐞i𝐞j)\frac{1}{\sqrt{2}}(\mathbf{e}_{i}-\mathbf{e}_{j}), which can determine whether |if(𝐱)||\nabla_{i}f(\mathbf{x})| or |jf(𝐱)||\nabla_{j}f(\mathbf{x})| is larger (up to Δ\Delta). Start with 𝐞1\mathbf{e}_{1} and 𝐞2\mathbf{e}_{2} and keep iterating to find the ii^{*} with the largest |if(𝐱)||\frac{\partial}{\partial i^{*}}\nabla f(\mathbf{x})| (up to Δ\Delta).

  • Finally, for each iii\neq i^{*}, It then sets 𝐯\mathbf{v} to have form 11+αi2(αi𝐞i𝐞i)\frac{1}{\sqrt{1+\alpha_{i}^{2}}}(\alpha_{i}\mathbf{e}_{i^{*}}-\mathbf{e}_{i}) and applies binary search to find the value for αi\alpha_{i} such that αi|if(𝐱)|\alpha_{i}|\nabla_{i^{*}}f(\mathbf{x})| equals to |if(𝐱)||\nabla_{i}f(\mathbf{x})| up to enough precision.

Comparison-GDE outputs 𝜶/𝜶\bm{\alpha}/\|\bm{\alpha}\| for GDE, where α=(α1,,αn)\alpha=(\alpha_{1},\ldots,\alpha_{n})^{\top}. It in total uses O(nlog(n/δ))O(n\log(n/\delta)) comparison queries, with the main cost coming from binary searches in the last step (the first two steps both take n\leq n comparisons).

We then leverage Comparison-GDE for solving various optimization problems. In convex optimization, we develop two algorithms that find an ϵ\epsilon-optimal point separately in Section 3.1 and Section 3.2. Our first algorithm is a specialization of the adaptive version of normalized gradient descent (NGD) introduced in [30], where we replace the normalized gradient query in their algorithm by Comparison-GDE. It is a natural choice to apply gradient estimation to normalized gradient descent, given that the comparison model only allows us to estimate the gradient direction without providing information about its norm. Note that Ref. [5] also discussed NGD, but their algorithm using NGD still needs the full gradient and cannot be directly implemented by comparisons. Our second algorithm builds upon the framework of cutting plane methods, where we show that the output of Comparison-GDE is a valid separation oracle, as long as it is accurate enough.

In nonconvex optimization, we develop two algorithms that find an ϵ\epsilon-FOSP and an ϵ\epsilon-SOSP, respectively, in Section 4.1 and Section 4.2. Our algorithm for finding an ϵ\epsilon-FOSP is a specialization of the NGD algorithm, where the normalized gradient is given by Comparison-GDE. Our algorithm for finding an ϵ\epsilon-SOSP uses a similar approach as corresponding first-order methods by [2, 47] and proceeds in rounds, where we alternately apply NGD and negative curvature descent to ensure that the function value will have a large decrease if more than 1/91/9 of the iterations in this round are not ϵ\epsilon-SOSP. The normalized gradient descent part is essentially the same as our algorithm for ϵ\epsilon-FOSP in Section 4.1. The negative curvature descent part with comparison information, however, is much more technically involved. In particular, previous first-order methods [2, 47, 49] all contains a subroutine that can find a negative curvature direction near a saddle point 𝐱\mathbf{x} with λmin(2f(𝐱)ρϵ)\lambda_{\min}(\nabla^{2}f(\mathbf{x})\leq-\sqrt{\rho\epsilon}). One crucial step in this subroutine is to approximate the Hessian-vector product 2f(𝐱)𝐲\nabla^{2}f(\mathbf{x})\cdot\mathbf{y} for some unit vector 𝐲n\mathbf{y}\in\mathbb{R}^{n} by taking the difference between f(𝐱+r𝐲)\nabla f(\mathbf{x}+r\mathbf{y}) and f(𝐱)\nabla f(\mathbf{x}), where rr is a very small parameter. However, this is infeasible in the comparison model which only allows us to estimate the gradient direction without providing information about its norm. Instead, we find the directions of f(𝐱)\nabla f(\mathbf{x}), f(𝐱+r𝐲)\nabla f(\mathbf{x}+r\mathbf{y}), and f(𝐱r𝐲)\nabla f(\mathbf{x}-r\mathbf{y}) by Comparison-GDE, and we determine the direction of f(𝐱+r𝐲)f(𝐲)\nabla f(\mathbf{x}+r\mathbf{y})-f(\mathbf{y}) using the fact that its intersection with f(𝐱)\nabla f(\mathbf{x}) and f(𝐱+r𝐲)\nabla f(\mathbf{x}+r\mathbf{y}) as well as its intersection with f(𝐱)\nabla f(\mathbf{x}) and f(𝐱r𝐲)\nabla f(\mathbf{x}-r\mathbf{y}) give two segments of same length (see Figure 1).

Figure 1: The intuition of Algorithm 10 for computing Hessian-vector products using gradient directions.
𝐱\mathbf{x}f(𝐱+r𝐲)f(𝐱+r𝐲)\frac{\nabla f(\mathbf{x}+r\mathbf{y})}{\|\nabla f(\mathbf{x}+r\mathbf{y})\|}f(𝐱r𝐲)f(𝐱r𝐲)\frac{\nabla f(\mathbf{x}-r\mathbf{y})}{\|\nabla f(\mathbf{x}-r\mathbf{y})\|}f(𝐱)f(𝐱)\frac{\nabla f(\mathbf{x})}{\|\nabla f(\mathbf{x})\|}direction of 2f(𝐱)𝐲\nabla^{2}f(\mathbf{x})\cdot\mathbf{y}
Open questions.

Our work leaves several natural directions for future investigation:

  • Can we give comparison-based optimization algorithms based on accelerated gradient descent (AGD) methods? This is challenging because AGD requires carefully chosen step sizes, but with comparisons we can only learn gradient directions but not the norm of gradients. This is also the main reason why the 1/ϵ1/\epsilon dependence in our Theorem 2 and Theorem 5 are worse than [37] and [50] with evaluations in their respective settings.

  • Can we improve our result for finding second-order stationary points in nonconvex optimization? Compared to gradient-based methods that choose the step size in negative curvature finding [2, 47], our comparison-based perturbed normalized gradient descent (Algorithm 5) can only utilize gradient directions but have no information about gradient norms, resulting in a fixed and conservative step size and in total O~(n/ϵ)\tilde{O}(\sqrt{n}/\epsilon) iterations.

  • Can we apply our algorithms to machine learning? [44] made attempts on preference-based RL, and it is worth further exploring whether we can prove more theoretical results for preference-based RL and other machine learning settings. It would be also of general interest to see if our results can provide theoretical justification for quantization in neural networks [17].

Notations.

We use bold letters, e.g., 𝐱\mathbf{x}, 𝐲\mathbf{y}, to denote vectors and capital letters, e.g., AA, BB, to denote matrices. We use \|\cdot\| to denote the Euclidean norm (2\ell_{2}-norm) and denote 𝒮n1\mathcal{S}^{n-1} to be the nn-dimensional sphere with radius 1, i.e., 𝒮n1:={𝐱n:𝐱=1}\mathcal{S}^{n-1}:=\{\mathbf{x}\in\mathbb{R}^{n}:\|\mathbf{x}\|=1\}. We denote 𝔹R(𝐱){𝐲n:𝐲𝐱R}\mathbb{B}_{R}(\mathbf{x})\coloneqq\{\mathbf{y}\in\mathbb{R}^{n}\colon\|\mathbf{y}-\mathbf{x}\|\leq R\} and [T]{0,1,,T}[T]\coloneqq\{0,1,\ldots,T\}. For a convex set 𝒦n\mathcal{K}\subseteq\mathbb{R}^{n}, its diameter is defined as Dsup𝐱,𝐲𝒦𝐱𝐲D\coloneqq\sup_{\mathbf{x},\mathbf{y}\in\mathcal{K}}\|\mathbf{x}-\mathbf{y}\| and its projection operator Π𝒦\Pi_{\mathcal{K}} is defined as

Π𝒦(𝐱)argmin𝐲𝒦𝐱𝐲,𝐱n.\displaystyle\Pi_{\mathcal{K}}(\mathbf{x})\coloneqq\mathrm{argmin}_{\mathbf{y}\in\mathcal{K}}\|\mathbf{x}-\mathbf{y}\|,\quad\forall\mathbf{x}\in\mathbb{R}^{n}.

2 Estimation of Gradient Direction by Comparisons

First, we show that given a point 𝐱n\mathbf{x}\in\mathbb{R}^{n} and a direction 𝐯n\mathbf{v}\in\mathbb{R}^{n}, we can use one comparison query to understand whether the inner product f(𝐱),𝐯\langle\gradient f(\mathbf{x}),\mathbf{v}\rangle is roughly positive or negative. Intuitively, this inner product determines whether 𝐱+𝐯\mathbf{x}+\mathbf{v} is following or against the direction of f(𝐱)\gradient f(\mathbf{x}), also known as directional preference (DP) in [24].

Lemma 1.

Given a point 𝐱n\mathbf{x}\in\mathbb{R}^{n}, a unit vector 𝐯𝔹1(0)\mathbf{v}\in\mathbb{B}_{1}(0), and precision Δ>0\Delta>0 for directional preference. Then Algorithm 1 is correct:

  • If OfComp(𝐱+2ΔL𝐯,𝐱)=1O_{f}^{\operatorname{Comp}}(\mathbf{x}+\frac{2\Delta}{L}\mathbf{v},\mathbf{x})=1, then f(𝐱),𝐯Δ\langle\gradient f(\mathbf{x}),\mathbf{v}\rangle\geq-\Delta.

  • If OfComp(𝐱+2ΔL𝐯,𝐱)=1O_{f}^{\operatorname{Comp}}(\mathbf{x}+\frac{2\Delta}{L}\mathbf{v},\mathbf{x})=-1, then f(𝐱),𝐯Δ\langle\gradient f(\mathbf{x}),\mathbf{v}\rangle\leq\Delta.

Input: Comparison oracle OfCompO_{f}^{\operatorname{Comp}} of f:nf\colon\mathbb{R}^{n}\to\mathbb{R}, 𝐱n\mathbf{x}\in\mathbb{R}^{n}, unit vector 𝐯𝔹1(0)\mathbf{v}\in\mathbb{B}_{1}(0), Δ>0\Delta>0
1 if OfComp(𝐱+2ΔL𝐯,𝐱)=1O_{f}^{\operatorname{Comp}}(\mathbf{x}+\frac{2\Delta}{L}\mathbf{v},\mathbf{x})=1 then
2      returnf(𝐱),𝐯Δ\langle\gradient f(\mathbf{x}),\mathbf{v}\rangle\geq-\Delta"
3else (in this case OfComp(𝐱+2ΔL𝐯,𝐱)=1O_{f}^{\operatorname{Comp}}(\mathbf{x}+\frac{2\Delta}{L}\mathbf{v},\mathbf{x})=-1)
4      returnf(𝐱),𝐯Δ\langle\gradient f(\mathbf{x}),\mathbf{v}\rangle\leq\Delta"
Algorithm 1 DP(𝐱\mathbf{x},𝐯\mathbf{v},Δ\Delta)
Proof.

Since ff is an LL-smooth differentiable function,

|f(𝐲)f(𝐱)f(𝐱),𝐲𝐱|12L𝐲𝐱2\displaystyle|f(\mathbf{y})-f(\mathbf{x})-\langle\gradient f(\mathbf{x}),\mathbf{y}-\mathbf{x}\rangle|\leq\frac{1}{2}L\|\mathbf{y}-\mathbf{x}\|^{2}

for any 𝐱,𝐲n\mathbf{x},\mathbf{y}\in\mathbb{R}^{n}. Take 𝐲=𝐱+2ΔL𝐯\mathbf{y}=\mathbf{x}+\frac{2\Delta}{L}\mathbf{v}, this gives

|f(𝐲)f(𝐱)2ΔLf(𝐱),𝐯|12L(2ΔL)2=2Δ2L.\displaystyle\left|f(\mathbf{y})-f(\mathbf{x})-\frac{2\Delta}{L}\langle\gradient f(\mathbf{x}),\mathbf{v}\rangle\right|\leq\frac{1}{2}L\left(\frac{2\Delta}{L}\right)^{2}=\frac{2\Delta^{2}}{L}.

Therefore, if OfComp(𝐲,𝐱)=1O_{f}^{\operatorname{Comp}}(\mathbf{y},\mathbf{x})=1, i.e., f(𝐲)f(𝐱)f(\mathbf{y})\geq f(\mathbf{x}),

2ΔLf(𝐱),𝐯2ΔLf(𝐱),𝐯+f(𝐱)f(𝐲)2Δ2L\displaystyle\frac{2\Delta}{L}\langle\gradient f(\mathbf{x}),\mathbf{v}\rangle\kern-1.42262pt\geq\kern-1.42262pt\frac{2\Delta}{L}\langle\gradient f(\mathbf{x}),\mathbf{v}\rangle+f(\mathbf{x})-f(\mathbf{y})\kern-1.42262pt\geq\kern-1.42262pt-\frac{2\Delta^{2}}{L}

and hence f(𝐱),𝐯Δ\langle\gradient f(\mathbf{x}),\mathbf{v}\rangle\geq-\Delta. On the other hand, if OfComp(𝐲,𝐱)=1O_{f}^{\operatorname{Comp}}(\mathbf{y},\mathbf{x})=-1, i.e., f(𝐲)f(𝐱)f(\mathbf{y})\leq f(\mathbf{x}),

2ΔLf(𝐱),𝐯f(𝐲)f(𝐱)+2Δ2L2Δ2L\displaystyle\frac{2\Delta}{L}\langle\gradient f(\mathbf{x}),\mathbf{v}\rangle\leq f(\mathbf{y})-f(\mathbf{x})+\frac{2\Delta^{2}}{L}\leq\frac{2\Delta^{2}}{L}

and hence f(𝐱),𝐯Δ\langle\gradient f(\mathbf{x}),\mathbf{v}\rangle\leq\Delta. ∎

Now, we prove that we can use O~(n)\tilde{O}(n) comparison queries to approximate the direction of the gradient at a point, which is one of our main technical contributions.

Theorem 1.

For an LL-smooth function f:nf\colon\mathbb{R}^{n}\to\mathbb{R} and a point 𝐱n\mathbf{x}\in\mathbb{R}^{n}, Algorithm 2 outputs an estimate 𝐠~(𝐱)\tilde{\mathbf{g}}(\mathbf{x}) of the direction of f(𝐱)\nabla f(\mathbf{x}) using O(nlog(n/δ))O(n\log(n/\delta)) queries to the comparison oracle OfCompO_{f}^{\operatorname{Comp}} of ff (Eq. (1)) that satisfies

𝐠~(𝐱)f(𝐱)f(𝐱)δ\displaystyle\left\|\tilde{\mathbf{g}}(\mathbf{x})-\frac{\nabla f(\mathbf{x})}{\|\nabla f(\mathbf{x})\|}\right\|\leq\delta

if we are given a parameter γ>0\gamma>0 such that f(𝐱)γ\|\nabla f(\mathbf{x})\|\geq\gamma.

Input: Comparison oracle OfCompO_{f}^{\operatorname{Comp}} of f:nf\colon\mathbb{R}^{n}\to\mathbb{R}, precision δ\delta, lower bound γ\gamma on f(𝐱)\|\gradient f(\mathbf{x})\|
1 Set Δδγ/4n3/2\Delta\leftarrow\delta\gamma/4n^{3/2}. Denote f(𝐱)=(g1,,gn)\gradient f(\mathbf{x})=(g_{1},\ldots,g_{n})^{\top}
2
3Call Algorithm 1 with inputs (𝐱,𝐞1,Δ),,(𝐱,𝐞n,Δ)(\mathbf{x},\mathbf{e}_{1},\Delta),\ldots,(\mathbf{x},\mathbf{e}_{n},\Delta) where eie_{i} is the ithi^{\text{th}} standard basis with ithi^{\text{th}} coordinate being 1 and others being 0. This determines whether giΔg_{i}\geq-\Delta or giΔg_{i}\leq\Delta for each i[n]i\in[n]. WLOG
giΔi[n]\displaystyle g_{i}\geq-\Delta\quad\forall i\in[n] (2)
(otherwise take a minus sign for the ithi^{\text{th}} coordinate)
4
5We next find the approximate largest one among g1,,gng_{1},\ldots,g_{n}. Call Algorithm 1 with input (𝐱,12(𝐞1𝐞2),Δ)(\mathbf{x},\frac{1}{\sqrt{2}}(\mathbf{e}_{1}-\mathbf{e}_{2}),\Delta). This determines whether g1g22Δg_{1}\geq g_{2}-\sqrt{2}\Delta or g2g12Δg_{2}\geq g_{1}-\sqrt{2}\Delta. If the former, call Algorithm 1 with input (𝐱,12(𝐞1𝐞3),Δ)(\mathbf{x},\frac{1}{\sqrt{2}}(\mathbf{e}_{1}-\mathbf{e}_{3}),\Delta). If the later, call Algorithm 1 with input (𝐱,12(𝐞2𝐞3),Δ)(\mathbf{x},\frac{1}{\sqrt{2}}(\mathbf{e}_{2}-\mathbf{e}_{3}),\Delta). Iterate this until ene_{n}, we find the i[n]i^{*}\in[n] such that
gimaxi[n]gi2Δ\displaystyle g_{i^{*}}\geq\max_{i\in[n]}g_{i}-\sqrt{2}\Delta (3)
6 for i=1i=1 to i=ni=n (except i=ii=i^{*}) do
7      Initialize αi1/2\alpha_{i}\leftarrow 1/2
8       Apply binary search to αi\alpha_{i} in log2(γ/Δ)+1\lceil\log_{2}(\gamma/\Delta)+1\rceil iterations by calling Algorithm 1 with input (𝐱,11+αi2(αi𝐞i𝐞i),Δ)(\mathbf{x},\frac{1}{\sqrt{1+\alpha_{i}^{2}}}(\alpha_{i}\mathbf{e}_{i^{*}}-\mathbf{e}_{i}),\Delta). For the first iteration with αi=1/2\alpha_{i}=1/2, if αigigi2Δ\alpha_{i}g_{i^{*}}-g_{i}\geq-\sqrt{2}\Delta we then take αi=3/4\alpha_{i}=3/4; if αigigi2Δ\alpha_{i}g_{i^{*}}-g_{i}\leq\sqrt{2}\Delta we then take αi=1/4\alpha_{i}=1/4. Later iterations are similar. Upon finishing the binary search, αi\alpha_{i} satisfies
gi2Δαigigi+2Δ\displaystyle g_{i}-\sqrt{2}\Delta\leq\alpha_{i}g_{i^{*}}\leq g_{i}+\sqrt{2}\Delta (4)
return 𝐠~(𝐱)=𝜶𝜶\tilde{\mathbf{g}}(\mathbf{x})=\frac{\bm{\alpha}}{\|\bm{\alpha}\|} where α=(α1,,αn)\alpha=(\alpha_{1},\ldots,\alpha_{n})^{\top}, αi\alpha_{i} (iii\neq i^{*}) is the output of the for loop, αi=1\alpha_{i^{*}}=1
Algorithm 2 Comparison-based Gradient Direction Estimation (Comparison-GDE(𝐱,δ,γ\mathbf{x},\delta,\gamma))
Proof.

The correctness of (2) and (3) follows directly from the arguments in Line 3 and Line 5, respectively. For Line 8, since αi1\alpha_{i}\leq 1 for any i[n]i\in[n], the binary search can be regarded as having bins with interval lengths 1+αi2Δ2Δ\sqrt{1+\alpha_{i}^{2}}\Delta\leq\sqrt{2}\Delta, and when the binary search ends Eq. (4) is satisfied. Furthermore, Eq. (4) can be written as

|αigigi|2Δgi2Δnγ.\displaystyle\left|\alpha_{i}-\frac{g_{i}}{g_{i^{*}}}\right|\leq\frac{\sqrt{2}\Delta}{g_{i^{*}}}\leq\frac{2\Delta\sqrt{n}}{\gamma}.

This is because f(𝐱)=(g1,,gn)γ\|\nabla f(\mathbf{x})\|=\|(g_{1},\ldots,g_{n})^{\top}\|\geq\gamma implies maxi[n]giγ/n\max_{i\in[n]}g_{i}\geq\gamma/\sqrt{n}, and together with (3) we have giγ/n2Δγ/2ng_{i^{*}}\geq\gamma/\sqrt{n}-\sqrt{2}\Delta\geq\gamma/\sqrt{2n} because Δγ/4n\Delta\leq\gamma/4\sqrt{n}.

We now estimate 𝐠~(𝐱)f(𝐱)f(𝐱)\left\|\tilde{\mathbf{g}}(\mathbf{x})-\frac{\nabla f(\mathbf{x})}{\|\nabla f(\mathbf{x})\|}\right\|. Note f(𝐱)f(𝐱)=f(𝐱)/gif(𝐱)/gi\frac{\nabla f(\mathbf{x})}{\|\nabla f(\mathbf{x})\|}=\frac{\nabla f(\mathbf{x})/g_{i^{*}}}{\|\nabla f(\mathbf{x})/g_{i^{*}}\|} and 𝐠~(𝐱)=𝜶/𝜶\tilde{\mathbf{g}}(\mathbf{x})=\bm{\alpha}/\|\bm{\alpha}\|. Moreover

𝜶f(𝐱)gii=1n|αigigi|2Δn(n1)γ.\displaystyle\left\|\bm{\alpha}-\frac{\nabla f(\mathbf{x})}{g_{i^{*}}}\right\|\leq\sum_{i=1}^{n}\left|\alpha_{i}-\frac{g_{i}}{g_{i^{*}}}\right|\leq\frac{2\Delta\sqrt{n}(n-1)}{\gamma}.

By Lemma 5 for bounding distance between normalized vectors) and the fact that 𝜶1\|\bm{\alpha}\|\geq 1,

𝐠~(𝐱)f(𝐱)f(𝐱)\displaystyle\left\|\tilde{\mathbf{g}}(\mathbf{x})-\frac{\nabla f(\mathbf{x})}{\|\nabla f(\mathbf{x})\|}\right\| =𝜶𝜶f(𝐱)/gif(𝐱)/gi4Δn3/2γδ.\displaystyle=\left\|\frac{\bm{\alpha}}{\|\bm{\alpha}\|}-\frac{\nabla f(\mathbf{x})/g_{i^{*}}}{\|\nabla f(\mathbf{x})/g_{i^{*}}\|}\right\|\leq\frac{4\Delta n^{3/2}}{\gamma}\leq\delta.

Thus the correctness has been established. For the query complexity, Line 3 takes nn queries, Line 5 takes n1n-1 queries, and Line 8 throughout the for loop takes (n1)log2(γ/2Δ)+1=O(nlog(n/δ))(n-1)\lceil\log_{2}(\gamma/\sqrt{2}\Delta)+1\rceil=O(n\log(n/\delta)) queries to the comparison oracle, given that each αi\alpha_{i} is within the range of [0,1][0,1] and we approximate it to accuracy 2Δ/gi2Δ/γ\sqrt{2}\Delta/g_{i^{*}}\geq\sqrt{2}\Delta/\gamma. This finishes the proof. ∎

3 Convex Optimization by Comparisons

In this section, we study convex optimization with function value comparisons:

Problem 1 (Comparison-based convex optimization).

In the comparison-based convex optimization (CCO) problem we are given query access to a comparison oracle OfCompO_{f}^{\operatorname{Comp}} (1) for an LL-smooth convex function f:nf\colon\mathbb{R}^{n}\to\mathbb{R} whose minimum is achieved at 𝐱\mathbf{x}^{*} with 𝐱R\|\mathbf{x}^{*}\|\leq R. The goal is to output a point 𝐱~\tilde{\mathbf{x}} such that 𝐱~R\|\tilde{\mathbf{x}}\|\leq R and f(𝐱~)f(𝐱)ϵf(\tilde{\mathbf{x}})-f(\mathbf{x}^{*})\leq\epsilon, i.e., 𝐱~\tilde{\mathbf{x}} is an ϵ\epsilon-optimal point.

We provide two algorithms that solve Problem 1. In Section 3.1, we use normalized gradient descent to achieve linear dependence in nn (up to a log factor) in terms of comparison queries. In Section 3.2, we use cutting plane method to achieve log(1/ϵ)\log(1/\epsilon) dependence in terms of comparison queries.

3.1 Comparison-based adaptive normalized gradient descent

In this subsection, we present our first algorithm for Problem 1, Algorithm 3, which applies Comparison-GDE (Algorithm 2) with estimated gradient direction at each iteration to the adaptive normalized gradient descent (AdaNGD), originally introduced by [30].

Input: Function f:nf\colon\mathbb{R}^{n}\to\mathbb{R}, precision ϵ\epsilon, radius RR
1 T64LR2ϵT\leftarrow\frac{64LR^{2}}{\epsilon}, δ14Rϵ2L\delta\leftarrow\frac{1}{4R}\sqrt{\frac{\epsilon}{2L}}, γϵ2R\gamma\leftarrow\frac{\epsilon}{2R}, 𝐱0𝟎\mathbf{x}_{0}\leftarrow\mathbf{0}
2 for t=0,,T1t=0,\ldots,T-1 do
3       𝐠^t\hat{\mathbf{g}}_{t}\leftarrowComparison-GDE(𝐱t,δ,γ)(\mathbf{x}_{t},\delta,\gamma)
4       ηtR2/t\eta_{t}\leftarrow R\sqrt{2/t}
5       𝐱t+1=Π𝔹R(𝟎)(𝐱tηt𝐠^t)\mathbf{x}_{t+1}=\Pi_{\mathbb{B}_{R}(\mathbf{0})}(\mathbf{x}_{t}-\eta_{t}\hat{\mathbf{g}}_{t})
6      
7toutargmint[T]f(𝐱t)t_{\mathrm{out}}\leftarrow{\mathrm{argmin}}_{t\in[T]}f(\mathbf{x}_{t})
return 𝐱tout\mathbf{x}_{t_{\mathrm{out}}}
Algorithm 3 Comparison-based Approximate Adaptive Normalized Gradient Descent (Comparison-AdaNGD)
Theorem 2.

Algorithm 3 solves Problem 1 using O(nLR2/ϵlog(nLR2/ϵ))O(nLR^{2}/\epsilon\log(nLR^{2}/\epsilon)) queries.

The following result bounds the rate at which Algorithm 3 decreases the function value of ff.

Lemma 2.

In the setting of Problem 1, Algorithm 3 satisfies

mint[T]f(𝐱t)f2L(2R2T+2TδR)2/T2,\displaystyle\min_{t\in[T]}f(\mathbf{x}_{t})-f^{*}\leq 2L(2R\sqrt{2T}+2T\delta R)^{2}/T^{2},

if at each step we have

𝐠~tft(𝐱t)ft(𝐱t)δ1.\displaystyle\left\|\tilde{\mathbf{g}}_{t}-\frac{\nabla f_{t}(\mathbf{x}_{t})}{\|\nabla f_{t}(\mathbf{x}_{t})\|}\right\|\leq\delta\leq 1.

The proof of Lemma 2 is deferred to Appendix B. We now prove Theorem 2 using Lemma 2.

Proof of Theorem 2.

We show that Algorithm 3 solves Problem 1 by contradiction. Assume that the output of Algorithm 3 is not an ϵ\epsilon-optimal point of ff, or equivalently, f(𝐱t)fϵf(\mathbf{x}_{t})-f^{*}\geq\epsilon for any t[T]t\in[T]. This leads to

f(𝐱t)f(𝐱t)f𝐱t𝐱ϵ2R,t[T]\displaystyle\|\nabla f(\mathbf{x}_{t})\|\geq\frac{f(\mathbf{x}_{t})-f^{*}}{\|\mathbf{x}_{t}-\mathbf{x}^{*}\|}\geq\frac{\epsilon}{2R},\quad\forall t\in[T]

given that ff is convex. Hence, Theorem 1 promises that

𝐠^tf(𝐱t)f(𝐱t)δ1.\displaystyle\left\|\hat{\mathbf{g}}_{t}-\frac{\nabla f(\mathbf{x}_{t})}{\|\nabla f(\mathbf{x}_{t})\|}\right\|\leq\delta\leq 1.

With these approximate gradient directions, by Lemma 2 we can derive that

mint[T]f(𝐱t)f\displaystyle\min_{t\in[T]}f(\mathbf{x}_{t})-f^{*} 2L(2R2T+2TδR)2/T2ϵ,\displaystyle\leq 2L(2R\sqrt{2T}+2T\delta R)^{2}/T^{2}\leq\epsilon,

contradiction. This proves the correctness of Algorithm 3. The query complexity of Algorithm 3 only comes from the gradient direction estimation step in Line 3, which equals

TO(nlog(n/δ))=O(nLR2ϵlog(nLR2ϵ)).\displaystyle T\cdot O(n\log(n/\delta))=O\left(\frac{nLR^{2}}{\epsilon}\log\left(\frac{nLR^{2}}{\epsilon}\right)\right).

3.2 Comparison-based cutting plane method

In this subsection, we provide a comparison-based cutting plane method that solves Problem 1. We begin by introducing the basic notation and concepts of cutting plane methods, which are algorithms that solves the feasibility problem defined as follows.

Problem 2 (Feasibility Problem, [21, 43]).

We are given query access to a separation oracle for a set KnK\subset\mathbb{R}^{n} such that on query 𝐱n\mathbf{x}\in\mathbb{R}^{n} the oracle outputs a vector 𝐜\mathbf{c} and either 𝐜=𝟎\mathbf{c}=\mathbf{0}, in which case 𝐱K\mathbf{x}\in K, or 𝐜𝟎\mathbf{c}\neq\mathbf{0}, in which case H{𝐳:𝐜𝐳𝐜𝐱}KH\coloneqq\{\mathbf{z}\colon\mathbf{c}^{\top}\mathbf{z}\leq\mathbf{c}^{\top}\mathbf{x}\}\supset K. The goal is to query a point 𝐱K\mathbf{x}\in K.

[21] developed a cutting plane method that solves Problem 2 using O(nlog(nR/r))O(n\log(nR/r)) queries to a separation oracle where RR and rr are parameters related to the convex set 𝒦\mathcal{K}.

Lemma 3 (Theorem 1.1, [21]).

There is a cutting plane method which solves Problem 2 using at most Cnlog(nR/r)C\cdot n\log(nR/r) queries for some constant CC, given that the set KK is contained in the ball of radius RR centered at the origin and it contains a ball of radius rr.

Refs. [35, 29] showed that, running cutting plane method on a Lipschitz convex function ff with the separation oracle being the gradient of ff would yield a sequence of points where at least one of them is ϵ\epsilon-optimal. Furthermore, Ref. [43] showed that even if we cannot access the exact gradient value of ff, it suffices to use an approximate gradient estimate with absolute error at most O(ϵ/R)O(\epsilon/R).

In this work, we show that this result can be extended to the case where we have an estimate of the gradient direction instead of the gradient itself. Specifically, we prove the following result.

Theorem 3.

There exists an algorithm based on cutting plane method that solves Problem 1 using O(n2log(nLR2/ϵ))O(n^{2}\log\small(nLR^{2}/\epsilon\small)) queries.

Note that Theorem 3 improves the prior state-of-the-art from O~(n4)\tilde{O}(n^{4}) by [24] to O~(n2)\tilde{O}(n^{2}).

Proof of Theorem 3.

The proof follows a similar intuition as the proof of Proposition 1 in [43]. Define 𝒦ϵ/2\mathcal{K}_{\epsilon/2} to be the set of ϵ/2\epsilon/2-optimal points of ff, and 𝒦ϵ\mathcal{K}_{\epsilon} to be the set of ϵ\epsilon-optimal points of ff. Given that ff is LL-smooth, 𝒦ϵ/2\mathcal{K}_{\epsilon/2} must contain a ball of radius at least r𝒦=ϵ/Lr_{\mathcal{K}}=\sqrt{\epsilon/L} since for any 𝐱\mathbf{x} with 𝐱𝐱r𝒦\|\mathbf{x}-\mathbf{x}^{*}\|\leq r_{\mathcal{K}} we have

f(𝐱)f(𝐱)L𝐱𝐱2/2ϵ/2.\displaystyle f(\mathbf{x})-f(\mathbf{x}^{*})\leq L\|\mathbf{x}-\mathbf{x}^{*}\|^{2}/2\leq\epsilon/2.

We apply the cutting plane method, as described in Lemma 3, to query a point in 𝒦ϵ/2\mathcal{K}_{\epsilon/2}, which is a subset of the ball 𝔹2R(𝟎)\mathbb{B}_{2R}(\mathbf{0}). To achieve this, at each query 𝐱\mathbf{x} of the cutting plane method, we use Comparison-GDE(𝐱,δ,γ)(\mathbf{x},\delta,\gamma), our comparison-based gradient direction estimation algorithm (Algorithm 2), as the separation oracle for the cutting plane method, where we set

δ=116RϵL,γ=2Lϵ.\displaystyle\delta=\frac{1}{16R}\sqrt{\frac{\epsilon}{L}},\qquad\gamma=\sqrt{2L\epsilon}.

We show that any query outside of 𝒦ϵ\mathcal{K}_{\epsilon} to Comparison-GDE(𝐱,δ,γ)(\mathbf{x},\delta,\gamma) will be a valid separation oracle for 𝒦ϵ/2\mathcal{K}_{\epsilon/2}. In particular, if we ever queried Comparison-GDE(𝐱,δ,γ)(\mathbf{x},\delta,\gamma) at any 𝐱𝔹2R(𝟎)𝒦ϵ\mathbf{x}\in\mathbb{B}_{2R}(\mathbf{0})\setminus\mathcal{K}_{\epsilon} with output being 𝐠^\hat{\mathbf{g}}, for any 𝐲𝒦ϵ/2\mathbf{y}\in\mathcal{K}_{\epsilon/2} we have

𝐠^,𝐲𝐱\displaystyle\left<\hat{\mathbf{g}},\mathbf{y}-\mathbf{x}\right> f(𝐱)f(𝐱),𝐲𝐱+𝐠^f(𝐱)f(𝐱)𝐲𝐱\displaystyle\leq\left\langle\frac{\nabla f(\mathbf{x})}{\|\nabla f(\mathbf{x})\|},\mathbf{y}-\mathbf{x}\right\rangle+\left\|\hat{\mathbf{g}}-\frac{\nabla f(\mathbf{x})}{\|\nabla f(\mathbf{x})\|}\right\|\cdot\|\mathbf{y}-\mathbf{x}\|
f(𝐲)f(𝐱)f(𝐱)+𝐠^f(𝐱)f(𝐱)𝐲𝐱ϵ2+ϵ10R4R<0,\displaystyle\leq\frac{f(\mathbf{y})-f(\mathbf{x})}{\|\nabla f(\mathbf{x})\|}+\left\|\hat{\mathbf{g}}-\frac{\nabla f(\mathbf{x})}{\|\nabla f(\mathbf{x})\|}\right\|\cdot\|\mathbf{y}-\mathbf{x}\|\leq-\frac{\epsilon}{2}+\frac{\epsilon}{10R}\cdot 4R<0,

where

f(𝐱)(f(𝐱)f)/𝐱𝐱(f(𝐱)f)/(2R)\displaystyle\|\nabla f(\mathbf{x})\|\geq(f(\mathbf{x})-f^{*})/\|\mathbf{x}-\mathbf{x}^{*}\|\geq(f(\mathbf{x})-f^{*})/(2R)

given that ff is convex. Combined with Theorem 1, it guarantees that

𝐠^f(𝐱)f(𝐱)δ=116RϵL.\displaystyle\left\|\hat{\mathbf{g}}-\frac{\nabla f(\mathbf{x})}{\|\nabla f(\mathbf{x})\|}\right\|\leq\delta=\frac{1}{16R}\sqrt{\frac{\epsilon}{L}}.

Hence,

𝐠^,𝐲𝐱\displaystyle\left<\hat{\mathbf{g}},\mathbf{y}-\mathbf{x}\right> f(𝐲)f(𝐱)f(𝐱)+𝐠^f(𝐱)f(𝐱)𝐲𝐱12ϵ2L+116RϵL4R<0,\displaystyle\leq\frac{f(\mathbf{y})-f(\mathbf{x})}{\|\nabla f(\mathbf{x})\|}+\left\|\hat{\mathbf{g}}-\frac{\nabla f(\mathbf{x})}{\|\nabla f(\mathbf{x})\|}\right\|\cdot\|\mathbf{y}-\mathbf{x}\|\leq-\frac{1}{2}\sqrt{\frac{\epsilon}{2L}}+\frac{1}{16R}\sqrt{\frac{\epsilon}{L}}\cdot 4R<0,

indicating that 𝐠^\hat{\mathbf{g}} is a valid separation oracle for the set 𝒦ϵ/2\mathcal{K}_{\epsilon/2}. Consequently, by Lemma 3, after Cnlog(nR/r𝒦)Cn\log(nR/r_{\mathcal{K}}) iterations, at least one of the queries must lie within 𝒦ϵ\mathcal{K}_{\epsilon}, and we can choose the query with minimum function value to output, which can be done by making Cnlog(nR/r𝒦)Cn\log(nR/r_{\mathcal{K}}) comparisons.

Note that in each iteration O(nlog(n/δ))O(n\log(n/\delta)) queries to OfCompO_{f}^{\operatorname{Comp}} (1) are needed. Hence, the overall query complexity equals

Cnlog(nR/r𝒦)O(nlog(n/δ))+Cnlog(nR/r𝒦)=O(n2log(nLR2/ϵ)).\displaystyle Cn\log(nR/r_{\mathcal{K}})\cdot O(n\log(n/\delta))+Cn\log(nR/r_{\mathcal{K}})=O\left(n^{2}\log\left(nLR^{2}/\epsilon\right)\right).

4 Nonconvex Optimization by Comparisons

In this section, we study nonconvex optimization with function value comparisons. We first develop an algorithm that finds an ϵ\epsilon-FOSP of a smooth nonconvex function in Section 4.1. Then in Section 4.2, we further develop an algorithm that finds an ϵ\epsilon-SOSP of a nonconvex function that is smooth and Hessian-Lipschitz.

4.1 First-order stationary point computation by comparisons

In this subsection, we focus on the problem of finding an ϵ\epsilon-FOSP of a smooth nonconvex function by making function value comparisons.

Problem 3 (Comparison-based first-order stationary point computation).

In the Comparison-based first-order stationary point computation (Comparison-FOSP) problem we are given query access to a comparison oracle OfCompO_{f}^{\operatorname{Comp}} (1) for an LL-smooth (possibly) nonconvex function f:nf\colon\mathbb{R}^{n}\to\mathbb{R} satisfying f(𝟎)inf𝐱f(𝐱)Δf(\mathbf{0})-\inf_{\mathbf{x}}f(\mathbf{x})\leq\Delta. The goal is to output an ϵ\epsilon-FOSP of ff.

We develop a comparison-based normalized gradient descent algorithm that solves Problem 3.

Input: Function f:nf\colon\mathbb{R}^{n}\to\mathbb{R}, Δ\Delta, precision ϵ\epsilon
1 T18LΔϵ2T\leftarrow\frac{18L\Delta}{\epsilon^{2}}, 𝐱0𝟎\mathbf{x}_{0}\leftarrow\mathbf{0}
2 for t=0,,T1t=0,\ldots,T-1 do
3       𝐠^t\hat{\mathbf{g}}_{t}\leftarrowComparison-GDE(𝐱t,1/6,ϵ/12)(\mathbf{x}_{t},1/6,\epsilon/12)
4       𝐱t=𝐱t1ϵ𝐠^t/(3L)\mathbf{x}_{t}=\mathbf{x}_{t-1}-\epsilon\hat{\mathbf{g}}_{t}/(3L)
5      
6Uniformly randomly select 𝐱out\mathbf{x}_{\mathrm{out}} from {𝐱0,,𝐱T}\{\mathbf{x}_{0},\ldots,\mathbf{x}_{T}\}
7 return 𝐱out\mathbf{x}_{\mathrm{out}}
Algorithm 4 Comparison-based Approximate Normalized Gradient Descent (Comparison-NGD)
Theorem 4.

With success probability at least 2/32/3, Algorithm 4 solves Problem 3 using O(LΔnlogn/ϵ2)O(L\Delta n\log n/\epsilon^{2}) queries.

The proof of Theorem 4 is deferred to Appendix C.1.

4.2 Escaping saddle points of nonconvex functions by comparisons

In this subsection, we focus on the problem of escaping from saddle points, i.e., finding an ϵ\epsilon-SOSP of a nonconvex function that is smooth and Hessian-Lipschitz, by making function value comparisons.

Problem 4 (Comparison-based escaping from saddle point).

In the Comparison-based escaping from saddle point (Comparison-SOSP) problem we are given query access to a comparison oracle OfCompO_{f}^{\operatorname{Comp}} (1) for a (possibly) nonconvex function f:nf\colon\mathbb{R}^{n}\to\mathbb{R} satisfying f(𝟎)inf𝐱f(𝐱)Δf(\mathbf{0})-\inf_{\mathbf{x}}f(\mathbf{x})\leq\Delta that is LL-smooth and ρ\rho-Hessian Lipschitz. The goal is to output an ϵ\epsilon-SOSP of ff.

Our algorithm for Problem 4 given in Algorithm 5 is a combination of comparison-based normalized gradient descent and comparison-based negative curvature descent (Comparison-NCD). Specifically, Comparison-NCD is built upon our comparison-based negative curvature finding algorithms, Comparison-NCF1 (Algorithm 8) and Comparison-NCF2 (Algorithm 9) that work when the gradient is small or large respectively, and can decrease the function value efficiently when applied at a point with a large negative curvature.

Input: Function f:nf\colon\mathbb{R}^{n}\to\mathbb{R}, Δ\Delta, precision ϵ\epsilon
1 𝒮350Δρϵ3\mathcal{S}\leftarrow 350\Delta\sqrt{\frac{\rho}{\epsilon^{3}}}, δ16\delta\leftarrow\frac{1}{6}, 𝐱1,0𝟎\mathbf{x}_{1,0}\leftarrow\mathbf{0}
2 𝒯384L2nδρϵlog36nLρϵ\mathscr{T}\leftarrow\frac{384L^{2}\sqrt{n}}{\delta\rho\epsilon}\log\frac{36nL}{\sqrt{\rho\epsilon}}, p100𝒯log𝒮p\leftarrow\frac{100}{\mathscr{T}}\log\mathcal{S}
3 for s=1,,𝒮s=1,\ldots,\mathcal{S} do
4       for t=0,,𝒯1t=0,\ldots,\mathscr{T}-1 do
5             𝐠^t\hat{\mathbf{g}}_{t}\leftarrowComparison-GDE(𝐱s,t,δ,γ)(\mathbf{x}_{s,t},\delta,\gamma)
6             𝐲s,t𝐱s,tϵ𝐠^t/(3L)\mathbf{y}_{s,t}\leftarrow\mathbf{x}_{s,t}-\epsilon\hat{\mathbf{g}}_{t}/(3L)
7             Choose 𝐱s,t+1\mathbf{x}_{s,t+1} to be the point between {xs,t,𝐲s,t}\{x_{s,t},\mathbf{y}_{s,t}\} with smaller function value
8             𝐱s,t+1{𝟎, w.p. 1pComparison-NCD(𝐱s,t+1,ϵ,δ), w.p. p\mathbf{x}_{s,t+1}^{\prime}\leftarrow\begin{cases}\mathbf{0},\text{ w.p. }1-p\\ \texttt{Comparison-NCD}(\mathbf{x}_{s,t+1},\epsilon,\delta),\text{ w.p. }p\end{cases}
9      Choose 𝐱s+1,0\mathbf{x}_{s+1,0} among {𝐱s,0,,𝐱s,𝒯,𝐱s,0,,𝐱s,𝒯}\{\mathbf{x}_{s,0},\ldots,\mathbf{x}_{s,\mathscr{T}},\mathbf{x}_{s,0}^{\prime},\ldots,\mathbf{x}_{s,\mathscr{T}}^{\prime}\} with the smallest function value.
10       𝐱s+1,0{𝟎, w.p. 1pComparison-NCD(𝐱s+1,0,ϵ,δ), w.p. p\mathbf{x}_{s+1,0}^{\prime}\leftarrow\begin{cases}\mathbf{0},\text{ w.p. }1-p\\ \texttt{Comparison-NCD}(\mathbf{x}_{s+1,0},\epsilon,\delta),\text{ w.p. }p\end{cases}
11Uniformly randomly select sout{1,,𝒮}s_{\mathrm{out}}\in\{1,\ldots,\mathcal{S}\} and tout[𝒯]t_{\mathrm{out}}\in[\mathscr{T}]
12 return 𝐱sout,tout\mathbf{x}_{s_{\mathrm{out}},t_{\mathrm{out}}}
Algorithm 5 Comparison-based Perturbed Normalized Gradient Descent (Comparison-PNGD)
Input: Function f:nf\colon\mathbb{R}^{n}\to\mathbb{R}, precision ϵ\epsilon, input point 𝐳\mathbf{z}, error probability δ\delta
1 𝐯1\mathbf{v}_{1}\leftarrowComparison-NCF1(𝐳,ϵ,δ)(\mathbf{z},\epsilon,\delta)
2 𝐯2\mathbf{v}_{2}\leftarrowComparison-NCF2(𝐳,ϵ,δ)(\mathbf{z},\epsilon,\delta)
3 𝐳1,+=𝐳+12ϵρ𝐯1\mathbf{z}_{1,+}=\mathbf{z}+\frac{1}{2}\sqrt{\frac{\epsilon}{\rho}}\mathbf{v}_{1}, 𝐳1,=𝐳12ϵρ𝐯1\mathbf{z}_{1,-}=\mathbf{z}-\frac{1}{2}\sqrt{\frac{\epsilon}{\rho}}\mathbf{v}_{1}, 𝐳2,+=𝐳+12ϵρ𝐯2\mathbf{z}_{2,+}=\mathbf{z}+\frac{1}{2}\sqrt{\frac{\epsilon}{\rho}}\mathbf{v}_{2}, 𝐳2,=𝐳12ϵρ𝐯2\mathbf{z}_{2,-}=\mathbf{z}-\frac{1}{2}\sqrt{\frac{\epsilon}{\rho}}\mathbf{v}_{2}
return 𝐳out{𝐳1,+,𝐳1,,𝐳2,+,𝐳2,}\mathbf{z}_{\mathrm{out}}\in\{\mathbf{z}_{1,+},\mathbf{z}_{1,-},\mathbf{z}_{2,+},\mathbf{z}_{2,-}\} with the smallest function value.
Algorithm 6 Comparison-based Negative Curvature Descent (Comparison-NCD)
Lemma 4.

In the setting of Problem 4, for any 𝐳\mathbf{z} satisfying λmin(2f(𝐱))ρϵ\lambda_{\min}(\nabla^{2}f(\mathbf{x}))\leq-\sqrt{\rho\epsilon}, Algorithm 6 outputs a point 𝐳outn\mathbf{z}_{\mathrm{out}}\in\mathbb{R}^{n} satisfying

f(𝐳out)f(𝐳)148ϵ3ρ\displaystyle f(\mathbf{z}_{\mathrm{out}})-f(\mathbf{z})\leq-\frac{1}{48}\sqrt{\frac{\epsilon^{3}}{\rho}}

with success probability at least 1ζ1-\zeta using O(L2n3/2ζρϵlog2nLζρϵ)O\big{(}\frac{L^{2}n^{3/2}}{\zeta\rho\epsilon}\log^{2}\frac{nL}{\zeta\sqrt{\rho\epsilon}}\big{)} queries.

The proof of Lemma 4 is deferred to Appendix C.3. Next, we present the main result of this subsection, which describes the complexity of solving Problem 4 using Algorithm 5.

Theorem 5.

With success probability at least 2/32/3, Algorithm 5 solves Problem 4 using an expected O(ΔL2n3/2ρ1/2ϵ5/2log3nLρϵ)O\big{(}\frac{\Delta L^{2}n^{3/2}}{\rho^{1/2}\epsilon^{5/2}}\log^{3}\frac{nL}{\sqrt{\rho\epsilon}}\big{)} queries.

The proof of Theorem 5 is deferred to Appendix C.4.

Acknowledgements

We thank Yexin Zhang for helpful discussions. TL was supported by the National Natural Science Foundation of China (Grant Numbers 62372006 and 92365117), and The Fundamental Research Funds for the Central Universities, Peking University.

References

  • [1] Naman Agarwal, Zeyuan Allen-Zhu, Brian Bullins, Elad Hazan, and Tengyu Ma, Finding approximate local minima faster than gradient descent, Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1195–1199, 2017, arXiv:1611.01146
  • [2] Zeyuan Allen-Zhu and Yuanzhi Li, Neon2: Finding local minima via first-order oracles, Advances in Neural Information Processing Systems, pp. 3716--3726, 2018, arXiv:1711.06673
  • [3] Charles Audet and John E. Dennis Jr, Mesh adaptive direct search algorithms for constrained optimization, SIAM Journal on Optimization 17 (2006), no. 1, 188--217.
  • [4] Krishnakumar Balasubramanian and Saeed Ghadimi, Zeroth-order nonconvex stochastic optimization: Handling constraints, high dimensionality, and saddle points, Foundations of Computational Mathematics (2022), 1--42, arXiv:1809.06474
  • [5] El Houcine Bergou, Eduard Gorbunov, and Peter Richtárik, Stochastic three points method for unconstrained smooth minimization, SIAM Journal on Optimization 30 (2020), no. 4, 2726--2749, arXiv:1902.03591
  • [6] Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar, signSGD: Compressed optimisation for non-convex problems, International Conference on Machine Learning, pp. 560--569, PMLR, 2018, arXiv:1802.04434
  • [7] Yair Carmon, John C. Duchi, Oliver Hinder, and Aaron Sidford, Accelerated methods for nonconvex optimization, SIAM Journal on Optimization 28 (2018), no. 2, 1751--1772, arXiv:1611.00756
  • [8] Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and Cho-Jui Hsieh, Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models, Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pp. 15--26, 2017, arXiv:1708.03999
  • [9] Xiaoyu Chen, Han Zhong, Zhuoran Yang, Zhaoran Wang, and Liwei Wang, Human-in-the-loop: Provably efficient preference-based reinforcement learning with general function approximation, International Conference on Machine Learning, pp. 3773--3793, PMLR, 2022, arXiv:2205.11140
  • [10] Krzysztof Choromanski, Mark Rowland, Vikas Sindhwani, Richard Turner, and Adrian Weller, Structured evolution with compact architectures for scalable policy optimization, International Conference on Machine Learning, pp. 970--978, PMLR, 2018, arXiv:1804.02395
  • [11] Frank E. Curtis, Daniel P. Robinson, and Mohammadreza Samadi, A trust region algorithm with a worst-case iteration complexity of 𝒪(ϵ3/2)\mathcal{O}(\epsilon^{-3/2}) for nonconvex optimization, Mathematical Programming 162 (2017), no. 1-2, 1--32.
  • [12] John E. Dennis, Jr and Virginia Torczon, Direct search methods on parallel machines, SIAM Journal on Optimization 1 (1991), no. 4, 448--474.
  • [13] John Duchi, Elad Hazan, and Yoram Singer, Adaptive subgradient methods for online learning and stochastic optimization, Journal of Machine Learning Research 12 (2011), no. 7, 2121--2159.
  • [14] John C Duchi, Michael I Jordan, Martin J Wainwright, and Andre Wibisono, Optimal rates for zero-order convex optimization: The power of two function evaluations, IEEE Transactions on Information Theory 61 (2015), no. 5, 2788--2806, arXiv:1312.2139
  • [15] Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang, SPIDER: Near-optimal non-convex optimization via stochastic path-integrated differential estimator, Advances in Neural Information Processing Systems 31 (2018), arXiv:1807.01695
  • [16] Saeed Ghadimi and Guanghui Lan, Stochastic first-and zeroth-order methods for nonconvex stochastic programming, SIAM Journal on Optimization 23 (2013), no. 4, 2341--2368, arXiv:1309.5549
  • [17] Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W. Mahoney, and Kurt Keutzer, A survey of quantization methods for efficient neural network inference, Low-Power Computer Vision, Chapman and Hall/CRC, 2022, arXiv:2103.13630, pp. 291--326.
  • [18] Eduard Gorbunov, Adel Bibi, Ozan Sener, El Houcine Bergou, and Peter Richtarik, A stochastic derivative free optimization method with momentum, International Conference on Learning Representations, 2020, arXiv:1905.13278
  • [19] Kevin G. Jamieson, Robert Nowak, and Ben Recht, Query complexity of derivative-free optimization, Advances in Neural Information Processing Systems 25 (2012), arXiv:1209.2434
  • [20] Kaiyi Ji, Zhe Wang, Yi Zhou, and Yingbin Liang, Improved zeroth-order variance reduced algorithms and analysis for nonconvex optimization, International Conference on Machine Learning, pp. 3100--3109, PMLR, 2019, arXiv:1910.12166
  • [21] Haotian Jiang, Yin Tat Lee, Zhao Song, and Sam Chiu-wai Wong, An improved cutting plane method for convex optimization, convex-concave games, and its applications, Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pp. 944--953, 2020, arXiv:2004.04250
  • [22] Chi Jin, Lydia T. Liu, Rong Ge, and Michael I. Jordan, On the local minima of the empirical risk, Advances in Neural Information Processing Systems 31 (2018), arXiv:1803.09357
  • [23] Chi Jin, Praneeth Netrapalli, and Michael I. Jordan, Accelerated gradient descent escapes saddle points faster than gradient descent, Conference on Learning Theory, pp. 1042--1085, 2018, arXiv:1711.10456
  • [24] Mustafa O. Karabag, Cyrus Neary, and Ufuk Topcu, Smooth convex optimization using sub-zeroth-order oracles, Proceedings of the AAAI Conference on Artificial Intelligence 35 (2021), no. 5, 3815--3822, arXiv:2103.00667
  • [25] Diederik P. Kingma and Jimmy Ba, Adam: A method for stochastic optimization, International Conference on Learning Representations, 2015, arXiv:1412.6980
  • [26] Tamara G. Kolda, Robert Michael Lewis, and Virginia Torczon, Optimization by direct search: New perspectives on some classical and modern methods, SIAM review 45 (2003), no. 3, 385--482.
  • [27] Jeffrey Larson, Matt Menickelly, and Stefan M. Wild, Derivative-free optimization methods, Acta Numerica 28 (2019), 287--404, arXiv:904.11585
  • [28] Yin Tat Lee, Aaron Sidford, and Santosh S. Vempala, Efficient convex optimization with membership oracles, Proceedings of the 31st Conference on Learning Theory, Proceedings of Machine Learning Research, vol. 75, pp. 1292--1294, 2018, arXiv:1706.07357
  • [29] Yin Tat Lee, Aaron Sidford, and Sam Chiu-wai Wong, A faster cutting plane method and its implications for combinatorial and convex optimization, 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pp. 1049--1065, IEEE, 2015, arXiv:1508.04874
  • [30] Kfir Levy, Online to offline conversions, universality and adaptive minibatch sizes, Advances in Neural Information Processing Systems 30 (2017), arXiv:1705.10499
  • [31] Xiuxian Li, Kuo-Yi Lin, Li Li, Yiguang Hong, and Jie Chen, On faster convergence of scaled sign gradient descent, IEEE Transactions on Industrial Informatics (2023), arXiv:2109.01806
  • [32] Sijia Liu, Pin-Yu Chen, Xiangyi Chen, and Mingyi Hong, signSGD via zeroth-order oracle, International Conference on Learning Representations, 2019.
  • [33] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu, Towards deep learning models resistant to adversarial attacks, International Conference on Learning Representations, 2018, arXiv:1706.06083
  • [34] John A. Nelder and Roger Mead, A simplex method for function minimization, The Computer Journal 7 (1965), no. 4, 308--313.
  • [35] Arkadi Nemirovski, Efficient methods in convex programming, Lecture notes (1994).
  • [36] Yurii Nesterov and Boris T. Polyak, Cubic regularization of Newton method and its global performance, Mathematical Programming 108 (2006), no. 1, 177--205.
  • [37] Yurii Nesterov and Vladimir Spokoiny, Random gradient-free minimization of convex functions, Foundations of Computational Mathematics 17 (2017), 527--566.
  • [38] Ellen Novoseller, Yibing Wei, Yanan Sui, Yisong Yue, and Joel Burdick, Dueling posterior sampling for preference-based reinforcement learning, Conference on Uncertainty in Artificial Intelligence, pp. 1029--1038, PMLR, 2020, arXiv:1908.01289
  • [39] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al., Training language models to follow instructions with human feedback, Advances in Neural Information Processing Systems 35 (2022), 27730--27744, arXiv:2203.02155
  • [40] Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami, Practical black-box attacks against machine learning, Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, pp. 506--519, 2017, arXiv:1602.02697
  • [41] Aadirupa Saha, Aldo Pacchiano, and Jonathan Lee, Dueling RL: Reinforcement learning with trajectory preferences, International Conference on Artificial Intelligence and Statistics, pp. 6263--6289, PMLR, 2023, arXiv:2111.04850
  • [42] Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever, Evolution strategies as a scalable alternative to reinforcement learning, 2017, arXiv:1703.03864
  • [43] Aaron Sidford and Chenyi Zhang, Quantum speedups for stochastic optimization, Advances in Neural Information Processing Systems 37 (2023), arXiv:2308.01582
  • [44] Zhiwei Tang, Dmitry Rybin, and Tsung-Hui Chang, Zeroth-order optimization meets human feedback: Provable learning via ranking oracles, 2023, arXiv:2303.03751
  • [45] Emmanouil-Vasileios Vlatakis-Gkaragkounis, Lampros Flokas, and Georgios Piliouras, Efficiently avoiding saddle points with zero order methods: No gradients required, Advances in Neural Information Processing Systems 32 (2019), arXiv:1910.13021
  • [46] Yuanhao Wang, Qinghua Liu, and Chi Jin, Is RLHF more difficult than standard RL? a theoretical perspective, Thirty-seventh Conference on Neural Information Processing Systems, 2023, arXiv:2306.14111
  • [47] Yi Xu, Rong Jin, and Tianbao Yang, First-order stochastic algorithms for escaping from saddle points in almost linear time, Advances in Neural Information Processing Systems, pp. 5530--5540, 2018, arXiv:1711.01944
  • [48] Yichong Xu, Ruosong Wang, Lin Yang, Aarti Singh, and Artur Dubrawski, Preference-based reinforcement learning with finite-time guarantees, Advances in Neural Information Processing Systems 33 (2020), 18784--18794, arXiv:2006.08910
  • [49] Chenyi Zhang and Tongyang Li, Escape saddle points by a simple gradient-descent based algorithm, Advances in Neural Information Processing Systems 34 (2021), 8545--8556, arXiv:2111.14069
  • [50] Hualin Zhang and Bin Gu, Faster gradient-free methods for escaping saddle points, The Eleventh International Conference on Learning Representations, 2023.
  • [51] Hualin Zhang, Huan Xiong, and Bin Gu, Zeroth-order negative curvature finding: Escaping saddle points without gradients, Advances in Neural Information Processing Systems 35 (2022), 38332--38344, arXiv:2210.01496
  • [52] Banghua Zhu, Jiantao Jiao, and Michael Jordan, Principled reinforcement learning with human feedback from pairwise or kk-wise comparisons, ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models, 2023, arXiv:2301.11270

Appendix A Auxiliary Lemmas

A.1 Distance between normalized vectors

Lemma 5.

If 𝐯,𝐯n\mathbf{v},\mathbf{v}^{\prime}\in\mathbb{R}^{n} are two vectors such that 𝐯γ\|\mathbf{v}\|\geq\gamma and 𝐯𝐯τ\|\mathbf{v}-\mathbf{v}^{\prime}\|\leq\tau, we have

𝐯𝐯𝐯𝐯2τγ.\displaystyle\left\|\frac{\mathbf{v}}{\|\mathbf{v}\|}-\frac{\mathbf{v}^{\prime}}{\|\mathbf{v}^{\prime}\|}\right\|\leq\frac{2\tau}{\gamma}.
Proof.

By the triangle inequality, we have

𝐯𝐯𝐯𝐯\displaystyle\left\|\frac{\mathbf{v}}{\|\mathbf{v}\|}-\frac{\mathbf{v}^{\prime}}{\|\mathbf{v}^{\prime}\|}\right\| 𝐯𝐯𝐯𝐯+𝐯𝐯𝐯𝐯\displaystyle\leq\left\|\frac{\mathbf{v}}{\|\mathbf{v}\|}-\frac{\mathbf{v}^{\prime}}{\|\mathbf{v}\|}\right\|+\left\|\frac{\mathbf{v}^{\prime}}{\|\mathbf{v}\|}-\frac{\mathbf{v}^{\prime}}{\|\mathbf{v}^{\prime}\|}\right\|
=𝐯𝐯𝐯+|𝐯𝐯|𝐯𝐯𝐯\displaystyle=\frac{\|\mathbf{v}-\mathbf{v}^{\prime}\|}{\|\mathbf{v}\|}+\frac{|\|\mathbf{v}\|-\|\mathbf{v}^{\prime}\||\|\mathbf{v}^{\prime}\|}{\|\mathbf{v}\|\|\mathbf{v}^{\prime}\|}
τγ+τγ=2τγ.\displaystyle\leq\frac{\tau}{\gamma}+\frac{\tau}{\gamma}=\frac{2\tau}{\gamma}.

Lemma 6.

If 𝐯1,𝐯2n\mathbf{v}_{1},\mathbf{v}_{2}\in\mathbb{R}^{n} are two vectors such that 𝐯1,𝐯2γ\|\mathbf{v}_{1}\|,\|\mathbf{v}_{2}\|\geq\gamma, and 𝐯1,𝐯2n\mathbf{v}_{1}^{\prime},\mathbf{v}_{2}^{\prime}\in\mathbb{R}^{n} are another two vectors such that 𝐯1𝐯1,𝐯2𝐯2τ\|\mathbf{v}_{1}-\mathbf{v}_{1}^{\prime}\|,\|\mathbf{v}_{2}-\mathbf{v}_{2}^{\prime}\|\leq\tau where 0<τ<γ0<\tau<\gamma, we have

|𝐯1𝐯1,𝐯2𝐯2𝐯1𝐯1,𝐯2𝐯2|6τγ.\displaystyle\left|\left\langle\frac{\mathbf{v}_{1}}{\|\mathbf{v}_{1}\|},\frac{\mathbf{v}_{2}}{\|\mathbf{v}_{2}\|}\right\rangle-\left\langle\frac{\mathbf{v}_{1}^{\prime}}{\|\mathbf{v}_{1}^{\prime}\|},\frac{\mathbf{v}_{2}^{\prime}}{\|\mathbf{v}_{2}^{\prime}\|}\right\rangle\right|\leq\frac{6\tau}{\gamma}.
Proof.

By the triangle inequality, we have

|𝐯1𝐯1,𝐯2𝐯2𝐯1𝐯1,𝐯2𝐯2|\displaystyle\left|\left\langle\frac{\mathbf{v}_{1}}{\|\mathbf{v}_{1}\|},\frac{\mathbf{v}_{2}}{\|\mathbf{v}_{2}\|}\right\rangle-\left\langle\frac{\mathbf{v}_{1}^{\prime}}{\|\mathbf{v}_{1}^{\prime}\|},\frac{\mathbf{v}_{2}^{\prime}}{\|\mathbf{v}_{2}^{\prime}\|}\right\rangle\right|
|𝐯1𝐯1,𝐯2𝐯2𝐯1𝐯1,𝐯2𝐯2|+|𝐯1𝐯1,𝐯2𝐯2𝐯1𝐯1,𝐯2𝐯2|.\displaystyle\qquad\qquad\leq\left|\left\langle\frac{\mathbf{v}_{1}}{\|\mathbf{v}_{1}\|},\frac{\mathbf{v}_{2}}{\|\mathbf{v}_{2}\|}\right\rangle-\left\langle\frac{\mathbf{v}_{1}^{\prime}}{\|\mathbf{v}_{1}\|},\frac{\mathbf{v}_{2}^{\prime}}{\|\mathbf{v}_{2}\|}\right\rangle\right|+\left|\left\langle\frac{\mathbf{v}_{1}^{\prime}}{\|\mathbf{v}_{1}\|},\frac{\mathbf{v}_{2}^{\prime}}{\|\mathbf{v}_{2}\|}\right\rangle-\left\langle\frac{\mathbf{v}_{1}^{\prime}}{\|\mathbf{v}_{1}^{\prime}\|},\frac{\mathbf{v}_{2}^{\prime}}{\|\mathbf{v}_{2}^{\prime}\|}\right\rangle\right|.

On the one hand, by the triangle inequality and the Cauchy-Schwarz inequality,

|𝐯1𝐯1,𝐯2𝐯2𝐯1𝐯1,𝐯2𝐯2|\displaystyle\left|\left\langle\frac{\mathbf{v}_{1}}{\|\mathbf{v}_{1}\|},\frac{\mathbf{v}_{2}}{\|\mathbf{v}_{2}\|}\right\rangle-\left\langle\frac{\mathbf{v}_{1}^{\prime}}{\|\mathbf{v}_{1}\|},\frac{\mathbf{v}_{2}^{\prime}}{\|\mathbf{v}_{2}\|}\right\rangle\right| 1𝐯1𝐯2(|𝐯1,𝐯2𝐯1,𝐯2|+|𝐯1,𝐯2𝐯1,𝐯2|)\displaystyle\leq\frac{1}{\|\mathbf{v}_{1}\|\|\mathbf{v}_{2}\|}(\left|\langle\mathbf{v}_{1},\mathbf{v}_{2}\rangle-\langle\mathbf{v}_{1},\mathbf{v}_{2}^{\prime}\rangle\right|+\left|\langle\mathbf{v}_{1},\mathbf{v}_{2}^{\prime}\rangle-\langle\mathbf{v}_{1}^{\prime},\mathbf{v}_{2}^{\prime}\rangle\rangle\right|)
𝐯2𝐯2𝐯2+𝐯1𝐯1𝐯2𝐯1𝐯2\displaystyle\leq\frac{\|\mathbf{v}_{2}-\mathbf{v}_{2}^{\prime}\|}{\|\mathbf{v}_{2}\|}+\frac{\|\mathbf{v}_{1}-\mathbf{v}_{1}^{\prime}\|\|\mathbf{v}_{2}^{\prime}\|}{\|\mathbf{v}_{1}\|\|\mathbf{v}_{2}\|}
τγ+τ(γ+τ)γ2.\displaystyle\leq\frac{\tau}{\gamma}+\frac{\tau(\gamma+\tau)}{\gamma^{2}}.

On the other hand, by the Cauchy-Schwarz inequality, |𝐯1,𝐯2|𝐯1𝐯2|\langle\mathbf{v}_{1}^{\prime},\mathbf{v}_{2}^{\prime}\rangle|\leq\|\mathbf{v}_{1}^{\prime}\|\|\mathbf{v}_{2}^{\prime}\|, and hence

|𝐯1𝐯1,𝐯2𝐯2𝐯1𝐯1,𝐯2𝐯2|\displaystyle\left|\left\langle\frac{\mathbf{v}_{1}^{\prime}}{\|\mathbf{v}_{1}\|},\frac{\mathbf{v}_{2}^{\prime}}{\|\mathbf{v}_{2}\|}\right\rangle-\left\langle\frac{\mathbf{v}_{1}^{\prime}}{\|\mathbf{v}_{1}^{\prime}\|},\frac{\mathbf{v}_{2}^{\prime}}{\|\mathbf{v}_{2}^{\prime}\|}\right\rangle\right| =|𝐯1,𝐯2||1𝐯1𝐯21𝐯1𝐯2|\displaystyle=|\langle\mathbf{v}_{1}^{\prime},\mathbf{v}_{2}^{\prime}\rangle|\left|\frac{1}{\|\mathbf{v}_{1}\|\|\mathbf{v}_{2}\|}-\frac{1}{\|\mathbf{v}_{1}^{\prime}\|\|\mathbf{v}_{2}^{\prime}\|}\right|
|𝐯1𝐯2𝐯1𝐯21|\displaystyle\leq\left|\frac{\|\mathbf{v}_{1}^{\prime}\|\|\mathbf{v}_{2}^{\prime}\|}{\|\mathbf{v}_{1}\|\|\mathbf{v}_{2}\|}-1\right|
(γ+τγ)21.\displaystyle\leq\left(\frac{\gamma+\tau}{\gamma}\right)^{2}-1.

In all, due to τ<γ\tau<\gamma,

|𝐯1𝐯1,𝐯2𝐯2𝐯1𝐯1,𝐯2𝐯2|τγ+τ(γ+τ)γ2+(γ+τγ)21=2τ(2γ+τ)γ26τγ.\displaystyle\left|\left\langle\frac{\mathbf{v}_{1}}{\|\mathbf{v}_{1}\|},\frac{\mathbf{v}_{2}}{\|\mathbf{v}_{2}\|}\right\rangle-\left\langle\frac{\mathbf{v}_{1}^{\prime}}{\|\mathbf{v}_{1}^{\prime}\|},\frac{\mathbf{v}_{2}^{\prime}}{\|\mathbf{v}_{2}^{\prime}\|}\right\rangle\right|\leq\frac{\tau}{\gamma}+\frac{\tau(\gamma+\tau)}{\gamma^{2}}+\left(\frac{\gamma+\tau}{\gamma}\right)^{2}-1=\frac{2\tau(2\gamma+\tau)}{\gamma^{2}}\leq\frac{6\tau}{\gamma}.

A.2 A fact for vector norms

Lemma 7.

For any nonzero vectors 𝐯,𝐠n\mathbf{v},\mathbf{g}\in\mathbb{R}^{n},

1𝐯+𝐠𝐯+𝐠,𝐯𝐯21𝐯𝐠𝐯𝐠,𝐯𝐯2=𝐯𝐠𝐯+𝐠.\displaystyle\sqrt{\frac{1-\langle\frac{\mathbf{v}+\mathbf{g}}{\|\mathbf{v}+\mathbf{g}\|},\frac{\mathbf{v}}{\|\mathbf{v}\|}\rangle^{2}}{1-\langle\frac{\mathbf{v}-\mathbf{g}}{\|\mathbf{v}-\mathbf{g}\|},\frac{\mathbf{v}}{\|\mathbf{v}\|}\rangle^{2}}}=\frac{\|\mathbf{v}-\mathbf{g}\|}{\|\mathbf{v}+\mathbf{g}\|}.
Proof.

We have

1𝐯+𝐠𝐯+𝐠,𝐯𝐯21𝐯𝐠𝐯𝐠,𝐯𝐯2𝐯+𝐠2𝐯𝐠2\displaystyle\frac{1-\langle\frac{\mathbf{v}+\mathbf{g}}{\|\mathbf{v}+\mathbf{g}\|},\frac{\mathbf{v}}{\|\mathbf{v}\|}\rangle^{2}}{1-\langle\frac{\mathbf{v}-\mathbf{g}}{\|\mathbf{v}-\mathbf{g}\|},\frac{\mathbf{v}}{\|\mathbf{v}\|}\rangle^{2}}\cdot\frac{\|\mathbf{v}+\mathbf{g}\|^{2}}{\|\mathbf{v}-\mathbf{g}\|^{2}} =𝐯+𝐠2𝐯+𝐠,𝐯𝐯2𝐯𝐠2𝐯𝐠,𝐯𝐯2\displaystyle=\frac{\|\mathbf{v}+\mathbf{g}\|^{2}-\langle\mathbf{v}+\mathbf{g},\frac{\mathbf{v}}{\|\mathbf{v}\|}\rangle^{2}}{\|\mathbf{v}-\mathbf{g}\|^{2}-\langle\mathbf{v}-\mathbf{g},\frac{\mathbf{v}}{\|\mathbf{v}\|}\rangle^{2}}
=𝐯+𝐠,𝐯+𝐠(𝐯+𝐯,𝐠𝐯)2𝐯𝐠,𝐯𝐠(𝐯𝐯,𝐠𝐯)2\displaystyle=\frac{\langle\mathbf{v}+\mathbf{g},\mathbf{v}+\mathbf{g}\rangle-(\|\mathbf{v}\|+\frac{\langle\mathbf{v},\mathbf{g}\rangle}{\|\mathbf{v}\|})^{2}}{\langle\mathbf{v}-\mathbf{g},\mathbf{v}-\mathbf{g}\rangle-(\|\mathbf{v}\|-\frac{\langle\mathbf{v},\mathbf{g}\rangle}{\|\mathbf{v}\|})^{2}}
=𝐯2+𝐠2+2𝐯,𝐠(𝐯2+2𝐯,𝐠+𝐯,𝐠2𝐯2)𝐯2+𝐠22𝐯,𝐠(𝐯22𝐯,𝐠+𝐯,𝐠2𝐯2)\displaystyle=\frac{\|\mathbf{v}\|^{2}+\|\mathbf{g}\|^{2}+2\langle\mathbf{v},\mathbf{g}\rangle-(\|\mathbf{v}\|^{2}+2\langle\mathbf{v},\mathbf{g}\rangle+\frac{\langle\mathbf{v},\mathbf{g}\rangle^{2}}{\|\mathbf{v}\|^{2}})}{\|\mathbf{v}\|^{2}+\|\mathbf{g}\|^{2}-2\langle\mathbf{v},\mathbf{g}\rangle-(\|\mathbf{v}\|^{2}-2\langle\mathbf{v},\mathbf{g}\rangle+\frac{\langle\mathbf{v},\mathbf{g}\rangle^{2}}{\|\mathbf{v}\|^{2}})}
=1.\displaystyle=1.

A.3 Gradient upper bound of smooth convex functions

Lemma 8 (Lemma A.2, [30]).

For any LL-smooth convex function f:nf\colon\mathbb{R}^{n}\to\mathbb{R} and any 𝐱n\mathbf{x}\in\mathbb{R}^{n}, we have

f(𝐱)22L(f(𝐱)f).\displaystyle\|\nabla f(\mathbf{x})\|^{2}\leq 2L(f(\mathbf{x})-f^{*}).

Appendix B Approximate adaptive normalized gradient descent (Approx-AdaNGD)

In this section, we prove technical details of the normalized gradient descent we use for convex optimization. Inspired by [30] which condcuted a detailed analysis for the normalized gradient descent method, we first introduce the Approximate Adaptive Gradient Descent (Approx-AdaGrad) algorithm below:

Input: # Iterations TT, a set of convex functions {ft}t=1T\{f_{t}\}_{t=1}^{T}, 𝐱0n\mathbf{x}_{0}\in\mathbb{R}^{n}, a convex set 𝒦\mathcal{K} with diameter DD
1 for t=1,,Tt=1,\ldots,T do
2       Calculate an estimate 𝐠~t\tilde{\mathbf{g}}_{t} of ft(𝐱t)\nabla f_{t}(\mathbf{x}_{t})
3       ηtD/2t\eta_{t}\leftarrow D/\sqrt{2t}
4       𝐱t=Π𝒦(𝐱t1ηt𝐠~t)\mathbf{x}_{t}=\Pi_{\mathcal{K}}(\mathbf{x}_{t-1}-\eta_{t}\tilde{\mathbf{g}}_{t})
5      
Algorithm 7 Approximate Adaptive Gradient Descent (Approx-AdaGrad)
Lemma 9.

Algorithm 7 guarantees the following regret

t=1Tft(𝐱t)min𝐱𝒦t=1Tft(𝐱)D2T+TδD.\displaystyle\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\underset{\mathbf{x}\in\mathcal{K}}{\min}\sum_{t=1}^{T}f_{t}(\mathbf{x})\leq D\sqrt{2T}+T\delta D.

if at each step tt we have

ft(𝐱t)=1,𝐠~tft(𝐱t)δ,𝐠~t=1.\displaystyle\|\nabla f_{t}(\mathbf{x}_{t})\|=1,\quad\|\tilde{\mathbf{g}}_{t}-\nabla f_{t}(\mathbf{x}_{t})\|\leq\delta,\quad\|\tilde{\mathbf{g}}_{t}\|=1.
Proof.

The proof follows the flow of the proof of Theorem 1.1 in [30]. For any t[T]t\in[T] and 𝐱𝒦\mathbf{x}\in\mathcal{K} we have

𝐱t+1𝐱2𝐱t𝐱22ηt𝐠~t,𝐱t𝐱+ηt2𝐠~t2\displaystyle\|\mathbf{x}_{t+1}-\mathbf{x}\|^{2}\leq\|\mathbf{x}_{t}-\mathbf{x}\|^{2}-2\eta_{t}\langle\tilde{\mathbf{g}}_{t},\mathbf{x}_{t}-\mathbf{x}\rangle+\eta_{t}^{2}\|\tilde{\mathbf{g}}_{t}\|^{2}

and

ηt𝐠~t,𝐱t𝐱12ηt(𝐱t𝐱2𝐱t+1𝐱2)+ηt2𝐠~t2.\displaystyle\eta_{t}\langle\tilde{\mathbf{g}}_{t},\mathbf{x}_{t}-\mathbf{x}\rangle\leq\frac{1}{2\eta_{t}}\left(\|\mathbf{x}_{t}-\mathbf{x}\|^{2}-\|\mathbf{x}_{t+1}-\mathbf{x}\|^{2}\right)+\frac{\eta_{t}}{2}\|\tilde{\mathbf{g}}_{t}\|^{2}.

Since ftf_{t} is convex for each tt, we have

ft(𝐱t)ft(𝐱)\displaystyle f_{t}(\mathbf{x}_{t})-f_{t}(\mathbf{x}) ft(𝐱t),𝐱t𝐱\displaystyle\leq\langle\nabla f_{t}(\mathbf{x}_{t}),\mathbf{x}_{t}-\mathbf{x}\rangle
𝐠~t,𝐱t𝐱+𝐠~tft(𝐱t)𝐱t𝐱\displaystyle\leq\langle\tilde{\mathbf{g}}_{t},\mathbf{x}_{t}-\mathbf{x}\rangle+\|\tilde{\mathbf{g}}_{t}-\nabla f_{t}(\mathbf{x}_{t})\|\cdot\|\mathbf{x}_{t}-\mathbf{x}\|
𝐠~t,𝐱t𝐱+δD,\displaystyle\leq\langle\tilde{\mathbf{g}}_{t},\mathbf{x}_{t}-\mathbf{x}\rangle+\delta D,

which leads to

t=1Tft(𝐱t)t=1Tft(𝐱)t=1T𝐱t𝐱22(1ηt1ηt1)+t=1Tηt2𝐠~t2+TδD,\displaystyle\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\sum_{t=1}^{T}f_{t}(\mathbf{x})\leq\sum_{t=1}^{T}\frac{\|\mathbf{x}_{t}-\mathbf{x}\|^{2}}{2}\left(\frac{1}{\eta_{t}}-\frac{1}{\eta_{t-1}}\right)+\sum_{t=1}^{T}\frac{\eta_{t}}{2}\|\tilde{\mathbf{g}}_{t}\|^{2}+T\delta D,

where we denote η0=\eta_{0}=\infty. Further we can derive that

t=1Tft(𝐱t)t=1Tft(𝐱)\displaystyle\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\sum_{t=1}^{T}f_{t}(\mathbf{x}) D22t=1T(1ηt1ηt1)+D22t=1T𝐠~t2t+TδD\displaystyle\leq\frac{D^{2}}{2}\sum_{t=1}^{T}\left(\frac{1}{\eta_{t}}-\frac{1}{\eta_{t-1}}\right)+\frac{D}{2\sqrt{2}}\sum_{t=1}^{T}\frac{\|\tilde{\mathbf{g}}_{t}\|^{2}}{\sqrt{t}}+T\delta D
D22ηT+D22t=1T1t+TδD,\displaystyle\leq\frac{D^{2}}{2\eta_{T}}+\frac{D}{2\sqrt{2}}\sum_{t=1}^{T}\frac{1}{\sqrt{t}}+T\delta D,

Moreover, we have

t=1T1t2T,\displaystyle\sum_{t=1}^{T}\frac{1}{\sqrt{t}}\leq 2\sqrt{T},

which leads to

t=1Tft(𝐱t)t=1Tft(𝐱)\displaystyle\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\sum_{t=1}^{T}f_{t}(\mathbf{x}) D22ηT+D22t=1T1t+TδD\displaystyle\leq\frac{D^{2}}{2\eta_{T}}+\frac{D}{2\sqrt{2}}\sum_{t=1}^{T}\frac{1}{\sqrt{t}}+T\delta D
D2T+TδD.\displaystyle\leq D\sqrt{2T}+T\delta D.

Now, we can prove Lemma 2 which guarantees the completeness of Theorem 2.

Proof of Lemma 2.

The proof follows the flow of the proof of Theorem 2.1 in [30]. In particular, observe that Algorithm 3 is equivalent to applying Approx-AdaGrad (Algorithm 7) to the following sequence of functions

f~t(𝐱)f(𝐱t),𝐱f(𝐱t),t[T].\displaystyle\tilde{f}_{t}(\mathbf{x})\coloneqq\frac{\langle\nabla f(\mathbf{x}_{t}),\mathbf{x}\rangle}{\|\nabla f(\mathbf{x}_{t})\|},\quad\forall t\in[T].

Then by Lemma 9, for any 𝐱𝒦\mathbf{x}\in\mathcal{K} we have

t=1Tf(𝐱t),𝐱t𝐱f(𝐱t)D2T+TδD,\displaystyle\sum_{t=1}^{T}\frac{\langle\nabla f(\mathbf{x}_{t}),\mathbf{x}_{t}-\mathbf{x}\rangle}{\|\nabla f(\mathbf{x}_{t})\|}\leq D\sqrt{2T}+T\delta D,

where

f(𝐱t)f(𝐱)f(𝐱t),𝐱t𝐱,t[T]\displaystyle f(\mathbf{x}_{t})-f(\mathbf{x})\leq\langle\nabla f(\mathbf{x}_{t}),\mathbf{x}_{t}-\mathbf{x}\rangle,\quad\forall t\in[T]

given that ff is convex, and D=2RD=2R is the diameter of 𝔹R(𝟎)\mathbb{B}_{R}(\mathbf{0}). Hence,

mint[T]f(𝐱t)ft=1T(f(𝐱t)f)/f(𝐱t)t=1T1/f(𝐱t)2R2T+2TδRt=1T1/f(𝐱t).\displaystyle\min_{t\in[T]}f(\mathbf{x}_{t})-f^{*}\leq\frac{\sum_{t=1}^{T}(f(\mathbf{x}_{t})-f^{*})/\|\nabla f(\mathbf{x}_{t})\|}{\sum_{t=1}^{T}1/\|\nabla f(\mathbf{x}_{t})\|}\leq\frac{2R\sqrt{2T}+2T\delta R}{\sum_{t=1}^{T}1/\|\nabla f(\mathbf{x}_{t})\|}.

Next, we proceed to bound the term t=1T1/f(𝐱t)\sum_{t=1}^{T}1/\|\nabla f(\mathbf{x}_{t})\| on the denominator. By the Cauchy-Schwarz inequality,

(t=1T1/f(𝐱t))(t=1Tf(𝐱t))T2,\displaystyle\left(\sum_{t=1}^{T}1/\|\nabla f(\mathbf{x}_{t})\|\right)\cdot\left(\sum_{t=1}^{T}\|\nabla f(\mathbf{x}_{t})\|\right)\geq T^{2},

which leads to

t=1T1f(𝐱t)T2t=1Tf(𝐱t),\displaystyle\sum_{t=1}^{T}\frac{1}{\|\nabla f(\mathbf{x}_{t})\|}\geq\frac{T^{2}}{\sum_{t=1}^{T}\|\nabla f(\mathbf{x}_{t})\|},

where

t=1Tf(𝐱t)\displaystyle\sum_{t=1}^{T}\|\nabla f(\mathbf{x}_{t})\| =t=1Tf(𝐱t)2f(𝐱t)\displaystyle=\sum_{t=1}^{T}\frac{\|\nabla f(\mathbf{x}_{t})\|^{2}}{\|\nabla f(\mathbf{x}_{t})\|}
t=1T2L(f(𝐱t)f)f(𝐱t)\displaystyle\leq\sum_{t=1}^{T}\frac{2L(f(\mathbf{x}_{t})-f^{*})}{\|\nabla f(\mathbf{x}_{t})\|}
2Lt=1Tf(𝐱t),𝐱t𝐱f(𝐱t)\displaystyle\leq 2L\sum_{t=1}^{T}\frac{\langle\nabla f(\mathbf{x}_{t}),\mathbf{x}_{t}-\mathbf{x}^{*}\rangle}{\|\nabla f(\mathbf{x}_{t})\|}
2L(2R2T+2TδR),\displaystyle\leq 2L(2R\sqrt{2T}+2T\delta R),

where the first inequality is by Lemma 8, the second inequality is by the convexity of ff, and the third inequality is due to Lemma 9. Further we can derive that

mint[T]f(𝐱t)f8RT+2TδRt=1T1/f(𝐱t)2L(2R2T+2TδR)2T2.\displaystyle\min_{t\in[T]}f(\mathbf{x}_{t})-f^{*}\leq\frac{8R\sqrt{T}+2T\delta R}{\sum_{t=1}^{T}1/\|\nabla f(\mathbf{x}_{t})\|}\leq\frac{2L(2R\sqrt{2T}+2T\delta R)^{2}}{T^{2}}.

Appendix C Proof details of nonconvex optimization by comparisons

C.1 Proof of Theorem 4

Proof of Theorem 4.

We prove the correctness of Theorem 4 by contradiction. For any iteration t[T]t\in[T] with f(𝐱t)>ϵ\|\nabla f(\mathbf{x}_{t})\|>\epsilon, by Theorem 1 we have

𝐠^tf(𝐱t)f(𝐱t)δ=16,\displaystyle\left\|\hat{\mathbf{g}}_{t}-\frac{\nabla f(\mathbf{x}_{t})}{\|\nabla f(\mathbf{x}_{t})\|}\right\|\leq\delta=\frac{1}{6},

indicating

f(𝐱t+1)f(𝐱t)\displaystyle f(\mathbf{x}_{t+1})-f(\mathbf{x}_{t}) f(𝐱t),𝐱t+1𝐱t+L2𝐱t+1𝐱t2\displaystyle\leq\langle\nabla f(\mathbf{x}_{t}),\mathbf{x}_{t+1}-\mathbf{x}_{t}\rangle+\frac{L}{2}\|\mathbf{x}_{t+1}-\mathbf{x}_{t}\|^{2}
ϵ3Lf(𝐱t),𝐠^t+L2(ϵ3L)2\displaystyle\leq-\frac{\epsilon}{3L}\langle\nabla f(\mathbf{x}_{t}),\hat{\mathbf{g}}_{t}\rangle+\frac{L}{2}\left(\frac{\epsilon}{3L}\right)^{2}
ϵ3Lf(𝐱t)(1δ)+ϵ218L2ϵ29L.\displaystyle\leq-\frac{\epsilon}{3L}\|\nabla f(\mathbf{x}_{t})\|(1-\delta)+\frac{\epsilon^{2}}{18L}\leq-\frac{2\epsilon^{2}}{9L}.

That is to say, for any iteration tt such that 𝐱t\mathbf{x}_{t} is not an ϵ\epsilon-FOSP, the function value will decrease at least 2ϵ29L\frac{2\epsilon^{2}}{9L} in this iteration. Furthermore, for any iteration t[T]t\in[T] with ϵ12<f(𝐱t)ϵ\frac{\epsilon}{12}<\|\nabla f(\mathbf{x}_{t})\|\leq\epsilon, by Theorem 1 we have

𝐠^tf(𝐱t)f(𝐱t)δ=16,\displaystyle\left\|\hat{\mathbf{g}}_{t}-\frac{\nabla f(\mathbf{x}_{t})}{\|\nabla f(\mathbf{x}_{t})\|}\right\|\leq\delta=\frac{1}{6},

indicating

f(𝐱t+1)f(𝐱t)\displaystyle f(\mathbf{x}_{t+1})-f(\mathbf{x}_{t}) f(𝐱t),𝐱t+1𝐱t+L2𝐱t+1𝐱t2\displaystyle\leq\langle\nabla f(\mathbf{x}_{t}),\mathbf{x}_{t+1}-\mathbf{x}_{t}\rangle+\frac{L}{2}\|\mathbf{x}_{t+1}-\mathbf{x}_{t}\|^{2}
ϵ3Lf(𝐱t)(1δ)+ϵ218L0.\displaystyle\leq-\frac{\epsilon}{3L}\|\nabla f(\mathbf{x}_{t})\|(1-\delta)+\frac{\epsilon^{2}}{18L}\leq 0. (5)

For any iteration t[T]t\in[T] with f(𝐱t)ϵ/12\|\nabla f(\mathbf{x}_{t})\|\leq\epsilon/12, we have

f(𝐱t+1)f(𝐱t)\displaystyle f(\mathbf{x}_{t+1})-f(\mathbf{x}_{t}) f(𝐱t),𝐱t+1𝐱t+L2𝐱t+1𝐱t2\displaystyle\leq\langle\nabla f(\mathbf{x}_{t}),\mathbf{x}_{t+1}-\mathbf{x}_{t}\rangle+\frac{L}{2}\|\mathbf{x}_{t+1}-\mathbf{x}_{t}\|^{2}
f(𝐱t)𝐱t+1𝐱t+L2𝐱t+1𝐱t2ϵ212L.\displaystyle\leq\|\nabla f(\mathbf{x}_{t})\|\kern-1.42262pt\cdot\kern-1.42262pt\|\mathbf{x}_{t+1}-\mathbf{x}_{t}\|+\frac{L}{2}\|\mathbf{x}_{t+1}-\mathbf{x}_{t}\|^{2}\leq\frac{\epsilon^{2}}{12L}.

Combining (C.1) and the above inequality, we know that for any iteration tt such that 𝐱t\mathbf{x}_{t} is an ϵ\epsilon-FOSP, the function value increases at most ϵ2/(12L)\epsilon^{2}/(12L) in this iteration. Moreover, since

f(𝟎)f(𝐱T)f(𝟎)fΔ,\displaystyle f(\mathbf{0})-f(\mathbf{x}_{T})\leq f(\mathbf{0})-f^{*}\leq\Delta,

we can conclude that at least 2/32/3 of the iterations have 𝐱t\mathbf{x}_{t} being an ϵ\epsilon-FOSP, and randomly outputting one of them solves Problem 3 with success probability at least 2/32/3.

The query complexity of Algorithm 4 only comes from the gradient direction estimation step in Line 4, which equals

TO(nlog(n/δ))=O(LΔnlogn/ϵ2).\displaystyle T\cdot O(n\log(n/\delta))=O\left(L\Delta n\log n/\epsilon^{2}\right).

C.2 Negative curvature finding by comparisons

In this subsection, we show how to find a negative curvature direction of a point 𝐱\mathbf{x} satisfying λmin(2f(𝐱))ρϵ\lambda_{\min}(\nabla^{2}f(\mathbf{x}))\leq-\sqrt{\rho\epsilon} Observe that the Hessian matrix 2f(𝐱)\nabla^{2}f(\mathbf{\mathbf{x}}) admits the following eigen-decomposition:

2f(𝐱)=i=1nλi𝐮i𝐮i,\displaystyle\nabla^{2}f(\mathbf{\mathbf{x}})=\sum_{i=1}^{n}\lambda_{i}\mathbf{u}_{i}\mathbf{u}_{i}^{\top}, (6)

where the vectors {𝐮i}i=1n\{\mathbf{u}_{i}\}_{i=1}^{n} forms an orthonormal basis of n\mathbb{R}^{n}. Without loss of generality we assume the eigenvalues λ1,λ2,,λn\lambda_{1},\lambda_{2},\ldots,\lambda_{n} corresponding to 𝐮1,𝐮2,,𝐮n\mathbf{u}_{1},\mathbf{u}_{2},\ldots,\mathbf{u}_{n} satisfy

λ1λ2λn,\displaystyle\lambda_{1}\leq\lambda_{2}\leq\cdots\leq\lambda_{n}, (7)

where λ1ρϵ\lambda_{1}\leq-\sqrt{\rho\epsilon}. Throughout this subsection, for any vector 𝐯n\mathbf{v}\in\mathbb{R}^{n}, we denote

𝐯𝐯𝐯,𝐮1𝐮1\displaystyle\mathbf{v}_{\perp}\coloneqq\mathbf{v}-\langle\mathbf{v},\mathbf{u}_{1}\rangle\mathbf{u}_{1}

to be the component of 𝐯\mathbf{v} that is orthogonal to 𝐮1\mathbf{u}_{1}.

C.2.1 Negative curvature finding when the gradient is relatively small

In this part, we present our negative curvature finding algorithm that finds the negative curvature of a point 𝐱\mathbf{x} with λmin(2f(𝐱))ρϵ\lambda_{\min}(\nabla^{2}f(\mathbf{x}))\leq-\sqrt{\rho\epsilon} when the norm of the gradient f(𝐱)\nabla f(\mathbf{x}) is relatively small.

Input: Function f:nf\colon\mathbb{R}^{n}\to\mathbb{R}, 𝐱\mathbf{x}, precision ϵ\epsilon, error probability δ\delta
1 𝒯384L2nδρϵlog36nLρϵ\mathscr{T}\leftarrow\frac{384L^{2}\sqrt{n}}{\delta\rho\epsilon}\log\frac{36nL}{\sqrt{\rho\epsilon}}, δ^18𝒯(ρϵ)1/4πLn\hat{\delta}\leftarrow\frac{1}{8\mathscr{T}(\rho\epsilon)^{1/4}}\sqrt{\frac{\pi L}{n}}, rπδ(ρϵ)1/4L128ρn𝒯r\leftarrow\frac{\pi\delta(\rho\epsilon)^{1/4}\sqrt{L}}{128\rho n\mathscr{T}}, γδr16πρϵn\gamma\leftarrow\frac{\delta r}{16}\sqrt{\frac{\pi\rho\epsilon}{n}}
2 𝐲0\mathbf{y}_{0}\leftarrowUniform(𝒮n1)(\mathcal{S}^{n-1})
3 for t=0,,𝒯1t=0,\ldots,\mathscr{T}-1 do
4       𝐠^t\hat{\mathbf{g}}_{t}\leftarrowComparison-GDE(𝐱+r𝐲t,δ^,γ)(\mathbf{x}+r\mathbf{y}_{t},\hat{\delta},\gamma)
5       𝐲¯t+1𝐲tδ16Lρϵn𝐠^t\bar{\mathbf{y}}_{t+1}\leftarrow\mathbf{y}_{t}-\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\hat{\mathbf{g}}_{t}
6       𝐲t+1𝐲t+1/𝐲t+1\mathbf{y}_{t+1}\leftarrow\mathbf{y}_{t+1}/\|\mathbf{y}_{t+1}\|
7      
return 𝐞^𝐲𝒯\hat{\mathbf{e}}\leftarrow\mathbf{y}_{\mathscr{T}}
Algorithm 8 Comparison-based Negative Curvature Finding 1 (Comparison-NCF1)
Lemma 10.

In the setting of Problem 4, for any 𝐱\mathbf{x} satisfying

f(𝐱)L(πδ256n𝒯)2ϵρ,λmin(2f(𝐱))ρϵ,\displaystyle\|\nabla f(\mathbf{x})\|\leq L\left(\frac{\pi\delta}{256n\mathscr{T}}\right)^{2}\sqrt{\frac{\epsilon}{\rho}},\qquad\lambda_{\min}(\nabla^{2}f(\mathbf{x}))\leq-\sqrt{\rho\epsilon},

Algorithm 8 outputs a unit vector 𝐞^\hat{\mathbf{e}} satisfying

𝐞^𝒯2f(𝐱)𝐞^ρϵ/4,\displaystyle\hat{\mathbf{e}}^{\mathscr{T}}\nabla^{2}f(\mathbf{x})\hat{\mathbf{e}}\leq-\sqrt{\rho\epsilon}/4,

with success probability at least 1δ1-\delta using

O(L2n3/2δρϵlog2nLδρϵ)\displaystyle O\left(\frac{L^{2}n^{3/2}}{\delta\rho\epsilon}\log^{2}\frac{nL}{\delta\sqrt{\rho\epsilon}}\right)

queries.

To prove Lemma 10, without loss of generality we assume 𝐱=𝟎\mathbf{x}=\mathbf{0} by shifting n\mathbb{R}^{n} such that 𝐱\mathbf{x} is mapped to 𝟎\mathbf{0}. We denote 𝐳tr𝐲t/𝐲t\mathbf{z}_{t}\coloneqq r\mathbf{y}_{t}/\|\mathbf{y}_{t}\| for each iteration t[𝒯]t\in[\mathscr{T}] of Algorithm 8.

Lemma 11.

In the setting of Problem 4, for any iteration t[𝒯]t\in[\mathscr{T}] of Algorithm 8 with |yt,1|δ8πn|y_{t,1}|\geq\frac{\delta}{8}\sqrt{\frac{\pi}{n}}, we have

f(𝐳t)δr16πρϵn.\displaystyle\|\nabla f(\mathbf{z}_{t})\|\geq\frac{\delta r}{16}\sqrt{\frac{\pi\rho\epsilon}{n}}.
Proof.

Observe that

f(𝐳k)\displaystyle\|\nabla f(\mathbf{z}_{k})\| |1f(𝐳k)|\displaystyle\geq|\nabla_{1}f(\mathbf{z}_{k})|
=|1f(𝟎)+(2f(𝟎)𝐳k)1+1f(𝐳k)1f(𝟎)(2f(𝟎)𝐳k)1|\displaystyle=|\nabla_{1}f(\mathbf{0})+(\nabla^{2}f(\mathbf{0})\mathbf{z}_{k})_{1}+\nabla_{1}f(\mathbf{z}_{k})-\nabla_{1}f(\mathbf{0})-(\nabla^{2}f(\mathbf{0})\mathbf{z}_{k})_{1}|
|(2f(𝟎)𝐳k)1||1f(𝟎)||1f(𝐳k)1f(𝟎)(2f(𝟎)𝐳k)1|.\displaystyle\geq|(\nabla^{2}f(\mathbf{0})\mathbf{z}_{k})_{1}|-|\nabla_{1}f(\mathbf{0})|-|\nabla_{1}f(\mathbf{z}_{k})-\nabla_{1}f(\mathbf{0})-(\nabla^{2}f(\mathbf{0})\mathbf{z}_{k})_{1}|.

Given that ff is ρ\rho-Hessian Lipschitz, we have

|1f(𝐳k)1f(𝟎)(2f(𝟎)𝐳k)1|ρ𝐳k22=ρr22δr32πρϵn.\displaystyle|\nabla_{1}f(\mathbf{z}_{k})-\nabla_{1}f(\mathbf{0})-(\nabla^{2}f(\mathbf{0})\mathbf{z}_{k})_{1}|\leq\frac{\rho\|\mathbf{z}_{k}\|^{2}}{2}=\frac{\rho r^{2}}{2}\leq\frac{\delta r}{32}\sqrt{\frac{\pi\rho\epsilon}{n}}.

Moreover, we have

|(2f(𝟎)𝐳k)1|=ρϵ𝐳k,1δr8πρϵn,\displaystyle|(\nabla^{2}f(\mathbf{0})\mathbf{z}_{k})_{1}|=\sqrt{\rho\epsilon}\|\mathbf{z}_{k,1}\|\geq\frac{\delta r}{8}\sqrt{\frac{\pi\rho\epsilon}{n}},

which leads to

f(𝐳k)\displaystyle\|\nabla f(\mathbf{z}_{k})\| |1f(𝐳k)|\displaystyle\geq|\nabla_{1}f(\mathbf{z}_{k})|
|(2f(𝟎)𝐳k)1||1f(𝟎)||1f(𝐳k)1f(𝟎)(2f(𝟎)𝐳k)1|\displaystyle\geq|(\nabla^{2}f(\mathbf{0})\mathbf{z}_{k})_{1}|-|\nabla_{1}f(\mathbf{0})|-|\nabla_{1}f(\mathbf{z}_{k})-\nabla_{1}f(\mathbf{0})-(\nabla^{2}f(\mathbf{0})\mathbf{z}_{k})_{1}|
δr16πρϵn,\displaystyle\geq\frac{\delta r}{16}\sqrt{\frac{\pi\rho\epsilon}{n}},

where the last inequality is due to the fact that

1f(𝟎)f(𝟎)πδr(ρϵ)1/4L256nTδr32πρϵn.\displaystyle\|\nabla_{1}f(\mathbf{0})\|\leq\|\nabla f(\mathbf{0})\|\leq\frac{\pi\delta r(\rho\epsilon)^{1/4}\sqrt{L}}{256nT}\leq\frac{\delta r}{32}\sqrt{\frac{\pi\rho\epsilon}{n}}.

Lemma 12.

In the setting of Problem 4, for any iteration t[𝒯]t\in[\mathscr{T}] of Algorithm 8 we have

|yt,1|δ8πn\displaystyle|y_{t,1}|\geq\frac{\delta}{8}\sqrt{\frac{\pi}{n}} (8)

if |y0,1|δ2πn|y_{0,1}|\geq\frac{\delta}{2}\sqrt{\frac{\pi}{n}} and f(𝟎)δr32πρϵn\|\nabla f(\mathbf{0})\|\leq\frac{\delta r}{32}\sqrt{\frac{\pi\rho\epsilon}{n}}.

Proof.

We use recurrence to prove this lemma. In particular, assume

|yt,1|𝐲t,δ2πn(112𝒯)t\displaystyle\frac{|y_{t,1}|}{\|\mathbf{y}_{t,\perp}\|}\geq\frac{\delta}{2}\sqrt{\frac{\pi}{n}}\left(1-\frac{1}{2\mathscr{T}}\right)^{t} (9)

is true for all tkt\leq k for some kk, which guarantees that

|yt,1|δ4πn(112𝒯)t\displaystyle|y_{t,1}|\geq\frac{\delta}{4}\sqrt{\frac{\pi}{n}}\left(1-\frac{1}{2\mathscr{T}}\right)^{t}

Then for t=k+1t=k+1, we have

𝐲¯k+1,=𝐲k,δ16Lρϵn𝐠^k,,\displaystyle\bar{\mathbf{y}}_{k+1,\perp}=\mathbf{y}_{k,\perp}-\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\cdot\hat{\mathbf{g}}_{k,\perp},

and

𝐲¯k+1,𝐲k,δ16Lρϵnf(𝐳k)f(𝐳k)+δ16Lρϵn𝐠^k,f(𝐳k)f(𝐳k).\displaystyle\|\bar{\mathbf{y}}_{k+1,\perp}\|\leq\left\|\mathbf{y}_{k,\perp}-\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\cdot\frac{\nabla_{\perp}f(\mathbf{z}_{k})}{\|\nabla f(\mathbf{z}_{k})\|}\right\|+\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\left\|\hat{\mathbf{g}}_{k,\perp}-\frac{\nabla_{\perp}f(\mathbf{z}_{k})}{\|\nabla f(\mathbf{z}_{k})\|}\right\|. (10)

Since f(𝐳t)δr16πρϵn\|f(\mathbf{z}_{t})\|\geq\frac{\delta r}{16}\sqrt{\frac{\pi\rho\epsilon}{n}} by Lemma 11, we have

δ16Lρϵn𝐠^k,f(𝐳k)f(𝐳k)δ16Lρϵn𝐠^kf(𝐳k)f(𝐳k)δδ^16Lρϵnδ64𝒯n.\displaystyle\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\left\|\hat{\mathbf{g}}_{k,\perp}-\frac{\nabla_{\perp}f(\mathbf{z}_{k})}{\|\nabla f(\mathbf{z}_{k})\|}\right\|\leq\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\left\|\hat{\mathbf{g}}_{k}-\frac{\nabla f(\mathbf{z}_{k})}{\|\nabla f(\mathbf{z}_{k})\|}\right\|\leq\frac{\delta\hat{\delta}}{16L}\sqrt{\frac{\rho\epsilon}{n}}\leq\frac{\delta}{64\mathscr{T}\sqrt{n}}.

by Theorem 1. Moreover, observe that

f(𝐳k)\displaystyle\nabla_{\perp}f(\mathbf{z}_{k}) =(2f(𝟎)𝐳k)+f(𝟎)+(f(𝐳k)f(𝟎)(2f(𝟎)𝐳k))\displaystyle=(\nabla^{2}f(\mathbf{0})\mathbf{z}_{k})_{\perp}+\nabla_{\perp}f(\mathbf{0})+(\nabla_{\perp}f(\mathbf{z}_{k})-\nabla_{\perp}f(\mathbf{0})-(\nabla^{2}f(\mathbf{0})\mathbf{z}_{k})_{\perp})
=2f(𝟎)𝐳k,+f(𝟎)+(f(𝐳k)f(𝟎)(2f(𝟎)𝐳k)),\displaystyle=\nabla^{2}f(\mathbf{0})\mathbf{z}_{k,\perp}+\nabla_{\perp}f(\mathbf{0})+(\nabla_{\perp}f(\mathbf{z}_{k})-\nabla_{\perp}f(\mathbf{0})-(\nabla^{2}f(\mathbf{0})\mathbf{z}_{k})_{\perp}), (11)

where the norm of

σk,f(𝐳k)f(𝟎)(2f(𝟎)𝐳k)\displaystyle\sigma_{k,\perp}\coloneqq\nabla_{\perp}f(\mathbf{z}_{k})-\nabla_{\perp}f(\mathbf{0})-(\nabla^{2}f(\mathbf{0})\mathbf{z}_{k})_{\perp}

is upper bounded by

ρr22+πδr(ρϵ)1/4L256n𝒯πδr(ρϵ)1/4L128n𝒯δr16πρϵn\displaystyle\frac{\rho r^{2}}{2}+\frac{\pi\delta r(\rho\epsilon)^{1/4}\sqrt{L}}{256n\mathscr{T}}\leq\frac{\pi\delta r(\rho\epsilon)^{1/4}\sqrt{L}}{128n\mathscr{T}}\leq\frac{\delta r}{16}\sqrt{\frac{\pi\rho\epsilon}{n}}

given that ff is ρ\rho-Hessian Lipschitz and f(𝟎)δr32πρϵn\|\nabla f(\mathbf{0})\|\leq\frac{\delta r}{32}\sqrt{\frac{\pi\rho\epsilon}{n}}. Next, we proceed to bound the first term on the RHS of (10), where

𝐲k,δ16Lρϵnf(𝐳k)f(𝐳k)\displaystyle\mathbf{y}_{k,\perp}-\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\cdot\frac{\nabla_{\perp}f(\mathbf{z}_{k})}{\|\nabla f(\mathbf{z}_{k})\|} =𝐲k,δ16Lρϵnf(𝐳k)f(𝐳k)\displaystyle=\mathbf{y}_{k,\perp}-\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\cdot\frac{\nabla_{\perp}f(\mathbf{z}_{k})}{\|\nabla f(\mathbf{z}_{k})\|}
=𝐲k,δ16Lρϵn2f(𝟎)𝐳k,f(𝐳k)δ16Lρϵnσk,f(𝐳k),\displaystyle=\mathbf{y}_{k,\perp}-\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\cdot\frac{\nabla^{2}f(\mathbf{0})\mathbf{z}_{k,\perp}}{\|\nabla f(\mathbf{z}_{k})\|}-\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\cdot\frac{\sigma_{k,\perp}}{\|\nabla f(\mathbf{z}_{k})\|},

where

2f(𝟎)𝐳k,=i=2nλi𝐳k,,𝐮i𝐮i=ri=2nλi𝐲k,,𝐮i𝐮i,\displaystyle\nabla^{2}f(\mathbf{0})\mathbf{z}_{k,\perp}=\sum_{i=2}^{n}\lambda_{i}\langle\mathbf{z}_{k,\perp},\mathbf{u}_{i}\rangle\mathbf{u}_{i}=r\sum_{i=2}^{n}\lambda_{i}\langle\mathbf{y}_{k,\perp},\mathbf{u}_{i}\rangle\mathbf{u}_{i},

and

𝐲k,δ16Lρϵn2f(𝟎)𝐳k,f(𝐳k)=i=2n(1rδ16f(𝐳k)ρϵnλiL)𝐲k,,𝐮i𝐮i.\displaystyle\mathbf{y}_{k,\perp}-\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\cdot\frac{\nabla^{2}f(\mathbf{0})\mathbf{z}_{k,\perp}}{\|\nabla f(\mathbf{z}_{k})\|}=\sum_{i=2}^{n}\left(1-\frac{r\delta}{16\|\nabla f(\mathbf{z}_{k})\|}\sqrt{\frac{\rho\epsilon}{n}}\frac{\lambda_{i}}{L}\right)\langle\mathbf{y}_{k,\perp},\mathbf{u}_{i}\rangle\mathbf{u}_{i}.

Given that

1rδ16f(𝐳k)ρϵnλiL1\displaystyle-1\leq\frac{r\delta}{16\|\nabla f(\mathbf{z}_{k})\|}\sqrt{\frac{\rho\epsilon}{n}}\frac{\lambda_{i}}{L}\leq 1

is always true, we have

𝐲k,δ16Lρϵn2f(𝟎)𝐳k,f(𝐳k)\displaystyle\left\|\mathbf{y}_{k,\perp}-\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\cdot\frac{\nabla^{2}f(\mathbf{0})\mathbf{z}_{k,\perp}}{\|\nabla f(\mathbf{z}_{k})\|}\right\| i=2n(1+rδρϵ16f(𝐳k)Ln)𝐲k,,𝐮i𝐮i\displaystyle\leq\left\|\sum_{i=2}^{n}\left(1+\frac{r\delta\rho\epsilon}{16\|\nabla f(\mathbf{z}_{k})\|L\sqrt{n}}\right)\langle\mathbf{y}_{k,\perp},\mathbf{u}_{i}\rangle\mathbf{u}_{i}\right\|
(1+rδρϵ16f(𝐳k)Ln)𝐲k,\displaystyle\leq\left(1+\frac{r\delta\rho\epsilon}{16\|\nabla f(\mathbf{z}_{k})\|L\sqrt{n}}\right)\|\mathbf{y}_{k,\perp}\|

and

𝐲k,δ16Lρϵnf(𝐳k)f(𝐳k)\displaystyle\left\|\mathbf{y}_{k,\perp}-\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\cdot\frac{\nabla_{\perp}f(\mathbf{z}_{k})}{\|\nabla f(\mathbf{z}_{k})\|}\right\| (1+rδρϵ16f(𝐳k)Ln)𝐲k,+δ16Lρϵnσk,f(𝐳k)\displaystyle\leq\left(1+\frac{r\delta\rho\epsilon}{16\|\nabla f(\mathbf{z}_{k})\|L\sqrt{n}}\right)\|\mathbf{y}_{k,\perp}\|+\left\|\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\cdot\frac{\sigma_{k,\perp}}{\|\nabla f(\mathbf{z}_{k})\|}\right\|
(1+rδρϵ16f(𝐳k)Ln)𝐲k,+δ64𝒯n.\displaystyle\leq\left(1+\frac{r\delta\rho\epsilon}{16\|\nabla f(\mathbf{z}_{k})\|L\sqrt{n}}\right)\|\mathbf{y}_{k,\perp}\|+\frac{\delta}{64\mathscr{T}\sqrt{n}}.

Combined with (10), we can derive that

𝐲¯k+1,\displaystyle\|\bar{\mathbf{y}}_{k+1,\perp}\| 𝐲k,δ16Lρϵnf(𝐳k)f(𝐳k)+δ16Lρϵn𝐠^k,f(𝐳k)f(𝐳k)\displaystyle\leq\left\|\mathbf{y}_{k,\perp}-\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\cdot\frac{\nabla_{\perp}f(\mathbf{z}_{k})}{\|\nabla f(\mathbf{z}_{k})\|}\right\|+\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\left\|\hat{\mathbf{g}}_{k,\perp}-\frac{\nabla_{\perp}f(\mathbf{z}_{k})}{\|\nabla f(\mathbf{z}_{k})\|}\right\| (12)
(1+rδρϵ16f(𝐳k)Ln)𝐲k,+δ32𝒯n.\displaystyle\leq\left(1+\frac{r\delta\rho\epsilon}{16\|\nabla f(\mathbf{z}_{k})\|L\sqrt{n}}\right)\|\mathbf{y}_{k,\perp}\|+\frac{\delta}{32\mathscr{T}\sqrt{n}}. (13)

Similarly, we have

|y¯k+1,1||yk,1δ16Lρϵn1f(𝐳k)f(𝐳k)|δ16Lρϵn|g^k,11f(𝐳k)f(𝐳k)|,\displaystyle|\bar{y}_{k+1,1}|\geq\left|y_{k,1}-\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\cdot\frac{\nabla_{1}f(\mathbf{z}_{k})}{\|\nabla f(\mathbf{z}_{k})\|}\right|-\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\left|\hat{g}_{k,1}-\frac{\nabla_{1}f(\mathbf{z}_{k})}{\|\nabla f(\mathbf{z}_{k})\|}\right|, (14)

where the second term on the RHS of (14) satisfies

δ16Lρϵn|g^k,11f(𝐳k)f(𝐳k)|δ16Lρϵn𝐠^kf(𝐳k)f(𝐳k)δδ^16Lρϵnδ64𝒯n,\displaystyle\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\left|\hat{g}_{k,1}-\frac{\nabla_{1}f(\mathbf{z}_{k})}{\|\nabla f(\mathbf{z}_{k})\|}\right|\leq\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\left\|\hat{\mathbf{g}}_{k}-\frac{\nabla f(\mathbf{z}_{k})}{\|\nabla f(\mathbf{z}_{k})\|}\right\|\leq\frac{\delta\hat{\delta}}{16L}\sqrt{\frac{\rho\epsilon}{n}}\leq\frac{\delta}{64\mathscr{T}\sqrt{n}},

by Theorem 1, whereas the first term on the RHS of (14) satisfies

yk,1δ16Lρϵn1f(𝐳k)f(𝐳k)\displaystyle y_{k,1}-\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\cdot\frac{\nabla_{1}f(\mathbf{z}_{k})}{\|\nabla f(\mathbf{z}_{k})\|} =yk,1δ16Lρϵn𝐮12f(𝟎)𝐮1yk,1f(𝐳k)δ16Lρϵnσk,1f(𝐳k)\displaystyle=y_{k,1}-\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\cdot\frac{\mathbf{u}_{1}^{\top}\nabla^{2}f(\mathbf{0})\mathbf{u}_{1}y_{k,1}}{\|\nabla f(\mathbf{z}_{k})\|}-\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\cdot\frac{\sigma_{k,1}}{\|\nabla f(\mathbf{z}_{k})\|}
=(1+rδρϵ16f(𝐳k)Ln)yk,1δ16Lρϵnσk,1f(𝐳k),\displaystyle=\left(1+\frac{r\delta\rho\epsilon}{16\|\nabla f(\mathbf{z}_{k})\|L\sqrt{n}}\right)y_{k,1}-\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\cdot\frac{\sigma_{k,1}}{\|\nabla f(\mathbf{z}_{k})\|},

where the absolute value of

σk,11f(𝐳k)1f(𝟎)(2f(𝟎)𝐳k)1\displaystyle\sigma_{k,1}\coloneqq\nabla_{1}f(\mathbf{z}_{k})-\nabla_{1}f(\mathbf{0})-(\nabla^{2}f(\mathbf{0})\mathbf{z}_{k})_{1}

is upper bounded by

ρr22+πδr(ρϵ)1/4L256n𝒯πδr(ρϵ)1/4L128n𝒯δr16πρϵn\displaystyle\frac{\rho r^{2}}{2}+\frac{\pi\delta r(\rho\epsilon)^{1/4}\sqrt{L}}{256n\mathscr{T}}\leq\frac{\pi\delta r(\rho\epsilon)^{1/4}\sqrt{L}}{128n\mathscr{T}}\leq\frac{\delta r}{16}\sqrt{\frac{\pi\rho\epsilon}{n}}

given that ff is ρ\rho-Hessian Lipschitz and

f(𝟎)πδr(ρϵ)1/4L256nT.\|\nabla f(\mathbf{0})\|\leq\frac{\pi\delta r(\rho\epsilon)^{1/4}\sqrt{L}}{256nT}.

Combined with (14), we can derive that

|y¯k+1,1|\displaystyle|\bar{y}_{k+1,1}| |yk,1δ16Lρϵn1f(𝐳k)f(𝐳k)|δ16Lρϵn|g^k,11f(𝐳k)f(𝐳k)|\displaystyle\geq\left|y_{k,1}-\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\cdot\frac{\nabla_{1}f(\mathbf{z}_{k})}{\|\nabla f(\mathbf{z}_{k})\|}\right|-\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\left|\hat{g}_{k,1}-\frac{\nabla_{1}f(\mathbf{z}_{k})}{\|\nabla f(\mathbf{z}_{k})\|}\right|
(1+rδρϵ16f(𝐳k)Ln)|yk,1|δ32𝒯n.\displaystyle\geq\left(1+\frac{r\delta\rho\epsilon}{16\|\nabla f(\mathbf{z}_{k})\|L\sqrt{n}}\right)|y_{k,1}|-\frac{\delta}{32\mathscr{T}\sqrt{n}}.

Combined with (12), we have

|yk+1,1|𝐲k+1,\displaystyle\frac{|y_{k+1,1}|}{\|\mathbf{y}_{k+1,\perp\|}} =|y¯k+1,1|𝐲¯k+1,\displaystyle=\frac{|\bar{y}_{k+1,1}|}{\|\bar{\mathbf{y}}_{k+1,\perp}\|}
(1+rδρϵ16f(𝐳k)Ln)|yk,1|δ32𝒯n(1+rδρϵ16f(𝐳k)Ln)𝐲k,+δ32𝒯n.\displaystyle\geq\frac{\left(1+\frac{r\delta\rho\epsilon}{16\|\nabla f(\mathbf{z}_{k})\|L\sqrt{n}}\right)|y_{k,1}|-\frac{\delta}{32\mathscr{T}\sqrt{n}}}{\left(1+\frac{r\delta\rho\epsilon}{16\|\nabla f(\mathbf{z}_{k})\|L\sqrt{n}}\right)\|\mathbf{y}_{k,\perp}\|+\frac{\delta}{32\mathscr{T}\sqrt{n}}}.

Hence, if |yk,1|12|y_{k,1}|\geq\frac{1}{2}, (9) is also true for t=k+1t=k+1. Otherwise, we have 𝐲k,3/2\|\mathbf{y}_{k,\perp}\|\geq\sqrt{3}/2 and

|yk+1,1|𝐲k+1,\displaystyle\frac{|y_{k+1,1}|}{\|\mathbf{y}_{k+1,\perp\|}} (1+rδρϵ16f(𝐳k)Ln)|yk,1|δ32𝒯n(1+rδρϵ16f(𝐳k)Ln)𝐲k,+δ32𝒯n\displaystyle\geq\frac{\left(1+\frac{r\delta\rho\epsilon}{16\|\nabla f(\mathbf{z}_{k})\|L\sqrt{n}}\right)|y_{k,1}|-\frac{\delta}{32\mathscr{T}\sqrt{n}}}{\left(1+\frac{r\delta\rho\epsilon}{16\|\nabla f(\mathbf{z}_{k})\|L\sqrt{n}}\right)\|\mathbf{y}_{k,\perp}\|+\frac{\delta}{32\mathscr{T}\sqrt{n}}}
(1+rδρϵ16f(𝐳k)Ln18𝒯)(1+rδρϵ16f(𝐳k)Ln+18𝒯)|yk,1|𝐲k,\displaystyle\geq\frac{\left(1+\frac{r\delta\rho\epsilon}{16\|\nabla f(\mathbf{z}_{k})\|L\sqrt{n}}-\frac{1}{8\mathscr{T}}\right)}{\left(1+\frac{r\delta\rho\epsilon}{16\|\nabla f(\mathbf{z}_{k})\|L\sqrt{n}}+\frac{1}{8\mathscr{T}}\right)}\frac{|y_{k,1}|}{\|\mathbf{y}_{k,\perp}\|}
(112𝒯)|yk,1|𝐲k,δ2πn(112𝒯)k+1.\displaystyle\geq\left(1-\frac{1}{2\mathscr{T}}\right)\frac{|y_{k,1}|}{\|\mathbf{y}_{k,\perp}\|}\geq\frac{\delta}{2}\sqrt{\frac{\pi}{n}}\left(1-\frac{1}{2\mathscr{T}}\right)^{k+1}.

Thus, we can conclude that (9) is true for all t[𝒯]t\in[\mathscr{T}]. This completes the proof. ∎

Lemma 13.

In the setting of Problem 4, for any ii with λiρϵ2\lambda_{i}\geq-\frac{\sqrt{\rho\epsilon}}{2}, the 𝒯\mathscr{T}-th iteration of Algorithm 8 satisfies

|y𝒯,i||y𝒯,1|(ρϵ)1/44nL\displaystyle\frac{|y_{\mathscr{T},i}|}{|y_{\mathscr{T},1}|}\leq\frac{(\rho\epsilon)^{1/4}}{4\sqrt{nL}} (15)

if |y0,1|δ2πn|y_{0,1}|\geq\frac{\delta}{2}\sqrt{\frac{\pi}{n}} and f(𝟎)δr32πρϵn\|\nabla f(\mathbf{0})\|\leq\frac{\delta r}{32}\sqrt{\frac{\pi\rho\epsilon}{n}}.

Proof.

For any t[𝒯1]t\in[\mathscr{T}-1], similar to (14) in the proof of Lemma 12, we have

y¯t+1,i=yt,iδ16Lρϵng^t,i,\displaystyle\bar{y}_{t+1,i}=y_{t,i}-\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\cdot\hat{g}_{t,i},

and

|y¯t+1,i||yt,iδ16Lρϵnif(𝐳k)f(𝐳k)|δ16Lρϵn|g^t,iif(𝐳k)f(𝐳k)|.\displaystyle|\bar{y}_{t+1,i}|\leq\left|y_{t,i}-\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\cdot\frac{\nabla_{i}f(\mathbf{z}_{k})}{\|\nabla f(\mathbf{z}_{k})\|}\right|-\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\left|\hat{g}_{t,i}-\frac{\nabla_{i}f(\mathbf{z}_{k})}{\|\nabla f(\mathbf{z}_{k})\|}\right|. (16)

By Lemma 12 we have |yt,1|δ8πn|y_{t,1}|\geq\frac{\delta}{8}\sqrt{\frac{\pi}{n}} for each t[𝒯]t\in[\mathscr{T}], which combined with Lemma 11 leads to f(𝐳t)δr16πρϵn\|\nabla f(\mathbf{z}_{t})\|\geq\frac{\delta r}{16}\sqrt{\frac{\pi\rho\epsilon}{n}}. Thus, the second term on the RHS of (16) satisfies

δ16Lρϵn|g^t,iif(𝐳t)f(𝐳t)|δ16Lρϵn𝐠^tf(𝐳t)f(𝐳t)δδ^16Lρϵnδ(ρϵ)1/4128𝒯nπL\displaystyle\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\left|\hat{g}_{t,i}-\frac{\nabla_{i}f(\mathbf{z}_{t})}{\|\nabla f(\mathbf{z}_{t})\|}\right|\leq\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\left\|\hat{\mathbf{g}}_{t}-\frac{\nabla f(\mathbf{z}_{t})}{\|\nabla f(\mathbf{z}_{t})\|}\right\|\leq\frac{\delta\hat{\delta}}{16L}\sqrt{\frac{\rho\epsilon}{n}}\leq\frac{\delta(\rho\epsilon)^{1/4}}{128\mathscr{T}n}\sqrt{\frac{\pi}{L}}

by Theorem 1. Moreover, the first term on the RHS of (16) satisfies

yt,iδ16Lρϵnif(𝐳t)f(𝐳t)\displaystyle y_{t,i}-\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\cdot\frac{\nabla_{i}f(\mathbf{z}_{t})}{\|\nabla f(\mathbf{z}_{t})\|} =yt,iδ16Lρϵn𝐮i2f(𝟎)𝐮iyt,if(𝐳t)δ16Lρϵnσt,if(𝐳t)\displaystyle=y_{t,i}-\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\cdot\frac{\mathbf{u}_{i}^{\top}\nabla^{2}f(\mathbf{0})\mathbf{u}_{i}y_{t,i}}{\|\nabla f(\mathbf{z}_{t})\|}-\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\cdot\frac{\sigma_{t,i}}{\|\nabla f(\mathbf{z}_{t})\|}
(1+rδρϵ32f(𝐳t)Ln)yt,iδ16Lρϵnσt,if(𝐳t),\displaystyle\leq\left(1+\frac{r\delta\rho\epsilon}{32\|\nabla f(\mathbf{z}_{t})\|L\sqrt{n}}\right)y_{t,i}-\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\cdot\frac{\sigma_{t,i}}{\|\nabla f(\mathbf{z}_{t})\|},

where the absolute value of

σt,iif(𝐳t)if(𝟎)(2f(𝟎)𝐳t)i\displaystyle\sigma_{t,i}\coloneqq\nabla_{i}f(\mathbf{z}_{t})-\nabla_{i}f(\mathbf{0})-(\nabla^{2}f(\mathbf{0})\mathbf{z}_{t})_{i}

is upper bounded by

ρr22+πδr(ρϵ)1/4L256nTπδr(ρϵ)1/4L128n𝒯\displaystyle\frac{\rho r^{2}}{2}+\frac{\pi\delta r(\rho\epsilon)^{1/4}\sqrt{L}}{256nT}\leq\frac{\pi\delta r(\rho\epsilon)^{1/4}\sqrt{L}}{128n\mathscr{T}}

given that ff is ρ\rho-Hessian Lipschitz and

f(𝟎)πδr(ρϵ)1/4L256n𝒯.\|\nabla f(\mathbf{0})\|\leq\frac{\pi\delta r(\rho\epsilon)^{1/4}\sqrt{L}}{256n\mathscr{T}}.

Combined with (16), we can derive that

|y¯t+1,i|\displaystyle|\bar{y}_{t+1,i}| |yt,iδ16Lρϵnif(𝐳t)f(𝐳t)|+δ16Lρϵn|g^t,iif(𝐳t)f(𝐳t)|\displaystyle\leq\left|y_{t,i}-\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\cdot\frac{\nabla_{i}f(\mathbf{z}_{t})}{\|\nabla f(\mathbf{z}_{t})\|}\right|+\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\left|\hat{g}_{t,i}-\frac{\nabla_{i}f(\mathbf{z}_{t})}{\|\nabla f(\mathbf{z}_{t})\|}\right|
(1+rδρϵ32f(𝐳t)Ln)|yt,i|+δ(ρϵ)1/464𝒯nπL.\displaystyle\leq\left(1+\frac{r\delta\rho\epsilon}{32\|\nabla f(\mathbf{z}_{t})\|L\sqrt{n}}\right)|y_{t,i}|+\frac{\delta(\rho\epsilon)^{1/4}}{64\mathscr{T}n}\sqrt{\frac{\pi}{L}}.

Considering that |yt,1|δ8πn|y_{t,1}|\geq\frac{\delta}{8}\sqrt{\frac{\pi}{n}},

|y¯t+1,1|\displaystyle|\bar{y}_{t+1,1}| |yt,1δ16Lρϵn1f(𝐳t)f(𝐳t)|δ16Lρϵn|g^t,11f(𝐳t)f(𝐳t)|\displaystyle\geq\left|y_{t,1}-\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\cdot\frac{\nabla_{1}f(\mathbf{z}_{t})}{\|\nabla f(\mathbf{z}_{t})\|}\right|-\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\left|\hat{g}_{t,1}-\frac{\nabla_{1}f(\mathbf{z}_{t})}{\|\nabla f(\mathbf{z}_{t})\|}\right|
(1+rδρϵ16f(𝐳t)Ln)|yt,1|δ(ρϵ)1/464𝒯nπL\displaystyle\geq\left(1+\frac{r\delta\rho\epsilon}{16\|\nabla f(\mathbf{z}_{t})\|L\sqrt{n}}\right)|y_{t,1}|-\frac{\delta(\rho\epsilon)^{1/4}}{64\mathscr{T}n}\sqrt{\frac{\pi}{L}}
(1+rδρϵ24f(𝐳t)Ln)|yt,1|,\displaystyle\geq\left(1+\frac{r\delta\rho\epsilon}{24\|\nabla f(\mathbf{z}_{t})\|L\sqrt{n}}\right)|y_{t,1}|,

where the last inequality is due to the fact that |yt,1|δ8πn|y_{t,1}|\geq\frac{\delta}{8}\sqrt{\frac{\pi}{n}} by Lemma 12. Hence, for any t[𝒯1]t\in[\mathscr{T}-1] we have

|yt+1,i||yt+1,1|\displaystyle\frac{|y_{t+1,i}|}{|y_{t+1,1}|} =|y¯t+1,i||y¯t+1,1|\displaystyle=\frac{|\bar{y}_{t+1,i}|}{|\bar{y}_{t+1,1}|}
(1+rδρϵ32f(𝐳t)Ln)|yt,i|+δ(ρϵ)1/464𝒯nπL(1+rδρϵ24f(𝐳t)Ln)|yt,1|\displaystyle\leq\frac{\left(1+\frac{r\delta\rho\epsilon}{32\|\nabla f(\mathbf{z}_{t})\|L\sqrt{n}}\right)|y_{t,i}|+\frac{\delta(\rho\epsilon)^{1/4}}{64\mathscr{T}n}\sqrt{\frac{\pi}{L}}}{\left(1+\frac{r\delta\rho\epsilon}{24\|\nabla f(\mathbf{z}_{t})\|L\sqrt{n}}\right)|y_{t,1}|}
(1+rδρϵ32f(𝐳t)Ln)|yt,i|(1+rδρϵ24f(𝐳t)Ln)|yt,1|+(ρϵ)1/48𝒯nL\displaystyle\leq\frac{\left(1+\frac{r\delta\rho\epsilon}{32\|\nabla f(\mathbf{z}_{t})\|L\sqrt{n}}\right)|y_{t,i}|}{\left(1+\frac{r\delta\rho\epsilon}{24\|\nabla f(\mathbf{z}_{t})\|L\sqrt{n}}\right)|y_{t,1}|}+\frac{(\rho\epsilon)^{1/4}}{8\mathscr{T}\sqrt{nL}}
(1rδρϵ192f(𝐳t)Ln)|yt,i||yt,1|+(ρϵ)1/48𝒯nL.\displaystyle\leq\left(1-\frac{r\delta\rho\epsilon}{192\|\nabla f(\mathbf{z}_{t})\|L\sqrt{n}}\right)\frac{|y_{t,i}|}{|y_{t,1}|}+\frac{(\rho\epsilon)^{1/4}}{8\mathscr{T}\sqrt{nL}}.

Since ff is LL-smooth, we have

f(𝐳t)f(𝟎)+L𝐳t2Lr,\displaystyle\|\nabla f(\mathbf{z}_{t})\|\leq\|\nabla f(\mathbf{0})\|+L\|\mathbf{z}_{t}\|\leq 2Lr,

which leads to

|yt+1,i||yt+1,1|\displaystyle\frac{|y_{t+1,i}|}{|y_{t+1,1}|} (1rδρϵ192f(𝐳t)Ln)|yt,i||yt,1|+(ρϵ)1/48𝒯nL\displaystyle\leq\left(1-\frac{r\delta\rho\epsilon}{192\|\nabla f(\mathbf{z}_{t})\|L\sqrt{n}}\right)\frac{|y_{t,i}|}{|y_{t,1}|}+\frac{(\rho\epsilon)^{1/4}}{8\mathscr{T}\sqrt{nL}}
(1δρϵ384L2n)|yt,i||yt,1|+(ρϵ)1/48𝒯nL.\displaystyle\leq\left(1-\frac{\delta\rho\epsilon}{384L^{2}\sqrt{n}}\right)\frac{|y_{t,i}|}{|y_{t,1}|}+\frac{(\rho\epsilon)^{1/4}}{8\mathscr{T}\sqrt{nL}}.

Thus,

|y𝒯,i||y𝒯,1|\displaystyle\frac{|y_{\mathscr{T},i}|}{|y_{\mathscr{T},1}|} (1δρϵ384L2n)𝒯|y0,i||y0,1|+t=1𝒯(ρϵ)1/46𝒯nL(1δρϵ384L2n)𝒯t\displaystyle\leq\left(1-\frac{\delta\rho\epsilon}{384L^{2}\sqrt{n}}\right)^{\mathscr{T}}\frac{|y_{0,i}|}{|y_{0,1}|}+\sum_{t=1}^{\mathscr{T}}\frac{(\rho\epsilon)^{1/4}}{6\mathscr{T}\sqrt{nL}}\left(1-\frac{\delta\rho\epsilon}{384L^{2}\sqrt{n}}\right)^{\mathscr{T}-t}
(1δρϵ384L2n)𝒯|y0,i||y0,1|+(ρϵ)1/48nL(ρϵ)1/44nL.\displaystyle\leq\left(1-\frac{\delta\rho\epsilon}{384L^{2}\sqrt{n}}\right)^{\mathscr{T}}\frac{|y_{0,i}|}{|y_{0,1}|}+\frac{(\rho\epsilon)^{1/4}}{8\sqrt{nL}}\leq\frac{(\rho\epsilon)^{1/4}}{4\sqrt{nL}}.

Equipped with Lemma 13, we are now ready to prove Lemma 10.

Proof of Lemma 10.

We consider the case where |y0,1|δ2πn|y_{0,1}|\geq\frac{\delta}{2}\sqrt{\frac{\pi}{n}}, which happens with probability

Pr{|y0,1|δ2πn}1δ2πnVol(𝒮n2)Vol(𝒮n1)1δ.\displaystyle\Pr\left\{|y_{0,1}|\geq\frac{\delta}{2}\sqrt{\frac{\pi}{n}}\right\}\geq 1-\frac{\delta}{2}\sqrt{\frac{\pi}{n}}\cdot\frac{\text{Vol}(\mathcal{S}^{n-2})}{\text{Vol}(\mathcal{S}^{n-1})}\geq 1-\delta.

In this case, by Lemma 13 we have

|y𝒯,1|2=|y𝒯,1|2i=1n|y𝒯,i|2=(1+i=2n(|y𝒯,i||y𝒯,1|)2)1(1+ρϵ16L)11ρϵ8L,\displaystyle|y_{\mathscr{T},1}|^{2}=\frac{|y_{\mathscr{T},1}|^{2}}{\sum_{i=1}^{n}|y_{\mathscr{T},i}|^{2}}=\left(1+\sum_{i=2}^{n}\left(\frac{|y_{\mathscr{T},i}|}{|y_{\mathscr{T},1}|}\right)^{2}\right)^{-1}\geq\left(1+\frac{\sqrt{\rho\epsilon}}{16L}\right)^{-1}\geq 1-\frac{\sqrt{\rho\epsilon}}{8L},

and

𝐲𝒯,2=1|y𝒯,1|2ρϵ8L.\displaystyle\|\mathbf{y}_{\mathscr{T},\perp}\|^{2}=1-|y_{\mathscr{T},1}|^{2}\leq\frac{\sqrt{\rho\epsilon}}{8L}.

Let ss be the smallest integer such that λs0\lambda_{s}\geq 0. Then the output 𝐞^=𝐲𝒯\hat{\mathbf{e}}=\mathbf{y}_{\mathscr{T}} of Algorithm 8 satisfies

𝐞^2f(𝐱)𝐞^\displaystyle\hat{\mathbf{e}}^{\top}\nabla^{2}f(\mathbf{x})\hat{\mathbf{e}} =|y𝒯,1|2𝐮12f(𝐱)𝐮1+𝐲𝒯,2f(𝐱)𝐲𝒯,\displaystyle=|y_{\mathscr{T},1}|^{2}\mathbf{u}_{1}^{\top}\nabla^{2}f(\mathbf{x})\mathbf{u}_{1}+\mathbf{y}_{\mathscr{T},\perp}^{\top}\nabla^{2}f(\mathbf{x})\mathbf{y}_{\mathscr{T},\perp}
ρϵ|y𝒯,1|2+Li=sd𝐲𝒯,i2\displaystyle\leq-\sqrt{\rho\epsilon}\cdot|y_{\mathscr{T},1}|^{2}+L\sum_{i=s}^{d}\|\mathbf{y}_{\mathscr{T},i}\|^{2}
ρϵ|y𝒯,1|2+L𝐲𝒯,2ρϵ4.\displaystyle\leq-\sqrt{\rho\epsilon}\cdot|y_{\mathscr{T},1}|^{2}+L\|\mathbf{y}_{\mathscr{T},\perp}\|^{2}\leq-\frac{\sqrt{\rho\epsilon}}{4}.

The query complexity of Algorithm 8 only comes from the gradient direction estimation step in Line 8, which equals

𝒯O(nlog(n/δ^))=O(L2n3/2δρϵlog3nLδρϵ).\displaystyle\mathscr{T}\cdot O\left(n\log(n/\hat{\delta})\right)=O\left(\frac{L^{2}n^{3/2}}{\delta\rho\epsilon}\log^{3}\frac{nL}{\delta\sqrt{\rho\epsilon}}\right).

C.2.2 Negative curvature finding when the gradient is relatively large

In this part, we present our negative curvature finding algorithm that finds the negative curvature of a point 𝐱\mathbf{x} with λmin(2f(𝐱))ρϵ\lambda_{\min}(\nabla^{2}f(\mathbf{x}))\leq-\sqrt{\rho\epsilon} when the norm of the gradient f(𝐱)\nabla f(\mathbf{x}) is relatively large.

Input: Function f:nf\colon\mathbb{R}^{n}\to\mathbb{R}, 𝐱\mathbf{x}, precision ϵ\epsilon, error probability δ\delta
1 𝒯384L2nδρϵlog36nLρϵ\mathscr{T}\leftarrow\frac{384L^{2}\sqrt{n}}{\delta\rho\epsilon}\log\frac{36nL}{\sqrt{\rho\epsilon}}, δ^18𝒯(ρϵ)1/4πLn\hat{\delta}\leftarrow\frac{1}{8\mathscr{T}(\rho\epsilon)^{1/4}}\sqrt{\frac{\pi L}{n}}, γ𝐱πδr(ρϵ)1/4L256n𝒯\gamma_{\mathbf{x}}\leftarrow\frac{\pi\delta r(\rho\epsilon)^{1/4}\sqrt{L}}{256n\mathscr{T}}, γ𝐲δ8πn\gamma_{\mathbf{y}}\leftarrow\frac{\delta}{8}\sqrt{\frac{\pi}{n}}
2 𝐲0\mathbf{y}_{0}\leftarrowUniform(𝒮n1)(\mathcal{S}^{n-1})
3 for t=0,,𝒯1t=0,\ldots,\mathscr{T}-1 do
4       𝐠^t\hat{\mathbf{g}}_{t}\leftarrowComparison-Hessian-Vector(𝐱,𝐲t,δ^,γ𝐱,γ𝐲)(\mathbf{x},\mathbf{y}_{t},\hat{\delta},\gamma_{\mathbf{x}},\gamma_{\mathbf{y}})
5       𝐲¯t+1𝐲tδ16Lρϵn𝐠^t\bar{\mathbf{y}}_{t+1}\leftarrow\mathbf{y}_{t}-\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\hat{\mathbf{g}}_{t}
6       𝐲t+1𝐲t+1/𝐲t+1\mathbf{y}_{t+1}\leftarrow\mathbf{y}_{t+1}/\|\mathbf{y}_{t+1}\|
7      
return 𝐞^𝐲𝒯\hat{\mathbf{e}}\leftarrow\mathbf{y}_{\mathscr{T}}
Algorithm 9 Comparison-based Negative Curvature Finding 2 (Comparison-NCF2)

The subroutine Comparison-Hessian-Vector in Line 9 of Algorithm 9 is given as Algorithm 10, whose output approximates the Hessian-vector product 2f(𝐱)𝐲t\nabla^{2}f(\mathbf{x})\cdot\mathbf{y}_{t}.

Input: Function f:nf\colon\mathbb{R}^{n}\to\mathbb{R}, 𝐱,𝐲n\mathbf{x},\mathbf{y}\in\mathbb{R}^{n}, precision δ^\hat{\delta}, lower bound γ𝐱\gamma_{\mathbf{x}} on f(𝐱)\|\nabla f(\mathbf{x})\|, lower bound γ𝐲\gamma_{\mathbf{y}} on |y1||y_{1}|
1 Set r0min{γ𝐱100L,γ𝐱100ρ,γ𝐱δ^20ρ,γ𝐲δ^ϵ20ρ}r_{0}\leftarrow\min\left\{\frac{\gamma_{\mathbf{x}}}{100L},\frac{\gamma_{\mathbf{x}}}{100\rho},\frac{\sqrt{\gamma_{\mathbf{x}}\hat{\delta}}}{20\sqrt{\rho}},\frac{\gamma_{\mathbf{y}}\hat{\delta}\sqrt{\epsilon}}{20\sqrt{\rho}}\right\}
2 𝐠^0\hat{\mathbf{g}}_{0}\leftarrowComparison-GDE(𝐱,ρr02γ𝐱,γ𝐱)(\mathbf{x},\frac{\rho r_{0}^{2}}{\gamma_{\mathbf{x}}},\gamma_{\mathbf{x}}), 𝐠^1\hat{\mathbf{g}}_{1}\leftarrowComparison-GDE(𝐱+r0𝐲,ρr02γ𝐱,γ𝐱/2)(\mathbf{x}+r_{0}\mathbf{y},\frac{\rho r_{0}^{2}}{\gamma_{\mathbf{x}}},\gamma_{\mathbf{x}}/2), 𝐠^1\hat{\mathbf{g}}_{-1}\leftarrowComparison-GDE(𝐱r0𝐲,ρr02γ𝐱,γ𝐱/2)(\mathbf{x}-r_{0}\mathbf{y},\frac{\rho r_{0}^{2}}{\gamma_{\mathbf{x}}},\gamma_{\mathbf{x}}/2)
3 Set 𝐠=1𝐠^1,𝐠^02𝐠^11𝐠^1,𝐠^02𝐠^1\mathbf{g}=\sqrt{1-\langle\hat{\mathbf{g}}_{-1},\hat{\mathbf{g}}_{0}\rangle^{2}}\hat{\mathbf{g}}_{1}-\sqrt{1-\langle\hat{\mathbf{g}}_{1},\hat{\mathbf{g}}_{0}\rangle^{2}}\hat{\mathbf{g}}_{-1}
return 𝐠^=𝐠/𝐠\hat{\mathbf{g}}=\mathbf{g}/\|\mathbf{g}\|
Algorithm 10 Comparison-based Hessian-vector product (Comparison-Hessian-Vector)
Lemma 14.

In the setting of Problem 4, for any 𝐱,𝐲d\mathbf{x},\mathbf{y}\in\mathbb{R}^{d} satisfying

f(𝐱)γ𝐱,λmin(2f(𝐱))ρϵ,𝐲=1,|y1|γ𝐲,\displaystyle\|\nabla f(\mathbf{x})\|\geq\gamma_{\mathbf{x}},\quad\lambda_{\min}(\nabla^{2}f(\mathbf{x}))\leq-\sqrt{\rho\epsilon},\quad\|\mathbf{y}\|=1,\quad|y_{1}|\geq\gamma_{\mathbf{y}},

Algorithm 10 outputs a vector 𝐠^\hat{\mathbf{g}} satisfying

𝐠^2f(𝐱)𝐲2f(𝐱)𝐲δ^\displaystyle\left\|\hat{\mathbf{g}}-\frac{\nabla^{2}f(\mathbf{x})\cdot\mathbf{y}}{\|\nabla^{2}f(\mathbf{x})\cdot\mathbf{y}\|}\right\|\leq\hat{\delta}

using O(nlog(nρL2/γ𝐱γ𝐲2ϵδ^2missing))O\big{(}n\log\big(n\rho L^{2}/\gamma_{\mathbf{x}}\gamma_{\mathbf{y}}^{2}\epsilon\hat{\delta}^{2}\big{missing})\big{)} queries.

Proof of Lemma 14.

Since ff is a ρ\rho-Hessian Lipschitz function,

f(𝐱+r0𝐲)f(𝐱)r02f(𝐱)𝐲ρ2r02;\displaystyle\left\|\nabla f(\mathbf{x}+r_{0}\mathbf{y})-\nabla f(\mathbf{x})-r_{0}\nabla^{2}f(\mathbf{x})\cdot\mathbf{y}\right\|\leq\frac{\rho}{2}r_{0}^{2}; (17)
f(𝐱r0𝐲)f(𝐱)+r02f(𝐱)𝐲ρ2r02.\displaystyle\left\|\nabla f(\mathbf{x}-r_{0}\mathbf{y})-\nabla f(\mathbf{x})+r_{0}\nabla^{2}f(\mathbf{x})\cdot\mathbf{y}\right\|\leq\frac{\rho}{2}r_{0}^{2}. (18)

Therefore,

f(𝐱+r0𝐲)+f(𝐱r0𝐲)2f(𝐱)\displaystyle\left\|\nabla f(\mathbf{x}+r_{0}\mathbf{y})+\nabla f(\mathbf{x}-r_{0}\mathbf{y})-2\nabla f(\mathbf{x})\right\| ρr02;\displaystyle\leq\rho r_{0}^{2}; (19)
2f(𝐱)𝐲12r0(f(𝐱+r0𝐲)f(𝐱r0𝐲))\displaystyle\left\|\nabla^{2}f(\mathbf{x})\cdot\mathbf{y}-\frac{1}{2r_{0}}\left(\nabla f(\mathbf{x}+r_{0}\mathbf{y})-\nabla f(\mathbf{x}-r_{0}\mathbf{y})\right)\right\| ρ2r0.\displaystyle\leq\frac{\rho}{2}r_{0}. (20)

Furthermore, because r0γ𝐱100Lr_{0}\leq\frac{\gamma_{\mathbf{x}}}{100L} and ff is LL-smooth,

f(𝐱+r0𝐲),f(𝐱r0𝐲)γ𝐱Lγ𝐱100L=0.99γ𝐱.\displaystyle\|\nabla f(\mathbf{x}+r_{0}\mathbf{y})\|,\|\nabla f(\mathbf{x}-r_{0}\mathbf{y})\|\geq\gamma_{\mathbf{x}}-L\cdot\frac{\gamma_{\mathbf{x}}}{100L}=0.99\gamma_{\mathbf{x}}.

We first understand how to approximate 2f(𝐱)𝐲\nabla^{2}f(\mathbf{x})\cdot\mathbf{y} by normalized vectors f(𝐱)f(𝐱),f(𝐱+r0𝐲)f(𝐱+r0𝐲),f(𝐱r0𝐲)f(𝐱r0𝐲)\frac{\nabla f(\mathbf{x})}{\|\nabla f(\mathbf{x})\|},\frac{\nabla f(\mathbf{x}+r_{0}\mathbf{y})}{\|\nabla f(\mathbf{x}+r_{0}\mathbf{y})\|},\frac{\nabla f(\mathbf{x}-r_{0}\mathbf{y})}{\|\nabla f(\mathbf{x}-r_{0}\mathbf{y})\|}, and then analyze the approximation error due to using 𝐠^0,𝐠^1,𝐠^1\hat{\mathbf{g}}_{0},\hat{\mathbf{g}}_{1},\hat{\mathbf{g}}_{-1}, respectively. By Lemma 7, we have

12f(𝐱)f(𝐱)r02f(𝐱)𝐲1f(𝐱)+r02f(𝐱)𝐲f(𝐱)+r02f(𝐱)𝐲,f(𝐱)f(𝐱)2\displaystyle\frac{1}{2\|\nabla f(\mathbf{x})\|}\frac{\|\nabla f(\mathbf{x})-r_{0}\nabla^{2}f(\mathbf{x})\cdot\mathbf{y}\|}{\sqrt{1-\left\langle\frac{\nabla f(\mathbf{x})+r_{0}\nabla^{2}f(\mathbf{x})\cdot\mathbf{y}}{\|\nabla f(\mathbf{x})+r_{0}\nabla^{2}f(\mathbf{x})\cdot\mathbf{y}\|},\frac{\nabla f(\mathbf{x})}{\|\nabla f(\mathbf{x})\|}\right\rangle^{2}}}
=12f(𝐱)f(𝐱)+r02f(𝐱)𝐲1f(𝐱)r02f(𝐱)𝐲f(𝐱)r02f(x)𝐲,f(𝐱)f(𝐱)2=:α,\displaystyle\qquad\qquad=\frac{1}{2\|\nabla f(\mathbf{x})\|}\frac{\|\nabla f(\mathbf{x})+r_{0}\nabla^{2}f(\mathbf{x})\cdot\mathbf{y}\|}{\sqrt{1-\left\langle\frac{\nabla f(\mathbf{x})-r_{0}\nabla^{2}f(\mathbf{x})\cdot\mathbf{y}}{\|\nabla f(\mathbf{x})-r_{0}\nabla^{2}f(x)\cdot\mathbf{y}\|},\frac{\nabla f(\mathbf{x})}{\|\nabla f(\mathbf{x})\|}\right\rangle^{2}}}=:\alpha, (21)

i.e., we denote the value above as α\alpha. Because ff is ρ\rho-Hessian Lipschitz, r02f(𝐱)𝐲r0ρ\|r_{0}\nabla^{2}f(\mathbf{x})\cdot\mathbf{y}\|\leq r_{0}\rho. Since r0γ𝐱100ρr_{0}\leq\frac{\gamma_{\mathbf{x}}}{100\rho}, r02f(𝐱)𝐲γ𝐱100\|r_{0}\nabla^{2}f(\mathbf{x})\cdot\mathbf{y}\|\leq\frac{\gamma_{\mathbf{x}}}{100}. Also note that by Lemma 6 we have

f(𝐱)+r02f(𝐱)𝐲f(𝐱)+r02f(𝐱)𝐲,f(𝐱)f(𝐱)0.94,f(𝐱)r02f(𝐱)𝐲f(𝐱)r02f(𝐱)𝐲,f(𝐱)f(𝐱)0.94.\displaystyle\left\langle\frac{\nabla f(\mathbf{x})+r_{0}\nabla^{2}f(\mathbf{x})\cdot\mathbf{y}}{\|\nabla f(\mathbf{x})+r_{0}\nabla^{2}f(\mathbf{x})\cdot\mathbf{y}\|},\frac{\nabla f(\mathbf{x})}{\|\nabla f(\mathbf{x})\|}\right\rangle\geq 0.94,\quad\left\langle\frac{\nabla f(\mathbf{x})-r_{0}\nabla^{2}f(\mathbf{x})\cdot\mathbf{y}}{\|\nabla f(\mathbf{x})-r_{0}\nabla^{2}f(\mathbf{x})\cdot\mathbf{y}\|},\frac{\nabla f(\mathbf{x})}{\|\nabla f(\mathbf{x})\|}\right\rangle\geq 0.94.

This promises that

α0.99210.9421.\displaystyle\alpha\geq\frac{0.99}{2\sqrt{1-0.94^{2}}}\geq 1. (22)

In arguments next, we say a vector 𝐮\mathbf{u} is dd-close to a vector 𝐯\mathbf{v} if 𝐮𝐯d\|\mathbf{u}-\mathbf{v}\|\leq d. We prove that the vector

𝐠~1:=f(𝐱)f(𝐱)+α(1f(𝐱r0𝐲)f(𝐱r0𝐲),f(𝐱)f(𝐱)2f(𝐱+r0𝐲)f(𝐱+r0𝐲)\displaystyle\tilde{\mathbf{g}}_{1}:=\frac{\nabla f(\mathbf{x})}{\|\nabla f(\mathbf{x})\|}+\alpha\cdot\Bigg{(}\sqrt{1-\left\langle\frac{\nabla f(\mathbf{x}-r_{0}\mathbf{y})}{\|\nabla f(\mathbf{x}-r_{0}\mathbf{y})\|},\frac{\nabla f(\mathbf{x})}{\|\nabla f(\mathbf{x})\|}\right\rangle^{2}}\frac{\nabla f(\mathbf{x}+r_{0}\mathbf{y})}{\|\nabla f(\mathbf{x}+r_{0}\mathbf{y})\|}
1f(𝐱+r0𝐲)f(𝐱+r0𝐲),f(𝐱)f(𝐱)2f(𝐱r0𝐲)f(𝐱r0𝐲))\displaystyle-\sqrt{1-\left\langle\frac{\nabla f(\mathbf{x}+r_{0}\mathbf{y})}{\|\nabla f(\mathbf{x}+r_{0}\mathbf{y})\|},\frac{\nabla f(\mathbf{x})}{\|\nabla f(\mathbf{x})\|}\right\rangle^{2}}\frac{\nabla f(\mathbf{x}-r_{0}\mathbf{y})}{\|\nabla f(\mathbf{x}-r_{0}\mathbf{y})\|}\Bigg{)} (23)

is 7ρr02γ𝐱\frac{7\rho r_{0}^{2}}{\gamma_{\mathbf{x}}}-close to a vector proportional to f(𝐱+r0𝐲)\nabla f(\mathbf{x}+r_{0}\mathbf{y}). This is because (17), (18), and Lemma 5 imply that

f(𝐱+r0𝐲)f(𝐱+r0𝐲) and f(𝐱)+r02f(𝐱)𝐲f(𝐱)+r02f(𝐱)𝐲\displaystyle\frac{\nabla f(\mathbf{x}+r_{0}\mathbf{y})}{\|\nabla f(\mathbf{x}+r_{0}\mathbf{y})\|}\text{\quad and\quad}\frac{\nabla f(\mathbf{x})+r_{0}\nabla^{2}f(\mathbf{x})\cdot\mathbf{y}}{\|\nabla f(\mathbf{x})+r_{0}\nabla^{2}f(\mathbf{x})\cdot\mathbf{y}\|}

are ρr020.99γ𝐱\frac{\rho r_{0}^{2}}{0.99\gamma_{\mathbf{x}}}-close to each other,

1f(𝐱r0𝐲)f(𝐱r0𝐲),f(𝐱)f(𝐱)2f(𝐱+r0𝐲)f(𝐱+r0𝐲)\displaystyle\sqrt{1-\left\langle\frac{\nabla f(\mathbf{x}-r_{0}\mathbf{y})}{\|\nabla f(\mathbf{x}-r_{0}\mathbf{y})\|},\frac{\nabla f(\mathbf{x})}{\|\nabla f(\mathbf{x})\|}\right\rangle^{2}}\frac{\nabla f(\mathbf{x}+r_{0}\mathbf{y})}{\|\nabla f(\mathbf{x}+r_{0}\mathbf{y})\|} (24)

is proportional to f(𝐱+r0𝐲)\nabla f(\mathbf{x}+r_{0}\mathbf{y}), and the definition of α\alpha implies

f(𝐱)f(𝐱)α1f(𝐱)+r02f(𝐱)𝐲f(𝐱)+r02f(𝐱)𝐲,f(𝐱)f(𝐱)2f(𝐱)r02f(𝐱)𝐲f(𝐱)r02f(𝐱)𝐲\displaystyle\frac{\nabla f(\mathbf{x})}{\|\nabla f(\mathbf{x})\|}-\alpha\sqrt{1-\left\langle\frac{\nabla f(\mathbf{x})+r_{0}\nabla^{2}f(\mathbf{x})\cdot\mathbf{y}}{\|\nabla f(\mathbf{x})+r_{0}\nabla^{2}f(\mathbf{x})\cdot\mathbf{y}\|},\frac{\nabla f(\mathbf{x})}{\|\nabla f(\mathbf{x})\|}\right\rangle^{2}}\frac{\nabla f(\mathbf{x})-r_{0}\nabla^{2}f(\mathbf{x})\cdot\mathbf{y}}{\|\nabla f(\mathbf{x})-r_{0}\nabla^{2}f(\mathbf{x})\cdot\mathbf{y}\|}
=f(𝐱)+r02f(𝐱)𝐲2f(𝐱).\displaystyle=\frac{\nabla f(\mathbf{x})+r_{0}\nabla^{2}f(\mathbf{x})\cdot\mathbf{y}}{2\|\nabla f(\mathbf{x})\|}. (25)

The above vector is ρr024γ𝐱\frac{\rho r_{0}^{2}}{4\gamma_{\mathbf{x}}}-close to f(𝐱+r0𝐲)2f(𝐱)\frac{\nabla f(\mathbf{x}+r_{0}\mathbf{y})}{2\|\nabla f(\mathbf{x})\|} by (17), and the error in above steps cumulates by at most 6ρr020.99γ𝐱\frac{6\rho r_{0}^{2}}{0.99\gamma_{\mathbf{x}}} using Lemma 6. In total 6ρr020.99γ𝐱+ρr024γ𝐱7ρr02γ𝐱\frac{6\rho r_{0}^{2}}{0.99\gamma_{\mathbf{x}}}+\frac{\rho r_{0}^{2}}{4\gamma_{\mathbf{x}}}\leq\frac{7\rho r_{0}^{2}}{\gamma_{\mathbf{x}}}.

Furthermore, this vector proportional to f(𝐱+r0𝐲)\nabla f(\mathbf{x}+r_{0}\mathbf{y}) that is ρr024γ𝐱\frac{\rho r_{0}^{2}}{4\gamma_{\mathbf{x}}}-close to (23) has norm at least (10.01)/2=0.495(1-0.01)/2=0.495 because the coefficient in (24) is positive, while in the equality above we have r02f(𝐱)𝐲γ𝐱100\|r_{0}\nabla^{2}f(\mathbf{x})\cdot\mathbf{y}\|\leq\frac{\gamma_{\mathbf{x}}}{100}. Therefore, applying Lemma 5, the vector 𝐠~1\tilde{\mathbf{g}}_{1} in (23) satisfies

𝐠~1𝐠~1f(𝐱+r0𝐲)f(𝐱+r0𝐲)29ρr02γ𝐱.\displaystyle\left\|\frac{\tilde{\mathbf{g}}_{1}}{\|\tilde{\mathbf{g}}_{1}\|}-\frac{\nabla f(\mathbf{x}+r_{0}\mathbf{y})}{\|\nabla f(\mathbf{x}+r_{0}\mathbf{y})\|}\right\|\leq\frac{29\rho r_{0}^{2}}{\gamma_{\mathbf{x}}}. (26)

Following the same proof, we can prove that the vector

𝐠~1:=f(𝐱)f(𝐱)α(1f(𝐱r0𝐲)f(𝐱r0𝐲),f(𝐱)f(𝐱)2f(𝐱+r0𝐲)f(𝐱+r0𝐲)\displaystyle\tilde{\mathbf{g}}_{-1}:=\frac{\nabla f(\mathbf{x})}{\|\nabla f(\mathbf{x})\|}-\alpha\cdot\Bigg{(}\sqrt{1-\left\langle\frac{\nabla f(\mathbf{x}-r_{0}\mathbf{y})}{\|\nabla f(\mathbf{x}-r_{0}\mathbf{y})\|},\frac{\nabla f(\mathbf{x})}{\|\nabla f(\mathbf{x})\|}\right\rangle^{2}}\frac{\nabla f(\mathbf{x}+r_{0}\mathbf{y})}{\|\nabla f(\mathbf{x}+r_{0}\mathbf{y})\|}
1f(𝐱+r0𝐲)f(𝐱+r0𝐲),f(𝐱)f(𝐱)2f(𝐱r0𝐲)f(𝐱r0𝐲))\displaystyle-\sqrt{1-\left\langle\frac{\nabla f(\mathbf{x}+r_{0}\mathbf{y})}{\|\nabla f(\mathbf{x}+r_{0}\mathbf{y})\|},\frac{\nabla f(\mathbf{x})}{\|\nabla f(\mathbf{x})\|}\right\rangle^{2}}\frac{\nabla f(\mathbf{x}-r_{0}\mathbf{y})}{\|\nabla f(\mathbf{x}-r_{0}\mathbf{y})\|}\Bigg{)} (27)

satisfies

𝐠~1𝐠~1f(𝐱r0𝐲)f(𝐱r0𝐲)29ρr02γ𝐱.\displaystyle\left\|\frac{\tilde{\mathbf{g}}_{-1}}{\|\tilde{\mathbf{g}}_{-1}\|}-\frac{\nabla f(\mathbf{x}-r_{0}\mathbf{y})}{\|\nabla f(\mathbf{x}-r_{0}\mathbf{y})\|}\right\|\leq\frac{29\rho r_{0}^{2}}{\gamma_{\mathbf{x}}}. (28)

Furthermore, (25) implies that 𝐠~1𝐠~1\tilde{\mathbf{g}}_{1}-\tilde{\mathbf{g}}_{-1} is 27ρr02γ𝐱=14ρr02γ𝐱2\cdot\frac{7\rho r_{0}^{2}}{\gamma_{\mathbf{x}}}=\frac{14\rho r_{0}^{2}}{\gamma_{\mathbf{x}}}-close to

f(𝐱)+r02f(𝐱)𝐲2f(𝐱)f(𝐱)r02f(𝐱)𝐲2f(𝐱)=r0f(𝐱)2f(𝐱)𝐲.\displaystyle\frac{\nabla f(\mathbf{x})+r_{0}\nabla^{2}f(\mathbf{x})\cdot\mathbf{y}}{2\|\nabla f(\mathbf{x})\|}-\frac{\nabla f(\mathbf{x})-r_{0}\nabla^{2}f(\mathbf{x})\cdot\mathbf{y}}{2\|\nabla f(\mathbf{x})\|}=\frac{r_{0}}{\|\nabla f(\mathbf{x})\|}\ \nabla^{2}f(\mathbf{x})\cdot\mathbf{y}. (29)

Because λmin(2f(𝐱))ρϵ\lambda_{\min}(\nabla^{2}f(\mathbf{x}))\leq-\sqrt{\rho\epsilon} and |y1|γ𝐲|y_{1}|\geq\gamma_{\mathbf{y}}, 2f(𝐱)𝐲ρϵγ𝐲\|\nabla^{2}f(\mathbf{x})\cdot\mathbf{y}\|\geq\sqrt{\rho\epsilon}\gamma_{\mathbf{y}}. Therefore, the RHS of (29) has norm at least r0ρϵγ𝐲γ𝐱\frac{r_{0}\sqrt{\rho\epsilon}\gamma_{\mathbf{y}}}{\gamma_{\mathbf{x}}}, and by Lemma 5 we have

𝐠~1𝐠~1𝐠~1𝐠~12f(𝐱)𝐲2f(𝐱)𝐲14ρr02γ𝐱/r0ρϵγ𝐲γ𝐱=14r0ρϵγ𝐲.\displaystyle\left\|\frac{\tilde{\mathbf{g}}_{1}-\tilde{\mathbf{g}}_{-1}}{\|\tilde{\mathbf{g}}_{1}-\tilde{\mathbf{g}}_{-1}\|}-\frac{\nabla^{2}f(\mathbf{x})\cdot\mathbf{y}}{\|\nabla^{2}f(\mathbf{x})\cdot\mathbf{y}\|}\right\|\leq\frac{14\rho r_{0}^{2}}{\gamma_{\mathbf{x}}}/\frac{r_{0}\sqrt{\rho\epsilon}\gamma_{\mathbf{y}}}{\gamma_{\mathbf{x}}}=\frac{14r_{0}\sqrt{\rho}}{\sqrt{\epsilon}\gamma_{\mathbf{y}}}. (30)

Finally, by Theorem 1 and our choice of the precision parameter, the error coming from running Comparison-GDE is:

𝐠^0f(𝐱)f(𝐱),𝐠^1f(𝐱+r0𝐲)f(𝐱+r0𝐲),𝐠^1f(𝐱r0𝐲)f(𝐱r0𝐲)ρr02γ𝐱.\displaystyle\left\|\hat{\mathbf{g}}_{0}-\frac{\nabla f(\mathbf{x})}{\|\nabla f(\mathbf{x})\|}\right\|,\ \left\|\hat{\mathbf{g}}_{1}-\frac{\nabla f(\mathbf{x}+r_{0}\mathbf{y})}{\|\nabla f(\mathbf{x}+r_{0}\mathbf{y})\|}\right\|,\ \left\|\hat{\mathbf{g}}_{-1}-\frac{\nabla f(\mathbf{x}-r_{0}\mathbf{y})}{\|\nabla f(\mathbf{x}-r_{0}\mathbf{y})\|}\right\|\leq\frac{\rho r_{0}^{2}}{\gamma_{\mathbf{x}}}. (31)

Combined with (26) and (28), we know that the vector 𝐠\mathbf{g} we obtained in Algorithm 10 is

29ρr02γ𝐱+29ρr02γ𝐱+3ρr02γ𝐱=61ρr02γ𝐱\displaystyle\frac{29\rho r_{0}^{2}}{\gamma_{\mathbf{x}}}+\frac{29\rho r_{0}^{2}}{\gamma_{\mathbf{x}}}+3\cdot\frac{\rho r_{0}^{2}}{\gamma_{\mathbf{x}}}=\frac{61\rho r_{0}^{2}}{\gamma_{\mathbf{x}}} (32)

close to (𝐠~1𝐠~1)/2α(\tilde{\mathbf{g}}_{1}-\tilde{\mathbf{g}}_{-1})/2\alpha. Since α1\alpha\geq 1 by (22), by Lemma 5 we have

𝐠𝐠𝐠~1𝐠~1𝐠~1𝐠~161ρr02γ𝐱.\displaystyle\left\|\frac{\mathbf{g}}{\|\mathbf{g}\|}-\frac{\tilde{\mathbf{g}}_{1}-\tilde{\mathbf{g}}_{-1}}{\|\tilde{\mathbf{g}}_{1}-\tilde{\mathbf{g}}_{-1}\|}\right\|\leq\frac{61\rho r_{0}^{2}}{\gamma_{\mathbf{x}}}. (33)

In total, all the errors we have accumulated are (30) and (33):

𝐠𝐠2f(𝐱)𝐲2f(𝐱)𝐲61ρr02γ𝐱+14r0ρϵγ𝐲.\displaystyle\left\|\frac{\mathbf{g}}{\|\mathbf{g}\|}-\frac{\nabla^{2}f(\mathbf{x})\cdot\mathbf{y}}{\|\nabla^{2}f(\mathbf{x})\cdot\mathbf{y}\|}\right\|\leq\frac{61\rho r_{0}^{2}}{\gamma_{\mathbf{x}}}+\frac{14r_{0}\sqrt{\rho}}{\sqrt{\epsilon}\gamma_{\mathbf{y}}}. (34)

Our selection of r0=min{γ𝐱100L,γ𝐱100ρ,γ𝐱δ^20ρ,γ𝐲δ^ϵ20ρ}r_{0}=\min\left\{\frac{\gamma_{\mathbf{x}}}{100L},\frac{\gamma_{\mathbf{x}}}{100\rho},\frac{\sqrt{\gamma_{\mathbf{x}}\hat{\delta}}}{20\sqrt{\rho}},\frac{\gamma_{\mathbf{y}}\hat{\delta}\sqrt{\epsilon}}{20\sqrt{\rho}}\right\} can guarantee that (34) is at most δ^\hat{\delta}.

In terms of query complexity, we made 3 calls to Comparison-GDE. By Theorem 1 and that our precision is

ρr02γ𝐱=Ω(γ𝐱γ𝐲2ϵδ^2ρL2),\displaystyle\frac{\rho r_{0}^{2}}{\gamma_{\mathbf{x}}}=\Omega\left(\frac{\gamma_{\mathbf{x}}\gamma_{\mathbf{y}}^{2}\epsilon\hat{\delta}^{2}}{\rho L^{2}}\right),

the total query complexity is O(nlog(nρL2/γ𝐱γ𝐲2ϵδ^2))O\left(n\log(n\rho L^{2}/\gamma_{\mathbf{x}}\gamma_{\mathbf{y}}^{2}\epsilon\hat{\delta}^{2})\right). ∎

Based on Lemma 14, we obtain the following result.

Lemma 15.

In the setting of Problem 4, for any 𝐱\mathbf{x} satisfying

f(𝐱)L(πδ256n𝒯)2ϵρ,λmin(2f(𝐱))ρϵ,\displaystyle\|\nabla f(\mathbf{x})\|\geq L\left(\frac{\pi\delta}{256n\mathscr{T}}\right)^{2}\sqrt{\frac{\epsilon}{\rho}},\qquad\lambda_{\min}(\nabla^{2}f(\mathbf{x}))\leq-\sqrt{\rho\epsilon},

Algorithm 9 outputs a unit vector 𝐞^\hat{\mathbf{e}} satisfying

𝐞^2f(𝐱)𝐞^ρϵ/4,\displaystyle\hat{\mathbf{e}}^{\top}\nabla^{2}f(\mathbf{x})\hat{\mathbf{e}}\leq-\sqrt{\rho\epsilon}/4,

with success probability at least 1δ1-\delta using

O(L2n3/2δρϵlog2nLδρϵ)O\left(\frac{L^{2}n^{3/2}}{\delta\rho\epsilon}\log^{2}\frac{nL}{\delta\sqrt{\rho\epsilon}}\right)

queries.

The proof of Lemma 15 is similar to the proof of Lemma 10. Without loss of generality we assume 𝐱=𝟎\mathbf{x}=\mathbf{0} by shifting n\mathbb{R}^{n} such that 𝐱\mathbf{x} is mapped to 𝟎\mathbf{0}. We denote 𝐠t2f(𝟎)𝐲t\mathbf{g}_{t}\coloneqq\nabla^{2}f(\mathbf{0})\cdot\mathbf{y}_{t} for each iteration t[𝒯]t\in[\mathscr{T}] of Algorithm 9.

Lemma 16.

In the setting of Problem 4, for any iteration t[𝒯]t\in[\mathscr{T}] of Algorithm 9 we have

|yt,1|δ8πn\displaystyle|y_{t,1}|\geq\frac{\delta}{8}\sqrt{\frac{\pi}{n}} (35)

if |y0,1|δ2πn|y_{0,1}|\geq\frac{\delta}{2}\sqrt{\frac{\pi}{n}} and f(𝟎)δr32πρϵn\|\nabla f(\mathbf{0})\|\leq\frac{\delta r}{32}\sqrt{\frac{\pi\rho\epsilon}{n}}.

Proof.

We use recurrence to prove this lemma. In particular, assume

|yt,1|𝐲t,δ2πn(112𝒯)t\displaystyle\frac{|y_{t,1}|}{\|\mathbf{y}_{t,\perp}\|}\geq\frac{\delta}{2}\sqrt{\frac{\pi}{n}}\left(1-\frac{1}{2\mathscr{T}}\right)^{t} (36)

is true for all tkt\leq k for some kk, which guarantees that

|yt,1|δ4πn(112𝒯)t\displaystyle|y_{t,1}|\geq\frac{\delta}{4}\sqrt{\frac{\pi}{n}}\left(1-\frac{1}{2\mathscr{T}}\right)^{t}

Then for t=k+1t=k+1, we have

𝐲¯k+1,=𝐲k,δ16Lρϵn𝐠^k,,\displaystyle\bar{\mathbf{y}}_{k+1,\perp}=\mathbf{y}_{k,\perp}-\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\cdot\hat{\mathbf{g}}_{k,\perp},

and

𝐲¯k+1,𝐲k,δ16Lρϵn𝐠k,𝐠k+δ16Lρϵn𝐠^k,𝐠k,𝐠k,\displaystyle\|\bar{\mathbf{y}}_{k+1,\perp}\|\leq\left\|\mathbf{y}_{k,\perp}-\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\cdot\frac{\mathbf{g}_{k,\perp}}{\|\mathbf{g}_{k}\|}\right\|+\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\left\|\hat{\mathbf{g}}_{k,\perp}-\frac{\mathbf{g}_{k,\perp}}{\|\mathbf{g}_{k}\|}\right\|, (37)

where

δ16Lρϵn𝐠^k,𝐠k,𝐠kδ16Lρϵn𝐠^k𝐠k𝐠kδδ^16Lρϵnδ64𝒯n.\displaystyle\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\left\|\hat{\mathbf{g}}_{k,\perp}-\frac{\mathbf{g}_{k,\perp}}{\|\mathbf{g}_{k}\|}\right\|\leq\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\left\|\hat{\mathbf{g}}_{k}-\frac{\mathbf{g}_{k}}{\|\mathbf{g}_{k}\|}\right\|\leq\frac{\delta\hat{\delta}}{16L}\sqrt{\frac{\rho\epsilon}{n}}\leq\frac{\delta}{64\mathscr{T}\sqrt{n}}.

by Lemma 14. Next, we proceed to bound the first term on the RHS of (37). Note that

𝐠k,=2f(𝟎)𝐲k,=i=2nλi𝐲k,,𝐮i𝐮i,\displaystyle\mathbf{g}_{k,\perp}=\nabla^{2}f(\mathbf{0})\mathbf{y}_{k,\perp}=\sum_{i=2}^{n}\lambda_{i}\langle\mathbf{y}_{k,\perp},\mathbf{u}_{i}\rangle\mathbf{u}_{i},

and

𝐲k,δ16Lρϵn𝐠k,𝐠k=i=2n(1δ16𝐠kρϵnλiL)𝐲k,,𝐮i𝐮i,\displaystyle\mathbf{y}_{k,\perp}-\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\cdot\frac{\mathbf{g}_{k,\perp}}{\|\mathbf{g}_{k}\|}=\sum_{i=2}^{n}\left(1-\frac{\delta}{16\|\mathbf{g}_{k}\|}\sqrt{\frac{\rho\epsilon}{n}}\frac{\lambda_{i}}{L}\right)\langle\mathbf{y}_{k,\perp},\mathbf{u}_{i}\rangle\mathbf{u}_{i},

where

𝐠k|gk,1|ρϵ|yk,1|δ8πn.\displaystyle\|\mathbf{g}_{k}\|\geq|g_{k,1}|\geq\sqrt{\rho\epsilon}|y_{k,1}|\geq\frac{\delta}{8}\sqrt{\frac{\pi}{n}}.

Consequently, we have

1δ16𝐠kρϵnλiL1,i=1,,n,\displaystyle-1\leq\frac{\delta}{16\|\mathbf{g}_{k}\|}\sqrt{\frac{\rho\epsilon}{n}}\frac{\lambda_{i}}{L}\leq 1,\qquad\forall i=1,\ldots,n,

which leads to

𝐲k,δ16Lρϵn𝐠k,𝐠k\displaystyle\left\|\mathbf{y}_{k,\perp}-\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\cdot\frac{\mathbf{g}_{k,\perp}}{\|\mathbf{g}_{k}\|}\right\| i=2n(1+δρϵ16𝐠kLn)𝐲k,,𝐮i𝐮i\displaystyle\leq\left\|\sum_{i=2}^{n}\left(1+\frac{\delta\rho\epsilon}{16\|\mathbf{g}_{k}\|L\sqrt{n}}\right)\langle\mathbf{y}_{k,\perp},\mathbf{u}_{i}\rangle\mathbf{u}_{i}\right\|
(1+δρϵ16𝐠kLn)𝐲k,.\displaystyle\leq\left(1+\frac{\delta\rho\epsilon}{16\|\mathbf{g}_{k}\|L\sqrt{n}}\right)\|\mathbf{y}_{k,\perp}\|.

Combined with (37), we can derive that

𝐲¯k+1,\displaystyle\|\bar{\mathbf{y}}_{k+1,\perp}\| 𝐲k,δ16Lρϵn𝐠k,𝐠k+δ16Lρϵn𝐠^k,𝐠k,𝐠k\displaystyle\leq\left\|\mathbf{y}_{k,\perp}-\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\cdot\frac{\mathbf{g}_{k,\perp}}{\|\mathbf{g}_{k}\|}\right\|+\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\left\|\hat{\mathbf{g}}_{k,\perp}-\frac{\mathbf{g}_{k,\perp}}{\|\mathbf{g}_{k}\|}\right\| (38)
(1+δρϵ16𝐠kLn)𝐲k,+δ64𝒯n.\displaystyle\leq\left(1+\frac{\delta\rho\epsilon}{16\|\mathbf{g}_{k}\|L\sqrt{n}}\right)\|\mathbf{y}_{k,\perp}\|+\frac{\delta}{64\mathscr{T}\sqrt{n}}. (39)

Similarly, we have

|y¯k+1,1||yk,1δ16Lρϵngk,1𝐠k|δ16Lρϵn|g^k,1gk,1𝐠k|,\displaystyle|\bar{y}_{k+1,1}|\geq\left|y_{k,1}-\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\cdot\frac{g_{k,1}}{\|\mathbf{g}_{k}\|}\right|-\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\left|\hat{g}_{k,1}-\frac{g_{k,1}}{\|\mathbf{g}_{k}\|}\right|, (40)

where the second term on the RHS of (40) satisfies

δ16Lρϵn|g^k,1gk,1𝐠k|δ16Lρϵn𝐠^k𝐠k𝐠kδδ^16Lρϵnδ64𝒯n,\displaystyle\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\left|\hat{g}_{k,1}-\frac{g_{k,1}}{\|\mathbf{g}_{k}\|}\right|\leq\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\left\|\hat{\mathbf{g}}_{k}-\frac{\mathbf{g}_{k}}{\|\mathbf{g}_{k}\|}\right\|\leq\frac{\delta\hat{\delta}}{16L}\sqrt{\frac{\rho\epsilon}{n}}\leq\frac{\delta}{64\mathscr{T}\sqrt{n}},

by Lemma 14. Combined with (40), we can derive that

|y¯k+1,1|\displaystyle|\bar{y}_{k+1,1}| |yk,1δ16Lρϵngk,1𝐠k|δ16Lρϵn|g^k,1gk,1𝐠k|\displaystyle\geq\left|y_{k,1}-\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\cdot\frac{g_{k,1}}{\|\mathbf{g}_{k}\|}\right|-\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\left|\hat{g}_{k,1}-\frac{g_{k,1}}{\|\mathbf{g}_{k}\|}\right|
(1+δρϵ16𝐠kLn)|yk,1|δ64𝒯n.\displaystyle\geq\left(1+\frac{\delta\rho\epsilon}{16\|\mathbf{g}_{k}\|L\sqrt{n}}\right)|y_{k,1}|-\frac{\delta}{64\mathscr{T}\sqrt{n}}.

Consequently,

|yk+1,1|𝐲k+1,\displaystyle\frac{|y_{k+1,1}|}{\|\mathbf{y}_{k+1,\perp\|}} =|y¯k+1,1|𝐲¯k+1,\displaystyle=\frac{|\bar{y}_{k+1,1}|}{\|\bar{\mathbf{y}}_{k+1,\perp}\|}
(1+δρϵ16𝐠kLn)|yk,1|δ64𝒯n(1+δρϵ16𝐠kLn)𝐲k,+δ64𝒯n.\displaystyle\geq\frac{\left(1+\frac{\delta\rho\epsilon}{16\|\mathbf{g}_{k}\|L\sqrt{n}}\right)|y_{k,1}|-\frac{\delta}{64\mathscr{T}\sqrt{n}}}{\left(1+\frac{\delta\rho\epsilon}{16\|\mathbf{g}_{k}\|L\sqrt{n}}\right)\|\mathbf{y}_{k,\perp}\|+\frac{\delta}{64\mathscr{T}\sqrt{n}}}.

Thus, if |yk,1|12|y_{k,1}|\geq\frac{1}{2}, (36) is also true for t=k+1t=k+1. Otherwise, we have 𝐲k,3/2\|\mathbf{y}_{k,\perp}\|\geq\sqrt{3}/2 and

|yk+1,1|𝐲k+1,\displaystyle\frac{|y_{k+1,1}|}{\|\mathbf{y}_{k+1,\perp\|}} (1+δρϵ16f(𝐳k)Ln)|yk,1|δ64𝒯n(1+δρϵ16𝐠kLn)𝐲k,+δ64𝒯n\displaystyle\geq\frac{\left(1+\frac{\delta\rho\epsilon}{16\|\nabla f(\mathbf{z}_{k})\|L\sqrt{n}}\right)|y_{k,1}|-\frac{\delta}{64\mathscr{T}\sqrt{n}}}{\left(1+\frac{\delta\rho\epsilon}{16\|\mathbf{g}_{k}\|L\sqrt{n}}\right)\|\mathbf{y}_{k,\perp}\|+\frac{\delta}{64\mathscr{T}\sqrt{n}}}
(1+δρϵ16𝐠kLn18𝒯)(1+δρϵ16𝐠kLn+18𝒯)|yk,1|𝐲k,\displaystyle\geq\frac{\left(1+\frac{\delta\rho\epsilon}{16\|\mathbf{g}_{k}\|L\sqrt{n}}-\frac{1}{8\mathscr{T}}\right)}{\left(1+\frac{\delta\rho\epsilon}{16\|\mathbf{g}_{k}\|L\sqrt{n}}+\frac{1}{8\mathscr{T}}\right)}\frac{|y_{k,1}|}{\|\mathbf{y}_{k,\perp}\|}
(112𝒯)|yk,1|𝐲k,δ2πn(112𝒯)k+1.\displaystyle\geq\left(1-\frac{1}{2\mathscr{T}}\right)\frac{|y_{k,1}|}{\|\mathbf{y}_{k,\perp}\|}\geq\frac{\delta}{2}\sqrt{\frac{\pi}{n}}\left(1-\frac{1}{2\mathscr{T}}\right)^{k+1}.

Thus, we can conclude that (36) is true for all t[𝒯]t\in[\mathscr{T}]. This completes the proof. ∎

Lemma 17.

In the setting of Problem 4, for any ii with λiρϵ2\lambda_{i}\geq-\frac{\sqrt{\rho\epsilon}}{2}, the 𝒯\mathscr{T}-th iteration of Algorithm 9 satisfies

|y𝒯,i||y𝒯,1|(ρϵ)1/44nL\displaystyle\frac{|y_{\mathscr{T},i}|}{|y_{\mathscr{T},1}|}\leq\frac{(\rho\epsilon)^{1/4}}{4\sqrt{nL}} (41)

if |y0,1|δ2πn|y_{0,1}|\geq\frac{\delta}{2}\sqrt{\frac{\pi}{n}}.

Proof.

For any t[𝒯1]t\in[\mathscr{T}-1], similar to (40) in the proof of Lemma 16, we have

y¯t+1,i=yt,iδ16Lρϵng^t,i,\displaystyle\bar{y}_{t+1,i}=y_{t,i}-\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\cdot\hat{g}_{t,i},

and

|y¯t+1,i||yt,iδ16Lρϵngt,i𝐠t|δ16Lρϵn|g^t,igt,i𝐠t|,\displaystyle|\bar{y}_{t+1,i}|\leq\left|y_{t,i}-\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\cdot\frac{g_{t,i}}{\|\mathbf{g}_{t}\|}\right|-\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\left|\hat{g}_{t,i}-\frac{g_{t,i}}{\|\mathbf{g}_{t}\|}\right|, (42)

where the second term on the RHS of (42) satisfies

δ16Lρϵn|g^t,igt,i𝐠t|δ16Lρϵn𝐠^t𝐠t𝐠tδδ^16Lρϵnδ(ρϵ)1/4128𝒯nπL\displaystyle\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\left|\hat{g}_{t,i}-\frac{g_{t,i}}{\|\mathbf{g}_{t}\|}\right|\leq\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\left\|\hat{\mathbf{g}}_{t}-\frac{\mathbf{g}_{t}}{\|\mathbf{g}_{t}\|}\right\|\leq\frac{\delta\hat{\delta}}{16L}\sqrt{\frac{\rho\epsilon}{n}}\leq\frac{\delta(\rho\epsilon)^{1/4}}{128\mathscr{T}n}\sqrt{\frac{\pi}{L}}

by Lemma 14. Moreover, the first term on the RHS of (42) satisfies

yt,iδ16Lρϵngt,i𝐠t=yt,iδ16Lρϵn𝐮i2f(𝟎)𝐮iyt,i𝐠t(1+δρϵ32𝐠tLn)yt,i,\displaystyle y_{t,i}-\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\cdot\frac{g_{t,i}}{\|\mathbf{g}_{t}\|}=y_{t,i}-\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\cdot\frac{\mathbf{u}_{i}^{\top}\nabla^{2}f(\mathbf{0})\mathbf{u}_{i}y_{t,i}}{\|\mathbf{g}_{t}\|}\leq\left(1+\frac{\delta\rho\epsilon}{32\|\mathbf{g}_{t}\|L\sqrt{n}}\right)y_{t,i},

Consequently, we have

|y¯t+1,i|\displaystyle|\bar{y}_{t+1,i}| (1+δρϵ32𝐠tLn)|yt,i|+δ(ρϵ)1/4128𝒯nπL.\displaystyle\leq\left(1+\frac{\delta\rho\epsilon}{32\|\mathbf{g}_{t}\|L\sqrt{n}}\right)|y_{t,i}|+\frac{\delta(\rho\epsilon)^{1/4}}{128\mathscr{T}n}\sqrt{\frac{\pi}{L}}.

Meanwhile,

|y¯t+1,1|\displaystyle|\bar{y}_{t+1,1}| |yt,1δ16Lρϵngt,1𝐠t|δ16Lρϵn|g^t,1gt,1𝐠t|\displaystyle\geq\left|y_{t,1}-\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\cdot\frac{g_{t,1}}{\|\mathbf{g}_{t}\|}\right|-\frac{\delta}{16L}\sqrt{\frac{\rho\epsilon}{n}}\left|\hat{g}_{t,1}-\frac{g_{t,1}}{\|\mathbf{g}_{t}\|}\right|
(1+δρϵ16𝐠tLn)|yt,1|δ(ρϵ)1/4128𝒯nπL\displaystyle\geq\left(1+\frac{\delta\rho\epsilon}{16\|\mathbf{g}_{t}\|L\sqrt{n}}\right)|y_{t,1}|-\frac{\delta(\rho\epsilon)^{1/4}}{128\mathscr{T}n}\sqrt{\frac{\pi}{L}}
(1+δρϵ24𝐠tLn)|yt,1|,\displaystyle\geq\left(1+\frac{\delta\rho\epsilon}{24\|\mathbf{g}_{t}\|L\sqrt{n}}\right)|y_{t,1}|,

where the last inequality is due to the fact that |yt,1|δ8πn|y_{t,1}|\geq\frac{\delta}{8}\sqrt{\frac{\pi}{n}} by Lemma 16. Hence, for any t[𝒯1]t\in[\mathscr{T}-1] we have

|yt+1,i||yt+1,1|\displaystyle\frac{|y_{t+1,i}|}{|y_{t+1,1}|} =|y¯t+1,i||y¯t+1,1|\displaystyle=\frac{|\bar{y}_{t+1,i}|}{|\bar{y}_{t+1,1}|}
(1+δρϵ32𝐠tLn)|yt,i|+δ(ρϵ)1/4128𝒯nπL(1+δρϵ24𝐠tLn)|yt,1|\displaystyle\leq\frac{\left(1+\frac{\delta\rho\epsilon}{32\|\mathbf{g}_{t}\|L\sqrt{n}}\right)|y_{t,i}|+\frac{\delta(\rho\epsilon)^{1/4}}{128\mathscr{T}n}\sqrt{\frac{\pi}{L}}}{\left(1+\frac{\delta\rho\epsilon}{24\|\mathbf{g}_{t}\|L\sqrt{n}}\right)|y_{t,1}|}
(1+δρϵ32𝐠tLn)|yt,i|(1+δρϵ24𝐠tLn)|yt,1|+(ρϵ)1/48𝒯nL\displaystyle\leq\frac{\left(1+\frac{\delta\rho\epsilon}{32\|\mathbf{g}_{t}\|L\sqrt{n}}\right)|y_{t,i}|}{\left(1+\frac{\delta\rho\epsilon}{24\|\mathbf{g}_{t}\|L\sqrt{n}}\right)|y_{t,1}|}+\frac{(\rho\epsilon)^{1/4}}{8\mathscr{T}\sqrt{nL}}
(1δρϵ192𝐠tLn)|yt,i||yt,1|+(ρϵ)1/48𝒯nL.\displaystyle\leq\left(1-\frac{\delta\rho\epsilon}{192\|\mathbf{g}_{t}\|L\sqrt{n}}\right)\frac{|y_{t,i}|}{|y_{t,1}|}+\frac{(\rho\epsilon)^{1/4}}{8\mathscr{T}\sqrt{nL}}.

Since ff is LL-smooth, we have

𝐠t+L𝐲tL,\displaystyle\|\mathbf{g}_{t}\|\leq+L\|\mathbf{y}_{t}\|\leq L,

which leads to

|yt+1,i||yt+1,1|\displaystyle\frac{|y_{t+1,i}|}{|y_{t+1,1}|} (1δρϵ192𝐠tLn)|yt,i||yt,1|+(ρϵ)1/48𝒯nL\displaystyle\leq\left(1-\frac{\delta\rho\epsilon}{192\|\mathbf{g}_{t}\|L\sqrt{n}}\right)\frac{|y_{t,i}|}{|y_{t,1}|}+\frac{(\rho\epsilon)^{1/4}}{8\mathscr{T}\sqrt{nL}}
(1δρϵ192L2n)|yt,i||yt,1|+(ρϵ)1/48𝒯nL.\displaystyle\leq\left(1-\frac{\delta\rho\epsilon}{192L^{2}\sqrt{n}}\right)\frac{|y_{t,i}|}{|y_{t,1}|}+\frac{(\rho\epsilon)^{1/4}}{8\mathscr{T}\sqrt{nL}}.

Thus,

|y𝒯,i||y𝒯,1|\displaystyle\frac{|y_{\mathscr{T},i}|}{|y_{\mathscr{T},1}|} (1δρϵ192L2n)𝒯|y0,i||y0,1|+t=1𝒯(ρϵ)1/46𝒯nL(1δρϵ192L2n)𝒯t\displaystyle\leq\left(1-\frac{\delta\rho\epsilon}{192L^{2}\sqrt{n}}\right)^{\mathscr{T}}\frac{|y_{0,i}|}{|y_{0,1}|}+\sum_{t=1}^{\mathscr{T}}\frac{(\rho\epsilon)^{1/4}}{6\mathscr{T}\sqrt{nL}}\left(1-\frac{\delta\rho\epsilon}{192L^{2}\sqrt{n}}\right)^{\mathscr{T}-t}
(1δρϵ192L2n)𝒯|y0,i||y0,1|+(ρϵ)1/48nL(ρϵ)1/44nL.\displaystyle\leq\left(1-\frac{\delta\rho\epsilon}{192L^{2}\sqrt{n}}\right)^{\mathscr{T}}\frac{|y_{0,i}|}{|y_{0,1}|}+\frac{(\rho\epsilon)^{1/4}}{8\sqrt{nL}}\leq\frac{(\rho\epsilon)^{1/4}}{4\sqrt{nL}}.

Equipped with Lemma 17, we are now ready to prove Lemma 15.

Proof of Lemma 15.

We consider the case where |y0,1|δ2πn|y_{0,1}|\geq\frac{\delta}{2}\sqrt{\frac{\pi}{n}}, which happens with probability

Pr{|y0,1|δ2πn}1δ2πnVol(𝒮n2)Vol(𝒮n1)1δ.\displaystyle\Pr\left\{|y_{0,1}|\geq\frac{\delta}{2}\sqrt{\frac{\pi}{n}}\right\}\geq 1-\frac{\delta}{2}\sqrt{\frac{\pi}{n}}\cdot\frac{\text{Vol}(\mathcal{S}^{n-2})}{\text{Vol}(\mathcal{S}^{n-1})}\geq 1-\delta.

In this case, by Lemma 17 we have

|y𝒯,1|2=|y𝒯,1|2i=1n|y𝒯,i|2=(1+i=2n(|y𝒯,i||y𝒯,1|)2)1(1+ρϵ16L)11ρϵ8L,\displaystyle|y_{\mathscr{T},1}|^{2}=\frac{|y_{\mathscr{T},1}|^{2}}{\sum_{i=1}^{n}|y_{\mathscr{T},i}|^{2}}=\left(1+\sum_{i=2}^{n}\left(\frac{|y_{\mathscr{T},i}|}{|y_{\mathscr{T},1}|}\right)^{2}\right)^{-1}\geq\left(1+\frac{\sqrt{\rho\epsilon}}{16L}\right)^{-1}\geq 1-\frac{\sqrt{\rho\epsilon}}{8L},

and

𝐲𝒯,2=1|y𝒯,1|2ρϵ8L.\displaystyle\|\mathbf{y}_{\mathscr{T},\perp}\|^{2}=1-|y_{\mathscr{T},1}|^{2}\leq\frac{\sqrt{\rho\epsilon}}{8L}.

Let ss be the smallest integer such that λs0\lambda_{s}\geq 0. Then the output 𝐞^=𝐲𝒯\hat{\mathbf{e}}=\mathbf{y}_{\mathscr{T}} of Algorithm 9 satisfies

𝐞^2f(𝐱)𝐞^\displaystyle\hat{\mathbf{e}}^{\top}\nabla^{2}f(\mathbf{x})\hat{\mathbf{e}} =|y𝒯,1|2𝐮12f(𝐱)𝐮1+𝐲𝒯,2f(𝐱)𝐲𝒯,\displaystyle=|y_{\mathscr{T},1}|^{2}\mathbf{u}_{1}^{\top}\nabla^{2}f(\mathbf{x})\mathbf{u}_{1}+\mathbf{y}_{\mathscr{T},\perp}^{\top}\nabla^{2}f(\mathbf{x})\mathbf{y}_{\mathscr{T},\perp}
ρϵ|y𝒯,1|2+Li=sd𝐲𝒯,i2\displaystyle\leq-\sqrt{\rho\epsilon}\cdot|y_{\mathscr{T},1}|^{2}+L\sum_{i=s}^{d}\|\mathbf{y}_{\mathscr{T},i}\|^{2}
ρϵ|y𝒯,1|2+L𝐲𝒯,2ρϵ4.\displaystyle\leq-\sqrt{\rho\epsilon}\cdot|y_{\mathscr{T},1}|^{2}+L\|\mathbf{y}_{\mathscr{T},\perp}\|^{2}\leq-\frac{\sqrt{\rho\epsilon}}{4}.

The query complexity of Algorithm 9 only comes from the Hessian-vector product estimation step in Line 9, which equals

𝒯O(nlog(nρL2/γ𝐱γ𝐲2ϵδ^2missing))=O(L2n3/2δρϵlog2nLδρϵ).\displaystyle\mathscr{T}\cdot O\big{(}n\log\big(n\rho L^{2}/\gamma_{\mathbf{x}}\gamma_{\mathbf{y}}^{2}\epsilon\hat{\delta}^{2}\big{missing})\big{)}=O\left(\frac{L^{2}n^{3/2}}{\delta\rho\epsilon}\log^{2}\frac{nL}{\delta\sqrt{\rho\epsilon}}\right).

C.3 Proof of Lemma 4

Proof.

By Lemma 10 and Lemma 15, at least one of the two unit vectors 𝐯1,𝐯2\mathbf{v}_{1},\mathbf{v}_{2} is a negative curvature direction. Quantitatively, with probability at least 1δ1-\delta, at least one of the following two inequalities is true:

𝐯12f(𝐳)𝐯1ρϵ4,𝐯22f(𝐳)𝐯2ρϵ4.\displaystyle\mathbf{v}_{1}^{\top}\nabla^{2}f(\mathbf{z})\mathbf{v}_{1}\leq-\frac{\sqrt{\rho\epsilon}}{4},\qquad\mathbf{v}_{2}^{\top}\nabla^{2}f(\mathbf{z})\mathbf{v}_{2}\leq-\frac{\sqrt{\rho\epsilon}}{4}.

WLOG we assume the first inequality is true. Denote η=12ϵρ\eta=\frac{1}{2}\sqrt{\frac{\epsilon}{\rho}}. Given that ff is ρ\rho-Hessian Lipschitz, we have

f(𝐳1,+)\displaystyle f(\mathbf{z}_{1,+}) f(𝐳)+ηf(𝐳),𝐯1+0η(0a(ρϵ4+ρb)db)da\displaystyle\leq f(\mathbf{z})+\eta\langle\nabla f(\mathbf{z}),\mathbf{v}_{1}\rangle+\int_{0}^{\eta}\left(\int_{0}^{a}\left(-\frac{\sqrt{\rho\epsilon}}{4}+\rho b\right)\mathrm{d}b\right)\mathrm{d}a
=f(𝐳)+ηf(𝐳),𝐯1148ϵ3ρ,\displaystyle=f(\mathbf{z})+\eta\langle\nabla f(\mathbf{z}),\mathbf{v}_{1}\rangle-\frac{1}{48}\sqrt{\frac{\epsilon^{3}}{\rho}},

and

f(𝐳1,)\displaystyle f(\mathbf{z}_{1,-}) f(𝐳)ηf(𝐳),𝐯1+0η(0a(ρϵ4+ρb)db)da\displaystyle\leq f(\mathbf{z})-\eta\langle\nabla f(\mathbf{z}),\mathbf{v}_{1}\rangle+\int_{0}^{\eta}\left(\int_{0}^{a}\left(-\frac{\sqrt{\rho\epsilon}}{4}+\rho b\right)\mathrm{d}b\right)\mathrm{d}a
=f(𝐳)ηf(𝐳),𝐯1148ϵ3ρ.\displaystyle=f(\mathbf{z})-\eta\langle\nabla f(\mathbf{z}),\mathbf{v}_{1}\rangle-\frac{1}{48}\sqrt{\frac{\epsilon^{3}}{\rho}}.

Hence,

f(𝐳1,+)+f(𝐳1,)2f(𝐳)148ϵ3ρ,\displaystyle\frac{f(\mathbf{z}_{1,+})+f(\mathbf{z}_{1,-})}{2}\leq f(\mathbf{z})-\frac{1}{48}\sqrt{\frac{\epsilon^{3}}{\rho}},

which leads to

f(𝐳out)min{f(𝐳1,+),f(𝐳1,)}f(𝐳)148ϵ3ρ.\displaystyle f(\mathbf{z}_{\mathrm{out}})\leq\min\{f(\mathbf{z}_{1,+}),f(\mathbf{z}_{1,-})\}\leq f(\mathbf{z})-\frac{1}{48}\sqrt{\frac{\epsilon^{3}}{\rho}}.

By Lemma 10 and Lemma 15, the query complexity of Algorithm 6 equals

O(L2n3/2δρϵlog2nLδρϵ).O\left(\frac{L^{2}n^{3/2}}{\delta\rho\epsilon}\log^{2}\frac{nL}{\delta\sqrt{\rho\epsilon}}\right).

C.4 Escape from saddle point via negative curvature finding

Lemma 18.

In the setting of Problem 4, if the iterations 𝐱s,0,,𝐱s,𝒯\mathbf{x}_{s,0},\ldots,\mathbf{x}_{s,\mathscr{T}} of Algorithm 5 satisfy

f(𝐱s,𝒯)f(𝐱s,0)148ϵ3ρ,\displaystyle f(\mathbf{x}_{s,\mathscr{T}})-f(\mathbf{x}_{s,0})\geq-\frac{1}{48}\sqrt{\frac{\epsilon^{3}}{\rho}},

then the number of ϵ\epsilon-FOSP among 𝐱s,0,,𝐱s,𝒯\mathbf{x}_{s,0},\ldots,\mathbf{x}_{s,\mathscr{T}} is at least 𝒯3L32ρϵ\mathscr{T}-\frac{3L}{32\sqrt{\rho\epsilon}}.

Proof.

For any iteration t[𝒯]t\in[\mathscr{T}] with f(𝐱s,t)>ϵ\|\nabla f(\mathbf{x}_{s,t})\|>\epsilon, by Theorem 1 we have

𝐠^tf(𝐱s,t)f(𝐱s,t)δ=16,\displaystyle\left\|\hat{\mathbf{g}}_{t}-\frac{\nabla f(\mathbf{x}_{s,t})}{\|\nabla f(\mathbf{x}_{s,t})\|}\right\|\leq\delta=\frac{1}{6},

indicating

f(𝐱s,t+1)f(𝐱s,t)\displaystyle f(\mathbf{x}_{s,t+1})-f(\mathbf{x}_{s,t}) f(𝐲s,t)f(𝐱s,t)\displaystyle\leq f(\mathbf{y}_{s,t})-f(\mathbf{x}_{s,t})
f(𝐱s,t),𝐱s,t+1𝐱s,t+L2𝐱s,t+1𝐱s,t2\displaystyle\leq\langle\nabla f(\mathbf{x}_{s,t}),\mathbf{x}_{s,t+1}-\mathbf{x}_{s,t}\rangle+\frac{L}{2}\|\mathbf{x}_{s,t+1}-\mathbf{x}_{s,t}\|^{2}
ϵ3Lf(𝐱s,t),𝐠^t+L2(ϵ3L)2\displaystyle\leq-\frac{\epsilon}{3L}\langle\nabla f(\mathbf{x}_{s,t}),\hat{\mathbf{g}}_{t}\rangle+\frac{L}{2}\left(\frac{\epsilon}{3L}\right)^{2}
ϵ3Lf(𝐱s,t)(1δ)+ϵ218L2ϵ29L.\displaystyle\leq-\frac{\epsilon}{3L}\|\nabla f(\mathbf{x}_{s,t})\|(1-\delta)+\frac{\epsilon^{2}}{18L}\leq-\frac{2\epsilon^{2}}{9L}.

That is to say, for any t[𝒯]t\in[\mathscr{T}] such that 𝐱s,t\mathbf{x}_{s,t} is not an ϵ\epsilon-FOSP, the function value will decrease at least 2ϵ29L\frac{2\epsilon^{2}}{9L} in this iteration. Moreover, given that

f(𝐱s,t+1)=min{f(𝐱s,t),f(𝐲s,t)}f(𝐱s,t)\displaystyle f(\mathbf{x}_{s,t+1})=\min\{f(\mathbf{x}_{s,t}),f(\mathbf{y}_{s,t})\}\leq f(\mathbf{x}_{s,t})

and

f(𝐱s,0)f(𝐱s,𝒯)148ϵ3ρ,\displaystyle f(\mathbf{x}_{s,0})-f(\mathbf{x}_{s,\mathscr{T}})\leq\frac{1}{48}\sqrt{\frac{\epsilon^{3}}{\rho}},

we can conclude that the number of ϵ\epsilon-FOSP among 𝐱s,1,,𝐱s,𝒯\mathbf{x}_{s,1},\ldots,\mathbf{x}_{s,\mathscr{T}} is at least

𝒯148ϵ3ρ9L2ϵ2=𝒯3L32ρϵ.\displaystyle\mathscr{T}-\frac{1}{48}\sqrt{\frac{\epsilon^{3}}{\rho}}\cdot\frac{9L}{2\epsilon^{2}}=\mathscr{T}-\frac{3L}{32\sqrt{\rho\epsilon}}.

Lemma 19.

In the setting of Problem 4, if there are less than 8𝒯9\frac{8\mathscr{T}}{9} ϵ\epsilon-SOSP among the iterations 𝐱s,0,,𝐱s,𝒯\mathbf{x}_{s,0},\ldots,\mathbf{x}_{s,\mathscr{T}} of Algorithm 5, with probability at least 1(1p(1δ))𝒯/181-\left(1-p(1-\delta)\right)^{\mathscr{T}/18} we have

f(𝐱s+1,0)f(𝐱s,0)148ϵ3ρ.\displaystyle f(\mathbf{x}_{s+1,0})-f(\mathbf{x}_{s,0})\leq-\frac{1}{48}\sqrt{\frac{\epsilon^{3}}{\rho}}.
Proof.

If f(𝐱s,𝒯)f(𝐱s,0)148ϵ3ρf(\mathbf{x}_{s,\mathscr{T}})-f(\mathbf{x}_{s,0})\leq-\frac{1}{48}\sqrt{\frac{\epsilon^{3}}{\rho}}, we directly have

f(𝐱s+1,0)f(𝐱s,0)\displaystyle f(\mathbf{x}_{s+1,0})-f(\mathbf{x}_{s,0}) =min{f(𝐱s,0),,f(𝐱s,𝒯),f(𝐱s,0),,f(𝐱s,𝒯)}f(𝐱s,0)\displaystyle=\min\{f(\mathbf{x}_{s,0}),\ldots,f(\mathbf{x}_{s,\mathscr{T}}),f(\mathbf{x}_{s,0}^{\prime}),\ldots,f(\mathbf{x}_{s,\mathscr{T}}^{\prime})\}-f(\mathbf{x}_{s,0})
f(𝐱s,𝒯)f(𝐱s,0)148ϵ3ρ.\displaystyle\leq f(\mathbf{x}_{s,\mathscr{T}})-f(\mathbf{x}_{s,0})\leq-\frac{1}{48}\sqrt{\frac{\epsilon^{3}}{\rho}}.

Hence, we only need to prove the case with f(𝐱s+1,0)f(𝐱s,0)>148ϵ3ρf(\mathbf{x}_{s+1,0})-f(\mathbf{x}_{s,0})>-\frac{1}{48}\sqrt{\frac{\epsilon^{3}}{\rho}}, where by Lemma 18 the number of ϵ\epsilon-FOSP among 𝐱s,0,,𝐱s,𝒯\mathbf{x}_{s,0},\ldots,\mathbf{x}_{s,\mathscr{T}} is at least 𝒯3L32ρϵ\mathscr{T}-\frac{3L}{32\sqrt{\rho\epsilon}}. Since there are less than 8𝒯9\frac{8\mathscr{T}}{9} ϵ\epsilon-SOSP among the iterations 𝐱s,0,,𝐱s,𝒯\mathbf{x}_{s,0},\ldots,\mathbf{x}_{s,\mathscr{T}}, there exists

𝒯3L32ρϵ8𝒯9𝒯18\displaystyle\mathscr{T}-\frac{3L}{32\sqrt{\rho\epsilon}}-\frac{8\mathscr{T}}{9}\geq\frac{\mathscr{T}}{18}

different values of t[𝒯]t\in[\mathscr{T}] such that

f(𝐱s,t)ϵ,λmin(2f(𝐱))ρϵ.\displaystyle\|\nabla f(\mathbf{x}_{s,t})\|\leq\epsilon,\quad\lambda_{\min}(\nabla^{2}f(\mathbf{x}))\leq-\sqrt{\rho\epsilon}.

For each such tt, with probability pp the subroutine Comparison-NCD (Algorithm 6) is executed in this iteration. Conditioned on that, with probability at least 1δ1-\delta its output 𝐱s,t\mathbf{x}_{s,t}^{\prime} satisfies

f(𝐱s,t)f(𝐱s,t)148ϵ3ρ\displaystyle f(\mathbf{x}_{s,t}^{\prime})-f(\mathbf{x}_{s,t})\leq-\frac{1}{48}\sqrt{\frac{\epsilon^{3}}{\rho}}

by Lemma 4. Hence, with probability at least

1(1p(1δ))𝒯/18,\displaystyle 1-\left(1-p(1-\delta)\right)^{\mathscr{T}/18},

there exists a t[𝒯]t^{\prime}\in[\mathscr{T}] with

f(𝐱s,t)f(𝐱s,t)148ϵ3ρ,\displaystyle f(\mathbf{x}_{s,t^{\prime}}^{\prime})-f(\mathbf{x}_{s,t^{\prime}})\leq-\frac{1}{48}\sqrt{\frac{\epsilon^{3}}{\rho}},

which leads to

f(𝐱s+1,0)f(𝐱s,0)\displaystyle f(\mathbf{x}_{s+1,0})-f(\mathbf{x}_{s,0}) =min{f(𝐱s,0),,f(𝐱s,𝒯),f(𝐱s,0),,f(𝐱s,𝒯)}f(𝐱s,0)\displaystyle=\min\{f(\mathbf{x}_{s,0}),\ldots,f(\mathbf{x}_{s,\mathscr{T}}),f(\mathbf{x}_{s,0}^{\prime}),\ldots,f(\mathbf{x}_{s,\mathscr{T}}^{\prime})\}-f(\mathbf{x}_{s,0})
f(𝐱s,t)f(𝐱s,t)148ϵ3ρ,\displaystyle\leq f(\mathbf{x}_{s,t^{\prime}}^{\prime})-f(\mathbf{x}_{s,t^{\prime}})\leq-\frac{1}{48}\sqrt{\frac{\epsilon^{3}}{\rho}},

where the second inequality is due to the fact that f(𝐱s,t)(𝐱s,0)f(\mathbf{x}_{s,t^{\prime}})\leq(\mathbf{x}_{s,0}) for any possible value of tt^{\prime} in [𝒯][\mathscr{T}]. ∎

Proof of Theorem 5.

We assume for any s=1,,𝒮s=1,\ldots,\mathcal{S} with 𝐱s,0,,𝐱s,𝒯\mathbf{x}_{s,0},\ldots,\mathbf{x}_{s,\mathscr{T}} containing less than 8𝒯9\frac{8\mathscr{T}}{9} ϵ\epsilon-SOSP we have

f(𝐱s+1,0)f(𝐱s,0)148ϵ3ρ.\displaystyle f(\mathbf{x}_{s+1,0})-f(\mathbf{x}_{s,0})\leq-\frac{1}{48}\sqrt{\frac{\epsilon^{3}}{\rho}}.

Given that there are at most 𝒮\mathcal{S} different values of ss, by Lemma 19, the probability of this assumption being true is at least

(1(1p(1δ))𝒯/18)𝒮89.\displaystyle\big{(}1-\left(1-p(1-\delta)\right)^{\mathscr{T}/18}\big{)}^{\mathcal{S}}\geq\frac{8}{9}. (43)

Moreover, given that

s=1𝒮f(𝐱s+1,0)f(𝐱s,0)=f(𝐱𝒮+1,0)f(𝟎)ff(0)Δ\displaystyle\sum_{s=1}^{\mathcal{S}}f(\mathbf{x}_{s+1,0})-f(\mathbf{x}_{s,0})=f(\mathbf{x}_{\mathcal{S}+1,0})-f(\mathbf{0})\geq f^{*}-f(0)\geq-\Delta

there are at least 2732𝒮\frac{27}{32}\mathcal{S} different values of s=1,,𝒮s=1,\ldots,\mathcal{S} with

f(𝐱s+1,0)f(𝐱s,0)148ϵ3ρ,\displaystyle f(\mathbf{x}_{s+1,0})-f(\mathbf{x}_{s,0})\leq-\frac{1}{48}\sqrt{\frac{\epsilon^{3}}{\rho}},

as we have f(𝐱s+1,0)f(𝐱s,0)f(\mathbf{x}_{s+1,0})\leq f(\mathbf{x}_{s,0}) for any ss. Hence, in this case the proportion of ϵ\epsilon-SOSP among all the iterations is at least

2732𝒮89𝒯𝒮𝒯=34.\displaystyle\frac{\frac{27}{32}\mathcal{S}\cdot\frac{8}{9}\mathscr{T}}{\mathcal{S}\mathscr{T}}=\frac{3}{4}.

Combined with (43), the overall success probability of outputting an ϵ\epsilon-SOSP is at least 34×89=23\frac{3}{4}\times\frac{8}{9}=\frac{2}{3}.

The query complexity of Algorithm 5 comes from both the gradient estimation step in Line 5 and the negative curvature descent step in Line 5. By Theorem 1, the query complexity of the first part equals

𝒮𝒯O(nlog(n/δ))=O(ΔL2n3/2ρ1/2ϵ5/2logn),\displaystyle\mathcal{S}\mathscr{T}\cdot O(n\log(n/\delta))=O\left(\frac{\Delta L^{2}n^{3/2}}{\rho^{1/2}\epsilon^{5/2}}\log n\right),

whereas the expected query complexity of the second part equals

𝒮𝒯pO(L2n3/2δρϵlog2nLδρϵ)=O(ΔL2n3/2ρ1/2ϵ5/2log3nLρϵ).\displaystyle\mathcal{S}\mathscr{T}p\cdot O\left(\frac{L^{2}n^{3/2}}{\delta\rho\epsilon}\log^{2}\frac{nL}{\delta\sqrt{\rho\epsilon}}\right)=O\left(\frac{\Delta L^{2}n^{3/2}}{\rho^{1/2}\epsilon^{5/2}}\log^{3}\frac{nL}{\sqrt{\rho\epsilon}}\right).

Hence, the overall query complexity of Algorithm 5 equals

O(ΔL2n3/2ρ1/2ϵ5/2log3nLρϵ).\displaystyle O\left(\frac{\Delta L^{2}n^{3/2}}{\rho^{1/2}\epsilon^{5/2}}\log^{3}\frac{nL}{\sqrt{\rho\epsilon}}\right).