This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Generalized Metric Subregularity with Applications to
High-Order Regularized Newton Methods

Guoyin Li111University of New South Wales, Sydney 2052, Australia ([email protected]). Research of this author was partly supported by the Australian Research Council under Discovery Project DP190100555.    Boris Mordukhovich222Wayne State University, Detroit, MI 48202, USA ([email protected]). Research of this author was partly supported by the US National Science Foundation under grant DMS-2204519, by the Australian Research Council under Discovery Project DP190100555, and by Project 111 of China under grant D21024.    Jiangxing Zhu333Yunnan University, Kunming 650091, China ([email protected]). The research of this author was partly supported by the Basic Research Program of Yunnan Province (Grant No. 202201AT070066), the Project for Young-notch Talents in the Ten Thousand Talent Program of Yunnan Province (Grant No. YNWR-QNBJ-2020-080), and the National Natural Science Foundation of People’s Republic of China (Grant Nos. 12171419 and 12261109).
Abstract

This paper pursues a twofold goal. First, we introduce and study in detail a new notion of variational analysis called generalized metric subregularity, which is a far-going extension of the conventional metric subregularity conditions. Our primary focus is on examining this concept concerning first-order and second-order stationary points. We develop an extended convergence framework that enables us to derive superlinear and quadratic convergence under the generalized metric subregularity condition, broadening the widely used KL convergence analysis framework. We present verifiable sufficient conditions to ensure the proposed generalized metric subregularity condition and provide examples demonstrating that the derived convergence rates are sharp. Second, we design a new high-order regularized Newton method with momentum steps, and apply the generalized metric subregularity to establish its superlinear convergence. Quadratic convergence is obtained under additional assumptions. Specifically, when applying the proposed method to solve the (nonconvex) over-parameterized compressed sensing model, we achieve global convergence with a quadratic local convergence rate towards a global minimizer under a strict complementarity condition.
Keywords: Variational analysis and optimization, generalized metric subregularity, error bounds, high-order regularized Newton methods, Kurdyka-Łojasiewicz property, superlinear and quadratic convergence
Mathematics Subject Classification (2020) 49J53, 49M15, 90C26

1 Introduction

It has been well recognized, especially during the recent years, that concepts and tools of modern variational analysis play a crucial role in the design and justification of numerical algorithms in optimization and related areas. We mention, in particular, generalized Newton methods for which, in addition to the fundamental developments summarized in [12, 17, 19], new approaches and results have been recently proposed in [1, 14, 18, 22, 27] among other publications. A common feature of these developments is the usage of metric regularity/subregularity assumptions in a certain form. Another common feature of the aforementioned publications is justifying superlinear convergence rates under some nondegeneracy condition. For example, [32] develops the cubic regularization (CR) method for minimizing a twice continuously differentiable function, which is now widely recognized as a globally convergent variant of Newton’s method with superior iteration complexity [10, 32]. The local quadratic convergence of CR method was derived under a nondegeneracy condition in [32]. In [40], the authors propose an error bound condition related to the second-order stationary set and examined the quadratic local convergence for the cubic regularized Newton method under this condition, extending the results in [32]. Interestingly, it was shown in [40] that this error bound condition can be applied to nonconvex and degenerate cases being satisfied with overwhelming probability for some concrete nonconvex optimization problems that arise in phase retrieval and low-rank matrix recovery.

On the other hand, it is known that the metric subregularity of the subdifferential mapping has a close relationship with the Kurdyka-Łojasiewicz (KL)(KL) property. The KL property is satisfied for various important classes of functions arising in diverse applications, and it has emerged as one of the most widely used tools for analyzing the convergence of important numerical algorithms [4, 5]. In particular, if the potential function of the algorithm satisfies the KL property and the associated desingularization function ϑ\vartheta involved in the KL property takes the form that ϑ(t)=ct1θ\vartheta(t)=c\,t^{1-\theta} with c>0c>0 and θ[0,1)\theta\in[0,1), then the corresponding exponent θ\theta determines the linear or sublinear convergence rate of the underlying algorithms [4, 5, 21, 39]. Although the KL property and its associated analysis framework are well-adopted in the literature, this framework, in general, cannot obtain or detect superlinear or quadratic convergence rate, which is typical for many second-order numerical methods including Newton-type methods.

In this paper, we introduce and thoroughly investigate novel generalized metric subregularity conditions, significantly extending conventional metric subregularity and error bound conditions from [40]. We develop an abstract extended convergence framework that enables us to derive superlinear and quadratic convergence towards specific target sets, such as first-order and second-order stationary points, under the introduced generalized metric subregularity condition. This new framework extends the widely used KL convergence analysis framework. We provide examples demonstrating that the derived quadratic convergence rates are sharp, meaning they can be attained. We also present verifiable sufficient conditions that ensure the proposed generalized metric subregularity condition and showcase that the sufficient conditions are satisfied for functions that arise in practical models including the over-parameterized compressed sensing, best rank-one matrix approximation, and generalized phase retrieval problems. Finally, we use the generalized metric subregularity condition to establish superlinear (and in some cases quadratic) convergence of the newly proposed high-order regularized Newton methods with momentum. Specifically, when applying the proposed method to solve the (nonconvex) over-parameterized compressed sensing model, we achieve global convergence with a quadratic local convergence rate towards a global minimizer under a strict complementarity condition.

Organization. The main achievements of the current research are summarized in the concluding Section 8. The rest of the paper is structured as follows.

Section 2 presents basic definitions and preliminaries from variational analysis and generalized differentiation broadly used in what follows. In Section 3, we define two versions of generalized metric subregularity for subgradient mappings with respect to first-order and second-order stationary points and then discuss their relationships with known in the literature and with growth conditions.

Section 4 develops a new framework for abstract convergence analysis to establish superlinear and quadratic convergence rates of numerical algorithms under generalized metric subregularity. Being of their independent interest, the obtained results are instrumental to justify such fast convergent rates for our main high-order regularized Newtonian algorithm in the subsequent parts of the paper. In Section 5, we provide some abstract convergence analysis under the Kurdyka-Łojasiewicz property.

Section 6 revisits generalized metric subregularity while now concentrating on this property with respect to second-order stationarity points. Our special attention is paid to the strict saddle point and weak separation properties, which provide verifiable sufficient conditions for the fulfillment of generalized metric subregularity. Here we present several examples of practical modeling, where these properties are satisfied.

In Section 7, we design the family of high-order regularized Newton methods with momentum steps for problems of 𝒞2{\cal C}^{2} optimization, where the order of the method is determined by the Hölder exponent q(0,1]q\in(0,1] of the Hessian matrix associated with the objective function. Using the machinery of generalized metric subregularity with respect to second-order stationary points leads us to establishing superlinear convergence of iterates with explicit convergence rates and also quadratic convergence in certain particular settings.

Section 8 provides an overview of the major results obtained in the paper with discussing some directions of the future research. Section 9 is an appendix, which contains the proofs of some technical lemmas.

2 Preliminaries from Variational Analysis

Let m\mathbb{R}^{m} be the mm-dimensional Euclidean space with the inner product denoted by x,y=xTy\langle x,y\rangle=x^{T}y for all x,ymx,y\in\mathbb{R}^{m} and the norm \|\cdot\|. The symbols Bm(x,γ)B_{\mathbb{R}^{m}}(x,\gamma) and Bm[x,γ]B_{\mathbb{R}^{m}}[x,\gamma] stand for the open and closed ball centered at xx with radius γ\gamma, respectively. Given a symmetric matrix Mm×mM\in\mathbb{R}^{m\times m}, the notation M0M\succeq 0 signifies that MM is positive-semidefinite. We use \mathbb{N} to denote the set of nonnegative integrals {0,1,}\{0,1,\ldots\}.

An extended-real-valued function f:m¯:=(,]f\colon\mathbb{R}^{m}\to\overline{\mathbb{R}}:=(-\infty,\infty] is proper if domf:={xm|f(x)<}{\rm dom}\,f:=\{x\in\mathbb{R}^{m}\;|\;f(x)<\infty\}\neq\emptyset for its effective domain. The (limiting, Mordukhovich) subdifferential for such functions ff at x¯domf\bar{x}\in\mbox{\rm dom}\,f is

f(x¯):={vm|xkx¯ with f(xk)f(x¯) as k withlim infxxkf(x)f(xk)v,xxkxxk0 for all k}.\begin{array}[]{ll}\partial f(\bar{x}):=\Big{\{}v\in\mathbb{R}^{m}\;\Big{|}&\exists\,x_{k}\to\bar{x}\;\mbox{ with }\;f(x_{k})\to f(\bar{x})\;\mbox{ as }\;k\to\infty\;\mbox{ with}\\ &\displaystyle\liminf_{x\to x_{k}}\frac{f(x)-f(x_{k})-\langle v,x-x_{k}\rangle}{\|x-x_{k}\|}\geq 0\;\mbox{ for all }\;k\in\mathbb{N}\Big{\}}.\end{array} (2.1)

The subdifferential (2.1) reduces to the subdifferential of convex analysis when ff is convex and to the gradient f(x¯)\nabla f(\bar{x}) when ff is continuously differentiable (𝒞1{\cal C}^{1}-smooth) around x¯\bar{x}. For the general class of lower semicontinuous (l.s.c.) functions, f\partial f enjoys comprehensive calculus rules and is used in numerous applications of variational analysis and optimization; see, e.g., the books [26, 27, 37] and the references therein.

Having (2.1), consider the set of first-order stationary points of an l.s.c. function f:m¯f\colon\mathbb{R}^{m}\to\overline{\mathbb{R}} defined by

Γ:={xm| 0f(x)}.\Gamma:=\big{\{}x\in\mathbb{R}^{m}\;\big{|}\;0\in\partial f(x)\big{\}}. (2.2)

Suppose that ff is a twice continuously differentiable (𝒞2{\cal C}^{2}-smooth) function and denote the collection of second-order stationary points of ff at xmx\in\mathbb{R}^{m} by

Θ:={xm|f(x)=0,2f(x)0}.\Theta:=\big{\{}x\in\mathbb{R}^{m}\;\big{|}\;\nabla f(x)=0,\;\nabla^{2}f(x)\succeq 0\big{\}}. (2.3)

As in [42], we say that φ:++\varphi:\mathbb{R}_{+}\rightarrow\mathbb{R}_{+}, where +:={t|t0}\mathbb{R}_{+}:=\{t\;|\;t\geq 0\}, is an admissible function if φ(0)=0\varphi(0)=0 and φ(t)0t0\varphi(t)\rightarrow 0\Rightarrow t\rightarrow 0. It is well known that for convex admissible functions, the directional derivative

φ+(t):=lims0+φ(t+s)φ(t)s\varphi_{+}^{\prime}(t):=\lim\limits_{s\rightarrow 0^{+}}\frac{\varphi(t+s)-\varphi(t)}{s}

always exists being nondecreasing on +\mathbb{R}_{+}. Furthermore (see, e.g., [43, Theorem 2.1.5]), φ+\varphi_{+}^{\prime} is increasing on +\mathbb{R}_{+} if and only if φ\varphi is strictly convex, i.e.,

φ(λt1+(1λ)t2)<λφ(t1)+(1λ)φ(t2)\varphi(\lambda t_{1}+(1-\lambda)t_{2})<\lambda\varphi(t_{1})+(1-\lambda)\varphi(t_{2})

for any λ(0,1)\lambda\in(0,1) and t1,t2+t_{1},t_{2}\in\mathbb{R}_{+} with t1t2t_{1}\neq t_{2}. It is also known that a convex function φ\varphi is differentiable on +\mathbb{R}_{+} if and only if φ+\varphi_{+}^{\prime} is continuous on +\mathbb{R}_{+}. The following lemma from [43, Theorem 2.1.5] is used below.

Lemma 2.1 (properties of convex admissible functions).

Let φ:++\varphi:\mathbb{R}_{+}\to\mathbb{R}_{+} be a convex admissible function. Then we have the relationships:
(i) 0<φ(t1)t1φ+(t1)φ+(t2)forallt1,t2(0,)witht1t2.0<\frac{\varphi(t_{1})}{t_{1}}\leq\varphi_{+}^{\prime}(t_{1})\leq\varphi_{+}^{\prime}(t_{2})\;\;{\rm for\;all}\;t_{1},t_{2}\in(0,\;\infty)\;{\rm with}\;t_{1}\leq t_{2}.
(ii) φ(t)φ+(t2)t2 for all (0,).\varphi(t)\geq\varphi_{+}^{\prime}\left(\frac{t}{2}\right)\frac{t}{2}\;\mbox{ for all }\;\in(0,\infty).

Finally in this section, we recall the celebrated property often used in convergence analysis.

Definition 2.2 (Kurdyka-Łojasiewicz property).

Let f:m¯f:\mathbb{R}^{m}\to\overline{\mathbb{R}} be a proper lower l.s.c. function, and let ϑ:[0,η)+\vartheta:[0,\eta)\to\mathbb{R}_{+} be a continuous concave function. We say that ff has the Kurdyka-Łojasiewicz (KL)(KL) property at x¯\overline{x} with respect to ϑ\vartheta if the following conditions hold:
(i) ϑ(0)=0\vartheta(0)=0 and ϑ\vartheta is continuously differentiable on (0,η)(0,\eta).
(ii) ϑ(s)>0\vartheta^{\prime}(s)>0 for all s(0,η)s\in(0,\eta).
(iii) There exists ε>0\varepsilon>0 such that

ϑ(f(x)f(x¯))d(0,f(x))1\vartheta^{\prime}(f(x)-f(\overline{x}))d(0,\partial f(x))\geq 1

for all xBm(x¯,ε)[f(x¯)<f<f(x¯)+η]x\in B_{\mathbb{R}^{m}}(\overline{x},\varepsilon)\cap[f(\overline{x})<f<f(\overline{x})+\eta], where d(,S)d(\cdot,S) stands for the distance function associated with the set SS. If in addition ϑ\vartheta takes the form of ϑ(t)=ct1θ\vartheta(t)=c\,t^{1-\theta} for some c>0c>0 and θ[0,1)\theta\in[0,1), then we say ff satisfies the KL property at x¯\bar{x} with the KL exponent θ\theta.

A widely used class of functions satisfying the KL property consists of subanalytic functions. It is known that the maximum and minimum of finitely many analytic functions and also semialgebraic functions (such as xp=[i=1m|xi|p]1p\|x\|_{p}=[\sum_{i=1}^{m}|x_{i}|^{p}]^{\frac{1}{p}} with pp being a nonnegative rational number) are subanalytic. We refer the reader to [4, 5, 6, 21] for more details on the KL property and its exponent version, as well as on subanalytic functions.

3 Generalized Metric Subregularity

The main variational concept for the subdifferential (2.1) of an extended-real-valued function used in this paper is introduced below, where we consider its two versions.

Definition 3.1 (generalized metric subregularity).

Let f:m¯f:\mathbb{R}^{m}\to\overline{\mathbb{R}} be a proper l.s.c. function, let Γ\Gamma be taken from (2.2), and let ψ:++\psi:\mathbb{R}_{+}\to\mathbb{R}_{+} be an admissible function. Given a target set ΞΓ\Xi\subset\Gamma and a point x¯Ξ\overline{x}\in\Xi, we say that:
(i) The subdifferential f\partial f satisfies the (pointwise) generalized metric subregularity property with respect to (ψ,Ξ)(\psi,\Xi) at x¯\overline{x} if there exist κ,δ(0,)\kappa,\delta\in(0,\infty) such that

ψ(d(x,Ξ))κd(0,f(x)) for all xBm(x¯,δ).\psi\big{(}d(x,\Xi)\big{)}\leq\kappa\,d\big{(}0,\partial f(x)\big{)}\;\mbox{ for all }\;x\in B_{\mathbb{R}^{m}}(\overline{x},\delta). (3.1)

(ii) The subdifferential f\partial f satisfies the uniform generalized metric subregularity property with respect to (ψ,Ξ)(\psi,\Xi) if there exist κ,ρ(0,)\kappa,\rho\in(0,\infty) such that

ψ(d(x,Ξ))κd(0,f(x)) for all x𝒩(Ξ,ρ):={xm|d(x,Ξ)ρ}.\psi\big{(}d(x,\Xi)\big{)}\leq\kappa\,d\big{(}0,\partial f(x)\big{)}\;\mbox{ for all }\;x\in\mathcal{N}(\Xi,\rho):=\big{\{}x\in\mathbb{R}^{m}\;\big{|}\;\;d(x,\Xi)\leq\rho\big{\}}. (3.2)

(iii) If ψ(t)=t\psi(t)=t in (i) and (ii), then we simply say that f\partial f satisfies the metric subregularity (resp. uniform metric subregularity) property with respect to Ξ\Xi.

Remark 3.2 (connections with related notions).

Let us briefly discuss connections of the generalized metric subregularity notions from Definition 3.1 with some related notions.

\bullet Consider the case where the target set Ξ:=Γ\Xi:=\Gamma is the set of first-order stationary points (2.2). If ψ(t)=t\psi(t)=t, then (3.1) reduces to the conventional notion of metric subregularity applied to subgradient mappings. If ψ(t)=tp\psi(t)=t^{p} for p>0p>0, which we referred to as exponent metric subregularity, this notion has also received a lot of attention due to the important applications in numerical optimization. It is called the Hölder metric subregularity if p>1p>1, and the high-order metric subregularity if p(0,1)p\in(0,1). Other cases of convex and smooth admissible functions are considered in [42]. The refer the reader to [3, 13, 20, 28, 30, 41, 42] and the bibliographies therein for more details.

\bullet In the case where ff is a 𝒞2{\cal C}^{2}-smooth function, ψ(t)=t\psi(t)=t, and the target set Ξ=Θ\Xi=\Theta consists of second-order stationary points (2.3), Definition 3.1(ii) takes the form: there exist κ,ρ(0,)\kappa,\rho\in(0,\infty) such that

d(x,Θ)κf(x) for all x𝒩(Θ,ρ):={xm|d(x,Θ)ρ},d(x,\Theta)\leq\kappa\|\nabla f(x)\|\;\mbox{ for all }\;x\in\mathcal{N}(\Theta,\rho):=\big{\{}x\in\mathbb{R}^{m}\;\big{|}\;\;d(x,\Theta)\leq\rho\big{\}}, (3.3)

i.e., it falls into the framework of the uniform generalized metric subregularity (3.2) for smooth functions. In this case, it was referred to as the error bound (EB) condition [40, Assumption 2]. This condition plays a key role in [40] to justify the quadratic convergence of the cubic regularized Newton method.

Now we present three examples illustrating several remarkable features of the introduced notions. The first example shows that the generalized metric subregularity from (3.1) can go beyond the exponent cases. Note that the usage of metric subregularity in the nonexponent setting has been recently identified in [23] for the case of exponential cone programming.

Example 3.3 (nonexponent generalized metric subregularity).

Let φ:++\varphi:\mathbb{R}_{+}\to\mathbb{R}_{+} be given by

φ(t):={0te1s𝑑s,t>0,0,t=0.\varphi(t):=\left\{\begin{array}[]{ll}\int_{0}^{t}e^{-\frac{1}{s}}ds,&\hbox{$t>0$,}\\ 0,&\hbox{$t=0$}.\end{array}\right.

Define f:f:\mathbb{R}\to\mathbb{R} by f(x):=φ(|x|)f(x):=\varphi(|x|) for all xx\in\mathbb{R}. It is easy to see that φ\varphi is a differentiable convex function on +\mathbb{R}_{+}, and that we have

f(x)={φ(|x|)x|x|,x0,0,x=0andargminxf(x)=Γ={0}.\partial f(x)=\left\{\begin{array}[]{ll}\varphi^{\prime}(|x|)\frac{x}{|x|},&\hbox{$x\neq 0$,}\\ 0,&\hbox{$x$}=0\end{array}\right.\quad\mbox{and}\quad\mathop{\arg\min}_{x\in\mathbb{R}}f(x)=\Gamma=\{0\}.

This tells us that φ(d(x,Γ))=φ(|x|)=d(0,f(x))\varphi^{\prime}(d(x,\Gamma))=\varphi^{\prime}(|x|)=d(0,\partial f(x)) for all xx\in\mathbb{R}, and so f\partial f has the generalized metric subregularity property with respect to (φ,Γ)(\varphi^{\prime},\Gamma). On the other hand, for any q(0,)q\in(0,\infty) we get

limx0d(x,Γ)qd(0,f(x))=limx0|x|qe1|x|=limt0+e1ttq=,\lim_{x\to 0}\frac{d(x,\Gamma)^{q}}{d(0,\partial f(x))}=\lim_{x\to 0}\frac{|x|^{q}}{e^{-\frac{1}{|x|}}}=\lim_{t\to 0^{+}}e^{\frac{1}{t}}t^{q}=\infty,

which shows that f\partial f does not enjoy any exponent-type metric subregularity.

The next example illustrates the fulfillment of the generalized metric subregularity (3.2) with respect of (ψ,Θ))(\psi,\Theta)) for some admissible function ψ\psi, while the error bound condition (3.3) from [40] fails.

Example 3.4 (generalized metric subregularity vs. error bound condition).

Given p>2p>2, define a 𝒞2{\cal C}^{2}-smooth function f:f:\mathbb{R}\to\mathbb{R} by

f(x):={xp,x0,0,x<0f(x):=\left\{\begin{array}[]{ll}x^{p},&\hbox{$x\geq 0$,}\\ 0,&\hbox{$x<0$}\end{array}\right.

for which we get by the direct calculations that

f(x)={pxp1,x0,0,x<0,2f(x)={p(p1)xp2,x0,0,x<0,andΘ=Γ=(,0].\nabla f(x)=\left\{\begin{array}[]{ll}px^{p-1},&\hbox{$x\geq 0$,}\\ 0,&\hbox{$x<0$,}\end{array}\right.\;\nabla^{2}f(x)=\left\{\begin{array}[]{ll}p(p-1)x^{p-2},&\hbox{$x\geq 0$,}\\ 0,&\hbox{$x<0$},\end{array}\right.\;\mbox{and}\;\;\Theta=\Gamma=(-\infty,0].

It is easy to see that d(x,Θ)=xd(x,\Theta)=x and d(0,f(x))=pxp1d(0,\nabla f(x))=px^{p-1} for x>0x>0, while for x0x\leq 0 we have d(x,Θ)=d(0,f(x))=0d(x,\Theta)=d(0,\nabla f(x))=0. This implies therefore that

lim supx0+d(x,Θ)d(0,f(x))=andd(x,Θ)p11pd(0,f(x))whenever x.\limsup_{x\to 0^{+}}\frac{d(x,\Theta)}{d(0,\nabla f(x))}=\infty\quad\mbox{and}\quad d(x,\Theta)^{p-1}\leq\frac{1}{p}d(0,\nabla f(x))\quad\mbox{whenever }\;x\in\mathbb{R}.

Consequently, the EB condition (3.3) fails, while f\nabla f enjoys the generalized metric subregularity property with respect to (ψ,Θ)(\psi,\Theta) at 0, where ψ(t):=tp1\psi(t):=t^{p-1}.

For a fixed admissible function ψ\psi and a given set ΞΓ\Xi\subset\Gamma, we observe directly from the definitions that uniform generalized metric subregularity with respect to (ψ,Ξ)(\psi,\Xi) yields the generalized metric subregularity at each point x¯\overline{x} of Ξ\Xi. The following simple one-dimensional example shows that the reverse implication fails.

Example 3.5 (pointwise version is strictly weaker than uniform version for generalized metric subregularity).

Consider f(x):=(xa)2(xb)2pf(x):=(x-a)^{2}(x-b)^{2p} for p>1p>1 with a,ba,b\in\mathbb{R} and 0a<b0\leq a<b. We get f(x)=f(x)=2(xa)(xb)2p+2p(xa)2(xb)2p1\nabla f(x)=f^{\prime}(x)=2(x-a)(x-b)^{2p}+2p(x-a)^{2}(x-b)^{2p-1} and f′′(x)=2(xb)2p+8p(xa)(xb)2p1+2p(2p1)(xa)2(xb)2p2f^{\prime\prime}(x)=2(x-b)^{2p}+8p(x-a)(x-b)^{2p-1}+2p(2p-1)(x-a)^{2}(x-b)^{2p-2}. Furthermore, Γ={a,b,b+pa1+p}\Gamma=\{a,b,\frac{b+pa}{1+p}\} and Θ={a,b}\Theta=\{a,b\}. Thus there exists M>0M>0 such that d(x,Θ)=|xa|Md(0,f(x))d(x,\Theta)=|x-a|\leq Md(0,\nabla f(x)) for all xB(a,ϵ)x\in B_{\mathbb{R}}(a,\epsilon) with ϵ<ba4\epsilon<\frac{b-a}{4}. This tells us that f\nabla f satisfies the generalized metric subregularity property with respect to (ψ,Ξ)(\psi,\Xi) at x¯=a\overline{x}=a with ψ(t)=t\psi(t)=t and Ξ=Θ\Xi=\Theta.

On the other hand, f\nabla f does not satisfy the uniform generalized metric subregularity with respect to (ψ,Ξ)(\psi,\Xi). Indeed, for any ϵk>0\epsilon_{k}>0 with ϵk0\epsilon_{k}\rightarrow 0, we get xk=b+ϵk𝒩(Θ,ϵk)x_{k}=b+\epsilon_{k}\in\mathcal{N}(\Theta,\epsilon_{k}) and d(xk,Θ)f(xk)=O(ϵkϵk2p1)\frac{d(x_{k},\Theta)}{\|\nabla f(x_{k})\|}=O(\frac{\epsilon_{k}}{\epsilon_{k}^{2p-1}})\rightarrow\infty.

It has been well recognized in variational analysis that there exist close relationships between metric subregularity of subgradient mappings and second-order growth conditions for the functions in question; see [3, 11, 27] with more details. We now extend such relationships to the case of generalized metric subregularity. The following useful lemma is of its own interest.

Lemma 3.6 (first-order stationary points of compositions).

Let f(x)=(φg)(x)f(x)=(\varphi\circ g)(x), where φ:n¯\varphi:\mathbb{R}^{n}\to\overline{\mathbb{R}} is a proper l.s.c. convex function, and where g:mng:\mathbb{R}^{m}\to\mathbb{R}^{n} is a 𝒞1{\cal C}^{1}-smooth mapping with the surjective derivative g(x¯)\nabla g(\overline{x}) at some first-order stationary point x¯Γ\overline{x}\in\Gamma associated with the composition ff in (2.2). Then there exists δ>0\delta>0 such that we have the inclusion

ΓBm(x¯,δ)argminxmf(x).\Gamma\cap B_{\mathbb{R}^{m}}(\overline{x},\delta)\subseteq\mathop{\arg\min}_{x\in\mathbb{R}^{m}}f(x). (3.4)
Proof.

The surjectivity of g(x¯)\nabla g(\overline{x}) for smooth gg allows us to deduce from the Lyusternik–Graves theorem (see, e.g., [26, Theorem 1.57]) the existence of δ,L>0\delta,L>0 such that g(x)\nabla g(x) is surjective on Bm(x¯,δ)B_{\mathbb{R}^{m}}(\overline{x},\delta) and

g(x)TvLvwhenever (x,v)Bm(x¯,δ)×n.\|\nabla g(x)^{T}v\|\geq L\|v\|\quad\mbox{whenever }\;(x,v)\in B_{\mathbb{R}^{m}}(\overline{x},\delta)\times\mathbb{R}^{n}. (3.5)

Consequently, it follows from [26, Proposition 1.112] that

f(x)=g(x)Tφ(g(x))for all xBm(x¯,δ).\partial f(x)=\nabla g(x)^{T}\partial\varphi(g(x))\quad\mbox{for all }\;x\in B_{\mathbb{R}^{m}}(\overline{x},\delta). (3.6)

Picking any uΓBm(x¯,δ)u\in\Gamma\cap B_{\mathbb{R}^{m}}(\overline{x},\delta), we deduce from (3.6) that

0f(u)=g(u)Tφ(g(u)).0\in\partial f(u)=\nabla g(u)^{T}\partial\varphi(g(u)).

Combining this with (3.5) yields 0φ(g(u))0\in\partial\varphi(g(u)). Hence g(u)g(u) is a minimizer of the convex function φ\varphi, and

f(u)=φ(g(u))φ(g(x))=f(x) for allxm,f(u)=\varphi(g(u))\leq\varphi(g(x))=f(x)\;\mbox{ for all}\;x\in\mathbb{R}^{m},

which gives us (3.4) and thus completes the proof. ∎

Now we are ready to establish relationships between the generalized metric subregularity and corresponding growth conditions extending the known ones [3, 11] when ψ(t):=t2\psi(t):=t^{2}, ψ¯(t):=t\overline{\psi}(t):=t and Ξ=Γ\Xi=\Gamma.

Theorem 3.7 (generalized metric subregularity via growth conditions).

Let f(x)=(φg)(x)f(x)=(\varphi\circ g)(x) in the setting of Lemma 3.6. Consider the following statements:
(i) There exists an an admissible function ψ:++\psi:\mathbb{R}_{+}\to\mathbb{R}_{+} and constants γ1,κ1(0,)\gamma_{1},\kappa_{1}\in(0,\infty) such that

f(x)f(x¯)+κ1ψ(d(x,Ξ))for all xBm(x¯,γ1).f(x)\geq f(\overline{x})+\kappa_{1}{\psi}(d(x,\Xi))\quad\mbox{for all }\;x\in B_{\mathbb{R}^{m}}(\overline{x},\gamma_{1}). (3.7)

(ii) There exists an an admissible function ψ¯:++\overline{\psi}:\mathbb{R}_{+}\to\mathbb{R}_{+} and constants γ2,κ2(0,)\gamma_{2},\kappa_{2}\in(0,\infty) such that

ψ¯(d(x,Ξ))κ2d(0,f(x))for all xBm(x¯,γ2).\overline{\psi}(d(x,\Xi))\leq\kappa_{2}d(0,\partial f(x))\quad\mbox{for all x}\;\in B_{\mathbb{R}^{m}}(\overline{x},\gamma_{2}).

Then implication (i)(ii){\rm(i)}\Rightarrow{\rm(ii)} holds with ψ¯(t):=ψ+(t2)\overline{\psi}(t):={\psi}_{+}^{\prime}(\frac{t}{2}) if ψ\psi is convex and limt0+ψ+(t)=ψ+(0)=0\lim_{t\to 0^{+}}\psi_{+}^{\prime}(t)=\psi_{+}^{\prime}(0)=0, while the reverse one (ii)(i){\rm(ii)}\Rightarrow{\rm(i)} is always satisfied with ψ(t):=0tψ¯(s)𝑑s\psi(t):=\int_{0}^{t}\,\overline{\psi}(s)ds.

Proof.

To verify (i)(ii){\rm(i)}\Rightarrow{\rm(ii)}, we proceed as in the proof of Lemma 3.6 and find δ,L>0\delta,L>0 such that (3.4), (3.5), and (3.6) hold. Letting δ~:=min{γ1,δ2}\widetilde{\delta}:=\min\{\gamma_{1},\frac{\delta}{2}\} and then picking any xBm(x¯,δ~)x\in B_{\mathbb{R}^{m}}(\overline{x},\widetilde{\delta}) and uf(x)u\in\partial f(x), take a sequence {xk}Ξ\{x_{k}\}\subset\Xi such that xxkd(x,Ξ)\|x-x_{k}\|\to d(x,\Xi) as kk\to\infty. Noting that x¯Ξ\overline{x}\in\Xi and xx¯<δ~δ2\|x-\overline{x}\|<\widetilde{\delta}\leq\frac{\delta}{2}, assume without loss of generality that {xk}\{x_{k}\} entirely lies in ΞBm(x¯,2δ~)\Xi\cap B_{\mathbb{R}^{m}}(\overline{x},2\widetilde{\delta}). Select further L0>0L_{0}>0 with g(x2)g(x1)L0x1x2\|g(x_{2})-g(x_{1})\|\leq L_{0}\|x_{1}-x_{2}\| for all x1,x2Bm(x¯,δ)x_{1},x_{2}\in B_{\mathbb{R}^{m}}(\overline{x},\delta) and show that

f(xk)f(x)L0Luxkx for all kf(x_{k})\geq f(x)-\frac{L_{0}}{L}\|u\|\cdot\|x_{k}-x\|\;\mbox{ for all }\;k\in\mathbb{N} (3.8)

whenever uf(x)u\in\partial f(x). Indeed, it follows from (3.5) and (3.6) that for any uf(x)u\in\partial f(x) there is vφ(g(x))v\in\partial\varphi(g(x)) with

u=g(x)TvandLvu.u=\nabla g(x)^{T}v\quad\mbox{and}\quad L\|v\|\leq\|u\|.

Combining the latter with the convexity of φ\varphi gives us

f(x)f(xk)=φ(g(x))φ(g(xk))v,g(x)g(xk)\displaystyle f(x)-f(x_{k})=\varphi(g(x))-\varphi(g(x_{k}))\leq\langle v,g(x)-g(x_{k})\rangle
vg(x)g(xk)L0vxxkL0Luxxk\displaystyle\leq\|v\|\cdot\|g(x)-g(x_{k})\|\leq L_{0}\|v\|\cdot\|x-x_{k}\|\leq\frac{L_{0}}{L}\|u\|\cdot\|x-x_{k}\|

for all kk\in\mathbb{N}, and thus brings us to (3.8). Note further that (3.4) ensures that f(xk)=f(x¯)f(x_{k})=f(\overline{x}) for all kk\in\mathbb{N}. Letting there kk\to\infty and denoting :=L0L\ell:=\frac{L_{0}}{L}, we get f(x)f(x¯)ud(x,Ξ)f(x)-f(\overline{x})\leq\ell\|u\|d(x,\Xi), which yields in turn

f(x)f(x¯)d(0,f(x))d(x,Ξ) whenever xBm(x¯,δ~).f(x)-f(\overline{x})\leq\ell d\left(0,\partial f(x)\right)d(x,\Xi)\;\mbox{ whenever }\;x\in B_{\mathbb{R}^{m}}(\overline{x},\widetilde{\delta}).

This allows us to derive from (3.7) that

0<κ1ψ(d(x,Ξ))d(x,Ξ)d(0,f(x)) for all xBm(x¯,δ~)cl(Ξ),0<\frac{\kappa_{1}}{\ell}\frac{\psi\left(d(x,\Xi)\right)}{d(x,\Xi)}\leq d\left(0,\partial f(x)\right)\;\mbox{ for all }\;x\in B_{\mathbb{R}^{m}}\left(\overline{x},\widetilde{\delta}\right)\setminus{\rm cl}(\Xi),

where cl(Ξ){\rm cl}(\Xi) stands the closure of Ξ\Xi. Employing Lemma 2.1(ii) tells us that

κ12ψ+(d(x,Ξ)2)d(0,f(x))\frac{\kappa_{1}}{2\ell}\psi_{+}^{\prime}\left(\frac{d(x,\Xi)}{2}\right)\leq d\left(0,\partial f(x)\right)

for any xBm(x¯,δ~)cl(Ξ)x\in B_{\mathbb{R}^{m}}\left(\overline{x},\widetilde{\delta}\right)\setminus{\rm cl}(\Xi), and thus it follows from ψ+(0)=0\psi_{+}^{\prime}(0)=0 that (ii) is satisfied. The reverse implication (ii)\Rightarrow(i) is a consequence of Lemma 3.6 and [38, Theorem 4.2]. ∎

4 Abstract Framework for Convergence Analysis

In this section, we develop a general framework for convergence analysis of an abstract minimization algorithm to ensure its superlinear and quadratic convergence. This means finding appropriate conditions on an abstract algorithm structure and the given objective function under which the generated iterates exhibit the desired fast convergence. We’ll see that the generalized metric subregularity is an important condition to establish fast convergence. The obtained abstract results are implemented in Section 7 to derive specific convergence conditions for the newly proposed high-order regularized Newton method with momentum.

Our study in this section focuses on a coupled sequence {(xk,ek)}m×+\{(x_{k},e_{k})\}\subset\mathbb{R}^{m}\times\mathbb{R}_{+}, where {xk}\{x_{k}\} is the main sequence generated by the algorithm, and where {ek}\{e_{k}\} is an auxiliary numerical sequence that serves as a surrogate for xk+1xk\|x_{k+1}-x_{k}\| (the magnitude of the successive change sequence). Throughout this section, we assume that these two sequences satisfy the following set of basic assumptions (BA):

  • (i)

    Descent condition:

    f(xk+1)f(xk) for all k.f(x_{k+1})\leq f(x_{k})\quad\mbox{ for all }\;k\in\mathbb{N}. (H0{\rm H0})
  • (ii)

    Surrogate condition: there exists c>0c>0 such that

    xk+1xkcek for all k with limkek=0.\|x_{k+1}-x_{k}\|\leq c\,e_{k}\;\mbox{ for all }\;k\in\mathbb{N}\;\mbox{ with }\;\lim_{k\to\infty}e_{k}=0. (H1{\rm H1})
  • (iii)

    Relative error condition:

    there exists wk+1f(xk+1) such that wk+1bβ(ek),\mbox{there exists }\;w_{k+1}\in\partial f(x_{k+1})\;\mbox{ such that }\;\|w_{k+1}\|\leq b\,\beta(e_{k}), (H2{\rm H2})

    where bb is a fixed positive constant, and where β:++\beta:\mathbb{R}_{+}\to\mathbb{R}_{+} is an admissible function.

  • (iv)

    Continuity condition: For each subsequence {xkj}\{x_{k_{j}}\} with limjxkj=x~\lim_{j\to\infty}x_{k_{j}}=\widetilde{x}, we have

    lim supjf(xkj)f(x~).\limsup_{j\to\infty}f(x_{k_{j}})\leq f(\widetilde{x}). (H3{\rm H3})

Note that the above descent condition (H0) and surrogate condition (H1) can be enforced by the following stronger property: there exist constants a,c>0a,c>0 such that

f(xk+1)+aφ(ek)f(xk) and xk+1xkcek for all k,f(x_{k+1})+a\,\varphi(e_{k})\leq f(x_{k})\;\mbox{ and }\;\|x_{k+1}-x_{k}\|\leq c\,e_{k}\;\mbox{ for all }\;k\in\mathbb{N}, (H1{\rm H1^{\prime}})

where φ\varphi is an admissible function. When φ(t)=t2\varphi(t)=t^{2} and ek=xk+1xke_{k}=\|x_{k+1}-x_{k}\|, condition (H1{\rm H1^{\prime}}) is often referred to as the sufficient descent condition. Both conditions (H1{\rm H1^{\prime}}) and (H2{\rm H2}) are automatically satisfied for various commonly used algorithms. For example, for proximal-type methods [5], these conditions always hold with φ(t)=t2\varphi(t)=t^{2}, β(t)=t\beta(t)=t, c=1c=1, and ek=xk+1xke_{k}=\|x_{k+1}-x_{k}\|; for cubic regularized Newton methods with Lipschitz continuous Hessian [40], the latter conditions are fulfilled with φ(t)=t3\varphi(t)=t^{3}, β(t)=t2\beta(t)=t^{2}, c=1c=1, and ek=xk+1xke_{k}=\|x_{k+1}-x_{k}\|; for high-order proximal point methods [2], φ(t)=tp+1\varphi(t)=t^{p+1} with p1p\geq 1, β(t)=tp\beta(t)=t^{p}, c=1c=1, and ek=xk+1xke_{k}=\|x_{k+1}-x_{k}\|.

Incorporating the surrogate sequence {ek}\{e_{k}\} gives us some flexibility in handling multistep numerical methods with, e.g., momentum steps. As we see in Section 7, conditions (H1{\rm H1^{\prime}}) and (H2{\rm H2}) hold with ek=x^k+1xke_{k}=\|\widehat{x}_{k+1}-x_{k}\|, φ(t)=tq+2\varphi(t)=t^{q+2}, and β(t)=tq+1\beta(t)=t^{q+1} for some q>0q>0, where x^k\widehat{x}_{k} is an auxiliary sequence generated by the subproblem of the high-order regularization methods, where xk=argminx{x^k,wk}f(x)x_{k}={\rm argmin}_{x\in\{\widehat{x}_{k},w_{k}\}}f(x), and where wk=x^k+βk(x^kx^k1)w_{k}=\widehat{x}_{k}+\beta_{k}(\widehat{x}_{k}-\widehat{x}_{k-1}) with βk0\beta_{k}\geq 0. In the special case where φ(t)=t2\varphi(t)=t^{2} and β(t)=t\beta(t)=t, there are other general unified frameworks [33], which involve a bivariate function FF serving as an approximation for ff. While we can potentially extend further in this direction, our main aim is to provide a framework applicable for superlinear and quadratic convergence, which is not discussed in [33].

From now on, we assume that f:m¯f:\mathbb{R}^{m}\to\overline{\mathbb{R}} is a proper, l.s.c., bounded from below function. For a given sequence {xk}\{x_{k}\}, use Ω\Omega to denote the set of its accumulation points. Given xmx\in\mathbb{R}^{m}, define (f(x)):={ym|f(y)f(x)}\mathcal{L}(f(x)):=\{y\in\mathbb{R}^{m}\;|\;f(y)\leq f(x)\}. The next two lemmas present some useful facts needed in what follows.

Lemma 4.1 (accumulation points under basic assumptions).

Consider a sequence {(xk,ek)}\{(x_{k},e_{k})\} under assumption (H0{\rm H0})–(H3{\rm H3}), and let (f(xk0))\mathcal{L}(f(x_{k_{0}})) be bounded for some k0k_{0}\in\mathbb{N}. Then the following assertions hold:
(i) The set of accumulation points Ω\Omega is nonempty and compact.
(ii) The sequence {xk}\{x_{k}\} satisfies the relationships

f(x)=infkf(xk)=limkf(xk)for all xΩ.f(x)=\inf_{k\in\mathbb{N}}f(x_{k})=\lim_{k\to\infty}f(x_{k})\quad\mbox{for all }\;x\in\Omega.

(iii) ΩΓ\Omega\subset\Gamma and for any set Ξ\Xi satisfying ΩΞΓ\Omega\subset\Xi\subset\Gamma, we have

limkd(xk,Ξ)=0,\lim_{k\to\infty}d(x_{k},\Xi)=0, (4.1)

where Γ\Gamma is the collection of the first-order stationary points (2.2).

Proof.

To verify (i), observe that the descent condition (H0{\rm H0}) and the boundedness of (f(xk0))\mathcal{L}(f(x_{k_{0}})) ensure the boundedness of {xk}\{x_{k}\}, which yields (i) since the closedness of Ω\Omega is easily derived by the diagonal process.

To proceed with (ii), we deduce from (H0{\rm H0}) that

infkf(xk)=limkf(xk).\inf_{k\in\mathbb{N}}f(x_{k})=\lim_{k\to\infty}f(x_{k}). (4.2)

Taking any xΩx\in\Omega allows us to find a subsequence {xkj}\{x_{k_{j}}\} such that limjxkj=x\lim_{j\to\infty}x_{k_{j}}=x. Combining this with condition (H3{\rm H3}) and the l.s.c. of ff gives us lim supjf(xkj)f(x)lim infjf(xkj)\limsup_{j\to\infty}f(x_{k_{j}})\leq f(x)\leq\liminf_{j\to\infty}f(x_{k_{j}}), which establishes (ii) based on the relationship provided in (4.2).

To verify the final assertion (iii), combine (i), (ii), and condition (H2{\rm H2}) to get ΩΓ\Omega\subset\Gamma. For any set Ξ\Xi satisfying ΩΞΓ\Omega\subset\Xi\subset\Gamma, (4.1) is a direct consequence of the fact that limkd(xk,Ω)=0\lim_{k\to\infty}d(x_{k},\Omega)=0. ∎

For a function τ:++\tau:\mathbb{R}_{+}\rightarrow\mathbb{R}_{+}, we say it is asymptotically shrinking around zero if τ(0)=0\tau(0)=0 and

lim supt0+n=0τn(t)t<,\limsup_{t\to 0^{+}}\sum_{n=0}^{\infty}\frac{\tau^{n}(t)}{t}<\infty, (4.3)

where τ0(t):=t\tau^{0}(t):=t and τn(t):=τ(τn1(t))\tau^{n}(t):=\tau(\tau^{n-1}(t)) for all nn\in\mathbb{N}. It follows from the definition, that the function τ(t)=αt\tau(t)=\alpha t with α(0,1)\alpha\in(0,1) is asymptotically shrinking around zero, while τ(t)=t\tau(t)=t is not. The next simple lemma provides an easily verifiable sufficient condition ensuring this property.

Lemma 4.2 (asymptotically shrinking functions).

Let τ:++\tau:\mathbb{R}_{+}\rightarrow\mathbb{R}_{+} be a nondecreasing function with τ(0)=0\tau(0)=0 and such that lim supt0+τ(t)t<1\limsup_{t\to 0^{+}}\frac{\tau(t)}{t}<1. Then this function is asymptotically shrinking around zero.

Proof.

To verify (4.3), denote r:=lim supt0+τ(t)t<1r:=\limsup_{t\to 0^{+}}\frac{\tau(t)}{t}<1 and for any ε(r,1)\varepsilon\in\left(r,1\right) find δ>0\delta>0 such that τ(t)<εt\tau(t)<\varepsilon t whenever t(0,δ)t\in(0,\delta), which yields τ2(t)=τ(τ(t))<ετ(t)<ε2t\tau^{2}(t)=\tau(\tau(t))<\varepsilon\tau(t)<\varepsilon^{2}t as t(0,δ)t\in(0,\delta). By induction, we get τn(t)εnt\tau^{n}(t)\leq\varepsilon^{n}t for all nandt(0,δ)n\in\mathbb{N}\;\mbox{and}\;t\in(0,\delta). Hence

n=0τn(t)tn=0εn=11εwhenever t(0,δ),\sum_{n=0}^{\infty}\frac{\tau^{n}(t)}{t}\leq\sum_{n=0}^{\infty}\varepsilon^{n}=\frac{1}{1-\varepsilon}\quad\;\mbox{whenever }\;t\in(0,\delta),

which justifies (4.3) and thus completes the proof. ∎

Now we are ready to establish our first abstract convergence result.

Theorem 4.3 (abstract convergence framework).

For some η>0\eta>0, let ξ:[0,η)+\xi:[0,\eta)\to\mathbb{R}_{+} be a nondecreasing continuous function with ξ(0)=0\xi(0)=0, and let {sk}\{s_{k}\} be a nonnegative sequence with limksk=0\lim_{k\to\infty}s_{k}=0. Consider a sequence {(xk,ek)}m×+\{(x_{k},e_{k})\}\subset\mathbb{R}^{m}\times\mathbb{R}_{+}, which satisfies conditions (H0{\rm H0})–(H3{\rm H3}), and assume that (f(xk0))\mathcal{L}(f(x_{k_{0}})) is bounded for some k0k_{0}\in\mathbb{N}. Take further any point x¯\overline{x}, a closed set Ξ\Xi with x¯ΩΞΓ\overline{x}\in\Omega\subset\Xi\subset\Gamma, and a nondecreasing function τ:++\tau:\mathbb{R}_{+}\to\mathbb{R}_{+} that is asymptotically shrinking around zero. Suppose that there exist 1[0,1),2,3[0,)\ell_{1}\in[0,1),\ell_{2},\ell_{3}\in[0,\infty), and k1{0}k_{1}\in\mathbb{N}\setminus\{0\} such that the following hold whenever kk1k\geq k_{1} with xkBm(x¯,η)x_{k}\in B_{\mathbb{R}^{m}}(\overline{x},\eta):

  • (i)

    The sequence {sk}\{s_{k}\} shrinks with respect to the mapping τ\tau, i.e.,

    skτ(sk1).\,s_{k}\leq\,\tau(s_{k-1}). (4.4)
  • (ii)

    Surrogate of successive change grows mildly, i.e.,

    ek1ek1+2Λk,k+1+3sk,e_{k}\leq\ell_{1}e_{k-1}+\ell_{2}\Lambda_{k,k+1}+\ell_{3}s_{k}, (4.5)

    where the sequence {Λk,k+1}\{\Lambda_{k,k+1}\} is defined by

    Λk,k+1:=ξ(f(xk)f(x¯))ξ(f(xk+1)f(x¯)).\Lambda_{k,k+1}:=\xi(f(x_{k})-f(\overline{x}))-\xi(f(x_{k+1})-f(\overline{x})). (4.6)

Then there exists k2k_{2}\in\mathbb{N} with k2k1k_{2}\geq k_{1} such that the inclusion xkBm(x¯,η)x_{k}\in B_{\mathbb{R}^{m}}(\overline{x},\eta) holds for all k>k2k>k_{2}, and therefore the estimates in (4.4) and (4.5) are also fulfilled for all k>k2k>k_{2}. Moreover, it follows that

k=0ek<,k=1xk+1xk<,\sum_{k=0}^{\infty}e_{k}<\infty,\quad\sum_{k=1}^{\infty}\|x_{k+1}-x_{k}\|<\infty, (4.7)

and the sequence {xk}\{x_{k}\} converges to x¯Ξ\overline{x}\in\Xi as kk\to\infty.

Proof.

We first see that lim supkn=0τn(sk)=0\limsup_{k\to\infty}\sum_{n=0}^{\infty}\tau^{n}(s_{k})=0. To show this, we observe that n=0τn(0)=0\sum_{n=0}^{\infty}\tau^{n}(0)=0 and assume without loss of generality that sk0s_{k}\neq 0 for all kk\in\mathbb{N}. Since limksk=0\lim_{k\to\infty}s_{k}=0 and τ\tau is asymptotically shrinking around zero, we get the equalities

lim supkn=0τn(sk)=lim supkn=0τn(sk)sksk=limksklim supkn=0τn(sk)sk=0.\limsup_{k\to\infty}\sum_{n=0}^{\infty}\tau^{n}(s_{k})=\limsup_{k\to\infty}\frac{\sum_{n=0}^{\infty}\tau^{n}(s_{k})}{s_{k}}\cdot s_{k}=\lim_{k\to\infty}s_{k}\cdot\limsup_{k\to\infty}\sum_{n=0}^{\infty}\frac{\tau^{n}(s_{k})}{s_{k}}=0. (4.8)

Recalling that ξ\xi is continuous, it follows from (4.8) combined with (H0{\rm H0}), (H1{\rm H1}), and Lemma 4.1 that there exists k2max{k0,k1}k_{2}\geq\max\{k_{0},k_{1}\} such that xk2Bm(x¯,η2)x_{k_{2}}\in B_{\mathbb{R}^{m}}(\overline{x},\frac{\eta}{2}), and for all kk2k\geq k_{2} we have f(x¯)f(xk)<f(x¯)+ηf(\overline{x})\leq f(x_{k})<f(\overline{x})+\eta and

111ek1+211ξ(f(xk)f(x¯))+311n=0τn(sk)<η2c.\frac{\ell_{1}}{1-\ell_{1}}e_{k-1}+\frac{\ell_{2}}{1-\ell_{1}}\xi(f(x_{k})-f(\overline{x}))+\frac{\ell_{3}}{1-\ell_{1}}\sum_{n=0}^{\infty}\tau^{n}(s_{k})<\frac{\eta}{2c}. (4.9)

The rest of the proof is split into the following two claims.

Claim 1: For every k>k2k>k_{2}, the inclusion xkBm(x¯,η)x_{k}\in B_{\mathbb{R}^{m}}(\overline{x},\eta) holds. Indeed, suppose on the contrary that there exists k>k2k>k_{2} such that xkBm(x¯,η)x_{k}\notin B_{\mathbb{R}^{m}}(\overline{x},\eta). Letting k¯:=min{k>k2|xkBm(x¯,η)}\bar{k}:=\min\{k>k_{2}\;|\;x_{k}\notin B_{\mathbb{R}^{m}}(\overline{x},\eta)\}, observe that for any k{k2,,k¯1}k\in\{k_{2},\dots,\bar{k}-1\}, we have xkBm(x¯,η)x_{k}\in B_{\mathbb{R}^{m}}(\overline{x},\eta), and thus the inequalities in (4.4) and (4.5) are satisfied. Using them together with (4.6) implies that

k=k2k¯1ek\displaystyle\sum_{k=k_{2}}^{\bar{k}-1}e_{k}\leq 1k=k2k¯1ek1+2k=k2k¯1Λk,k+1+3k=k2k¯1sk\displaystyle\ \ell_{1}\sum_{k=k_{2}}^{\bar{k}-1}e_{k-1}+\ell_{2}\sum_{k=k_{2}}^{\bar{k}-1}\Lambda_{k,k+1}+\ell_{3}\sum_{k=k_{2}}^{\bar{k}-1}s_{k}
\displaystyle\leq 1k=k2k¯1ek1+2ξ(f(xk2)f(x¯))+3(sk2+τ(sk2)++τk¯k21(sk2))\displaystyle\ \ell_{1}\sum_{k=k_{2}}^{\bar{k}-1}e_{k-1}+\ell_{2}\,\xi\big{(}f(x_{k_{2}})-f(\overline{x})\big{)}+\ell_{3}\big{(}s_{k_{2}}+\tau(s_{k_{2}})+\cdots+\tau^{\bar{k}-k_{2}-1}(s_{k_{2}})\big{)}
\displaystyle\leq 1k=k2k¯1ek1+2ξ(f(xk2)f(x¯))+3n=0τn(sk2).\displaystyle\ \ell_{1}\sum_{k=k_{2}}^{\bar{k}-1}e_{k-1}+\ell_{2}\,\xi\big{(}f(x_{k_{2}})-f(\overline{x})\big{)}+\ell_{3}\sum_{n=0}^{\infty}\tau^{n}(s_{k_{2}}).

Rearranging the terms above brings us to the estimate

k=k2k¯1ek\displaystyle\sum_{k=k_{2}}^{\bar{k}-1}e_{k}\leq 111ek21+211ξ(f(xk2)f(x¯))+311n=0τn(sk2).\displaystyle\frac{\ell_{1}}{1-\ell_{1}}e_{k_{2}-1}+\frac{\ell_{2}}{1-\ell_{1}}\xi\big{(}f(x_{k_{2}})-f(\overline{x})\big{)}+\frac{\ell_{3}}{1-\ell_{1}}\sum_{n=0}^{\infty}\tau^{n}(s_{k_{2}}). (4.10)

Combining this with (4.9) and condition (H1{\rm H1}), we arrive at

xk¯x¯\displaystyle\|x_{\bar{k}}-\overline{x}\|\leq xk2x¯+k=k2k¯1xk+1xkxk2x¯+ck=k2k¯1ek<η,\displaystyle\|x_{k_{2}}-\overline{x}\|+\sum_{k=k_{2}}^{\bar{k}-1}\|x_{k+1}-x_{k}\|\leq\|x_{k_{2}}-\overline{x}\|+c\sum_{k=k_{2}}^{\bar{k}-1}e_{k}<\eta,

which contradicts the assumption that xk¯Bm(x¯,η)x_{\bar{k}}\notin B_{\mathbb{R}^{m}}(\overline{x},\eta) and thus justifies the claim.

Claim 2: It holds that k=1xkxk+1<\sum_{k=1}^{\infty}\|x_{k}-x_{k+1}\|<\infty. Indeed, Claim 1 tells us that the inequalities in (4.4) and (4.5) are satisfied for all k>k2k>k_{2}. Consequently, for any kk2k\geq k_{2} and p>kp>k, by similar arguments as those for deriving (4.10) we get the estimate

i=kpei\displaystyle\sum_{i=k}^{p}e_{i}\leq 111ek1+211ξ(f(xk)f(x¯))+311n=0τn(sk).\displaystyle\frac{\ell_{1}}{1-\ell_{1}}e_{k-1}+\frac{\ell_{2}}{1-\ell_{1}}\xi\big{(}f(x_{k})-f(\overline{x})\big{)}+\frac{\ell_{3}}{1-\ell_{1}}\sum_{n=0}^{\infty}\tau^{n}(s_{k}). (4.11)

Letting pp\rightarrow\infty and employing (4.9) yields

k=0ek\displaystyle\sum_{k=0}^{\infty}e_{k}\leq k=0k21ek+i=k2ei\displaystyle\sum_{k=0}^{k_{2}-1}e_{k}+\sum_{i=k_{2}}^{\infty}e_{i}
\displaystyle\leq k=0k21ek+111ek21+211ξ(f(xk2)f(x¯))+311n=0τn(sk2)<,\displaystyle\sum_{k=0}^{k_{2}-1}e_{k}+\frac{\ell_{1}}{1-\ell_{1}}e_{k_{2}-1}+\frac{\ell_{2}}{1-\ell_{1}}\xi\big{(}f(x_{k_{2}})-f(\overline{x})\big{)}+\frac{\ell_{3}}{1-\ell_{1}}\sum_{n=0}^{\infty}\tau^{n}(s_{k_{2}})<\infty,

which gives us k=1xk+1xk<\sum_{k=1}^{\infty}\|x_{k+1}-x_{k}\|<\infty due to (H1{\rm H1}). Therefore, {xk}\{x_{k}\} is a Cauchy sequence. Since x¯\overline{x} is an accumulation point of {xk}\{x_{k}\}, it follows that xkx¯Ξx_{k}\rightarrow\overline{x}\in\Xi as kk\to\infty. ∎

In what follows, we study the convergence rate of {xk}\{x_{k}\}. For each kk, denote Δk:=i=kei\Delta_{k}:=\sum_{i=k}^{\infty}e_{i} and deduce from Theorem 4.3 that Δk<\Delta_{k}<\infty. For any kk sufficiently large, it follows from (4.11) that

Δk=i=kei\displaystyle\Delta_{k}=\sum_{i=k}^{\infty}e_{i}\leq 111ek1+211ξ(f(xk)f(x¯))+311n=0τn(sk),\displaystyle\frac{\ell_{1}}{1-\ell_{1}}e_{k-1}+\frac{\ell_{2}}{1-\ell_{1}}\xi\big{(}f(x_{k})-f(\overline{x})\big{)}+\frac{\ell_{3}}{1-\ell_{1}}\sum_{n=0}^{\infty}\tau^{n}(s_{k}),

which ensures from (H1{\rm H1}) that

x¯xk=\displaystyle\|\overline{x}-x_{k}\|= limjxk+jxki=kxi+1xici=kei=cΔk\displaystyle\lim_{j\to\infty}\|x_{k+j}-x_{k}\|\leq\sum_{i=k}^{\infty}\|x_{i+1}-x_{i}\|\leq\ c\sum_{i=k}^{\infty}e_{i}=c\Delta_{k} (4.12)
\displaystyle\leq c111ek1+c211ξ(f(xk)f(x¯))+c311n=0τn(sk).\displaystyle\ \frac{c\,\ell_{1}}{1-\ell_{1}}e_{k-1}+\frac{c\,\ell_{2}}{1-\ell_{1}}\xi\big{(}f(x_{k})-f(\overline{x})\big{)}+\frac{c\,\ell_{3}}{1-\ell_{1}}\sum_{n=0}^{\infty}\tau^{n}(s_{k}).

The next theorem establishes a fast convergence rate of the iterative sequence generated by the abstract algorithm satisfying the basic assumptions (BA) under generalized metric subregularity.

Theorem 4.4 (convergence rate under generalized metric subregularity).

Let ψ,β:++\psi,\beta:\mathbb{R}_{+}\to\mathbb{R}_{+} be increasing admissible functions, let {(xk,ek)}\{(x_{k},e_{k})\} be a sequence satisfying the basic assumptions (H0{\rm H0})–(H3{\rm H3}), and let x¯Ω\overline{x}\in\Omega be an accumulation point of {xk}\{x_{k}\}. Fixing any closed set Ξ\Xi with ΩΞΓ\Omega\subset\Xi\subset\Gamma, assume that (f(xk0))\mathcal{L}(f(x_{k_{0}})) is bounded for some k0k_{0}\in\mathbb{N} and the following properties are fulfilled:

  • (i)

    Generalized metric subregularity holds at x¯\overline{x} with respect to (ψ,Ξ)(\psi,\Xi), meaning that there exist numbers γ,η(0,)\gamma,\eta\in(0,\infty) such that

    ψ(d(x,Ξ))γd(0,f(x)) for all xBm(x¯,η).\psi\big{(}d(x,\Xi)\big{)}\leq\gamma\,d\big{(}0,\partial f(x)\big{)}\quad\mbox{ for all }\;x\in B_{\mathbb{R}^{m}}(\overline{x},\eta). (4.13)
  • (ii)

    Surrogate of successive change bounded in terms of the target set, meaning that there exist >0\ell>0 and k1k_{1}\in\mathbb{N} such that

    ekd(xk,Ξ) for all kk1.e_{k}\leq\ell\,d(x_{k},\Xi)\quad\mbox{ for all }\;k\geq k_{1}. (4.14)
  • (iii)

    Growth rate control is satisfied, meaning that lim supt0+τ(t)t<1\limsup_{t\to 0^{+}}\frac{\tau(t)}{t}<1, where τ:++\tau:\mathbb{R}_{+}\to\mathbb{R}_{+} is given by τ(t):=ψ1(γbβ(t))\tau(t):=\psi^{-1}\left(\gamma b\beta(\ell t)\right) for all t+t\in\mathbb{R}_{+}.

Then the sequence {xk}\{x_{k}\} converges to x¯Ξ\overline{x}\in\Xi as kk\to\infty with the rate xkx¯=O(τ(xk1x¯))\|x_{k}-\overline{x}\|=O(\tau(\|x_{k-1}-\overline{x}\|)), i.e.,

lim supkxkx¯τ(xk1x¯)<.\limsup_{k\to\infty}\frac{\|x_{k}-\overline{x}\|}{{\tau}(\|x_{k-1}-\overline{x}\|)}<\infty. (4.15)
Proof.

Let us first check that, for any kk1+1k\geq k_{1}+1 with xkBm(x¯,η)x_{k}\in B_{\mathbb{R}^{m}}(\overline{x},\eta), we have the estimate

d(xk,Ξ)τ(d(xk1,Ξ)).d(x_{k},\Xi)\leq\tau\big{(}d(x_{k-1},\Xi)\big{)}. (4.16)

To this end, fix kk1+1k\geq k_{1}+1 such that xkBm(x¯,η)x_{k}\in B_{\mathbb{R}^{m}}(\overline{x},\eta) and deduce from (H2{\rm H2}), (4.13), and (4.14) that

ψ(d(xk,Ξ))γd(0,f(xk))γbβ(ek1)γbβ(d(xk1,Ξ)),\psi\big{(}d(x_{k},\Xi)\big{)}\leq\gamma\,d\big{(}0,\partial f(x_{k})\big{)}\leq\gamma b\,\beta(e_{k-1})\leq\gamma b\,\beta\big{(}\ell\,d(x_{k-1},\Xi)\big{)},

where the last inequality is due to the fact that β\beta is an increasing admissible function. Since ψ\psi is also an increasing admissible function, we get

d(xk,Ξ)ψ1(γbβ(d(xk1Ξ))=τ(d(xk1,Ξ))d(x_{k},\Xi)\leq\psi^{-1}\big{(}\gamma b\beta(\ell d(x_{k-1}\Xi)\big{)}=\tau(d(x_{k-1},\Xi))

and thus verifies (4.16). It follows from Lemma 4.1(iii), Lemma 4.2, and Theorem 4.3 (applied to 1=2=0\ell_{1}=\ell_{2}=0, 3=\ell_{3}=\ell, and sk=d(xk,Ξ)s_{k}=d(x_{k},\Xi)) that (4.16) holds for all large kk, and hence {xk}\{x_{k}\} converges to x¯Ξ\overline{x}\in\Xi as kk\to\infty.

Using now (4.16), the increasing property of τ\tau, and the fact that x¯Ξ\overline{x}\in\Xi, we arrive at

d(xk,Ξ)τ(d(xk1,Ξ))τ(xk1x¯)d(x_{k},\Xi)\leq\tau(d(x_{k-1},\Xi))\leq\tau(\|x_{k-1}-\overline{x}\|)

for all kk sufficiently large. Combining this with (4.12) and recalling that 1=2=0\ell_{1}=\ell_{2}=0, 3=\ell_{3}=\ell, and sk=d(xk,Ξ)s_{k}=d(x_{k},\Xi) in this case, we obtain the estimates

xkx¯cΔk\displaystyle\|x_{k}-\overline{x}\|\leq c\,\Delta_{k}\leq cd(xk,Ξ)n=0τn(d(xk,Ξ))d(xk,Ξ)cτ(xk1x¯)n=0τn(d(xk,Ξ))d(xk,Ξ),\displaystyle\,c\,\ell\,d(x_{k},\Xi)\sum_{n=0}^{\infty}\frac{\tau^{n}(d(x_{k},\Xi))}{d(x_{k},\Xi)}\leq c\,\ell\,\tau(\|x_{k-1}-\overline{x}\|)\sum_{n=0}^{\infty}\frac{\tau^{n}(d(x_{k},\Xi))}{d(x_{k},\Xi)},

which being combined with Lemma 4.1(iii) and Lemma 4.2 justifies (4.15) and complete the proof. ∎

In the rest of this section, we present discussions on some remarkable features of our abstract convergence analysis and its comparison with known results in the literature.

Remark 4.5 (discussions on abstract convergence analysis and its specifications).

\bullet Note first that, for the three assumptions imposed in Theorem 4.4, the second one (surrogate sequence bounded in terms of the target set) is often tied up with and indeed guaranteed by the construction of many common algorithms, while the other two assumptions are related to some regularity of the function ff with respect to the target set. In particular, in the special case where ek=xk+1xke_{k}=\|x_{k+1}-x_{k}\|, Assumption (ii) in Theorem 4.4 is automatically satisfied for the proximal methods [5] with Ξ=Γ\Xi=\Gamma, and for the cubic regularized Newton methods [32] with Ξ=Θ\Xi=\Theta, p(0,q+1)p\in(0,q+1).

\bullet For the aforementioned proximal method from [5] to solve minxmf(x)\min_{x\in\mathbb{R}^{m}}f(x) with a fixed proximal parameter λ>0\lambda>0, where ff is a proper l.s.c. function, the automatic fulfillment of Assumption (ii) allows us to deduce from Theorem 4.4 that if f\partial f is ppth-order exponent metrically subregular at x¯\overline{x} with respect to the first-order stationary set Γ\Gamma for p(0,1)p\in(0,1), then the proximal method converges at least superlinearly with the convergence rate 1p\frac{1}{p}. To the best of our knowledge, this appears to be new to the literature. Moreover, we see in Example 4.6 below that the derived convergence rate can indeed be attained.

\bullet In Section 7, we apply the conducted convergence analysis to the higher-order regularized Newton method with momentum steps to solve minxmf(x)\min_{x\in\mathbb{R}^{m}}f(x), where ff is a C2C^{2}-smooth function whose Hessian is Hölder continuous with exponent q(0,1]q\in(0,1]. As mentioned, Assumption (ii) holds automatically with respect to the second-order stationary points Θ\Theta from (2.3). Therefore, if f\partial f is ppth-order exponent metrically subregular at x¯\overline{x} with respect to Θ\Theta for p(0,q+1)p\in(0,q+1), then the sequence generated by the algorithm converges to a second-order stationary point at least superlinearly with order q+1p\frac{q+1}{p}. Even in the special case where p=1p=1, this improves the current literature on high-order regularized Newton methods by obtaining fast convergence rates under the pointwise metric subregularity (instead of the uniform metric subregularity/error bound condition as in (3.3)), and in the more general context with allowing momentum steps; see Remark 7.10 for more details.

Now we provide a simple example showing that the established convergence rate can be attained for the proximal method of minimizing a smooth univariate function.

Example 4.6 (tightness of the convergence rate).

Consider applying the proximal point method for the univariate function f(t)=|t|32f(t)=|t|^{\frac{3}{2}}, tt\in\mathbb{R}, with the iterations

tk+1=argmint{f(t)+λ2(ttk)2},t0=1,t_{k+1}={\rm argmin}_{t\in\mathbb{R}}\big{\{}f(t)+\frac{\lambda}{2}(t-t_{k})^{2}\big{\}},\quad t_{0}=1,

where λ\lambda is a fixed positive parameter. Then we have

f(tk+1)+λ(tk+1tk)=32sign(tk+1)|tk+1|12+λ(tk+1tk)=0.\nabla f(t_{k+1})+\lambda(t_{k+1}-t_{k})=\frac{3}{2}{\rm sign}(t_{k+1})|t_{k+1}|^{\frac{1}{2}}+\lambda(t_{k+1}-t_{k})=0.

This shows that tk>0tk+1>0t_{k}>0\Rightarrow t_{k+1}>0, and so tk>0t_{k}>0 for all kk, which further implies that tk=32λ(tk+1)12+tk+1t_{k}=\frac{3}{2\lambda}(t_{k+1})^{\frac{1}{2}}+t_{k+1} yielding tk32λ(tk+1)12t_{k}\geq\frac{3}{2\lambda}(t_{k+1})^{\frac{1}{2}}. Take Ξ=Γ\Xi=\Gamma, the first-order stationary set (2.2), and get

|tk+1tk|=32λ|tk+1|12|tk|=d(tk,Ξ).|t_{k+1}-t_{k}|=\frac{3}{2\lambda}|t_{k+1}|^{\frac{1}{2}}\leq|t_{k}|=d(t_{k},\Xi).

For wk+1=f(tk+1)=32tk+112w_{k+1}=\nabla f(t_{k+1})=\frac{3}{2}t_{k+1}^{\frac{1}{2}}, we have |wk+1|=32tk+112=λ|tk+1tk||w_{k+1}|=\frac{3}{2}t_{k+1}^{\frac{1}{2}}=\lambda|t_{k+1}-t_{k}|. Choosing now ψ(t)=32t12\psi(t)=\frac{3}{2}t^{\frac{1}{2}} and β(t)=t\beta(t)=t gives us τ(t)=ψ1(λβ(t))=O(t2)\tau(t)=\psi^{-1}(\lambda\beta(t))=O(t^{2}) and the equalities ψ(d(tk,Ξ))=ψ(|tk|)=32|tk|12=d(0,f(tk))\psi(d(t_{k},\Xi))=\psi(|t_{k}|)=\frac{3}{2}|t_{k}|^{\frac{1}{2}}=d(0,\nabla f(t_{k})). Direct verifications show that all the conditions in Theorem 4.4 are satisfied, and thus it follows from the theorem that tkt¯=0t_{k}\rightarrow\overline{t}=0 in a quadratic rate, e.g., |tk+1t¯|=tk+1=O(tk2)=O(|tkt¯|2)|t_{k+1}-\overline{t}|=t_{k+1}=O(t_{k}^{2})=O(|{t_{k}-}\overline{t}|^{2}).

To determine the exact convergence rate, we solve tk=32λt12+tt_{k}=\frac{3}{2\lambda}\,t^{\frac{1}{2}}+t in terms of the variable t+t\in\mathbb{R}_{+}. Letting u=tu=\sqrt{t} provides tk=32λu+u2t_{k}=\frac{3}{2\lambda}u+u^{2}, which implies that u=34λ+tk+916λ2u=-\frac{3}{4\lambda}+\sqrt{t_{k}+\frac{9}{16\lambda^{2}}}. Passing finally to the limit as kk\to\infty and remembering that tk0t_{k}\rightarrow 0 tell us that

tk+1=[34λ+tk+916λ2]2=[tk34λ+tk+916λ2]2=O(tk2),t_{k+1}=\left[-\frac{3}{4\lambda}+\sqrt{t_{k}+\frac{9}{16\lambda^{2}}}\right]^{2}=\left[\frac{t_{k}}{\frac{3}{4\lambda}+\sqrt{t_{k}+\frac{9}{16\lambda^{2}}}}\right]^{2}=O(t_{k}^{2}),

which agrees with the exact quadratic convergence rate derived in Theorem 4.4.

5 Abstract Convergence Analysis under the KL Property

In this section, we continue our convergence analysis of abstract algorithms involving now the Kurdyka-Łojasiewicz (KL) property from Definition 2.2. Here is the main result of this section.

Theorem 5.1 (convergence under the KL properties).

Consider a sequence {(xk,ek)}\{(x_{k},e_{k})\}, which satisfies assumptions (H1{\rm H1^{\prime}}), (H2{\rm H2}), (H3{\rm H3}) being such that the set (f(xk0))\mathcal{L}(f(x_{k_{0}})) is bounded for some k0k_{0}\in\mathbb{N}. Let x¯\overline{x} be an accumulation point of the iterated sequence, and let f:m¯f:\mathbb{R}^{m}\to\overline{\mathbb{R}} be an l.s.c. function satisfying the KL property at x¯\overline{x}, where η>0\eta>0 and ϑ:[0,η)+\vartheta:[0,\eta)\to\mathbb{R}_{+} are taken from Definition 2.2. Suppose also that there exists a pair (μ1,μ2)(0,1)×+(\mu_{1},\mu_{2})\in(0,1)\times\mathbb{R}_{+} for which

lim sup(s,t)(0,0)φ1(β(s)t)μ1s+μ2t1,\limsup_{(s,t)\to(0,0)}\frac{\varphi^{-1}(\beta(s)t)}{\mu_{1}s+\mu_{2}t}\leq 1, (5.1)

where φ()\varphi(\cdot) and β()\beta(\cdot) are increasing functions taken from conditions (H1{\rm H1^{\prime}}) and (H2{\rm H2}). Then we have k=1xk+1xk<\sum_{k=1}^{\infty}\|x_{k+1}-x_{k}\|<\infty, and the sequence {xk}\{x_{k}\} converges to x¯\overline{x} as kk\to\infty.

Proof.

It follows from (5.1) that for any α(1,1μ1)\alpha\in\left(1,\frac{1}{\mu_{1}}\right) there exists δ>0\delta>0 such that

φ1(β(s)t)αμ1s+αμ2t whenever s,t(0,δ).\varphi^{-1}\big{(}\beta(s)t\big{)}\leq\alpha\mu_{1}s+\alpha\mu_{2}t\quad\mbox{ whenever }\;s,t\in(0,\delta). (5.2)

Let η¯:=min{η,ε}\overline{\eta}:=\min\{\eta,\varepsilon\} with ε,η\varepsilon,\eta given in Definition 2.2. It follows from the abstract convergence framework of Theorem 4.3 with sk=0s_{k}=0 that the claimed assertions are verified if showing that there exist constants 1(0,1),2[0,)\ell_{1}\in(0,1),\ell_{2}\in[0,\infty), and k1{0}k_{1}\in\mathbb{N}\setminus\{0\} such that for any kk1k\geq k_{1} with xkBm(x¯,η¯)x_{k}\in B_{\mathbb{R}^{m}}(\overline{x},\overline{\eta}) we have

ek1ek1+2Λk,k+1e_{k}\leq\ell_{1}e_{k-1}+\ell_{2}\Lambda_{k,k+1} (5.3)

with the quantities Λk,k+1\Lambda_{k,k+1} defined by

Λk,k+1:=ϑ(f(xk)f(x¯))ϑ(f(xk+1)f(x¯)).\Lambda_{k,k+1}:=\vartheta\big{(}f(x_{k})-f(\overline{x})\big{)}-\vartheta\big{(}f(x_{k+1})-f(\overline{x})\big{)}.

To see this, recall that φ\varphi is an admissible function and then deduce from (H1{\rm H1^{\prime}}) and Lemma 4.1 that

f(xk+1)f(xk) for all k,f(x¯)=infkf(xk)=limkf(xk),f(x_{k+1})\leq f(x_{k})\;\mbox{ for all }\;k\in\mathbb{N},\quad f(\overline{x})=\inf_{k\in\mathbb{N}}f(x_{k})=\lim_{k\to\infty}f(x_{k}),

and limkek=0\lim_{k\to\infty}e_{k}=0. It follows from the continuity of ϑ\vartheta that there exists k11k_{1}\geq 1 for which

f(x¯)f(xk)<f(x¯)+η¯,ek1<δ, and f(\overline{x})\leq f(x_{k})<f(\overline{x})+\overline{\eta},\quad e_{k-1}<\delta,\;\mbox{ and } (5.4)
ba[ϑ(f(xk)f(x¯))ϑ(f(xk+1)f(x¯))]<δ whenever kk1.\frac{b}{a}[\vartheta(f(x_{k})-f(\overline{x}))-\vartheta(f(x_{k+1})-f(\overline{x}))]<\delta\;\mbox{ whenever }\;k\geq k_{1}. (5.5)

Take any kk1k\geq k_{1} with xkBm(x¯,η¯)x_{k}\in B_{\mathbb{R}^{m}}(\overline{x},\overline{\eta}) and observe that if ek=0e_{k}=0, then (5.3) holds trivially. Assuming that ek0e_{k}\neq 0 and then using (H1{\rm H1^{\prime}}) and (5.4), we get f(x¯)f(xk+1)<f(xk)f(\overline{x})\leq f(x_{k+1})<f(x_{k}), which being combined with the KL property and condition (H2{\rm H2}) shows that wk0w_{k}\neq 0 and ek10e_{k-1}\neq 0. Since wkf(xk)w_{k}\in\partial f(x_{k}), the aforementioned conditions readily yield the estimates

ϑ(f(xk)f(x¯))1wk1bβ(ek1).\vartheta^{\prime}(f(x_{k})-f(\overline{x}))\geq\frac{1}{\|w_{k}\|}\geq\frac{1}{b\beta(e_{k-1})}. (5.6)

The concavity of ϑ\vartheta in the KL property and assumption (H1{\rm H1^{\prime}}) imply that

ϑ(f(xk)f(x¯))ϑ(f(xk+1)f(x¯))\displaystyle\vartheta\big{(}f(x_{k})-f(\overline{x})\big{)}-\vartheta(f(x_{k+1})-f(\overline{x})) ϑ(f(xk)f(x¯))(f(xk)f(xk+1))\displaystyle\geq\vartheta^{\prime}\big{(}f(x_{k})-f(\overline{x})\big{)}\big{(}f(x_{k})-f(x_{k+1})\big{)} (5.7)
aϑ(f(xk)f(x¯))φ(ek).\displaystyle\geq a\vartheta^{\prime}\big{(}f(x_{k})-f(\overline{x})\big{)}\varphi(e_{k}).

Substituting (5.6) into (5.7) guarantees that

ba[ϑ(f(xk)f(x¯))ϑ(f(xk+1)f(x¯)]φ(ek)β(ek1),\frac{b}{a}\big{[}\vartheta\big{(}f(x_{k})-f(\overline{x})\big{)}-\vartheta\big{(}f(x_{k+1})-f(\overline{x}\big{)}\big{]}\geq\frac{\varphi(e_{k})}{\beta(e_{k-1})},

which establishes in turn that

ekφ1(baβ(ek1)[ϑ(f(xk)f(x¯))ϑ(f(xk+1)f(x¯))]).\displaystyle e_{k}\leq\varphi^{-1}\Big{(}\frac{b}{a}\beta(e_{k-1})\big{[}\vartheta\big{(}f(x_{k})-f(\overline{x})\big{)}-\vartheta\big{(}f(x_{k+1})-f(\overline{x})\big{)}\big{]}\Big{)}.

Combining the latter with (5.2) and (5.5), we get

ek\displaystyle e_{k} αμ1ek1+αμ2ba(ϑ(f(xk)f(x¯))ϑ(f(xk+1)f(x¯))),\displaystyle\leq\alpha\mu_{1}e_{k-1}+\frac{\alpha\mu_{2}b}{a}\Big{(}\vartheta\big{(}f(x_{k})-f(\overline{x})\big{)}-\vartheta\big{(}f(x_{k+1})-f(\overline{x})\big{)}\Big{)},

which shows that (5.3) holds with 1:=αμ1<1\ell_{1}:=\alpha\mu_{1}<1 and 2:=αμ2ba\ell_{2}:=\frac{\alpha\mu_{2}b}{a} and thus completes the proof. ∎

Now we employ Theorem 5.1 and the results of the previous section to find explicit conditions ensuring superlinear convergence of an abstract algorithm under the KL property. Talking ϑ:[0,η]+\vartheta\colon[0,\eta]\to\mathbb{R}_{+} from Definition 2.2 and the admissible functions φ()\varphi(\cdot) and β()\beta(\cdot) from Theorem 5.1, define ρ:++\rho:\mathbb{R}_{+}\to\mathbb{R}_{+} by

ρ(t):={((ϑ1))1(bβ(t)) if t>0,0 if t=0\rho(t):=\left\{\begin{array}[]{cc}\big{(}(\vartheta^{-1})^{\prime}\big{)}^{-1}(b\beta(t))&\mbox{ if }\;t>0,\\ 0&\mbox{ if }\;t=0\end{array}\right.

and consider the function ξ:++\xi:\mathbb{R}_{+}\to\mathbb{R}_{+} given by ξ(t):=φ1(ϑ1(ρ(t))a)\xi(t):=\varphi^{-1}\left(\frac{\vartheta^{-1}(\rho(t))}{a}\right) for all t+t\in\mathbb{R}_{+}.

Theorem 5.2 (superlinear convergence rate under the KL property).

Let {(xk,ek)}\{(x_{k},e_{k})\}, (f(xk0))\mathcal{L}(f(x_{k_{0}})), x¯\overline{x}, ff, ϑ\vartheta, φ\varphi, and β\beta be as in the general setting of Theorem 5.1 under the assumptions (H1{\rm H1^{\prime}}), (H2{\rm H2}), and (H3{\rm H3}). Take the function ξ:++\xi:\mathbb{R}_{+}\to\mathbb{R}_{+} as defined above and assume in addition that it is increasing on +\mathbb{R}_{+} with limt0+ξ(t)t=0\lim_{t\to 0^{+}}\frac{\xi(t)}{t}=0. The following assertions hold:

  • (i)

    If there exists r1r\geq 1 such that ekrxk+1xke_{k}\leq r\|x_{k+1}-x_{k}\| for all large kk\in\mathbb{N}, then the sequence {xk}\{x_{k}\} converges to x¯\overline{x} as kk\to\infty with the convergence rate

    lim supkxkx¯ξ~(xk1x¯)<, where ξ~(t):=ξ(2rt).\limsup_{k\to\infty}\frac{\|x_{k}-\overline{x}\|}{\widetilde{\xi}(\|x_{k-1}-\overline{x}\|)}<\infty,\;\mbox{ where }\;\widetilde{\xi}(t):=\xi(2rt). (5.8)
  • (ii)

    If there exist >0\ell>0 and a closed set Ξ\Xi satisfying ΩΞΓ\Omega\subset\Xi\subset\Gamma and such that

    ekd(xk,Ξ) for all large k,e_{k}\leq\ell\,d(x_{k},\Xi)\;\mbox{ for all large }\;k\in\mathbb{N}, (5.9)

    then the sequence {xk}\{x_{k}\} converges to x¯Ξ\overline{x}\in\Xi as kk\to\infty with the convergence rate

    lim supkxkx¯ξ^(xk1x¯)<, where ξ^(t):=ξ(t).\limsup_{k\to\infty}\frac{\|x_{k}-\overline{x}\|}{\widehat{\xi}(\|x_{k-1}-\overline{x}\|)}<\infty,\;\mbox{ where }\;\widehat{\xi}(t):=\xi(\ell\,t). (5.10)
Proof.

Suppose without loss of generality that ek0e_{k}\neq 0 for all kk\in\mathbb{N}. It follows from Theorem 4.3 that (5.6) holds when kk is sufficiently large and that there exists k1k_{1} such that

(ϑ1)(ϑ((f(xk)f(x¯))))=1ϑ(f(xk)f(x¯))bβ(ek1)as kk1.(\vartheta^{-1})^{\prime}\big{(}\vartheta((f(x_{k})-f(\overline{x})))\big{)}=\frac{1}{\vartheta^{\prime}\big{(}f(x_{k})-f(\overline{x})\big{)}}\leq b\beta(e_{k-1})\quad\mbox{as }\;k\geq k_{1}.

The latter readily implies that

ϑ((f(xk)f(x¯)))((ϑ1))1(bβ(ek1))=ρ(ek1)as kk1.\vartheta\big{(}(f(x_{k})-f(\overline{x}))\big{)}\leq\big{(}(\vartheta^{-1})^{\prime}\big{)}^{-1}\big{(}b\beta(e_{k-1})\big{)}=\rho(e_{k-1})\quad\mbox{as }\;k\geq k_{1}. (5.11)

Furthermore, condition (H1{\rm H1^{\prime}}) and the first part of (5.4) tell us that

aφ(ek)f(xk)f(xk+1)f(xk)f(x¯) for all k,a\varphi(e_{k})\leq f(x_{k})-f(x_{k+1})\leq f(x_{k})-f(\overline{x})\;\mbox{ for all }\;k\in\mathbb{N},

which together with (5.11) yields ekξ(ek1)e_{k}\leq\xi(e_{k-1}). Combining this with the limiting conditions limkek=0\lim_{k\to\infty}e_{k}=0 and limt0+ξ(t)t=0\lim_{t\to 0^{+}}\frac{\xi(t)}{t}=0 allows us to find a sufficiently large number k2k_{2}\in\mathbb{N} such that

ekξ(ek1)<12ek1for all kk2.e_{k}\leq\xi(e_{k-1})<\frac{1}{2}e_{k-1}\quad\mbox{for all }\;k\geq k_{2}.

Invoking again condition (H1{\rm H1^{\prime}}) ensures that for any kk2k\geq k_{2} we have the estimates

x¯xk=limjxk+jxk\displaystyle\|\overline{x}-x_{k}\|=\lim_{j\to\infty}\|x_{k+j}-x_{k}\|\leq i=kxi+1xici=kei2cek2cξ(ek1).\displaystyle\sum_{i=k}^{\infty}\|x_{i+1}-x_{i}\|\leq c\sum_{i=k}^{\infty}e_{i}\leq 2c\,e_{k}\leq 2c\,\xi(e_{k-1}). (5.12)

Now we are in a position to justify both assertions of the theorem. Starting with (i), note that

ξ(ek1)ξ(rxkxk1)ξ(rxkx¯+rxk1x¯)ξ(2rxkx¯)+ξ(2rxk1x¯)\xi(e_{k-1})\leq\xi(r\|x_{k}-x_{k-1}\|)\leq\xi(r\|x_{k}-\overline{x}\|+r\|x_{k-1}-\overline{x}\|)\leq\xi(2r\|x_{k}-\overline{x}\|)+\xi(2r\|x_{k-1}-\overline{x}\|)

for large kk, where the last inequality follows from the increasing property of ξ\xi ensuring that

ξ(α+γ)max{ξ(2α),ξ(2γ)}ξ(2α)+ξ(2γ)\xi(\alpha+\gamma)\leq\max\big{\{}\xi(2\alpha),\xi(2\gamma)\big{\}}\leq\xi(2\alpha)+\xi(2\gamma)

for any α,γ0\alpha,\gamma\geq 0. Observing that limt0+ξ(2rt)t=0\lim_{t\to 0^{+}}\frac{\xi(2rt)}{t}=0, it readily follows from xkx¯x_{k}\to\overline{x} and (5.12) that

lim supkxkx¯ξ(2rxk1x¯)=\displaystyle\limsup_{k\to\infty}\frac{\|x_{k}-\overline{x}\|}{\xi(2r\|x_{k-1}-\overline{x}\|)}= lim supkxkx¯ξ(2rxk1x¯)+ξ(2rxkx¯)\displaystyle\limsup_{k\to\infty}\frac{\|x_{k}-\overline{x}\|}{\xi(2r\|x_{k-1}-\overline{x}\|)+\xi(2r\|x_{k}-\overline{x}\|)}
\displaystyle\leq lim supkxkx¯ξ(ek1)<,\displaystyle\limsup_{k\to\infty}\frac{\|x_{k}-\overline{x}\|}{\xi(e_{k-1})}<\infty,

which brings us to (5.8) and thus justifies assertion (i).

To verify finally assertion (ii), observe that the increasing property of ξ\xi and the inclusion x¯Ξ\overline{x}\in\Xi allow us to deduce from (5.9) the estimates

ξ(ek1)ξ(d(xk1,Ξ))ξ(xk1x¯) for all kk2,\xi(e_{k-1})\leq\xi\big{(}\ell d(x_{k-1},\Xi)\big{)}\leq\xi\big{(}\ell\|x_{k-1}-\overline{x}\|\big{)}\quad\mbox{ for all }\;k\geq k_{2},

which together with (5.12) give us (5.10) and thus completes the proof of the theorem. ∎

We conclude this section with some specifications of the convergence results obtained under the KL property and comparisons with known results in the literature.

Remark 5.3 (further discussions and comparisons for KL convergence analysis).

\bullet Observe first that assumption (5.1) in Theorem 5.1 holds automatically for many common algorithms. For example, in the standard proximal methods [5] we have φ(t)=t2\varphi(t)=t^{2} and β(t)=t\beta(t)=t, and thus by setting (μ1,μ2)=(1/2,1/2)(\mu_{1},\mu_{2})=(1/2,1/2) in Theorem 5.1, condition (5.1) holds by the classical AM–GM inequality.

\bullet In the case where ϑ(t)=ct1θ\vartheta(t)=ct^{1-\theta} with the KL exponent θ[0,1)\theta\in[0,1), we can obtain the explicit sublinear, linear, and superlinear convergence rates. In fact, let β,φ,ϑ:++\beta,\varphi,\vartheta:\mathbb{R}_{+}\to\mathbb{R}_{+} be such that β(t)=tα,φ(t)=t1+α, and ϑ(t)=ct1θ\beta(t)=t^{\alpha},\;\varphi(t)=t^{1+\alpha},\;\mbox{ and }\vartheta(t)=ct^{1-\theta}, where c,α>0c,\alpha>0 and θ[0,1)\theta\in[0,1). It is easy to see that the assumption (5.1) in Theorem 5.1 holds with (μ1,μ2)=(αα+1,1α+1)(\mu_{1},\mu_{2})=(\frac{\alpha}{\alpha+1},\frac{1}{\alpha+1}) due to the AM–GM inequality. In this case, direct verifications show that

ρ(t):=((ϑ1))1(bβ(t))=(1θ)1θθc1θb1θθtα(1θ)θ=:κtα(1θ)θ,\rho(t):=\big{(}(\vartheta^{-1})^{\prime}\big{)}^{-1}(b\beta(t))=(1-\theta)^{\frac{1-\theta}{\theta}}c^{\frac{1}{\theta}}b^{\frac{1-\theta}{\theta}}t^{\frac{\alpha(1-\theta)}{\theta}}=:\kappa t^{\frac{\alpha(1-\theta)}{\theta}},
ξ(t):=φ1(ϑ1(ρ(t))a)=a11+α(κc)1(1+α)(1θ)tα(1+α)θ=:μtα(1+α)θ.\xi(t):=\varphi^{-1}\left(\frac{\vartheta^{-1}(\rho(t))}{a}\right)=a^{-\frac{1}{1+\alpha}}\left(\frac{\kappa}{c}\right)^{\frac{1}{(1+\alpha)(1-\theta)}}t^{\frac{\alpha}{(1+\alpha)\theta}}=:\mu t^{\frac{\alpha}{(1+\alpha)\theta}}.

Suppose that condition (5.9) holds, which is true for various common numerical methods such as proximal-type methods and high-order regularized methods. If θ<α1+α\theta<\frac{\alpha}{1+\alpha}, then Theorem 5.2 implies that the sequence {xk}\{x_{k}\} converges to x¯\overline{x} superlinearly with the convergence rate α(1+α)θ\frac{\alpha}{(1+\alpha)\theta}. It will be seen in Section 7 that if we further assume that ff is a 𝒞2{\cal C}^{2}-smooth function and satisfies the strict saddle point property, then Theorem 4.4 and Corollary 6.5 allow us to improve the superlinear convergence rate to (1θ)αθ\frac{(1-\theta)\alpha}{\theta}.

\bullet Following the arguments from [4, Theorem 2], we can also obtain convergence rates in the remaining cases for θ\theta. Indeed, when θ=α1+α\theta=\frac{\alpha}{1+\alpha} there exist ϱ(0,1)\varrho\in(0,1) and ζ(0,)\zeta\in(0,\infty) such that for large kk we get

xkx¯Δk:=i=kxi+1xiζϱk.\|x_{k}-\overline{x}\|\leq\Delta_{k}:=\sum_{i=k}^{\infty}\|x_{i+1}-x_{i}\|\leq\zeta\varrho^{k}.

If finally θ>α1+α\theta>\frac{\alpha}{1+\alpha}, then there exists μ>0\mu>0 ensuring the estimates

xkx¯Δkμkα(1α)(1+α)θα for all large k.\|x_{k}-\overline{x}\|\leq\Delta_{k}\leq\mu k^{-\frac{\alpha(1-\alpha)}{(1+\alpha)\theta-\alpha}}\;\mbox{ for all large }\;k\in\mathbb{N}.

The proof of this fact is rather standard, and so we omit details here.

\bullet Let us highlight that the superlinear convergence rate derived under the KL assumptions need not be tight in general, and it can be worse than the one derived under the generalized metric subregularity. For instance, consider the situation in Example 4.6, where the proximal point method is applied to f(t)=|t|32f(t)=|t|^{\frac{3}{2}}. As discussed in that example, the sequence {tk}\{t_{k}\} generated by the algorithm converges with a quadratic rate, and the convergence rate agrees with the rate derived under the generalized metric subregularity. Direct verification shows that the function f(t)=|t|32f(t)=|t|^{\frac{3}{2}} satisfies the KL property with exponent θ=13\theta=\frac{1}{3}, and that we have α=1\alpha=1 for the proximal point method. Thus Theorem 5.2 only implies that {tk}\{t_{k}\} converges superlinearly with rate 32\frac{3}{2} while not quadratically as under generalized metric subregularity.

6 Generalized Metric Subregularity for Second-Order Conditions

In this section, we continue our study of generalized metric subregularity while concentrating now on deriving verifiable conditions for the fulfillment of this notion with respect to the set Ξ=Θ\Xi=\Theta of second-order stationary points defined in (2.3), where the function ff in question is 𝒞2{\cal C}^{2}-smooth. Such a second-order generalized metric subregularity plays a crucial role in the justification of fast convergence of our new high-order regularized Newton method in Section 7. We’ll see below that this notion holds in broad settings of variational analysis and optimization and is satisfied in important practical models.

Note that if ff is convex, then there is no difference between Θ\Theta and the set Γ\Gamma of first-order stationary points (2.2). We focus in what follows on nonconvex settings for generalized metric subregularity, where ΘΓ\Theta\neq\Gamma as in the following simple while rather instructive example.

Example 6.1 (generalized metric subregularity for nonconvex functions).

Consider the function f:mf:\mathbb{R}^{m}\rightarrow\mathbb{R} given by f(x)=(x2r)2pf(x)=(\|x\|^{2}-r)^{2p} with r>0r>0 and p1p\geq 1. Direct computations show that f(x)=4p(x2r)2p1x\nabla f(x)={4}p(\|x\|^{2}-r)^{2p-1}x and thus

2f(x)=8p(2p1)(x2r)2p2xxT+4p(x2r)2p1Im,\nabla^{2}f(x)={8}p(2p-1)(\|x\|^{2}-r)^{2p-2}xx^{T}+{4}p(\|x\|^{2}-r)^{2p-1}I_{m},

where ImI_{m} stands for the m×mm\times m identity matrix. Since 2f(0)\nabla^{2}f(0) is negative-definite, ff is nonconvex. Moreover, Γ={x|x=r}{0}\Gamma=\{x\;|\;\|x\|=\sqrt{r}\}\cup\{0\} and Θ={x|x=r}\Theta=\{x\;|\;\|x\|=\sqrt{r}\}. Therefore, for any x¯Θ\overline{x}\in\Theta we have

d(x,Θ)=|xr|=|x2r||x+r|(32r)1|x2r|\displaystyle d(x,\Theta)=\big{|}\|x\|-\sqrt{r}\big{|}=\frac{\big{|}\|x\|^{2}-r\big{|}}{\big{|}\|x\|+\sqrt{r}\big{|}}\leq\Big{(}\frac{3}{2}\sqrt{r}\Big{)}^{-1}\big{|}\|x\|^{2}-r\big{|}
=(32r)1[4p|x2r|2p1x]12p1(4px)12p1(32r)1f(x)12p1(2pr)12p1 whenever xBm(x¯,r/2),\displaystyle=\Big{(}\frac{3}{2}\sqrt{r}\Big{)}^{-1}\frac{[{4}p\big{|}\|x\|^{2}-r\big{|}^{2p-1}\|x\|]^{\frac{1}{2p-1}}}{({4}p\|x\|)^{\frac{1}{2p-1}}}\leq(\frac{3}{2}\sqrt{r})^{-1}\frac{\|\nabla f(x)\|^{\frac{1}{2p-1}}}{({2}p\sqrt{r})^{\frac{1}{2p-1}}}\;\mbox{ whenever }\;x\in B_{\mathbb{R}^{m}}(\overline{x},\sqrt{r}/2),

and thus f\nabla f satisfies the generalized metric subregularity with respect to (ψ,Ξ)(\psi,\Xi) for ψ(t)=t2p1\psi(t)=t^{2p-1} and Ξ=Θ\Xi=\Theta.

Let us now introduce the new property that is playing an important role below. Given a proper l.s.c. function f:m¯f\colon\mathbb{R}^{m}\to\overline{\mathbb{R}}, we say that ff has the weak separation property (WSP) at x¯Γ\overline{x}\in\Gamma if there is ς>0\varsigma>0 with

f(y)f(x¯)for all yΓBm(x¯,ς),f(y)\leq f(\overline{x})\quad\mbox{for all }\;y\in\Gamma\cap B_{\mathbb{R}^{m}}(\overline{x},\varsigma), (6.1)

where Γ\Gamma is taken from (2.2). The constant ς\varsigma is called the modulus of weak separation. If the inequality in (6.1) is replaced by equality, such property is frequently used in the study of error bounds under the name of “separation property” (e.g., in [21, 25]), which inspires our terminology, While the separation property obviously yields WSP, the converse fails. For example, consider f(x):=e1x2[cos(1x)]2pf(x):=-e^{\frac{-1}{x^{2}}}[\cos(\frac{1}{x})]^{2p} if x0x\neq 0 and f(x):=0f(x):=0 if x=0x=0, where pp\in\mathbb{N}. Take x¯=0\overline{x}=0 and deduce from the definition that ff satisfies WSP at x¯\overline{x}. On the other hand, f(12kπ+π2)=f(12kπ+3π2)=0f(\frac{1}{2k\pi+\frac{\pi}{2}})=f(\frac{1}{2k\pi+\frac{3\pi}{2}})=0 for all kk\in\mathbb{N}, and hence there exists ak(12kπ+3π2,12kπ+π2)a_{k}\in(\frac{1}{2k\pi+\frac{3\pi}{2}},\frac{1}{2k\pi+\frac{\pi}{2}}) such that f(ak)=0f^{\prime}(a_{k})=0 and f(ak)<0=f(x¯)f(a_{k})<0=f(\overline{x}), which tells us that the separation property fails.

The next proposition shows that a large class of functions satisfies the weak separation property.

Proposition 6.2 (sufficient conditions for weak separation).

Let f:m¯f\colon\mathbb{R}^{m}\to\overline{\mathbb{R}} be a proper l.s.c. function, and let x¯Γ\overline{x}\in\Gamma. Suppose that either one of the following conditions holds:

  • (i)

    ff is continuous at x¯\overline{x} with respect to dom(f){\rm dom}(\partial f), i.e., limzdom(f)zx¯f(z)=f(x¯)\lim\limits_{{\scriptstyle{z\in{\rm dom}(\partial f)}\atop\scriptstyle{z\rightarrow\overline{x}}}}f(z)=f(\overline{x}), and it satisfies the KL property at x¯Γ\overline{x}\in\Gamma.

  • (ii)

    f=φgf=\varphi\circ g, where φ:n\varphi:\mathbb{R}^{n}\rightarrow\mathbb{R} is a convex function, and where g:mng:\mathbb{R}^{m}\rightarrow\mathbb{R}^{n} is a smooth mapping such that the Jacobian matrix g(x¯)\nabla g(\overline{x}) is surjective.

Then ff enjoys the weak separation property at x¯\overline{x}.

Proof.

To verify (i), suppose on the contrary that WSP fails at x¯\bar{x}. Then there exists a sequence ykx¯y_{k}\rightarrow\overline{x} with ykΓy_{k}\in\Gamma such that f(yk)>f(x¯)f(y_{k})>f(\overline{x}) for all kk\in\mathbb{N}. It follows from the continuity of ff at x¯\overline{x} with respect to dom(f){\rm dom}(\partial f) that yk{x|f(x¯)<f(x)<f(x¯)+η}y_{k}\in\{x\;|\;f(\overline{x})<f(x)<f(\overline{x})+\eta\} for all large kk, where η\eta is taken from the assumed KL property at x¯\overline{x}. The latter condition implies that

ϑ(f(yk)f(x¯))d(0,f(yk))1\vartheta^{\prime}(f(y_{k})-f(\overline{x}))d(0,\partial f(y_{k}))\geq 1

while contradicting ykΓy_{k}\in\Gamma and hence justifying the claimed WSP. Assertion (ii) follows directly from Lemma 3.6, which therefore completes the proof of the proposition. ∎

Another concept used below to study the second-order generalized metric subregularity applies to 𝒞2{\cal C}^{2}-smooth functions. Given such a function, recall that x¯\overline{x} is its strict saddle point if λmin(2f(x¯))<0\lambda_{\min}(\nabla^{2}f(\overline{x}))<0. We say that ff satisfies the strict saddle point property at x¯Γ\overline{x}\in\Gamma if x¯\overline{x} is either a local minimizer for ff, or a strict saddle point for ff. If ff satisfies the strict saddle point property at any first-order stationary point, then we simply say that ff satisfies the strict saddle point property. Furthermore, if for any first-order stationary point x¯\overline{x} of ff, it is either a global minimizer for ff, or a strict saddle point for ff, then we say that ff satisfies the strong strict saddle point property. It is known from Morse theory that strict saddle point property holds generically for 𝒞2{\cal C}^{2}-smooth functions. Moreover, it has been recognized in the literature that various classes of nonconvex functions arising in matrix completion, phase retrieval, neural networks, etc. enjoy the strict saddle point property; see some instructive examples below.

Before establishing the main result of this section, we present one more proposition of its own interest telling us that, around a local minimizer, the distance to Γ\Gamma, the set of first-order stationary points, agrees with the distance to Θ\Theta, the set of second-order stationary points, under WSP. Let us emphasize that in general Γ\Gamma and Θ\Theta are not the same while the distances to Γ\Gamma and Θ\Theta are under the imposed assumptions.

Proposition 6.3 (distance to stationary points).

Given a 𝒞2{\cal C}^{2}-smooth function f:mf\colon\mathbb{R}^{m}\to\mathbb{R} and x¯argminxBm[x¯,α]f(x)\overline{x}\in\mathop{\arg\min}_{x\in B_{\mathbb{R}^{m}}[\overline{x},\alpha]}f(x) for some α>0\alpha>0, suppose that WSP holds at x¯\overline{x} with modulus ς>0\varsigma>0. Denoting γ:=min{α,ς}\gamma:=\min\{\alpha,\varsigma\}, we have the equality

d(x,Θ)=d(x,Γ)=d(x,argminxBm[x¯,γ]f(x))=d(x,[ff(x¯)]) for all xBm(x¯,γ/4).d(x,\Theta)=d(x,\Gamma)=d\big{(}x,\mathop{\arg\min}_{x\in B_{\mathbb{R}^{m}}[\overline{x},\gamma]}f(x)\big{)}=d\big{(}x,[f\leq f(\overline{x})]\big{)}\;\mbox{ for all }\;x\in B_{\mathbb{R}^{m}}\big{(}\overline{x},\gamma/4\big{)}. (6.2)
Proof.

It follows from (6.1), the fact that x¯argminxBm[x¯,α]f(x)\overline{x}\in\mathop{\arg\min}_{x\in B_{\mathbb{R}^{m}}[\overline{x},\alpha]}f(x), and the choice of γ\gamma that

ΓBm(x¯,γ)argminxBm[x¯,γ]f(x)[ff(x¯)].\Gamma\cap B_{\mathbb{R}^{m}}(\overline{x},\gamma)\subset\mathop{\arg\min}_{x\in B_{\mathbb{R}^{m}}[\overline{x},\gamma]}f(x)\subset[f\leq f(\overline{x})].

This implies that for any xBm(x¯,γ/2)x\in B_{\mathbb{R}^{m}}(\overline{x},\gamma/2) we get

d(x,[ff(x¯)])d(x,argminxBm[x¯,γ]f(x))d(x,ΓBm(x¯,γ))\displaystyle d(x,[f\leq f(\overline{x})])\leq d\big{(}x,\mathop{\arg\min}_{x\in B_{\mathbb{R}^{m}}[\overline{x},\gamma]}f(x)\big{)}\leq\ d(x,\Gamma\cap B_{\mathbb{R}^{m}}(\overline{x},\gamma)) (6.3)
=min{d(x,ΓBm(x¯,γ)),d(x,ΓBm(x¯,γ))}=d(x,Γ)d(x,Θ),\displaystyle=\ \min\{d(x,\Gamma\cap B_{\mathbb{R}^{m}}(\overline{x},\gamma)),d(x,\Gamma\setminus B_{\mathbb{R}^{m}}(\overline{x},\gamma))\}=\ d(x,\Gamma)\leq d(x,\Theta),

where the first equality is due to d(x,ΓBm(x¯,γ))<γ/2d(x,\Gamma\cap B_{\mathbb{R}^{m}}(\overline{x},\gamma))<\gamma/2 (by x¯Γ\overline{x}\in\Gamma) and d(x,ΓBm(x¯,γ))γ/2d(x,\Gamma\setminus B_{\mathbb{R}^{m}}(\overline{x},\gamma))\geq\gamma/2. Recalling that x¯argminxBm[x¯,α]f(x)\overline{x}\in\mathop{\arg\min}_{x\in B_{\mathbb{R}^{m}}[\overline{x},\alpha]}f(x) leads us to

[ff(x¯)]Bm(x¯,γ/2)=argminxBm(x¯,γ/2)f(x)[f\leq f(\overline{x})]\cap B_{\mathbb{R}^{m}}\left(\overline{x},\gamma/2\right)=\mathop{\arg\min}_{x\in B_{\mathbb{R}^{m}}\left(\overline{x},\gamma/2\right)}f(x)

and to the distance estimate d(x,argminxBm(x¯,γ/2)f(x))xx¯<γ/4\displaystyle d\big{(}x,\mathop{\arg\min}_{x\in B_{\mathbb{R}^{m}}\left(\overline{x},\gamma/2\right)}f(x)\big{)}\leq\|x-\overline{x}\|<\gamma/4 for all xBm(x¯,γ/4)x\in B_{\mathbb{R}^{m}}\left(\overline{x},\gamma/4\right). Therefore, taking xx from the above neighborhood of x¯\bar{x} brings us to the equality

d(x,[ff(x¯)])=\displaystyle d\big{(}x,[f\leq f(\overline{x})]\big{)}= min{d(x,argminxBm(x¯,γ/2)f(x)),d(x,[ff(x¯)]Bm(x¯,γ/2))}\displaystyle\min\big{\{}d\big{(}x,\mathop{\arg\min}_{x\in B_{\mathbb{R}^{m}}\left(\overline{x},\gamma/2\right)}f(x)\big{)},d\big{(}x,[f\leq f(\overline{x})]\setminus B_{\mathbb{R}^{m}}\left(\overline{x},\gamma/2\right)\big{)}\big{\}} (6.4)
=\displaystyle= d(x,argminxBm(x¯,γ/2)f(x)).\displaystyle\ d\big{(}x,\mathop{\arg\min}_{x\in B_{\mathbb{R}^{m}}\left(\overline{x},\gamma/2\right)}f(x)\big{)}.

Observing finally that argminxBm(x¯,γ/2)f(x)Θ\mathop{\arg\min}_{x\in B_{\mathbb{R}^{m}}\left(\overline{x},\gamma/2\right)}f(x)\subset\Theta, we deduce from (6.3) and (6.4) that

d(x,[ff(x¯)])d(x,argminxBm[x¯,γ]f(x))d(x,Γ)d(x,Θ)d(x,argminxBm(x¯,γ/2)f(x))=d(x,[ff(x¯)])\displaystyle d(x,[f\leq f(\overline{x})])\leq d\big{(}x,\mathop{\arg\min}_{x\in B_{\mathbb{R}^{m}}[\overline{x},\gamma]}f(x)\big{)}\leq d(x,\Gamma)\leq d(x,\Theta)\leq d\big{(}x,\mathop{\arg\min}_{x\in B_{\mathbb{R}^{m}}\left(\overline{x},\gamma/2\right)}f(x)\big{)}=d(x,[f\leq f(\overline{x})])

for any xBm(x¯,γ/4)x\in B_{\mathbb{R}^{m}}\left(\overline{x},\gamma/4\right), which justifies (6.2) and thus completes the proof of the proposition. ∎

Now we are ready to derive sufficient conditions for generalized metric subregularity broadly used below.

Theorem 6.4 (sufficient conditions for generalized metric subregularity).

Let f:mf:\mathbb{R}^{m}\to\mathbb{R} be a 𝒞2{\cal C}^{2}-smooth function, let ψ:++\psi:\mathbb{R}_{+}\to\mathbb{R}_{+} be an admissible function, and let x¯Θ\overline{x}\in\Theta. Suppose that both WSP and strict saddle point property hold for ff at x¯\overline{x}. Then f\nabla f has the generalized metric subregularity property with respect to (ψ,Θ)(\psi,\Theta) at x¯\overline{x} if and only if f\nabla f has this property at x¯\overline{x} with respect to (ψ,Γ)(\psi,\Gamma). In particular, if ff is a subanalytic 𝒞2{\cal C}^{2}-smooth function satisfying the strict saddle point property, then f\nabla f enjoys the generalized metric subregularity property with respect to (ψ,Θ)(\psi,\Theta) at x¯Θ\overline{x}\in\Theta, where ψ(t)=tγ\psi(t)=t^{\gamma} for some γ>0\gamma>0.

Proof.

Note first that by x¯Θ\overline{x}\in\Theta we have f(x¯)=0\nabla f(\overline{x})=0 and 2f(x¯)0\nabla^{2}f(\overline{x})\succeq 0. Since ff satisfies the strict saddle point property, x¯\overline{x} must be a local minimizer for ff. Therefore, the first conclusion of the theorem follows immediately from Proposition 6.3.

Suppose now that ff is a subanalytic 𝒞2{\cal C}^{2}-smooth function satisfying the strict saddle point property. As mentioned earlier, subanalytic functions satisfy the KL property [6], and hence we deduce from Proposition 6.2 that the weak separation property holds at x¯\overline{x}. By [7, Proposition 2.13(ii)], the subanalytic property of ff ensures that the functions h1(x):=d(0,f(x))h_{1}(x):=d(0,\nabla f(x)) and h2(x):=d(x,Γ)h_{2}(x):=d(x,\Gamma) are subanalytic. Observing that h11(0)h21(0)h_{1}^{-1}(0)\subset h_{2}^{-1}(0), we deduce from the Lojasiewicz factorization lemma for subanalytic functions in [6, Theorem 6.4] that for each compact set KK there exist c>0c>0 and γ0(0,1]\gamma_{0}\in(0,1] such that |h2(x)|c|h1(x)|γ0|h_{2}(x)|\leq c|h_{1}(x)|^{\gamma_{0}} whenever xKx\in K. This verifies the second conclusion of the theorem and thus completes the proof. ∎

Next we provide some sufficient conditions for the exponent metric subregularity with explicit exponents for second-order stationary sets.

Corollary 6.5 (exponent metric subregularity for second-order stationary points).

Let f:mf\colon\mathbb{R}^{m}\to\mathbb{R} be a 𝒞2{\cal C}^{2}-smooth function, and let x¯Θ\overline{x}\in\Theta for the second-order stationary set (2.3). Suppose that either one of the following conditions holds:

  • (i)

    ff satisfies the strict saddle point property and the KL property at x¯\overline{x} with the KL exponent θ\theta.

  • (ii)

    f=φgf=\varphi\circ g, where φ:n\varphi:\mathbb{R}^{n}\rightarrow\mathbb{R} is a convex function satisfying the KL property at g(x¯)g(\overline{x}) with the KL exponent θ\theta, and where g:mng:\mathbb{R}^{m}\rightarrow\mathbb{R}^{n} is a smooth mapping with the surjective derivative g(x¯)\nabla g(\overline{x}).

Then f\nabla f enjoys the generalized metric subregularity property at x¯\overline{x} with respect to (ψ,Θ)(\psi,\Theta), where ψ(t)=tθ1θ\psi(t)=t^{\frac{\theta}{1-\theta}}.

Proof.

For (i), we get by the KL property of ff at x¯\overline{x} with the exponent θ\theta that there are c,η,ϵ>0c,\eta,\epsilon>0 such that

d(0,f(x))c1(f(x)f(x¯))θd(0,\nabla f(x))\geq c^{-1}\big{(}f(x)-f(\overline{x})\big{)}^{\theta} (6.5)

whenever xBm(x¯,ϵ)[f(x¯)<f<f(x¯)+η]x\in B_{\mathbb{R}^{m}}(\overline{x},\epsilon)\cap[f(\overline{x})<f<f(\overline{x})+\eta]. Using the continuity of ff and shrinking ϵ\epsilon if needed tell us that (6.5) holds for all xBm(x¯,ϵ)[f>f(x¯)]x\in B_{\mathbb{R}^{m}}(\overline{x},\epsilon)\cap[f>f(\overline{x})]. This together with [20, Lemma 6.1] implies that there exist α>0\alpha>0 and ϵ0(0,ϵ)\epsilon_{0}\in(0,\epsilon) such that

d(x,[ff(x¯)])α(f(x)f(x¯))1θ for all xBm(x¯,ϵ0).d(x,[f\leq f(\overline{x})])\leq\alpha\,\big{(}f(x)-f(\overline{x})\big{)}^{1-\theta}\;\mbox{ for all }\;x\in B_{\mathbb{R}^{m}}(\overline{x},\epsilon_{0}).

Shrinking ϵ0\epsilon_{0} again if necessary, we deduce from Propositions 6.2 and 6.3 that

d(x,Θ)α(f(x)f(x¯))1θ whenever xBm(x¯,ϵ0).d(x,\Theta)\leq\alpha\,\big{(}f(x)-f(\overline{x})\big{)}^{1-\theta}\;\mbox{ whenever }\;x\in B_{\mathbb{R}^{m}}(\overline{x},\epsilon_{0}).

Thus there exists M>0M>0 with d(x,Θ)Md(0,f(x))1θθd(x,\Theta)\leq M\,d\big{(}0,\nabla f(x)\big{)}^{\frac{1-\theta}{\theta}} for all xBm(x¯,ϵ0)x\in B_{\mathbb{R}^{m}}(\overline{x},\epsilon_{0}), and so (i) holds.

To justify (ii) under the imposed assumptions, it follows from [21, Theorem 3.2] that ff satisfies the KL property at x¯\overline{x} with the KL exponent θ\theta. Since x¯ΘΓ\overline{x}\in\Theta\subset\Gamma, we get from Lemma 3.6 that ff satisfies the strict saddle point property at x¯\overline{x}, and therefore the claimed conclusion follows from part (i). ∎

In the rest of this section, we discuss three practical models that often appear in, e.g., machine learning, image processing, and statistics for which the obtained sufficient conditions allow us to verify the second-order generalized metric subregularity.

Example 6.6 (over-parameterization of compressed sensing models).

Consider the 1\ell_{1}-regularization problem, known also as the Lasso problem:

minxmAxb2+νx1,\min_{x\in\mathbb{R}^{m}}\|Ax-b\|^{2}+\nu\|x\|_{1}, (6.6)

where An×mA\in\mathbb{R}^{n\times m}, bnb\in\mathbb{R}^{n}, ν>0\nu>0, and 1\|\cdot\|_{1} is the usual 1\ell_{1} norm. A recent interesting way to solve this problem [34, 35, 36] is to transform it into an equivalent smooth problem

min(u,v)m×mfOP(u,v):=A(uv)b2+ν(u2+v2),\min_{(u,v)\in\mathbb{R}^{m}\times\mathbb{R}^{m}}f_{OP}(u,v):=\|A(u\circ v)-b\|^{2}+\nu(\|u\|^{2}+\|v\|^{2}), (6.7)

where uvu\circ v is the Hadamard (entrywise) product between the vector uu and vv in the sense that (uv)i:=uivi(u\circ v)_{i}:=u_{i}v_{i}, i=1,,mi=1,\ldots,m. A nice feature of this over-parameterized model is that its objective function is a polynomial (and so 𝒞2{\cal C}^{2}-smooth and subanalytic), and it satisfies the strong strict saddle point property; see [35, Appendix C] for the detailed derivation, and [34, 36] for more information.

To proceed, we first use Theorem 6.4 to see that the generalized metric subregularity holds for fOP\nabla f_{OP} with respect to (ψ,Θ)(\psi,\Theta), where ψ(t)=tγ\psi(t)=t^{\gamma} for some γ>0\gamma>0. In fact, it is possible to identify the exponent γ\gamma explicitly. As shown in [34], the fulfillment of the strict complementarity condition

02AT(Ax¯b)+ri(ν1(x¯)),0\in 2A^{T}(A\overline{x}-b)+{\rm ri}\,\big{(}\nu\,\partial\|\cdot\|_{1}(\overline{x})\big{)}, (6.8)

where ‘ri{\rm ri}’ denotes the relative interior, ensures that fOPf_{OP} satisfies the KL property at any x¯Γ\overline{x}\in\Gamma with the KL exponent 1/21/2. In the absence of the strict complementarity condition, it is shown in [34] that fOPf_{OP} satisfies the KL property at any x¯Γ\overline{x}\in\Gamma with the KL exponent 3/43/4.

Therefore, Corollary 6.5, tells us that for any x¯Θ\overline{x}\in\Theta satisfying the strict complementarity condition (6.8), fOP\nabla f_{OP} enjoys the standard metric subregularity property (as ψ(t)=t\psi(t)=t) at x¯\overline{x} with respect to Θ\Theta. In the absence of strict complementarity condition, fOP\nabla f_{OP} satisfies the generalized metric subregularity property at x¯\overline{x} with respect to (ψ,Θ)(\psi,\Theta), where ψ(t)=t3\psi(t)=t^{3}.

Example 6.7 (best rank-one matrix approximation).

Consider the following function f:mf:\mathbb{R}^{m}\rightarrow\mathbb{R} that appears in the rank-one matrix approximation

f(x):=12xxTMF2=12i,j=1m(xixjMij)2,f(x):=\frac{1}{2}\|xx^{T}-M\|^{2}_{F}=\frac{1}{2}\sum_{i,j=1}^{m}(x_{i}x_{j}-M_{ij})^{2},

where we suppose for simplicity that MM is a symmetric (m×m)(m\times m) matrix with its maximum eigenvalue λmax(M)>0\lambda_{\max}(M)>0. Direct verification shows that

f(x)=x2xMx and 2f(x)=2xxTM+x2Im,\nabla f(x)=\|x\|^{2}x-Mx\mbox{ and }\nabla^{2}f(x)=2xx^{T}-M+\|x\|^{2}I_{m},

and hence we have Γ={xm|Mx=x2x}\Gamma=\{x\in\mathbb{R}^{m}\;|\;Mx=\|x\|^{2}x\}. Let us check next that Θ=Ω\Theta=\Omega, where

Ω:={xm|x is an eigenvector of M associated with the largest eigenvalue and x=λmax(M)}.\displaystyle\Omega:=\big{\{}x\in\mathbb{R}^{m}\;\big{|}\;x\mbox{ is an eigenvector of }M\mbox{ associated with the largest eigenvalue}\mbox{ and }\|x\|=\sqrt{\lambda_{\max}(M)}\big{\}}.

Indeed, take xΘx\in\Theta and observe that if x=0x=0, then 2f(x)=M\nabla^{2}f(x)=-M having a negative eigenvalue. This is impossible by xΘx\in\Theta. For x0x\neq 0, xx is an eigenvector of MM with the associated eigenvalue λ=x2\lambda=\|x\|^{2}. Suppose that λ<λmax(M)\lambda<\lambda_{\max}(M) and get, for any unit eigenvector dd associated with λmax(M)\lambda_{\max}(M), that

0dT2f(x)d=2(xTd)2dTMd+x2d2=λmax(M)+λ.0\leq d^{T}\nabla^{2}f(x)d=2(x^{T}d)^{2}-d^{T}Md+\|x\|^{2}\|d\|^{2}=-\lambda_{\max}(M)+\lambda.

This is also impossible, and so λ=λmax(M)\lambda=\lambda_{\max}(M) implying that x=λmax(M).\|x\|=\sqrt{\lambda_{\max}(M)}. Thus ΘΩ\Theta\subset\Omega.

Now take xΩx\in\Omega and see that for any dmd\in\mathbb{R}^{m} we have

dT(2f(x))d\displaystyle d^{T}(\nabla^{2}f(x))d =\displaystyle= 2(xTd)2dTMd+x2d2\displaystyle 2(x^{T}d)^{2}-d^{T}Md+\|x\|^{2}\|d\|^{2}
\displaystyle\geq 2(xTd)2λmax(M)d2+λmax(M)d20,\displaystyle 2(x^{T}d)^{2}-\lambda_{\max}(M)\|d\|^{2}+\lambda_{\max}(M)\|d\|^{2}\geq 0,

which shows that the reverse inclusion holds, and the claim follows.

Suppose next that the largest eigenvalue λmax(M)\lambda_{\max}(M) is simple, i.e., its geometric multiplicity is one; note that this condition holds generically. Then the eigenspace of MM associated with λmax(M)\lambda_{\max}(M) is one-dimensional, and so Θ\Theta consists only of two elements {±v}\{\pm v\}, where vv is the eigenvector associated with the largest eigenvalue, and of length λmax(M)\sqrt{\lambda_{\max}(M)}. Both of these two elements are global minimizers of ff. Hence in this case, ff satisfies the strong strict saddle point property. Furthermore, it follows from Theorem 6.4 that the generalized metric subregularity holds for f\nabla f with respect to (ψ,Θ)(\psi,\Theta), where ψ(t)=tγ\psi(t)=t^{\gamma} for some γ>0\gamma>0. Indeed, direct computations show that at x¯=±v\bar{x}=\pm v, where vv is the eigenvector associated with the largest eigenvalue, we have 2f(x¯)=2vvTM+v2Im=2vvTM+λmax(M)Im\nabla^{2}f(\bar{x})=2vv^{T}-M+\|v\|^{2}I_{m}=2vv^{T}-M+\lambda_{\max}(M)I_{m}. This tells us that dT2f(x¯)d0d^{T}\nabla^{2}f(\bar{x})d\geq 0 for all dmd\in\mathbb{R}^{m}, and that dT2f(x¯)d=0d^{T}\nabla^{2}f(\bar{x})d=0 if and only if vTd=0v^{T}d=0 and Md=λmax(M)dMd=\lambda_{\max}(M)d, which happens only when d=0d=0. Therefore, 2f(x¯)0\nabla^{2}f(\bar{x})\succ 0, and so the metric subregularity holds for f\nabla f with respect to Θ\Theta at any x¯Θ\overline{x}\in\Theta.

Example 6.8 (generalized phase retrieval problem).

Consider the optimization problem

minxmf(x)=i=1nφ(gi(x)),\min_{x\in\mathbb{R}^{m}}f(x)=\sum_{i=1}^{n}\varphi\big{(}g_{i}(x)\big{)},

where φ:\varphi:\mathbb{R}\rightarrow\mathbb{R} is a 𝒞2{\cal C}^{2} strictly convex subanalytic function such that φ(0)=0\varphi^{\prime}(0)=0, and where gi:mg_{i}:\mathbb{R}^{m}\rightarrow\mathbb{R} are defined by gi(x):=1γi[(aiTx)2bi2]g_{i}(x):=\frac{1}{\gamma_{i}}[(a_{i}^{T}x)^{2}-b_{i}^{2}] with γi>0\gamma_{i}>0, aim\{0}a_{i}\in\mathbb{R}^{m}\backslash\{0\}, and bi\{0}b_{i}\in\mathbb{R}\backslash\{0\}, i=1,,ni=1,\ldots,n. We consider the two typical choices for γi\gamma_{i} as γi=1\gamma_{i}=1 and γi=|bi|\gamma_{i}=|b_{i}|, i=1,,ni=1,\ldots,n, and the two typical choices for φ\varphi given by:

  • the polynomial loss function φPL(t)=t2p\varphi_{PL}(t)=t^{2p} with p1p\geq 1;

  • the pseudo-Huber loss function in the form φPHL(t)=δ2(1+t2δ21)\varphi_{PHL}(t)=\delta^{2}\bigg{(}\sqrt{1+\frac{t^{2}}{\delta^{2}}}-1\bigg{)} with δ>0\delta>0.

This class of problems arises in phase retrieval models (see [8, 9, 40]) when the goal is to recover an unknown signal from the observed magnitude of some linear transformation such as, e.g., the Fourier transform.

Below we consider the two settings: (I) the vectors {a1,,an}\{a_{1},\ldots,a_{n}\} are linearly independent; (II) the vectors {a1,,an}\{a_{1},\ldots,a_{n}\} are independent and identically distributed (i.i.d.) standard Gaussian random vectors, bi=aiTub_{i}=a_{i}^{T}u for some um{0}u\in\mathbb{R}^{m}\setminus\{0\} and φ(t)=t2\varphi(t)=t^{2}. Note that in the first setting, the linear independence condition can be easily verified, and it is usually satisfied when mm is much larger than nn. The second setting is also frequently considered in the literature; see, e.g., [8, 9, 40].

First we consider setting (I), where {a1,,an}\{a_{1},\ldots,a_{n}\} is linearly independent, to verify that f\nabla f satisfies the generalized metric subregularity property with respect to the second-order stationary set Θ\Theta. Observe that

f(x)\displaystyle\nabla f(x) =\displaystyle= i=1nφ(gi(x))gi(x)=i=1n2γi[(aiTx)φ(gi(x))]ai,\displaystyle\sum_{i=1}^{n}\varphi^{\prime}(g_{i}(x))\nabla g_{i}(x)=\sum_{i=1}^{n}\frac{2}{\gamma_{i}}\,\big{[}(a_{i}^{T}x)\,\varphi^{\prime}\big{(}g_{i}(x)\big{)}\big{]}\,a_{i},

and thus it follows from the linear independence condition that

Γ={x|f(x)=0}=i=1n{x|aiTx=0orφ(gi(x))=0},\Gamma=\big{\{}x\;\big{|}\;\nabla f(x)=0\big{\}}=\bigcap_{i=1}^{n}\Big{\{}x\;\Big{|}\;a_{i}^{T}x=0\;\mbox{or}\;\varphi^{\prime}\big{(}g_{i}(x)\big{)}=0\Big{\}},

and for all xmx\in\mathbb{R}^{m} we have 2f(x)=i=1n2γi{φ(gi(x))+2γi(aiTx)2φ′′(gi(x))}aiaiT.\nabla^{2}f(x)=\sum_{i=1}^{n}\frac{2}{\gamma_{i}}\,\big{\{}\varphi^{\prime}\big{(}g_{i}(x)\big{)}+{\frac{2}{\gamma_{i}}}(a_{i}^{T}x)^{2}\,\varphi^{\prime\prime}(g_{i}(x))\big{\}}\,a_{i}a_{i}^{T}. Let us now show that

Θ={x|φ(gi(x))=0,i=1,,n}.\Theta=\big{\{}x\;\big{|}\;\varphi^{\prime}\big{(}g_{i}(x)\big{)}=0,\;i=1,\ldots,n\big{\}}. (6.9)

The inclusion “\supset” in (6.9) is obvious. To check the reverse inclusion, pick xΘx\in\Theta and suppose that there exists i0i_{0} such that φ(gi0(x))0\varphi^{\prime}\big{(}g_{i_{0}}(x)\big{)}\neq 0. By ΘΓ\Theta\subset\Gamma, we have ai0Tx=0a_{i_{0}}^{T}x=0, which tells us that gi0(x)=(ai0Tx)2bi2=bi2<0g_{i_{0}}(x)=(a_{i_{0}}^{T}x)^{2}-b_{i}^{2}=-b_{i}^{2}<0. Since φ(0)=0\varphi^{\prime}(0)=0 and φ\varphi is strictly convex (and hence φ\varphi^{\prime} is strictly increasing), it follows that φ(gi0(x))<φ(0)=0\varphi^{\prime}(g_{i_{0}}(x))<\varphi^{\prime}(0)=0. For d=ai0d=a_{i_{0}}, we then have dT2f(x)d=2γiφ(gi0(x))ai04<0d^{T}\nabla^{2}f(x)d={\frac{2}{\gamma_{i}}}\varphi^{\prime}(g_{i_{0}}(x))\|a_{i_{0}}\|^{4}<0, which contradicts the assumption that xΘx\in\Theta and thus justifies (6.9). Observe further that

Θ={x|φ(gi(x))=0,i=1,,n}={x|gi(x)=0,i=1,,n}={x||aiTx|=|bi|,i=1,,n}\displaystyle\Theta=\big{\{}x\;\big{|}\;\varphi^{\prime}\big{(}g_{i}(x)\big{)}=0,\;i=1,\ldots,n\big{\}}=\big{\{}x\;\big{|}\;g_{i}(x)=0,\;i=1,\ldots,n\big{\}}=\big{\{}x\;\big{|}\;|a_{i}^{T}x|=|b_{i}|,\;i=1,\ldots,n\big{\}}

telling us that Θ\Theta reduces to the set of global minimizers. Therefore, the strong strict saddle point property holds for ff if the linear independence condition is satisfied. Then it follows from Theorem 6.4 that generalized metric subregularity is fulfilled for f\nabla f with respect to (ψ,Θ)(\psi,\Theta), where ψ(t)=tγ\psi(t)=t^{\gamma} for some γ>0\gamma>0.

Let us further specify the exponent γ\gamma in both bullet cases above by using Corollary 6.5. In the polynomial case φ(t)=t2p\varphi(t)=t^{2p} with p1p\geq 1, the direct verification shows that |φ(t)|cφ(t)2p12p|\varphi^{\prime}(t)|\geq c\,\varphi(t)^{\frac{2p-1}{2p}} for some c>0c>0. Thus φ\varphi satisfies the KL property with the KL exponent 112p1-\frac{1}{2p}. Let g(x)=(g1(x),,gn(x))g(x)=(g_{1}(x),\ldots,g_{n}(x)) for all xmx\in\mathbb{R}^{m}, where the components gig_{i} are defined above. We deduce from the previous considerations that the inclusion xΘx\in\Theta implies that aiTx0a_{i}^{T}x\neq 0 whenever i=1,,ni=1,\ldots,n. Hence for any x¯Θ\overline{x}\in\Theta, the derivative operator g(x¯):mn\nabla g(\overline{x}):\mathbb{R}^{m}\rightarrow\mathbb{R}^{n} is surjective by the linear independence assumption. Corollary 6.5 tells us in this case that the generalized metric subregularity holds for f\nabla f with respect to (ψ,Θ)(\psi,\Theta), where ψ(t)=t2p1\psi(t)=t^{2p-1}.

In the case where φ\varphi is the pseudo-Huber loss, we have φ′′(t)>0\varphi^{\prime\prime}(t)>0 for all tt\in\mathbb{R}, and it is not hard to verify that the function φ¯(y):=i=1nφ(yi)\overline{\varphi}(y):=\sum_{i=1}^{n}\varphi(y_{i}) satisfies the KL property at y¯=(g1(x¯),,gn(x¯))\overline{y}=(g_{1}(\overline{x}),\ldots,g_{n}(\overline{x})) with the KL exponent 1/21/2. Note also that f(x):=(φ¯g)(x)f(x):=(\overline{\varphi}\circ g)(x) with g(x)=(g1(x),,gn(x))g(x)=(g_{1}(x),\ldots,g_{n}(x)) for all xmx\in\mathbb{R}^{m}, and so Corollary 6.5 implies that the standard metric subregularity holds for f\nabla f at any x¯Θ\overline{x}\in\Theta with respect to Θ\Theta.

Next we consider the second setting (II). As shown in [8] in the typical case of γi=|bi|=|aiTu|\gamma_{i}=|b_{i}|=|a_{i}^{T}u|, there exist positive constants c,Cc,C such that if nCmn\geq Cm, where nn is the number of samples and mm is the dimension of the underlying space, then with probability at least 1ecn1-e^{-c\,n} the only local minimizers of ff are ±u\pm u (which are also global minimizers), ff is strongly convex in a neighborhood of ±u\pm u, and all the other critical points are either strict saddle points or local maximizers with strictly negative definite Hessians. Thus in this setting, when the number of samples nn is sufficiently large, we get with high probability that Θ={±u}\Theta=\{\pm u\}, and that for any x¯Θ\overline{x}\in\Theta the strong saddle point property holds at x¯\overline{x}. It follows therefore from Corollary 6.5 that the standard metric subregularity is satisfied for f\nabla f at x¯\overline{x} with respect to Θ\Theta. Note finally that in the case where γi=1\gamma_{i}=1, a similar result on the saddle point property is obtained in [9] under the stronger assumption that nCmlog(m)n\geq Cm\log(m) for some positive number CC.

7 High-order Regularized Newton Method with Momentum

In this section, we design a new high-order Newtonian method with momentum and justify its fast convergence by using the above results on generalized metric regularity and related investigations.

For the optimization problem minxmf(x)\min_{x\in\mathbb{R}^{m}}f(x), the following assumptions are imposed:

Assumption 7.1 (major assumptions).

The objective function ff satisfies the conditions:

(1) ff is 𝒞2{\cal C}^{2}-smooth and bounded below, i.e., f=infxmf(x)>f^{*}=\inf_{x\in\mathbb{R}^{m}}f(x)>-\infty.

(2) There exists a closed convex set m\mathcal{F}\subset\mathbb{R}^{m} with nonempty interior such that (f(x0))int()\mathcal{L}(f(x_{0}))\subset{\rm int}(\mathcal{F}), where x0int()x_{0}\in{\rm int}(\mathcal{F}) is a starting point of the iterative process.

(3) The gradient f\nabla f is Lipschitz continuous with modulus L1>0L_{1}>0 on \mathcal{F}, and the Hessian of ff is Hölder-continuous on \mathcal{F} with exponent qq, i.e., there exist numbers L2>0L_{2}>0 and q(0,1]q\in(0,1] such that 2f(x)2f(y)L2xyq\|\nabla^{2}f(x)-\nabla^{2}f(y)\|\leq L_{2}\|x-y\|^{q} for all x,yx,y\in\mathcal{F}.

Given xmx\in\mathbb{R}^{m} and q(0,1]q\in(0,1], define the (q+2)(q+2)th-order regularized quadratic approximation for ff at xx by

fσ(y;x):=f(x)+f(x),yx+122f(x)(yx),yx+σ(q+1)(q+2)yxq+2,f_{\sigma}(y;x):=f(x)+\left\langle\nabla f(x),y-x\right\rangle+\frac{1}{2}\langle\nabla^{2}f(x)(y-x),y-x\rangle+\frac{\sigma}{(q+1)(q+2)}\|y-x\|^{q+2}, (7.1)

where σ>0\sigma>0 serves as a regularization parameter. Consider also

f¯σ(x):=minymfσ(y;x)andyσ(x)argminymfσ(y;x).\overline{f}_{\sigma}(x):=\min_{y\in\mathbb{R}^{m}}f_{\sigma}(y;x)\quad\text{and}\quad y_{\sigma}(x)\in\mathop{\arg\min}_{y\in\mathbb{R}^{m}}f_{\sigma}(y;x). (7.2)

Here is our high-order Newtonian method with momentum steps.

Algorithm 1 High-order regularization method with momentum
 1: Input: initial point  x0=x^0mx_{0}=\widehat{x}_{0}\in\mathbb{R}^{m}, scalars  σ¯(2L2q+2,L2]\overline{\sigma}\in\left(\frac{2L_{2}}{q+2},L_{2}\right]  and  ζ[0,1)\zeta\in[0,1).
 2: for k=0,1,k=0,1,\dotsdo
 3: Regularization step: Choose σk[σ¯,2L2]\sigma_{k}\in[\overline{\sigma},2L_{2}] and find
x^k+1:=yσk(xk)argminymfσk(y;xk).\widehat{x}_{k+1}:=y_{\sigma_{k}}(x_{k})\in\mathop{\arg\min}_{y\in\mathbb{R}^{m}}f_{\sigma_{k}}(y;x_{k}). (7.3)
 4: Momentum step:
βk+1=min{ζ,f(x^k+1),x^k+1xk},\beta_{k+1}=\min\big{\{}\zeta,\|\nabla f(\widehat{x}_{k+1})\|,\|\widehat{x}_{k+1}-x_{k}\|\big{\}}, (7.4)
x~k+1=x^k+1+βk+1(x^k+1x^k).\widetilde{x}_{k+1}=\widehat{x}_{k+1}+\beta_{k+1}(\widehat{x}_{k+1}-\widehat{x}_{k}). (7.5)
 5: Monotone step:
xk+1=argminx{x^k+1,x~k+1}f(x).x_{k+1}=\mathop{\arg\min}_{x\in\{\widehat{x}_{k+1},\widetilde{x}_{k+1}\}}f(x). (7.6)
 6: end for

Note that when q=1q=1 and ζ=0\zeta=0 (i.e., the Hessian is locally Lipschitzian and there is no momentum step), this algorithm reduces to the classical cubic regularized Newton method proposed in [32]. In the case where ζ=0\zeta=0 (i.e., no momentum steps are considered), this method has also been considered in [15] with a comprehensive analysis of its complexity. The superlinear/quadratic convergence for the cubic Newton method was first established in [32] under the nondegeneracy condition, and then was relaxed to the uniform metric subregularity/error bound condition on f\nabla f in [40]. We also note that, if L2L_{2} can be obtained easily, then the parameter σk\sigma_{k} in Algorithm 1 can be chosen as L2L_{2}. Otherwise, the parameter σk\sigma_{k} therein can be obtained by using line search strategies outlined in [32]. Incorporating momentum steps is a well-adopted technique in first-order optimization to accelerate numerical performances. Therefore, it is conceivable to also consider incorporating the momentum technique into high-order regularization methods.

To illustrate this, we see from the following plots (Figures 1 and 2) that incorporating the momentum steps into the cubic Newton methods in solving phase retrieval and matrix completion problems can lead to faster convergence with comparable quality of numerical solutions. Here we follow the setup as in [40] for both problems with dimension 128128, and use their code (available at https://github.com/ZiruiZhou/cubicreg app.git) for implementing the cubic regularized Newton method. For the cubic regularized Newton method with momentum steps, we choose ξ=0.1\xi=0.1 and set q=1q=1 in Algorithm 1 and follow the same approach as in [40] to solve the subproblems (7.3). Then we use the same notion of relative error as in [40] and plot the relative error (in the log scale) in terms of the number of iterations. 444While it is clear from the presented numerical illustrations that incorporating momentum steps can have numerical advantages, further extensive numerical experiments are definitely needed. This goes beyond the scope of this paper and will be carried out in a separate study.

Refer to caption
Figure 1: Phase retrieval problem
Refer to caption
Figure 2: Matrix completion problem

Let us emphasize that, to the best of our knowledge, superlinear/quadratic convergence for high-order regularized methods with momentum steps has not been considered in the literature. We also mention that, even when it is specialized to the celebrated cubic regularized Newton method without momentum steps, our assumptions for quadratic convergence are strictly weaker than the ones imposed in [40].

Note that in the regularization step, Algorithm 1 requires computing yσk(xk)y_{\sigma_{k}}(x_{k}), which is a global minimizer of fσk(;xk)f_{\sigma_{k}}(\cdot;x_{k}). This can be done by solving either an equivalent convex optimization problem, or an eigenvalue problem; see, e.g, [15, 32, 40]. The necessary and sufficient optimality conditions for global minimizers of the regularized quadratic approximation (7.1) presented in the next lemma are taken from [16].

Lemma 7.2 (optimality conditions for regularized quadratic approximations).

Given xx\in\mathcal{F}, we have that yσ(x)y_{\sigma}(x) is a global minimizer of (7.1) if and only if the following conditions hold:

f(x)+2f(x)(yσ(x)x)+σq+1yσ(x)xq(yσ(x)x)=0,\nabla f(x)+\nabla^{2}f(x)(y_{\sigma}(x)-x)+\frac{\sigma}{q+1}\|y_{\sigma}(x)-x\|^{q}(y_{\sigma}(x)-x)=0, (7.7)
2f(x)+σq+1yσ(x)xqIm0.\nabla^{2}f(x)+\frac{\sigma}{q+1}\|y_{\sigma}(x)-x\|^{q}I_{m}\succeq 0. (7.8)

For the reader’s convenience, we also formulate straightforward consequences of our major requirements in Assumption 7.1 that can be found in [15, (2.7) and (2.8)].

Lemma 7.3 (consequences of major assumptions).

Suppose that all the requirements in Assumption 7.1 are satisfied. Then for any x,yx,y\in\mathcal{F} we get the estimates

f(y)f(x)2f(x)(yx)L2q+1yxq+1,\|\nabla f(y)-\nabla f(x)-\nabla^{2}f(x)(y-x)\|\leq\frac{L_{2}}{q+1}\|y-x\|^{q+1}, (7.9)
|f(y)f(x)f(x),yx122f(x)(yx),yx|L2(q+1)(q+2)yxq+2.\left|f(y)-f(x)-\langle\nabla f(x),y-x\rangle-\frac{1}{2}\langle\nabla^{2}f(x)(y-x),y-x\rangle\right|\leq\frac{L_{2}}{(q+1)(q+2)}\|y-x\|^{q+2}. (7.10)

The next technical lemma is needed for the convergence justification below.

Lemma 7.4 (relationships for Hölder exponents).

Given b(0,)b\in(0,\infty) and q(0,1]q\in(0,1], the following hold.

(i) For each α>0\alpha>0, there is a unique solution t(α)>2t(\alpha)>2 to the equation

αtq+12αtqb(q+1)tb=0.\alpha t^{q+1}-2\alpha t^{q}-b(q+1)t-b=0.

Moreover, the function αt(α)\alpha\mapsto t(\alpha) is decreasing on (0,)(0,\infty).

(ii) Letting a(0,)a\in(0,\infty) and :=t(a)>0\ell:=t(a)>0, suppose that

axq+1byq+1+2axqy+b(q+1)xyq for x,y+.ax^{q+1}\leq by^{q+1}+2ax^{q}y+b(q+1)xy^{q}\;\mbox{ for }\;x,y\in\mathbb{R}_{+}. (7.11)

Then for such numbers xx and yy, we have the inequality xyx\leq\ell y.

Proof.

To verify (i), we define the function f(α,t):=αtq+12αtqb(q+1)tbf(\alpha,t):=\alpha t^{q+1}-2\alpha t^{q}-b(q+1)t-b for all (α,t)(0,)×+(\alpha,t)\in(0,\infty)\times\mathbb{R}_{+} and denote t(α)={t|f(α,t)=0}t(\alpha)=\{t\;|\;f(\alpha,t)=0\} with α(0,)\alpha\in(0,\infty), which occurs to be single-valued as shown below. Direct calculations give us the expressions

ft(α,t)=α(q+1)tq2αqtq1b(q+1),\frac{\partial f}{\partial t}(\alpha,t)=\alpha(q+1)t^{q}-2\alpha qt^{q-1}-b(q+1),
2ft2(α,t)=αq(q+1)tq12αq(q1)tq2=αqtq2((q+1)t2(q1))>0,t(0,),\frac{\partial^{2}f}{\partial t^{2}}(\alpha,t)=\alpha q(q+1)t^{q-1}-2\alpha q(q-1)t^{q-2}=\alpha qt^{q-2}\big{(}(q+1)t-2(q-1)\big{)}>0,\quad t\in(0,\infty),

and hence ft(α,)\frac{\partial f}{\partial t}(\alpha,\cdot) is increasing on +\mathbb{R}_{+}. Combining this with the facts that limt0+ft(α,t)<0\lim_{t\to 0^{+}}\frac{\partial f}{\partial t}(\alpha,t)<0 and limtft(α,t)=\lim_{t\to\infty}\frac{\partial f}{\partial t}(\alpha,t)=\infty for any α(0,)\alpha\in(0,\infty) guarantees the existence of t~(α)>0\widetilde{t}(\alpha)>0 such that f(α,)f(\alpha,\cdot) is decreasing on (0,t~(α)](0,\widetilde{t}(\alpha)] and increasing on (t~(α),)(\widetilde{t}(\alpha),\infty). Observing that f(α,0)=b<0f(\alpha,0)=-b<0 and limtf(α,t)=\lim_{t\to\infty}f(\alpha,t)=\infty for any α(0,)\alpha\in(0,\infty), we see that the mapping αt(α)\alpha\mapsto t(\alpha) with α(0,)\alpha\in(0,\infty) is single-valued and satisfies the implication

f(α,t)0tt(α).f(\alpha,t)\leq 0\Longrightarrow t\leq t(\alpha). (7.12)

Defining further the univariate function

α(t):=b+b(q+1)ttq+12tqfor all t(2,),\alpha(t):=\frac{b+b(q+1)t}{t^{q+1}-2t^{q}}\quad\mbox{for all }\;t\in(2,\infty),

we claim that tα(t)t\mapsto\alpha(t) is decreasing on (2,)(2,\infty) with the range (0,)(0,\infty). This clearly ensures that its inverse function αt(α)\alpha\mapsto t(\alpha) is decreasing on (0,)(0,\infty) with a range of (2,)(2,\infty), and so (i) follows.

Thus to prove (i), it remains to verify the formulated claim. Direct calculations show that

α(t)=\displaystyle\alpha^{\prime}(t)= b(q+1)(tq+12tq)(b+b(q+1)t)((q+1)tq2qtq1)(tq+12tq)2\displaystyle\frac{b(q+1)(t^{q+1}-2t^{q})-(b+b(q+1)t)\Big{(}(q+1)t^{q}-2qt^{q-1}\Big{)}}{(t^{q+1}-2t^{q})^{2}} (7.13)
=\displaystyle= btq1(tq+12tq)2[(q+1)(t22t)(1+(q+1)t)((q+1)t2q)]\displaystyle\frac{bt^{q-1}}{(t^{q+1}-2t^{q})^{2}}\Big{[}(q+1)(t^{2}-2t)-(1+(q+1)t)\Big{(}(q+1)t-2q\Big{)}\Big{]}
=\displaystyle= btq1(tq+12tq)2[q(q+1)t2+(2q2q3)t+2q].\displaystyle\frac{bt^{q-1}}{(t^{q+1}-2t^{q})^{2}}\Big{[}-q(q+1)t^{2}+(2q^{2}-q-3)t+2q\Big{]}.

For all tt\in\mathbb{R}, we consider the function h(t):=q(q+1)t2+(2q2q3)t+2qh(t):=-q(q+1)t^{2}+(2q^{2}-q-3)t+2q and observe that limth(t)=\lim_{t\to-\infty}h(t)=-\infty and h(0)=2q>0h(0)=2q>0 and h(2)=4q6<0h(2)=-4q-6<0, which allows us to deduce from (7.13) that α(t)\alpha(t) is decreasing on (2,)(2,\infty). Combining this with the facts that limt2α(t)=\lim_{t\to 2}\alpha(t)=\infty and limtα(t)=0\lim_{t\to\infty}\alpha(t)=0 justifies the above claim and hence the entire assertion (i).

To verify now assertion (ii), we deduce from (7.11) that axq+1byq+12axqyb(q+1)xyq0ax^{q+1}-by^{q+1}-2ax^{q}y-b(q+1)xy^{q}\leq 0. Letting x:=tyx:=ty for some t+t\in\mathbb{R}_{+} and substituting it into the above inequality tells us that

(atq+12atqb(q+1)tb)yq+10.\left(at^{q+1}-2at^{q}-b(q+1)t-b\right)y^{q+1}\leq 0. (7.14)

If y=0y=0, then (7.11) implies that x=0x=0, and so (ii) trivially holds. To justify the inequality xyx\leq\ell y, it suffices to consider the case where y>0y>0. Dividing both sides of (7.14) by yq+1y^{q+1}, we get

f(a,t)=atq+12atqb(q+1)tb0,f(a,t)=at^{q+1}-2at^{q}-b(q+1)t-b\leq 0,

and therefore the claimed inequality follows from (7.12) and the fact that t(α)>2t(\alpha)>2. The proof is complete. ∎

The following lemma is an important step to justify the desired performance of Algorithm 1.

Lemma 7.5 (estimates for regularized quadratic approximations).

Under Assumption 7.1, let xx\in\mathcal{F}, and let xˇ\widecheck{x} be a projection point of xx to Θ\Theta. If xˇ\widecheck{x}\in\mathcal{F}, then for any σ>0\sigma>0 we have the estimate

yσ(x)xσd(x,Θ),\|y_{\sigma}(x)-x\|\leq\ell_{\sigma}\,d(x,\Theta), (7.15)

where yσ(x)y_{\sigma}(x) is a minimizer of the regularized quadratic approximation from (7.1) and (7.2) with σ:=t\ell_{\sigma}:=t^{*} defined as the unique positive solution tt^{*} of the equation Hσ(t)=σtq+12σtqL2(q+1)tL2=0H_{\sigma}(t)=\sigma t^{q+1}-2\sigma t^{q}-L_{2}(q+1)t-L_{2}=0 with L2L_{2} taken from part (3) of Assumption 7.1.

Proof.

Denote z:=yσ(x)z:=y_{\sigma}(x) and deduce from condition (7.2) and Lemma 7.2 that

0=f(x)+2f(x)(zx)+σq+1zxq(zx).0=\nabla f(x)+\nabla^{2}f(x)(z-x)+\frac{\sigma}{q+1}\|z-x\|^{q}(z-x). (7.16)

By xˇΘ\widecheck{x}\in\Theta, we have f(xˇ)=0\nabla f(\widecheck{x})=0 and 2f(xˇ)0\nabla^{2}f(\widecheck{x})\succeq 0. Then it follows from (7.16) that

(2f(xˇ)+σq+1zxqIm)(zxˇ)=2f(xˇ)((zx)(xˇx))+σq+1zxq((zx)(xˇx))\displaystyle\left(\nabla^{2}f(\widecheck{x})+\frac{\sigma}{q+1}\|z-x\|^{q}I_{m}\right)(z-\widecheck{x})=\nabla^{2}f(\widecheck{x})\Big{(}(z-x)-(\widecheck{x}-x)\Big{)}+\frac{\sigma}{q+1}\|z-x\|^{q}\Big{(}(z-x)-(\widecheck{x}-x)\Big{)}
=f(xˇ)f(x)2f(xˇ)(xˇx)σq+1zxq(xˇx)(2f(x)2f(xˇ))(zx).\displaystyle=\nabla f(\widecheck{x})-\nabla f(x)-\nabla^{2}f(\widecheck{x})(\widecheck{x}-x)-\frac{\sigma}{q+1}\|z-x\|^{q}(\widecheck{x}-x)-(\nabla^{2}f(x)-\nabla^{2}f(\widecheck{x}))(z-x).

Combining the latter with 2f(xˇ)0\nabla^{2}f(\widecheck{x})\succeq 0 gives us

(2f(xˇ)+σq+1zxqIm)(zxˇ)σq+1zxqzxˇ,\left\|\left(\nabla^{2}f(\widecheck{x})+\frac{\sigma}{q+1}\|z-x\|^{q}I_{m}\right)(z-\widecheck{x})\right\|\geq\frac{\sigma}{q+1}\|z-x\|^{q}\|z-\widecheck{x}\|,

and then employing xˇ\widecheck{x}\in\mathcal{F}, Lemma 7.3, and Assumption 7.1(3), we arrive at

σq+1zxqzxˇ\displaystyle\frac{\sigma}{q+1}\|z-x\|^{q}\|z-\widecheck{x}\|\leq f(x)f(xˇ)2f(xˇ)(xxˇ)+σq+1zxqxxˇ\displaystyle\|\nabla f(x)-\nabla f(\widecheck{x})-\nabla^{2}f(\widecheck{x})(x-\widecheck{x})\|+\frac{\sigma}{q+1}\|z-x\|^{q}\|x-\widecheck{x}\| (7.17)
+2f(x)2f(xˇ)zx\displaystyle+\|\nabla^{2}f(x)-\nabla^{2}f(\widecheck{x})\|\cdot\|z-x\|
\displaystyle\leq L2q+1xxˇq+1+σq+1zxqxxˇ+L2zxxxˇq.\displaystyle\frac{L_{2}}{q+1}\|x-\widecheck{x}\|^{q+1}+\frac{\sigma}{q+1}\|z-x\|^{q}\|x-\widecheck{x}\|+L_{2}\|z-x\|\cdot\|x-\widecheck{x}\|^{q}.

Applying further the triangle inequality zxˇzxxxˇ\|z-\widecheck{x}\|\geq\|z-x\|-\|x-\widecheck{x}\| ensures that

σq+1zxq+1σq+1zxqxxˇσq+1zxqzxˇ,\frac{\sigma}{q+1}\|z-x\|^{q+1}-\frac{\sigma}{q+1}\|z-x\|^{q}\|x-\widecheck{x}\|\leq\frac{\sigma}{q+1}\|z-x\|^{q}\|z-\widecheck{x}\|,

which being substituted into (7.17) yields

σq+1zxq+1L2q+1xxˇq+1+2σq+1zxqxxˇ+L2zxxxˇq.\frac{\sigma}{q+1}\|z-x\|^{q+1}\leq\frac{L_{2}}{q+1}\|x-\widecheck{x}\|^{q+1}+\frac{2\sigma}{q+1}\|z-x\|^{q}\|x-\widecheck{x}\|+L_{2}\|z-x\|\cdot\|x-\widecheck{x}\|^{q}.

By Lemma 7.4(ii), this allows us to find a positive scalar σ\ell_{\sigma} (depending only on σ\sigma) such that zxσxxˇ\|z-x\|\leq\ell_{\sigma}\|x-\widecheck{x}\|. Recalling that z=yσ(x)z=y_{\sigma}(x) and d(x,Θ)=xxˇd(x,\Theta)=\|x-\widecheck{x}\| verifies the desired inequality (7.15). ∎

The last lemma here plays a significant role in the convergent analysis of our algorithm. Its proof is rather technical and mainly follows the lines of justifying fast convergence of the cubic regularization method in [40] with additional adaptations to accommodate the momentum steps and the Hölderian continuity of Hessians. For completeness and the reader’s convenience, we present details in the Appendix (Section 9).

Lemma 7.6 (estimates for iterates of the high-order regularized method with momentum).

Let {xk}\{x_{k}\} be a sequence of iterates in Algorithm 1 under Assumption 7.1. If the set (f(xk0))\mathcal{L}(f(x_{k_{0}})) is bounded for some k0k_{0}\in\mathbb{N}, then the following assertions hold.

(i) For all kk\in\mathbb{N} with kk0k\geq k_{0}, we have the inequalities

f(xk+1)+(q+2)σ¯2L22(q+1)(q+2)x^k+1xkq+2f(xk),f(x_{k+1})+\frac{(q+2)\overline{\sigma}-2L_{2}}{2(q+1)(q+2)}\|\widehat{x}_{k+1}-x_{k}\|^{q+2}\leq f(x_{k}), (7.18)

and the limit v:=limkf(xk)v:=\lim_{k\to\infty}f(x_{k}) exists.

(ii) For all kk\in\mathbb{N} with kk0k\geq k_{0}, the relationships

xk+1xk(1+γ)x^k+1xk,limkx^k+1xk=0,\displaystyle\|x_{k+1}-x_{k}\|\leq(1+\gamma)\|\widehat{x}_{k+1}-x_{k}\|,\quad\lim_{k\to\infty}\|\widehat{x}_{k+1}-x_{k}\|=0, (7.19)
f(xk+1)3(1+γL1)L2q+1x^k+1xkq+1\|\nabla f(x_{k+1})\|\leq\frac{3(1+\gamma L_{1})L_{2}}{q+1}\|\widehat{x}_{k+1}-x_{k}\|^{q+1} (7.20)

are fulfilled, where the number γ>0\gamma>0 in (7.19) is defined by

γ:=11ζ[2(q+1)(q+2)(q+2)σ¯2L2(f(x0)f)]1q+2.\gamma:=\frac{1}{1-\zeta}\left[\frac{2(q+1)(q+2)}{(q+2)\overline{\sigma}-2L_{2}}(f(x_{0})-f^{*})\right]^{\frac{1}{q+2}}. (7.21)

(iii) The set Ω\Omega of accumulation points of {xk}\{x_{k}\} is nonempty. Moreover, for any x¯Ω\overline{x}\in\Omega we have

f(x¯)=v,f(x¯)=0, and 2f(x¯)0.f(\overline{x})=v,\quad\nabla f(\overline{x})=0,\;\mbox{ and }\;\nabla^{2}f(\overline{x})\succeq 0. (7.22)

Now we are in a position to establish the main result on the performance of Algorithm 1.

Theorem 7.7 (convergence rate of the high-order algorithm under generalized metric subregularity).

Let the iterative sequence {xk}\{x_{k}\} be generated by Algorithm 1, and let x¯\bar{x} be its accumulation point provided that the set (f(xk0))\mathcal{L}(f(x_{k_{0}})) is bounded for some k0k_{0}\in\mathbb{N}. In addition to Assumption 7.1, suppose that there exists η>0\eta>0 such that the generalized metric subregularity condition

ψ(d(x,Θ))f(x) for all xBm(x¯,η)\psi\big{(}d(x,\Theta))\leq\|\nabla f(x)\|\;\mbox{ for all }\;x\in B_{\mathbb{R}^{m}}(\overline{x},\eta) (7.23)

holds with respect to (ψ,Θ)(\psi,\Theta), where ψ:++\psi\colon\mathbb{R}_{+}\to\mathbb{R}_{+} is an admissible function, and where Θ\Theta is the set of second-order stationary points (2.3). Define the function τ:++\tau:\mathbb{R}_{+}\to\mathbb{R}_{+} by

τ(t):=ψ1(3(1+γL1)L2q+1q+1tq+1),t+,\tau(t):=\psi^{-1}\Big{(}\frac{3(1+\gamma L_{1})L_{2}\ell^{q+1}}{q+1}t^{q+1}\Big{)},\quad t\in\mathbb{R}_{+}, (7.24)

where :=σ¯\ell:=\ell_{\overline{\sigma}} with σ\ell_{\sigma} taken from Lemma 7.5, and where γ\gamma is given in (7.21). If lim supt0+τ(t)t<1\limsup_{t\to 0^{+}}\frac{\tau(t)}{t}<1, then x¯Θ\overline{x}\in\Theta and the sequence {xk}\{x_{k}\} converges to x¯\overline{x} as kk\to\infty with the convergence rate

lim supkxkx¯τ(xk1x¯)<.\limsup_{k\to\infty}\frac{\|x_{k}-\overline{x}\|}{{\tau}(\|x_{k-1}-\overline{x}\|)}<\infty.
Proof.

By Lemma 7.6, we have the inclusion ΩΘ\Omega\subset\Theta together with all the conditions in (7.18)–(7.20). Let us now verify the estimate

x^k+1xkd(xk,Θ)\|\widehat{x}_{k+1}-x_{k}\|\leq\ell d(x_{k},\Theta) (7.25)

for any kk\in\mathbb{N} sufficiently large. Having this, we arrive at all of the claimed conclusions of the theorem by applying the abstract convergence results of Theorem 4.4 with β(t):=tq+1\beta(t):=t^{q+1} and ek:=x^k+1xke_{k}:=\|\widehat{x}_{k+1}-x_{k}\|.

To justify (7.25), let xˇk\widecheck{x}_{k} be a projection point of xkx_{k} to Θ\Theta. Thus we deduce from Lemma 4.1(iii) that

limkd(xk,Θ)=limkxkxˇk=0.\lim_{k\to\infty}d(x_{k},\Theta)=\lim_{k\to\infty}\|x_{k}-\widecheck{x}_{k}\|=0. (7.26)

Moreover, observing that xk(f(x0))int()x_{k}\in\mathcal{L}(f(x_{0}))\subset{\rm int}(\mathcal{F}) for all kk\in\mathbb{N} and that {xk}\{x_{k}\} remains bounded allows us to find a compact set 𝒞int()\mathcal{C}\subset{\rm int}(\mathcal{F}) such that {xk}𝒞\{x_{k}\}\subset\mathcal{C}. Based on {xk}𝒞\{x_{k}\}\subset\mathcal{C} and the conditions given by (7.26), we get that limkd(xˇk,𝒞)=0\lim_{k\to\infty}d(\widecheck{x}_{k},\mathcal{C})=0. Coupling this with the inclusion 𝒞int()\mathcal{C}\subset{\rm int}(\mathcal{F}) and the compactness of 𝒞\mathcal{C} indicates that xˇkint()\widecheck{x}_{k}\in{\rm int}(\mathcal{F}) for all large kk. Consequently, there exists k11k_{1}\geq 1 such that xk,xˇkx_{k},\widecheck{x}_{k}\in\mathcal{F} whenever kk1k\geq k_{1}. Employing Lemmas 7.4 and 7.5 together with σkσ¯>0\sigma_{k}\geq\overline{\sigma}>0 for all kk gives us the number :=σ¯\ell:=\ell_{\overline{\sigma}} from Lemma 7.5 ensuring the estimate

x^k+1xkd(xk,Θ) for all large k,\|\widehat{x}_{k+1}-x_{k}\|\leq\ell d(x_{k},\Theta)\;\mbox{ for all large }\;k\in\mathbb{N},

which justifies (7.25) and thus completes the proof of the theorem. ∎

As a consequence of Theorem 7.7, we show now that if f\nabla f enjoys the (pointwise) exponent metric subregularity property with respect to the second-order stationary set Θ\Theta, then the sequence generated by our algorithm converges to a second-order stationary point at least superlinearly with an explicit convergence order under a suitable condition on the exponent.

Corollary 7.8 (superlinear convergence under exponent metric subregularity).

Let ff satisfy the Assumption 7.1 with the qq-th order Hölder continuous Hessian for some q(0,1]q\in(0,1], let {xk}\{x_{k}\} be generated by Algorithm 1, and let the set (f(xk0))\mathcal{L}(f(x_{k_{0}})) be bounded for some k0k_{0}\in\mathbb{N}. Taking x¯Ω\overline{x}\in\Omega and p>0p>0, suppose that there exist c,η>0c,\eta>0 such that

d(x,Θ)pcf(x) for all xBm(x¯,η).d(x,\Theta)^{p}\leq c\,\|\nabla f(x)\|\;\mbox{ for all }\;x\in B_{\mathbb{R}^{m}}(\overline{x},\eta).

If p<q+1p<q+1, then {xk}\{x_{k}\} converges at least superlinearly to x¯Θ\overline{x}\in\Theta as kk\to\infty with the rate q+1p\frac{q+1}{p}, i.e.,

lim supkxkx¯xk1x¯q+1p<.\limsup_{k\to\infty}\frac{\|x_{k}-\overline{x}\|}{\|x_{k-1}-\overline{x}\|^{\frac{q+1}{p}}}<\infty.
Proof.

Choosing ψ(t):=c1tp\psi(t):=c^{-1}t^{p}, we have τ(t)=O(tq+1p)\tau(t)=O(t^{\frac{q+1}{p}}) and deduce the conclusion from Theorem 7.7. ∎

Following the discussions in Remark 5.3, we can establish global convergence of the sequence {xk}\{x_{k}\} under the KL property and its convergence rate depending on the KL exponent. For brevity, the proof is skipped.

Proposition 7.9 (convergence under the KL property).

Let ff satisfy Assumption 7.1 with the qq-th order Hölder continuous Hessian for some q(0,1]q\in(0,1]. Let {xk}\{x_{k}\} be generated by Algorithm 1, and let the set (f(xk0))\mathcal{L}(f(x_{k_{0}})) be bounded for some k0k_{0}\in\mathbb{N}. Take x¯Ω\overline{x}\in\Omega and suppose that ff satisfies the KL property at x¯\overline{x}. Then the sequence {xk}\{x_{k}\} converges to x¯\overline{x}, which is a second-order stationary point of ff. If furthermore ff satisfies the KL property at x¯\overline{x} with the KL exponent θ\theta, then we have the following convergence rate:

\bullet If θ=q+1q+2\theta=\frac{q+1}{q+2}, then the sequence {xk}\{x_{k}\} converges linearly.

\bullet If θ>q+1q+2\theta>\frac{q+1}{q+2}, then the sequence {xk}\{x_{k}\} converges sublinearly with the rate O(k(q+1)(1θ)(q+2)θ(q+1))O(k^{-\frac{(q+1)(1-\theta)}{(q+2)\theta-(q+1)}}).

\bullet If θ<q+1q+2\theta<\frac{q+1}{q+2}, then the sequence {xk}\{x_{k}\} converges superlinearly with the rate q+1(q+2)θ\frac{q+1}{(q+2)\theta}.

Remark 7.10 (comparison with known results).

To the best of our knowledge, the derived (superlinear/linear/sublinear) convergence rates are new in the literature for high-order regularization methods with momentum steps without any convexity assumptions. To compare in more detail, consider the case where p=1p=1, i.e., the metric subregularity of f\nabla f at x¯\overline{x} holds with respect to the second-order stationary set Θ\Theta. In this case, when applying the high-order regularization methods with momentum steps to a 𝒞2{\cal C}^{2}-smooth function ff with the qqth-order Hölder continuous Hessian, we obtain the superlinear convergence of {xk}\{x_{k}\} with rate q+1q+1 to a second-order stationary point. Suppose further that q=1q=1 and no momentum step is used. In that case, we recover the quadratic convergence rate to a second-order stationary point for the cubic regularized Newton method, but under the weaker assumption of the pointwise metric subregularity at x¯\overline{x} instead of the stronger error bound condition used in [40].

Next we provide two examples with explicit degeneracy illustrating that our convergence results go beyond the current literature on regularized Newton-type methods of optimization.

Example 7.11 (fast and slow convergence without the error bound condition).

Consider the function f(x):=(wTxa)2(wTxb)2pf(x):=(w^{T}x-a)^{2}(w^{T}x-b)^{2p} with xmx\in\mathbb{R}^{m}, wm\{0}w\in\mathbb{R}^{m}\backslash\{0\}, 0a<b0\leq a<b, and m,p2m,p\geq 2. Direct calculations give us the exact expressions for the first-order stationary set Γ=r{a,b,b+pa1+p}{x|wTx=r}\Gamma=\bigcup_{r\in\{a,b,\frac{b+pa}{1+p}\}}\big{\{}x\;\big{|}\;w^{T}x=r\big{\}} and the second-order stationary set Θ=r{a,b}{x|wTx=r}\Theta=\bigcup_{r\in\{a,b\}}\big{\{}x\;\big{|}\;w^{T}x=r\}. For any x¯Θ\overline{x}\in\Theta, we see that 2f(x¯)\nabla^{2}f(\overline{x}) has a zero eigenvalue. Similarly to Example 3.5, it is easy to check that the uniform metric subregularity/error bound condition for f\nabla f with respect to Θ\Theta fails in this case.

Let x¯Θ=Θ1Θ2\overline{x}\in\Theta=\Theta_{1}\cup\Theta_{2}, where Θ1={x|wTx=a}\Theta_{1}=\{x\;|\;w^{T}x=a\} and Θ2={x|wTx=b}\Theta_{2}=\{x\;|\;w^{T}x=b\}. If x¯Θ1\overline{x}\in\Theta_{1}, then there exists C>0C>0 such that d(x,Θ)=d(x,Θ1)Cf(x)d(x,\Theta)=d(x,\Theta_{1})\leq C\,\|\nabla f(x)\| for all x𝔹m(x¯,(ba)/4)x\in\mathbb{B}_{\mathbb{R}^{m}}(\overline{x},(b-a)/4), and so the metric subregularity condition of f\nabla f holds at x¯\overline{x} with respect to Θ\Theta. On the other hand, direct verification shows that ff satisfies the KL property at x¯\overline{x} with the KL exponent 112p1-\frac{1}{2p} for x¯Θ2\overline{x}\in\Theta_{2}.

Let \mathcal{F} be a compact set, and let x0x_{0}\in\mathcal{F} be such that Assumption 7.1 is satisfied. Then we see that the Hessian of ff is Lipschitz continuous on \mathcal{F}, i.e., q=1q=1. This yields the convergence xkx¯Θx_{k}\to\overline{x}\in\Theta of the iterative sequence in Algorithm 1. If x¯Θ1\overline{x}\in\Theta_{1}, then Corollary 7.8 implies that the convergence rate is quadratic. If x¯Θ2\overline{x}\in\Theta_{2}, then it follows from Proposition 7.9 that {xk}\{x_{k}\} converges to x¯\overline{x} sublinearly with the rate O(k22p3)O(k^{-\frac{2}{2p-3}}).

Example 7.12 (linear and sublinear convergence under the KL property).

For γ>2\gamma>2 and M>0M>0, define the 𝒞2{\cal C}^{2}-smooth function f:f:\mathbb{R}\rightarrow\mathbb{R} by

f(x):=|min{max{x,0},x+M}|γ={xγ if x>0,0 if x[M,0],|x+M|γ if x<M.f(x):=|\min\{\max\{x,0\},x+M\}|^{\gamma}=\left\{\begin{array}[]{ccl}x^{\gamma}&\mbox{ if }&x>0,\\ 0&\mbox{ if }&x\in[-M,0],\\ |x+M|^{\gamma}&\mbox{ if }&x<-M.\end{array}\right.

We have that Γ=Θ=[M,0]\Gamma=\Theta=[-M,0] and that ff satisfies the KL property at any x¯Θ\overline{x}\in\Theta with the KL exponent θ=11γ\theta=1-\frac{1}{\gamma}. On the other hand, the error bound property of f\nabla f with respect to Θ\Theta fails here.

Let \mathcal{F} be a compact set, and let x0x_{0}\in\mathcal{F} be such that Assumption 7.1 is satisfied. It can be easily verified that f′′f^{\prime\prime} is qqth-order Hölder continuous on \mathcal{F} with q=min{γ2,1}q=\min\{\gamma-2,1\}. If γ(2,3]\gamma\in(2,3], then θ=q+1q+2\theta=\frac{q+1}{q+2}, and the sequence of iterates {xk}\{x_{k}\} of Algorithm 1 converges to x¯Θ\overline{x}\in\Theta linearly by Proposition 7.9. If γ>3\gamma>3, then θ>q+1q+2\theta>\frac{q+1}{q+2}, and so {xk}\{x_{k}\} converges to x¯Θ\overline{x}\in\Theta sublinearly with the rate O(k2γ3)O(k^{-\frac{2}{\gamma-3}}). Observe that for γ=3\gamma=3, the linear convergence can be indeed achieved. In this case, we have q=1q=1 and L2=6L_{2}=6, and then take σk:=σ(2L2q+2,2L2]=(4,12]\sigma_{k}:=\sigma\in\big{(}\frac{2L_{2}}{q+2},2L_{2}\big{]}=(4,12] for all kk and x0>0x_{0}>0. Assuming without loss of generality that {xk}\{x_{k}\} is an infinite sequence, we get by the construction of Algorithm 1 that

xk+1argminy{f(xk)(yxk)+12f′′(xk)(yxk)2+σ6|yxk|3}.x_{k+1}\in{\rm argmin}_{y}\big{\{}f^{\prime}(x_{k})(y-x_{k})+\frac{1}{2}f^{\prime\prime}(x_{k})(y-x_{k})^{2}+\frac{\sigma}{6}|y-x_{k}|^{3}\big{\}}.

Letting t:=xk+1xkt:=x_{k+1}-x_{k} gives us f(xk)+f′′(xk)t+σ2t|t|=0f^{\prime}(x_{k})+f^{\prime\prime}(x_{k})t+\frac{\sigma}{2}t|t|=0. Suppose now that xk>0x_{k}>0, and then f(xk)>0f^{\prime}(x_{k})>0 and f′′(xk)>0f^{\prime\prime}(x_{k})>0. This implies that t<0t<0 and hence 3xk2+6xktσ2t2=03x_{k}^{2}+6x_{k}t-\frac{\sigma}{2}t^{2}=0. Therefore, t=(6σ6σ+36σ)xkt=(\frac{6}{\sigma}-\frac{\sqrt{6\sigma+36}}{\sigma})x_{k} and so xk+1=ρxk>0x_{k+1}=\rho x_{k}>0 with ρ=16σ+366σ(0,1)\rho=1-\frac{\sqrt{6\sigma+36}-6}{\sigma}\in(0,1). Arguing by induction tells us that xk>0x_{k}>0 and xk+1=ρxkx_{k+1}=\rho x_{k} for all kk\in\mathbb{N}. This yields xkx¯=0Θx_{k}\rightarrow\overline{x}=0\in\Theta, which confirms that the convergence rate is linear.

Finally in this section, we specify the convergence results for the case where the strict saddle point property holds, which allows us to establish stronger convergence results. Observe to this end that if the objective function ff satisfies the strict saddle point property (resp. strong strict saddle point property), then Algorithm 1 indeed converges to a local minimizer (resp. global minimizer) of ff with the corresponding superlinear rate. In particular, this covers the situations discussed in the illustrative examples of Section 6.

The next corollary addresses a concrete case of the over-parameterized compressed sensing model. In this case, Algorithm 1 converges to a global minimizer with a quadratic convergence rate under the strict complementary condition.

Corollary 7.13 (quadratic convergence for over-parameterized compressed sensing models).

Consider the 1\ell_{1}-regularization model (6.6) and the associated over-parameterized smooth optimization problem minx2mf(x)\min_{x\in\mathbb{R}^{2m}}f(x), where f:=fOPf:=f_{OP} as in (6.7). Then the iterative sequence {xk}\{x_{k}\} of Algorithm 1 converges to a global minimizer x¯\bar{x} of (6.7) as kk\to\infty, and the following assertions hold:

  • (i)

    Under the validity of the strict complementary condition (6.8), {xk}\{x_{k}\} converges to x¯\overline{x} quadratically, i.e.,

    lim supkxkx¯xk1x¯2<.\limsup_{k\to\infty}\frac{\|x_{k}-\overline{x}\|}{\|x_{k-1}-\overline{x}\|^{2}}<\infty.
  • (ii)

    If the strict complementary condition (6.8) fails, then {xk}\{x_{k}\} converges to x¯\overline{x} with the order O(k2)O(k^{-2}).

Proof.

We know from Section 6 that the strong strict saddle point property is fulfilled for f=fOPf=f_{OP}. To verify (i), recall that if the strict complementary condition holds, then f\nabla f satisfies the metric subregularity property at x¯\overline{x} with respect to Θ\Theta. Direct verifications show that ff is bounded from below by zero, that the set (f(x0))\mathcal{L}(f(x_{0})) is bounded, and that the Hessian of ff is Lipschitz continuous on =(f(x0))+𝔹2m(0,1)\mathcal{F}=\mathcal{L}(f(x_{0}))+\mathbb{B}_{\mathbb{R}^{2m}}(0,1). By setting p=q=1p=q=1, we see that {xk}\{x_{k}\} converges to x¯Θ\overline{x}\in\Theta in a quadratic rate. Note also that ff satisfies the strong strict saddle point property, and thus x¯\overline{x} is indeed a global minimizer.

To prove (ii), observe that if the strict complementary condition fails, then ff satisfies the KL property with the exponent θ=34\theta=\frac{3}{4}. Therefore, by setting q=1q=1 in Proposition 7.9 we see that {xk}\{x_{k}\} converges to x¯Θ\overline{x}\in\Theta with the rate O(k(q+1)(1θ)(q+2)θ(q+1))=O(k2)O(k^{-\frac{(q+1)(1-\theta)}{(q+2)\theta-(q+1)}})=O(k^{-2}). Similarly to (i), it follows that x¯\overline{x} is a global minimizer of ff. ∎

Remark 7.14 (further comparison).

During the completion of this paper, we became aware of the preprint [24] in which the authors examined the convergence rate of the high-order regularized method without momentum steps for composite optimization problems under the assumption that the Hessian of the smooth part is Lipschitz continuous labeled there as “strict continuity.” Recall that the main focus of our paper is on the development of generalized metric subregularity with applications to fast convergence analysis for general descent methods. This include, in particular, the high-order regularized methods with momentum steps. We also mention that the results obtained in [24] cannot cover the setting of Section 7, where we incorporate momentum steps and can also handle the case of Hölder continuous Hessians.

8 Concluding Remarks

The main contributions of this paper are summarized as follows.

\bullet We introduce a generalized metric subregularity condition, which serves as an extension of the standard metric subregularity and its Hölderian version while being strictly weaker than the uniform version (error bound condition) recently employed in the convergence analysis of the cubic regularized Newton method. We then develop an abstract extended convergence framework that enables us to derive superlinear convergence towards specific target sets (such as the first-order and second-order stationary points) under the introduced generalized metric subregularity condition. Interestingly, this framework also extends the widely used KL convergence analysis, and the derived superlinear convergence rates of this new framework are sharp in the sense that the rate can be attained.

\bullet We provide easily verifiable sufficient conditions for the generalized metric subregularity with respect to the second-order stationary set under the strict saddle point property. We also demonstrate that several modern practical models appearing in machine learning, statistics, etc., enjoy the generalized metric subregularity property with respect to the second-order stationary set in explicit forms. This includes, e.g., objective functions that arise in the over-parameterized compressed sensing model, best rank-one matrix approximation, and generalized phase retrieval problems.

\bullet As a major application of generalized metric subregularity and abstract convergence analysis, we derive superlinear convergence results for the new high-order regularized Newton method with momentum. This improves the existing literature by weakening the regularity conditions and extending the analysis to include momentum steps. Furthermore, when applying the cubic regularized Newton methods with momentum steps to solve the over-parameterized compressed sensing model, we obtain global convergence with a quadratic local convergence rate towards a global minimizer under the strict complementarity condition.

One important potential future research topic is to extend the abstract convergence framework to cover nonmonotone numerical algorithms. Another major goal of our future research is to extend the obtained second-order results from 𝒞2{\cal C}^{2}-smooth functions to the class of 𝒞1,1{\cal C}^{1,1} functions [27], i.e., continuously differentiable functions with Lipschitzian gradients, or to functions with a composite structure; see, e.g., [31].

References

  • [1] Ahookhosh, M., Aragón-Artacho, F.J., Fleming, Vuong, P.: Local convergence of the Levenberg–Marquardt method under Hölder metric subregularity. Adv. Comput. Math. 45, 2771–2806 (2019)
  • [2] Ahookhosh, M., Nesterov, Y.: High-order methods beyond the classical complexity bounds: inexact high-order proximal-point methods. Math. Prog. (2024). https://doi.org/10.1007/s10107-023-02041-4
  • [3] Aragón-Artacho, F.J., Geoffroy, M.H.: Metric subregularity of the convex subdifferential in Banach spaces. J. Nonlinear Convex Anal. 15, 35-47 (2014)
  • [4] Attouch, H., Bolte, J.: On the convergence of the proximal algorithm for nonsmooth functions involving analytic features. Math. Prog. 116, 5–16 (2009)
  • [5] Attouch, H., Bolte, J., Svaiter, B.F.: Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward-backward splitting, and regularized Gauss-Seidel methods. Math. Prog. 137, 91–129 (2013)
  • [6] Bierstone, E., Milman, P.: Semianalytic and subanalytic sets. IHES Publ. Math. 67, 5–42 (1988)
  • [7] Bolte, J., Daniilidis. A., Lewis, A.S.: The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM J. Optim. 17, 1205–1223 (2007)
  • [8] Cai, J.-F., Huang, M., Li, D., Wang, Y.: The global landscape of phase retrieval II. quotient intensity models. Ann. Appl. Math. 38, 62–114 (2022)
  • [9] Cai, J.-F., Huang, M., Li, D., Wang, Y.: Nearly optimal bounds for the global geometric landscape of phase retrieval. Inverse Problems. 39, 075011 (2023)
  • [10] Cartis, C., Gould, N.I.M., Toint, P.L.: Adaptive cubic regularisation methods for unconstrained optimization. Part II: worst-case function- and derivative-evaluation complexity. Math. Prog. 130, 295–319 (2011).
  • [11] Drusvyatskiy, D., Mordukhovich, B.S., Nghia, T.T.A.: Second-order growth, tilt stability, and metric regularity of the subdifferential. J. Convex Anal. 21, 1165—1192 (2014)
  • [12] Facchinei, F., Pang, J.-S.: Finite-Dimensional Variational Inequalities and Complementarity Problems, Volumes I and II, Springer, New York (2003)
  • [13] Gaydu, M., Geoffroy, M.H., Jean-Alexis, C.: Metric subregularity of order qq and the solving of inclusions. Cent. Eur. J. Math. 9, 147–161 (2011)
  • [14] Gfrerer, H., Outrata, J.V.: On semismooth Newton method of solving generalized equations. SIAM J. Optim. 31, 89–517 (2021)
  • [15] Grapiglia, G.N., Nesterov, Yu.: Regularized Newton methods for minimizing functions with Hölder continuous Hessians. SIAM J. Optim. 27, 478–506 (2017)
  • [16] Hsia, Y., Sheu, R.-L., Yuan, Y.-X.: Theory and application of pp-regularized subproblems for p>2p>2. Optim. Methods Softw. 32, 1059–1077 (2017)
  • [17] Izmailov, A.F., Solodov, M.V.: Newton-Type Methods for Optimization and Variational Problems. Springer, New York (2014)
  • [18] Khanh, P.D., Mordukhovich, B.S., Phat, V.T., Tran, D.B.: Globally convergent coderivative-based generalized Newton methods in nonsmooth optimization. Math. Program. 205, 373–429 (2024)
  • [19] Klatte, D., Kummer, B.: Nonsmooth Equations in Optimization: Regularity, Calculus, and Applications. Kluwer, Boston (2002)
  • [20] Li, G., Mordukhovich, B.S.: Hölder metric subregularity with applications to proximal point method. SIAM J. Optim. 22, 1655–1684 (2012)
  • [21] Li, G., Pong, T.K.: Calculus of the exponent of Kurdyka-Łojasiewicz inequality and its applications to linear convergence of first-order methods. Found. Comput. Math. 18, 1199-1232 (2018)
  • [22] Li, X., Sun, D., Toh, K.-C.: A highly efficient semismooth Newton augmented Lagrangian method for solving Lasso problems. SIAM J. Optim. 28, 433–458 (2018)
  • [23] Lindstrom, S.B., Lourenço, B.F., Pong, T.K.: Error bounds, facial residual functions and applications to the exponential cone. Math. Program. 200, 229–278 (2023)
  • [24] Liu, Y., Pan, S., Qian, Y.: An inexact qq-order regularized proximal Newton method for nonconvex composite optimization. https://arxiv.org/abs/2311.06871.
  • [25] Luo, Z.-Q., Tseng, P.: Error bounds and convergence analysis of feasible descent methods: A general approach. Ann. Oper. Res. 46, 157–178 (1993)
  • [26] Mordukhovich, B.S.: Variational Analysis and Generalized differentiation, I: Basic Theory, II: Applications. Springer, Berlin (2006)
  • [27] Mordukhovich, B.S.: Second-Order Variational Analysis in Optimization, Variational Stability, and Control: Theory, Algorithm, Applications. Springer, Cham (2024)
  • [28] Mordukhovich, B.S., Ouyang, W.: Higher-order metric subregularity and its applications. J. Global Optim. 63, 777–795 (2015)
  • [29] Mordukhovich, B.S., Sarabi, M.E.: Generalized Newton algorithms for tilt-stable minimizers in nonsmooth Optimization. SIAM J. Optim. 31, 1184–1214 (2021)
  • [30] Mordukhovich, B.S., Yuan, X., Zeng, S., Zhang, J.: A globally convergent proximal Newton-type method in nonsmooth convex optimization. Math. Program. 198, 899–936 (2023)
  • [31] Nabou, Y., Necoara, I.: Efficiency of higher-order algorithms for minimizing composite functions. Comput. Optim. Appl. 87, 441–473 (2024)
  • [32] Nesterov, Yu., Polyak, B.T.: Cubic regularization of Newton method and its global performance. Math. Program. 108, 177–205 (2006)
  • [33] Ochs, P.: Unifying abstract inexact convergence theorems and block coordinate variable metric iPiano. SIAM J. Optim. 29(1), 541–570 (2019)
  • [34] Ouyang, W., Liu Y., Pong, T.K., Wang, H.: Kurdyka-Łojasiewicz exponent via Hadamard parametrization. https://arxiv.org/abs/2402.00377
  • [35] Poon, C., Peyré, G.: Smooth bilevel programming for sparse regularization. In: Proceedings of NeurIPS’21 (2021). https://arxiv.org/abs/2106.01429
  • [36] Poon, C., Peyré, G.: Smooth over-parameterized solvers for nonsmooth structured optimization. Math. Program. 201, 897–952 (2023)
  • [37] Rockafellar, R.T., Wets, R.J-B.: Variational Analysis. Springer, Berlin (1998)
  • [38] Yao, J.C., Zheng, X.Y., Zhu, J.: Stable minimizers of φ\varphi-regular functions. SIAM J. Optim. 27, 1150–1170 (2017)
  • [39] Yu, P., Li, G., Pong T.K.: Kurdyka-Łojasiewicz exponent via inf-projection. Found. Comput. Math. 22, 1171–1217 (2022)
  • [40] Yue, M.-C., Zhou, Z., Man-Cho So, A.: On the quadratic convergence of the cubic regularization method under a local error bound condition. SIAM J. Optim. 29, 904–932 (2019)
  • [41] Zhang, B., Ng, K.F., Zheng, X.Y., He, Q.: Hölder metric subregularity for multifunctions in 𝒞2\mathcal{C}^{2} type Banach spaces. Optimization. 65, 1963–1982 (2016)
  • [42] Zheng, X.Y., Zhu, J.: Generalized metric subregularity and regularity with respect to an admissible function. SIAM J. Optim. 26, 535–563 (2016)
  • [43] Zălinescu, C.: Convex Analysis in General Vector Spaces. World Scientific, Singapore (2002)

9 Proofs of Technical Lemmas

The main goal of this appendix is to give a detailed proof of Lemma 7.6. We split this proof into several technical steps (also called lemmas), which are of their own interest.

Lemma 9.1.

Under Assumption 7.1, for any xx\in\mathcal{F} we have

f(x),xyσ(x)0.\langle\nabla f(x),x-y_{\sigma}(x)\rangle\geq 0. (9.1)

If furthermore σ2L2q+2\sigma\geq\frac{2L_{2}}{q+2} and f(x)f(x0)f(x)\leq f(x_{0}), then yσ(x)(f(x))y_{\sigma}(x)\in\mathcal{L}(f(x))\subset\mathcal{F}.

Proof.

It readily follows from (7.7) that

f(x),yσ(x)x+2f(x)(yσ(x)x),yσ(x)x+σq+1yσ(x)xq+2=0.\langle\nabla f(x),y_{\sigma}(x)-x\rangle+\langle\nabla^{2}f(x)(y_{\sigma}(x)-x),y_{\sigma}(x)-x\rangle+\frac{\sigma}{q+1}\|y_{\sigma}(x)-x\|^{q+2}=0. (9.2)

Employing (7.8), we obtain

2f(x)(yσ(x)x),yσ(x)x+σq+1yσ(x)xq+20,\langle\nabla^{2}f(x)(y_{\sigma}(x)-x),y_{\sigma}(x)-x\rangle+\frac{\sigma}{q+1}\|y_{\sigma}(x)-x\|^{q+2}\geq 0,

and hence (9.1) is a direct consequence of (9.2).

Let now σ2L2q+2\sigma\geq\frac{2L_{2}}{q+2} and f(x)f(x0)f(x)\leq f(x_{0}). To see the conclusion, we argue by contradiction and suppose that yσ(x)y_{\sigma}(x)\notin\mathcal{F}, which yields xyσ(x)>0\|x-y_{\sigma}(x)\|>0. Denote

yα:=x+α(yσ(x)x),α[0,1],y_{\alpha}:=x+\alpha(y_{\sigma}(x)-x),\quad\alpha\in[0,1], (9.3)

and deduce from (f(x0)int()\mathcal{L}(f(x_{0})\subset{\rm int}(\mathcal{F}) (by (2) in Assumption 7.1) and f(x)f(x0)f(x)\leq f(x_{0}) that y0=xint()y_{0}=x\in{\rm int}(\mathcal{F}). So the number α¯:=min{α>0|yαbd()}\overline{\alpha}:=\min\{\alpha>0\;|\;y_{\alpha}\in{\rm bd}(\mathcal{F})\} is well-defined. It follows that α¯<1\overline{\alpha}<1 and that yαy_{\alpha}\in\mathcal{F} for all α[0,α¯]\alpha\in[0,\overline{\alpha}]. Consequently, combining Lemma 7.3 with (9.1)–(9.3) gives us the estimates

f(yα)\displaystyle f(y_{\alpha})\leq f(x)+f(x),yαx+122f(x)(yαx),yαx+L2(q+1)(q+2)yαxq+2\displaystyle f(x)+\langle\nabla f(x),y_{\alpha}-x\rangle+\frac{1}{2}\langle\nabla^{2}f(x)(y_{\alpha}-x),y_{\alpha}-x\rangle+\frac{L_{2}}{(q+1)(q+2)}\|y_{\alpha}-x\|^{q+2}
\displaystyle\leq f(x)+αf(x),yσ(x)x+α222f(x)(yσ(x)x),yσ(x)x+αq+2σ2(q+1)yσ(x)xq+2\displaystyle f(x)+\alpha\langle\nabla f(x),y_{\sigma}(x)-x\rangle+\frac{\alpha^{2}}{2}\langle\nabla^{2}f(x)(y_{\sigma}(x)-x),y_{\sigma}(x)-x\rangle+\frac{\alpha^{q+2}\sigma}{2(q+1)}\|y_{\sigma}(x)-x\|^{q+2}
\displaystyle\leq f(x)+(αα22)f(x),yσ(x)x(σα22(q+1)αq+2σ2(q+1))yσ(x)xq+2\displaystyle f(x)+\left(\alpha-\frac{\alpha^{2}}{2}\right)\langle\nabla f(x),y_{\sigma}(x)-x\rangle-\left(\frac{\sigma\alpha^{2}}{2(q+1)}-\frac{\alpha^{q+2}\sigma}{2(q+1)}\right)\|y_{\sigma}(x)-x\|^{q+2}
\displaystyle\leq f(x)σα2(1αq)2(q+1)yσ(x)xq+2,\displaystyle f(x)-\frac{\sigma\alpha^{2}(1-\alpha^{q})}{2(q+1)}\|y_{\sigma}(x)-x\|^{q+2},

which imply in turn that f(yα¯)<f(x)f(y_{\overline{\alpha}})<f(x). Therefore, yα¯int()y_{\overline{\alpha}}\in{\rm int}(\mathcal{F}), a contradiction ensuring that yσ(x)y_{\sigma}(x)\in\mathcal{F}. In a similar way, we can check that f(yσ(x))f(x)f(y_{\sigma}(x))\leq f(x). ∎

Lemma 9.2.

Under Assumption 7.1, for any xx\in\mathcal{F} with yσ(x)y_{\sigma}(x)\in\mathcal{F} we have

f(yσ(x))L2+σq+1yσ(x)xq+1.\|\nabla f(y_{\sigma}(x))\|\leq\frac{L_{2}+\sigma}{q+1}\|y_{\sigma}(x)-x\|^{q+1}. (9.4)
Proof.

It follows from equation (7.7) that

f(x)+2f(x)(yσ(x)x)=σq+1yσ(x)xq+1.\|\nabla f(x)+\nabla^{2}f(x)(y_{\sigma}(x)-x)\|=\frac{\sigma}{q+1}\|y_{\sigma}(x)-x\|^{q+1}.

On the other hand, we deduce from (7.9) that

f(yσ(x))f(x)2f(x)(yσ(x)x)L2q+1yσ(x)xq+1.\|\nabla f(y_{\sigma}(x))-\nabla f(x)-\nabla^{2}f(x)(y_{\sigma}(x)-x)\|\leq\frac{L_{2}}{q+1}\|y_{\sigma}(x)-x\|^{q+1}.

Combining these two relations verifies the desired inequality (9.4). ∎

Lemma 9.3.

Under Assumption 7.1, for any xx\in\mathcal{F} we have the estimates

f¯σ(x)miny[f(y)+L2+σ(q+1)(q+2)yxq+2],\overline{f}_{\sigma}(x)\leq\min_{y\in\mathcal{F}}\left[f(y)+\frac{L_{2}+\sigma}{(q+1)(q+2)}\|y-x\|^{q+2}\right], (9.5)
f(x)f¯σ(x)qσ2(q+1)(q+2)yσ(x)xq+2.f(x)-\overline{f}_{\sigma}(x)\geq\frac{q\sigma}{2(q+1)(q+2)}\|y_{\sigma}(x)-x\|^{q+2}. (9.6)

Moreover, if σL2\sigma\geq L_{2} and f(x)f(x0)f(x)\leq f(x_{0}), then yσ(x)y_{\sigma}(x)\in\mathcal{F} and

f(yσ(x))f¯σ(x).f(y_{\sigma}(x))\leq\overline{f}_{\sigma}(x). (9.7)
Proof.

Employing the lower bound in (7.10), for any x,yx,y\in\mathcal{F}, we get

f(x)+f(x),yx+122f(x)(yx),yxf(y)+L2(q+1)(q+2)yxq+2,f(x)+\langle\nabla f(x),y-x\rangle+\frac{1}{2}\langle\nabla^{2}f(x)(y-x),y-x\rangle\leq f(y)+\frac{L_{2}}{(q+1)(q+2)}\|y-x\|^{q+2},

and so inequality (9.5) follows from the construction of f¯σ(x)\overline{f}_{\sigma}(x). Furthermore, we deduce from yσ(x)argminymf¯σ(x)y_{\sigma}(x)\in\mathop{\arg\min}_{y\in\mathbb{R}^{m}}\overline{f}_{\sigma}(x), (9.2), and Lemma 9.1 that

f(x)f¯σ(x)\displaystyle f(x)-\overline{f}_{\sigma}(x) =f(x),xyσ(x)122f(x)(yσ(x)x),yσ(x)xσ(q+1)(q+2)yσ(x)xq+2\displaystyle=\langle\nabla f(x),x-y_{\sigma}(x)\rangle-\frac{1}{2}\langle\nabla^{2}f(x)(y_{\sigma}(x)-x),y_{\sigma}(x)-x\rangle-\frac{\sigma}{(q+1)(q+2)}\|y_{\sigma}(x)-x\|^{q+2}
=12f(x),xyσ(x)+qσ2(q+1)(q+2)yσ(x)xq+2qσ2(q+1)(q+2)yσ(x)xq+2.\displaystyle=\frac{1}{2}\langle\nabla f(x),x-y_{\sigma}(x)\rangle+\frac{q\sigma}{2(q+1)(q+2)}\|y_{\sigma}(x)-x\|^{q+2}\geq\frac{q\sigma}{2(q+1)(q+2)}\|y_{\sigma}(x)-x\|^{q+2}.

Finally, assuming σL2\sigma\geq L_{2} and f(x)f(x0)f(x)\leq f(x_{0}) gives us by Lemma 9.1 that yσ(x)y_{\sigma}(x)\in\mathcal{F}, and thus (9.7) follows from the upper bound in (7.10). ∎

Lemma 9.4.

Under Assumption 7.1, for any xx\in\mathcal{F} with yσ(x)y_{\sigma}(x)\in\mathcal{F} we have

λmin(2f(yσ(x)))(σq+1+L2)yσ(x)xq.\lambda_{\min}\Big{(}\nabla^{2}f(y_{\sigma}(x))\Big{)}\geq-\left(\frac{\sigma}{q+1}+L_{2}\right)\|y_{\sigma}(x)-x\|^{q}. (9.8)
Proof.

The Hölder continuity of 2f(x)\nabla^{2}f(x) and inequality (7.8) tell us that

2f(yσ(x))2f(x)L2yσ(x)xqIm(σq+1+L2)yσ(x)xqIm,\displaystyle\nabla^{2}f(y_{\sigma}(x))\succeq\nabla^{2}f(x)-L_{2}\|y_{\sigma}(x)-x\|^{q}I_{m}\succeq-\left(\frac{\sigma}{q+1}+L_{2}\right)\|y_{\sigma}(x)-x\|^{q}I_{m},

which clearly yields the estimate in (9.8). ∎

Lemma 9.5.

Let Assumption 7.1 hold, and let the sequences {xk}\{x_{k}\} and {x^k}\{\widehat{x}_{k}\} be taken from Algorithm 1 with σ¯>2L2q+2\overline{\sigma}>\frac{2L_{2}}{q+2}. Then for all kk\in\mathbb{N} we have the estimates

f(x^k+1)f(xk)(q+2)σ¯2L22(q+1)(q+2)x^k+1xkq+2,f(\widehat{x}_{k+1})-f(x_{k})\leq-\frac{(q+2)\overline{\sigma}-2L_{2}}{2(q+1)(q+2)}\|\widehat{x}_{k+1}-x_{k}\|^{q+2}, (9.9)
i=0kx^k+1xkq+22(q+1)(q+2)(q+2)σ¯2L2(f(x0)f),\sum_{i=0}^{k}\|\widehat{x}_{k+1}-x_{k}\|^{q+2}\leq\frac{2(q+1)(q+2)}{(q+2)\overline{\sigma}-2L_{2}}(f(x_{0})-f^{*}), (9.10)
f(x^k+1)3L2q+1x^k+1xkq+1,\|\nabla f(\widehat{x}_{k+1})\|\leq\frac{3L_{2}}{q+1}\|\widehat{x}_{k+1}-x_{k}\|^{q+1}, (9.11)
λmin(2f(x^k+1))(q+3)L2q+1x^k+1xkq.\lambda_{\min}(\nabla^{2}f(\widehat{x}_{k+1}))\geq-\frac{(q+3)L_{2}}{q+1}\|\widehat{x}_{k+1}-x_{k}\|^{q}. (9.12)
Proof.

First we verify (9.9). It follows from Lemma 7.3, (7.2), and (9.6) that

f(x^k+1)f(xk)+f(xk),x^k+1xk\displaystyle f(\widehat{x}_{k+1})\leq f(x_{k})+\langle\nabla f(x_{k}),\widehat{x}_{k+1}-x_{k}\rangle
+122f(xk)(x^k+1xk),x^k+1xk+L2(q+1)(q+2)x^k+1xkq+2\displaystyle+\frac{1}{2}\langle\nabla^{2}f(x_{k})(\widehat{x}_{k+1}-x_{k}),\widehat{x}_{k+1}-x_{k}\rangle+\frac{L_{2}}{(q+1)(q+2)}\|\widehat{x}_{k+1}-x_{k}\|^{q+2}
=f(xk)+f(xk),x^k+1xk+122f(xk)(x^k+1xk),x^k+1xk\displaystyle=f(x_{k})+\langle\nabla f(x_{k}),\widehat{x}_{k+1}-x_{k}\rangle+\frac{1}{2}\langle\nabla^{2}f(x_{k})(\widehat{x}_{k+1}-x_{k}),\widehat{x}_{k+1}-x_{k}\rangle
+σk(q+1)(q+2)x^k+1xkq+2+L2σk(q+1)(q+2)x^k+1xkq+2\displaystyle+\frac{\sigma_{k}}{(q+1)(q+2)}\|\widehat{x}_{k+1}-x_{k}\|^{q+2}+\frac{L_{2}-\sigma_{k}}{(q+1)(q+2)}\|\widehat{x}_{k+1}-x_{k}\|^{q+2}
f(xk)qσk2(q+1)(q+2)x^k+1xkq+2+L2σk(q+1)(q+2)x^k+1xkq+2\displaystyle\leq f(x_{k})-\frac{q\sigma_{k}}{2(q+1)(q+2)}\|\widehat{x}_{k+1}-x_{k}\|^{q+2}+\frac{L_{2}-\sigma_{k}}{(q+1)(q+2)}\|\widehat{x}_{k+1}-x_{k}\|^{q+2}
=f(xk)(q+2)σk2L22(q+1)(q+2)x^k+1xkq+2f(xk)(q+2)σ¯2L22(q+1)(q+2)x^k+1xkq+2,\displaystyle=f(x_{k})-\frac{(q+2)\sigma_{k}-2L_{2}}{2(q+1)(q+2)}\|\widehat{x}_{k+1}-x_{k}\|^{q+2}\leq f(x_{k})-\frac{(q+2)\overline{\sigma}-2L_{2}}{2(q+1)(q+2)}\|\widehat{x}_{k+1}-x_{k}\|^{q+2},

which justifies (9.9). To prove (9.10), we use (7.6) and (9.9) to get

f(x0)f\displaystyle f(x_{0})-f^{*}\geq i=0k[f(xi)f(xi+1)]i=0k[f(xi)f(x^i+1)]\displaystyle\sum_{i=0}^{k}[f(x_{i})-f(x_{i+1})]\geq\sum_{i=0}^{k}[f(x_{i})-f(\widehat{x}_{i+1})]
\displaystyle\geq i=0k(q+2)σ¯2L22(q+1)(q+2)x^i+1xiq+2,\displaystyle\sum_{i=0}^{k}\frac{(q+2)\overline{\sigma}-2L_{2}}{2(q+1)(q+2)}\|\widehat{x}_{i+1}-x_{i}\|^{q+2},

which yields (9.10). The inequalities in (9.11) and (9.12) follow from Lemmas 9.2 and 9.4, respectively. ∎

The next estimates follow from (7.6) and (7.5) being useful for our subsequent analysis:

xk+1x^k+1\displaystyle\|x_{k+1}-\widehat{x}_{k+1}\|\leq max{x^k+1x^k+1,x~k+1x^k+1}\displaystyle\max\big{\{}\|\widehat{x}_{k+1}-\widehat{x}_{k+1}\|,\|\widetilde{x}_{k+1}-\widehat{x}_{k+1}\|\big{\}} (9.13)
=\displaystyle= x~k+1x^k+1βk+1x^k+1x^k.\displaystyle\|\widetilde{x}_{k+1}-\widehat{x}_{k+1}\|\leq\beta_{k+1}\|\widehat{x}_{k+1}-\widehat{x}_{k}\|.
Lemma 9.6.

Under Assumption 7.1, the sequence {x^k}\{\widehat{x}_{k}\} generated by Algorithm 1 satisfies

x^k+1x^k11ζ[2(q+1)(q+2)(q+2)σ¯2L2(f(x0)f)]1q+2=:γ.\|\widehat{x}_{k+1}-\widehat{x}_{k}\|\leq\frac{1}{1-\zeta}\left[\frac{2(q+1)(q+2)}{(q+2)\overline{\sigma}-2L_{2}}(f(x_{0})-f^{*})\right]^{\frac{1}{q+2}}=:\gamma.
Proof.

For any index i1i\geq 1, we deduce from (9.13) and the condition βk+1ζ\beta_{k+1}\leq\zeta for all k0k\geq 0 that

x^i+1x^i\displaystyle\|\widehat{x}_{i+1}-\widehat{x}_{i}\|\leq x^i+1xi+xix^ix^i+1xi+βix^ix^i1x^i+1xi+ζx^ix^i1.\displaystyle\|\widehat{x}_{i+1}-x_{i}\|+\|x_{i}-\widehat{x}_{i}\|\leq\|\widehat{x}_{i+1}-x_{i}\|+\beta_{i}\|\widehat{x}_{i}-\widehat{x}_{i-1}\|\leq\|\widehat{x}_{i+1}-x_{i}\|+\zeta\|\widehat{x}_{i}-\widehat{x}_{i-1}\|. (9.14)

Recursively applying (9.14), we deduce from x^0=x0\widehat{x}_{0}=x_{0} that

x^k+1x^k\displaystyle\|\widehat{x}_{k+1}-\widehat{x}_{k}\|\leq ζkx^1x^0+i=1kζkix^i+1xii=0kζkix^i+1xi\displaystyle\zeta^{k}\|\widehat{x}_{1}-\widehat{x}_{0}\|+\sum_{i=1}^{k}\zeta^{k-i}\|\widehat{x}_{i+1}-x_{i}\|\leq\sum_{i=0}^{k}\zeta^{k-i}\|\widehat{x}_{i+1}-x_{i}\|

whenever kk\in\mathbb{N}. Then it follows from (9.10) that

x^k+1x^k\displaystyle\|\widehat{x}_{k+1}-\widehat{x}_{k}\| maxi{0,,k}x^i+1xii=0kζki11ζmaxi{0,,k}x^i+1xi\displaystyle\leq\max_{i\in\{0,\cdots,k\}}\|\widehat{x}_{i+1}-x_{i}\|\sum_{i=0}^{k}\zeta^{k-i}\leq\frac{1}{1-\zeta}\max_{i\in\{0,\cdots,k\}}\|\widehat{x}_{i+1}-x_{i}\|
11ζ[2(q+1)(q+2)(q+2)σ¯2L2(f(x0)f)]1q+2,\displaystyle\leq\frac{1}{1-\zeta}\left[\frac{2(q+1)(q+2)}{(q+2)\overline{\sigma}-2L_{2}}(f(x_{0})-f^{*})\right]^{\frac{1}{q+2}},

which justified the claim of the lemma. ∎

Now we are ready to finalize the Proof of Lemma 7.6:
To verify assertion (i) therein, it suffices to show that {f(xk)}\{f(x_{k})\} is a decreasing sequence having a lower bound. By Assumption 7.1, ff is bounded from below and so is {f(xk)}\{f(x_{k})\}. Employing (7.6) and (9.9) yields

f(xk+1)f(x^k+1)f(xk)(q+2)σ¯2L22(q+1)(q+2)x^k+1xkq+2f(x_{k+1})\leq f(\widehat{x}_{k+1})\leq f(x_{k})-\frac{(q+2)\overline{\sigma}-2L_{2}}{2(q+1)(q+2)}\|\widehat{x}_{k+1}-x_{k}\|^{q+2} (9.15)

and allows us to deduce that {f(xk)}\{f(x_{k})\} converges to some vv.
To verify (ii), we begin with using (9.13), βk+1x^k+1xk\beta_{k+1}\leq\|\widehat{x}_{k+1}-x_{k}\|, and Lemma 9.6 to get

xk+1xkxk+1x^k+1+x^k+1xkβk+1x^k+1x^k+x^k+1xk\displaystyle\|x_{k+1}-x_{k}\|\leq\|x_{k+1}-\widehat{x}_{k+1}\|+\|\widehat{x}_{k+1}-x_{k}\|\leq\beta_{k+1}\|\widehat{x}_{k+1}-\widehat{x}_{k}\|+\|\widehat{x}_{k+1}-x_{k}\|
x^k+1xk(x^k+1x^k+1)(1+γ)x^k+1xk,\displaystyle\leq\|\widehat{x}_{k+1}-x_{k}\|(\|\widehat{x}_{k+1}-\widehat{x}_{k}\|+1)\leq(1+\gamma)\|\widehat{x}_{k+1}-x_{k}\|,

which tells us together with (9.10) that

limkx^k+1xk=0.\lim_{k\to\infty}\|\widehat{x}_{k+1}-x_{k}\|=0. (9.16)

Combining further Assumption 7.1 and Lemma 9.6 with (7.4), (9.13), and (9.11), we arrive at

f(xk+1)f(x^k+1)+f(x^k+1)f(xk+1)\displaystyle\|\nabla f(x_{k+1})\|\leq\|\nabla f(\widehat{x}_{k+1})\|+\|\nabla f(\widehat{x}_{k+1})-\nabla f(x_{k+1})\| (9.17)
f(x^k+1)+L1xk+1x^k+1f(x^k+1)+L1βk+1x^k+1x^k\displaystyle\leq\|\nabla f(\widehat{x}_{k+1})\|+L_{1}\|x_{k+1}-\widehat{x}_{k+1}\|\leq\|\nabla f(\widehat{x}_{k+1})\|+L_{1}\beta_{k+1}\|\widehat{x}_{k+1}-\widehat{x}_{k}\|
f(x^k+1)(1+L1x^k+1x^k)3(1+γL1)L2q+1x^k+1xkq+1,\displaystyle\leq\|\nabla f(\widehat{x}_{k+1})\|(1+L_{1}\|\widehat{x}_{k+1}-\widehat{x}_{k}\|)\|\leq\frac{3(1+\gamma L_{1})L_{2}}{q+1}\|\widehat{x}_{k+1}-x_{k}\|^{q+1},

and therefore verify assertion (ii).
To verify the final assertion (iii), note that (9.15) yields {xk}(f(x0))\{x_{k}\}\subset\mathcal{L}(f(x_{0})). Recalling that (f(xk0))\mathcal{L}(f(x_{k_{0}})) is bounded for some k0k_{0}\in\mathbb{N}, we conclude that {xk}\{x_{k}\} is bounded as well, which ensures that Ω\Omega\neq\emptyset. Pick any x¯Ω\overline{x}\in\Omega and observe that justifying (7.22) requires to bound the minimum eigenvalue of the Hessian. Recalling that |λmin(C)λmin(D)|CD|\lambda_{\min}(C)-\lambda_{\min}(D)|\leq\|C-D\| holds for any real symmetric matrices CC and DD, we have

λmin(2f(xk+1))λmin(2f(x^k+1))2f(xk+1)2f(x^k+1).\displaystyle\lambda_{\min}(\nabla^{2}f(x_{k+1}))\geq\lambda_{\min}(\nabla^{2}f(\widehat{x}_{k+1}))-\|\nabla^{2}f(x_{k+1})-\nabla^{2}f(\widehat{x}_{k+1})\|.

This together with Assumption 7.1, Lemma 9.6, (9.12), (9.13), and βk+1x^k+1xk\beta_{k+1}\leq\|\widehat{x}_{k+1}-x_{k}\| tells us that

λmin(2f(xk+1))λmin(2f(x^k+1))2f(xk+1)2f(x^k+1)λmin(2f(x^k+1))L2xk+1x^k+1q\displaystyle\lambda_{\min}(\nabla^{2}f(x_{k+1}))\geq\lambda_{\min}(\nabla^{2}f(\widehat{x}_{k+1}))-\|\nabla^{2}f(x_{k+1})-\nabla^{2}f(\widehat{x}_{k+1})\|\geq\lambda_{\min}(\nabla^{2}f(\widehat{x}_{k+1}))-L_{2}\|x_{k+1}-\widehat{x}_{k+1}\|^{q}
λmin(2f(x^k+1))L2βk+1qx^k+1x^kqλmin(2f(x^k+1))L2γqx^k+1xkq\displaystyle\geq\lambda_{\min}(\nabla^{2}f(\widehat{x}_{k+1}))-L_{2}\beta_{k+1}^{q}\|\widehat{x}_{k+1}-\widehat{x}_{k}\|^{q}\geq\lambda_{\min}(\nabla^{2}f(\widehat{x}_{k+1}))-L_{2}\gamma^{q}\|\widehat{x}_{k+1}-x_{k}\|^{q}
(q+3)L2q+1x^k+1xkqL2γqx^k+1xkq=(q+3q+1+γq)L2x^k+1xkq.\displaystyle\geq-\frac{(q+3)L_{2}}{q+1}\|\widehat{x}_{k+1}-x_{k}\|^{q}-L_{2}\gamma^{q}\|\widehat{x}_{k+1}-x_{k}\|^{q}=-\left(\frac{q+3}{q+1}+\gamma^{q}\right)L_{2}\|\widehat{x}_{k+1}-x_{k}\|^{q}.

Unifying the latter with (9.16) and (9.17) leads us to the inequalities

f(x¯)lim supkf(xk+1)0andλmin(2f(x¯))lim infkλmin(2f(xk+1))0,\|\nabla f(\overline{x})\|\leq\limsup_{k\to\infty}\|\nabla f(x_{k+1})\|\leq 0\quad\mbox{and}\quad\lambda_{\min}(\nabla^{2}f(\overline{x}))\geq\liminf_{k\to\infty}\lambda_{\min}(\nabla^{2}f(x_{k+1}))\geq 0,

which show that f(x¯)=0\nabla f(\overline{x})=0 and 2f(x¯)0\nabla^{2}f(\overline{x})\succeq 0. Employing finally Lemma 7.6(i), we obtain f(x¯)=vf(\overline{x})=v and therefore complete the proof of the lemma.