Generalized Metric Subregularity with Applications to
High-Order Regularized Newton Methods
Abstract
This paper pursues a twofold goal. First, we introduce and study in detail a new notion of variational analysis called generalized metric subregularity, which is a far-going extension of the conventional metric subregularity conditions. Our primary focus is on examining this concept concerning first-order and second-order stationary points. We develop an extended convergence framework that enables us to derive superlinear and quadratic convergence under the generalized metric subregularity condition, broadening the widely used KL convergence analysis framework. We present verifiable sufficient conditions to ensure the proposed generalized metric subregularity condition and provide examples demonstrating that the derived convergence rates are sharp. Second, we design a new high-order regularized Newton method with momentum steps, and apply the generalized metric subregularity to establish its superlinear convergence. Quadratic convergence is obtained under additional assumptions. Specifically, when applying the proposed method to solve the (nonconvex) over-parameterized compressed sensing model, we achieve global convergence with a quadratic local convergence rate towards a global minimizer under a strict complementarity condition.
Keywords: Variational analysis and optimization, generalized metric subregularity, error bounds, high-order regularized Newton methods, Kurdyka-Łojasiewicz property, superlinear and quadratic convergence
Mathematics Subject Classification (2020) 49J53, 49M15, 90C26
1 Introduction
It has been well recognized, especially during the recent years, that concepts and tools of modern variational analysis play a crucial role in the design and justification of numerical algorithms in optimization and related areas. We mention, in particular, generalized Newton methods for which, in addition to the fundamental developments summarized in [12, 17, 19], new approaches and results have been recently proposed in [1, 14, 18, 22, 27] among other publications. A common feature of these developments is the usage of metric regularity/subregularity assumptions in a certain form. Another common feature of the aforementioned publications is justifying superlinear convergence rates under some nondegeneracy condition. For example, [32] develops the cubic regularization (CR) method for minimizing a twice continuously differentiable function, which is now widely recognized as a globally convergent variant of Newton’s method with superior iteration complexity [10, 32]. The local quadratic convergence of CR method was derived under a nondegeneracy condition in [32]. In [40], the authors propose an error bound condition related to the second-order stationary set and examined the quadratic local convergence for the cubic regularized Newton method under this condition, extending the results in [32]. Interestingly, it was shown in [40] that this error bound condition can be applied to nonconvex and degenerate cases being satisfied with overwhelming probability for some concrete nonconvex optimization problems that arise in phase retrieval and low-rank matrix recovery.
On the other hand, it is known that the metric subregularity of the subdifferential mapping has a close relationship with the Kurdyka-Łojasiewicz property. The KL property is satisfied for various important classes of functions arising in diverse applications, and it has emerged as one of the most widely used tools for analyzing the convergence of important numerical algorithms [4, 5]. In particular, if the potential function of the algorithm satisfies the KL property and the associated desingularization function involved in the KL property takes the form that with and , then the corresponding exponent determines the linear or sublinear convergence rate of the underlying algorithms [4, 5, 21, 39]. Although the KL property and its associated analysis framework are well-adopted in the literature, this framework, in general, cannot obtain or detect superlinear or quadratic convergence rate, which is typical for many second-order numerical methods including Newton-type methods.
In this paper, we introduce and thoroughly investigate novel generalized metric subregularity conditions, significantly extending conventional metric subregularity and error bound conditions from [40]. We develop an abstract extended convergence framework that enables us to derive superlinear and quadratic convergence towards specific target sets, such as first-order and second-order stationary points, under the introduced generalized metric subregularity condition. This new framework extends the widely used KL convergence analysis framework. We provide examples demonstrating that the derived quadratic convergence rates are sharp, meaning they can be attained. We also present verifiable sufficient conditions that ensure the proposed generalized metric subregularity condition and showcase that the sufficient conditions are satisfied for functions that arise in practical models including the over-parameterized compressed sensing, best rank-one matrix approximation, and generalized phase retrieval problems. Finally, we use the generalized metric subregularity condition to establish superlinear (and in some cases quadratic) convergence of the newly proposed high-order regularized Newton methods with momentum. Specifically, when applying the proposed method to solve the (nonconvex) over-parameterized compressed sensing model, we achieve global convergence with a quadratic local convergence rate towards a global minimizer under a strict complementarity condition.
Organization. The main achievements of the current research are summarized in the concluding Section 8. The rest of the paper is structured as follows.
Section 2 presents basic definitions and preliminaries from variational analysis and generalized differentiation broadly used in what follows. In Section 3, we define two versions of generalized metric subregularity for subgradient mappings with respect to first-order and second-order stationary points and then discuss their relationships with known in the literature and with growth conditions.
Section 4 develops a new framework for abstract convergence analysis to establish superlinear and quadratic convergence rates of numerical algorithms under generalized metric subregularity. Being of their independent interest, the obtained results are instrumental to justify such fast convergent rates for our main high-order regularized Newtonian algorithm in the subsequent parts of the paper. In Section 5, we provide some abstract convergence analysis under the Kurdyka-Łojasiewicz property.
Section 6 revisits generalized metric subregularity while now concentrating on this property with respect to second-order stationarity points. Our special attention is paid to the strict saddle point and weak separation properties, which provide verifiable sufficient conditions for the fulfillment of generalized metric subregularity. Here we present several examples of practical modeling, where these properties are satisfied.
In Section 7, we design the family of high-order regularized Newton methods with momentum steps for problems of optimization, where the order of the method is determined by the Hölder exponent of the Hessian matrix associated with the objective function. Using the machinery of generalized metric subregularity with respect to second-order stationary points leads us to establishing superlinear convergence of iterates with explicit convergence rates and also quadratic convergence in certain particular settings.
2 Preliminaries from Variational Analysis
Let be the -dimensional Euclidean space with the inner product denoted by for all and the norm . The symbols and stand for the open and closed ball centered at with radius , respectively. Given a symmetric matrix , the notation signifies that is positive-semidefinite. We use to denote the set of nonnegative integrals .
An extended-real-valued function is proper if for its effective domain. The (limiting, Mordukhovich) subdifferential for such functions at is
(2.1) |
The subdifferential (2.1) reduces to the subdifferential of convex analysis when is convex and to the gradient when is continuously differentiable (-smooth) around . For the general class of lower semicontinuous (l.s.c.) functions, enjoys comprehensive calculus rules and is used in numerous applications of variational analysis and optimization; see, e.g., the books [26, 27, 37] and the references therein.
Having (2.1), consider the set of first-order stationary points of an l.s.c. function defined by
(2.2) |
Suppose that is a twice continuously differentiable (-smooth) function and denote the collection of second-order stationary points of at by
(2.3) |
As in [42], we say that , where , is an admissible function if and . It is well known that for convex admissible functions, the directional derivative
always exists being nondecreasing on . Furthermore (see, e.g., [43, Theorem 2.1.5]), is increasing on if and only if is strictly convex, i.e.,
for any and with . It is also known that a convex function is differentiable on if and only if is continuous on . The following lemma from [43, Theorem 2.1.5] is used below.
Lemma 2.1 (properties of convex admissible functions).
Let be a convex admissible function. Then we have the relationships:
(i)
(ii)
Finally in this section, we recall the celebrated property often used in convergence analysis.
Definition 2.2 (Kurdyka-Łojasiewicz property).
Let be a proper lower l.s.c. function, and let be a continuous concave function. We say that has the Kurdyka-Łojasiewicz property at with respect to if the following conditions hold:
(i) and is continuously differentiable on .
(ii) for all .
(iii) There exists such that
for all , where stands for the distance function associated with the set . If in addition takes the form of for some and , then we say satisfies the KL property at with the KL exponent .
A widely used class of functions satisfying the KL property consists of subanalytic functions. It is known that the maximum and minimum of finitely many analytic functions and also semialgebraic functions (such as with being a nonnegative rational number) are subanalytic. We refer the reader to [4, 5, 6, 21] for more details on the KL property and its exponent version, as well as on subanalytic functions.
3 Generalized Metric Subregularity
The main variational concept for the subdifferential (2.1) of an extended-real-valued function used in this paper is introduced below, where we consider its two versions.
Definition 3.1 (generalized metric subregularity).
Let be a proper l.s.c. function, let be taken from (2.2), and let be an admissible function. Given a target set and a point , we say that:
(i) The subdifferential satisfies the (pointwise) generalized metric subregularity property with respect to at
if there exist such that
(3.1) |
(ii) The subdifferential satisfies the uniform generalized metric subregularity property with respect to if there exist such that
(3.2) |
(iii) If in (i) and (ii), then we simply say that satisfies the metric subregularity (resp. uniform metric subregularity) property with respect to .
Remark 3.2 (connections with related notions).
Let us briefly discuss connections of the generalized metric subregularity notions from Definition 3.1 with some related notions.
Consider the case where the target set is the set of first-order stationary points (2.2). If , then (3.1) reduces to the conventional notion of metric subregularity applied to subgradient mappings. If for , which we referred to as exponent metric subregularity, this notion has also received a lot of attention due to the important applications in numerical optimization. It is called the Hölder metric subregularity if , and the high-order metric subregularity if . Other cases of convex and smooth admissible functions are considered in [42]. The refer the reader to [3, 13, 20, 28, 30, 41, 42] and the bibliographies therein for more details.
In the case where is a -smooth function, , and the target set consists of second-order stationary points (2.3), Definition 3.1(ii) takes the form: there exist such that
(3.3) |
i.e., it falls into the framework of the uniform generalized metric subregularity (3.2) for smooth functions. In this case, it was referred to as the error bound (EB) condition [40, Assumption 2]. This condition plays a key role in [40] to justify the quadratic convergence of the cubic regularized Newton method.
Now we present three examples illustrating several remarkable features of the introduced notions. The first example shows that the generalized metric subregularity from (3.1) can go beyond the exponent cases. Note that the usage of metric subregularity in the nonexponent setting has been recently identified in [23] for the case of exponential cone programming.
Example 3.3 (nonexponent generalized metric subregularity).
Let be given by
Define by for all . It is easy to see that is a differentiable convex function on , and that we have
This tells us that for all , and so has the generalized metric subregularity property with respect to . On the other hand, for any we get
which shows that does not enjoy any exponent-type metric subregularity.
The next example illustrates the fulfillment of the generalized metric subregularity (3.2) with respect of for some admissible function , while the error bound condition (3.3) from [40] fails.
Example 3.4 (generalized metric subregularity vs. error bound condition).
Given , define a -smooth function by
for which we get by the direct calculations that
It is easy to see that and for , while for we have . This implies therefore that
Consequently, the EB condition (3.3) fails, while enjoys the generalized metric subregularity property with respect to at , where .
For a fixed admissible function and a given set , we observe directly from the definitions that uniform generalized metric subregularity with respect to yields the generalized metric subregularity at each point of . The following simple one-dimensional example shows that the reverse implication fails.
Example 3.5 (pointwise version is strictly weaker than uniform version for generalized metric subregularity).
Consider for with and . We get and . Furthermore, and . Thus there exists such that for all with . This tells us that satisfies the generalized metric subregularity property with respect to at with and .
On the other hand, does not satisfy the uniform generalized metric subregularity with respect to . Indeed, for any with , we get and .
It has been well recognized in variational analysis that there exist close relationships between metric subregularity of subgradient mappings and second-order growth conditions for the functions in question; see [3, 11, 27] with more details. We now extend such relationships to the case of generalized metric subregularity. The following useful lemma is of its own interest.
Lemma 3.6 (first-order stationary points of compositions).
Let , where is a proper l.s.c. convex function, and where is a -smooth mapping with the surjective derivative at some first-order stationary point associated with the composition in (2.2). Then there exists such that we have the inclusion
(3.4) |
Proof.
The surjectivity of for smooth allows us to deduce from the Lyusternik–Graves theorem (see, e.g., [26, Theorem 1.57]) the existence of such that is surjective on and
(3.5) |
Consequently, it follows from [26, Proposition 1.112] that
(3.6) |
Picking any , we deduce from (3.6) that
Combining this with (3.5) yields . Hence is a minimizer of the convex function , and
which gives us (3.4) and thus completes the proof. ∎
Now we are ready to establish relationships between the generalized metric subregularity and corresponding growth conditions extending the known ones [3, 11] when , and .
Theorem 3.7 (generalized metric subregularity via growth conditions).
Let in the setting of Lemma 3.6. Consider the following statements:
(i) There exists an an admissible function and constants such that
(3.7) |
(ii) There exists an an admissible function and constants such that
Then implication holds with if is convex and , while the reverse one is always satisfied with .
Proof.
To verify , we proceed as in the proof of Lemma 3.6 and find such that (3.4), (3.5), and (3.6) hold. Letting and then picking any and , take a sequence such that as . Noting that and , assume without loss of generality that entirely lies in . Select further with for all and show that
(3.8) |
whenever . Indeed, it follows from (3.5) and (3.6) that for any there is with
Combining the latter with the convexity of gives us
for all , and thus brings us to (3.8). Note further that (3.4) ensures that for all . Letting there and denoting , we get , which yields in turn
This allows us to derive from (3.7) that
where stands the closure of . Employing Lemma 2.1(ii) tells us that
for any , and thus it follows from that (ii) is satisfied. The reverse implication (ii)(i) is a consequence of Lemma 3.6 and [38, Theorem 4.2]. ∎
4 Abstract Framework for Convergence Analysis
In this section, we develop a general framework for convergence analysis of an abstract minimization algorithm to ensure its superlinear and quadratic convergence. This means finding appropriate conditions on an abstract algorithm structure and the given objective function under which the generated iterates exhibit the desired fast convergence. We’ll see that the generalized metric subregularity is an important condition to establish fast convergence. The obtained abstract results are implemented in Section 7 to derive specific convergence conditions for the newly proposed high-order regularized Newton method with momentum.
Our study in this section focuses on a coupled sequence , where is the main sequence generated by the algorithm, and where is an auxiliary numerical sequence that serves as a surrogate for (the magnitude of the successive change sequence). Throughout this section, we assume that these two sequences satisfy the following set of basic assumptions (BA):
-
(i)
Descent condition:
() -
(ii)
Surrogate condition: there exists such that
() -
(iii)
Relative error condition:
() where is a fixed positive constant, and where is an admissible function.
-
(iv)
Continuity condition: For each subsequence with , we have
()
Note that the above descent condition (H0) and surrogate condition (H1) can be enforced by the following stronger property: there exist constants such that
() |
where is an admissible function. When and , condition () is often referred to as the sufficient descent condition. Both conditions () and () are automatically satisfied for various commonly used algorithms. For example, for proximal-type methods [5], these conditions always hold with , , , and ; for cubic regularized Newton methods with Lipschitz continuous Hessian [40], the latter conditions are fulfilled with , , , and ; for high-order proximal point methods [2], with , , , and .
Incorporating the surrogate sequence gives us some flexibility in handling multistep numerical methods with, e.g., momentum steps. As we see in Section 7, conditions () and () hold with , , and for some , where is an auxiliary sequence generated by the subproblem of the high-order regularization methods, where , and where with . In the special case where and , there are other general unified frameworks [33], which involve a bivariate function serving as an approximation for . While we can potentially extend further in this direction, our main aim is to provide a framework applicable for superlinear and quadratic convergence, which is not discussed in [33].
From now on, we assume that is a proper, l.s.c., bounded from below function. For a given sequence , use to denote the set of its accumulation points. Given , define . The next two lemmas present some useful facts needed in what follows.
Lemma 4.1 (accumulation points under basic assumptions).
Consider a sequence under assumption ()–(), and let be bounded for some . Then the following assertions hold:
(i) The set of accumulation points is nonempty and compact.
(ii) The sequence satisfies the relationships
(iii) and for any set satisfying , we have
(4.1) |
where is the collection of the first-order stationary points (2.2).
Proof.
To verify (i), observe that the descent condition () and the boundedness of ensure the boundedness of , which yields (i) since the closedness of is easily derived by the diagonal process.
To proceed with (ii), we deduce from () that
(4.2) |
Taking any allows us to find a subsequence such that . Combining this with condition () and the l.s.c. of gives us , which establishes (ii) based on the relationship provided in (4.2).
To verify the final assertion (iii), combine (i), (ii), and condition () to get . For any set satisfying , (4.1) is a direct consequence of the fact that . ∎
For a function , we say it is asymptotically shrinking around zero if and
(4.3) |
where and for all . It follows from the definition, that the function with is asymptotically shrinking around zero, while is not. The next simple lemma provides an easily verifiable sufficient condition ensuring this property.
Lemma 4.2 (asymptotically shrinking functions).
Let be a nondecreasing function with and such that . Then this function is asymptotically shrinking around zero.
Proof.
Now we are ready to establish our first abstract convergence result.
Theorem 4.3 (abstract convergence framework).
For some , let be a nondecreasing continuous function with , and let be a nonnegative sequence with . Consider a sequence , which satisfies conditions ()–(), and assume that is bounded for some . Take further any point , a closed set with , and a nondecreasing function that is asymptotically shrinking around zero. Suppose that there exist , and such that the following hold whenever with :
-
(i)
The sequence shrinks with respect to the mapping , i.e.,
(4.4) -
(ii)
Surrogate of successive change grows mildly, i.e.,
(4.5) where the sequence is defined by
(4.6)
Then there exists with such that the inclusion holds for all , and therefore the estimates in (4.4) and (4.5) are also fulfilled for all . Moreover, it follows that
(4.7) |
and the sequence converges to as .
Proof.
We first see that . To show this, we observe that and assume without loss of generality that for all . Since and is asymptotically shrinking around zero, we get the equalities
(4.8) |
Recalling that is continuous, it follows from (4.8) combined with (), (), and Lemma 4.1 that there exists such that , and for all we have and
(4.9) |
The rest of the proof is split into the following two claims.
Claim 1: For every , the inclusion holds. Indeed, suppose on the contrary that there exists such that . Letting , observe that for any , we have , and thus the inequalities in (4.4) and (4.5) are satisfied. Using them together with (4.6) implies that
Rearranging the terms above brings us to the estimate
(4.10) |
Combining this with (4.9) and condition (), we arrive at
which contradicts the assumption that and thus justifies the claim.
Claim 2: It holds that . Indeed, Claim 1 tells us that the inequalities in (4.4) and (4.5) are satisfied for all . Consequently, for any and , by similar arguments as those for deriving (4.10) we get the estimate
(4.11) |
Letting and employing (4.9) yields
which gives us due to (). Therefore, is a Cauchy sequence. Since is an accumulation point of , it follows that as . ∎
In what follows, we study the convergence rate of . For each , denote and deduce from Theorem 4.3 that . For any sufficiently large, it follows from (4.11) that
(4.12) | ||||
The next theorem establishes a fast convergence rate of the iterative sequence generated by the abstract algorithm satisfying the basic assumptions (BA) under generalized metric subregularity.
Theorem 4.4 (convergence rate under generalized metric subregularity).
Let be increasing admissible functions, let be a sequence satisfying the basic assumptions ()–(), and let be an accumulation point of . Fixing any closed set with , assume that is bounded for some and the following properties are fulfilled:
-
(i)
Generalized metric subregularity holds at with respect to , meaning that there exist numbers such that
(4.13) -
(ii)
Surrogate of successive change bounded in terms of the target set, meaning that there exist and such that
(4.14) -
(iii)
Growth rate control is satisfied, meaning that , where is given by for all .
Then the sequence converges to as with the rate , i.e.,
(4.15) |
Proof.
Let us first check that, for any with , we have the estimate
(4.16) |
To this end, fix such that and deduce from (), (4.13), and (4.14) that
where the last inequality is due to the fact that is an increasing admissible function. Since is also an increasing admissible function, we get
and thus verifies (4.16). It follows from Lemma 4.1(iii), Lemma 4.2, and Theorem 4.3 (applied to , , and ) that (4.16) holds for all large , and hence converges to as .
In the rest of this section, we present discussions on some remarkable features of our abstract convergence analysis and its comparison with known results in the literature.
Remark 4.5 (discussions on abstract convergence analysis and its specifications).
Note first that, for the three assumptions imposed in Theorem 4.4, the second one (surrogate sequence bounded in terms of the target set) is often tied up with and indeed guaranteed by the construction of many common algorithms, while the other two assumptions are related to some regularity of the function with respect to the target set. In particular, in the special case where , Assumption (ii) in Theorem 4.4 is automatically satisfied for the proximal methods [5] with , and for the cubic regularized Newton methods [32] with , .
For the aforementioned proximal method from [5] to solve with a fixed proximal parameter , where is a proper l.s.c. function, the automatic fulfillment of Assumption (ii) allows us to deduce from Theorem 4.4 that if is th-order exponent metrically subregular at with respect to the first-order stationary set for , then the proximal method converges at least superlinearly with the convergence rate . To the best of our knowledge, this appears to be new to the literature. Moreover, we see in Example 4.6 below that the derived convergence rate can indeed be attained.
In Section 7, we apply the conducted convergence analysis to the higher-order regularized Newton method with momentum steps to solve , where is a -smooth function whose Hessian is Hölder continuous with exponent . As mentioned, Assumption (ii) holds automatically with respect to the second-order stationary points from (2.3). Therefore, if is th-order exponent metrically subregular at with respect to for , then the sequence generated by the algorithm converges to a second-order stationary point at least superlinearly with order . Even in the special case where , this improves the current literature on high-order regularized Newton methods by obtaining fast convergence rates under the pointwise metric subregularity (instead of the uniform metric subregularity/error bound condition as in (3.3)), and in the more general context with allowing momentum steps; see Remark 7.10 for more details.
Now we provide a simple example showing that the established convergence rate can be attained for the proximal method of minimizing a smooth univariate function.
Example 4.6 (tightness of the convergence rate).
Consider applying the proximal point method for the univariate function , , with the iterations
where is a fixed positive parameter. Then we have
This shows that , and so for all , which further implies that yielding . Take , the first-order stationary set (2.2), and get
For , we have . Choosing now and gives us and the equalities . Direct verifications show that all the conditions in Theorem 4.4 are satisfied, and thus it follows from the theorem that in a quadratic rate, e.g., .
To determine the exact convergence rate, we solve in terms of the variable . Letting provides , which implies that . Passing finally to the limit as and remembering that tell us that
which agrees with the exact quadratic convergence rate derived in Theorem 4.4.
5 Abstract Convergence Analysis under the KL Property
In this section, we continue our convergence analysis of abstract algorithms involving now the Kurdyka-Łojasiewicz (KL) property from Definition 2.2. Here is the main result of this section.
Theorem 5.1 (convergence under the KL properties).
Consider a sequence , which satisfies assumptions (), (), () being such that the set is bounded for some . Let be an accumulation point of the iterated sequence, and let be an l.s.c. function satisfying the KL property at , where and are taken from Definition 2.2. Suppose also that there exists a pair for which
(5.1) |
where and are increasing functions taken from conditions () and (). Then we have , and the sequence converges to as .
Proof.
It follows from (5.1) that for any there exists such that
(5.2) |
Let with given in Definition 2.2. It follows from the abstract convergence framework of Theorem 4.3 with that the claimed assertions are verified if showing that there exist constants , and such that for any with we have
(5.3) |
with the quantities defined by
To see this, recall that is an admissible function and then deduce from () and Lemma 4.1 that
and . It follows from the continuity of that there exists for which
(5.4) |
(5.5) |
Take any with and observe that if , then (5.3) holds trivially. Assuming that and then using () and (5.4), we get , which being combined with the KL property and condition () shows that and . Since , the aforementioned conditions readily yield the estimates
(5.6) |
The concavity of in the KL property and assumption () imply that
(5.7) | ||||
Substituting (5.6) into (5.7) guarantees that
which establishes in turn that
Combining the latter with (5.2) and (5.5), we get
which shows that (5.3) holds with and and thus completes the proof. ∎
Now we employ Theorem 5.1 and the results of the previous section to find explicit conditions ensuring superlinear convergence of an abstract algorithm under the KL property. Talking from Definition 2.2 and the admissible functions and from Theorem 5.1, define by
and consider the function given by for all .
Theorem 5.2 (superlinear convergence rate under the KL property).
Let , , , , , , and be as in the general setting of Theorem 5.1 under the assumptions (), (), and (). Take the function as defined above and assume in addition that it is increasing on with . The following assertions hold:
-
(i)
If there exists such that for all large , then the sequence converges to as with the convergence rate
(5.8) -
(ii)
If there exist and a closed set satisfying and such that
(5.9) then the sequence converges to as with the convergence rate
(5.10)
Proof.
Suppose without loss of generality that for all . It follows from Theorem 4.3 that (5.6) holds when is sufficiently large and that there exists such that
The latter readily implies that
(5.11) |
Furthermore, condition () and the first part of (5.4) tell us that
which together with (5.11) yields . Combining this with the limiting conditions and allows us to find a sufficiently large number such that
Invoking again condition () ensures that for any we have the estimates
(5.12) |
Now we are in a position to justify both assertions of the theorem. Starting with (i), note that
for large , where the last inequality follows from the increasing property of ensuring that
for any . Observing that , it readily follows from and (5.12) that
which brings us to (5.8) and thus justifies assertion (i).
We conclude this section with some specifications of the convergence results obtained under the KL property and comparisons with known results in the literature.
Remark 5.3 (further discussions and comparisons for KL convergence analysis).
Observe first that assumption (5.1) in Theorem 5.1 holds automatically for many common algorithms. For example, in the standard proximal methods [5] we have and , and thus by setting in Theorem 5.1, condition (5.1) holds by the classical AM–GM inequality.
In the case where with the KL exponent , we can obtain the explicit sublinear, linear, and superlinear convergence rates. In fact, let be such that , where and . It is easy to see that the assumption (5.1) in Theorem 5.1 holds with due to the AM–GM inequality. In this case, direct verifications show that
Suppose that condition (5.9) holds, which is true for various common numerical methods such as proximal-type methods and high-order regularized methods. If , then Theorem 5.2 implies that the sequence converges to superlinearly with the convergence rate . It will be seen in Section 7 that if we further assume that is a -smooth function and satisfies the strict saddle point property, then Theorem 4.4 and Corollary 6.5 allow us to improve the superlinear convergence rate to .
Following the arguments from [4, Theorem 2], we can also obtain convergence rates in the remaining cases for . Indeed, when there exist and such that for large we get
If finally , then there exists ensuring the estimates
The proof of this fact is rather standard, and so we omit details here.
Let us highlight that the superlinear convergence rate derived under the KL assumptions need not be tight in general, and it can be worse than the one derived under the generalized metric subregularity. For instance, consider the situation in Example 4.6, where the proximal point method is applied to . As discussed in that example, the sequence generated by the algorithm converges with a quadratic rate, and the convergence rate agrees with the rate derived under the generalized metric subregularity. Direct verification shows that the function satisfies the KL property with exponent , and that we have for the proximal point method. Thus Theorem 5.2 only implies that converges superlinearly with rate while not quadratically as under generalized metric subregularity.
6 Generalized Metric Subregularity for Second-Order Conditions
In this section, we continue our study of generalized metric subregularity while concentrating now on deriving verifiable conditions for the fulfillment of this notion with respect to the set of second-order stationary points defined in (2.3), where the function in question is -smooth. Such a second-order generalized metric subregularity plays a crucial role in the justification of fast convergence of our new high-order regularized Newton method in Section 7. We’ll see below that this notion holds in broad settings of variational analysis and optimization and is satisfied in important practical models.
Note that if is convex, then there is no difference between and the set of first-order stationary points (2.2). We focus in what follows on nonconvex settings for generalized metric subregularity, where as in the following simple while rather instructive example.
Example 6.1 (generalized metric subregularity for nonconvex functions).
Consider the function given by with and . Direct computations show that and thus
where stands for the identity matrix. Since is negative-definite, is nonconvex. Moreover, and . Therefore, for any we have
and thus satisfies the generalized metric subregularity with respect to for and .
Let us now introduce the new property that is playing an important role below. Given a proper l.s.c. function , we say that has the weak separation property (WSP) at if there is with
(6.1) |
where is taken from (2.2). The constant is called the modulus of weak separation. If the inequality in (6.1) is replaced by equality, such property is frequently used in the study of error bounds under the name of “separation property” (e.g., in [21, 25]), which inspires our terminology, While the separation property obviously yields WSP, the converse fails. For example, consider if and if , where . Take and deduce from the definition that satisfies WSP at . On the other hand, for all , and hence there exists such that and , which tells us that the separation property fails.
The next proposition shows that a large class of functions satisfies the weak separation property.
Proposition 6.2 (sufficient conditions for weak separation).
Let be a proper l.s.c. function, and let . Suppose that either one of the following conditions holds:
-
(i)
is continuous at with respect to , i.e., , and it satisfies the KL property at .
-
(ii)
, where is a convex function, and where is a smooth mapping such that the Jacobian matrix is surjective.
Then enjoys the weak separation property at .
Proof.
To verify (i), suppose on the contrary that WSP fails at . Then there exists a sequence with such that for all . It follows from the continuity of at with respect to that for all large , where is taken from the assumed KL property at . The latter condition implies that
while contradicting and hence justifying the claimed WSP. Assertion (ii) follows directly from Lemma 3.6, which therefore completes the proof of the proposition. ∎
Another concept used below to study the second-order generalized metric subregularity applies to -smooth functions. Given such a function, recall that is its strict saddle point if . We say that satisfies the strict saddle point property at if is either a local minimizer for , or a strict saddle point for . If satisfies the strict saddle point property at any first-order stationary point, then we simply say that satisfies the strict saddle point property. Furthermore, if for any first-order stationary point of , it is either a global minimizer for , or a strict saddle point for , then we say that satisfies the strong strict saddle point property. It is known from Morse theory that strict saddle point property holds generically for -smooth functions. Moreover, it has been recognized in the literature that various classes of nonconvex functions arising in matrix completion, phase retrieval, neural networks, etc. enjoy the strict saddle point property; see some instructive examples below.
Before establishing the main result of this section, we present one more proposition of its own interest telling us that, around a local minimizer, the distance to , the set of first-order stationary points, agrees with the distance to , the set of second-order stationary points, under WSP. Let us emphasize that in general and are not the same while the distances to and are under the imposed assumptions.
Proposition 6.3 (distance to stationary points).
Given a -smooth function and for some , suppose that WSP holds at with modulus . Denoting , we have the equality
(6.2) |
Proof.
It follows from (6.1), the fact that , and the choice of that
This implies that for any we get
(6.3) | ||||
where the first equality is due to (by ) and . Recalling that leads us to
and to the distance estimate for all . Therefore, taking from the above neighborhood of brings us to the equality
(6.4) | ||||
Observing finally that , we deduce from (6.3) and (6.4) that
for any , which justifies (6.2) and thus completes the proof of the proposition. ∎
Now we are ready to derive sufficient conditions for generalized metric subregularity broadly used below.
Theorem 6.4 (sufficient conditions for generalized metric subregularity).
Let be a -smooth function, let be an admissible function, and let . Suppose that both WSP and strict saddle point property hold for at . Then has the generalized metric subregularity property with respect to at if and only if has this property at with respect to . In particular, if is a subanalytic -smooth function satisfying the strict saddle point property, then enjoys the generalized metric subregularity property with respect to at , where for some .
Proof.
Note first that by we have and . Since satisfies the strict saddle point property, must be a local minimizer for . Therefore, the first conclusion of the theorem follows immediately from Proposition 6.3.
Suppose now that is a subanalytic -smooth function satisfying the strict saddle point property. As mentioned earlier, subanalytic functions satisfy the KL property [6], and hence we deduce from Proposition 6.2 that the weak separation property holds at . By [7, Proposition 2.13(ii)], the subanalytic property of ensures that the functions and are subanalytic. Observing that , we deduce from the Lojasiewicz factorization lemma for subanalytic functions in [6, Theorem 6.4] that for each compact set there exist and such that whenever . This verifies the second conclusion of the theorem and thus completes the proof. ∎
Next we provide some sufficient conditions for the exponent metric subregularity with explicit exponents for second-order stationary sets.
Corollary 6.5 (exponent metric subregularity for second-order stationary points).
Let be a -smooth function, and let for the second-order stationary set (2.3). Suppose that either one of the following conditions holds:
-
(i)
satisfies the strict saddle point property and the KL property at with the KL exponent .
-
(ii)
, where is a convex function satisfying the KL property at with the KL exponent , and where is a smooth mapping with the surjective derivative .
Then enjoys the generalized metric subregularity property at with respect to , where .
Proof.
For (i), we get by the KL property of at with the exponent that there are such that
(6.5) |
whenever . Using the continuity of and shrinking if needed tell us that (6.5) holds for all . This together with [20, Lemma 6.1] implies that there exist and such that
Shrinking again if necessary, we deduce from Propositions 6.2 and 6.3 that
Thus there exists with for all , and so (i) holds.
In the rest of this section, we discuss three practical models that often appear in, e.g., machine learning, image processing, and statistics for which the obtained sufficient conditions allow us to verify the second-order generalized metric subregularity.
Example 6.6 (over-parameterization of compressed sensing models).
Consider the -regularization problem, known also as the Lasso problem:
(6.6) |
where , , , and is the usual norm. A recent interesting way to solve this problem [34, 35, 36] is to transform it into an equivalent smooth problem
(6.7) |
where is the Hadamard (entrywise) product between the vector and in the sense that , . A nice feature of this over-parameterized model is that its objective function is a polynomial (and so -smooth and subanalytic), and it satisfies the strong strict saddle point property; see [35, Appendix C] for the detailed derivation, and [34, 36] for more information.
To proceed, we first use Theorem 6.4 to see that the generalized metric subregularity holds for with respect to , where for some . In fact, it is possible to identify the exponent explicitly. As shown in [34], the fulfillment of the strict complementarity condition
(6.8) |
where ‘’ denotes the relative interior, ensures that satisfies the KL property at any with the KL exponent . In the absence of the strict complementarity condition, it is shown in [34] that satisfies the KL property at any with the KL exponent .
Therefore, Corollary 6.5, tells us that for any satisfying the strict complementarity condition (6.8), enjoys the standard metric subregularity property (as ) at with respect to . In the absence of strict complementarity condition, satisfies the generalized metric subregularity property at with respect to , where .
Example 6.7 (best rank-one matrix approximation).
Consider the following function that appears in the rank-one matrix approximation
where we suppose for simplicity that is a symmetric matrix with its maximum eigenvalue . Direct verification shows that
and hence we have . Let us check next that , where
Indeed, take and observe that if , then having a negative eigenvalue. This is impossible by . For , is an eigenvector of with the associated eigenvalue . Suppose that and get, for any unit eigenvector associated with , that
This is also impossible, and so implying that Thus .
Now take and see that for any we have
which shows that the reverse inclusion holds, and the claim follows.
Suppose next that the largest eigenvalue is simple, i.e., its geometric multiplicity is one; note that this condition holds generically. Then the eigenspace of associated with is one-dimensional, and so consists only of two elements , where is the eigenvector associated with the largest eigenvalue, and of length . Both of these two elements are global minimizers of . Hence in this case, satisfies the strong strict saddle point property. Furthermore, it follows from Theorem 6.4 that the generalized metric subregularity holds for with respect to , where for some . Indeed, direct computations show that at , where is the eigenvector associated with the largest eigenvalue, we have . This tells us that for all , and that if and only if and , which happens only when . Therefore, , and so the metric subregularity holds for with respect to at any .
Example 6.8 (generalized phase retrieval problem).
Consider the optimization problem
where is a strictly convex subanalytic function such that , and where are defined by with , , and , . We consider the two typical choices for as and , , and the two typical choices for given by:
-
•
the polynomial loss function with ;
-
•
the pseudo-Huber loss function in the form with .
This class of problems arises in phase retrieval models (see [8, 9, 40]) when the goal is to recover an unknown signal from the observed magnitude of some linear transformation such as, e.g., the Fourier transform.
Below we consider the two settings: (I) the vectors are linearly independent; (II) the vectors are independent and identically distributed (i.i.d.) standard Gaussian random vectors, for some and . Note that in the first setting, the linear independence condition can be easily verified, and it is usually satisfied when is much larger than . The second setting is also frequently considered in the literature; see, e.g., [8, 9, 40].
First we consider setting (I), where is linearly independent, to verify that satisfies the generalized metric subregularity property with respect to the second-order stationary set . Observe that
and thus it follows from the linear independence condition that
and for all we have Let us now show that
(6.9) |
The inclusion “” in (6.9) is obvious. To check the reverse inclusion, pick and suppose that there exists such that . By , we have , which tells us that . Since and is strictly convex (and hence is strictly increasing), it follows that . For , we then have , which contradicts the assumption that and thus justifies (6.9). Observe further that
telling us that reduces to the set of global minimizers. Therefore, the strong strict saddle point property holds for if the linear independence condition is satisfied. Then it follows from Theorem 6.4 that generalized metric subregularity is fulfilled for with respect to , where for some .
Let us further specify the exponent in both bullet cases above by using Corollary 6.5. In the polynomial case with , the direct verification shows that for some . Thus satisfies the KL property with the KL exponent . Let for all , where the components are defined above. We deduce from the previous considerations that the inclusion implies that whenever . Hence for any , the derivative operator is surjective by the linear independence assumption. Corollary 6.5 tells us in this case that the generalized metric subregularity holds for with respect to , where .
In the case where is the pseudo-Huber loss, we have for all , and it is not hard to verify that the function satisfies the KL property at with the KL exponent . Note also that with for all , and so Corollary 6.5 implies that the standard metric subregularity holds for at any with respect to .
Next we consider the second setting (II). As shown in [8] in the typical case of , there exist positive constants such that if , where is the number of samples and is the dimension of the underlying space, then with probability at least the only local minimizers of are (which are also global minimizers), is strongly convex in a neighborhood of , and all the other critical points are either strict saddle points or local maximizers with strictly negative definite Hessians. Thus in this setting, when the number of samples is sufficiently large, we get with high probability that , and that for any the strong saddle point property holds at . It follows therefore from Corollary 6.5 that the standard metric subregularity is satisfied for at with respect to . Note finally that in the case where , a similar result on the saddle point property is obtained in [9] under the stronger assumption that for some positive number .
7 High-order Regularized Newton Method with Momentum
In this section, we design a new high-order Newtonian method with momentum and justify its fast convergence by using the above results on generalized metric regularity and related investigations.
For the optimization problem , the following assumptions are imposed:
Assumption 7.1 (major assumptions).
The objective function satisfies the conditions:
(1) is -smooth and bounded below, i.e., .
(2) There exists a closed convex set with nonempty interior such that , where is a starting point of the iterative process.
(3) The gradient is Lipschitz continuous with modulus on , and the Hessian of is Hölder-continuous on with exponent , i.e., there exist numbers and such that for all .
Given and , define the th-order regularized quadratic approximation for at by
(7.1) |
where serves as a regularization parameter. Consider also
(7.2) |
Here is our high-order Newtonian method with momentum steps.
(7.3) |
(7.4) |
(7.5) |
(7.6) |
Note that when and (i.e., the Hessian is locally Lipschitzian and there is no momentum step), this algorithm reduces to the classical cubic regularized Newton method proposed in [32]. In the case where (i.e., no momentum steps are considered), this method has also been considered in [15] with a comprehensive analysis of its complexity. The superlinear/quadratic convergence for the cubic Newton method was first established in [32] under the nondegeneracy condition, and then was relaxed to the uniform metric subregularity/error bound condition on in [40]. We also note that, if can be obtained easily, then the parameter in Algorithm 1 can be chosen as . Otherwise, the parameter therein can be obtained by using line search strategies outlined in [32]. Incorporating momentum steps is a well-adopted technique in first-order optimization to accelerate numerical performances. Therefore, it is conceivable to also consider incorporating the momentum technique into high-order regularization methods.
To illustrate this, we see from the following plots (Figures 1 and 2) that incorporating the momentum steps into the cubic Newton methods in solving phase retrieval and matrix completion problems can lead to faster convergence with comparable quality of numerical solutions. Here we follow the setup as in [40] for both problems with dimension , and use their code (available at https://github.com/ZiruiZhou/cubicreg app.git) for implementing the cubic regularized Newton method. For the cubic regularized Newton method with momentum steps, we choose and set in Algorithm 1 and follow the same approach as in [40] to solve the subproblems (7.3). Then we use the same notion of relative error as in [40] and plot the relative error (in the log scale) in terms of the number of iterations. 444While it is clear from the presented numerical illustrations that incorporating momentum steps can have numerical advantages, further extensive numerical experiments are definitely needed. This goes beyond the scope of this paper and will be carried out in a separate study.


Let us emphasize that, to the best of our knowledge, superlinear/quadratic convergence for high-order regularized methods with momentum steps has not been considered in the literature. We also mention that, even when it is specialized to the celebrated cubic regularized Newton method without momentum steps, our assumptions for quadratic convergence are strictly weaker than the ones imposed in [40].
Note that in the regularization step, Algorithm 1 requires computing , which is a global minimizer of . This can be done by solving either an equivalent convex optimization problem, or an eigenvalue problem; see, e.g, [15, 32, 40]. The necessary and sufficient optimality conditions for global minimizers of the regularized quadratic approximation (7.1) presented in the next lemma are taken from [16].
Lemma 7.2 (optimality conditions for regularized quadratic approximations).
Given , we have that is a global minimizer of (7.1) if and only if the following conditions hold:
(7.7) |
(7.8) |
For the reader’s convenience, we also formulate straightforward consequences of our major requirements in Assumption 7.1 that can be found in [15, (2.7) and (2.8)].
Lemma 7.3 (consequences of major assumptions).
Suppose that all the requirements in Assumption 7.1 are satisfied. Then for any we get the estimates
(7.9) |
(7.10) |
The next technical lemma is needed for the convergence justification below.
Lemma 7.4 (relationships for Hölder exponents).
Given and , the following hold.
(i) For each , there is a unique solution to the equation
Moreover, the function is decreasing on .
(ii) Letting and , suppose that
(7.11) |
Then for such numbers and , we have the inequality .
Proof.
To verify (i), we define the function for all and denote with , which occurs to be single-valued as shown below. Direct calculations give us the expressions
and hence is increasing on . Combining this with the facts that and for any guarantees the existence of such that is decreasing on and increasing on . Observing that and for any , we see that the mapping with is single-valued and satisfies the implication
(7.12) |
Defining further the univariate function
we claim that is decreasing on with the range . This clearly ensures that its inverse function is decreasing on with a range of , and so (i) follows.
Thus to prove (i), it remains to verify the formulated claim. Direct calculations show that
(7.13) | ||||
For all , we consider the function and observe that and and , which allows us to deduce from (7.13) that is decreasing on . Combining this with the facts that and justifies the above claim and hence the entire assertion (i).
To verify now assertion (ii), we deduce from (7.11) that . Letting for some and substituting it into the above inequality tells us that
(7.14) |
If , then (7.11) implies that , and so (ii) trivially holds. To justify the inequality , it suffices to consider the case where . Dividing both sides of (7.14) by , we get
and therefore the claimed inequality follows from (7.12) and the fact that . The proof is complete. ∎
The following lemma is an important step to justify the desired performance of Algorithm 1.
Lemma 7.5 (estimates for regularized quadratic approximations).
Under Assumption 7.1, let , and let be a projection point of to . If , then for any we have the estimate
(7.15) |
where is a minimizer of the regularized quadratic approximation from (7.1) and (7.2) with defined as the unique positive solution of the equation with taken from part (3) of Assumption 7.1.
Proof.
Denote and deduce from condition (7.2) and Lemma 7.2 that
(7.16) |
By , we have and . Then it follows from (7.16) that
Combining the latter with gives us
and then employing , Lemma 7.3, and Assumption 7.1(3), we arrive at
(7.17) | ||||
Applying further the triangle inequality ensures that
which being substituted into (7.17) yields
By Lemma 7.4(ii), this allows us to find a positive scalar (depending only on ) such that . Recalling that and verifies the desired inequality (7.15). ∎
The last lemma here plays a significant role in the convergent analysis of our algorithm. Its proof is rather technical and mainly follows the lines of justifying fast convergence of the cubic regularization method in [40] with additional adaptations to accommodate the momentum steps and the Hölderian continuity of Hessians. For completeness and the reader’s convenience, we present details in the Appendix (Section 9).
Lemma 7.6 (estimates for iterates of the high-order regularized method with momentum).
Let be a sequence of iterates in Algorithm 1 under Assumption 7.1. If the set is bounded for some , then the following assertions hold.
(i) For all with , we have the inequalities
(7.18) |
and the limit exists.
(ii) For all with , the relationships
(7.19) |
(7.20) |
are fulfilled, where the number in (7.19) is defined by
(7.21) |
(iii) The set of accumulation points of is nonempty. Moreover, for any we have
(7.22) |
Now we are in a position to establish the main result on the performance of Algorithm 1.
Theorem 7.7 (convergence rate of the high-order algorithm under generalized metric subregularity).
Let the iterative sequence be generated by Algorithm 1, and let be its accumulation point provided that the set is bounded for some . In addition to Assumption 7.1, suppose that there exists such that the generalized metric subregularity condition
(7.23) |
holds with respect to , where is an admissible function, and where is the set of second-order stationary points (2.3). Define the function by
(7.24) |
where with taken from Lemma 7.5, and where is given in (7.21). If , then and the sequence converges to as with the convergence rate
Proof.
By Lemma 7.6, we have the inclusion together with all the conditions in (7.18)–(7.20). Let us now verify the estimate
(7.25) |
for any sufficiently large. Having this, we arrive at all of the claimed conclusions of the theorem by applying the abstract convergence results of Theorem 4.4 with and .
To justify (7.25), let be a projection point of to . Thus we deduce from Lemma 4.1(iii) that
(7.26) |
Moreover, observing that for all and that remains bounded allows us to find a compact set such that . Based on and the conditions given by (7.26), we get that . Coupling this with the inclusion and the compactness of indicates that for all large . Consequently, there exists such that whenever . Employing Lemmas 7.4 and 7.5 together with for all gives us the number from Lemma 7.5 ensuring the estimate
which justifies (7.25) and thus completes the proof of the theorem. ∎
As a consequence of Theorem 7.7, we show now that if enjoys the (pointwise) exponent metric subregularity property with respect to the second-order stationary set , then the sequence generated by our algorithm converges to a second-order stationary point at least superlinearly with an explicit convergence order under a suitable condition on the exponent.
Corollary 7.8 (superlinear convergence under exponent metric subregularity).
Let satisfy the Assumption 7.1 with the -th order Hölder continuous Hessian for some , let be generated by Algorithm 1, and let the set be bounded for some . Taking and , suppose that there exist such that
If , then converges at least superlinearly to as with the rate , i.e.,
Proof.
Choosing , we have and deduce the conclusion from Theorem 7.7. ∎
Following the discussions in Remark 5.3, we can establish global convergence of the sequence under the KL property and its convergence rate depending on the KL exponent. For brevity, the proof is skipped.
Proposition 7.9 (convergence under the KL property).
Let satisfy Assumption 7.1 with the -th order Hölder continuous Hessian for some . Let be generated by Algorithm 1, and let the set be bounded for some . Take and suppose that satisfies the KL property at . Then the sequence converges to , which is a second-order stationary point of . If furthermore satisfies the KL property at with the KL exponent , then we have the following convergence rate:
If , then the sequence converges linearly.
If , then the sequence converges sublinearly with the rate .
If , then the sequence converges superlinearly with the rate .
Remark 7.10 (comparison with known results).
To the best of our knowledge, the derived (superlinear/linear/sublinear) convergence rates are new in the literature for high-order regularization methods with momentum steps without any convexity assumptions. To compare in more detail, consider the case where , i.e., the metric subregularity of at holds with respect to the second-order stationary set . In this case, when applying the high-order regularization methods with momentum steps to a -smooth function with the th-order Hölder continuous Hessian, we obtain the superlinear convergence of with rate to a second-order stationary point. Suppose further that and no momentum step is used. In that case, we recover the quadratic convergence rate to a second-order stationary point for the cubic regularized Newton method, but under the weaker assumption of the pointwise metric subregularity at instead of the stronger error bound condition used in [40].
Next we provide two examples with explicit degeneracy illustrating that our convergence results go beyond the current literature on regularized Newton-type methods of optimization.
Example 7.11 (fast and slow convergence without the error bound condition).
Consider the function with , , , and . Direct calculations give us the exact expressions for the first-order stationary set and the second-order stationary set . For any , we see that has a zero eigenvalue. Similarly to Example 3.5, it is easy to check that the uniform metric subregularity/error bound condition for with respect to fails in this case.
Let , where and . If , then there exists such that for all , and so the metric subregularity condition of holds at with respect to . On the other hand, direct verification shows that satisfies the KL property at with the KL exponent for .
Let be a compact set, and let be such that Assumption 7.1 is satisfied. Then we see that the Hessian of is Lipschitz continuous on , i.e., . This yields the convergence of the iterative sequence in Algorithm 1. If , then Corollary 7.8 implies that the convergence rate is quadratic. If , then it follows from Proposition 7.9 that converges to sublinearly with the rate .
Example 7.12 (linear and sublinear convergence under the KL property).
For and , define the -smooth function by
We have that and that satisfies the KL property at any with the KL exponent . On the other hand, the error bound property of with respect to fails here.
Let be a compact set, and let be such that Assumption 7.1 is satisfied. It can be easily verified that is th-order Hölder continuous on with . If , then , and the sequence of iterates of Algorithm 1 converges to linearly by Proposition 7.9. If , then , and so converges to sublinearly with the rate . Observe that for , the linear convergence can be indeed achieved. In this case, we have and , and then take for all and . Assuming without loss of generality that is an infinite sequence, we get by the construction of Algorithm 1 that
Letting gives us . Suppose now that , and then and . This implies that and hence . Therefore, and so with . Arguing by induction tells us that and for all . This yields , which confirms that the convergence rate is linear.
Finally in this section, we specify the convergence results for the case where the strict saddle point property holds, which allows us to establish stronger convergence results. Observe to this end that if the objective function satisfies the strict saddle point property (resp. strong strict saddle point property), then Algorithm 1 indeed converges to a local minimizer (resp. global minimizer) of with the corresponding superlinear rate. In particular, this covers the situations discussed in the illustrative examples of Section 6.
The next corollary addresses a concrete case of the over-parameterized compressed sensing model. In this case, Algorithm 1 converges to a global minimizer with a quadratic convergence rate under the strict complementary condition.
Corollary 7.13 (quadratic convergence for over-parameterized compressed sensing models).
Proof.
We know from Section 6 that the strong strict saddle point property is fulfilled for . To verify (i), recall that if the strict complementary condition holds, then satisfies the metric subregularity property at with respect to . Direct verifications show that is bounded from below by zero, that the set is bounded, and that the Hessian of is Lipschitz continuous on . By setting , we see that converges to in a quadratic rate. Note also that satisfies the strong strict saddle point property, and thus is indeed a global minimizer.
To prove (ii), observe that if the strict complementary condition fails, then satisfies the KL property with the exponent . Therefore, by setting in Proposition 7.9 we see that converges to with the rate . Similarly to (i), it follows that is a global minimizer of . ∎
Remark 7.14 (further comparison).
During the completion of this paper, we became aware of the preprint [24] in which the authors examined the convergence rate of the high-order regularized method without momentum steps for composite optimization problems under the assumption that the Hessian of the smooth part is Lipschitz continuous labeled there as “strict continuity.” Recall that the main focus of our paper is on the development of generalized metric subregularity with applications to fast convergence analysis for general descent methods. This include, in particular, the high-order regularized methods with momentum steps. We also mention that the results obtained in [24] cannot cover the setting of Section 7, where we incorporate momentum steps and can also handle the case of Hölder continuous Hessians.
8 Concluding Remarks
The main contributions of this paper are summarized as follows.
We introduce a generalized metric subregularity condition, which serves as an extension of the standard metric subregularity and its Hölderian version while being strictly weaker than the uniform version (error bound condition) recently employed in the convergence analysis of the cubic regularized Newton method. We then develop an abstract extended convergence framework that enables us to derive superlinear convergence towards specific target sets (such as the first-order and second-order stationary points) under the introduced generalized metric subregularity condition. Interestingly, this framework also extends the widely used KL convergence analysis, and the derived superlinear convergence rates of this new framework are sharp in the sense that the rate can be attained.
We provide easily verifiable sufficient conditions for the generalized metric subregularity with respect to the second-order stationary set under the strict saddle point property. We also demonstrate that several modern practical models appearing in machine learning, statistics, etc., enjoy the generalized metric subregularity property with respect to the second-order stationary set in explicit forms. This includes, e.g., objective functions that arise in the over-parameterized compressed sensing model, best rank-one matrix approximation, and generalized phase retrieval problems.
As a major application of generalized metric subregularity and abstract convergence analysis, we derive superlinear convergence results for the new high-order regularized Newton method with momentum. This improves the existing literature by weakening the regularity conditions and extending the analysis to include momentum steps. Furthermore, when applying the cubic regularized Newton methods with momentum steps to solve the over-parameterized compressed sensing model, we obtain global convergence with a quadratic local convergence rate towards a global minimizer under the strict complementarity condition.
One important potential future research topic is to extend the abstract convergence framework to cover nonmonotone numerical algorithms. Another major goal of our future research is to extend the obtained second-order results from -smooth functions to the class of functions [27], i.e., continuously differentiable functions with Lipschitzian gradients, or to functions with a composite structure; see, e.g., [31].
References
- [1] Ahookhosh, M., Aragón-Artacho, F.J., Fleming, Vuong, P.: Local convergence of the Levenberg–Marquardt method under Hölder metric subregularity. Adv. Comput. Math. 45, 2771–2806 (2019)
- [2] Ahookhosh, M., Nesterov, Y.: High-order methods beyond the classical complexity bounds: inexact high-order proximal-point methods. Math. Prog. (2024). https://doi.org/10.1007/s10107-023-02041-4
- [3] Aragón-Artacho, F.J., Geoffroy, M.H.: Metric subregularity of the convex subdifferential in Banach spaces. J. Nonlinear Convex Anal. 15, 35-47 (2014)
- [4] Attouch, H., Bolte, J.: On the convergence of the proximal algorithm for nonsmooth functions involving analytic features. Math. Prog. 116, 5–16 (2009)
- [5] Attouch, H., Bolte, J., Svaiter, B.F.: Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward-backward splitting, and regularized Gauss-Seidel methods. Math. Prog. 137, 91–129 (2013)
- [6] Bierstone, E., Milman, P.: Semianalytic and subanalytic sets. IHES Publ. Math. 67, 5–42 (1988)
- [7] Bolte, J., Daniilidis. A., Lewis, A.S.: The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM J. Optim. 17, 1205–1223 (2007)
- [8] Cai, J.-F., Huang, M., Li, D., Wang, Y.: The global landscape of phase retrieval II. quotient intensity models. Ann. Appl. Math. 38, 62–114 (2022)
- [9] Cai, J.-F., Huang, M., Li, D., Wang, Y.: Nearly optimal bounds for the global geometric landscape of phase retrieval. Inverse Problems. 39, 075011 (2023)
- [10] Cartis, C., Gould, N.I.M., Toint, P.L.: Adaptive cubic regularisation methods for unconstrained optimization. Part II: worst-case function- and derivative-evaluation complexity. Math. Prog. 130, 295–319 (2011).
- [11] Drusvyatskiy, D., Mordukhovich, B.S., Nghia, T.T.A.: Second-order growth, tilt stability, and metric regularity of the subdifferential. J. Convex Anal. 21, 1165—1192 (2014)
- [12] Facchinei, F., Pang, J.-S.: Finite-Dimensional Variational Inequalities and Complementarity Problems, Volumes I and II, Springer, New York (2003)
- [13] Gaydu, M., Geoffroy, M.H., Jean-Alexis, C.: Metric subregularity of order and the solving of inclusions. Cent. Eur. J. Math. 9, 147–161 (2011)
- [14] Gfrerer, H., Outrata, J.V.: On semismooth∗ Newton method of solving generalized equations. SIAM J. Optim. 31, 89–517 (2021)
- [15] Grapiglia, G.N., Nesterov, Yu.: Regularized Newton methods for minimizing functions with Hölder continuous Hessians. SIAM J. Optim. 27, 478–506 (2017)
- [16] Hsia, Y., Sheu, R.-L., Yuan, Y.-X.: Theory and application of -regularized subproblems for . Optim. Methods Softw. 32, 1059–1077 (2017)
- [17] Izmailov, A.F., Solodov, M.V.: Newton-Type Methods for Optimization and Variational Problems. Springer, New York (2014)
- [18] Khanh, P.D., Mordukhovich, B.S., Phat, V.T., Tran, D.B.: Globally convergent coderivative-based generalized Newton methods in nonsmooth optimization. Math. Program. 205, 373–429 (2024)
- [19] Klatte, D., Kummer, B.: Nonsmooth Equations in Optimization: Regularity, Calculus, and Applications. Kluwer, Boston (2002)
- [20] Li, G., Mordukhovich, B.S.: Hölder metric subregularity with applications to proximal point method. SIAM J. Optim. 22, 1655–1684 (2012)
- [21] Li, G., Pong, T.K.: Calculus of the exponent of Kurdyka-Łojasiewicz inequality and its applications to linear convergence of first-order methods. Found. Comput. Math. 18, 1199-1232 (2018)
- [22] Li, X., Sun, D., Toh, K.-C.: A highly efficient semismooth Newton augmented Lagrangian method for solving Lasso problems. SIAM J. Optim. 28, 433–458 (2018)
- [23] Lindstrom, S.B., Lourenço, B.F., Pong, T.K.: Error bounds, facial residual functions and applications to the exponential cone. Math. Program. 200, 229–278 (2023)
- [24] Liu, Y., Pan, S., Qian, Y.: An inexact -order regularized proximal Newton method for nonconvex composite optimization. https://arxiv.org/abs/2311.06871.
- [25] Luo, Z.-Q., Tseng, P.: Error bounds and convergence analysis of feasible descent methods: A general approach. Ann. Oper. Res. 46, 157–178 (1993)
- [26] Mordukhovich, B.S.: Variational Analysis and Generalized differentiation, I: Basic Theory, II: Applications. Springer, Berlin (2006)
- [27] Mordukhovich, B.S.: Second-Order Variational Analysis in Optimization, Variational Stability, and Control: Theory, Algorithm, Applications. Springer, Cham (2024)
- [28] Mordukhovich, B.S., Ouyang, W.: Higher-order metric subregularity and its applications. J. Global Optim. 63, 777–795 (2015)
- [29] Mordukhovich, B.S., Sarabi, M.E.: Generalized Newton algorithms for tilt-stable minimizers in nonsmooth Optimization. SIAM J. Optim. 31, 1184–1214 (2021)
- [30] Mordukhovich, B.S., Yuan, X., Zeng, S., Zhang, J.: A globally convergent proximal Newton-type method in nonsmooth convex optimization. Math. Program. 198, 899–936 (2023)
- [31] Nabou, Y., Necoara, I.: Efficiency of higher-order algorithms for minimizing composite functions. Comput. Optim. Appl. 87, 441–473 (2024)
- [32] Nesterov, Yu., Polyak, B.T.: Cubic regularization of Newton method and its global performance. Math. Program. 108, 177–205 (2006)
- [33] Ochs, P.: Unifying abstract inexact convergence theorems and block coordinate variable metric iPiano. SIAM J. Optim. 29(1), 541–570 (2019)
- [34] Ouyang, W., Liu Y., Pong, T.K., Wang, H.: Kurdyka-Łojasiewicz exponent via Hadamard parametrization. https://arxiv.org/abs/2402.00377
- [35] Poon, C., Peyré, G.: Smooth bilevel programming for sparse regularization. In: Proceedings of NeurIPS’21 (2021). https://arxiv.org/abs/2106.01429
- [36] Poon, C., Peyré, G.: Smooth over-parameterized solvers for nonsmooth structured optimization. Math. Program. 201, 897–952 (2023)
- [37] Rockafellar, R.T., Wets, R.J-B.: Variational Analysis. Springer, Berlin (1998)
- [38] Yao, J.C., Zheng, X.Y., Zhu, J.: Stable minimizers of -regular functions. SIAM J. Optim. 27, 1150–1170 (2017)
- [39] Yu, P., Li, G., Pong T.K.: Kurdyka-Łojasiewicz exponent via inf-projection. Found. Comput. Math. 22, 1171–1217 (2022)
- [40] Yue, M.-C., Zhou, Z., Man-Cho So, A.: On the quadratic convergence of the cubic regularization method under a local error bound condition. SIAM J. Optim. 29, 904–932 (2019)
- [41] Zhang, B., Ng, K.F., Zheng, X.Y., He, Q.: Hölder metric subregularity for multifunctions in type Banach spaces. Optimization. 65, 1963–1982 (2016)
- [42] Zheng, X.Y., Zhu, J.: Generalized metric subregularity and regularity with respect to an admissible function. SIAM J. Optim. 26, 535–563 (2016)
- [43] Zălinescu, C.: Convex Analysis in General Vector Spaces. World Scientific, Singapore (2002)
9 Proofs of Technical Lemmas
The main goal of this appendix is to give a detailed proof of Lemma 7.6. We split this proof into several technical steps (also called lemmas), which are of their own interest.
Lemma 9.1.
Proof.
Let now and . To see the conclusion, we argue by contradiction and suppose that , which yields . Denote
(9.3) |
and deduce from (by (2) in Assumption 7.1) and that . So the number is well-defined. It follows that and that for all . Consequently, combining Lemma 7.3 with (9.1)–(9.3) gives us the estimates
which imply in turn that . Therefore, , a contradiction ensuring that . In a similar way, we can check that . ∎
Lemma 9.2.
Under Assumption 7.1, for any with we have
(9.4) |
Proof.
Lemma 9.3.
Proof.
Lemma 9.4.
Under Assumption 7.1, for any with we have
(9.8) |
Proof.
Lemma 9.5.
Let Assumption 7.1 hold, and let the sequences and be taken from Algorithm 1 with . Then for all we have the estimates
(9.9) |
(9.10) |
(9.11) |
(9.12) |
Proof.
Lemma 9.6.
Under Assumption 7.1, the sequence generated by Algorithm 1 satisfies
Proof.
Now we are ready to finalize the Proof of Lemma 7.6:
To verify assertion (i) therein, it suffices to show that is a decreasing sequence having a lower bound. By Assumption 7.1, is bounded from below and so is . Employing (7.6) and (9.9) yields
(9.15) |
and allows us to deduce that converges to some .
To verify (ii), we begin with using (9.13), , and Lemma 9.6 to get
which tells us together with (9.10) that
(9.16) |
Combining further Assumption 7.1 and Lemma 9.6 with (7.4), (9.13), and (9.11), we arrive at
(9.17) | ||||
and therefore verify assertion (ii).
To verify the final assertion (iii), note that (9.15) yields . Recalling that is bounded for some , we conclude that is bounded as well, which ensures that . Pick any and observe that justifying (7.22) requires to bound the minimum eigenvalue of the Hessian. Recalling that holds for any real symmetric matrices and , we have
This together with Assumption 7.1, Lemma 9.6, (9.12), (9.13), and tells us that
Unifying the latter with (9.16) and (9.17) leads us to the inequalities
which show that and . Employing finally Lemma 7.6(i), we obtain and therefore complete the proof of the lemma.