A mirror inertial forward-reflected-backward splitting: Global convergence and linesearch extension beyond convexity and Lipschitz smoothness††thanks: This work was supported by NSERC Discovery Grants, and the JSPS KAKENHI grant number JP21K17710.
Abstract
This work investigates a Bregman and inertial extension of the forward-reflected-backward algorithm [Y. Malitsky and M. Tam, SIAM J. Optim., 30 (2020), pp. 1451–1472] applied to structured nonconvex minimization problems under relative smoothness. To this end, the proposed algorithm hinges on two key features: taking inertial steps in the dual space, and allowing for possibly negative inertial values. Our analysis begins with studying an associated envelope function that takes inertial terms into account through a novel product space formulation. Such construction substantially differs from similar objects in the literature and could offer new insights for extensions of splitting algorithms. Global convergence and rates are obtained by appealing to the generalized concave Kurdyka-Łojasiewicz (KL) property, which allows us to describe a sharp upper bound on the total length of iterates. Finally, a linesearch extension is given to enhance the proposed method.
Keywords. Nonsmooth nonconvex optimization forward-reflected-backward splitting inertia Bregman distance relative smoothness.
AMS subject classifications. 90C26 49J52 49J53.
1 Introduction
Consider the following composite minimization problem
(P) |
where is a nonempty open and convex set with closure , is differentiable on , and is proper and lower semicontinuous (lsc). For notational brevity, we define with denoting the indicator function of set , namely such that if and otherwise. By doing so, problem (P) can equivalently be cast as the “unconstrained” minimization
Note that (P) is beyond the scope of traditional first-order methods that require global Lipschitz continuity of and the consequential descent lemma [9, Prop. A.24]; see, e.g, [3, 31, 24, 20, 21, 19] for such algorithms. To resolve this issue, Lipschitz-like convexity was introduced in the seminal work [5], furnishing a descent lemma beyond the aforementioned setting. This notion was then referred to as relative smoothness (see 2.1) and has played a central role in extending splitting algorithm to the setting of (P); see, e.g., [10, 13, 16, 23, 29, 35].
The goal of this paper is to propose a Bregman inertial forward-reflected-backward method FRB (Algorithm 1) for solving (P), which, roughly speaking, iterates
where is the stepsize, is inertial parameter, and is the kernel. In the convex case, the above scheme reduces to the inertial forward-reflected-backward method proposed in [25] when , which is not applicable to (P) due to its assumption on Lipschitz continuity of .
A fundamental tool in our analysis is the FRB-envelope (see 3.4), which is the value function associated to the parametric minimization of a “model” of (P); see Section 3.1. The term “envelope” is borrowed from the celebrated Moreau envelope [28] and its relation with the proximal operator. Indeed, there has been a re-emerged interest of employing an associated envelope function to study convergence of splitting methods, such as forward-backward splitting [38, 1], Douglas-Rachford splitting [37, 39], alternating minimization algorithm [34], as well as the splitting scheme of Davis and Yin [22]. The aforementioned works share one common theme: regularity properties of the associated envelope function are used for further enhancement and deeper algorithmic insights.
Pursuing the same pattern, additionally to studying global convergence of the proposed algorithm in Section 4 using the FRB-envelope, we will showcase in Section 5 how it offers a versatile globalization framework for fast local methods in the full generality of I. Such framework revealed the necessity of interleaving noninertial trial steps in the globalization strategy, an observation that led to a substantial change in the linesearch strategy with respect to related works.
A major departure of our analysis from previous work is that we consider an envelope function with two independent variables, allowing us to take inertial terms into account. In this regard, we believe that our methodology is appealing in its own right, as it can be instrumental for deriving inertial extensions of other splitting methods. Another notable feature of this work is that, as one shall see in Section 3.4, non-positive inertial parameter is required for the sake of convergence under relative smoothness. This result, although more pessimistic, aligns with the recent work [14] regarding the impossibility of accelerated Bregman forward-backward method under the same assumption; see 3.7 for a detailed discussion. Our work differs from the analysis carried out in [40], which also deals with an inertial forward-reflected-backward algorithm using Bregman metrics but is still limited to the Lipschitz smoothness assumption. The game changer that enables us to cope with the relative smoothness is taking the inertial step in the dual space, that is, interpolating application of (cf. step 4 of Algorithm 1), whence the name, inspired by [8], mirror inertial forward-reflected-backward splitting (FRB).
The rest of the paper is structured as follows. In the remainder of the section we formally define the problem setting and the proposed algorithm, and in Section 2 we discuss some preliminary material and notational conventions. Section 3 introduces the FRB-envelope and an associated merit function; the proof of the main result therein is deferred to Appendix A. The convergence analysis is carried out in Section 4, and finally Section 5 presents the FRB-based globalization framework. Section 6 draws some concluding remarks.
1.1 Problem setting and proposed algorithm
Throughout, we fix a Legendre kernel with , namely, a proper, lsc, and strictly convex function that is 1-coercive and essentially smooth, i.e., such that for every sequence converging to a boundary point of . We will consider the following iterative scheme for addressing problem (P), where denotes the Bregman distance induced by , defined as
(1.1) |
Note that Algorithm 1 takes inertial step in the dual space, hence the abbreviation FRB. We will work under the following assumptions.
Assumption I.
The following hold in problem (P):
-
1
is smooth relative to (see Section 2.2).
-
2
is proper and lsc.
-
3
.
-
4
For any and , .
As will be made explicit in 3.2, Assumption 4 is a requirement ensuring that Algorithm 1 is well defined. Note that, in general, the minimizers therein are a (possibly empty) subset of ; Assumption 4 thus only excludes points on the boundary of . This standard requirement is trivially satisfied when is open, or more generally when constraint qualifications enabling a subdifferential calculus rule on the boundary are met, as is the case when is convex.
Remark 1.1 (constraint qualifications for Assumption 4).
If is proper and lsc, Assumption 4 is satisfied if holds everywhere (this condition being automatically guaranteed at all point outside the boundary of , having empty outside and in its interior). Indeed, optimality of implies that , with inclusion holding by [33, Cor. 10.9] and implying nonemptiness of ; see Section 2.1 for definitions of subdifferentials. ∎
Regardless, it will be shown in 2.4 that existence of minimizers is guaranteed for small enough values of , which will then be linked in 3.2 to the well definedness of Algorithm 1.
2 Preliminaries
2.1 Notation
The extended-real line is denoted by . The positive and negative part of are respectively defined as and , so that .
The distance of a point to a nonempty set is given by . The interior, closure, and boundary of are respectively denoted as , , and . The indicator function of is defined as if and otherwise.
A function is proper if and , in which case its domain is defined as the set . For , denotes the -sublevel set of ; with is defined accordingly. We say that is level bounded (or coercive) if , and -coercive if . A point is a local minimum for if holds for all in a neighborhood of . If the inequality can be strengthened to for some , then is a strong local minimum. The convex conjugate of is denoted as . Given , denotes the Mordukhovich (limiting) subdifferential of at , given by
and is the set of regular subgradients of at , namely vectors such that ; see, e.g., [33, 27]. is the set of functions which are times continuously differentiable. We write if is clear from context. The notation indicates a set-valued mapping, whose domain and graph are respectively defined as and .
2.2 Relative smoothness and weak convexity
In the following definition, as well as throughout the entire paper, we follow the extended-real convention to resolve possible ill definitions of difference of extended-real–valued functions. Similarly, we will adopt the convention that .
Definition 2.1 (relative smoothness).
We say that an lsc function is smooth relative to if and there exists a constant such that are convex functions on . We may alternatively say that is -smooth relative to to make the smoothness modulus explicit.
Note that the constant may be loose. For instance, if is convex, then is convex for any . This motivates us to consider one-sided conditions and treat and separately.
Definition 2.2 (relative weak convexity).
We say that an lsc function is weakly convex relative to if there exists a (possibly negative) constant such that is a convex function. We may alternatively say that is -weakly convex relative to to make the weak convexity modulus explicit.
Note that relative smoothness of is equivalent to the relative weak convexity of . Indeed, if is -smooth relative to , then both and are -weakly convex. Conversely, if and are - and -weakly convex relative to , respectively, then (as well as ) is -smooth relative to with
(2.1) |
The relative weak convexity moduli will be henceforth adopted when referring to Assumption 1. It will be convenient to normalize these quantities into pure numbers
(2.2) |
To make all definitions and implications well posed, we will ignore the uninsteresting case in which is affine, in which case for any . The comment below will be instrumental in Section 3.4.
Remark 2.3.
Invoking (2.1) and (2.2) yields that
(2.3) |
where the second inequality owes to the fact that, by definition, both and are convex functions, and therefore so is their sum . Thus, whenever is convex (resp. concave), one can take (resp. ) and by virtue of the inclusion in (2.3) it directly follows that (resp. ). ∎
We now turn to a lemma that guarantees well definedness of Algorithm 1.
Lemma 2.4 (relative prox-boundedness).
Suppose that I holds. Then, the set as in Assumption 4 is nonempty for any and . In other words, is prox bounded relative to with threshold [18, Def. 2.3].
Proof.
Recall that a proper convex function admits affine minorant; see, e.g, [7, Cor. 16.18]. It then follows that is 1-coercive by observing that
on and on . ∎
We point out that the content of this subsection is a (well-known) generalization of the (well-known) equivalence between smoothness relative to the Euclidean kernel and Lipschitz differentiability, a fact that will be invoked in Section 3 and whose proof is given next for the sake of completeness.
Lemma 2.5 (Lipschitz smoothness from weak convexity).
Let . Then, for any the following are equivalent:
-
1
there exist such that both and are proper, convex and lsc;
-
2
, and there exist such that for all , , it holds that ;
-
3
is -Lipschitz differentiable for some .
In particular, 1 and/or 2 imply 3 with , and conversely 3 implies 1 and 2 with .
Proof.
-
2 3 Invoking again [33, Ex. 12.28(c)] yields that both and are lower-, in the sense of [33, Def. 10.29], and continuous differentiability then follows from [33, Prop. 10.30]. Thus, in assertion 2, and denoting the function satisfies
It then follows from [30, Thm. 2.1.5] that is convex and -Lipschitz differentiable, and that
Using the definition of and expanding the squares yields , proving that is -Lipschitz continuous.
3 Algorithmic analysis toolbox
In the literature, convergence analysis for nonconvex splitting algorithms typically revolves around the identification of a ‘Lyapunov potential’, namely, a lower bounded function that decreases its value along the iterates. In this section we will pursue this direction. We will actually go one step further in identifying a function which, additionally to serving as Lyapunov potential, is also continuous. This property, whose utility will be showcased in Section 5, will come as the result of a parametric minimization, as discussed in the following two subsections.
In what follows, to simplify the discussion we introduce
(3.1) |
and observe that too is a Legendre kernel, for small enough.
Lemma 3.1 ([1, Thm. 4.1]).
Suppose that Assumption 1 holds. Then, for every the function is a Legendre kernel with .
We will also (ab)use the notation of the Bregman distance for functions differentiable on that are not necessarily convex, thereby possibly having . This notational abuse is justified by the fact that all algebraic identities of the Bregman distance used in the manuscript (e.g., the three-point identity [12, Lem. 3.1]) are valid regardless of whether is convex or not, and will overall yield a major simplification of the math.
3.1 Parametric minimization model
As a first step towards the desired goals, as well as to considerably simplify the discussion, we begin by observing that the FRB-update is the result of a parametric minimization. Namely, by introducing the “model”
(3.2a) | ||||
(3.2b) |
observe that the -update in FRB can be compactly expressed as
(3.3a) | ||||
where defined by | ||||
(3.3b) |
is the FRB-operator with stepsize and inertial parameter . The fact that maps pairs in to subsets of is a consequence of Assumption 4, as we are about to formalize in Item 3. Note that many can be defined giving rise to the same , and all these differ by additive terms which are constant with respect to . Among these, the one given in (3.2b) reflects the tangency condition for every . A consequence of this fact and other basic properties are summarized next.
Lemma 3.2 (basic properties of the model and the operator ).
Suppose that I holds, and let and be fixed. The following hold:
-
1.
for all .
-
2.
is level bounded in locally uniformly in on .111Namely, is bounded for any compact and .
-
3.
is locally bounded and osc,222Being defined on , osc and local boundedness are meant relative to . Namely, is closed relative to , and is bounded for any compact. and is a nonempty and compact subset of for any .
-
4.
for any and .
-
5.
If , then and for every .
Proof.
We start by observing that 2.4 ensures that the set is nonempty for any ; this follows by considering the expression (3.2a) of the model, by observing that, for any , . For the same reason, it then follows from Assumption 4 that .
Remark 3.3 (inertial effect).
Letting and for some , gives an alternative decomposition of which still complies with I, having . Relative to this decomposition, it is easy to verify that the corresponding model is related to the original one in (3.2a) as
and in particular FRB steps with the respective parameters coincide. The effect of inertia can then be explained as a redistribution of multiples of among and in the problem formulation, having for any and . ∎
3.2 The FRB-envelope
Having defined model and its solution mapping resulted from parametric minimization, we now introduce the associated value function, which we name FRB-envelope.
Definition 3.4 (FRB-envelope).
The envelope function associated to FRB with stepsize and inertia is defined as
(3.4) |
Lemma 3.5 (basic properties of ).
Suppose that I holds. Then, for any and the following hold:
-
1.
is (real-valued and) continuous on ; in fact, it is locally Lipschitz provided that .
-
2.
For any and
-
3.
for any .
Proof.
-
1 In light of the uniform level boundedness asserted in Item 2, continuity of follows from [33, Thm. 1.17(c)] by observing that the mapping is continuous for every ; in fact, when and are both on , its gradient exists and is continuous with respect to all its arguments, which together with local boundedness of , cf. Item 3, gives that is a lower- function in the sense of [33, Def. 10.29], and in particular locally Lipschitz continuous by virtue of [33, Thm.s 10.31 and 9.2].
3.3 Establishing a merit function
We now work towards establishing a merit function for FRB, starting from comparing the values of and , with . Owing to Item 3, we have
(3.5) |
From here two separate cases can be considered, each yielding surprisingly different results. The watershed lies in whether the bracketed term is positive or not: one case will result in a very straightforward convergence analysis in the full generality of I, while the other will necessitate an additional Lipschitz differentiability requirement. The convergence analysis in both cases revolves around the identification of a constant determining a lower bounded merit function
(3.6) |
The difference between the two cases is determined by function appearing in the last Bregman operator , having in the former case and in the latter, where is a Lipschitz constant for and
(3.7) |
is the squared Euclidean norm. The two cases are stated in the next theorem, which constitutes the main result of this section. Special and worst-case scenarios leading to simplified statements will be given in Section 3.4. In what follows, patterning the normalization of into detailed in Section 2.2 we also introduce the scaled stepsize
(3.8) |
which as a result of the convergence analysis will be confined in the interval .
Theorem 3.6.
Suppose that I holds and consider one of the following scenarios:
-
1
either is convex (i.e., ) and , in which case
-
2
or is -Lipschitz differentiable, is -strongly convex, and
in which case .
Then, for as in (3.6) the following assertions hold:
-
1.
For every and , (3.9a) and (3.9b) -
2.
, and is level bounded iff so is .
The proof of this result is detailed in the dedicated Appendix A; before that, let us draw some comments. As clarified in the statement of 1, convexity of can be enforced by suitably choosing and without imposing additional requirements on the problem. However, an unusual yet reasonable condition on inertial parameter may be necessary.
Remark 3.7.
In order to furnish 1, we shall see soon that may be required; see Section 3.4. Such assumption, although more pessimistic, coincides with a recent conjecture by Dragomir et al. [14, §4.5.3], which states that inertial methods with nonadaptive coefficients fail to convergence in the relative smoothness setting, and provides an alternative perspective to the same matter through the lens of the convexity of . ∎
Unlike 1, however, additional assumptions are needed for the Lipschitz differentiable case of 2. This is because the requirement is equivalent to smoothness relative to the Euclidean Bregman kernel , while I prescribes bounds only relative to . For simplicity, for functions we write
to indicate that is convex.
Remark 3.8.
Under I, one has that is Lipschitz-differentiable with modulus under either one of the following conditions:
-
1
either is -Lipschitz, in which case ,
-
2
or and is -Lipschitz, in which case .
Recalling that , the second condition is tautological. In case is -Lipschitz, 2.5 yields that
and that consequently
is a Lipschitz modulus for . ∎
3.4 Simplified bounds
In this section we provide bounds that only discern whether is convex, concave, or neither of the above. As discussed in 2.3, these cases can be recovered by suitable combinations of the coefficients , and thus lead to easier, though possibly looser, bounds compared to those in 3.6. We will also avail ourselves of the estimates of in 3.8 to discuss the cases in which is Lipschitz differentiable.
Without distinguishing between upper and lower relative bounds, whenever is -smooth relative to as in I one can consider or, equivalently, . Plugging these values into 3.6 yields the following.
Corollary 3.9 (worst-case bounds).
Suppose that I holds. All the claims of 3.6 hold when , and are such that
-
1
either and in which case ;
-
1
or is -strongly convex and -Lipschitz differentiable, and in which case ;
-
2
or is -strongly convex, is -Lipschitz, and in which case .
Proof.
We will discuss only the bounds on as those on will automatically follow from the fact that . Setting in 3.6, one has:
When is convex, can be considered resulting in and .
Corollary 3.10 (bounds when is convex).
Suppose that I holds and that is convex. All the claims of 3.6 remain valid if , and are such that
-
1
either and in which case ;
-
1
or is -strongly convex and -Lipschitz differentiable,
in which case ;
-
2
or is -strongly convex, is -Lipschitz, , and in which case .
Similarly, when is concave (that is, is convex), then can be considered, resulting in and . The proof is omitted, as it uses the same arguments as in the previous results.
Corollary 3.11 (bounds when is concave).
4 Convergence analysis
In this section we study the behavior of sequences generated by FRB. Although some basic convergence results can be derived in the full generality of I, establishing local optimality guarantees of the limit point(s) will ultimately require an additional full domain assumption.
Assumption II.
Function has full domain, that is, .
II is standard for nonconvex splitting algorithms in a relative smooth setting. To the best of our knowledge, the question regarding whether this requirement can be removed remains open; see, e.g., [35] and the references therein.
4.1 Function value convergence
We begin with the convergence of merit function value.
Theorem 4.1 (function value convergence of FRB).
Let be a sequence generated by FRB (Algorithm 1) in the setting of 3.6. Then,
-
1.
It holds that
(4.1) Then, and as for some .
-
2.
If is level bounded, then is bounded.
-
3.
Suppose that II holds, and let be the set of limit points of . Then, is constant on with value , and for every it holds that and .
Proof.
-
3 Suppose that a subsequence converges to a point , then so do by Item 1 and [6, Prop. 2.2(iii)]. Since , by passing to the limit osc of (Item 3) implies that . Invoking Item 5 yields the stationarity condition . Moreover, by continuity of one has
where the last equality follows from Item 2, owing to the inclusion (and the fact that for any differentiable function ). From the arbitrarity of we conclude that on . ∎
It is now possible to demonstrate the necessity of some of the bounds on the stepsize that were discussed in Section 3.4, by showing that may otherwise fail to vanish. Note that, for , the following counterexample constitutes a tightness certificate for the bound derived in [42] in the noninertial Euclidean case.
Example 4.2.
The bound is tight even in the Euclidean case. To see this, consider and for a fixed let . Then, one has and . For , it is easy to see that
(with ). If , then is a sequence generated by FRB for which . ∎
As a consequence of Item 1, the condition is satisfied in finitely many iterations for any tolerance . While this could be used as termination criterion, in the generality of s Iand II there is no guarantee on the relaxed stationarity measure , which through Item 4 can only be estimated as
(4.2) |
On the other hand, in accounting for possibly unbounded sequences, additional assumptions are needed for the condition to be met in finitely many iterations.
Lemma 4.3 (termination criteria).
Suppose that II holds, and let be a sequence generated by FRB (Algorithm 1) in the setting of 3.6. If
-
1
either is level bounded,
-
2
or is uniformly convex (equivalently, is uniformly smooth),
then, for as in (4.2) it holds that . Thus, for any the condition is satisfied for all large enough and guarantees .
Proof.
The implication of and -stationarity of has already been discussed. If is level bounded, then 4.1 implies that is bounded and holds by continuity of and . In case is uniformly convex, this being equivalent to uniform smoothness of as shown in [4, Cor. 2.8], the vanishing of implies through [32, Prop. 4.13(IV)] that . In turn, by relative smoothness necessarily as well, overall proving that . ∎
4.2 Global convergence
In this subsection, we work towards the global sequential convergence of FRB. To this end, we introduce a key concept which will be useful soon. For , denote by the class of functions satisfying the following: (i) is right-continuous at with ; (ii) is strictly increasing on .
Definition 4.4 ([41, Def. 5]).
Let be proper and lsc. Let and , and let be a nonempty subset.
-
1.
We say that has the pointwise generalized concave Kurdyka-Łojasiewicz (KL) property at , if there exist a neighborhood , and a concave , such that for all ,
where denotes the left derivative. Moreover, is a generalized concave KL function if it has the generalized concave KL property at every .
-
2.
Suppose that on . We say has the setwise333We shall omit adjectives “pointwise” and “setwise” whenever there is no ambiguity. generalized concave KL property on if there exist , and a concave such that for every ,
For a subset , define .
Lemma 4.5 ([41, Lem. 4.4]).
Let be proper lsc and let . Let be a nonempty compact set on which for all . Then the following hold:
-
1.
Suppose that satisfies the pointwise generalized concave KL property at each . Then there exist and such that has the setwise generalized concave KL property on with respect to , and .
-
2.
Set and define by
Then the function defined by with is well defined and belongs to . The function has the setwise generalized concave KL property on with respect to , and . Moreover,
We say that is the exact modulus of the setwise generalized concave KL property of on with respect to and .
In the remainder of the section, we will make use of the norm on the product space defined as .
Theorem 4.6 (sequential convergence of FRB).
Suppose that II holds, and let be a sequence generated by FRB (Algorithm 1) in the setting of 3.6. Define and let be the set of limit points of . Define
Assume in addition the following:
-
1
is level bounded.
-
2
are twice continuously differentiable and is positive definite.
-
3
satisfies the generalized concave KL property on with respect to and .
Then and there exists with such that as . To be specific, there exists such that for all
(4.3) |
where is some constant, is the strong convexity modulus of on , the closed ball in which lies, and is the exact modulus of the generalized concave KL property associated with produced by Item 2.
Proof.
Set for simplicity. Then , decreasingly and as by invoking 4.1. Assume without loss of generality that , otherwise we would have due to Item 1, from which the desired result readily follows by simple induction. Thus . Appealing to Item 3 and Item 1 yields that is constantly equal to on the compact set . In turn, all conditions in Item 2 are satisfied, which implies that for
(4.4) |
Define
Then and , where , , , and . Applying subdifferential calculus to yields that
which together with Item 4 entails that . In turn, summing the aforementioned bounds on and gives
(4.5) |
where .
Finally, we show that is convergent. For simplicity, define . Then, combining (4.4) and (4.5) yields
where the second inequality is implied by concavity of , the third one follows from (4.1), and the fourth one holds because is the strong convexity modulus of on . Hence,
(4.6) |
Summing (4.6) from to an arbitrary and passing to infinity justifies (4.3). A similar procedure shows that is Cauchy, which together with Item 3 entails the rest of the statement. ∎
Remark 4.7.
The corollary below states that semialgebraic functions satisfy the assumptions in 4.6.
Corollary 4.8.
Suppose that II holds, and let be a sequence generated by FRB (Algorithm 1) in the setting of 3.6. Assume in addition that
-
1
is level-bounded,
-
2
are twice continuously differentiable and is positive definite, and
-
3
are semialgebraic.
Then and there exists with such that .
4.3 Convergence rates
Having established convergence of FRB, we now turn to its rate. Recall that a function is said to have KL exponent if it satisfies the generalized concave KL property (recall 4.4) and there exists a desingularizing function of the form for some .
Theorem 4.9 (function value and sequential convergence rate).
Suppose that all the assumptions in 4.6 are satisfied, and follow the notation therein. Define . Assume in addition that has KL exponent at . Then the following hold:
-
1.
If , then and after finite steps.
-
2.
If , then there exist and such that for sufficiently large,
-
3.
If , then there exist such that for sufficiently large,
Proof.
Assume without loss of generality that has a desingularizing function and let . We claim that
(4.7) |
which will be justified at the end of this proof. It is routine to see that the desired sequential rate can be implied by those of through (4.7); see, e.g., [40, Thm. 5.3], therefore it suffices to prove convergence rate of .
Recall from Item 1 that is a decreasing sequence converging to . Then invoking the KL exponent assumption yields , which together with (4.5) implies that
(4.8) |
Appealing again to Item 1 gives
where the last inequality is implied by (4.8). Then applying [11, Lem. 10] justifies the desired convergence rate of .
Finally, we show that (4.7) holds. Invoking Assumptions 2and 1 entails thus . In turn,
where the second inequality follows from (4.3), the third one holds due to the fact that and the monotonicity of , and the last one is implied by the KL exponent assumption and Lemma 4.5, as claimed. ∎
5 Globalizing fast local methods with FRB
With some due care, the method can be enhanced in the context of the continuous-Lyapunov descent (CLyD) framework [36, §4], so as to globalize fast local methods by using the same oracle as FRB. Methods of quasi-Newton type constitute a convenient class of candidate local methods. A prototypical use case hinges on the inclusion encoding necessary optimality conditions, cf. Item 5, so that fast update directions can be retrieved based on the root-finding problem ; see, e.g., [15, §7] and [17]. Regardless, while the globalization framework is flexible to accommodate any update direction and yet retains convergence of FRB, it promotes the ones triggering fast local convergence as it will be demonstrated in 5.3.
Set and |
with the largest in such that |
(5.1) |
The algorithmic framework revolves around two key facts: continuity of and its decrease after FRB steps with . (The reason for introducing an auxiliary variable will be discussed after 5.1.) Thus, not only is smaller than at , but also at sufficiently close points, thereby ensuring that by gradually pushing the tentative fast update towards the safeguard a decrease on is eventually achieved. This fact is formalized next.
Theorem 5.1 (well definedness and asymptotic analysis of Algorithm 2).
Suppose that I and the bounds on and as in 3.6 are satisfied. Then, the following hold for the iterates generated by Algorithm 2:
- 1.
-
2.
, and in particular .
-
3.
If is coercive, then the sequence remains bounded.
-
4.
Suppose that II holds, and let be the set of limit points of . Then, is constant on with value , and for every it holds that and .
-
5.
Under the assumptions of 4.3, for any the condition holds for all large enough and guarantees that .
Proof.
It should be noted that 5.1 remains valid even with the simpler choice . The apparently wasteful choice of as in Algorithm 2 is instead of paramount importance in promoting “good” directions by enabling acceptance of unit stepsize. This is in sharp contrast with known issues of other nonsmooth globalization strategies, where convergence severely hampered (Maratos’ effect [26], see also [17, §6.2]). A quality assessment on the direction is provided by the following notion.
Definition 5.2 (superlinearly convergent directions [15, Eq. (7.5.2)]).
Relative to a sequence converging to a point we say that are superlinearly convergent directions if
Theorem 5.3 (acceptance of unit stepsize).
Suppose that s Iand II and the bounds on and as in 3.6 are satisfied. Suppose further that with , and that converges to a strong local minimum for satisfying .444Although it is only the inclusion which is guaranteed for any limit point in the full generality of s Iand II (cf. Item 4), additionally requiring single-valuedness is a negligible extra assumption as it can be inferred from Item 5; see [1, §3.1] and [36, §2.4] for a detailed discussion. If the directions are superlinearly convergent with respect to , then eventually is always accepted at step 6, and the algorithm reduces to the local method and converges at superlinear rate.
Proof.
Let be the limit point of . Since , as well, and strong local minimality of implies that holds for some and all large enough. In turn, for large it holds that
for some (since ), and by Young’s inequality | ||||
for every . By choosing one obtains that
(5.2) |
where . On the other hand, since for any , by definition of (cf. (3.4) with ) it follows that
(5.3) | ||||
holds for some and all large enough, where the last inequality uses the fact that (hence too). Combined with (5.2) and superlinearity of the directions we obtain that
(5.4) |
Observe that for any eventually it holds that . In fact, the first inequality owes to (3.9b), and the second one holds for large enough because of local minimality of for and the fact that
by osc of (cf. Item 3). Then, for large enough so that also holds, we have
for , so that 3.6 yields | ||||
Since and , eventually , hence
implying that the first attempt with passes the linesearch condition (since, when , one has ). In particular, the sequence eventually reduces to , and thus converges superlinearly. ∎
The core of the proof hinges on showing that as in (5.4) vanishes. The same argument cannot be achieved by a naive linesearch prescribing . Indeed,
where the first inequality owes to local minimality of and the fact that , the second one from the expression (3.2a) of and the definition of the envelope, cf. (3.4), and the big- estimates from local smoothness of and . In particular, . Observing that (5.2) is still valid, denoting so that one then has
where “” denotes equality up to vanishing terms and are some constants due to [15, Lem. 7.5.7]. In this process, the information of superlinearity of is lost, and the vanishing of cannot be established. Instead, as is apparent from (5.3) the choice of as in step 6 guarantees that on the first trial step coincides with a Bregman Moreau envelope, a function more tightly connected to the cost .
6 Conclusions
This work contributes a mirror inertial forward-reflected-backward splitting algorithm (FRB) and its linesearch enhancement, extending the forward-reflected-backward method proposed in [25] to the nonconvex and relative smooth setting. We have shown that the proposed algorithms enjoy pleasant properties akin to other splitting methods in the same setting. However, our methodology deviates from tradition through the FRB-envelope, an envelope function defined on a product space that takes inertial terms into account, which, to the best of our knowledge, is the first of its kind and thus could be instrumental for future research. This approach also requires the inertial parameter to be negative, which coincides with a recent result [14] regarding the impossibility of accelerated non-Euclidean algorithms under relative smoothness. Thus, it would be tempting to see whether an explicit example can be constructed to prove the sharpness of such restrictive assumption. It is also worth applying our technique to other two-stage splitting methods, such as Tseng’s method, to obtain similar extensions.
Appendix A Proof of 3.6
A.1 Convex case
Lemma A.1.
Suppose that I holds and let and be such that is a convex function. Then, for every and
(A.1) |
If too is convex, then has the same infimum of , and is level bounded iff so is .
Proof.
All the claimed inequalities follows from (3.5) together with the fact that , and that too when is convex. When both and are convex, Item 2 implies that
(A.2) |
proving that . The converse inequality follows from Item 3 by observing that .
To conclude, suppose that is not level bounded, and consider an unbounded sequence such that , for some . Then, it follows from (A.2) that , where . From local boundedness of (Item 3) it follows that too is unbounded, showing that is not level bounded either. The converse holds by observing that , cf. Item 3. ∎
In the setting of 1, inequality (A.1) can equivalently be written in terms of as
where the second inequality owes to the fact that , since is convex, having
the coefficient of being null by definition of , and being convex by definition of the relative weak convexity modulus , cf. 2.2. This proves (3.9a); inequality (3.9b) follows similarly by observing that
so that
In turn, Item 2 follows from the same arguments as in the proof of A.1.
A.2 Lipschitz differentiable case
Lemma A.2.
Additionally to I, suppose that is -Lipschitz differentiable for some . Then, for every and
(A.3) |
If is a convex function, then has the same infimum of , and is level bounded iff so is .
We will pattern the arguments in the previous section, and observe that inequality (A.3) can equivalently be written in terms of as
Once again, the fact that owes to convexity of , having
altogether proving (3.9a). Similarly, inequality (3.9b) follows by observing that , so that
The assertion of Item 2 once again follows from the same arguments as in the proof of A.2.
References
- [1] M. Ahookhosh, A. Themelis, and P. Patrinos. A Bregman forward-backward linesearch algorithm for nonconvex composite optimization: Superlinear convergence to nonisolated local minima. SIAM Journal on Optimization, 31(1):653–685, 2021.
- [2] H. Attouch, J. Bolte, P. Redont, and A. Soubeyran. Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the Kurdyka-Łojasiewicz inequality. Mathematics of Operations Research, 35(2):438–457, 2010.
- [3] H. Attouch, J. Bolte, and B.F. Svaiter. Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized Gauss–Seidel methods. Mathematical Programming, 137(1):91–129, 2013.
- [4] D. Azé and J. Penot. Uniformly convex and uniformly smooth convex functions. Annales de la Faculté des sciences de Toulouse : Mathématiques, Ser. 6, 4(4):705–730, 1995.
- [5] H.H. Bauschke, J. Bolte, and M. Teboulle. A descent lemma beyond Lipschitz gradient continuity: First-order methods revisited and applications. Mathematics of Operations Research, 42(2):330–348, 2017.
- [6] H.H. Bauschke and P.L. Combettes. Iterating Bregman retractions. SIAM Journal on Optimization, 13(4):1159–1173, 2003.
- [7] H.H. Bauschke and P.L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert spaces. CMS Books in Mathematics. Springer, 2017.
- [8] A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167–175, 2003.
- [9] D.P. Bertsekas. Nonlinear Programming. Athena Scientific, 2016.
- [10] J. Bolte, S. Sabach, M. Teboulle, and Y. Vaisbourd. First order methods beyond convexity and Lipschitz gradient continuity with applications to quadratic inverse problems. SIAM Journal on Optimization, 28(3):2131–2151, 2018.
- [11] R.I. Boţ and D. Nguyen. The proximal alternating direction method of multipliers in the nonconvex setting: convergence analysis and rates. Mathematics of Operations Research, 45(2):682–712, 2020.
- [12] G. Chen and M. Teboulle. Convergence analysis of a proximal-like minimization algorithm using Bregman functions. SIAM Journal on Optimization, 3(3):538–543, 1993.
- [13] R. Dragomir, A. d’Aspremont, and J. Bolte. Quartic first-order methods for low-rank minimization. Journal of Optimization Theory and Applications, 189(2):341–363, 2021.
- [14] R. Dragomir, A.B. Taylor, A. d’Aspremont, and J. Bolte. Optimal complexity and certification of Bregman first-order methods. Mathematical Programming, 194(1):41–83, 2022.
- [15] F. Facchinei and J. Pang. Finite-dimensional Variational Inequalities and Complementarity Problems, volume II. Springer, 2003.
- [16] F. Hanzely, P. Richtarik, and L. Xiao. Accelerated Bregman proximal gradient methods for relatively smooth convex optimization. Computational Optimization and Applications, 79(2):405–440, 2021.
- [17] A.F. Izmailov and M.V. Solodov. Newton-type Methods for Optimization and Variational Problems. Springer, 2014.
- [18] C. Kan and W. Song. The Moreau envelope function and proximal mapping in the sense of the Bregman distance. Nonlinear Analysis: Theory, Methods & Applications, 75(3):1385 – 1399, 2012.
- [19] G. Li, T. Liu, and T.K. Pong. Peaceman–Rachford splitting for a class of nonconvex optimization problems. Computational Optimization and Applications, 68(2):407–436, Nov 2017.
- [20] G. Li and T.K. Pong. Global convergence of splitting methods for nonconvex composite optimization. SIAM Journal on Optimization, 25(4):2434–2460, 2015.
- [21] G. Li and T.K. Pong. Douglas-Rachford splitting for nonconvex optimization with application to nonconvex feasibility problems. Mathematical Programming, 159(1):371–401, Sep 2016.
- [22] Y. Liu and W. Yin. An envelope for Davis–Yin splitting and strict saddle-point avoidance. Journal of Optimization Theory and Applications, 181(2):567–587, 2019.
- [23] H. Lu, R.M. Freund, and Y. Nesterov. Relatively smooth convex optimization by first-order methods, and applications. SIAM Journal on Optimization, 28(1):333–354, 2018.
- [24] J. Mairal. Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM Journal on Optimization, 25(2):829–855, 2015.
- [25] Y. Malitsky and M.K. Tam. A forward-backward splitting method for monotone inclusions without cocoercivity. SIAM Journal on Optimization, 30(2):1451–1472, 2020.
- [26] N. Maratos. Exact penalty function algorithms for finite dimensional and control optimization problems. PhD thesis, Imperial College London (University of London), 1978.
- [27] B. Mordukhovich. Variational Analysis and Applications, volume 30. Springer, 2018.
- [28] J. Moreau. Proximité et dualité dans un espace hilbertien. Bulletin de la Société mathématique de France, 93:273–299, 1965.
- [29] Y. Nesterov. Implementable tensor methods in unconstrained convex optimization. Mathematical Programming, pages 1–27, 2019.
- [30] Yurii Nesterov. Lectures on convex optimization, volume 137. Springer, 2018.
- [31] N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trends in Optimization, 1(3):127–239, 2014.
- [32] D. Reem, S. Reich, and A. De Pierro. Re-examination of Bregman functions and new properties of their divergences. Optimization, 68(1):279–348, 2019.
- [33] R.T. Rockafellar and R.J. Wets. Variational Analysis, volume 317. Springer, 2011.
- [34] L. Stella, A. Themelis, and P. Patrinos. Newton-type alternating minimization algorithm for convex optimization. IEEE Transactions on Automatic Control, 2018.
- [35] M. Teboulle. A simplified view of first order methods for optimization. Mathematical Programming, pages 1–30, 2018.
- [36] A. Themelis. Proximal Algorithms for Structured Nonconvex Optimization. PhD thesis, KU Leuven, 12 2018.
- [37] A. Themelis and P. Patrinos. Douglas–Rachford splitting and ADMM for nonconvex optimization: Tight convergence results. SIAM Journal on Optimization, 30(1):149–181, 2020.
- [38] A. Themelis, L. Stella, and P. Patrinos. Forward-backward envelope for the sum of two nonconvex functions: Further properties and nonmonotone linesearch algorithms. SIAM Journal on Optimization, 28(3):2274–2303, 2018.
- [39] A. Themelis, L. Stella, and P. Patrinos. Douglas-Rachford splitting and ADMM for nonconvex optimization: Accelerated and Newton-type algorithms. Computational Optimization and Applications, 82:395–440, 2022.
- [40] X. Wang and Z. Wang. A Bregman inertial forward-reflected-backward method for nonconvex minimization. arXiv:2207.01170, 2022.
- [41] X. Wang and Z. Wang. The exact modulus of the generalized concave Kurdyka-Łojasiewicz property. Mathematics of Operations Research, 47(4):2765–2783, 2022.
- [42] X. Wang and Z. Wang. Malitsky-Tam forward-reflected-backward splitting method for nonconvex minimization problems. Computational Optimization and Applications, 82(2):441–463, 2022.