Universal subgradient and proximal bundle methods for
convex and strongly convex hybrid composite optimization
Abstract
This paper
develops two parameter-free methods for
solving
convex and strongly convex hybrid composite optimization problems, namely,
a composite subgradient
type method and a proximal bundle type method.
Both functional and stationary complexity bounds for the two methods are established in terms of the unknown strong convexity parameter.
To the best of our knowledge, the two proposed methods are the first universal methods for solving hybrid strongly convex composite optimization problems that do not rely on any restart scheme nor require the knowledge of the optimal value.
Key words. hybrid composite optimization, iteration complexity, universal method, proximal bundle method
AMS subject classifications. 49M37, 65K05, 68Q25, 90C25, 90C30, 90C60
1 Introduction
This paper considers convex hybrid composite optimization (HCO) problem
(1) |
where are proper lower semi-continuous convex functions such that and the following conditions hold: there exist scalars and and a first-order oracle (i.e., for every ) satisfying the -hybrid condition that for every . Moreover, assume that is the largest scalar such that is convex, i.e., is the intrinsic convex parameter of .
This work is concerned with parameter-free (PF) methods for solving (1), i.e., ones that do not require knowledge of any of parameters associated with the instance , such as the parameter pair or the intrisic convexity parameter of . More specifically, it considers PF methods whose complexities for solving (1) are expressed in terms of (in addition to other parameters associated with ). We refer to them as -universal methods. PF methods whose (provable) complexities do not depend on are called universal ones (even if ). Moreover, PF methods whose complexities are given in terms of the intrinsic convex parameter for (resp., for ) are called -universal (resp., -universal). It is worth noting that can be substantially larger than (e.g., for , , and , we have ). Hence, complexities for -universal methods are usually better than -universal and -universal methods, and even (universal or non-universal) methods whose complexities depend on .
Related literature. We divide our discussion here into universal and -universal methods.
Universal methods: The first universal methods for solving (1) under the condition that is Hölder continuous have been presented in [21] and [10]. Specifically, the first paper develops universal variants of the primal gradient, the dual gradient, and the accelerated gradient methods, while the second one shows that acceleration versions of the bundle-level and the prox-level methods are universal. Additional universal methods for solving (1) have been studied in [14, 16, 19, 25] under the condition that is smooth, and in [13, 15, 17] for the case where is either smooth or nonsmooth. The methods in [15, 17] (resp., [19, 25]) are also shown to be -universal (resp., -universal under the condition that ). The papers [14, 16] present universal accelerated composite gradients methods for solving (1) under the more general condition that is a smooth -weakly convex function. Since any convex function is -weakly convex for any , the results of [14, 16] also apply to the convex case and yield complexity bounds similar to the ones of [19, 25].
-Universal methods: Under the assumption that is a smooth function (i.e., ), various works [1, 2, 3, 4, 5, 6, 8, 12, 20, 23] have developed -universal (or -universal) methods for (1) with a (or ) iteration complexity bound. For the sake of our discussion, we refer to a convex (resp., strongly convex) version of an accelerated gradient method as ACG (resp., S-ACG). Among papers concerned with finding an -solution of (1), [20] proposes the first -universal method based on a restart S-ACG scheme where each iteration adaptively updates an estimate of and calls S-FISTA (see also [8] for a -universal variant along this venue); moreover, using restart ACG schemes motivated by previous works such as [11, 22, 24], paper [23] develops -universal methods under the assumption that is known.
Among papers concerned with finding an -stationary solution of (1), [1, 2, 3, 4, 6] (resp., [12]) develop -universal (resp., -universal) methods based on restart ACG (resp., S-ACG) schemes that estimate the condition number (resp., ), or some related quantity; [3, 4, 6] then use the estimation to determine the number of iterations of each ACG call while the SCAR method of [12] uses a stationary termination to end each S-ACG call. Moreover, under the assumption that and are known, [5] develops a -universal method that performs only one call to an ACG variant (for convex CO).
Under the assumption that is non-smooth (i.e., ), [23] (see also [9] for an extension) proposes -universal methods under the assumption that is known. Specifically, the -universal method of [23] repeatedly invokes a universal oracle that halves the primal gap on each call.111Paper [23] removes the assumption that is known but forces its method to make multiple parallel calls to the universal oracle.
Our contribution. The goal of this work is to present two -universal methods for problem (1), namely: a composite subgradient (U-CS) type method and a proximal bundle (U-PB) type method. The first method is a variant of the universal primal gradient method of [21] (see also Appendix C.2 of [17]), which is still not known to be -universal. The second one is a variant of the generic proximal bundle (GPB) method of [17] that bounds the number of consecutive null iterations and adaptively chooses the prox stepsize under this policy. Both methods are analyzed in a unified manner using a general framework for strongly convex optimization problems (1) (referred to as FSCO) which specifies sufficient conditions for its PF instances to be -universal. Both functional and stationary complexities are established for FSCO in terms of , which are then used to obtain complexity bounds for both U-CS and U-PB. Interestingly, in contrast to previous -universal methods, both U-CS and U-PB do not perform any restart scheme nor require to be known.
Organization of the paper. Subsection 1.1 presents basic definitions and notation used throughout the paper. Section 2 formally describes FSCO and the assumptions on the problem of interest, and provides both functional and stationarity complexity analyses of FSCO. Section 3 presents U-CS and U-PB, as two instances of FSCO, for solving problem (1) and establishes their corresponding complexity bounds. Section 4 presents some concluding remarks and possible extensions. Finally, Appendix A provides technical results of FSCO and U-PB.
1.1 Basic definitions and notation
Let denote the set of real numbers. Let (resp., ) denote the set of non-negative real numbers (resp., the set of positive real numbers). Let denote the standard -dimensional Euclidean space equipped with inner product and norm denoted by and , respectively. Let denote the natural logarithm.
For given , let denote the effective domain of and is proper if . A proper function is -convex for some if
for every and . Let denote the set of all proper lower semicontinuous -convex functions. We simply denote by when . For , the -subdifferential of at is denoted by
We denote the subdifferential of at by , which is the set by definition. For a given subgradient , we denote the linearization of convex function at by , which is defined as
(2) |
2 A framework for strongly convex optimization
This section presents a general framework, namely FSCO, for convex optimization problems and establishes both functional and stationary complexity bounds for any of its instances. These results will then be used in Subsections 3.1 and 3.2 to analyze the complexities of two specific algorithms, namely: U-CS and U-PB. This section is divided into two subsections. Subsection 2.1 focuses on the functional complexity analysis, while Subsection 2.2 provides the stationary complexity analysis.
FSCO is presented in the context of the convex optimization problem
(3) |
for which the following conditions are assumed:
-
(A1)
the set of optimal solutions of problem (3) is nonempty;
-
(A2)
for some .
Clearly, the HCO problem (1) with the assumptions described underneath it is a special case of (3) where .
We now describe FSCO.
FSCO
-
0.
Let , , and be given, and set ;
-
1.
Compute , , , and satisfying
(4) and set
(5) -
2.
Check whether a termination criterion holds and if so stop; else go to step 3;
-
3.
Set and go to step 1.
FSCO does not specify how sequences and are generated, how models are updated, and how stepsizes are computed. Rather, it provides sufficient conditions on these sequences to ensure that its instances are -universal.
The complexity analysis of FSCO requires two additional assumptions, namely:
-
(F1)
there exists such that ;
-
(F2)
there exists such that for every iteration of the FSCO.
2.1 Functional complexity analysis
This subsection studies the iteration complexity for the framework FSCO to obtain an iterate such that . Its main result is stated in Theorem 2.3.
The following lemma will be useful in the sequel.
Lemma 2.1
Consider sequences , , , and generated by FSCO. Define for ,
(6) | ||||
(7) |
where
(8) |
Then for , we have:
-
a)
and for every ,
(9) -
b)
and
(10) -
c)
for any and every , we have
(11)
Proof: (a) The optimality condition of (5) yields . This inclusion and the fact that for every imply that
where the identity is due to the definition of in (6). Hence, the inclusion in a) holds. Relation (9) immediately follows from the inclusion in a) and the definition of the subdifferential.
(b) It follows from the relation (see step 1 of FSCO) and the definition of and in (8) that . This inequality, (9), the definition of in (7) imply that for every ,
which yields the inclusion in b). Taking in the above inequality gives , and hence the first inequality in (10) holds. Using the definitions of and in (6) and (7), respectively, the definitions of and in (8), and (4) and (5), we have
Hence, the second inequality in (10) holds.
(c) Using the inclusion in b), the fact that is -convex, and Lemma A.1 of [18], we have for any and every ,
Now, (11) follows from the above inequality with where is as in step 0 of FSCO.
Proposition 2.2
Consider sequences , , and generated by FSCO. For every , we have for every , , and , we have:
(12) |
and
(13) |
where
(14) |
Proof: Let . Since is a quadratic function in , its Taylor’s expansion gives
where and
Using the above formulas, we have for the expresssion
(15) |
where the inequality is due to (11) and the last identity is due to the definition of in (8). Rearranging the terms in (15) and using Lemma 2.1(b), we have
where in the last inequality we have used Assumption (F1).
Rearranging the terms and using Assumption (F1), we have
(16) |
It is clear that (12) follows from (16) with and observing that . Using the triangle inequality and the fact that with and we have
(17) |
Plugging the above ineqaulity into (16), we have
which is the same as (13) after simplification.
Before giving the first main complexity result for FSCO, namely, a complexity bound for obtaining a -solution of (3), we first introduce some terminology used throughout our analysis. Let denote the closest solution of (3) to the initial point of FSCO and let denote its distance to , i.e.,
(18) |
Theorem 2.3
For a given tolerance , consider FSCO with , where is as in step 0 of FSCO. Then, the number of iterations of FSCO to generate an iterate satisfying is at most
(19) |
Proof: It is easy to see that (13) with where is given by (18) satisfies (93) with is as in (14) and
(20) |
Also, note that
It follows from Lemma A.1(c) with the above parameters and the definition of in (14) that the complexity to find a -solution is
(21) |
Since , it is easy to verify that
and hence that
Moreover, noting that and , we also have
Combining the above two inequalities, we obtain
The result now immediately follows from (21), the above inequality, and the fact that
We now comment on the complexity bound obtained in Theorem 2.3. First, the bound
(22) |
implied by (19) is meaningful only when or (otherwise, it should be understood as being infinity). Second, the validity of the second bound in (22) is well-known and can be found for example in [7, 15]. Third, if (see the assumption (F2) in Section 1) and is sufficiently close to one, the smallest term in (22) is the first one, in which case (19) reduces to
Fourth, since the second bound does not depend on , and , it holds for any parameters and .
The drawback of using the termination criterion is that it requires knowledge of . The next subsection presents the complexity of FSCO to obtain a point satisfying a stopping criterion that is easily checkable for any instance of (3).
2.2 Stationarity complexity analysis
This subsection studies the iteration complexity for FSCO to obtain a near-stationary solution of (3) (see the definition below). Its main result is stated in Theorem 2.8.
We start by defining the notion of near-stationary solutions considered in this subsection.
Definition 2.4
A triple is called -compatible if it satisfies the inclusion . For a given tolerance pair , a -compatible triple is called a -stationary solution of (3) if it satisfies and .
We now comment on the benefits of using near-stationary solutions as a way to terminate an algorithm. First, many algorithms, including the ones considered in this paper, naturally generate a sequence of -compatible triples where the sequence of residual pairs can be made arbitrarily small (see Proposition 2.7 below). As a consequence, some will eventually become a -stationary solution of (3). Moreover, verifying this only requires checking whether the two inequalities and hold as the inclusion is guaranteed to hold for every . Second, this notion is related to the one considered in Subsection 2.1 as follows. If is a -stationary solution and has a finite diameter , then it follows that is a -solution of (3).
The following lemma will be useful to derive such complexity for stationary conditions:
Lemma 2.5
Proof: First note that (13) is a special case of (93) with
Then, (25) follows from Lemma A.1(a). It is also easy to verify that (13) with satisfies (93) with parameters as in (20). Then, (26) follows from Lemma A.1(b).
Lemma 2.6
For every , we have
(27) | ||||
(28) |
Proof: Using (26) and the fact that , we have for every ,
(29) |
It follows from the triangle inequality that
which together with (29) implies (27). Using the triangle inequality and the fact that with and , we have
(30) |
where the last inequality is due to (12). Noting that for some , and using (27), (29), and (30), we have
where the last inequality is due to the fact that . Hence, (28) is proved.
The following results shows that FSCO naturally generates a sequence of residual pairs such that is -compatible for every and provides suitable bounds for it.
Proposition 2.7
Proof: (a) Using (25) and the fact that , we have for every ,
where the last identity is due to the definitions of and in (31). Hence, the statement holds.
(b) Using the triangle inequality, (26), and the fact that , we have
Hence, the first inequality in (32) follows from the fact that . Moreover, the second inequality in (32) follows immediately from the definition of in (31).
The following Theorem 2.8 provides the complexity for stationary conditions announced in the beginning of this section.
Theorem 2.8
For a given tolerance pair , FSCO with
(33) |
generates a triple satisfying
(34) |
in at most
(35) |
iterations where
(36) |
Proof: First, it follows from Proposition 2.7(a) that the inclusion holds for every . We next show that the number of iterations required for (34) is at most
(37) |
Hence, it suffices to show that and for every . Using the definition of in (24), assumption (F2), and (98), we have
(38) |
It is easy to verify that for every , we have
(39) |
It immediately follows from the inequality that and the definition of in (36) that
(40) |
Using Lemma 2.6, Proposition 2.7(b), (33), the definition of in (36), and the above observation, we have
(41) | ||||
(42) |
Finally, plugging (39) into (41) and (42), we conclude that and for every , where is as in (37). It follows from the same argument as in the proof of Theorem 2.3 that (35) is an upper bound on . Finally, we conclude that the theorem holds.
We now make some remarks about Theorem 2.8. In contrast to Theorem 2.3, it does not apply to the case where because the quantity that appears in (35) depends on (see in (36)). The following result considers two special cases which cover the case .
Theorem 2.9
For a given tolerance pair , the following statements hold for FSCO with :
- a)
- b)
Proof: a) Since and are in , it follows from the boundedness assumption that . Using the inequality and the triangle inequality, we have
Thus, it follows the second inequality in (32) and the fact that that
where the last inequality is due to the definition of in (43). Similarly, using the first inequality in (32), the fact that , and (43), we have
where the second inequality is also due to the observation that in view of (43). Finally, the rest of the proof follows from the same argument as in the proof of Theorem 2.8.
b) Since for every , we know for some . This observation, (27), and the second inequality in (32) imply that
where the last inequality is due to the fact that for . It follows from the fact that and the definition of in (44) that
Similarly, using the first inequality in (32), the fact that , and (44), we have
where the second inequality is also due to the observation that in view of (44). Finally, the rest of the proof follows from the same argument as in the proof of Theorem 2.8.
3 Universal composite subgradient and proximal bundle methods
This section presents the two -universal methods for solving (1), namely, U-CS in Subsection 3.1 and U-PB in Subsection 3.2. We prove that both methods are instances of FSCO and establish their iteration complexities in terms of function value and stationarity based on the analysis in Section 2.
We assume that conditions (A1) and (A2) hold for (1). We also assume that
-
(A3)
for some ;
-
(A4)
is such that , and a subgradient oracle, i.e., a function satisfying for every , is available;
-
(A5)
there exists such that for every ,
It is well known that (A5) implies that for every ,
(45) |
Also, for a given tolerance , the fact that the set is a (nonempty) closed convex set implies that there exists a unique pair that minimizes over (referred to as the -best pair).
3.1 A universal composite subgradient method
The U-CS method is a variant of the universal primal gradient method of [21] by introducing a parameter . It can also be shown as an instance of FSCO, and we thus establish its complexity using the analysis in Section 2. The method is described below.
U-CS
-
0.
Let , , , and be given, and set and ;
-
1.
Compute
-
2.
If does not hold, then set and go to step 1; else, go to step 3;
-
3.
Set , , , and go to step 1.
We now make some remarks about U-CS. First, no stopping criterion is added to it since our goal is to analyze its iteration-complexity for obtaining two types of approximate solutions, i.e., either a -solution (see Theorem 3.2 below) or a -stationary solution (see Theorem 3.3 below). Second, U-CS with and is exactly the universal primal gradient method analyzed in [21]. Hence, with the introduction of the damping parameter , U-CS can be viewed as a generalization of the method of [21].
The following result shows that U-CS is an instance of FSCO and that assumptions (F1) and (F2) of Section 2 are satisfied.
Proposition 3.1
The following statements hold for U-CS:
-
a)
is a non-increasing sequence;
-
b)
for every , we have
(46) (47) (48) -
c)
U-CS is a special case of FSCO where:
-
i)
and for every ;
-
ii)
assumptions (F1) and (F2) are satisfied with given by (48) and from assumption (A3).
-
i)
Proof: a) This statement directly follows from the description of U-CS.
b) Relations (46) and (47) directly follow from the description of U-CS. Using (45) with and the inequality for , we have
(49) |
Observe that implies that , and hence that
The above inequality and (49) thus imply that (47) holds with replaced by . This indicates that if is small enough, then it will remain unchanged. Therefore, following from the update scheme of in step 2 of U-CS, there is a lower bound as in (48).
c) Relations (46) and (47) are the analogues of relations (4) and (5) of FSCO with and as in (i). Inequality (48) shows that Assumption (F2) is satisfied with . Finally, in the -th iteration of U-CS, the model for being simply the linearization with for some (from Assumption (A3)), we obtain that Assumption (F1) is satisfied.
We are now in a position to state the main result for the functional complexity of U-CS.
Theorem 3.2
Let be given and consider U-CS with , where is as in step 0 of U-CS. Then, the number of iterations of U-CS to generate an iterate satisfying is at most
(50) |
where
(51) |
Proof: Define
(52) |
Observe that for , and from Lemma 3.1 that we cannot halve more than iterations. It follows from that , which together with implies that
Therefore, the second term on the right-hand side of (50) gives an upper bound on the number of iterations with backtracking of . We now provide a bound on the number of remaining iterations to obtain a -solution with U-CS. Theorem 2.3 (which can be applied since we have shown that U-CS is a special case of FSCO) gives for U-CS the upper bound
(53) |
on the number of the remaining iterations (where is not halved) required to find an -optimal solution of (1) with given by (48). Using the inequality
(54) |
the assumption that , and the definition of in (51), we conclude that . This observation and (53) thus imply that the first term in (50) is an upper bound on the number of the remaining iterations (where is not halved). This completes the proof.
We now make some comments about Theorem 3.2. First, Theorem 3.2 applies to any , and hence to the universal primal gradient method of [21]. In this case, if , then the strong convexity part of the bound in (50) is
(55) |
which is identical to the one in Proposition C.3 of [17]. Second, if and , then the complexity bound (50) is also
(56) |
which is smaller than (55) whenever . For example, if and , then (56) is quite smaller than (55). In summary, the performance of the U-CS with the damping parameter depends on the strong convexity parameter of the overall objective function and hence is potentially more universal than the universal primal gradient method of [21] whose complexity bound depends only on the strong convexity parameter of the composite function .
The following result states the complexity of stationary complexity of U-CS.
Theorem 3.3
Proof: Same as in the proof of Theorem 3.2, integer given by (52) gives an upper bound on the number of iterations where is halved. Using and , we have that
which gives the second term on the right-hand side of (57). Next, using Theorem 2.9(b) (observe that this theorem can be applied to U-CS since we have for U-CS and we consider the possibility for to be 0), the number of remaining iterations to satisfy (34) is upper bounded by
(58) |
where
with given by (48). Now, using the inequality (54), the assumption that , and the definition of in (51), we conclude that . This observation and (58) then imply that the first term in (57) is an upper bound on the number of the remaining iterations (where is not halved) required to satisfy (34). This completes the proof.
3.2 A universal proximal bundle method
This subsection describes the U-PB method and establishes its iteration complexities. More specifically, § 3.2.1 describes U-PB and states the main results, namely Theorems 3.6 and 3.7, and § 3.2.2 is devoted to the proof of a technical result (i.e., Proposition 3.4) that is crucial to the proof of the main results. Conditions (A1)-(A5) are assumed to hold in this subsection.
3.2.1 Description of U-PB and related complexity results
The U-PB method is an extension of the GPB method of [17]. In contrast to GPB, we use an adaptive stepsize and introduce a maximal number (which can be as small as one) of iterations for all cycles. Similarly to U-CS, U-PB is another instance of FSCO and we establish both functional and stationary complexities for U-PB using the results of Section 2. Compared with the complexity results in [17], those obtained in this paper are sharper, since they are expressed in terms of instead of .
U-PB is based on the following bundle update (BU) blackbox which builds a model for on the basis of a previous model of and of a new linearization of . This blackbox BU is given below and takes as inputs a prox-center , a current approximate solution , an initial model for , and a stepsize .
BU
Inputs: and such that and
Find function such that
(59) |
where is as in (2) and is such that
(60) |
Output: .
In the following, we give two examples of BU, namely two-cuts and multiple-cuts schemes. The proofs for the two schemes belonging to BU can be provided similarly to Appendix D of [17].
-
(E1)
two-cuts scheme: We assume that is of the form where is an affine function satisfying . The scheme then sets and updates as , where satisfies
-
(E2)
multiple-cuts scheme: We assume that has the form where is the current bundle set and is defined as . This scheme selects the next bundle set so that where , and then outputs .
Before giving the motivation of U-PB, we briefly review the GPB method of [17]. GPB is an inexact proximal point method (PPM, with fixed stepsize) in that, given a prox-center and a prox stepsize , it computes the next prox-center as a suitable approximate solution of the prox subproblem
(61) |
More specifically, a sequence of prox bundle subproblems of the form
(62) |
where is a bundle approximation of , is solved until for the first time an iterate as in (62) approximately solves (61), and such is then set to be . The bundle approximation is sequentially updated, for example, according to either one of the schemes (E1) and (E2) described above.
U-PB is also an inexact PPM but with variable prox stepsizes (i.e., with in (61) replaced by ) instead of a constant one as in GPB. Given iteration upper limit and prox-center , it adaptively computes as follows: starting with , it solves at most prox subproblems of the form (62) in an attempt to obtain an approximate solution of (61) and, if it fails, repeats this procedure with divided by ; otherwise, it sets to be the first successful and as described in the previous paragraph.
U-PB is given below.
U-PB
-
0.
Let , , , , and integer be given, and set , , , and . Find such that ;
-
1.
Compute as in (62);
-
2.
Choose such that
(63) and set and
(64) -
3.
If and then
perform a null update, i.e.: set ;
else
if and
perform a reset update, i.e., set ;
else (i.e., and )
perform a serious update, i.e., set , , , , and ;
end if
set and find such that ;end if
-
4.
Set and go to step 1.
We now give further explanation about U-PB. U-PB performs three types of iterations, namely, null, reset, and serious, corresponding to the types of updates performed at the end. A reset (resp., serious) cycle of U-PB consists of a reset (resp., serious) iteration and all the consecutive null iterations preceding it. The index counts the total iterations including null, reset, and serious ones. The index counts the serious cycles which, together with the quantities , , and computed at the end of cycle , is used to cast U-PB as an instance of FSCO. All iterations within a cycle are referred to as inner iterations. The quantity counts the number of inner iterations performed in the current cycle. Each cycle of U-PB performs at most iterations. A serious cycle successfully finds within iterations, while a reset cycle fails to do so. In both cases, U-PB resets the counter to and starts a new cycle. The differences between the two cases are: 1) the stepsize is halved at the end of a reset cycle, while it is kept as is at the end of a serious cycle; and 2) the prox-center is kept the same at the end of a reset cycle, but it is updated to the latest at the end of a serious cycle.
We now make some remarks about U-PB. First, it follows from the fact that and the definition of in (64) that the primal gap of the prox subproblem in (61) is upper bounded by . Hence, if , then is an -solution of (61) where . Second, the GPB method of [17] (resp., [15]) computes using (63) with (resp., ). In the case where , it can be easily seen that is an upper bound on the primal gap of the prox subproblem in (61), and hence that is an -solution of (61) if . Third, the iterate computed in step 2 of U-PB satisfies
(65) |
where denotes the first iteration index of the cycle containing . In other words, is the best point in terms of among all the points obtained in the course of solving (62) and the point obtained at the end of the previous cycle.
The next proposition shows some useful relations about the sequences , and generated at the end of the serious cycles of U-PB. This proposition will be proved in § 3.2.2.
Proposition 3.4
Define
(66) |
The following statements hold for U-PB:
-
a)
every cycle in U-PB has at most inner iterations;
-
b)
each stepsize generated by U-PB satisfies
(67) -
c)
the number of reset cycles is upper bounded by
(68) -
d)
for a serious cycle, the pair of U-PB satisfies
(69) (70)
The following result shows that serious iterations of U-PB generate sequences , , , and satisfying the requirements of FSCO.
Proposition 3.5
b) By Assumption (A3) we have that and therefore Assumption (F1) of FSCO is satisfied for U-PB. Assumption (F2) is given for U-PB in Proposition 3.4(b).
We now state the first main result of this subsection where the functional iteration complexity of U-PB is established.
Theorem 3.6
Given tolerance , consider U-PB and , where is as in step 0 of U-PB. Let and be the sequences generated by U-PB. Then, the number of iterations of U-PB to generate an iterate satisfying is at most
(71) |
where
Proof: Since every serious cycle of U-PB has at most inner iterations (by Proposition 3.4(a)), the complexity of serious cycles of U-PB is that of FSCO, given by (19) in Theorem 2.3, multiplied by (observe that the functional complexity of FSCO can be applied to U-PB since we have shown in Proposition 3.5 that serious iterates of U-PB follow the FSCO framework). In this expression, by Proposition 3.4(b), we can bound from above by which gives the first term in (71). By Proposition 3.4(c), the number of reset cycles is at most (68) which is bounded from above by
and each of these cycles also has at most inner iterations (by Proposition 3.4(a)). This gives the second term in (71) and the result follows.
The complexity result of Theorem 3.6 for the case where is not too small, is neither close to one nor to zero, and reduces to
(72) |
and we obtain the functional complexity (56) of U-CS. For the case where , the above complexity is optimal up to logarithmic terms.
We now state the second main result of this subsection where the stationary iteration complexity of U-PB is established.
Theorem 3.7
For a given tolerance pair , U-PB with
(73) |
generates a triple satisfying (34) in at most
(74) |
iterations where
(75) |
Proof: The complexity of serious cycles of U-PB to satisfy stationarity conditions (34) is that of FSCO, given by (35) in Theorem 2.8, multiplied by . In this expression
where
(76) |
using the expression (73) for , we can bound from above by which gives the first term in (74). By Proposition 3.4(c), the number of reset cycles is at most (68) which is bounded from above by
and each of these cycles also has at most inner iterations (by Proposition 3.4(a)). This gives the second term in (74) and the result follows.
Theorem 3.7 does not hold for since the definitions of and in (75) depend on . However, if and the domain of is bounded with diameter , then we can apply Theorem 2.9(b) to conclude that U-PB with generates a triple satisfying (34) within a number iterations bounded by (35) but with now given by (43) and bounded from above by where
3.2.2 Proof of Proposition 3.4
To prove Proposition 3.4, we first state Lemmas 3.8 and 3.9. Lemma 3.8 will be used to show Lemma 3.9. Lemma 3.9 plays an important role in the analysis of the null iterates and establishes a key recursive formula for the sequence defined in (64).
Lemma 3.8
For the -th iteration of a cycle with prox stepsize and prox-center , define
(77) |
If is not the last iteration of the cycle, then
(78) |
where
(79) |
Proof: It follows from the definitions of in (79) that
(80) |
Using the definition of in (77), and relations (99) and (101) with , we have
Lemma 3.9
The following statements about U-PB hold:
- a)
-
b)
if is the first iteration of the cycle and , then
(82)
Proof: a) In what follows, we use the notation
(83) |
It follows from the above notation and the definitions of and in (64) and (77), respectively, that
(84) |
Using (45) with and the fact that , we have
(85) |
This inequality, the definition of in (83), and relation (78), imply that
(86) |
where the last inequality follows from the fact that and the inequality with and . Using relations (84) and (86), we conclude that
and hence that (81) holds.
b) Using relations (77), (83), and (84), we have
where the last equality follows from the facts that and . It follows from U-PB that if is the first iteration of the cycle, then
(87) |
Combining the above two inequalities and using (45), we obtain
By maximizing the right-hand side with respect to we deduce and using the fact that , we obtain (82).
Before presenting the proof of Proposition 3.4, we also need the following technical lemma.
Lemma 3.10
Assume that the prox stepsize of some cycle of U-PB is such that where
(88) |
Then, this cycle must be a serious one.
Proof: Let denote the first iteration of a cycle whose prox stepsize satisfies (88). It suffices to prove that where is the last iteration of the cycle. We consider two cases: and . If , using Lemma 3.9(b) (which can be applied since )) and (88), we have
which shows that the cycle is a serious one (with one iteration only). Let us now consider the case . Lemma 3.9 implies that
(89) |
It then suffices to show that the right-hand side of (89) is bounded above by , or equivalently, that
(90) |
Using (88), we have
(91) |
where the last inequality above is easily checked for . Using (88), (79), and (91), we have
which, because of the inequality , implies inequality (90).
We are now ready to prove Proposition 3.4.
Proof of Proposition 3.4. a) This statement immediately follows from the description of U-PB.
b) By Lemma 3.10, if for a cycle we have with given in (88), then the cycle ends with a serious step and is kept unchanged for all subsequent cycles and all subsequent cycles are serious cycles. Therefore, if we have for all (where the last inequality can be easily checked using the definitions of and ). Otherwise, if , we cannot have for some , i.e., for all .
c) This statement immediately follows from (b) and the update rule of in U-PB.
4 Concluding remarks
In this paper, we present two -universal methods, namely U-CS and U-PB, to solve HCO (1). We propose FSCO to analyze both methods in a unified manner and establish both functional and stationary complexity bounds. We then prove that both U-CS and U-PB are instances of FSCO and apply the complexity bounds for FSCO to obtain iteration complexities for the two methods. One interesting property of our proposed methods is that they do not rely on any restart scheme based on estimating or knowing .
Some papers about universal methods (see for example [10, 21]) assume that, for some , in (1) has -Hölder continuous gradient, i.e., there exists such that for every . It is shown in [21] that the universal primal gradient method proposed on it (i.e., the U-CS method with ) finds a -solution of (1) in
(92) |
iterations. This result also follows as a consequence of our results in this paper. Indeed, first note that the dominant term in the iteration complexity (50) for the U-CS method is . Second, Proposition 2.1 of [17] implies that there exists a pair as in (A5) and that the -best pair defined below (45) satisfies
Hence, it follows from these two observations that (50) is sharper than bound (92) obtained in [21].
We finally discuss some possible extensions of our analysis in this paper. First, it is shown in Theorems 3.2 and 3.3 (resp., Theorem 3.6) that U-CS (resp., U-PB) is -universal if and is -universal if . It would be interesting to investigate whether they are also -universal for too. Note that this question is related to whether the universal primal gradient of [21] (which is the same as U-CS with ) is -universal. Finally, it would also be interesting to study whether the general results obtained for the FSCO framework can also be used to show that other methods for solving the HCO problem (1) are -universal.
References
- [1] T. Alamo, P. Krupa, and D. Limon. Gradient based restart FISTA. In 2019 IEEE 58th Conference on Decision and Control (CDC), pages 3936–3941. IEEE, 2019.
- [2] T. Alamo, P. Krupa, and D. Limon. Restart of accelerated first-order methods with linear convergence under a quadratic functional growth condition. IEEE Transactions on Automatic Control, 68(1):612–619, 2022.
- [3] T. Alamo, D. Limon, and P. Krupa. Restart FISTA with global linear convergence. In 2019 18th European Control Conference (ECC), pages 1969–1974. IEEE, 2019.
- [4] J-F Aujol, L. Calatroni, C. Dossal, H. Labarrière, and A. Rondepierre. Parameter-free FISTA by adaptive restart and backtracking. arXiv preprint arXiv:2307.14323, 2023.
- [5] J-F Aujol, C. Dossal, and A. Rondepierre. FISTA is an automatic geometrically optimized algorithm for strongly convex functions. Mathematical Programming, pages 1–43, 2023.
- [6] J-F Aujol, C. H. Dossal, H. Labarrière, and A. Rondepierre. FISTA restart using an automatic estimation of the growth parameter. Preprint, 2022.
- [7] Y. Du and A. Ruszczyński. Rate of convergence of the bundle method. Journal of Optimization Theory and Applications, 173(3):908–922, 2017.
- [8] O. Fercoq and Z. Qu. Adaptive restart of accelerated gradient methods under local quadratic growth condition. IMA Journal of Numerical Analysis, 39(4):2069–2095, 2019.
- [9] B. Grimmer. On optimal universal first-order methods for minimizing heterogeneous sums. Optimization Letters, pages 1–19, 2023.
- [10] G. Lan. Bundle-level type methods uniformly optimal for smooth and nonsmooth convex optimization. Mathematical Programming, 149(1-2):1–45, 2015.
- [11] G. Lan and R. D. C. Monteiro. Iteration-complexity of first-order penalty methods for convex programming. Mathematical Programming, 138(1):115–139, 2013.
- [12] G. Lan, Y. Ouyang, and Z. Zhang. Optimal and parameter-free gradient minimization methods for convex and nonconvex optimization. arXiv: 2310.12139, 2023.
- [13] T. Li and G. Lan. A simple uniformly optimal method without line search for convex optimization. arXiv:2310.10082, 2023.
- [14] J. Liang and R. D. C. Monteiro. An average curvature accelerated composite gradient method for nonconvex smooth composite optimization problems. SIAM Journal on Optimization, 31(1):217–243, 2021.
- [15] J. Liang and R. D. C. Monteiro. A proximal bundle variant with optimal iteration-complexity for a large range of prox stepsizes. SIAM Journal on Optimization, 31(4):2955–2986, 2021.
- [16] J. Liang and R. D. C. Monteiro. Average curvature fista for nonconvex smooth composite optimization problems. Computational Optimization and Applications, 86(1):275–302, 2023.
- [17] J. Liang and R. D. C. Monteiro. A unified analysis of a class of proximal bundle methods for solving hybrid convex composite optimization problems. Mathematics of Operations Research, 49(2):832–855, 2024.
- [18] J. G. Melo, R. D. C. Monteiro, and H. Wang. A proximal augmented lagrangian method for linearly constrained nonconvex composite optimization problems. Journal of Optimization Theory and Applications, pages 1–33, 2023.
- [19] K. Mishchenko and Y. Malitsky. Adaptive gradient descent without descent. In 37th International Conference on Machine Learning (ICLM 2020), 2020.
- [20] Y. Nesterov. Gradient methods for minimizing composite functions. Mathematical Programming, 140(1):125–161, 2013.
- [21] Y. Nesterov. Universal gradient methods for convex optimization problems. Mathematical Programming, 152(1):381–404, 2015.
- [22] B. O’donoghue and E. Candes. Adaptive restart for accelerated gradient schemes. Foundations of Computational Mathematics, 15:715–732, 2015.
- [23] J. Renegar and B. Grimmer. A simple nearly optimal restart scheme for speeding up first-order methods. Foundations of Computational Mathematics, 22(1):211–256, 2022.
- [24] V. Roulet and A. d’Aspremont. Sharpness, restart and acceleration. SIAM Journal on Optimization, 30:262–289, 2020.
- [25] D. Zhou, S. Ma, and J. Yang. AdaBB: Adaptive Barzilai-Borwein method for convex optimization. arXiv:2401.08024, 2024.
Appendix A Technical results
A.1 Technical results for FSCO
Lemma A.1
Assume that sequences , , and satisfy for every , and
(93) |
for some , and . Then, the following statements hold:
-
a)
for every ,
(94) -
b)
if the sequence is nonnegative, then for every ,
(95) -
c)
if the sequence is nonnegative, then for every such that
with the convention that the first term is equal to the second term when . (Note that the first term converges to the second term as .)
Proof: a) Multiplying (93) by and summing the resulting inequality from to , we have
(96) |
Inequality (94) follows immediately from the above inequality.
b) This statement follows immediately from (96) and the fact that .
A.2 Technical results for U-PB
Lemma A.2
The following statements hold for a cycle of U-PB with stepsize :
-
a)
for every iteration that is not the last iteration of the cycle, there exists a function such that
(99) (100) where is as in (79);
-
b)
for every iteration of the cycle and , we have
(101)