Parameter-free proximal bundle methods
with
adaptive
stepsizes
for hybrid convex composite optimization problems
Abstract
This paper develops a parameter-free adaptive proximal bundle method with two important features: 1) adaptive choice of variable prox stepsizes that ”closely fits” the instance under consideration; and 2) adaptive criterion for making the occurrence of serious steps easier. Computational experiments show that our method performs substantially fewer consecutive null steps (i.e., a shorter cycle) while maintaining the number of serious steps under control. As a result, our method performs significantly less number of iterations than its counterparts based on a constant prox stepsize choice and a non-adaptive cycle termination criterion. Moreover, our method is very robust relative to the user-provided initial stepsize.
Key words. hybrid convex composite optimization, iteration-complexity, adaptive stepsize, parameter-free proximal bundle methods.
AMS subject classifications. 49M37, 65K05, 68Q25, 90C25, 90C30, 90C60
1 Introduction
Let be proper lower semi-continuous convex functions such that and consider the optimization problem
(1) |
It is said that (1) is a hybrid convex composite optimization (HCCO) problem if there exist nonnegative scalars and a first-order oracle (i.e., for every ) satisfying for every . The main goal of this paper is to study the complexity of adaptive proximal bundle methods (Ad-GPB) for solving the HCCO problem (1) based on a unified bundle update schemes.
Proximal bundle (PB) methods solve a sequence of prox bundle subproblems
(2) |
where is a bundle approximation of (i.e., a simple convex function underneath ) and is the current prox center. The prox center is updated to (i.e., a serious step is performed) only when the pair satisfies a certain error criterion; otherwise, the prox center is kept the same (i.e., a null step is performed). Regardless of the step performed, the bundle is updated to account for the newest iterate . In the discussion below, a sequence of consecutive null steps followed by a serious step is referred to as a cycle. Classical PB methods (see e.g. [5, 6, 12, 15, 16, 21, 27]) perform the serious step when satisfies a relaxed descent condition (e.g., see the paragraph containing equation (15) in [18]), which in its unrelaxed form implies that . On the other hand, modern PB methods (see e.g. [8, 18, 19, 20]) perform the serious step when the best -valued iterate , say , satisfies where is the optimal value of (2) and is a suitably chosen tolerance. Although does not necessarily satisfy the descent condition, it does satisfy a -relaxed version of it. It is shown in [18, 19] that if is such that , then modern PB methods with for every achieve an iteration complexity to obtain an -solution regardless of whether is bounded or not. In contrast, papers [5, 12] show that the classical PB methods achieve: i) an iteration complexity under the assumption that regardless of whether is bounded or not; and ii) an iteration complexity under the assumption that for the case where is bounded.
The goal of this paper is to develop a parameter-free adaptive modern PB method, namely Ad-GPB, with two important features: 1) adaptive choice of variable prox stepsizes that ”closely fits” the instance under consideration; and 2) adaptive criterion for making the occurrence of serious steps easier. Computational experiments show that Ad-GPB performs substantially fewer consecutive null steps while maintaining the number of serious steps under control. As a result, Ad-GPB performs significantly less number of iterations than the Ad-GPB method of [8, 18, 19]. Moreover, in contrast to GPB, Ad-GPB is very robust with respect to the user-provided initial stepsize.
Several papers (see e.g. [2, 4, 5, 11, 13, 17] of which only [5] deals with complexity analysis), have proposed ways of generating variable prox stepsizes to improve classical PB methods’ computational performance. More recently, [8] developed a modern PB method for solving either the convex or strongly convex version of (1) which: requires no knowledge of the Lipschitz parameters and the strong convex parameter of ; and allows the stepsize to change only at the beginning of each cycle. A potential drawback of the method of [8] is that it can restart a cycle with its current initial prox stepsize divided by two if is found to be large, i.e., the method can backtrack. In contrast, by allowing the prox stepsizes to vary within a cycle, Ad-GPB never has to restart a cycle.
In theory, classical PB methods perform on average consecutive null iterations while modern PB methods perform only consecutive null iterations in the worst case. The explanation for this phenomenon is due to the more relaxed -criterion used by modern PB methods to end a cycle. Our Ad-GPB method pursues the idea of further relaxing the cycle termination criterion to reduce its overall number of iterations, and hence improve its computational performance while retaining all the theoretical guarantees of the modern PB methods of [8, 18, 19]. More specifically, under the simplifying assumption that is known, an Ad-GPB cycle stops when, for some universal constant , the inequality is satisfied. The addition of the (usually large) term makes this inequality easier to satisfy, thereby resulting in Ad-GPB performing shorter cycles. Even though the previous observation assumes that is known, Ad-GPB removes this assumption, at the expense of assuming that the domain of is bounded, by replacing in the above inequality with a suitable lower bound on .
Organization of the paper. Subsection 1.1 presents basic definitions and notation used throughout the paper. Section 2 contains two subsections. Subsection 2.1 formally describes problem (1) and the assumptions made on it. Subsection 2.2 presents a generic bundle update scheme, the Ad-GPB framework, and states the main iteration-complexity result for Ad-GPB. Section 3 provides a bound on the number of iterations within a cycle of Ad-GPB. Section 4 contains two subsections. The first (resp, second) one establishes bounds on the number of cycles and the total number of iterations performed by Ad-GPB under the assumption that is known (resp., unknown). Section 5 presents the numerical results comparing Ad-GPB with the two-cut bundle update scheme against other mordern PB methods.
1.1 Basic definitions and notation
The sets of real numbers and positive real numbers are denoted by and , respectively. Let denote the standard -dimensional Euclidean space equipped with inner product and norm denoted by and , respectively. Let denote the natural logarithm and denote . Let denote the standard big-O notation.
For a given function , let denote the effective domain of and is proper if . A proper function is convex if
for every and . Denote the set of all proper lower semicontinuous convex functions by .
The subdifferential of at is denoted by
(3) |
The set of proper closed convex functions such that is denoted by and any such is called a bundle for .
The sign function is defined as
2 Main problem and algorithm
This section contains two subsections. The first one describes the main problem and corresponding assumptions. The second one presents the motivation and the description of Ad-GPB, as well as its main complexity result.
2.1 Main problem
The problem of interest in this paper is (1) which is assumed to satisfy the following conditions for some constants and :
-
(A1)
and there exists such that
(4) -
(A2)
is such that , and a subgradient oracle, i.e., a function satisfying for every , is available;
-
(A3)
for every ,
In addition to the above assumptions, it is also assumed that is simple in the sense that, for any and affine function , the following two optimization problems
(5) |
are easy to solve.
We now make three remarks about assumptions (A1)-(A3). First, it can be shown that (A1) implies that both problems in (5) have optimal solutions. Second, it can also be shown that (A1) implies that the set of optimal solutions of problem (1) is nonempty. Third, letting denotes the linearization of at , i.e.,
(6) |
then it is well-known that (A3) implies that for every ,
(7) |
Finally, define the composite linearization of the objective of (1) at as
(8) |
2.2 Algorithm
As mentioned in the Introduction, PB uses a bundle (convex) function underneath to construct subproblem (2) at a given iteration, and then updates to obtain the bundle function for the next iteration. This subsection describes ways of updating the bundle. Instead of focusing on a specific bundle update scheme, this subsection describes a generic bundle update framework (BUF) which is a restricted version of the one introduced in Subsection 3.1 of [19]. It also discusses two concrete bundle update schemes lying within the framework.
We start by describing the generic BUF.
BUF
Input: and such that
(9) |
-
•
find bundle satisfying
(10) for some such that
(11)
Output: .
Now we make some remarks about BUF. First, observe that if and for some , then . Hence, it follows from (10) and the definition of that the output of BUF satisfies
Second, the bundle update framework of [19] replaces (10) by the weaker inequality for some and, as a result, contains the one-cut bundle update scheme described in Subsection 3.1 of [19]. Even though BUF does not include the one-cut bundle update scheme, it contains the two other bundle update schemes discussed in [19] (see Subsection 3.1), which for convenience are briefly described below.
-
•
2-cut: For this scheme, it is assumed that has the form
(12) where and is an affine function satisfying . In view of (9), it can be shown that there exists such that
(13) (14) The scheme then sets
(15) and outputs the function defined as
(16) -
•
multiple-cut (M-cut): Suppose where is a finite set (i.e., the current bundle set) and is defined as
(17) This scheme chooses the next bundle set so that
(18) where
(19) and then output .
The following facts, whose proofs can be found in Appendix D of [19], imply that 2-cut and M-cut schemes are special implementations of BUF:
- (a)
- (b)
We next give an outline of Ad-GPB. Ad-GPB is an inexact proximal point method which, given the -th prox center , finds a pair of prox stepsize and -th prox-center satisfying a suitable error criterion for being an approximate solution of the prox subproblem
(20) |
More specifically, Ad-GPB solves a sequence of prox bundle subproblems
(21) |
where is a bundle approximation of and is an adaptively chosen prox stepsize, until the pair satisfy the approximate error criterion for (20). In contrast to the GPB method of [19], which can also be viewed in the setting outlined above, Ad-GPB: i) (adaptively) changes while computing the next prox center ; and, ii) Ad-GPB stops the search for the next prox center using a termination criterion based not only on the user-provided tolerance (the quantity in the description below) as GPB also does, but also on a suitable primal-dual gap for (20), a feature that considerably speeds up the computation of for many subproblems (20).
We now formally describe Ad-GPB. Its description uses the definition of the set of bundles for the function given in Subsection 1.1.
Ad-GPB
-
0.
Let , , , , and be given; find such that and set , , , , and ;
-
1.
compute as in (21) and
(22) (23) (24) -
2.
if is violated then perform a null update, i.e.:
if either or(25) then set ; else, set ;
set ;
else perform a serious update, i.e.:
set ;
compute(26) and choose ;
if , output , and stop; else compute(27) (28) if , then set ; else set ;
set and find such that ;
set and ;
end if -
3.
set and go to step 1.
We now introduce some terminology related to Ad-GPB. Ad-GPB performs two types of iterations, namely, null and serious, corresponding to the kinds of updates performed at the end. The index counts the iterations (including null and serious). Let denote the sequence of all serious iterations (i.e., the ones ending with a serious update) and, for every , define and the -th cycle as
(29) |
Observe that for every , we have where is computed in the serious update part of step 2 of Ad-GPB. (Hence, index counts the cycles generated by Ad-GPB.) An iteration is called good (resp., bad) if (resp., ). Note that the logic of Ad-GPB implies that and are good iterations and that (25) is violated whenever is a bad iteration.
We next make some remarks about the quantities related to different -functions that appear in Ad-GPB and the associated quantities and . First, the observation immediately following BUF implies that
(30) |
which together with the fact that is the last generated within a cycle imply that and . Moreover, the first identity in (26) and the latter conclusion then imply that and , and hence that . Second, the facts that is bounded (see assumption A1) and is a closed convex function imply that . Moreover, the problem has the same format as the first one that appears in (5), and hence is easily solvable by assumption. Its optimal value, which is a lower bound on as already observed above, is used to update the lower bound for to a possibly sharper one, namely, . Thus, the choice of in the line following (26) makes sense. For the sake of future reference, we note that
(31) |
Third, obvious ways of choosing in the interval are: i) ; and ii) . While choice i) requires knowledge of , choice ii) does not and can be easily implemented in view of the previous remark. Moreover, if is known and is chosen as in i), then there is no need to compute , and hence the min term in (26), at the end of every cycle.
We now make some other relevant remarks about Ad-GPB. First, it follows from (30) and the definition of in (21) that for every . Second, an induction argument using (23) and the fact that imply that and
(32) |
(hence, ) for every . Third, the cycle-stopping criterion, i.e., the inequality in the first line of step 2, is a relaxation of the one used by GPB method of [19], in the sense its right-hand side has the extra term involving the relaxation factor . The addition of this term allows earlier cycles to terminate in less number of inner iterations, and hence speeds up the overall performance of the method. The quantities in (27) and (28) are used to update at the end of the -th cycle (see ’if’ statement after (28)). Fourth, the condition imposed on at the end of a serious iteration (see the second line below (28)) does not completely specify it. An obvious way (cold start) of choosing this is to set it to be ; another way (warm start) is to choose it using the update rule of an null iteration, i.e., since (10) implies the required condition on .
We now comment on the inexactness of as a solution of prox subproblem (20) and as a solution of (1) upon termination of Ad-GPB. The fact that and the fact that imply that the primal gap of (20) at is upper bounded by . Hence, if the inequality for stopping the cycle in step 2 holds, then we conclude that is an -solution of (20), where
Finally, if the test inequality before (27) in step 2 holds, then the second component of the pair output by Ad-GPB satisfies due to the fact that .
Lastly, Ad-GPB never restarts a cycle, i.e., attempts to inexactly solve two or more subproblems (20) with the same prox center . Instead, Ad-GPB has a key rule for updating the inner stepsize which always allows it to inexactly solve subproblem (20) with set to be the last generated within the -th cycle (see the second line of the serious update part of Ad-GPB).
We now state the main complexity result for Ad-GPB whose proof is postponed to the end of Section 4.
Theorem 2.1
Define
(33) | ||||
(34) |
where
(35) |
Then, Ad-GPB finds a pair satisfying in at most cycles and
(36) |
iterations.
We now make some remarks about Theorem 2.1. First, in terms of and only, it follows from Theorem 2.1 that the iteration complexity of Ad-GPB to find a -solution of (1) is . Hence, when satisfies and , the total iteration complexity of Ad-GPB is . Moreover, under the assumption that satisfies , Ad-GPB performs
-
•
cycles whenever ;
-
•
more generally, cycles whenever for some .
3 Bounding cycle lengths
The main goal of this section is to derive a bound (Proposition 3.5 below) on the number of iterations within a cycle.
Recall from (29) that (resp., ) denotes the first (resp., last) iteration index of the -th cycle of Ad-GPB. The first result describes some basic facts about the iterations within any given cycle.
Lemma 3.1
For every , the following statements hold:
-
a)
there exists function such that
(37) (38) (39) -
b)
for every , we have
(40)
Proof: a) This statement immediately follows from (10), (11), and the facts that is the output of the BUF blackbox with input and and .
b) Using (39) and the fact that is strongly convex, we have for every ,
The statement follows from the above inequality and the second identity in (38).
The next result presents some basic recursive inequalities for .
Lemma 3.2
For every , the following statements hold:
-
a)
for every , there holds
-
b)
if , then we have
(41)
Proof: a) Inequality (37) implies that for every , we have
The definition of in (22), the above inequality, and (40) with , imply that
Using this inequality and the definition of in (24), we have
where the last inequality is due to the definition of in (23). The conclusion of the statement now follows from the above inequality and relation (7) with .
b) Using the assumption of this statement and statement a) with , we easily see that
The statement now follows from the above inequality and the inequality with
The next result describes some properties about the stepsizes within any given cycle. It uses the fact that if is a bad iteration of Ad-GPB, then (25) is violated (see step 2 of Ad-GPB and the first paragraph following Ad-GPB).
Lemma 3.3
Define
(42) |
where is an input to Ad-GPB, and and are as in Assumption 3. Then, the following statements hold:
-
a)
for every index , we have
-
b)
the number of bad iterations in is bounded by
Proof: a) Assume for contradiction that there exists such that
(43) |
and that is the smallest index in satisfying the above inequality. We claim that this assumption implies that
(44) |
Before showing the claim, we argue that (44) implies the conclusion of the lemma. Indeed, noting that (42) and (44) implies that , it follows from (41) with and the definition of in (42) that
where the last inequality is due to the fact that . This conclusion then implies that (25) holds for iteration , and hence that due to the logic of step 2 of Ad-GPB. Since this contradicts (44), statement (a) follows.
We will now show the above claim, i.e., that the definition of implies (44). Indeed, since the logic of step 2 implies that and is the smallest index in satisfying (43), we conclude that and . Using these conclusions and the fact that the logic of step 2 of Ad-GPB implies that either or for every , we then conclude that both the inequality and the identity in (44) hold.
b) Since (resp., ) if is a good (resp., bad) iteration, we easily see that . This observation together with (a) then implies that statement (b) holds.
It follows from Lemma 3.3(b) that the number of bad iterations within the -th cycle is finite. Proposition 3.5 below provides a bound on , and hence shows that every cycle terminates. Before showing this result, we state a technical result which provides some key properties about the sequence .
Lemma 3.4
The following statements hold:
-
a)
if , then .
-
b)
if is a good iteration that is not the last one in , then
-
c)
if is not the last iteration of , then
(45) where denotes the number of bad iterations within cycle .
Proof: a) The statement immediately follows from Lemma 3.2(a) with .
b) Assume that is a good iteration that is not the last one in . This together with the logic of the Ad-GPB imply that (25) is satisfied and the cycle-stopping criterion is violated at iteration , i.e.,
(46) |
These two observations then imply that
which can be easily seen to imply that statement b) holds.
c) If , then (45) obviously follows. Assume then that . The fact that there are at least good iterations in , and statements a) and b), imply that
(47) |
and thus (45) follows.
Proposition 3.5
For every cycle index generated by Ad-GPB, its size is bounded by where denotes the number of bad iterations within it and is defined as
(48) |
Proof: If , then the cycle-stopping criterion is satisfied with . This implies that , and hence that the result trivially holds in this case. From now on, assume and suppose for contradiction that . This implies that there exists a nonnegative integer such that and
(49) |
because the left-hand side of (49) is the cardinality of the index set . Since is not the last iteration of , the cycle-stopping criterion in step 2 of Ad-GPB is violated with , i.e.,
This observation together with Lemma 3.4(c) with then imply that
which together with the definition of in (48) can be easily seen to imply that
Since this conclusion contradicts (49), the conclusion of the proposition follows.
We now make some remarks about Proposition 3.5. First, the bound on the length of each cycle depends on the number of bad iterations within it. Second, to obtain the overall iteration complexity of Ad-GPB, it suffices to derive a bound on the number of cycles generated by Ad-GPB, which is the main goal of the subsequent section.
4 Bounding the number of cycles
This section establishes a bound on the number of cycles generated by Ad-GPB. It contains two subsections. The first one considers the (much simpler) case where is known and in step 2 of Ad-GPB is set to for every . The second one considers the general case where is an arbitrary scalar satisfying the condition in step 2 of Ad-GPB.
We start by stating two technical results that are used in both subsections.
The first one describes basic facts about the sextuple generated at the end of the -th cycle.
Lemma 4.1
For every cycle index of Ad-GPB, the following statements hold:
-
a)
, , and ;
-
b)
we have
-
c)
, , and , where by convention ;
-
d)
and ;
-
e)
for every given , we have
(50)
Proof: a) It follows from (30) and the fact that is the last generated within the -th cycle.
b) It follows from the fact that the cycle-stopping criterion in the first line of step 2 of Ad-GPB is satisfied at iteration and the definitions of the quantities and in step 2 of Ad-GPB.
c) It follows from the first two remarks in the paragraph containing (32), the definition of in (27), and the fact that (resp., ) is the last (resp., ) generated within the -th cycle.
d) The first inequality in (d) follows from the definition in (26). Moreover, the rule for updating in step 2 of Ad-GPB implies that .
e) Observe that (30) and the definitions of the quantities , , , and in step 2 of Ad-GPB, imply that is the pair of optimal solution and optimal value of
(51) |
The above observation, the fact that is -strongly convex, together imply that, for the given , we have
(52) |
and hence that
where the equality is due to the definition of in step 2 of Ad-GPB. This shows that (50), and hence statement (e), holds.
The next result provides a uniform upper (resp., lower) bound on the sequence (resp. ), and also a bound on the total number of bad iterations generated by Ad-GPB.
Lemma 4.2
Proof: a) Using the facts that (see step 2 of Ad-GPB) and Lemma 3.3(a) with , we conclude that
The statement then follows by using the above inequality recursively and the convention that .
b) Since the last iteration of a cycle is not bad and is equal to (resp., equal to ) if is a bad iteration, we easily see that , or equivalently, , for every cycle of Ad-GPB, under the convention that . Statement (c) now follows by summing the above inequality from to and using statement a).
c) Using the facts that and (see the serious update in step 2 of Ad-GPB), and the definition of , and in (24), (22) and (23), respectively, we have
(53) |
Statement c) now follows from the above inequality, Assumption 4, and the fact that .
It follows from Lemma 4.2(b) and the definition of in (42) that the overall number of bad iterations is
4.1 Case where is known
This subsection considers the special case of Ad-GPB where is known and
(54) |
Even though the general result in Theorem 2.1 holds for this case, the simpler proof presented here covering the above case helps to understand the proof of the more general case given in Subsection 4.2 and has the advantage that it does not assume that is bounded.
For convenience, the simplified version of Ad-GPB, referred to as Ad-GPB*, is explicitly stated below.
Ad-GPB*
-
0.
Let , , , and be given; find such that and set , , ;
- 1.
-
2.
if is violated then perform a null update, i.e.:
set ;
if either or(55) then set ; else, set ;
else perform a serious update, i.e.:
set and find such that ;
set and ;
if , then output , and stop;
; -
3.
set and go to step 1.
We now make some remarks about Ad-GPB*. First, even though the parameter can be arbitrarily chosen in , Ad-GPB* is stated with for simplicity. Second, if Ad-GPB* reaches step 2 then the primal gap is greater than because of step 2 and is substantially larger than this lower bound at its early cycles. Hence, the right-hand side of its cycle termination criterion in step 2 is always larger than and is substantially larger than at its early cycles. Since, in contrast, GPB terminates a cycle when the inequality is satisfied, the cycle termination of Ad-GPB* is always looser, and potentially much looser at its early cycles, than that of GPB.
The next lemma formally shows that Ad-GPB* is a specific instance of Ad-GPB.
Lemma 4.3
The following statements hold:
-
a)
Ad-GPB* is a special instance of Ad-GPB with and chosen as in (54); moreover, for every cycle index ;
-
b)
for every cycle index of Ad-GPB*, we have
(56)
Proof: a) The first claim of (a) is obvious. To show that for every index cycle generated by Ad-GPB, it suffices to show that because . Indeed, using (54), the facts that for due to Lemma 4.1(c), and the definitions of and in (28) and (27), respectively, we conclude that
and hence that due to the update rule for at the end of step 2 of Ad-GPB.
b) This statement follows from (a), Lemma 4.1(b), and the fact that .
Let denote the distance of the initial point to the set of optimal solutions , i.e.,
(57) |
Lemma 4.4
If is a cycle index generated by Ad-GPB, then we have
(58) |
Proof: Relation (50) with , and the facts that and , imply that
where the last inequality is due to (56). Simplifying the above inequality and summing the resulting inequality from , we conclude that
The statement now follows from the above inequality and (57).
We are now ready to prove Theorem 4.5.
Theorem 4.5
Ad-GPB* finds an iterate satisfying in at most cycles and
Proof: We first prove that Ad-GPB finds an iterate satisfying in at most cycles. Suppose for contradiction that Ad-GPB generates a cycle . Since the Ad-GPB did not stop at any of the previous iterations, we have that for every . Using the previous observation, inequality (58), the fact that , and Lemma 4.2(a), we conclude that
The definition of in (60), the above inequality, and some simple algebraic manipulation on the min term, yield the desired contradiction.
Let denote the numbers of cycles generated by Ad-GPB*. Proposition 3.5, and statements (b) and (c) of Lemma 4.2, then imply that the total number of iterations performed by Ad-GPB* until it finds an iterate satisfying is bounded by
and hence by (59), due to the definitions of and in (35) and (42), respectively.
4.2 General case where is unknown
As already mentioned above, this subsection considers the general case (see step 2 of Ad-GPB) where is in the interval , where is as in (26), and derives an upper bound on the number of cycles generated by Ad-GPB.
Lemma 4.6
Proof: Let be given. Multiplying (50) by and summing the resulting inequality from , we have
where the last inequality is due to Lemma 4.1(b). Dividing both sides of the above inequality by , and using the definition of , , and in (27), (26), and (28), respectively, we have
where the last inequality is due to Assumption (A4) and the fact that and where the last inclusion is due to Lemma 4.1(c). Inequality (61) now follows from (31) and by maximizing the above inequality relative to .
Lemma 4.7
For every cycle index of Ad-GPB, the following statements hold:
-
a)
If , then we have
(62) - b)
Proof: a) The update rule for just after equation (28) and the assumption that imply that . This observation and inequality (61) then immediately imply (62).
b) The first inequality in (63) follows from the assumption that and the update rule for just after equation (28). Moreover, using the definition of in (28), the fact that for every due to Lemma 4.1(d), and the fact that for every in (31) we conclude that
where the last inequality is because statements (c) and (d) of Lemma 4.1 imply that and for every . The above inequality, the convention that , and the fact that (see step 0 of Ad-GPB) then imply the second inequality in (63).
We are now ready to prove the main result of this paper.
Proof of Theorem 2.1 To simplify notation, let . It is easy to see that
(64) |
We first prove that Ad-GPB finds an iterate satisfying in at most cycles. Suppose for contradiction that Ad-GPB generates a cycle . Since the Ad-GPB did not stop at cycles from to , we have that
(65) |
We then have that
(66) |
since otherwise we would have some for some , and this together with (65), Lemma 4.1(c), Lemma 4.7(a), and Lemma 4.2(a), would yield the contradiction that
where the last inequality is due to the first inequality in (64). Relations (65) and (66), and Lemmas 4.1(c) and 4.7(b), all with , then yield
where the last inequality is due to the second inequality in (64). Since , the above inequality gives the desired contradiction, and hence the first conclusion of theorem holds.
To show the second conclusion of the theorem, let denote the numbers of cycles generated by Ad-GPB. Proposition 3.5, and statements (b) and (c) of Lemma 4.2 then imply that the total number of iterations that Ad-GPB finds an iterate satisfying is bounded by
and hence by (36), due to the definitions of and in (35) and (42), respectively.
5 Computational experiments
This section reports the computational results of Ad-GPB* and two corresponding practical variants against other modern PB methods and the subgradient method. It contains two subsections. The first one presents the computational results for a simple feasibility problem. The second one showcases the computational results for the Lagrangian cut problem appeared in the area of integer programming.
All the methods tested in the following two subsections are terminated based on the following criterion:
(67) |
where , or . All experiments were performed in MATLAB 2023a and run on a PC with a 16-core Intel Core i9 processor and 32 GB of memory.
Now we describe the algorithm details used in the following two subsections. We first describe Polyak subgradient method. Given , it computes
where
(68) |
where . Next we describe five GPB related methods, namely: GPB, Ad-GPB*, Ad-GPB**, Pol-Ad-GPB*, and Pol-GPB. A cycle of Ad-GPB*, regardless of the way its initial prox stepsize is chosen, is called good if the prox stepsizes do not change (i.e., the inequality (25) is not violated) within it. First, GPB is stated in [18, 19]. Second, Ad-GPB* is stated in Subsection 4.1. Third, Ad-GPB** is a corresponding variant of Ad-GPB* that allows the prox stepsize to increase at the beginning of its initial cycles. Specifically, if denotes the largest cycle index for which cycles 1 to are good then Ad-GPB** sets for every and afterwards sets for every as Ad-GPB* does, where and are defined in step 2 of Ad-GPB* and (29), respectively. The motivation behind Ad-GPB** is to prevent Ad-GPB or Ad-GPB* from generating only small ’s due to a poor choice of initial prox stepsize . Pol-GPB and Pol-Ad-GPB* are two Polyak-type variants of GPB and Ad-GPB* where the initial prox stepsize for the th-cycle is set to . The above five methods all update the bundle according to the 2-cut update scheme described in Subsection 2.2. Finally, Ad-GPB*, Ad-GPB** and Pol-Ad-GPB* use .
5.1 feasibility problem
This subsection reports the computational results on a feasibility problem. It contains two subsections. The first one presents computational results of Ad-GPB* and Ad-GPB** against GPB method in [18, 19]. The second one showcases the computational results of Pol-Ad-GPB* and Pol-GPB against subgradient method with Polyak stepsize.
We start by describing the feasibility problem. The problem can be formulated as:
(69) |
where and are known. We consider two different ways of generating the data, i.e., sparse and dense ways. For dense problems, matrix is randomly generated in the form where the entries of the matrix (resp., ) are i.i.d sampled from the standard normal (resp., uniform ) distribution. For sparse problem, matrix is randomly generated in the form where the nonzero entries of the sparse matrix are i.i.d sampled from the standard normal distribution and D is a diagonal matrix where the diagonal of are i.i.d sampled from . In both cases, vector is determined as where for some vector whose entries are i.i.d sampled from the standard Normal distribution . Finally, we generated for some vector whose entries are i.i.d. sampled from the uniform distribution over . Clearly, is a global minimizer of (69), whose optimal value equals zero in both cases. We test our methods on six dense and six sparse instances.
We now describe some details about all the tables that appear in this subsection. We set the target in (67) as for dense instances and for sparse instances. The quantities , and are defined as , , and , where is the number of non-zero entries of . An entry in each table is given as a fraction with the numerator expressing the (rounded) number of iterations and the denominator expressing the CPU running time in seconds. An entry marked as indicates that the CPU running time exceeds the allocated time limit. The bold numbers highlight the method that has the best performance for each instance.
5.1.1 Ad-GPB* versus GPB
This subsection presents computational results of GPB method in [18, 19] against Ad-GPB* and Ad-GPB**.
To check how sensitive the three methods are relative to the initial choice of prox stepsize , we test them for where . The computational results for the above three methods are given in Table 1 (resp., Table 2) for six sparse (resp., dense) instances. The time limit is four hours for Table 1 and two hours for Table 2.
ALG. | GPB | Ad-GPB* | Ad-GPB** | ||||||
1 | 1 | 1 | |||||||
(1,20,) | |||||||||
(3,30,) | |||||||||
(5,50,) | |||||||||
(10,100,) | |||||||||
(20,200,) | |||||||||
(50,500,) |
ALG. | GPB | Ad-GPB* | Ad-GPB** | ||||||
1 | 1 | 1 | |||||||
(0.5,1.5) | |||||||||
(1,3) | |||||||||
(2,6) | |||||||||
(1.5,0.5) | |||||||||
(3,1) | |||||||||
(6,2) |
The results in Tables 1 and 2 show that Ad-GPB* and Ad-GPB** are generally at least two to three times faster than GPB in terms of CPU running time. Second, it also shows that Ad-GPB* and Ad-GPB** are more robust to initial stepsize than GPB. We also observe that Ad-GPB** generally performs slightly better for small initial stepsize than Ad-GPB* which accounts for the increase of stepsize at the end of the cycle.
5.1.2 Polyak type methods
This subsection considers two Polyak-type variants of GPB and Ad-GPB* where the initial prox stepsize for the th-cycle is set to . These two variants in turn are compared with the subgradient method with Polyak stepsize (see (68)) and the Ad-GPB* and Ad-GPB** variants described in Subsection 5.1.1.
The computational results for the above five methods are given in Table 3 (resp., Table 4) for six sparse (resp., dense) instances. The results for Ad-GPB* and Ad-GPB** are the same ones that appear in Tables 1 and 2 with . They are duplicated here for the sake of convenience.
Pol-Sub | Pol-GPB | Pol-Ad-GPB* | Ad-GPB* | Ad-GPB** | |
(1,20,) | 431.3K/354 | 33.8K/58 | 15.5K/22 | 19.6K/27 | 21.3K/31 |
(3,30,) | 413.3K/786 | 126.9K/392 | 21.2K/60 | 39.5K/102 | 36.3K/94 |
(5,50,) | 389.8k/2440 | 91.4K/581 | 19.7K/128 | 32.7K/212 | 32.1K/211 |
(10,100,) | 473.2k/2092 | 137.12K/876 | 21.3K/157 | 40.5K/226 | 44.4K/242 |
(20,200,) | */* | 115.7K/5576 | 25.9K/1070 | 41.3K/1401 | 40.8K/1441 |
(50,500,) | */* | 133.0K/13754 | 25.5K/1847 | 42.9K/2460 | 43.0K/2392 |
Pol-Sub | Pol-GPB | Pol-Ad-GPB* | Ad-GPB* | Ad-GPB** | |
(0.5,1.5) | 479.8K/36.5 | 143.2K/22.8 | 73.5K/9.4 | 47.9K/5 | 56.4K/6 |
(1,3) | 1644.3K/1440 | 390.3K/196.7 | 144.4K/72.1 | 81.9K/40 | 79.0K/41 |
(2,6) | 354.5K/2513 | 46.4K/233 | 182.5K/959 | 181.4K/911 | 176.6K/882 |
(1.5,0.5) | 699.4K/79.5 | 109.9K/12.0 | 56.5K/5.6 | 102.5K/10 | 104.6K/10 |
(3,1) | 1034.6K/1046 | 147.8K/57.9 | 89.2K/38.8 | 155.5K/65 | 156.7K/64 |
(6,2) | */* | 136.1K/602 | 143.1K/713 | 242.4K/1211 | 238.0K/1151 |
5.2 Lagrangian cut problem
This subsection presents the numerical results comparing Ad-GPB* and Pol-Ad-GPB* against GPB and Pol-Sub on a convex nonsmooth optimization problem that has broad applications in the field of integer programming (see e.g. [28]).
The problem considered in this subsection arises in the context of solving the the stochastic binary multi-knapsack problem
(70) |
where and
(71) | ||||
s.t. | ||||
for every . In the second-stage problem, only the objective vector is a random variable. Moreover, it is assumed that its support is a finite set, i.e., has a finite number of scenarios ’s.
Benders decomposition (see e.g. [7, 26]) is an efficient cutting-plane approach for solving (70) which approximates by pointwise maximum of cuts for . Specifically, an affine function such that for every is called a cut for ; moreover, a cut for is tight at if . Benders decomposition starts with a cut for and compute a sequence of iterates as follows: given cuts for , it computes as
(72) | ||||
s.t. | ||||
where
it then uses to generate a new cut for and repeats the above steps with replaced by . Problem (72) can be easily formulated as an equivalent linear integer programming problem.
We now describe how to generate a Lagrangian cut for using a given point . First, for every , (71) is equivalent to
s.t. | |||
By dualizing the constraint , we obtain the Lagrangian dual (LD) problem
(73) |
where
(74) | ||||
Let denote an optimal solution of (73). The optimal values and of (71) and (73), respectively, are known to satisfy the following two properties for every and :
-
(i)
for every ;
-
(ii)
.
Property (i) can be found in many textbooks dealing with Lagrangian duality theory and property (ii) has been established in [28]. Defining
and taking expectation of the relations in (i) and (ii), we easily see that
and hence that is a tight cut for at .
Computation of assumes that the optimal value of (71) can be efficiently computed for every . Computation of assumes that an optimal solution of (73) can be computed for every . Noting that (73) is an unconstrained convex nonsmooth optimization problem in terms of variable and its optimal value is the (already computed) optimal value of (71), we use the Ad-GPB* variant of Ad-GPB to obtain a near optimal solution of (73). For the purpose of this subsection, we use several instances of (73) to benchmark the methods described at the beginning of this section.
For every , recall that using Ad-GPB* to solve (73) requires the ability to evaluate and compute a subgradient of at every . The value is evaluated by solving MILP (74). Moreover, if denotes an optimal solution of (74), then yields a subgradient of at . It is worth noting (73) is a non-smooth convex problem that does not seem to be tractable by the methods discussed in the papers (see e.g. [3, 9, 10, 14, 22, 23, 24, 25]) for solving min-max smooth convex-concave saddle-point problems, mainly due to the integrality condition imposed on the decision variable in (74).
Next we describe how the data of (70) and (71) is generated. We generate three random instances of (70), each following the same methodology as in [1]. We set and with each scenario being equiprobable. We generate matrices , , and vector , with all entries i.i.d. sampled from the uniform distribution over the integers . We then set and where the zero block of is . Twenty vectors with components i.i.d. sampled from are generated. Finally, we set and where denotes the vector of ones.
For each randomly generated instance of (70), we run Benders decomposition started from to obtain three iterates , , and . Each and for are computed using the twenty randomly generated vectors as described above, and hence each iteration solves twenty LD subproblems as in (73). Hence, each randomly generated instance of (70) yields a total of sixty LD instances as in (73). The total time to solve these sixty LD instances are given in Table 5 for the three instances of (70) (named , and in the table) and all the benchmarked methods considered in this section. In this comparison, both Ad-GPB* and GPB set the initial stepsize to where the entries of are i.i.d. generated from the uniform distribution in .
GPB | Ad-GPB* | Pol-Subgrad | Pol-Ad-GPB* | ||
751s | 600s | 1390s | 98s | ||
721s | 550s | 1503s | 32s | ||
1250s | 963s | 5206s | 98s |
Table 5 shows that Ad-GPB* consistently outperforms GPB. Additionally, it shows that Pol-Ad-GPB* once again surpasses all other methods.
6 Concluding remarks
This paper presents a parameter-free adaptive proximal bundle method featuring two key ingredients: i) an adaptive strategy for selecting variable proximal step sizes tailored to specific problem instances, and ii) an adaptive cycle-stopping criterion that enhances the effectiveness of serious steps. Computational experiments reveal that our method significantly reduces the number of consecutive null steps (i.e., shorter cycles) while maintaining a manageable number of serious steps. As a result, it requires fewer iterations than methods employing a constant proximal step size and a non-adaptive cycle termination criterion. Moreover, our approach demonstrates considerable robustness to variations in the initial step size provided by the user.
We now discuss some possible extensions of our results. Recall that the complexity analysis of Ad-GPB* assumes that is bounded, i.e., Lemma 4.2(c) uses it to give a simple proof that is bounded in terms of and the diameter of . Using more complex arguments as those used in Appendix A of [19], we can show that the complexity analysis of Ad-GPB* can be extended to the setting where is unbounded.
We finally discuss some possible extensions of our analysis in this paper. First, establishing the iteration complexity for Ad-GPB to the case where is unknown and is unbounded is a more challenging and interesting research topic. Second, it would be interesting to analyze the complexities of Pol-GPB and Pol-Ad-GPB* described in Subsection 5 as they both performed well in our computational experiments. Finally, our current analysis does not apply to the one-cut bundle update scheme (see Subsection 3.1 of [19]) since it is not a special case of BUF as already observed in the second remark following BUF. It would be interesting to extend the analysis of this paper to establish the complexity of Ad-GPB based on the one-cut bundle update scheme.
References
- [1] G. Angulo, S. Ahmed, and S. S Dey. Improving the integer l-shaped method. INFORMS Journal on Computing, 28(3):483–499, 2016.
- [2] J. F. Bonnans, C. Lemaréchal J. C. Gilbert, and C. A. Sagastizábal. A family of variable metric proximal methods. Mathematical Programming, 68:15–47, 1995.
- [3] Y. Chen, G. Lan, and Y. Ouyang. Optimal primal-dual methods for a class of saddle point problems. SIAM Journal on Optimization, 24(4):1779–1814, 2014.
- [4] W. de Oliveira and M. Solodov. A doubly stabilized bundle method for nonsmooth convex optimization. Mathematical programming, 156(1-2):125–159, 2016.
- [5] M. Díaz and B. Grimmer. Optimal convergence rates for the proximal bundle method. SIAM Journal on Optimization, 33(2):424–454, 2023.
- [6] Y. Du and A. Ruszczyński. Rate of convergence of the bundle method. Journal of Optimization Theory and Applications, 173:908–922, 2017.
- [7] A. M. Geoffrion. Generalized benders decomposition. Journal of optimization theory and applications, 10:237–260, 1972.
- [8] V. Guigues, J. Liang, and R.D.C Monteiro. Universal subgradient and proximal bundle methods for convex and strongly convex hybrid composite optimization. arXiv preprint arXiv:2407.10073, 2024.
- [9] Y. He and R. D. C. Monteiro. Accelerating block-decomposition first-order methods for solving composite saddle-point and two-player nash equilibrium problems. SIAM Journal on Optimization, 25(4):2182–2211, 2015.
- [10] Y. He and R. D. C. Monteiro. An accelerated hpe-type algorithm for a class of composite convex-concave saddle-point problems. SIAM Journal on Optimization, 26(1):29–56, 2016.
- [11] N. Karmitsa and M. M. Mäkelä. Adaptive limited memory bundle method for bound constrained large-scale nonsmooth optimization. Optimization, 59(6):945–962, 2010.
- [12] K. C Kiwiel. Efficiency of proximal bundle methods. Journal of Optimization Theory and Applications, 104(3):589, 2000.
- [13] K. C. Kiwiel. A proximal bundle method with approximate subgradient linearizations. SIAM Journal on Optimization, 16(4):1007–1023, 2006.
- [14] O. Kolossoski and R. D. C. Monteiro. An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex–concave saddle-point problems. Optimization Methods and Software, 32(6):1244–1272, 2017.
- [15] C. Lemaréchal. An extension of davidon methods to non differentiable problems. In Nondifferentiable optimization, pages 95–109. Springer, 1975.
- [16] C. Lemaréchal. Nonsmooth optimization and descent methods. 1978.
- [17] C. Lemaréchal and C. Sagastizábal. Variable metric bundle methods: from conceptual to implementable forms. Mathematical Programming, 76:393–410, 1997.
- [18] J. Liang and R. D. C. Monteiro. A proximal bundle variant with optimal iteration-complexity for a large range of prox stepsizes. SIAM Journal on Optimization, 31(4):2955–2986, 2021.
- [19] J. Liang and R. D. C. Monteiro. A unified analysis of a class of proximal bundle methods for solving hybrid convex composite optimization problems. Mathematics of Operations Research, 49(2):832–855, 2024.
- [20] J. Liang, R. D. C. Monteiro, and H. Zhang. Proximal bundle methods for hybrid weakly convex composite optimization problems. arXiv preprint arXiv:2303.14896, 2023.
- [21] R. Mifflin. A modification and an extension of Lemarechal’s algorithm for nonsmooth minimization, pages 77–90. Springer Berlin Heidelberg, Berlin, Heidelberg, 1982.
- [22] R. D. C. Monteiro and B. F. Svaiter. Complexity of variants of Tseng’s modified F-B splitting and korpelevich’s methods for hemivariational inequalities with applications to saddle-point and convex optimization problems. SIAM Journal on Optimization, 21(4):1688–1720, 2011.
- [23] A. Nemirovski. Prox-method with rate of convergence o (1/t) for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM J. Optim., 15(1):229–251, 2004.
- [24] A. S. Nemirovski and D. B. Yudin. Cesari convergence of the gradient method of approximating saddle points of convex-concave functions. In Doklady Akademii Nauk, volume 239, pages 1056–1059. Russian Academy of Sciences, 1978.
- [25] Y. Nesterov. Smooth minimization of non-smooth functions. Math. Programming, 103(1):127–152, 2005.
- [26] R. Rahmaniani, T. G. Crainic, M. Gendreau, and W. Rei. The benders decomposition algorithm: A literature review. European Journal of Operational Research, 259(3):801–817, 2017.
- [27] P. Wolfe. A method of conjugate subgradients for minimizing nondifferentiable functions. In Nondifferentiable optimization, pages 145–173. Springer, 1975.
- [28] J. Zou, S. Ahmed, and X. A. Sun. Stochastic dual dynamic integer programming. Mathematical Programming, 175:461–502, 2019.