A single cut proximal bundle method for
stochastic convex composite optimization
1st revision: November 3, 2022
2nd revision: April 30, 2023
3rd revision: October 20, 2023)
Abstract
This paper considers
optimization problems
where the objective is
the sum of a function given by an expectation and a closed convex function,
and proposes
stochastic composite proximal
bundle (SCPB) methods
for solving it.
Complexity guarantees are established for them without requiring knowledge of parameters associated with the problem
instance. Moreover, it is shown that they have optimal complexity when
these problem parameters are known.
To the best of our knowledge, this is the
first proximal bundle method for stochastic
programming able to deal with
continuous distributions.
Finally, we present
computational results showing that SCPB substantially
outperforms the robust stochastic approximation (RSA) method in all instances considered.
Keywords. stochastic convex composite optimization, stochastic approximation, proximal bundle method, optimal complexity bound.
AMS subject classifications. 49M37, 65K05, 68Q25, 90C25, 90C30, 90C60.
1 Introduction
The main goal of this paper is to propose and study the complexity of some stochastic composite proximal bundle (SCPB) variants to solve the stochastic convex composite optimization (SCCO) problem
(1) |
where
(2) |
We assume the following conditions hold: i) are proper closed convex functions such that ; ii) a stochastic first-order oracle, which for every and almost every random vector returns such that , is available; and iii) for every , for some .
Literature Review. Proximal bundle methods for solving the deterministic version of (1), i.e., where an oracle that outputs for any is available, have been proposed in [17, 18, 21, 37]. Moreover, convergence (but not complexity) analyses of proximal bundle methods have been developed for example in [7, 26, 30, 34], and their iteration complexities have been derived for example in [2, 5, 6, 15, 19, 20].
We now discuss methods for solving (the stochastic version of) problem (1). Methods for solving (1) when can be computed exactly (e.g., is a discrete random vector with small support) have been discussed for example in [3, 4] and are usually based on solving a deterministic (but large-scale) reformulation of (1), using decomposition (such as the L-shaped method [33]) possibly combined with regularization as in [12, 13].
Solution methods for problem (1) in which has a continuous distribution are basically based on one of the following three ideas: i) a single (usually expensive) approximation of (1) where is approximated by a Monte Carlo average for a large i.i.d. sample of is constructed at the beginning of the method and is then solved to yield an approximate solution of (1) (SAA-type methods); see for instance [11, 14, 16, 31, 35] and also Chapter 5 of [32] for their complexity analysis; ii) simple approximations of (1) are constructed at every iteration based on a small (usually a single) sample and their solutions are used to obtain an approximation solution of (1) (SA-type methods); SA-type methods have been originally proposed in [29] and further extended in [9, 10, 22, 23, 25, 27, 28]; and iii) hybrid type methods which sit in between SAA and SA-type ones in that they use partial Monte Carlo averages (and their expensive subgradients) for increasing iteration indices [13].
Contributions. Although the cutting plane methodology can be used in the context of SAA methods to solve single approximations of (1) generated at their outset, such methodology has not been used in the context of SA-type methods. This paper partially addresses this issue by developing regularized aggregated cutting plane methods for solving (1) where some of the most recent (linear approximation) cuts (in expectation) are combined, i.e., a suitable convex combination of them is chosen, so that a single aggregated cut (in expectation) is obtained. Two SCPB variants based on the aforementioned aggregated one-cut scheme are proposed which can be viewed as natural extensions of the one-cut variant developed in [20] (based on the analysis of [19]) for solving the deterministic version of (1). More specifically, at every iteration, these SCPB variants solve the prox bundle subproblem
(3) |
where is the prox stepsize, is the current prox-center, and is the current bundle function in expectation, i.e., it satisfies . The prox-center remains the same for several consecutive iterations which are referred to as a cycle. In the beginning of a cycle, the prox-center is updated to and the bundle function is chosen to be the composite linear approximation of the function at for some new independent sample . For other iterations of the cycle, the prox-center remains the same but is set to be a convex combination of the previous bundle function and the most recent composite linear approximation as constructed above. It is then shown that both SCPB variants obtain a stochastic iterate (determined by some of the above generated ’s) such that , where is as in (1), in iterations/resolvent evaluations. To our knowledge, these are the first SA-type SCPB methods for solving SCCO problems where can have either a discrete or continuous distribution. Finally, it is shown that the robust stochastic approximation (RSA) method of [22] is a special case of SCPB with a relatively small prox stepsize.
Organization of the paper. Subsection 1.1 presents basic definitions and notation used throughout the paper. Section 2 formally describes the assumptions on the SCCO problem (1), presents the SCPB scheme, and two cycles rules for determining the length of the cycles in SCPB. Section 3 presents various convergence rate bounds for the SCPB variant based on the first cycle rule and discusses the relationship between SCPB and RSA. Section 4 provides convergence rate bounds for the SCPB variant based on the second cycle rule. Section 5 collects proofs of the main results in Sections 3 and 4. Section 6 reports the numerical experiments. Finally, Section 7 presents some concluding remarks and possible extensions.
1.1 Basic definitions and notation
Let denote the set of positive integers. The sets of real numbers, non-negative and positive real numbers are denoted by , and , respectively. Let denote the standard -dimensional Euclidean space equipped with inner product and norm denoted by and , respectively.
Let be given. The effective domain of is denoted by and is proper if . For , the -subdifferential of at is denoted by . The subdifferential of at , denoted by , is by definition the set . Moreover, a proper function is -strongly convex for some if
for every and . Note that we say is convex when . We use the notation for the history of the sampled observations of up to iteration . Define . Define the diameter of a set to be .
2 Assumptions and two SCPB variants
This section presents the assumptions made on problem (1) and states two SCPB variants for solving it.
2.1 Assumptions
Let denote the support of random vector and assume that the following conditions on (1) are assumed to hold:
Assumption 2.1.
We now make some observations about the above conditions. First, as in [22], condition (A2) does not require to be convex. Second, condition (A3) implies that
(4) |
Third, defining for every and ,
(5) |
it follows from (A2), the second identity in (5), and the convexity of by (A1), that
(6) |
where is as in (1). Hence, is a stochastic composite linear approximation of in the sense that its expectation is a true composite linear approximation of . (The terminology “composite” refers to the function which is included in the approximation as is.)
2.2 Description of the two SCPB variants
Before describing the two SCPB variants, we motivate them by interpreting them as inexact implementations of the (theoretical) proximal point method for solving (1).
Their -th cycle of iterations performs a finite number of iterations to solve the prox subproblem
where denotes the prox-center during the cycle. Each iteration within the cycle solves a subproblem of the form
(7) |
where is an affine bundle for in expectation, i.e., an affine function such that . (This type of bundle has been considered in the inexact proximal point approach considered in [20] where it is referred to as a one-cut bundle for .) The bundle for the next subproblem in the cycle is then taken to be a linear combination of the current bundle and a newly generated stochastic linear approximation of of the form . Moreover, the first iteration of every cycle starts by setting the prox-center to the most recently generated as in (7) and the bundle to the most recently generated stochastic linear approximation of at .
Both SCPB variants are based on the SCPB scheme described below. As stated below, the scheme is not a completely specified algorithm since its step 2 does not describe how to select the index . Two rules for doing so, and hence the complete description of the two aforementioned SCPB variants, are then given following the statement of the scheme.
At every iteration , the SCPB scheme samples an independent realization of .
SCPB
Input: Scalars , integer , and initial point .
-
0.
Set , , and
(8) -
1.
take a sample of r.v. independent from the previous samples and compute
(11) (14) (15) and
(18) -
2.
if , then choose an integer such that
if , then set and go to step 1; else, set , and go to step 3;
-
3.
if , then set and , and go to step 1; otherwise, compute
(19) and stop.
Output: .
We first discuss the roles played by the two index counts and used by SCPB. First, counts the total number of iterations/resolvent evaluations performed by SCPB since it is increased by one every time SCPB returns to step 1. Second, defining the -th cycle as the iteration indices lying in
(20) |
it immediately follows that counts the number of cycles generated by SCPB. Third, step 1 determines two types of iterations depending on whether (serious iteration) or (null iteration). Hence, the last iteration of a cycle is a serious one while the others are null ones.
We now make several basic remarks about SCPB. First, every execution of step 1 involves one resolvent evaluation of , i.e., an evaluation of the point-to-point operator for some . Second, SCPB generates three sequences of iterates, namely, the sequence of prox-centers computed in (11), the sequence determined by (15), and the sequence given by (18). Third, it follows from (11) that for every . Hence, the prox-center remains constant between consecutive iterations within a cycle and (possibly) changes only at the beginning of the first iteration of the following cycle. Fourth, is the subsequence of consisting of all the last cycle iterates generated by SCPB. Fifth, the convergence rates for the two specific variants of the SCPB scheme described below are with respect to the average of the iterates , namely, the point as in (19) (see Theorems 3.1 and 4.1 below).
As already mentioned in the second paragraph preceding the description of SCPB, the scheme is not completely specified since its step 2 does not describe how to select . We now describe two cycle rules for doing so which depend on a pre-specified parameter , namely:
-
(B1)
for every , let be the smallest integer such that ;
- (B2)
We make the following remarks about cycle rules (B1) and (B2). First, the sequence determined by the cycle rule (B1) is deterministic, while the one determined by (B2) is stochastic since the sequence used in (21) is stochastic. Second, another difference between the two cycle rules is that (B1) allows , while in (B2) is at least . In other words, the cycle length for (B1) may be equal to one, but the one for (B2) is at least two. Third, the length of cycle for both rules above depends on the cycle index . Hence, even though (B1) is deterministic, the length of the cycles generated by it changes with .
Throughout our presentation, SCPB based on cycle rule (B1) (resp., (B2)) is referred to as SCPB1 (resp., SCPB2).
3 Complexity results for SCPB1
This section presents the main complexity results for SCPB1 under various assumptions and discusses the relationship between SCPB1 and RSA.
3.1 Convergence rate bounds of SCPB1 with bounded
The following result states a general convergence rate result for SCPB1 that holds for bounded and for any choice of input in SCPB1 and constant as in (B1). The proof is postponed to Subsection 5.2.
Theorem 3.1.
We now make some remarks about Theorem 3.1. First, its overall iteration complexity is given by , which is its outer iteration complexity, times its inner iteration complexity given in (23). Second, (24) gives a bound on the expected primal gap in terms of , and hence provides a sufficient condition on how large should be chosen for SCPB1 to generate a desired approximate solution.
Even though Theorem 3.1 holds for the general case in which is unbounded, all its corollaries stated in this subsection and Subsection 3.3 assume that is bounded. For any given , the following result describes a convergence rate bound for SCPB1 with a specific choice of .
Corollary 3.2.
Assume that Assumption 2.1 holds and has a finite diameter . Let a pair be given and consider SCPB1 with input and in (B1) given by
(25) |
where is an estimate for the (usually unknown) pair . Then, the following statements hold:
-
a)
we have
where
(26) -
b)
its expected overall iteration complexity (up to a logarithmic term) is
(27)
Proof: a) Using (24), the definitions of and in (26), and the definitions of and in (25), we get
where in the second inequality we have used and the definitions of and while in the last inequality we have used the relation for every .
b) It follows from Theorem 3.1(a) that the overall complexity (up to a logarithmic term) is , which in turn is (27) in view of as in (25).
We now argue that the overall iteration (and sample) complexity of the SCPB1 variant of Corollary 3.2 for finding an -solution of (1), i.e., one that satisfies , is optimal for a large range of prox stepsizes. Indeed, setting where
it follows from the above result that . Since , we conclude from (27) that the expected overall iteration complexity of SCPB1 is bounded by
In particular, if and , or equivalently, and , then the above complexity reduces to
Moreover, under the assumption that the prox stepsize lies in the interval , the above complexity bound further reduces to , which is known to be the optimal complexity of finding an -solution for any instance of (1) such that its corresponding pair satisfies the condition that and (e.g., see [24]).
3.2 Relationship between SCPB1 and the RSA method of [22]
We argue in this subsection that RSA can be viewed as a special instance of SCPB1 where every cycle contains only one index (or equivalently, an instance for which every iteration is serious).
Recall that the RSA method of [22], which is developed under the assumption that is the indicator function of a nonempty compact convex set , with a given initial point and constant prox stepsize recursively computes its iteration sequence according to
(28) |
For the purpose of reviewing the iteration complexity of RSA, assume that is an upper bound on the diameter of and is an upper bound on . For , let denote the average of the iterates , i.e.,
(29) |
It is shown in equation (2.24) of [22] that if the stepsize is chosen as
(30) |
for some fixed scalar , then the ergodic iterate with satisfies
(31) |
Hence, for a given tolerance , the smallest satisfying has the property that
(32) |
It turns out that RSA is a special case of SCPB1 with in (B1) given by
Indeed, it follows from the above choice of and as in (30) with replaced by that
and hence that satisfies (B1). Thus, every cycle only performs one iteration, i.e., its only serious iteration. Moreover, every iteration of this SCPB1 variant is a serious one and is its total number of iterations.
3.3 A practical SCPB1 variant
From a computational point of view, the choice of in Corollary 3.2 usually results in the quantity , and hence the inner complexity bound (23), being large. The following result provides a practical variant of SCPB1 with an alternative choice for and which partially remedies the above drawback by forcing to be constant. A nice feature of this variant is that it is able to choose large prox stepsizes without loosing the optimality of its overall iteration complexity.
Corollary 3.3.
Assume that Assumption 2.1 holds and has a finite diameter . Let positive integer and constant be given, and define
(33) |
where is an estimate for the pair . Then, the following statements about SCPB1 with input and as above hold:
- a)
-
b)
the number of iterations within the -th cycle is bounded by
and hence, up to a logarithmic term, is ;
-
c)
its expected overall iteration complexity, up to a logarithmic term, is .
Proof: a) Using (24), the definitions of and in (26), and the definitions of and in (33), we get
(35) |
where in the second inequality we have used and the definition of . It follows from (35) and the fact that that
c) This statement follows from (b) and the fact that SCPB1 has cycles.
We now make two remarks about the practical SCPB1 variant of Corollary 3.3. First, although in (33) depends neither on nor , the choice of depends on both of these estimates. On the other hand, SCPB2 will be analyzed in Section 4 where depends neither on nor , and depends on but not . Second, Corollary 3.3 (see its statement (b)) implies that the number of iterations within a cycle of SCPB1 is bounded (up to a logarithmic term) by the a priori (user specified) constant . Thus, the SCPB1 variant of Corollary 3.3 can be viewed as an extended version of RSA where the number of iterations within a cycle can be larger than one, instead of being equal to one as in RSA (see the discussion in the second paragraph of Subsection 3.2).
In the remaining part of this subsection, we compare the performance of RSA and SCPB1 when both use the prox stepsize as in (33) for some relatively large scalar . For this discussion, we assume that their performance measure is the overall iteration complexity (or sample complexity) for finding an -solution of (1). For simplicity, we assume as in Subsection 3.2 that is the indicator function of a nonempty closed convex set and that the estimation pair satisfies and , or equivalently, and .
We first consider the performance (see the previous paragraph) of SCPB1. It follows from Corollary 3.3(a) that there exists such that is an -solution of (1). Hence, it follows from Corollary 3.3(c) that the performance of SCPB1 is . In conclusion, SCPB1 with the above choice of is able to choose a prox stepsize as in (33) with a large constant while preserving its optimal performance. We now consider the performance of RSA. It follows from (30) and (32) with that the performance of RSA is . In conclusion, while both RSA and SCPB1 with prox stepsize as in (33) have their own performance guarantee, the one for RSA becomes worse than that of SCPB1 as becomes large.
Finally, although the SCPB1 variant of Corollary 3.3 chooses as in (33), our numerical experiments uses a more aggressive prox stepsize, i.e.,
where . It is possible to show that similar conclusions (with slightly modified bounds) as those of Corollary 3.3 hold for this choice of . Finally, SCPB1 with this prox stepsize substantially outperforms RSA on the (relatively small number of) instances considered in the computational experiments of Section 6.
4 Complexity results for SCPB2
This section provides the main complexity results for SCPB2.
The following result is an analogue of Theorem 3.1 and describes the convergence rate bound for the SCPB2 without imposing any condition on its input and the constant in (B2). The proof is postponed to Subsection 5.3.
Theorem 4.1.
Following a similar argument as in the paragraph following Corollary 3.2, it can be shown that SCPB2 has optimal iteration complexity (up to a logarithmic term) for finding an -solution of (1) for a large range of prox stepsizes.
The following result is the analogue of Corollary 3.3 when SCPB2 is implemented using cycle rule (B2) instead of (B1). As in Corollary 3.3, it forces the quantity to be constant but, in contrast to the choice of of Corollary 3.3, its choice for does not depend on an estimate for .
Corollary 4.2.
Assume that Assumption 2.1 holds and has a finite diameter . Let positive integers and constant be given, and define
(37) |
where is an estimate for and is an estimate for . Then, the following statements for SCPB2 with input based on cycle rule (B2) with , , and as above hold:
- a)
-
b)
the expected number of iterations within the -th cycle is bounded by
and hence, up to a logarithmic term, is ;
-
c)
its expected overall iteration complexity, up to a logarithmic term, is .
c) This statement follows from (b) and the fact that SCPB2 has cycles.
5 Proofs of main results in Sections 3 and 4
This section contains three subsections. The first one presents some technical results that apply to the SCPB scheme regardless of how the index is chosen in step 2. The second and third ones are then devoted to the proofs of Theorems 3.1 and 4.1, respectively.
5.1 Proofs of some technical results
We assume Assumption 2.1 holds throughout this subsection. Recall that for every
and for positive integers we denote by the portion of realizations of the r.v. over the iterations . For convenience, in what follows we set
(39) |
For every and , define
(42) |
and
(45) |
where and are as in (5) and and is as in (22). It is easy to see from (14), (15), and the above definition of that
(46) |
The first result below provides some basic relations which are often used in our analysis.
Lemma 5.1.
For every , we have
(47) | ||||
(48) | ||||
(49) |
Proof: Observe that is a function of and not of . Hence, is independent of in view of the fact that is chosen in step 1 of SCPB to be independent of . Using the relation (see (A2)), it follows that
which is identity (47). It then suffices to show that, for any given , (48) and (49) hold for every in the -th cycle, i.e., . We show this by induction on where is the iteration count. If , then it follows from (18), (42), and (47) that
and from (45) with , (22), and Assumption 2.1 (A1)-(A2) that for every ,
Let be such that and (48) and (49) hold for . Then, it follows from (18), (42), the fact that (48) holds for , and the convexity of , that
and from (6), (45) and the fact that (49) holds for , that
It is worth noting that the proof of (49) is strongly based on the fact that is a convex combination of affine functions whose expected values are underneath . Moreover, this inequality would not necessarily be true if were for example the maximum of functions as just described.
The next result provides a useful estimate for the quantity .
Lemma 5.2.
For every such that , we have:
(50) |
Proof: Using the definitions of and in (1) and (5), respectively, we have
where the inequality is due to the convexity of . The above inequality, the Cauchy-Schwarz inequality, the triangle inequality and (4) then imply (50).
The technical result below introduces a key quantity, namely, scalar below, and provides a useful recursive relation for it over the iterations of the -th cycle. This recursive relation will then be used in Proposition 5.6 to show that the at the end of the -th cycle, namely , is relatively small in expectation.
Lemma 5.3.
Proof: Let with be given. It follows from the definitions of and in (45) and (46), respectively, that
(54) |
where for the first inequality we used the fact that and for with while for the second inequality is due to the facts that is -strongly convex and is the minimizer of (see (46)). Using (8), (50) and (54), we have
where the last inequality is obtained by minimizing its left hand side with respect to . The above inequality, the fact that for every , and the definition of in (51) imply that
Rearranging the above inequality and using the definition of in (51), identity (42), and the fact that , we then conclude that
which, in view of the definition of in (51), implies (52). Inequality (53) follows immediately from (52) and an induction argument.
The following technical result provides some useful bounds on .
Lemma 5.4.
For every and , we have
(55) |
Proof: We first show that for every and ,
(56) |
Fix . Since becomes deterministic when is given, it follows from Assumption 2.1 (A3) with and the definition of in (39) that
Now, if , then the above relations together with the law of total expectation imply that and
We have thus shown that (56) holds for any .
The first inequality in (55) then follows from the definition of in (51). The second inequality in (55) follows from the first one and the law of total expectation.
The next technical result provides a bound on the initial for the -th cycle, namely , in expectation.
Proof: Let
(57) |
Using the definitions of and in (51) and (46), respectively, (42) with (see (20)), we have
(58) |
where the inequality is due to Lemma 5.2. Maximizing the right hand side of the last inequality above with respect to and using the relation for every , we obtain
(59) |
Moreover, (58) and the fact that also imply that
(60) |
It follows from (39), (57), and Assumption 2.1 (A2) and (A3) that
Hence, the lemma follows by taking expectations of (59) and (60) and using the above three relations.
We emphasize that all results developed in this subsection hold regardless of the way is chosen in step 2. On the other hand, the results in the following two subsections strongly use the fact that is chosen according to either (B1) or (B2).
5.2 Proof of Theorem 3.1
This subsection is devoted to the proof of Theorem 3.1. The following result derives a bound in expectation for when is chosen according to cycle rule (B1).
Proposition 5.6.
In addition to Assumption 2.1, assume also that (B1) holds. Then, for every , we have
(61) |
where is as in (51).
Proof: Fix . It follows from cycle rule (B1) and inequality (53) with that
(62) |
In view of (B1) and (20), it follows that and are both deterministic. Hence, taking expectation of the above inequality and using the last inequality in (55), cycle rule (B1), and Lemma 5.5, we conclude that
and hence that (61) holds.
It is worth noting that rule (B1) plays an important role in showing that the expectation of the first term on the right-hand side of (62) is . On the other hand, the proof of the bound for the expectation of the second term on the right-hand side of (62) does not depend on rule (B1) but on the fact that the expectation of is small, namely, (see (55) and its definition in (51)). In conclusion, rule (B1) provides a way to estimate the magnitude of , a quantity which by itself is intractable to compute exactly.
In the remaining part of this subsection, we analyze the behavior of the “outer” sequence of iterations generated in step 2 of SCPB. For this purpose, define
(63) |
and
(64) |
In what follows, we make some remarks about the above “outer” sequences which follow as immediate consequences of the results developed above. In view of the above definitions, relation (46) with , and the way the prox-centers are updated in (11), we have that
(65) |
Moreover, it follows from (48) and (49) with that
(66) |
and
(67) |
The following result describes an important recursive formula for the outer sequence generated by SCPB.
Lemma 5.7.
In addition to Assumption 2.1, assume also that (B1) holds. Then, for every and , we have
where
(68) |
Proof: First observe that (63), (64), and the definitions of and in (46) and (51), respectively, imply that (61) is equivalent to
(69) |
It follows from (65) and the fact that the objective function of (65) is -strongly convex that for every ,
and hence that
Taking expectation of the above inequality and using (68) and (69), we conclude that
which, in view of (66) and (67), immediately implies the conclusion of the lemma.
We are now in a position to prove Theorem 3.1.
Proof of Theorem 3.1: a) This statement directly follows from (8), cycle rule (B1), the definition of , and the facts that and .
b) Using the definition of in (19), Lemma 5.7 with , and the facts that and
we then conclude that for every ,
(70) |
where the first inequality is due to the convexity of . It is also easy to see from Lemma 5.7 that (70) holds for . Then (24) follows from (70) and the fact that . ∎
The above result strongly uses the fact that is bounded in view of the last inequality of its proof.
We end this subsection by describing a complexity bound for a slightly modified SCPB1 variant which is derived without assuming that is bounded. We start by describing the two changes one needs to make to SCPB1 in order to obtain the aforementioned variant. First, instead of the point as in (19), it outputs
(71) |
Second, instead of computing as in (B1), it sets as the smallest integer greater than or equal to such that .
Theorem 5.8.
Assume that Assumption 2.1 holds and let denote the distance of the initial point to the optimal set , i.e.,
(72) |
Then, the aforementioned SCPB1 variant satisfies the following statements:
-
a)
the number of iterations within each cycle is bounded by
-
b)
there holds
Proof: (a) The proof of (a) is similar to that of Theorem 3.1(a).
(b) First note that the new way of choosing and slightly different arguments than the ones used in the proofs of Proposition 5.6 and Lemma 5.7 imply that for every ,
Using the fact that is convex and the definition of in (71), and summing the above inequality from to , we conclude that for every ,
The statement now follows from the above inequality with where is as in (72).
5.3 Proof of Theorem 4.1
The following result, which is an analogue of Proposition 5.6 with cycle rule (B1) replaced by (B2), derives a bound on in expectation.
Proposition 5.9.
In addition to Assumption 2.1 holds, assume also that cycle rule (B2) is used. For every , we have
(73) |
Proof: Using (53) with , we conclude that
(74) |
where the second inequality is due to cycle rule (B2), the observation that (B2) is equivalent to , and in view of (8). Noting that becomes deterministic once is given, taking expectation of the above inequality conditioned on , rearranging the terms, and using the first inequality in (55), we have
Taking expectation of the above inequality with respect to , rearranging the terms, and using the second inequality (55) and the fact that by (8), we conclude that
and hence that (73) holds.
It is worth noting that rule (B2) plays an important role in showing that the first term on the right-hand side of the (74) is .
The following result is an analogue of Lemma 5.7 with (B1) replaced by (B2).
Lemma 5.10.
Proof: First observe that the definitions of and in (46) and (51), respectively, imply that (73) is equivalent to
(75) |
The remaining part of the proof is now similar to that of Lemma 5.7 except that (75) is used in place of (69).
We are now ready to prove Theorem 4.1.
6 Numerical experiments
In this section, we report the results of numerical experiments where we compare the performance of RSA and our two variants of SCPB on three stochastic programming problems, namely: a stochastic utility problem given in Section 4.2 of [22] and the two two-stage nonlinear stochastic programs considered in the numerical experiments of [10]. These three problems are of form (1)-(2) with the indicator function of a convex compact set with diameter . Therefore, the problems can be written as
(76) |
The implementations are coded in MATLAB, using Mosek optimization library [1] to generate stochastic oracles and , and run on a laptop with Intel i7, 1.80 GHz processor. For solving subproblem (15), we do not use Mosek but implement algorithms for projection onto . In particular, we follow [36] to implement an exact algorithm for projection onto the unit simplex.
Parameters for Robust Stochastic Approximation. Robust Stochastic Approximation, denoted by E-SA (Euclidean Stochastic Approximation) in what follows, is described in Section 2.2 of [22] (as explained in [22], in terms of Section 2.3 of [22], this is mirror descent robust SA with Euclidean setup). In the notation of [22], for E-SA run for iterations, we output (this is given by (29) with and corresponds to the usual output of RSA) where is computed at iteration taking the constant steps given in (2.23) of [22] by
where is the diameter of the feasible set
in (76).111Parameter is denoted
by in [22].
As in [22], we take
which was calibrated in [22]
using an instance of the stochastic
utility problem. For each problem, the value of is estimated as
in [22]
taking the maximum of
over 10,000 calls to the stochastic oracle at randomly
generated feasible solutions.
Remark 6.1.
In [22], E-SA generates approximately candidate solutions with , and an additional sample was used to estimate the objective at these candidate solutions in order to choose the best of these candidates. In [22], the computational effort required by this postprocessing is not reflected in the experiments. However, we believe that for a fair comparison of E-SA using this set of candidate solutions and SCPB, this computational effort should be taken into account and without this additional computational bulk, SCPB is already faster than E-SA in our experiments.
Parameters for SCPB1. SCPB1 uses parameters , , , and given by
where constant
and
constant
was calibrated with
the stochastic utility problem,
see below.
We take in
all our experiments.
Constant was estimated
as for RSA
taking the maximum of
over 10,000 calls to the stochastic oracle at randomly
generated feasible solutions.
Parameters for SCPB2. SCPB2 uses parameters , , , and given by
where constant
and
constant
was calibrated with
the stochastic utility problem,
see below.
We take in
all our experiments.
Constant was again estimated
as for RSA
taking the maximum of
over 10,000 calls to the stochastic oracle at randomly
generated feasible solutions.
Notation in the tables. In what follows, we denote by
-
•
the design dimension of an instance;
-
•
the sample size used to run the methods; this is also the number of iterations of E-SA;
-
•
the number of SCPB outer iterations;
-
•
Obj the empirical mean
(77) of at based on a sample of of size , which provides an estimation of . The empirical means are computed with being the final iterate output by the algorithm and ;
-
•
CPU the CPU time in seconds.
6.1 A stochastic utility problem
Our first set of experiments was carried out with the stochastic utility problem given by
where
(78) |
are independent and is piecewise convex with 10 breakpoints, all located on 222Although the same problem class and a similar procedure to build was used in the experiments of Section 4.2 in [22], we could not find in this reference the precise choices of and the optimal values of our instances differ from the optimal values of the instances in [22]. Also, contrary to [22], we use the same function for all instances. The instances differ for the problem dimension . .
Calibration of and . We run SCPB1 and SCPB2 with 7 values of and on four instances of the stochastic utility problem for outer iterations (i.e., cycles) and , , , and . For this experiment, the values of , , the corresponding values of stepsize , and the optimal values computed by SCPB1 and SCPB2 are reported in Table 1. We found out that slightly outperforms other choices of for SCPB1. Surprisingly, SCPB2 was not affected by changes in and all tested values allowed us to obtain with similar CPU times a good approximate optimal value. This value and the same value will be chosen for all runs of SCPB and all the problem instances (the stepsizes in [22] were calibrated similarly, on the basis of an instance of the stochastic utility problem).
0.01 | 0.1 | 1 | 10 | 50 | 150 | 1000 | |
0.0017 | 0.017 | 0.09 | 0.26 | 1.7 | |||
Obj1 | 14.2795 | 10.6439 | 10.1819 | 10.1811 | 10.1811 | 10.1838 | 10.1937 |
Obj2 | 10.1937 | 10.1937 | 10.1937 | 10.1937 | 10.1937 | 10.1937 | 10.1937 |
0.0012 | 0.012 | 0.0585 | 0.18 | 1.17 | |||
Obj1 | 14.6307 | 11.1325 | 10.0510 | 10.0504 | 10.0509 | 10.0523 | 10.0710 |
Obj2 | 10.0710 | 10.0710 | 10.0710 | 10.0710 | 10.0710 | 10.0710 | 10.0710 |
0.0084 | 0.0418 | 0.1255 | 0.8364 | ||||
Obj1 | 13.7451 | 11.0836 | 10.0365 | 10.0363 | 10.0364 | 10.0375 | 10.0613 |
Obj2 | 10.0613 | 10.0613 | 10.0613 | 10.0613 | 10.0613 | 10.0613 | 10.0613 |
0.0079 | 0.0397 | 0.119 | 0.793 | ||||
Obj1 | 14.0830 | 11.3370 | 10.0228 | 10.0228 | 10.0231 | 10.0237 | 10.0540 |
Obj2 | 10.0540 | 10.0540 | 10.0540 | 10.0540 | 10.0540 | 10.0540 | 10.0540 |
We now run E-SA, SCPB1, and SCPB2 on three instances , , and of the stochastic utility problem with , , and respectively. For SCPB1 and SCPB2, we used outer iterations. The results are reported in Table 2. Several comments are now in order for the results reported in this table.
-
•
For SCPB, approximate solutions can only be computed at the end of every cycle. Namely, at the end of -th cycle at iteration we can compute the approximate solution
For a given value of in Table 2, the approximate objective value Obj we report for E-SA is the empirical mean of at the approximate solution (where ’s are computed along iterations of E-SA) while for SCPB the approximate value Obj is the empirical mean of at the approximate solution where
(since a cycle may not end at iteration ).
-
•
Each iteration of E-SA and SCPB takes a similar amount of time (in both cases we evaluate an inexact prox-operator at some point) and therefore for a given sample size the CPU time for E-SA and SCPB is similar.
-
•
For all instances, SCPB computes a good approximate optimal value much quicker than E-SA and the decrease in the objective function value is much slower with E-SA. We also refer to Table 7 which reports for and the distance between SCPB approximate optimal value and E-SA approximate value as a percentage of SCPB decrease in the objective for several sample sizes. This table confirms the slower convergence of E-SA in these instances.
- | |||||||
---|---|---|---|---|---|---|---|
ALG. | N | Obj | CPU | Obj | CPU | Obj | CPU |
E-SA | 10 | 14.6449 | 0.001 | 14.6892 | 0.05 | 15.4 | 0.05 |
50 | 14.6322 | 0.006 | 14.6813 | 0.07 | 14.7 | 0.35 | |
100 | 14.6169 | 0.01 | 14.6725 | 0.1 | 14.6 | 0.74 | |
200 | 14.5880 | 0.03 | 14.6574 | 0.2 | 14.6 | 1.44 | |
1000 | 14.3992 | 0.1 | 14.5604 | 0.5 | 14.3 | 17.2 | |
12.9656 | 1.28 | 12.7410 | 3.7 | 14.2 | 80.3 | ||
- | - | - | - | 13.2 | 860.1 | ||
SCPB1 | 10 | 13.9539 | 0.003 | 13.7763 | 0.008 | 14.7 | 0.08 |
50 | 13.6527 | 0.01 | 13.4672 | 0.02 | 14.4 | 0.39 | |
100 | 13.5986 | 0.02 | 13.5346 | 0.05 | 14.2 | 0.9 | |
200 | 13.5349 | 0.03 | 13.4686 | 0.08 | 14.3 | 1.6 | |
1000 | 13.0370 | 0.2 | 12.8376 | 1.6 | 14.2 | 12.5 | |
- | - | - | - | 12.7 | 72.3 | ||
SCPB2 | 10 | 13.5968 | 0.002 | 13.7777 | 0.01 | 14.2 | 0.06 |
50 | 12.9421 | 0.008 | 12.6959 | 0.05 | 13.2 | 0.7 | |
100 | 12.1317 | 0.02 | 11.7614 | 0.09 | 12.2 | 1.5 | |
200 | 11.3640 | 0.03 | 11.3698 | 0.2 | 11.6 | 3.4 | |
1000 | 11.5681 | 0.1 | 11.5572 | 0.9 | 11.2 | 25.4 |
6.2 A first two-stage stochastic program
Our second test problem is the nonlinear two-stage stochastic program
(79) |
where the second stage recourse function is given by
(80) |
Problem (79)-(80) is of form (1)-(2) where with given by (80) and where is the indicator function of set where given by (78) is the unit simplex. For problem (79) we refer to Lemma 2.1 in [8] for the computation of stochastic subgradients . We take and consider a Gaussian random vector in for . We consider two instances of problem (79) with and . For each instance, the components of are independent with means and standard deviations randomly generated in respectively intervals and . The components of are generated randomly in interval .
We run E-SA, SCPB1, and SCPB2 on our two instances and with and , respectively. For SCPB1 and SCPB2, we used outer iterations. The results are reported in Table 3. The conclusions are similar to the experiments on the stochastic utility problem: SCPB computes a good approximate optimal value much quicker than E-SA and the decrease in the objective function value is much slower with E-SA. We again refer to Table 7 which reports the distance between SCPB approximate optimal value and E-SA approximate value as a percentage of SCPB decrease in the objective for several sample sizes. This percentage is again above 90% for almost all instances and sample sizes.
- | |||||
---|---|---|---|---|---|
ALG. | Obj | CPU | Obj | CPU | |
E-SA | 10 | 24.3477 | 0.13 | 7.5134 | 0.5 |
50 | 24.2378 | 0.6 | 7.5018 | 2.5 | |
100 | 24.0816 | 1.2 | 7.4868 | 5.0 | |
200 | 23.7947 | 3.0 | 7.4566 | 10.1 | |
500 | 22.9185 | 8.8 | 7.3790 | 25.9 | |
1000 | 21.5328 | 24.6 | 7.2587 | 55.5 | |
8.5482 | 377 | 5.1339 | 1282 | ||
5.7358 | 1555.6 | 3.9193 | 6147 | ||
SCPB1 | 10 | 11.5047 | 0.2 | 3.0063 | 1.3 |
50 | 9.2959 | 0.6 | 2.7269 | 3.2 | |
100 | 7.2031 | 1.5 | 2.4914 | 6.9 | |
200 | 6.4626 | 2.9 | 2.2899 | 13.0 | |
500 | 5.3700 | 7.5 | 2.0635 | 39.2 | |
1000 | 5.0582 | 15.1 | 1.9609 | 70.4 | |
SCPB2 | 10 | 8.6325 | 0.15 | 3.3113 | 0.6 |
50 | 7.8378 | 0.7 | 2.2478 | 3.2 | |
100 | 7.8602 | 1.5 | 2.1929 | 6.4 | |
200 | 6.5839 | 3.0 | 2.2913 | 13.4 | |
500 | 6.0361 | 7.4 | 1.9974 | 33.7 | |
1000 | 6.1989 | 14.9 | 1.8058 | 65.1 |
Table 4 reports the impact of overestimating (taking 10 times the Monte Carlo estimation ) and underestimating (taking 10 times smaller than the Monte Carlo estimation ). In this experiment, SCPB is essentially not affected by a bad estimation of while E-SA converge much slower when is overestimated. Additionally, Table 5 reports the computational results for all methods applied to a variant of two-stage stochastic program (79)-(80) of size where the feasible set of the first stage problem is replaced by the larger simplex set . These results show that the SCPB variants are more efficient that E-SA on this specific instance. Also SCPB is not much affected by an overestimation of the diameter .
- | |||||||
---|---|---|---|---|---|---|---|
ALG. | N | Obj | CPU | Obj | CPU | Obj | CPU |
E-SA | 2000 | 9.8 | 29.7 | 22.1 | 32.5 | 10.5 | 36.2 |
SCPB1 | 2000 | 5.1 | 36.3 | 5.2 | 33.8 | 5.1 | 42.2 |
SCPB2 | 2000 | 5.5 | 31.2 | 4.6 | 33.5 | 5.6 | 37.2 |
- | |||||
---|---|---|---|---|---|
ALG. | N | Obj | CPU | Obj | CPU |
E-SA | 2000 | 33.6 | 33.8 | ||
10000 | 155.5 | 160.8 | |||
SCPB1 | 2000 | 35.4 | 35.6 | ||
SCPB2 | 2000 | 34.2 | 30.8 |
Finally, we report the length of SCPB cycle along iterations in the left plot of Figure 1. A few comments are now in order on the length of the cycles with SCPB1 and SCPB2:
-
•
We observe that the length of the cycles is much larger with SCPB1.
-
•
For SCPB1, sequence (and therefore the length of the cycles) can be computed independently of sequence , before running SCPB, once constant is known. It is worth mentioning that we have an analytic expression for as a function of , , , and , namely if and
otherwise. Therefore, the cycle length with SCPB1 is a piecewise constant nondecreasing function of outer iteration and the cardinality of the set of consecutive iterations with constant cycle length increases along the cycles.
-
•
For SCPB2, the length of the cycles is in general small with small variability, with an average cycle length of 2.2 for the instance with and an average cycle length of 2.1 for the instance with .
6.3 A second two-stage stochastic program
Our third test problem is the nonlinear two-stage stochastic program
(81) |
where cost-to-go function has nonlinear objective and constraint coupling functions and is given by
(82) |
Problem (81)-(82) is of form (1)-(2) where with given by (82) and where is the indicator function of set
For problem (81), we again refer to Lemma 2.1 in [8] for the computation of stochastic subgradients . We take and consider for a Gaussian random vector in with the components of independent with means and standard deviations randomly generated in respectively intervals and . The components of are generated randomly in interval and we take , and for .
We run E-SA, SCPB1, and SCPB2 on two instances and with and , respectively. For SCPB1 and SCPB2, we used outer iterations. The results are reported in Tables 6 and 7. The conclusions are similar to the experiments on the stochastic utility problem: it still takes much longer for E-SA to compute a solution with given accuracy. We also report in the right plot of Figure 1 the evolution of the length of the cycles along outer iterations. The behavior of the length of these cycles is similar to what was observed for the previous problem (79)-(80). For SCPB2 the length of the cycles is still small on all iterations and almost constant.
- | |||||
---|---|---|---|---|---|
ALG. | N | Obj | CPU | Obj | CPU |
E-SA | 10 | 15182 | 0.2 | 18571 | 0.6 |
50 | 15108 | 0.9 | 18497 | 4.0 | |
100 | 15017 | 0.9 | 18405 | 7.4 | |
200 | 14836 | 1.8 | 18222 | 13.9 | |
1000 | 13481 | 3.4 | 16830 | 63.8 | |
99.8 | 16.2 | 177.5 | 6275 | ||
SCPB1 | 10 | 2981.5 | 0.2 | 4914 | 0.8 |
50 | 761.2 | 0.8 | 1574 | 2.8 | |
100 | 288.1 | 1.7 | 679 | 5.7 | |
200 | 64.8 | 2.9 | 191 | 10.2 | |
1000 | -4.38 | 14.3 | -7.87 | 55.5 | |
SCPB2 | 10 | 1400.9 | 0.13 | 2766 | 0.5 |
50 | 0.45 | 0.5 | 17.5 | 2.1 | |
100 | -4.38 | 1.0 | -7.88 | 4.1 | |
200 | -4.38 | 1.9 | -7.95 | 8.1 | |
1000 | -4.38 | 9.0 | -7.95 | 42.2 |
6.4 Summarizing performance indicators
The computational results reported in Subsections 6.1-6.3 show that SCPB computes a good approximate solution quicker than E-SA. To properly quantify the speed-up over a fixed number of iterations , we compute the quantity
(83) |
associated with SCPBi, where is the empirical mean (see (77)) of with and equal to the initial point , and , , and are the empirical means of with and equal to the final iterates output by E-SA, SCPB1, and SCPB2, respectively. We see that after iterations this percentage is above for most instances, which clearly shows that both variants of SCPB are faster than E-SA.
Sample size | 10 | 50 | 100 | 200 | 1000 |
---|---|---|---|---|---|
, SCPB1 | 95.0 | 95.2 | 94.1 | 91.9 | 82.8 |
, SCPB2 | 96.6 | 97.2 | 97.5 | 97.2 | 90.9 |
, SCPB1 | 92.1 | 93.3 | 92.3 | 91.5 | 77.7 |
, SCPB2 | 92.1 | 95.8 | 96.8 | 96.7 | 93.5 |
, SCPB1 | 99.5 | 98.8 | 98.1 | 96.6 | 85.1 |
, SCPB2 | 99.6 | 99.0 | 98.0 | 96.5 | 84.2 |
, SCPB1 | 99.8 | 99.6 | 99.3 | 98.8 | 95.0 |
, SCPB2 | 99.8 | 99.6 | 99.3 | 98.8 | 95.4 |
, SCPB1 | 99.8 | 99.4 | 98.8 | 97.6 | 88.7 |
, SCPB2 | 99.9 | 99.4 | 98.8 | 97.6 | 88.7 |
, SCPB1 | 99.9 | 99.4 | 99.0 | 98.0 | 90.5 |
, SCPB2 | 99.9 | 99.5 | 99.0 | 98.0 | 90.5 |
7 Concluding remarks
This paper proposes two single-cut stochastic composite proximal bundle variants, called SCPB, for solving SCCO problem (1)-(2) where at each iteration a problem of form (3) is solved. The two SCPB variants, which differ in the way their cycle lengths are determined, are analyzed in Sections 3 and 4, respectively. More specifically, it is shown that both variants of SCPB with properly chosen parameters have optimal iteration complexity (up to a logarithmic term) for finding an -solution of (1) for a large range of prox stepsizes. Practical variants of SCPB which keep their cycle lengths bounded are also proposed and numerical experiments demonstrating their excellent performance against the RSA method of [22] on the instances considered in this paper are also reported.
Comparison with other methods: First, we have shown in Subsection 3.2 that RSA is a special case of SCPB1 which performs only one iteration per cycle. Second, it is worth noting that SCPB has a slight similarity with the stochastic dual averaging (SDA) method discussed in [25, 38] since both methods explore the idea of aggregating cuts into a single one. However, there are essential differences between the two methods, namely: 1) while SCPB updates the prox-centers whenever a serious iteration occurs, SDA uses a fixed prox-center, and hence only performs null iterations; and 2) SDA uses variable stepsizes which have to grow sufficiently large, while SCPB uses constant prox stepsizes. In summary, from the viewpoint of this paper, SDA is closest to the special case of SCPB with a single cycle and a sufficiently large prox stepsize; the difference between the latter two methods is that SDA allows the prox stepsizes within its single cycle to gradually become sufficiently large.
In summary, while RSA (resp., SDA) performs only serious (resp., null) iterations, SCPB performs a balanced mix of serious and null iterations. Hence, it is reasonable to conclude that SCPB lies between RSA and SDA.
Extensions. We finally discuss some possible extensions of our analysis in this paper. A first question is how to extend SCPB and the corresponding complexity analysis if instead of Assumption 2.1 (A3) we use the assumption that for every , we have
which is called a uniform -condition in [20]. A second natural question is how to extend SCPB and its complexity analysis when either the prox stepsize and parameter are allowed to change with the iteration count . Recalling that the prox stepsize is the only ingredient of the second variant of SCPB that depends on an estimate of , a third natural question is whether it is possible to develop a SCPB variant which adaptively chooses a (variable) prox stepsize without the need of knowing . A fourth question is whether it is possible to establish global convergence rate guarantees for proximal bundle methods based on two-cut or multiple-cut bundle models instead of the single-cut bundle models considered in this paper. Finally, SCPB is able to solve two-stage convex stochastic programs with continuous distributions, under the assumption that the second-stage subproblems can be exactly solved (e.g., see Subsections 6.2 and 6.3). It would be interesting to extend it to the setting of multistage stochastic convex problems with continuous distributions.
References
- [1] E. D. Andersen and K.D. Andersen. The MOSEK optimization toolbox for MATLAB manual. Version 9.2, 2019. https://www.mosek.com/documentation/.
- [2] A. Astorino, A. Frangioni, A. Fuduli, and E. Gorgone. A nonmonotone proximal bundle method with (potentially) continuous step decisions. SIAM Journal on Optimization, 23(3):1784–1809, 2013.
- [3] J. Birge and F. Louveaux. Introduction to Stochastic Programming. Springer-Verlag, New York, 1997.
- [4] J.R. Birge and F.V. Louveaux. A multicut algorithm for two-stage stochastic linear programs. European Journal of Operational Research, 34:384–392, 1988.
- [5] M. Díaz and B. Grimmer. Optimal convergence rates for the proximal bundle method. SIAM Journal on Optimization, 33(2):424–454, 2023.
- [6] Y. Du and A. Ruszczyński. Rate of convergence of the bundle method. Journal of Optimization Theory and Applications, 173(3):908–922, 2017.
- [7] A. Frangioni. Generalized bundle methods. SIAM Journal on Optimization, 13(1):117–156, 2002.
- [8] V. Guigues. Convergence analysis of sampling-based decomposition methods for risk-averse multistage stochastic convex programs. SIAM Journal on Optimization, 26:2468–2494, 2016.
- [9] V. Guigues. Multistep stochastic mirror descent for risk-averse convex stochastic programs based on extended polyhedral risk measures. Mathematical Programming, 163:169–212, 2017.
- [10] V. Guigues. Inexact Stochastic Mirror Descent for two-stage nonlinear stochastic programs. Mathematical Programming, 187:533–577, 2021.
- [11] V. Guigues, A. Juditsky, and A. Nemirovski. Non-asymptotic confidence bounds for the optimal value of a stochastic program. Optimization Methods & Software, 32:1033–1058, 2017.
- [12] V. Guigues, W. Tekaya, and M. Lejeune. Regularized decomposition methods for deterministic and stochastic convex optimization and application to portfolio selection with direct transaction and market impact costs. Optimization & Engineering, 21:1133–1165, 2020.
- [13] J.L. Higle and S. Sen. Stochastic Decomposition. Kluwer, Dordrecht, 1996.
- [14] A.J. King and R.T. Rockafellar. Asymptotic theory for solutions in statistical estimation and stochastic programming. Math. Oper. Res., 18:148–162, 1993.
- [15] K. C. Kiwiel. Efficiency of proximal bundle methods. Journal of Optimization Theory and Applications, 104(3):589–603, 2000.
- [16] A. J. Kleywegt, A. Shapiro, and T. Homem-de Mello. The sample average approximation method for stochastic discrete optimization. SIAM Journal on Optimization, 12(2):479–502, 2002.
- [17] C. Lemaréchal. An extension of davidon methods to non differentiable problems. In Nondifferentiable optimization, pages 95–109. Springer, 1975.
- [18] C. Lemaréchal. Nonsmooth optimization and descent methods. 1978.
- [19] J. Liang and R. D. C. Monteiro. A proximal bundle variant with optimal iteration-complexity for a large range of prox stepsizes. SIAM Journal on Optimization, 31(4):2955–2986, 2021.
- [20] J. Liang and R. D. C. Monteiro. A unified analysis of a class of proximal bundle methods for solving hybrid convex composite optimization problems. Mathematics of Operations Research, 2023.
- [21] R. Mifflin. A modification and an extension of Lemaréchal’s algorithm for nonsmooth minimization. In Nondifferential and variational techniques in optimization, pages 77–90. Springer, 1982.
- [22] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM J. Optim., 19:1574–1609, 2009.
- [23] A. Nemirovski and D. Yudin. On Cezari’s convergence of the steepest descent method for approximating saddle point of convex-concave functions. Soviet Math. Dokl., 19, 1978.
- [24] A. Nemirovski and D.B. Yudin. Problem complexity and method efficiency in optimization. Wiley, 1983.
- [25] Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical programming, 120(1):221–259, 2009.
- [26] W. de Oliveira, C. Sagastizábal, and C. Lemaréchal. Convex proximal bundle methods in depth: a unified analysis for inexact oracles. Mathematical Programming, 148(1-2):241–277, 2014.
- [27] B.T. Polyak. New stochastic approximation type procedures. Automat. i Telemekh (English translation: Automation and Remote Control), 7:98–107, 1990.
- [28] B.T. Polyak and A. Juditsky. Acceleration of stochastic approximation by averaging. SIAM J. Contr. and Optim., 30:838–855, 1992.
- [29] H. Robbins and S. Monroe. A stochastic approximation method. Annals of Math. Stat., 22:400–407, 1951.
- [30] A. Ruszczyński. Nonlinear optimization. Princeton university press, 2011.
- [31] A. Shapiro. Asymptotic analysis of stochastic programs. Ann. Oper. Res., 30:169–186, 1991.
- [32] A. Shapiro, D. Dentcheva, and A. Ruszczyński. Lectures on Stochastic Programming: Modeling and Theory. SIAM, Philadelphia, 2009.
- [33] R.M. Van Slyke and R.J.-B. Wets. L-shaped linear programs with applications to optimal control and stochastic programming. SIAM Journal of Applied Mathematics, 17:638–663, 1969.
- [34] W. van Ackooij, V. Berge, W. de Oliveira, and C. Sagastizábal. Probabilistic optimization via approximate p-efficient points and bundle methods. Computers & Operations Research, 77:177–193, 2017.
- [35] B. Verweij, S. Ahmed, A. J. Kleywegt, G. Nemhauser, and A. Shapiro. The sample average approximation method applied to stochastic routing problems: a computational study. Computational Optimization and Applications, 24(2-3):289–333, 2003.
- [36] W. Wang and M. A. Carreira-Perpinán. Projection onto the probability simplex: An efficient algorithm with a simple proof, and an application. Available on arXiv:1309.1541, 2013.
- [37] P. Wolfe. A method of conjugate subgradients for minimizing nondifferentiable functions. In Nondifferentiable optimization, pages 145–173. Springer, 1975.
- [38] L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research, 11(88):2543–2596, 2010.