This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A single cut proximal bundle method for
stochastic convex composite optimization

Jiaming Liang   Vincent Guigues   Renato D.C. Monteiro Goergen Institute for Data Science and Department of Computer Science, University of Rochester, Rochester, NY 14620 (email: [email protected]).School of Applied Mathematics FGV/EMAp, 22250-900 Rio de Janeiro, Brazil. (email: [email protected]).School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA 30332. (email: [email protected]). This work was partially supported by AFOSR Grant FA9550-22-1-0088.
(July 18, 2022
1st revision: November 3, 2022
2nd revision: April 30, 2023
3rd revision: October 20, 2023)
Abstract

This paper considers optimization problems where the objective is the sum of a function given by an expectation and a closed convex function, and proposes stochastic composite proximal bundle (SCPB) methods for solving it. Complexity guarantees are established for them without requiring knowledge of parameters associated with the problem instance. Moreover, it is shown that they have optimal complexity when these problem parameters are known. To the best of our knowledge, this is the first proximal bundle method for stochastic programming able to deal with continuous distributions. Finally, we present computational results showing that SCPB substantially outperforms the robust stochastic approximation (RSA) method in all instances considered.

Keywords. stochastic convex composite optimization, stochastic approximation, proximal bundle method, optimal complexity bound.

AMS subject classifications. 49M37, 65K05, 68Q25, 90C25, 90C30, 90C60.

1 Introduction

The main goal of this paper is to propose and study the complexity of some stochastic composite proximal bundle (SCPB) variants to solve the stochastic convex composite optimization (SCCO) problem

ϕ:=min{ϕ(x):=f(x)+h(x):xn}\phi_{*}:=\min\left\{\phi(x):=f(x)+h(x):x\in\mathbb{R}^{n}\right\} (1)

where

f(x)=𝔼ξ[F(x,ξ)].f(x)=\mathbb{E}_{\xi}[F(x,\xi)]. (2)

We assume the following conditions hold: i) f,h:n{+}f,h:\mathbb{R}^{n}\rightarrow\mathbb{R}\cup\{+\infty\} are proper closed convex functions such that domhdomf\mathrm{dom}\,h\subseteq\mathrm{dom}\,f; ii) a stochastic first-order oracle, which for every xdomhx\in\mathrm{dom}\,h and almost every random vector ξ\xi returns s(x,ξ)s(x,\xi) such that 𝔼[s(x,ξ)]f(x)\mathbb{E}[s(x,\xi)]\in\partial f(x), is available; and iii) for every xdomhx\in\mathrm{dom}\,h, 𝔼[s(x,ξ)2]M¯2\mathbb{E}[\|s(x,\xi)\|^{2}]\leq\bar{M}^{2} for some M¯+\bar{M}\in\mathbb{R}_{+}.

Literature Review. Proximal bundle methods for solving the deterministic version of (1), i.e., where an oracle that outputs f(x)f(x) for any xx is available, have been proposed in [17, 18, 21, 37]. Moreover, convergence (but not complexity) analyses of proximal bundle methods have been developed for example in [7, 26, 30, 34], and their iteration complexities have been derived for example in [2, 5, 6, 15, 19, 20].

We now discuss methods for solving (the stochastic version of) problem (1). Methods for solving (1) when ff can be computed exactly (e.g., ξ\xi is a discrete random vector with small support) have been discussed for example in [3, 4] and are usually based on solving a deterministic (but large-scale) reformulation of (1), using decomposition (such as the L-shaped method [33]) possibly combined with regularization as in [12, 13].

Solution methods for problem (1) in which ξ\xi has a continuous distribution are basically based on one of the following three ideas: i) a single (usually expensive) approximation of (1) where ff is approximated by a Monte Carlo average HN(x):=i=1NF(,ξi)/NH_{N}(x):=\sum_{i=1}^{N}F(\cdot,\xi_{i})/N for a large i.i.d. sample (ξ1,,ξN)(\xi_{1},\ldots,\xi_{N}) of ξ\xi is constructed at the beginning of the method and is then solved to yield an approximate solution of (1) (SAA-type methods); see for instance [11, 14, 16, 31, 35] and also Chapter 5 of [32] for their complexity analysis; ii) simple approximations of (1) are constructed at every iteration based on a small (usually a single) sample and their solutions are used to obtain an approximation solution of (1) (SA-type methods); SA-type methods have been originally proposed in [29] and further extended in [9, 10, 22, 23, 25, 27, 28]; and iii) hybrid type methods which sit in between SAA and SA-type ones in that they use partial Monte Carlo averages Hk()H_{k}(\cdot) (and their expensive subgradients) for increasing iteration indices kk [13].

Contributions. Although the cutting plane methodology can be used in the context of SAA methods to solve single approximations of (1) generated at their outset, such methodology has not been used in the context of SA-type methods. This paper partially addresses this issue by developing regularized aggregated cutting plane methods for solving (1) where some of the most recent (linear approximation) cuts (in expectation) are combined, i.e., a suitable convex combination of them is chosen, so that a single aggregated cut (in expectation) is obtained. Two SCPB variants based on the aforementioned aggregated one-cut scheme are proposed which can be viewed as natural extensions of the one-cut variant developed in [20] (based on the analysis of [19]) for solving the deterministic version of (1). More specifically, at every iteration, these SCPB variants solve the prox bundle subproblem

x=argminun{Γ(u)+12λuxc2}x=\underset{u\in\mathbb{R}^{n}}{\mathrm{argmin}\,}\left\{\Gamma(u)+\frac{1}{2{\lambda}}\|u-x^{c}\|^{2}\right\} (3)

where λ>0{\lambda}>0 is the prox stepsize, xcx^{c} is the current prox-center, and Γ\Gamma is the current bundle function in expectation, i.e., it satisfies 𝔼[Γ()]ϕ()\mathbb{E}[\Gamma(\cdot)]\leq\phi(\cdot). The prox-center remains the same for several consecutive iterations which are referred to as a cycle. In the beginning of a cycle, the prox-center is updated to xcxx^{c}\leftarrow x and the bundle function Γ\Gamma is chosen to be the composite linear approximation F(x,ξ)+s(x,ξ),x+h()F(x,\xi)+\langle s(x,\xi),\cdot-x\rangle+h(\cdot) of the function F(,ξ)+h()F(\cdot,\xi)+h(\cdot) at xx for some new independent sample ξ\xi. For other iterations of the cycle, the prox-center remains the same but Γ\Gamma is set to be a convex combination of the previous bundle function and the most recent composite linear approximation as constructed above. It is then shown that both SCPB variants obtain a stochastic iterate yny\in\mathbb{R}^{n} (determined by some of the above generated xx’s) such that 𝔼[ϕ(y)]ϕε\mathbb{E}[\phi(y)]-\phi_{*}\leq\varepsilon, where ϕ\phi_{*} is as in (1), in 𝒪(ε2){\cal O}(\varepsilon^{-2}) iterations/resolvent evaluations. To our knowledge, these are the first SA-type SCPB methods for solving SCCO problems where ξ\xi can have either a discrete or continuous distribution. Finally, it is shown that the robust stochastic approximation (RSA) method of [22] is a special case of SCPB with a relatively small prox stepsize.

Organization of the paper. Subsection 1.1 presents basic definitions and notation used throughout the paper. Section 2 formally describes the assumptions on the SCCO problem (1), presents the SCPB scheme, and two cycles rules for determining the length of the cycles in SCPB. Section 3 presents various convergence rate bounds for the SCPB variant based on the first cycle rule and discusses the relationship between SCPB and RSA. Section 4 provides convergence rate bounds for the SCPB variant based on the second cycle rule. Section 5 collects proofs of the main results in Sections 3 and 4. Section 6 reports the numerical experiments. Finally, Section 7 presents some concluding remarks and possible extensions.

1.1 Basic definitions and notation

Let ++\mathbb{N}_{++} denote the set of positive integers. The sets of real numbers, non-negative and positive real numbers are denoted by \mathbb{R}, +\mathbb{R}_{+} and ++\mathbb{R}_{++}, respectively. Let n\mathbb{R}^{n} denote the standard nn-dimensional Euclidean space equipped with inner product and norm denoted by ,\left\langle\cdot,\cdot\right\rangle and \|\cdot\|, respectively.

Let ψ:n(,+]\psi:\mathbb{R}^{n}\rightarrow(-\infty,+\infty] be given. The effective domain of ψ\psi is denoted by domψ:={xn:ψ(x)<}\mathrm{dom}\,\psi:=\{x\in\mathbb{R}^{n}:\psi(x)<\infty\} and ψ\psi is proper if domψ\mathrm{dom}\,\psi\neq\emptyset. For ε0\varepsilon\geq 0, the ε\varepsilon-subdifferential of ψ\psi at zdomψz\in\mathrm{dom}\,\psi is denoted by εψ(z):={sn:ψ(u)ψ(z)+s,uzε,un}\partial_{\varepsilon}\psi(z):=\left\{s\in\mathbb{R}^{n}:\psi(u)\geq\psi(z)+\left\langle s,u-z\right\rangle-\varepsilon,\forall u\in\mathbb{R}^{n}\right\}. The subdifferential of ψ\psi at zdomψz\in\mathrm{dom}\,\psi, denoted by ψ(z)\partial\psi(z), is by definition the set 0ψ(z)\partial_{0}\psi(z). Moreover, a proper function ψ:n(,+]\psi:\mathbb{R}^{n}\rightarrow(-\infty,+\infty] is μ\mu-strongly convex for some μ0\mu\geq 0 if

ψ(αz+(1α)u)αψ(z)+(1α)ψ(u)α(1α)μ2zu2\psi(\alpha z+(1-\alpha)u)\leq\alpha\psi(z)+(1-\alpha)\psi(u)-\frac{\alpha(1-\alpha)\mu}{2}\|z-u\|^{2}

for every z,udomψz,u\in\mathrm{dom}\,\psi and α[0,1]\alpha\in[0,1]. Note that we say ψ\psi is convex when μ=0\mu=0. We use the notation ξ[t]=(ξ0,ξ1,,ξt)\xi_{[t]}=(\xi_{0},\xi_{1},\ldots,\xi_{t}) for the history of the sampled observations of ξ\xi up to iteration tt. Define ln0+():=max{0,ln()}\ln_{0}^{+}(\cdot):=\max\{0,\ln(\cdot)\}. Define the diameter of a set XX to be DX:=sup{xx:x,xdomX}D_{X}:=\sup\{\|x-x^{\prime}\|:x,x^{\prime}\in\mathrm{dom}\,X\}.

2 Assumptions and two SCPB variants

This section presents the assumptions made on problem (1) and states two SCPB variants for solving it.

2.1 Assumptions

Let Ξ\Xi denote the support of random vector ξ\xi and assume that the following conditions on (1) are assumed to hold:

Assumption 2.1.
  • (A1)

    ff and hh are proper closed convex functions satisfying domfdomh\mathrm{dom}\,f\supset\mathrm{dom}\,h;

  • (A2)

    for almost every ξΞ\xi\in\Xi, a functional oracle F(,ξ):domhF(\cdot,\xi):\mathrm{dom}\,h\to\mathbb{R} and a stochastic subgradient oracle s(,ξ):domhns(\cdot,\xi):\mathrm{dom}\,h\to\mathbb{R}^{n} satisfying

    f(x)=𝔼[F(x,ξ)],f(x):=𝔼[s(x,ξ)]f(x)f(x)=\mathbb{E}[F(x,\xi)],\quad f^{\prime}(x):=\mathbb{E}[s(x,\xi)]\in\partial f(x)

    for every xdomhx\in\mathrm{dom}\,h are available;

  • (A3)

    M¯:=sup{𝔼[s(x,ξ)2]1/2:xdomh}<\bar{M}:=\sup\{\mathbb{E}[\|s(x,\xi)\|^{2}]^{1/2}:x\in\mathrm{dom}\,h\}<\infty;

  • (A4)

    the set of optimal solutions XX^{*} of (1)-(2) is nonempty.

We now make some observations about the above conditions. First, as in [22], condition (A2) does not require F(,ξ)F(\cdot,\xi) to be convex. Second, condition (A3) implies that

f(x)=𝔼[s(x,ξ)]𝔼[s(x,ξ)](𝔼[s(x,ξ)2])1/2M¯xdomh.\|f^{\prime}(x)\|=\|\mathbb{E}[s(x,\xi)]\|\leq\mathbb{E}[\|s(x,\xi)\|]\leq\left(\mathbb{E}[\|s(x,\xi)\|^{2}]\right)^{1/2}\leq\bar{M}\quad\forall x\in\mathrm{dom}\,h. (4)

Third, defining for every ξΞ\xi\in\Xi and xdomhx\in\mathrm{dom}\,h,

Φ(,ξ)=F(,ξ)+h(),(;x,ξ)=f(x)+s(x,ξ),x+h(),\Phi(\cdot,\xi)=F(\cdot,\xi)+h(\cdot),\quad\ell(\cdot;x,\xi)=f(x)+\langle s(x,\xi),\cdot-x\rangle+h(\cdot), (5)

it follows from (A2), the second identity in (5), and the convexity of ff by (A1), that

𝔼[Φ(,ξ)]=ϕ()f(x)+f(x),x+h()=𝔼[(;x,ξ)]\mathbb{E}[\Phi(\cdot,\xi)]=\phi(\cdot)\geq f(x)+\langle f^{\prime}(x),\cdot-x\rangle+h(\cdot)=\mathbb{E}[\ell(\cdot;x,\xi)] (6)

where ϕ()\phi(\cdot) is as in (1). Hence, (;x,ξ)\ell(\cdot;x,\xi) is a stochastic composite linear approximation of ϕ()\phi(\cdot) in the sense that its expectation is a true composite linear approximation of ϕ()\phi(\cdot). (The terminology “composite” refers to the function hh which is included in the approximation (;x,ξ)\ell(\cdot;x,\xi) as is.)

2.2 Description of the two SCPB variants

Before describing the two SCPB variants, we motivate them by interpreting them as inexact implementations of the (theoretical) proximal point method for solving (1).

Their kk-th cycle of iterations performs a finite number of iterations to solve the prox subproblem

minun{ϕ(u)+12λux^k12}\underset{u\in\mathbb{R}^{n}}{\min}\left\{\phi(u)+\frac{1}{2{\lambda}}\|u-\hat{x}_{k-1}\|^{2}\right\}

where x^k1\hat{x}_{k-1} denotes the prox-center during the cycle. Each iteration within the cycle solves a subproblem of the form

x=argminun{𝒜(u)+h(u)+12λux^k12}x=\underset{u\in\mathbb{R}^{n}}{\mathrm{argmin}\,}\left\{{\cal A}(u)+h(u)+\frac{1}{2{\lambda}}\|u-\hat{x}_{k-1}\|^{2}\right\} (7)

where 𝒜(){\cal A}(\cdot) is an affine bundle for ff in expectation, i.e., an affine function such that 𝔼[𝒜()]f()\mathbb{E}[{\cal A}(\cdot)]\leq f(\cdot). (This type of bundle has been considered in the inexact proximal point approach considered in [20] where it is referred to as a one-cut bundle for ff.) The bundle 𝒜+{\cal A}^{+} for the next subproblem in the cycle is then taken to be a linear combination of the current bundle 𝒜{\cal A} and a newly generated stochastic linear approximation of ff of the form F(x,ξ)+s(x,ξ),xF(x,\xi)+\langle s(x,\xi),\cdot-x\rangle. Moreover, the first iteration of every cycle starts by setting the prox-center to the most recently generated xx as in (7) and the bundle to the most recently generated stochastic linear approximation of ff at xx.

Both SCPB variants are based on the SCPB scheme described below. As stated below, the scheme is not a completely specified algorithm since its step 2 does not describe how to select the index jkj_{k}. Two rules for doing so, and hence the complete description of the two aforementioned SCPB variants, are then given following the statement of the scheme.

At every iteration j1j\geq 1, the SCPB scheme samples an independent realization ξj1\xi_{j-1} of ξ\xi.

 

SCPB

  Input: Scalars λ,θ>0{\lambda},\theta>0, integer K1K\geq 1, and initial point x0domhx_{0}\in\mathrm{dom}\,h.

  • 0.

    Set j=k=1j=k=1, j0=0j_{0}=0, and

    τ=θKθK+1;\tau=\frac{\theta K}{\theta K+1}; (8)
  • 1.

    take a sample ξj1\xi_{j-1} of r.v. ξ\xi independent from the previous samples ξ0,,ξj2\xi_{0},\ldots,\xi_{j-2} and compute

    xjc={xjk1, if j=jk1+1,xj1c, otherwise,\displaystyle x_{j}^{c}=\left\{\begin{array}[]{ll}x_{j_{k-1}},&\text{ if }j=j_{k-1}+1,\\ x_{j-1}^{c},&\text{ otherwise},\end{array}\right. (11)
    Sj={s(xjk1,ξjk1), if j=jk1+1,(1τ)s(xj1,ξj1)+τSj1, otherwise,\displaystyle S_{j}=\left\{\begin{array}[]{ll}s(x_{j_{k-1}},\xi_{j_{k-1}}),&\text{ if }j=j_{k-1}+1,\\ (1-\tau)s(x_{j-1},\xi_{j-1})+\tau S_{j-1},&\text{ otherwise},\end{array}\right. (14)
    xj=argminun{h(u)+Sj,u+12λuxjc2},\displaystyle x_{j}=\underset{u\in\mathbb{R}^{n}}{\mathrm{argmin}\,}\left\{h(u)+\langle S_{j},u\rangle+\frac{1}{2{\lambda}}\|u-x_{j}^{c}\|^{2}\right\}, (15)

    and

    yj={xj, if j=jk1+1,(1τ)xj+τyj1, otherwise;\displaystyle y_{j}=\left\{\begin{array}[]{ll}x_{j},&\text{ if }j=j_{k-1}+1,\\ (1-\tau)x_{j}+\tau y_{j-1},&\text{ otherwise};\end{array}\right. (18)
  • 2.

    if j=jk1+1j=j_{k-1}+1, then choose an integer jkj_{k} such that

    jkjk1+1;j_{k}\geq j_{k-1}+1;

    if j<jkj<j_{k}, then set jj+1j\leftarrow j+1 and go to step 1; else, set y^k=yjk\hat{y}_{k}=y_{j_{k}}, and go to step 3;

  • 3.

    if k<Kk<K, then set kk+1k\leftarrow k+1 and jj+1j\leftarrow j+1, and go to step 1; otherwise, compute

    y^Ka=1K/2k=K/2+1Ky^k\hat{y}_{K}^{a}=\frac{1}{\lceil K/2\rceil}\sum_{k=\lfloor K/2\rfloor+1}^{K}\hat{y}_{k} (19)

    and stop.

Output: y^Ka\hat{y}_{K}^{a}.

 

We first discuss the roles played by the two index counts jj and kk used by SCPB. First, jj counts the total number of iterations/resolvent evaluations performed by SCPB since it is increased by one every time SCPB returns to step 1. Second, defining the kk-th cycle as the iteration indices jj lying in

𝒞k:={ik,,jk},whereik:=jk1+1,{\cal C}_{k}:=\{i_{k},\ldots,j_{k}\},\quad\mbox{\rm where}\ \ \ i_{k}:=j_{k-1}+1, (20)

it immediately follows that kk counts the number of cycles generated by SCPB. Third, step 1 determines two types of iterations depending on whether j=jkj=j_{k} (serious iteration) or j𝒞k{jk}j\in{\cal C}_{k}\setminus\{j_{k}\} (null iteration). Hence, the last iteration of a cycle is a serious one while the others are null ones.

We now make several basic remarks about SCPB. First, every execution of step 1 involves one resolvent evaluation of h\partial h, i.e., an evaluation of the point-to-point operator (I+αh)1()(I+\alpha\partial h)^{-1}(\cdot) for some α>0\alpha>0. Second, SCPB generates three sequences of iterates, namely, the sequence of prox-centers {xjc}\{x_{j}^{c}\} computed in (11), the sequence {xj}\{x_{j}\} determined by (15), and the sequence {yj}\{y_{j}\} given by (18). Third, it follows from (11) that xjc=xjk1x^{c}_{j}=x_{j_{k-1}} for every j𝒞kj\in{\cal C}_{k}. Hence, the prox-center xjcx^{c}_{j} remains constant between consecutive iterations within a cycle and (possibly) changes only at the beginning of the first iteration of the following cycle. Fourth, {y^k}\{\hat{y}_{k}\} is the subsequence of {yj}\{y_{j}\} consisting of all the last cycle iterates yjky_{j_{k}} generated by SCPB. Fifth, the convergence rates for the two specific variants of the SCPB scheme described below are with respect to the average of the iterates y^K/2+1,,y^K\hat{y}_{\lfloor K/2\rfloor+1},\ldots,\hat{y}_{K}, namely, the point y^Ka\hat{y}_{K}^{a} as in (19) (see Theorems 3.1 and 4.1 below).

As already mentioned in the second paragraph preceding the description of SCPB, the scheme is not completely specified since its step 2 does not describe how to select jkj_{k}. We now describe two cycle rules for doing so which depend on a pre-specified parameter R>0R>0, namely:

  • (B1)

    for every k1k\geq 1, let jkj_{k} be the smallest integer ik\geq i_{k} such that λkτjkikR{\lambda}k\tau^{j_{k}-i_{k}}\leq R;

  • (B2)

    for every k1k\geq 1, let jkj_{k} be the smallest integer ik+1\geq i_{k}+1 such that

    λkτjkik(F(xik,ξik)~k(xik)12λxikxikc2)R{\lambda}k\tau^{j_{k}-i_{k}}\left(F(x_{i_{k}},\xi_{i_{k}})-\tilde{\ell}_{k}(x_{i_{k}})-\frac{1}{2{\lambda}}\|x_{i_{k}}-x_{i_{k}}^{c}\|^{2}\right)\leq R (21)

    where iki_{k} is as in (20) and

    ~k():=F(xik1,ξik1)+s(xik1,ξik1),xik1.\tilde{\ell}_{k}(\cdot):=F(x_{i_{k}-1},\xi_{i_{k}-1})+\langle s(x_{i_{k}-1},\xi_{i_{k}-1}),\cdot-x_{i_{k}-1}\rangle. (22)

We make the following remarks about cycle rules (B1) and (B2). First, the sequence {jk}\{j_{k}\} determined by the cycle rule (B1) is deterministic, while the one determined by (B2) is stochastic since the sequence {xik}\{x_{i_{k}}\} used in (21) is stochastic. Second, another difference between the two cycle rules is that (B1) allows jk=ikj_{k}=i_{k}, while jkj_{k} in (B2) is at least ik+1i_{k}+1. In other words, the cycle length for (B1) may be equal to one, but the one for (B2) is at least two. Third, the length of cycle 𝒞k{\cal C}_{k} for both rules above depends on the cycle index kk. Hence, even though (B1) is deterministic, the length of the cycles generated by it changes with kk.

Throughout our presentation, SCPB based on cycle rule (B1) (resp., (B2)) is referred to as SCPB1 (resp., SCPB2).

3 Complexity results for SCPB1

This section presents the main complexity results for SCPB1 under various assumptions and discusses the relationship between SCPB1 and RSA.

3.1 Convergence rate bounds of SCPB1 with bounded domh\mathrm{dom}\,h

The following result states a general convergence rate result for SCPB1 that holds for bounded domh\mathrm{dom}\,h and for any choice of input (λ,θ,K)({\lambda},\theta,K) in SCPB1 and constant RR as in (B1). The proof is postponed to Subsection 5.2.

Theorem 3.1.

Assume that Assumption 2.1 holds and domh\mathrm{dom}\,h has a finite diameter Dh0D_{h}\geq 0. Then, for any given (λ,θ,K)++2×++({\lambda},\theta,K)\in\mathbb{R}_{++}^{2}\times\mathbb{N}_{++} and R>0R>0, SCPB1 with any input (λ,θ,K)({\lambda},\theta,K) and constant RR in (B1) satisfies the following statements:

  • a)

    the number of iterations within the kk-th cycle 𝒞k{\cal C}_{k} (see (20)) is bounded by

    (θK+1)ln0+(λkR)+1;\left\lceil(\theta K+1)\ln_{0}^{+}\left(\frac{\lambda k}{R}\right)\right\rceil+1; (23)
  • b)

    we have

    𝔼[ϕ(y^Ka)]ϕ1K(Dh2λ+6Rmin{λM¯2,M¯Dh}λ+2λM¯2θ)\mathbb{E}[\phi(\hat{y}_{K}^{a})]-\phi_{*}\leq\frac{1}{K}\left(\frac{D_{h}^{2}}{{\lambda}}+\frac{6R\min\{{\lambda}\bar{M}^{2},\bar{M}D_{h}\}}{{\lambda}}+\frac{2{\lambda}\bar{M}^{2}}{\theta}\right) (24)

    where DhD_{h} is the diameter of domh\mathrm{dom}\,h.

We now make some remarks about Theorem 3.1. First, its overall iteration complexity is given by KK, which is its outer iteration complexity, times its inner iteration complexity given in (23). Second, (24) gives a bound on the expected primal gap 𝔼[ϕ(y^Ka)]ϕ\mathbb{E}[\phi(\hat{y}_{K}^{a})]-\phi_{*} in terms of KK, and hence provides a sufficient condition on how large KK should be chosen for SCPB1 to generate a desired approximate solution.

Even though Theorem 3.1 holds for the general case in which domh\mathrm{dom}\,h is unbounded, all its corollaries stated in this subsection and Subsection 3.3 assume that domh\mathrm{dom}\,h is bounded. For any given (λ,K)({\lambda},K), the following result describes a convergence rate bound for SCPB1 with a specific choice of (θ,R)(\theta,R).

Corollary 3.2.

Assume that Assumption 2.1 holds and domh\mathrm{dom}\,h has a finite diameter Dh>0D_{h}>0. Let a pair (λ,K)(\lambda,K) be given and consider SCPB1 with input (λ,θ,K)({\lambda},\theta,K) and RR in (B1) given by

θ=2λ2M2D2,R=D6M\theta=\frac{2{\lambda}^{2}M^{2}}{D^{2}},\quad R=\frac{D}{6M} (25)

where (D,M)(D,M) is an estimate for the (usually unknown) pair (Dh,M¯)(D_{h},\bar{M}). Then, the following statements hold:

  • a)

    we have

    𝔼[ϕ(y^Ka)]ϕ3D22λK(κD+κM)\mathbb{E}[\phi(\hat{y}_{K}^{a})]-\phi_{*}\leq\frac{3D^{2}}{2\lambda K}\left(\kappa_{D}+\kappa_{M}\right)

    where

    κD:=Dh2D2,κM:=M¯2M2;\kappa_{D}:=\frac{D_{h}^{2}}{D^{2}},\quad\kappa_{M}:=\frac{\bar{M}^{2}}{M^{2}}; (26)
  • b)

    its expected overall iteration complexity (up to a logarithmic term) is

    𝒪(λ2M2K2D2+K).{\cal O}\left(\frac{{\lambda}^{2}M^{2}K^{2}}{D^{2}}+K\right). (27)

Proof: a) Using (24), the definitions of κD\kappa_{D} and κM\kappa_{M} in (26), and the definitions of θ\theta and RR in (25), we get

𝔼[ϕ(y^Ka)]ϕ1K(κDD2λ+Dmin{λM¯2,M¯Dh}λM+κMD2λ)D2λK(κD+κM+κDκM)3D22λK(κD+κM),\begin{array}[]{lcl}\displaystyle\mathbb{E}[\phi(\hat{y}_{K}^{a})]-\phi_{*}&\leq&\displaystyle\frac{1}{K}\left(\frac{\kappa_{D}D^{2}}{{\lambda}}+\frac{D\min\{{\lambda}\bar{M}^{2},\bar{M}D_{h}\}}{{\lambda}M}+\frac{\kappa_{M}D^{2}}{\lambda}\right)\\ &\leq&\displaystyle\frac{D^{2}}{\lambda K}\left(\kappa_{D}+\kappa_{M}+\sqrt{\kappa_{D}\kappa_{M}}\right)\\ &\leq&\displaystyle\frac{3D^{2}}{2\lambda K}\left(\kappa_{D}+\kappa_{M}\right),\end{array}

where in the second inequality we have used min{λM¯2,M¯Dh}M¯Dh\min\{{\lambda}\bar{M}^{2},\bar{M}D_{h}\}\leq\bar{M}D_{h} and the definitions of κD\kappa_{D} and κM\kappa_{M} while in the last inequality we have used the relation ab(a+b)/2\sqrt{ab}\leq(a+b)/2 for every a,b0a,b\geq 0.

b) It follows from Theorem 3.1(a) that the overall complexity (up to a logarithmic term) is 𝒪(θK2+K){\cal O}(\theta K^{2}+K), which in turn is (27) in view of θ\theta as in (25).    

We now argue that the overall iteration (and sample) complexity of the SCPB1 variant of Corollary 3.2 for finding an ε\varepsilon-solution xx of (1), i.e., one that satisfies 𝔼[ϕ(x)]ϕε\mathbb{E}[\phi(x)]-\phi_{*}\leq\varepsilon, is optimal for a large range of prox stepsizes. Indeed, setting K=TεK=\lceil T_{\varepsilon}\rceil where

Tε:=3D22λε(κD+κM),T_{\varepsilon}:=\frac{3D^{2}}{2\lambda\varepsilon}\left(\kappa_{D}+\kappa_{M}\right),

it follows from the above result that 𝔼[ϕ(y^Ka)]ϕε\mathbb{E}[\phi(\hat{y}_{K}^{a})]-\phi_{*}\leq\varepsilon. Since KTε+1K\leq T_{\varepsilon}+1, we conclude from (27) that the expected overall iteration complexity of SCPB1 is bounded by

𝒪(λ2M2(Tε+1)2D2+Tε+1)=𝒪(M2D2ε2[κD2+κM2]+λ2M2D2+D2λε[κD+κM]+1).{\cal O}\left(\frac{{\lambda}^{2}M^{2}(T_{\varepsilon}+1)^{2}}{D^{2}}+T_{\varepsilon}+1\right)={\cal O}\left(\frac{M^{2}D^{2}}{\varepsilon^{2}}\left[\kappa_{D}^{2}+\kappa_{M}^{2}\right]+\frac{{\lambda}^{2}M^{2}}{D^{2}}+\frac{D^{2}}{\lambda\varepsilon}\left[\kappa_{D}+\kappa_{M}\right]+1\right).

In particular, if DDhD\geq D_{h} and MM¯M\geq\bar{M}, or equivalently, κD1\kappa_{D}\leq 1 and κM1\kappa_{M}\leq 1, then the above complexity reduces to

𝒪(M2D2ε2+λ2M2D2+D2λε+1).{\cal O}\left(\frac{M^{2}D^{2}}{\varepsilon^{2}}+\frac{{\lambda}^{2}M^{2}}{D^{2}}+\frac{D^{2}}{\lambda\varepsilon}+1\right).

Moreover, under the assumption that the prox stepsize λ\lambda lies in the interval [ε/M2,D2/ε][\varepsilon/M^{2},D^{2}/\varepsilon], the above complexity bound further reduces to 𝒪(M2D2/ε2){\cal O}(M^{2}D^{2}/\varepsilon^{2}), which is known to be the optimal complexity of finding an ε\varepsilon-solution for any instance of (1) such that its corresponding pair (Dh,M¯)(D_{h},\bar{M}) satisfies the condition that DDhD\geq D_{h} and MM¯M\geq\bar{M} (e.g., see [24]).

3.2 Relationship between SCPB1 and the RSA method of [22]

We argue in this subsection that RSA can be viewed as a special instance of SCPB1 where every cycle 𝒞k{\cal C}_{k} contains only one index (or equivalently, an instance for which every iteration is serious).

Recall that the RSA method of [22], which is developed under the assumption that hh is the indicator function of a nonempty compact convex set XX, with a given initial point x0Xx_{0}\in X and constant prox stepsize λ>0{\lambda}>0 recursively computes its iteration sequence {xj}j=1N\{x_{j}\}_{j=1}^{N} according to

xj=argminuX{s(xj1,ξj1),u+12λuxj12}j=1,,N.x_{j}=\underset{u\in X}{\mathrm{argmin}\,}\left\{\langle s(x_{j-1},\xi_{j-1}),u\rangle+\frac{1}{2{\lambda}}\|u-x_{j-1}\|^{2}\right\}\qquad\forall j=1,\ldots,N. (28)

For the purpose of reviewing the iteration complexity of RSA, assume that DD is an upper bound on the diameter of XX and MM is an upper bound on M¯\bar{M}. For 1iN1\leq i\leq N, let x~iN\tilde{x}_{i}^{N} denote the average of the iterates {xj}j=iN\{x_{j}\}_{j=i}^{N}, i.e.,

x~iN=1Ni+1j=iNxj.\tilde{x}^{N}_{i}=\frac{1}{N-i+1}\sum_{j=i}^{N}x_{j}. (29)

It is shown in equation (2.24) of [22] that if the stepsize λ>0{\lambda}>0 is chosen as

λ=αDMN{\lambda}=\frac{\alpha D}{M\sqrt{N}} (30)

for some fixed scalar α>0\alpha>0, then the ergodic iterate x~iN\tilde{x}^{N}_{i} with i=N/2+1i=\lfloor N/2\rfloor+1 satisfies

𝔼[ϕ(x~N/2+1N)]ϕmax{α,α1}9DM2N.\mathbb{E}[\phi(\tilde{x}^{N}_{\lfloor N/2\rfloor+1})]-\phi_{*}\leq\max\{\alpha,\alpha^{-1}\}\frac{9DM}{2\sqrt{N}}. (31)

Hence, for a given tolerance ε>0\varepsilon>0, the smallest NN satisfying 𝔼[ϕ(x~N/2+1N)]ϕε\mathbb{E}[\phi(\tilde{x}^{N}_{\lfloor N/2\rfloor+1})]-\phi_{*}\leq\varepsilon has the property that

N=𝒪(max{α2,α2}M2D2ε2).N={\cal O}\left(\frac{\max\{\alpha^{2},\alpha^{-2}\}M^{2}D^{2}}{\varepsilon^{2}}\right). (32)

It turns out that RSA is a special case of SCPB1 with RR in (B1) given by

R=αDKM.R=\frac{\alpha D\sqrt{K}}{M}.

Indeed, it follows from the above choice of RR and λ{\lambda} as in (30) with NN replaced by KK that

RλkRλK=1\frac{R}{{\lambda}k}\geq\frac{R}{{\lambda}K}=1

and hence that jk=ikj_{k}=i_{k} satisfies (B1). Thus, every cycle only performs one iteration, i.e., its only serious iteration. Moreover, every iteration of this SCPB1 variant is a serious one and KK is its total number of iterations.

3.3 A practical SCPB1 variant

From a computational point of view, the choice of θ\theta in Corollary 3.2 usually results in the quantity θK\theta K, and hence the inner complexity bound (23), being large. The following result provides a practical variant of SCPB1 with an alternative choice for θ\theta and RR which partially remedies the above drawback by forcing θK\theta K to be constant. A nice feature of this variant is that it is able to choose large prox stepsizes without loosing the optimality of its overall iteration complexity.

Corollary 3.3.

Assume that Assumption 2.1 holds and domh\mathrm{dom}\,h has a finite diameter Dh>0D_{h}>0. Let positive integer KK and constant C1C\geq 1 be given, and define

θ=CK,R=DM,λ=CDMK\theta=\frac{C}{K},\quad R=\frac{D}{M},\quad\lambda=\frac{\sqrt{C}D}{M\sqrt{K}} (33)

where (D,M)(D,M) is an estimate for the pair (Dh,M¯)(D_{h},\bar{M}). Then, the following statements about SCPB1 with input (λ,θ,K)({\lambda},\theta,K) and RR as above hold:

  • a)

    we have

    𝔼[ϕ(y^Ka)]ϕ(4κD+5κM)DMCK\mathbb{E}[\phi(\hat{y}_{K}^{a})]-\phi_{*}\leq\frac{(4\kappa_{D}+5\kappa_{M})DM}{\sqrt{CK}} (34)

    where κD\kappa_{D} and κM\kappa_{M} are as in (26);

  • b)

    the number of iterations within the kk-th cycle 𝒞k{\cal C}_{k} is bounded by

    (C+1)ln0+(CkK)+1,\left\lceil(C+1)\ln_{0}^{+}\left(\frac{\sqrt{C}k}{\sqrt{K}}\right)\right\rceil+1,

    and hence, up to a logarithmic term, is 𝒪(C){\cal O}(C);

  • c)

    its expected overall iteration complexity, up to a logarithmic term, is 𝒪(CK){\cal O}(CK).

Proof: a) Using (24), the definitions of κD\kappa_{D} and κM\kappa_{M} in (26), and the definitions of θ\theta and RR in (33), we get

𝔼[ϕ(y^Ka)]ϕ\displaystyle\mathbb{E}[\phi(\hat{y}_{K}^{a})]-\phi_{*} 1K(κDD2λ+6Dmin{λM¯2,M¯Dh}λM+2λKM¯2C)\displaystyle\leq\frac{1}{K}\left(\frac{\kappa_{D}D^{2}}{{\lambda}}+\frac{6D\min\{{\lambda}\bar{M}^{2},\bar{M}D_{h}\}}{{\lambda}M}+\frac{2{\lambda}K{\bar{M}}^{2}}{C}\right)
1λK(κDD2+6DM¯DhM)+2λM2κMC\displaystyle\leq\frac{1}{\lambda K}\left(\kappa_{D}D^{2}+\frac{6D{\bar{M}}D_{h}}{M}\right)+\frac{2{\lambda}M^{2}\kappa_{M}}{C}
=D2λK(κD+6κMκD)+2λM2κMC,\displaystyle=\frac{D^{2}}{\lambda K}(\kappa_{D}+6\sqrt{\kappa_{M}\kappa_{D}})+\frac{2{\lambda}M^{2}\kappa_{M}}{C}, (35)

where in the second inequality we have used min{λM¯2,M¯Dh}M¯Dh\min\{{\lambda}\bar{M}^{2},\bar{M}D_{h}\}\leq\bar{M}D_{h} and the definition of κM\kappa_{M}. It follows from (35) and the fact that κMκD(κM+κD)/2\sqrt{\kappa_{M}\kappa_{D}}\leq(\kappa_{M}+\kappa_{D})/2 that

𝔼[ϕ(y^Ka)]ϕ(4κD+3κM)D2λK+2κMλM2C.\mathbb{E}[\phi(\hat{y}_{K}^{a})]-\phi_{*}\leq\frac{(4\kappa_{D}+3\kappa_{M})D^{2}}{\lambda K}+\frac{2\kappa_{M}{\lambda}M^{2}}{C}.

Finally, the above bound with λ{\lambda} as in (33) implies (34).

b) This statement immediately follows from Theorem 3.1(a) with θ\theta, RR, and λ{\lambda} as in (33).

c) This statement follows from (b) and the fact that SCPB1 has KK cycles.    

We now make two remarks about the practical SCPB1 variant of Corollary 3.3. First, although θ\theta in (33) depends neither on MM nor DD, the choice of RR depends on both of these estimates. On the other hand, SCPB2 will be analyzed in Section 4 where θ\theta depends neither on MM nor DD, and RR depends on DD but not MM. Second, Corollary 3.3 (see its statement (b)) implies that the number of iterations within a cycle of SCPB1 is bounded (up to a logarithmic term) by the a priori (user specified) constant CC. Thus, the SCPB1 variant of Corollary 3.3 can be viewed as an extended version of RSA where the number of iterations within a cycle can be larger than one, instead of being equal to one as in RSA (see the discussion in the second paragraph of Subsection 3.2).

In the remaining part of this subsection, we compare the performance of RSA and SCPB1 when both use the prox stepsize λ{\lambda} as in (33) for some relatively large scalar C1C\geq 1. For this discussion, we assume that their performance measure is the overall iteration complexity (or sample complexity) for finding an ε\varepsilon-solution of (1). For simplicity, we assume as in Subsection 3.2 that hh is the indicator function of a nonempty closed convex set and that the estimation pair (D,M)(D,M) satisfies DDhD\geq D_{h} and MM¯M\geq\bar{M}, or equivalently, κD1\kappa_{D}\leq 1 and κM1\kappa_{M}\leq 1.

We first consider the performance (see the previous paragraph) of SCPB1. It follows from Corollary 3.3(a) that there exists K=𝒪(D2M2/(Cε2))K={\cal O}(D^{2}M^{2}/(C\varepsilon^{2})) such that y^Ka\hat{y}_{K}^{a} is an ε\varepsilon-solution of (1). Hence, it follows from Corollary 3.3(c) that the performance of SCPB1 is 𝒪(M2D2/ε2){\cal O}(M^{2}D^{2}/\varepsilon^{2}). In conclusion, SCPB1 with the above choice of KK is able to choose a prox stepsize λ{\lambda} as in (33) with a large constant CC while preserving its optimal performance. We now consider the performance of RSA. It follows from (30) and (32) with α=C\alpha=\sqrt{C} that the performance of RSA is 𝒪(CD2M2/ε2){\cal O}\left(CD^{2}M^{2}/\varepsilon^{2}\right). In conclusion, while both RSA and SCPB1 with prox stepsize λ{\lambda} as in (33) have their own performance guarantee, the one for RSA becomes worse than that of SCPB1 as CC becomes large.

Finally, although the SCPB1 variant of Corollary 3.3 chooses λ{\lambda} as in (33), our numerical experiments uses a more aggressive prox stepsize, i.e.,

λ=β1CDMK\lambda=\beta_{1}\frac{\sqrt{C}D}{M\sqrt{K}}

where β1=10\beta_{1}=10. It is possible to show that similar conclusions (with slightly modified bounds) as those of Corollary 3.3 hold for this choice of λ\lambda. Finally, SCPB1 with this prox stepsize λ{\lambda} substantially outperforms RSA on the (relatively small number of) instances considered in the computational experiments of Section 6.

4 Complexity results for SCPB2

This section provides the main complexity results for SCPB2.

The following result is an analogue of Theorem 3.1 and describes the convergence rate bound for the SCPB2 without imposing any condition on its input (λ,θ,K)({\lambda},\theta,K) and the constant RR in (B2). The proof is postponed to Subsection 5.3.

Theorem 4.1.

Assume that Assumption 2.1 holds and domh\mathrm{dom}\,h has a finite diameter Dh>0D_{h}>0. Then, SCPB2 satisfies the following statements:

  • a)

    the expected number of iterations within the kk-th cycle 𝒞k{\cal C}_{k} (see (20)) is bounded by

    (θK+1)ln0+(2M¯2λ2kR)+1;\left\lceil(\theta K+1)\ln_{0}^{+}\left(\frac{2\bar{M}^{2}{\lambda}^{2}k}{R}\right)\right\rceil+1; (36)
  • b)

    we have

    𝔼[ϕ(y^Ka)]ϕ1K(3R+Dh2λ+2λM¯2θ+2λM¯2θ2K).\mathbb{E}[\phi(\hat{y}_{K}^{a})]-\phi_{*}\leq\frac{1}{K}\left(\frac{3R+D_{h}^{2}}{{\lambda}}+\frac{2{\lambda}\bar{M}^{2}}{\theta}+\frac{2{\lambda}\bar{M}^{2}}{\theta^{2}K}\right).

Following a similar argument as in the paragraph following Corollary 3.2, it can be shown that SCPB2 has optimal iteration complexity (up to a logarithmic term) for finding an ε\varepsilon-solution of (1) for a large range of prox stepsizes.

The following result is the analogue of Corollary 3.3 when SCPB2 is implemented using cycle rule (B2) instead of (B1). As in Corollary 3.3, it forces the quantity θK\theta K to be constant but, in contrast to the choice of RR of Corollary 3.3, its choice for RR does not depend on an estimate MM for M¯\bar{M}.

Corollary 4.2.

Assume that Assumption 2.1 holds and domh\mathrm{dom}\,h has a finite diameter Dh>0D_{h}>0. Let positive integers KK and constant C1C\geq 1 be given, and define

θ=CK,R=D2,λ=CDMK\theta=\frac{C}{K},\quad R=D^{2},\quad\lambda=\frac{\sqrt{C}D}{M\sqrt{K}} (37)

where DD is an estimate for DhD_{h} and MM is an estimate for M¯\bar{M}. Then, the following statements for SCPB2 with input (λ,θ,K)({\lambda},\theta,K) based on cycle rule (B2) with RR, θ\theta, and λ\lambda as above hold:

  • a)

    we have

    𝔼[ϕ(y^Ka)]ϕ(3+κD+4κM)DMCK\mathbb{E}[\phi(\hat{y}_{K}^{a})]-\phi_{*}\leq\frac{(3+\kappa_{D}+4\kappa_{M})DM}{\sqrt{CK}} (38)

    where κD\kappa_{D} and κM\kappa_{M} are as in (26);

  • b)

    the expected number of iterations within the kk-th cycle 𝒞k{\cal C}_{k} is bounded by

    (C+1)ln0+(2κMCkK)+1,\left\lceil(C+1)\ln_{0}^{+}\left(\frac{2\kappa_{M}Ck}{K}\right)\right\rceil+1,

    and hence, up to a logarithmic term, is 𝒪(C){\cal O}(C);

  • c)

    its expected overall iteration complexity, up to a logarithmic term, is 𝒪(CK){\cal O}(CK).

Proof: a) Using Theorem 4.1(b) with θ\theta and RR as in (37) and the definitions of κD\kappa_{D} and κM\kappa_{M} in (26), we have

𝔼[ϕ(y^Ka)]ϕ(3+κD)D2λK+4κMλM2C,\mathbb{E}[\phi(\hat{y}_{K}^{a})]-\phi_{*}\leq\frac{(3+\kappa_{D})D^{2}}{{\lambda}K}+\frac{4\kappa_{M}{\lambda}M^{2}}{C},

which together with λ\lambda in (37) implies (38).

b) This statement follows from (36) with θ\theta, RR, and λ{\lambda} as in (37) and the definition of κM\kappa_{M} in (26).

c) This statement follows from (b) and the fact that SCPB2 has KK cycles.    

5 Proofs of main results in Sections 3 and 4

This section contains three subsections. The first one presents some technical results that apply to the SCPB scheme regardless of how the index jkj_{k} is chosen in step 2. The second and third ones are then devoted to the proofs of Theorems 3.1 and 4.1, respectively.

5.1 Proofs of some technical results

We assume Assumption 2.1 holds throughout this subsection. Recall that for every j0j\geq 0

ξ[j]=(ξ0,ξ1,,ξj)\xi_{[j]}=(\xi_{0},\xi_{1},\ldots,\xi_{j})

and for pqp\leq q positive integers we denote by ξ[p:q]\xi_{[p:q]} the portion ξ[p:q]=(ξp,ξp+1,,ξq)\xi_{[p:q]}=(\xi_{p},\xi_{p+1},\ldots,\xi_{q}) of realizations of the r.v. ξ\xi over the iterations p,p+1,,qp,p+1,\ldots,q. For convenience, in what follows we set

sj:=s(xj,ξj).s_{j}:=s(x_{j},\xi_{j}). (39)

For every k1k\geq 1 and j𝒞kj\in{\cal C}_{k}, define

uj:={Φ(xik,ξik), if j=ik,(1τ)ϕ(xj)+τuj1, otherwise,\displaystyle u_{j}:=\left\{\begin{array}[]{ll}\Phi(x_{i_{k}},\xi_{i_{k}}),&\text{ if }j=i_{k},\\ (1-\tau)\phi(x_{j})+\tau u_{j-1},&\text{ otherwise},\end{array}\right. (42)

and

Γj():={~k()+h(), if j=ik,(1τ)(;xj1,ξj1)+τΓj1(), otherwise,\displaystyle\Gamma_{j}(\cdot):=\left\{\begin{array}[]{ll}\tilde{\ell}_{k}(\cdot)+h(\cdot),&\text{ if }j=i_{k},\\ (1-\tau)\ell(\cdot;x_{j-1},\xi_{j-1})+\tau\Gamma_{j-1}(\cdot),&\text{ otherwise},\end{array}\right. (45)

where Φ(,ξ)\Phi(\cdot,\xi) and (;x,ξ)\ell(\cdot;x,\xi) are as in (5) and and ~k()\tilde{\ell}_{k}(\cdot) is as in (22). It is easy to see from (14), (15), and the above definition of Γj\Gamma_{j} that

xj=argminun{Γjλ(u):=Γj(u)+12λuxjc2}.x_{j}=\underset{u\in\mathbb{R}^{n}}{\mathrm{argmin}\,}\left\{\Gamma_{j}^{\lambda}(u):=\Gamma_{j}(u)+\frac{1}{2{\lambda}}\|u-x_{j}^{c}\|^{2}\right\}. (46)

The first result below provides some basic relations which are often used in our analysis.

Lemma 5.1.

For every j1j\geq 1, we have

𝔼[Φ(xj,ξj)]\displaystyle\mathbb{E}[\Phi(x_{j},\xi_{j})] =𝔼[ϕ(xj)],\displaystyle=\mathbb{E}[\phi(x_{j})], (47)
𝔼[ϕ(yj)]\displaystyle\mathbb{E}[\phi(y_{j})] 𝔼[uj],\displaystyle\leq\mathbb{E}[u_{j}], (48)
𝔼[Γj(x)]\displaystyle\mathbb{E}[\Gamma_{j}(x)] ϕ(x)xdomh.\displaystyle\leq\phi(x)\quad\forall x\in\mathrm{dom}\,h. (49)

Proof: Observe that xjx_{j} is a function of ξ[j1]\xi_{[j-1]} and not of ξj\xi_{j}. Hence, xjx_{j} is independent of ξj\xi_{j} in view of the fact that ξj\xi_{j} is chosen in step 1 of SCPB to be independent of ξ[j1]\xi_{[j-1]}. Using the relation f(x)=𝔼[F(x,ξ)]f(x)=\mathbb{E}[F(x,\xi)] (see (A2)), it follows that

𝔼[Φ(xj,ξj)]=𝔼ξ[j][F(xj,ξj)+h(xj)]=𝔼ξ[j1][𝔼ξj[F(xj,ξj)+h(xj)|ξ[j1]]]=𝔼ξ[j1][f(xj)+h(xj)]=𝔼[ϕ(xj)],\begin{array}[]{lcl}\mathbb{E}[\Phi(x_{j},\xi_{j})]&=&\mathbb{E}_{\xi_{[j]}}[F(x_{j},\xi_{j})+h(x_{j})]=\mathbb{E}_{\xi_{[j-1]}}[\mathbb{E}_{\xi_{j}}[F(x_{j},\xi_{j})+h(x_{j})|\xi_{[j-1]}]]\\ &=&\mathbb{E}_{\xi_{[j-1]}}[f(x_{j})+h(x_{j})]=\mathbb{E}[\phi(x_{j})],\end{array}

which is identity (47). It then suffices to show that, for any given k1k\geq 1, (48) and (49) hold for every jj in the kk-th cycle, i.e., j𝒞kj\in\mathcal{C}_{k}. We show this by induction on jj where jj is the iteration count. If j=ikj=i_{k}, then it follows from (18), (42), and (47) that

𝔼[uj]=(42)𝔼[Φ(xj,ξj)]=(47)𝔼[ϕ(xj)]=(18)𝔼[ϕ(yj)],\mathbb{E}[u_{j}]\overset{\eqref{def:u}}{=}\mathbb{E}[\Phi(x_{j},\xi_{j})]\overset{\eqref{eq:exp}}{=}\mathbb{E}[\phi(x_{j})]\overset{\eqref{def:yj}}{=}\mathbb{E}[\phi(y_{j})],

and from (45) with j=ikj=i_{k}, (22), and Assumption 2.1 (A1)-(A2) that for every xdomhx\in\mathrm{dom}\,h,

𝔼[Γj(x)]=(45)𝔼[~k(x)+h(x)]=(22),(A2)f(xik1)+f(xik1),xxik1+h(x)(A1)ϕ(x).\mathbb{E}[\Gamma_{j}(x)]\overset{\eqref{eq:Gamma}}{=}\mathbb{E}[\tilde{\ell}_{k}(x)+h(x)]\overset{\eqref{def:tl},(A2)}{=}f(x_{i_{k}-1})+\langle f^{\prime}(x_{i_{k}-1}),x-x_{i_{k}-1}\rangle+h(x)\overset{(A1)}{\leq}\phi(x).

Let jj be such that j>ikj>i_{k} and (48) and (49) hold for jj. Then, it follows from (18), (42), the fact that (48) holds for jj, and the convexity of ϕ\phi, that

𝔼[uj+1](42),(48)(1τ)𝔼[ϕ(xj+1)]+τ𝔼[ϕ(yj)]𝔼[ϕ((1τ)xj+1+τyj)]=(18)𝔼[ϕ(yj+1)],\mathbb{E}[u_{j+1}]\overset{\eqref{def:u},\eqref{ineq:basic1}}{\geq}(1-\tau)\mathbb{E}[\phi(x_{j+1})]+\tau\mathbb{E}[\phi(y_{j})]\geq\mathbb{E}[\phi((1-\tau)x_{j+1}+\tau y_{j})]\overset{\eqref{def:yj}}{=}\mathbb{E}[\phi(y_{j+1})],

and from (6), (45) and the fact that (49) holds for jj, that

𝔼[Γj+1(x)]=(45)τ𝔼[Γj(x)]+(1τ)𝔼[(x;xj,ξj)](6),(49)τϕ(x)+(1τ)ϕ(x)=ϕ(x).\mathbb{E}[\Gamma_{j+1}(x)]\overset{\eqref{eq:Gamma}}{=}\tau\mathbb{E}[\Gamma_{j}(x)]+(1-\tau)\mathbb{E}[\ell(x;x_{j},\xi_{j})]\overset{\eqref{eq:exp0},\eqref{ineq:basic2}}{\leq}\tau\phi(x)+(1-\tau)\phi(x)=\phi(x).

We have thus shown that (48) and (49) hold for every j𝒞kj\in{\cal C}_{k}.    

It is worth noting that the proof of (49) is strongly based on the fact that Γj\Gamma_{j} is a convex combination of affine functions whose expected values are underneath ϕ\phi. Moreover, this inequality would not necessarily be true if Γj\Gamma_{j} were for example the maximum of functions as just described.

The next result provides a useful estimate for the quantity ϕ(xj,ξj)(xj;xj1,ξj1)\phi(x_{j},\xi_{j})-\ell(x_{j};x_{j-1},\xi_{j-1}).

Lemma 5.2.

For every j𝒞kj\in{\cal C}_{k} such that jikj\geq i_{k}, we have:

ϕ(xj)(xj;xj1,ξj1)(M¯+sj1)xjxj1.\phi(x_{j})-\ell(x_{j};x_{j-1},\xi_{j-1})\leq(\bar{M}+\|s_{j-1}\|)\|x_{j}-x_{j-1}\|. (50)

Proof: Using the definitions of ϕ\phi and (;x,ξ)\ell(\cdot;x,\xi) in (1) and (5), respectively, we have

ϕ(xj)(xj;xj1,ξj1)=f(xj)f(xj1)sj1,xjxj1f(xj)sj1,xjxj1\phi(x_{j})-\ell(x_{j};x_{j-1},\xi_{j-1})=f(x_{j})-f(x_{j-1})-\langle s_{j-1},x_{j}-x_{j-1}\rangle\leq\langle f^{\prime}(x_{j})-s_{j-1},x_{j}-x_{j-1}\rangle

where the inequality is due to the convexity of ff. The above inequality, the Cauchy-Schwarz inequality, the triangle inequality and (4) then imply (50).    

The technical result below introduces a key quantity, namely, scalar tjt_{j} below, and provides a useful recursive relation for it over the iterations of the kk-th cycle. This recursive relation will then be used in Proposition 5.6 to show that the tjt_{j} at the end of the kk-th cycle, namely tjkt_{j_{k}}, is relatively small in expectation.

Lemma 5.3.

For every j1j\geq 1, define

tj:=ujΓjλ(xj),bj+1:=λ(M¯2+sj2)θKt_{j}:=u_{j}-\Gamma_{j}^{\lambda}(x_{j}),\quad b_{j+1}:=\frac{{\lambda}(\bar{M}^{2}+\|s_{j}\|^{2})}{\theta K} (51)

where λ\lambda, θ\theta, and KK are as in step 0 of SCPB, and sjs_{j} is as in (39). Then, for every j𝒞kj\in{\cal C}_{k} such that jik+1j\geq i_{k}+1, we have

tjτtj1+(1τ)bjt_{j}\leq\tau t_{j-1}+(1-\tau)b_{j} (52)

where τ\tau is as in (8), and hence

tjτjiktik+(1τ)i=ik+1jτjibi.t_{j}\leq\tau^{j-i_{k}}t_{i_{k}}+(1-\tau)\sum_{i=i_{k}+1}^{j}\tau^{j-i}b_{i}. (53)

Proof: Let j𝒞kj\in\mathcal{C}_{k} with jik+1j\geq i_{k}+1 be given. It follows from the definitions of Γj\Gamma_{j} and Γjλ\Gamma_{j}^{\lambda} in (45) and (46), respectively, that

Γjλ(xj)\displaystyle\Gamma_{j}^{\lambda}(x_{j}) =(1τ)(xj;xj1,ξj1)+τΓj1(xj)+12λxjxjc2\displaystyle=(1-\tau)\ell(x_{j};x_{j-1},\xi_{j-1})+\tau\Gamma_{j-1}(x_{j})+\frac{1}{2\lambda}\|x_{j}-x_{j}^{c}\|^{2}
(1τ)(xj;xj1,ξj1)+τ[Γj1(xj)+12λxjxj1c2]\displaystyle\geq(1-\tau)\ell(x_{j};x_{j-1},\xi_{j-1})+\tau\left[\Gamma_{j-1}(x_{j})+\frac{1}{2\lambda}\|x_{j}-x_{j-1}^{c}\|^{2}\right]
=(1τ)(xj;xj1,ξj1)+τΓj1λ(xj)\displaystyle=(1-\tau)\ell(x_{j};x_{j-1},\xi_{j-1})+\tau\Gamma_{j-1}^{\lambda}(x_{j})
(1τ)(xj;xj1,ξj1)+τ[Γj1λ(xj1)+12λxjxj12],\displaystyle\geq(1-\tau)\ell(x_{j};x_{j-1},\xi_{j-1})+\tau\left[\Gamma_{j-1}^{\lambda}(x_{j-1})+\frac{1}{2{\lambda}}\|x_{j}-x_{j-1}\|^{2}\right], (54)

where for the first inequality we used the fact that τ<1\tau<1 and xjc=xj1cx_{j}^{c}=x_{j-1}^{c} for j𝒞kj\in\mathcal{C}_{k} with jik+1j\geq i_{k}+1 while for the second inequality is due to the facts that Γjλ\Gamma_{j}^{\lambda} is (1/λ)(1/{\lambda})-strongly convex and xj1x_{j-1} is the minimizer of Γj1λ\Gamma_{j-1}^{\lambda} (see (46)). Using (8), (50) and (54), we have

Γjλ(xj)τΓj1λ(xj1)(8),(54)(1τ)[(xj;xj1,ξj1)+θK2λxjxj12]\displaystyle\Gamma_{j}^{\lambda}(x_{j})-\tau\Gamma_{j-1}^{\lambda}(x_{j-1})\overset{\eqref{def:tau},\eqref{ineq:Gamma}}{\geq}(1-\tau)\left[\ell(x_{j};x_{j-1},\xi_{j-1})+\frac{\theta K}{2{\lambda}}\|x_{j}-x_{j-1}\|^{2}\right]
(50)(1τ)ϕ(xj)+(1τ)[θK2λxjxj12(M¯+sj1)xjxj1]\displaystyle\overset{\eqref{ineq:Phi}}{\geq}(1-\tau)\phi(x_{j})+(1-\tau)\left[\frac{\theta K}{2{\lambda}}\|x_{j}-x_{j-1}\|^{2}-(\bar{M}+\|s_{j-1}\|)\|x_{j}-x_{j-1}\|\right]
(1τ)ϕ(xj)(1τ)λ(M¯+sj1)22θK\displaystyle\geq(1-\tau)\phi(x_{j})-(1-\tau)\frac{{\lambda}(\bar{M}+\|s_{j-1}\|)^{2}}{2\theta K}

where the last inequality is obtained by minimizing its left hand side with respect to xjxj1\|x_{j}-x_{j-1}\|. The above inequality, the fact that (α1+α2)22α12+2α22(\alpha_{1}+\alpha_{2})^{2}\leq 2\alpha_{1}^{2}+2\alpha_{2}^{2} for every α1,α2\alpha_{1},\alpha_{2}\in\mathbb{R}, and the definition of bjb_{j} in (51) imply that

Γjλ(xj)τΓj1λ(xj1)(1τ)ϕ(xj)(1τ)λ(M¯2+sj12)θK=(51)(1τ)ϕ(xj)(1τ)bj.\Gamma_{j}^{\lambda}(x_{j})-\tau\Gamma_{j-1}^{\lambda}(x_{j-1})\geq(1-\tau)\phi(x_{j})-(1-\tau)\frac{{\lambda}(\bar{M}^{2}+\|s_{j-1}\|^{2})}{\theta K}\overset{\eqref{def:delta}}{=}(1-\tau)\phi(x_{j})-(1-\tau)b_{j}.

Rearranging the above inequality and using the definition of tjt_{j} in (51), identity (42), and the fact that jik+1j\geq i_{k}+1, we then conclude that

Γjλ(xj)+(1τ)bjτΓj1λ(xj1)+(1τ)ϕ(xj)=(51)τ(uj1tj1)+(1τ)ϕ(xj)=(42)ujτtj1,\begin{array}[]{lcl}\Gamma_{j}^{\lambda}(x_{j})+(1-\tau)b_{j}&\geq&\tau\Gamma_{j-1}^{\lambda}(x_{j-1})+(1-\tau)\phi(x_{j})\\ &\overset{\eqref{def:delta}}{=}&\tau(u_{j-1}-t_{j-1})+(1-\tau)\phi(x_{j})\overset{\eqref{def:u}}{=}u_{j}-\tau t_{j-1},\end{array}

which, in view of the definition of tjt_{j} in (51), implies (52). Inequality (53) follows immediately from (52) and an induction argument.    

The following technical result provides some useful bounds on bjb_{j}.

Lemma 5.4.

For every 0\ell\geq 0 and j+2j\geq\ell+2, we have

𝔼[bj|ξ[]]2λM¯2θK,𝔼[bj]2λM¯2θK.\mathbb{E}[b_{j}\,|\xi_{[\ell]}]\leq\frac{2{\lambda}\bar{M}^{2}}{\theta K},\quad\mathbb{E}[b_{j}]\leq\frac{2{\lambda}\bar{M}^{2}}{\theta K}. (55)

Proof: We first show that for every j1j\geq 1 and j1\ell\leq j-1,

𝔼[sj2|ξ[]]M¯2.\mathbb{E}[\|s_{j}\|^{2}\,|\,\xi_{[\ell]}]\leq\bar{M}^{2}. (56)

Fix j1j\geq 1. Since xjx_{j} becomes deterministic when ξ[j1]\xi_{[j-1]} is given, it follows from Assumption 2.1 (A3) with x=xjx=x_{j} and the definition of sjs_{j} in (39) that

𝔼ξj[sj2|ξ[j1]]M¯2.\mathbb{E}_{\xi_{j}}[\|s_{j}\|^{2}\,|\,\xi_{[j-1]}]\leq\bar{M}^{2}.

Now, if j2\ell\leq j-2, then the above relations together with the law of total expectation imply that and

𝔼[sj2|ξ[]]=𝔼ξ[+1:j][sj2|ξ[]]=𝔼ξ[+1:j1][𝔼ξj[sj2|ξ[j1]]]M¯2.\mathbb{E}[\|s_{j}\|^{2}\,|\,\xi_{[\ell]}]=\mathbb{E}_{\xi_{[\ell+1:j]}}[\|s_{j}\|^{2}\,|\,\xi_{[\ell]}]=\mathbb{E}_{\xi_{[\ell+1:j-1]}}[\mathbb{E}_{\xi_{j}}[\|s_{j}\|^{2}\,|\,\xi_{[j-1]}]]\leq\bar{M}^{2}.

We have thus shown that (56) holds for any j1\ell\leq j-1.

The first inequality in (55) then follows from the definition of bjb_{j} in (51). The second inequality in (55) follows from the first one and the law of total expectation.    

The next technical result provides a bound on the initial tjt_{j} for the kk-th cycle, namely tikt_{i_{k}}, in expectation.

Lemma 5.5.

For every k1k\geq 1, we have 𝔼[tik]2min{λM¯2,M¯Dh}\mathbb{E}[t_{i_{k}}]\leq 2\min\{{\lambda}\bar{M}^{2},\bar{M}D_{h}\} where iki_{k} and tjt_{j} are as in (20) and (51), respectively.

Proof: Let

Δj=Φ(xj,ξj)ϕ(xj)=F(xj,ξj)f(xj).\Delta_{j}=\Phi(x_{j},\xi_{j})-\phi(x_{j})=F(x_{j},\xi_{j})-f(x_{j}). (57)

Using the definitions of tjt_{j} and Γjλ\Gamma_{j}^{\lambda} in (51) and (46), respectively, (42) with j=ik=jk1+1j=i_{k}=j_{k-1}+1 (see (20)), we have

tik\displaystyle t_{i_{k}} =(51)uikΓikλ(xik)\displaystyle\stackrel{{\scriptstyle\eqref{def:delta}}}{{=}}u_{i_{k}}-\Gamma_{i_{k}}^{{\lambda}}(x_{i_{k}})
=(42),(45)Φ(xik,ξik)[F(xjk1,ξjk1)+sjk1,xikxjk1+h(xik)]12λxikxjk12\displaystyle\stackrel{{\scriptstyle\eqref{def:u},\eqref{eq:Gamma}}}{{=}}\Phi(x_{i_{k}},\xi_{i_{k}})-\left[F(x_{j_{k-1}},\xi_{j_{k-1}})+\langle s_{j_{k-1}},x_{i_{k}}-x_{j_{k-1}}\rangle+h(x_{i_{k}})\right]-\frac{1}{2{\lambda}}\|x_{i_{k}}-x_{j_{k-1}}\|^{2}
=ΔikΔjk1+ϕ(xik)(xik;xjk1,ξjk1)12λxikxjk12\displaystyle=\Delta_{i_{k}}-\Delta_{j_{k-1}}+\phi(x_{i_{k}})-\ell(x_{i_{k}};x_{j_{k-1}},\xi_{j_{k-1}})-\frac{1}{2{\lambda}}\|x_{i_{k}}-x_{j_{k-1}}\|^{2}
ΔikΔjk1+(M¯+sjk1)xikxjk112λxikxjk12\displaystyle\leq\Delta_{i_{k}}-\Delta_{j_{k-1}}+\left(\bar{M}+\|s_{j_{k-1}}\|\right)\|x_{i_{k}}-x_{j_{k-1}}\|-\frac{1}{2{\lambda}}\|x_{i_{k}}-x_{j_{k-1}}\|^{2} (58)

where the inequality is due to Lemma 5.2. Maximizing the right hand side of the last inequality above with respect to xikxjk1\|x_{i_{k}}-x_{j_{k-1}}\| and using the relation (a+b)22a2+2b2(a+b)^{2}\leq 2a^{2}+2b^{2} for every a,ba,b\in\mathbb{R}, we obtain

tikΔikΔjk1+λ2(M¯+sjk1)2ΔikΔjk1+λ(M¯2+sjk12).t_{i_{k}}\leq\Delta_{i_{k}}-\Delta_{j_{k-1}}+\frac{{\lambda}}{2}\left(\bar{M}+\|s_{j_{k-1}}\|\right)^{2}\leq\Delta_{i_{k}}-\Delta_{j_{k-1}}+{\lambda}\left(\bar{M}^{2}+\|s_{j_{k-1}}\|^{2}\right). (59)

Moreover, (58) and the fact that xikxjk1Dh\|x_{i_{k}}-x_{j_{k-1}}\|\leq D_{h} also imply that

tikΔikΔjk1+(M¯+sjk1)Dh.t_{i_{k}}\leq\Delta_{i_{k}}-\Delta_{j_{k-1}}+\left(\bar{M}+\|s_{j_{k-1}}\|\right)D_{h}. (60)

It follows from (39), (57), and Assumption 2.1 (A2) and (A3) that

𝔼[Δik]=0,𝔼[Δjk1]=0,𝔼[sjk12]M¯2.\mathbb{E}[\Delta_{i_{k}}]=0,\quad\mathbb{E}[\Delta_{j_{k-1}}]=0,\quad\mathbb{E}[\|s_{j_{k-1}}\|^{2}]\leq\bar{M}^{2}.

Hence, the lemma follows by taking expectations of (59) and (60) and using the above three relations.    

We emphasize that all results developed in this subsection hold regardless of the way jkj_{k} is chosen in step 2. On the other hand, the results in the following two subsections strongly use the fact that jkj_{k} is chosen according to either (B1) or (B2).

5.2 Proof of Theorem 3.1

This subsection is devoted to the proof of Theorem 3.1. The following result derives a bound in expectation for tjkt_{j_{k}} when jkj_{k} is chosen according to cycle rule (B1).

Proposition 5.6.

In addition to Assumption 2.1, assume also that (B1) holds. Then, for every k1k\geq 1, we have

𝔼[tjk]2Rmin{λM¯2,M¯Dh}λk+2λM¯2θK\mathbb{E}[t_{j_{k}}]\leq\frac{2R\min\{{\lambda}\bar{M}^{2},\bar{M}D_{h}\}}{{\lambda}k}+\frac{2{\lambda}\bar{M}^{2}}{\theta K} (61)

where tjt_{j} is as in (51).

Proof: Fix k1k\geq 1. It follows from cycle rule (B1) and inequality (53) with j=jkj=j_{k} that

tjkτjkiktik+(1τ)i=ik+1jkτjkibi.t_{j_{k}}\leq\tau^{j_{k}-i_{k}}t_{i_{k}}+(1-\tau)\sum_{i=i_{k}+1}^{j_{k}}\tau^{j_{k}-i}b_{i}. (62)

In view of (B1) and (20), it follows that jkj_{k} and iki_{k} are both deterministic. Hence, taking expectation of the above inequality and using the last inequality in (55), cycle rule (B1), and Lemma 5.5, we conclude that

𝔼[tjk]\displaystyle\mathbb{E}[t_{j_{k}}] τjkik𝔼[tik]+(1τ)i=ik+1jkτjki𝔼[bi]\displaystyle\leq\tau^{j_{k}-i_{k}}\mathbb{E}[t_{i_{k}}]+(1-\tau)\sum_{i=i_{k}+1}^{j_{k}}\tau^{j_{k}-i}\mathbb{E}[b_{i}]
2Rmin{λM¯2,M¯Dh}λk+(1τ)2λM¯2θKi=ik+1jkτjki,\displaystyle\leq\frac{2R\min\{{\lambda}\bar{M}^{2},\bar{M}D_{h}\}}{{\lambda}k}+(1-\tau)\frac{2{\lambda}\bar{M}^{2}}{\theta K}\sum_{i=i_{k}+1}^{j_{k}}\tau^{j_{k}-i},

and hence that (61) holds.    

It is worth noting that rule (B1) plays an important role in showing that the expectation of the first term on the right-hand side of (62) is 𝒪(1/k){\cal O}(1/k). On the other hand, the proof of the 𝒪(1/K){\cal O}(1/K) bound for the expectation of the second term on the right-hand side of (62) does not depend on rule (B1) but on the fact that the expectation of bjb_{j} is small, namely, 𝒪(1/K){\cal O}(1/K) (see (55) and its definition in (51)). In conclusion, rule (B1) provides a way to estimate the magnitude of 𝔼[tjk]\mathbb{E}[t_{j_{k}}], a quantity which by itself is intractable to compute exactly.

In the remaining part of this subsection, we analyze the behavior of the “outer” sequence of iterations {y^k}={yjk}n\{\hat{y}_{k}\}=\{y_{j_{k}}\}\subset\mathbb{R}^{n} generated in step 2 of SCPB. For this purpose, define

Γ^k:=Γjkk1\hat{\Gamma}_{k}:=\Gamma_{j_{k}}\quad\forall k\geq 1 (63)

and

x^k:=xjk,u^k:=ujk.\hat{x}_{k}:=x_{j_{k}},\quad\hat{u}_{k}:=u_{j_{k}}. (64)

In what follows, we make some remarks about the above “outer” sequences which follow as immediate consequences of the results developed above. In view of the above definitions, relation (46) with j=jkj=j_{k}, and the way the prox-centers xjcx_{j}^{c} are updated in (11), we have that

x^k=argminxn{Γ^k(x)+12λxx^k12}k1.\hat{x}_{k}=\underset{x\in\mathbb{R}^{n}}{\mathrm{argmin}\,}\left\{\hat{\Gamma}_{k}(x)+\frac{1}{2{\lambda}}\|x-\hat{x}_{k-1}\|^{2}\right\}\quad\forall k\geq 1. (65)

Moreover, it follows from (48) and (49) with j=jkj=j_{k} that

𝔼[ϕ(y^k)]𝔼[u^k]\mathbb{E}[\phi(\hat{y}_{k})]\leq\mathbb{E}[\hat{u}_{k}] (66)

and

𝔼[Γ^k(z)]ϕ(z)zdomh.\mathbb{E}[\hat{\Gamma}_{k}(z)]\leq\phi(z)\quad\forall z\in\mathrm{dom}\,h. (67)

The following result describes an important recursive formula for the outer sequence {y^k}\{\hat{y}_{k}\} generated by SCPB.

Lemma 5.7.

In addition to Assumption 2.1, assume also that (B1) holds. Then, for every k1k\geq 1 and zdomhz\in\mathrm{dom}\,h, we have

2Rmin{λM¯2,M¯Dh}λk+2λM¯2θK+12λ[dk1(z)]212λ[dk(z)]2𝔼[ϕ(y^k)]ϕ(z)\frac{2R\min\{{\lambda}\bar{M}^{2},\bar{M}D_{h}\}}{{\lambda}k}+\frac{2{\lambda}\bar{M}^{2}}{\theta K}+\frac{1}{2{\lambda}}[d_{k-1}(z)]^{2}-\frac{1}{2{\lambda}}[d_{k}(z)]^{2}\geq\mathbb{E}[\phi(\hat{y}_{k})]-\phi(z)

where

dk(z):=(𝔼[x^kz2])1/2.d_{k}(z):=\left(\mathbb{E}[\|\hat{x}_{k}-z\|^{2}]\right)^{1/2}. (68)

Proof: First observe that (63), (64), and the definitions of Γjλ\Gamma_{j}^{\lambda} and tjt_{j} in (46) and (51), respectively, imply that (61) is equivalent to

𝔼[u^kΓ^k(x^k)12λx^kx^k12]2Rmin{λM¯2,M¯Dh}λk+2λM¯2θK.\mathbb{E}\left[\hat{u}_{k}-\hat{\Gamma}_{k}(\hat{x}_{k})-\frac{1}{2{\lambda}}\|\hat{x}_{k}-\hat{x}_{k-1}\|^{2}\right]\leq\frac{2R\min\{{\lambda}\bar{M}^{2},\bar{M}D_{h}\}}{{\lambda}k}+\frac{2{\lambda}\bar{M}^{2}}{\theta K}. (69)

It follows from (65) and the fact that the objective function of (65) is (1/λ)(1/{\lambda})-strongly convex that for every zdomhz\in\mathrm{dom}\,h,

Γ^k(x^k)+12λx^kx^k12Γ^k(z)+12λzx^k1212λzx^k2,\hat{\Gamma}_{k}(\hat{x}_{k})+\frac{1}{2{\lambda}}\|\hat{x}_{k}-\hat{x}_{k-1}\|^{2}\leq\hat{\Gamma}_{k}(z)+\frac{1}{2{\lambda}}\|z-\hat{x}_{k-1}\|^{2}-\frac{1}{2{\lambda}}\|z-\hat{x}_{k}\|^{2},

and hence that

u^kΓ^k(x^k)12λx^kx^k12+12λx^k1z2u^kΓ^k(z)+12λx^kz2.\hat{u}_{k}-\hat{\Gamma}_{k}(\hat{x}_{k})-\frac{1}{2{\lambda}}\|\hat{x}_{k}-\hat{x}_{k-1}\|^{2}+\frac{1}{2{\lambda}}\|\hat{x}_{k-1}-z\|^{2}\geq\hat{u}_{k}-\hat{\Gamma}_{k}(z)+\frac{1}{2{\lambda}}\|\hat{x}_{k}-z\|^{2}.

Taking expectation of the above inequality and using (68) and (69), we conclude that

2Rmin{λM¯2,M¯Dh}λk+2λM¯2θK+12λ[dk1(z)]2𝔼[u^k]𝔼[Γ^k(z)]+12λ[dk(z)]2\frac{2R\min\{{\lambda}\bar{M}^{2},\bar{M}D_{h}\}}{{\lambda}k}+\frac{2{\lambda}\bar{M}^{2}}{\theta K}+\frac{1}{2{\lambda}}[d_{k-1}(z)]^{2}\geq\mathbb{E}[\hat{u}_{k}]-\mathbb{E}[\hat{\Gamma}_{k}(z)]+\frac{1}{2{\lambda}}[d_{k}(z)]^{2}

which, in view of (66) and (67), immediately implies the conclusion of the lemma.    

We are now in a position to prove Theorem 3.1.

Proof of Theorem 3.1: a) This statement directly follows from (8), cycle rule (B1), the definition of ln0+\ln_{0}^{+}, and the facts that |𝒞k|=jkik+1|{\cal C}_{k}|=j_{k}-i_{k}+1 and lnτ11τ\ln\tau^{-1}\geq 1-\tau.

b) Using the definition of y^Ka\hat{y}_{K}^{a} in (19), Lemma 5.7 with z=xXz=x^{*}\in X^{*}, and the facts that K/2K/2\lceil K/2\rceil\geq K/2 and

k=K/2+1K1kK/2K1x𝑑x=lnKK/2lnKK/4=ln432K2,\sum_{k=\lfloor K/2\rfloor+1}^{K}\frac{1}{k}\leq\int_{\lfloor K/2\rfloor}^{K}\frac{1}{x}dx=\ln\frac{K}{\lfloor K/2\rfloor}\leq\ln\frac{K}{K/4}=\ln 4\leq\frac{3}{2}\quad\forall K\geq 2,

we then conclude that for every K2K\geq 2,

𝔼[ϕ(y^Ka)]ϕ1K/2k=K/2+1K(𝔼[ϕ(y^k)]ϕ)\displaystyle\mathbb{E}[\phi(\hat{y}_{K}^{a})]-\phi_{*}\leq\frac{1}{\lceil K/2\rceil}\sum_{k=\lfloor K/2\rfloor+1}^{K}(\mathbb{E}[\phi(\hat{y}_{k})]-\phi_{*})
1K/2k=K/2+1K(2Rmin{λM¯2,M¯Dh}λk+2λM¯2θK+12λ[dk1(x)]212λ[dk(x)]2)\displaystyle\leq\frac{1}{\lceil K/2\rceil}\sum_{k=\lfloor K/2\rfloor+1}^{K}\left(\frac{2R\min\{{\lambda}\bar{M}^{2},\bar{M}D_{h}\}}{{\lambda}k}+\frac{2{\lambda}\bar{M}^{2}}{\theta K}+\frac{1}{2{\lambda}}[d_{k-1}(x^{*})]^{2}-\frac{1}{2{\lambda}}[d_{k}(x^{*})]^{2}\right)
6Rmin{λM¯2,M¯Dh}λK+2λM¯2θK+[dK/2(x)]2λK\displaystyle\leq\frac{6R\min\{{\lambda}\bar{M}^{2},\bar{M}D_{h}\}}{{\lambda}K}+\frac{2{\lambda}\bar{M}^{2}}{\theta K}+\frac{\left[d_{\lfloor K/2\rfloor}(x^{*})\right]^{2}}{{\lambda}K} (70)

where the first inequality is due to the convexity of ϕ\phi. It is also easy to see from Lemma 5.7 that (70) holds for K=1K=1. Then (24) follows from (70) and the fact that dK/2(x)Dhd_{\lfloor K/2\rfloor}(x^{*})\leq D_{h}.

The above result strongly uses the fact that domh\mathrm{dom}\,h is bounded in view of the last inequality of its proof.

We end this subsection by describing a complexity bound for a slightly modified SCPB1 variant which is derived without assuming that domh\mathrm{dom}\,h is bounded. We start by describing the two changes one needs to make to SCPB1 in order to obtain the aforementioned variant. First, instead of the point y^Ka\hat{y}_{K}^{a} as in (19), it outputs

y¯Ka=1Kk=1Ky^k.\bar{y}_{K}^{a}=\frac{1}{K}\sum_{k=1}^{K}\hat{y}_{k}. (71)

Second, instead of computing jkj_{k} as in (B1), it sets jkj_{k} as the smallest integer greater than or equal to iki_{k} such that λKτjkikR{\lambda}K\tau^{j_{k}-i_{k}}\leq R.

Theorem 5.8.

Assume that Assumption 2.1 holds and let d0d_{0} denote the distance of the initial point x0x_{0} to the optimal set XX^{*}, i.e.,

d0:=x0x0,wherex0:=argmin{x0x:xX}.d_{0}:=\|x_{0}-x_{0}^{*}\|,\ \ \mbox{\rm where}\ \ \ x_{0}^{*}:=\mathrm{argmin}\,\{\|x_{0}-x^{*}\|:x^{*}\in X^{*}\}. (72)

Then, the aforementioned SCPB1 variant satisfies the following statements:

  • a)

    the number of iterations within each cycle is bounded by

    (θK+1)ln0+(λKR)+1;\left\lceil(\theta K+1)\ln_{0}^{+}\left(\frac{\lambda K}{R}\right)\right\rceil+1;
  • b)

    there holds

    𝔼[ϕ(y¯Ka)]ϕ1K(d022λ+2RM¯2+2λM¯2θ).\mathbb{E}[\phi(\bar{y}_{K}^{a})]-\phi_{*}\leq\frac{1}{K}\left(\frac{d_{0}^{2}}{2{\lambda}}+2R\bar{M}^{2}+\frac{2{\lambda}\bar{M}^{2}}{\theta}\right).

Proof: (a) The proof of (a) is similar to that of Theorem 3.1(a).

(b) First note that the new way of choosing jkj_{k} and slightly different arguments than the ones used in the proofs of Proposition 5.6 and Lemma 5.7 imply that for every zdomhz\in\mathrm{dom}\,h,

2RM¯2K+2λM¯2θK+12λ[dk1(z)]212λ[dk(z)]2𝔼[ϕ(y^k)]ϕ(z).\frac{2R\bar{M}^{2}}{K}+\frac{2{\lambda}\bar{M}^{2}}{\theta K}+\frac{1}{2{\lambda}}[d_{k-1}(z)]^{2}-\frac{1}{2{\lambda}}[d_{k}(z)]^{2}\geq\mathbb{E}[\phi(\hat{y}_{k})]-\phi(z).

Using the fact that ϕ\phi is convex and the definition of y¯Ka\bar{y}_{K}^{a} in (71), and summing the above inequality from k=1k=1 to KK, we conclude that for every zdomhz\in\mathrm{dom}\,h,

𝔼[ϕ(y¯Ka)]ϕ(z)\displaystyle\mathbb{E}[\phi(\bar{y}_{K}^{a})]-\phi(z) 1Kk=1K[𝔼[ϕ(y^k)]ϕ(z)]\displaystyle\leq\frac{1}{K}\sum_{k=1}^{K}[\mathbb{E}[\phi(\hat{y}_{k})]-\phi(z)]
1Kk=1K(2RM¯2K+2λM¯2θK+12λ[dk1(z)]212λ[dk(z)]2)\displaystyle\leq\frac{1}{K}\sum_{k=1}^{K}\left(\frac{2R\bar{M}^{2}}{K}+\frac{2{\lambda}\bar{M}^{2}}{\theta K}+\frac{1}{2{\lambda}}[d_{k-1}(z)]^{2}-\frac{1}{2{\lambda}}[d_{k}(z)]^{2}\right)
2RM¯2K+2λM¯2θK+[d0(z)]22λK.\displaystyle\leq\frac{2R\bar{M}^{2}}{K}+\frac{2{\lambda}\bar{M}^{2}}{\theta K}+\frac{[d_{0}(z)]^{2}}{2{\lambda}K}.

The statement now follows from the above inequality with z=x0z=x_{0}^{*} where x0x_{0}^{*} is as in (72).    

5.3 Proof of Theorem 4.1

The following result, which is an analogue of Proposition 5.6 with cycle rule (B1) replaced by (B2), derives a bound on tjkt_{j_{k}} in expectation.

Proposition 5.9.

In addition to Assumption 2.1 holds, assume also that cycle rule (B2) is used. For every k1k\geq 1, we have

𝔼[tjk]Rλk+2λM¯2θK+2λM¯2θ2K2.\mathbb{E}[t_{j_{k}}]\leq\frac{R}{{\lambda}k}+\frac{2{\lambda}\bar{M}^{2}}{\theta K}+\frac{2{\lambda}\bar{M}^{2}}{\theta^{2}K^{2}}. (73)

Proof: Using (53) with j=jkj=j_{k}, we conclude that

tjk(1τ)i=ik+2jkτjkibi(53)τjkiktik+(1τ)τjkik1bik+1(B2)Rλk+(1τ)bik+1t_{j_{k}}-(1-\tau)\sum_{i=i_{k}+2}^{j_{k}}\tau^{j_{k}-i}b_{i}\overset{\eqref{ineq:recursive}}{\leq}\tau^{j_{k}-i_{k}}t_{i_{k}}+(1-\tau)\tau^{j_{k}-i_{k}-1}b_{i_{k}+1}\overset{(B2)}{\leq}\frac{R}{{\lambda}k}+(1-\tau)b_{i_{k}+1} (74)

where the second inequality is due to cycle rule (B2), the observation that (B2) is equivalent to λkτjkiktikR{\lambda}k\tau^{j_{k}-i_{k}}t_{i_{k}}\leq R, and τ(0,1)\tau\in(0,1) in view of (8). Noting that jkj_{k} becomes deterministic once ξ[ik]\xi_{[i_{k}]} is given, taking expectation of the above inequality conditioned on ξ[ik]\xi_{[i_{k}]}, rearranging the terms, and using the first inequality in (55), we have

𝔼[tjk|ξ[ik]]Rλk(1τ)𝔼[bik+1|ξ[ik]]\displaystyle\mathbb{E}\left[t_{j_{k}}|\xi_{[i_{k}]}\right]-\frac{R}{{\lambda}k}-(1-\tau)\mathbb{E}[b_{i_{k}+1}|\xi_{[i_{k}]}] (1τ)i=ik+2jkτjki𝔼[bi|ξ[ik]]\displaystyle\leq(1-\tau)\sum_{i=i_{k}+2}^{j_{k}}\tau^{j_{k}-i}\mathbb{E}[b_{i}|\xi_{[i_{k}]}]
(55)(1τ)(i=ik+2jkτjki)2λM¯2θK2λM¯2θK.\displaystyle\overset{\eqref{ineq:bj}}{\leq}(1-\tau)\left(\sum_{i=i_{k}+2}^{j_{k}}\tau^{j_{k}-i}\right)\frac{2{\lambda}\bar{M}^{2}}{\theta K}\leq\frac{2{\lambda}\bar{M}^{2}}{\theta K}.

Taking expectation of the above inequality with respect to ξ[ik]\xi_{[i_{k}]}, rearranging the terms, and using the second inequality (55) and the fact that 1τ1/(θK)1-\tau\leq 1/(\theta K) by (8), we conclude that

𝔼[tjk]\displaystyle\mathbb{E}[t_{j_{k}}] Rλk+(1τ)𝔼[bik+1]+2λM¯2θK\displaystyle\leq\frac{R}{{\lambda}k}+(1-\tau)\mathbb{E}[b_{i_{k}+1}]+\frac{2{\lambda}\bar{M}^{2}}{\theta K}
Rλk+1θK2λM¯2θK+2λM¯2θK,\displaystyle\leq\frac{R}{{\lambda}k}+\frac{1}{\theta K}\frac{2{\lambda}\bar{M}^{2}}{\theta K}+\frac{2{\lambda}\bar{M}^{2}}{\theta K},

and hence that (73) holds.    

It is worth noting that rule (B2) plays an important role in showing that the first term on the right-hand side of the (74) is 𝒪(1/k){\cal O}(1/k).

The following result is an analogue of Lemma 5.7 with (B1) replaced by (B2).

Lemma 5.10.

In addition to Assumption 2.1, assume also that cycle rule (B2) is used. Then, for every zdomhz\in\mathrm{dom}\,h and k1k\geq 1, we have

Rλk+2λM¯2θK+2λM¯2θ2K2+12λ[dk1(z)]212λ[dk(z)]2𝔼[ϕ(y^k)]ϕ(z)\frac{R}{{\lambda}k}+\frac{2{\lambda}\bar{M}^{2}}{\theta K}+\frac{2{\lambda}\bar{M}^{2}}{\theta^{2}K^{2}}+\frac{1}{2{\lambda}}[d_{k-1}(z)]^{2}-\frac{1}{2{\lambda}}[d_{k}(z)]^{2}\geq\mathbb{E}[\phi(\hat{y}_{k})]-\phi(z)

where dk(z)d_{k}(z) is as in (68).

Proof: First observe that the definitions of Γjλ\Gamma_{j}^{\lambda} and tjt_{j} in (46) and (51), respectively, imply that (73) is equivalent to

𝔼[u^kΓ^k(x^k)12λx^kx^k12]Rλk+2λM¯2θK+2λM¯2θ2K2.\mathbb{E}\left[\hat{u}_{k}-\hat{\Gamma}_{k}(\hat{x}_{k})-\frac{1}{2{\lambda}}\|\hat{x}_{k}-\hat{x}_{k-1}\|^{2}\right]\leq\frac{R}{{\lambda}k}+\frac{2{\lambda}\bar{M}^{2}}{\theta K}+\frac{2{\lambda}\bar{M}^{2}}{\theta^{2}K^{2}}. (75)

The remaining part of the proof is now similar to that of Lemma 5.7 except that (75) is used in place of (69).    

We are now ready to prove Theorem 4.1.

Proof of Theorem 4.1: a) Using (8), (21), the definition of ln0+\ln_{0}^{+}, and the facts that |𝒞k|=jkik+1|{\cal C}_{k}|=j_{k}-i_{k}+1 and lnτ11τ\ln\tau^{-1}\geq 1-\tau, we have

|𝒞k|11τln0+(tikλkR)+1=(θK+1)ln0+(tikλkR)+1.|{\cal C}_{k}|\leq\frac{1}{1-\tau}\ln_{0}^{+}\left(\frac{t_{i_{k}}{\lambda}k}{R}\right)+1=(\theta K+1)\ln_{0}^{+}\left(\frac{t_{i_{k}}{\lambda}k}{R}\right)+1.

Taking expectation of the above inequality, and using the Jensen’s inequality and the fact that lnx\ln x is a concave function, we then conclude that

𝔼[|𝒞k|]\displaystyle\mathbb{E}[|{\cal C}_{k}|] (θK+1)𝔼[ln0+(tikλkR)]+1(θK+1)ln0+(𝔼[tik]λkR)+1\displaystyle\leq(\theta K+1)\mathbb{E}\left[\ln_{0}^{+}\left(\frac{t_{i_{k}}{\lambda}k}{R}\right)\right]+1\leq(\theta K+1)\ln_{0}^{+}\left(\frac{\mathbb{E}[t_{i_{k}}]{\lambda}k}{R}\right)+1
(θK+1)ln0+(2M¯2λ2kR)+1,\displaystyle\leq(\theta K+1)\ln_{0}^{+}\left(\frac{2\bar{M}^{2}{\lambda}^{2}k}{R}\right)+1,

where the last inequality is due to Lemma 5.5.

b) This statement follows from the same argument as in the proof of Theorem 3.1(b) except that Lemma 5.10 is used in place of Lemma 5.7.

6 Numerical experiments

In this section, we report the results of numerical experiments where we compare the performance of RSA and our two variants of SCPB on three stochastic programming problems, namely: a stochastic utility problem given in Section 4.2 of [22] and the two two-stage nonlinear stochastic programs considered in the numerical experiments of [10]. These three problems are of form (1)-(2) with hh the indicator function of a convex compact set XX with diameter DXD_{X}. Therefore, the problems can be written as

min{f(x):=𝔼[F(x,ξ)]:xX}.\min\{f(x):=\mathbb{E}[F(x,\xi)]:x\in X\}. (76)

The implementations are coded in MATLAB, using Mosek optimization library [1] to generate stochastic oracles F(x,ξ)F(x,\xi) and s(x,ξ)s(x,\xi), and run on a laptop with Intel i7, 1.80 GHz processor. For solving subproblem (15), we do not use Mosek but implement algorithms for projection onto XX. In particular, we follow [36] to implement an exact algorithm for projection onto the unit simplex.

Parameters for Robust Stochastic Approximation. Robust Stochastic Approximation, denoted by E-SA (Euclidean Stochastic Approximation) in what follows, is described in Section 2.2 of [22] (as explained in [22], in terms of Section 2.3 of [22], this is mirror descent robust SA with Euclidean setup). In the notation of [22], for E-SA run for NN iterations, we output x~1N=1Ni=1Nxi\displaystyle\tilde{x}_{1}^{N}=\frac{1}{N}\sum_{i=1}^{N}x_{i} (this is x~iN\tilde{x}_{i}^{N} given by (29) with i=1i=1 and corresponds to the usual output of RSA) where xix_{i} is computed at iteration ii taking the constant steps given in (2.23) of [22] by

γt=θDXMN\gamma_{t}=\frac{\theta D_{X}}{M\sqrt{N}}

where DXD_{X} is the diameter of the feasible set XX in (76).111Parameter MM is denoted by MM_{*} in [22]. As in [22], we take θ=0.1\theta=0.1 which was calibrated in [22] using an instance of the stochastic utility problem. For each problem, the value of MM is estimated as in [22] taking the maximum of s(,)\|s(\cdot,\cdot)\| over 10,000 calls to the stochastic oracle at randomly generated feasible solutions.

Remark 6.1.

In [22], E-SA generates approximately log2(N)\log_{2}(N) candidate solutions x~iN=1Ni+1k=iNxk\tilde{x}_{i}^{N}=\frac{1}{N-i+1}\sum_{k=i}^{N}x_{k} with Ni+1=min[2k,N]N-i+1=\min[2^{k},N], k=0,1,,log2(N)k=0,1,\ldots,\log_{2}(N) and an additional sample was used to estimate the objective at these candidate solutions in order to choose the best of these candidates. In [22], the computational effort required by this postprocessing is not reflected in the experiments. However, we believe that for a fair comparison of E-SA using this set of candidate solutions and SCPB, this computational effort should be taken into account and without this additional computational bulk, SCPB is already faster than E-SA in our experiments.

Parameters for SCPB1. SCPB1 uses parameters θ\theta, τ\tau, RR, and λ\lambda given by

θ=CK,τ=θKθK+1,R=DXM,λ=β1CDXMK\theta=\frac{C}{K},\;\tau=\frac{\theta K}{\theta K+1},\;R=\frac{D_{X}}{M},\;\lambda=\beta_{1}\frac{\sqrt{C}D_{X}}{M\sqrt{K}}

where constant C=9C=9 and constant β1\beta_{1} was calibrated with the stochastic utility problem, see below. We take β1=10\beta_{1}=10 in all our experiments. Constant MM was estimated as for RSA taking the maximum of s(,)\|s(\cdot,\cdot)\| over 10,000 calls to the stochastic oracle at randomly generated feasible solutions.

Parameters for SCPB2. SCPB2 uses parameters θ\theta, τ\tau, RR, and λ\lambda given by

θ=CK,τ=θKθK+1,R=DX2,λ=β2CDXMK\theta=\frac{C}{K},\;\tau=\frac{\theta K}{\theta K+1},\;R=D_{X}^{2},\;\lambda=\beta_{2}\frac{\sqrt{C}D_{X}}{M\sqrt{K}}

where constant C=9C=9 and constant β2\beta_{2} was calibrated with the stochastic utility problem, see below. We take β2=10\beta_{2}=10 in all our experiments. Constant MM was again estimated as for RSA taking the maximum of s(,)\|s(\cdot,\cdot)\| over 10,000 calls to the stochastic oracle at randomly generated feasible solutions.

Notation in the tables. In what follows, we denote by

  • nn the design dimension of an instance;

  • NN the sample size used to run the methods; this is also the number of iterations of E-SA;

  • KK the number of SCPB outer iterations;

  • Obj the empirical mean

    F^T(x):=1Ti=1TF(x,ξi){\hat{F}}_{T}(x):=\frac{1}{T}\sum_{i=1}^{T}F(x,\xi_{i}) (77)

    of FF at xx based on a sample ξ1,,ξT\xi_{1},\ldots,\xi_{T} of ξ\xi of size TT, which provides an estimation of f(x)f(x). The empirical means are computed with xx being the final iterate output by the algorithm and T=104T=10^{4};

  • CPU the CPU time in seconds.

6.1 A stochastic utility problem

Our first set of experiments was carried out with the stochastic utility problem given by

minxX𝔼[ϕ(i=1n(in+ξi)xi)]\min_{x\in X}\mathbb{E}\left[\phi\left(\displaystyle\sum_{i=1}^{n}\left(\frac{i}{n}+\xi_{i}\right)x_{i}\right)\right]

where

X={xn:i=1nxi=1,x0},X=\left\{x\in\mathbb{R}^{n}:\sum_{i=1}^{n}x_{i}=1,\,x\geq 0\right\}, (78)

ξi𝒩(0,1)\xi_{i}\sim\mathcal{N}(0,1) are independent and ϕ(t)=max(v1+s1t,,vm+smt)\phi(t)=\max({v_{1}+s_{1}t,\ldots,v_{m}+s_{m}t)} is piecewise convex with 10 breakpoints, all located on [0,1][0,1]222Although the same problem class and a similar procedure to build ϕ\phi was used in the experiments of Section 4.2 in [22], we could not find in this reference the precise choices of vk,skv_{k},s_{k} and the optimal values of our instances differ from the optimal values of the instances in [22]. Also, contrary to [22], we use the same function ϕ\phi for all instances. The instances differ for the problem dimension nn. .

Calibration of β1\beta_{1} and β2\beta_{2}. We run SCPB1 and SCPB2 with 7 values of β1\beta_{1} and β2\beta_{2} on four instances of the stochastic utility problem for K=1000K=1000 outer iterations (i.e., cycles) and n=500n=500, n=1000n=1000, n=2000n=2000, and n=5000n=5000. For this experiment, the values of β1\beta_{1}, β2\beta_{2}, the corresponding values of stepsize λ\lambda, and the optimal values computed by SCPB1 and SCPB2 are reported in Table 1. We found out that β1=10\beta_{1}=10 slightly outperforms other choices of β1\beta_{1} for SCPB1. Surprisingly, SCPB2 was not affected by changes in β2\beta_{2} and all tested values allowed us to obtain with similar CPU times a good approximate optimal value. This value β1=10\beta_{1}=10 and the same value β2=10\beta_{2}=10 will be chosen for all runs of SCPB and all the problem instances (the stepsizes in [22] were calibrated similarly, on the basis of an instance of the stochastic utility problem).

β1,β2\beta_{1},\beta_{2} 0.01 0.1 1 10 50 150 1000
λ(n=500)\lambda(n=500) 1.70×1051.70\small{\times}10^{-5} 1.70×1041.70\small{\times}10^{-4} 0.0017 0.017 0.09 0.26 1.7
Obj1 14.2795 10.6439 10.1819 10.1811 10.1811 10.1838 10.1937
Obj2 10.1937 10.1937 10.1937 10.1937 10.1937 10.1937 10.1937
λ(n=103)\lambda(n=10^{3}) 1.17×1051.17\small{\times}10^{-5} 1.17×1041.17\small{\times}10^{-4} 0.0012 0.012 0.0585 0.18 1.17
Obj1 14.6307 11.1325 10.0510 10.0504 10.0509 10.0523 10.0710
Obj2 10.0710 10.0710 10.0710 10.0710 10.0710 10.0710 10.0710
λ(n=2×103)\lambda(n=2\small{\times}10^{3}) 8.36×1068.36\small{\times}10^{-6} 8.36×1058.36\small{\times}10^{-5} 8.36×1048.36\small{\times}10^{-4} 0.0084 0.0418 0.1255 0.8364
Obj1 13.7451 11.0836 10.0365 10.0363 10.0364 10.0375 10.0613
Obj2 10.0613 10.0613 10.0613 10.0613 10.0613 10.0613 10.0613
λ(n=5×103)\lambda(n=5\small{\times}10^{3}) 7.93×1067.93\small{\times}10^{-6} 7.93×1057.93\small{\times}10^{-5} 7.93×1047.93\small{\times}10^{-4} 0.0079 0.0397 0.119 0.793
Obj1 14.0830 11.3370 10.0228 10.0228 10.0231 10.0237 10.0540
Obj2 10.0540 10.0540 10.0540 10.0540 10.0540 10.0540 10.0540
Table 1: Selecting parameters β1\beta_{1} and β2\beta_{2} of SCPB1 and SCPB2. Framework: SCPB, K=1000K=1000 outer iterations, four instances of the stochastic utility problem with n=500n=500, 10001000, 20002000, and 50005000. Obj1 (resp., Obj2) is the approximate optimal value with SCPB1 (resp., SCPB2).

We now run E-SA, SCPB1, and SCPB2 on three instances L1L_{1}, L2L_{2}, and L3L_{3} of the stochastic utility problem with n=2000n=2000, 50005000, and 10510^{5} respectively. For SCPB1 and SCPB2, we used K=1000K=1000 outer iterations. The results are reported in Table 2. Several comments are now in order for the results reported in this table.

  • For SCPB, approximate solutions can only be computed at the end of every cycle. Namely, at the end of LL-th cycle at iteration jLj_{L} we can compute the approximate solution

    1L/2=L/2+1Ly^=1L/2=L/2+1Lyj.\frac{1}{\lceil L/2\rceil}\sum_{\ell=\lfloor L/2\rfloor+1}^{L}\hat{y}_{\ell}=\frac{1}{\lceil L/2\rceil}\sum_{\ell=\lfloor L/2\rfloor+1}^{L}y_{j_{\ell}}.

    For a given value of NN in Table 2, the approximate objective value Obj we report for E-SA is the empirical mean of F(x,ξ)F(x,\xi) at the approximate solution 1Ni=1Nxi\frac{1}{N}\sum_{i=1}^{N}x_{i} (where xix_{i}’s are computed along iterations of E-SA) while for SCPB the approximate value Obj is the empirical mean of F(x,ξ)F(x,\xi) at the approximate solution 1L(N)/2=L(N)/2+1L(N)y^\frac{1}{\lceil L(N)/2\rceil}\sum_{\ell=\lfloor L(N)/2\rfloor+1}^{L(N)}\hat{y}_{\ell} where

    L(N)=min{k:jkN}L(N)=\min\{k:j_{k}\geq N\}

    (since a cycle may not end at iteration NN).

  • Each iteration of E-SA and SCPB takes a similar amount of time (in both cases we evaluate an inexact prox-operator at some point) and therefore for a given sample size NN the CPU time for E-SA and SCPB is similar.

  • For all instances, SCPB computes a good approximate optimal value much quicker than E-SA and the decrease in the objective function value is much slower with E-SA. We also refer to Table 7 which reports for L1L_{1} and L2L_{2} the distance between SCPB approximate optimal value and E-SA approximate value as a percentage of SCPB decrease in the objective for several sample sizes. This table confirms the slower convergence of E-SA in these instances.

- L1:n=2000L_{1}:n=2000 L2:n=5000L_{2}:n=5000 L3:n=105L_{3}:n=10^{5}
ALG. N Obj CPU Obj CPU Obj CPU
E-SA 10 14.6449 0.001 14.6892 0.05 15.4 0.05
50 14.6322 0.006 14.6813 0.07 14.7 0.35
100 14.6169 0.01 14.6725 0.1 14.6 0.74
200 14.5880 0.03 14.6574 0.2 14.6 1.44
1000 14.3992 0.1 14.5604 0.5 14.3 17.2
10410^{4} 12.9656 1.28 12.7410 3.7 14.2 80.3
10510^{5} - - - - 13.2 860.1
SCPB1 10 13.9539 0.003 13.7763 0.008 14.7 0.08
50 13.6527 0.01 13.4672 0.02 14.4 0.39
100 13.5986 0.02 13.5346 0.05 14.2 0.9
200 13.5349 0.03 13.4686 0.08 14.3 1.6
1000 13.0370 0.2 12.8376 1.6 14.2 12.5
10410^{4} - - - - 12.7 72.3
SCPB2 10 13.5968 0.002 13.7777 0.01 14.2 0.06
50 12.9421 0.008 12.6959 0.05 13.2 0.7
100 12.1317 0.02 11.7614 0.09 12.2 1.5
200 11.3640 0.03 11.3698 0.2 11.6 3.4
1000 11.5681 0.1 11.5572 0.9 11.2 25.4
Table 2: E-SA versus two variants of SCPB on the stochastic utility problem run with K=1000K=1000 outer iterations.

6.2 A first two-stage stochastic program

Our second test problem is the nonlinear two-stage stochastic program

{mincTx1+𝔼[𝔔(x1,ξ)]x1n:x10,i=1nx1(i)=1\left\{\begin{array}[]{l}\min\;c^{T}x_{1}+\mathbb{E}[\mathfrak{Q}(x_{1},\xi)]\\ x_{1}\in\mathbb{R}^{n}:x_{1}\geq 0,\sum_{i=1}^{n}x_{1}(i)=1\end{array}\right. (79)

where the second stage recourse function is given by

𝔔(x1,ξ)={minx2n12(x1x2)T(ξξT+γ0I2n)(x1x2)+ξT(x1x2)s.t.x20,i=1nx2(i)=1.\mathfrak{Q}(x_{1},\xi)=\left\{\begin{array}[]{l}\displaystyle\min_{x_{2}\in\mathbb{R}^{n}}\;\frac{1}{2}\left(\begin{array}[]{c}x_{1}\\ x_{2}\end{array}\right)^{T}\Big{(}\xi\xi^{T}+\gamma_{0}I_{2n}\Big{)}\left(\begin{array}[]{c}x_{1}\\ x_{2}\end{array}\right)+\xi^{T}\left(\begin{array}[]{c}x_{1}\\ x_{2}\end{array}\right)\\ s.t.\ \ \ x_{2}\geq 0,\displaystyle\sum_{i=1}^{n}x_{2}(i)=1.\end{array}\right. (80)

Problem (79)-(80) is of form (1)-(2) where F(x,ξ)=cTx+𝔔(x,ξ)F(x,\xi)=c^{T}x+\mathfrak{Q}(x,\xi) with 𝔔\mathfrak{Q} given by (80) and where hh is the indicator function of set XX where XX given by (78) is the unit simplex. For problem (79) we refer to Lemma 2.1 in [8] for the computation of stochastic subgradients s(x,ξ)s(x,\xi). We take γ0=2\gamma_{0}=2 and consider a Gaussian random vector in 2n\mathbb{R}^{2n} for ξ\xi. We consider two instances of problem (79) with n=50n=50 and n=100n=100. For each instance, the components of ξ\xi are independent with means and standard deviations randomly generated in respectively intervals [5,25][5,25] and [5,15][5,15]. The components of cc are generated randomly in interval [1,3][1,3].

We run E-SA, SCPB1, and SCPB2 on our two instances A1A_{1} and A2A_{2} with n=50n=50 and n=100n=100, respectively. For SCPB1 and SCPB2, we used K=1000K=1000 outer iterations. The results are reported in Table 3. The conclusions are similar to the experiments on the stochastic utility problem: SCPB computes a good approximate optimal value much quicker than E-SA and the decrease in the objective function value is much slower with E-SA. We again refer to Table 7 which reports the distance between SCPB approximate optimal value and E-SA approximate value as a percentage of SCPB decrease in the objective for several sample sizes. This percentage is again above 90% for almost all instances and sample sizes.

- A1:n=50A_{1}:n=50 A2:n=100A_{2}:n=100
ALG. NN Obj CPU Obj CPU
E-SA 10 24.3477 0.13 7.5134 0.5
50 24.2378 0.6 7.5018 2.5
100 24.0816 1.2 7.4868 5.0
200 23.7947 3.0 7.4566 10.1
500 22.9185 8.8 7.3790 25.9
1000 21.5328 24.6 7.2587 55.5
2×1042\small{\times}10^{4} 8.5482 377 5.1339 1282
10510^{5} 5.7358 1555.6 3.9193 6147
SCPB1 10 11.5047 0.2 3.0063 1.3
50 9.2959 0.6 2.7269 3.2
100 7.2031 1.5 2.4914 6.9
200 6.4626 2.9 2.2899 13.0
500 5.3700 7.5 2.0635 39.2
1000 5.0582 15.1 1.9609 70.4
SCPB2 10 8.6325 0.15 3.3113 0.6
50 7.8378 0.7 2.2478 3.2
100 7.8602 1.5 2.1929 6.4
200 6.5839 3.0 2.2913 13.4
500 6.0361 7.4 1.9974 33.7
1000 6.1989 14.9 1.8058 65.1
Table 3: E-SA versus two variants of SCPB on the two-stage stochastic program (79)-(80)

Table 4 reports the impact of overestimating MM (taking MM 10 times the Monte Carlo estimation M¯e\overline{M}_{e}) and underestimating MM (taking MM 10 times smaller than the Monte Carlo estimation M¯e\overline{M}_{e}). In this experiment, SCPB is essentially not affected by a bad estimation of MM while E-SA converge much slower when MM is overestimated. Additionally, Table 5 reports the computational results for all methods applied to a variant of two-stage stochastic program (79)-(80) of size n=50n=50 where the feasible set of the first stage problem is replaced by the larger simplex set {x1n:x10,i=1nx1(i)=100}\{x_{1}\in\mathbb{R}^{n}:x_{1}\geq 0,\sum_{i=1}^{n}x_{1}(i)=100\}. These results show that the SCPB variants are more efficient that E-SA on this specific instance. Also SCPB is not much affected by an overestimation of the diameter DXD_{X}.

- M=M¯eM=\overline{M}_{e} M=10M¯eM=10\overline{M}_{e} M=0.1M¯eM=0.1\overline{M}_{e}
ALG. N Obj CPU Obj CPU Obj CPU
E-SA 2000 9.8 29.7 22.1 32.5 10.5 36.2
SCPB1 2000 5.1 36.3 5.2 33.8 5.1 42.2
SCPB2 2000 5.5 31.2 4.6 33.5 5.6 37.2
Table 4: E-SA versus two variants of SCPB on the two-stage stochastic program (79)-(80) with overestimated and underestimated values of MM on an instance with n=50n=50.
- D=D¯D=\overline{D} D=5D¯D=5\overline{D}
ALG. N Obj CPU Obj CPU
E-SA 2000 1.0338×1061.0338\small{\times}10^{6} 33.6 1.0365×1061.0365\small{\times}10^{6} 33.8
10000 8.894×1058.894\small{\times}10^{5} 155.5 1.0055×1061.0055\small{\times}10^{6} 160.8
SCPB1 2000 2.894×1052.894\small{\times}10^{5} 35.4 3.029×1053.029\small{\times}10^{5} 35.6
SCPB2 2000 2.889×1052.889\small{\times}10^{5} 34.2 3.003×1053.003\small{\times}10^{5} 30.8
Table 5: E-SA versus SCPB1 and SCPB2 on a variant of the two-stage stochastic program (79)-(80) of size n=50n=50 where the simplex feasible set of the first stage problem is replaced by the larger feasible set {x1n:x10,i=1nx1(i)=100}\{x_{1}\in\mathbb{R}^{n}:x_{1}\geq 0,\sum_{i=1}^{n}x_{1}(i)=100\}. Exact value D=DX=D¯D=D_{X}=\overline{D} of the diameter used to solve the first instance and overestimated value D=5D¯=5DXD=5\overline{D}=5D_{X} used to solve the second instance.

Finally, we report the length of SCPB cycle along iterations in the left plot of Figure 1. A few comments are now in order on the length of the cycles with SCPB1 and SCPB2:

  • We observe that the length of the cycles is much larger with SCPB1.

  • For SCPB1, sequence {jk}\{j_{k}\} (and therefore the length of the cycles) can be computed independently of sequence {xk}\{x_{k}\}, before running SCPB, once constant RR is known. It is worth mentioning that we have an analytic expression for jkj_{k} as a function of λ\lambda, RR, τ\tau, and kk, namely jkik=0j_{k}-i_{k}=0 if RλkR\geq\lambda k and

    jkik=log(Rλk)log(τ)j_{k}-i_{k}=\left\lceil{\frac{\log\left(\frac{R}{\lambda k}\right)}{\log\left(\tau\right)}}\right\rceil

    otherwise. Therefore, the cycle length with SCPB1 is a piecewise constant nondecreasing function of outer iteration kk and the cardinality of the set of consecutive iterations with constant cycle length increases along the cycles.

  • For SCPB2, the length of the cycles is in general small with small variability, with an average cycle length of 2.2 for the instance with n=50n=50 and an average cycle length of 2.1 for the instance with n=100n=100.

Refer to caption Refer to caption
Figure 1: Cycle length for SCPB1 and SCPB2 applied to two-stage stochastic program (79)-(80) (left figure) and two-stage stochastic program (81)-(82) (right figure).

6.3 A second two-stage stochastic program

Our third test problem is the nonlinear two-stage stochastic program

{mincTx1+𝔼[𝔔(x1,ξ)]x1n:x1x02100\left\{\begin{array}[]{l}\min\;c^{T}x_{1}+\mathbb{E}[\mathfrak{Q}(x_{1},\xi)]\\ x_{1}\in\mathbb{R}^{n}:\|x_{1}-x_{0}\|_{2}\leq 100\end{array}\right. (81)

where cost-to-go function 𝔔(x1,ξ)\mathfrak{Q}(x_{1},\xi) has nonlinear objective and constraint coupling functions and is given by

𝔔(x1,ξ)={minx2n12(x1x2)T(ξξT+γ0I2n)(x1x2)+ξT(x1x2)s.t.x2y022+x1x022R20.\mathfrak{Q}(x_{1},\xi)=\left\{\begin{array}[]{l}\displaystyle\min_{x_{2}\in\mathbb{R}^{n}}\;\frac{1}{2}\left(\begin{array}[]{c}x_{1}\\ x_{2}\end{array}\right)^{T}\Big{(}\xi\xi^{T}+\gamma_{0}I_{2n}\Big{)}\left(\begin{array}[]{c}x_{1}\\ x_{2}\end{array}\right)+\xi^{T}\left(\begin{array}[]{c}x_{1}\\ x_{2}\end{array}\right)\\ s.t.\ \ \ \|x_{2}-y_{0}\|_{2}^{2}+\|x_{1}-x_{0}\|_{2}^{2}-R^{2}\leq 0.\end{array}\right. (82)

Problem (81)-(82) is of form (1)-(2) where F(x,ξ)=cTx+𝔔(x,ξ)F(x,\xi)=c^{T}x+\mathfrak{Q}(x,\xi) with 𝔔\mathfrak{Q} given by (82) and where hh is the indicator function of set

X={xn:xx02100}.X=\{x\in\mathbb{R}^{n}:\|x-x_{0}\|_{2}\leq 100\}.

For problem (81), we again refer to Lemma 2.1 in [8] for the computation of stochastic subgradients s(x,ξ)s(x,\xi). We take γ0=2\gamma_{0}=2 and consider for ξ\xi a Gaussian random vector in 2n\mathbb{R}^{2n} with the components of ξ\xi independent with means and standard deviations randomly generated in respectively intervals [5,5][-5,5] and [0,10][0,10]. The components of cc are generated randomly in interval [1,1][-1,1] and we take R=200R=200, x0(i)=10x_{0}(i)=10 and y0(i)=1y_{0}(i)=1 for i=1,,ni=1,\ldots,n.

We run E-SA, SCPB1, and SCPB2 on two instances B1B_{1} and B2B_{2} with n=50n=50 and n=100n=100, respectively. For SCPB1 and SCPB2, we used K=1500K=1500 outer iterations. The results are reported in Tables 6 and 7. The conclusions are similar to the experiments on the stochastic utility problem: it still takes much longer for E-SA to compute a solution with given accuracy. We also report in the right plot of Figure 1 the evolution of the length of the cycles along outer iterations. The behavior of the length of these cycles is similar to what was observed for the previous problem (79)-(80). For SCPB2 the length of the cycles is still small on all iterations and almost constant.

- B1:n=50B_{1}:n=50 B2:n=100B_{2}:n=100
ALG. N Obj CPU Obj CPU
E-SA 10 15182 0.2 18571 0.6
50 15108 0.9 18497 4.0
100 15017 0.9 18405 7.4
200 14836 1.8 18222 13.9
1000 13481 3.4 16830 63.8
10510^{5} 99.8 16.2 177.5 6275
SCPB1 10 2981.5 0.2 4914 0.8
50 761.2 0.8 1574 2.8
100 288.1 1.7 679 5.7
200 64.8 2.9 191 10.2
1000 -4.38 14.3 -7.87 55.5
SCPB2 10 1400.9 0.13 2766 0.5
50 0.45 0.5 17.5 2.1
100 -4.38 1.0 -7.88 4.1
200 -4.38 1.9 -7.95 8.1
1000 -4.38 9.0 -7.95 42.2
Table 6: E-SA versus two variants of SCPB on the two-stage stochastic program (81), (82)

6.4 Summarizing performance indicators

The computational results reported in Subsections 6.1-6.3 show that SCPB computes a good approximate solution quicker than E-SA. To properly quantify the speed-up over a fixed number of iterations NN, we compute the quantity

100𝙾𝚋𝚓(E-SA)𝙾𝚋𝚓(SCPBi)F^T(x0)𝙾𝚋𝚓(SCPBi)100\frac{{\tt{Obj}}\mbox{(E-SA)}-{\tt{Obj}}(\mbox{SCPBi})}{{\hat{F}}_{T}(x_{0})-{\tt{Obj}}\mbox{(SCPBi})} (83)

associated with SCPBi, where F^T(x0){\hat{F}}_{T}(x_{0}) is the empirical mean (see (77)) of FF with T=104T=10^{4} and xx equal to the initial point x0x_{0}, and 𝙾𝚋𝚓(E-SA){\tt{Obj}}\mbox{(E-SA)}, 𝙾𝚋𝚓(SCPB1){\tt{Obj}}\mbox{(SCPB1)}, and 𝙾𝚋𝚓(SCPB2){\tt{Obj}}(\mbox{SCPB2}) are the empirical means of FF with T=104T=10^{4} and xx equal to the final iterates output by E-SA, SCPB1, and SCPB2, respectively. We see that after N=1000N=1000 iterations this percentage is above 90%90\% for most instances, which clearly shows that both variants of SCPB are faster than E-SA.

Sample size NN 10 50 100 200 1000
L1L_{1}, SCPB1 95.0 95.2 94.1 91.9 82.8
L1L_{1}, SCPB2 96.6 97.2 97.5 97.2 90.9
L2L_{2}, SCPB1 92.1 93.3 92.3 91.5 77.7
L2L_{2}, SCPB2 92.1 95.8 96.8 96.7 93.5
A1A_{1}, SCPB1 99.5 98.8 98.1 96.6 85.1
A1A_{1}, SCPB2 99.6 99.0 98.0 96.5 84.2
A2A_{2}, SCPB1 99.8 99.6 99.3 98.8 95.0
A2A_{2}, SCPB2 99.8 99.6 99.3 98.8 95.4
B1B_{1}, SCPB1 99.8 99.4 98.8 97.6 88.7
B1B_{1}, SCPB2 99.9 99.4 98.8 97.6 88.7
B2B_{2}, SCPB1 99.9 99.4 99.0 98.0 90.5
B2B_{2}, SCPB2 99.9 99.5 99.0 98.0 90.5
Table 7: Percentages (83) for SCPB1 and SCPB2

7 Concluding remarks

This paper proposes two single-cut stochastic composite proximal bundle variants, called SCPB, for solving SCCO problem (1)-(2) where at each iteration a problem of form (3) is solved. The two SCPB variants, which differ in the way their cycle lengths are determined, are analyzed in Sections 3 and 4, respectively. More specifically, it is shown that both variants of SCPB with properly chosen parameters have optimal iteration complexity (up to a logarithmic term) for finding an ε\varepsilon-solution of (1) for a large range of prox stepsizes. Practical variants of SCPB which keep their cycle lengths bounded are also proposed and numerical experiments demonstrating their excellent performance against the RSA method of [22] on the instances considered in this paper are also reported.

Comparison with other methods: First, we have shown in Subsection 3.2 that RSA is a special case of SCPB1 which performs only one iteration per cycle. Second, it is worth noting that SCPB has a slight similarity with the stochastic dual averaging (SDA) method discussed in [25, 38] since both methods explore the idea of aggregating cuts into a single one. However, there are essential differences between the two methods, namely: 1) while SCPB updates the prox-centers whenever a serious iteration occurs, SDA uses a fixed prox-center, and hence only performs null iterations; and 2) SDA uses variable stepsizes which have to grow sufficiently large, while SCPB uses constant prox stepsizes. In summary, from the viewpoint of this paper, SDA is closest to the special case of SCPB with a single cycle and a sufficiently large prox stepsize; the difference between the latter two methods is that SDA allows the prox stepsizes within its single cycle to gradually become sufficiently large.

In summary, while RSA (resp., SDA) performs only serious (resp., null) iterations, SCPB performs a balanced mix of serious and null iterations. Hence, it is reasonable to conclude that SCPB lies between RSA and SDA.

Extensions. We finally discuss some possible extensions of our analysis in this paper. A first question is how to extend SCPB and the corresponding complexity analysis if instead of Assumption 2.1 (A3) we use the assumption that for every u,vdomhu,v\in\mathrm{dom}\,h, we have

f(u)f(v)2M+Luv,\|f^{\prime}(u)-f^{\prime}(v)\|\leq 2M+L\|u-v\|,

which is called a uniform (M,L)(M,L)-condition in [20]. A second natural question is how to extend SCPB and its complexity analysis when either the prox stepsize λ{\lambda} and parameter θ\theta are allowed to change with the iteration count kk. Recalling that the prox stepsize is the only ingredient of the second variant of SCPB that depends on an estimate MM of M¯\bar{M}, a third natural question is whether it is possible to develop a SCPB variant which adaptively chooses a (variable) prox stepsize without the need of knowing MM. A fourth question is whether it is possible to establish global convergence rate guarantees for proximal bundle methods based on two-cut or multiple-cut bundle models instead of the single-cut bundle models considered in this paper. Finally, SCPB is able to solve two-stage convex stochastic programs with continuous distributions, under the assumption that the second-stage subproblems can be exactly solved (e.g., see Subsections 6.2 and 6.3). It would be interesting to extend it to the setting of multistage stochastic convex problems with continuous distributions.

References

  • [1] E. D. Andersen and K.D. Andersen. The MOSEK optimization toolbox for MATLAB manual. Version 9.2, 2019. https://www.mosek.com/documentation/.
  • [2] A. Astorino, A. Frangioni, A. Fuduli, and E. Gorgone. A nonmonotone proximal bundle method with (potentially) continuous step decisions. SIAM Journal on Optimization, 23(3):1784–1809, 2013.
  • [3] J. Birge and F. Louveaux. Introduction to Stochastic Programming. Springer-Verlag, New York, 1997.
  • [4] J.R. Birge and F.V. Louveaux. A multicut algorithm for two-stage stochastic linear programs. European Journal of Operational Research, 34:384–392, 1988.
  • [5] M. Díaz and B. Grimmer. Optimal convergence rates for the proximal bundle method. SIAM Journal on Optimization, 33(2):424–454, 2023.
  • [6] Y. Du and A. Ruszczyński. Rate of convergence of the bundle method. Journal of Optimization Theory and Applications, 173(3):908–922, 2017.
  • [7] A. Frangioni. Generalized bundle methods. SIAM Journal on Optimization, 13(1):117–156, 2002.
  • [8] V. Guigues. Convergence analysis of sampling-based decomposition methods for risk-averse multistage stochastic convex programs. SIAM Journal on Optimization, 26:2468–2494, 2016.
  • [9] V. Guigues. Multistep stochastic mirror descent for risk-averse convex stochastic programs based on extended polyhedral risk measures. Mathematical Programming, 163:169–212, 2017.
  • [10] V. Guigues. Inexact Stochastic Mirror Descent for two-stage nonlinear stochastic programs. Mathematical Programming, 187:533–577, 2021.
  • [11] V. Guigues, A. Juditsky, and A. Nemirovski. Non-asymptotic confidence bounds for the optimal value of a stochastic program. Optimization Methods & Software, 32:1033–1058, 2017.
  • [12] V. Guigues, W. Tekaya, and M. Lejeune. Regularized decomposition methods for deterministic and stochastic convex optimization and application to portfolio selection with direct transaction and market impact costs. Optimization & Engineering, 21:1133–1165, 2020.
  • [13] J.L. Higle and S. Sen. Stochastic Decomposition. Kluwer, Dordrecht, 1996.
  • [14] A.J. King and R.T. Rockafellar. Asymptotic theory for solutions in statistical estimation and stochastic programming. Math. Oper. Res., 18:148–162, 1993.
  • [15] K. C. Kiwiel. Efficiency of proximal bundle methods. Journal of Optimization Theory and Applications, 104(3):589–603, 2000.
  • [16] A. J. Kleywegt, A. Shapiro, and T. Homem-de Mello. The sample average approximation method for stochastic discrete optimization. SIAM Journal on Optimization, 12(2):479–502, 2002.
  • [17] C. Lemaréchal. An extension of davidon methods to non differentiable problems. In Nondifferentiable optimization, pages 95–109. Springer, 1975.
  • [18] C. Lemaréchal. Nonsmooth optimization and descent methods. 1978.
  • [19] J. Liang and R. D. C. Monteiro. A proximal bundle variant with optimal iteration-complexity for a large range of prox stepsizes. SIAM Journal on Optimization, 31(4):2955–2986, 2021.
  • [20] J. Liang and R. D. C. Monteiro. A unified analysis of a class of proximal bundle methods for solving hybrid convex composite optimization problems. Mathematics of Operations Research, 2023.
  • [21] R. Mifflin. A modification and an extension of Lemaréchal’s algorithm for nonsmooth minimization. In Nondifferential and variational techniques in optimization, pages 77–90. Springer, 1982.
  • [22] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM J. Optim., 19:1574–1609, 2009.
  • [23] A. Nemirovski and D. Yudin. On Cezari’s convergence of the steepest descent method for approximating saddle point of convex-concave functions. Soviet Math. Dokl., 19, 1978.
  • [24] A. Nemirovski and D.B. Yudin. Problem complexity and method efficiency in optimization. Wiley, 1983.
  • [25] Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical programming, 120(1):221–259, 2009.
  • [26] W. de Oliveira, C. Sagastizábal, and C. Lemaréchal. Convex proximal bundle methods in depth: a unified analysis for inexact oracles. Mathematical Programming, 148(1-2):241–277, 2014.
  • [27] B.T. Polyak. New stochastic approximation type procedures. Automat. i Telemekh (English translation: Automation and Remote Control), 7:98–107, 1990.
  • [28] B.T. Polyak and A. Juditsky. Acceleration of stochastic approximation by averaging. SIAM J. Contr. and Optim., 30:838–855, 1992.
  • [29] H. Robbins and S. Monroe. A stochastic approximation method. Annals of Math. Stat., 22:400–407, 1951.
  • [30] A. Ruszczyński. Nonlinear optimization. Princeton university press, 2011.
  • [31] A. Shapiro. Asymptotic analysis of stochastic programs. Ann. Oper. Res., 30:169–186, 1991.
  • [32] A. Shapiro, D. Dentcheva, and A. Ruszczyński. Lectures on Stochastic Programming: Modeling and Theory. SIAM, Philadelphia, 2009.
  • [33] R.M. Van Slyke and R.J.-B. Wets. L-shaped linear programs with applications to optimal control and stochastic programming. SIAM Journal of Applied Mathematics, 17:638–663, 1969.
  • [34] W. van Ackooij, V. Berge, W. de Oliveira, and C. Sagastizábal. Probabilistic optimization via approximate p-efficient points and bundle methods. Computers & Operations Research, 77:177–193, 2017.
  • [35] B. Verweij, S. Ahmed, A. J. Kleywegt, G. Nemhauser, and A. Shapiro. The sample average approximation method applied to stochastic routing problems: a computational study. Computational Optimization and Applications, 24(2-3):289–333, 2003.
  • [36] W. Wang and M. A. Carreira-Perpinán. Projection onto the probability simplex: An efficient algorithm with a simple proof, and an application. Available on arXiv:1309.1541, 2013.
  • [37] P. Wolfe. A method of conjugate subgradients for minimizing nondifferentiable functions. In Nondifferentiable optimization, pages 145–173. Springer, 1975.
  • [38] L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research, 11(88):2543–2596, 2010.