This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Parameter-free proximal bundle methods with adaptive
stepsizes for hybrid convex composite optimization problems

Renato D.C. Monteiro   Honghao Zhang 11footnotemark: 1 School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA, 30332-0205. (email: [email protected] and [email protected]). This work was partially supported by AFOSR Grant FA9550-22-1-0088.
(October 27, 2023)
Abstract

This paper develops a parameter-free adaptive proximal bundle method with two important features: 1) adaptive choice of variable prox stepsizes that ”closely fits” the instance under consideration; and 2) adaptive criterion for making the occurrence of serious steps easier. Computational experiments show that our method performs substantially fewer consecutive null steps (i.e., a shorter cycle) while maintaining the number of serious steps under control. As a result, our method performs significantly less number of iterations than its counterparts based on a constant prox stepsize choice and a non-adaptive cycle termination criterion. Moreover, our method is very robust relative to the user-provided initial stepsize.

Key words. hybrid convex composite optimization, iteration-complexity, adaptive stepsize, parameter-free proximal bundle methods.

AMS subject classifications. 49M37, 65K05, 68Q25, 90C25, 90C30, 90C60

1 Introduction

Let f,h:n{+}f,h:\mathbb{R}^{n}\rightarrow\mathbb{R}\cup\{+\infty\} be proper lower semi-continuous convex functions such that domhdomf\mathrm{dom}\,h\subseteq\mathrm{dom}\,f and consider the optimization problem

ϕ:=min{ϕ(x):=f(x)+h(x):xn}.\phi_{*}:=\min\left\{\phi(x):=f(x)+h(x):x\in\mathbb{R}^{n}\right\}. (1)

It is said that (1) is a hybrid convex composite optimization (HCCO) problem if there exist nonnegative scalars M,LM,L and a first-order oracle f:domhnf^{\prime}:\mathrm{dom}\,h\to\mathbb{R}^{n} (i.e., f(x)f(x)f^{\prime}(x)\in\partial f(x) for every xdomhx\in\mathrm{dom}\,h) satisfying f(u)f(v)2M+Luv\|f^{\prime}(u)-f^{\prime}(v)\|\leq 2M+L\|u-v\| for every u,vdomhu,v\in\mathrm{dom}\,h. The main goal of this paper is to study the complexity of adaptive proximal bundle methods (Ad-GPB) for solving the HCCO problem (1) based on a unified bundle update schemes.

Proximal bundle (PB) methods solve a sequence of prox bundle subproblems

xj=argminun{Γj(u)+12λjuxc2},x_{j}=\underset{u\in\mathbb{R}^{n}}{\mathrm{argmin}\,}\left\{\Gamma_{j}(u)+\frac{1}{2{\lambda}_{j}}\|u-x^{c}\|^{2}\right\}, (2)

where Γj\Gamma_{j} is a bundle approximation of ϕ\phi (i.e., a simple convex function underneath ϕ\phi) and xcx^{c} is the current prox center. The prox center is updated to xjx_{j} (i.e., a serious step is performed) only when the pair (xj,λj)(x_{j},{\lambda}_{j}) satisfies a certain error criterion; otherwise, the prox center is kept the same (i.e., a null step is performed). Regardless of the step performed, the bundle Γj\Gamma_{j} is updated to account for the newest iterate xjx_{j}. In the discussion below, a sequence of consecutive null steps followed by a serious step is referred to as a cycle. Classical PB methods (see e.g. [5, 6, 12, 15, 16, 21, 27]) perform the serious step when xjx_{j} satisfies a relaxed descent condition (e.g., see the paragraph containing equation (15) in [18]), which in its unrelaxed form implies that ϕ(xj)ϕ(xc)\phi(x_{j})\leq\phi(x^{c}). On the other hand, modern PB methods (see e.g. [8, 18, 19, 20]) perform the serious step when the best ϕ\phi-valued iterate xjx_{j}, say yjy_{j}, satisfies ϕ(yj)mjδ\phi(y_{j})-m_{j}\leq\delta where mjm_{j} is the optimal value of (2) and δ\delta is a suitably chosen tolerance. Although yjy_{j} does not necessarily satisfy the descent condition, it does satisfy a δ\delta-relaxed version of it. It is shown in [18, 19] that if λ>0{\lambda}>0 is such that max{λ,λ1}=𝒪(ε1)\max\{{\lambda},{\lambda}^{-1}\}={\cal O}(\varepsilon^{-1}), then modern PB methods with λj=λ{\lambda}_{j}={\lambda} for every jj achieve an 𝒪~(ε2)\tilde{\cal O}(\varepsilon^{-2}) iteration complexity to obtain an ε\varepsilon-solution regardless of whether domh\mathrm{dom}\,h is bounded or not. In contrast, papers [5, 12] show that the classical PB methods achieve: i) an 𝒪(ε3){\cal O}(\varepsilon^{-3}) iteration complexity under the assumption that λ=Θ(1){\lambda}=\Theta(1) regardless of whether domh\mathrm{dom}\,h is bounded or not; and ii) an 𝒪(ε2){\cal O}(\varepsilon^{-2}) iteration complexity under the assumption that λ=Θ(ε1){\lambda}=\Theta(\varepsilon^{-1}) for the case where domh\mathrm{dom}\,h is bounded.

The goal of this paper is to develop a parameter-free adaptive modern PB method, namely Ad-GPB, with two important features: 1) adaptive choice of variable prox stepsizes that ”closely fits” the instance under consideration; and 2) adaptive criterion for making the occurrence of serious steps easier. Computational experiments show that Ad-GPB performs substantially fewer consecutive null steps while maintaining the number of serious steps under control. As a result, Ad-GPB performs significantly less number of iterations than the Ad-GPB method of [8, 18, 19]. Moreover, in contrast to GPB, Ad-GPB is very robust with respect to the user-provided initial stepsize.

Several papers (see e.g. [2, 4, 5, 11, 13, 17] of which only [5] deals with complexity analysis), have proposed ways of generating variable prox stepsizes to improve classical PB methods’ computational performance. More recently, [8] developed a modern PB method for solving either the convex or strongly convex version of (1) which: requires no knowledge of the Lipschitz parameters (M,L)(M,L) and the strong convex parameter μ\mu of ϕ\phi; and allows the stepsize to change only at the beginning of each cycle. A potential drawback of the method of [8] is that it can restart a cycle with its current initial prox stepsize λ{\lambda} divided by two if λ{\lambda} is found to be large, i.e., the method can backtrack. In contrast, by allowing the prox stepsizes to vary within a cycle, Ad-GPB never has to restart a cycle.

In theory, classical PB methods perform on average 𝒪(ε2){\cal O}(\varepsilon^{-2}) consecutive null iterations while modern PB methods perform only 𝒪(ε1){\cal O}(\varepsilon^{-1}) consecutive null iterations in the worst case. The explanation for this phenomenon is due to the more relaxed δ\delta-criterion used by modern PB methods to end a cycle. Our Ad-GPB method pursues the idea of further relaxing the cycle termination criterion to reduce its overall number of iterations, and hence improve its computational performance while retaining all the theoretical guarantees of the modern PB methods of [8, 18, 19]. More specifically, under the simplifying assumption that ϕ\phi_{*} is known, an Ad-GPB cycle stops when, for some universal constant β(0,1]\beta\in(0,1], the inequality ϕ(yj)mjδ+β[ϕ(yj)ϕ]\phi(y_{j})-m_{j}\leq\delta+\beta[\phi(y_{j})-\phi_{*}] is satisfied. The addition of the (usually large) term β[ϕ(yj)ϕ]\beta[\phi(y_{j})-\phi_{*}] makes this inequality easier to satisfy, thereby resulting in Ad-GPB performing shorter cycles. Even though the previous observation assumes that ϕ\phi_{*} is known, Ad-GPB removes this assumption, at the expense of assuming that the domain of hh is bounded, by replacing ϕ\phi_{*} in the above inequality with a suitable lower bound on ϕ\phi_{*}.

Organization of the paper. Subsection 1.1 presents basic definitions and notation used throughout the paper. Section 2 contains two subsections. Subsection 2.1 formally describes problem (1) and the assumptions made on it. Subsection 2.2 presents a generic bundle update scheme, the Ad-GPB framework, and states the main iteration-complexity result for Ad-GPB. Section 3 provides a bound on the number of iterations within a cycle of Ad-GPB. Section 4 contains two subsections. The first (resp, second) one establishes bounds on the number of cycles and the total number of iterations performed by Ad-GPB under the assumption that ϕ\phi_{*} is known (resp., unknown). Section 5 presents the numerical results comparing Ad-GPB with the two-cut bundle update scheme against other mordern PB methods.

1.1 Basic definitions and notation

The sets of real numbers and positive real numbers are denoted by \mathbb{R} and ++\mathbb{R}_{++}, respectively. Let n\mathbb{R}^{n} denote the standard nn-dimensional Euclidean space equipped with inner product and norm denoted by ,\left\langle\cdot,\cdot\right\rangle and \|\cdot\|, respectively. Let log()\log(\cdot) denote the natural logarithm and log+()\log^{+}(\cdot) denote max{log(),0}\max\{\log(\cdot),0\}. Let 𝒪{\cal O} denote the standard big-O notation.

For a given function φ:n(,+]\varphi:\mathbb{R}^{n}\rightarrow(-\infty,+\infty], let domφ:={xn:φ(x)<}\mathrm{dom}\,\varphi:=\{x\in\mathbb{R}^{n}:\varphi(x)<\infty\} denote the effective domain of φ\varphi and φ\varphi is proper if domφ\mathrm{dom}\,\varphi\neq\emptyset. A proper function φ:n(,+]\varphi:\mathbb{R}^{n}\rightarrow(-\infty,+\infty] is convex if

φ(αx+(1α)y)αφ(x)+(1α)φ(y)\varphi(\alpha x+(1-\alpha)y)\leq\alpha\varphi(x)+(1-\alpha)\varphi(y)

for every x,ydomφx,y\in\mathrm{dom}\,\varphi and α[0,1]\alpha\in[0,1]. Denote the set of all proper lower semicontinuous convex functions by Conv¯(n)\overline{\mbox{\rm Conv}}\,(\mathbb{R}^{n}).

The subdifferential of φ\varphi at xdomφx\in\mathrm{dom}\,\varphi is denoted by

φ(x):={sn:φ(y)φ(x)+s,yx,yn}.\partial\varphi(x):=\left\{s\in\mathbb{R}^{n}:\varphi(y)\geq\varphi(x)+\left\langle s,y-x\right\rangle,\forall y\in\mathbb{R}^{n}\right\}. (3)

The set of proper closed convex functions Γ\Gamma such that Γϕ\Gamma\leq\phi is denoted by (ϕ){\cal B}(\phi) and any such Γ\Gamma is called a bundle for ϕ\phi.

The sign function sign:RnRn\rm sign:R^{n}\to R^{n} is defined as

sign(x)i={1,if xi<0, 0,if xi=0, 1,if xi>0.\rm sign(x)_{i}=\begin{cases}-1,&\mbox{if $x_{i}<0$},\\ \ \ 0,&\mbox{if $x_{i}=0$},\\ \ \ 1,&\mbox{if $x_{i}>0$}.\end{cases}

2 Main problem and algorithm

This section contains two subsections. The first one describes the main problem and corresponding assumptions. The second one presents the motivation and the description of Ad-GPB, as well as its main complexity result.

2.1 Main problem

The problem of interest in this paper is (1) which is assumed to satisfy the following conditions for some constants M0M\geq 0 and L0L\geq 0:

  • (A1)

    hConv¯(n)h\in\overline{\mbox{\rm Conv}}\,(\mathbb{R}^{n}) and there exists D0D\geq 0 such that

    supx,ydomhyxD;\sup_{x,y\in\mathrm{dom}\,h}\|y-x\|\leq D; (4)
  • (A2)

    fConv¯(n)f\in\overline{\mbox{\rm Conv}}\,(\mathbb{R}^{n}) is such that domhdomf\mathrm{dom}\,h\subset\mathrm{dom}\,f, and a subgradient oracle, i.e., a function f:domhnf^{\prime}:\mathrm{dom}\,h\to\mathbb{R}^{n} satisfying f(x)f(x)f^{\prime}(x)\in\partial f(x) for every xdomhx\in\mathrm{dom}\,h, is available;

  • (A3)

    for every x,ydomhx,y\in\mathrm{dom}\,h,

    f(x)f(y)2M+Lxy.\|f^{\prime}(x)-f^{\prime}(y)\|\leq 2M+L\|x-y\|.

In addition to the above assumptions, it is also assumed that hh is simple in the sense that, for any λ>0{\lambda}>0 and affine function 𝒜{\cal A}, the following two optimization problems

minu𝒜(u)+h(u),minu𝒜(u)+h(u)+12λu2\min_{u}{\cal A}(u)+h(u),\qquad\min_{u}{\cal A}(u)+h(u)+\frac{1}{2{\lambda}}\|u\|^{2} (5)

are easy to solve.

We now make three remarks about assumptions (A1)-(A3). First, it can be shown that (A1) implies that both problems in (5) have optimal solutions. Second, it can also be shown that (A1) implies that the set of optimal solutions XX^{*} of problem (1) is nonempty. Third, letting ~f(;x)\tilde{\ell}_{f}(\cdot;x) denotes the linearization of ff at xx, i.e.,

~f(;x):=f(x)+f(x),xxdomh,\tilde{\ell}_{f}(\cdot;x):=f(x)+\langle f^{\prime}(x),\cdot-x\rangle\quad\forall x\in\mathrm{dom}\,h, (6)

then it is well-known that (A3) implies that for every x,ydomhx,y\in\mathrm{dom}\,h,

f(x)~f(x;y)2Mxy+L2xy2.f(x)-\tilde{\ell}_{f}(x;y)\leq 2M\|x-y\|+\frac{L}{2}\|x-y\|^{2}. (7)

Finally, define the composite linearization of the objective ϕ\phi of (1) at xx as

ϕ(;x):=~f(;x)+h()xdomh.\ell_{\phi}(\cdot;x):=\tilde{\ell}_{f}(\cdot;x)+h(\cdot)\quad\forall x\in\mathrm{dom}\,h. (8)

2.2 Algorithm

As mentioned in the Introduction, PB uses a bundle (convex) function underneath ϕ()\phi(\cdot) to construct subproblem (2) at a given iteration, and then updates Γj\Gamma_{j} to obtain the bundle function Γj+1\Gamma_{j+1} for the next iteration. This subsection describes ways of updating the bundle. Instead of focusing on a specific bundle update scheme, this subsection describes a generic bundle update framework (BUF) which is a restricted version of the one introduced in Subsection 3.1 of [19]. It also discusses two concrete bundle update schemes lying within the framework.

We start by describing the generic BUF.

 

BUF

  Input: λ++{\lambda}\in\mathbb{R}_{++} and (xc,x,Γ)n×n×(ϕ)(x^{c},x,\Gamma)\in\mathbb{R}^{n}\times\mathbb{R}^{n}\times{\cal B}(\phi) such that

x=argminun{Γ(u)+12λuxc2},x=\underset{u\in\mathbb{R}^{n}}{\mathrm{argmin}\,}\left\{\Gamma(u)+\frac{1}{2{\lambda}}\|u-x^{c}\|^{2}\right\}, (9)
  • find bundle Γ+(ϕ)\Gamma^{+}\in{\cal B}(\phi) satisfying

    Γ+()max{Γ¯(),ϕ(;x)},\Gamma^{+}(\cdot)\geq\max\{\bar{\Gamma}(\cdot),\ell_{\phi}(\cdot;x)\}, (10)

    for some Γ¯()Conv¯(n)\bar{\Gamma}(\cdot)\in\overline{\mbox{\rm Conv}}\,(\mathbb{R}^{n}) such that

    Γ¯(x)=Γ(x),x=argminun{Γ¯(u)+12λuxc2}.\bar{\Gamma}(x)=\Gamma(x),\quad x=\underset{u\in\mathbb{R}^{n}}{\mathrm{argmin}\,}\left\{\bar{\Gamma}(u)+\frac{1}{2{\lambda}}\|u-x^{c}\|^{2}\right\}. (11)

Output: Γ+\Gamma^{+}.

 

Now we make some remarks about BUF. First, observe that if Γ(ϕ)\Gamma\in{\cal B}(\phi) and Γϕ(;x¯)\Gamma\geq\ell_{\phi}(\cdot;\bar{x}) for some x¯n\bar{x}\in\mathbb{R}^{n}, then domΓ=domh\mathrm{dom}\,\Gamma=\mathrm{dom}\,h. Hence, it follows from (10) and the definition of (ϕ){\cal B}(\phi) that the output Γ+\Gamma^{+} of BUF satisfies

Γ+ϕ,Γ+Conv¯(n),domΓ+=domh.\Gamma^{+}\leq\phi,\quad\Gamma^{+}\in\overline{\mbox{\rm Conv}}\,(\mathbb{R}^{n}),\quad\mathrm{dom}\,\Gamma^{+}=\mathrm{dom}\,h.

Second, the bundle update framework of [19] replaces (10) by the weaker inequality Γ+()τΓ¯()+(1τ)ϕ(;x)\Gamma^{+}(\cdot)\geq\tau\bar{\Gamma}(\cdot)+(1-\tau)\ell_{\phi}(\cdot;x) for some τ(0,1)\tau\in(0,1) and, as a result, contains the one-cut bundle update scheme described in Subsection 3.1 of [19]. Even though BUF does not include the one-cut bundle update scheme, it contains the two other bundle update schemes discussed in [19] (see Subsection 3.1), which for convenience are briefly described below.

  • 2-cut: For this scheme, it is assumed that Γ\Gamma has the form

    Γ=max{Af,~f(;x)}+h\Gamma=\max\{A_{f},\tilde{\ell}_{f}(\cdot;x^{-})\}+h (12)

    where hConv(n)h\in\mbox{\rm Conv}(\mathbb{R}^{n}) and AfA_{f} is an affine function satisfying AffA_{f}\leq f. In view of (9), it can be shown that there exists θ[0,1]\theta\in[0,1] such that

    1λ(xxc)+h(x)+θAf+(1θ)f(x)0,\displaystyle\frac{1}{{\lambda}}(x-x^{c})+\partial h(x)+\theta\nabla A_{f}+(1-\theta)f^{\prime}(x^{-})\ni 0, (13)
    θAf(x)+(1θ)f(x;x)=max{Af(x),~f(x;x)}.\displaystyle\theta A_{f}(x)+(1-\theta)\ell_{f}(x;x^{-})=\max\{A_{f}(x),\tilde{\ell}_{f}(x;x^{-})\}. (14)

    The scheme then sets

    Af+():=θAf()+(1θ)~f(;x)A_{f}^{+}(\cdot):=\theta A_{f}(\cdot)+(1-\theta)\tilde{\ell}_{f}(\cdot;x^{-}) (15)

    and outputs the function Γ+\Gamma^{+} defined as

    Γ+():=max{Af+(),~f(;x)}+h().\Gamma^{+}(\cdot):=\max\{A_{f}^{+}(\cdot),\tilde{\ell}_{f}(\cdot;x)\}+h(\cdot). (16)
  • multiple-cut (M-cut): Suppose Γ=Γ(;C)\Gamma=\Gamma(\cdot;C) where CnC\subset\mathbb{R}^{n} is a finite set (i.e., the current bundle set) and Γ(;C)\Gamma(\cdot;C) is defined as

    Γ(;C):=max{~f(;c):cC}+h().\Gamma(\cdot;C):=\max\{\tilde{\ell}_{f}(\cdot;c):c\in C\}+h(\cdot). (17)

    This scheme chooses the next bundle set C+C^{+} so that

    C(x){x}C+C{x}C(x)\cup\{x\}\subset C^{+}\subset C\cup\{x\} (18)

    where

    C(x):={cC:~f(x;c)+h(x)=Γ(x)},C(x):=\{c\in C:\tilde{\ell}_{f}(x;c)+h(x)=\Gamma(x)\}, (19)

    and then output Γ+=Γ(;C+)\Gamma^{+}=\Gamma(\cdot;C^{+}).

The following facts, whose proofs can be found in Appendix D of [19], imply that 2-cut and M-cut schemes are special implementations of BUF:

  • (a)

    If Γ+\Gamma^{+} is obtained according to 2-cut, then (Γ+,Γ¯)(\Gamma^{+},\bar{\Gamma}) where Γ¯=Af++h\bar{\Gamma}=A_{f}^{+}+h satisfies (10) and (11);

  • (b)

    If Γ+\Gamma^{+} is obtained according to M-cut, then (Γ+,Γ¯)(\Gamma^{+},\bar{\Gamma}) where Γ¯=Γ(;C(x))\bar{\Gamma}=\Gamma(\cdot;C(x)) satisfies (10) and (11).

We next give an outline of Ad-GPB. Ad-GPB is an inexact proximal point method which, given the (k1)(k-1)-th prox center x^k1n\hat{x}_{k-1}\in\mathbb{R}^{n}, finds a pair (x^k,λ^k)(\hat{x}_{k},\hat{\lambda}_{k}) of prox stepsize λ^k>0\hat{\lambda}_{k}>0 and kk-th prox-center x^k\hat{x}_{k} satisfying a suitable error criterion for being an approximate solution of the prox subproblem

x^kargminun{ϕ(u)+12λ^kux^k12}.\hat{x}_{k}\approx\underset{u\in\mathbb{R}^{n}}{\operatorname{argmin}}\left\{\phi(u)+\frac{1}{2\hat{\lambda}_{k}}\left\|u-\hat{x}_{k-1}\right\|^{2}\right\}. (20)

More specifically, Ad-GPB solves a sequence of prox bundle subproblems

xj=argminun{Γj(u)+12λjux^k12},x_{j}=\underset{u\in\mathbb{R}^{n}}{\mathrm{argmin}\,}\left\{\Gamma_{j}(u)+\frac{1}{2{\lambda}_{j}}\|u-\hat{x}_{k-1}\|^{2}\right\}, (21)

where Γj\Gamma_{j} is a bundle approximation of ϕ\phi and λjλj1{\lambda}_{j}\leq{\lambda}_{j-1} is an adaptively chosen prox stepsize, until the pair (x^k,λ^k)=(xj,λj)(\hat{x}_{k},\hat{\lambda}_{k})=(x_{j},{\lambda}_{j}) satisfy the approximate error criterion for (20). In contrast to the GPB method of [19], which can also be viewed in the setting outlined above, Ad-GPB: i) (adaptively) changes λj{\lambda}_{j} while computing the next prox center x^k\hat{x}_{k}; and, ii) Ad-GPB stops the search for the next prox center x^k\hat{x}_{k} using a termination criterion based not only on the user-provided tolerance (the quantity ε\varepsilon in the description below) as GPB also does, but also on a suitable primal-dual gap for (20), a feature that considerably speeds up the computation of x^k\hat{x}_{k} for many subproblems (20).

We now formally describe Ad-GPB. Its description uses the definition of the set of bundles (ϕ){\cal B}(\phi) for the function ϕ\phi given in Subsection 1.1.

 

Ad-GPB

 

  • 0.

    Let x^0domh\hat{x}_{0}\in\mathrm{dom}\,h, λ1>0{\lambda}_{1}>0, 0β01/20\leq\beta_{0}\leq 1/2, τ(0,1)\tau\in(0,1), and ε>0\varepsilon>0 be given; find Γ1(ϕ)\Gamma_{1}\in{\cal B}(\phi) such that Γ1ϕ(;x^0)\Gamma_{1}\geq\ell_{\phi}(\cdot;\hat{x}_{0}) and set ^0=minuΓ1(u)\hat{\ell}_{0}=\min_{u}\Gamma_{1}(u), y0=x^0y_{0}=\hat{x}_{0}, j0=0j_{0}=0, j=k=1j=k=1, and n^0=^0\hat{n}_{0}=\hat{\ell}_{0};

  • 1.

    compute xjx_{j} as in (21) and

    mj\displaystyle m_{j} :=Γj(xj)+12λjxjx^k12\displaystyle:=\Gamma_{j}(x_{j})+\frac{1}{2{\lambda}_{j}}\|x_{j}-\hat{x}_{k-1}\|^{2} (22)
    yj\displaystyle y_{j} :=argmin{ϕ(x):x{yj1,xj}}\displaystyle:=\mathrm{argmin}\,\left\{\phi(x):x\in\{y_{j-1},x_{j}\}\right\} (23)
    tj\displaystyle t_{j} :=ϕ(yj)mj;\displaystyle:=\phi(y_{j})-m_{j}; (24)
  • 2.

    if tjβk1[ϕ(yj)n^k1]+ε/4t_{j}\leq\beta_{k-1}[\phi(y_{j})-\hat{n}_{k-1}]+\varepsilon/4 is violated then perform a null update, i.e.:
           if either j=jk1+1j=j_{k-1}+1 or

    tjτtj1(1τ){βk1[ϕ(yj)n^k1]2+ε8},t_{j}-\tau t_{j-1}\leq(1-\tau)\left\{\frac{\beta_{k-1}[\phi(y_{j})-\hat{n}_{k-1}]}{2}+\frac{\varepsilon}{8}\right\}, (25)

           then set λj+1=λj{\lambda}_{j+1}={\lambda}_{j}; else, set λj+1=λj/2{\lambda}_{j+1}={\lambda}_{j}/2;
           set Γj+1=BUF(x^k1,xj,Γj,λj)\Gamma_{j+1}=\mbox{BUF}(\hat{x}_{k-1},x_{j},\Gamma_{j},\lambda_{j});
    else perform a serious update, i.e.:
           set (λ^k,x^k,y^k,Γ^k,m^k,t^k)=(λj,xj,yj,Γj,mj,tj)(\hat{\lambda}_{k},\hat{x}_{k},\hat{y}_{k},\hat{\Gamma}_{k},\hat{m}_{k},\hat{t}_{k})=(\lambda_{j},x_{j},y_{j},\Gamma_{j},m_{j},t_{j});
           compute

    Γ^ka(u):=l=k/2kλ^lΓ^l(u)l=k/2kλ^l,^k\displaystyle\hat{\Gamma}^{a}_{k}(u):=\frac{\sum_{l=\lceil k/2\rceil}^{k}\hat{\lambda}_{l}\hat{\Gamma}_{l}(u)}{\sum_{l=\lceil k/2\rceil}^{k}\hat{\lambda}_{l}},\quad\hat{\ell}_{k} :=max{^k1,infuΓ^ka(u)},\displaystyle:=\max\left\{\hat{\ell}_{k-1},\inf_{u}\hat{\Gamma}_{k}^{a}(u)\right\}, (26)

           and choose n^k[^k,ϕ]\hat{n}_{k}\in[\hat{\ell}_{k},\phi_{*}];
           if ϕ(y^k)n^kε\phi(\hat{y}_{k})-\hat{n}_{k}\leq\varepsilon, output (x^k,y^k)(\hat{x}_{k},\hat{y}_{k}), and stop; else compute

    ϕ^ka\displaystyle\hat{\phi}^{a}_{k} :=l=k/2kλ^lϕ(y^l)l=k/2kλ^l,\displaystyle:=\frac{\sum_{l=\lceil k/2\rceil}^{k}\hat{\lambda}_{l}\phi(\hat{y}_{l})}{\sum_{l=\lceil k/2\rceil}^{k}\hat{\lambda}_{l}}, (27)
    g^k\displaystyle\hat{g}_{k} :=1l=k/2kλ^ll=k/2kβl1λ^l[ϕ(y^l)n^l1];\displaystyle:=\frac{1}{\sum_{l=\lceil k/2\rceil}^{k}\hat{\lambda}_{l}}\sum_{l=\lceil k/2\rceil}^{k}\beta_{l-1}\hat{\lambda}_{l}[\phi(\hat{y}_{l})-\hat{n}_{l-1}];\color[rgb]{0,0,0} (28)

           if g^k(ϕ^kan^k)/2\hat{g}_{k}\leq(\hat{\phi}^{a}_{k}-\hat{n}_{k})/2, then set βk=βk1\beta_{k}=\beta_{k-1}; else set βk=βk1/2\beta_{k}=\beta_{k-1}/2;
           set λj+1=λj{\lambda}_{j+1}={\lambda}_{j} and find Γj+1(ϕ)\Gamma_{j+1}\in{\cal B}(\phi) such that Γj+1ϕ(;xj)\Gamma_{j+1}\geq\ell_{\phi}(\cdot;x_{j});
           set jk=jj_{k}=j and kk+1k\leftarrow k+1;
    end if

  • 3.

    set jj+1j\leftarrow j+1 and go to step 1.

 

We now introduce some terminology related to Ad-GPB. Ad-GPB performs two types of iterations, namely, null and serious, corresponding to the kinds of updates performed at the end. The index jj counts the iterations (including null and serious). Let j1j2j_{1}\leq j_{2}\leq\ldots denote the sequence of all serious iterations (i.e., the ones ending with a serious update) and, for every k1k\geq 1, define ik=jk1+1i_{k}=j_{k-1}+1 and the kk-th cycle 𝒞k{\cal C}_{k} as

𝒞k:={ik,,jk}.{\cal C}_{k}:=\{i_{k},\ldots,j_{k}\}. (29)

Observe that for every k1k\geq 1, we have λ^k=λjk\hat{\lambda}_{k}={\lambda}_{j_{k}} where λ^k\hat{\lambda}_{k} is computed in the serious update part of step 2 of Ad-GPB. (Hence, index kk counts the cycles generated by Ad-GPB.) An iteration jj is called good (resp., bad) if λj+1=λj{\lambda}_{j+1}={\lambda}_{j} (resp., λj+1=λj/2{\lambda}_{j+1}={\lambda}_{j}/2). Note that the logic of Ad-GPB implies that iki_{k} and jkj_{k} are good iterations and that (25) is violated whenever jj is a bad iteration.

We next make some remarks about the quantities related to different Γ\Gamma-functions that appear in Ad-GPB and the associated quantities ^k\hat{\ell}_{k} and n^k\hat{n}_{k}. First, the observation immediately following BUF implies that

Γjϕ,ΓjConv¯(n),domΓj=domhj1,\Gamma_{j}\leq\phi,\quad\Gamma_{j}\in\overline{\mbox{\rm Conv}}\,(\mathbb{R}^{n}),\quad\mathrm{dom}\,\Gamma_{j}=\mathrm{dom}\,h\quad\forall j\geq 1, (30)

which together with the fact that Γ^k\hat{\Gamma}_{k} is the last Γj\Gamma_{j} generated within a cycle imply that Γ^kConv¯(n)\hat{\Gamma}_{k}\in\overline{\mbox{\rm Conv}}\,(\mathbb{R}^{n}) and Γ^kϕ\hat{\Gamma}_{k}\leq\phi. Moreover, the first identity in (26) and the latter conclusion then imply that Γ^kaConv¯(n)\hat{\Gamma}_{k}^{a}\in\overline{\mbox{\rm Conv}}\,(\mathbb{R}^{n}) and Γ^kaϕ\hat{\Gamma}^{a}_{k}\leq\phi, and hence that infuΓ^ka(u)infuϕ(u)=ϕ\inf_{u}\hat{\Gamma}^{a}_{k}(u)\leq\inf_{u}\phi(u)=\phi_{*}. Second, the facts that domΓ^ka=domh\mathrm{dom}\,\hat{\Gamma}^{a}_{k}=\mathrm{dom}\,h is bounded (see assumption A1) and Γ^ka\hat{\Gamma}^{a}_{k} is a closed convex function imply that infuΓ^ka(u)>\inf_{u}\hat{\Gamma}^{a}_{k}(u)>-\infty. Moreover, the problem infuΓ^ka(u)\inf_{u}\hat{\Gamma}^{a}_{k}(u) has the same format as the first one that appears in (5), and hence is easily solvable by assumption. Its optimal value, which is a lower bound on ϕ\phi_{*} as already observed above, is used to update the lower bound ^k1\hat{\ell}_{k-1} for ϕ\phi_{*} to a possibly sharper one, namely, ^k^k1\hat{\ell}_{k}\geq\hat{\ell}_{k-1}. Thus, the choice of n^k\hat{n}_{k} in the line following (26) makes sense. For the sake of future reference, we note that

ϕn^k^kinfuΓ^ka(u).\phi_{*}\geq\hat{n}_{k}\geq\hat{\ell}_{k}\geq\inf_{u}\hat{\Gamma}_{k}^{a}(u). (31)

Third, obvious ways of choosing n^k\hat{n}_{k} in the interval [^k,ϕ][\hat{\ell}_{k},\phi_{*}] are: i) n^k=ϕ\hat{n}_{k}=\phi_{*}; and ii) n^k=^k\hat{n}_{k}=\hat{\ell}_{k}. While choice i) requires knowledge of ϕ\phi_{*}, choice ii) does not and can be easily implemented in view of the previous remark. Moreover, if ϕ\phi_{*} is known and n^k\hat{n}_{k} is chosen as in i), then there is no need to compute l^k\hat{l}_{k}, and hence the min term in (26), at the end of every cycle.

We now make some other relevant remarks about Ad-GPB. First, it follows from (30) and the definition of xjx_{j} in (21) that xjdomhx_{j}\in\mathrm{dom}\,h for every j1j\geq 1. Second, an induction argument using (23) and the fact that y0=x^0domhy_{0}=\hat{x}_{0}\in\mathrm{dom}\,h imply that yj{x^0,x1,,xj}domhy_{j}\in\{\hat{x}_{0},x_{1},\ldots,x_{j}\}\subset\mathrm{dom}\,h and

yjArgmin{ϕ(x):x{x^0,x1,,xj}}y_{j}\in\mathrm{Argmin}\,\left\{\phi(x):x\in\{\hat{x}_{0},x_{1},\ldots,x_{j}\}\right\} (32)

(hence, ϕ(yj+1)ϕ(yj)\phi(y_{j+1})\leq\phi(y_{j})) for every j1j\geq 1. Third, the cycle-stopping criterion, i.e., the inequality in the first line of step 2, is a relaxation of the one used by GPB method of [19], in the sense its right-hand side has the extra term βk1[ϕ(yj)n^k1]\beta_{k-1}[\phi(y_{j})-\hat{n}_{k-1}] involving the relaxation factor βk1\beta_{k-1}. The addition of this term allows earlier cycles to terminate in less number of inner iterations, and hence speeds up the overall performance of the method. The quantities in (27) and (28) are used to update βk1\beta_{k-1} at the end of the kk-th cycle (see ’if’ statement after (28)). Fourth, the condition imposed on Γj+1\Gamma_{j+1} at the end of a serious iteration (see the second line below (28)) does not completely specify it. An obvious way (cold start) of choosing this Γj+1\Gamma_{j+1} is to set it to be ϕ(;xj)\ell_{\phi}(\cdot;x_{j}); another way (warm start) is to choose it using the update rule of an null iteration, i.e., Γj+1=BUF(x^k1,xj,Γj,λj)\Gamma_{j+1}=\mbox{BUF}(\hat{x}_{k-1},x_{j},\Gamma_{j},\lambda_{j}) since (10) implies the required condition on Γj+1\Gamma_{j+1}.

We now comment on the inexactness of y^k\hat{y}_{k} as a solution of prox subproblem (20) and as a solution of (1) upon termination of Ad-GPB. The fact that Γ^kϕ\hat{\Gamma}_{k}\leq\phi and the fact that t^k=tjk\hat{t}_{k}=t_{j_{k}} imply that the primal gap of (20) at y^k\hat{y}_{k} is upper bounded by t^k+y^kx^k12/(2λ^k)\hat{t}_{k}+\left\|\hat{y}_{k}-\hat{x}_{k-1}\right\|^{2}/(2\hat{\lambda}_{k}). Hence, if the inequality for stopping the cycle in step 2 holds, then we conclude that y^k\hat{y}_{k} is an εk\varepsilon_{k}-solution of (20), where

εk:=ε4+βk1[ϕ(y^k)n^k1]+y^kx^k122λ^k.\varepsilon_{k}:=\frac{\varepsilon}{4}+\beta_{k-1}[\phi(\hat{y}_{k})-\hat{n}_{k-1}]+\frac{\left\|\hat{y}_{k}-\hat{x}_{k-1}\right\|^{2}}{2\hat{\lambda}_{k}}.

Finally, if the test inequality before (27) in step 2 holds, then the second component y^k\hat{y}_{k} of the pair output by Ad-GPB satisfies ϕ(y^k)ϕε\phi(\hat{y}_{k})-\phi_{*}\leq\varepsilon due to the fact that n^kϕ\hat{n}_{k}\leq\phi_{*}.

Lastly, Ad-GPB never restarts a cycle, i.e., attempts to inexactly solve two or more subproblems (20) with the same prox center x^k1\hat{x}_{k-1}. Instead, Ad-GPB has a key rule for updating the inner stepsize λj{\lambda}_{j} which always allows it to inexactly solve subproblem (20) with λ^k\hat{\lambda}_{k} set to be the last λj{\lambda}_{j} generated within the kk-th cycle (see the second line of the serious update part of Ad-GPB).

We now state the main complexity result for Ad-GPB whose proof is postponed to the end of Section 4.

Theorem 2.1

Define

t¯\displaystyle\bar{t} :=2MD+L2D2,\displaystyle:=2MD+\frac{L}{2}D^{2}, (33)
K¯(ε)\displaystyle\bar{K}(\varepsilon) :=22D2Q¯ε(M2ε+L16+1λ1Q¯)+log+{β0(ϕ(x^0)n^0)ε}+1\displaystyle:=2\left\lceil\frac{2D^{2}\bar{Q}}{\varepsilon}\left(\frac{M^{2}}{\varepsilon}+\frac{L}{16}+\frac{1}{{\lambda}_{1}\bar{Q}}\right)+\log^{+}\left\{\frac{\beta_{0}(\phi(\hat{x}_{0})-\hat{n}_{0})}{\varepsilon}\right\}+1\right\rceil (34)

where

Q¯=128(1τ)τ.\bar{Q}=\frac{128(1-\tau)}{\tau}. (35)

Then, Ad-GPB finds a pair (y^k,n^k)domϕ×(\hat{y}_{k},\hat{n}_{k})\in\mathrm{dom}\,\phi\times\mathbb{R} satisfying ϕ(y^k)ϕϕ(y^k)n^kε\phi(\hat{y}_{k})-\phi_{*}\leq\phi(\hat{y}_{k})-\hat{n}_{k}\leq\varepsilon in at most 4K¯(ε)4\bar{K}(\varepsilon) cycles and

4K¯(ε)(1+τ1τlog+[8t¯ε1]+2)+log2+(Q¯λ1(M2ε+L16))4\bar{K}(\varepsilon)\left(\frac{1+\tau}{1-\tau}\log^{+}\left[8\bar{t}\varepsilon^{-1}\right]+2\right)+\log_{2}^{+}\left(\bar{Q}{\lambda}_{1}\left(\frac{M^{2}}{\varepsilon}+\frac{L}{16}\right)\right) (36)

iterations.

We now make some remarks about Theorem 2.1. First, in terms of τ\tau and ε\varepsilon only, it follows from Theorem 2.1 that the iteration complexity of Ad-GPB to find a ε\varepsilon-solution of (1) is 𝒪~(ε2τ1+(1τ)1))\tilde{\cal O}\left(\varepsilon^{-2}\tau^{-1}+(1-\tau)^{-1})\right). Hence, when τ(0,1)\tau\in(0,1) satisfies τ1=𝒪(1)\tau^{-1}={\cal O}(1) and τ=1Ω(ε2)\tau=1-\Omega(\varepsilon^{2}), the total iteration complexity of Ad-GPB is 𝒪~(ε2)\tilde{\cal O}(\varepsilon^{-2}). Moreover, under the assumption that τ(0,1)\tau\in(0,1) satisfies τ1=𝒪(1)\tau^{-1}={\cal O}(1), Ad-GPB performs

  • 𝒪(ε2){\cal O}(\varepsilon^{-2}) cycles whenever τ=1Θ(1)\tau=1-\Theta(1);

  • more generally, 𝒪(εα){\cal O}(\varepsilon^{-\alpha}) cycles whenever τ=1Θ(ε2α)\tau=1-\Theta(\varepsilon^{2-\alpha}) for some α[0,2]\alpha\in[0,2].

3 Bounding cycle lengths

The main goal of this section is to derive a bound (Proposition 3.5 below) on the number of iterations within a cycle.

Recall from (29) that iki_{k} (resp., jkj_{k}) denotes the first (resp., last) iteration index of the kk-th cycle of Ad-GPB. The first result describes some basic facts about the iterations within any given cycle.

Lemma 3.1

For every j𝒞k{ik}j\in{\cal C}_{k}\setminus\{i_{k}\}, the following statements hold:

  • a)

    there exists function Γ¯j1()\bar{\Gamma}_{j-1}(\cdot) such that

    max{Γ¯j1(),f(;xj1)+h()}Γj()ϕ(),\displaystyle\max\left\{\bar{\Gamma}_{j-1}(\cdot),\ell_{f}(\cdot;x_{j-1})+h(\cdot)\right\}\leq\Gamma_{j}(\cdot)\leq\phi(\cdot), (37)
    Γ¯j1Conv(n),Γ¯j1(xj1)=Γj1(xj1),\displaystyle\bar{\Gamma}_{j-1}\in\mbox{\rm Conv}(\mathbb{R}^{n}),\quad\bar{\Gamma}_{j-1}(x_{j-1})=\Gamma_{j-1}(x_{j-1}), (38)
    xj1=argminun{Γ¯j1(u)+12λj1ux^k12};\displaystyle x_{j-1}=\underset{u\in\mathbb{R}^{n}}{\mathrm{argmin}\,}\left\{\bar{\Gamma}_{j-1}(u)+\frac{1}{2{\lambda}_{j-1}}\|u-\hat{x}_{k-1}\|^{2}\right\}; (39)
  • b)

    for every unu\in\mathbb{R}^{n}, we have

    Γ¯j1(u)+12λj1ux^k12mj1+12λj1uxj12.\bar{\Gamma}_{j-1}(u)+\frac{1}{2{\lambda}_{j-1}}\|u-\hat{x}_{k-1}\|^{2}\geq m_{j-1}+\frac{1}{2{\lambda}_{j-1}}\|u-x_{j-1}\|^{2}. (40)

Proof: a) This statement immediately follows from (10), (11), and the facts that Γj\Gamma_{j} is the output of the BUF blackbox with input λj1{\lambda}_{j-1} and (xc,x,Γ)=(xj1c,xj1,Γj1)(x^{c},x,\Gamma)=(x_{j-1}^{c},x_{j-1},\Gamma_{j-1}) and xj1c=x^k1x_{j-1}^{c}=\hat{x}_{k-1}.

b) Using (39) and the fact that f=Γ¯j1+x^k12/(2λj1)f=\bar{\Gamma}_{j-1}+\|\cdot-\hat{x}_{k-1}\|^{2}/(2{\lambda}_{j-1}) is λj11{\lambda}_{j-1}^{-1} strongly convex, we have for every udomhu\in\mathrm{dom}\,h,

Γ¯j1(u)+12λj1ux^k12Γ¯j1(xj1)+12λj1xj1x^k12+12λj1uxj12.\bar{\Gamma}_{j-1}(u)+\frac{1}{2{\lambda}_{j-1}}\|u-\hat{x}_{k-1}\|^{2}\geq\bar{\Gamma}_{j-1}(x_{j-1})+\frac{1}{2{\lambda}_{j-1}}\|x_{j-1}-\hat{x}_{k-1}\|^{2}+\frac{1}{2{\lambda}_{j-1}}\|u-x_{j-1}\|^{2}.

The statement follows from the above inequality and the second identity in (38).    

The next result presents some basic recursive inequalities for {tj}\{t_{j}\}.

Lemma 3.2

For every j𝒞k{ik}j\in{\cal C}_{k}\setminus\{i_{k}\}, the following statements hold:

  • a)

    for every τ[0,1]\tau^{\prime}\in[0,1], there holds

    tjτtj12M(1τ)xjxj1(τ2λj1(1τ)L2)xjxj121τ2λjxjx^k12;t_{j}-\tau^{\prime}t_{j-1}\leq 2M(1-\tau^{\prime})\|x_{j}-x_{j-1}\|-\left(\frac{\tau^{\prime}}{2{\lambda}_{j-1}}-\frac{(1-\tau^{\prime})L}{2}\right)\|x_{j}-x_{j-1}\|^{2}-\frac{1-\tau^{\prime}}{2{\lambda}_{j}}\|x_{j}-\hat{x}_{k-1}\|^{2};
  • b)

    if λj1τ/(2(1τ)L){\lambda}_{j-1}\leq\tau/(2(1-\tau)L), then we have

    tjτtj14M2(1τ)2λj1τ.t_{j}-\tau t_{j-1}\leq\frac{4M^{2}(1-\tau)^{2}{\lambda}_{j-1}}{\tau}. (41)

Proof: a) Inequality (37) implies that for every τ[0,1]\tau^{\prime}\in[0,1], we have

Γj(xj)\displaystyle\Gamma_{j}(x_{j}) max{Γ¯j1(xj),ϕ(xj;xj1)}(1τ)ϕ(xj;xj1)+τΓ¯j1(xj).\displaystyle\geq\max\left\{\bar{\Gamma}_{j-1}(x_{j}),\ell_{\phi}(x_{j};x_{j-1})\right\}\geq(1-\tau^{\prime})\ell_{\phi}(x_{j};x_{j-1})+\tau^{\prime}\bar{\Gamma}_{j-1}(x_{j}).

The definition of mjm_{j} in (22), the above inequality, and (40) with u=xju=x_{j}, imply that

mj\displaystyle m_{j} (1τ)ϕ(xj;xj1)+τΓ¯j1(xj)+12λjxjx^k12\displaystyle\geq(1-\tau^{\prime})\ell_{\phi}(x_{j};x_{j-1})+\tau^{\prime}\bar{\Gamma}_{j-1}(x_{j})+\frac{1}{2{\lambda}_{j}}\|x_{j}-\hat{x}_{k-1}\|^{2}
=(1τ)[ϕ(xj;xj1)+12λjxjx^k12]+τ[Γ¯j1(xj)+12λjxjx^k12]\displaystyle=(1-\tau^{\prime})\left[\ell_{\phi}(x_{j};x_{j-1})+\frac{1}{2{\lambda}_{j}}\|x_{j}-\hat{x}_{k-1}\|^{2}\right]+\tau^{\prime}\left[\bar{\Gamma}_{j-1}(x_{j})+\frac{1}{2{\lambda}_{j}}\|x_{j}-\hat{x}_{k-1}\|^{2}\right]
λjλj1(1τ)[ϕ(xj;xj1)+12λjxjx^k12]+τ[Γ¯j1(xj)+12λj1xjx^k12]\displaystyle\overset{{\lambda}_{j}\leq{\lambda}_{j-1}}{\geq}(1-\tau^{\prime})\left[\ell_{\phi}(x_{j};x_{j-1})+\frac{1}{2{\lambda}_{j}}\|x_{j}-\hat{x}_{k-1}\|^{2}\right]+\tau^{\prime}\left[\bar{\Gamma}_{j-1}(x_{j})+\frac{1}{2{\lambda}_{j-1}}\|x_{j}-\hat{x}_{k-1}\|^{2}\right]
(40)(1τ)[ϕ(xj;xj1)+12λjxjx^k12]+τ[mj1+12λj1xjxj12].\displaystyle\overset{\eqref{ineq:Gammaj-1}}{\geq}(1-\tau^{\prime})\left[\ell_{\phi}(x_{j};x_{j-1})+\frac{1}{2{\lambda}_{j}}\|x_{j}-\hat{x}_{k-1}\|^{2}\right]+\tau^{\prime}\left[m_{j-1}+\frac{1}{2{\lambda}_{j-1}}\|x_{j}-x_{j-1}\|^{2}\right].

Using this inequality and the definition of tjt_{j} in (24), we have

tjτtj1=[ϕ(yj)mj]τ[ϕ(yj1)mj1]\displaystyle t_{j}-\tau^{\prime}t_{j-1}=[\phi(y_{j})-m_{j}]-\tau^{\prime}[\phi(y_{j-1})-m_{j-1}]
=[ϕ(yj)τϕ(yj1)][mjτmj1]\displaystyle=[\phi(y_{j})-\tau^{\prime}\phi(y_{j-1})]-[m_{j}-\tau^{\prime}m_{j-1}]
[ϕ(yj)τϕ(yj1)](1τ)[ϕ(xj;xj1)+12λjxjx^k12]τ2λj1xjxj12\displaystyle\leq[\phi(y_{j})-\tau^{\prime}\phi(y_{j-1})]-(1-\tau^{\prime})\left[\ell_{\phi}(x_{j};x_{j-1})+\frac{1}{2{\lambda}_{j}}\|x_{j}-\hat{x}_{k-1}\|^{2}\right]-\frac{\tau^{\prime}}{2{\lambda}_{j-1}}\|x_{j}-x_{j-1}\|^{2}
=[ϕ(yj)τϕ(yj1)(1τ)ϕ(xj)]\displaystyle=[\phi(y_{j})-\tau^{\prime}\phi(y_{j-1})-(1-\tau^{\prime})\phi(x_{j})]
+(1τ)[ϕ(xj)ϕ(xj;xj1)]1τ2λjxjx^k12τ2λj1xjxj12\displaystyle\ \ \ +(1-\tau^{\prime})\left[\phi(x_{j})-\ell_{\phi}(x_{j};x_{j-1})\right]-\frac{1-\tau^{\prime}}{2{\lambda}_{j}}\|x_{j}-\hat{x}_{k-1}\|^{2}-\frac{\tau^{\prime}}{2{\lambda}_{j-1}}\|x_{j}-x_{j-1}\|^{2}
(1τ)[ϕ(xj)ϕ(xj;xj1)]1τ2λjxjx^k12τ2λj1xjxj12,\displaystyle\leq(1-\tau^{\prime})\left[\phi(x_{j})-\ell_{\phi}(x_{j};x_{j-1})\right]-\frac{1-\tau^{\prime}}{2{\lambda}_{j}}\|x_{j}-\hat{x}_{k-1}\|^{2}-\frac{\tau^{\prime}}{2{\lambda}_{j-1}}\|x_{j}-x_{j-1}\|^{2},

where the last inequality is due to the definition of yjy_{j} in (23). The conclusion of the statement now follows from the above inequality and relation (7) with (y,x)=(xj1,xj)(y,x)=(x_{j-1},x_{j}).

b) Using the assumption of this statement and statement a) with τ=τ\tau^{\prime}=\tau, we easily see that

tjτtj12M(1τ)xjxj1τ4λj1xjxj12.t_{j}-\tau t_{j-1}\leq 2M(1-\tau)\|x_{j}-x_{j-1}\|-\frac{\tau}{4{\lambda}_{j-1}}\|x_{j}-x_{j-1}\|^{2}.

The statement now follows from the above inequality and the inequality 2abb2a22ab-b^{2}\leq a^{2} with

a=2M(1τ)λj1τ,b=τxjxj12λj1.a=\frac{2M(1-\tau)\sqrt{{\lambda}_{j-1}}}{\sqrt{\tau}},\quad b=\frac{\sqrt{\tau}\|x_{j}-x_{j-1}\|}{2\sqrt{{\lambda}_{j-1}}}.

 

The next result describes some properties about the stepsizes λj{\lambda}_{j} within any given cycle. It uses the fact that if jj is a bad iteration of Ad-GPB, then (25) is violated (see step 2 of Ad-GPB and the first paragraph following Ad-GPB).

Lemma 3.3

Define

λ¯:=min{τε128(1τ)M2,τ8(1τ)L};\underline{{\lambda}}:=\min\left\{\frac{\tau\varepsilon}{128(1-\tau)M^{2}},\frac{\tau}{8(1-\tau)L}\right\}; (42)

where τ\tau is an input to Ad-GPB, and MM and LL are as in Assumption 3. Then, the following statements hold:

  • a)

    for every index j𝒞kj\in{\cal C}_{k}, we have

    λjmin{λ¯,λik};{\lambda}_{j}\geq\min\left\{\underline{{\lambda}},{\lambda}_{i_{k}}\right\};
  • b)

    the number of bad iterations in 𝒞k{\cal C}_{k} is bounded by log2+(λik/λ¯).\log_{2}^{+}\left({\lambda}_{i_{k}}/\underline{{\lambda}}\right).

Proof: a) Assume for contradiction that there exists j𝒞kj\in{\cal C}_{k} such that

λj<min{λ¯,λik},{\lambda}_{j}<\min\left\{\underline{{\lambda}},{\lambda}_{i_{k}}\right\}, (43)

and that jj is the smallest index in 𝒞k{\cal C}_{k} satisfying the above inequality. We claim that this assumption implies that

λj24λj12=λj.\frac{{\lambda}_{j-2}}{4}\leq\frac{{\lambda}_{j-1}}{2}={\lambda}_{j}. (44)

Before showing the claim, we argue that (44) implies the conclusion of the lemma. Indeed, noting that (42) and (44) implies that λj24λ¯τ/(2(1τ)L){\lambda}_{j-2}\leq 4\underline{{\lambda}}\leq\tau/(2(1-\tau)L), it follows from (41) with j=j1j=j-1 and the definition of λ¯\underline{{\lambda}} in (42) that

tj1τtj2\displaystyle t_{j-1}-\tau t_{j-2} (41)4(1τ)2λj2M2τ16(1τ)2λjM2τ16(1τ)2λ¯M2τ\displaystyle\overset{\eqref{ineq:re_t2}}{\leq}\frac{4(1-\tau)^{2}{\lambda}_{j-2}M^{2}}{\tau}\leq\frac{16(1-\tau)^{2}{\lambda}_{j}M^{2}}{\tau}\leq\frac{16(1-\tau)^{2}\underline{{\lambda}}M^{2}}{\tau}
(42)(1τ)ε8(1τ)(βk1(ϕ(yj1)n^k1)2+ε8),\displaystyle\overset{\eqref{def:barlam}}{\leq}(1-\tau)\frac{\varepsilon}{8}\leq(1-\tau)\left(\frac{\beta_{k-1}(\phi(y_{j-1})-\hat{n}_{k-1})}{2}+\frac{\varepsilon}{8}\right),

where the last inequality is due to the fact that ϕ(yj1)n^k10\phi(y_{j-1})-\hat{n}_{k-1}\geq 0. This conclusion then implies that (25) holds for iteration j1j-1, and hence that λj=λj1{\lambda}_{j}={\lambda}_{j-1} due to the logic of step 2 of Ad-GPB. Since this contradicts (44), statement (a) follows.

We will now show the above claim, i.e., that the definition of jj implies (44). Indeed, since the logic of step 2 implies that λik+1=λik{\lambda}_{i_{k}+1}={\lambda}_{i_{k}} and jj is the smallest index in 𝒞k{\cal C}_{k} satisfying (43), we conclude that jik+2j\geq i_{k}+2 and λjλj1{\lambda}_{j}\neq{\lambda}_{j-1}. Using these conclusions and the fact that the logic of step 2 of Ad-GPB implies that either λi=λi1{\lambda}_{i}={\lambda}_{i-1} or λi=λi1/2{\lambda}_{i}={\lambda}_{i-1}/2 for every i𝒞k{ik}i\in{\cal C}_{k}\setminus\{i_{k}\}, we then conclude that both the inequality and the identity in (44) hold.

b) Since λj+1=λj{\lambda}_{j+1}={\lambda}_{j} (resp., λj+1=λj/2{\lambda}_{j+1}={\lambda}_{j}/2) if jj is a good (resp., bad) iteration, we easily see that λik/λ^k=2sk{\lambda}_{i_{k}}/\hat{\lambda}_{k}=2^{s_{k}}. This observation together with (a) then implies that statement (b) holds.    

It follows from Lemma 3.3(b) that the number of bad iterations within the kk-th cycle 𝒞k{\cal C}_{k} is finite. Proposition 3.5 below provides a bound on |𝒞k||{\cal C}_{k}|, and hence shows that every cycle 𝒞k{\cal C}_{k} terminates. Before showing this result, we state a technical result which provides some key properties about the sequence {tj}\{t_{j}\}.

Lemma 3.4

The following statements hold:

  • a)

    if j𝒞k{ik}j\in{\cal C}_{k}\setminus\{i_{k}\}, then tjtj1t_{j}\leq t_{j-1}.

  • b)

    if j𝒞k{ik}j\in{\cal C}_{k}\setminus\{i_{k}\} is a good iteration that is not the last one in 𝒞k{\cal C}_{k}, then

    tjε82τ1+τ(tj1ε8);t_{j}-\frac{\varepsilon}{8}\leq\frac{2\tau}{1+\tau}\left(t_{j-1}-\frac{\varepsilon}{8}\right);
  • c)

    if j𝒞kj\in{\cal C}_{k} is not the last iteration of 𝒞k{\cal C}_{k}, then

    tjε8(2τ1+τ)jiksk(tikε8)t_{j}-\frac{\varepsilon}{8}\leq\left(\frac{2\tau}{1+\tau}\right)^{j-i_{k}-s_{k}}\left(t_{i_{k}}-\frac{\varepsilon}{8}\right) (45)

    where sks_{k} denotes the number of bad iterations within cycle kk.

Proof: a) The statement immediately follows from Lemma 3.2(a) with τ=1\tau^{\prime}=1.

b) Assume that j𝒞k{ik}j\in{\cal C}_{k}\setminus\{i_{k}\} is a good iteration that is not the last one in 𝒞k{\cal C}_{k}. This together with the logic of the Ad-GPB imply that (25) is satisfied and the cycle-stopping criterion is violated at iteration jj, i.e.,

tjε4>βk1(ϕ(yj)n^k1).t_{j}-\frac{\varepsilon}{4}>\beta_{k-1}(\phi(y_{j})-\hat{n}_{k-1}). (46)

These two observations then imply that

tjε8\displaystyle t_{j}-\frac{\varepsilon}{8} (25)τtj1+(1τ)[βk1(ϕ(yj)n^k1)2+ε8]ε8\displaystyle\overset{\eqref{keyineq}}{\leq}\tau t_{j-1}+(1-\tau)\left[\frac{\beta_{k-1}(\phi(y_{j})-\hat{n}_{k-1})}{2}+\frac{\varepsilon}{8}\right]-\frac{\varepsilon}{8}
=τ(tj1ε8)+1τ2[βk1(ϕ(yj)n^k1)]\displaystyle=\tau\left(t_{j-1}-\frac{\varepsilon}{8}\right)+\frac{1-\tau}{2}\left[\beta_{k-1}(\phi(y_{j})-\hat{n}_{k-1})\right]
(46)τ(tj1ε8)+1τ2(tjε4)τ(tj1ε8)+1τ2(tjε8),\displaystyle\overset{\eqref{notjk}}{\leq}\tau\left(t_{j-1}-\frac{\varepsilon}{8}\right)+\frac{1-\tau}{2}\left(t_{j}-\frac{\varepsilon}{4}\right)\leq\tau\left(t_{j-1}-\frac{\varepsilon}{8}\right)+\frac{1-\tau}{2}\left(t_{j}-\frac{\varepsilon}{8}\right),

which can be easily seen to imply that statement b) holds.

c) If jiksk0j-i_{k}-s_{k}\leq 0, then (45) obviously follows. Assume then that jiksk>0j-i_{k}-s_{k}>0. The fact that there are at least jikskj-i_{k}-s_{k} good iterations in {ik+1,,j}\{i_{k}+1,\ldots,j\}, and statements a) and b), imply that

tjε8(tikε8)(2τ1+τ)jiksk.t_{j}-\frac{\varepsilon}{8}\leq\left(t_{i_{k}}-\frac{\varepsilon}{8}\right)\left(\frac{2\tau}{1+\tau}\right)^{j-i_{k}-s_{k}}. (47)

and thus (45) follows.    

Proposition 3.5

For every cycle index k1k\geq 1 generated by Ad-GPB, its size is bounded by |𝒞k|sk+N¯k(ε)+1|{\cal C}_{k}|\leq s_{k}+\bar{N}_{k}(\varepsilon)+1 where sks_{k} denotes the number of bad iterations within it and N¯k()\bar{N}_{k}(\cdot) is defined as

N¯k(ε):=1+τ1τlog+[8tikε1].\bar{N}_{k}(\varepsilon):=\left\lceil\frac{1+\tau}{1-\tau}\log^{+}\left[8t_{i_{k}}\varepsilon^{-1}\right]\right\rceil. (48)

Proof: If tik<ε/8t_{i_{k}}<\varepsilon/8, then the cycle-stopping criterion is satisfied with j=ikj=i_{k}. This implies that |𝒞k|=1|{\cal C}_{k}|=1, and hence that the result trivially holds in this case. From now on, assume tikε/8t_{i_{k}}\geq\varepsilon/8 and suppose for contradiction that |𝒞k|>sk+N¯k(ε)+1|{\cal C}_{k}|>s_{k}+\bar{N}_{k}(\varepsilon)+1. This implies that there exists a nonnegative integer JikJ\geq i_{k} such that J+1𝒞kJ+1\in{\cal C}_{k} and

Jik+2>sk+N¯k(ε)+1J-i_{k}+2>s_{k}+\bar{N}_{k}(\varepsilon)+1 (49)

because the left-hand side of (49) is the cardinality of the index set {ik,,J+1}\{i_{k},\ldots,J+1\}. Since JJ is not the last iteration of 𝒞k{\cal C}_{k}, the cycle-stopping criterion in step 2 of Ad-GPB is violated with j=Jj=J, i.e.,

tJ>βk1[ϕ(yJ)n^k1]+ε4ε4.t_{J}>\beta_{k-1}[\phi(y_{J})-\hat{n}_{k-1}]+\frac{\varepsilon}{4}\geq\frac{\varepsilon}{4}.

This observation together with Lemma 3.4(c) with j=Jj=J then imply that

ε8tJε8(2τ1+τ)Jiksk(tikε8)(2τ1+τ)Jiksktik,\frac{\varepsilon}{8}\leq t_{J}-\frac{\varepsilon}{8}\leq\left(\frac{2\tau}{1+\tau}\right)^{J-i_{k}-s_{k}}\left(t_{i_{k}}-\frac{\varepsilon}{8}\right)\leq\left(\frac{2\tau}{1+\tau}\right)^{J-i_{k}-s_{k}}t_{i_{k}},

which together with the definition of N¯k(ε)\bar{N}_{k}(\varepsilon) in (48) can be easily seen to imply that

JikskN¯k(ε)1.J-i_{k}-s_{k}\leq\bar{N}_{k}(\varepsilon)-1.

Since this conclusion contradicts (49), the conclusion of the proposition follows.    

We now make some remarks about Proposition 3.5. First, the bound on the length of each cycle depends on the number of bad iterations within it. Second, to obtain the overall iteration complexity of Ad-GPB, it suffices to derive a bound on the number of cycles generated by Ad-GPB, which is the main goal of the subsequent section.

4 Bounding the number of cycles

This section establishes a bound on the number of cycles generated by Ad-GPB. It contains two subsections. The first one considers the (much simpler) case where ϕ\phi_{*} is known and n^k\hat{n}_{k} in step 2 of Ad-GPB is set to ϕ\phi_{*} for every k1k\geq 1. The second one considers the general case where n^k\hat{n}_{k} is an arbitrary scalar satisfying the condition in step 2 of Ad-GPB.

We start by stating two technical results that are used in both subsections.

The first one describes basic facts about the sextuple (λ^k,x^k,y^k,Γ^k,m^k,t^k)(\hat{\lambda}_{k},\hat{x}_{k},\hat{y}_{k},\hat{\Gamma}_{k},\hat{m}_{k},\hat{t}_{k}) generated at the end of the kk-th cycle.

Lemma 4.1

For every cycle index kk of Ad-GPB, the following statements hold:

  • a)

    Γ^kConv¯(n)\hat{\Gamma}_{k}\in\overline{\mbox{\rm Conv}}\,(\mathbb{R}^{n}), Γ^kϕ\hat{\Gamma}_{k}\leq\phi, and domΓ^k=domh\mathrm{dom}\,\hat{\Gamma}_{k}=\mathrm{dom}\,h;

  • b)

    we have

    t^kβk1[ϕ(y^k)n^k1]+ε4;\hat{t}_{k}\leq\beta_{k-1}[\phi(\hat{y}_{k})-\hat{n}_{k-1}]+\frac{\varepsilon}{4};
  • c)

    x^k,y^kdomh\hat{x}_{k},\hat{y}_{k}\in\mathrm{dom}\,h, ϕ(y^k)ϕ^ka\phi(\hat{y}_{k})\leq\hat{\phi}_{k}^{a}, and ϕ(y^k)ϕ(y^k1)\phi(\hat{y}_{k})\leq\phi(\hat{y}_{k-1}), where by convention y^0=x^0\hat{y}_{0}=\hat{x}_{0};

  • d)

    ^k^k1\hat{\ell}_{k}\geq\hat{\ell}_{k-1} and βkβk1\beta_{k}\leq\beta_{k-1};

  • e)

    for every given udomhu\in\mathrm{dom}\,h, we have

    ϕ(y^k)Γ^k(u)t^k+12λ^k[ux^k12ux^k2];\phi(\hat{y}_{k})-\hat{\Gamma}_{k}(u)\leq\hat{t}_{k}+\frac{1}{2\hat{\lambda}_{k}}\left[\|u-\hat{x}_{k-1}\|^{2}-\|u-\hat{x}_{k}\|^{2}\right]; (50)

Proof: a) It follows from (30) and the fact that Γ^k\hat{\Gamma}_{k} is the last Γj\Gamma_{j} generated within the kk-th cycle.

b) It follows from the fact that the cycle-stopping criterion in the first line of step 2 of Ad-GPB is satisfied at iteration jkj_{k} and the definitions of the quantities t^k\hat{t}_{k} and y^k\hat{y}_{k} in step 2 of Ad-GPB.

c) It follows from the first two remarks in the paragraph containing (32), the definition of ϕ^ka\hat{\phi}_{k}^{a} in (27), and the fact that x^k\hat{x}_{k} (resp., y^k\hat{y}_{k}) is the last xjx_{j} (resp., yjy_{j}) generated within the kk-th cycle.

d) The first inequality in (d) follows from the definition ^k\hat{\ell}_{k} in (26). Moreover, the rule for updating βk\beta_{k} in step 2 of Ad-GPB implies that βkβk1\beta_{k}\leq\beta_{k-1}.

e) Observe that (30) and the definitions of the quantities x^k\hat{x}_{k}, m^k\hat{m}_{k}, Γ^k\hat{\Gamma}_{k}, and λ^k\hat{\lambda}_{k} in step 2 of Ad-GPB, imply that (x^k,m^k)(\hat{x}_{k},\hat{m}_{k}) is the pair of optimal solution and optimal value of

min{Γ^k(u)+12λ^kux^k12:un}.\min\left\{\hat{\Gamma}_{k}(u)+\frac{1}{2\hat{\lambda}_{k}}\|u-\hat{x}_{k-1}\|^{2}:u\in\mathbb{R}^{n}\right\}. (51)

The above observation, the fact that Γ^k()+1/(2λ^k)x^k12\hat{\Gamma}_{k}(\cdot)+1/(2\hat{\lambda}_{k})\|\cdot-\hat{x}_{k-1}\|^{2} is 1/λ^k1/\hat{\lambda}_{k}-strongly convex, together imply that, for the given udomhu\in\mathrm{dom}\,h, we have

m^k+12λ^kux^k2Γ^k(u)+12λ^kux^k12,\hat{m}_{k}+\frac{1}{2\hat{\lambda}_{k}}\|u-\hat{x}_{k}\|^{2}\leq\hat{\Gamma}_{k}(u)+\frac{1}{2\hat{\lambda}_{k}}\|u-\hat{x}_{k-1}\|^{2}, (52)

and hence that

ϕ(y^k)Γ^k(u)+12λ^kux^k2\displaystyle\phi(\hat{y}_{k})-\hat{\Gamma}_{k}(u)+\frac{1}{2\hat{\lambda}_{k}}\|u-\hat{x}_{k}\|^{2} ϕ(y^k)m^k+12λ^kux^k12=t^k+12λ^kux^k12,\displaystyle{\leq}\phi(\hat{y}_{k})-\hat{m}_{k}+\frac{1}{2\hat{\lambda}_{k}}\|u-\hat{x}_{k-1}\|^{2}=\hat{t}_{k}+\frac{1}{2\hat{\lambda}_{k}}\|u-\hat{x}_{k-1}\|^{2},

where the equality is due to the definition of t^k\hat{t}_{k} in step 2 of Ad-GPB. This shows that (50), and hence statement (e), holds.    

The next result provides a uniform upper (resp., lower) bound on the sequence {tik}\{t_{i_{k}}\} (resp. λ^k\hat{\lambda}_{k}), and also a bound on the total number of bad iterations generated by Ad-GPB.

Lemma 4.2

For every cycle index k1k\geq 1 generated by Ad-GPB, the following statements hold:

  • a)

    λ^kmin{λ¯,λ1}\hat{\lambda}_{k}\geq\min\{\underline{{\lambda}},{\lambda}_{1}\} where λ¯\underline{{\lambda}} is as in (42);

  • b)

    l=1ksllog2+(λ1/λ¯)\sum_{l=1}^{k}s_{l}\leq\log_{2}^{+}({\lambda}_{1}/\underline{{\lambda}}) where sls_{l} denotes the number of bad iterations within cycle ll.

  • c)

    we have tikt¯t_{i_{k}}\leq\bar{t} where t¯\bar{t} is as in (33).

Proof: a) Using the facts that λik=λ^k1{\lambda}_{i_{k}}=\hat{\lambda}_{k-1} (see step 2 of Ad-GPB) and Lemma 3.3(a) with j=jkj=j_{k}, we conclude that

λ^kmin{λ¯,λ^k1}.\hat{\lambda}_{k}\geq\min\left\{\underline{{\lambda}},\hat{\lambda}_{k-1}\right\}.

The statement then follows by using the above inequality recursively and the convention that λ^0=λ1\hat{\lambda}_{0}={\lambda}_{1}.

b) Since the last iteration of a cycle is not bad and λj+1/λj{\lambda}_{j+1}/{\lambda}_{j} is equal to 1/21/2 (resp., equal to 11) if jj is a bad iteration, we easily see that λ^l/λ^l1=(1/2)sl\hat{\lambda}_{l}/\hat{\lambda}_{l-1}=(1/2)^{s_{l}}, or equivalently, log2λ^l1log2λ^l=sl\log_{2}\hat{\lambda}_{l-1}-\log_{2}\hat{\lambda}_{l}=s_{l}, for every cycle ll of Ad-GPB, under the convention that λ^0:=λ1\hat{\lambda}_{0}:={\lambda}_{1}. Statement (c) now follows by summing the above inequality from l=1l=1 to kk and using statement a).

c) Using the facts that ϕ=f+h\phi=f+h and Γik()~f(;x^k1)+h()\Gamma_{i_{k}}(\cdot)\geq\tilde{\ell}_{f}(\cdot;\hat{x}_{k-1})+h(\cdot) (see the serious update in step 2 of Ad-GPB), and the definition of tjt_{j}, mjm_{j} and yjy_{j} in (24), (22) and (23), respectively, we have

tik\displaystyle t_{i_{k}} =(24)ϕ(yik)mik(23)ϕ(xik)mik=(22)ϕ(xik)Γik(xik)12λikxikx^k12\displaystyle\overset{\eqref{def:tj}}{=}\phi(y_{i_{k}})-m_{i_{k}}\overset{\eqref{def:yj}}{\leq}\phi(x_{i_{k}})-m_{i_{k}}\overset{\eqref{def:mj}}{=}\phi(x_{i_{k}})-\Gamma_{i_{k}}(x_{i_{k}})-\frac{1}{2{\lambda}_{i_{k}}}\|x_{i_{k}}-\hat{x}_{k-1}\|^{2}
f(xik)~f(xik;x^k1)12λikxikx^k12\displaystyle\leq f(x_{i_{k}})-\tilde{\ell}_{f}(x_{i_{k}};\hat{x}_{k-1})-\frac{1}{2{\lambda}_{i_{k}}}\|x_{i_{k}}-\hat{x}_{k-1}\|^{2}
(7)2Mxikx^k1+L2xikx^k12.\displaystyle\overset{\eqref{ineq:est}}{\leq}2M\|x_{i_{k}}-\hat{x}_{k-1}\|+\frac{L}{2}\|x_{i_{k}}-\hat{x}_{k-1}\|^{2}. (53)

Statement c) now follows from the above inequality, Assumption 4, and the fact that xik,x^k1domhx_{i_{k}},\hat{x}_{k-1}\in\mathrm{dom}\,h.    

It follows from Lemma 4.2(b) and the definition of λ¯\underline{{\lambda}} in (42) that the overall number of bad iterations is

𝒪(log2((1τ)(M2ε+L))).{\cal O}\left(\log_{2}\left((1-\tau)\left(\frac{M^{2}}{\varepsilon}+L\right)\right)\right).

4.1 Case where ϕ\phi_{*} is known

This subsection considers the special case of Ad-GPB where ϕ\phi_{*} is known and

β0=12,n^k=ϕk1.\beta_{0}=\frac{1}{2},\qquad\hat{n}_{k}=\phi_{*}\quad\forall k\geq 1. (54)

Even though the general result in Theorem 2.1 holds for this case, the simpler proof presented here covering the above case helps to understand the proof of the more general case given in Subsection 4.2 and has the advantage that it does not assume that domh\mathrm{dom}\,h is bounded.

For convenience, the simplified version of Ad-GPB, referred to as Ad-GPB*, is explicitly stated below.

 

Ad-GPB*

 

  • 0.

    Let x0domhx_{0}\in\mathrm{dom}\,h, λ1>0{\lambda}_{1}>0, τ(0,1)\tau\in(0,1), and ε>0\varepsilon>0 be given; find Γ1(ϕ)\Gamma_{1}\in{\cal B}(\phi) such that Γ1ϕ(;x^0)\Gamma_{1}\geq\ell_{\phi}(\cdot;\hat{x}_{0}) and set y0=x^0y_{0}=\hat{x}_{0}, j0=0j_{0}=0, j=k=1j=k=1;

  • 1.

    compute xj,mj,yjx_{j},m_{j},y_{j}, and tjt_{j} as in (21), (22), (23), and (24), respectively;

  • 2.

    if tj(ϕ(yj)ϕ)/2+ε/4t_{j}\leq(\phi(y_{j})-\phi_{*})/2+\varepsilon/4 is violated then perform a null update, i.e.:
           set Γj+1=BU(x^k1,xj,Γj,λj)\Gamma_{j+1}=\mbox{BU}(\hat{x}_{k-1},x_{j},\Gamma_{j},\lambda_{j});
           if either j=jk1+1j=j_{k-1}+1 or

    tjτtj1(1τ)[ϕ(yj)ϕ4+ε8],t_{j}-\tau t_{j-1}\leq(1-\tau)\left[\frac{\phi(y_{j})-\phi_{*}}{4}+\frac{\varepsilon}{8}\right], (55)

           then set λj+1=λj{\lambda}_{j+1}={\lambda}_{j}; else, set λj+1=λj/2{\lambda}_{j+1}={\lambda}_{j}/2;
    else perform a serious update, i.e.:
           set λj+1=λj{\lambda}_{j+1}={\lambda}_{j} and find Γj+1(ϕ)\Gamma_{j+1}\in{\cal B}(\phi) such that Γj+1ϕ(;xj)\Gamma_{j+1}\geq\ell_{\phi}(\cdot;x_{j});
           set jk=jj_{k}=j and (λ^k,x^k,y^k,Γ^k,m^k,t^k)=(λj,xj,yj,Γj,mj,tj)(\hat{\lambda}_{k},\hat{x}_{k},\hat{y}_{k},\hat{\Gamma}_{k},\hat{m}_{k},\hat{t}_{k})=(\lambda_{j},x_{j},y_{j},\Gamma_{j},m_{j},t_{j});
           if ϕ(y^k)ϕε\phi(\hat{y}_{k})-\phi_{*}\leq\varepsilon, then output (x^k,y^k)(\hat{x}_{k},\hat{y}_{k}), and stop;
           kk+1k\leftarrow k+1;

  • 3.

    set jj+1j\leftarrow j+1 and go to step 1.

 

We now make some remarks about Ad-GPB*. First, even though the parameter β0\beta_{0} can be arbitrarily chosen in (0,1/2](0,1/2], Ad-GPB* is stated with β0=1/2\beta_{0}=1/2 for simplicity. Second, if Ad-GPB* reaches step 2 then the primal gap ϕ(yj)ϕ\phi(y_{j})-\phi_{*} is greater than ε\varepsilon because of step 2 and is substantially larger than this lower bound at its early cycles. Hence, the right-hand side (ϕ(yj)ϕ)/2+ε/4(\phi(y_{j})-\phi_{*})/2+\varepsilon/4 of its cycle termination criterion in step 2 is always larger than 3ε/43\varepsilon/4 and is substantially larger than 3ε/43\varepsilon/4 at its early cycles. Since, in contrast, GPB terminates a cycle when the inequality tjε/2t_{j}\leq\varepsilon/2 is satisfied, the cycle termination of Ad-GPB* is always looser, and potentially much looser at its early cycles, than that of GPB.

The next lemma formally shows that Ad-GPB* is a specific instance of Ad-GPB.

Lemma 4.3

The following statements hold:

  • a)

    Ad-GPB* is a special instance of Ad-GPB with β0\beta_{0} and {n^k}\{\hat{n}_{k}\} chosen as in (54); moreover, βk=1/2\beta_{k}=1/2 for every cycle index k1k\geq 1;

  • b)

    for every cycle index kk of Ad-GPB*, we have

    t^k[ϕ(y^k)ϕ]/2+ε/4.\hat{t}_{k}\leq[\phi(\hat{y}_{k})-\phi_{*}]/2+\varepsilon/4. (56)

Proof: a) The first claim of (a) is obvious. To show that βk=1/2\beta_{k}=1/2 for every index cycle k1k\geq 1 generated by Ad-GPB, it suffices to show that βk=βk1\beta_{k}=\beta_{k-1} because β0=1/2\beta_{0}=1/2. Indeed, using (54), the facts that βlβ0=1/2\beta_{l}\leq\beta_{0}=1/2 for ł0\l\geq 0 due to Lemma 4.1(c), and the definitions of ϕka\phi_{k}^{a} and g^k\hat{g}_{k} in (28) and (27), respectively, we conclude that

g^k12l=k/2kλ^ll=k/2kλ^l[ϕ(y^l)ϕ]=ϕ^kaϕ2,\hat{g}_{k}\leq\frac{1}{2\sum_{l=\lceil k/2\rceil}^{k}\hat{\lambda}_{l}}\sum_{l=\lceil k/2\rceil}^{k}\hat{\lambda}_{l}[\phi(\hat{y}_{l})-\phi_{*}]=\frac{\hat{\phi}^{a}_{k}-\phi_{*}}{2},

and hence that βk=βk1\beta_{k}=\beta_{k-1} due to the update rule for βk\beta_{k} at the end of step 2 of Ad-GPB.

b) This statement follows from (a), Lemma 4.1(b), and the fact that β0=1/2\beta_{0}=1/2.    

Let d0d_{0} denote the distance of the initial point x^0domh\hat{x}_{0}\in\mathrm{dom}\,h to the set of optimal solutions XX^{*}, i.e.,

d0:=x^0x^,wherex^:=argmin{x^0x:xX}.d_{0}:=\|\hat{x}_{0}-\hat{x}_{*}\|,\ \ \mbox{\rm where}\ \ \ \hat{x}_{*}:=\mathrm{argmin}\,\{\|\hat{x}_{0}-x_{*}\|:x_{*}\in X_{*}\}. (57)
Lemma 4.4

If K1K\geq 1 is a cycle index generated by Ad-GPB, then we have

k=1Kλ^k[ϕ(y^k)ϕ]ε2k=1Kλ^k+d02.\sum_{k=1}^{K}\hat{\lambda}_{k}[\phi(\hat{y}_{k})-\phi_{*}]\leq\frac{\varepsilon}{2}\sum_{k=1}^{K}\hat{\lambda}_{k}+d_{0}^{2}. (58)

Proof: Relation (50) with u=x^u=\hat{x}_{*}, and the facts that Γ^kϕ\hat{\Gamma}_{k}\leq\phi and ϕ=ϕ(x^)\phi_{*}=\phi(\hat{x}_{*}), imply that

λ^k[ϕ(y^k)ϕ]\displaystyle\hat{\lambda}_{k}[\phi(\hat{y}_{k})-\phi_{*}] λ^k[ϕ(y^k)Γ^k(x^)]\displaystyle\leq\hat{\lambda}_{k}[\phi(\hat{y}_{k})-\hat{\Gamma}_{k}(\hat{x}_{*})]
λ^kt^k+12x^x^k1212x^kx^2\displaystyle\leq\hat{\lambda}_{k}\hat{t}_{k}+\frac{1}{2}\|\hat{x}_{*}-\hat{x}_{k-1}\|^{2}-\frac{1}{2}\|\hat{x}_{k}-\hat{x}_{*}\|^{2}
λ^k[ϕ(y^k)ϕ2+ε4]+12x^x^k1212x^kx^2\displaystyle\leq\hat{\lambda}_{k}\left[\frac{\phi(\hat{y}_{k})-\phi_{*}}{2}+\frac{\varepsilon}{4}\right]+\frac{1}{2}\|\hat{x}_{*}-\hat{x}_{k-1}\|^{2}-\frac{1}{2}\|\hat{x}_{k}-\hat{x}_{*}\|^{2}

where the last inequality is due to (56). Simplifying the above inequality and summing the resulting inequality from k=1,,Kk=1,\ldots,K, we conclude that

12k=1Kλ^k[ϕ(y^k)ϕ]ε4k=1Kλ^k+12x^x^0212x^Kx^2.\frac{1}{2}\sum_{k=1}^{K}\hat{\lambda}_{k}[\phi(\hat{y}_{k})-\phi_{*}]\leq\frac{\varepsilon}{4}\sum_{k=1}^{K}\hat{\lambda}_{k}+\frac{1}{2}\|\hat{x}_{*}-\hat{x}_{0}\|^{2}-\frac{1}{2}\|\hat{x}_{K}-\hat{x}_{*}\|^{2}.

The statement now follows from the above inequality and (57).    

We are now ready to prove Theorem 4.5.

Theorem 4.5

Ad-GPB* finds an iterate y^k\hat{y}_{k} satisfying ϕ(y^k)ϕε\phi(\hat{y}_{k})-\phi_{*}\leq\varepsilon in at most K^(ε)\hat{K}(\varepsilon) cycles and

log2+(Q¯λ1(M2ε+L16))+(1+τ1τlog+[8t¯ε1]+2)K^(ε)\log_{2}^{+}\left(\bar{Q}{\lambda}_{1}\left(\frac{M^{2}}{\varepsilon}+\frac{L}{16}\right)\right)+\left(\frac{1+\tau}{1-\tau}\log^{+}\left[8\bar{t}\varepsilon^{-1}\right]+2\right)\hat{K}(\varepsilon) (59)

iterations, where Q¯\bar{Q} is as in (35) and

K^(ε):=2d02Q¯ε(M2ε+L16+1λ1Q¯).\hat{K}(\varepsilon):=\left\lceil\frac{2d_{0}^{2}\bar{Q}}{\varepsilon}\left(\frac{M^{2}}{\varepsilon}+\frac{L}{16}+\frac{1}{{\lambda}_{1}\bar{Q}}\right)\right\rceil. (60)

Proof: We first prove that Ad-GPB finds an iterate y^k\hat{y}_{k} satisfying ϕ(y^k)ϕε\phi(\hat{y}_{k})-\phi_{*}\leq\varepsilon in at most K^(ε)\hat{K}(\varepsilon) cycles. Suppose for contradiction that Ad-GPB generates a cycle K>K^(ε)K>\hat{K}(\varepsilon). Since the Ad-GPB did not stop at any of the previous iterations, we have that ϕ(y^k)ϕ>ε\phi(\hat{y}_{k})-\phi_{*}>\varepsilon for every k=1,,K1k=1,\ldots,K-1. Using the previous observation, inequality (58), the fact that K1K^(ε)K-1\geq\hat{K}(\varepsilon), and Lemma 4.2(a), we conclude that

d02ε\displaystyle\frac{d_{0}^{2}}{\varepsilon} (58)1εk=1K1λ^k[ϕ(y^k)ϕ]12k=1K1λ^k>k=1K1λ^k12k=1K1λ^k\displaystyle\overset{\eqref{outer:rec}}{\geq}\frac{1}{\varepsilon}\sum_{k=1}^{K-1}\hat{\lambda}_{k}[\phi(\hat{y}_{k})-\phi_{*}]-\frac{1}{2}\sum_{k=1}^{K-1}\hat{\lambda}_{k}>\sum_{k=1}^{K-1}\hat{\lambda}_{k}-\frac{1}{2}\sum_{k=1}^{K-1}\hat{\lambda}_{k}
12min{λ¯,λ1}(K1)12min{λ¯,λ1}K^(ε).\displaystyle\geq\frac{1}{2}\min\left\{\underline{{\lambda}},{\lambda}_{1}\right\}(K-1)\geq\frac{1}{2}\min\left\{\underline{{\lambda}},{\lambda}_{1}\right\}\hat{K}(\varepsilon).

The definition of K^(ε)\hat{K}(\varepsilon) in (60), the above inequality, and some simple algebraic manipulation on the min term, yield the desired contradiction.

Let k¯K^(ε)\bar{k}\leq\hat{K}(\varepsilon) denote the numbers of cycles generated by Ad-GPB*. Proposition 3.5, and statements (b) and (c) of Lemma 4.2, then imply that the total number of iterations performed by Ad-GPB* until it finds an iterate y^k\hat{y}_{k} satisfying ϕ(y^k)ϕε\phi(\hat{y}_{k})-\phi_{*}\leq\varepsilon is bounded by

k=1k¯\displaystyle\sum_{k=1}^{\bar{k}} |𝒞k|k=1k¯(1+τ1τlog+[8t¯ε1]+2+sk)\displaystyle|{\cal C}_{k}|\leq\sum_{k=1}^{\bar{k}}\left(\frac{1+\tau}{1-\tau}\log^{+}\left[8\bar{t}\varepsilon^{-1}\right]+2+s_{k}\right)
log2+λ1λ¯+(1+τ1τlog+[8t¯ε1]+2)K^(ε)\displaystyle\leq\log_{2}^{+}\frac{{\lambda}_{1}}{\underline{{\lambda}}}+\left(\frac{1+\tau}{1-\tau}\log^{+}\left[8\bar{t}\varepsilon^{-1}\right]+2\right)\hat{K}(\varepsilon)

and hence by (59), due to the definitions of Q¯\bar{Q} and λ¯\underline{{\lambda}} in (35) and (42), respectively.    

4.2 General case where ϕ\phi_{*} is unknown

As already mentioned above, this subsection considers the general case (see step 2 of Ad-GPB) where n^k\hat{n}_{k} is in the interval [^k,ϕ][\hat{\ell}_{k},\phi_{*}], where ^k\hat{\ell}_{k} is as in (26), and derives an upper bound on the number of cycles generated by Ad-GPB.

Lemma 4.6

For every cycle index kk of Ad-GPB, we have:

ϕ^kan^kg^k+D22l=k/2kλ^l+ε4\hat{\phi}_{k}^{a}-\hat{n}_{k}\leq\hat{g}_{k}+\frac{D^{2}}{2\sum_{l=\lceil k/2\rceil}^{k}\hat{\lambda}_{l}}+\frac{\varepsilon}{4} (61)

where DD is as in (4).

Proof: Let udomhu\in\mathrm{dom}\,h be given. Multiplying (50) by λ^k\hat{\lambda}_{k} and summing the resulting inequality from k=k/2,,kk=\lceil k/2\rceil,\ldots,k, we have

l=k/2kλ^l[ϕ(y^l)Γ^l(u)]\displaystyle\sum_{l=\lceil k/2\rceil}^{k}\hat{\lambda}_{l}[\phi(\hat{y}_{l})-\hat{\Gamma}_{l}(u)] l=k/2k(λ^lt^l+12[ux^l12ux^l2])\displaystyle\leq\sum_{l=\lceil k/2\rceil}^{k}\left(\hat{\lambda}_{l}\hat{t}_{l}+\frac{1}{2}\left[\|u-\hat{x}_{l-1}\|^{2}-\|u-\hat{x}_{l}\|^{2}\right]\right)
l=k/2kλ^l[βl1[ϕ(y^l)n^l1]+ε4]+12ux^k/21212x^ku2,\displaystyle\leq\sum_{l=\lceil k/2\rceil}^{k}\hat{\lambda}_{l}\left[\beta_{l-1}[\phi(\hat{y}_{l})-\hat{n}_{l-1}]+\frac{\varepsilon}{4}\right]+\frac{1}{2}\|u-\hat{x}_{\lceil k/2\rceil-1}\|^{2}-\frac{1}{2}\|\hat{x}_{k}-u\|^{2},

where the last inequality is due to Lemma 4.1(b). Dividing both sides of the above inequality by l=k/2kλ^l\sum_{l=\lceil k/2\rceil}^{k}\hat{\lambda}_{l}, and using the definition of ϕ^ka\hat{\phi}_{k}^{a}, Γ^ka()\hat{\Gamma}_{k}^{a}(\cdot), and g^k\hat{g}_{k} in (27), (26), and (28), respectively, we have

ϕ^kaΓ^ka(u)g^k+ε4+ux^k/2122l=k/2kλ^lg^k+ε4+D22l=k/2kλ^l,\hat{\phi}^{a}_{k}-\hat{\Gamma}^{a}_{k}(u)\leq\hat{g}_{k}+\frac{\varepsilon}{4}+\frac{\|u-\hat{x}_{\lceil k/2\rceil-1}\|^{2}}{2\sum_{l=\lceil k/2\rceil}^{k}\hat{\lambda}_{l}}\leq\hat{g}_{k}+\frac{\varepsilon}{4}+\frac{D^{2}}{2\sum_{l=\lceil k/2\rceil}^{k}\hat{\lambda}_{l}},

where the last inequality is due to Assumption (A4) and the fact that udomhu\in\mathrm{dom}\,h and x^k/21domh\hat{x}_{\lceil k/2\rceil-1}\in\mathrm{dom}\,h where the last inclusion is due to Lemma 4.1(c). Inequality (61) now follows from (31) and by maximizing the above inequality relative to udomhu\in\mathrm{dom}\,h.    

Lemma 4.7

For every cycle index kk of Ad-GPB, the following statements hold:

  • a)

    If βk=βk1\beta_{k}=\beta_{k-1}, then we have

    ϕ^kan^kD2l=k/2kλ^l+ε2;\hat{\phi}_{k}^{a}-\hat{n}_{k}\leq\frac{D^{2}}{\sum_{l=\lceil k/2\rceil}^{k}\hat{\lambda}_{l}}+\frac{\varepsilon}{2}; (62)
  • b)

    If βk=βk1/2\beta_{k}=\beta_{k-1}/2, then we have

    ϕ^kan^k2<g^kβk21(ϕ(x^0)n^0)\frac{\hat{\phi}^{a}_{k}-\hat{n}_{k}}{2}<\hat{g}_{k}\leq\beta_{\lceil\frac{k}{2}\rceil-1}(\phi(\hat{x}_{0})-\hat{n}_{0}) (63)

    where ϕ^ka\hat{\phi}^{a}_{k} is as in (27).

Proof: a) The update rule for βk\beta_{k} just after equation (28) and the assumption that βk=βk1\beta_{k}=\beta_{k-1} imply that g^k(ϕ^kan^k)/2\hat{g}_{k}\leq(\hat{\phi}_{k}^{a}-\hat{n}_{k})/2. This observation and inequality (61) then immediately imply (62).

b) The first inequality in (63) follows from the assumption that βk=βk1/2\beta_{k}=\beta_{k-1}/2 and the update rule for βk\beta_{k} just after equation (28). Moreover, using the definition of g^k\hat{g}_{k} in (28), the fact that βl1βk/21\beta_{l-1}\leq\beta_{\lceil k/2\rceil-1} for every lk/2l\geq\lceil k/2\rceil due to Lemma 4.1(d), and the fact that n^k^k\hat{n}_{k}\geq\hat{\ell}_{k} for every k1k\geq 1 in (31) we conclude that

g^k\displaystyle\hat{g}_{k} βk/21l=k/2kλ^ll=k/2kλ^l[ϕ(y^l)n^l1]βk/21l=k/2kλ^ll=k/2kλ^l[ϕ(y^l)^l1]βk/21[ϕ(y^0)^0],\displaystyle\leq\frac{\beta_{\lceil k/2\rceil-1}}{\sum_{l=\lceil k/2\rceil}^{k}\hat{\lambda}_{l}}\sum_{l=\lceil k/2\rceil}^{k}\hat{\lambda}_{l}[\phi(\hat{y}_{l})-\hat{n}_{l-1}]\leq\frac{\beta_{\lceil k/2\rceil-1}}{\sum_{l=\lceil k/2\rceil}^{k}\hat{\lambda}_{l}}\sum_{l=\lceil k/2\rceil}^{k}\hat{\lambda}_{l}[\phi(\hat{y}_{l})-\hat{\ell}_{l-1}]\leq\beta_{\lceil k/2\rceil-1}[\phi(\hat{y}_{0})-\hat{\ell}_{0}],

where the last inequality is because statements (c) and (d) of Lemma 4.1 imply that ϕ(y^l)ϕ(y^0)\phi(\hat{y}_{l})\leq\phi(\hat{y}_{0}) and ^l^0\hat{\ell}_{l}\geq\hat{\ell}_{0} for every l1l\geq 1. The above inequality, the convention that y^0=x^0\hat{y}_{0}=\hat{x}_{0}, and the fact that n^0=^0\hat{n}_{0}=\hat{\ell}_{0} (see step 0 of Ad-GPB) then imply the second inequality in (63).    

We are now ready to prove the main result of this paper.

Proof of Theorem 2.1 To simplify notation, let K¯=K¯(ε)\bar{K}=\bar{K}(\varepsilon). It is easy to see that

D2K¯(1λ¯+1λ1)ε4,12K¯2(ϕ(x^0)n^0)<εβ0.\frac{D^{2}}{\bar{K}}\left(\frac{1}{\underline{{\lambda}}}+\frac{1}{{\lambda}_{1}}\right)\leq\frac{\varepsilon}{4},\qquad\frac{1}{2^{\bar{K}-2}}(\phi(\hat{x}_{0})-\hat{n}_{0})<\frac{\varepsilon}{\beta_{0}}. (64)

We first prove that Ad-GPB finds an iterate y^k\hat{y}_{k} satisfying ϕ(y^k)n^kε\phi(\hat{y}_{k})-\hat{n}_{k}\leq\varepsilon in at most 4K¯4\bar{K} cycles. Suppose for contradiction that Ad-GPB generates a cycle K4K¯+1K\geq 4\bar{K}+1. Since the Ad-GPB did not stop at cycles from 11 to K1K-1, we have that

ϕ(y^k)n^k>εk{1,,4K¯}.\phi(\hat{y}_{k})-\hat{n}_{k}>\varepsilon\quad\forall k\in\{1,\ldots,4\bar{K}\}. (65)

We then have that

βk=βk12k{K¯,,4K¯}\beta_{k}=\frac{\beta_{k-1}}{2}\quad\forall k\in\{\bar{K},\ldots,4\bar{K}\} (66)

since otherwise we would have some βk=βk1\beta_{k}=\beta_{k-1} for some k{K¯,,4K¯}k\in\{\bar{K},\ldots,4\bar{K}\}, and this together with (65), Lemma 4.1(c), Lemma 4.7(a), and Lemma 4.2(a), would yield the contradiction that

ε<ϕ(y^k)n^k\displaystyle\varepsilon<\phi(\hat{y}_{k})-\hat{n}_{k} L.4.1(c)ϕ^kan^kL.4.7(a)D2l=k/2kλ^l+ε2L.4.2(a)2D2kmin{λ¯,λ1}+ε2\displaystyle\overset{\textbf{L.}\ref{lem:iterate}(c)}{\leq}\hat{\phi}_{k}^{a}-\hat{n}_{k}\overset{\textbf{L.}\ref{lem:typeA}(a)}{\leq}\frac{D^{2}}{\sum_{l=\lceil k/2\rceil}^{k}\hat{\lambda}_{l}}+\frac{\varepsilon}{2}\overset{\textbf{L.}\ref{lem:t1}(a)}{\leq}\frac{2D^{2}}{k\min\{\underline{{\lambda}},{\lambda}_{1}\}}+\frac{\varepsilon}{2}
2D2K¯(1λ¯+1λ1)+ε2(64)ε2+ε2=ε,\displaystyle\ \ \ \ \leq\frac{2D^{2}}{\bar{K}}\left(\frac{1}{\underline{{\lambda}}}+\frac{1}{{\lambda}_{1}}\right)+\frac{\varepsilon}{2}\overset{\eqref{coroN}}{\leq}\frac{\varepsilon}{2}+\frac{\varepsilon}{2}=\varepsilon,

where the last inequality is due to the first inequality in (64). Relations (65) and (66), and Lemmas 4.1(c) and 4.7(b), all with k=4K¯k=4\bar{K}, then yield

ε<(65)ϕ(y^4K¯)n^4K¯L.4.1(c)ϕ^4K¯an^4K¯L.4.7(b)2β2K¯1(ϕ(x^0)n^0)(66)12K¯2βK¯(ϕ(x^0)n^0)<(64)βK¯β0ε\varepsilon\overset{\eqref{notfinished}}{<}\phi(\hat{y}_{4\bar{K}})-\hat{n}_{4\bar{K}}\overset{\textbf{L.}\ref{lem:iterate}(c)}{\leq}\hat{\phi}_{4\bar{K}}^{a}-\hat{n}_{4\bar{K}}\overset{\textbf{L.}\ref{lem:typeA}(b)}{\leq}2\beta_{2\bar{K}-1}(\phi(\hat{x}_{0})-\hat{n}_{0})\overset{\eqref{halfbeta}}{\leq}\frac{1}{2^{\bar{K}-2}}\beta_{\bar{K}}(\phi(\hat{x}_{0})-\hat{n}_{0})\overset{\eqref{coroN}}{<}\frac{\beta_{\bar{K}}}{\beta_{0}}\varepsilon

where the last inequality is due to the second inequality in (64). Since βK¯β0\beta_{\bar{K}}\leq\beta_{0}, the above inequality gives the desired contradiction, and hence the first conclusion of theorem holds.

To show the second conclusion of the theorem, let k¯4K¯\bar{k}\leq 4\bar{K} denote the numbers of cycles generated by Ad-GPB. Proposition 3.5, and statements (b) and (c) of Lemma 4.2 then imply that the total number of iterations that Ad-GPB finds an iterate y^k\hat{y}_{k} satisfying ϕ(y^k)n^kε\phi(\hat{y}_{k})-\hat{n}_{k}\leq\varepsilon is bounded by

k=1k¯\displaystyle\sum_{k=1}^{\bar{k}} |𝒞k|k=1k¯(1+τ1τlog+[8t¯ε1]+2+sk)\displaystyle|{\cal C}_{k}|\leq\sum_{k=1}^{\bar{k}}\left(\frac{1+\tau}{1-\tau}\log^{+}\left[8\bar{t}\varepsilon^{-1}\right]+2+s_{k}\right)
log2+λ1λ¯+4K¯(1+τ1τlog+[8t¯ε1]+2)\displaystyle\leq\log_{2}^{+}\frac{{\lambda}_{1}}{\underline{{\lambda}}}+4\bar{K}\left(\frac{1+\tau}{1-\tau}\log^{+}\left[8\bar{t}\varepsilon^{-1}\right]+2\right)

and hence by (36), due to the definitions of Q¯\bar{Q} and λ¯\underline{{\lambda}} in (35) and (42), respectively.    

5 Computational experiments

This section reports the computational results of Ad-GPB* and two corresponding practical variants against other modern PB methods and the subgradient method. It contains two subsections. The first one presents the computational results for a simple l1l_{1} feasibility problem. The second one showcases the computational results for the Lagrangian cut problem appeared in the area of integer programming.

All the methods tested in the following two subsections are terminated based on the following criterion:

ϕ(xk)ϕε¯[ϕ(x0)ϕ]\phi(x_{k})-\phi_{*}\leq\bar{\varepsilon}[\phi(x_{0})-\phi_{*}] (67)

where ε¯=106\bar{\varepsilon}=10^{-6}, 10510^{-5} or 10410^{-4}. All experiments were performed in MATLAB 2023a and run on a PC with a 16-core Intel Core i9 processor and 32 GB of memory.

Now we describe the algorithm details used in the following two subsections. We first describe Polyak subgradient method. Given xkx_{k}, it computes

xk+1=argminx{ϕ(x;xk)+12λpol(xk)xxk2}x_{k+1}=\mathrm{argmin}\,_{x}\left\{\ell_{\phi}(x;x_{k})+\frac{1}{2{\lambda}_{\rm pol}(x_{k})}\|x-x_{k}\|^{2}\right\}

where

λpol(x):=ϕ(x)ϕg(x)2{\lambda}_{\rm pol}(x):=\frac{\phi(x)-\phi_{*}}{\|g(x)\|^{2}} (68)

where g(x)f(x)g(x)\in\partial f(x). Next we describe five GPB related methods, namely: GPB, Ad-GPB*, Ad-GPB**, Pol-Ad-GPB*, and Pol-GPB. A cycle k1k\geq 1 of Ad-GPB*, regardless of the way its initial prox stepsize is chosen, is called good if the prox stepsizes λj{\lambda}_{j} do not change (i.e., the inequality (25) is not violated) within it. First, GPB is stated in [18, 19]. Second, Ad-GPB* is stated in Subsection 4.1. Third, Ad-GPB** is a corresponding variant of Ad-GPB* that allows the prox stepsize to increase at the beginning of its initial cycles. Specifically, if k¯\bar{k} denotes the largest cycle index for which cycles 1 to k¯\bar{k} are good then Ad-GPB** sets λik+1=2λ^k{\lambda}_{i_{k+1}}=2\hat{\lambda}_{k} for every kk¯k\leq\bar{k} and afterwards sets λik+1=λ^k{\lambda}_{i_{k+1}}=\hat{\lambda}_{k} for every k>k¯k>\bar{k} as Ad-GPB* does, where λ^k\hat{\lambda}_{k} and ik+1i_{k+1} are defined in step 2 of Ad-GPB* and (29), respectively. The motivation behind Ad-GPB** is to prevent Ad-GPB or Ad-GPB* from generating only small λj{\lambda}_{j}’s due to a poor choice of initial prox stepsize λ1{\lambda}_{1}. Pol-GPB and Pol-Ad-GPB* are two Polyak-type variants of GPB and Ad-GPB* where the initial prox stepsize for the kkth-cycle is set to λik=40λpol(x^k1){\lambda}_{i_{k}}=40{\lambda}_{\rm pol}(\hat{x}_{k-1}). The above five methods all update the bundle Γ\Gamma according to the 2-cut update scheme described in Subsection 2.2. Finally, Ad-GPB*, Ad-GPB** and Pol-Ad-GPB* use τ=0.95\tau=0.95.

5.1 l1l_{1} feasibility problem

This subsection reports the computational results on a l1l_{1} feasibility problem. It contains two subsections. The first one presents computational results of Ad-GPB* and Ad-GPB** against GPB method in [18, 19]. The second one showcases the computational results of Pol-Ad-GPB* and Pol-GPB against subgradient method with Polyak stepsize.

We start by describing the l1l_{1} feasibility problem. The problem can be formulated as:

ϕ:=minx0f(x):=Axb1\phi_{*}:=\min_{x\geq 0}f(x):=\|Ax-b\|_{1} (69)

where Am×nA\in\mathbb{R}^{m\times n} and bnb\in\mathbb{R}^{n} are known. We consider two different ways of generating the data, i.e., sparse and dense ways. For dense problems, matrix AA is randomly generated in the form A=NUA=NU where the entries of the matrix Nm×nN\in\mathbb{R}^{m\times n} (resp., Un×nU\in\mathbb{R}^{n\times n}) are i.i.d sampled from the standard normal 𝒩(0,1)\mathcal{N}(0,1) (resp., uniform 𝒰[0,100]\mathcal{U}[0,100]) distribution. For sparse problem, matrix AA is randomly generated in the form A=DNA=DN where the nonzero entries of the sparse matrix Nm×nN\in\mathbb{R}^{m\times n} are i.i.d sampled from the standard normal 𝒩(0,1)\mathcal{N}(0,1) distribution and D is a diagonal matrix where the diagonal of DD are i.i.d sampled from 𝒰[0,1000]\mathcal{U}[0,1000]. In both cases, vector bb is determined as b=Axb=Ax_{*} where x=(v)2x_{*}=(v_{*})^{2} for some vector vnv_{*}\in\mathbb{R}^{n} whose entries are i.i.d sampled from the standard Normal distribution 𝒩(0,1)\mathcal{N}(0,1). Finally, we generated x0=(v0)2x_{0}=(v_{0})^{2} for some vector v0nv_{0}\in\mathbb{R}^{n} whose entries are i.i.d. sampled from the uniform distribution over (0,1)(0,1). Clearly, xx_{*} is a global minimizer of (69), whose optimal value ff_{*} equals zero in both cases. We test our methods on six dense and six sparse instances.

We now describe some details about all the tables that appear in this subsection. We set the target ε¯\bar{\varepsilon} in (67) as 10510^{-5} for dense instances and 10410^{-4} for sparse instances. The quantities θm\theta_{m}, θn\theta_{n} and θs\theta_{s} are defined as θm=m/103\theta_{m}=m/10^{3}, θn=n/103\theta_{n}=n/10^{3}, and θs:=nnz(A)/mn\theta_{s}:=\text{nnz}(A)/mn, where nnz(A)\text{nnz}(A) is the number of non-zero entries of AA. An entry in each table is given as a fraction with the numerator expressing the (rounded) number of iterations and the denominator expressing the CPU running time in seconds. An entry marked as /*/* indicates that the CPU running time exceeds the allocated time limit. The bold numbers highlight the method that has the best performance for each instance.

5.1.1 Ad-GPB* versus GPB

This subsection presents computational results of GPB method in [18, 19] against Ad-GPB* and Ad-GPB**.

To check how sensitive the three methods are relative to the initial choice of prox stepsize λ1{\lambda}_{1}, we test them for λ1=αλpol(x0){\lambda}_{1}=\alpha{\lambda}_{\rm pol}(x_{0}) where α{0.01,1,100}\alpha\in\{0.01,1,100\}. The computational results for the above three methods are given in Table 1 (resp., Table 2) for six sparse (resp., dense) instances. The time limit is four hours for Table 1 and two hours for Table 2.

ALG. GPB Ad-GPB* Ad-GPB**
(θm,θn,θs)(\theta_{m},\theta_{n},\theta_{s}) α\alpha\qquad 10210^{-2} 1 10210^{2} 10210^{-2} 1 10210^{2} 10210^{-2} 1 10210^{2}
(1,20,10210^{-2}) 68.3K125\frac{68.3K}{125} 153.7K314\frac{153.7K}{314} \frac{*}{*} 26.1K36\frac{26.1K}{36} 19.6K27\frac{19.6K}{27} 23.3K33\frac{23.3K}{33} 17.8𝑲𝟐𝟔\frac{17.8K}{26} 21.3K31\frac{21.3K}{31} 23.3K34\frac{23.3K}{34}
(3,30,10210^{-2}) 164.2K450\frac{164.2K}{450} 120.8K381\frac{120.8K}{381} \frac{*}{*} 59.8K152\frac{59.8K}{152} 39.5k102\frac{39.5k}{102} 62.4K164\frac{62.4K}{164} 55.4K144\frac{55.4K}{144} 36.3𝑲𝟗𝟒\frac{36.3K}{94} 62.4K165\frac{62.4K}{165}
(5,50,10210^{-2}) 123.4K811\frac{123.4K}{811} 99.4K654\frac{99.4K}{654} \frac{*}{*} 62.8K409\frac{62.8K}{409} 32.7K212\frac{32.7K}{212} 64.5K409\frac{64.5K}{409} 59.1K387\frac{59.1K}{387} 32.1𝑲𝟐𝟏𝟏\frac{32.1K}{211} 64.5K420\frac{64.5K}{420}
(10,100,10310^{-3}) 152.2K928\frac{152.2K}{928} 132.0K837\frac{132.0K}{837} \frac{*}{*} 67.4K363\frac{67.4K}{363} 40.5𝑲𝟐𝟐𝟔\frac{40.5K}{226} 66.9K360\frac{66.9K}{360} 61.8K337\frac{61.8K}{337} 44.4K242\frac{44.4K}{242} 66.9K360\frac{66.9K}{360}
(20,200,10310^{-3}) 136.2K7119\frac{136.2K}{7119} 111.2K5725\frac{111.2K}{5725} \frac{*}{*} 78.4K2636\frac{78.4K}{2636} 41.3K1401\frac{41.3K}{1401} 76.0K2577\frac{76.0K}{2577} 67.9K2384\frac{67.9K}{2384} 40.8𝑲𝟏𝟒𝟒𝟏\frac{40.8K}{1441} 76.0K3280\frac{76.0K}{3280}
(50,500,10410^{-4}) 148.1K12574\frac{148.1K}{12574} 130.1K11864\frac{130.1K}{11864} \frac{*}{*} 70.4K3947\frac{70.4K}{3947} 42.9𝑲𝟐𝟒𝟔𝟎\frac{42.9K}{2460} 65.0K3803\frac{65.0K}{3803} 67.1K3820\frac{67.1K}{3820} 43.0K2392\frac{43.0K}{2392} 65.0K3610\frac{65.0K}{3610}
Table 1: Numerical results for sparse instances. A relative tolerance of ε¯=104\bar{\varepsilon}=10^{-4} is set and a time limit of 14400 seconds (4 hours) is given.
ALG. GPB Ad-GPB* Ad-GPB**
(θm,θn)(\theta_{m},\theta_{n}) α\alpha\qquad 10210^{-2} 1 10210^{2} 10210^{-2} 1 10210^{2} 10210^{-2} 1 10210^{2}
(0.5,1.5) 9354.7K1502\frac{9354.7K}{1502} 3323.6K462\frac{3323.6K}{462} \frac{*}{*} 74.2K6\frac{74.2K}{6} 47.9𝑲𝟓\frac{47.9K}{5} 49.7K5\frac{49.7K}{5} 53.8K5\frac{53.8K}{5} 56.4K6\frac{56.4K}{6} 49.7K5\frac{49.7K}{5}
(1,3) \frac{*}{*} 5384.9K3114\frac{5384.9K}{3114} \frac{*}{*} 86.7K42\frac{86.7K}{42} 81.9K40\frac{81.9K}{40} 87.0K43\frac{87.0K}{43} 137.3K70\frac{137.3K}{70} 79.0𝑲𝟒𝟏\frac{79.0K}{41} 143.1K75\frac{143.1K}{75}
(2,6) \frac{*}{*} 221.8K1214\frac{221.8K}{1214} 59.0𝑲𝟓𝟎𝟗\frac{59.0K}{509} 305.6K1552\frac{305.6K}{1552} 181.4K911\frac{181.4K}{911} 134.7K677\frac{134.7K}{677} 136.7K685\frac{136.7K}{685} 176.6K882\frac{176.6K}{882} 133.9K669\frac{133.9K}{669}
(1.5,0.5) 1630.7K181\frac{1630.7K}{181} 495.1K55\frac{495.1K}{55} \frac{*}{*} 135.5K13\frac{135.5K}{13} 102.5𝑲𝟏𝟎\frac{102.5K}{10} 113.1K11\frac{113.1K}{11} 128.4K13\frac{128.4K}{13} 104.6K10\frac{104.6K}{10} 117.8K11\frac{117.8K}{11}
(3,1) 2170.7K869\frac{2170.7K}{869} 502.5K199\frac{502.5K}{199} \frac{*}{*} 233.5K100\frac{233.5K}{100} 155.5𝑲𝟔𝟓\frac{155.5K}{65} 166.5K74\frac{166.5K}{74} 180.8K71\frac{180.8K}{71} 156.7K64\frac{156.7K}{64} 175.5K73\frac{175.5K}{73}
(6,2) \frac{*}{*} 757.9K3542\frac{757.9K}{3542} \frac{*}{*} 351.4K1779\frac{351.4K}{1779} 242.4K1211\frac{242.4K}{1211} 276.6K1376\frac{276.6K}{1376} 304.0K1515\frac{304.0K}{1515} 238.0𝑲𝟏𝟏𝟓𝟏\frac{238.0K}{1151} 276.6K1340\frac{276.6K}{1340}
Table 2: Numerical results for dense instances. A relative tolerance of ε¯=105\bar{\varepsilon}=10^{-5} is set and a time limit of 7200 seconds (2 hours) is given.

The results in Tables 1 and 2 show that Ad-GPB* and Ad-GPB** are generally at least two to three times faster than GPB in terms of CPU running time. Second, it also shows that Ad-GPB* and Ad-GPB** are more robust to initial stepsize than GPB. We also observe that Ad-GPB** generally performs slightly better for small initial stepsize than Ad-GPB* which accounts for the increase of stepsize at the end of the cycle.

5.1.2 Polyak type methods

This subsection considers two Polyak-type variants of GPB and Ad-GPB* where the initial prox stepsize for the kkth-cycle is set to λik=40λpol(x^k1){\lambda}_{i_{k}}=40{\lambda}_{\rm pol}(\hat{x}_{k-1}). These two variants in turn are compared with the subgradient method with Polyak stepsize (see (68)) and the Ad-GPB* and Ad-GPB** variants described in Subsection 5.1.1.

The computational results for the above five methods are given in Table 3 (resp., Table 4) for six sparse (resp., dense) instances. The results for Ad-GPB* and Ad-GPB** are the same ones that appear in Tables 1 and 2 with α=1\alpha=1. They are duplicated here for the sake of convenience.

(θm,θn,θs)(\theta_{m},\theta_{n},\theta_{s}) Pol-Sub Pol-GPB Pol-Ad-GPB* Ad-GPB* Ad-GPB**
(1,20,10210^{-2}) 431.3K/354 33.8K/58 15.5K/22 19.6K/27 21.3K/31
(3,30,10210^{-2}) 413.3K/786 126.9K/392 21.2K/60 39.5K/102 36.3K/94
(5,50,10210^{-2}) 389.8k/2440 91.4K/581 19.7K/128 32.7K/212 32.1K/211
(10,100,10310^{-3}) 473.2k/2092 137.12K/876 21.3K/157 40.5K/226 44.4K/242
(20,200,10310^{-3}) */* 115.7K/5576 25.9K/1070 41.3K/1401 40.8K/1441
(50,500,10410^{-4}) */* 133.0K/13754 25.5K/1847 42.9K/2460 43.0K/2392
Table 3: Numerical results for sparse instances. A relative tolerance of ε¯=104\bar{\varepsilon}=10^{-4} is set and a time limit of 14400 seconds (4 hours) is given.
(θm,θn)(\theta_{m},\theta_{n}) Pol-Sub Pol-GPB Pol-Ad-GPB* Ad-GPB* Ad-GPB**
(0.5,1.5) 479.8K/36.5 143.2K/22.8 73.5K/9.4 47.9K/5 56.4K/6
(1,3) 1644.3K/1440 390.3K/196.7 144.4K/72.1 81.9K/40 79.0K/41
(2,6) 354.5K/2513 46.4K/233 182.5K/959 181.4K/911 176.6K/882
(1.5,0.5) 699.4K/79.5 109.9K/12.0 56.5K/5.6 102.5K/10 104.6K/10
(3,1) 1034.6K/1046 147.8K/57.9 89.2K/38.8 155.5K/65 156.7K/64
(6,2) */* 136.1K/602 143.1K/713 242.4K/1211 238.0K/1151
Table 4: Numerical results for dense instances. A relative tolerance of ε¯=105\bar{\varepsilon}=10^{-5} is set and a time limit of 7200 seconds (2 hours) is given.

Tables 3 and 4 demonstrate that PB methods generally outperform Pol-Subgrad in terms of CPU running time. Additionally, Pol-Ad-GPB* stands out as a particularly effective variant, outperforming other methods in eight out of twelve instances.

5.2 Lagrangian cut problem

This subsection presents the numerical results comparing Ad-GPB* and Pol-Ad-GPB* against GPB and Pol-Sub on a convex nonsmooth optimization problem that has broad applications in the field of integer programming (see e.g. [28]).

The problem considered in this subsection arises in the context of solving the the stochastic binary multi-knapsack problem

mincTx+P(x) s.t. Axbx{0,1}n\begin{array}[]{cl}\min&c^{T}x+P(x)\\ \text{ s.t. }&Ax\geq b\\ &x\in\{0,1\}^{n}\\ \end{array} (70)

where P(x):=𝔼ξ[Pξ(x)]P(x):=\mathbb{E}_{\xi}\left[P_{\xi}(x)\right] and

Pξ(x):=min\displaystyle P_{\xi}(x):=\min q(ξ)Ty\displaystyle\quad q(\xi)^{T}y (71)
s.t. WyhTx\displaystyle Wy\geq h-Tx
y{0,1}n\displaystyle y\in\{0,1\}^{n}

for every x{0,1}nx\in\{0,1\}^{n}. In the second-stage problem, only the objective vector q(ξ)q(\xi) is a random variable. Moreover, it is assumed that its support Ξ\Xi is a finite set, i.e., q()q(\cdot) has a finite number of scenarios ξ\xi’s.

Benders decomposition (see e.g. [7, 26]) is an efficient cutting-plane approach for solving (70) which approximates P()P(\cdot) by pointwise maximum of cuts for P()P(\cdot). Specifically, an affine function A()A(\cdot) such that P(x)𝒜(x)P(x^{\prime})\geq{\cal A}(x^{\prime}) for every x{0,1}nx^{\prime}\in\{0,1\}^{n} is called a cut for P()P(\cdot); moreover, a cut for P()P(\cdot) is tight at xx if P(x)=𝒜(x)P(x)={\cal A}(x). Benders decomposition starts with a cut 𝒜0{\cal A}_{0} for P()P(\cdot) and compute a sequence {xk}\{x_{k}\} of iterates as follows: given cuts {𝒜i()}i=0k1\{{\cal A}_{i}(\cdot)\}_{i=0}^{k-1} for P()P(\cdot), it computes xkx_{k} as

xk=argminx\displaystyle x_{k}=\mathrm{argmin}\,_{x} cTx+Pk(x)\displaystyle c^{T}x+P_{k}(x) (72)
s.t. Axb\displaystyle Ax\geq b
x{0,1}n\displaystyle x\in\{0,1\}^{n}

where

Pk()=maxi=0,,k1𝒜i();P_{k}(\cdot)=\max_{i=0,\cdots,k-1}{\cal A}_{i}(\cdot);

it then uses xkx_{k} to generate a new cut 𝒜k{\cal A}_{k} for P()P(\cdot) and repeats the above steps with kk replaced by k+1k+1. Problem (72) can be easily formulated as an equivalent linear integer programming problem.

We now describe how to generate a Lagrangian cut for P()P(\cdot) using a given point x{0,1}nx\in\{0,1\}^{n}. First, for every ξΞ\xi\in\Xi, (71) is equivalent to

miny,u\displaystyle\min_{y,u} q(ξ)Ty\displaystyle\quad q(\xi)^{T}y
s.t. Wy+Tuh\displaystyle Wy+Tu\geq h
y{0,1}n,u[0,1]n\displaystyle y\in\{0,1\}^{n},\,u\in[0,1]^{n}
ux=0.\displaystyle u-x=0.

By dualizing the constraint ux=0u-x=0, we obtain the Lagrangian dual (LD) problem

Dξ(x):=maxπLξ(x;π)D_{\xi}(x):=\max_{\pi}L_{\xi}(x;\pi) (73)

where

Lξ(x;π)\displaystyle L_{\xi}(x;\pi) :=miny,uq(ξ)TyπT(ux)\displaystyle:=\min_{y,u}\,q(\xi)^{T}y-\pi^{T}(u-x) (74)
s.t.Wy+Tud\displaystyle\text{s.t.}\quad Wy+Tu\geq d
y{0,1}n,u[0,1]n.\displaystyle\ \qquad y\in\{0,1\}^{n},\,u\in[0,1]^{n}.

Let πξ(x)\pi_{\xi}(x) denote an optimal solution of (73). The optimal values Pξ()P_{\xi}(\cdot) and Dξ()D_{\xi}(\cdot) of (71) and (73), respectively, are known to satisfy the following two properties for every ξΞ\xi\in\Xi and x{0,1}nx\in\{0,1\}^{n}:

  • (i)

    Pξ(x)Dξ(x)Dξ(x)+πξ(x),xxP_{\xi}(x^{\prime})\geq D_{\xi}(x^{\prime})\geq D_{\xi}(x)+\langle\pi_{\xi}(x),x^{\prime}-x\rangle for every x{0,1}nx^{\prime}\in\{0,1\}^{n};

  • (ii)

    Pξ(x)=Dξ(x)P_{\xi}(x)=D_{\xi}(x).

Property (i) can be found in many textbooks dealing with Lagrangian duality theory and property (ii) has been established in [28]. Defining

π(x):=𝔼[πξ(x)]\pi(x):=\mathbb{E}[\pi_{\xi}(x)]

and taking expectation of the relations in (i) and (ii), we easily see that

P(x)P(x)+π(x),xxx{0,1}n,P(x^{\prime})\geq P(x)+\langle\pi(x),x^{\prime}-x\rangle\quad\forall x^{\prime}\in\{0,1\}^{n},

and hence that 𝒜x():=P(x)+π(x),x{\cal A}_{x}(\cdot):=P(x)+\langle\pi(x),\cdot-x\rangle is a tight cut for P()P(\cdot) at xx.

Computation of P(x)P(x) assumes that the optimal value Pξ()P_{\xi}(\cdot) of (71) can be efficiently computed for every ξΞ\xi\in\Xi. Computation of π(x)\pi(x) assumes that an optimal solution of (73) can be computed for every ξΞ\xi\in\Xi. Noting that (73) is an unconstrained convex nonsmooth optimization problem in terms of variable π\pi and its optimal value Dξ(x)D_{\xi}(x) is the (already computed) optimal value Pξ(x)P_{\xi}(x) of (71), we use the Ad-GPB* variant of Ad-GPB to obtain a near optimal solution πξ(x)\approx\pi_{\xi}(x) of (73). For the purpose of this subsection, we use several instances of (73) to benchmark the methods described at the beginning of this section.

For every (ξ,x)Ξ×{0,1}n(\xi,x)\in\Xi\times\{0,1\}^{n}, recall that using Ad-GPB* to solve (73) requires the ability to evaluate Lξ(x,)L_{\xi}(x,\cdot) and compute a subgradient of Lξ(x,)-L_{\xi}(x,\cdot) at every πn\pi\in\mathbb{R}^{n}. The value Lξ(x,π)L_{\xi}(x,\pi) is evaluated by solving MILP (74). Moreover, if (uξ(x;π),yξ(x;π)))(u_{\xi}(x;\pi),y_{\xi}(x;\pi))) denotes an optimal solution of (74), then uξ(x;π)u_{\xi}(x;\pi) yields a subgradient of Lξ(x,)-L_{\xi}(x,\cdot) at π\pi. It is worth noting (73) is a non-smooth convex problem that does not seem to be tractable by the methods discussed in the papers (see e.g. [3, 9, 10, 14, 22, 23, 24, 25]) for solving min-max smooth convex-concave saddle-point problems, mainly due to the integrality condition imposed on the decision variable yy in (74).

Next we describe how the data of (70) and (71) is generated. We generate three random instances of (70), each following the same methodology as in [1]. We set n=240n=240 and Ξ={1,,20}\Xi=\{1,\ldots,20\} with each scenario ξΞ\xi\in\Xi being equiprobable. We generate matrices A1,A250×120A_{1},A_{2}\in\mathbb{R}^{50\times 120}, T1,W5×120T_{1},W\in\mathbb{R}^{5\times 120}, and vector c240c\in\mathbb{R}^{240}, with all entries i.i.d. sampled from the uniform distribution over the integers {1,,100}\{1,\ldots,100\}. We then set A=[A1A2]A=[A_{1}\,A_{2}] and T=[T1 0]T=[T_{1}\,0] where the zero block of TT is 5×1205\times 120. Twenty vectors {q(ξ)}ξΞ\{q(\xi)\}_{\xi\in\Xi} with components i.i.d. sampled from {1,,100}\{1,\ldots,100\} are generated. Finally, we set b=3(A1𝟏+A2𝟏)/4b=3(A_{1}\mathbf{1}+A_{2}\mathbf{1})/4 and h=3(W𝟏+T1𝟏)/4h=3(W\mathbf{1}+T_{1}\mathbf{1})/4 where 𝟏\mathbf{1} denotes the vector of ones.

For each randomly generated instance of (70), we run Benders decomposition started from x0=𝟏x_{0}=\mathbf{1} to obtain three iterates x1x_{1}, x2x_{2}, and x3x_{3}. Each P(xk)P(x_{k}) and π(xk)\pi(x_{k}) for k=1,2,3k=1,2,3 are computed using the twenty randomly generated vectors {q(ξ)}ξΞ\{q(\xi)\}_{\xi\in\Xi} as described above, and hence each iteration solves twenty LD subproblems as in (73). Hence, each randomly generated instance of (70) yields a total of sixty LD instances as in (73). The total time to solve these sixty LD instances are given in Table 5 for the three instances of (70) (named I1I1, I2I2 and I3I3 in the table) and all the benchmarked methods considered in this section. In this comparison, both Ad-GPB* and GPB set the initial stepsize λ1{\lambda}_{1} to λpol(π0){\lambda}_{\rm pol}(\pi_{0}) where the entries of π0\pi_{0} are i.i.d. generated from the uniform distribution in (0,1)(0,1).

GPB Ad-GPB* Pol-Subgrad Pol-Ad-GPB*
I1I1 751s 600s 1390s 98s
I2I2 721s 550s 1503s 32s
I3I3 1250s 963s 5206s 98s
Table 5: Numerical results for solving LD subproblems. A relative tolerance of ε¯=106\bar{\varepsilon}=10^{-6} is set.

Table 5 shows that Ad-GPB* consistently outperforms GPB. Additionally, it shows that Pol-Ad-GPB* once again surpasses all other methods.

6 Concluding remarks

This paper presents a parameter-free adaptive proximal bundle method featuring two key ingredients: i) an adaptive strategy for selecting variable proximal step sizes tailored to specific problem instances, and ii) an adaptive cycle-stopping criterion that enhances the effectiveness of serious steps. Computational experiments reveal that our method significantly reduces the number of consecutive null steps (i.e., shorter cycles) while maintaining a manageable number of serious steps. As a result, it requires fewer iterations than methods employing a constant proximal step size and a non-adaptive cycle termination criterion. Moreover, our approach demonstrates considerable robustness to variations in the initial step size provided by the user.

We now discuss some possible extensions of our results. Recall that the complexity analysis of Ad-GPB* assumes that domh\mathrm{dom}\,h is bounded, i.e., Lemma 4.2(c) uses it to give a simple proof that tikt_{i_{k}} is bounded in terms of (M,L)(M,L) and the diameter DD of domh\mathrm{dom}\,h. Using more complex arguments as those used in Appendix A of [19], we can show that the complexity analysis of Ad-GPB* can be extended to the setting where domh\mathrm{dom}\,h is unbounded.

We finally discuss some possible extensions of our analysis in this paper. First, establishing the iteration complexity for Ad-GPB to the case where ϕ\phi_{*} is unknown and domh\mathrm{dom}\,h is unbounded is a more challenging and interesting research topic. Second, it would be interesting to analyze the complexities of Pol-GPB and Pol-Ad-GPB* described in Subsection 5 as they both performed well in our computational experiments. Finally, our current analysis does not apply to the one-cut bundle update scheme (see Subsection 3.1 of [19]) since it is not a special case of BUF as already observed in the second remark following BUF. It would be interesting to extend the analysis of this paper to establish the complexity of Ad-GPB based on the one-cut bundle update scheme.

References

  • [1] G. Angulo, S. Ahmed, and S. S Dey. Improving the integer l-shaped method. INFORMS Journal on Computing, 28(3):483–499, 2016.
  • [2] J. F. Bonnans, C. Lemaréchal J. C. Gilbert, and C. A. Sagastizábal. A family of variable metric proximal methods. Mathematical Programming, 68:15–47, 1995.
  • [3] Y. Chen, G. Lan, and Y. Ouyang. Optimal primal-dual methods for a class of saddle point problems. SIAM Journal on Optimization, 24(4):1779–1814, 2014.
  • [4] W. de Oliveira and M. Solodov. A doubly stabilized bundle method for nonsmooth convex optimization. Mathematical programming, 156(1-2):125–159, 2016.
  • [5] M. Díaz and B. Grimmer. Optimal convergence rates for the proximal bundle method. SIAM Journal on Optimization, 33(2):424–454, 2023.
  • [6] Y. Du and A. Ruszczyński. Rate of convergence of the bundle method. Journal of Optimization Theory and Applications, 173:908–922, 2017.
  • [7] A. M. Geoffrion. Generalized benders decomposition. Journal of optimization theory and applications, 10:237–260, 1972.
  • [8] V. Guigues, J. Liang, and R.D.C Monteiro. Universal subgradient and proximal bundle methods for convex and strongly convex hybrid composite optimization. arXiv preprint arXiv:2407.10073, 2024.
  • [9] Y. He and R. D. C. Monteiro. Accelerating block-decomposition first-order methods for solving composite saddle-point and two-player nash equilibrium problems. SIAM Journal on Optimization, 25(4):2182–2211, 2015.
  • [10] Y. He and R. D. C. Monteiro. An accelerated hpe-type algorithm for a class of composite convex-concave saddle-point problems. SIAM Journal on Optimization, 26(1):29–56, 2016.
  • [11] N. Karmitsa and M. M. Mäkelä. Adaptive limited memory bundle method for bound constrained large-scale nonsmooth optimization. Optimization, 59(6):945–962, 2010.
  • [12] K. C Kiwiel. Efficiency of proximal bundle methods. Journal of Optimization Theory and Applications, 104(3):589, 2000.
  • [13] K. C. Kiwiel. A proximal bundle method with approximate subgradient linearizations. SIAM Journal on Optimization, 16(4):1007–1023, 2006.
  • [14] O. Kolossoski and R. D. C. Monteiro. An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex–concave saddle-point problems. Optimization Methods and Software, 32(6):1244–1272, 2017.
  • [15] C. Lemaréchal. An extension of davidon methods to non differentiable problems. In Nondifferentiable optimization, pages 95–109. Springer, 1975.
  • [16] C. Lemaréchal. Nonsmooth optimization and descent methods. 1978.
  • [17] C. Lemaréchal and C. Sagastizábal. Variable metric bundle methods: from conceptual to implementable forms. Mathematical Programming, 76:393–410, 1997.
  • [18] J. Liang and R. D. C. Monteiro. A proximal bundle variant with optimal iteration-complexity for a large range of prox stepsizes. SIAM Journal on Optimization, 31(4):2955–2986, 2021.
  • [19] J. Liang and R. D. C. Monteiro. A unified analysis of a class of proximal bundle methods for solving hybrid convex composite optimization problems. Mathematics of Operations Research, 49(2):832–855, 2024.
  • [20] J. Liang, R. D. C. Monteiro, and H. Zhang. Proximal bundle methods for hybrid weakly convex composite optimization problems. arXiv preprint arXiv:2303.14896, 2023.
  • [21] R. Mifflin. A modification and an extension of Lemarechal’s algorithm for nonsmooth minimization, pages 77–90. Springer Berlin Heidelberg, Berlin, Heidelberg, 1982.
  • [22] R. D. C. Monteiro and B. F. Svaiter. Complexity of variants of Tseng’s modified F-B splitting and korpelevich’s methods for hemivariational inequalities with applications to saddle-point and convex optimization problems. SIAM Journal on Optimization, 21(4):1688–1720, 2011.
  • [23] A. Nemirovski. Prox-method with rate of convergence o (1/t) for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM J. Optim., 15(1):229–251, 2004.
  • [24] A. S. Nemirovski and D. B. Yudin. Cesari convergence of the gradient method of approximating saddle points of convex-concave functions. In Doklady Akademii Nauk, volume 239, pages 1056–1059. Russian Academy of Sciences, 1978.
  • [25] Y. Nesterov. Smooth minimization of non-smooth functions. Math. Programming, 103(1):127–152, 2005.
  • [26] R. Rahmaniani, T. G. Crainic, M. Gendreau, and W. Rei. The benders decomposition algorithm: A literature review. European Journal of Operational Research, 259(3):801–817, 2017.
  • [27] P. Wolfe. A method of conjugate subgradients for minimizing nondifferentiable functions. In Nondifferentiable optimization, pages 145–173. Springer, 1975.
  • [28] J. Zou, S. Ahmed, and X. A. Sun. Stochastic dual dynamic integer programming. Mathematical Programming, 175:461–502, 2019.