This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Universal subgradient and proximal bundle methods for
convex and strongly convex hybrid composite optimization

Vincent Guigues   Jiaming Liang   Renato D.C. Monteiro School of Applied Mathematics FGV/EMAp, 22250-900 Rio de Janeiro, Brazil. (email: [email protected]).Goergen Institute for Data Science and Department of Computer Science, University of Rochester, Rochester, NY 14620 (email: [email protected]).School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA 30332. (email: [email protected]). This work was partially supported by AFOSR Grant FA9550-22-1-0088.
(July 14, 2024 (1st revision: August 2, 2024))
Abstract

This paper develops two parameter-free methods for solving convex and strongly convex hybrid composite optimization problems, namely, a composite subgradient type method and a proximal bundle type method. Both functional and stationary complexity bounds for the two methods are established in terms of the unknown strong convexity parameter. To the best of our knowledge, the two proposed methods are the first universal methods for solving hybrid strongly convex composite optimization problems that do not rely on any restart scheme nor require the knowledge of the optimal value.

Key words. hybrid composite optimization, iteration complexity, universal method, proximal bundle method

AMS subject classifications. 49M37, 65K05, 68Q25, 90C25, 90C30, 90C60

1 Introduction

This paper considers convex hybrid composite optimization (HCO) problem

ϕ:=min{ϕ(x):=f(x)+h(x):xn},\phi_{*}:=\min\left\{\phi(x):=f(x)+h(x):x\in\mathbb{R}^{n}\right\}, (1)

where f,h:n{+}f,h:\mathbb{R}^{n}\rightarrow\mathbb{R}\cup\{+\infty\} are proper lower semi-continuous convex functions such that domhdomf\mathrm{dom}\,h\subseteq\mathrm{dom}\,f and the following conditions hold: there exist scalars Mf0M_{f}\geq 0 and Lf0L_{f}\geq 0 and a first-order oracle f:domhnf^{\prime}:\mathrm{dom}\,h\to\mathbb{R}^{n} (i.e., f(x)f(x)f^{\prime}(x)\in\partial f(x) for every xdomhx\in\mathrm{dom}\,h) satisfying the (Mf,Lf)(M_{f},L_{f})-hybrid condition that f(x)f(y)2Mf+Lfxy\|f^{\prime}(x)-f^{\prime}(y)\|\leq 2M_{f}+L_{f}\|x-y\| for every x,ydomhx,y\in\mathrm{dom}\,h. Moreover, assume that μ0\mu\geq 0 is the largest scalar such that ϕ()μ2/2\phi(\cdot)-\mu\|\cdot\|^{2}/2 is convex, i.e., μ\mu is the intrinsic convex parameter of ϕ\phi.

This work is concerned with parameter-free (PF) methods for solving (1), i.e., ones that do not require knowledge of any of parameters associated with the instance (f,h)(f,h), such as the parameter pair (Mf,Lf)(M_{f},L_{f}) or the intrisic convexity parameter μ\mu of ϕ\phi. More specifically, it considers PF methods whose complexities for solving (1) are expressed in terms of μ\mu (in addition to other parameters associated with (f,h)(f,h)). We refer to them as μ\mu-universal methods. PF methods whose (provable) complexities do not depend on μ\mu are called universal ones (even if μ>0\mu>0). Moreover, PF methods whose complexities are given in terms of the intrinsic convex parameter μf\mu_{f} for ff (resp., μh\mu_{h} for hh) are called μf\mu_{f}-universal (resp., μh\mu_{h}-universal). It is worth noting that μ\mu can be substantially larger than μf+μh\mu_{f}+\mu_{h} (e.g., for α0\alpha\gg 0, f(x)=αexp(x)f(x)=\alpha\exp(x), and h(x)=αexp(x)h(x)=\alpha\exp(-x), we have μ=2α0=μf+μh\mu=2\alpha\gg 0=\mu_{f}+\mu_{h}). Hence, complexities for μ\mu-universal methods are usually better than μf\mu_{f}-universal and μh\mu_{h}-universal methods, and even (universal or non-universal) methods whose complexities depend on μf+μh\mu_{f}+\mu_{h}.

Related literature. We divide our discussion here into universal and μ\mu-universal methods.

Universal methods: The first universal methods for solving (1) under the condition that f\nabla f is Hölder continuous have been presented in [21] and [10]. Specifically, the first paper develops universal variants of the primal gradient, the dual gradient, and the accelerated gradient methods, while the second one shows that acceleration versions of the bundle-level and the prox-level methods are universal. Additional universal methods for solving (1) have been studied in [14, 16, 19, 25] under the condition that ff is smooth, and in [13, 15, 17] for the case where ff is either smooth or nonsmooth. The methods in [15, 17] (resp., [19, 25]) are also shown to be μh\mu_{h}-universal (resp., μf\mu_{f}-universal under the condition that h=0h=0). The papers [14, 16] present universal accelerated composite gradients methods for solving (1) under the more general condition that ff is a smooth mm-weakly convex function. Since any convex function is mm-weakly convex for any m>0m>0, the results of [14, 16] also apply to the convex case and yield complexity bounds similar to the ones of [19, 25].

μ\mu-Universal methods: Under the assumption that ff is a smooth function (i.e., Mf=0M_{f}=0), various works [1, 2, 3, 4, 5, 6, 8, 12, 20, 23] have developed μ\mu-universal (or μf\mu_{f}-universal) methods for (1) with a 𝒪~(Lf/μ)\tilde{\cal O}(\sqrt{L_{f}/\mu}) (or 𝒪~(Lf/μf)\tilde{\cal O}(\sqrt{L_{f}/\mu_{f}})) iteration complexity bound. For the sake of our discussion, we refer to a convex (resp., strongly convex) version of an accelerated gradient method as ACG (resp., S-ACG). Among papers concerned with finding an ε\varepsilon-solution of (1), [20] proposes the first μf\mu_{f}-universal method based on a restart S-ACG scheme where each iteration adaptively updates an estimate of μf\mu_{f} and calls S-FISTA (see also [8] for a μ\mu-universal variant along this venue); moreover, using restart ACG schemes motivated by previous works such as [11, 22, 24], paper [23] develops μ\mu-universal methods under the assumption that ϕ\phi_{*} is known.

Among papers concerned with finding an ε\varepsilon-stationary solution of (1), [1, 2, 3, 4, 6] (resp., [12]) develop μ\mu-universal (resp., μf\mu_{f}-universal) methods based on restart ACG (resp., S-ACG) schemes that estimate the condition number Lf/μL_{f}/\mu (resp., μf\mu_{f}), or some related quantity; [3, 4, 6] then use the estimation to determine the number of iterations of each ACG call while the SCAR method of [12] uses a stationary termination to end each S-ACG call. Moreover, under the assumption that ϕ\phi_{*} and LfL_{f} are known, [5] develops a μ\mu-universal method that performs only one call to an ACG variant (for convex CO).

Under the assumption that ff is non-smooth (i.e., Mf>0M_{f}>0), [23] (see also [9] for an extension) proposes μ\mu-universal methods under the assumption that ϕ\phi_{*} is known. Specifically, the μ\mu-universal method of [23] repeatedly invokes a universal oracle that halves the primal gap ϕ(x)ϕ\phi(x)-\phi_{*} on each call.111Paper [23] removes the assumption that ϕ\phi_{*} is known but forces its method to make multiple parallel calls to the universal oracle.

Our contribution. The goal of this work is to present two μ\mu-universal methods for problem (1), namely: a composite subgradient (U-CS) type method and a proximal bundle (U-PB) type method. The first method is a variant of the universal primal gradient method of [21] (see also Appendix C.2 of [17]), which is still not known to be μ\mu-universal. The second one is a variant of the generic proximal bundle (GPB) method of [17] that bounds the number of consecutive null iterations and adaptively chooses the prox stepsize under this policy. Both methods are analyzed in a unified manner using a general framework for strongly convex optimization problems (1) (referred to as FSCO) which specifies sufficient conditions for its PF instances to be μ\mu-universal. Both functional and stationary complexities are established for FSCO in terms of μ\mu, which are then used to obtain complexity bounds for both U-CS and U-PB. Interestingly, in contrast to previous μ\mu-universal methods, both U-CS and U-PB do not perform any restart scheme nor require ϕ\phi_{*} to be known.

Organization of the paper. Subsection 1.1 presents basic definitions and notation used throughout the paper. Section 2 formally describes FSCO and the assumptions on the problem of interest, and provides both functional and stationarity complexity analyses of FSCO. Section 3 presents U-CS and U-PB, as two instances of FSCO, for solving problem (1) and establishes their corresponding complexity bounds. Section 4 presents some concluding remarks and possible extensions. Finally, Appendix A provides technical results of FSCO and U-PB.

1.1 Basic definitions and notation

Let \mathbb{R} denote the set of real numbers. Let +\mathbb{R}_{+} (resp., ++\mathbb{R}_{++}) denote the set of non-negative real numbers (resp., the set of positive real numbers). Let n\mathbb{R}^{n} denote the standard nn-dimensional Euclidean space equipped with inner product and norm denoted by ,\left\langle\cdot,\cdot\right\rangle and \|\cdot\|, respectively. Let log()\log(\cdot) denote the natural logarithm.

For given Φ:n(,+]\Phi:\mathbb{R}^{n}\rightarrow(-\infty,+\infty], let domΦ:={xn:Φ(x)<}\mathrm{dom}\,\Phi:=\{x\in\mathbb{R}^{n}:\Phi(x)<\infty\} denote the effective domain of Φ\Phi and Φ\Phi is proper if domΦ\mathrm{dom}\,\Phi\neq\emptyset. A proper function Φ:n(,+]\Phi:\mathbb{R}^{n}\rightarrow(-\infty,+\infty] is μ\mu-convex for some μ0\mu\geq 0 if

Φ(αx+(1α)y)αΦ(x)+(1α)Φ(y)α(1α)μ2xy2\Phi(\alpha x+(1-\alpha)y)\leq\alpha\Phi(x)+(1-\alpha)\Phi(y)-\frac{\alpha(1-\alpha)\mu}{2}\|x-y\|^{2}

for every x,ydomΦx,y\in\mathrm{dom}\,\Phi and α[0,1]\alpha\in[0,1]. Let Conv¯μ(n)\overline{\mbox{\rm Conv}}_{\mu}\,(\mathbb{R}^{n}) denote the set of all proper lower semicontinuous μ\mu-convex functions. We simply denote Conv¯μ(n)\overline{\mbox{\rm Conv}}_{\mu}\,(\mathbb{R}^{n}) by Conv¯(n)\overline{\mbox{\rm Conv}}\,(\mathbb{R}^{n}) when μ=0\mu=0. For ε0\varepsilon\geq 0, the ε\varepsilon-subdifferential of Φ\Phi at xdomΦx\in\mathrm{dom}\,\Phi is denoted by

εΦ(x):={sn:Φ(y)Φ(x)+s,yxε,yn}.\partial_{\varepsilon}\Phi(x):=\left\{s\in\mathbb{R}^{n}:\Phi(y)\geq\Phi(x)+\left\langle s,y-x\right\rangle-\varepsilon,\forall y\in\mathbb{R}^{n}\right\}.

We denote the subdifferential of Φ\Phi at xdomΦx\in\mathrm{dom}\,\Phi by Φ(x)\partial\Phi(x), which is the set 0Φ(x)\partial_{0}\Phi(x) by definition. For a given subgradient Φ(x)Φ(x)\Phi^{\prime}(x)\in\partial\Phi(x), we denote the linearization of convex function Φ\Phi at xx by Φ(,x)\ell_{\Phi}(\cdot,x), which is defined as

Φ(,x)=Φ(x)+Φ(x),x.\ell_{\Phi}(\cdot,x)=\Phi(x)+\langle\Phi^{\prime}(x),\cdot-x\rangle. (2)

2 A framework for strongly convex optimization

This section presents a general framework, namely FSCO, for convex optimization problems and establishes both functional and stationary complexity bounds for any of its instances. These results will then be used in Subsections 3.1 and 3.2 to analyze the complexities of two specific algorithms, namely: U-CS and U-PB. This section is divided into two subsections. Subsection 2.1 focuses on the functional complexity analysis, while Subsection 2.2 provides the stationary complexity analysis.

FSCO is presented in the context of the convex optimization problem

ϕ:=min{ϕ(x):xn}\phi_{*}:=\min\left\{\phi(x):x\in\mathbb{R}^{n}\right\} (3)

for which the following conditions are assumed:

  • (A1)

    the set of optimal solutions XX_{*} of problem (3) is nonempty;

  • (A2)

    ϕConv¯μ(n)\phi\in\overline{\mbox{\rm Conv}}_{\mu}\,(\mathbb{R}^{n}) for some μ0\mu\geq 0.

Clearly, the HCO problem (1) with the assumptions described underneath it is a special case of (3) where ϕ=f+h\phi=f+h.

We now describe FSCO.

 

FSCO

 

  • 0.

    Let χ[0,1)\chi\in[0,1), ε>0\varepsilon>0, and x^0domϕ\hat{x}_{0}\in\mathrm{dom}\,\phi be given, and set k=1k=1;

  • 1.

    Compute λk>0\lambda_{k}>0, Γ^kConv¯(n)\hat{\Gamma}_{k}\in\overline{\mbox{\rm Conv}}\,(\mathbb{R}^{n}), Γ^kϕ\hat{\Gamma}_{k}\leq\phi, and y^kdomϕ\hat{y}_{k}\in\mathrm{dom}\,\phi satisfying

    ϕ(y^k)+χ2λky^kx^k12minun{Γ^k(u)+12λkux^k12}ε,\displaystyle\phi(\hat{y}_{k})+\frac{\chi}{2\lambda_{k}}\|\hat{y}_{k}-\hat{x}_{k-1}\|^{2}-\underset{u\in\mathbb{R}^{n}}{\min}\left\{\hat{\Gamma}_{k}(u)+\frac{1}{2{\lambda}_{k}}\|u-\hat{x}_{k-1}\|^{2}\right\}\leq\varepsilon, (4)

    and set

    x^k:=argminun{Γ^k(u)+12λkux^k12};\hat{x}_{k}:=\underset{u\in\mathbb{R}^{n}}{\mathrm{argmin}\,}\left\{\hat{\Gamma}_{k}(u)+\frac{1}{2{\lambda}_{k}}\|u-\hat{x}_{k-1}\|^{2}\right\}; (5)
  • 2.

    Check whether a termination criterion holds and if so stop; else go to step 3;

  • 3.

    Set kk+1k\leftarrow k+1 and go to step 1.

 

FSCO does not specify how sequences {x^k}\{\hat{x}_{k}\} and {y^k}\{\hat{y}_{k}\} are generated, how models {Γ^k}\{\hat{\Gamma}_{k}\} are updated, and how stepsizes {λk}\{\lambda_{k}\} are computed. Rather, it provides sufficient conditions on these sequences to ensure that its instances are μ\mu-universal.

The complexity analysis of FSCO requires two additional assumptions, namely:

  • (F1)

    there exists ν[0,μ]\nu\in[0,\mu] such that Γ^kConv¯ν(n)\hat{\Gamma}_{k}\in\overline{\mbox{\rm Conv}}_{\nu}\,(\mathbb{R}^{n});

  • (F2)

    there exists λ¯>0\underline{\lambda}>0 such that λkλ¯\lambda_{k}\geq{\underline{\lambda}} for every iteration kk of the FSCO.

2.1 Functional complexity analysis

This subsection studies the iteration complexity for the framework FSCO to obtain an iterate y^k\hat{y}_{k} such that ϕ(y^k)ϕε¯\phi(\hat{y}_{k})-\phi_{*}\leq\bar{\varepsilon}. Its main result is stated in Theorem 2.3.

The following lemma will be useful in the sequel.

Lemma 2.1

Consider sequences {x^k}\{\hat{x}_{k}\}, {y^k}\{\hat{y}_{k}\}, {Γ^k}\{\hat{\Gamma}_{k}\}, and {λk}\{\lambda_{k}\} generated by FSCO. Define for k1k\geq 1,

s~k\displaystyle\tilde{s}_{k} :=x^k1x^kλkνx^k,\displaystyle:=\frac{\hat{x}_{k-1}-\hat{x}_{k}}{{\lambda}_{k}}-\nu\hat{x}_{k}, (6)
η~k\displaystyle\tilde{\eta}_{k} :=ϕ~(y^k)Γ~k(x^k)s~k,y^kx^k,\displaystyle:=\tilde{\phi}(\hat{y}_{k})-\tilde{\Gamma}_{k}(\hat{x}_{k})-\langle\tilde{s}_{k},\hat{y}_{k}-\hat{x}_{k}\rangle, (7)

where

ϕ~:=ϕν22,Γ~k:=Γ^kν22.\tilde{\phi}:=\phi-\frac{\nu}{2}\|\cdot\|^{2},\quad\tilde{\Gamma}_{k}:=\hat{\Gamma}_{k}-\frac{\nu}{2}\|\cdot\|^{2}. (8)

Then for k1k\geq 1, we have:

  • a)

    s~kΓ~k(x^k)\tilde{s}_{k}\in\partial\tilde{\Gamma}_{k}(\hat{x}_{k}) and for every unu\in\mathbb{R}^{n},

    Γ~k(u)Γ~k(x^k)+s~k,ux^k;\tilde{\Gamma}_{k}(u)\geq\tilde{\Gamma}_{k}(\hat{x}_{k})+\langle\tilde{s}_{k},u-\hat{x}_{k}\rangle; (9)
  • b)

    s~kη~kϕ~(y^k)\tilde{s}_{k}\in\partial_{\tilde{\eta}_{k}}\tilde{\phi}(\hat{y}_{k}) and

    02λkη~k2λkε(1+νλk)y^kx^k2+(1χ)y^kx^k12;0\leq 2{\lambda}_{k}\tilde{\eta}_{k}\leq 2{\lambda}_{k}\varepsilon-(1+\nu{\lambda}_{k})\|\hat{y}_{k}-\hat{x}_{k}\|^{2}+(1-\chi)\|\hat{y}_{k}-\hat{x}_{k-1}\|^{2}; (10)
  • c)

    for any χ[0,1)\chi\in[0,1) and every udomϕu\in\mathrm{dom}\,\phi, we have

    ϕ~(u)ϕ~(y^k)+s~k,uy^k+χ(μν)2uy^k2η~k1χ.\tilde{\phi}(u)\geq\tilde{\phi}(\hat{y}_{k})+\langle\tilde{s}_{k},u-\hat{y}_{k}\rangle+\frac{\chi(\mu-\nu)}{2}\|u-\hat{y}_{k}\|^{2}-\frac{\tilde{\eta}_{k}}{1-\chi}. (11)

Proof: (a) The optimality condition of (5) yields 0Γ^k(x^k)+(x^kx^k1)/λk0\in\partial\hat{\Gamma}_{k}(\hat{x}_{k})+(\hat{x}_{k}-\hat{x}_{k-1})/\lambda_{k}. This inclusion and the fact that Γ^k(u)=Γ~k(u)+νu\partial\hat{\Gamma}_{k}(u)=\partial\tilde{\Gamma}_{k}(u)+\nu u for every unu\in\mathbb{R}^{n} imply that

0Γ~k(x^k)+νx^k+x^kx^k1λk=(6)Γ~k(x^k)s~k0\in\partial\tilde{\Gamma}_{k}(\hat{x}_{k})+\nu\hat{x}_{k}+\frac{\hat{x}_{k}-\hat{x}_{k-1}}{{\lambda}_{k}}\stackrel{{\scriptstyle\eqref{defsb}}}{{=}}\partial\tilde{\Gamma}_{k}(\hat{x}_{k})-\tilde{s}_{k}

where the identity is due to the definition of s~k\tilde{s}_{k} in (6). Hence, the inclusion in a) holds. Relation (9) immediately follows from the inclusion in a) and the definition of the subdifferential.

(b) It follows from the relation ϕΓ^k\phi\geq\hat{\Gamma}_{k} (see step 1 of FSCO) and the definition of ϕ~\tilde{\phi} and Γ~k\tilde{\Gamma}_{k} in (8) that ϕ~Γ~k\tilde{\phi}\geq\tilde{\Gamma}_{k}. This inequality, (9), the definition of η~k\tilde{\eta}_{k} in (7) imply that for every udomϕu\in\mathrm{dom}\,\phi,

ϕ~(u)Γ~k(u)(9)Γ~k(x^k)+s~k,ux^k=(7)ϕ~(y^k)+s~k,uy^kη~k,\tilde{\phi}(u)\geq\tilde{\Gamma}_{k}(u)\stackrel{{\scriptstyle\eqref{eq:subdifb}}}{{\geq}}\tilde{\Gamma}_{k}(\hat{x}_{k})+\langle\tilde{s}_{k},u-\hat{x}_{k}\rangle\stackrel{{\scriptstyle\eqref{defetab}}}{{=}}\tilde{\phi}(\hat{y}_{k})+\langle\tilde{s}_{k},u-\hat{y}_{k}\rangle-\tilde{\eta}_{k},

which yields the inclusion in b). Taking u=y^ku=\hat{y}_{k} in the above inequality gives η~k0\tilde{\eta}_{k}\geq 0, and hence the first inequality in (10) holds. Using the definitions of s~k\tilde{s}_{k} and η~k\tilde{\eta}_{k} in (6) and (7), respectively, the definitions of ϕ~\tilde{\phi} and Γ~k\tilde{\Gamma}_{k} in (8), and (4) and (5), we have

η~k=(7),(8)ϕ(y^k)Γ^k(x^k)+ν2(x^k2y^k2)s~k,y^kx^k(4),(5)[εχ2λky^kx^k12+12λkx^kx^k12]+ν2(x^k2y^k2)s~k,y^kx^k=(6)εχ2λky^kx^k12+12λkx^kx^k12+ν2(x^k2y^k2)+x^kx^k1λk+νx^k,y^kx^k=ε+1χ2λky^kx^k1212λky^kx^k2ν2y^kx^k2.\begin{array}[]{lcl}\tilde{\eta}_{k}&\stackrel{{\scriptstyle\eqref{defetab},\eqref{def:t-phi}}}{{=}}&\displaystyle\phi(\hat{y}_{k})-\hat{\Gamma}_{k}(\hat{x}_{k})+\frac{\nu}{2}\left(\|\hat{x}_{k}\|^{2}-\|\hat{y}_{k}\|^{2}\right)-\langle\tilde{s}_{k},\hat{y}_{k}-\hat{x}_{k}\rangle\\ &\stackrel{{\scriptstyle\eqref{propyb},\eqref{defxhatk}}}{{\leq}}&\displaystyle\left[\varepsilon-\frac{\chi}{2{\lambda}_{k}}\|\hat{y}_{k}-\hat{x}_{k-1}\|^{2}+\frac{1}{2{\lambda}_{k}}\|\hat{x}_{k}-\hat{x}_{k-1}\|^{2}\right]+\frac{\nu}{2}\left(\|\hat{x}_{k}\|^{2}-\|\hat{y}_{k}\|^{2}\right)-\langle\tilde{s}_{k},\hat{y}_{k}-\hat{x}_{k}\rangle\\ &\stackrel{{\scriptstyle\eqref{defsb}}}{{=}}&\displaystyle\varepsilon-\frac{\chi}{2{\lambda}_{k}}\|\hat{y}_{k}-\hat{x}_{k-1}\|^{2}+\frac{1}{2{\lambda}_{k}}\|\hat{x}_{k}-\hat{x}_{k-1}\|^{2}+\frac{\nu}{2}\left(\|\hat{x}_{k}\|^{2}-\|\hat{y}_{k}\|^{2}\right)+\left\langle\frac{\hat{x}_{k}-\hat{x}_{k-1}}{{\lambda}_{k}}+\nu\hat{x}_{k}\,,\hat{y}_{k}-\hat{x}_{k}\right\rangle\\ &=&\displaystyle\varepsilon+\frac{1-\chi}{2{\lambda}_{k}}\|\hat{y}_{k}-\hat{x}_{k-1}\|^{2}-\frac{1}{2{\lambda}_{k}}\|\hat{y}_{k}-\hat{x}_{k}\|^{2}-\frac{\nu}{2}\|\hat{y}_{k}-\hat{x}_{k}\|^{2}.\end{array}

Hence, the second inequality in (10) holds.

(c) Using the inclusion in b), the fact that ϕ~\tilde{\phi} is (μν)(\mu-\nu)-convex, and Lemma A.1 of [18], we have for any ζ(0,]\zeta\in(0,\infty] and every udomϕu\in\mathrm{dom}\,\phi,

ϕ~(u)ϕ~(y^k)+s~k,uy^k+μν2(1+ζ)uy^k2(1+ζ1)η~k.\tilde{\phi}(u)\geq\tilde{\phi}(\hat{y}_{k})+\langle\tilde{s}_{k},u-\hat{y}_{k}\rangle+\frac{\mu-\nu}{2(1+\zeta)}\|u-\hat{y}_{k}\|^{2}-(1+\zeta^{-1})\tilde{\eta}_{k}.

Now, (11) follows from the above inequality with ζ=(1χ)/χ\zeta=(1-\chi)/\chi where χ\chi is as in step 0 of FSCO.    

Before showing Theorem 2.3 on the complexity of FSCO, we also need the following Proposition 2.2.

Proposition 2.2

Consider sequences {x^k}\{\hat{x}_{k}\}, {y^k}\{\hat{y}_{k}\}, and {λk}\{\lambda_{k}\} generated by FSCO. For every unu\in\mathbb{R}^{n}, we have for every k1k\geq 1, unu\in\mathbb{R}^{n}, and χ[0,1)\chi\in[0,1), we have:

χ(1+νλ¯)x^ky^k2(1χ)x^k1x2+2λkε.\chi(1+\nu{\underline{\lambda}})\|\hat{x}_{k}-\hat{y}_{k}\|^{2}\leq(1-\chi)\|\hat{x}_{k-1}-x_{*}\|^{2}+2{\lambda}_{k}\varepsilon. (12)

and

2λk[ϕ(y^k)ϕ(u)]2λkε1χ+x^k1u2(1+σ)x^ku2,2\lambda_{k}\left[\phi(\hat{y}_{k})-\phi(u)\right]\leq\frac{2{\lambda}_{k}\varepsilon}{1-\chi}+\left\|\hat{x}_{k-1}-u\right\|^{2}-(1+\sigma)\|\hat{x}_{k}-u\|^{2}, (13)

where

σ=σ(λ¯):=λ¯[ν(1+μλ¯)+χ(μν)]1+μλ¯+χλ¯(νμ).\sigma=\sigma(\underline{{\lambda}}):=\frac{\underline{{\lambda}}[\nu(1+\mu\underline{{\lambda}})+\chi(\mu-\nu)]}{1+\mu{\underline{\lambda}}+\chi{\underline{\lambda}}(\nu-\mu)}. (14)

Proof: Let A(u):=x^k1u2(1+νλk)x^ku2A(u):=\|\hat{x}_{k-1}-u\|^{2}-(1+\nu{\lambda}_{k})\|\hat{x}_{k}-u\|^{2}. Since A(u)A(u) is a quadratic function in uu, its Taylor’s expansion gives

A(u)=A(y^k)+A(y^k),uy^k+122A(y^k)(uy^k),uy^k,A(u)=A(\hat{y}_{k})+\langle\nabla A(\hat{y}_{k}),u-\hat{y}_{k}\rangle+\frac{1}{2}\langle\nabla^{2}A(\hat{y}_{k})(u-\hat{y}_{k}),u-\hat{y}_{k}\rangle,

where 2A(y^k)=2νλkI\nabla^{2}A(\hat{y}_{k})=-2\nu{\lambda}_{k}I and

A(y^k)=2(y^kx^k1)2(1+νλk)(y^kx^k)=(6)2νλky^k2λks~k.\nabla A(\hat{y}_{k})=2(\hat{y}_{k}-\hat{x}_{k-1})-2(1+\nu{\lambda}_{k})(\hat{y}_{k}-\hat{x}_{k})\stackrel{{\scriptstyle\eqref{defsb}}}{{=}}-2\nu{\lambda}_{k}\hat{y}_{k}-2{\lambda}_{k}\tilde{s}_{k}.

Using the above formulas, we have for A(u)A(y^k)A(u)-A(\hat{y}_{k}) the expresssion

x^k1u2(1+νλk)x^ku2(x^k1y^k2(1+νλk)x^ky^k2)\displaystyle\|\hat{x}_{k-1}-u\|^{2}-(1+\nu{\lambda}_{k})\|\hat{x}_{k}-u\|^{2}-\left(\|\hat{x}_{k-1}-\hat{y}_{k}\|^{2}-(1+\nu{\lambda}_{k})\|\hat{x}_{k}-\hat{y}_{k}\|^{2}\right)
=2λkνy^k+s~k,uy^kνλkuy^k2=2λks~k,uy^kνλk(u2y^k2)\displaystyle=-2{\lambda}_{k}\langle\nu\hat{y}_{k}+\tilde{s}_{k},u-\hat{y}_{k}\rangle-\nu{\lambda}_{k}\|u-\hat{y}_{k}\|^{2}=-2{\lambda}_{k}\langle\tilde{s}_{k},u-\hat{y}_{k}\rangle-\nu{\lambda}_{k}(\|u\|^{2}-\|\hat{y}_{k}\|^{2})
(11)2λk[ϕ~(y^k)ϕ~(u)+χ(μν)2uy^k2η~k1χ]νλk(u2y^k2)\displaystyle\stackrel{{\scriptstyle\eqref{qphi-1}}}{{\geq}}2\lambda_{k}\left[\tilde{\phi}(\hat{y}_{k})-\tilde{\phi}(u)+\frac{\chi(\mu-\nu)}{2}\|u-\hat{y}_{k}\|^{2}-\frac{\tilde{\eta}_{k}}{1-\chi}\right]-\nu{\lambda}_{k}(\|u\|^{2}-\|\hat{y}_{k}\|^{2})
=(8)2λk[ϕ(y^k)ϕ(u)]+χ(μν)λkuy^k22λkη~k1χ,\displaystyle\stackrel{{\scriptstyle\eqref{def:t-phi}}}{{=}}2\lambda_{k}\left[\phi(\hat{y}_{k})-\phi(u)\right]+\chi(\mu-\nu){\lambda}_{k}\|u-\hat{y}_{k}\|^{2}-\frac{2{\lambda}_{k}\tilde{\eta}_{k}}{1-\chi}, (15)

where the inequality is due to (11) and the last identity is due to the definition of ϕ~\tilde{\phi} in (8). Rearranging the terms in (15) and using Lemma 2.1(b), we have

x^k1u2(1+νλk)x^ku22λk[ϕ(y^k)ϕ(u)]\displaystyle\left\|\hat{x}_{k-1}-u\right\|^{2}-(1+\nu{\lambda}_{k})\|\hat{x}_{k}-u\|^{2}-2\lambda_{k}\left[\phi(\hat{y}_{k})-\phi(u)\right]
(15)x^k1y^k2(1+νλk)x^ky^k2+χ(μν)λkuy^k22λkη~k1χ\displaystyle\stackrel{{\scriptstyle\eqref{ineq:long}}}{{\geq}}\|\hat{x}_{k-1}-\hat{y}_{k}\|^{2}-(1+\nu{\lambda}_{k})\|\hat{x}_{k}-\hat{y}_{k}\|^{2}+\chi(\mu-\nu){\lambda}_{k}\|u-\hat{y}_{k}\|^{2}-\frac{2{\lambda}_{k}\tilde{\eta}_{k}}{1-\chi}
(10)2λkε1χ+χ(1+νλ¯)1χx^ky^k2+χ(μν)λ¯uy^k2,\displaystyle\stackrel{{\scriptstyle\eqref{ineqetab}}}{{\geq}}-\frac{2{\lambda}_{k}\varepsilon}{1-\chi}+\frac{\chi(1+\nu{\underline{\lambda}})}{1-\chi}\|\hat{x}_{k}-\hat{y}_{k}\|^{2}+\chi(\mu-\nu){\underline{\lambda}}\|u-\hat{y}_{k}\|^{2},

where in the last inequality we have used Assumption (F1).

Rearranging the terms and using Assumption (F1), we have

2λk[ϕ(y^k)ϕ(u)]\displaystyle 2\lambda_{k}\left[\phi(\hat{y}_{k})-\phi(u)\right]\leq 2λkε1χ+x^k1u2(1+νλ¯)x^ku2\displaystyle\frac{2{\lambda}_{k}\varepsilon}{1-\chi}+\left\|\hat{x}_{k-1}-u\right\|^{2}-(1+\nu{\underline{\lambda}})\|\hat{x}_{k}-u\|^{2}
χ(1+νλ¯1χx^ky^k2+(μν)λ¯uy^k2).\displaystyle-\chi\left(\frac{1+\nu{\underline{\lambda}}}{1-\chi}\|\hat{x}_{k}-\hat{y}_{k}\|^{2}+(\mu-\nu){\underline{\lambda}}\|u-\hat{y}_{k}\|^{2}\right). (16)

It is clear that (12) follows from (16) with u=xu=x_{*} and observing that ϕ(y^k)ϕ(x)0\phi(\hat{y}_{k})-\phi(x_{*})\geq 0. Using the triangle inequality and the fact that (a1+a2)2(b11+b21)(a12b1+a22b2)(a_{1}+a_{2})^{2}\leq(b_{1}^{-1}+b_{2}^{-1})(a_{1}^{2}b_{1}+a_{2}^{2}b_{2}) with (a1,a2)=(x^ky^k,uy^k)(a_{1},a_{2})=(\|\hat{x}_{k}-\hat{y}_{k}\|,\|u-\hat{y}_{k}\|) and (b1,b2)=((1+νλ¯)/(1χ),(μν)λ¯)(b_{1},b_{2})=((1+\nu{\underline{\lambda}})/(1-\chi),(\mu-\nu){\underline{\lambda}}) we have

x^ku2(1χ1+νλ¯+1(μν)λ¯)(1+νλ¯1χx^ky^k2+(μν)λ¯uy^k2).\|\hat{x}_{k}-u\|^{2}\leq\left(\frac{1-\chi}{1+\nu{\underline{\lambda}}}+\frac{1}{(\mu-\nu){\underline{\lambda}}}\right)\left(\frac{1+\nu{\underline{\lambda}}}{1-\chi}\|\hat{x}_{k}-\hat{y}_{k}\|^{2}+(\mu-\nu){\underline{\lambda}}\|u-\hat{y}_{k}\|^{2}\right). (17)

Plugging the above ineqaulity into (16), we have

2λk[ϕ(y^k)ϕ(u)]2λkε1χ+x^k1u2[1+νλ¯+χ(1χ1+νλ¯+1(μν)λ¯)1]x^ku2,2\lambda_{k}\left[\phi(\hat{y}_{k})-\phi(u)\right]\leq\frac{2{\lambda}_{k}\varepsilon}{1-\chi}+\left\|\hat{x}_{k-1}-u\right\|^{2}-\left[1+\nu{\underline{\lambda}}+\chi\left(\frac{1-\chi}{1+\nu{\underline{\lambda}}}+\frac{1}{(\mu-\nu){\underline{\lambda}}}\right)^{-1}\right]\|\hat{x}_{k}-u\|^{2},

which is the same as (13) after simplification.    

Before giving the first main complexity result for FSCO, namely, a complexity bound for obtaining a ε¯\bar{\varepsilon}-solution of (3), we first introduce some terminology used throughout our analysis. Let xx_{*} denote the closest solution of (3) to the initial point x^0\hat{x}_{0} of FSCO and let d0d_{0} denote its distance to x^0\hat{x}_{0}, i.e.,

x^0x=min{xx^0:xX},d0=x^0x.\|\hat{x}_{0}-x_{*}\|=\min\{\|x-\hat{x}_{0}\|:x\in X_{*}\},\quad d_{0}=\|\hat{x}_{0}-x_{*}\|. (18)
Theorem 2.3

For a given tolerance ε¯>0\bar{\varepsilon}>0, consider FSCO with ε=(1χ)ε¯/2\varepsilon=(1-\chi)\bar{\varepsilon}/2, where χ[0,1)\chi\in[0,1) is as in step 0 of FSCO. Then, the number of iterations of FSCO to generate an iterate y^k\hat{y}_{k} satisfying ϕ(y^k)ϕε¯\phi(\hat{y}_{k})-\phi_{*}\leq\bar{\varepsilon} is at most

𝒞func(ε¯):=min{min[1χ(1+1λ¯μ),1+1λ¯ν]log(1+λ0μd02λ¯ε¯),d02λ¯ε¯}.{\cal C}_{\text{func}}(\bar{\varepsilon}):=\min\left\{\min\left[\frac{1}{\chi}\left(1+\frac{1}{\underline{{\lambda}}\mu}\right),1+\frac{1}{\underline{{\lambda}}\nu}\right]\log\left(1+\frac{\lambda_{0}\mu d_{0}^{2}}{{\underline{\lambda}}{\bar{\varepsilon}}}\right),\frac{d_{0}^{2}}{{\underline{\lambda}}{\bar{\varepsilon}}}\right\}. (19)

Proof: It is easy to see that (13) with u=xu=x_{*} where xx_{*} is given by (18) satisfies (93) with σ\sigma is as in (14) and

γk=2λk,ηk=ϕ(y^k)ϕ,αk=x^kx2,δ=ε1χ.\gamma_{k}=2{\lambda}_{k},\quad\eta_{k}=\phi(\hat{y}_{k})-\phi_{*},\quad\alpha_{k}=\|\hat{x}_{k}-x_{*}\|^{2},\quad\delta=\frac{\varepsilon}{1-\chi}. (20)

Also, note that

γ¯=2λ¯,2δ=2ε1χ=ε¯.{\underline{\gamma}}=2{\underline{{\lambda}}},\quad 2\delta=\frac{2\varepsilon}{1-\chi}=\bar{\varepsilon}.

It follows from Lemma A.1(c) with the above parameters and the definition of σ\sigma in (14) that the complexity to find a ε¯\bar{\varepsilon}-solution is

min{1+σσlog(1+σd02λ¯ε¯),d02λ¯ε¯}=(14)min{(1+νλ¯)(1+μλ¯)λ¯[ν(1+μλ¯)+χ(μν)]log(1+σd02λ¯ε¯),d02λ¯ε¯}.\min\left\{\frac{1+\sigma}{\sigma}\log\left(1+\frac{\sigma d_{0}^{2}}{{\underline{\lambda}}{\bar{\varepsilon}}}\right),\frac{d_{0}^{2}}{{\underline{\lambda}}{\bar{\varepsilon}}}\right\}\stackrel{{\scriptstyle\eqref{def:sigma}}}{{=}}\min\left\{\frac{(1+\nu\underline{{\lambda}})(1+\mu\underline{{\lambda}})}{\underline{{\lambda}}[\nu(1+\mu\underline{{\lambda}})+\chi(\mu-\nu)]}\log\left(1+\frac{\sigma d_{0}^{2}}{{\underline{\lambda}}{\bar{\varepsilon}}}\right),\frac{d_{0}^{2}}{{\underline{\lambda}}{\bar{\varepsilon}}}\right\}. (21)

Since χ[0,1)\chi\in[0,1), it is easy to verify that

1+νλ¯ν(1+μλ¯)+χ(μν)1χμ,\frac{1+\nu\underline{{\lambda}}}{\nu(1+\mu\underline{{\lambda}})+\chi(\mu-\nu)}\leq\frac{1}{\chi\mu},

and hence that

(1+νλ¯)(1+μλ¯)λ¯[ν(1+μλ¯)+χ(μν)]1+μλ¯χλ¯μ=1χ(1+1λ¯μ).\frac{(1+\nu\underline{{\lambda}})(1+\mu\underline{{\lambda}})}{\underline{{\lambda}}[\nu(1+\mu\underline{{\lambda}})+\chi(\mu-\nu)]}\leq\frac{1+\mu\underline{{\lambda}}}{\chi\underline{{\lambda}}\mu}=\frac{1}{\chi}\left(1+\frac{1}{\underline{{\lambda}}\mu}\right).

Moreover, noting that χ[0,1)\chi\in[0,1) and μν\mu\geq\nu, we also have

(1+νλ¯)(1+μλ¯)λ¯[ν(1+μλ¯)+χ(μν)]1+1λ¯ν.\frac{(1+\nu\underline{{\lambda}})(1+\mu\underline{{\lambda}})}{\underline{{\lambda}}[\nu(1+\mu\underline{{\lambda}})+\chi(\mu-\nu)]}\leq 1+\frac{1}{\underline{{\lambda}}\nu}.

Combining the above two inequalities, we obtain

(1+νλ¯)(1+μλ¯)λ¯[ν(1+μλ¯)+χ(μν)]min{1χ(1+1λ¯μ),1+1λ¯ν}.\frac{(1+\nu\underline{{\lambda}})(1+\mu\underline{{\lambda}})}{\underline{{\lambda}}[\nu(1+\mu\underline{{\lambda}})+\chi(\mu-\nu)]}\leq\min\left\{\frac{1}{\chi}\left(1+\frac{1}{\underline{{\lambda}}\mu}\right),1+\frac{1}{\underline{{\lambda}}\nu}\right\}.

The result now immediately follows from (21), the above inequality, and the fact that

σλ¯[ν(1+μλ¯)+χ(μν)]1+μλ¯λ¯(ν+χ(μν)1+μλ¯)λ0(ν+χ(μν)1+μλ0)λ0(ν+(μν))λ0μ.\sigma\leq\frac{\underline{{\lambda}}[\nu(1+\mu\underline{{\lambda}})+\chi(\mu-\nu)]}{1+\mu{\underline{\lambda}}}\leq\underline{{\lambda}}\left(\nu+\frac{\chi(\mu-\nu)}{1+\mu\underline{{\lambda}}}\right)\leq{{\lambda}_{0}}\left(\nu+\frac{\chi(\mu-\nu)}{1+\mu{{\lambda}_{0}}}\right)\leq{{\lambda}_{0}}\left(\nu+{(\mu-\nu)}\right)\leq{\lambda}_{0}\mu.

 

We now comment on the complexity bound obtained in Theorem 2.3. First, the bound

min{1χ(1+1λ¯μ),1+1λ¯ν}log(1+λ0μd02λ¯ε¯)\min\left\{\frac{1}{\chi}\left(1+\frac{1}{\underline{{\lambda}}\mu}\right),1+\frac{1}{\underline{{\lambda}}\nu}\right\}\log\left(1+\frac{\lambda_{0}\mu d_{0}^{2}}{{\underline{\lambda}}{\bar{\varepsilon}}}\right) (22)

implied by (19) is meaningful only when χ>0\chi>0 or ν>0\nu>0 (otherwise, it should be understood as being infinity). Second, the validity of the second bound in (22) is well-known and can be found for example in [7, 15]. Third, if μν\mu\gg\nu (see the assumption (F2) in Section 1) and χ\chi is sufficiently close to one, the smallest term in (22) is the first one, in which case (19) reduces to

1χ(1+1λ¯μ)log(1+λ0μd02λ¯ε¯)=𝒪~(1λ¯μ).\frac{1}{\chi}\left(1+\frac{1}{\underline{{\lambda}}\mu}\right)\log\left(1+\frac{\lambda_{0}\mu d_{0}^{2}}{{\underline{\lambda}}{\bar{\varepsilon}}}\right)=\tilde{\cal O}\left(\frac{1}{\underline{{\lambda}}\mu}\right).

Fourth, since the second bound does not depend on ν\nu, μ\mu and χ\chi, it holds for any parameters μν0\mu\geq\nu\geq 0 and χ[0,1)\chi\in[0,1).

The drawback of using the termination criterion ϕ(x¯)ϕε¯\phi(\bar{x})-\phi_{*}\leq\bar{\varepsilon} is that it requires knowledge of ϕ\phi_{*}. The next subsection presents the complexity of FSCO to obtain a point satisfying a stopping criterion that is easily checkable for any instance of (3).

2.2 Stationarity complexity analysis

This subsection studies the iteration complexity for FSCO to obtain a near-stationary solution of (3) (see the definition below). Its main result is stated in Theorem 2.8.

We start by defining the notion of near-stationary solutions considered in this subsection.

Definition 2.4

A triple (x,v,η)(x,v,\eta) is called ϕ\phi-compatible if it satisfies the inclusion vηϕ(x)v\in\partial_{\eta}\phi(x). For a given tolerance pair (ρ^,ε^)(\hat{\rho},\hat{\varepsilon}), a ϕ\phi-compatible triple (x,v,η)(x,v,\eta) is called a (ρ^,ε^)(\hat{\rho},\hat{\varepsilon})-stationary solution of (3) if it satisfies vρ^\|v\|\leq\hat{\rho} and ηε^\eta\leq\hat{\varepsilon}.

We now comment on the benefits of using near-stationary solutions as a way to terminate an algorithm. First, many algorithms, including the ones considered in this paper, naturally generate a sequence of ϕ\phi-compatible triples {(y¯k,s¯k,ε¯k)}\{(\bar{y}_{k},\bar{s}_{k},\bar{\varepsilon}_{k})\} where the sequence of residual pairs {(s¯k,ε¯k)}\{(\bar{s}_{k},\bar{\varepsilon}_{k})\} can be made arbitrarily small (see Proposition 2.7 below). As a consequence, some (y¯k,s¯k,ε¯k)(\bar{y}_{k},\bar{s}_{k},\bar{\varepsilon}_{k}) will eventually become a (ρ^,ε^)(\hat{\rho},\hat{\varepsilon})-stationary solution of (3). Moreover, verifying this only requires checking whether the two inequalities s¯kρ^\|\bar{s}_{k}\|\leq\hat{\rho} and ε¯kε^\bar{\varepsilon}_{k}\leq\hat{\varepsilon} hold as the inclusion s¯kε¯kϕ(y¯k)\bar{s}_{k}\in\partial_{\bar{\varepsilon}_{k}}\phi(\bar{y}_{k}) is guaranteed to hold for every k1k\geq 1. Second, this notion is related to the one considered in Subsection 2.1 as follows. If (y¯k,s¯k,ε¯k)(\bar{y}_{k},\bar{s}_{k},\bar{\varepsilon}_{k}) is a (ρ^,ε^)(\hat{\rho},\hat{\varepsilon})-stationary solution and domϕ\mathrm{dom}\,\phi has a finite diameter DD, then it follows that y¯k\bar{y}_{k} is a (ε^+Dρ^)(\hat{\varepsilon}+D\hat{\rho})-solution of (3).

The following lemma will be useful to derive such complexity for stationary conditions:

Lemma 2.5

For every k1k\geq 1, define

y¯k=argmin{ϕ(y):y{y^1,,y^k}}\bar{y}_{k}=\mathrm{argmin}\,\{\phi(y):y\in\{\hat{y}_{1},\dots,\hat{y}_{k}\}\} (23)

and

Sk=j=1k(1+σ)j1λjS_{k}=\sum_{j=1}^{k}(1+\sigma)^{j-1}\lambda_{j} (24)

where σ\sigma is as in (14). Then, for every udomϕu\in\mathrm{dom}\,\phi, we have:

ϕ(y¯k)ϕ(u)\displaystyle\phi(\bar{y}_{k})-\phi(u) x^0u2(1+σ)kx^ku22Sk+ε1χ,\displaystyle\leq\frac{\|\hat{x}_{0}-u\|^{2}-(1+\sigma)^{k}\|\hat{x}_{k}-u\|^{2}}{2S_{k}}+\frac{\varepsilon}{1-\chi}, (25)
x^kx2\displaystyle\|\hat{x}_{k}-x_{*}\|^{2} d02(1+σ)k+2εSk(1χ)(1+σ)k,\displaystyle\leq\frac{d_{0}^{2}}{(1+\sigma)^{k}}+\frac{2\varepsilon S_{k}}{(1-\chi)(1+\sigma)^{k}}, (26)

where xx_{*} and d0d_{0} are as in (18).

Proof: First note that (13) is a special case of (93) with

γk=2λk,ηk=ϕ(y^k)ϕ(u),αk=x^ku2,δ=ε1χ.\gamma_{k}=2{\lambda}_{k},\quad\eta_{k}=\phi(\hat{y}_{k})-\phi(u),\quad\alpha_{k}=\|\hat{x}_{k}-u\|^{2},\quad\delta=\frac{\varepsilon}{1-\chi}.

Then, (25) follows from Lemma A.1(a). It is also easy to verify that (13) with u=xu=x_{*} satisfies (93) with parameters as in (20). Then, (26) follows from Lemma A.1(b).    

Lemma 2.6

For every k1k\geq 1, we have

x^0x^k2\displaystyle\|\hat{x}_{0}-\hat{x}_{k}\|^{2} 4(d02+Skε1χ),\displaystyle\leq 4\left(d_{0}^{2}+\frac{S_{k}\varepsilon}{1-\chi}\right), (27)
x^0y¯k2\displaystyle\|\hat{x}_{0}-\bar{y}_{k}\|^{2} 1χ(5d02+8Skε1χ).\displaystyle\leq\frac{1}{\chi}\left(5d_{0}^{2}+\frac{8S_{k}\varepsilon}{1-\chi}\right). (28)

Proof: Using (26) and the fact that σ0\sigma\geq 0, we have for every i{1,2,,k}i\in\{1,2,\ldots,k\},

x^ix2\displaystyle\|\hat{x}_{i}-x_{*}\|^{2} (26)d02(1+σ)i+2εSi(1χ)(1+σ)id02+2εSi1χd02+2εSk1χ.\displaystyle\stackrel{{\scriptstyle\eqref{ineq:result-2}}}{{\leq}}\frac{d_{0}^{2}}{(1+\sigma)^{i}}+\frac{2\varepsilon S_{i}}{(1-\chi)(1+\sigma)^{i}}\leq d_{0}^{2}+\frac{2\varepsilon S_{i}}{1-\chi}\leq d_{0}^{2}+\frac{2\varepsilon S_{k}}{1-\chi}. (29)

It follows from the triangle inequality that

x^0x^k2(x^0x+x^kx)22[d02+x^kx2].\|\hat{x}_{0}-\hat{x}_{k}\|^{2}\leq(\|\hat{x}_{0}-x_{*}\|+\|\hat{x}_{k}-x_{*}\|)^{2}\leq 2\left[d_{0}^{2}+\|\hat{x}_{k}-x_{*}\|^{2}\right].

which together with (29) implies (27). Using the triangle inequality and the fact that (a1+a2)2(b11+b21)(a12b1+a22b2)(a_{1}+a_{2})^{2}\leq(b_{1}^{-1}+b_{2}^{-1})(a_{1}^{2}b_{1}+a_{2}^{2}b_{2}) with (a1,a2)=(x^0x^k,x^ky^k)(a_{1},a_{2})=\left(\|\hat{x}_{0}-\hat{x}_{k}\|,\|\hat{x}_{k}-\hat{y}_{k}\|\right) and (b1,b2)=(1,χ(1+νλ¯)/(1χ))(b_{1},b_{2})=\left(1,\chi(1+\nu{\underline{\lambda}})/(1-\chi)\right), we have

x^0y^k2\displaystyle\|\hat{x}_{0}-\hat{y}_{k}\|^{2} (x^0x^k+x^ky^k)2\displaystyle\leq(\|\hat{x}_{0}-\hat{x}_{k}\|+\|\hat{x}_{k}-\hat{y}_{k}\|)^{2}
(1+1χχ(1+νλ¯))[x^0x^k2+χ(1+νλ¯)1χx^ky^k2]\displaystyle\leq\left(1+\frac{1-\chi}{\chi(1+\nu{\underline{\lambda}})}\right)\left[\|\hat{x}_{0}-\hat{x}_{k}\|^{2}+\frac{\chi(1+\nu{\underline{\lambda}})}{1-\chi}\|\hat{x}_{k}-\hat{y}_{k}\|^{2}\right]
(12)1χ(x^0x^k2+x^k1x2+2λkε1χ),\displaystyle\stackrel{{\scriptstyle\eqref{ineq:bound}}}{{\leq}}\frac{1}{\chi}\left(\|\hat{x}_{0}-\hat{x}_{k}\|^{2}+\|\hat{x}_{k-1}-x_{*}\|^{2}+\frac{2{\lambda}_{k}\varepsilon}{1-\chi}\right), (30)

where the last inequality is due to (12). Noting that y¯k=y^i\bar{y}_{k}=\hat{y}_{i} for some i{1,2,,k}i\in\{1,2,\ldots,k\}, and using (27), (29), and (30), we have

x^0y¯k2\displaystyle\|\hat{x}_{0}-\bar{y}_{k}\|^{2} =x^0y^i2(30)1χ[x^0x^i2+x^i1x2+2λiε1χ]\displaystyle=\|\hat{x}_{0}-\hat{y}_{i}\|^{2}\stackrel{{\scriptstyle\eqref{ineq:x0-yk}}}{{\leq}}\frac{1}{\chi}\left[\|\hat{x}_{0}-\hat{x}_{i}\|^{2}+\|\hat{x}_{i-1}-x_{*}\|^{2}+\frac{2{\lambda}_{i}\varepsilon}{1-\chi}\right]
(27),(29)1χ(4d02+4εSk1χ+d02+2εSk1χ+2λiε1χ)1χ(5d02+8εSk1χ),\displaystyle\stackrel{{\scriptstyle\eqref{ineq:xk},\eqref{ineq:nl}}}{{\leq}}\frac{1}{\chi}\left(4d_{0}^{2}+\frac{4\varepsilon S_{k}}{1-\chi}+d_{0}^{2}+\frac{2\varepsilon S_{k}}{1-\chi}+\frac{2{\lambda}_{i}\varepsilon}{1-\chi}\right)\leq\frac{1}{\chi}\left(5d_{0}^{2}+\frac{8\varepsilon S_{k}}{1-\chi}\right),

where the last inequality is due to the fact that Sk(1+σ)i1λiλiS_{k}\geq(1+\sigma)^{i-1}\lambda_{i}\geq\lambda_{i}. Hence, (28) is proved.    

The following results shows that FSCO naturally generates a sequence of residual pairs {(v^k,ε^k)}\{(\hat{v}_{k},\hat{\varepsilon}_{k})\} such that (z^k,v^k,ε^k)(\hat{z}_{k},\hat{v}_{k},\hat{\varepsilon}_{k}) is ϕ\phi-compatible for every k1k\geq 1 and provides suitable bounds for it.

Proposition 2.7

For every k1k\geq 1, define

s¯k=x^0x^kSk,ε¯k=x^0y¯k2x^ky¯k22Sk+ε1χ\bar{s}_{k}=\frac{\hat{x}_{0}-\hat{x}_{k}}{S_{k}},\quad\bar{\varepsilon}_{k}=\frac{\|\hat{x}_{0}-\bar{y}_{k}\|^{2}-\|\hat{x}_{k}-\bar{y}_{k}\|^{2}}{2S_{k}}+\frac{\varepsilon}{1-\chi} (31)

where y¯k\bar{y}_{k} is as in (23). Then, the following statements hold for every k1k\geq 1:

  • a)

    s¯kϕε¯k(y¯k)\bar{s}_{k}\in\partial\phi_{\bar{\varepsilon}_{k}}(\bar{y}_{k});

  • b)

    the residual pair (s¯k,ε¯k)(\bar{s}_{k},\bar{\varepsilon}_{k}) is bounded by

    s¯k2d0Sk+2ε(1χ)Sk,ε¯kx^0y¯k22Sk+ε1χ,\|\bar{s}_{k}\|\leq\frac{2d_{0}}{S_{k}}+\frac{\sqrt{2\varepsilon}}{\sqrt{(1-\chi)S_{k}}},\quad\bar{\varepsilon}_{k}\leq\frac{\|\hat{x}_{0}-\bar{y}_{k}\|^{2}}{2S_{k}}+\frac{\varepsilon}{1-\chi}, (32)

    where SkS_{k} is as in (24).

Proof: (a) Using (25) and the fact that σ0\sigma\geq 0, we have for every unu\in\mathbb{R}^{n},

ϕ(y¯k)ϕ(u)\displaystyle\phi(\bar{y}_{k})-\phi(u) x^0u2x^ku22Sk+ε1χ\displaystyle\leq\frac{\|\hat{x}_{0}-u\|^{2}-\|\hat{x}_{k}-u\|^{2}}{2S_{k}}+\frac{\varepsilon}{1-\chi}
=x^0y¯k2x^ky¯k2+2x^0x^k,y¯ku2Sk+ε1χ\displaystyle=\frac{\|\hat{x}_{0}-\bar{y}_{k}\|^{2}-\|\hat{x}_{k}-\bar{y}_{k}\|^{2}+2\langle\hat{x}_{0}-\hat{x}_{k},\bar{y}_{k}-u\rangle}{2S_{k}}+\frac{\varepsilon}{1-\chi}
=(31)s¯k,y¯ku+ε¯k,\displaystyle\stackrel{{\scriptstyle\eqref{def:sk-ek-1}}}{{=}}\langle\bar{s}_{k},\bar{y}_{k}-u\rangle+\bar{\varepsilon}_{k},

where the last identity is due to the definitions of s¯k\bar{s}_{k} and ε¯k\bar{\varepsilon}_{k} in (31). Hence, the statement holds.

(b) Using the triangle inequality, (26), and the fact that a+ba+b\sqrt{a+b}\leq\sqrt{a}+\sqrt{b}, we have

s¯kd0+x^kxSk(26)d0Sk+1Sk(d0(1+σ)k+2εSk(1χ)(1+σ)k).\|\bar{s}_{k}\|\leq\frac{d_{0}+\|\hat{x}_{k}-x_{*}\|}{S_{k}}\stackrel{{\scriptstyle\eqref{ineq:result-2}}}{{\leq}}\frac{d_{0}}{S_{k}}+\frac{1}{S_{k}}\left(\frac{d_{0}}{\sqrt{(1+\sigma)^{k}}}+\frac{\sqrt{2\varepsilon S_{k}}}{\sqrt{(1-\chi)(1+\sigma)^{k}}}\right).

Hence, the first inequality in (32) follows from the fact that σ0\sigma\geq 0. Moreover, the second inequality in (32) follows immediately from the definition of ε¯k\bar{\varepsilon}_{k} in (31).    

The following Theorem 2.8 provides the complexity for stationary conditions announced in the beginning of this section.

Theorem 2.8

For a given tolerance pair (ε^,ρ^)++2(\hat{\varepsilon},\hat{\rho})\in\mathbb{R}^{2}_{++}, FSCO with

χ(0,1),ε=χ(1χ)ε^10,\chi\in(0,1),\quad\varepsilon=\frac{\chi(1-\chi)\hat{\varepsilon}}{10}, (33)

generates a triple (y¯k,s¯k,ε¯k)(\bar{y}_{k},\bar{s}_{k},\bar{\varepsilon}_{k}) satisfying

s¯kϕε¯k(y¯k),s¯kρ^,ε¯kε^\bar{s}_{k}\in\partial\phi_{\bar{\varepsilon}_{k}}(\bar{y}_{k}),\quad\|\bar{s}_{k}\|\leq\hat{\rho},\quad\bar{\varepsilon}_{k}\leq\hat{\varepsilon} (34)

in at most

min{min[1χ(1+1λ¯μ),1+1λ¯ν]log[1+λ0μβ(ε^,ρ^)],β(ε^,ρ^)}\min\left\{\min\left[\frac{1}{\chi}\left(1+\frac{1}{\underline{{\lambda}}\mu}\right),1+\frac{1}{\underline{{\lambda}}\nu}\right]\log\left[1+\lambda_{0}\mu\beta(\hat{\varepsilon},\hat{\rho})\right]\,,\,\beta(\hat{\varepsilon},\hat{\rho})\right\} (35)

iterations where

β(ε^,ρ^)=1λ¯(4χε^5ρ^2+5d02χε^).\beta(\hat{\varepsilon},\hat{\rho})=\frac{1}{\underline{\lambda}}\left(\frac{4\chi\hat{\varepsilon}}{5\hat{\rho}^{2}}+\frac{5d_{0}^{2}}{\chi\hat{\varepsilon}}\right). (36)

Proof: First, it follows from Proposition 2.7(a) that the inclusion s¯kϕε¯k(y¯k)\bar{s}_{k}\in\partial\phi_{\bar{\varepsilon}_{k}}(\bar{y}_{k}) holds for every k1k\geq 1. We next show that the number of iterations required for (34) is at most

k0:=min{1+σσlog[1+σβ(ε^,ρ^)],β(ε^,ρ^)}.k_{0}:=\min\left\{\frac{1+\sigma}{\sigma}\log\left[1+\sigma\beta(\hat{\varepsilon},\hat{\rho})\right],\beta(\hat{\varepsilon},\hat{\rho})\right\}. (37)

Hence, it suffices to show that s¯kρ^\|\bar{s}_{k}\|\leq\hat{\rho} and ε¯kε^\bar{\varepsilon}_{k}\leq\hat{\varepsilon} for every kk0k\geq k_{0}. Using the definition of SkS_{k} in (24), assumption (F2), and (98), we have

Skλ¯j=1k(1+σ)j1(98)λ¯max{eσk/(1+σ)1σ,k}.S_{k}\geq{\underline{\lambda}}\sum_{j=1}^{k}(1+\sigma)^{j-1}\stackrel{{\scriptstyle\eqref{ineq:theta}}}{{\geq}}{\underline{\lambda}}\max\left\{\frac{e^{\sigma k/(1+\sigma)}-1}{\sigma},k\right\}. (38)

It is easy to verify that for every kk0k\geq k_{0}, we have

Skλ¯β(ε^,ρ^).S_{k}\geq{\underline{\lambda}}\beta(\hat{\varepsilon},\hat{\rho}). (39)

It immediately follows from the inequality that a2+b22aba^{2}+b^{2}\geq 2ab and the definition of β(ε^,ρ^)\beta(\hat{\varepsilon},\hat{\rho}) in (36) that

β(ε^,ρ^)4d0λ¯ρ^.\beta(\hat{\varepsilon},\hat{\rho})\geq\frac{4d_{0}}{{\underline{\lambda}}\hat{\rho}}. (40)

Using Lemma 2.6, Proposition 2.7(b), (33), the definition of β(ε^,ρ^)\beta(\hat{\varepsilon},\hat{\rho}) in (36), and the above observation, we have

s¯k\displaystyle\|\bar{s}_{k}\| (32),(33)2d0Sk+χε^5Sk(36),(40)β(ε^,ρ^)λ¯ρ^2Sk+β(ε^,ρ^)λ¯ρ^2Sk,\displaystyle\stackrel{{\scriptstyle\eqref{ineq:sk},\eqref{eq:eps}}}{{\leq}}\frac{2d_{0}}{S_{k}}+\frac{\sqrt{\chi\hat{\varepsilon}}}{\sqrt{5S_{k}}}\stackrel{{\scriptstyle\eqref{def:beta},\eqref{ineq:observation}}}{{\leq}}\frac{\beta(\hat{\varepsilon},\hat{\rho}){\underline{\lambda}}\hat{\rho}}{2S_{k}}+\frac{\sqrt{\beta(\hat{\varepsilon},\hat{\rho}){\underline{\lambda}}}\hat{\rho}}{2\sqrt{S_{k}}}, (41)
ε¯k\displaystyle\bar{\varepsilon}_{k} (28),(32)5d022χSk+4εχ(1χ)+ε1χ(33)5d022χSk+ε^2(36)β(ε^,ρ^)λ¯ε^2Sk+ε^2.\displaystyle\stackrel{{\scriptstyle\eqref{xoytk},\eqref{ineq:sk}}}{{\leq}}\frac{5d_{0}^{2}}{2\chi S_{k}}+\frac{4\varepsilon}{\chi(1-\chi)}+\frac{\varepsilon}{1-\chi}\stackrel{{\scriptstyle\eqref{eq:eps}}}{{\leq}}\frac{5d_{0}^{2}}{2\chi S_{k}}+\frac{\hat{\varepsilon}}{2}\stackrel{{\scriptstyle\eqref{def:beta}}}{{\leq}}\frac{\beta(\hat{\varepsilon},\hat{\rho}){\underline{\lambda}}\hat{\varepsilon}}{2S_{k}}+\frac{\hat{\varepsilon}}{2}. (42)

Finally, plugging (39) into (41) and (42), we conclude that s¯kρ^\|\bar{s}_{k}\|\leq\hat{\rho} and ε¯kε^\bar{\varepsilon}_{k}\leq\hat{\varepsilon} for every kk0k\geq k_{0}, where k0k_{0} is as in (37). It follows from the same argument as in the proof of Theorem 2.3 that (35) is an upper bound on k0k_{0}. Finally, we conclude that the theorem holds.    

We now make some remarks about Theorem 2.8. In contrast to Theorem 2.3, it does not apply to the case where χ=0\chi=0 because the quantity β(ε^,ρ^)\beta(\hat{\varepsilon},\hat{\rho}) that appears in (35) depends on χ1\chi^{-1} (see in (36)). The following result considers two special cases which cover the case χ=0\chi=0.

Theorem 2.9

For a given tolerance pair (ε^,ρ^)++2(\hat{\varepsilon},\hat{\rho})\in\mathbb{R}^{2}_{++}, the following statements hold for FSCO with χ[0,1)\chi\in[0,1):

  • a)

    if domϕ\mathrm{dom}\,\phi is bounded with diameter D>0D>0, then FSCO with ε=(1χ)ε^/2\varepsilon=(1-\chi)\hat{\varepsilon}/2 generates a triple (y¯k,s¯k,ε¯k)(\bar{y}_{k},\bar{s}_{k},\bar{\varepsilon}_{k}) satisfying (34) within a number iterations bounded by (35) but with β(ε^,ρ^)\beta(\hat{\varepsilon},\hat{\rho}) now given by

    β(ε^,ρ^)=2λ¯(2ε^ρ^2+d02+D2ε^);\beta(\hat{\varepsilon},\hat{\rho})=\frac{2}{\underline{\lambda}}\left(\frac{2\hat{\varepsilon}}{\hat{\rho}^{2}}+\frac{d_{0}^{2}+D^{2}}{\hat{\varepsilon}}\right); (43)
  • b)

    the special case of FSCO, where ε=(1χ)ε^/6\varepsilon=(1-\chi)\hat{\varepsilon}/6 and y^k=x^k\hat{y}_{k}=\hat{x}_{k} for every k1k\geq 1, generates a triple (y¯k,s¯k,ε¯k)(\bar{y}_{k},\bar{s}_{k},\bar{\varepsilon}_{k}) satisfying (34) within a number iterations bounded by (35) but with β(ε^,ρ^)\beta(\hat{\varepsilon},\hat{\rho}) now given by

    β(ε^,ρ^)=4λ¯(ε^3ρ^2+d02ε^).\beta(\hat{\varepsilon},\hat{\rho})=\frac{4}{{\underline{\lambda}}}\left(\frac{\hat{\varepsilon}}{3\hat{\rho}^{2}}+\frac{d_{0}^{2}}{\hat{\varepsilon}}\right). (44)

Proof: a) Since y^k\hat{y}_{k} and xx_{*} are in domϕ\mathrm{dom}\,\phi, it follows from the boundedness assumption that y^kxD\|\hat{y}_{k}-x_{*}\|\leq D. Using the inequality and the triangle inequality, we have

x^0y¯k2(x^0x+xy¯k)22(d02+D2).\|\hat{x}_{0}-\bar{y}_{k}\|^{2}\leq(\|\hat{x}_{0}-x_{*}\|+\|x_{*}-\bar{y}_{k}\|)^{2}\leq 2(d_{0}^{2}+D^{2}).

Thus, it follows the second inequality in (32) and the fact that ε=(1χ)ε^/2\varepsilon=(1-\chi)\hat{\varepsilon}/2 that

ε¯k(32)d02+D2Sk+ε^2(43)β(ε^,ρ^)λ¯ε^2Sk+ε^2,\bar{\varepsilon}_{k}\stackrel{{\scriptstyle\eqref{ineq:sk}}}{{\leq}}\frac{d_{0}^{2}+D^{2}}{S_{k}}+\frac{\hat{\varepsilon}}{2}\stackrel{{\scriptstyle\eqref{def:beta-1}}}{{\leq}}\frac{\beta(\hat{\varepsilon},\hat{\rho}){\underline{\lambda}}\hat{\varepsilon}}{2S_{k}}+\frac{\hat{\varepsilon}}{2},

where the last inequality is due to the definition of β(ε^,ρ^)\beta(\hat{\varepsilon},\hat{\rho}) in (43). Similarly, using the first inequality in (32), the fact that ε=(1χ)ε^/2\varepsilon=(1-\chi)\hat{\varepsilon}/2, and (43), we have

s¯k(32)2d0Sk+ε^Sk(43)β(ε^,ρ^)λ¯ρ^2Sk+β(ε^,ρ^)λ¯ρ^2Sk,\|\bar{s}_{k}\|\stackrel{{\scriptstyle\eqref{ineq:sk}}}{{\leq}}\frac{2d_{0}}{S_{k}}+\frac{\sqrt{\hat{\varepsilon}}}{\sqrt{S_{k}}}\stackrel{{\scriptstyle\eqref{def:beta-1}}}{{\leq}}\frac{\beta(\hat{\varepsilon},\hat{\rho}){\underline{\lambda}}\hat{\rho}}{2S_{k}}+\frac{\sqrt{\beta(\hat{\varepsilon},\hat{\rho}){\underline{\lambda}}}\hat{\rho}}{2\sqrt{S_{k}}},

where the second inequality is also due to the observation that β(ε^,ρ^)4d0/(λ¯ρ^)\beta(\hat{\varepsilon},\hat{\rho})\geq 4d_{0}/(\underline{\lambda}\hat{\rho}) in view of (43). Finally, the rest of the proof follows from the same argument as in the proof of Theorem 2.8.

b) Since y^k=x^k\hat{y}_{k}=\hat{x}_{k} for every k1k\geq 1, we know y¯k=y^i=x^i\bar{y}_{k}=\hat{y}_{i}=\hat{x}_{i} for some i{1,2,,k}i\in\{1,2,\ldots,k\}. This observation, (27), and the second inequality in (32) imply that

ε¯k(32)x^0x^i22Sk+ε1χ(27)2d02Sk+2εSi(1χ)Sk+ε1χ2d02Sk+3ε1χ,\bar{\varepsilon}_{k}\stackrel{{\scriptstyle\eqref{ineq:sk}}}{{\leq}}\frac{\|\hat{x}_{0}-\hat{x}_{i}\|^{2}}{2S_{k}}+\frac{\varepsilon}{1-\chi}\stackrel{{\scriptstyle\eqref{ineq:xk}}}{{\leq}}\frac{2d_{0}^{2}}{S_{k}}+\frac{2\varepsilon S_{i}}{(1-\chi)S_{k}}+\frac{\varepsilon}{1-\chi}\leq\frac{2d_{0}^{2}}{S_{k}}+\frac{3\varepsilon}{1-\chi},

where the last inequality is due to the fact that SiSkS_{i}\leq S_{k} for iki\leq k. It follows from the fact that ε=(1χ)ε^/6\varepsilon=(1-\chi)\hat{\varepsilon}/6 and the definition of β(ε^,ρ^)\beta(\hat{\varepsilon},\hat{\rho}) in (44) that

ε¯k2d02Sk+ε^2(44)β(ε^,ρ^)λ¯ε^2Sk+ε^2.\bar{\varepsilon}_{k}\leq\frac{2d_{0}^{2}}{S_{k}}+\frac{\hat{\varepsilon}}{2}\stackrel{{\scriptstyle\eqref{def:beta-2}}}{{\leq}}\frac{\beta(\hat{\varepsilon},\hat{\rho}){\underline{\lambda}}\hat{\varepsilon}}{2S_{k}}+\frac{\hat{\varepsilon}}{2}.

Similarly, using the first inequality in (32), the fact that ε=(1χ)ε^/6\varepsilon=(1-\chi)\hat{\varepsilon}/6, and (44), we have

s¯k(32)2d0Sk+ε^3Sk(44)β(ε^,ρ^)λ¯ρ^2Sk+β(ε^,ρ^)λ¯ρ^2Sk,\|\bar{s}_{k}\|\stackrel{{\scriptstyle\eqref{ineq:sk}}}{{\leq}}\frac{2d_{0}}{S_{k}}+\frac{\sqrt{\hat{\varepsilon}}}{\sqrt{3S_{k}}}\stackrel{{\scriptstyle\eqref{def:beta-2}}}{{\leq}}\frac{\beta(\hat{\varepsilon},\hat{\rho}){\underline{\lambda}}\hat{\rho}}{2S_{k}}+\frac{\sqrt{\beta(\hat{\varepsilon},\hat{\rho}){\underline{\lambda}}}\hat{\rho}}{2\sqrt{S_{k}}},

where the second inequality is also due to the observation that β(ε^,ρ^)4d0/(λ¯ρ^)\beta(\hat{\varepsilon},\hat{\rho})\geq 4d_{0}/(\underline{\lambda}\hat{\rho}) in view of (44). Finally, the rest of the proof follows from the same argument as in the proof of Theorem 2.8.    

Compared to the complexity result of Theorem 2.8 (which does not apply to χ=0\chi=0) where term χ\chi appears in the denominator, the one of Theorem 2.9 (which applies to χ=0\chi=0) involves a term β(ε^,ρ^)\beta(\hat{\varepsilon},\hat{\rho}) where the term χ\chi is now removed from the denominator.

3 Universal composite subgradient and proximal bundle methods

This section presents the two μ\mu-universal methods for solving (1), namely, U-CS in Subsection 3.1 and U-PB in Subsection 3.2. We prove that both methods are instances of FSCO and establish their iteration complexities in terms of function value and stationarity based on the analysis in Section 2.

We assume that conditions (A1) and (A2) hold for (1). We also assume that

  • (A3)

    hConv¯ν(n)h\in\overline{\mbox{\rm Conv}}_{\nu}\,(\mathbb{R}^{n}) for some 0νμ0\leq\nu\leq\mu;

  • (A4)

    fConv¯(n)f\in\overline{\mbox{\rm Conv}}\,(\mathbb{R}^{n}) is such that domhdomf\mathrm{dom}\,h\subset\mathrm{dom}\,f, and a subgradient oracle, i.e., a function f:domhnf^{\prime}:\mathrm{dom}\,h\to\mathbb{R}^{n} satisfying f(x)f(x)f^{\prime}(x)\in\partial f(x) for every xdomhx\in\mathrm{dom}\,h, is available;

  • (A5)

    there exists (Mf,Lf)+2(M_{f},L_{f})\in\mathbb{R}^{2}_{+} such that for every x,ydomhx,y\in\mathrm{dom}\,h,

    f(x)f(y)2Mf+Lfxy.\|f^{\prime}(x)-f^{\prime}(y)\|\leq 2M_{f}+L_{f}\|x-y\|.

It is well known that (A5) implies that for every x,ydomhx,y\in\mathrm{dom}\,h,

f(x)f(x;y)2Mfxy+Lf2|xy|2.f(x)-\ell_{f}(x;y)\leq 2M_{f}\|x-y\|+\frac{L_{f}}{2}|x-y|2. (45)

Also, for a given tolerance ε>0\varepsilon>0, the fact that the set Ω:={(Mf,Lf)+2:(A5) holds with (Mf,Lf)}\Omega:=\{(M_{f},L_{f})\in\mathbb{R}^{2}_{+}:\mbox{(A5) holds with }(M_{f},L_{f})\} is a (nonempty) closed convex set implies that there exists a unique pair (M¯f,L¯f):=(M¯f(ε),L¯f(ε))(\overline{M}_{f},\overline{L}_{f}):=(\overline{M}_{f}(\varepsilon),\overline{L}_{f}(\varepsilon)) that minimizes Mf2+εLfM_{f}^{2}+\varepsilon L_{f} over Ω\Omega (referred to as the ε\varepsilon-best pair).

3.1 A universal composite subgradient method

The U-CS method is a variant of the universal primal gradient method of [21] by introducing a parameter χ[0,1)\chi\in[0,1). It can also be shown as an instance of FSCO, and we thus establish its complexity using the analysis in Section 2. The method is described below.

 

U-CS

 

  • 0.

    Let x^0domh\hat{x}_{0}\in\mathrm{dom}\,h, χ[0,1)\chi\in[0,1), λ0>0{\lambda}_{0}>0, and ε>0\varepsilon>0 be given, and set λ=λ0{\lambda}={\lambda}_{0} and j=1j=1;

  • 1.

    Compute

    x=argminun{f(u;x^j1)+h(u)+12λux^j12};x=\underset{u\in\mathbb{R}^{n}}{\mathrm{argmin}\,}\left\{\ell_{f}(u;\hat{x}_{j-1})+h(u)+\frac{1}{2{\lambda}}\|u-\hat{x}_{j-1}\|^{2}\right\};
  • 2.

    If f(x)f(x;x^j1)(1χ)xx^j12/(2λ)εf(x)-\ell_{f}(x;\hat{x}_{j-1})-(1-\chi)\|x-\hat{x}_{j-1}\|^{2}/(2{\lambda})\leq\varepsilon does not hold, then set λ=λ/2{\lambda}={\lambda}/2 and go to step 1; else, go to step 3;

  • 3.

    Set λj=λ{\lambda}_{j}={\lambda}, x^j=x\hat{x}_{j}=x, jj+1j\leftarrow j+1, and go to step 1.

 

We now make some remarks about U-CS. First, no stopping criterion is added to it since our goal is to analyze its iteration-complexity for obtaining two types of approximate solutions, i.e., either a ε¯\bar{\varepsilon}-solution (see Theorem 3.2 below) or a (ρ^,ε^)(\hat{\rho},\hat{\varepsilon})-stationary solution (see Theorem 3.3 below). Second, U-CS with χ=0\chi=0 and ε=ε¯/2\varepsilon=\bar{\varepsilon}/2 is exactly the universal primal gradient method analyzed in [21]. Hence, with the introduction of the damping parameter χ\chi, U-CS can be viewed as a generalization of the method of [21].

The following result shows that U-CS is an instance of FSCO and that assumptions (F1) and (F2) of Section 2 are satisfied.

Proposition 3.1

The following statements hold for U-CS:

  • a)

    {λk}\{{\lambda}_{k}\} is a non-increasing sequence;

  • b)

    for every k1k\geq 1, we have

    x^k=argminun{f(u;x^k1)+h(u)+12λkux^k12},\displaystyle\hat{x}_{k}=\underset{u\in\mathbb{R}^{n}}{\mathrm{argmin}\,}\left\{\ell_{f}(u;\hat{x}_{k-1})+h(u)+\frac{1}{2{\lambda}_{k}}\|u-\hat{x}_{k-1}\|^{2}\right\}, (46)
    f(x^k)f(x^k;x^k1)+χ12λkx^kx^k12ε,\displaystyle f(\hat{x}_{k})-\ell_{f}(\hat{x}_{k};\hat{x}_{k-1})+\frac{\chi-1}{2\lambda_{k}}\|\hat{x}_{k}-\hat{x}_{k-1}\|^{2}\leq\varepsilon, (47)
    λkλ¯(ε):=min{(1χ)ε4(M¯f2+εL¯f),λ0}.\displaystyle{\lambda}_{k}\geq\underline{{\lambda}}(\varepsilon):=\min\left\{\frac{(1-\chi)\varepsilon}{4\left({\overline{M}}_{f}^{2}+\varepsilon{\overline{L}}_{f}\right)},{\lambda}_{0}\right\}. (48)
  • c)

    U-CS is a special case of FSCO where:

    • i)

      y^k=x^k\hat{y}_{k}=\hat{x}_{k} and Γ^k()=f(;x^k1)+h()\hat{\Gamma}_{k}(\cdot)=\ell_{f}(\cdot;\hat{x}_{k-1})+h(\cdot) for every k1k\geq 1;

    • ii)

      assumptions (F1) and (F2) are satisfied with λ¯=λ¯(ε)\underline{{\lambda}}={\underline{\lambda}}(\varepsilon) given by (48) and ν\nu from assumption (A3).

Proof: a) This statement directly follows from the description of U-CS.

b) Relations (46) and (47) directly follow from the description of U-CS. Using (45) with (Mf,Lf,u,v)=(M¯f,L¯f,x^k,x^k1)(M_{f},L_{f},u,v)=(\overline{M}_{f},\overline{L}_{f},\hat{x}_{k},\hat{x}_{k-1}) and the inequality a2+b22aba^{2}+b^{2}\geq 2ab for a,ba,b\in\mathbb{R}, we have

f(x^k)f(x^k;x^k1)+χ12λk1x^kx^k12\displaystyle f(\hat{x}_{k})-\ell_{f}(\hat{x}_{k};\hat{x}_{k-1})+\frac{\chi-1}{2\lambda_{k-1}}\|\hat{x}_{k}-\hat{x}_{k-1}\|^{2} (45)2M¯fx^kx^k1+L¯f2x^kx^k12+χ12λk1x^kx^k12\displaystyle\stackrel{{\scriptstyle\eqref{ineq:est}}}{{\leq}}2\overline{M}_{f}\|\hat{x}_{k}-\hat{x}_{k-1}\|+\frac{\overline{L}_{f}}{2}\|\hat{x}_{k}-\hat{x}_{k-1}\|^{2}+\frac{\chi-1}{2\lambda_{k-1}}\|\hat{x}_{k}-\hat{x}_{k-1}\|^{2}
=2M¯fx^kx^k11χλk1L¯f2λk1x^kx^k12\displaystyle=2\overline{M}_{f}\|\hat{x}_{k}-\hat{x}_{k-1}\|-\frac{1-\chi-{\lambda}_{k-1}\overline{L}_{f}}{2{\lambda}_{k-1}}\|\hat{x}_{k}-\hat{x}_{k-1}\|^{2}
2λk1M¯f21χλk1L¯f.\displaystyle\leq\frac{2{\lambda}_{k-1}\overline{M}_{f}^{2}}{1-\chi-{\lambda}_{k-1}\overline{L}_{f}}. (49)

Observe that λk1(1χ)ε/(2(M¯f2+L¯fε)){\lambda}_{k-1}\leq(1-\chi)\varepsilon/(2(\overline{M}_{f}^{2}+\overline{L}_{f}\varepsilon)) implies that λk1(1χ)ε/(2M¯f2+L¯fε){\lambda}_{k-1}\leq(1-\chi)\varepsilon/(2\overline{M}_{f}^{2}+\overline{L}_{f}\varepsilon), and hence that

2λk1M¯f21χλk1L¯fε.\frac{2{\lambda}_{k-1}\overline{M}_{f}^{2}}{1-\chi-{\lambda}_{k-1}\overline{L}_{f}}\leq\varepsilon.

The above inequality and (49) thus imply that (47) holds with λk{\lambda}_{k} replaced by λk1\lambda_{k-1}. This indicates that if λ{\lambda} is small enough, then it will remain unchanged. Therefore, following from the update scheme of λ{\lambda} in step 2 of U-CS, there is a lower bound λ¯(ε)\underline{{\lambda}}(\varepsilon) as in (48).

c) Relations (46) and (47) are the analogues of relations (4) and (5) of FSCO with y^k=x^k\hat{y}_{k}=\hat{x}_{k} and Γ^k\hat{\Gamma}_{k} as in (i). Inequality (48) shows that Assumption (F2) is satisfied with λ¯=λ¯(ε)\underline{{\lambda}}={\underline{\lambda}}(\varepsilon). Finally, in the kk-th iteration of U-CS, the model Γ^k\hat{\Gamma}_{k} for ϕ=f+h\phi=f+h being simply the linearization f(,x^k1)+h()\ell_{f}(\cdot,\hat{x}_{k-1})+h(\cdot) with hConv¯ν(n)h\in\overline{\mbox{\rm Conv}}_{\nu}\,(\mathbb{R}^{n}) for some 0νμ0\leq\nu\leq\mu (from Assumption (A3)), we obtain that Assumption (F1) is satisfied.    

We are now in a position to state the main result for the functional complexity of U-CS.

Theorem 3.2

Let ε¯>0\bar{\varepsilon}>0 be given and consider U-CS with ε=(1χ)ε¯/2\varepsilon=(1-\chi)\bar{\varepsilon}/2, where χ[0,1)\chi\in[0,1) is as in step 0 of U-CS. Then, the number of iterations of U-CS to generate an iterate x^k\hat{x}_{k} satisfying ϕ(x^k)ϕε¯\phi(\hat{x}_{k})-\phi_{*}\leq\bar{\varepsilon} is at most

min{min[1χ(1+Qf(ε¯)με¯),1+Qf(ε¯)νε¯]log(1+λ0μQf(ε¯)d02ε¯2),d02Qf(ε¯)ε¯2}+2logλ0Qf(ε¯)ε¯\min\left\{\min\left[\frac{1}{\chi}\left(1+\frac{Q_{f}({\bar{\varepsilon}})}{\mu{\bar{\varepsilon}}}\right),1+\frac{Q_{f}({\bar{\varepsilon}})}{\nu{\bar{\varepsilon}}}\right]\log\left(1+\frac{\lambda_{0}\mu Q_{f}(\bar{\varepsilon})d_{0}^{2}}{\bar{\varepsilon}^{2}}\right)\,,\,\frac{d_{0}^{2}Q_{f}({\bar{\varepsilon}})}{{\bar{\varepsilon}}^{2}}\right\}+\displaystyle\left\lceil 2\log\frac{\lambda_{0}Q_{f}({\bar{\varepsilon}})}{\bar{\varepsilon}}\right\rceil (50)

where

Qf(ε¯)=8M¯f2(1χ)2+ε¯(λ01+8L¯f(1χ)2)Q_{f}({\bar{\varepsilon}})=\frac{8\overline{M}_{f}^{2}}{(1-\chi)^{2}}+{\bar{\varepsilon}}\left({\lambda}_{0}^{-1}+\frac{8\overline{L}_{f}}{(1-\chi)^{2}}\right) (51)

Proof: Define

k¯=2logmax{4λ0(M¯f2+εL¯f)(1χ)ε,1}.\bar{k}=\left\lceil 2\log\max\left\{\frac{4\lambda_{0}(\overline{M}_{f}^{2}+\varepsilon\overline{L}_{f})}{(1-\chi){\varepsilon}},1\right\}\right\rceil. (52)

Observe that λ0/2kλ¯(ε)\lambda_{0}/2^{k}\leq{\underline{\lambda}}(\varepsilon) for kk¯k\geq\bar{k}, and from Lemma 3.1 that we cannot halve λ\lambda more than k¯\bar{k} iterations. It follows from εε¯\varepsilon\leq{\bar{\varepsilon}} that M¯f2+εL¯fM¯f2+ε¯L¯f\overline{M}_{f}^{2}+\varepsilon\overline{L}_{f}\leq\overline{M}_{f}^{2}+{\bar{\varepsilon}}\overline{L}_{f}, which together with ε=(1χ)ε¯/2\varepsilon=(1-\chi)\bar{\varepsilon}/2 implies that

k¯2logλ0Qf(ε¯)ε¯.\bar{k}\leq\displaystyle\left\lceil 2\log\frac{\lambda_{0}Q_{f}({\bar{\varepsilon}})}{\bar{\varepsilon}}\right\rceil.

Therefore, the second term on the right-hand side of (50) gives an upper bound on the number of iterations with backtracking of λ\lambda. We now provide a bound on the number of remaining iterations to obtain a ε¯\bar{\varepsilon}-solution with U-CS. Theorem 2.3 (which can be applied since we have shown that U-CS is a special case of FSCO) gives for U-CS the upper bound

min{min[1χ(1+1λ¯(ε)μ),1+1λ¯(ε)ν]log(1+λ0μd02λ¯(ε)ε¯),d02λ¯(ε)ε¯}\min\left\{\min\left[\frac{1}{\chi}\left(1+\frac{1}{\underline{\lambda}(\varepsilon)\mu}\right),1+\frac{1}{\underline{\lambda}(\varepsilon)\nu}\right]\log\left(1+\frac{\lambda_{0}\mu d_{0}^{2}}{{\underline{\lambda}(\varepsilon)}\bar{\varepsilon}}\right),\frac{d_{0}^{2}}{\underline{\lambda}(\varepsilon){\bar{\varepsilon}}}\right\} (53)

on the number of the remaining iterations (where λ\lambda is not halved) required to find an ε\varepsilon-optimal solution of (1) with λ¯(ε){\underline{\lambda}}(\varepsilon) given by (48). Using the inequality

1λ¯(ε)=max{4(M¯f2+εL¯f)(1χ)ε,1λ0}4(M¯f2+εL¯f)(1χ)ε+1λ0,\frac{1}{\underline{{\lambda}}(\varepsilon)}=\max\left\{\frac{4(\overline{M}_{f}^{2}+\varepsilon\overline{L}_{f})}{(1-\chi)\varepsilon},\frac{1}{\lambda_{0}}\right\}\leq\frac{4(\overline{M}_{f}^{2}+\varepsilon\overline{L}_{f})}{(1-\chi)\varepsilon}+\frac{1}{\lambda_{0}}, (54)

the assumption that ε=(1χ)ε¯/2\varepsilon=(1-\chi)\bar{\varepsilon}/2, and the definition of Qf(ε¯)Q_{f}(\bar{\varepsilon}) in (51), we conclude that 1/λ¯(ε)Qf(ε¯)/ε¯1/\underline{{\lambda}}(\varepsilon)\leq Q_{f}(\bar{\varepsilon})/\bar{\varepsilon}. This observation and (53) thus imply that the first term in (50) is an upper bound on the number of the remaining iterations (where λ\lambda is not halved). This completes the proof.    

We now make some comments about Theorem 3.2. First, Theorem 3.2 applies to any χ[0,1)\chi\in[0,1), and hence to the universal primal gradient method of [21]. In this case, if λ01=𝒪(1)\lambda_{0}^{-1}={\cal O}(1), then the strong convexity part of the bound in (50) is

𝒪~(Qf(ε¯)νε¯)=𝒪~(M¯f2νε¯+L¯fν),\tilde{\cal O}\left(\frac{Q_{f}(\bar{\varepsilon})}{\nu{\bar{\varepsilon}}}\right)=\tilde{\cal O}\left(\frac{\overline{M}_{f}^{2}}{\nu{\bar{\varepsilon}}}+\frac{\overline{L}_{f}}{\nu}\right), (55)

which is identical to the one in Proposition C.3 of [17]. Second, if χ>0\chi>0 and λ01=𝒪(1)\lambda_{0}^{-1}={\cal O}(1), then the complexity bound (50) is also

𝒪~(Qf(ε¯)χμε¯)=𝒪~(M¯f2χμε¯+L¯fχμ),\tilde{\cal O}\left(\frac{Q_{f}(\bar{\varepsilon})}{\chi\mu{\bar{\varepsilon}}}\right)=\tilde{\cal O}\left(\frac{\overline{M}_{f}^{2}}{\chi\mu{\bar{\varepsilon}}}+\frac{\overline{L}_{f}}{\chi\mu}\right), (56)

which is smaller than (55) whenever χμν\chi\mu\geq\nu. For example, if χ=1/2\chi=1/2 and μν\mu\gg\nu, then (56) is quite smaller than (55). In summary, the performance of the U-CS with the damping parameter χ>0\chi>0 depends on the strong convexity parameter μ\mu of the overall objective function ϕ\phi and hence is potentially more universal than the universal primal gradient method of [21] whose complexity bound depends only on the strong convexity parameter ν\nu of the composite function hh.

The following result states the complexity of stationary complexity of U-CS.

Theorem 3.3

For a given tolerance pair (ε^,ρ^)++2(\hat{\varepsilon},\hat{\rho})\in\mathbb{R}^{2}_{++}, consider U-CS with χ[0,1)\chi\in[0,1) and ε=(1χ)ε^/6\varepsilon=(1-\chi)\hat{\varepsilon}/6. Define y¯k\bar{y}_{k} as in (23) with sequence {y^k}\{\hat{y}_{k}\} replaced by {x^k}\{\hat{x}_{k}\}, and let s¯k\bar{s}_{k} and ε¯k\bar{\varepsilon}_{k} be as in (31) where the sequence {x^k}\{\hat{x}_{k}\} is generated by U-CS. Then for every k1k\geq 1, U-CS generates a triple (y¯k,s¯k,ε¯k)(\bar{y}_{k},\bar{s}_{k},\bar{\varepsilon}_{k}) satisfying (34) within a number of iterations bounded by

min{min[1χ(1+Qs(ε^)με^),1+Qs(ε^)νε^]logC(ε^,ρ^),4Qs(ε^)ε^(ε^3ρ^2+d02ε^)}+2logλ0Qs(ε^)ε^\min\left\{\min\left[\frac{1}{\chi}\left(1+\frac{Q_{s}({\hat{\varepsilon}})}{\mu{\hat{\varepsilon}}}\right),1+\frac{Q_{s}({\hat{\varepsilon}})}{\nu{\hat{\varepsilon}}}\right]\log C(\hat{\varepsilon},\hat{\rho})\,,\,\frac{4Q_{s}(\hat{\varepsilon})}{\hat{\varepsilon}}\left(\frac{\hat{\varepsilon}}{3{\hat{\rho}}^{2}}+\frac{d_{0}^{2}}{\hat{\varepsilon}}\right)\right\}+\displaystyle\left\lceil 2\log\frac{\lambda_{0}Q_{s}({\hat{\varepsilon}})}{\hat{\varepsilon}}\right\rceil (57)

where

C(ε^,ρ^)=1+4λ0μQs(ε^)ε^(ε^3ρ^2+d02ε^),Qs(ε^)=24M¯f2(1χ)2+ε^(1λ0+24L¯f(1χ)2).C(\hat{\varepsilon},\hat{\rho})=1+\frac{4\lambda_{0}\mu Q_{s}(\hat{\varepsilon})}{\hat{\varepsilon}}\left(\frac{\hat{\varepsilon}}{3{\hat{\rho}}^{2}}+\frac{d_{0}^{2}}{\hat{\varepsilon}}\right),\quad\displaystyle Q_{s}({\hat{\varepsilon}})=\frac{24{\overline{M}}_{f}^{2}}{(1-\chi)^{2}}+{\hat{\varepsilon}}\left(\frac{1}{{\lambda}_{0}}+\frac{24{\overline{L}}_{f}}{(1-\chi)^{2}}\right).

Proof: Same as in the proof of Theorem 3.2, integer k¯\bar{k} given by (52) gives an upper bound on the number of iterations where λ\lambda is halved. Using ε=(1χ)ε^/6\varepsilon=(1-\chi)\hat{\varepsilon}/6 and M¯f2+εL¯fM¯f2+ε^L¯f\overline{M}_{f}^{2}+\varepsilon\overline{L}_{f}\leq\overline{M}_{f}^{2}+{\hat{\varepsilon}}\overline{L}_{f}, we have that

k¯2logλ0Qs(ε^)ε^,\bar{k}\leq\displaystyle\left\lceil 2\log\frac{\lambda_{0}Q_{s}({\hat{\varepsilon}})}{\hat{\varepsilon}}\right\rceil,

which gives the second term on the right-hand side of (57). Next, using Theorem 2.9(b) (observe that this theorem can be applied to U-CS since we have y^k=x^k\hat{y}_{k}=\hat{x}_{k} for U-CS and we consider the possibility for χ\chi to be 0), the number of remaining iterations to satisfy (34) is upper bounded by

min{min[1χ(1+1λ¯(ε)μ),1+1λ¯(ε)ν]log[1+λ0μβ(ε^,ρ^)],β(ε^,ρ^)}\min\left\{\min\left[\frac{1}{\chi}\left(1+\frac{1}{\underline{{\lambda}}(\varepsilon)\mu}\right),1+\frac{1}{\underline{{\lambda}}(\varepsilon)\nu}\right]\log\left[1+\lambda_{0}\mu\beta(\hat{\varepsilon},\hat{\rho})\right]\,,\,\beta(\hat{\varepsilon},\hat{\rho})\right\} (58)

where

β(ε^,ρ^)=4λ¯(ε)(ε^3ρ^2+d02ε^)\beta(\hat{\varepsilon},\hat{\rho})=\frac{4}{{\underline{\lambda}(\varepsilon)}}\left(\frac{\hat{\varepsilon}}{3\hat{\rho}^{2}}+\frac{d_{0}^{2}}{\hat{\varepsilon}}\right)

with λ¯(ε){\underline{\lambda}}(\varepsilon) given by (48). Now, using the inequality (54), the assumption that ε=(1χ)ε^/6\varepsilon=(1-\chi)\hat{\varepsilon}/6, and the definition of Qs(ε^)Q_{s}(\hat{\varepsilon}) in (51), we conclude that 1/λ¯(ε)Qs(ε^)/ε^1/\underline{{\lambda}}(\varepsilon)\leq Q_{s}(\hat{\varepsilon})/\hat{\varepsilon}. This observation and (58) then imply that the first term in (57) is an upper bound on the number of the remaining iterations (where λ\lambda is not halved) required to satisfy (34). This completes the proof.    

It is easy to see that when λ0\lambda_{0} is large, χ\chi is close to 11, and μν\mu\gg\nu, the upper bound (57) on the number of iterations for U-CS to generate a triple (y¯k,s¯k,ε¯k)(\bar{y}_{k},\bar{s}_{k},\bar{\varepsilon}_{k}) satisfying (34) reduces again to (56).

3.2 A universal proximal bundle method

This subsection describes the U-PB method and establishes its iteration complexities. More specifically, § 3.2.1 describes U-PB and states the main results, namely Theorems 3.6 and 3.7, and § 3.2.2 is devoted to the proof of a technical result (i.e., Proposition 3.4) that is crucial to the proof of the main results. Conditions (A1)-(A5) are assumed to hold in this subsection.

3.2.1 Description of U-PB and related complexity results

The U-PB method is an extension of the GPB method of [17]. In contrast to GPB, we use an adaptive stepsize and introduce a maximal number N¯\overline{N} (which can be as small as one) of iterations for all cycles. Similarly to U-CS, U-PB is another instance of FSCO and we establish both functional and stationary complexities for U-PB using the results of Section 2. Compared with the complexity results in [17], those obtained in this paper are sharper, since they are expressed in terms of μ=μϕ\mu=\mu_{\phi} instead of μh\mu_{h}.

U-PB is based on the following bundle update (BU) blackbox which builds a model fM++hf_{M}^{+}+h for f+hf+h on the basis of a previous model fMf_{M} of ff and of a new linearization f(,x)\ell_{f}(\cdot,x) of ff. This blackbox BU(xc,x,fM,λ)(x^{c},x,f_{M},\lambda) is given below and takes as inputs a prox-center xcx^{c}, a current approximate solution xx, an initial model fMf_{M} for ff, and a stepsize λ>0\lambda>0.

 

BU(xc,x,fM,λ)(x^{c},x,f_{M},\lambda)

  Inputs: λ++{\lambda}\in\mathbb{R}_{++} and (xc,x,fM)n×n×Conv¯(n)(x^{c},x,f_{M})\in\mathbb{R}^{n}\times\mathbb{R}^{n}\times{\overline{\mbox{Conv}}}(\mathbb{R}^{n}) such that fMff_{M}\leq f and

x=argminun{fM(u)+h(u)+12λuxc2}.x=\underset{u\in\mathbb{R}^{n}}{\mathrm{argmin}\,}\left\{f_{M}(u)+h(u)+\frac{1}{2{\lambda}}\|u-x^{c}\|^{2}\right\}.

Find function fM+f_{M}^{+} such that

fM+Conv¯(n),max{f¯,f(;x)}fM+f,f_{M}^{+}\in{\overline{\mbox{Conv}}}(\mathbb{R}^{n}),\qquad\max\{{\overline{f}},\ell_{f}(\cdot;x)\}\leq f_{M}^{+}\leq f, (59)

where f(;)\ell_{f}(\cdot;\cdot) is as in (2) and f¯\overline{f} is such that

f¯f,f¯Conv¯(n),f¯(x)=f(x),x=argminun{f¯(u)+h(u)+12λuxc2}.\overline{f}\leq f,\quad\overline{f}\in{\overline{\mbox{Conv}}}(\mathbb{R}^{n}),\quad\overline{f}(x)=f(x),\quad x=\underset{u\in\mathbb{R}^{n}}{\mathrm{argmin}\,}\left\{\overline{f}(u)+h(u)+\frac{1}{2{\lambda}}\|u-x^{c}\|^{2}\right\}.\\ (60)

Output: fM+f_{M}^{+}.
 

In the following, we give two examples of BU, namely two-cuts and multiple-cuts schemes. The proofs for the two schemes belonging to BU can be provided similarly to Appendix D of [17].

  • (E1)

    two-cuts scheme: We assume that fMf_{M} is of the form fM=max{Af,f(;x)}f_{M}=\max\{A_{f},\ell_{f}(\cdot;x^{-})\} where AfA_{f} is an affine function satisfying AffA_{f}\leq f. The scheme then sets Af+():=θAf()+(1θ)f(;x)A_{f}^{+}(\cdot):=\theta A_{f}(\cdot)+(1-\theta)\ell_{f}(\cdot;x^{-}) and updates fM+f_{M}^{+} as fM+():=max{Af+(),f(;x)}f_{M}^{+}(\cdot):=\max\{A_{f}^{+}(\cdot),\ell_{f}(\cdot;x)\}, where θ[0,1]\theta\in[0,1] satisfies

    1λ(xxc)+h(x)+θAf+(1θ)f(x)0,\displaystyle\frac{1}{{\lambda}}(x-x^{c})+\partial h(x)+\theta\nabla A_{f}+(1-\theta)f^{\prime}(x^{-})\ni 0,
    θAf(x)+(1θ)f(x;x)=max{Af(x),f(x;x)}.\displaystyle\theta A_{f}(x)+(1-\theta)\ell_{f}(x;x^{-})=\max\{A_{f}(x),\ell_{f}(x;x^{-})\}.
  • (E2)

    multiple-cuts scheme: We assume that fMf_{M} has the form fM=fM(;B)f_{M}=f_{M}(\cdot;B) where BnB\subset\mathbb{R}^{n} is the current bundle set and fM(;B)f_{M}(\cdot;B) is defined as fM(;B):=max{f(;b):bB}f_{M}(\cdot;B):=\max\{\ell_{f}(\cdot;b):b\in B\}. This scheme selects the next bundle set B+B^{+} so that B(x){x}B+B{x}B(x)\cup\{x\}\subset B^{+}\subset B\cup\{x\} where B(x):={bB:f(x;b)=fM(x)}B(x):=\{b\in B:\ell_{f}(x;b)=f_{M}(x)\}, and then outputs fM+=fM(;B+)f_{M}^{+}=f_{M}(\cdot;B^{+}).

Before giving the motivation of U-PB, we briefly review the GPB method of [17]. GPB is an inexact proximal point method (PPM, with fixed stepsize) in that, given a prox-center x^k1n\hat{x}_{k-1}\in\mathbb{R}^{n} and a prox stepsize λ>0{\lambda}>0, it computes the next prox-center x^k\hat{x}_{k} as a suitable approximate solution of the prox subproblem

x^kargminun{(f+h)(u)+12λux^k12}.\hat{x}_{k}\approx\underset{u\in\mathbb{R}^{n}}{\mathrm{argmin}\,}\left\{(f+h)(u)+\frac{1}{2{\lambda}}\|u-\hat{x}_{k-1}\|^{2}\right\}. (61)

More specifically, a sequence of prox bundle subproblems of the form

xj=argminun{(fj+h)(u)+12λux^k12},x_{j}=\underset{u\in\mathbb{R}^{n}}{\mathrm{argmin}\,}\left\{(f_{j}+h)(u)+\frac{1}{2{\lambda}}\|u-\hat{x}_{k-1}\|^{2}\right\}, (62)

where fjff_{j}\leq f is a bundle approximation of ff, is solved until for the first time an iterate xjx_{j} as in (62) approximately solves (61), and such xjx_{j} is then set to be x^k\hat{x}_{k}. The bundle approximation fjf_{j} is sequentially updated, for example, according to either one of the schemes (E1) and (E2) described above.

U-PB is also an inexact PPM but with variable prox stepsizes (i.e., with λ{\lambda} in (61) replaced by λk{\lambda}_{k}) instead of a constant one as in GPB. Given iteration upper limit N¯1\overline{N}\geq 1 and prox-center x^k1\hat{x}_{k-1}, it adaptively computes λk>0{\lambda}_{k}>0 as follows: starting with λ=λk1{\lambda}={\lambda}_{k-1}, it solves at most N¯\overline{N} prox subproblems of the form (62) in an attempt to obtain an approximate solution of (61) and, if it fails, repeats this procedure with λ{\lambda} divided by 22; otherwise, it sets λk{\lambda}_{k} to be the first successful λ{\lambda} and x^k\hat{x}_{k} as described in the previous paragraph.

U-PB is given below.

 

U-PB

 

  • 0.

    Let x^0domh\hat{x}_{0}\in\mathrm{dom}\,h, λ1=λ>0\lambda_{1}=\lambda>0, χ[0,1)\chi\in[0,1), ε>0\varepsilon>0, and integer N¯1\overline{N}\geq 1 be given, and set y0=x^0y_{0}=\hat{x}_{0}, N=0N=0, j=1j=1, and k=1k=1. Find f1Conv¯(n)f_{1}\in\overline{\mbox{Conv}}(\mathbb{R}^{n}) such that f(;x^0)f1f\ell_{f}(\cdot;{\hat{x}}_{0})\leq f_{1}\leq f;

  • 1.

    Compute xjx_{j} as in (62);

  • 2.

    Choose yj{xj,yj1}y_{j}\in\{x_{j},y_{j-1}\} such that

    ϕ(yj)+χ2λyjx^k12=min{ϕ(xj)+χ2λxjx^k12,ϕ(yj1)+χ2λyj1x^k12},\phi(y_{j})+\frac{\chi}{2{\lambda}}\|y_{j}-\hat{x}_{k-1}\|^{2}=\min\left\{\phi(x_{j})+\frac{\chi}{2{\lambda}}\|x_{j}-\hat{x}_{k-1}\|^{2},\phi(y_{j-1})+\frac{\chi}{2{\lambda}}\|y_{j-1}-\hat{x}_{k-1}\|^{2}\right\}, (63)

    and set N=N+1N=N+1 and

    tj=ϕ(yj)+χ2λyjx^k12((fj+h)(xj)+12λxjx^k12);t_{j}=\phi(y_{j})+\frac{\chi}{2{\lambda}}\|y_{j}-\hat{x}_{k-1}\|^{2}-\left((f_{j}+h)(x_{j})+\frac{1}{2{\lambda}}\|x_{j}-\hat{x}_{k-1}\|^{2}\right); (64)
  • 3.

    If tj>εt_{j}>\varepsilon and N<N¯N<\overline{N} then
           
    perform a null update, i.e.: set fj+1=BU(x^k1,xj,fj,λ)f_{j+1}=\mbox{BU}(\hat{x}_{k-1},x_{j},f_{j},\lambda);
    else
           if
    tj>εt_{j}>\varepsilon and N=N¯N=\overline{N}
               perform a reset update, i.e., set λλ/2\lambda\leftarrow\lambda/2;
           else (i.e., tjεt_{j}\leq\varepsilon and NN¯N\leq\overline{N})
               perform a serious update, i.e., set x^k=xj\hat{x}_{k}=x_{j}, Γ^k=fj+h\hat{\Gamma}_{k}=f_{j}+h, y^k=yj\hat{y}_{k}=y_{j}, λk=λ\lambda_{k}=\lambda, and kk+1k\leftarrow k+1;
           end if
           
    set N=0N=0 and find fj+1Conv¯(n)f_{j+1}\in\overline{\mbox{Conv}}(\mathbb{R}^{n}) such that f(;x^k1)fj+1f\ell_{f}(\cdot;{\hat{x}}_{k-1})\leq f_{j+1}\leq f;

    end if

  • 4.

    Set jj+1j\leftarrow j+1 and go to step 1.

 

We now give further explanation about U-PB. U-PB performs three types of iterations, namely, null, reset, and serious, corresponding to the types of updates performed at the end. A reset (resp., serious) cycle of U-PB consists of a reset (resp., serious) iteration and all the consecutive null iterations preceding it. The index jj counts the total iterations including null, reset, and serious ones. The index kk counts the serious cycles which, together with the quantities x^k\hat{x}_{k}, y^k\hat{y}_{k}, and Γ^k\hat{\Gamma}_{k} computed at the end of cycle kk, is used to cast U-PB as an instance of FSCO. All iterations within a cycle are referred to as inner iterations. The quantity NN counts the number of inner iterations performed in the current cycle. Each cycle of U-PB performs at most N¯\overline{N} iterations. A serious cycle successfully finds tjεt_{j}\leq\varepsilon within N¯\overline{N} iterations, while a reset cycle fails to do so. In both cases, U-PB resets the counter NN to 0 and starts a new cycle. The differences between the two cases are: 1) the stepsize λ{\lambda} is halved at the end of a reset cycle, while it is kept as is at the end of a serious cycle; and 2) the prox-center is kept the same at the end of a reset cycle, but it is updated to the latest xjx_{j} at the end of a serious cycle.

We now make some remarks about U-PB. First, it follows from the fact that fjff_{j}\leq f and the definition of tjt_{j} in (64) that the primal gap of the prox subproblem in (61) is upper bounded by tj+(1χ)yjx^k12/(2λ)t_{j}+(1-\chi)\|y_{j}-\hat{x}_{k-1}\|^{2}/(2{\lambda}). Hence, if tjεt_{j}\leq\varepsilon, then yjy_{j} is an εj\varepsilon_{j}-solution of (61) where εj=ε+(1χ)yjx^k12/(2λ)\varepsilon_{j}=\varepsilon+(1-\chi)\|y_{j}-\hat{x}_{k-1}\|^{2}/(2{\lambda}). Second, the GPB method of [17] (resp., [15]) computes yjy_{j} using (63) with χ=0\chi=0 (resp., χ=1\chi=1). In the case where χ=1\chi=1, it can be easily seen that tjt_{j} is an upper bound on the primal gap of the prox subproblem in (61), and hence that yjy_{j} is an ε\varepsilon-solution of (61) if tjεt_{j}\leq\varepsilon. Third, the iterate yjy_{j} computed in step 2 of U-PB satisfies

yjArgmin{ϕ(x)+χ2λxx^k12:x{y^k1,x0,,xj}},y_{j}\in\mathrm{Argmin}\,\left\{\phi(x)+\frac{\chi}{2{\lambda}}\|x-\hat{x}_{k-1}\|^{2}:x\in\{\hat{y}_{k-1},x_{\ell_{0}},\ldots,x_{j}\}\right\}, (65)

where 0\ell_{0} denotes the first iteration index of the cycle containing jj. In other words, yjy_{j} is the best point in terms of ϕ()+χx^k12/(2λ)\phi(\cdot)+\chi\|\cdot-\hat{x}_{k-1}\|^{2}/(2{\lambda}) among all the points obtained in the course of solving (62) and the point y^k1\hat{y}_{k-1} obtained at the end of the previous cycle.

The next proposition shows some useful relations about the sequences {x^k}\{\hat{x}_{k}\}, {y^k}\{\hat{y}_{k}\} and {λk}\{\lambda_{k}\} generated at the end of the serious cycles of U-PB. This proposition will be proved in § 3.2.2.

Proposition 3.4

Define

U¯(ε)=ε[1λ0+40L¯f1χ]+32M¯f2(1χ)N¯(1+log(N¯)).{\overline{U}}(\varepsilon)=\displaystyle\varepsilon\left[\frac{1}{\lambda_{0}}+\frac{40{\overline{L}}_{f}}{1-\chi}\right]+\frac{32{\overline{M}}_{f}^{2}}{(1-\chi){\overline{N}}}\left(1+\log({\overline{N}})\right). (66)

The following statements hold for U-PB:

  • a)

    every cycle in U-PB has at most N¯\overline{N} inner iterations;

  • b)

    each stepsize λk\lambda_{k} generated by U-PB satisfies

    λkεU¯(ε);\displaystyle\lambda_{k}\geq\frac{\varepsilon}{{\overline{U}}(\varepsilon)}; (67)
  • c)

    the number of reset cycles is upper bounded by

    2logλ0U¯(ε)ε;\displaystyle\left\lceil 2\log\frac{\lambda_{0}{\overline{U}}(\varepsilon)}{\varepsilon}\right\rceil; (68)
  • d)

    for a serious cycle, the pair (x^k,y^k)(\hat{x}_{k},\hat{y}_{k}) of U-PB satisfies

    x^k=argminun{Γ^k(u)+12λkux^k12},\displaystyle\hat{x}_{k}=\underset{u\in\mathbb{R}^{n}}{\mathrm{argmin}\,}\left\{\hat{\Gamma}_{k}(u)+\frac{1}{2{\lambda}_{k}}\|u-\hat{x}_{k-1}\|^{2}\right\}, (69)
    ϕ(y^k)+χ2λky^kx^k12[Γ^k(x^k)+12λkx^kx^k12]ε.\displaystyle\phi(\hat{y}_{k})+\frac{\chi}{2\lambda_{k}}\|\hat{y}_{k}-\hat{x}_{k-1}\|^{2}-\left[\hat{\Gamma}_{k}(\hat{x}_{k})+\frac{1}{2{\lambda}_{k}}\|\hat{x}_{k}-\hat{x}_{k-1}\|^{2}\right]\leq\varepsilon. (70)

The following result shows that serious iterations of U-PB generate sequences {x^k}\{\hat{x}_{k}\}, {y^k}\{\hat{y}_{k}\}, {λk}\{\lambda_{k}\}, and {Γ^k}\{\hat{\Gamma}_{k}\} satisfying the requirements of FSCO.

Proposition 3.5

U-PB is a special case of FSCO where:

  • a)

    the pair (x^k,y^k)(\hat{x}_{k},\hat{y}_{k}) satisfies relations (4) and (5);

  • b)

    conditions (F1) and (F2) used in the analysis of FSCO are satisfied.

Proof: a) Relations (4) and (5) are given in Proposition 3.4(d).

b) By Assumption (A3) we have that Γ^kConv¯ν(n)\hat{\Gamma}_{k}\in\overline{\mbox{\rm Conv}}_{\nu}\,(\mathbb{R}^{n}) and therefore Assumption (F1) of FSCO is satisfied for U-PB. Assumption (F2) is given for U-PB in Proposition 3.4(b).    

We now state the first main result of this subsection where the functional iteration complexity of U-PB is established.

Theorem 3.6

Given tolerance ε¯>0\bar{\varepsilon}>0, consider U-PB and ε=(1χ)ε¯/2\varepsilon=(1-\chi)\bar{\varepsilon}/2, where χ[0,1)\chi\in[0,1) is as in step 0 of U-PB. Let {x^k}\{\hat{x}_{k}\} and {y^k}\{\hat{y}_{k}\} be the sequences generated by U-PB. Then, the number of iterations of U-PB to generate an iterate y^k\hat{y}_{k} satisfying ϕ(y^k)ϕε¯\phi(\hat{y}_{k})-\phi_{*}\leq\bar{\varepsilon} is at most

min{min[1χ(N¯+Rf(ε¯)με¯),N¯+Rf(ε¯)νε¯]log(1+λ0μRf(ε¯)d02ε¯2N¯),d02Rf(ε¯)ε¯2N¯}+N¯2logλ0Rf(ε¯)ε¯N¯\min\left\{\min\left[\frac{1}{\chi}\left({\overline{N}}+\frac{R_{f}({\bar{\varepsilon}})}{\mu{\bar{\varepsilon}}}\right),{\overline{N}}+\frac{R_{f}({\bar{\varepsilon}})}{\nu{\bar{\varepsilon}}}\right]\log\left(1+\frac{\lambda_{0}\mu R_{f}({\bar{\varepsilon}})d_{0}^{2}}{\bar{\varepsilon}^{2}{\overline{N}}}\right)\,,\,\frac{d_{0}^{2}R_{f}({\bar{\varepsilon}})}{\bar{\varepsilon}^{2}{\overline{N}}}\right\}+{\overline{N}}\displaystyle\left\lceil 2\log\frac{\lambda_{0}R_{f}({\bar{\varepsilon}})}{{\bar{\varepsilon}}{\overline{N}}}\right\rceil (71)

where

Rf(ε¯)=ε¯N¯[1λ0+40L¯f1χ]+64M¯f2(1χ)2(1+log(N¯)).\displaystyle R_{f}({\bar{\varepsilon}})=\displaystyle{\bar{\varepsilon}}{\overline{N}}\left[\frac{1}{\lambda_{0}}+\frac{40{\overline{L}}_{f}}{1-\chi}\right]+\frac{64{\overline{M}}_{f}^{2}}{(1-\chi)^{2}}\left(1+\log({\overline{N}})\right).

Proof: Since every serious cycle of U-PB has at most N¯\overline{N} inner iterations (by Proposition 3.4(a)), the complexity of serious cycles of U-PB is that of FSCO, given by (19) in Theorem 2.3, multiplied by N¯\overline{N} (observe that the functional complexity of FSCO can be applied to U-PB since we have shown in Proposition 3.5 that serious iterates of U-PB follow the FSCO framework). In this expression, by Proposition 3.4(b), we can bound from above 1/λ¯1/{\underline{\lambda}} by U¯(ε)/εRf(ε¯)/(ε¯N¯){\overline{U}}(\varepsilon)/\varepsilon\leq R_{f}(\bar{\varepsilon})/({\bar{\varepsilon}}{\overline{N}}) which gives the first term in (71). By Proposition 3.4(c), the number of reset cycles is at most (68) which is bounded from above by

2logλ0U¯(ε)ε2logλ0Rf(ε¯)ε¯N¯,\left\lceil 2\log\frac{\lambda_{0}{\overline{U}}(\varepsilon)}{\varepsilon}\right\rceil\leq\left\lceil 2\log\frac{\lambda_{0}R_{f}({\bar{\varepsilon}})}{\bar{\varepsilon}{\overline{N}}}\right\rceil,

and each of these cycles also has at most N¯\overline{N} inner iterations (by Proposition 3.4(a)). This gives the second term in (71) and the result follows.    

The complexity result of Theorem 3.6 for the case where λ0\lambda_{0} is not too small, χ\chi is neither close to one nor to zero, and μν\mu\gg\nu reduces to

𝒪~(Rf(ε¯)με¯)=𝒪~(M¯f2με¯)\tilde{\cal O}\left(\frac{R_{f}(\bar{\varepsilon})}{\mu{\bar{\varepsilon}}}\right)=\tilde{\cal O}\left(\frac{\overline{M}_{f}^{2}}{\mu{\bar{\varepsilon}}}\right) (72)

and we obtain the functional complexity (56) of U-CS. For the case where L¯f=0\overline{L}_{f}=0, the above complexity is optimal up to logarithmic terms.

We now state the second main result of this subsection where the stationary iteration complexity of U-PB is established.

Theorem 3.7

For a given tolerance pair (ε^,ρ^)++2(\hat{\varepsilon},\hat{\rho})\in\mathbb{R}^{2}_{++}, U-PB with

χ(0,1),ε=χ(1χ)ε^10,\chi\in(0,1),\quad\varepsilon=\frac{\chi(1-\chi)\hat{\varepsilon}}{10}, (73)

generates a triple (y¯k,s¯k,ε¯k)(\bar{y}_{k},\bar{s}_{k},\bar{\varepsilon}_{k}) satisfying (34) in at most

min{min[1χ(N¯+Rs(ε^)ε^μ),N¯+Rs(ε^)ε^ν]logC(ε^,ρ^),Rs(ε^)ε^N¯(4χε^5ρ^2+5d02χε^)}+N¯2logλ0Rs(ε^)ε^N¯\min\left\{\min\left[\frac{1}{\chi}\left({\overline{N}}+\frac{R_{s}(\hat{\varepsilon})}{{\hat{\varepsilon}}\mu}\right),{\overline{N}}+\frac{R_{s}(\hat{\varepsilon})}{{\hat{\varepsilon}}\nu}\right]\log C(\hat{\varepsilon},\hat{\rho}),\,\frac{R_{s}(\hat{\varepsilon})}{\hat{\varepsilon}{\overline{N}}}\left(\frac{4\chi\hat{\varepsilon}}{5\hat{\rho}^{2}}+\frac{5d_{0}^{2}}{\chi\hat{\varepsilon}}\right)\right\}+{\overline{N}}\displaystyle\left\lceil 2\log\frac{\lambda_{0}R_{s}({\hat{\varepsilon}})}{\hat{\varepsilon}{\overline{N}}}\right\rceil (74)

iterations where

C(ε^,ρ^)=1+λ0μRs(ε^)ε^N¯(4χε^5ρ^2+5d02χε^) and Rs(ε^)=ε^N¯[1λ0+40L¯f1χ]+320M¯f2χ(1χ)2(1+log(N¯)).\begin{array}[]{l}\displaystyle C(\hat{\varepsilon},\hat{\rho})=1+\frac{\lambda_{0}\mu R_{s}(\hat{\varepsilon})}{\hat{\varepsilon}{\overline{N}}}\left(\frac{4\chi\hat{\varepsilon}}{5\hat{\rho}^{2}}+\frac{5d_{0}^{2}}{\chi\hat{\varepsilon}}\right)\;\emph{ and }\\ R_{s}(\hat{\varepsilon})=\displaystyle{\hat{\varepsilon}}{\overline{N}}\left[\frac{1}{\lambda_{0}}+\frac{40{\overline{L}}_{f}}{1-\chi}\right]+\frac{320{\overline{M}}_{f}^{2}}{\chi(1-\chi)^{2}}\left(1+\log({\overline{N}})\right).\end{array} (75)

Proof: The complexity of serious cycles of U-PB to satisfy stationarity conditions (34) is that of FSCO, given by (35) in Theorem 2.8, multiplied by N¯\overline{N}. In this expression

min{min[1χ(1+1λ¯μ),1+1λ¯ν]log[1+λ0μβ(ε^,ρ^)],β(ε^,ρ^)}\min\left\{\min\left[\frac{1}{\chi}\left(1+\frac{1}{\underline{{\lambda}}\mu}\right),1+\frac{1}{\underline{{\lambda}}\nu}\right]\log\left[1+\lambda_{0}\mu\beta(\hat{\varepsilon},\hat{\rho})\right]\,,\,\beta(\hat{\varepsilon},\hat{\rho})\right\}

where

β(ε^,ρ^)=1λ¯(4χε^5ρ^2+5d02χε^),\beta(\hat{\varepsilon},\hat{\rho})=\frac{1}{\underline{\lambda}}\left(\frac{4\chi\hat{\varepsilon}}{5\hat{\rho}^{2}}+\frac{5d_{0}^{2}}{\chi\hat{\varepsilon}}\right), (76)

using the expression (73) for ε\varepsilon, we can bound from above 1/λ¯1/{\underline{\lambda}} by U¯(ε)/εRs(ε^)/(ε^N¯){\overline{U}}(\varepsilon)/{\varepsilon}\leq R_{s}(\hat{\varepsilon})/({\hat{\varepsilon}}{\overline{N}}) which gives the first term in (74). By Proposition 3.4(c), the number of reset cycles is at most (68) which is bounded from above by

2logλ0Rs(ε^)ε^N¯,\left\lceil 2\log\frac{\lambda_{0}R_{s}({\hat{\varepsilon}})}{\hat{\varepsilon}{\overline{N}}}\right\rceil,

and each of these cycles also has at most N¯\overline{N} inner iterations (by Proposition 3.4(a)). This gives the second term in (74) and the result follows.    

Theorem 3.7 does not hold for χ=0\chi=0 since the definitions of C(ε^,ρ^)C(\hat{\varepsilon},\hat{\rho}) and Rs(ε^)R_{s}(\hat{\varepsilon}) in (75) depend on χ1\chi^{-1}. However, if χ=0\chi=0 and the domain of ϕ\phi is bounded with diameter DD, then we can apply Theorem 2.9(b) to conclude that U-PB with ε=(1χ)ε^/2\varepsilon=(1-\chi)\hat{\varepsilon}/2 generates a triple (y¯k,s¯k,ε¯k)(\bar{y}_{k},\bar{s}_{k},\bar{\varepsilon}_{k}) satisfying (34) within a number iterations bounded by (35) but with β(ε^,ρ^)\beta(\hat{\varepsilon},\hat{\rho}) now given by (43) and 1/λ¯1/{\underline{\lambda}} bounded from above by Ws(ε^)/ε^W_{s}(\hat{\varepsilon})/{\hat{\varepsilon}} where

Ws(ε^)=ε^[1λ0+40L¯f1χ]+64M¯f2(1χ)2N¯(1+log(N¯)).W_{s}(\hat{\varepsilon})=\displaystyle{\hat{\varepsilon}}\left[\frac{1}{\lambda_{0}}+\frac{40{\overline{L}}_{f}}{1-\chi}\right]+\frac{64{\overline{M}}_{f}^{2}}{(1-\chi)^{2}{\overline{N}}}\left(1+\log({\overline{N}})\right).

3.2.2 Proof of Proposition 3.4

To prove Proposition 3.4, we first state Lemmas 3.8 and 3.9. Lemma 3.8 will be used to show Lemma 3.9. Lemma 3.9 plays an important role in the analysis of the null iterates and establishes a key recursive formula for the sequence {tj}\{t_{j}\} defined in (64).

Lemma 3.8

For the jj-th iteration of a cycle with prox stepsize λ\lambda and prox-center x^k1\hat{x}_{k-1}, define

mj:=(fj+h)(xj)+12λxjx^k12.m_{j}:=(f_{j}+h)(x_{j})+\frac{1}{2\lambda}\|x_{j}-\hat{x}_{k-1}\|^{2}. (77)

If jj is not the last iteration of the cycle, then

mj+1τmj\displaystyle m_{j+1}-\tau m_{j}
(1τ)[f(xj+1;xj)+h(xj+1)+12λxj+1x^k12+(L¯f2+2M¯f2ε)xj+1xj2],\displaystyle\geq(1-\tau)\left[\ell_{f}(x_{j+1};x_{j})+h(x_{j+1})+\frac{1}{2{\lambda}}\|x_{j+1}-\hat{x}_{k-1}\|^{2}+\left(\frac{\overline{L}_{f}}{2}+\frac{2\overline{M}_{f}^{2}}{\varepsilon}\right)\|x_{j+1}-x_{j}\|^{2}\right], (78)

where

τ=1(1+4λ(M¯f2+εL¯f)ε)1.\tau=1-\left(1+\frac{4{\lambda}({\overline{M}}_{f}^{2}+\varepsilon{\overline{L}}_{f})}{\varepsilon}\right)^{-1}. (79)

Proof: It follows from the definitions of τ\tau in (79) that

τ2λ(1τ)=(79)2L¯f+2M¯f2εL¯f2+2M¯f2ε.\frac{\tau}{2{\lambda}(1-\tau)}\stackrel{{\scriptstyle\eqref{deftauk}}}{{=}}2\overline{L}_{f}+\frac{2\overline{M}_{f}^{2}}{\varepsilon}\geq\frac{\overline{L}_{f}}{2}+\frac{2\overline{M}_{f}^{2}}{\varepsilon}. (80)

Using the definition of mjm_{j} in (77), and relations (99) and (101) with u=xj+1u=x_{j+1}, we have

mj+1\displaystyle m_{j+1} =(77)(fj+1+h)(xj+1)+12λxj+1x^k12\displaystyle\overset{\eqref{defmj}}{=}(f_{j+1}+h)(x_{j+1})+\frac{1}{2{\lambda}}\|x_{j+1}-\hat{x}_{k-1}\|^{2}
(99)(1τ)[f(xj+1;xj)+h(xj+1)+12λxj+1x^k12]+τ((f¯j+h)(xj+1)+12λxj+1x^k12)\displaystyle\overset{\eqref{eq:Gamma_j}}{\geq}(1-\tau)\left[\ell_{f}(x_{j+1};x_{j})+h(x_{j+1})+\frac{1}{2{\lambda}}\|x_{j+1}-\hat{x}_{k-1}\|^{2}\right]+\tau\left((\overline{f}_{j}+h)(x_{j+1})+\frac{1}{2{\lambda}}\|x_{j+1}-\hat{x}_{k-1}\|^{2}\right)
(101)(1τ)[f(xj+1;xj)+h(xj+1)+12λxj+1x^k12]+τ(mj+12λxj+1xj2).\displaystyle\overset{\eqref{ineq:Gammaj}}{\geq}(1-\tau)\left[\ell_{f}(x_{j+1};x_{j})+h(x_{j+1})+\frac{1}{2{\lambda}}\|x_{j+1}-\hat{x}_{k-1}\|^{2}\right]+\tau\left(m_{j}+\frac{1}{2{\lambda}}\|x_{j+1}-x_{j}\|^{2}\right).

This inequality and (80) then imply (78).    

Lemma 3.9

The following statements about U-PB hold:

  • a)

    for every iteration jj that is not the last one of the cycle, we have

    tj+1ε2τ(tjε2)t_{j+1}-\frac{\varepsilon}{2}\leq\tau\left(t_{j}-\frac{\varepsilon}{2}\right) (81)

    where τ\tau is as in (79);

  • b)

    if ii is the first iteration of the cycle and λ(1χ)/(2L¯f)\lambda\leq(1-\chi)/(2{\overline{L}_{f}}), then

    ti4λM¯f21χ.t_{i}\leq\frac{4{\lambda}\overline{M}_{f}^{2}}{1-\chi}. (82)

Proof: a) In what follows, we use the notation

ψ():=ϕ()+χ2λx^k12.\psi(\cdot):=\phi(\cdot)+\frac{\chi}{2{\lambda}}\|\cdot-\hat{x}_{k-1}\|^{2}. (83)

It follows from the above notation and the definitions of tjt_{j} and mjm_{j} in (64) and (77), respectively, that

tj=ψ(yj)mj and ψ(yj+1)=(83),(63)min{ψ(xj+1),ψ(yj)).t_{j}=\psi(y_{j})-m_{j}\ \mbox{ and }\ \psi(y_{j+1})\stackrel{{\scriptstyle\eqref{def:philam},\eqref{def:txj}}}{{=}}\min\{\psi(x_{j+1}),\psi(y_{j})). (84)

Using (45) with (Mf,Lf,u,v)=(M¯f,L¯f,xj+1,xj)(M_{f},L_{f},u,v)=(\overline{M}_{f},\overline{L}_{f},x_{j+1},x_{j}) and the fact that ϕ=f+h\phi=f+h, we have

f(xj+1;xj)+h(xj+1)+L¯f2xj+1xj2ϕ(xj+1)2M¯fxj+1xj.\ell_{f}(x_{j+1};x_{j})+h(x_{j+1})+\frac{\overline{L}_{f}}{2}\|x_{j+1}-x_{j}\|^{2}\geq\phi(x_{j+1})-2\overline{M}_{f}\|x_{j+1}-x_{j}\|. (85)

This inequality, the definition of ψ\psi in (83), and relation (78), imply that

mj+1τmj\displaystyle m_{j+1}-\tau m_{j}
(78)(1τ)[f(xj+1;xj)+h(xj+1)+12λxj+1x^k12+(L¯f2+2M¯f2ε)xj+1xj2]\displaystyle\overset{\eqref{ineq:mj}}{\geq}(1-\tau)\left[\ell_{f}(x_{j+1};x_{j})+h(x_{j+1})+\frac{1}{2{\lambda}}\|x_{j+1}-\hat{x}_{k-1}\|^{2}+\left(\frac{\overline{L}_{f}}{2}+\frac{2\overline{M}_{f}^{2}}{\varepsilon}\right)\|x_{j+1}-x_{j}\|^{2}\right]
(85)(1τ)(ϕ(xj+1)+12λxj+1x^k12)+1τε(2M¯f2xj+1xj22M¯fεxj+1xj)\displaystyle\overset{\eqref{ineq:ellphi}}{\geq}(1-\tau)\Big{(}\phi(x_{j+1})+\frac{1}{2\lambda}\|x_{j+1}-\hat{x}_{k-1}\|^{2}\Big{)}+\frac{1-\tau}{\varepsilon}\Big{(}2\overline{M}_{f}^{2}\|x_{j+1}-x_{j}\|^{2}-2\overline{M}_{f}\varepsilon\|x_{j+1}-x_{j}\|\Big{)}
=(83)(1τ)(ψ(xj+1)+1χ2λxj+1x^k12)+1τε(2M¯f2xj+1xj22M¯fεxj+1xj)\displaystyle\overset{\eqref{def:philam}}{=}(1-\tau)\Big{(}\psi(x_{j+1})+\frac{1-\chi}{2\lambda}\|x_{j+1}-\hat{x}_{k-1}\|^{2}\Big{)}+\frac{1-\tau}{\varepsilon}\Big{(}2\overline{M}_{f}^{2}\|x_{j+1}-x_{j}\|^{2}-2\overline{M}_{f}\varepsilon\|x_{j+1}-x_{j}\|\Big{)}
(1τ)ψ(xj+1)(1τ)ε2,\displaystyle\geq(1-\tau)\psi(x_{j+1})-\frac{(1-\tau)\varepsilon}{2}, (86)

where the last inequality follows from the fact that χ<1\chi<1 and the inequality a22abb2a^{2}-2ab\geq-b^{2} with a=2M¯fxj+1xja=2\overline{M}_{f}\|x_{j+1}-x_{j}\| and b=εb=\varepsilon. Using relations (84) and (86), we conclude that

tj+1τtj\displaystyle t_{j+1}-\tau t_{j} =(84)ψ(yj+1)mj+1τtj\displaystyle\overset{\eqref{eq:tj}}{=}\psi(y_{j+1})-m_{j+1}-\tau t_{j}
(86)ψ(yj+1)τ(mj+tj)(1τ)ψ(xj+1)+(1τ)ε2\displaystyle\overset{\eqref{ineq:mj2b}}{\leq}\psi(y_{j+1})-\tau(m_{j}+t_{j})-(1-\tau)\psi(x_{j+1})+\frac{(1-\tau)\varepsilon}{2}
=(84)ψ(yj+1)τψ(yj)(1τ)ψ(xj+1)+(1τ)ε2\displaystyle\overset{\eqref{eq:tj}}{=}\psi(y_{j+1})-\tau\psi(y_{j})-(1-\tau)\psi(x_{j+1})+\frac{(1-\tau)\varepsilon}{2}
(84)(1τ)ε2,\displaystyle\overset{\eqref{eq:tj}}{\leq}\frac{(1-\tau)\varepsilon}{2},

and hence that (81) holds.

b) Using relations (77), (83), and (84), we have

ti\displaystyle t_{i} =(77),(84)ψ(yi)Γi(xi)12λxix^k12(84)ψ(xi)Γi(xi)12λxix^k12\displaystyle\stackrel{{\scriptstyle\eqref{defmj},\eqref{eq:tj}}}{{=}}\psi(y_{i})-\Gamma_{i}(x_{i})-\frac{1}{2\lambda}\|x_{i}-{\hat{x}}_{k-1}\|^{2}\stackrel{{\scriptstyle\eqref{eq:tj}}}{{\leq}}\psi(x_{i})-\Gamma_{i}(x_{i})-\frac{1}{2\lambda}\|x_{i}-{\hat{x}}_{k-1}\|^{2}
=(83)ϕ(xi)Γi(xi)+χ12λxix^k12=f(xi)fi(xi)+χ12λxix^k12,\displaystyle\stackrel{{\scriptstyle\eqref{def:philam}}}{{=}}\phi(x_{i})-\Gamma_{i}(x_{i})+\frac{\chi-1}{2\lambda}\|x_{i}-{\hat{x}}_{k-1}\|^{2}=f(x_{i})-f_{i}(x_{i})+\frac{\chi-1}{2\lambda}\|x_{i}-{\hat{x}}_{k-1}\|^{2},

where the last equality follows from the facts that ϕ=f+h\phi=f+h and Γi=fi+h\Gamma_{i}=f_{i}+h. It follows from U-PB that if ii is the first iteration of the cycle, then

fi()f(;x^k1).f_{i}(\cdot)\geq\ell_{f}(\cdot;{\hat{x}}_{k-1}). (87)

Combining the above two inequalities and using (45), we obtain

ti\displaystyle t_{i} (87)f(xi)f(xi;x^k1)+χ12λxix^k12\displaystyle\stackrel{{\scriptstyle\eqref{lbfxj}}}{{\leq}}f(x_{i})-\ell_{f}(x_{i};{\hat{x}}_{k-1})+\frac{\chi-1}{2\lambda}\|x_{i}-{\hat{x}}_{k-1}\|^{2}
(45)2M¯fxix^k1+L¯f2xix^k12+χ12λxix^k12\displaystyle\stackrel{{\scriptstyle\eqref{ineq:est}}}{{\leq}}2\overline{M}_{f}\|x_{i}-\hat{x}_{k-1}\|+\frac{\overline{L}_{f}}{2}\|x_{i}-{\hat{x}}_{k-1}\|^{2}+\frac{\chi-1}{2\lambda}\|x_{i}-{\hat{x}}_{k-1}\|^{2}
=2M¯fxix^k11χλL¯f2λxix^k12.\displaystyle=2\overline{M}_{f}\|x_{i}-{\hat{x}}_{k-1}\|-\frac{1-\chi-{\lambda}\overline{L}_{f}}{2{\lambda}}\|x_{i}-{\hat{x}}_{k-1}\|^{2}.

By maximizing the right-hand side with respect to xix^k1\|x_{i}-\hat{x}_{k-1}\| we deduce ti(2λkM¯f2)/(1χλL¯f)t_{i}\leq(2\lambda_{k}{\overline{M}}_{f}^{2})/(1-\chi-\lambda{\overline{L}}_{f}) and using the fact that λ(1χ)/(2L¯f)\lambda\leq(1-\chi)/(2{\overline{L}_{f}}), we obtain (82).    

Before presenting the proof of Proposition 3.4, we also need the following technical lemma.

Lemma 3.10

Assume that the prox stepsize λ\lambda of some cycle of U-PB is such that λλ~(ε){\lambda}\leq{\tilde{\lambda}}(\varepsilon) where

λ~(ε)={min{(N¯1)ε8(M¯f2+εL¯f)log(N¯),(1χ)exp(1/2)N¯ε8M¯f2,1χ2L¯f}if N¯2,min{(1χ)ε4M¯f2,1χ2L¯f}if N¯=1.{\tilde{\lambda}}(\varepsilon)=\left\{\begin{array}[]{cl}\min\displaystyle\left\{\frac{({\overline{N}}-1)\varepsilon}{8({\overline{M}}_{f}^{2}+\varepsilon{\overline{L}}_{f})\log({\overline{N}})},\frac{(1-\chi)\exp(-1/2){\overline{N}}\varepsilon}{8\overline{M}_{f}^{2}},\frac{1-\chi}{2{\overline{L}}_{f}}\right\}&\mbox{if }{\overline{N}}\geq 2,\\ \displaystyle\min\left\{\frac{(1-\chi)\varepsilon}{4{\overline{M}}_{f}^{2}},\frac{1-\chi}{2{\overline{L}}_{f}}\right\}&\mbox{if }{\overline{N}}=1.\end{array}\right. (88)

Then, this cycle must be a serious one.

Proof: Let ii denote the first iteration of a cycle whose prox stepsize λ\lambda satisfies (88). It suffices to prove that tεt_{\ell}\leq\varepsilon where =i+N¯1\ell=i+{\overline{N}}-1 is the last iteration of the cycle. We consider two cases: N¯=1\overline{N}=1 and N¯2\overline{N}\geq 2. If N¯=1\overline{N}=1, using Lemma 3.9(b) (which can be applied since λ<(1χ)/(2L¯f\lambda<(1-\chi)/(2{\overline{L}_{f}})) and (88), we have

t=ti4λM¯f21χ(88)εt_{\ell}=t_{i}\leq\frac{4{\lambda}\overline{M}_{f}^{2}}{1-\chi}\stackrel{{\scriptstyle\eqref{ublambdak}}}{{\leq}}\varepsilon

which shows that the cycle is a serious one (with one iteration only). Let us now consider the case N¯2{\overline{N}}\geq 2. Lemma 3.9 implies that

tε2(81)τN¯1(tiε2)τN¯1ti(82)4λM¯f21χτN¯1.t_{\ell}-\frac{\varepsilon}{2}\stackrel{{\scriptstyle\eqref{ineq:tj-recur}}}{{\leq}}\tau^{{\overline{N}}-1}\left(t_{i}-\frac{\varepsilon}{2}\right)\leq\tau^{{\overline{N}}-1}t_{i}\stackrel{{\scriptstyle\eqref{ineq:ti}}}{{\leq}}\frac{4{\lambda}\overline{M}_{f}^{2}}{1-\chi}\tau^{{\overline{N}}-1}. (89)

It then suffices to show that the right-hand side of (89) is bounded above by ε/2\varepsilon/2, or equivalently, that

log(8λM¯f2(1χ)ε)(N¯1)log(τ1).\log\left(\frac{8\lambda{\overline{M}}_{f}^{2}}{(1-\chi)\varepsilon}\right)\leq({\overline{N}}-1)\log(\tau^{-1}). (90)

Using (88), we have

λε4(M¯f2+εL¯f)(N¯12log(N¯))\displaystyle\lambda\leq\frac{\varepsilon}{4(\overline{M}_{f}^{2}+\varepsilon\overline{L}_{f})}\left(\frac{{\overline{N}}-1}{2\log(\overline{N})}\right) ε4(M¯f2+εL¯f)(N¯12log(N¯)1)\displaystyle\leq\frac{\varepsilon}{4(\overline{M}_{f}^{2}+\varepsilon\overline{L}_{f})}\left(\frac{{\overline{N}}-1}{2\log(\overline{N})-1}\right)
=ε4(M¯f2+εL¯f)(N¯12log(exp(1/2)N¯))\displaystyle=\frac{\varepsilon}{4(\overline{M}_{f}^{2}+\varepsilon\overline{L}_{f})}\left(\frac{{\overline{N}}-1}{2\log(\exp(-1/2){\overline{N}})}\right)
ε4(M¯f2+εL¯f)(N¯1log(exp(1/2)N¯)1)\displaystyle\leq\frac{\varepsilon}{4(\overline{M}_{f}^{2}+\varepsilon\overline{L}_{f})}\left(\frac{{\overline{N}}-1}{\log(\exp(-1/2){\overline{N}})}-1\right) (91)

where the last inequality above is easily checked for N¯2{\overline{N}}\geq 2. Using (88), (79), and (91), we have

τ1τ11=(1τ)1=(79)1+4(M¯f2+εL¯f)λε(91)N¯1log(exp(1/2)N¯)(88)N¯1log(8M¯f2λ(1χ)ε),\frac{\tau^{-1}}{\tau^{-1}-1}=(1-\tau)^{-1}\stackrel{{\scriptstyle\eqref{deftauk}}}{{=}}1+\frac{4(\overline{M}_{f}^{2}+\varepsilon\overline{L}_{f})\lambda}{\varepsilon}\stackrel{{\scriptstyle\eqref{ineqser3}}}{{\leq}}\frac{{\overline{N}}-1}{\log(\exp(-1/2){\overline{N}})}\stackrel{{\scriptstyle\eqref{ublambdak}}}{{\leq}}\frac{{\overline{N}}-1}{\log\left(\frac{8{\overline{M}}_{f}^{2}\lambda}{(1-\chi)\varepsilon}\right)},

which, because of the inequality log(τ1)(τ11)/τ1\log(\tau^{-1})\geq(\tau^{-1}-1)/\tau^{-1}, implies inequality (90).    

We are now ready to prove Proposition 3.4.

Proof of Proposition 3.4. a) This statement immediately follows from the description of U-PB.

b) By Lemma 3.10, if for a cycle we have λλ~(ε){\lambda}\leq{\tilde{\lambda}}(\varepsilon) with λ~(ε){\tilde{\lambda}}(\varepsilon) given in (88), then the cycle ends with a serious step and λ\lambda is kept unchanged for all subsequent cycles and all subsequent cycles are serious cycles. Therefore, if λ0λ~(ε)/2\lambda_{0}\leq{\tilde{\lambda}}(\varepsilon)/2 we have λk=λ0=min{λ0,λ~(ε)/2}(ε/U¯(ε))\lambda_{k}=\lambda_{0}=\min\{\lambda_{0},{\tilde{\lambda}}(\varepsilon)/2\}\geq(\varepsilon/{\overline{U}}(\varepsilon)) for all kk (where the last inequality can be easily checked using the definitions of λ~(ε)\tilde{\lambda}(\varepsilon) and U¯(ε){\overline{U}}(\varepsilon)). Otherwise, if λ0>λ~(ε)/2\lambda_{0}>{\tilde{\lambda}}(\varepsilon)/2, we cannot have λk<λ~(ε)/2\lambda_{k}<{\tilde{\lambda}}(\varepsilon)/{2} for some kk, i.e., λkλ~(ε)/2=min{λ0,λ~(ε)/2}ε/U¯(ε)\lambda_{k}\geq{\tilde{\lambda}}(\varepsilon)/{2}=\min\{\lambda_{0},{\tilde{\lambda}}(\varepsilon)/{2}\}\geq\varepsilon/{\overline{U}}(\varepsilon) for all kk.

c) This statement immediately follows from (b) and the update rule of λ{\lambda} in U-PB.

d) Relations (69) and (70) (which are (5) and (4) in FSCO, respectively) follow from (62) and tjεt_{j}\leq\varepsilon with x^k=xj\hat{x}_{k}=x_{j}, Γ^k=fj+h\hat{\Gamma}_{k}=f_{j}+h, y^k=yj\hat{y}_{k}=y_{j}, and λk=λ\lambda_{k}=\lambda (see the serious update in step 3 of U-PB).  

4 Concluding remarks

In this paper, we present two μ\mu-universal methods, namely U-CS and U-PB, to solve HCO (1). We propose FSCO to analyze both methods in a unified manner and establish both functional and stationary complexity bounds. We then prove that both U-CS and U-PB are instances of FSCO and apply the complexity bounds for FSCO to obtain iteration complexities for the two methods. One interesting property of our proposed methods is that they do not rely on any restart scheme based on estimating μ\mu or knowing ϕ\phi_{*}.

Some papers about universal methods (see for example [10, 21]) assume that, for some α[0,1]\alpha\in[0,1], ff in (1) has α\alpha-Hölder continuous gradient, i.e., there exists Lα0L_{\alpha}\geq 0 such that f(x)f(y)Lαxyα\|\nabla f(x)-\nabla f(y)\|\leq L_{\alpha}\|x-y\|^{\alpha} for every x,ydomhx,y\in\mathrm{dom}\,h. It is shown in [21] that the universal primal gradient method proposed on it (i.e., the U-CS method with χ=0\chi=0) finds a ε¯\bar{\varepsilon}-solution of (1) in

𝒪~(d02Lα2α+1ε¯2α+1)\tilde{\cal O}\left(\frac{d_{0}^{2}L_{\alpha}^{\frac{2}{\alpha+1}}}{\bar{\varepsilon}^{\frac{2}{\alpha+1}}}\right) (92)

iterations. This result also follows as a consequence of our results in this paper. Indeed, first note that the dominant term in the iteration complexity (50) for the U-CS method is 𝒪~(d02(M¯f2+ε¯L¯f)/ε¯2)\tilde{\cal O}(d_{0}^{2}(\overline{M}_{f}^{2}+\bar{\varepsilon}\overline{L}_{f})/\bar{\varepsilon}^{2}). Second, Proposition 2.1 of [17] implies that there exists a pair (Mf,Lf)(M_{f},L_{f}) as in (A5) and that the ε¯\bar{\varepsilon}-best pair (M¯f,L¯f)(\overline{M}_{f},\overline{L}_{f}) defined below (45) satisfies

M¯f2+ε¯L¯f2ε¯2αα+1Lα2α+1.\overline{M}_{f}^{2}+\bar{\varepsilon}\overline{L}_{f}\leq 2\bar{\varepsilon}^{\frac{2\alpha}{\alpha+1}}L_{\alpha}^{\frac{2}{\alpha+1}}.

Hence, it follows from these two observations that (50) is sharper than bound (92) obtained in [21].

We finally discuss some possible extensions of our analysis in this paper. First, it is shown in Theorems 3.2 and 3.3 (resp., Theorem 3.6) that U-CS (resp., U-PB) is μ\mu-universal if χ>0\chi>0 and is ν\nu-universal if χ=0\chi=0. It would be interesting to investigate whether they are also μ\mu-universal for χ=0\chi=0 too. Note that this question is related to whether the universal primal gradient of [21] (which is the same as U-CS with χ=0\chi=0) is μ\mu-universal. Finally, it would also be interesting to study whether the general results obtained for the FSCO framework can also be used to show that other methods for solving the HCO problem (1) are μ\mu-universal.

References

  • [1] T. Alamo, P. Krupa, and D. Limon. Gradient based restart FISTA. In 2019 IEEE 58th Conference on Decision and Control (CDC), pages 3936–3941. IEEE, 2019.
  • [2] T. Alamo, P. Krupa, and D. Limon. Restart of accelerated first-order methods with linear convergence under a quadratic functional growth condition. IEEE Transactions on Automatic Control, 68(1):612–619, 2022.
  • [3] T. Alamo, D. Limon, and P. Krupa. Restart FISTA with global linear convergence. In 2019 18th European Control Conference (ECC), pages 1969–1974. IEEE, 2019.
  • [4] J-F Aujol, L. Calatroni, C. Dossal, H. Labarrière, and A. Rondepierre. Parameter-free FISTA by adaptive restart and backtracking. arXiv preprint arXiv:2307.14323, 2023.
  • [5] J-F Aujol, C. Dossal, and A. Rondepierre. FISTA is an automatic geometrically optimized algorithm for strongly convex functions. Mathematical Programming, pages 1–43, 2023.
  • [6] J-F Aujol, C. H. Dossal, H. Labarrière, and A. Rondepierre. FISTA restart using an automatic estimation of the growth parameter. Preprint, 2022.
  • [7] Y. Du and A. Ruszczyński. Rate of convergence of the bundle method. Journal of Optimization Theory and Applications, 173(3):908–922, 2017.
  • [8] O. Fercoq and Z. Qu. Adaptive restart of accelerated gradient methods under local quadratic growth condition. IMA Journal of Numerical Analysis, 39(4):2069–2095, 2019.
  • [9] B. Grimmer. On optimal universal first-order methods for minimizing heterogeneous sums. Optimization Letters, pages 1–19, 2023.
  • [10] G. Lan. Bundle-level type methods uniformly optimal for smooth and nonsmooth convex optimization. Mathematical Programming, 149(1-2):1–45, 2015.
  • [11] G. Lan and R. D. C. Monteiro. Iteration-complexity of first-order penalty methods for convex programming. Mathematical Programming, 138(1):115–139, 2013.
  • [12] G. Lan, Y. Ouyang, and Z. Zhang. Optimal and parameter-free gradient minimization methods for convex and nonconvex optimization. arXiv: 2310.12139, 2023.
  • [13] T. Li and G. Lan. A simple uniformly optimal method without line search for convex optimization. arXiv:2310.10082, 2023.
  • [14] J. Liang and R. D. C. Monteiro. An average curvature accelerated composite gradient method for nonconvex smooth composite optimization problems. SIAM Journal on Optimization, 31(1):217–243, 2021.
  • [15] J. Liang and R. D. C. Monteiro. A proximal bundle variant with optimal iteration-complexity for a large range of prox stepsizes. SIAM Journal on Optimization, 31(4):2955–2986, 2021.
  • [16] J. Liang and R. D. C. Monteiro. Average curvature fista for nonconvex smooth composite optimization problems. Computational Optimization and Applications, 86(1):275–302, 2023.
  • [17] J. Liang and R. D. C. Monteiro. A unified analysis of a class of proximal bundle methods for solving hybrid convex composite optimization problems. Mathematics of Operations Research, 49(2):832–855, 2024.
  • [18] J. G. Melo, R. D. C. Monteiro, and H. Wang. A proximal augmented lagrangian method for linearly constrained nonconvex composite optimization problems. Journal of Optimization Theory and Applications, pages 1–33, 2023.
  • [19] K. Mishchenko and Y. Malitsky. Adaptive gradient descent without descent. In 37th International Conference on Machine Learning (ICLM 2020), 2020.
  • [20] Y. Nesterov. Gradient methods for minimizing composite functions. Mathematical Programming, 140(1):125–161, 2013.
  • [21] Y. Nesterov. Universal gradient methods for convex optimization problems. Mathematical Programming, 152(1):381–404, 2015.
  • [22] B. O’donoghue and E. Candes. Adaptive restart for accelerated gradient schemes. Foundations of Computational Mathematics, 15:715–732, 2015.
  • [23] J. Renegar and B. Grimmer. A simple nearly optimal restart scheme for speeding up first-order methods. Foundations of Computational Mathematics, 22(1):211–256, 2022.
  • [24] V. Roulet and A. d’Aspremont. Sharpness, restart and acceleration. SIAM Journal on Optimization, 30:262–289, 2020.
  • [25] D. Zhou, S. Ma, and J. Yang. AdaBB: Adaptive Barzilai-Borwein method for convex optimization. arXiv:2401.08024, 2024.

Appendix A Technical results

A.1 Technical results for FSCO

Lemma A.1

Assume that sequences {γj}\{\gamma_{j}\}, {ηj}\{\eta_{j}\}, and {αj}\{\alpha_{j}\} satisfy for every j1j\geq 1, γjγ¯\gamma_{j}\geq{\underline{\gamma}} and

γjηjαj1(1+σ)αj+γjδ\gamma_{j}\eta_{j}\leq\alpha_{j-1}-(1+\sigma)\alpha_{j}+\gamma_{j}\delta (93)

for some σ0\sigma\geq 0, δ0\delta\geq 0 and γ¯>0{\underline{\gamma}}>0. Then, the following statements hold:

  • a)

    for every k1k\geq 1,

    min1jkηjα0(1+σ)kαkj=1k(1+σ)j1γj+δ;\min_{1\leq j\leq k}\eta_{j}\leq\frac{\alpha_{0}-(1+\sigma)^{k}\alpha_{k}}{\sum_{j=1}^{k}(1+\sigma)^{j-1}\gamma_{j}}+\delta; (94)
  • b)

    if the sequence {ηj}\{\eta_{j}\} is nonnegative, then for every k1k\geq 1,

    αkα0(1+σ)k+j=1k(1+σ)j1γjδ(1+σ)k;\alpha_{k}\leq\frac{\alpha_{0}}{(1+\sigma)^{k}}+\frac{\sum_{j=1}^{k}(1+\sigma)^{j-1}\gamma_{j}\delta}{(1+\sigma)^{k}}; (95)
  • c)

    if the sequence {αj}\{\alpha_{j}\} is nonnegative, then min1jkηj2δ\min_{1\leq j\leq k}\eta_{j}\leq 2\delta for every k1k\geq 1 such that

    kmin{1+σσlog(σα0γ¯δ+1),α0γ¯δ}k\geq\min\left\{\frac{1+\sigma}{\sigma}\log\left(\frac{\sigma\alpha_{0}}{{\underline{\gamma}}\delta}+1\right),\frac{\alpha_{0}}{{\underline{\gamma}}\delta}\right\}

    with the convention that the first term is equal to the second term when σ=0\sigma=0. (Note that the first term converges to the second term as σ0\sigma\downarrow 0.)

Proof: a) Multiplying (93) by (1+σ)j1(1+\sigma)^{j-1} and summing the resulting inequality from j=1j=1 to kk, we have

j=1k(1+σ)j1γj[min1jkηj]\displaystyle\sum_{j=1}^{k}(1+\sigma)^{j-1}\gamma_{j}\left[\min_{1\leq j\leq k}\eta_{j}\right] j=1k(1+σ)j1γjηjj=1k(1+σ)j1(αj1(1+σ)αj+γjδ)\displaystyle\leq\sum_{j=1}^{k}(1+\sigma)^{j-1}\gamma_{j}\eta_{j}\leq\sum_{j=1}^{k}(1+\sigma)^{j-1}\left(\alpha_{j-1}-(1+\sigma)\alpha_{j}+\gamma_{j}\delta\right)
=α0(1+σ)kαk+j=1k(1+σ)j1γjδ.\displaystyle=\alpha_{0}-(1+\sigma)^{k}\alpha_{k}+\sum_{j=1}^{k}(1+\sigma)^{j-1}\gamma_{j}\delta. (96)

Inequality (94) follows immediately from the above inequality.

b) This statement follows immediately from (96) and the fact that ηj0\eta_{j}\geq 0.

c) It follows from (94), and the facts that αk0\alpha_{k}\geq 0 and γjγ¯\gamma_{j}\geq{\underline{\gamma}} that

min1jkηjα0γ¯j=1k(1+σ)j1+δ.\min_{1\leq j\leq k}\eta_{j}\leq\frac{\alpha_{0}}{{\underline{\gamma}}\sum_{j=1}^{k}(1+\sigma)^{j-1}}+\delta. (97)

Using the fact that 1+σeσ/(1+σ)1+\sigma\geq e^{\sigma/(1+\sigma)} for every σ0\sigma\geq 0, we have

j=1k(1+σ)j1=max{(1+σ)k1σ,k}max{eσk/(1+σ)1σ,k}.\sum_{j=1}^{k}(1+\sigma)^{j-1}=\max\left\{\frac{(1+\sigma)^{k}-1}{\sigma},k\right\}\geq\max\left\{\frac{e^{\sigma k/(1+\sigma)}-1}{\sigma},k\right\}. (98)

Plugging the above inequality into (97), we have for every k1k\geq 1,

min1jkηjα0γ¯min{σeσk/(1+σ)1,1k}+δ,\min_{1\leq j\leq k}\eta_{j}\leq\frac{\alpha_{0}}{{\underline{\gamma}}}\min\left\{\frac{\sigma}{e^{\sigma k/(1+\sigma)}-1},\frac{1}{k}\right\}+\delta,

which can be easily seen to imply (c).    

A.2 Technical results for U-PB

Lemma A.2

The following statements hold for a cycle of U-PB with stepsize λ\lambda:

  • a)

    for every iteration jj that is not the last iteration of the cycle, there exists a function f¯j()\overline{f}_{j}(\cdot) such that

    τ(f¯j+h)+(1τ)[f(;xj)+h]fj+1+hϕ,\displaystyle\tau(\overline{f}_{j}+h)+(1-\tau)[\ell_{f}(\cdot;x_{j})+h]\leq f_{j+1}+h\leq\phi, (99)
    f¯j+hConv¯ν(n),f¯j(xj)=fj(xj),xj=argminun{f¯j(u)+h(u)+12λux^k12},\displaystyle\overline{f}_{j}+h\in\overline{\mbox{\rm Conv}}_{\nu}\,(\mathbb{R}^{n}),\quad\overline{f}_{j}(x_{j})=f_{j}(x_{j}),\quad x_{j}=\underset{u\in\mathbb{R}^{n}}{\mathrm{argmin}\,}\left\{\overline{f}_{j}(u)+h(u)+\frac{1}{2{\lambda}}\|u-\hat{x}_{k-1}\|^{2}\right\}, (100)

    where τ\tau is as in (79);

  • b)

    for every iteration jj of the cycle and unu\in\mathbb{R}^{n}, we have

    f¯j(u)+h(u)+12λux^k12mj+12λuxj2.\overline{f}_{j}(u)+h(u)+\frac{1}{2{\lambda}}\|u-\hat{x}_{k-1}\|^{2}\geq m_{j}+\frac{1}{2{\lambda}}\|u-x_{j}\|^{2}. (101)

Proof: a) Since jj is not the last iteration of a cycle, we have fj+1=BU(x^k1,xj,fj,λ)f_{j+1}=\mbox{BU}(\hat{x}_{k-1},x_{j},f_{j},\lambda). Using the properties of the BU blackbox, it follows that there exists f¯j\overline{f}_{j} such that f¯j+hConv¯ν(n)\overline{f}_{j}+h\in\overline{\mbox{\rm Conv}}_{\nu}\,(\mathbb{R}^{n}), f¯j(xj)=fj(xj)\overline{f}_{j}(x_{j})=f_{j}(x_{j}),

max{f¯j+h,f(;xj)+h}fj+1+hϕ,\max\{{\bar{f}}_{j}+h,\ell_{f}(\cdot;x_{j})+h\}\leq f_{j+1}+h\leq\phi, (102)

and

xj=argminun{f¯j(u)+h(u)+12λux^k12}.x_{j}=\underset{u\in\mathbb{R}^{n}}{\mathrm{argmin}\,}\left\{\overline{f}_{j}(u)+h(u)+\frac{1}{2{\lambda}}\|u-\hat{x}_{k-1}\|^{2}\right\}.

We have therefore checked that (100) holds while (99) is an immediate consequence of (102).

b) Since the objective function in the last identity of (100) is λ1{\lambda}^{-1}-strongly convex, we have

f¯j(u)+h(u)+12λux^k12f¯j(xj)+h(xj)+12λxjx^k12+12λuxj2.\overline{f}_{j}(u)+h(u)+\frac{1}{2{\lambda}}\|u-\hat{x}_{k-1}\|^{2}\geq\overline{f}_{j}(x_{j})+h(x_{j})+\frac{1}{2{\lambda}}\|x_{j}-\hat{x}_{k-1}\|^{2}+\frac{1}{2{\lambda}}\|u-x_{j}\|^{2}. (103)

The statement follows from (103), the first identity in (100), and the definition of mjm_{j} in (77).