This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Complexity of Inexact Proximal Point Algorithm for minimizing convex functions with Holderian Growth

\nameAndrei Pătraşcu and Paul Irofti a Email: [email protected], [email protected] a Research Center for Logic, Optimization and Security (LOS),
Department of Computer Science, Faculty of Mathematics and Computer Science,
University of Bucharest, Academiei 14, Bucharest, Romania.
Abstract

Several decades ago the Proximal Point Algorithm (PPA) started to gain a long-lasting attraction for both abstract operator theory and numerical optimization communities. Even in modern applications, researchers still use proximal minimization theory to design scalable algorithms that overcome nonsmoothness. Remarkable works as [9, 4, 5, 51] established tight relations between the convergence behaviour of PPA and the regularity of the objective function. In this manuscript we derive nonasymptotic iteration complexity of exact and inexact PPA to minimize convex functions under γ\gamma-Holderian growth: 𝒪(log(1/ϵ))\mathcal{O}\left(\log(1/\epsilon)\right) (for γ[1,2]\gamma\in[1,2]) and 𝒪(1/ϵγ2)\mathcal{O}\left(1/\epsilon^{\gamma-2}\right) (for γ>2\gamma>2). In particular, we recover well-known results on PPA: finite convergence for sharp minima and linear convergence for quadratic growth, even under presence of deterministic noise. Moreover, when a simple Proximal Subgradient Method is recurrently called as an inner routine for computing each IPPA iterate, novel computational complexity bounds are obtained for Restarting Inexact PPA. Our numerical tests show improvements over existing restarting versions of the Subgradient Method.

keywords:
Inexact proximal point, weak sharp minima, Holderian growth, finite termination.
articletype: ARTICLE

Received: date / Accepted: date

1 Introduction

The problem of interest of our paper formulates as the following convex nonsmooth minimization:

F=minx𝐑n{F(x):=f(x)+ψ(x)}.\displaystyle F^{*}=\min_{x\in\mathbf{R}^{n}}\;\{F(x):=f(x)+\psi(x)\}. (1)

Here we assume that f:𝐑n𝐑f:\mathbf{R}^{n}\mapsto\mathbf{R} is convex and ψ:𝐑n(,]\psi:\mathbf{R}^{n}\mapsto(-\infty,\infty] is convex, lower semicontinuous and proximable. By proximable function we refer to those functions whose proximal mapping is computable in closed form or linear time. The above model finds plenty of applications from which we shortly mention compressed sensing [3], sparse risk minimization [21, 52] and graph-regularized models [54]. Dating back to ’60s, the classical Subgradient Methods (SGM) [43, 44, 34, 35, 37] established 𝒪(1/ϵ2)\mathcal{O}\left(1/\epsilon^{2}\right) iteration complexity for minimizing convex functions up to finding x^\hat{x} such that f(x^)fϵf(\hat{x})-f^{*}\leq\epsilon. Despite the fact that this complexity order is unimprovable for the class of convex functions, particular growth or error bounds properties can be exploited to obtain better lower orders. Error bounds and regularity conditions have a long history in optimization, systems of inequalities or projection methods: [1, 7, 9, 34, 35, 37, 6, 14, 23]. Particularly, in the seminal works [36, 37] SGM is proved to converge linearly towards weakly sharp minima of FF. The optimal solutions XX^{*} is a set of weak sharp minima (WSM) if there exists σF>0\sigma_{F}>0 such that

WSM:F(x)FσFdistX(x),xdomF.\displaystyle WSM:\qquad F(x)-F^{*}\geq\sigma_{F}\text{dist}_{X^{*}}(x),\quad\forall x\in\text{dom}{F}.

Acceleration of other first-order algorithms has been proved under WSM in subsequent works as [1, 7, 8, 41]. Besides acceleration, [37, Section 5.2.3] introduces the “superstability” of sharp optimal solutions XX^{*}: under small perturbations of the objective function FF a subset of the weak sharp minima XX^{*} remains optimal for the perturbed model. The superstability of WSM was used in [36, 27] to show the robustness of inexact SGM. In short, using low persistently perturbed subgradients at each iteration, the resulted perturbed SGM still converges linearly to XX^{*}. In the line of these results, we also show in our manuscript that similar robustness holds for the proximal point methods under WSM.

Other recent works as [53, 16, 26, 22, 18, 51, 19, 14, 10, 23, 11, 38, 6, 17] look at a suite of different growth regimes besides WSM and use them to improve the complexity of first-order algorithms. Particularly, in our paper we are interested in the γ\gamma-Holderian growth : let γ1\gamma\geq 1

γHG:F(x)FσFdistXγ(x),xdomF.\displaystyle\gamma-HG:\qquad F(x)-F^{*}\geq\sigma_{F}\text{dist}_{X^{*}}^{\gamma}(x),\quad\forall x\in\text{dom}{F}.

Note that γHG\gamma-HG is equivalent to the Kurdyka–Łojaziewicz (KL) inequality for convex, closed, and proper functions, as shown in [6]. It includes the class of uniformly convex functions analyzed in [17] and, obviously, it covers the sharp minima WSM, for γ=1\gamma=1. The Quadratic Growth (QG), covered by γ=2\gamma=2, was analyzed in a large suite of previous works [22, 53, 23, 26] and, although is weaker than strong convexity, it could be essentially exploited (besides Lipschitz gradient continuity) to show 𝒪(log(1/ϵ))\mathcal{O}\left(\log(1/\epsilon)\right) complexity of proximal gradient methods11. Our analysis recover similar complexity orders under the same particular assumptions.

Some recent works [53, 16, 11, 10, 38] developed restarted SGM schemes, for minimizing convex functions under γ\gamma-HG or WSM, and analyzed their theoretical convergence and their natural dependence on the growth moduli γ\gamma and σF\sigma_{F}. Restarted SubGradient (RSG) of [53] and Decaying Stepsize - SubGradient (DS-SG) of [16] present iteration complexity estimates of 𝒪(log(1/ϵ))\mathcal{O}\left(\log(1/\epsilon)\right) under WSM and 𝒪(1ϵ2(γ1))\mathcal{O}\left(\frac{1}{\epsilon^{2(\gamma-1)}}\right) bound under γ\gamma-(HG) in order to attain distX(x)ϵ\text{dist}_{X^{*}}(x)\leq\epsilon. These bounds are optimal for bounded gradients functions, as observed by [29]. Most SGM schemes are dependent up to various degrees on the knowledge of problem information. For instance, RSG and DS-SG rely on lower bounds of optimal value FF^{*} and knowledge of parameters σF,γ\sigma_{F},\gamma or other Lipschitz constants. The restartation is introduced in order to avoid the exact estimation of modulus σF\sigma_{F}. Also, our schemes allows estimations of problem moduli such as γ,σF\gamma,\sigma_{F}, covering cases when these are not known. In the best case, when estimations are close to the true parameters, similar complexity estimates 𝒪(1ϵ2(γ1))\mathcal{O}\left(\frac{1}{\epsilon^{2(\gamma-1)}}\right) are provided in terms of subgradient evaluations. Moreover, by exploiting additional smooth structure we further obtain lower estimates.

The work of [17] approach the constrained model, i.e. ψ\psi is the indicator function of a closed convex set, and assume γ\gamma-uniform convexity:

f(αx+(1α)y)αf(x)+\displaystyle f(\alpha x+(1-\alpha)y)\leq\alpha f(x)+ (1α)f(y)\displaystyle(1-\alpha)f(y)
12σfα(1α)[αγ1+(1α)γ1]xyγ,\displaystyle-\frac{1}{2}\sigma_{f}\alpha(1-\alpha)[\alpha^{\gamma-1}+(1-\alpha)^{\gamma-1}]\lVert x-y\rVert^{\gamma},

for all feasible x,yx,y and γ2\gamma\geq 2. The authors obtain optimal complexity bounds 𝒪(σf2/γϵ2(γ1))\mathcal{O}\left(\sigma_{f}^{-2/\gamma}\epsilon^{-2(\gamma-1)}\right) when the subgradients of ff are bounded. Moreover, their restartation technique are adaptive to growth modulus γ\gamma and parameter σF\sigma_{F}, up to a fixed number of iterations.

Inherent for all SGMs, the complexity results of these works essentially requires the boundedness of the subgradients, which is often natural for nondifferentiable functions. However, plenty of convex objective functions coming from risk minimization, sparse regression or machine learning presents, besides their particular growth, a certain smoothness degree which is not compatible with the subgradient boundedness assumption. Enclosing the feasible domain in a ball is an artificial remedy used to further keep the subgradients bounded, which however might load the implementation with additional uncertain tuning heuristics. Our analysis shows how to exploit smoothness in order to improve the complexity estimates.

The analysis of [41] investigates the effect of restarting over the optimal first-order schemes under γ\gamma-HG and ν\nu-Holder smoothness, starting from results of [29]. For ψ=0\psi=0, ϵ\epsilon-suboptimality is reached after 𝒪(log(1/ϵ))\mathcal{O}\left(\log(1/\epsilon)\right) accelerated gradient iterations if F\nabla F is Lipschitz continuous and 22-Holder growth holds, or after 𝒪(1/ϵγ22)\mathcal{O}\left(1/\epsilon^{\frac{\gamma-2}{2}}\right) iterations when the growth modulus is larger than 22. In general, if F\nabla F is ν\nu-Holder continuous, they restart the Universal Gradient Method and obtain an overall complexity of 𝒪(log(1/ϵ))\mathcal{O}\left(\log(1/\epsilon)\right) if γ=ν\gamma=\nu, or 𝒪(1/ϵ2(γν1)2ν1)\mathcal{O}\left(1/\epsilon^{\frac{2(\gamma-\nu-1)}{2\nu-1}}\right) if γ>ν\gamma>\nu. Although these estimates are unimprovable and better than ours, in general the implementation of the optimal schemes requires complete knowledge of growth and smoothness parameters.

Several decades ago the Proximal Point Algorithm (PPA) started to gain much attraction for both abstract operator theory and the numerical optimization communities. Even in modern applications, where large-scale nonsmooth optimization arises recurrently, practitioners still inspire from proximal minimization theory to design scalable algorithmic techniques that overcomes nonsmoothness. The powerful PPA iteration consists mainly in the recursive evaluation of the proximal operator associated to the objective function. The proximal mapping is based on the infimal convolution with a metric function, often chosen to be the squared Euclidean norm: proxμF(x):=argminzF(z)+12μzx2.\text{prox}_{\mu}^{F}(x):=\arg\min_{z}F(z)+\frac{1}{2\mu}\lVert z-x\rVert^{2}. The Proximal Point recursion:

xk+1=proxμF(xk).\displaystyle x^{k+1}=\text{prox}_{\mu}^{F}(x^{k}).

became famous in optimization community when [40, 39] and [4, 5] revealed its connection to various multipliers methods for constrained minimization, see also [31, 32, 28, 12, 13]. There are remarkable works that shown how the growth regularity is a key factor in the iteration complexity of PPA.

Finite convergence of the exact PPA under WSM is proved by [7, 9, 1]. Furthermore, in [5, 18] can be found an extensive convergence analysis of the exact PPA and the Augmented Lagrangian algorithm under γ\gamma-(HG). Although the results and analysis are of a remarkable generality, they are of asymptotic nature (see [51]). A nonasymptotic analysis is found in [51], where the equivalence between a Dual Augmented Lagrangian algorithm and a variable stepsize PPA is established. The authors analyze sparse learning models of the form: minx𝐑nf(Ax)+ψ(x),\min_{x\in\mathbf{R}^{n}}\;f(Ax)+\psi(x), where ff is twice differentiable with Lipschitz continuous gradient, AA a linear operator and ψ\psi a convex nonsmooth regularizer. Under γ\gamma-Holderian growth, ranging with γ[1,2]\gamma\in[1,2], they show nonasymptotic superlinear convergence rate of the exact PPA with exponentially increasing stepsize. For the inexact variant they kept further a slightly weaker superlinear convergence. The progress, from the asymptotic analysis of [40, 18] to a nonasymptotic one, is remarkable due to the simplicity of the arguments. However, a convergence rate of inexact PPA (IPPA) could become irrelevant without quantifying the local computational effort spent to compute each iteration, since one inexact iteration of PPA requires the approximate solution the regularized optimization problem. Among the remarkable references on inexact versions of various proximal algorithms are [25, 48, 49, 50, 47, 46, 20, 21, 24, 45, 31, 32].

We mention that, a small portion of the results on the WSM case, contained into this manuscript, has been recently published by the authors in [33]. However, we included it in the present manuscript for the sake of completeness.

Contributions. We list further our main contributions:

Inexact PPA under γ\gamma-(HG). We provide nonasymptotic iteration complexity bounds for IPPA to solve (1) under γ\gamma-HG, when γ1\gamma\geq 1. In particular, we obtain 𝒪(log(1/ϵ))\mathcal{O}\left(\log(1/\epsilon)\right) for γ[1,2]\gamma\in[1,2] and, in the case of best parameter choice, 𝒪(1/ϵγ2)\mathcal{O}\left(1/\epsilon^{\gamma-2}\right) for γ>2\gamma>2, to attain ϵ\epsilon distance to the optimal set. All these bounds require only convexity of the objective function FF and they are independent on any bounded gradients or smoothness. We could not find these nonasymptotic estimates in the literature for general γ1\gamma\geq 1.

Restartation. We further analyze the complexity bounds of restarting IPPA, that facilitates the derivation of better computational complexity estimates than the non-restarted IPPA. The complexity estimates have similar orders in both restarted and non-restarted algorithms for all γ1\gamma\geq 1.

Total computational complexity. We derive total complexity, including the inner computational effort spent at each IPPA iteration, in terms of number of inner (proximal) gradient evaluations. If ff has ν\nu-Holder continuous gradients we obtain that, in the case of best parameter choice, there are necessary:

[γ=1+ν]\displaystyle[\gamma=1+\nu]\qquad 𝒪(log(1/ϵ))\displaystyle\mathcal{O}\left(\log(1/\epsilon)\right)
[ν=1]\displaystyle[\nu=1]\qquad 𝒪(1/ϵγ2)\displaystyle\mathcal{O}\left(1/\epsilon^{\gamma-2}\right)
[ν=0]\displaystyle[\nu=0]\qquad 𝒪(1/ϵ2(γ1))\displaystyle\mathcal{O}\left(1/\epsilon^{2(\gamma-1)}\right)

proximal (sub)gradient evaluations to approach to ϵ\epsilon distance to the optimal set. As we discuss in the section 6, the total complexity is dependent on various restartation variables.

Experiments. Our numerical experiments confirm a better behaviour of the restarted IPPA, that uses an inner subgradient method routine, in comparison with other two restarting strategies of classical Subgradient Method. We performed our tests on several polyhedral learning models that includes Graph SVM and Matrix Completion, using synthetic and real data.

1.1 Notations and preliminaries

Now we introduce the main notations of our manuscript. For x,y𝐑nx,y\in\mathbf{R}^{n} denote the scalar product x,y=xTy\langle x,y\rangle=x^{T}y and Euclidean norm by x=xTx\|x\|=\sqrt{x^{T}x}. The projection operator onto set XX is denoted by πX\pi_{X} and the distance from xx to the set XX is denoted distX(x)=minzXxz\text{dist}_{X}(x)=\min_{z\in X}\lVert x-z\rVert. The indicator function of QQ is denoted by ιQ\iota_{Q}. Given function hh, then by h(k)h^{(k)} we denote the composition h(k)(x):=(hhh)ktimes(x)h^{(k)}(x):=\underbrace{\left(h\circ h\circ\cdots h\right)}_{k\;\text{times}}(x). We use h(x)\partial h(x) for the subdifferential set and h(x)h^{\prime}(x) for a subgradient of hh at xx. In differentiable case, when h\partial h is a singleton, h\nabla h will be eventually used instead of hh^{\prime}. By XX^{*} we denote the optimal set associated to (1) and by ϵ\epsilon- suboptimal point we understand a point xx that satisfies distX(x)ϵ\text{dist}_{X^{*}}(x)\leq\epsilon.

A function ff is called σ\sigma-strongly convex if the following relation holds:

f(x)f(y)+f(y),xy+σ2xy2x,y𝐑n.f(x)\geq f(y)+\langle f^{\prime}(y),x-y\rangle+\frac{\sigma}{2}\lVert x-y\rVert^{2}\qquad\forall x,y\in\mathbf{R}^{n}.

Let ν[0,1]\nu\in[0,1], then we say that a differentiable function ff has ν\nu-Holder continuous gradient with constant L>0L>0 if :

f(x)f(y)Lxyνx,y𝐑n.\displaystyle\lVert f^{\prime}(x)-f^{\prime}(y)\rVert\leq L\lVert x-y\rVert^{\nu}\quad\forall x,y\in\mathbf{R}^{n}.

Notice that when ν=0\nu=0, the Holder continuity describes nonsmooth functions with bounded gradients, i.e. f(x)L\lVert f^{\prime}(x)\rVert\leq L for all xdom(f)x\in\text{dom}(f). The 11-Holder continuity reduces to LL-Lipschitz gradient continuity.

Given a convex function ff, we denote its Moreau envelope [40, 5, 39] with fμf_{\mu} and its proximal operator with proxμf(x)\text{prox}_{\mu}^{f}(x), defined by:

fμ(x)\displaystyle f_{\mu}(x) =minzf(z)+12μzx2\displaystyle=\min\limits_{z}\;f(z)+\frac{1}{2\mu}\lVert z-x\rVert^{2}
proxμf(x)\displaystyle\text{prox}_{\mu}^{f}(x) =argminzf(z)+12μzx2.\displaystyle=\arg\min_{z}\;f(z)+\frac{1}{2\mu}\lVert z-x\rVert^{2}.

We recall the nonexpansiveness property of the proximal mapping [40]:

proxμf(x)proxμf(y)xyx,ydom(f).\displaystyle\lVert\text{prox}_{\mu}^{f}(x)-\text{prox}_{\mu}^{f}(y)\rVert\leq\lVert x-y\rVert\quad\forall x,y\in\text{dom}(f). (2)

Basic arguments from [40, 5, 39] show that the gradient fμ\nabla f_{\mu} is Lipschitz continuous with constant 1μ\frac{1}{\mu} and satisfies:

fμ(x)=1μ(xproxμf(x))f(proxμf(x)).\displaystyle\nabla f_{\mu}(x)=\frac{1}{\mu}\left(x-\text{prox}_{\mu}^{f}(x)\right)\in\partial f(\text{prox}_{\mu}^{f}(x)). (3)

In the differentiable case, obviously fμ(x)=f(proxμf(x))\nabla f_{\mu}(x)=\nabla f(\text{prox}_{\mu}^{f}(x)).

Paper structure. In section 2 we analyze how the growth properties of FF are also inherited by FμF_{\mu}. The key relations on FμF_{\mu} will become the basis for the complexity analysis. In section 3 we define the iteration of inexact Proximal Point algorithm and discuss its stopping criterion. The iteration complexity is presented in section 4 for both the exact and inexact case. Subsequently, the restarted IPPA is defined and its complexity is presented. Finally, in section 6 we quantify the complexity of IPPA in terms of proximal (sub)gradient iterations and compare with other results. In the last section we compare our scheme with the state-of-the-art restarting subgradient algorithms.

2 Holderian growth and Moreau envelopes

As discussed in the introduction, γ\gamma-HG relates tightly with widely known regularity properties such as WSM [53, 36, 37, 7, 9, 1, 8], Quadratic Growth (QG) [53, 22, 26], Error Bound [23] and Kurdika-Lojasiewicz inequality [6, 53].

Next we show how the Moreau envelope of a given convex function inherits its growth properties over its entire domain excepting a certain neighborhood of the optimal set. Recall that minxF(x)=minxFμ(x)\min_{x}F(x)=\min_{x}F_{\mu}(x).

Lemma 2.1.

Let FF be a convex function and let γ\gamma-(HG) hold. Then the Moreau envelope FμF_{\mu} satisfies the relations presented below.

(i)(i) Let γ=1\gamma=1 WSM:

Fμ(x)FHσF2μ(σFdistX(x)),\displaystyle F_{\mu}(x)-F^{*}\geq H_{\sigma^{2}_{F}\mu}(\sigma_{F}\text{dist}_{X^{*}}(x)),

where Hτ(s)={sτ2,s>τ12τs2,sτH_{\tau}(s)=\begin{cases}s-\frac{\tau}{2},&s>\tau\\ \frac{1}{2\tau}s^{2},&s\leq\tau\end{cases} is the Huber function.

(ii)(ii) Let γ=2\gamma=2:

Fμ(x)FσF1+2σFμdistX2(x).\displaystyle F_{\mu}(x)-F^{*}\geq\frac{\sigma_{F}}{1+2\sigma_{F}\mu}\text{dist}_{X^{*}}^{2}(x).

(iii)(iii) For all γ1\gamma\geq 1:

Fμ(x)Fφ(γ)min{σFdistXγ(x),12μdistX2(x)},\displaystyle F_{\mu}(x)-F^{*}\geq\varphi(\gamma)\min\left\{\sigma_{F}\text{dist}^{\gamma}_{X^{*}}(x),\frac{1}{2\mu}\text{dist}^{2}_{X^{*}}(x)\right\},

where φ(γ)=minλ[0,1]λγ+(1λ)2\varphi(\gamma)=\min_{\lambda\in[0,1]}\lambda^{\gamma}+(1-\lambda)^{2}.

Proof.

The proof can be found in the Appendix. ∎

It is interesting to remark that the Moreau envelope FμF_{\mu} inherits a similar growth landscape as FF outside a given neighborhood of the optimal set. For instance, under WSM, outside the tube 𝒩(σFμ)={x𝐑n:distX(x)σFμ}}\mathcal{N}(\sigma_{F}\mu)=\{x\in\mathbf{R}^{n}:\;\text{dist}_{X^{*}}(x)\leq\sigma_{F}\mu\}\}, the Moreau envelope FμF_{\mu} grows sharply. Inside of 𝒩(σFμ)\mathcal{N}(\sigma_{F}\mu) it grows quadratically which, unlike the objective function FF, allows the gradient to get small near to the optimal set. This separation of growth regimes suggests that first-order algorithms that minimize FμF_{\mu} would reach very fast the region 𝒩(σFμ)\mathcal{N}(\sigma_{F}\mu), allowing large steps in a first phase and subsequently, they would slow down in the vicinity of the optimal set.

This discussion extends to general growths when γ>1\gamma>1, where a similar separation of behaviours holds for appropriate neighborhoods. Note that when FF has quadratic growth with constant σF\sigma_{F}, also the envelope FμF_{\mu} satisfies a quadratic growth with a smaller modulus σF1+2σFμ\frac{\sigma_{F}}{1+2\sigma_{F}\mu}.

Remark 1.

It will be useful in the subsequent sections to recall the connection between Holderian growth and Holderian error bound under convexity. Observe that by a simple use of convexity into γ\gamma-HG, we obtain for all xdomFx\in\text{dom}F

σFdistXγ(x)F(x)FF(x),xπX(x)F(x)distX(x),\displaystyle\sigma_{F}\text{dist}_{X^{*}}^{\gamma}(x)\leq F(x)-F^{*}\leq\langle F^{\prime}(x),x-\pi_{X^{*}}(x^{*})\rangle\leq\lVert F^{\prime}(x)\rVert\text{dist}_{X^{*}}(x), (4)

which immediately turns into the following error bound:

σFdistXγ1(x)F(x)xdomF.\displaystyle\sigma_{F}\text{dist}_{X^{*}}^{\gamma-1}(x)\leq\lVert F^{\prime}(x)\rVert\qquad x\in\text{dom}{F}. (5)

Under WSM, by replacing xx with proxμF(x)\text{prox}_{\mu}^{F}(x) into (5) and by using property (3) pointing that Fμ(x)F(proxμF(x))\nabla F_{\mu}(x)\in\partial F(\text{prox}_{\mu}^{F}(x)), we obtain as well a similar bound on Fμ()\lVert\nabla F_{\mu}(\cdot)\rVert at non-optimal points:

σFFμ(x)xX.\displaystyle\sigma_{F}\leq\lVert\nabla F_{\mu}(x)\rVert\quad\forall x\notin X^{*}. (6)

This is the traditional key relation for the finite convergence of PPA. Under γ\gamma-HG, starting from Lemma 2.1(iii)(iii), by using convexity of FμF_{\mu} and, further, by following similar inequalities as in (4), another error bound is obtained:

distX(x)max{[1σFφ(γ)Fμ(x)]1γ1,2μφ(γ)Fμ(x)}x.\displaystyle\text{dist}_{X^{*}}(x)\leq\max\left\{\left[\frac{1}{\sigma_{F}\varphi(\gamma)}\lVert\nabla F_{\mu}(x)\rVert\right]^{\frac{1}{\gamma-1}},\frac{2\mu}{\varphi(\gamma)}\lVert\nabla F_{\mu}(x)\rVert\right\}\quad\forall x. (7)

In the following section we start with the analysis of exact and inexact PPA. Aligning to old results on the subgradient algorithms back to [36], we illustrate the robustness induced by the weak sharp minima regularity.

3 Inexact Proximal Point algorithm

The basic exact PPA iteration is shortly described as

xk+1=proxμF(xk).\displaystyle x^{k+1}=\text{prox}_{\mu}^{F}(x^{k}).

Recall that by (3) one can express Fμ(xk)=1μ(xkproxμF(xk))\nabla F_{\mu}(x^{k})=\frac{1}{\mu}(x^{k}-\text{prox}_{\mu}^{F}(x^{k})), which makes PPA equivalent with the constant stepsize Gradient Method (GM) iteration:

xk+1=xkμFμ(xk).\displaystyle x^{k+1}=x^{k}-\mu\nabla F_{\mu}(x^{k}). (8)

Since our reasoning from below borrow simple arguments from classical GM analysis, we will use further (8) to express PPA. It is realistic not to rely on explicit proxμF(xk)\text{prox}_{\mu}^{F}(x^{k}), but an approximated one to a fixed accuracy. By using such an approximation, one can form an approximate gradient δFμ(xk)\nabla_{\delta}F_{\mu}(x^{k}) and interpret IPPA as an inexact Gradient Method. Let xdomFx\in\text{dom}{F}, then a basic δ\delta- approximation of Fμ(x)\nabla F_{\mu}(x) is

δFμ(x):=1μ(xz~),wherez~proxμF(x)δ.\displaystyle\nabla_{\delta}F_{\mu}(x):=\frac{1}{\mu}\left(x-\tilde{z}\right),\quad\text{where}\quad\lVert\tilde{z}-\text{prox}_{\mu}^{F}(x)\rVert\leq\delta. (9)

Other works as [40, 42, 48, 49] promotes similar approximation measures for inexact first order methods. Now we present the basic IPPA scheme with constant stepsize.

1Initialize k:=0\;k:=0
2 while  δkF(xk)>ϵ\lVert\nabla_{\delta_{k}}F(x^{k})\rVert>\epsilon or δk>ϵμ\delta_{k}>\frac{\epsilon}{\mu} do
3       Given xk computexk+1such that:xk+1proxμF(xk)δk\text{Given $x^{k}$ compute}\;\;x^{k+1}\;\text{such that}:\;\lVert x^{k+1}-\text{prox}_{\mu}^{F}(x^{k})\rVert\leq\delta_{k}
4       k:=k+1k:=k+1
5      
Return xkx^{k}
Algorithm 1 Inexact Proximal Point Algorithm(x0,μ,{δk}k0,ϵx^{0},\mu,\{\delta_{k}\}_{k\geq 0},\epsilon)

There already exist a variety of relative or absolute stopping rules in the literature for the class of first-order methods [51, 42, 37, 15]. However, since bounding the norms of gradients is commonly regarded as one of the simplest optimality measures, we use the following:

δkF(xk)<ϵδkϵμ\displaystyle\lVert\nabla_{\delta_{k}}F(x^{k})\rVert<\epsilon\qquad\delta_{k}\leq\frac{\epsilon}{\mu} (10)

which is computable by the nature of the iteration. The next result shows the relation between (10) and the distance to optimal set.

Lemma 3.1.

Let μ,δ>0\mu,\delta>0 and assume xXx\notin X^{*} then:

(i)(i) Let γ=1\gamma=1, if δFμ(xk)+δμ<σF\lVert\nabla_{\delta}F_{\mu}(x^{k})\rVert+\frac{\delta}{\mu}<\sigma_{F}, then

distX(xk)μδFμ(xk)+δandproxμF(xk)=πX(xk).\displaystyle\text{dist}_{X^{*}}(x^{k})\leq\mu\lVert\nabla_{\delta}F_{\mu}(x^{k})\rVert+\delta\quad\text{and}\quad\text{prox}_{\mu}^{F}(x^{k})=\pi_{X^{*}}(x^{k}). (11)

(ii)(ii) Let γ>1\gamma>1, then: distX(xk)max{[μδFμ(xk)+δμσFφ(γ)]1γ1,2μδFμ(xk)+δφ(γ)}.\;\;\text{dist}_{X^{*}}(x^{k})\leq\max\left\{\left[\frac{\mu\lVert\nabla_{\delta}F_{\mu}(x^{k})\rVert+\delta}{\mu\sigma_{F}\varphi(\gamma)}\right]^{\frac{1}{\gamma-1}},\frac{2\mu\lVert\nabla_{\delta}F_{\mu}(x^{k})\rVert+\delta}{\varphi(\gamma)}\right\}.

Proof.

The proof can be found in the Appendix. ∎

If XX^{*} are weakly sharp minimizers, the above lemma states that, for sufficiently small δ\delta and δFμ(xk)\lVert\nabla_{\delta}F_{\mu}(x^{k})\rVert one can directly guarantee that xkx^{k} is close to XX^{*}. From another viewpoint, Lemma 3.1 also suggests for WSM that a sufficiently large μ>distX(x0)σF\mu>\frac{\text{dist}_{X^{*}}(x^{0})}{\sigma_{F}} provides proxX(x0)=πX(x0)\text{prox}_{X^{*}}(x^{0})=\pi_{X^{*}}(x^{0}) and a δ\delta-optimal solution is obtained as the output of the first IPPA iteration. A similar result stated in [9], guarantees the existence of a sufficiently large smoothing value μ\mu which makes PPA to converge in a single (exact) iteration.

4 Iteration complexity of IPPA

For reasons that will be clear later, note that some of the results from below are given for constant inexactness noise δk=δ\delta_{k}=\delta. The Holderian growth leads naturally to recurrences on the residual distance to the optimal set, that allow us to give direct complexity bounds.

Theorem 4.1.

Let FF be convex and γ\gamma-HG hold.

(i)(i) Under sharp minima γ=1\gamma=1, let {δk}k0\{\delta_{k}\}_{k\geq 0} nonincreasing and assume distX(x0)μσF\text{dist}_{X^{*}}(x^{0})\geq\mu\sigma_{F}, then:

distX(xk)max{distX(x0)i=0k1(μσFδi),δk1}.\displaystyle\;\;\text{dist}_{X^{*}}(x^{k})\leq\max\left\{\text{dist}_{X^{*}}(x^{0})-\sum\limits_{i=0}^{k-1}(\mu\sigma_{F}-\delta_{i}),\delta_{k-1}\right\}.

(ii)(ii) Under quadratic growth γ=2\gamma=2, let i0δi<Γ<\sum\limits_{i\geq 0}\delta_{i}<\Gamma<\infty then:

distX(xk)[11+2μσF]k44(distX(x0)+Γ)+(1+11+2μσF1)δk2+1.\displaystyle\text{dist}_{X^{*}}(x^{k})\leq\left[\frac{1}{1+2\mu\sigma_{F}}\right]^{\frac{k-4}{4}}(\text{dist}_{X^{*}}(x^{0})+\Gamma)+\left(1+\frac{1}{\sqrt{1+2\mu\sigma_{F}}-1}\right)\delta_{\lceil\frac{k}{2}\rceil+1}.

(iii)(iii) Let γ\gamma-HG hold. Then the following convergence rate holds:

distX(xk)\displaystyle\text{dist}_{X^{*}}(x^{k}) max{h(k)(distX(x0)),δ¯k}\displaystyle\leq\max\left\{h^{(k)}(\text{dist}_{X^{*}}(x^{0})),\bar{\delta}_{k}\right\}

where h(r)={max{rμφ(γ)σF2rγ1,1+1φ(γ)2r},ifγ(1,2)max{rμφ(γ)σF2rγ1,1+1φ(γ)2r,(11γ1)r},ifγ>2.h(r)=\begin{cases}\max\left\{r-\frac{\mu\varphi(\gamma)\sigma_{F}}{2}r^{\gamma-1},\frac{1+\sqrt{1-\varphi(\gamma)}}{2}r\right\},&\text{if}\;\gamma\in(1,2)\\ \max\left\{r-\frac{\mu\varphi(\gamma)\sigma_{F}}{2}r^{\gamma-1},\frac{1+\sqrt{1-\varphi(\gamma)}}{2}r,\left(1-\frac{1}{\gamma-1}\right)r\right\},&\text{if}\;\gamma>2.\end{cases} and δ¯k=max{h(δ¯k1),(2δkμφ(γ)σF)1γ1,2δk11φ(γ)}.\bar{\delta}_{k}=\max\left\{h(\bar{\delta}_{k-1}),\left(\frac{2\delta_{k}}{\mu\varphi(\gamma)\sigma_{F}}\right)^{\frac{1}{\gamma-1}},\frac{2\delta_{k}}{1-\sqrt{1-\varphi(\gamma)}}\right\}.

Proof.

The proof can be found in the appendix. ∎

Note that in general, all the abstract convergence rates of Theorem 4.1 depend on two terms: the first one illustrates the reduction of the initial distance to optimum and the second term reflects the accuracy of the approximated gradient. Therefore, after a finite number (for γ=1\gamma=1) or at most 𝒪(log(1/ϵ))\mathcal{O}(\log(1/\epsilon)) (for γ>1\gamma>1) IPPA iterations, the evolution of inner accuracy δk\delta_{k} becomes the main bottleneck of the convergence process. The above theorem provides an abstract insight about how fast should the accuracy {δk}k0\{\delta_{k}\}_{k\geq 0} decay in order that {xk}k0\{x^{k}\}_{k\geq 0} attains the best rate towards the optimal set in terms of distX(xk)\text{dist}_{X^{*}}(x^{k}). In short, (i)(i) shows that for WSM a recurrent constant decrease on distX(xk)\text{dist}_{X^{*}}(x^{k}) is established only if δk<μσF\delta_{k}<\mu\sigma_{F}, while the noise δk\delta_{k} is not necessary to vanish. This aspect will be discussed in more details below. The last parts (ii)(ii) and (iii)(iii) suggest that a linear decay of δk\delta_{k} (for γ=2\gamma=2) and, respectively, δk+1=h(δk)\delta_{k+1}=h(\delta_{k}) for general γ>1\gamma>1, ensure the fastest convergence of the residual.

Several previous works as [27, 36, 37] analyzed perturbed and incremental SGM algorithms, under WSM, that use noisy estimates GkG_{k} of subgradient F(xk)F^{\prime}(x^{k}). A surprising common conclusion of these works is that under a sufficiently low persistent noise: 0<F(xk)Gk<σF,0<\lVert F^{\prime}(x^{k})-G_{k}\rVert<\sigma_{F}, for all k0k\geq 0, SGM still converges linearly to the optimal set. Although IPPA is based on a similar approximate first order information, notice that the smoothed objective FμF_{\mu} do not satisfy the pure WSM property. However, our next result states, in a similar context of small but persistent noise of magnitude at most δμ\frac{\delta}{\mu}, that IPPA attains δ\delta-accuracy after a finite number of iterations.

Corollary 4.2.

Let δk=δ<μσF\delta_{k}=\delta<\mu\sigma_{F} and γ=1\gamma=1, then after

K=distX(x0)μσFδ\displaystyle K=\left\lceil\frac{\text{dist}_{X^{*}}(x^{0})}{\mu\sigma_{F}-\delta}\right\rceil

IPPA iterations, xKx^{K} satisfies proxμF(xK)=πX(xK)\text{prox}_{\mu}^{F}(x^{K})=\pi_{X^{*}}(x^{K}) and distX(xK)δ\text{dist}_{X^{*}}(x^{K})\leq\delta.

Proof.

The proof can be found in the Appendix. ∎

To conclude, if the noise magnitude δμ\frac{\delta}{\mu} is below the threshold σF\sigma_{F}, or equivalently 0<δ:=Fμ(xk)δFμ(xk)<σF0<\delta^{\nabla}:=\lVert\nabla F_{\mu}(x^{k})-\nabla_{\delta}F_{\mu}(x^{k})\rVert<\sigma_{F} then after a finite number of iterations IPPA reaches δ\delta distance to XX^{*}. We see that under sufficiently low persistent noise, IPPA still guarantees convergence to the optimal set assuming the existence of an inner routine that computes each iteration. In other words, ”noisy” Proximal Point algorithms share similar stability properties as the noisy Subgradient schemes from [27, 36, 37]. This discussion can be extended to general decreasing {δk}k0\{\delta_{k}\}_{k\geq 0}.

We show next that Theorem 4.1 covers well-known results on exact PPA.

Corollary 4.3.

Let {xk}k0\{x^{k}\}_{k\geq 0} be the sequence of exact PPA, i.e. δk=0\delta_{k}=0. By denoting r0=distX(x0)r_{0}=\text{dist}_{X^{*}}(x^{0}), an ϵ\epsilon-suboptimal iterate is attained, i.e. distX(xk)ϵ\text{dist}_{X^{*}}(x^{k})\leq\epsilon, after the number of iterations:

𝒦e(γ,ϵ)={r0ϵμσFifγ=1𝒪(1μσFlog(r0ϵ))ifγ=2𝒪(min{𝒯2γlog(r0ϵ),𝒯}),ifγ(1,2),ϵ(μφ(γ)σF)12γ𝒪(log(min{r0,(2μσF)12γ}/ϵ))ifγ(1,2),ϵ<(μφ(γ)σF)12γ𝒪(1ϵγ2)ifγ>2.\displaystyle\mathcal{K}_{e}(\gamma,\epsilon)=\begin{cases}\left\lceil\frac{r_{0}-\epsilon}{\mu\sigma_{F}}\right\rceil&\text{if}\;\gamma=1\\ \left\lceil\mathcal{O}\left(\frac{1}{\mu\sigma_{F}}\log\left(\frac{r_{0}}{\epsilon}\right)\right)\right\rceil&\text{if}\;\gamma=2\\ \left\lceil\mathcal{O}\left(\min\left\{\mathcal{T}^{2-\gamma}\log\left(\frac{r_{0}}{\epsilon}\right),\mathcal{T}\right\}\right)\right\rceil,&\text{if}\;\gamma\in(1,2),\;\epsilon\geq(\mu\varphi(\gamma)\sigma_{F})^{\frac{1}{2-\gamma}}\\ \left\lceil\mathcal{O}\left(\log\left(\min\left\{r_{0},(2\mu\sigma_{F})^{\frac{1}{2-\gamma}}\right\}/\epsilon\right)\right)\right\rceil&\text{if}\;\gamma\in(1,2),\epsilon<(\mu\varphi(\gamma)\sigma_{F})^{\frac{1}{2-\gamma}}\\ \left\lceil\mathcal{O}\left(\frac{1}{\epsilon^{\gamma-2}}\right)\right\rceil&\text{if}\;\gamma>2.\end{cases}
Proof.

The proof can be found in the Appendix. ∎

The finite convergence of the exact PPA, under WSM, dates back to [9, 7, 1, 4]. Since PPA is simply a gradient descent iteration, its iteration complexity under QG γ=2\gamma=2 shares the typical dependence on the conditioning number 1μσF\frac{1}{\mu\sigma_{F}}. The Holder growth γ(1,2)\gamma\in(1,2) behaves similarly with the sharp minima case: fast convergence outside the neighborhood around the optimum which expands with μ\mu.

A simple argument on the tightness of the bounds for the case γ>2\gamma>2 can be found in [2]. Indeed, Douglas-Rachford, PPA and Alternating Projections algorithms were analyzed in [2] for particular univariate functions. The authors proved that PPA requires 𝒪(1/ϵγ2)\mathcal{O}\left(1/\epsilon^{\gamma-2}\right) iterations to minimize the particular objective F(x)=1γ|x|γF(x)=\frac{1}{\gamma}|x|^{\gamma} (when γ>2\gamma>2) up to ϵ\epsilon tolerance.

Corollary 4.4.

Under the assumptions of Corrolary 4.3, recall the notation 𝒦e(γ,ϵ)\mathcal{K}_{e}(\gamma,\epsilon) for the exact case. The number of iterations performed by {xk}k0\{x^{k}\}_{k\geq 0} generated with IPPA(x0,μ,{δk})(x^{0},\mu,\{\delta_{k}\}) in order to attain an ϵ\epsilon-suboptimal point is:

{𝒪(max{𝒦e(1,ϵ)+δ0μσF,log(δ0ϵ)})γ=1,δkδ02k𝒪(𝒦e(γ,ϵ))γ(1,2],δkδ02k𝒪(𝒦e(γ,ϵ))γ>2,δk=(11γ1)k(γ1)δ0.\displaystyle\begin{cases}\left\lceil\mathcal{O}\left(\max\left\{\mathcal{K}_{e}(1,\epsilon)+\frac{\delta_{0}}{\mu\sigma_{F}},\log\left(\frac{\delta_{0}}{\epsilon}\right)\right\}\right)\right\rceil&\quad\gamma=1,\delta_{k}\leq\frac{\delta_{0}}{2^{k}}\\ \left\lceil\mathcal{O}\left(\mathcal{K}_{e}(\gamma,\epsilon)\right)\right\rceil&\quad\gamma\in(1,2],\delta_{k}\leq\frac{\delta_{0}}{2^{k}}\\ \left\lceil\mathcal{O}\left(\mathcal{K}_{e}(\gamma,\epsilon)\right)\right\rceil&\quad\gamma>2,\delta_{k}=\left(1-\frac{1}{\gamma-1}\right)^{k(\gamma-1)}\delta_{0}.\end{cases}
Proof.

The proof can be found in the Appendix. ∎

Overall, if the noise vanishes sufficiently fast (e.g. a linear decay with factor 12\frac{1}{2} in the case γ2\gamma\leq 2) then the complexity of IPPA has the same order as the one of PPA. However, a fast vanishing noise δk\delta_{k} requires an efficient inner routine that computes the iterate xkx^{k}. Particularly, when FF is nonsmooth and convex, a simple choice of the inner routine is the Subgradient Method (SM). However, the efficiency of the SM for minimizing a given σf\sigma_{f}-strongly convex function ff up to δ\delta accuracy, i.e. f(x)fδf(x)-f^{*}\leq\delta, has the order 𝒪(1δ)\mathcal{O}\left(\frac{1}{\delta}\right) subgradient calls [17]. By using the quadratic growth guaranteed by strong convexity, i.e. σf2xx2f(x)fδ\frac{\sigma_{f}}{2}\lVert x-x^{*}\rVert^{2}\leq f(x)-f^{*}\leq\delta, in order to approach a minimizer at distance δ\delta, SM requires 𝒪(1δ2)\mathcal{O}\left(\frac{1}{\delta^{2}}\right) iterations. According to this bound, the cost of computing a single iteration of IPPA results into 𝒪(1δk2)\mathcal{O}\left(\frac{1}{\delta_{k}^{2}}\right) subgradient calls and, therefore, a direct naive counting of the total number of subgradient evaluations over TT outer iterations is of order 𝒪(k=0T1δk2)\mathcal{O}\left(\sum\limits_{k=0}^{T}\frac{1}{\delta_{k}^{2}}\right). By using the previous estimates of TT necessary for distX(xT)ϵ\text{dist}_{X^{*}}(x^{T})\leq\epsilon, yields a total estimate of subgradient calls which is significantly larger than the known optimal bound of 𝒪(1ϵ2(γ1))\mathcal{O}\left(\frac{1}{\epsilon^{2(\gamma-1)}}\right) for SM algorithm. However, we further give a restarted variant of IPPA that overcomes this issue.

5 Restarted Inexact Proximal Point Algorithm

The Restarted IPPA (RIPPA) illustrates a simple recursive call of the IPPA combined with a multiplicative decrease of the parameters. Observe that RIPPA is completely independent of the problem constants.

1Initialize δ0:=δ0,t:=0\delta^{\nabla}_{0}:=\delta_{0},t:=0
2 while stopping criterion do
3       Call IPPAto compute:xt+1=IPPA(xt,μt,{δk:=δt}k0,5δt)\text{Call IPPA}\;\text{to compute:}\;\;x^{t+1}=IPPA(x^{t},\mu_{t},\{\delta_{k}:=\delta_{t}\}_{k\geq 0},5\delta_{t}^{\nabla})
4       Update:μt+1=2μt,δt+1=δt2ρ,δt+1=μt+1δt+1\text{Update:}\;\;\mu_{t+1}=2\mu_{t},\;\delta^{\nabla}_{t+1}=\frac{\delta^{\nabla}_{t}}{2^{\rho}},\;\delta_{t+1}=\mu_{t+1}\delta^{\nabla}_{t+1}
5       t:=t+1\;t:=t+1
6      
Algorithm 2 Restarted IPPA (x0,μ0,δ0,ρx^{0},\mu_{0},\delta_{0},\rho)

As in the usual context of restartation, we call any tt-th iteration an epoch. The stopping criterion can be optionally based on a fixed number of epochs or on the reduction of gradient norm (10). Denote K0=distX(x0)μ0δ0K_{0}=\lceil\frac{\text{dist}_{X^{*}}(x^{0})}{\mu_{0}\delta_{0}}\rceil.

Theorem 5.1.

Let δ0,μ0\delta_{0},\mu_{0} be positive constants and ρ>1\rho>1. Then the sequence {xt}t0\{x^{t}\}_{t\geq 0} generated by RIPPA(x0,μ0,δ0)RIPPA(x^{0},\mu_{0},\delta_{0}) attains distX(xt)ϵ\text{dist}_{X^{*}}(x^{t})\leq\epsilon after a number of 𝒯IPP(γ,ϵ)\mathcal{T}_{IPP}(\gamma,\epsilon) iterations.

Let γ=1\gamma=1 and assume ϵ<μ0σF\epsilon<\mu_{0}\sigma_{F} and distX(x0)μ0σF\text{dist}_{X^{*}}(x^{0})\geq\mu_{0}\sigma_{F}, then

𝒯IPP(1,ϵ)=1ρ1log(μ0δ0ϵ)+𝒯ct,\displaystyle\mathcal{T}_{IPP}(1,\epsilon)=\frac{1}{\rho-1}\log\left(\frac{\mu_{0}\delta_{0}}{\epsilon}\right)+\mathcal{T}_{ct},

where 𝒯ct=K01ρlog(12δ0σF).\mathcal{T}_{ct}=K_{0}\lceil\frac{1}{\rho}\log\left(\frac{12\delta_{0}}{\sigma_{F}}\right)\rceil. In particular, if δ0<μ0σF\delta_{0}<\mu_{0}\sigma_{F} then RIPPA(x0,μ0,δ0,0)RIPPA(x^{0},\mu_{0},\delta_{0},0) reaches the ϵ\epsilon-suboptimality within 𝒪(log(μ0σFϵ))\mathcal{O}\left(\log\left(\frac{\mu_{0}\sigma_{F}}{\epsilon}\right)\right) iterations.

Let γ=2\gamma=2, then

𝒯IPP(2,ϵ)=𝒪(1ρlog(δ0ϵ))+K0,\displaystyle\mathcal{T}_{IPP}(2,\epsilon)=\mathcal{O}\left(\frac{1}{\rho}\log\left(\frac{\delta_{0}}{\epsilon}\right)\right)+K_{0},

Let γ(1,2)\gamma\in(1,2), then:

𝒯IPP(γ,ϵ)=𝒪(max{γ1ρlog(δ0ϵ),1ρ1log(μ0δ0ϵ)})+K0.\displaystyle\mathcal{T}_{IPP}(\gamma,\epsilon)=\mathcal{O}\left(\max\left\{\frac{\gamma-1}{\rho}\log\left(\frac{\delta_{0}}{\epsilon}\right),\frac{1}{\rho-1}\log\left(\frac{\mu_{0}\delta_{0}}{\epsilon}\right)\right\}\right)+K_{0}.

Otherwise, for γ>2\gamma>2

𝒯IPP(γ,ϵ)=𝒪((δ0ϵ)[(11ρ)(γ1)1]max{1,1(11/ρ)(γ1)})+K0.\displaystyle\mathcal{T}_{IPP}(\gamma,\epsilon)=\mathcal{O}\left(\left(\frac{\delta_{0}}{\epsilon}\right)^{\left[\left(1-\frac{1}{\rho}\right)(\gamma-1)-1\right]\max\left\{1,\frac{1}{(1-1/\rho)(\gamma-1)}\right\}}\right)+K_{0}.
Proof.

The proof can be found in the Appendix. ∎

Remark 2.

Notice that for any γ[1,2]\gamma\in[1,2], logarithmic complexity 𝒪(log(1/ϵ))\mathcal{O}\left(\log(1/\epsilon)\right) is obtained. When γ>2\gamma>2, the above estimate is shortened as

𝒪((1ϵ)(ζ1)max{1,1ζ}),\displaystyle\mathcal{O}\left(\left(\frac{1}{\epsilon}\right)^{(\zeta-1)\max\left\{1,\frac{1}{\zeta}\right\}}\right),

where ζ=(γ1)(11ρ)\zeta=(\gamma-1)\left(1-\frac{1}{\rho}\right), In particular, if ργ1γ2\rho\leq\frac{\gamma-1}{\gamma-2}, then all epochs reduce to length 11 and the total number of IPPA iterations reduces to the same order as in the exact case: 𝒪((1ϵ)γ2).\;\mathcal{O}\left(\left(\frac{1}{\epsilon}\right)^{\gamma-2}\right).

6 Inner Proximal Subgradient Method routine

Although the influence of growth modulus on the behaviour of IPPA is obvious, all complexity estimates derived in the previous sections assume the existence of an oracle computing an approximate proximal mapping:

xk+1argminz𝐑nF(z)+12μzxk2.\displaystyle x^{k+1}\approx\arg\min\limits_{z\in\mathbf{R}^{n}}\;F(z)+\frac{1}{2\mu}\lVert z-x^{k}\rVert^{2}. (12)

Therefore in the rest of this section we focus on the solution of the following inner minimization problem:

minz𝐑nf(z)+ψ(z)+12μzx2.\displaystyle\min\limits_{z\in\mathbf{R}^{n}}\;f(z)+\psi(z)+\frac{1}{2\mu}\lVert z-x\rVert^{2}.

In most situations, despite the regularization term, this problem is not trivial and one should select an appropriate routine that computes {xk}k0\{x^{k}\}_{k\geq 0}. For instance, variance-reduced or accelerated proximal first-order methods were employed in [20, 21, 22, 23] as inner routines and theoretical guarantees were provided. Also, Conjugate Gradients based Newton method was used in [51] under a twice differentiability assumption on ff. We limit our analysis only to gradient-type routines and let other accelerated or higher-order methods, that typically improve the performance of their classical counterparts, for future work.

In this section we evaluate the computational complexity of IPPA in terms of number of proximal gradient iterations. The basic routine for solving (12), that we analyze below, is the Proximal (sub)Gradient Method. Notice that when ff is nonsmooth with bounded subgradients, we consider only the case when ψ\psi is an indicator function for a simple, closed convex set. In this situation, PsGM becomes a simple projected subgradient scheme with constant stepsize that solves (12).

1for =0,,N1\ell=0,\cdots,N-1 do
2       z+1=proxαψ(zα(f(z)+1μ(zxk)))z^{\ell+1}=\text{prox}_{\alpha}^{\psi}\left(z^{\ell}-\alpha\left(f^{\prime}(z^{\ell})+\frac{1}{\mu}(z^{\ell}-x^{k})\right)\right)
3      
Output:zN\text{Output:}\;z^{N}
Algorithm 3 Proximal subGradient Method (PsGM) (z0,xk,α,μ,Nz^{0},x^{k},\alpha,\mu,N)

Through a natural combination of the outer guiding IPP iteration and the inner PsGM scheme, we further derive the total complexity of proximal-based restartation Subgradient Method in terms of subgradient oracle calls.

1 Initialize: t:=0,α0:=μ02,N0:=max{8log(Lf/δ0)+1,ρ1}t:=0,\alpha_{0}:=\frac{\mu_{0}}{2},N_{0}:=\max\left\{8\log(L_{f}/\delta_{0})+1,\rho-1\right\}
2 for  t:=0,,T1t:=0,\cdots,T-1 do
3       x0,t:=xt,k:=0x^{0,t}:=x^{t},k:=0
4       do
5             xk+1,t:=PsGM(xk,t,xk,t,αt,μt,Nt)x^{k+1,t}:=PsGM(x^{k,t},x^{k,t},\alpha_{t},\mu_{t},N_{t})
6             k:=k+1k:=k+1
7            
8      while  xk1,txk2,t>μtδt\lVert x^{k-1,t}-x^{k-2,t}\rVert>\mu_{t}\delta_{t} ;
9      xt+1:=xk1,tx^{t+1}:=x^{k-1,t}
10       αt+1:=αt2q,Nt+1:=Nt2(q+1)\alpha_{t+1}:=\alpha_{t}2^{-q},N_{t+1}:=N_{t}2^{-(q+1)}
11       μt+1:=2μt,δt+1:=2ρδt\mu_{t+1}:=2\mu_{t},\delta_{t+1}:=2^{-\rho}\delta_{t}
12      
13Return xT,δTx^{T},\delta_{T}
14
15
16 
17Postprocessing(x~0,β0,μ,N,K\tilde{x}^{0},\beta_{0},\mu,N,K)
18 for  k:=0,,K1k:=0,\cdots,K-1 do
19       x~k+1:=PsGM(x~k,x~k,βk,μ,N)\tilde{x}^{k+1}:=PsGM(\tilde{x}^{k},\tilde{x}^{k},\beta_{k},\mu,N)
20       βk+1:=βk/2\beta_{k+1}:=\beta_{k}/2
Return x~K\tilde{x}^{K}
Algorithm 4 Restarted Inexact Proximal Point - SubGradient Method (RIPP-PsGM) (x0,δ0,μ0,ρ,q,Lf,Tx^{0},\delta_{0},\mu_{0},\rho,q,L_{f},T)
Theorem 6.1.

Let the assumptions of Theorem 8.8 hold and ρ>1,δ02Lf\rho>1,\delta_{0}\geq 2L_{f} then the following assertions hold:

(i)(i) Let xfx^{f} be generated as follows:

{xT,δT}\displaystyle\{x^{T},\delta_{T}\} =RIPP-PsGM(x0,δ0,μ0,ρ,2ρ1,Lf,T)\displaystyle=\textbf{RIPP-PsGM}(x^{0},\delta_{0},\mu_{0},\rho,2\rho-1,L_{f},T)
xf\displaystyle x^{f} =Postprocessing(xT,δT2/Lf2,μT,2(Lf/δT)2,log(δT/ϵ)).\displaystyle=\textbf{Postprocessing}(x^{T},\delta_{T}^{2}/L_{f}^{2},\mu_{T},\lceil 2(L_{f}/\delta_{T})^{2}\rceil,\lceil\log(\delta_{T}/\epsilon)\rceil).

For T1ρlog(12δ0/σF)T\geq\left\lceil\frac{1}{\rho}\log(12\delta_{0}/\sigma_{F})\right\rceil, the final output xfx^{f} satisfies distX(xf)ϵ\text{dist}_{X^{*}}(x^{f})\leq\epsilon after 𝒪(log(1/ϵ))\mathcal{O}\left(\log\left(1/\epsilon\right)\right) PsGM iterations.

(ii)(ii) Moreover, if ff has ν\nu-Holder continuous gradients with constant LfL_{f} and νγ1\nu\leq\gamma-1, assume δ0(2Lf2)12(1ν)μ0ν1ν\delta_{0}\geq(2L_{f}^{2})^{\frac{1}{2(1-\nu)}}\mu_{0}^{\frac{\nu}{1-\nu}} and q1ρ12(1ν)\frac{q-1}{\rho-1}\geq 2(1-\nu). If T=𝒪(max{γ1ρlog(μ0δ0/ϵ),1ρ1log(μ0δ0/ϵ)})T=\mathcal{O}\left(\max\left\{\frac{\gamma-1}{\rho}\log(\mu_{0}\delta_{0}/\epsilon),\frac{1}{\rho-1}\log(\mu_{0}\delta_{0}/\epsilon)\right\}\right), then the output xT:=RIPPPsGM(x0,δ0,μ0,ρ,q,Lf,T)x^{T}:=RIPP-PsGM(x^{0},\delta_{0},\mu_{0},\rho,q,L_{f},T) satisfies distX(xT)ϵ\text{dist}_{X^{*}}(x^{T})\leq\epsilon within a total cost of:

𝒪(1/ϵmax{γ2+qρ(γ1),(γ2)ρ(ρ1)(γ1)+qρ1})\displaystyle\mathcal{O}\left(1/\epsilon^{\max\left\{\gamma-2+\frac{q}{\rho}(\gamma-1),(\gamma-2)\frac{\rho}{(\rho-1)(\gamma-1)}+\frac{q}{\rho-1}\right\}}\right)

PsGM iterations.

Proof.

The proof can be found in the Appendix. ∎

Remark 3.

Although the bound on the number of epochs in RIPP-PsGM depends somehow on γ\gamma, a sufficiently high value of TT ensure the result to hold. To investigate some important particular bounds hidden into the above complexity estimates (of Theorem 6.1) we analyze different choices of the input parameters (ρ,q)(\rho,q) which will be synthesized in Table 1.

Assume ν\nu is known, q=1+2(1ν)(ρ1)q=1+2(1-\nu)(\rho-1) and denote ζ=(11ρ)(γ1)\zeta=\left(1-\frac{1}{\rho}\right)(\gamma-1)

[γ=1+ν]\displaystyle[\gamma=1+\nu] 𝒪(1/ϵ(32γ)(ζ1)max{1,1ζ})\displaystyle\quad\mathcal{O}\left(1/\epsilon^{(3-2\gamma)(\zeta-1)\max\left\{1,\frac{1}{\zeta}\right\}}\right) (13)
[γ>1+ν]\displaystyle[\gamma>1+\nu] 𝒪(1/ϵ[2(γν1)+(12ν)(ζ1)]max{1,1ζ}).\displaystyle\quad\mathcal{O}\left(1/\epsilon^{\left[2(\gamma-\nu-1)+(1-2\nu)(\zeta-1)\right]\max\left\{1,\frac{1}{\zeta}\right\}}\right). (14)

Under knowledge of γ\gamma, by setting ρ=γ1γ2\rho=\frac{\gamma-1}{\gamma-2} then ζ=1\zeta=1 and (13) becomes 𝒪(log(1/ϵ))\mathcal{O}\left(\log(1/\epsilon)\right). When γ<2\gamma<2, a sufficiently large ρ\rho simplify (14) into 𝒪(1/ϵ32ν1γ1)\mathcal{O}\left(1/\epsilon^{3-2\nu-\frac{1}{\gamma-1}}\right). Given any ν[1/2,1]\nu\in[1/2,1] and γ>2\gamma>2, similarly for ρ\rho\to\infty the estimate (14) reduces to 𝒪(1/ϵ(32ν)(γ1)1)\mathcal{O}\left(1/\epsilon^{(3-2\nu)(\gamma-1)-1}\right).

In the particular smooth case ν=1\nu=1, bounds (13)-(14) become:

[γ=2]\displaystyle[\gamma=2] 𝒪(1/ϵmax{1ρ,1ρ1})\displaystyle\quad\mathcal{O}\left(1/\epsilon^{\max\left\{\frac{1}{\rho},\frac{1}{\rho-1}\right\}}\right)
[γ>2]\displaystyle[\gamma>2] 𝒪(1/ϵ[γ2+γ1ρ]max{1,1ζ}).\displaystyle\quad\mathcal{O}\left(1/\epsilon^{\left[\gamma-2+\frac{\gamma-1}{\rho}\right]\max\left\{1,\frac{1}{\zeta}\right\}}\right).

For high values of ρlog(1/ϵ)\rho\geq\log(1/\epsilon), the first one becomes 𝒪(log(1/ϵ))\mathcal{O}\left(\log(1/\epsilon)\right). Also the second one reduces to 𝒪(1/ϵγ2)\mathcal{O}\left(1/\epsilon^{\gamma-2}\right) when ρ(γ1)log(1/ϵ)\rho\geq(\gamma-1)\log(1/\epsilon).

In the bounded gradients case ν=0\nu=0 the estimates reduces to

[γ=1]\displaystyle[\gamma=1] 𝒪(log(1/ϵ))\displaystyle\quad\mathcal{O}\left(\log(1/\epsilon)\right) (15)
[γ>1]\displaystyle[\gamma>1] 𝒪(1/ϵ[2(γ1)+ζ1]max{1,1ζ}).\displaystyle\quad\mathcal{O}\left(1/\epsilon^{\left[2(\gamma-1)+\zeta-1\right]\max\left\{1,\frac{1}{\zeta}\right\}}\right). (16)

First observe that when the main parameters σF,Lf,γ\sigma_{F},L_{f},\gamma are known then ζ=1\zeta=1 and we recover the same iteration complexity in terms of the number of subgradient evaluations as in the literature [53, 16, 41, 10]. The last estimate (16) holds when RIPP-PsGM performs a sufficiently high number of epochs, under no availability of problem parameters σF,ν,γ\sigma_{F},\nu,\gamma.

In [29, 41] are derived the optimal complexity estimates 𝒪(ϵ2(γν1)3(1+ν)2)\mathcal{O}\left(\epsilon^{-\frac{2(\gamma-\nu-1)}{3(1+\nu)-2}}\right), in terms of the number of (sub)gradient evaluations, for accelerated first-order methods under ν\nu-Holder smoothness and γ\gamma-HG. These optimal estimates require full availability of the problem information: γ,σF,ν,LF\gamma,\sigma_{F},\nu,L_{F}.

Knowledge DS-SG[16] RSG[53] Restarted UGM[29] IPPA-PsGM
σF,γ,Lf(ν=0)\sigma_{F},\gamma,L_{f}(\nu=0) 𝒪(ϵ2(γ1))\mathcal{O}\left(\epsilon^{-2(\gamma-1)}\right) 𝒪(ϵ2(γ1))\mathcal{O}\left(\epsilon^{-2(\gamma-1)}\right) 𝒪(ϵ2(γ1))\mathcal{O}\left(\epsilon^{-2(\gamma-1)}\right) 𝒪(ϵ2(γ1))\mathcal{O}\left(\epsilon^{-2(\gamma-1)}\right)
σF,γ,Lf,ν>0\sigma_{F},\gamma,L_{f},\nu>0 - - 𝒪(ϵ2(γν1)3(1+ν)2)\mathcal{O}\left(\epsilon^{-\frac{2(\gamma-\nu-1)}{3(1+\nu)-2}}\right) (13)/(14)
γ,Lf\gamma,L_{f} - - 𝒪(ϵ2(γ1))\mathcal{O}\left(\epsilon^{-2(\gamma-1)}\right) 𝒪(ϵ2(γ1))\mathcal{O}\left(\epsilon^{-2(\gamma-1)}\right)
ν,Lf\nu,L_{f} - - - (13)/(14)
LfL_{f} - - - (16)
Table 1: Comparison of complexity estimates under various knowledge degrees of problem information

7 Numerical simulations

In the following Section we evaluate RIPP-PsGM by applying it to real-world applications often found in machine learning tasks. The algorithm and its applications are public and available online 111https://github.com/pirofti/IPPA.

Unless stated otherwise, we perform enough epochs (restarts) until the objective is within ε0=0.5\varepsilon_{0}=0.5 proximity to the CVX computed optimum. The current objective value is computed within each inner PsGM iteration. All the models under consideration satisfy WSM property and therefore the implementation of PsGM reduces to the scheme of (projected) Subgradient Method.

We would like to thank the authors of the methods we compare with for providing the code implementation for reproducing the experiments. No modifications were performed by us on the algorithms or their specific parameters. Following their implementation and as is common in the literature, in our reports we also use the minimum error obtained in the iterations thus far.

All our experiments were implemented in Python 3.10.2 under ArchLinux (Linux version 5.16.15-arch1-1) and executed on an AMD Ryzen Threadripper PRO 3955WX with 16-Cores and 512GB of system memory.

7.1 Robust 1\ell_{1} Least Squares

We start out with the least-squares (LS) problem in the 1\ell_{1} setting. This form deviates from standard LS by imposing an 1\ell_{1}-norm on the objective and by constraining the solution sparsity through the τ\tau parameter on its 1\ell_{1}-norm:

minx𝐑n\displaystyle\min\limits_{x\in\mathbf{R}^{n}}\; Axb1\displaystyle\;\lVert Ax-b\rVert_{1}
s.t. x1τ\displaystyle\;\;\lVert x\rVert_{1}\leq\tau

Our goal is to analyze the effect of the data dimensions, the problem and IPPA parameters on the total number of iterations.

Refer to caption
Refer to caption
Refer to caption
Figure 1: Total number of inner iterations needed for various parametrizations. (a) Varying ρ\rho. (b) Varied problem dimensions where we set m=nm=n. (c) Varying τ\tau

The first experiment from Figure 1 investigates the effect of the ρ\rho parameter on the unconstrained 1\ell_{1}-LS formulation (τ=\tau=\infty) on a small 50×2050\times 20 problem. In our experiment we start with μ=0.1\mu=0.1, with 9 epochs and vary ρ\rho from 1.005 to 1.1 in 0.005 steps sizes. In Figure 1, we repeat the same experiment with fixed ρ=1.005\rho=1.005 now, but with varied problem dimensions starting from 10 up to 200 in increments of 5 where we set both dimensions equal (m=nm=n). Finally, in Figure 1, we study the effect of the problem specific parameter on the total number of iterations. Although dim effects are noticed in the beginning, we can see a sudden burst past τ=3.4\tau=3.4. Please note that this is specific to 1\ell_{1}-LS and not to IPPA in general as we will see in the following section.

7.2 Graph Support Vector Machines

Graph SVM adds a graph-guided lasso regularization to the standard SVM averaged hinge-loss objective and extends the 1\ell_{1}-SVM formulation through the factorization MxMx where MM is the weighted graph adjacency matrix, i.e.

minx𝐑n\displaystyle\min\limits_{x\in\mathbf{R}^{n}}\; 1mi=1mmax{0,1yiaiTx}+τMx1\displaystyle\;\frac{1}{m}\sum_{i=1}^{m}\max\{0,1-y_{i}a_{i}^{T}x\}+\tau\lVert Mx\rVert_{1} (17)

where ai𝐑na_{i}\in\mathbf{R}^{n}, yi{±1}y_{i}\in\{\pm 1\} are the ii-th data point and its label, respectively. When M=InM=I_{n} the Regularized Sparse 1\ell_{1}-SVM formulation is recovered.

Synthetic

Refer to caption

Refer to caption

Refer to caption

20newsgroups

Refer to caption

Refer to caption

1\ell_{1}-SVM (M=InM=I_{n})

Refer to caption
Figure 2: GraphSVM experiments on synthetic data, first row, on the 20newsgroups data-set, second row, and with M=InM=I_{n} on the third row. (a), (c), (e): Objective error evolution across inner iterations. (b), (d), (f): Objective error evolution across inner iterations measured in CPU time. The error is displayed in logarithmic scale.

In Figure 2 we present multiple experiments on different data and across different τ\tau parametrizations. Figure 2 (a) and (b) compares RIPP-PsGM with R2SG from [53] on synthetic random data {C,X,M}\{C,X,M\} from the standard normal distribution. The same initialization and starting point x0x_{0} was used for all methods. We use m=100m=100 measurements of n=512n=512 samples xx with initial parameters μ=0.1\mu=0.1, ρ=1.0005\rho=1.0005 and τ=1\tau=1 which we execute for 15 epochs.

We repeat the experiment in Figure 2 (c) and (d), but this time on real-data from the 20newsgroup data-set222https://cs.nyu.edu/ roweis/data.html following the experiment from [53], with parameters μ=0.1\mu=0.1, ρ=1.005\rho=1.005, and τ=3\tau=3. Here we find a similar behaviour for both methods as in the synthetic case. In Figure 2 (e) and (f), we repeat the experiment by setting M=ImM=I_{m} in (17) thus recovering the regularized 1\ell_{1}-SVM formulation. We notice here that RIPP-PsGM maintains its position ahead of R2SG and that the error drop is similar to the 20newsgroup experiment.

Refer to caption
Refer to caption
Refer to caption
Figure 3: GraphSVM: Total number of inner iterations needed for various parametrizations. (a) Varying ρ\rho. (b) Varied problem dimensions where we set m=nm=n. (c) Varied τ\tau.

Furthermore, Figure 3 rehashes the experiments from the 1\ell_{1}-LS Section in order to study the effect of the data dimensions and of the problem parameters on the number of total number of required inner iterations. The results for ρ\rho and the data dimensions are as expected: as they grow they almost linearly increase the iteration numbers. For the GraphSVM specific parameter τ\tau, we find the results are opposite to that of 1\ell_{1}-LS; it is harder to solve the problem when τ\tau is small.

Refer to caption

Refer to caption

Figure 4: Sparse 1\ell_{1}-SVM: (a) Objective error evolution across inner iterations. (b) Zoomed objective error evolution.

In order to compare with DS2SG [16] and R2SG, we follow the Sparse SVM experiment from [16] which requests M=ImM=I_{m} and removes the regularization τx1\tau\lVert x\rVert_{1} from (17) and instead adds it as a constraint {x:x1τ}\{x:\lVert x\rVert_{1}\leq\tau\}. We used the parameters μ=0.001\mu=0.001, ρ=1.0005\rho=1.0005, and τ=0.4\tau=0.4 and plotted the results in Figure 4. Please note that the starting point is the same, with a quick initial drop for all three methods as can be seen in Figure 4 (a). Execution times were almost identical and took around 0.15s. To investigate the differences in convergence behaviour, we zoom in after the first few iterations in the 4 panels of Figure 4 (b). In the first panel we show the curves for all methods together and in the other three we see the individual curves for each. Our experiments showed that this is the general behaviour for τ>0.4\tau>0.4: DS2SG has a slightly sharper drop, with RIPP-PsGM following in closely with a staircase behaviour while R2SG takes a few more iterations to reach convergence. For smaller values of τ\tau we found that RIPP-PsGM reaches the solution in just 1–5 iterations within 10910^{-9} precision, while the others lag behind for a few hundred iterations.

7.3 Matrix Completion for Movie Recommendation

In this last section, the problem of matrix completion is applied to the standard movie recommendation challenge which recovers a full user-rating matrix XX from the partial observations YY corresponding to the NN known user-movie ratings pairs.

minX𝐑m×n\displaystyle\min\limits_{X\in\mathbf{R}^{m\times n}}\; 1N(i,j)ΣN|XijYij|+τX\displaystyle\;\frac{1}{N}\sum_{(i,j)\in\Sigma}^{N}|X_{ij}-Y_{ij}|+\tau\lVert X\rVert_{\star}

where Σ\Sigma is the set of user-movie pairs with N=|Σ|N=|\Sigma|. Solving this will complete matrix X based on the known sparse matrix Y while maintaining a low rank.

Refer to caption
Refer to caption
Figure 5: Matrix completion: (a) Objective error evolution across inner iterations. (b) Objective error evolution across inner iterations measured in CPU time.

In Figure 5 we reproduce the experiment from [53] with parameters μ=0.1\mu=0.1, ρ=1.005\rho=1.005, τ=3\tau=3 on a synthetic database with 50 movies and 20 users filled with 250 i.i.d. randomly chosen ratings from 1 to 5. We let a few more R2SG iterations execute to show that the slight progress that is made.

References

  • [1] Antipin, A.: On finite convergence of processes to a sharp minimum and to a smooth minimum with a sharp derivative. Differential Equations 30(11), 1703–1713 (1994)
  • [2] Bauschke, H.H., Dao, M.N., Noll, D., Phan, H.M.: Proximal point algorithm, douglas-rachford algorithm and alternating projections: a case study. Journal of Convex Analysis 23(1), 237–261 (2015)
  • [3] Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences 2(1), 183–202 (2009)
  • [4] Bertsekas, D.P.: Constrained optimization and Lagrange multiplier methods. Athena Scientific (1982)
  • [5] Bertsekas, D.P.: Parallel and distributed computation: numerical methods, vol. 23. Prentice hall Englewood Cliffs, NJ (1989)
  • [6] Bolte, J., Nguyen, T.P., Peypouquet, J., Suter, B.W.: From error bounds to the complexity of first-order descent methods for convex functions. Mathematical Programming 165(2), 471–507 (2017)
  • [7] Burke, J.V., Ferris, M.C.: Weak sharp minima in mathematical programming. SIAM Journal on Control and Optimization 31(5), 1340–1359 (1993)
  • [8] Davis, D., Drusvyatskiy, D., MacPhee, K.J., Paquette, C.: Subgradient methods for sharp weakly convex functions. Journal of Optimization Theory and Applications 179(3), 962–982 (2018)
  • [9] Ferris, M.C.: Finite termination of the proximal point algorithm. Mathematical Programming 50(1), 359–366 (1991)
  • [10] Freund, R.M., Lu, H.: New computational guarantees for solving convex optimization problems with first order methods, via a function growth condition measure. Mathematical Programming 170(2), 445–477 (2018)
  • [11] Gilpin, A., Pena, J., Sandholm, T.: First-order algorithm with 𝒪(ln(1/ϵ))\mathcal{O}(\ln(1/\epsilon)) convergence for ϵ\epsilon-equilibrium in two-person zero-sum games. Mathematical programming 133(1), 279–298 (2012)
  • [12] Güler, O.: On the convergence of the proximal point algorithm for convex minimization. SIAM journal on control and optimization 29(2), 403–419 (1991)
  • [13] Güler, O.: New proximal point algorithms for convex minimization. SIAM Journal on Optimization 2(4), 649–664 (1992)
  • [14] Hu, Y., Li, C., Yang, X.: On convergence rates of linearized proximal algorithms for convex composite optimization with applications. SIAM Journal on Optimization 26(2), 1207–1235 (2016)
  • [15] Humes, C., Silva, P.J.: Inexact proximal point algorithms and descent methods in optimization. Optimization and Engineering 6(2), 257–271 (2005)
  • [16] Johnstone, P.R., Moulin, P.: Faster subgradient methods for functions with hölderian growth. Mathematical Programming 180(1), 417–450 (2020)
  • [17] Juditsky, A., Nesterov, Y.: Deterministic and stochastic primal-dual subgradient algorithms for uniformly convex minimization. Stochastic Systems 4(1), 44–80 (2014)
  • [18] Kort, B.W., Bertsekas, D.P.: Combined primal–dual and penalty methods for convex programming. SIAM Journal on Control and Optimization 14(2), 268–294 (1976)
  • [19] Li, G., Mordukhovich, B.S.: Holder metric subregularity with applications to proximal point method. SIAM Journal on Optimization 22(4), 1655–1684 (2012)
  • [20] Lin, H., Mairal, J., Harchaoui, Z.: A universal catalyst for first-order optimization. Advances in neural information processing systems 28 (2015)
  • [21] Lin, H., Mairal, J., Harchaoui, Z.: Catalyst acceleration for first-order convex optimization: from theory to practice. Journal of Machine Learning Research 18(1), 7854–7907 (2018)
  • [22] Lu, M., Qu, Z.: An adaptive proximal point algorithm framework and application to large-scale optimization. arXiv preprint arXiv:2008.08784 (2020)
  • [23] Luo, Z.Q., Tseng, P.: Error bounds and convergence analysis of feasible descent methods: a general approach. Annals of Operations Research 46(1), 157–178 (1993)
  • [24] Mairal, J.: Cyanure: An open-source toolbox for empirical risk minimization for python, c++, and soon more. arXiv preprint arXiv:1912.08165 (2019)
  • [25] Monteiro, R.D., Svaiter, B.F.: An accelerated hybrid proximal extragradient method for convex optimization and its implications to second-order methods. SIAM Journal on Optimization 23(2), 1092–1125 (2013)
  • [26] Necoara, I., Nesterov, Y., Glineur, F.: Linear convergence of first order methods for non-strongly convex optimization. Mathematical Programming 175(1), 69–107 (2019)
  • [27] Nedić, A., Bertsekas, D.P.: The effect of deterministic noise in subgradient methods. Mathematical programming 125(1), 75–99 (2010)
  • [28] Nemirovski, A.: Prox-method with rate of convergence o (1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization 15(1), 229–251 (2004)
  • [29] Nemirovskii, A., Nesterov, Y.: Optimal methods of smooth convex minimization. USSR Computational Mathematics and Mathematical Physics 25(2), 21–30 (1985). https://doi.org/10.1016/0041-5553(85)90100-4. URL https://www.sciencedirect.com/science/article/pii/0041555385901004
  • [30] Nesterov, Y.: How to make the gradients small. Optima 88 (2012)
  • [31] Nesterov, Y.: Inexact accelerated high-order proximal-point methods. Mathematical Programming pp. 1–26 (2021)
  • [32] Nesterov, Y.: Inexact high-order proximal-point methods with auxiliary search procedure. SIAM Journal on Optimization 31(4), 2807–2828 (2021)
  • [33] Patrascu, A., Irofti, P.: On finite termination of an inexact proximal point algorithm. Applied Mathematics Letters 134, 108348 (2022). https://doi.org/10.1016/j.aml.2022.108348
  • [34] Polyak, B.: A general method of solving extremal problems. Math. Doklady 8, 593–597 (1967)
  • [35] Polyak, B.: Minimization of unsmooth functionals. U.S.S.R. Computational Mathematics and Mathematical Physics 9, 509–521 (1969)
  • [36] Polyak, B.: Nonlinear programming methods in the presence of noise. Mathematical Programming 14, 87–97 (1978)
  • [37] Polyak, B.: Introduction to optimization. Optimization Software, Inc., Publications Division, New York (1987)
  • [38] Renegar, J.: Efficient first-order methods for linear programming and semidefinite programming. arXiv preprint arXiv:1409.5832 (2014)
  • [39] Rockafellar, R.T.: Augmented lagrangians and applications of the proximal point algorithm in convex programming. Mathematics of operations research 1(2), 97–116 (1976)
  • [40] Rockafellar, R.T.: Monotone operators and the proximal point algorithm. SIAM journal on control and optimization 14(5), 877–898 (1976)
  • [41] Roulet, V., d’Aspremont, A.: Sharpness, restart, and acceleration. SIAM Journal on Optimization 30(1), 262–289 (2020)
  • [42] Salzo, S., Villa, S.: Inexact and accelerated proximal point algorithms. Journal of Convex analysis 19(4), 1167–1192 (2012)
  • [43] Shor, N.: An application of the method of gradient descent to the solution of the network transportation problem. Materialy Naucnovo Seminara po Teoret i Priklad. Voprosam Kibernet. i Issted. Operacii, Nuc- nyi Sov. po Kibernet, Akad. Nauk Ukrain. SSSR 1, 9–17 (1962)
  • [44] Shor, N.: On the structure of algorithms for numerical solution of problems of optimal planning and design. Diss. Doctor Philos, Kiev (1964)
  • [45] Shulgin, E., Gasnikov, A., Matyukhin, V.: Adaptive catalyst for smooth convex optimization. In: Optimization and Applications: 12th International Conference, OPTIMA 2021, Petrovac, Montenegro, September 27–October 1, 2021, Proceedings, vol. 13078, p. 20. Springer Nature (2021)
  • [46] Solodov, M.V., Svaiter, B.F.: A hybrid approximate extragradient–proximal point algorithm using the enlargement of a maximal monotone operator. Set-Valued Analysis 7(4), 323–345 (1999)
  • [47] Solodov, M.V., Svaiter, B.F.: A hybrid projection-proximal point algorithm. Journal of convex analysis 6(1), 59–70 (1999)
  • [48] Solodov, M.V., Svaiter, B.F.: Error bounds for proximal point subproblems and associated inexact proximal point algorithms. Mathematical programming 88(2), 371–389 (2000)
  • [49] Solodov, M.V., Svaiter, B.F.: A unified framework for some inexact proximal point algorithms. Numerical functional analysis and optimization 22(7-8), 1013–1035 (2001)
  • [50] Solodov, M.V., Svaiter, B.F.: A unified framework for some inexact proximal point algorithms. Numerical functional analysis and optimization 22(7-8), 1013–1035 (2001)
  • [51] Tomioka, R., Suzuki, T., Sugiyama, M.: Super-linear convergence of dual augmented lagrangian algorithm for sparsity regularized estimation. Journal of Machine Learning Research 12(5) (2011)
  • [52] Xiao, L., Zhang, T.: A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization 24(4), 2057–2075 (2014)
  • [53] Yang, T., Lin, Q.: Rsg: Beating subgradient method without smoothness and strong convexity. The Journal of Machine Learning Research 19(1), 236–268 (2018)
  • [54] Yankelevsky, Y., Elad, M.: Dual graph regularized dictionary learning. IEEE Transactions on Signal and Information Processing over Networks 2(4), 611–624 (2016)

8 Appendix

Proof of Lemma 2.1.

By using γ\gamma-HG, we get:

Fμ(x)F\displaystyle F_{\mu}(x)-F^{*} =minzF(z)F+12μzx2\displaystyle=\min_{z}F(z)-F^{*}+\frac{1}{2\mu}\lVert z-x\rVert^{2}
minzσFdistXγ(z)+12μzx2\displaystyle\geq\min_{z}\sigma_{F}\text{dist}_{X^{*}}^{\gamma}(z)+\frac{1}{2\mu}\lVert z-x\rVert^{2}
=minz,yXσFzyγ+12μzx2.\displaystyle=\min_{z,y\in X^{*}}\sigma_{F}\lVert z-y\rVert^{\gamma}+\frac{1}{2\mu}\lVert z-x\rVert^{2}. (18)

The solution of (18) in zz, denoted as z(x)z(x) satisfies the following optimality condition: σFγz(x)yz(x)y2γ+1μ(z(x)x)=0\sigma_{F}\gamma\frac{z(x)-y}{\lVert z(x)-y\rVert^{2-\gamma}}+\frac{1}{\mu}\left(z(x)-x\right)=0, which simply implies that

z(x)=zy2γzy2γ+σFμγx+σFμγzy2γ+σFμγy.\displaystyle z(x)=\frac{\lVert z-y\rVert^{2-\gamma}}{\lVert z-y\rVert^{2-\gamma}+\sigma_{F}\mu\gamma}x+\frac{\sigma_{F}\mu\gamma}{\lVert z-y\rVert^{2-\gamma}+\sigma_{F}\mu\gamma}y. (19)

(i)(i) For a function with sharp minima (γ=1\gamma=1) it is easy to see that (z(x)y)[1+σFμz(x)y]=xy(z(x)-y)\left[1+\frac{\sigma_{F}\mu}{\lVert z(x)-y\rVert}\right]=x-y. By taking norm in both sides then: z(x)y=max{0,xyσFμ}\lVert z(x)-y\rVert=\max\{0,\lVert x-y\rVert-\sigma_{F}\mu\} and (19) becomes:

z(x)={y+(1σFμxy)(xy),xy>σFμy,xyσFμz(x)=\begin{cases}y+\left(1-\frac{\sigma_{F}\mu}{\lVert x-y\rVert}\right)(x-y),&\lVert x-y\rVert>\sigma_{F}\mu\\ y,&\lVert x-y\rVert\leq\sigma_{F}\mu\end{cases}

By replacing this form of z(x)z(x) into (18) we obtain our first result:

Fμ(x)F\displaystyle F_{\mu}(x)-F^{*} minyX{σFxyμσF22,xy>σFμ12μyx2,xyσFμ\displaystyle\geq\min_{y\in X^{*}}\begin{cases}\sigma_{F}\lVert x-y\rVert-\frac{\mu\sigma-F^{2}}{2},&\lVert x-y\rVert>\sigma_{F}\mu\\ \frac{1}{2\mu}\lVert y-x\rVert^{2},&\lVert x-y\rVert\leq\sigma_{F}\mu\end{cases}
{σFdistX(x)μσF22,distX(x)>σFμ12μdistX(x)2,distX(x)σFμ.\displaystyle\geq\begin{cases}\sigma_{F}\text{dist}_{X^{*}}(x)-\frac{\mu\sigma^{2}_{F}}{2},&\text{dist}_{X^{*}}(x)>\sigma_{F}\mu\\ \frac{1}{2\mu}\text{dist}_{X^{*}}(x)^{2},&\text{dist}_{X^{*}}(x)\leq\sigma_{F}\mu\end{cases}.

(ii)(ii) For quadratic growth, (19) reduces to z(x)=11+2σFμx+2σFμ1+2σFμyz(x)=\frac{1}{1+2\sigma_{F}\mu}x+\frac{2\sigma_{F}\mu}{1+2\sigma_{F}\mu}y and by (18) it leads to:

Fμ(x)F\displaystyle F_{\mu}(x)-F^{*} minyXσF1+2σFμyx2=σF1+2σFμdistX2(x).\displaystyle\geq\min_{y\in X^{*}}\frac{\sigma_{F}}{1+2\sigma_{F}\mu}\lVert y-x\rVert^{2}=\frac{\sigma_{F}}{1+2\sigma_{F}\mu}\text{dist}_{X^{*}}^{2}(x).

(iii)(iii) Lastly, from (19) we see that z(x)z(x) lies on the segment [x,y][x,y], i.e. z(x)=λx+(1λ)yz(x)=\lambda x+(1-\lambda)y for certain λ[0,1]\lambda\in[0,1]. Using this argument into (18) we equivalently have:

Fμ(x)F\displaystyle F_{\mu}(x)-F^{*} minyX,λ[0,1],z=λx+(1λ)yσFzyγ+12μzx2\displaystyle\geq\min_{y\in X^{*},\lambda\in[0,1],z=\lambda x+(1-\lambda)y}\sigma_{F}\lVert z-y\rVert^{\gamma}+\frac{1}{2\mu}\lVert z-x\rVert^{2}
=minyX,λ[0,1]σFλγxyγ+(1λ)22μxy2\displaystyle=\min_{y\in X^{*},\lambda\in[0,1]}\sigma_{F}\lambda^{\gamma}\lVert x-y\rVert^{\gamma}+\frac{(1-\lambda)^{2}}{2\mu}\lVert x-y\rVert^{2}
=minλ[0,1]σFλγdistXγ(x)+(1λ)22μdistX2(x)\displaystyle=\min_{\lambda\in[0,1]}\sigma_{F}\lambda^{\gamma}\text{dist}^{\gamma}_{X^{*}}(x)+\frac{(1-\lambda)^{2}}{2\mu}\text{dist}^{2}_{X^{*}}(x)
min{σFdistXγ(x),12μdistX2(x)}minλ[0,1]λγ+(1λ)2.\displaystyle\geq\min\left\{\sigma_{F}\text{dist}^{\gamma}_{X^{*}}(x),\frac{1}{2\mu}\text{dist}^{2}_{X^{*}}(x)\right\}\min_{\lambda\in[0,1]}\lambda^{\gamma}+(1-\lambda)^{2}.

Proof of Lemma 3.1.

Assume δFμ(x)ϵ\lVert\nabla_{\delta}F_{\mu}(x)\rVert\leq\epsilon. Observe that on one hand, by the triangle inequality that: Fμ(x)δFμ(x)+δμϵ+δμ=:ϵ~\lVert\nabla F_{\mu}(x)\rVert\leq\lVert\nabla_{\delta}F_{\mu}(x)\rVert+\frac{\delta}{\mu}\leq\epsilon+\frac{\delta}{\mu}=:\tilde{\epsilon}. On the other hand, one can derive:

Fμ(x)2μ[Fμ(x)F]\displaystyle\lVert\nabla F_{\mu}(x)\rVert\leq\sqrt{\frac{2}{\mu}[F_{\mu}(x)-F^{*}]} 1μdistX(x).\displaystyle\leq\frac{1}{\mu}\text{dist}_{X^{*}}(x). (20)

Now let γ=1\gamma=1. Based on Fμ(x)F(proxμF(x))\nabla F_{\mu}(x)\in\partial F(\text{prox}_{\mu}^{F}(x)), for nonoptimal proxμF(x)\text{prox}_{\mu}^{F}(x) the bound (6) guarantees Fμ(x)=F(proxμF(x))σF\lVert\nabla F_{\mu}(x)\rVert=\lVert F^{\prime}(\text{prox}_{\mu}^{F}(x))\rVert\geq\sigma_{F}. Therefore, if xx would determine ϵ~<σF\tilde{\epsilon}<\sigma_{F}, and implicitly Fμ(x)<σF\lVert\nabla F_{\mu}(x)\rVert<\sigma_{F}, the contradiction with the previous lower bound impose that proxμF(x)X\text{prox}_{\mu}^{F}(x)\in X^{*}. Moreover, in this case, since xproxμF(x)=μFμ(x)(20)distX(x)\lVert x-\text{prox}_{\mu}^{F}(x)\rVert=\mu\lVert\nabla F_{\mu}(x)\rVert\overset{\eqref{upperbound_morenvgrad}}{\leq}\text{dist}_{X^{*}}(x) then obviously proxμF(x)=πX(x)\text{prox}_{\mu}^{F}(x)=\pi_{X^{*}}(x). On summary, a sufficiently small approximate gradient norm δFμ(x)<σFδμ\lVert\nabla_{\delta}F_{\mu}(x)\rVert<\sigma_{F}-\frac{\delta}{\mu} also confirms a small distance to optimal set distX(x)μϵ~\text{dist}_{X^{*}}(x)\leq\mu\tilde{\epsilon}.

Let γ>1\gamma>1. Then our assumption Fμ(x)ϵ~\lVert\nabla F_{\mu}(x)\rVert\leq\tilde{\epsilon}, (20) and (5) implies the error bound: distX(x)max{[ϵ~σFφ(γ)]1γ1,2μϵ~φ(γ)}\;\;\text{dist}_{X^{*}}(x)\leq\max\left\{\left[\frac{\tilde{\epsilon}}{\sigma_{F}\varphi(\gamma)}\right]^{\frac{1}{\gamma-1}},\frac{2\mu\tilde{\epsilon}}{\varphi(\gamma)}\right\}, which confirms the last result. ∎

Proof of Corrolary 4.2.

From Theorem 4.1(i)(i) we obtain directly:

distX(xk)max{distX(x0)k(μσFδ),δ},\displaystyle\text{dist}_{X^{*}}(x^{k})\leq\max\left\{\text{dist}_{X^{*}}(x^{0})-k(\mu\sigma_{F}-\delta),\delta\right\},

which means that after at most KK iterations xkx^{k} reaches distX(xK)δ<μσF\text{dist}_{X^{*}}(x^{K})\leq\delta<\mu\sigma_{F}. Lastly, the same reasoning as in Lemma 3.1, based on the relations (20) and (6), lead to proxμF(xK)=πX(xK)\text{prox}_{\mu}^{F}(x^{K})=\pi_{X^{*}}(x^{K}). ∎

Proof of Corrolary 4.3.

The proof for the first two estimates are immediately derived from Theorem 4.1 (i)(i) and (ii)(ii). For γ(1,2)\gamma\in(1,2), we considered α=μφ(γ)σF2,β=1+1σF2\alpha=\frac{\mu\varphi(\gamma)\sigma_{F}}{2},\beta=\frac{1+\sqrt{1-\sigma_{F}}}{2} into Corrolary 8.6 and obtained an estimate for our exact case. To refine the complexity order, we majorized some constants by using: (2βμσF)12γ(2μσF)12γ(2\beta\mu\sigma_{F})^{\frac{1}{2-\gamma}}\leq(2\mu\sigma_{F})^{\frac{1}{2-\gamma}} and 11φ(γ)<11-\sqrt{1-\varphi(\gamma)}<1. For γ>2\gamma>2, we replace the same α\alpha as before and β^=max{1+1σF2,11γ1}\hat{\beta}=\max\left\{\frac{1+\sqrt{1-\sigma_{F}}}{2},1-\frac{1}{\gamma-1}\right\} into Corrolary 8.6 to get the last estimate. ∎

Proof of Corrolary 4.4.

The proof for the first two estimates can be derived immediately from Theorem 4.1 (i)(i) and (ii)(ii). We provide details for the other two cases.

For γ(1,2)\gamma\in(1,2) we use the same notations as in the proof of Theorem 4.1 (given in the appendix). There, the key functions which decide the decrease rate of distX(xk)\text{dist}_{X^{*}}(x^{k}) are the nondecreasing function hh and accuracy δ^k=max{(2δkμσFφ(γ))1γ1,2δk11φ(γ)}\hat{\delta}_{k}=\max\left\{\left(\frac{2\delta_{k}}{\mu\sigma_{F}\varphi(\gamma)}\right)^{\frac{1}{\gamma-1}},\frac{2\delta_{k}}{1-\sqrt{1-\varphi(\gamma)}}\right\}. First recall that

12=φ(2)φ(γ)φ(1)=34.\displaystyle\frac{1}{2}=\varphi(2)\leq\varphi(\gamma)\leq\varphi(1)=\frac{3}{4}. (21)

which implies that for any δ0\delta\geq 0

δ2(21)1+1φ(γ)2δh(δ).\displaystyle\frac{\delta}{2}\overset{\eqref{varphi_bounds}}{\leq}\frac{1+\sqrt{1-\varphi(\gamma)}}{2}\delta\leq h(\delta). (22)

Recalling that δ¯k=max{δ^k,h(δ¯k1)}\bar{\delta}_{k}=\max\left\{\hat{\delta}_{k},h(\bar{\delta}_{k-1})\right\}, then by Theorem 4.1(iii)(iii) we have:

distX(xk)\displaystyle\text{dist}_{X^{*}}(x^{k}) max{h(k)(distX(x0)),δ¯k}\displaystyle\leq\max\left\{h^{(k)}(\text{dist}_{X^{*}}(x^{0})),\bar{\delta}_{k}\right\} (23)

By taking δk=δk12\delta_{k}=\frac{\delta_{k-1}}{2} then δ^k=max{121γ1(2δk1μσFφ(γ))1γ1,122δk111φ(γ)}δ^k12(22)h(δ^k1)\hat{\delta}_{k}=\max\left\{\frac{1}{2^{\frac{1}{\gamma-1}}}\left(\frac{2\delta_{k-1}}{\mu\sigma_{F}\varphi(\gamma)}\right)^{\frac{1}{\gamma-1}},\frac{1}{2}\frac{2\delta_{k-1}}{1-\sqrt{1-\varphi(\gamma)}}\right\}\leq\frac{\hat{\delta}_{k-1}}{2}\overset{\eqref{rel:low_bound_h}}{\leq}h(\hat{\delta}_{k-1}). By this recurrence, the monotonicity of hh and δ^0=δ¯0\hat{\delta}_{0}=\bar{\delta}_{0}, we derive:

δ¯k\displaystyle\bar{\delta}_{k} max{h(δ^k1),h(δ¯k1)}=h(δ¯k1)\displaystyle\leq\max\left\{h(\hat{\delta}_{k-1}),h(\bar{\delta}_{k-1})\right\}=h(\bar{\delta}_{k-1})
=h(k)(δ^0).\displaystyle=h^{(k)}(\hat{\delta}_{0}).

Finally this key bound enters into (23) and we get:

distX(xk)\displaystyle\text{dist}_{X^{*}}(x^{k}) max{h(k)(distX(x0)),h(k)(δ^0)}\displaystyle\leq\max\left\{h^{(k)}\left(\text{dist}_{X^{*}}(x^{0})\right),h^{(k)}\left(\hat{\delta}_{0}\right)\right\}
h(k)(max{distX(x0),δ^0}),\displaystyle\leq h^{(k)}\left(\max\left\{\text{dist}_{X^{*}}(x^{0}),\hat{\delta}_{0}\right\}\right),

where for the last equality we used the fact that, since hh is nondecreasing, h(k)h^{(k)} is monotonically nondecreasing. Finally, by applying Theorem 8.5 we get our result. Now let γ>2\gamma>2. By redefining hh as in Theorem 4.1, observe that

δ(11γ1)h(δ).\displaystyle\delta\left(1-\frac{1}{\gamma-1}\right)\leq h(\delta). (24)

Take δk=(11γ1)γ1δk1\delta_{k}=\left(1-\frac{1}{\gamma-1}\right)^{\gamma-1}\delta_{k-1} then

δ^k\displaystyle\hat{\delta}_{k} =max{(11γ1)(2δk1μσFφ(γ))1γ1,(11γ1)γ12δk111φ(γ)}\displaystyle=\max\left\{\left(1-\frac{1}{\gamma-1}\right)\left(\frac{2\delta_{k-1}}{\mu\sigma_{F}\varphi(\gamma)}\right)^{\frac{1}{\gamma-1}},\left(1-\frac{1}{\gamma-1}\right)^{\gamma-1}\frac{2\delta_{k-1}}{1-\sqrt{1-\varphi(\gamma)}}\right\}
(11γ1)δ^k1(24)h(δ^k1).\displaystyle\leq\left(1-\frac{1}{\gamma-1}\right)\hat{\delta}_{k-1}\overset{\eqref{rel:low_bound_h_hat}}{\leq}h(\hat{\delta}_{k-1}).

We have shown in the proof of Theorem 4.1 that also this variant of hh is nondecreasing and thus, using the same reasoning as in the case γ(1,2)\gamma\in(1,2) we obtain the above result. ∎

Proof of Theorem 5.1.

In this proof we use notation KtK_{t} for the number of iterations in the tt-th epoch, large enough to turn the stopping criterion to be satisfied. We denote xk,tx^{k,t} as the kk-th IPPA iterate during tt-th epoch. From Theorem 8.3 (from Appendix B) we have that

Kt=distX(xt)δt\displaystyle K_{t}=\left\lceil\frac{\text{dist}_{X^{*}}(x^{t})}{\delta_{t}}\right\rceil (25)

iterations are sufficient to guarantee δtFμt(x^Kt,t)5δt\lVert\nabla_{\delta_{t}}F_{\mu_{t}}(\hat{x}^{K_{t},t})\rVert\leq 5\delta_{t}^{\nabla} and thus the end of tt-th epoch. Furthermore, by the triangle inequality

Fμt(xt+1)δtδtFμt(xt+1)5δt,\displaystyle\lVert\nabla F_{\mu_{t}}(x^{t+1})\rVert-\delta_{t}^{\nabla}\leq\lVert\nabla_{\delta_{t}}F_{\mu_{t}}(x^{t+1})\rVert\leq 5\delta_{t}^{\nabla}, (26)

which implies that the end of tt-th epoch we also have Fμt(xt+1)6δt\lVert\nabla F_{\mu_{t}}(x^{t+1})\rVert\leq 6\delta_{t}^{\nabla}.

Let WSM hold and recall assumption distX(x0)μ0σF\text{dist}_{X^{*}}(x^{0})\geq\mu_{0}\sigma_{F}. For sufficiently large tt we show that restartation loses any effect and after a single iteration the stopping criterion of epoch tt is satisfied. We separate the analysis in two stages: the first stage covers the epochs that produce xt+1x^{t+1} satisfying Fμt(xt+1)>σF\lVert\nabla F_{\mu_{t}}(x^{t+1})\rVert>\sigma_{F}. The second one covers the rest of epochs when the gradient norms decrease below the threshold σF\sigma_{F}, i.e. Fμt(xt+1)σF\lVert\nabla F_{\mu_{t}}(x^{t+1})\rVert\leq\sigma_{F}.

In the first stage, the stopping rule Fμt(xt+1)δt\lVert\nabla F_{\mu_{t}}(x^{t+1})\rVert\leq\delta_{t}^{\nabla} limits the first stage to maximum T1=1ρlog(12δ0σF)T_{1}=\left\lceil\frac{1}{\rho}\log\left(\frac{12\delta_{0}}{\sigma_{F}}\right)\right\rceil epochs. The total number of iterations in this stage is bounded by: t=0T1KtT1K0\sum_{t=0}^{T_{1}}K_{t}\leq T_{1}K_{0}.

For the second stage when Fμt1(xt)<σF\lVert\nabla F_{\mu_{t-1}}(x^{t})\rVert<\sigma_{F}, Lemma 3.1 states that proxμ,F(xt)=πX(xt)\text{prox}_{\mu,F}(x^{t})=\pi_{X^{*}}(x^{t}) and thus we have: distX(xt)=μt1Fμt1(xt)μt1δt1=δt1<2μt1σF=μtσF.\;\;\text{dist}_{X^{*}}(x^{t})=\mu_{t-1}\lVert\nabla F_{\mu_{t-1}}(x^{t})\rVert\leq\mu_{t-1}\delta_{t-1}^{\nabla}=\delta_{t-1}<2\mu_{t-1}\sigma_{F}=\mu_{t}\sigma_{F}. Therefore, by Theorem 4.1:

distX(xt+1)\displaystyle\text{dist}_{X^{*}}(x^{t+1}) max{distX(xt)Kt(μtσFδt),δt}\displaystyle\leq\max\left\{\text{dist}_{X^{*}}(x^{t})-K_{t}(\mu_{t}\sigma_{F}-\delta_{t}),\delta_{t}\right\}
max{μtσFKt(μtσFδt),δt}\displaystyle\leq\max\left\{\mu_{t}\sigma_{F}-K_{t}(\mu_{t}\sigma_{F}-\delta_{t}),\delta_{t}\right\}

which means that after a single iteration, i.e. Kt=1K_{t}=1, it is guaranteed that distX(xt+1)δt\text{dist}_{X^{*}}(x^{t+1})\leq\delta_{t}. In this phase, the output of IPPA is in fact the only point produced in tt-th epoch and the necessary number of epochs (or equivalently the number of IPPA iterations) is T2=𝒪(1ρ1log(δT1ϵ))T_{2}=\mathcal{O}\left(\frac{1}{\rho-1}\log\left(\frac{\delta_{T_{1}}}{\epsilon}\right)\right).

Let γ=2\gamma=2. By Lemma 2.1(ii)(ii) and convexity of FμF_{\mu} yields σF1+2σFμdistX2(x)Fμ(x),xπX(x)Fμ(x)distX(x)\frac{\sigma_{F}}{1+2\sigma_{F}\mu}\text{dist}_{X^{*}}^{2}(x)\leq\langle\nabla F_{\mu}(x),x-\pi_{X^{*}}(x)\rangle\leq\lVert\nabla F_{\mu}(x)\rVert\text{dist}_{X^{*}}(x). By using the inequality formed by the first and last terms, together with the relation (26), then at the end of t1t-1 epoch:

distX(xt)Fμt1(xt)(1μt1σF+2)6δt1(1μt1σF+2),\displaystyle\text{dist}_{X^{*}}(x^{t})\leq\lVert\nabla F_{\mu_{t-1}}(x^{t})\rVert\left(\frac{1}{\mu_{t-1}\sigma_{F}}+2\right)\leq 6\delta_{t-1}\left(\frac{1}{\mu_{t-1}\sigma_{F}}+2\right),

suggesting that the necessary number of epochs is T=𝒪(1ρlog(δ0ϵ)).T=\mathcal{O}\left(\frac{1}{\rho}\log\left(\frac{\delta_{0}}{\epsilon}\right)\right). This fact allow us to refine KtK_{t} in (25) as Kt=32ρ(2+1μt1σF)t1.K_{t}=\left\lceil 3\cdot 2^{\rho}\left(2+\frac{1}{\mu_{t-1}\sigma_{F}}\right)\right\rceil\quad\forall t\geq 1. Since KtK_{t} is bounded, then the total number of IPPA iterations has the order t=0T1Kt=K0+𝒪(T).\sum_{t=0}^{T-1}K_{t}=K_{0}+\mathcal{O}\left(T\right).

Let γ>1\gamma>1. Similarly as in the previous two cases, Fμt1(xt)6δt\lVert\nabla F_{\mu_{t-1}}(x^{t})\rVert\leq 6\delta_{t}^{\nabla} guaranteed by t1t-1 epoch further implies distX(xt)(7)max{12δt1φ(γ),[6δt1φ(γ)σF]1γ1},\;\;\text{dist}_{X^{*}}(x^{t})\overset{\eqref{gamma_all_Env_Holder_error_bound}}{\leq}\max\left\{\frac{12\delta_{t-1}}{\varphi(\gamma)},\left[\frac{6\delta_{t-1}^{\nabla}}{\varphi(\gamma)\sigma_{F}}\right]^{\frac{1}{\gamma-1}}\right\}, which suggests that the maximal number of epochs is

T=𝒪(max{γ1ρlog(δ0ϵ),1ρ1log(μ0δ0ϵ)}).\displaystyle T=\mathcal{O}\left(\max\left\{\frac{\gamma-1}{\rho}\log\left(\frac{\delta_{0}}{\epsilon}\right),\frac{1}{\rho-1}\log\left(\frac{\mu_{0}\delta_{0}}{\epsilon}\right)\right\}\right).

Now, KtK_{t} of (25) becomes: Kt=max{32ρ+1φ(γ),D2t[(ρ1)ργ1]+ργ1}\;K_{t}=\left\lceil\max\left\{\frac{3\cdot 2^{\rho+1}}{\varphi(\gamma)},D2^{t\left[(\rho-1)-\frac{\rho}{\gamma-1}\right]+\frac{\rho}{\gamma-1}}\right\}\right\rceil for all t1,t\geq 1, where D=(6δ0σFφ(γ))2/γ1μ0δ0D=\left(\frac{6\delta_{0}}{\sigma_{F}\varphi(\gamma)}\right)^{2/\gamma}\frac{1}{\mu_{0}\delta_{0}}. For γ2\gamma\leq 2, KtK_{t} is bounded, thus for γ>2\gamma>2 we further estimate the total number of IPPA iterations by summing:

t=0T1Kt\displaystyle\sum_{t=0}^{T_{1}}K_{t} =K0+t=1T1max{32ρ+1φ(γ),D2t[(ρ1)ργ1]+ργ1}\displaystyle=K_{0}+\sum_{t=1}^{T_{1}}\left\lceil\max\left\{\frac{3\cdot 2^{\rho+1}}{\varphi(\gamma)},D2^{t\left[(\rho-1)-\frac{\rho}{\gamma-1}\right]+\frac{\rho}{\gamma-1}}\right\}\right\rceil (27)
K0+T1+max{32ρ+1φ(γ)T1,D2ργ1t=1T2t[(ρ1)ργ1]}\displaystyle\leq K_{0}+T_{1}+\max\left\{\frac{3\cdot 2^{\rho+1}}{\varphi(\gamma)}T_{1},D2^{\frac{\rho}{\gamma-1}}\sum_{t=1}^{T}2^{t\left[(\rho-1)-\frac{\rho}{\gamma-1}\right]}\right\}

Finally, t=1T2t[(ρ1)ργ1]=𝒪((δ0ϵ)max{(11ρ)(γ1)1,1ρ(ρ1)(γ1)})\sum_{t=1}^{T}2^{t\left[(\rho-1)-\frac{\rho}{\gamma-1}\right]}=\mathcal{O}\left(\left(\frac{\delta_{0}}{\epsilon}\right)^{\max\left\{\left(1-\frac{1}{\rho}\right)(\gamma-1)-1,1-\frac{\rho}{(\rho-1)(\gamma-1)}\right\}}\right).

Proof of Theorem 6.1.

We keep the same notations as in the proof of Theorem 5.1 and redenote δt:=δt\delta_{t}^{\nabla}:=\delta_{t}. By assumption δ02Lf\delta_{0}\geq 2L_{f} we observe that (4α0μ0Lf2)122μ0Lfμ0δ0(4\alpha_{0}\mu_{0}L_{f}^{2})^{\frac{1}{2}}\leq 2\mu_{0}L_{f}\leq\mu_{0}\delta_{0}. Since αt=α02(2ρ1)t\alpha_{t}=\alpha_{0}2^{-(2\rho-1)t}, then the inequality 4αtμtLf2(δt)24\alpha_{t}\mu_{t}L_{f}^{2}\leq(\delta_{t})^{2} recursively holds for all t0t\geq 0. This last inequality allow Theorem 8.8 to establish that at tt-epoch there are enough:

[t=0]\displaystyle[t=0] N0=8log(Fμ0(x0)δ0)\displaystyle\quad N_{0}=\left\lceil 8\log\left(\frac{\lVert\nabla F_{\mu_{0}}(x^{0})\rVert}{\delta_{0}}\right)\right\rceil
[t>0]\displaystyle[t>0] Nt=422ρtlog(μt1δt1μtδt)=4(ρ1)22ρt\displaystyle\quad N_{t}=\left\lceil 4\cdot 2^{2\rho t}\log\left(\frac{\mu_{t-1}\delta_{t-1}}{\mu_{t}\delta_{t}}\right)\right\rceil=\left\lceil 4(\rho-1)2^{2\rho t}\right\rceil

PsGM iterations. Lastly, we compute the total computational cost by summing over all NtN_{t}. Recall that at the end of tt-th epoch RIPPA guarantees that Fμt(xt+1)δt\lVert\nabla F_{\mu_{t}}(x^{t+1})\rVert\leq\delta_{t}^{\nabla}. After a number of epochs of T=1ρlog(δ0σF)T=\left\lceil\frac{1}{\rho}\log\left(\frac{\delta_{0}}{\sigma_{F}}\right)\right\rceil, measuring a total number of:

𝒯1=t=0T11NtKt\displaystyle\mathcal{T}_{1}=\sum\limits_{t=0}^{T_{1}-1}N_{t}K_{t} =N0K0+K0t=1T114(ρ1)22ρt\displaystyle=N_{0}K_{0}+K_{0}\sum\limits_{t=1}^{T_{1}-1}\left\lceil 4(\rho-1)2^{2\rho t}\right\rceil
=𝒪(K022ρT1)=𝒪(K0(δ0σF)2).\displaystyle=\mathcal{O}\left(K_{0}2^{2\rho T_{1}}\right)=\mathcal{O}\left(K_{0}\left(\frac{\delta_{0}}{\sigma_{F}}\right)^{2}\right).

PsGM iterations, then one has xTx^{T} satisfying distX(xT)μT1δTμT1σF\text{dist}_{X^{*}}(x^{T})\leq\mu_{T-1}\delta_{T}\leq\mu_{T-1}\sigma_{F}. Now we evaluate the final postprocessing loop, which aims to bring the iterate xTx^{T} into the ϵ\epsilon-suboptimality region. By Theorem 8.8, a single call of PsGM(xT,xT,β0,μT1,2(LfδT)2)PsGM\left(x^{T},x^{T},\beta_{0},\mu_{T-1},\left\lceil 2\left(\frac{L_{f}}{\delta_{T}}\right)^{2}\right\rceil\right) guarantees that distX(x~1)β0Lf22σFβ0Lf22δT=δT2\text{dist}_{X^{*}}(\tilde{x}^{1})\leq\frac{\beta_{0}L_{f}^{2}}{2\sigma_{F}}\leq\frac{\beta_{0}L_{f}^{2}}{2\delta_{T}}=\frac{\delta_{T}}{2}. In general, if distX(x~k)βk1Lf22σFβk1Lf22δT\text{dist}_{X^{*}}(\tilde{x}^{k})\leq\frac{\beta_{k-1}L_{f}^{2}}{2\sigma_{F}}\leq\frac{\beta_{k-1}L_{f}^{2}}{2\delta_{T}}, Theorem 8.8 specifies that after 2(distX(x~k)βkLf)2Theorem8.82(LfδT)2\left\lceil 2\left(\frac{\text{dist}_{X^{*}}(\tilde{x}^{k})}{\beta_{k}L_{f}}\right)^{2}\right\rceil\overset{\text{Theorem}\;\ref{th:Holder_gradients}}{\leq}\left\lceil 2\left(\frac{L_{f}}{\delta_{T}}\right)^{2}\right\rceil routine iterations, the output x~k+1\tilde{x}^{k+1} satisfies distX(x~k+1)βkLf22σFβkLf22δT\text{dist}_{X^{*}}(\tilde{x}^{k+1})\leq\frac{\beta_{k}L_{f}^{2}}{2\sigma_{F}}\leq\frac{\beta_{k}L_{f}^{2}}{2\delta_{T}}. Therefore, by setting N=2(LfδT)2,K=log(δT/ϵ)N=\left\lceil 2\left(\frac{L_{f}}{\delta_{T}}\right)^{2}\right\rceil,K=\lceil\log(\delta_{T}/\epsilon)\rceil and by running the final procedure Postprocessing(xT,δT2/Lf2,μT1,N,Kx^{T},\delta_{T}^{2}/L_{f}^{2},\mu_{T-1},N,K), the ”second phase” loop produces distX(xK)ϵ\text{dist}_{X^{*}}(x^{K})\leq\epsilon. The total cost of the ”second phase” loop can be computed by 𝒯2=KN=2(LfδT)2log(δTϵ).\mathcal{T}_{2}=KN=\left\lceil 2\left(\frac{L_{f}}{\delta_{T}}\right)^{2}\right\rceil\left\lceil\log\left(\frac{\delta_{T}}{\epsilon}\right)\right\rceil. Lastly, by taking into account that δt=𝒪(σF)\delta_{t}=\mathcal{O}\left(\sigma_{F}\right), then the total complexity has the order: 𝒯1+𝒯2=𝒪(K0(δ0σF)2+(LfσF)2log(σFϵ)),\;\mathcal{T}_{1}+\mathcal{T}_{2}=\mathcal{O}\left(K_{0}\left(\frac{\delta_{0}}{\sigma_{F}}\right)^{2}+\left(\frac{L_{f}}{\sigma_{F}}\right)^{2}\log\left(\frac{\sigma_{F}}{\epsilon}\right)\right), which confirms the first part of the above result.

Now we prove the second part of the result. By assumption δ0(2Lf2)12(1ν)μ0ν1ν\delta_{0}\geq(2L_{f}^{2})^{\frac{1}{2(1-\nu)}}\mu_{0}^{\frac{\nu}{1-\nu}} we observe that:

(4α0μ0Lf2)12(1ν)μ0δ0.\displaystyle(4\alpha_{0}\mu_{0}L_{f}^{2})^{\frac{1}{2(1-\nu)}}\leq\mu_{0}\delta_{0}. (28)

Further we show that, for appropriate stepsize choices αt\alpha_{t}, the inequality 4αtμtLf2(μtδt)2(1ν)4\alpha_{t}\mu_{t}L_{f}^{2}\leq(\mu_{t}\delta_{t})^{2(1-\nu)} recursively holds for all t0t\geq 0 under initial condition (28). Indeed let 2(ρ1)(1ν)q12(\rho-1)(1-\nu)\leq q-1, then

4αtμtLf2\displaystyle 4\alpha_{t}\mu_{t}L_{f}^{2} =2μ02Lf22(q1)t(28)(μ0δ0)2(1ν)22(ρ1)(1ν)t=(μtδt)2(1ν).\displaystyle=\frac{2\mu_{0}^{2}L_{f}^{2}}{2^{(q-1)t}}\overset{\eqref{rel:induction_initial_step}}{\leq}\frac{(\mu_{0}\delta_{0})^{2(1-\nu)}}{2^{2(\rho-1)(1-\nu)t}}=(\mu_{t}\delta_{t})^{2(1-\nu)}. (29)

The inequality (29) allow Theorem 8.8 to establish the necessary inner complexity for each IPPA iteration. By using bounds from Theorem 8.8, at tt-epoch there are enough:

[t=0]\displaystyle[t=0] N0=8log(Fμ0(x0)δ0)\displaystyle\quad N_{0}=\left\lceil 8\log\left(\frac{\lVert\nabla F_{\mu_{0}}(x^{0})\rVert}{\delta_{0}}\right)\right\rceil
[t>0]\displaystyle[t>0] Nt=42(q+1)tlog(μt1δt1μtδt)=4(ρ1)2(q+1)t\displaystyle\quad N_{t}=\left\lceil 4\cdot 2^{(q+1)t}\log\left(\frac{\mu_{t-1}\delta_{t-1}}{\mu_{t}\delta_{t}}\right)\right\rceil=\left\lceil 4(\rho-1)2^{(q+1)t}\right\rceil

PsGM iterations. We still keep the same notations from Theorem 5.1.

Let γ>1\gamma>1 and recall T=𝒪(max{γ1ρlog(μ0δ0/ϵ),1ρ1log(μ0δ0/ϵ)}).T=\mathcal{O}\left(\max\left\{\frac{\gamma-1}{\rho}\log(\mu_{0}\delta_{0}/\epsilon),\frac{1}{\rho-1}\log(\mu_{0}\delta_{0}/\epsilon)\right\}\right). By following a similar reasoning, we require:

t=0T1NtKt\displaystyle\sum\limits_{t=0}^{T-1}N_{t}K_{t} =N0K0+𝒪(t=1T2t[(ρ1)ργ1+q+1])\displaystyle=N_{0}K_{0}+\mathcal{O}\left(\sum_{t=1}^{T}2^{t\left[(\rho-1)-\frac{\rho}{\gamma-1}+q+1\right]}\right)
=N0K0+𝒪(2T[(ρ1)ργ1+q+1]).\displaystyle=N_{0}K_{0}+\mathcal{O}\left(2^{T\left[(\rho-1)-\frac{\rho}{\gamma-1}+q+1\right]}\right).

Let ζ=ρ1ρ(γ1)1\zeta=\frac{\rho-1}{\rho}(\gamma-1)\geq 1, then the exponent of the last term becomes:

T[(ρ1)ργ1+q+1]\displaystyle T\left[(\rho-1)-\frac{\rho}{\gamma-1}+q+1\right] =max{γ1ρ,1ρ1}[(ρ1)ργ1+q+1]\displaystyle=\max\left\{\frac{\gamma-1}{\rho},\frac{1}{\rho-1}\right\}\left[(\rho-1)-\frac{\rho}{\gamma-1}+q+1\right]
=γ2+qρ(γ1).\displaystyle=\gamma-2+\frac{q}{\rho}(\gamma-1).

Otherwise, if ζ<1\zeta<1 then the respective exponent turns into: T[(ρ1)ργ1+q+1]=ρρ1γ2γ1+qρ1.T\left[(\rho-1)-\frac{\rho}{\gamma-1}+q+1\right]=\frac{\rho}{\rho-1}\frac{\gamma-2}{\gamma-1}+\frac{q}{\rho-1}.

8.1 Appendix B

Lemma 8.1.

Let the sequence {uk}k0\{u_{k}\}_{k\geq 0} satisfy: uk+1αkuk+βk,u_{k+1}\leq\alpha_{k}u_{k}+\beta_{k}, where αk[0,1),{βk}k0\alpha_{k}\in[0,1),\{\beta_{k}\}_{k\geq 0} nonincreasing and i=0βiΓ\sum\limits_{i=0}^{\infty}\beta_{i}\leq\Gamma. Then the following bound holds:

uku0j=0kαj+Γj=k/2+1kαj+maxk/2+1ikβi1αi.\displaystyle u_{k}\leq u_{0}\prod\limits_{j=0}^{k}\alpha_{j}+\Gamma\prod\limits_{j=\lceil k/2\rceil+1}^{k}\alpha_{j}+\max\limits_{\lceil k/2\rceil+1\leq i\leq k}\frac{\beta_{i}}{1-\alpha_{i}}.

Moreover, if αk=α[0,1)\alpha_{k}=\alpha\in[0,1) then: ukα(k4)/2(u0+Γ)+βk/2+11α.\;\;u_{k}\leq\alpha^{(k-4)/2}(u_{0}+\Gamma)+\frac{\beta_{\lceil k/2\rceil+1}}{1-\alpha}.

Proof of Lemma 8.1.

By using a simple induction we get:

uk+1\displaystyle u_{k+1} αkuk+βku0j=0kαj+i=0kβij=i+1kαj\displaystyle\leq\alpha_{k}u_{k}+\beta_{k}\leq u_{0}\prod\limits_{j=0}^{k}\alpha_{j}+\sum\limits_{i=0}^{k}\beta_{i}\prod\limits_{j=i+1}^{k}\alpha_{j}
=u0j=0kαj+i=0k/2βij=i+1kαj+i=k/2+1kβij=i+1kαj\displaystyle=u_{0}\prod\limits_{j=0}^{k}\alpha_{j}+\sum\limits_{i=0}^{\lceil k/2\rceil}\beta_{i}\prod\limits_{j=i+1}^{k}\alpha_{j}+\sum\limits_{i=\lceil k/2\rceil+1}^{k}\beta_{i}\prod\limits_{j=i+1}^{k}\alpha_{j}
u0j=0kαj+Γj=k/2+1kαj+i=k/2+1kβi1αi(1αi)j=i+1kαj\displaystyle\leq u_{0}\prod\limits_{j=0}^{k}\alpha_{j}+\Gamma\prod\limits_{j=\lceil k/2\rceil+1}^{k}\alpha_{j}+\sum\limits_{i=\lceil k/2\rceil+1}^{k}\frac{\beta_{i}}{1-\alpha_{i}}(1-\alpha_{i})\prod\limits_{j=i+1}^{k}\alpha_{j}
u0j=0kαj+Γj=k/2+1kαj+maxk/2+1ikβi1αii=k/2+1k(1αi)j=i+1kαj\displaystyle\leq u_{0}\prod\limits_{j=0}^{k}\alpha_{j}+\Gamma\prod\limits_{j=\lceil k/2\rceil+1}^{k}\alpha_{j}+\max\limits_{\lceil k/2\rceil+1\leq i\leq k}\frac{\beta_{i}}{1-\alpha_{i}}\sum\limits_{i=\lceil k/2\rceil+1}^{k}(1-\alpha_{i})\prod\limits_{j=i+1}^{k}\alpha_{j}
u0j=0kαj+Γj=k/2+1kαj+maxk/2+1ikβi1αi.\displaystyle\leq u_{0}\prod\limits_{j=0}^{k}\alpha_{j}+\Gamma\prod\limits_{j=\lceil k/2\rceil+1}^{k}\alpha_{j}+\max\limits_{\lceil k/2\rceil+1\leq i\leq k}\frac{\beta_{i}}{1-\alpha_{i}}.

Lemma 8.2.

Let a1ana_{1}\geq\cdots\geq a_{n} be nn real numbers, then the following relation holds:

max{0,an,an+an1,,j=1naj}=max{0,j=1naj}.\displaystyle\max\left\{0,a_{n},a_{n}+a_{n-1},\cdots,\sum\limits_{j=1}^{n}a_{j}\right\}=\max\left\{0,\sum\limits_{j=1}^{n}a_{j}\right\}.
Proof.

Since we have:

max{0,an,,j=knaj}=max{max{0,an},,max{0,j=1naj}},\max\left\{0,a_{n},\cdots,\sum\limits_{j=k}^{n}a_{j}\right\}=\max\left\{\max\{0,a_{n}\},\cdots,\max\left\{0,\sum\limits_{j=1}^{n}a_{j}\right\}\right\},

then it is sufficient to show that for any positive kk:

max{0,j=knaj}max{0,j=k1naj}.\displaystyle\max\left\{0,\sum\limits_{j=k}^{n}a_{j}\right\}\leq\max\left\{0,\sum\limits_{j=k-1}^{n}a_{j}\right\}. (30)

Indeed, if ak10a_{k-1}\geq 0 then (30) results straightforward. Consider that ak1<0a_{k-1}<0, then by monotonicity we have: aj<0a_{j}<0 for all j>k1j>k-1 and thus j=knaj<0\sum\limits_{j=k}^{n}a_{j}<0. In this case it is obvious that max{0,j=knaj}=max{0,j=k1naj}=0\max\left\{0,\sum\limits_{j=k}^{n}a_{j}\right\}=\max\left\{0,\sum\limits_{j=k-1}^{n}a_{j}\right\}=0, which confirms the final above results. ∎

Theorem 8.3.

Let {xk}k0\{x^{k}\}_{k\geq 0} be the sequence generated by IPPA with inexactness criterion (9), then the following relation hold:

distX(xk+1)\displaystyle\text{dist}_{X^{*}}(x^{k+1}) distX(xk)μFμ(xk)FdistX(xk)+δk.\displaystyle\leq\text{dist}_{X^{*}}(x^{k})-\mu\frac{F_{\mu}(x^{k})-F^{*}}{\text{dist}_{X^{*}}(x^{k})}+\delta_{k}.

Moreover, assume constant accuracy δk=δ\delta_{k}=\delta. Then after at most:

distX(x0)δ\displaystyle\left\lceil\frac{\text{dist}_{X^{*}}(x^{0})}{\delta}\right\rceil

iterations, a point x~{x0,,xk}\tilde{x}\in\{x^{0},\cdots,x^{k}\} satisfies δFμ(x~)4δμ\lVert\nabla_{\delta}F_{\mu}(\tilde{x})\rVert\leq\frac{4\delta}{\mu} and distX(x~)distX(x0)\text{dist}_{X^{*}}(\tilde{x})\leq\text{dist}_{X^{*}}(x^{0}).

Proof.

By convexity of FF, for any zz we derive:

proxμF(xk)z2=xkz2+2proxμF(xk)xk,xkz+proxμF(xk)xk2\displaystyle\lVert\text{prox}_{\mu}^{F}(x^{k})-z\rVert^{2}=\lVert x^{k}-z\rVert^{2}+2\langle\text{prox}_{\mu}^{F}(x^{k})-x^{k},x^{k}-z\rangle+\lVert\text{prox}_{\mu}^{F}(x^{k})-x^{k}\rVert^{2}
=xkz22μF(proxμF(xk)),proxμF(xk)zproxμF(xk)xk2\displaystyle=\lVert x^{k}-z\rVert^{2}-2\mu\langle\nabla F(\text{prox}_{\mu}^{F}(x^{k})),\text{prox}_{\mu}^{F}(x^{k})-z\rangle-\lVert\text{prox}_{\mu}^{F}(x^{k})-x^{k}\rVert^{2}
xkz22μ(F(proxμF(xk))F(z)+12μproxμF(xk)xk2)\displaystyle\leq\lVert x^{k}-z\rVert^{2}-2\mu\left(F(\text{prox}_{\mu}^{F}(x^{k}))-F(z)+\frac{1}{2\mu}\lVert\text{prox}_{\mu}^{F}(x^{k})-x^{k}\rVert^{2}\right)
=xkz22μ(Fμ(xk)F(z)).\displaystyle=\lVert x^{k}-z\rVert^{2}-2\mu\left(F_{\mu}(x^{k})-F(z)\right). (31)

In order to obtain, by the triangle inequality we simply derive:

xk+1zproxμF(xk)z\displaystyle\lVert x^{k+1}-z\rVert\leq\lVert\text{prox}_{\mu}^{F}(x^{k})-z\rVert +proxμF(xk)xk+1\displaystyle+\lVert\text{prox}_{\mu}^{F}(x^{k})-x^{k+1}\rVert
proxμF(xk)z+δ\displaystyle\leq\lVert\text{prox}_{\mu}^{F}(x^{k})-z\rVert+\delta (32)

Finally, by taking z=πX(x)z=\pi_{X^{*}}(x), then:

distX(xk+1)xk+1πX(xk)(8.1)proxμF(xk)πX(xk)+δ\displaystyle\text{dist}_{X^{*}}(x^{k+1})\leq\lVert x^{k+1}-\pi_{X^{*}}(x^{k})\rVert\overset{\eqref{eq:triangle}}{\leq}\lVert\text{prox}_{\mu}^{F}(x^{k})-\pi_{X^{*}}(x^{k})\rVert+\delta
(31)distX2(xk)2μ(Fμ(xk)F)+δ\displaystyle\overset{\eqref{eq:0acc_recurrence}}{\leq}\sqrt{\text{dist}_{X^{*}}^{2}(x^{k})-2\mu\left(F_{\mu}(x^{k})-F^{*}\right)}+\delta (33)
distX(xk)12μFμ(xk)FdistX2(xk)+δ\displaystyle\leq\text{dist}_{X^{*}}(x^{k})\sqrt{1-2\mu\frac{F_{\mu}(x^{k})-F^{*}}{\text{dist}_{X^{*}}^{2}(x^{k})}}+\delta
distX(xk)(1μFμ(xk)FdistX2(xk))+δ,\displaystyle\leq\text{dist}_{X^{*}}(x^{k})\left(1-\mu\frac{F_{\mu}(x^{k})-F^{*}}{\text{dist}_{X^{*}}^{2}(x^{k})}\right)+\delta,

where in the last inequality we used the fact 12a1a\sqrt{1-2a}\leq 1-a. The last inequality leads to the first part from our result:

distX(xk+1)\displaystyle\text{dist}_{X^{*}}(x^{k+1}) distX(xk)μFμ(xk)FdistX(xk)+δk.\displaystyle\leq\text{dist}_{X^{*}}(x^{k})-\mu\frac{F_{\mu}(x^{k})-F^{*}}{\text{dist}_{X^{*}}(x^{k})}+\delta_{k}. (34)

Assume that

Fμ(x0)FdistX(x0)δμ\displaystyle\frac{F_{\mu}(x^{0})-F^{*}}{\text{dist}_{X^{*}}(x^{0})}\geq\frac{\delta}{\mu} (35)

and denote K=min{k0:distX(xk+1)distX(xk)}K=\min\{k\geq 0\;:\;\text{dist}_{X^{*}}(x^{k+1})\geq\text{dist}_{X^{*}}(x^{k})\}. Then (34) has two consequences. First, obviously for all k<Kk<K:

Fμ(xk)F1μ(distX2(xk)distX2(xk+1))+δμdistX(x0).\displaystyle F_{\mu}(x^{k})-F^{*}\leq\frac{1}{\mu}\left(\text{dist}_{X^{*}}^{2}(x^{k})-\text{dist}_{X^{*}}^{2}(x^{k+1})\right)+\frac{\delta}{\mu}\text{dist}_{X^{*}}(x^{0}).

By further summing over the history we obtain:

Fμ(x^k)Fmin0ikFμ(xi)F\displaystyle F_{\mu}(\hat{x}^{k})-F^{*}\leq\min\limits_{0\leq i\leq k}F_{\mu}(x^{i})-F^{*} 1k+1i=0kFμ(xi)F\displaystyle\leq\frac{1}{k+1}\sum\limits_{i=0}^{k}F_{\mu}(x^{i})-F^{*}
distX2(x0)μ(k+1)+δμdistX(x0).\displaystyle\leq\frac{\text{dist}_{X^{*}}^{2}(x^{0})}{\mu(k+1)}+\frac{\delta}{\mu}\text{dist}_{X^{*}}(x^{0}). (36)

Second, since KK is the first iteration at which the residual optimal distance increases, then distX(xK)distX(xK1)distX(x0)\text{dist}_{X^{*}}(x^{K})\leq\text{dist}_{X^{*}}(x^{K-1})\leq\cdots\leq\text{dist}_{X^{*}}(x^{0}) and (34) guarantees:

Fμ(xK)FδμdistX(xK)δμdistX(x0).\displaystyle F_{\mu}(x^{K})-F^{*}\leq\frac{\delta}{\mu}\text{dist}_{X^{*}}(x^{K})\leq\frac{\delta}{\mu}\text{dist}_{X^{*}}(x^{0}).

By unifying both cases we conclude that after at most: Kδ=distX(x0)δK_{\delta}=\frac{\text{dist}_{X^{*}}(x^{0})}{\delta} iterations the threshold: Fμ(xKδ)F2δμdistX(x0)F_{\mu}(x^{K_{\delta}})-F^{*}\leq\frac{2\delta}{\mu}\text{dist}_{X^{*}}(x^{0}) is reached. Notice that if (35) do not hold, then Kδ=0K_{\delta}=0.

Now we use the same arguments from [30, Sec. I] to bound the norm of the gradients. Observe that the Lipschitz gradients property of FμF_{\mu} leads to:

Fμ(x^k+1)Fμ(x^kμδF(x^k))\displaystyle F_{\mu}(\hat{x}^{k+1})\leq F_{\mu}(\hat{x}^{k}-\mu\nabla_{\delta}F(\hat{x}^{k}))
=Fμ(x^k)μFμ(x^k),δFμ(x^k)+μ2δFμ(x^k)2\displaystyle=F_{\mu}(\hat{x}^{k})-\mu\langle\nabla F_{\mu}(\hat{x}^{k}),\nabla_{\delta}F_{\mu}(\hat{x}^{k})\rangle+\frac{\mu}{2}\lVert\nabla_{\delta}F_{\mu}(\hat{x}^{k})\rVert^{2}
=Fμ(x^k)+μδFμ(x^k)Fμ(x^k),δFμ(x^k)μ2δFμ(x^k)2\displaystyle=F_{\mu}(\hat{x}^{k})+\mu\langle\nabla_{\delta}F_{\mu}(\hat{x}^{k})-\nabla F_{\mu}(\hat{x}^{k}),\nabla_{\delta}F_{\mu}(\hat{x}^{k})\rangle-\frac{\mu}{2}\lVert\nabla_{\delta}F_{\mu}(\hat{x}^{k})\rVert^{2}
=Fμ(x^k)μ4δFμ(x^k)2+μδFμ(x^k)Fμ(x^k),δFμ(x^k)μ4δFμ(x^k)2\displaystyle=F_{\mu}(\hat{x}^{k})-\frac{\mu}{4}\lVert\nabla_{\delta}F_{\mu}(\hat{x}^{k})\rVert^{2}+\mu\langle\nabla_{\delta}F_{\mu}(\hat{x}^{k})-\nabla F_{\mu}(\hat{x}^{k}),\nabla_{\delta}F_{\mu}(\hat{x}^{k})\rangle-\frac{\mu}{4}\lVert\nabla_{\delta}F_{\mu}(\hat{x}^{k})\rVert^{2}
Fμ(x^k)μ4δFμ(x^k)2+μδFμ(x^k)Fμ(x^k)2\displaystyle\leq F_{\mu}(\hat{x}^{k})-\frac{\mu}{4}\lVert\nabla_{\delta}F_{\mu}(\hat{x}^{k})\rVert^{2}+\mu\lVert\nabla_{\delta}F_{\mu}(\hat{x}^{k})-\nabla F_{\mu}(\hat{x}^{k})\rVert^{2}
=Fμ(x^k)μ4δFμ(x^k)2+δ2μ=Fμ(x^k/2)kμ8δFμ(x^k/2)2+kδ22μ.\displaystyle=\!F_{\mu}(\hat{x}^{k})\!-\!\frac{\mu}{4}\lVert\nabla_{\delta}F_{\mu}(\hat{x}^{k})\rVert^{2}\!+\!\frac{\delta^{2}}{\mu}\!=\!F_{\mu}(\hat{x}^{k/2})-\frac{k\mu}{8}\lVert\nabla_{\delta}F_{\mu}(\hat{x}^{k/2})\rVert^{2}+\frac{k\delta^{2}}{2\mu}. (37)

By using (8.1) into (37), then for kKδk\geq K_{\delta}

δFμ(x^k)2\displaystyle\lVert\nabla_{\delta}F_{\mu}(\hat{x}^{k})\rVert^{2} 4(Fμ(x^k)F)kμ+δ2μ8distX(x0)δkμ2+4δ2μ2\displaystyle\leq\frac{4(F_{\mu}(\hat{x}^{k})-F^{*})}{k\mu}+\frac{\delta^{2}}{\mu}\leq\frac{8\text{dist}_{X^{*}}(x^{0})\delta}{k\mu^{2}}+\frac{4\delta^{2}}{\mu^{2}}
8δ2μ2+4δ2μ2=12δ2μ2.\displaystyle\leq\frac{8\delta^{2}}{\mu^{2}}+\frac{4\delta^{2}}{\mu^{2}}=\frac{12\delta^{2}}{\mu^{2}}.

Lemma 8.4.

Let γ\gamma-HG holds for the objective function FF. Then IPPA sequence {xk}k0\{x^{k}\}_{k\geq 0} with variable accuracies δk\delta_{k}, satisfies the following reccurences:

(i)(i) Under weak sharp minima γ=1\gamma=1

distX(xk+1)max{distX(xk)μσF,0}+δk\displaystyle\text{dist}_{X^{*}}(x^{k+1})\leq\max\left\{\text{dist}_{X^{*}}(x^{k})-\mu\sigma_{F},0\right\}+\delta_{k}

(ii)(ii) Under quadratic growth γ=2\gamma=2

distX(xk+1)11+2μσFdistX(xk)+δk\displaystyle\text{dist}_{X^{*}}(x^{k+1})\leq\frac{1}{\sqrt{1+2\mu\sigma_{F}}}\text{dist}_{X^{*}}(x^{k})+\delta_{k}

(iii)(iii) Under general Holderian growth γ1\gamma\geq 1

distX(xk+1)max{distX(xk)μφ(γ)σFdistXγ1(xk),(1φ(γ)2)distX(xk)}+δk,\displaystyle\text{dist}_{X^{*}}(x^{k+1})\leq\max\left\{\text{dist}_{X^{*}}(x^{k})-\mu\varphi(\gamma)\sigma_{F}\text{dist}_{X^{*}}^{\gamma-1}(x^{k}),\left(1-\frac{\varphi(\gamma)}{2}\right)\text{dist}_{X^{*}}(x^{k})\right\}+\delta_{k},
Proof.

(i)(i) Assume distX(xk)>σFμ\text{dist}_{X^{*}}(x^{k})>\sigma_{F}\mu then from (the proof of) Theorem 8.3 and Lemma 2.1:

distX(xk+1)\displaystyle\text{dist}_{X^{*}}(x^{k+1}) distX2(xk)2μ(Fμ(xk)F)+δk\displaystyle\leq\sqrt{\text{dist}_{X^{*}}^{2}(x^{k})-2\mu\left(F_{\mu}(x^{k})-F^{*}\right)}+\delta_{k}
distX2(xk)2μ(σFdistX(xk)σF2μ2)+δk\displaystyle\leq\sqrt{\text{dist}_{X^{*}}^{2}(x^{k})-2\mu\left(\sigma_{F}\text{dist}_{X^{*}}(x^{k})-\frac{\sigma^{2}_{F}\mu}{2}\right)}+\delta_{k}
=(distX(xk)μσF)2+δk=distX(xk)(μσFδk).\displaystyle=\sqrt{\left(\text{dist}_{X^{*}}(x^{k})-\mu\sigma_{F}\right)^{2}}+\delta_{k}=\text{dist}_{X^{*}}(x^{k})-\left(\mu\sigma_{F}-\delta_{k}\right).

In short,

distX(xk+1){distX(xk)(μσFδk),ifdistX(xk)>σFμδk,ifdistX(xk)σFμ\displaystyle\text{dist}_{X^{*}}(x^{k+1})\leq\begin{cases}\text{dist}_{X^{*}}(x^{k})-(\mu\sigma_{F}-\delta_{k}),&\text{if}\;\text{dist}_{X^{*}}(x^{k})>\sigma_{F}\mu\\ \delta_{k},&\text{if}\;\text{dist}_{X^{*}}(x^{k})\leq\sigma_{F}\mu\end{cases}

(ii)(ii) By using the same relations in the case γ=2\gamma=2, then:

distX(xk+1)\displaystyle\text{dist}_{X^{*}}(x^{k+1}) distX2(xk)2μ(Fμ(xk)F)+δk\displaystyle\leq\sqrt{\text{dist}_{X^{*}}^{2}(x^{k})-2\mu\left(F_{\mu}(x^{k})-F^{*}\right)}+\delta_{k}
distX2(xk)2μσF1+2μσFdistX2(xk)+δk=11+2μσFdistX(xk)+δk.\displaystyle\leq\sqrt{\text{dist}_{X^{*}}^{2}(x^{k})-\frac{2\mu\sigma_{F}}{1+2\mu\sigma_{F}}\text{dist}_{X^{*}}^{2}(x^{k})}+\delta_{k}=\frac{1}{\sqrt{1+2\mu\sigma_{F}}}\text{dist}_{X^{*}}(x^{k})+\delta_{k}.

(iii)(iii) Under Holderian growth, similarly Theorem 8.3 and Lemma 2.1 lead to:

distX(xk+1)distX(xk)μFμ(xk)FdistX(xk)+δk\displaystyle\text{dist}_{X^{*}}(x^{k+1})\leq\text{dist}_{X^{*}}(x^{k})-\mu\frac{F_{\mu}(x^{k})-F^{*}}{\text{dist}_{X^{*}}(x^{k})}+\delta_{k}
distX(xk)μφ(γ)min{σFdistXγ1(xk),12μdistX(xk)}+δk\displaystyle\leq\text{dist}_{X^{*}}(x^{k})-\mu\varphi(\gamma)\min\left\{\sigma_{F}\text{dist}_{X^{*}}^{\gamma-1}(x^{k}),\frac{1}{2\mu}\text{dist}_{X^{*}}(x^{k})\right\}+\delta_{k}
=max{distX(xk)μφ(γ)σFdistXγ1(xk),(1φ(γ)/2)distX(xk)}+δk.\displaystyle=\max\left\{\text{dist}_{X^{*}}(x^{k})-\mu\varphi(\gamma)\sigma_{F}\text{dist}_{X^{*}}^{\gamma-1}(x^{k}),(1-\varphi(\gamma)/2)\text{dist}_{X^{*}}(x^{k})\right\}+\delta_{k}.

Theorem 8.5.

Let α,ρ>0,β(0,1)\alpha,\rho>0,\beta\in(0,1) and h(r)=max{rαrρ,βr}h(r)=\max\{r-\alpha r^{\rho},\beta r\}. Then the sequence rk+1=h(rk)r_{k+1}=h(r_{k}) satisfies:

(i)(i) For ρ(0,1)\rho\in(0,1):

rk\displaystyle r_{k} {(1α2r01ρ)k[r0kα2(α1β)ρ1ρ],ifrk>(α1β)11ρβkk01(α1β)11ρ,ifrk(α1β)11ρ.\displaystyle\leq\begin{cases}\left(1-\frac{\alpha}{2r_{0}^{1-\rho}}\right)^{k}\left[r_{0}-k\frac{\alpha}{2}\left(\frac{\alpha}{1-\beta}\right)^{\frac{\rho}{1-\rho}}\right],&\text{if}\;\;r_{k}>\left(\frac{\alpha}{1-\beta}\right)^{\frac{1}{1-\rho}}\\ \beta^{k-k_{0}-1}\left(\frac{\alpha}{1-\beta}\right)^{\frac{1}{1-\rho}},&\text{if}\;\;r_{k}\leq\left(\frac{\alpha}{1-\beta}\right)^{\frac{1}{1-\rho}}.\end{cases} (38)

(ii)(ii) For ρ1\rho\geq 1:

rk{β^kr0,ifrk>(1β^α)1ρ1[11min{r0ρ1,1β^α}+(ρ1)(kk0)α]1ρ1,ifrk(1β^α)1ρ1,\displaystyle r_{k}\leq\begin{cases}\hat{\beta}^{k}r_{0},&\text{if}\;r_{k}>\left(\frac{1-\hat{\beta}}{\alpha}\right)^{\frac{1}{\rho-1}}\\ \left[\frac{1}{\frac{1}{\min\{r_{0}^{\rho-1},\frac{1-\hat{\beta}}{\alpha}\}}+(\rho-1)(k-k_{0})\alpha}\right]^{\frac{1}{\rho-1}},&\text{if}\;r_{k}\leq\left(\frac{1-\hat{\beta}}{\alpha}\right)^{\frac{1}{\rho-1}},\end{cases} (39)

where k0={mink0k:rk(1β^α)1ρ1},β^=max{β,11/ρ}k_{0}=\left\{\min\limits_{k\geq 0}\;k:\;r_{k}\leq\left(\frac{1-\hat{\beta}}{\alpha}\right)^{\frac{1}{\rho-1}}\right\},\hat{\beta}=\max\left\{\beta,1-1/\rho\right\}.

Proof.

Denote g(r)=rαrρg(r)=r-\alpha r^{\rho}.

Consider ρ(0,1)\rho\in(0,1). In this case, note that gg is nondecreasing and thus also hh is nondecreasing, we have:

rk+1\displaystyle r_{k+1} ={rkαrkρ,ifrk>(α1β)11ρβrk,ifrk(α1β)11ρ\displaystyle=\begin{cases}r_{k}-\alpha r_{k}^{\rho},&\text{if}\;\;r_{k}>\left(\frac{\alpha}{1-\beta}\right)^{\frac{1}{1-\rho}}\\ \beta r_{k},&\text{if}\;\;r_{k}\leq\left(\frac{\alpha}{1-\beta}\right)^{\frac{1}{1-\rho}}\end{cases}

Observe that if rk>(α1β)11ρr_{k}>\left(\frac{\alpha}{1-\beta}\right)^{\frac{1}{1-\rho}} then, by using the monotonicity of rkr_{k}, we can further derive another bound:

rk+1\displaystyle r_{k+1} rkα2rkρα2rkρ(1α2rk1ρ)rkα2(α1β)11ρ\displaystyle\leq r_{k}-\frac{\alpha}{2}r_{k}^{\rho}-\frac{\alpha}{2}r_{k}^{\rho}\leq\left(1-\frac{\alpha}{2r_{k}^{1-\rho}}\right)r_{k}-\frac{\alpha}{2}\left(\frac{\alpha}{1-\beta}\right)^{\frac{1}{1-\rho}}
(1α2r01ρ)rkα2(α1β)11ρ.\displaystyle\leq\left(1-\frac{\alpha}{2r_{0}^{1-\rho}}\right)r_{k}-\frac{\alpha}{2}\left(\frac{\alpha}{1-\beta}\right)^{\frac{1}{1-\rho}}.

Any given sequence uku_{k} satisfying the recurrence uk+1(1ξ)ukcu_{k+1}\leq(1-\xi)u_{k}-c can be further bounded as: uk+1(1ξ)ku0ci=0k1(1ξ)i(1ξ)ku0ci=0k1(1ξ)k=(1ξ)k[u0kc].u_{k+1}\leq(1-\xi)^{k}u_{0}-c\sum_{i=0}^{k-1}(1-\xi)^{i}\leq(1-\xi)^{k}u_{0}-c\sum_{i=0}^{k-1}(1-\xi)^{k}=(1-\xi)^{k}[u_{0}-kc]. Thus, by apply similar arguments to our sequence rkr_{k} we refined the above bound as follows:

rk+1\displaystyle r_{k+1} {(1α2r01ρ)k[r0kα2(α1β)ρ1ρ],ifrk>(α1β)11ρβkk0min{r0,(α1β)11ρ},ifrk(α1β)11ρ.\displaystyle\leq\begin{cases}\left(1-\frac{\alpha}{2r_{0}^{1-\rho}}\right)^{k}\left[r_{0}-k\frac{\alpha}{2}\left(\frac{\alpha}{1-\beta}\right)^{\frac{\rho}{1-\rho}}\right],&\text{if}\;\;r_{k}>\left(\frac{\alpha}{1-\beta}\right)^{\frac{1}{1-\rho}}\\ \beta^{k-k_{0}}\min\left\{r_{0},\left(\frac{\alpha}{1-\beta}\right)^{\frac{1}{1-\rho}}\right\},&\text{if}\;\;r_{k}\leq\left(\frac{\alpha}{1-\beta}\right)^{\frac{1}{1-\rho}}.\end{cases}

Now consider ρ>1\rho>1. In this case, on one hand, the function gg is nondecreasing only on (0,(1αρ)1ρ1]\left(0,\left(\frac{1}{\alpha\rho}\right)^{\frac{1}{\rho-1}}\right]. On the other hand, for r(1αρ)1ρ1r\geq\left(\frac{1}{\alpha\rho}\right)^{\frac{1}{\rho-1}} it is easy to see that g(r)(11ρ)rg(r)\leq\left(1-\frac{1}{\rho}\right)r. These two observations lead to:

h(r)\displaystyle h(r) =max{rαrρ,βr}max{rαrρ,(11ρ)r,βr}\displaystyle=\max\{r-\alpha r^{\rho},\beta r\}\leq\max\left\{r-\alpha r^{\rho},\left(1-\frac{1}{\rho}\right)r,\beta r\right\}
=max{rαrρ,β^r}:=h^(r),\displaystyle=\max\left\{r-\alpha r^{\rho},\hat{\beta}r\right\}:=\hat{h}(r),

where β^=max{11ρ,β}\hat{\beta}=\max\left\{1-\frac{1}{\rho},\beta\right\}. Since h^\hat{h} is nondecreasing, then rkh^(k)(r0)r_{k}\leq\hat{h}^{(k)}(r_{0}). In order to determine the clear convergence rate of rkr_{k}, based on [36, Lemma 6, Section 2.2] we make a last observation:

g(k)(r)r(1+(ρ1)rρ1kα)1ρ1[11rρ1+(ρ1)kα]1ρ1\displaystyle g^{(k)}(r)\leq\frac{r}{\left(1+(\rho-1)r^{\rho-1}k\alpha\right)^{\frac{1}{\rho-1}}}\leq\left[\frac{1}{\frac{1}{r^{\rho-1}}+(\rho-1)k\alpha}\right]^{\frac{1}{\rho-1}} (40)

Using this final bound, we are able to deduce the explicit convergence rate:

rkh^(k)(r0)\displaystyle r_{k}\leq\hat{h}^{(k)}(r_{0}) {β^kr0,ifrk>(1β^α)1ρ1g(kk0)(rk0),ifrk(1β^α)1ρ1\displaystyle\leq\begin{cases}\hat{\beta}^{k}r_{0},&\text{if}\;r_{k}>\left(\frac{1-\hat{\beta}}{\alpha}\right)^{\frac{1}{\rho-1}}\\ g^{(k-k_{0})}(r_{k_{0}}),&\text{if}\;r_{k}\leq\left(\frac{1-\hat{\beta}}{\alpha}\right)^{\frac{1}{\rho-1}}\end{cases}
(40){β^kr0,ifrk>(1β^α)1ρ1[11min{r0ρ1,1β^α}+(ρ1)(kk0)α]1ρ1,ifrk(1β^α)1ρ1.\displaystyle\overset{\eqref{rel:Poliak}}{\leq}\begin{cases}\hat{\beta}^{k}r_{0},&\text{if}\;r_{k}>\left(\frac{1-\hat{\beta}}{\alpha}\right)^{\frac{1}{\rho-1}}\\ \left[\frac{1}{\frac{1}{\min\{r_{0}^{\rho-1},\frac{1-\hat{\beta}}{\alpha}\}}+(\rho-1)(k-k_{0})\alpha}\right]^{\frac{1}{\rho-1}},&\text{if}\;r_{k}\leq\left(\frac{1-\hat{\beta}}{\alpha}\right)^{\frac{1}{\rho-1}}.\end{cases}

Corollary 8.6.

Under the assumptions of Theorem 8.5, let rk+1=h(rk)r_{k+1}=h(r_{k}) and ϵ>0\epsilon>0. The sequence rkr_{k} attains the threshold rkϵr_{k}\leq\epsilon after the following number of iterations:

(i)(i) For ρ(0,1)\rho\in(0,1):

Kmin{2r01ραlog(r0max{ϵ,τρα/2}),2r0τρα}+1βlog(min{r0,τ}ϵ)\displaystyle K\geq\min\left\{\frac{2r_{0}^{1-\rho}}{\alpha}\log\left(\frac{r_{0}}{\max\{\epsilon,\tau^{\rho}\alpha/2\}}\right),\frac{2r_{0}}{\tau^{\rho}\alpha}\right\}+\frac{1}{\beta}\log\left(\frac{\min\left\{r_{0},\tau\right\}}{\epsilon}\right) (41)

(ii)(ii) For ρ1\rho\geq 1:

K1β^log(r0τ(β^))+1(ρ1)α(1ϵρ11min{r0,τ(β^)}ρ1),\displaystyle K\geq\frac{1}{\hat{\beta}}\log\left(\frac{r_{0}}{\tau(\hat{\beta})}\right)+\frac{1}{(\rho-1)\alpha}\left(\frac{1}{\epsilon^{\rho-1}}-\frac{1}{\min\left\{r_{0},\tau(\hat{\beta})\right\}^{\rho-1}}\right), (42)

where τ(β)=(α1β)11ρ\tau(\beta)=\left(\frac{\alpha}{1-\beta}\right)^{\frac{1}{1-\rho}}.

Proof.

(i)(i) Let ρ(0,1)\rho\in(0,1). In the first regime of (38), when rk>τ(β)r_{k}>\tau(\beta), there are necessary at most:

K1(0,1)min{2r01ραlog(r0max{ϵ,τρα/2}),2r0τρα}\displaystyle K_{1}^{(0,1)}\geq\min\left\{\frac{2r_{0}^{1-\rho}}{\alpha}\log\left(\frac{r_{0}}{\max\{\epsilon,\tau^{\rho}\alpha/2\}}\right),\frac{2r_{0}}{\tau^{\rho}\alpha}\right\} (43)

iterations, while the second regime, i.e. rkτ(β)r_{k}\leq\tau(\beta), has a length of at most:

K2(0,1)1βlog(min{r0,τ}ϵ)\displaystyle K_{2}^{(0,1)}\geq\frac{1}{\beta}\log\left(\frac{\min\left\{r_{0},\tau\right\}}{\epsilon}\right) (44)

iterations to reach rkϵr_{k}\leq\epsilon. An upper margin on the total number of iterations is K1(0,1)+K2(0,1)K_{1}^{(0,1)}+K_{2}^{(0,1)}.

(ii)(ii) Let ρ>1\rho>1. Similarly, the first regime when rk>τ(β^)r_{k}>\tau(\hat{\beta}) has a maximal length of: K1(1,)1β^log(r0τ(β^)).K_{1}^{(1,\infty)}\geq\frac{1}{\hat{\beta}}\log\left(\frac{r_{0}}{\tau(\hat{\beta})}\right). The second regime, while rkτ(β^)r_{k}\leq\tau(\hat{\beta}), requires at most: K2(1,)1(ρ1)α(1ϵρ11min{r0,τ(β^)}ρ1)K_{2}^{(1,\infty)}\geq\frac{1}{(\rho-1)\alpha}\left(\frac{1}{\epsilon^{\rho-1}}-\frac{1}{\min\left\{r_{0},\tau(\hat{\beta})\right\}^{\rho-1}}\right) iteration to get rkϵr_{k}\leq\epsilon. ∎

Lemma 8.7.

Let α,ρ>0,β(0,1)\alpha,\rho>0,\beta\in(0,1). Let the sequence {rk,δk}k0\{r_{k},\delta_{k}\}_{k\geq 0} satisfy the recurrence:

rk+1max{rkαrkρ,βrk}+δk.\displaystyle r_{k+1}\leq\max\{r_{k}-\alpha r_{k}^{\rho},\beta r_{k}\}+\delta_{k}.

For ρ(0,1)\rho\in(0,1), let h(r)=max{rα2rρ,1+β2r}h(r)=\max\left\{r-\frac{\alpha}{2}r^{\rho},\frac{1+\beta}{2}r\right\}, then:

rkmax{h(k)(r0)),h(k1)(δ^1),,h(δ^k1),δ^k}r_{k}\leq\max\left\{h^{(k)}(r_{0})),h^{(k-1)}\left(\hat{\delta}_{1}\right),\cdots,h\left(\hat{\delta}_{k-1}\right),\hat{\delta}_{k}\right\}

For ρ1\rho\geq 1, let h^(r)=max{h(r),(11ρ)r}\hat{h}(r)=\max\left\{h(r),\left(1-\frac{1}{\rho}\right)r\right\}, then:

rkmax{h^(k)(r0)),h^(k1)(δ^1),,h^(δ^k1),δ^k},r_{k}\leq\max\left\{\hat{h}^{(k)}(r_{0})),\hat{h}^{(k-1)}\left(\hat{\delta}_{1}\right),\cdots,\hat{h}\left(\hat{\delta}_{k-1}\right),\hat{\delta}_{k}\right\},

where δ^k=max{(2δkα)1ρ,2δk1β}\hat{\delta}_{k}=\max\left\{\left(\frac{2\delta_{k}}{\alpha}\right)^{\frac{1}{\rho}},\frac{2\delta_{k}}{1-\beta}\right\}.

Proof.

Starting from the recurrence we get:

rk+1\displaystyle r_{k+1} max{rkαrkρ,βrk}+δk=max{rkαrkρ+δk,βrk+δk}\displaystyle\leq\max\{r_{k}-\alpha r_{k}^{\rho},\beta r_{k}\}+\delta_{k}=\max\{r_{k}-\alpha r_{k}^{\rho}+\delta_{k},\beta r_{k}+\delta_{k}\}
=max{rkα2rkρ+(δkα2rkρ),1+β2rk+(δk1β2rk)}\displaystyle=\max\left\{r_{k}-\frac{\alpha}{2}r_{k}^{\rho}+\left(\delta_{k}-\frac{\alpha}{2}r_{k}^{\rho}\right),\frac{1+\beta}{2}r_{k}+\left(\delta_{k}-\frac{1-\beta}{2}r_{k}\right)\right\}

If δkmin{α2rkρ,1β2rk}\delta_{k}\leq\min\left\{\frac{\alpha}{2}r_{k}^{\rho},\frac{1-\beta}{2}r_{k}\right\}, or equivalently rkmax{(2δkα)1ρ,2δk1β}r_{k}\geq\max\left\{\left(\frac{2\delta_{k}}{\alpha}\right)^{\frac{1}{\rho}},\frac{2\delta_{k}}{1-\beta}\right\}, then we recover the recurrence:

rk+1\displaystyle r_{k+1} max{rkα2rkρ,1+β2rk}\displaystyle\leq\max\left\{r_{k}-\frac{\alpha}{2}r_{k}^{\rho},\frac{1+\beta}{2}r_{k}\right\} (45)

Otherwise, clearly

rkmax{(2δkα)1ρ,2δk1β}\displaystyle r_{k}\leq\max\left\{\left(\frac{2\delta_{k}}{\alpha}\right)^{\frac{1}{\rho}},\frac{2\delta_{k}}{1-\beta}\right\} (46)

By combining both bounds (45) and (46), we obtain:

rk+1\displaystyle r_{k+1} max{rkα2rkρ,1+β2rk,(2δk+1α)1ρ,2δk+11β}.\displaystyle\leq\max\left\{r_{k}-\frac{\alpha}{2}r_{k}^{\rho},\frac{1+\beta}{2}r_{k},\left(\frac{2\delta_{k+1}}{\alpha}\right)^{\frac{1}{\rho}},\frac{2\delta_{k+1}}{1-\beta}\right\}. (47)

Denote h(r)=max{rα2rρ,1+β2r}h(r)=\max\left\{r-\frac{\alpha}{2}r^{\rho},\frac{1+\beta}{2}r\right\} and δ^k=max{(2δkα)1ρ,2δk1β}\hat{\delta}_{k}=\max\left\{\left(\frac{2\delta_{k}}{\alpha}\right)^{\frac{1}{\rho}},\frac{2\delta_{k}}{1-\beta}\right\}. For ρ(0,1)\rho\in(0,1), since both functions rrαrρr\mapsto r-\alpha r^{\rho} and r1+β2rr\mapsto\frac{1+\beta}{2}r are nondecreasing, then hh is nondecreasing. This fact allows to apply the following induction to (47):

rk+1\displaystyle r_{k+1} max{h(rk),δ^k+1}max{h(max{h(rk1),δ^k}),δ^k+1}\displaystyle\leq\max\left\{h(r_{k}),\hat{\delta}_{k+1}\right\}\leq\max\left\{h\left(\max\left\{h(r_{k-1}),\hat{\delta}_{k}\right\}\right),\hat{\delta}_{k+1}\right\}
max{h(h(rk1)),h(δ^k),δ^k+1}\displaystyle\leq\max\left\{h(h(r_{k-1})),h\left(\hat{\delta}_{k}\right),\hat{\delta}_{k+1}\right\}
\displaystyle\cdots
max{h(k+1)(r0)),h(k)(δ^1),,h(δ^k),δ^k+1}.\displaystyle\leq\max\left\{h^{(k+1)}(r_{0})),h^{(k)}\left(\hat{\delta}_{1}\right),\cdots,h\left(\hat{\delta}_{k}\right),\hat{\delta}_{k+1}\right\}. (48)

In the second case when ρ1\rho\geq 1, the corresponding recurrence function h^(r)=max{rα2rρ,(11ρ)r,1+β2r}\hat{h}(r)=\max\left\{r-\frac{\alpha}{2}r^{\rho},\left(1-\frac{1}{\rho}\right)r,\frac{1+\beta}{2}r\right\} is again nondecreasing. Indeed, here rrαrρr\mapsto r-\alpha r^{\rho} is nondecreasing only when r(1αρ)1ρ1r\leq\left(\frac{1}{\alpha\rho}\right)^{\frac{1}{\rho-1}}. However, if r>(1αρ)1ρ1r>\left(\frac{1}{\alpha\rho}\right)^{\frac{1}{\rho-1}}, then h^(r)=max{11ρ,1+β2}r\hat{h}(r)=\max\left\{1-\frac{1}{\rho},\frac{1+\beta}{2}\right\}r which is also nondecreasing. Thus we get our claim. The monotonicity of h^\hat{h} and majorization h^(r)h(r)\hat{h}(r)\geq h(r), allow us to obtain by a similar induction an analog relation to (48), which holds with h^\hat{h}. ∎

Proof of Theorem 4.1.

(i)(i) Denote rk=distX(xk)r_{k}=\text{dist}_{X^{*}}(x^{k}). Since δkδk1\delta_{k}\leq\delta_{k-1}, then by rolling the recurrence in Lemma 8.4 we get:

rk+1\displaystyle r_{k+1} max{rk(μσFδk),δk}\displaystyle\leq\max\left\{r_{k}-(\mu\sigma_{F}-\delta_{k}),\delta_{k}\right\}
max{rk1[2μσFδkδk1],δk+δk1μσF,δk}\displaystyle\leq\max\left\{r_{k-1}-[2\mu\sigma_{F}-\delta_{k}-\delta_{k-1}],\delta_{k}+\delta_{k-1}-\mu\sigma_{F},\delta_{k}\right\}
max{r0i=0k(μσFδi),δk+max{0,δk1μσF,,i=0k1(δiμσF)}}\displaystyle\leq\max\left\{r_{0}-\sum\limits_{i=0}^{k}(\mu\sigma_{F}-\delta_{i}),\delta_{k}+\max\left\{0,\delta_{k-1}-\mu\sigma_{F},\cdots,\sum\limits_{i=0}^{k-1}(\delta_{i}-\mu\sigma_{F})\right\}\right\} (49)

By using the Lemma 8.2, then (49) can be refined as:

rk+1\displaystyle r_{k+1} max{r0i=0k(μσFδi),δk+max{0,δk1μσF,,i=0k1(δiμσF)}}\displaystyle\leq\max\left\{r_{0}-\sum\limits_{i=0}^{k}(\mu\sigma_{F}-\delta_{i}),\delta_{k}+\max\left\{0,\delta_{k-1}-\mu\sigma_{F},\cdots,\sum\limits_{i=0}^{k-1}(\delta_{i}-\mu\sigma_{F})\right\}\right\}
Lemma8.2max{r0i=0k(μσFδi),δk+max{0,i=0k1δiμσF}}\displaystyle\overset{\text{Lemma}\;\ref{max_lemma}}{\leq}\max\left\{r_{0}-\sum\limits_{i=0}^{k}(\mu\sigma_{F}-\delta_{i}),\delta_{k}+\max\left\{0,\sum\limits_{i=0}^{k-1}\delta_{i}-\mu\sigma_{F}\right\}\right\}
max{r0i=0k(μσFδi),max{δk,μσF+i=0kδiμσF}}\displaystyle\leq\max\left\{r_{0}-\sum\limits_{i=0}^{k}(\mu\sigma_{F}-\delta_{i}),\max\left\{\delta_{k},\mu\sigma_{F}+\sum\limits_{i=0}^{k}\delta_{i}-\mu\sigma_{F}\right\}\right\}
=max{max{r0,μσF}i=0k(μσFδi),δk}.\displaystyle=\max\left\{\max\{r_{0},\mu\sigma_{F}\}-\sum\limits_{i=0}^{k}(\mu\sigma_{F}-\delta_{i}),\delta_{k}\right\}.

(ii)(ii) Denote θ=1(1+2σFμ)1/2\theta=\frac{1}{(1+2\sigma_{F}\mu)^{1/2}}. From Lemmas 8.4 and 8.1 we derive that:

distX(xk)θdistX(xk1)+δk1Lemma8.1θk42(distX(x0)+Γ)+δk/2+11θ.\displaystyle\text{dist}_{X^{*}}(x^{k})\leq\theta\text{dist}_{X^{*}}(x^{k-1})+\delta_{k-1}\overset{\text{Lemma}\;\ref{lemma:first_sequence}}{\leq}\theta^{\frac{k-4}{2}}\left(\text{dist}_{X^{*}}(x^{0})+\Gamma\right)+\frac{\delta_{\lceil k/2\rceil+1}}{1-\theta}.

(iii)(iii) First consider γ[1,2)\gamma\in[1,2) and let h(r)=max{rμφ(γ)σF2rγ1,1+1φ(γ)2r}h(r)=\max\left\{r-\frac{\mu\varphi(\gamma)\sigma_{F}}{2}r^{\gamma-1},\frac{1+\sqrt{1-\varphi(\gamma)}}{2}r\right\}. Then by Lemmas 8.4 and 8.7, we have that:

rk+1\displaystyle r_{k+1} max{h(k)(r0)),h(k1)(δ^1),,h(δ^k1),δ^k},\displaystyle\leq\max\left\{h^{(k)}(r_{0})),h^{(k-1)}\left(\hat{\delta}_{1}\right),\cdots,h\left(\hat{\delta}_{k-1}\right),\hat{\delta}_{k}\right\}, (50)

where δ^k=max{(2δkμφ(γ)σF)1ρ,2δk11φ(γ)}\hat{\delta}_{k}=\max\left\{\left(\frac{2\delta_{k}}{\mu\varphi(\gamma)\sigma_{F}}\right)^{\frac{1}{\rho}},\frac{2\delta_{k}}{1-\sqrt{1-\varphi(\gamma)}}\right\}. Let some uk=h(k)(u0)u_{k}=h^{(k)}(u_{0}) and δ¯k=max{δ^k,h(δ¯k1)}\bar{\delta}_{k}=\max\left\{\hat{\delta}_{k},h(\bar{\delta}_{k-1})\right\}. Then, since hh is nondecreasing, we get:

rk+1max{h(k)(r0)),h(k1)(δ^1),,h(δ^k1),δ^k}=max{uk,δ¯k},\displaystyle r_{k+1}\leq\max\left\{h^{(k)}(r_{0})),h^{(k-1)}\left(\hat{\delta}_{1}\right),\cdots,h\left(\hat{\delta}_{k-1}\right),\hat{\delta}_{k}\right\}=\max\left\{u_{k},\bar{\delta}_{k}\right\},

Finally, by using the convergence rate upper bounds from the Theorem 8.5, we can further find out an the convergence rate order of uku_{k}. We can appeal to a similar argument when γ2\gamma\geq 2, by using the nondecreasing function h^(r)=max{h(r),(11ρ)r}\hat{h}(r)=\max\left\{h(r),\left(1-\frac{1}{\rho}\right)r\right\}, instead of hh.

Theorem 8.8.

Let the function ff having ν\nu-Holder continuous gradients with constant LfL_{f} and ν[0,1]\nu\in[0,1]. Also let μ>0,αmin{μ2,δ2(1ν)4μLf2},z0dom(ψ)\mu>0,\alpha\leq\min\left\{\frac{\mu}{2},\frac{\delta^{2(1-\nu)}}{4\mu L_{f}^{2}}\right\},z^{0}\in\text{dom}(\psi) and

N4μαlog(z0proxμF(x)δ)\displaystyle N\geq\left\lceil\frac{4\mu}{\alpha}\log\left(\frac{\lVert z^{0}-\text{prox}_{\mu}^{F}(x)\rVert}{\delta}\right)\right\rceil (51)

then PsGM(z0,x,α,μ,Nz^{0},x,\alpha,\mu,N) outputs zNz^{N} such that zNproxμF(x)δ\lVert z^{N}-\text{prox}_{\mu}^{F}(x)\rVert\leq\delta.

Moreover, assume particularly that ν=0\nu=0 and FF satisfies WSM with constant σf\sigma_{f}. Also let α(0,μ/2]\alpha\in(0,\mu/2], QQ be a closed convex feasible set and ψ=ιQ\psi=\iota_{Q} its indicator function. If

distX(x)μσFandN2(distX(x)αLf)2\displaystyle\text{dist}_{X^{*}}(x)\leq\mu\sigma_{F}\qquad\text{and}\qquad N\geq\left\lceil 2\left(\frac{\text{dist}_{X^{*}}(x)}{\alpha L_{f}}\right)^{2}\right\rceil (52)

then PsGM(x,x,α,μ,Nx,x,\alpha,\mu,N) outputs zNz^{N} satisfying distX(zN)αLf22σF\text{dist}_{X^{*}}(z^{N})\leq\frac{\alpha L_{f}^{2}}{2\sigma_{F}}.

Proof.

For brevity we avoid the counter kk and denote z(x):=proxμF(x),z+:=proxμψ(zα[f(z)+1μ(zx)])z(x):=\text{prox}_{\mu}^{F}(x),z^{+}:=\text{prox}_{\mu}^{\psi}\left(z-\alpha\left[f^{\prime}(z)+\frac{1}{\mu}(z-x)\right]\right). Recall the optimality condition:

z(x)=proxμψ(z(x)α[f(z(x))+1μ(z(x)x)]).\displaystyle z(x)=\text{prox}_{\mu}^{\psi}\left(z(x)-\alpha\left[f^{\prime}(z(x))+\frac{1}{\mu}(z(x)-x)\right]\right). (53)

By using ν\nu-Holder continuity then we get:

f(z)f(z(x))Lfzz(x)νz.\displaystyle\lVert f^{\prime}(z)-f^{\prime}(z(x))\rVert\leq L_{f}\lVert z-z(x)\rVert^{\nu}\quad\forall z. (54)

Then the following recurrence holds:

z+z(x)2\displaystyle\lVert z^{+}-z(x)\rVert^{2}
=(53)proxμψ(zα[f(z)+1μ(zx)])proxμψ(z(x)α[f(z(x))+1μ(z(x)x)])2\displaystyle\!\!\overset{\eqref{rel:strong_convexity_inner}}{=}\!\!\left\|\text{prox}_{\mu}^{\psi}\!\!\left(z\!\!-\!\!\alpha\left[f^{\prime}(z)\!+\!\frac{1}{\mu}(z\!\!-\!\!x)\right]\right)\!\!-\!\!\text{prox}_{\mu}^{\psi}\left(z(x)\!\!-\!\!\alpha\left[f^{\prime}(z(x))\!+\!\frac{1}{\mu}(z(x)\!\!-\!\!x)\right]\right)\right\|^{2}
(1αμ)(zz(x))+α[f(z(x))f(z)]2\displaystyle\leq\left\|\left(1-\frac{\alpha}{\mu}\right)(z-z(x))+\alpha[f^{\prime}(z(x))-f^{\prime}(z)]\right\|^{2}
=(1αμ)2zz(x)22α(1αμ)f(z)f(z(x)),zz(x)\displaystyle=\left(1-\frac{\alpha}{\mu}\right)^{2}\lVert z-z(x)\rVert^{2}-2\alpha\left(1-\frac{\alpha}{\mu}\right)\langle f^{\prime}(z)-f^{\prime}(z(x)),z-z(x)\rangle
+α2f(z)f(z(x))2\displaystyle\hskip 170.71652pt+\alpha^{2}\lVert f^{\prime}(z)-f^{\prime}(z(x))\rVert^{2} (55)
(54)(1αμ)2zz(x)2+α2Lf2zz(x)2ν.\displaystyle\overset{\eqref{rel:bound_holdgrad}}{\leq}\left(1-\frac{\alpha}{\mu}\right)^{2}\lVert z-z(x)\rVert^{2}+\alpha^{2}L_{f}^{2}\lVert z-z(x)\rVert^{2\nu}.

Obviously, a small stepsize α<μ\alpha<\mu yields (1αμ)21αμ\left(1-\frac{\alpha}{\mu}\right)^{2}\leq 1-\frac{\alpha}{\mu}. If the squared residual is dominant, i.e.

zz(x)δ(2αμL2)12(1ν),\displaystyle\lVert z-z(x)\rVert\geq\delta\geq\left(2\alpha\mu L^{2}\right)^{\frac{1}{2(1-\nu)}}, (56)

then:

z+z(x)2(1α2μ)zz(x)2.\displaystyle\lVert z^{+}-z(x)\rVert^{2}\leq\left(1-\frac{\alpha}{2\mu}\right)\lVert z-z(x)\rVert^{2}. (57)

By (56), this linear decrease of residual stop when zz(x)δ\lVert z-z(x)\rVert\leq\delta, which occurs after at most 4μαlog(z0z(x)δ)\left\lceil\frac{4\mu}{\alpha}\log\left(\frac{\lVert z^{0}-z(x)\rVert}{\delta}\right)\right\rceil PsGM iterations.

To show the second part of our result we recall that the first assumption of (52) ensures z(x)=πX(x)z(x)=\pi_{X^{*}}(x). By using (55) with chosen subgradient f(x)=0f^{\prime}(x^{*})=0, then the following recurrence is obtained:

z+1x2\displaystyle\lVert z^{\ell+1}\!-\!x^{*}\rVert^{2} =(1αμ)2zx22α(1αμ)f(z),zx+α2f(z)2\displaystyle\!=\!\left(1\!-\!\frac{\alpha}{\mu}\right)^{2}\lVert z^{\ell}\!-\!x^{*}\rVert^{2}\!-\!2\alpha\left(1\!-\!\frac{\alpha}{\mu}\right)\langle f^{\prime}(z^{\ell}),z^{\ell}\!-\!x^{*}\rangle\!+\!\alpha^{2}\lVert f^{\prime}(z^{\ell})\rVert^{2}
zx22α(1αμ)σFdistX(z)+α2Lf2\displaystyle\leq\lVert z^{\ell}-x^{*}\rVert^{2}-2\alpha\left(1-\frac{\alpha}{\mu}\right)\sigma_{F}\text{dist}_{X^{*}}(z^{\ell})+\alpha^{2}L_{f}^{2}
zx2ασFdistX(z)+α2Lf2,\displaystyle\leq\lVert z^{\ell}-x^{*}\rVert^{2}-\alpha\sigma_{F}\text{dist}_{X^{*}}(z^{\ell})+\alpha^{2}L_{f}^{2}, (58)

where x=πX(x)x^{*}=\pi_{X^{*}}(x) and in the last inequality we used f(zt),ztxF(zt)FσFdistX(zt)\langle f^{\prime}(z^{t}),z^{t}-x^{*}\rangle\geq F(z^{t})-F^{*}\geq\sigma_{F}\text{dist}_{X^{*}}(z^{t}). If distX(z0)=distX(x)>αLf22σF\text{dist}_{X^{*}}(z^{0})=\text{dist}_{X^{*}}(x)>\frac{\alpha L_{f}^{2}}{2\sigma_{F}}, then as long as distX(z)>αLf22σF\text{dist}_{X^{*}}(z^{\ell})>\frac{\alpha L_{f}^{2}}{2\sigma_{F}}, (58) turns into:

distX2(z+1)z+1x2\displaystyle\text{dist}_{X^{*}}^{2}(z^{\ell+1})\leq\lVert z^{\ell+1}-x^{*}\rVert^{2} zx2(αLf)22\displaystyle\leq\lVert z^{\ell}-x^{*}\rVert^{2}-\frac{\left(\alpha L_{f}\right)^{2}}{2}
distX(x)2(αLf)22.\displaystyle\leq\text{dist}_{X^{*}}(x)^{2}-\ell\frac{\left(\alpha L_{f}\right)^{2}}{2}. (59)

To unify both cases, we further express the recurrence as:

distX(z+)2max{distX(x)2(αLf)22,α2Lf44σF2},\displaystyle\text{dist}_{X^{*}}(z^{+})^{2}\leq\max\left\{\text{dist}_{X^{*}}(x)^{2}-\ell\frac{\left(\alpha L_{f}\right)^{2}}{2},\frac{\alpha^{2}L_{f}^{4}}{4\sigma_{F}^{2}}\right\}, (60)

which confirms our above result.

Remark 4.

As the above theorem states, when the sequence xtx^{t} is sufficiently close to the solution set, computing xt+1x^{t+1} necessitates a number of PsGM iterations dependent on distX(z0)\text{dist}_{X^{*}}(z^{0}). In other words, the estimate from (52) can be further reduced through a good initialization or restartation technique. Such a restartation, for the neighborhood around the optimum, is exploited by the Algorithm 4 below.