Complexity of Inexact Proximal Point Algorithm for minimizing convex functions with Holderian Growth

\nameAndrei Pătraşcu and Paul Irofti ^a Email: [email protected], [email protected] ^a Research Center for Logic, Optimization and Security (LOS),
Department of Computer Science, Faculty of Mathematics and Computer Science,
University of Bucharest, Academiei 14, Bucharest, Romania.

Abstract

Several decades ago the Proximal Point Algorithm (PPA) started to gain a long-lasting attraction for both abstract operator theory and numerical optimization communities. Even in modern applications, researchers still use proximal minimization theory to design scalable algorithms that overcome nonsmoothness. Remarkable works as [9, 4, 5, 51] established tight relations between the convergence behaviour of PPA and the regularity of the objective function. In this manuscript we derive nonasymptotic iteration complexity of exact and inexact PPA to minimize convex functions under $\gamma-$ Holderian growth: $\mathcal{O}\left(\log(1/\epsilon)\right)$ (for $\gamma\in[1,2]$ ) and $\mathcal{O}\left(1/\epsilon^{\gamma-2}\right)$ (for $\gamma>2$ ). In particular, we recover well-known results on PPA: finite convergence for sharp minima and linear convergence for quadratic growth, even under presence of deterministic noise. Moreover, when a simple Proximal Subgradient Method is recurrently called as an inner routine for computing each IPPA iterate, novel computational complexity bounds are obtained for Restarting Inexact PPA. Our numerical tests show improvements over existing restarting versions of the Subgradient Method.

keywords:

Inexact proximal point, weak sharp minima, Holderian growth, finite termination.

^†^†articletype: ARTICLE

Received: date / Accepted: date

1 Introduction

The problem of interest of our paper formulates as the following convex nonsmooth minimization:

\displaystyle F^{*}=\min_{x\in\mathbf{R}^{n}}\;\{F(x):=f(x)+\psi(x)\}.

(1)

Here we assume that $f:\mathbf{R}^{n}\mapsto\mathbf{R}$ is convex and $\psi:\mathbf{R}^{n}\mapsto(-\infty,\infty]$ is convex, lower semicontinuous and proximable. By proximable function we refer to those functions whose proximal mapping is computable in closed form or linear time. The above model finds plenty of applications from which we shortly mention compressed sensing [3], sparse risk minimization [21, 52] and graph-regularized models [54]. Dating back to ’60s, the classical Subgradient Methods (SGM) [43, 44, 34, 35, 37] established $\mathcal{O}\left(1/\epsilon^{2}\right)$ iteration complexity for minimizing convex functions up to finding $\hat{x}$ such that $f(\hat{x})-f^{*}\leq\epsilon$ . Despite the fact that this complexity order is unimprovable for the class of convex functions, particular growth or error bounds properties can be exploited to obtain better lower orders. Error bounds and regularity conditions have a long history in optimization, systems of inequalities or projection methods: [1, 7, 9, 34, 35, 37, 6, 14, 23]. Particularly, in the seminal works [36, 37] SGM is proved to converge linearly towards weakly sharp minima of $F$ . The optimal solutions $X^{*}$ is a set of weak sharp minima (WSM) if there exists $\sigma_{F}>0$ such that

\displaystyle WSM:\qquad F(x)-F^{*}\geq\sigma_{F}\text{dist}_{X^{*}}(x),\quad\forall x\in\text{dom}{F}.

Acceleration of other first-order algorithms has been proved under WSM in subsequent works as [1, 7, 8, 41]. Besides acceleration, [37, Section 5.2.3] introduces the “superstability” of sharp optimal solutions $X^{*}$ : under small perturbations of the objective function $F$ a subset of the weak sharp minima $X^{*}$ remains optimal for the perturbed model. The superstability of WSM was used in [36, 27] to show the robustness of inexact SGM. In short, using low persistently perturbed subgradients at each iteration, the resulted perturbed SGM still converges linearly to $X^{*}$ . In the line of these results, we also show in our manuscript that similar robustness holds for the proximal point methods under WSM.

Other recent works as [53, 16, 26, 22, 18, 51, 19, 14, 10, 23, 11, 38, 6, 17] look at a suite of different growth regimes besides WSM and use them to improve the complexity of first-order algorithms. Particularly, in our paper we are interested in the $\gamma-$ Holderian growth : let $\gamma\geq 1$

\displaystyle\gamma-HG:\qquad F(x)-F^{*}\geq\sigma_{F}\text{dist}_{X^{*}}^{\gamma}(x),\quad\forall x\in\text{dom}{F}.

Note that $\gamma-HG$ is equivalent to the Kurdyka–Łojaziewicz (KL) inequality for convex, closed, and proper functions, as shown in [6]. It includes the class of uniformly convex functions analyzed in [17] and, obviously, it covers the sharp minima WSM, for $\gamma=1$ . The Quadratic Growth (QG), covered by $\gamma=2$ , was analyzed in a large suite of previous works [22, 53, 23, 26] and, although is weaker than strong convexity, it could be essentially exploited (besides Lipschitz gradient continuity) to show $\mathcal{O}\left(\log(1/\epsilon)\right)$ complexity of proximal gradient methods11. Our analysis recover similar complexity orders under the same particular assumptions.

Some recent works [53, 16, 11, 10, 38] developed restarted SGM schemes, for minimizing convex functions under $\gamma$ -HG or WSM, and analyzed their theoretical convergence and their natural dependence on the growth moduli $\gamma$ and $\sigma_{F}$ . Restarted SubGradient (RSG) of [53] and Decaying Stepsize - SubGradient (DS-SG) of [16] present iteration complexity estimates of $\mathcal{O}\left(\log(1/\epsilon)\right)$ under WSM and $\mathcal{O}\left(\frac{1}{\epsilon^{2(\gamma-1)}}\right)$ bound under $\gamma-$ (HG) in order to attain $\text{dist}_{X^{*}}(x)\leq\epsilon$ . These bounds are optimal for bounded gradients functions, as observed by [29]. Most SGM schemes are dependent up to various degrees on the knowledge of problem information. For instance, RSG and DS-SG rely on lower bounds of optimal value $F^{*}$ and knowledge of parameters $\sigma_{F},\gamma$ or other Lipschitz constants. The restartation is introduced in order to avoid the exact estimation of modulus $\sigma_{F}$ . Also, our schemes allows estimations of problem moduli such as $\gamma,\sigma_{F}$ , covering cases when these are not known. In the best case, when estimations are close to the true parameters, similar complexity estimates $\mathcal{O}\left(\frac{1}{\epsilon^{2(\gamma-1)}}\right)$ are provided in terms of subgradient evaluations. Moreover, by exploiting additional smooth structure we further obtain lower estimates.

The work of [17] approach the constrained model, i.e. $\psi$ is the indicator function of a closed convex set, and assume $\gamma-$ uniform convexity:

	$\displaystyle f(\alpha x+(1-\alpha)y)\leq\alpha f(x)+$	$\displaystyle(1-\alpha)f(y)$
		$\displaystyle-\frac{1}{2}\sigma_{f}\alpha(1-\alpha)[\alpha^{\gamma-1}+(1-\alpha)^{\gamma-1}]\lVert x-y\rVert^{\gamma},$

for all feasible $x,y$ and $\gamma\geq 2$ . The authors obtain optimal complexity bounds $\mathcal{O}\left(\sigma_{f}^{-2/\gamma}\epsilon^{-2(\gamma-1)}\right)$ when the subgradients of $f$ are bounded. Moreover, their restartation technique are adaptive to growth modulus $\gamma$ and parameter $\sigma_{F}$ , up to a fixed number of iterations.

Inherent for all SGMs, the complexity results of these works essentially requires the boundedness of the subgradients, which is often natural for nondifferentiable functions. However, plenty of convex objective functions coming from risk minimization, sparse regression or machine learning presents, besides their particular growth, a certain smoothness degree which is not compatible with the subgradient boundedness assumption. Enclosing the feasible domain in a ball is an artificial remedy used to further keep the subgradients bounded, which however might load the implementation with additional uncertain tuning heuristics. Our analysis shows how to exploit smoothness in order to improve the complexity estimates.

The analysis of [41] investigates the effect of restarting over the optimal first-order schemes under $\gamma$ -HG and $\nu$ -Holder smoothness, starting from results of [29]. For $\psi=0$ , $\epsilon-$ suboptimality is reached after $\mathcal{O}\left(\log(1/\epsilon)\right)$ accelerated gradient iterations if $\nabla F$ is Lipschitz continuous and $2-$ Holder growth holds, or after $\mathcal{O}\left(1/\epsilon^{\frac{\gamma-2}{2}}\right)$ iterations when the growth modulus is larger than $2$ . In general, if $\nabla F$ is $\nu-$ Holder continuous, they restart the Universal Gradient Method and obtain an overall complexity of $\mathcal{O}\left(\log(1/\epsilon)\right)$ if $\gamma=\nu$ , or $\mathcal{O}\left(1/\epsilon^{\frac{2(\gamma-\nu-1)}{2\nu-1}}\right)$ if $\gamma>\nu$ . Although these estimates are unimprovable and better than ours, in general the implementation of the optimal schemes requires complete knowledge of growth and smoothness parameters.

Several decades ago the Proximal Point Algorithm (PPA) started to gain much attraction for both abstract operator theory and the numerical optimization communities. Even in modern applications, where large-scale nonsmooth optimization arises recurrently, practitioners still inspire from proximal minimization theory to design scalable algorithmic techniques that overcomes nonsmoothness. The powerful PPA iteration consists mainly in the recursive evaluation of the proximal operator associated to the objective function. The proximal mapping is based on the infimal convolution with a metric function, often chosen to be the squared Euclidean norm: $\text{prox}_{\mu}^{F}(x):=\arg\min_{z}F(z)+\frac{1}{2\mu}\lVert z-x\rVert^{2}.$ The Proximal Point recursion:

\displaystyle x^{k+1}=\text{prox}_{\mu}^{F}(x^{k}).

became famous in optimization community when [40, 39] and [4, 5] revealed its connection to various multipliers methods for constrained minimization, see also [31, 32, 28, 12, 13]. There are remarkable works that shown how the growth regularity is a key factor in the iteration complexity of PPA.

Finite convergence of the exact PPA under WSM is proved by [7, 9, 1]. Furthermore, in [5, 18] can be found an extensive convergence analysis of the exact PPA and the Augmented Lagrangian algorithm under $\gamma-$ (HG). Although the results and analysis are of a remarkable generality, they are of asymptotic nature (see [51]). A nonasymptotic analysis is found in [51], where the equivalence between a Dual Augmented Lagrangian algorithm and a variable stepsize PPA is established. The authors analyze sparse learning models of the form: $\min_{x\in\mathbf{R}^{n}}\;f(Ax)+\psi(x),$ where $f$ is twice differentiable with Lipschitz continuous gradient, $A$ a linear operator and $\psi$ a convex nonsmooth regularizer. Under $\gamma-$ Holderian growth, ranging with $\gamma\in[1,2]$ , they show nonasymptotic superlinear convergence rate of the exact PPA with exponentially increasing stepsize. For the inexact variant they kept further a slightly weaker superlinear convergence. The progress, from the asymptotic analysis of [40, 18] to a nonasymptotic one, is remarkable due to the simplicity of the arguments. However, a convergence rate of inexact PPA (IPPA) could become irrelevant without quantifying the local computational effort spent to compute each iteration, since one inexact iteration of PPA requires the approximate solution the regularized optimization problem. Among the remarkable references on inexact versions of various proximal algorithms are [25, 48, 49, 50, 47, 46, 20, 21, 24, 45, 31, 32].

We mention that, a small portion of the results on the WSM case, contained into this manuscript, has been recently published by the authors in [33]. However, we included it in the present manuscript for the sake of completeness.

Contributions. We list further our main contributions:

Inexact PPA under $\gamma-$ (HG). We provide nonasymptotic iteration complexity bounds for IPPA to solve (1) under $\gamma-$ HG, when $\gamma\geq 1$ . In particular, we obtain $\mathcal{O}\left(\log(1/\epsilon)\right)$ for $\gamma\in[1,2]$ and, in the case of best parameter choice, $\mathcal{O}\left(1/\epsilon^{\gamma-2}\right)$ for $\gamma>2$ , to attain $\epsilon$ distance to the optimal set. All these bounds require only convexity of the objective function $F$ and they are independent on any bounded gradients or smoothness. We could not find these nonasymptotic estimates in the literature for general $\gamma\geq 1$ .

Restartation. We further analyze the complexity bounds of restarting IPPA, that facilitates the derivation of better computational complexity estimates than the non-restarted IPPA. The complexity estimates have similar orders in both restarted and non-restarted algorithms for all $\gamma\geq 1$ .

Total computational complexity. We derive total complexity, including the inner computational effort spent at each IPPA iteration, in terms of number of inner (proximal) gradient evaluations. If $f$ has $\nu-$ Holder continuous gradients we obtain that, in the case of best parameter choice, there are necessary:

	$\displaystyle[\gamma=1+\nu]\qquad$	$\displaystyle\mathcal{O}\left(\log(1/\epsilon)\right)$
	$\displaystyle[\nu=1]\qquad$	$\displaystyle\mathcal{O}\left(1/\epsilon^{\gamma-2}\right)$
	$\displaystyle[\nu=0]\qquad$	$\displaystyle\mathcal{O}\left(1/\epsilon^{2(\gamma-1)}\right)$

proximal (sub)gradient evaluations to approach to $\epsilon$ distance to the optimal set. As we discuss in the section 6, the total complexity is dependent on various restartation variables.

Experiments. Our numerical experiments confirm a better behaviour of the restarted IPPA, that uses an inner subgradient method routine, in comparison with other two restarting strategies of classical Subgradient Method. We performed our tests on several polyhedral learning models that includes Graph SVM and Matrix Completion, using synthetic and real data.

1.1 Notations and preliminaries

Now we introduce the main notations of our manuscript. For $x,y\in\mathbf{R}^{n}$ denote the scalar product $\langle x,y\rangle=x^{T}y$ and Euclidean norm by $\|x\|=\sqrt{x^{T}x}$ . The projection operator onto set $X$ is denoted by $\pi_{X}$ and the distance from $x$ to the set $X$ is denoted $\text{dist}_{X}(x)=\min_{z\in X}\lVert x-z\rVert$ . The indicator function of $Q$ is denoted by $\iota_{Q}$ . Given function $h$ , then by $h^{(k)}$ we denote the composition $h^{(k)}(x):=\underbrace{\left(h\circ h\circ\cdots h\right)}_{k\;\text{times}}(x)$ . We use $\partial h(x)$ for the subdifferential set and $h^{\prime}(x)$ for a subgradient of $h$ at $x$ . In differentiable case, when $\partial h$ is a singleton, $\nabla h$ will be eventually used instead of $h^{\prime}$ . By $X^{*}$ we denote the optimal set associated to (1) and by $\epsilon-$ suboptimal point we understand a point $x$ that satisfies $\text{dist}_{X^{*}}(x)\leq\epsilon$ .

A function $f$ is called $\sigma-$ strongly convex if the following relation holds:

f(x)\geq f(y)+\langle f^{\prime}(y),x-y\rangle+\frac{\sigma}{2}\lVert x-y\rVert^{2}\qquad\forall x,y\in\mathbf{R}^{n}.

Let $\nu\in[0,1]$ , then we say that a differentiable function $f$ has $\nu-$ Holder continuous gradient with constant $L>0$ if :

\displaystyle\lVert f^{\prime}(x)-f^{\prime}(y)\rVert\leq L\lVert x-y\rVert^{\nu}\quad\forall x,y\in\mathbf{R}^{n}.

Notice that when $\nu=0$ , the Holder continuity describes nonsmooth functions with bounded gradients, i.e. $\lVert f^{\prime}(x)\rVert\leq L$ for all $x\in\text{dom}(f)$ . The $1-$ Holder continuity reduces to $L-$ Lipschitz gradient continuity.

Given a convex function $f$ , we denote its Moreau envelope [40, 5, 39] with $f_{\mu}$ and its proximal operator with $\text{prox}_{\mu}^{f}(x)$ , defined by:

	$\displaystyle f_{\mu}(x)$	$\displaystyle=\min\limits_{z}\;f(z)+\frac{1}{2\mu}\lVert z-x\rVert^{2}$
	$\displaystyle\text{prox}_{\mu}^{f}(x)$	$\displaystyle=\arg\min_{z}\;f(z)+\frac{1}{2\mu}\lVert z-x\rVert^{2}.$

We recall the nonexpansiveness property of the proximal mapping [40]:

\displaystyle\lVert\text{prox}_{\mu}^{f}(x)-\text{prox}_{\mu}^{f}(y)\rVert\leq\lVert x-y\rVert\quad\forall x,y\in\text{dom}(f).

(2)

Basic arguments from [40, 5, 39] show that the gradient $\nabla f_{\mu}$ is Lipschitz continuous with constant $\frac{1}{\mu}$ and satisfies:

\displaystyle\nabla f_{\mu}(x)=\frac{1}{\mu}\left(x-\text{prox}_{\mu}^{f}(x)\right)\in\partial f(\text{prox}_{\mu}^{f}(x)).

(3)

In the differentiable case, obviously $\nabla f_{\mu}(x)=\nabla f(\text{prox}_{\mu}^{f}(x))$ .

Paper structure. In section 2 we analyze how the growth properties of $F$ are also inherited by $F_{\mu}$ . The key relations on $F_{\mu}$ will become the basis for the complexity analysis. In section 3 we define the iteration of inexact Proximal Point algorithm and discuss its stopping criterion. The iteration complexity is presented in section 4 for both the exact and inexact case. Subsequently, the restarted IPPA is defined and its complexity is presented. Finally, in section 6 we quantify the complexity of IPPA in terms of proximal (sub)gradient iterations and compare with other results. In the last section we compare our scheme with the state-of-the-art restarting subgradient algorithms.

2 Holderian growth and Moreau envelopes

As discussed in the introduction, $\gamma$ -HG relates tightly with widely known regularity properties such as WSM [53, 36, 37, 7, 9, 1, 8], Quadratic Growth (QG) [53, 22, 26], Error Bound [23] and Kurdika-Lojasiewicz inequality [6, 53].

Next we show how the Moreau envelope of a given convex function inherits its growth properties over its entire domain excepting a certain neighborhood of the optimal set. Recall that $\min_{x}F(x)=\min_{x}F_{\mu}(x)$ .

Lemma 2.1.

Let $F$ be a convex function and let $\gamma-$ (HG) hold. Then the Moreau envelope $F_{\mu}$ satisfies the relations presented below.

$(i)$ Let $\gamma=1$ WSM:

\displaystyle F_{\mu}(x)-F^{*}\geq H_{\sigma^{2}_{F}\mu}(\sigma_{F}\text{dist}_{X^{*}}(x)),

where $H_{\tau}(s)=\begin{cases}s-\frac{\tau}{2},&s>\tau\\ \frac{1}{2\tau}s^{2},&s\leq\tau\end{cases}$ is the Huber function.

$(ii)$ Let $\gamma=2$ :

\displaystyle F_{\mu}(x)-F^{*}\geq\frac{\sigma_{F}}{1+2\sigma_{F}\mu}\text{dist}_{X^{*}}^{2}(x).

$(iii)$ For all $\gamma\geq 1$ :

\displaystyle F_{\mu}(x)-F^{*}\geq\varphi(\gamma)\min\left\{\sigma_{F}\text{dist}^{\gamma}_{X^{*}}(x),\frac{1}{2\mu}\text{dist}^{2}_{X^{*}}(x)\right\},

where $\varphi(\gamma)=\min_{\lambda\in[0,1]}\lambda^{\gamma}+(1-\lambda)^{2}$ .

Proof.

The proof can be found in the Appendix. ∎

It is interesting to remark that the Moreau envelope $F_{\mu}$ inherits a similar growth landscape as $F$ outside a given neighborhood of the optimal set. For instance, under WSM, outside the tube $\mathcal{N}(\sigma_{F}\mu)=\{x\in\mathbf{R}^{n}:\;\text{dist}_{X^{*}}(x)\leq\sigma_{F}\mu\}\}$ , the Moreau envelope $F_{\mu}$ grows sharply. Inside of $\mathcal{N}(\sigma_{F}\mu)$ it grows quadratically which, unlike the objective function $F$ , allows the gradient to get small near to the optimal set. This separation of growth regimes suggests that first-order algorithms that minimize $F_{\mu}$ would reach very fast the region $\mathcal{N}(\sigma_{F}\mu)$ , allowing large steps in a first phase and subsequently, they would slow down in the vicinity of the optimal set.

This discussion extends to general growths when $\gamma>1$ , where a similar separation of behaviours holds for appropriate neighborhoods. Note that when $F$ has quadratic growth with constant $\sigma_{F}$ , also the envelope $F_{\mu}$ satisfies a quadratic growth with a smaller modulus $\frac{\sigma_{F}}{1+2\sigma_{F}\mu}$ .

Remark 1.

It will be useful in the subsequent sections to recall the connection between Holderian growth and Holderian error bound under convexity. Observe that by a simple use of convexity into $\gamma$ -HG, we obtain for all $x\in\text{dom}F$

\displaystyle\sigma_{F}\text{dist}_{X^{*}}^{\gamma}(x)\leq F(x)-F^{*}\leq\langle F^{\prime}(x),x-\pi_{X^{*}}(x^{*})\rangle\leq\lVert F^{\prime}(x)\rVert\text{dist}_{X^{*}}(x),

(4)

which immediately turns into the following error bound:

\displaystyle\sigma_{F}\text{dist}_{X^{*}}^{\gamma-1}(x)\leq\lVert F^{\prime}(x)\rVert\qquad x\in\text{dom}{F}.

(5)

Under WSM, by replacing $x$ with $\text{prox}_{\mu}^{F}(x)$ into (5) and by using property (3) pointing that $\nabla F_{\mu}(x)\in\partial F(\text{prox}_{\mu}^{F}(x))$ , we obtain as well a similar bound on $\lVert\nabla F_{\mu}(\cdot)\rVert$ at non-optimal points:

\displaystyle\sigma_{F}\leq\lVert\nabla F_{\mu}(x)\rVert\quad\forall x\notin X^{*}.

(6)

This is the traditional key relation for the finite convergence of PPA. Under $\gamma-$ HG, starting from Lemma 2.1 $(iii)$ , by using convexity of $F_{\mu}$ and, further, by following similar inequalities as in (4), another error bound is obtained:

\displaystyle\text{dist}_{X^{*}}(x)\leq\max\left\{\left[\frac{1}{\sigma_{F}\varphi(\gamma)}\lVert\nabla F_{\mu}(x)\rVert\right]^{\frac{1}{\gamma-1}},\frac{2\mu}{\varphi(\gamma)}\lVert\nabla F_{\mu}(x)\rVert\right\}\quad\forall x.

(7)

In the following section we start with the analysis of exact and inexact PPA. Aligning to old results on the subgradient algorithms back to [36], we illustrate the robustness induced by the weak sharp minima regularity.

3 Inexact Proximal Point algorithm

The basic exact PPA iteration is shortly described as

\displaystyle x^{k+1}=\text{prox}_{\mu}^{F}(x^{k}).

Recall that by (3) one can express $\nabla F_{\mu}(x^{k})=\frac{1}{\mu}(x^{k}-\text{prox}_{\mu}^{F}(x^{k}))$ , which makes PPA equivalent with the constant stepsize Gradient Method (GM) iteration:

\displaystyle x^{k+1}=x^{k}-\mu\nabla F_{\mu}(x^{k}).

(8)

Since our reasoning from below borrow simple arguments from classical GM analysis, we will use further (8) to express PPA. It is realistic not to rely on explicit $\text{prox}_{\mu}^{F}(x^{k})$ , but an approximated one to a fixed accuracy. By using such an approximation, one can form an approximate gradient $\nabla_{\delta}F_{\mu}(x^{k})$ and interpret IPPA as an inexact Gradient Method. Let $x\in\text{dom}{F}$ , then a basic $\delta-$ approximation of $\nabla F_{\mu}(x)$ is

\displaystyle\nabla_{\delta}F_{\mu}(x):=\frac{1}{\mu}\left(x-\tilde{z}\right),\quad\text{where}\quad\lVert\tilde{z}-\text{prox}_{\mu}^{F}(x)\rVert\leq\delta.

(9)

Other works as [40, 42, 48, 49] promotes similar approximation measures for inexact first order methods. Now we present the basic IPPA scheme with constant stepsize.

1Initialize

\;k:=0

2 while $\lVert\nabla_{\delta_{k}}F(x^{k})\rVert>\epsilon$ or $\delta_{k}>\frac{\epsilon}{\mu}$ do

\text{Given $x^{k}$ compute}\;\;x^{k+1}\;\text{such that}:\;\lVert x^{k+1}-\text{prox}_{\mu}^{F}(x^{k})\rVert\leq\delta_{k}

k:=k+1

Return

x^{k}

Algorithm 1 Inexact Proximal Point Algorithm(

x^{0},\mu,\{\delta_{k}\}_{k\geq 0},\epsilon

)

There already exist a variety of relative or absolute stopping rules in the literature for the class of first-order methods [51, 42, 37, 15]. However, since bounding the norms of gradients is commonly regarded as one of the simplest optimality measures, we use the following:

\displaystyle\lVert\nabla_{\delta_{k}}F(x^{k})\rVert<\epsilon\qquad\delta_{k}\leq\frac{\epsilon}{\mu}

(10)

which is computable by the nature of the iteration. The next result shows the relation between (10) and the distance to optimal set.

Lemma 3.1.

Let $\mu,\delta>0$ and assume $x\notin X^{*}$ then:

$(i)$ Let $\gamma=1$ , if $\lVert\nabla_{\delta}F_{\mu}(x^{k})\rVert+\frac{\delta}{\mu}<\sigma_{F}$ , then

\displaystyle\text{dist}_{X^{*}}(x^{k})\leq\mu\lVert\nabla_{\delta}F_{\mu}(x^{k})\rVert+\delta\quad\text{and}\quad\text{prox}_{\mu}^{F}(x^{k})=\pi_{X^{*}}(x^{k}).

(11)

$(ii)$ Let $\gamma>1$ , then: $\;\;\text{dist}_{X^{*}}(x^{k})\leq\max\left\{\left[\frac{\mu\lVert\nabla_{\delta}F_{\mu}(x^{k})\rVert+\delta}{\mu\sigma_{F}\varphi(\gamma)}\right]^{\frac{1}{\gamma-1}},\frac{2\mu\lVert\nabla_{\delta}F_{\mu}(x^{k})\rVert+\delta}{\varphi(\gamma)}\right\}.$

Proof.

The proof can be found in the Appendix. ∎

If $X^{*}$ are weakly sharp minimizers, the above lemma states that, for sufficiently small $\delta$ and $\lVert\nabla_{\delta}F_{\mu}(x^{k})\rVert$ one can directly guarantee that $x^{k}$ is close to $X^{*}$ . From another viewpoint, Lemma 3.1 also suggests for WSM that a sufficiently large $\mu>\frac{\text{dist}_{X^{*}}(x^{0})}{\sigma_{F}}$ provides $\text{prox}_{X^{*}}(x^{0})=\pi_{X^{*}}(x^{0})$ and a $\delta-$ optimal solution is obtained as the output of the first IPPA iteration. A similar result stated in [9], guarantees the existence of a sufficiently large smoothing value $\mu$ which makes PPA to converge in a single (exact) iteration.

4 Iteration complexity of IPPA

For reasons that will be clear later, note that some of the results from below are given for constant inexactness noise $\delta_{k}=\delta$ . The Holderian growth leads naturally to recurrences on the residual distance to the optimal set, that allow us to give direct complexity bounds.

Theorem 4.1.

Let $F$ be convex and $\gamma-$ HG hold.

$(i)$ Under sharp minima $\gamma=1$ , let $\{\delta_{k}\}_{k\geq 0}$ nonincreasing and assume $\text{dist}_{X^{*}}(x^{0})\geq\mu\sigma_{F}$ , then:

\displaystyle\;\;\text{dist}_{X^{*}}(x^{k})\leq\max\left\{\text{dist}_{X^{*}}(x^{0})-\sum\limits_{i=0}^{k-1}(\mu\sigma_{F}-\delta_{i}),\delta_{k-1}\right\}.

$(ii)$ Under quadratic growth $\gamma=2$ , let $\sum\limits_{i\geq 0}\delta_{i}<\Gamma<\infty$ then:

\displaystyle\text{dist}_{X^{*}}(x^{k})\leq\left[\frac{1}{1+2\mu\sigma_{F}}\right]^{\frac{k-4}{4}}(\text{dist}_{X^{*}}(x^{0})+\Gamma)+\left(1+\frac{1}{\sqrt{1+2\mu\sigma_{F}}-1}\right)\delta_{\lceil\frac{k}{2}\rceil+1}.

$(iii)$ Let $\gamma-$ HG hold. Then the following convergence rate holds:

\displaystyle\text{dist}_{X^{*}}(x^{k})

\displaystyle\leq\max\left\{h^{(k)}(\text{dist}_{X^{*}}(x^{0})),\bar{\delta}_{k}\right\}

where $h(r)=\begin{cases}\max\left\{r-\frac{\mu\varphi(\gamma)\sigma_{F}}{2}r^{\gamma-1},\frac{1+\sqrt{1-\varphi(\gamma)}}{2}r\right\},&\text{if}\;\gamma\in(1,2)\\ \max\left\{r-\frac{\mu\varphi(\gamma)\sigma_{F}}{2}r^{\gamma-1},\frac{1+\sqrt{1-\varphi(\gamma)}}{2}r,\left(1-\frac{1}{\gamma-1}\right)r\right\},&\text{if}\;\gamma>2.\end{cases}$ and $\bar{\delta}_{k}=\max\left\{h(\bar{\delta}_{k-1}),\left(\frac{2\delta_{k}}{\mu\varphi(\gamma)\sigma_{F}}\right)^{\frac{1}{\gamma-1}},\frac{2\delta_{k}}{1-\sqrt{1-\varphi(\gamma)}}\right\}.$

Proof.

The proof can be found in the appendix. ∎

Note that in general, all the abstract convergence rates of Theorem 4.1 depend on two terms: the first one illustrates the reduction of the initial distance to optimum and the second term reflects the accuracy of the approximated gradient. Therefore, after a finite number (for $\gamma=1$ ) or at most $\mathcal{O}(\log(1/\epsilon))$ (for $\gamma>1$ ) IPPA iterations, the evolution of inner accuracy $\delta_{k}$ becomes the main bottleneck of the convergence process. The above theorem provides an abstract insight about how fast should the accuracy $\{\delta_{k}\}_{k\geq 0}$ decay in order that $\{x^{k}\}_{k\geq 0}$ attains the best rate towards the optimal set in terms of $\text{dist}_{X^{*}}(x^{k})$ . In short, $(i)$ shows that for WSM a recurrent constant decrease on $\text{dist}_{X^{*}}(x^{k})$ is established only if $\delta_{k}<\mu\sigma_{F}$ , while the noise $\delta_{k}$ is not necessary to vanish. This aspect will be discussed in more details below. The last parts $(ii)$ and $(iii)$ suggest that a linear decay of $\delta_{k}$ (for $\gamma=2$ ) and, respectively, $\delta_{k+1}=h(\delta_{k})$ for general $\gamma>1$ , ensure the fastest convergence of the residual.

Several previous works as [27, 36, 37] analyzed perturbed and incremental SGM algorithms, under WSM, that use noisy estimates $G_{k}$ of subgradient $F^{\prime}(x^{k})$ . A surprising common conclusion of these works is that under a sufficiently low persistent noise: $0<\lVert F^{\prime}(x^{k})-G_{k}\rVert<\sigma_{F},$ for all $k\geq 0$ , SGM still converges linearly to the optimal set. Although IPPA is based on a similar approximate first order information, notice that the smoothed objective $F_{\mu}$ do not satisfy the pure WSM property. However, our next result states, in a similar context of small but persistent noise of magnitude at most $\frac{\delta}{\mu}$ , that IPPA attains $\delta-$ accuracy after a finite number of iterations.

Corollary 4.2.

Let $\delta_{k}=\delta<\mu\sigma_{F}$ and $\gamma=1$ , then after

\displaystyle K=\left\lceil\frac{\text{dist}_{X^{*}}(x^{0})}{\mu\sigma_{F}-\delta}\right\rceil

IPPA iterations, $x^{K}$ satisfies $\text{prox}_{\mu}^{F}(x^{K})=\pi_{X^{*}}(x^{K})$ and $\text{dist}_{X^{*}}(x^{K})\leq\delta$ .

Proof.

The proof can be found in the Appendix. ∎

To conclude, if the noise magnitude $\frac{\delta}{\mu}$ is below the threshold $\sigma_{F}$ , or equivalently $0<\delta^{\nabla}:=\lVert\nabla F_{\mu}(x^{k})-\nabla_{\delta}F_{\mu}(x^{k})\rVert<\sigma_{F}$ then after a finite number of iterations IPPA reaches $\delta$ distance to $X^{*}$ . We see that under sufficiently low persistent noise, IPPA still guarantees convergence to the optimal set assuming the existence of an inner routine that computes each iteration. In other words, ”noisy” Proximal Point algorithms share similar stability properties as the noisy Subgradient schemes from [27, 36, 37]. This discussion can be extended to general decreasing $\{\delta_{k}\}_{k\geq 0}$ .

We show next that Theorem 4.1 covers well-known results on exact PPA.

Corollary 4.3.

Let $\{x^{k}\}_{k\geq 0}$ be the sequence of exact PPA, i.e. $\delta_{k}=0$ . By denoting $r_{0}=\text{dist}_{X^{*}}(x^{0})$ , an $\epsilon-$ suboptimal iterate is attained, i.e. $\text{dist}_{X^{*}}(x^{k})\leq\epsilon$ , after the number of iterations:

\displaystyle\mathcal{K}_{e}(\gamma,\epsilon)=\begin{cases}\left\lceil\frac{r_{0}-\epsilon}{\mu\sigma_{F}}\right\rceil&\text{if}\;\gamma=1\\ \left\lceil\mathcal{O}\left(\frac{1}{\mu\sigma_{F}}\log\left(\frac{r_{0}}{\epsilon}\right)\right)\right\rceil&\text{if}\;\gamma=2\\ \left\lceil\mathcal{O}\left(\min\left\{\mathcal{T}^{2-\gamma}\log\left(\frac{r_{0}}{\epsilon}\right),\mathcal{T}\right\}\right)\right\rceil,&\text{if}\;\gamma\in(1,2),\;\epsilon\geq(\mu\varphi(\gamma)\sigma_{F})^{\frac{1}{2-\gamma}}\\ \left\lceil\mathcal{O}\left(\log\left(\min\left\{r_{0},(2\mu\sigma_{F})^{\frac{1}{2-\gamma}}\right\}/\epsilon\right)\right)\right\rceil&\text{if}\;\gamma\in(1,2),\epsilon<(\mu\varphi(\gamma)\sigma_{F})^{\frac{1}{2-\gamma}}\\ \left\lceil\mathcal{O}\left(\frac{1}{\epsilon^{\gamma-2}}\right)\right\rceil&\text{if}\;\gamma>2.\end{cases}

Proof.

The proof can be found in the Appendix. ∎

The finite convergence of the exact PPA, under WSM, dates back to [9, 7, 1, 4]. Since PPA is simply a gradient descent iteration, its iteration complexity under QG $\gamma=2$ shares the typical dependence on the conditioning number $\frac{1}{\mu\sigma_{F}}$ . The Holder growth $\gamma\in(1,2)$ behaves similarly with the sharp minima case: fast convergence outside the neighborhood around the optimum which expands with $\mu$ .

A simple argument on the tightness of the bounds for the case $\gamma>2$ can be found in [2]. Indeed, Douglas-Rachford, PPA and Alternating Projections algorithms were analyzed in [2] for particular univariate functions. The authors proved that PPA requires $\mathcal{O}\left(1/\epsilon^{\gamma-2}\right)$ iterations to minimize the particular objective $F(x)=\frac{1}{\gamma}|x|^{\gamma}$ (when $\gamma>2$ ) up to $\epsilon$ tolerance.

Corollary 4.4.

Under the assumptions of Corrolary 4.3, recall the notation $\mathcal{K}_{e}(\gamma,\epsilon)$ for the exact case. The number of iterations performed by $\{x^{k}\}_{k\geq 0}$ generated with IPPA $(x^{0},\mu,\{\delta_{k}\})$ in order to attain an $\epsilon-$ suboptimal point is:

\displaystyle\begin{cases}\left\lceil\mathcal{O}\left(\max\left\{\mathcal{K}_{e}(1,\epsilon)+\frac{\delta_{0}}{\mu\sigma_{F}},\log\left(\frac{\delta_{0}}{\epsilon}\right)\right\}\right)\right\rceil&\quad\gamma=1,\delta_{k}\leq\frac{\delta_{0}}{2^{k}}\\ \left\lceil\mathcal{O}\left(\mathcal{K}_{e}(\gamma,\epsilon)\right)\right\rceil&\quad\gamma\in(1,2],\delta_{k}\leq\frac{\delta_{0}}{2^{k}}\\ \left\lceil\mathcal{O}\left(\mathcal{K}_{e}(\gamma,\epsilon)\right)\right\rceil&\quad\gamma>2,\delta_{k}=\left(1-\frac{1}{\gamma-1}\right)^{k(\gamma-1)}\delta_{0}.\end{cases}

Proof.

The proof can be found in the Appendix. ∎

Overall, if the noise vanishes sufficiently fast (e.g. a linear decay with factor $\frac{1}{2}$ in the case $\gamma\leq 2$ ) then the complexity of IPPA has the same order as the one of PPA. However, a fast vanishing noise $\delta_{k}$ requires an efficient inner routine that computes the iterate $x^{k}$ . Particularly, when $F$ is nonsmooth and convex, a simple choice of the inner routine is the Subgradient Method (SM). However, the efficiency of the SM for minimizing a given $\sigma_{f}-$ strongly convex function $f$ up to $\delta$ accuracy, i.e. $f(x)-f^{*}\leq\delta$ , has the order $\mathcal{O}\left(\frac{1}{\delta}\right)$ subgradient calls [17]. By using the quadratic growth guaranteed by strong convexity, i.e. $\frac{\sigma_{f}}{2}\lVert x-x^{*}\rVert^{2}\leq f(x)-f^{*}\leq\delta$ , in order to approach a minimizer at distance $\delta$ , SM requires $\mathcal{O}\left(\frac{1}{\delta^{2}}\right)$ iterations. According to this bound, the cost of computing a single iteration of IPPA results into $\mathcal{O}\left(\frac{1}{\delta_{k}^{2}}\right)$ subgradient calls and, therefore, a direct naive counting of the total number of subgradient evaluations over $T$ outer iterations is of order $\mathcal{O}\left(\sum\limits_{k=0}^{T}\frac{1}{\delta_{k}^{2}}\right)$ . By using the previous estimates of $T$ necessary for $\text{dist}_{X^{*}}(x^{T})\leq\epsilon$ , yields a total estimate of subgradient calls which is significantly larger than the known optimal bound of $\mathcal{O}\left(\frac{1}{\epsilon^{2(\gamma-1)}}\right)$ for SM algorithm. However, we further give a restarted variant of IPPA that overcomes this issue.

5 Restarted Inexact Proximal Point Algorithm

The Restarted IPPA (RIPPA) illustrates a simple recursive call of the IPPA combined with a multiplicative decrease of the parameters. Observe that RIPPA is completely independent of the problem constants.

1Initialize

\delta^{\nabla}_{0}:=\delta_{0},t:=0

2 while stopping criterion do

\text{Call IPPA}\;\text{to compute:}\;\;x^{t+1}=IPPA(x^{t},\mu_{t},\{\delta_{k}:=\delta_{t}\}_{k\geq 0},5\delta_{t}^{\nabla})

\text{Update:}\;\;\mu_{t+1}=2\mu_{t},\;\delta^{\nabla}_{t+1}=\frac{\delta^{\nabla}_{t}}{2^{\rho}},\;\delta_{t+1}=\mu_{t+1}\delta^{\nabla}_{t+1}

\;t:=t+1

Algorithm 2 Restarted IPPA (

x^{0},\mu_{0},\delta_{0},\rho

)

As in the usual context of restartation, we call any $t-$ th iteration an epoch. The stopping criterion can be optionally based on a fixed number of epochs or on the reduction of gradient norm (10). Denote $K_{0}=\lceil\frac{\text{dist}_{X^{*}}(x^{0})}{\mu_{0}\delta_{0}}\rceil$ .

Theorem 5.1.

Let $\delta_{0},\mu_{0}$ be positive constants and $\rho>1$ . Then the sequence $\{x^{t}\}_{t\geq 0}$ generated by $RIPPA(x^{0},\mu_{0},\delta_{0})$ attains $\text{dist}_{X^{*}}(x^{t})\leq\epsilon$ after a number of $\mathcal{T}_{IPP}(\gamma,\epsilon)$ iterations.

Let $\gamma=1$ and assume $\epsilon<\mu_{0}\sigma_{F}$ and $\text{dist}_{X^{*}}(x^{0})\geq\mu_{0}\sigma_{F}$ , then

\displaystyle\mathcal{T}_{IPP}(1,\epsilon)=\frac{1}{\rho-1}\log\left(\frac{\mu_{0}\delta_{0}}{\epsilon}\right)+\mathcal{T}_{ct},

where $\mathcal{T}_{ct}=K_{0}\lceil\frac{1}{\rho}\log\left(\frac{12\delta_{0}}{\sigma_{F}}\right)\rceil.$ In particular, if $\delta_{0}<\mu_{0}\sigma_{F}$ then $RIPPA(x^{0},\mu_{0},\delta_{0},0)$ reaches the $\epsilon-$ suboptimality within $\mathcal{O}\left(\log\left(\frac{\mu_{0}\sigma_{F}}{\epsilon}\right)\right)$ iterations.

Let $\gamma=2$ , then

\displaystyle\mathcal{T}_{IPP}(2,\epsilon)=\mathcal{O}\left(\frac{1}{\rho}\log\left(\frac{\delta_{0}}{\epsilon}\right)\right)+K_{0},

Let $\gamma\in(1,2)$ , then:

\displaystyle\mathcal{T}_{IPP}(\gamma,\epsilon)=\mathcal{O}\left(\max\left\{\frac{\gamma-1}{\rho}\log\left(\frac{\delta_{0}}{\epsilon}\right),\frac{1}{\rho-1}\log\left(\frac{\mu_{0}\delta_{0}}{\epsilon}\right)\right\}\right)+K_{0}.

Otherwise, for $\gamma>2$

\displaystyle\mathcal{T}_{IPP}(\gamma,\epsilon)=\mathcal{O}\left(\left(\frac{\delta_{0}}{\epsilon}\right)^{\left[\left(1-\frac{1}{\rho}\right)(\gamma-1)-1\right]\max\left\{1,\frac{1}{(1-1/\rho)(\gamma-1)}\right\}}\right)+K_{0}.

Proof.

The proof can be found in the Appendix. ∎

Remark 2.

Notice that for any $\gamma\in[1,2]$ , logarithmic complexity $\mathcal{O}\left(\log(1/\epsilon)\right)$ is obtained. When $\gamma>2$ , the above estimate is shortened as

\displaystyle\mathcal{O}\left(\left(\frac{1}{\epsilon}\right)^{(\zeta-1)\max\left\{1,\frac{1}{\zeta}\right\}}\right),

where $\zeta=(\gamma-1)\left(1-\frac{1}{\rho}\right)$ , In particular, if $\rho\leq\frac{\gamma-1}{\gamma-2}$ , then all epochs reduce to length $1$ and the total number of IPPA iterations reduces to the same order as in the exact case: $\;\mathcal{O}\left(\left(\frac{1}{\epsilon}\right)^{\gamma-2}\right).$

6 Inner Proximal Subgradient Method routine

Although the influence of growth modulus on the behaviour of IPPA is obvious, all complexity estimates derived in the previous sections assume the existence of an oracle computing an approximate proximal mapping:

\displaystyle x^{k+1}\approx\arg\min\limits_{z\in\mathbf{R}^{n}}\;F(z)+\frac{1}{2\mu}\lVert z-x^{k}\rVert^{2}.

(12)

Therefore in the rest of this section we focus on the solution of the following inner minimization problem:

\displaystyle\min\limits_{z\in\mathbf{R}^{n}}\;f(z)+\psi(z)+\frac{1}{2\mu}\lVert z-x\rVert^{2}.

In most situations, despite the regularization term, this problem is not trivial and one should select an appropriate routine that computes $\{x^{k}\}_{k\geq 0}$ . For instance, variance-reduced or accelerated proximal first-order methods were employed in [20, 21, 22, 23] as inner routines and theoretical guarantees were provided. Also, Conjugate Gradients based Newton method was used in [51] under a twice differentiability assumption on $f$ . We limit our analysis only to gradient-type routines and let other accelerated or higher-order methods, that typically improve the performance of their classical counterparts, for future work.

In this section we evaluate the computational complexity of IPPA in terms of number of proximal gradient iterations. The basic routine for solving (12), that we analyze below, is the Proximal (sub)Gradient Method. Notice that when $f$ is nonsmooth with bounded subgradients, we consider only the case when $\psi$ is an indicator function for a simple, closed convex set. In this situation, PsGM becomes a simple projected subgradient scheme with constant stepsize that solves (12).

1for $\ell=0,\cdots,N-1$ do

z^{\ell+1}=\text{prox}_{\alpha}^{\psi}\left(z^{\ell}-\alpha\left(f^{\prime}(z^{\ell})+\frac{1}{\mu}(z^{\ell}-x^{k})\right)\right)

\text{Output:}\;z^{N}

Algorithm 3 Proximal subGradient Method (PsGM) (

z^{0},x^{k},\alpha,\mu,N

)

Through a natural combination of the outer guiding IPP iteration and the inner PsGM scheme, we further derive the total complexity of proximal-based restartation Subgradient Method in terms of subgradient oracle calls.

1 Initialize:

t:=0,\alpha_{0}:=\frac{\mu_{0}}{2},N_{0}:=\max\left\{8\log(L_{f}/\delta_{0})+1,\rho-1\right\}

2 for $t:=0,\cdots,T-1$ do

x^{0,t}:=x^{t},k:=0

4 do

x^{k+1,t}:=PsGM(x^{k,t},x^{k,t},\alpha_{t},\mu_{t},N_{t})

k:=k+1

8 while $\lVert x^{k-1,t}-x^{k-2,t}\rVert>\mu_{t}\delta_{t}$ ;

x^{t+1}:=x^{k-1,t}

\alpha_{t+1}:=\alpha_{t}2^{-q},N_{t+1}:=N_{t}2^{-(q+1)}

\mu_{t+1}:=2\mu_{t},\delta_{t+1}:=2^{-\rho}\delta_{t}

13Return

x^{T},\delta_{T}

17Postprocessing(

\tilde{x}^{0},\beta_{0},\mu,N,K

)

18 for $k:=0,\cdots,K-1$ do

\tilde{x}^{k+1}:=PsGM(\tilde{x}^{k},\tilde{x}^{k},\beta_{k},\mu,N)

\beta_{k+1}:=\beta_{k}/2

Return

\tilde{x}^{K}

Algorithm 4 Restarted Inexact Proximal Point - SubGradient Method (RIPP-PsGM) (

x^{0},\delta_{0},\mu_{0},\rho,q,L_{f},T

)

Theorem 6.1.

Let the assumptions of Theorem 8.8 hold and $\rho>1,\delta_{0}\geq 2L_{f}$ then the following assertions hold:

$(i)$ Let $x^{f}$ be generated as follows:

	$\displaystyle\{x^{T},\delta_{T}\}$	$\displaystyle=\textbf{RIPP-PsGM}(x^{0},\delta_{0},\mu_{0},\rho,2\rho-1,L_{f},T)$
	$\displaystyle x^{f}$	$\displaystyle=\textbf{Postprocessing}(x^{T},\delta_{T}^{2}/L_{f}^{2},\mu_{T},\lceil 2(L_{f}/\delta_{T})^{2}\rceil,\lceil\log(\delta_{T}/\epsilon)\rceil).$

For $T\geq\left\lceil\frac{1}{\rho}\log(12\delta_{0}/\sigma_{F})\right\rceil$ , the final output $x^{f}$ satisfies $\text{dist}_{X^{*}}(x^{f})\leq\epsilon$ after $\mathcal{O}\left(\log\left(1/\epsilon\right)\right)$ PsGM iterations.

$(ii)$ Moreover, if $f$ has $\nu-$ Holder continuous gradients with constant $L_{f}$ and $\nu\leq\gamma-1$ , assume $\delta_{0}\geq(2L_{f}^{2})^{\frac{1}{2(1-\nu)}}\mu_{0}^{\frac{\nu}{1-\nu}}$ and $\frac{q-1}{\rho-1}\geq 2(1-\nu)$ . If $T=\mathcal{O}\left(\max\left\{\frac{\gamma-1}{\rho}\log(\mu_{0}\delta_{0}/\epsilon),\frac{1}{\rho-1}\log(\mu_{0}\delta_{0}/\epsilon)\right\}\right)$ , then the output $x^{T}:=RIPP-PsGM(x^{0},\delta_{0},\mu_{0},\rho,q,L_{f},T)$ satisfies $\text{dist}_{X^{*}}(x^{T})\leq\epsilon$ within a total cost of:

\displaystyle\mathcal{O}\left(1/\epsilon^{\max\left\{\gamma-2+\frac{q}{\rho}(\gamma-1),(\gamma-2)\frac{\rho}{(\rho-1)(\gamma-1)}+\frac{q}{\rho-1}\right\}}\right)

PsGM iterations.

Proof.

The proof can be found in the Appendix. ∎

Remark 3.

Although the bound on the number of epochs in RIPP-PsGM depends somehow on $\gamma$ , a sufficiently high value of $T$ ensure the result to hold. To investigate some important particular bounds hidden into the above complexity estimates (of Theorem 6.1) we analyze different choices of the input parameters $(\rho,q)$ which will be synthesized in Table 1.

Assume $\nu$ is known, $q=1+2(1-\nu)(\rho-1)$ and denote $\zeta=\left(1-\frac{1}{\rho}\right)(\gamma-1)$

	$\displaystyle[\gamma=1+\nu]$	$\displaystyle\quad\mathcal{O}\left(1/\epsilon^{(3-2\gamma)(\zeta-1)\max\left\{1,\frac{1}{\zeta}\right\}}\right)$		(13)
	$\displaystyle[\gamma>1+\nu]$	$\displaystyle\quad\mathcal{O}\left(1/\epsilon^{\left[2(\gamma-\nu-1)+(1-2\nu)(\zeta-1)\right]\max\left\{1,\frac{1}{\zeta}\right\}}\right).$		(14)

Under knowledge of $\gamma$ , by setting $\rho=\frac{\gamma-1}{\gamma-2}$ then $\zeta=1$ and (13) becomes $\mathcal{O}\left(\log(1/\epsilon)\right)$ . When $\gamma<2$ , a sufficiently large $\rho$ simplify (14) into $\mathcal{O}\left(1/\epsilon^{3-2\nu-\frac{1}{\gamma-1}}\right)$ . Given any $\nu\in[1/2,1]$ and $\gamma>2$ , similarly for $\rho\to\infty$ the estimate (14) reduces to $\mathcal{O}\left(1/\epsilon^{(3-2\nu)(\gamma-1)-1}\right)$ .

In the particular smooth case $\nu=1$ , bounds (13)-(14) become:

	$\displaystyle[\gamma=2]$	$\displaystyle\quad\mathcal{O}\left(1/\epsilon^{\max\left\{\frac{1}{\rho},\frac{1}{\rho-1}\right\}}\right)$
	$\displaystyle[\gamma>2]$	$\displaystyle\quad\mathcal{O}\left(1/\epsilon^{\left[\gamma-2+\frac{\gamma-1}{\rho}\right]\max\left\{1,\frac{1}{\zeta}\right\}}\right).$

For high values of $\rho\geq\log(1/\epsilon)$ , the first one becomes $\mathcal{O}\left(\log(1/\epsilon)\right)$ . Also the second one reduces to $\mathcal{O}\left(1/\epsilon^{\gamma-2}\right)$ when $\rho\geq(\gamma-1)\log(1/\epsilon)$ .

In the bounded gradients case $\nu=0$ the estimates reduces to

	$\displaystyle[\gamma=1]$	$\displaystyle\quad\mathcal{O}\left(\log(1/\epsilon)\right)$		(15)
	$\displaystyle[\gamma>1]$	$\displaystyle\quad\mathcal{O}\left(1/\epsilon^{\left[2(\gamma-1)+\zeta-1\right]\max\left\{1,\frac{1}{\zeta}\right\}}\right).$		(16)

First observe that when the main parameters $\sigma_{F},L_{f},\gamma$ are known then $\zeta=1$ and we recover the same iteration complexity in terms of the number of subgradient evaluations as in the literature [53, 16, 41, 10]. The last estimate (16) holds when RIPP-PsGM performs a sufficiently high number of epochs, under no availability of problem parameters $\sigma_{F},\nu,\gamma$ .

In [29, 41] are derived the optimal complexity estimates $\mathcal{O}\left(\epsilon^{-\frac{2(\gamma-\nu-1)}{3(1+\nu)-2}}\right)$ , in terms of the number of (sub)gradient evaluations, for accelerated first-order methods under $\nu-$ Holder smoothness and $\gamma-$ HG. These optimal estimates require full availability of the problem information: $\gamma,\sigma_{F},\nu,L_{F}$ .

Knowledge	DS-SG[16]	RSG[53]	Restarted UGM[29]	IPPA-PsGM
$\sigma_{F},\gamma,L_{f}(\nu=0)$	$\mathcal{O}\left(\epsilon^{-2(\gamma-1)}\right)$	$\mathcal{O}\left(\epsilon^{-2(\gamma-1)}\right)$	$\mathcal{O}\left(\epsilon^{-2(\gamma-1)}\right)$	$\mathcal{O}\left(\epsilon^{-2(\gamma-1)}\right)$
$\sigma_{F},\gamma,L_{f},\nu>0$	-	-	$\mathcal{O}\left(\epsilon^{-\frac{2(\gamma-\nu-1)}{3(1+\nu)-2}}\right)$	(13)/(14)
$\gamma,L_{f}$	-	-	$\mathcal{O}\left(\epsilon^{-2(\gamma-1)}\right)$	$\mathcal{O}\left(\epsilon^{-2(\gamma-1)}\right)$
$\nu,L_{f}$	-	-	-	(13)/(14)
$L_{f}$	-	-	-	(16)

Table 1: Comparison of complexity estimates under various knowledge degrees of problem information

7 Numerical simulations

In the following Section we evaluate RIPP-PsGM by applying it to real-world applications often found in machine learning tasks. The algorithm and its applications are public and available online ¹¹1https://github.com/pirofti/IPPA.

Unless stated otherwise, we perform enough epochs (restarts) until the objective is within $\varepsilon_{0}=0.5$ proximity to the CVX computed optimum. The current objective value is computed within each inner PsGM iteration. All the models under consideration satisfy WSM property and therefore the implementation of PsGM reduces to the scheme of (projected) Subgradient Method.

We would like to thank the authors of the methods we compare with for providing the code implementation for reproducing the experiments. No modifications were performed by us on the algorithms or their specific parameters. Following their implementation and as is common in the literature, in our reports we also use the minimum error obtained in the iterations thus far.

All our experiments were implemented in Python 3.10.2 under ArchLinux (Linux version 5.16.15-arch1-1) and executed on an AMD Ryzen Threadripper PRO 3955WX with 16-Cores and 512GB of system memory.

7.1 Robust $\ell_{1}$ Least Squares

We start out with the least-squares (LS) problem in the $\ell_{1}$ setting. This form deviates from standard LS by imposing an $\ell_{1}$ -norm on the objective and by constraining the solution sparsity through the $\tau$ parameter on its $\ell_{1}$ -norm:

	$\displaystyle\min\limits_{x\in\mathbf{R}^{n}}\;$	$\displaystyle\;\lVert Ax-b\rVert_{1}$
	s.t.	$\displaystyle\;\;\lVert x\rVert_{1}\leq\tau$

Our goal is to analyze the effect of the data dimensions, the problem and IPPA parameters on the total number of iterations.

Refer to caption — Figure 1: Total number of inner iterations needed for various parametrizations. (a) Varying $\rho$ . (b) Varied problem dimensions where we set $m=n$ . (c) Varying $\tau$

The first experiment from Figure 1 investigates the effect of the $\rho$ parameter on the unconstrained $\ell_{1}$ -LS formulation ( $\tau=\infty$ ) on a small $50\times 20$ problem. In our experiment we start with $\mu=0.1$ , with 9 epochs and vary $\rho$ from 1.005 to 1.1 in 0.005 steps sizes. In Figure 1, we repeat the same experiment with fixed $\rho=1.005$ now, but with varied problem dimensions starting from 10 up to 200 in increments of 5 where we set both dimensions equal ( $m=n$ ). Finally, in Figure 1, we study the effect of the problem specific parameter on the total number of iterations. Although dim effects are noticed in the beginning, we can see a sudden burst past $\tau=3.4$ . Please note that this is specific to $\ell_{1}$ -LS and not to IPPA in general as we will see in the following section.

7.2 Graph Support Vector Machines

Graph SVM adds a graph-guided lasso regularization to the standard SVM averaged hinge-loss objective and extends the $\ell_{1}$ -SVM formulation through the factorization $Mx$ where $M$ is the weighted graph adjacency matrix, i.e.

\displaystyle\min\limits_{x\in\mathbf{R}^{n}}\;

\displaystyle\;\frac{1}{m}\sum_{i=1}^{m}\max\{0,1-y_{i}a_{i}^{T}x\}+\tau\lVert Mx\rVert_{1}

(17)

where $a_{i}\in\mathbf{R}^{n}$ , $y_{i}\in\{\pm 1\}$ are the $i-$ th data point and its label, respectively. When $M=I_{n}$ the Regularized Sparse $\ell_{1}$ -SVM formulation is recovered.

In Figure 2 we present multiple experiments on different data and across different $\tau$ parametrizations. Figure 2 (a) and (b) compares RIPP-PsGM with R²SG from [53] on synthetic random data $\{C,X,M\}$ from the standard normal distribution. The same initialization and starting point $x_{0}$ was used for all methods. We use $m=100$ measurements of $n=512$ samples $x$ with initial parameters $\mu=0.1$ , $\rho=1.0005$ and $\tau=1$ which we execute for 15 epochs.

We repeat the experiment in Figure 2 (c) and (d), but this time on real-data from the 20newsgroup data-set²²2https://cs.nyu.edu/ roweis/data.html following the experiment from [53], with parameters $\mu=0.1$ , $\rho=1.005$ , and $\tau=3$ . Here we find a similar behaviour for both methods as in the synthetic case. In Figure 2 (e) and (f), we repeat the experiment by setting $M=I_{m}$ in (17) thus recovering the regularized $\ell_{1}$ -SVM formulation. We notice here that RIPP-PsGM maintains its position ahead of R²SG and that the error drop is similar to the 20newsgroup experiment.

Furthermore, Figure 3 rehashes the experiments from the $\ell_{1}$ -LS Section in order to study the effect of the data dimensions and of the problem parameters on the number of total number of required inner iterations. The results for $\rho$ and the data dimensions are as expected: as they grow they almost linearly increase the iteration numbers. For the GraphSVM specific parameter $\tau$ , we find the results are opposite to that of $\ell_{1}$ -LS; it is harder to solve the problem when $\tau$ is small.

In order to compare with DS²SG [16] and R²SG, we follow the Sparse SVM experiment from [16] which requests $M=I_{m}$ and removes the regularization $\tau\lVert x\rVert_{1}$ from (17) and instead adds it as a constraint $\{x:\lVert x\rVert_{1}\leq\tau\}$ . We used the parameters $\mu=0.001$ , $\rho=1.0005$ , and $\tau=0.4$ and plotted the results in Figure 4. Please note that the starting point is the same, with a quick initial drop for all three methods as can be seen in Figure 4 (a). Execution times were almost identical and took around 0.15s. To investigate the differences in convergence behaviour, we zoom in after the first few iterations in the 4 panels of Figure 4 (b). In the first panel we show the curves for all methods together and in the other three we see the individual curves for each. Our experiments showed that this is the general behaviour for $\tau>0.4$ : DS²SG has a slightly sharper drop, with RIPP-PsGM following in closely with a staircase behaviour while R²SG takes a few more iterations to reach convergence. For smaller values of $\tau$ we found that RIPP-PsGM reaches the solution in just 1–5 iterations within $10^{-9}$ precision, while the others lag behind for a few hundred iterations.

7.3 Matrix Completion for Movie Recommendation

In this last section, the problem of matrix completion is applied to the standard movie recommendation challenge which recovers a full user-rating matrix $X$ from the partial observations $Y$ corresponding to the $N$ known user-movie ratings pairs.

\displaystyle\min\limits_{X\in\mathbf{R}^{m\times n}}\;

\displaystyle\;\frac{1}{N}\sum_{(i,j)\in\Sigma}^{N}|X_{ij}-Y_{ij}|+\tau\lVert X\rVert_{\star}

where $\Sigma$ is the set of user-movie pairs with $N=|\Sigma|$ . Solving this will complete matrix X based on the known sparse matrix Y while maintaining a low rank.

In Figure 5 we reproduce the experiment from [53] with parameters $\mu=0.1$ , $\rho=1.005$ , $\tau=3$ on a synthetic database with 50 movies and 20 users filled with 250 i.i.d. randomly chosen ratings from 1 to 5. We let a few more R²SG iterations execute to show that the slight progress that is made.

References

[1] Antipin, A.: On finite convergence of processes to a sharp minimum and to a smooth minimum with a sharp derivative. Differential Equations 30(11), 1703–1713 (1994)
[2] Bauschke, H.H., Dao, M.N., Noll, D., Phan, H.M.: Proximal point algorithm, douglas-rachford algorithm and alternating projections: a case study. Journal of Convex Analysis 23(1), 237–261 (2015)
[3] Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences 2(1), 183–202 (2009)
[4] Bertsekas, D.P.: Constrained optimization and Lagrange multiplier methods. Athena Scientific (1982)
[5] Bertsekas, D.P.: Parallel and distributed computation: numerical methods, vol. 23. Prentice hall Englewood Cliffs, NJ (1989)
[6] Bolte, J., Nguyen, T.P., Peypouquet, J., Suter, B.W.: From error bounds to the complexity of first-order descent methods for convex functions. Mathematical Programming 165(2), 471–507 (2017)
[7] Burke, J.V., Ferris, M.C.: Weak sharp minima in mathematical programming. SIAM Journal on Control and Optimization 31(5), 1340–1359 (1993)
[8] Davis, D., Drusvyatskiy, D., MacPhee, K.J., Paquette, C.: Subgradient methods for sharp weakly convex functions. Journal of Optimization Theory and Applications 179(3), 962–982 (2018)
[9] Ferris, M.C.: Finite termination of the proximal point algorithm. Mathematical Programming 50(1), 359–366 (1991)
[10] Freund, R.M., Lu, H.: New computational guarantees for solving convex optimization problems with first order methods, via a function growth condition measure. Mathematical Programming 170(2), 445–477 (2018)
[11] Gilpin, A., Pena, J., Sandholm, T.: First-order algorithm with $\mathcal{O}(\ln(1/\epsilon))$ convergence for $\epsilon$ -equilibrium in two-person zero-sum games. Mathematical programming 133(1), 279–298 (2012)
[12] Güler, O.: On the convergence of the proximal point algorithm for convex minimization. SIAM journal on control and optimization 29(2), 403–419 (1991)
[13] Güler, O.: New proximal point algorithms for convex minimization. SIAM Journal on Optimization 2(4), 649–664 (1992)
[14] Hu, Y., Li, C., Yang, X.: On convergence rates of linearized proximal algorithms for convex composite optimization with applications. SIAM Journal on Optimization 26(2), 1207–1235 (2016)
[15] Humes, C., Silva, P.J.: Inexact proximal point algorithms and descent methods in optimization. Optimization and Engineering 6(2), 257–271 (2005)
[16] Johnstone, P.R., Moulin, P.: Faster subgradient methods for functions with hölderian growth. Mathematical Programming 180(1), 417–450 (2020)
[17] Juditsky, A., Nesterov, Y.: Deterministic and stochastic primal-dual subgradient algorithms for uniformly convex minimization. Stochastic Systems 4(1), 44–80 (2014)
[18] Kort, B.W., Bertsekas, D.P.: Combined primal–dual and penalty methods for convex programming. SIAM Journal on Control and Optimization 14(2), 268–294 (1976)
[19] Li, G., Mordukhovich, B.S.: Holder metric subregularity with applications to proximal point method. SIAM Journal on Optimization 22(4), 1655–1684 (2012)
[20] Lin, H., Mairal, J., Harchaoui, Z.: A universal catalyst for first-order optimization. Advances in neural information processing systems 28 (2015)
[21] Lin, H., Mairal, J., Harchaoui, Z.: Catalyst acceleration for first-order convex optimization: from theory to practice. Journal of Machine Learning Research 18(1), 7854–7907 (2018)
[22] Lu, M., Qu, Z.: An adaptive proximal point algorithm framework and application to large-scale optimization. arXiv preprint arXiv:2008.08784 (2020)
[23] Luo, Z.Q., Tseng, P.: Error bounds and convergence analysis of feasible descent methods: a general approach. Annals of Operations Research 46(1), 157–178 (1993)
[24] Mairal, J.: Cyanure: An open-source toolbox for empirical risk minimization for python, c++, and soon more. arXiv preprint arXiv:1912.08165 (2019)
[25] Monteiro, R.D., Svaiter, B.F.: An accelerated hybrid proximal extragradient method for convex optimization and its implications to second-order methods. SIAM Journal on Optimization 23(2), 1092–1125 (2013)
[26] Necoara, I., Nesterov, Y., Glineur, F.: Linear convergence of first order methods for non-strongly convex optimization. Mathematical Programming 175(1), 69–107 (2019)
[27] Nedić, A., Bertsekas, D.P.: The effect of deterministic noise in subgradient methods. Mathematical programming 125(1), 75–99 (2010)
[28] Nemirovski, A.: Prox-method with rate of convergence o (1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization 15(1), 229–251 (2004)
[29] Nemirovskii, A., Nesterov, Y.: Optimal methods of smooth convex minimization. USSR Computational Mathematics and Mathematical Physics 25(2), 21–30 (1985). https://doi.org/10.1016/0041-5553(85)90100-4. URL https://www.sciencedirect.com/science/article/pii/0041555385901004
[30] Nesterov, Y.: How to make the gradients small. Optima 88 (2012)
[31] Nesterov, Y.: Inexact accelerated high-order proximal-point methods. Mathematical Programming pp. 1–26 (2021)
[32] Nesterov, Y.: Inexact high-order proximal-point methods with auxiliary search procedure. SIAM Journal on Optimization 31(4), 2807–2828 (2021)
[33] Patrascu, A., Irofti, P.: On finite termination of an inexact proximal point algorithm. Applied Mathematics Letters 134, 108348 (2022). https://doi.org/10.1016/j.aml.2022.108348
[34] Polyak, B.: A general method of solving extremal problems. Math. Doklady 8, 593–597 (1967)
[35] Polyak, B.: Minimization of unsmooth functionals. U.S.S.R. Computational Mathematics and Mathematical Physics 9, 509–521 (1969)
[36] Polyak, B.: Nonlinear programming methods in the presence of noise. Mathematical Programming 14, 87–97 (1978)
[37] Polyak, B.: Introduction to optimization. Optimization Software, Inc., Publications Division, New York (1987)
[38] Renegar, J.: Efficient first-order methods for linear programming and semidefinite programming. arXiv preprint arXiv:1409.5832 (2014)
[39] Rockafellar, R.T.: Augmented lagrangians and applications of the proximal point algorithm in convex programming. Mathematics of operations research 1(2), 97–116 (1976)
[40] Rockafellar, R.T.: Monotone operators and the proximal point algorithm. SIAM journal on control and optimization 14(5), 877–898 (1976)
[41] Roulet, V., d’Aspremont, A.: Sharpness, restart, and acceleration. SIAM Journal on Optimization 30(1), 262–289 (2020)
[42] Salzo, S., Villa, S.: Inexact and accelerated proximal point algorithms. Journal of Convex analysis 19(4), 1167–1192 (2012)
[43] Shor, N.: An application of the method of gradient descent to the solution of the network transportation problem. Materialy Naucnovo Seminara po Teoret i Priklad. Voprosam Kibernet. i Issted. Operacii, Nuc- nyi Sov. po Kibernet, Akad. Nauk Ukrain. SSSR 1, 9–17 (1962)
[44] Shor, N.: On the structure of algorithms for numerical solution of problems of optimal planning and design. Diss. Doctor Philos, Kiev (1964)
[45] Shulgin, E., Gasnikov, A., Matyukhin, V.: Adaptive catalyst for smooth convex optimization. In: Optimization and Applications: 12th International Conference, OPTIMA 2021, Petrovac, Montenegro, September 27–October 1, 2021, Proceedings, vol. 13078, p. 20. Springer Nature (2021)
[46] Solodov, M.V., Svaiter, B.F.: A hybrid approximate extragradient–proximal point algorithm using the enlargement of a maximal monotone operator. Set-Valued Analysis 7(4), 323–345 (1999)
[47] Solodov, M.V., Svaiter, B.F.: A hybrid projection-proximal point algorithm. Journal of convex analysis 6(1), 59–70 (1999)
[48] Solodov, M.V., Svaiter, B.F.: Error bounds for proximal point subproblems and associated inexact proximal point algorithms. Mathematical programming 88(2), 371–389 (2000)
[49] Solodov, M.V., Svaiter, B.F.: A unified framework for some inexact proximal point algorithms. Numerical functional analysis and optimization 22(7-8), 1013–1035 (2001)
[50] Solodov, M.V., Svaiter, B.F.: A unified framework for some inexact proximal point algorithms. Numerical functional analysis and optimization 22(7-8), 1013–1035 (2001)
[51] Tomioka, R., Suzuki, T., Sugiyama, M.: Super-linear convergence of dual augmented lagrangian algorithm for sparsity regularized estimation. Journal of Machine Learning Research 12(5) (2011)
[52] Xiao, L., Zhang, T.: A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization 24(4), 2057–2075 (2014)
[53] Yang, T., Lin, Q.: Rsg: Beating subgradient method without smoothness and strong convexity. The Journal of Machine Learning Research 19(1), 236–268 (2018)
[54] Yankelevsky, Y., Elad, M.: Dual graph regularized dictionary learning. IEEE Transactions on Signal and Information Processing over Networks 2(4), 611–624 (2016)

8 Appendix

Proof of Lemma 2.1.

By using $\gamma$ -HG, we get:

$\displaystyle F_{\mu}(x)-F^{*}$	$\displaystyle=\min_{z}F(z)-F^{*}+\frac{1}{2\mu}\lVert z-x\rVert^{2}$
	$\displaystyle\geq\min_{z}\sigma_{F}\text{dist}_{X^{*}}^{\gamma}(z)+\frac{1}{2\mu}\lVert z-x\rVert^{2}$
	$\displaystyle=\min_{z,y\in X^{*}}\sigma_{F}\lVert z-y\rVert^{\gamma}+\frac{1}{2\mu}\lVert z-x\rVert^{2}.$	(18)

The solution of (18) in $z$ , denoted as $z(x)$ satisfies the following optimality condition: $\sigma_{F}\gamma\frac{z(x)-y}{\lVert z(x)-y\rVert^{2-\gamma}}+\frac{1}{\mu}\left(z(x)-x\right)=0$ , which simply implies that

\displaystyle z(x)=\frac{\lVert z-y\rVert^{2-\gamma}}{\lVert z-y\rVert^{2-\gamma}+\sigma_{F}\mu\gamma}x+\frac{\sigma_{F}\mu\gamma}{\lVert z-y\rVert^{2-\gamma}+\sigma_{F}\mu\gamma}y.

(19)

$(i)$ For a function with sharp minima ( $\gamma=1$ ) it is easy to see that $(z(x)-y)\left[1+\frac{\sigma_{F}\mu}{\lVert z(x)-y\rVert}\right]=x-y$ . By taking norm in both sides then: $\lVert z(x)-y\rVert=\max\{0,\lVert x-y\rVert-\sigma_{F}\mu\}$ and (19) becomes:

z(x)=\begin{cases}y+\left(1-\frac{\sigma_{F}\mu}{\lVert x-y\rVert}\right)(x-y),&\lVert x-y\rVert>\sigma_{F}\mu\\ y,&\lVert x-y\rVert\leq\sigma_{F}\mu\end{cases}

By replacing this form of $z(x)$ into (18) we obtain our first result:

	$\displaystyle F_{\mu}(x)-F^{*}$	$\displaystyle\geq\min_{y\in X^{*}}\begin{cases}\sigma_{F}\lVert x-y\rVert-\frac{\mu\sigma-F^{2}}{2},&\lVert x-y\rVert>\sigma_{F}\mu\\ \frac{1}{2\mu}\lVert y-x\rVert^{2},&\lVert x-y\rVert\leq\sigma_{F}\mu\end{cases}$
		$\displaystyle\geq\begin{cases}\sigma_{F}\text{dist}_{X^{}}(x)-\frac{\mu\sigma^{2}_{F}}{2},&\text{dist}_{X^{}}(x)>\sigma_{F}\mu\\ \frac{1}{2\mu}\text{dist}_{X^{}}(x)^{2},&\text{dist}_{X^{}}(x)\leq\sigma_{F}\mu\end{cases}.$

$(ii)$ For quadratic growth, (19) reduces to $z(x)=\frac{1}{1+2\sigma_{F}\mu}x+\frac{2\sigma_{F}\mu}{1+2\sigma_{F}\mu}y$ and by (18) it leads to:

\displaystyle F_{\mu}(x)-F^{*}

\displaystyle\geq\min_{y\in X^{*}}\frac{\sigma_{F}}{1+2\sigma_{F}\mu}\lVert y-x\rVert^{2}=\frac{\sigma_{F}}{1+2\sigma_{F}\mu}\text{dist}_{X^{*}}^{2}(x).

$(iii)$ Lastly, from (19) we see that $z(x)$ lies on the segment $[x,y]$ , i.e. $z(x)=\lambda x+(1-\lambda)y$ for certain $\lambda\in[0,1]$ . Using this argument into (18) we equivalently have:

	$\displaystyle F_{\mu}(x)-F^{*}$	$\displaystyle\geq\min_{y\in X^{*},\lambda\in[0,1],z=\lambda x+(1-\lambda)y}\sigma_{F}\lVert z-y\rVert^{\gamma}+\frac{1}{2\mu}\lVert z-x\rVert^{2}$
		$\displaystyle=\min_{y\in X^{*},\lambda\in[0,1]}\sigma_{F}\lambda^{\gamma}\lVert x-y\rVert^{\gamma}+\frac{(1-\lambda)^{2}}{2\mu}\lVert x-y\rVert^{2}$
		$\displaystyle=\min_{\lambda\in[0,1]}\sigma_{F}\lambda^{\gamma}\text{dist}^{\gamma}_{X^{}}(x)+\frac{(1-\lambda)^{2}}{2\mu}\text{dist}^{2}_{X^{}}(x)$
		$\displaystyle\geq\min\left\{\sigma_{F}\text{dist}^{\gamma}_{X^{}}(x),\frac{1}{2\mu}\text{dist}^{2}_{X^{}}(x)\right\}\min_{\lambda\in[0,1]}\lambda^{\gamma}+(1-\lambda)^{2}.$

∎

Proof of Lemma 3.1.

Assume $\lVert\nabla_{\delta}F_{\mu}(x)\rVert\leq\epsilon$ . Observe that on one hand, by the triangle inequality that: $\lVert\nabla F_{\mu}(x)\rVert\leq\lVert\nabla_{\delta}F_{\mu}(x)\rVert+\frac{\delta}{\mu}\leq\epsilon+\frac{\delta}{\mu}=:\tilde{\epsilon}$ . On the other hand, one can derive:

\displaystyle\lVert\nabla F_{\mu}(x)\rVert\leq\sqrt{\frac{2}{\mu}[F_{\mu}(x)-F^{*}]}

\displaystyle\leq\frac{1}{\mu}\text{dist}_{X^{*}}(x).

(20)

Now let $\gamma=1$ . Based on $\nabla F_{\mu}(x)\in\partial F(\text{prox}_{\mu}^{F}(x))$ , for nonoptimal $\text{prox}_{\mu}^{F}(x)$ the bound (6) guarantees $\lVert\nabla F_{\mu}(x)\rVert=\lVert F^{\prime}(\text{prox}_{\mu}^{F}(x))\rVert\geq\sigma_{F}$ . Therefore, if $x$ would determine $\tilde{\epsilon}<\sigma_{F}$ , and implicitly $\lVert\nabla F_{\mu}(x)\rVert<\sigma_{F}$ , the contradiction with the previous lower bound impose that $\text{prox}_{\mu}^{F}(x)\in X^{*}$ . Moreover, in this case, since $\lVert x-\text{prox}_{\mu}^{F}(x)\rVert=\mu\lVert\nabla F_{\mu}(x)\rVert\overset{\eqref{upperbound_morenvgrad}}{\leq}\text{dist}_{X^{*}}(x)$ then obviously $\text{prox}_{\mu}^{F}(x)=\pi_{X^{*}}(x)$ . On summary, a sufficiently small approximate gradient norm $\lVert\nabla_{\delta}F_{\mu}(x)\rVert<\sigma_{F}-\frac{\delta}{\mu}$ also confirms a small distance to optimal set $\text{dist}_{X^{*}}(x)\leq\mu\tilde{\epsilon}$ .

Let $\gamma>1$ . Then our assumption $\lVert\nabla F_{\mu}(x)\rVert\leq\tilde{\epsilon}$ , (20) and (5) implies the error bound: $\;\;\text{dist}_{X^{*}}(x)\leq\max\left\{\left[\frac{\tilde{\epsilon}}{\sigma_{F}\varphi(\gamma)}\right]^{\frac{1}{\gamma-1}},\frac{2\mu\tilde{\epsilon}}{\varphi(\gamma)}\right\}$ , which confirms the last result. ∎

Proof of Corrolary 4.2.

From Theorem 4.1 $(i)$ we obtain directly:

\displaystyle\text{dist}_{X^{*}}(x^{k})\leq\max\left\{\text{dist}_{X^{*}}(x^{0})-k(\mu\sigma_{F}-\delta),\delta\right\},

which means that after at most $K$ iterations $x^{k}$ reaches $\text{dist}_{X^{*}}(x^{K})\leq\delta<\mu\sigma_{F}$ . Lastly, the same reasoning as in Lemma 3.1, based on the relations (20) and (6), lead to $\text{prox}_{\mu}^{F}(x^{K})=\pi_{X^{*}}(x^{K})$ . ∎

Proof of Corrolary 4.3.

The proof for the first two estimates are immediately derived from Theorem 4.1 $(i)$ and $(ii)$ . For $\gamma\in(1,2)$ , we considered $\alpha=\frac{\mu\varphi(\gamma)\sigma_{F}}{2},\beta=\frac{1+\sqrt{1-\sigma_{F}}}{2}$ into Corrolary 8.6 and obtained an estimate for our exact case. To refine the complexity order, we majorized some constants by using: $(2\beta\mu\sigma_{F})^{\frac{1}{2-\gamma}}\leq(2\mu\sigma_{F})^{\frac{1}{2-\gamma}}$ and $1-\sqrt{1-\varphi(\gamma)}<1$ . For $\gamma>2$ , we replace the same $\alpha$ as before and $\hat{\beta}=\max\left\{\frac{1+\sqrt{1-\sigma_{F}}}{2},1-\frac{1}{\gamma-1}\right\}$ into Corrolary 8.6 to get the last estimate. ∎

Proof of Corrolary 4.4.

The proof for the first two estimates can be derived immediately from Theorem 4.1 $(i)$ and $(ii)$ . We provide details for the other two cases.

For $\gamma\in(1,2)$ we use the same notations as in the proof of Theorem 4.1 (given in the appendix). There, the key functions which decide the decrease rate of $\text{dist}_{X^{*}}(x^{k})$ are the nondecreasing function $h$ and accuracy $\hat{\delta}_{k}=\max\left\{\left(\frac{2\delta_{k}}{\mu\sigma_{F}\varphi(\gamma)}\right)^{\frac{1}{\gamma-1}},\frac{2\delta_{k}}{1-\sqrt{1-\varphi(\gamma)}}\right\}$ . First recall that

\displaystyle\frac{1}{2}=\varphi(2)\leq\varphi(\gamma)\leq\varphi(1)=\frac{3}{4}.

(21)

which implies that for any $\delta\geq 0$

\displaystyle\frac{\delta}{2}\overset{\eqref{varphi_bounds}}{\leq}\frac{1+\sqrt{1-\varphi(\gamma)}}{2}\delta\leq h(\delta).

(22)

Recalling that $\bar{\delta}_{k}=\max\left\{\hat{\delta}_{k},h(\bar{\delta}_{k-1})\right\}$ , then by Theorem 4.1 $(iii)$ we have:

\displaystyle\text{dist}_{X^{*}}(x^{k})

\displaystyle\leq\max\left\{h^{(k)}(\text{dist}_{X^{*}}(x^{0})),\bar{\delta}_{k}\right\}

(23)

By taking $\delta_{k}=\frac{\delta_{k-1}}{2}$ then $\hat{\delta}_{k}=\max\left\{\frac{1}{2^{\frac{1}{\gamma-1}}}\left(\frac{2\delta_{k-1}}{\mu\sigma_{F}\varphi(\gamma)}\right)^{\frac{1}{\gamma-1}},\frac{1}{2}\frac{2\delta_{k-1}}{1-\sqrt{1-\varphi(\gamma)}}\right\}\leq\frac{\hat{\delta}_{k-1}}{2}\overset{\eqref{rel:low_bound_h}}{\leq}h(\hat{\delta}_{k-1})$ . By this recurrence, the monotonicity of $h$ and $\hat{\delta}_{0}=\bar{\delta}_{0}$ , we derive:

	$\displaystyle\bar{\delta}_{k}$	$\displaystyle\leq\max\left\{h(\hat{\delta}_{k-1}),h(\bar{\delta}_{k-1})\right\}=h(\bar{\delta}_{k-1})$
		$\displaystyle=h^{(k)}(\hat{\delta}_{0}).$

Finally this key bound enters into (23) and we get:

	$\displaystyle\text{dist}_{X^{*}}(x^{k})$	$\displaystyle\leq\max\left\{h^{(k)}\left(\text{dist}_{X^{*}}(x^{0})\right),h^{(k)}\left(\hat{\delta}_{0}\right)\right\}$
		$\displaystyle\leq h^{(k)}\left(\max\left\{\text{dist}_{X^{*}}(x^{0}),\hat{\delta}_{0}\right\}\right),$

where for the last equality we used the fact that, since $h$ is nondecreasing, $h^{(k)}$ is monotonically nondecreasing. Finally, by applying Theorem 8.5 we get our result. Now let $\gamma>2$ . By redefining $h$ as in Theorem 4.1, observe that

\displaystyle\delta\left(1-\frac{1}{\gamma-1}\right)\leq h(\delta).

(24)

Take $\delta_{k}=\left(1-\frac{1}{\gamma-1}\right)^{\gamma-1}\delta_{k-1}$ then

	$\displaystyle\hat{\delta}_{k}$	$\displaystyle=\max\left\{\left(1-\frac{1}{\gamma-1}\right)\left(\frac{2\delta_{k-1}}{\mu\sigma_{F}\varphi(\gamma)}\right)^{\frac{1}{\gamma-1}},\left(1-\frac{1}{\gamma-1}\right)^{\gamma-1}\frac{2\delta_{k-1}}{1-\sqrt{1-\varphi(\gamma)}}\right\}$
		$\displaystyle\leq\left(1-\frac{1}{\gamma-1}\right)\hat{\delta}_{k-1}\overset{\eqref{rel:low_bound_h_hat}}{\leq}h(\hat{\delta}_{k-1}).$

We have shown in the proof of Theorem 4.1 that also this variant of $h$ is nondecreasing and thus, using the same reasoning as in the case $\gamma\in(1,2)$ we obtain the above result. ∎

Proof of Theorem 5.1.

In this proof we use notation $K_{t}$ for the number of iterations in the $t-$ th epoch, large enough to turn the stopping criterion to be satisfied. We denote $x^{k,t}$ as the $k-$ th IPPA iterate during $t-$ th epoch. From Theorem 8.3 (from Appendix B) we have that

\displaystyle K_{t}=\left\lceil\frac{\text{dist}_{X^{*}}(x^{t})}{\delta_{t}}\right\rceil

(25)

iterations are sufficient to guarantee $\lVert\nabla_{\delta_{t}}F_{\mu_{t}}(\hat{x}^{K_{t},t})\rVert\leq 5\delta_{t}^{\nabla}$ and thus the end of $t-$ th epoch. Furthermore, by the triangle inequality

\displaystyle\lVert\nabla F_{\mu_{t}}(x^{t+1})\rVert-\delta_{t}^{\nabla}\leq\lVert\nabla_{\delta_{t}}F_{\mu_{t}}(x^{t+1})\rVert\leq 5\delta_{t}^{\nabla},

(26)

which implies that the end of $t-$ th epoch we also have $\lVert\nabla F_{\mu_{t}}(x^{t+1})\rVert\leq 6\delta_{t}^{\nabla}$ .

Let WSM hold and recall assumption $\text{dist}_{X^{*}}(x^{0})\geq\mu_{0}\sigma_{F}$ . For sufficiently large $t$ we show that restartation loses any effect and after a single iteration the stopping criterion of epoch $t$ is satisfied. We separate the analysis in two stages: the first stage covers the epochs that produce $x^{t+1}$ satisfying $\lVert\nabla F_{\mu_{t}}(x^{t+1})\rVert>\sigma_{F}$ . The second one covers the rest of epochs when the gradient norms decrease below the threshold $\sigma_{F}$ , i.e. $\lVert\nabla F_{\mu_{t}}(x^{t+1})\rVert\leq\sigma_{F}$ .

In the first stage, the stopping rule $\lVert\nabla F_{\mu_{t}}(x^{t+1})\rVert\leq\delta_{t}^{\nabla}$ limits the first stage to maximum $T_{1}=\left\lceil\frac{1}{\rho}\log\left(\frac{12\delta_{0}}{\sigma_{F}}\right)\right\rceil$ epochs. The total number of iterations in this stage is bounded by: $\sum_{t=0}^{T_{1}}K_{t}\leq T_{1}K_{0}$ .

For the second stage when $\lVert\nabla F_{\mu_{t-1}}(x^{t})\rVert<\sigma_{F}$ , Lemma 3.1 states that $\text{prox}_{\mu,F}(x^{t})=\pi_{X^{*}}(x^{t})$ and thus we have: $\;\;\text{dist}_{X^{*}}(x^{t})=\mu_{t-1}\lVert\nabla F_{\mu_{t-1}}(x^{t})\rVert\leq\mu_{t-1}\delta_{t-1}^{\nabla}=\delta_{t-1}<2\mu_{t-1}\sigma_{F}=\mu_{t}\sigma_{F}.$ Therefore, by Theorem 4.1:

	$\displaystyle\text{dist}_{X^{*}}(x^{t+1})$	$\displaystyle\leq\max\left\{\text{dist}_{X^{*}}(x^{t})-K_{t}(\mu_{t}\sigma_{F}-\delta_{t}),\delta_{t}\right\}$
		$\displaystyle\leq\max\left\{\mu_{t}\sigma_{F}-K_{t}(\mu_{t}\sigma_{F}-\delta_{t}),\delta_{t}\right\}$

which means that after a single iteration, i.e. $K_{t}=1$ , it is guaranteed that $\text{dist}_{X^{*}}(x^{t+1})\leq\delta_{t}$ . In this phase, the output of IPPA is in fact the only point produced in $t$ -th epoch and the necessary number of epochs (or equivalently the number of IPPA iterations) is $T_{2}=\mathcal{O}\left(\frac{1}{\rho-1}\log\left(\frac{\delta_{T_{1}}}{\epsilon}\right)\right)$ .

Let $\gamma=2$ . By Lemma 2.1 $(ii)$ and convexity of $F_{\mu}$ yields $\frac{\sigma_{F}}{1+2\sigma_{F}\mu}\text{dist}_{X^{*}}^{2}(x)\leq\langle\nabla F_{\mu}(x),x-\pi_{X^{*}}(x)\rangle\leq\lVert\nabla F_{\mu}(x)\rVert\text{dist}_{X^{*}}(x)$ . By using the inequality formed by the first and last terms, together with the relation (26), then at the end of $t-1$ epoch:

\displaystyle\text{dist}_{X^{*}}(x^{t})\leq\lVert\nabla F_{\mu_{t-1}}(x^{t})\rVert\left(\frac{1}{\mu_{t-1}\sigma_{F}}+2\right)\leq 6\delta_{t-1}\left(\frac{1}{\mu_{t-1}\sigma_{F}}+2\right),

suggesting that the necessary number of epochs is $T=\mathcal{O}\left(\frac{1}{\rho}\log\left(\frac{\delta_{0}}{\epsilon}\right)\right).$ This fact allow us to refine $K_{t}$ in (25) as $K_{t}=\left\lceil 3\cdot 2^{\rho}\left(2+\frac{1}{\mu_{t-1}\sigma_{F}}\right)\right\rceil\quad\forall t\geq 1.$ Since $K_{t}$ is bounded, then the total number of IPPA iterations has the order $\sum_{t=0}^{T-1}K_{t}=K_{0}+\mathcal{O}\left(T\right).$

Let $\gamma>1$ . Similarly as in the previous two cases, $\lVert\nabla F_{\mu_{t-1}}(x^{t})\rVert\leq 6\delta_{t}^{\nabla}$ guaranteed by $t-1$ epoch further implies $\;\;\text{dist}_{X^{*}}(x^{t})\overset{\eqref{gamma_all_Env_Holder_error_bound}}{\leq}\max\left\{\frac{12\delta_{t-1}}{\varphi(\gamma)},\left[\frac{6\delta_{t-1}^{\nabla}}{\varphi(\gamma)\sigma_{F}}\right]^{\frac{1}{\gamma-1}}\right\},$ which suggests that the maximal number of epochs is

\displaystyle T=\mathcal{O}\left(\max\left\{\frac{\gamma-1}{\rho}\log\left(\frac{\delta_{0}}{\epsilon}\right),\frac{1}{\rho-1}\log\left(\frac{\mu_{0}\delta_{0}}{\epsilon}\right)\right\}\right).

Now, $K_{t}$ of (25) becomes: $\;K_{t}=\left\lceil\max\left\{\frac{3\cdot 2^{\rho+1}}{\varphi(\gamma)},D2^{t\left[(\rho-1)-\frac{\rho}{\gamma-1}\right]+\frac{\rho}{\gamma-1}}\right\}\right\rceil$ for all $t\geq 1,$ where $D=\left(\frac{6\delta_{0}}{\sigma_{F}\varphi(\gamma)}\right)^{2/\gamma}\frac{1}{\mu_{0}\delta_{0}}$ . For $\gamma\leq 2$ , $K_{t}$ is bounded, thus for $\gamma>2$ we further estimate the total number of IPPA iterations by summing:

	$\displaystyle\sum_{t=0}^{T_{1}}K_{t}$	$\displaystyle=K_{0}+\sum_{t=1}^{T_{1}}\left\lceil\max\left\{\frac{3\cdot 2^{\rho+1}}{\varphi(\gamma)},D2^{t\left[(\rho-1)-\frac{\rho}{\gamma-1}\right]+\frac{\rho}{\gamma-1}}\right\}\right\rceil$		(27)
		$\displaystyle\leq K_{0}+T_{1}+\max\left\{\frac{3\cdot 2^{\rho+1}}{\varphi(\gamma)}T_{1},D2^{\frac{\rho}{\gamma-1}}\sum_{t=1}^{T}2^{t\left[(\rho-1)-\frac{\rho}{\gamma-1}\right]}\right\}$

Finally, $\sum_{t=1}^{T}2^{t\left[(\rho-1)-\frac{\rho}{\gamma-1}\right]}=\mathcal{O}\left(\left(\frac{\delta_{0}}{\epsilon}\right)^{\max\left\{\left(1-\frac{1}{\rho}\right)(\gamma-1)-1,1-\frac{\rho}{(\rho-1)(\gamma-1)}\right\}}\right)$ .

∎

Proof of Theorem 6.1.

We keep the same notations as in the proof of Theorem 5.1 and redenote $\delta_{t}^{\nabla}:=\delta_{t}$ . By assumption $\delta_{0}\geq 2L_{f}$ we observe that $(4\alpha_{0}\mu_{0}L_{f}^{2})^{\frac{1}{2}}\leq 2\mu_{0}L_{f}\leq\mu_{0}\delta_{0}$ . Since $\alpha_{t}=\alpha_{0}2^{-(2\rho-1)t}$ , then the inequality $4\alpha_{t}\mu_{t}L_{f}^{2}\leq(\delta_{t})^{2}$ recursively holds for all $t\geq 0$ . This last inequality allow Theorem 8.8 to establish that at $t-$ epoch there are enough:

	$\displaystyle[t=0]$	$\displaystyle\quad N_{0}=\left\lceil 8\log\left(\frac{\lVert\nabla F_{\mu_{0}}(x^{0})\rVert}{\delta_{0}}\right)\right\rceil$
	$\displaystyle[t>0]$	$\displaystyle\quad N_{t}=\left\lceil 4\cdot 2^{2\rho t}\log\left(\frac{\mu_{t-1}\delta_{t-1}}{\mu_{t}\delta_{t}}\right)\right\rceil=\left\lceil 4(\rho-1)2^{2\rho t}\right\rceil$

PsGM iterations. Lastly, we compute the total computational cost by summing over all $N_{t}$ . Recall that at the end of $t-$ th epoch RIPPA guarantees that $\lVert\nabla F_{\mu_{t}}(x^{t+1})\rVert\leq\delta_{t}^{\nabla}$ . After a number of epochs of $T=\left\lceil\frac{1}{\rho}\log\left(\frac{\delta_{0}}{\sigma_{F}}\right)\right\rceil$ , measuring a total number of:

	$\displaystyle\mathcal{T}_{1}=\sum\limits_{t=0}^{T_{1}-1}N_{t}K_{t}$	$\displaystyle=N_{0}K_{0}+K_{0}\sum\limits_{t=1}^{T_{1}-1}\left\lceil 4(\rho-1)2^{2\rho t}\right\rceil$
		$\displaystyle=\mathcal{O}\left(K_{0}2^{2\rho T_{1}}\right)=\mathcal{O}\left(K_{0}\left(\frac{\delta_{0}}{\sigma_{F}}\right)^{2}\right).$

PsGM iterations, then one has $x^{T}$ satisfying $\text{dist}_{X^{*}}(x^{T})\leq\mu_{T-1}\delta_{T}\leq\mu_{T-1}\sigma_{F}$ . Now we evaluate the final postprocessing loop, which aims to bring the iterate $x^{T}$ into the $\epsilon-$ suboptimality region. By Theorem 8.8, a single call of $PsGM\left(x^{T},x^{T},\beta_{0},\mu_{T-1},\left\lceil 2\left(\frac{L_{f}}{\delta_{T}}\right)^{2}\right\rceil\right)$ guarantees that $\text{dist}_{X^{*}}(\tilde{x}^{1})\leq\frac{\beta_{0}L_{f}^{2}}{2\sigma_{F}}\leq\frac{\beta_{0}L_{f}^{2}}{2\delta_{T}}=\frac{\delta_{T}}{2}$ . In general, if $\text{dist}_{X^{*}}(\tilde{x}^{k})\leq\frac{\beta_{k-1}L_{f}^{2}}{2\sigma_{F}}\leq\frac{\beta_{k-1}L_{f}^{2}}{2\delta_{T}}$ , Theorem 8.8 specifies that after $\left\lceil 2\left(\frac{\text{dist}_{X^{*}}(\tilde{x}^{k})}{\beta_{k}L_{f}}\right)^{2}\right\rceil\overset{\text{Theorem}\;\ref{th:Holder_gradients}}{\leq}\left\lceil 2\left(\frac{L_{f}}{\delta_{T}}\right)^{2}\right\rceil$ routine iterations, the output $\tilde{x}^{k+1}$ satisfies $\text{dist}_{X^{*}}(\tilde{x}^{k+1})\leq\frac{\beta_{k}L_{f}^{2}}{2\sigma_{F}}\leq\frac{\beta_{k}L_{f}^{2}}{2\delta_{T}}$ . Therefore, by setting $N=\left\lceil 2\left(\frac{L_{f}}{\delta_{T}}\right)^{2}\right\rceil,K=\lceil\log(\delta_{T}/\epsilon)\rceil$ and by running the final procedure Postprocessing( $x^{T},\delta_{T}^{2}/L_{f}^{2},\mu_{T-1},N,K$ ), the ”second phase” loop produces $\text{dist}_{X^{*}}(x^{K})\leq\epsilon$ . The total cost of the ”second phase” loop can be computed by $\mathcal{T}_{2}=KN=\left\lceil 2\left(\frac{L_{f}}{\delta_{T}}\right)^{2}\right\rceil\left\lceil\log\left(\frac{\delta_{T}}{\epsilon}\right)\right\rceil.$ Lastly, by taking into account that $\delta_{t}=\mathcal{O}\left(\sigma_{F}\right)$ , then the total complexity has the order: $\;\mathcal{T}_{1}+\mathcal{T}_{2}=\mathcal{O}\left(K_{0}\left(\frac{\delta_{0}}{\sigma_{F}}\right)^{2}+\left(\frac{L_{f}}{\sigma_{F}}\right)^{2}\log\left(\frac{\sigma_{F}}{\epsilon}\right)\right),$ which confirms the first part of the above result.

Now we prove the second part of the result. By assumption $\delta_{0}\geq(2L_{f}^{2})^{\frac{1}{2(1-\nu)}}\mu_{0}^{\frac{\nu}{1-\nu}}$ we observe that:

\displaystyle(4\alpha_{0}\mu_{0}L_{f}^{2})^{\frac{1}{2(1-\nu)}}\leq\mu_{0}\delta_{0}.

(28)

Further we show that, for appropriate stepsize choices $\alpha_{t}$ , the inequality $4\alpha_{t}\mu_{t}L_{f}^{2}\leq(\mu_{t}\delta_{t})^{2(1-\nu)}$ recursively holds for all $t\geq 0$ under initial condition (28). Indeed let $2(\rho-1)(1-\nu)\leq q-1$ , then

\displaystyle 4\alpha_{t}\mu_{t}L_{f}^{2}

\displaystyle=\frac{2\mu_{0}^{2}L_{f}^{2}}{2^{(q-1)t}}\overset{\eqref{rel:induction_initial_step}}{\leq}\frac{(\mu_{0}\delta_{0})^{2(1-\nu)}}{2^{2(\rho-1)(1-\nu)t}}=(\mu_{t}\delta_{t})^{2(1-\nu)}.

(29)

The inequality (29) allow Theorem 8.8 to establish the necessary inner complexity for each IPPA iteration. By using bounds from Theorem 8.8, at $t-$ epoch there are enough:

	$\displaystyle[t=0]$	$\displaystyle\quad N_{0}=\left\lceil 8\log\left(\frac{\lVert\nabla F_{\mu_{0}}(x^{0})\rVert}{\delta_{0}}\right)\right\rceil$
	$\displaystyle[t>0]$	$\displaystyle\quad N_{t}=\left\lceil 4\cdot 2^{(q+1)t}\log\left(\frac{\mu_{t-1}\delta_{t-1}}{\mu_{t}\delta_{t}}\right)\right\rceil=\left\lceil 4(\rho-1)2^{(q+1)t}\right\rceil$

PsGM iterations. We still keep the same notations from Theorem 5.1.

Let $\gamma>1$ and recall $T=\mathcal{O}\left(\max\left\{\frac{\gamma-1}{\rho}\log(\mu_{0}\delta_{0}/\epsilon),\frac{1}{\rho-1}\log(\mu_{0}\delta_{0}/\epsilon)\right\}\right).$ By following a similar reasoning, we require:

	$\displaystyle\sum\limits_{t=0}^{T-1}N_{t}K_{t}$	$\displaystyle=N_{0}K_{0}+\mathcal{O}\left(\sum_{t=1}^{T}2^{t\left[(\rho-1)-\frac{\rho}{\gamma-1}+q+1\right]}\right)$
		$\displaystyle=N_{0}K_{0}+\mathcal{O}\left(2^{T\left[(\rho-1)-\frac{\rho}{\gamma-1}+q+1\right]}\right).$

Let $\zeta=\frac{\rho-1}{\rho}(\gamma-1)\geq 1$ , then the exponent of the last term becomes:

	$\displaystyle T\left[(\rho-1)-\frac{\rho}{\gamma-1}+q+1\right]$	$\displaystyle=\max\left\{\frac{\gamma-1}{\rho},\frac{1}{\rho-1}\right\}\left[(\rho-1)-\frac{\rho}{\gamma-1}+q+1\right]$
		$\displaystyle=\gamma-2+\frac{q}{\rho}(\gamma-1).$

Otherwise, if $\zeta<1$ then the respective exponent turns into: $T\left[(\rho-1)-\frac{\rho}{\gamma-1}+q+1\right]=\frac{\rho}{\rho-1}\frac{\gamma-2}{\gamma-1}+\frac{q}{\rho-1}.$

∎

8.1 Appendix B

Lemma 8.1.

Let the sequence $\{u_{k}\}_{k\geq 0}$ satisfy: $u_{k+1}\leq\alpha_{k}u_{k}+\beta_{k},$ where $\alpha_{k}\in[0,1),\{\beta_{k}\}_{k\geq 0}$ nonincreasing and $\sum\limits_{i=0}^{\infty}\beta_{i}\leq\Gamma$ . Then the following bound holds:

\displaystyle u_{k}\leq u_{0}\prod\limits_{j=0}^{k}\alpha_{j}+\Gamma\prod\limits_{j=\lceil k/2\rceil+1}^{k}\alpha_{j}+\max\limits_{\lceil k/2\rceil+1\leq i\leq k}\frac{\beta_{i}}{1-\alpha_{i}}.

Moreover, if $\alpha_{k}=\alpha\in[0,1)$ then: $\;\;u_{k}\leq\alpha^{(k-4)/2}(u_{0}+\Gamma)+\frac{\beta_{\lceil k/2\rceil+1}}{1-\alpha}.$

Proof of Lemma 8.1.

By using a simple induction we get:

	$\displaystyle u_{k+1}$	$\displaystyle\leq\alpha_{k}u_{k}+\beta_{k}\leq u_{0}\prod\limits_{j=0}^{k}\alpha_{j}+\sum\limits_{i=0}^{k}\beta_{i}\prod\limits_{j=i+1}^{k}\alpha_{j}$
		$\displaystyle=u_{0}\prod\limits_{j=0}^{k}\alpha_{j}+\sum\limits_{i=0}^{\lceil k/2\rceil}\beta_{i}\prod\limits_{j=i+1}^{k}\alpha_{j}+\sum\limits_{i=\lceil k/2\rceil+1}^{k}\beta_{i}\prod\limits_{j=i+1}^{k}\alpha_{j}$

	$\displaystyle\leq u_{0}\prod\limits_{j=0}^{k}\alpha_{j}+\Gamma\prod\limits_{j=\lceil k/2\rceil+1}^{k}\alpha_{j}+\sum\limits_{i=\lceil k/2\rceil+1}^{k}\frac{\beta_{i}}{1-\alpha_{i}}(1-\alpha_{i})\prod\limits_{j=i+1}^{k}\alpha_{j}$
	$\displaystyle\leq u_{0}\prod\limits_{j=0}^{k}\alpha_{j}+\Gamma\prod\limits_{j=\lceil k/2\rceil+1}^{k}\alpha_{j}+\max\limits_{\lceil k/2\rceil+1\leq i\leq k}\frac{\beta_{i}}{1-\alpha_{i}}\sum\limits_{i=\lceil k/2\rceil+1}^{k}(1-\alpha_{i})\prod\limits_{j=i+1}^{k}\alpha_{j}$
	$\displaystyle\leq u_{0}\prod\limits_{j=0}^{k}\alpha_{j}+\Gamma\prod\limits_{j=\lceil k/2\rceil+1}^{k}\alpha_{j}+\max\limits_{\lceil k/2\rceil+1\leq i\leq k}\frac{\beta_{i}}{1-\alpha_{i}}.$

∎

Lemma 8.2.

Let $a_{1}\geq\cdots\geq a_{n}$ be $n$ real numbers, then the following relation holds:

\displaystyle\max\left\{0,a_{n},a_{n}+a_{n-1},\cdots,\sum\limits_{j=1}^{n}a_{j}\right\}=\max\left\{0,\sum\limits_{j=1}^{n}a_{j}\right\}.

Proof.

Since we have:

\max\left\{0,a_{n},\cdots,\sum\limits_{j=k}^{n}a_{j}\right\}=\max\left\{\max\{0,a_{n}\},\cdots,\max\left\{0,\sum\limits_{j=1}^{n}a_{j}\right\}\right\},

then it is sufficient to show that for any positive $k$ :

\displaystyle\max\left\{0,\sum\limits_{j=k}^{n}a_{j}\right\}\leq\max\left\{0,\sum\limits_{j=k-1}^{n}a_{j}\right\}.

(30)

Indeed, if $a_{k-1}\geq 0$ then (30) results straightforward. Consider that $a_{k-1}<0$ , then by monotonicity we have: $a_{j}<0$ for all $j>k-1$ and thus $\sum\limits_{j=k}^{n}a_{j}<0$ . In this case it is obvious that $\max\left\{0,\sum\limits_{j=k}^{n}a_{j}\right\}=\max\left\{0,\sum\limits_{j=k-1}^{n}a_{j}\right\}=0$ , which confirms the final above results. ∎

Theorem 8.3.

Let $\{x^{k}\}_{k\geq 0}$ be the sequence generated by IPPA with inexactness criterion (9), then the following relation hold:

\displaystyle\text{dist}_{X^{*}}(x^{k+1})

\displaystyle\leq\text{dist}_{X^{*}}(x^{k})-\mu\frac{F_{\mu}(x^{k})-F^{*}}{\text{dist}_{X^{*}}(x^{k})}+\delta_{k}.

Moreover, assume constant accuracy $\delta_{k}=\delta$ . Then after at most:

\displaystyle\left\lceil\frac{\text{dist}_{X^{*}}(x^{0})}{\delta}\right\rceil

iterations, a point $\tilde{x}\in\{x^{0},\cdots,x^{k}\}$ satisfies $\lVert\nabla_{\delta}F_{\mu}(\tilde{x})\rVert\leq\frac{4\delta}{\mu}$ and $\text{dist}_{X^{*}}(\tilde{x})\leq\text{dist}_{X^{*}}(x^{0})$ .

Proof.

By convexity of $F$ , for any $z$ we derive:

	$\displaystyle\lVert\text{prox}_{\mu}^{F}(x^{k})-z\rVert^{2}=\lVert x^{k}-z\rVert^{2}+2\langle\text{prox}_{\mu}^{F}(x^{k})-x^{k},x^{k}-z\rangle+\lVert\text{prox}_{\mu}^{F}(x^{k})-x^{k}\rVert^{2}$
	$\displaystyle=\lVert x^{k}-z\rVert^{2}-2\mu\langle\nabla F(\text{prox}_{\mu}^{F}(x^{k})),\text{prox}_{\mu}^{F}(x^{k})-z\rangle-\lVert\text{prox}_{\mu}^{F}(x^{k})-x^{k}\rVert^{2}$
	$\displaystyle\leq\lVert x^{k}-z\rVert^{2}-2\mu\left(F(\text{prox}_{\mu}^{F}(x^{k}))-F(z)+\frac{1}{2\mu}\lVert\text{prox}_{\mu}^{F}(x^{k})-x^{k}\rVert^{2}\right)$
	$\displaystyle=\lVert x^{k}-z\rVert^{2}-2\mu\left(F_{\mu}(x^{k})-F(z)\right).$		(31)

In order to obtain, by the triangle inequality we simply derive:

	$\displaystyle\lVert x^{k+1}-z\rVert\leq\lVert\text{prox}_{\mu}^{F}(x^{k})-z\rVert$	$\displaystyle+\lVert\text{prox}_{\mu}^{F}(x^{k})-x^{k+1}\rVert$
		$\displaystyle\leq\lVert\text{prox}_{\mu}^{F}(x^{k})-z\rVert+\delta$		(32)

Finally, by taking $z=\pi_{X^{*}}(x)$ , then:

	$\displaystyle\text{dist}_{X^{}}(x^{k+1})\leq\lVert x^{k+1}-\pi_{X^{}}(x^{k})\rVert\overset{\eqref{eq:triangle}}{\leq}\lVert\text{prox}_{\mu}^{F}(x^{k})-\pi_{X^{*}}(x^{k})\rVert+\delta$
	$\displaystyle\overset{\eqref{eq:0acc_recurrence}}{\leq}\sqrt{\text{dist}_{X^{}}^{2}(x^{k})-2\mu\left(F_{\mu}(x^{k})-F^{}\right)}+\delta$		(33)
	$\displaystyle\leq\text{dist}_{X^{}}(x^{k})\sqrt{1-2\mu\frac{F_{\mu}(x^{k})-F^{}}{\text{dist}_{X^{*}}^{2}(x^{k})}}+\delta$
	$\displaystyle\leq\text{dist}_{X^{}}(x^{k})\left(1-\mu\frac{F_{\mu}(x^{k})-F^{}}{\text{dist}_{X^{*}}^{2}(x^{k})}\right)+\delta,$

where in the last inequality we used the fact $\sqrt{1-2a}\leq 1-a$ . The last inequality leads to the first part from our result:

\displaystyle\text{dist}_{X^{*}}(x^{k+1})

\displaystyle\leq\text{dist}_{X^{*}}(x^{k})-\mu\frac{F_{\mu}(x^{k})-F^{*}}{\text{dist}_{X^{*}}(x^{k})}+\delta_{k}.

(34)

Assume that

\displaystyle\frac{F_{\mu}(x^{0})-F^{*}}{\text{dist}_{X^{*}}(x^{0})}\geq\frac{\delta}{\mu}

(35)

and denote $K=\min\{k\geq 0\;:\;\text{dist}_{X^{*}}(x^{k+1})\geq\text{dist}_{X^{*}}(x^{k})\}$ . Then (34) has two consequences. First, obviously for all $k<K$ :

\displaystyle F_{\mu}(x^{k})-F^{*}\leq\frac{1}{\mu}\left(\text{dist}_{X^{*}}^{2}(x^{k})-\text{dist}_{X^{*}}^{2}(x^{k+1})\right)+\frac{\delta}{\mu}\text{dist}_{X^{*}}(x^{0}).

By further summing over the history we obtain:

	$\displaystyle F_{\mu}(\hat{x}^{k})-F^{}\leq\min\limits_{0\leq i\leq k}F_{\mu}(x^{i})-F^{}$	$\displaystyle\leq\frac{1}{k+1}\sum\limits_{i=0}^{k}F_{\mu}(x^{i})-F^{*}$
		$\displaystyle\leq\frac{\text{dist}_{X^{}}^{2}(x^{0})}{\mu(k+1)}+\frac{\delta}{\mu}\text{dist}_{X^{}}(x^{0}).$		(36)

Second, since $K$ is the first iteration at which the residual optimal distance increases, then $\text{dist}_{X^{*}}(x^{K})\leq\text{dist}_{X^{*}}(x^{K-1})\leq\cdots\leq\text{dist}_{X^{*}}(x^{0})$ and (34) guarantees:

\displaystyle F_{\mu}(x^{K})-F^{*}\leq\frac{\delta}{\mu}\text{dist}_{X^{*}}(x^{K})\leq\frac{\delta}{\mu}\text{dist}_{X^{*}}(x^{0}).

By unifying both cases we conclude that after at most: $K_{\delta}=\frac{\text{dist}_{X^{*}}(x^{0})}{\delta}$ iterations the threshold: $F_{\mu}(x^{K_{\delta}})-F^{*}\leq\frac{2\delta}{\mu}\text{dist}_{X^{*}}(x^{0})$ is reached. Notice that if (35) do not hold, then $K_{\delta}=0$ .

Now we use the same arguments from [30, Sec. I] to bound the norm of the gradients. Observe that the Lipschitz gradients property of $F_{\mu}$ leads to:

	$\displaystyle F_{\mu}(\hat{x}^{k+1})\leq F_{\mu}(\hat{x}^{k}-\mu\nabla_{\delta}F(\hat{x}^{k}))$
	$\displaystyle=F_{\mu}(\hat{x}^{k})-\mu\langle\nabla F_{\mu}(\hat{x}^{k}),\nabla_{\delta}F_{\mu}(\hat{x}^{k})\rangle+\frac{\mu}{2}\lVert\nabla_{\delta}F_{\mu}(\hat{x}^{k})\rVert^{2}$
	$\displaystyle=F_{\mu}(\hat{x}^{k})+\mu\langle\nabla_{\delta}F_{\mu}(\hat{x}^{k})-\nabla F_{\mu}(\hat{x}^{k}),\nabla_{\delta}F_{\mu}(\hat{x}^{k})\rangle-\frac{\mu}{2}\lVert\nabla_{\delta}F_{\mu}(\hat{x}^{k})\rVert^{2}$
	$\displaystyle=F_{\mu}(\hat{x}^{k})-\frac{\mu}{4}\lVert\nabla_{\delta}F_{\mu}(\hat{x}^{k})\rVert^{2}+\mu\langle\nabla_{\delta}F_{\mu}(\hat{x}^{k})-\nabla F_{\mu}(\hat{x}^{k}),\nabla_{\delta}F_{\mu}(\hat{x}^{k})\rangle-\frac{\mu}{4}\lVert\nabla_{\delta}F_{\mu}(\hat{x}^{k})\rVert^{2}$
	$\displaystyle\leq F_{\mu}(\hat{x}^{k})-\frac{\mu}{4}\lVert\nabla_{\delta}F_{\mu}(\hat{x}^{k})\rVert^{2}+\mu\lVert\nabla_{\delta}F_{\mu}(\hat{x}^{k})-\nabla F_{\mu}(\hat{x}^{k})\rVert^{2}$
	$\displaystyle=\!F_{\mu}(\hat{x}^{k})\!-\!\frac{\mu}{4}\lVert\nabla_{\delta}F_{\mu}(\hat{x}^{k})\rVert^{2}\!+\!\frac{\delta^{2}}{\mu}\!=\!F_{\mu}(\hat{x}^{k/2})-\frac{k\mu}{8}\lVert\nabla_{\delta}F_{\mu}(\hat{x}^{k/2})\rVert^{2}+\frac{k\delta^{2}}{2\mu}.$		(37)

By using (8.1) into (37), then for $k\geq K_{\delta}$

	$\displaystyle\lVert\nabla_{\delta}F_{\mu}(\hat{x}^{k})\rVert^{2}$	$\displaystyle\leq\frac{4(F_{\mu}(\hat{x}^{k})-F^{})}{k\mu}+\frac{\delta^{2}}{\mu}\leq\frac{8\text{dist}_{X^{}}(x^{0})\delta}{k\mu^{2}}+\frac{4\delta^{2}}{\mu^{2}}$
		$\displaystyle\leq\frac{8\delta^{2}}{\mu^{2}}+\frac{4\delta^{2}}{\mu^{2}}=\frac{12\delta^{2}}{\mu^{2}}.$

∎

Lemma 8.4.

Let $\gamma-$ HG holds for the objective function $F$ . Then IPPA sequence $\{x^{k}\}_{k\geq 0}$ with variable accuracies $\delta_{k}$ , satisfies the following reccurences:

$(i)$ Under weak sharp minima $\gamma=1$

\displaystyle\text{dist}_{X^{*}}(x^{k+1})\leq\max\left\{\text{dist}_{X^{*}}(x^{k})-\mu\sigma_{F},0\right\}+\delta_{k}

$(ii)$ Under quadratic growth $\gamma=2$

\displaystyle\text{dist}_{X^{*}}(x^{k+1})\leq\frac{1}{\sqrt{1+2\mu\sigma_{F}}}\text{dist}_{X^{*}}(x^{k})+\delta_{k}

$(iii)$ Under general Holderian growth $\gamma\geq 1$

\displaystyle\text{dist}_{X^{*}}(x^{k+1})\leq\max\left\{\text{dist}_{X^{*}}(x^{k})-\mu\varphi(\gamma)\sigma_{F}\text{dist}_{X^{*}}^{\gamma-1}(x^{k}),\left(1-\frac{\varphi(\gamma)}{2}\right)\text{dist}_{X^{*}}(x^{k})\right\}+\delta_{k},

Proof.

$(i)$ Assume $\text{dist}_{X^{*}}(x^{k})>\sigma_{F}\mu$ then from (the proof of) Theorem 8.3 and Lemma 2.1:

	$\displaystyle\text{dist}_{X^{*}}(x^{k+1})$	$\displaystyle\leq\sqrt{\text{dist}_{X^{}}^{2}(x^{k})-2\mu\left(F_{\mu}(x^{k})-F^{}\right)}+\delta_{k}$
		$\displaystyle\leq\sqrt{\text{dist}_{X^{}}^{2}(x^{k})-2\mu\left(\sigma_{F}\text{dist}_{X^{}}(x^{k})-\frac{\sigma^{2}_{F}\mu}{2}\right)}+\delta_{k}$
		$\displaystyle=\sqrt{\left(\text{dist}_{X^{}}(x^{k})-\mu\sigma_{F}\right)^{2}}+\delta_{k}=\text{dist}_{X^{}}(x^{k})-\left(\mu\sigma_{F}-\delta_{k}\right).$

In short,

\displaystyle\text{dist}_{X^{*}}(x^{k+1})\leq\begin{cases}\text{dist}_{X^{*}}(x^{k})-(\mu\sigma_{F}-\delta_{k}),&\text{if}\;\text{dist}_{X^{*}}(x^{k})>\sigma_{F}\mu\\ \delta_{k},&\text{if}\;\text{dist}_{X^{*}}(x^{k})\leq\sigma_{F}\mu\end{cases}

$(ii)$ By using the same relations in the case $\gamma=2$ , then:

	$\displaystyle\text{dist}_{X^{*}}(x^{k+1})$	$\displaystyle\leq\sqrt{\text{dist}_{X^{}}^{2}(x^{k})-2\mu\left(F_{\mu}(x^{k})-F^{}\right)}+\delta_{k}$
		$\displaystyle\leq\sqrt{\text{dist}_{X^{}}^{2}(x^{k})-\frac{2\mu\sigma_{F}}{1+2\mu\sigma_{F}}\text{dist}_{X^{}}^{2}(x^{k})}+\delta_{k}=\frac{1}{\sqrt{1+2\mu\sigma_{F}}}\text{dist}_{X^{*}}(x^{k})+\delta_{k}.$

$(iii)$ Under Holderian growth, similarly Theorem 8.3 and Lemma 2.1 lead to:

	$\displaystyle\text{dist}_{X^{}}(x^{k+1})\leq\text{dist}_{X^{}}(x^{k})-\mu\frac{F_{\mu}(x^{k})-F^{}}{\text{dist}_{X^{}}(x^{k})}+\delta_{k}$
	$\displaystyle\leq\text{dist}_{X^{}}(x^{k})-\mu\varphi(\gamma)\min\left\{\sigma_{F}\text{dist}_{X^{}}^{\gamma-1}(x^{k}),\frac{1}{2\mu}\text{dist}_{X^{*}}(x^{k})\right\}+\delta_{k}$
	$\displaystyle=\max\left\{\text{dist}_{X^{}}(x^{k})-\mu\varphi(\gamma)\sigma_{F}\text{dist}_{X^{}}^{\gamma-1}(x^{k}),(1-\varphi(\gamma)/2)\text{dist}_{X^{*}}(x^{k})\right\}+\delta_{k}.$

∎

Theorem 8.5.

Let $\alpha,\rho>0,\beta\in(0,1)$ and $h(r)=\max\{r-\alpha r^{\rho},\beta r\}$ . Then the sequence $r_{k+1}=h(r_{k})$ satisfies:

$(i)$ For $\rho\in(0,1)$ :

\displaystyle r_{k}

\displaystyle\leq\begin{cases}\left(1-\frac{\alpha}{2r_{0}^{1-\rho}}\right)^{k}\left[r_{0}-k\frac{\alpha}{2}\left(\frac{\alpha}{1-\beta}\right)^{\frac{\rho}{1-\rho}}\right],&\text{if}\;\;r_{k}>\left(\frac{\alpha}{1-\beta}\right)^{\frac{1}{1-\rho}}\\ \beta^{k-k_{0}-1}\left(\frac{\alpha}{1-\beta}\right)^{\frac{1}{1-\rho}},&\text{if}\;\;r_{k}\leq\left(\frac{\alpha}{1-\beta}\right)^{\frac{1}{1-\rho}}.\end{cases}

(38)

$(ii)$ For $\rho\geq 1$ :

\displaystyle r_{k}\leq\begin{cases}\hat{\beta}^{k}r_{0},&\text{if}\;r_{k}>\left(\frac{1-\hat{\beta}}{\alpha}\right)^{\frac{1}{\rho-1}}\\ \left[\frac{1}{\frac{1}{\min\{r_{0}^{\rho-1},\frac{1-\hat{\beta}}{\alpha}\}}+(\rho-1)(k-k_{0})\alpha}\right]^{\frac{1}{\rho-1}},&\text{if}\;r_{k}\leq\left(\frac{1-\hat{\beta}}{\alpha}\right)^{\frac{1}{\rho-1}},\end{cases}

(39)

where $k_{0}=\left\{\min\limits_{k\geq 0}\;k:\;r_{k}\leq\left(\frac{1-\hat{\beta}}{\alpha}\right)^{\frac{1}{\rho-1}}\right\},\hat{\beta}=\max\left\{\beta,1-1/\rho\right\}$ .

Proof.

Denote $g(r)=r-\alpha r^{\rho}$ .

Consider $\rho\in(0,1)$ . In this case, note that $g$ is nondecreasing and thus also $h$ is nondecreasing, we have:

\displaystyle r_{k+1}

\displaystyle=\begin{cases}r_{k}-\alpha r_{k}^{\rho},&\text{if}\;\;r_{k}>\left(\frac{\alpha}{1-\beta}\right)^{\frac{1}{1-\rho}}\\ \beta r_{k},&\text{if}\;\;r_{k}\leq\left(\frac{\alpha}{1-\beta}\right)^{\frac{1}{1-\rho}}\end{cases}

Observe that if $r_{k}>\left(\frac{\alpha}{1-\beta}\right)^{\frac{1}{1-\rho}}$ then, by using the monotonicity of $r_{k}$ , we can further derive another bound:

	$\displaystyle r_{k+1}$	$\displaystyle\leq r_{k}-\frac{\alpha}{2}r_{k}^{\rho}-\frac{\alpha}{2}r_{k}^{\rho}\leq\left(1-\frac{\alpha}{2r_{k}^{1-\rho}}\right)r_{k}-\frac{\alpha}{2}\left(\frac{\alpha}{1-\beta}\right)^{\frac{1}{1-\rho}}$
		$\displaystyle\leq\left(1-\frac{\alpha}{2r_{0}^{1-\rho}}\right)r_{k}-\frac{\alpha}{2}\left(\frac{\alpha}{1-\beta}\right)^{\frac{1}{1-\rho}}.$

Any given sequence $u_{k}$ satisfying the recurrence $u_{k+1}\leq(1-\xi)u_{k}-c$ can be further bounded as: $u_{k+1}\leq(1-\xi)^{k}u_{0}-c\sum_{i=0}^{k-1}(1-\xi)^{i}\leq(1-\xi)^{k}u_{0}-c\sum_{i=0}^{k-1}(1-\xi)^{k}=(1-\xi)^{k}[u_{0}-kc].$ Thus, by apply similar arguments to our sequence $r_{k}$ we refined the above bound as follows:

\displaystyle r_{k+1}

\displaystyle\leq\begin{cases}\left(1-\frac{\alpha}{2r_{0}^{1-\rho}}\right)^{k}\left[r_{0}-k\frac{\alpha}{2}\left(\frac{\alpha}{1-\beta}\right)^{\frac{\rho}{1-\rho}}\right],&\text{if}\;\;r_{k}>\left(\frac{\alpha}{1-\beta}\right)^{\frac{1}{1-\rho}}\\ \beta^{k-k_{0}}\min\left\{r_{0},\left(\frac{\alpha}{1-\beta}\right)^{\frac{1}{1-\rho}}\right\},&\text{if}\;\;r_{k}\leq\left(\frac{\alpha}{1-\beta}\right)^{\frac{1}{1-\rho}}.\end{cases}

Now consider $\rho>1$ . In this case, on one hand, the function $g$ is nondecreasing only on $\left(0,\left(\frac{1}{\alpha\rho}\right)^{\frac{1}{\rho-1}}\right]$ . On the other hand, for $r\geq\left(\frac{1}{\alpha\rho}\right)^{\frac{1}{\rho-1}}$ it is easy to see that $g(r)\leq\left(1-\frac{1}{\rho}\right)r$ . These two observations lead to:

	$\displaystyle h(r)$	$\displaystyle=\max\{r-\alpha r^{\rho},\beta r\}\leq\max\left\{r-\alpha r^{\rho},\left(1-\frac{1}{\rho}\right)r,\beta r\right\}$
		$\displaystyle=\max\left\{r-\alpha r^{\rho},\hat{\beta}r\right\}:=\hat{h}(r),$

where $\hat{\beta}=\max\left\{1-\frac{1}{\rho},\beta\right\}$ . Since $\hat{h}$ is nondecreasing, then $r_{k}\leq\hat{h}^{(k)}(r_{0})$ . In order to determine the clear convergence rate of $r_{k}$ , based on [36, Lemma 6, Section 2.2] we make a last observation:

\displaystyle g^{(k)}(r)\leq\frac{r}{\left(1+(\rho-1)r^{\rho-1}k\alpha\right)^{\frac{1}{\rho-1}}}\leq\left[\frac{1}{\frac{1}{r^{\rho-1}}+(\rho-1)k\alpha}\right]^{\frac{1}{\rho-1}}

(40)

Using this final bound, we are able to deduce the explicit convergence rate:

	$\displaystyle r_{k}\leq\hat{h}^{(k)}(r_{0})$	$\displaystyle\leq\begin{cases}\hat{\beta}^{k}r_{0},&\text{if}\;r_{k}>\left(\frac{1-\hat{\beta}}{\alpha}\right)^{\frac{1}{\rho-1}}\\ g^{(k-k_{0})}(r_{k_{0}}),&\text{if}\;r_{k}\leq\left(\frac{1-\hat{\beta}}{\alpha}\right)^{\frac{1}{\rho-1}}\end{cases}$
		$\displaystyle\overset{\eqref{rel:Poliak}}{\leq}\begin{cases}\hat{\beta}^{k}r_{0},&\text{if}\;r_{k}>\left(\frac{1-\hat{\beta}}{\alpha}\right)^{\frac{1}{\rho-1}}\\ \left[\frac{1}{\frac{1}{\min\{r_{0}^{\rho-1},\frac{1-\hat{\beta}}{\alpha}\}}+(\rho-1)(k-k_{0})\alpha}\right]^{\frac{1}{\rho-1}},&\text{if}\;r_{k}\leq\left(\frac{1-\hat{\beta}}{\alpha}\right)^{\frac{1}{\rho-1}}.\end{cases}$

∎

Corollary 8.6.

Under the assumptions of Theorem 8.5, let $r_{k+1}=h(r_{k})$ and $\epsilon>0$ . The sequence $r_{k}$ attains the threshold $r_{k}\leq\epsilon$ after the following number of iterations:

$(i)$ For $\rho\in(0,1)$ :

\displaystyle K\geq\min\left\{\frac{2r_{0}^{1-\rho}}{\alpha}\log\left(\frac{r_{0}}{\max\{\epsilon,\tau^{\rho}\alpha/2\}}\right),\frac{2r_{0}}{\tau^{\rho}\alpha}\right\}+\frac{1}{\beta}\log\left(\frac{\min\left\{r_{0},\tau\right\}}{\epsilon}\right)

(41)

$(ii)$ For $\rho\geq 1$ :

\displaystyle K\geq\frac{1}{\hat{\beta}}\log\left(\frac{r_{0}}{\tau(\hat{\beta})}\right)+\frac{1}{(\rho-1)\alpha}\left(\frac{1}{\epsilon^{\rho-1}}-\frac{1}{\min\left\{r_{0},\tau(\hat{\beta})\right\}^{\rho-1}}\right),

(42)

where $\tau(\beta)=\left(\frac{\alpha}{1-\beta}\right)^{\frac{1}{1-\rho}}$ .

Proof.

$(i)$ Let $\rho\in(0,1)$ . In the first regime of (38), when $r_{k}>\tau(\beta)$ , there are necessary at most:

\displaystyle K_{1}^{(0,1)}\geq\min\left\{\frac{2r_{0}^{1-\rho}}{\alpha}\log\left(\frac{r_{0}}{\max\{\epsilon,\tau^{\rho}\alpha/2\}}\right),\frac{2r_{0}}{\tau^{\rho}\alpha}\right\}

(43)

iterations, while the second regime, i.e. $r_{k}\leq\tau(\beta)$ , has a length of at most:

\displaystyle K_{2}^{(0,1)}\geq\frac{1}{\beta}\log\left(\frac{\min\left\{r_{0},\tau\right\}}{\epsilon}\right)

(44)

iterations to reach $r_{k}\leq\epsilon$ . An upper margin on the total number of iterations is $K_{1}^{(0,1)}+K_{2}^{(0,1)}$ .

$(ii)$ Let $\rho>1$ . Similarly, the first regime when $r_{k}>\tau(\hat{\beta})$ has a maximal length of: $K_{1}^{(1,\infty)}\geq\frac{1}{\hat{\beta}}\log\left(\frac{r_{0}}{\tau(\hat{\beta})}\right).$ The second regime, while $r_{k}\leq\tau(\hat{\beta})$ , requires at most: $K_{2}^{(1,\infty)}\geq\frac{1}{(\rho-1)\alpha}\left(\frac{1}{\epsilon^{\rho-1}}-\frac{1}{\min\left\{r_{0},\tau(\hat{\beta})\right\}^{\rho-1}}\right)$ iteration to get $r_{k}\leq\epsilon$ . ∎

Lemma 8.7.

Let $\alpha,\rho>0,\beta\in(0,1)$ . Let the sequence $\{r_{k},\delta_{k}\}_{k\geq 0}$ satisfy the recurrence:

\displaystyle r_{k+1}\leq\max\{r_{k}-\alpha r_{k}^{\rho},\beta r_{k}\}+\delta_{k}.

For $\rho\in(0,1)$ , let $h(r)=\max\left\{r-\frac{\alpha}{2}r^{\rho},\frac{1+\beta}{2}r\right\}$ , then:

r_{k}\leq\max\left\{h^{(k)}(r_{0})),h^{(k-1)}\left(\hat{\delta}_{1}\right),\cdots,h\left(\hat{\delta}_{k-1}\right),\hat{\delta}_{k}\right\}

For $\rho\geq 1$ , let $\hat{h}(r)=\max\left\{h(r),\left(1-\frac{1}{\rho}\right)r\right\}$ , then:

r_{k}\leq\max\left\{\hat{h}^{(k)}(r_{0})),\hat{h}^{(k-1)}\left(\hat{\delta}_{1}\right),\cdots,\hat{h}\left(\hat{\delta}_{k-1}\right),\hat{\delta}_{k}\right\},

where $\hat{\delta}_{k}=\max\left\{\left(\frac{2\delta_{k}}{\alpha}\right)^{\frac{1}{\rho}},\frac{2\delta_{k}}{1-\beta}\right\}$ .

Proof.

Starting from the recurrence we get:

	$\displaystyle r_{k+1}$	$\displaystyle\leq\max\{r_{k}-\alpha r_{k}^{\rho},\beta r_{k}\}+\delta_{k}=\max\{r_{k}-\alpha r_{k}^{\rho}+\delta_{k},\beta r_{k}+\delta_{k}\}$
		$\displaystyle=\max\left\{r_{k}-\frac{\alpha}{2}r_{k}^{\rho}+\left(\delta_{k}-\frac{\alpha}{2}r_{k}^{\rho}\right),\frac{1+\beta}{2}r_{k}+\left(\delta_{k}-\frac{1-\beta}{2}r_{k}\right)\right\}$

If $\delta_{k}\leq\min\left\{\frac{\alpha}{2}r_{k}^{\rho},\frac{1-\beta}{2}r_{k}\right\}$ , or equivalently $r_{k}\geq\max\left\{\left(\frac{2\delta_{k}}{\alpha}\right)^{\frac{1}{\rho}},\frac{2\delta_{k}}{1-\beta}\right\}$ , then we recover the recurrence:

\displaystyle r_{k+1}

\displaystyle\leq\max\left\{r_{k}-\frac{\alpha}{2}r_{k}^{\rho},\frac{1+\beta}{2}r_{k}\right\}

(45)

Otherwise, clearly

\displaystyle r_{k}\leq\max\left\{\left(\frac{2\delta_{k}}{\alpha}\right)^{\frac{1}{\rho}},\frac{2\delta_{k}}{1-\beta}\right\}

(46)

By combining both bounds (45) and (46), we obtain:

\displaystyle r_{k+1}

\displaystyle\leq\max\left\{r_{k}-\frac{\alpha}{2}r_{k}^{\rho},\frac{1+\beta}{2}r_{k},\left(\frac{2\delta_{k+1}}{\alpha}\right)^{\frac{1}{\rho}},\frac{2\delta_{k+1}}{1-\beta}\right\}.

(47)

Denote $h(r)=\max\left\{r-\frac{\alpha}{2}r^{\rho},\frac{1+\beta}{2}r\right\}$ and $\hat{\delta}_{k}=\max\left\{\left(\frac{2\delta_{k}}{\alpha}\right)^{\frac{1}{\rho}},\frac{2\delta_{k}}{1-\beta}\right\}$ . For $\rho\in(0,1)$ , since both functions $r\mapsto r-\alpha r^{\rho}$ and $r\mapsto\frac{1+\beta}{2}r$ are nondecreasing, then $h$ is nondecreasing. This fact allows to apply the following induction to (47):

$\displaystyle r_{k+1}$	$\displaystyle\leq\max\left\{h(r_{k}),\hat{\delta}_{k+1}\right\}\leq\max\left\{h\left(\max\left\{h(r_{k-1}),\hat{\delta}_{k}\right\}\right),\hat{\delta}_{k+1}\right\}$
	$\displaystyle\leq\max\left\{h(h(r_{k-1})),h\left(\hat{\delta}_{k}\right),\hat{\delta}_{k+1}\right\}$
	$\displaystyle\cdots$
	$\displaystyle\leq\max\left\{h^{(k+1)}(r_{0})),h^{(k)}\left(\hat{\delta}_{1}\right),\cdots,h\left(\hat{\delta}_{k}\right),\hat{\delta}_{k+1}\right\}.$	(48)

In the second case when $\rho\geq 1$ , the corresponding recurrence function $\hat{h}(r)=\max\left\{r-\frac{\alpha}{2}r^{\rho},\left(1-\frac{1}{\rho}\right)r,\frac{1+\beta}{2}r\right\}$ is again nondecreasing. Indeed, here $r\mapsto r-\alpha r^{\rho}$ is nondecreasing only when $r\leq\left(\frac{1}{\alpha\rho}\right)^{\frac{1}{\rho-1}}$ . However, if $r>\left(\frac{1}{\alpha\rho}\right)^{\frac{1}{\rho-1}}$ , then $\hat{h}(r)=\max\left\{1-\frac{1}{\rho},\frac{1+\beta}{2}\right\}r$ which is also nondecreasing. Thus we get our claim. The monotonicity of $\hat{h}$ and majorization $\hat{h}(r)\geq h(r)$ , allow us to obtain by a similar induction an analog relation to (48), which holds with $\hat{h}$ . ∎

Proof of Theorem 4.1.

$(i)$ Denote $r_{k}=\text{dist}_{X^{*}}(x^{k})$ . Since $\delta_{k}\leq\delta_{k-1}$ , then by rolling the recurrence in Lemma 8.4 we get:

$\displaystyle r_{k+1}$	$\displaystyle\leq\max\left\{r_{k}-(\mu\sigma_{F}-\delta_{k}),\delta_{k}\right\}$
	$\displaystyle\leq\max\left\{r_{k-1}-[2\mu\sigma_{F}-\delta_{k}-\delta_{k-1}],\delta_{k}+\delta_{k-1}-\mu\sigma_{F},\delta_{k}\right\}$
	$\displaystyle\leq\max\left\{r_{0}-\sum\limits_{i=0}^{k}(\mu\sigma_{F}-\delta_{i}),\delta_{k}+\max\left\{0,\delta_{k-1}-\mu\sigma_{F},\cdots,\sum\limits_{i=0}^{k-1}(\delta_{i}-\mu\sigma_{F})\right\}\right\}$	(49)

By using the Lemma 8.2, then (49) can be refined as:

	$\displaystyle r_{k+1}$	$\displaystyle\leq\max\left\{r_{0}-\sum\limits_{i=0}^{k}(\mu\sigma_{F}-\delta_{i}),\delta_{k}+\max\left\{0,\delta_{k-1}-\mu\sigma_{F},\cdots,\sum\limits_{i=0}^{k-1}(\delta_{i}-\mu\sigma_{F})\right\}\right\}$
		$\displaystyle\overset{\text{Lemma}\;\ref{max_lemma}}{\leq}\max\left\{r_{0}-\sum\limits_{i=0}^{k}(\mu\sigma_{F}-\delta_{i}),\delta_{k}+\max\left\{0,\sum\limits_{i=0}^{k-1}\delta_{i}-\mu\sigma_{F}\right\}\right\}$
		$\displaystyle\leq\max\left\{r_{0}-\sum\limits_{i=0}^{k}(\mu\sigma_{F}-\delta_{i}),\max\left\{\delta_{k},\mu\sigma_{F}+\sum\limits_{i=0}^{k}\delta_{i}-\mu\sigma_{F}\right\}\right\}$
		$\displaystyle=\max\left\{\max\{r_{0},\mu\sigma_{F}\}-\sum\limits_{i=0}^{k}(\mu\sigma_{F}-\delta_{i}),\delta_{k}\right\}.$

$(ii)$ Denote $\theta=\frac{1}{(1+2\sigma_{F}\mu)^{1/2}}$ . From Lemmas 8.4 and 8.1 we derive that:

\displaystyle\text{dist}_{X^{*}}(x^{k})\leq\theta\text{dist}_{X^{*}}(x^{k-1})+\delta_{k-1}\overset{\text{Lemma}\;\ref{lemma:first_sequence}}{\leq}\theta^{\frac{k-4}{2}}\left(\text{dist}_{X^{*}}(x^{0})+\Gamma\right)+\frac{\delta_{\lceil k/2\rceil+1}}{1-\theta}.

$(iii)$ First consider $\gamma\in[1,2)$ and let $h(r)=\max\left\{r-\frac{\mu\varphi(\gamma)\sigma_{F}}{2}r^{\gamma-1},\frac{1+\sqrt{1-\varphi(\gamma)}}{2}r\right\}$ . Then by Lemmas 8.4 and 8.7, we have that:

\displaystyle r_{k+1}

\displaystyle\leq\max\left\{h^{(k)}(r_{0})),h^{(k-1)}\left(\hat{\delta}_{1}\right),\cdots,h\left(\hat{\delta}_{k-1}\right),\hat{\delta}_{k}\right\},

(50)

where $\hat{\delta}_{k}=\max\left\{\left(\frac{2\delta_{k}}{\mu\varphi(\gamma)\sigma_{F}}\right)^{\frac{1}{\rho}},\frac{2\delta_{k}}{1-\sqrt{1-\varphi(\gamma)}}\right\}$ . Let some $u_{k}=h^{(k)}(u_{0})$ and $\bar{\delta}_{k}=\max\left\{\hat{\delta}_{k},h(\bar{\delta}_{k-1})\right\}$ . Then, since $h$ is nondecreasing, we get:

\displaystyle r_{k+1}\leq\max\left\{h^{(k)}(r_{0})),h^{(k-1)}\left(\hat{\delta}_{1}\right),\cdots,h\left(\hat{\delta}_{k-1}\right),\hat{\delta}_{k}\right\}=\max\left\{u_{k},\bar{\delta}_{k}\right\},

Finally, by using the convergence rate upper bounds from the Theorem 8.5, we can further find out an the convergence rate order of $u_{k}$ . We can appeal to a similar argument when $\gamma\geq 2$ , by using the nondecreasing function $\hat{h}(r)=\max\left\{h(r),\left(1-\frac{1}{\rho}\right)r\right\}$ , instead of $h$ .

∎

Theorem 8.8.

Let the function $f$ having $\nu-$ Holder continuous gradients with constant $L_{f}$ and $\nu\in[0,1]$ . Also let $\mu>0,\alpha\leq\min\left\{\frac{\mu}{2},\frac{\delta^{2(1-\nu)}}{4\mu L_{f}^{2}}\right\},z^{0}\in\text{dom}(\psi)$ and

\displaystyle N\geq\left\lceil\frac{4\mu}{\alpha}\log\left(\frac{\lVert z^{0}-\text{prox}_{\mu}^{F}(x)\rVert}{\delta}\right)\right\rceil

(51)

then PsGM( $z^{0},x,\alpha,\mu,N$ ) outputs $z^{N}$ such that $\lVert z^{N}-\text{prox}_{\mu}^{F}(x)\rVert\leq\delta$ .

Moreover, assume particularly that $\nu=0$ and $F$ satisfies WSM with constant $\sigma_{f}$ . Also let $\alpha\in(0,\mu/2]$ , $Q$ be a closed convex feasible set and $\psi=\iota_{Q}$ its indicator function. If

\displaystyle\text{dist}_{X^{*}}(x)\leq\mu\sigma_{F}\qquad\text{and}\qquad N\geq\left\lceil 2\left(\frac{\text{dist}_{X^{*}}(x)}{\alpha L_{f}}\right)^{2}\right\rceil

(52)

then PsGM( $x,x,\alpha,\mu,N$ ) outputs $z^{N}$ satisfying $\text{dist}_{X^{*}}(z^{N})\leq\frac{\alpha L_{f}^{2}}{2\sigma_{F}}$ .

Proof.

For brevity we avoid the counter $k$ and denote $z(x):=\text{prox}_{\mu}^{F}(x),z^{+}:=\text{prox}_{\mu}^{\psi}\left(z-\alpha\left[f^{\prime}(z)+\frac{1}{\mu}(z-x)\right]\right)$ . Recall the optimality condition:

\displaystyle z(x)=\text{prox}_{\mu}^{\psi}\left(z(x)-\alpha\left[f^{\prime}(z(x))+\frac{1}{\mu}(z(x)-x)\right]\right).

(53)

By using $\nu-$ Holder continuity then we get:

\displaystyle\lVert f^{\prime}(z)-f^{\prime}(z(x))\rVert\leq L_{f}\lVert z-z(x)\rVert^{\nu}\quad\forall z.

(54)

Then the following recurrence holds:

	$\displaystyle\lVert z^{+}-z(x)\rVert^{2}$
	$\displaystyle\!\!\overset{\eqref{rel:strong_convexity_inner}}{=}\!\!\left\\|\text{prox}_{\mu}^{\psi}\!\!\left(z\!\!-\!\!\alpha\left[f^{\prime}(z)\!+\!\frac{1}{\mu}(z\!\!-\!\!x)\right]\right)\!\!-\!\!\text{prox}_{\mu}^{\psi}\left(z(x)\!\!-\!\!\alpha\left[f^{\prime}(z(x))\!+\!\frac{1}{\mu}(z(x)\!\!-\!\!x)\right]\right)\right\\|^{2}$
	$\displaystyle\leq\left\\|\left(1-\frac{\alpha}{\mu}\right)(z-z(x))+\alpha[f^{\prime}(z(x))-f^{\prime}(z)]\right\\|^{2}$
	$\displaystyle=\left(1-\frac{\alpha}{\mu}\right)^{2}\lVert z-z(x)\rVert^{2}-2\alpha\left(1-\frac{\alpha}{\mu}\right)\langle f^{\prime}(z)-f^{\prime}(z(x)),z-z(x)\rangle$
	$\displaystyle\hskip 170.71652pt+\alpha^{2}\lVert f^{\prime}(z)-f^{\prime}(z(x))\rVert^{2}$		(55)
	$\displaystyle\overset{\eqref{rel:bound_holdgrad}}{\leq}\left(1-\frac{\alpha}{\mu}\right)^{2}\lVert z-z(x)\rVert^{2}+\alpha^{2}L_{f}^{2}\lVert z-z(x)\rVert^{2\nu}.$

Obviously, a small stepsize $\alpha<\mu$ yields $\left(1-\frac{\alpha}{\mu}\right)^{2}\leq 1-\frac{\alpha}{\mu}$ . If the squared residual is dominant, i.e.

\displaystyle\lVert z-z(x)\rVert\geq\delta\geq\left(2\alpha\mu L^{2}\right)^{\frac{1}{2(1-\nu)}},

(56)

then:

\displaystyle\lVert z^{+}-z(x)\rVert^{2}\leq\left(1-\frac{\alpha}{2\mu}\right)\lVert z-z(x)\rVert^{2}.

(57)

By (56), this linear decrease of residual stop when $\lVert z-z(x)\rVert\leq\delta$ , which occurs after at most $\left\lceil\frac{4\mu}{\alpha}\log\left(\frac{\lVert z^{0}-z(x)\rVert}{\delta}\right)\right\rceil$ PsGM iterations.

To show the second part of our result we recall that the first assumption of (52) ensures $z(x)=\pi_{X^{*}}(x)$ . By using (55) with chosen subgradient $f^{\prime}(x^{*})=0$ , then the following recurrence is obtained:

$\displaystyle\lVert z^{\ell+1}\!-\!x^{*}\rVert^{2}$	$\displaystyle\!=\!\left(1\!-\!\frac{\alpha}{\mu}\right)^{2}\lVert z^{\ell}\!-\!x^{}\rVert^{2}\!-\!2\alpha\left(1\!-\!\frac{\alpha}{\mu}\right)\langle f^{\prime}(z^{\ell}),z^{\ell}\!-\!x^{}\rangle\!+\!\alpha^{2}\lVert f^{\prime}(z^{\ell})\rVert^{2}$
	$\displaystyle\leq\lVert z^{\ell}-x^{}\rVert^{2}-2\alpha\left(1-\frac{\alpha}{\mu}\right)\sigma_{F}\text{dist}_{X^{}}(z^{\ell})+\alpha^{2}L_{f}^{2}$
	$\displaystyle\leq\lVert z^{\ell}-x^{}\rVert^{2}-\alpha\sigma_{F}\text{dist}_{X^{}}(z^{\ell})+\alpha^{2}L_{f}^{2},$	(58)

where $x^{*}=\pi_{X^{*}}(x)$ and in the last inequality we used $\langle f^{\prime}(z^{t}),z^{t}-x^{*}\rangle\geq F(z^{t})-F^{*}\geq\sigma_{F}\text{dist}_{X^{*}}(z^{t})$ . If $\text{dist}_{X^{*}}(z^{0})=\text{dist}_{X^{*}}(x)>\frac{\alpha L_{f}^{2}}{2\sigma_{F}}$ , then as long as $\text{dist}_{X^{*}}(z^{\ell})>\frac{\alpha L_{f}^{2}}{2\sigma_{F}}$ , (58) turns into:

	$\displaystyle\text{dist}_{X^{}}^{2}(z^{\ell+1})\leq\lVert z^{\ell+1}-x^{}\rVert^{2}$	$\displaystyle\leq\lVert z^{\ell}-x^{*}\rVert^{2}-\frac{\left(\alpha L_{f}\right)^{2}}{2}$
		$\displaystyle\leq\text{dist}_{X^{*}}(x)^{2}-\ell\frac{\left(\alpha L_{f}\right)^{2}}{2}.$		(59)

To unify both cases, we further express the recurrence as:

\displaystyle\text{dist}_{X^{*}}(z^{+})^{2}\leq\max\left\{\text{dist}_{X^{*}}(x)^{2}-\ell\frac{\left(\alpha L_{f}\right)^{2}}{2},\frac{\alpha^{2}L_{f}^{4}}{4\sigma_{F}^{2}}\right\},

(60)

which confirms our above result.

∎

Remark 4.

As the above theorem states, when the sequence $x^{t}$ is sufficiently close to the solution set, computing $x^{t+1}$ necessitates a number of PsGM iterations dependent on $\text{dist}_{X^{*}}(z^{0})$ . In other words, the estimate from (52) can be further reduced through a good initialization or restartation technique. Such a restartation, for the neighborhood around the optimum, is exploited by the Algorithm 4 below.

	$\displaystyle\text{dist}_{X^{}}(x^{k+1})\leq\lVert x^{k+1}-\pi_{X^{}}(x^{k})\rVert\overset{\eqref{eq:triangle}}{\leq}\lVert\text{prox}_{\mu}^{F}(x^{k})-\pi_{X^{*}}(x^{k})\rVert+\delta$
	$\displaystyle\overset{\eqref{eq:0acc_recurrence}}{\leq}\sqrt{\text{dist}_{X^{}}^{2}(x^{k})-2\mu\left(F_{\mu}(x^{k})-F^{}\right)}+\delta$		(33)
	$\displaystyle\leq\text{dist}_{X^{}}(x^{k})\sqrt{1-2\mu\frac{F_{\mu}(x^{k})-F^{}}{\text{dist}_{X^{*}}^{2}(x^{k})}}+\delta$
	$\displaystyle\leq\text{dist}_{X^{}}(x^{k})\left(1-\mu\frac{F_{\mu}(x^{k})-F^{}}{\text{dist}_{X^{*}}^{2}(x^{k})}\right)+\delta,$

	$\displaystyle\text{dist}_{X^{*}}(x^{k+1})$	$\displaystyle\leq\sqrt{\text{dist}_{X^{}}^{2}(x^{k})-2\mu\left(F_{\mu}(x^{k})-F^{}\right)}+\delta_{k}$
		$\displaystyle\leq\sqrt{\text{dist}_{X^{}}^{2}(x^{k})-2\mu\left(\sigma_{F}\text{dist}_{X^{}}(x^{k})-\frac{\sigma^{2}_{F}\mu}{2}\right)}+\delta_{k}$
		$\displaystyle=\sqrt{\left(\text{dist}_{X^{}}(x^{k})-\mu\sigma_{F}\right)^{2}}+\delta_{k}=\text{dist}_{X^{}}(x^{k})-\left(\mu\sigma_{F}-\delta_{k}\right).$

$\displaystyle\lVert z^{\ell+1}\!-\!x^{*}\rVert^{2}$	$\displaystyle\!=\!\left(1\!-\!\frac{\alpha}{\mu}\right)^{2}\lVert z^{\ell}\!-\!x^{}\rVert^{2}\!-\!2\alpha\left(1\!-\!\frac{\alpha}{\mu}\right)\langle f^{\prime}(z^{\ell}),z^{\ell}\!-\!x^{}\rangle\!+\!\alpha^{2}\lVert f^{\prime}(z^{\ell})\rVert^{2}$
	$\displaystyle\leq\lVert z^{\ell}-x^{}\rVert^{2}-2\alpha\left(1-\frac{\alpha}{\mu}\right)\sigma_{F}\text{dist}_{X^{}}(z^{\ell})+\alpha^{2}L_{f}^{2}$
	$\displaystyle\leq\lVert z^{\ell}-x^{}\rVert^{2}-\alpha\sigma_{F}\text{dist}_{X^{}}(z^{\ell})+\alpha^{2}L_{f}^{2},$	(58)

Complexity of Inexact Proximal Point Algorithm for minimizing convex functions with Holderian Growth

Abstract

keywords:

1 Introduction

1.1 Notations and preliminaries

2 Holderian growth and Moreau envelopes

Lemma 2.1.

Proof.

Remark 1.

3 Inexact Proximal Point algorithm

Lemma 3.1.

Proof.

4 Iteration complexity of IPPA

Theorem 4.1.

Proof.

Corollary 4.2.

Proof.

Corollary 4.3.

Proof.

Corollary 4.4.

Proof.

5 Restarted Inexact Proximal Point Algorithm

Theorem 5.1.

Proof.

Remark 2.

6 Inner Proximal Subgradient Method routine

Theorem 6.1.

Proof.

Remark 3.

7 Numerical simulations

7.1 Robust ℓ1\ell_{1} Least Squares

7.2 Graph Support Vector Machines

7.3 Matrix Completion for Movie Recommendation

References

8 Appendix

Proof of Lemma 2.1.

Proof of Lemma 3.1.

Proof of Corrolary 4.2.

Proof of Corrolary 4.3.

Proof of Corrolary 4.4.

Proof of Theorem 5.1.

Proof of Theorem 6.1.

8.1 Appendix B

Lemma 8.1.

Proof of Lemma 8.1.

Lemma 8.2.

Proof.

Theorem 8.3.

Proof.

Lemma 8.4.

Proof.

Theorem 8.5.

Proof.

Corollary 8.6.

Proof.

Lemma 8.7.

Proof.

Proof of Theorem 4.1.

Theorem 8.8.

Proof.

Remark 4.

7.1 Robust $\ell_{1}$ Least Squares