Complexity of Inexact Proximal Point Algorithm for minimizing convex functions with Holderian Growth
Abstract
Several decades ago the Proximal Point Algorithm (PPA) started to gain a long-lasting attraction for both abstract operator theory and numerical optimization communities. Even in modern applications, researchers still use proximal minimization theory to design scalable algorithms that overcome nonsmoothness. Remarkable works as [9, 4, 5, 51] established tight relations between the convergence behaviour of PPA and the regularity of the objective function. In this manuscript we derive nonasymptotic iteration complexity of exact and inexact PPA to minimize convex functions under Holderian growth: (for ) and (for ). In particular, we recover well-known results on PPA: finite convergence for sharp minima and linear convergence for quadratic growth, even under presence of deterministic noise. Moreover, when a simple Proximal Subgradient Method is recurrently called as an inner routine for computing each IPPA iterate, novel computational complexity bounds are obtained for Restarting Inexact PPA. Our numerical tests show improvements over existing restarting versions of the Subgradient Method.
keywords:
Inexact proximal point, weak sharp minima, Holderian growth, finite termination.Received: date / Accepted: date
1 Introduction
The problem of interest of our paper formulates as the following convex nonsmooth minimization:
(1) |
Here we assume that is convex and is convex, lower semicontinuous and proximable. By proximable function we refer to those functions whose proximal mapping is computable in closed form or linear time. The above model finds plenty of applications from which we shortly mention compressed sensing [3], sparse risk minimization [21, 52] and graph-regularized models [54]. Dating back to ’60s, the classical Subgradient Methods (SGM) [43, 44, 34, 35, 37] established iteration complexity for minimizing convex functions up to finding such that . Despite the fact that this complexity order is unimprovable for the class of convex functions, particular growth or error bounds properties can be exploited to obtain better lower orders. Error bounds and regularity conditions have a long history in optimization, systems of inequalities or projection methods: [1, 7, 9, 34, 35, 37, 6, 14, 23]. Particularly, in the seminal works [36, 37] SGM is proved to converge linearly towards weakly sharp minima of . The optimal solutions is a set of weak sharp minima (WSM) if there exists such that
Acceleration of other first-order algorithms has been proved under WSM in subsequent works as [1, 7, 8, 41]. Besides acceleration, [37, Section 5.2.3] introduces the “superstability” of sharp optimal solutions : under small perturbations of the objective function a subset of the weak sharp minima remains optimal for the perturbed model. The superstability of WSM was used in [36, 27] to show the robustness of inexact SGM. In short, using low persistently perturbed subgradients at each iteration, the resulted perturbed SGM still converges linearly to . In the line of these results, we also show in our manuscript that similar robustness holds for the proximal point methods under WSM.
Other recent works as [53, 16, 26, 22, 18, 51, 19, 14, 10, 23, 11, 38, 6, 17] look at a suite of different growth regimes besides WSM and use them to improve the complexity of first-order algorithms. Particularly, in our paper we are interested in the Holderian growth : let
Note that is equivalent to the Kurdyka–Łojaziewicz (KL) inequality for convex, closed, and proper functions, as shown in [6]. It includes the class of uniformly convex functions analyzed in [17] and, obviously, it covers the sharp minima WSM, for . The Quadratic Growth (QG), covered by , was analyzed in a large suite of previous works [22, 53, 23, 26] and, although is weaker than strong convexity, it could be essentially exploited (besides Lipschitz gradient continuity) to show complexity of proximal gradient methods11. Our analysis recover similar complexity orders under the same particular assumptions.
Some recent works [53, 16, 11, 10, 38] developed restarted SGM schemes, for minimizing convex functions under -HG or WSM, and analyzed their theoretical convergence and their natural dependence on the growth moduli and . Restarted SubGradient (RSG) of [53] and Decaying Stepsize - SubGradient (DS-SG) of [16] present iteration complexity estimates of under WSM and bound under (HG) in order to attain . These bounds are optimal for bounded gradients functions, as observed by [29]. Most SGM schemes are dependent up to various degrees on the knowledge of problem information. For instance, RSG and DS-SG rely on lower bounds of optimal value and knowledge of parameters or other Lipschitz constants. The restartation is introduced in order to avoid the exact estimation of modulus . Also, our schemes allows estimations of problem moduli such as , covering cases when these are not known. In the best case, when estimations are close to the true parameters, similar complexity estimates are provided in terms of subgradient evaluations. Moreover, by exploiting additional smooth structure we further obtain lower estimates.
The work of [17] approach the constrained model, i.e. is the indicator function of a closed convex set, and assume uniform convexity:
for all feasible and . The authors obtain optimal complexity bounds when the subgradients of are bounded. Moreover, their restartation technique are adaptive to growth modulus and parameter , up to a fixed number of iterations.
Inherent for all SGMs, the complexity results of these works essentially requires the boundedness of the subgradients, which is often natural for nondifferentiable functions. However, plenty of convex objective functions coming from risk minimization, sparse regression or machine learning presents, besides their particular growth, a certain smoothness degree which is not compatible with the subgradient boundedness assumption. Enclosing the feasible domain in a ball is an artificial remedy used to further keep the subgradients bounded, which however might load the implementation with additional uncertain tuning heuristics. Our analysis shows how to exploit smoothness in order to improve the complexity estimates.
The analysis of [41] investigates the effect of restarting over the optimal first-order schemes under -HG and -Holder smoothness, starting from results of [29]. For , suboptimality is reached after accelerated gradient iterations if is Lipschitz continuous and Holder growth holds, or after iterations when the growth modulus is larger than . In general, if is Holder continuous, they restart the Universal Gradient Method and obtain an overall complexity of if , or if . Although these estimates are unimprovable and better than ours, in general the implementation of the optimal schemes requires complete knowledge of growth and smoothness parameters.
Several decades ago the Proximal Point Algorithm (PPA) started to gain much attraction for both abstract operator theory and the numerical optimization communities. Even in modern applications, where large-scale nonsmooth optimization arises recurrently, practitioners still inspire from proximal minimization theory to design scalable algorithmic techniques that overcomes nonsmoothness. The powerful PPA iteration consists mainly in the recursive evaluation of the proximal operator associated to the objective function. The proximal mapping is based on the infimal convolution with a metric function, often chosen to be the squared Euclidean norm: The Proximal Point recursion:
became famous in optimization community when [40, 39] and [4, 5] revealed its connection to various multipliers methods for constrained minimization, see also [31, 32, 28, 12, 13]. There are remarkable works that shown how the growth regularity is a key factor in the iteration complexity of PPA.
Finite convergence of the exact PPA under WSM is proved by [7, 9, 1]. Furthermore, in [5, 18] can be found an extensive convergence analysis of the exact PPA and the Augmented Lagrangian algorithm under (HG). Although the results and analysis are of a remarkable generality, they are of asymptotic nature (see [51]). A nonasymptotic analysis is found in [51], where the equivalence between a Dual Augmented Lagrangian algorithm and a variable stepsize PPA is established. The authors analyze sparse learning models of the form: where is twice differentiable with Lipschitz continuous gradient, a linear operator and a convex nonsmooth regularizer. Under Holderian growth, ranging with , they show nonasymptotic superlinear convergence rate of the exact PPA with exponentially increasing stepsize. For the inexact variant they kept further a slightly weaker superlinear convergence. The progress, from the asymptotic analysis of [40, 18] to a nonasymptotic one, is remarkable due to the simplicity of the arguments. However, a convergence rate of inexact PPA (IPPA) could become irrelevant without quantifying the local computational effort spent to compute each iteration, since one inexact iteration of PPA requires the approximate solution the regularized optimization problem. Among the remarkable references on inexact versions of various proximal algorithms are [25, 48, 49, 50, 47, 46, 20, 21, 24, 45, 31, 32].
We mention that, a small portion of the results on the WSM case, contained into this manuscript, has been recently published by the authors in [33]. However, we included it in the present manuscript for the sake of completeness.
Contributions. We list further our main contributions:
Inexact PPA under (HG). We provide nonasymptotic iteration complexity bounds for IPPA to solve (1) under HG, when . In particular, we obtain for and, in the case of best parameter choice, for , to attain distance to the optimal set. All these bounds require only convexity of the objective function and they are independent on any bounded gradients or smoothness. We could not find these nonasymptotic estimates in the literature for general .
Restartation. We further analyze the complexity bounds of restarting IPPA, that facilitates the derivation of better computational complexity estimates than the non-restarted IPPA. The complexity estimates have similar orders in both restarted and non-restarted algorithms for all .
Total computational complexity. We derive total complexity, including the inner computational effort spent at each IPPA iteration, in terms of number of inner (proximal) gradient evaluations. If has Holder continuous gradients we obtain that, in the case of best parameter choice, there are necessary:
proximal (sub)gradient evaluations to approach to distance to the optimal set. As we discuss in the section 6, the total complexity is dependent on various restartation variables.
Experiments. Our numerical experiments confirm a better behaviour of the restarted IPPA, that uses an inner subgradient method routine, in comparison with other two restarting strategies of classical Subgradient Method. We performed our tests on several polyhedral learning models that includes Graph SVM and Matrix Completion, using synthetic and real data.
1.1 Notations and preliminaries
Now we introduce the main notations of our manuscript. For denote the scalar product and Euclidean norm by . The projection operator onto set is denoted by and the distance from to the set is denoted . The indicator function of is denoted by . Given function , then by we denote the composition . We use for the subdifferential set and for a subgradient of at . In differentiable case, when is a singleton, will be eventually used instead of . By we denote the optimal set associated to (1) and by suboptimal point we understand a point that satisfies .
A function is called strongly convex if the following relation holds:
Let , then we say that a differentiable function has Holder continuous gradient with constant if :
Notice that when , the Holder continuity describes nonsmooth functions with bounded gradients, i.e. for all . The Holder continuity reduces to Lipschitz gradient continuity.
Given a convex function , we denote its Moreau envelope [40, 5, 39] with and its proximal operator with , defined by:
We recall the nonexpansiveness property of the proximal mapping [40]:
(2) |
Basic arguments from [40, 5, 39] show that the gradient is Lipschitz continuous with constant and satisfies:
(3) |
In the differentiable case, obviously .
Paper structure. In section 2 we analyze how the growth properties of are also inherited by . The key relations on will become the basis for the complexity analysis. In section 3 we define the iteration of inexact Proximal Point algorithm and discuss its stopping criterion. The iteration complexity is presented in section 4 for both the exact and inexact case. Subsequently, the restarted IPPA is defined and its complexity is presented. Finally, in section 6 we quantify the complexity of IPPA in terms of proximal (sub)gradient iterations and compare with other results. In the last section we compare our scheme with the state-of-the-art restarting subgradient algorithms.
2 Holderian growth and Moreau envelopes
As discussed in the introduction, -HG relates tightly with widely known regularity properties such as WSM [53, 36, 37, 7, 9, 1, 8], Quadratic Growth (QG) [53, 22, 26], Error Bound [23] and Kurdika-Lojasiewicz inequality [6, 53].
Next we show how the Moreau envelope of a given convex function inherits its growth properties over its entire domain excepting a certain neighborhood of the optimal set. Recall that .
Lemma 2.1.
Let be a convex function and let (HG) hold. Then the Moreau envelope satisfies the relations presented below.
Let WSM:
where is the Huber function.
Let :
For all :
where .
Proof.
The proof can be found in the Appendix. ∎
It is interesting to remark that the Moreau envelope inherits a similar growth landscape as outside a given neighborhood of the optimal set. For instance, under WSM, outside the tube , the Moreau envelope grows sharply. Inside of it grows quadratically which, unlike the objective function , allows the gradient to get small near to the optimal set. This separation of growth regimes suggests that first-order algorithms that minimize would reach very fast the region , allowing large steps in a first phase and subsequently, they would slow down in the vicinity of the optimal set.
This discussion extends to general growths when , where a similar separation of behaviours holds for appropriate neighborhoods. Note that when has quadratic growth with constant , also the envelope satisfies a quadratic growth with a smaller modulus .
Remark 1.
It will be useful in the subsequent sections to recall the connection between Holderian growth and Holderian error bound under convexity. Observe that by a simple use of convexity into -HG, we obtain for all
(4) |
which immediately turns into the following error bound:
(5) |
Under WSM, by replacing with into (5) and by using property (3) pointing that , we obtain as well a similar bound on at non-optimal points:
(6) |
This is the traditional key relation for the finite convergence of PPA. Under HG, starting from Lemma 2.1, by using convexity of and, further, by following similar inequalities as in (4), another error bound is obtained:
(7) |
In the following section we start with the analysis of exact and inexact PPA. Aligning to old results on the subgradient algorithms back to [36], we illustrate the robustness induced by the weak sharp minima regularity.
3 Inexact Proximal Point algorithm
The basic exact PPA iteration is shortly described as
Recall that by (3) one can express , which makes PPA equivalent with the constant stepsize Gradient Method (GM) iteration:
(8) |
Since our reasoning from below borrow simple arguments from classical GM analysis, we will use further (8) to express PPA. It is realistic not to rely on explicit , but an approximated one to a fixed accuracy. By using such an approximation, one can form an approximate gradient and interpret IPPA as an inexact Gradient Method. Let , then a basic approximation of is
(9) |
Other works as [40, 42, 48, 49] promotes similar approximation measures for inexact first order methods. Now we present the basic IPPA scheme with constant stepsize.
There already exist a variety of relative or absolute stopping rules in the literature for the class of first-order methods [51, 42, 37, 15]. However, since bounding the norms of gradients is commonly regarded as one of the simplest optimality measures, we use the following:
(10) |
which is computable by the nature of the iteration. The next result shows the relation between (10) and the distance to optimal set.
Lemma 3.1.
Let and assume then:
Let , if , then
(11) |
Let , then:
Proof.
The proof can be found in the Appendix. ∎
If are weakly sharp minimizers, the above lemma states that, for sufficiently small and one can directly guarantee that is close to . From another viewpoint, Lemma 3.1 also suggests for WSM that a sufficiently large provides and a optimal solution is obtained as the output of the first IPPA iteration. A similar result stated in [9], guarantees the existence of a sufficiently large smoothing value which makes PPA to converge in a single (exact) iteration.
4 Iteration complexity of IPPA
For reasons that will be clear later, note that some of the results from below are given for constant inexactness noise . The Holderian growth leads naturally to recurrences on the residual distance to the optimal set, that allow us to give direct complexity bounds.
Theorem 4.1.
Let be convex and HG hold.
Under sharp minima , let nonincreasing and assume , then:
Under quadratic growth , let then:
Let HG hold. Then the following convergence rate holds:
where and
Proof.
The proof can be found in the appendix. ∎
Note that in general, all the abstract convergence rates of Theorem 4.1 depend on two terms: the first one illustrates the reduction of the initial distance to optimum and the second term reflects the accuracy of the approximated gradient. Therefore, after a finite number (for ) or at most (for ) IPPA iterations, the evolution of inner accuracy becomes the main bottleneck of the convergence process. The above theorem provides an abstract insight about how fast should the accuracy decay in order that attains the best rate towards the optimal set in terms of . In short, shows that for WSM a recurrent constant decrease on is established only if , while the noise is not necessary to vanish. This aspect will be discussed in more details below. The last parts and suggest that a linear decay of (for ) and, respectively, for general , ensure the fastest convergence of the residual.
Several previous works as [27, 36, 37] analyzed perturbed and incremental SGM algorithms, under WSM, that use noisy estimates of subgradient . A surprising common conclusion of these works is that under a sufficiently low persistent noise: for all , SGM still converges linearly to the optimal set. Although IPPA is based on a similar approximate first order information, notice that the smoothed objective do not satisfy the pure WSM property. However, our next result states, in a similar context of small but persistent noise of magnitude at most , that IPPA attains accuracy after a finite number of iterations.
Corollary 4.2.
Let and , then after
IPPA iterations, satisfies and .
Proof.
The proof can be found in the Appendix. ∎
To conclude, if the noise magnitude is below the threshold , or equivalently then after a finite number of iterations IPPA reaches distance to . We see that under sufficiently low persistent noise, IPPA still guarantees convergence to the optimal set assuming the existence of an inner routine that computes each iteration. In other words, ”noisy” Proximal Point algorithms share similar stability properties as the noisy Subgradient schemes from [27, 36, 37]. This discussion can be extended to general decreasing .
We show next that Theorem 4.1 covers well-known results on exact PPA.
Corollary 4.3.
Let be the sequence of exact PPA, i.e. . By denoting , an suboptimal iterate is attained, i.e. , after the number of iterations:
Proof.
The proof can be found in the Appendix. ∎
The finite convergence of the exact PPA, under WSM, dates back to [9, 7, 1, 4]. Since PPA is simply a gradient descent iteration, its iteration complexity under QG shares the typical dependence on the conditioning number . The Holder growth behaves similarly with the sharp minima case: fast convergence outside the neighborhood around the optimum which expands with .
A simple argument on the tightness of the bounds for the case can be found in [2]. Indeed, Douglas-Rachford, PPA and Alternating Projections algorithms were analyzed in [2] for particular univariate functions. The authors proved that PPA requires iterations to minimize the particular objective (when ) up to tolerance.
Corollary 4.4.
Under the assumptions of Corrolary 4.3, recall the notation for the exact case. The number of iterations performed by generated with IPPA in order to attain an suboptimal point is:
Proof.
The proof can be found in the Appendix. ∎
Overall, if the noise vanishes sufficiently fast (e.g. a linear decay with factor in the case ) then the complexity of IPPA has the same order as the one of PPA. However, a fast vanishing noise requires an efficient inner routine that computes the iterate . Particularly, when is nonsmooth and convex, a simple choice of the inner routine is the Subgradient Method (SM). However, the efficiency of the SM for minimizing a given strongly convex function up to accuracy, i.e. , has the order subgradient calls [17]. By using the quadratic growth guaranteed by strong convexity, i.e. , in order to approach a minimizer at distance , SM requires iterations. According to this bound, the cost of computing a single iteration of IPPA results into subgradient calls and, therefore, a direct naive counting of the total number of subgradient evaluations over outer iterations is of order . By using the previous estimates of necessary for , yields a total estimate of subgradient calls which is significantly larger than the known optimal bound of for SM algorithm. However, we further give a restarted variant of IPPA that overcomes this issue.
5 Restarted Inexact Proximal Point Algorithm
The Restarted IPPA (RIPPA) illustrates a simple recursive call of the IPPA combined with a multiplicative decrease of the parameters. Observe that RIPPA is completely independent of the problem constants.
As in the usual context of restartation, we call any th iteration an epoch. The stopping criterion can be optionally based on a fixed number of epochs or on the reduction of gradient norm (10). Denote .
Theorem 5.1.
Let be positive constants and . Then the sequence generated by attains after a number of iterations.
Let and assume and , then
where In particular, if then reaches the suboptimality within iterations.
Let , then
Let , then:
Otherwise, for
Proof.
The proof can be found in the Appendix. ∎
Remark 2.
Notice that for any , logarithmic complexity is obtained. When , the above estimate is shortened as
where , In particular, if , then all epochs reduce to length and the total number of IPPA iterations reduces to the same order as in the exact case:
6 Inner Proximal Subgradient Method routine
Although the influence of growth modulus on the behaviour of IPPA is obvious, all complexity estimates derived in the previous sections assume the existence of an oracle computing an approximate proximal mapping:
(12) |
Therefore in the rest of this section we focus on the solution of the following inner minimization problem:
In most situations, despite the regularization term, this problem is not trivial and one should select an appropriate routine that computes . For instance, variance-reduced or accelerated proximal first-order methods were employed in [20, 21, 22, 23] as inner routines and theoretical guarantees were provided. Also, Conjugate Gradients based Newton method was used in [51] under a twice differentiability assumption on . We limit our analysis only to gradient-type routines and let other accelerated or higher-order methods, that typically improve the performance of their classical counterparts, for future work.
In this section we evaluate the computational complexity of IPPA in terms of number of proximal gradient iterations. The basic routine for solving (12), that we analyze below, is the Proximal (sub)Gradient Method. Notice that when is nonsmooth with bounded subgradients, we consider only the case when is an indicator function for a simple, closed convex set. In this situation, PsGM becomes a simple projected subgradient scheme with constant stepsize that solves (12).
Through a natural combination of the outer guiding IPP iteration and the inner PsGM scheme, we further derive the total complexity of proximal-based restartation Subgradient Method in terms of subgradient oracle calls.
Theorem 6.1.
Let the assumptions of Theorem 8.8 hold and then the following assertions hold:
Let be generated as follows:
For , the final output satisfies after PsGM iterations.
Moreover, if has Holder continuous gradients with constant and , assume and . If , then the output satisfies within a total cost of:
PsGM iterations.
Proof.
The proof can be found in the Appendix. ∎
Remark 3.
Although the bound on the number of epochs in RIPP-PsGM depends somehow on , a sufficiently high value of ensure the result to hold. To investigate some important particular bounds hidden into the above complexity estimates (of Theorem 6.1) we analyze different choices of the input parameters which will be synthesized in Table 1.
Assume is known, and denote
(13) | ||||
(14) |
Under knowledge of , by setting then and (13) becomes . When , a sufficiently large simplify (14) into . Given any and , similarly for the estimate (14) reduces to .
In the particular smooth case , bounds (13)-(14) become:
For high values of , the first one becomes . Also the second one reduces to when .
In the bounded gradients case the estimates reduces to
(15) | ||||
(16) |
First observe that when the main parameters are known then and we recover the same iteration complexity in terms of the number of subgradient evaluations as in the literature [53, 16, 41, 10]. The last estimate (16) holds when RIPP-PsGM performs a sufficiently high number of epochs, under no availability of problem parameters .
7 Numerical simulations
In the following Section we evaluate RIPP-PsGM by applying it to real-world applications often found in machine learning tasks. The algorithm and its applications are public and available online 111https://github.com/pirofti/IPPA.
Unless stated otherwise, we perform enough epochs (restarts) until the objective is within proximity to the CVX computed optimum. The current objective value is computed within each inner PsGM iteration. All the models under consideration satisfy WSM property and therefore the implementation of PsGM reduces to the scheme of (projected) Subgradient Method.
We would like to thank the authors of the methods we compare with for providing the code implementation for reproducing the experiments. No modifications were performed by us on the algorithms or their specific parameters. Following their implementation and as is common in the literature, in our reports we also use the minimum error obtained in the iterations thus far.
All our experiments were implemented in Python 3.10.2 under ArchLinux (Linux version 5.16.15-arch1-1) and executed on an AMD Ryzen Threadripper PRO 3955WX with 16-Cores and 512GB of system memory.
7.1 Robust Least Squares
We start out with the least-squares (LS) problem in the setting. This form deviates from standard LS by imposing an -norm on the objective and by constraining the solution sparsity through the parameter on its -norm:
s.t. |
Our goal is to analyze the effect of the data dimensions, the problem and IPPA parameters on the total number of iterations.



The first experiment from Figure 1 investigates the effect of the parameter on the unconstrained -LS formulation () on a small problem. In our experiment we start with , with 9 epochs and vary from 1.005 to 1.1 in 0.005 steps sizes. In Figure 1, we repeat the same experiment with fixed now, but with varied problem dimensions starting from 10 up to 200 in increments of 5 where we set both dimensions equal (). Finally, in Figure 1, we study the effect of the problem specific parameter on the total number of iterations. Although dim effects are noticed in the beginning, we can see a sudden burst past . Please note that this is specific to -LS and not to IPPA in general as we will see in the following section.
7.2 Graph Support Vector Machines
Graph SVM adds a graph-guided lasso regularization to the standard SVM averaged hinge-loss objective and extends the -SVM formulation through the factorization where is the weighted graph adjacency matrix, i.e.
(17) |
where , are the th data point and its label, respectively. When the Regularized Sparse -SVM formulation is recovered.
Synthetic

20newsgroups

-SVM ()

In Figure 2 we present multiple experiments on different data and across different parametrizations. Figure 2 (a) and (b) compares RIPP-PsGM with R2SG from [53] on synthetic random data from the standard normal distribution. The same initialization and starting point was used for all methods. We use measurements of samples with initial parameters , and which we execute for 15 epochs.
We repeat the experiment in Figure 2 (c) and (d), but this time on real-data from the 20newsgroup data-set222https://cs.nyu.edu/ roweis/data.html following the experiment from [53], with parameters , , and . Here we find a similar behaviour for both methods as in the synthetic case. In Figure 2 (e) and (f), we repeat the experiment by setting in (17) thus recovering the regularized -SVM formulation. We notice here that RIPP-PsGM maintains its position ahead of R2SG and that the error drop is similar to the 20newsgroup experiment.



Furthermore, Figure 3 rehashes the experiments from the -LS Section in order to study the effect of the data dimensions and of the problem parameters on the number of total number of required inner iterations. The results for and the data dimensions are as expected: as they grow they almost linearly increase the iteration numbers. For the GraphSVM specific parameter , we find the results are opposite to that of -LS; it is harder to solve the problem when is small.

In order to compare with DS2SG [16] and R2SG, we follow the Sparse SVM experiment from [16] which requests and removes the regularization from (17) and instead adds it as a constraint . We used the parameters , , and and plotted the results in Figure 4. Please note that the starting point is the same, with a quick initial drop for all three methods as can be seen in Figure 4 (a). Execution times were almost identical and took around 0.15s. To investigate the differences in convergence behaviour, we zoom in after the first few iterations in the 4 panels of Figure 4 (b). In the first panel we show the curves for all methods together and in the other three we see the individual curves for each. Our experiments showed that this is the general behaviour for : DS2SG has a slightly sharper drop, with RIPP-PsGM following in closely with a staircase behaviour while R2SG takes a few more iterations to reach convergence. For smaller values of we found that RIPP-PsGM reaches the solution in just 1–5 iterations within precision, while the others lag behind for a few hundred iterations.
7.3 Matrix Completion for Movie Recommendation
In this last section, the problem of matrix completion is applied to the standard movie recommendation challenge which recovers a full user-rating matrix from the partial observations corresponding to the known user-movie ratings pairs.
where is the set of user-movie pairs with . Solving this will complete matrix X based on the known sparse matrix Y while maintaining a low rank.


References
- [1] Antipin, A.: On finite convergence of processes to a sharp minimum and to a smooth minimum with a sharp derivative. Differential Equations 30(11), 1703–1713 (1994)
- [2] Bauschke, H.H., Dao, M.N., Noll, D., Phan, H.M.: Proximal point algorithm, douglas-rachford algorithm and alternating projections: a case study. Journal of Convex Analysis 23(1), 237–261 (2015)
- [3] Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences 2(1), 183–202 (2009)
- [4] Bertsekas, D.P.: Constrained optimization and Lagrange multiplier methods. Athena Scientific (1982)
- [5] Bertsekas, D.P.: Parallel and distributed computation: numerical methods, vol. 23. Prentice hall Englewood Cliffs, NJ (1989)
- [6] Bolte, J., Nguyen, T.P., Peypouquet, J., Suter, B.W.: From error bounds to the complexity of first-order descent methods for convex functions. Mathematical Programming 165(2), 471–507 (2017)
- [7] Burke, J.V., Ferris, M.C.: Weak sharp minima in mathematical programming. SIAM Journal on Control and Optimization 31(5), 1340–1359 (1993)
- [8] Davis, D., Drusvyatskiy, D., MacPhee, K.J., Paquette, C.: Subgradient methods for sharp weakly convex functions. Journal of Optimization Theory and Applications 179(3), 962–982 (2018)
- [9] Ferris, M.C.: Finite termination of the proximal point algorithm. Mathematical Programming 50(1), 359–366 (1991)
- [10] Freund, R.M., Lu, H.: New computational guarantees for solving convex optimization problems with first order methods, via a function growth condition measure. Mathematical Programming 170(2), 445–477 (2018)
- [11] Gilpin, A., Pena, J., Sandholm, T.: First-order algorithm with convergence for -equilibrium in two-person zero-sum games. Mathematical programming 133(1), 279–298 (2012)
- [12] Güler, O.: On the convergence of the proximal point algorithm for convex minimization. SIAM journal on control and optimization 29(2), 403–419 (1991)
- [13] Güler, O.: New proximal point algorithms for convex minimization. SIAM Journal on Optimization 2(4), 649–664 (1992)
- [14] Hu, Y., Li, C., Yang, X.: On convergence rates of linearized proximal algorithms for convex composite optimization with applications. SIAM Journal on Optimization 26(2), 1207–1235 (2016)
- [15] Humes, C., Silva, P.J.: Inexact proximal point algorithms and descent methods in optimization. Optimization and Engineering 6(2), 257–271 (2005)
- [16] Johnstone, P.R., Moulin, P.: Faster subgradient methods for functions with hölderian growth. Mathematical Programming 180(1), 417–450 (2020)
- [17] Juditsky, A., Nesterov, Y.: Deterministic and stochastic primal-dual subgradient algorithms for uniformly convex minimization. Stochastic Systems 4(1), 44–80 (2014)
- [18] Kort, B.W., Bertsekas, D.P.: Combined primal–dual and penalty methods for convex programming. SIAM Journal on Control and Optimization 14(2), 268–294 (1976)
- [19] Li, G., Mordukhovich, B.S.: Holder metric subregularity with applications to proximal point method. SIAM Journal on Optimization 22(4), 1655–1684 (2012)
- [20] Lin, H., Mairal, J., Harchaoui, Z.: A universal catalyst for first-order optimization. Advances in neural information processing systems 28 (2015)
- [21] Lin, H., Mairal, J., Harchaoui, Z.: Catalyst acceleration for first-order convex optimization: from theory to practice. Journal of Machine Learning Research 18(1), 7854–7907 (2018)
- [22] Lu, M., Qu, Z.: An adaptive proximal point algorithm framework and application to large-scale optimization. arXiv preprint arXiv:2008.08784 (2020)
- [23] Luo, Z.Q., Tseng, P.: Error bounds and convergence analysis of feasible descent methods: a general approach. Annals of Operations Research 46(1), 157–178 (1993)
- [24] Mairal, J.: Cyanure: An open-source toolbox for empirical risk minimization for python, c++, and soon more. arXiv preprint arXiv:1912.08165 (2019)
- [25] Monteiro, R.D., Svaiter, B.F.: An accelerated hybrid proximal extragradient method for convex optimization and its implications to second-order methods. SIAM Journal on Optimization 23(2), 1092–1125 (2013)
- [26] Necoara, I., Nesterov, Y., Glineur, F.: Linear convergence of first order methods for non-strongly convex optimization. Mathematical Programming 175(1), 69–107 (2019)
- [27] Nedić, A., Bertsekas, D.P.: The effect of deterministic noise in subgradient methods. Mathematical programming 125(1), 75–99 (2010)
- [28] Nemirovski, A.: Prox-method with rate of convergence o (1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization 15(1), 229–251 (2004)
- [29] Nemirovskii, A., Nesterov, Y.: Optimal methods of smooth convex minimization. USSR Computational Mathematics and Mathematical Physics 25(2), 21–30 (1985). https://doi.org/10.1016/0041-5553(85)90100-4. URL https://www.sciencedirect.com/science/article/pii/0041555385901004
- [30] Nesterov, Y.: How to make the gradients small. Optima 88 (2012)
- [31] Nesterov, Y.: Inexact accelerated high-order proximal-point methods. Mathematical Programming pp. 1–26 (2021)
- [32] Nesterov, Y.: Inexact high-order proximal-point methods with auxiliary search procedure. SIAM Journal on Optimization 31(4), 2807–2828 (2021)
- [33] Patrascu, A., Irofti, P.: On finite termination of an inexact proximal point algorithm. Applied Mathematics Letters 134, 108348 (2022). https://doi.org/10.1016/j.aml.2022.108348
- [34] Polyak, B.: A general method of solving extremal problems. Math. Doklady 8, 593–597 (1967)
- [35] Polyak, B.: Minimization of unsmooth functionals. U.S.S.R. Computational Mathematics and Mathematical Physics 9, 509–521 (1969)
- [36] Polyak, B.: Nonlinear programming methods in the presence of noise. Mathematical Programming 14, 87–97 (1978)
- [37] Polyak, B.: Introduction to optimization. Optimization Software, Inc., Publications Division, New York (1987)
- [38] Renegar, J.: Efficient first-order methods for linear programming and semidefinite programming. arXiv preprint arXiv:1409.5832 (2014)
- [39] Rockafellar, R.T.: Augmented lagrangians and applications of the proximal point algorithm in convex programming. Mathematics of operations research 1(2), 97–116 (1976)
- [40] Rockafellar, R.T.: Monotone operators and the proximal point algorithm. SIAM journal on control and optimization 14(5), 877–898 (1976)
- [41] Roulet, V., d’Aspremont, A.: Sharpness, restart, and acceleration. SIAM Journal on Optimization 30(1), 262–289 (2020)
- [42] Salzo, S., Villa, S.: Inexact and accelerated proximal point algorithms. Journal of Convex analysis 19(4), 1167–1192 (2012)
- [43] Shor, N.: An application of the method of gradient descent to the solution of the network transportation problem. Materialy Naucnovo Seminara po Teoret i Priklad. Voprosam Kibernet. i Issted. Operacii, Nuc- nyi Sov. po Kibernet, Akad. Nauk Ukrain. SSSR 1, 9–17 (1962)
- [44] Shor, N.: On the structure of algorithms for numerical solution of problems of optimal planning and design. Diss. Doctor Philos, Kiev (1964)
- [45] Shulgin, E., Gasnikov, A., Matyukhin, V.: Adaptive catalyst for smooth convex optimization. In: Optimization and Applications: 12th International Conference, OPTIMA 2021, Petrovac, Montenegro, September 27–October 1, 2021, Proceedings, vol. 13078, p. 20. Springer Nature (2021)
- [46] Solodov, M.V., Svaiter, B.F.: A hybrid approximate extragradient–proximal point algorithm using the enlargement of a maximal monotone operator. Set-Valued Analysis 7(4), 323–345 (1999)
- [47] Solodov, M.V., Svaiter, B.F.: A hybrid projection-proximal point algorithm. Journal of convex analysis 6(1), 59–70 (1999)
- [48] Solodov, M.V., Svaiter, B.F.: Error bounds for proximal point subproblems and associated inexact proximal point algorithms. Mathematical programming 88(2), 371–389 (2000)
- [49] Solodov, M.V., Svaiter, B.F.: A unified framework for some inexact proximal point algorithms. Numerical functional analysis and optimization 22(7-8), 1013–1035 (2001)
- [50] Solodov, M.V., Svaiter, B.F.: A unified framework for some inexact proximal point algorithms. Numerical functional analysis and optimization 22(7-8), 1013–1035 (2001)
- [51] Tomioka, R., Suzuki, T., Sugiyama, M.: Super-linear convergence of dual augmented lagrangian algorithm for sparsity regularized estimation. Journal of Machine Learning Research 12(5) (2011)
- [52] Xiao, L., Zhang, T.: A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization 24(4), 2057–2075 (2014)
- [53] Yang, T., Lin, Q.: Rsg: Beating subgradient method without smoothness and strong convexity. The Journal of Machine Learning Research 19(1), 236–268 (2018)
- [54] Yankelevsky, Y., Elad, M.: Dual graph regularized dictionary learning. IEEE Transactions on Signal and Information Processing over Networks 2(4), 611–624 (2016)
8 Appendix
Proof of Lemma 2.1.
By using -HG, we get:
(18) |
The solution of (18) in , denoted as satisfies the following optimality condition: , which simply implies that
(19) |
For a function with sharp minima () it is easy to see that . By taking norm in both sides then: and (19) becomes:
By replacing this form of into (18) we obtain our first result:
Proof of Lemma 3.1.
Assume . Observe that on one hand, by the triangle inequality that: . On the other hand, one can derive:
(20) |
Now let . Based on , for nonoptimal the bound (6) guarantees . Therefore, if would determine , and implicitly , the contradiction with the previous lower bound impose that . Moreover, in this case, since then obviously . On summary, a sufficiently small approximate gradient norm also confirms a small distance to optimal set .
Proof of Corrolary 4.2.
Proof of Corrolary 4.3.
The proof for the first two estimates are immediately derived from Theorem 4.1 and . For , we considered into Corrolary 8.6 and obtained an estimate for our exact case. To refine the complexity order, we majorized some constants by using: and . For , we replace the same as before and into Corrolary 8.6 to get the last estimate. ∎
Proof of Corrolary 4.4.
The proof for the first two estimates can be derived immediately from Theorem 4.1 and . We provide details for the other two cases.
For we use the same notations as in the proof of Theorem 4.1 (given in the appendix). There, the key functions which decide the decrease rate of are the nondecreasing function and accuracy . First recall that
(21) |
which implies that for any
(22) |
Recalling that , then by Theorem 4.1 we have:
(23) |
By taking then . By this recurrence, the monotonicity of and , we derive:
Finally this key bound enters into (23) and we get:
where for the last equality we used the fact that, since is nondecreasing, is monotonically nondecreasing. Finally, by applying Theorem 8.5 we get our result. Now let . By redefining as in Theorem 4.1, observe that
(24) |
Take then
We have shown in the proof of Theorem 4.1 that also this variant of is nondecreasing and thus, using the same reasoning as in the case we obtain the above result. ∎
Proof of Theorem 5.1.
In this proof we use notation for the number of iterations in the th epoch, large enough to turn the stopping criterion to be satisfied. We denote as the th IPPA iterate during th epoch. From Theorem 8.3 (from Appendix B) we have that
(25) |
iterations are sufficient to guarantee and thus the end of th epoch. Furthermore, by the triangle inequality
(26) |
which implies that the end of th epoch we also have .
Let WSM hold and recall assumption . For sufficiently large we show that restartation loses any effect and after a single iteration the stopping criterion of epoch is satisfied. We separate the analysis in two stages: the first stage covers the epochs that produce satisfying . The second one covers the rest of epochs when the gradient norms decrease below the threshold , i.e. .
In the first stage, the stopping rule limits the first stage to maximum epochs. The total number of iterations in this stage is bounded by: .
For the second stage when , Lemma 3.1 states that and thus we have: Therefore, by Theorem 4.1:
which means that after a single iteration, i.e. , it is guaranteed that . In this phase, the output of IPPA is in fact the only point produced in -th epoch and the necessary number of epochs (or equivalently the number of IPPA iterations) is .
Let . By Lemma 2.1 and convexity of yields . By using the inequality formed by the first and last terms, together with the relation (26), then at the end of epoch:
suggesting that the necessary number of epochs is This fact allow us to refine in (25) as Since is bounded, then the total number of IPPA iterations has the order
Let . Similarly as in the previous two cases, guaranteed by epoch further implies which suggests that the maximal number of epochs is
Now, of (25) becomes: for all where . For , is bounded, thus for we further estimate the total number of IPPA iterations by summing:
(27) | ||||
Finally, .
∎
Proof of Theorem 6.1.
We keep the same notations as in the proof of Theorem 5.1 and redenote . By assumption we observe that . Since , then the inequality recursively holds for all . This last inequality allow Theorem 8.8 to establish that at epoch there are enough:
PsGM iterations. Lastly, we compute the total computational cost by summing over all . Recall that at the end of th epoch RIPPA guarantees that . After a number of epochs of , measuring a total number of:
PsGM iterations, then one has satisfying . Now we evaluate the final postprocessing loop, which aims to bring the iterate into the suboptimality region. By Theorem 8.8, a single call of guarantees that . In general, if , Theorem 8.8 specifies that after routine iterations, the output satisfies . Therefore, by setting and by running the final procedure Postprocessing(), the ”second phase” loop produces . The total cost of the ”second phase” loop can be computed by Lastly, by taking into account that , then the total complexity has the order: which confirms the first part of the above result.
Now we prove the second part of the result. By assumption we observe that:
(28) |
Further we show that, for appropriate stepsize choices , the inequality recursively holds for all under initial condition (28). Indeed let , then
(29) |
The inequality (29) allow Theorem 8.8 to establish the necessary inner complexity for each IPPA iteration. By using bounds from Theorem 8.8, at epoch there are enough:
PsGM iterations. We still keep the same notations from Theorem 5.1.
Let and recall By following a similar reasoning, we require:
Let , then the exponent of the last term becomes:
Otherwise, if then the respective exponent turns into:
∎
8.1 Appendix B
Lemma 8.1.
Let the sequence satisfy: where nonincreasing and . Then the following bound holds:
Moreover, if then:
Proof of Lemma 8.1.
By using a simple induction we get:
∎
Lemma 8.2.
Let be real numbers, then the following relation holds:
Proof.
Since we have:
then it is sufficient to show that for any positive :
(30) |
Indeed, if then (30) results straightforward. Consider that , then by monotonicity we have: for all and thus . In this case it is obvious that , which confirms the final above results. ∎
Theorem 8.3.
Let be the sequence generated by IPPA with inexactness criterion (9), then the following relation hold:
Moreover, assume constant accuracy . Then after at most:
iterations, a point satisfies and .
Proof.
By convexity of , for any we derive:
(31) |
In order to obtain, by the triangle inequality we simply derive:
(32) |
Finally, by taking , then:
(33) | |||
where in the last inequality we used the fact . The last inequality leads to the first part from our result:
(34) |
Assume that
(35) |
and denote . Then (34) has two consequences. First, obviously for all :
By further summing over the history we obtain:
(36) |
Second, since is the first iteration at which the residual optimal distance increases, then and (34) guarantees:
By unifying both cases we conclude that after at most: iterations the threshold: is reached. Notice that if (35) do not hold, then .
Now we use the same arguments from [30, Sec. I] to bound the norm of the gradients. Observe that the Lipschitz gradients property of leads to:
(37) |
By using (8.1) into (37), then for
∎
Lemma 8.4.
Let HG holds for the objective function . Then IPPA sequence with variable accuracies , satisfies the following reccurences:
Under weak sharp minima
Under quadratic growth
Under general Holderian growth
Proof.
By using the same relations in the case , then:
Theorem 8.5.
Let and . Then the sequence satisfies:
For :
(38) |
For :
(39) |
where .
Proof.
Denote .
Consider . In this case, note that is nondecreasing and thus also is nondecreasing, we have:
Observe that if then, by using the monotonicity of , we can further derive another bound:
Any given sequence satisfying the recurrence can be further bounded as: Thus, by apply similar arguments to our sequence we refined the above bound as follows:
Now consider . In this case, on one hand, the function is nondecreasing only on . On the other hand, for it is easy to see that . These two observations lead to:
where . Since is nondecreasing, then . In order to determine the clear convergence rate of , based on [36, Lemma 6, Section 2.2] we make a last observation:
(40) |
Using this final bound, we are able to deduce the explicit convergence rate:
∎
Corollary 8.6.
Under the assumptions of Theorem 8.5, let and . The sequence attains the threshold after the following number of iterations:
For :
(41) |
For :
(42) |
where .
Proof.
Let . In the first regime of (38), when , there are necessary at most:
(43) |
iterations, while the second regime, i.e. , has a length of at most:
(44) |
iterations to reach . An upper margin on the total number of iterations is .
Let . Similarly, the first regime when has a maximal length of: The second regime, while , requires at most: iteration to get . ∎
Lemma 8.7.
Let . Let the sequence satisfy the recurrence:
For , let , then:
For , let , then:
where .
Proof.
Starting from the recurrence we get:
If , or equivalently , then we recover the recurrence:
(45) |
Otherwise, clearly
(46) |
By combining both bounds (45) and (46), we obtain:
(47) |
Denote and . For , since both functions and are nondecreasing, then is nondecreasing. This fact allows to apply the following induction to (47):
(48) |
In the second case when , the corresponding recurrence function is again nondecreasing. Indeed, here is nondecreasing only when . However, if , then which is also nondecreasing. Thus we get our claim. The monotonicity of and majorization , allow us to obtain by a similar induction an analog relation to (48), which holds with . ∎
Proof of Theorem 4.1.
Denote . Since , then by rolling the recurrence in Lemma 8.4 we get:
(49) |
First consider and let . Then by Lemmas 8.4 and 8.7, we have that:
(50) |
where . Let some and . Then, since is nondecreasing, we get:
Finally, by using the convergence rate upper bounds from the Theorem 8.5, we can further find out an the convergence rate order of . We can appeal to a similar argument when , by using the nondecreasing function , instead of .
∎
Theorem 8.8.
Let the function having Holder continuous gradients with constant and . Also let and
(51) |
then PsGM() outputs such that .
Moreover, assume particularly that and satisfies WSM with constant . Also let , be a closed convex feasible set and its indicator function. If
(52) |
then PsGM() outputs satisfying .
Proof.
For brevity we avoid the counter and denote . Recall the optimality condition:
(53) |
By using Holder continuity then we get:
(54) |
Then the following recurrence holds:
(55) | |||
Obviously, a small stepsize yields . If the squared residual is dominant, i.e.
(56) |
then:
(57) |
By (56), this linear decrease of residual stop when , which occurs after at most PsGM iterations.
To show the second part of our result we recall that the first assumption of (52) ensures . By using (55) with chosen subgradient , then the following recurrence is obtained:
(58) |
where and in the last inequality we used . If , then as long as , (58) turns into:
(59) |
To unify both cases, we further express the recurrence as:
(60) |
which confirms our above result.
∎
Remark 4.
As the above theorem states, when the sequence is sufficiently close to the solution set, computing necessitates a number of PsGM iterations dependent on . In other words, the estimate from (52) can be further reduced through a good initialization or restartation technique. Such a restartation, for the neighborhood around the optimum, is exploited by the Algorithm 4 below.