∎
22email: hyk@hebut.edu.cn 33institutetext: Hou-Duo Qi 44institutetext: Department of Data Science and Artificial Intelligence and Department of Applied Mathematics, The Hong Kong Polytechnic University, Hung Hom, Hong Kong
44email: houduo.qi@polyu.edu.hk
Analytic analysis of the worst-case complexity of the gradient method with exact line search and the Polyak stepsize
Abstract
We give a novel analytic analysis of the worst-case complexity of the gradient method with exact line search and the Polyak stepsize, respectively, which previously could only be established by computer-assisted proof. Our analysis is based on studying the linear convergence of a family of gradient methods, whose stepsizes include the one determined by exact line search and the Polyak stepsize as special instances. The asymptotic behavior of the considered family is also investigated which shows that the gradient method with the Polyak stepsize will zigzag in a two-dimensional subspace spanned by the two eigenvectors corresponding to the largest and smallest eigenvalues of the Hessian.
Keywords:
Gradient methods Exact line search Polyak stepsize Worst-case complexityMSC:
90C25 90C25 90C301 Introduction
We focus on the gradient method for solving unconstrained optimization problem
(1) |
where is a real-valued, convex, and continuously differentiable function. As is known, the gradient method updates iterates by
(2) |
where and is the stepsize. The classic choice for the stepsize is due to Cauchy cauchy1847methode who suggested to calculate it by exact line search, i.e.,
(3) |
which together with (2) is often referred to as the steepest descent (SD) method. Many stepsizes have been designed to improve the efficiency of gradient methods such as the Barzilai–Borwein stepsize barzilai1988two , the Polyak stepsize polyak1969minimization ; polyak1987introduction and the Yuan stepsize yuan2006new . Due to the good performance in solving machine learning problems, there has been a surge of interest in the Polyak stepsize jiang2024adaptive ; loizou2021stochastic ; wang2023generalized , which minimizes an upper bound of the distance between the new iteration to the optimal solution and has the form
(4) |
where and is the optimal function value. Convergence of the gradient method using the Polyak stepsize has been studied in different works for the case or 1 barre2020complexity ; hazan2019revisiting .
In recent years, analyzing the worst-case complexities of gradient-based methods has triggered much attention cartis2010complexity ; das2024branch ; de2020worst ; drori2014performance ; gu2020tight ; lessard2016analysis ; taylor2017exact ; teboulle2023elementary ; yuan2010short . Particularly, Drori and Teboulle drori2014performance formulate the worst-case complexity of a given algorithm as an optimal solution to the so-called performance estimation problem (PEP). By making some relaxations to PEP, a bound of the worst-case complexity can be obtained by solving a convex semidefinite programming (SDP). Following drori2014performance , the worst-case complexities for various popular methods have been analyzed including the gradient method, Nesterov’s fast gradient method, inexact gradient method, proximal point algorithm, see de2020worst ; gu2020tight ; taylor2017exact and references therein.
Although the PEP framework is a powerful tool for the study of algorithm complexity, as pointed out by Teboulle and Vaisbourd teboulle2023elementary , it lacks an intuitive explanation how the worst-case complexity is addressed: (i) one often has to employ a computer-assisted proof since the aforementioned SDP usually has no closed form solution and must be solved numerically; (ii) it is generally unclear how the solution is deduced when a closed form solution is available. Moreover, it is not easy to apply the PEP framework to the gradient method with adaptive stepsizes goujaud2023fundamental . Consequently, most of existing complexity results of gradient methods focus on fixed stepsizes. Recently, by resorting to PEP, the worst-case complexities of the gradient method with exact line search and the Polyak stepsize (4) with are presented through computer-assisted proof in de2017worst and barre2020complexity , respectively.
The purpose of this paper is to give a novel analytical analysis for the worst-case complexity of the gradient method with exact line search and the Polyak stepsize, respectively, for smooth strongly convex functions with Lipschitz continuous gradient. To this end, we first study a family of gradient methods whose stepsizes include the one determined by exact line search and the Polyak stepsize as special instances. We show that the family converges linearly when applying to a strongly convex quadratic function. Then, based on the convergence results on quadratics, for general smooth strongly convex objective functions, we establish the worst-case complexity of the gradient method with exact line search and the Polyak stepsize in an analytic way, respectively, which recovers the computer-assisted results in de2017worst and barre2020complexity . To our knowledge, this is the first analytic proof of the worst-case complexity of the gradient method with exact line search and the Polyak stepsize. Furthermore, we show that the considered family will asymptotically zigzag in a two-dimensional subspace spanned by the two eigenvectors corresponding to the largest and smallest eigenvalues of the Hessian, which has not been clarified in the literature for the gradient method using the Polyak stepsize.
The paper is organized as follows. In Section 2, we investigate linear convergence of a family of gradient methods on strongly convex quadratics. In Section 3, we present our analytic proof of the worst-case complexity of the gradient method with exact line search and the Polyak stepsize, respectively. Finally, some conclusion remarks are drawn and the asymptotic behavior of the considered family is discussed in Section 4.
2 Convergence analysis for quadratics
In this section, we consider the unconstrained quadratic optimization
(5) |
where and is symmetric positive definite. Let be the eigenvalues of , and be the associated orthonormal eigenvectors. Without loss of generality, we assume that
For a more general and uniform analysis, we investigate a family of gradient methods with stepsize given by
(6) |
where is a real analytic function on and can be expressed by Laurent series with such that for all . Apparently, when , we get the generalized SD stepsize, namely
(7) |
which is exactly the stepsize obtained by exact line search when . See dai2003altermin for gradient methods with shorten SD steps, i.e., . Interestingly, the Polyak stepsize in (4) corresponding to the case . In fact, since , where is the optimal solution, we have
which gives
(8) |
For a symmetric positive definite matrix , denote the weighted norm by . We establish line convergence of the family of gradient methods (6) in the next theorem.
Theorem 2.1
Proof
Using and the definition of in (6), we have
(10) |
where the last inequality follows from the Kantorovich inequality.
If , we have
(11) |
which gives
Thus,
We complete the proof.
Remark 1
By setting and in (9) respectively, we get the following results for the gradient method with the generalized SD stepsize (7) and the Polyak stepsize (8).
3 Worst-case complexity for smooth strongly convex functions
In this section, by making use of convergence results in the former section, we present an analytic analysis of the worst-case complexity for the gradient method with exact line search and Polyak stepsize, respectively, on the class of -smooth and -strongly convex functions, denoted as .
Definition 1
Let be a continuously differentiable function and . We say that is -smooth and -strongly convex if
(i) is -smooth, i.e.,
(ii) is -strongly convex, i.e.,
3.1 Gradient method with exact line search
Now we show analytically the worst-case complexity of the gradient method with exact line search.
Theorem 3.1
Suppose that and consider the gradient method with stepsize determined by exact line search. Then,
(15) |
Proof
It follows from (3), and that
(16) | |||
(17) |
and
(18) |
Suppose that are nonnegative scalars and . Then, weighted sum of the above three inequalities yields
To proceed, we complete the square of all items containing to get
where is some scalar and
As we will see later is important to derive the worst-case rate. It follows that
(19) |
With the aim of determining , and , we consider applying the gradient method with exact line search to a two-dimensional quadratic function with
(20) |
By Theorem 2.1 and (12), in this case, the worst-case rate of the gradient method with exact line search matches the one in (15). Since the inequality (19) applies to the above quadratic case and the worst-case rate in (12) does not involve the last four terms in (19), we require
(21) |
(22) |
and
(23) |
From Remark 1, we must have when achieving the worst-case rate. Thus, for the above two-dimensional quadratic function, we get
Suppose that the two components of are nonzero. The above relations together with (22) and (23) imply that
(24) |
(25) |
(26) |
(27) |
Then, yields
which indicates
(28) |
From the first equation in (21), we get
Using the definition of and rearranging terms, we obtain
(29) |
(30) |
It is not difficult to check that the choices , and in (28), (29) and (30) are such that the two equations in (21), (26) and (27) hold. Thus, we only need to find nonnegative , and satisfying (28) and (29). Letting , we get
(31) |
Remark 3
3.2 Gradient method with the Polyak stepsize
Next theorem gives an analytic analysis of the worst-case complexity of the gradient method using the Polyak stepsize (4).
Theorem 3.2
Suppose that and consider the gradient method with stepsize given by (4). Then,
(32) |
Proof
By (3) and , we have
(33) |
and
(34) |
It follows from the definition of in (4) that
(35) |
Let be two nonnegative scalars, and . The following weighted sum is valid:
where . Substituting
into the above inequality, we obtain
(36) |
where .
To get the desired convergence result, we assume that
(37) |
which together with the definition of gives
and hence
(38) |
Since , we have
Thus, . Now we consider two cases:
Case 1: . In this case, we require
(39) |
which together with (3.2) indicates
and hence
Let
(40) |
We obtain and
(41) |
Clearly, are nonnegative. By (3.2), (37), (39), (40) and (41), we get
(42) |
Notice that the function achieves its minimum at . It follows from (3.2) that
Case 2: . Recall that are nonnegative and . In this case, we cannot expect (39) holds. It follows from (3.2), (37) and (3.2) that
(43) |
To simplify our analysis, we set . By the strongly convexity of , the definition of , (3.2) and (43), we have
We further suppose that is a scalar independent of . The function
attains its minimum at , which implies
(44) |
Notice that the rate in (44) holds for any . So, for the quadratic case, the rate in (44) should coincide with the one in (13) and the corresponding stepsize must be equal to . This gives . We get (32) by substituting this into (44).
We complete the proof by considering the above two cases.
Remark 4
When , the inequality (32) reduces to
which recovers the one in Proposition 1 of barre2020complexity .
4 Concluding remarks
Based on the convergence results of the family of gradient methods (6) on strongly convex quadratics, we have established the worst-case complexity of the gradient method with exact line search and the Polyak stepsize, respectively, in an analytic way for general smooth strongly convex objective functions. It is shown by Corollary 1 and Theorem 3.2 that, from the worst-case complexity point of view, the gradient method using the Polyak stepsize (8) achieves the fastest convergence rate when , which is also true for the generalized SD stepsize (7). However, for the case , we can show that the family of gradient methods (6) will zigzag in a two-dimensional subspace spanned by the two eigenvectors corresponding to the largest and smallest eigenvalues of the Hessian. Denoting the components of along the eigenvectors by , , i.e., . Next theorem gives the asymptotic convergence behavior of each gradient method in the family (6), see Theorem 1 of huang2022asymptotic for details.
Theorem 4.1

Due to the zigzag behavior shown in Theorem 4.1, the gradient method using the Polyak stepsize (8) with may perform as poor as that with the exact line search. To see this, we apply the family of gradient methods (6) with for (exact line search) and (the Polyak stepsize), respectively, to minimize the quadratic problem (5) with
We use as the starting point and stop the iteration if . It can be seen from Figure 1 that the gradient method with the Polyak stepsize may be slower than the SD method even for a strongly convex quadratic function. Techniques to break the zigzag pattern of the family of gradient methods (6) with have been developed in huang2022asymptotic which show promising performance on the gradient method with exact line search. We are wondering whether the techniques in huang2022asymptotic can lead to efficient gradient methods based on the Polyak stepsize.
Acknowledgements.
The work of the first author was supported by Natural Science Foundation of Hebei Province (A2021202010). The work of the second author was supported by Hong Kong RGC General Research Fund (15309223) and PolyU AMA Project (230413007).References
- (1) M. Barré, A. Taylor, and A. d’Aspremont, Complexity guarantees for Polyak steps with momentum, in Conference on Learning Theory, PMLR, 2020, pp. 452–478.
- (2) J. Barzilai and J. M. Borwein, Two-point step size gradient methods, IMA J. Numer. Anal., 8 (1988), pp. 141–148.
- (3) C. Cartis, N. Gould, and P. Toint, On the complexity of steepest descent, Newton’s and regularized Newton’s methods for nonconvex unconstrained optimization problems, SIAM J. Optim., 20 (2010), pp. 2833–2852.
- (4) A. Cauchy, Méthode générale pour la résolution des systemes di’équations simultanées, Comp. Rend. Sci. Paris, 25 (1847), pp. 536–538.
- (5) Y.-H. Dai and Y.-X. Yuan, Alternate minimization gradient method, IMA J. Numer. Anal., 23 (2003), pp. 377–393.
- (6) S. Das Gupta, B. P. Van Parys, and E. K. Ryu, Branch-and-bound performance estimation programming: A unified methodology for constructing optimal optimization methods, Math. Program., 204 (2024), pp. 567–639.
- (7) E. De Klerk, F. Glineur, and A. B. Taylor, On the worst-case complexity of the gradient method with exact line search for smooth strongly convex functions, Optim. Lett., 11 (2017), pp. 1185–1199.
- (8) E. De Klerk, F. Glineur, and A. B. Taylor, Worst-case convergence analysis of inexact gradient and Newton methods through semidefinite programming performance estimation, SIAM J. Optim., 30 (2020), pp. 2053–2082.
- (9) Y. Drori and M. Teboulle, Performance of first-order methods for smooth convex minimization: a novel approach, Math. Program., 145 (2014), pp. 451–482.
- (10) B. Goujaud, A. Dieuleveut, and A. Taylor, On fundamental proof structures in first-order optimization, in 62nd IEEE Conference on Decision and Control (CDC), 2023, pp. 3023–3030.
- (11) G. Gu and J. Yang, Tight sublinear convergence rate of the proximal point algorithm for maximal monotone inclusion problems, SIAM J. Optim., 30 (2020), pp. 1905–1921.
- (12) E. Hazan and S. Kakade, Revisiting the Polyak step size, arXiv preprint, arXiv:1905.00313, (2019).
- (13) Y. Huang, Y.-H. Dai, X.-W. Liu, and H. Zhang, On the asymptotic convergence and acceleration of gradient methods, J. Sci. Comput., 90 (2022), pp. 1–29.
- (14) X. Jiang and S. U. Stich, Adaptive SGD with Polyak stepsize and line-search: Robust convergence and variance reduction, in Advances in Neural Information Processing Systems, 2023, pp. 26396–26424.
- (15) L. Lessard, B. Recht, and A. Packard, Analysis and design of optimization algorithms via integral quadratic constraints, SIAM J. Optim., 26 (2016), pp. 57–95.
- (16) N. Loizou, S. Vaswani, I. H. Laradji, and S. Lacoste-Julien, Stochastic Polyak step-size for SGD: An adaptive learning rate for fast convergence, in International Conference on Artificial Intelligence and Statistics, PMLR, 2021, pp. 1306–1314.
- (17) B. T. Polyak, Minimization of unsmooth functionals, Comput. Math. Math. Phys., 9 (1969), pp. 14–29.
- (18) B. T. Polyak, Introduction to Optimization, Optimization Software, Inc., New York, 1987.
- (19) A. B. Taylor, J. M. Hendrickx, and F. Glineur, Exact worst-case performance of first-order methods for composite convex optimization, SIAM J. Optim., 27 (2017), pp. 1283–1313.
- (20) A. B. Taylor, J. M. Hendrickx, and F. Glineur, Smooth strongly convex interpolation and exact worst-case performance of first-order methods, Math. Program., 161 (2017), pp. 307–345.
- (21) M. Teboulle and Y. Vaisbourd, An elementary approach to tight worst case complexity analysis of gradient based methods, Math. Program., 201 (2023), pp. 63–96.
- (22) X. Wang, M. Johansson, and T. Zhang, Generalized Polyak step size for first order optimization with momentum, in International Conference on Machine Learning, PMLR, 2023, pp. 35836–35863.
- (23) Y.-X. Yuan, A new stepsize for the steepest descent method, J. Comput. Math., 24 (2006), pp. 149–156.
- (24) Y.-X. Yuan, A short note on the -linear convergence of the steepest descent method, Math. Program., 123 (2010), pp. 339–343.