33email: [email protected]
A note on -linear convergence of nonmonotone gradient methods
Abstract
Nonmonotone gradient methods generally perform better than their monotone counterparts especially on unconstrained quadratic optimization. However, the known convergence rate of the monotone method is often much better than its nonmonotone variant. With the aim of shrinking the gap between theory and practice of nonmonotone gradient methods, we introduce a property for convergence analysis of a large collection of gradient methods. We prove that any gradient method using stepsizes satisfying the property will converge -linearly at a rate of , where is the smallest eigenvalue of Hessian matrix and is the upper bound of the inverse stepsize. Our results indicate that the existing convergence rates of many nonmonotone methods can be improved to with being the associated condition number.
Keywords:
Gradient methods -linear convergence Nonmonotone Quadratic optimizationMSC:
90C25 90C25 90C301 Introduction
We concentrate on analyzing the convergence rate of gradient methods for unconstrained quadratic optimization:
(1) |
where is a real symmetric positive definite matrix and . The gradient method solves (1) by
(2) |
where and is the stepsize. By considering the generated function values, we can roughly divided gradient methods into two categories: monotone and nonmonotone.
The classic steepest descent (SD) method developed by Cauchy cauchy1847methode is monotone, which determines the stepsize by exact line search:
(3) |
It is known that the SD method converges -linearly at a rate of akaike1959successive ; forsythe1968asymptotic , where is the condition number of with and being the smallest and largest eigenvalues, respectively. In a similar spirit to the SD method, the minimal gradient (MG) method dai2003altermin calculates the stepsize by minimizing the gradient norm, i.e.
(4) |
This method is also monotone and has the same convergence speed as the SD method huang2019asymptotic . Another monotone method is due to Dai and Yang dai2006new2 whose stepsize is given by
(5) |
It is proved that asymptotically converges to , which is in some sense an optimal stepsize since it minimizes over , see dai2006new2 ; elman1994inexact . The Dai–Yang method (5) also converges -linearly at the same rate as the SD method dai2006new2 . See dai2003altermin ; dai2005analysis ; yuan2006new ; yuan2008step and references therein for other monotone gradient methods.
The Barzilai–Borwein (BB) methods barzilai1988two can be viewed as nonmonotone counterparts of the SD and MG methods since
(6) |
and
(7) |
where and . Clearly, for the quadratic function (1). An interesting property of the above two BB stepsizes is that they satisfy the quasi-Newton secant equation in the sense of least squares:
For the BB2 method, a surprising -superlinear convergence result on the two dimensional strictly convex quadratic function is presented in barzilai1988two , which is improved by Dai dai2013new for the BB1 method. Dai and Fletcher dai2005asymptotic showed similar -superlinear convergence for the BB1 method in the three dimensional case. As for the dimensional case, Raydan raydan1993barzilai established global convergence of the BB1 method while Dai and Liao dai2002r proved that it converges -linearly and the rate is roughly . Recently, Li and Sun li2021on improved the above rate to . A nonmonotone version of the Dai–Yang method using can be found in dai2015positive which is also -superlinear convergent on two dimensional strictly convex quadratics.
Interestingly, the above mentioned nonmonotone methods are generally much faster than the corresponding monotone ones especially for quadratics. Moreover, those nonmonotone methods have a great advantage of easy extension to general unconstrained problems BDH2019stabilized ; dai2019family ; di2018steplength ; fletcher2005barzilai ; huang2020equipping ; raydan1997barzilai , to constrained optimization birgin2000nonmonotone ; crisci2020spectral ; dai2005asymptotic ; dai2005projected ; huang2020equipping ; huang2016smoothing and to various applications dai2015positive ; huang2020gradient ; huang2015quadratic ; jiang2013feasible ; tan2016barzilai ; wang2007projected . However, theoretical analysis of a nonmonotone method is generally much more difficult due to the nonmonotonity. In recent years, to achieve good performance, long and short stepsizes are often employed in alternate, cyclic, adaptive, or hybrid way by nonmonotone gradient methods dai2005projected ; de2014efficient ; frassoldati2008new ; huang2020equipping ; sun2020new ; zhou2006gradient which makes the analysis even more difficult. Consequently, the known convergence rate of a nonmonotone method is often much worse than its monotone counterpart.
An important tool for analyzing the convergence of nonmonotone methods is the following Property A suggested by Dai dai2003alternate .
Property A: Assume that and . Let be the -th component of and
Suppose that there exist an integer and positive constants and such that
(i)
(ii) for any integer and , if and hold for , then .
Dai dai2003alternate proved that if stepsizes of the gradient method have Property A, then either for some finite or the sequence converges to zero -linearly and the rate is roughly . Such a rate has recently been improved to huang2021On , and the rate can be further improved to for the gradient method with:
(8) |
where and is some real number, see friedlander1998gradient . Nevertheless, it is not very straightfoward to check Property A since it requires that for any integer , the inverse stepsize is large enough when the first eigencomponents of are much smaller than the -th eigencomponent. Moreover, as we will see in the next section some existing stepsize does not satisfy Property A.
In this paper, we introduce a new property, named Property B, for stepsizes of the gradient method which is straightforward and easy to check. We prove that any gradient method with stepsizes satisfying Property B will converge -linearly at a rate of with being the upper bound of the inverse stepsize as above. Our results indicate that gradient methods using retard, alternate, cyclic, adaptive or hybrid stepsize schemes such as the SDC method de2014efficient , the periodic gradient method in huang2019asymptotic and the methods in huang2020acceleration ; sun2020new ; zou2021fast converge -linearly at the above rate. Consequently, when the convergence rate can be refined to which implies that the BB1, BB2, alternate BB dai2005projected , adaptive BB zhou2006gradient , cyclic BB dai2006cyclic , cyclic SD dai2005asymptotic , ABBmin frassoldati2008new , BB family dai2019family , and the methods in dai2015positive ; huang2020equipping ; huang2019asymptotic ; huang2020gradient ; huang2020acceleration converge at least -linearly with a rate of . It is worth to mention that the methods in huang2019asymptotic ; huang2020acceleration satisfy Property B but not necessarily satisfy Property A. In addition, our results generalize the one in li2021on for the BB1 method, and improve those in huang2021On for the method (8) in the sense that the associated quantities in our convergence rate are smaller than or equal to those in huang2021On . We also discuss convergence rates of iterations and function values for gradient methods satisfying the proposed Property B.
2 Main results
Since the gradient method (2) is invariant under translations and rotations when applying to the quadratic problem (1), for the convenience of analysis, we assume without loss of generality that the matrix is diagonal with distinct eigenvalues, i.e.
(9) |
We shall analyze -linear convergence rate of gradient methods whose stepsizes satisfying the following Property B.
Property B: We
say that the stepsize has Property B if there exist an integer and a positive constant such that
(i) ;
(ii) for some integer ,
(10) |
where is a real analytic function on and can be expressed by Laurent series
such that for all .
The only difference between Property B here and Property A is condition (ii). For the case , we can prove that satisfies (10) also meets the condition (ii) of Property A with . In fact, for any integer , suppose that and hold for and , we obtain that
However, if , the following example shows that for a stepsize satisfies (10) even if and .
Example 1. Consider the -dimensional quadratic with
We apply the gradient method in huang2020acceleration which uses the stepsize
Here we choose and . For any given , set . We get
It follows that and
which gives and
Then we have and
Clearly, for with , holds for . However, for any such that , we have and .
Now we present our first convergence result for gradient methods satisfying Property B. See li2021on for analysis of the BB1 method and huang2021On for methods satisfying Property A.
Theorem 2.1
Proof
Clearly, we only need to show that (11) holds for the case . In what follows, we show (11) by induction on . Since and , we have
(12) |
which gives
That is, (11) holds for .
Assume that (11) holds for all with . We have to prove that (11) holds for . By the definition of , we know that (11) holds for . Assume by contradiction that (11) does not hold for when . Let be the minimal index such that . Then, we must have ; otherwise,
which is impossible. It follows from (12), and the inductive assumption that
which gives
(13) |
where . Since , by (10) and (13) we have that
(14) |
Notice that
where . It follows from (i) of Property B that . Thus,
which together with (2) gives . This contradicts our assumption. Thus (11) holds for all . We complete the proof.
Theorem 2.1 allows us to use various choices of stepsizes for the gradient method. For example, one can use a stepsize such that for some integer and all which includes BB1, BB2, and the method in dai2015positive as special instances. One may also use alternate, cyclic, adaptive or hybrid stepsize schemes such as the cyclic BB dai2006cyclic , ABBmin frassoldati2008new and SDC de2014efficient methods, the periodic gradient method in huang2019asymptotic and the methods in huang2020gradient ; sun2020new ; zou2021fast . Convergence results for those cases are specified in the next corollary.
Corollary 1
Many efficient gradient methods using stepsizes satisfying , for example, the BB1, BB2, alternate BB dai2005projected , adaptive BB zhou2006gradient , cyclic BB dai2006cyclic , and cyclic SD dai2005asymptotic methods, BB family dai2019family , and the method in dai2015positive . For such methods, next theorem refines the results in Theorem 2.1 and Corollary 1.
Theorem 2.2
Remark 1
Notice that satisfies Property B with , and . So, the result in Theorem 1 of li2021on is a special case of (i) in Theorem 2.2. The stepsize in (8) satisfies Property B with for some real number . Hence, the results in Lemma 2.4 of huang2021On will be recovered from (ii) of Theorem 2.2 by setting and but the last term in is slightly different. It is worth to mention that since when the quantities , , here are smaller than or equal to those in huang2021On .
Remark 2
From the above analysis, it seems that for two gradient methods corresponding to the same and the one with smaller in will have faster convergence rate. However, this is not necessarily true because the components of obtained by the two methods may be quite different. For example, both the BB1 and BB2 methods correspond to and but to and , respectively, which implies a smaller for the BB2 method. Nevertheless, there is no theoretical evidence in support of a faster convergence rate for the BB2 method and practical experience often prefers the BB1 method, see dai2005asymptotic for example.
3 Some discussions
We introduced Property B for gradient methods and proved that any gradient method with stepsizes satisfying this property has -linear convergence rate where is the upper bound of the inverse stepsize. For a large collection of gradient methods, we refined the rate to . An important advantage of Property B is that it can easily be checked. Our results extend the one in li2021on and partly improve those in huang2021On .
With the notation , we can generalize Property A in dai2003alternate as the following Property GA.
Property GA: If there exist an integer and positive constants and such that
(i) ;
(ii) for any integer and real number , if and hold for , then .
Clearly, Property GA reduces to Property A when . Moreover, it is not difficult to check that Property B is a special case of Property GA. By similar arguments as the proof of Theorem 2.1, we can establish -linear convergence rate as (11) for any gradient method using stepsizes satisfying Property GA but with
and which is bounded by with . This extends the results in Lemma 2.2 and Theorem 2.1 of huang2021On . Other results in the former section could be obtained for Property GA as well.
All the above results on gradients can be applied to get -linear convergence for iterations and function values. In fact, denote by the solution of (1) and . Then we have
and
By applying (11) we obtain
which yield
where . In the same way, we are able to establish -linear convergence for iterations and function values from (15).
Notice that the upper bound on the right hand side of (2) can be refined as . From the proof of Theorem 2.1 we know that the convergence rate in Theorem 2.2 will be improved to if holds. Nevertheless, the above inequality requires , i.e. which always yields the rate . The authors of huang2021On ; li2021on have constructed examples for the method (8) and the BB1 method such that to achieve the rate , respectively. Recall that both BB1 and (8) are special cases of Property B. Actually, for a given method satisfies Property B, we can construct examples as huang2021On ; li2021on such that . It is worth to mention that the convergence results obtained here and in huang2021On ; li2021on are componentwise. One interesting question is whether we could get better convergence rate for gradient norm or objective values under reasonable conditions. This will be our future work.
Acknowledgements.
This work was supported by the National Natural Science Foundation of China (Grant No. 11701137) and Natural Science Foundation of Hebei Province (Grant No. A2021202010).References
- (1) Cauchy, A.: Méthode générale pour la résolution des systemes di’équations simultanées. Comp. Rend. Sci. Paris 25, 536–538 (1847)
- (2) Akaike, H.: On a successive transformation of probability distribution and its application to the analysis of the optimum gradient method. Ann. Inst. Stat. Math. 11(1), 1–16 (1959)
- (3) Forsythe, G.E.: On the asymptotic directions of the -dimensional optimum gradient method. Numer. Math. 11(1), 57–76 (1968)
- (4) Dai, Y.H., Yuan, Y.X.: Alternate minimization gradient method. IMA J. Numer. Anal. 23(3), 377–393 (2003)
- (5) Huang, Y.K., Dai, Y.H., Liu, X.W. et al.: On the asymptotic convergence and acceleration of gradient methods. J. Sci. Comput. 90, 7, (2022).
- (6) Dai, Y.H., Yang, X.: A new gradient method with an optimal stepsize property. Comp. Optim. Appl. 33(1), 73–88 (2006)
- (7) Elman, H.C., Golub, G.H.: Inexact and preconditioned Uzawa algorithms for saddle point problems. SIAM J. Numer. Anal. 31(6), 1645–1661 (1994)
- (8) Dai, Y.H., Yuan, Y.X.: Analysis of monotone gradient methods. J. Ind. Mang. Optim. 1(2), 181 (2005)
- (9) Yuan, Y.X.: A new stepsize for the steepest descent method. J. Comput. Math. 24(2), 149–156 (2006)
- (10) Yuan, Y.X.: Step-sizes for the gradient method. AMS/IP Stud. Adv. Math. 42(2), 785–796 (2008)
- (11) Barzilai, J., Borwein, J.M.: Two-point step size gradient methods. IMA J. Numer. Anal. 8(1), 141–148 (1988)
- (12) Dai, Y.H.: A new analysis on the Barzilai–Borwein gradient method. J. Oper. Res. Soc. China 2(1), 187–198 (2013)
- (13) Dai, Y.H., Fletcher, R.: On the asymptotic behaviour of some new gradient methods. Math. Program. 103(3), 541–559 (2005)
- (14) Raydan, M.: On the Barzilai–Borwein choice of steplength for the gradient method. IMA J. Numer. Anal. 13(3), 321–326 (1993)
- (15) Dai, Y.H., Liao, L.Z.: -linear convergence of the Barzilai–Borwein gradient method. IMA J. Numer. Anal. 22(1), 1–10 (2002)
- (16) Li, D.W., Sun, R.Y.: On a faster -Linear convergence rate of the Barzilai–Borwein method, arXiv preprint arXiv:2101.00205v2, (2021)
- (17) Dai, Y.H., Al-Baali, M., Yang, X.: A positive Barzilai–Borwein-like stepsize and an extension for symmetric linear systems. In: Numerical Analysis and Optimization, pp. 59–75. Springer (2015)
- (18) Burdakov, O., Dai, Y. H., Huang, N.: Stabilized Barzilai-Borwein Method, J. Comp. Math., 37(6), 916–936 (2019)
- (19) Dai, Y.H., Huang, Y.K., Liu, X.W.: A family of spectral gradient methods for optimization. Comp. Optim. Appl. 74(1), 43–65 (2019)
- (20) Di Serafino, D., Ruggiero, V., Toraldo, G., Zanni, L.: On the steplength selection in gradient methods for unconstrained optimization. Appl. Math. Comput. 318, 176–195 (2018)
- (21) Fletcher, R.: On the Barzilai–Borwein method. In: Optimization and Control with Applications, pp. 235–256. Springer, New York (2005)
- (22) Huang, Y.K., Dai, Y.H., Liu, X.W.: Equipping the Barzilai–Borwein method with the two dimensional quadratic termination property. SIAM J. Optim. 31(4), 3068–3096 (2021)
- (23) Raydan, M.: The Barzilai–Borwein gradient method for the large scale unconstrained minimization problem. SIAM J. Optim. 7(1), 26–33 (1997)
- (24) Birgin, E.G., Martínez, J.M., Raydan, M.: Nonmonotone spectral projected gradient methods on convex sets. SIAM J. Optim. 10(4), 1196–1211 (2000)
- (25) Crisci, S., Porta, F., Ruggiero, V., Zanni, L.: Spectral properties of Barzilai–Borwein rules in solving singly linearly constrained optimization problems subject to lower and upper bounds. SIAM J. Optim. 30(2), 1300–1326 (2020)
- (26) Dai, Y.H., Fletcher, R.: Projected Barzilai–Borwein methods for large-scale box-constrained quadratic programming. Numer. Math. 100(1), 21–47 (2005)
- (27) Huang, Y.K., Liu, H.: Smoothing projected Barzilai–Borwein method for constrained non-lipschitz optimization. Comp. Optim. Appl. 65(3), 671–698 (2016)
- (28) Huang, Y.K., Dai, Y.H., Liu, X.W., Zhang, H.: Gradient methods exploiting spectral properties. Optimi. Method Softw. 35(4), 681–705 (2020)
- (29) Huang, Y.K., Liu, H., Zhou, S.: Quadratic regularization projected Barzilai–Borwein method for nonnegative matrix factorization. Data Min. Knowl. Disc. 29(6), 1665–1684 (2015)
- (30) Jiang, B., Dai, Y.H.: Feasible Barzilai–Borwein-like methods for extreme symmetric eigenvalue problems. Optim. Method Softw. 28(4), 756–784 (2013)
- (31) Tan, C., Ma, S., Dai, Y.H., Qian, Y.: Barzilai–Borwein step size for stochastic gradient descent. In: Advances in Neural Information Processing Systems, pp. 685–693 (2016)
- (32) Wang, Y., Ma, S.: Projected Barzilai–Borwein method for large-scale nonnegative image restoration. Inverse Probl. Sci. En. 15(6), 559–583 (2007)
- (33) De Asmundis, R., Di Serafino, D., Hager, W.W., Toraldo, G., Zhang, H.: An efficient gradient method using the Yuan steplength. Comp. Optim. Appl. 59(3), 541–563 (2014)
- (34) Frassoldati, G., Zanni, L., Zanghirati, G.: New adaptive stepsize selections in gradient methods. J. Ind. Mang. Optim. 4(2), 299–312 (2008)
- (35) Sun, C., Liu, J.P.: New stepsizes for the gradient method. Optim. Lett. 14(7), 1943–1955 (2020)
- (36) Zhou, B., Gao, L., Dai, Y.H.: Gradient methods with adaptive step-sizes. Comp. Optim. Appl. 35(1), 69–86 (2006)
- (37) Dai, Y.H.: Alternate step gradient method. Optimization 52(4-5), 395–415 (2003)
- (38) Huang, N.: On -linear convergence analysis for a class of gradient methods. Comput. Optim. Appl. 81(1), 161–177 (2022).
- (39) Friedlander, A., Martínez, J.M., Molina, B., and Raydan, M.: Gradient method with retards and generalizations. SIAM J. Numer. Anal. 36(1), 275–289 (1998)
- (40) Huang, Y.K., Dai, Y.H., Liu, X.W., Zhang, H.: On the acceleration of the Barzilai–Borwein method. Comp. Optim. Appl. 81(3), 717–740 (2022)
- (41) Zou, Q., Magoulès, F.: Fast gradient methods with alignment for symmetric linear systems without using Cauchy step. J. Comput. Math. 381, 113033 (2021)
- (42) Dai, Y.H., Hager, W.W., Schittkowski, K., Zhang, H.: The cyclic Barzilai–Borwein method for unconstrained optimization. IMA J. Numer. Anal. 26(3), 604–627 (2006)