This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: Xinrui Li 22institutetext: Yakui Huang 33institutetext: Institute of Mathematics, Hebei University of Technology, Tianjin 300401, China
33email: [email protected]

A note on RR-linear convergence of nonmonotone gradient methods

Xinrui Li    Yakui Huang
(Received: date / Accepted: date)
Abstract

Nonmonotone gradient methods generally perform better than their monotone counterparts especially on unconstrained quadratic optimization. However, the known convergence rate of the monotone method is often much better than its nonmonotone variant. With the aim of shrinking the gap between theory and practice of nonmonotone gradient methods, we introduce a property for convergence analysis of a large collection of gradient methods. We prove that any gradient method using stepsizes satisfying the property will converge RR-linearly at a rate of 1λ1/M11-\lambda_{1}/M_{1}, where λ1\lambda_{1} is the smallest eigenvalue of Hessian matrix and M1M_{1} is the upper bound of the inverse stepsize. Our results indicate that the existing convergence rates of many nonmonotone methods can be improved to 11/κ1-1/\kappa with κ\kappa being the associated condition number.

Keywords:
Gradient methods RR-linear convergence Nonmonotone Quadratic optimization
MSC:
90C25 90C25 90C30

1 Introduction

We concentrate on analyzing the convergence rate of gradient methods for unconstrained quadratic optimization:

minxnf(x)=12xAxbx,\min_{x\in\mathbb{R}^{n}}f(x)=\frac{1}{2}x^{\top}Ax-b^{\top}x, (1)

where An×nA\in\mathbb{R}^{n\times n} is a real symmetric positive definite matrix and bnb\in\mathbb{R}^{n}. The gradient method solves (1) by

xk+1=xkαkgk,x_{k+1}=x_{k}-\alpha_{k}g_{k}, (2)

where gk=f(xk)g_{k}=\nabla f(x_{k}) and αk>0\alpha_{k}>0 is the stepsize. By considering the generated function values, we can roughly divided gradient methods into two categories: monotone and nonmonotone.

The classic steepest descent (SD) method developed by Cauchy cauchy1847methode is monotone, which determines the stepsize by exact line search:

αkSD=argminα>0f(xkαgk)=gkTgkgkTAgk.\alpha_{k}^{SD}=\arg\min_{\alpha>0}~{}f(x_{k}-\alpha g_{k})=\frac{g_{k}^{T}g_{k}}{g_{k}^{T}Ag_{k}}. (3)

It is known that the SD method converges QQ-linearly at a rate of κ1κ+1\frac{\kappa-1}{\kappa+1} akaike1959successive ; forsythe1968asymptotic , where κ=λn/λ1\kappa=\lambda_{n}/\lambda_{1} is the condition number of AA with λ1\lambda_{1} and λn\lambda_{n} being the smallest and largest eigenvalues, respectively. In a similar spirit to the SD method, the minimal gradient (MG) method dai2003altermin calculates the stepsize by minimizing the gradient norm, i.e.

αkMG=argminα>0g(xkαgk)=gkTAgkgkTA2gk.\alpha_{k}^{MG}=\arg\min_{\alpha>0}~{}\|g(x_{k}-\alpha g_{k})\|=\frac{g_{k}^{T}Ag_{k}}{g_{k}^{T}A^{2}g_{k}}. (4)

This method is also monotone and has the same convergence speed as the SD method huang2019asymptotic . Another monotone method is due to Dai and Yang dai2006new2 whose stepsize is given by

αkAOPT=gkAgk=αkSDαkMG.\alpha_{k}^{AOPT}=\frac{\|g_{k}\|}{\|Ag_{k}\|}=\sqrt{\alpha_{k}^{SD}\alpha_{k}^{MG}}. (5)

It is proved that αkAOPT\alpha_{k}^{AOPT} asymptotically converges to 2λ1+λn\frac{2}{\lambda_{1}+\lambda_{n}}, which is in some sense an optimal stepsize since it minimizes IαA\|I-\alpha A\| over α\alpha, see dai2006new2 ; elman1994inexact . The Dai–Yang method (5) also converges QQ-linearly at the same rate as the SD method dai2006new2 . See dai2003altermin ; dai2005analysis ; yuan2006new ; yuan2008step and references therein for other monotone gradient methods.

The Barzilai–Borwein (BB) methods barzilai1988two can be viewed as nonmonotone counterparts of the SD and MG methods since

αkBB1=sk1sk1sk1yk1=gk1gk1gk1Agk1=αk1SD\alpha_{k}^{BB1}=\frac{s_{k-1}^{\top}s_{k-1}}{s_{k-1}^{\top}y_{k-1}}=\frac{g_{k-1}^{\top}g_{k-1}}{g_{k-1}^{\top}Ag_{k-1}}=\alpha_{k-1}^{SD} (6)

and

αkBB2=sk1yk1yk1yk1=gk1Agk1gk1A2gk1=αk1MG,\alpha_{k}^{BB2}=\frac{s_{k-1}^{\top}y_{k-1}}{y_{k-1}^{\top}y_{k-1}}=\frac{g_{k-1}^{\top}Ag_{k-1}}{g_{k-1}^{\top}A^{2}g_{k-1}}=\alpha_{k-1}^{MG}, (7)

where sk1=xkxk1s_{k-1}=x_{k}-x_{k-1} and yk1=gkgk1y_{k-1}=g_{k}-g_{k-1}. Clearly, αkBB1αkBB2\alpha_{k}^{BB1}\geqslant\alpha_{k}^{BB2} for the quadratic function (1). An interesting property of the above two BB stepsizes is that they satisfy the quasi-Newton secant equation in the sense of least squares:

αkBB1=argminα>0α1sk1yk1,αkBB2=argminα>0sk1αyk1.\alpha_{k}^{BB1}=\arg\min_{\alpha>0}\left\|\alpha^{-1}s_{k-1}-y_{k-1}\right\|,\quad\alpha_{k}^{BB2}=\arg\min_{\alpha>0}\left\|s_{k-1}-\alpha y_{k-1}\right\|.

For the BB2 method, a surprising RR-superlinear convergence result on the two dimensional strictly convex quadratic function is presented in barzilai1988two , which is improved by Dai dai2013new for the BB1 method. Dai and Fletcher dai2005asymptotic showed similar RR-superlinear convergence for the BB1 method in the three dimensional case. As for the nn dimensional case, Raydan raydan1993barzilai established global convergence of the BB1 method while Dai and Liao dai2002r proved that it converges RR-linearly and the rate is roughly (11/κ)1/n(1-1/\kappa)^{1/n}. Recently, Li and Sun li2021on improved the above rate to 11/κ1-1/\kappa. A nonmonotone version of the Dai–Yang method using αk1AOPT\alpha_{k-1}^{AOPT} can be found in dai2015positive which is also RR-superlinear convergent on two dimensional strictly convex quadratics.

Interestingly, the above mentioned nonmonotone methods are generally much faster than the corresponding monotone ones especially for quadratics. Moreover, those nonmonotone methods have a great advantage of easy extension to general unconstrained problems BDH2019stabilized ; dai2019family ; di2018steplength ; fletcher2005barzilai ; huang2020equipping ; raydan1997barzilai , to constrained optimization birgin2000nonmonotone ; crisci2020spectral ; dai2005asymptotic ; dai2005projected ; huang2020equipping ; huang2016smoothing and to various applications dai2015positive ; huang2020gradient ; huang2015quadratic ; jiang2013feasible ; tan2016barzilai ; wang2007projected . However, theoretical analysis of a nonmonotone method is generally much more difficult due to the nonmonotonity. In recent years, to achieve good performance, long and short stepsizes are often employed in alternate, cyclic, adaptive, or hybrid way by nonmonotone gradient methods dai2005projected ; de2014efficient ; frassoldati2008new ; huang2020equipping ; sun2020new ; zhou2006gradient which makes the analysis even more difficult. Consequently, the known convergence rate of a nonmonotone method is often much worse than its monotone counterpart.

An important tool for analyzing the convergence of nonmonotone methods is the following Property A suggested by Dai dai2003alternate .

Property A: Assume that A=diag(λ1,,λn)A=\operatorname{diag}\left(\lambda_{1},\cdots,\lambda_{n}\right) and 1=λ1<λ2<<λn1=\lambda_{1}<\lambda_{2}<\cdots<\lambda_{n}. Let gk(i)g_{k}^{(i)} be the ii-th component of gkg_{k} and

P(k,l)=i=1l(gk(i))2.P(k,l)=\sum_{i=1}^{l}\left(g_{k}^{(i)}\right)^{2}.

Suppose that there exist an integer mm and positive constants M1λ1M_{1}\geqslant\lambda_{1} and M2M_{2} such that
(i) λ1αk1M1\lambda_{1}\leqslant\alpha_{k}^{-1}\leqslant M_{1}
(ii) for any integer l[1,n1]l\in[1,n-1] and ε>0\varepsilon>0, if P(kj,l)εP(k-j,l)\leqslant\varepsilon and (gkj(l+1))2M2ε\left(g_{k-j}^{(l+1)}\right)^{2}\geqslant M_{2}\varepsilon hold for j[0,min{k,m}1]j\in[0,\min\{k,m\}-1], then αk123λl+1\alpha_{k}^{-1}\geqslant\frac{2}{3}\lambda_{l+1}.

Dai dai2003alternate proved that if stepsizes of the gradient method have Property A, then either gk=0g_{k}=0 for some finite kk or the sequence {gk}\left\{\left\|g_{k}\right\|\right\} converges to zero RR-linearly and the rate is roughly (max{12,1λ1M1})1/n\big{(}\max\{\frac{1}{2},1-\frac{\lambda_{1}}{M_{1}}\}\big{)}^{1/n}. Such a rate has recently been improved to max{12,1λ1M1}\max\{\frac{1}{2},1-\frac{\lambda_{1}}{M_{1}}\} huang2021On , and the rate can be further improved to 11/κ1-1/\kappa for the gradient method with:

αkGMR=gv(k)Aρ(k)gv(k)gv(k)Aρ(k)+1gv(k),\alpha_{k}^{GMR}=\frac{g_{v(k)}^{\top}A^{\rho(k)}g_{v(k)}}{g_{v(k)}^{\top}A^{\rho(k)+1}g_{v(k)}}, (8)

where v(k){k,k1,,max{km+1,1}}v(k)\in\{k,k-1,\dots,\max\{k-m+1,1\}\} and ρ(k)0\rho(k)\geqslant 0 is some real number, see friedlander1998gradient . Nevertheless, it is not very straightfoward to check Property A since it requires that for any integer ll, the inverse stepsize is large enough when the first ll eigencomponents of gkg_{k} are much smaller than the (l+1)(l+1)-th eigencomponent. Moreover, as we will see in the next section some existing stepsize does not satisfy Property A.

In this paper, we introduce a new property, named Property B, for stepsizes of the gradient method which is straightforward and easy to check. We prove that any gradient method with stepsizes satisfying Property B will converge RR-linearly at a rate of 1λ1/M11-\lambda_{1}/M_{1} with M1M_{1} being the upper bound of the inverse stepsize as above. Our results indicate that gradient methods using retard, alternate, cyclic, adaptive or hybrid stepsize schemes such as the SDC method de2014efficient , the periodic gradient method in huang2019asymptotic and the methods in huang2020acceleration ; sun2020new ; zou2021fast converge RR-linearly at the above rate. Consequently, when M1λnM_{1}\leqslant\lambda_{n} the convergence rate can be refined to 11/κ1-1/\kappa which implies that the BB1, BB2, alternate BB dai2005projected , adaptive BB zhou2006gradient , cyclic BB dai2006cyclic , cyclic SD dai2005asymptotic , ABBmin frassoldati2008new , BB family dai2019family , and the methods in dai2015positive ; huang2020equipping ; huang2019asymptotic ; huang2020gradient ; huang2020acceleration converge at least RR-linearly with a rate of 11/κ1-1/\kappa. It is worth to mention that the methods in huang2019asymptotic ; huang2020acceleration satisfy Property B but not necessarily satisfy Property A. In addition, our results generalize the one in li2021on for the BB1 method, and improve those in huang2021On for the method (8) in the sense that the associated quantities in our convergence rate are smaller than or equal to those in huang2021On . We also discuss convergence rates of iterations and function values for gradient methods satisfying the proposed Property B.

The paper is organized as follows. In Section 2, we present Property B and our RR-linear convergence results. In Section 3 we give some concluding remarks and discuss generalizations of our results.

2 Main results

Since the gradient method (2) is invariant under translations and rotations when applying to the quadratic problem (1), for the convenience of analysis, we assume without loss of generality that the matrix AA is diagonal with distinct eigenvalues, i.e.

A=diag{λ1,λ2,,λn},0<λ1<λ2<<λn.A=\mathrm{diag}\{\lambda_{1},\lambda_{2},\cdots,\lambda_{n}\},\quad 0<\lambda_{1}<\lambda_{2}<\cdots<\lambda_{n}. (9)

We shall analyze RR-linear convergence rate of gradient methods whose stepsizes satisfying the following Property B.

Property B: We say that the stepsize αk\alpha_{k} has Property B if there exist an integer mm and a positive constant M1λ1M_{1}\geqslant\lambda_{1} such that
(i) λ1αk1M1\lambda_{1}\leqslant\alpha_{k}^{-1}\leqslant M_{1};
(ii) for some integer v(k){k,k1,,max{km+1,0}}v(k)\in\{k,k-1,\cdots,\max\{k-m+1,0\}\},

αkgv(k)ψ(A)gv(k)gv(k)Aψ(A)gv(k),\alpha_{k}\leqslant\frac{g_{v(k)}^{\top}\psi(A)g_{v(k)}}{g_{v(k)}^{\top}A\psi(A)g_{v(k)}}, (10)

where ψ\psi is a real analytic function on [λ1,λn][\lambda_{1},\lambda_{n}] and can be expressed by Laurent series

ψ(z)=k=dkzk,dk,\psi(z)=\sum_{k=-\infty}^{\infty}d_{k}z^{k},\quad d_{k}\in\mathbb{R},

such that 0<k=dkzk<+0<\sum_{k=-\infty}^{\infty}d_{k}z^{k}<+\infty for all z[λ1,λn]z\in\left[\lambda_{1},\lambda_{n}\right].

The only difference between Property B here and Property A is condition (ii). For the case ψ(A)=I\psi(A)=I, we can prove that αk\alpha_{k} satisfies (10) also meets the condition (ii) of Property A with M2=2M_{2}=2. In fact, for any integer l[1,n1]l\in[1,n-1], suppose that P(kj,l)εP(k-j,l)\leqslant\varepsilon and (gkj(l+1))22ε\left(g_{k-j}^{(l+1)}\right)^{2}\geqslant 2\varepsilon hold for ε>0\varepsilon>0 and j[0,min{k,m}1]j\in[0,\min\{k,m\}-1], we obtain that

αk1\displaystyle\alpha_{k}^{-1} gv(k)Agv(k)gv(k)gv(k)=i=1nλi(gv(k)(i))2i=1n(gv(k)(i))2λl+1i=l+1n(gv(k)(i))2P(v(k),l)+i=l+1n(gv(k)(i))2\displaystyle\geqslant\frac{g_{v(k)}^{\top}Ag_{v(k)}}{g_{v(k)}^{\top}g_{v(k)}}=\frac{\sum_{i=1}^{n}\lambda_{i}\left(g_{v(k)}^{(i)}\right)^{2}}{\sum_{i=1}^{n}\left(g_{v(k)}^{(i)}\right)^{2}}\geqslant\frac{\lambda_{l+1}\sum_{i=l+1}^{n}\left(g_{v(k)}^{(i)}\right)^{2}}{P(v(k),l)+\sum_{i=l+1}^{n}\left(g_{v(k)}^{(i)}\right)^{2}}
λl+1ε/2ε+123λl+1.\displaystyle\geqslant\frac{\lambda_{l+1}}{\varepsilon/2\varepsilon+1}\geqslant\frac{2}{3}\lambda_{l+1}.

However, if ψ(A)I\psi(A)\neq I, the following example shows that αk1<23λl+1\alpha_{k}^{-1}<\frac{2}{3}\lambda_{l+1} for a stepsize satisfies (10) even if P(kj,l)εP(k-j,l)\leqslant\varepsilon and (gkj(l+1))2M2ε\left(g_{k-j}^{(l+1)}\right)^{2}\geqslant M_{2}\varepsilon.

Example 1. Consider the 33-dimensional quadratic with

A=diag{1,8,16},b=0.A=\mathrm{diag}\{1,8,16\},\quad b=0.

We apply the gradient method in huang2020acceleration which uses the stepsize

αk=gk1ψ(A)gk1gk1Aψ(A)gk1,k1.\alpha_{k}=\frac{g_{k-1}^{\top}\psi(A)g_{k-1}}{g_{k-1}^{\top}A\psi(A)g_{k-1}},~{}k\geqslant 1.

Here we choose ψ(z)=(1+2zz2)2\psi(z)=\left(\frac{1+2z}{z^{2}}\right)^{2} and α0=α0SD\alpha_{0}=\alpha_{0}^{SD}. For any given ε>0\varepsilon>0, set x0=(ε,40ε8,40ε16)Tx_{0}=\left(\sqrt{\varepsilon},\frac{\sqrt{40\varepsilon}}{8},\frac{\sqrt{40\varepsilon}}{16}\right)^{T}. We get

g0=Ax0=(ε,40ε,40ε)T.g_{0}=Ax_{0}=\left(\sqrt{\varepsilon},\sqrt{40\varepsilon},\sqrt{40\varepsilon}\right)^{T}.

It follows that α0=g0g0g0Ag00.0843\alpha_{0}=\frac{g_{0}^{\top}g_{0}}{g_{0}^{\top}Ag_{0}}\approx 0.084~{}3 and

g1=(Iα0A)g0(0.9157ε,2.0599ε,2.2047ε)T,g_{1}=(I-\alpha_{0}A)g_{0}\approx\left(0.915~{}7\sqrt{\varepsilon},2.059~{}9\sqrt{\varepsilon},-2.204~{}7\sqrt{\varepsilon}\right)^{T},

which gives α1=g0ψ2(A)g0g0Aψ2(A)g00.2958\alpha_{1}=\frac{g_{0}^{\top}\psi^{2}(A)g_{0}}{g_{0}^{\top}A\psi^{2}(A)g_{0}}\approx 0.295~{}8 and

g2=(Iα1A)g1(0.6448ε,2.8148ε,8.2300ε)T.g_{2}=(I-\alpha_{1}A)g_{1}\approx\left(0.644~{}8\sqrt{\varepsilon},-2.814~{}8\sqrt{\varepsilon},8.230~{}0\sqrt{\varepsilon}\right)^{T}.

Then we have α20.7056\alpha_{2}\approx 0.705~{}6 and

g3=(Iα2A)g2(0.1898ε,13.0744ε,84.6846ε)T.g_{3}=(I-\alpha_{2}A)g_{2}\approx\left(0.189~{}8\sqrt{\varepsilon},13.0744\sqrt{\varepsilon},-84.684~{}6\sqrt{\varepsilon}\right)^{T}.

Clearly, for j[0,min{k,m}1]j\in[0,\min\{k,m\}-1] with m=2m=2, P(kj,1)εP(k-j,1)\leqslant\varepsilon holds for k=2,3k=2,3. However, for any M2M_{2} such that (gkj(2))2M2ε\left(g_{k-j}^{(2)}\right)^{2}\geqslant M_{2}\varepsilon, we have α211.4172<16/3\alpha_{2}^{-1}\approx 1.417~{}2<16/3 and α314.8320<16/3\alpha_{3}^{-1}\approx 4.832~{}0<16/3.

Now we present our first convergence result for gradient methods satisfying Property B. See li2021on for analysis of the BB1 method and huang2021On for methods satisfying Property A.

Theorem 2.1

Suppose that the sequence {gk}\left\{\left\|g_{k}\right\|\right\} is generated by applying the gradient method (2) to the quadratic problem (1). If the stepsize αk\alpha_{k} satisfies Property B, then either gk=0g_{k}=0 for some finite kk or the sequence {gk}\left\{\left\|g_{k}\right\|\right\} converges to zero RR-linearly in the sense that

|gk(i)|Ciθk,i=1,2,,n,\lvert g_{k}^{(i)}\rvert\leqslant C_{i}\theta^{k},\qquad i=1,2,\cdots,n, (11)

where θ=1λ1/M1{\theta}=1-\lambda_{1}/M_{1} and

{C1=|g0(1)|;Ci=max{|g0(i)|,|g1(i)|θ,,|gm1(i)|θm1,max{σi,σim}θmψ(λi)j=1i1ψ2(λj)Cj2},i=2,3,,n,\left\{\begin{array}[]{l}C_{1}=\lvert g_{0}^{(1)}\rvert;\\ C_{i}=\displaystyle\max\left\{\lvert g_{0}^{(i)}\rvert,\frac{\lvert g_{1}^{(i)}\rvert}{\theta},\cdots,\frac{\lvert g_{m-1}^{(i)}\rvert}{\theta^{m-1}},\frac{\max\{\sigma_{i},\sigma_{i}^{m}\}}{\theta^{m}\psi(\lambda_{i})}\sqrt{\sum_{j=1}^{i-1}\psi^{2}(\lambda_{j})C_{j}^{2}}\right\},\\ \quad i=2,3,\cdots,n,\end{array}\right.

with σi=max{λiλ11,1λiM1}\sigma_{i}=\max\left\{\frac{\lambda_{i}}{\lambda_{1}}-1,1-\frac{\lambda_{i}}{M_{1}}\right\}.

Proof

Clearly, we only need to show that (11) holds for the case gk(i)0g_{k}^{(i)}\neq 0. In what follows, we show (11) by induction on ii. Since gk+1(i)=(Iαkλi)gk(i)g_{k+1}^{(i)}=\left(I-\alpha_{k}\lambda_{i}\right)g_{k}^{(i)} and λ1αk1M1\lambda_{1}\leqslant\alpha_{k}^{-1}\leqslant M_{1}, we have

|gk+1(i)|σi|gk(i)|,i=1,2,,n,\lvert g_{k+1}^{(i)}\rvert\leqslant{\sigma}_{i}\lvert g_{k}^{(i)}\rvert,\qquad i=1,2,\cdots,n, (12)

which gives

|gk(1)|θ|gk1(1)|C1θk.\lvert g_{k}^{(1)}\rvert\leqslant\theta\lvert g_{k-1}^{(1)}\rvert\leqslant C_{1}\theta^{k}.

That is, (11) holds for i=1i=1.

Assume that (11) holds for all 1iL11\leqslant i\leqslant L-1 with L{2,,n}L\in\{2,\cdots,n\}. We have to prove that (11) holds for i=Li=L. By the definition of CLC_{L}, we know that (11) holds for k=0,1,,m1k=0,1,\cdots,m-1. Assume by contradiction that (11) does not hold for kmk\geqslant m when i=Li=L. Let k^(m)\hat{k}~{}(\geqslant m) be the minimal index such that |gk^(L)|>CLθk^\lvert g_{\hat{k}}^{(L)}\rvert>C_{L}\theta^{\hat{k}}. Then, we must have αk^1λL>1\alpha_{{}\hat{k}-1}\lambda_{L}>1; otherwise,

|gk^(L)|=(1αk^1λL)|gk^1(L)|θ|gk^1(L)|CLθk^,\lvert g_{{}\hat{k}}^{(L)}\rvert=(1-\alpha_{{}\hat{k}-1}\lambda_{L})\lvert g_{{}\hat{k}-1}^{(L)}\rvert\leqslant{\theta}\lvert g_{{}\hat{k}-1}^{(L)}\rvert\leqslant C_{L}\theta^{{}\hat{k}},

which is impossible. It follows from (12), θ<1\theta<1 and the inductive assumption that

ψ(λL)|gk^j(L)|\displaystyle\psi(\lambda_{L})\lvert g_{\hat{k}-j}^{(L)}\rvert ψ(λL)|gk^(L)|σLj>ψ(λL)CLθk^σLjψ(λL)CLθk^max{σL,σLm}\displaystyle\geqslant\frac{\psi(\lambda_{L})\lvert g_{\hat{k}}^{(L)}\rvert}{{\sigma}_{L}^{j}}>\frac{\psi(\lambda_{L})C_{L}\theta^{\hat{k}}}{{\sigma}_{L}^{j}}\geqslant\frac{\psi(\lambda_{L})C_{L}\theta^{\hat{k}}}{\max\{\sigma_{L},\sigma_{L}^{m}\}}
θk^mi=1L1ψ2(λi)Ci2θk^mi=1L1(ψ(λi)gk^j(i))2θ2(jk^)\displaystyle\geqslant\theta^{\hat{k}-m}\sqrt{\sum_{i=1}^{L-1}\psi^{2}(\lambda_{i})C_{i}^{2}}\geqslant\theta^{\hat{k}-m}\sqrt{\sum_{i=1}^{L-1}\left(\psi(\lambda_{i})g_{\hat{k}-j}^{(i)}\right)^{2}\theta^{2(j-\hat{k})}}
=θjmG(k^j,L1)G(k^j,L1),j[1,m],\displaystyle=\theta^{j-m}\sqrt{G(\hat{k}-j,L-1)}\geqslant\sqrt{G(\hat{k}-j,L-1)},\quad j\in[1,m],

which gives

(ψ(λL)gv(k^1)(L))2>G(v(k^1),L1),\left(\psi(\lambda_{L})g_{v(\hat{k}-1)}^{(L)}\right)^{2}>G(v(\hat{k}-1),L-1), (13)

where G(k,l)=i=1l(ψ(λi)gk(i))2G(k,l)=\sum_{i=1}^{l}\left(\psi(\lambda_{i})g_{k}^{(i)}\right)^{2}. Since αk^1λL>1\alpha_{\hat{k}-1}\lambda_{L}>1, by (10) and (13) we have that

|1αk^1λL|\displaystyle\left\lvert 1-\alpha_{\hat{k}-1}\lambda_{L}\right\rvert =αk^1λL1(ψ(A)gv(k^1))(ψ(A)gv(k^1))(ψ(A)gv(k^1))A(ψ(A)gv(k^1))λL1\displaystyle=\alpha_{\hat{k}-1}\lambda_{L}-1\leqslant\frac{\left(\psi(A)g_{v(\hat{k}-1)}\right)^{\top}\left(\psi(A)g_{v(\hat{k}-1)}\right)}{\left(\psi(A)g_{v(\hat{k}-1)}\right)^{\top}A\left(\psi(A)g_{v(\hat{k}-1)}\right)}\lambda_{L}-1
=i=1n(λLλi)(ψ(λi)gv(k^1)(i))2i=1nλi(ψ(λi)gv(k^1)(i))2\displaystyle=\frac{\sum_{i=1}^{n}\left(\lambda_{L}-\lambda_{i}\right)\left(\psi(\lambda_{i})g_{v(\hat{k}-1)}^{(i)}\right)^{2}}{\sum_{i=1}^{n}\lambda_{i}\left(\psi(\lambda_{i})g_{v(\hat{k}-1)}^{(i)}\right)^{2}}
(λLλ1)G(v(k^1),L1)λ1G(v(k^1),L1)+λL(ψ(λL)gv(k^1)(L))2\displaystyle\leqslant\frac{\left(\lambda_{L}-\lambda_{1}\right)G(v(\hat{k}-1),L-1)}{\lambda_{1}G(v(\hat{k}-1),L-1)+\lambda_{L}\left(\psi(\lambda_{L})g_{v(\hat{k}-1)}^{(L)}\right)^{2}}
λLλ1λ1+λL(ψ(λL)gv(k^1)(L))2G(v(k^1),L1)λLλ1λL+λ1.\displaystyle\leqslant\frac{\lambda_{L}-\lambda_{1}}{\lambda_{1}+\lambda_{L}\frac{\left(\psi(\lambda_{L})g_{v(\hat{k}-1)}^{(L)}\right)^{2}}{G(v(\hat{k}-1),L-1)}}\leqslant\frac{\lambda_{L}-\lambda_{1}}{\lambda_{L}+\lambda_{1}}. (14)

Notice that

αk^11\displaystyle\alpha_{\hat{k}-1}^{-1} i=1nλi(ψ(λi)gv(k^1)(i))2i=1n(ψ(λi)gv(k^1)(i))2\displaystyle\geqslant\frac{\sum_{i=1}^{n}\lambda_{i}\left(\psi(\lambda_{i})g_{v(\hat{k}-1)}^{(i)}\right)^{2}}{\sum_{i=1}^{n}\left(\psi(\lambda_{i})g_{v(\hat{k}-1)}^{(i)}\right)^{2}}
λ1G(v(k^1),L1)+λLi=Ln(ψ(λi)gv(k^1)(i))2G(v(k^1),L1)+i=Ln(ψ(λi)gv(k^1)(i))2\displaystyle\geqslant\frac{\lambda_{1}G(v(\hat{k}-1),L-1)+\lambda_{L}\sum_{i=L}^{n}\left(\psi(\lambda_{i})g_{v(\hat{k}-1)}^{(i)}\right)^{2}}{G(v(\hat{k}-1),L-1)+\sum_{i=L}^{n}\left(\psi(\lambda_{i})g_{v(\hat{k}-1)}^{(i)}\right)^{2}}
=λ1+λLδ1+δ>λ1+λL2,\displaystyle=\frac{\lambda_{1}+\lambda_{L}\delta}{1+\delta}>\frac{\lambda_{1}+\lambda_{L}}{2},

where δ=i=Ln(ψ(λi)gv(k^1)(i))2G(v(k^1),L1)(ψ(λL)gv(k^1)(L))2G(v(k^1),L1)>1\delta=\frac{\sum_{i=L}^{n}\left(\psi(\lambda_{i})g_{v(\hat{k}-1)}^{(i)}\right)^{2}}{G(v(\hat{k}-1),L-1)}\geqslant\frac{\left(\psi(\lambda_{L})g_{v(\hat{k}-1)}^{(L)}\right)^{2}}{G(v(\hat{k}-1),L-1)}>1. It follows from (i) of Property B that M1>λ1+λL2M_{1}>\frac{\lambda_{1}+\lambda_{L}}{2}. Thus,

λLλ1λL+λ1=12λ1λL+λ1<1λ1M1=θ,\frac{\lambda_{L}-\lambda_{1}}{\lambda_{L}+\lambda_{1}}=1-\frac{2\lambda_{1}}{\lambda_{L}+\lambda_{1}}<1-\frac{\lambda_{1}}{M_{1}}=\theta,

which together with (2) gives |gk^(L)|θ|gk^1(L)|CLθk^\lvert g_{\hat{k}}^{(L)}\rvert\leqslant{\theta}\lvert g_{\hat{k}-1}^{(L)}\rvert\leqslant C_{L}\theta^{\hat{k}}. This contradicts our assumption. Thus (11) holds for all ii. We complete the proof.

Theorem 2.1 allows us to use various choices of stepsizes for the gradient method. For example, one can use a stepsize such that v(k)=krv(k)=k-r for some integer r[1,m1]r\in[1,m-1] and all krk\geqslant r which includes BB1, BB2, and the method in dai2015positive as special instances. One may also use alternate, cyclic, adaptive or hybrid stepsize schemes such as the cyclic BB dai2006cyclic , ABBmin frassoldati2008new and SDC de2014efficient methods, the periodic gradient method in huang2019asymptotic and the methods in huang2020gradient ; sun2020new ; zou2021fast . Convergence results for those cases are specified in the next corollary.

Corollary 1

Suppose that the sequence {gk}\left\{\|g_{k}\|\right\} is generated by applying the gradient method (2) to the quadratic problem (1).

  1. (i)

    If Property B is satisfied with v(k)=krv(k)=k-r for some integer r[1,m1]r\in[1,m-1] and all krk\geqslant r, then either gk=0g_{k}=0 for some finite kk or the sequence {gk}\left\{\|g_{k}\|\right\} converges to zero RR-linearly in the sense of (11) with

    {C1=|g0(1)|;Ci=max{|g0(i)|,|g1(i)|θ,,|gr(i)|θr,σir+1θr+1ψ(λi)j=1i1ψ2(λj)Cj2},i=2,3,,n.\left\{\begin{array}[]{l}C_{1}=\lvert g_{0}^{(1)}\rvert;\\ C_{i}=\displaystyle\max\left\{\lvert g_{0}^{(i)}\rvert,\frac{\lvert g_{1}^{(i)}\rvert}{\theta},\cdots,\frac{\lvert g_{r}^{(i)}\rvert}{\theta^{r}},\frac{\sigma_{i}^{r+1}}{\theta^{r+1}\psi(\lambda_{i})}\sqrt{\sum_{j=1}^{i-1}\psi^{2}(\lambda_{j})C_{j}^{2}}\right\},\\ \quad i=2,3,\cdots,n.\end{array}\right.
  2. (ii)

    If Property B is satisfied by replacing v(k)v(k) in (10) with vi(k){k,k1,,max{kmi+1,0}}v_{i}(k)\in\{k,k-1,\cdots,\max\{k-m_{i}+1,0\}\} for all i=1,,si=1,\cdots,s where mim_{i} and ss are positive integers, then either gk=0g_{k}=0 for some finite kk or the sequence {gk}\left\{\|g_{k}\|\right\} converges to zero RR-linearly in the sense of (11) with m=max{m1,,ms}m=\max\{m_{1},\cdots,m_{s}\}.

Many efficient gradient methods using stepsizes satisfying λ1αk1λn\lambda_{1}\leqslant\alpha_{k}^{-1}\leqslant\lambda_{n}, for example, the BB1, BB2, alternate BB dai2005projected , adaptive BB zhou2006gradient , cyclic BB dai2006cyclic , and cyclic SD dai2005asymptotic methods, BB family dai2019family , and the method in dai2015positive . For such methods, next theorem refines the results in Theorem 2.1 and Corollary 1.

Theorem 2.2

Suppose that the sequence {gk}\left\{\|g_{k}\|\right\} is generated by applying the gradient method (2) with λ1αk1λn\lambda_{1}\leqslant\alpha_{k}^{-1}\leqslant\lambda_{n} to the quadratic problem (1).

  1. (i)

    If Property B is satisfied with v(k)=krv(k)=k-r for some integer r[1,m1]r\in[1,m-1] and all krk\geqslant r, then either gk=0g_{k}=0 for some finite kk or the sequence {gk}\left\{\|g_{k}\|\right\} converges to zero RR-linearly in the sense that

    |gk(i)|C~iθ~k,i=1,2,,n,\lvert g_{k}^{(i)}\rvert\leqslant\widetilde{C}_{i}\tilde{\theta}^{k},\qquad i=1,2,\cdots,n, (15)

    where θ~=11/κ\tilde{\theta}=1-1/\kappa and

    {C~1=|g0(1)|;C~i=max{|g0(i)|,|g1(i)|θ~,,|gr(i)|θ~r,σ~ir+1θ~r+1ψ(λi)j=1i1ψ2(λj)C~j2},i=2,3,,n,\left\{\begin{array}[]{l}\widetilde{C}_{1}=\lvert g_{0}^{(1)}\rvert;\\ \widetilde{C}_{i}=\displaystyle\max\left\{\lvert g_{0}^{(i)}\rvert,\frac{\lvert g_{1}^{(i)}\rvert}{\tilde{\theta}},\cdots,\frac{\lvert g_{r}^{(i)}\rvert}{\tilde{\theta}^{r}},\frac{\tilde{\sigma}_{i}^{r+1}}{\tilde{\theta}^{r+1}\psi(\lambda_{i})}\sqrt{\sum_{j=1}^{i-1}\psi^{2}(\lambda_{j})\widetilde{C}_{j}^{2}}\right\},\\ \quad i=2,3,\cdots,n,\end{array}\right.

    with σ~i=max{λiλ11,1λiλn}\tilde{\sigma}_{i}=\max\left\{\frac{\lambda_{i}}{\lambda_{1}}-1,1-\frac{\lambda_{i}}{\lambda_{n}}\right\}.

  2. (ii)

    If Property B is satisfied by replacing v(k)v(k) in (10) with vi(k){k,k1,,max{kmi+1,0}}v_{i}(k)\in\{k,k-1,\cdots,\max\{k-m_{i}+1,0\}\} for all i=1,,si=1,\cdots,s where mim_{i} and ss are positive integers, then either gk=0g_{k}=0 for some finite kk or the sequence {gk}\left\{\|g_{k}\|\right\} converges to zero RR-linearly in the sense of (15) with m=max{m1,,ms}m=\max\{m_{1},\cdots,m_{s}\} and

    {C~1=|g0(1)|;C~i=max{|g0(i)|,|g1(i)|θ~,,|gm1(i)|θ~m1,max{σ~i,σ~im}θ~mψ(λi)j=1i1ψ2(λj)C~j2},i=2,3,,n.\left\{\begin{array}[]{l}\widetilde{C}_{1}=\lvert g_{0}^{(1)}\rvert;\\ \widetilde{C}_{i}=\displaystyle\max\left\{\lvert g_{0}^{(i)}\rvert,\frac{\lvert g_{1}^{(i)}\rvert}{\tilde{\theta}},\cdots,\frac{\lvert g_{m-1}^{(i)}\rvert}{\tilde{\theta}^{m-1}},\frac{\max\{\tilde{\sigma}_{i},\tilde{\sigma}_{i}^{m}\}}{\tilde{\theta}^{m}\psi(\lambda_{i})}\sqrt{\sum_{j=1}^{i-1}\psi^{2}(\lambda_{j})\widetilde{C}_{j}^{2}}\right\},\\ \quad i=2,3,\cdots,n.\end{array}\right.
Remark 1

Notice that αkBB1\alpha_{k}^{BB1} satisfies Property B with v(k)=k1v(k)=k-1, ψ(A)=I\psi(A)=I and M1=λnM_{1}=\lambda_{n}. So, the result in Theorem 1 of li2021on is a special case of (i) in Theorem 2.2. The stepsize αkGMR\alpha_{k}^{GMR} in (8) satisfies Property B with ψ(A)=Aρ(k)/2\psi(A)=A^{\rho(k)/2} for some real number ρ(k)0\rho(k)\geqslant 0. Hence, the results in Lemma 2.4 of huang2021On will be recovered from (ii) of Theorem 2.2 by setting s=1s=1 and m1=mm_{1}=m but the last term in C~i\widetilde{C}_{i} is slightly different. It is worth to mention that since λjρ(k)/2λiρ(k)/2\lambda_{j}^{\rho(k)/2}\leqslant\lambda_{i}^{\rho(k)/2} when λjλi\lambda_{j}\leqslant\lambda_{i} the quantities C~i\widetilde{C}_{i}, i=2,3,,ni=2,3,\cdots,n, here are smaller than or equal to those in huang2021On .

Remark 2

From the above analysis, it seems that for two gradient methods corresponding to the same v(k)v(k) and M1M_{1} the one with smaller ψ(λj)/ψ(λi)\psi(\lambda_{j})/\psi(\lambda_{i}) in C~i\widetilde{C}_{i} will have faster convergence rate. However, this is not necessarily true because the components of gkg_{k} obtained by the two methods may be quite different. For example, both the BB1 and BB2 methods correspond to v(k)=k1v(k)=k-1 and M1=λnM_{1}=\lambda_{n} but to ψ(A)=I\psi(A)=I and ψ(A)=A\psi(A)=A, respectively, which implies a smaller ψ(λj)/ψ(λi)\psi(\lambda_{j})/\psi(\lambda_{i}) for the BB2 method. Nevertheless, there is no theoretical evidence in support of a faster convergence rate for the BB2 method and practical experience often prefers the BB1 method, see dai2005asymptotic for example.

3 Some discussions

We introduced Property B for gradient methods and proved that any gradient method with stepsizes satisfying this property has RR-linear convergence rate 1λ1/M11-\lambda_{1}/M_{1} where M1M_{1} is the upper bound of the inverse stepsize. For a large collection of gradient methods, we refined the rate to 11/κ1-1/\kappa. An important advantage of Property B is that it can easily be checked. Our results extend the one in li2021on and partly improve those in huang2021On .

With the notation G(k,l)=i=1l(ψ(λi)gk(i))2G(k,l)=\sum_{i=1}^{l}\left(\psi(\lambda_{i})g_{k}^{(i)}\right)^{2}, we can generalize Property A in dai2003alternate as the following Property GA.

Property GA: If there exist an integer mm and positive constants M1λ1M_{1}\geqslant\lambda_{1} and M2M_{2} such that
(i) λ1αk1M1\lambda_{1}\leqslant\alpha_{k}^{-1}\leqslant M_{1};
(ii) for any integer l[1,n1]l\in[1,n-1] and real number ε>0\varepsilon>0, if G(kj,l)εG(k-j,l)\leqslant\varepsilon and (ψ(λl+1)gkj(l+1))2M2ε\left(\psi(\lambda_{l+1})g_{k-j}^{(l+1)}\right)^{2}\geqslant M_{2}\varepsilon hold for j[0,min{k,m}1]j\in[0,\min\{k,m\}-1], then αk123λl+1\alpha_{k}^{-1}\geqslant\frac{2}{3}\lambda_{l+1}.

Clearly, Property GA reduces to Property A when ψ(A)=I\psi(A)=I. Moreover, it is not difficult to check that Property B is a special case of Property GA. By similar arguments as the proof of Theorem 2.1, we can establish RR-linear convergence rate as (11) for any gradient method using stepsizes satisfying Property GA but with

{C1=|g0(1)|;Ci=max{|g0(i)|,|g1(i)|θ,,|gm1(i)|θm1,max{σi,σim}θmψ(λi)j=1i1M2ψ2(λj)Cj2},i=2,3,,n,\left\{\begin{array}[]{l}C_{1}=\lvert g_{0}^{(1)}\rvert;\\ C_{i}=\displaystyle\max\left\{\lvert g_{0}^{(i)}\rvert,\frac{\lvert g_{1}^{(i)}\rvert}{\theta},\cdots,\frac{\lvert g_{m-1}^{(i)}\rvert}{\theta^{m-1}},\frac{\max\{\sigma_{i},\sigma_{i}^{m}\}}{\theta^{m}\psi(\lambda_{i})}\sqrt{\sum_{j=1}^{i-1}M_{2}\psi^{2}(\lambda_{j})C_{j}^{2}}\right\},\\ \quad i=2,3,\cdots,n,\end{array}\right.

and θ=max{12,1λ1M1}\theta=\max\{\frac{1}{2},1-\frac{\lambda_{1}}{M_{1}}\} which is bounded by max{12,1λ1M}\max\{\frac{1}{2},1-\frac{\lambda_{1}}{M}\} with M=limsupkαk1M=\lim\sup_{k\rightarrow\infty}\alpha_{k}^{-1}. This extends the results in Lemma 2.2 and Theorem 2.1 of huang2021On . Other results in the former section could be obtained for Property GA as well.

All the above results on gradients can be applied to get RR-linear convergence for iterations and function values. In fact, denote by x=A1bx^{*}=A^{-1}b the solution of (1) and ek=xkxe_{k}=x_{k}-x^{*}. Then we have

ek=xkx=A1(Axkb)=A1gke_{k}=x_{k}-x^{*}=A^{-1}(Ax_{k}-b)=A^{-1}g_{k}

and

f(xk)=f(x)+12ekTAek.f(x_{k})=f(x^{*})+\frac{1}{2}e_{k}^{T}Ae_{k}.

By applying (11) we obtain

|ek(i)|=|λi1gk(i)|λi1Ciθk,i=1,2,,n,\lvert e_{k}^{(i)}\rvert=\lvert\lambda_{i}^{-1}g_{k}^{(i)}\rvert\leqslant\lambda_{i}^{-1}C_{i}\theta^{k},\qquad i=1,2,\cdots,n,

which yield

f(xk)f(x)\displaystyle f(x_{k})-f(x^{*}) =12ekTAek=12i=1nλi(ek(i))2\displaystyle=\frac{1}{2}e_{k}^{T}Ae_{k}=\frac{1}{2}\sum_{i=1}^{n}\lambda_{i}(e_{k}^{(i)})^{2}
12i=1nλi1Ci2θ2k=12C2θ2k,\displaystyle\leqslant\frac{1}{2}\sum_{i=1}^{n}\lambda_{i}^{-1}C_{i}^{2}\theta^{2k}=\frac{1}{2}\|C\|^{2}\theta^{2k},

where C=(λ11/2C1,,λn1/2Cn)TC=(\lambda_{1}^{-1/2}C_{1},\cdots,\lambda_{n}^{-1/2}C_{n})^{T}. In the same way, we are able to establish RR-linear convergence for iterations and function values from (15).

Notice that the upper bound on the right hand side of (2) can be refined as θ^=κ1κ+1\hat{\theta}=\frac{\kappa-1}{\kappa+1}. From the proof of Theorem 2.1 we know that the convergence rate in Theorem 2.2 will be improved to θ^\hat{\theta} if |gk(1)|θ^|gk1(1)|\lvert g_{k}^{(1)}\rvert\leqslant\hat{\theta}\lvert g_{k-1}^{(1)}\rvert holds. Nevertheless, the above inequality requires 1αkλ1θ^1-\alpha_{k}\lambda_{1}\leqslant\hat{\theta}, i.e. αk2λ1+λn\alpha_{k}\geqslant\frac{2}{\lambda_{1}+\lambda_{n}} which always yields the rate θ^\hat{\theta}. The authors of huang2021On ; li2021on have constructed examples for the method (8) and the BB1 method such that αk2λ1+λn\alpha_{k}\equiv\frac{2}{\lambda_{1}+\lambda_{n}} to achieve the rate θ^\hat{\theta}, respectively. Recall that both BB1 and (8) are special cases of Property B. Actually, for a given method satisfies Property B, we can construct examples as huang2021On ; li2021on such that αk2λ1+λn\alpha_{k}\equiv\frac{2}{\lambda_{1}+\lambda_{n}}. It is worth to mention that the convergence results obtained here and in huang2021On ; li2021on are componentwise. One interesting question is whether we could get better convergence rate for gradient norm or objective values under reasonable conditions. This will be our future work.

Acknowledgements.
This work was supported by the National Natural Science Foundation of China (Grant No. 11701137) and Natural Science Foundation of Hebei Province (Grant No. A2021202010).

References

  • (1) Cauchy, A.: Méthode générale pour la résolution des systemes di’équations simultanées. Comp. Rend. Sci. Paris 25, 536–538 (1847)
  • (2) Akaike, H.: On a successive transformation of probability distribution and its application to the analysis of the optimum gradient method. Ann. Inst. Stat. Math. 11(1), 1–16 (1959)
  • (3) Forsythe, G.E.: On the asymptotic directions of the ss-dimensional optimum gradient method. Numer. Math. 11(1), 57–76 (1968)
  • (4) Dai, Y.H., Yuan, Y.X.: Alternate minimization gradient method. IMA J. Numer. Anal. 23(3), 377–393 (2003)
  • (5) Huang, Y.K., Dai, Y.H., Liu, X.W. et al.: On the asymptotic convergence and acceleration of gradient methods. J. Sci. Comput. 90, 7, (2022).
  • (6) Dai, Y.H., Yang, X.: A new gradient method with an optimal stepsize property. Comp. Optim. Appl. 33(1), 73–88 (2006)
  • (7) Elman, H.C., Golub, G.H.: Inexact and preconditioned Uzawa algorithms for saddle point problems. SIAM J. Numer. Anal. 31(6), 1645–1661 (1994)
  • (8) Dai, Y.H., Yuan, Y.X.: Analysis of monotone gradient methods. J. Ind. Mang. Optim. 1(2), 181 (2005)
  • (9) Yuan, Y.X.: A new stepsize for the steepest descent method. J. Comput. Math. 24(2), 149–156 (2006)
  • (10) Yuan, Y.X.: Step-sizes for the gradient method. AMS/IP Stud. Adv. Math. 42(2), 785–796 (2008)
  • (11) Barzilai, J., Borwein, J.M.: Two-point step size gradient methods. IMA J. Numer. Anal. 8(1), 141–148 (1988)
  • (12) Dai, Y.H.: A new analysis on the Barzilai–Borwein gradient method. J. Oper. Res. Soc. China 2(1), 187–198 (2013)
  • (13) Dai, Y.H., Fletcher, R.: On the asymptotic behaviour of some new gradient methods. Math. Program. 103(3), 541–559 (2005)
  • (14) Raydan, M.: On the Barzilai–Borwein choice of steplength for the gradient method. IMA J. Numer. Anal. 13(3), 321–326 (1993)
  • (15) Dai, Y.H., Liao, L.Z.: RR-linear convergence of the Barzilai–Borwein gradient method. IMA J. Numer. Anal. 22(1), 1–10 (2002)
  • (16) Li, D.W., Sun, R.Y.: On a faster RR-Linear convergence rate of the Barzilai–Borwein method, arXiv preprint arXiv:2101.00205v2, (2021)
  • (17) Dai, Y.H., Al-Baali, M., Yang, X.: A positive Barzilai–Borwein-like stepsize and an extension for symmetric linear systems. In: Numerical Analysis and Optimization, pp. 59–75. Springer (2015)
  • (18) Burdakov, O., Dai, Y. H., Huang, N.: Stabilized Barzilai-Borwein Method, J. Comp. Math., 37(6), 916–936 (2019)
  • (19) Dai, Y.H., Huang, Y.K., Liu, X.W.: A family of spectral gradient methods for optimization. Comp. Optim. Appl. 74(1), 43–65 (2019)
  • (20) Di Serafino, D., Ruggiero, V., Toraldo, G., Zanni, L.: On the steplength selection in gradient methods for unconstrained optimization. Appl. Math. Comput. 318, 176–195 (2018)
  • (21) Fletcher, R.: On the Barzilai–Borwein method. In: Optimization and Control with Applications, pp. 235–256. Springer, New York (2005)
  • (22) Huang, Y.K., Dai, Y.H., Liu, X.W.: Equipping the Barzilai–Borwein method with the two dimensional quadratic termination property. SIAM J. Optim. 31(4), 3068–3096 (2021)
  • (23) Raydan, M.: The Barzilai–Borwein gradient method for the large scale unconstrained minimization problem. SIAM J. Optim. 7(1), 26–33 (1997)
  • (24) Birgin, E.G., Martínez, J.M., Raydan, M.: Nonmonotone spectral projected gradient methods on convex sets. SIAM J. Optim. 10(4), 1196–1211 (2000)
  • (25) Crisci, S., Porta, F., Ruggiero, V., Zanni, L.: Spectral properties of Barzilai–Borwein rules in solving singly linearly constrained optimization problems subject to lower and upper bounds. SIAM J. Optim. 30(2), 1300–1326 (2020)
  • (26) Dai, Y.H., Fletcher, R.: Projected Barzilai–Borwein methods for large-scale box-constrained quadratic programming. Numer. Math. 100(1), 21–47 (2005)
  • (27) Huang, Y.K., Liu, H.: Smoothing projected Barzilai–Borwein method for constrained non-lipschitz optimization. Comp. Optim. Appl. 65(3), 671–698 (2016)
  • (28) Huang, Y.K., Dai, Y.H., Liu, X.W., Zhang, H.: Gradient methods exploiting spectral properties. Optimi. Method Softw. 35(4), 681–705 (2020)
  • (29) Huang, Y.K., Liu, H., Zhou, S.: Quadratic regularization projected Barzilai–Borwein method for nonnegative matrix factorization. Data Min. Knowl. Disc. 29(6), 1665–1684 (2015)
  • (30) Jiang, B., Dai, Y.H.: Feasible Barzilai–Borwein-like methods for extreme symmetric eigenvalue problems. Optim. Method Softw. 28(4), 756–784 (2013)
  • (31) Tan, C., Ma, S., Dai, Y.H., Qian, Y.: Barzilai–Borwein step size for stochastic gradient descent. In: Advances in Neural Information Processing Systems, pp. 685–693 (2016)
  • (32) Wang, Y., Ma, S.: Projected Barzilai–Borwein method for large-scale nonnegative image restoration. Inverse Probl. Sci. En. 15(6), 559–583 (2007)
  • (33) De Asmundis, R., Di Serafino, D., Hager, W.W., Toraldo, G., Zhang, H.: An efficient gradient method using the Yuan steplength. Comp. Optim. Appl. 59(3), 541–563 (2014)
  • (34) Frassoldati, G., Zanni, L., Zanghirati, G.: New adaptive stepsize selections in gradient methods. J. Ind. Mang. Optim. 4(2), 299–312 (2008)
  • (35) Sun, C., Liu, J.P.: New stepsizes for the gradient method. Optim. Lett. 14(7), 1943–1955 (2020)
  • (36) Zhou, B., Gao, L., Dai, Y.H.: Gradient methods with adaptive step-sizes. Comp. Optim. Appl. 35(1), 69–86 (2006)
  • (37) Dai, Y.H.: Alternate step gradient method. Optimization 52(4-5), 395–415 (2003)
  • (38) Huang, N.: On RR-linear convergence analysis for a class of gradient methods. Comput. Optim. Appl. 81(1), 161–177 (2022).
  • (39) Friedlander, A., Martínez, J.M., Molina, B., and Raydan, M.: Gradient method with retards and generalizations. SIAM J. Numer. Anal. 36(1), 275–289 (1998)
  • (40) Huang, Y.K., Dai, Y.H., Liu, X.W., Zhang, H.: On the acceleration of the Barzilai–Borwein method. Comp. Optim. Appl. 81(3), 717–740 (2022)
  • (41) Zou, Q., Magoulès, F.: Fast gradient methods with alignment for symmetric linear systems without using Cauchy step. J. Comput. Math. 381, 113033 (2021)
  • (42) Dai, Y.H., Hager, W.W., Schittkowski, K., Zhang, H.: The cyclic Barzilai–Borwein method for unconstrained optimization. IMA J. Numer. Anal. 26(3), 604–627 (2006)