This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: Yakui Huang 22institutetext: Institute of Mathematics, Hebei University of Technology, Tianjin 300401, China
22email: [email protected]
33institutetext: Yu-Hong Dai 44institutetext: LSEC, ICMSEC, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, 100190, Beijing, China
44email: [email protected]
55institutetext: Xin-Wei Liu 66institutetext: Institute of Mathematics, Hebei University of Technology, Tianjin 300401, China
66email: [email protected]
77institutetext: Hongchao Zhang 88institutetext: Department of Mathematics, Louisiana State University, Baton Rouge, LA 70803-4918, USA
88email: [email protected]

On the acceleration of the Barzilai-Borwein method

Yakui Huang    Yu-Hong Dai    Xin-Wei Liu    Hongchao Zhang
(Received: date / Accepted: date)
Abstract

The Barzilai-Borwein (BB) gradient method is efficient for solving large-scale unconstrained problems to the modest accuracy and has a great advantage of being easily extended to solve a wide class of constrained optimization problems. In this paper, we propose a new stepsize to accelerate the BB method by requiring finite termination for minimizing two-dimensional strongly convex quadratic function. Combing with this new stepsize, we develop gradient methods which adaptively take the nonmonotone BB stepsizes and certain monotone stepsizes for minimizing general strongly convex quadratic function. Furthermore, by incorporating nonmonotone line searches and gradient projection techniques, we extend these new gradient methods to solve general smooth unconstrained and bound constrained optimization. Extensive numerical experiments show that our strategies of properly inserting monotone gradient steps into the nonmonotone BB method could significantly improve its performance and the new resulted methods can outperform the most successful gradient decent methods developed in the recent literature.

MSC:
90C20 90C25 90C30

1 Introduction

Gradient descent methods have been widely used for solving smooth unconstrained nonlinear optimization

minxnf(x)\min_{x\in\mathbb{R}^{n}}~{}f(x) (1)

by generating a sequence of iterates

xk+1=xkαkgk,x_{k+1}=x_{k}-\alpha_{k}g_{k}, (2)

where f:nf:\mathbb{R}^{n}\to\mathbb{R} is continuously differentiable, gk=f(xk)g_{k}=\nabla f(x_{k}) and αk>0\alpha_{k}>0 is the stepsize along the negative gradient. Different gradient descent methods would have different rules for determining the stepsize αk\alpha_{k}. The classic steepest descent (SD) method proposed by Cauchy cauchy1847methode determines its stepsize by the so-called exact line search

αkSD=argminαf(xkαgk).\alpha_{k}^{SD}=\arg\min_{\alpha\in\mathbb{R}}~{}f(x_{k}-\alpha g_{k}). (3)

Although the SD method locally has the most function value reduction along the negative gradient direction, it often performs poorly in practice. Theoretically, when ff is a strongly convex quadratic function, i.e.,

f(x)=12xTAxbTx,f(x)=\frac{1}{2}x^{T}Ax-b^{T}x, (4)

where bnb\in\mathbb{R}^{n} and An×nA\in\mathbb{R}^{n\times n} is symmetric and positive definite, SD method converges QQ-linearly akaike1959successive and will asymptotically perform zigzag between two orthogonal directions forsythe1968asymptotic ; huang2019asymptotic .

In 1988, Barzilai and Borwein barzilai1988two proposed the following two novel stepsizes that significantly improve the performance of gradient descent methods:

αkBB1=sk1Tsk1sk1Tyk1andαkBB2=sk1Tyk1yk1Tyk1,\alpha_{k}^{BB1}=\frac{s_{k-1}^{T}s_{k-1}}{s_{k-1}^{T}y_{k-1}}\qquad\mbox{and}\qquad\alpha_{k}^{BB2}=\frac{s_{k-1}^{T}y_{k-1}}{y_{k-1}^{T}y_{k-1}}, (5)

where sk1=xkxk1s_{k-1}=x_{k}-x_{k-1} and yk1=gkgk1y_{k-1}=g_{k}-g_{k-1}. Clearly, when sk1Tyk1>0s_{k-1}^{T}y_{k-1}>0, one has αkBB1αkBB2\alpha_{k}^{BB1}\geq\alpha_{k}^{BB2}. Hence, αkBB1\alpha_{k}^{BB1} is often called the long BB stepsize while αkBB2\alpha_{k}^{BB2} is called the short BB stepsize. When the objective function is quadratic (4), the BB stepsize αkBB1\alpha_{k}^{BB1} will be exactly the steepest descent stepsize, but with one step retard, while αkBB2\alpha_{k}^{BB2} will be just the stepsize of minimal gradient (MG) method dai2003altermin , that is

αkBB1=gk1Tgk1gk1TAgk1=αk1SDandαkBB2=gk1TAgk1gk1TA2gk1=αk1MG.\alpha_{k}^{BB1}=\frac{g_{k-1}^{T}g_{k-1}}{g_{k-1}^{T}Ag_{k-1}}=\alpha_{k-1}^{SD}\quad\mbox{and}\quad\alpha_{k}^{BB2}=\frac{g_{k-1}^{T}Ag_{k-1}}{g_{k-1}^{T}A^{2}g_{k-1}}=\alpha_{k-1}^{MG}.

It is proved that the Barzilai-Borwein (BB) method converges RR-superlinearly for minimizing two-dimensional strongly convex quadratic function barzilai1988two and RR-linearly for the general nn-dimensional case dai2002r . Although the BB method does not decrease the objective function value monotonically, extensive numerical experiments show that it performs much better than the SD method fletcher2005barzilai ; raydan1997barzilai ; yuan2008step . And it is commonly accepted that when a not high accuracy is required, BB-type methods could be even competitive with nonlinear conjugate gradient (CG) methods for solving smooth unconstrained optimization fletcher2005barzilai ; raydan1997barzilai . Furthermore, by combing with gradient projection techniques, the BB-type methods have a great advantage of easy extension to solve a wide class of constrained optimization, for example the bound or simplex constrained optimization dai2005projected . Hence, BB-type methods enjoy many important applications, such as image restoration wang2007projected , signal processing liu2011coordinated , eigenvalue problems jiang2013feasible , nonnegative matrix factorization huang2015quadratic , sparse reconstruction wright2009sparse , machine learning tan2016barzilai , etc.

Recently, Yuan yuan2006new ; yuan2008step propose a gradient descent method which combines a new stepsize

αkY=21αk1SD+1αkSD+(1αk1SD1αkSD)2+4gk2(αk1SDgk1)2,\alpha_{k}^{Y}=\frac{2}{\frac{1}{\alpha_{k-1}^{SD}}+\frac{1}{\alpha_{k}^{SD}}+\sqrt{\left(\frac{1}{\alpha_{k-1}^{SD}}-\frac{1}{\alpha_{k}^{SD}}\right)^{2}+\frac{4\|g_{k}\|^{2}}{(\alpha_{k-1}^{SD}\|g_{k-1}\|)^{2}}}}, (6)

in the SD method so that the new method enjoys finite termination for minimizing a two-dimensional strongly convex quadratic function. Based on this new stepsize αkY\alpha_{k}^{Y}, Dai and Yuan dai2005analysis further develop the DY method, which alternately employs αkSD\alpha_{k}^{SD} and αkY\alpha_{k}^{Y} stepsizes as follows

αkDY={αkSD,if mod(k,4)<2;αkY,otherwise.\alpha_{k}^{DY}=\left\{\begin{array}[]{ll}\alpha_{k}^{SD},&\hbox{if mod($k$,4)$<2$;}\\ \alpha_{k}^{Y},&\hbox{otherwise.}\end{array}\right. (7)

It is easy to see that αkYαkSD\alpha_{k}^{Y}\leq\alpha_{k}^{SD}. Hence, DY method (7) is a monotone method. Moreover, it is shown that DY method (7) performs better than the nonmonotone BB method dai2005analysis .

The property of nonmonotonically reducing objective function values is an intrinsic feature that causes the efficiency of BB method. However, it is also pointed out by Fletcher fletcher2012limited that retaining monotonicity is important for a gradient method, especially for minimizing general objective functions. On the other hand, although the monotone DY method performs well, using αkSD\alpha_{k}^{SD} and αkY\alpha_{k}^{Y} in a nonmonotone fashion may yield better performance, see de2014efficient for example. Moreover, it is usually difficult to compute the exact monotone stepsize αkSD\alpha_{k}^{SD} in general optimization. Hence, in this paper, motivated by the great success of the BB method and the previous considerations, we want to further improve and accelerate the nonmonotone BB method by incorporating some monotone steps. For a more general and uniform analysis, we first consider to accelerate the class of gradient descent methods (2) for quadratic optimization (4) using the following stepsize

αk(Ψ(A))=gk1TΨ(A)gk1gk1TΨ(A)Agk1,\alpha_{k}(\Psi(A))=\frac{g_{k-1}^{T}\Psi(A)g_{k-1}}{g_{k-1}^{T}\Psi(A)Ag_{k-1}}, (8)

where Ψ()\Psi(\cdot) is a real analytic function on [λ1,λn][\lambda_{1},\lambda_{n}] that can be expressed by a Laurent series

Ψ(z)=k=ckzk,ck,\Psi(z)=\sum_{k=-\infty}^{\infty}c_{k}z^{k},~{}~{}c_{k}\in\mathbb{R},

such that 0<k=ckzk<+0<\sum_{k=-\infty}^{\infty}c_{k}z^{k}<+\infty for all z[λ1,λn]z\in[\lambda_{1},\lambda_{n}]. Here, λ1\lambda_{1} and λn\lambda_{n} are the smallest and largest eigenvalues of AA, respectively. Clearly, the method (8) is generally nonmonotone and the two BB stepsizes αkBB1\alpha_{k}^{BB1} and αkBB2\alpha_{k}^{BB2} can be obtained by setting Ψ(A)=I\Psi(A)=I and Ψ(A)=A\Psi(A)=A in (8), respectively.

More particularly, we will derive a new stepsize, say α~k(Ψ(A))\tilde{\alpha}_{k}(\Psi(A)), which together with the stepsize αk(Ψ(A))\alpha_{k}(\Psi(A)) in (8) can minimize the two-dimensional convex quadratic function in no more than 55 iterations. To the best of our knowledge, this is the first nonmonotone gradient method with finite termination property. We will see that α~k(I)αkSD\tilde{\alpha}_{k}(I)\leq\alpha_{k}^{SD} and α~k(A)αkMG\tilde{\alpha}_{k}(A)\leq\alpha_{k}^{MG}. Hence, this finite termination property is essentially obtained by inserting monotone stepsizes into the generally nonmonotone stepsizes (8). In fact, we show that this finite termination property can be maintained even the algorithm uses different function Ψ\Psi’s during its iteration. Based on this observation, to achieve good numerical performance, we propose an adaptive nonmonotone gradient method (ANGM), which adaptively takes some nonmonotone steps involving the long and short BB stepsizes (5), and some monotone steps using α~k(A)\tilde{\alpha}_{k}(A). Moreover, to efficiently minimize more general nonlinear objective function, we propose two variants of ANGM, called ANGR1 and ANGR2, using certain retard stepsize. By combing nonmonotone line search and gradient projection techniques, these two variants of gradient methods are further extended to solve bound constrained optimization. Our numerical experiments show that the new proposed methods significantly accelerate the BB method and perform much better on minimizing quadratic function (4) than the most successful gradient decent methods developed in the recent literature, such as DY method dai2005analysis , ABBmin2 method frassoldati2008new and SDC method de2014efficient . In addition, we also compare ANGR1 and ANGR2 with the spectral projected gradient (SPG) method birgin2000nonmonotone ; birgin2014spectral and the BB method using Dai-Zhang nonmonotone line search (BB1-DZ) dai2001adaptive for solving the general unconstrained problems from andrei2008unconstrained and the bound constrained problems from the CUTEst collection gould2015cutest . The numerical results highly suggest the potential benefits of our new proposed methods for solving more general unconstrained and bound constrained optimization.

The paper is organized as follows. In Section 2, we derive the new stepsize α~k(Ψ(A))\tilde{\alpha}_{k}(\Psi(A)) by requiring finite termination on minimizing two-dimensional strongly convex quadratic function. In Section 3, we first derive the ANGM, ANGR1 and ANGR2 methods for minimizing general strongly convex quadratic function and then generalize ANGR1 and ANGR2 to solve bound constrained optimization. Our extensive numerical experiments on minimizing strongly convex quadratic function and solving general unconstrained and bound constrained optimization are presented in Section 4. We finally draw some conclusion remarks in Section 5.

2 Derivation of new stepsize

In this section, we derive a new monotone stepsize based on the nonmonotone gradient method (8) to minimize quadratic function (4). This new stepsize is motivated by requiring finite termination for minimizing two-dimensional strongly convex quadratic function. Such an idea was originally proposed by Yuan yuan2006new to accelerate SD method. However, new techniques need to be developed for accelerating the class of nonmonotone gradient descent methods (8) since the key orthogonal property of successive two gradients generated by SD method no longer holds for methods (8).

Observe that the method (8) is invariant under translations and rotations when minimizing quadratics. Hence, for theoretical analysis to minimize (4), without loss of generality, we may simply assume that the matrix AA is diagonal, i.e.,

A=diag{λ1,,λn},A=\textrm{diag}\{\lambda_{1},\ldots,\lambda_{n}\}, (9)

where 0<λ1λn0<\lambda_{1}\leq\ldots\leq\lambda_{n}.

2.1 Motivation

First, let us investigate the behavior of gradient method (8) with Ψ(A)=I\Psi(A)=I (i.e., the BB1 method). Particularly, we apply it to the non-random quadratic minimization problem proposed in de2014efficient , which has the form (4) with a diagonal matrix AA given by

Ajj=10ncondn1(nj),j=1,,n,A_{jj}=10^{\frac{ncond}{n-1}(n-j)},~{}~{}~{}j=1,\ldots,n, (10)

and bb being a null vector. Here, ncond=log10κncond=\log_{10}\kappa and κ>0\kappa>0 is the condition number of AA. We set n=10n=10, κ=103\kappa=10^{3} and use (10,10,,10)T(10,10,\ldots,10)^{T} as the initial point. The iteration was stopped once the gradient norm is reduced by a factor of 10610^{-6}. Denote the ii-th component of gkg_{k} by gk(i)g_{k}^{(i)} and the indices of the components of gkg_{k} with two largest magnitudes by i1i_{1} and i2i_{2}, respectively. Then, the percentage of the magnitudes of the first two largest components to that of the whole gradient can be computed by

Υ(gk)=|gk(i1)|+|gk(i2)|i=1n|gk(i)|.\Upsilon(g_{k})=\frac{|g_{k}^{(i_{1})}|+|g_{k}^{(i_{2})}|}{\sum_{i=1}^{n}|g_{k}^{(i)}|}.

This Υ(gk)\Upsilon(g_{k}) is plotted in Fig. 1 (left), where we can see that Υ(gk)0.8\Upsilon(g_{k})\geq 0.8 holds for more than half of the iterations (145 out of 224 total iterations). Hence, roughly speaking, the searches of the BB1 method are often dominated in some two-dimensional subspaces. The history of index i1i_{1} against the iteration number is also plotted in Fig. 1 (right), where we can see that |gk(i1)||g_{k}^{(i_{1})}| corresponds more frequently to the largest eigenvalues λ10\lambda_{10} or λ9\lambda_{9}. Since

gk+1(j)=(1αkλj)gk(j),j=1,,n.g_{k+1}^{(j)}=(1-\alpha_{k}\lambda_{j})g_{k}^{(j)},~{}j=1,\ldots,n. (11)

and 1/λnαk1/λ11/\lambda_{n}\leq\alpha_{k}\leq 1/\lambda_{1}, the history of i1i_{1} in Fig. 1 (right) in fact indicates that, most stepsizes generated by the BB1 method are often much larger than 1/λ101/\lambda_{10} or 1/λ91/\lambda_{9}. As a result, the BB1 method may need many iterations to reduce the corresponding components of the gradients gk(9)g_{k}^{(9)} or gk(10)g_{k}^{(10)}.

Refer to caption
Refer to caption
Figure 1: Problem (10) with n=10n=10: history of Υ(gk)\Upsilon(g_{k}) (left) and the index i1i_{1} (right) generated by the BB1 method

In huang2019asymptotic , we showed that a family of gradient methods including SD and MG will asymptotically reduce their searches in a two-dimensional subspace and could be accelerated by exploiting certain orthogonal properties in this two-dimensional subspace. In a similar spirit, we could also accelerate the convergence of the class of gradient methods (8) in a lower dimensional subspace if certain orthogonal properties hold.

Suppose that, for a given k>0k>0, there exists a qkq_{k} satisfying

(Iαk1A)qk=gk1.(I-\alpha_{k-1}A)q_{k}=g_{k-1}. (12)

Since this qkq_{k} is also invariant under translations and rotations, for later analysis we may still assume AA in (12) is diagonal as in (9). The following lemma shows a generalized orthogonal property for qkq_{k} and gk+1g_{k+1}, which is a key property for deriving our new stepsize in the next subsection.

Lemma 1 (Orthogonal property)

Suppose that the sequence {gk}\{g_{k}\} is obtained by applying gradient method (2) with stepsizes (8) to minimize a quadratic function (4) and qkq_{k} satisfies (12). Then, we have

qkTΨ(A)gk+1=0.q_{k}^{T}\Psi(A)g_{k+1}=0. (13)
Proof

By (2), (8) and (12) we get

qkTΨ(A)gk+1\displaystyle q_{k}^{T}\Psi(A)g_{k+1} =qkTΨ(A)(IαkA)(Iαk1A)gk1\displaystyle=q_{k}^{T}\Psi(A)(I-\alpha_{k}A)(I-\alpha_{k-1}A)g_{k-1}
=gk1TΨ(A)(IαkA)gk1\displaystyle=g_{k-1}^{T}\Psi(A)(I-\alpha_{k}A)g_{k-1}
=gk1TΨ(A)gk1αkgk1TΨ(A)Agk1=0.\displaystyle=g_{k-1}^{T}\Psi(A)g_{k-1}-\alpha_{k}g_{k-1}^{T}\Psi(A)Ag_{k-1}=0.

This completes the proof. ∎

2.2 A new stepsize

In this subsection, we derive a new stepsize based on the iterations of gradient method (8). We show that combining the new stepsize with gradient method (8), we can achieve finite termination for minimizing two-dimensional strongly convex quadratic functions.

By Lemma 1, we have that gkTΨ(A)qk1=0g_{k}^{T}\Psi(A)q_{k-1}=0 for k>0k>0. Now, suppose both Ψr(A)qk1\Psi^{r}(A)q_{k-1} and Ψ1r(A)gk\Psi^{1-r}(A)g_{k} are nonzero vectors, where rr\in\mathbb{R}. Let us consider to minimize the function ff in a two-dimensional subspace spanned by Ψr(A)qk1Ψr(A)qk1\frac{\Psi^{r}(A)q_{k-1}}{\|\Psi^{r}(A)q_{k-1}\|} and Ψ1r(A)gkΨ1r(A)gk\frac{\Psi^{1-r}(A)g_{k}}{\|\Psi^{1-r}(A)g_{k}\|}, and let

φ(t,l)\displaystyle\varphi(t,l) :=f(xk+tΨr(A)qk1Ψr(A)qk1+lΨ1r(A)gkΨ1r(A)gk)\displaystyle:=f\left(x_{k}+t\frac{\Psi^{r}(A)q_{k-1}}{\|\Psi^{r}(A)q_{k-1}\|}+l\frac{\Psi^{1-r}(A)g_{k}}{\|\Psi^{1-r}(A)g_{k}\|}\right)
=f(xk)+ϑkT(tl)+12(tl)THk(tl),\displaystyle=f(x_{k})+\vartheta_{k}^{T}\begin{pmatrix}t\\ l\\ \end{pmatrix}+\frac{1}{2}\begin{pmatrix}t\\ l\\ \end{pmatrix}^{T}H_{k}\begin{pmatrix}t\\ l\\ \end{pmatrix}, (14)

where

ϑk=Bkgk=(gkTΨr(A)qk1Ψr(A)qk1gkTΨ1r(A)gkΨ1r(A)gk) with Bk=(Ψr(A)qk1Ψr(A)qk1,Ψ1r(A)gkΨ1r(A)gk)T\vartheta_{k}=B_{k}g_{k}=\begin{pmatrix}\frac{g_{k}^{T}\Psi^{r}(A)q_{k-1}}{\|\Psi^{r}(A)q_{k-1}\|}\\ \frac{g_{k}^{T}\Psi^{1-r}(A)g_{k}}{\|\Psi^{1-r}(A)g_{k}\|}\\ \end{pmatrix}\mbox{ with }B_{k}=\begin{pmatrix}\frac{\Psi^{r}(A)q_{k-1}}{\|\Psi^{r}(A)q_{k-1}\|},\frac{\Psi^{1-r}(A)g_{k}}{\|\Psi^{1-r}(A)g_{k}\|}\\ \end{pmatrix}^{T} (15)

and

Hk=BkABkT=(qk1TΨ2r(A)Aqk1Ψr(A)qk12qk1TΨ(A)AgkΨr(A)qk1Ψ1r(A)gkqk1TΨ(A)AgkΨr(A)qk1Ψ1r(A)gkgkTΨ2(1r)(A)AgkΨ1r(A)gk2).H_{k}=B_{k}AB_{k}^{T}=\begin{pmatrix}\frac{q_{k-1}^{T}\Psi^{2r}(A)Aq_{k-1}}{\|\Psi^{r}(A)q_{k-1}\|^{2}}&\frac{q_{k-1}^{T}\Psi(A)Ag_{k}}{\|\Psi^{r}(A)q_{k-1}\|\|\Psi^{1-r}(A)g_{k}\|}\\ \frac{q_{k-1}^{T}\Psi(A)Ag_{k}}{\|\Psi^{r}(A)q_{k-1}\|\|\Psi^{1-r}(A)g_{k}\|}&\frac{g_{k}^{T}\Psi^{2(1-r)}(A)Ag_{k}}{\|\Psi^{1-r}(A)g_{k}\|^{2}}\\ \end{pmatrix}. (16)

Denote the components of HkH_{k} by Hk(ij)H_{k}^{(ij)}, i,j=1,2i,j=1,2 and notice that BkBkT=IB_{k}B_{k}^{T}=I by gkTΨ(A)qk1=0g_{k}^{T}\Psi(A)q_{k-1}=0. Then, we have the following finite termination theorem.

Theorem 2.1 (Finite termination)

Suppose that a gradient method (2) is applied to minimize a two-dimensional quadratic function (4) with αk\alpha_{k} given by (8) for all kk0k\neq k_{0} and uses the stepsize

α~k0\displaystyle\tilde{\alpha}_{k_{0}} =2(Hk0(11)+Hk0(22))+(Hk0(11)Hk0(22))2+4(Hk0(12))2\displaystyle=\frac{2}{(H_{k_{0}}^{(11)}+H_{k_{0}}^{(22)})+\sqrt{(H_{k_{0}}^{(11)}-H_{k_{0}}^{(22)})^{2}+4(H_{k_{0}}^{(12)})^{2}}} (17)

at the k0k_{0}-th iteration where k02k_{0}\geq 2. Then, the method will find the minimizer in at most k0+3k_{0}+3 iterations.

Proof

Let us suppose xkx_{k} is not a minimizer for all k=1,,k0+2k=1,\ldots,k_{0}+2. We then show k0+3k_{0}+3 must be the minimizer, i.e., gk0+3=0g_{k_{0}+3}=0. For notation convenience, in the following proof of this theorem, let’s simply use kk to denote k0k_{0}. First, we show that using stepsize (17) at the kk-th iteration implies

gk+1is parallel toBkTHk1ϑk+α~kgk,g_{k+1}\quad\mbox{is parallel to}\quad-B_{k}^{T}H_{k}^{-1}\vartheta_{k}+\tilde{\alpha}_{k}g_{k}, (18)

where ϑk,Bk\vartheta_{k},B_{k} and HkH_{k} is given by (15) and (16). In fact, α~k\tilde{\alpha}_{k} given by (17) satisfies the following quadratic equation

α~k2Δα~k(Hk(11)+Hk(22))+1=0,\tilde{\alpha}_{k}^{2}\Delta-\tilde{\alpha}_{k}(H_{k}^{(11)}+H_{k}^{(22)})+1=0, (19)

where Δ=det(Hk)=det(A)>0\Delta=\textrm{det}(H_{k})=\textrm{det}(A)>0. Let

Θ=(Hk(12)ϑk(1)+Hk(22)ϑk(2))ϑk(1)(Hk(11)ϑk(1)+Hk(12)ϑk(2))ϑk(2),\Theta=(H_{k}^{(12)}\vartheta_{k}^{(1)}+H_{k}^{(22)}\vartheta_{k}^{(2)})\vartheta_{k}^{(1)}-(H_{k}^{(11)}\vartheta_{k}^{(1)}+H_{k}^{(12)}\vartheta_{k}^{(2)})\vartheta_{k}^{(2)},

where ϑk(i)\vartheta_{k}^{(i)} are components of ϑk\vartheta_{k}, i=1,2i=1,2. Then, multiplying Θ\Theta to (19), we have

α~k2ΔΘα~k(Hk(11)+Hk(22))Θ+Θ=0,\tilde{\alpha}_{k}^{2}\Delta\Theta-\tilde{\alpha}_{k}(H_{k}^{(11)}+H_{k}^{(22)})\Theta+\Theta=0, (20)

which is exactly

(Hk(22)ϑk(1)Hk(12)ϑk(2)α~kΔϑk(1))[ϑk(2)α~k(Hk(12)ϑk(1)+Hk(22)ϑk(2))]\displaystyle(H_{k}^{(22)}\vartheta_{k}^{(1)}-H_{k}^{(12)}\vartheta_{k}^{(2)}-\tilde{\alpha}_{k}\Delta\vartheta_{k}^{(1)})[\vartheta_{k}^{(2)}-\tilde{\alpha}_{k}(H_{k}^{(12)}\vartheta_{k}^{(1)}+H_{k}^{(22)}\vartheta_{k}^{(2)})]
=\displaystyle= (Hk(11)ϑk(2)Hk(12)ϑk(1)α~kΔϑk(2))[ϑk(1)α~k(Hk(11)ϑk(1)+Hk(12)ϑk(2))].\displaystyle(H_{k}^{(11)}\vartheta_{k}^{(2)}-H_{k}^{(12)}\vartheta_{k}^{(1)}-\tilde{\alpha}_{k}\Delta\vartheta_{k}^{(2)})[\vartheta_{k}^{(1)}-\tilde{\alpha}_{k}(H_{k}^{(11)}\vartheta_{k}^{(1)}+H_{k}^{(12)}\vartheta_{k}^{(2)})]. (21)

The above identity (2.2) implies the vector

(ϑk(1)α~k(Hk(11)ϑk(1)+Hk(12)ϑk(2))ϑk(2)α~k(Hk(12)ϑk(1)+Hk(22)ϑk(2)))\begin{pmatrix}\vartheta_{k}^{(1)}-\tilde{\alpha}_{k}(H_{k}^{(11)}\vartheta_{k}^{(1)}+H_{k}^{(12)}\vartheta_{k}^{(2)})\\ \vartheta_{k}^{(2)}-\tilde{\alpha}_{k}(H_{k}^{(12)}\vartheta_{k}^{(1)}+H_{k}^{(22)}\vartheta_{k}^{(2)})\\ \end{pmatrix}

is parallel to

(Hk(22)ϑk(1)Hk(12)ϑk(2)α~kΔϑk(1)Hk(11)ϑk(2)Hk(12)ϑk(1)α~kΔϑk(2)),\begin{pmatrix}H_{k}^{(22)}\vartheta_{k}^{(1)}-H_{k}^{(12)}\vartheta_{k}^{(2)}-\tilde{\alpha}_{k}\Delta\vartheta_{k}^{(1)}\\ H_{k}^{(11)}\vartheta_{k}^{(2)}-H_{k}^{(12)}\vartheta_{k}^{(1)}-\tilde{\alpha}_{k}\Delta\vartheta_{k}^{(2)}\\ \end{pmatrix},

which written in a matrix format just means

ϑk+Hk(α~kϑk)is parallel toHk1ϑkα~kϑk.\vartheta_{k}+H_{k}(-\tilde{\alpha}_{k}\vartheta_{k})\quad\mbox{is parallel to}\quad H_{k}^{-1}\vartheta_{k}-\tilde{\alpha}_{k}\vartheta_{k}. (22)

Since n=2n=2, we have BkBkT=BkTBk=IB_{k}B_{k}^{T}=B_{k}^{T}B_{k}=I. Then, it follows from gk=BkTϑkg_{k}=B_{k}^{T}\vartheta_{k}, ϑk=Bkgk\vartheta_{k}=B_{k}g_{k}, Hk=BkABkTH_{k}=B_{k}AB_{k}^{T} and gk+1=gkα~kAgkg_{k+1}=g_{k}-\tilde{\alpha}_{k}Ag_{k} that gk+1=BkTϑk+BkTHk(α~kϑk)g_{k+1}=B_{k}^{T}\vartheta_{k}+B_{k}^{T}H_{k}(-\tilde{\alpha}_{k}\vartheta_{k}). So, we have from (22) that (18) holds. Therefore, (17) implies (18) holds.

Now, it follows from (15) and Hk1=BkA1BkTH_{k}^{-1}=B_{k}A^{-1}B_{k}^{T} that

BkTHk1ϑk+α~kgk=A1gk+α~kgk=A1(gkα~kAgk)=A1gk+1.-B_{k}^{T}H_{k}^{-1}\vartheta_{k}+\tilde{\alpha}_{k}g_{k}=-A^{-1}g_{k}+\tilde{\alpha}_{k}g_{k}=-A^{-1}(g_{k}-\tilde{\alpha}_{k}Ag_{k})=-A^{-1}g_{k+1}.

Hence, (18) implies gk+1g_{k+1} is parallel to A1gk+1A^{-1}g_{k+1}. So, if α~k\tilde{\alpha}_{k} given by (17) is used at the kk-th iteration, then gk+1g_{k+1} is parallel to A1gk+1A^{-1}g_{k+1}. Since xk+1x_{k+1} is not the minimizer, we have gk+10g_{k+1}\neq 0. So, gk+1g_{k+1} is an eigenvector of AA, i.e. Agk+1=λgk+1Ag_{k+1}=\lambda g_{k+1} for some λ>0\lambda>0. Since xk+2x_{k+2} is not the minimizer, we have gk+20g_{k+2}\neq 0 and the algorithm will not stop at the k+2k+2-th iteration. So, by (8), we have αk+2=gk+1TΨ(A)gk+1gk+1TΨ(A)Agk+1=1/λ\alpha_{k+2}=\frac{g_{k+1}^{T}\Psi(A)g_{k+1}}{g_{k+1}^{T}\Psi(A)Ag_{k+1}}=1/\lambda. Hence, we have gk+3=(1αk+2λ)gk+2=0g_{k+3}=(1-\alpha_{k+2}\lambda)g_{k+2}=0, which implies xk+3x_{k+3} must be the minimizer. We complete the proof. ∎

Notice that by setting k0=2k_{0}=2 in the above Theorem 2.1, the new gradient method in Theorem 2.1 will find the exact minimizer in at most 55 iterations when minimizing a two-dimensional strongly convex quadratic function. In fact, since Δ=λ1λ2\Delta=\lambda_{1}\lambda_{2} and Hk(11)+Hk(22)=λ1+λ2H_{k}^{(11)}+H_{k}^{(22)}=\lambda_{1}+\lambda_{2}, the equation (19) has two positive roots 1/λ11/\lambda_{1} and 1/λ21/\lambda_{2}. This observation allows us to use the stepsize α~k0\tilde{\alpha}_{k_{0}} with some retards as stated in the following corollary, which would lead us a more convenient way for choosing stepsizes when the objective function is not quadratic.

Corollary 1

Suppose that a gradient method is applied to a two-dimensional quadratic function (4) with αk0+m=α~k0\alpha_{k_{0}+m}=\tilde{\alpha}_{k_{0}} for k02k_{0}\geq 2 and some positive integer mm, and αk\alpha_{k} given by (8) for all kk0+mk\neq k_{0}+m. Then, the method stops in at most k0+m+3k_{0}+m+3 iterations.

By setting Ψ(A)=I\Psi(A)=I, Ψ(A)=A\Psi(A)=A and r=1/2r=1/2 in (16), and setting k0=kk_{0}=k in (17), we can derive the following two stepsizes:

α~kBB1=2qk1TAqk1qk12+1αkSD+(qk1TAqk1qk121αkSD)2+4(qk1TAgk)2qk12gk2\tilde{\alpha}_{k}^{BB1}=\frac{2}{\frac{q_{k-1}^{T}Aq_{k-1}}{\|q_{k-1}\|^{2}}+\frac{1}{\alpha_{k}^{SD}}+\sqrt{\left(\frac{q_{k-1}^{T}Aq_{k-1}}{\|q_{k-1}\|^{2}}-\frac{1}{\alpha_{k}^{SD}}\right)^{2}+\frac{4(q_{k-1}^{T}Ag_{k})^{2}}{\|q_{k-1}\|^{2}\|g_{k}\|^{2}}}} (23)

and

α~kBB2=21α^k1+1αkMG+(1α^k11αkMG)2+Γk,\tilde{\alpha}_{k}^{BB2}=\frac{2}{\frac{1}{\hat{\alpha}_{k-1}}+\frac{1}{\alpha_{k}^{MG}}+\sqrt{\left(\frac{1}{\hat{\alpha}_{k-1}}-\frac{1}{\alpha_{k}^{MG}}\right)^{2}+\Gamma_{k}}}, (24)

respectively, where

α^k=qkTAqkqkTA2qkandΓk=4(qk1TA2gk)2qk1TAqk1gkTAgk.\hat{\alpha}_{k}=\frac{q_{k}^{T}Aq_{k}}{q_{k}^{T}A^{2}q_{k}}~{}~{}\textrm{and}~{}~{}\Gamma_{k}=\frac{4(q_{k-1}^{T}A^{2}g_{k})^{2}}{q_{k-1}^{T}Aq_{k-1}\cdot g_{k}^{T}Ag_{k}}. (25)

By (23) and (24), we have

α~kBB1min{αkSD,qk12qk1TAqk1}andα~kBB2min{αkMG,α^k1}.\tilde{\alpha}_{k}^{BB1}\leq\min\left\{\alpha_{k}^{SD},\frac{\|q_{k-1}\|^{2}}{q_{k-1}^{T}Aq_{k-1}}\right\}~{}~{}\textrm{and}~{}~{}\tilde{\alpha}_{k}^{BB2}\leq\min\{\alpha_{k}^{MG},\hat{\alpha}_{k-1}\}. (26)

Hence, both α~kBB1\tilde{\alpha}_{k}^{BB1} and α~kBB2\tilde{\alpha}_{k}^{BB2} are short monotone steps for reducing the value and gradient norm of the objective function, respectively. And it follows from Theorem 2.1 that by inserting the monotone steps α~kBB1\tilde{\alpha}_{k}^{BB1} and α~kBB2\tilde{\alpha}_{k}^{BB2} into the BB1 and BB2 methods, respectively, the gradient method will have finite termination for minimizing two-dimensional strongly convex quadratic functions.

To numerically verify this finite termination property, we apply the method (8) with Ψ(A)=I\Psi(A)=I (i.e., the BB1 method) and α~2BB1\tilde{\alpha}_{2}^{BB1} given by (23) to minimize a two-dimensional quadratic function (4) with

A=diag{1,λ}andb=0.A=\textrm{diag}\{1,\lambda\}\quad\mbox{and}\quad b=0. (27)

We run the algorithm for five iterations using ten random starting points. The averaged values of g5\|g_{5}\| and f(x5)f(x_{5}) are presented in Table 1. Moreover, we also run BB1 method for a comparison purpose. We can observe that for different values of λ\lambda, the values of g5\|g_{5}\| and f(x5)f(x_{5}) obtained by BB1 method with α~2BB1\tilde{\alpha}_{2}^{BB1} given by (23) are numerically very close to zero. However, even for the case λ=10\lambda=10, g5\|g_{5}\| and f(x5)f(x_{5}) obtained by pure BB1 method are far away from zero. These numerical results coincide with our analysis and show that the nonmonotone method (8) can be significantly accelerated by incorporating proper monotone steps.

Table 1: Averaged results on problem (27) with different condition numbers
λ\lambda BB1 BB1 with α~2BB1\tilde{\alpha}_{2}^{BB1} given by (23)
g5\|g_{5}\| f(x5)f(x_{5}) g5\|g_{5}\| f(x5)f(x_{5})
10 6.6873e+00 4.8701e+00 1.1457e-16 5.8735e-31
100 8.6772e+01 1.9969e+02 2.1916e-16 2.8047e-30
1000 2.0925e+02 9.7075e+01 3.4053e-19 2.3730e-29
10000 1.7943e+02 9.6935e+00 5.1688e-19 6.7870e-28

3 New methods

In this section, based on the above analysis, we propose an adaptive nonmonotone gradient method (ANGM) and its two variants, ANGR1 and ANGR2, for solving both unconstrained and box constrained optimization. These new gradient methods adaptively take some nonmonotone steps involving the long and short BB stepsizes (5), and some monotone steps using the new stepsize developed in the previous section.

3.1 Quadratic case

As mentioned in Section 2.1, the stepsizes αkBB1\alpha_{k}^{BB1} generated by the BB1 method may be far away from the reciprocals of the largest eigenvalues of the Hessian matrix AA of the quadratic function (4). In other words, the stepsize αkBB1\alpha_{k}^{BB1} may be too large to effectively decrease the components of gradient gkg_{k} corresponding to the first several largest eigenvalues, which, by (11), can be greatly reduced when small stepsizes are employed. In addition, it has been observed by many works in the recent literature that gradient methods using long and short stepsizes adaptively generally perform much better than those using only one type of stepsizes, for example see dai2003altermin ; de2014efficient ; di2018steplength ; gonzaga2016steepest ; huang2019gradient ; huang2019asymptotic . So, we would like to develop gradient methods that combines the two nonmonotone BB stepsizes with the short monotone stepsize given by (17).

We first extend the orthogonal property developed in Lemma 1 and the finite termination result given in Theorem 2.1.

Lemma 2 (Generalized orthogonal property)

Suppose that a gradient method (2) with stepsizes in the form of (8) is applied to minimize a quadratic function (4). In particular, at the k1k-1-th and kk-th iteration, two stepsizes αk1(Ψ(A))\alpha_{k-1}(\Psi(A)) and αk(Ψ1(A))\alpha_{k}(\Psi_{1}(A)) are used, respectively, where Ψ\Psi and Ψ1\Psi_{1} may be two different analytic functions used in (8). If qknq_{k}\in\mathbb{R}^{n} satisfies

(Iαk1(Ψ(A))A)qk=gk1,(I-\alpha_{k-1}(\Psi(A))A)q_{k}=g_{k-1}, (28)

then we have

qkTΨ1(A)gk+1=0.q_{k}^{T}\Psi_{1}(A)g_{k+1}=0. (29)
Proof

Notice that by (2), we have

gk=gk1αk1(Ψ(A))Agk1andgk+1=gkαk(Ψ1(A))Agk.g_{k}=g_{k-1}-\alpha_{k-1}(\Psi(A))Ag_{k-1}\quad\mbox{and}\quad g_{k+1}=g_{k}-\alpha_{k}(\Psi_{1}(A))Ag_{k}.

Then, the proof is essential the same as those in the proof of Lemma 1. ∎

Based on Lemma 2 and using the same arguments as those in the proof of Theorem 2.1, we can obtain the following finite termination result even different function Ψ\Psi’s are used in (8) to obtain the stepsizes.

Theorem 3.1 (Generalized finite termination)

Suppose that a gradient method (2) is applied to minimize a two-dimensional quadratic function (4) with αk\alpha_{k} given by (8) for all kk0k\neq k_{0} and kk01k\neq k_{0}-1, and uses the stepsizes αk1(Ψ1(A))\alpha_{k-1}(\Psi_{1}(A)) and αk(Ψ1(A))\alpha_{k}(\Psi_{1}(A)) at the k1k-1-th and kk-th iteration, respectively, where k02k_{0}\geq 2. Then, the method will find the minimizer in at most k0+3k_{0}+3 iterations.

Theorem 3.1 allows us to incorporate the nonmonotone BB stepsizes αkBB1\alpha_{k}^{BB1} and αkBB2\alpha_{k}^{BB2}, and the short monotone stepsize α~kBB2\tilde{\alpha}_{k}^{BB2} in one gradient method. Alternate or adaptive scheme has been employed for choosing long and short stepsizes in BB-type methods dai2005projected ; zhou2006gradient . And recent studies show that adaptive strategies are more preferred than the alternate scheme dhl2018 ; zhou2006gradient . Hence, we would like develop adaptive strategies to choose proper stepsizes for our new gradient methods. In particular, our adaptive nonmonotone gradient method (ANGM) takes the long BB stepsize αkBB1\alpha_{k}^{BB1} when αkBB2/αkBB1τ1\alpha_{k}^{BB2}/\alpha_{k}^{BB1}\geq\tau_{1} for some τ1(0,1)\tau_{1}\in(0,1). Otherwise, a short stepsize αkBB2\alpha_{k}^{BB2} or α~kBB2\tilde{\alpha}_{k}^{BB2} will be taken depending on the ratio gk1/gk\|g_{k-1}\|/\|g_{k}\|. Notice that αkBB2\alpha_{k}^{BB2} minimizes the gradient in the sense that

αkBB2=αk1MG=argminαgk1αAgk1.\alpha_{k}^{BB2}=\alpha_{k-1}^{MG}=\arg\min_{\alpha\in\mathbb{R}}\|g_{k-1}-\alpha Ag_{k-1}\|.

So, when gk1/gk>τ2\|g_{k-1}\|/\|g_{k}\|>\tau_{2} for some τ2>1\tau_{2}>1, i.e. the gradient norm decreases, the previous stepsize αk1\alpha_{k-1} is often a reasonable approximation of αkBB2\alpha_{k}^{BB2}. By our numerical experiments, when BB method is applied the searches are often dominated in some two-dimensional subspaces. And the new gradient method in Theorem 3.1 would have finite convergence for minimizing two-dimensional quadratic function when the new stepsize α~kBB2\tilde{\alpha}_{k}^{BB2} is applied after some BB2 steps. Hence, our ANGM would employ the new monotone stepsize α~kBB2\tilde{\alpha}_{k}^{BB2} when gk1τ2gk\|g_{k-1}\|\geq\tau_{2}\|g_{k}\|; otherwise, certain BB2 steps should be taken. In practice, we find that when gk1τ2gk\|g_{k-1}\|\leq\tau_{2}\|g_{k}\|, ANGM often has good performance by taking the stepsize min{αkBB2,αk1BB2}\min\{\alpha_{k}^{BB2},\alpha_{k-1}^{BB2}\}. To summarize, our ANGM applies the following adaptive strategies for choosing stepsizes:

αk={min{αkBB2,αk1BB2},if αkBB2<τ1αkBB1 and gk1<τ2gk;α~kBB2,if αkBB2<τ1αkBB1 and gk1τ2gk;αkBB1,otherwise,\alpha_{k}=\begin{cases}\min\{\alpha_{k}^{BB2},\alpha_{k-1}^{BB2}\},&\text{if $\alpha_{k}^{BB2}<\tau_{1}\alpha_{k}^{BB1}$ and $\|g_{k-1}\|<\tau_{2}\|g_{k}\|$};\\ \tilde{\alpha}_{k}^{BB2},&\text{if $\alpha_{k}^{BB2}<\tau_{1}\alpha_{k}^{BB1}$ and $\|g_{k-1}\|\geq\tau_{2}\|g_{k}\|$};\\ \alpha_{k}^{BB1},&\text{otherwise},\end{cases} (30)

Notice that the calculation of α~kBB2\tilde{\alpha}_{k}^{BB2} needs to compute αkMG\alpha_{k}^{MG} which is not easy to obtain when the objective function is not quadratic. In stead, the calculation of α~k1BB2\tilde{\alpha}_{k-1}^{BB2} will just require αkBB2\alpha_{k}^{BB2}, which is readily available even for general objective function. Moreover, it is found in recent research that gradient methods using retard stepsizes can often lead better performances friedlander1998gradient . Hence, in the first variant of ANGM, we simply replace α~kBB2\tilde{\alpha}_{k}^{BB2} in (30) by α~k1BB2\tilde{\alpha}_{k-1}^{BB2}, i.e. the stepsizes are chosen as

αk={min{αkBB2,αk1BB2},if αkBB2<τ1αkBB1 and gk1<τ2gk;α~k1BB2,if αkBB2<τ1αkBB1 and gk1τ2gk;αkBB1,otherwise.\alpha_{k}=\begin{cases}\min\{\alpha_{k}^{BB2},\alpha_{k-1}^{BB2}\},&\text{if $\alpha_{k}^{BB2}<\tau_{1}\alpha_{k}^{BB1}$ and $\|g_{k-1}\|<\tau_{2}\|g_{k}\|$};\\ \tilde{\alpha}_{k-1}^{BB2},&\text{if $\alpha_{k}^{BB2}<\tau_{1}\alpha_{k}^{BB1}$ and $\|g_{k-1}\|\geq\tau_{2}\|g_{k}\|$};\\ \alpha_{k}^{BB1},&\text{otherwise}.\end{cases} (31)

We call the gradient method using stepsize (31) ANGR1. On the other hand, since the calculation of α~k1BB2\tilde{\alpha}_{k-1}^{BB2} also needs α^k2\hat{\alpha}_{k-2} and Γk1\Gamma_{k-1} and by (26),

α~k1BB2min{αkBB2,α^k2},\tilde{\alpha}_{k-1}^{BB2}\leq\min\{\alpha_{k}^{BB2},\hat{\alpha}_{k-2}\}, (32)

to simplify ANGR1, we may further replace α~k1BB2\tilde{\alpha}_{k-1}^{BB2} in (31) by its upper bound in (32). As a result, we have the second variant of ANGM, which chooses stepsizes as

αk={min{αkBB2,αk1BB2},if αkBB2<τ1αkBB1 and gk1<τ2gk;min{αkBB2,α^k2},if αkBB2<τ1αkBB1 and gk1τ2gk;αkBB1,otherwise.\alpha_{k}=\begin{cases}\min\{\alpha_{k}^{BB2},\alpha_{k-1}^{BB2}\},&\text{if $\alpha_{k}^{BB2}<\tau_{1}\alpha_{k}^{BB1}$ and $\|g_{k-1}\|<\tau_{2}\|g_{k}\|$};\\ \min\{\alpha_{k}^{BB2},\hat{\alpha}_{k-2}\},&\text{if $\alpha_{k}^{BB2}<\tau_{1}\alpha_{k}^{BB1}$ and $\|g_{k-1}\|\geq\tau_{2}\|g_{k}\|$};\\ \alpha_{k}^{BB1},&\text{otherwise}.\end{cases} (33)

We call the gradient method using stepsize (33) ANGR2.

In terms of global convergence for minimizing quadratic function (4), by (26), we can easily show the RR-linear global convergence of ANGM since it satisfies the property in dai2003alternate . Similarly, RR-linear convergence of ANGR1 and ANGR2 can be also established. See the proof of Theorem 3 in dhl2018 for example.

Remark 1

Compared with other gradient methods, ANGM, ANGR1 and ANGR2 do not need additional matrix-vector products. In fact, it follows from (12) that Aqk=1αk1(qkgk1)Aq_{k}=\frac{1}{\alpha_{k-1}}(q_{k}-g_{k-1}), which gives

α^k=qkTAqkqkTA2qk=αk1qkT(qkgk1)(qkgk1)T(qkgk1).\hat{\alpha}_{k}=\frac{q_{k}^{T}Aq_{k}}{q_{k}^{T}A^{2}q_{k}}=\frac{\alpha_{k-1}q_{k}^{T}(q_{k}-g_{k-1})}{(q_{k}-g_{k-1})^{T}(q_{k}-g_{k-1})}. (34)

Hence, no additional matrix-vector products are needed for calculation of α^k1\hat{\alpha}_{k-1} in α~kBB2\tilde{\alpha}_{k}^{BB2}, α^k2\hat{\alpha}_{k-2} in α~k1BB2\tilde{\alpha}_{k-1}^{BB2} and the stepsize used in ANGR2. Since the calculation of AgkAg_{k} is necessary for the calculation of gk+1g_{k+1}, Γk\Gamma_{k} in α~kBB2\tilde{\alpha}_{k}^{BB2} requires no additional matrix-vector products either. As for α~k1BB2\tilde{\alpha}_{k-1}^{BB2}, it follows from gk1TA2qk2=1αk3(qk2gk3)TAgk1g_{k-1}^{T}A^{2}q_{k-2}=\frac{1}{\alpha_{k-3}}(q_{k-2}-g_{k-3})^{T}Ag_{k-1} and Agk1=1αk1(gk1gk)Ag_{k-1}=\frac{1}{\alpha_{k-1}}(g_{k-1}-g_{k}) that

Γk1\displaystyle\Gamma_{k-1} =4((qk2gk3)TAgk1)2αk3((qk2gk3)Tqk2)gk1TAgk1\displaystyle=\frac{4((q_{k-2}-g_{k-3})^{T}Ag_{k-1})^{2}}{\alpha_{k-3}((q_{k-2}-g_{k-3})^{T}q_{k-2})\cdot g_{k-1}^{T}Ag_{k-1}}
=4((qk2gk3)T(gk1gk))2αk3αk1((qk2gk3)Tqk2)gk1T(gk1gk).\displaystyle=\frac{4((q_{k-2}-g_{k-3})^{T}(g_{k-1}-g_{k}))^{2}}{\alpha_{k-3}\alpha_{k-1}((q_{k-2}-g_{k-3})^{T}q_{k-2})\cdot g_{k-1}^{T}(g_{k-1}-g_{k})}. (35)

Thus, no additional matrix-vector products are required for calculation of Γk1\Gamma_{k-1} in α~k1BB2\tilde{\alpha}_{k-1}^{BB2}.

Remark 2

Notice that all the new methods, ANGM, ANGR1 and ANGR2, require the vector qkq_{k} for calculation of their stepsizes. However, computing qkq_{k} exactly from (12) maybe as difficult as minimizing the quadratic function. Notice that the qkq_{k} satisfying (12) also satisfies the secant equation

qkTgk=gk12.q_{k}^{T}g_{k}=\|g_{k-1}\|^{2}. (36)

Hence, we may find an approximation of qkq_{k} by requiring the above secant condition holds. One efficient way to find such a qkq_{k} satisfying the secant equation (36) is to simply treat the Hessian matrix AA as the diagonal matrix (9) and derive qkq_{k} from (12), that is when gk(i)0g_{k}^{(i)}\neq 0,

qk(i)=gk1(i)1αk1λi=(gk1(i))2gk(i),i=1,,n,q_{k}^{(i)}=\frac{g_{k-1}^{(i)}}{1-\alpha_{k-1}\lambda_{i}}=\frac{(g_{k-1}^{(i)})^{2}}{g_{k}^{(i)}},~{}~{}i=1,\ldots,n, (37)

And we can just let qk(i)=0q_{k}^{(i)}=0, if gk(i)=0g_{k}^{(i)}=0. To summarize, the approximated qkq_{k} can be computed by

qk(i)={(gk1(i))2gk(i),if gk(i)0;0,if gk(i)=0.q_{k}^{(i)}=\left\{\begin{array}[]{ll}\frac{(g_{k-1}^{(i)})^{2}}{g_{k}^{(i)}},&\hbox{if $g_{k}^{(i)}\neq 0$;}\\ 0,&\hbox{if $g_{k}^{(i)}=0$.}\end{array}\right. (38)

As we will see in Section 4, this simple way of calculating qkq_{k} leads very efficient algorithm.

For a simple illustration of numerical behavior of ANGR1, we again applied ANGR1 with τ1=0.85\tau_{1}=0.85 and τ2=1.3\tau_{2}=1.3 to solve problem (10) with n=10n=10. Fig. 2 shows the largest component |gk(i1)||g_{k}^{(i_{1})}| of the gradient generated by BB1 and ANGR1 methods against the iteration number, where circle means the ANGR1 method takes the new stepsize α~kBB2\tilde{\alpha}_{k}^{BB2} at that iteration. It can be seen that |gk(i1)||g_{k}^{(i_{1})}| generated by BB1 method often increases significantly with a much larger value at the iteration where the new stepsize α~kBB2\tilde{\alpha}_{k}^{BB2} is applied. On the other hand, |gk(i1)||g_{k}^{(i_{1})}| generated by the ANGR1 method is often reduced and kept small after the new stepsize α~kBB2\tilde{\alpha}_{k}^{BB2} is applied. A detail correspondence of i1i_{1} and λj\lambda_{j} is presented in Table 2, where njn_{j} is the total number of i1si_{1}^{\prime}s for which i1=ji_{1}=j, j=1,,10j=1,\ldots,10. We can see from the last three columns in Table 2 that ANGR1 is much efficient than BB1 for decreasing those components of gkg_{k} corresponding to large eigenvalues. Hence, the undesired behavior of BB1 discussed in the motivation Section 2.1 is greatly eliminated by ANGR1.

Refer to caption
Figure 2: Problem (10) with n=10n=10: history of |gk(i1)||g_{k}^{(i_{1})}| generated by the BB1 and ANGR1 methods
Table 2: The correspondence of i1i_{1} and λj\lambda_{j} by the BB1 and ANGR1 methods
njn_{j} 1 2 3 4 5 6 7 8 9 10
BB1 40 5 6 4 6 2 10 21 42 87
ANGR1 51 4 22 18 16 4 1 8 12 17

3.2 Bound constrained case

In this subsection, we would like to extend ANGR1 and ANGR2 methods for solving the bound constrained optimization

minxΩf(x),\min_{x\in\Omega}~{}~{}f(x), (39)

where ff is Lipschitz continuously differentiable on the feasible set Ω={xn|lxu}\Omega=\{x\in\mathbb{R}^{n}|~{}l\leq x\leq u\}. Here, lxul\leq x\leq u means componentwise lixiuil_{i}\leq x_{i}\leq u_{i} for all i=1,,ni=1,\ldots,n. Clearly, when li=l_{i}=-\infty and ui=+u_{i}=+\infty for all ii, problem (39) reduces to the smooth unconstrained optimization.

Our methods will incorporate the gradient projection strategy and update the iterates as

xk+1=xk+λkdk,x_{k+1}=x_{k}+\lambda_{k}d_{k},

with λk\lambda_{k} being a step length determined by some line searches and dkd_{k} being the search direction given by

dk=PΩ(xkαkgk)xk,d_{k}=P_{\Omega}(x_{k}-\alpha_{k}g_{k})-x_{k}, (40)

where PΩ()P_{\Omega}(\cdot) is the Euclidean projection onto Ω\Omega and αk\alpha_{k} is our proposed stepsize.

It is well-known that the components of iterates generated by gradient descent methods corresponding to optimal solutions at the boundary will be finally unchanged when the problem is nondegenerate. Hence, in huang2019gradient , the authors suggest to use the following modified BB stepsizes for bound constrained optimization

α¯kBB1=sk1Tsk1sk1Ty¯k1andα¯kBB2=sk1Ty¯k1y¯k1Ty¯k1,\bar{\alpha}_{k}^{BB1}=\frac{s_{k-1}^{T}s_{k-1}}{s_{k-1}^{T}\bar{y}_{k-1}}\quad\mbox{and}\quad\bar{\alpha}_{k}^{BB2}=\frac{s_{k-1}^{T}\bar{y}_{k-1}}{\bar{y}_{k-1}^{T}\bar{y}_{k-1}}, (41)

where y¯k1\bar{y}_{k-1} is given by

y¯k1(i)={0, if sk1(i)=0;gk(i)gk1(i), otherwise.\bar{y}_{k-1}^{(i)}=\left\{\begin{array}[]{ll}0,&\hbox{ if $s_{k-1}^{(i)}=0$;}\\ g_{k}^{(i)}-g_{k-1}^{(i)},&\hbox{ otherwise.}\end{array}\right. (42)

Notice that αkBB1=α¯kBB1\alpha_{k}^{BB1}=\bar{\alpha}_{k}^{BB1}. We will also do this modifications for solving bound constrained optimization and replace the two BB stepsizes in our new methods by α¯kBB1\bar{\alpha}_{k}^{BB1} and α¯kBB2\bar{\alpha}_{k}^{BB2}.

As mentioned before, we expect to get short steps using our new stepsizes. Since (32) may not hold for general functions, we would impose α¯kBB2\bar{\alpha}_{k}^{BB2} as a safeguard. As a result, our ANGR1 and ANGR2 methods for solving bound constrained optimization employ the following stepsizes:

α¯k={min{α¯k1BB2,α¯kBB2},if α¯kBB2<τ1α¯kBB1 and g¯k1<τ2g¯k;min{α¯kBB2,α~k1BB2},if α¯kBB2<τ1α¯kBB1 and g¯k1τ2g¯k;α¯kBB1,otherwise,\bar{\alpha}_{k}=\begin{cases}\min\{\bar{\alpha}_{k-1}^{BB2},\bar{\alpha}_{k}^{BB2}\},&\text{if $\bar{\alpha}_{k}^{BB2}<\tau_{1}\bar{\alpha}_{k}^{BB1}$ and $\|\bar{g}_{k-1}\|<\tau_{2}\|\bar{g}_{k}\|$};\\ \min\{\bar{\alpha}_{k}^{BB2},\tilde{\alpha}_{k-1}^{BB2}\},&\text{if $\bar{\alpha}_{k}^{BB2}<\tau_{1}\bar{\alpha}_{k}^{BB1}$ and $\|\bar{g}_{k-1}\|\geq\tau_{2}\|\bar{g}_{k}\|$};\\ \bar{\alpha}_{k}^{BB1},&\text{otherwise},\end{cases} (43)

and

α¯k={min{α¯k1BB2,α¯kBB2},if α¯kBB2<τ1α¯kBB1 and g¯k1<τ2g¯k;min{α¯kBB2,α^k2},if α¯kBB2<τ1α¯kBB1 and g¯k1τ2g¯k;α¯kBB1,otherwise,\bar{\alpha}_{k}=\begin{cases}\min\{\bar{\alpha}_{k-1}^{BB2},\bar{\alpha}_{k}^{BB2}\},&\text{if $\bar{\alpha}_{k}^{BB2}<\tau_{1}\bar{\alpha}_{k}^{BB1}$ and $\|\bar{g}_{k-1}\|<\tau_{2}\|\bar{g}_{k}\|$};\\ \min\{\bar{\alpha}_{k}^{BB2},\hat{\alpha}_{k-2}\},&\text{if $\bar{\alpha}_{k}^{BB2}<\tau_{1}\bar{\alpha}_{k}^{BB1}$ and $\|\bar{g}_{k-1}\|\geq\tau_{2}\|\bar{g}_{k}\|$};\\ \bar{\alpha}_{k}^{BB1},&\text{otherwise},\end{cases} (44)

respectively, where τ1(0,1)\tau_{1}\in(0,1), τ21\tau_{2}\geq 1, and g¯k=PΩ(xkgk)xk\bar{g}_{k}=P_{\Omega}(x_{k}-g_{k})-x_{k}.

The overall algorithm of ANGR1 and ANGR2 for solving bound constrained optimization (39) are given in Algorithm 1, where the adaptive nonmonotone line search by Dai and Zhang dai2001adaptive is employed to ensure global convergence and achieve better numerical performance. In particular, the step length λk=1\lambda_{k}=1 is accepted if

f(xk+dk)fr+σgkTdk,f(x_{k}+d_{k})\leq f_{r}+\sigma g_{k}^{T}d_{k}, (45)

where frf_{r} is the so-called reference function value and is adaptively updated by the rules given in dai2001adaptive and σ(0,1)\sigma\in(0,1) is a line search parameter. Once (45) is not accepted, it will perform an Armijo-type back tracking line search to find the step length λk\lambda_{k} such that

f(xk+λkdk)min{fmax,fr}+σλkgkTdk,f(x_{k}+\lambda_{k}d_{k})\leq\min\{f_{\max},f_{r}\}+\sigma\lambda_{k}g_{k}^{T}d_{k}, (46)

where fmaxf_{\max} is the maximal function value in recent MM iterations, i.e.,

fmax=max0imin{k,M1}f(xki).f_{\max}=\max_{0\leq i\leq\min\{k,M-1\}}f(x_{k-i}).

This nonmonotone line search is observed specially suitable for BB-type methods dai2001adaptive . Moreover, under standard assumptions, Algorithm 1 ensures convergence in the sense that liminfkg¯k=0\lim\inf_{k\rightarrow\infty}\|\bar{g}_{k}\|=0, see hager2006new .

1
Input: x0nx_{0}\in\mathbb{R}^{n}, ϵ,σ(0,1)\epsilon,\sigma\in(0,1), τ>0\tau>0, MM\in\mathbb{N}, 0<αminαmax0<\alpha_{\min}\leq\alpha_{\max}, α0[αmin,αmax]\alpha_{0}\in[\alpha_{\min},\alpha_{\max}], k:=0k:=0.
2
3while g¯k>ϵ\|\bar{g}_{k}\|>\epsilon do
4     
5     Compute the search direction dkd_{k} by (40);
6     Determine λk\lambda_{k} by Dai-Zhang nonmonotone line search;
7     xk+1=xk+λkdkx_{k+1}=x_{k}+\lambda_{k}d_{k};
8     Compute sk=xk+1xks_{k}=x_{k+1}-x_{k} and y¯k\bar{y}_{k} by (42);
9     if skTy¯k>0s_{k}^{T}\bar{y}_{k}>0 then
10           Compute stepsize α¯k+1\bar{\alpha}_{k+1} by (43) or (44);
11          αk+1=max{αmin,min{α¯k+1,αmax}};\alpha_{k+1}=\max\{\alpha_{\min},\min\{\bar{\alpha}_{k+1},\alpha_{\max}\}\};
12           else
13                αk+1=1/g¯k+1;\alpha_{k+1}=1/\|\bar{g}_{k+1}\|_{\infty};
14                end if
15               k:=k+1;k:=k+1;
16                end while
Algorithm 1 Adaptive nonmonotone gradient method with retard steps

4 Numerical results

In this section, we present numerical comparisons of ANGM, ANGR1, ANGR2 with some recent very successful gradient descent methods on solving quadratic, general unconstrained and bound constrained problems. All the comparison methods were implemented in Matlab (v.9.0-R2016a) and run on a laptop with an Intel Core i7, 2.9 GHz processor and 8 GB of RAM running Windows 10 system.

4.1 Quadratic problems

In this subsection, we compare ANGM, ANGR1 and ANGR2 with the BB1 barzilai1988two , DY dai2005analysis , ABBmin2 frassoldati2008new , and SDC de2014efficient methods on solving quadratic optimization problems.

We first solve some randomly generated quadratic problems from yuan2006new . Particularly, we solve

minxnf(x)=(xx)TV(xx),\min_{x\in\mathbb{R}^{n}}\>f(x)=(x-x^{*})^{T}V(x-x^{*}), (47)

where xx^{*} is randomly generated with components between 10-10 and 1010, V=diag{v1,,vn}V=\textrm{diag}\{v_{1},\ldots,v_{n}\} is a diagonal matrix with v1=1v_{1}=1 and vn=κv_{n}=\kappa, and vjv_{j}, j=2,,n1j=2,\ldots,n-1, are generated by the rand function between 1 and κ\kappa.

Table 3: Distributions of vjv_{j}
Problem Spectrum
1 {v2,,vn1}(1,κ)\{v_{2},\ldots,v_{n-1}\}\subset(1,\kappa)
2 {v2,,vn/5}(1,100)\{v_{2},\ldots,v_{n/5}\}\subset(1,100)
{vn/5+1,,vn1}(κ2,κ)\{v_{n/5+1},\ldots,v_{n-1}\}\subset(\frac{\kappa}{2},\kappa)
3 {v2,,vn/2}(1,100)\{v_{2},\ldots,v_{n/2}\}\subset(1,100)
{vn/2+1,,vn1}(κ2,κ)\{v_{n/2+1},\ldots,v_{n-1}\}\subset(\frac{\kappa}{2},\kappa)
4 {v2,,v4n/5}(1,100)\{v_{2},\ldots,v_{4n/5}\}\subset(1,100)
{v4n/5+1,,vn1}(κ2,κ)\{v_{4n/5+1},\ldots,v_{n-1}\}\subset(\frac{\kappa}{2},\kappa)
5 {v2,,vn/5}(1,100)\{v_{2},\ldots,v_{n/5}\}\subset(1,100)
{vn/5+1,,v4n/5}(100,κ2)\{v_{n/5+1},\ldots,v_{4n/5}\}\subset(100,\frac{\kappa}{2})
{v4n/5+1,,vn1}(κ2,κ)\{v_{4n/5+1},\ldots,v_{n-1}\}\subset(\frac{\kappa}{2},\kappa)

We have tested five sets of problems (47) with n=1,000n=1,000 using different spectral distributions of the Hessian listed in Table 3. The algorithm is stopped once the number of iteration exceeds 20,000 or the gradient norm is reduced by a factor of ϵ\epsilon, which is set to 106,10910^{-6},10^{-9} and 101210^{-12}, respectively. Three different condition numbers κ=104,105\kappa=10^{4},10^{5} and 10610^{6} are tested. For each value of κ\kappa or ϵ\epsilon, 10 instances of the problem are randomly generated and the averaged results obtained by the starting point x0=(0,,0)Tx_{0}=(0,\ldots,0)^{T} are presented. For the ABBmin2 method, τ\tau is set to 0.9 as suggested in frassoldati2008new . The parameter pair (h,s)(h,s) of the SDC method is set to (8,6)(8,6) which is more efficient than other choices for this test. The qkq_{k} is calculated by (38) for our methods.

Refer to caption
Figure 3: Performance profiles of compared methods on solving random quadratic problems (47) with spectral distributions in Table 3, iteration metric
Table 4: The numbers of averaged iterations of the ANGM, BB1, DY, ABBmin2 and SDC methods on solving quadratic problems (47) with spectral distributions in Table 3
set ϵ\epsilon τ1\tau_{1} for ANGM BB1 DY ABBmin2 SDC
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1 1e-06 200.6 198.6 213.3 193.6 204.1 196.5 211.1 219.0 230.3 221.0 197.1 177.1 183.2
1e-09 665.0 678.1 677.5 937.3 718.7 768.5 1259.4 1198.1 1195.3 2590.0 2672.7 428.8 2029.6
1e-12 939.7 941.6 901.5 1159.2 951.2 1043.7 1464.3 1339.0 1411.9 6032.9 6353.9 560.3 4087.2
2 1e-06 140.3 140.2 143.1 161.2 172.8 171.1 217.3 265.8 310.3 311.2 261.5 302.5 160.6
1e-09 546.1 590.5 641.1 779.5 850.5 891.9 1055.7 1161.3 1299.0 1665.4 1340.0 1321.1 735.1
1e-12 895.1 1025.2 1098.1 1328.6 1446.6 1598.2 1739.6 1908.7 2107.0 2820.8 2434.6 2267.5 1346.8
3 1e-06 163.2 170.1 177.3 193.3 216.7 231.1 283.3 318.4 363.9 388.4 329.4 356.3 235.5
1e-09 566.5 640.0 680.7 811.0 970.8 986.0 1188.3 1218.7 1364.4 1783.1 1511.8 1470.8 818.1
1e-12 928.4 1030.0 1163.7 1412.0 1575.2 1733.2 1973.6 2075.2 2191.1 2977.7 2780.2 2288.4 1310.6
4 1e-06 212.3 213.7 237.1 259.0 254.7 291.8 365.3 431.4 475.1 500.5 431.3 519.0 262.8
1e-09 616.1 655.7 759.7 885.5 956.4 1107.9 1232.4 1405.3 1533.3 1859.4 1659.9 1489.5 805.5
1e-12 996.0 1078.4 1250.9 1452.1 1629.5 1786.8 2041.6 2179.0 2427.6 3051.5 2785.4 2383.9 1469.3
5 1e-06 623.1 654.8 663.4 671.0 683.4 697.2 761.3 813.7 931.0 832.5 650.8 816.0 668.3
1e-09 2603.0 2654.7 2847.6 2901.4 2936.9 3161.7 3228.6 3306.4 3807.2 4497.2 3185.5 2929.7 3274.5
1e-12 4622.7 4675.2 4905.9 4634.2 4818.7 4944.7 5224.0 5480.7 5972.1 7446.7 7024.1 4808.7 5816.6
total 1e-06 1339.5 1377.5 1434.2 1478.1 1531.7 1587.7 1838.3 2048.4 2310.6 2253.7 1870.1 2170.9 1510.4
1e-09 4996.7 5219.0 5606.6 6314.7 6433.3 6916.0 7964.4 8289.7 9199.2 12395.0 10370.0 7640.0 7662.7
1e-12 8382.0 8750.3 9320.1 9986.1 10421.3 11106.5 12443.0 12982.6 14109.7 22329.5 21378.3 12308.9 14030.4
Table 5: The numbers of averaged iterations of the ANGR1, BB1, DY, ABBmin2 and SDC methods on problems in Table 3
set ϵ\epsilon τ1\tau_{1} for ANGR1 BB1 DY ABBmin2 SDC
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1 1e-06 198.7 185.8 188.1 182.5 189.2 188.2 184.5 200.1 214.1 221.0 197.1 177.1 183.2
1e-09 655.5 641.6 625.4 667.9 621.7 680.9 762.7 811.5 1012.2 2590.0 2672.7 428.8 2029.6
1e-12 897.3 854.7 844.5 838.4 775.2 813.1 917.4 984.7 1172.7 6032.9 6353.9 560.3 4087.2
2 1e-06 137.9 119.0 119.2 118.7 115.3 124.5 152.4 157.5 178.9 311.2 261.5 302.5 160.6
1e-09 466.5 470.9 487.2 512.0 540.2 589.9 672.6 691.0 722.3 1665.4 1340.0 1321.1 735.1
1e-12 788.1 783.3 802.1 835.3 908.1 1004.5 1074.2 1110.3 1215.2 2820.8 2434.6 2267.5 1346.8
3 1e-06 164.5 149.6 146.3 153.0 157.8 166.8 183.3 208.9 229.1 388.4 329.4 356.3 235.5
1e-09 507.5 473.0 516.6 541.2 563.3 609.0 682.5 747.2 850.5 1783.1 1511.8 1470.8 818.1
1e-12 818.3 801.5 865.4 894.4 965.4 1046.0 1108.8 1200.2 1419.8 2977.7 2780.2 2288.4 1310.6
4 1e-06 189.7 172.2 168.8 179.6 195.2 212.7 229.4 262.9 282.1 500.5 431.3 519.0 262.8
1e-09 539.0 519.2 558.7 570.2 617.1 689.6 786.3 853.1 901.8 1859.4 1659.9 1489.5 805.5
1e-12 849.0 851.1 894.2 934.8 1011.8 1065.5 1197.3 1291.8 1408.0 3051.5 2785.4 2383.9 1469.3
5 1e-06 625.7 587.6 600.1 583.1 614.2 632.2 675.8 726.9 769.0 832.5 650.8 816.0 668.3
1e-09 2539.9 2518.1 2612.7 2530.4 2636.0 2581.2 2773.6 2875.2 3026.2 4497.2 3185.5 2929.7 3274.5
1e-12 4207.3 4238.2 4292.4 4256.2 4240.7 4285.1 4397.2 4534.8 4729.0 7446.7 7024.1 4808.7 5816.6
total 1e-06 1316.5 1214.1 1222.6 1217.0 1271.8 1324.5 1425.3 1556.4 1673.1 2253.7 1870.1 2170.9 1510.4
1e-09 4708.5 4622.7 4800.7 4821.7 4978.4 5150.6 5677.7 5977.9 6512.9 12395.0 10370.0 7640.0 7662.7
1e-12 7560.1 7528.8 7698.7 7759.1 7901.3 8214.1 8694.9 9121.9 9944.7 22329.5 21378.3 12308.9 14030.4

We compared the algorithms by using the performance profiles of Dolan and Moré dolan2002 on iteration metric. In these performance profiles, the vertical axis shows the percentage of the problems the method solves within the factor ρ\rho of the metric used by the most effective method in this comparison. Fig. 3 shows the performance profiles of ANGM, ANGR1 and ANGR2 obtained by setting τ1=0.1\tau_{1}=0.1, τ2=1\tau_{2}=1 and other four compared methods. Clearly, ANGR2 outperforms all other methods. In general, we can see ANGM, ANGR1 and ANGR2 are much better than the BB1, DY and SDC methods.

To further analyze the performance of our methods, we present results of them with different values of τ1\tau_{1} from 0.1 to 0.9 in Tables 4, 5 and 6. From Table 4, we can see that, for the first problem set, ANGM is much faster than the BB1, DY and SDC methods, and competitive with the ABBmin2 method. As for the other four problem sets, ANGM method with a small value of τ1\tau_{1} outperforms the other compared methods though its performance seems to become worse as τ1\tau_{1} increases. The results shown in Tables 5 and 6 are slightly different from those in Table 4. In particular, ANGR1 and ANGR2 outperform other four compared methods for most of the problem sets and values of τ1\tau_{1}. For each given τ1\tau_{1} and tolerance level, ANGR1 and ANGR2 always perform better than other methods in terms of total number of iterations.

Table 6: The numbers of averaged iterations of the ANGR2, BB1, DY, ABBmin2 and SDC methods on problems in Table 3
set ϵ\epsilon τ1\tau_{1} for ANGR2 BB1 DY ABBmin2 SDC
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1 1e-06 196.9 191.8 184.8 186.8 183.3 189.0 175.6 182.2 189.5 221.0 197.1 177.1 183.2
1e-09 591.0 607.8 599.4 613.3 669.5 699.6 704.0 763.6 1005.1 2590.0 2672.7 428.8 2029.6
1e-12 870.4 811.3 763.1 790.7 800.0 881.4 844.8 971.0 1146.5 6032.9 6353.9 560.3 4087.2
2 1e-06 129.7 117.6 115.0 111.9 114.6 116.3 140.6 146.9 164.4 311.2 261.5 302.5 160.6
1e-09 455.4 474.8 470.8 491.4 530.6 566.0 595.9 601.8 691.5 1665.4 1340.0 1321.1 735.1
1e-12 786.0 750.5 811.7 819.3 938.1 908.4 1000.3 1008.5 1089.0 2820.8 2434.6 2267.5 1346.8
3 1e-06 156.2 141.2 136.3 149.7 146.7 167.9 179.9 188.0 206.8 388.4 329.4 356.3 235.5
1e-09 495.8 465.6 490.4 529.0 530.8 606.4 667.6 718.5 726.6 1783.1 1511.8 1470.8 818.1
1e-12 809.0 775.1 796.4 876.6 918.8 997.0 1080.8 1130.1 1157.8 2977.7 2780.2 2288.4 1310.6
4 1e-06 183.2 168.0 164.9 166.9 180.9 195.7 219.8 234.2 225.3 500.5 431.3 519.0 262.8
1e-09 516.2 521.9 523.2 541.6 603.9 637.3 700.0 709.8 739.2 1859.4 1659.9 1489.5 805.5
1e-12 842.6 830.1 845.8 899.5 959.4 1039.2 1099.3 1133.9 1192.2 3051.5 2785.4 2383.9 1469.3
5 1e-06 611.7 580.5 605.4 594.8 586.2 605.1 659.3 644.1 653.1 832.5 650.8 816.0 668.3
1e-09 2472.7 2394.4 2510.8 2504.4 2447.2 2502.1 2551.9 2648.6 2763.9 4497.2 3185.5 2929.7 3274.5
1e-12 4122.4 4108.7 4103.2 4107.2 4161.8 4227.9 4363.6 4324.0 4483.7 7446.7 7024.1 4808.7 5816.6
total 1e-06 1277.8 1199.1 1206.3 1210.1 1211.6 1274.0 1375.2 1395.4 1439.1 2253.7 1870.1 2170.9 1510.4
1e-09 4531.1 4464.5 4594.7 4679.7 4781.9 5011.5 5219.4 5442.3 5926.3 12395.0 10370.0 7640.0 7662.7
1e-12 7430.3 7275.7 7320.2 7493.4 7778.1 8053.8 8388.8 8567.6 9069.2 22329.5 21378.3 12308.9 14030.4

Secondly, we compared the methods on solving the non-rand quadratic problem (10) with n=10,000n=10,000. For ANGM, ANGR1 and ANGR2, τ1\tau_{1} and τ2\tau_{2} were set to 0.4 and 1, respectively. The parameter pair (h,s)(h,s) used for the SDC method was set to (30,2)(30,2). Other settings are the same as above. Table 7 presents averaged number of iterations over 10 different starting points with entries randomly generated in [10,10][-10,10]. It can be seen that ANGM, ANGR1 and ANGR2 are significantly better than the BB1 and DY methods. In addition, ANGR1 and ANGR2 often outperform the ABBmin2 and SDC methods while ANGM is very competitive with them.

Table 7: The numbers of iterations of the compared methods on problem (10) with n=10,000n=10,000
κ\kappa ϵ\epsilon BB1 DY ABBmin2 SDC ANGM ANGR1 ANGR2
10410^{4} 1e-06 626.8 527.7 513.0 597.1 539.1 500.6 512.1
1e-09 1267.0 972.1 894.5 1000.6 976.7 893.7 890.2
1e-12 1741.9 1396.8 1277.8 1409.2 1399.1 1298.0 1257.4
10510^{5} 1e-06 1597.8 1326.7 1266.3 1254.3 1209.7 1046.0 1127.9
1e-09 3687.5 3168.3 2559.8 2647.4 2605.2 2424.3 2399.8
1e-12 5564.8 4892.4 3895.0 4156.4 4139.0 3858.5 3663.3
10610^{6} 1e-06 4060.9 2159.4 3130.2 1986.2 2112.9 1992.0 1936.0
1e-09 10720.4 10134.3 7560.8 7178.5 7381.1 6495.1 6550.1
1e-12 17805.5 18015.6 12193.6 11646.7 11922.9 10364.9 10280.2
total 1e-06 6285.5 4013.8 4909.5 3837.6 3861.7 3538.6 3576.0
1e-09 15674.9 14274.7 11015.1 10826.5 10963.0 9813.1 9840.1
1e-12 25112.2 24304.8 17366.4 17212.3 17461.0 15521.4 15200.9

Finally, we compared the methods on solving two large-scale real problems Laplace1(a) and Laplace1(b) described in fletcher2005barzilai . The two problems require the solution of a system of linear equations derived from a 3D Laplacian on a box, discretized using a standard 7-point finite difference stencil. Each problem has n=N3n=N^{3} variables with NN being the number of interior nodes taken in each coordinate direction. The solution is fixed by a Gaussian function centered at (α,β,γ)(\alpha,\beta,\gamma) and multiplied by x(x1)y(y1)z(z1)x(x-1)y(y-1)z(z-1). The parameter σ\sigma is used to control the rate of decay of the Gaussian. See fletcher2005barzilai for more details on these problems. Here, we set the parameters as follows:

(a)σ=20,α=β=γ=0.5;\displaystyle(a)~{}~{}\sigma=20,~{}\alpha=\beta=\gamma=0.5;
(b)σ=50,α=0.4,β=0.7,γ=0.5.\displaystyle(b)~{}~{}\sigma=50,~{}\alpha=0.4,~{}\beta=0.7,~{}\gamma=0.5.

The null vector was used as the starting point. We again stop the iteration when gkϵg0\|g_{k}\|\leq\epsilon\|g_{0}\| with different values of ϵ\epsilon.

For ANGM, ANGR1 and ANGR2, τ1\tau_{1} and τ2\tau_{2} were set to 0.7 and 1.2, respectively. The parameter pair (h,s)(h,s) used for the SDC method was chosen for the best performance in our test, i.e., (2,6)(2,6) and (8,6)(8,6) for Laplace1(a) and Laplace1(b), respectively. Other settings are the same as above. The number of iterations required by the compared methods for solving the two problems are listed in Table 8. It can be seen that our methods are significantly better than the BB1, DY and SDC methods and is often faster than ABBmin2 especially when a tight tolerance is used.

Table 8: The numbers of iterations of BB1, DY, ABBmin2, SDC, ANGM, ANGR1 and ANGR2 on solving the 3D Laplacian problem
nn ϵ\epsilon BB1 DY ABBmin2 SDC ANGM ANGR1 ANGR2
Laplace1(a)
60360^{3} 1e-06 259 249 192 213 245 195 233
1e-09 441 373 329 393 313 322 308
1e-12 680 546 401 529 367 373 364
80380^{3} 1e-06 359 383 289 297 291 332 288
1e-09 591 570 430 553 408 446 396
1e-12 882 789 608 705 620 516 591
1003100^{3} 1e-06 950 427 351 513 450 303 416
1e-09 1088 651 485 609 584 519 503
1e-12 1241 918 687 825 694 604 597
total 1e-06 1568 1059 832 1023 986 830 937
1e-09 2120 1594 1244 1555 1305 1287 1207
1e-12 2803 2253 1696 2059 1681 1493 1552
Laplace1(b)
60360^{3} 1e-06 246 236 217 213 242 217 214
1e-09 473 399 365 437 333 338 409
1e-12 651 532 502 555 451 478 573
80380^{3} 1e-06 288 454 294 309 296 290 324
1e-09 607 567 433 485 517 499 495
1e-12 739 794 634 766 686 590 645
1003100^{3} 1e-06 544 371 369 379 381 406 358
1e-09 646 700 585 653 638 558 648
1e-12 937 1038 880 965 854 785 810
total 1e-06 1078 1061 880 901 919 913 896
1e-09 1726 1666 1383 1575 1488 1395 1552
1e-12 2327 2364 2016 2286 1991 1853 2028

4.2 Unconstrained problems

Now we present comparisons of ANGR1 and ANGR2 with SPG method111codes available at https://www.ime.usp.br/~egbirgin/tango/codes.php in birgin2000nonmonotone ; birgin2014spectral , and the BB1 method using Dai-Zhang nonmonotone line search dai2001adaptive (BB1-DZ) on solving general unconstrained problems.

For our methods, the parameter values are set as the following:

αmin=1030,αmax=1030,M=8,σ=104,α0=1/g0.\alpha_{\min}=10^{-30},~{}\alpha_{\max}=10^{30},~{}M=8,~{}\sigma=10^{-4},~{}\alpha_{0}=1/\|g_{0}\|_{\infty}.

In addition, the parameter τ1\tau_{1} and τ2\tau_{2} for ANGR1 and ANGR2 are set to 0.8 and 1.2, respectively. Default parameters were used for SPG. Each method was stopped if the number of iteration exceeds 200,000 or gk106\|g_{k}\|_{\infty}\leq 10^{-6}.

Our test problems were taken from andrei2008unconstrained . We have tested 59 problems listed in Table 9 with n=1,000n=1,000 and the performance profiles are shown in Fig. 4, which shows that ANGR1 and ANGR2 outperform SPG and BB1-DZ in terms of the iteration number, and ANGR2 is faster than ANGR1. Moreover, BB1-DZ is slightly better than SPG. Detail numerical results are also presented in Table 10. Since the only difference between BB1-DZ with ANGR1 and ANGR2 lies in the choice of stepsizes, these numerical results show our adaptive choices of stepsizes in ANGR1 and ANGR2 are very effective and can indeed greatly accelerate the convergence of BB-type methods.

Table 9: Test problems from andrei2008unconstrained
Problem name Problem name
1 Extended Freudenstein & Roth function 31 NONDIA
2 Extended Trigonometric 32 DQDRTIC
3 Extended White & Holst 33 Partial Perturbed Quadratic
4 Extended Beale 34 Broyden Tridiagonal
5 Extended Penalty 35 Almost Perturbed Quadratic
6 Perturbed Quadratic 36 Perturbed Tridiagonal Quadratic
7 Raydan 1 37 Staircase 1
8 Raydan 2 38 Staircase 2
9 Diagonal 1 39 LIARWHD
10 Diagonal 2 40 POWER
11 Diagonal 3 41 ENGVAL1
12 Hager 42 EDENSCH
13 Generalized Tridiagonal 1 43 BDEXP
14 Extended Tridiagonal 1 44 GENHUMPS
15 Extended TET 45 NONSCOMP
16 Generalized Tridiagonal 2 46 VARDIM
17 Diagonal 4 47 QUARTC
18 Diagonal 5 48 Extended DENSCHNB
19 Extended Himmelblau 49 Extended DENSCHNF
20 Extended PSC1 50 LIARWHD
21 Generalized PSC1 51 BIGGSB1
22 Extended Powell 52 Generalized Quartic
23 Extended Cliff 53 Diagonal 7
24 Perturbed quadratic diagonal 54 Diagonal 8
25 Quadratic QF1 55 Full Hessian FH3
26 Extended quadratic exponential EP1 56 SINCOS
27 Extended Tridiagonal 2 57 Diagonal 9
28 BDQRTIC 58 HIMMELBG
29 TRIDIA 59 HIMMELH
30 ARWHEAD
Refer to caption
Figure 4: Performance profiles of compared methods on 59 unconstrained problems in Table 9, iteration metric
Table 10: Results of compared methods on unconstrained problems in Table 9
Problem SPG BB1-DZ ANGR1 ANGR2
iter fkf_{k} gk\|g_{k}\|_{\infty} iter fkf_{k} gk\|g_{k}\|_{\infty} iter fkf_{k} gk\|g_{k}\|_{\infty} iter fkf_{k} gk\|g_{k}\|_{\infty}
1 87 2.45e+04 8.95e-07 41 2.45e+04 4.06e-08 26 2.45e+04 1.46e-07 25 2.45e+04 1.46e-07
2 45 5.72e-13 3.83e-07 87 5.26e-07 3.97e-07 100 4.89e-07 9.32e-07 100 4.73e-07 2.20e-07
3 110 5.80e-14 3.05e-08 120 6.30e-20 4.95e-11 91 9.95e-12 2.93e-07 88 1.05e-11 2.90e-07
4 46 3.92e-10 6.57e-07 65 8.28e-18 3.74e-10 31 9.52e-15 4.15e-08 31 1.94e-12 6.12e-08
5 41 8.83e+02 9.02e-07 107 8.83e+02 1.24e-08 107 8.83e+02 1.24e-08 107 8.83e+02 1.24e-08
6 597 2.39e-13 9.83e-07 457 2.16e-13 9.24e-07 300 7.98e-16 5.66e-08 287 1.69e-13 8.03e-07
7 465 5.01e+04 9.64e-07 339 5.01e+04 9.04e-07 330 5.01e+04 5.56e-08 301 5.01e+04 8.39e-07
8 1 1.00e+03 0.00e+00 1 1.00e+03 0.00e+00 1 1.00e+03 0.00e+00 1 1.00e+03 0.00e+00
9 415 -2.71e+06 8.50e-07 361 -2.71e+06 5.43e-07 294 -2.71e+06 9.05e-07 276 -2.71e+06 7.04e-07
10 168 3.13e+01 6.05e-07 173 3.13e+01 5.68e-07 135 3.13e+01 5.93e-07 131 3.13e+01 8.61e-07
11 668 -4.96e+05 8.82e-07 404 -4.96e+05 8.63e-07 336 -4.96e+05 8.85e-07 286 -4.96e+05 9.93e-07
12 63 -4.47e+04 2.91e-07 58 -4.47e+04 5.70e-07 52 -4.47e+04 5.14e-07 52 -4.47e+04 5.59e-07
13 29 9.97e+02 6.53e-07 27 9.97e+02 9.67e-07 26 9.97e+02 4.21e-07 25 9.97e+02 9.79e-07
14 30 2.13e-07 3.75e-07 18 2.68e-07 4.46e-07 20 5.20e-07 7.33e-07 20 5.20e-07 7.33e-07
15 10 1.28e+03 4.34e-10 5 1.28e+03 1.33e-08 5 1.28e+03 1.33e-08 5 1.28e+03 1.33e-08
16 69 2.38e+00 8.58e-07 58 9.58e-01 7.14e-07 56 9.58e-01 8.40e-07 57 9.58e-01 6.13e-07
17 3 0.00e+00 0.00e+00 3 0.00e+00 0.00e+00 3 0.00e+00 0.00e+00 3 0.00e+00 0.00e+00
18 4 6.93e+02 4.81e-08 1 6.93e+02 0.00e+00 1 6.93e+02 0.00e+00 1 6.93e+02 0.00e+00
19 15 4.21e-19 1.92e-10 15 4.21e-19 1.92e-10 15 4.21e-19 1.92e-10 15 4.21e-19 1.92e-10
20 23 9.99e+02 8.27e-07 22 9.99e+02 8.65e-07 22 9.99e+02 8.65e-07 22 9.99e+02 8.65e-07
21 15 3.87e+02 5.87e-07 13 3.87e+02 3.78e-07 13 3.87e+02 3.78e-07 13 3.87e+02 3.78e-07
22 246 3.19e-07 6.80e-07 266 3.67e-07 5.97e-07 164 7.08e-07 9.62e-07 205 5.55e-07 9.82e-07
23 154 9.99e+01 7.79e-07 126 9.99e+01 1.57e-07 541 9.99e+01 2.61e-07 539 9.99e+01 6.10e-08
24 373 2.41e-11 8.95e-07 246 1.41e-11 6.83e-07 249 2.05e-12 2.98e-07 314 4.05e-12 4.14e-07
25 612 -5.00e-04 9.45e-07 575 -5.00e-04 9.95e-07 319 -5.00e-04 8.92e-08 282 -5.00e-04 7.45e-07
26 4 7.93e+03 4.19e-08 4 7.93e+03 3.71e-09 4 7.93e+03 3.71e-09 4 7.93e+03 3.71e-09
27 34 3.89e+02 7.50e-07 39 3.89e+02 9.74e-07 36 3.89e+02 8.49e-07 36 3.89e+02 5.15e-07
28 77 3.98e+03 7.64e-07 98 3.98e+03 8.91e-07 69 3.98e+03 1.49e-07 64 3.98e+03 9.57e-07
29 3910 2.47e-13 6.97e-07 2403 9.93e-15 6.71e-07 816 9.62e-16 4.67e-07 611 6.71e-14 5.86e-07
30 4 0.00e+00 1.50e-09 4 0.00e+00 1.50e-09 4 0.00e+00 1.50e-09 4 0.00e+00 1.50e-09
31 15 1.35e-14 9.86e-09 15 1.35e-14 9.86e-09 16 1.10e-12 8.20e-08 16 1.10e-12 8.20e-08
32 41 4.20e-16 2.86e-07 26 1.80e-13 8.47e-07 17 2.99e-16 2.14e-07 17 2.99e-16 2.14e-07
33 312 2.51e-14 5.96e-07 229 1.42e-13 9.63e-07 257 6.00e-15 2.16e-07 294 7.03e-14 8.24e-07
34 47 5.31e-15 3.20e-07 51 7.94e-14 8.28e-07 41 7.67e-14 7.35e-07 47 2.62e-14 7.66e-07
35 621 2.36e-13 9.76e-07 363 2.47e-13 9.97e-07 358 1.53e-13 7.77e-07 272 4.53e-15 6.22e-07
36 595 2.49e-13 9.97e-07 606 2.48e-13 9.95e-07 368 3.46e-13 1.00e-06 306 1.82e-13 8.61e-07
37 15936 1.81e-12 7.80e-07 10239 1.54e-12 7.68e-07 7234 1.39e-13 1.89e-07 6543 1.07e-12 9.13e-07
38 28528 1.97e-14 2.43e-07 25640 2.74e-12 9.97e-07 60283 2.86e-12 9.93e-07 27783 2.98e-12 9.74e-07
39 61 2.64e-14 2.06e-08 59 2.62e-13 4.93e-07 47 9.15e-14 6.15e-08 47 1.30e-12 2.36e-07
40 5435 2.84e-13 9.98e-07 5472 2.94e-13 9.98e-07 327 1.89e-13 6.99e-07 366 5.63e-13 9.98e-07
41 33 1.11e+03 8.58e-07 31 1.11e+03 8.09e-07 29 1.11e+03 1.70e-07 29 1.11e+03 1.70e-07
42 32 6.00e+03 7.33e-07 32 6.00e+03 7.33e-07 31 6.00e+03 4.46e-07 31 6.00e+03 9.35e-07
43 15 6.52e-05 9.93e-07 15 6.52e-05 9.93e-07 15 6.52e-05 9.93e-07 15 6.52e-05 9.93e-07
44 747 1.39e-17 1.50e-09 3 4.49e-23 3.00e-12 3 4.49e-23 3.00e-12 3 4.49e-23 3.00e-12
45 48 9.98e-12 8.75e-07 89 1.97e-14 2.57e-07 59 2.30e-13 5.85e-07 63 2.18e-13 7.89e-07
46 1 4.62e-30 4.44e-16 30 2.17e-27 5.77e-15 30 2.17e-27 5.77e-15 30 2.17e-27 5.77e-15
47 1 0.00e+00 0.00e+00 1 0.00e+00 0.00e+00 1 0.00e+00 0.00e+00 1 0.00e+00 0.00e+00
48 8 1.37e-11 4.68e-07 8 1.37e-11 4.68e-07 8 1.37e-11 4.68e-07 8 1.37e-11 4.68e-07
49 12 1.78e-19 4.76e-10 17 6.46e-21 5.91e-11 17 6.46e-21 5.91e-11 17 6.46e-21 5.91e-11
50 61 2.64e-14 2.06e-08 59 2.62e-13 4.93e-07 47 9.15e-14 6.15e-08 47 1.30e-12 2.36e-07
51 30459 1.24e-05 9.95e-07 14877 3.73e-07 6.46e-07 7558 6.51e-07 8.62e-07 8969 6.08e-12 6.45e-07
52 9 4.98e-15 1.99e-07 9 4.98e-15 1.99e-07 9 4.98e-15 1.99e-07 9 4.98e-15 1.99e-07
53 7 -8.17e+02 1.97e-09 7 -8.17e+02 1.97e-09 7 -8.17e+02 1.97e-09 7 -8.17e+02 1.97e-09
54 6 -4.80e+02 5.85e-08 6 -4.80e+02 4.36e-10 6 -4.80e+02 4.36e-10 6 -4.80e+02 4.36e-10
55 3 -2.50e-01 4.56e-10 3 -2.50e-01 4.56e-10 3 -2.50e-01 4.56e-10 3 -2.50e-01 4.56e-10
56 15 3.87e+02 5.87e-07 13 3.87e+02 3.78e-07 13 3.87e+02 3.78e-07 13 3.87e+02 3.78e-07
57 553 -2.70e+06 1.91e-07 762 -2.70e+06 6.06e-07 305 -2.70e+06 9.96e-07 346 -2.70e+06 5.60e-07
58 16 4.79e-04 8.77e-07 17 3.78e-04 6.91e-07 17 3.78e-04 6.91e-07 17 3.78e-04 6.91e-07
59 12 -5.00e+02 1.09e-07 12 -5.00e+02 7.41e-11 12 -5.00e+02 7.41e-11 12 -5.00e+02 7.41e-11

4.3 Bound constrained problems

We further compare ANGR1 and ANGR2, with SPG birgin2000nonmonotone ; birgin2014spectral and the BB1-DZ method dai2001adaptive combined with gradient projection techniques on solving bound constrained problems from the CUTEst collection gould2015cutest with dimension more than 5050. We deleted 33 problems from this list since none of these comparison algorithms can solve them. Hence, in total there are 4747 problems left in our test. The iteration was stopped if the number of iteration exceeds 200,000 or PΩ(xkgk)xk106\|P_{\Omega}(x_{k}-g_{k})-x_{k}\|_{\infty}\leq 10^{-6}. The parameters τ1\tau_{1} and τ2\tau_{2} for ANGR1 and ANGR2 are set to 0.4 and 1.5, respectively. Other settings are the same as before. Fig. 5 shows the performance profiles of all the compared methods on iteration metric. Similar as the unconstrained case, from Fig. 5, we again see that both ANGR1 and ANGR2 perform significantly better than SPG and BB1-DZ. Hence, our new gradient methods also have potential great benefits for solving constrained optimization.

Refer to caption
Figure 5: Performance profiles of compared methods on solving 47 bound constrained problems from CUTEst, iteration metric

5 Conclusions

We have developed techniques to accelerate the Barzilai-Borwein (BB) method motivated from finite termination for minimizing two-dimensional strongly convex quadratic functions. More particularly, by exploiting certain orthogonal properties of the gradients, we derive a new monotone stepsize that can be combined with BB stepsizes to significantly improve their performance for minimizing general strongly convex quadratic functions. By adaptively using this new stepsize and the two BB stepsizes, we develop a new gradient method called ANGM and its two variants ANGR1 and ANGR2, which are further extended for solving unconstrained and bound constrained optimization. Our extensive numerical experiments show that all the new developed methods are significantly better than the BB method and are faster than some very successful gradient methods developed in the recent literature for solving quadratic, general smooth unconstrained and bound constrained optimization.

Acknowledgements.
This work was supported by the National Natural Science Foundation of China (Grant Nos. 11631013, 11701137, 11671116), the National 973 Program of China (Grant No. 2015CB856002), China Scholarship Council (No. 201806705007), and USA National Science Foundation (1522654, 1819161).

References

  • (1) Akaike, H.: On a successive transformation of probability distribution and its application to the analysis of the optimum gradient method. Ann. Inst. Stat. Math. 11(1), 1–16 (1959)
  • (2) Andrei, N.: An unconstrained optimization test functions collection. Adv. Model. Optim. 10(1), 147–161 (2008)
  • (3) Barzilai, J., Borwein, J.M.: Two-point step size gradient methods. IMA J. Numer. Anal. 8(1), 141–148 (1988)
  • (4) Birgin, E.G., Martínez, J.M., Raydan, M.: Nonmonotone spectral projected gradient methods on convex sets. SIAM J. Optim. 10(4), 1196–1211 (2000)
  • (5) Birgin, E.G., Martínez, J.M., Raydan, M.: Spectral projected gradient methods: review and perspectives. J. Stat. Softw. 60(3), 539–559 (2014)
  • (6) Cauchy, A.: Méthode générale pour la résolution des systemes d équations simultanées. Comp. Rend. Sci. Paris 25, 536–538 (1847)
  • (7) Dai, Y.H.: Alternate step gradient method. Optimization 52(4-5), 395–415 (2003)
  • (8) Dai, Y.H., Fletcher, R.: Projected Barzilai–Borwein methods for large-scale box-constrained quadratic programming. Numer. Math. 100(1), 21–47 (2005)
  • (9) Dai, Y.H., Huang, Y., Liu, X.W.: A family of spectral gradient methods for optimization. Comp. Optim. Appl. 74(1), 43–65 (2019)
  • (10) Dai, Y.H., Liao, L.Z.: RR-linear convergence of the Barzilai and Borwein gradient method. IMA J. Numer. Anal. 22(1), 1–10 (2002)
  • (11) Dai, Y.H., Yuan, Y.X.: Alternate minimization gradient method. IMA J. Numer. Anal. 23(3), 377–393 (2003)
  • (12) Dai, Y.H., Yuan, Y.X.: Analysis of monotone gradient methods. J. Ind. Manag. Optim. 1(2), 181 (2005)
  • (13) Dai, Y.H., Zhang, H.: Adaptive two-point stepsize gradient algorithm. Numer. Algorithms 27(4), 377–385 (2001)
  • (14) De Asmundis, R., Di Serafino, D., Hager, W.W., Toraldo, G., Zhang, H.: An efficient gradient method using the yuan steplength. Comp. Optim. Appl. 59(3), 541–563 (2014)
  • (15) Di Serafino, D., Ruggiero, V., Toraldo, G., Zanni, L.: On the steplength selection in gradient methods for unconstrained optimization. Appl. Math. Comput. 318, 176–195 (2018)
  • (16) Dolan, E.D., Moré, J.J.: Benchmarking optimization software with performance profiles. Math. Program. 91(2), 201–213 (2002)
  • (17) Fletcher, R.: On the Barzilai–Borwein method. In: Qi, L., Teo, K., Yang, X. (eds.) Optimization and Control with Applications. pp. 235–256. Springer, Boston, (2005)
  • (18) Fletcher, R.: A limited memory steepest descent method. Math. Program. 135(1-2), 413–436 (2012)
  • (19) Forsythe, G.E.: On the asymptotic directions of the s-dimensional optimum gradient method. Numer. Math. 11(1), 57–76 (1968)
  • (20) Frassoldati, G., Zanni, L., Zanghirati, G.: New adaptive stepsize selections in gradient methods. J. Ind. Manag. Optim. 4(2), 299 (2008)
  • (21) Friedlander, A., Martínez, J.M., Molina, B., Raydan, M.: Gradient method with retards and generalizations. SIAM J. Numer. Anal. 36(1), 275–289 (1998)
  • (22) Gonzaga, C.C., Schneider, R.M.: On the steepest descent algorithm for quadratic functions. Comp. Optim. Appl. 63(2), 523–542 (2016)
  • (23) Gould, N.I., Orban, D., Toint, P.L.: CUTEst: a constrained and unconstrained testing environment with safe threads for mathematical optimization. Comp. Optim. Appl. 60(3), 545–557 (2015)
  • (24) Hager, W.W., Zhang, H.: A new active set algorithm for box constrained optimization. SIAM J. Optim. 17(2), 526–557 (2006)
  • (25) Huang, Y., Dai, Y.H., Liu, X.W., Zhang, H.: Gradient methods exploiting spectral properties. arXiv preprint arXiv:1905.03870 (2019)
  • (26) Huang, Y., Dai, Y.H., Liu, X.W., Zhang, H.: On the asymptotic convergence and acceleration of gradient methods. arXiv preprint arXiv:1908.07111 (2019)
  • (27) Huang, Y., Liu, H., Zhou, S.: Quadratic regularization projected Barzilai–Borwein method for nonnegative matrix factorization. Data Min. Knowl. Disc. 29(6), 1665–1684 (2015)
  • (28) Jiang, B., Dai, Y.H.: Feasible Barzilai–Borwein-like methods for extreme symmetric eigenvalue problems. Optim. Method Softw. 28(4), 756–784 (2013)
  • (29) Liu, Y.F., Dai, Y.H., Luo, Z.Q.: Coordinated beamforming for miso interference channel: Complexity analysis and efficient algorithms. IEEE Trans. Signal Process. 59(3), 1142–1157 (2011)
  • (30) Raydan, M.: The Barzilai and Borwein gradient method for the large scale unconstrained minimization problem. SIAM J. Optim. 7(1), 26–33 (1997)
  • (31) Tan, C., Ma, S., Dai, Y.H., Qian, Y.: Barzilai–Borwein step size for stochastic gradient descent. In: Advances in Neural Information Processing Systems, pp. 685–693 (2016)
  • (32) Wang, Y., Ma, S.: Projected Barzilai–Borwein method for large-scale nonnegative image restoration. Inverse Probl. Sci. En. 15(6), 559–583 (2007)
  • (33) Wright, S.J., Nowak, R.D., Figueiredo, M.A.: Sparse reconstruction by separable approximation. IEEE Trans. Signal Process. 57(7), 2479–2493 (2009)
  • (34) Yuan, Y.X.: A new stepsize for the steepest descent method. J. Comput. Math. 24(2), 149–156 (2006)
  • (35) Yuan, Y.X.: Step-sizes for the gradient method. AMS IP Studies in Advanced Mathematics. 42(2), 785–796 (2008)
  • (36) Zhou, B., Gao, L., Dai, Y.H.: Gradient methods with adaptive step-sizes. Comp. Optim. Appl. 35(1), 69–86 (2006)