This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: Ya-Kui Huang 22institutetext: Institute of Mathematics, Hebei University of Technology, Tianjin 300401, China
22email: hyk@hebut.edu.cn
33institutetext: Hou-Duo Qi 44institutetext: Department of Data Science and Artificial Intelligence and Department of Applied Mathematics, The Hong Kong Polytechnic University, Hung Hom, Hong Kong
44email: houduo.qi@polyu.edu.hk

Analytic analysis of the worst-case complexity of the gradient method with exact line search and the Polyak stepsize

Ya-Kui Huang    Hou-Duo Qi
(Received: date / Accepted: date)
Abstract

We give a novel analytic analysis of the worst-case complexity of the gradient method with exact line search and the Polyak stepsize, respectively, which previously could only be established by computer-assisted proof. Our analysis is based on studying the linear convergence of a family of gradient methods, whose stepsizes include the one determined by exact line search and the Polyak stepsize as special instances. The asymptotic behavior of the considered family is also investigated which shows that the gradient method with the Polyak stepsize will zigzag in a two-dimensional subspace spanned by the two eigenvectors corresponding to the largest and smallest eigenvalues of the Hessian.

Keywords:
Gradient methods Exact line search Polyak stepsize Worst-case complexity
MSC:
90C25 90C25 90C30

1 Introduction

We focus on the gradient method for solving unconstrained optimization problem

minxnf(x),\min_{x\in\mathbb{R}^{n}}~{}f(x), (1)

where f:nf:\mathbb{R}^{n}\to\mathbb{R} is a real-valued, convex, and continuously differentiable function. As is known, the gradient method updates iterates by

xk+1=xkαkgk,x_{k+1}=x_{k}-\alpha_{k}g_{k}, (2)

where gk=f(xk)g_{k}=\nabla f(x_{k}) and αk>0\alpha_{k}>0 is the stepsize. The classic choice for the stepsize is due to Cauchy cauchy1847methode who suggested to calculate it by exact line search, i.e.,

αkSD=argminα>0f(xkαgk),\alpha_{k}^{SD}=\arg\min_{\alpha>0}~{}f(x_{k}-\alpha g_{k}), (3)

which together with (2) is often referred to as the steepest descent (SD) method. Many stepsizes have been designed to improve the efficiency of gradient methods such as the Barzilai–Borwein stepsize barzilai1988two , the Polyak stepsize polyak1969minimization ; polyak1987introduction and the Yuan stepsize yuan2006new . Due to the good performance in solving machine learning problems, there has been a surge of interest in the Polyak stepsize jiang2024adaptive ; loizou2021stochastic ; wang2023generalized , which minimizes an upper bound of the distance between the new iteration to the optimal solution and has the form

αkP=2γfkfgk2,γ(0,1].\alpha_{k}^{P}=2\gamma\frac{f_{k}-f_{*}}{\|g_{k}\|^{2}},\quad\gamma\in(0,1]. (4)

where fk=f(xk)f_{k}=f(x_{k}) and ff_{*} is the optimal function value. Convergence of the gradient method using the Polyak stepsize has been studied in different works for the case γ=12\gamma=\frac{1}{2} or 1 barre2020complexity ; hazan2019revisiting .

In recent years, analyzing the worst-case complexities of gradient-based methods has triggered much attention cartis2010complexity ; das2024branch ; de2020worst ; drori2014performance ; gu2020tight ; lessard2016analysis ; taylor2017exact ; teboulle2023elementary ; yuan2010short . Particularly, Drori and Teboulle drori2014performance formulate the worst-case complexity of a given algorithm as an optimal solution to the so-called performance estimation problem (PEP). By making some relaxations to PEP, a bound of the worst-case complexity can be obtained by solving a convex semidefinite programming (SDP). Following drori2014performance , the worst-case complexities for various popular methods have been analyzed including the gradient method, Nesterov’s fast gradient method, inexact gradient method, proximal point algorithm, see de2020worst ; gu2020tight ; taylor2017exact and references therein.

Although the PEP framework is a powerful tool for the study of algorithm complexity, as pointed out by Teboulle and Vaisbourd teboulle2023elementary , it lacks an intuitive explanation how the worst-case complexity is addressed: (i) one often has to employ a computer-assisted proof since the aforementioned SDP usually has no closed form solution and must be solved numerically; (ii) it is generally unclear how the solution is deduced when a closed form solution is available. Moreover, it is not easy to apply the PEP framework to the gradient method with adaptive stepsizes goujaud2023fundamental . Consequently, most of existing complexity results of gradient methods focus on fixed stepsizes. Recently, by resorting to PEP, the worst-case complexities of the gradient method with exact line search and the Polyak stepsize (4) with γ=1\gamma=1 are presented through computer-assisted proof in de2017worst and barre2020complexity , respectively.

The purpose of this paper is to give a novel analytical analysis for the worst-case complexity of the gradient method with exact line search and the Polyak stepsize, respectively, for smooth strongly convex functions with Lipschitz continuous gradient. To this end, we first study a family of gradient methods whose stepsizes include the one determined by exact line search and the Polyak stepsize as special instances. We show that the family converges linearly when applying to a strongly convex quadratic function. Then, based on the convergence results on quadratics, for general smooth strongly convex objective functions, we establish the worst-case complexity of the gradient method with exact line search and the Polyak stepsize in an analytic way, respectively, which recovers the computer-assisted results in de2017worst and barre2020complexity . To our knowledge, this is the first analytic proof of the worst-case complexity of the gradient method with exact line search and the Polyak stepsize. Furthermore, we show that the considered family will asymptotically zigzag in a two-dimensional subspace spanned by the two eigenvectors corresponding to the largest and smallest eigenvalues of the Hessian, which has not been clarified in the literature for the gradient method using the Polyak stepsize.

The paper is organized as follows. In Section 2, we investigate linear convergence of a family of gradient methods on strongly convex quadratics. In Section 3, we present our analytic proof of the worst-case complexity of the gradient method with exact line search and the Polyak stepsize, respectively. Finally, some conclusion remarks are drawn and the asymptotic behavior of the considered family is discussed in Section 4.

2 Convergence analysis for quadratics

In this section, we consider the unconstrained quadratic optimization

minxnf(x)=12x𝖳Axb𝖳x,\min_{x\in\mathbb{R}^{n}}~{}f(x)=\frac{1}{2}x^{\sf T}Ax-b^{\sf T}x, (5)

where bnb\in\mathbb{R}^{n} and An×nA\in\mathbb{R}^{n\times n} is symmetric positive definite. Let {λ1,λ2,,λn}\{\lambda_{1},\lambda_{2},\cdots,\lambda_{n}\} be the eigenvalues of AA, and {ξ1,ξ2,,ξn}\{\xi_{1},\xi_{2},\ldots,\xi_{n}\} be the associated orthonormal eigenvectors. Without loss of generality, we assume that

A=diag{λ1,λ2,,λn},0<μ=λ1<λ2<<λn=L.A=\textrm{diag}\{\lambda_{1},\lambda_{2},\cdots,\lambda_{n}\},~{}~{}0<\mu=\lambda_{1}<\lambda_{2}<\cdots<\lambda_{n}=L.

For a more general and uniform analysis, we investigate a family of gradient methods with stepsize given by

αk=γgk𝖳ψ(A)gkgk𝖳ψ(A)Agk,γ(0,1],\alpha_{k}=\gamma\frac{g_{k}^{\sf T}\psi(A)g_{k}}{g_{k}^{\sf T}\psi(A)Ag_{k}},\quad\gamma\in(0,1], (6)

where ψ\psi is a real analytic function on [μ,L][\mu,L] and can be expressed by Laurent series ψ(z)=k=ckzk\psi(z)=\sum_{k=-\infty}^{\infty}c_{k}z^{k} with ckc_{k}\in\mathbb{R} such that 0<k=ckzk<+0<\sum_{k=-\infty}^{\infty}c_{k}z^{k}<+\infty for all z[μ,L]z\in[\mu,L]. Apparently, when ψ(A)=I\psi(A)=I, we get the generalized SD stepsize, namely

αkGSD=γargminα>0f(xkαgk)=γgk𝖳gkgk𝖳Agk,\alpha_{k}^{GSD}=\gamma\arg\min_{\alpha>0}~{}f(x_{k}-\alpha g_{k})=\gamma\frac{g_{k}^{\sf T}g_{k}}{g_{k}^{\sf T}Ag_{k}}, (7)

which is exactly the stepsize αkSD\alpha_{k}^{SD} obtained by exact line search when γ=1\gamma=1. See dai2003altermin for gradient methods with shorten SD steps, i.e., γ(0,1)\gamma\in(0,1). Interestingly, the Polyak stepsize αkP\alpha_{k}^{P} in (4) corresponding to the case ψ(A)=A1\psi(A)=A^{-1}. In fact, since f=f(x)f_{*}=f(x_{*}), where x=A1bx_{*}=A^{-1}b is the optimal solution, we have

fk=f+12(xkx)𝖳A(xkx)=f+12gk𝖳A1gk,f_{k}=f_{*}+\frac{1}{2}(x_{k}-x_{*})^{\sf T}A(x_{k}-x_{*})=f_{*}+\frac{1}{2}g_{k}^{\sf T}A^{-1}g_{k},

which gives

αkP=2γfkfgk2=γgk𝖳A1gkgk𝖳gk.\alpha_{k}^{P}=2\gamma\frac{f_{k}-f_{*}}{\|g_{k}\|^{2}}=\gamma\frac{g_{k}^{\sf T}A^{-1}g_{k}}{g_{k}^{\sf T}g_{k}}. (8)

For a symmetric positive definite matrix DD, denote the weighted norm by xD=x𝖳Dx\|x\|_{D}=\sqrt{x^{\sf T}Dx}. We establish line convergence of the family of gradient methods (6) in the next theorem.

Theorem 2.1

Consider applying the gradient method (2) with the stepsize αk\alpha_{k} given by (6) to solve the unconstrained quadratic optimization (5). It holds that

xk+1xψ(A)A2\displaystyle\|x_{k+1}-x_{*}\|_{\psi(A)A}^{2} (14γ(2γ)μL(L+μ)2)xkxψ(A)A2.\displaystyle\leq\left(1-\frac{4\gamma(2-\gamma)\mu L}{(L+\mu)^{2}}\right)\|x_{k}-x_{*}\|_{\psi(A)A}^{2}. (9)

Moreover, if xk=A1(12ψ12(A)(ξ1±ξn)+b)x_{k}=A^{-1}\left(\frac{1}{\sqrt{2}}\psi^{-\frac{1}{2}}(A)\left(\xi_{1}\pm\xi_{n}\right)+b\right), then the two sides of (9) are equal.

Proof

Using xkx=A1gkx_{k}-x_{*}=A^{-1}g_{k} and the definition of αk\alpha_{k} in (6), we have

xk+1xψ(A)A2\displaystyle\|x_{k+1}-x_{*}\|_{\psi(A)A}^{2} =xkαkgkxψ(A)A2\displaystyle=\|x_{k}-\alpha_{k}g_{k}-x_{*}\|_{\psi(A)A}^{2}
=xkxψ(A)A22αkgk𝖳ψ(A)A(xkx)+αk2gk𝖳ψ(A)Agk\displaystyle=\|x_{k}-x_{*}\|_{\psi(A)A}^{2}-2\alpha_{k}g_{k}^{\sf T}\psi(A)A(x_{k}-x_{*})+\alpha_{k}^{2}g_{k}^{\sf T}\psi(A)Ag_{k}
=gk𝖳ψ(A)A1gk2αkgk𝖳ψ(A)gk+αk2gk𝖳ψ(A)Agk\displaystyle=g_{k}^{\sf T}\psi(A)A^{-1}g_{k}-2\alpha_{k}g_{k}^{\sf T}\psi(A)g_{k}+\alpha_{k}^{2}g_{k}^{\sf T}\psi(A)Ag_{k}
=gk𝖳ψ(A)A1gkγ(2γ)(gk𝖳ψ(A)gk)2gk𝖳ψ(A)Agk\displaystyle=g_{k}^{\sf T}\psi(A)A^{-1}g_{k}-\gamma(2-\gamma)\frac{(g_{k}^{\sf T}\psi(A)g_{k})^{2}}{g_{k}^{\sf T}\psi(A)Ag_{k}}
=(1γ(2γ)(gk𝖳ψ(A)gk)2gk𝖳ψ(A)Agkgk𝖳ψ(A)A1gk)gk𝖳ψ(A)A1gk\displaystyle=\left(1-\frac{\gamma(2-\gamma)(g_{k}^{\sf T}\psi(A)g_{k})^{2}}{g_{k}^{\sf T}\psi(A)Ag_{k}\cdot g_{k}^{\sf T}\psi(A)A^{-1}g_{k}}\right)g_{k}^{\sf T}\psi(A)A^{-1}g_{k}
=(1γ(2γ)(gk𝖳ψ(A)gk)2gk𝖳ψ(A)Agkgk𝖳ψ(A)A1gk)xkxψ(A)A2\displaystyle=\left(1-\frac{\gamma(2-\gamma)(g_{k}^{\sf T}\psi(A)g_{k})^{2}}{g_{k}^{\sf T}\psi(A)Ag_{k}\cdot g_{k}^{\sf T}\psi(A)A^{-1}g_{k}}\right)\|x_{k}-x_{*}\|_{\psi(A)A}^{2}
(14γ(2γ)μL(L+μ)2)xkxψ(A)A2,\displaystyle\leq\left(1-\frac{4\gamma(2-\gamma)\mu L}{(L+\mu)^{2}}\right)\|x_{k}-x_{*}\|_{\psi(A)A}^{2}, (10)

where the last inequality follows from the Kantorovich inequality.

If xk=A1(12ψ12(A)(ξ1±ξn)+b)x_{k}=A^{-1}\left(\frac{1}{\sqrt{2}}\psi^{-\frac{1}{2}}(A)\left(\xi_{1}\pm\xi_{n}\right)+b\right), we have

gk=12ψ12(A)(ξ1±ξn),g_{k}=\frac{1}{\sqrt{2}}\psi^{-\frac{1}{2}}(A)\left(\xi_{1}\pm\xi_{n}\right), (11)

which gives

gk𝖳ψ(A)Ajgk=12(μj+Lj),j=1,0,1.g_{k}^{\sf T}\psi(A)A^{j}g_{k}=\frac{1}{2}\left(\mu^{j}+L^{j}\right),\quad j=-1,0,1.

Thus,

xk+1xψ(A)A2\displaystyle\|x_{k+1}-x_{*}\|_{\psi(A)A}^{2} =(1γ(2γ)12(L+μ)12(μ1+L1))xkxψ(A)A2\displaystyle=\left(1-\frac{\gamma(2-\gamma)}{\frac{1}{2}(L+\mu)\cdot\frac{1}{2}\left(\mu^{-1}+L^{-1}\right)}\right)\|x_{k}-x_{*}\|_{\psi(A)A}^{2}
=(14γ(2γ)μL(L+μ)2)xkxψ(A)A2.\displaystyle=\left(1-\frac{4\gamma(2-\gamma)\mu L}{(L+\mu)^{2}}\right)\|x_{k}-x_{*}\|_{\psi(A)A}^{2}.

We complete the proof.

Remark 1

From the proof of Theorem 2.1, if the worst-case of convergence rate of the gradient method with the stepsize αk\alpha_{k} given by (6) is achieved, we must have αk=2γL+μ\alpha_{k}=\frac{2\gamma}{L+\mu}. Moreover, when γ=1\gamma=1 and xt=A1(12ψ12(A)(ξ1±ξn)+b)x_{t}=A^{-1}\left(\frac{1}{\sqrt{2}}\psi^{-\frac{1}{2}}(A)\left(\xi_{1}\pm\xi_{n}\right)+b\right) for some tt, then the two sides of (9) are equal for all ktk\geq t.

By setting ψ(A)=I\psi(A)=I and A1A^{-1} in (9) respectively, we get the following results for the gradient method with the generalized SD stepsize (7) and the Polyak stepsize (8).

Corollary 1

Consider applying the gradient method (2) to solve the unconstrained quadratic optimization (5).

(i) If αk\alpha_{k} is given by the generalized SD stepsize (7), then

fk+1f(14γ(2γ)μL(L+μ)2)(fkf).f_{k+1}-f_{*}\leq\left(1-\frac{4\gamma(2-\gamma)\mu L}{(L+\mu)^{2}}\right)(f_{k}-f_{*}). (12)

(ii) If αk\alpha_{k} is given by the Polyak stepsize (8), then

xk+1x2(14γ(2γ)μL(L+μ)2)xkx2.\|x_{k+1}-x_{*}\|^{2}\leq\left(1-\frac{4\gamma(2-\gamma)\mu L}{(L+\mu)^{2}}\right)\|x_{k}-x_{*}\|^{2}. (13)
Remark 2

From Corollary 1, for the generalized SD stepsize (7), αkSD=gk𝖳gkgk𝖳Agk\alpha_{k}^{SD}=\frac{g_{k}^{\sf T}g_{k}}{g_{k}^{\sf T}Ag_{k}} yields faster convergence rate than other values of γ\gamma while for the Polyak stepsize (8), αkP=2fkfgk2\alpha_{k}^{P}=2\frac{f_{k}-f_{*}}{\|g_{k}\|^{2}} gives the fastest gradient method in the sense of (13).

3 Worst-case complexity for smooth strongly convex functions

In this section, by making use of convergence results in the former section, we present an analytic analysis of the worst-case complexity for the gradient method with exact line search and Polyak stepsize, respectively, on the class of LL-smooth and μ\mu-strongly convex functions, denoted as μ,L\mathcal{F}_{\mu,L}.

Definition 1

Let f:nf:\mathbb{R}^{n}\to\mathbb{R} be a continuously differentiable function and μ,L>0\mu,L>0. We say that ff is LL-smooth and μ\mu-strongly convex if

(i) ff is LL-smooth, i.e.,

f(y)f(x)+f(x)𝖳(yx)+L2xy2,x,yn,f(y)\leq f(x)+\nabla f(x)^{\sf T}(y-x)+\frac{L}{2}\|x-y\|^{2},~{}\forall~{}x,y\in\mathbb{R}^{n},

(ii) ff is μ\mu-strongly convex, i.e.,

f(y)f(x)+f(x)𝖳(yx)+μ2xy2,x,yn.f(y)\geq f(x)+\nabla f(x)^{\sf T}(y-x)+\frac{\mu}{2}\|x-y\|^{2},~{}\forall~{}x,y\in\mathbb{R}^{n}.

We will use the following important inequality for fμ,Lf\in\mathcal{F}_{\mu,L},

f(x)\displaystyle f(x) f(y)+f(x)𝖳(yx)+12(1μ/L)×\displaystyle-f(y)+\nabla f(x)^{\sf T}(y-x)+\frac{1}{2(1-\mu/L)}\times
(μxy22μL(f(x)f(y))𝖳(xy)+1Lf(x)f(y)2)0,\displaystyle\left(\mu\|x-y\|^{2}-\frac{2\mu}{L}(\nabla f(x)-\nabla f(y))^{\sf T}(x-y)+\frac{1}{L}\|\nabla f(x)-\nabla f(y)\|^{2}\right)\leq 0, (14)

see Theorem 4 of taylor2017smooth .

3.1 Gradient method with exact line search

Now we show analytically the worst-case complexity of the gradient method with exact line search.

Theorem 3.1

Suppose that fμ,Lf\in\mathcal{F}_{\mu,L} and consider the gradient method with stepsize αk\alpha_{k} determined by exact line search. Then,

fk+1f(LμL+μ)2(fkf).f_{k+1}-f_{*}\leq\left(\frac{L-\mu}{L+\mu}\right)^{2}(f_{k}-f_{*}). (15)
Proof

It follows from (3), g=0g_{*}=0 and gk+1𝖳gk=0g_{k+1}^{\sf T}g_{k}=0 that

fk+1fk+12(1μ/L)\displaystyle f_{k+1}-f_{k}+\frac{1}{2(1-\mu/L)}
×(μxk+1xk2+2μLgk𝖳(xk+1xk)+1Lgk+1gk2)0,\displaystyle\times\left(\mu\|x_{k+1}-x_{k}\|^{2}+\frac{2\mu}{L}g_{k}^{\sf T}(x_{k+1}-x_{k})+\frac{1}{L}\|g_{k+1}-g_{k}\|^{2}\right)\leq 0, (16)
fkf+gk𝖳(xxk)+12(1μ/L)\displaystyle f_{k}-f_{*}+g_{k}^{\sf T}(x_{*}-x_{k})+\frac{1}{2(1-\mu/L)}
×(μxkx22μLgk𝖳(xkx)+1Lgk2)0,\displaystyle\quad\times\left(\mu\|x_{k}-x_{*}\|^{2}-\frac{2\mu}{L}g_{k}^{\sf T}(x_{k}-x_{*})+\frac{1}{L}\|g_{k}\|^{2}\right)\leq 0, (17)

and

fk+1f+gk+1𝖳(xxk+1)+12(1μ/L)\displaystyle f_{k+1}-f_{*}+g_{k+1}^{\sf T}(x_{*}-x_{k+1})+\frac{1}{2(1-\mu/L)}
×(μxk+1x22μLgk+1𝖳(xk+1x)+1Lgk+1g2)0.\displaystyle\times\left(\mu\|x_{k+1}-x_{*}\|^{2}-\frac{2\mu}{L}g_{k+1}^{\sf T}(x_{k+1}-x_{*})+\frac{1}{L}\|g_{k+1}-g_{*}\|^{2}\right)\leq 0. (18)

Suppose that ζ1,ζ2,ζ3\zeta_{1},\zeta_{2},\zeta_{3} are nonnegative scalars and ζ1+ζ2+ζ3>0\zeta_{1}+\zeta_{2}+\zeta_{3}>0. Then, weighted sum of the above three inequalities yields

0\displaystyle 0 ζ1×(16)+ζ2×(17)+ζ3×(18)\displaystyle\geq\zeta_{1}\times\eqref{less0fkfk1}+\zeta_{2}\times\eqref{less0fkfs}+\zeta_{3}\times\eqref{less0fk1fs}
=(ζ1+ζ3)(fk+1f)+(ζ2ζ1)(fkf)+12(1μ/L)\displaystyle=(\zeta_{1}+\zeta_{3})(f_{k+1}-f_{*})+(\zeta_{2}-\zeta_{1})(f_{k}-f_{*})+\frac{1}{2(1-\mu/L)}
×[μζ1(xk+1x)(xkx)2+μζ2xkx2+μζ3xk+1x2\displaystyle\times\Bigg{[}\mu\zeta_{1}\|(x_{k+1}-x_{*})-(x_{k}-x_{*})\|^{2}+\mu\zeta_{2}\|x_{k}-x_{*}\|^{2}+\mu\zeta_{3}\|x_{k+1}-x_{*}\|^{2}
+2μζ1Lgk𝖳((xk+1x)(xkx))+2ζ2gk𝖳(xxk)\displaystyle+\frac{2\mu\zeta_{1}}{L}g_{k}^{\sf T}((x_{k+1}-x_{*})-(x_{k}-x_{*}))+2\zeta_{2}g_{k}^{\sf T}(x_{*}-x_{k})
+2ζ3gk+1𝖳(xxk+1)+ζ1+ζ3Lgk+12+ζ1+ζ2Lgk2]\displaystyle+2\zeta_{3}g_{k+1}^{\sf T}(x_{*}-x_{k+1})+\frac{\zeta_{1}+\zeta_{3}}{L}\|g_{k+1}\|^{2}+\frac{\zeta_{1}+\zeta_{2}}{L}\|g_{k}\|^{2}\Bigg{]}
=(ζ1+ζ3)(fk+1f)+(ζ2ζ1)(fkf)+12(1μ/L)\displaystyle=(\zeta_{1}+\zeta_{3})(f_{k+1}-f_{*})+(\zeta_{2}-\zeta_{1})(f_{k}-f_{*})+\frac{1}{2(1-\mu/L)}
×[μ(ζ1+ζ2)xkx2(2μζ1L+2ζ2)gk𝖳(xkx)\displaystyle\times\Bigg{[}\mu(\zeta_{1}+\zeta_{2})\|x_{k}-x_{*}\|^{2}-\left(\frac{2\mu\zeta_{1}}{L}+2\zeta_{2}\right)g_{k}^{\sf T}(x_{k}-x_{*})
2μζ1(xk+1x)𝖳(xkx)+(2μζ1Lgk2ζ3gk+1)𝖳(xk+1x)\displaystyle-2\mu\zeta_{1}(x_{k+1}-x_{*})^{\sf T}(x_{k}-x_{*})+\left(\frac{2\mu\zeta_{1}}{L}g_{k}-2\zeta_{3}g_{k+1}\right)^{\sf T}(x_{k+1}-x_{*})
+μ(ζ1+ζ3)xk+1x2+ζ1+ζ3Lgk+12+ζ1+ζ2Lgk2].\displaystyle+\mu(\zeta_{1}+\zeta_{3})\|x_{k+1}-x_{*}\|^{2}+\frac{\zeta_{1}+\zeta_{3}}{L}\|g_{k+1}\|^{2}+\frac{\zeta_{1}+\zeta_{2}}{L}\|g_{k}\|^{2}\Bigg{]}.

To proceed, we complete the square of all items containing xkxx_{k}-x_{*} to get

0\displaystyle 0 (ζ1+ζ3)(fk+1f)+(ζ2ζ1)(fkf)+12(1μ/L)\displaystyle\geq(\zeta_{1}+\zeta_{3})(f_{k+1}-f_{*})+(\zeta_{2}-\zeta_{1})(f_{k}-f_{*})+\frac{1}{2(1-\mu/L)}
×[μ(ζ1+ζ2)xkxζ1ζ1+ζ2(xk+1x)μζ1+Lζ2μL(ζ1+ζ2)gkδgk+12\displaystyle\times\Bigg{[}\mu(\zeta_{1}+\zeta_{2})\left\|x_{k}-x_{*}-\frac{\zeta_{1}}{\zeta_{1}+\zeta_{2}}(x_{k+1}-x_{*})-\frac{\mu\zeta_{1}+L\zeta_{2}}{\mu L(\zeta_{1}+\zeta_{2})}g_{k}-\delta g_{k+1}\right\|^{2}
+β1xk+1x2+β2gk2+β3gk+122(xk+1x)𝖳(β4gkβ5gk+1)]\displaystyle+\beta_{1}\|x_{k+1}-x_{*}\|^{2}+\beta_{2}\|g_{k}\|^{2}+\beta_{3}\|g_{k+1}\|^{2}-2(x_{k+1}-x_{*})^{\sf T}\left(\beta_{4}g_{k}-\beta_{5}g_{k+1}\right)\Bigg{]}
=(ζ1+ζ3)(fk+1f)+(ζ2ζ1)(fkf)+12(1μ/L)\displaystyle=(\zeta_{1}+\zeta_{3})(f_{k+1}-f_{*})+(\zeta_{2}-\zeta_{1})(f_{k}-f_{*})+\frac{1}{2(1-\mu/L)}
×[μ(ζ1+ζ2)xkxζ1ζ1+ζ2(xk+1x)μζ1+Lζ2μL(ζ1+ζ2)gkδgk+12\displaystyle\times\Bigg{[}\mu(\zeta_{1}+\zeta_{2})\left\|x_{k}-x_{*}-\frac{\zeta_{1}}{\zeta_{1}+\zeta_{2}}(x_{k+1}-x_{*})-\frac{\mu\zeta_{1}+L\zeta_{2}}{\mu L(\zeta_{1}+\zeta_{2})}g_{k}-\delta g_{k+1}\right\|^{2}
+β1xk+1x1β1(β4gkβ5gk+1)2\displaystyle+\beta_{1}\left\|x_{k+1}-x_{*}-\frac{1}{\beta_{1}}\left(\beta_{4}g_{k}-\beta_{5}g_{k+1}\right)\right\|^{2}
+(β2β42β1)gk2+(β3β52β1)gk+12],\displaystyle+\left(\beta_{2}-\frac{\beta_{4}^{2}}{\beta_{1}}\right)\|g_{k}\|^{2}+\left(\beta_{3}-\frac{\beta_{5}^{2}}{\beta_{1}}\right)\|g_{k+1}\|^{2}\Bigg{]},

where δ\delta is some scalar and

β1=μ(ζ1+ζ3)μζ12ζ1+ζ2,β2=(Lμ)(μζ12Lζ22)μL2(ζ1+ζ2),\beta_{1}=\mu(\zeta_{1}+\zeta_{3})-\frac{\mu\zeta_{1}^{2}}{\zeta_{1}+\zeta_{2}},\quad\beta_{2}=\frac{(L-\mu)(\mu\zeta_{1}^{2}-L\zeta_{2}^{2})}{\mu L^{2}(\zeta_{1}+\zeta_{2})},
β3=ζ1+ζ3Lμ(ζ1+ζ2)δ2,β4=(Lμ)ζ1ζ2L(ζ1+ζ2),β5=μδζ2ζ3.\beta_{3}=\frac{\zeta_{1}+\zeta_{3}}{L}-\mu(\zeta_{1}+\zeta_{2})\delta^{2},\quad\beta_{4}=\frac{(L-\mu)\zeta_{1}\zeta_{2}}{L(\zeta_{1}+\zeta_{2})},\quad\beta_{5}=\mu\delta\zeta_{2}-\zeta_{3}.

As we will see later δ0\delta\neq 0 is important to derive the worst-case rate. It follows that

fk+1fζ2ζ1ζ1+ζ3(fkf)12(1μ/L)\displaystyle f_{k+1}-f_{*}\leq\frac{\zeta_{2}-\zeta_{1}}{\zeta_{1}+\zeta_{3}}(f_{k}-f_{*})-\frac{1}{2(1-\mu/L)}
×[μ(ζ1+ζ2)ζ1+ζ3xkxζ1ζ1+ζ2(xk+1x)μζ1+Lζ2μL(ζ1+ζ2)gkδgk+12\displaystyle\times\Bigg{[}\frac{\mu(\zeta_{1}+\zeta_{2})}{\zeta_{1}+\zeta_{3}}\left\|x_{k}-x_{*}-\frac{\zeta_{1}}{\zeta_{1}+\zeta_{2}}(x_{k+1}-x_{*})-\frac{\mu\zeta_{1}+L\zeta_{2}}{\mu L(\zeta_{1}+\zeta_{2})}g_{k}-\delta g_{k+1}\right\|^{2}
+β1ζ1+ζ3xk+1x1β1(β4gkβ5gk+1)2\displaystyle+\frac{\beta_{1}}{\zeta_{1}+\zeta_{3}}\left\|x_{k+1}-x_{*}-\frac{1}{\beta_{1}}\left(\beta_{4}g_{k}-\beta_{5}g_{k+1}\right)\right\|^{2}
+1ζ1+ζ3(β2β42β1)gk2+1ζ1+ζ3(β3β52β1)gk+12].\displaystyle+\frac{1}{\zeta_{1}+\zeta_{3}}\left(\beta_{2}-\frac{\beta_{4}^{2}}{\beta_{1}}\right)\|g_{k}\|^{2}+\frac{1}{\zeta_{1}+\zeta_{3}}\left(\beta_{3}-\frac{\beta_{5}^{2}}{\beta_{1}}\right)\|g_{k+1}\|^{2}\Bigg{]}. (19)

With the aim of determining ζ1\zeta_{1}, ζ2\zeta_{2} and ζ3\zeta_{3}, we consider applying the gradient method with exact line search to a two-dimensional quadratic function with

A=(μ00L).A=\begin{pmatrix}\mu&0\\ 0&L\\ \end{pmatrix}. (20)

By Theorem 2.1 and (12), in this case, the worst-case rate of the gradient method with exact line search matches the one in (15). Since the inequality (19) applies to the above quadratic case and the worst-case rate in (12) does not involve the last four terms in (19), we require

β2β42β1=0,β3β52β1=0,\beta_{2}-\frac{\beta_{4}^{2}}{\beta_{1}}=0,\quad\beta_{3}-\frac{\beta_{5}^{2}}{\beta_{1}}=0, (21)
xkxζ1ζ1+ζ2(xk+1x)μζ1+Lζ2μL(ζ1+ζ2)gkδgk+12=0,\left\|x_{k}-x_{*}-\frac{\zeta_{1}}{\zeta_{1}+\zeta_{2}}(x_{k+1}-x_{*})-\frac{\mu\zeta_{1}+L\zeta_{2}}{\mu L(\zeta_{1}+\zeta_{2})}g_{k}-\delta g_{k+1}\right\|^{2}=0, (22)

and

xk+1x1β1(β4gkβ5gk+1)2=0,\left\|x_{k+1}-x_{*}-\frac{1}{\beta_{1}}\left(\beta_{4}g_{k}-\beta_{5}g_{k+1}\right)\right\|^{2}=0, (23)

From Remark 1, we must have αk=2L+μ\alpha_{k}=\frac{2}{L+\mu} when achieving the worst-case rate. Thus, for the above two-dimensional quadratic function, we get

xkx=A1gk,gk+1=(IαkA)gk=(12μL+μ0012LL+μ)gk.x_{k}-x_{*}=A^{-1}g_{k},\quad g_{k+1}=(I-\alpha_{k}A)g_{k}=\begin{pmatrix}1-\frac{2\mu}{L+\mu}&0\\ 0&1-\frac{2L}{L+\mu}\\ \end{pmatrix}g_{k}.

Suppose that the two components of gkg_{k} are nonzero. The above relations together with (22) and (23) imply that

1μ(ζ1ζ1+ζ21μ+δ)(12μL+μ)μζ1+Lζ2μL(ζ1+ζ2)=0,\frac{1}{\mu}-\left(\frac{\zeta_{1}}{\zeta_{1}+\zeta_{2}}\frac{1}{\mu}+\delta\right)\left(1-\frac{2\mu}{L+\mu}\right)-\frac{\mu\zeta_{1}+L\zeta_{2}}{\mu L(\zeta_{1}+\zeta_{2})}=0, (24)
1L(ζ1ζ1+ζ21L+δ)(12LL+μ)μζ1+Lζ2μL(ζ1+ζ2)=0,\frac{1}{L}-\left(\frac{\zeta_{1}}{\zeta_{1}+\zeta_{2}}\frac{1}{L}+\delta\right)\left(1-\frac{2L}{L+\mu}\right)-\frac{\mu\zeta_{1}+L\zeta_{2}}{\mu L(\zeta_{1}+\zeta_{2})}=0, (25)
(β1μ+β5)(12μL+μ)β4=0,\left(\frac{\beta_{1}}{\mu}+\beta_{5}\right)\left(1-\frac{2\mu}{L+\mu}\right)-\beta_{4}=0, (26)
(β1L+β5)(12LL+μ)β4=0.\left(\frac{\beta_{1}}{L}+\beta_{5}\right)\left(1-\frac{2L}{L+\mu}\right)-\beta_{4}=0. (27)

Then, μL×((24)+(25))\mu L\times(\eqref{xkgk1s0qp1}+\eqref{xkgk1s0qp2}) yields

L+μζ1ζ1+ζ2(Lμ)2L+μ2μζ1+Lζ2(ζ1+ζ2)=0,L+\mu-\frac{\zeta_{1}}{\zeta_{1}+\zeta_{2}}\frac{(L-\mu)^{2}}{L+\mu}-2\frac{\mu\zeta_{1}+L\zeta_{2}}{(\zeta_{1}+\zeta_{2})}=0,

which indicates

ζ2=2μL+μζ1.\zeta_{2}=\frac{2\mu}{L+\mu}\zeta_{1}. (28)

From the first equation in (21), we get

β1=β42β2=(Lμ)2ζ12ζ22L2(ζ1+ζ2)2μL2(ζ1+ζ2)(Lμ)(μζ12Lζ22)=μ(Lμ)ζ12ζ22(ζ1+ζ2)(μζ12Lζ22).\beta_{1}=\frac{\beta_{4}^{2}}{\beta_{2}}=\frac{(L-\mu)^{2}\zeta_{1}^{2}\zeta_{2}^{2}}{L^{2}(\zeta_{1}+\zeta_{2})^{2}}\frac{\mu L^{2}(\zeta_{1}+\zeta_{2})}{(L-\mu)(\mu\zeta_{1}^{2}-L\zeta_{2}^{2})}=\frac{\mu(L-\mu)\zeta_{1}^{2}\zeta_{2}^{2}}{(\zeta_{1}+\zeta_{2})(\mu\zeta_{1}^{2}-L\zeta_{2}^{2})}.

Using the definition of β1\beta_{1} and rearranging terms, we obtain

ζ3=μζ12(ζ1ζ2)μζ12Lζ22ζ1=2μLμζ1.\zeta_{3}=\frac{\mu\zeta_{1}^{2}(\zeta_{1}-\zeta_{2})}{\mu\zeta_{1}^{2}-L\zeta_{2}^{2}}-\zeta_{1}=\frac{2\mu}{L-\mu}\zeta_{1}. (29)

By (24) or (25), we have

δ=1L(ζ1+ζ2)ζ1=L+μL(L+3μ).\delta=\frac{1}{L(\zeta_{1}+\zeta_{2})}\zeta_{1}=\frac{L+\mu}{L(L+3\mu)}. (30)

It is not difficult to check that the choices ζ2\zeta_{2}, ζ3\zeta_{3} and δ\delta in (28), (29) and (30) are such that the two equations in (21), (26) and (27) hold. Thus, we only need to find nonnegative ζ1\zeta_{1}, ζ2\zeta_{2} and ζ3\zeta_{3} satisfying (28) and (29). Letting ζ3=1\zeta_{3}=1, we get

ζ1=Lμ2μ,ζ2=LμL+μ,ζ3=1.\zeta_{1}=\frac{L-\mu}{2\mu},\quad\zeta_{2}=\frac{L-\mu}{L+\mu},\quad\zeta_{3}=1. (31)

Substituting the above ζ1,ζ2,ζ3\zeta_{1},\zeta_{2},\zeta_{3} in (31) into (19), we get

fk+1f(LμL+μ)2(fkf)\displaystyle f_{k+1}-f_{*}\leq\left(\frac{L-\mu}{L+\mu}\right)^{2}(f_{k}-f_{*})
μL(L+3μ)2(L+μ)2xkxL+μL+3μ(xk+1x)3L+μL(L+3μ)gkL+μL(L+3μ)gk+12\displaystyle-\frac{\mu L(L+3\mu)}{2(L+\mu)^{2}}\left\|x_{k}-x_{*}-\frac{L+\mu}{L+3\mu}(x_{k+1}-x_{*})-\frac{3L+\mu}{L(L+3\mu)}g_{k}-\frac{L+\mu}{L(L+3\mu)}g_{k+1}\right\|^{2}
2μL2(Lμ)(L+3μ)xk+1x(Lμ)22μL(L+μ)gkL+μ2μLgk+12,\displaystyle-\frac{2\mu L^{2}}{(L-\mu)(L+3\mu)}\left\|x_{k+1}-x_{*}-\frac{(L-\mu)^{2}}{2\mu L(L+\mu)}g_{k}-\frac{L+\mu}{2\mu L}g_{k+1}\right\|^{2},

which implies (15). In addition, the rate in (15) is achieved by applying the gradient method with exact line search to a two-dimensional quadratic function with Hessian given by (20). We complete the proof.

Remark 3

If we let ζ1+ζ3=1\zeta_{1}+\zeta_{3}=1, it follows from (28) and (29) that

ζ1=LμL+μ,ζ2=2μ(Lμ)(L+μ)2,ζ3=2μL+μ,\zeta_{1}=\frac{L-\mu}{L+\mu},\quad\zeta_{2}=\frac{2\mu(L-\mu)}{(L+\mu)^{2}},\quad\zeta_{3}=\frac{2\mu}{L+\mu},

which are the parameters in de2017worst .

3.2 Gradient method with the Polyak stepsize

Next theorem gives an analytic analysis of the worst-case complexity of the gradient method using the Polyak stepsize (4).

Theorem 3.2

Suppose that fμ,Lf\in\mathcal{F}_{\mu,L} and consider the gradient method with stepsize αk\alpha_{k} given by (4). Then,

xk+1x2(14γ(2γ)μL(L+μ)2)xkx2.\|x_{k+1}-x_{*}\|^{2}\leq\left(1-\frac{4\gamma(2-\gamma)\mu L}{(L+\mu)^{2}}\right)\|x_{k}-x_{*}\|^{2}. (32)
Proof

By (3) and g=0g_{*}=0, we have

fk\displaystyle f_{k} f+gk𝖳(xxk)+12(1μ/L)×\displaystyle-f_{*}+g_{k}^{\sf T}(x_{*}-x_{k})+\frac{1}{2(1-\mu/L)}\times
(μxkx22μLgk𝖳(xkx)+1Lgk2)0\displaystyle\left(\mu\|x_{k}-x_{*}\|^{2}-\frac{2\mu}{L}g_{k}^{\sf T}(x_{k}-x_{*})+\frac{1}{L}\|g_{k}\|^{2}\right)\leq 0 (33)

and

f\displaystyle f_{*} fk+12(1μ/L)(μxkx2+2μLgk𝖳(xxk)+1Lgk2)0.\displaystyle-f_{k}+\frac{1}{2(1-\mu/L)}\left(\mu\|x_{k}-x_{*}\|^{2}+\frac{2\mu}{L}g_{k}^{\sf T}(x_{*}-x_{k})+\frac{1}{L}\|g_{k}\|^{2}\right)\leq 0. (34)

It follows from the definition of αk\alpha_{k} in (4) that

2γ(fkf)αkgk2=0.\displaystyle 2\gamma(f_{k}-f_{*})-\alpha_{k}\|g_{k}\|^{2}=0. (35)

Let ζ1,ζ2\zeta_{1},\zeta_{2} be two nonnegative scalars, ζ3\zeta_{3}\in\mathbb{R} and ζ1+ζ2>0\zeta_{1}+\zeta_{2}>0. The following weighted sum is valid:

0\displaystyle 0 ζ1×(3.2)+ζ2×(34)+ζ3×(35)\displaystyle\geq\zeta_{1}\times\eqref{fk2fsscl}+\zeta_{2}\times\eqref{fs2fkscl}+\zeta_{3}\times\eqref{fs2fkeqalp}
=ζ1[fkf+gk𝖳(xxk)+12(1μ/L)\displaystyle=\zeta_{1}\Bigg{[}f_{k}-f_{*}+g_{k}^{\sf T}(x_{*}-x_{k})+\frac{1}{2(1-\mu/L)}
×(μxkx22μLgk𝖳(xkx)+1Lgk2)]\displaystyle\times\left(\mu\|x_{k}-x_{*}\|^{2}-\frac{2\mu}{L}g_{k}^{\sf T}(x_{k}-x_{*})+\frac{1}{L}\|g_{k}\|^{2}\right)\Bigg{]}
+ζ2[ffk+12(1μ/L)(μxkx2+2μLgk𝖳(xxk)+1Lgk2)]\displaystyle+\zeta_{2}\Bigg{[}f_{*}-f_{k}+\frac{1}{2(1-\mu/L)}\left(\mu\|x_{k}-x_{*}\|^{2}+\frac{2\mu}{L}g_{k}^{\sf T}(x_{*}-x_{k})+\frac{1}{L}\|g_{k}\|^{2}\right)\Bigg{]}
+ζ3[2γ(fkf)αkgk2]\displaystyle+\zeta_{3}\left[2\gamma(f_{k}-f_{*})-\alpha_{k}\|g_{k}\|^{2}\right]
=(ζ1ζ2+2γζ3)(fkf)+μLβxkx2\displaystyle=(\zeta_{1}-\zeta_{2}+2\gamma\zeta_{3})(f_{k}-f_{*})+\mu L\beta\|x_{k}-x_{*}\|^{2}
(2μβ+ζ1)gk𝖳(xkx)+(βζ3αk)gk2,\displaystyle-\left(2\mu\beta+\zeta_{1}\right)g_{k}^{\sf T}(x_{k}-x_{*})+\left(\beta-\zeta_{3}\alpha_{k}\right)\|g_{k}\|^{2},

where β=ζ1+ζ22(Lμ)\beta=\frac{\zeta_{1}+\zeta_{2}}{2(L-\mu)}. Substituting

xk+1x2xkx2αk2gk2=2αkgk𝖳(xkx)\|x_{k+1}-x_{*}\|^{2}-\|x_{k}-x_{*}\|^{2}-\alpha_{k}^{2}\|g_{k}\|^{2}=-2\alpha_{k}g_{k}^{\sf T}(x_{k}-x_{*})

into the above inequality, we obtain

σ(fkf)+2μβ+ζ12αkxk+1x2\displaystyle\sigma(f_{k}-f_{*})+\frac{2\mu\beta+\zeta_{1}}{2\alpha_{k}}\|x_{k+1}-x_{*}\|^{2}
+(βζ3αkαk(2μβ+ζ1)2)gk2(2μβ+ζ12αkμLβ)xkx2,\displaystyle+\left(\beta-\zeta_{3}\alpha_{k}-\frac{\alpha_{k}(2\mu\beta+\zeta_{1})}{2}\right)\|g_{k}\|^{2}\leq\left(\frac{2\mu\beta+\zeta_{1}}{2\alpha_{k}}-\mu L\beta\right)\|x_{k}-x_{*}\|^{2}, (36)

where σ=ζ1ζ2+2γζ3\sigma=\zeta_{1}-\zeta_{2}+2\gamma\zeta_{3}.

To get the desired convergence result, we assume that

βζ3αkαk(2μβ+ζ1)2=0,\beta-\zeta_{3}\alpha_{k}-\frac{\alpha_{k}(2\mu\beta+\zeta_{1})}{2}=0, (37)

which together with the definition of β\beta gives

ζ3=(1μαk)(ζ1+ζ2)2(Lμ)αkζ12\zeta_{3}=\frac{(1-\mu\alpha_{k})(\zeta_{1}+\zeta_{2})}{2(L-\mu)\alpha_{k}}-\frac{\zeta_{1}}{2}

and hence

σ\displaystyle\sigma =ζ1ζ2+γ(1μαk)(Lμ)αk(ζ1+ζ2)γζ1\displaystyle=\zeta_{1}-\zeta_{2}+\frac{\gamma(1-\mu\alpha_{k})}{(L-\mu)\alpha_{k}}(\zeta_{1}+\zeta_{2})-\gamma\zeta_{1}
=(1γ)(Lμ)αk+γ(1μαk)(Lμ)αkζ1+γ(1μαk)(Lμ)αk(Lμ)αkζ2.\displaystyle=\frac{(1-\gamma)(L-\mu)\alpha_{k}+\gamma(1-\mu\alpha_{k})}{(L-\mu)\alpha_{k}}\zeta_{1}+\frac{\gamma(1-\mu\alpha_{k})-(L-\mu)\alpha_{k}}{(L-\mu)\alpha_{k}}\zeta_{2}. (38)

Since fμ,Lf\in\mathcal{F}_{\mu,L}, we have

γLαkγμ.\frac{\gamma}{L}\leq\alpha_{k}\leq\frac{\gamma}{\mu}.

Thus, 1μαk01-\mu\alpha_{k}\geq 0. Now we consider two cases:

Case 1: γ(1μαk)(Lμ)αk<0\gamma(1-\mu\alpha_{k})-(L-\mu)\alpha_{k}<0. In this case, we require

σ=ζ1ζ2+2γζ3=0,\sigma=\zeta_{1}-\zeta_{2}+2\gamma\zeta_{3}=0, (39)

which together with (3.2) indicates

ζ2=(1γ)(Lμ)αk+γ(1μαk)(Lμ)αkγ(1μαk)ζ1\zeta_{2}=\frac{(1-\gamma)(L-\mu)\alpha_{k}+\gamma(1-\mu\alpha_{k})}{(L-\mu)\alpha_{k}-\gamma(1-\mu\alpha_{k})}\zeta_{1}

and hence

β=(2γ)αk2((Lμ)αkγ(1μαk))ζ1.\beta=\frac{(2-\gamma)\alpha_{k}}{2((L-\mu)\alpha_{k}-\gamma(1-\mu\alpha_{k}))}\zeta_{1}.

Let

ζ1=2(Lμ)αkγ(1μαk)(2γ)αk.\zeta_{1}=2\frac{(L-\mu)\alpha_{k}-\gamma(1-\mu\alpha_{k})}{(2-\gamma)\alpha_{k}}. (40)

We obtain β=1\beta=1 and

ζ2=2(1γ)(Lμ)αk+γ(1μαk)(2γ)αk,ζ3=2(L+μ)αk(2γ)αk.\zeta_{2}=2\frac{(1-\gamma)(L-\mu)\alpha_{k}+\gamma(1-\mu\alpha_{k})}{(2-\gamma)\alpha_{k}},\quad\zeta_{3}=\frac{2-(L+\mu)\alpha_{k}}{(2-\gamma)\alpha_{k}}. (41)

Clearly, ζ1,ζ2\zeta_{1},\zeta_{2} are nonnegative. By (3.2), (37), (39), (40) and (41), we get

xk+1x2\displaystyle\|x_{k+1}-x_{*}\|^{2} (12μLβαk2μβ+ζ1)xkx2\displaystyle\leq\left(1-\frac{2\mu L\beta\alpha_{k}}{2\mu\beta+\zeta_{1}}\right)\|x_{k}-x_{*}\|^{2}
(1μL(2γ)αk2(L+μ)αkγ)xkx2.\displaystyle\leq\left(1-\frac{\mu L(2-\gamma)\alpha_{k}^{2}}{(L+\mu)\alpha_{k}-\gamma}\right)\|x_{k}-x_{*}\|^{2}. (42)

Notice that the function h(α)=α2(L+μ)αγh(\alpha)=\frac{\alpha^{2}}{(L+\mu)\alpha-\gamma} achieves its minimum at 2γL+μ\frac{2\gamma}{L+\mu}. It follows from (3.2) that

xk+1x2\displaystyle\|x_{k+1}-x_{*}\|^{2} (1(2γ)μLh(2γL+μ))xkx2\displaystyle\leq\left(1-(2-\gamma)\mu Lh\left(\frac{2\gamma}{L+\mu}\right)\right)\|x_{k}-x_{*}\|^{2}
=(14γ(2γ)μL(L+μ)2)xkx2.\displaystyle=\left(1-\frac{4\gamma(2-\gamma)\mu L}{(L+\mu)^{2}}\right)\|x_{k}-x_{*}\|^{2}.

Case 2: γ(1μαk)(Lμ)αk0\gamma(1-\mu\alpha_{k})-(L-\mu)\alpha_{k}\geq 0. Recall that ζ1,ζ2\zeta_{1},\zeta_{2} are nonnegative and ζ1+ζ2>0\zeta_{1}+\zeta_{2}>0. In this case, we cannot expect (39) holds. It follows from (3.2), (37) and (3.2) that

xk+1x2\displaystyle\|x_{k+1}-x_{*}\|^{2} (12μLβαk2μβ+ζ1)xkx2σ(fkf).\displaystyle\leq\left(1-\frac{2\mu L\beta\alpha_{k}}{2\mu\beta+\zeta_{1}}\right)\|x_{k}-x_{*}\|^{2}-\sigma(f_{k}-f_{*}). (43)

To simplify our analysis, we set ζ2=0\zeta_{2}=0. By the strongly convexity of ff, the definition of β\beta, (3.2) and (43), we have

xk+1x2\displaystyle\|x_{k+1}-x_{*}\|^{2} (1μαkσμ2)xkx2\displaystyle\leq\left(1-\mu\alpha_{k}-\frac{\sigma\mu}{2}\right)\|x_{k}-x_{*}\|^{2}
=(1μαkμ(1γ)(Lμ)αk+γ(1μαk)2(Lμ)αkζ1)xkx2.\displaystyle=\left(1-\mu\alpha_{k}-\mu\frac{(1-\gamma)(L-\mu)\alpha_{k}+\gamma(1-\mu\alpha_{k})}{2(L-\mu)\alpha_{k}}\zeta_{1}\right)\|x_{k}-x_{*}\|^{2}.

We further suppose that ζ1\zeta_{1} is a scalar independent of αk\alpha_{k}. The function

h(α)=α+(1γ)(Lμ)α+γ(1μα)2(Lμ)αζ1h(\alpha)=\alpha+\frac{(1-\gamma)(L-\mu)\alpha+\gamma(1-\mu\alpha)}{2(L-\mu)\alpha}\zeta_{1}

attains its minimum at α=γζ12(Lμ)\alpha=\sqrt{\frac{\gamma\zeta_{1}}{2(L-\mu)}}, which implies

xk+1x2\displaystyle\|x_{k+1}-x_{*}\|^{2} h(γζ12(Lμ))xkx2.\displaystyle\leq h\left(\sqrt{\frac{\gamma\zeta_{1}}{2(L-\mu)}}\right)\|x_{k}-x_{*}\|^{2}. (44)

Notice that the rate in (44) holds for any fμ,Lf\in\mathcal{F}_{\mu,L}. So, for the quadratic case, the rate in (44) should coincide with the one in (13) and the corresponding stepsize must be equal to 2γL+μ\frac{2\gamma}{L+\mu}. This gives ζ1=8γ(Lμ)(L+μ)2\zeta_{1}=\frac{8\gamma(L-\mu)}{(L+\mu)^{2}}. We get (32) by substituting this ζ1\zeta_{1} into (44).

We complete the proof by considering the above two cases.

Remark 4

When γ=1\gamma=1, the inequality (32) reduces to

xk+1x2(LμL+μ)2xkx2,\|x_{k+1}-x_{*}\|^{2}\leq\left(\frac{L-\mu}{L+\mu}\right)^{2}\|x_{k}-x_{*}\|^{2},

which recovers the one in Proposition 1 of barre2020complexity .

4 Concluding remarks

Based on the convergence results of the family of gradient methods (6) on strongly convex quadratics, we have established the worst-case complexity of the gradient method with exact line search and the Polyak stepsize, respectively, in an analytic way for general smooth strongly convex objective functions. It is shown by Corollary 1 and Theorem 3.2 that, from the worst-case complexity point of view, the gradient method using the Polyak stepsize (8) achieves the fastest convergence rate when γ=1\gamma=1, which is also true for the generalized SD stepsize (7). However, for the case γ=1\gamma=1, we can show that the family of gradient methods (6) will zigzag in a two-dimensional subspace spanned by the two eigenvectors corresponding to the largest and smallest eigenvalues of the Hessian. Denoting the components of gkg_{k} along the eigenvectors ξi\xi_{i} by νk(i)\nu_{k}^{(i)}, i=1,,ni=1,\ldots,n, i.e., gk=i=1nνk(i)ξig_{k}=\sum_{i=1}^{n}\nu_{k}^{(i)}\xi_{i}. Next theorem gives the asymptotic convergence behavior of each gradient method in the family (6), see Theorem 1 of huang2022asymptotic for details.

Theorem 4.1

Assume that the starting point x0x_{0} has the property that

g0𝖳ξ10andg0𝖳ξn0.g_{0}^{\sf T}\xi_{1}\neq 0~{}~{}\textrm{and}~{}~{}g_{0}^{\sf T}\xi_{n}\neq 0.

Let {xk}\{x_{k}\} be the iterations generated by applying a method in (6) with γ=1\gamma=1 to solve the unconstrained quadratic optimization (5). Then

limk(ν2k(i))2j=1n(ν2k(j))2={11+c2,if i=1,0,if i=2,,n1,c21+c2,if i=n,\lim_{k\rightarrow\infty}\frac{(\nu_{2k}^{(i)})^{2}}{\sum_{j=1}^{n}(\nu_{2k}^{(j)})^{2}}=\left\{\begin{array}[]{ll}\displaystyle\frac{1}{1+c^{2}},&\hbox{if $i=1$,}\\ 0,&\hbox{if $i=2,\ldots,n-1$,}\\ \displaystyle\frac{c^{2}}{1+c^{2}},&\hbox{if $i=n$,}\end{array}\right.

and

limk(ν2k+1(i))2j=1n(ν2k+1(j))2={c2ψ2(L)ψ2(μ)+c2ψ2(L),if i=1,0,if i=2,,n1,ψ2(μ)ψ2(μ)+c2ψ2(L),if i=n,\lim_{k\rightarrow\infty}\frac{(\nu_{2k+1}^{(i)})^{2}}{\sum_{j=1}^{n}(\nu_{2k+1}^{(j)})^{2}}=\left\{\begin{array}[]{ll}\displaystyle\frac{c^{2}\psi^{2}(L)}{\psi^{2}(\mu)+c^{2}\psi^{2}(L)},&\hbox{if $i=1$,}\\ 0,&\hbox{if $i=2,\ldots,n-1$,}\\ \displaystyle\frac{\psi^{2}(\mu)}{\psi^{2}(\mu)+c^{2}\psi^{2}(L)},&\hbox{if $i=n$,}\end{array}\right.

where cc is a nonzero constant and can be determined by

c=limkν2k(n)ν2k(1)=ψ(μ)ψ(L)limkν2k+1(1)ν2k+1(n).c=\lim_{k\rightarrow\infty}\frac{\nu_{2k}^{(n)}}{\nu_{2k}^{(1)}}=-\frac{\psi(\mu)}{\psi(L)}\lim_{k\rightarrow\infty}\frac{\nu_{2k+1}^{(1)}}{\nu_{2k+1}^{(n)}}.
Refer to caption
Figure 1: The gradient method with exact line search (SD) vs. that with the Polyak stepsize (Polyak).

Due to the zigzag behavior shown in Theorem 4.1, the gradient method using the Polyak stepsize (8) with γ=1\gamma=1 may perform as poor as that with the exact line search. To see this, we apply the family of gradient methods (6) with γ=1\gamma=1 for ψ(A)=I\psi(A)=I (exact line search) and A1A^{-1} (the Polyak stepsize), respectively, to minimize the quadratic problem (5) with

A=diag{1,100}andb=0.A=\textrm{diag}\{1,100\}\quad\mbox{and}\quad b=0.

We use x0=(30,1)𝖳x_{0}=(30,1)^{\sf T} as the starting point and stop the iteration if gk108g0\|g_{k}\|\leq 10^{-8}\|g_{0}\|. It can be seen from Figure 1 that the gradient method with the Polyak stepsize may be slower than the SD method even for a strongly convex quadratic function. Techniques to break the zigzag pattern of the family of gradient methods (6) with γ=1\gamma=1 have been developed in huang2022asymptotic which show promising performance on the gradient method with exact line search. We are wondering whether the techniques in huang2022asymptotic can lead to efficient gradient methods based on the Polyak stepsize.

Acknowledgements.
The work of the first author was supported by Natural Science Foundation of Hebei Province (A2021202010). The work of the second author was supported by Hong Kong RGC General Research Fund (15309223) and PolyU AMA Project (230413007).

References

  • (1) M. Barré, A. Taylor, and A. d’Aspremont, Complexity guarantees for Polyak steps with momentum, in Conference on Learning Theory, PMLR, 2020, pp. 452–478.
  • (2) J. Barzilai and J. M. Borwein, Two-point step size gradient methods, IMA J. Numer. Anal., 8 (1988), pp. 141–148.
  • (3) C. Cartis, N. Gould, and P. Toint, On the complexity of steepest descent, Newton’s and regularized Newton’s methods for nonconvex unconstrained optimization problems, SIAM J. Optim., 20 (2010), pp. 2833–2852.
  • (4) A. Cauchy, Méthode générale pour la résolution des systemes di’équations simultanées, Comp. Rend. Sci. Paris, 25 (1847), pp. 536–538.
  • (5) Y.-H. Dai and Y.-X. Yuan, Alternate minimization gradient method, IMA J. Numer. Anal., 23 (2003), pp. 377–393.
  • (6) S. Das Gupta, B. P. Van Parys, and E. K. Ryu, Branch-and-bound performance estimation programming: A unified methodology for constructing optimal optimization methods, Math. Program., 204 (2024), pp. 567–639.
  • (7) E. De Klerk, F. Glineur, and A. B. Taylor, On the worst-case complexity of the gradient method with exact line search for smooth strongly convex functions, Optim. Lett., 11 (2017), pp. 1185–1199.
  • (8) E. De Klerk, F. Glineur, and A. B. Taylor, Worst-case convergence analysis of inexact gradient and Newton methods through semidefinite programming performance estimation, SIAM J. Optim., 30 (2020), pp. 2053–2082.
  • (9) Y. Drori and M. Teboulle, Performance of first-order methods for smooth convex minimization: a novel approach, Math. Program., 145 (2014), pp. 451–482.
  • (10) B. Goujaud, A. Dieuleveut, and A. Taylor, On fundamental proof structures in first-order optimization, in 62nd IEEE Conference on Decision and Control (CDC), 2023, pp. 3023–3030.
  • (11) G. Gu and J. Yang, Tight sublinear convergence rate of the proximal point algorithm for maximal monotone inclusion problems, SIAM J. Optim., 30 (2020), pp. 1905–1921.
  • (12) E. Hazan and S. Kakade, Revisiting the Polyak step size, arXiv preprint, arXiv:1905.00313, (2019).
  • (13) Y. Huang, Y.-H. Dai, X.-W. Liu, and H. Zhang, On the asymptotic convergence and acceleration of gradient methods, J. Sci. Comput., 90 (2022), pp. 1–29.
  • (14) X. Jiang and S. U. Stich, Adaptive SGD with Polyak stepsize and line-search: Robust convergence and variance reduction, in Advances in Neural Information Processing Systems, 2023, pp. 26396–26424.
  • (15) L. Lessard, B. Recht, and A. Packard, Analysis and design of optimization algorithms via integral quadratic constraints, SIAM J. Optim., 26 (2016), pp. 57–95.
  • (16) N. Loizou, S. Vaswani, I. H. Laradji, and S. Lacoste-Julien, Stochastic Polyak step-size for SGD: An adaptive learning rate for fast convergence, in International Conference on Artificial Intelligence and Statistics, PMLR, 2021, pp. 1306–1314.
  • (17) B. T. Polyak, Minimization of unsmooth functionals, Comput. Math. Math. Phys., 9 (1969), pp. 14–29.
  • (18) B. T. Polyak, Introduction to Optimization, Optimization Software, Inc., New York, 1987.
  • (19) A. B. Taylor, J. M. Hendrickx, and F. Glineur, Exact worst-case performance of first-order methods for composite convex optimization, SIAM J. Optim., 27 (2017), pp. 1283–1313.
  • (20) A. B. Taylor, J. M. Hendrickx, and F. Glineur, Smooth strongly convex interpolation and exact worst-case performance of first-order methods, Math. Program., 161 (2017), pp. 307–345.
  • (21) M. Teboulle and Y. Vaisbourd, An elementary approach to tight worst case complexity analysis of gradient based methods, Math. Program., 201 (2023), pp. 63–96.
  • (22) X. Wang, M. Johansson, and T. Zhang, Generalized Polyak step size for first order optimization with momentum, in International Conference on Machine Learning, PMLR, 2023, pp. 35836–35863.
  • (23) Y.-X. Yuan, A new stepsize for the steepest descent method, J. Comput. Math., 24 (2006), pp. 149–156.
  • (24) Y.-X. Yuan, A short note on the QQ-linear convergence of the steepest descent method, Math. Program., 123 (2010), pp. 339–343.