∎

¹¹institutetext: Ya-Kui Huang ²²institutetext: Institute of Mathematics, Hebei University of Technology, Tianjin 300401, China
²²email: hyk@hebut.edu.cn ³³institutetext: Hou-Duo Qi ⁴⁴institutetext: Department of Data Science and Artificial Intelligence and Department of Applied Mathematics, The Hong Kong Polytechnic University, Hung Hom, Hong Kong
⁴⁴email: houduo.qi@polyu.edu.hk

Analytic analysis of the worst-case complexity of the gradient method with exact line search and the Polyak stepsize

Ya-Kui Huang Hou-Duo Qi

(Received: date / Accepted: date)

Abstract

We give a novel analytic analysis of the worst-case complexity of the gradient method with exact line search and the Polyak stepsize, respectively, which previously could only be established by computer-assisted proof. Our analysis is based on studying the linear convergence of a family of gradient methods, whose stepsizes include the one determined by exact line search and the Polyak stepsize as special instances. The asymptotic behavior of the considered family is also investigated which shows that the gradient method with the Polyak stepsize will zigzag in a two-dimensional subspace spanned by the two eigenvectors corresponding to the largest and smallest eigenvalues of the Hessian.

Keywords:

Gradient methods Exact line search Polyak stepsize Worst-case complexity

MSC:

90C25 90C25 90C30

1 Introduction

We focus on the gradient method for solving unconstrained optimization problem

\min_{x\in\mathbb{R}^{n}}~{}f(x),

(1)

where $f:\mathbb{R}^{n}\to\mathbb{R}$ is a real-valued, convex, and continuously differentiable function. As is known, the gradient method updates iterates by

x_{k+1}=x_{k}-\alpha_{k}g_{k},

(2)

where $g_{k}=\nabla f(x_{k})$ and $\alpha_{k}>0$ is the stepsize. The classic choice for the stepsize is due to Cauchy cauchy1847methode who suggested to calculate it by exact line search, i.e.,

\alpha_{k}^{SD}=\arg\min_{\alpha>0}~{}f(x_{k}-\alpha g_{k}),

(3)

which together with (2) is often referred to as the steepest descent (SD) method. Many stepsizes have been designed to improve the efficiency of gradient methods such as the Barzilai–Borwein stepsize barzilai1988two , the Polyak stepsize polyak1969minimization ; polyak1987introduction and the Yuan stepsize yuan2006new . Due to the good performance in solving machine learning problems, there has been a surge of interest in the Polyak stepsize jiang2024adaptive ; loizou2021stochastic ; wang2023generalized , which minimizes an upper bound of the distance between the new iteration to the optimal solution and has the form

\alpha_{k}^{P}=2\gamma\frac{f_{k}-f_{*}}{\|g_{k}\|^{2}},\quad\gamma\in(0,1].

(4)

where $f_{k}=f(x_{k})$ and $f_{*}$ is the optimal function value. Convergence of the gradient method using the Polyak stepsize has been studied in different works for the case $\gamma=\frac{1}{2}$ or 1 barre2020complexity ; hazan2019revisiting .

In recent years, analyzing the worst-case complexities of gradient-based methods has triggered much attention cartis2010complexity ; das2024branch ; de2020worst ; drori2014performance ; gu2020tight ; lessard2016analysis ; taylor2017exact ; teboulle2023elementary ; yuan2010short . Particularly, Drori and Teboulle drori2014performance formulate the worst-case complexity of a given algorithm as an optimal solution to the so-called performance estimation problem (PEP). By making some relaxations to PEP, a bound of the worst-case complexity can be obtained by solving a convex semidefinite programming (SDP). Following drori2014performance , the worst-case complexities for various popular methods have been analyzed including the gradient method, Nesterov’s fast gradient method, inexact gradient method, proximal point algorithm, see de2020worst ; gu2020tight ; taylor2017exact and references therein.

Although the PEP framework is a powerful tool for the study of algorithm complexity, as pointed out by Teboulle and Vaisbourd teboulle2023elementary , it lacks an intuitive explanation how the worst-case complexity is addressed: (i) one often has to employ a computer-assisted proof since the aforementioned SDP usually has no closed form solution and must be solved numerically; (ii) it is generally unclear how the solution is deduced when a closed form solution is available. Moreover, it is not easy to apply the PEP framework to the gradient method with adaptive stepsizes goujaud2023fundamental . Consequently, most of existing complexity results of gradient methods focus on fixed stepsizes. Recently, by resorting to PEP, the worst-case complexities of the gradient method with exact line search and the Polyak stepsize (4) with $\gamma=1$ are presented through computer-assisted proof in de2017worst and barre2020complexity , respectively.

The purpose of this paper is to give a novel analytical analysis for the worst-case complexity of the gradient method with exact line search and the Polyak stepsize, respectively, for smooth strongly convex functions with Lipschitz continuous gradient. To this end, we first study a family of gradient methods whose stepsizes include the one determined by exact line search and the Polyak stepsize as special instances. We show that the family converges linearly when applying to a strongly convex quadratic function. Then, based on the convergence results on quadratics, for general smooth strongly convex objective functions, we establish the worst-case complexity of the gradient method with exact line search and the Polyak stepsize in an analytic way, respectively, which recovers the computer-assisted results in de2017worst and barre2020complexity . To our knowledge, this is the first analytic proof of the worst-case complexity of the gradient method with exact line search and the Polyak stepsize. Furthermore, we show that the considered family will asymptotically zigzag in a two-dimensional subspace spanned by the two eigenvectors corresponding to the largest and smallest eigenvalues of the Hessian, which has not been clarified in the literature for the gradient method using the Polyak stepsize.

The paper is organized as follows. In Section 2, we investigate linear convergence of a family of gradient methods on strongly convex quadratics. In Section 3, we present our analytic proof of the worst-case complexity of the gradient method with exact line search and the Polyak stepsize, respectively. Finally, some conclusion remarks are drawn and the asymptotic behavior of the considered family is discussed in Section 4.

2 Convergence analysis for quadratics

In this section, we consider the unconstrained quadratic optimization

\min_{x\in\mathbb{R}^{n}}~{}f(x)=\frac{1}{2}x^{\sf T}Ax-b^{\sf T}x,

(5)

where $b\in\mathbb{R}^{n}$ and $A\in\mathbb{R}^{n\times n}$ is symmetric positive definite. Let $\{\lambda_{1},\lambda_{2},\cdots,\lambda_{n}\}$ be the eigenvalues of $A$ , and $\{\xi_{1},\xi_{2},\ldots,\xi_{n}\}$ be the associated orthonormal eigenvectors. Without loss of generality, we assume that

A=\textrm{diag}\{\lambda_{1},\lambda_{2},\cdots,\lambda_{n}\},~{}~{}0<\mu=\lambda_{1}<\lambda_{2}<\cdots<\lambda_{n}=L.

For a more general and uniform analysis, we investigate a family of gradient methods with stepsize given by

\alpha_{k}=\gamma\frac{g_{k}^{\sf T}\psi(A)g_{k}}{g_{k}^{\sf T}\psi(A)Ag_{k}},\quad\gamma\in(0,1],

(6)

where $\psi$ is a real analytic function on $[\mu,L]$ and can be expressed by Laurent series $\psi(z)=\sum_{k=-\infty}^{\infty}c_{k}z^{k}$ with $c_{k}\in\mathbb{R}$ such that $0<\sum_{k=-\infty}^{\infty}c_{k}z^{k}<+\infty$ for all $z\in[\mu,L]$ . Apparently, when $\psi(A)=I$ , we get the generalized SD stepsize, namely

\alpha_{k}^{GSD}=\gamma\arg\min_{\alpha>0}~{}f(x_{k}-\alpha g_{k})=\gamma\frac{g_{k}^{\sf T}g_{k}}{g_{k}^{\sf T}Ag_{k}},

(7)

which is exactly the stepsize $\alpha_{k}^{SD}$ obtained by exact line search when $\gamma=1$ . See dai2003altermin for gradient methods with shorten SD steps, i.e., $\gamma\in(0,1)$ . Interestingly, the Polyak stepsize $\alpha_{k}^{P}$ in (4) corresponding to the case $\psi(A)=A^{-1}$ . In fact, since $f_{*}=f(x_{*})$ , where $x_{*}=A^{-1}b$ is the optimal solution, we have

f_{k}=f_{*}+\frac{1}{2}(x_{k}-x_{*})^{\sf T}A(x_{k}-x_{*})=f_{*}+\frac{1}{2}g_{k}^{\sf T}A^{-1}g_{k},

which gives

\alpha_{k}^{P}=2\gamma\frac{f_{k}-f_{*}}{\|g_{k}\|^{2}}=\gamma\frac{g_{k}^{\sf T}A^{-1}g_{k}}{g_{k}^{\sf T}g_{k}}.

(8)

For a symmetric positive definite matrix $D$ , denote the weighted norm by $\|x\|_{D}=\sqrt{x^{\sf T}Dx}$ . We establish line convergence of the family of gradient methods (6) in the next theorem.

Theorem 2.1

Consider applying the gradient method (2) with the stepsize $\alpha_{k}$ given by (6) to solve the unconstrained quadratic optimization (5). It holds that

\displaystyle\|x_{k+1}-x_{*}\|_{\psi(A)A}^{2}

\displaystyle\leq\left(1-\frac{4\gamma(2-\gamma)\mu L}{(L+\mu)^{2}}\right)\|x_{k}-x_{*}\|_{\psi(A)A}^{2}.

(9)

Moreover, if $x_{k}=A^{-1}\left(\frac{1}{\sqrt{2}}\psi^{-\frac{1}{2}}(A)\left(\xi_{1}\pm\xi_{n}\right)+b\right)$ , then the two sides of (9) are equal.

Proof

Using $x_{k}-x_{*}=A^{-1}g_{k}$ and the definition of $\alpha_{k}$ in (6), we have

$\displaystyle\\|x_{k+1}-x_{*}\\|_{\psi(A)A}^{2}$	$\displaystyle=\\|x_{k}-\alpha_{k}g_{k}-x_{*}\\|_{\psi(A)A}^{2}$
	$\displaystyle=\\|x_{k}-x_{}\\|_{\psi(A)A}^{2}-2\alpha_{k}g_{k}^{\sf T}\psi(A)A(x_{k}-x_{})+\alpha_{k}^{2}g_{k}^{\sf T}\psi(A)Ag_{k}$
	$\displaystyle=g_{k}^{\sf T}\psi(A)A^{-1}g_{k}-2\alpha_{k}g_{k}^{\sf T}\psi(A)g_{k}+\alpha_{k}^{2}g_{k}^{\sf T}\psi(A)Ag_{k}$
	$\displaystyle=g_{k}^{\sf T}\psi(A)A^{-1}g_{k}-\gamma(2-\gamma)\frac{(g_{k}^{\sf T}\psi(A)g_{k})^{2}}{g_{k}^{\sf T}\psi(A)Ag_{k}}$
	$\displaystyle=\left(1-\frac{\gamma(2-\gamma)(g_{k}^{\sf T}\psi(A)g_{k})^{2}}{g_{k}^{\sf T}\psi(A)Ag_{k}\cdot g_{k}^{\sf T}\psi(A)A^{-1}g_{k}}\right)g_{k}^{\sf T}\psi(A)A^{-1}g_{k}$
	$\displaystyle=\left(1-\frac{\gamma(2-\gamma)(g_{k}^{\sf T}\psi(A)g_{k})^{2}}{g_{k}^{\sf T}\psi(A)Ag_{k}\cdot g_{k}^{\sf T}\psi(A)A^{-1}g_{k}}\right)\\|x_{k}-x_{*}\\|_{\psi(A)A}^{2}$
	$\displaystyle\leq\left(1-\frac{4\gamma(2-\gamma)\mu L}{(L+\mu)^{2}}\right)\\|x_{k}-x_{*}\\|_{\psi(A)A}^{2},$	(10)

where the last inequality follows from the Kantorovich inequality.

If $x_{k}=A^{-1}\left(\frac{1}{\sqrt{2}}\psi^{-\frac{1}{2}}(A)\left(\xi_{1}\pm\xi_{n}\right)+b\right)$ , we have

g_{k}=\frac{1}{\sqrt{2}}\psi^{-\frac{1}{2}}(A)\left(\xi_{1}\pm\xi_{n}\right),

(11)

which gives

g_{k}^{\sf T}\psi(A)A^{j}g_{k}=\frac{1}{2}\left(\mu^{j}+L^{j}\right),\quad j=-1,0,1.

Thus,

	$\displaystyle\\|x_{k+1}-x_{*}\\|_{\psi(A)A}^{2}$	$\displaystyle=\left(1-\frac{\gamma(2-\gamma)}{\frac{1}{2}(L+\mu)\cdot\frac{1}{2}\left(\mu^{-1}+L^{-1}\right)}\right)\\|x_{k}-x_{*}\\|_{\psi(A)A}^{2}$
		$\displaystyle=\left(1-\frac{4\gamma(2-\gamma)\mu L}{(L+\mu)^{2}}\right)\\|x_{k}-x_{*}\\|_{\psi(A)A}^{2}.$

We complete the proof.

Remark 1

From the proof of Theorem 2.1, if the worst-case of convergence rate of the gradient method with the stepsize $\alpha_{k}$ given by (6) is achieved, we must have $\alpha_{k}=\frac{2\gamma}{L+\mu}$ . Moreover, when $\gamma=1$ and $x_{t}=A^{-1}\left(\frac{1}{\sqrt{2}}\psi^{-\frac{1}{2}}(A)\left(\xi_{1}\pm\xi_{n}\right)+b\right)$ for some $t$ , then the two sides of (9) are equal for all $k\geq t$ .

By setting $\psi(A)=I$ and $A^{-1}$ in (9) respectively, we get the following results for the gradient method with the generalized SD stepsize (7) and the Polyak stepsize (8).

Corollary 1

Consider applying the gradient method (2) to solve the unconstrained quadratic optimization (5).

(i) If $\alpha_{k}$ is given by the generalized SD stepsize (7), then

f_{k+1}-f_{*}\leq\left(1-\frac{4\gamma(2-\gamma)\mu L}{(L+\mu)^{2}}\right)(f_{k}-f_{*}).

(12)

(ii) If $\alpha_{k}$ is given by the Polyak stepsize (8), then

\|x_{k+1}-x_{*}\|^{2}\leq\left(1-\frac{4\gamma(2-\gamma)\mu L}{(L+\mu)^{2}}\right)\|x_{k}-x_{*}\|^{2}.

(13)

Remark 2

From Corollary 1, for the generalized SD stepsize (7), $\alpha_{k}^{SD}=\frac{g_{k}^{\sf T}g_{k}}{g_{k}^{\sf T}Ag_{k}}$ yields faster convergence rate than other values of $\gamma$ while for the Polyak stepsize (8), $\alpha_{k}^{P}=2\frac{f_{k}-f_{*}}{\|g_{k}\|^{2}}$ gives the fastest gradient method in the sense of (13).

3 Worst-case complexity for smooth strongly convex functions

In this section, by making use of convergence results in the former section, we present an analytic analysis of the worst-case complexity for the gradient method with exact line search and Polyak stepsize, respectively, on the class of $L$ -smooth and $\mu$ -strongly convex functions, denoted as $\mathcal{F}_{\mu,L}$ .

Definition 1

Let $f:\mathbb{R}^{n}\to\mathbb{R}$ be a continuously differentiable function and $\mu,L>0$ . We say that $f$ is $L$ -smooth and $\mu$ -strongly convex if

(i) $f$ is $L$ -smooth, i.e.,

f(y)\leq f(x)+\nabla f(x)^{\sf T}(y-x)+\frac{L}{2}\|x-y\|^{2},~{}\forall~{}x,y\in\mathbb{R}^{n},

(ii) $f$ is $\mu$ -strongly convex, i.e.,

f(y)\geq f(x)+\nabla f(x)^{\sf T}(y-x)+\frac{\mu}{2}\|x-y\|^{2},~{}\forall~{}x,y\in\mathbb{R}^{n}.

We will use the following important inequality for $f\in\mathcal{F}_{\mu,L}$ ,

	$\displaystyle f(x)$	$\displaystyle-f(y)+\nabla f(x)^{\sf T}(y-x)+\frac{1}{2(1-\mu/L)}\times$
		$\displaystyle\left(\mu\\|x-y\\|^{2}-\frac{2\mu}{L}(\nabla f(x)-\nabla f(y))^{\sf T}(x-y)+\frac{1}{L}\\|\nabla f(x)-\nabla f(y)\\|^{2}\right)\leq 0,$		(14)

see Theorem 4 of taylor2017smooth .

3.1 Gradient method with exact line search

Now we show analytically the worst-case complexity of the gradient method with exact line search.

Theorem 3.1

Suppose that $f\in\mathcal{F}_{\mu,L}$ and consider the gradient method with stepsize $\alpha_{k}$ determined by exact line search. Then,

f_{k+1}-f_{*}\leq\left(\frac{L-\mu}{L+\mu}\right)^{2}(f_{k}-f_{*}).

(15)

Proof

It follows from (3), $g_{*}=0$ and $g_{k+1}^{\sf T}g_{k}=0$ that

	$\displaystyle f_{k+1}-f_{k}+\frac{1}{2(1-\mu/L)}$
	$\displaystyle\times\left(\mu\\|x_{k+1}-x_{k}\\|^{2}+\frac{2\mu}{L}g_{k}^{\sf T}(x_{k+1}-x_{k})+\frac{1}{L}\\|g_{k+1}-g_{k}\\|^{2}\right)\leq 0,$		(16)
	$\displaystyle f_{k}-f_{}+g_{k}^{\sf T}(x_{}-x_{k})+\frac{1}{2(1-\mu/L)}$
	$\displaystyle\quad\times\left(\mu\\|x_{k}-x_{}\\|^{2}-\frac{2\mu}{L}g_{k}^{\sf T}(x_{k}-x_{})+\frac{1}{L}\\|g_{k}\\|^{2}\right)\leq 0,$		(17)

and

	$\displaystyle f_{k+1}-f_{}+g_{k+1}^{\sf T}(x_{}-x_{k+1})+\frac{1}{2(1-\mu/L)}$
	$\displaystyle\times\left(\mu\\|x_{k+1}-x_{}\\|^{2}-\frac{2\mu}{L}g_{k+1}^{\sf T}(x_{k+1}-x_{})+\frac{1}{L}\\|g_{k+1}-g_{*}\\|^{2}\right)\leq 0.$		(18)

Suppose that $\zeta_{1},\zeta_{2},\zeta_{3}$ are nonnegative scalars and $\zeta_{1}+\zeta_{2}+\zeta_{3}>0$ . Then, weighted sum of the above three inequalities yields

	$\displaystyle 0$	$\displaystyle\geq\zeta_{1}\times\eqref{less0fkfk1}+\zeta_{2}\times\eqref{less0fkfs}+\zeta_{3}\times\eqref{less0fk1fs}$
		$\displaystyle=(\zeta_{1}+\zeta_{3})(f_{k+1}-f_{})+(\zeta_{2}-\zeta_{1})(f_{k}-f_{})+\frac{1}{2(1-\mu/L)}$
		$\displaystyle\times\Bigg{[}\mu\zeta_{1}\\|(x_{k+1}-x_{})-(x_{k}-x_{})\\|^{2}+\mu\zeta_{2}\\|x_{k}-x_{}\\|^{2}+\mu\zeta_{3}\\|x_{k+1}-x_{}\\|^{2}$
		$\displaystyle+\frac{2\mu\zeta_{1}}{L}g_{k}^{\sf T}((x_{k+1}-x_{})-(x_{k}-x_{}))+2\zeta_{2}g_{k}^{\sf T}(x_{*}-x_{k})$
		$\displaystyle+2\zeta_{3}g_{k+1}^{\sf T}(x_{*}-x_{k+1})+\frac{\zeta_{1}+\zeta_{3}}{L}\\|g_{k+1}\\|^{2}+\frac{\zeta_{1}+\zeta_{2}}{L}\\|g_{k}\\|^{2}\Bigg{]}$
		$\displaystyle=(\zeta_{1}+\zeta_{3})(f_{k+1}-f_{})+(\zeta_{2}-\zeta_{1})(f_{k}-f_{})+\frac{1}{2(1-\mu/L)}$
		$\displaystyle\times\Bigg{[}\mu(\zeta_{1}+\zeta_{2})\\|x_{k}-x_{}\\|^{2}-\left(\frac{2\mu\zeta_{1}}{L}+2\zeta_{2}\right)g_{k}^{\sf T}(x_{k}-x_{})$
		$\displaystyle-2\mu\zeta_{1}(x_{k+1}-x_{})^{\sf T}(x_{k}-x_{})+\left(\frac{2\mu\zeta_{1}}{L}g_{k}-2\zeta_{3}g_{k+1}\right)^{\sf T}(x_{k+1}-x_{*})$
		$\displaystyle+\mu(\zeta_{1}+\zeta_{3})\\|x_{k+1}-x_{*}\\|^{2}+\frac{\zeta_{1}+\zeta_{3}}{L}\\|g_{k+1}\\|^{2}+\frac{\zeta_{1}+\zeta_{2}}{L}\\|g_{k}\\|^{2}\Bigg{]}.$

To proceed, we complete the square of all items containing $x_{k}-x_{*}$ to get

	$\displaystyle 0$	$\displaystyle\geq(\zeta_{1}+\zeta_{3})(f_{k+1}-f_{})+(\zeta_{2}-\zeta_{1})(f_{k}-f_{})+\frac{1}{2(1-\mu/L)}$
		$\displaystyle\times\Bigg{[}\mu(\zeta_{1}+\zeta_{2})\left\\|x_{k}-x_{}-\frac{\zeta_{1}}{\zeta_{1}+\zeta_{2}}(x_{k+1}-x_{})-\frac{\mu\zeta_{1}+L\zeta_{2}}{\mu L(\zeta_{1}+\zeta_{2})}g_{k}-\delta g_{k+1}\right\\|^{2}$
		$\displaystyle+\beta_{1}\\|x_{k+1}-x_{}\\|^{2}+\beta_{2}\\|g_{k}\\|^{2}+\beta_{3}\\|g_{k+1}\\|^{2}-2(x_{k+1}-x_{})^{\sf T}\left(\beta_{4}g_{k}-\beta_{5}g_{k+1}\right)\Bigg{]}$
		$\displaystyle=(\zeta_{1}+\zeta_{3})(f_{k+1}-f_{})+(\zeta_{2}-\zeta_{1})(f_{k}-f_{})+\frac{1}{2(1-\mu/L)}$
		$\displaystyle\times\Bigg{[}\mu(\zeta_{1}+\zeta_{2})\left\\|x_{k}-x_{}-\frac{\zeta_{1}}{\zeta_{1}+\zeta_{2}}(x_{k+1}-x_{})-\frac{\mu\zeta_{1}+L\zeta_{2}}{\mu L(\zeta_{1}+\zeta_{2})}g_{k}-\delta g_{k+1}\right\\|^{2}$
		$\displaystyle+\beta_{1}\left\\|x_{k+1}-x_{*}-\frac{1}{\beta_{1}}\left(\beta_{4}g_{k}-\beta_{5}g_{k+1}\right)\right\\|^{2}$
		$\displaystyle+\left(\beta_{2}-\frac{\beta_{4}^{2}}{\beta_{1}}\right)\\|g_{k}\\|^{2}+\left(\beta_{3}-\frac{\beta_{5}^{2}}{\beta_{1}}\right)\\|g_{k+1}\\|^{2}\Bigg{]},$

where $\delta$ is some scalar and

\beta_{1}=\mu(\zeta_{1}+\zeta_{3})-\frac{\mu\zeta_{1}^{2}}{\zeta_{1}+\zeta_{2}},\quad\beta_{2}=\frac{(L-\mu)(\mu\zeta_{1}^{2}-L\zeta_{2}^{2})}{\mu L^{2}(\zeta_{1}+\zeta_{2})},

\beta_{3}=\frac{\zeta_{1}+\zeta_{3}}{L}-\mu(\zeta_{1}+\zeta_{2})\delta^{2},\quad\beta_{4}=\frac{(L-\mu)\zeta_{1}\zeta_{2}}{L(\zeta_{1}+\zeta_{2})},\quad\beta_{5}=\mu\delta\zeta_{2}-\zeta_{3}.

As we will see later $\delta\neq 0$ is important to derive the worst-case rate. It follows that

	$\displaystyle f_{k+1}-f_{}\leq\frac{\zeta_{2}-\zeta_{1}}{\zeta_{1}+\zeta_{3}}(f_{k}-f_{})-\frac{1}{2(1-\mu/L)}$
	$\displaystyle\times\Bigg{[}\frac{\mu(\zeta_{1}+\zeta_{2})}{\zeta_{1}+\zeta_{3}}\left\\|x_{k}-x_{}-\frac{\zeta_{1}}{\zeta_{1}+\zeta_{2}}(x_{k+1}-x_{})-\frac{\mu\zeta_{1}+L\zeta_{2}}{\mu L(\zeta_{1}+\zeta_{2})}g_{k}-\delta g_{k+1}\right\\|^{2}$
	$\displaystyle+\frac{\beta_{1}}{\zeta_{1}+\zeta_{3}}\left\\|x_{k+1}-x_{*}-\frac{1}{\beta_{1}}\left(\beta_{4}g_{k}-\beta_{5}g_{k+1}\right)\right\\|^{2}$
	$\displaystyle+\frac{1}{\zeta_{1}+\zeta_{3}}\left(\beta_{2}-\frac{\beta_{4}^{2}}{\beta_{1}}\right)\\|g_{k}\\|^{2}+\frac{1}{\zeta_{1}+\zeta_{3}}\left(\beta_{3}-\frac{\beta_{5}^{2}}{\beta_{1}}\right)\\|g_{k+1}\\|^{2}\Bigg{]}.$		(19)

With the aim of determining $\zeta_{1}$ , $\zeta_{2}$ and $\zeta_{3}$ , we consider applying the gradient method with exact line search to a two-dimensional quadratic function with

A=\begin{pmatrix}\mu&0\\ 0&L\\ \end{pmatrix}.

(20)

By Theorem 2.1 and (12), in this case, the worst-case rate of the gradient method with exact line search matches the one in (15). Since the inequality (19) applies to the above quadratic case and the worst-case rate in (12) does not involve the last four terms in (19), we require

\beta_{2}-\frac{\beta_{4}^{2}}{\beta_{1}}=0,\quad\beta_{3}-\frac{\beta_{5}^{2}}{\beta_{1}}=0,

(21)

\left\|x_{k}-x_{*}-\frac{\zeta_{1}}{\zeta_{1}+\zeta_{2}}(x_{k+1}-x_{*})-\frac{\mu\zeta_{1}+L\zeta_{2}}{\mu L(\zeta_{1}+\zeta_{2})}g_{k}-\delta g_{k+1}\right\|^{2}=0,

(22)

and

\left\|x_{k+1}-x_{*}-\frac{1}{\beta_{1}}\left(\beta_{4}g_{k}-\beta_{5}g_{k+1}\right)\right\|^{2}=0,

(23)

From Remark 1, we must have $\alpha_{k}=\frac{2}{L+\mu}$ when achieving the worst-case rate. Thus, for the above two-dimensional quadratic function, we get

x_{k}-x_{*}=A^{-1}g_{k},\quad g_{k+1}=(I-\alpha_{k}A)g_{k}=\begin{pmatrix}1-\frac{2\mu}{L+\mu}&0\\ 0&1-\frac{2L}{L+\mu}\\ \end{pmatrix}g_{k}.

Suppose that the two components of $g_{k}$ are nonzero. The above relations together with (22) and (23) imply that

\frac{1}{\mu}-\left(\frac{\zeta_{1}}{\zeta_{1}+\zeta_{2}}\frac{1}{\mu}+\delta\right)\left(1-\frac{2\mu}{L+\mu}\right)-\frac{\mu\zeta_{1}+L\zeta_{2}}{\mu L(\zeta_{1}+\zeta_{2})}=0,

(24)

\frac{1}{L}-\left(\frac{\zeta_{1}}{\zeta_{1}+\zeta_{2}}\frac{1}{L}+\delta\right)\left(1-\frac{2L}{L+\mu}\right)-\frac{\mu\zeta_{1}+L\zeta_{2}}{\mu L(\zeta_{1}+\zeta_{2})}=0,

(25)

\left(\frac{\beta_{1}}{\mu}+\beta_{5}\right)\left(1-\frac{2\mu}{L+\mu}\right)-\beta_{4}=0,

(26)

\left(\frac{\beta_{1}}{L}+\beta_{5}\right)\left(1-\frac{2L}{L+\mu}\right)-\beta_{4}=0.

(27)

Then, $\mu L\times(\eqref{xkgk1s0qp1}+\eqref{xkgk1s0qp2})$ yields

L+\mu-\frac{\zeta_{1}}{\zeta_{1}+\zeta_{2}}\frac{(L-\mu)^{2}}{L+\mu}-2\frac{\mu\zeta_{1}+L\zeta_{2}}{(\zeta_{1}+\zeta_{2})}=0,

which indicates

\zeta_{2}=\frac{2\mu}{L+\mu}\zeta_{1}.

(28)

From the first equation in (21), we get

\beta_{1}=\frac{\beta_{4}^{2}}{\beta_{2}}=\frac{(L-\mu)^{2}\zeta_{1}^{2}\zeta_{2}^{2}}{L^{2}(\zeta_{1}+\zeta_{2})^{2}}\frac{\mu L^{2}(\zeta_{1}+\zeta_{2})}{(L-\mu)(\mu\zeta_{1}^{2}-L\zeta_{2}^{2})}=\frac{\mu(L-\mu)\zeta_{1}^{2}\zeta_{2}^{2}}{(\zeta_{1}+\zeta_{2})(\mu\zeta_{1}^{2}-L\zeta_{2}^{2})}.

Using the definition of $\beta_{1}$ and rearranging terms, we obtain

\zeta_{3}=\frac{\mu\zeta_{1}^{2}(\zeta_{1}-\zeta_{2})}{\mu\zeta_{1}^{2}-L\zeta_{2}^{2}}-\zeta_{1}=\frac{2\mu}{L-\mu}\zeta_{1}.

(29)

By (24) or (25), we have

\delta=\frac{1}{L(\zeta_{1}+\zeta_{2})}\zeta_{1}=\frac{L+\mu}{L(L+3\mu)}.

(30)

It is not difficult to check that the choices $\zeta_{2}$ , $\zeta_{3}$ and $\delta$ in (28), (29) and (30) are such that the two equations in (21), (26) and (27) hold. Thus, we only need to find nonnegative $\zeta_{1}$ , $\zeta_{2}$ and $\zeta_{3}$ satisfying (28) and (29). Letting $\zeta_{3}=1$ , we get

\zeta_{1}=\frac{L-\mu}{2\mu},\quad\zeta_{2}=\frac{L-\mu}{L+\mu},\quad\zeta_{3}=1.

(31)

Substituting the above $\zeta_{1},\zeta_{2},\zeta_{3}$ in (31) into (19), we get

	$\displaystyle f_{k+1}-f_{}\leq\left(\frac{L-\mu}{L+\mu}\right)^{2}(f_{k}-f_{})$
	$\displaystyle-\frac{\mu L(L+3\mu)}{2(L+\mu)^{2}}\left\\|x_{k}-x_{}-\frac{L+\mu}{L+3\mu}(x_{k+1}-x_{})-\frac{3L+\mu}{L(L+3\mu)}g_{k}-\frac{L+\mu}{L(L+3\mu)}g_{k+1}\right\\|^{2}$
	$\displaystyle-\frac{2\mu L^{2}}{(L-\mu)(L+3\mu)}\left\\|x_{k+1}-x_{*}-\frac{(L-\mu)^{2}}{2\mu L(L+\mu)}g_{k}-\frac{L+\mu}{2\mu L}g_{k+1}\right\\|^{2},$

which implies (15). In addition, the rate in (15) is achieved by applying the gradient method with exact line search to a two-dimensional quadratic function with Hessian given by (20). We complete the proof.

Remark 3

If we let $\zeta_{1}+\zeta_{3}=1$ , it follows from (28) and (29) that

\zeta_{1}=\frac{L-\mu}{L+\mu},\quad\zeta_{2}=\frac{2\mu(L-\mu)}{(L+\mu)^{2}},\quad\zeta_{3}=\frac{2\mu}{L+\mu},

which are the parameters in de2017worst .

3.2 Gradient method with the Polyak stepsize

Next theorem gives an analytic analysis of the worst-case complexity of the gradient method using the Polyak stepsize (4).

Theorem 3.2

Suppose that $f\in\mathcal{F}_{\mu,L}$ and consider the gradient method with stepsize $\alpha_{k}$ given by (4). Then,

\|x_{k+1}-x_{*}\|^{2}\leq\left(1-\frac{4\gamma(2-\gamma)\mu L}{(L+\mu)^{2}}\right)\|x_{k}-x_{*}\|^{2}.

(32)

Proof

By (3) and $g_{*}=0$ , we have

	$\displaystyle f_{k}$	$\displaystyle-f_{}+g_{k}^{\sf T}(x_{}-x_{k})+\frac{1}{2(1-\mu/L)}\times$
		$\displaystyle\left(\mu\\|x_{k}-x_{}\\|^{2}-\frac{2\mu}{L}g_{k}^{\sf T}(x_{k}-x_{})+\frac{1}{L}\\|g_{k}\\|^{2}\right)\leq 0$		(33)

and

\displaystyle f_{*}

\displaystyle-f_{k}+\frac{1}{2(1-\mu/L)}\left(\mu\|x_{k}-x_{*}\|^{2}+\frac{2\mu}{L}g_{k}^{\sf T}(x_{*}-x_{k})+\frac{1}{L}\|g_{k}\|^{2}\right)\leq 0.

(34)

It follows from the definition of $\alpha_{k}$ in (4) that

\displaystyle 2\gamma(f_{k}-f_{*})-\alpha_{k}\|g_{k}\|^{2}=0.

(35)

Let $\zeta_{1},\zeta_{2}$ be two nonnegative scalars, $\zeta_{3}\in\mathbb{R}$ and $\zeta_{1}+\zeta_{2}>0$ . The following weighted sum is valid:

	$\displaystyle 0$	$\displaystyle\geq\zeta_{1}\times\eqref{fk2fsscl}+\zeta_{2}\times\eqref{fs2fkscl}+\zeta_{3}\times\eqref{fs2fkeqalp}$
		$\displaystyle=\zeta_{1}\Bigg{[}f_{k}-f_{}+g_{k}^{\sf T}(x_{}-x_{k})+\frac{1}{2(1-\mu/L)}$
		$\displaystyle\times\left(\mu\\|x_{k}-x_{}\\|^{2}-\frac{2\mu}{L}g_{k}^{\sf T}(x_{k}-x_{})+\frac{1}{L}\\|g_{k}\\|^{2}\right)\Bigg{]}$
		$\displaystyle+\zeta_{2}\Bigg{[}f_{}-f_{k}+\frac{1}{2(1-\mu/L)}\left(\mu\\|x_{k}-x_{}\\|^{2}+\frac{2\mu}{L}g_{k}^{\sf T}(x_{*}-x_{k})+\frac{1}{L}\\|g_{k}\\|^{2}\right)\Bigg{]}$
		$\displaystyle+\zeta_{3}\left[2\gamma(f_{k}-f_{*})-\alpha_{k}\\|g_{k}\\|^{2}\right]$
		$\displaystyle=(\zeta_{1}-\zeta_{2}+2\gamma\zeta_{3})(f_{k}-f_{})+\mu L\beta\\|x_{k}-x_{}\\|^{2}$
		$\displaystyle-\left(2\mu\beta+\zeta_{1}\right)g_{k}^{\sf T}(x_{k}-x_{*})+\left(\beta-\zeta_{3}\alpha_{k}\right)\\|g_{k}\\|^{2},$

where $\beta=\frac{\zeta_{1}+\zeta_{2}}{2(L-\mu)}$ . Substituting

\|x_{k+1}-x_{*}\|^{2}-\|x_{k}-x_{*}\|^{2}-\alpha_{k}^{2}\|g_{k}\|^{2}=-2\alpha_{k}g_{k}^{\sf T}(x_{k}-x_{*})

into the above inequality, we obtain

		$\displaystyle\sigma(f_{k}-f_{})+\frac{2\mu\beta+\zeta_{1}}{2\alpha_{k}}\\|x_{k+1}-x_{}\\|^{2}$
		$\displaystyle+\left(\beta-\zeta_{3}\alpha_{k}-\frac{\alpha_{k}(2\mu\beta+\zeta_{1})}{2}\right)\\|g_{k}\\|^{2}\leq\left(\frac{2\mu\beta+\zeta_{1}}{2\alpha_{k}}-\mu L\beta\right)\\|x_{k}-x_{*}\\|^{2},$		(36)

where $\sigma=\zeta_{1}-\zeta_{2}+2\gamma\zeta_{3}$ .

To get the desired convergence result, we assume that

\beta-\zeta_{3}\alpha_{k}-\frac{\alpha_{k}(2\mu\beta+\zeta_{1})}{2}=0,

(37)

which together with the definition of $\beta$ gives

\zeta_{3}=\frac{(1-\mu\alpha_{k})(\zeta_{1}+\zeta_{2})}{2(L-\mu)\alpha_{k}}-\frac{\zeta_{1}}{2}

and hence

	$\displaystyle\sigma$	$\displaystyle=\zeta_{1}-\zeta_{2}+\frac{\gamma(1-\mu\alpha_{k})}{(L-\mu)\alpha_{k}}(\zeta_{1}+\zeta_{2})-\gamma\zeta_{1}$
		$\displaystyle=\frac{(1-\gamma)(L-\mu)\alpha_{k}+\gamma(1-\mu\alpha_{k})}{(L-\mu)\alpha_{k}}\zeta_{1}+\frac{\gamma(1-\mu\alpha_{k})-(L-\mu)\alpha_{k}}{(L-\mu)\alpha_{k}}\zeta_{2}.$		(38)

Since $f\in\mathcal{F}_{\mu,L}$ , we have

\frac{\gamma}{L}\leq\alpha_{k}\leq\frac{\gamma}{\mu}.

Thus, $1-\mu\alpha_{k}\geq 0$ . Now we consider two cases:

Case 1: $\gamma(1-\mu\alpha_{k})-(L-\mu)\alpha_{k}<0$ . In this case, we require

\sigma=\zeta_{1}-\zeta_{2}+2\gamma\zeta_{3}=0,

(39)

which together with (3.2) indicates

\zeta_{2}=\frac{(1-\gamma)(L-\mu)\alpha_{k}+\gamma(1-\mu\alpha_{k})}{(L-\mu)\alpha_{k}-\gamma(1-\mu\alpha_{k})}\zeta_{1}

and hence

\beta=\frac{(2-\gamma)\alpha_{k}}{2((L-\mu)\alpha_{k}-\gamma(1-\mu\alpha_{k}))}\zeta_{1}.

Let

\zeta_{1}=2\frac{(L-\mu)\alpha_{k}-\gamma(1-\mu\alpha_{k})}{(2-\gamma)\alpha_{k}}.

(40)

We obtain $\beta=1$ and

\zeta_{2}=2\frac{(1-\gamma)(L-\mu)\alpha_{k}+\gamma(1-\mu\alpha_{k})}{(2-\gamma)\alpha_{k}},\quad\zeta_{3}=\frac{2-(L+\mu)\alpha_{k}}{(2-\gamma)\alpha_{k}}.

(41)

Clearly, $\zeta_{1},\zeta_{2}$ are nonnegative. By (3.2), (37), (39), (40) and (41), we get

	$\displaystyle\\|x_{k+1}-x_{*}\\|^{2}$	$\displaystyle\leq\left(1-\frac{2\mu L\beta\alpha_{k}}{2\mu\beta+\zeta_{1}}\right)\\|x_{k}-x_{*}\\|^{2}$
		$\displaystyle\leq\left(1-\frac{\mu L(2-\gamma)\alpha_{k}^{2}}{(L+\mu)\alpha_{k}-\gamma}\right)\\|x_{k}-x_{*}\\|^{2}.$		(42)

Notice that the function $h(\alpha)=\frac{\alpha^{2}}{(L+\mu)\alpha-\gamma}$ achieves its minimum at $\frac{2\gamma}{L+\mu}$ . It follows from (3.2) that

	$\displaystyle\\|x_{k+1}-x_{*}\\|^{2}$	$\displaystyle\leq\left(1-(2-\gamma)\mu Lh\left(\frac{2\gamma}{L+\mu}\right)\right)\\|x_{k}-x_{*}\\|^{2}$
		$\displaystyle=\left(1-\frac{4\gamma(2-\gamma)\mu L}{(L+\mu)^{2}}\right)\\|x_{k}-x_{*}\\|^{2}.$

Case 2: $\gamma(1-\mu\alpha_{k})-(L-\mu)\alpha_{k}\geq 0$ . Recall that $\zeta_{1},\zeta_{2}$ are nonnegative and $\zeta_{1}+\zeta_{2}>0$ . In this case, we cannot expect (39) holds. It follows from (3.2), (37) and (3.2) that

\displaystyle\|x_{k+1}-x_{*}\|^{2}

\displaystyle\leq\left(1-\frac{2\mu L\beta\alpha_{k}}{2\mu\beta+\zeta_{1}}\right)\|x_{k}-x_{*}\|^{2}-\sigma(f_{k}-f_{*}).

(43)

To simplify our analysis, we set $\zeta_{2}=0$ . By the strongly convexity of $f$ , the definition of $\beta$ , (3.2) and (43), we have

	$\displaystyle\\|x_{k+1}-x_{*}\\|^{2}$	$\displaystyle\leq\left(1-\mu\alpha_{k}-\frac{\sigma\mu}{2}\right)\\|x_{k}-x_{*}\\|^{2}$
		$\displaystyle=\left(1-\mu\alpha_{k}-\mu\frac{(1-\gamma)(L-\mu)\alpha_{k}+\gamma(1-\mu\alpha_{k})}{2(L-\mu)\alpha_{k}}\zeta_{1}\right)\\|x_{k}-x_{*}\\|^{2}.$

We further suppose that $\zeta_{1}$ is a scalar independent of $\alpha_{k}$ . The function

h(\alpha)=\alpha+\frac{(1-\gamma)(L-\mu)\alpha+\gamma(1-\mu\alpha)}{2(L-\mu)\alpha}\zeta_{1}

attains its minimum at $\alpha=\sqrt{\frac{\gamma\zeta_{1}}{2(L-\mu)}}$ , which implies

\displaystyle\|x_{k+1}-x_{*}\|^{2}

\displaystyle\leq h\left(\sqrt{\frac{\gamma\zeta_{1}}{2(L-\mu)}}\right)\|x_{k}-x_{*}\|^{2}.

(44)

Notice that the rate in (44) holds for any $f\in\mathcal{F}_{\mu,L}$ . So, for the quadratic case, the rate in (44) should coincide with the one in (13) and the corresponding stepsize must be equal to $\frac{2\gamma}{L+\mu}$ . This gives $\zeta_{1}=\frac{8\gamma(L-\mu)}{(L+\mu)^{2}}$ . We get (32) by substituting this $\zeta_{1}$ into (44).

We complete the proof by considering the above two cases.

Remark 4

When $\gamma=1$ , the inequality (32) reduces to

\|x_{k+1}-x_{*}\|^{2}\leq\left(\frac{L-\mu}{L+\mu}\right)^{2}\|x_{k}-x_{*}\|^{2},

which recovers the one in Proposition 1 of barre2020complexity .

4 Concluding remarks

Based on the convergence results of the family of gradient methods (6) on strongly convex quadratics, we have established the worst-case complexity of the gradient method with exact line search and the Polyak stepsize, respectively, in an analytic way for general smooth strongly convex objective functions. It is shown by Corollary 1 and Theorem 3.2 that, from the worst-case complexity point of view, the gradient method using the Polyak stepsize (8) achieves the fastest convergence rate when $\gamma=1$ , which is also true for the generalized SD stepsize (7). However, for the case $\gamma=1$ , we can show that the family of gradient methods (6) will zigzag in a two-dimensional subspace spanned by the two eigenvectors corresponding to the largest and smallest eigenvalues of the Hessian. Denoting the components of $g_{k}$ along the eigenvectors $\xi_{i}$ by $\nu_{k}^{(i)}$ , $i=1,\ldots,n$ , i.e., $g_{k}=\sum_{i=1}^{n}\nu_{k}^{(i)}\xi_{i}$ . Next theorem gives the asymptotic convergence behavior of each gradient method in the family (6), see Theorem 1 of huang2022asymptotic for details.

Theorem 4.1

Assume that the starting point $x_{0}$ has the property that

g_{0}^{\sf T}\xi_{1}\neq 0~{}~{}\textrm{and}~{}~{}g_{0}^{\sf T}\xi_{n}\neq 0.

Let $\{x_{k}\}$ be the iterations generated by applying a method in (6) with $\gamma=1$ to solve the unconstrained quadratic optimization (5). Then

\lim_{k\rightarrow\infty}\frac{(\nu_{2k}^{(i)})^{2}}{\sum_{j=1}^{n}(\nu_{2k}^{(j)})^{2}}=\left\{\begin{array}[]{ll}\displaystyle\frac{1}{1+c^{2}},&\hbox{if $i=1$,}\\ 0,&\hbox{if $i=2,\ldots,n-1$,}\\ \displaystyle\frac{c^{2}}{1+c^{2}},&\hbox{if $i=n$,}\end{array}\right.

and

\lim_{k\rightarrow\infty}\frac{(\nu_{2k+1}^{(i)})^{2}}{\sum_{j=1}^{n}(\nu_{2k+1}^{(j)})^{2}}=\left\{\begin{array}[]{ll}\displaystyle\frac{c^{2}\psi^{2}(L)}{\psi^{2}(\mu)+c^{2}\psi^{2}(L)},&\hbox{if $i=1$,}\\ 0,&\hbox{if $i=2,\ldots,n-1$,}\\ \displaystyle\frac{\psi^{2}(\mu)}{\psi^{2}(\mu)+c^{2}\psi^{2}(L)},&\hbox{if $i=n$,}\end{array}\right.

where $c$ is a nonzero constant and can be determined by

c=\lim_{k\rightarrow\infty}\frac{\nu_{2k}^{(n)}}{\nu_{2k}^{(1)}}=-\frac{\psi(\mu)}{\psi(L)}\lim_{k\rightarrow\infty}\frac{\nu_{2k+1}^{(1)}}{\nu_{2k+1}^{(n)}}.

Refer to caption — Figure 1: The gradient method with exact line search (SD) vs. that with the Polyak stepsize (Polyak).

Due to the zigzag behavior shown in Theorem 4.1, the gradient method using the Polyak stepsize (8) with $\gamma=1$ may perform as poor as that with the exact line search. To see this, we apply the family of gradient methods (6) with $\gamma=1$ for $\psi(A)=I$ (exact line search) and $A^{-1}$ (the Polyak stepsize), respectively, to minimize the quadratic problem (5) with

A=\textrm{diag}\{1,100\}\quad\mbox{and}\quad b=0.

We use $x_{0}=(30,1)^{\sf T}$ as the starting point and stop the iteration if $\|g_{k}\|\leq 10^{-8}\|g_{0}\|$ . It can be seen from Figure 1 that the gradient method with the Polyak stepsize may be slower than the SD method even for a strongly convex quadratic function. Techniques to break the zigzag pattern of the family of gradient methods (6) with $\gamma=1$ have been developed in huang2022asymptotic which show promising performance on the gradient method with exact line search. We are wondering whether the techniques in huang2022asymptotic can lead to efficient gradient methods based on the Polyak stepsize.

Acknowledgements.

The work of the first author was supported by Natural Science Foundation of Hebei Province (A2021202010). The work of the second author was supported by Hong Kong RGC General Research Fund (15309223) and PolyU AMA Project (230413007).

References

(1) M. Barré, A. Taylor, and A. d’Aspremont, Complexity guarantees for Polyak steps with momentum, in Conference on Learning Theory, PMLR, 2020, pp. 452–478.
(2) J. Barzilai and J. M. Borwein, Two-point step size gradient methods, IMA J. Numer. Anal., 8 (1988), pp. 141–148.
(3) C. Cartis, N. Gould, and P. Toint, On the complexity of steepest descent, Newton’s and regularized Newton’s methods for nonconvex unconstrained optimization problems, SIAM J. Optim., 20 (2010), pp. 2833–2852.
(4) A. Cauchy, Méthode générale pour la résolution des systemes di’équations simultanées, Comp. Rend. Sci. Paris, 25 (1847), pp. 536–538.
(5) Y.-H. Dai and Y.-X. Yuan, Alternate minimization gradient method, IMA J. Numer. Anal., 23 (2003), pp. 377–393.
(6) S. Das Gupta, B. P. Van Parys, and E. K. Ryu, Branch-and-bound performance estimation programming: A unified methodology for constructing optimal optimization methods, Math. Program., 204 (2024), pp. 567–639.
(7) E. De Klerk, F. Glineur, and A. B. Taylor, On the worst-case complexity of the gradient method with exact line search for smooth strongly convex functions, Optim. Lett., 11 (2017), pp. 1185–1199.
(8) E. De Klerk, F. Glineur, and A. B. Taylor, Worst-case convergence analysis of inexact gradient and Newton methods through semidefinite programming performance estimation, SIAM J. Optim., 30 (2020), pp. 2053–2082.
(9) Y. Drori and M. Teboulle, Performance of first-order methods for smooth convex minimization: a novel approach, Math. Program., 145 (2014), pp. 451–482.
(10) B. Goujaud, A. Dieuleveut, and A. Taylor, On fundamental proof structures in first-order optimization, in 62nd IEEE Conference on Decision and Control (CDC), 2023, pp. 3023–3030.
(11) G. Gu and J. Yang, Tight sublinear convergence rate of the proximal point algorithm for maximal monotone inclusion problems, SIAM J. Optim., 30 (2020), pp. 1905–1921.
(12) E. Hazan and S. Kakade, Revisiting the Polyak step size, arXiv preprint, arXiv:1905.00313, (2019).
(13) Y. Huang, Y.-H. Dai, X.-W. Liu, and H. Zhang, On the asymptotic convergence and acceleration of gradient methods, J. Sci. Comput., 90 (2022), pp. 1–29.
(14) X. Jiang and S. U. Stich, Adaptive SGD with Polyak stepsize and line-search: Robust convergence and variance reduction, in Advances in Neural Information Processing Systems, 2023, pp. 26396–26424.
(15) L. Lessard, B. Recht, and A. Packard, Analysis and design of optimization algorithms via integral quadratic constraints, SIAM J. Optim., 26 (2016), pp. 57–95.
(16) N. Loizou, S. Vaswani, I. H. Laradji, and S. Lacoste-Julien, Stochastic Polyak step-size for SGD: An adaptive learning rate for fast convergence, in International Conference on Artificial Intelligence and Statistics, PMLR, 2021, pp. 1306–1314.
(17) B. T. Polyak, Minimization of unsmooth functionals, Comput. Math. Math. Phys., 9 (1969), pp. 14–29.
(18) B. T. Polyak, Introduction to Optimization, Optimization Software, Inc., New York, 1987.
(19) A. B. Taylor, J. M. Hendrickx, and F. Glineur, Exact worst-case performance of first-order methods for composite convex optimization, SIAM J. Optim., 27 (2017), pp. 1283–1313.
(20) A. B. Taylor, J. M. Hendrickx, and F. Glineur, Smooth strongly convex interpolation and exact worst-case performance of first-order methods, Math. Program., 161 (2017), pp. 307–345.
(21) M. Teboulle and Y. Vaisbourd, An elementary approach to tight worst case complexity analysis of gradient based methods, Math. Program., 201 (2023), pp. 63–96.
(22) X. Wang, M. Johansson, and T. Zhang, Generalized Polyak step size for first order optimization with momentum, in International Conference on Machine Learning, PMLR, 2023, pp. 35836–35863.
(23) Y.-X. Yuan, A new stepsize for the steepest descent method, J. Comput. Math., 24 (2006), pp. 149–156.
(24) Y.-X. Yuan, A short note on the $Q$ -linear convergence of the steepest descent method, Math. Program., 123 (2010), pp. 339–343.

	$\displaystyle 0$	$\displaystyle\geq\zeta_{1}\times\eqref{less0fkfk1}+\zeta_{2}\times\eqref{less0fkfs}+\zeta_{3}\times\eqref{less0fk1fs}$
		$\displaystyle=(\zeta_{1}+\zeta_{3})(f_{k+1}-f_{})+(\zeta_{2}-\zeta_{1})(f_{k}-f_{})+\frac{1}{2(1-\mu/L)}$
		$\displaystyle\times\Bigg{[}\mu\zeta_{1}\\|(x_{k+1}-x_{})-(x_{k}-x_{})\\|^{2}+\mu\zeta_{2}\\|x_{k}-x_{}\\|^{2}+\mu\zeta_{3}\\|x_{k+1}-x_{}\\|^{2}$
		$\displaystyle+\frac{2\mu\zeta_{1}}{L}g_{k}^{\sf T}((x_{k+1}-x_{})-(x_{k}-x_{}))+2\zeta_{2}g_{k}^{\sf T}(x_{*}-x_{k})$
		$\displaystyle+2\zeta_{3}g_{k+1}^{\sf T}(x_{*}-x_{k+1})+\frac{\zeta_{1}+\zeta_{3}}{L}\\|g_{k+1}\\|^{2}+\frac{\zeta_{1}+\zeta_{2}}{L}\\|g_{k}\\|^{2}\Bigg{]}$
		$\displaystyle=(\zeta_{1}+\zeta_{3})(f_{k+1}-f_{})+(\zeta_{2}-\zeta_{1})(f_{k}-f_{})+\frac{1}{2(1-\mu/L)}$
		$\displaystyle\times\Bigg{[}\mu(\zeta_{1}+\zeta_{2})\\|x_{k}-x_{}\\|^{2}-\left(\frac{2\mu\zeta_{1}}{L}+2\zeta_{2}\right)g_{k}^{\sf T}(x_{k}-x_{})$
		$\displaystyle-2\mu\zeta_{1}(x_{k+1}-x_{})^{\sf T}(x_{k}-x_{})+\left(\frac{2\mu\zeta_{1}}{L}g_{k}-2\zeta_{3}g_{k+1}\right)^{\sf T}(x_{k+1}-x_{*})$
		$\displaystyle+\mu(\zeta_{1}+\zeta_{3})\\|x_{k+1}-x_{*}\\|^{2}+\frac{\zeta_{1}+\zeta_{3}}{L}\\|g_{k+1}\\|^{2}+\frac{\zeta_{1}+\zeta_{2}}{L}\\|g_{k}\\|^{2}\Bigg{]}.$

	$\displaystyle 0$	$\displaystyle\geq(\zeta_{1}+\zeta_{3})(f_{k+1}-f_{})+(\zeta_{2}-\zeta_{1})(f_{k}-f_{})+\frac{1}{2(1-\mu/L)}$
		$\displaystyle\times\Bigg{[}\mu(\zeta_{1}+\zeta_{2})\left\\|x_{k}-x_{}-\frac{\zeta_{1}}{\zeta_{1}+\zeta_{2}}(x_{k+1}-x_{})-\frac{\mu\zeta_{1}+L\zeta_{2}}{\mu L(\zeta_{1}+\zeta_{2})}g_{k}-\delta g_{k+1}\right\\|^{2}$
		$\displaystyle+\beta_{1}\\|x_{k+1}-x_{}\\|^{2}+\beta_{2}\\|g_{k}\\|^{2}+\beta_{3}\\|g_{k+1}\\|^{2}-2(x_{k+1}-x_{})^{\sf T}\left(\beta_{4}g_{k}-\beta_{5}g_{k+1}\right)\Bigg{]}$
		$\displaystyle=(\zeta_{1}+\zeta_{3})(f_{k+1}-f_{})+(\zeta_{2}-\zeta_{1})(f_{k}-f_{})+\frac{1}{2(1-\mu/L)}$
		$\displaystyle\times\Bigg{[}\mu(\zeta_{1}+\zeta_{2})\left\\|x_{k}-x_{}-\frac{\zeta_{1}}{\zeta_{1}+\zeta_{2}}(x_{k+1}-x_{})-\frac{\mu\zeta_{1}+L\zeta_{2}}{\mu L(\zeta_{1}+\zeta_{2})}g_{k}-\delta g_{k+1}\right\\|^{2}$
		$\displaystyle+\beta_{1}\left\\|x_{k+1}-x_{*}-\frac{1}{\beta_{1}}\left(\beta_{4}g_{k}-\beta_{5}g_{k+1}\right)\right\\|^{2}$
		$\displaystyle+\left(\beta_{2}-\frac{\beta_{4}^{2}}{\beta_{1}}\right)\\|g_{k}\\|^{2}+\left(\beta_{3}-\frac{\beta_{5}^{2}}{\beta_{1}}\right)\\|g_{k+1}\\|^{2}\Bigg{]},$

	$\displaystyle f_{k+1}-f_{}\leq\frac{\zeta_{2}-\zeta_{1}}{\zeta_{1}+\zeta_{3}}(f_{k}-f_{})-\frac{1}{2(1-\mu/L)}$
	$\displaystyle\times\Bigg{[}\frac{\mu(\zeta_{1}+\zeta_{2})}{\zeta_{1}+\zeta_{3}}\left\\|x_{k}-x_{}-\frac{\zeta_{1}}{\zeta_{1}+\zeta_{2}}(x_{k+1}-x_{})-\frac{\mu\zeta_{1}+L\zeta_{2}}{\mu L(\zeta_{1}+\zeta_{2})}g_{k}-\delta g_{k+1}\right\\|^{2}$
	$\displaystyle+\frac{\beta_{1}}{\zeta_{1}+\zeta_{3}}\left\\|x_{k+1}-x_{*}-\frac{1}{\beta_{1}}\left(\beta_{4}g_{k}-\beta_{5}g_{k+1}\right)\right\\|^{2}$
	$\displaystyle+\frac{1}{\zeta_{1}+\zeta_{3}}\left(\beta_{2}-\frac{\beta_{4}^{2}}{\beta_{1}}\right)\\|g_{k}\\|^{2}+\frac{1}{\zeta_{1}+\zeta_{3}}\left(\beta_{3}-\frac{\beta_{5}^{2}}{\beta_{1}}\right)\\|g_{k+1}\\|^{2}\Bigg{]}.$		(19)

	$\displaystyle 0$	$\displaystyle\geq\zeta_{1}\times\eqref{fk2fsscl}+\zeta_{2}\times\eqref{fs2fkscl}+\zeta_{3}\times\eqref{fs2fkeqalp}$
		$\displaystyle=\zeta_{1}\Bigg{[}f_{k}-f_{}+g_{k}^{\sf T}(x_{}-x_{k})+\frac{1}{2(1-\mu/L)}$
		$\displaystyle\times\left(\mu\\|x_{k}-x_{}\\|^{2}-\frac{2\mu}{L}g_{k}^{\sf T}(x_{k}-x_{})+\frac{1}{L}\\|g_{k}\\|^{2}\right)\Bigg{]}$
		$\displaystyle+\zeta_{2}\Bigg{[}f_{}-f_{k}+\frac{1}{2(1-\mu/L)}\left(\mu\\|x_{k}-x_{}\\|^{2}+\frac{2\mu}{L}g_{k}^{\sf T}(x_{*}-x_{k})+\frac{1}{L}\\|g_{k}\\|^{2}\right)\Bigg{]}$
		$\displaystyle+\zeta_{3}\left[2\gamma(f_{k}-f_{*})-\alpha_{k}\\|g_{k}\\|^{2}\right]$
		$\displaystyle=(\zeta_{1}-\zeta_{2}+2\gamma\zeta_{3})(f_{k}-f_{})+\mu L\beta\\|x_{k}-x_{}\\|^{2}$
		$\displaystyle-\left(2\mu\beta+\zeta_{1}\right)g_{k}^{\sf T}(x_{k}-x_{*})+\left(\beta-\zeta_{3}\alpha_{k}\right)\\|g_{k}\\|^{2},$