This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

The Average Rate of Convergence of the Exact Line Search Gradient Descent Method

Thomas Yu Department of Mathematics, Drexel University. Email: yut@drexel.edu. He is supported in part by the National Science Foundation grants DMS 1522337 and DMS 1913038.
(April 28, 2023
Revised March 5, 2024)

Abstract:

It is very well known that when the exact line search gradient descent method is applied to a convex quadratic objective, the worst-case rate of convergence (ROC), among all seed vectors, deteriorates as the condition number of the Hessian of the objective grows. By an elegant analysis due to H. Akaike, it is generally believed – but not proved – that in the ill-conditioned regime the ROC for almost all initial vectors, and hence also the average ROC, is close to the worst case ROC. We complete Akaike’s analysis using the theorem of center and stable manifolds. Our analysis also makes apparent the effect of an intermediate eigenvalue in the Hessian by establishing the following somewhat amusing result: In the absence of an intermediate eigenvalue, the average ROC gets arbitrarily fast – not slow – as the Hessian gets increasingly ill-conditioned.

We discuss in passing some contemporary applications of exact line search GD to polynomial optimization problems arising from imaging and data sciences.

Keywords: Gradient descent, exact line search, worst case versus average case rate of convergence, center and stable manifolds theorem, polynomial optimization problem

1 Introduction

Exact line search is usually deemed impractical in the optimization literature, but when the objective function has a specific global structure then its use can be beneficial. A notable case is when the objective is a polynomial. Polynomial optimization problems (POPs) abound in diverse applications; see, for example, [15, 13, 6, 4, 8, 14] and Section 1.2.

For the gradient descent (GD) method, a popular choice for the step size is s=1/Ls=1/L, where LL is a Lipschitz constant satisfied by the gradient, i.e. f(x)f(y)Lxy\|\nabla f(x)-\nabla f(y)\|\leq L\|x-y\|. The rate of convergence of GD would then partly depend on how well one can estimate LL. Exact line search refers to the ‘optimum’ choice of step size, namely s=argmintf(x+td)s=\mathop{\rm argmin}_{t}f(x+td), where dd is the search direction, hence the nomenclature optimum gradient descent in the case of d=f(x)d=-\nabla f(x).111We use the terms ‘optimum GD’ and ‘GD with exact line search’ interchangeably in this article. When ff is, say, a degree 4 polynomial, it amounts to determining the univariate quartic polynomial p(t)=f(xtd)p(t)=f(x-td) followed by finding its minimizer, which is a very manageable computational task.

Let us recall the key benefit of exact line search. We focus on the case when the objective is a strictly convex quadratic, which locally approximates any smooth objective function in the vicinity of a local minimizer with a positive definite Hessian. By the invariance of GD (constant step size or exact line search) under rigid transformations, there is no loss of generality, as far as the study of ROC is concerned, to consider only quadratics of the form

f(x)=12xTAx, with A=diag(λ),λ=(λ1,,λn),λ1λn>0.\displaystyle f(x)=\frac{1}{2}x^{T}Ax,\mbox{ with }A={\rm diag}(\lambda),\;\;\lambda=(\lambda_{1},\ldots,\lambda_{n}),\;\;\lambda_{1}\geq\cdots\geq\lambda_{n}>0. (1.1)

Its gradient is Lipschitz continuous with Lipschitz constant L=λ1L=\lambda_{1}. Also, it is strongly convex with the strong convexity parameter σ=λn\sigma=\lambda_{n}. In the case of constant step size GD, we have x(k+1)=(IsA)x(k)x^{(k+1)}=(I-sA)x^{(k)}, so the rate of convergence is given the spectral radius of IsAI-sA, which equals max1in|1sλi|\max_{1\leq i\leq n}|1-s\lambda_{i}|. From this, it is easy to check that the step size

s=2/(λ1+λn)\displaystyle s=2/(\lambda_{1}+\lambda_{n}) (1.2)

minimizes the spectral radius of IsAI-sA, and hence offers the optimal ROC

x(k)=O(ρk) with ρ=λ1λnλ1+λn.\displaystyle\|x^{(k)}\|=O(\rho^{k})\mbox{ with }\;\rho=\frac{\lambda_{1}-\lambda_{n}}{\lambda_{1}+\lambda_{n}}. (1.3)

Gradient descent with exact line search involves non-constant step sizes: x(k+1)=(IskA)x(k)x^{(k+1)}=(I-s_{k}A)x^{(k)}, with sk=(x(k))TA2x(k)/(x(k))TA3x(k)s_{k}=(x^{(k)})^{T}A^{2}x^{(k)}/(x^{(k)})^{T}A^{3}x^{(k)}. For convenience, denote the iteration operator by OGD, i.e.

x(k+1)=OGD(x(k)),OGD(x):=OGD(x;λ):=xxTA2xxTA3xAx.\displaystyle x^{(k+1)}=\textsf{OGD}(x^{(k)}),\quad\textsf{OGD}(x):=\textsf{OGD}(x;\lambda):=x-\frac{x^{T}A^{2}x}{x^{T}A^{3}x}Ax. (1.4)

We set OGD(0)=0\textsf{OGD}(0)=0 so that OGD is a well-defined self-map on n\mbox{$\mathbb{R}$}^{n}. By norm equivalence, one is free to choose any norm in the study of ROC; and the first trick is to notice that by choosing the AA-norm, defined by xA:=xTAx\|x\|_{A}:=\sqrt{x^{T}Ax}, we have the following convenient relation:

OGD(x)A2=[1(xTA2x)2(xTAx)(xTA3x)]xA2.\displaystyle\|\textsf{OGD}(x)\|_{A}^{2}=\left[1-\frac{(x^{T}A^{2}x)^{2}}{(x^{T}Ax)(x^{T}A^{3}x)}\right]\|x\|_{A}^{2}. (1.5)

Write d=Axd=Ax (=f(x)=\nabla f(x)). By the Kantorovich inequality,222For a proof of the Kantorovich inequality (dTd)2(dTA1d)(dTAd)4λ1λn(λ1+λn)2=4cond(A)(1+cond(A))2\frac{(d^{T}d)^{2}}{(d^{T}A^{-1}d)(d^{T}Ad)}\geq\frac{4\lambda_{1}\lambda_{n}}{(\lambda_{1}+\lambda_{n})^{2}}=\frac{4{\rm cond}(A)}{(1+{\rm cond}(A))^{2}}, see, for example, [12, 7]. A good way to appreciate this result is to compare it with the following bound obtained from Rayleigh quotient: (dTd)2(dTA1d)(dTAd)λnλ1=cond(A)1\frac{(d^{T}d)^{2}}{(d^{T}A^{-1}d)(d^{T}Ad)}\geq\frac{\lambda_{n}}{\lambda_{1}}={\rm cond}(A)^{-1}. Unless λ1=λn\lambda_{1}=\lambda_{n}, Kantorovich’s bound is always sharper, and is the sharpest possible in the sense that there exists a vector dd so that equality holds.

(xTA2x)2(xTAx)(xTA3x)=(dTd)2(dTA1d)(dTAd)4λ1λn(λ1+λn)2,\displaystyle\frac{(x^{T}A^{2}x)^{2}}{(x^{T}Ax)(x^{T}A^{3}x)}=\frac{(d^{T}d)^{2}}{(d^{T}A^{-1}d)(d^{T}Ad)}\geq\frac{4\lambda_{1}\lambda_{n}}{(\lambda_{1}+\lambda_{n})^{2}}, (1.6)

which yields the well-known error bound for the optimum GD method:

x(k)A(λ1λnλ1+λn)kx(0)A.\displaystyle\|x^{(k)}\|_{A}\leq\Big{(}\frac{\lambda_{1}-\lambda_{n}}{\lambda_{1}+\lambda_{n}}\Big{)}^{k}\|x^{(0)}\|_{A}. (1.7)

So optimum GD satisfies the same upper-bound for its ROC as in (1.3). The constant step size GD method with the optimal choice of step size (1.2), however, should not be confused with the optimum GD method, they have the following fundamental differences:

  • The optimal step size (1.2) requires the knowledge of the two extremal eigenvalues, of which the determination is no easier than the original minimization problem. In contrast, the optimum GD method is blind to the values of λ1\lambda_{1} and λn\lambda_{n}.

  • Due to the linearity of the iteration process, GD with the optimal constant step size (1.2) achieves the ROC x(k)Cρk\|x^{(k)}\|\asymp C\rho^{k}, with ρ\rho in (1.3), for any initial vector with a non-zero component in the dominant eigenvector of AA, and hence for almost all initial vectors x(0)x^{(0)}. So for this method the worst case ROC is the same as the average ROC. (As the ROC is invariant under scaling of x(0)x^{(0)}, the average and worst case ROC can be defined by taking the average and maximum, respectively, ROC over all seed vectors of unit length.) In contrast, OGD is nonlinear and the worst case ROC (1.3) is attained only for specific initial vectors x(0)x^{(0)}; see Proposition 3.1. It is much less obvious how the average ROC, defined by (1.10) below, compares to the worst case ROC.

Due to (1.5), we define the (one-step) shrinking factor by

ρ(x,λ)=1(xTA2x)2(xTAx)(xTA3x)=1(iλi2xi2)2(iλixi2)(iλi3xi2).\displaystyle\rho(x,\lambda)=\sqrt{1-\frac{(x^{T}A^{2}x)^{2}}{(x^{T}Ax)(x^{T}A^{3}x)}}=\sqrt{1-\frac{(\sum_{i}\lambda_{i}^{2}x_{i}^{2})^{2}}{(\sum_{i}\lambda_{i}x_{i}^{2})(\sum_{i}\lambda_{i}^{3}x_{i}^{2})}}. (1.8)

Then, for any initial vector x(0)0x^{(0)}\neq 0, the rate of convergence of the optimum gradient descent method applied to the minimization of (1.1) is given by

ρ(x(0),λ):=lim supk[j=0k1ρ(OGDj(x(0)),λ)]1/k.\displaystyle\rho^{\ast}(x^{(0)},\lambda):=\limsup_{k\rightarrow\infty}\Bigg{[}\prod_{j=0}^{k-1}\rho(\textsf{OGD}^{j}(x^{(0)}),\lambda)\Bigg{]}^{1/k}. (1.9)

As ρ(x(0),λ)\rho^{\ast}(x^{(0)},\lambda) depends only on the direction of x(0)x^{(0)}, and is insensitive to sign changes in the components of x(0)x^{(0)} (see (2.1)), the average ROC can be defined based on averaging over all x(0)x^{(0)} on the unit sphere, or just over the positive octant of the unit sphere. Formally,

Definition 1.1

The average ROC of the optimum GD method applied to (1.1) is

Average ROC:=𝕊n1ρ(x,λ)𝑑μ(x)=2n𝕊+n1ρ(x,λ)𝑑μ(x),\displaystyle\mbox{Average ROC}:=\int_{\mbox{$\mathbb{S}$}^{n-1}}\rho^{\ast}(x,\lambda)d\mu(x)=2^{n}\int_{\mbox{$\mathbb{S}$}^{n-1}_{+}}\rho^{\ast}(x,\lambda)d\mu(x), (1.10)

where μ\mu is the uniform probability measure on 𝕊n1\mbox{$\mathbb{S}$}^{n-1}, and 𝕊+n1:={x𝕊n1:x0}\mbox{$\mathbb{S}$}^{n-1}_{+}:=\{x\in\mbox{$\mathbb{S}$}^{n-1}:x\geq 0\}.

We have

Average ROCWorst case ROC=()1a1+a, where a=λnλ1=cond(A)1.\displaystyle\mbox{Average ROC}\leq\mbox{Worst case ROC}\stackrel{{\scriptstyle(\ast)}}{{=}}\frac{1-a}{1+a},\;\;\;\mbox{ where }a=\frac{\lambda_{n}}{\lambda_{1}}={\rm cond}(A)^{-1}. (1.11)

Note that (1.7) only shows that the worst case ROC is upper bounded by (1a)/(1+a)(1-a)/(1+a); for a proof of the equality (\ast) in (1.11), see Proposition 3.1.

1.1 Main results

In this paper, we establish the following result:

Theorem 1.2

(i) If AA has only two distinct eigenvalues, then the average ROC approaches 0 when cond(A){\rm cond}(A)\rightarrow\infty. (ii) If AA has an intermediate eigenvalue λi\lambda_{i} uniformly bounded away from the two extremal eigenvalues, then the average ROC approaches the worst case ROC in (1.11), which approaches 1, when cond(A){\rm cond}(A)\rightarrow\infty.

The second part of Theorem 1.2 is an immediate corollary of the following result:

Theorem 1.3

If AA has an intermediate eigenvalue, i.e. n>2n>2 and there exists i{2,,n1}i\in\{2,\ldots,n-1\} s.t. λ1>λi>λn\lambda_{1}>\lambda_{i}>\lambda_{n}, then

essinfx(0)𝕊n1ρ(x(0),λ)=1a(1+a)2+Ba,\displaystyle\operatorname*{ess\,inf}_{x^{(0)}\in\mathbb{S}^{n-1}}\rho^{\ast}(x^{(0)},\lambda)=\frac{1-a}{\sqrt{(1+a)^{2}+Ba}}, (1.12)

where a=cond(A)1a={\rm cond}(A)^{-1}, B=4(1+δ2)1δ2B=\frac{4(1+\delta^{2})}{1-\delta^{2}}, δ=mini:λ1>λi>λnλi(λ1+λn)/2(λ1λn)/2\delta=\min_{i:\lambda_{1}>\lambda_{i}>\lambda_{n}}\frac{\lambda_{i}-(\lambda_{1}+\lambda_{n})/2}{(\lambda_{1}-\lambda_{n})/2}.

Remark 1.4

It is shown in [5, §2] that ρ(x(0),λ)\rho^{\ast}(x^{(0)},\lambda) is lower-bounded by the right-hand side of (1.12) with the proviso of a difficult-to-verify condition on x(0)x^{(0)}. The undesirable condition seems to be an artifact of the subtle argument in [5, §2], which also makes it hard to see whether the bound (1.12) is tight. Our proof of Theorem 1.3 uses Akaike’s results in [5, §1], but replace his arguments in [5, §2] by a more natural dynamical system approach. The proof shows that the bound is tight and holds for a set of x(0)x^{(0)} of full measure, which also allows us to conclude the second part of Theorem 1.2. It uses the center and stable manifolds theorem, a result that was not available at the time [5] was written.

Remark 1.5

For constant step size GD, ill-conditioning alone is enough to cause slow convergence for almost all initial vectors. For exact line search, however, it is ill-conditioning in cahoot with an intermediate eigenvalue that causes the slowdown. This is already apparent from Akaike’s analysis; the first part of Theorem 1.2 intends to bring this point home, by showing that the exact opposite happens in the absence of an intermediate eigenvalue.

Organization of the proofs. In Section 2, we recall the results of Akaike, along with a few observations not directly found in [5]. These results and observations are necessary for the final proofs of Theorem 1.2(i) and Theorem 1.3, to be presented in Section 3 and Section 4, respectively. The key idea of Akaike is an interesting discrete dynamical system, with a probabilistic interpretation, underlying the optimum GD method. We give an exposition of this dynamical system in Section 2. A key property of this dynamical system is recorded in Theorem 2.2. In a nutshell, the theorem tells us that part of the properties of the dynamical system in the high-dimensional case is captured by the 2-dimensional case; so – while the final results say that there is a drastic difference between the n=2n=2 and n>2n>2 cases – the analysis of the latter case (in Section 4) uses some of the results in the former case (in Section 3). Appendix A records two auxiliary technical lemmas. Appendix B recalls a version of the theorem of center and stable manifolds stated in the monograph [17], and discusses a refinement of the result needed to prove the full version of Theorem 1.3.

Section 3 and Section 4 also present computations, in Figures 2 and 3, that serve to illustrate some key ideas behind the proofs.

Before proceeding to the proofs, we consider some contemporary applications of exact line search methods to POPs.

1.2 Applications of exact line search methods to POPs

In its abstract form, the phase retrieval problem seeks to recover a signal x𝕂nx\in\mathbb{K}^{n} (𝕂=\mathbb{K}=\mbox{$\mathbb{R}$} or \mathbb{C}) from its noisy ‘phaseless measurements’ yi|x,ai|2y_{i}\approx|\langle x,a_{i}\rangle|^{2}, with enough random ‘sensors’ ai𝕂na_{i}\in\mathbb{K}^{n}. A plausible approach is to choose xx that solves

minx𝕂nj=1m[yj|x,aj|2]2.\displaystyle\min_{x\in\mathbb{K}^{n}}\sum_{j=1}^{m}\Big{[}y_{j}-|\langle x,a_{j}\rangle|^{2}\Big{]}^{2}. (1.13)

The two squares makes it a degree 4 POP.

We consider also another data science problem: matrix completion. In this problem, we want to exploit the a priori low rank property of a data matrix MM in order to estimate it from just a small fraction of its entries Mi,jM_{i,j}, (i,j)Ω(i,j)\in\Omega. If we know a priori that Mm×nM\in\mbox{$\mathbb{R}$}^{m\times n} has rank rmin(m,n)r\ll\min(m,n), then similar to (1.13) we may hope to recover MM by solving

minXm×r,Yn×r(i,j)Ω[(XYT)i,jMi,j]2.\displaystyle\min_{X\in\mbox{$\mathbb{R}$}^{m\times r},Y\in\mbox{$\mathbb{R}$}^{n\times r}}\sum_{(i,j)\in\Omega}\Big{[}(XY^{T})_{i,j}-M_{i,j}\Big{]}^{2}. (1.14)

It is again a degree 4 POP.

Extensive theories have been developed for addressing the following questions: (i) Under what conditions – in particular how big the sample size mm for phase retrieval and |Ω||\Omega| for matrix completion – would the global minimizer of (1.13) or (1.14) recovers the underlying object of interest? (ii) What optimization algorithms would be able to compute the global minimizer?

It is shown in [13] that constant step size GD with an appropriate choice of initial vector and step size applied to the optimization problems above probably guarantees success in recovering the object of interest under suitable statistical models. For the phase retrieval problem, assume for simplicity xnx^{\ast}\in\mbox{$\mathbb{R}$}^{n}, yj=(ajTx)2y_{j}=(a_{j}^{T}x^{\ast})^{2}, write

f(x):=14mj=1m[yj(ajTx)2]2.\displaystyle f(x):=\frac{1}{4m}\sum_{j=1}^{m}\Big{[}y_{j}-(a_{j}^{T}x)^{2}\Big{]}^{2}. (1.15)

Then

f(x)=1mj=1m[yj(ajTx)2](ajTx)aj and 2f(x)=1mj=1m[3(ajTx)2yj]ajajT.\displaystyle\nabla f(x)=-\frac{1}{m}\sum_{j=1}^{m}\Big{[}y_{j}-(a_{j}^{T}x)^{2}\Big{]}(a_{j}^{T}x)a_{j}\;\mbox{ and }\;\nabla^{2}f(x)=\frac{1}{m}\sum_{j=1}^{m}\Big{[}3(a_{j}^{T}x)^{2}-y_{j}\Big{]}a_{j}a_{j}^{T}. (1.16)

Under the Gaussian design of sensors aji.i.d.N(0,In)a_{j}\stackrel{{\scriptstyle\rm i.i.d.}}{{\sim}}N(0,I_{n}), 1jm1\leq j\leq m, considered in, e.g., [10, 9, 13], we have

𝔼[2f(x)]=3[x22In+2xxT][x22In+2x(x)T].\mathbb{E}\big{[}\nabla^{2}f(x)\big{]}=3\big{[}\|x\|_{2}^{2}I_{n}+2xx^{T}\big{]}-\big{[}\|x^{\ast}\|_{2}^{2}I_{n}+2x^{\ast}(x^{\ast})^{T}\big{]}.

At the global minimizer x=xx=x^{\ast}, 𝔼[2f(x)]=2[x22In+2x(x)T]\mathbb{E}\big{[}\nabla^{2}f(x^{\ast})\big{]}=2\big{[}\|x^{\ast}\|_{2}^{2}I_{n}+2x^{\ast}(x^{\ast})^{T}\big{]}, so

cond(𝔼[2f(x)])=3.{\rm cond}(\mathbb{E}\big{[}\nabla^{2}f(x^{\ast})\big{]})=3.

This suggests that when the sample size mm is large enough, we may expect that the Hessian of the objective is well-conditioned for xxx\approx x^{\ast}. Indeed, when mnlognm\asymp n\log n, the discussion in [13, Section 2.3] implies that cond(2f(x)){\rm cond}(\nabla^{2}f(x^{\ast})) grows slowly with nn:

cond(2f(x))=O(logn){\rm cond}(\nabla^{2}f(x^{\ast}))=O(\log n)

with a high probability. However, unlike (1.1), the objective (1.15) is a quartic instead of a quadratic polynomial, so the Hessian 2f(x)\nabla^{2}f(x) is not constant in xx. We have the following phenomena:

  • (i)

    On the one hand, in the directions given by aja_{j}, the Hessian 2f(x+δaj/aj)\nabla^{2}f(x^{\ast}+\delta a_{j}/\|a_{j}\|) has a condition number that grows (up to logarithmic factors) as O(n)O(n), meaning that the objective can get increasingly ill-conditioned as the dimension nn grows, even within a small ball around xx^{\ast} with a fixed radius δ\delta.

  • (ii)

    On the other hand, most directions vv would not be too close to be parallel to aja_{j}, and cond(2f(x+δv/v))=O(logn){\rm cond}\big{(}\nabla^{2}f(x^{\ast}+\delta v/\|v\|)\big{)}=O(\log n) with a high probability.

  • (iii)

    Constant step-size GD, with a step size that can be chosen nearly constant in nn, has the property of staying away from the ill-conditioned directions, hence no pessimistically small step sizes or explicit regularization steps avoiding the bad directions are needed. Such an ‘implicit regularization’ property of constant step size GD is the main theme of the article [13].

To illustrate (i) and (ii) numerically, we compute the condition numbers of the Hessians 2f(x)\nabla^{2}f(x) at x=x=x¯/x¯x=x^{\ast}=\bar{x}/\|\bar{x}\|, x=x+.5a1/a1x=x^{\ast}+.5a_{1}/\|a_{1}\| and x=x+.5z/zx=x^{\ast}+.5z/\|z\| for aj,x¯,zi.i.d.N(0,In)a_{j},\bar{x},z\stackrel{{\scriptstyle\rm i.i.d.}}{{\sim}}N(0,I_{n}), with n=1000kn=1000k, , k=1,,5k=1,\ldots,5, and m=nlog2(n)m=n\log_{2}(n):

cond(2f(x)){\rm cond}(\nabla^{2}f(x^{\ast})) cond(2f(x+.5a1/a1){\rm cond}(\nabla^{2}f(x^{\ast}+.5a_{1}/\|a_{1}) cond(2f(x+.5z/z)){\rm cond}(\nabla^{2}f(x^{\ast}+.5z/\|z\|))
n=1000n=1000 14.1043 147.6565 17.2912, 16.0391, 16.7982
n=2000n=2000 12.1743 193.6791 15.3551, 14.9715, 14.5999
n=3000n=3000 11.4561 251.2571 14.3947, 13.8015, 14.0738
n=4000n=4000 11.5022 310.8092 13.9249, 13.6388, 13.5541
n=5000n=5000 10.8728 338.3008 13.2793, 12.9796, 13.3100

(The last column is based on three independent samples of zN(0,In)z\sim N(0,I_{n}).) Evidently, the condition numbers do not increase with the dimension nn in the first and third columns, while a near linear growth is observed in the second column.

We now illustrate (iii); moreover, we present experimental results suggesting that exact line search GD performs favorably compared to constant step size GD. For the latter, we first show how to efficiently compute the line search function p(t):=f(x+td)p(t):=f(x+td) by combining (1.15) and (1.16). Write A=[a1,,am]n×mA=[a_{1},\cdots,a_{m}]\in\mbox{$\mathbb{R}$}^{n\times m}. We can compute the gradient descent direction together with the line search polynomial p(t)p(t) by computing the following sequence of vectors and scalars, with the computational complexity of each listed in parenthesis:

(1)ATx(2mn),(2)α=y+(ATx)2(O(m)),(3)d=f(x)=1mA(αATx)(2mn),\displaystyle(1)\;A^{T}x\;\;(\approx 2mn),\;\;\;(2)\;\alpha=-y+(A^{T}x)^{2}\;\;(O(m)),\;\;\;(3)\;d=-\nabla f(x)=-\frac{1}{m}A(\alpha\cdot A^{T}x)\;\;(\approx 2mn),
(4)ATd(2mn),(5)β=2(ATx)(ATd)(O(m)),(6)γ=(ATd)2(O(m))\displaystyle(4)\;A^{T}d\;\;(\approx 2mn),\;\;\;(5)\;\beta=2(A^{T}x)\cdot(A^{T}d)\;\;(O(m)),\;\;\;(6)\;\gamma=(A^{T}d)^{2}\;\;(O(m))
(7)γTγ, 2βTγ,βTβ+2αTγ, 2αTβ,αTα(O(m))\displaystyle(7)\;\gamma^{T}\gamma,\;2\beta^{T}\gamma,\;\beta^{T}\beta+2\alpha^{T}\gamma,\;2\alpha^{T}\beta,\;\alpha^{T}\alpha\;\;(O(m))
(8)s=argmint0p(t),p(t)=f(x+td)=(γTγ)t4+(2βTγ)t3+(βTβ+2αTγ)t2+(2αTβ)t+αTα(O(1)).\displaystyle(8)\;s^{\ast}=\mathop{\rm argmin}_{t\geq 0}p(t),\;p(t)=f(x+td)=(\gamma^{T}\gamma)t^{4}+(2\beta^{T}\gamma)t^{3}+(\beta^{T}\beta+2\alpha^{T}\gamma)t^{2}+(2\alpha^{T}\beta)t+\alpha^{T}\alpha\;\;(O(1)).

In above, uvu\cdot v, for two vectors uu, vv of the same length, stands for componentwise multiplication, and v2:=vvv^{2}:=v\cdot v. As we can see, the dominant steps are steps (1), (3) and (4). As only steps (1)-(3) are necessary for constant step size GD, we conclude that, for the phase retrieval problem, exact line search GD is about 50% more expensive per iteration than constant step size GD.

Figure 1 shows the rates of convergence for gradient descent with constant step size s=0.1s=0.1 (suggested in [13, Section 1.4]) and exact line search for n=10,100,200,1000,5000,10000n=10,100,200,1000,5000,10000, m=10nm=10n, with the initial guess chosen by a so-called spectral initialization.333Under the Gaussian design a1,,ami.i.d.N(0,In)a_{1},\ldots,a_{m}\sim_{\rm i.i.d.}N(0,I_{n}), it is interesting to see that E[(aiTx)2aiaiT]=I+2xxTE[(a_{i}^{T}x)^{2}a_{i}a_{i}^{T}]=I+2xx^{T}. As the Rayleigh quotient uT(I+2xxT)u/uTu=1+2(uTx)2/uTuu^{T}(I+2xx^{T})u/u^{T}u=1+2(u^{T}x)^{2}/u^{T}u is maximized when uu is parallel to xx, an educated initial guess for xx would be a suitably scaled dominant eigenvector of 1mr=1myrararT\frac{1}{m}\sum_{r=1}^{m}y_{r}a_{r}a_{r}^{T}, which can be computed from the data (yr)r=1m(y_{r})_{r=1}^{m} and sensors (ar)r=1m(a_{r})_{r=1}^{m} using the power method. See, for example, [13, 9] for more details and references therein. As the plots show, for each signal size nn the ROC for exact line search GD is more than twice as fast as that of constant step size GD.

Refer to caption
Figure 1: ROC of constant step size GD vs optimum GD for the phase retrieval problem

Not only is the speedup in ROC by exact line search outweighs the 50% increase in per-iteration cost, the determination of step size is automatic and requires no tuning. For each nn, the ROC for exact line search GD in Figure 1 is slightly faster than O([(1a)/(1+a)]k)O([(1-a)/(1+a)]^{k}) for a=cond(2f(x))1a={\rm cond}(\nabla^{2}f(x^{\ast}))^{-1} – the ROC attained by optimum GD as if the degree 4 objective has a constant Hessian 2f(x)\nabla^{2}f(x^{\ast}) (Theorem 1.3)– suggesting also that the GD method implicitly avoids the ill-conditioned directions (recall the table above), akin to what is established in [13]. Unsurprisingly, our experiments also suggest that exact line search GD is more robust than its constant step size counterpart for different choices of initial vectors. Similar advantages for exact line search GD was observed for the matrix completion problem.

2 Properties of OGD and Akaike’s TT

Let n\mbox{$\mathbb{R}$}^{n}_{\ast} be n\mbox{$\mathbb{R}$}^{n} with the origin removed. For xnx\in\mbox{$\mathbb{R}$}^{n}, define |x|n|x|\in\mbox{$\mathbb{R}$}^{n} by |x|i=|xi||x|_{i}=|x_{i}|. Notice from (1.8) that, for a fixed λ\lambda, ρ(,λ)\rho(\cdot,\lambda) is invariant under both scaling and sign-changes of the components, i.e.

ρ(αx,λ)=ρ(x,λ),xn,α0,=diag(ε1,,εn),εi{1,1}.\displaystyle\begin{split}\rho(\alpha\mathcal{E}x,\lambda)=\rho(x,\lambda),\quad\forall x\in\mbox{$\mathbb{R}$}^{n}_{\ast},\;\alpha\neq 0,\;\mathcal{E}={\rm diag}(\varepsilon_{1},\ldots,\varepsilon_{n}),\;\varepsilon_{i}\in\{1,-1\}.\end{split} (2.1)

In other words, ρ(x,λ)\rho(x,\lambda) depends only on the equivalence class [x][x]_{\sim}, where xyx\sim y if |x|/x=|y|/y|x|/\|x\|=|y|/\|y\|.

By inspecting (1.4) one sees that

OGD(αx,λ)=αOGD(x,λ),xn,α0,=diag(ε1,,εn),εi{1,1}.\displaystyle\begin{split}\textsf{OGD}(\alpha\mathcal{E}x,\lambda)=\alpha\mathcal{E}\!\cdot\!\textsf{OGD}(x,\lambda),\quad\forall x\in\mbox{$\mathbb{R}$}^{n}_{\ast},\;\alpha\neq 0,\;\mathcal{E}={\rm diag}(\varepsilon_{1},\ldots,\varepsilon_{n}),\;\varepsilon_{i}\in\{1,-1\}.\end{split} (2.2)

This means [OGD(x)][\textsf{OGD}(x)]_{\sim}, when well-defined, depends only on [x][x]_{\sim}. In other words, OGD descends to a map

[OGD]:dom([OGD])n/n/.\displaystyle[\textsf{OGD}]:{\rm dom}([\textsf{OGD}])\subset\mbox{$\mathbb{R}$}^{n}_{\ast}/\!\!\sim\rightarrow\mbox{$\mathbb{R}$}^{n}_{\ast}/\!\!\sim. (2.3)

It can be shown that [OGD][\textsf{OGD}] is well-defined on

dom([OGD])={[x]n/:x is not an eigenvector of A};\displaystyle{\rm dom}([\textsf{OGD}])=\big{\{}[x]_{\sim}\in\mbox{$\mathbb{R}$}^{n}_{\ast}/\!\!\sim\,:\mbox{$x$ is not an eigenvector of $A$}\big{\}}; (2.4)

also [OGD](dom([OGD]))dom([OGD])[\textsf{OGD}]({\rm dom}([\textsf{OGD}]))\subset{\rm dom}([\textsf{OGD}]). (We exclude the proof here because it also follows from one of Akaike’s results; see (2.10) below.) Except when n=2n=2 with λ1>λ2\lambda_{1}>\lambda_{2}, [OGD][\textsf{OGD}] does not extend continuously to the whole n/\mbox{$\mathbb{R}$}^{n}_{\ast}/\!\!\sim; see below.

Akaike’s map TT. While Akaike, a statistician, did not use jargons such as ‘invariance’ or ‘parametrization’ in his paper, the map TT introduced in [5] is the representation of [OGD][\textsf{OGD}] under the identification of [x][x]_{\sim} with

σ([x]):=[λ12x12,,λn2xn2]T/jλj2xj2n:={pn:jpj=1,pj0}.\displaystyle\sigma([x]_{\sim}):=\Big{[}\lambda_{1}^{2}x_{1}^{2},\ldots,\lambda_{n}^{2}x_{n}^{2}\Big{]}^{T}/\sum_{j}\lambda_{j}^{2}x_{j}^{2}\in\triangle_{n}:=\Big{\{}p\in\mbox{$\mathbb{R}$}^{n}:\sum_{j}p_{j}=1,p_{j}\geq 0\Big{\}}. (2.5)

In above, n\triangle_{n}, or simply \triangle, is usually called the standard simplex, or the probability simplex as Akaike would prefer. One can verify that σ:n/\sigma:\mbox{$\mathbb{R}$}^{n}_{\ast}/\!\!\sim\rightarrow\triangle is a well-defined bijection and hence

σ1:n/,p[x],xj=pjλj\displaystyle\sigma^{-1}:\triangle\rightarrow\mbox{$\mathbb{R}$}^{n}_{\ast}/\!\!\sim,\quad p\mapsto[x]_{\sim},\;x_{j}=\frac{\sqrt{p_{j}}}{\lambda_{j}} (2.6)

may be viewed as a parametrization of the quotient space n/\mbox{$\mathbb{R}$}^{n}_{\ast}/\!\!\sim. (Strictly speaking, the map σ1\sigma^{-1} is not a parametrization. As a manifold, n/\mbox{$\mathbb{R}$}^{n}_{\ast}/\!\!\sim is (n1)(n-1)-dimensional, which means it deserves a parametrization with n1n-1 parameters. But, of course, we can identify any pp\in\triangle with [s1,,sn1]T[s_{1},\ldots,s_{n-1}]^{T} by p=[s1,,sn1,1i=1n1si]Tp=[s_{1},\ldots,s_{n-1},1-\sum_{i=1}^{n-1}s_{i}]^{T}.)

We now derive a formula for T:=σ[OGD]σ1T:=\sigma\circ[\textsf{OGD}]\circ\sigma^{-1}: By (1.4) and (2.6), [OGD](σ1(p))[\textsf{OGD}](\sigma^{-1}(p)) has a representor yny\in\mbox{$\mathbb{R}$}^{n}_{\ast} with

yi\displaystyle y_{i} =pjλjjλj2pj/λj2jλj3pj/λj2λipjλj=pi[1λi1jλjpj]=piλ¯(p)λiλiλ¯(p),\displaystyle=\frac{\sqrt{p_{j}}}{\lambda_{j}}-\frac{\sum_{j}\lambda_{j}^{2}p_{j}/\lambda_{j}^{2}}{\sum_{j}\lambda_{j}^{3}p_{j}/\lambda_{j}^{2}}\lambda_{i}\frac{\sqrt{p_{j}}}{\lambda_{j}}=\sqrt{p_{i}}\Big{[}\frac{1}{\lambda_{i}}-\frac{1}{\sum_{j}\lambda_{j}p_{j}}\Big{]}=\sqrt{p_{i}}\,\frac{\overline{\lambda}(p)-\lambda_{i}}{\lambda_{i}\overline{\lambda}(p)},

where λ¯(p):=jλjpj\overline{\lambda}(p):=\sum_{j}\lambda_{j}p_{j}. Consequently,

T(p)i=λi2(piλ¯(p)λiλiλ¯(p))2/jλj2(pjλ¯(p)λjλjλ¯(p))2=pi(λ¯(p)λi)2jpj(λ¯(p)λj)2.\displaystyle T(p)_{i}=\lambda_{i}^{2}\Big{(}\sqrt{p_{i}}\,\frac{\overline{\lambda}(p)-\lambda_{i}}{\lambda_{i}\overline{\lambda}(p)}\Big{)}^{2}/\sum_{j}\lambda_{j}^{2}\Big{(}\sqrt{p_{j}}\,\frac{\overline{\lambda}(p)-\lambda_{j}}{\lambda_{j}\overline{\lambda}(p)}\Big{)}^{2}=\frac{p_{i}(\overline{\lambda}(p)-\lambda_{i})^{2}}{\sum_{j}p_{j}(\overline{\lambda}(p)-\lambda_{j})^{2}}. (2.7)

The last expression is Akaike’s map TT defined in [5, §1]. Under the distinct eigenvalues assumption (see below), T(p)T(p) is well-defined for any pp in

dom(T)=n\{e1,,en},\displaystyle{\rm dom}(T)=\triangle_{n}\backslash\{e_{1},\ldots,e_{n}\}, (2.8)

i.e. the standard simplex with the standard basis of n\mbox{$\mathbb{R}$}^{n} removed. Also, TT is continuous on its domain. By (2.11) below, when n=2n=2, TT extends continuously to 2\triangle_{2}. But for n3n\geq 3 it does not extend continuously to any eie_{i}; for example, if n=3n=3 and i=2i=2, then (assuming λ1>λ2>λ3\lambda_{1}>\lambda_{2}>\lambda_{3}),

T([ϵ,1ϵ,0]T)=[1ϵ,ϵ,0]TandT([0,1ϵ,ϵ]T)=[0,ϵ,1ϵ]T.\displaystyle T([\epsilon,1-\epsilon,0]^{T})=[1-\epsilon,\epsilon,0]^{T}\quad\mbox{and}\quad T([0,1-\epsilon,\epsilon]^{T})=[0,\epsilon,1-\epsilon]^{T}. (2.9)

This follows from the n=2n=2 case of Proposition 2.1 and the following diagonal property of TT.

Diagonal property. Thanks to the matrix AA being diagonal, TT is invariant under J:={pn:pi=0,iJ}\triangle_{J}:=\{p\in\triangle_{n}:p_{i}=0,\;\forall i\notin J\} for any J{1,,n}J\subset\{1,\ldots,n\}. Notice the correspondence between J\triangle_{J} and |J|\triangle_{|J|} via the projection λλJ:=(λi)iJ\lambda\mapsto\lambda_{J}:=(\lambda_{i})_{i\in J}. If we write TλT_{\lambda} to signify the dependence of TT on λ\lambda, then under this correspondence T|JT|_{\triangle_{J}} is simply TλJT_{\lambda_{J}} acting on |J|\triangle_{|J|}.

This obvious property will be useful for our proof later; for now, see (2.9) above for an example of the property in action.

Distinct eigenvalues assumption. It causes no loss of generality to assume that the eigenvalues λi\lambda_{i} are distinct: if λ^=[λ^i,,λ^m]T\hat{\lambda}=[\hat{\lambda}_{i},\ldots,\hat{\lambda}_{m}]^{T} consists of the distinct eigenvalues of AA, then A=diag(λ^iI,,λ^mI)A={\rm diag}(\hat{\lambda}_{i}I,\ldots,\hat{\lambda}_{m}I), where each II stands for an identity matrix of the appropriate size. Accordingly, each initial vector x(0)x^{(0)} can be written in block form [𝐱1(0),,𝐱m(0)]T[\mathbf{x}^{(0)}_{1},\ldots,\mathbf{x}^{(0)}_{m}]^{T}. It is easy to check that if we apply the optimum GD method to f^(x^)=12x^TA^x^\hat{f}(\hat{x})=\frac{1}{2}\hat{x}^{T}\hat{A}\hat{x} with A^=diag(λ^i,,λ^m)\hat{A}={\rm diag}(\hat{\lambda}_{i},\ldots,\hat{\lambda}_{m}) and initial vector x^(0)=[𝐱1(0)2,,𝐱m(0)2]T\hat{x}^{(0)}=\big{[}\|\mathbf{x}^{(0)}_{1}\|_{2},\ldots,\|\mathbf{x}^{(0)}_{m}\|_{2}\big{]}^{T}, then the ROC of the reduced system is identical to that of the original. Moreover, the map

P:𝕊+n1𝕊+m1,[𝐱1,,𝐱m]T[𝐱12,,𝐱m2]T,P:\mbox{$\mathbb{S}$}^{n-1}_{+}\rightarrow\mbox{$\mathbb{S}$}^{m-1}_{+},\quad[\mathbf{x}_{1},\ldots,\mathbf{x}_{m}]^{T}\mapsto\big{[}\|\mathbf{x}_{1}\|_{2},\ldots,\|\mathbf{x}_{m}\|_{2}\big{]}^{T},

is a submersion and hence has the property that P1(N)P^{-1}(N) is a null set in 𝕊n1\mbox{$\mathbb{S}$}^{n-1} for any null set NN in 𝕊+m1\mbox{$\mathbb{S}$}^{m-1}_{+}; see Lemma A.1. Therefore, it suffices to prove Theorem 1.2 and Theorem 1.3 under the distinct eigenvalues assumption.

So from now on we make the blanket assumption that λ1>>λn>0\lambda_{1}>\cdots>\lambda_{n}>0.

Connection to variance. Akaike’s λ\lambda-dependent parametrization (2.5)-(2.6) does not only give [OGD][\textsf{OGD}] the simple representation TT (2.7), the map TT also has an interesting probabilistic interpretation: if we think of pp as a probability distribution associated to the values in λ\lambda, then λ¯(p)\overline{\lambda}(p) is the mean of the resulted random variable, the expression in the dominator of the definition of TT, i.e. jpj(λ¯(p)λj)2\sum_{j}p_{j}(\overline{\lambda}(p)-\lambda_{j})^{2}, is the variance of the random variable. What, then, does the map TT do to pp? It produces a new probability distribution, namely T(p)T(p), to λ\lambda. The definition of TT in (2.7) suggests that T(p)iT(p)_{i} will be bigger if λi\lambda_{i} is far from the mean λ¯(p)\overline{\lambda}(p), so the map polarizes the probabilities to the extremal values λ1\lambda_{1} and λn\lambda_{n}. This also suggests that the map TT tends to increase variance. Akaike proved that it is indeed the case in [5, Lemma 2]: Using the notation f(λ)¯(p)\overline{f(\lambda)}(p) for i=1nf(λi)pi\sum_{i=1}^{n}f(\lambda_{i})p_{i}, the result can be expressed as

(λλ¯(T(p)))2¯(T(p))(λλ¯(p))2¯(p).\displaystyle\overline{(\lambda-\overline{\lambda}(T(p)))^{2}}(T(p))\geq\overline{(\lambda-\overline{\lambda}(p))^{2}}(p). (2.10)

This monotonicity result is a key to Akaike’s proof of (2.12) below. As an immediate application, notice that by (2.8) pdom(T)p\in{\rm dom}(T) is equivalent to saying that the random variable with probability pip_{i} attached to the value λi\lambda_{i} has a positive variance. Therefore (2.10) implies that if pdom(T)p\in{\rm dom}(T), then T(p)dom(T)T(p)\in{\rm dom}(T) also, and so Tk(p)T^{k}(p) is well-defined for all k0k\geq 0.

The following fact is instrumental for our proof of the main result; it not explicitly stated in [5].

Proposition 2.1 (Independence of λ1\lambda_{1} and λn\lambda_{n})

The map TT depends only on αi(0,1)\alpha_{i}\in(0,1), i=2,,n1i=2,\ldots,n-1, defined by

λi=αiλ1+(1αi)λn.\lambda_{i}=\alpha_{i}\lambda_{1}+(1-\alpha_{i})\lambda_{n}.

In particular, when n=2n=2, TT is independent of λ\lambda; in fact,

T([s,1s]T)=[1s,s]T.\displaystyle T([s,1-s]^{T})=[1-s,s]^{T}. (2.11)

While elementary to check, it is not clear if Akaike was aware of the first part of the above proposition. It tells us that the condition number of AA does not play a role in the dynamics of TT. However, he must be aware of the second part of the proposition, as he proved the following nontrivial generalization of (2.11) in higher dimensions. When the dimension is higher than 2, TT is no longer an involution, but yet TT resembles (2.11) in the following way:

Theorem 2.2

[5, Theorem 2] For any p(0)dom(T)p^{(0)}\in{\rm dom}(T), there exists s[0,1]s\in[0,1] such that

p():=limkT2k(p(0))=[1s,0,,0,s]T and p():=limkT2k+1(p(0))=[s,0,,0,1s]T.\displaystyle p^{(\infty)}:=\lim_{k\rightarrow\infty}T^{2k}(p^{(0)})=[1-s,0,\ldots,0,s]^{T}\;\mbox{ and }\;p^{\ast(\infty)}:=\lim_{k\rightarrow\infty}T^{2k+1}(p^{(0)})=[s,0,\ldots,0,1-s]^{T}. (2.12)

Our proof of Theorem 1.3 shall rely on this result, which says that the dynamical system defined by TT polarizes the probabilities to the two extremal eigenvalues. This makes part of the problem essentially two-dimensional. So we begin by analyzing the ROC in the 2-D case.

3 Analysis in 2-D and Proof of Theorem 1.2(i)

When n=2n=2, we may represent a vector in \triangle by [1s,s]T[1-s,s]^{T} with s[0,1]s\in[0,1]. Recall (2.11) and note that ρ\rho depends on λ\lambda only through a:=λ2/λ1=cond(A)1a:=\lambda_{2}/\lambda_{1}={\rm cond}(A)^{-1}. So, we may represent TT and ρ\rho in the parameter ss and the quantity aa as:

t(s)=1s,ρ¯2(s,a)=1(1s+sa)1(1s+sa1)1.\displaystyle t(s)=1-s,\quad\overline{\rho}^{2}(s,a)=1-(1-s+sa)^{-1}(1-s+sa^{-1})^{-1}. (3.1)

So ρ¯(t(s),a)=ρ¯(s,a)\overline{\rho}({t}(s),a)=\overline{\rho}(s,a), and the otherwise difficult-to-analyze ROC ρ\rho^{\ast} ((1.9)) is determined simply by

ρ(x(0),λ)=ρ(x(0),λ).\displaystyle\rho^{\ast}(x^{(0)},\lambda)=\rho(x^{(0)},\lambda). (3.2)

By (3.1), the value s[0,1]s\in[0,1] that maximizes ρ¯\overline{\rho} is given by smax=1/2s_{\rm max}=1/2, with maximum value (1a)/(1+a)(1-a)/(1+a) (=(λ1λn)/(λ1+λn)=(\lambda_{1}-\lambda_{n})/(\lambda_{1}+\lambda_{n})). It may be instructive to see that the same can be concluded from Henrici’s proof of Kantorovich’s inequality in [12]: The one-step shrinking factor ρ(x,λ)\rho(x,\lambda) (in any dimension) attains it maximum value λ1λnλ1+λn\frac{\lambda_{1}-\lambda_{n}}{\lambda_{1}+\lambda_{n}} when and only when (i) for every ii such that xi0x_{i}\neq 0, λi\lambda_{i} is either λ1\lambda_{1} or λn\lambda_{n}, and (ii) i:λi=λ1λi2xi2=i:λi=λnλi2xi2\sum_{i:\lambda_{i}=\lambda_{1}}\lambda_{i}^{2}x_{i}^{2}=\sum_{i:\lambda_{i}=\lambda_{n}}\lambda_{i}^{2}x_{i}^{2}. When n=2n=2, condition (i) is automatically satisfied, while condition (ii) means

λ12x12=λ22x22, or |x1|=(λ2/λ1)|x2|.\displaystyle\lambda_{1}^{2}x_{1}^{2}=\lambda_{2}^{2}x_{2}^{2},\mbox{ or }|x_{1}|=(\lambda_{2}/\lambda_{1})|x_{2}|. (3.3)

This is equivalent to setting ss to smax=1/2s_{\rm max}=1/2.

These observations show that the worst case bound (1.7) is tight in any dimension nn:

Proposition 3.1

There exists initial vector x(0)nx^{(0)}\in\mbox{$\mathbb{R}$}^{n} such that equality holds in (1.7) for any iteration kk.

Proof:  We have proved the claim for n=2n=2. For a general dimension n2n\geq 2, observe that if the initial vector lies on the x1x_{1}-xnx_{n} plane, then – thanks to diagonalization – GD behaves exactly the same as in 2-D, with A=diag(λ1,λ2,,λn)A={\rm diag}(\lambda_{1},\lambda_{2},\ldots,\lambda_{n}) replaced by A=diag(λ1,λn)A={\rm diag}(\lambda_{1},\lambda_{n}). That is, if we choose x(0)nx^{(0)}\in\mbox{$\mathbb{R}$}^{n} so that |x1(0)|=λnλ1|xn(0)||x^{(0)}_{1}|=\frac{\lambda_{n}}{\lambda_{1}}|x^{(0)}_{n}| and xi(0)=0x^{(0)}_{i}=0 for 2in12\leq i\leq n-1, then equality holds in (1.7) for any kk.  

For any non-zero vector xx in 2\mbox{$\mathbb{R}$}^{2}, [x][x]_{\ast} can be identified by [cos(θ),sin(θ)]T[\cos(\theta),\sin(\theta)]^{T} with θ[0,π/2]\theta\in[0,\pi/2], which, by (2.5), is identified with [1s,s]TΔ2[1-s,s]^{T}\in\Delta_{2} where

s=a2sin2(θ)cos2(θ)+a2sin2(θ).\displaystyle s=\frac{a^{2}\sin^{2}(\theta)}{\cos^{2}(\theta)+a^{2}\sin^{2}(\theta)}. (3.4)

Proof of the first part of Theorem 1.2. As the ROC ρ(x(0),λ)\rho^{\ast}(x^{(0)},\lambda) depends only on the direction of |x(0)||x^{(0)}|, in 2-D the average ROC is given by

Average ROC=(π/2)10π/2ρ([cos(θ),sin(θ)]T,[1,a])𝑑θ.\displaystyle\mbox{Average ROC}=(\pi/2)^{-1}\int_{0}^{\pi/2}\rho([\cos(\theta),\sin(\theta)]^{T},[1,a])\,d\theta. (3.5)

Note that

ρ([cos(θ),sin(θ)]T,[1,a]T)=ρ¯(a2sin2(θ)cos2(θ)+a2sin2(θ),a).\displaystyle\rho([\cos(\theta),\sin(\theta)]^{T},[1,a]^{T})=\overline{\rho}\Big{(}\frac{a^{2}\sin^{2}(\theta)}{\cos^{2}(\theta)+a^{2}\sin^{2}(\theta)},a\Big{)}. (3.6)

On the one hand, we have

maxθ[0,π/2]ρ([cos(θ),sin(θ)]T,[1,a]T)=ρ¯(1/2,a)=(1a)/(1+a),\displaystyle\max_{\theta\in[0,\pi/2]}\rho([\cos(\theta),\sin(\theta)]^{T},[1,a]^{T})=\overline{\rho}(1/2,a)=(1-a)/(1+a), (3.7)

so lima0+maxs[0,1]ρ¯(s,a)=1\lim_{a\rightarrow 0^{+}}\max_{s\in[0,1]}\overline{\rho}(s,a)=1. On the other hand, by (3.6) and (3.1) one can verify that

lima0+ρ([cos(θ),sin(θ)]T,[1,a]T)=0,θ[0,π/2].\displaystyle\lim_{a\rightarrow 0^{+}}\rho([\cos(\theta),\sin(\theta)]^{T},[1,a]^{T})=0,\;\;\forall\,\theta\in[0,\pi/2]. (3.8)

See Figure 2 (left panel) illustrating the non-uniform convergence. Since ρ([cos(θ),sin(θ)]T,[1,a]T)1\rho([\cos(\theta),\sin(\theta)]^{T},[1,a]^{T})\leq 1, by the dominated convergence theorem,

lima0+0π/2ρ([cos(θ),sin(θ)]T,[1,a]T)𝑑θ=0π/2lima0+ρ([cos(θ),sin(θ)]T,[1,a]T)dθ=0.\lim_{a\rightarrow 0^{+}}\int_{0}^{\pi/2}\rho([\cos(\theta),\sin(\theta)]^{T},[1,a]^{T})\,d\theta=\int_{0}^{\pi/2}\lim_{a\rightarrow 0^{+}}\rho([\cos(\theta),\sin(\theta)]^{T},[1,a]^{T})\,d\theta=0.

This proves the first part of Theorem 1.2.  

An alternate proof. While the average rate of convergence (π/2)10π/2ρ([cos(t),sin(t)]T,a)𝑑θ(\pi/2)^{-1}\int_{0}^{\pi/2}\rho([\cos(t),\sin(t)]^{T},a)\,d\theta does not seem to have a closed-form expression, the average square rate of convergence can be expressed in closed-form:

(π/2)10π/2ρ¯2(a2sin2(θ)cos2(θ)+a2sin2(θ),a)𝑑θ=a(1a)2(1+a)(1a+a).\displaystyle(\pi/2)^{-1}\int_{0}^{\pi/2}\overline{\rho}^{2}\Big{(}\frac{a^{2}\sin^{2}(\theta)}{\cos^{2}(\theta)+a^{2}\sin^{2}(\theta)},a\Big{)}\,d\theta=\frac{\sqrt{a}(1-\sqrt{a})^{2}}{(1+a)(1-\sqrt{a}+a)}. (3.9)

By Jensen’s inequality,

[(π/2)10π/2ρ([cos(θ),sin(θ)]T,λ)𝑑θ]2(π/2)10π/2ρ2([cos(θ),sin(θ)]T,λ)𝑑θ.\displaystyle\big{[}(\pi/2)^{-1}\int_{0}^{\pi/2}\rho([\cos(\theta),\sin(\theta)]^{T},\lambda)\,d\theta\big{]}^{2}\leq(\pi/2)^{-1}\int_{0}^{\pi/2}\rho^{2}([\cos(\theta),\sin(\theta)]^{T},\lambda)\,d\theta. (3.10)

Since the right-hand side of (3.9) (== the r.h.s. of (3.10)) goes to zero as aa approaches 0, so does the left-hand side of (3.10) and hence also the average ROC.  

See Figure 2 (right panel, yellow curve) for a plot of how the average ROC varies with cond(A){\rm cond}(A): as cond(A){\rm cond}(A) increases from 1, the average ROC does deteriorate – as most textbooks would suggest – but only up to a certain point, after that the average ROC does not only improve, but gets arbitrarily fast, as AA gets more and more ill-conditioned – quite the opposite of what most textbooks may suggest. See also Appendix C.

Refer to caption Refer to caption
Figure 2: Left: Plots of θ\theta versus ρ([cos(θ),sin(θ)]T,[1,a]T)\rho^{\ast}([\cos(\theta),\sin(\theta)]^{T},[1,a]^{T}) for various values of aa. Observe the convergence in (3.8) being non-uniform in θ\theta. Right: The worst, average and the square root of the average square rate of convergence as a function of aa. The average rate of convergence is computed using numerical integration, while the other two curves are given by the closed-form expressions (3.7) and (3.9).

4 Proof of Theorem 1.3 and Theorem 1.2(ii)

Let n3n\geq 3. Denote by Θ\Theta the map

[pi]1i<n[Ti(p1,,pn1,1j=1n1pj)]1i<n.[p_{i}]_{1\leq i<n}\mapsto\big{[}T_{i}(p_{1},\ldots,p_{n-1},1-\mbox{$\sum_{j=1}^{n-1}p_{j}$})\big{]}_{1\leq i<n}.

Its domain, denoted by dom(Θ){\rm dom}(\Theta), is the simplex Λ:={[p1,,pn1]T:pj0, 0pj1}\Lambda:=\big{\{}[p_{1},\ldots,p_{n-1}]^{T}:p_{j}\geq 0,\;0\leq\sum p_{j}\leq 1\big{\}} with its vertices removed. Notwithstanding, Θ\Theta can be be smoothly extended to some open set of n1\mbox{$\mathbb{R}$}^{n-1} containing dom(Θ){\rm dom}(\Theta).

The 2-periodic points [s,0,,0,1s]T[s,0,\ldots,0,1-s]^{T} of TT according to (2.12)\eqref{eq:AkaikeTheorem12} correspond to the fixed points [s,0,,0]T[s,0,\ldots,0]^{T}, which we denote more compactly by [s0]\big{[}\!\begin{smallmatrix}s\\ 0\end{smallmatrix}\!\big{]}, of Θ2=ΘΘ\Theta^{2}=\Theta\circ\Theta.

The map σ\sigma defined by (2.5)-(2.6) induces smooth one-to-one correspondences between 𝕊+n1\mbox{$\mathbb{S}$}^{n-1}_{+}, \triangle and Λ\Lambda. For any x(0)𝕊+n1x^{(0)}\in\mbox{$\mathbb{S}$}^{n-1}_{+}, let p(0)p^{(0)}\in\triangle be the corresponding probability vector and denote by s(x(0))(0,1)s(x^{(0)})\in(0,1) the corresponding limit probability according to (2.12)\eqref{eq:AkaikeTheorem12}. Theorem 2.2, together with (3.1), imply that

ρ(x(0),λ)\displaystyle\rho^{\ast}(x^{(0)},\lambda) =1(1s+sa)1(1s+sa1)1,s=s(x(0))\displaystyle=\sqrt{1-\big{(}1-s+sa\big{)}^{-1}\big{(}1-s+sa^{-1}\big{)}^{-1}},\quad s=s(x^{(0)})
=1a(1+a)2+a(cc1)2 if we write s(x(0))=1/(1+c2).\displaystyle=\frac{1-a}{\sqrt{(1+a)^{2}+a(c-c^{-1})^{2}}}\quad\mbox{ if we write $s(x^{(0)})=1/(1+c^{2})$.} (4.1)

A computation shows that for any s(0,1)s\in(0,1), the Jacobian matrix of Θ\Theta at [s0]\big{[}\!\begin{smallmatrix}s\\ 0\end{smallmatrix}\!\big{]} is:

DΘ|[s0]:=(Θ1,,Θn1)(p1,,pn1)|[s0]=[1α22sαn12s0(α2s)2s(1s)00000(αn1s)2s(1s)],\displaystyle D\Theta|_{\big{[}\!\begin{smallmatrix}s\\ 0\end{smallmatrix}\!\big{]}}:=\frac{\partial(\Theta_{1},\ldots,\Theta_{n-1})}{\partial(p_{1},\ldots,p_{n-1})}\Big{|}_{\big{[}\!\begin{smallmatrix}s\\ 0\end{smallmatrix}\!\big{]}}=\begin{bmatrix}-1&-\frac{\alpha_{2}^{2}}{s}&\cdots&\cdots&-\frac{\alpha_{n-1}^{2}}{s}\\ 0&\frac{(\alpha_{2}-s)^{2}}{s(1-s)}&0&\cdots&0\\ \vdots&\ddots&\ddots&\ddots&\vdots\\ \vdots&\ddots&\ddots&\ddots&0\\ 0&\cdots&\cdots&0&\frac{(\alpha_{n-1}-s)^{2}}{s(1-s)}\\ \end{bmatrix}, (4.2)

where αi\alpha_{i} is defined as in Proposition 2.1. Since Θ([s0])=[1s0]\Theta(\big{[}\!\begin{smallmatrix}s\\ 0\end{smallmatrix}\!\big{]})=\big{[}\!\begin{smallmatrix}1-s\\ 0\end{smallmatrix}\!\big{]}, the Jacobian matrix of Θ2\Theta^{2} at its fixed point [s0]\big{[}\!\begin{smallmatrix}s\\ 0\end{smallmatrix}\!\big{]} is given by

DΘ2|[s0]=DΘ|[1s0]DΘ|[s0];D\Theta^{2}|_{\big{[}\!\begin{smallmatrix}s\\ 0\end{smallmatrix}\!\big{]}}=D\Theta|_{\big{[}\!\begin{smallmatrix}1-s\\ 0\end{smallmatrix}\!\big{]}}\cdot D\Theta|_{\big{[}\!\begin{smallmatrix}s\\ 0\end{smallmatrix}\!\big{]}};

its eigenvalues are 1 and

μi(s):=(αis)2(αi(1s))2s2(1s)2=(s(1s)αi(1αi)s(1s))2,i=2,,n1.\displaystyle\mu_{i}(s):=\frac{(\alpha_{i}-s)^{2}(\alpha_{i}-(1-s))^{2}}{s^{2}(1-s)^{2}}=\left(\frac{s(1-s)-\alpha_{i}(1-\alpha_{i})}{s(1-s)}\right)^{2},\;\;i=2,\ldots,n-1. (4.3)

Each of the last n2n-2 eigenvalues, namely μi(s)\mu_{i}(s), is less than or equal to 1 if and only if s(1s)12αi(1αi)s(1-s)\geq\frac{1}{2}\alpha_{i}(1-\alpha_{i}). Consequently, we have the following:

Lemma 4.1

The spectrum of DΘ2|[s0]D\Theta^{2}|_{\big{[}\!\begin{smallmatrix}s\\ 0\end{smallmatrix}\!\big{]}} has at least one eigenvalue larger than one iff s(1,1)\Is\in(-1,1)\backslash I, where

I:={s:|s1/2|1212αi(1αi)},iargmin1<i<n|αi1/2|(=argmin1<i<n|λi(λ1+λn)/2|.)\displaystyle I:=\Big{\{}s:|s-1/2|\leq\frac{1}{2}\sqrt{1-2\alpha_{i^{\ast}}(1-\alpha_{i^{\ast}})}\Big{\}},\;\;\;i^{\ast}\in\mathop{\rm argmin}_{1<i<n}|\alpha_{i}-1/2|\;(=\mathop{\rm argmin}_{1<i<n}|\lambda_{i}-(\lambda_{1}+\lambda_{n})/2|.) (4.4)

Next, observe that:

  • If x(0)𝕊+n1x^{(0)}\in\mbox{$\mathbb{S}$}^{n-1}_{+} satisfies the upper bound on |s(x(0))1/2||s(x^{(0)})-1/2| in the definition of II, then the ROC ρ(x(0),λ)\rho^{\ast}(x^{(0)},\lambda) satisfies the lower bound in (1.12). This can be checked using the expression of ρ(x(0),λ)\rho^{\ast}(x^{(0)},\lambda) in (4.1).

  • The correspondence between 𝕊+n1\mbox{$\mathbb{S}$}^{n-1}_{+} and Λ\Lambda, induced by (2.5)-(2.6), maps null set to null set. This can be verified by applying Lemma A.1.

Theorem 1.3 then follows if we can establish the following:

Claim I: For almost all sIs\in I, there exists an open set UsU_{s} around [s0]\big{[}\!\begin{smallmatrix}s\\ 0\end{smallmatrix}\!\big{]} such that f(Us)Usf(U_{s})\subset U_{s}.

Claim II: For almost all p(0)dom(Θ)p^{(0)}\in{\rm dom}(\Theta), Θ2k(p(0))=[s0]\Theta^{2k}(p^{(0)})={\big{[}\!\begin{smallmatrix}s\\ 0\end{smallmatrix}\!\big{]}} with sIs\in I.

Our proofs of these claims are based on (essentially part 2 of) the Theorem B.1.

By (4.3), Θ2\Theta^{2} is a local diffeomorphism at [s0]\big{[}\!\begin{smallmatrix}s\\ 0\end{smallmatrix}\!\big{]} for every s(1,1)\{αi,1αi:i=2,,n1}s\in(-1,1)\backslash\{\alpha_{i},1-\alpha_{i}:i=2,\ldots,n-1\}. The interval II defined by (4.4) covers at least 70% of (0,1)(0,1): I[12122,12+122]I\supseteq\big{[}\frac{1}{2}-\frac{1}{2\sqrt{2}},\frac{1}{2}+\frac{1}{2\sqrt{2}}\big{]}; it is easy to check that αi,1αiI\alpha_{i^{\ast}},1-\alpha_{i^{\ast}}\in I, while for iii\neq i^{\ast}, αi\alpha_{i}, 1αi1-\alpha_{i} may or may not fall into II. If we make the assumption that

αi, 1αiI,ii,\displaystyle\alpha_{i},\;1-\alpha_{i}\in I,\;\;\forall i\neq i^{\ast}, (4.5)

then Theorem B.1 can be applied verbatim to Θ2\Theta^{2} at [s0]\big{[}\!\begin{smallmatrix}s\\ 0\end{smallmatrix}\!\big{]} for every s(1,1)\Is\in(-1,1)\backslash I and the argument below will prove the claim under the assumption (4.5), and hence a weaker version of Theorem 1.3.

Note that Claim I is local in nature, and follows immediately from Theorem B.1 – also a local result – for any ss in the interior of II excluding {αi,1αi:i=2,,n1}\{\alpha_{i},1-\alpha_{i}:i=2,\ldots,n-1\}. (If we invoke the refined version of Theorem B.1, there is no need to exclude the singularities. Either way suffices for proving Claim I.) Claim II, however, is global in nature. Its proof combines the refined version of Theorem B.1 with arguments exploiting the diagonal and polynomial properties of Θ\Theta.

Proof of the claim II. By (2.12), it suffices to show that the set

s(1,1)\I{pdom(Θ):limkΘ2k(p)=[s0]}\mbox{$\bigcup_{s\in(-1,1)\backslash I}$}\Big{\{}p\in{\rm dom}(\Theta):\mbox{$\lim_{k\rightarrow\infty}$}\Theta^{2k}(p)=\big{[}\!\begin{smallmatrix}s\\ 0\end{smallmatrix}\!\big{]}\Big{\}}

has measure zero in n1\mbox{$\mathbb{R}$}^{n-1}. By (the refined version of) Theorem B.1 and Lemma 4.1, every fixed point [s0]\big{[}\!\begin{smallmatrix}s\\ 0\end{smallmatrix}\!\big{]}, s(1,1)\Is\in(-1,1)\backslash I, of Θ2\Theta^{2} has a center-stable manifold, denoted by Wloccs(s)W^{\rm cs}_{\rm loc}(s), with co-dimension at least 1. The diagonal property of TT, and hence of Θ\Theta, ensures that Wloccs(s)W^{\rm cs}_{\rm loc}(s) can be chosen to lie on the plane {xi=0:μi(s)>1}\{x_{i}=0:\mu_{i}(s)>1\}, which is contained in the hyperplane 𝒫:=ei\mathcal{P}_{\ast}:=e_{i^{\ast}}^{\perp}. Therefore,

sJWloccs(s)𝒫.\mbox{$\bigcup_{s\in J}$}W^{\rm cs}_{\rm loc}(s)\subset\mathcal{P}_{\ast}.

Of course, we also have [αi0]\big{[}\!\begin{smallmatrix}\alpha_{i}\\ 0\end{smallmatrix}\!\big{]}, [1αi0]𝒫\big{[}\!\begin{smallmatrix}1-\alpha_{i}\\ 0\end{smallmatrix}\!\big{]}\in\mathcal{P}_{\ast}. So to complete the proof, it suffices to show that the set of points attracted to the hyperplane 𝒫\mathcal{P}_{\ast} by Θ2\Theta^{2} has measure 0, i.e. it suffices to show that

n0Θ2n(𝒫dom(Θ))\displaystyle\mbox{$\bigcup_{n\geq 0}$}\Theta^{-2n}(\mathcal{P}_{\ast}\cap{\rm dom}(\Theta)) (4.6)

is a null set.

We now argue that DΘ2|pD\Theta^{2}|_{p} is non-singular for almost all pdom(Θ)p\in{\rm dom}(\Theta). By the chain rule, it suffices to show that DΘ|pD\Theta|_{p} is non-singular for almost all pdom(Θ)p\in{\rm dom}(\Theta). Note that the entries of Θ(p)\Theta(p) are rational functions; in fact Θi(p)\Theta_{i}(p) is of the form ti(p)/v(p)t_{i}(p)/v(p), where ti(p)t_{i}(p) and v(p)v(p) are degree 2 polynomials in pp and v(p)>0v(p)>0 for pdom(Θ)p\in{\rm dom}(\Theta). So det(DΘ|p)\det(D\Theta|_{p}) is of the form w(p)/v(p)2(n1)w(p)/v(p)^{2(n-1)} where w(p)w(p) is some polynomial in pp. It is clear that w(p)w(p) is not identically zero, as that would violate the invertibility of DΘ|pD\Theta|_{p} at many pp, as shown by (4.2). It then follows from Lemma A.2 that w(p)w(p) is non-zero almost everywhere, hence the almost everywhere invertibility of DΘ|pD\Theta|_{p}.

As 𝒫dom(Θ)\mathcal{P}_{\ast}\cap{\rm dom}(\Theta) is null, we can then use Lemma A.1 inductively to conclude that Θ2n(𝒫dom(Θ))\Theta^{-2n}(\mathcal{P}_{\ast}\cap{\rm dom}(\Theta)) is null for any n0n\geq 0. So the set (4.6) is a countable union of null sets, hence is also null.

We have completed the proof of Theorem 1.3, and Theorem 1.2(ii) follows.  

Computational examples in 3-D. Corresponding to the limit probability s(x(0))s(x^{(0)}) for x(0)𝕊n1x^{(0)}\in\mbox{$\mathbb{S}$}^{n-1}, defined by (2.12), is the limit angle θ(x(0))[0,π/2]\theta(x^{(0)})\in[0,\pi/2] defined by |x^()|=[cos(θ),0,,0,sin(θ)]T|\hat{x}^{(\infty)}|=[\cos(\theta),0,\ldots,0,\sin(\theta)]^{T}, where x^():=limkOGD2k(x(0))/OGD2k(x(0))\hat{x}^{(\infty)}:=\lim_{k\rightarrow\infty}\textsf{OGD}^{2k}(x^{(0)})/\|\textsf{OGD}^{2k}(x^{(0)})\|. The limit probability and the limit angle are related by

θ=tan1(a1s/(1s)).\displaystyle\theta=\tan^{-1}\big{(}a^{-1}\sqrt{s/(1-s)}\big{)}. (4.7)

This is just the inverse of the bijection between θ\theta and ss (3.4) already seen in the 2-D case. Unlike the 2-D case, initial vectors x(0)x^{(0)} uniformly sampled on the unit sphere 𝕊n1\mbox{$\mathbb{S}$}^{n-1} will not resulted in a limit angle uniformly sampled on the interval [0,π/2][0,\pi/2].

We consider various choices of AA with n=3n=3, and for each we estimate the probability distribution of the limit angles θ\theta with x(0)x^{(0)} uniformly distributed on the unit sphere of 3\mbox{$\mathbb{R}$}^{3}. As we see from Figure 3, the computations suggest that when the intermediate eigenvalue λ2\lambda_{2} equals (λ1+λ3)/2(\lambda_{1}+\lambda_{3})/2, the distribution of the limit angles peaks at tan1(a1)\tan^{-1}(a^{-1}), the angle that corresponds to the slowest ROC (1a)/(1+a)(1-a)/(1+a). The mere presence of an intermediate eigenvalue, as long as it is not too close to λ1\lambda_{1} or λ3\lambda_{3}, concentrates the limit angle θ\theta near tan1(a1)\tan^{-1}(a^{-1}). Moreover, the effect gets more prominent when a1=cond(A)a^{-1}={\rm cond}(A) is large. The horizontal lines in Figure 3 correspond to Akaike’s lower bound of ROC in (1.12); the computations illustrates that the bound is tight.

Refer to caption Refer to caption
Figure 3: Distribution of the limit angle θ\theta, estimated from 10710^{7} initial vectors x(0)x^{(0)} sampled from the uniform distribution on the unit 2-sphere in 3-D. The horizontal lines show Akaike’s lower bound of ROC in (1.12). The left vertical axis is for ROC, while the right vertical axis is for probability density. The black dot corresponds to the angle tan1(a1)\tan^{-1}(a^{-1}) yeilding the slowest ROC (1a)/(1+a)(1-a)/(1+a).

5 Open Question

The computational results in Appendix C suggest that our main results Theorem 1.2 and 1.3 should extend to objectives beyond strongly convex quadratics. The article [11] shows how to extend the worst case ROC (1.7) for strongly convex quadratics to general strongly convex functions:

Theorem 5.1

[11, Theorem 1.2] If fC2(n,)f\in C^{2}(\mbox{$\mathbb{R}$}^{n},\mbox{$\mathbb{R}$}) satisfies LI2f(x)μIL\cdot I\succeq\nabla^{2}f(x)\succeq\mu\cdot I for all xnx\in\mbox{$\mathbb{R}$}^{n}, xx^{\ast} a global minimizer of ff on n\mbox{$\mathbb{R}$}^{n}, and f=f(x)f^{\ast}=f(x^{\ast}). Each iteration of the GD method with exact line search satisfies

f(x(k+1))f(LμL+μ)2k(f(x(k))f),k=0,1,.f\big{(}x^{(k+1)}\big{)}-f^{\ast}\leq\Big{(}\frac{L-\mu}{L+\mu}\Big{)}^{2k}\Big{(}f\big{(}x^{(k)}\big{)}-f^{\ast}\Big{)},\;\;k=0,1,\ldots.

The result is proved for the slightly bigger family of strongly convex functions μ,L(n)\mathcal{F}_{\mu,L}(\mbox{$\mathbb{R}$}^{n}) that requires only C1C^{1} regularity; see [11, Definition 1.1] or [7, Chapter 7] for the definition of μ,L(n)\mathcal{F}_{\mu,L}(\mbox{$\mathbb{R}$}^{n}).

The above result was anticipated by most optimization experts; compare it with, for example, [16, Theorem 3.4]. But the rigirous proof in [11] requires a tricky semi-definite programming formulation, which is tailored for the analysis of the worst case ROC. At this point, it is not clear how to generalize the average case ROC result in this article to objectives in μ,L(n)\mathcal{F}_{\mu,L}(\mbox{$\mathbb{R}$}^{n}).

Appendix A Two Auxiliary Measure Theoretic Lemmas.

The proof of Theorem 1.3 uses the following two lemmas.

Lemma A.1

Let UmU\subset\mbox{$\mathbb{R}$}^{m} and VnV\subset\mbox{$\mathbb{R}$}^{n} be open sets, mnm\geq n, f:UVf:U\rightarrow V be C1C^{1} with rank(Df(x))=n{\rm rank}(Df(x))=n for almost all xx. Then whenever AA is of measure zero, so is f1(A)f^{-1}(A).

Lemma A.2

The zero set of any non-zero polynomial has measure zero.

For the proofs of these two lemmas, see [1] and [2].

Appendix B The Theorem of Center and Stable Manifolds

Theorem B.1 (Center and Stable Manifolds)

Let 0 be a fixed point for the CrC^{r} local diffeomorphism f:Unf:U\rightarrow\mbox{$\mathbb{R}$}^{n} where UU is a neighborhood of zero in n\mbox{$\mathbb{R}$}^{n} and >r1\infty>r\geq 1. Let EsEcEuE^{\rm s}\oplus E^{\rm c}\oplus E^{\rm u} be the invariant splitting of n\mbox{$\mathbb{R}$}^{n} into the generalized eigenspaces of Df(0)Df(0) corresponding to eigenvalues of absolute value less than one, equal to one, and greater than one. To each of the five Df(0)Df(0) invariant subspaces EsE^{\rm s}, EsEcE^{\rm s}\oplus E^{\rm c}, EcE^{\rm c}, EcEuE^{\rm c}\oplus E^{\rm u}, and EuE^{\rm u} there is associated a local f invariant CrC^{r} embedded disc WlocsW^{\rm s}_{\rm loc}, WloccsW^{\rm cs}_{\rm loc}, WloccW^{\rm c}_{\rm loc}, WloccuW^{\rm cu}_{\rm loc}, WlocuW^{\rm u}_{\rm loc} tangent to the linear subspace at 0 and a ball BB around zero in a (suitably defined) norm such that:

  1. 1.

    Wlocs={xB| fn(x)B for all n0 and d(fn(x),0) tends to zero exponentially}W^{\rm s}_{\rm loc}=\{x\in B|\mbox{ $f^{n}(x)\in B$ for all $n\geq 0$ and $d(f^{n}(x),0)$ tends to zero exponentially}\}. f:WlocsWlocsf:W^{\rm s}_{\rm loc}\rightarrow W^{\rm s}_{\rm loc} is a contraction mapping.

  2. 2.

    f(Wloccs)BWloccsf(W^{\rm cs}_{\rm loc})\cap B\subset W^{\rm cs}_{\rm loc}. If fn(x)Bf^{n}(x)\in B for all n0n\geq 0, then xWloccsx\in W^{\rm cs}_{\rm loc}.

  3. 3.

    f(Wlocc)BWloccf(W^{\rm c}_{\rm loc})\cap B\subset W^{\rm c}_{\rm loc}. If fn(x)Bf^{n}(x)\in B for all nn\in\mbox{$\mathbb{Z}$}, then xWloccx\in W^{\rm c}_{\rm loc}.

  4. 4.

    f(Wloccu)BWloccuf(W^{\rm cu}_{\rm loc})\cap B\subset W^{\rm cu}_{\rm loc}. If fn(x)Bf^{n}(x)\in B for all n0n\leq 0, then xWloccux\in W^{\rm cu}_{\rm loc}.

  5. 5.

    Wlocu={xB| fn(x)B for all n0 and d(fn(x),0) tends to zero exponentially}W^{\rm u}_{\rm loc}=\{x\in B|\mbox{ $f^{n}(x)\in B$ for all $n\leq 0$ and $d(f^{n}(x),0)$ tends to zero exponentially}\}. f1:WlocuWlocuf^{-1}:W^{\rm u}_{\rm loc}\rightarrow W^{\rm u}_{\rm loc} is a contraction mapping.

The assumption that ff is invertible in Theorem B.1 happens to be unnecessary. The proof of existence of WloccuW^{\rm cu}_{\rm loc} and WlocuW^{\rm u}_{\rm loc}, based on [17, Theorem III.2], clearly does not rely on the invertibility of ff. That for WloccsW^{\rm cs}_{\rm loc} and WlocsW^{\rm s}_{\rm loc}, however, is based on the applying [17, Theorem III.2] to f1f^{-1}; and it is the existence of WloccsW^{\rm cs}_{\rm loc} that is needed in our proof of Theorem 1.3. Fortunately, by a finer argument outlined in [17, Exercise III.2, Page 68] and [3], the existence of WloccsW^{\rm cs}_{\rm loc} can be established without assuming the invertibility of ff. Thanks to the refinement, we can proceed with the proof without the extra assumption (4.5).

Appendix C The 2-D Rosenbrock function

While Akaike did not bother to explore what happened to the optimum GD method in 2-D, the 2-D Rosenbrock function f(x)=100(x2x12)2+(1x1)2f(x)=100(x_{2}-x_{1}^{2})^{2}+(1-x_{1})^{2}, again a degree 4 polynomial, is often used in optimization textbooks to exemplify different optimization methods. In particular, due to the ill-conditioning near its global minimizer x=[1,1]Tx^{\ast}=[1,1]^{T} (cond(2f(x))2500{\rm cond}(\nabla^{2}f(x^{\ast}))\approx 2500), it is often used to illustrate the slow convergence of GD methods.

A student will likely be confused if he applies the exact linesearch GD method to this objective function. Figure 4 (leftmost panel) shows the ROC of optimum GD applied to the 2-D Rosenbrock function. The black line illustrates the worst case ROC given by (1.11) under the pretense that the degree 4 Rosenbrock function is a quadratic with the constant Hessian 2f(x)\nabla^{2}f(x^{\ast}); the other lines show the ROC with 500 different initial guesses sampled from x+zx^{\ast}+z, zN(0,I2)z\sim N(0,I_{2}). As predicted by the first part of Theorem 1.2, the average ROC is much faster than the worst case ROC shown by the black line not in spite of, but because of, the ill-conditioning of the Hessian.444To the best of the author’s knowledge, the only textbook that briefly mentions this difference between dimension 2 and above is [16, Page 62]. The same reference points to a specialized analysis for the 2-D case in an older optimization textbook, but the latter does not contain a result to the effect of the first part of Theorem 1.2.

Refer to caption Refer to caption Refer to caption
n=2n=2 n=3n=3 n=4n=4
Figure 4: ROC of the optimum GD method applied to the Rosenbrock function with nn variables with 500 initial guesses sampled from x+zx^{\ast}+z, zN(0,In)z\sim N(0,I_{n}), x=[1,,1]Tx^{\ast}=[1,\ldots,1]^{T}. The black line illustrates the worst case ROC assuming that the objective were a quadratic with Hessian 2f(x)\nabla^{2}f(x^{\ast}).

The student may be less confused if he tests the method on the higher dimensional Rosenbrock function:

f(x)=i=2n100(xixi12)2+(1xi1)2.f(x)=\sum_{i=2}^{n}100(x_{i}-x_{i-1}^{2})^{2}+(1-x_{i-1})^{2}.

See the next two panels of the same figure for n=3n=3 and n=4n=4, which are consistent with what the second part of Theorem 1.2 predicts, namely, ill-conditioning leads to slow convergence for essentially all initial vectors.

References

  • [1] Mathematics Stack Exchange: https://math.stackexchange.com/q/3215996 (version: 2019-05-06).
  • [2] Mathematics Stack Exchange: https://math.stackexchange.com/q/1920302 (version: 2022-02-07).
  • [3] MathOverflow: https://mathoverflow.net/q/443156 (version: 2023-03-21).
  • [4] A. A. Ahmadi and A. Majumdar. Some applications of polynomial optimization in operations research and real-time decision making. Optimization Letters, 10(4):709–729, 2016.
  • [5] H. Akaike. On a successive transformation of probability distribution and its application to the analysis of the optimum gradient method. Annals of the Institute of Statistical Mathematics, 11:1–16, 1959.
  • [6] A. Aubry, A. De Maio, B. Jiang, and S. Zhang. Ambiguity function shaping for cognitive radar via complex quartic optimization. IEEE Transactions on Signal Processing, 61(22):5603–5619, 2013.
  • [7] A. Beck. Introduction to Nonlinear Optimization: Theory, Algorithms, and Applications with MATLAB. Society for Industrial and Applied Mathematics, USA, 2014.
  • [8] J. J. Benedetto and M. Fickus. Finite normalized tight frames. Adv. Comput. Math., 18(2):357–385, 2003.
  • [9] E. J. Candès, X. Li, and M. Soltanolkotabi. Phase retrieval via Wirtinger flow: theory and algorithms. IEEE Trans. Inform. Theory, 61(4):1985–2007, 2015.
  • [10] E. J. Candès, T. Strohmer, and V. Voroninski. PhaseLift: exact and stable signal recovery from magnitude measurements via convex programming. Comm. Pure Appl. Math., 66(8):1241–1274, 2013.
  • [11] E. de Klerk, F. Glineur, and A. B. Taylor. On the worst-case complexity of the gradient method with exact line search for smooth strongly convex functions. Optimization Letters, 11(7):1185–1199, Oct 2017.
  • [12] P. Henrici. Two remarks on the Kantorovich inequality. Amer. Math. Monthly, 68:904–906, 1961.
  • [13] C. Ma, K. Wang, Y. Chi, and Y. Chen. Implicit regularization in nonconvex statistical estimation: gradient descent converges linearly for phase retrieval, matrix completion, and blind deconvolution. Found. Comput. Math., 20(3):451–632, 2020.
  • [14] P. Mohan, N. K. Yip, and T. P.-Y. Yu. Minimal energy configurations of bilayer plates as a polynomial optimization problem. Nonlinear Analysis, June 2022.
  • [15] J. Nie. Sum of squares method for sensor network localization. Computational Optimization and Applications, 43(2):151–179, 2009.
  • [16] J. Nocedal and S. J. Wright. Numerical Optimization. Springer, New York, NY, USA, second edition, 2006.
  • [17] M. Shub. Global stability of dynamical systems. Springer-Verlag, New York, 1987. With the collaboration of Albert Fathi and Rémi Langevin, Translated from the French by Joseph Christy.