This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Accelerated Algorithms for Smooth Convex-Concave Minimax Problems
with 𝒪(1/k2)\mathcal{O}(1/k^{2}) Rate on Squared Gradient Norm

TaeHo Yoon    Ernest K. Ryu
Abstract

In this work, we study the computational complexity of reducing the squared gradient magnitude for smooth minimax optimization problems. First, we present algorithms with accelerated 𝒪(1/k2)\mathcal{O}(1/k^{2}) last-iterate rates, faster than the existing 𝒪(1/k)\mathcal{O}(1/k) or slower rates for extragradient, Popov, and gradient descent with anchoring. The acceleration mechanism combines extragradient steps with anchoring and is distinct from Nesterov’s acceleration. We then establish optimality of the 𝒪(1/k2)\mathcal{O}(1/k^{2}) rate through a matching lower bound.

Convex optimization, Minimax optimization, Convex-Concave problems, Acceleration, Generative models, GANs

1 Introduction

Minimax optimization problems, or minimax games, of the form

minimize𝐱nmaximize𝐲m𝐋(𝐱,𝐲)\displaystyle\underset{\mathbf{x}\in\mathbb{R}^{n}}{\mbox{minimize}}\,\,\underset{\mathbf{y}\in\mathbb{R}^{m}}{\mbox{maximize}}\,\,\mathbf{L}(\mathbf{x},\mathbf{y}) (1)

have recently gained significant interest in the optimization and machine learning communities due to their application in adversarial training (Goodfellow et al., 2015; Madry et al., 2018) and generative adversarial networks (GANs) (Goodfellow et al., 2014).

Prior works on minimax optimization often consider compact domains X,YX,Y for 𝐱,𝐲\mathbf{x},\mathbf{y} and use the duality gap

Errgap(𝐱,𝐲):=sup𝐲~Y𝐋(𝐱,𝐲~)inf𝐱~X𝐋(𝐱~,𝐲)\displaystyle\mathrm{Err}_{\mathrm{gap}}(\mathbf{x},\mathbf{y}):=\sup_{\tilde{\mathbf{y}}\in Y}\,\mathbf{L}(\mathbf{x},\tilde{\mathbf{y}})-\inf_{\tilde{\mathbf{x}}\in X}\,\mathbf{L}(\tilde{\mathbf{x}},\mathbf{y})

to quantify suboptimality of algorithms’ iterates in solving (1). However, while it is a natural analog of minimization error for minimax problems, the duality gap can be difficult to measure directly in practice, and it is unclear how to generalize the notion to non-convex-concave problems.

In contrast, the squared gradient magnitude 𝐋(𝐱,𝐲)2\|\nabla\mathbf{L}(\mathbf{x},\mathbf{y})\|^{2}, when 𝐋\mathbf{L} is differentiable, is a more directly observable value for quantifying suboptimality. Moreover, the notion is meaningful for differentiable non-convex-concave minimax games. Interestingly, very few prior works have analyzed convergence rates on the gradient norm for minimax problems, and the optimal convergence rate or corresponding algorithms were hitherto unknown.

Contributions.

In this work, we introduce the extra anchored gradient (EAG) algorithms for smooth convex-concave minimax problems and establish an accelerated 𝐋(𝐳k)2𝒪(R2/k2)\|\nabla\mathbf{L}(\mathbf{z}^{k})\|^{2}\leq\mathcal{O}(R^{2}/k^{2}) rate, where RR is the Lipschitz constant of 𝐋\nabla\mathbf{L}. The rate improves upon the 𝒪(R2/k)\mathcal{O}(R^{2}/k) rates of prior algorithms and is the first 𝒪(R2/k2)\mathcal{O}(R^{2}/k^{2}) rate in this setup. We then provide a matching Ω(R2/k2)\Omega(R^{2}/k^{2}) complexity lower bound for gradient-based algorithms and thereby establish optimality of EAG.

Beyond establishing the optimal complexity, our results provide the following observations. First, different suboptimality measures lead to materially different acceleration mechanisms, since reducing the duality gap is done optimally by the extragradient algorithm (Nemirovski, 2004; Nemirovsky, 1992). Also, since our optimal accelerated convergence rate is on the non-ergodic last iterate, neither averaging nor keeping track of the best iterate is necessary for optimally reducing the gradient magnitude in the deterministic setup.

1.1 Preliminaries and notation

We say a saddle function 𝐋:n×m\mathbf{L}\colon\mathbb{R}^{n}\times\mathbb{R}^{m}\to\mathbb{R} is convex-concave if 𝐋(𝐱,𝐲)\mathbf{L}(\mathbf{x},\mathbf{y}) is convex in 𝐱n\mathbf{x}\in\mathbb{R}^{n} for all fixed 𝐲m\mathbf{y}\in\mathbb{R}^{m} and 𝐋(𝐱,𝐲)\mathbf{L}(\mathbf{x},\mathbf{y}) is concave in 𝐲m\mathbf{y}\in\mathbb{R}^{m} for all fixed 𝐱n\mathbf{x}\in\mathbb{R}^{n}. We say (𝐱,𝐲)(\mathbf{x}^{\star},\mathbf{y}^{\star}) is a saddle point of 𝐋\mathbf{L} if 𝐋(𝐱,𝐲)𝐋(𝐱,𝐲)𝐋(𝐱,𝐲)\mathbf{L}(\mathbf{x}^{\star},\mathbf{y})\leq\mathbf{L}(\mathbf{x}^{\star},\mathbf{y}^{\star})\leq\mathbf{L}(\mathbf{x},\mathbf{y}^{\star}) for all 𝐱n\mathbf{x}\in\mathbb{R}^{n} and 𝐲m\mathbf{y}\in\mathbb{R}^{m}. Solutions to the minimax problem (1) are defined to be saddle points of 𝐋\mathbf{L}. For notational conciseness, write 𝐳=(𝐱,𝐲)\mathbf{z}=(\mathbf{x},\mathbf{y}). When 𝐋\mathbf{L} is differentiable, define the saddle operator of 𝐋\mathbf{L} at 𝐳=(𝐱,𝐲)\mathbf{z}=(\mathbf{x},\mathbf{y}) by

𝐆𝐋(𝐳)=[𝐱𝐋(𝐱,𝐲)𝐲𝐋(𝐱,𝐲)].\displaystyle\mathbf{G}_{\mathbf{L}}(\mathbf{z})=\begin{bmatrix}\nabla_{\mathbf{x}}\mathbf{L}(\mathbf{x},\mathbf{y})\\ -\nabla_{\mathbf{y}}\mathbf{L}(\mathbf{x},\mathbf{y})\end{bmatrix}. (2)

(When clear from the context, we drop the subscript 𝐋\mathbf{L}.) The saddle operator is monotone (Rockafellar, 1970), i.e., 𝐆(𝐳1)𝐆(𝐳2),𝐳1𝐳20\langle\mathbf{G}(\mathbf{z}_{1})-\mathbf{G}(\mathbf{z}_{2}),\mathbf{z}_{1}-\mathbf{z}_{2}\rangle\geq 0 for all 𝐳1,𝐳2n×m\mathbf{z}_{1},\mathbf{z}_{2}\in\mathbb{R}^{n}\times\mathbb{R}^{m}. We say 𝐋\mathbf{L} is RR-smooth if 𝐆𝐋\mathbf{G}_{\mathbf{L}} is RR-Lipschitz continuous. Note that 𝐋𝐆𝐋\nabla\mathbf{L}\neq\mathbf{G}_{\mathbf{L}} due to the sign change in the 𝐲\mathbf{y} gradient, but 𝐋=𝐆𝐋\|\nabla\mathbf{L}\|=\|\mathbf{G}_{\mathbf{L}}\|, and we use the two forms interchangeably. Because 𝐳=(𝐱,𝐲)\mathbf{z}^{\star}=(\mathbf{x}^{\star},\mathbf{y}^{\star}) is a saddle point of 𝐋\mathbf{L} if and only if 0=𝐆𝐋(𝐳)0=\mathbf{G}_{\mathbf{L}}(\mathbf{z}^{\star}), the squared gradient magnitude is a natural measure of suboptimality at a given point for smooth convex-concave problems.

1.2 Prior work

Extragradient-type algorithms.

The first main component of our proposed algorithm is the extragradient (EG) algorithm of Korpelevich (1977). EG and its variants, including the algorithm of Popov (1980), have been studied in the context of saddle point and variational inequality problems and have appeared in the mathematical programming literature (Solodov & Svaiter, 1999; Tseng, 2000; Noor, 2003; Censor et al., 2011; Lyashko et al., 2011; Malitsky & Semenov, 2014; Malitsky, 2015, 2020). More recently in the machine learning literature, similar ideas such as optimism (Chiang et al., 2012; Rakhlin & Sridharan, 2013a), prediction (Yadav et al., 2018), and negative momentum (Gidel et al., 2019; Zhang et al., 2020) have been presented and used in the context of multi-player games (Daskalakis et al., 2011; Rakhlin & Sridharan, 2013b; Syrgkanis et al., 2015; Antonakopoulos et al., 2021) and GANs (Gidel et al., 2018; Mertikopoulos et al., 2019; Liang & Stokes, 2019; Peng et al., 2020).

𝓞(𝑹/𝒌)\boldsymbol{\mathcal{O}(R/k)} rates on duality gap.

For minimax problems with an RR-smooth 𝐋\mathbf{L} and bounded domains for 𝐱\mathbf{x} and 𝐲\mathbf{y}, Nemirovski (2004) presented the mirror-prox algorithm generalizing EG and established ergodic 𝒪(R/k)\mathcal{O}(R/k) convergence rates on Errgap\mathrm{Err}_{\mathrm{gap}}. Nesterov (2007); Monteiro & Svaiter (2010, 2011) extended the 𝒪(R/k)\mathcal{O}(R/k) complexity analysis to the case of unbounded domains. Mokhtari et al. (2020b) showed that the optimistic descent converges at 𝒪(R/k)\mathcal{O}(R/k) rate with respect to Errgap\mathrm{Err}_{\mathrm{gap}}. Since there exists Ω(R/k)\Omega(R/k) complexity lower bound on Errgap\mathrm{Err}_{\mathrm{gap}} for black-box gradient-based minimax optimization algorithms (Nemirovsky, 1992; Nemirovski, 2004), in terms of duality gap, these algorithms are order-optimal.

Convergence rates on squared gradient norm.

Using standard arguments (e.g. (Solodov & Svaiter, 1999, Lemma 2.3)), one can show mini=0,,k𝐆(𝐳i)2𝒪(R2/k)\underset{i=0,\dots,k}{\min}\,\|\mathbf{G}(\mathbf{z}^{i})\|^{2}\leq\mathcal{O}(R^{2}/k) convergence rate of EG, provided that 𝐋\mathbf{L} is RR-smooth. Ryu et al. (2019) showed that optimistic descent algorithms also attain 𝒪(R2/k)\mathcal{O}(R^{2}/k) convergence in terms of the best iterate and proposed simultaneous gradient descent with anchoring, which pulls iterates toward the initial point 𝐳0\mathbf{z}^{0}, and established 𝒪(R2/k22p)\mathcal{O}(R^{2}/k^{2-2p}) convergence rates in terms of squared gradient norm of the last iterate (where p>12p>\frac{1}{2} is an algorithm parameter; see Section A). Notably, anchoring resembles the Halpern iteration (Halpern, 1967; Lieder, 2020), which was used in Diakonikolas (2020) to develop a regularization-based algorithm with near-optimal (optimal up to logarithmic factors) complexity with respect to the gradient norm of the last iterate. Anchoring turns out to be the second main component of the acceleration; combining EG steps with anchoring, we obtain the optimal last-iterate convergence rate of 𝒪(R2/k2)\mathcal{O}(R^{2}/k^{2}).

Structured minimax problems.

For structured minimax problems of the form

𝐋(𝐱,𝐲)=f(𝐱)+𝐀𝐱,𝐲g(𝐲),\displaystyle\mathbf{L}(\mathbf{x},\mathbf{y})=f(\mathbf{x})+\langle\mathbf{A}\mathbf{x},\mathbf{y}\rangle-g(\mathbf{y}),

where f,gf,g are convex and 𝐀\mathbf{A} is a linear operator, primal-dual splitting algorithms (Chambolle & Pock, 2011; Condat, 2013; Vũ, 2013; Yan, 2018; Ryu & Yin, 2021) and Nesterov’s smoothing technique (Nesterov, 2005a, b) have also been extensively studied (Chen et al., 2014; He & Monteiro, 2016). Notably, when gg is of “simple” form, Neterov’s smoothing framework achieves an accelerated rate 𝒪(𝐀k+Lfk2)\mathcal{O}\left(\frac{\|\mathbf{A}\|}{k}+\frac{L_{f}}{k^{2}}\right) on duality gap. Additionally, Chambolle & Pock (2016) have shown that splitting algorithms can achieve 𝒪(1/k2)\mathcal{O}(1/k^{2}) or linear convergence rates under appropriate strong convexity and smoothness assumptions on ff and gg, although they rely on proximal operations. Kolossoski & Monteiro (2017); Hamedani & Aybat (2018); Zhao (2019); Alkousa et al. (2020) generalized these accelerated algorithms to the setting where the coupling term 𝐀𝐱,𝐲\langle\mathbf{A}\mathbf{x},\mathbf{y}\rangle is replaced by non-bilinear convex-concave function Φ(𝐱,𝐲)\Phi(\mathbf{x},\mathbf{y}).

Complexity lower bounds.

Ouyang & Xu (2021) presented a Ω(𝐀k+Lfk2)\Omega\left(\frac{\|\mathbf{A}\|}{k}+\frac{L_{f}}{k^{2}}\right) complexity lower bound on duality gap for gradient-based algorithms solving bilinear minimax problems with proximable gg, establishing optimality of Nesterov’s smoothing. Zhang et al. (2019) presented lower bounds for strongly-convex-strongly-concave problems. Golowich et al. (2020) proved that with the narrower class of 11-SCLI algorithms, which includes EG but not EAG, the squared gradient norm of the last iterate cannot be reduced beyond 𝒪(R2/k)\mathcal{O}(R^{2}/k) in RR-smooth minimax problems. These approaches are aligned with the information-based complexity analysis, introduced in (Nemirovsky & Yudin, 1983) and thoroughly studied in (Nemirovsky, 1991, 1992) for the special case of linear equations.

Other problem setups.

Nesterov (2009) and Nedić & Ozdaglar (2009) proposed subgradient algorithms for non-smooth minimax problems. Stochastic minimax and variational inequality problems were studied in (Nemirovski et al., 2009; Juditsky et al., 2011; Lan, 2012; Ghadimi & Lan, 2012, 2013; Chen et al., 2014, 2017; Hsieh et al., 2019). Strongly monotone variational inequality problems or strongly-convex-strongly-concave minimax problems were studied in (Tseng, 1995; Nesterov & Scrimali, 2011; Gidel et al., 2018; Mokhtari et al., 2020a; Lin et al., 2020b; Wang & Li, 2020; Zhang et al., 2020; Azizian et al., 2020). Recently, minimax problems with objectives that are either strongly convex or nonconvex in one variable were studied in (Rafique et al., 2018; Thekumparampil et al., 2019; Jin et al., 2019; Nouiehed et al., 2019; Ostrovskii et al., 2020; Lin et al., 2020a, b; Lu et al., 2020; Wang & Li, 2020; Yang et al., 2020; Chen et al., 2021). Minimax optimization of composite objectives with smooth and nonsmooth-but-proximable convex-concave functions were studied in (Tseng, 2000; Csetnek et al., 2019; Malitsky & Tam, 2020; Bùi & Combettes, 2021).

2 Accelerated algorithms: Extra anchored gradient

We now present two accelerated EAG algorithms that are qualitatively very similar but differ in the choice of step-sizes. The two algorithms present a tradeoff between the simplicity of the step-size and the simplicity of the convergence proof; one algorithm has a varying step-size but a simpler convergence proof, while the other algorithm has a simpler constant step-size but has a more complicated proof.

2.1 Description of the algorithms

The proposed extra anchored gradient (EAG) algorithms have the following general form:

𝐳k+1/2\displaystyle\mathbf{z}^{k+1/2} =𝐳k+βk(𝐳0𝐳k)αk𝐆(𝐳k)\displaystyle=\mathbf{z}^{k}+\beta_{k}(\mathbf{z}^{0}-\mathbf{z}^{k})-\alpha_{k}\mathbf{G}(\mathbf{z}^{k}) (3)
𝐳k+1\displaystyle\mathbf{z}^{k+1} =𝐳k+βk(𝐳0𝐳k)αk𝐆(𝐳k+1/2)\displaystyle=\mathbf{z}^{k}+\beta_{k}(\mathbf{z}^{0}-\mathbf{z}^{k})-\alpha_{k}\mathbf{G}(\mathbf{z}^{k+1/2})

for k0k\geq 0, where 𝐳0n×m\mathbf{z}^{0}\in\mathbb{R}^{n}\times\mathbb{R}^{m} is the starting point. We use 𝐆\mathbf{G} defined in (2) rather than describing the 𝐱\mathbf{x}- and 𝐲\mathbf{y}- updates separately to keep the notation concise. We call αk>0\alpha_{k}>0 step-sizes and βk[0,1)\beta_{k}\in[0,1) anchoring coefficients. Note that when βk=0\beta_{k}=0, EAG coincides with the unconstrained extragradient algorithm.

The simplest choice of {αk}k0\{\alpha_{k}\}_{k\geq 0} is the constant one. Together with the choice βk=1k+2\beta_{k}=\frac{1}{k+2} (which we clarify later), we get the following simpler algorithm.

EAG with constant step-size (EAG-C)

𝐳k+1/2\displaystyle\mathbf{z}^{k+1/2} =𝐳k+1k+2(𝐳0𝐳k)α𝐆(𝐳k)\displaystyle=\mathbf{z}^{k}+\frac{1}{k+2}(\mathbf{z}^{0}-\mathbf{z}^{k})-\alpha\mathbf{G}(\mathbf{z}^{k})
𝐳k+1\displaystyle\mathbf{z}^{k+1} =𝐳k+1k+2(𝐳0𝐳k)α𝐆(𝐳k+1/2)\displaystyle=\mathbf{z}^{k}+\frac{1}{k+2}(\mathbf{z}^{0}-\mathbf{z}^{k})-\alpha\mathbf{G}(\mathbf{z}^{k+1/2})

where α>0\alpha>0 is fixed.

Theorem 1.

Assume 𝐋:n×m\mathbf{L}\colon\mathbb{R}^{n}\times\mathbb{R}^{m}\to\mathbb{R} is an RR-smooth convex-concave function with a saddle point 𝐳\mathbf{z}^{\star}. Assume α>0\alpha>0 satisfies

13αRα2R2α3R30\displaystyle 1-3\alpha R-\alpha^{2}R^{2}-\alpha^{3}R^{3}\geq 0 (4)
18αR+α2R22α3R30.\displaystyle 1-8\alpha R+\alpha^{2}R^{2}-2\alpha^{3}R^{3}\geq 0.

Then EAG-C converges with rate

𝐋(𝐳k)24(1+αR+α2R2)α2(1+αR)𝐳0𝐳2(k+1)2\|\nabla\mathbf{L}(\mathbf{z}^{k})\|^{2}\leq\frac{4(1+\alpha R+\alpha^{2}R^{2})}{\alpha^{2}(1+\alpha R)}\frac{\|\mathbf{z}^{0}-\mathbf{z}^{\star}\|^{2}}{(k+1)^{2}}

for k0k\geq 0.

Corollary 1.

In the setup of Theorem 1, α(0,18R]\alpha\in\left(0,\frac{1}{8R}\right] satisfies (4), and the particular choice α=18R\alpha=\frac{1}{8R} yields

𝐋(𝐳k)2260R2𝐳0𝐳2(k+1)2\|\nabla\mathbf{L}(\mathbf{z}^{k})\|^{2}\leq\frac{260R^{2}\|\mathbf{z}^{0}-\mathbf{z}^{\star}\|^{2}}{(k+1)^{2}}

for k0k\geq 0.

While EAG-C is simple in its form, its convergence proof (presented in the appendix) is complicated. Furthermore, the constant 260260 in Corollary 1 seems large and raises the question of whether it could be reduced. These issues, to some extent, are addressed by the following alternative version of EAG.

EAG with varying step-size (EAG-V)

𝐳k+1/2\displaystyle\mathbf{z}^{k+1/2} =𝐳k+1k+2(𝐳0𝐳k)αk𝐆(𝐳k)\displaystyle=\mathbf{z}^{k}+\frac{1}{k+2}(\mathbf{z}^{0}-\mathbf{z}^{k})-\alpha_{k}\mathbf{G}(\mathbf{z}^{k})
𝐳k+1\displaystyle\mathbf{z}^{k+1} =𝐳k+1k+2(𝐳0𝐳k)αk𝐆(𝐳k+1/2),\displaystyle=\mathbf{z}^{k}+\frac{1}{k+2}(\mathbf{z}^{0}-\mathbf{z}^{k})-\alpha_{k}\mathbf{G}(\mathbf{z}^{k+1/2}),

where α0(0,1R)\alpha_{0}\in\left(0,\frac{1}{R}\right) and

αk+1\displaystyle\alpha_{k+1} =αk1αk2R2(1(k+2)2(k+1)(k+3)αk2R2)\displaystyle=\frac{\alpha_{k}}{1-\alpha_{k}^{2}R^{2}}\left(1-\frac{(k+2)^{2}}{(k+1)(k+3)}\alpha_{k}^{2}R^{2}\right)
=αk(11(k+1)(k+3)αk2R21αk2R2)\displaystyle=\alpha_{k}\left(1-\frac{1}{(k+1)(k+3)}\frac{\alpha_{k}^{2}R^{2}}{1-\alpha_{k}^{2}R^{2}}\right) (5)

for k0k\geq 0.

As the recurrence relation (2.1) may seem unfamiliar, we provide the following lemma describing the behavior of the resulting sequence.

Lemma 1.

If α0(0,34R)\alpha_{0}\in\left(0,\frac{3}{4R}\right), then the sequence {αk}k0\{\alpha_{k}\}_{k\geq 0} of (2.1) monotonically decreases to a positive limit. In particular, when α0=0.618R\alpha_{0}=\frac{0.618}{R}, we have limkαk0.437R\lim_{k\rightarrow\infty}\alpha_{k}\approx\frac{0.437}{R}.

We now state the convergence results for EAG-V.

Theorem 2.

Assume 𝐋:n×m\mathbf{L}\colon\mathbb{R}^{n}\times\mathbb{R}^{m}\to\mathbb{R} is an RR-smooth convex-concave function with a saddle point 𝐳\mathbf{z}^{\star}. Assume α0(0,34R)\alpha_{0}\in\left(0,\frac{3}{4R}\right), and define α=limkαk\alpha_{\infty}=\lim_{k\to\infty}\alpha_{k}. Then EAG-V converges with rate

𝐋(𝐳k)24(1+α0αR2)α2𝐳0𝐳2(k+1)(k+2)\|\nabla\mathbf{L}(\mathbf{z}^{k})\|^{2}\leq\frac{4\left(1+\alpha_{0}\alpha_{\infty}R^{2}\right)}{\alpha_{\infty}^{2}}\frac{\|\mathbf{z}^{0}-\mathbf{z}^{\star}\|^{2}}{(k+1)(k+2)}

for k0k\geq 0.

Corollary 2.

EAG-V with α0=0.618R\alpha_{0}=\frac{0.618}{R} satisfies

𝐋(𝐳k)227R2𝐳0𝐳2(k+1)(k+2)\|\nabla\mathbf{L}(\mathbf{z}^{k})\|^{2}\leq\frac{27R^{2}\|\mathbf{z}^{0}-\mathbf{z}^{\star}\|^{2}}{(k+1)(k+2)}

for k0k\geq 0.

2.2 Proof outline

We now outline the convergence analysis for EAG-V, whose proof is simpler than that of EAG-C. The key ingredient of the proof is a Lyapunov analysis with a nonincreasing Lyapunov function, the VkV_{k} of the following lemma.

Lemma 2.

Let {βk}k0(0,1)\{\beta_{k}\}_{k\geq 0}\subseteq(0,1) and α0(0,1R)\alpha_{0}\in\left(0,\frac{1}{R}\right) be given. Define the sequences {Ak}k0,{Bk}k0\{A_{k}\}_{k\geq 0},\{B_{k}\}_{k\geq 0} and {αk}0\{\alpha_{k}\}_{\geq 0} by the recurrence relations

Ak=αk2βkBk\displaystyle A_{k}=\frac{\alpha_{k}}{2\beta_{k}}B_{k} (6)
Bk+1=Bk1βk\displaystyle B_{k+1}=\frac{B_{k}}{1-\beta_{k}} (7)
αk+1=αkβk+1(1αk2R2βk2)βk(1βk)(1αk2R2)\displaystyle\alpha_{k+1}=\frac{\alpha_{k}\beta_{k+1}(1-\alpha_{k}^{2}R^{2}-\beta_{k}^{2})}{\beta_{k}(1-\beta_{k})(1-\alpha_{k}^{2}R^{2})} (8)

for k0k\geq 0, where B0=1B_{0}=1. Suppose that αk(0,1R)\alpha_{k}\in(0,\frac{1}{R}) holds for all k0k\geq 0. Assume 𝐋\mathbf{L} is RR-smooth and convex-concave. Then the sequence {Vk}k0\{V_{k}\}_{k\geq 0} defined as

Vk:=Ak𝐆(𝐳k)2+Bk𝐆(𝐳k),𝐳k𝐳0\displaystyle V_{k}:=A_{k}\|\mathbf{G}(\mathbf{z}^{k})\|^{2}+B_{k}\langle\mathbf{G}(\mathbf{z}^{k}),\mathbf{z}^{k}-\mathbf{z}^{0}\rangle (9)

for EAG iterations in (3) is nonincreasing.

In Lemma 2, the choice of βk=1k+2\beta_{k}=\frac{1}{k+2} leads to Bk=k+1B_{k}=k+1, Ak=αk(k+2)(k+1)2A_{k}=\frac{\alpha_{k}(k+2)(k+1)}{2}, and (2.1). Why the Lyapunov function of Lemma 2 leads to the convergence guarantee of Theorem 2 may not be immediately obvious. The following proof provides the analysis.

Proof of Theorem 2.

Let βk=1k+2\beta_{k}=\frac{1}{k+2} as specified by the definition of EAG-V. By Lemma 2, the quantity VkV_{k} defined by (9) is nonincreasing in kk. Therefore,

VkV0=α0𝐆(𝐳0)2α0R2𝐳0𝐳2.\displaystyle V_{k}\leq\cdots\leq V_{0}=\alpha_{0}\|\mathbf{G}(\mathbf{z}^{0})\|^{2}\leq\alpha_{0}R^{2}\|\mathbf{z}^{0}-\mathbf{z}^{\star}\|^{2}.

Next, we have

Vk\displaystyle V_{k} =\displaystyle=\, Ak𝐆(𝐳k)2+Bk𝐆(𝐳k),𝐳k𝐳0\displaystyle\,A_{k}\|\mathbf{G}(\mathbf{z}^{k})\|^{2}+B_{k}\langle\mathbf{G}(\mathbf{z}^{k}),\mathbf{z}^{k}-\mathbf{z}^{0}\rangle
(a)\displaystyle\stackrel{{\scriptstyle\textnormal{(a)}}}{{\mathstrut{\geq}}}\, Ak𝐆(𝐳k)2+Bk𝐆(𝐳k),𝐳𝐳0\displaystyle\,A_{k}\|\mathbf{G}(\mathbf{z}^{k})\|^{2}+B_{k}\langle\mathbf{G}(\mathbf{z}^{k}),\mathbf{z}^{\star}-\mathbf{z}^{0}\rangle
(b)\displaystyle\stackrel{{\scriptstyle\textnormal{(b)}}}{{\mathstrut{\geq}}}\, Ak𝐆(𝐳k)2Ak2𝐆(𝐳k)2Bk22Ak𝐳0𝐳2\displaystyle\,A_{k}\|\mathbf{G}(\mathbf{z}^{k})\|^{2}-\frac{A_{k}}{2}\|\mathbf{G}(\mathbf{z}^{k})\|^{2}-\frac{B_{k}^{2}}{2A_{k}}\|\mathbf{z}^{0}-\mathbf{z}^{\star}\|^{2}
=(c)\displaystyle\stackrel{{\scriptstyle\textnormal{(c)}}}{{\mathstrut{=}}}\, αk4(k+1)(k+2)𝐆(𝐳k)2\displaystyle\,\frac{\alpha_{k}}{4}(k+1)(k+2)\|\mathbf{G}(\mathbf{z}^{k})\|^{2}
k+1αk(k+2)𝐳0𝐳2\displaystyle\quad-\frac{k+1}{\alpha_{k}(k+2)}\|\mathbf{z}^{0}-\mathbf{z}^{\star}\|^{2}
(d)\displaystyle\stackrel{{\scriptstyle\textnormal{(d)}}}{{\mathstrut{\geq}}}\, α4(k+1)(k+2)𝐆(𝐳k)21α𝐳0𝐳2,\displaystyle\,\frac{\alpha_{\infty}}{4}(k+1)(k+2)\|\mathbf{G}(\mathbf{z}^{k})\|^{2}-\frac{1}{\alpha_{\infty}}\|\mathbf{z}^{0}-\mathbf{z}^{\star}\|^{2},

where (2.2) follows from the monotonicity inequality 𝐆(𝐳k),𝐳k𝐳0\langle\mathbf{G}(\mathbf{z}^{k}),\mathbf{z}^{k}-\mathbf{z}^{\star}\rangle\geq 0, (2.2) follows from Young’s inequality, (2.2) follows from plugging in Ak=αk(k+1)(k+2)2A_{k}=\frac{\alpha_{k}(k+1)(k+2)}{2} and Bk=k+1B_{k}=k+1, and (2.2) follows from Lemma 1 (αkα\alpha_{k}\downarrow\alpha_{\infty}). Reorganize to get

α4(k+1)(k+2)𝐆\displaystyle\frac{\alpha_{\infty}}{4}(k+1)(k+2)\|\mathbf{G} (𝐳k)2Vk+1α𝐳0𝐳2\displaystyle(\mathbf{z}^{k})\|^{2}\leq V_{k}+\frac{1}{\alpha_{\infty}}\|\mathbf{z}^{0}-\mathbf{z}^{\star}\|^{2}
(α0R2+1α)𝐳0𝐳2,\displaystyle\leq\left(\alpha_{0}R^{2}+\frac{1}{\alpha_{\infty}}\right)\|\mathbf{z}^{0}-\mathbf{z}^{\star}\|^{2},

and divide both sides by α4(k+1)(k+2)\frac{\alpha_{\infty}}{4}(k+1)(k+2). ∎

2.3 Discussion of further generalizations

The algorithms and results of Sections 2.1 and 2.2 remain valid when we replace 𝐆\mathbf{G} with an RR-Lipschitz continuous monotone operator; neither the definition of the EAG algorithms nor any part of the proofs of Theorems 1 and 2 utilize properties of saddle functions beyond the monotonicity of their subdifferentials.

For EAG-C, the step-size conditions (4) in Theorem 1 can be relaxed to accommodate larger values of α\alpha. However, we do not pursue such generalizations to keep the already complicated and arduous analysis of EAG-C manageable. Also, larger step-sizes are more naturally allowed in EAG-V and Theorem 2. Finally, although (4) holds for values of α\alpha up to 0.1265R\frac{0.1265}{R}, we present a slightly smaller range (0,18R]\left(0,\frac{1}{8R}\right] in Corollary 1 for simplicity.

For EAG-V, the choice βk=1k+2\beta_{k}=\frac{1}{k+2} was obtained by roughly, but not fully, optimizing the bound on EAG-V originating from Lemma 2. If one chooses βk=1k+δ\beta_{k}=\frac{1}{k+\delta} with δ>1\delta>1, then (6) and (7) become

Ak=αk(k+δ)(k+δ1)2(δ1),Bk=k+δ1δ1.\displaystyle A_{k}=\frac{\alpha_{k}(k+\delta)(k+\delta-1)}{2(\delta-1)},\quad B_{k}=\frac{k+\delta-1}{\delta-1}.

As the proof of Theorem 2 illustrates, linear growth of BkB_{k} and quadratic growth of AkA_{k} leads to 𝒪(1/k2)\mathcal{O}(1/k^{2}) convergence of 𝐆(𝐳k)2\|\mathbf{G}(\mathbf{z}^{k})\|^{2}. The value α0=0.618R\alpha_{0}=\frac{0.618}{R} in Lemma 1 and Corollary 2 was obtained by numerically minimizing the constant 4α2(1+α0αR2)\frac{4}{\alpha_{\infty}^{2}}\left(1+\alpha_{0}\alpha_{\infty}R^{2}\right) in Theorem 2 in the case of δ=2\delta=2. The choice δ=2\delta=2, however, is not optimal. Indeed, the constant 2727 of Corollary 2 can be reduced to 24.4424.44 with (δ,α0)(2.697,0.690/R)(\delta^{\star},\alpha_{0}^{\star})\approx(2.697,0.690/R), which was obtained by numerically optimizing over δ\delta and α0\alpha_{0}. Finally, there is a possibility that a choice of βk\beta_{k} not in the form of βk=1k+δ\beta_{k}=\frac{1}{k+\delta} leads to an improved constant.

In the end, we choose to present EAG-C and EAG-V with the simple choice βk=1k+2\beta_{k}=\frac{1}{k+2}. As we establish in Section 3, the EAG algorithms are optimal up to a constant.

3 Optimality of EAG via a matching complexity lower bound

Upon seeing an accelerated algorithm, it is natural to ask whether the algorithm is optimal. In this section, we present a Ω(R2/k2)\Omega(R^{2}/k^{2}) complexity lower bound for the class of deterministic gradient-based algorithms for smooth convex-concave minimax problems. This result establishes that EAG is indeed optimal.

For the class of smooth minimax optimization problems, a deterministic algorithm 𝒜\mathcal{A} produces iterates (𝐱k,𝐲k)=𝐳k(\mathbf{x}^{k},\mathbf{y}^{k})=\mathbf{z}^{k} for k1k\geq 1 given a starting point (𝐱0,𝐲0)=𝐳0(\mathbf{x}^{0},\mathbf{y}^{0})=\mathbf{z}^{0} and a saddle function 𝐋\mathbf{L}, and we write 𝐳k=𝒜(𝐳0,,𝐳k1;𝐋)\mathbf{z}^{k}=\mathcal{A}(\mathbf{z}^{0},\dots,\mathbf{z}^{k-1};\mathbf{L}) for k1k\geq 1. Define 𝔄sim\mathfrak{A}_{\textrm{sim}} as the class of algorithms satisfying

𝐳k𝐳0+span{𝐆𝐋(𝐳0),,𝐆𝐋(𝐳k1)},\mathbf{z}^{k}\in\mathbf{z}^{0}+\mathrm{span}\{\mathbf{G}_{\mathbf{L}}(\mathbf{z}^{0}),\dots,\mathbf{G}_{\mathbf{L}}(\mathbf{z}^{k-1})\}, (10)

and 𝔄sep\mathfrak{A}_{\textrm{sep}} as the class of algorithms satisfying

𝐱k𝐱0+span{𝐱𝐋(𝐱0,𝐲0),,𝐱𝐋(𝐱k1,𝐲k1)}\displaystyle\mathbf{x}^{k}\in\mathbf{x}^{0}+\mathrm{span}\left\{\nabla_{\mathbf{x}}\mathbf{L}(\mathbf{x}^{0},\mathbf{y}^{0}),\dots,\nabla_{\mathbf{x}}\mathbf{L}(\mathbf{x}^{k-1},\mathbf{y}^{k-1})\right\}
𝐲k𝐲0+span{𝐲𝐋(𝐱0,𝐲0),,𝐲𝐋(𝐱k1,𝐲k1)}.\displaystyle\mathbf{y}^{k}\in\mathbf{y}^{0}+\mathrm{span}\left\{\nabla_{\mathbf{y}}\mathbf{L}(\mathbf{x}^{0},\mathbf{y}^{0}),\dots,\nabla_{\mathbf{y}}\mathbf{L}(\mathbf{x}^{k-1},\mathbf{y}^{k-1})\right\}. (11)

To clarify, algorithms in 𝔄sim\mathfrak{A}_{\textrm{sim}} access and utilize the 𝐱\mathbf{x}- and 𝐲\mathbf{y}-subgradients simultaneously. So 𝔄sim\mathfrak{A}_{\textrm{sim}} contains simultaneous gradient descent, extragradient, Popov, and EAG (if we also count intermediate sequences 𝐳k+1/2\mathbf{z}^{k+1/2} as algorithms’ iterates). On the other hand, algorithms in 𝔄sep\mathfrak{A}_{\textrm{sep}} can access and utilize the 𝐱\mathbf{x}- and 𝐲\mathbf{y}-subgradients separately. So 𝔄sim𝔄sep\mathfrak{A}_{\textrm{sim}}\subset\mathfrak{A}_{\textrm{sep}}, and alternating gradient descent-ascent belongs to 𝔄sep\mathfrak{A}_{\textrm{sep}} but not to 𝔄sim\mathfrak{A}_{\textrm{sim}}.

In this section, we present a complexity lower bound that applies to all algorithms in 𝔄sep\mathfrak{A}_{\textrm{sep}}, not just the algorithms in 𝔄sim\mathfrak{A}_{\textrm{sim}}. Although EAG-C and EAG-V are in 𝔄sim\mathfrak{A}_{\textrm{sim}}, we consider the broader class 𝔄sep\mathfrak{A}_{\textrm{sep}} to rule out the possibility that separately updating the 𝐱\mathbf{x}- and 𝐲\mathbf{y}-variables provides an improvement beyond a constant factor.

We say 𝐋(𝐱,𝐲)\mathbf{L}(\mathbf{x},\mathbf{y}) is biaffine if it is an affine function of 𝐱\mathbf{x} for any fixed 𝐲\mathbf{y} and an affine function of 𝐲\mathbf{y} for any fixed 𝐱\mathbf{x}. Biaffine functions are, of course, convex-concave. We first establish a complexity lower bound on minimiax optimization problems with biaffine loss functions.

Theorem 3.

Let k0k\geq 0 be fixed. For any nk+2n\geq k+2, there exists an RR-smooth biaffine function 𝐋\mathbf{L} on n×n\mathbb{R}^{n}\times\mathbb{R}^{n} for which

𝐋(𝐳k)2R2𝐳0𝐳2(2k/2+1)2\displaystyle\|\nabla\mathbf{L}(\mathbf{z}^{k})\|^{2}\geq\frac{R^{2}\|\mathbf{z}^{0}-\mathbf{z}^{\star}\|^{2}}{(2\lfloor k/2\rfloor+1)^{2}} (12)

holds for any algorithm in 𝔄sep\mathfrak{A}_{\textrm{sep}}, where \left\lfloor\cdot\right\rfloor is the floor function and 𝐳\mathbf{z}^{\star} is the saddle point of 𝐋\mathbf{L} closest to 𝐳0\mathbf{z}^{0}. Moreover, this lower bound is optimal in the sense that it cannot be improved with biaffine functions.

Since smooth biaffine functions are special cases of smooth convex-concave functions, Theorem 3 implies the optimality of EAG applied to smooth convex-concave mimimax optimization problems.

Corollary 3.

For RR-smooth convex-concave minimax problems, an algorithm in 𝔄sep\mathfrak{A}_{\textrm{sep}} cannot attain a worst-case convergence rate better than

R2𝐳0𝐳2(2k/2+1)2\frac{R^{2}\|\mathbf{z}^{0}-\mathbf{z}^{\star}\|^{2}}{(2\lfloor k/2\rfloor+1)^{2}}

with respect to 𝐋(𝐳k)2\|\nabla\mathbf{L}(\mathbf{z}^{k})\|^{2}. Since EAG-C and EAG-V have rates 𝒪(R2𝐳0𝐳2/k2)\mathcal{O}(R^{2}\|\mathbf{z}^{0}-\mathbf{z}^{\star}\|^{2}/k^{2}), they are optimal, up to a constant factor, in 𝔄sep\mathfrak{A}_{\textrm{sep}}.

3.1 Outline of the worst-case biaffine construction

Consider biaffine functions of the form

𝐋(𝐱,𝐲)=𝐀𝐱𝐛,𝐲𝐜,\displaystyle\mathbf{L}(\mathbf{x},\mathbf{y})=\langle\mathbf{A}\mathbf{x}-\mathbf{b},\mathbf{y}-\mathbf{c}\rangle,

where 𝐀n×n\mathbf{A}\in\mathbb{R}^{n\times n} and 𝐛,𝐜n\mathbf{b},\mathbf{c}\in\mathbb{R}^{n}. Then, 𝐱𝐋(𝐱,𝐲)=𝐀(𝐲𝐜)\nabla_{\mathbf{x}}\mathbf{L}(\mathbf{x},\mathbf{y})=\mathbf{A}^{\intercal}(\mathbf{y}-\mathbf{c}), 𝐲𝐋(𝐱,𝐲)=𝐀𝐱𝐛\nabla_{\mathbf{y}}\mathbf{L}(\mathbf{x},\mathbf{y})=\mathbf{A}\mathbf{x}-\mathbf{b}, 𝐆\mathbf{G} is 𝐀\|\mathbf{A}\|-Lipschitz, and solutions to

minimize𝐱Xmaximize𝐲Y𝐀𝐱𝐛,𝐲𝐜\displaystyle\underset{\mathbf{x}\in X}{\mbox{minimize}}\,\,\underset{\mathbf{y}\in Y}{\mbox{maximize}}\,\,\langle\mathbf{A}\mathbf{x}-\mathbf{b},\mathbf{y}-\mathbf{c}\rangle

are characterized by 𝐀𝐱𝐛=0\mathbf{A}\mathbf{x}-\mathbf{b}=0 and 𝐀(𝐲𝐜)=0\mathbf{A}^{\intercal}(\mathbf{y}-\mathbf{c})=0.

Through translation, we may assume without loss of generality that 𝐱0=0,𝐲0=0\mathbf{x}^{0}=0,\mathbf{y}^{0}=0. In this case, (11) becomes

𝐱k\displaystyle\mathbf{x}^{k}\in span{𝐀𝐜,𝐀(𝐀𝐀)𝐜,,𝐀(𝐀𝐀)k12𝐜}\displaystyle\,\,\mathrm{span}\{\mathbf{A}^{\intercal}\mathbf{c},\mathbf{A}^{\intercal}(\mathbf{A}\mathbf{A}^{\intercal})\mathbf{c},\dots,\mathbf{A}^{\intercal}(\mathbf{A}\mathbf{A}^{\intercal})^{\lfloor\frac{k-1}{2}\rfloor}\mathbf{c}\}
+span{𝐀𝐛,𝐀(𝐀𝐀)𝐛,,𝐀(𝐀𝐀)k21𝐛}\displaystyle+\mathrm{span}\{\mathbf{A}^{\intercal}\mathbf{b},\mathbf{A}^{\intercal}(\mathbf{A}\mathbf{A}^{\intercal})\mathbf{b},\dots,\mathbf{A}^{\intercal}(\mathbf{A}\mathbf{A}^{\intercal})^{\lfloor\frac{k}{2}\rfloor-1}\mathbf{b}\}
𝐲k\displaystyle\mathbf{y}^{k}\in span{b,(𝐀𝐀)𝐛,,(𝐀𝐀)k12𝐛}\displaystyle\,\,\mathrm{span}\{b,(\mathbf{A}\mathbf{A}^{\intercal})\mathbf{b},\dots,(\mathbf{A}\mathbf{A}^{\intercal})^{\lfloor\frac{k-1}{2}\rfloor}\mathbf{b}\}
+span{𝐀𝐀𝐜,,(𝐀𝐀)k2𝐜}\displaystyle+\mathrm{span}\{\mathbf{A}\mathbf{A}^{\intercal}\mathbf{c},\dots,(\mathbf{A}\mathbf{A}^{\intercal})^{\lfloor\frac{k}{2}\rfloor}\mathbf{c}\} (13)

for k2k\geq 2. (We detail these arguments in the appendix.) Furthermore let 𝐀=𝐀\mathbf{A}=\mathbf{A}^{\intercal} and 𝐛=𝐀𝐜=𝐀𝐜\mathbf{b}=\mathbf{A}^{\intercal}\mathbf{c}=\mathbf{A}\mathbf{c}. Then the characterization of 𝔄sep\mathfrak{A}_{\textrm{sep}} further simplifies to

𝐱k,𝐲k𝒦k1(𝐀;𝐛):=span{𝐛,𝐀𝐛,𝐀2𝐛,,𝐀k1𝐛}.\displaystyle\mathbf{x}^{k},\mathbf{y}^{k}\in\mathcal{K}_{k-1}(\mathbf{A};\mathbf{b}):=\mathrm{span}\{\mathbf{b},\mathbf{A}\mathbf{b},\mathbf{A}^{2}\mathbf{b},\dots,\mathbf{A}^{k-1}\mathbf{b}\}.

Note that 𝒦k1(𝐀;𝐛)\mathcal{K}_{k-1}(\mathbf{A};\mathbf{b}) is the order-(k1)(k-1) Krylov subspace.

Consider the following lemma. Its proof, deferred to the appendix, combines arguments from Nemirovsky (1991, 1992).

Lemma 3.

Let R>0R>0, k0k\geq 0, and nk+2n\geq k+2. Then there exists 𝐀=𝐀n×n\mathbf{A}=\mathbf{A}^{\intercal}\in\mathbb{R}^{n\times n} such that 𝐀R\|\mathbf{A}\|\leq R and 𝐛(𝐀)\mathbf{b}\in\mathcal{R}(\mathbf{A}), satisfying

𝐀𝐱𝐛2R2𝐱2(2k/2+1)2\displaystyle\|\mathbf{A}\mathbf{x}-\mathbf{b}\|^{2}\geq\frac{R^{2}\|\mathbf{x}^{\star}\|^{2}}{(2\lfloor k/2\rfloor+1)^{2}} (14)

for any 𝐱𝒦k1(𝐀;𝐛)\mathbf{x}\in\mathcal{K}_{k-1}(\mathbf{A};\mathbf{b}), where 𝐱\mathbf{x}^{\star} is the minimum norm solution to the equation 𝐀𝐱=𝐛\mathbf{A}\mathbf{x}=\mathbf{b}.

Take 𝐀\mathbf{A} and 𝐛\mathbf{b} as in Lemma 3 and 𝐜=𝐱\mathbf{c}=\mathbf{x}^{\star}. Then 𝐳=(𝐱,𝐱)\mathbf{z}^{\star}=(\mathbf{x}^{\star},\mathbf{x}^{\star}) is the saddle point of 𝐋(𝐱,𝐲)=𝐀𝐱𝐛,𝐲𝐜\mathbf{L}(\mathbf{x},\mathbf{y})=\langle\mathbf{A}\mathbf{x}-\mathbf{b},\mathbf{y}-\mathbf{c}\rangle with minimum norm. Finally,

𝐋(𝐱k,𝐲k)2\displaystyle\|\nabla\mathbf{L}(\mathbf{x}^{k},\mathbf{y}^{k})\|^{2} =𝐀(𝐲k𝐜)2+𝐀𝐱k𝐛2\displaystyle=\|\mathbf{A}^{\intercal}(\mathbf{y}^{k}-\mathbf{c})\|^{2}+\|\mathbf{A}\mathbf{x}^{k}-\mathbf{b}\|^{2}
=𝐀𝐲k𝐛2+𝐀𝐱k𝐛2\displaystyle=\|\mathbf{A}\mathbf{y}^{k}-\mathbf{b}\|^{2}+\|\mathbf{A}\mathbf{x}^{k}-\mathbf{b}\|^{2}
R2𝐱2(2k/2+1)2+R2𝐱2(2k/2+1)2\displaystyle\geq\frac{R^{2}\|\mathbf{x}^{\star}\|^{2}}{(2\lfloor k/2\rfloor+1)^{2}}+\frac{R^{2}\|\mathbf{x}^{\star}\|^{2}}{(2\lfloor k/2\rfloor+1)^{2}}
=R2𝐳𝐳02(2k/2+1)2,\displaystyle=\frac{R^{2}\|\mathbf{z}^{\star}-\mathbf{z}^{0}\|^{2}}{(2\lfloor k/2\rfloor+1)^{2}},

for any 𝐱k,𝐲k𝒦k1(𝐀;𝐛)\mathbf{x}^{k},\mathbf{y}^{k}\in\mathcal{K}_{k-1}(\mathbf{A};\mathbf{b}). This completes the construction of the biaffine 𝐋\mathbf{L} of Theorem 3.

3.2 Optimal complexity lower bound

We now formalize the notion of complexity lower bounds. This formulation will allow us to precisely state and prove the second statement of Theorem 3 regarding the optimality of the lower bound.

Let \mathcal{F} be a function class, 𝒫={𝒫f}f\mathcal{P}_{\mathcal{F}}=\{\mathcal{P}_{f}\}_{f\in\mathcal{F}} a class of optimization problems (with some common form), and (;𝒫f)\mathcal{E}(\cdot;\mathcal{P}_{f}) a suboptimality measure for the problem 𝒫f\mathcal{P}_{f}. Define the worst-case complexity of an algorithm 𝒜\mathcal{A} for 𝒫\mathcal{P}_{\mathcal{F}} at the kk-th iteration given the initial condition 𝐳0𝐳D\|\mathbf{z}^{0}-\mathbf{z}^{\star}\|\leq D, as

𝒞(𝒜;𝒫,D,k):=sup𝐳0B(𝐳;D)f(𝐳k;𝒫f),\mathcal{C}\left(\mathcal{A};\mathcal{P}_{\mathcal{F}},D,k\right):=\sup_{\begin{subarray}{c}\mathbf{z}^{0}\in B(\mathbf{z}^{\star};D)\\ f\in\mathcal{F}\end{subarray}}\mathcal{E}\left(\mathbf{z}^{k};\mathcal{P}_{f}\right),

where 𝐳j=𝒜(𝐳0,,𝐳j1;f)\mathbf{z}^{j}=\mathcal{A}(\mathbf{z}^{0},\dots,\mathbf{z}^{j-1};f) for j=1,,kj=1,\dots,k and B(𝐳;D)B(\mathbf{z};D) denotes the closed ball of radius DD centered at 𝐳\mathbf{z}. The optimal complexity lower bound with respect to an algorithm class 𝔄\mathfrak{A} is

𝒞(𝔄;𝒫,D,k):\displaystyle\mathcal{C}\left(\mathfrak{A};\mathcal{P}_{\mathcal{F}},D,k\right): =inf𝒜𝔄𝒞(𝒜;𝒫,D,k)\displaystyle=\inf_{\mathcal{A}\in\mathfrak{A}}\mathcal{C}\left(\mathcal{A};\mathcal{P}_{\mathcal{F}},D,k\right)
=inf𝒜𝔄sup𝐳0B(𝐳;D)f(𝐳k;𝒫f).\displaystyle=\inf_{\mathcal{A}\in\mathfrak{A}}\sup_{\begin{subarray}{c}\mathbf{z}^{0}\in B(\mathbf{z}^{\star};D)\\ f\in\mathcal{F}\end{subarray}}\mathcal{E}\left(\mathbf{z}^{k};\mathcal{P}_{f}\right).

A complexity lower bound is a lower bound on the optimal complexity lower bound.

Let R(n×m)\mathcal{L}_{R}(\mathbb{R}^{n}\times\mathbb{R}^{m}) be the class of RR-smooth convex-concave functions on n×m\mathbb{R}^{n}\times\mathbb{R}^{m}, 𝒫𝐋\mathcal{P}_{\mathbf{L}} the minimax problem (1), and (𝐳;𝒫𝐋)=𝐋(𝐳)2\mathcal{E}(\mathbf{z};\mathcal{P}_{\mathbf{L}})=\|\nabla\mathbf{L}(\mathbf{z})\|^{2}. With this notation, the results of Section 2 can be expressed as

𝒞(EAG;𝒫R(n×m),D,k)=𝒪(R2D2k2).\displaystyle\mathcal{C}\left(\textrm{EAG};\mathcal{P}_{\mathcal{L}_{R}(\mathbb{R}^{n}\times\mathbb{R}^{m})},D,k\right)=\mathcal{O}\left(\frac{R^{2}D^{2}}{k^{2}}\right).

Let Rbiaff(n×m)\mathcal{L}_{R}^{\textrm{biaff}}(\mathbb{R}^{n}\times\mathbb{R}^{m}) be the class of RR-smooth biaffine functions on n×m\mathbb{R}^{n}\times\mathbb{R}^{m}. Then the first statement of Theorem 3, the existence of 𝐋\mathbf{L}, can be expressed as

𝒞(𝔄sep;𝒫Rbiaff(n×n),D,k)R2D2(2k/2+1)2\displaystyle\mathcal{C}\left(\mathfrak{A}_{\textrm{sep}};\mathcal{P}_{\mathcal{L}_{R}^{\textrm{biaff}}(\mathbb{R}^{n}\times\mathbb{R}^{n})},D,k\right)\geq\frac{R^{2}D^{2}}{(2\lfloor k/2\rfloor+1)^{2}} (15)

for nk+2n\geq k+2.

As an aside, the argument of Corollary 3 can be expressed as: for any 𝒜𝔄sep\mathcal{A}\in\mathfrak{A}_{\textrm{sep}}, we have

𝒞(𝒜;𝒫R(n×n),D,k)\displaystyle\mathcal{C}\left(\mathcal{A};\mathcal{P}_{\mathcal{L}_{R}(\mathbb{R}^{n}\times\mathbb{R}^{n})},D,k\right) 𝒞(𝔄sep;𝒫R(n×n),D,k)\displaystyle\geq\mathcal{C}\left(\mathfrak{A}_{\textrm{sep}};\mathcal{P}_{\mathcal{L}_{R}(\mathbb{R}^{n}\times\mathbb{R}^{n})},D,k\right)
𝒞(𝔄sep;𝒫Rbiaff(n×n),D,k)\displaystyle\geq\mathcal{C}\left(\mathfrak{A}_{\textrm{sep}};\mathcal{P}_{\mathcal{L}_{R}^{\textrm{biaff}}(\mathbb{R}^{n}\times\mathbb{R}^{n})},D,k\right)
R2D2(2k/2+1)2.\displaystyle\geq\frac{R^{2}D^{2}}{(2\lfloor k/2\rfloor+1)^{2}}.

The first inequality follows from 𝒜𝔄sep\mathcal{A}\in\mathfrak{A}_{\textrm{sep}}, the second from RbiaffR\mathcal{L}_{R}^{\textrm{biaff}}\subset\mathcal{L}_{R}, and the third from Theorem 3.

Optimality of lower bound of Theorem 3.

Using above notations, our goal is to prove that for nk+2n\geq k+2,

𝒞(𝔄sep;𝒫Rbiaff(n×n),D,k)=R2D2(2k/2+1)2.\displaystyle\mathcal{C}\left(\mathfrak{A}_{\textrm{sep}};\mathcal{P}_{\mathcal{L}_{R}^{\textrm{biaff}}(\mathbb{R}^{n}\times\mathbb{R}^{n})},D,k\right)=\frac{R^{2}D^{2}}{(2\lfloor k/2\rfloor+1)^{2}}. (16)

We establish this claim with the chain of inequalities:

R2D2(2k/2+1)2\displaystyle\frac{R^{2}D^{2}}{(2\lfloor k/2\rfloor+1)^{2}} 𝒞(𝔄sep;𝒫Rbiaff(n×n),D,k)\displaystyle\leq\mathcal{C}\left(\mathfrak{A}_{\textrm{sep}};\mathcal{P}_{\mathcal{L}_{R}^{\textrm{biaff}}(\mathbb{R}^{n}\times\mathbb{R}^{n})},D,k\right) (17)
𝒞(𝔄sim;𝒫Rbiaff(n×n),D,k)\displaystyle\leq\mathcal{C}\left(\mathfrak{A}_{\textrm{sim}};\mathcal{P}_{\mathcal{L}_{R}^{\textrm{biaff}}(\mathbb{R}^{n}\times\mathbb{R}^{n})},D,k\right) (18)
𝒞(𝔄lin;𝒫R,D2n,skew,k)\displaystyle\leq\mathcal{C}\left(\mathfrak{A}_{\textrm{lin}};\mathcal{P}^{2n,\textrm{skew}}_{R,D},k\right) (19)
𝒞(𝔄lin;𝒫R,D2n,k)\displaystyle\leq\mathcal{C}\left(\mathfrak{A}_{\textrm{lin}};\mathcal{P}^{2n}_{R,D},k\right) (20)
R2D2(2k/2+1)2.\displaystyle\leq\frac{R^{2}D^{2}}{(2\lfloor k/2\rfloor+1)^{2}}. (21)

Inequality (17) is what we established in Section 3.1. Inequality (18) follows from 𝔄sim𝔄sep\mathfrak{A}_{\textrm{sim}}\subset\mathfrak{A}_{\textrm{sep}} and the fact that the infimum over a larger class is smaller. Roughly speaking, the quantities in lines (19) and (20) are the complexity lower bounds for solving linear equations using only matrix-vector products, which were studied thoroughly in (Nemirovsky, 1991, 1992). We will show inequalities (19), (20), and (21) by establishing the connection of Nemirovsky’s work with our setup of biaffine saddle problems. Once this is done, equality holds throughout and (16) is proved.

We first provide the definitions. Let 𝒫R,D2n\mathcal{P}^{2n}_{R,D} be the collection of linear equations with 2n×2n2n\times 2n matrices 𝐁\mathbf{B} satisfying 𝐁R\|\mathbf{B}\|\leq R and 𝐯=𝐁𝐳\mathbf{v}=\mathbf{B}\mathbf{z}^{\star} for some 𝐳B(0;D)\mathbf{z}^{\star}\in B(0;D). Let 𝒫R,D2n,skew𝒫R,D2n\mathcal{P}^{2n,\textrm{skew}}_{R,D}\subset\mathcal{P}^{2n}_{R,D} be the subclass of equations with skew-symmetric 𝐁\mathbf{B}. Let 𝔄lin\mathfrak{A}_{\textrm{lin}} be the class of iterative algorithms solving linear equations 𝐁𝐳=𝐯\mathbf{B}\mathbf{z}=\mathbf{v} using only matrix multiplication by 𝐁\mathbf{B} and 𝐁\mathbf{B}^{\intercal} in the sense that

𝐳kspan{𝐯0,,𝐯k},\displaystyle\mathbf{z}^{k}\in\mathrm{span}\{\mathbf{v}^{0},\dots,\mathbf{v}^{k}\}, (22)

where 𝐯0=0\mathbf{v}^{0}=0, 𝐯1=𝐯\mathbf{v}^{1}=\mathbf{v}, and for k2k\geq 2,

𝐯k=𝐁𝐯j or 𝐁𝐯j for some j=0,,k1.\displaystyle\mathbf{v}^{k}=\mathbf{B}\mathbf{v}^{j}\text{ or }\mathbf{B}^{\intercal}\mathbf{v}^{j}\text{ for some }j=0,\dots,k-1.

The optimal complexity lower bound for a class of linear equation instances is defined as

𝒞(𝔄lin;𝒫R,D2n,k):=inf𝒜𝔄linsup𝐁R𝐯=𝐁𝐳,𝐳D𝐁𝐳k𝐯2.\displaystyle\mathcal{C}\left(\mathfrak{A}_{\textrm{lin}};\mathcal{P}^{2n}_{R,D},k\right):=\inf_{\mathcal{A}\in\mathfrak{A}_{\textrm{lin}}}\sup_{\begin{subarray}{c}\|\mathbf{B}\|\leq R\\ \mathbf{v}=\mathbf{B}\mathbf{z}^{\star},\|\mathbf{z}^{\star}\|\leq D\end{subarray}}\left\|\mathbf{B}\mathbf{z}^{k}-\mathbf{v}\right\|^{2}.

Define 𝒞(𝔄lin;𝒫R,D2n,skew,k)\mathcal{C}\left(\mathfrak{A}_{\textrm{lin}};\mathcal{P}^{2n,\textrm{skew}}_{R,D},k\right) analogously.

Now we relate the optimal complexity lower bounds for biaffine minimax problems to those for linear equations. For 𝐋(𝐱,𝐲)=𝐛𝐱+𝐱𝐀𝐲𝐜𝐲\mathbf{L}(\mathbf{x},\mathbf{y})=\mathbf{b}^{\intercal}\mathbf{x}+\mathbf{x}^{\intercal}\mathbf{A}\mathbf{y}-\mathbf{c}^{\intercal}\mathbf{y}, we have

𝐆𝐋(𝐱,𝐲)=[𝐎𝐀𝐀𝐎][𝐱𝐲]+[𝐛𝐜].\displaystyle\mathbf{G}_{\mathbf{L}}(\mathbf{x},\mathbf{y})=\begin{bmatrix}\mathbf{O}&\mathbf{A}\\ -\mathbf{A}^{\intercal}&\mathbf{O}\end{bmatrix}\begin{bmatrix}\mathbf{x}\\ \mathbf{y}\end{bmatrix}+\begin{bmatrix}\mathbf{b}\\ \mathbf{c}\end{bmatrix}.

Therefore, the minimax problem 𝒫𝐋\mathcal{P}_{\mathbf{L}} for 𝐋Rbiaff(n×n)\mathbf{L}\in\mathcal{L}_{R}^{\textrm{biaff}}(\mathbb{R}^{n}\times\mathbb{R}^{n}) is equivalent to solving the linear equation 𝐁𝐳=𝐯\mathbf{B}\mathbf{z}=\mathbf{v} with 𝐁=[𝐎𝐀𝐀𝐎]\mathbf{B}=\begin{bmatrix}\mathbf{O}&-\mathbf{A}\\ \mathbf{A}^{\intercal}&\mathbf{O}\end{bmatrix} and 𝐯=[𝐛𝐜]2n\mathbf{v}=\begin{bmatrix}\mathbf{b}\\ \mathbf{c}\end{bmatrix}\in\mathbb{R}^{2n}, which belongs to 𝒫R,D2n,skew\mathcal{P}^{2n,\textrm{skew}}_{R,D} with D=zD=\|z^{\star}\|.

For both algorithm classes 𝔄sim\mathfrak{A}_{\textrm{sim}} and 𝔄lin\mathfrak{A}_{\textrm{lin}}, we may assume without loss of generality that 𝐳0=0\mathbf{z}^{0}=0 through translation. Then, the span condition (10) for 𝔄sim\mathfrak{A}_{\textrm{sim}} becomes

𝐳k𝒦k1(𝐁;𝐯).\displaystyle\mathbf{z}^{k}\in\mathcal{K}_{k-1}(\mathbf{B};\mathbf{v}). (23)

Note that (22) reduces to (23) as 𝐁\mathbf{B} is skew-symmetric, so 𝔄sim\mathfrak{A}_{\textrm{sim}} and 𝔄lin\mathfrak{A}_{\textrm{lin}} are effectively the same class of algorithms under the identification 𝒫Rbiaff(n×n)𝒫R,D2n,skew\mathcal{P}_{\mathcal{L}_{R}^{\textrm{biaff}}(\mathbb{R}^{n}\times\mathbb{R}^{n})}\subset\mathcal{P}_{R,D}^{2n,\textrm{skew}}.

Since the supremum over a larger class of problems is larger, inequality (19) holds. Similarly, inequality (20) follows from 𝒫R,D2n,skew𝒫R,D2n\mathcal{P}^{2n,\textrm{skew}}_{R,D}\subset\mathcal{P}^{2n}_{R,D}.

Refer to caption
(a) Two-dimensional example 𝐋δ,ϵ\mathbf{L}_{\delta,\epsilon} of (24)
Refer to caption
(b) Lagrangian of linearly constrained QP of (25)
Figure 1: Plots of 𝐆(𝐳k)2\|\mathbf{G}(\mathbf{z}^{k})\|^{2} versus iteration count. Dashed lines indicate corresponding theoretical upper bounds.

Finally, (21) follows from the following lemma, using arguments based on Chebyshev-type matrix polynomials from Nemirovsky (1992). Its proof is deferred to the appendix.

Lemma 4.

Let R>0R>0 and k0k\geq 0. Then there exists 𝒜𝔄lin\mathcal{A}\in\mathfrak{A}_{\textnormal{lin}} such that for any m1m\geq 1, 𝐁m×m\mathbf{B}\in\mathbb{R}^{m\times m}, and 𝐯=𝐁𝐳\mathbf{v}=\mathbf{B}\mathbf{z}^{\star} satisfying 𝐁R\|\mathbf{B}\|\leq R and 𝐳D\|\mathbf{z}^{\star}\|\leq D, the 𝐳k\mathbf{z}^{k}-iterate produced by 𝒜\mathcal{A} satisfies

𝐁𝐳k𝐯2R2D2(2k/2+1)2.\displaystyle\left\|\mathbf{B}\mathbf{z}^{k}-\mathbf{v}\right\|^{2}\leq\frac{R^{2}D^{2}}{(2\lfloor k/2\rfloor+1)^{2}}.

3.3 Broader algorithm classes via resisting oracles

In (10) and (11), we assumed the subgradient queries are made within the span of the gradients at the previous iterates. This requirement (the linear span assumption) can be removed, i.e., a similar analysis can be done on general deterministic black-box gradient-based algorithms (formally defined in the appendix, Section C.5), using the resisting oracle technique (Nemirovsky & Yudin, 1983) at the cost of slightly enlarging the required problem dimension. We informally state the generalized result below and provide details in the appendix.

Theorem 4 (Informal).

Let n3k+2n\geq 3k+2. For any gradient-based deterministic algorithm, there exists an RR-smooth biaffine function 𝐋\mathbf{L} on n×n\mathbb{R}^{n}\times\mathbb{R}^{n} such that (12) holds.

Although we do not formally pursue this, the requirement that the algorithm is not randomized can also be removed using the techniques of Woodworth & Srebro (2016), which exploit near-orthogonality of random vectors in high dimensions.

3.4 Discussion

We established that one cannot improve the lower bound of Theorem 3 using biaffine functions, arguably the simplest family of convex-concave functions. Furthermore, this optimality statement holds for both algorithm classes 𝔄sep\mathfrak{A}_{\textrm{sep}} and 𝔄sim\mathfrak{A}_{\textrm{sim}} as established through the chain of inequalities in Section 3.2. However, as demonstrated by Drori (2017), who introduced a non-quadratic lower bound for smooth convex minimization that improves upon the classical quadratic lower bounds of Nemirovsky (1992) and Nesterov (2013), a non-biaffine construction may improve the constant. In our setup, there is a factor-near-100100 difference between the upper and lower bounds. (Note that each EAG iteration requires 22 evaluations of the saddle subdifferential oracle.) We suspect that both the algorithm and the lower bound can be improved upon, but we leave this to future work.

Golowich et al. (2020) establishes that for the class of 1-SCLI algorithms (S is for stationary), a subclass of 𝔄sim\mathfrak{A}_{\textrm{sim}} for biaffine objectives, one cannot achieve a rate faster than 𝐋(𝐳k)2𝒪(1/k)\|\nabla\mathbf{L}(\mathbf{z}^{k})\|^{2}\leq\mathcal{O}(1/k). This lower bound applies to EG but not EAG; EAG is not 1-SCLI, as its anchoring coefficients 1k+2\frac{1}{k+2} vary over iterations, and its convergence rate breaks the 1-SCLI lower bound. On the other hand, we can view EAG as a non-stationary CLI algorithm (Arjevani & Shamir, 2016, Definition 2). We further discuss these connections in the appendix, Section E.

Refer to caption
(a) Discrete trajectories with 𝐋δ,ϵ\mathbf{L}_{\delta,\epsilon}
Refer to caption
(b) Moreau–Yosida regularized flow with λ=0.01\lambda=0.01 and the anchored flow with 𝐋(x,y)=xy\mathbf{L}(x,y)=xy
Figure 2: Comparison of the discrete trajectories and their corresponding continuous-time flow. Trajectories from EAG-C and SimGD-A virtually coincide and resemble the anchored flow. However, SimGD-A progresses slower due to its diminishing step-sizes.

4 Experiments

We now present experiments illustrating the accelerated rate of EAG. We compare EAG-C and EAG-V against the prior algorithms with convergence guarantees: EG, Popov’s algorithm (or optimistic descent) and simultaneous gradient descent with anchoring (SimGD-A). The precise forms of the algorithms are restated in the appendix.

Figure 1(a) presents experiments on our first example, constructed as follows. For ϵ>0\epsilon>0, define

fϵ(u)={ϵ|u|12ϵ2if |u|ϵ,12u2if |u|<ϵ.\displaystyle f_{\epsilon}(u)=\begin{cases}\epsilon|u|-\frac{1}{2}\epsilon^{2}&\text{if }|u|\geq\epsilon,\\ \frac{1}{2}u^{2}&\text{if }|u|<\epsilon.\end{cases}

Next, for 0<ϵδ10<\epsilon\ll\delta\ll 1, define

𝐋δ,ϵ(x,y)=(1δ)fϵ(x)+δxy(1δ)fϵ(y),\displaystyle\mathbf{L}_{\delta,\epsilon}(x,y)=(1-\delta)f_{\epsilon}(x)+\delta xy-(1-\delta)f_{\epsilon}(y), (24)

where x,yx,y\in\mathbb{R}. Since fϵf_{\epsilon} is a 11-smooth convex function, 𝐋δ,ϵ\mathbf{L}_{\delta,\epsilon} has smoothness parameter 11, which is almost tight due to the quadratic behavior of 𝐋δ,ϵ\mathbf{L}_{\delta,\epsilon} within the region |x|,|y|ϵ|x|,|y|\leq\epsilon. This construction was inspired by Drori & Teboulle (2014), who presented fϵf_{\epsilon} as the worst-case instance for gradient descent. We choose the step-size α=0.1\alpha=0.1 as this value is comfortably within the theoretical range of convergent parameters for EG, EAG-C, and Popov. For EAG-V, we set α0=0.1\alpha_{0}=0.1. We use N=105N=10^{5}, δ=102\delta=10^{-2}, and ϵ=5×105\epsilon=5\times 10^{-5}, and the initial point 𝐳0\mathbf{z}^{0} has norm 11.

Figure 1(b) presents experiments on our second example

𝐋(𝐱,𝐲)=12𝐱𝐇𝐱𝐡𝐱𝐀𝐱𝐛,𝐲,\displaystyle\mathbf{L}(\mathbf{x},\mathbf{y})=\frac{1}{2}\mathbf{x}^{\intercal}\mathbf{H}\mathbf{x}-\mathbf{h}^{\intercal}\mathbf{x}-\langle\mathbf{A}\mathbf{x}-\mathbf{b},\mathbf{y}\rangle, (25)

where 𝐱,𝐲n\mathbf{x},\mathbf{y}\in\mathbb{R}^{n}, 𝐀n×n\mathbf{A}\in\mathbb{R}^{n\times n}, 𝐛n\mathbf{b}\in\mathbb{R}^{n}, 𝐇n×n\mathbf{H}\in\mathbb{R}^{n\times n} is positive semidefinite, and 𝐡n\mathbf{h}\in\mathbb{R}^{n}. Note that this is the Lagrangian of a linearly constrained quadratic minimization problem. We adopted this saddle function from Ouyang & Xu (2021), where the authors constructed 𝐇,𝐡,𝐀\mathbf{H},\mathbf{h},\mathbf{A} and 𝐛\mathbf{b} to provide a lower bound on duality gap. The exact forms of 𝐇\mathbf{H}, 𝐡\mathbf{h}, 𝐀\mathbf{A}, and 𝐛\mathbf{b} are restated in the appendix. We use n=200n=200, N=106N=10^{6}, α=0.5\alpha=0.5 for EG and Popov, α=0.1265\alpha=0.1265 for EAG-C and α0=0.618\alpha_{0}=0.618 for EAG-V. Finally, we use the initial point 𝐳0=0\mathbf{z}^{0}=0.

ODE Interpretation

Figure 2(a) illustrates the algorithms applied to (24). For |x|,|y|ϵ|x|,|y|\gg\epsilon,

𝐆𝐋δ,ϵ(x,y)=[(1δ)ϵ+δy(1δ)ϵδx]δ[yx],\mathbf{G}_{\mathbf{L}_{\delta,\epsilon}}(x,y)=\begin{bmatrix}(1-\delta)\epsilon+\delta y\\ (1-\delta)\epsilon-\delta x\end{bmatrix}\approx\delta\begin{bmatrix}y\\ -x\end{bmatrix},

so the algorithms roughly behave as if the objective is the bilinear function δxy\delta xy. When δ\delta is sufficiently small, trajectories of the algorithms closely resemble the corresponding continuous-time flows with 𝐋(x,y)=xy\mathbf{L}(x,y)=xy.

Csetnek et al. (2019) demonstrated that Popov’s algorithm can be viewed as discretization of the Moreau–Yosida regularized flow 𝐳˙(t)=𝐆(Id+λ𝐆)1λ(𝐳(t))\dot{\mathbf{z}}(t)=-\frac{\mathbf{G}-(\textrm{Id}+\lambda\mathbf{G})^{-1}}{\lambda}\left(\mathbf{z}(t)\right) for some λ>0\lambda>0, and a similar analysis can be performed with EG. This connection explains why EG’s trajectory in Figure 2(a) and the regularized flow depicted in Figure 2(b) are similar.

On the other hand, EAG and SimGD-A can be viewed as a discretization of the anchored flow ODE

𝐳˙(t)=𝐆(𝐳(t))+1t(𝐳0𝐳(t)).\dot{\mathbf{z}}(t)=-\mathbf{G}(\mathbf{z}(t))+\frac{1}{t}(\mathbf{z}^{0}-\mathbf{z}(t)).

The anchored flow depicted in Figure 2(b) approaches the solution much more quickly due to the anchoring term dampening the cycling behavior. The trajectories of EAG and SimGD-A iterates in Figure 2(a) are very similar to the anchored flow. However, SimGD-A requires diminishing step-sizes 1p(k+1)p\frac{1-p}{(k+1)^{p}} (both theoretically and experimentally) and therefore progresses much slower.

5 Conclusion

This work presents the extra anchored gradient (EAG) algorithms, which exhibit accelerated 𝒪(1/k2)\mathcal{O}(1/k^{2}) rates on the squared gradient magnitude for smooth convex-concave minimax problems. The acceleration combines the extragradient and anchoring mechanisms, which separately achieve 𝒪(1/k)\mathcal{O}(1/k) or slower rates. We complement the 𝒪(1/k2)\mathcal{O}(1/k^{2}) rate with a matching Ω(1/k2)\Omega(1/k^{2}) complexity lower bound, thereby establishing optimality of EAG.

At a superficial level, the acceleration mechanism of EAG seems to be distinct from that of Nesterov; anchoring dampens oscillations, but momentum provides the opposite effect of dampening. However, are the two accelerations phenomena entirely unrelated? Finding a common structure, a connection, between the two acceleration phenomena would be an interesting direction of future work.

Acknowledgements

TY and EKR were supported by the National Research Foundation of Korea (NRF) Grant funded by the Korean Government (MSIP) [No. 2020R1F1A1A01072877], the National Research Foundation of Korea (NRF) Grant funded by the Korean Government (MSIP) [No. 2017R1A5A1015626], by the New Faculty Startup Fund from Seoul National University, and by the AI Institute of Seoul National University (AIIS) through its AI Frontier Research Grant (No. 0670-20200015) in 2020. We thank Jaewook Suh and Jongmin Lee for reviewing the manuscript and providing valuable feedback. We thank Jelena Diakonikolas for the discussion on the prior work on parameter-free near-optimal methods for the smooth minimax setup. Finally, we thank the anonymous referees for bringing to our attention the recent complexity lower bound on the class of 11-SCLI algorithms by Golowich et al. (2020).

References

  • Alkousa et al. (2020) Alkousa, M., Gasnikov, A., Dvinskikh, D., Kovalev, D., and Stonyakin, F. Accelerated methods for saddle-point problem. Computational Mathematics and Mathematical Physics, 60(11):1787–1809, 2020.
  • Antonakopoulos et al. (2021) Antonakopoulos, K., Belmega, E. V., and Mertikopoulos, P. Adaptive extra-gradient methods for min-max optimization and games. ICLR, 2021.
  • Arjevani & Shamir (2016) Arjevani, Y. and Shamir, O. On the iteration complexity of oblivious first-order optimization algorithms. ICML, 2016.
  • Arjevani et al. (2016) Arjevani, Y., Shalev-Shwartz, S., and Shamir, O. On lower and upper bounds in smooth and strongly convex optimization. Journal of Machine Learning Research, 17(1):4303–4353, 2016.
  • Azizian et al. (2020) Azizian, W., Mitliagkas, I., Lacoste-Julien, S., and Gidel, G. A tight and unified analysis of gradient-based methods for a whole spectrum of differentiable games. AISTATS, 2020.
  • Bùi & Combettes (2021) Bùi, M. N. and Combettes, P. L. A warped resolvent algorithm to construct nash equilibria. arXiv preprint arXiv:2101.00532, 2021.
  • Censor et al. (2011) Censor, Y., Gibali, A., and Reich, S. The subgradient extragradient method for solving variational inequalities in Hilbert space. Journal of Optimization Theory and Applications, 148(2):318–335, 2011.
  • Chambolle & Pock (2011) Chambolle, A. and Pock, T. A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision, 40(1):120–145, 2011.
  • Chambolle & Pock (2016) Chambolle, A. and Pock, T. On the ergodic convergence rates of a first-order primal–dual algorithm. Mathematical Programming, 159(1-2):253–287, 2016.
  • Chen et al. (2014) Chen, Y., Lan, G., and Ouyang, Y. Optimal primal-dual methods for a class of saddle point problems. SIAM Journal on Optimization, 24(4):1779–1814, 2014.
  • Chen et al. (2017) Chen, Y., Lan, G., and Ouyang, Y. Accelerated schemes for a class of variational inequalities. Mathematical Programming, 165(1):113–149, 2017.
  • Chen et al. (2021) Chen, Z., Zhou, Y., Xu, T., and Liang, Y. Proximal gradient descent-ascent: Variable convergence under KŁ geometry. ICLR, 2021.
  • Chiang et al. (2012) Chiang, C.-K., Yang, T., Lee, C.-J., Mahdavi, M., Lu, C.-J., Jin, R., and Zhu, S. Online optimization with gradual variations. COLT, 2012.
  • Condat (2013) Condat, L. A primal–dual splitting method for convex optimization involving Lipschitzian, proximable and linear composite terms. Journal of Optimization Theory and Applications, 158(2):460–479, 2013.
  • Csetnek et al. (2019) Csetnek, E. R., Malitsky, Y., and Tam, M. K. Shadow Douglas–Rachford splitting for monotone inclusions. Applied Mathematics & Optimization, 80(3):665–678, 2019.
  • Daskalakis et al. (2011) Daskalakis, C., Deckelbaum, A., and Kim, A. Near-optimal no-regret algorithms for zero-sum games. SODA, 2011.
  • Diakonikolas (2020) Diakonikolas, J. Halpern iteration for near-optimal and parameter-free monotone inclusion and strong solutions to variational inequalities. COLT, 2020.
  • Drori (2017) Drori, Y. The exact information-based complexity of smooth convex minimization. Journal of Complexity, 39:1–16, 2017.
  • Drori & Teboulle (2014) Drori, Y. and Teboulle, M. Performance of first-order methods for smooth convex minimization: A novel approach. Mathematical Programming, 145(1-2):451–482, 2014.
  • Ghadimi & Lan (2012) Ghadimi, S. and Lan, G. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization I: A generic algorithmic framework. SIAM Journal on Optimization, 22(4):1469–1492, 2012.
  • Ghadimi & Lan (2013) Ghadimi, S. and Lan, G. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, II: Shrinking procedures and optimal algorithms. SIAM Journal on Optimization, 23(4):2061–2089, 2013.
  • Gidel et al. (2018) Gidel, G., Berard, H., Vignoud, G., Vincent, P., and Lacoste-Julien, S. A variational inequality perspective on generative adversarial networks. ICLR, 2018.
  • Gidel et al. (2019) Gidel, G., Hemmat, R. A., Pezeshki, M., Le Priol, R., Huang, G., Lacoste-Julien, S., and Mitliagkas, I. Negative momentum for improved game dynamics. AISTATS, 2019.
  • Golowich et al. (2020) Golowich, N., Pattathil, S., Daskalakis, C., and Ozdaglar, A. Last iterate is slower than averaged iterate in smooth convex-concave saddle point problems. COLT, 2020.
  • Goodfellow et al. (2014) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. NeurIPS, 2014.
  • Goodfellow et al. (2015) Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. ICLR, 2015.
  • Halpern (1967) Halpern, B. Fixed points of nonexpanding maps. Bulletin of the American Mathematical Society, 73(6):957–961, 1967.
  • Hamedani & Aybat (2018) Hamedani, E. Y. and Aybat, N. S. A primal-dual algorithm for general convex-concave saddle point problems. arXiv preprint arXiv:1803.01401, 2018.
  • He & Monteiro (2016) He, Y. and Monteiro, R. D. An accelerated HPE-type algorithm for a class of composite convex-concave saddle-point problems. SIAM Journal on Optimization, 26(1):29–56, 2016.
  • Hsieh et al. (2019) Hsieh, Y.-G., Iutzeler, F., Malick, J., and Mertikopoulos, P. On the convergence of single-call stochastic extra-gradient methods. NeurIPS, 2019.
  • Jin et al. (2019) Jin, C., Netrapalli, P., and Jordan, M. I. Minmax optimization: Stable limit points of gradient descent ascent are locally optimal. arXiv preprint arXiv:1902.00618, 2019.
  • Juditsky et al. (2011) Juditsky, A., Nemirovski, A., and Tauvel, C. Solving variational inequalities with stochastic mirror-prox algorithm. Stochastic Systems, 1(1):17–58, 2011.
  • Kolossoski & Monteiro (2017) Kolossoski, O. and Monteiro, R. D. An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex–concave saddle-point problems. Optimization Methods and Software, 32(6):1244–1272, 2017.
  • Korpelevich (1977) Korpelevich, G. Extragradient method for finding saddle points and other problems. Matekon, 13(4):35–49, 1977.
  • Lan (2012) Lan, G. An optimal method for stochastic composite optimization. Mathematical Programming, 133(1-2):365–397, 2012.
  • Liang & Stokes (2019) Liang, T. and Stokes, J. Interaction matters: A note on non-asymptotic local convergence of generative adversarial networks. AISTATS, 2019.
  • Lieder (2020) Lieder, F. On the convergence rate of the halpern-iteration. Optimization Letters, pp.  1–14, 2020.
  • Lin et al. (2020a) Lin, T., Jin, C., and Jordan, M. On gradient descent ascent for nonconvex-concave minimax problems. ICML, 2020a.
  • Lin et al. (2020b) Lin, T., Jin, C., Jordan, M., et al. Near-optimal algorithms for minimax optimization. COLT, 2020b.
  • Lu et al. (2020) Lu, S., Tsaknakis, I., Hong, M., and Chen, Y. Hybrid block successive approximation for one-sided non-convex min-max problems: Algorithms and applications. IEEE Transactions on Signal Processing, 68:3676–3691, 2020.
  • Lyashko et al. (2011) Lyashko, S., Semenov, V., and Voitova, T. Low-cost modification of Korpelevich’s methods for monotone equilibrium problems. Cybernetics and Systems Analysis, 47(4):631, 2011.
  • Madry et al. (2018) Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to adversarial attacks. ICLR, 2018.
  • Malitsky (2015) Malitsky, Y. Projected reflected gradient methods for monotone variational inequalities. SIAM Journal on Optimization, 25(1):502–520, 2015.
  • Malitsky (2020) Malitsky, Y. Golden ratio algorithms for variational inequalities. Mathematical Programming, 184:383–410, 2020.
  • Malitsky & Tam (2020) Malitsky, Y. and Tam, M. K. A forward-backward splitting method for monotone inclusions without cocoercivity. SIAM Journal on Optimization, 30(2):1451–1472, 2020.
  • Malitsky & Semenov (2014) Malitsky, Y. V. and Semenov, V. An extragradient algorithm for monotone variational inequalities. Cybernetics and Systems Analysis, 50(2):271–277, 2014.
  • Mason & Handscomb (2002) Mason, J. C. and Handscomb, D. C. Chebyshev Polynomials. 2002.
  • Mertikopoulos et al. (2019) Mertikopoulos, P., Zenati, H., Lecouat, B., Foo, C.-S., Chandrasekhar, V., and Piliouras, G. Optimistic mirror descent in saddle-point problems: Going the extra (gradient) mile. ICLR, 2019.
  • Mokhtari et al. (2020a) Mokhtari, A., Ozdaglar, A., and Pattathil, S. A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach. AISTATS, 2020a.
  • Mokhtari et al. (2020b) Mokhtari, A., Ozdaglar, A. E., and Pattathil, S. Convergence rate of 𝒪(1/k)\mathcal{O}(1/k) for optimistic gradient and extragradient methods in smooth convex-concave saddle point problems. SIAM Journal on Optimization, 30(4):3230–3251, 2020b.
  • Monteiro & Svaiter (2010) Monteiro, R. D. and Svaiter, B. F. On the complexity of the hybrid proximal extragradient method for the iterates and the ergodic mean. SIAM Journal on Optimization, 20(6):2755–2787, 2010.
  • Monteiro & Svaiter (2011) Monteiro, R. D. and Svaiter, B. F. Complexity of variants of Tseng’s modified FB splitting and Korpelevich’s methods for hemivariational inequalities with applications to saddle-point and convex optimization problems. SIAM Journal on Optimization, 21(4):1688–1720, 2011.
  • Nedić & Ozdaglar (2009) Nedić, A. and Ozdaglar, A. Subgradient methods for saddle-point problems. Journal of Optimization Theory and Applications, 142(1):205–228, 2009.
  • Nemirovski (2004) Nemirovski, A. Prox-method with rate of convergence 𝒪(1/t)\mathcal{O}(1/t) for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization, 15(1):229–251, 2004.
  • Nemirovski et al. (2009) Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.
  • Nemirovsky (1991) Nemirovsky, A. S. On optimality of Krylov’s information when solving linear operator equations. Journal of Complexity, 7(2):121–130, 1991.
  • Nemirovsky (1992) Nemirovsky, A. S. Information-based complexity of linear operator equations. Journal of Complexity, 8(2):153–175, 1992.
  • Nemirovsky & Yudin (1983) Nemirovsky, A. S. and Yudin, D. B. Problem Complexity and Method Efficiency in Optimization. 1983.
  • Nesterov (2005a) Nesterov, Y. Excessive gap technique in nonsmooth convex minimization. SIAM Journal on Optimization, 16(1):235–249, 2005a.
  • Nesterov (2005b) Nesterov, Y. Smooth minimization of non-smooth functions. Mathematical Programming, 103(1):127–152, 2005b.
  • Nesterov (2007) Nesterov, Y. Dual extrapolation and its applications to solving variational inequalities and related problems. Mathematical Programming, 109(2-3):319–344, 2007.
  • Nesterov (2009) Nesterov, Y. Primal-dual subgradient methods for convex problems. Mathematical Programming, 120(1):221–259, 2009.
  • Nesterov (2013) Nesterov, Y. Introductory Lectures on Convex Optimization: A Basic Course. 2013.
  • Nesterov & Scrimali (2011) Nesterov, Y. and Scrimali, L. Solving strongly monotone variational and quasi-variational inequalities. Discrete & Continuous Dynamical Systems – A, 31(4):1383–1396, 2011.
  • Noor (2003) Noor, M. A. New extragradient-type methods for general variational inequalities. Journal of Mathematical Analysis and Applications, 277(2):379–394, 2003.
  • Nouiehed et al. (2019) Nouiehed, M., Sanjabi, M., Huang, T., Lee, J. D., and Razaviyayn, M. Solving a class of non-convex min-max games using iterative first order methods. NeurIPS, 2019.
  • Ostrovskii et al. (2020) Ostrovskii, D. M., Lowy, A., and Razaviyayn, M. Efficient search of first-order Nash equilibria in nonconvex-concave smooth min-max problems. arXiv preprint arXiv:2002.07919, 2020.
  • Ouyang & Xu (2021) Ouyang, Y. and Xu, Y. Lower complexity bounds of first-order methods for convex-concave bilinear saddle-point problems. Mathematical Programming, 185:1–35, 2021.
  • Peng et al. (2020) Peng, W., Dai, Y.-H., Zhang, H., and Cheng, L. Training GANs with centripetal acceleration. Optimization Methods and Software, 35(5):955–973, 2020.
  • Popov (1980) Popov, L. D. A modification of the Arrow–Hurwicz method for search of saddle points. Mathematical Notes of the Academy of Sciences of the USSR, 28(5):845–848, 1980.
  • Rafique et al. (2018) Rafique, H., Liu, M., Lin, Q., and Yang, T. Non-convex min-max optimization: Provable algorithms and applications in machine learning. arXiv preprint arXiv:1810.02060, 2018.
  • Rakhlin & Sridharan (2013a) Rakhlin, A. and Sridharan, K. Online learning with predictable sequences. COLT, 2013a.
  • Rakhlin & Sridharan (2013b) Rakhlin, S. and Sridharan, K. Optimization, learning, and games with predictable sequences. NeurIPS, 2013b.
  • Rockafellar (1970) Rockafellar, R. T. Monotone operators associated with saddle-functions and minimax problems. Nonlinear Functional Analysis, 18(part 1):397–407, 1970.
  • Ryu & Yin (2021) Ryu, E. K. and Yin, W. Large-Scale Convex Optimization via Monotone Operators. Draft, 2021.
  • Ryu et al. (2019) Ryu, E. K., Yuan, K., and Yin, W. ODE analysis of stochastic gradient methods with optimism and anchoring for minimax problems and GANs. arXiv preprint arXiv:1905.10899, 2019.
  • Solodov & Svaiter (1999) Solodov, M. V. and Svaiter, B. F. A hybrid approximate extragradient–proximal point algorithm using the enlargement of a maximal monotone operator. Set-Valued Analysis, 7(4):323–345, 1999.
  • Syrgkanis et al. (2015) Syrgkanis, V., Agarwal, A., Luo, H., and Schapire, R. E. Fast convergence of regularized learning in games. NeurIPS, 2015.
  • Taylor & Bach (2019) Taylor, A. and Bach, F. Stochastic first-order methods: Non-asymptotic and computer-aided analyses via potential functions. COLT, 2019.
  • Taylor et al. (2017) Taylor, A. B., Hendrickx, J. M., and Glineur, F. Smooth strongly convex interpolation and exact worst-case performance of first-order methods. Mathematical Programming, 161(1-2):307–345, 2017.
  • Thekumparampil et al. (2019) Thekumparampil, K. K., Jain, P., Netrapalli, P., and Oh, S. Efficient algorithms for smooth minimax optimization. NeurIPS, 2019.
  • Tseng (1995) Tseng, P. On linear convergence of iterative methods for the variational inequality problem. Journal of Computational and Applied Mathematics, 60(1-2):237–252, 1995.
  • Tseng (2000) Tseng, P. A modified forward-backward splitting method for maximal monotone mappings. SIAM Journal on Control and Optimization, 38(2):431–446, 2000.
  • Vũ (2013) Vũ, B. C. A splitting algorithm for dual monotone inclusions involving cocoercive operators. Advances in Computational Mathematics, 38(3):667–681, 2013.
  • Wang & Li (2020) Wang, Y. and Li, J. Improved algorithms for convex-concave minimax optimization. NeurIPS, 2020.
  • Woodworth & Srebro (2016) Woodworth, B. and Srebro, N. Tight complexity bounds for optimizing composite objectives. NeurIPS, 2016.
  • Yadav et al. (2018) Yadav, A., Shah, S., Xu, Z., Jacobs, D., and Goldstein, T. Stabilizing adversarial nets with prediction methods. ICLR, 2018.
  • Yan (2018) Yan, M. A new primal–dual algorithm for minimizing the sum of three functions with a linear operator. Journal of Scientific Computing, 76(3):1698–1717, 2018.
  • Yang et al. (2020) Yang, J., Zhang, S., Kiyavash, N., and He, N. A catalyst framework for minimax optimization. NeurIPS, 2020.
  • Zhang et al. (2020) Zhang, G., Bao, X., Lessard, L., and Grosse, R. A unified analysis of first-order methods for smooth games via integral quadratic constraints. arXiv preprint arXiv:2009.11359, 2020.
  • Zhang et al. (2019) Zhang, J., Hong, M., and Zhang, S. On lower iteration complexity bounds for the saddle point problems. arXiv preprint arXiv:1912.07481, 2019.
  • Zhao (2019) Zhao, R. Optimal stochastic algorithms for convex-concave saddle-point problems. arXiv preprint arXiv:1903.01687, 2019.

Appendix A Algorithm specifications

For the sake of clarity, we precisely specify all the algorithms discussed in this work.

Simultaneous gradient descent for smooth minimax optimization is defined as

𝐱k+1=𝐱kα𝐱𝐋(𝐱k,𝐲k)\displaystyle\mathbf{x}^{k+1}=\mathbf{x}^{k}-\alpha\nabla_{\mathbf{x}}\mathbf{L}(\mathbf{x}^{k},\mathbf{y}^{k})
𝐲k+1=𝐲k+α𝐲𝐋(𝐱k,𝐲k).\displaystyle\mathbf{y}^{k+1}=\mathbf{y}^{k}+\alpha\nabla_{\mathbf{y}}\mathbf{L}(\mathbf{x}^{k},\mathbf{y}^{k}).

The notation becomes more concise with the joint variable notation 𝐳k=(𝐱k,𝐲k)\mathbf{z}^{k}=(\mathbf{x}^{k},\mathbf{y}^{k}) and the saddle operator (2), where the sign change in 𝐲\mathbf{y}-gradient is already included:

𝐳k+1=𝐳kα𝐆(𝐳k).\displaystyle\mathbf{z}^{k+1}=\mathbf{z}^{k}-\alpha\mathbf{G}(\mathbf{z}^{k}).

Alternating gradient descent-ascent is defined as

𝐱k+1=𝐱kα𝐱𝐋(𝐱k,𝐲k)\displaystyle\mathbf{x}^{k+1}=\mathbf{x}^{k}-\alpha\nabla_{\mathbf{x}}\mathbf{L}(\mathbf{x}^{k},\mathbf{y}^{k})
𝐲k+1=𝐲k+α𝐲𝐋(𝐱k+1,𝐲k).\displaystyle\mathbf{y}^{k+1}=\mathbf{y}^{k}+\alpha\nabla_{\mathbf{y}}\mathbf{L}(\mathbf{x}^{k+1},\mathbf{y}^{k}).

Note that we update the 𝐱\mathbf{x} variable first and then use it to update the 𝐲\mathbf{y}-iterate.

The extragradient (EG) algorithm is defined as

𝐳k+1/2\displaystyle\mathbf{z}^{k+1/2} =𝐳kα𝐆(𝐳k),\displaystyle=\mathbf{z}^{k}-\alpha\mathbf{G}(\mathbf{z}^{k}),
𝐳k+1\displaystyle\mathbf{z}^{k+1} =𝐳kα𝐆(𝐳k+1/2).\displaystyle=\mathbf{z}^{k}-\alpha\mathbf{G}(\mathbf{z}^{k+1/2}).

Popov’s algorithm, or optimistic descent, is defined as

𝐳k+1=𝐳kα𝐆(𝐳k)α(𝐆(𝐳k)𝐆(𝐳k1)).\displaystyle\mathbf{z}^{k+1}=\mathbf{z}^{k}-\alpha\mathbf{G}(\mathbf{z}^{k})-\alpha\left(\mathbf{G}(\mathbf{z}^{k})-\mathbf{G}(\mathbf{z}^{k-1})\right).

Simultaneous gradient descent with anchoring (SimGD-A) (Ryu et al., 2019) is defined as

𝐳k+1=𝐳k1p(k+1)p𝐆(𝐳k)+(1p)γk+1(𝐳0𝐳k),\displaystyle\mathbf{z}^{k+1}=\mathbf{z}^{k}-\frac{1-p}{(k+1)^{p}}\mathbf{G}(\mathbf{z}^{k})+\frac{(1-p)\gamma}{k+1}(\mathbf{z}^{0}-\mathbf{z}^{k}),

where p(1/2,1)p\in(1/2,1) and γ>0\gamma>0. It has been proved in Ryu et al. (2019) that SimGD-A converges at 𝒪(1/k22p)\mathcal{O}(1/k^{2-2p}) rate. In this paper, we always used γ=1\gamma=1 and p=12+102p=\frac{1}{2}+10^{-2}.

Appendix B Omitted proofs of Section 2

The following identities follow directly from the definition of EAG iterates:

𝐳k𝐳k+1=βk(𝐳k𝐳0)+αk𝐆(𝐳k+1/2)\displaystyle\mathbf{z}^{k}-\mathbf{z}^{k+1}=\beta_{k}(\mathbf{z}^{k}-\mathbf{z}^{0})+\alpha_{k}\,\mathbf{G}(\mathbf{z}^{k+1/2}) (26)
𝐳k+1/2𝐳k+1=αk(𝐆(𝐳k+1/2)𝐆(𝐳k))\displaystyle\mathbf{z}^{k+1/2}-\mathbf{z}^{k+1}=\alpha_{k}\left(\mathbf{G}(\mathbf{z}^{k+1/2})-\mathbf{G}(\mathbf{z}^{k})\right) (27)
𝐳0𝐳k+1=(1βk)(𝐳0𝐳k)+αk𝐆(𝐳k+1/2).\displaystyle\mathbf{z}^{0}-\mathbf{z}^{k+1}=(1-\beta_{k})(\mathbf{z}^{0}-\mathbf{z}^{k})+\alpha_{k}\,\mathbf{G}(\mathbf{z}^{k+1/2}). (28)

B.1 Proof of Lemma 2

Recall that 𝐆\mathbf{G} is a monotone operator, so that

0\displaystyle 0 𝐳k𝐳k+1,𝐆(𝐳k)𝐆(𝐳k+1).\displaystyle\leq\left\langle\mathbf{z}^{k}-\mathbf{z}^{k+1},\mathbf{G}(\mathbf{z}^{k})-\mathbf{G}(\mathbf{z}^{k+1})\right\rangle.

Therefore,

VkVk+1\displaystyle V_{k}-V_{k+1}
VkVk+1Bkβk𝐳k𝐳k+1,𝐆(𝐳k)𝐆(𝐳k+1)\displaystyle\geq V_{k}-V_{k+1}-\frac{B_{k}}{\beta_{k}}\left\langle\mathbf{z}^{k}-\mathbf{z}^{k+1},\mathbf{G}(\mathbf{z}^{k})-\mathbf{G}(\mathbf{z}^{k+1})\right\rangle
=Ak𝐆(𝐳k)2+Bk𝐆(𝐳k),𝐳k𝐳0\displaystyle=A_{k}\left\|\mathbf{G}(\mathbf{z}^{k})\right\|^{2}+B_{k}\left\langle\mathbf{G}(\mathbf{z}^{k}),\mathbf{z}^{k}-\mathbf{z}^{0}\right\rangle
Ak+1𝐆(𝐳k+1)2Bk+1𝐆(𝐳k+1),𝐳k+1𝐳0Bkβk𝐳k𝐳k+1,𝐆(𝐳k)𝐆(𝐳k+1)\displaystyle\quad-A_{k+1}\left\|\mathbf{G}(\mathbf{z}^{k+1})\right\|^{2}-B_{k+1}\left\langle\mathbf{G}(\mathbf{z}^{k+1}),\mathbf{z}^{k+1}-\mathbf{z}^{0}\right\rangle-\frac{B_{k}}{\beta_{k}}\left\langle\mathbf{z}^{k}-\mathbf{z}^{k+1},\mathbf{G}(\mathbf{z}^{k})-\mathbf{G}(\mathbf{z}^{k+1})\right\rangle
=(a)Ak𝐆(𝐳k)2+Bk𝐆(𝐳k),𝐳k𝐳0\displaystyle\stackrel{{\scriptstyle\textnormal{(a)}}}{{\mathstrut{=}}}A_{k}\left\|\mathbf{G}(\mathbf{z}^{k})\right\|^{2}+B_{k}\left\langle\mathbf{G}(\mathbf{z}^{k}),\mathbf{z}^{k}-\mathbf{z}^{0}\right\rangle
Ak+1𝐆(𝐳k+1)2+Bk+1𝐆(𝐳k+1),(1βk)(𝐳0𝐳k)+αk𝐆(𝐳k+1/2)\displaystyle\quad-A_{k+1}\left\|\mathbf{G}(\mathbf{z}^{k+1})\right\|^{2}+B_{k+1}\left\langle\mathbf{G}(\mathbf{z}^{k+1}),(1-\beta_{k})(\mathbf{z}^{0}-\mathbf{z}^{k})+\alpha_{k}\,\mathbf{G}(\mathbf{z}^{k+1/2})\right\rangle
Bk𝐳k𝐳0,𝐆(𝐳k)𝐆(𝐳k+1)αkBkβk𝐆(𝐳k+1/2),𝐆(𝐳k)𝐆(𝐳k+1)\displaystyle\quad-B_{k}\left\langle\mathbf{z}^{k}-\mathbf{z}^{0},\mathbf{G}(\mathbf{z}^{k})-\mathbf{G}(\mathbf{z}^{k+1})\right\rangle-\frac{\alpha_{k}B_{k}}{\beta_{k}}\left\langle\mathbf{G}(\mathbf{z}^{k+1/2}),\mathbf{G}(\mathbf{z}^{k})-\mathbf{G}(\mathbf{z}^{k+1})\right\rangle
=(b)Ak𝐆(𝐳k)2Ak+1𝐆(𝐳k+1)2+αkBk+1𝐆(𝐳k+1),𝐆(𝐳k+1/2)αkBkβk𝐆(𝐳k+1/2),𝐆(𝐳k)𝐆(𝐳k+1),\displaystyle\begin{aligned} &\stackrel{{\scriptstyle\textnormal{(b)}}}{{\mathstrut{=}}}A_{k}\left\|\mathbf{G}(\mathbf{z}^{k})\right\|^{2}-A_{k+1}\left\|\mathbf{G}(\mathbf{z}^{k+1})\right\|^{2}+\alpha_{k}B_{k+1}\left\langle\mathbf{G}(\mathbf{z}^{k+1}),\mathbf{G}(\mathbf{z}^{k+1/2})\right\rangle\\ &\quad-\frac{\alpha_{k}B_{k}}{\beta_{k}}\left\langle\mathbf{G}(\mathbf{z}^{k+1/2}),\mathbf{G}(\mathbf{z}^{k})-\mathbf{G}(\mathbf{z}^{k+1})\right\rangle,\end{aligned} (29)

where (B.1) follows from (26) and (28), and (29) results from cancellation and collection of terms using (7). Next, we have

0R2𝐳k+1/2𝐳k+12𝐆(𝐳k+1/2)𝐆(𝐳k+1)2=αk2R2𝐆(𝐳k)𝐆(𝐳k+1/2)2𝐆(𝐳k+1/2)𝐆(𝐳k+1)2\displaystyle\begin{aligned} 0&\leq R^{2}\big{\|}\mathbf{z}^{k+1/2}-\mathbf{z}^{k+1}\big{\|}^{2}-\big{\|}\mathbf{G}(\mathbf{z}^{k+1/2})-\mathbf{G}(\mathbf{z}^{k+1})\big{\|}^{2}\\ &=\alpha_{k}^{2}R^{2}\,\big{\|}\mathbf{G}(\mathbf{z}^{k})-\mathbf{G}(\mathbf{z}^{k+1/2})\big{\|}^{2}-\big{\|}\mathbf{G}(\mathbf{z}^{k+1/2})-\mathbf{G}(\mathbf{z}^{k+1})\big{\|}^{2}\end{aligned} (30)

from RR-Lipschitzness of 𝐆\mathbf{G} and (27). Now multiplying the factor Akαk2R2\frac{A_{k}}{\alpha_{k}^{2}R^{2}} to (30) and subtracting from (29) gives

VkVk+1\displaystyle V_{k}-V_{k+1}
Ak𝐆(𝐳k)2Ak+1𝐆(𝐳k+1)2+αkBk+1𝐆(𝐳k+1),𝐆(𝐳k+1/2)\displaystyle\geq A_{k}\left\|\mathbf{G}(\mathbf{z}^{k})\right\|^{2}-A_{k+1}\left\|\mathbf{G}(\mathbf{z}^{k+1})\right\|^{2}+\alpha_{k}B_{k+1}\big{\langle}\mathbf{G}(\mathbf{z}^{k+1}),\mathbf{G}(\mathbf{z}^{k+1/2})\big{\rangle}
αkBkβk𝐆(𝐳k+1/2),𝐆(𝐳k)𝐆(𝐳k+1)\displaystyle\quad-\frac{\alpha_{k}B_{k}}{\beta_{k}}\left\langle\mathbf{G}(\mathbf{z}^{k+1/2}),\mathbf{G}(\mathbf{z}^{k})-\mathbf{G}(\mathbf{z}^{k+1})\right\rangle
Ak𝐆(𝐳k)𝐆(𝐳k+1/2)2+Akαk2R2𝐆(𝐳k+1/2)𝐆(𝐳k+1)2\displaystyle\quad-A_{k}\,\left\|\mathbf{G}(\mathbf{z}^{k})-\mathbf{G}(\mathbf{z}^{k+1/2})\right\|^{2}+\frac{A_{k}}{\alpha_{k}^{2}R^{2}}\left\|\mathbf{G}(\mathbf{z}^{k+1/2})-\mathbf{G}(\mathbf{z}^{k+1})\right\|^{2}
=Ak(1αk2R2)αk2R2𝐆(𝐳k+1/2)2+(Akαk2R2Ak+1)𝐆(𝐳k+1)2+(2AkαkBkβk)𝐆(𝐳k),𝐆(𝐳k+1/2)+(αkBk+1+αkBkβk2Akαk2R2)𝐆(𝐳k+1/2),𝐆(𝐳k+1).\displaystyle\begin{aligned} &=\frac{A_{k}(1-\alpha_{k}^{2}R^{2})}{\alpha_{k}^{2}R^{2}}\left\|\mathbf{G}(\mathbf{z}^{k+1/2})\right\|^{2}+\left(\frac{A_{k}}{\alpha_{k}^{2}R^{2}}-A_{k+1}\right)\left\|\mathbf{G}(\mathbf{z}^{k+1})\right\|^{2}\\ &\quad+\left(2A_{k}-\frac{\alpha_{k}B_{k}}{\beta_{k}}\right)\left\langle\mathbf{G}(\mathbf{z}^{k}),\mathbf{G}(\mathbf{z}^{k+1/2})\right\rangle\\ &\quad+\left(\alpha_{k}B_{k+1}+\frac{\alpha_{k}B_{k}}{\beta_{k}}-\frac{2A_{k}}{\alpha_{k}^{2}R^{2}}\right)\left\langle\mathbf{G}(\mathbf{z}^{k+1/2}),\mathbf{G}(\mathbf{z}^{k+1})\right\rangle.\end{aligned} (31)

Observe that the 𝐆(𝐳k),𝐆(𝐳k+1/2)\left\langle\mathbf{G}(\mathbf{z}^{k}),\mathbf{G}(\mathbf{z}^{k+1/2})\right\rangle term vanishes because of (6), and that

αkBk+1+αkBkβk=αk(Bk1βk+Bkβk)=αkBkβk(1βk)=2Ak1βk.\displaystyle\alpha_{k}B_{k+1}+\frac{\alpha_{k}B_{k}}{\beta_{k}}=\alpha_{k}\left(\frac{B_{k}}{1-\beta_{k}}+\frac{B_{k}}{\beta_{k}}\right)=\frac{\alpha_{k}B_{k}}{\beta_{k}(1-\beta_{k})}=\frac{2A_{k}}{1-\beta_{k}}.

Furthermore, by (8), we have

Ak+1=αk+1Bk+12βk+1=αkβk+1(1αk2R2βk2)(1αk2R2)βk(1βk)Bk2βk+1(1βk)=Ak(1αk2R2βk2)(1αk2R2)(1βk)2.\displaystyle A_{k+1}=\alpha_{k+1}\frac{B_{k+1}}{2\beta_{k+1}}=\frac{\alpha_{k}\beta_{k+1}(1-\alpha_{k}^{2}R^{2}-\beta_{k}^{2})}{(1-\alpha_{k}^{2}R^{2})\beta_{k}(1-\beta_{k})}\frac{B_{k}}{2\beta_{k+1}(1-\beta_{k})}=\frac{A_{k}(1-\alpha_{k}^{2}R^{2}-\beta_{k}^{2})}{(1-\alpha_{k}^{2}R^{2})(1-\beta_{k})^{2}}.

Plugging these identities into (B.1) and simplifying, we get

VkVk+1\displaystyle V_{k}-V_{k+1}
Ak(1αk2R2)αk2R2𝐆(𝐳k+1/2)2+Ak(1αk2R2βk)2αk2R2(1αk2R2)(1βk)2𝐆(𝐳k+1)2\displaystyle\geq\frac{A_{k}(1-\alpha_{k}^{2}R^{2})}{\alpha_{k}^{2}R^{2}}\left\|\mathbf{G}(\mathbf{z}^{k+1/2})\right\|^{2}+\frac{A_{k}(1-\alpha_{k}^{2}R^{2}-\beta_{k})^{2}}{\alpha_{k}^{2}R^{2}(1-\alpha_{k}^{2}R^{2})(1-\beta_{k})^{2}}\left\|\mathbf{G}(\mathbf{z}^{k+1})\right\|^{2}
2Ak(1αk2R2βk)αk2R2(1βk)𝐆(𝐳k+1/2),𝐆(𝐳k+1)\displaystyle\quad-\frac{2A_{k}(1-\alpha_{k}^{2}R^{2}-\beta_{k})}{\alpha_{k}^{2}R^{2}(1-\beta_{k})}\left\langle\mathbf{G}(\mathbf{z}^{k+1/2}),\mathbf{G}(\mathbf{z}^{k+1})\right\rangle
0,\displaystyle\geq 0,

where the last inequality is an application of Young’s inequality.

B.2 Proof of Lemma 1

We may assume R=1R=1 without loss of generality because we can recover the general case by replacing αk\alpha_{k} with αkR\alpha_{k}R. Rewrite (2.1) as

αkαk+1=αk3(k+1)(k+3)(1αk2).\displaystyle\alpha_{k}-\alpha_{k+1}=\frac{\alpha_{k}^{3}}{(k+1)(k+3)(1-\alpha_{k}^{2})}. (32)

Suppose that we have already established 0<αN<ρ0<\alpha_{N}<\rho for some N0N\geq 0 and ρ(0,1)\rho\in(0,1), where ρ\rho satisfies

γ:=12(1N+1+1N+2)ρ21ρ2<1.\displaystyle\gamma:=\frac{1}{2}\left(\frac{1}{N+1}+\frac{1}{N+2}\right)\frac{\rho^{2}}{1-\rho^{2}}<1. (33)

Note that (33) holds true for all N0N\geq 0 if ρ<34\rho<\frac{3}{4}. Now we will show that given (33),

αN>αN+1>>αN+k>(1γ)αNfor all k0,\displaystyle\alpha_{N}>\alpha_{N+1}>\cdots>\alpha_{N+k}>(1-\gamma)\alpha_{N}\quad\text{for all }k\geq 0,

so that αkα\alpha_{k}\downarrow\alpha for some α(1γ)αN\alpha\geq(1-\gamma)\alpha_{N}. It suffices to prove that (1γ)αN<αN+k<ρ(1-\gamma)\alpha_{N}<\alpha_{N+k}<\rho for all k0k\geq 0, because it is clear from (32) that {αk}k0\{\alpha_{k}\}_{k\geq 0} is decreasing.

We use induction on kk to prove that αN+k((1γ)αN,ρ)\alpha_{N+k}\in((1-\gamma)\alpha_{N},\rho). The case k=0k=0 is trivial. Now suppose that (1γ)αN<αN+j<ρ(1-\gamma)\alpha_{N}<\alpha_{N+j}<\rho holds true for all j=0,,kj=0,\dots,k. Then by (32), for each 0jk0\leq j\leq k we have

0<αN+jαN+j+1\displaystyle 0<\alpha_{N+j}-\alpha_{N+j+1} =1(N+j+1)(N+j+3)αN+j31αN+j2\displaystyle=\frac{1}{(N+j+1)(N+j+3)}\frac{\alpha_{N+j}^{3}}{1-\alpha_{N+j}^{2}}
<1(N+j+1)(N+j+3)ρ2αN1ρ2.\displaystyle<\frac{1}{(N+j+1)(N+j+3)}\frac{\rho^{2}\alpha_{N}}{1-\rho^{2}}.

Summing up the inequalities for j=0,,kj=0,\dots,k, we obtain

0<αNαN+k+1\displaystyle 0<\alpha_{N}-\alpha_{N+k+1} <j=0k1(N+j+1)(N+j+3)ρ2αN1ρ2\displaystyle<\sum_{j=0}^{k}\frac{1}{(N+j+1)(N+j+3)}\frac{\rho^{2}\alpha_{N}}{1-\rho^{2}}
<ρ2αN1ρ2j=01(N+j+1)(N+j+3)\displaystyle<\frac{\rho^{2}\alpha_{N}}{1-\rho^{2}}\sum_{j=0}^{\infty}\frac{1}{(N+j+1)(N+j+3)}
=ρ2αN1ρ212(1N+1+1N+2)=γαN,\displaystyle=\frac{\rho^{2}\alpha_{N}}{1-\rho^{2}}\frac{1}{2}\left(\frac{1}{N+1}+\frac{1}{N+2}\right)=\gamma\alpha_{N},

which gives (1γ)αN<αN+k+1<αN<ρ(1-\gamma)\alpha_{N}<\alpha_{N+k+1}<\alpha_{N}<\rho, completing the induction.

In particular, when α0=0.618\alpha_{0}=0.618, direct calculation gives 0.437>αN>0.43660.437>\alpha_{N}>0.4366 when N=1000N=1000. With ρ=0.437\rho=0.437 and N=1000N=1000, we have γ=12(1N+1+1N+2)ρ21ρ2<2.5×104\gamma=\frac{1}{2}\left(\frac{1}{N+1}+\frac{1}{N+2}\right)\frac{\rho^{2}}{1-\rho^{2}}<2.5\times 10^{-4}, which gives α(1γ)αN0.4365\alpha\geq(1-\gamma)\alpha_{N}\approx 0.4365.

B.3 Proof of Theorem 1

As in the proof of Theorem 2, assume without loss of generality that R=1R=1. The strategy of the proof is basically the same as in Theorem 2; we construct a nonincreasing Lyapunov function by combining the same set of inequalities, but with different (more intricate) coefficients. For k0k\geq 0, let

Vk=Ak𝐆(𝐳k)2+Bk𝐆(𝐳k),𝐳k𝐳0.\displaystyle V_{k}=A_{k}\left\|\mathbf{G}(\mathbf{z}^{k})\right\|^{2}+B_{k}\left\langle\mathbf{G}(\mathbf{z}^{k}),\mathbf{z}^{k}-\mathbf{z}^{0}\right\rangle.

As in Lemma 2, we will use Bk=11βk=k+1B_{k}=\frac{1}{1-\beta_{k}}=k+1, and ak0a_{k}\geq 0 will be specified later. Because we have the fixed step-size α\alpha, the identities (26), (27), and (28) become

𝐳k+1/2𝐳k+1=α(𝐆(𝐳k+1/2)𝐆(𝐳k))\displaystyle\mathbf{z}^{k+1/2}-\mathbf{z}^{k+1}=\alpha\left(\mathbf{G}(\mathbf{z}^{k+1/2})-\mathbf{G}(\mathbf{z}^{k})\right)
𝐳k𝐳k+1=1k+2(𝐳k𝐳0)+α𝐆(𝐳k+1/2)\displaystyle\mathbf{z}^{k}-\mathbf{z}^{k+1}=\frac{1}{k+2}(\mathbf{z}^{k}-\mathbf{z}^{0})+\alpha\,\mathbf{G}(\mathbf{z}^{k+1/2})
𝐳k+1𝐳0=k+1k+2(𝐳k𝐳0)α𝐆(𝐳k+1/2).\displaystyle\mathbf{z}^{k+1}-\mathbf{z}^{0}=\frac{k+1}{k+2}(\mathbf{z}^{k}-\mathbf{z}^{0})-\alpha\,\mathbf{G}(\mathbf{z}^{k+1/2}).

Now, subtracting the same inequalities from monotonicity and Lipschitzness from VkVk+1V_{k}-V_{k+1} as in Lemma 2, each with coefficients (k+1)(k+2)(k+1)(k+2) and τk0\tau_{k}\geq 0 (to be specified later), we obtain

VkVk+1\displaystyle V_{k}-V_{k+1}
VkVk+1(k+1)(k+2)𝐳k𝐳k+1,𝐆(𝐳k)𝐆(𝐳k+1)\displaystyle\geq V_{k}-V_{k+1}-(k+1)(k+2)\left\langle\mathbf{z}^{k}-\mathbf{z}^{k+1},\mathbf{G}(\mathbf{z}^{k})-\mathbf{G}(\mathbf{z}^{k+1})\right\rangle
τk(𝐳k+1/2𝐳k+12𝐆(𝐳k+1/2)𝐆(𝐳k+1)2)\displaystyle\quad-\tau_{k}\,\left(\left\|\mathbf{z}^{k+1/2}-\mathbf{z}^{k+1}\right\|^{2}-\left\|\mathbf{G}(\mathbf{z}^{k+1/2})-\mathbf{G}(\mathbf{z}^{k+1})\right\|^{2}\right)
=(Akα2τk)𝐆(𝐳k)2+τk(1α2)𝐆(𝐳k+1/2)2+(τkAk+1)𝐆(𝐳k+1)2\displaystyle=(A_{k}-\alpha^{2}\tau_{k})\left\|\mathbf{G}(\mathbf{z}^{k})\right\|^{2}+\tau_{k}(1-\alpha^{2})\left\|\mathbf{G}(\mathbf{z}^{k+1/2})\right\|^{2}+(\tau_{k}-A_{k+1})\left\|\mathbf{G}(\mathbf{z}^{k+1})\right\|^{2}
+(2α2τkα(k+1)(k+2))𝐆(𝐳k),𝐆(𝐳k+1/2)+(α(k+2)22τk)𝐆(𝐳k+1/2),𝐆(𝐳k+1)\displaystyle\quad+\left(2\alpha^{2}\tau_{k}-\alpha(k+1)(k+2)\right)\,\left\langle\mathbf{G}(\mathbf{z}^{k}),\mathbf{G}(\mathbf{z}^{k+1/2})\right\rangle+\left(\alpha(k+2)^{2}-2\tau_{k}\right)\,\left\langle\mathbf{G}(\mathbf{z}^{k+1/2}),\mathbf{G}(\mathbf{z}^{k+1})\right\rangle
=Tr(𝐌k𝐒k𝐌k),\displaystyle=\operatorname{\mathrm{Tr}}\left(\mathbf{M}_{k}\mathbf{S}_{k}\mathbf{M}_{k}^{\intercal}\right),

where we define 𝐌k:=[𝐆(𝐳k)𝐆(𝐳k+1/2)𝐆(𝐳k+1)]\mathbf{M}_{k}:=\begin{bmatrix}\mathbf{G}(\mathbf{z}^{k})&\mathbf{G}(\mathbf{z}^{k+1/2})&\mathbf{G}(\mathbf{z}^{k+1})\end{bmatrix} and

𝐒k:=[Akα2τkα2τkα2(k+1)(k+2)0α2τkα2(k+1)(k+2)τk(1α2)α2(k+2)2τk0α2(k+2)2τkτkAk+1].\displaystyle\mathbf{S}_{k}:=\begin{bmatrix}A_{k}-\alpha^{2}\tau_{k}&\alpha^{2}\tau_{k}-\frac{\alpha}{2}(k+1)(k+2)&0\\ \alpha^{2}\tau_{k}-\frac{\alpha}{2}(k+1)(k+2)&\tau_{k}(1-\alpha^{2})&\frac{\alpha}{2}(k+2)^{2}-\tau_{k}\\ 0&\frac{\alpha}{2}(k+2)^{2}-\tau_{k}&\tau_{k}-A_{k+1}\end{bmatrix}. (34)

If 𝐒k𝐎\mathbf{S}_{k}\succeq\mathbf{O}, then Tr(𝐌k𝐒k𝐌k)=Tr(𝐒k𝐌k𝐌k)0\operatorname{\mathrm{Tr}}\left(\mathbf{M}_{k}\mathbf{S}_{k}\mathbf{M}_{k}^{\intercal}\right)=\operatorname{\mathrm{Tr}}\left(\mathbf{S}_{k}\mathbf{M}_{k}^{\intercal}\mathbf{M}_{k}\right)\geq 0 because the positive semidefinite cone is self-dual with respect to the matrix inner product 𝐀,𝐁=Tr(𝐀𝐁)\langle\mathbf{A},\mathbf{B}\rangle=\operatorname{\mathrm{Tr}}(\mathbf{A}^{\intercal}\mathbf{B}). Because Bk=k+1B_{k}=k+1 grows linearly, provided that the sequence {Ak}\{A_{k}\} grows quadratically, we can derive 𝒪(1/k2)\mathcal{O}(1/k^{2}) convergence by using similar line of arguments as in the proof of Theorem 2. This reduction of the proof into a search of appropriate parameters (i.e., τk\tau_{k}) that meet semidefiniteness constraints (𝐒k𝐎\mathbf{S}_{k}\succeq\mathbf{O} in our case) while allowing for desired rate of growth in Lyapunov function coefficients (AkA_{k} in our case) was inspired by works of Taylor et al. (2017) and Taylor & Bach (2019). In the following, we demonstrate that careful choices of A0A_{0} and τk\tau_{k} make AkA_{k} asymptotically close to α(k+1)(k+2)2\frac{\alpha(k+1)(k+2)}{2}, so quadratic growth is guaranteed. We begin with the following lemma, which will be used throughout the proof.

Lemma 5.

Let k0k\in\mathbb{N}_{\geq 0} and α(0,12]\alpha\in\left(0,\frac{1}{2}\right] be fixed, and define

k:=α(k+2)(k+1+kα)2(1+α),uk:=α(k+2)(k+1kα)2(1α).\displaystyle\ell_{k}:=\frac{\alpha(k+2)(k+1+k\alpha)}{2(1+\alpha)},\quad u_{k}:=\frac{\alpha(k+2)(k+1-k\alpha)}{2(1-\alpha)}.

Then,

uk>α(k+1)(k+2)2\displaystyle u_{k}>\frac{\alpha(k+1)(k+2)}{2} >k\displaystyle>\ell_{k} (35)
α(k+1)(k+1+α(k+2))2(1+α)\displaystyle\geq\frac{\alpha(k+1)(k+1+\alpha(k+2))}{2(1+\alpha)} (36)
α(k+1)2α3k(k+2)2(1α2)\displaystyle\geq\frac{\alpha(k+1)^{2}-\alpha^{3}k(k+2)}{2(1-\alpha^{2})} (37)
max{α(k+1)(k+1α(k+2))2(1α),α2(k+1)(k+2)1+α}\displaystyle\geq\max\left\{\frac{\alpha(k+1)(k+1-\alpha(k+2))}{2(1-\alpha)},\frac{\alpha^{2}(k+1)(k+2)}{1+\alpha}\right\} (38)
α2(k+1)(k+2)+α3(k+2)22(1+α).\displaystyle\geq\frac{\alpha^{2}(k+1)(k+2)+\alpha^{3}(k+2)^{2}}{2(1+\alpha)}. (39)

We shall prove Lemma 5 after the proof of the main theorem and for now, focus on why we need such results. Observe that all the quantities within the lines (35) through (37) are asymptotically close to αk22\frac{\alpha k^{2}}{2}. We show that AkIk:=[k,uk]A_{k}\in I_{k}:=[\ell_{k},u_{k}] for all k0k\geq 0, which implies the quadratic growth. The quantities in Lemma 5 are used for choosing the right τk\tau_{k} and for showing the positive semidefiniteness of 𝐒k\mathbf{S}_{k}.

Subdivide the interval IkI_{k} into two parts:

Ik=[k,α(k+1)(k+2)2],Ik+=[α(k+1)(k+2)2,uk].\displaystyle I_{k}^{-}=\left[\ell_{k},\frac{\alpha(k+1)(k+2)}{2}\right],\quad I_{k}^{+}=\left[\frac{\alpha(k+1)(k+2)}{2},u_{k}\right].

We divide cases: AkIkA_{k}\in I_{k}^{-} and AkIk+A_{k}\in I_{k}^{+}. However, the latter case is in fact not needed unless we wish to extend the proof for α\alpha beyond 0.1265R\frac{0.1265}{R}. If that is not the case, we recommend the readers to refer to Case 1 only. Nevertheless, we exhibit analysis of both cases because Case 2 might provide useful data for enlarging or even completely determining the range of convergent step-sizes for EAG-C.

Case 1. Suppose that AkIkA_{k}\in I_{k}^{-}. In this case, we choose

τk=(k+2)2(2(1α)Akα(k+1)(k+1α(k+2)))2(α(k+2)(k+1kα)2(1α)Ak).\displaystyle\tau_{k}=\frac{(k+2)^{2}\left(2(1-\alpha)A_{k}-\alpha(k+1)(k+1-\alpha(k+2))\right)}{2\left(\alpha(k+2)(k+1-k\alpha)-2(1-\alpha)A_{k}\right)}. (40)

The denominator and numerator of (40) are both positive because uk>Ak>α(k+1)(k+1α(k+2))2(1α)u_{k}>A_{k}>\frac{\alpha(k+1)(k+1-\alpha(k+2))}{2(1-\alpha)} (see (38)). Thus, τk>0\tau_{k}>0. Next, define Ak+1A_{k+1} as

Ak+1\displaystyle A_{k+1} =α(k+2)2(4(1α)Akα(k+1α(k+2))2)4(1α)((1α)Ak+α2(k+1)(k+2))\displaystyle=\frac{\alpha(k+2)^{2}\left(4(1-\alpha)A_{k}-\alpha(k+1-\alpha(k+2))^{2}\right)}{4(1-\alpha)\left((1-\alpha)A_{k}+\alpha^{2}(k+1)(k+2)\right)}
=α(k+2)21α(1α(k+1+α(k+2))24((1α)Ak+α2(k+1)(k+2))).\displaystyle=\frac{\alpha(k+2)^{2}}{1-\alpha}\left(1-\frac{\alpha(k+1+\alpha(k+2))^{2}}{4((1-\alpha)A_{k}+\alpha^{2}(k+1)(k+2))}\right). (41)

Then (34) can be rewritten as

𝐒k=[s11s120s12s22s230s23s33],\displaystyle\mathbf{S}_{k}=\begin{bmatrix}s_{11}&s_{12}&0\\ s_{12}&s_{22}&s_{23}\\ 0&s_{23}&s_{33}\end{bmatrix},

where

s11=(α(k+1)(k+2)2Ak)(2(1α)Ak+α2(k+1)(k+2)α3(k+2)2)2(α(k+2)(k+1kα)2(1α)Ak)\displaystyle s_{11}=\frac{\left(\alpha(k+1)(k+2)-2A_{k}\right)\left(2(1-\alpha)A_{k}+\alpha^{2}(k+1)(k+2)-\alpha^{3}(k+2)^{2}\right)}{2\left(\alpha(k+2)(k+1-k\alpha)-2(1-\alpha)A_{k}\right)} (42)
s12=α(1α)(k+2)(k+1+α(k+2))(α(k+1)(k+2)2Ak)2(α(k+2)(k+1kα)2(1α)Ak)\displaystyle s_{12}=-\frac{\alpha(1-\alpha)(k+2)(k+1+\alpha(k+2))(\alpha(k+1)(k+2)-2A_{k})}{2\left(\alpha(k+2)(k+1-k\alpha)-2(1-\alpha)A_{k}\right)} (43)
s22=(1α2)(k+2)2(2(1α)Akα(k+1)(k+1α(k+2)))2(α(k+2)(k+1kα)2(1α)Ak)\displaystyle s_{22}=\frac{(1-\alpha^{2})(k+2)^{2}\left(2(1-\alpha)A_{k}-\alpha(k+1)(k+1-\alpha(k+2))\right)}{2\left(\alpha(k+2)(k+1-k\alpha)-2(1-\alpha)A_{k}\right)} (44)
s23=(k+2)2(2(1α2)Akα(k+1)2+α3k(k+2))2(α(k+2)(k+1kα)2(1α)Ak)\displaystyle s_{23}=-\frac{(k+2)^{2}\left(2(1-\alpha^{2})A_{k}-\alpha(k+1)^{2}+\alpha^{3}k(k+2)\right)}{2\left(\alpha(k+2)(k+1-k\alpha)-2(1-\alpha)A_{k}\right)} (45)
s33=(k+2)2(2(1α2)Akα(k+1)2+α3k(k+2))(2(1α)Ak+α2(k+1)(k+2)α3(k+2)2)4(1α)(α(k+2)(k+1kα)2(1α)Ak)((1α)Ak+α2(k+1)(k+2)).\displaystyle s_{33}=\frac{(k+2)^{2}\left(2(1-\alpha^{2})A_{k}-\alpha(k+1)^{2}+\alpha^{3}k(k+2)\right)\left(2(1-\alpha)A_{k}+\alpha^{2}(k+1)(k+2)-\alpha^{3}(k+2)^{2}\right)}{4(1-\alpha)\left(\alpha(k+2)(k+1-k\alpha)-2(1-\alpha)A_{k}\right)\left((1-\alpha)A_{k}+\alpha^{2}(k+1)(k+2)\right)}. (46)

The expressions seem ridiculously complicated, but there are a number of repeating terms. Let

E1=α(k+2)(k+1kα)2(1α)Ak\displaystyle E_{1}=\alpha(k+2)(k+1-k\alpha)-2(1-\alpha)A_{k}
E2=α(k+1)(k+2)2Ak.\displaystyle E_{2}=\alpha(k+1)(k+2)-2A_{k}.

Because Akα(k+1)(k+2)2<ukA_{k}\leq\frac{\alpha(k+1)(k+2)}{2}<u_{k} (see (35)), we have E1>0,E20E_{1}>0,E_{2}\geq 0. (Note that E2=0E_{2}=0 only in the boundary case Ak=supIkA_{k}=\sup I_{k}^{-}.) Next, put

E3=2(1α)Akα(k+1)(k+1α(k+2)),\displaystyle E_{3}=2(1-\alpha)A_{k}-\alpha(k+1)(k+1-\alpha(k+2)),

which is a factor that appears within the definition of τk\tau_{k} (40); we have already seen that E3>0E_{3}>0. Further, let

E4=2(1α)Ak+α2(k+1)(k+2)α3(k+2)2\displaystyle E_{4}=2(1-\alpha)A_{k}+\alpha^{2}(k+1)(k+2)-\alpha^{3}(k+2)^{2}
E5=(1α)Ak+α2(k+1)(k+2)\displaystyle E_{5}=(1-\alpha)A_{k}+\alpha^{2}(k+1)(k+2)
E6=2(1α2)Akα(k+1)2+α3k(k+2)\displaystyle E_{6}=2(1-\alpha^{2})A_{k}-\alpha(k+1)^{2}+\alpha^{3}k(k+2)
E7=k+1+α(k+2).\displaystyle E_{7}=k+1+\alpha(k+2).

It is obvious that E5,E7>0E_{5},E_{7}>0, and E6>0E_{6}>0 follows directly from (37). To see that E4>0E_{4}>0, observe that k+1α(k+2)=(k+2)(k+1k+2α)(k+2)(12α)0k+1-\alpha(k+2)=(k+2)\left(\frac{k+1}{k+2}-\alpha\right)\geq(k+2)\left(\frac{1}{2}-\alpha\right)\geq 0, provided that α12\alpha\leq\frac{1}{2}. This implies

E4=2(1α)Ak+α2(k+2)(k+1(k+2)α)>0.\displaystyle E_{4}=2(1-\alpha)A_{k}+\alpha^{2}(k+2)\left(k+1-(k+2)\alpha\right)>0.

Now we can rewrite (42) through (46) as

s11=E2E42E1\displaystyle s_{11}=\frac{E_{2}E_{4}}{2E_{1}}
s12=α(1α)(k+2)E2E72E1\displaystyle s_{12}=-\frac{\alpha(1-\alpha)(k+2)E_{2}E_{7}}{2E_{1}}
s22=(1α2)(k+2)2E32E1\displaystyle s_{22}=\frac{(1-\alpha^{2})(k+2)^{2}E_{3}}{2E_{1}}
s23=(k+2)2E62E1\displaystyle s_{23}=-\frac{(k+2)^{2}E_{6}}{2E_{1}}
s33=(k+2)2E4E64(1α)E1E5.\displaystyle s_{33}=\frac{(k+2)^{2}E_{4}E_{6}}{4(1-\alpha)E_{1}E_{5}}.

This immediately shows that the diagonal entries siis_{ii} are nonnegative for i=1,2,3i=1,2,3. By brute-force calculation, it is not difficult to verify the identity

(1+α)E3E4=α2(1α)E2E72+2E5E6.\displaystyle(1+\alpha)E_{3}E_{4}=\alpha^{2}(1-\alpha)E_{2}E_{7}^{2}+2E_{5}E_{6}.

Using this, we see that 𝐯:=[α(k+2)E72E5E42(1α)E51]\mathbf{v}:=\begin{bmatrix}\frac{\alpha(k+2)E_{7}}{2E_{5}}&\frac{E_{4}}{2(1-\alpha)E_{5}}&1\end{bmatrix}^{\intercal} satisfies 𝐒k𝐯=0\mathbf{S}_{k}\mathbf{v}=0, and this implies det𝐒k=0\det\mathbf{S}_{k}=0. The cofactor-expansion of det𝐒k\det\mathbf{S}_{k} along the first row gives

0=det𝐒k=s11|s22s23s23s33|s12|s12s230s33||s22s23s23s33|=s122s33s11>0\displaystyle 0=\det\mathbf{S}_{k}=s_{11}\begin{vmatrix}s_{22}&s_{23}\\ s_{23}&s_{33}\end{vmatrix}-s_{12}\begin{vmatrix}s_{12}&s_{23}\\ 0&s_{33}\end{vmatrix}\iff\begin{vmatrix}s_{22}&s_{23}\\ s_{23}&s_{33}\end{vmatrix}=\frac{s_{12}^{2}s_{33}}{s_{11}}>0

when s11>0s_{11}>0, and via continuity argument we can argue that |s22s23s23s33|0\begin{vmatrix}s_{22}&s_{23}\\ s_{23}&s_{33}\end{vmatrix}\geq 0 even in the boundary case s11=0s_{11}=0. Similarly one can show that |s11s12s12s22|0\begin{vmatrix}s_{11}&s_{12}\\ s_{12}&s_{22}\end{vmatrix}\geq 0. Therefore, we have shown that all diagonal submatrices of 𝐒k\mathbf{S}_{k} (including the trivial case |s1100s33|=s11s330\begin{vmatrix}s_{11}&0\\ 0&s_{33}\end{vmatrix}=s_{11}s_{33}\geq 0) have nonnegative determinants, that is, 𝐒k𝐎\mathbf{S}_{k}\succeq\mathbf{O}.

Finally, (41) shows that Ak+1A_{k+1} is increasing with respect to AkA_{k}. We see that

Ak+1|Ak=α(k+1)(k+2)2=α(k+2)((k+1)(k+3)α2(k+2)2)2(1α2)(k+1)<α(k+2)(k+3)2\displaystyle A_{k+1}\Big{|}_{A_{k}=\frac{\alpha(k+1)(k+2)}{2}}=\frac{\alpha(k+2)((k+1)(k+3)-\alpha^{2}(k+2)^{2})}{2(1-\alpha^{2})(k+1)}<\frac{\alpha(k+2)(k+3)}{2} (47)

and

Ak+1|Ak=kk+1=α2((13αα2α3)k+18α+α22α3)2(1α2)((1+α)2k+1+α+2α2),\displaystyle A_{k+1}|_{A_{k}=\ell_{k}}-\ell_{k+1}=\frac{\alpha^{2}\left((1-3\alpha-\alpha^{2}-\alpha^{3})k+1-8\alpha+\alpha^{2}-2\alpha^{3}\right)}{2(1-\alpha^{2})\left((1+\alpha)^{2}k+1+\alpha+2\alpha^{2}\right)},

and the last expression is nonnegative because of the assumption (4), which we restate here for the case R=1R=1 for convenience: 13αα2α301-3\alpha-\alpha^{2}-\alpha^{3}\geq 0 and 18α+α22α301-8\alpha+\alpha^{2}-2\alpha^{3}\geq 0. This proves that Ak+1Ik+1Ik+1A_{k+1}\in I_{k+1}^{-}\subset I_{k+1}, as desired.

Case 2. Suppose that AkIk+A_{k}\in I_{k}^{+}. The proof would be similar to Case 1, but choices of τk\tau_{k} and Ak+1A_{k+1} are different. We let

τk=(k+2)2(2(1+α)Akα(k+1)(k+1+α(k+2)))4(1+α)Ak2α(k+2)(k+1+kα).\displaystyle\tau_{k}=\frac{(k+2)^{2}\left(2(1+\alpha)A_{k}-\alpha(k+1)(k+1+\alpha(k+2))\right)}{4(1+\alpha)A_{k}-2\alpha(k+2)(k+1+k\alpha)}. (48)

Since Ak>k>α(k+1)(k+1+α(k+2))2(1+α)A_{k}>\ell_{k}>\frac{\alpha(k+1)(k+1+\alpha(k+2))}{2(1+\alpha)}, the denominator and numerator of (48) are both positive and thus τk>0\tau_{k}>0. Next, let

Ak+1\displaystyle A_{k+1} =α(k+2)2(4(1+α)Akα(k+1+α(k+2))2)4(1+α)((1+α)Akα2(k+1)(k+2))\displaystyle=\frac{\alpha(k+2)^{2}\left(4(1+\alpha)A_{k}-\alpha(k+1+\alpha(k+2))^{2}\right)}{4(1+\alpha)\left((1+\alpha)A_{k}-\alpha^{2}(k+1)(k+2)\right)}
=α(k+2)21+α(1α(k+1α(k+2))24((1+α)Akα2(k+1)(k+2))).\displaystyle=\frac{\alpha(k+2)^{2}}{1+\alpha}\left(1-\frac{\alpha(k+1-\alpha(k+2))^{2}}{4((1+\alpha)A_{k}-\alpha^{2}(k+1)(k+2))}\right). (49)

Then we can check that

s11=(2Akα(k+1)(k+2))(2(1+α)Akα2(k+1)(k+2)α3(k+2)2)4(1+α)Ak2α(k+2)(k+1+kα)\displaystyle s_{11}=\frac{(2A_{k}-\alpha(k+1)(k+2))(2(1+\alpha)A_{k}-\alpha^{2}(k+1)(k+2)-\alpha^{3}(k+2)^{2})}{4(1+\alpha)A_{k}-2\alpha(k+2)(k+1+k\alpha)}
s33=(k+2)2(2(1+α)Akα2(k+1)(k+2)α3(k+2)2)(2(1α2)Akα(k+1)2+α3k(k+2))4(1+α)(2(1+α)Akα(k+2)(k+1+kα))(2(1+α)Akα2(k+1)(k+2)),\displaystyle s_{33}=\frac{(k+2)^{2}\left(2(1+\alpha)A_{k}-\alpha^{2}(k+1)(k+2)-\alpha^{3}(k+2)^{2}\right)\left(2(1-\alpha^{2})A_{k}-\alpha(k+1)^{2}+\alpha^{3}k(k+2)\right)}{4(1+\alpha)\left(2(1+\alpha)A_{k}-\alpha(k+2)(k+1+k\alpha)\right)\left(2(1+\alpha)A_{k}-\alpha^{2}(k+1)(k+2)\right)},

and so on. (Note that 2Akα(k+1)(k+2)02A_{k}-\alpha(k+1)(k+2)\geq 0 because now we are assuming that AkIk+A_{k}\in I_{k}^{+}.) We omit further details of calculations, but with the above choices of τk\tau_{k} and Ak+1A_{k+1} it can be shown that det𝐒k=0\det\mathbf{S}_{k}=0 and s11,s330s_{11},s_{33}\geq 0, using (36) through (39). As in Case 1, this implies 𝐒k𝐎\mathbf{S}_{k}\succeq\mathbf{O}.

The identity (49) shows that Ak+1A_{k+1} is increasing with respect to AkA_{k}. Interestingly, although (41) and (49) have distinct forms, for the boundary value Ak=α(k+1)(k+2)2A_{k}=\frac{\alpha(k+1)(k+2)}{2}, they evaluate to the same expression (47) and thus arguments from Case 1 readily show that Ak+1>k+1A_{k+1}>\ell_{k+1}. On the other hand, we have

uk+1Ak+1|Ak=uk=α2((1+3αα2+α3)k+1+8α+α2+2α3)2(1α2)((1α)2k+1α+2α2)\displaystyle u_{k+1}-A_{k+1}|_{A_{k}=u_{k}}=\frac{\alpha^{2}\left(\left(1+3\alpha-\alpha^{2}+\alpha^{3}\right)k+1+8\alpha+\alpha^{2}+2\alpha^{3}\right)}{2(1-\alpha^{2})\left((1-\alpha)^{2}k+1-\alpha+2\alpha^{2}\right)}

and the last term is positive for any α(0,1)\alpha\in(0,1), i.e., Ak+1<uk+1A_{k+1}<u_{k+1}. This completes Case 2.

Proof of the theorem statement. Given that AkIkA_{k}\in I_{k}^{-} implies Ak+1Ik+1A_{k+1}\in I_{k+1}^{-} (which has been proved in Case 1), the rest is easy. If we take A0=0=α1+αA_{0}=\ell_{0}=\frac{\alpha}{1+\alpha}, then because 𝐒k𝐎\mathbf{S}_{k}\succeq\mathbf{O} for all k0k\geq 0, we see that VkV_{k} is nonincreasing:

α1+α𝐳0𝐳2\displaystyle\frac{\alpha}{1+\alpha}\|\mathbf{z}^{0}-\mathbf{z}^{\star}\|^{2} α1+α𝐆(𝐳0)2=V0Vk=Ak𝐆(𝐳k)2+(k+1)𝐳k𝐳0,𝐆(𝐳k),\displaystyle\geq\frac{\alpha}{1+\alpha}\left\|\mathbf{G}(\mathbf{z}^{0})\right\|^{2}=V_{0}\geq\cdots\geq V_{k}=A_{k}\left\|\mathbf{G}(\mathbf{z}^{k})\right\|^{2}+(k+1)\left\langle\mathbf{z}^{k}-\mathbf{z}^{0},\mathbf{G}(\mathbf{z}^{k})\right\rangle,

where the first inequality follows from Lipschitzness of 𝐆\mathbf{G} (recall that we are assuming that R=1R=1). Also by (35) and (36),

Akk>α(k+1)(k+1+α(k+2))2(1+α)=α(k+1)2(1+α)(k+1)+α1+α>α(k+1)22.\displaystyle A_{k}\geq\ell_{k}>\frac{\alpha(k+1)(k+1+\alpha(k+2))}{2(1+\alpha)}=\frac{\alpha(k+1)}{2}\frac{(1+\alpha)(k+1)+\alpha}{1+\alpha}>\frac{\alpha(k+1)^{2}}{2}. (50)

Hence, we obtain

α1+α𝐳0𝐳2Vk\displaystyle\frac{\alpha}{1+\alpha}\|\mathbf{z}^{0}-\mathbf{z}^{\star}\|^{2}\geq V_{k} k𝐆(𝐳k)2+(k+1)𝐳k𝐳0,𝐆(𝐳k)\displaystyle\geq\ell_{k}\left\|\mathbf{G}(\mathbf{z}^{k})\right\|^{2}+(k+1)\left\langle\mathbf{z}^{k}-\mathbf{z}^{0},\mathbf{G}(\mathbf{z}^{k})\right\rangle
(c)α(k+1)22𝐆(𝐳k)2+(k+1)𝐳𝐳0,𝐆(𝐳k)\displaystyle\stackrel{{\scriptstyle\textnormal{(c)}}}{{\mathstrut{\geq}}}\frac{\alpha(k+1)^{2}}{2}\left\|\mathbf{G}(\mathbf{z}^{k})\right\|^{2}+(k+1)\left\langle\mathbf{z}^{\star}-\mathbf{z}^{0},\mathbf{G}(\mathbf{z}^{k})\right\rangle
(d)α(k+1)22𝐆(𝐳k)2(k+1)(1α(k+1)𝐳𝐳02+α(k+1)4𝐆(𝐳k)2),\displaystyle\stackrel{{\scriptstyle\textnormal{(d)}}}{{\mathstrut{\geq}}}\frac{\alpha(k+1)^{2}}{2}\left\|\mathbf{G}(\mathbf{z}^{k})\right\|^{2}-(k+1)\left(\frac{1}{\alpha(k+1)}\|\mathbf{z}^{\star}-\mathbf{z}^{0}\|^{2}+\frac{\alpha(k+1)}{4}\left\|\mathbf{G}(\mathbf{z}^{k})\right\|^{2}\right),

where (B.3) follows from (50) and the monotonicity inequality 𝐳k𝐳,𝐆(𝐳k)0\langle\mathbf{z}^{k}-\mathbf{z}^{\star},\mathbf{G}(\mathbf{z}^{k})\rangle\geq 0, and (B.3) follows from Young’s inequality. Rearranging terms, we conclude that

𝐆(𝐳k)24α(k+1)2(α1+α+1α)𝐳0𝐳2=C𝐳0𝐳2(k+1)2,\displaystyle\left\|\mathbf{G}(\mathbf{z}^{k})\right\|^{2}\leq\frac{4}{\alpha(k+1)^{2}}\left(\frac{\alpha}{1+\alpha}+\frac{1}{\alpha}\right)\|\mathbf{z}^{0}-\mathbf{z}^{\star}\|^{2}=\frac{C\|\mathbf{z}^{0}-\mathbf{z}^{\star}\|^{2}}{(k+1)^{2}},

where C=4(1+α+α2)α2(1+α)C=\frac{4(1+\alpha+\alpha^{2})}{\alpha^{2}(1+\alpha)}.

Proof of Lemma 5. Direct calculation gives

ukα(k+1)(k+2)2=α2(k+2)2(1α)>0\displaystyle u_{k}-\frac{\alpha(k+1)(k+2)}{2}=\frac{\alpha^{2}(k+2)}{2(1-\alpha)}>0
α(k+1)(k+2)2k=α2(k+2)2(1+α)>0,\displaystyle\frac{\alpha(k+1)(k+2)}{2}-\ell_{k}=\frac{\alpha^{2}(k+2)}{2(1+\alpha)}>0,

showing (35). Next,

kα(k+1)(k+1+α(k+2))2(1+α)=α(k+1α(k+2))2(1+α)0\displaystyle\ell_{k}-\frac{\alpha(k+1)(k+1+\alpha(k+2))}{2(1+\alpha)}=\frac{\alpha(k+1-\alpha(k+2))}{2(1+\alpha)}\geq 0

because k+1α(k+2)=(k+2)(k+1k+2α)(k+2)(12α)0k+1-\alpha(k+2)=(k+2)(\frac{k+1}{k+2}-\alpha)\geq(k+2)(\frac{1}{2}-\alpha)\geq 0, which shows (36). Similarly, we observe that

α(k+1)(k+1+α(k+2))2(1+α)α(k+1)2α3k(k+2)2(1α2)=α2(k+1α(k+2))2(1α2)\displaystyle\frac{\alpha(k+1)(k+1+\alpha(k+2))}{2(1+\alpha)}-\frac{\alpha(k+1)^{2}-\alpha^{3}k(k+2)}{2(1-\alpha^{2})}=\frac{\alpha^{2}(k+1-\alpha(k+2))}{2(1-\alpha^{2})} 0\displaystyle\geq 0
α(k+1)2α3k(k+2)2(1α2)α(k+1)(k+1α(k+2))2(1α)=α2(k+1+α(k+2))2(1α2)\displaystyle\frac{\alpha(k+1)^{2}-\alpha^{3}k(k+2)}{2(1-\alpha^{2})}-\frac{\alpha(k+1)(k+1-\alpha(k+2))}{2(1-\alpha)}=\frac{\alpha^{2}(k+1+\alpha(k+2))}{2(1-\alpha^{2})} >0\displaystyle>0
α(k+1)2α3k(k+2)2(1α2)α2(k+1)(k+2)1+α=α(k+1α(k+2))22(1α2)\displaystyle\frac{\alpha(k+1)^{2}-\alpha^{3}k(k+2)}{2(1-\alpha^{2})}-\frac{\alpha^{2}(k+1)(k+2)}{1+\alpha}=\frac{\alpha(k+1-\alpha(k+2))^{2}}{2(1-\alpha^{2})} 0\displaystyle\geq 0
α2(k+1)(k+2)1+αα2(k+1)(k+2)+α3(k+2)22(1+α)=α2(k+2)(k+1α(k+2))2(1+α)\displaystyle\frac{\alpha^{2}(k+1)(k+2)}{1+\alpha}-\frac{\alpha^{2}(k+1)(k+2)+\alpha^{3}(k+2)^{2}}{2(1+\alpha)}=\frac{\alpha^{2}(k+2)(k+1-\alpha(k+2))}{2(1+\alpha)} 0,\displaystyle\geq 0,

and each line corresponds to an inequality within (37), (38) and (39).

Appendix C Omitted proofs of Section 3

In this section, we provide a self-contained discussion on the complexity lower bound results for linear operator equations from Nemirovsky (1991, 1992).

C.1 Proof of Theorem 3

The proof of Theorem 3 was essentially completed in the main body of the paper, except the argument regarding translation, (13), and the proof of Lemma 3.

We first provide the precise meaning of the translation invariance that we are to prove. Given a saddle function 𝐋\mathbf{L} and 𝐳n×n\mathbf{z}\in\mathbb{R}^{n}\times\mathbb{R}^{n}, let 𝐳𝐋(𝐳)\mathbf{z}_{\mathbf{L}}^{\star}(\mathbf{z}) be the saddle point of 𝐋\mathbf{L} nearest to 𝐳\mathbf{z}. For any 𝐳0n×n\mathbf{z}^{0}\in\mathbb{R}^{n}\times\mathbb{R}^{n}, k0k\geq 0 and D>0D>0, define

𝔗(𝐳0;k,D):={𝐳k|𝐋(𝐱,𝐲)=𝐀𝐱𝐛,𝐲𝐜,𝐀n×n,𝐛,𝐜n,𝐳𝐋(𝐳0)𝐳0D,𝐳j=𝒜(𝐳0,,𝐳j1;𝐋),j=1,,k,𝒜𝔄sep}.\displaystyle\mathfrak{T}\left(\mathbf{z}^{0};k,D\right):=\left\{\mathbf{z}^{k}\,\,\Bigg{|}\,\,\begin{aligned} &\mathbf{L}(\mathbf{x},\mathbf{y})=\langle\mathbf{A}\mathbf{x}-\mathbf{b},\mathbf{y}-\mathbf{c}\rangle,\,\,\mathbf{A}\in\mathbb{R}^{n\times n},\,\,\mathbf{b},\mathbf{c}\in\mathbb{R}^{n},\,\,\left\|\mathbf{z}_{\mathbf{L}}^{\star}\left(\mathbf{z}^{0}\right)-\mathbf{z}^{0}\right\|\leq D,\\ &\mathbf{z}^{j}=\mathcal{A}(\mathbf{z}^{0},\dots,\mathbf{z}^{j-1};\mathbf{L}),\,\,j=1,\dots,k,\,\,\mathcal{A}\in\mathfrak{A}_{\textrm{sep}}\end{aligned}\right\}.

We will show that

𝔗(𝐳0;k,D)=𝐳0+𝔗(0;k,D)\displaystyle\mathfrak{T}\left(\mathbf{z}^{0};k,D\right)=\mathbf{z}^{0}+\mathfrak{T}\left(0;k,D\right)

holds for any 𝐳0n×n\mathbf{z}^{0}\in\mathbb{R}^{n}\times\mathbb{R}^{n}.

Let 𝐳0=(𝐱0,𝐲0)\mathbf{z}^{0}=(\mathbf{x}^{0},\mathbf{y}^{0}) and 𝐋(𝐱,𝐲)=𝐀𝐱𝐛,𝐲𝐜\mathbf{L}(\mathbf{x},\mathbf{y})=\langle\mathbf{A}\mathbf{x}-\mathbf{b},\mathbf{y}-\mathbf{c}\rangle be given, and assume that 𝐳𝐋(𝐳0)𝐳0D\|\mathbf{z}_{\mathbf{L}}^{\star}(\mathbf{z}^{0})-\mathbf{z}^{0}\|\leq D. Let 𝐛0=𝐛𝐀𝐱0\mathbf{b}_{0}=\mathbf{b}-\mathbf{A}\mathbf{x}^{0} and 𝐜0=𝐜𝐲0\mathbf{c}_{0}=\mathbf{c}-\mathbf{y}^{0}. Then

𝐱𝐋0(𝐱0,𝐲0)=𝐀(𝐲0𝐜)=𝐀𝐜0\displaystyle\nabla_{\mathbf{x}}\mathbf{L}_{0}(\mathbf{x}^{0},\mathbf{y}^{0})=\mathbf{A}^{\intercal}(\mathbf{y}^{0}-\mathbf{c})=-\mathbf{A}^{\intercal}\mathbf{c}_{0}
𝐲𝐋0(𝐱0,𝐲0)=𝐀𝐱0𝐛=𝐛0.\displaystyle\nabla_{\mathbf{y}}\mathbf{L}_{0}(\mathbf{x}^{0},\mathbf{y}^{0})=\mathbf{A}\mathbf{x}^{0}-\mathbf{b}=-\mathbf{b}_{0}.

Hence, (11) with k=1k=1 reads as

𝐱1𝐱0span{𝐀𝐜0}=Δ𝒳1(𝐀;𝐛0,𝐜0)\displaystyle\mathbf{x}^{1}-\mathbf{x}^{0}\in\mathrm{span}\{\mathbf{A}^{\intercal}\mathbf{c}_{0}\}\stackrel{{\scriptstyle\Delta}}{{=}}\mathcal{X}_{1}(\mathbf{A};\mathbf{b}_{0},\mathbf{c}_{0})
𝐲1𝐲0span{𝐛0}=Δ𝒴1(𝐀;𝐛0,𝐜0).\displaystyle\mathbf{y}^{1}-\mathbf{y}^{0}\in\mathrm{span}\{\mathbf{b}_{0}\}\stackrel{{\scriptstyle\Delta}}{{=}}\mathcal{Y}_{1}(\mathbf{A};\mathbf{b}_{0},\mathbf{c}_{0}).

This further shows that

𝐱𝐋0(𝐱1,𝐲1)=𝐀(𝐲1𝐜)=𝐀(𝐲1𝐲0)𝐀𝐜0span{𝐀𝐛0,𝐀𝐜0}\displaystyle\nabla_{\mathbf{x}}\mathbf{L}_{0}(\mathbf{x}^{1},\mathbf{y}^{1})=\mathbf{A}^{\intercal}(\mathbf{y}^{1}-\mathbf{c})=\mathbf{A}^{\intercal}(\mathbf{y}^{1}-\mathbf{y}^{0})-\mathbf{A}^{\intercal}\mathbf{c}_{0}\in\mathrm{span}\{\mathbf{A}^{\intercal}\mathbf{b}_{0},\mathbf{A}^{\intercal}\mathbf{c}_{0}\}
𝐲𝐋0(𝐱1,𝐲1)=𝐀𝐱1𝐛=𝐀(𝐱1𝐱0)𝐛0span{𝐀(𝐀𝐜0),𝐛0},\displaystyle\nabla_{\mathbf{y}}\mathbf{L}_{0}(\mathbf{x}^{1},\mathbf{y}^{1})=\mathbf{A}\mathbf{x}^{1}-\mathbf{b}=\mathbf{A}(\mathbf{x}^{1}-\mathbf{x}^{0})-\mathbf{b}_{0}\in\mathrm{span}\{\mathbf{A}(\mathbf{A}^{\intercal}\mathbf{c}_{0}),\mathbf{b}_{0}\},

and (11) with k=2k=2 becomes

𝐱2𝐱0span{𝐀𝐜0,𝐀𝐛0}=Δ𝒳2(𝐀;𝐛0,𝐜0)\displaystyle\mathbf{x}^{2}-\mathbf{x}^{0}\in\mathrm{span}\{\mathbf{A}^{\intercal}\mathbf{c}_{0},\mathbf{A}^{\intercal}\mathbf{b}_{0}\}\stackrel{{\scriptstyle\Delta}}{{=}}\mathcal{X}_{2}(\mathbf{A};\mathbf{b}_{0},\mathbf{c}_{0})
𝐲2𝐲0span{𝐛0,𝐀𝐀𝐜0}=Δ𝒴2(𝐀;𝐛0,𝐜0).\displaystyle\mathbf{y}^{2}-\mathbf{y}^{0}\in\mathrm{span}\{\mathbf{b}_{0},\mathbf{A}\mathbf{A}^{\intercal}\mathbf{c}_{0}\}\stackrel{{\scriptstyle\Delta}}{{=}}\mathcal{Y}_{2}(\mathbf{A};\mathbf{b}_{0},\mathbf{c}_{0}).

As one can see, we have 𝐱k𝐱0𝒳k(𝐀;𝐛0,𝐜0)\mathbf{x}^{k}-\mathbf{x}^{0}\in\mathcal{X}_{k}(\mathbf{A};\mathbf{b}_{0},\mathbf{c}_{0}) and 𝐲k𝐲0𝒴k(𝐀;𝐛0,𝐜0)\mathbf{y}^{k}-\mathbf{y}^{0}\in\mathcal{Y}_{k}(\mathbf{A};\mathbf{b}_{0},\mathbf{c}_{0}), where we inductively define

𝒳k+1(𝐀;𝐛0,𝐜0)=span{𝐀𝐜0}+𝐀𝒴k(𝐀;𝐛0,𝐜0)\displaystyle\mathcal{X}_{k+1}(\mathbf{A};\mathbf{b}_{0},\mathbf{c}_{0})=\mathrm{span}\{\mathbf{A}^{\intercal}\mathbf{c}_{0}\}+\mathbf{A}^{\intercal}\mathcal{Y}_{k}(\mathbf{A};\mathbf{b}_{0},\mathbf{c}_{0})
𝒴k+1(𝐀;𝐛0,𝐜0)=span{𝐛0}+𝐀𝒳k(𝐀;𝐛0,𝐜0).\displaystyle\mathcal{Y}_{k+1}(\mathbf{A};\mathbf{b}_{0},\mathbf{c}_{0})=\mathrm{span}\{\mathbf{b}_{0}\}+\mathbf{A}\mathcal{X}_{k}(\mathbf{A};\mathbf{b}_{0},\mathbf{c}_{0}).

Then it is not difficult to see that for k2k\geq 2,

𝒳k(𝐀;𝐛0,𝐜0)=span{𝐀𝐜0,𝐀(𝐀𝐀)𝐜0,,𝐀(𝐀𝐀)k12𝐜0}+span{𝐀𝐛0,𝐀(𝐀𝐀)𝐛0,,𝐀(𝐀𝐀)k21𝐛0}\displaystyle\mathcal{X}_{k}(\mathbf{A};\mathbf{b}_{0},\mathbf{c}_{0})=\mathrm{span}\left\{\mathbf{A}^{\intercal}\mathbf{c}_{0},\mathbf{A}^{\intercal}(\mathbf{A}\mathbf{A}^{\intercal})\mathbf{c}_{0},\dots,\mathbf{A}^{\intercal}(\mathbf{A}\mathbf{A}^{\intercal})^{\lfloor\frac{k-1}{2}\rfloor}\mathbf{c}_{0}\right\}+\mathrm{span}\left\{\mathbf{A}^{\intercal}\mathbf{b}_{0},\mathbf{A}^{\intercal}(\mathbf{A}\mathbf{A}^{\intercal})\mathbf{b}_{0},\dots,\mathbf{A}^{\intercal}(\mathbf{A}\mathbf{A}^{\intercal})^{\lfloor\frac{k}{2}\rfloor-1}\mathbf{b}_{0}\right\}
𝒴k(𝐀;𝐛0,𝐜0)=span{𝐛0,(𝐀𝐀)𝐛0,,(𝐀𝐀)k12𝐛0}+span{𝐀𝐀𝐜0,,(𝐀𝐀)k2𝐜0}.\displaystyle\mathcal{Y}_{k}(\mathbf{A};\mathbf{b}_{0},\mathbf{c}_{0})=\mathrm{span}\left\{\mathbf{b}_{0},(\mathbf{A}\mathbf{A}^{\intercal})\mathbf{b}_{0},\dots,(\mathbf{A}\mathbf{A}^{\intercal})^{\lfloor\frac{k-1}{2}\rfloor}\mathbf{b}_{0}\right\}+\mathrm{span}\left\{\mathbf{A}\mathbf{A}^{\intercal}\mathbf{c}_{0},\dots,(\mathbf{A}\mathbf{A}^{\intercal})^{\lfloor\frac{k}{2}\rfloor}\mathbf{c}_{0}\right\}.

Now consider 𝐋0(𝐱,𝐲):=𝐀𝐱𝐛0,𝐲𝐜0=𝐀(𝐱+𝐱0)𝐛,𝐲+𝐲0𝐜\mathbf{L}_{0}(\mathbf{x},\mathbf{y}):=\langle\mathbf{A}\mathbf{x}-\mathbf{b}_{0},\mathbf{y}-\mathbf{c}_{0}\rangle=\left\langle\mathbf{A}(\mathbf{x}+\mathbf{x}^{0})-\mathbf{b},\mathbf{y}+\mathbf{y}^{0}-\mathbf{c}\right\rangle. Because 𝐳𝐋0\mathbf{z}^{\star}_{\mathbf{L}_{0}} is a saddle point of 𝐋0\mathbf{L}_{0} if and only if 𝐳𝐋0+𝐳0\mathbf{z}^{\star}_{\mathbf{L}_{0}}+\mathbf{z}^{0} is a saddle point of 𝐋\mathbf{L}, we have 𝐳𝐋0(0)=𝐳𝐋(𝐳0)𝐳0\mathbf{z}^{\star}_{\mathbf{L}_{0}}(0)=\mathbf{z}^{\star}_{\mathbf{L}}(\mathbf{z}^{0})-\mathbf{z}^{0}, and thus 𝐳𝐋0(0)D\|\mathbf{z}^{\star}_{\mathbf{L}_{0}}(0)\|\leq D. Therefore, if we let

𝒮(𝐀;D)=Δ{(𝐛~,𝐜~)n×n|𝐳𝐋~(0)D, where 𝐋~(𝐱,𝐲)=𝐀𝐱𝐛~,𝐲𝐜~},\displaystyle\mathcal{S}(\mathbf{A};D)\stackrel{{\scriptstyle\Delta}}{{=}}\left\{(\tilde{\mathbf{b}},\tilde{\mathbf{c}})\in\mathbb{R}^{n}\times\mathbb{R}^{n}\,\bigg{|}\,\left\|\mathbf{z}^{\star}_{\tilde{\mathbf{L}}}(0)\right\|\leq D,\textnormal{ where }\tilde{\mathbf{L}}(\mathbf{x},\mathbf{y})=\langle\mathbf{A}\mathbf{x}-\tilde{\mathbf{b}},\mathbf{y}-\tilde{\mathbf{c}}\rangle\right\},

then

𝔗(𝐳0;k,D)=𝐀n×n(𝐛0,𝐜0)𝒮(𝐀;D)𝐳0+(𝒳k(𝐀;𝐛0,𝐜0)×𝒴k(𝐀;𝐛0,𝐜0)).\displaystyle\mathfrak{T}\left(\mathbf{z}^{0};k,D\right)=\bigcup_{\begin{subarray}{c}\mathbf{A}\in\mathbb{R}^{n\times n}\\ (\mathbf{b}_{0},\mathbf{c}_{0})\in\mathcal{S}(\mathbf{A};D)\end{subarray}}\mathbf{z}^{0}+\left(\mathcal{X}_{k}(\mathbf{A};\mathbf{b}_{0},\mathbf{c}_{0})\times\mathcal{Y}_{k}(\mathbf{A};\mathbf{b}_{0},\mathbf{c}_{0})\right).

This proves that the translation invariance holds with 𝔗(0;k,D)=𝐀n×n(𝐛0,𝐜0)𝒮(𝐀;D)(𝒳k(𝐀;𝐛0,𝐜0)×𝒴k(𝐀;𝐛0,𝐜0))\mathfrak{T}(0;k,D)=\bigcup_{\begin{subarray}{c}\mathbf{A}\in\mathbb{R}^{n\times n}\\ (\mathbf{b}_{0},\mathbf{c}_{0})\in\mathcal{S}(\mathbf{A};D)\end{subarray}}\left(\mathcal{X}_{k}(\mathbf{A};\mathbf{b}_{0},\mathbf{c}_{0})\times\mathcal{Y}_{k}(\mathbf{A};\mathbf{b}_{0},\mathbf{c}_{0})\right) and in particular, shows (13).

C.2 Complexity of solving linear operator equations and minimax polynomials

We first make some general observations. Suppose that we are given a symmetric matrix 𝐀n×n\mathbf{A}\in\mathbb{R}^{n\times n}, 𝐛n\mathbf{b}\in\mathbb{R}^{n}, and an integer k1k\geq 1. Then any 𝐱𝒦k1(𝐀;𝐛)=span{𝐛,𝐀𝐛,,𝐀k1𝐛}\mathbf{x}\in\mathcal{K}_{k-1}(\mathbf{A};\mathbf{b})=\mathrm{span}\{\mathbf{b},\mathbf{A}\mathbf{b},\dots,\mathbf{A}^{k-1}\mathbf{b}\} can be expressed in the form

𝐱=q(𝐀)𝐛,where q(t)=q0+q1t++qk1tk1,\displaystyle\mathbf{x}=q(\mathbf{A})\mathbf{b},\quad\text{where }q(t)=q_{0}+q_{1}t+\cdots+q_{k-1}t^{k-1},

for some q0,,qk1q_{0},\dots,q_{k-1}\in\mathbb{R}. Then we can write

𝐛𝐀𝐱=𝐛𝐀q(𝐀)𝐛=(𝐈𝐀q(A))𝐛=p(𝐀)𝐛,\displaystyle\mathbf{b}-\mathbf{A}\mathbf{x}=\mathbf{b}-\mathbf{A}q(\mathbf{A})\mathbf{b}=(\mathbf{I}-\mathbf{A}q(A))\mathbf{b}=p(\mathbf{A})\mathbf{b}, (51)

where p(t)=1tq(t)p(t)=1-tq(t) is a polynomial of degree at most kk satisfying p(0)=1p(0)=1. Note that conversely, given any polynomial p~(t)\tilde{p}(t) with degree k\leq k and constant term 11, one can decompose it as p~(t)=1tq~(t)\tilde{p}(t)=1-t\tilde{q}(t) and recover a polynomial q~\tilde{q} of degree k1\leq k-1 corresponding to 𝐱\mathbf{x}.

Now suppose further there exists 𝐱n\mathbf{x}^{\star}\in\mathbb{R}^{n} such that 𝐛=𝐀𝐱\mathbf{b}=\mathbf{A}\mathbf{x}^{\star} and 𝐱D\|\mathbf{x}^{\star}\|\leq D. The symmetric matrix 𝐀\mathbf{A} has an orthonormal eigenbasis 𝐯1,,𝐯n\mathbf{v}_{1},\dots,\mathbf{v}_{n}, corresponding to eigenvalues λ1,,λn\lambda_{1},\dots,\lambda_{n}, so we can write 𝐱=c1𝐯1++cn𝐯n\mathbf{x}^{\star}=c_{1}\mathbf{v}_{1}+\cdots+c_{n}\mathbf{v}_{n} for some c1,,cnc_{1},\dots,c_{n}\in\mathbb{R}. Using (51), we obtain

𝐀𝐱𝐛2=p(𝐀)𝐀𝐱2=j=1ncj𝐀p(𝐀)𝐯j2\displaystyle\left\|\mathbf{A}\mathbf{x}-\mathbf{b}\right\|^{2}=\left\|p(\mathbf{A})\mathbf{A}\mathbf{x}^{\star}\right\|^{2}=\left\|\sum_{j=1}^{n}c_{j}\mathbf{A}p(\mathbf{A})\mathbf{v}_{j}\right\|^{2} =j=1ncjλjp(λj)𝐯j2\displaystyle=\left\|\sum_{j=1}^{n}c_{j}\lambda_{j}p(\lambda_{j})\mathbf{v}_{j}\right\|^{2} (52)
=j=1ncj2λj2p(λj)2D2(maxj=1,,nλj2p(λj)2).\displaystyle=\sum_{j=1}^{n}c_{j}^{2}\lambda_{j}^{2}p(\lambda_{j})^{2}\leq D^{2}\left(\max_{j=1,\dots,n}\lambda_{j}^{2}p(\lambda_{j})^{2}\right).

We define the problem class by 𝐀R\|\mathbf{A}\|\leq R, which is equivalent to λj[R,R]\lambda_{j}\in[-R,R] for all j=1,,nj=1,\dots,n. Therefore, we consider a method corresponding to a polynomial q(t)q(t) such that p(t)=1tq(t)p(t)=1-tq(t) minimizes

maxλ[R,R]λ2p(λ)2=(maxλ[R,R]|λp(λ)|)2.\displaystyle\max_{\lambda\in[-R,R]}\lambda^{2}p(\lambda)^{2}=\left(\max_{\lambda\in[-R,R]}\left|\lambda p(\lambda)\right|\right)^{2}.

More precisely, if pk(t)=1tqk(t)p_{k}^{\star}(t)=1-tq_{k}^{\star}(t) minimizes the last quantity among all p(t)p(t) such that degpk\deg p\leq k and p(0)=1p(0)=1, and if we put 𝐱k=qk(𝐀)𝐛\mathbf{x}^{k}=q_{k}^{\star}(\mathbf{A})\mathbf{b}, then (52) implies

𝐀𝐱k𝐛2=j=1ncj2λj2(pk(λj))2D2M(k,R)2\displaystyle\left\|\mathbf{A}\mathbf{x}^{k}-\mathbf{b}\right\|^{2}=\sum_{j=1}^{n}c_{j}^{2}\lambda_{j}^{2}\left(p_{k}^{\star}(\lambda_{j})\right)^{2}\leq D^{2}M^{\star}(k,R)^{2}
M(k,R)=Δmindegpkp(0)=1maxλ[R,R]|λp(λ)|,\displaystyle M^{\star}(k,R)\stackrel{{\scriptstyle\Delta}}{{=}}\min_{\begin{subarray}{c}\deg p\leq k\\ p(0)=1\end{subarray}}\max_{\lambda\in[-R,R]}|\lambda p(\lambda)|, (53)

for all 𝐀\mathbf{A} whose spectrum belongs to [R,R][-R,R] and 𝐛=𝐀𝐱\mathbf{b}=\mathbf{A}\mathbf{x}^{\star} with 𝐱D\|\mathbf{x}^{\star}\|\leq D. As pkp_{k}^{\star} solves (53), it is called a minimax polynomial.

In order to establish Lemma 3, we present a two-fold analysis in the following. First, we compute the quantity (53) by explicitly naming pkp_{k}^{\star} for each k1k\geq 1. (This was given by Nemirovsky (1992), but without a proof.) Then, following the exposition from (Nemirovsky, 1991), we show that there exists an instance of (𝐀,𝐛)(\mathbf{A},\mathbf{b}) such that

𝐀q(𝐀)𝐛𝐛2D2M(k,R)2\displaystyle\|\mathbf{A}q(\mathbf{A})\mathbf{b}-\mathbf{b}\|^{2}\geq D^{2}M^{\star}(k,R)^{2}

holds for any polynomial qq of degree k1\leq k-1.

C.3 Proof of Lemma 3

The solutions to (53) are characterized using the Chebyshev polynomials of first kind, defined by

TN(cosθ)=cos(Nθ),N1,\displaystyle T_{N}(\cos\theta)=\cos(N\theta),\quad N\geq 1,

or equivalently by TN(t)=cos(Narccost)T_{N}(t)=\cos(N\arccos t). If N=2dN=2d for some nonnegative integer dd, then TNT_{N} is an even polynomial satisfying TN(0)=cos(dπ)=(1)dT_{N}(0)=\cos(d\pi)=(-1)^{d}. On the other hand, if N=2d+1N=2d+1, then TNT_{N} is an odd polynomial of the form

T2d+1(t)=(1)d(2d+1)t+,\displaystyle T_{2d+1}(t)=(-1)^{d}(2d+1)t+\cdots, (54)

which can be shown via induction using the recurrence relation TN+1(t)=2tTN(t)TN1(t)T_{N+1}(t)=2tT_{N}(t)-T_{N-1}(t), which follows from the trigonometric identity

cos((N+1)θ)+cos((N1)θ)=2cos(Nθ)cosθ.\displaystyle\cos((N+1)\theta)+\cos((N-1)\theta)=2\cos(N\theta)\cos\theta.

Based on arguments from (Nemirovsky, 1992; Mason & Handscomb, 2002), we will show that given k1k\geq 1 and m:=k2m:=\lfloor\frac{k}{2}\rfloor,

pk(t):=(1)m2m+1(Rt)T2m+1(tR)\displaystyle p_{k}^{\star}(t):=\frac{(-1)^{m}}{2m+1}\left(\frac{R}{t}\right)T_{2m+1}\left(\frac{t}{R}\right)

solves (53).

The Chebyshev polynomials satisfy the equioscillation property which makes them so special: the extrema of TNT_{N} within [1,1][-1,1] occur at tj=cos(Nj)πNt_{j}=\cos\frac{(N-j)\pi}{N} for j=0,,Nj=0,\dots,N, and the signs of the extremal values alternate. Indeed, we have |TN(t)=cos(Narccost)|1|T_{N}(t)=\cos(N\arccos t)|\leq 1 for all t[1,1]t\in[-1,1], and for each j=0,,Nj=0,\dots,N,

TN(tj)=cos(N(Nj)πN)=cos(Nj)π=(1)Nj.\displaystyle T_{N}(t_{j})=\cos\left(N\frac{(N-j)\pi}{N}\right)=\cos(N-j)\pi=(-1)^{N-j}.

Also, we have TN(tj)=TN(tj1)T_{N}(t_{j})=-T_{N}(t_{j-1}) for each j=1,,nj=1,\dots,n.

Given k1k\geq 1, we denote by 𝒫k\mathcal{P}_{k} the collection of all polynomials pp of degree k\leq k with p(0)=1p(0)=1. Recall that we are to minimize

M(p,R):=maxλ[R,R]|λp(λ)|\displaystyle M(p,R):=\max_{\lambda\in[-R,R]}|\lambda\,p(\lambda)| (55)

over p𝒫kp\in\mathcal{P}_{k}. If p𝒫kp\in\mathcal{P}_{k} minimizes (55), then so does pev(t):=p(t)+p(t)2p_{\mathrm{ev}}(t):=\frac{p(t)+p(-t)}{2}, since for all λ[R,R]\lambda\in[-R,R]

|λpev(λ)|=|λ||p(λ)+p(λ)2||λp(λ)|2+|(λ)p(λ)|2M(p,R)2+M(p,R)2=M(p,R)\displaystyle|\lambda p_{\textrm{ev}}(\lambda)|=|\lambda|\cdot\left|\frac{p(\lambda)+p(-\lambda)}{2}\right|\leq\frac{|\lambda p(\lambda)|}{2}+\frac{|(-\lambda)p(-\lambda)|}{2}\leq\frac{M(p,R)}{2}+\frac{M(p,R)}{2}=M(p,R) (56)

holds, which implies that M(pev,R)M(p,R)M(p_{\mathrm{ev}},R)\leq M(p,R).

Observe that pk𝒫kp_{k}^{\star}\in\mathcal{P}_{k} due to (54). Next, note that λpk(λ)=(1)mR2m+1T2m+1(λR)\lambda p_{k}^{\star}(\lambda)=\frac{(-1)^{m}R}{2m+1}T_{2m+1}(\frac{\lambda}{R}) has extrema of alternating signs and same magnitude within [R,R][-R,R], which occur precisely at λj:=Rcos(2m+1j)π2m+1\lambda_{j}:=R\cos\frac{(2m+1-j)\pi}{2m+1}, where j=0,,2m+1j=0,\dots,2m+1. Suppose that pkp_{k}^{\star} is not a minimizer of M(p,R)M(p,R) over 𝒫k\mathcal{P}_{k}, so that there exists p𝒫kp\in\mathcal{P}_{k} such that

|λjp(λj)|M(p,R)<M(pk,R)=|λjpk(λj)|(j=0,,2m+1).\displaystyle|\lambda_{j}p(\lambda_{j})|\leq M(p,R)<M(p_{k}^{\star},R)=|\lambda_{j}p_{k}^{\star}(\lambda_{j})|\quad(j=0,\dots,2m+1). (57)

Due to (56), by replacing pp with pevp_{\mathrm{ev}} if necessary, we may assume that pp is even and has degree 2m\leq 2m. Since λj0\lambda_{j}\neq 0 for all j=0,,2m+1j=0,\dots,2m+1, the condition (57) reduces to |p(λj)|<|pk(λj)||p(\lambda_{j})|<|p_{k}^{\star}(\lambda_{j})|.

As pp and pkp_{k}^{\star} are both polynomials of degree 2m\leq 2m and constant terms 1, we can write

pk(λ)p(λ)=λq(λ)\displaystyle p_{k}^{\star}(\lambda)-p(\lambda)=\lambda q(\lambda)

for some polynomial qq of degree 2m1\leq 2m-1. But then |p(λj)|=|pk(λj)λjq(λj)|<|pk(λj)||p(\lambda_{j})|=|p_{k}^{\star}(\lambda_{j})-\lambda_{j}q(\lambda_{j})|<|p_{k}^{\star}(\lambda_{j})|, which implies that pk(λj)p_{k}^{\star}(\lambda_{j}) and λjq(λj)\lambda_{j}q(\lambda_{j}) have same signs for j=0,,2m+1j=0,\dots,2m+1. Now, because pk(λj)p_{k}^{\star}(\lambda_{j}) have alternating signs and

λ0<<λm<0<λm+1<<λ2m+1,\lambda_{0}<\cdots<\lambda_{m}<0<\lambda_{m+1}<\cdots<\lambda_{2m+1},

we see that the signs of q(λj)q(\lambda_{j}) alternate over j=0,,mj=0,\dots,m and over j=m+1,,2m+1j=m+1,\dots,2m+1, respectively. Therefore, qq must have at least one zero in each open interval (λj,λj+1)(\lambda_{j},\lambda_{j+1}) for j=0,,m1,m+1,,2mj=0,\dots,m-1,m+1,\dots,2m. This implies that q(t)0q(t)\equiv 0 since degq2m1\deg q\leq 2m-1, while qq has at least 2m2m zeros. Therefore, we arrive at pk=pp_{k}^{\star}=p, which is a contradiction.

We have established that

M(k,R)=M(pk,R)=|λjpk(λj)|=R2m+1=R2k/2+1(j=0,,2m+1).\displaystyle M^{\star}(k,R)=M(p_{k}^{\star},R)=\left|\lambda_{j}p_{k}^{\star}(\lambda_{j})\right|=\frac{R}{2m+1}=\frac{R}{2\lfloor k/2\rfloor+1}\,\quad(j=0,\dots,2m+1). (58)

Furthermore, the above arguments show that the minimization of (55) over p𝒫kp\in\mathcal{P}_{k} is in fact the same as the minimization of

maxj=0,,2m+1|λjp(λj)|=maxλΛ|λp(λ)|,Λ:={λ0,λ1,,λ2m+1}.\displaystyle\max_{j=0,\dots,2m+1}\left|\lambda_{j}p(\lambda_{j})\right|=\max_{\lambda\in\Lambda}|\lambda p(\lambda)|,\quad\Lambda:=\{\lambda_{0},\lambda_{1},\dots,\lambda_{2m+1}\}. (59)

Note that the trick of replacing pp by pevp_{\mathrm{ev}} is still applicable to (59), but only because the set Λ\Lambda is symmetric with respect to the origin. Now we can write

M(k,R)2=(minp𝒫kmaxλ[R,R]|λp(λ)|)2=(minp𝒫kmaxλΛ|λp(λ)|)2=minp𝒫kmaxλΛλ2p(λ)2,\displaystyle M^{\star}(k,R)^{2}=\left(\min_{p\in\mathcal{P}_{k}}\max_{\lambda\in[-R,R]}|\lambda p(\lambda)|\right)^{2}=\left(\min_{p\in\mathcal{P}_{k}}\max_{\lambda\in\Lambda}|\lambda p(\lambda)|\right)^{2}=\min_{p\in\mathcal{P}_{k}}\max_{\lambda\in\Lambda}\lambda^{2}p(\lambda)^{2}, (60)

and the final problem from the line (60) is equivalent to

minimizeν,p𝒫kνsubject toλj2p(λj)2ν,j=0,,2m+1.\displaystyle\begin{array}[]{ll}\underset{\nu\in\mathbb{R},\,p\in\mathcal{P}_{k}}{\mbox{minimize}}&\nu\\ \mbox{subject to}&\lambda_{j}^{2}p(\lambda_{j})^{2}\leq\nu,\quad j=0,\dots,2m+1.\end{array} (63)

We can identify any p(t)=1+p1t++pktk𝒫kp(t)=1+p_{1}t+\cdots+p_{k}t^{k}\in\mathcal{P}_{k} as the vector (p1,,pk)k(p_{1},\dots,p_{k})\in\mathbb{R}^{k}. Under this identification, (63) is a second order cone program (as the constraints are convex quadratic in p1,,pkp_{1},\dots,p_{k}), and Slater’s constraint qualification is clearly satisfied. Hence M(k,R)2M^{\star}(k,R)^{2} equals the optimal value of the dual problem

maximize𝝁2m+2minimizep𝒫kj=02m+1μjλj2p(λj)2subject toj=02m+1μj=1,𝝁0.\displaystyle\begin{array}[]{ll}\underset{\boldsymbol{\mu}\in\mathbb{R}^{2m+2}}{\mbox{maximize}}\,\,\underset{p\in\mathcal{P}_{k}}{\mbox{minimize}}&\sum_{j=0}^{2m+1}\mu_{j}\lambda_{j}^{2}p(\lambda_{j})^{2}\\ \mbox{subject to}&\sum_{j=0}^{2m+1}\mu_{j}=1,\\ &\boldsymbol{\mu}\geq 0.\end{array} (67)

Let 𝝁=(μ0,,μ2m+1)\boldsymbol{\mu}^{\star}=(\mu_{0}^{\star},\dots,\mu_{2m+1}^{\star}) be the dual optimal solution to (67). Provided that nk+22m+2n\geq k+2\geq 2m+2, we can take standard basis vectors (with 0-indexing) 𝐞0,,𝐞2m+1n\mathbf{e}_{0},\dots,\mathbf{e}_{2m+1}\in\mathbb{R}^{n}. Define 𝐀\mathbf{A} by

𝐀𝐞j=λj𝐞j(j=0,,2m+1),𝐀𝐯=0(𝐯span{𝐞0,,𝐞2m+1})\displaystyle\mathbf{A}\mathbf{e}_{j}=\lambda_{j}\mathbf{e}_{j}\quad(j=0,\dots,2m+1),\quad\mathbf{A}\mathbf{v}=0\quad(\mathbf{v}\perp\mathrm{span}\{\mathbf{e}_{0},\dots,\mathbf{e}_{2m+1}\})

and let

𝐛=𝐀𝐱,𝐱=Dj=02m+1(μj)1/2𝐞j\displaystyle\mathbf{b}=\mathbf{A}\mathbf{x}^{\star},\quad\mathbf{x}^{\star}=D\sum_{j=0}^{2m+1}\left(\mu_{j}^{\star}\right)^{1/2}\mathbf{e}_{j}

so that 𝐱=D\|\mathbf{x}^{\star}\|=D. For any given 𝐱=q(𝐀)𝐛\mathbf{x}=q(\mathbf{A})\mathbf{b} with degqk1\deg q\leq k-1, we use (52) to rewrite 𝐀𝐱𝐛2\|\mathbf{A}\mathbf{x}-\mathbf{b}\|^{2} as

𝐀𝐱𝐛2=D2j=02m+1μjλj2(1λjq(λj))2=D2j=02m+1μjλj2p(λj)2,\displaystyle\|\mathbf{A}\mathbf{x}-\mathbf{b}\|^{2}=D^{2}\sum_{j=0}^{2m+1}\mu_{j}^{\star}\lambda_{j}^{2}\left(1-\lambda_{j}q(\lambda_{j})\right)^{2}=D^{2}\sum_{j=0}^{2m+1}\mu_{j}^{\star}\lambda_{j}^{2}p(\lambda_{j})^{2},

where p(t)=1tq(t)𝒫kp(t)=1-tq(t)\in\mathcal{P}_{k}. But since (pk,𝝁)(p_{k}^{\star},\boldsymbol{\mu}^{\star}) is the primal-dual solution pair to the problems (63) and (67), pkp_{k}^{\star} minimizes j=02m+1μjλj2p(λj)2\sum_{j=0}^{2m+1}\mu_{j}^{\star}\lambda_{j}^{2}p(\lambda_{j})^{2} within 𝒫k\mathcal{P}_{k}. Therefore,

𝐀𝐱𝐛2=D2j=02m+1μjλj2p(λj)2D2j=02m+1μjλj2pk(λj)2=D2M(k,R)2=R2D22(k/2+1)2,\displaystyle\|\mathbf{A}\mathbf{x}-\mathbf{b}\|^{2}=D^{2}\sum_{j=0}^{2m+1}\mu_{j}^{\star}\lambda_{j}^{2}p(\lambda_{j})^{2}\geq D^{2}\sum_{j=0}^{2m+1}\mu_{j}^{\star}\lambda_{j}^{2}p_{k}^{\star}(\lambda_{j})^{2}=D^{2}M^{\star}(k,R)^{2}=\frac{R^{2}D^{2}}{2(\lfloor k/2\rfloor+1)^{2}},

which establishes (14).

C.4 Proof of Lemma 4

Let k0k\geq 0 be a given (fixed) integer. Consider the polynomial pkp_{k}^{\star} we defined in the previous section. It is an even polynomial of degree 2k22\lfloor\frac{k}{2}\rfloor, and thus pk(t)p_{k}^{\star}\left(\sqrt{t}\right) is a polynomial in tt of degree k2\lfloor\frac{k}{2}\rfloor, whose constant term is pk(0)=1p_{k}^{\star}(0)=1. Therefore, we can write pk(t)=1tqk(t)p_{k}^{\star}\left(\sqrt{t}\right)=1-tq_{k}(t) for some polynomial qkq_{k}. We will show that

𝐳k=qk(𝐁𝐁)𝐁𝐯\displaystyle\mathbf{z}^{k}=q_{k}\left(\mathbf{B}^{\intercal}\mathbf{B}\right)\mathbf{B}^{\intercal}\mathbf{v} (68)

satisfies 𝐁𝐳k𝐯2R2D22(k/2+1)2\|\mathbf{B}\mathbf{z}^{k}-\mathbf{v}\|^{2}\leq\frac{R^{2}D^{2}}{2(\lfloor k/2\rfloor+1)^{2}} for any (possibly non-symmetric) 𝐁m×m\mathbf{B}\in\mathbb{R}^{m\times m} and 𝐯=𝐁𝐳\mathbf{v}=\mathbf{B}\mathbf{z}^{\star} satisfying 𝐁R\|\mathbf{B}\|\leq R and 𝐳D\left\|\mathbf{z}^{\star}\right\|\leq D. The equation (68) defines an algorithm within the class 𝔄lin\mathfrak{A}_{\textrm{lin}}, as qkq_{k} is of degree k21\lfloor\frac{k}{2}\rfloor-1, so that 𝐳k\mathbf{z}^{k} is determined by 2k21k12\lfloor\frac{k}{2}\rfloor-1\leq k-1 queries to the matrix multiplication oracle.

We proceed via arguments similar to derivations in C.2. First, observe that

𝐁𝐳k𝐯2=𝐁𝐳k𝐁𝐳2=(𝐳k𝐳)𝐁𝐁(𝐳k𝐳)=(𝐳k𝐳)|𝐁|2(𝐳k𝐳)=|𝐁|𝐳k|𝐁|𝐳2,\displaystyle\left\|\mathbf{B}\mathbf{z}^{k}-\mathbf{v}\right\|^{2}=\left\|\mathbf{B}\mathbf{z}^{k}-\mathbf{B}\mathbf{z}^{\star}\right\|^{2}=(\mathbf{z}^{k}-\mathbf{z}^{\star})^{\intercal}\mathbf{B}^{\intercal}\mathbf{B}(\mathbf{z}^{k}-\mathbf{z}^{\star})=(\mathbf{z}^{k}-\mathbf{z}^{\star})^{\intercal}|\mathbf{B}|^{2}(\mathbf{z}^{k}-\mathbf{z}^{\star})=\left\||\mathbf{B}|\mathbf{z}^{k}-|\mathbf{B}|\mathbf{z}^{\star}\right\|^{2}, (69)

where |𝐁||\mathbf{B}| is the matrix square root of the positive semidefinite matrix 𝐁𝐁\mathbf{B}^{\intercal}\mathbf{B}. Rewriting (68) in terms of |𝐁||\mathbf{B}|, we obtain

𝐳k=qk(𝐁𝐁)𝐁𝐁𝐳=qk(|𝐁|2)|𝐁|2𝐳.\displaystyle\mathbf{z}^{k}=q_{k}\left(\mathbf{B}^{\intercal}\mathbf{B}\right)\mathbf{B}^{\intercal}\mathbf{B}\mathbf{z}^{\star}=q_{k}\left(|\mathbf{B}|^{2}\right)|\mathbf{B}|^{2}\mathbf{z}^{\star}.

Plugging the last equation into (69) gives

|𝐁|𝐳|𝐁|𝐳k2=(𝐈|𝐁|2qk(|𝐁|2))|𝐁|𝐳2=pk(|𝐁|)|𝐁|𝐳2.\displaystyle\left\||\mathbf{B}|\mathbf{z}^{\star}-|\mathbf{B}|\mathbf{z}^{k}\right\|^{2}=\left\|\left(\mathbf{I}-|\mathbf{B}|^{2}q_{k}\left(|\mathbf{B}|^{2}\right)\right)|\mathbf{B}|\mathbf{z}^{\star}\right\|^{2}=\left\|p_{k}^{\star}\left(|\mathbf{B}|\right)|\mathbf{B}|\mathbf{z}^{\star}\right\|^{2}.

Finally, because |𝐁||\mathbf{B}| is a symmetric matrix whose eigenvalues are within [0,R][0,R], we can apply (52) with |𝐁|,𝐳|\mathbf{B}|,\mathbf{z}^{\star} in places of 𝐀,𝐱\mathbf{A},\mathbf{x}^{\star}, and use (58) to conclude that

|𝐁|𝐳|𝐁|𝐳k2D2(maxλ[0,R]λ2pk(λ)2)D2(maxλ[R,R]λ2pk(λ)2)=R2D2(2k/2+1)2.\displaystyle\left\||\mathbf{B}|\mathbf{z}^{\star}-|\mathbf{B}|\mathbf{z}^{k}\right\|^{2}\leq D^{2}\left(\max_{\lambda\in[0,R]}\lambda^{2}p_{k}^{\star}(\lambda)^{2}\right)\leq D^{2}\left(\max_{\lambda\in[-R,R]}\lambda^{2}p_{k}^{\star}(\lambda)^{2}\right)=\frac{R^{2}D^{2}}{(2\lfloor k/2\rfloor+1)^{2}}.

C.5 Proof of Theorem 4

We first describe the general class 𝔄\mathfrak{A} of algorithms without the linear span assumption. An algorithm 𝒜\mathcal{A} within 𝔄\mathfrak{A} is a sequence of deterministic functions 𝒜1,𝒜2,\mathcal{A}_{1},\mathcal{A}_{2},\dots, each of which having the form

(𝐳i,𝐳¯i)=𝒜i(𝐳0,𝒪(𝐳0;𝐋),,𝒪(𝐳i1;𝐋);𝐋)\displaystyle(\mathbf{z}^{i},\overline{\mathbf{z}}^{i})=\mathcal{A}_{i}\left(\mathbf{z}^{0},\mathcal{O}(\mathbf{z}^{0};\mathbf{L}),\dots,\mathcal{O}(\mathbf{z}^{i-1};\mathbf{L});\mathbf{L}\right)

for i1i\geq 1, where 𝐳0=(𝐱0,𝐲0)n×m\mathbf{z}^{0}=(\mathbf{x}^{0},\mathbf{y}^{0})\in\mathbb{R}^{n}\times\mathbb{R}^{m} is an initial point and 𝒪:(n×m)×R(n×m)n×m\mathcal{O}\colon(\mathbb{R}^{n}\times\mathbb{R}^{m})\times\mathcal{L}_{R}(\mathbb{R}^{n}\times\mathbb{R}^{m})\to\mathbb{R}^{n}\times\mathbb{R}^{m} is the gradient oracle defined as

𝒪((𝐱,𝐲);𝐋)=(𝐱𝐋(𝐱,𝐲),𝐲𝐋(𝐱,𝐲)).\mathcal{O}((\mathbf{x},\mathbf{y});\mathbf{L})=\left(\nabla_{\mathbf{x}}\mathbf{L}(\mathbf{x},\mathbf{y}),\nabla_{\mathbf{y}}\mathbf{L}(\mathbf{x},\mathbf{y})\right).

The sequence {𝐳i}i0\{\mathbf{z}^{i}\}_{i\geq 0} are the inquiry points, and {𝐳¯i}i0\{\overline{\mathbf{z}}^{i}\}_{i\geq 0} are the approximate solutions produced by 𝒜\mathcal{A}. When k1k\geq 1 is the predefined maximum number of iterations, then we assume 𝐳¯k=𝐳k\overline{\mathbf{z}}^{k}=\mathbf{z}^{k} without loss of generality. Similar definitions for deterministic algorithms have been considered in (Nemirovsky, 1991; Ouyang & Xu, 2021).

To clarify, given 𝐋R(n×m)\mathbf{L}\in\mathcal{L}_{R}(\mathbb{R}^{n}\times\mathbb{R}^{m}), an algorithm 𝒜\mathcal{A} uses only the previous oracle information to choose the next inquiry point and approximate solution. Therefore, if 𝒪(𝐳i;𝐋1)=𝒪(𝐳i;𝐋2)\mathcal{O}(\mathbf{z}^{i};\mathbf{L}_{1})=\mathcal{O}(\mathbf{z}^{i};\mathbf{L}_{2}) for all i=0,,k1i=0,\dots,k-1, then the algorithm output (𝐳k,𝐳¯k)(\mathbf{z}^{k},\overline{\mathbf{z}}^{k}) for the two functions will coincide, even if 𝐋1𝐋2\mathbf{L}_{1}\neq\mathbf{L}_{2}. In that sense, 𝒜\mathcal{A} is deterministic, black-box, and gradient-based.

Now we precisely restate Theorem 4.

Theorem 4.

Let k1k\geq 1 and n3k+2n\geq 3k+2. Let 𝒜𝔄\mathcal{A}\in\mathfrak{A} be a deterministic black-box gradient-based algorithm for solving convex-concave minimax problems on n×n\mathbb{R}^{n}\times\mathbb{R}^{n}. Then for any initial point 𝐳0n×n\mathbf{z}^{0}\in\mathbb{R}^{n}\times\mathbb{R}^{n}, there exists 𝐋Rbiaff(n×n)\mathbf{L}\in\mathcal{L}_{R}^{\textnormal{biaff}}(\mathbb{R}^{n}\times\mathbb{R}^{n}) with a saddle point 𝐳\mathbf{z}^{\star}, for which 𝐳k\mathbf{z}^{k}, the kk-th iterate produced by 𝒜\mathcal{A}, satisfies

𝐋(𝐳k)𝐳0𝐳2(2k/2+1)2.\|\nabla\mathbf{L}(\mathbf{z}^{k})\|\geq\frac{\|\mathbf{z}^{0}-\mathbf{z}^{\star}\|^{2}}{(2\lfloor k/2\rfloor+1)^{2}}.
Proof.

Let 𝐳0=(𝐱0,𝐲0)n×n\mathbf{z}^{0}=(\mathbf{x}^{0},\mathbf{y}^{0})\in\mathbb{R}^{n}\times\mathbb{R}^{n} be given. Take 𝐀\mathbf{A} and 𝐛\mathbf{b} as in Lemma 3. Denote by 𝐱min\mathbf{x}^{\textrm{min}} the minimum norm solution to 𝐀𝐱=𝐛\mathbf{A}\mathbf{x}=\mathbf{b}. Recall the construction of 𝐀\mathbf{A} and 𝐛\mathbf{b}, where (𝐀)=span{𝐞0,,𝐞2m+1}ker(𝐀)\mathcal{R}(\mathbf{A})=\mathrm{span}\{\mathbf{e}_{0},\dots,\mathbf{e}_{2m+1}\}\perp\ker(\mathbf{A}). Define

𝐋0(𝐱0,𝐲0)=𝐛(𝐱𝐱0)+(𝐱𝐱0)𝐀(𝐲𝐲0)𝐛(𝐲𝐲0).\mathbf{L}_{0}(\mathbf{x}^{0},\mathbf{y}^{0})=-\mathbf{b}^{\intercal}(\mathbf{x}-\mathbf{x}^{0})+(\mathbf{x}-\mathbf{x}^{0})^{\intercal}\mathbf{A}(\mathbf{y}-\mathbf{y}^{0})-\mathbf{b}^{\intercal}(\mathbf{y}-\mathbf{y}^{0}).

Then (𝐱𝐋0(𝐱,𝐲),𝐲𝐋0(𝐱,𝐲))=(𝐀(𝐲𝐲0)𝐛,𝐀(𝐱𝐱0)𝐛)\left(\nabla_{\mathbf{x}}\mathbf{L}_{0}(\mathbf{x},\mathbf{y}),\nabla_{\mathbf{y}}\mathbf{L}_{0}(\mathbf{x},\mathbf{y})\right)=\left(\mathbf{A}(\mathbf{y}-\mathbf{y}^{0})-\mathbf{b},\mathbf{A}(\mathbf{x}-\mathbf{x}^{0})-\mathbf{b}\right), and 𝐳0+(𝐱min,𝐱min)\mathbf{z}^{0}+\left(\mathbf{x}^{\textrm{min}},\mathbf{x}^{\textrm{min}}\right) is a saddle point of 𝐋0\mathbf{L}_{0}.

We follow the oracle-resisting proof strategy of Nemirovsky (1991), described as follows. For each i=1,,ki=1,\dots,k, we inductively define a rotated biaffine function

𝐋i(𝐱0,𝐲0)=𝐛(𝐱𝐱0)+(𝐱𝐱0)𝐀i(𝐲𝐲0)𝐛(𝐲𝐲0),\mathbf{L}_{i}(\mathbf{x}^{0},\mathbf{y}^{0})=-\mathbf{b}^{\intercal}(\mathbf{x}-\mathbf{x}^{0})+(\mathbf{x}-\mathbf{x}^{0})^{\intercal}\mathbf{A}_{i}(\mathbf{y}-\mathbf{y}^{0})-\mathbf{b}^{\intercal}(\mathbf{y}-\mathbf{y}^{0}),

where 𝐀i=𝐔i𝐀𝐔i\mathbf{A}_{i}=\mathbf{U}_{i}\mathbf{A}\mathbf{U}_{i}^{\intercal} for an orthogonal matrix 𝐔in×n\mathbf{U}_{i}\in\mathbb{R}^{n\times n}. We will show that UiU_{i} can be chosen to satisfy 𝐔i𝐛=𝐛\mathbf{U}_{i}\mathbf{b}=\mathbf{b},

𝒪(𝐳j;𝐋i)=𝒪(𝐳j;𝐋i1)\displaystyle\mathcal{O}(\mathbf{z}^{j};\mathbf{L}_{i})=\mathcal{O}(\mathbf{z}^{j};\mathbf{L}_{i-1}) (70)

for j=0,,i1j=0,\dots,i-1, and

𝐱j𝐱0,𝐲j𝐲0𝒦j1(𝐀i;𝐛)𝐔i𝒩i=𝐔i𝒦j1(𝐀;𝐛)𝐔i𝒩i\displaystyle\mathbf{x}^{j}-\mathbf{x}^{0},\mathbf{y}^{j}-\mathbf{y}^{0}\in\mathcal{K}_{j-1}(\mathbf{A}_{i};\mathbf{b})\oplus\mathbf{U}_{i}\mathcal{N}_{i}=\mathbf{U}_{i}\mathcal{K}_{j-1}(\mathbf{A};\mathbf{b})\oplus\mathbf{U}_{i}\mathcal{N}_{i} (71)

for j=0,,ij=0,\dots,i, where 𝒩i\mathcal{N}_{i} is a subspace of ker(𝐀)\ker(\mathbf{A}) such that dim(𝒩i)2i\dim(\mathcal{N}_{i})\leq 2i. Note that (70) implies that the algorithm iterates (𝐳j,𝐳¯j)(\mathbf{z}^{j},\overline{\mathbf{z}}^{j}) for j=1,,ij=1,\dots,i do not change when 𝐋i1\mathbf{L}_{i-1} is replaced by 𝐋i\mathbf{L}_{i}. Hence, this process sequentially adjusts the objective function 𝐋\mathbf{L} upon observing an iterate 𝐳i\mathbf{z}^{i} to resist the algorithm from optimizing it efficiently. Indeed, if (71) holds with i=j=ki=j=k, then

𝐱k𝐱0=𝐔kq𝐱(𝐀)𝐛+𝐔k𝐯𝐱k\displaystyle\mathbf{x}^{k}-\mathbf{x}^{0}=\mathbf{U}_{k}q_{\mathbf{x}}(\mathbf{A})\mathbf{b}+\mathbf{U}_{k}\mathbf{v}_{\mathbf{x}}^{k}
𝐲k𝐲0=𝐔kq𝐲(𝐀)𝐛+𝐔k𝐯𝐲k\displaystyle\mathbf{y}^{k}-\mathbf{y}^{0}=\mathbf{U}_{k}q_{\mathbf{y}}(\mathbf{A})\mathbf{b}+\mathbf{U}_{k}\mathbf{v}_{\mathbf{y}}^{k}

for some polynomials q𝐱,q𝐲q_{\mathbf{x}},q_{\mathbf{y}} of degree k1\leq k-1 and 𝐯𝐱k,𝐯𝐲k𝒩iker(𝐀)\mathbf{v}_{\mathbf{x}}^{k},\mathbf{v}_{\mathbf{y}}^{k}\in\mathcal{N}_{i}\subseteq\ker(\mathbf{A}). Thus

𝐱𝐋k(𝐱k,𝐲k)=𝐀k(𝐲k𝐲0)𝐛=𝐔k𝐀𝐔k(𝐔kqy(𝐀)𝐛+𝐔k𝐯𝐲k)𝐛=𝐔k(𝐀qy(𝐀)𝐈)𝐛\displaystyle\nabla_{\mathbf{x}}\mathbf{L}_{k}(\mathbf{x}^{k},\mathbf{y}^{k})=\mathbf{A}_{k}(\mathbf{y}^{k}-\mathbf{y}^{0})-\mathbf{b}=\mathbf{U}_{k}\mathbf{A}\mathbf{U}_{k}^{\intercal}\left(\mathbf{U}_{k}q_{y}(\mathbf{A})\mathbf{b}+\mathbf{U}_{k}\mathbf{v}_{\mathbf{y}}^{k}\right)-\mathbf{b}=\mathbf{U}_{k}\left(\mathbf{A}q_{y}(\mathbf{A})-\mathbf{I}\right)\mathbf{b}

and similarly

𝐲𝐋k(𝐱k,𝐲k)=𝐔k(𝐀qx(𝐀)𝐈)𝐛,\nabla_{\mathbf{y}}\mathbf{L}_{k}(\mathbf{x}^{k},\mathbf{y}^{k})=\mathbf{U}_{k}\left(\mathbf{A}q_{x}(\mathbf{A})-\mathbf{I}\right)\mathbf{b},

showing that

𝐋k(𝐳k)2=𝐔k(𝐀qy(𝐀)𝐈)𝐛2+𝐔k(𝐀qx(𝐀)𝐈)𝐛22𝐱min2(2k/2+1)2.\|\nabla\mathbf{L}_{k}(\mathbf{z}^{k})\|^{2}=\|\mathbf{U}_{k}\left(\mathbf{A}q_{y}(\mathbf{A})-\mathbf{I}\right)\mathbf{b}\|^{2}+\|\mathbf{U}_{k}\left(\mathbf{A}q_{x}(\mathbf{A})-\mathbf{I}\right)\mathbf{b}\|^{2}\geq\frac{2\|\mathbf{x}^{\textrm{min}}\|^{2}}{(2\lfloor k/2\rfloor+1)^{2}}.

Then the theorem statement follows from the fact that 𝐳=𝐳0+(𝐔k𝐱min,𝐔k𝐱min)\mathbf{z}^{\star}=\mathbf{z}^{0}+(\mathbf{U}_{k}\mathbf{x}^{\textrm{min}},\mathbf{U}_{k}\mathbf{x}^{\textrm{min}}) is a saddle point of 𝐋k\mathbf{L}_{k}.

It remains to provide an inductive scheme for choosing 𝐔i\mathbf{U}_{i}. We set 𝐔0=𝐈\mathbf{U}_{0}=\mathbf{I} (so that 𝐀0=𝐀\mathbf{A}_{0}=\mathbf{A}), 𝒩0={0}\mathcal{N}_{0}=\{0\}, and define 𝒦1(𝐀;𝐛)={0}\mathcal{K}_{-1}(\mathbf{A};\mathbf{b})=\{0\} for convenience. Let 1ik1\leq i\leq k, and suppose that we already have an orthogonal matrix 𝐔i1\mathbf{U}_{i-1} and 𝒩i1ker(𝐀)\mathcal{N}_{i-1}\subseteq\ker(\mathbf{A}) for which 𝐔i1𝐛=𝐛\mathbf{U}_{i-1}\mathbf{b}=\mathbf{b}, dim(𝒩i1)2i2\dim(\mathcal{N}_{i-1})\leq 2i-2, and (71) holds with i1i-1 (which is vacuously true when i=1i=1). Let

(𝐳i,𝐳¯i)=𝒜i(𝐳0,𝒪(𝐳0;𝐋i1),,𝒪(𝐳i1;𝐋i1)).(\mathbf{z}^{i},\overline{\mathbf{z}}^{i})=\mathcal{A}_{i}\left(\mathbf{z}^{0},\mathcal{O}(\mathbf{z}^{0};\mathbf{L}_{i-1}),\dots,\mathcal{O}(\mathbf{z}^{i-1};\mathbf{L}_{i-1})\right).

We want 𝐔i\mathbf{U}_{i} (to be defined) to satisfy 𝐬𝐱i,𝐬𝐲i𝐔iker(𝐀)\mathbf{s}_{\mathbf{x}}^{i},\mathbf{s}_{\mathbf{y}}^{i}\in\mathbf{U}_{i}\ker(\mathbf{A}) while 𝒦i1(𝐀i1;𝐛)=𝒦i1(𝐀i;𝐛)\mathcal{K}_{i-1}(\mathbf{A}_{i-1};\mathbf{b})=\mathcal{K}_{i-1}(\mathbf{A}_{i};\mathbf{b}). The latter condition is satisfied if 𝐔i=𝐐i𝐔i1\mathbf{U}_{i}=\mathbf{Q}_{i}\mathbf{U}_{i-1} for some orthogonal matrix 𝐐i\mathbf{Q}_{i} which preserves every element within

𝒥i1=𝒦i1(𝐀i1;𝐛)𝐔i1𝒩i1,\mathcal{J}_{i-1}=\mathcal{K}_{i-1}(\mathbf{A}_{i-1};\mathbf{b})\oplus\mathbf{U}_{i-1}\mathcal{N}_{i-1},

because then it follows that 𝐔i𝐛=𝐐i𝐔i1𝐛=𝐐i𝐛=𝐛\mathbf{U}_{i}\mathbf{b}=\mathbf{Q}_{i}\mathbf{U}_{i-1}\mathbf{b}=\mathbf{Q}_{i}\mathbf{b}=\mathbf{b} and

𝒦i1(𝐀i;𝐛)=𝐔i𝒦i1(𝐀;𝐛)=𝐐i𝐔i1𝒦i1(𝐀;𝐛)=𝐐i𝒦i1(𝐀i1;𝐛)=𝒦i1(𝐀i1;𝐛).\mathcal{K}_{i-1}(\mathbf{A}_{i};\mathbf{b})=\mathbf{U}_{i}\mathcal{K}_{i-1}(\mathbf{A};\mathbf{b})=\mathbf{Q}_{i}\mathbf{U}_{i-1}\mathcal{K}_{i-1}(\mathbf{A};\mathbf{b})=\mathbf{Q}_{i}\mathcal{K}_{i-1}(\mathbf{A}_{i-1};\mathbf{b})=\mathcal{K}_{i-1}(\mathbf{A}_{i-1};\mathbf{b}).

Consider the decomposition

𝐱i𝐱0=Π𝒦i1(𝐀i1;𝐛)(𝐱i𝐱0)+𝐔i1𝐫𝐱i+𝐬𝐱i\displaystyle\mathbf{x}^{i}-\mathbf{x}^{0}=\Pi_{\mathcal{K}_{i-1}(\mathbf{A}_{i-1};\mathbf{b})}(\mathbf{x}^{i}-\mathbf{x}^{0})+\mathbf{U}_{i-1}\mathbf{r}^{i}_{\mathbf{x}}+\mathbf{s}^{i}_{\mathbf{x}}
𝐲i𝐲0=Π𝒦i1(𝐀i1;𝐛)(𝐲i𝐲0)+𝐔i1𝐫𝐲i+𝐬𝐲i\displaystyle\mathbf{y}^{i}-\mathbf{y}^{0}=\Pi_{\mathcal{K}_{i-1}(\mathbf{A}_{i-1};\mathbf{b})}(\mathbf{y}^{i}-\mathbf{y}^{0})+\mathbf{U}_{i-1}\mathbf{r}^{i}_{\mathbf{y}}+\mathbf{s}^{i}_{\mathbf{y}}

where Π\Pi denotes the orthogonal projection, 𝐫𝐱i,𝐫𝐲i𝒩i1\mathbf{r}^{i}_{\mathbf{x}},\mathbf{r}^{i}_{\mathbf{y}}\in\mathcal{N}_{i-1} and 𝐬𝐱i,𝐬𝐲i𝒥i1\mathbf{s}^{i}_{\mathbf{x}},\mathbf{s}^{i}_{\mathbf{y}}\in\mathcal{J}_{i-1}^{\perp}. Since dimker(𝐀)=n2m2nk2\dim\ker(\mathbf{A})=n-2m-2\geq n-k-2 and dim(𝒩i1)n(2i2)n2k+2\dim\left(\mathcal{N}_{i-1}\right)^{\perp}\geq n-(2i-2)\geq n-2k+2, we have

dim(ker(𝐀)(𝒩i1))n3k2,\dim\left(\ker(\mathbf{A})\cap\left(\mathcal{N}_{i-1}\right)^{\perp}\right)\geq n-3k\geq 2,

so there exist 𝐬~𝐱i,𝐬~𝐲iker(𝐀)(𝒩i1)\tilde{\mathbf{s}}^{i}_{\mathbf{x}},\tilde{\mathbf{s}}^{i}_{\mathbf{y}}\in\ker(\mathbf{A})\cap\left(\mathcal{N}_{i-1}\right)^{\perp} such that 𝐬~𝐱i=𝐬𝐱i\|\tilde{\mathbf{s}}^{i}_{\mathbf{x}}\|=\|\mathbf{s}^{i}_{\mathbf{x}}\|, 𝐬~𝐲i=𝐬𝐲i\|\tilde{\mathbf{s}}^{i}_{\mathbf{y}}\|=\|\mathbf{s}^{i}_{\mathbf{y}}\|, and 𝐬~𝐱i,𝐬~𝐲i=𝐬𝐱i,𝐬𝐲i\langle\tilde{\mathbf{s}}^{i}_{\mathbf{x}},\tilde{\mathbf{s}}^{i}_{\mathbf{y}}\rangle=\langle\mathbf{s}^{i}_{\mathbf{x}},\mathbf{s}^{i}_{\mathbf{y}}\rangle. Also, because ker(𝐀)𝒦i1(𝐀;𝐛)\ker(\mathbf{A})\perp\mathcal{K}_{i-1}(\mathbf{A};\mathbf{b}),

𝒥i1=𝐔i1(𝒦i1(𝐀;𝐛)+𝒩i1)𝐔i1(ker(𝐀)(𝒩i1)).\mathcal{J}_{i-1}=\mathbf{U}_{i-1}\left(\mathcal{K}_{i-1}(\mathbf{A};\mathbf{b})+\mathcal{N}_{i-1}\right)\perp\mathbf{U}_{i-1}\left(\ker(\mathbf{A})\cap\left(\mathcal{N}_{i-1}\right)^{\perp}\right).

This implies that there exists an orthogonal 𝐐in×n\mathbf{Q}_{i}\in\mathbb{R}^{n\times n} satisfying

𝐐i|𝒥i1=Id𝒥i1\displaystyle\mathbf{Q}_{i}\big{|}_{\mathcal{J}_{i-1}}=\mathrm{Id}_{\mathcal{J}_{i-1}}
𝐐i(𝐔i1𝐬~𝐱i)=𝐬𝐱i\displaystyle\mathbf{Q}_{i}\left(\mathbf{U}_{i-1}\tilde{\mathbf{s}}^{i}_{\mathbf{x}}\right)=\mathbf{s}^{i}_{\mathbf{x}}
𝐐i(𝐔i1𝐬~𝐲i)=𝐬𝐲i.\displaystyle\mathbf{Q}_{i}\left(\mathbf{U}_{i-1}\tilde{\mathbf{s}}^{i}_{\mathbf{y}}\right)=\mathbf{s}^{i}_{\mathbf{y}}.

Now let 𝐯𝐱i=𝐫𝐱i+𝐬~𝐱iker(𝐀),𝐯𝐲i=𝐫𝐲i+𝐬~𝐲iker(𝐀)\mathbf{v}^{i}_{\mathbf{x}}=\mathbf{r}^{i}_{\mathbf{x}}+\tilde{\mathbf{s}}^{i}_{\mathbf{x}}\in\ker(\mathbf{A}),\mathbf{v}^{i}_{\mathbf{y}}=\mathbf{r}^{i}_{\mathbf{y}}+\tilde{\mathbf{s}}^{i}_{\mathbf{y}}\in\ker(\mathbf{A}), and

𝐔i=Δ𝐐i𝐔i1\displaystyle\mathbf{U}_{i}\stackrel{{\scriptstyle\Delta}}{{=}}\mathbf{Q}_{i}\mathbf{U}_{i-1}
𝒩i=Δ𝒩i1+span{𝐯𝐱i,𝐯𝐲i}.\displaystyle\mathcal{N}_{i}\stackrel{{\scriptstyle\Delta}}{{=}}\mathcal{N}_{i-1}+\mathrm{span}\{\mathbf{v}^{i}_{\mathbf{x}},\mathbf{v}^{i}_{\mathbf{y}}\}.

Then clearly 𝐔i𝐛=𝐛\mathbf{U}_{i}\mathbf{b}=\mathbf{b}, 𝒩iker(𝐀)\mathcal{N}_{i}\subseteq\ker(\mathbf{A}), and dim𝒩i2i\dim\mathcal{N}_{i}\leq 2i. Next, for each j=0,,i1j=0,\dots,i-1, we have

𝐱j𝐱0,𝐲j𝐲0𝒦j1(𝐀i1;𝐛)𝐔i1𝒩i1𝒦j1(𝐀i;𝐛)𝐔i𝒩i\mathbf{x}^{j}-\mathbf{x}^{0},\mathbf{y}^{j}-\mathbf{y}^{0}\in\mathcal{K}_{j-1}(\mathbf{A}_{i-1};\mathbf{b})\oplus\mathbf{U}_{i-1}\mathcal{N}_{i-1}\subseteq\mathcal{K}_{j-1}(\mathbf{A}_{i};\mathbf{b})\oplus\mathbf{U}_{i}\mathcal{N}_{i}

since 𝐐i\mathbf{Q}_{i} preserves 𝒥i1\mathcal{J}_{i-1} and 𝒩i1𝒩i\mathcal{N}_{i-1}\subseteq\mathcal{N}_{i}. Moreover, because 𝐔i1𝐫𝐱i=𝐐i𝐔i1𝐫𝐱i=𝐔i𝐫𝐱i\mathbf{U}_{i-1}\mathbf{r}^{i}_{\mathbf{x}}=\mathbf{Q}_{i}\mathbf{U}_{i-1}\mathbf{r}^{i}_{\mathbf{x}}=\mathbf{U}_{i}\mathbf{r}^{i}_{\mathbf{x}} and 𝐬𝐱i=𝐐i𝐔i1𝐬~𝐱i=𝐔i𝐬~𝐱i\mathbf{s}^{i}_{\mathbf{x}}=\mathbf{Q}_{i}\mathbf{U}_{i-1}\tilde{\mathbf{s}}^{i}_{\mathbf{x}}=\mathbf{U}_{i}\tilde{\mathbf{s}}^{i}_{\mathbf{x}},

𝐱i𝐱0=Π𝒦i1(𝐀i1;𝐛)(𝐱i𝐱0)+𝐔i(𝐫𝐱i+𝐬~𝐱i)𝒦i1(𝐀i1;𝐛)𝐔i𝒩i=𝒦i1(𝐀i;𝐛)𝐔i𝒩i\mathbf{x}^{i}-\mathbf{x}^{0}=\Pi_{\mathcal{K}_{i-1}(\mathbf{A}_{i-1};\mathbf{b})}(\mathbf{x}^{i}-\mathbf{x}^{0})+\mathbf{U}_{i}(\mathbf{r}^{i}_{\mathbf{x}}+\tilde{\mathbf{s}}^{i}_{\mathbf{x}})\in\mathcal{K}_{i-1}(\mathbf{A}_{i-1};\mathbf{b})\oplus\mathbf{U}_{i}\mathcal{N}_{i}=\mathcal{K}_{i-1}(\mathbf{A}_{i};\mathbf{b})\oplus\mathbf{U}_{i}\mathcal{N}_{i}

and similarly 𝐲i𝐲0𝒦i1(𝐀i;𝐛)𝐔i𝒩i\mathbf{y}^{i}-\mathbf{y}^{0}\in\mathcal{K}_{i-1}(\mathbf{A}_{i};\mathbf{b})\oplus\mathbf{U}_{i}\mathcal{N}_{i}. This proves (71).

Finally, for j=0,,i1j=0,\dots,i-1,

𝐱𝐋i(𝐱j,𝐲j)=𝐀i(𝐲j𝐲0)𝐛=𝐐i𝐀i1𝐐i(𝐲j𝐲0)𝐛.\displaystyle\nabla_{\mathbf{x}}\mathbf{L}_{i}(\mathbf{x}^{j},\mathbf{y}^{j})=\mathbf{A}_{i}(\mathbf{y}^{j}-\mathbf{y}^{0})-\mathbf{b}=\mathbf{Q}_{i}\mathbf{A}_{i-1}\mathbf{Q}_{i}^{\intercal}(\mathbf{y}^{j}-\mathbf{y}^{0})-\mathbf{b}.

But 𝐐i(𝐲j𝐲0)=𝐲j𝐲0\mathbf{Q}_{i}^{\intercal}(\mathbf{y}^{j}-\mathbf{y}^{0})=\mathbf{y}^{j}-\mathbf{y}^{0} because 𝐲j𝐲0𝒦j1(𝐀i1;𝐛)𝐔i1𝒩i1𝒥i1\mathbf{y}^{j}-\mathbf{y}^{0}\in\mathcal{K}_{j-1}(\mathbf{A}_{i-1};\mathbf{b})\oplus\mathbf{U}_{i-1}\mathcal{N}_{i-1}\subseteq\mathcal{J}_{i-1}, and

𝐀i1(𝐲j𝐲0)𝐀i1𝒦j1(𝐀i1;𝐛)𝐀i1𝐔i1𝒩i1=𝒦j(𝐀i1;𝐛)𝒥i1,\displaystyle\mathbf{A}_{i-1}(\mathbf{y}^{j}-\mathbf{y}^{0})\in\mathbf{A}_{i-1}\mathcal{K}_{j-1}(\mathbf{A}_{i-1};\mathbf{b})\oplus\mathbf{A}_{i-1}\mathbf{U}_{i-1}\mathcal{N}_{i-1}=\mathcal{K}_{j}(\mathbf{A}_{i-1};\mathbf{b})\subseteq\mathcal{J}_{i-1},

which shows that 𝐱𝐋i(𝐱j,𝐲j)=𝐐i𝐀i1𝐐i(𝐲j𝐲0)𝐛=𝐀i1(𝐲j𝐲0)𝐛=𝐱𝐋i1(𝐱j,𝐲j)\nabla_{\mathbf{x}}\mathbf{L}_{i}(\mathbf{x}^{j},\mathbf{y}^{j})=\mathbf{Q}_{i}\mathbf{A}_{i-1}\mathbf{Q}_{i}^{\intercal}(\mathbf{y}^{j}-\mathbf{y}^{0})-\mathbf{b}=\mathbf{A}_{i-1}(\mathbf{y}^{j}-\mathbf{y}^{0})-\mathbf{b}=\nabla_{\mathbf{x}}\mathbf{L}_{i-1}(\mathbf{x}^{j},\mathbf{y}^{j}). Arguing analogously for the 𝐲\mathbf{y}-variable gives 𝐲𝐋i(𝐱j,𝐲j)=𝐲𝐋i1(𝐱j,𝐲j)\nabla_{\mathbf{y}}\mathbf{L}_{i}(\mathbf{x}^{j},\mathbf{y}^{j})=\nabla_{\mathbf{y}}\mathbf{L}_{i-1}(\mathbf{x}^{j},\mathbf{y}^{j}), proving (70). This completes the induction step, and hence the proof. ∎

Appendix D Experimental details

D.1 Exact forms of the construction from Ouyang & Xu (2021)

Following Ouyang & Xu (2021), we use

𝐀=14[1111111]n×n,𝐛=14[1111]n,𝐡=14[0001]n,\mathbf{A}=\frac{1}{4}\begin{bmatrix}&&&-1&1\\ &&\iddots&\iddots&\\ &-1&1\\ -1&1\\ 1\end{bmatrix}\in\mathbb{R}^{n\times n},\quad\mathbf{b}=\frac{1}{4}\begin{bmatrix}1\\ 1\\ \vdots\\ 1\\ 1\end{bmatrix}\in\mathbb{R}^{n},\quad\mathbf{h}=\frac{1}{4}\begin{bmatrix}0\\ 0\\ \vdots\\ 0\\ 1\end{bmatrix}\in\mathbb{R}^{n},

and 𝐇=2𝐀𝐀\mathbf{H}=2\mathbf{A}^{\intercal}\mathbf{A}. Ouyang & Xu (2021) shows that 𝐀12\|\mathbf{A}\|\leq\frac{1}{2}, which implies 𝐇12\|\mathbf{H}\|\leq\frac{1}{2}. Therefore (25) is a 11-smooth saddle function.

D.2 Best-iterate gradient norm bound for EG

In Figure 1, we indicated theoretical upper bounds for EG. To clarify, there is no known last-iterate convergence result for EG with respect to 𝐆()2\|\mathbf{G}(\cdot)\|^{2}. However, it is straightforward to derive 𝒪(R2/k)\mathcal{O}(R^{2}/k) best-iterate convergence via standard summability arguments in weak convergence proofs for EG. Although there is no theoretical guarantee that 𝐆(𝐳k)2\|\mathbf{G}(\mathbf{z}^{k})\|^{2} will monotonically decrease with EG, in our experiments on both examples, they did monotonically decrease (see Figures 1(a), 1(b)). Therefore, we safely used the best-iterate bounds to visualize the upper bound for EG in Figure 1. For the sake of completeness, we derive the best-iterate bound below.

Lemma 6.

Let 𝐋:n×m\mathbf{L}\colon\mathbb{R}^{n}\times\mathbb{R}^{m}\to\mathbb{R} be an RR-smooth convex-concave saddle function with a saddle point 𝐳\mathbf{z}^{\star}. Let 𝐳n×m\mathbf{z}\in\mathbb{R}^{n}\times\mathbb{R}^{m} and α(0,1R)\alpha\in\left(0,\frac{1}{R}\right). Then 𝐰=𝐳α𝐆(𝐳)\mathbf{w}=\mathbf{z}-\alpha\mathbf{G}(\mathbf{z}) and 𝐳+=𝐳α𝐆(𝐰)\mathbf{z}^{+}=\mathbf{z}-\alpha\mathbf{G}(\mathbf{w}) satisfy

𝐳𝐳2𝐳+𝐳2(1α2R2)𝐳𝐰2.\|\mathbf{z}-\mathbf{z}^{\star}\|^{2}-\|\mathbf{z}^{+}-\mathbf{z}^{\star}\|^{2}\geq(1-\alpha^{2}R^{2})\|\mathbf{z}-\mathbf{w}\|^{2}.
Proof.
𝐳𝐳2𝐳+𝐳2\displaystyle\|\mathbf{z}-\mathbf{z}^{\star}\|^{2}-\|\mathbf{z}^{+}-\mathbf{z}^{\star}\|^{2} =(𝐳𝐰2+2𝐳𝐰,𝐰𝐳+𝐰𝐳2)\displaystyle=\left(\|\mathbf{z}-\mathbf{w}\|^{2}+2\langle\mathbf{z}-\mathbf{w},\mathbf{w}-\mathbf{z}^{\star}\rangle+\|\mathbf{w}-\mathbf{z}^{\star}\|^{2}\right)
(𝐳+𝐰2+2𝐳+𝐰,𝐰𝐳+𝐰𝐳2)\displaystyle\quad\quad-\left(\|\mathbf{z}^{+}-\mathbf{w}\|^{2}+2\langle\mathbf{z}^{+}-\mathbf{w},\mathbf{w}-\mathbf{z}^{\star}\rangle+\|\mathbf{w}-\mathbf{z}^{\star}\|^{2}\right)
=𝐳𝐰2𝐳+𝐰2+2𝐳𝐳+,𝐰𝐳\displaystyle=\|\mathbf{z}-\mathbf{w}\|^{2}-\|\mathbf{z}^{+}-\mathbf{w}\|^{2}+2\langle\mathbf{z}-\mathbf{z}^{+},\mathbf{w}-\mathbf{z}^{\star}\rangle
𝐳𝐰2𝐳+𝐰2.\displaystyle\geq\|\mathbf{z}-\mathbf{w}\|^{2}-\|\mathbf{z}^{+}-\mathbf{w}\|^{2}.

The last inequality is just monotonicity: 𝐳𝐳+,𝐰𝐳=α𝐆(𝐰),𝐰𝐳0\langle\mathbf{z}-\mathbf{z}^{+},\mathbf{w}-\mathbf{z}^{\star}\rangle=\alpha\langle\mathbf{G}(\mathbf{w}),\mathbf{w}-\mathbf{z}^{\star}\rangle\geq 0. Now the conclusion follows from

𝐳+𝐰2=(𝐳α𝐆(𝐰))(𝐳α𝐆(𝐳))2=α2𝐆(𝐳)𝐆(𝐰)2α2R2𝐳𝐰2,\displaystyle\|\mathbf{z}^{+}-\mathbf{w}\|^{2}=\left\|(\mathbf{z}-\alpha\mathbf{G}(\mathbf{w}))-(\mathbf{z}-\alpha\mathbf{G}(\mathbf{z}))\right\|^{2}=\alpha^{2}\|\mathbf{G}(\mathbf{z})-\mathbf{G}(\mathbf{w})\|^{2}\leq\alpha^{2}R^{2}\|\mathbf{z}-\mathbf{w}\|^{2},

where the last inequality follows from RR-Lipschitzness of 𝐆\mathbf{G}. ∎

Now fix an integer k0k\geq 0, and consider the EG iterations

𝐳i+1/2\displaystyle\mathbf{z}^{i+1/2} =𝐳iα𝐆(𝐳i)\displaystyle=\mathbf{z}^{i}-\alpha\mathbf{G}(\mathbf{z}^{i})
𝐳i+1\displaystyle\mathbf{z}^{i+1} =𝐳iα𝐆(𝐳i+1/2)\displaystyle=\mathbf{z}^{i}-\alpha\mathbf{G}(\mathbf{z}^{i+1/2})

for i=0,,ki=0,\dots,k. Applying Lemma 6 with 𝐳=𝐳i\mathbf{z}=\mathbf{z}^{i}, 𝐰=𝐳i+1/2\mathbf{w}=\mathbf{z}^{i+1/2} and 𝐳+=𝐳i+1\mathbf{z}^{+}=\mathbf{z}^{i+1}, we have

𝐳i𝐳2𝐳i+1𝐳2(1α2R2)𝐳i𝐳i+1/22=(1α2R2)α2𝐆(𝐳i)2\displaystyle\|\mathbf{z}^{i}-\mathbf{z}^{\star}\|^{2}-\|\mathbf{z}^{i+1}-\mathbf{z}^{\star}\|^{2}\geq(1-\alpha^{2}R^{2})\|\mathbf{z}^{i}-\mathbf{z}^{i+1/2}\|^{2}=(1-\alpha^{2}R^{2})\alpha^{2}\|\mathbf{G}(\mathbf{z}^{i})\|^{2} (72)

for i=0,,ki=0,\dots,k. Summing up the inequalities (72) for all i=0,,ki=0,\dots,k, we obtain

𝐳0𝐳2𝐳k+1𝐳2(1α2R2)α2i=0k𝐆(𝐳i)2.\displaystyle\|\mathbf{z}^{0}-\mathbf{z}^{\star}\|^{2}-\|\mathbf{z}^{k+1}-\mathbf{z}^{\star}\|^{2}\geq(1-\alpha^{2}R^{2})\alpha^{2}\sum_{i=0}^{k}\|\mathbf{G}(\mathbf{z}^{i})\|^{2}.

The left hand side is at most 𝐳0𝐳2\|\mathbf{z}^{0}-\mathbf{z}^{\star}\|^{2}, while the right hand side is lower bounded by

(1α2R2)α2(k+1)mini=0,,k𝐆(𝐳i)2.\displaystyle(1-\alpha^{2}R^{2})\alpha^{2}\,(k+1)\min_{i=0,\dots,k}\|\mathbf{G}(\mathbf{z}^{i})\|^{2}.

Therefore we conclude that

mini=0,,k𝐆(𝐳i)2C𝐳0𝐳2k+1\min_{i=0,\dots,k}\|\mathbf{G}(\mathbf{z}^{i})\|^{2}\leq\frac{C\|\mathbf{z}^{0}-\mathbf{z}^{\star}\|^{2}}{k+1}

where C=1α2(1α2R2)C=\frac{1}{\alpha^{2}(1-\alpha^{2}R^{2})}.

D.3 ODE flows for 𝐋(𝒙,𝒚)=𝒙𝒚\boldsymbol{\mathbf{L}(x,y)=xy}

Interestingly, the continuous-time flows with 𝐋(x,y)=xy\mathbf{L}(x,y)=xy have exact closed-form solutions.

Note that 𝐆(x,y)=[0110][xy]\mathbf{G}(x,y)=\begin{bmatrix}0&1\\ -1&0\end{bmatrix}\begin{bmatrix}x\\ y\end{bmatrix}. Therefore,

𝐆λ(x,y)=1λ([1001][1λλ1]1)[xy]=[λ1+λ211+λ211+λ2λ1+λ2][xy].\displaystyle\mathbf{G}_{\lambda}(x,y)=\frac{1}{\lambda}\left(\begin{bmatrix}1&0\\ 0&1\end{bmatrix}-\begin{bmatrix}1&\lambda\\ -\lambda&1\end{bmatrix}^{-1}\right)\begin{bmatrix}x\\ y\end{bmatrix}=\begin{bmatrix}\frac{\lambda}{1+\lambda^{2}}&\frac{1}{1+\lambda^{2}}\\ -\frac{1}{1+\lambda^{2}}&\frac{\lambda}{1+\lambda^{2}}\end{bmatrix}\begin{bmatrix}x\\ y\end{bmatrix}.

The solution to the Moreau–Yosida regularized flow

[x˙y˙]=[λ1+λ211+λ211+λ2λ1+λ2][xy]\displaystyle\begin{bmatrix}\dot{x}\\ \dot{y}\end{bmatrix}=\begin{bmatrix}-\frac{\lambda}{1+\lambda^{2}}&-\frac{1}{1+\lambda^{2}}\\ \frac{1}{1+\lambda^{2}}&-\frac{\lambda}{1+\lambda^{2}}\end{bmatrix}\begin{bmatrix}x\\ y\end{bmatrix}

can be obtained with the matrix exponent. The results are

x(t)=exp(λ1+λ2t)(x0cost1+λ2y0sint1+λ2)\displaystyle x(t)=\exp\left(-\frac{\lambda}{1+\lambda^{2}}t\right)\left(x^{0}\cos\frac{t}{1+\lambda^{2}}-y^{0}\sin\frac{t}{1+\lambda^{2}}\right)
y(t)=exp(λ1+λ2t)(y0cost1+λ2+x0sint1+λ2).\displaystyle y(t)=\exp\left(-\frac{\lambda}{1+\lambda^{2}}t\right)\left(y^{0}\cos\frac{t}{1+\lambda^{2}}+x^{0}\sin\frac{t}{1+\lambda^{2}}\right).

The anchored flow ODE for 𝐋(x,y)=xy\mathbf{L}(x,y)=xy is given by

x˙(t)=y(t)+1t(x0x(t))\displaystyle\dot{x}(t)=-y(t)+\frac{1}{t}\left(x^{0}-x(t)\right)
y˙(t)=x(t)+1t(y0y(t)).\displaystyle\dot{y}(t)=x(t)+\frac{1}{t}\left(y^{0}-y(t)\right).

From the first equation, we have ddt(tx(t))=tx˙(t)+x(t)=ty(t)+x0\frac{d}{dt}(tx(t))=t\dot{x}(t)+x(t)=-ty(t)+x^{0}, while similar manipulation of the second equation gives ddt(ty(t))=tx(t)+y0\frac{d}{dt}(ty(t))=tx(t)+y^{0}. Therefore,

d2dt2(tx(t))=ddt(ty(t))=tx(t)y0\displaystyle\frac{d^{2}}{dt^{2}}(tx(t))=-\frac{d}{dt}(ty(t))=-tx(t)-y^{0}
d2dt2(ty(t))=ddt(tx(t))=ty(t)+x0,\displaystyle\frac{d^{2}}{dt^{2}}(ty(t))=\frac{d}{dt}(tx(t))=-ty(t)+x^{0},

which gives

tx(t)=c1costc2sinty0\displaystyle tx(t)=c_{1}\cos t-c_{2}\sin t-y^{0}
ty(t)=c1sint+c2cost+x0.\displaystyle ty(t)=c_{1}\sin t+c_{2}\cos t+x^{0}.

Using the initial conditions to determine the coefficients c1,c2c_{1},c_{2}, we obtain

x(t)=y0cost+x0sinty0t\displaystyle x(t)=\frac{y^{0}\cos t+x^{0}\sin t-y^{0}}{t}
y(t)=y0sintx0cost+x0t.\displaystyle y(t)=\frac{y^{0}\sin t-x^{0}\cos t+x^{0}}{t}.

Appendix E Connection to CLI lower bounds

In this section, we discuss how EAG relates to the prior work on complexity lower bounds on the class of CLI and SCLI algorithms, introduced and studied in (Arjevani et al., 2016; Arjevani & Shamir, 2016; Azizian et al., 2020; Golowich et al., 2020). Specifically, we show that EAG is not SCLI, so it can break the Ω(R2/k)\Omega(R^{2}/k) lower bound on squared gradient norm for the 1-SCLI class derived by Golowich et al. (2020). On the other hand, we show that EAG is 2-CLI in the sense of Golowich et al. (2020), and that EAG belongs to an extended class of 1-CLI algorithms.

E.1 Lower bounds for 1-SCLI and non-stationarity of EAG

We start with the notion of 1-SCLI algorithms by Golowich et al. (2020). Consider an algorithm 𝒜\mathcal{A} for finding saddle points of biaffine functions of the form

𝐋(𝐱,𝐲)=𝐛𝐱+𝐱𝐀𝐲𝐜𝐲,\mathbf{L}(\mathbf{x},\mathbf{y})=\mathbf{b}^{\intercal}\mathbf{x}+\mathbf{x}^{\intercal}\mathbf{A}\mathbf{y}-\mathbf{c}^{\intercal}\mathbf{y},

where (𝐱,𝐲)n×n(\mathbf{x},\mathbf{y})\in\mathbb{R}^{n}\times\mathbb{R}^{n}. We say 𝒜\mathcal{A} is 1-stationary canonical linear iterative (1-SCLI) if there exist some fixed matrix mappings 𝐂,𝐍:2n×2n2n×2n\mathbf{C},\mathbf{N}:\mathbb{R}^{2n\times 2n}\to\mathbb{R}^{2n\times 2n} such that

𝐳k+1=𝐂([𝐎𝐀𝐀𝐎])𝐳k+𝐍([𝐎𝐀𝐀𝐎])[𝐛𝐜]=𝐂(𝐁)𝐳k+𝐍(𝐁)𝐯\displaystyle\mathbf{z}^{k+1}=\mathbf{C}\left(\begin{bmatrix}\mathbf{O}&\mathbf{A}\\ -\mathbf{A}^{\intercal}&\mathbf{O}\end{bmatrix}\right)\mathbf{z}^{k}+\mathbf{N}\left(\begin{bmatrix}\mathbf{O}&\mathbf{A}\\ -\mathbf{A}^{\intercal}&\mathbf{O}\end{bmatrix}\right)\begin{bmatrix}\mathbf{b}\\ \mathbf{c}\end{bmatrix}=\mathbf{C}(\mathbf{B})\mathbf{z}^{k}+\mathbf{N}(\mathbf{B})\mathbf{v} (73)

for k0k\geq 0, where

𝐁=[𝐎𝐀𝐀𝐎]2n×2n,𝐯=[𝐛𝐜]2n.\displaystyle\mathbf{B}=\begin{bmatrix}\mathbf{O}&\mathbf{A}\\ -\mathbf{A}^{\intercal}&\mathbf{O}\end{bmatrix}\in\mathbb{R}^{2n\times 2n},\quad\mathbf{v}=\begin{bmatrix}\mathbf{b}\\ \mathbf{c}\end{bmatrix}\in\mathbb{R}^{2n}.

Following the convention of Azizian et al. (2020) and Golowich et al. (2020), we also require that 𝐂,𝐍\mathbf{C},\mathbf{N} are matrix polynomials. The classical extragradient method (EG) is an 1-SCLI algorithm: with 𝐆(𝐳)=𝐁𝐳+𝐯\mathbf{G}(\mathbf{z})=\mathbf{B}\mathbf{z}+\mathbf{v}, we can express EG as

𝐳k+1\displaystyle\mathbf{z}^{k+1} =𝐳kα𝐆(𝐳kα𝐆(𝐳k))\displaystyle=\mathbf{z}^{k}-\alpha\mathbf{G}\left(\mathbf{z}^{k}-\alpha\mathbf{G}(\mathbf{z}^{k})\right)
=𝐳kα(𝐁(𝐳kα𝐁𝐳kα𝐯)+𝐯)\displaystyle=\mathbf{z}^{k}-\alpha\left(\mathbf{B}\left(\mathbf{z}^{k}-\alpha\mathbf{B}\mathbf{z}^{k}-\alpha\mathbf{v}\right)+\mathbf{v}\right)
=(𝐈α𝐁+α2𝐁2)𝐳kα(𝐈α𝐁)𝐯,\displaystyle=\left(\mathbf{I}-\alpha\mathbf{B}+\alpha^{2}\mathbf{B}^{2}\right)\mathbf{z}^{k}-\alpha(\mathbf{I}-\alpha\mathbf{B})\mathbf{v},

which is of the 1-SCLI form.

A 11-SCLI algorithm 𝒜\mathcal{A} is consistent with respect to an invertible matrix 𝐁\mathbf{B} if for any 𝐯2n\mathbf{v}\in\mathbb{R}^{2n}, iterates {𝐳k}k0\{\mathbf{z}^{k}\}_{k\geq 0} produced by 𝒜\mathcal{A} satisfy

𝐳k𝐳=𝐁1𝐯.\mathbf{z}^{k}\to\mathbf{z}^{\star}=-\mathbf{B}^{-1}\mathbf{v}.

If 𝒜\mathcal{A} is consistent with respect to 𝐁\mathbf{B}, then for any 𝐰=𝐁1𝐯2n\mathbf{w}=\mathbf{B}^{-1}\mathbf{v}\in\mathbb{R}^{2n}, we have

𝐰=𝐁1𝐯\displaystyle-\mathbf{w}=-\mathbf{B}^{-1}\mathbf{v} =limk𝐳k+1\displaystyle=\lim_{k\to\infty}\mathbf{z}^{k+1}
=limk𝐂(𝐁)𝐳k+𝐍(𝐁)𝐯\displaystyle=\lim_{k\to\infty}\mathbf{C}(\mathbf{B})\mathbf{z}^{k}+\mathbf{N}(\mathbf{B})\mathbf{v}
=𝐂(𝐁)(𝐁1𝐯)+𝐍(𝐁)𝐯\displaystyle=\mathbf{C}(\mathbf{B})(-\mathbf{B}^{-1}\mathbf{v})+\mathbf{N}(\mathbf{B})\mathbf{v}
=(𝐂(𝐁)+𝐍(𝐁)𝐁)𝐰.\displaystyle=\left(-\mathbf{C}(\mathbf{B})+\mathbf{N}(\mathbf{B})\mathbf{B}\right)\mathbf{w}.

As this holds for all 𝐰2n\mathbf{w}\in\mathbb{R}^{2n}, we have the following result.

Lemma 7 (Arjevani et al. (2016)).

If a 1-SCLI algorithm 𝒜\mathcal{A} described by (73) is consistent with respect to 𝐁\mathbf{B}, then

𝐈+𝐍(𝐁)𝐁=𝐂(𝐁).\displaystyle\mathbf{I}+\mathbf{N}(\mathbf{B})\mathbf{B}=\mathbf{C}(\mathbf{B}). (74)

Indeed, the 1-SCLI formulation of EG satisfies (74).

For the class of consistent 1-SCLI algorithms, Golowich et al. (2020) established Ω(1/k)\Omega(1/k) a complexity lower bound on squared gradient norm.

Theorem 5 (Golowich et al. (2020)).

Let k0k\geq 0 and n1n\geq 1. Then for any consistent 1-SCLI algorithm of the form (73) with deg𝐍=d𝐍\deg\mathbf{N}=d_{\mathbf{N}}, there exist a biaffine function 𝐋(𝐱,𝐲)=𝐛𝐱+𝐱𝐀𝐲𝐜𝐲\mathbf{L}(\mathbf{x},\mathbf{y})=\mathbf{b}^{\intercal}\mathbf{x}+\mathbf{x}^{\intercal}\mathbf{A}\mathbf{y}-\mathbf{c}^{\intercal}\mathbf{y} on n×n\mathbb{R}^{n}\times\mathbb{R}^{n} with invertible 𝐀\mathbf{A}, for which

𝐋(𝐳k)2R2𝐳0𝐳220(d𝐍+1)2k=Ω(R2𝐳0𝐳2k),\displaystyle\|\nabla\mathbf{L}(\mathbf{z}^{k})\|^{2}\geq\frac{R^{2}\|\mathbf{z}^{0}-\mathbf{z}^{\star}\|^{2}}{20(d_{\mathbf{N}}+1)^{2}k}=\Omega\left(\frac{R^{2}\|\mathbf{z}^{0}-\mathbf{z}^{\star}\|^{2}}{k}\right),

where 𝐳\mathbf{z}^{\star} is the unique saddle point of 𝐋\mathbf{L}.

To clarify, deg𝐍\deg\mathbf{N} refers to the degree of the matrix polynomial defining 𝐍\mathbf{N}. 1-SCLI algorithms with d𝐂=deg𝐂=1d_{\mathbf{C}}=\deg\mathbf{C}=1 forms a subclass of 𝔄sim\mathfrak{A}_{\textrm{sim}} and 𝔄sep\mathfrak{A}_{\textrm{sep}}. (Even if d𝐂>1d_{\mathbf{C}}>1, one can still view 1-SCLI algorithms as instances of 𝔄sim\mathfrak{A}_{\textrm{sim}} or 𝔄sep\mathfrak{A}_{\textrm{sep}} by introducing d𝐂1d_{\mathbf{C}}-1 dummy iterates for each 1-SCLI iteration.) However, EAG is an algorithm that belongs to 𝔄sim\mathfrak{A}_{\textrm{sim}} but is not 1-SCLI; if it was, a contradiction would occur, as 𝐋(𝐳k)2𝒪(1/k2)\|\nabla\mathbf{L}(\mathbf{z}^{k})\|^{2}\leq\mathcal{O}(1/k^{2}) for EAG. In fact, it is intuitively clear that EAG is not 1-SCLI; the S in 1-SCLI stands for stationary, but EAG has anchoring coefficients 1k+2\frac{1}{k+2} that vary over iterations.

E.2 Understanding EAG as a CLI algorithm

In this section, we show that EAG algorithms are (non-stationary) 2-CLI, and that we can expand the definition of 1-CLI algorithms to accommodate EAG.

First, we state the definition of mm-CLI algorithms introduced by Arjevani & Shamir (2016) adapted to the case of biaffine saddle functions. For m1m\geq 1, an mm-CLI algorithm 𝒜\mathcal{A} takes mm initial points 𝐳10,,𝐳m0\mathbf{z}^{0}_{1},\dots,\mathbf{z}^{0}_{m} and at each iteration k0k\geq 0, outputs

𝐳ik+1=j=1m𝐂ij(k)(𝐁)𝐳jk+𝐍i(k)(𝐁)𝐯\displaystyle\mathbf{z}^{k+1}_{i}=\sum_{j=1}^{m}\mathbf{C}_{ij}^{(k)}(\mathbf{B})\,\mathbf{z}^{k}_{j}+\mathbf{N}_{i}^{(k)}(\mathbf{B})\,\mathbf{v} (75)

for i=1,,mi=1,\dots,m, where 𝐂ij(k),𝐍i(k):2n×2n2n×2n\mathbf{C}_{ij}^{(k)},\mathbf{N}_{i}^{(k)}:\mathbb{R}^{2n\times 2n}\to\mathbb{R}^{2n\times 2n} for i,j=1,,mi,j=1,\dots,m are matrix polynomials that depend on kk but not on {𝐳1k,,𝐳mk}k0\{\mathbf{z}^{k}_{1},\dots,\mathbf{z}^{k}_{m}\}_{k\geq 0}. In the case where 𝐂ij(k)𝐂ij\mathbf{C}_{ij}^{(k)}\equiv\mathbf{C}_{ij} and 𝐍i(k)𝐍i\mathbf{N}_{i}^{(k)}\equiv\mathbf{N}_{i} for all i,j=1,,mi,j=1,\dots,m and k0k\geq 0, we say 𝒜\mathcal{A} is stationary. Indeed, when m=1m=1, this definition of stationary 1-CLI coincides with that of 1-SCLI given in Section E.1. Also note that the definition (75) includes algorithms that obtain 𝐳k+1\mathbf{z}^{k+1} with mm previous iterates 𝐳k,𝐳k1,,𝐳km+1\mathbf{z}^{k},\mathbf{z}^{k-1},\dots,\mathbf{z}^{k-m+1}, by letting 𝐳ik=𝐳k+1i\mathbf{z}^{k}_{i}=\mathbf{z}^{k+1-i} for i=1,,mi=1,\dots,m.

Performance measure Algorithm class Lower bound Best known rate Order-optimality
1-SCLI Ω(Rk)\Omega\left(\frac{R}{\sqrt{k}}\right) (Golowich et al., 2020) 𝒪(Rk)\mathcal{O}\left(\frac{R}{\sqrt{k}}\right) (Golowich et al., 2020)* Established*
Duality gap (Last iterate) 1-CLI Ω(Rk)\Omega\left(\frac{R}{k}\right) (Nemirovsky (1992), Nemirovski (2004)) 𝒪(Rk)\mathcal{O}\left(\frac{R}{\sqrt{k}}\right) (Golowich et al., 2020)* Unknown
mm-CLI (m2m\geq 2) Ω(Rk)\Omega\left(\frac{R}{k}\right) (Nemirovsky (1992), Nemirovski (2004)) 𝒪(Rk)\mathcal{O}\left(\frac{R}{k}\right) (Nemirovski (2004), Golowich et al. (2020)) Established
1-SCLI Ω(R2k)\Omega\left(\frac{R^{2}}{k}\right) (Golowich et al., 2020) 𝒪(R2k)\mathcal{O}\left(\frac{R^{2}}{k}\right) (Golowich et al., 2020)* Established*
Squared gradient norm (Last iterate) 1-CLI Ω(R2k2)\Omega\left(\frac{R^{2}}{k^{2}}\right) (Nemirovsky, 1992) 𝒪(R2k)\mathcal{O}\left(\frac{R^{2}}{k}\right) (Golowich et al., 2020)* Unknown
Translated 1-CLI Ω(R2k2)\Omega\left(\frac{R^{2}}{k^{2}}\right) (Nemirovsky, 1992) 𝒪(R2k2)\mathcal{O}\left(\frac{R^{2}}{k^{2}}\right) (This paper) Established
mm-CLI (m2m\geq 2) Ω(R2k2)\Omega\left(\frac{R^{2}}{k^{2}}\right) (Nemirovsky, 1992) 𝒪(R2k2)\mathcal{O}\left(\frac{R^{2}}{k^{2}}\right) (This paper) Established
Table 1: Lower bounds and best known rates for CLI algorithm classes (* means that the result holds with the additional assumption that the derivative of 𝐆\mathbf{G} is Lipschitz continuous).

Golowich et al. (2020) showed that the averaged EG iterates, which have rate 𝒪(1/k)\mathcal{O}(1/k) on duality gap, can be written in 2-CLI form; hence, the Ω(1/k)\Omega(1/\sqrt{k}) 1-SCLI lower bound on duality gap therein cannot be generalized to mm-CLI algorithms for m2m\geq 2. They then posed the open problem of whether the Ω(1/k)\Omega(1/\sqrt{k}) 1-SCLI lower bound on duality gap can be generalized to 1-CLI algorithms. Below, we provide a similar discussion on rates on squared gradient norm.

It is straightforward to see that EAG is 2-CLI; define 𝐳2k+1=𝐳2k==𝐳20=𝐳0=𝐳10\mathbf{z}^{k+1}_{2}=\mathbf{z}^{k}_{2}=\cdots=\mathbf{z}^{0}_{2}=\mathbf{z}^{0}=\mathbf{z}^{0}_{1} for all k0k\geq 0, and

𝐳1k+1\displaystyle\mathbf{z}^{k+1}_{1} =𝐳1kαk𝐆(𝐳1kαk𝐆(𝐳1k)+1k+2(𝐳0𝐳1k))+1k+2(𝐳0𝐳1k)\displaystyle=\mathbf{z}^{k}_{1}-\alpha_{k}\mathbf{G}\left(\mathbf{z}^{k}_{1}-\alpha_{k}\mathbf{G}(\mathbf{z}^{k}_{1})+\frac{1}{k+2}(\mathbf{z}^{0}-\mathbf{z}^{k}_{1})\right)+\frac{1}{k+2}(\mathbf{z}^{0}-\mathbf{z}^{k}_{1})
=(k+1k+2𝐈k+1k+2αk𝐁+αk2𝐁2)𝐳1k+1k+2(𝐈αk𝐁)𝐳2kαk(𝐈αk𝐁)𝐯.\displaystyle=\left(\frac{k+1}{k+2}\mathbf{I}-\frac{k+1}{k+2}\alpha_{k}\mathbf{B}+\alpha_{k}^{2}\mathbf{B}^{2}\right)\mathbf{z}^{k}_{1}+\frac{1}{k+2}(\mathbf{I}-\alpha_{k}\mathbf{B})\mathbf{z}^{k}_{2}-\alpha_{k}(\mathbf{I}-\alpha_{k}\mathbf{B})\mathbf{v}. (76)

For EAG-C, one can alternatively eliminate the dependency on 𝐳0\mathbf{z}^{0} to define 𝐳k+1\mathbf{z}^{k+1} in terms of 𝐳k\mathbf{z}^{k}, 𝐳k1\mathbf{z}^{k-1}, and 𝐯\mathbf{v}; respectively multiply (k+2)(k+2) and (k+1)(k+1) to the following identities

𝐳k+1=(k+1k+2𝐈k+1k+2α𝐁+α2𝐁2)𝐳k+1k+2(𝐈α𝐁)𝐳0α(𝐈α𝐁)𝐯\displaystyle\mathbf{z}^{k+1}=\left(\frac{k+1}{k+2}\mathbf{I}-\frac{k+1}{k+2}\alpha\mathbf{B}+\alpha^{2}\mathbf{B}^{2}\right)\mathbf{z}^{k}+\frac{1}{k+2}(\mathbf{I}-\alpha\mathbf{B})\mathbf{z}^{0}-\alpha(\mathbf{I}-\alpha\mathbf{B})\mathbf{v}
𝐳k=(kk+1𝐈kk+1α𝐁+α2𝐁2)𝐳k1+1k+1(𝐈α𝐁)𝐳0α(𝐈α𝐁)𝐯\displaystyle\mathbf{z}^{k}=\left(\frac{k}{k+1}\mathbf{I}-\frac{k}{k+1}\alpha\mathbf{B}+\alpha^{2}\mathbf{B}^{2}\right)\mathbf{z}^{k-1}+\frac{1}{k+1}(\mathbf{I}-\alpha\mathbf{B})\mathbf{z}^{0}-\alpha(\mathbf{I}-\alpha\mathbf{B})\mathbf{v}

and subtract to eliminate 𝐳0\mathbf{z}^{0}. Since EAG has 𝒪(1/k2)\mathcal{O}(1/k^{2}) rate, this reformulation shows that the Θ(1/k)\Theta(1/k) 1-SCLI lower bound on the squared gradient norm cannot be generalized to 2-CLI algorithms.

Furthermore, EAG also provides a partial resolution, in the negative, of the open problem of whether the Θ(1/k)\Theta(1/k) 1-SCLI lower bound on the squared gradient norm can be generalized to 1-CLI algorithms. Observe that if we translate the given problem to set 𝐳0=0\mathbf{z}^{0}=0, keeping the sequence 𝐳2k\mathbf{z}^{k}_{2} is no longer necessary, and (76) reduces to 1-CLI form. Such translation is not allowed in the definition (75), but it is reasonable to consider an expanded class of algorithms that are 11-CLI up to translation. Precisely, define an algorithm 𝒜\mathcal{A} to be translated 1-CLI if it takes the form

𝐳k+1=𝐂(k)(𝐁)(𝐳k)+𝐍(k)(𝐁)(𝐯)\displaystyle\mathbf{z}^{k+1}=\mathbf{C}^{(k)}(\mathbf{B})(\mathbf{z}^{k})+\mathbf{N}^{(k)}(\mathbf{B})(\mathbf{v})

when 𝐳0=0\mathbf{z}^{0}=0, and is translation invariant in the sense that

𝐳k=𝒜(𝐳0,𝐳1,,𝐳k1;𝐋)=𝐳0+𝒜(0,𝐳1𝐳0,,𝐳k1𝐳0;𝐋𝐳0)\displaystyle\mathbf{z}^{k}=\mathcal{A}(\mathbf{z}^{0},\mathbf{z}^{1},\dots,\mathbf{z}^{k-1};\mathbf{L})=\mathbf{z}^{0}+\mathcal{A}(0,\mathbf{z}^{1}-\mathbf{z}^{0},\dots,\mathbf{z}^{k-1}-\mathbf{z}^{0};\mathbf{L}_{\mathbf{z}^{0}})

when 𝐳00\mathbf{z}^{0}\neq 0, where 𝐋𝐳0(𝐱,𝐲)=𝐋(𝐱+𝐱0,𝐲+𝐲0)\mathbf{L}_{\mathbf{z}^{0}}(\mathbf{x},\mathbf{y})=\mathbf{L}(\mathbf{x}+\mathbf{x}^{0},\mathbf{y}+\mathbf{y}^{0}). That is, the iterates of 𝒜\mathcal{A} are generated equivalently by starting with 𝐳0=0\mathbf{z}^{0}=0 and applying 𝒜\mathcal{A} to the translated objective 𝐋𝐳0\mathbf{L}_{\mathbf{z}^{0}}. The concept of translated 1-CLI can be viewed as a generalization of consistent 1-SCLI algorithms; observe that we can rewrite (73) as

𝐳k+1𝐳0=𝐂(𝐁)(𝐳k𝐳0)+𝐍(𝐁)(𝐁𝐳0+𝐯)(𝐈+𝐍(𝐁)𝐁𝐂(𝐁))𝐳0,\displaystyle\mathbf{z}^{k+1}-\mathbf{z}^{0}=\mathbf{C}(\mathbf{B})(\mathbf{z}^{k}-\mathbf{z}^{0})+\mathbf{N}(\mathbf{B})(\mathbf{B}\mathbf{z}^{0}+\mathbf{v})-(\mathbf{I}+\mathbf{N}(\mathbf{B})\mathbf{B}-\mathbf{C}(\mathbf{B}))\mathbf{z}^{0},

which shows that a 1-SCLI algorithm is translation invariant if and only if it satisfies the consistency formula (74). Since EAG has 𝒪(1/k2)\mathcal{O}(1/k^{2}) rate and is a translated 1-CLI algorithm, our results prove that the Θ(1/k)\Theta(1/k) 1-SCLI lower bound on the squared gradient norm can be generalized to translated 1-CLI algorithms.