This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: Jaeyeon Kim 22institutetext: Seoul National University
22email: [email protected]
33institutetext: Chanwoo Park 44institutetext: MIT
44email: [email protected]
55institutetext: Asuman Ozdaglar 66institutetext: MIT
66email: [email protected]
77institutetext: Jelena Diakonikolas 88institutetext: University of Wisconsin–Madison
88email: [email protected]
99institutetext: Ernest K. Ryu 1010institutetext: Seoul National University
1010email: [email protected]

Mirror Duality in Convex Optimization

Jaeyeon Kim    Chanwoo Park    Asuman Ozdaglar   
Jelena Diakonikolas
   Ernest K. Ryu
(Received: date / Accepted: date)
Abstract

While first-order optimization methods are usually designed to efficiently reduce the function value f(x)f(x), there has been recent interest in methods efficiently reducing the magnitude of f(x)\nabla f(x), and the findings show that the two types of methods exhibit a certain symmetry. In this work, we present mirror duality, a one-to-one correspondence between mirror-descent-type methods reducing function value and reducing gradient magnitude. Using mirror duality, we obtain the dual accelerated mirror descent (dual-AMD) method that efficiently reduces ψ(f(x))\psi^{*}(\nabla f(x)), where ψ\psi is a distance-generating function and ψ\psi^{*} quantifies the magnitude of f(x)\nabla f(x). We then apply dual-AMD to efficiently reduce f()q\|\nabla f(\cdot)\|_{q} for q[2,)q\in[2,\infty) and to efficiently compute ε\varepsilon-approximate solutions of the optimal transport problem.

1 Introduction

Consider the problem

minimizexXf(x),\displaystyle\begin{array}[]{ll}\underset{x\in X}{\mbox{minimize}}&\quad f(x),\end{array}

where f:Xf\colon X\to\mathbb{R} is a differentiable convex function and XX is a reflexive Banach space. One of the most fundamental facts in convex optimization is that every stationary point is also a global minimum, so the minimization problem is equivalent to

findxX0=f(x).\displaystyle\begin{array}[]{ll}\underset{x\in X}{\mbox{find}}&\quad 0=\nabla f(x).\end{array}

However, the problem of efficiently approximating function minima, i.e., making f(x)infxXf(x)f(x)-\inf_{x\in X}f(x) small, is quite different from the problem of efficiently computing approximate stationary points, i.e., making f(x)\nabla f(x) small. Methods that exhibit optimal convergence rates under one of the criteria do not, in general, exhibit optimal convergence rates under the other. In particular, Nesterov’s accelerated gradient method is an optimal method for the former but not the latter criterion. Recently, methods like OGM-G kim2021optimizing have been proposed as accelerated methods for reducing f(x)2\|\nabla f(x)\|_{2} when XX is the Euclidean space. However, the problem of finding approximate stationary points is still much less understood than the problem of finding approximate function minima in more general normed spaces.

On the other hand, the framework of mirror descent has been a cornerstone to designing efficient methods for optimization problems with structures that are well-behaved in non-Euclidean geometries. However, compared to the large body of research on mirror-descent-type methods for efficiently reducing function value, which includes accelerated mirror descent, mirror-descent-type methods for reducing gradient magnitude have been much less explored.

In this work, we start from the perspective that reducing gradient magnitude is reducing a function value in an appropriate mirror-descent setup. This observation, due to doi:10.1137/19M130858X and restated in Section 2.1, more specifically is that exchanging the roles of the objective function and the conjugate of the distance-generating function (DGF) in (unaccelerated) mirror descent yields a method that reduces the gradient magnitude as measured by the DGF.

We then build upon this insight and present mirror duality, a one-to-one correspondence between mirror-descent-type methods reducing function value and those reducing gradient magnitude. Using mirror duality, we obtain dual accelerated mirror descent (dual-AMD) as the mirror dual of accelerated mirror descent (AMD). With mirror duality, the prior analysis showing that AMD efficiently reduces function value implies that dual-AMD efficiently reduces ψ(f(x))\psi^{*}(\nabla f(x)), where ψ\psi is a distance-generating function and ψ\psi^{*} quantifies the magnitude of f(x)\nabla f(x). Finally, we show how to apply dual-AMD to efficiently reduce f()q\|\nabla f(\cdot)\|_{q} for q[2,)q\in[2,\infty) and to efficiently compute approximate solutions of the optimal transport problem.

1.1 Preliminaries

We quickly review relevant basic concepts and set up the notation.

Banach spaces.

We follow the standard definitions of convex analysis in Banach spaces barbu2012 . Let (X,)(X,\left\|\cdot\right\|) be a Banach space, (X,)(X^{*},\left\|\cdot\right\|_{*}) be its (continuous) dual space, and ,:X×X\left\langle\cdot,\cdot\right\rangle\colon X^{*}\times X\rightarrow\mathbb{R} be the canonical pairing between XX^{*} and XX. In finite-dimensional setups, X=X=nX=X^{*}=\mathbb{R}^{n}, \left\|\cdot\right\| and \left\|\cdot\right\|_{*} are dual norms, and u,x=ux\langle u,x\rangle=u^{\intercal}x is the standard Euclidean inner product. We assume XX is reflexive: X=XX^{**}=X and the dual norm of \left\|\cdot\right\|_{*} is \left\|\cdot\right\|, which always holds when XX is finite-dimensional. We denote the Euclidean norm by 2\left\|\cdot\right\|_{2}.

Basic convex analysis.

A set CXC\subset X is convex if λx+(1λ)yC\lambda x+(1-\lambda)y\in C for all x,yC,λ(0,1)x,y\in C,\,\lambda\in(0,1). The interior of CXC\subseteq X is intC={xC|ϵ>0,B(x,ϵ)C}\operatorname*{int}C=\{x\in C\,|\,\exists\,\epsilon>0,\,B(x,\epsilon)\subset C\}, where B(x,ϵ)B(x,\epsilon) denotes the open ball of radius ϵ\epsilon centered at xx. Let ¯={±}\overline{\mathbb{R}}=\mathbb{R}\cup\{\pm\infty\}. A function f:X¯f\colon X\rightarrow\overline{\mathbb{R}} is convex if f(λx+(1λ)y)λf(x)+(1λ)f(y)f(\lambda x+(1-\lambda)y)\leq\lambda f(x)+(1-\lambda)f(y) for any x,yXx,y\in X and λ(0,1)\lambda\in(0,1). If the inequality holds strictly for all x,yXx,y\in X such that f(x)<f(x)<\infty and f(y)<f(y)<\infty and λ(0,1)\lambda\in(0,1), then ff is strictly convex. A function ff is proper if f(x)>f(x)>-\infty for all xXx\in X and f(x)<f(x)<\infty for some xXx\in X. We say that ff is closed if epi(f)={(x,t)|f(x)t}X×\operatorname*{epi}(f)=\{(x,t)\,|\,f(x)\leq t\}\subseteq X\times\mathbb{R} is closed (in the standard product topology). (See (barbu2012, , Proposition 2.5) for equivalent definitions of closeness.) We say a function ff is CCP if it is convex, closed, and proper. Write domf={xX|f(x)<}\operatorname*{dom}f=\{x\in X\rvert\,f(x)<\infty\} to denote the (effective) domain of ff.

Differentiability and subdifferentiability.

Given f:X¯f\colon X\rightarrow\overline{\mathbb{R}}, we define its convex conjugate f:X¯f^{*}\colon X^{*}\rightarrow\overline{\mathbb{R}} as f(u)=supxX{x,uf(x)}f^{*}(u)=\sup_{x\in X}\left\{\left\langle x,u\right\rangle-f(x)\right\}. Define the biconjugate f:X¯f^{**}\colon X\rightarrow\overline{\mathbb{R}} (remember, X=X)X^{**}=X) as f(x)=supuX{u,xf(u)}f^{**}(x)=\sup_{u\in X^{*}}\left\{\left\langle u,x\right\rangle-f^{*}(u)\right\}. If ff is CCP, then f=ff^{**}=f (barbu2012, , Theorem 2.22). We say that f:X¯f\colon X\rightarrow\overline{\mathbb{R}} is (Fréchet) differentiable at xXx\in X if |f()|<|f(\cdot)|<\infty on an open neighborhood of xx and there exists a f(x)X\nabla f(x)\in X^{*} such that

f(x),h=limε0f(x+εh)f(x)ε,hX.\langle\nabla f(x),h\rangle=\lim_{\varepsilon\rightarrow 0}\frac{f(x+\varepsilon h)-f(x)}{\varepsilon},\qquad\forall\,h\in X.

If the above holds, we define f(x)X\nabla f(x)\in X^{*} to be the gradient of ff at xx. Write domf\operatorname*{dom}\nabla f for the set of points on which ff is Fréchet differentiable. Finally, we say ff is differentiable (everywhere) if ff is differentiable for all xXx\in X. Let f:X¯f\colon X\rightarrow\overline{\mathbb{R}} be a CCP function. We say gXg\in X^{*} is a subgradient of ff at xx if

f(y)f(x)+g,yx,yX.f(y)\geq f(x)+\langle g,y-x\rangle,\quad\forall\,y\in X.

We write f(x)X\partial f(x)\subseteq X^{*} for the set of subgradients of ff at xx.

Smoothness.

For L>0L>0, we say f:X¯f\colon X\rightarrow\overline{\mathbb{R}} is LL-smooth (with respect to \left\|\cdot\right\|) if |f(x)|<|f(x)|<\infty everywhere (so f:Xf\colon X\rightarrow\mathbb{R}), ff is differentiable everywhere, and f(x)f(y)Lxy\left\|\nabla f(x)-\nabla f(y)\right\|_{*}\leq L\left\|x-y\right\| for all x,yXx,y\in X. The following lemma establishes equivalent conditions defining LL-smoothness.

Lemma 1

(zalinescu2002convex, , Section 3.5) [Banach–Baillon–Haddad theorem] Let (X,)(X,\|\cdot\|) be a Banach space, f:Xf\colon X\rightarrow\mathbb{R} be CCP and differentiable, and L>0L>0. The following are equivalent:

  • (i)

    f(x)f(y)Lxy,x,yX\|\nabla f(x)-\nabla f(y)\|_{*}\leq L\|x-y\|,\quad\forall\,x,y\in X.

  • (ii)

    f(x)f(y),xy1Lf(x)f(y)2,x,yX\langle\nabla f(x)-\nabla f(y),x-y\rangle\geq\frac{1}{L}\|\nabla f(x)-\nabla f(y)\|^{2}_{*},\quad\forall x,y\in X.

  • (iii)

    f(x)f(y)+f(x),yx+12Lf(x)f(y)20,x,yXf(x)-f(y)+\left\langle\nabla f(x),y-x\right\rangle+\frac{1}{2L}\left\|\nabla f(x)-\nabla f(y)\right\|^{2}_{*}\leq 0,\quad\forall x,y\in X.

This result can be viewed as a generalization of the Baillon–Haddad theorem (Baillon1977QuelquesPD, , Corollary 10) to Banach spaces, so we call it the Banach–Baillon–Haddad theorem. We call the inequality of (iii) the cocoercivity inequality.

Strong convexity.

For σ>0\sigma>0, we say h:X¯h\colon X\rightarrow\overline{\mathbb{R}} is σ\sigma-strongly convex (with respect to \left\|\cdot\right\|) if

h(y)h(x)+h(x),yx+σ2yx2,x,yX,h(y)\geq h(x)+\langle\partial h(x),y-x\rangle+\frac{\sigma}{2}\|y-x\|^{2},\quad\forall\,x,y\in X,

where the inequality is understood to hold for all elements of h(x)\partial h(x). The following lemma establishes the equivalence between the strong convexity and the smoothness of the convex conjugate.

Lemma 2

(zalinescu2002convex, , Section 3.5) Let (X,)(X,\|\cdot\|) be a reflexive Banach space, h:X¯h\colon X\rightarrow\overline{\mathbb{R}} a CCP function, and σ>0\sigma>0. Then, hh is σ\sigma-strongly convex with respect to \left\|\cdot\right\| if and only if hh^{*} is 1/σ1/\sigma-smooth with respect to \left\|\cdot\right\|_{*}.

Bregman divergence.

Let (X,)(X,\|\cdot\|) be a Banach space and let h:X¯h\colon X\rightarrow\overline{\mathbb{R}}. The Bregman divergence with respect to hh is Dh:X×X¯D_{h}\colon X\times X\rightarrow\overline{\mathbb{R}} defined as

Dh(x,x)={h(x)h(x)h(x),xx for xdomh,xdomh otherwise.D_{h}(x,x^{\prime})=\left\{\begin{array}[]{ll}h(x)-h(x^{\prime})-\left\langle\nabla h(x^{\prime}),x-x^{\prime}\right\rangle&\text{ for }x\in\operatorname*{dom}h,\,x^{\prime}\in\operatorname*{dom}\nabla h\\ \infty&\text{ otherwise.}\end{array}\right.

The standard inequalities for differentiable convex ff can be written using the Bregman divergence as follows: If ff is convex,

Df(x,x)0,x,xX.D_{f}(x,x^{\prime})\geq 0,\qquad\forall\,x,x^{\prime}\in X. (cvx-ineq)

If ff is convex and LL-smooth, by Lemma 1,

Df(x,x)12Lf(x)f(x)20,x,xX.D_{f}(x,x^{\prime})-\frac{1}{2L}\left\|\nabla f(x)-\nabla f(x^{\prime})\right\|^{2}_{*}\geq 0,\qquad\forall\,x,x^{\prime}\in X. (coco-ineq)

If ydomhy\in\operatorname*{dom}\nabla h^{*}, the Fenchel inequality fenchel1949conjugate is

Dh(x,h(y))=h(x)+h(y)y,x0,xX,yX.D_{h}(x,\nabla h^{*}(y))=h(x)+h^{*}(y)-\left\langle y,x\right\rangle\geq 0,\qquad\forall\,x\in X,\,y\in X^{*}.

Legendre convexity.

We follow the definitions from Bauschke2001ESSENTIALSE . Let h:X¯h\colon X\to\bar{\mathbb{R}} be a CCP function. hh is essentially smooth if h\partial h is locally bounded and single-valued on its domain. Additionally, if (h)1(\partial h)^{-1} is locally bounded on its domain and hh is strictly convex on every convex subset of domh\operatorname*{dom}\partial h, we say hh is Legendre convex. To clarify, a set-valued function is locally bounded on YY if, for any yYy\in Y, there is a neighborhood of yy in YY on which the output values of the set-valued function are bounded.

1.2 Prior works

First-order methods reducing gradient norm.

Consider minimizing a differentiable LL-smooth function f:Xf\colon X\rightarrow\mathbb{R} with NN iterations of a first-order method. Assume xargminxdf(x)x_{\star}\in\operatorname*{arg\,min}_{x\in\mathbb{R}^{d}}f(x) exists. Write f(x)=ff(x_{\star})=f_{\star}. Classically, plain gradient descent achieves a 𝒪(L(f(x0)f)/N)\mathcal{O}(L(f(x_{0})-f_{\star})/N) rate on f()22\|\nabla f(\cdot)\|_{2}^{2} for LL-smooth functions, both convex and non-convex. Nemirovsky’s classical work nemirovsky1991optimality ; nemirovsky1992information provides lower bounds for reducing f(xN)2\left\|\nabla f(x_{N})\right\|^{2} via first-order deterministic methods. They proved there exists a convex and LL-smooth function ff such that no NN-step first-order deterministic method can reduce f(xN)2\|\nabla f(x_{N})\|^{2} faster than Ω(L(f(x0)f)/N2)\Omega\left(L(f(x_{0})-f_{\star})/N^{2}\right), with respect to the worst-case gaurantee. The same argument holds for Ω(L2x0x22/N4)\Omega\left(L^{2}\|x_{0}-x_{\star}\|_{2}^{2}/N^{4}\right).

The formal study of first-order methods for efficiently reducing the gradient magnitude f(x)22\left\|\nabla f(x)\right\|_{2}^{2} was initiated by Nesterov nesterov2012make , where he proposed the regularization technique to obtain a 𝒪~(L2x0x22/N4)\tilde{\mathcal{O}}(L^{2}\|x_{0}-x_{\star}\|_{2}^{2}/N^{4}) rate, where 𝒪~\tilde{\mathcal{O}} ignores logarithmic factors. This regularization technique was also used in the stochastic setup zhu2018how . The logarithmic gap of Nesterov was closed by Kim and Fessler, who proposed OGM-G and established a 𝒪(L(f(x0)f)/N2)\mathcal{O}(L(f(x_{0})-f_{\star})/N^{2}) rate kim2021optimizing ; it was pointed out in (nesterov2020primal, , Remark 2.1) that OGM-G’s rate can be combined with the classical Nesterov acceleration to imply the optimal 𝒪(L2x0x22/N4)\mathcal{O}(L^{2}\|x_{0}-x_{\star}\|_{2}^{2}/N^{4}) rate. Kim and Fessler’s discovery, which was made with the computer-assisted proof methodology called the performance estimation problem drori2014performance ; taylor2017smooth ; taylor2017exact ; kim2016optimized , kickstarted a series of follow-up works: OBL-G, a variant of OGM-G that allows a backtracking linesearch park2021optimal ; M-OGM-G, which attains the same 𝒪(L(f(x0)f)/N2)\mathcal{O}(L(f(x_{0})-f_{\star})/N^{2}) rate with simpler rational coefficients ZhouTianSoCheng2021_practical ; and better human-understandable analyses using potential-functions lee2021geometric ; diakonikolas2021potential . On the other hand, lan2023optimal utilizes the regularization technique of Nesterov in a clever aggregate analysis way to remove the logarithmic terms and achieves the optimal rate without using the concatenation technique that we describe in Section 3.4. Furthermore, lan2023optimal accomplishes this adaptively in the sense that the method does not require a priori knowledge of LL.

There are also follow-up works in other setups; continuous-time analysis of OGM-G Suh2022continuous and the proximal variants FISTA-G lee2021geometric and SFG kim2023timereversed . However, despite these active works, the acceleration mechanisms behind reducing gradient magnitude are understood much less compared to the classical techniques for reducing the function value.

H-duality.

Interestingly, OGM-G exhibits a certain symmetry against OGM drori2014performance ; kim2016optimized , an accelerated first-order method that improves upon Nesterov’s acceleration by a factor of 22 for reducing function values of smooth convex functions. The recent work kim2023timereversed showed that this symmetry is fundamentally due to H-duality, a one-to-one correspondence between first-order methods that reduce function value and those that reduce gradient norm.

We quickly review H-duality. Let XX be an Euclidean space equipped with the standard Euclidean norm. Let f:Xf\colon X\rightarrow\mathbb{R} be convex and LL-smooth. Consider the problem

minimizexXf(x).\displaystyle\begin{array}[]{ll}\underset{x\in X}{\mbox{minimize}}&\quad f(x).\end{array}

Consider an NN-step fixed-step first-order method (FSFOM) with step sizes {hk,i}0i<kN\{h_{k,i}\}_{0\leq i<k\leq N}\subset\mathbb{R}, defined by:

xk+1=xk1Li=0khk+1,if(xi),k=0,1,,N1,\displaystyle x_{k+1}=x_{k}-\frac{1}{L}\sum_{i=0}^{k}h_{k+1,i}\nabla f(x_{i}),\quad k=0,1,\dots,N-1,

where x0Xx_{0}\in X is a starting point. We say this FSFOM is defined by the lower triangular matrix HN×NH\in\mathbb{R}^{N\times N} containing {hk,i}0i<kN\{h_{k,i}\}_{0\leq i<k\leq N}, i.e., Hk+1,i+1=hk+1,iH_{k+1,i+1}=h_{k+1,i} if 0ikN10\leq i\leq k\leq N-1, and Hk+1,i+1=0H_{k+1,i+1}=0 otherwise. For HN×NH\in\mathbb{R}^{N\times N} define its anti-diagonal transpose HAN×NH^{A}\in\mathbb{R}^{N\times N} with Hi,jA=HNj+1,Ni+1H^{A}_{i,j}=H_{N-j+1,N-i+1} for i,j=1,,Ni,j=1,\dots,N. We refer to the FSFOM defined by HAH^{A} as the H-dual of the FSFOM defined by HH.

To clarify, the H-dual is a dual of a first-order method, an algorithm. This contrasts with the usual notions of duality between primal and dual spaces, functions, or optimization problems. The H-dual operation is defined mechanically, but the H-duality theorem, which we soon state as Fact 1, establishes a non-trivial property between the H-dual FSFOM pair.

Fact 1

[H-duality theorem, Informal (kim2023timereversed, , Theorem 1)] Let 𝒜\mathcal{A} and \mathcal{B} be FSFOMs that are H-duals of each other and α>0\alpha>0. Then the following are equivalent.

  • f(xN)fαL2x0x2f(x_{N})-f_{\star}\leq\frac{\alpha L}{2}\left\|x_{0}-x_{\star}\right\|^{2} for 𝒜\mathcal{A} can be proved with primal energy function structure.

  • 12Lf(xN)2α(f(x0)f)\frac{1}{2L}\left\|\nabla f(x_{N})\right\|^{2}\leq\alpha\left(f(x_{0})-f_{\star}\right) for \mathcal{B} can be proved with dual energy function structure.

We will clarify the meaning of ‘primal and dual energy function structures’ in Section 3.2 along with our main results. The H-duality theorem provides a sufficient condition that ensures an FSFOM defined by HN×NH\in\mathbb{R}^{N\times N} with a convergence guarantee on (f(xN)f)(f(x_{N})-f_{\star}) can be “H-dualized” to obtain an FSFOM defined by HAN×NH^{A}\in\mathbb{R}^{N\times N} with a convergence guarantee on f(xN)2\|\nabla f(x_{N})\|^{2}, and vice versa.

Instances of H-duality.

OGM kim2016optimized is an FSFOM with guarantee

f(xN)f1ζN2Lx0x222,f(x_{N})-f_{\star}\leq\frac{1}{\zeta_{N}^{2}}\frac{L\left\|x_{0}-x_{\star}\right\|_{2}^{2}}{2},

and OGM-G kim2021optimizing is an FSFOM with guarantee

12Lf(xN)221ζN2(f(x0)f),\frac{1}{2L}\|\nabla f(x_{N})\|^{2}_{2}\leq\frac{1}{\zeta_{N}^{2}}\left(f(x_{0})-f_{\star}\right),

where {ζi}i=0N\{\zeta_{i}\}_{i=0}^{N} is defined by ζ0=1\zeta_{0}=1, ζi+12ζi+1=ζi2\zeta_{i+1}^{2}-\zeta_{i+1}=\zeta_{i}^{2} for i=0,,N2i=0,\dots,N-2 and ζN2ζN=2ζN12\zeta_{N}^{2}-\zeta_{N}=2\zeta_{N-1}^{2}. It turns out that OGM and OGM-G are H-duals of each other, and their guarantees imply each other through the H-duality theorem (with α=1/ζN2\alpha=1/\zeta_{N}^{2}).

As another example, the standard gradient descent, xk+1=xk1Lf(xk)x_{k+1}=x_{k}-\frac{1}{L}\nabla f(x_{k}) for k=0,,N1k=0,\dots,N-1, has guarantees

f(xN)f12N+1Lx0x222f(x_{N})-f_{\star}\leq\frac{1}{2N+1}\frac{L\left\|x_{0}-x_{\star}\right\|^{2}_{2}}{2}

and

12Lf(xN)2212N+1(f(x0)f).\frac{1}{2L}\|\nabla f(x_{N})\|^{2}_{2}\leq\frac{1}{2N+1}\left(f(x_{0})-f_{\star}\right).

Gradient descent (with constant stepsize) is “self-H-dual”, i.e., the H-dual of gradient descent is gradient descent itself, and these two guarantees imply each other through the H-duality theorem (with α=1/(2N+1)\alpha=1/(2N+1)).

Recently, DasGupta2024 numerically observed that gradient-descent type methods (xi+1=xihiLx_{i+1}=x_{i}-\frac{h_{i}}{L} for i=0,,N1i=0,\dots,N-1) with appropriately chosen step sizes can exhibit a faster convergence rate than 𝒪(1/N)\mathcal{O}(1/N). In a series of works grimmer2023accelerated ; grimmer2024accelerated ; grimmer2024provably ; altschuler2023acceleration1 ; altschuler2023acceleration2 , the rate 𝒪(1/N1.2716)\mathcal{O}(1/N^{1.2716}) was theoretically established. The latest of this line of work grimmer2024accelerated , which presents the guarantee with the best constant, also presents an H-duality phenomenon: If N=2n1N=2^{n}-1 for some nn\in\mathbb{N}, the method xi+1=xihi(left)Lf(xi)x_{i+1}=x_{i}-\tfrac{h_{i}^{\mathrm{(left)}}}{L}\nabla f(x_{i}) with {hi(left)}i=0N1\{h_{i}^{\mathrm{(left)}}\}_{i=0}^{N-1} has a guarantee

f(xN)f1rNLx0x222,f(x_{N})-f_{\star}\leq\frac{1}{r_{N}}\frac{L\left\|x_{0}-x_{\star}\right\|^{2}_{2}}{2},

where rNNlog2(1+2)r_{N}\approx N^{\log_{2}(1+\sqrt{2})}, and its H-dual method, xi+1=xihN1i(left)Lf(xi)x_{i+1}=x_{i}-\tfrac{h_{N-1-i}^{\mathrm{(left)}}}{L}\nabla f(x_{i}), has a guarantee

12Lf(xN)221rN(f(x0)f).\frac{1}{2L}\|\nabla f(x_{N})\|^{2}_{2}\leq\frac{1}{r_{N}}\left(f(x_{0})-f_{\star}\right).

Although this result is not a direct instance of the existing H-duality theorem, the symmetry points to the existence of a broader H-duality phenomenon beyond what is elucidated in (kim2023timereversed, , Theorem 1).

Mirror-descent-type methods.

The framework of mirror descent has been a cornerstone for designing efficient methods for optimization problems with structures that are well-behaved in non-Euclidean geometries nemirovsky1983problem ; beck2003mirror ; ben2001ordered ; Juditsky20105FO ; Censor1992proximal ; Teboulle1992proximal ; Jonathan1993nonlinear . A mirror-descent-type method uses a distance-generating function ϕ:X¯\phi\colon X\rightarrow\overline{\mathbb{R}}, which gives rise to the Bregman divergence Dϕ(,)D_{\phi}(\cdot,\cdot), for minimizing a convex function f:X¯f\colon X\rightarrow\overline{\mathbb{R}}. Consider NN iterations of such a method.

The standard mirror descent method attains a 𝒪(Dϕ(x,x0)/N)\mathcal{O}(D_{\phi}(x_{\star},x_{0})/\sqrt{N}) rate on the best iterate for nonsmooth convex ff with bounded subgradients and a standard set of assumptions. For differentiable convex ff, doi:10.1287/moor.2016.0817 ; doi:10.1137/16M1099546 ; VanNguyen2017 presented the relative smoothness condition, which relaxes the strong convexity of distance generating function, and achieved a 𝒪(Dϕ(x,x0)/N)\mathcal{O}(D_{\phi}(x_{\star},x_{0})/N) rate on the last iterate under relative smoothness. Under a set of stronger assumptions, accelerating the rate on function value is possible: the Improved Gradient Algorithm achieves a 𝒪(Dϕ(x,x0)/N2)\mathcal{O}(D_{\phi}(x_{\star},x_{0})/N^{2}) rate for constrained smooth optimization problems within Euclidean spaces auslender2006interior ; with a uniformly convex distance-generating function a faster rate than 𝒪(Dϕ(x,x0)/N)\mathcal{O}(D_{\phi}(x_{\star},x_{0})/N) can be achieved with convex functions that are smooth with respect to the p\ell_{p}-norm alexandre2018optimal ; a method using an auxiliary regularizer function tseng2008 ; and a method proposed from an ODE perspective on the linear coupling krichene2015amd . The prior accelerated mirror-descent-type method most relevant to our work is the accelerated Bregman proximal gradient method, which attains a 𝒪(Dϕ(x,x0)/N2)\mathcal{O}(D_{\phi}(x_{\star},x_{0})/N^{2}) rate on function values, under the setup of smooth ff and strongly convex ϕ\phi with respect to any norm \|\cdot\| d2021acceleration .

Recently, connections between various sampling algorithms and entropic mirror descent have been established, which utilize negative entropy as a distance-generating function: chopin2023connection showed the equivalence with tempering and karimi2023sinkhorn demonstrated the connection with generalized Sinkhorn iterations.

Mirror-descent-type methods reducing gradient magnitude.

There has been some recent attention toward mirror-descent-type methods that efficiently reduce gradient magnitude. Dual preconditioned gradient descent doi:10.1137/19M130858X efficiently reduces the gradient measurement under the relative smoothness condition in the dual space, a condition later characterized in doi:10.1137/21M1465913 . This method was extended to the non-convex and composite setups laude2023anisotropic . On the other hand, diakonikolas2023complementary used a regularization technique to develop methods to reduce the magnitude of the gradient, measured by the dual norm. In Section 2.1, we review dual preconditioned gradient descent in further detail as it serves as the foundation for the development of dual-AMD.

1.3 Contributions and organization

Contributions.

This work presents two main contributions, one conceptual and one algorithmic. The first main contribution is mirror duality, presented as Theorem 3.1. The H-duality presented in kim2023timereversed made progress by establishing a duality principle between the task of reducing function value and the task of reducing gradient magnitude. Mirror duality extends H-duality to the mirror descent setup and provides a further richer understanding.

Our second main contribution is the method dual-AMD and its convergence analysis, stated as Corollary 1. To the best of our knowledge, dual-AMD is the first mirror-descent-type method for reducing gradient magnitude at an accelerated rate. The generality of dual-AMD accommodates optimization problems with structures that are well-behaved in non-Euclidean geometries and allows us to measure the gradient magnitude with a smooth convex function of our choice, not necessarily a norm. The applications of Section 4 illustrate the practical utility of dual-AMD.

Organization.

Section 2.1 reviews MD and dual-MD and points out a symmetry between these two mirror-descent-type methods. Section 2.2 reviews AMD, an accelerated mirror-descent-type method that reduces the function value. The specific structure of the convergence analysis of AMD is essential for obtaining our main result.

Section 3.1 quickly establishes some definitions. Section 3.2 formally presents the mirror duality theorem, our first main contribution. Section 3.3 presents the dual-AMD method, our second main contribution, obtained by applying the mirror duality theorem to AMD. Section 4 presents applications illustrating the practical utility of dual-AMD. Section 5 compares the prior results on H-duality with our generalized framework of mirror duality. Finally, Section 6 concludes the paper and provides some discussion on future directions.

2 Review of dual mirror descent and accelerated mirror descent

2.1 MD and dual-MD

We provide a brief overview of mirror descent (MD) and dual-mirror descent (dual-MD). The relationship between these two methods presents the following insight: Let ff be the objective function and ϕ\phi be the distance-generating function. Then, MD is a method finding a yXy\in X^{*} minimizing f(ϕ(y))f(\nabla\phi^{*}(y)) and by swapping the roles of ff and ϕ\phi^{*}, we get dual-MD, which finds an xXx\in X minimizing ϕ(f(x))\phi^{*}(\nabla f(x)).

Mirror descent.

Let XX be a reflexive Banach space, CXC\subseteq X be a nonempty closed convex set, and f:X¯f\colon X\to\overline{\mathbb{R}} be a Legendre convex function. To state MD, consider the problem

minimizexCf(x).\displaystyle\begin{array}[]{ll}\underset{x\in C}{\mbox{minimize}}&\quad f(x).\end{array}

Assume that xargminxCf(x)x_{\star}\in\operatorname*{arg\,min}_{x\in C}f(x) exists. Given an approximate solution xx, we measure its suboptimality via

f(x)f(x).f(x)-f(x_{\star}).

To obtain xCx\in C with small f(x)f(x)f(x)-f(x_{\star}), consider a Legendre convex function ϕ:X¯\phi\colon X\to\overline{\mathbb{R}} such that domϕ¯=C\overline{\operatorname*{dom}\phi}=C. For α>0\alpha>0 and any starting point y0intdomϕy_{0}\in\operatorname*{int}\operatorname*{dom}\phi^{*}, the Mirror Descent (MD) method is

yk+1\displaystyle y_{k+1} =ykαf(xk)\displaystyle=y_{k}-\alpha\nabla f(x_{k}) (MD)
xk+1\displaystyle x_{k+1} =ϕ(yk+1),\displaystyle=\nabla\phi^{*}(y_{k+1}),

for k=0,1,,N1k=0,1,\dots,N-1, and x0=ϕ(y0)x_{0}=\nabla\phi^{*}(y_{0}). We view xNx_{N} as the output of MD. Consider the following additional assumption regarding ff and ϕ\phi.

  • (i)

    For some λ>0\lambda>0, λϕf\lambda\phi-f is convex on intdomϕ\operatorname*{int}\operatorname*{dom}\phi.

  • (ii)

    intdomϕintdomf\operatorname*{int}\operatorname*{dom}\phi\subseteq\operatorname*{int}\operatorname*{dom}f.

Under the above assumption, we have the following convergence guarantee for xNx_{N}.

Fact 2

(Dragomir2021, , Theorem 1) Let XX be a reflexive Banach space and f,ϕ:X¯f,\phi\colon X\to\overline{\mathbb{R}} Legendre convex functions. Then the iterates of MD are well defined. Assume (i) and (ii). If we further assume that xdomϕx_{\star}\in\operatorname*{dom}\phi, then for α(0,λ]\alpha\in(0,\lambda],

f(xN)f(x)1αNDϕ(x,x0).\displaystyle f(x_{N})-f(x_{\star})\leq\frac{1}{\alpha N}D_{\phi}(x_{\star},x_{0}).

Dual mirror descent.

Consider the problem

findxdomf0=f(x).\displaystyle\begin{array}[]{ll}\underset{x\in\operatorname*{dom}\nabla f}{\mbox{find}}&\quad 0=\nabla f(x).\end{array}

This problem is equivalent to the previous one, minxCf(x)\min_{x\in C}f(x), if C=XC=X and ff is differentiable (domf=X\operatorname*{dom}\nabla f=X). Given an approximate solution xx for this problem, we measure its suboptimality via

ψ(f(x)),\psi^{*}(\nabla f(x)),

where ψ:X¯\psi^{*}\colon X^{*}\rightarrow\overline{\mathbb{R}} is a Legendre convex function such that ψ(0)=0\psi^{*}(0)=0 and 0X0\in X^{*} is the unique minimizer of ψ\psi^{*}. So, ψ\psi^{*} quantifies the magnitude of f(x)\nabla f(x).111Every stationary point is a global function minimum in convex optimization, so ψ(f(x))=0\psi^{*}(\nabla f(x))=0 is equivalent to f(x)infxXf(x)=0f(x)-\inf_{x\in X}f(x)=0. However, [ψ(f(x))\psi^{*}(\nabla f(x)) being small] is not equivalent to [f(x)infxXf(x)f(x)-\inf_{x\in X}f(x) being small] in the sense that for any ε>0\varepsilon>0 and ψ\psi^{*}, there is an ff such that ψ(f(x))<ε\psi^{*}(\nabla f(x))<\varepsilon but f(x)infxXf(x)>1f(x)-\inf_{x\in X}f(x)>1. Even though the solution set is unchanged, changing the suboptimality measure leads to different efficient methods for finding approximate solutions. Dual preconditioned gradient descent doi:10.1137/19M130858X , formulated as follows, efficiently reduces ψ(f(x))\psi^{*}(\nabla f(x)). For α>0\alpha>0, starting point q0intdomfq_{0}\in\operatorname*{int}\operatorname*{dom}f and r0=f(q0)r_{0}=\nabla f(q_{0}),

qk+1\displaystyle q_{k+1} =qkαψ(rk)\displaystyle=q_{k}-\alpha\nabla\psi^{*}(r_{k}) (dual-MD)
rk+1\displaystyle r_{k+1} =f(qk+1),\displaystyle=\nabla f(q_{k+1}),

for k=0,,N1k=0,\dots,N-1. We will refer to this method as dual-MD. The reason for this alternative naming will be made clear in Section 3.

The output of dual-MD is qNq_{N}. Consider the following additional assumption regarding ff and ψ\psi.

  • (i)’

    For some τ>0\tau>0, τfψ\tau f^{*}-\psi^{*} is convex on intdomf\operatorname*{int}\operatorname*{dom}f^{*}.

  • (ii)’

    intdomfintdomψ\operatorname*{int}\operatorname*{dom}f^{*}\subseteq\operatorname*{int}\operatorname*{dom}\psi^{*}.

Under this assumption, we have the following convergence guarantee for qNq_{N}.

Fact 3

(doi:10.1137/19M130858X, , Theorem 3.9) Let XX be a reflexive Banach space and f,ψ:X¯f,\psi\colon X\to\overline{\mathbb{R}} be Legendre convex functions. Assume (i)’ and (ii)’. Then, the iterates of dual-MD are well defined. If we further assume that ψ(0)=0\psi^{*}(0)=0 and 0X0\in X^{*} is the unique minimizer of ψ\psi^{*}, then for α(0,τ]\alpha\in(0,\tau],

ψ(f(qN))=ψ(rN)1αN(f(q0)infxXf(x)).\psi^{*}(\nabla f(q_{N}))=\psi^{*}(r_{N})\leq\frac{1}{\alpha N}\big{(}f(q_{0})-\inf_{x\in X}f(x)\big{)}.

(If infxXf(x)=\inf_{x\in X}f(x)=-\infty, then the right-hand-side is ++\infty and the bound vacuously holds.)

Interestingly, dual-MD can be obtained by swapping ff with ϕ=ψ\phi^{*}=\psi^{*} in MD, and this symmetry between them leads to a crucial observation. In dual-MD, ψ=ϕ\psi^{*}=\phi^{*} serves as the objective function, so the reduction of ψ(f(qN))\psi^{*}(\nabla f(q_{N})) in dual-MD as guaranteed by Fact 3 is really an immediate consequence of MD reducing f(xN)=f(ϕ(yN))f(x_{N})=f(\nabla\phi^{*}(y_{N})) as guaranteed by Fact 2. At this point, MD and dual-MD exhibit a certain symmetry, and one method reduces f()f(x)f(\cdot)-f(x_{\star}) while the other “dual” method reduces ψ(f())\psi^{*}(\nabla f(\cdot)). In Section 2.2, we state AMD, which reduces f()f(x)f(\cdot)-f(x_{\star}) at an 𝒪(1/N2)\mathcal{O}\left(1/N^{2}\right) rate under stronger assumptions than needed for MD. This leads to a natural question: Can we “dualize” AMD to obtain a mirror-descent-type method with an 𝒪(1/N2)\mathcal{O}\left(1/N^{2}\right) rate on ψ(f())\psi^{*}(\nabla f(\cdot))? For reasons that we explain later in Section 3, merely interchanging the roles of ff and ϕ=ψ\phi^{*}=\psi^{*} does not work for obtaining the dual counterpart of AMD. The answer is obtained through mirror duality that we present in Section 3.

The versions of Facts 2 and 3 presented in Dragomir2021 and doi:10.1137/19M130858X rely on slightly weaker assumptions than we state. We make this simplification to focus on our goal of introducing the concept of swapping ff and ϕ=ψ\phi^{*}=\psi^{*} in mirror-descent-type methods.

Connection with dual averaging.

Our formulation of MD is closely related to dual averaging nesterov2009primal . (In dual averaging, dual refers to the duality between the primal XX and dual XX^{*} spaces, while in our dual-MD, dual refers to mirror duality.) Dual averaging and mirror descent are equivalent in the so-called unconstrained setup, which is what we consider. In the constrained setup, the optimization problem is constrained to a further subset of C=domϕ¯C=\overline{\operatorname*{dom}\phi}, and in this constrained setup, mirror duality and dual averaging differ. The dual algorithms of our work and the notion of mirror duality do not seem to immediately generalize to the constrained setup, and studying this extension would be an interesting direction of future work.

2.2 Accelerated Mirror Descent

Again, consider the problem

minimizexCf(x),\displaystyle\begin{array}[]{ll}\underset{x\in C}{\mbox{minimize}}&\quad f(x),\end{array}

where XX is a reflexive Banach space, CXC\subseteq X is a nonempty closed convex set and f:Xf\colon X\to\mathbb{R} is a CCP function. Assume xargminxCf(x)x_{\star}\in\operatorname*{arg\,min}_{x\in C}f(x) exists. Given an approximate solution xx, we measure its suboptimality via f(x)f(x)f(x)-f(x_{\star}). We use a CCP function ϕ:X¯\phi\colon X\to\overline{\mathbb{R}} such that domϕ¯=C\overline{\operatorname*{dom}\phi}=C for the distance-generating function. Assume that

  • (A1)

    ff is LL-smooth; ϕ\phi is σ\sigma-strongly convex; L>0L>0, σ>0\sigma>0.

Let NN be a positive integer. Let {θi}i=1N\{\theta_{i}\}_{i=-1}^{N}\subset\mathbb{R} be a non-decreasing sequence satisfying

θ1=0,θ0=1,θi2θiθi12for  1iN1,θN=θN1.\displaystyle\theta_{-1}=0,\qquad\theta_{0}=1,\qquad\theta_{i}^{2}-\theta_{i}\leq\theta_{i-1}^{2}\,\,\text{for}\,\,1\leq i\leq N-1,\qquad\theta_{N}=\theta_{N-1}.

When θi2θi=θi12\theta_{i}^{2}-\theta_{i}=\theta_{i-1}^{2} for i=1,,N1i=1,\dots,N-1, the sequence becomes exactly the same as the θ\theta-sequence of Nesterov’s fast gradient method 10029946121 , except for the final θN\theta_{N}. The variant of accelerated mirror descent (AMD) we consider is

yk+1=ykσL(θk2θk12)f(xk)\displaystyle y_{k+1}=y_{k}-\frac{\sigma}{L}\left(\theta_{k}^{2}-\theta_{k-1}^{2}\right)\nabla f(x_{k}) (AMD)
xk+1=θk2θk+12xk+θk+12θk2θk+12ϕ(yk+1)+θk2θk12θk+12(ϕ(yk+1)ϕ(yk))\displaystyle x_{k+1}=\frac{\theta_{k}^{2}}{\theta_{k+1}^{2}}x_{k}+\frac{\theta_{k+1}^{2}-\theta_{k}^{2}}{\theta_{k+1}^{2}}\nabla\phi^{*}(y_{k+1})+\frac{\theta_{k}^{2}-\theta_{k-1}^{2}}{\theta_{k+1}^{2}}\left(\nabla\phi^{*}(y_{k+1})-\nabla\phi^{*}(y_{k})\right)

for k=0,1,,N1k=0,1,\dots,N-1, where y0Xy_{0}\in X^{*} is any starting point and x0=ϕ(y0)x_{0}=\nabla\phi^{*}(y_{0}). We view xNx_{N} as the output of AMD.

Fact 4

Let XX be a reflexive Banach space and let f:Xf\colon X\to\mathbb{R} and ϕ:X\phi\colon X\to\mathbb{R} be CCP.222Our analysis does not require the distance-generating function ϕ\phi to be differentiable. Prior work such as xiao2010dual has also carried out mirror-descent-type analyses under non-differentiability of ϕ\phi. Assume (A1). Then, the iterates of AMD are well defined and xkCx_{k}\in C for k=0,,Nk=0,\dots,N. Furthermore, if xdomϕx\in\operatorname*{dom}\phi,

f(xN)f(x)LDϕ(x,x0)σθN2=()𝒪(LDϕ(x,x0)σN2)\displaystyle f(x_{N})-f(x)\leq\frac{LD_{\phi}(x,x_{0})}{\sigma\theta_{N}^{2}}\stackrel{{\scriptstyle(\bullet)}}{{=}}\mathcal{O}\left(\frac{LD_{\phi}(x,x_{0})}{\sigma N^{2}}\right)

where ()(\bullet) holds if θi2θi=θi12\theta_{i}^{2}-\theta_{i}=\theta_{i-1}^{2} for 0iN10\leq i\leq N-1.

Proof

First, we show xkx_{k} is a convex combination of {ϕ(y0),,ϕ(yk)}\{\nabla\phi^{*}(y_{0}),\dots,\nabla\phi^{*}(y_{k})\}. Then the fact ϕ:XC\nabla\phi^{*}\colon X^{*}\rightarrow C yields xkCx_{k}\in C. For k=0,,N1k=0,\dots,N-1,

xk+1\displaystyle x_{k+1} =θk2θk+12xk+θk+12θk12θk+12ϕ(yk+1)θk2θk12θk+12ϕ(yk)\displaystyle=\frac{\theta_{k}^{2}}{\theta_{k+1}^{2}}x_{k}+\frac{\theta_{k+1}^{2}-\theta_{k-1}^{2}}{\theta_{k+1}^{2}}\nabla\phi^{*}(y_{k+1})-\frac{\theta_{k}^{2}-\theta_{k-1}^{2}}{\theta_{k+1}^{2}}\nabla\phi^{*}(y_{k})
=θk12θk+12xk1+θk+12θk12θk+12ϕ(yk+1)+θk12θk22θk+12ϕ(yk)\displaystyle=\frac{\theta_{k-1}^{2}}{\theta_{k+1}^{2}}x_{k-1}+\frac{\theta_{k+1}^{2}-\theta_{k-1}^{2}}{\theta_{k+1}^{2}}\nabla\phi^{*}(y_{k+1})+\frac{\theta_{k-1}^{2}-\theta_{k-2}^{2}}{\theta_{k+1}^{2}}\nabla\phi^{*}(y_{k})
θk12θk22θk+12ϕ(yk1)\displaystyle\quad-\frac{\theta_{k-1}^{2}-\theta_{k-2}^{2}}{\theta_{k+1}^{2}}\nabla\phi^{*}(y_{k-1})
\displaystyle\qquad\qquad\qquad\qquad\vdots
=θk+12θk12θk+12ϕ(yk+1)+s=1kθs12θs22θk+12ϕ(ys).\displaystyle=\frac{\theta_{k+1}^{2}-\theta_{k-1}^{2}}{\theta_{k+1}^{2}}\nabla\phi^{*}(y_{k+1})+\sum_{s=1}^{k}\frac{\theta_{s-1}^{2}-\theta_{s-2}^{2}}{\theta_{k+1}^{2}}\nabla\phi^{*}(y_{s}).

Now we provide the convergence analysis of AMD.

Convergence analysis.

First, define {ui}i=0N\{u_{i}\}_{i=0}^{N} as ui=σLθi2u_{i}=\frac{\sigma}{L}\theta_{i}^{2}. Also define an energy function {𝒰k}k=0N\{\mathcal{U}_{k}\}_{k=0}^{N} recurrently as 𝒰0=Dϕ(x,x0)σLθ02Df(x,x0)\mathcal{U}_{0}=D_{\phi}(x,x_{0})-\frac{\sigma}{L}\theta_{0}^{2}D_{f}(x,x_{0}) and

𝒰k+1=\displaystyle\quad\mathcal{U}_{k+1}= 𝒰k(uk+1uk)Df(xk,xk+1)\displaystyle\;\mathcal{U}_{k}-(u_{k+1}-u_{k})D_{f}\left(x_{k},x_{k+1}\right)
uk(Df(xk,xk+1)12Lf(xk)f(xk+1)20 by (coco-ineq))\displaystyle-u_{k}\Big{(}\underbrace{D_{f}\left(x_{k},x_{k+1}\right)-\frac{1}{2L}\left\|\nabla f(x_{k})-\nabla f(x_{k+1})\right\|_{*}^{2}}_{\geq 0\text{ by \eqref{eqn::coco_with_bregman}}}\Big{)}
(Dϕ(yk,yk+1)σ2ϕ(yk+1)ϕ(yk)20 by (coco-ineq))\displaystyle\quad-\Big{(}\underbrace{D_{\phi^{*}}\left(y_{k},y_{k+1}\right)-\frac{\sigma}{2}\left\|\nabla\phi^{*}(y_{k+1})-\nabla\phi^{*}(y_{k})\right\|^{2}}_{\geq 0\text{ by \eqref{eqn::coco_with_bregman}}}\Big{)}

for k=0,,N1k=0,\dots,N-1. Due to (cvx-ineq) and (coco-ineq), {𝒰k}k=0N\{\mathcal{U}_{k}\}_{k=0}^{N} is non-increasing . If we can prove f(xN)f(x)LσθN2𝒰Nf(x_{N})-f(x)\leq\frac{L}{\sigma\theta_{N}^{2}}\mathcal{U}_{N}, the monotonicity of {𝒰k}k=0N\{\mathcal{U}_{k}\}_{k=0}^{N} provides the convergence rate as follows:

f(xN)f(x)LσθN2𝒰NLσθN2𝒰0LσθN2Dϕ(x,x0).\displaystyle f(x_{N})-f(x)\leq\frac{L}{\sigma\theta_{N}^{2}}\mathcal{U}_{N}\leq\dots\leq\frac{L}{\sigma\theta_{N}^{2}}\mathcal{U}_{0}\leq\frac{L}{\sigma\theta_{N}^{2}}D_{\phi}(x,x_{0}).

Now, the only part left is proving the inequality f(xN)f(x)LσθN2𝒰Nf(x_{N})-f(x)\leq\frac{L}{\sigma\theta_{N}^{2}}\mathcal{U}_{N}. Define 𝐔AMD\mathbf{U}_{\text{AMD}}, which is the function of algorithm parameters and iterates of AMD, via

𝒰N\displaystyle\mathcal{U}_{N} =σLθN2(f(xN)f(x))+ϕ(x)+ϕ(yN)yN,x0 by (1.1)\displaystyle=\frac{\sigma}{L}\theta_{N}^{2}\left(f(x_{N})-f(x)\right)+\underbrace{\phi(x)+\phi^{*}(y_{N})-\left\langle y_{N},x\right\rangle}_{\geq 0\text{ by \eqref{eqn::fenchel}}}
+yNy0+i=0N(σLθi2σLθi12)f(xi),x+𝐔AMD.\displaystyle\quad+\left\langle y_{N}-y_{0}+\sum_{i=0}^{N}\left(\frac{\sigma}{L}\theta_{i}^{2}-\frac{\sigma}{L}\theta_{i-1}^{2}\right)\nabla f(x_{i}),x\right\rangle+\mathbf{U}_{\text{AMD}}.

Here we used Dϕ(x,x0)=ϕ(x)+ϕ(y0)y0,xD_{\phi}(x,x_{0})=\phi(x)+\phi^{*}(y_{0})-\left\langle y_{0},x\right\rangle. Observe that

yNy0+i=0N(σLθi2σLθi12)f(xi)\displaystyle y_{N}-y_{0}+\sum_{i=0}^{N}\left(\frac{\sigma}{L}\theta_{i}^{2}-\frac{\sigma}{L}\theta_{i-1}^{2}\right)\nabla f(x_{i})
=\displaystyle= k=0N1(ykyk+1)+i=0N(σLθi2σLθi12)f(xi)\displaystyle-\sum_{k=0}^{N-1}(y_{k}-y_{k+1})+\sum_{i=0}^{N}\left(\frac{\sigma}{L}\theta_{i}^{2}-\frac{\sigma}{L}\theta_{i-1}^{2}\right)\nabla f(x_{i})
=\displaystyle= k=0N1σL(θk2θk12)f(xk)+i=0N(σLθi2σLθi12)f(xi)\displaystyle-\sum_{k=0}^{N-1}\frac{\sigma}{L}\left(\theta_{k}^{2}-\theta_{k-1}^{2}\right)\nabla f(x_{k})+\sum_{i=0}^{N}\left(\frac{\sigma}{L}\theta_{i}^{2}-\frac{\sigma}{L}\theta_{i-1}^{2}\right)\nabla f(x_{i})
=\displaystyle= 0.(θN=θN1).\displaystyle 0.\quad\left(\because\theta_{N}=\theta_{N-1}\right).

Therefore, it is enough to show 𝐔AMD0\mathbf{U}_{\text{AMD}}\geq 0. We simplify 𝐔AMD\mathbf{U}_{\text{AMD}} as follows:

𝐔AMD\displaystyle\;\mathbf{U}_{\text{AMD}}
=\displaystyle= σ2L2k=0N1θk2f(xk)f(xk+1)2+σ2k=0N1ϕ(yk)ϕ(yk+1)2\displaystyle\frac{\sigma}{2L^{2}}\sum_{k=0}^{N-1}\theta_{k}^{2}\left\|\nabla f(x_{k})-\nabla f(x_{k+1})\right\|^{2}_{*}+\frac{\sigma}{2}\sum_{k=0}^{N-1}\left\|\nabla\phi^{*}(y_{k})-\nabla\phi^{*}(y_{k+1})\right\|^{2}
+k=0N1ykyk+1,ϕ(yk+1)σLk=0Nθk2f(xk)f(xk+1),xk\displaystyle\quad+\sum_{k=0}^{N-1}\left\langle y_{k}-y_{k+1},\nabla\phi^{*}(y_{k+1})\right\rangle-\frac{\sigma}{L}\sum_{k=0}^{N}\theta_{k}^{2}\left\langle\nabla f(x_{k})-\nabla f(x_{k+1}),x_{k}\right\rangle
=\displaystyle= σ2L2k=0N1θk2f(xk)f(xk+1)2+σ2k=0N1ϕ(yk)ϕ(yk+1)2\displaystyle\frac{\sigma}{2L^{2}}\sum_{k=0}^{N-1}\theta_{k}^{2}\left\|\nabla f(x_{k})-\nabla f(x_{k+1})\right\|^{2}_{*}+\frac{\sigma}{2}\sum_{k=0}^{N-1}\left\|\nabla\phi^{*}(y_{k})-\nabla\phi^{*}(y_{k+1})\right\|^{2}
+σLk=0N1(θk2θk12)f(xk),ϕ(yk+1)σLk=0Nf(xk),θk2xkθk12xk1\displaystyle\quad+\frac{\sigma}{L}\sum_{k=0}^{N-1}\left\langle\left(\theta_{k}^{2}-\theta_{k-1}^{2}\right)\nabla f(x_{k}),\nabla\phi^{*}(y_{k+1})\right\rangle-\frac{\sigma}{L}\sum_{k=0}^{N}\left\langle\nabla f(x_{k}),\theta_{k}^{2}x_{k}-\theta_{k-1}^{2}x_{k-1}\right\rangle
=\displaystyle= σ2L2k=0N1θk2f(xk)f(xk+1)2+σ2k=0N1ϕ(yk)ϕ(yk+1)2\displaystyle\frac{\sigma}{2L^{2}}\sum_{k=0}^{N-1}\theta_{k}^{2}\left\|\nabla f(x_{k})-\nabla f(x_{k+1})\right\|^{2}_{*}+\frac{\sigma}{2}\sum_{k=0}^{N-1}\left\|\nabla\phi^{*}(y_{k})-\nabla\phi^{*}(y_{k+1})\right\|^{2}
+σLk=0N1(θk2θk12)f(xk),ϕ(yk+1)σLk=0Nf(xk),(θk2θk12)ϕ(yk)\displaystyle\quad+\frac{\sigma}{L}\sum_{k=0}^{N-1}\left\langle\left(\theta_{k}^{2}-\theta_{k-1}^{2}\right)\nabla f(x_{k}),\nabla\phi^{*}(y_{k+1})\right\rangle-\frac{\sigma}{L}\sum_{k=0}^{N}\left\langle\nabla f(x_{k}),(\theta_{k}^{2}-\theta_{k-1}^{2})\nabla\phi^{*}(y_{k})\right\rangle
σLk=1Nf(xk),(θk12θk22)(ϕ(yk)ϕ(yk1)).\displaystyle\quad-\frac{\sigma}{L}\sum_{k=1}^{N}\left\langle\nabla f(x_{k}),(\theta_{k-1}^{2}-\theta_{k-2}^{2})\left(\nabla\phi^{*}(y_{k})-\nabla\phi^{*}(y_{k-1})\right)\right\rangle.

Grouping the terms in the last two lines, we obtain

σLk=0N1(θk2θk12)[f(xk),ϕ(yk+1)ϕ(yk)f(xk+1),ϕ(yk+1)ϕ(yk)]\displaystyle\frac{\sigma}{L}\sum_{k=0}^{N-1}\left(\theta_{k}^{2}-\theta_{k-1}^{2}\right)\big{[}\left\langle\nabla f(x_{k}),\nabla\phi^{*}(y_{k+1})-\nabla\phi^{*}(y_{k})\right\rangle-\left\langle\nabla f(x_{k+1}),\nabla\phi^{*}(y_{k+1})-\nabla\phi^{*}(y_{k})\right\rangle\big{]}
=\displaystyle= σLk=0N1(θk2θk12)f(xk)f(xk+1),ϕ(yk+1)ϕ(yk).\displaystyle\;\frac{\sigma}{L}\sum_{k=0}^{N-1}\left(\theta_{k}^{2}-\theta_{k-1}^{2}\right)\left\langle\nabla f(x_{k})-\nabla f(x_{k+1}),\nabla\phi^{*}(y_{k+1})-\nabla\phi^{*}(y_{k})\right\rangle.

Thus

𝐔AMD=\displaystyle\mathbf{U}_{\text{AMD}}= σ2L2k=0N1θk2f(xk)f(xk+1)2+σ2k=0N1ϕ(yk)ϕ(yk+1)2\displaystyle\;\frac{\sigma}{2L^{2}}\sum_{k=0}^{N-1}\theta_{k}^{2}\left\|\nabla f(x_{k})-\nabla f(x_{k+1})\right\|^{2}_{*}+\frac{\sigma}{2}\sum_{k=0}^{N-1}\left\|\nabla\phi^{*}(y_{k})-\nabla\phi^{*}(y_{k+1})\right\|^{2}
+σLk=0N1(θk2θk12)f(xk)f(xk+1),ϕ(yk+1)ϕ(yk)\displaystyle+\frac{\sigma}{L}\sum_{k=0}^{N-1}\left(\theta_{k}^{2}-\theta_{k-1}^{2}\right)\left\langle\nabla f(x_{k})-\nabla f(x_{k+1}),\nabla\phi^{*}(y_{k+1})-\nabla\phi^{*}(y_{k})\right\rangle
=\displaystyle= k=0N1(σθk2f(xk)f(xk+1)22L2+σϕ(yk)ϕ(yk+1)22\displaystyle\sum_{k=0}^{N-1}\bigg{(}\frac{\sigma\theta_{k}^{2}\left\|\nabla f(x_{k})-\nabla f(x_{k+1})\right\|^{2}_{*}}{2L^{2}}+\frac{\sigma\left\|\nabla\phi^{*}(y_{k})-\nabla\phi^{*}(y_{k+1})\right\|^{2}}{2}
+σ(θk2θk12)Lf(xk)f(xk+1),ϕ(yk+1)ϕ(yk))\displaystyle\quad+\frac{\sigma\left(\theta_{k}^{2}-\theta_{k-1}^{2}\right)}{L}\left\langle\nabla f(x_{k})-\nabla f(x_{k+1}),\nabla\phi^{*}(y_{k+1})-\nabla\phi^{*}(y_{k})\right\rangle\bigg{)}
\displaystyle\geq  0,\displaystyle\;0,

where we used θk2θkθk12\theta_{k}^{2}-\theta_{k}\leq\theta_{k-1}^{2} for k=0,,N1k=0,\dots,N-1 and σ2L2a2+σ2b2σLabσLa,b\frac{\sigma}{2L^{2}}\left\|a\right\|_{*}^{2}+\frac{\sigma}{2}\left\|b\right\|^{2}\geq\frac{\sigma}{L}\left\|a\right\|_{*}\left\|b\right\|\geq\frac{\sigma}{L}\left\langle a,b\right\rangle for any (a,b)X×X(a,b)\in X^{*}\times X. The first inequality is AM-GM and the second one comes from the definition of a dual norm.∎

Comparison with existing methods.

Our stated AMD and its analysis represent a minor variation of prior work, which are not part of our main contributions. More specifically, when C=XC=X and θi2θi12=θi\theta_{i}^{2}-\theta_{i-1}^{2}=\theta_{i} for 0iN10\leq i\leq N-1, AMD is equivalent to (d2021acceleration, , Algorithm 22), except for the last iterate. In the Euclidean setup, AMD is also equivalent to IGA (auslender2006interior, , Section 5), again excluding the final iterate.

In terms of the analysis, the prior work d2021acceleration achieves the same convergence rate as in Fact 4 under the slightly weaker assumption that ff is LL-smooth on CC, i.e., that f(x)f(y)Lxy\left\|\nabla f(x)-\nabla f(y)\right\|_{*}\leq L\left\|x-y\right\| for all x,yCx,y\in C. In contrast, we require that ff is LL-smooth on all of XX, i.e., that f(x)f(y)Lxy\left\|\nabla f(x)-\nabla f(y)\right\|_{*}\leq L\left\|x-y\right\| for all x,yXx,y\in X. In our proof, we use (coco-ineq), which, to the best of our knowledge, requires that ff is LL-smooth on all of XX, even though our AMD never evaluates f\nabla f outside of CXC\subset X. In this sense, our analysis of AMD is slightly less general than the prior analysis of (d2021acceleration, , Algorithm 22).

Nevertheless, we provide our proof of Fact 4 because the structure of our analysis, namely the reliance on (coco-ineq), is crucial for the application of our mirror duality result and the construction of our novel method dual-AMD in Section 3.

Lower bound.

The classical Ω(Lx0x2N2)\Omega\left(\frac{L\|x_{0}-x\|^{2}}{N^{2}}\right) lower bound in the Euclidean setup by nemirovsky1992information ; nesterov2018 applies and establishes optimality of the rate of Fact 4 under all choices of (f,ϕ)(f,\phi) that satisfy (A1).

3 Mirror Duality and dual-AMD

In Section 2.1, we simply interchanged the roles of ff and ϕ=ψ\phi^{*}=\psi^{*} in MD, which has a rate on f(xN)f(x_{N}), to obtain a dual-MD, which has a rate on ψ(f(qN))\psi^{*}(\nabla f(q_{N})). A natural question that follows is: Can we “dualize” AMD, which has an accelerated rate on f(xN)f(x_{N}), to obtain a dual method with an accelerated rate on ψ(f())\psi^{*}(\nabla f(\cdot))? Merely interchanging the role of ff and ϕ=ψ\phi^{*}=\psi^{*}, however, does not work as it produces a method finding an rNXr_{N}\in X^{*} reducing ψ(rN)\psi^{*}(r_{N}), but it does not find a qNXq_{N}\in X that reduces ψ(f(qN))\psi^{*}(\nabla f(q_{N})).

In this section, we obtain a dual method reducing ψ(f())\psi^{*}(\nabla f(\cdot)) at an accelerated 𝒪(1/N2)\mathcal{O}(1/N^{2}) rate. The key insight is to interchange the roles of ff and ϕ=ψ\phi^{*}=\psi^{*} and also perform an operation analogous to the anti-diagonal transpose as in Section 1.2. We call the resulting operation the mirror dual, and we present a duality correspondence as Theorem 3.1.

3.1 Coupled first-order method (CFOM) and its mirror dual

Let XX be a reflexive Banach space and f:Xf\colon X\rightarrow\mathbb{R} and ϕ:X\phi^{*}\colon X^{*}\rightarrow\mathbb{R} be differentiable. An NN-step fixed-step coupled first-order method (CFOM) with step sizes {ak,i}0i<kN\{a_{k,i}\}_{0\leq i<k\leq N} and {bk,i}0ikN\{b_{k,i}\}_{0\leq i\leq k\leq N} is:

yk+1=yki=0kak+1,if(xi),xk+1=xki=0k+1bk+1,iϕ(yi)\displaystyle\begin{split}y_{k+1}=y_{k}-\sum\limits_{i=0}^{k}a_{k+1,i}\nabla f(x_{i}),\quad x_{k+1}=x_{k}-\sum\limits_{i=0}^{k+1}b_{k+1,i}\nabla\phi^{*}(y_{i})\quad\end{split} (1)

for k=0,,N1k=0,\dots,N-1, where y0Xy_{0}\in X^{*} is a starting point and x0=ϕ(y0)x_{0}=\nabla\phi^{*}(y_{0}). Note that the upper limits of the two summations are kk and k+1k+1, which means yk+1y_{k+1} is computed before xk+1x_{k+1}. Although b0,0b_{0,0} is not used in (1), define b0,0=1b_{0,0}=-1 for the sake of notational simplicity in later arguments. The class of CFOMs is very general, and it, in particular, includes MD and AMD.

Let ψ:X\psi^{*}\colon X^{*}\rightarrow\mathbb{R} be differentiable. Throughout this section, we assume ψ(0)=0\psi^{*}(0)=0 and 0X0\in X^{*} is the unique minimizer of ψ\psi^{*}. For a given CFOM (1), we define its mirror dual as

qk+1=qki=0kaNi,N1kψ(ri),rk+1=rki=0k+1bNi,N1kf(qi)\displaystyle\begin{split}q_{k+1}=q_{k}-\sum_{i=0}^{k}a_{N-i,N-1-k}\nabla\psi^{*}(r_{i}),\quad r_{k+1}=r_{k}-\sum_{i=0}^{k+1}b_{N-i,N-1-k}\nabla f(q_{i})\end{split} (2)

for k=0,1,,N1k=0,1,\dots,N-1, where q0Xq_{0}\in X is a starting point and r0=bN,Nf(q0)r_{0}=-b_{N,N}\nabla f(q_{0}). The mirror dual has the coefficients {ak,i}0i<kN\{a_{k,i}\}_{0\leq i<k\leq N} and {bk,i}0ikN\{b_{k,i}\}_{0\leq i\leq k\leq N} used in an anti-diagonal transposed manner as in the H-dual and, furthermore, has the roles of ff and ϕ\phi^{*} swapped when ψ=ϕ\psi=\phi. In Section 5, we more precisely discuss how the mirror dual and mirror duality generalize the prior notions of the H-dual and H-duality.

Proposition 1

For a CFOM (1), assume 1i=0kj=0i+1bi+1,j=11-\sum_{i=0}^{k}\sum_{j=0}^{i+1}b_{i+1,j}=1 for k=0,,N1k=0,\dots,N-1, which is a sufficient condition for xkconv{ϕ(y0),,ϕ(yk)}x_{k}\in\mathrm{conv}\{\nabla\phi^{*}(y_{0}),\dots,\nabla\phi^{*}(y_{k})\} for k=0,,Nk=0,\dots,N. Then, the mirror dual (2) satisfies rN=f(qN)r_{N}=\nabla f(q_{N}).

Under the condition of Proposition 1, a convergence guarantee on ψ(rN)\psi^{*}(r_{N}) can be translated to a bound on ψ(rN)=ψ(f(qN))\psi^{*}(r_{N})=\psi^{*}(\nabla f(q_{N})). As we showed in the proof of Fact 4, the CFOM AMD satisfies the property xkconv{ϕ(y0),,ϕ(yk)}x_{k}\in\mathrm{conv}\{\nabla\phi^{*}(y_{0}),\dots,\nabla\phi^{*}(y_{k})\} for k=0,,Nk=0,\dots,N. Therefore, Proposition 1 applies to the mirror dual of AMD that we present in Section 3.3.

Proof

Observe first that

xk+1=x0i=0k(xixi+1)=ϕ(y0)i=0kj=0i+1bi+1,jϕ(yj)\displaystyle x_{k+1}=x_{0}-\sum_{i=0}^{k}(x_{i}-x_{i+1})=\nabla\phi^{*}(y_{0})-\sum_{i=0}^{k}\sum_{j=0}^{i+1}b_{i+1,j}\nabla\phi^{*}(y_{j})

for k=0,,N1k=0,\dots,N-1. Therefore, if 1i=0kj=0i+1bi+1,j=11-\sum_{i=0}^{k}\sum_{j=0}^{i+1}b_{i+1,j}=1 for k=0,,N1k=0,\dots,N-1, xk+1conv{ϕ(y0),,ϕ(yk+1)}x_{k+1}\in\mathrm{conv}\{\nabla\phi^{*}(y_{0}),\dots,\nabla\phi^{*}(y_{k+1})\}. Now we prove the mirror dual satisfies rN=f(qN)r_{N}=\nabla f(q_{N}). Note that

i=0k+1bk+1,i=0,k=0,,N1.\displaystyle\sum_{i=0}^{k+1}b_{k+1,i}=0,\quad k=0,\dots,N-1. (3)

Thus, we get that the following holds for the mirror dual:

rN=\displaystyle r_{N}= r0i=0N1j=0i+1bNj,N1if(qj)\displaystyle r_{0}-\sum_{i=0}^{N-1}\sum_{j=0}^{i+1}b_{N-j,N-1-i}\nabla f(q_{j})
=\displaystyle= bN,Nf(q0)i=0N1bN,N1if(q0)j=1N(i=j1N1bNj,N1i)f(qj)\displaystyle-b_{N,N}\nabla f(q_{0})-\sum_{i=0}^{N-1}b_{N,N-1-i}\nabla f(q_{0})-\sum_{j=1}^{N}\left(\sum_{i=j-1}^{N-1}b_{N-j,N-1-i}\right)\nabla f(q_{j})\qquad
=\displaystyle= bN,Nf(q0)i=0N1bN,N1if(q0)j=1N(l=0NjbNj,l)f(qj)\displaystyle-b_{N,N}\nabla f(q_{0})-\sum_{i=0}^{N-1}b_{N,N-1-i}\nabla f(q_{0})-\sum_{j=1}^{N}\left(\sum_{l=0}^{N-j}b_{N-j,l}\right)\nabla f(q_{j})
=\displaystyle= bN,Nf(q0)i=0N1bN,N1if(q0)b0,0f(qN)((3))\displaystyle-b_{N,N}\nabla f(q_{0})-\sum_{i=0}^{N-1}b_{N,N-1-i}\nabla f(q_{0})-b_{0,0}\nabla f(q_{N})\qquad\left(\because\eqref{eqn::appendix_convex_sum_pf_1}\right)
=\displaystyle= i=0NbN,Nif(q0)+f(qN)(b0,0=1)\displaystyle-\sum_{i=0}^{N}b_{N,N-i}\nabla f(q_{0})+\nabla f(q_{N})\qquad\left(\because b_{0,0}=-1\right)
=\displaystyle= f(qN).((3))\displaystyle\nabla f(q_{N}).\qquad\left(\because\eqref{eqn::appendix_convex_sum_pf_1}\right)

3.2 Mirror Duality

For an NN-step CFOM (1), which we denote as 𝒜\mathcal{A}, consider establishing a convergence guarantee on f(xN)f(x)f(x_{N})-f(x) for xdomϕx\in\operatorname*{dom}\phi using the following proof strategy. Assume (A1) for (f,ϕ)(f,\phi), i.e., assume ff is LL-smooth and ϕ\phi is σ\sigma-strongly convex. For a positive nondecreasing sequence {ui}i=0N\{u_{i}\}_{i=0}^{N}\subset\mathbb{R}, define {𝒰k}k=0N\{\mathcal{U}_{k}\}_{k=0}^{N} as

𝒰0=ϕ(x)+ϕ(y0)y0,xu0Df(x,x0)\displaystyle\mathcal{U}_{0}=\phi(x)+\phi^{*}(y_{0})-\left\langle y_{0},x\right\rangle-u_{0}D_{f}(x,x_{0})

and

𝒰k+1=\displaystyle\quad\mathcal{U}_{k+1}= 𝒰k(uk+1uk)Df(x,xk+1)\displaystyle\;\mathcal{U}_{k}-(u_{k+1}-u_{k})D_{f}\left(x,x_{k+1}\right)
uk(Df(xk,xk+1)12Lf(xk)f(xk+1)20 by (coco-ineq))\displaystyle-u_{k}\Big{(}\underbrace{D_{f}\left(x_{k},x_{k+1}\right)-\frac{1}{2L}\left\|\nabla f(x_{k})-\nabla f(x_{k+1})\right\|_{*}^{2}}_{\geq 0\text{ by \eqref{eqn::coco_with_bregman}}}\Big{)}
(Dϕ(yk,yk+1)σ2ϕ(yk+1)ϕ(yk)20 by (coco-ineq))\displaystyle-\Big{(}\underbrace{D_{\phi^{*}}\left(y_{k},y_{k+1}\right)-\frac{\sigma}{2}\left\|\nabla\phi^{*}(y_{k+1})-\nabla\phi^{*}(y_{k})\right\|^{2}}_{\geq 0\text{ by \eqref{eqn::coco_with_bregman}}}\Big{)}

for k=0,,N1k=0,\dots,N-1. The sequence {𝒰k}k=0N\{\mathcal{U}_{k}\}_{k=0}^{N} is nonincreasing by construction. Define 𝐔𝒜\mathbf{U}_{\mathcal{A}} via

𝒰N=\displaystyle\mathcal{U}_{N}= uN(f(xN)f(x))+ϕ(x)+ϕ(yN)yN,x0 by (1.1)\displaystyle\;u_{N}\left(f(x_{N})-f(x)\right)+\underbrace{\phi(x)+\phi^{*}(y_{N})-\left\langle y_{N},x\right\rangle}_{\geq 0\text{ by \eqref{eqn::fenchel}}}
+yNy0+i=0N(uiui1)f(xi),x+𝐔𝒜.\displaystyle\quad+\left\langle y_{N}-y_{0}+\sum_{i=0}^{N}(u_{i}-u_{i-1})\nabla f(x_{i}),x\right\rangle+\mathbf{U}_{\mathcal{A}}.

If yNy0+i=0N(uiui1)f(xi)=0y_{N}-y_{0}+\sum_{i=0}^{N}(u_{i}-u_{i-1})\nabla f(x_{i})=0 and 𝐔𝒜0\mathbf{U}_{\mathcal{A}}\geq 0, then we conclude

uN(f(xN)f(x))𝒰N𝒰0ϕ(x)+ϕ(y0)y0,x.\displaystyle u_{N}(f(x_{N})-f(x))\leq\mathcal{U}_{N}\leq\dots\leq\mathcal{U}_{0}\leq\phi(x)+\phi^{*}(y_{0})-\left\langle y_{0},x\right\rangle.

By a telescoping sum argument, the ff- and ϕ\phi^{*}-function values cancel out and 𝐔𝒜\mathbf{U}_{\mathcal{A}} is a function of only {f(xi)}i=0NX\{\nabla f(x_{i})\}_{i=0}^{N}\subset X^{*} and {ϕ(yi)}i=0NX\{\nabla\phi^{*}(y_{i})\}_{i=0}^{N}\subset X. (We state the exact form of 𝐔𝒜\mathbf{U}_{\mathcal{A}} later in (6).) If we show inf𝐔𝒜0\inf\mathbf{U}_{\mathcal{A}}\geq 0, where the infimum is taken over all possible values of {f(xi)}i=0N\{\nabla f(x_{i})\}_{i=0}^{N} and {ϕ(yi)}i=0N\{\nabla\phi^{*}(y_{i})\}_{i=0}^{N}, then we conclude 𝐔𝒜0\mathbf{U}_{\mathcal{A}}\geq 0 and uN(f(xN)f(x))ϕ(x)+ϕ(y0)y0,xu_{N}(f(x_{N})-f(x))\leq\phi(x)+\phi^{*}(y_{0})-\left\langle y_{0},x\right\rangle. Recall that this is the exact strategy we employed in our proof of Fact 4. With a slight abuse of notation, define 𝐔𝒜:(X)N+1×(X)N+1\mathbf{U}_{\mathcal{A}}\colon(X^{*})^{N+1}\times(X)^{N+1}\rightarrow\mathbb{R} so that

𝐔𝒜=𝐔𝒜(f(x0),,f(xN),ϕ(y0),,ϕ(yN)).\mathbf{U}_{\mathcal{A}}=\mathbf{U}_{\mathcal{A}}(\nabla f(x_{0}),\dots,\nabla f(x_{N}),\nabla\phi^{*}(y_{0}),\dots,\nabla\phi^{*}(y_{N})). (4)

On the other hand, consider establishing a convergence rate of ψ(rN)\psi^{*}(r_{N}) for an NN-step dual CFOM (2), which we denote as \mathcal{B}. Assume (A1) for (f,ψ)(f,\psi), i.e., ff is LL-smooth and ψ\psi is σ\sigma-strongly convex. For a positive increasing sequence {vi}i=0N\{v_{i}\}_{i=0}^{N}\subset\mathbb{R}, define {𝒱k}k=0N\{\mathcal{V}_{k}\}_{k=0}^{N} as

𝒱0=v0(f(q0)f(qN))\mathcal{V}_{0}=v_{0}\left(f(q_{0})-f(q_{N})\right)

and

𝒱k+1=\displaystyle\mathcal{V}_{k+1}= 𝒱k(vk+1vk)Df(qN,qk)\displaystyle\;\mathcal{V}_{k}-(v_{k+1}-v_{k})D_{f}\left(q_{N},q_{k}\right)
vk+1(Df(qk,qk+1)12Lf(qk)f(qk+1)20 by (coco-ineq))\displaystyle\quad-v_{k+1}\Big{(}\underbrace{D_{f}(q_{k},q_{k+1})-\frac{1}{2L}\left\|\nabla f(q_{k})-\nabla f(q_{k+1})\right\|^{2}_{*}}_{\geq 0\text{ by \eqref{eqn::coco_with_bregman}}}\Big{)}
(Dψ(rk,rk+1)σ2ψ(rk+1)ψ(rk)20 by (coco-ineq))\displaystyle\quad-\Big{(}\underbrace{D_{\psi^{*}}\left(r_{k},r_{k+1}\right)-\frac{\sigma}{2}\left\|\nabla\psi^{*}(r_{k+1})-\nabla\psi^{*}(r_{k})\right\|^{2}}_{\geq 0\text{ by \eqref{eqn::coco_with_bregman}}}\Big{)}

for k=0,1,,N1k=0,1,\dots,N-1. The sequence {𝒱k}k=0N\{\mathcal{V}_{k}\}_{k=0}^{N} is nonincreasing by construction. Define 𝐕\mathbf{V}_{\mathcal{B}} via

𝒱N=ψ(rN)+Dψ(0,r0)0+𝐕.\displaystyle\mathcal{V}_{N}=\psi^{*}(r_{N})+\underbrace{D_{\psi^{*}}\left(0,r_{0}\right)}_{\geq 0}+\mathbf{V}_{\mathcal{B}}.

Here, we used ψ(0)=0\psi^{*}(0)=0. If 𝐕0\mathbf{V}_{\mathcal{B}}\geq 0, then we conclude

ψ(rN)𝒱N𝒱0=v0(f(q0)f(qN)).\displaystyle\psi^{*}(r_{N})\leq\mathcal{V}_{N}\leq\dots\leq\mathcal{V}_{0}=v_{0}(f(q_{0})-f(q_{N})).

Again, by a telescoping sum argument, the ff- and ψ\psi^{*}-function values cancel out and 𝐕\mathbf{V}_{\mathcal{B}} is a function of only {f(qi)}i=0NX\{\nabla f(q_{i})\}_{i=0}^{N}\subset X^{*} and {ψ(ri)}i=0NX\{\nabla\psi^{*}(r_{i})\}_{i=0}^{N}\subset X. (We state the exact form of 𝐕\mathbf{V}_{\mathcal{B}} later in (7).) If we show inf𝐕0\inf\mathbf{V}_{\mathcal{B}}\geq 0, where the infimum is taken over all possible values of {f(xi)}i=0N\{\nabla f(x_{i})\}_{i=0}^{N} and {ψ(yi)}i=0N\{\nabla\psi^{*}(y_{i})\}_{i=0}^{N}, then we conclude 𝐕0\mathbf{V}_{\mathcal{B}}\geq 0 and

ψ(rN)v0(f(q0)f(qN))v0(f(q0)infxXf(x)).\displaystyle\psi^{*}(r_{N})\leq v_{0}(f(q_{0})-f(q_{N}))\leq v_{0}(f(q_{0})-\inf_{x\in X}f(x)).

With a slight abuse of notation, define 𝐕:(X)N+1×(X)N+1\mathbf{V}_{\mathcal{B}}\colon(X^{*})^{N+1}\times(X)^{N+1}\rightarrow\mathbb{R} so that

𝐕=𝐕(f(x0),,f(xN),ψ(y0),,ψ(yN)).\mathbf{V}_{\mathcal{B}}=\mathbf{V}_{\mathcal{B}}(\nabla f(x_{0}),\dots,\nabla f(x_{N}),\nabla\psi^{*}(y_{0}),\dots,\nabla\psi^{*}(y_{N})). (5)

The following result establishes a one-to-one correspondence between these two proof strategies.

Theorem 3.1 (Mirror Duality)

Let XX be a reflexive Banach space and f:Xf\colon X\rightarrow\mathbb{R} and ϕ:X\phi\colon X\rightarrow\mathbb{R} be CCP. Assume (A1). Let 𝒜\mathcal{A} be a CFOM (1) and \mathcal{B} be its mirror dual CFOM (2). Let {ui}i=0N\{u_{i}\}_{i=0}^{N}\in\mathbb{R} be a positive nondecreasing sequence and vi=1uNiv_{i}=\frac{1}{u_{N-i}} for i=0,,Ni=0,\dots,N. Let

𝐔𝒜:(X)N+1×(X)N+1\displaystyle\mathbf{U}_{\mathcal{A}}\colon(X^{*})^{N+1}\times(X)^{N+1}\rightarrow\mathbb{R}
𝐕:(X)N+1×(X)N+1\displaystyle\mathbf{V}_{\mathcal{B}}\colon(X^{*})^{N+1}\times(X)^{N+1}\rightarrow\mathbb{R}

respectively be defined as in (4) and (5). Then,

inf{𝐔𝒜(x0,,xN,x0,,xN)|x0,,xNX,x0,,xNX}\displaystyle\inf\{\mathbf{U}_{\mathcal{A}}(x_{0}^{*},\dots,x_{N}^{*},x_{0},\dots,x_{N})\,|\,x_{0}^{*},\dots,x_{N}^{*}\in X^{*},\,x_{0},\dots,x_{N}\in X^{*}\}
=\displaystyle= inf{𝐕(x0,,xN,x0,,xN)|x0,,xNX,x0,,xNX}.\displaystyle\inf\{\mathbf{V}_{\mathcal{B}}(x_{0}^{*},\dots,x_{N}^{*},x_{0},\dots,x_{N})\,|\,x_{0}^{*},\dots,x_{N}^{*}\in X^{*},\,x_{0},\dots,x_{N}\in X^{*}\}.

Proposition 1 and Theorem 3.1 together provide a sufficient condition that ensures a CFOM 𝒜\mathcal{A} with a convergence guarantee on f(xN)f(x)f(x_{N})-f(x) can be “mirror-dualized” to obtain a mirror dual CFOM \mathcal{B} with a convergence guarantee on ψ(rN)=ψ(f(qN))\psi^{*}(r_{N})=\psi^{*}(\nabla f(q_{N})), and vice versa.

Proof

First, we calculate 𝐔𝒜\mathbf{U}_{\mathcal{A}}. Denote f(xN+1)=0\nabla f(x_{N+1})=0 for notational convenience.

𝐔𝒜=k=0N(ukuk1)f(xk),xk+k=0N1uk(f(xk+1),xkxk+1+12Lf(xk)f(xk+1)2)+k=0N1(ykyk+1,ϕ(yk+1)+σ2ϕ(yk+1)ϕ(yk)2)=k=0N1uk2Lf(xk)f(xk+1)2+k=0N1σ2ϕ(yk)ϕ(yk+1)2+k=0N1ykyk+1,ϕ(yk+1)k=0Nukf(xk)f(xk+1),xk.\displaystyle\begin{split}\mathbf{U}_{\mathcal{A}}=&\sum_{k=0}^{N}(u_{k}-u_{k-1})\left\langle\nabla f(x_{k}),-x_{k}\right\rangle\\ &+\sum_{k=0}^{N-1}u_{k}\left(\left\langle\nabla f(x_{k+1}),x_{k}-x_{k+1}\right\rangle+\frac{1}{2L}\left\|\nabla f(x_{k})-\nabla f(x_{k+1})\right\|^{2}_{*}\right)\\ &\quad+\sum_{k=0}^{N-1}\left(\left\langle y_{k}-y_{k+1},\nabla\phi^{*}(y_{k+1})\right\rangle+\frac{\sigma}{2}\left\|\nabla\phi^{*}(y_{k+1})-\nabla\phi^{*}(y_{k})\right\|^{2}\right)\\ =&\sum_{k=0}^{N-1}\frac{u_{k}}{2L}\left\|\nabla f(x_{k})-\nabla f(x_{k+1})\right\|^{2}_{*}+\sum_{k=0}^{N-1}\frac{\sigma}{2}\left\|\nabla\phi^{*}(y_{k})-\nabla\phi^{*}(y_{k+1})\right\|^{2}\\ &\quad+\sum_{k=0}^{N-1}\left\langle y_{k}-y_{k+1},\nabla\phi^{*}(y_{k+1})\right\rangle-\sum_{k=0}^{N}u_{k}\left\langle\nabla f(x_{k})-\nabla f(x_{k+1}),x_{k}\right\rangle.\end{split} (6)

By the definition of CFOM (1), yk1ykspan{f(x0),,f(xk1)}y_{k-1}-y_{k}\in\mathrm{span}\{\nabla f(x_{0}),\dots,\nabla f(x_{k-1})\} and xkx0+span{ϕ(y0),,ϕ(yk)}x_{k}\in x_{0}+\mathrm{span}\{\nabla\phi^{*}(y_{0}),\dots,\nabla\phi^{*}(y_{k})\}. Therefore, 𝐔𝒜:(X)N+1×XN+1\mathbf{U}_{\mathcal{A}}\colon\left(X^{*}\right)^{N+1}\times X^{N+1}\rightarrow\mathbb{R} is a function of {ak,i}0i<kN\{a_{k,i}\}_{0\leq i<k\leq N} and {bk,i}0ikN\{b_{k,i}\}_{0\leq i\leq k\leq N}. Next we calculate 𝐕\mathbf{V}_{\mathcal{B}}.

𝐕=𝒱Nψ(rN)Dψ(0,r0)ψ(rN)+ψ(0)=k=0N1(vk+1vk)f(qk),qNqk+k=0N1vk+1(f(qk+1),qkqk+1+12Lf(qk)f(qk+1)2)+k=0N1(rkrk+1,ψ(rk+1)+σ2ψ(rk)ψ(rk+1)2)+ψ(r0),r0=k=0N1vk+12Lf(qk)f(qk+1)2+k=0N1σ2ψ(rk)ψ(rk+1)2+k=0Nrk1rk,ψ(rk)+k=0N1vk+1f(qk+1)(vk+1vk)f(qk)(v1v0)f(q0),qkqk+1\displaystyle\begin{split}\mathbf{V}_{\mathcal{B}}=&\;\mathcal{V}_{N}-\psi^{*}(r_{N})-D_{\psi^{*}}(0,r_{0})-\psi^{*}(r_{N})+\psi^{*}(0)\\ =&\sum_{k=0}^{N-1}(v_{k+1}-v_{k})\left\langle\nabla f(q_{k}),q_{N}-q_{k}\right\rangle\\ &+\sum_{k=0}^{N-1}v_{k+1}\left(\left\langle\nabla f(q_{k+1}),q_{k}-q_{k+1}\right\rangle+\frac{1}{2L}\left\|\nabla f(q_{k})-\nabla f(q_{k+1})\right\|^{2}_{*}\right)\\ &+\sum_{k=0}^{N-1}\left(\left\langle r_{k}-r_{k+1},\nabla\psi^{*}(r_{k+1})\right\rangle+\frac{\sigma}{2}\left\|\nabla\psi^{*}(r_{k})-\nabla\psi^{*}(r_{k+1})\right\|^{2}\right)+\left\langle\nabla\psi^{*}(r_{0}),-r_{0}\right\rangle\\ =&\sum_{k=0}^{N-1}\frac{v_{k+1}}{2L}\left\|\nabla f(q_{k})-\nabla f(q_{k+1})\right\|^{2}_{*}+\sum_{k=0}^{N-1}\frac{\sigma}{2}\left\|\nabla\psi^{*}(r_{k})-\nabla\psi^{*}(r_{k+1})\right\|^{2}\\ &+\sum_{k=0}^{N}\left\langle r_{k-1}-r_{k},\nabla\psi^{*}(r_{k})\right\rangle\\ &+\sum_{k=0}^{N-1}\left\langle v_{k+1}\nabla f(q_{k+1})-(v_{k+1}-v_{k})\nabla f(q_{k})-\dots-(v_{1}-v_{0})\nabla f(q_{0}),q_{k}-q_{k+1}\right\rangle\end{split} (7)

where we have used r1=0r_{-1}=0. By the definition of mirror dual, rk1rkspan{f(q0),,f(qk)}r_{k-1}-r_{k}\in\mathrm{span}\{\nabla f(q_{0}),\dots,\nabla f(q_{k})\} and qkqk+1span{ψ(r0),,ψ(rk)}q_{k}-q_{k+1}\in\mathrm{span}\{\nabla\psi^{*}(r_{0}),\dots,\nabla\psi^{*}(r_{k})\}. Therefore, 𝐕:(X)N+1×XN+1\mathbf{V}_{\mathcal{B}}\colon\left(X^{*}\right)^{N+1}\times X^{N+1}\rightarrow\mathbb{R} is the function of {ak,i}0i<kN\{a_{k,i}\}_{0\leq i<k\leq N} and {bk,i}0ikN\{b_{k,i}\}_{0\leq i\leq k\leq N}. To show inf𝐔𝒜=inf𝐕\inf\mathbf{U}_{\mathcal{A}}=\inf\mathbf{V}_{\mathcal{B}} under vi=1uNiv_{i}=\frac{1}{u_{N-i}} for i=0,,Ni=0,\dots,N, we show the following statement: ({Ai}i=0N,{Bi}i=0N)(X)N+1×(X)N+1,\forall\,\left(\{A_{i}\}_{i=0}^{N},\{B_{i}\}_{i=0}^{N}\right)\in(X^{*})^{N+1}\times\left(X\right)^{N+1},

𝐔𝒜({Ai}i=0N,{Bi}i=0N)=𝐕({Ci}i=0N,{Di}i=0N),\displaystyle\mathbf{U}_{\mathcal{A}}\left(\{A_{i}\}_{i=0}^{N},\{B_{i}\}_{i=0}^{N}\right)=\mathbf{V}_{\mathcal{B}}\left(\{C_{i}\}_{i=0}^{N},\{D_{i}\}_{i=0}^{N}\right),

where {Ci}i=0N\{C_{i}\}_{i=0}^{N} and {Di}i=0N\{D_{i}\}_{i=0}^{N} are defined as

C0:=uNAN,CNiCNi1:=ui(AiAi+1),i=0,1,,N1Di:=BNi,i=0,1,,N.\displaystyle\begin{split}C_{0}\colon=u_{N}A_{N},\quad&C_{N-i}-C_{N-i-1}\colon=u_{i}\left(A_{i}-A_{i+1}\right),\quad i=0,1,\dots,N-1\\ &D_{i}\colon=B_{N-i},\quad i=0,1,\dots,N.\end{split} (8)

Note that the above transformation is a bijection since uiu_{i}’s are positive, thus if we can show 𝐔𝒜({Ai}i=0N,{Bi}i=0N)=𝐕({Ci}i=0N,{Di}i=0N)\mathbf{U}_{\mathcal{A}}\left(\{A_{i}\}_{i=0}^{N},\{B_{i}\}_{i=0}^{N}\right)=\mathbf{V}_{\mathcal{B}}\left(\{C_{i}\}_{i=0}^{N},\{D_{i}\}_{i=0}^{N}\right), inf𝐔𝒜=inf𝐕\inf\mathbf{U}_{\mathcal{A}}=\inf\mathbf{V}_{\mathcal{B}}. First we write 𝐔𝒜\mathbf{U}_{\mathcal{A}} and 𝐕\mathbf{V}_{\mathcal{B}} with ({Ai}i=0N,{Bi}i=0N,{Ci}i=0N,{Di}i=0N)\left(\{A_{i}\}_{i=0}^{N},\{B_{i}\}_{i=0}^{N},\{C_{i}\}_{i=0}^{N},\{D_{i}\}_{i=0}^{N}\right).

𝐔𝒜=\displaystyle\mathbf{U}_{\mathcal{A}}= k=0N1uk2LAkAk+12+k=0N1σ2BkBk+12\displaystyle\sum_{k=0}^{N-1}\frac{u_{k}}{2L}\left\|A_{k}-A_{k+1}\right\|^{2}_{*}+\sum_{k=0}^{N-1}\frac{\sigma}{2}\left\|B_{k}-B_{k+1}\right\|^{2}
+k=0N1i=0kak+1,iAi,Bk+1k=0NukAkAk+1,xk\displaystyle\quad+\sum_{k=0}^{N-1}\left\langle\sum_{i=0}^{k}a_{k+1,i}A_{i},B_{k+1}\right\rangle-\sum_{k=0}^{N}u_{k}\left\langle A_{k}-A_{k+1},x_{k}\right\rangle

where xk+1:=xki=0k+1bk+1,iBix_{k+1}\colon=x_{k}-\sum_{i=0}^{k+1}b_{k+1,i}B_{i} and x0=B0x_{0}=B_{0}. For simplicity, define Ak+1=C1=0A_{k+1}=C_{-1}=0. Using (8) gives us

k=0N1i=0kak+1,iAi,Bk+1=k=0N1i=0kak+1,iAi,DNk1\displaystyle\sum_{k=0}^{N-1}\left\langle\sum_{i=0}^{k}a_{k+1,i}A_{i},B_{k+1}\right\rangle=\sum_{k=0}^{N-1}\sum_{i=0}^{k}a_{k+1,i}\left\langle A_{i},D_{N-k-1}\right\rangle

and

k=0NukAkAk+1,xk=\displaystyle-\sum_{k=0}^{N}u_{k}\left\langle A_{k}-A_{k+1},x_{k}\right\rangle= k=0NCNkCNk1,xk\displaystyle-\sum_{k=0}^{N}\left\langle C_{N-k}-C_{N-k-1},x_{k}\right\rangle
=\displaystyle= CN,x0+CN1,x0x1++C0,xN1xN\displaystyle\left\langle C_{N},-x_{0}\right\rangle+\left\langle C_{N-1},x_{0}-x_{1}\right\rangle+\dots+\left\langle C_{0},x_{N-1}-x_{N}\right\rangle
=\displaystyle= k=1N1CNk1,i=0k+1bk+1,iBi\displaystyle\sum_{k=-1}^{N-1}\left\langle C_{N-k-1},\sum_{i=0}^{k+1}b_{k+1,i}B_{i}\right\rangle
=\displaystyle= k=0Ni=0kbk,iCNk,Bi.\displaystyle\sum_{k=0}^{N}\sum_{i=0}^{k}b_{k,i}\left\langle C_{N-k},B_{i}\right\rangle.

Combining the above equations, we conclude

𝐔𝒜=k=0N1uk2LAkAk+12+k=0N1σ2BkBk+12+k=0N1i=0kak+1,iAi,DNk1+k=0Ni=0kbk,iCNk,Bi.\displaystyle\begin{split}\mathbf{U}_{\mathcal{A}}=&\sum_{k=0}^{N-1}\frac{u_{k}}{2L}\left\|A_{k}-A_{k+1}\right\|^{2}_{*}+\sum_{k=0}^{N-1}\frac{\sigma}{2}\left\|B_{k}-B_{k+1}\right\|^{2}\\ &\quad+\sum_{k=0}^{N-1}\sum_{i=0}^{k}a_{k+1,i}\left\langle A_{i},D_{N-k-1}\right\rangle+\sum_{k=0}^{N}\sum_{i=0}^{k}b_{k,i}\left\langle C_{N-k},B_{i}\right\rangle.\end{split} (9)

In the same way, 𝐕\mathbf{V}_{\mathcal{B}} is written as follows.

𝐕=\displaystyle\mathbf{V}_{\mathcal{B}}= k=0N1vk+12LCkCk+12+k=0N1σ2DkDk+12\displaystyle\sum_{k=0}^{N-1}\frac{v_{k+1}}{2L}\left\|C_{k}-C_{k+1}\right\|^{2}_{*}+\sum_{k=0}^{N-1}\frac{\sigma}{2}\left\|D_{k}-D_{k+1}\right\|^{2}
+k=0Ni=0kbNi,NkCi,Dk\displaystyle\quad+\sum_{k=0}^{N}\left\langle\sum_{i=0}^{k}b_{N-i,N-k}C_{i},D_{k}\right\rangle
+k=0N1vk+1Ck+1(vk+1vk)Ck(v1v0)C0,i=0kaNi,N1kDi\displaystyle\quad+\sum_{k=0}^{N-1}\left\langle v_{k+1}C_{k+1}-(v_{k+1}-v_{k})C_{k}-\dots-(v_{1}-v_{0})C_{0},\sum_{i=0}^{k}a_{N-i,N-1-k}D_{i}\right\rangle

By (8),

k=0Ni=0kbNi,NkCi,Dk=k=0Ni=0kbNi,NkCi,Dk=k=0Ni=0kbNi,NkCi,BNk,\displaystyle\sum_{k=0}^{N}\left\langle\sum_{i=0}^{k}b_{N-i,N-k}C_{i},D_{k}\right\rangle=\sum_{k=0}^{N}\sum_{i=0}^{k}b_{N-i,N-k}\left\langle C_{i},D_{k}\right\rangle=\sum_{k=0}^{N}\sum_{i=0}^{k}b_{N-i,N-k}\left\langle C_{i},B_{N-k}\right\rangle,
k=0N1vk+1Ck+1(vk+1vk)Ck(v1v0)C0,i=0kaNi,N1kDi\displaystyle\sum_{k=0}^{N-1}\left\langle v_{k+1}C_{k+1}-(v_{k+1}-v_{k})C_{k}-\dots-(v_{1}-v_{0})C_{0},\sum_{i=0}^{k}a_{N-i,N-1-k}D_{i}\right\rangle
=()\displaystyle\stackrel{{\scriptstyle(\star)}}{{=}} k=0N11uNk1(Ck+1Ck)+1uNk(CkCk1)++1uNC0,i=0kaNi,N1kDi\displaystyle\sum_{k=0}^{N-1}\left\langle\frac{1}{u_{N-k-1}}\left(C_{k+1}-C_{k}\right)+\frac{1}{u_{N-k}}\left(C_{k}-C_{k-1}\right)+\dots+\frac{1}{u_{N}}C_{0},\sum_{i=0}^{k}a_{N-i,N-1-k}D_{i}\right\rangle
=\displaystyle= k=0N1(ANk1ANk)+(ANkANk+1)++AN,i=0kaNi,N1kDi\displaystyle\sum_{k=0}^{N-1}\left\langle(A_{N-k-1}-A_{N-k})+(A_{N-k}-A_{N-k+1})+\dots+A_{N},\sum_{i=0}^{k}a_{N-i,N-1-k}D_{i}\right\rangle
=\displaystyle= k=0N1ANk1,i=0kaNi,N1kDi\displaystyle\sum_{k=0}^{N-1}\left\langle A_{N-k-1},\sum_{i=0}^{k}a_{N-i,N-1-k}D_{i}\right\rangle
=\displaystyle= k=0N1i=0kaNi,Nk1ANk1,Di.\displaystyle\sum_{k=0}^{N-1}\sum_{i=0}^{k}a_{N-i,N-k-1}\left\langle A_{N-k-1},D_{i}\right\rangle.

vi=1uNiv_{i}=\frac{1}{u_{N-i}} is used at ()(\star).

𝐕=k=0N1vk+12LCkCk+12+k=0N1σ2DkDk+12+k=0Ni=0kbNi,NkCi,BNk+k=0N1i=0kaNi,Nk1ANk1,Di=()k=0N1ui2LAkAk+12+k=0N1σ2BkBk+12+k=0Ni=0kbNi,NkCi,BNk+k=0N1i=0kaNi,Nk1ANk1,Di=()k=0N1ui2LAkAk+12+k=0N1σ2BkBk+12+l=0Nj=0lbl,jCNl,Bj+l=0N1j=0lal+1,jAj,DNl1=()k=0N1ui2LAkAk+12+k=0N1σ2BkBk+12+k=0Ni=0kbk,iCNk,Bi+k=0N1i=0kak+1,iAi,DNk1\displaystyle\begin{split}\mathbf{V}_{\mathcal{B}}=&\sum_{k=0}^{N-1}\frac{v_{k+1}}{2L}\left\|C_{k}-C_{k+1}\right\|^{2}_{*}+\sum_{k=0}^{N-1}\frac{\sigma}{2}\left\|D_{k}-D_{k+1}\right\|^{2}\\ &\quad+\sum_{k=0}^{N}\sum_{i=0}^{k}b_{N-i,N-k}\left\langle C_{i},B_{N-k}\right\rangle+\sum_{k=0}^{N-1}\sum_{i=0}^{k}a_{N-i,N-k-1}\left\langle A_{N-k-1},D_{i}\right\rangle\\ \stackrel{{\scriptstyle(\circ)}}{{=}}&\sum_{k=0}^{N-1}\frac{u_{i}}{2L}\left\|A_{k}-A_{k+1}\right\|_{*}^{2}+\sum_{k=0}^{N-1}\frac{\sigma}{2}\left\|B_{k}-B_{k+1}\right\|^{2}\\ &\quad+\sum_{k=0}^{N}\sum_{i=0}^{k}b_{N-i,N-k}\left\langle C_{i},B_{N-k}\right\rangle+\sum_{k=0}^{N-1}\sum_{i=0}^{k}a_{N-i,N-k-1}\left\langle A_{N-k-1},D_{i}\right\rangle\\ \stackrel{{\scriptstyle(\bullet)}}{{=}}&\sum_{k=0}^{N-1}\frac{u_{i}}{2L}\left\|A_{k}-A_{k+1}\right\|_{*}^{2}+\sum_{k=0}^{N-1}\frac{\sigma}{2}\left\|B_{k}-B_{k+1}\right\|^{2}\\ &\quad+\sum_{l=0}^{N}\sum_{j=0}^{l}b_{l,j}\left\langle C_{N-l},B_{j}\right\rangle+\sum_{l=0}^{N-1}\sum_{j=0}^{l}a_{l+1,j}\left\langle A_{j},D_{N-l-1}\right\rangle\\ \stackrel{{\scriptstyle(\bullet)}}{{=}}&\sum_{k=0}^{N-1}\frac{u_{i}}{2L}\left\|A_{k}-A_{k+1}\right\|_{*}^{2}+\sum_{k=0}^{N-1}\frac{\sigma}{2}\left\|B_{k}-B_{k+1}\right\|^{2}\\ &\quad+\sum_{k=0}^{N}\sum_{i=0}^{k}b_{k,i}\left\langle C_{N-k},B_{i}\right\rangle+\sum_{k=0}^{N-1}\sum_{i=0}^{k}a_{k+1,i}\left\langle A_{i},D_{N-k-1}\right\rangle\end{split} (10)

(8) is used at ()(\circ). ()(\bullet)s are index changing. Now it becomes clear that (10) is the same as (9). ∎

3.3 dual-AMD

Again, consider the problem

findxX0=f(x),\displaystyle\begin{array}[]{ll}\underset{x\in X}{\mbox{find}}&\quad 0=\nabla f(x),\end{array}

where XX is a reflexive Banach space and f:Xf\colon X\to\mathbb{R} is a differentiable CCP function. Given an approximate solution, we measure its suboptimality via

ψ(f(x)).\psi^{*}(\nabla f(x)).

Recall that we assumed ψ(0)=0\psi^{*}(0)=0 and 0X0\in X^{*} is the unique minimizer of ψ\psi^{*}. Consider (A1), which we restate as

  • (A1)

    ff is LL-smooth; ψ\psi is σ\sigma-strongly convex; L>0L>0, σ>0\sigma>0.

We propose dual accelerated mirror descent (dual-AMD)

qk+1=qkσL(θNk12θNk22)ψ(rk)\displaystyle q_{k+1}=q_{k}-\frac{\sigma}{L}(\theta_{N-k-1}^{2}-\theta_{N-k-2}^{2})\nabla\psi^{*}(r_{k})
gk+1=gk+1θNk12(f(qk+1)f(qk))\displaystyle g_{k+1}=g_{k}+\frac{1}{\theta_{N-k-1}^{2}}\left(\nabla f(q_{k+1})-\nabla f(q_{k})\right) (dual-AMD)
rk+1=rk+(θNk12θNk22)(gk+1gk)+(θNk22θNk32)gk+1\displaystyle r_{k+1}=r_{k}+(\theta_{N-k-1}^{2}-\theta_{N-k-2}^{2})\left(g_{k+1}-g_{k}\right)+(\theta_{N-k-2}^{2}-\theta_{N-k-3}^{2})g_{k+1}
Lemma 3

dual-AMD is the mirror dual of AMD.

We defer the proof to Appendix A, which follows from direct calculations of the coefficients. Now, our Mirror Duality theorem and Lemma 3 immediately imply a convergence guarantee on dual-AMD.

Corollary 1 (dual-AMD)

Let XX be a reflexive Banach space and f:Xf\colon X\to\mathbb{R} and ψ:X\psi\colon X\to\mathbb{R} be CCP. Assume (A1). Then, the iterates of dual-AMD are well defined and rN=f(qN)r_{N}=\nabla f(q_{N}). If we further assume that ψ(0)=0\psi^{*}(0)=0 and 0X0\in X^{*} is the unique minimizer of ψ\psi^{*}, then

ψ(f(qN))=ψ(rN)LσθN2(f(q0)infxXf(x))=()𝒪(L(f(q0)infxXf(x))σN2),\displaystyle\psi^{*}(\nabla f(q_{N}))=\psi^{*}(r_{N})\leq\frac{L}{\sigma\theta_{N}^{2}}\Big{(}f(q_{0})-\inf_{x\in X}f(x)\Big{)}\overset{(\bullet)}{=}\mathcal{O}\bigg{(}\frac{L(f(q_{0})-\inf_{x\in X}f(x))}{\sigma N^{2}}\bigg{)},

where ()(\bullet) holds if θi2θi=θi12\theta_{i}^{2}-\theta_{i}=\theta_{i-1}^{2} for 0iN10\leq i\leq N-1. (If infxXf(x)=\inf_{x\in X}f(x)=-\infty, then the right-hand-side is ++\infty and the bound vacuously holds.)

Proof

As shown in Fact 4, the convergence rate of AMD is proved by {𝒰k}k=0N\{\mathcal{U}_{k}\}_{k=0}^{N} with ui=σLθi2u_{i}=\frac{\sigma}{L}\theta_{i}^{2}, inf𝐔AMD0\inf\mathbf{U}_{\text{AMD}}\geq 0, and x=xx=x_{\star}. Since ui=σLθi2u_{i}=\frac{\sigma}{L}\theta_{i}^{2} are positive and non-decreasing, vi=1uNiv_{i}=\frac{1}{u_{N-i}} are also positive and non-decreasing.

Now, we use Theorem 3.1 and Proposition 1. Since dual-AMD is the mirror dual of AMD (Lemma 3),Theorem 3.1 provides inf𝐕dual-AMD=inf𝐔AMD0.\inf\mathbf{V}_{\text{dual-AMD}}=\inf\mathbf{U}_{\text{AMD}}\geq 0. Therefore, ψ(rN)Lσ1θN2(f(q0)f)\psi^{*}(r_{N})\leq\frac{L}{\sigma}\frac{1}{\theta_{N}^{2}}\left(f(q_{0})-f_{\star}\right) holds for dual-AMD and Proposition 1 gives us rN=f(qN)r_{N}=\nabla f(q_{N}), which leads to

ψ(f(qN))=ψ(rN)LσθN2(f(q0)f).\displaystyle\psi^{*}(\nabla f(q_{N}))=\psi^{*}(r_{N})\leq\frac{L}{\sigma\theta_{N}^{2}}\left(f(q_{0})-f_{\star}\right).

3.4 Concatenation of AMD and dual-AMD

Concatenating AMD and dual-AMD, we obtain a 𝒪(1/N4)\mathcal{O}\left(1/N^{4}\right) rate on ψ(f())\psi^{*}(\nabla f(\cdot)). More specifically, assume C=XC=X, so that f=infxCf(x)=infxXf(x)f_{\star}=\inf_{x\in C}f(x)=\inf_{x\in X}f(x). Also assume ff is LL-smooth convex, ϕ\phi is σ1\sigma_{1}-strongly convex, and ψ\psi is σ2\sigma_{2}-strongly convex. Starting from x0=ϕ(y0)x_{0}=\nabla\phi^{*}(y_{0}), execute NN iterations of AMD to obtain XNX_{N}. Then, starting from q0=xNq_{0}=x_{N}, execute NN iterations of dual-AMD and denote the output as x2N=qNx_{2N}=q_{N}.

Corollary 2

Assume (A1) and xdomϕargminfx_{\star}\in\operatorname*{dom}\phi\cap\mathrm{argmin}f exists. Then concatenation of AMD and dual-AMD has the rate

ψ(f(x2N))dual-AMDLσ1θN2(f(xN)f(x))\displaystyle\psi^{*}(\nabla f(x_{2N}))\underbrace{\leq}_{\text{\ref{eqn::AMD-G}}}\frac{L}{\sigma_{1}\theta_{N}^{2}}\left(f(x_{N})-f(x_{\star})\right)
AMDL2Dϕ(x,x0)σ1σ2θN4=()𝒪(L2Dϕ(x,x0)σ1σ2N4),\displaystyle\qquad\underbrace{\leq}_{\text{\ref{eqn::AMD}}}\frac{L^{2}D_{\phi}\left(x_{\star},x_{0}\right)}{\sigma_{1}\sigma_{2}\theta_{N}^{4}}\stackrel{{\scriptstyle(\bullet)}}{{=}}\mathcal{O}\left(\frac{L^{2}D_{\phi}\left(x_{\star},x_{0}\right)}{\sigma_{1}\sigma_{2}N^{4}}\right),

where ()(\bullet) holds if θi2θi=θi12\theta_{i}^{2}-\theta_{i}=\theta_{i-1}^{2} for 0iN10\leq i\leq N-1.

Lower bound.

Under (A1), the classical Ω(L2x0x2N4)\Omega\Big{(}\frac{L^{2}\|x_{0}-x_{\star}\|^{2}}{N^{4}}\Big{)} lower bound by nemirovsky1991optimality ; nemirovsky1992information applies and establishes optimality of this concatenated rate. This also implies optimality of dual-AMD under all choices of (f,ϕ)(f,\phi) that satisfies (A1), since if the rate of dual-AMD could be improved, then the concatenated rate of AMD and dual-AMD would be improved and would violate the aforementioned lower bound.

4 Applications

In this section, we provide two applications of dual-AMD and AMD. Our discussion of these applications very closely follows diakonikolas2023complementary .

4.1 Small gradients with respect to q\left\|\cdot\right\|_{q}-norm

For p(1,2]p\in(1,2] and q=pp1q=\frac{p}{p-1}, let f:nf\colon\mathbb{R}^{n}\rightarrow\mathbb{R} be a differentiable convex function that is LL-smooth with respect to p\left\|\cdot\right\|_{p}. Our goal is to find an xnx\in\mathbb{R}^{n} such that f(x)q\|\nabla f(x)\|_{q} is small, and the concatenation of AMD and dual-AMD yields an optimal method for this task, improving upon the prior state-of-the-art rate of diakonikolas2023complementary .

For a starting point x0Xx_{0}\in X, we choose ϕ()=12x0p2\phi(\cdot)=\frac{1}{2}\left\|\cdot-x_{0}\right\|_{p}^{2} and ψ()=12p2\psi(\cdot)=\frac{1}{2}\left\|\cdot\right\|^{2}_{p}, which are (p1)(p-1)-strongly convex with respect to p\left\|\cdot\right\|_{p} (juditsky2008large, , Proposition 3.5). It is straightforward to show that ϕ()=12q2+,x0\phi^{*}(\cdot)=\frac{1}{2}\left\|\cdot\right\|_{q}^{2}+\left\langle\cdot,x_{0}\right\rangle and ψ()=12q2\psi^{*}(\cdot)=\frac{1}{2}\left\|\cdot\right\|^{2}_{q} and that ϕ\phi^{*} and ψ\psi^{*} are differentiable on n\mathbb{R}^{n} with

ϕ(u)=uq2q(|u1|q1,,|un|q1)+x0,ψ(u)=uq2q(|u1|q1,,|un|q1).\nabla\phi^{*}(u)=\left\|u\right\|_{q}^{2-q}\left(\lvert u_{1}\rvert^{q-1},\dots,\lvert u_{n}\rvert^{q-1}\right)+x_{0},\quad\nabla\psi^{*}(u)=\left\|u\right\|_{q}^{2-q}\left(\lvert u_{1}\rvert^{q-1},\dots,\lvert u_{n}\rvert^{q-1}\right).
Corollary 3

Let p(1,2]p\in(1,2] and ff be an LL-smooth convex function with respect to p\left\|\cdot\right\|_{p}. Let x0Xx_{0}\in X be a given starting point. The concatenation of AMD and dual-AMD with σ=τ=p1\sigma=\tau=p-1, distance-generating functions ϕ()=12x0p2\phi(\cdot)=\frac{1}{2}\left\|\cdot-x_{0}\right\|^{2}_{p} and ψ()=12p2\psi(\cdot)=\frac{1}{2}\left\|\cdot\right\|^{2}_{p}, starting point y0=ϕ(x0)=0y_{0}=\nabla\phi(x_{0})=0, and θi2θi=θi12\theta_{i}^{2}-\theta_{i}=\theta_{i-1}^{2} for i=0,,N1i=0,\dots,N-1 exhibits the rate

f(x2N)qLx0xp(p1)θN2=𝒪(Lx0xpN2).\displaystyle\left\|\nabla f(x_{2N})\right\|_{q}\leq\frac{{L}\left\|x_{0}-x_{\star}\right\|_{p}}{(p-1)\theta_{N}^{2}}=\mathcal{O}\left(\frac{L\left\|x_{0}-x_{\star}\right\|_{p}}{N^{2}}\right).
Proof

Direct consequence of Corollary 2. ∎

Lower bound.

Our upper bound of Corollary 3 matches the Ω(Lx0xpN2)\Omega\left(\frac{L\left\|x_{0}-x_{\star}\right\|_{p}}{N^{2}}\right) lower bound of (diakonikolas2023complementary, , Corollary 2) up to a constant factor and is therefore optimal. Our upper bound improves upon the prior state-of-the-art result for this setup (diakonikolas2023complementary, , Theorem 3), which we now know is loose by a logarithmic factor. For p(2,)p\in(2,\infty), however, ϕ()=12p2\phi(\cdot)=\frac{1}{2}\left\|\cdot\right\|^{2}_{p} is not strongly convex with respect to p\left\|\cdot\right\|_{p} with a dimension-independent strong convexity parameter, as discussed in (diakonikolas2023complementary, , Section 1), so the concatenation of AMD and dual-AMD cannot yield a dimension-independent upper bound. We leave the investigation of the p(2,)p\in(2,\infty) case to future work.

4.2 Entropy-regularized optimal transport

Consider the discrete optimal transport problem

minimizeX+m×nC,Xsubject toX𝟏=μ,X𝟏=ν,\displaystyle\begin{array}[]{ll}\underset{X\in{\mathbb{R}_{+}^{m\times n}}}{\mbox{minimize}}&\quad\left\langle C,X\right\rangle\\ \mbox{subject to}&\quad X\mathbf{1}=\mu,\,\,X^{\intercal}\mathbf{1}=\nu,\end{array}

where Cm×nC\in\mathbb{R}^{m\times n} is the transport costs with nonnegative entries, μm\mu\in\mathbb{R}^{m} and νn\nu\in\mathbb{R}^{n} represent probability mass functions with full support, i.e.,

μ1++μm=1,μi>0 for i=1,,m\mu_{1}+\dots+\mu_{m}=1,\quad\mu_{i}>0\text{ for }i=1,\dots,m

and likewise for ν\nu, C,X=i=1mj=1nCijXij\left\langle C,X\right\rangle=\sum_{i=1}^{m}\sum_{j=1}^{n}C_{ij}X_{ij}, and 𝟏\mathbf{1} denotes the vectors in m\mathbb{R}^{m} or n\mathbb{R}^{n} with all components 11. Write XX_{\star} to denote a solution to this (non-regularized) problem.

The entropy-regularized discrete optimal transport problem with parameter r>0r>0 marco2013sinkhorn ; fang1992unconstraineed ; lin2022efficiency is

minimizeX+m×nC,X+rX,log(X)Usubject toX𝟏=μ,X𝟏=ν,\begin{array}[]{ll}\underset{X\in{\mathbb{R}_{+}^{m\times n}}}{\mbox{minimize}}&\quad\left\langle C,X\right\rangle+r\left\langle X,\log(X)-U\right\rangle\\ \mbox{subject to}&\quad X\mathbf{1}=\mu,\,\,X^{\intercal}\mathbf{1}=\nu,\end{array}

where log(X)\log(X) denotes the element-wise logarithm, we use the convention 0log(0)=00\cdot\log(0)=0, and Um×nU\in\mathbb{R}^{m\times n} is the matrix with all entries 11. The corresponding regularized dual problem is

minimizeum,vnh(u,v)=rlog(i,jexp(ui+vjcijr))μ,uν,v.\begin{array}[]{ll}\underset{u\in\mathbb{R}^{m},\,v\in\mathbb{R}^{n}}{\mbox{minimize}}&\quad h(u,v)=r\log\Big{(}\sum_{i,j}\exp\big{(}\frac{u_{i}+v_{j}-c_{ij}}{r}\big{)}\Big{)}-\left\langle\mu,u\right\rangle-\left\langle\nu,v\right\rangle.\end{array}

The h(u,v)h(u,v) function of the regularized dual problem is (1/r)(1/r)-smooth with respect to \|\cdot\|_{\infty}.

If we solve the regularized dual problem approximately in the sense of making h(u,v)1\left\|\nabla h(u,v)\right\|_{1} small, then we can obtain an approximate solution to the (non-regularized) primal problem using the following lemma.

Lemma 4

(altschuler2018nearlinear, , Theorem 1, Lemma 7) For (u,v)m×n(u,v)\in\mathbb{R}^{m}\times\mathbb{R}^{n}, define the m×nm\times n matrix B(u,v)ij=exp(ui+vjcijr)B(u,v)_{ij}=\exp\left(\frac{u_{i}+v_{j}-c_{ij}}{r}\right) for (i,j)[1,m]×[1,n](i,j)\in[1,m]\times[1,n]. Define X(u,v)=B(u,v)/U,B(u,v)X(u,v)=B(u,v)/\left\langle U,B(u,v)\right\rangle. Then,

C,X(u,v)Xrlog(mn)+2Ch(u,v)1.\displaystyle\left\langle C,X(u,v)-X_{\star}\right\rangle\leq r\log(mn)+2\left\|C\right\|_{\infty}\left\|\nabla h(u,v)\right\|_{1}.

Moreover, there exists an algorithm with running time 𝒪(mn)\mathcal{O}\left(mn\right) that takes an input X(u,v)X(u,v) and outputs X^(u,v)\hat{X}(u,v), which satisfies the constraints X^(u,v)𝟏=μ\hat{X}(u,v)\mathbf{1}=\mu and X^(u,v)𝟏=ν\hat{X}(u,v)^{\intercal}\mathbf{1}=\nu and satisfies X^(u,v)X(u,v)12h(u,v)1\|\hat{X}(u,v)-X(u,v)\|_{1}\leq 2\left\|\nabla h(u,v)\right\|_{1}.

Specifically, set r=ε2log(mn)r=\frac{\varepsilon}{2\log(mn)} and find (u0,v0)(u_{0},v_{0}) such that h(u0,v0)1ε8C\left\|\nabla h(u_{0},v_{0})\right\|_{1}\leq\frac{\varepsilon}{8\left\|C\right\|_{\infty}}. By Lemma 4, this is sufficient, since

C,X^(u0,v0)X\displaystyle\langle C,\hat{X}(u_{0},v_{0})-X_{\star}\rangle rlog(mn)+2Ch(u0,v0)1+C,X(u0,v0)X\displaystyle\leq r\log(mn)+2\left\|C\right\|_{\infty}\left\|\nabla h(u_{0},v_{0})\right\|_{1}+\left\langle C,X(u_{0},v_{0})-X_{\star}\right\rangle
rlog(mn)+4Ch(u0,v0)1\displaystyle\leq r\log(mn)+4\left\|C\right\|_{\infty}\left\|\nabla h(u_{0},v_{0})\right\|_{1}
ε.\displaystyle\leq\varepsilon.

Take ϕ(x)=ψ(x)=12x22\phi(x)=\psi(x)=\frac{1}{2}\left\|x\right\|^{2}_{2}, which is 11-strongly convex with respect to \|\cdot\|_{\infty}. With a starting point (0,0)(0,0), implement the concatenation of AMD and dual-AMD and denote the output as (u2N,v2N)(u_{2N},v_{2N}). By Corollary 2,

h(u2N,v2N)22=𝒪((u,v)22r2N4),\displaystyle\left\|\nabla h(u_{2N},v_{2N})\right\|_{2}^{2}=\mathcal{O}\Big{(}\frac{\left\|(u_{\star},v_{\star})\right\|_{2}^{2}}{r^{2}N^{4}}\Big{)},

where (u,v)argminu,vh(u,v)(u_{\star},v_{\star})\in\operatorname*{arg\,min}_{u,v}h(u,v).

Lemma 5

(lin2022efficiency, , Lemma 3.2) If μ\mu and ν\nu have full support, then there exists an optimal point (u,v)(u_{\star},v_{\star}) of h(u,v)h(u,v) such that (u,v)=𝒪(C)\left\|(u_{\star},v_{\star})\right\|_{\infty}=\mathcal{O}\left(\left\|C\right\|_{\infty}\right).

Using Lemma 5 and (u,v)1(m+n)1/2(u,v)2(m+n)(u,v)\left\|(u,v)\right\|_{1}\leq(m+n)^{1/2}\left\|(u,v)\right\|_{2}\leq(m+n)\left\|(u,v)\right\|_{\infty} for (u,v)m×n(u,v)\in\mathbb{R}^{m}\times\mathbb{R}^{n}, we get

h(u2N,v2N)1(n+m)1/2h(u2N,v2N)2\displaystyle\left\|\nabla h(u_{2N},v_{2N})\right\|_{1}\leq(n+m)^{1/2}\left\|\nabla h(u_{2N},v_{2N})\right\|_{2}
=𝒪((m+n)1/2(u,v)2rN2)=𝒪((m+n)log(mn)CεN2).\displaystyle\qquad=\mathcal{O}\left(\frac{(m+n)^{1/2}\left\|(u_{\star},v_{\star})\right\|_{2}}{rN^{2}}\right)=\mathcal{O}\left(\frac{(m+n)\log(mn)\left\|C\right\|_{\infty}}{\varepsilon N^{2}}\right).

To ensure h(u2N,v2N)1𝒪(εC)\left\|\nabla h(u_{2N},v_{2N})\right\|_{1}\leq\mathcal{O}\big{(}\frac{\varepsilon}{\left\|C\right\|_{\infty}}\big{)}, set

(m+n)log(mn)CεN2εC\frac{(m+n)\log(mn)\left\|C\right\|_{\infty}}{\varepsilon N^{2}}\leq\frac{\varepsilon}{\left\|C\right\|_{\infty}}

and solve for NN to conclude that 𝒪((m+n)1/2log(mn)C/ε)\mathcal{O}\left((m+n)^{1/2}\log(mn)\left\|C\right\|_{\infty}/\varepsilon\right) iterations are required. Finally, since each step of dual-AMD and AMD require 𝒪(mn)\mathcal{O}(mn) arithmetic operations, the total arithmetic complexity needed to obtain an ε\varepsilon-approximate solution for the (non-regularized) discrete optimal transport problem is

𝒪((m+n)5/2log(mn)Cε).\mathcal{O}\Big{(}\frac{(m+n)^{5/2}\log(mn)\left\|C\right\|_{\infty}}{\varepsilon}\Big{)}.

This rate improves upon the rate of (diakonikolas2023complementary, , Section 5.6) by logarithmic factors and matches the state-of-the-art 𝒪((m+n)5/2log(mn)C/ε){\mathcal{O}}((m+n)^{5/2}\log(mn)\left\|C\right\|_{\infty}/\varepsilon) rate of antonin2022accelerated , among non-parallel methods.

5 Comparison with H-duality

In this section, we discuss the relationship between our proposed mirror duality and the H-duality of prior work kim2023timereversed . First, we show that the mirror dual operation reduces to the H-dual operation in the Euclidean case.

Proposition 2

Let XX be an Euclidean space equipped with the Euclidean norm (or XX is a Hilbert space). Let ϕ=ψ=(1/2)22\phi=\psi=(1/2)\|\cdot\|_{2}^{2} and σ=1\sigma=1. For a CFOM (1), assume 1i=0kj=0i+1bi+1,j=11-\sum_{i=0}^{k}\sum_{j=0}^{i+1}b_{i+1,j}=1 for k=0,,N1k=0,\dots,N-1, which is a sufficient condition for xkconv{ϕ(y0),,ϕ(yk)}x_{k}\in\mathrm{conv}\{\nabla\phi^{*}(y_{0}),\dots,\nabla\phi^{*}(y_{k})\} for k=0,,Nk=0,\dots,N. 333 If the CFOM (1) does not satisfy 1i=0kj=0i+1bi+1,j=11-\sum_{i=0}^{k}\sum_{j=0}^{i+1}b_{i+1,j}=1 for k=0,,N1k=0,\dots,N-1, then it reduces to an NN-step FSFOM of the form xk+1=γkxk1Li=0khk+1,if(xi),k=0,1,,N1,x_{k+1}=\gamma_{k}x_{k}-\frac{1}{L}\sum_{i=0}^{k}h_{k+1,i}\nabla f(x_{i}),\quad k=0,1,\dots,N-1, where γk1\gamma_{k}\neq 1 for some kk. Then the CFOM (1) reduces to an NN-step FSFOM of the form

xk+1=xk1Li=0khk+1,if(xi),k=0,1,,N1,x_{k+1}=x_{k}-\frac{1}{L}\sum_{i=0}^{k}h_{k+1,i}\nabla f(x_{i}),\quad k=0,1,\dots,N-1,

and the mirror dual CFOM (2) reduces to an NN-step FSFOM of the form

qk+1=qk1Li=0khNi,N1kf(qi),k=0,1,,N1,q_{k+1}=q_{k}-\frac{1}{L}\sum_{i=0}^{k}h_{N-i,N-1-k}\nabla f(q_{i}),\quad k=0,1,\dots,N-1,

so the FSFOMs are H-duals of each other.

Proof

For any matrix MN×NM\in\mathbb{R}^{N\times N}, we can describe the relationship between MM and MAM^{A} via introducing JN×NJ\in\mathbb{R}^{N\times N} as Ji,j=1J_{i,j}=1 if i+j=N+1i+j=N+1 and Ji,j=0J_{i,j}=0 otherwise, so that

MA=JMJ.\displaystyle M^{A}=JM^{\intercal}J. (11)

Moreover, for N×NN\times N matrices XX and YY,

(XY)A=J(XY)J=JYXJ=JYJJXJ=YAXA.\displaystyle\left(XY\right)^{A}=J\left(XY\right)^{\intercal}J=JY^{\intercal}X^{\intercal}J=JY^{\intercal}J\cdot JX^{\intercal}J=Y^{A}X^{A}. (12)

This derivation employs (11) and J2=IJ^{2}=I.

Define matrices A,BN×NA,B\in\mathbb{R}^{N\times N} as Ak+1,i+1=ak+1,i+1A_{k+1,i+1}=a_{k+1,i+1}, Bk+1,i+1=bk+1,iB_{k+1,i+1}=b_{k+1,i} and otherwise 0. Since ψ(yi)=yi\nabla\psi^{*}(y_{i})=y_{i},

yk+1=\displaystyle y_{k+1}= yk1Li=0kak+1,if(xi),\displaystyle y_{k}-\frac{1}{L}\sum_{i=0}^{k}a_{k+1,i}\nabla f(x_{i}),
xk+1=\displaystyle x_{k+1}= xki=0k+1bk+1,iyi\displaystyle x_{k}-\sum_{i=0}^{k+1}b_{k+1,i}y_{i}
=\displaystyle= xk+1Li=0k+1bk+1,i(y0+j=0i1l=0jaj+1,lf(xl))\displaystyle x_{k}+\frac{1}{L}\sum_{i=0}^{k+1}b_{k+1,i}\left(-y_{0}+\sum_{j=0}^{i-1}\sum_{l=0}^{j}a_{j+1,l}\nabla f(x_{l})\right)
=\displaystyle= xk+1Li=0k+1bk+1,i(x0+j=0i1l=0jaj+1,lf(xl))\displaystyle x_{k}+\frac{1}{L}\sum_{i=0}^{k+1}b_{k+1,i}\left(-x_{0}+\sum_{j=0}^{i-1}\sum_{l=0}^{j}a_{j+1,l}\nabla f(x_{l})\right)
=\displaystyle= xk+1Li=0k+1bk+1,i(j=0i1l=0jaj+1,lf(xl))1Li=0k+1bk+1,ix0\displaystyle x_{k}+\frac{1}{L}\sum_{i=0}^{k+1}b_{k+1,i}\left(\sum_{j=0}^{i-1}\sum_{l=0}^{j}a_{j+1,l}\nabla f(x_{l})\right)-\frac{1}{L}\sum_{i=0}^{k+1}b_{k+1,i}x_{0}
=\displaystyle= xk+1Li=0k+1bk+1,i(j=0i1l=0jaj+1,lf(xl)).\displaystyle x_{k}+\frac{1}{L}\sum_{i=0}^{k+1}b_{k+1,i}\left(\sum_{j=0}^{i-1}\sum_{l=0}^{j}a_{j+1,l}\nabla f(x_{l})\right).

for k=0,,N1k=0,\dots,N-1. i=0k+1bk+1,ix0\sum_{i=0}^{k+1}b_{k+1,i}x_{0} vanishes to 0, due to (3). The coefficient of f(xl)\nabla f(x_{l}) is hk+1,l=i=0k+1(bk+1,ij=0i1aj+1,i)h_{k+1,l}=\sum_{i=0}^{k+1}\left(b_{k+1,i}\sum_{j=0}^{i-1}a_{j+1,i}\right). To simplify the above equation, define H,CN×NH,C\in\mathbb{R}^{N\times N} as [Hk,i:=hk+1,iand otherwise 0,Ck,i:=1ik1]\left[H_{k,i}\colon=h_{k+1,i}\,\,\text{and otherwise $0$},\quad C_{k,i}\colon=1_{i\leq k-1}\right]. Then Hk,i=i=0k(Bk,ij=0i1Aj,i)=0i,jNBk,iCi,jAj,iH_{k,i}=\sum_{i=0}^{k}\left(B_{k,i}\sum_{j=0}^{i-1}A_{j,i}\right)=\sum_{0\leq i,j\leq N}B_{k,i}C_{i,j}A_{j,i} thus H=BCAH=BCA. For the mirror dual, we obtain

qk+1=qk1Li=0kaNi,N1kri,rk+1=rki=0k+1bNi,N1lf(qi).\displaystyle q_{k+1}=q_{k}-\frac{1}{L}\sum_{i=0}^{k}a_{N-i,N-1-k}r_{i},\quad r_{k+1}=r_{k}-\sum_{i=0}^{k+1}b_{N-i,N-1-l}\nabla f(q_{i}).

In the same way, and further using bN,N=1b_{N,N}=-1 gives H=AACBAH^{\prime}=A^{A}CB^{A} where Hk,i=hk+1,iH^{\prime}_{k,i}=h^{\prime}_{k+1,i}. Finally (H)A=(BA)ACA(AA)A=BCA=H\left(H^{\prime}\right)^{A}=\left(B^{A}\right)^{A}C^{A}\left(A^{A}\right)^{A}=BCA=H. ∎

Mirror duality and H-duality.

Consider when the distance-generating function is the Euclidean norm, ϕ=ψ=122\phi=\psi=\frac{1}{2}\|\cdot\|^{2}. Proposition 2 indicates that the mirror dual operation generalizes the H-dual operation. Furthermore, since the Bregman divergence with respect to the Euclidean norm equals the squared Euclidean distance, our mirror duality theorem generalizes the H-duality theorem (kim2023timereversed, , Theorem 1). However, strictly speaking, the two theorems technically differ. The energy functions {𝒰k}k=0N\{\mathcal{U}_{k}\}_{k=0}^{N} and {𝒱k}k=0N\{\mathcal{V}_{k}\}_{k=0}^{N} used in our mirror duality theorem and the H-duality theorem of kim2023timereversed are similar, but not the same, primarily due to a technical constraint arising from the inability to expand the norm into an inner product in general Banach spaces.

6 Conclusion

In this work, we presented mirror duality, a one-to-one correspondence between mirror-descent-type methods reducing function value and those reducing gradient magnitude, thereby advancing our understanding of the task of reducing gradient magnitude. Furthermore, we used mirror duality to obtain dual-AMD, the first method that reduces gradient magnitude at an optimal accelerated rate in the mirror-descent setup.

The results of this work inspire some interesting directions for future work. One direction would be to consider the relaxed assumption of a weakly smooth objective function ff and a uniformly convex distance-generating function ϕ\phi. For the primal setup of reducing function value, alexandre2018optimal provides an accelerated mirror-descent-type method. For a related composite minimization setup, diakonikolas2023complementary provides an accelerated first-order method. Extending mirror duality and dual-AMD to the weakly smooth and uniformly convex setup will also have some interesting applications. Another potential direction is to extend dual-MD and dual-AMD to the constrained setup in which the notion of mirror descent and dual averaging differ. Yet another direction of future work is to consider the stochastic setup. For the setup of using stochastic gradients of convex functions, zhu2018how provides a rate for reducing the Euclidean norm of gradients. Investigating whether similar rates are attainable in the mirror descent setup or whether there is an analog of mirror duality in the stochastic setup would be interesting.

Acknowledgement

We thank TaeHo Yoon for reviewing the manuscript and providing valuable feedback. We thank Ya-Ping Hsieh, Anna Korba, Emanuel Laude, and Francesco Orabona for helpful discussions that improved this work. JD acknowledges funding from the NSF grant 2007757.

References

  • [1] Z. Allen-Zhu. How to make the gradients small stochastically: Even faster convex and nonconvex SGD. NeurIPS, 2018.
  • [2] J. M. Altschuler and P. A. Parrilo. Acceleration by stepsize hedging i: Multi-step descent and the silver stepsize schedule. arXiv 2309.07879, 2023.
  • [3] J. M. Altschuler and P. A. Parrilo. Acceleration by stepsize hedging ii: Silver stepsize schedule for smooth convex optimization. arXiv 2309.16530, 2023.
  • [4] J. M. Altschuler, J. Weed, and P. Rigollet. Near-linear time approximation algorithms for optimal transport via sinkhorn iteration. NeurIPS, 2018.
  • [5] A. Auslender and M. Teboulle. Interior gradient and proximal methods for convex and conic optimization. SIAM Journal on Optimization, 16(3):697–725, 2006.
  • [6] J. B. Baillon and G. Haddad. Quelques propriétés des opérateurs angle-bornés etn-cycliquement monotones. Israel Journal of Mathematics, 26:137–150, 1977.
  • [7] V. Barbu and T. Precupanu. Convexity and Optimization in Banach Spaces. 4th edition, 2012.
  • [8] H. H. Bauschke, J. Bolte, and M. Teboulle. A descent lemma beyond Lipschitz gradient continuity: First-order methods revisited and applications. Mathematics of Operations Research, 42(2):330–348, 2017.
  • [9] H. H. Bauschke, J. M. Borwein, and P. L. Combettes. Essential smoothness, essential strict convexity, and legendre functions in banach spaces. Communications in Contemporary Mathematics, 03:615–647, 2001.
  • [10] A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167–175, 2003.
  • [11] A. Ben-Tal, T. Margalit, and A. Nemirovski. The ordered subsets mirror descent optimization method with applications to tomography. SIAM Journal on Optimization, 12(1):79–108, 2001.
  • [12] Y. Censor and S. Zenios. On the proximal minimization algorithm with d-functions. Journal of Optimization Theory and Applications, 73:451–464, 05 1992.
  • [13] A. Chambolle and J. P. Contreras. Accelerated bregman primal-dual methods applied to optimal transport and wasserstein barycenter problems. SIAM Journal on Mathematics of Data Science, 4(4):1369–1395, 2022.
  • [14] N. Chopin, F. R. Crucinio, and A. Korba. A connection between tempering and entropic mirror descent. arXiv:2310.11914, 2023.
  • [15] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. NeurIPS, 2013.
  • [16] S. Das Gupta, B. P. G. Van Parys, and E. K. Ryu. Branch-and-bound performance estimation programming: A unified methodology for constructing optimal optimization methods. Mathematical Programming, 204(1):567–639, 2024.
  • [17] A. d’Aspremont, C. Guzmán, and M. Jaggi. Optimal affine-invariant smooth minimization algorithms. SIAM Journal on Optimization, 28(3):2384–2405, 2018.
  • [18] A. d’Aspremont, D. Scieur, and A. B. Taylor. Acceleration methods. Foundations and Trends in Optimization, 5(1-2):1–245, 2021.
  • [19] J. Diakonikolas and C. Guzmán. Complementary composite minimization, small gradients in general norms, and applications. arXiv: 2101.11041, 2023.
  • [20] J. Diakonikolas and P. Wang. Potential function-based framework for minimizing gradients in convex and min-max optimization. SIAM Journal on Optimization, 32(3):1668–1697, 2022.
  • [21] R.-A. Dragomir, A. B. Taylor, A. d’Aspremont, and J. Bolte. Optimal complexity and certification of bregman first-order methods. Mathematical Programming, 194(1):41–83, 2021.
  • [22] Y. Drori and M. Teboulle. Performance of first-order methods for smooth convex minimization: a novel approach. Mathematical Programming, 145(1):451–482, 2014.
  • [23] J. Eckstein. Nonlinear proximal point algorithms using bregman functions, with applications to convex programming. Mathematics of Operations Research, 18(1):202–226, 1993.
  • [24] S. Fang. An unconstrained convex programming view of linear programming. Zeitschrift für Operations Research, 36(2):149–162, 1992.
  • [25] W. Fenchel. On conjugate convex functions. Canadian Journal of Mathematics, 1(1):73–77, 1949.
  • [26] B. Grimmer. Provably faster gradient descent via long steps. arXiv 2307.06324, 2024.
  • [27] B. Grimmer, K. Shu, and A. Wang. Accelerated objective gap and gradient norm convergence for gradient descent via long steps. arXiv 2403.14045, 2024.
  • [28] B. Grimmer, K. Shu, and A. L. Wang. Accelerated gradient descent via long steps. arXiv 2309.09961, 2023.
  • [29] A. B. Juditsky and A. S. Nemirovski. Large deviations of vector-valued martingales in 2-smooth normed spaces. arXiv 0809.0813, 2008.
  • [30] A. B. Juditsky and A. S. Nemirovski. First-order methods for nonsmooth convex large-scale optimization , i : General purpose methods. In S. Sra, S. Nowozin, and S. J. Wright, editors, Optimization for Machine Learning, pages 121–147. 2011.
  • [31] M. R. Karimi, Y.-P. Hsieh, and A. Krause. Sinkhorn flow: A continuous-time framework for understanding and generalizing the sinkhorn algorithm. arXiv:2311.16706, 2023.
  • [32] D. Kim and J. A. Fessler. Optimized first-order methods for smooth convex minimization. Mathematical Programming, 159(1):81–107, 2016.
  • [33] D. Kim and J. A. Fessler. Optimizing the efficiency of first-order methods for decreasing the gradient of smooth convex functions. Journal of Optimization Theory and Applications, 188(1):192–219, 2021.
  • [34] J. Kim, A. Ozdaglar, C. Park, and E. K. Ryu. Time-reversed dissipation induces duality between minimizing gradient norm and function value. NeurIPS, 2023.
  • [35] W. Krichene, A. Bayen, and P. L. Bartlett. Accelerated mirror descent in continuous and discrete time. NeurIPS, 2015.
  • [36] G. Lan, Y. Ouyang, and Z. Zhang. Optimal and parameter-free gradient minimization methods for smooth optimization. arXiv: 2310.12139, 2023.
  • [37] E. Laude and P. Patrinos. Anisotropic proximal gradient. arXiv 2210.15531, 2023.
  • [38] E. Laude, A. Themelis, and P. Patrinos. Dualities for non-euclidean smoothness and strong convexity under the light of generalized conjugacy. SIAM Journal on Optimization, 33(4):2721–2749, 2023.
  • [39] J. Lee, C. Park, and E. K. Ryu. A geometric structure of acceleration and its role in making gradients small fast. NeurIPS, 2021.
  • [40] T. Lin, N. Ho, and M. I. Jordan. On the efficiency of entropic regularized algorithms for optimal transport. Journal of Machine Learning Research, 23(137):1–42, 2022.
  • [41] H. Lu, R. M. Freund, and Y. Nesterov. Relatively smooth convex optimization by first-order methods, and applications. SIAM Journal on Optimization, 28(1):333–354, 2018.
  • [42] C. J. Maddison, D. Paulin, Y. W. Teh, and A. Doucet. Dual space preconditioning for gradient descent. SIAM Journal on Optimization, 31(1):991–1016, 2021.
  • [43] A. S. Nemirovsky. On optimality of Krylov’s information when solving linear operator equations. Journal of Complexity, 7(2):121–130, 1991.
  • [44] A. S. Nemirovsky. Information-based complexity of linear operator equations. Journal of Complexity, 8(2):153–175, 1992.
  • [45] A. S. Nemirovsky and D. B. Yudin. Problem Complexity and Method Efficiency in Optimization. 1983.
  • [46] Y. Nesterov. A method for unconstrained convex minimization problem with the rate of convergence 𝒪(1/k2)\mathcal{O}(1/k^{2}). Proceedings of the USSR Academy of Sciences, 269:543–547, 1983.
  • [47] Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical Programming, 120(1):221–259, 2009.
  • [48] Y. Nesterov. How to make the gradients small. Optima. Mathematical Optimization Society Newsletter, (88):10–11, 2012.
  • [49] Y. Nesterov. Lectures on Convex Optimization. 2nd edition, 2018.
  • [50] Y. Nesterov, A. Gasnikov, S. Guminov, and P. Dvurechensky. Primal–dual accelerated gradient methods with small-dimensional relaxation oracle. Optimization Methods and Software, 36(4):1–38, 2020.
  • [51] C. Park and E. K. Ryu. Optimal first-order algorithms as a function of inequalities. Journal of Machine Learning Research, 25(51):1–66, 2024.
  • [52] J. J. Suh, G. Roh, and E. K. Ryu. Continuous-time analysis of accelerated gradient methods via conservation laws in dilated coordinate systems. ICML, 2022.
  • [53] A. B. Taylor, J. M. Hendrickx, and F. Glineur. Exact worst-case performance of first-order methods for composite convex optimization. SIAM Journal on Optimization, 27(3):1283–1313, 2017.
  • [54] A. B. Taylor, J. M. Hendrickx, and F. Glineur. Smooth strongly convex interpolation and exact worst-case performance of first-order methods. Mathematical Programming, 161(1-2):307–345, 2017.
  • [55] M. Teboulle. Entropic proximal mappings with applications to nonlinear programming. Mathematics of Operations Research, 17(3):670–690, 1992.
  • [56] P. Tseng. On accelerated proximal gradient methods for convex-concave optimization. Technical report, 2008.
  • [57] Q. Van Nguyen. Forward-backward splitting with bregman distances. Vietnam Journal of Mathematics, 45(3):519–539, 2017.
  • [58] L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research, 11(88):2543–2596, 2010.
  • [59] K. Zhou, L. Tian, A. M.-C. So, and J. Cheng. Practical schemes for finding near-stationary points of convex finite-sums. AISTATS, 2022.
  • [60] C. Zălinescu. Convex Analysis in General Vector Spaces. World Scientific Publishing Company, 2002.

Appendix A Omitted proofs

The proof of Lemma 3.

Using the expression of xkx_{k} that we used in Fact 4 provides us

xkxk+1\displaystyle x_{k}-x_{k+1} =(θk2θk22θk2ϕ(yk)+s=1k1θs12θs22θk2ϕ(ys))\displaystyle=\left(\frac{\theta_{k}^{2}-\theta_{k-2}^{2}}{\theta_{k}^{2}}\nabla\phi^{*}(y_{k})+\sum_{s=1}^{k-1}\frac{\theta_{s-1}^{2}-\theta_{s-2}^{2}}{\theta_{k}^{2}}\nabla\phi^{*}(y_{s})\right)
(θk+12θk12θk+12ϕ(yk+1)+s=1kθs12θs22θk+12ϕ(ys))\displaystyle\qquad\qquad\qquad-\left(\frac{\theta_{k+1}^{2}-\theta_{k-1}^{2}}{\theta_{k+1}^{2}}\nabla\phi^{*}(y_{k+1})+\sum_{s=1}^{k}\frac{\theta_{s-1}^{2}-\theta_{s-2}^{2}}{\theta_{k+1}^{2}}\nabla\phi^{*}(y_{s})\right)
=s=1k1(θs12θs22θk2θs12θs22θk+12)ϕ(ys)\displaystyle=\sum_{s=1}^{k-1}\left(\frac{\theta_{s-1}^{2}-\theta_{s-2}^{2}}{\theta_{k}^{2}}-\frac{\theta_{s-1}^{2}-\theta_{s-2}^{2}}{\theta_{k+1}^{2}}\right)\nabla\phi^{*}(y_{s})
+(θk2θk22θk2θk12θk22θk+12)ϕ(yk)θk+12θk12θk+12ϕ(yk+1).\displaystyle\quad+\left(\frac{\theta_{k}^{2}-\theta_{k-2}^{2}}{\theta_{k}^{2}}-\frac{\theta_{k-1}^{2}-\theta_{k-2}^{2}}{\theta_{k+1}^{2}}\right)\nabla\phi^{*}(y_{k})-\frac{\theta_{k+1}^{2}-\theta_{k-1}^{2}}{\theta_{k+1}^{2}}\nabla\phi^{*}(y_{k+1}).

Thus bk+1,0(AMD)=0b_{k+1,0}^{(\text{AMD})}=0, bk+1,s(AMD)=θs12θs22θk2θs12θs22θk+12b_{k+1,s}^{(\text{AMD})}=\frac{\theta_{s-1}^{2}-\theta_{s-2}^{2}}{\theta_{k}^{2}}-\frac{\theta_{s-1}^{2}-\theta_{s-2}^{2}}{\theta_{k+1}^{2}} for 1sk11\leq s\leq k-1, bk+1,k(AMD)=θk2θk22θk2θk12θk22θk+12b_{k+1,k}^{(\text{AMD})}=\frac{\theta_{k}^{2}-\theta_{k-2}^{2}}{\theta_{k}^{2}}-\frac{\theta_{k-1}^{2}-\theta_{k-2}^{2}}{\theta_{k+1}^{2}}, and bk+1,k+1(AMD)=θk+12θk12θk+12b_{k+1,k+1}^{(\text{AMD})}=-\frac{\theta_{k+1}^{2}-\theta_{k-1}^{2}}{\theta_{k+1}^{2}} holds. Moreover, we have

ak+1,s(AMD)=0 for s=0,1,,k1,ak+1,k(AMD)=σL(θk2θk12).\displaystyle a_{k+1,s}^{(\text{AMD})}=0\text{ for }s=0,1,\dots,k-1,\qquad a_{k+1,k}^{(\text{AMD})}=\frac{\sigma}{L}(\theta_{k}^{2}-\theta_{k-1}^{2}).

Now, we calculate the step sizes of dual-AMD. For k=0,1,,N1k=0,1,\dots,N-1,

rk+1\displaystyle r_{k+1} =s=0k((θNs12θNs22)(gs+1gs)+(θNs22θNs32)gs+1)+r0\displaystyle=\sum_{s=0}^{k}\left((\theta_{N-s-1}^{2}-\theta_{N-s-2}^{2})\left(g_{s+1}-g_{s}\right)+(\theta_{N-s-2}^{2}-\theta_{N-s-3}^{2})g_{s+1}\right)+r_{0}
=(θNk22θNs32)gk+1+s=1k+1(θNs2θNs12)gs(θN12θN22)g0+r0.\displaystyle=(\theta_{N-k-2}^{2}-\theta_{N-s-3}^{2})g_{k+1}+\sum_{s=1}^{k+1}(\theta_{N-s}^{2}-\theta_{N-s-1}^{2})g_{s}-(\theta_{N-1}^{2}-\theta_{N-2}^{2})g_{0}+r_{0}. (13)

In addition, for s=0,1,,Ns=0,1,\dots,N, the update of dual-AMD provides

gs\displaystyle g_{s} =g0+l=0s11θNl12(f(ql+1)f(ql))\displaystyle=g_{0}+\sum_{l=0}^{s-1}\frac{1}{\theta_{N-l-1}^{2}}\left(\nabla f(q_{l+1})-\nabla f(q_{l})\right)
=g0+l=1s1(1θNl21θNl12)f(ql)+1θNs2f(qs)1θN12f(q0).\displaystyle=g_{0}+\sum_{l=1}^{s-1}\left(\frac{1}{\theta_{N-l}^{2}}-\frac{1}{\theta_{N-l-1}^{2}}\right)\nabla f(q_{l})+\frac{1}{\theta_{N-s}^{2}}\nabla f(q_{s})-\frac{1}{\theta_{N-1}^{2}}\nabla f(q_{0}). (14)

Now, using (13), we rewrite rkrk+1r_{k}-r_{k+1} with the summation of f(ql)\nabla f(q_{l}) for k=0,1,,N1k=0,1,\dots,N-1 as follows:

bk+1,0(AMD-G)\displaystyle b_{k+1,0}^{(\text{AMD-G})} =(θNk22θNk32)1θN12(θNk+12θNk2)1θN2\displaystyle=(\theta_{N-k-2}^{2}-\theta_{N-k-3}^{2})\frac{1}{\theta_{N-1}^{2}}-(\theta_{N-k+1}^{2}-\theta_{N-k}^{2})\frac{1}{\theta_{N}^{2}}
bk+1,l(AMD-G)\displaystyle b_{k+1,l}^{(\text{AMD-G})} =(θNk22θNk32)(1θNl21θNl12)\displaystyle=-(\theta_{N-k-2}^{2}-\theta_{N-k-3}^{2})\left(\frac{1}{\theta_{N-l}^{2}}-\frac{1}{\theta_{N-l-1}^{2}}\right)
bk+1,k(AMD-G)\displaystyle b_{k+1,k}^{(\text{AMD-G})} =(θNk2θNk22)1θNk2(θNk12θNk32)(1θNk21θNk12)\displaystyle=(\theta_{N-k}^{2}-\theta_{N-k-2}^{2})\frac{1}{\theta_{N-k}^{2}}-(\theta_{N-k-1}^{2}-\theta_{N-k-3}^{2})\left(\frac{1}{\theta_{N-k}^{2}}-\frac{1}{\theta_{N-k-1}^{2}}\right)
(θNk2θNk12)1θNk2\displaystyle\qquad-(\theta_{N-k}^{2}-\theta_{N-k-1}^{2})\frac{1}{\theta_{N-k}^{2}}
=(θNk32θNk22)1θNk2+(θNk12θNk32)1θNk12\displaystyle=(\theta_{N-k-3}^{2}-\theta_{N-k-2}^{2})\frac{1}{\theta_{N-k}^{2}}+(\theta_{N-k-1}^{2}-\theta_{N-k-3}^{2})\frac{1}{\theta_{N-k-1}^{2}}
bk+1,k+1(AMD-G)\displaystyle b_{k+1,k+1}^{(\text{AMD-G})} =(θNk12θNk32)1θNk12.\displaystyle=-(\theta_{N-k-1}^{2}-\theta_{N-k-3}^{2})\frac{1}{\theta_{N-k-1}^{2}}.

Also, by the update rule of dual-AMD, we have

qk+1=qkσL(θNk12θNk22)ψ(rk),\displaystyle q_{k+1}=q_{k}-\frac{\sigma}{L}(\theta_{N-k-1}^{2}-\theta_{N-k-2}^{2})\nabla\psi^{*}(r_{k}),

so we can specify ak+1,i(AMD-G)a_{k+1,i}^{(\text{AMD-G})} as follows:

ak+1,s(AMD-G)=0 for s=0,1,,k1,ak+1,k(AMD-G)=σL(θNk12θNk22).\displaystyle a_{k+1,s}^{(\text{AMD-G})}=0\text{ for }s=0,1,\dots,k-1,\qquad a_{k+1,k}^{(\text{AMD-G})}=\frac{\sigma}{L}(\theta_{N-k-1}^{2}-\theta_{N-k-2}^{2}).

Therefore, we conclude that the mirror dual of AMD is dual-AMD.