∎
22email: [email protected] 33institutetext: Chanwoo Park 44institutetext: MIT
44email: [email protected] 55institutetext: Asuman Ozdaglar 66institutetext: MIT
66email: [email protected] 77institutetext: Jelena Diakonikolas 88institutetext: University of Wisconsin–Madison
88email: [email protected] 99institutetext: Ernest K. Ryu 1010institutetext: Seoul National University
1010email: [email protected]
Mirror Duality in Convex Optimization
Abstract
While first-order optimization methods are usually designed to efficiently reduce the function value , there has been recent interest in methods efficiently reducing the magnitude of , and the findings show that the two types of methods exhibit a certain symmetry. In this work, we present mirror duality, a one-to-one correspondence between mirror-descent-type methods reducing function value and reducing gradient magnitude. Using mirror duality, we obtain the dual accelerated mirror descent (dual-AMD) method that efficiently reduces , where is a distance-generating function and quantifies the magnitude of . We then apply dual-AMD to efficiently reduce for and to efficiently compute -approximate solutions of the optimal transport problem.
1 Introduction
Consider the problem
where is a differentiable convex function and is a reflexive Banach space. One of the most fundamental facts in convex optimization is that every stationary point is also a global minimum, so the minimization problem is equivalent to
However, the problem of efficiently approximating function minima, i.e., making small, is quite different from the problem of efficiently computing approximate stationary points, i.e., making small. Methods that exhibit optimal convergence rates under one of the criteria do not, in general, exhibit optimal convergence rates under the other. In particular, Nesterov’s accelerated gradient method is an optimal method for the former but not the latter criterion. Recently, methods like OGM-G kim2021optimizing have been proposed as accelerated methods for reducing when is the Euclidean space. However, the problem of finding approximate stationary points is still much less understood than the problem of finding approximate function minima in more general normed spaces.
On the other hand, the framework of mirror descent has been a cornerstone to designing efficient methods for optimization problems with structures that are well-behaved in non-Euclidean geometries. However, compared to the large body of research on mirror-descent-type methods for efficiently reducing function value, which includes accelerated mirror descent, mirror-descent-type methods for reducing gradient magnitude have been much less explored.
In this work, we start from the perspective that reducing gradient magnitude is reducing a function value in an appropriate mirror-descent setup. This observation, due to doi:10.1137/19M130858X and restated in Section 2.1, more specifically is that exchanging the roles of the objective function and the conjugate of the distance-generating function (DGF) in (unaccelerated) mirror descent yields a method that reduces the gradient magnitude as measured by the DGF.
We then build upon this insight and present mirror duality, a one-to-one correspondence between mirror-descent-type methods reducing function value and those reducing gradient magnitude. Using mirror duality, we obtain dual accelerated mirror descent (dual-AMD) as the mirror dual of accelerated mirror descent (AMD). With mirror duality, the prior analysis showing that AMD efficiently reduces function value implies that dual-AMD efficiently reduces , where is a distance-generating function and quantifies the magnitude of . Finally, we show how to apply dual-AMD to efficiently reduce for and to efficiently compute approximate solutions of the optimal transport problem.
1.1 Preliminaries
We quickly review relevant basic concepts and set up the notation.
Banach spaces.
We follow the standard definitions of convex analysis in Banach spaces barbu2012 . Let be a Banach space, be its (continuous) dual space, and be the canonical pairing between and . In finite-dimensional setups, , and are dual norms, and is the standard Euclidean inner product. We assume is reflexive: and the dual norm of is , which always holds when is finite-dimensional. We denote the Euclidean norm by .
Basic convex analysis.
A set is convex if for all . The interior of is , where denotes the open ball of radius centered at . Let . A function is convex if for any and . If the inequality holds strictly for all such that and and , then is strictly convex. A function is proper if for all and for some . We say that is closed if is closed (in the standard product topology). (See (barbu2012, , Proposition 2.5) for equivalent definitions of closeness.) We say a function is CCP if it is convex, closed, and proper. Write to denote the (effective) domain of .
Differentiability and subdifferentiability.
Given , we define its convex conjugate as . Define the biconjugate (remember, as . If is CCP, then (barbu2012, , Theorem 2.22). We say that is (Fréchet) differentiable at if on an open neighborhood of and there exists a such that
If the above holds, we define to be the gradient of at . Write for the set of points on which is Fréchet differentiable. Finally, we say is differentiable (everywhere) if is differentiable for all . Let be a CCP function. We say is a subgradient of at if
We write for the set of subgradients of at .
Smoothness.
For , we say is -smooth (with respect to ) if everywhere (so ), is differentiable everywhere, and for all . The following lemma establishes equivalent conditions defining -smoothness.
Lemma 1
(zalinescu2002convex, , Section 3.5) [Banach–Baillon–Haddad theorem] Let be a Banach space, be CCP and differentiable, and . The following are equivalent:
-
(i)
.
-
(ii)
.
-
(iii)
.
This result can be viewed as a generalization of the Baillon–Haddad theorem (Baillon1977QuelquesPD, , Corollary 10) to Banach spaces, so we call it the Banach–Baillon–Haddad theorem. We call the inequality of (iii) the cocoercivity inequality.
Strong convexity.
For , we say is -strongly convex (with respect to ) if
where the inequality is understood to hold for all elements of . The following lemma establishes the equivalence between the strong convexity and the smoothness of the convex conjugate.
Lemma 2
(zalinescu2002convex, , Section 3.5) Let be a reflexive Banach space, a CCP function, and . Then, is -strongly convex with respect to if and only if is -smooth with respect to .
Bregman divergence.
Let be a Banach space and let . The Bregman divergence with respect to is defined as
The standard inequalities for differentiable convex can be written using the Bregman divergence as follows: If is convex,
(cvx-ineq) |
If is convex and -smooth, by Lemma 1,
(coco-ineq) |
If , the Fenchel inequality fenchel1949conjugate is
Legendre convexity.
We follow the definitions from Bauschke2001ESSENTIALSE . Let be a CCP function. is essentially smooth if is locally bounded and single-valued on its domain. Additionally, if is locally bounded on its domain and is strictly convex on every convex subset of , we say is Legendre convex. To clarify, a set-valued function is locally bounded on if, for any , there is a neighborhood of in on which the output values of the set-valued function are bounded.
1.2 Prior works
First-order methods reducing gradient norm.
Consider minimizing a differentiable -smooth function with iterations of a first-order method. Assume exists. Write . Classically, plain gradient descent achieves a rate on for -smooth functions, both convex and non-convex. Nemirovsky’s classical work nemirovsky1991optimality ; nemirovsky1992information provides lower bounds for reducing via first-order deterministic methods. They proved there exists a convex and -smooth function such that no -step first-order deterministic method can reduce faster than , with respect to the worst-case gaurantee. The same argument holds for .
The formal study of first-order methods for efficiently reducing the gradient magnitude was initiated by Nesterov nesterov2012make , where he proposed the regularization technique to obtain a rate, where ignores logarithmic factors. This regularization technique was also used in the stochastic setup zhu2018how . The logarithmic gap of Nesterov was closed by Kim and Fessler, who proposed OGM-G and established a rate kim2021optimizing ; it was pointed out in (nesterov2020primal, , Remark 2.1) that OGM-G’s rate can be combined with the classical Nesterov acceleration to imply the optimal rate. Kim and Fessler’s discovery, which was made with the computer-assisted proof methodology called the performance estimation problem drori2014performance ; taylor2017smooth ; taylor2017exact ; kim2016optimized , kickstarted a series of follow-up works: OBL-G, a variant of OGM-G that allows a backtracking linesearch park2021optimal ; M-OGM-G, which attains the same rate with simpler rational coefficients ZhouTianSoCheng2021_practical ; and better human-understandable analyses using potential-functions lee2021geometric ; diakonikolas2021potential . On the other hand, lan2023optimal utilizes the regularization technique of Nesterov in a clever aggregate analysis way to remove the logarithmic terms and achieves the optimal rate without using the concatenation technique that we describe in Section 3.4. Furthermore, lan2023optimal accomplishes this adaptively in the sense that the method does not require a priori knowledge of .
There are also follow-up works in other setups; continuous-time analysis of OGM-G Suh2022continuous and the proximal variants FISTA-G lee2021geometric and SFG kim2023timereversed . However, despite these active works, the acceleration mechanisms behind reducing gradient magnitude are understood much less compared to the classical techniques for reducing the function value.
H-duality.
Interestingly, OGM-G exhibits a certain symmetry against OGM drori2014performance ; kim2016optimized , an accelerated first-order method that improves upon Nesterov’s acceleration by a factor of for reducing function values of smooth convex functions. The recent work kim2023timereversed showed that this symmetry is fundamentally due to H-duality, a one-to-one correspondence between first-order methods that reduce function value and those that reduce gradient norm.
We quickly review H-duality. Let be an Euclidean space equipped with the standard Euclidean norm. Let be convex and -smooth. Consider the problem
Consider an -step fixed-step first-order method (FSFOM) with step sizes , defined by:
where is a starting point. We say this FSFOM is defined by the lower triangular matrix containing , i.e., if , and otherwise. For define its anti-diagonal transpose with for . We refer to the FSFOM defined by as the H-dual of the FSFOM defined by .
To clarify, the H-dual is a dual of a first-order method, an algorithm. This contrasts with the usual notions of duality between primal and dual spaces, functions, or optimization problems. The H-dual operation is defined mechanically, but the H-duality theorem, which we soon state as Fact 1, establishes a non-trivial property between the H-dual FSFOM pair.
Fact 1
[H-duality theorem, Informal (kim2023timereversed, , Theorem 1)] Let and be FSFOMs that are H-duals of each other and . Then the following are equivalent.
-
•
for can be proved with primal energy function structure.
-
•
for can be proved with dual energy function structure.
We will clarify the meaning of ‘primal and dual energy function structures’ in Section 3.2 along with our main results. The H-duality theorem provides a sufficient condition that ensures an FSFOM defined by with a convergence guarantee on can be “H-dualized” to obtain an FSFOM defined by with a convergence guarantee on , and vice versa.
Instances of H-duality.
OGM kim2016optimized is an FSFOM with guarantee
and OGM-G kim2021optimizing is an FSFOM with guarantee
where is defined by , for and . It turns out that OGM and OGM-G are H-duals of each other, and their guarantees imply each other through the H-duality theorem (with ).
As another example, the standard gradient descent, for , has guarantees
and
Gradient descent (with constant stepsize) is “self-H-dual”, i.e., the H-dual of gradient descent is gradient descent itself, and these two guarantees imply each other through the H-duality theorem (with ).
Recently, DasGupta2024 numerically observed that gradient-descent type methods ( for ) with appropriately chosen step sizes can exhibit a faster convergence rate than . In a series of works grimmer2023accelerated ; grimmer2024accelerated ; grimmer2024provably ; altschuler2023acceleration1 ; altschuler2023acceleration2 , the rate was theoretically established. The latest of this line of work grimmer2024accelerated , which presents the guarantee with the best constant, also presents an H-duality phenomenon: If for some , the method with has a guarantee
where , and its H-dual method, , has a guarantee
Although this result is not a direct instance of the existing H-duality theorem, the symmetry points to the existence of a broader H-duality phenomenon beyond what is elucidated in (kim2023timereversed, , Theorem 1).
Mirror-descent-type methods.
The framework of mirror descent has been a cornerstone for designing efficient methods for optimization problems with structures that are well-behaved in non-Euclidean geometries nemirovsky1983problem ; beck2003mirror ; ben2001ordered ; Juditsky20105FO ; Censor1992proximal ; Teboulle1992proximal ; Jonathan1993nonlinear . A mirror-descent-type method uses a distance-generating function , which gives rise to the Bregman divergence , for minimizing a convex function . Consider iterations of such a method.
The standard mirror descent method attains a rate on the best iterate for nonsmooth convex with bounded subgradients and a standard set of assumptions. For differentiable convex , doi:10.1287/moor.2016.0817 ; doi:10.1137/16M1099546 ; VanNguyen2017 presented the relative smoothness condition, which relaxes the strong convexity of distance generating function, and achieved a rate on the last iterate under relative smoothness. Under a set of stronger assumptions, accelerating the rate on function value is possible: the Improved Gradient Algorithm achieves a rate for constrained smooth optimization problems within Euclidean spaces auslender2006interior ; with a uniformly convex distance-generating function a faster rate than can be achieved with convex functions that are smooth with respect to the -norm alexandre2018optimal ; a method using an auxiliary regularizer function tseng2008 ; and a method proposed from an ODE perspective on the linear coupling krichene2015amd . The prior accelerated mirror-descent-type method most relevant to our work is the accelerated Bregman proximal gradient method, which attains a rate on function values, under the setup of smooth and strongly convex with respect to any norm d2021acceleration .
Recently, connections between various sampling algorithms and entropic mirror descent have been established, which utilize negative entropy as a distance-generating function: chopin2023connection showed the equivalence with tempering and karimi2023sinkhorn demonstrated the connection with generalized Sinkhorn iterations.
Mirror-descent-type methods reducing gradient magnitude.
There has been some recent attention toward mirror-descent-type methods that efficiently reduce gradient magnitude. Dual preconditioned gradient descent doi:10.1137/19M130858X efficiently reduces the gradient measurement under the relative smoothness condition in the dual space, a condition later characterized in doi:10.1137/21M1465913 . This method was extended to the non-convex and composite setups laude2023anisotropic . On the other hand, diakonikolas2023complementary used a regularization technique to develop methods to reduce the magnitude of the gradient, measured by the dual norm. In Section 2.1, we review dual preconditioned gradient descent in further detail as it serves as the foundation for the development of dual-AMD.
1.3 Contributions and organization
Contributions.
This work presents two main contributions, one conceptual and one algorithmic. The first main contribution is mirror duality, presented as Theorem 3.1. The H-duality presented in kim2023timereversed made progress by establishing a duality principle between the task of reducing function value and the task of reducing gradient magnitude. Mirror duality extends H-duality to the mirror descent setup and provides a further richer understanding.
Our second main contribution is the method dual-AMD and its convergence analysis, stated as Corollary 1. To the best of our knowledge, dual-AMD is the first mirror-descent-type method for reducing gradient magnitude at an accelerated rate. The generality of dual-AMD accommodates optimization problems with structures that are well-behaved in non-Euclidean geometries and allows us to measure the gradient magnitude with a smooth convex function of our choice, not necessarily a norm. The applications of Section 4 illustrate the practical utility of dual-AMD.
Organization.
Section 2.1 reviews MD and dual-MD and points out a symmetry between these two mirror-descent-type methods. Section 2.2 reviews AMD, an accelerated mirror-descent-type method that reduces the function value. The specific structure of the convergence analysis of AMD is essential for obtaining our main result.
Section 3.1 quickly establishes some definitions. Section 3.2 formally presents the mirror duality theorem, our first main contribution. Section 3.3 presents the dual-AMD method, our second main contribution, obtained by applying the mirror duality theorem to AMD. Section 4 presents applications illustrating the practical utility of dual-AMD. Section 5 compares the prior results on H-duality with our generalized framework of mirror duality. Finally, Section 6 concludes the paper and provides some discussion on future directions.
2 Review of dual mirror descent and accelerated mirror descent
2.1 MD and dual-MD
We provide a brief overview of mirror descent (MD) and dual-mirror descent (dual-MD). The relationship between these two methods presents the following insight: Let be the objective function and be the distance-generating function. Then, MD is a method finding a minimizing and by swapping the roles of and , we get dual-MD, which finds an minimizing .
Mirror descent.
Let be a reflexive Banach space, be a nonempty closed convex set, and be a Legendre convex function. To state MD, consider the problem
Assume that exists. Given an approximate solution , we measure its suboptimality via
To obtain with small , consider a Legendre convex function such that . For and any starting point , the Mirror Descent (MD) method is
(MD) | ||||
for , and . We view as the output of MD. Consider the following additional assumption regarding and .
-
(i)
For some , is convex on .
-
(ii)
.
Under the above assumption, we have the following convergence guarantee for .
Fact 2
(Dragomir2021, , Theorem 1) Let be a reflexive Banach space and Legendre convex functions. Then the iterates of MD are well defined. Assume (i) and (ii). If we further assume that , then for ,
Dual mirror descent.
Consider the problem
This problem is equivalent to the previous one, , if and is differentiable (). Given an approximate solution for this problem, we measure its suboptimality via
where is a Legendre convex function such that and is the unique minimizer of . So, quantifies the magnitude of .111Every stationary point is a global function minimum in convex optimization, so is equivalent to . However, [ being small] is not equivalent to [ being small] in the sense that for any and , there is an such that but . Even though the solution set is unchanged, changing the suboptimality measure leads to different efficient methods for finding approximate solutions. Dual preconditioned gradient descent doi:10.1137/19M130858X , formulated as follows, efficiently reduces . For , starting point and ,
(dual-MD) | ||||
for . We will refer to this method as dual-MD. The reason for this alternative naming will be made clear in Section 3.
The output of dual-MD is . Consider the following additional assumption regarding and .
-
(i)’
For some , is convex on .
-
(ii)’
.
Under this assumption, we have the following convergence guarantee for .
Fact 3
(doi:10.1137/19M130858X, , Theorem 3.9) Let be a reflexive Banach space and be Legendre convex functions. Assume (i)’ and (ii)’. Then, the iterates of dual-MD are well defined. If we further assume that and is the unique minimizer of , then for ,
(If , then the right-hand-side is and the bound vacuously holds.)
Interestingly, dual-MD can be obtained by swapping with in MD, and this symmetry between them leads to a crucial observation. In dual-MD, serves as the objective function, so the reduction of in dual-MD as guaranteed by Fact 3 is really an immediate consequence of MD reducing as guaranteed by Fact 2. At this point, MD and dual-MD exhibit a certain symmetry, and one method reduces while the other “dual” method reduces . In Section 2.2, we state AMD, which reduces at an rate under stronger assumptions than needed for MD. This leads to a natural question: Can we “dualize” AMD to obtain a mirror-descent-type method with an rate on ? For reasons that we explain later in Section 3, merely interchanging the roles of and does not work for obtaining the dual counterpart of AMD. The answer is obtained through mirror duality that we present in Section 3.
The versions of Facts 2 and 3 presented in Dragomir2021 and doi:10.1137/19M130858X rely on slightly weaker assumptions than we state. We make this simplification to focus on our goal of introducing the concept of swapping and in mirror-descent-type methods.
Connection with dual averaging.
Our formulation of MD is closely related to dual averaging nesterov2009primal . (In dual averaging, dual refers to the duality between the primal and dual spaces, while in our dual-MD, dual refers to mirror duality.) Dual averaging and mirror descent are equivalent in the so-called unconstrained setup, which is what we consider. In the constrained setup, the optimization problem is constrained to a further subset of , and in this constrained setup, mirror duality and dual averaging differ. The dual algorithms of our work and the notion of mirror duality do not seem to immediately generalize to the constrained setup, and studying this extension would be an interesting direction of future work.
2.2 Accelerated Mirror Descent
Again, consider the problem
where is a reflexive Banach space, is a nonempty closed convex set and is a CCP function. Assume exists. Given an approximate solution , we measure its suboptimality via . We use a CCP function such that for the distance-generating function. Assume that
Let be a positive integer. Let be a non-decreasing sequence satisfying
When for , the sequence becomes exactly the same as the -sequence of Nesterov’s fast gradient method 10029946121 , except for the final . The variant of accelerated mirror descent (AMD) we consider is
(AMD) | |||
for , where is any starting point and . We view as the output of AMD.
Fact 4
Let be a reflexive Banach space and let and be CCP.222Our analysis does not require the distance-generating function to be differentiable. Prior work such as xiao2010dual has also carried out mirror-descent-type analyses under non-differentiability of . Assume (A1). Then, the iterates of AMD are well defined and for . Furthermore, if ,
where holds if for .
Proof
First, we show is a convex combination of . Then the fact yields . For ,
Now we provide the convergence analysis of AMD.
Convergence analysis.
First, define as . Also define an energy function recurrently as and
for . Due to (cvx-ineq) and (coco-ineq), is non-increasing . If we can prove , the monotonicity of provides the convergence rate as follows:
Now, the only part left is proving the inequality . Define , which is the function of algorithm parameters and iterates of AMD, via
Here we used . Observe that
Therefore, it is enough to show . We simplify as follows:
Grouping the terms in the last two lines, we obtain
Thus
where we used for and for any . The first inequality is AM-GM and the second one comes from the definition of a dual norm.∎
Comparison with existing methods.
Our stated AMD and its analysis represent a minor variation of prior work, which are not part of our main contributions. More specifically, when and for , AMD is equivalent to (d2021acceleration, , Algorithm 22), except for the last iterate. In the Euclidean setup, AMD is also equivalent to IGA (auslender2006interior, , Section 5), again excluding the final iterate.
In terms of the analysis, the prior work d2021acceleration achieves the same convergence rate as in Fact 4 under the slightly weaker assumption that is -smooth on , i.e., that for all . In contrast, we require that is -smooth on all of , i.e., that for all . In our proof, we use (coco-ineq), which, to the best of our knowledge, requires that is -smooth on all of , even though our AMD never evaluates outside of . In this sense, our analysis of AMD is slightly less general than the prior analysis of (d2021acceleration, , Algorithm 22).
Lower bound.
The classical lower bound in the Euclidean setup by nemirovsky1992information ; nesterov2018 applies and establishes optimality of the rate of Fact 4 under all choices of that satisfy (A1).
3 Mirror Duality and dual-AMD
In Section 2.1, we simply interchanged the roles of and in MD, which has a rate on , to obtain a dual-MD, which has a rate on . A natural question that follows is: Can we “dualize” AMD, which has an accelerated rate on , to obtain a dual method with an accelerated rate on ? Merely interchanging the role of and , however, does not work as it produces a method finding an reducing , but it does not find a that reduces .
In this section, we obtain a dual method reducing at an accelerated rate. The key insight is to interchange the roles of and and also perform an operation analogous to the anti-diagonal transpose as in Section 1.2. We call the resulting operation the mirror dual, and we present a duality correspondence as Theorem 3.1.
3.1 Coupled first-order method (CFOM) and its mirror dual
Let be a reflexive Banach space and and be differentiable. An -step fixed-step coupled first-order method (CFOM) with step sizes and is:
(1) |
for , where is a starting point and . Note that the upper limits of the two summations are and , which means is computed before . Although is not used in (1), define for the sake of notational simplicity in later arguments. The class of CFOMs is very general, and it, in particular, includes MD and AMD.
Let be differentiable. Throughout this section, we assume and is the unique minimizer of . For a given CFOM (1), we define its mirror dual as
(2) |
for , where is a starting point and . The mirror dual has the coefficients and used in an anti-diagonal transposed manner as in the H-dual and, furthermore, has the roles of and swapped when . In Section 5, we more precisely discuss how the mirror dual and mirror duality generalize the prior notions of the H-dual and H-duality.
Proposition 1
Under the condition of Proposition 1, a convergence guarantee on can be translated to a bound on . As we showed in the proof of Fact 4, the CFOM AMD satisfies the property for . Therefore, Proposition 1 applies to the mirror dual of AMD that we present in Section 3.3.
Proof
Observe first that
for . Therefore, if for , . Now we prove the mirror dual satisfies . Note that
(3) |
Thus, we get that the following holds for the mirror dual:
∎
3.2 Mirror Duality
For an -step CFOM (1), which we denote as , consider establishing a convergence guarantee on for using the following proof strategy. Assume (A1) for , i.e., assume is -smooth and is -strongly convex. For a positive nondecreasing sequence , define as
and
for . The sequence is nonincreasing by construction. Define via
If and , then we conclude
By a telescoping sum argument, the - and -function values cancel out and is a function of only and . (We state the exact form of later in (6).) If we show , where the infimum is taken over all possible values of and , then we conclude and . Recall that this is the exact strategy we employed in our proof of Fact 4. With a slight abuse of notation, define so that
(4) |
On the other hand, consider establishing a convergence rate of for an -step dual CFOM (2), which we denote as . Assume (A1) for , i.e., is -smooth and is -strongly convex. For a positive increasing sequence , define as
and
for . The sequence is nonincreasing by construction. Define via
Here, we used . If , then we conclude
Again, by a telescoping sum argument, the - and -function values cancel out and is a function of only and . (We state the exact form of later in (7).) If we show , where the infimum is taken over all possible values of and , then we conclude and
With a slight abuse of notation, define so that
(5) |
The following result establishes a one-to-one correspondence between these two proof strategies.
Theorem 3.1 (Mirror Duality)
Proposition 1 and Theorem 3.1 together provide a sufficient condition that ensures a CFOM with a convergence guarantee on can be “mirror-dualized” to obtain a mirror dual CFOM with a convergence guarantee on , and vice versa.
Proof
First, we calculate . Denote for notational convenience.
(6) |
By the definition of CFOM (1), and . Therefore, is a function of and . Next we calculate .
(7) |
where we have used . By the definition of mirror dual, and . Therefore, is the function of and . To show under for , we show the following statement:
where and are defined as
(8) |
Note that the above transformation is a bijection since ’s are positive, thus if we can show , . First we write and with .
where and . For simplicity, define . Using (8) gives us
and
Combining the above equations, we conclude
(9) |
In the same way, is written as follows.
By (8),
is used at .
(10) |
(8) is used at . s are index changing. Now it becomes clear that (10) is the same as (9). ∎
3.3 dual-AMD
Again, consider the problem
where is a reflexive Banach space and is a differentiable CCP function. Given an approximate solution, we measure its suboptimality via
Recall that we assumed and is the unique minimizer of . Consider (A1), which we restate as
We propose dual accelerated mirror descent (dual-AMD)
(dual-AMD) | |||
We defer the proof to Appendix A, which follows from direct calculations of the coefficients. Now, our Mirror Duality theorem and Lemma 3 immediately imply a convergence guarantee on dual-AMD.
Corollary 1 (dual-AMD)
3.4 Concatenation of AMD and dual-AMD
Concatenating AMD and dual-AMD, we obtain a rate on . More specifically, assume , so that . Also assume is -smooth convex, is -strongly convex, and is -strongly convex. Starting from , execute iterations of AMD to obtain . Then, starting from , execute iterations of dual-AMD and denote the output as .
Corollary 2
Lower bound.
Under (A1), the classical lower bound by nemirovsky1991optimality ; nemirovsky1992information applies and establishes optimality of this concatenated rate. This also implies optimality of dual-AMD under all choices of that satisfies (A1), since if the rate of dual-AMD could be improved, then the concatenated rate of AMD and dual-AMD would be improved and would violate the aforementioned lower bound.
4 Applications
In this section, we provide two applications of dual-AMD and AMD. Our discussion of these applications very closely follows diakonikolas2023complementary .
4.1 Small gradients with respect to -norm
For and , let be a differentiable convex function that is -smooth with respect to . Our goal is to find an such that is small, and the concatenation of AMD and dual-AMD yields an optimal method for this task, improving upon the prior state-of-the-art rate of diakonikolas2023complementary .
For a starting point , we choose and , which are -strongly convex with respect to (juditsky2008large, , Proposition 3.5). It is straightforward to show that and and that and are differentiable on with
Corollary 3
Proof
Direct consequence of Corollary 2. ∎
Lower bound.
Our upper bound of Corollary 3 matches the lower bound of (diakonikolas2023complementary, , Corollary 2) up to a constant factor and is therefore optimal. Our upper bound improves upon the prior state-of-the-art result for this setup (diakonikolas2023complementary, , Theorem 3), which we now know is loose by a logarithmic factor. For , however, is not strongly convex with respect to with a dimension-independent strong convexity parameter, as discussed in (diakonikolas2023complementary, , Section 1), so the concatenation of AMD and dual-AMD cannot yield a dimension-independent upper bound. We leave the investigation of the case to future work.
4.2 Entropy-regularized optimal transport
Consider the discrete optimal transport problem
where is the transport costs with nonnegative entries, and represent probability mass functions with full support, i.e.,
and likewise for , , and denotes the vectors in or with all components . Write to denote a solution to this (non-regularized) problem.
The entropy-regularized discrete optimal transport problem with parameter marco2013sinkhorn ; fang1992unconstraineed ; lin2022efficiency is
where denotes the element-wise logarithm, we use the convention , and is the matrix with all entries . The corresponding regularized dual problem is
The function of the regularized dual problem is -smooth with respect to .
If we solve the regularized dual problem approximately in the sense of making small, then we can obtain an approximate solution to the (non-regularized) primal problem using the following lemma.
Lemma 4
(altschuler2018nearlinear, , Theorem 1, Lemma 7) For , define the matrix for . Define . Then,
Moreover, there exists an algorithm with running time that takes an input and outputs , which satisfies the constraints and and satisfies .
Specifically, set and find such that . By Lemma 4, this is sufficient, since
Take , which is -strongly convex with respect to . With a starting point , implement the concatenation of AMD and dual-AMD and denote the output as . By Corollary 2,
where .
Lemma 5
(lin2022efficiency, , Lemma 3.2) If and have full support, then there exists an optimal point of such that .
Using Lemma 5 and for , we get
To ensure , set
and solve for to conclude that iterations are required. Finally, since each step of dual-AMD and AMD require arithmetic operations, the total arithmetic complexity needed to obtain an -approximate solution for the (non-regularized) discrete optimal transport problem is
This rate improves upon the rate of (diakonikolas2023complementary, , Section 5.6) by logarithmic factors and matches the state-of-the-art rate of antonin2022accelerated , among non-parallel methods.
5 Comparison with H-duality
In this section, we discuss the relationship between our proposed mirror duality and the H-duality of prior work kim2023timereversed . First, we show that the mirror dual operation reduces to the H-dual operation in the Euclidean case.
Proposition 2
Let be an Euclidean space equipped with the Euclidean norm (or is a Hilbert space). Let and . For a CFOM (1), assume for , which is a sufficient condition for for . 333 If the CFOM (1) does not satisfy for , then it reduces to an -step FSFOM of the form where for some . Then the CFOM (1) reduces to an -step FSFOM of the form
and the mirror dual CFOM (2) reduces to an -step FSFOM of the form
so the FSFOMs are H-duals of each other.
Proof
For any matrix , we can describe the relationship between and via introducing as if and otherwise, so that
(11) |
Moreover, for matrices and ,
(12) |
This derivation employs (11) and .
Define matrices as , and otherwise . Since ,
for . vanishes to , due to (3). The coefficient of is . To simplify the above equation, define as . Then thus . For the mirror dual, we obtain
In the same way, and further using gives where . Finally . ∎
Mirror duality and H-duality.
Consider when the distance-generating function is the Euclidean norm, . Proposition 2 indicates that the mirror dual operation generalizes the H-dual operation. Furthermore, since the Bregman divergence with respect to the Euclidean norm equals the squared Euclidean distance, our mirror duality theorem generalizes the H-duality theorem (kim2023timereversed, , Theorem 1). However, strictly speaking, the two theorems technically differ. The energy functions and used in our mirror duality theorem and the H-duality theorem of kim2023timereversed are similar, but not the same, primarily due to a technical constraint arising from the inability to expand the norm into an inner product in general Banach spaces.
6 Conclusion
In this work, we presented mirror duality, a one-to-one correspondence between mirror-descent-type methods reducing function value and those reducing gradient magnitude, thereby advancing our understanding of the task of reducing gradient magnitude. Furthermore, we used mirror duality to obtain dual-AMD, the first method that reduces gradient magnitude at an optimal accelerated rate in the mirror-descent setup.
The results of this work inspire some interesting directions for future work. One direction would be to consider the relaxed assumption of a weakly smooth objective function and a uniformly convex distance-generating function . For the primal setup of reducing function value, alexandre2018optimal provides an accelerated mirror-descent-type method. For a related composite minimization setup, diakonikolas2023complementary provides an accelerated first-order method. Extending mirror duality and dual-AMD to the weakly smooth and uniformly convex setup will also have some interesting applications. Another potential direction is to extend dual-MD and dual-AMD to the constrained setup in which the notion of mirror descent and dual averaging differ. Yet another direction of future work is to consider the stochastic setup. For the setup of using stochastic gradients of convex functions, zhu2018how provides a rate for reducing the Euclidean norm of gradients. Investigating whether similar rates are attainable in the mirror descent setup or whether there is an analog of mirror duality in the stochastic setup would be interesting.
Acknowledgement
We thank TaeHo Yoon for reviewing the manuscript and providing valuable feedback. We thank Ya-Ping Hsieh, Anna Korba, Emanuel Laude, and Francesco Orabona for helpful discussions that improved this work. JD acknowledges funding from the NSF grant 2007757.
References
- [1] Z. Allen-Zhu. How to make the gradients small stochastically: Even faster convex and nonconvex SGD. NeurIPS, 2018.
- [2] J. M. Altschuler and P. A. Parrilo. Acceleration by stepsize hedging i: Multi-step descent and the silver stepsize schedule. arXiv 2309.07879, 2023.
- [3] J. M. Altschuler and P. A. Parrilo. Acceleration by stepsize hedging ii: Silver stepsize schedule for smooth convex optimization. arXiv 2309.16530, 2023.
- [4] J. M. Altschuler, J. Weed, and P. Rigollet. Near-linear time approximation algorithms for optimal transport via sinkhorn iteration. NeurIPS, 2018.
- [5] A. Auslender and M. Teboulle. Interior gradient and proximal methods for convex and conic optimization. SIAM Journal on Optimization, 16(3):697–725, 2006.
- [6] J. B. Baillon and G. Haddad. Quelques propriétés des opérateurs angle-bornés etn-cycliquement monotones. Israel Journal of Mathematics, 26:137–150, 1977.
- [7] V. Barbu and T. Precupanu. Convexity and Optimization in Banach Spaces. 4th edition, 2012.
- [8] H. H. Bauschke, J. Bolte, and M. Teboulle. A descent lemma beyond Lipschitz gradient continuity: First-order methods revisited and applications. Mathematics of Operations Research, 42(2):330–348, 2017.
- [9] H. H. Bauschke, J. M. Borwein, and P. L. Combettes. Essential smoothness, essential strict convexity, and legendre functions in banach spaces. Communications in Contemporary Mathematics, 03:615–647, 2001.
- [10] A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167–175, 2003.
- [11] A. Ben-Tal, T. Margalit, and A. Nemirovski. The ordered subsets mirror descent optimization method with applications to tomography. SIAM Journal on Optimization, 12(1):79–108, 2001.
- [12] Y. Censor and S. Zenios. On the proximal minimization algorithm with d-functions. Journal of Optimization Theory and Applications, 73:451–464, 05 1992.
- [13] A. Chambolle and J. P. Contreras. Accelerated bregman primal-dual methods applied to optimal transport and wasserstein barycenter problems. SIAM Journal on Mathematics of Data Science, 4(4):1369–1395, 2022.
- [14] N. Chopin, F. R. Crucinio, and A. Korba. A connection between tempering and entropic mirror descent. arXiv:2310.11914, 2023.
- [15] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. NeurIPS, 2013.
- [16] S. Das Gupta, B. P. G. Van Parys, and E. K. Ryu. Branch-and-bound performance estimation programming: A unified methodology for constructing optimal optimization methods. Mathematical Programming, 204(1):567–639, 2024.
- [17] A. d’Aspremont, C. Guzmán, and M. Jaggi. Optimal affine-invariant smooth minimization algorithms. SIAM Journal on Optimization, 28(3):2384–2405, 2018.
- [18] A. d’Aspremont, D. Scieur, and A. B. Taylor. Acceleration methods. Foundations and Trends in Optimization, 5(1-2):1–245, 2021.
- [19] J. Diakonikolas and C. Guzmán. Complementary composite minimization, small gradients in general norms, and applications. arXiv: 2101.11041, 2023.
- [20] J. Diakonikolas and P. Wang. Potential function-based framework for minimizing gradients in convex and min-max optimization. SIAM Journal on Optimization, 32(3):1668–1697, 2022.
- [21] R.-A. Dragomir, A. B. Taylor, A. d’Aspremont, and J. Bolte. Optimal complexity and certification of bregman first-order methods. Mathematical Programming, 194(1):41–83, 2021.
- [22] Y. Drori and M. Teboulle. Performance of first-order methods for smooth convex minimization: a novel approach. Mathematical Programming, 145(1):451–482, 2014.
- [23] J. Eckstein. Nonlinear proximal point algorithms using bregman functions, with applications to convex programming. Mathematics of Operations Research, 18(1):202–226, 1993.
- [24] S. Fang. An unconstrained convex programming view of linear programming. Zeitschrift für Operations Research, 36(2):149–162, 1992.
- [25] W. Fenchel. On conjugate convex functions. Canadian Journal of Mathematics, 1(1):73–77, 1949.
- [26] B. Grimmer. Provably faster gradient descent via long steps. arXiv 2307.06324, 2024.
- [27] B. Grimmer, K. Shu, and A. Wang. Accelerated objective gap and gradient norm convergence for gradient descent via long steps. arXiv 2403.14045, 2024.
- [28] B. Grimmer, K. Shu, and A. L. Wang. Accelerated gradient descent via long steps. arXiv 2309.09961, 2023.
- [29] A. B. Juditsky and A. S. Nemirovski. Large deviations of vector-valued martingales in 2-smooth normed spaces. arXiv 0809.0813, 2008.
- [30] A. B. Juditsky and A. S. Nemirovski. First-order methods for nonsmooth convex large-scale optimization , i : General purpose methods. In S. Sra, S. Nowozin, and S. J. Wright, editors, Optimization for Machine Learning, pages 121–147. 2011.
- [31] M. R. Karimi, Y.-P. Hsieh, and A. Krause. Sinkhorn flow: A continuous-time framework for understanding and generalizing the sinkhorn algorithm. arXiv:2311.16706, 2023.
- [32] D. Kim and J. A. Fessler. Optimized first-order methods for smooth convex minimization. Mathematical Programming, 159(1):81–107, 2016.
- [33] D. Kim and J. A. Fessler. Optimizing the efficiency of first-order methods for decreasing the gradient of smooth convex functions. Journal of Optimization Theory and Applications, 188(1):192–219, 2021.
- [34] J. Kim, A. Ozdaglar, C. Park, and E. K. Ryu. Time-reversed dissipation induces duality between minimizing gradient norm and function value. NeurIPS, 2023.
- [35] W. Krichene, A. Bayen, and P. L. Bartlett. Accelerated mirror descent in continuous and discrete time. NeurIPS, 2015.
- [36] G. Lan, Y. Ouyang, and Z. Zhang. Optimal and parameter-free gradient minimization methods for smooth optimization. arXiv: 2310.12139, 2023.
- [37] E. Laude and P. Patrinos. Anisotropic proximal gradient. arXiv 2210.15531, 2023.
- [38] E. Laude, A. Themelis, and P. Patrinos. Dualities for non-euclidean smoothness and strong convexity under the light of generalized conjugacy. SIAM Journal on Optimization, 33(4):2721–2749, 2023.
- [39] J. Lee, C. Park, and E. K. Ryu. A geometric structure of acceleration and its role in making gradients small fast. NeurIPS, 2021.
- [40] T. Lin, N. Ho, and M. I. Jordan. On the efficiency of entropic regularized algorithms for optimal transport. Journal of Machine Learning Research, 23(137):1–42, 2022.
- [41] H. Lu, R. M. Freund, and Y. Nesterov. Relatively smooth convex optimization by first-order methods, and applications. SIAM Journal on Optimization, 28(1):333–354, 2018.
- [42] C. J. Maddison, D. Paulin, Y. W. Teh, and A. Doucet. Dual space preconditioning for gradient descent. SIAM Journal on Optimization, 31(1):991–1016, 2021.
- [43] A. S. Nemirovsky. On optimality of Krylov’s information when solving linear operator equations. Journal of Complexity, 7(2):121–130, 1991.
- [44] A. S. Nemirovsky. Information-based complexity of linear operator equations. Journal of Complexity, 8(2):153–175, 1992.
- [45] A. S. Nemirovsky and D. B. Yudin. Problem Complexity and Method Efficiency in Optimization. 1983.
- [46] Y. Nesterov. A method for unconstrained convex minimization problem with the rate of convergence . Proceedings of the USSR Academy of Sciences, 269:543–547, 1983.
- [47] Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical Programming, 120(1):221–259, 2009.
- [48] Y. Nesterov. How to make the gradients small. Optima. Mathematical Optimization Society Newsletter, (88):10–11, 2012.
- [49] Y. Nesterov. Lectures on Convex Optimization. 2nd edition, 2018.
- [50] Y. Nesterov, A. Gasnikov, S. Guminov, and P. Dvurechensky. Primal–dual accelerated gradient methods with small-dimensional relaxation oracle. Optimization Methods and Software, 36(4):1–38, 2020.
- [51] C. Park and E. K. Ryu. Optimal first-order algorithms as a function of inequalities. Journal of Machine Learning Research, 25(51):1–66, 2024.
- [52] J. J. Suh, G. Roh, and E. K. Ryu. Continuous-time analysis of accelerated gradient methods via conservation laws in dilated coordinate systems. ICML, 2022.
- [53] A. B. Taylor, J. M. Hendrickx, and F. Glineur. Exact worst-case performance of first-order methods for composite convex optimization. SIAM Journal on Optimization, 27(3):1283–1313, 2017.
- [54] A. B. Taylor, J. M. Hendrickx, and F. Glineur. Smooth strongly convex interpolation and exact worst-case performance of first-order methods. Mathematical Programming, 161(1-2):307–345, 2017.
- [55] M. Teboulle. Entropic proximal mappings with applications to nonlinear programming. Mathematics of Operations Research, 17(3):670–690, 1992.
- [56] P. Tseng. On accelerated proximal gradient methods for convex-concave optimization. Technical report, 2008.
- [57] Q. Van Nguyen. Forward-backward splitting with bregman distances. Vietnam Journal of Mathematics, 45(3):519–539, 2017.
- [58] L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research, 11(88):2543–2596, 2010.
- [59] K. Zhou, L. Tian, A. M.-C. So, and J. Cheng. Practical schemes for finding near-stationary points of convex finite-sums. AISTATS, 2022.
- [60] C. Zălinescu. Convex Analysis in General Vector Spaces. World Scientific Publishing Company, 2002.
Appendix A Omitted proofs
The proof of Lemma 3.
Using the expression of that we used in Fact 4 provides us
Thus , for , , and holds. Moreover, we have
Now, we calculate the step sizes of dual-AMD. For ,
(13) |
In addition, for , the update of dual-AMD provides
(14) |
Now, using (13), we rewrite with the summation of for as follows:
Also, by the update rule of dual-AMD, we have
so we can specify as follows:
Therefore, we conclude that the mirror dual of AMD is dual-AMD.