A Universal Black-Box Optimization Method with
Almost Dimension-Free Convergence Rate Guarantees
Abstract.
Universal methods for optimization are designed to achieve theoretically optimal convergence rates without any prior knowledge of the problem’s regularity parameters or the accurarcy of the gradient oracle employed by the optimizer. In this regard, existing state-of-the-art algorithms achieve an value convergence rate in Lipschitz smooth problems with a perfect gradient oracle, and an convergence rate when the underlying problem is non-smooth and/or the gradient oracle is stochastic. On the downside, these methods do not take into account the problem’s dimensionality, and this can have a catastrophic impact on the achieved convergence rate, in both theory and practice. Our paper aims to bridge this gap by providing a scalable universal gradient method – dubbed UnderGrad – whose oracle complexity is almost dimension-free in problems with a favorable geometry (like the simplex, linearly constrained semidefinite programs and combinatorial bandits), while retaining the order-optimal dependence on described above. These “best-of-both-worlds” results are achieved via a primal-dual update scheme inspired by the dual exploration method for variational inequalities.
Key words and phrases:
Universal methods; dimension-freeness; dual extrapolation; rate interpolation.2020 Mathematics Subject Classification:
Primary 90C25, 90C15; secondary 68Q32, 68T05.1. Introduction
The analysis of first-order methods for convex minimization typically revolves around the following regularity conditions: \edefnit\selectfonta\edefnn) Lipschitz continuity of a problem’s objective function and / or \edefnit\selectfonta\edefnn) Lipschitz continuity of the objective’s gradients. Depending on these conditions and the quality of the gradient oracle available to the optimizer, the optimal convergence rates that can be obtained by an iterative first-order algorithm after oracle queries are:
-
(1)
if the problem’s objective is -Lipschitz continuous and the oracle’s variance is .
-
(2)
if the objective is -Lipschitz smooth.
[In both cases, denotes the diameter of the problem’s domain ; for an in-depth treatment, see [38, 11] and references therein.]
This stark separation of black-box guarantees has led to an intense search for universal methods that are capable of interpolating smoothly between these rates without any prior knowledge of the problem’s regularity properties or the oracle’s noise profile. As far as we are aware, the first algorithm with order-optimal rate guarantees for unconstrained problems and no knowledge of the problem’s smoothness parameters was the AcceleGrad proposal of Levy et al. [28]. Subsequently, in the context of constrained convex problems (the focus of our work), Kavis et al. [24] combined the extra-gradient / mirror-prox algorithmic template of Korpelevich [25] and Nemirovski [36] with an “iterate averaging” scheme introduced by Cutkosky [16] to change the query structure of the base algorithm and make it more amenable to acceleration. In this way, Kavis et al. [24] obtained a universal extra-gradient algorithm – dubbed UnixGrad – which interpolates between the optimal rates mentioned above, without requiring any tuning.
Our contributions.
The starting point of our paper is the observation that, even though the rates in question are optimal in , they may be highly supoptimal in , the problem’s dimensionality. For example, if the noise in the oracle has unit variance, would scale as ; this represents a hidden dependence on which could have a catastrophic impact on the method’s actual convergence rate. Likewise, in problems with a favorable geometry (like the -ball, trace-constrained semidefinite programs, combinatorial bandits, etc.), methods based on the mirror descent [37] and mirror-prox [36] templates can achieve rates with a logarithmic (instead of polynomial) dependence on .
Importantly, the UnixGrad algorithm of Kavis et al. [24] is itself based on the mirror-prox blueprint, so it would seem ideally suited to achieve convergence rates that are simultaneously optimal in and . However, the method’s guarantees depend crucially on the Bregman diameter of the problem’s domain, a quantity which becomes infinite when the method is used with a regularization setup that can lead to almost dimension-free guarantees. This would seem to suggest that universality comes at the cost of scalability, leading to the following open question:
Is it possible to achieve almost dimension-free convergence rates
while retaining an order-optimal dependence on ?
In this paper, we develop a novel adaptive algorithm, which we call universal dual extrapolation with reweighted gradients (UnderGrad), and which provides a positive answer to this question. Specifically, the value convergence rate of universal dual extrapolation with reweighted gradients (UnderGrad) scales in terms of , , and as:
-
(1)
in non-smooth problems.
-
(2)
in smooth problems.
In the above, the method’s “range parameter” scales as in the worst case and as in problems with a favorable geometry – that is, in problems where it is possible to attain almost dimension-free convergence rates [38, 11]. In this regard, UnderGrad seems to be the first method in the literature that concurrently achieves order-optimal rates in both and , without any prior knowledge on the problem’s level of smoothness.
To achieve this result, the UnderGrad algorithm combines the following basic ingredients:
-
(1)
A modified version of the dual extrapolation method of Nesterov [39] for solving variational inequalities.
-
(2)
A gradient “reweighting” scheme that allows gradients to enter the algorithm with increasing weights.
-
(3)
An iterate averaging scheme in the spirit of Cutkosky [16] which allows us to obtain an accelerated rate of convergence by means of an online-to-batch conversion.
The glue that holds these elements together is an adaptive learning rate inspired by Rakhlin & Sridharan [41, 42] which automatically rescales aggregated gradients by \edefnit\selectfonta\edefnn) a small, constant amount when the method approaches a solution where gradient differences vanish (as in the smooth, deterministic case); and \edefnit\selectfonta\edefnn) a factor of otherwise (thus providing the desired interpolation between smooth and non-smooth problems). In so doing, the proposed policy achieves the correct step-size scaling and achieves the desired optimal rates.
Related work.
The term “universality” was coined by Nesterov [40] whose universal primal gradient descent (UPGD) algorithm interpolates between the and rates for smooth and non-smooth problems respectively (assuming access to noiseless gradients in both cases). On the downside, universal primal gradient descent (UPGD) relies on an Armijo-like line search to interpolate between smooth and non-smooth objectives, so it is not applicable to stochastic environments.
A partial work-around to this issue was achieved by the accelerated stochastic approximation (AC-SA) algorithm of Lan [26] which uses a mirror descent template and guarantees order-optimal rates for both noisy and noiseless oracles. However, to attain these rates, the AC-SA algorithm requires a precise estimate of the smoothness modulus of the problem’s objective, so it is not universal in this respect. Subsequent works on the topic have focused on attaining universal guarantees for composite problems [21], non-convex objectives [29, 46], preconditioned methods [21, 17], non-Lipschitz settings [3, 2, 4], specific applications [45], or variational inequalities / min-max problems [7, 4, 20, 5].
Of the generalist works above, some employ a Bregman regularization setup [2, 7], but the guarantees obtained therein either fall short of an accelerated convergence rate for Lipschitz smooth problems, or they depend on the problem’s Bregman diameter – so they cannot be associated with a Bregman setup leading to almost dimension-free convergence rate guarantees. To the best of our knowledge, UnderGrad is the first method that manages to combine the “best of both worlds” in terms of universality with respect to and scalability with respect to .
2. Preliminaries
2.1. Notation and basic definitions
Let be a -dimensional space with norm . In what follows, we will write for the dual of , for the pairing between and , and for the dual norm on . Given an extended-real-valued convex function , we will write for its effective domain and for the subdifferential of at . Any element will be called a subgradient of at , and we will write for the domain of subdifferentiability of .
2.2. Problem setup and blanket assumptions
The main focus of our paper is the solution of convex minimization problems of the form
minimize | (Opt) | |||
subject to |
where is a closed convex subset of and is a convex function with . To avoid trivialities, we will assume throughout that the solution set of (Opt) is non-empty, and we will write for a generic minimizer of .
Other than this blanket assumption, our main reqularity requirements for will be as follows:
-
(1)
Lipschitz continuity:
(LC) for some and for all .
-
(2)
Lipschitz smoothness:
(LS) for some and for all , .
Since , the above requirements are respectively equivalent to assuming that admits a selection of subgradients with the properties below:
-
(1)
Bounded (sub)gradient selection:
(BG) for some and for all .
-
(2)
Lipschitz (sub)gradient selection:
(LG) for some and for all .
In the rest of our paper, we will assume that satisfies at least one of (BG) or (LG).
Remark 1.
For posterity, we note here that the requirement (LG) does not imply that is a singleton.111Consider for example the case of for and otherwise: clearly satisfies (BG)/(LS), even though its and are infinite sets. In any case, the directional derivative of at along exists and is equal to for all vectors of the form , . We will use this fact freely in the sequel. ¶
2.3. The oracle model
To solve (Opt), we will consider iterative methods and algorithms with access to a stochastic first-order oracle (SFO), i.e., a black-box device that returns a (possibly random) estimate of a subgradient of at the point at which it was queried. Formally, following Nesterov [38], an stochastic first-order oracle (SFO) for is a measurable function such that
(SFO) |
where is a complete probability space and is a selection of subgradients of as per (BG)/(LG). The oracle’s statistical precision will then be measured by the associated noise level (assumed finite). In particular, if , will be called perfect (or deterministic); otherwise, will be called noisy.
In practice, the oracle is called repeatedly at a sequence of query points with a different random seed drawn according to at each time.222In the sequel, may take both integer and half-integer values. In this way, at the -th query to (SFO), the oracle returns the gradient signal
(1) |
where denotes the “gradient noise” of the oracle (obviously, if the oracle is perfect). For measurability purposes, we will write for the history (adapted filtration) of , so is -measurable (by definition) but , and are not. In particular, conditioning on , we have and , justifying in this way the terminology “gradient noise” for .
Remark 2.
We close this section by noting that the best convergence rates that can be achieved by an iterative algorithm that outputs a candidate solution after queries to (SFO) are:333In general, the query and output points – and respectively – need not coincide, hence the different notation. The only assumption for the rates provided below is that the output point is an affine combination of [38, 11].
-
(1)
if satisfies (BG) and is deterministic.
-
(2)
if satisfies (LG) and is deterministic.
-
(3)
if is stochastic.
In general, without finer assumptions on or , the dependence of these rates on cannot be improved [38, 11]; we will revisit this issue several times in the sequel.
3. Regularization, universality, and the curse of dimensionality
To set the stage for the analysis to come, we discuss below the properties of two algorithmic frameworks – one non-adaptive, the other adaptive – based on the mirror-prox template [36]. Our aim in doing this will be to set a baseline for the sequel as well as to explore the impact of the problem’s dimensionality on the attained rates of convergence.
3.1. Motivating examples
As a first step, we present three archetypal problems to motivate and illustrate the general setup that follows.
Example 1 (Resource allocation).
Consider a set of computing resources (GPUs in a cluster, servers in a computing grid, …) indexed by . Each resource is capable of serving a stream of computing demands that arrive at a rate of units per time: if the optimizer assigns a load of to the -th resource, the marginal cost incurred is per unit served, where is the cost function of the -th resource (assumed convex, differentiable, and increasing in ). Taking for simplicity, the goal of the optimizer is to minimize the aggregate cost , leading to a convex minimization problem over the unit -dimensional simplex . ¶
Example 2 (Input covariance matrix optimization).
Consider a Gaussian vector channel in the spirit of [44, 47]: the encoder controls the covariance matrix of the Gaussian input signal and seeks to maximize the transfer rate of the output signal , where is the ambient noise in the channel and is the channel’s transfer matrix. By the Shannon–Telatar formula [44], this boils down to maximizing the capacity function
(2) |
subject to the constraint , where denotes the encoder’s maximum input power and the expectation in (2) is taken over the statistics of the (possibly deterministic) matrix . Since is concave in [10, 47], we obtain a minimization problem of the form (Opt) over the spectrahedron Since is Hermitian, can be seen as a convex body of where ; in the optimization literature, this is sometimes referred to as the “spectrahedron setup” [22]. ¶
Example 3 (Combinatorial bandits).
In bandit linear optimization problems, the optimizer is given a finite set of possible actions , i.e., each action is a -dimensional binary vector indicating whether the -th component is “on” or “off”. The optimizer then chooses an action based on a mixed strategy and incurs the mean loss
(3) |
where is a random vector with values in (but otherwise unknown distribution). In many cases of interest – such as slate recommendation and shortest-path problems – the cardinality of is exponential in , so it is computationally prohibitive to state the resulting minimization problem in terms of . Instead, writing for the probability of the -th component being “on” under , the optimizer’s objective can be rewritten more compactly as with constrained to lie on the -dimensional convex hull of in . In the literature on multi-armed bandits, this setup is known as a combinatorial bandit; for an in-depth treatment, see [27, 14, 19, 13] and the many references cited therein. ¶
Examples 1, 2 and 3 all suffer from the “curse of dimensionality”: for instance, the dimensionality of a vector Gaussian channel with input entries is , while a combinatorial bandit for recommendation systems may have upwards of several million arms. Nonetheless, these examples also share a number of geometric properties that make it possible to design scalable optimization algorithms with (almost) dimension-free convergence rate guarantees. We elaborate on this in the next section.
3.2. The mirror-prox template
We begin by considering the well-known mirror-prox (MP) method of Nemirovski [36]. Following [22, 35], this is defined via the recursion
(MP) | ||||
where
-
(1)
denotes the method’s iteration counter (for the origins of the half-integer notation, see Facchinei & Pang [18] and references therein).
-
(2)
is the algorithm’s step-size sequence.
-
(3)
and are stochastic gradients of obtained by querying the oracle at and respectively.
-
(4)
is a generalized projection operator known as the method’s “prox-mapping” (more on this later).
The most elementary instance of (MP) is the extra-gradient (EG) algorithm of Korpelevich [25], in which case the method’s prox-mapping is the Euclidean projector
(4) |
for all , . More generally, the prox-mapping in (MP) is defined in terms of a Bregman regularizer as follows:
Domain () | Breg. Diam. () | Range () | Shape () | Rate () | Rate (, ) | |
---|---|---|---|---|---|---|
Euclidean | any below | |||||
Entropic | simplex | |||||
Quantum | spectrahedron | |||||
ComBand |
Definition 1.
A Bregman regularizer on is a convex function such that
-
(1)
and is continuous on .
-
(2)
The subdifferential of admits a continuous selection, i.e., there exists a continuous mapping with for all .
-
(3)
is strongly convex on , i.e.,
(5) for some and all , .
We also define the Bregman divergence of as
(6) |
and the induced prox-mapping as
(7) |
for all , and all .
Remark.
The set is often referred to as the prox-domain of ; by standard results in convex analysis, we have [43, Chap. 26].
In terms of output, the candidate solution returned by (MP) after iterations is the so-called “ergodic average”
(8) |
Then, assuming the method’s step-size is chosen appropriately (more on this below), enjoys the following guarantees [22, 42]:
-
\edefnit\selectfonta\edefnn)
If satisfies (BG), then
(9a) (9b)
In the above, is the minimum Bregman divergence between a solution of (Opt) and the initial state of (MP). In particular, if (MP) is initialized at the prox-center of , we have
(10) |
We will refer to as the range of .
To quantify the interplay betwen the problem’s dimensionality and the rate guarantees (9) for (MP), it will be convenient to introduce the normalized regularity parameters
(11) |
and the associated shape factor
(12) |
Since at least one of the terms , and appears in (9), it follows that the leading term in scales as in non-smooth / stochastic environments, and as in smooth, deterministic problems.
The importance of the normalized parameters , , and the shape factor lies in the fact that they do not depend on the ambient norm (a choice which, to a certain extent, is arbitrary). Indeed, if and are two norms on that are related as for some , it is straightforward to verify that is -strongly convex relative to . Likewise, in terms of dual norms we have , so the constants , and would respectively become , and when computed under . In general, these inequalities are all tight, so a change in norm does not affect the shape factor ; accordingly, any dependence of on will be propagated verbatim to the guarantees (9).
In Table 1, we provide the values of and for the following cases:
-
(1)
The Euclidean regularizer that gives rise to the extra-gradient algorithm (4).
-
(2)
The entropic regularizer for the simplex setup of Example 1.
-
(3)
The von Neumann regularizer for the spectrahedron setup of Example 2.
-
(4)
The unnormalized entropy for the combinatorial setup of Example 3.
These derivations are standard, so we omit the details. For posterity, we only note that the logarithmic dependence on is asymptotically optimal, cf. [13, 12] and references therein.
3.3. The UnixGrad algorithm
As can be seen from Table 1, the mirror-prox algorithm achieves an almost dimension-free rate of convergence when used with a suitable regularizer. However, this comes with two important caveats: First, the algorithm’s rate in the smooth case falls short of the optimal dependence in , so (MP) is suboptimal in this regard. Second, to achieve the rates presented in Eq. 9, the algorithm’s step-size must be tuned with prior knowledge of the problem’s parameters: in particular, under (BG), the algorithm must be run with step-size while, under (LG), the algorithm requires if and otherwise. This creates an undesirable state of affairs because the parameters , and are usually not known in advance, and (MP) can – and does – fail to converge if run with an untuned step-size.
In the rest of this section, we briefly discuss the UnixGrad algorithm of Kavis et al. [24] which expands on the mirror-prox template in the following two crucial ways: \edefnit\selectfonta\edefnn) it introduces an iterate-averaging mechanism in the spirit of Cutkosky [16] to enable acceleration; and \edefnit\selectfonta\edefnn) it employs an adaptive step-size policy that does not require any tuning by the optimizer. In so doing, UnixGrad interpolates smoothly between the optimal convergence rates described in Section 2 without requiring any prior knowledge of , or .
Concretely, UnixGrad proceeds as (MP), but instead of querying at and , it introduces the weighted query states
(13) | ||||
where is a “gradient weighting” parameter. Then, building on an idea by Rakhlin & Sridharan [41, 42], the oracle queries and are used to update the method’s step-size as
(14) |
where
(15) |
is the so-called Bregman diameter of .
With all this in hand, Kavis et al. [24] provide the following bounds if UnixGrad is run with :
-
\edefnit\selectfonta\edefnn)
If satisfies (BG), then
(16a) (16b)
As we mentioned in Section 2, the bounds (16) cannot be improved in terms of without further assumptions, so UnixGrad is universally optimal in this regard.
That being said, these guarantees also uncover an important limitation of UnixGrad, namely that the bounds (16) become void when the method is used in conjunction with one of the non-Euclidean frameworks of Examples 1, 2 and 3. For example, the Bregman diameter of the simplex under the entropic regularizer is , so the multiplicative constants in (16) become infinite (and the bounds themselves become meaningless). However, since the use of these regularizers is crucial to obtain the scalable, dimension-free convergence rates reported in Table 1, 444In particular, since the shape factor of the Euclidean regularizer is , employing UnixGrad with ordinary Euclidean projections would not lead to scalable guarantees. we are led to the open question we stated before:
Is it possible to achieve almost dimension-free convergence rates
while retaining an order-optimal dependence on ?
We address this question in the next section.
4. Universal dual extrapolation
The point of departure of our analysis is the observation that gradient queries enter (MP) with decreasing weights. Specifically, if UnixGrad is run with (a choice which is necessary to have a shot at acceleration), the denominator of (14) may grow as fast as in the non-smooth/stochastic case, leading to an asymptotic worst-case behavior for . In fact, even under the ansatz that the algorithm’s query points converge to a minimizer of at an accelerated rate, the denominator of (14) may still grow as , indicating that will, at best, stabilize to a positive value as . This feature of the step-size rule (14) is somewhat counterintuitive because conventional wisdom would suggest that \edefnit\selectfonta\edefnn) recent queries are more useful than older, potentially obsolete ones; and \edefnit\selectfonta\edefnn) gradients should be “inflated” as the method’s query points approach a zero-gradient solution in order to maintain a fast rate of convergence.
The problem with a vanishing step-size becomes especially pronounced if the method is used with a non-Euclidean regularizer (which is what one would wish to do in order to obtain scalable convergence guarantees). To see this, consider the iterates of the mirror-prox template generated by the regularizer on .555Strictly speaking this regularizer is not strongly convex over but this detail is not relevant for the question at hand. In this case, the induced prox-mapping is , leading to the recursion
(17) |
Therefore, if the problem’s objective function attains its minimum at , the actual steps of the method scale as for small , so it is imperative to maintain a large step-size to avoid stalling the algorithm.
This scaling issue is at the heart of the dual extrapolation (DE) method of Nesterov [39]. Originally designed to solve variational inequalities and related problems, the method proceeds by (\edefnit\selectfonti\edefnn ) using a prox-step to generate the method’s leading state and get a “look-ahead” gradient query; (\edefnit\selectfonti\edefnn ) aggregating gradient information with a constant weight; and, finally, (\edefnit\selectfonti\edefnn ) using a “primal-dual” mirror map to update the method’s base state. Formally, the algorithm follows the iterative update rule
(DE) | ||||
where the so-called “mirror map” is defined as
(18) |
Unfortunately, the template (DE) is not sufficient for our purposes, for two main reasons: First, the method still couples a prox-step with a variable (decreasing) step-size update; this is not problematic for the application of the method to VIs (where the achievable rates are different), but it is not otherwise favorable for acceleration.
In addition to the above, the method’s gradient pre-multiplier is the same as its post-multiplier ( in both cases), and it is not possible to differentiate these parameters while maintaining optimal rates [39]. However, this differentiation is essential for acceleration, especially when cannot be tuned with prior knowledge of the problem’s parameters.
Our approach to overcome this issue consists of: \edefnit\selectfonta\edefnn) eliminating the prox-step altogether in favor of a mirror step; and \edefnit\selectfonta\edefnn) separating the weights used for introducing new gradients to the algorithm versus those used to generate the base and leading states. To formalize this, we introduce below the “universal” dual extrapolation template:
(UDE) | ||||||
In the above, the gradient signals and are considered generic and the query points are not specified. To get a concrete algorithm, we will use the weighting scheme of Kavis et al. [24] and query the oracle at the averaged states and introduced previously in (13). Finally, regarding the algorithm’s gradient weighting and averaging parameters ( and respectively), we will use an increasing weight for the method’s step-size and the adaptive rule
(19) |
for the method’s learning rate (the parameters and are discussed below, and we are using the standard convention that empty sums are taken equal to zero).
The resulting method – which we call universal dual extrapolation with reweighted gradients (UnderGrad) – is encoded in pseudocode form in Algorithm 1 and represented schematically in Fig. 1. Its main guarantees are as follows:
Theorem 1.
Suppose that UnderGrad (Algorithm 1) is run for iterations with given by (19), for all , and , with . Then the algorithm’s output state simultaneously enjoys the following guarantees:
Theorem 1 is our main result so, before discussing its proof (which we carry out in detail in the appendix), some remarks are in order.

The first point of note concerns the dependence of the anytime bounds (20) on the problem’s dimensionality. To that end, let and , so UnderGrad’s rate of convergence scales as in smooth, deterministic problems, and as in non-smooth and/or stochastic environments. Thus, to compare the algorithm’s rate of convergence to that of mirror-prox and UnixGrad (and up to universal constants), we have to compare to and respectively.
To that end, we calculate below the values of and in the three archetypal examples of Section 3:
-
(1)
In the simplex setup of Example 1, we have , and , so and .
- (2)
- (3)
The above shows that UnderGrad achieves the desired almost dimension-free rates of the non-adaptive mirror-prox algorithm, as well as the universal order-optimal guarantees of UnixGrad. The only discrepancy with the rates presented in Table 1 is the additive constant that appears in the numerator of (20a): this constant is an artifact of the analysis and it only becomes relevant when and . Since the scaling of the algorithm’s convergence rate concerns the large regime, this difference is not relevant for our purposes.
An additional difference between UnderGrad and UnixGrad is that the latter involves the prox-mapping (7), whereas the former involves the mirror map (18). To compare the two in terms of their per-iteration complexity, note that if we apply the prox-mapping (7) to the prox-center of , we get
(21) |
so, in particular, whenever (which is the case for most regularizers used in practice, including the Legendre regularizers used in Examples 1–3 above). This shows that the calculation of the mirror map is at least as simple as the calculation of the prox-mapping for a general base point – and, typically, calculating the mirror map is strictly lighter because there is no need to vary the base point over different iterations of the algorithm. In this regard, the per-iteration overhead of (UDE) is actually lighter compared to (MP) or (DE).
Finally, we should note that all our results above implicitly assume that the problem’s domain is bounded (otherwise the range parameter of the problem becomes infinite). Thus, in addition to these convergence properties of UnderGrad, we also provide below an asymptotic guarantee for problems with an unbounded domain:
Theorem 2.
This result provides an important extension of Theorem 1 to problems with unbounded domains. It remains an open question for the future to derive the precise constants in the convergence rate presented in Theorem 2.
Main ideas of the proof.
The detailed proof of Theorem 1 is fairly long so we defer it to the appendix and only present here the main ideas.
The main ingredient of our proof is a specific template inequality used to derive an “appropriate” upper bound of the term . Importantly, to prove the dimension-free properties of UnderGrad, such an upper-bound cannot involve Bregman divergences: even though this is common practice in previous papers [24, 1], these terms would ultimately lead to the Bregman diameter that we seek to avoid. This is a principal part of the reason for switching gears to the dual extrapolation (DE) template for UnderGrad: in so doing, we are able to employ the notion of the Fenchel coupling [31, 32], which is a “primal-dual distance” as opposed to the Bregman divergence which is a “primal-primal distance” (cf. Section A.1). This poses another challenge on connecting the Fenchel coupling of targeted points before and after a mirror step, for which we need to employ a primal-dual version of the “three-point identity” (Lemma A.3). These elements lead to the following proposition:
Proposition 1.
For all , we have
(22) |
With (22) in hand, (20a) comes from the application of the Fenchel-Young inequality to upper-bound the right-hand-side of (22) as (plus a constant term involving ). The challenge here is to notice and successfully prove that this summation is actually upper-bounded by (due to our special choice of the learning rate update). Finally, by its definition, can be bounded by , and as described in the statement of Theorem 1.
The proof of (20b) is far more complex. The main challenge is to manipulate the terms in (22) to derive an upper-bound of the form (plus a term involving the noise level ) where is a function of the learning rate chosen such that only the first elements of this summation are positive. Once this has been achieved, the quantity is connected to via (LG) and our claim is obtained.
5. Numerical Experiments
For the experimental validation of our results, we focus on the simplex setup of Example 1 with linear losses and . Our first experiment concerns the perfect SFO case and tracks down the convergence properties of UnderGrad run with the entropic regularizer adapted to the simplex. As a baseline, we ran UnixGrad, also with the entropic regularizer. A first challenge here is that the Bregman diameter of the simmplex is infinite, so UnixGrad is not well-defined. On that account, we choose the step-size update rule of UnixGrad such that its initial step-size equals the initial learning rate of UnderGrad. We also ran UnixGrad with the update rule such that is smaller or larger than . Finally, for comparison purposes, we also present a non-adaptive accelerated entropic gradient (AEG) algorithm, and we report the results in Fig. 2.
Fig. 2 confirms that UnderGrad successfully converges with an accelerated rate of . Perhaps surprisingly, it also shows that when UnixGrad’s initial step-size is small (10E-3 or smaller), UnixGrad also achieves an , but at a much more conservative pace, trailing UnderGrad by one or two orders of magnitude. On the other hand, when UnixGrad’s initial step-size is of the same magnitude as the UnderGrad’s learning rate (or larger), UnixGrad eventually destabilizes and its rate drops from to approximately . We also conducted experiments in the setup with a noisy SFO; these are reported in Appendix C.
Appendix A Bregman regularizers and preliminary results
A.1. Bregman regularizers and their properties
We begin by clarifying and recalling some of the notational convetions used throughout the paper. We also give the formal definition of the Fenchel coupling (a key notion for the proof our main results) and we present some preliminary results to prepare the ground for the sequel.
The convex conjugate of is then defined as
(A.1) |
As a result, the supremum in (A.1) is always attained, and is finite for all [8]. Moreover, by standard results in convex analysis [43, Chap. 26], is differentiable on and its gradient satisfies the identity
(A.2) |
Thus, recalling the definition of the mirror map , we readily get
(A.3) |
Lemma A.1.
Let be a Bregman regularizer on . Then, for all and all , we have:
(A.4a) | |||||||
(A.4b) |
Finally, if and , we have
(A.5) |
Proof of Lemma A.1.
To prove (A.4a), note that solves (A.2) if and only if , i.e., if and only if . Eq. A.4b is then obtained in the same manner.
For the inequality (A.5), it suffices to show it holds for all (by continuity). To do so, let
(A.6) |
Since is strongly convex relative to and by (A.4a), it follows that with equality if and only if . Moreover, note that is a continuous selection of subgradients of . Given that and are both continuous on , it follows that is continuously differentiable and on . Thus, with convex and for all , we conclude that . ∎
As we mentioned earlier, much of our analysis revolves around a ”primal-dual” divergence between a target point and a dual vector , called the Fenchel coupling. Following [33], this is defined as follows for all , :
(A.7) |
The following lemma illustrates some basic properties of the Fenchel coupling:
Lemma A.2.
Let be a Bregman regularizer on with convexity modulus . Then, for all and all , we have:
-
(1)
if (but not necessarily otherwise).
-
(2)
Proof.
For our first claim, let . Then, by definition we have:
(A.8) |
Since , we have whenever , thus proving our first claim. For our second claim, working in the previous spirit we get that:
(A.9) |
Thus, we obtain the result by recalling the strong convexity assumption for with respect to the respetive norm . ∎
We continue with some basic relations connecting the Fenchel coupling relative to a target point before and after a gradient step. The basic ingredient for this is a primal-dual analogue of the so-called “three-point identity” for Bregman functions [15]:
Lemma A.3.
Let be a Bregman regularizer on . Fix some and let . Then, letting , we have
(A.10) |
Proof.
By definition, we get:
(A.11) | ||||
Then, by subtracting the above we get:
(A.12) |
and our proof is complete. ∎
A.2. Numerical sequence inequalities
In this section, we provide some necessary inequalities on numerical sequences that we require for the convergence rate analysis of the previous sections. Most of the lemmas presented below already exist in the literature, and go as far back as Auer et al. [6] and McMahan & Streeter [30]; when appropriate, we note next to each lemma the references with the statement closest to the precise version we are using in our analysis.
Appendix B Analysis and proofs of the main results
The proof of the template inequality.
We first prove the template inequality of UnderGrad; this is the primary element of our proof of Theorem 1: See 1
Proof.
First, we set . For all we have:
# from Lemma A.3 | ||||
(B.1) |
Here, the last inequality comes from the facts that and and that is decreasing.
As a consequence of (B.1), we have:
(B.2) |
Now, recall the definitions and in Algorithm 1, apply Lemma A.2 to and then combine with (B.4), we get:
(B.5) |
Hence, after telescoping and recalling the notation , we get:
(B.6) |
Finally, by our initial choice of , we have and (22) follows (B.6). This concludes the proof of Proposition 1. ∎
Regret-to-rate conversion lemma.
The next element in our proof is the following lemma that will be used to connect the term (which, in intuition, is similar to a regret term) and the term whose bounds will characterize the convergence rate of UnderGrad.
Lemma B.1.
For any , for any , we have:
(B.7) |
Remark 1.
Proof of (20a): convergence of UnderGrad under (LC)/(BG).
Our starting point is Eq. 22 that we established in Proposition 1 that leads to the following inequality:
(B.12) |
We now focus on the second term in the right hand side of (B.12). From the Cauchy-Schwarz inequality and the fact that
(B.13) |
for any ,666This can be proved trivially: is a minimizer of the function . we have:
(B.14) |
Moreover, from the definition of and by applying Lemma A.4, we have:
(B.15) |
Combine (B.14) and (B.15) with (B.12) and by the compactness of the feasible region , we get:
(B.16) |
Proof of (20b): convergence of UnderGrad under (LG)/(LS).
We analyze the terms in the right-hand-side of (B.23). First, we have:
(B.24) |
Second, we have:
(B.25) |
Hence,
(B.26) |
We will analyze the terms in the right-hand-side of (B.27). To do this, we first introduce the quantities
(B.28a) | ||||
and | ||||
(B.28b) |
We also define
(B.29) |
By these definitions, we obtain that
(B.30) |
Here, the last inequality is obtained by the fact that if then it yields:
(B.31) |
On the other hand, by the update rule in Algorithm 1 and our choice of as in Theorem 1, we also have . Use this and recall that for any and that is -smooth over , we have:
(B.33) |
(B.35) |
Now, define as follows:
(B.36) |
In other words, for any , . As a consequence,
(B.37) |
(B.38) |
Recall the choice , apply Lemma B.1, we have:
Convergence of UnderGrad in unbounded domains.
Finally, we give the proof of Theorem 2 concerning the deterministic SFO in the (LG) case with a possibly unbounded domain .
Proof.
Since the respective learning rate is non-increasing and non-negative, we have that its limit exists. Particularly,
(B.42) |
Let us now assume that . Then, by applying Proposition 1 we have:
(B.43) | ||||
(B.44) |
Now for the term we have:
(B.45) |
which readily yields:
(B.46) |
Hence, putting everything together, we get:
(B.47) |
Moreover, since is smooth we have:
(B.48) |
Combining this with the fact that is a decreasing sequence, we can rewrite (B.47) as follows:
(B.49) |
In the sequel, we look for the appropriate bounds of the two terms in the right-hand-side of (B.49). We start with the second term. From (B.9), we also have . Combine this with (B.49), we have:
(B.50) |
Hence by rearranging we have:
(B.51) |
Now, since we assumed that converges to , there exists some such that:
(B.52) |
which directly yields that for all . Hence, we have:
(B.53) |
On the other hand, we have:
(B.54) |
and hence
(B.55) |
So, summarizing we have:
(B.56) |
We now focus on the first term of the right-hand-size of (B.49). If one lets to infinity and recalling the fact that we assumed that converges to (and so ), we have that:
(B.57) |
a contradiction. This shows that and hence
(B.58) |
so our proof is complete. ∎
Appendix C Additional Numerical Experiments
In this last section, we report another numerical experiment highlighting the universality of UnderGrad. In this experiment, we also focus on the simplex setup as presented in Section 5. However, this time, we work with a noisy SFO that returns first-order feedback that is perturbed by a noise generated from a pre-determined zero-mean normal distribution. We compare the performances of UnderGrad and UnixGrad, both run with the entropic regularizer. The result of this experiment is reported in Fig. 3.
Fig. 3 shows that UnderGrad obtains the optimal rate in this set-up. UnixGrad can also obtain the same rate but only when its step-size update rule is chosen appropriately (note again that with entropic regularizer, the update rule (14) of UnixGrad is not available due to the fact that ): when is chosen with the same or larger magnitude of UnderGrad’s initial learning rate, UnixGrad converges with the rate ; but if is too small (e.g., when ), UnixGrad can have a very long warming up phase. This experiment reasserts that in cases where the step-size update rule (14) is unavailable, it is non-trivial to choose an appropriate step-size update rule of UnixGrad: small might lead to better performances under perfect SFO (cf. Section 5) but might create unwanted behaviors in noisy SFO setups. On the contrary, UnderGrad does not encounter such issues in our experiments.
Finally, we conduct another experiment to confirm the dependency of the convergence rates of UnderGrad on the noise level . The result is reported in Fig. 4.
Acknowledgments
KA and VC were supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement № - time-data) and from the Swiss National Science Foundation (SNSF) under grant number . KYL is grateful for financial support by Israel Science Foundation (grant No. 447/20), by Israel PBC-VATAT, and by the Technion Center for Machine Learning and Intelligent Systems (MLIS). PM is grateful for financial support by the French National Research Agency (ANR) in the framework of the “Investissements d’avenir” program (ANR-15-IDEX-02), the LabEx PERSYVAL (ANR-11-LABX-0025-01), MIAI@Grenoble Alpes (ANR-19-P3IA-0003), and the bilateral ANR-NRF grant ALIAS (ANR-19-CE48-0018-01).
References
- Allen-Zhu & Orecchia [2017] Allen-Zhu, Z. and Orecchia, L. Linear Coupling: An Ultimate Unification of Gradient and Mirror Descent. In Proceedings of the 8th Innovations in Theoretical Computer Science, ITCS ’17, 2017. Full version available at http://arxiv.org/abs/1407.1537.
- Antonakopoulos & Mertikopoulos [2021] Antonakopoulos, K. and Mertikopoulos, P. Adaptive first-order methods revisited: Convex optimization without Lipschitz requirements. In NeurIPS ’21: Proceedings of the 35th International Conference on Neural Information Processing Systems, 2021.
- Antonakopoulos et al. [2019] Antonakopoulos, K., Belmega, E. V., and Mertikopoulos, P. An adaptive mirror-prox algorithm for variational inequalities with singular operators. In NeurIPS ’19: Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019.
- Antonakopoulos et al. [2021a] Antonakopoulos, K., Belmega, E. V., and Mertikopoulos, P. Adaptive extra-gradient methods for min-max optimization and games. In ICLR ’21: Proceedings of the 2021 International Conference on Learning Representations, 2021a.
- Antonakopoulos et al. [2021b] Antonakopoulos, K., Pethick, T., Kavis, A., Mertikopoulos, P., and Cevher, V. Sifting through the noise: Universal first-order methods for stochastic variational inequalities. In NeurIPS ’21: Proceedings of the 35th International Conference on Neural Information Processing Systems, 2021b.
- Auer et al. [2002] Auer, P., Cesa-Bianchi, N., and Gentile, C. Adaptive and self-confident on-line learning algorithms. Journal of Computer and System Sciences, 64(1):48–75, 2002.
- Bach & Levy [2019] Bach, F. and Levy, K. Y. A universal algorithm for variational inequalities adaptive to smoothness and noise. In COLT ’19: Proceedings of the 32nd Annual Conference on Learning Theory, 2019.
- Bauschke & Combettes [2017] Bauschke, H. H. and Combettes, P. L. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, New York, NY, USA, 2 edition, 2017.
- Bilenne et al. [2020] Bilenne, O., Mertikopoulos, P., and Belmega, E. V. Fast optimization with zeroth-order feedback in distributed multi-user MIMO systems. IEEE Trans. Signal Process., 68:6085–6100, October 2020.
- Boyd & Vandenberghe [2004] Boyd, S. P. and Vandenberghe, L. Convex Optimization. Cambridge University Press, Cambridge, UK, 2004.
- Bubeck [2015] Bubeck, S. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(3-4):231–358, 2015.
- Bubeck & Cesa-Bianchi [2012] Bubeck, S. and Cesa-Bianchi, N. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1–122, 2012.
- Cesa-Bianchi & Lugosi [2006] Cesa-Bianchi, N. and Lugosi, G. Prediction, Learning, and Games. Cambridge University Press, 2006.
- Cesa-Bianchi & Lugosi [2012] Cesa-Bianchi, N. and Lugosi, G. Combinatorial bandits. Journal of Computer and System Sciences, 78:1404–1422, 2012.
- Chen & Teboulle [1993] Chen, G. and Teboulle, M. Convergence analysis of a proximal-like minimization algorithm using Bregman functions. SIAM Journal on Optimization, 3(3):538–543, August 1993.
- Cutkosky [2019] Cutkosky, A. Anytime online-to-batch, optimism and acceleration. In ICML ’19: Proceedings of the 36th International Conference on Machine Learning, 2019.
- Ene et al. [2021] Ene, A., Nguyen, H. L., and Vladu, A. Adaptive gradient methods for constrained convex optimization and variational inequalities. In AAAI ’21: Proceedings of the 35th Conference on Artificial Intelligence, 2021.
- Facchinei & Pang [2003] Facchinei, F. and Pang, J.-S. Finite-Dimensional Variational Inequalities and Complementarity Problems. Springer Series in Operations Research. Springer, 2003.
- György et al. [2007] György, A., Linder, T., Lugosi, G., and Ottucsák, G. The online shortest path problem under partial monitoring. Journal of Machine Learning Research, 8:2369–2403, 2007.
- Hsieh et al. [2021] Hsieh, Y.-G., Antonakopoulos, K., and Mertikopoulos, P. Adaptive learning in continuous games: Optimal regret bounds and convergence to Nash equilibrium. In COLT ’21: Proceedings of the 34th Annual Conference on Learning Theory, 2021.
- Joulani et al. [2020] Joulani, P., Raj, A., György, A., and Szepesvári, C. A simpler approach to accelerated stochastic optimization: Iterative averaging meets optimism. In ICML ’20: Proceedings of the 37th International Conference on Machine Learning, 2020.
- Juditsky et al. [2011] Juditsky, A., Nemirovski, A. S., and Tauvel, C. Solving variational inequalities with stochastic mirror-prox algorithm. Stochastic Systems, 1(1):17–58, 2011.
- Kakade et al. [2012] Kakade, S. M., Shalev-Shwartz, S., and Tewari, A. Regularization techniques for learning with matrices. The Journal of Machine Learning Research, 13:1865–1890, 2012.
- Kavis et al. [2019] Kavis, A., Levy, K. Y., Bach, F., and Cevher, V. UnixGrad: A universal, adaptive algorithm with optimal guarantees for constrained optimization. In NeurIPS ’19: Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019.
- Korpelevich [1976] Korpelevich, G. M. The extragradient method for finding saddle points and other problems. Èkonom. i Mat. Metody, 12:747–756, 1976.
- Lan [2012] Lan, G. An optimal method for stochastic composite optimization. Mathematical Programming, 133(1-2):365–397, June 2012.
- Lattimore & Szepesvári [2020] Lattimore, T. and Szepesvári, C. Bandit Algorithms. Cambridge University Press, Cambridge, UK, 2020.
- Levy et al. [2018] Levy, K. Y., Yurtsever, A., and Cevher, V. Online adaptive methods, universality and acceleration. In NeurIPS ’18: Proceedings of the 32nd International Conference of Neural Information Processing Systems, 2018.
- Li & Orabona [2019] Li, X. and Orabona, F. On the convergence of stochastic gradient descent with adaptive stepsizes. In AISTATS ’19: Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, 2019.
- McMahan & Streeter [2010] McMahan, H. B. and Streeter, M. Adaptive bound optimization for online convex optimization. In COLT ’10: Proceedings of the 23rd Annual Conference on Learning Theory, 2010.
- Mertikopoulos & Sandholm [2016] Mertikopoulos, P. and Sandholm, W. H. Learning in games via reinforcement and regularization. Mathematics of Operations Research, 41(4):1297–1324, November 2016.
- Mertikopoulos & Staudigl [2018] Mertikopoulos, P. and Staudigl, M. On the convergence of gradient-like flows with noisy gradient input. SIAM Journal on Optimization, 28(1):163–197, January 2018.
- Mertikopoulos & Zhou [2019] Mertikopoulos, P. and Zhou, Z. Learning in games with continuous action sets and unknown payoff functions. Mathematical Programming, 173(1-2):465–507, January 2019.
- Mertikopoulos et al. [2017] Mertikopoulos, P., Belmega, E. V., Negrel, R., and Sanguinetti, L. Distributed stochastic optimization via matrix exponential learning. IEEE Trans. Signal Process., 65(9):2277–2290, May 2017.
- Mertikopoulos et al. [2019] Mertikopoulos, P., Lecouat, B., Zenati, H., Foo, C.-S., Chandrasekhar, V., and Piliouras, G. Optimistic mirror descent in saddle-point problems: Going the extra (gradient) mile. In ICLR ’19: Proceedings of the 2019 International Conference on Learning Representations, 2019.
- Nemirovski [2004] Nemirovski, A. S. Prox-method with rate of convergence for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization, 15(1):229–251, 2004.
- Nemirovski & Yudin [1983] Nemirovski, A. S. and Yudin, D. B. Problem Complexity and Method Efficiency in Optimization. Wiley, New York, NY, 1983.
- Nesterov [2004] Nesterov, Y. Introductory Lectures on Convex Optimization: A Basic Course. Number 87 in Applied Optimization. Kluwer Academic Publishers, 2004.
- Nesterov [2007] Nesterov, Y. Dual extrapolation and its applications to solving variational inequalities and related problems. Mathematical Programming, 109(2):319–344, 2007.
- Nesterov [2015] Nesterov, Y. Universal gradient methods for convex optimization problems. Mathematical Programming, 152(1-2):381–404, 2015.
- Rakhlin & Sridharan [2013a] Rakhlin, A. and Sridharan, K. Online learning with predictable sequences. In COLT ’13: Proceedings of the 26th Annual Conference on Learning Theory, 2013a.
- Rakhlin & Sridharan [2013b] Rakhlin, A. and Sridharan, K. Optimization, learning, and games with predictable sequences. In NIPS ’13: Proceedings of the 27th International Conference on Neural Information Processing Systems, 2013b.
- Rockafellar [1970] Rockafellar, R. T. Convex Analysis. Princeton University Press, Princeton, NJ, 1970.
- Telatar [1999] Telatar, I. E. Capacity of multi-antenna Gaussian channels. European Transactions on Telecommunications and Related Technologies, 10(6):585–596, 1999.
- Vu et al. [2021] Vu, D. Q., Antonakopoulos, K., and Mertikopoulos, P. Fast routing under uncertainty: Adaptive learning in congestion games with exponential weights. In NeurIPS ’21: Proceedings of the 35th International Conference on Neural Information Processing Systems, 2021.
- Ward et al. [2019] Ward, R., Wu, X., and Bottou, L. AdaGrad stepsizes: Sharp convergence over nonconvex landscapes, from any initialization. In ICML ’19: Proceedings of the 36th International Conference on Machine Learning, 2019.
- Yu et al. [2004] Yu, W., Rhee, W., Boyd, S. P., and Cioffi, J. M. Iterative water-filling for Gaussian vector multiple-access channels. IEEE Trans. Inf. Theory, 50(1):145–152, 2004.