A Universal Black-Box Optimization Method with
Almost Dimension-Free Convergence Rate Guarantees

Kimon Antonakopoulos^∗,⋄ ^♯ Laboratory for Information and Inference Systems, IEL STI EPFL, Lausanne, Switzerland. [email protected] , Dong Quan Vu^♯,⋄ ^⋄ Safran Tech, Magny-Les-Hameaux, France. [email protected] , Volkan Cevher^∗ [email protected] ,
Kfir Yehuda Levy^† ^† Technion, Haifa, Israel. [email protected] and Panayotis Mertikopoulos^‡,§ ^‡ Univ. Grenoble Alpes, CNRS, Inria, Grenoble INP, LIG, 38000 Grenoble, France. ^§ Criteo AI Lab. [email protected]

Abstract.

Universal methods for optimization are designed to achieve theoretically optimal convergence rates without any prior knowledge of the problem’s regularity parameters or the accurarcy of the gradient oracle employed by the optimizer. In this regard, existing state-of-the-art algorithms achieve an $\operatorname{\mathcal{O}}(1/T^{2})$ value convergence rate in Lipschitz smooth problems with a perfect gradient oracle, and an $\operatorname{\mathcal{O}}(1/\sqrt{T})$ convergence rate when the underlying problem is non-smooth and/or the gradient oracle is stochastic. On the downside, these methods do not take into account the problem’s dimensionality, and this can have a catastrophic impact on the achieved convergence rate, in both theory and practice. Our paper aims to bridge this gap by providing a scalable universal gradient method – dubbed UnderGrad – whose oracle complexity is almost dimension-free in problems with a favorable geometry (like the simplex, linearly constrained semidefinite programs and combinatorial bandits), while retaining the order-optimal dependence on $T$ described above. These “best-of-both-worlds” results are achieved via a primal-dual update scheme inspired by the dual exploration method for variational inequalities.

Key words and phrases:

Universal methods; dimension-freeness; dual extrapolation; rate interpolation.

2020 Mathematics Subject Classification:

Primary 90C25, 90C15; secondary 68Q32, 68T05.

This work was done when KA and DQV were affiliated with Univ. Grenoble Alpes, CNRS, Inria, Grenoble INP, LIG, 38000 Grenoble, France.

1. Introduction

The analysis of first-order methods for convex minimization typically revolves around the following regularity conditions: \edefnit\selectfonta\edefnn) Lipschitz continuity of a problem’s objective function and / or \edefnit\selectfonta\edefnn) Lipschitz continuity of the objective’s gradients. Depending on these conditions and the quality of the gradient oracle available to the optimizer, the optimal convergence rates that can be obtained by an iterative first-order algorithm after $T$ oracle queries are:

(1)

$\operatorname{\mathcal{O}}\big{(}\lVert\mathcal{X}\rVert\sqrt{(G^{2}+\sigma^{2})/T}\big{)}$ if the problem’s objective is $G$ -Lipschitz continuous and the oracle’s variance is $\sigma$ .
(2)

$\operatorname{\mathcal{O}}\big{(}L\lVert\mathcal{X}\rVert^{2}/T^{2}+\sigma\lVert\mathcal{X}\rVert/\sqrt{T}\big{)}$ if the objective is $L$ -Lipschitz smooth.

[In both cases, $\lVert\mathcal{X}\rVert\coloneqq\sup_{x,x^{\prime}\in\mathcal{X}}\lVert x^{\prime}-x\rVert$ denotes the diameter of the problem’s domain $\mathcal{X}\subseteq\mathbb{R}^{d}$ ; for an in-depth treatment, see [38, 11] and references therein.]

This stark separation of black-box guarantees has led to an intense search for universal methods that are capable of interpolating smoothly between these rates without any prior knowledge of the problem’s regularity properties or the oracle’s noise profile. As far as we are aware, the first algorithm with order-optimal rate guarantees for unconstrained problems and no knowledge of the problem’s smoothness parameters was the AcceleGrad proposal of Levy et al. [28]. Subsequently, in the context of constrained convex problems (the focus of our work), Kavis et al. [24] combined the extra-gradient / mirror-prox algorithmic template of Korpelevich [25] and Nemirovski [36] with an “iterate averaging” scheme introduced by Cutkosky [16] to change the query structure of the base algorithm and make it more amenable to acceleration. In this way, Kavis et al. [24] obtained a universal extra-gradient algorithm – dubbed UnixGrad – which interpolates between the optimal rates mentioned above, without requiring any tuning.

Our contributions.

The starting point of our paper is the observation that, even though the rates in question are optimal in $T$ , they may be highly supoptimal in $d$ , the problem’s dimensionality. For example, if the noise in the oracle has unit variance, $\sigma$ would scale as $\operatorname{\mathcal{O}}(\sqrt{d})$ ; this represents a hidden dependence on $d$ which could have a catastrophic impact on the method’s actual convergence rate. Likewise, in problems with a favorable geometry (like the $L^{1}$ -ball, trace-constrained semidefinite programs, combinatorial bandits, etc.), methods based on the mirror descent [37] and mirror-prox [36] templates can achieve rates with a logarithmic (instead of polynomial) dependence on $d$ .

Importantly, the UnixGrad algorithm of Kavis et al. [24] is itself based on the mirror-prox blueprint, so it would seem ideally suited to achieve convergence rates that are simultaneously optimal in $T$ and $d$ . However, the method’s guarantees depend crucially on the Bregman diameter of the problem’s domain, a quantity which becomes infinite when the method is used with a regularization setup that can lead to almost dimension-free guarantees. This would seem to suggest that universality comes at the cost of scalability, leading to the following open question:

Is it possible to achieve almost dimension-free convergence rates
while retaining an order-optimal dependence on $T$ ?

In this paper, we develop a novel adaptive algorithm, which we call universal dual extrapolation with reweighted gradients (UnderGrad), and which provides a positive answer to this question. Specifically, the value convergence rate of universal dual extrapolation with reweighted gradients (UnderGrad) scales in terms of $G$ , $\sigma$ , $L$ and $T$ as:

(1)

$\operatorname{\mathcal{O}}\big{(}\sqrt{R_{h}(G^{2}+\sigma^{2})/T}\big{)}$ in non-smooth problems.
(2)

$\operatorname{\mathcal{O}}\big{(}R_{h}L/T^{2}+\sigma\sqrt{R_{h}/T}\big{)}$ in smooth problems.

In the above, the method’s “range parameter” $R_{h}$ scales as $\operatorname{\mathcal{O}}(\lVert\mathcal{X}\rVert^{2})$ in the worst case and as $\operatorname{\mathcal{O}}(\log d)$ in problems with a favorable geometry – that is, in problems where it is possible to attain almost dimension-free convergence rates [38, 11]. In this regard, UnderGrad seems to be the first method in the literature that concurrently achieves order-optimal rates in both $T$ and $d$ , without any prior knowledge on the problem’s level of smoothness.

To achieve this result, the UnderGrad algorithm combines the following basic ingredients:

(1)

A modified version of the dual extrapolation method of Nesterov [39] for solving variational inequalities.
(2)

A gradient “reweighting” scheme that allows gradients to enter the algorithm with increasing weights.
(3)

An iterate averaging scheme in the spirit of Cutkosky [16] which allows us to obtain an accelerated rate of convergence by means of an online-to-batch conversion.

The glue that holds these elements together is an adaptive learning rate inspired by Rakhlin & Sridharan [41, 42] which automatically rescales aggregated gradients by \edefnit\selectfonta\edefnn) a small, constant amount when the method approaches a solution where gradient differences vanish (as in the smooth, deterministic case); and \edefnit\selectfonta\edefnn) a factor of $\operatorname{\mathcal{O}}(\sqrt{T})$ otherwise (thus providing the desired interpolation between smooth and non-smooth problems). In so doing, the proposed policy achieves the correct step-size scaling and achieves the desired optimal rates.

Related work.

The term “universality” was coined by Nesterov [40] whose universal primal gradient descent (UPGD) algorithm interpolates between the $\operatorname{\mathcal{O}}(1/T^{2})$ and $\operatorname{\mathcal{O}}(1/\sqrt{T})$ rates for smooth and non-smooth problems respectively (assuming access to noiseless gradients in both cases). On the downside, universal primal gradient descent (UPGD) relies on an Armijo-like line search to interpolate between smooth and non-smooth objectives, so it is not applicable to stochastic environments.

A partial work-around to this issue was achieved by the accelerated stochastic approximation (AC-SA) algorithm of Lan [26] which uses a mirror descent template and guarantees order-optimal rates for both noisy and noiseless oracles. However, to attain these rates, the AC-SA algorithm requires a precise estimate of the smoothness modulus of the problem’s objective, so it is not universal in this respect. Subsequent works on the topic have focused on attaining universal guarantees for composite problems [21], non-convex objectives [29, 46], preconditioned methods [21, 17], non-Lipschitz settings [3, 2, 4], specific applications [45], or variational inequalities / min-max problems [7, 4, 20, 5].

Of the generalist works above, some employ a Bregman regularization setup [2, 7], but the guarantees obtained therein either fall short of an accelerated $\operatorname{\mathcal{O}}(1/T^{2})$ convergence rate for Lipschitz smooth problems, or they depend on the problem’s Bregman diameter – so they cannot be associated with a Bregman setup leading to almost dimension-free convergence rate guarantees. To the best of our knowledge, UnderGrad is the first method that manages to combine the “best of both worlds” in terms of universality with respect to $T$ and scalability with respect to $d$ .

2. Preliminaries

2.1. Notation and basic definitions

Let $\mathcal{V}$ be a $d$ -dimensional space with norm $\lVert\cdot\rVert$ . In what follows, we will write $\mathcal{Y}\equiv\mathcal{V}^{\ast}$ for the dual of $\mathcal{V}$ , $\langle y\mathopen{},\mathopen{}x\rangle$ for the pairing between $y\in\mathcal{Y}$ and $x\in\mathcal{V}$ , and $\lVert y\rVert_{\ast}\equiv\sup\{\langle y\mathopen{},\mathopen{}x\rangle:\lVert x\rVert\leq 1\}$ for the dual norm on $\mathcal{Y}$ . Given an extended-real-valued convex function $f\colon\mathcal{V}\to\mathbb{R}\cup\{\infty\}$ , we will write $\operatorname{dom}f\equiv\{x\in\mathcal{V}:f(x)<\infty\}$ for its effective domain and $\partial f(x)\equiv\{y\in\mathcal{Y}:f(x^{\prime})-f(x)-\langle y\mathopen{},\mathopen{}x^{\prime}-x\rangle\geq 0\;\text{for all $x^{\prime}\in\mathcal{V}$}\}$ for the subdifferential of $f$ at $x\in\operatorname{dom}f$ . Any element $g\in\partial f(x)$ will be called a subgradient of $f$ at $x$ , and we will write $\operatorname{dom}\partial f\equiv\{x\in\operatorname{dom}f:\partial f\neq\varnothing\}$ for the domain of subdifferentiability of $f$ .

2.2. Problem setup and blanket assumptions

The main focus of our paper is the solution of convex minimization problems of the form

	minimize	$\displaystyle\quad f(x)$		(Opt)
	subject to	$\displaystyle\quad x\in\mathcal{X}$		(Opt)

where $\mathcal{X}$ is a closed convex subset of $\mathcal{V}$ and $f\colon\mathcal{V}\to\mathbb{R}\cup\{\infty\}$ is a convex function with $\operatorname{dom}f=\operatorname{dom}\partial f=\mathcal{X}$ . To avoid trivialities, we will assume throughout that the solution set $\mathcal{X}^{\ast}\coloneqq\operatorname*{arg\,min}f$ of (Opt) is non-empty, and we will write $x^{\ast}$ for a generic minimizer of $f$ .

Other than this blanket assumption, our main reqularity requirements for $f$ will be as follows:

(1)

Lipschitz continuity:

$\lvert f(x^{\prime})-f(x)\rvert\leq G\lVert x^{\prime}-x\rVert$ (LC)

for some $G\geq 0$ and for all $x,x^{\prime}\in\mathcal{X}$ .

(2)

Lipschitz smoothness:

f(x^{\prime})\leq f(x)+\langle\nabla f(x)\mathopen{},\mathopen{}x^{\prime}-x\rangle+\frac{L}{2}\lVert x^{\prime}-x\rVert^{2}

(LS)

for some $L\geq 0$ and for all $g\in\partial f(x)$ , $x,x^{\prime}\in\mathcal{X}$ .

Since $\operatorname{dom}\partial f=\mathcal{X}$ , the above requirements are respectively equivalent to assuming that $f$ admits a selection of subgradients $\nabla f(x)\in\partial f(x)$ with the properties below:

(1)

Bounded (sub)gradient selection:

$\lVert\nabla f(x)\rVert_{\ast}\leq G$ (BG)

for some $G\geq 0$ and for all $x\in\mathcal{X}$ .
(2)

Lipschitz (sub)gradient selection:

$\lVert\nabla f(x^{\prime})-\nabla f(x)\rVert_{\ast}\leq L\lVert x^{\prime}-x\rVert$ (LG)

for some $L\geq 0$ and for all $x,x^{\prime}\in\mathcal{X}$ .

In the rest of our paper, we will assume that $f$ satisfies at least one of (BG) or (LG).

Remark 1.

For posterity, we note here that the requirement (LG) does not imply that $\partial f(x)$ is a singleton.¹¹1Consider for example the case of $f(x)=x$ for $x\in[0,1]$ and $f(x)=\infty$ otherwise: $f$ clearly satisfies (BG)/(LS), even though its $\partial f(0)$ and $\partial f(1)$ are infinite sets. In any case, the directional derivative $f^{\prime}(x;z)=d/dt|_{t=0}f(x+tz)$ of $f$ at $x\in\mathcal{X}$ along $z\in\mathcal{V}$ exists and is equal to $\langle\nabla f(x)\mathopen{},\mathopen{}z\rangle$ for all vectors of the form $z=x^{\prime}-x$ , $x^{\prime}\in\mathcal{X}$ . We will use this fact freely in the sequel. ¶

2.3. The oracle model

To solve (Opt), we will consider iterative methods and algorithms with access to a stochastic first-order oracle (SFO), i.e., a black-box device that returns a (possibly random) estimate of a subgradient of $f$ at the point at which it was queried. Formally, following Nesterov [38], an stochastic first-order oracle (SFO) for $f$ is a measurable function $\operatorname{\mathsf{G}}\colon\mathcal{X}\times\Omega\to\mathcal{Y}$ such that

\operatorname{\mathbb{E}}[\operatorname{\mathsf{G}}(x;\omega)]=\nabla f(x)\quad\text{for all $x\in\mathcal{X}$}

(SFO)

where $(\Omega,\mathcal{F},\operatorname{\mathbb{P}})$ is a complete probability space and $\nabla f(x)$ is a selection of subgradients of $f$ as per (BG)/(LG). The oracle’s statistical precision will then be measured by the associated noise level $\sigma\coloneqq\operatorname{ess}\sup_{\omega,x}\lVert\operatorname{\mathsf{G}}(x;\omega)-\nabla f(x)\rVert_{\ast}$ (assumed finite). In particular, if $\sigma=0$ , $\operatorname{\mathsf{G}}$ will be called perfect (or deterministic); otherwise, $\operatorname{\mathsf{G}}$ will be called noisy.

In practice, the oracle is called repeatedly at a sequence of query points $x_{t}$ with a different random seed $\omega_{t}$ drawn according to $\operatorname{\mathbb{P}}$ at each time.²²2In the sequel, $t$ may take both integer and half-integer values. In this way, at the $t$ -th query to (SFO), the oracle $\operatorname{\mathsf{G}}$ returns the gradient signal

g_{t}=\operatorname{\mathsf{G}}(x_{t};\omega_{t})=\nabla f(x_{t})+U_{t}

(1)

where $U_{t}$ denotes the “gradient noise” of the oracle (obviously, $U_{t}\equiv 0$ if the oracle is perfect). For measurability purposes, we will write $\mathcal{F}_{t}$ for the history (adapted filtration) of $x_{t}$ , so $x_{t}$ is $\mathcal{F}_{t}$ -measurable (by definition) but $\omega_{t}$ , $g_{t}$ and $U_{t}$ are not. In particular, conditioning on $\mathcal{F}_{t}$ , we have $\operatorname{\mathbb{E}}[g_{t}\nonscript\,|\nonscript\,\mathopen{}\mathcal{F}_{t}]=\nabla f(x_{t})$ and $\operatorname{\mathbb{E}}[U_{t}\nonscript\,|\nonscript\,\mathopen{}\mathcal{F}_{t}]=0$ , justifying in this way the terminology “gradient noise” for $U_{t}$ .

Remark 2.

The oracle model detailed above is not the only one possible, but it is very widely used in the analysis of parameter-agnostic and adaptive methods, cf. [28, 24, 46, 2] and references therein. In view of this, we will not examine either finer or coarser assumptions for (SFO). ¶

We close this section by noting that the best convergence rates that can be achieved by an iterative algorithm that outputs a candidate solution $\bar{x}_{T}\in\mathcal{X}$ after $T$ queries to (SFO) are:³³3In general, the query and output points – $x_{T}$ and $\bar{x}_{T}$ respectively – need not coincide, hence the different notation. The only assumption for the rates provided below is that the output point $\bar{x}_{T}$ is an affine combination of $x_{1},g_{1},\dotsc,x_{T},g_{T}$ [38, 11].

(1)

$f(\bar{x}_{T})-\min f=\operatorname{\mathcal{O}}(1/\sqrt{T})$ if $f$ satisfies (BG) and $\operatorname{\mathsf{G}}$ is deterministic.
(2)

$f(\bar{x}_{T})-\min f=\operatorname{\mathcal{O}}(1/T^{2})$ if $f$ satisfies (LG) and $\operatorname{\mathsf{G}}$ is deterministic.
(3)

$\operatorname{\mathbb{E}}[f(\bar{x}_{T})-\min f]=\operatorname{\mathcal{O}}(1/\sqrt{T})$ if $\operatorname{\mathsf{G}}$ is stochastic.

In general, without finer assumptions on $f$ or $\operatorname{\mathsf{G}}$ , the dependence of these rates on $T$ cannot be improved [38, 11]; we will revisit this issue several times in the sequel.

3. Regularization, universality, and the curse of dimensionality

To set the stage for the analysis to come, we discuss below the properties of two algorithmic frameworks – one non-adaptive, the other adaptive – based on the mirror-prox template [36]. Our aim in doing this will be to set a baseline for the sequel as well as to explore the impact of the problem’s dimensionality on the attained rates of convergence.

3.1. Motivating examples

As a first step, we present three archetypal problems to motivate and illustrate the general setup that follows.

Example 1 (Resource allocation).

Consider a set of computing resources (GPUs in a cluster, servers in a computing grid, …) indexed by $s\in\mathcal{S}=\{1,\dotsc,d\}$ . Each resource is capable of serving a stream of computing demands that arrive at a rate of $\rho$ units per time: if the optimizer assigns a load of $x_{s}\geq 0$ to the $s$ -th resource, the marginal cost incurred is $c_{s}(x_{s})$ per unit served, where $c_{s}\colon[0,\rho]\to\mathbb{R}_{+}$ is the cost function of the $s$ -th resource (assumed convex, differentiable, and increasing in $x_{s}$ ). Taking $\rho=1$ for simplicity, the goal of the optimizer is to minimize the aggregate cost $f(x)=\sum_{s=1}^{d}x_{s}c_{s}(x_{s})$ , leading to a convex minimization problem over the unit $d$ -dimensional simplex $\mathcal{X}=\operatorname{\Delta}(\mathcal{S})=\{x\in\mathbb{R}_{+}^{d}:\sum_{s}x_{s}=1\}$ . ¶

Example 2 (Input covariance matrix optimization).

Consider a Gaussian vector channel in the spirit of [44, 47]: the encoder controls the covariance matrix $\mathbf{X}=\operatorname{\mathbb{E}}[\mathbf{x}\mathbf{x}^{{\dagger}}]$ of the Gaussian input signal $\mathbf{x}\in\mathbb{C}^{M}$ and seeks to maximize the transfer rate of the output signal $\mathbf{y}=\mathbf{H}\mathbf{x}+\mathbf{z}$ , where $\mathbf{z}\in\mathbb{C}^{N}$ is the ambient noise in the channel and $\mathbf{H}\in\mathbb{C}^{N\times M}$ is the channel’s transfer matrix. By the Shannon–Telatar formula [44], this boils down to maximizing the capacity function

R(\mathbf{X})=\operatorname{\mathbb{E}}\left[\log\det\left(\mathbf{I}+\mathbf{H}\mathbf{X}\mathbf{H}^{{\dagger}}\right)\right]

(2)

subject to the constraint $\operatorname{tr}(\mathbf{X})\leq P$ , where $P$ denotes the encoder’s maximum input power and the expectation in (2) is taken over the statistics of the (possibly deterministic) matrix $\mathbf{H}$ . Since $R$ is concave in $\mathbf{X}$ [10, 47], we obtain a minimization problem of the form (Opt) over the spectrahedron $\mathcal{D}=\{\mathbf{X}\succcurlyeq 0:\operatorname{tr}(\mathbf{X})\leq P\}.$ Since $\mathbf{X}$ is Hermitian, $\mathcal{D}$ can be seen as a convex body of $\mathbb{R}^{d}$ where $d=M^{2}$ ; in the optimization literature, this is sometimes referred to as the “spectrahedron setup” [22]. ¶

Example 3 (Combinatorial bandits).

In bandit linear optimization problems, the optimizer is given a finite set of $n$ possible actions $\mathcal{A}\subseteq\{0,1\}^{d}$ , i.e., each action $\alpha\in\mathcal{A}$ is a $d$ -dimensional binary vector indicating whether the $i$ -th component is “on” or “off”. The optimizer then chooses an action $\alpha\in\mathcal{A}$ based on a mixed strategy $p\in\operatorname{\Delta}(\mathcal{A})$ and incurs the mean loss

\ell(p;\omega)=\operatorname{\mathbb{E}}\left[\sum\nolimits_{\alpha\in\mathcal{A}}p_{\alpha}\langle\alpha\mathopen{},\mathopen{}\omega\rangle\right]

(3)

where $\omega$ is a random vector with values in $[0,1]^{d}$ (but otherwise unknown distribution). In many cases of interest – such as slate recommendation and shortest-path problems – the cardinality of $\mathcal{A}$ is exponential in $d$ , so it is computationally prohibitive to state the resulting minimization problem in terms of $p$ . Instead, writing $x_{i}=\sum_{\alpha\in\mathcal{A}}p_{\alpha}\alpha_{i}$ for the probability of the $i$ -th component being “on” under $p$ , the optimizer’s objective can be rewritten more compactly as $f(x)=\operatorname{\mathbb{E}}\left[\langle x\mathopen{},\mathopen{}\omega\rangle\right]$ with $x$ constrained to lie on the $d$ -dimensional convex hull $\mathcal{X}=\operatorname{conv}(\mathcal{A})$ of $\mathcal{A}$ in $\mathbb{R}^{d}$ . In the literature on multi-armed bandits, this setup is known as a combinatorial bandit; for an in-depth treatment, see [27, 14, 19, 13] and the many references cited therein. ¶

Examples 1, 2 and 3 all suffer from the “curse of dimensionality”: for instance, the dimensionality of a vector Gaussian channel with $M=256$ input entries is $d\approx 6.5\times 10^{4}$ , while a combinatorial bandit for recommendation systems may have upwards of several million arms. Nonetheless, these examples also share a number of geometric properties that make it possible to design scalable optimization algorithms with (almost) dimension-free convergence rate guarantees. We elaborate on this in the next section.

3.2. The mirror-prox template

We begin by considering the well-known mirror-prox (MP) method of Nemirovski [36]. Following [22, 35], this is defined via the recursion

	$\displaystyle X_{t+1/2}$	$\displaystyle=P_{#1}(-\gamma_{t}g_{t})$		(MP)
	$\displaystyle X_{t+1}$	$\displaystyle=P_{#1}(-\gamma_{t}g_{t+1/2})$		(MP)

where

(1)

$t=1,2,\dotsc\,$ denotes the method’s iteration counter (for the origins of the half-integer notation, see Facchinei & Pang [18] and references therein).
(2)

$\gamma_{t}>0$ is the algorithm’s step-size sequence.
(3)

$g_{t}$ and $g_{t+1/2}$ are stochastic gradients of $f$ obtained by querying the oracle $\operatorname{\mathsf{G}}$ at $X_{t}$ and $X_{t+1/2}$ respectively.
(4)

$P_{#1}(\cdot)$ is a generalized projection operator known as the method’s “prox-mapping” (more on this later).

The most elementary instance of (MP) is the extra-gradient (EG) algorithm of Korpelevich [25], in which case the method’s prox-mapping is the Euclidean projector

P_{#1}(y)=\operatorname{\Pi}_{\mathcal{X}}(x+y)\coloneqq\operatorname*{arg\,min}\nolimits_{x^{\prime}\in\mathcal{X}}\lVert x+y-x^{\prime}\rVert_{2}

(4)

for all $x\in\mathcal{X}$ , $y\in\mathcal{Y}$ . More generally, the prox-mapping in (MP) is defined in terms of a Bregman regularizer as follows:

	Domain ( $\mathcal{X}$ )	Breg. Diam. ( $B_{h}$ )	Range ( $R_{h}$ )	Shape ( $\chi$ )	Rate ( $L=\infty$ )	Rate ( $L<\infty$ , $\sigma=0$ )
Euclidean	any below	$\operatorname{\mathcal{O}}(1)$	$\operatorname{\mathcal{O}}(1)$	$\sqrt{d}$	$\operatorname{\mathcal{O}}\big{(}\sqrt{d/T}\big{)}$	$\operatorname{\mathcal{O}}(d/T)$
Entropic	simplex	$\infty$	$\log d$	$1$	$\operatorname{\mathcal{O}}\big{(}\sqrt{\log d/T}\big{)}$	$\operatorname{\mathcal{O}}(\log d/T)$
Quantum	spectrahedron	$\infty$	$\log d$	$1$	$\operatorname{\mathcal{O}}\big{(}\sqrt{\log d/T}\big{)}$	$\operatorname{\mathcal{O}}(\log d/T)$
ComBand	$\operatorname{conv}(\mathcal{A})$	$\infty$	$\operatorname{\mathcal{O}}(\log d)$	$1$	$\operatorname{\mathcal{O}}\big{(}\sqrt{\log d/T}\big{)}$	$\operatorname{\mathcal{O}}(\log d/T)$

Table 1. The convergence rate of (MP) in terms of

d

and

T

for different regularizers. In the combinatorial setup of Example 3, the unnormalized entropy has

R_{h}=m(1+\log(d/m))

, where

m=\max_{\alpha\in\mathcal{A}}\lVert\alpha\rVert_{1}

is the maximum number of elements of

\{1,\dotsc,d\}

that can be simultaneously “on” [27, Chap. 30]. In many applications,

m

does not scale with

d

, so it has been absorbed in the

\operatorname{\mathcal{O}}(\cdot)

notation; other than that,

\operatorname{\mathcal{O}}(\cdot)

contains only universal constants.

Definition 1.

A Bregman regularizer on $\mathcal{X}$ is a convex function $h\colon\mathcal{V}\to\mathbb{R}\cup\{\infty\}$ such that

(1)

$\operatorname{dom}h=\mathcal{X}$ and $h$ is continuous on $\mathcal{X}$ .
(2)

The subdifferential of $h$ admits a continuous selection, i.e., there exists a continuous mapping $\nabla h\colon\operatorname{dom}\partial h\to\mathcal{Y}$ with $\nabla h(x)\in\partial h(x)$ for all $x\in\operatorname{dom}\partial h$ .

(3)

$h$ is strongly convex on $\mathcal{X}$ , i.e.,

h(x^{\prime})\geq h(x)+\langle\nabla h(x)\mathopen{},\mathopen{}x^{\prime}-x\rangle+\tfrac{1}{2}K_{h}\lVert x^{\prime}-x\rVert^{2}

(5)

for some $K_{h}>0$ and all $x\in\operatorname{dom}\partial h$ , $x^{\prime}\in\mathcal{X}$ .

We also define the Bregman divergence of $h$ as

D(x^{\prime},x)=h(x^{\prime})-h(x)-\langle\nabla h(x)\mathopen{},\mathopen{}x^{\prime}-x\rangle

(6)

and the induced prox-mapping as

P_{#1}(y)=\operatorname*{arg\,min}\nolimits_{x^{\prime}\in\mathcal{X}}\{\langle y\mathopen{},\mathopen{}x-x^{\prime}\rangle+D(x^{\prime},x)\}

(7)

for all $x\in\mathcal{X}_{h}$ , $x^{\prime}\in\mathcal{X}$ and all $y\in\mathcal{Y}$ .

Remark.

The set $\mathcal{X}_{h}\coloneqq\operatorname{dom}\partial h$ is often referred to as the prox-domain of $h$ ; by standard results in convex analysis, we have $\operatorname{ri}\mathcal{X}\subseteq\mathcal{X}_{h}\subseteq\mathcal{X}$ [43, Chap. 26].

In terms of output, the candidate solution returned by (MP) after $T$ iterations is the so-called “ergodic average”

\bar{X}_{T}=\frac{\sum_{t=1}^{T}\gamma_{t}X_{t+1/2}}{\sum_{t=1}^{T}\gamma_{t}}.

(8)

Then, assuming the method’s step-size $\gamma_{t}$ is chosen appropriately (more on this below), $\bar{X}_{T}$ enjoys the following guarantees [22, 42]:

\edefnit\selectfonta\edefnn)

If $f$ satisfies (BG), then

$\displaystyle\operatorname{\mathbb{E}}[f(\bar{X}_{T})-\min f]$	$\displaystyle=\operatorname{\mathcal{O}}\left(\sqrt{\frac{G^{2}+\sigma^{2}}{K_{h}}\frac{D_{1}}{T}}\right)$	(9a)

$\displaystyle\operatorname{\mathbb{E}}[f(\bar{X}_{T})-\min f]$	$\displaystyle=\operatorname{\mathcal{O}}\left(\frac{LD_{1}}{K_{h}T}+\sigma\sqrt{\frac{D_{1}}{K_{h}T}}\right)$	(9b)

In the above, $D_{1}=D(x^{\ast},X_{1})$ is the minimum Bregman divergence between a solution $x^{\ast}$ of (Opt) and the initial state $X_{1}$ of (MP). In particular, if (MP) is initialized at the prox-center $x_{c}=\operatorname*{arg\,min}h$ of $\mathcal{X}$ , we have

D_{1}\leq h(x^{\ast})-\min h\leq\max h-\min h\eqqcolon R_{h}.

(10)

We will refer to $R_{h}=\max h-\min h$ as the range of $h$ .

To quantify the interplay betwen the problem’s dimensionality and the rate guarantees (9) for (MP), it will be convenient to introduce the normalized regularity parameters

G_{h}=\frac{G}{\sqrt{K_{h}}}\quad L_{h}=\frac{L}{K_{h}}\quad\text{and}\quad\sigma_{h}=\frac{\sigma}{\sqrt{K_{h}}}

(11)

and the associated shape factor

\chi=\begin{cases}\sqrt{G_{h}^{2}+\sigma_{h}^{2}}&\quad\text{if $L=\infty$,}\\ \sqrt{L_{h}}&\quad\text{if $L<\infty$ and $\sigma=0$,}\\ \sigma_{h}&\quad\text{if $L<\infty$ and $\sigma>0$.}\end{cases}

(12)

Since at least one of the terms $G/\sqrt{K_{h}}$ , $L/K_{h}$ and $\sigma/\sqrt{K_{h}}$ appears in (9), it follows that the leading term in $T$ scales as $\operatorname{\mathcal{O}}(\chi\sqrt{D_{1}/T})$ in non-smooth / stochastic environments, and as $\operatorname{\mathcal{O}}(\chi^{2}D_{1}/T)$ in smooth, deterministic problems.

The importance of the normalized parameters $G_{h}$ , $L_{h}$ , $\sigma_{h}$ and the shape factor $\chi$ lies in the fact that they do not depend on the ambient norm $\lVert\cdot\rVert$ (a choice which, to a certain extent, is arbitrary). Indeed, if $\lVert\cdot\rVert$ and $\lVert\cdot\rVert^{\prime}$ are two norms on $\mathcal{V}$ that are related as $\lVert\cdot\rVert\leq\mu\lVert\cdot\rVert^{\prime}$ for some $\mu>0$ , it is straightforward to verify that $h$ is $(\mu^{2}K_{h})$ -strongly convex relative to $\lVert\cdot\rVert^{\prime}$ . Likewise, in terms of dual norms we have $\lVert\cdot\rVert_{\ast}\geq(1/\mu)\lVert\cdot\rVert^{\prime}_{\ast}$ , so the constants $G$ , $\sigma$ and $L$ would respectively become $\mu G$ , $\mu\sigma$ and $\mu^{2}L$ when computed under $\lVert\cdot\rVert^{\prime}$ . In general, these inequalities are all tight, so a change in norm does not affect the shape factor $\chi$ ; accordingly, any dependence of $\chi$ on $d$ will be propagated verbatim to the guarantees (9).

In Table 1, we provide the values of $R_{h}$ and $\chi$ for the following cases:

(1)

The Euclidean regularizer $h(x)=\lVert x\rVert_{2}^{2}/2$ that gives rise to the extra-gradient algorithm (4).
(2)

The entropic regularizer $h(x)=\sum_{i=1}^{d}x_{i}\log x_{i}$ for the simplex setup of Example 1.
(3)

The von Neumann regularizer $h(\mathbf{X})=\operatorname{tr}(\mathbf{X}\log\mathbf{X})+(1-\operatorname{tr}\mathbf{X})\log(1-\operatorname{tr}\mathbf{X})$ for the spectrahedron setup of Example 2.
(4)

The unnormalized entropy $h(x)=\sum_{i=1}^{d}(x_{i}\log x_{i}-x_{i})$ for the combinatorial setup of Example 3.

These derivations are standard, so we omit the details. For posterity, we only note that the logarithmic dependence on $d$ is asymptotically optimal, cf. [13, 12] and references therein.

3.3. The UnixGrad algorithm

As can be seen from Table 1, the mirror-prox algorithm achieves an almost dimension-free rate of convergence when used with a suitable regularizer. However, this comes with two important caveats: First, the algorithm’s rate in the smooth case falls short of the optimal $\operatorname{\mathcal{O}}(1/T^{2})$ dependence in $T$ , so (MP) is suboptimal in this regard. Second, to achieve the rates presented in Eq. 9, the algorithm’s step-size $\gamma_{t}$ must be tuned with prior knowledge of the problem’s parameters: in particular, under (BG), the algorithm must be run with step-size $\gamma_{t}\propto 1/\sqrt{(G^{2}+\sigma^{2})T}$ while, under (LG), the algorithm requires $\gamma_{t}=K_{h}/L$ if $\sigma=0$ and $\gamma_{t}\propto 1/(\sigma\sqrt{T})$ otherwise. This creates an undesirable state of affairs because the parameters $G$ , $L$ and $\sigma$ are usually not known in advance, and (MP) can – and does – fail to converge if run with an untuned step-size.

In the rest of this section, we briefly discuss the UnixGrad algorithm of Kavis et al. [24] which expands on the mirror-prox template in the following two crucial ways: \edefnit\selectfonta\edefnn) it introduces an iterate-averaging mechanism in the spirit of Cutkosky [16] to enable acceleration; and \edefnit\selectfonta\edefnn) it employs an adaptive step-size policy that does not require any tuning by the optimizer. In so doing, UnixGrad interpolates smoothly between the optimal convergence rates described in Section 2 without requiring any prior knowledge of $G$ , $L$ or $\sigma$ .

Concretely, UnixGrad proceeds as (MP), but instead of querying $\operatorname{\mathsf{G}}$ at $X_{t}$ and $X_{t+1/2}$ , it introduces the weighted query states

	$\displaystyle\bar{X}_{t}$	$\displaystyle=\frac{\alpha_{t}X_{t}+\sum_{s=1}^{t-1}\alpha_{s}X_{s+1/2}}{\sum_{s=1}^{t}\alpha_{s}}$		(13)
	$\displaystyle\bar{X}_{t+1/2}$	$\displaystyle=\frac{\alpha_{t}X_{t+1/2}+\sum_{s=1}^{t-1}\alpha_{s}X_{s+1/2}}{\sum_{s=1}^{t}\alpha_{s}}$		(13)

where $\alpha_{t}$ is a “gradient weighting” parameter. Then, building on an idea by Rakhlin & Sridharan [41, 42], the oracle queries $g_{t}\leftarrow\operatorname{\mathsf{G}}(\bar{X}_{t};\omega_{t})$ and $g_{t+1/2}\leftarrow\operatorname{\mathsf{G}}(\bar{X}_{t+1/2};\omega_{t+1/2})$ are used to update the method’s step-size as

\gamma_{t}=\frac{B_{h}\alpha_{t}}{\sqrt{1+\sum_{s=1}^{t-1}\alpha_{s}^{2}\lVert g_{s+1/2}-g_{s}\rVert_{\ast}^{2}}}

(14)

where

B_{h}=\sup\nolimits_{x\in\mathcal{X},x^{\prime}\in\mathcal{X}_{h}}\sqrt{2D(x,x^{\prime})}

(15)

is the so-called Bregman diameter of $\mathcal{X}$ .

With all this in hand, Kavis et al. [24] provide the following bounds if UnixGrad is run with $\alpha_{t}=t$ :

\edefnit\selectfonta\edefnn)

If $f$ satisfies (BG), then

$\displaystyle\operatorname{\mathbb{E}}[f(\bar{X}_{T+1/2})-\min f]$	$\displaystyle=\operatorname{\mathcal{O}}\bigg{(}\frac{B_{h}\sqrt{G^{2}+\sigma^{2}}}{\sqrt{K_{h}T}}\bigg{)}$	(16a)

$\displaystyle\operatorname{\mathbb{E}}[f(\bar{X}_{T+1/2})-\min f]$	$\displaystyle=\operatorname{\mathcal{O}}\left(\frac{B_{h}^{2}L}{K_{h}T^{2}}+\frac{B_{h}\sigma}{\sqrt{K_{h}T}}\right)$	(16b)

As we mentioned in Section 2, the bounds (16) cannot be improved in terms of $T$ without further assumptions, so UnixGrad is universally optimal in this regard.

That being said, these guarantees also uncover an important limitation of UnixGrad, namely that the bounds (16) become void when the method is used in conjunction with one of the non-Euclidean frameworks of Examples 1, 2 and 3. For example, the Bregman diameter of the simplex under the entropic regularizer is $B_{h}=\sup_{x,x^{\prime}}\sum_{i}x_{i}\log(x_{i}/x^{\prime}_{i})=\infty$ , so the multiplicative constants in (16) become infinite (and the bounds themselves become meaningless). However, since the use of these regularizers is crucial to obtain the scalable, dimension-free convergence rates reported in Table 1, ⁴⁴4In particular, since the shape factor of the Euclidean regularizer is $\chi=\sqrt{d}$ , employing UnixGrad with ordinary Euclidean projections would not lead to scalable guarantees. we are led to the open question we stated before:

Is it possible to achieve almost dimension-free convergence rates
while retaining an order-optimal dependence on $T$ ?

We address this question in the next section.

4. Universal dual extrapolation

The point of departure of our analysis is the observation that gradient queries enter (MP) with decreasing weights. Specifically, if UnixGrad is run with $\alpha_{t}=t$ (a choice which is necessary to have a shot at acceleration), the denominator of (14) may grow as fast as $\Theta(t^{3/2})$ in the non-smooth/stochastic case, leading to an asymptotic $\operatorname{\mathcal{O}}(1/\sqrt{t})$ worst-case behavior for $\gamma_{t}$ . In fact, even under the ansatz that the algorithm’s query points converge to a minimizer of $f$ at an accelerated rate, the denominator of (14) may still grow as $\Theta(t)$ , indicating that $\gamma_{t}$ will, at best, stabilize to a positive value as $t\to\infty$ . This feature of the step-size rule (14) is somewhat counterintuitive because conventional wisdom would suggest that \edefnit\selectfonta\edefnn) recent queries are more useful than older, potentially obsolete ones; and \edefnit\selectfonta\edefnn) gradients should be “inflated” as the method’s query points approach a zero-gradient solution in order to maintain a fast rate of convergence.

The problem with a vanishing step-size becomes especially pronounced if the method is used with a non-Euclidean regularizer (which is what one would wish to do in order to obtain scalable convergence guarantees). To see this, consider the iterates of the mirror-prox template generated by the regularizer $h(x)=x\log x$ on $\mathcal{X}=[0,\infty)$ .⁵⁵5Strictly speaking this regularizer is not strongly convex over $[0,\infty)$ but this detail is not relevant for the question at hand. In this case, the induced prox-mapping is $P_{#1}(y)=x\exp(y)$ , leading to the recursion

x^{+}=P_{#1}(-\gamma{v})=x\exp(-\gamma{v}).

(17)

Therefore, if the problem’s objective function attains its minimum at $0$ , the actual steps of the method scale as $x^{+}-x=\operatorname{\mathcal{O}}(x)$ for small $x$ , so it is imperative to maintain a large step-size to avoid stalling the algorithm.

This scaling issue is at the heart of the dual extrapolation (DE) method of Nesterov [39]. Originally designed to solve variational inequalities and related problems, the method proceeds by (\edefnit\selectfonti\edefnn ) using a prox-step to generate the method’s leading state and get a “look-ahead” gradient query; (\edefnit\selectfonti\edefnn ) aggregating gradient information with a constant weight; and, finally, (\edefnit\selectfonti\edefnn ) using a “primal-dual” mirror map to update the method’s base state. Formally, the algorithm follows the iterative update rule

$\displaystyle X_{t+1/2}$	$\displaystyle=P_{#1}(-\gamma_{t}g_{t})$	(DE)
$\displaystyle Y_{t+1}$	$\displaystyle=Y_{t}-g_{t+1/2}$
$\displaystyle X_{t+1}$	$\displaystyle=Q(\gamma_{t+1}Y_{t+1})$

where the so-called “mirror map” $Q\colon\mathcal{Y}\to\mathcal{X}$ is defined as

Q(y)=\operatorname*{arg\,max}_{x\in\mathcal{X}}\{\langle y\mathopen{},\mathopen{}x\rangle-h(x)\}.

(18)

Unfortunately, the template (DE) is not sufficient for our purposes, for two main reasons: First, the method still couples a prox-step with a variable (decreasing) step-size update; this is not problematic for the application of the method to VIs (where the achievable rates are different), but it is not otherwise favorable for acceleration.

In addition to the above, the method’s gradient pre-multiplier is the same as its post-multiplier ( $\gamma_{t}$ in both cases), and it is not possible to differentiate these parameters while maintaining optimal rates [39]. However, this differentiation is essential for acceleration, especially when $\gamma_{t}$ cannot be tuned with prior knowledge of the problem’s parameters.

Our approach to overcome this issue consists of: \edefnit\selectfonta\edefnn) eliminating the prox-step altogether in favor of a mirror step; and \edefnit\selectfonta\edefnn) separating the weights used for introducing new gradients to the algorithm versus those used to generate the base and leading states. To formalize this, we introduce below the “universal” dual extrapolation template:

	$\displaystyle Y_{t+1/2}$	$\displaystyle=Y_{t}-\alpha_{t}g_{t}$	$\displaystyle\quad X_{t+1/2}$	$\displaystyle=Q(\eta_{t}Y_{t+1/2})$		(UDE)
	$\displaystyle Y_{t+1}$	$\displaystyle=Y_{t}-\alpha_{t}g_{t+1/2}$	$\displaystyle\quad X_{t+1}$	$\displaystyle=Q(\eta_{t+1}Y_{t+1})$		(UDE)

In the above, the gradient signals $g_{t}$ and $g_{t+1/2}$ are considered generic and the query points are not specified. To get a concrete algorithm, we will use the weighting scheme of Kavis et al. [24] and query the oracle at the averaged states $\bar{X}_{t}$ and $\bar{X}_{t+1/2}$ introduced previously in (13). Finally, regarding the algorithm’s gradient weighting and averaging parameters ( $\alpha_{t}$ and $\eta_{t}$ respectively), we will use an increasing weight for the method’s step-size $\alpha_{t}=t$ and the adaptive rule

\eta_{t}=\frac{b}{\sqrt{{}^{2}+\sum_{s=1}^{t-1}\alpha_{s}^{2}\lVert g_{s+1/2}-g_{s}\rVert_{\ast}^{2}}}

(19)

for the method’s learning rate (the parameters and $b$ are discussed below, and we are using the standard convention that empty sums are taken equal to zero).

2Parameters

\leftarrow\sqrt{K_{h}}

;

b\leftarrow\sqrt{K_{h}(R_{h}+K_{h}\lVert\mathcal{X}\rVert^{2})}

3 Initialize

Y_{1}\leftarrow 0

;

Z_{1}\leftarrow 0

;

S_{1}\leftarrow^{2}

4 for $t=1,2,\dotsc\,,T$ do

\eta_{t}\leftarrow b/\sqrt{S_{t}}

// set learning rate

X_{t}\leftarrow Q\left(\eta_{t}Y_{t}\right)

// mirror step

\bar{X}_{t}\leftarrow\left(\alpha_{t}X_{t}+Z_{t}\right)\big{/}{\sum_{s=1}^{t}\alpha_{s}}

// mixing

g_{t}\leftarrow\operatorname{\mathsf{G}}(\bar{X}_{t};\omega_{t})

// oracle query

Y_{t+1/2}\leftarrow Y_{t}-\alpha_{t}g_{t}

// dual step

X_{t+1/2}\leftarrow Q\left(\eta_{t}Y_{t+1/2}\right)

// mirror step

\bar{X}_{t+1/2}\leftarrow\left(\alpha_{t}X_{t+1/2}+Z_{t}\right)\big{/}{\sum_{s=1}^{t}\alpha_{s}}

// mixing

g_{t+1/2}\leftarrow\operatorname{\mathsf{G}}(\bar{X}_{t+1/2};\omega_{t+1/2})

// oracle query

Y_{t+1}\leftarrow Y_{t}-\alpha_{t}g_{t+1/2}

// dual step

S_{t+1}\leftarrow S_{t}+\alpha_{t}^{2}\lVert g_{t+1/2}-g_{t}\rVert_{\ast}^{2}

// precondition

Z_{t+1}\leftarrow Z_{t}+\alpha_{t}X_{t+1/2}

// update mixing state

return $\bar{x}_{T}\leftarrow\bar{X}_{T+1/2}$

Algorithm 1 \Acfundergrad

The resulting method – which we call universal dual extrapolation with reweighted gradients (UnderGrad) – is encoded in pseudocode form in Algorithm 1 and represented schematically in Fig. 1. Its main guarantees are as follows:

Theorem 1.

Suppose that UnderGrad (Algorithm 1) is run for $T$ iterations with $\eta_{t}$ given by (19), $\alpha_{t}=t$ for all $t=1,2,\dotsc\,$ , and $=\sqrt{K_{h}}$ , $b=C_{h}\sqrt{K_{h}}$ with $C_{h}=\sqrt{R_{h}+K_{h}\lVert\mathcal{X}\rVert^{2}}$ . Then the algorithm’s output state $\bar{x}_{T}\equiv\bar{X}_{T+1/2}$ simultaneously enjoys the following guarantees:

\edefitit\selectfonta\edefitn)

If $f$ satisfies (BG), then

$\!\operatorname{\mathbb{E}}[f(\bar{x}_{T})]\leq\min f+2C_{h}\sqrt{\frac{K_{h}+8(G^{2}+\sigma^{2})}{K_{h}T}}$ (20a)

\edefitit\selectfonta\edefitn)

If $f$ satisfies (LG), then

\operatorname{\mathbb{E}}[f(\bar{x}_{T})]\leq\min f+\frac{32\sqrt{2}C_{h}^{2}L}{K_{h}T^{2}}+\frac{8\sqrt{2}C_{h}\sigma}{\sqrt{K_{h}T}}

(20b)

Theorem 1 is our main result so, before discussing its proof (which we carry out in detail in the appendix), some remarks are in order.

Refer to caption — Figure 1. Schematic representation of the UnderGrad algorithm (Algorithm 1). The light blue area represents the problem’s domain ( $\mathcal{X}$ ), whereas the grey area represents the dual space ( $\mathcal{Y}$ ).

The first point of note concerns the dependence of the anytime bounds (20) on the problem’s dimensionality. To that end, let $C_{\textrm{fast}}=C_{h}^{2}$ and $C_{\textrm{slow}}=C_{h}$ , so UnderGrad’s rate of convergence scales as $\operatorname{\mathcal{O}}(C_{\textrm{fast}}\chi^{2}/T^{2})$ in smooth, deterministic problems, and as $\operatorname{\mathcal{O}}(C_{\textrm{slow}}\chi/\sqrt{T})$ in non-smooth and/or stochastic environments. Thus, to compare the algorithm’s rate of convergence to that of mirror-prox and UnixGrad (and up to universal constants), we have to compare $C_{h}$ to $R_{h}$ and $B_{h}$ respectively.

To that end, we calculate below the values of $C_{\textrm{fast}}$ and $C_{\textrm{slow}}$ in the three archetypal examples of Section 3:

(1)

In the simplex setup of Example 1, we have $R_{h}=\log d$ , $\lVert\mathcal{X}\rVert=1$ and $K_{h}=1$ , so $C_{\textrm{slow}}=\operatorname{\mathcal{O}}(\sqrt{\log d})$ and $C_{\textrm{fast}}=\operatorname{\mathcal{O}}(\log d)$ .
(2)

In the spectrahedron setup of Example 2, we have again $R_{h}=\log d$ , $\lVert\mathcal{X}\rVert=1$ and $K_{h}=1$ , so $C_{\textrm{slow}}=\operatorname{\mathcal{O}}(\sqrt{\log d})$ and $C_{\textrm{fast}}=\operatorname{\mathcal{O}}(\log d)$ . [For a detailed discussion, see [34, 9, 23] and references therein.]
(3)

Finally, in the combinatorial setup of Example 3, we have $R_{h}=m(1+\log(d/m))$ , $\lVert\mathcal{X}\rVert=m$ and $K_{h}=1$ [27]. Thus, if $m=\operatorname{\mathcal{O}}(1)$ in $d$ , we get again $C_{\textrm{slow}}=\operatorname{\mathcal{O}}(\sqrt{\log d})$ and $C_{\textrm{fast}}=\operatorname{\mathcal{O}}(\log d)$ .

The above shows that UnderGrad achieves the desired almost dimension-free rates of the non-adaptive mirror-prox algorithm, as well as the universal order-optimal guarantees of UnixGrad. The only discrepancy with the rates presented in Table 1 is the additive constant $K_{h}$ that appears in the numerator of (20a): this constant is an artifact of the analysis and it only becomes relevant when $G\to 0$ and $\sigma\to 0$ . Since the scaling of the algorithm’s convergence rate concerns the large $G$ regime, this difference is not relevant for our purposes.

An additional difference between UnderGrad and UnixGrad is that the latter involves the prox-mapping (7), whereas the former involves the mirror map (18). To compare the two in terms of their per-iteration complexity, note that if we apply the prox-mapping (7) to the prox-center $x_{c}\leftarrow\operatorname*{arg\,min}h$ of $\mathcal{X}$ , we get

$\displaystyle P_{#1}(y)$	$\displaystyle=\operatorname*{arg\,min}\nolimits_{x\in\mathcal{X}}\{\langle y\mathopen{},\mathopen{}x_{c}-x\rangle+D(x,x_{c})\}$
	$\displaystyle=\operatorname*{arg\,min}\nolimits_{x\in\mathcal{X}}\{h(x)-\langle\nabla h(x_{c})+y\mathopen{},\mathopen{}x\rangle\}$
	$\displaystyle=Q(\nabla h(x_{c})+y)$	(21)

so, in particular, $Q(y)=P_{#1}(y)$ whenever $x_{c}\in\operatorname{ri}\mathcal{X}$ (which is the case for most regularizers used in practice, including the Legendre regularizers used in Examples 1–3 above). This shows that the calculation of the mirror map $Q(y)=P_{#1}(y-\nabla h(x_{c}))$ is at least as simple as the calculation of the prox-mapping $P_{#1}(y)$ for a general base point $x\in\mathcal{X}$ – and, typically, calculating the mirror map is strictly lighter because there is no need to vary the base point over different iterations of the algorithm. In this regard, the per-iteration overhead of (UDE) is actually lighter compared to (MP) or (DE).

Finally, we should note that all our results above implicitly assume that the problem’s domain is bounded (otherwise the range parameter $R_{h}$ of the problem becomes infinite). Thus, in addition to these convergence properties of UnderGrad, we also provide below an asymptotic guarantee for problems with an unbounded domain:

Theorem 2.

Suppose that UnderGrad is run with perfect oracle feedback with $\eta_{t}$ given by (19) and $\alpha_{t}=t$ . If $f$ satisfies (LG), the algorithm’s output state $\bar{x}_{T}=\bar{X}_{T+1/2}$ enjoys the rate $f(\bar{x}_{T})-\min f=\operatorname{\mathcal{O}}(1/T^{2})$ .

This result provides an important extension of Theorem 1 to problems with unbounded domains. It remains an open question for the future to derive the precise constants in the convergence rate presented in Theorem 2.

Main ideas of the proof.

The detailed proof of Theorem 1 is fairly long so we defer it to the appendix and only present here the main ideas.

The main ingredient of our proof is a specific template inequality used to derive an “appropriate” upper bound of the term $\operatorname{\tilde{\mathcal{R}}}_{T}(x)\coloneqq\sum_{t=1}^{T}{\alpha_{t}}\left\langle g_{t+1/2},X_{t+1/2}-x\right\rangle$ . Importantly, to prove the dimension-free properties of UnderGrad, such an upper-bound cannot involve Bregman divergences: even though this is common practice in previous papers [24, 1], these terms would ultimately lead to the Bregman diameter $B_{h}$ that we seek to avoid. This is a principal part of the reason for switching gears to the dual extrapolation (DE) template for UnderGrad: in so doing, we are able to employ the notion of the Fenchel coupling [31, 32], which is a “primal-dual distance” as opposed to the Bregman divergence which is a “primal-primal distance” (cf. Section A.1). This poses another challenge on connecting the Fenchel coupling of targeted points before and after a mirror step, for which we need to employ a primal-dual version of the “three-point identity” (Lemma A.3). These elements lead to the following proposition:

Proposition 1.

For all $x\in\mathcal{X}$ , we have

	$\displaystyle\operatorname{\tilde{\mathcal{R}}}_{T}(x)$	$\displaystyle\leq\frac{R_{h}}{\eta_{T+1}}+\sum_{t=1}^{T}\alpha_{t}\langle g_{t+1/2}-g_{t}\mathopen{},\mathopen{}X_{t+1/2}-X_{t+1}\rangle$
		$\displaystyle-K_{h}\sum_{t=1}^{T}\frac{\lVert X_{t+1}-X_{t+1/2}\rVert^{2}+\lVert X_{t+1/2}-X_{t}\rVert^{2}}{{2\eta_{t}}}$		(22)

With (22) in hand, (20a) comes from the application of the Fenchel-Young inequality to upper-bound the right-hand-side of (22) as $\sum_{t=1}^{T}\alpha_{t}^{2}\eta_{t+1}\lVert g_{t+1/2}-g_{t}\rVert_{\ast}$ (plus a constant term involving $\lVert\mathcal{X}\rVert$ ). The challenge here is to notice and successfully prove that this summation is actually upper-bounded by ${\eta}_{T+1}^{-1}$ (due to our special choice of the learning rate update). Finally, by its definition, ${\eta}_{T+1}^{-1}$ can be bounded by $G$ , $\sigma$ and $K_{h}$ as described in the statement of Theorem 1.

The proof of (20b) is far more complex. The main challenge is to manipulate the terms in (22) to derive an upper-bound of the form $\sum_{t=1}^{T}\alpha_{t}^{2}\textsl{g}(\eta_{t+1})\lVert\nabla f(\bar{X}_{t+1/2})-\nabla f\bar{X}_{t}\rVert_{\ast}^{2}$ (plus a term involving the noise level $\sigma$ ) where $\textsl{g}(\eta_{t+1})$ is a function of the learning rate chosen such that only the first $T_{0}\ll T$ elements of this summation are positive. Once this has been achieved, the quantity $\lVert\nabla f(\bar{X}_{t+1/2})-\nabla f\bar{X}_{t}\rVert_{\ast}$ is connected to $\lVert\mathcal{X}\rVert$ via (LG) and our claim is obtained.

5. Numerical Experiments

For the experimental validation of our results, we focus on the simplex setup of Example 1 with linear losses and $d=100$ . Our first experiment concerns the perfect SFO case and tracks down the convergence properties of UnderGrad run with the entropic regularizer adapted to the simplex. As a baseline, we ran UnixGrad, also with the entropic regularizer. A first challenge here is that the Bregman diameter $B_{h}$ of the simmplex is infinite, so UnixGrad is not well-defined. On that account, we choose the step-size update rule of UnixGrad such that its initial step-size $\gamma_{1}$ equals the initial learning rate $\eta_{1}$ of UnderGrad. We also ran UnixGrad with the update rule such that $\gamma_{1}$ is smaller or larger than $\eta_{1}$ . Finally, for comparison purposes, we also present a non-adaptive accelerated entropic gradient (AEG) algorithm, and we report the results in Fig. 2.

Fig. 2 confirms that UnderGrad successfully converges with an accelerated rate of $\operatorname{\mathcal{O}}(1/T^{2})$ . Perhaps surprisingly, it also shows that when UnixGrad’s initial step-size is small (10E-3 or smaller), UnixGrad also achieves an $\operatorname{\mathcal{O}}(1/T^{2})$ , but at a much more conservative pace, trailing UnderGrad by one or two orders of magnitude. On the other hand, when UnixGrad’s initial step-size is of the same magnitude as the UnderGrad’s learning rate (or larger), UnixGrad eventually destabilizes and its rate drops from $\operatorname{\mathcal{O}}(1/T^{2})$ to approximately $\operatorname{\mathcal{O}}(1/T)$ . We also conducted experiments in the setup with a noisy SFO; these are reported in Appendix C.

Appendix A Bregman regularizers and preliminary results

A.1. Bregman regularizers and their properties

We begin by clarifying and recalling some of the notational convetions used throughout the paper. We also give the formal definition of the Fenchel coupling (a key notion for the proof our main results) and we present some preliminary results to prepare the ground for the sequel.

The convex conjugate $h^{\ast}\colon\mathcal{Y}\to\mathbb{R}$ of $h$ is then defined as

h^{\ast}(y)=\sup_{x\in\mathcal{X}}\{\langle y\mathopen{},\mathopen{}x\rangle-h(x)\}.

(A.1)

As a result, the supremum in (A.1) is always attained, and $h^{\ast}(y)$ is finite for all $y\in\mathcal{Y}$ [8]. Moreover, by standard results in convex analysis [43, Chap. 26], $h^{\ast}$ is differentiable on $\mathcal{Y}$ and its gradient satisfies the identity

\nabla h^{\ast}(y)=\operatorname*{arg\,max}_{x\in\mathcal{X}}\{\langle y\mathopen{},\mathopen{}x\rangle-h(x)\}.

(A.2)

Thus, recalling the definition of the mirror map $Q\colon\mathcal{Y}\to\mathcal{X}$ , we readily get

Q(y)=\nabla h^{\ast}(y).

(A.3)

Lemma A.1.

Let $h$ be a Bregman regularizer on $\mathcal{X}$ . Then, for all $x,\in^{+}\operatorname{dom}\partial h$ and all $y,{v}\in\mathcal{Y}$ , we have:


$\displaystyle a)$	$\displaystyle\;\;x=Q(y)$	$\displaystyle\;\iff\;$	$\displaystyle y\in\partial h(x).$	(A.4a)
$\displaystyle b)$	$\displaystyle\;\;x^{+}=Q(\nabla h(x)+{v})$	$\displaystyle\;\iff\;$	$\displaystyle\nabla h(x)+{v}\in\partial h(x^{+})$	(A.4b)

Finally, if $x=Q(y)$ and $x^{\ast}\in\mathcal{X}$ , we have

\langle\nabla h(x)\mathopen{},\mathopen{}x-x^{\ast}\rangle\leq\langle y\mathopen{},\mathopen{}x-x^{\ast}\rangle.

(A.5)

Proof of Lemma A.1.

To prove (A.4a), note that $x$ solves (A.2) if and only if $y-\partial h(x)\ni 0$ , i.e., if and only if $y\in\partial h(x)$ . Eq. A.4b is then obtained in the same manner.

For the inequality (A.5), it suffices to show it holds for all $x^{\ast}\in\mathcal{X}_{h}\equiv\operatorname{dom}\partial h$ (by continuity). To do so, let

\phi(t)=h(x+t(x^{\ast}-x))-[h(x)+\langle y\mathopen{},\mathopen{}x+t(x^{\ast}-x)\rangle].

(A.6)

Since $h$ is strongly convex relative to $g$ and $y\in\partial h(x)$ by (A.4a), it follows that $\phi(t)\geq 0$ with equality if and only if $t=0$ . Moreover, note that $\psi(t)=\langle\nabla h(x+t(x^{\ast}-x))-y\mathopen{},\mathopen{}x^{\ast}-x\rangle$ is a continuous selection of subgradients of $\phi$ . Given that $\phi$ and $\psi$ are both continuous on $[0,1]$ , it follows that $\phi$ is continuously differentiable and $\phi^{\prime}=\psi$ on $[0,1]$ . Thus, with $\phi$ convex and $\phi(t)\geq 0=\phi(0)$ for all $t\in[0,1]$ , we conclude that $\phi^{\prime}(0)=\langle\nabla h(x)-y\mathopen{},\mathopen{}x^{\ast}-x\rangle\geq 0$ . ∎

As we mentioned earlier, much of our analysis revolves around a ”primal-dual” divergence between a target point $x^{\ast}\in\mathcal{X}$ and a dual vector $y\in\mathcal{Y}$ , called the Fenchel coupling. Following [33], this is defined as follows for all $x^{\ast}\in\mathcal{X}$ , $y\in\mathcal{Y}$ :

F(x^{\ast},y)=h(x^{\ast})+h^{\ast}(y)-\langle y\mathopen{},\mathopen{}x^{\ast}\rangle.

(A.7)

The following lemma illustrates some basic properties of the Fenchel coupling:

Lemma A.2.

Let $h$ be a Bregman regularizer on $\mathcal{X}$ with convexity modulus $K_{h}$ . Then, for all $x^{\ast}\in\mathcal{X}$ and all $y\in\mathcal{Y}$ , we have:

(1)

$F(x^{\ast},y)=D(x^{\ast},Q(y))$ if $Q(y)\in\mathcal{X}_{h}$ (but not necessarily otherwise).
(2)

$F(x^{\ast},y)\geq\frac{K_{h}}{2}\lVert Q(y)-x^{\ast}\rVert^{2}$

Proof.

For our first claim, let $x=Q(y)$ . Then, by definition we have:

F(x^{\ast},y)=h(x^{\ast})-\langle y\mathopen{},\mathopen{}Q(y)\rangle-h(Q(y))-\langle y\mathopen{},\mathopen{}x^{\ast}\rangle=h(x^{\ast})-h(x)-\langle y\mathopen{},\mathopen{}x^{\ast}-x\rangle.

(A.8)

Since $y\in\partial h(x)$ , we have $h^{\prime}(x;x^{\ast}-x)=\langle y\mathopen{},\mathopen{}x^{\ast}-x\rangle$ whenever $x\in\mathcal{X}_{h}$ , thus proving our first claim. For our second claim, working in the previous spirit we get that:

F(x^{\ast},y)=h(x^{\ast})-h(x)-\langle y\mathopen{},\mathopen{}x^{\ast}-x\rangle

(A.9)

Thus, we obtain the result by recalling the strong convexity assumption for $h$ with respect to the respetive norm $\lVert\cdot\rVert$ . ∎

We continue with some basic relations connecting the Fenchel coupling relative to a target point before and after a gradient step. The basic ingredient for this is a primal-dual analogue of the so-called “three-point identity” for Bregman functions [15]:

Lemma A.3.

Let $h$ be a Bregman regularizer on $\mathcal{X}$ . Fix some $x^{\ast}\in\mathcal{X}$ and let $y,y^{+}\in\mathcal{Y}$ . Then, letting $x=Q(y)$ , we have

F(x^{\ast},y^{+})=F(x^{\ast},y)+F(x,y^{+})+\langle y^{+}-y\mathopen{},\mathopen{}x-x^{\ast}\rangle.

(A.10)

Proof.

By definition, we get:

	$\displaystyle F(x^{\ast},y^{+})$	$\displaystyle=h(x^{\ast})+h^{\ast}(y^{+})-\langle y^{+}\mathopen{},\mathopen{}x^{\ast}\rangle$		(A.11)
	$\displaystyle F(x^{\ast},y)$	$\displaystyle=h(x^{\ast})+h^{\ast}(y)-\langle y\mathopen{},\mathopen{}x^{\ast}\rangle.$		(A.11)

Then, by subtracting the above we get:

$\displaystyle F(x^{\ast},y^{+})-F(x^{\ast},y)$	$\displaystyle=h(x^{\ast})+h^{\ast}(y^{+})-\langle y^{+}\mathopen{},\mathopen{}x^{\ast}\rangle-h(x^{\ast})-h^{\ast}(y)+\langle y\mathopen{},\mathopen{}x^{\ast}\rangle$
	$\displaystyle=h^{\ast}(y^{+})-h^{\ast}(y)-\langle y^{+}-y\mathopen{},\mathopen{}x^{\ast}\rangle$
	$\displaystyle=h^{\ast}(y^{+})-\langle y\mathopen{},\mathopen{}Q(y)\rangle+h(Q(y))-\langle y^{+}-y\mathopen{},\mathopen{}x^{\ast}\rangle$
	$\displaystyle=h^{\ast}(y^{+})-\langle y\mathopen{},\mathopen{}x\rangle+h(x)-\langle y^{+}-y\mathopen{},\mathopen{}x^{\ast}\rangle$
	$\displaystyle=h^{\ast}(y^{+})+\langle y^{+}-y\mathopen{},\mathopen{}x\rangle-\langle y^{+}\mathopen{},\mathopen{}x\rangle+h(x)-\langle y^{+}-y\mathopen{},\mathopen{}x^{\ast}\rangle$
	$\displaystyle=F(x,y^{+})+\langle y^{+}-y\mathopen{},\mathopen{}x-x^{\ast}\rangle$	(A.12)

and our proof is complete. ∎

A.2. Numerical sequence inequalities

In this section, we provide some necessary inequalities on numerical sequences that we require for the convergence rate analysis of the previous sections. Most of the lemmas presented below already exist in the literature, and go as far back as Auer et al. [6] and McMahan & Streeter [30]; when appropriate, we note next to each lemma the references with the statement closest to the precise version we are using in our analysis.

Lemma A.4 (30, 28).

For all non-negative numbers ${}_{1},\dotsc_{t}$ , the following inequality holds:

\sqrt{{}^{2}+\sum_{t=1}^{T}{}_{t}}\leq+\sum_{t=1}^{T}\dfrac{{}_{t}}{\sqrt{\sum_{s=1}^{t}{}_{s}}}\leq 2\sqrt{{}^{2}+\sum_{t=1}^{T}{}_{t}}

(A.13)

Appendix B Analysis and proofs of the main results

The proof of the template inequality.

We first prove the template inequality of UnderGrad; this is the primary element of our proof of Theorem 1: See 1

Proof.

First, we set $\tilde{Y}_{t}=\eta_{t}{Y}_{t}$ . For all $x\in\mathcal{X}$ we have:

	$\displaystyle\alpha_{t}\langle g_{t+1/2}\mathopen{},\mathopen{}X_{t+1}-x\rangle$
$\displaystyle=$	$\displaystyle\langle\frac{1}{\eta_{t}}\tilde{Y}_{t}-\frac{1}{\eta_{t+1}}\tilde{Y}_{t+1}\mathopen{},\mathopen{}X_{t+1}-x\rangle$
$\displaystyle=$	$\displaystyle\frac{1}{\eta_{t}}\langle\tilde{Y}_{t}-\tilde{Y}_{t+1}\mathopen{},\mathopen{}X_{t+1}-x\rangle+\left[\frac{1}{\eta_{t+1}}-\frac{1}{\eta_{t}}\right]\langle 0-\tilde{Y}_{t+1}\mathopen{},\mathopen{}X_{t+1}-x\rangle$
$\displaystyle=$	$\displaystyle\frac{1}{\eta_{t}}\left[F(x,\tilde{Y}_{t})-F(x,\tilde{Y}_{t+1})-F(X_{t+1},\tilde{Y}_{t})\right]$
	$\displaystyle\qquad+\left[\frac{1}{\eta_{t+1}}-\frac{1}{\eta_{t}}\right]\bigg{(}F(x,0)-F(x,\tilde{Y}_{t+1})-F(X_{t+1},0)\bigg{)}$	# from Lemma A.3
$\displaystyle\leq$	$\displaystyle\frac{1}{\eta_{t}}F(x,\tilde{Y}_{t})-\frac{1}{\eta_{t+1}}F(x,\tilde{Y}_{t+1})+\left[\frac{1}{\eta_{t+1}}-\frac{1}{\eta_{t}}\right]R_{h}-\frac{1}{\eta_{t}}F(X_{t+1},\tilde{Y}_{t}).$	(B.1)

Here, the last inequality comes from the facts that $F(x,0)=h(x)-h(Q(0))=h(x)-\min_{x\in\mathcal{X}}h\leq R_{h}$ and $F(\cdot,\cdot)\geq 0$ and that $\eta_{t}$ is decreasing.

As a consequence of (B.1), we have:

$\displaystyle\alpha_{t}\langle g_{t+1/2}\mathopen{},\mathopen{}X_{t+1/2}-x\rangle$
	$\displaystyle=\alpha_{t}\langle g_{t+1/2}\mathopen{},\mathopen{}X_{t+1/2}-X_{t+1}\rangle+\alpha_{t}\langle g_{t+1/2}\mathopen{},\mathopen{}X_{t+1}-x\rangle$
	$\displaystyle\leq\alpha_{t}\langle g_{t+1/2}\mathopen{},\mathopen{}X_{t+1/2}-X_{t+1}\rangle+\frac{1}{\eta_{t}}F(x,\tilde{Y}_{t})-\frac{1}{\eta_{t+1}}F(x,\tilde{Y}_{t+1})$
	$\displaystyle\qquad+\left[\frac{1}{\eta_{t+1}}-\frac{1}{\eta_{t}}\right]R_{h}-\frac{1}{\eta_{t}}F(X_{t+1},\tilde{Y}_{t})$	(B.2)

On the other hand, if we let $\tilde{Y}_{t+1/2}\coloneqq\eta_{t}Y_{t+1/2}$ , we have

	$\displaystyle\alpha_{t}\langle g_{t}\mathopen{},\mathopen{}X_{t+1/2}-X_{t+1}\rangle$	$\displaystyle=\frac{1}{\eta_{t}}\langle\tilde{Y}_{t}-\tilde{Y}_{t+1/2}\mathopen{},\mathopen{}X_{t+1/2}-X_{t+1}\rangle$
		$\displaystyle=\frac{1}{\eta_{t}}\left[F(X_{t+1},\tilde{Y}_{t})-F(X_{t+1},\tilde{Y}_{t+1/2})-F(X_{t+1/2},\tilde{Y}_{t})\right]$
so
	$\displaystyle\frac{1}{\eta_{t}}F(X_{t+1},\tilde{Y}_{t})$	$\displaystyle=\alpha_{t}\langle g_{t}\mathopen{},\mathopen{}X_{t+1/2}-X^{t+1}\rangle$
		$\displaystyle+\frac{1}{\eta_{t}}F(X_{t+1},\tilde{Y}_{t+1/2})+\frac{1}{\eta_{t}}F(X_{t+1/2},\tilde{Y}_{t}).$	(B.3)

Thus, replacing (B.3) into (B), we get:

$\displaystyle\alpha_{t}\langle g_{t+1/2}\mathopen{},\mathopen{}X_{t+1/2}-x\rangle$	$\displaystyle\leq\alpha_{t}\langle g_{t+1/2}-g_{t}\mathopen{},\mathopen{}X_{t+1/2}-X_{t+1}\rangle$
	$\displaystyle+\frac{1}{\eta_{t}}F(x,\tilde{Y}_{t})-\frac{1}{\eta_{t+1}}F(x,\tilde{Y}_{t+1})-\frac{1}{\eta_{t}}F(X_{t+1},\tilde{Y}_{t+1/2})$
	$\displaystyle-\frac{1}{\eta_{t}}F(X_{t+1/2},\tilde{Y}_{t})+\left[\frac{1}{\eta_{t+1}}-\frac{1}{\eta_{t}}\right]R_{h}.$	(B.4)

Now, recall the definitions $X_{t+1/2}=Q(\tilde{Y}_{t+1/2})$ and $X_{t+1}=Q(Y_{t+1})$ in Algorithm 1, apply Lemma A.2 to $F(X_{t+1},\tilde{Y}_{t+1/2})$ and $F(X_{t+1/2},\tilde{Y}_{t})$ then combine with (B.4), we get:

\alpha_{t}\langle g_{t+1/2}\mathopen{},\mathopen{}X_{t+1/2}-x\rangle\leq\alpha_{t}\langle g_{t+1/2}-g_{t}\mathopen{},\mathopen{}X_{t+1/2}-X_{t+1}\rangle+\frac{1}{\eta_{t}}F(x,\tilde{Y}_{t})-\frac{1}{\eta_{t+1}}F(x,\tilde{Y}_{t+1})\\ -\frac{K_{h}}{2\eta_{t}}\lVert X_{t+1}-X_{t+1/2}\rVert^{2}-\frac{K_{h}}{2\eta_{t}}\lVert X_{t+1/2}-X_{t}\rVert^{2}+\bigg{(}\frac{1}{\eta_{t+1}}-\frac{1}{\eta_{t}}\bigg{)}R_{h}

(B.5)

Hence, after telescoping and recalling the notation $\operatorname{\tilde{\mathcal{R}}}_{T}(x)\coloneqq\sum_{t=1}^{T}{\alpha_{t}}\left\langle g_{t+1/2},X_{t+1/2}-x\right\rangle$ , we get:

$\displaystyle\operatorname{\tilde{\mathcal{R}}}_{T}(x)$	$\displaystyle\leq\frac{1}{\eta_{1}}F(x,\tilde{Y}_{1})+\bigg{(}\frac{1}{\eta_{T+1}}-\frac{1}{\eta_{1}}\bigg{)}R_{h}$
	$\displaystyle+\sum_{t=1}^{T}\alpha_{t}\langle g_{t+1/2}-g_{t}\mathopen{},\mathopen{}X_{t+1/2}-X_{t+1}\rangle$
	$\displaystyle-\sum_{t=1}^{T}\frac{K_{h}}{2\eta_{t}}\lVert X_{t+1}-X_{t+1/2}\rVert^{2}-\sum_{t=1}^{T}\frac{K_{h}}{2\eta_{t}}\lVert X_{t+1/2}-X_{t}\rVert^{2}$	(B.6)

Finally, by our initial choice of $Y_{1}=0$ , we have $F(x,\tilde{Y}_{1})=h(x)-\min_{x\in\mathcal{X}}h(x)\leq R_{h}$ and (22) follows (B.6). This concludes the proof of Proposition 1. ∎

Regret-to-rate conversion lemma.

The next element in our proof is the following lemma that will be used to connect the term $\operatorname{\tilde{\mathcal{R}}}_{T}(x)$ (which, in intuition, is similar to a regret term) and the term $\operatorname{\mathbb{E}}\left[f(\bar{X}_{T+1/2})-\min f\right]$ whose bounds will characterize the convergence rate of UnderGrad.

Lemma B.1.

For any $x^{\ast}\in\mathcal{X}^{\ast}$ , for any $T$ , we have:

	$\displaystyle\operatorname{\mathbb{E}}\left[f(\bar{X}_{T+1/2})-\min f\right]$	$\displaystyle\leq\operatorname{\mathbb{E}}\left[\frac{2}{{T^{2}}}{\sum_{t=1}^{T}{\alpha_{t}}\langle\nabla f\left(\bar{X}_{t+1/2}\right),X_{t+1/2}-x^{\ast}\rangle}\right]$
		$\displaystyle=\frac{2}{T^{2}}\operatorname{\mathbb{E}}\left[\operatorname{\tilde{\mathcal{R}}}_{T}(x^{\ast})\right].$		(B.7)

Remark 1.

A variation of Lemma B.1 appears in [16, 24]; for the sake of completeness, we provide its proof below. ¶

Proof.

Let us denote $H_{t}:=\sum_{s=1}^{t}\alpha_{s}$ . From the update rule of Algorithm 1, we can rewrite $X_{t+1/2}=\frac{H_{t}}{\alpha_{t}}\bar{X}_{t+1/2}-\frac{H_{t-1}}{\alpha_{t}}{\bar{X}}_{t-1/2}$ . Therefore,

	$\displaystyle X_{t+1/2}-x^{\ast}$	$\displaystyle=\frac{H_{t}}{\alpha_{t}}(\bar{X}_{t+1/2}-x^{\ast})-\frac{H_{t-1}}{\alpha_{t}}({\bar{X}}_{t-1/2}-x^{\ast})$
		$\displaystyle=\frac{1}{\alpha_{t}}\left[\alpha_{t}(\bar{X}_{t+1/2}-x^{\ast})+H_{t-1}(\bar{X}_{t+1/2}-{\bar{X}}_{t-1/2})\right].$		(B.8)

As a consequence, we have:

$\displaystyle\sum_{t=1}^{T}{\alpha_{t}}\langle\nabla f(\bar{X}_{t+1/2}),X_{t+1/2}-x^{\ast}\rangle$	$\displaystyle=\sum_{t=1}^{T}\alpha_{t}\left\langle\nabla f(\bar{X}_{t+1/2}),\bar{X}_{t+1/2}-x^{\ast}\right\rangle$
	$\displaystyle\qquad+H_{t-1}\sum_{t=1}^{T}\left\langle\nabla f(\bar{X}_{t+1/2}),\bar{X}_{t+1/2}-{\bar{X}}_{t-1/2}\right\rangle$
$\displaystyle\geq$	$\displaystyle\sum_{t=1}^{T}\alpha_{t}\left[f(\bar{X}_{t+1/2})-f(x^{\ast})\right]$
	$\displaystyle\qquad+\sum_{t=1}^{T}H_{t-1}\left[f(\bar{X}_{t+1/2})-f({\bar{X}}_{t-1/2})\right]$
	$\displaystyle=\sum_{t=1}^{T}\alpha_{t}\left[f(\bar{X}_{t+1/2})-f(x^{\ast})\right]$
	$\displaystyle\qquad+\sum_{t=1}^{T-1}\alpha_{t}\left[f({\bar{X}}_{T+1/2})-f(\bar{X}_{t+1/2})\right]$
$\displaystyle=$	$\displaystyle\left[f({\bar{X}}_{T+1/2})-f(x^{\ast})\right]\sum_{t=1}^{T}\alpha_{t}.$	(B.9)

Divide two sides of (B.9) by $H_{t}>0$ and choose $\alpha_{t}$ such that $H_{t}>\frac{T^{2}}{2}$ (for example, choose $\alpha_{t}=\alpha$ ), we obtain that:

\displaystyle\frac{2}{T^{2}}\sum_{t=1}^{T}{\alpha_{t}}\langle\nabla f(\bar{X}_{t+1/2}),X_{t+1/2}-x^{\ast}\rangle\geq f({\bar{X}}_{T+1/2})-f(x^{\ast})=f({\bar{X}}_{T+1/2})-\min f.

(B.10)

Finally, we recall that by definition, $g_{t+1/2}=\operatorname{\mathsf{G}}(\bar{X}_{t+1/2};\omega_{t+1/2})=\nabla f(\bar{X}_{t+1/2})+U_{t+1/2}$ where $\operatorname{\mathbb{E}}\left[U_{t+1/2}\nonscript\,\middle|\nonscript\,\mathopen{}\mathcal{F}_{t+1/2}\right]=0$ . Therefore, by the law of total expectation, we have:

$\displaystyle\operatorname{\tilde{\mathcal{R}}}_{T}(x^{\ast})$	$\displaystyle=\operatorname{\mathbb{E}}\left[\sum_{t=1}^{T}{\alpha_{t}}\langle\nabla f(\bar{X}_{t+1/2}),X_{t+1/2}-x^{\ast}\rangle\right]+\operatorname{\mathbb{E}}\left[\sum_{t=1}^{T}{\alpha_{t}}\langle U_{t+1/2},X_{t+1/2}-x^{\ast}\rangle\right]$
	$\displaystyle=\operatorname{\mathbb{E}}\left[\sum_{t=1}^{T}{\alpha_{t}}\langle\nabla f(\bar{X}_{t+1/2}),X_{t+1/2}-x^{\ast}\rangle\right]$
	$\displaystyle+\operatorname{\mathbb{E}}\left[\sum_{t=1}^{T}{\alpha_{t}}\operatorname{\mathbb{E}}\left[\langle U_{t+1/2},X_{t+1/2}-x^{\ast}\rangle\nonscript\,\|\nonscript\,\mathopen{}\mathcal{F}_{t+1/2}\right]\right]$
$\displaystyle=$	$\displaystyle\operatorname{\mathbb{E}}\left[\sum_{t=1}^{T}{\alpha_{t}}\langle\nabla f(\bar{X}_{t+1/2}),X_{t+1/2}-x^{\ast}\rangle\right].$	(B.11)

Then, taking expectations on both sides of (B.10) and invoking (B.11) concludes the proof. ∎

Proof of (20a): convergence of UnderGrad under (LC)/(BG).

Our starting point is Eq. 22 that we established in Proposition 1 that leads to the following inequality:

\operatorname{\tilde{\mathcal{R}}}_{T}(x)\leq\frac{R_{h}}{\eta_{T+1}}+\sum_{t=1}^{T}\alpha_{t}\langle g_{t+1/2}-g_{t}\mathopen{},\mathopen{}X_{t+1/2}-X_{t+1}\rangle-\sum_{t=1}^{T}\frac{K_{h}}{2\eta_{t}}\lVert X_{t+1}-X_{t+1/2}\rVert^{2}

(B.12)

We now focus on the second term in the right hand side of (B.12). From the Cauchy-Schwarz inequality and the fact that

\left\lVert Y-Y^{\prime}\right\rVert_{\ast}\left\lVert X-X^{\prime}\right\rVert=\min_{>0}\left\{\frac{1}{2missing}\left\lVert Y-Y^{\prime}\right\rVert_{\ast}^{2}+\frac{missing}{2}\left\lVert X-X^{\prime}\right\rVert^{2}\right\}

(B.13)

for any $X,X^{\prime},Y,Y^{\prime}\in\mathbb{R}^{d}$ ,⁶⁶6This can be proved trivially: ${}^{*}=\left\lVert Y-Y^{\prime}\right\rVert_{\ast}\left\lVert X-X^{\prime}\right\rVert$ is a minimizer of the function $\psi()=\frac{1}{2missing}\left\lVert Y-Y^{\prime}\right\rVert_{\ast}^{2}+\frac{missing}{2}\left\lVert X-X^{\prime}\right\rVert^{2}$ . we have:

$\displaystyle\sum_{t=1}^{T}\alpha_{t}\langle g_{t+1/2}-g_{t}\mathopen{},\mathopen{}X_{t+1/2}-X_{t+1}\rangle$	$\displaystyle\leq\sum_{t=1}^{T}\alpha_{t}\lVert g_{t+1/2}-g_{t}\rVert_{\ast}\lVert X_{t+1/2}-X_{t+1}\rVert$
	$\displaystyle\leq\frac{1}{2K_{h}}\sum_{t=1}^{T}\alpha_{t}^{2}\eta_{t+1}\lVert g_{t+1/2}-g_{t}\rVert_{\ast}^{2}$
	$\displaystyle+\frac{K_{h}}{2}\sum_{t=1}^{T}\frac{1}{\eta_{t+1}}\lVert X_{t+1}-X_{t+1/2}\rVert^{2}.$	(B.14)

Moreover, from the definition of $\eta_{t+1}$ and by applying Lemma A.4, we have:

$\displaystyle\frac{1}{2K_{h}}\sum_{t=1}^{T}\alpha_{t}^{2}\eta_{t+1}\lVert g_{t+1/2}-g_{t}\rVert_{\ast}^{2}$	$\displaystyle=\frac{b}{2K_{h}}\sum_{t=1}^{T}\frac{\alpha_{t}^{2}\lVert g_{t+1/2}-g_{t}\rVert_{\ast}^{2}}{\sqrt{{}^{2}+\sum_{s=1}^{t}\lVert g_{s+1/2}-g_{s}\rVert_{\ast}^{2}}}$
	$\displaystyle\leq\frac{b}{K_{h}}\sqrt{{}^{2}+\sum_{t=1}^{T}\alpha_{t}^{2}\lVert g_{t+1/2}-g_{t}\rVert_{\ast}^{2}}-\frac{b\sqrt{{}^{2}}}{2K_{h}}$
	$\displaystyle=\frac{b^{2}}{K_{h}\cdot\eta_{T+1}}-\frac{b\sqrt{{}^{2}}}{2K_{h}}.$	(B.15)

Combine (B.14) and (B.15) with (B.12) and by the compactness of the feasible region $\mathcal{X}$ , we get:

$\displaystyle\operatorname{\tilde{\mathcal{R}}}_{T}(x)\leq$	$\displaystyle\frac{R_{h}}{\eta_{T+1}}+\frac{1}{2K_{h}}\sum_{t=1}^{T}\alpha_{t}^{2}\eta_{t+1}\lVert g_{t+1/2}-g_{t}\rVert_{\ast}^{2}$
	$\displaystyle\hphantom{\frac{R_{h}}{\eta_{T+1}}}+\frac{K_{h}}{2}\sum_{t=1}^{T}\left[\frac{1}{\eta_{t+1}}-\frac{1}{\eta_{t}}\right]\lVert X_{t+1}-X_{t+1/2}\rVert^{2}$
$\displaystyle\leq$	$\displaystyle\frac{R_{h}}{\eta_{T+1}}+\frac{b^{2}}{K_{h}\cdot\eta_{T+1}}-\frac{b\sqrt{{}^{2}}}{2K_{h}}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{2}\sum_{t=1}^{T}\left[\frac{1}{\eta_{t+1}}-\frac{1}{\eta_{t}}\right]$
$\displaystyle=$	$\displaystyle\frac{1}{\eta_{T+1}}\left(R_{h}+\frac{b^{2}}{K_{h}}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{2}\right)-b\sqrt{{}^{2}}\left(\frac{1}{2K_{h}}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{2}\right).$	(B.16)

Hence, by invoking Lemma B.1, we have:

$\displaystyle\operatorname{\mathbb{E}}\left[f(\overline{X}_{T+1/2})-\min_{x\in\mathcal{X}}f(x)\right]$	$\displaystyle\leq\frac{2\operatorname{\mathbb{E}}\left[\operatorname{\tilde{\mathcal{R}}}_{T}(x^{\ast})\right]}{T^{2}}$
	$\displaystyle=\frac{2}{T^{2}}\left[\left(R_{h}+\frac{b^{2}}{K_{h}}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{2}\right)\right]\operatorname{\mathbb{E}}\left[\frac{1}{\eta_{T+1}}\right]$
	$\displaystyle-\frac{2b\sqrt{{}^{2}}}{T^{2}}\left(\frac{1}{2K_{h}}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{2}\right).$	(B.17)

On the other hand, by the definition of $\eta_{T+1}$ , we get:

\displaystyle\operatorname{\mathbb{E}}\left[\frac{1}{\eta_{T+1}}\right]

\displaystyle\leq\frac{1}{b}\operatorname{\mathbb{E}}\left[\sqrt{{}^{2}+\sum_{t=1}^{T}\alpha_{t}^{2}\lVert g_{t+1/2}-g_{t}\rVert_{\ast}^{2}}\right]\leq\frac{1}{b}\sqrt{{}^{2}+\sum_{t=1}^{T}\alpha_{t}^{2}\operatorname{\mathbb{E}}\left[\lVert g_{t+1/2}-g_{t}\rVert_{\ast}^{2}\right]}.

(B.18)

Moreover, we have that:

$\displaystyle\operatorname{\mathbb{E}}\left[\lVert g_{t+1/2}-g_{t}\rVert_{\ast}^{2}\right]$	$\displaystyle=\operatorname{\mathbb{E}}\left[\lVert\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t})-(U_{t+1/2}-U_{t})\rVert_{\ast}^{2}\right]$
	$\displaystyle\leq\operatorname{\mathbb{E}}\left[2\lVert\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t})\rVert_{\ast}^{2}+2\lVert U_{t+1/2}-U_{t}\rVert_{\ast}^{2}\right]$
	$\displaystyle\leq\operatorname{\mathbb{E}}\left[4(\lVert\nabla f(\overline{X}_{t+1/2})\rVert_{\ast}^{2}+\lVert\nabla f(\overline{X}_{t})\rVert_{\ast}^{2})+4(\lVert U_{t+1/2}\rVert_{\ast}^{2}+\lVert U_{t}\rVert_{\ast}^{2})\right]$
	$\displaystyle\leq\operatorname{\mathbb{E}}\left[8G^{2}+4(\lVert U_{t+1/2}\rVert_{\ast}^{2}+\lVert U_{t}\rVert_{\ast}^{2})\right]$	# from (BG)
	$\displaystyle=8\left[G^{2}+\sigma^{2}\right].$	(B.19)

Therefore, when we choose the step-size parameters $\alpha_{t}=t,\forall t$ as indicated in Theorem 1, we have:

	$\displaystyle\left(R_{h}+\frac{b^{2}}{K_{h}}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{2}\right)\operatorname{\mathbb{E}}\left[\frac{1}{\eta_{T+1}}\right]\leq$	$\displaystyle\left(R_{h}+\frac{b^{2}}{K_{h}}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{2}\right)\frac{1}{b}\sqrt{{}^{2}+8\left(G^{2}+\sigma^{2}\right)\sum_{t=1}^{T}\alpha_{t}^{2}}$
	$\displaystyle\leq$	$\displaystyle\left(\frac{R_{h}}{b}+\frac{b}{K_{h}}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{2b}\right)\sqrt{{}^{2}+8\left(G^{2}+\sigma^{2}\right)T^{3}}.$		(B.20)

Finally, substituting (B.20) into (B.17), we get:

	$\displaystyle\operatorname{\mathbb{E}}\left[f(\overline{X}_{T+1/2})-\min_{x\in\mathcal{X}}f(x)\right]$	$\displaystyle\leq 2\left(\frac{R_{h}}{b}+\frac{b}{K_{h}}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{2b}\right)\frac{\sqrt{{}^{2}+8\left(G^{2}+\sigma^{2}\right)T^{3}}}{T^{2}}$
		$\displaystyle-\frac{b\sqrt{{}^{2}}}{T^{2}}\left(\frac{1}{2K_{h}}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{2}\right).$		(B.21)

Then, from our choice for $b$ and ² in Theorem 1, we obtain:

\operatorname{\mathbb{E}}\left[f(\overline{X}_{T+1/2})-\min_{x\in\mathcal{X}}f(x)\right]\leq 2\frac{C_{h}}{\sqrt{K_{h}}}\frac{\sqrt{K_{h}+8\left(G^{2}+\sigma^{2}\right)}}{\sqrt{T}}.

(B.22)

Proof of (20b): convergence of UnderGrad under (LG)/(LS).

From (22) and (B.14), we have:

	$\displaystyle\operatorname{\tilde{\mathcal{R}}}_{T}(x)\leq\frac{R_{h}}{\eta_{T+1}}$	$\displaystyle+\frac{1}{2K_{h}}\sum_{t=1}^{T}\alpha_{t}^{2}\eta_{t+1}\lVert g_{t+1/2}-g_{t}\rVert_{\ast}^{2}$
		$\displaystyle+\frac{K_{h}}{2}\sum_{t=1}^{T}\left(\frac{1}{\eta_{t+1}}-\frac{1}{\eta_{t}}\right)\lVert X_{t+1}-X_{t+1/2}\rVert^{2}-\sum_{t=1}^{T}\frac{K_{h}}{2\eta_{t}}{\lVert X_{t+1/2}-X_{t}\rVert^{2}}.$		(B.23)

We analyze the terms in the right-hand-side of (B.23). First, we have:

\displaystyle\frac{K_{h}}{2}\sum_{t=1}^{T}\left(\frac{1}{\eta_{t+1}}-\frac{1}{\eta_{t}}\right)\lVert X_{t+1}-X_{t+1/2}\rVert^{2}\leq\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{2}\left(\frac{1}{\eta_{T+1}}-\frac{1}{\eta_{1}}\right).

(B.24)

Second, we have:

$\displaystyle\frac{K_{h}}{2}\sum_{t=1}^{T}\frac{1}{\eta_{t+1}}\lVert X_{t+1/2}-X_{t}\rVert^{2}$	$\displaystyle=\frac{K_{h}}{2}\sum_{t=1}^{T}\left[\frac{1}{\eta_{t+1}}-\frac{1}{\eta_{t}}\right]\lVert X_{t+1/2}-X_{t}\rVert^{2}$
	$\displaystyle\qquad+\frac{K_{h}}{2}\sum_{t=1}^{T}\frac{1}{\eta_{t}}\lVert X_{t+1/2}-X_{t}\rVert^{2}$
	$\displaystyle\leq\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{2}\left(\frac{1}{\eta_{T+1}}-\frac{1}{\eta_{1}}\right)+\frac{K_{h}}{2}\sum_{t=1}^{T}\frac{1}{\eta_{t}}\lVert X_{t+1/2}-X_{t}\rVert^{2}$	(B.25)

Hence,

-\frac{K_{h}}{2}\sum_{t=1}^{T}\frac{1}{\eta_{t}}\lVert X_{t+1/2}-X_{t}\rVert^{2}\leq\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{2}\left(\frac{1}{\eta_{T+1}}-\frac{1}{\eta_{1}}\right)-\frac{K_{h}}{2}\sum_{t=1}^{T}\frac{1}{\eta_{t+1}}\lVert X_{t+1/2}-X_{t}\rVert^{2}.

(B.26)

Combine (B.24) and (B.26) with (B.23), we have:

	$\displaystyle\operatorname{\tilde{\mathcal{R}}}_{T}(x)$	$\displaystyle\leq\frac{R_{h}+K_{h}\lVert\mathcal{X}\rVert^{2}}{\eta_{T+1}}-\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{\eta_{1}}$
		$\displaystyle+\frac{1}{2K_{h}}\sum_{t=1}^{T}\alpha_{t}^{2}\eta_{t+1}\lVert g_{t+1/2}-g_{t}\rVert_{\ast}^{2}-\frac{K_{h}}{2}\sum_{t=1}^{T}\frac{1}{\eta_{t+1}}\lVert X_{t+1/2}-X_{t}\rVert^{2}.$		(B.27)

We will analyze the terms in the right-hand-side of (B.27). To do this, we first introduce the quantities


	$\displaystyle B_{t}^{2}$	$\displaystyle=\min\{\lVert\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t})\rVert_{\ast}^{2},\lVert g_{t+1/2}-g_{t}\rVert_{\ast}^{2}\}$	(B.28a)
and
	$\displaystyle\xi_{t}$	$\displaystyle=\left[g_{t+1/2}-g_{t}\right]-\left[\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t})\right].$	(B.28b)

We also define

\tilde{\eta}_{t}=\frac{b}{\sqrt{{}^{2}+\sum_{s=1}^{t-1}\alpha_{s}^{2}B_{s}^{2}}}.

(B.29)

By these definitions, we obtain that

$\displaystyle\lVert g_{t+1/2}-g_{t}\rVert_{\ast}^{2}$	$\displaystyle\leq B_{t}^{2}+\left[\lVert g_{t+1/2}-g_{t}\rVert_{\ast}^{2}-\min\{\lVert\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t})\rVert_{\ast}^{2},\lVert g_{t+1/2}-g_{t}\rVert_{\ast}^{2}\}\right]$
	$\displaystyle\leq B_{t}^{2}+\max\{0,\lVert g_{t+1/2}-g_{t}\rVert_{\ast}^{2}-\lVert\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t})\rVert_{\ast}^{2}\}$
	$\displaystyle\leq B_{t}^{2}+B_{t}^{2}+2\lVert\xi_{t}\rVert_{\ast}^{2}$
	$\displaystyle=2B_{t}^{2}+2\lVert\xi_{t}\rVert_{\ast}^{2}.$	(B.30)

Here, the last inequality is obtained by the fact that if $\lVert g_{t+1/2}-g_{t}\rVert_{\ast}^{2}\geq\lVert\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t})\rVert_{\ast}^{2}$ then it yields:

\lVert g_{t+1/2}-g_{t}\rVert_{\ast}^{2}-\lVert\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t})\rVert_{\ast}^{2}\leq B_{t}^{2}+2\lVert\xi_{t}\rVert_{\ast}^{2}.

(B.31)

Therefore, we have:

$\displaystyle\frac{1}{2K_{h}}\sum_{t=1}^{T}\alpha_{t}^{2}\eta_{t+1}\lVert g_{t+1/2}-g_{t}\rVert_{\ast}^{2}$	$\displaystyle=\frac{b}{2K_{h}}\sum_{t=1}^{T}\frac{\alpha_{t}^{2}\lVert g_{t+1/2}-g_{t}\rVert_{\ast}^{2}}{\sqrt{{}^{2}+\sum_{s=1}^{t}\alpha_{s}^{2}\left\lVert g_{s+{1/2}}-g_{s}\right\rVert^{2}}}$
	$\displaystyle\leq\frac{b}{K_{h}}\sqrt{{}^{2}+\sum_{t=1}^{T}\alpha_{t}^{2}\lVert g_{t+1/2}-g_{t}\rVert_{\ast}^{2}}-\frac{b\sqrt{{}^{2}}}{2K_{h}}$
	$\displaystyle\leq\frac{b}{K_{h}}\sqrt{{}^{2}+2\sum_{t=1}^{T}\alpha_{t}^{2}B_{t}^{2}+2\sum_{t=1}^{T}\alpha_{t}^{2}\lVert\xi_{t}\rVert_{\ast}^{2}}-\frac{b\sqrt{{}^{2}}}{2K_{h}}$	# from (B.30)
	$\displaystyle\leq\frac{b\sqrt{2}}{K_{h}}\sqrt{{}^{2}+\sum_{t=1}^{T}\alpha_{t}^{2}B^{2}_{t}}+\frac{b\sqrt{2}}{K_{h}}\sqrt{\sum_{t=1}^{T}\alpha_{t}^{2}\lVert\xi_{t}\rVert_{\ast}^{2}}-\frac{b\sqrt{{}^{2}}}{2K_{h}}$
	$\displaystyle\leq\frac{b\sqrt{2}}{K_{h}}\left(\sqrt{{}^{2}}+\sum_{t=1}^{T}\frac{\alpha_{t}^{2}B_{t}^{2}}{\sqrt{{}^{2}+\sum_{s=1}^{t}\alpha_{s}^{2}B_{s}^{2}}}\right)$
	$\displaystyle\qquad+\frac{b\sqrt{2}}{K_{h}}\sqrt{\sum_{t=1}^{T}\alpha_{t}^{2}\lVert\xi_{t}\rVert_{\ast}^{2}}-\frac{b\sqrt{{}^{2}}}{2K_{h}}$	# from Lemma A.4
	$\displaystyle=\frac{\sqrt{2}}{K_{h}}\sum_{t=1}^{T}\alpha_{t}^{2}\tilde{\eta}_{t+1}B_{t}^{2}+\frac{b\sqrt{2}}{K_{h}}\sqrt{\sum_{t=1}^{T}\alpha_{t}^{2}\lVert\xi_{t}\rVert_{\ast}^{2}}+\frac{b\sqrt{{}^{2}}}{K_{h}}\left(\sqrt{2}-\frac{1}{2}\right).$	(B.32)

On the other hand, by the update rule in Algorithm 1 and our choice of $\alpha_{t}=t,\forall t$ as in Theorem 1, we also have $X_{t}-X_{t+1/2}=\frac{\sum_{s=1}^{t}\alpha_{s}}{\alpha_{t}}\left(\bar{X}_{t}-\bar{X}_{t+1/2}\right)=\frac{\alpha_{t+1}}{2}\left(\bar{X}_{t}-\bar{X}_{t+1/2}\right)$ . Use this and recall that $\frac{1}{\tilde{\eta_{t}}}\leq\frac{1}{\eta_{t}}$ for any $t$ and that $f$ is $L$ -smooth over $\mathcal{X}$ , we have:

$\displaystyle-\frac{K_{h}}{2}\sum_{t=1}^{T}\frac{1}{\eta_{t+1}}\lVert X_{t}-X_{t+1/2}\rVert^{2}$	$\displaystyle\leq-\frac{K_{h}}{2}\sum_{t=1}^{T}\frac{1}{\tilde{\eta}_{t+1}}\lVert X_{t}-X_{t+1/2}\rVert^{2}$
	$\displaystyle=-\frac{K_{h}}{8}\sum_{t=1}^{T}\frac{\alpha_{t+1}^{2}}{\tilde{\eta}_{t+1}}\lVert\bar{X}_{t}-\bar{X}_{t+1/2}\rVert^{2}$
	$\displaystyle\leq-\frac{K_{h}}{8}\sum_{t=1}^{T}\frac{1}{\tilde{\eta}_{t+1}}\frac{1}{L^{2}}\lVert\nabla f(\overline{X}_{t})-\nabla f(\overline{X}_{t+1/2})\rVert_{\ast}^{2}$
	$\displaystyle\leq-\frac{K_{h}}{8L^{2}}\sum_{t=1}^{T}\frac{\alpha_{t}^{2}B_{t}^{2}}{\tilde{\eta}_{t+1}}.$	(B.33)

Finally, letting $C_{h}=\sqrt{R_{h}+K_{h}\lVert\mathcal{X}\rVert^{2}}$ , (B.30) yields:

$\displaystyle\frac{C_{h}^{2}}{\eta_{T+1}}$	$\displaystyle=\frac{C_{h}^{2}}{b}\sqrt{{}^{2}+\sum_{t=1}^{T}\alpha_{t}^{2}\lVert g_{t+1/2}-g_{t}\rVert_{\ast}^{2}}$
	$\displaystyle\leq\frac{C_{h}^{2}}{b}\sqrt{{}^{2}+2\sum_{t=1}^{T}\alpha_{t}^{2}B_{t}^{2}+2\sum_{t=1}^{T}\alpha_{t}^{2}\lVert\xi_{t}\rVert_{\ast}^{2}}$
	$\displaystyle\leq\frac{\sqrt{2}C_{h}^{2}}{b}\sqrt{{}^{2}+\sum_{t=1}^{T}\alpha_{t}^{2}B_{t}^{2}}+\frac{\sqrt{2}C_{h}^{2}}{b}\sqrt{\sum_{t=1}^{T}\alpha_{t}^{2}\lVert\xi_{t}\rVert_{\ast}^{2}}$
	$\displaystyle\leq\frac{\sqrt{2}C_{h}^{2}}{b}\left(\sqrt{{}^{2}}+\sum_{t=1}^{T}\frac{\alpha_{t}^{2}B_{t}^{2}}{\sqrt{{}^{2}+\sum_{s=1}^{t}\alpha_{s}^{2}B_{s}^{2}}}\right)+\frac{\sqrt{2}C_{h}^{2}}{b}\sqrt{\sum_{t=1}^{T}\alpha_{t}^{2}\lVert\xi_{t}\rVert_{\ast}^{2}}$
	$\displaystyle\leq\frac{\sqrt{2}C_{h}^{2}}{b^{2}}\sum_{t=1}^{T}\alpha_{t}^{2}\tilde{\eta}_{t+1}B_{t}^{2}+\frac{\sqrt{2}C_{h}^{2}\sqrt{{}^{2}}}{b}+\frac{\sqrt{2}C_{h}^{2}}{b}\sqrt{\sum_{t=1}^{T}\alpha_{t}^{2}\lVert\xi_{t}\rVert_{\ast}^{2}}.$	(B.34)

Combine (B.32), (B.33) and (B.34) into (B.27), we have:

$\displaystyle\operatorname{\tilde{\mathcal{R}}}_{T}(x)$	$\displaystyle\leq\sqrt{2}\sum_{t=1}^{T}\alpha_{t}^{2}B_{t}^{2}\left[\left(\frac{R_{h}}{b^{2}}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{b^{2}}+\frac{1}{K_{h}}\right)\tilde{\eta}_{t+1}-\frac{K_{h}}{8L^{2}\tilde{\eta}_{t+1}}\right]$
	$\displaystyle+\sqrt{2}\left(\frac{R_{h}}{b}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{b}+\frac{b}{K_{h}}\right)\sqrt{\sum_{t=1}^{T}\alpha_{t}^{2}\lVert\xi_{t}\rVert_{\ast}^{2}}$
	$\displaystyle\qquad+\left[\frac{b\sqrt{{}^{2}}}{K_{h}}\left(\sqrt{2}-\frac{1}{2}\right)-\frac{K_{h}\lVert\mathcal{X}\rVert^{2}\sqrt{{}^{2}}}{b}+\frac{\sqrt{2}C_{h}^{2}\sqrt{{}^{2}}}{b}\right].$	(B.35)

Now, define $T_{0}$ as follows:

T_{0}=\max\left\{1\leq t\leq T:\tilde{\eta}_{t+1}\geq\frac{\sqrt{K_{h}}}{\sqrt{8L^{2}\left(\frac{R_{h}}{b^{2}}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{b^{2}}+\frac{1}{K_{h}}\right)}}\right\}

(B.36)

In other words, for any $t>T_{0}$ , $\left(\frac{R_{h}}{b^{2}}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{b^{2}}+\frac{1}{K_{h}}\right)\tilde{\eta}_{t+1}-\frac{K_{h}}{8L^{2}\tilde{\eta}_{t+1}}<0$ . As a consequence,

	$\displaystyle\sqrt{2}\sum_{t=1}^{T}\alpha_{t}^{2}B_{t}^{2}\left[\left(\frac{R_{h}}{b^{2}}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{b^{2}}+\frac{1}{K_{h}}\right)\tilde{\eta}_{t+1}-\frac{K_{h}}{8L^{2}\tilde{\eta}_{t+1}}\right]$
$\displaystyle=$	$\displaystyle\sqrt{2}\sum_{t=1}^{T_{0}}\alpha_{t}^{2}B_{t}^{2}\left[\left(\frac{R_{h}}{b^{2}}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{b^{2}}+\frac{1}{K_{h}}\right)\tilde{\eta}_{t+1}-\frac{K_{h}}{8L^{2}\tilde{\eta}_{t+1}}\right]$
$\displaystyle\leq$	$\displaystyle\sqrt{2}\left(\frac{R_{h}}{b^{2}}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{b^{2}}+\frac{1}{K_{h}}\right)\sum_{t=1}^{T_{0}}\alpha_{t}^{2}B_{t}^{2}\tilde{\eta}_{t+1}$
$\displaystyle=$	$\displaystyle\sqrt{2}\left(\frac{R_{h}}{b^{2}}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{b^{2}}+\frac{1}{K_{h}}\right)\sum_{t=1}^{T_{0}}b\frac{\alpha_{t}^{2}B_{t}^{2}}{\sqrt{{}^{2}+\sum_{s=1}^{t}\alpha_{s}^{2}B_{s}^{2}}}$
$\displaystyle\leq$	$\displaystyle\sqrt{2}\left(\frac{R_{h}}{b}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{b}+\frac{b}{K_{h}}\right)\left(2\sqrt{{}^{2}+\sum_{t=1}^{T_{0}}\alpha_{t}^{2}B_{t}^{2}}-\sqrt{{}^{2}}\right)$
$\displaystyle=$	$\displaystyle 2\sqrt{2}\left(\frac{R_{h}}{b}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{b}+\frac{b}{K_{h}}\right)\frac{b}{\tilde{\eta}_{T_{0}+1}}-\sqrt{2^{2}}\left(\frac{R_{h}}{b}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{b}+\frac{b}{K_{h}}\right)$
$\displaystyle\leq$	$\displaystyle 8\left(\frac{R_{h}}{b^{2}}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{b^{2}}+\frac{1}{K_{h}}\right)^{3/2}\frac{{b}^{2}L}{\sqrt{K_{h}}}-\sqrt{2^{2}}\left(\frac{R_{h}}{b}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{b}+\frac{b}{K_{h}}\right).$	(B.37)

Combine (B.35) with (B.37) and use the fact that $\operatorname{\mathbb{E}}\left[\left\lVert\xi\right\rVert_{\ast}^{2}\right]\leq 4\sigma^{2}$ , we have:

$\displaystyle\operatorname{\mathbb{E}}\left[\operatorname{\tilde{\mathcal{R}}}_{T}(x)\right]$	$\displaystyle\leq 8\left(\frac{R_{h}}{b^{2}}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{b^{2}}+\frac{1}{K_{h}}\right)^{3/2}\frac{{b}^{2}L}{\sqrt{K_{h}}}-\sqrt{2^{2}}\left(\frac{R_{h}}{b}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{b}+\frac{b}{K_{h}}\right)$
	$\displaystyle+2\sqrt{2}\left(\frac{R_{h}}{b}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{b}+\frac{b}{K_{h}}\right)\sigma\sqrt{\sum_{t=1}^{T}\alpha_{t}^{2}}$
	$\displaystyle\qquad+\left[\frac{b\sqrt{{}^{2}}}{K_{h}}\left(\sqrt{2}-\frac{1}{2}\right)-\frac{K_{h}\lVert\mathcal{X}\rVert^{2}\sqrt{{}^{2}}}{b}+\frac{\sqrt{2}C_{h}^{2}\sqrt{{}^{2}}}{b}\right]$	(B.38)

Recall the choice $\alpha_{t}=t,\forall t$ , apply Lemma B.1, we have:

	$\displaystyle\operatorname{\mathbb{E}}\left[f(\overline{X}_{T+1/2})-\min_{x\in\mathcal{X}}f(x)\right]$	$\displaystyle\leq\frac{16}{T^{2}}\left(\frac{R_{h}}{b^{2}}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{b^{2}}+\frac{1}{K_{h}}\right)^{3/2}\frac{{b}^{2}L}{\sqrt{K_{h}}}$
		$\displaystyle+\frac{4\sqrt{2}}{\sqrt{T}}\left(\frac{R_{h}}{b}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{b}+\frac{b}{K_{h}}\right)\sigma+\frac{2C(b,^{2})}{T^{2}}.$		(B.39)

where we set

C(b,^{2})\coloneqq\frac{b\sqrt{{}^{2}}}{K_{h}}\left(\sqrt{2}-\frac{1}{2}\right)-\frac{K_{h}\lVert\mathcal{X}\rVert^{2}\sqrt{{}^{2}}}{b}+\frac{\sqrt{2}C_{h}^{2}\sqrt{{}^{2}}}{b}-\sqrt{2^{2}}\left(\frac{R_{h}}{b}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{b}+\frac{b}{K_{h}}\right).

(B.40)

Finally, replace $b=\sqrt{K_{h}C_{h}^{2}}$ and ${}^{2}=K_{h}$ as chosen in Theorem 1 into (B.39) and note that with these choices, $C(b,^{2})\leq-\frac{1}{2}C_{h}\sqrt{K_{h}}\leq 0$ ; we rewrite (B.39) as follows:

\operatorname{\mathbb{E}}\left[f(\overline{X}_{T+1/2})-\min_{x\in\mathcal{X}}f(x)\right]\leq\frac{32\sqrt{2}L}{T^{2}}\left(\frac{C_{h}^{2}}{{K_{h}}}\right)+\frac{8\sqrt{2}\sigma}{\sqrt{T}}\frac{C_{h}}{\sqrt{K_{h}}}.

(B.41)

Convergence of UnderGrad in unbounded domains.

Finally, we give the proof of Theorem 2 concerning the deterministic SFO in the (LG) case with a possibly unbounded domain $\mathcal{X}$ .

Proof.

Since the respective learning rate $\eta_{t}$ is non-increasing and non-negative, we have that its limit exists. Particularly,

\lim_{t\to\infty}\eta_{t}=\inf\nolimits_{t\in\mathbb{N}}\eta_{t}\geq 0

(B.42)

Let us now assume that $\inf_{t\in\mathbb{N}}\eta_{t}=0$ . Then, by applying Proposition 1 we have:

$\displaystyle\sum_{t=1}^{T}\alpha_{t}\langle\nabla f(\overline{X}_{t+1/2})\mathopen{},\mathopen{}X_{t+1/2}-x\rangle$	$\displaystyle\leq\frac{h(x)-\min_{x\in\mathcal{X}}h(x)}{\eta_{T+1}}$	(B.43)
	$\displaystyle+\sum_{t=1}^{T}\alpha_{t}\langle\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t+1/2})\mathopen{},\mathopen{}X_{t}-X_{t+1}\rangle$
	$\displaystyle-\sum_{t=1}^{T}\frac{K_{h}}{2\eta_{t}}\lVert X_{t+1}-X_{t+1/2}\rVert^{2}-\sum_{t=1}^{T}\frac{K_{h}}{2\eta_{t}}\lVert X_{t+1/2}-X_{t}\rVert^{2}$	(B.44)

Now for the term $\sum_{t=1}^{T}\alpha_{t}\langle\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t})\mathopen{},\mathopen{}X_{t+1/2}-X_{t+1}\rangle-\sum_{t=1}^{T}\frac{K_{h}}{2\eta_{t}}\lVert X_{t+1}-X_{t+1/2}\rVert^{2}$ we have:

$\displaystyle-\sum_{t=1}^{T}\frac{K_{h}}{2\eta_{t}}\lVert X_{t+1}-X_{t+1/2}\rVert^{2}$	$\displaystyle+\sum_{t=1}^{T}\alpha_{t}\langle\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t+1/2})\mathopen{},\mathopen{}X_{t}-X_{t+1}\rangle$
	$\displaystyle\leq\frac{1}{2K_{h}}\sum_{t=1}^{T}\alpha_{t}^{2}\eta_{t}\lVert\nabla f(\overline{X}_{s+1/2})-\nabla f(\overline{X}_{s})\rVert_{\ast}^{2}$
	$\displaystyle+\frac{K_{h}}{2}\sum_{t=1}^{T}\frac{1}{\eta_{t}}\lVert X_{t+1}-X_{t+1/2}\rVert^{2}-\sum_{t=1}^{T}\frac{K_{h}}{2\eta_{t}}\lVert X_{t+1}-X_{t+1/2}\rVert^{2}$	(B.45)

which readily yields:

	$\displaystyle\sum_{t=1}^{T}\alpha_{t}\langle\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t})\mathopen{},\mathopen{}X_{t+1/2}-X_{t+1}\rangle$	$\displaystyle-\sum_{t=1}^{T}\frac{K_{h}}{2\eta_{t}}\lVert X_{t+1}-X_{t+1/2}\rVert^{2}$
		$\displaystyle\leq\frac{1}{2K_{h}}\sum_{t=1}^{T}\alpha_{t}^{2}\eta_{t}\lVert\nabla f(\overline{X}_{s+1/2})-\nabla f(\overline{X}_{s})\rVert_{\ast}^{2}$		(B.46)

Hence, putting everything together, we get:

	$\displaystyle\sum_{t=1}^{T}\alpha_{t}\langle\nabla f(\overline{{X}}_{{t+1/2}})\mathopen{},\mathopen{}X_{t+1/2}-x\rangle$	$\displaystyle\leq\frac{1}{2K_{h}}\sum_{t=1}^{T}\alpha_{t}^{2}\eta_{t}\lVert\nabla f(\overline{X}_{s+1/2})-\nabla f(\overline{X}_{s})\rVert_{\ast}^{2}$
		$\displaystyle-\frac{K_{h}}{2}\sum_{t=1}^{T}\frac{1}{\eta_{t}}\lVert X_{t}-X_{t+1/2}\rVert^{2}$		(B.47)

Moreover, since $f$ is smooth we have:

$\displaystyle\lVert\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t})\rVert_{\ast}^{2}$	$\displaystyle\leq L^{2}\lVert\overline{X}_{t+1/2}-\overline{X}_{t}\rVert^{2}$
	$\displaystyle\leq L^{2}\frac{\alpha_{t}^{2}}{\left(\sum_{t=1}^{T}\alpha_{t}\right)^{2}}\lVert X_{t+1/2}-X_{t}\rVert^{2}$
	$\displaystyle=L^{2}\frac{4t^{2}}{t^{2}(t+1)^{2}}\lVert X_{t+1/2}-X_{t}\rVert^{2}$
	$\displaystyle\leq\frac{4L^{2}}{\alpha_{t}^{2}}\lVert X_{t+1/2}-X_{t}\rVert^{2}$	(B.48)

Combining this with the fact that $\eta_{t}$ is a decreasing sequence, we can rewrite (B.47) as follows:

	$\displaystyle\sum_{t=1}^{T}\alpha_{t}\langle\nabla f(\overline{{X}}_{{t+1/2}})\mathopen{},\mathopen{}X_{t+1/2}-x\rangle$	$\displaystyle\leq\frac{\eta_{1}}{2K_{h}}\sum_{t=1}^{T}\alpha_{t}^{2}\lVert\nabla f(\overline{X}_{s+1/2})-\nabla f(\overline{X}_{s})\rVert_{\ast}^{2}$
		$\displaystyle-\frac{K_{h}}{8L^{2}}\sum_{t=1}^{T}\frac{\alpha_{t}^{2}}{\eta_{t}}\lVert\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t})\rVert_{\ast}^{2}]$		(B.49)

In the sequel, we look for the appropriate bounds of the two terms in the right-hand-side of (B.49). We start with the second term. From (B.9), we also have $\sum_{t=1}^{T}\alpha_{t}\langle\nabla f(\overline{{X}}_{{t+1/2}})\mathopen{},\mathopen{}X_{t+1/2}-x\rangle\geq 0$ . Combine this with (B.49), we have:

0\leq\frac{1}{2K_{h}}\sum_{t=1}^{T}\alpha_{t}^{2}\eta_{t}\lVert\nabla f(\overline{X}_{s+1/2})-\nabla f(\overline{X}_{s})\rVert_{\ast}^{2}-\frac{K_{h}}{8L^{2}}\sum_{t=1}^{T}\frac{\alpha_{t}^{2}}{\eta_{t}}\lVert\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t})\rVert_{\ast}^{2}

(B.50)

Hence by rearranging we have:

$\displaystyle\sum_{t=1}^{T}\frac{K_{h}\alpha_{t}^{2}}{16\eta_{t}L^{2}}\lVert\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t})\rVert_{\ast}^{2}$	$\displaystyle\leq\frac{1}{2K_{h}}\sum_{t=1}^{T}\alpha_{t}^{2}\eta_{t}\lVert\nabla f(\overline{X}_{s+1/2})-\nabla f(\overline{X}_{s})\rVert_{\ast}^{2}$
	$\displaystyle-\frac{K_{h}}{16L^{2}}\sum_{t=1}^{T}\frac{\alpha_{t}^{2}}{\eta_{t}}\lVert\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t})\rVert_{\ast}^{2}$
	$\displaystyle=\frac{1}{2}\sum_{t=1}^{T}\alpha_{t}^{2}\lVert\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t})\rVert_{\ast}^{2}\left[\frac{\eta_{t}}{K_{h}}-\frac{K_{h}}{8\eta_{t}L^{2}}\right]$	(B.51)

Now, since we assumed that $\eta_{t}$ converges to $0$ , there exists some $t_{0}\in\mathbb{N}$ such that:

\eta_{t}\leq\frac{K_{h}}{\sqrt{8}L}\;\;\text{for all}\;\;t>t_{0}

(B.52)

which directly yields that $\left[\frac{\eta_{t}}{K_{h}}-\frac{K_{h}}{8\eta_{t}L^{2}}\right]\leq 0$ for all $t>T_{0}$ . Hence, we have:

\frac{K_{h}}{16L^{2}}\sum_{t=1}^{T}\frac{\alpha_{t}^{2}}{\eta_{t}}\lVert\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t})\rVert_{\ast}^{2}\leq\frac{1}{2}\sum_{t=1}^{T_{0}}\alpha_{t}^{2}\lVert\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t})\rVert_{\ast}^{2}\left[\frac{\eta_{t}}{K_{h}}-\frac{K_{h}}{8\eta_{t}L^{2}}\right]

(B.53)

On the other hand, we have:

\displaystyle\frac{1}{\eta_{t}}

\displaystyle=\frac{1}{\sqrt{K_{h}}}\sqrt{K_{h}+\sum_{s=1}^{t-1}\alpha_{s}^{2}\lVert\nabla f(\overline{X}_{s+1/2})-\nabla f(\overline{X}_{s})\rVert_{\ast}^{2}}\geq\frac{1}{\sqrt{K_{h}}}\sqrt{K_{h}}=1

(B.54)

and hence

$\displaystyle\frac{K_{h}}{16L^{2}}\sum_{t=1}^{T}\frac{\alpha_{t}^{2}}{\eta_{t}}\lVert\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t})\rVert_{\ast}^{2}$	$\displaystyle\geq\frac{K_{h}}{16L^{2}}\sum_{t=1}^{T}\alpha_{t}^{2}\lVert\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t})\rVert_{\ast}^{2}$
	$\displaystyle=\frac{K_{h}^{2}}{16L^{2}K_{h}}\sum_{t=1}^{T}\alpha_{t}^{2}\lVert\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t})\rVert_{\ast}^{2}$
	$\displaystyle=\frac{K_{h}^{2}}{16\eta_{T-1}^{2}L^{2}}$	(B.55)

So, summarizing we have:

\frac{K_{h}^{2}}{16\eta_{T-1}^{2}L^{2}}\leq\frac{1}{2}\sum_{t=1}^{T_{0}}\alpha_{t}^{2}\lVert\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t})\rVert_{\ast}^{2}\left[\frac{\eta_{t}}{K_{h}}-\frac{K_{h}}{8\eta_{t}L^{2}}\right]

(B.56)

We now focus on the first term of the right-hand-size of (B.49). If one lets $T$ to infinity and recalling the fact that we assumed that $\eta_{t}$ converges to $0$ (and so $1/\eta_{t}^{2}\to\infty$ ), we have that:

\infty\leq\frac{1}{2}\sum_{t=1}^{T_{0}}\alpha_{t}^{2}\lVert\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t})\rVert_{\ast}^{2}\left[\frac{\eta_{t}}{K_{h}}-\frac{K_{h}}{8\eta_{t}L^{2}}\right]<\infty

(B.57)

a contradiction. This shows that $\inf_{t\in\mathbb{N}}\eta_{t}>0$ and hence

$\displaystyle\sum_{t=1}^{+\infty}\alpha_{t}^{2}\lVert\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t})\rVert_{\ast}^{2}$	$\displaystyle=\lim_{T\to+\infty}\sum_{t=1}^{T}\alpha_{t}^{2}\lVert\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t})\rVert_{\ast}^{2}$
	$\displaystyle=\lim_{T\to\infty}\frac{K_{h}}{\eta_{t}^{2}}-K_{h}$
	$\displaystyle=\frac{K_{h}}{\inf_{t}\eta_{t}}^{2}-K_{h}<\infty$	(B.58)

so our proof is complete. ∎

Appendix C Additional Numerical Experiments

In this last section, we report another numerical experiment highlighting the universality of UnderGrad. In this experiment, we also focus on the simplex setup as presented in Section 5. However, this time, we work with a noisy SFO that returns first-order feedback that is perturbed by a noise generated from a pre-determined zero-mean normal distribution. We compare the performances of UnderGrad and UnixGrad, both run with the entropic regularizer. The result of this experiment is reported in Fig. 3.

Fig. 3 shows that UnderGrad obtains the optimal rate $\operatorname{\mathcal{O}}(1/\sqrt{T})$ in this set-up. UnixGrad can also obtain the same rate but only when its step-size update rule is chosen appropriately (note again that with entropic regularizer, the update rule (14) of UnixGrad is not available due to the fact that $B_{h}=\infty$ ): when $\gamma_{1}$ is chosen with the same or larger magnitude of UnderGrad’s initial learning rate, UnixGrad converges with the rate $\operatorname{\mathcal{O}}(1/\sqrt{T})$ ; but if $\gamma_{1}$ is too small (e.g., when $\gamma_{1}=10^{-3}\cdot\eta_{1}$ ), UnixGrad can have a very long warming up phase. This experiment reasserts that in cases where the step-size update rule (14) is unavailable, it is non-trivial to choose an appropriate step-size update rule of UnixGrad: small $\gamma_{1}$ might lead to better performances under perfect SFO (cf. Section 5) but might create unwanted behaviors in noisy SFO setups. On the contrary, UnderGrad does not encounter such issues in our experiments.

Finally, we conduct another experiment to confirm the dependency of the convergence rates of UnderGrad on the noise level $\sigma$ . The result is reported in Fig. 4.

Acknowledgments

KA and VC were supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement № $725594$ - time-data) and from the Swiss National Science Foundation (SNSF) under grant number $200021\_205011$ . KYL is grateful for financial support by Israel Science Foundation (grant No. 447/20), by Israel PBC-VATAT, and by the Technion Center for Machine Learning and Intelligent Systems (MLIS). PM is grateful for financial support by the French National Research Agency (ANR) in the framework of the “Investissements d’avenir” program (ANR-15-IDEX-02), the LabEx PERSYVAL (ANR-11-LABX-0025-01), MIAI@Grenoble Alpes (ANR-19-P3IA-0003), and the bilateral ANR-NRF grant ALIAS (ANR-19-CE48-0018-01).

References

Allen-Zhu & Orecchia [2017] Allen-Zhu, Z. and Orecchia, L. Linear Coupling: An Ultimate Unification of Gradient and Mirror Descent. In Proceedings of the 8th Innovations in Theoretical Computer Science, ITCS ’17, 2017. Full version available at http://arxiv.org/abs/1407.1537.
Antonakopoulos & Mertikopoulos [2021] Antonakopoulos, K. and Mertikopoulos, P. Adaptive first-order methods revisited: Convex optimization without Lipschitz requirements. In NeurIPS ’21: Proceedings of the 35th International Conference on Neural Information Processing Systems, 2021.
Antonakopoulos et al. [2019] Antonakopoulos, K., Belmega, E. V., and Mertikopoulos, P. An adaptive mirror-prox algorithm for variational inequalities with singular operators. In NeurIPS ’19: Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019.
Antonakopoulos et al. [2021a] Antonakopoulos, K., Belmega, E. V., and Mertikopoulos, P. Adaptive extra-gradient methods for min-max optimization and games. In ICLR ’21: Proceedings of the 2021 International Conference on Learning Representations, 2021a.
Antonakopoulos et al. [2021b] Antonakopoulos, K., Pethick, T., Kavis, A., Mertikopoulos, P., and Cevher, V. Sifting through the noise: Universal first-order methods for stochastic variational inequalities. In NeurIPS ’21: Proceedings of the 35th International Conference on Neural Information Processing Systems, 2021b.
Auer et al. [2002] Auer, P., Cesa-Bianchi, N., and Gentile, C. Adaptive and self-confident on-line learning algorithms. Journal of Computer and System Sciences, 64(1):48–75, 2002.
Bach & Levy [2019] Bach, F. and Levy, K. Y. A universal algorithm for variational inequalities adaptive to smoothness and noise. In COLT ’19: Proceedings of the 32nd Annual Conference on Learning Theory, 2019.
Bauschke & Combettes [2017] Bauschke, H. H. and Combettes, P. L. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, New York, NY, USA, 2 edition, 2017.
Bilenne et al. [2020] Bilenne, O., Mertikopoulos, P., and Belmega, E. V. Fast optimization with zeroth-order feedback in distributed multi-user MIMO systems. IEEE Trans. Signal Process., 68:6085–6100, October 2020.
Boyd & Vandenberghe [2004] Boyd, S. P. and Vandenberghe, L. Convex Optimization. Cambridge University Press, Cambridge, UK, 2004.
Bubeck [2015] Bubeck, S. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(3-4):231–358, 2015.
Bubeck & Cesa-Bianchi [2012] Bubeck, S. and Cesa-Bianchi, N. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1–122, 2012.
Cesa-Bianchi & Lugosi [2006] Cesa-Bianchi, N. and Lugosi, G. Prediction, Learning, and Games. Cambridge University Press, 2006.
Cesa-Bianchi & Lugosi [2012] Cesa-Bianchi, N. and Lugosi, G. Combinatorial bandits. Journal of Computer and System Sciences, 78:1404–1422, 2012.
Chen & Teboulle [1993] Chen, G. and Teboulle, M. Convergence analysis of a proximal-like minimization algorithm using Bregman functions. SIAM Journal on Optimization, 3(3):538–543, August 1993.
Cutkosky [2019] Cutkosky, A. Anytime online-to-batch, optimism and acceleration. In ICML ’19: Proceedings of the 36th International Conference on Machine Learning, 2019.
Ene et al. [2021] Ene, A., Nguyen, H. L., and Vladu, A. Adaptive gradient methods for constrained convex optimization and variational inequalities. In AAAI ’21: Proceedings of the 35th Conference on Artificial Intelligence, 2021.
Facchinei & Pang [2003] Facchinei, F. and Pang, J.-S. Finite-Dimensional Variational Inequalities and Complementarity Problems. Springer Series in Operations Research. Springer, 2003.
György et al. [2007] György, A., Linder, T., Lugosi, G., and Ottucsák, G. The online shortest path problem under partial monitoring. Journal of Machine Learning Research, 8:2369–2403, 2007.
Hsieh et al. [2021] Hsieh, Y.-G., Antonakopoulos, K., and Mertikopoulos, P. Adaptive learning in continuous games: Optimal regret bounds and convergence to Nash equilibrium. In COLT ’21: Proceedings of the 34th Annual Conference on Learning Theory, 2021.
Joulani et al. [2020] Joulani, P., Raj, A., György, A., and Szepesvári, C. A simpler approach to accelerated stochastic optimization: Iterative averaging meets optimism. In ICML ’20: Proceedings of the 37th International Conference on Machine Learning, 2020.
Juditsky et al. [2011] Juditsky, A., Nemirovski, A. S., and Tauvel, C. Solving variational inequalities with stochastic mirror-prox algorithm. Stochastic Systems, 1(1):17–58, 2011.
Kakade et al. [2012] Kakade, S. M., Shalev-Shwartz, S., and Tewari, A. Regularization techniques for learning with matrices. The Journal of Machine Learning Research, 13:1865–1890, 2012.
Kavis et al. [2019] Kavis, A., Levy, K. Y., Bach, F., and Cevher, V. UnixGrad: A universal, adaptive algorithm with optimal guarantees for constrained optimization. In NeurIPS ’19: Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019.
Korpelevich [1976] Korpelevich, G. M. The extragradient method for finding saddle points and other problems. Èkonom. i Mat. Metody, 12:747–756, 1976.
Lan [2012] Lan, G. An optimal method for stochastic composite optimization. Mathematical Programming, 133(1-2):365–397, June 2012.
Lattimore & Szepesvári [2020] Lattimore, T. and Szepesvári, C. Bandit Algorithms. Cambridge University Press, Cambridge, UK, 2020.
Levy et al. [2018] Levy, K. Y., Yurtsever, A., and Cevher, V. Online adaptive methods, universality and acceleration. In NeurIPS ’18: Proceedings of the 32nd International Conference of Neural Information Processing Systems, 2018.
Li & Orabona [2019] Li, X. and Orabona, F. On the convergence of stochastic gradient descent with adaptive stepsizes. In AISTATS ’19: Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, 2019.
McMahan & Streeter [2010] McMahan, H. B. and Streeter, M. Adaptive bound optimization for online convex optimization. In COLT ’10: Proceedings of the 23rd Annual Conference on Learning Theory, 2010.
Mertikopoulos & Sandholm [2016] Mertikopoulos, P. and Sandholm, W. H. Learning in games via reinforcement and regularization. Mathematics of Operations Research, 41(4):1297–1324, November 2016.
Mertikopoulos & Staudigl [2018] Mertikopoulos, P. and Staudigl, M. On the convergence of gradient-like flows with noisy gradient input. SIAM Journal on Optimization, 28(1):163–197, January 2018.
Mertikopoulos & Zhou [2019] Mertikopoulos, P. and Zhou, Z. Learning in games with continuous action sets and unknown payoff functions. Mathematical Programming, 173(1-2):465–507, January 2019.
Mertikopoulos et al. [2017] Mertikopoulos, P., Belmega, E. V., Negrel, R., and Sanguinetti, L. Distributed stochastic optimization via matrix exponential learning. IEEE Trans. Signal Process., 65(9):2277–2290, May 2017.
Mertikopoulos et al. [2019] Mertikopoulos, P., Lecouat, B., Zenati, H., Foo, C.-S., Chandrasekhar, V., and Piliouras, G. Optimistic mirror descent in saddle-point problems: Going the extra (gradient) mile. In ICLR ’19: Proceedings of the 2019 International Conference on Learning Representations, 2019.
Nemirovski [2004] Nemirovski, A. S. Prox-method with rate of convergence ${O}(1/t)$ for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization, 15(1):229–251, 2004.
Nemirovski & Yudin [1983] Nemirovski, A. S. and Yudin, D. B. Problem Complexity and Method Efficiency in Optimization. Wiley, New York, NY, 1983.
Nesterov [2004] Nesterov, Y. Introductory Lectures on Convex Optimization: A Basic Course. Number 87 in Applied Optimization. Kluwer Academic Publishers, 2004.
Nesterov [2007] Nesterov, Y. Dual extrapolation and its applications to solving variational inequalities and related problems. Mathematical Programming, 109(2):319–344, 2007.
Nesterov [2015] Nesterov, Y. Universal gradient methods for convex optimization problems. Mathematical Programming, 152(1-2):381–404, 2015.
Rakhlin & Sridharan [2013a] Rakhlin, A. and Sridharan, K. Online learning with predictable sequences. In COLT ’13: Proceedings of the 26th Annual Conference on Learning Theory, 2013a.
Rakhlin & Sridharan [2013b] Rakhlin, A. and Sridharan, K. Optimization, learning, and games with predictable sequences. In NIPS ’13: Proceedings of the 27th International Conference on Neural Information Processing Systems, 2013b.
Rockafellar [1970] Rockafellar, R. T. Convex Analysis. Princeton University Press, Princeton, NJ, 1970.
Telatar [1999] Telatar, I. E. Capacity of multi-antenna Gaussian channels. European Transactions on Telecommunications and Related Technologies, 10(6):585–596, 1999.
Vu et al. [2021] Vu, D. Q., Antonakopoulos, K., and Mertikopoulos, P. Fast routing under uncertainty: Adaptive learning in congestion games with exponential weights. In NeurIPS ’21: Proceedings of the 35th International Conference on Neural Information Processing Systems, 2021.
Ward et al. [2019] Ward, R., Wu, X., and Bottou, L. AdaGrad stepsizes: Sharp convergence over nonconvex landscapes, from any initialization. In ICML ’19: Proceedings of the 36th International Conference on Machine Learning, 2019.
Yu et al. [2004] Yu, W., Rhee, W., Boyd, S. P., and Cioffi, J. M. Iterative water-filling for Gaussian vector multiple-access channels. IEEE Trans. Inf. Theory, 50(1):145–152, 2004.

A Universal Black-Box Optimization Method with Almost Dimension-Free Convergence Rate Guarantees

Abstract.

Key words and phrases:

2020 Mathematics Subject Classification:

1. Introduction

Our contributions.

Related work.

2. Preliminaries

2.1. Notation and basic definitions

2.2. Problem setup and blanket assumptions

Remark 1.

2.3. The oracle model

Remark 2.

3. Regularization, universality, and the curse of dimensionality

3.1. Motivating examples

Example 1 (Resource allocation).

Example 2 (Input covariance matrix optimization).

Example 3 (Combinatorial bandits).

3.2. The mirror-prox template

Definition 1.

Remark.

3.3. The UnixGrad algorithm

4. Universal dual extrapolation

Theorem 1.

Theorem 2.

Main ideas of the proof.

Proposition 1.

5. Numerical Experiments

Appendix A Bregman regularizers and preliminary results

A.1. Bregman regularizers and their properties

Lemma A.1.

Proof of Lemma A.1.

Lemma A.2.

Proof.

Lemma A.3.

Proof.

A.2. Numerical sequence inequalities

Lemma A.4 (30, 28).

Appendix B Analysis and proofs of the main results

The proof of the template inequality.

Proof.

Regret-to-rate conversion lemma.

Lemma B.1.

Remark 1.

Proof.

Proof of (20a): convergence of UnderGrad under (LC)/(BG).

Proof of (20b): convergence of UnderGrad under (LG)/(LS).

Convergence of UnderGrad in unbounded domains.

Proof.

Appendix C Additional Numerical Experiments

Acknowledgments

References

A Universal Black-Box Optimization Method with
Almost Dimension-Free Convergence Rate Guarantees