This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Universal Black-Box Optimization Method with
Almost Dimension-Free Convergence Rate Guarantees

Kimon Antonakopoulos∗,⋄  Laboratory for Information and Inference Systems, IEL STI EPFL, Lausanne, Switzerland. [email protected] Dong Quan Vu♯,⋄  Safran Tech, Magny-Les-Hameaux, France. [email protected] Volkan Cevher [email protected]
Kfir Yehuda Levy
 Technion, Haifa, Israel. [email protected]
 and  Panayotis Mertikopoulos‡,§  Univ. Grenoble Alpes, CNRS, Inria, Grenoble INP, LIG, 38000 Grenoble, France. § Criteo AI Lab. [email protected]
Abstract.

Universal methods for optimization are designed to achieve theoretically optimal convergence rates without any prior knowledge of the problem’s regularity parameters or the accurarcy of the gradient oracle employed by the optimizer. In this regard, existing state-of-the-art algorithms achieve an 𝒪(1/T2)\operatorname{\mathcal{O}}(1/T^{2}) value convergence rate in Lipschitz smooth problems with a perfect gradient oracle, and an 𝒪(1/T)\operatorname{\mathcal{O}}(1/\sqrt{T}) convergence rate when the underlying problem is non-smooth and/or the gradient oracle is stochastic. On the downside, these methods do not take into account the problem’s dimensionality, and this can have a catastrophic impact on the achieved convergence rate, in both theory and practice. Our paper aims to bridge this gap by providing a scalable universal gradient method – dubbed UnderGrad – whose oracle complexity is almost dimension-free in problems with a favorable geometry (like the simplex, linearly constrained semidefinite programs and combinatorial bandits), while retaining the order-optimal dependence on TT described above. These “best-of-both-worlds” results are achieved via a primal-dual update scheme inspired by the dual exploration method for variational inequalities.

Key words and phrases:
Universal methods; dimension-freeness; dual extrapolation; rate interpolation.
2020 Mathematics Subject Classification:
Primary 90C25, 90C15; secondary 68Q32, 68T05.
This work was done when KA and DQV were affiliated with Univ. Grenoble Alpes, CNRS, Inria, Grenoble INP, LIG, 38000 Grenoble, France.

1. Introduction

The analysis of first-order methods for convex minimization typically revolves around the following regularity conditions: \edefnit\selectfonta\edefnn) Lipschitz continuity of a problem’s objective function and / or \edefnit\selectfonta\edefnn) Lipschitz continuity of the objective’s gradients. Depending on these conditions and the quality of the gradient oracle available to the optimizer, the optimal convergence rates that can be obtained by an iterative first-order algorithm after TT oracle queries are:

  1. (1)

    𝒪(𝒳(G2+σ2)/T)\operatorname{\mathcal{O}}\big{(}\lVert\mathcal{X}\rVert\sqrt{(G^{2}+\sigma^{2})/T}\big{)} if the problem’s objective is GG-Lipschitz continuous and the oracle’s variance is σ\sigma.

  2. (2)

    𝒪(L𝒳2/T2+σ𝒳/T)\operatorname{\mathcal{O}}\big{(}L\lVert\mathcal{X}\rVert^{2}/T^{2}+\sigma\lVert\mathcal{X}\rVert/\sqrt{T}\big{)} if the objective is LL-Lipschitz smooth.

[In both cases, 𝒳supx,x𝒳xx\lVert\mathcal{X}\rVert\coloneqq\sup_{x,x^{\prime}\in\mathcal{X}}\lVert x^{\prime}-x\rVert denotes the diameter of the problem’s domain 𝒳d\mathcal{X}\subseteq\mathbb{R}^{d}; for an in-depth treatment, see [38, 11] and references therein.]

This stark separation of black-box guarantees has led to an intense search for universal methods that are capable of interpolating smoothly between these rates without any prior knowledge of the problem’s regularity properties or the oracle’s noise profile. As far as we are aware, the first algorithm with order-optimal rate guarantees for unconstrained problems and no knowledge of the problem’s smoothness parameters was the AcceleGrad proposal of Levy et al. [28]. Subsequently, in the context of constrained convex problems (the focus of our work), Kavis et al. [24] combined the extra-gradient / mirror-prox algorithmic template of Korpelevich [25] and Nemirovski [36] with an “iterate averaging” scheme introduced by Cutkosky [16] to change the query structure of the base algorithm and make it more amenable to acceleration. In this way, Kavis et al. [24] obtained a universal extra-gradient algorithm – dubbed UnixGrad – which interpolates between the optimal rates mentioned above, without requiring any tuning.

Our contributions.

The starting point of our paper is the observation that, even though the rates in question are optimal in TT, they may be highly supoptimal in dd, the problem’s dimensionality. For example, if the noise in the oracle has unit variance, σ\sigma would scale as 𝒪(d)\operatorname{\mathcal{O}}(\sqrt{d}); this represents a hidden dependence on dd which could have a catastrophic impact on the method’s actual convergence rate. Likewise, in problems with a favorable geometry (like the L1L^{1}-ball, trace-constrained semidefinite programs, combinatorial bandits, etc.), methods based on the mirror descent [37] and mirror-prox [36] templates can achieve rates with a logarithmic (instead of polynomial) dependence on dd.

Importantly, the UnixGrad algorithm of Kavis et al. [24] is itself based on the mirror-prox blueprint, so it would seem ideally suited to achieve convergence rates that are simultaneously optimal in TT and dd. However, the method’s guarantees depend crucially on the Bregman diameter of the problem’s domain, a quantity which becomes infinite when the method is used with a regularization setup that can lead to almost dimension-free guarantees. This would seem to suggest that universality comes at the cost of scalability, leading to the following open question:

Is it possible to achieve almost dimension-free convergence rates
while retaining an order-optimal dependence on TT?

In this paper, we develop a novel adaptive algorithm, which we call universal dual extrapolation with reweighted gradients (UnderGrad), and which provides a positive answer to this question. Specifically, the value convergence rate of universal dual extrapolation with reweighted gradients (UnderGrad) scales in terms of GG, σ\sigma, LL and TT as:

  1. (1)

    𝒪(Rh(G2+σ2)/T)\operatorname{\mathcal{O}}\big{(}\sqrt{R_{h}(G^{2}+\sigma^{2})/T}\big{)} in non-smooth problems.

  2. (2)

    𝒪(RhL/T2+σRh/T)\operatorname{\mathcal{O}}\big{(}R_{h}L/T^{2}+\sigma\sqrt{R_{h}/T}\big{)} in smooth problems.

In the above, the method’s “range parameter” RhR_{h} scales as 𝒪(𝒳2)\operatorname{\mathcal{O}}(\lVert\mathcal{X}\rVert^{2}) in the worst case and as 𝒪(logd)\operatorname{\mathcal{O}}(\log d) in problems with a favorable geometry – that is, in problems where it is possible to attain almost dimension-free convergence rates [38, 11]. In this regard, UnderGrad seems to be the first method in the literature that concurrently achieves order-optimal rates in both TT and dd, without any prior knowledge on the problem’s level of smoothness.

To achieve this result, the UnderGrad algorithm combines the following basic ingredients:

  1. (1)

    A modified version of the dual extrapolation method of Nesterov [39] for solving variational inequalities.

  2. (2)

    A gradient “reweighting” scheme that allows gradients to enter the algorithm with increasing weights.

  3. (3)

    An iterate averaging scheme in the spirit of Cutkosky [16] which allows us to obtain an accelerated rate of convergence by means of an online-to-batch conversion.

The glue that holds these elements together is an adaptive learning rate inspired by Rakhlin & Sridharan [41, 42] which automatically rescales aggregated gradients by \edefnit\selectfonta\edefnn) a small, constant amount when the method approaches a solution where gradient differences vanish (as in the smooth, deterministic case); and \edefnit\selectfonta\edefnn) a factor of 𝒪(T)\operatorname{\mathcal{O}}(\sqrt{T}) otherwise (thus providing the desired interpolation between smooth and non-smooth problems). In so doing, the proposed policy achieves the correct step-size scaling and achieves the desired optimal rates.

Related work.

The term “universality” was coined by Nesterov [40] whose universal primal gradient descent (UPGD) algorithm interpolates between the 𝒪(1/T2)\operatorname{\mathcal{O}}(1/T^{2}) and 𝒪(1/T)\operatorname{\mathcal{O}}(1/\sqrt{T}) rates for smooth and non-smooth problems respectively (assuming access to noiseless gradients in both cases). On the downside, universal primal gradient descent (UPGD) relies on an Armijo-like line search to interpolate between smooth and non-smooth objectives, so it is not applicable to stochastic environments.

A partial work-around to this issue was achieved by the accelerated stochastic approximation (AC-SA) algorithm of Lan [26] which uses a mirror descent template and guarantees order-optimal rates for both noisy and noiseless oracles. However, to attain these rates, the AC-SA algorithm requires a precise estimate of the smoothness modulus of the problem’s objective, so it is not universal in this respect. Subsequent works on the topic have focused on attaining universal guarantees for composite problems [21], non-convex objectives [29, 46], preconditioned methods [21, 17], non-Lipschitz settings [3, 2, 4], specific applications [45], or variational inequalities / min-max problems [7, 4, 20, 5].

Of the generalist works above, some employ a Bregman regularization setup [2, 7], but the guarantees obtained therein either fall short of an accelerated 𝒪(1/T2)\operatorname{\mathcal{O}}(1/T^{2}) convergence rate for Lipschitz smooth problems, or they depend on the problem’s Bregman diameter – so they cannot be associated with a Bregman setup leading to almost dimension-free convergence rate guarantees. To the best of our knowledge, UnderGrad is the first method that manages to combine the “best of both worlds” in terms of universality with respect to TT and scalability with respect to dd.

2. Preliminaries

2.1. Notation and basic definitions

Let 𝒱\mathcal{V} be a dd-dimensional space with norm \lVert\cdot\rVert. In what follows, we will write 𝒴𝒱\mathcal{Y}\equiv\mathcal{V}^{\ast} for the dual of 𝒱\mathcal{V}, y,x\langle y\mathopen{},\mathopen{}x\rangle for the pairing between y𝒴y\in\mathcal{Y} and x𝒱x\in\mathcal{V}, and ysup{y,x:x1}\lVert y\rVert_{\ast}\equiv\sup\{\langle y\mathopen{},\mathopen{}x\rangle:\lVert x\rVert\leq 1\} for the dual norm on 𝒴\mathcal{Y}. Given an extended-real-valued convex function f:𝒱{}f\colon\mathcal{V}\to\mathbb{R}\cup\{\infty\}, we will write domf{x𝒱:f(x)<}\operatorname{dom}f\equiv\{x\in\mathcal{V}:f(x)<\infty\} for its effective domain and f(x){y𝒴:f(x)f(x)y,xx0for all x𝒱}\partial f(x)\equiv\{y\in\mathcal{Y}:f(x^{\prime})-f(x)-\langle y\mathopen{},\mathopen{}x^{\prime}-x\rangle\geq 0\;\text{for all $x^{\prime}\in\mathcal{V}$}\} for the subdifferential of ff at xdomfx\in\operatorname{dom}f. Any element gf(x)g\in\partial f(x) will be called a subgradient of ff at xx, and we will write domf{xdomf:f}\operatorname{dom}\partial f\equiv\{x\in\operatorname{dom}f:\partial f\neq\varnothing\} for the domain of subdifferentiability of ff.

2.2. Problem setup and blanket assumptions

The main focus of our paper is the solution of convex minimization problems of the form

minimize f(x)\displaystyle\quad f(x) (Opt)
subject to x𝒳\displaystyle\quad x\in\mathcal{X}

where 𝒳\mathcal{X} is a closed convex subset of 𝒱\mathcal{V} and f:𝒱{}f\colon\mathcal{V}\to\mathbb{R}\cup\{\infty\} is a convex function with domf=domf=𝒳\operatorname{dom}f=\operatorname{dom}\partial f=\mathcal{X}. To avoid trivialities, we will assume throughout that the solution set 𝒳argminf\mathcal{X}^{\ast}\coloneqq\operatorname*{arg\,min}f of (Opt) is non-empty, and we will write xx^{\ast} for a generic minimizer of ff.

Other than this blanket assumption, our main reqularity requirements for ff will be as follows:

  1. (1)

    Lipschitz continuity:

    |f(x)f(x)|Gxx\lvert f(x^{\prime})-f(x)\rvert\leq G\lVert x^{\prime}-x\rVert (LC)

    for some G0G\geq 0 and for all x,x𝒳x,x^{\prime}\in\mathcal{X}.

  2. (2)

    Lipschitz smoothness:

    f(x)f(x)+f(x),xx+L2xx2f(x^{\prime})\leq f(x)+\langle\nabla f(x)\mathopen{},\mathopen{}x^{\prime}-x\rangle+\frac{L}{2}\lVert x^{\prime}-x\rVert^{2} (LS)

    for some L0L\geq 0 and for all gf(x)g\in\partial f(x), x,x𝒳x,x^{\prime}\in\mathcal{X}.

Since domf=𝒳\operatorname{dom}\partial f=\mathcal{X}, the above requirements are respectively equivalent to assuming that ff admits a selection of subgradients f(x)f(x)\nabla f(x)\in\partial f(x) with the properties below:

  1. (1)

    Bounded (sub)gradient selection:

    f(x)G\lVert\nabla f(x)\rVert_{\ast}\leq G (BG)

    for some G0G\geq 0 and for all x𝒳x\in\mathcal{X}.

  2. (2)

    Lipschitz (sub)gradient selection:

    f(x)f(x)Lxx\lVert\nabla f(x^{\prime})-\nabla f(x)\rVert_{\ast}\leq L\lVert x^{\prime}-x\rVert (LG)

    for some L0L\geq 0 and for all x,x𝒳x,x^{\prime}\in\mathcal{X}.

In the rest of our paper, we will assume that ff satisfies at least one of (BG) or (LG).

Remark 1.

For posterity, we note here that the requirement (LG) does not imply that f(x)\partial f(x) is a singleton.111Consider for example the case of f(x)=xf(x)=x for x[0,1]x\in[0,1] and f(x)=f(x)=\infty otherwise: ff clearly satisfies (BG)/(LS), even though its f(0)\partial f(0) and f(1)\partial f(1) are infinite sets. In any case, the directional derivative f(x;z)=d/dt|t=0f(x+tz)f^{\prime}(x;z)=d/dt|_{t=0}f(x+tz) of ff at x𝒳x\in\mathcal{X} along z𝒱z\in\mathcal{V} exists and is equal to f(x),z\langle\nabla f(x)\mathopen{},\mathopen{}z\rangle for all vectors of the form z=xxz=x^{\prime}-x, x𝒳x^{\prime}\in\mathcal{X}. We will use this fact freely in the sequel.

2.3. The oracle model

To solve (Opt), we will consider iterative methods and algorithms with access to a stochastic first-order oracle (SFO), i.e., a black-box device that returns a (possibly random) estimate of a subgradient of ff at the point at which it was queried. Formally, following Nesterov [38], an stochastic first-order oracle (SFO) for ff is a measurable function 𝖦:𝒳×Ω𝒴\operatorname{\mathsf{G}}\colon\mathcal{X}\times\Omega\to\mathcal{Y} such that

𝔼[𝖦(x;ω)]=f(x)for all x𝒳\operatorname{\mathbb{E}}[\operatorname{\mathsf{G}}(x;\omega)]=\nabla f(x)\quad\text{for all $x\in\mathcal{X}$} (SFO)

where (Ω,,)(\Omega,\mathcal{F},\operatorname{\mathbb{P}}) is a complete probability space and f(x)\nabla f(x) is a selection of subgradients of ff as per (BG)/(LG). The oracle’s statistical precision will then be measured by the associated noise level σesssupω,x𝖦(x;ω)f(x)\sigma\coloneqq\operatorname{ess}\sup_{\omega,x}\lVert\operatorname{\mathsf{G}}(x;\omega)-\nabla f(x)\rVert_{\ast} (assumed finite). In particular, if σ=0\sigma=0, 𝖦\operatorname{\mathsf{G}} will be called perfect (or deterministic); otherwise, 𝖦\operatorname{\mathsf{G}} will be called noisy.

In practice, the oracle is called repeatedly at a sequence of query points xtx_{t} with a different random seed ωt\omega_{t} drawn according to \operatorname{\mathbb{P}} at each time.222In the sequel, tt may take both integer and half-integer values. In this way, at the tt-th query to (SFO), the oracle 𝖦\operatorname{\mathsf{G}} returns the gradient signal

gt=𝖦(xt;ωt)=f(xt)+Utg_{t}=\operatorname{\mathsf{G}}(x_{t};\omega_{t})=\nabla f(x_{t})+U_{t} (1)

where UtU_{t} denotes the “gradient noise” of the oracle (obviously, Ut0U_{t}\equiv 0 if the oracle is perfect). For measurability purposes, we will write t\mathcal{F}_{t} for the history (adapted filtration) of xtx_{t}, so xtx_{t} is t\mathcal{F}_{t}-measurable (by definition) but ωt\omega_{t}, gtg_{t} and UtU_{t} are not. In particular, conditioning on t\mathcal{F}_{t}, we have 𝔼[gt\nonscript|\nonscriptt]=f(xt)\operatorname{\mathbb{E}}[g_{t}\nonscript\,|\nonscript\,\mathopen{}\mathcal{F}_{t}]=\nabla f(x_{t}) and 𝔼[Ut\nonscript|\nonscriptt]=0\operatorname{\mathbb{E}}[U_{t}\nonscript\,|\nonscript\,\mathopen{}\mathcal{F}_{t}]=0, justifying in this way the terminology “gradient noise” for UtU_{t}.

Remark 2.

The oracle model detailed above is not the only one possible, but it is very widely used in the analysis of parameter-agnostic and adaptive methods, cf. [28, 24, 46, 2] and references therein. In view of this, we will not examine either finer or coarser assumptions for (SFO).

We close this section by noting that the best convergence rates that can be achieved by an iterative algorithm that outputs a candidate solution x¯T𝒳\bar{x}_{T}\in\mathcal{X} after TT queries to (SFO) are:333In general, the query and output points – xTx_{T} and x¯T\bar{x}_{T} respectively – need not coincide, hence the different notation. The only assumption for the rates provided below is that the output point x¯T\bar{x}_{T} is an affine combination of x1,g1,,xT,gTx_{1},g_{1},\dotsc,x_{T},g_{T} [38, 11].

  1. (1)

    f(x¯T)minf=𝒪(1/T)f(\bar{x}_{T})-\min f=\operatorname{\mathcal{O}}(1/\sqrt{T}) if ff satisfies (BG) and 𝖦\operatorname{\mathsf{G}} is deterministic.

  2. (2)

    f(x¯T)minf=𝒪(1/T2)f(\bar{x}_{T})-\min f=\operatorname{\mathcal{O}}(1/T^{2}) if ff satisfies (LG) and 𝖦\operatorname{\mathsf{G}} is deterministic.

  3. (3)

    𝔼[f(x¯T)minf]=𝒪(1/T)\operatorname{\mathbb{E}}[f(\bar{x}_{T})-\min f]=\operatorname{\mathcal{O}}(1/\sqrt{T}) if 𝖦\operatorname{\mathsf{G}} is stochastic.

In general, without finer assumptions on ff or 𝖦\operatorname{\mathsf{G}}, the dependence of these rates on TT cannot be improved [38, 11]; we will revisit this issue several times in the sequel.

3. Regularization, universality, and the curse of dimensionality

To set the stage for the analysis to come, we discuss below the properties of two algorithmic frameworks – one non-adaptive, the other adaptive – based on the mirror-prox template [36]. Our aim in doing this will be to set a baseline for the sequel as well as to explore the impact of the problem’s dimensionality on the attained rates of convergence.

3.1. Motivating examples

As a first step, we present three archetypal problems to motivate and illustrate the general setup that follows.

Example 1 (Resource allocation).

Consider a set of computing resources (GPUs in a cluster, servers in a computing grid, …) indexed by s𝒮={1,,d}s\in\mathcal{S}=\{1,\dotsc,d\}. Each resource is capable of serving a stream of computing demands that arrive at a rate of ρ\rho units per time: if the optimizer assigns a load of xs0x_{s}\geq 0 to the ss-th resource, the marginal cost incurred is cs(xs)c_{s}(x_{s}) per unit served, where cs:[0,ρ]+c_{s}\colon[0,\rho]\to\mathbb{R}_{+} is the cost function of the ss-th resource (assumed convex, differentiable, and increasing in xsx_{s}). Taking ρ=1\rho=1 for simplicity, the goal of the optimizer is to minimize the aggregate cost f(x)=s=1dxscs(xs)f(x)=\sum_{s=1}^{d}x_{s}c_{s}(x_{s}), leading to a convex minimization problem over the unit dd-dimensional simplex 𝒳=Δ(𝒮)={x+d:sxs=1}\mathcal{X}=\operatorname{\Delta}(\mathcal{S})=\{x\in\mathbb{R}_{+}^{d}:\sum_{s}x_{s}=1\}.

Example 2 (Input covariance matrix optimization).

Consider a Gaussian vector channel in the spirit of [44, 47]: the encoder controls the covariance matrix 𝐗=𝔼[𝐱𝐱]\mathbf{X}=\operatorname{\mathbb{E}}[\mathbf{x}\mathbf{x}^{{\dagger}}] of the Gaussian input signal 𝐱M\mathbf{x}\in\mathbb{C}^{M} and seeks to maximize the transfer rate of the output signal 𝐲=𝐇𝐱+𝐳\mathbf{y}=\mathbf{H}\mathbf{x}+\mathbf{z}, where 𝐳N\mathbf{z}\in\mathbb{C}^{N} is the ambient noise in the channel and 𝐇N×M\mathbf{H}\in\mathbb{C}^{N\times M} is the channel’s transfer matrix. By the Shannon–Telatar formula [44], this boils down to maximizing the capacity function

R(𝐗)=𝔼[logdet(𝐈+𝐇𝐗𝐇)]R(\mathbf{X})=\operatorname{\mathbb{E}}\left[\log\det\left(\mathbf{I}+\mathbf{H}\mathbf{X}\mathbf{H}^{{\dagger}}\right)\right] (2)

subject to the constraint tr(𝐗)P\operatorname{tr}(\mathbf{X})\leq P, where PP denotes the encoder’s maximum input power and the expectation in (2) is taken over the statistics of the (possibly deterministic) matrix 𝐇\mathbf{H}. Since RR is concave in 𝐗\mathbf{X} [10, 47], we obtain a minimization problem of the form (Opt) over the spectrahedron 𝒟={𝐗0:tr(𝐗)P}.\mathcal{D}=\{\mathbf{X}\succcurlyeq 0:\operatorname{tr}(\mathbf{X})\leq P\}. Since 𝐗\mathbf{X} is Hermitian, 𝒟\mathcal{D} can be seen as a convex body of d\mathbb{R}^{d} where d=M2d=M^{2}; in the optimization literature, this is sometimes referred to as the “spectrahedron setup” [22].

Example 3 (Combinatorial bandits).

In bandit linear optimization problems, the optimizer is given a finite set of nn possible actions 𝒜{0,1}d\mathcal{A}\subseteq\{0,1\}^{d}, i.e., each action α𝒜\alpha\in\mathcal{A} is a dd-dimensional binary vector indicating whether the ii-th component is “on” or “off”. The optimizer then chooses an action α𝒜\alpha\in\mathcal{A} based on a mixed strategy pΔ(𝒜)p\in\operatorname{\Delta}(\mathcal{A}) and incurs the mean loss

(p;ω)=𝔼[α𝒜pαα,ω]\ell(p;\omega)=\operatorname{\mathbb{E}}\left[\sum\nolimits_{\alpha\in\mathcal{A}}p_{\alpha}\langle\alpha\mathopen{},\mathopen{}\omega\rangle\right] (3)

where ω\omega is a random vector with values in [0,1]d[0,1]^{d} (but otherwise unknown distribution). In many cases of interest – such as slate recommendation and shortest-path problems – the cardinality of 𝒜\mathcal{A} is exponential in dd, so it is computationally prohibitive to state the resulting minimization problem in terms of pp. Instead, writing xi=α𝒜pααix_{i}=\sum_{\alpha\in\mathcal{A}}p_{\alpha}\alpha_{i} for the probability of the ii-th component being “on” under pp, the optimizer’s objective can be rewritten more compactly as f(x)=𝔼[x,ω]f(x)=\operatorname{\mathbb{E}}\left[\langle x\mathopen{},\mathopen{}\omega\rangle\right] with xx constrained to lie on the dd-dimensional convex hull 𝒳=conv(𝒜)\mathcal{X}=\operatorname{conv}(\mathcal{A}) of 𝒜\mathcal{A} in d\mathbb{R}^{d}. In the literature on multi-armed bandits, this setup is known as a combinatorial bandit; for an in-depth treatment, see [27, 14, 19, 13] and the many references cited therein.

Examples 1, 2 and 3 all suffer from the “curse of dimensionality”: for instance, the dimensionality of a vector Gaussian channel with M=256M=256 input entries is d6.5×104d\approx 6.5\times 10^{4}, while a combinatorial bandit for recommendation systems may have upwards of several million arms. Nonetheless, these examples also share a number of geometric properties that make it possible to design scalable optimization algorithms with (almost) dimension-free convergence rate guarantees. We elaborate on this in the next section.

3.2. The mirror-prox template

We begin by considering the well-known mirror-prox (MP) method of Nemirovski [36]. Following [22, 35], this is defined via the recursion

Xt+1/2\displaystyle X_{t+1/2} =P#1(γtgt)\displaystyle=P_{#1}(-\gamma_{t}g_{t}) (MP)
Xt+1\displaystyle X_{t+1} =P#1(γtgt+1/2)\displaystyle=P_{#1}(-\gamma_{t}g_{t+1/2})

where

  1. (1)

    t=1,2,t=1,2,\dotsc\, denotes the method’s iteration counter (for the origins of the half-integer notation, see Facchinei & Pang [18] and references therein).

  2. (2)

    γt>0\gamma_{t}>0 is the algorithm’s step-size sequence.

  3. (3)

    gtg_{t} and gt+1/2g_{t+1/2} are stochastic gradients of ff obtained by querying the oracle 𝖦\operatorname{\mathsf{G}} at XtX_{t} and Xt+1/2X_{t+1/2} respectively.

  4. (4)

    P#1()P_{#1}(\cdot) is a generalized projection operator known as the method’s “prox-mapping” (more on this later).

The most elementary instance of (MP) is the extra-gradient (EG) algorithm of Korpelevich [25], in which case the method’s prox-mapping is the Euclidean projector

P#1(y)=Π𝒳(x+y)argminx𝒳x+yx2P_{#1}(y)=\operatorname{\Pi}_{\mathcal{X}}(x+y)\coloneqq\operatorname*{arg\,min}\nolimits_{x^{\prime}\in\mathcal{X}}\lVert x+y-x^{\prime}\rVert_{2} (4)

for all x𝒳x\in\mathcal{X}, y𝒴y\in\mathcal{Y}. More generally, the prox-mapping in (MP) is defined in terms of a Bregman regularizer as follows:

Domain (𝒳\mathcal{X}) Breg. Diam. (BhB_{h}) Range (RhR_{h}) Shape (χ\chi) Rate (L=L=\infty) Rate (L<L<\infty, σ=0\sigma=0)
Euclidean any below 𝒪(1)\operatorname{\mathcal{O}}(1) 𝒪(1)\operatorname{\mathcal{O}}(1) d\sqrt{d} 𝒪(d/T)\operatorname{\mathcal{O}}\big{(}\sqrt{d/T}\big{)} 𝒪(d/T)\operatorname{\mathcal{O}}(d/T)
Entropic simplex \infty logd\log d 11 𝒪(logd/T)\operatorname{\mathcal{O}}\big{(}\sqrt{\log d/T}\big{)} 𝒪(logd/T)\operatorname{\mathcal{O}}(\log d/T)
Quantum spectrahedron \infty logd\log d 11 𝒪(logd/T)\operatorname{\mathcal{O}}\big{(}\sqrt{\log d/T}\big{)} 𝒪(logd/T)\operatorname{\mathcal{O}}(\log d/T)
ComBand conv(𝒜)\operatorname{conv}(\mathcal{A}) \infty 𝒪(logd)\operatorname{\mathcal{O}}(\log d) 11 𝒪(logd/T)\operatorname{\mathcal{O}}\big{(}\sqrt{\log d/T}\big{)} 𝒪(logd/T)\operatorname{\mathcal{O}}(\log d/T)
Table 1. The convergence rate of (MP) in terms of dd and TT for different regularizers. In the combinatorial setup of Example 3, the unnormalized entropy has Rh=m(1+log(d/m))R_{h}=m(1+\log(d/m)), where m=maxα𝒜α1m=\max_{\alpha\in\mathcal{A}}\lVert\alpha\rVert_{1} is the maximum number of elements of {1,,d}\{1,\dotsc,d\} that can be simultaneously “on” [27, Chap. 30]. In many applications, mm does not scale with dd, so it has been absorbed in the 𝒪()\operatorname{\mathcal{O}}(\cdot) notation; other than that, 𝒪()\operatorname{\mathcal{O}}(\cdot) contains only universal constants.
Definition 1.

A Bregman regularizer on 𝒳\mathcal{X} is a convex function h:𝒱{}h\colon\mathcal{V}\to\mathbb{R}\cup\{\infty\} such that

  1. (1)

    domh=𝒳\operatorname{dom}h=\mathcal{X} and hh is continuous on 𝒳\mathcal{X}.

  2. (2)

    The subdifferential of hh admits a continuous selection, i.e., there exists a continuous mapping h:domh𝒴\nabla h\colon\operatorname{dom}\partial h\to\mathcal{Y} with h(x)h(x)\nabla h(x)\in\partial h(x) for all xdomhx\in\operatorname{dom}\partial h.

  3. (3)

    hh is strongly convex on 𝒳\mathcal{X}, i.e.,

    h(x)h(x)+h(x),xx+12Khxx2h(x^{\prime})\geq h(x)+\langle\nabla h(x)\mathopen{},\mathopen{}x^{\prime}-x\rangle+\tfrac{1}{2}K_{h}\lVert x^{\prime}-x\rVert^{2} (5)

    for some Kh>0K_{h}>0 and all xdomhx\in\operatorname{dom}\partial h, x𝒳x^{\prime}\in\mathcal{X}.

We also define the Bregman divergence of hh as

D(x,x)=h(x)h(x)h(x),xxD(x^{\prime},x)=h(x^{\prime})-h(x)-\langle\nabla h(x)\mathopen{},\mathopen{}x^{\prime}-x\rangle (6)

and the induced prox-mapping as

P#1(y)=argminx𝒳{y,xx+D(x,x)}P_{#1}(y)=\operatorname*{arg\,min}\nolimits_{x^{\prime}\in\mathcal{X}}\{\langle y\mathopen{},\mathopen{}x-x^{\prime}\rangle+D(x^{\prime},x)\} (7)

for all x𝒳hx\in\mathcal{X}_{h}, x𝒳x^{\prime}\in\mathcal{X} and all y𝒴y\in\mathcal{Y}.

Remark.

The set 𝒳hdomh\mathcal{X}_{h}\coloneqq\operatorname{dom}\partial h is often referred to as the prox-domain of hh; by standard results in convex analysis, we have ri𝒳𝒳h𝒳\operatorname{ri}\mathcal{X}\subseteq\mathcal{X}_{h}\subseteq\mathcal{X} [43, Chap. 26].

In terms of output, the candidate solution returned by (MP) after TT iterations is the so-called “ergodic average”

X¯T=t=1TγtXt+1/2t=1Tγt.\bar{X}_{T}=\frac{\sum_{t=1}^{T}\gamma_{t}X_{t+1/2}}{\sum_{t=1}^{T}\gamma_{t}}. (8)

Then, assuming the method’s step-size γt\gamma_{t} is chosen appropriately (more on this below), X¯T\bar{X}_{T} enjoys the following guarantees [22, 42]:

  1. \edefnit\selectfonta\edefnn)

    If ff satisfies (BG), then

    𝔼[f(X¯T)minf]\displaystyle\operatorname{\mathbb{E}}[f(\bar{X}_{T})-\min f] =𝒪(G2+σ2KhD1T)\displaystyle=\operatorname{\mathcal{O}}\left(\sqrt{\frac{G^{2}+\sigma^{2}}{K_{h}}\frac{D_{1}}{T}}\right) (9a)
    𝔼[f(X¯T)minf]\displaystyle\operatorname{\mathbb{E}}[f(\bar{X}_{T})-\min f] =𝒪(LD1KhT+σD1KhT)\displaystyle=\operatorname{\mathcal{O}}\left(\frac{LD_{1}}{K_{h}T}+\sigma\sqrt{\frac{D_{1}}{K_{h}T}}\right) (9b)

In the above, D1=D(x,X1)D_{1}=D(x^{\ast},X_{1}) is the minimum Bregman divergence between a solution xx^{\ast} of (Opt) and the initial state X1X_{1} of (MP). In particular, if (MP) is initialized at the prox-center xc=argminhx_{c}=\operatorname*{arg\,min}h of 𝒳\mathcal{X}, we have

D1h(x)minhmaxhminhRh.D_{1}\leq h(x^{\ast})-\min h\leq\max h-\min h\eqqcolon R_{h}. (10)

We will refer to Rh=maxhminhR_{h}=\max h-\min h as the range of hh.

To quantify the interplay betwen the problem’s dimensionality and the rate guarantees (9) for (MP), it will be convenient to introduce the normalized regularity parameters

Gh=GKhLh=LKhandσh=σKhG_{h}=\frac{G}{\sqrt{K_{h}}}\quad L_{h}=\frac{L}{K_{h}}\quad\text{and}\quad\sigma_{h}=\frac{\sigma}{\sqrt{K_{h}}} (11)

and the associated shape factor

χ={Gh2+σh2if L=,Lhif L< and σ=0,σhif L< and σ>0.\chi=\begin{cases}\sqrt{G_{h}^{2}+\sigma_{h}^{2}}&\quad\text{if $L=\infty$,}\\ \sqrt{L_{h}}&\quad\text{if $L<\infty$ and $\sigma=0$,}\\ \sigma_{h}&\quad\text{if $L<\infty$ and $\sigma>0$.}\end{cases} (12)

Since at least one of the terms G/KhG/\sqrt{K_{h}}, L/KhL/K_{h} and σ/Kh\sigma/\sqrt{K_{h}} appears in (9), it follows that the leading term in TT scales as 𝒪(χD1/T)\operatorname{\mathcal{O}}(\chi\sqrt{D_{1}/T}) in non-smooth / stochastic environments, and as 𝒪(χ2D1/T)\operatorname{\mathcal{O}}(\chi^{2}D_{1}/T) in smooth, deterministic problems.

The importance of the normalized parameters GhG_{h}, LhL_{h}, σh\sigma_{h} and the shape factor χ\chi lies in the fact that they do not depend on the ambient norm \lVert\cdot\rVert (a choice which, to a certain extent, is arbitrary). Indeed, if \lVert\cdot\rVert and \lVert\cdot\rVert^{\prime} are two norms on 𝒱\mathcal{V} that are related as μ\lVert\cdot\rVert\leq\mu\lVert\cdot\rVert^{\prime} for some μ>0\mu>0, it is straightforward to verify that hh is (μ2Kh)(\mu^{2}K_{h})-strongly convex relative to \lVert\cdot\rVert^{\prime}. Likewise, in terms of dual norms we have (1/μ)\lVert\cdot\rVert_{\ast}\geq(1/\mu)\lVert\cdot\rVert^{\prime}_{\ast}, so the constants GG, σ\sigma and LL would respectively become μG\mu G, μσ\mu\sigma and μ2L\mu^{2}L when computed under \lVert\cdot\rVert^{\prime}. In general, these inequalities are all tight, so a change in norm does not affect the shape factor χ\chi; accordingly, any dependence of χ\chi on dd will be propagated verbatim to the guarantees (9).

In Table 1, we provide the values of RhR_{h} and χ\chi for the following cases:

  1. (1)

    The Euclidean regularizer h(x)=x22/2h(x)=\lVert x\rVert_{2}^{2}/2 that gives rise to the extra-gradient algorithm (4).

  2. (2)

    The entropic regularizer h(x)=i=1dxilogxih(x)=\sum_{i=1}^{d}x_{i}\log x_{i} for the simplex setup of Example 1.

  3. (3)

    The von Neumann regularizer h(𝐗)=tr(𝐗log𝐗)+(1tr𝐗)log(1tr𝐗)h(\mathbf{X})=\operatorname{tr}(\mathbf{X}\log\mathbf{X})+(1-\operatorname{tr}\mathbf{X})\log(1-\operatorname{tr}\mathbf{X}) for the spectrahedron setup of Example 2.

  4. (4)

    The unnormalized entropy h(x)=i=1d(xilogxixi)h(x)=\sum_{i=1}^{d}(x_{i}\log x_{i}-x_{i}) for the combinatorial setup of Example 3.

These derivations are standard, so we omit the details. For posterity, we only note that the logarithmic dependence on dd is asymptotically optimal, cf. [13, 12] and references therein.

3.3. The UnixGrad algorithm

As can be seen from Table 1, the mirror-prox algorithm achieves an almost dimension-free rate of convergence when used with a suitable regularizer. However, this comes with two important caveats: First, the algorithm’s rate in the smooth case falls short of the optimal 𝒪(1/T2)\operatorname{\mathcal{O}}(1/T^{2}) dependence in TT, so (MP) is suboptimal in this regard. Second, to achieve the rates presented in Eq. 9, the algorithm’s step-size γt\gamma_{t} must be tuned with prior knowledge of the problem’s parameters: in particular, under (BG), the algorithm must be run with step-size γt1/(G2+σ2)T\gamma_{t}\propto 1/\sqrt{(G^{2}+\sigma^{2})T} while, under (LG), the algorithm requires γt=Kh/L\gamma_{t}=K_{h}/L if σ=0\sigma=0 and γt1/(σT)\gamma_{t}\propto 1/(\sigma\sqrt{T}) otherwise. This creates an undesirable state of affairs because the parameters GG, LL and σ\sigma are usually not known in advance, and (MP) can – and does – fail to converge if run with an untuned step-size.

In the rest of this section, we briefly discuss the UnixGrad algorithm of Kavis et al. [24] which expands on the mirror-prox template in the following two crucial ways: \edefnit\selectfonta\edefnn) it introduces an iterate-averaging mechanism in the spirit of Cutkosky [16] to enable acceleration; and \edefnit\selectfonta\edefnn) it employs an adaptive step-size policy that does not require any tuning by the optimizer. In so doing, UnixGrad interpolates smoothly between the optimal convergence rates described in Section 2 without requiring any prior knowledge of GG, LL or σ\sigma.

Concretely, UnixGrad proceeds as (MP), but instead of querying 𝖦\operatorname{\mathsf{G}} at XtX_{t} and Xt+1/2X_{t+1/2}, it introduces the weighted query states

X¯t\displaystyle\bar{X}_{t} =αtXt+s=1t1αsXs+1/2s=1tαs\displaystyle=\frac{\alpha_{t}X_{t}+\sum_{s=1}^{t-1}\alpha_{s}X_{s+1/2}}{\sum_{s=1}^{t}\alpha_{s}} (13)
X¯t+1/2\displaystyle\bar{X}_{t+1/2} =αtXt+1/2+s=1t1αsXs+1/2s=1tαs\displaystyle=\frac{\alpha_{t}X_{t+1/2}+\sum_{s=1}^{t-1}\alpha_{s}X_{s+1/2}}{\sum_{s=1}^{t}\alpha_{s}}

where αt\alpha_{t} is a “gradient weighting” parameter. Then, building on an idea by Rakhlin & Sridharan [41, 42], the oracle queries gt𝖦(X¯t;ωt)g_{t}\leftarrow\operatorname{\mathsf{G}}(\bar{X}_{t};\omega_{t}) and gt+1/2𝖦(X¯t+1/2;ωt+1/2)g_{t+1/2}\leftarrow\operatorname{\mathsf{G}}(\bar{X}_{t+1/2};\omega_{t+1/2}) are used to update the method’s step-size as

γt=Bhαt1+s=1t1αs2gs+1/2gs2\gamma_{t}=\frac{B_{h}\alpha_{t}}{\sqrt{1+\sum_{s=1}^{t-1}\alpha_{s}^{2}\lVert g_{s+1/2}-g_{s}\rVert_{\ast}^{2}}} (14)

where

Bh=supx𝒳,x𝒳h2D(x,x)B_{h}=\sup\nolimits_{x\in\mathcal{X},x^{\prime}\in\mathcal{X}_{h}}\sqrt{2D(x,x^{\prime})} (15)

is the so-called Bregman diameter of 𝒳\mathcal{X}.

With all this in hand, Kavis et al. [24] provide the following bounds if UnixGrad is run with αt=t\alpha_{t}=t:

  1. \edefnit\selectfonta\edefnn)

    If ff satisfies (BG), then

    𝔼[f(X¯T+1/2)minf]\displaystyle\operatorname{\mathbb{E}}[f(\bar{X}_{T+1/2})-\min f] =𝒪(BhG2+σ2KhT)\displaystyle=\operatorname{\mathcal{O}}\bigg{(}\frac{B_{h}\sqrt{G^{2}+\sigma^{2}}}{\sqrt{K_{h}T}}\bigg{)} (16a)
    𝔼[f(X¯T+1/2)minf]\displaystyle\operatorname{\mathbb{E}}[f(\bar{X}_{T+1/2})-\min f] =𝒪(Bh2LKhT2+BhσKhT)\displaystyle=\operatorname{\mathcal{O}}\left(\frac{B_{h}^{2}L}{K_{h}T^{2}}+\frac{B_{h}\sigma}{\sqrt{K_{h}T}}\right) (16b)

As we mentioned in Section 2, the bounds (16) cannot be improved in terms of TT without further assumptions, so UnixGrad is universally optimal in this regard.

That being said, these guarantees also uncover an important limitation of UnixGrad, namely that the bounds (16) become void when the method is used in conjunction with one of the non-Euclidean frameworks of Examples 1, 2 and 3. For example, the Bregman diameter of the simplex under the entropic regularizer is Bh=supx,xixilog(xi/xi)=B_{h}=\sup_{x,x^{\prime}}\sum_{i}x_{i}\log(x_{i}/x^{\prime}_{i})=\infty, so the multiplicative constants in (16) become infinite (and the bounds themselves become meaningless). However, since the use of these regularizers is crucial to obtain the scalable, dimension-free convergence rates reported in Table 1, 444In particular, since the shape factor of the Euclidean regularizer is χ=d\chi=\sqrt{d}, employing UnixGrad with ordinary Euclidean projections would not lead to scalable guarantees. we are led to the open question we stated before:

Is it possible to achieve almost dimension-free convergence rates
while retaining an order-optimal dependence on TT?

We address this question in the next section.

4. Universal dual extrapolation

The point of departure of our analysis is the observation that gradient queries enter (MP) with decreasing weights. Specifically, if UnixGrad is run with αt=t\alpha_{t}=t (a choice which is necessary to have a shot at acceleration), the denominator of (14) may grow as fast as Θ(t3/2)\Theta(t^{3/2}) in the non-smooth/stochastic case, leading to an asymptotic 𝒪(1/t)\operatorname{\mathcal{O}}(1/\sqrt{t}) worst-case behavior for γt\gamma_{t}. In fact, even under the ansatz that the algorithm’s query points converge to a minimizer of ff at an accelerated rate, the denominator of (14) may still grow as Θ(t)\Theta(t), indicating that γt\gamma_{t} will, at best, stabilize to a positive value as tt\to\infty. This feature of the step-size rule (14) is somewhat counterintuitive because conventional wisdom would suggest that \edefnit\selectfonta\edefnn) recent queries are more useful than older, potentially obsolete ones; and \edefnit\selectfonta\edefnn) gradients should be “inflated” as the method’s query points approach a zero-gradient solution in order to maintain a fast rate of convergence.

The problem with a vanishing step-size becomes especially pronounced if the method is used with a non-Euclidean regularizer (which is what one would wish to do in order to obtain scalable convergence guarantees). To see this, consider the iterates of the mirror-prox template generated by the regularizer h(x)=xlogxh(x)=x\log x on 𝒳=[0,)\mathcal{X}=[0,\infty).555Strictly speaking this regularizer is not strongly convex over [0,)[0,\infty) but this detail is not relevant for the question at hand. In this case, the induced prox-mapping is P#1(y)=xexp(y)P_{#1}(y)=x\exp(y), leading to the recursion

x+=P#1(γv)=xexp(γv).x^{+}=P_{#1}(-\gamma{v})=x\exp(-\gamma{v}). (17)

Therefore, if the problem’s objective function attains its minimum at 0, the actual steps of the method scale as x+x=𝒪(x)x^{+}-x=\operatorname{\mathcal{O}}(x) for small xx, so it is imperative to maintain a large step-size to avoid stalling the algorithm.

This scaling issue is at the heart of the dual extrapolation (DE) method of Nesterov [39]. Originally designed to solve variational inequalities and related problems, the method proceeds by (\edefnit\selectfonti\edefnn ) using a prox-step to generate the method’s leading state and get a “look-ahead” gradient query; (\edefnit\selectfonti\edefnn ) aggregating gradient information with a constant weight; and, finally, (\edefnit\selectfonti\edefnn ) using a “primal-dual” mirror map to update the method’s base state. Formally, the algorithm follows the iterative update rule

Xt+1/2\displaystyle X_{t+1/2} =P#1(γtgt)\displaystyle=P_{#1}(-\gamma_{t}g_{t}) (DE)
Yt+1\displaystyle Y_{t+1} =Ytgt+1/2\displaystyle=Y_{t}-g_{t+1/2}
Xt+1\displaystyle X_{t+1} =Q(γt+1Yt+1)\displaystyle=Q(\gamma_{t+1}Y_{t+1})

where the so-called “mirror map” Q:𝒴𝒳Q\colon\mathcal{Y}\to\mathcal{X} is defined as

Q(y)=argmaxx𝒳{y,xh(x)}.Q(y)=\operatorname*{arg\,max}_{x\in\mathcal{X}}\{\langle y\mathopen{},\mathopen{}x\rangle-h(x)\}. (18)

Unfortunately, the template (DE) is not sufficient for our purposes, for two main reasons: First, the method still couples a prox-step with a variable (decreasing) step-size update; this is not problematic for the application of the method to VIs (where the achievable rates are different), but it is not otherwise favorable for acceleration.

In addition to the above, the method’s gradient pre-multiplier is the same as its post-multiplier (γt\gamma_{t} in both cases), and it is not possible to differentiate these parameters while maintaining optimal rates [39]. However, this differentiation is essential for acceleration, especially when γt\gamma_{t} cannot be tuned with prior knowledge of the problem’s parameters.

Our approach to overcome this issue consists of: \edefnit\selectfonta\edefnn) eliminating the prox-step altogether in favor of a mirror step; and \edefnit\selectfonta\edefnn) separating the weights used for introducing new gradients to the algorithm versus those used to generate the base and leading states. To formalize this, we introduce below the “universal” dual extrapolation template:

Yt+1/2\displaystyle Y_{t+1/2} =Ytαtgt\displaystyle=Y_{t}-\alpha_{t}g_{t} Xt+1/2\displaystyle\quad X_{t+1/2} =Q(ηtYt+1/2)\displaystyle=Q(\eta_{t}Y_{t+1/2}) (UDE)
Yt+1\displaystyle Y_{t+1} =Ytαtgt+1/2\displaystyle=Y_{t}-\alpha_{t}g_{t+1/2} Xt+1\displaystyle\quad X_{t+1} =Q(ηt+1Yt+1)\displaystyle=Q(\eta_{t+1}Y_{t+1})

In the above, the gradient signals gtg_{t} and gt+1/2g_{t+1/2} are considered generic and the query points are not specified. To get a concrete algorithm, we will use the weighting scheme of Kavis et al. [24] and query the oracle at the averaged states X¯t\bar{X}_{t} and X¯t+1/2\bar{X}_{t+1/2} introduced previously in (13). Finally, regarding the algorithm’s gradient weighting and averaging parameters (αt\alpha_{t} and ηt\eta_{t} respectively), we will use an increasing weight for the method’s step-size αt=t\alpha_{t}=t and the adaptive rule

ηt=b+2s=1t1αs2gs+1/2gs2\eta_{t}=\frac{b}{\sqrt{{}^{2}+\sum_{s=1}^{t-1}\alpha_{s}^{2}\lVert g_{s+1/2}-g_{s}\rVert_{\ast}^{2}}} (19)

for the method’s learning rate (the parameters and bb are discussed below, and we are using the standard convention that empty sums are taken equal to zero).

1
2Parameters Kh\leftarrow\sqrt{K_{h}}; bKh(Rh+Kh𝒳2)b\leftarrow\sqrt{K_{h}(R_{h}+K_{h}\lVert\mathcal{X}\rVert^{2})}
3 Initialize Y10Y_{1}\leftarrow 0; Z10Z_{1}\leftarrow 0; S12S_{1}\leftarrow^{2}
4 for t=1,2,,Tt=1,2,\dotsc\,,T do
      ηtb/St\eta_{t}\leftarrow b/\sqrt{S_{t}}
        // set learning rate
       XtQ(ηtYt)X_{t}\leftarrow Q\left(\eta_{t}Y_{t}\right)
        // mirror step
       X¯t(αtXt+Zt)/s=1tαs\bar{X}_{t}\leftarrow\left(\alpha_{t}X_{t}+Z_{t}\right)\big{/}{\sum_{s=1}^{t}\alpha_{s}}
        // mixing
       gt𝖦(X¯t;ωt)g_{t}\leftarrow\operatorname{\mathsf{G}}(\bar{X}_{t};\omega_{t})
        // oracle query
       Yt+1/2YtαtgtY_{t+1/2}\leftarrow Y_{t}-\alpha_{t}g_{t}
        // dual step
       Xt+1/2Q(ηtYt+1/2)X_{t+1/2}\leftarrow Q\left(\eta_{t}Y_{t+1/2}\right)
        // mirror step
       X¯t+1/2(αtXt+1/2+Zt)/s=1tαs\bar{X}_{t+1/2}\leftarrow\left(\alpha_{t}X_{t+1/2}+Z_{t}\right)\big{/}{\sum_{s=1}^{t}\alpha_{s}}
        // mixing
       gt+1/2𝖦(X¯t+1/2;ωt+1/2)g_{t+1/2}\leftarrow\operatorname{\mathsf{G}}(\bar{X}_{t+1/2};\omega_{t+1/2})
        // oracle query
       Yt+1Ytαtgt+1/2Y_{t+1}\leftarrow Y_{t}-\alpha_{t}g_{t+1/2}
        // dual step
       St+1St+αt2gt+1/2gt2S_{t+1}\leftarrow S_{t}+\alpha_{t}^{2}\lVert g_{t+1/2}-g_{t}\rVert_{\ast}^{2}
        // precondition
       Zt+1Zt+αtXt+1/2Z_{t+1}\leftarrow Z_{t}+\alpha_{t}X_{t+1/2}
        // update mixing state
5      
6
return x¯TX¯T+1/2\bar{x}_{T}\leftarrow\bar{X}_{T+1/2}
Algorithm 1 \Acfundergrad

The resulting method – which we call universal dual extrapolation with reweighted gradients (UnderGrad) – is encoded in pseudocode form in Algorithm 1 and represented schematically in Fig. 1. Its main guarantees are as follows:

Theorem 1.

Suppose that UnderGrad (Algorithm 1) is run for TT iterations with ηt\eta_{t} given by (19), αt=t\alpha_{t}=t for all t=1,2,t=1,2,\dotsc\,, and =Kh=\sqrt{K_{h}}, b=ChKhb=C_{h}\sqrt{K_{h}} with Ch=Rh+Kh𝒳2C_{h}=\sqrt{R_{h}+K_{h}\lVert\mathcal{X}\rVert^{2}}. Then the algorithm’s output state x¯TX¯T+1/2\bar{x}_{T}\equiv\bar{X}_{T+1/2} simultaneously enjoys the following guarantees:

  1. \edefitit\selectfonta\edefitn)

    If ff satisfies (BG), then

    𝔼[f(x¯T)]minf+2ChKh+8(G2+σ2)KhT\!\operatorname{\mathbb{E}}[f(\bar{x}_{T})]\leq\min f+2C_{h}\sqrt{\frac{K_{h}+8(G^{2}+\sigma^{2})}{K_{h}T}} (20a)
  2. \edefitit\selectfonta\edefitn)

    If ff satisfies (LG), then

    𝔼[f(x¯T)]minf+322Ch2LKhT2+82ChσKhT\operatorname{\mathbb{E}}[f(\bar{x}_{T})]\leq\min f+\frac{32\sqrt{2}C_{h}^{2}L}{K_{h}T^{2}}+\frac{8\sqrt{2}C_{h}\sigma}{\sqrt{K_{h}T}} (20b)

Theorem 1 is our main result so, before discussing its proof (which we carry out in detail in the appendix), some remarks are in order.

Refer to caption
Figure 1. Schematic representation of the UnderGrad algorithm (Algorithm 1). The light blue area represents the problem’s domain (𝒳\mathcal{X}), whereas the grey area represents the dual space (𝒴\mathcal{Y}).

The first point of note concerns the dependence of the anytime bounds (20) on the problem’s dimensionality. To that end, let Cfast=Ch2C_{\textrm{fast}}=C_{h}^{2} and Cslow=ChC_{\textrm{slow}}=C_{h}, so UnderGrad’s rate of convergence scales as 𝒪(Cfastχ2/T2)\operatorname{\mathcal{O}}(C_{\textrm{fast}}\chi^{2}/T^{2}) in smooth, deterministic problems, and as 𝒪(Cslowχ/T)\operatorname{\mathcal{O}}(C_{\textrm{slow}}\chi/\sqrt{T}) in non-smooth and/or stochastic environments. Thus, to compare the algorithm’s rate of convergence to that of mirror-prox and UnixGrad (and up to universal constants), we have to compare ChC_{h} to RhR_{h} and BhB_{h} respectively.

To that end, we calculate below the values of CfastC_{\textrm{fast}} and CslowC_{\textrm{slow}} in the three archetypal examples of Section 3:

  1. (1)

    In the simplex setup of Example 1, we have Rh=logdR_{h}=\log d, 𝒳=1\lVert\mathcal{X}\rVert=1 and Kh=1K_{h}=1, so Cslow=𝒪(logd)C_{\textrm{slow}}=\operatorname{\mathcal{O}}(\sqrt{\log d}) and Cfast=𝒪(logd)C_{\textrm{fast}}=\operatorname{\mathcal{O}}(\log d).

  2. (2)

    In the spectrahedron setup of Example 2, we have again Rh=logdR_{h}=\log d, 𝒳=1\lVert\mathcal{X}\rVert=1 and Kh=1K_{h}=1, so Cslow=𝒪(logd)C_{\textrm{slow}}=\operatorname{\mathcal{O}}(\sqrt{\log d}) and Cfast=𝒪(logd)C_{\textrm{fast}}=\operatorname{\mathcal{O}}(\log d). [For a detailed discussion, see [34, 9, 23] and references therein.]

  3. (3)

    Finally, in the combinatorial setup of Example 3, we have Rh=m(1+log(d/m))R_{h}=m(1+\log(d/m)), 𝒳=m\lVert\mathcal{X}\rVert=m and Kh=1K_{h}=1 [27]. Thus, if m=𝒪(1)m=\operatorname{\mathcal{O}}(1) in dd, we get again Cslow=𝒪(logd)C_{\textrm{slow}}=\operatorname{\mathcal{O}}(\sqrt{\log d}) and Cfast=𝒪(logd)C_{\textrm{fast}}=\operatorname{\mathcal{O}}(\log d).

The above shows that UnderGrad achieves the desired almost dimension-free rates of the non-adaptive mirror-prox algorithm, as well as the universal order-optimal guarantees of UnixGrad. The only discrepancy with the rates presented in Table 1 is the additive constant KhK_{h} that appears in the numerator of (20a): this constant is an artifact of the analysis and it only becomes relevant when G0G\to 0 and σ0\sigma\to 0. Since the scaling of the algorithm’s convergence rate concerns the large GG regime, this difference is not relevant for our purposes.

An additional difference between UnderGrad and UnixGrad is that the latter involves the prox-mapping (7), whereas the former involves the mirror map (18). To compare the two in terms of their per-iteration complexity, note that if we apply the prox-mapping (7) to the prox-center xcargminhx_{c}\leftarrow\operatorname*{arg\,min}h of 𝒳\mathcal{X}, we get

P#1(y)\displaystyle P_{#1}(y) =argminx𝒳{y,xcx+D(x,xc)}\displaystyle=\operatorname*{arg\,min}\nolimits_{x\in\mathcal{X}}\{\langle y\mathopen{},\mathopen{}x_{c}-x\rangle+D(x,x_{c})\}
=argminx𝒳{h(x)h(xc)+y,x}\displaystyle=\operatorname*{arg\,min}\nolimits_{x\in\mathcal{X}}\{h(x)-\langle\nabla h(x_{c})+y\mathopen{},\mathopen{}x\rangle\}
=Q(h(xc)+y)\displaystyle=Q(\nabla h(x_{c})+y) (21)

so, in particular, Q(y)=P#1(y)Q(y)=P_{#1}(y) whenever xcri𝒳x_{c}\in\operatorname{ri}\mathcal{X} (which is the case for most regularizers used in practice, including the Legendre regularizers used in Examples 13 above). This shows that the calculation of the mirror map Q(y)=P#1(yh(xc))Q(y)=P_{#1}(y-\nabla h(x_{c})) is at least as simple as the calculation of the prox-mapping P#1(y)P_{#1}(y) for a general base point x𝒳x\in\mathcal{X} – and, typically, calculating the mirror map is strictly lighter because there is no need to vary the base point over different iterations of the algorithm. In this regard, the per-iteration overhead of (UDE) is actually lighter compared to (MP) or (DE).

Finally, we should note that all our results above implicitly assume that the problem’s domain is bounded (otherwise the range parameter RhR_{h} of the problem becomes infinite). Thus, in addition to these convergence properties of UnderGrad, we also provide below an asymptotic guarantee for problems with an unbounded domain:

Theorem 2.

Suppose that UnderGrad is run with perfect oracle feedback with ηt\eta_{t} given by (19) and αt=t\alpha_{t}=t. If ff satisfies (LG), the algorithm’s output state x¯T=X¯T+1/2\bar{x}_{T}=\bar{X}_{T+1/2} enjoys the rate f(x¯T)minf=𝒪(1/T2)f(\bar{x}_{T})-\min f=\operatorname{\mathcal{O}}(1/T^{2}).

This result provides an important extension of Theorem 1 to problems with unbounded domains. It remains an open question for the future to derive the precise constants in the convergence rate presented in Theorem 2.

Main ideas of the proof.

The detailed proof of Theorem 1 is fairly long so we defer it to the appendix and only present here the main ideas.

The main ingredient of our proof is a specific template inequality used to derive an “appropriate” upper bound of the term ~T(x)t=1Tαtgt+1/2,Xt+1/2x\operatorname{\tilde{\mathcal{R}}}_{T}(x)\coloneqq\sum_{t=1}^{T}{\alpha_{t}}\left\langle g_{t+1/2},X_{t+1/2}-x\right\rangle. Importantly, to prove the dimension-free properties of UnderGrad, such an upper-bound cannot involve Bregman divergences: even though this is common practice in previous papers [24, 1], these terms would ultimately lead to the Bregman diameter BhB_{h} that we seek to avoid. This is a principal part of the reason for switching gears to the dual extrapolation (DE) template for UnderGrad: in so doing, we are able to employ the notion of the Fenchel coupling [31, 32], which is a “primal-dual distance” as opposed to the Bregman divergence which is a “primal-primal distance” (cf. Section A.1). This poses another challenge on connecting the Fenchel coupling of targeted points before and after a mirror step, for which we need to employ a primal-dual version of the “three-point identity” (Lemma A.3). These elements lead to the following proposition:

Proposition 1.

For all x𝒳x\in\mathcal{X}, we have

~T(x)\displaystyle\operatorname{\tilde{\mathcal{R}}}_{T}(x) RhηT+1+t=1Tαtgt+1/2gt,Xt+1/2Xt+1\displaystyle\leq\frac{R_{h}}{\eta_{T+1}}+\sum_{t=1}^{T}\alpha_{t}\langle g_{t+1/2}-g_{t}\mathopen{},\mathopen{}X_{t+1/2}-X_{t+1}\rangle
Kht=1TXt+1Xt+1/22+Xt+1/2Xt22ηt\displaystyle-K_{h}\sum_{t=1}^{T}\frac{\lVert X_{t+1}-X_{t+1/2}\rVert^{2}+\lVert X_{t+1/2}-X_{t}\rVert^{2}}{{2\eta_{t}}} (22)

With (22) in hand, (20a) comes from the application of the Fenchel-Young inequality to upper-bound the right-hand-side of (22) as t=1Tαt2ηt+1gt+1/2gt\sum_{t=1}^{T}\alpha_{t}^{2}\eta_{t+1}\lVert g_{t+1/2}-g_{t}\rVert_{\ast} (plus a constant term involving 𝒳\lVert\mathcal{X}\rVert). The challenge here is to notice and successfully prove that this summation is actually upper-bounded by ηT+11{\eta}_{T+1}^{-1} (due to our special choice of the learning rate update). Finally, by its definition, ηT+11{\eta}_{T+1}^{-1} can be bounded by GG, σ\sigma and KhK_{h} as described in the statement of Theorem 1.

The proof of (20b) is far more complex. The main challenge is to manipulate the terms in (22) to derive an upper-bound of the form t=1Tαt2g(ηt+1)f(X¯t+1/2)fX¯t2\sum_{t=1}^{T}\alpha_{t}^{2}\textsl{g}(\eta_{t+1})\lVert\nabla f(\bar{X}_{t+1/2})-\nabla f\bar{X}_{t}\rVert_{\ast}^{2} (plus a term involving the noise level σ\sigma) where g(ηt+1)\textsl{g}(\eta_{t+1}) is a function of the learning rate chosen such that only the first T0TT_{0}\ll T elements of this summation are positive. Once this has been achieved, the quantity f(X¯t+1/2)fX¯t\lVert\nabla f(\bar{X}_{t+1/2})-\nabla f\bar{X}_{t}\rVert_{\ast} is connected to 𝒳\lVert\mathcal{X}\rVert via (LG) and our claim is obtained.

5. Numerical Experiments

For the experimental validation of our results, we focus on the simplex setup of Example 1 with linear losses and d=100d=100. Our first experiment concerns the perfect SFO case and tracks down the convergence properties of UnderGrad run with the entropic regularizer adapted to the simplex. As a baseline, we ran UnixGrad, also with the entropic regularizer. A first challenge here is that the Bregman diameter BhB_{h} of the simmplex is infinite, so UnixGrad is not well-defined. On that account, we choose the step-size update rule of UnixGrad such that its initial step-size γ1\gamma_{1} equals the initial learning rate η1\eta_{1} of UnderGrad. We also ran UnixGrad with the update rule such that γ1\gamma_{1} is smaller or larger than η1\eta_{1}. Finally, for comparison purposes, we also present a non-adaptive accelerated entropic gradient (AEG) algorithm, and we report the results in Fig. 2.

Refer to captionT{T}Δ(T){\Delta(T)}
Figure 2. Convergence of UnderGrad and UnixGrad in the simplex setup with a perfect SFO. The yy-axis corresponds to the differences between the ff-value of the relevant point of each algorithm and minf\min f. The code is available at https://github.com/dongquan-vu/UnDerGrad_Universal_CnvOpt_ICML2022.

Fig. 2 confirms that UnderGrad successfully converges with an accelerated rate of 𝒪(1/T2)\operatorname{\mathcal{O}}(1/T^{2}). Perhaps surprisingly, it also shows that when UnixGrad’s initial step-size is small (10E-3 or smaller), UnixGrad also achieves an 𝒪(1/T2)\operatorname{\mathcal{O}}(1/T^{2}), but at a much more conservative pace, trailing UnderGrad by one or two orders of magnitude. On the other hand, when UnixGrad’s initial step-size is of the same magnitude as the UnderGrad’s learning rate (or larger), UnixGrad eventually destabilizes and its rate drops from 𝒪(1/T2)\operatorname{\mathcal{O}}(1/T^{2}) to approximately 𝒪(1/T)\operatorname{\mathcal{O}}(1/T). We also conducted experiments in the setup with a noisy SFO; these are reported in Appendix C.

Appendix A Bregman regularizers and preliminary results

A.1. Bregman regularizers and their properties

We begin by clarifying and recalling some of the notational convetions used throughout the paper. We also give the formal definition of the Fenchel coupling (a key notion for the proof our main results) and we present some preliminary results to prepare the ground for the sequel.

The convex conjugate h:𝒴h^{\ast}\colon\mathcal{Y}\to\mathbb{R} of hh is then defined as

h(y)=supx𝒳{y,xh(x)}.h^{\ast}(y)=\sup_{x\in\mathcal{X}}\{\langle y\mathopen{},\mathopen{}x\rangle-h(x)\}. (A.1)

As a result, the supremum in (A.1) is always attained, and h(y)h^{\ast}(y) is finite for all y𝒴y\in\mathcal{Y} [8]. Moreover, by standard results in convex analysis [43, Chap. 26], hh^{\ast} is differentiable on 𝒴\mathcal{Y} and its gradient satisfies the identity

h(y)=argmaxx𝒳{y,xh(x)}.\nabla h^{\ast}(y)=\operatorname*{arg\,max}_{x\in\mathcal{X}}\{\langle y\mathopen{},\mathopen{}x\rangle-h(x)\}. (A.2)

Thus, recalling the definition of the mirror map Q:𝒴𝒳Q\colon\mathcal{Y}\to\mathcal{X}, we readily get

Q(y)=h(y).Q(y)=\nabla h^{\ast}(y). (A.3)
Lemma A.1.

Let hh be a Bregman regularizer on 𝒳\mathcal{X}. Then, for all x,+domhx,\in^{+}\operatorname{dom}\partial h and all y,v𝒴y,{v}\in\mathcal{Y}, we have:

a)\displaystyle a) x=Q(y)\displaystyle\;\;x=Q(y) \displaystyle\;\iff\; yh(x).\displaystyle y\in\partial h(x). (A.4a)
b)\displaystyle b) x+=Q(h(x)+v)\displaystyle\;\;x^{+}=Q(\nabla h(x)+{v}) \displaystyle\;\iff\; h(x)+vh(x+)\displaystyle\nabla h(x)+{v}\in\partial h(x^{+}) (A.4b)

Finally, if x=Q(y)x=Q(y) and x𝒳x^{\ast}\in\mathcal{X}, we have

h(x),xxy,xx.\langle\nabla h(x)\mathopen{},\mathopen{}x-x^{\ast}\rangle\leq\langle y\mathopen{},\mathopen{}x-x^{\ast}\rangle. (A.5)
Proof of Lemma A.1.

To prove (A.4a), note that xx solves (A.2) if and only if yh(x)0y-\partial h(x)\ni 0, i.e., if and only if yh(x)y\in\partial h(x). Eq. A.4b is then obtained in the same manner.

For the inequality (A.5), it suffices to show it holds for all x𝒳hdomhx^{\ast}\in\mathcal{X}_{h}\equiv\operatorname{dom}\partial h (by continuity). To do so, let

ϕ(t)=h(x+t(xx))[h(x)+y,x+t(xx)].\phi(t)=h(x+t(x^{\ast}-x))-[h(x)+\langle y\mathopen{},\mathopen{}x+t(x^{\ast}-x)\rangle]. (A.6)

Since hh is strongly convex relative to gg and yh(x)y\in\partial h(x) by (A.4a), it follows that ϕ(t)0\phi(t)\geq 0 with equality if and only if t=0t=0. Moreover, note that ψ(t)=h(x+t(xx))y,xx\psi(t)=\langle\nabla h(x+t(x^{\ast}-x))-y\mathopen{},\mathopen{}x^{\ast}-x\rangle is a continuous selection of subgradients of ϕ\phi. Given that ϕ\phi and ψ\psi are both continuous on [0,1][0,1], it follows that ϕ\phi is continuously differentiable and ϕ=ψ\phi^{\prime}=\psi on [0,1][0,1]. Thus, with ϕ\phi convex and ϕ(t)0=ϕ(0)\phi(t)\geq 0=\phi(0) for all t[0,1]t\in[0,1], we conclude that ϕ(0)=h(x)y,xx0\phi^{\prime}(0)=\langle\nabla h(x)-y\mathopen{},\mathopen{}x^{\ast}-x\rangle\geq 0. ∎

As we mentioned earlier, much of our analysis revolves around a ”primal-dual” divergence between a target point x𝒳x^{\ast}\in\mathcal{X} and a dual vector y𝒴y\in\mathcal{Y}, called the Fenchel coupling. Following [33], this is defined as follows for all x𝒳x^{\ast}\in\mathcal{X}, y𝒴y\in\mathcal{Y}:

F(x,y)=h(x)+h(y)y,x.F(x^{\ast},y)=h(x^{\ast})+h^{\ast}(y)-\langle y\mathopen{},\mathopen{}x^{\ast}\rangle. (A.7)

The following lemma illustrates some basic properties of the Fenchel coupling:

Lemma A.2.

Let hh be a Bregman regularizer on 𝒳\mathcal{X} with convexity modulus KhK_{h}. Then, for all x𝒳x^{\ast}\in\mathcal{X} and all y𝒴y\in\mathcal{Y}, we have:

  1. (1)

    F(x,y)=D(x,Q(y))F(x^{\ast},y)=D(x^{\ast},Q(y)) if Q(y)𝒳hQ(y)\in\mathcal{X}_{h} (but not necessarily otherwise).

  2. (2)

    F(x,y)Kh2Q(y)x2F(x^{\ast},y)\geq\frac{K_{h}}{2}\lVert Q(y)-x^{\ast}\rVert^{2}

Proof.

For our first claim, let x=Q(y)x=Q(y). Then, by definition we have:

F(x,y)=h(x)y,Q(y)h(Q(y))y,x=h(x)h(x)y,xx.F(x^{\ast},y)=h(x^{\ast})-\langle y\mathopen{},\mathopen{}Q(y)\rangle-h(Q(y))-\langle y\mathopen{},\mathopen{}x^{\ast}\rangle=h(x^{\ast})-h(x)-\langle y\mathopen{},\mathopen{}x^{\ast}-x\rangle. (A.8)

Since yh(x)y\in\partial h(x), we have h(x;xx)=y,xxh^{\prime}(x;x^{\ast}-x)=\langle y\mathopen{},\mathopen{}x^{\ast}-x\rangle whenever x𝒳hx\in\mathcal{X}_{h}, thus proving our first claim. For our second claim, working in the previous spirit we get that:

F(x,y)=h(x)h(x)y,xxF(x^{\ast},y)=h(x^{\ast})-h(x)-\langle y\mathopen{},\mathopen{}x^{\ast}-x\rangle (A.9)

Thus, we obtain the result by recalling the strong convexity assumption for hh with respect to the respetive norm \lVert\cdot\rVert. ∎

We continue with some basic relations connecting the Fenchel coupling relative to a target point before and after a gradient step. The basic ingredient for this is a primal-dual analogue of the so-called “three-point identity” for Bregman functions [15]:

Lemma A.3.

Let hh be a Bregman regularizer on 𝒳\mathcal{X}. Fix some x𝒳x^{\ast}\in\mathcal{X} and let y,y+𝒴y,y^{+}\in\mathcal{Y}. Then, letting x=Q(y)x=Q(y), we have

F(x,y+)=F(x,y)+F(x,y+)+y+y,xx.F(x^{\ast},y^{+})=F(x^{\ast},y)+F(x,y^{+})+\langle y^{+}-y\mathopen{},\mathopen{}x-x^{\ast}\rangle. (A.10)
Proof.

By definition, we get:

F(x,y+)\displaystyle F(x^{\ast},y^{+}) =h(x)+h(y+)y+,x\displaystyle=h(x^{\ast})+h^{\ast}(y^{+})-\langle y^{+}\mathopen{},\mathopen{}x^{\ast}\rangle (A.11)
F(x,y)\displaystyle F(x^{\ast},y) =h(x)+h(y)y,x.\displaystyle=h(x^{\ast})+h^{\ast}(y)-\langle y\mathopen{},\mathopen{}x^{\ast}\rangle.

Then, by subtracting the above we get:

F(x,y+)F(x,y)\displaystyle F(x^{\ast},y^{+})-F(x^{\ast},y) =h(x)+h(y+)y+,xh(x)h(y)+y,x\displaystyle=h(x^{\ast})+h^{\ast}(y^{+})-\langle y^{+}\mathopen{},\mathopen{}x^{\ast}\rangle-h(x^{\ast})-h^{\ast}(y)+\langle y\mathopen{},\mathopen{}x^{\ast}\rangle
=h(y+)h(y)y+y,x\displaystyle=h^{\ast}(y^{+})-h^{\ast}(y)-\langle y^{+}-y\mathopen{},\mathopen{}x^{\ast}\rangle
=h(y+)y,Q(y)+h(Q(y))y+y,x\displaystyle=h^{\ast}(y^{+})-\langle y\mathopen{},\mathopen{}Q(y)\rangle+h(Q(y))-\langle y^{+}-y\mathopen{},\mathopen{}x^{\ast}\rangle
=h(y+)y,x+h(x)y+y,x\displaystyle=h^{\ast}(y^{+})-\langle y\mathopen{},\mathopen{}x\rangle+h(x)-\langle y^{+}-y\mathopen{},\mathopen{}x^{\ast}\rangle
=h(y+)+y+y,xy+,x+h(x)y+y,x\displaystyle=h^{\ast}(y^{+})+\langle y^{+}-y\mathopen{},\mathopen{}x\rangle-\langle y^{+}\mathopen{},\mathopen{}x\rangle+h(x)-\langle y^{+}-y\mathopen{},\mathopen{}x^{\ast}\rangle
=F(x,y+)+y+y,xx\displaystyle=F(x,y^{+})+\langle y^{+}-y\mathopen{},\mathopen{}x-x^{\ast}\rangle (A.12)

and our proof is complete. ∎

A.2. Numerical sequence inequalities

In this section, we provide some necessary inequalities on numerical sequences that we require for the convergence rate analysis of the previous sections. Most of the lemmas presented below already exist in the literature, and go as far back as Auer et al. [6] and McMahan & Streeter [30]; when appropriate, we note next to each lemma the references with the statement closest to the precise version we are using in our analysis.

Lemma A.4 (30, 28).

For all non-negative numbers ,1t{}_{1},\dotsc_{t}, the following inequality holds:

+2t=1Tt+t=1Tts=1ts2+2t=1Tt\sqrt{{}^{2}+\sum_{t=1}^{T}{}_{t}}\leq+\sum_{t=1}^{T}\dfrac{{}_{t}}{\sqrt{\sum_{s=1}^{t}{}_{s}}}\leq 2\sqrt{{}^{2}+\sum_{t=1}^{T}{}_{t}} (A.13)

Appendix B Analysis and proofs of the main results

The proof of the template inequality.

We first prove the template inequality of UnderGrad; this is the primary element of our proof of Theorem 1: See 1

Proof.

First, we set Y~t=ηtYt\tilde{Y}_{t}=\eta_{t}{Y}_{t}. For all x𝒳x\in\mathcal{X} we have:

αtgt+1/2,Xt+1x\displaystyle\alpha_{t}\langle g_{t+1/2}\mathopen{},\mathopen{}X_{t+1}-x\rangle
=\displaystyle= 1ηtY~t1ηt+1Y~t+1,Xt+1x\displaystyle\langle\frac{1}{\eta_{t}}\tilde{Y}_{t}-\frac{1}{\eta_{t+1}}\tilde{Y}_{t+1}\mathopen{},\mathopen{}X_{t+1}-x\rangle
=\displaystyle= 1ηtY~tY~t+1,Xt+1x+[1ηt+11ηt]0Y~t+1,Xt+1x\displaystyle\frac{1}{\eta_{t}}\langle\tilde{Y}_{t}-\tilde{Y}_{t+1}\mathopen{},\mathopen{}X_{t+1}-x\rangle+\left[\frac{1}{\eta_{t+1}}-\frac{1}{\eta_{t}}\right]\langle 0-\tilde{Y}_{t+1}\mathopen{},\mathopen{}X_{t+1}-x\rangle
=\displaystyle= 1ηt[F(x,Y~t)F(x,Y~t+1)F(Xt+1,Y~t)]\displaystyle\frac{1}{\eta_{t}}\left[F(x,\tilde{Y}_{t})-F(x,\tilde{Y}_{t+1})-F(X_{t+1},\tilde{Y}_{t})\right]
+[1ηt+11ηt](F(x,0)F(x,Y~t+1)F(Xt+1,0))\displaystyle\qquad+\left[\frac{1}{\eta_{t+1}}-\frac{1}{\eta_{t}}\right]\bigg{(}F(x,0)-F(x,\tilde{Y}_{t+1})-F(X_{t+1},0)\bigg{)} #  from Lemma A.3
\displaystyle\leq 1ηtF(x,Y~t)1ηt+1F(x,Y~t+1)+[1ηt+11ηt]Rh1ηtF(Xt+1,Y~t).\displaystyle\frac{1}{\eta_{t}}F(x,\tilde{Y}_{t})-\frac{1}{\eta_{t+1}}F(x,\tilde{Y}_{t+1})+\left[\frac{1}{\eta_{t+1}}-\frac{1}{\eta_{t}}\right]R_{h}-\frac{1}{\eta_{t}}F(X_{t+1},\tilde{Y}_{t}). (B.1)

Here, the last inequality comes from the facts that F(x,0)=h(x)h(Q(0))=h(x)minx𝒳hRhF(x,0)=h(x)-h(Q(0))=h(x)-\min_{x\in\mathcal{X}}h\leq R_{h} and F(,)0F(\cdot,\cdot)\geq 0 and that ηt\eta_{t} is decreasing.

As a consequence of (B.1), we have:

αtgt+1/2,Xt+1/2x\displaystyle\alpha_{t}\langle g_{t+1/2}\mathopen{},\mathopen{}X_{t+1/2}-x\rangle
=αtgt+1/2,Xt+1/2Xt+1+αtgt+1/2,Xt+1x\displaystyle=\alpha_{t}\langle g_{t+1/2}\mathopen{},\mathopen{}X_{t+1/2}-X_{t+1}\rangle+\alpha_{t}\langle g_{t+1/2}\mathopen{},\mathopen{}X_{t+1}-x\rangle
αtgt+1/2,Xt+1/2Xt+1+1ηtF(x,Y~t)1ηt+1F(x,Y~t+1)\displaystyle\leq\alpha_{t}\langle g_{t+1/2}\mathopen{},\mathopen{}X_{t+1/2}-X_{t+1}\rangle+\frac{1}{\eta_{t}}F(x,\tilde{Y}_{t})-\frac{1}{\eta_{t+1}}F(x,\tilde{Y}_{t+1})
+[1ηt+11ηt]Rh1ηtF(Xt+1,Y~t)\displaystyle\qquad+\left[\frac{1}{\eta_{t+1}}-\frac{1}{\eta_{t}}\right]R_{h}-\frac{1}{\eta_{t}}F(X_{t+1},\tilde{Y}_{t}) (B.2)

On the other hand, if we let Y~t+1/2ηtYt+1/2\tilde{Y}_{t+1/2}\coloneqq\eta_{t}Y_{t+1/2}, we have

αtgt,Xt+1/2Xt+1\displaystyle\alpha_{t}\langle g_{t}\mathopen{},\mathopen{}X_{t+1/2}-X_{t+1}\rangle =1ηtY~tY~t+1/2,Xt+1/2Xt+1\displaystyle=\frac{1}{\eta_{t}}\langle\tilde{Y}_{t}-\tilde{Y}_{t+1/2}\mathopen{},\mathopen{}X_{t+1/2}-X_{t+1}\rangle
=1ηt[F(Xt+1,Y~t)F(Xt+1,Y~t+1/2)F(Xt+1/2,Y~t)]\displaystyle=\frac{1}{\eta_{t}}\left[F(X_{t+1},\tilde{Y}_{t})-F(X_{t+1},\tilde{Y}_{t+1/2})-F(X_{t+1/2},\tilde{Y}_{t})\right]
so
1ηtF(Xt+1,Y~t)\displaystyle\frac{1}{\eta_{t}}F(X_{t+1},\tilde{Y}_{t}) =αtgt,Xt+1/2Xt+1\displaystyle=\alpha_{t}\langle g_{t}\mathopen{},\mathopen{}X_{t+1/2}-X^{t+1}\rangle
+1ηtF(Xt+1,Y~t+1/2)+1ηtF(Xt+1/2,Y~t).\displaystyle+\frac{1}{\eta_{t}}F(X_{t+1},\tilde{Y}_{t+1/2})+\frac{1}{\eta_{t}}F(X_{t+1/2},\tilde{Y}_{t}). (B.3)

Thus, replacing (B.3) into (B), we get:

αtgt+1/2,Xt+1/2x\displaystyle\alpha_{t}\langle g_{t+1/2}\mathopen{},\mathopen{}X_{t+1/2}-x\rangle αtgt+1/2gt,Xt+1/2Xt+1\displaystyle\leq\alpha_{t}\langle g_{t+1/2}-g_{t}\mathopen{},\mathopen{}X_{t+1/2}-X_{t+1}\rangle
+1ηtF(x,Y~t)1ηt+1F(x,Y~t+1)1ηtF(Xt+1,Y~t+1/2)\displaystyle+\frac{1}{\eta_{t}}F(x,\tilde{Y}_{t})-\frac{1}{\eta_{t+1}}F(x,\tilde{Y}_{t+1})-\frac{1}{\eta_{t}}F(X_{t+1},\tilde{Y}_{t+1/2})
1ηtF(Xt+1/2,Y~t)+[1ηt+11ηt]Rh.\displaystyle-\frac{1}{\eta_{t}}F(X_{t+1/2},\tilde{Y}_{t})+\left[\frac{1}{\eta_{t+1}}-\frac{1}{\eta_{t}}\right]R_{h}. (B.4)

Now, recall the definitions Xt+1/2=Q(Y~t+1/2)X_{t+1/2}=Q(\tilde{Y}_{t+1/2}) and Xt+1=Q(Yt+1)X_{t+1}=Q(Y_{t+1}) in Algorithm 1, apply Lemma A.2 to F(Xt+1,Y~t+1/2)F(X_{t+1},\tilde{Y}_{t+1/2}) and F(Xt+1/2,Y~t)F(X_{t+1/2},\tilde{Y}_{t}) then combine with (B.4), we get:

αtgt+1/2,Xt+1/2xαtgt+1/2gt,Xt+1/2Xt+1+1ηtF(x,Y~t)1ηt+1F(x,Y~t+1)Kh2ηtXt+1Xt+1/22Kh2ηtXt+1/2Xt2+(1ηt+11ηt)Rh\alpha_{t}\langle g_{t+1/2}\mathopen{},\mathopen{}X_{t+1/2}-x\rangle\leq\alpha_{t}\langle g_{t+1/2}-g_{t}\mathopen{},\mathopen{}X_{t+1/2}-X_{t+1}\rangle+\frac{1}{\eta_{t}}F(x,\tilde{Y}_{t})-\frac{1}{\eta_{t+1}}F(x,\tilde{Y}_{t+1})\\ -\frac{K_{h}}{2\eta_{t}}\lVert X_{t+1}-X_{t+1/2}\rVert^{2}-\frac{K_{h}}{2\eta_{t}}\lVert X_{t+1/2}-X_{t}\rVert^{2}+\bigg{(}\frac{1}{\eta_{t+1}}-\frac{1}{\eta_{t}}\bigg{)}R_{h} (B.5)

Hence, after telescoping and recalling the notation ~T(x)t=1Tαtgt+1/2,Xt+1/2x\operatorname{\tilde{\mathcal{R}}}_{T}(x)\coloneqq\sum_{t=1}^{T}{\alpha_{t}}\left\langle g_{t+1/2},X_{t+1/2}-x\right\rangle, we get:

~T(x)\displaystyle\operatorname{\tilde{\mathcal{R}}}_{T}(x) 1η1F(x,Y~1)+(1ηT+11η1)Rh\displaystyle\leq\frac{1}{\eta_{1}}F(x,\tilde{Y}_{1})+\bigg{(}\frac{1}{\eta_{T+1}}-\frac{1}{\eta_{1}}\bigg{)}R_{h}
+t=1Tαtgt+1/2gt,Xt+1/2Xt+1\displaystyle+\sum_{t=1}^{T}\alpha_{t}\langle g_{t+1/2}-g_{t}\mathopen{},\mathopen{}X_{t+1/2}-X_{t+1}\rangle
t=1TKh2ηtXt+1Xt+1/22t=1TKh2ηtXt+1/2Xt2\displaystyle-\sum_{t=1}^{T}\frac{K_{h}}{2\eta_{t}}\lVert X_{t+1}-X_{t+1/2}\rVert^{2}-\sum_{t=1}^{T}\frac{K_{h}}{2\eta_{t}}\lVert X_{t+1/2}-X_{t}\rVert^{2} (B.6)

Finally, by our initial choice of Y1=0Y_{1}=0, we have F(x,Y~1)=h(x)minx𝒳h(x)RhF(x,\tilde{Y}_{1})=h(x)-\min_{x\in\mathcal{X}}h(x)\leq R_{h} and (22) follows (B.6). This concludes the proof of Proposition 1. ∎

Regret-to-rate conversion lemma.

The next element in our proof is the following lemma that will be used to connect the term ~T(x)\operatorname{\tilde{\mathcal{R}}}_{T}(x) (which, in intuition, is similar to a regret term) and the term 𝔼[f(X¯T+1/2)minf]\operatorname{\mathbb{E}}\left[f(\bar{X}_{T+1/2})-\min f\right] whose bounds will characterize the convergence rate of UnderGrad.

Lemma B.1.

For any x𝒳x^{\ast}\in\mathcal{X}^{\ast}, for any TT, we have:

𝔼[f(X¯T+1/2)minf]\displaystyle\operatorname{\mathbb{E}}\left[f(\bar{X}_{T+1/2})-\min f\right] 𝔼[2T2t=1Tαtf(X¯t+1/2),Xt+1/2x]\displaystyle\leq\operatorname{\mathbb{E}}\left[\frac{2}{{T^{2}}}{\sum_{t=1}^{T}{\alpha_{t}}\langle\nabla f\left(\bar{X}_{t+1/2}\right),X_{t+1/2}-x^{\ast}\rangle}\right]
=2T2𝔼[~T(x)].\displaystyle=\frac{2}{T^{2}}\operatorname{\mathbb{E}}\left[\operatorname{\tilde{\mathcal{R}}}_{T}(x^{\ast})\right]. (B.7)
Remark 1.

A variation of Lemma B.1 appears in [16, 24]; for the sake of completeness, we provide its proof below.

Proof.

Let us denote Ht:=s=1tαsH_{t}:=\sum_{s=1}^{t}\alpha_{s}. From the update rule of Algorithm 1, we can rewrite Xt+1/2=HtαtX¯t+1/2Ht1αtX¯t1/2X_{t+1/2}=\frac{H_{t}}{\alpha_{t}}\bar{X}_{t+1/2}-\frac{H_{t-1}}{\alpha_{t}}{\bar{X}}_{t-1/2}. Therefore,

Xt+1/2x\displaystyle X_{t+1/2}-x^{\ast} =Htαt(X¯t+1/2x)Ht1αt(X¯t1/2x)\displaystyle=\frac{H_{t}}{\alpha_{t}}(\bar{X}_{t+1/2}-x^{\ast})-\frac{H_{t-1}}{\alpha_{t}}({\bar{X}}_{t-1/2}-x^{\ast})
=1αt[αt(X¯t+1/2x)+Ht1(X¯t+1/2X¯t1/2)].\displaystyle=\frac{1}{\alpha_{t}}\left[\alpha_{t}(\bar{X}_{t+1/2}-x^{\ast})+H_{t-1}(\bar{X}_{t+1/2}-{\bar{X}}_{t-1/2})\right]. (B.8)

As a consequence, we have:

t=1Tαtf(X¯t+1/2),Xt+1/2x\displaystyle\sum_{t=1}^{T}{\alpha_{t}}\langle\nabla f(\bar{X}_{t+1/2}),X_{t+1/2}-x^{\ast}\rangle =t=1Tαtf(X¯t+1/2),X¯t+1/2x\displaystyle=\sum_{t=1}^{T}\alpha_{t}\left\langle\nabla f(\bar{X}_{t+1/2}),\bar{X}_{t+1/2}-x^{\ast}\right\rangle
+Ht1t=1Tf(X¯t+1/2),X¯t+1/2X¯t1/2\displaystyle\qquad+H_{t-1}\sum_{t=1}^{T}\left\langle\nabla f(\bar{X}_{t+1/2}),\bar{X}_{t+1/2}-{\bar{X}}_{t-1/2}\right\rangle
\displaystyle\geq t=1Tαt[f(X¯t+1/2)f(x)]\displaystyle\sum_{t=1}^{T}\alpha_{t}\left[f(\bar{X}_{t+1/2})-f(x^{\ast})\right]
+t=1THt1[f(X¯t+1/2)f(X¯t1/2)]\displaystyle\qquad+\sum_{t=1}^{T}H_{t-1}\left[f(\bar{X}_{t+1/2})-f({\bar{X}}_{t-1/2})\right]
=t=1Tαt[f(X¯t+1/2)f(x)]\displaystyle=\sum_{t=1}^{T}\alpha_{t}\left[f(\bar{X}_{t+1/2})-f(x^{\ast})\right]
+t=1T1αt[f(X¯T+1/2)f(X¯t+1/2)]\displaystyle\qquad+\sum_{t=1}^{T-1}\alpha_{t}\left[f({\bar{X}}_{T+1/2})-f(\bar{X}_{t+1/2})\right]
=\displaystyle= [f(X¯T+1/2)f(x)]t=1Tαt.\displaystyle\left[f({\bar{X}}_{T+1/2})-f(x^{\ast})\right]\sum_{t=1}^{T}\alpha_{t}. (B.9)

Divide two sides of (B.9) by Ht>0H_{t}>0 and choose αt\alpha_{t} such that Ht>T22H_{t}>\frac{T^{2}}{2} (for example, choose αt=α\alpha_{t}=\alpha), we obtain that:

2T2t=1Tαtf(X¯t+1/2),Xt+1/2xf(X¯T+1/2)f(x)=f(X¯T+1/2)minf.\displaystyle\frac{2}{T^{2}}\sum_{t=1}^{T}{\alpha_{t}}\langle\nabla f(\bar{X}_{t+1/2}),X_{t+1/2}-x^{\ast}\rangle\geq f({\bar{X}}_{T+1/2})-f(x^{\ast})=f({\bar{X}}_{T+1/2})-\min f. (B.10)

Finally, we recall that by definition, gt+1/2=𝖦(X¯t+1/2;ωt+1/2)=f(X¯t+1/2)+Ut+1/2g_{t+1/2}=\operatorname{\mathsf{G}}(\bar{X}_{t+1/2};\omega_{t+1/2})=\nabla f(\bar{X}_{t+1/2})+U_{t+1/2} where 𝔼[Ut+1/2\nonscript|\nonscriptt+1/2]=0\operatorname{\mathbb{E}}\left[U_{t+1/2}\nonscript\,\middle|\nonscript\,\mathopen{}\mathcal{F}_{t+1/2}\right]=0. Therefore, by the law of total expectation, we have:

~T(x)\displaystyle\operatorname{\tilde{\mathcal{R}}}_{T}(x^{\ast}) =𝔼[t=1Tαtf(X¯t+1/2),Xt+1/2x]+𝔼[t=1TαtUt+1/2,Xt+1/2x]\displaystyle=\operatorname{\mathbb{E}}\left[\sum_{t=1}^{T}{\alpha_{t}}\langle\nabla f(\bar{X}_{t+1/2}),X_{t+1/2}-x^{\ast}\rangle\right]+\operatorname{\mathbb{E}}\left[\sum_{t=1}^{T}{\alpha_{t}}\langle U_{t+1/2},X_{t+1/2}-x^{\ast}\rangle\right]
=𝔼[t=1Tαtf(X¯t+1/2),Xt+1/2x]\displaystyle=\operatorname{\mathbb{E}}\left[\sum_{t=1}^{T}{\alpha_{t}}\langle\nabla f(\bar{X}_{t+1/2}),X_{t+1/2}-x^{\ast}\rangle\right]
+𝔼[t=1Tαt𝔼[Ut+1/2,Xt+1/2x\nonscript|\nonscriptt+1/2]]\displaystyle+\operatorname{\mathbb{E}}\left[\sum_{t=1}^{T}{\alpha_{t}}\operatorname{\mathbb{E}}\left[\langle U_{t+1/2},X_{t+1/2}-x^{\ast}\rangle\nonscript\,|\nonscript\,\mathopen{}\mathcal{F}_{t+1/2}\right]\right]
=\displaystyle= 𝔼[t=1Tαtf(X¯t+1/2),Xt+1/2x].\displaystyle\operatorname{\mathbb{E}}\left[\sum_{t=1}^{T}{\alpha_{t}}\langle\nabla f(\bar{X}_{t+1/2}),X_{t+1/2}-x^{\ast}\rangle\right]. (B.11)

Then, taking expectations on both sides of (B.10) and invoking (B.11) concludes the proof. ∎

Proof of (20a): convergence of UnderGrad under (LC)/(BG).

Our starting point is Eq. 22 that we established in Proposition 1 that leads to the following inequality:

~T(x)RhηT+1+t=1Tαtgt+1/2gt,Xt+1/2Xt+1t=1TKh2ηtXt+1Xt+1/22\operatorname{\tilde{\mathcal{R}}}_{T}(x)\leq\frac{R_{h}}{\eta_{T+1}}+\sum_{t=1}^{T}\alpha_{t}\langle g_{t+1/2}-g_{t}\mathopen{},\mathopen{}X_{t+1/2}-X_{t+1}\rangle-\sum_{t=1}^{T}\frac{K_{h}}{2\eta_{t}}\lVert X_{t+1}-X_{t+1/2}\rVert^{2} (B.12)

We now focus on the second term in the right hand side of (B.12). From the Cauchy-Schwarz inequality and the fact that

YYXX=min>0{12missingYY2+missing2XX2}\left\lVert Y-Y^{\prime}\right\rVert_{\ast}\left\lVert X-X^{\prime}\right\rVert=\min_{>0}\left\{\frac{1}{2missing}\left\lVert Y-Y^{\prime}\right\rVert_{\ast}^{2}+\frac{missing}{2}\left\lVert X-X^{\prime}\right\rVert^{2}\right\} (B.13)

for any X,X,Y,YdX,X^{\prime},Y,Y^{\prime}\in\mathbb{R}^{d},666This can be proved trivially: =YYXX{}^{*}=\left\lVert Y-Y^{\prime}\right\rVert_{\ast}\left\lVert X-X^{\prime}\right\rVert is a minimizer of the function ψ()=12missingYY2+missing2XX2\psi()=\frac{1}{2missing}\left\lVert Y-Y^{\prime}\right\rVert_{\ast}^{2}+\frac{missing}{2}\left\lVert X-X^{\prime}\right\rVert^{2}. we have:

t=1Tαtgt+1/2gt,Xt+1/2Xt+1\displaystyle\sum_{t=1}^{T}\alpha_{t}\langle g_{t+1/2}-g_{t}\mathopen{},\mathopen{}X_{t+1/2}-X_{t+1}\rangle t=1Tαtgt+1/2gtXt+1/2Xt+1\displaystyle\leq\sum_{t=1}^{T}\alpha_{t}\lVert g_{t+1/2}-g_{t}\rVert_{\ast}\lVert X_{t+1/2}-X_{t+1}\rVert
12Kht=1Tαt2ηt+1gt+1/2gt2\displaystyle\leq\frac{1}{2K_{h}}\sum_{t=1}^{T}\alpha_{t}^{2}\eta_{t+1}\lVert g_{t+1/2}-g_{t}\rVert_{\ast}^{2}
+Kh2t=1T1ηt+1Xt+1Xt+1/22.\displaystyle+\frac{K_{h}}{2}\sum_{t=1}^{T}\frac{1}{\eta_{t+1}}\lVert X_{t+1}-X_{t+1/2}\rVert^{2}. (B.14)

Moreover, from the definition of ηt+1\eta_{t+1} and by applying Lemma A.4, we have:

12Kht=1Tαt2ηt+1gt+1/2gt2\displaystyle\frac{1}{2K_{h}}\sum_{t=1}^{T}\alpha_{t}^{2}\eta_{t+1}\lVert g_{t+1/2}-g_{t}\rVert_{\ast}^{2} =b2Kht=1Tαt2gt+1/2gt2+2s=1tgs+1/2gs2\displaystyle=\frac{b}{2K_{h}}\sum_{t=1}^{T}\frac{\alpha_{t}^{2}\lVert g_{t+1/2}-g_{t}\rVert_{\ast}^{2}}{\sqrt{{}^{2}+\sum_{s=1}^{t}\lVert g_{s+1/2}-g_{s}\rVert_{\ast}^{2}}}
bKh+2t=1Tαt2gt+1/2gt2b22Kh\displaystyle\leq\frac{b}{K_{h}}\sqrt{{}^{2}+\sum_{t=1}^{T}\alpha_{t}^{2}\lVert g_{t+1/2}-g_{t}\rVert_{\ast}^{2}}-\frac{b\sqrt{{}^{2}}}{2K_{h}}
=b2KhηT+1b22Kh.\displaystyle=\frac{b^{2}}{K_{h}\cdot\eta_{T+1}}-\frac{b\sqrt{{}^{2}}}{2K_{h}}. (B.15)

Combine (B.14) and (B.15) with (B.12) and by the compactness of the feasible region 𝒳\mathcal{X}, we get:

~T(x)\displaystyle\operatorname{\tilde{\mathcal{R}}}_{T}(x)\leq RhηT+1+12Kht=1Tαt2ηt+1gt+1/2gt2\displaystyle\frac{R_{h}}{\eta_{T+1}}+\frac{1}{2K_{h}}\sum_{t=1}^{T}\alpha_{t}^{2}\eta_{t+1}\lVert g_{t+1/2}-g_{t}\rVert_{\ast}^{2}
+Kh2t=1T[1ηt+11ηt]Xt+1Xt+1/22\displaystyle\hphantom{\frac{R_{h}}{\eta_{T+1}}}+\frac{K_{h}}{2}\sum_{t=1}^{T}\left[\frac{1}{\eta_{t+1}}-\frac{1}{\eta_{t}}\right]\lVert X_{t+1}-X_{t+1/2}\rVert^{2}
\displaystyle\leq RhηT+1+b2KhηT+1b22Kh+Kh𝒳22t=1T[1ηt+11ηt]\displaystyle\frac{R_{h}}{\eta_{T+1}}+\frac{b^{2}}{K_{h}\cdot\eta_{T+1}}-\frac{b\sqrt{{}^{2}}}{2K_{h}}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{2}\sum_{t=1}^{T}\left[\frac{1}{\eta_{t+1}}-\frac{1}{\eta_{t}}\right]
=\displaystyle= 1ηT+1(Rh+b2Kh+Kh𝒳22)b2(12Kh+Kh𝒳22).\displaystyle\frac{1}{\eta_{T+1}}\left(R_{h}+\frac{b^{2}}{K_{h}}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{2}\right)-b\sqrt{{}^{2}}\left(\frac{1}{2K_{h}}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{2}\right). (B.16)

Hence, by invoking Lemma B.1, we have:

𝔼[f(X¯T+1/2)minx𝒳f(x)]\displaystyle\operatorname{\mathbb{E}}\left[f(\overline{X}_{T+1/2})-\min_{x\in\mathcal{X}}f(x)\right] 2𝔼[~T(x)]T2\displaystyle\leq\frac{2\operatorname{\mathbb{E}}\left[\operatorname{\tilde{\mathcal{R}}}_{T}(x^{\ast})\right]}{T^{2}}
=2T2[(Rh+b2Kh+Kh𝒳22)]𝔼[1ηT+1]\displaystyle=\frac{2}{T^{2}}\left[\left(R_{h}+\frac{b^{2}}{K_{h}}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{2}\right)\right]\operatorname{\mathbb{E}}\left[\frac{1}{\eta_{T+1}}\right]
2b2T2(12Kh+Kh𝒳22).\displaystyle-\frac{2b\sqrt{{}^{2}}}{T^{2}}\left(\frac{1}{2K_{h}}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{2}\right). (B.17)

On the other hand, by the definition of ηT+1\eta_{T+1}, we get:

𝔼[1ηT+1]\displaystyle\operatorname{\mathbb{E}}\left[\frac{1}{\eta_{T+1}}\right] 1b𝔼[+2t=1Tαt2gt+1/2gt2]1b+2t=1Tαt2𝔼[gt+1/2gt2].\displaystyle\leq\frac{1}{b}\operatorname{\mathbb{E}}\left[\sqrt{{}^{2}+\sum_{t=1}^{T}\alpha_{t}^{2}\lVert g_{t+1/2}-g_{t}\rVert_{\ast}^{2}}\right]\leq\frac{1}{b}\sqrt{{}^{2}+\sum_{t=1}^{T}\alpha_{t}^{2}\operatorname{\mathbb{E}}\left[\lVert g_{t+1/2}-g_{t}\rVert_{\ast}^{2}\right]}. (B.18)

Moreover, we have that:

𝔼[gt+1/2gt2]\displaystyle\operatorname{\mathbb{E}}\left[\lVert g_{t+1/2}-g_{t}\rVert_{\ast}^{2}\right] =𝔼[f(X¯t+1/2)f(X¯t)(Ut+1/2Ut)2]\displaystyle=\operatorname{\mathbb{E}}\left[\lVert\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t})-(U_{t+1/2}-U_{t})\rVert_{\ast}^{2}\right]
𝔼[2f(X¯t+1/2)f(X¯t)2+2Ut+1/2Ut2]\displaystyle\leq\operatorname{\mathbb{E}}\left[2\lVert\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t})\rVert_{\ast}^{2}+2\lVert U_{t+1/2}-U_{t}\rVert_{\ast}^{2}\right]
𝔼[4(f(X¯t+1/2)2+f(X¯t)2)+4(Ut+1/22+Ut2)]\displaystyle\leq\operatorname{\mathbb{E}}\left[4(\lVert\nabla f(\overline{X}_{t+1/2})\rVert_{\ast}^{2}+\lVert\nabla f(\overline{X}_{t})\rVert_{\ast}^{2})+4(\lVert U_{t+1/2}\rVert_{\ast}^{2}+\lVert U_{t}\rVert_{\ast}^{2})\right]
𝔼[8G2+4(Ut+1/22+Ut2)]\displaystyle\leq\operatorname{\mathbb{E}}\left[8G^{2}+4(\lVert U_{t+1/2}\rVert_{\ast}^{2}+\lVert U_{t}\rVert_{\ast}^{2})\right] #  from (BG)
=8[G2+σ2].\displaystyle=8\left[G^{2}+\sigma^{2}\right]. (B.19)

Therefore, when we choose the step-size parameters αt=t,t\alpha_{t}=t,\forall t as indicated in Theorem 1, we have:

(Rh+b2Kh+Kh𝒳22)𝔼[1ηT+1]\displaystyle\left(R_{h}+\frac{b^{2}}{K_{h}}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{2}\right)\operatorname{\mathbb{E}}\left[\frac{1}{\eta_{T+1}}\right]\leq (Rh+b2Kh+Kh𝒳22)1b+28(G2+σ2)t=1Tαt2\displaystyle\left(R_{h}+\frac{b^{2}}{K_{h}}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{2}\right)\frac{1}{b}\sqrt{{}^{2}+8\left(G^{2}+\sigma^{2}\right)\sum_{t=1}^{T}\alpha_{t}^{2}}
\displaystyle\leq (Rhb+bKh+Kh𝒳22b)+28(G2+σ2)T3.\displaystyle\left(\frac{R_{h}}{b}+\frac{b}{K_{h}}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{2b}\right)\sqrt{{}^{2}+8\left(G^{2}+\sigma^{2}\right)T^{3}}. (B.20)

Finally, substituting (B.20) into (B.17), we get:

𝔼[f(X¯T+1/2)minx𝒳f(x)]\displaystyle\operatorname{\mathbb{E}}\left[f(\overline{X}_{T+1/2})-\min_{x\in\mathcal{X}}f(x)\right] 2(Rhb+bKh+Kh𝒳22b)+28(G2+σ2)T3T2\displaystyle\leq 2\left(\frac{R_{h}}{b}+\frac{b}{K_{h}}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{2b}\right)\frac{\sqrt{{}^{2}+8\left(G^{2}+\sigma^{2}\right)T^{3}}}{T^{2}}
b2T2(12Kh+Kh𝒳22).\displaystyle-\frac{b\sqrt{{}^{2}}}{T^{2}}\left(\frac{1}{2K_{h}}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{2}\right). (B.21)

Then, from our choice for bb and 2 in Theorem 1, we obtain:

𝔼[f(X¯T+1/2)minx𝒳f(x)]2ChKhKh+8(G2+σ2)T.\operatorname{\mathbb{E}}\left[f(\overline{X}_{T+1/2})-\min_{x\in\mathcal{X}}f(x)\right]\leq 2\frac{C_{h}}{\sqrt{K_{h}}}\frac{\sqrt{K_{h}+8\left(G^{2}+\sigma^{2}\right)}}{\sqrt{T}}. (B.22)

Proof of (20b): convergence of UnderGrad under (LG)/(LS).

From (22) and (B.14), we have:

~T(x)RhηT+1\displaystyle\operatorname{\tilde{\mathcal{R}}}_{T}(x)\leq\frac{R_{h}}{\eta_{T+1}} +12Kht=1Tαt2ηt+1gt+1/2gt2\displaystyle+\frac{1}{2K_{h}}\sum_{t=1}^{T}\alpha_{t}^{2}\eta_{t+1}\lVert g_{t+1/2}-g_{t}\rVert_{\ast}^{2}
+Kh2t=1T(1ηt+11ηt)Xt+1Xt+1/22t=1TKh2ηtXt+1/2Xt2.\displaystyle+\frac{K_{h}}{2}\sum_{t=1}^{T}\left(\frac{1}{\eta_{t+1}}-\frac{1}{\eta_{t}}\right)\lVert X_{t+1}-X_{t+1/2}\rVert^{2}-\sum_{t=1}^{T}\frac{K_{h}}{2\eta_{t}}{\lVert X_{t+1/2}-X_{t}\rVert^{2}}. (B.23)

We analyze the terms in the right-hand-side of (B.23). First, we have:

Kh2t=1T(1ηt+11ηt)Xt+1Xt+1/22Kh𝒳22(1ηT+11η1).\displaystyle\frac{K_{h}}{2}\sum_{t=1}^{T}\left(\frac{1}{\eta_{t+1}}-\frac{1}{\eta_{t}}\right)\lVert X_{t+1}-X_{t+1/2}\rVert^{2}\leq\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{2}\left(\frac{1}{\eta_{T+1}}-\frac{1}{\eta_{1}}\right). (B.24)

Second, we have:

Kh2t=1T1ηt+1Xt+1/2Xt2\displaystyle\frac{K_{h}}{2}\sum_{t=1}^{T}\frac{1}{\eta_{t+1}}\lVert X_{t+1/2}-X_{t}\rVert^{2} =Kh2t=1T[1ηt+11ηt]Xt+1/2Xt2\displaystyle=\frac{K_{h}}{2}\sum_{t=1}^{T}\left[\frac{1}{\eta_{t+1}}-\frac{1}{\eta_{t}}\right]\lVert X_{t+1/2}-X_{t}\rVert^{2}
+Kh2t=1T1ηtXt+1/2Xt2\displaystyle\qquad+\frac{K_{h}}{2}\sum_{t=1}^{T}\frac{1}{\eta_{t}}\lVert X_{t+1/2}-X_{t}\rVert^{2}
Kh𝒳22(1ηT+11η1)+Kh2t=1T1ηtXt+1/2Xt2\displaystyle\leq\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{2}\left(\frac{1}{\eta_{T+1}}-\frac{1}{\eta_{1}}\right)+\frac{K_{h}}{2}\sum_{t=1}^{T}\frac{1}{\eta_{t}}\lVert X_{t+1/2}-X_{t}\rVert^{2} (B.25)

Hence,

Kh2t=1T1ηtXt+1/2Xt2Kh𝒳22(1ηT+11η1)Kh2t=1T1ηt+1Xt+1/2Xt2.-\frac{K_{h}}{2}\sum_{t=1}^{T}\frac{1}{\eta_{t}}\lVert X_{t+1/2}-X_{t}\rVert^{2}\leq\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{2}\left(\frac{1}{\eta_{T+1}}-\frac{1}{\eta_{1}}\right)-\frac{K_{h}}{2}\sum_{t=1}^{T}\frac{1}{\eta_{t+1}}\lVert X_{t+1/2}-X_{t}\rVert^{2}. (B.26)

Combine (B.24) and (B.26) with (B.23), we have:

~T(x)\displaystyle\operatorname{\tilde{\mathcal{R}}}_{T}(x) Rh+Kh𝒳2ηT+1Kh𝒳2η1\displaystyle\leq\frac{R_{h}+K_{h}\lVert\mathcal{X}\rVert^{2}}{\eta_{T+1}}-\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{\eta_{1}}
+12Kht=1Tαt2ηt+1gt+1/2gt2Kh2t=1T1ηt+1Xt+1/2Xt2.\displaystyle+\frac{1}{2K_{h}}\sum_{t=1}^{T}\alpha_{t}^{2}\eta_{t+1}\lVert g_{t+1/2}-g_{t}\rVert_{\ast}^{2}-\frac{K_{h}}{2}\sum_{t=1}^{T}\frac{1}{\eta_{t+1}}\lVert X_{t+1/2}-X_{t}\rVert^{2}. (B.27)

We will analyze the terms in the right-hand-side of (B.27). To do this, we first introduce the quantities

Bt2\displaystyle B_{t}^{2} =min{f(X¯t+1/2)f(X¯t)2,gt+1/2gt2}\displaystyle=\min\{\lVert\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t})\rVert_{\ast}^{2},\lVert g_{t+1/2}-g_{t}\rVert_{\ast}^{2}\} (B.28a)
and
ξt\displaystyle\xi_{t} =[gt+1/2gt][f(X¯t+1/2)f(X¯t)].\displaystyle=\left[g_{t+1/2}-g_{t}\right]-\left[\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t})\right]. (B.28b)

We also define

η~t=b+2s=1t1αs2Bs2.\tilde{\eta}_{t}=\frac{b}{\sqrt{{}^{2}+\sum_{s=1}^{t-1}\alpha_{s}^{2}B_{s}^{2}}}. (B.29)

By these definitions, we obtain that

gt+1/2gt2\displaystyle\lVert g_{t+1/2}-g_{t}\rVert_{\ast}^{2} Bt2+[gt+1/2gt2min{f(X¯t+1/2)f(X¯t)2,gt+1/2gt2}]\displaystyle\leq B_{t}^{2}+\left[\lVert g_{t+1/2}-g_{t}\rVert_{\ast}^{2}-\min\{\lVert\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t})\rVert_{\ast}^{2},\lVert g_{t+1/2}-g_{t}\rVert_{\ast}^{2}\}\right]
Bt2+max{0,gt+1/2gt2f(X¯t+1/2)f(X¯t)2}\displaystyle\leq B_{t}^{2}+\max\{0,\lVert g_{t+1/2}-g_{t}\rVert_{\ast}^{2}-\lVert\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t})\rVert_{\ast}^{2}\}
Bt2+Bt2+2ξt2\displaystyle\leq B_{t}^{2}+B_{t}^{2}+2\lVert\xi_{t}\rVert_{\ast}^{2}
=2Bt2+2ξt2.\displaystyle=2B_{t}^{2}+2\lVert\xi_{t}\rVert_{\ast}^{2}. (B.30)

Here, the last inequality is obtained by the fact that if gt+1/2gt2f(X¯t+1/2)f(X¯t)2\lVert g_{t+1/2}-g_{t}\rVert_{\ast}^{2}\geq\lVert\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t})\rVert_{\ast}^{2} then it yields:

gt+1/2gt2f(X¯t+1/2)f(X¯t)2Bt2+2ξt2.\lVert g_{t+1/2}-g_{t}\rVert_{\ast}^{2}-\lVert\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t})\rVert_{\ast}^{2}\leq B_{t}^{2}+2\lVert\xi_{t}\rVert_{\ast}^{2}. (B.31)

Therefore, we have:

12Kht=1Tαt2ηt+1gt+1/2gt2\displaystyle\frac{1}{2K_{h}}\sum_{t=1}^{T}\alpha_{t}^{2}\eta_{t+1}\lVert g_{t+1/2}-g_{t}\rVert_{\ast}^{2} =b2Kht=1Tαt2gt+1/2gt2+2s=1tαs2gs+1/2gs2\displaystyle=\frac{b}{2K_{h}}\sum_{t=1}^{T}\frac{\alpha_{t}^{2}\lVert g_{t+1/2}-g_{t}\rVert_{\ast}^{2}}{\sqrt{{}^{2}+\sum_{s=1}^{t}\alpha_{s}^{2}\left\lVert g_{s+{1/2}}-g_{s}\right\rVert^{2}}}
bKh+2t=1Tαt2gt+1/2gt2b22Kh\displaystyle\leq\frac{b}{K_{h}}\sqrt{{}^{2}+\sum_{t=1}^{T}\alpha_{t}^{2}\lVert g_{t+1/2}-g_{t}\rVert_{\ast}^{2}}-\frac{b\sqrt{{}^{2}}}{2K_{h}}
bKh+22t=1Tαt2Bt2+2t=1Tαt2ξt2b22Kh\displaystyle\leq\frac{b}{K_{h}}\sqrt{{}^{2}+2\sum_{t=1}^{T}\alpha_{t}^{2}B_{t}^{2}+2\sum_{t=1}^{T}\alpha_{t}^{2}\lVert\xi_{t}\rVert_{\ast}^{2}}-\frac{b\sqrt{{}^{2}}}{2K_{h}} #  from (B.30)
b2Kh+2t=1Tαt2Bt2+b2Kht=1Tαt2ξt2b22Kh\displaystyle\leq\frac{b\sqrt{2}}{K_{h}}\sqrt{{}^{2}+\sum_{t=1}^{T}\alpha_{t}^{2}B^{2}_{t}}+\frac{b\sqrt{2}}{K_{h}}\sqrt{\sum_{t=1}^{T}\alpha_{t}^{2}\lVert\xi_{t}\rVert_{\ast}^{2}}-\frac{b\sqrt{{}^{2}}}{2K_{h}}
b2Kh(2+t=1Tαt2Bt2+2s=1tαs2Bs2)\displaystyle\leq\frac{b\sqrt{2}}{K_{h}}\left(\sqrt{{}^{2}}+\sum_{t=1}^{T}\frac{\alpha_{t}^{2}B_{t}^{2}}{\sqrt{{}^{2}+\sum_{s=1}^{t}\alpha_{s}^{2}B_{s}^{2}}}\right)
+b2Kht=1Tαt2ξt2b22Kh\displaystyle\qquad+\frac{b\sqrt{2}}{K_{h}}\sqrt{\sum_{t=1}^{T}\alpha_{t}^{2}\lVert\xi_{t}\rVert_{\ast}^{2}}-\frac{b\sqrt{{}^{2}}}{2K_{h}} #  from Lemma A.4
=2Kht=1Tαt2η~t+1Bt2+b2Kht=1Tαt2ξt2+b2Kh(212).\displaystyle=\frac{\sqrt{2}}{K_{h}}\sum_{t=1}^{T}\alpha_{t}^{2}\tilde{\eta}_{t+1}B_{t}^{2}+\frac{b\sqrt{2}}{K_{h}}\sqrt{\sum_{t=1}^{T}\alpha_{t}^{2}\lVert\xi_{t}\rVert_{\ast}^{2}}+\frac{b\sqrt{{}^{2}}}{K_{h}}\left(\sqrt{2}-\frac{1}{2}\right). (B.32)

On the other hand, by the update rule in Algorithm 1 and our choice of αt=t,t\alpha_{t}=t,\forall t as in Theorem 1, we also have XtXt+1/2=s=1tαsαt(X¯tX¯t+1/2)=αt+12(X¯tX¯t+1/2)X_{t}-X_{t+1/2}=\frac{\sum_{s=1}^{t}\alpha_{s}}{\alpha_{t}}\left(\bar{X}_{t}-\bar{X}_{t+1/2}\right)=\frac{\alpha_{t+1}}{2}\left(\bar{X}_{t}-\bar{X}_{t+1/2}\right). Use this and recall that 1ηt~1ηt\frac{1}{\tilde{\eta_{t}}}\leq\frac{1}{\eta_{t}} for any tt and that ff is LL-smooth over 𝒳\mathcal{X}, we have:

Kh2t=1T1ηt+1XtXt+1/22\displaystyle-\frac{K_{h}}{2}\sum_{t=1}^{T}\frac{1}{\eta_{t+1}}\lVert X_{t}-X_{t+1/2}\rVert^{2} Kh2t=1T1η~t+1XtXt+1/22\displaystyle\leq-\frac{K_{h}}{2}\sum_{t=1}^{T}\frac{1}{\tilde{\eta}_{t+1}}\lVert X_{t}-X_{t+1/2}\rVert^{2}
=Kh8t=1Tαt+12η~t+1X¯tX¯t+1/22\displaystyle=-\frac{K_{h}}{8}\sum_{t=1}^{T}\frac{\alpha_{t+1}^{2}}{\tilde{\eta}_{t+1}}\lVert\bar{X}_{t}-\bar{X}_{t+1/2}\rVert^{2}
Kh8t=1T1η~t+11L2f(X¯t)f(X¯t+1/2)2\displaystyle\leq-\frac{K_{h}}{8}\sum_{t=1}^{T}\frac{1}{\tilde{\eta}_{t+1}}\frac{1}{L^{2}}\lVert\nabla f(\overline{X}_{t})-\nabla f(\overline{X}_{t+1/2})\rVert_{\ast}^{2}
Kh8L2t=1Tαt2Bt2η~t+1.\displaystyle\leq-\frac{K_{h}}{8L^{2}}\sum_{t=1}^{T}\frac{\alpha_{t}^{2}B_{t}^{2}}{\tilde{\eta}_{t+1}}. (B.33)

Finally, letting Ch=Rh+Kh𝒳2C_{h}=\sqrt{R_{h}+K_{h}\lVert\mathcal{X}\rVert^{2}}, (B.30) yields:

Ch2ηT+1\displaystyle\frac{C_{h}^{2}}{\eta_{T+1}} =Ch2b+2t=1Tαt2gt+1/2gt2\displaystyle=\frac{C_{h}^{2}}{b}\sqrt{{}^{2}+\sum_{t=1}^{T}\alpha_{t}^{2}\lVert g_{t+1/2}-g_{t}\rVert_{\ast}^{2}}
Ch2b+22t=1Tαt2Bt2+2t=1Tαt2ξt2\displaystyle\leq\frac{C_{h}^{2}}{b}\sqrt{{}^{2}+2\sum_{t=1}^{T}\alpha_{t}^{2}B_{t}^{2}+2\sum_{t=1}^{T}\alpha_{t}^{2}\lVert\xi_{t}\rVert_{\ast}^{2}}
2Ch2b+2t=1Tαt2Bt2+2Ch2bt=1Tαt2ξt2\displaystyle\leq\frac{\sqrt{2}C_{h}^{2}}{b}\sqrt{{}^{2}+\sum_{t=1}^{T}\alpha_{t}^{2}B_{t}^{2}}+\frac{\sqrt{2}C_{h}^{2}}{b}\sqrt{\sum_{t=1}^{T}\alpha_{t}^{2}\lVert\xi_{t}\rVert_{\ast}^{2}}
2Ch2b(2+t=1Tαt2Bt2+2s=1tαs2Bs2)+2Ch2bt=1Tαt2ξt2\displaystyle\leq\frac{\sqrt{2}C_{h}^{2}}{b}\left(\sqrt{{}^{2}}+\sum_{t=1}^{T}\frac{\alpha_{t}^{2}B_{t}^{2}}{\sqrt{{}^{2}+\sum_{s=1}^{t}\alpha_{s}^{2}B_{s}^{2}}}\right)+\frac{\sqrt{2}C_{h}^{2}}{b}\sqrt{\sum_{t=1}^{T}\alpha_{t}^{2}\lVert\xi_{t}\rVert_{\ast}^{2}}
2Ch2b2t=1Tαt2η~t+1Bt2+2Ch22b+2Ch2bt=1Tαt2ξt2.\displaystyle\leq\frac{\sqrt{2}C_{h}^{2}}{b^{2}}\sum_{t=1}^{T}\alpha_{t}^{2}\tilde{\eta}_{t+1}B_{t}^{2}+\frac{\sqrt{2}C_{h}^{2}\sqrt{{}^{2}}}{b}+\frac{\sqrt{2}C_{h}^{2}}{b}\sqrt{\sum_{t=1}^{T}\alpha_{t}^{2}\lVert\xi_{t}\rVert_{\ast}^{2}}. (B.34)

Combine (B.32), (B.33) and (B.34) into (B.27), we have:

~T(x)\displaystyle\operatorname{\tilde{\mathcal{R}}}_{T}(x) 2t=1Tαt2Bt2[(Rhb2+Kh𝒳2b2+1Kh)η~t+1Kh8L2η~t+1]\displaystyle\leq\sqrt{2}\sum_{t=1}^{T}\alpha_{t}^{2}B_{t}^{2}\left[\left(\frac{R_{h}}{b^{2}}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{b^{2}}+\frac{1}{K_{h}}\right)\tilde{\eta}_{t+1}-\frac{K_{h}}{8L^{2}\tilde{\eta}_{t+1}}\right]
+2(Rhb+Kh𝒳2b+bKh)t=1Tαt2ξt2\displaystyle+\sqrt{2}\left(\frac{R_{h}}{b}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{b}+\frac{b}{K_{h}}\right)\sqrt{\sum_{t=1}^{T}\alpha_{t}^{2}\lVert\xi_{t}\rVert_{\ast}^{2}}
+[b2Kh(212)Kh𝒳22b+2Ch22b].\displaystyle\qquad+\left[\frac{b\sqrt{{}^{2}}}{K_{h}}\left(\sqrt{2}-\frac{1}{2}\right)-\frac{K_{h}\lVert\mathcal{X}\rVert^{2}\sqrt{{}^{2}}}{b}+\frac{\sqrt{2}C_{h}^{2}\sqrt{{}^{2}}}{b}\right]. (B.35)

Now, define T0T_{0} as follows:

T0=max{1tT:η~t+1Kh8L2(Rhb2+Kh𝒳2b2+1Kh)}T_{0}=\max\left\{1\leq t\leq T:\tilde{\eta}_{t+1}\geq\frac{\sqrt{K_{h}}}{\sqrt{8L^{2}\left(\frac{R_{h}}{b^{2}}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{b^{2}}+\frac{1}{K_{h}}\right)}}\right\} (B.36)

In other words, for any t>T0t>T_{0}, (Rhb2+Kh𝒳2b2+1Kh)η~t+1Kh8L2η~t+1<0\left(\frac{R_{h}}{b^{2}}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{b^{2}}+\frac{1}{K_{h}}\right)\tilde{\eta}_{t+1}-\frac{K_{h}}{8L^{2}\tilde{\eta}_{t+1}}<0. As a consequence,

2t=1Tαt2Bt2[(Rhb2+Kh𝒳2b2+1Kh)η~t+1Kh8L2η~t+1]\displaystyle\sqrt{2}\sum_{t=1}^{T}\alpha_{t}^{2}B_{t}^{2}\left[\left(\frac{R_{h}}{b^{2}}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{b^{2}}+\frac{1}{K_{h}}\right)\tilde{\eta}_{t+1}-\frac{K_{h}}{8L^{2}\tilde{\eta}_{t+1}}\right]
=\displaystyle= 2t=1T0αt2Bt2[(Rhb2+Kh𝒳2b2+1Kh)η~t+1Kh8L2η~t+1]\displaystyle\sqrt{2}\sum_{t=1}^{T_{0}}\alpha_{t}^{2}B_{t}^{2}\left[\left(\frac{R_{h}}{b^{2}}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{b^{2}}+\frac{1}{K_{h}}\right)\tilde{\eta}_{t+1}-\frac{K_{h}}{8L^{2}\tilde{\eta}_{t+1}}\right]
\displaystyle\leq 2(Rhb2+Kh𝒳2b2+1Kh)t=1T0αt2Bt2η~t+1\displaystyle\sqrt{2}\left(\frac{R_{h}}{b^{2}}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{b^{2}}+\frac{1}{K_{h}}\right)\sum_{t=1}^{T_{0}}\alpha_{t}^{2}B_{t}^{2}\tilde{\eta}_{t+1}
=\displaystyle= 2(Rhb2+Kh𝒳2b2+1Kh)t=1T0bαt2Bt2+2s=1tαs2Bs2\displaystyle\sqrt{2}\left(\frac{R_{h}}{b^{2}}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{b^{2}}+\frac{1}{K_{h}}\right)\sum_{t=1}^{T_{0}}b\frac{\alpha_{t}^{2}B_{t}^{2}}{\sqrt{{}^{2}+\sum_{s=1}^{t}\alpha_{s}^{2}B_{s}^{2}}}
\displaystyle\leq 2(Rhb+Kh𝒳2b+bKh)(2+2t=1T0αt2Bt22)\displaystyle\sqrt{2}\left(\frac{R_{h}}{b}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{b}+\frac{b}{K_{h}}\right)\left(2\sqrt{{}^{2}+\sum_{t=1}^{T_{0}}\alpha_{t}^{2}B_{t}^{2}}-\sqrt{{}^{2}}\right)
=\displaystyle= 22(Rhb+Kh𝒳2b+bKh)bη~T0+122(Rhb+Kh𝒳2b+bKh)\displaystyle 2\sqrt{2}\left(\frac{R_{h}}{b}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{b}+\frac{b}{K_{h}}\right)\frac{b}{\tilde{\eta}_{T_{0}+1}}-\sqrt{2^{2}}\left(\frac{R_{h}}{b}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{b}+\frac{b}{K_{h}}\right)
\displaystyle\leq 8(Rhb2+Kh𝒳2b2+1Kh)3/2b2LKh22(Rhb+Kh𝒳2b+bKh).\displaystyle 8\left(\frac{R_{h}}{b^{2}}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{b^{2}}+\frac{1}{K_{h}}\right)^{3/2}\frac{{b}^{2}L}{\sqrt{K_{h}}}-\sqrt{2^{2}}\left(\frac{R_{h}}{b}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{b}+\frac{b}{K_{h}}\right). (B.37)

Combine (B.35) with (B.37) and use the fact that 𝔼[ξ2]4σ2\operatorname{\mathbb{E}}\left[\left\lVert\xi\right\rVert_{\ast}^{2}\right]\leq 4\sigma^{2}, we have:

𝔼[~T(x)]\displaystyle\operatorname{\mathbb{E}}\left[\operatorname{\tilde{\mathcal{R}}}_{T}(x)\right] 8(Rhb2+Kh𝒳2b2+1Kh)3/2b2LKh22(Rhb+Kh𝒳2b+bKh)\displaystyle\leq 8\left(\frac{R_{h}}{b^{2}}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{b^{2}}+\frac{1}{K_{h}}\right)^{3/2}\frac{{b}^{2}L}{\sqrt{K_{h}}}-\sqrt{2^{2}}\left(\frac{R_{h}}{b}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{b}+\frac{b}{K_{h}}\right)
+22(Rhb+Kh𝒳2b+bKh)σt=1Tαt2\displaystyle+2\sqrt{2}\left(\frac{R_{h}}{b}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{b}+\frac{b}{K_{h}}\right)\sigma\sqrt{\sum_{t=1}^{T}\alpha_{t}^{2}}
+[b2Kh(212)Kh𝒳22b+2Ch22b]\displaystyle\qquad+\left[\frac{b\sqrt{{}^{2}}}{K_{h}}\left(\sqrt{2}-\frac{1}{2}\right)-\frac{K_{h}\lVert\mathcal{X}\rVert^{2}\sqrt{{}^{2}}}{b}+\frac{\sqrt{2}C_{h}^{2}\sqrt{{}^{2}}}{b}\right] (B.38)

Recall the choice αt=t,t\alpha_{t}=t,\forall t, apply Lemma B.1, we have:

𝔼[f(X¯T+1/2)minx𝒳f(x)]\displaystyle\operatorname{\mathbb{E}}\left[f(\overline{X}_{T+1/2})-\min_{x\in\mathcal{X}}f(x)\right] 16T2(Rhb2+Kh𝒳2b2+1Kh)3/2b2LKh\displaystyle\leq\frac{16}{T^{2}}\left(\frac{R_{h}}{b^{2}}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{b^{2}}+\frac{1}{K_{h}}\right)^{3/2}\frac{{b}^{2}L}{\sqrt{K_{h}}}
+42T(Rhb+Kh𝒳2b+bKh)σ+2C(b,2)T2.\displaystyle+\frac{4\sqrt{2}}{\sqrt{T}}\left(\frac{R_{h}}{b}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{b}+\frac{b}{K_{h}}\right)\sigma+\frac{2C(b,^{2})}{T^{2}}. (B.39)

where we set

C(b,2)b2Kh(212)Kh𝒳22b+2Ch22b22(Rhb+Kh𝒳2b+bKh).C(b,^{2})\coloneqq\frac{b\sqrt{{}^{2}}}{K_{h}}\left(\sqrt{2}-\frac{1}{2}\right)-\frac{K_{h}\lVert\mathcal{X}\rVert^{2}\sqrt{{}^{2}}}{b}+\frac{\sqrt{2}C_{h}^{2}\sqrt{{}^{2}}}{b}-\sqrt{2^{2}}\left(\frac{R_{h}}{b}+\frac{K_{h}\lVert\mathcal{X}\rVert^{2}}{b}+\frac{b}{K_{h}}\right). (B.40)

Finally, replace b=KhCh2b=\sqrt{K_{h}C_{h}^{2}} and =2Kh{}^{2}=K_{h} as chosen in Theorem 1 into (B.39) and note that with these choices, C(b,2)12ChKh0C(b,^{2})\leq-\frac{1}{2}C_{h}\sqrt{K_{h}}\leq 0; we rewrite (B.39) as follows:

𝔼[f(X¯T+1/2)minx𝒳f(x)]322LT2(Ch2Kh)+82σTChKh.\operatorname{\mathbb{E}}\left[f(\overline{X}_{T+1/2})-\min_{x\in\mathcal{X}}f(x)\right]\leq\frac{32\sqrt{2}L}{T^{2}}\left(\frac{C_{h}^{2}}{{K_{h}}}\right)+\frac{8\sqrt{2}\sigma}{\sqrt{T}}\frac{C_{h}}{\sqrt{K_{h}}}. (B.41)

Convergence of UnderGrad in unbounded domains.

Finally, we give the proof of Theorem 2 concerning the deterministic SFO in the (LG) case with a possibly unbounded domain 𝒳\mathcal{X}.

Proof.

Since the respective learning rate ηt\eta_{t} is non-increasing and non-negative, we have that its limit exists. Particularly,

limtηt=inftηt0\lim_{t\to\infty}\eta_{t}=\inf\nolimits_{t\in\mathbb{N}}\eta_{t}\geq 0 (B.42)

Let us now assume that inftηt=0\inf_{t\in\mathbb{N}}\eta_{t}=0. Then, by applying Proposition 1 we have:

t=1Tαtf(X¯t+1/2),Xt+1/2x\displaystyle\sum_{t=1}^{T}\alpha_{t}\langle\nabla f(\overline{X}_{t+1/2})\mathopen{},\mathopen{}X_{t+1/2}-x\rangle h(x)minx𝒳h(x)ηT+1\displaystyle\leq\frac{h(x)-\min_{x\in\mathcal{X}}h(x)}{\eta_{T+1}} (B.43)
+t=1Tαtf(X¯t+1/2)f(X¯t+1/2),XtXt+1\displaystyle+\sum_{t=1}^{T}\alpha_{t}\langle\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t+1/2})\mathopen{},\mathopen{}X_{t}-X_{t+1}\rangle
t=1TKh2ηtXt+1Xt+1/22t=1TKh2ηtXt+1/2Xt2\displaystyle-\sum_{t=1}^{T}\frac{K_{h}}{2\eta_{t}}\lVert X_{t+1}-X_{t+1/2}\rVert^{2}-\sum_{t=1}^{T}\frac{K_{h}}{2\eta_{t}}\lVert X_{t+1/2}-X_{t}\rVert^{2} (B.44)

Now for the term t=1Tαtf(X¯t+1/2)f(X¯t),Xt+1/2Xt+1t=1TKh2ηtXt+1Xt+1/22\sum_{t=1}^{T}\alpha_{t}\langle\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t})\mathopen{},\mathopen{}X_{t+1/2}-X_{t+1}\rangle-\sum_{t=1}^{T}\frac{K_{h}}{2\eta_{t}}\lVert X_{t+1}-X_{t+1/2}\rVert^{2} we have:

t=1TKh2ηtXt+1Xt+1/22\displaystyle-\sum_{t=1}^{T}\frac{K_{h}}{2\eta_{t}}\lVert X_{t+1}-X_{t+1/2}\rVert^{2} +t=1Tαtf(X¯t+1/2)f(X¯t+1/2),XtXt+1\displaystyle+\sum_{t=1}^{T}\alpha_{t}\langle\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t+1/2})\mathopen{},\mathopen{}X_{t}-X_{t+1}\rangle
12Kht=1Tαt2ηtf(X¯s+1/2)f(X¯s)2\displaystyle\leq\frac{1}{2K_{h}}\sum_{t=1}^{T}\alpha_{t}^{2}\eta_{t}\lVert\nabla f(\overline{X}_{s+1/2})-\nabla f(\overline{X}_{s})\rVert_{\ast}^{2}
+Kh2t=1T1ηtXt+1Xt+1/22t=1TKh2ηtXt+1Xt+1/22\displaystyle+\frac{K_{h}}{2}\sum_{t=1}^{T}\frac{1}{\eta_{t}}\lVert X_{t+1}-X_{t+1/2}\rVert^{2}-\sum_{t=1}^{T}\frac{K_{h}}{2\eta_{t}}\lVert X_{t+1}-X_{t+1/2}\rVert^{2} (B.45)

which readily yields:

t=1Tαtf(X¯t+1/2)f(X¯t),Xt+1/2Xt+1\displaystyle\sum_{t=1}^{T}\alpha_{t}\langle\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t})\mathopen{},\mathopen{}X_{t+1/2}-X_{t+1}\rangle t=1TKh2ηtXt+1Xt+1/22\displaystyle-\sum_{t=1}^{T}\frac{K_{h}}{2\eta_{t}}\lVert X_{t+1}-X_{t+1/2}\rVert^{2}
12Kht=1Tαt2ηtf(X¯s+1/2)f(X¯s)2\displaystyle\leq\frac{1}{2K_{h}}\sum_{t=1}^{T}\alpha_{t}^{2}\eta_{t}\lVert\nabla f(\overline{X}_{s+1/2})-\nabla f(\overline{X}_{s})\rVert_{\ast}^{2} (B.46)

Hence, putting everything together, we get:

t=1Tαtf(X¯t+1/2),Xt+1/2x\displaystyle\sum_{t=1}^{T}\alpha_{t}\langle\nabla f(\overline{{X}}_{{t+1/2}})\mathopen{},\mathopen{}X_{t+1/2}-x\rangle 12Kht=1Tαt2ηtf(X¯s+1/2)f(X¯s)2\displaystyle\leq\frac{1}{2K_{h}}\sum_{t=1}^{T}\alpha_{t}^{2}\eta_{t}\lVert\nabla f(\overline{X}_{s+1/2})-\nabla f(\overline{X}_{s})\rVert_{\ast}^{2}
Kh2t=1T1ηtXtXt+1/22\displaystyle-\frac{K_{h}}{2}\sum_{t=1}^{T}\frac{1}{\eta_{t}}\lVert X_{t}-X_{t+1/2}\rVert^{2} (B.47)

Moreover, since ff is smooth we have:

f(X¯t+1/2)f(X¯t)2\displaystyle\lVert\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t})\rVert_{\ast}^{2} L2X¯t+1/2X¯t2\displaystyle\leq L^{2}\lVert\overline{X}_{t+1/2}-\overline{X}_{t}\rVert^{2}
L2αt2(t=1Tαt)2Xt+1/2Xt2\displaystyle\leq L^{2}\frac{\alpha_{t}^{2}}{\left(\sum_{t=1}^{T}\alpha_{t}\right)^{2}}\lVert X_{t+1/2}-X_{t}\rVert^{2}
=L24t2t2(t+1)2Xt+1/2Xt2\displaystyle=L^{2}\frac{4t^{2}}{t^{2}(t+1)^{2}}\lVert X_{t+1/2}-X_{t}\rVert^{2}
4L2αt2Xt+1/2Xt2\displaystyle\leq\frac{4L^{2}}{\alpha_{t}^{2}}\lVert X_{t+1/2}-X_{t}\rVert^{2} (B.48)

Combining this with the fact that ηt\eta_{t} is a decreasing sequence, we can rewrite (B.47) as follows:

t=1Tαtf(X¯t+1/2),Xt+1/2x\displaystyle\sum_{t=1}^{T}\alpha_{t}\langle\nabla f(\overline{{X}}_{{t+1/2}})\mathopen{},\mathopen{}X_{t+1/2}-x\rangle η12Kht=1Tαt2f(X¯s+1/2)f(X¯s)2\displaystyle\leq\frac{\eta_{1}}{2K_{h}}\sum_{t=1}^{T}\alpha_{t}^{2}\lVert\nabla f(\overline{X}_{s+1/2})-\nabla f(\overline{X}_{s})\rVert_{\ast}^{2}
Kh8L2t=1Tαt2ηtf(X¯t+1/2)f(X¯t)2]\displaystyle-\frac{K_{h}}{8L^{2}}\sum_{t=1}^{T}\frac{\alpha_{t}^{2}}{\eta_{t}}\lVert\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t})\rVert_{\ast}^{2}] (B.49)

In the sequel, we look for the appropriate bounds of the two terms in the right-hand-side of (B.49). We start with the second term. From (B.9), we also have t=1Tαtf(X¯t+1/2),Xt+1/2x0\sum_{t=1}^{T}\alpha_{t}\langle\nabla f(\overline{{X}}_{{t+1/2}})\mathopen{},\mathopen{}X_{t+1/2}-x\rangle\geq 0. Combine this with (B.49), we have:

012Kht=1Tαt2ηtf(X¯s+1/2)f(X¯s)2Kh8L2t=1Tαt2ηtf(X¯t+1/2)f(X¯t)20\leq\frac{1}{2K_{h}}\sum_{t=1}^{T}\alpha_{t}^{2}\eta_{t}\lVert\nabla f(\overline{X}_{s+1/2})-\nabla f(\overline{X}_{s})\rVert_{\ast}^{2}-\frac{K_{h}}{8L^{2}}\sum_{t=1}^{T}\frac{\alpha_{t}^{2}}{\eta_{t}}\lVert\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t})\rVert_{\ast}^{2} (B.50)

Hence by rearranging we have:

t=1TKhαt216ηtL2f(X¯t+1/2)f(X¯t)2\displaystyle\sum_{t=1}^{T}\frac{K_{h}\alpha_{t}^{2}}{16\eta_{t}L^{2}}\lVert\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t})\rVert_{\ast}^{2} 12Kht=1Tαt2ηtf(X¯s+1/2)f(X¯s)2\displaystyle\leq\frac{1}{2K_{h}}\sum_{t=1}^{T}\alpha_{t}^{2}\eta_{t}\lVert\nabla f(\overline{X}_{s+1/2})-\nabla f(\overline{X}_{s})\rVert_{\ast}^{2}
Kh16L2t=1Tαt2ηtf(X¯t+1/2)f(X¯t)2\displaystyle-\frac{K_{h}}{16L^{2}}\sum_{t=1}^{T}\frac{\alpha_{t}^{2}}{\eta_{t}}\lVert\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t})\rVert_{\ast}^{2}
=12t=1Tαt2f(X¯t+1/2)f(X¯t)2[ηtKhKh8ηtL2]\displaystyle=\frac{1}{2}\sum_{t=1}^{T}\alpha_{t}^{2}\lVert\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t})\rVert_{\ast}^{2}\left[\frac{\eta_{t}}{K_{h}}-\frac{K_{h}}{8\eta_{t}L^{2}}\right] (B.51)

Now, since we assumed that ηt\eta_{t} converges to 0, there exists some t0t_{0}\in\mathbb{N} such that:

ηtKh8Lfor allt>t0\eta_{t}\leq\frac{K_{h}}{\sqrt{8}L}\;\;\text{for all}\;\;t>t_{0} (B.52)

which directly yields that [ηtKhKh8ηtL2]0\left[\frac{\eta_{t}}{K_{h}}-\frac{K_{h}}{8\eta_{t}L^{2}}\right]\leq 0 for all t>T0t>T_{0}. Hence, we have:

Kh16L2t=1Tαt2ηtf(X¯t+1/2)f(X¯t)212t=1T0αt2f(X¯t+1/2)f(X¯t)2[ηtKhKh8ηtL2]\frac{K_{h}}{16L^{2}}\sum_{t=1}^{T}\frac{\alpha_{t}^{2}}{\eta_{t}}\lVert\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t})\rVert_{\ast}^{2}\leq\frac{1}{2}\sum_{t=1}^{T_{0}}\alpha_{t}^{2}\lVert\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t})\rVert_{\ast}^{2}\left[\frac{\eta_{t}}{K_{h}}-\frac{K_{h}}{8\eta_{t}L^{2}}\right] (B.53)

On the other hand, we have:

1ηt\displaystyle\frac{1}{\eta_{t}} =1KhKh+s=1t1αs2f(X¯s+1/2)f(X¯s)21KhKh=1\displaystyle=\frac{1}{\sqrt{K_{h}}}\sqrt{K_{h}+\sum_{s=1}^{t-1}\alpha_{s}^{2}\lVert\nabla f(\overline{X}_{s+1/2})-\nabla f(\overline{X}_{s})\rVert_{\ast}^{2}}\geq\frac{1}{\sqrt{K_{h}}}\sqrt{K_{h}}=1 (B.54)

and hence

Kh16L2t=1Tαt2ηtf(X¯t+1/2)f(X¯t)2\displaystyle\frac{K_{h}}{16L^{2}}\sum_{t=1}^{T}\frac{\alpha_{t}^{2}}{\eta_{t}}\lVert\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t})\rVert_{\ast}^{2} Kh16L2t=1Tαt2f(X¯t+1/2)f(X¯t)2\displaystyle\geq\frac{K_{h}}{16L^{2}}\sum_{t=1}^{T}\alpha_{t}^{2}\lVert\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t})\rVert_{\ast}^{2}
=Kh216L2Kht=1Tαt2f(X¯t+1/2)f(X¯t)2\displaystyle=\frac{K_{h}^{2}}{16L^{2}K_{h}}\sum_{t=1}^{T}\alpha_{t}^{2}\lVert\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t})\rVert_{\ast}^{2}
=Kh216ηT12L2\displaystyle=\frac{K_{h}^{2}}{16\eta_{T-1}^{2}L^{2}} (B.55)

So, summarizing we have:

Kh216ηT12L212t=1T0αt2f(X¯t+1/2)f(X¯t)2[ηtKhKh8ηtL2]\frac{K_{h}^{2}}{16\eta_{T-1}^{2}L^{2}}\leq\frac{1}{2}\sum_{t=1}^{T_{0}}\alpha_{t}^{2}\lVert\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t})\rVert_{\ast}^{2}\left[\frac{\eta_{t}}{K_{h}}-\frac{K_{h}}{8\eta_{t}L^{2}}\right] (B.56)

We now focus on the first term of the right-hand-size of (B.49). If one lets TT to infinity and recalling the fact that we assumed that ηt\eta_{t} converges to 0 (and so 1/ηt21/\eta_{t}^{2}\to\infty), we have that:

12t=1T0αt2f(X¯t+1/2)f(X¯t)2[ηtKhKh8ηtL2]<\infty\leq\frac{1}{2}\sum_{t=1}^{T_{0}}\alpha_{t}^{2}\lVert\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t})\rVert_{\ast}^{2}\left[\frac{\eta_{t}}{K_{h}}-\frac{K_{h}}{8\eta_{t}L^{2}}\right]<\infty (B.57)

a contradiction. This shows that inftηt>0\inf_{t\in\mathbb{N}}\eta_{t}>0 and hence

t=1+αt2f(X¯t+1/2)f(X¯t)2\displaystyle\sum_{t=1}^{+\infty}\alpha_{t}^{2}\lVert\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t})\rVert_{\ast}^{2} =limT+t=1Tαt2f(X¯t+1/2)f(X¯t)2\displaystyle=\lim_{T\to+\infty}\sum_{t=1}^{T}\alpha_{t}^{2}\lVert\nabla f(\overline{X}_{t+1/2})-\nabla f(\overline{X}_{t})\rVert_{\ast}^{2}
=limTKhηt2Kh\displaystyle=\lim_{T\to\infty}\frac{K_{h}}{\eta_{t}^{2}}-K_{h}
=Khinftηt2Kh<\displaystyle=\frac{K_{h}}{\inf_{t}\eta_{t}}^{2}-K_{h}<\infty (B.58)

so our proof is complete. ∎

Appendix C Additional Numerical Experiments

Refer to captionT{T}Δ(T){\Delta(T)}
Figure 3. Convergence of UnderGrad and UnixGrad in the simplex setup with a noisy SFO. The plot is drawn in log-log scale. The yy-axis corresponds to the differences between the ff-value of the relevant point of each algorithm and minf\min f.

In this last section, we report another numerical experiment highlighting the universality of UnderGrad. In this experiment, we also focus on the simplex setup as presented in Section 5. However, this time, we work with a noisy SFO that returns first-order feedback that is perturbed by a noise generated from a pre-determined zero-mean normal distribution. We compare the performances of UnderGrad and UnixGrad, both run with the entropic regularizer. The result of this experiment is reported in Fig. 3.

Fig. 3 shows that UnderGrad obtains the optimal rate 𝒪(1/T)\operatorname{\mathcal{O}}(1/\sqrt{T}) in this set-up. UnixGrad can also obtain the same rate but only when its step-size update rule is chosen appropriately (note again that with entropic regularizer, the update rule (14) of UnixGrad is not available due to the fact that Bh=B_{h}=\infty): when γ1\gamma_{1} is chosen with the same or larger magnitude of UnderGrad’s initial learning rate, UnixGrad converges with the rate 𝒪(1/T)\operatorname{\mathcal{O}}(1/\sqrt{T}); but if γ1\gamma_{1} is too small (e.g., when γ1=103η1\gamma_{1}=10^{-3}\cdot\eta_{1}), UnixGrad can have a very long warming up phase. This experiment reasserts that in cases where the step-size update rule (14) is unavailable, it is non-trivial to choose an appropriate step-size update rule of UnixGrad: small γ1\gamma_{1} might lead to better performances under perfect SFO (cf. Section 5) but might create unwanted behaviors in noisy SFO setups. On the contrary, UnderGrad does not encounter such issues in our experiments.

Finally, we conduct another experiment to confirm the dependency of the convergence rates of UnderGrad on the noise level σ\sigma. The result is reported in Fig. 4.

Refer to captionT{T}Δ(T){\Delta(T)}
Figure 4. Convergence of UnderGrad in the simplex setup with different noise levels of the SFO.

Acknowledgments

KA and VC were supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement №725594725594 - time-data) and from the Swiss National Science Foundation (SNSF) under grant number 200021_205011200021\_205011. KYL is grateful for financial support by Israel Science Foundation (grant No. 447/20), by Israel PBC-VATAT, and by the Technion Center for Machine Learning and Intelligent Systems (MLIS). PM is grateful for financial support by the French National Research Agency (ANR) in the framework of the “Investissements d’avenir” program (ANR-15-IDEX-02), the LabEx PERSYVAL (ANR-11-LABX-0025-01), MIAI@Grenoble Alpes (ANR-19-P3IA-0003), and the bilateral ANR-NRF grant ALIAS (ANR-19-CE48-0018-01).

References

  • Allen-Zhu & Orecchia [2017] Allen-Zhu, Z. and Orecchia, L. Linear Coupling: An Ultimate Unification of Gradient and Mirror Descent. In Proceedings of the 8th Innovations in Theoretical Computer Science, ITCS ’17, 2017. Full version available at http://arxiv.org/abs/1407.1537.
  • Antonakopoulos & Mertikopoulos [2021] Antonakopoulos, K. and Mertikopoulos, P. Adaptive first-order methods revisited: Convex optimization without Lipschitz requirements. In NeurIPS ’21: Proceedings of the 35th International Conference on Neural Information Processing Systems, 2021.
  • Antonakopoulos et al. [2019] Antonakopoulos, K., Belmega, E. V., and Mertikopoulos, P. An adaptive mirror-prox algorithm for variational inequalities with singular operators. In NeurIPS ’19: Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019.
  • Antonakopoulos et al. [2021a] Antonakopoulos, K., Belmega, E. V., and Mertikopoulos, P. Adaptive extra-gradient methods for min-max optimization and games. In ICLR ’21: Proceedings of the 2021 International Conference on Learning Representations, 2021a.
  • Antonakopoulos et al. [2021b] Antonakopoulos, K., Pethick, T., Kavis, A., Mertikopoulos, P., and Cevher, V. Sifting through the noise: Universal first-order methods for stochastic variational inequalities. In NeurIPS ’21: Proceedings of the 35th International Conference on Neural Information Processing Systems, 2021b.
  • Auer et al. [2002] Auer, P., Cesa-Bianchi, N., and Gentile, C. Adaptive and self-confident on-line learning algorithms. Journal of Computer and System Sciences, 64(1):48–75, 2002.
  • Bach & Levy [2019] Bach, F. and Levy, K. Y. A universal algorithm for variational inequalities adaptive to smoothness and noise. In COLT ’19: Proceedings of the 32nd Annual Conference on Learning Theory, 2019.
  • Bauschke & Combettes [2017] Bauschke, H. H. and Combettes, P. L. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, New York, NY, USA, 2 edition, 2017.
  • Bilenne et al. [2020] Bilenne, O., Mertikopoulos, P., and Belmega, E. V. Fast optimization with zeroth-order feedback in distributed multi-user MIMO systems. IEEE Trans. Signal Process., 68:6085–6100, October 2020.
  • Boyd & Vandenberghe [2004] Boyd, S. P. and Vandenberghe, L. Convex Optimization. Cambridge University Press, Cambridge, UK, 2004.
  • Bubeck [2015] Bubeck, S. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(3-4):231–358, 2015.
  • Bubeck & Cesa-Bianchi [2012] Bubeck, S. and Cesa-Bianchi, N. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1–122, 2012.
  • Cesa-Bianchi & Lugosi [2006] Cesa-Bianchi, N. and Lugosi, G. Prediction, Learning, and Games. Cambridge University Press, 2006.
  • Cesa-Bianchi & Lugosi [2012] Cesa-Bianchi, N. and Lugosi, G. Combinatorial bandits. Journal of Computer and System Sciences, 78:1404–1422, 2012.
  • Chen & Teboulle [1993] Chen, G. and Teboulle, M. Convergence analysis of a proximal-like minimization algorithm using Bregman functions. SIAM Journal on Optimization, 3(3):538–543, August 1993.
  • Cutkosky [2019] Cutkosky, A. Anytime online-to-batch, optimism and acceleration. In ICML ’19: Proceedings of the 36th International Conference on Machine Learning, 2019.
  • Ene et al. [2021] Ene, A., Nguyen, H. L., and Vladu, A. Adaptive gradient methods for constrained convex optimization and variational inequalities. In AAAI ’21: Proceedings of the 35th Conference on Artificial Intelligence, 2021.
  • Facchinei & Pang [2003] Facchinei, F. and Pang, J.-S. Finite-Dimensional Variational Inequalities and Complementarity Problems. Springer Series in Operations Research. Springer, 2003.
  • György et al. [2007] György, A., Linder, T., Lugosi, G., and Ottucsák, G. The online shortest path problem under partial monitoring. Journal of Machine Learning Research, 8:2369–2403, 2007.
  • Hsieh et al. [2021] Hsieh, Y.-G., Antonakopoulos, K., and Mertikopoulos, P. Adaptive learning in continuous games: Optimal regret bounds and convergence to Nash equilibrium. In COLT ’21: Proceedings of the 34th Annual Conference on Learning Theory, 2021.
  • Joulani et al. [2020] Joulani, P., Raj, A., György, A., and Szepesvári, C. A simpler approach to accelerated stochastic optimization: Iterative averaging meets optimism. In ICML ’20: Proceedings of the 37th International Conference on Machine Learning, 2020.
  • Juditsky et al. [2011] Juditsky, A., Nemirovski, A. S., and Tauvel, C. Solving variational inequalities with stochastic mirror-prox algorithm. Stochastic Systems, 1(1):17–58, 2011.
  • Kakade et al. [2012] Kakade, S. M., Shalev-Shwartz, S., and Tewari, A. Regularization techniques for learning with matrices. The Journal of Machine Learning Research, 13:1865–1890, 2012.
  • Kavis et al. [2019] Kavis, A., Levy, K. Y., Bach, F., and Cevher, V. UnixGrad: A universal, adaptive algorithm with optimal guarantees for constrained optimization. In NeurIPS ’19: Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019.
  • Korpelevich [1976] Korpelevich, G. M. The extragradient method for finding saddle points and other problems. Èkonom. i Mat. Metody, 12:747–756, 1976.
  • Lan [2012] Lan, G. An optimal method for stochastic composite optimization. Mathematical Programming, 133(1-2):365–397, June 2012.
  • Lattimore & Szepesvári [2020] Lattimore, T. and Szepesvári, C. Bandit Algorithms. Cambridge University Press, Cambridge, UK, 2020.
  • Levy et al. [2018] Levy, K. Y., Yurtsever, A., and Cevher, V. Online adaptive methods, universality and acceleration. In NeurIPS ’18: Proceedings of the 32nd International Conference of Neural Information Processing Systems, 2018.
  • Li & Orabona [2019] Li, X. and Orabona, F. On the convergence of stochastic gradient descent with adaptive stepsizes. In AISTATS ’19: Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, 2019.
  • McMahan & Streeter [2010] McMahan, H. B. and Streeter, M. Adaptive bound optimization for online convex optimization. In COLT ’10: Proceedings of the 23rd Annual Conference on Learning Theory, 2010.
  • Mertikopoulos & Sandholm [2016] Mertikopoulos, P. and Sandholm, W. H. Learning in games via reinforcement and regularization. Mathematics of Operations Research, 41(4):1297–1324, November 2016.
  • Mertikopoulos & Staudigl [2018] Mertikopoulos, P. and Staudigl, M. On the convergence of gradient-like flows with noisy gradient input. SIAM Journal on Optimization, 28(1):163–197, January 2018.
  • Mertikopoulos & Zhou [2019] Mertikopoulos, P. and Zhou, Z. Learning in games with continuous action sets and unknown payoff functions. Mathematical Programming, 173(1-2):465–507, January 2019.
  • Mertikopoulos et al. [2017] Mertikopoulos, P., Belmega, E. V., Negrel, R., and Sanguinetti, L. Distributed stochastic optimization via matrix exponential learning. IEEE Trans. Signal Process., 65(9):2277–2290, May 2017.
  • Mertikopoulos et al. [2019] Mertikopoulos, P., Lecouat, B., Zenati, H., Foo, C.-S., Chandrasekhar, V., and Piliouras, G. Optimistic mirror descent in saddle-point problems: Going the extra (gradient) mile. In ICLR ’19: Proceedings of the 2019 International Conference on Learning Representations, 2019.
  • Nemirovski [2004] Nemirovski, A. S. Prox-method with rate of convergence O(1/t){O}(1/t) for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization, 15(1):229–251, 2004.
  • Nemirovski & Yudin [1983] Nemirovski, A. S. and Yudin, D. B. Problem Complexity and Method Efficiency in Optimization. Wiley, New York, NY, 1983.
  • Nesterov [2004] Nesterov, Y. Introductory Lectures on Convex Optimization: A Basic Course. Number 87 in Applied Optimization. Kluwer Academic Publishers, 2004.
  • Nesterov [2007] Nesterov, Y. Dual extrapolation and its applications to solving variational inequalities and related problems. Mathematical Programming, 109(2):319–344, 2007.
  • Nesterov [2015] Nesterov, Y. Universal gradient methods for convex optimization problems. Mathematical Programming, 152(1-2):381–404, 2015.
  • Rakhlin & Sridharan [2013a] Rakhlin, A. and Sridharan, K. Online learning with predictable sequences. In COLT ’13: Proceedings of the 26th Annual Conference on Learning Theory, 2013a.
  • Rakhlin & Sridharan [2013b] Rakhlin, A. and Sridharan, K. Optimization, learning, and games with predictable sequences. In NIPS ’13: Proceedings of the 27th International Conference on Neural Information Processing Systems, 2013b.
  • Rockafellar [1970] Rockafellar, R. T. Convex Analysis. Princeton University Press, Princeton, NJ, 1970.
  • Telatar [1999] Telatar, I. E. Capacity of multi-antenna Gaussian channels. European Transactions on Telecommunications and Related Technologies, 10(6):585–596, 1999.
  • Vu et al. [2021] Vu, D. Q., Antonakopoulos, K., and Mertikopoulos, P. Fast routing under uncertainty: Adaptive learning in congestion games with exponential weights. In NeurIPS ’21: Proceedings of the 35th International Conference on Neural Information Processing Systems, 2021.
  • Ward et al. [2019] Ward, R., Wu, X., and Bottou, L. AdaGrad stepsizes: Sharp convergence over nonconvex landscapes, from any initialization. In ICML ’19: Proceedings of the 36th International Conference on Machine Learning, 2019.
  • Yu et al. [2004] Yu, W., Rhee, W., Boyd, S. P., and Cioffi, J. M. Iterative water-filling for Gaussian vector multiple-access channels. IEEE Trans. Inf. Theory, 50(1):145–152, 2004.