This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

μ2\mu^{2}-SGD: Stable Stochastic Optimization via a Double Momentum Mechanism

Kfir Y. Levy111 Department of Electrical and Computer Engineering, Technion, Israel.
Email :[email protected].
Abstract

We consider stochastic convex optimization problems where the objective is an expectation over smooth functions. For this setting we suggest a novel gradient estimate that combines two recent mechanism that are related to notion of momentum. Then, we design an SGD-style algorithm as well as an accelerated version that make use of this new estimator, and demonstrate the robustness of these new approaches to the choice of the learning rate. Concretely, we show that these approaches obtain the optimal convergence rates for both noiseless and noisy case with the same choice of fixed learning rate. Moreover, for the noisy case we show that these approaches achieve the same optimal bound for a very wide range of learning rates.

1 Inroduction

Stochastic Convex Optimization (SCO) is a fundamental framework, that captures several classical Machine Learning (ML) problems, such as linear regression, logistic regression and SVMs (Support Vector Machines); amongst others. In the past two decades, SCO has been extensively explored and highly influenced the field of ML: it has popularized the use of Stochastic Gradient Descent (SGD) as the standard workhorse for training ML models; see e.g. (Shalev-Shwartz et al., 2007; Welling and Teh, 2011; Mairal et al., 2009; Recht et al., 2011); as well as has lead to the design of sophisticated SGD variants that play a central role in training modern large scale models (Duchi et al., 2011; Kingma and Ba, 2015).

One practical difficulty in applying SGD-type methods is the need to tune its learning rate among other hyperparameters, and it is well known that the performance of such algorithms crucially relies on the right choice of the learning rate. As a remedy, several adaptive SGD variants have been designed throughout the years (Duchi et al., 2011; Kingma and Ba, 2015; Levy et al., 2018; Kavis et al., 2019; Jacobsen and Cutkosky, 2022); however in practice such methods still require hyperparameter tuning which might be very costly in huge scale scenarios.

In this paper we focus on the prevalent SCO setting where the objective (expected loss) is an Expectation Over Smooth losses (SCO-EOS); this applies e.g. to linear and logistic regression problems (but not to SVMs). In this case, it is well known that SGD requires a careful tuning of the learning rate to obtain the optimal performance. For example, in the noiseless case, SGD (or GD in this case) should employ a learning rate of ηOffline=1/L\eta^{\rm Offline}=1/L where LL is the smoothness parameter of the objective. Nevertheless, if we apply this ηOffline\eta^{\rm Offline} in the noisy setting, the guarantees of SGD become vacuous. To obtain the optimal SGD guarantees, we should roughly decrease the learning rate by a factor of σT\sigma\sqrt{T} where TT is the total number of SGD iterates (and samples), and σ\sigma is the variance of the noise in the gradient estimates. This illustrates the sensitivity of SGD to the choice of η\eta. The same applies to stochastic accelerated methods such as (Lan, 2012; Hu et al., 2009; Xiao, 2010).

Contributions.

We introduce a novel gradient estimate for SCO-EOS problems, and show that its square error, εt2\|\varepsilon_{t}\|^{2}, shrinks with the number of updates, εt21/t\|\varepsilon_{t}\|^{2}\propto 1/t, where tt is the iterate. This, in contrast to the standard SGD estimator where usually εt2=Variancet=O(1)\|\varepsilon_{t}\|^{2}=\text{Variance}_{t}=O(1). Our new estimate blends two recent machanisms that are related to the notion of momentum: Anytime Averaging, which is due to (Cutkosky, 2019); and a corrected momentum technique (Cutkosky and Orabona, 2019). We therefore denote our estimate by μ2\mu^{2} which stands for Momentum2\textbf{Momentum}^{2}.

We further design an SGD variant called μ\mu-SGD, as well as an accelerated version called μ2ExtraSGD\mu^{2}-\texttt{Extra}\text{SGD} , that employ our new estimate, and demonstrate their stability with respect to the choice of the learning rates η\eta. Concretely,

  • Upon using the exact same learning rate of ηOffline=1/8LT\eta^{\rm Offline}=1/8LT (where TT is the total number of iterates/data-samples), μ2\mu^{2}-SGD enjoys a convergence rate of O(L/T)O({L}/T) in the noiseless case, and a rate of O(L/T+σ~/T)O({L}/T+\tilde{\sigma}/\sqrt{T}) in the noisy case.

    Moreover, in the noisy case, μ2\mu^{2}-SGD enjoys the same convergence rate as of the optimal SGD O(L/T+σ~/T)O({L}/T+\tilde{\sigma}/\sqrt{T}), for a wide range of learning rate choices i.e. η[ηNoisy,ηOffline]\eta\in[\eta^{\rm Noisy},\eta^{\rm Offline}], such that the ratio ηOffline/ηNoisy(σ~/L)T\eta^{\rm Offline}/\eta^{\rm Noisy}\approx(\tilde{\sigma}/L)\sqrt{T}.

  • Upon using the exact same learning rate of ηOffline=1/2L\eta^{\rm Offline}=1/2L, μ2ExtraSGD\mu^{2}-\texttt{Extra}\text{SGD} enjoys an optimal convergence rate of O(L/T2)O({L}/T^{2}) in the noiseless case, and an optimal rate of O(L/T2+σ~/T)O({L}/T^{2}+\tilde{\sigma}/\sqrt{T}) in the noisy case.

    Moreover, in the noisy case, μ2ExtraSGD\mu^{2}-\texttt{Extra}\text{SGD} enjoys the same optimal convergence of O(L/T2+σ~/T)O({L}/T^{2}+\tilde{\sigma}/\sqrt{T}), for an extremely wide range of learning rate choices i.e. η[ηNoisy,ηOffline]\eta\in[\eta^{\rm Noisy},\eta^{\rm Offline}], such that the ratio ηOffline/ηNoisy(σ~/L)T3/2\eta^{\rm Offline}/\eta^{\rm Noisy}\approx(\tilde{\sigma}/L)T^{3/2}.

Related Work:

The Gradient Descent (GD) algorithm and it stochastic counterpart SGD (Robbins and Monro, 1951) are cornerstones of ML and Optimization. Their wide adoption in ML as well as in other fields has lead to the development of numerous elegant and useful variants (Duchi et al., 2011; Kingma and Ba, 2015; Ge et al., 2015). Curiously, many of these variants that serve in practical training of non-convex learning models; were originally designed under the framework of SCO.

As we mention in the Introduction, the performance of SGD crucially relies on the choice of the learning rate. There is a plethora of work on designing methods that implicitly and optimally adapt the learning rate throughout the learning process (Duchi et al., 2011; Kingma and Ba, 2015; Kavis et al., 2019; Antonakopoulos et al., 2022; Jacobsen and Cutkosky, 2022); and some of these methods have been widely adopted among practitioners.

Momentum (Polyak, 1964) is another widely used practical technique (Sutskever et al., 2013), and it is interestingly related to accelerated method of Nesterov (Nesterov, 1983) – a seminal approach that enables to obtain faster convergence rates compared to GD for smooth and convex objectives. While Nesterov’s accelerated method is fragile to noise, Lan (2012); Hu et al. (2009); Xiao (2010) have designed stochastic accelerated variants that enable to obtain a convergence rate that interpolates between the fast rate in the noiseless case and between the standard SGD rate in the noisy case (depending on the noise magnitude).

Our work builds on two recent mechanisms that are related to the notion of momentum: (i) An Anytime averaging mechanism Cutkosky (2019) which relies on averaging the query points of the gradient oracle. (ii) A corrected momentum technique Cutkosky and Orabona (2019) which relies on averaging the gradients themselves throughout the learning process (while introducing correction). It is interesting to note that the Anytime mechanism has proven to be extremely useful in designing adaptive and accelerated methods (Cutkosky, 2019; Kavis et al., 2019; Antonakopoulos et al., 2022). The corrected momentum mechanism has mainly found use in designing optimal and adaptive algorithms for stochastic non-convex problems (Cutkosky and Orabona, 2019; Levy et al., 2021).

2 Setting

We consider stochastic optimization problems where the objective f:𝒦f:\mathcal{K}\mapsto{\mathbb{R}} is convex and is of the following form,

f(x):=Ez𝒟f(x;z),f(x):=\mbox{\bf E}_{z\sim{\mathcal{D}}}f(x;z)~{},

where 𝒦d\mathcal{K}\subseteq{\mathbb{R}}^{d} is a compact convex set, and 𝒟{\mathcal{D}} is an unknown distribution from which we may draw i.i.d. samples {zt𝒟}t\{z_{t}\sim{\mathcal{D}}\}_{t}. We consider first order optimization methods that iteratively employ such samples in order to generate a sequence of query points and eventually output a solution xoutput𝒦{x}_{\rm output}\in\mathcal{K}. Our performance measure is the expected excess loss,

ExcessLoss:=E[f(xoutput)]minx𝒦f(x)\text{ExcessLoss}:=\mbox{\bf E}[f({x}_{\rm output})]-\min_{x\in\mathcal{K}}f(x)

where the expectation is w.r.t. the randomization of the samples.

More concretely, at every iteration tt such methods maintain a query point xtdx_{t}\in{\mathbb{R}}^{d} which is computed based on the past query points and past samples {z1,,zt1}\{z_{1},\ldots,z_{t-1}\}. Then, the next query point xt+1x_{t+1} is computed based on xtx_{t} and on a gradient estimate gtg_{t} that is derived by drawing a fresh sample zt𝒟z_{t}\sim{\mathcal{D}} independently of the past samples, and computing,

gt:=f(xt;zt),g_{t}:=\nabla f(x_{t};z_{t})~{},

note that the above derivative is w.r.t. xx. The independence between samples implies that gtg_{t} is an unbiased estimate of f(xt)\nabla f(x_{t}) in the following sense, E[gt|xt]=f(xt).\mbox{\bf E}[g_{t}|x_{t}]=\nabla f(x_{t})~{}. It is often comfortable to think of the computation of gt=f(xt;zt)g_{t}=\nabla f(x_{t};z_{t}) as a (noisy) Gradient Oracle that upon receiving a query point xt𝒦x_{t}\in\mathcal{K} outputs a vector gtdg_{t}\in{\mathbb{R}}^{d}, which is an unbiased estimate of f(xt)\nabla f(x_{t}).

Assumptions.

We will make the following assumptions,
Bounded Diameter: we assume there exists D>0D>0 such that maxx,y𝒦xyD\max_{x,y\in\mathcal{K}}\|x-y\|\leq D.
Bounded variance:  There exists σ>0\sigma>0 such that,

Ef(x;z)f(x)2σ2;x𝒦,zSupport{𝒟}\displaystyle\mbox{\bf E}\|\nabla f(x;z)-\nabla f(x)\|^{2}\leq\sigma^{2}~{};\quad\forall x\in\mathcal{K},~{}z\in\textbf{Support}\{{\mathcal{D}}\} (1)

Expectation over smooth functions: We assume that f()f(\cdot) is an expectation of smooth functions, i.e. there exist L>0L>0 such,

f(x;z)f(y;z)Lxy,x,y𝒦,zSupport{𝒟}\displaystyle\|f(x;z)-f(y;z)\|\leq L\|x-y\|~{},\quad\forall x,y\in\mathcal{K}~{},z\in\textbf{Support}\{{\mathcal{D}}\} (2)

The above assumption also implies that the expected loss f()f(\cdot) is LL smooth.
Bounded Smoothness Variance.  Note that the assumption that we make in Eq. (2) implies that there exists σL2[0,L]\sigma_{L}^{2}\in[0,L] such,

E(f(x;z)f(x))(f(y;z)f(y))2σL2xy2,x,y𝒦.\displaystyle\mbox{\bf E}\left\|(\nabla f(x;z)-\nabla f(x))-(\nabla f(y;z)-\nabla f(y))\right\|^{2}\leq\sigma_{L}^{2}\|x-y\|^{2}~{},\quad\forall x,y\in\mathcal{K}~{}. (3)

Clearly, in the deterministic setting where f(x;z)=f(x),zSupport{𝒟}f(x;z)=f(x)~{},~{}\forall z\in\textbf{Support}\{{\mathcal{D}}\}, we have σL=0\sigma_{L}=0. In App. A, we show how Eq. (2) implies Eq. (3).
Notation: In the rest of this manuscript, f(x;z)\nabla f(x;z) relates to gradients with respect to xx, i.e., :=x\nabla:=\nabla_{x}. We use \|\cdot\| to denote the Euclidean norm. Given a sequence {yt}t\{y_{t}\}_{t} we denote yt1:t2:=τ=t1t2yτy_{t_{1}:t_{2}}:=\sum_{\tau=t_{1}}^{t_{2}}y_{\tau}. For a positive integer NN we denote [N]:={1,,N}[N]:=\{1,\ldots,N\}. We also let Π𝒦:d𝒦\Pi_{\mathcal{K}}:{\mathbb{R}}^{d}\mapsto\mathcal{K} denote the orthogonal projection onto 𝒦\mathcal{K}, i.e. Π𝒦(x)=argminy𝒦yx2,xd\Pi_{\mathcal{K}}(x)=\operatorname*{arg\,min}_{y\in\mathcal{K}}\|y-x\|^{2}~{},~{}\forall x\in{\mathbb{R}}^{d}.

3 Momentum Mechanisms

In this section we provide background regarding two mechanisms that can be related to the notion of momentum. Curiously, both of these approaches are related to averaging of different elements of the learning algorithm. Our approach that we present in the following section builds on a combination of these aforementioned mechanisms.

3.1 Mechanism I: Anytime-GD

The mechanism that we discuss here is related to averaging the query points for the noisy gradient oracle. Basically, while in standard SGD we query the gradients at the iterates that we compute, in Anytime-SGD (Cutkosky, 2019), we query the gradients at weighted averages of the iterates that we compute.

More concretely, the Anytime-SGD algorithm (Cutkosky, 2019; Kavis et al., 2019) that we describe in Equations (4) and (5), employs a learning rate η>0\eta>0 and a sequence of non-negative weights {αt}t\{\alpha_{t}\}_{t}. The algorithm maintains two sequences {wt}t,{xt}t\{w_{t}\}_{t},\{x_{t}\}_{t}. At initialization x1=w1x_{1}=w_{1}, and then we update as follows t\forall t,

wt+1=wtηαtgt,t[T],where gt=f(xt;zt),\displaystyle w_{t+1}=w_{t}-\eta\alpha_{t}g_{t}~{},\forall t\in[T]~{},\text{where~{}}g_{t}=\nabla f(x_{t};z_{t})~{}, (4)

and ztz_{t} is a fresh sample from 𝒟{\mathcal{D}}. Then Anytime-SGD updates,

xt+1=α1:tα1:t+1xt+αt+1α1:t+1wt+1.\displaystyle x_{t+1}=\frac{\alpha_{1:t}}{\alpha_{1:t+1}}x_{t}+\frac{\alpha_{t+1}}{\alpha_{1:t+1}}w_{t+1}~{}. (5)

The above implies that the xtx_{t}’s are weighted averages of the wtw_{t}’s, i.e. that xt+1=1α1:t+1τ=1t+1ατwτx_{t+1}=\frac{1}{\alpha_{1:t+1}}\sum_{\tau=1}^{t+1}\alpha_{\tau}w_{\tau}. Thus, at every iterate, the gradient gtg_{t} is queried at xtx_{t} which is a weighted average of past iterates, and then wt+1w_{t+1} is updated similarly to GD with a weight αt\alpha_{t} on the gradient gtg_{t}.

Curiously, it was shown in Wang et al. (2021), that a very similar algorithm to Anytime-GD, can be related to the classical Heavy-Ball method (Polyak, 1964), and the latter incorporates momentum in its iterates (see Alg. (13)(13) in Section 4.5.14.5.1 of ).

It was shown in Cutkosky (2019) that Anytime-SGD obtains the same convergence rates as SGD for convex loss functions (both smooth and non-smooth). And this technique was found to be extremely useful in designing universal accelerated methods Cutkosky (2019); Kavis et al. (2019).

The next theorem is crucial in analyzing Anytime-SGD, and actually applies more broadly,

Theorem 3.1 (Rephrased from Theorem 1 in Cutkosky (2019)).

Let f:𝒦f:\mathcal{K}\mapsto{\mathbb{R}} be a convex function with a minimum wargminw𝒦f(w)w^{*}\in\operatorname*{arg\,min}_{w\in\mathcal{K}}f(w). Also let {αt0}t\{\alpha_{t}\geq 0\}_{t}, and {wt𝒦}t,{xt𝒦}t\{w_{t}\in\mathcal{K}\}_{t},\{x_{t}\in\mathcal{K}\}_{t}, such that {xt}t\{x_{t}\}_{t} is an {αt}t\{\alpha_{t}\}_{t} weighted average of {wt}t\{w_{t}\}_{t}, i.e. such that x1=w1x_{1}=w_{1}, and for any t1t\geq 1,

xt+1=1α1:t+1τ=1t+1ατwτ.x_{t+1}=\frac{1}{\alpha_{1:t+1}}\sum_{\tau=1}^{t+1}\alpha_{\tau}w_{\tau}~{}.

Then the following holds for any t1t\geq 1,

α1:t(f(xt)f(w))τ=1tατf(xτ)(wτw).\alpha_{1:t}\left(f(x_{t})-f(w^{*})\right)\leq\sum_{\tau=1}^{t}\alpha_{\tau}\nabla f(x_{\tau})\cdot(w_{\tau}-w^{*})~{}.

Note that the above theorem holds for any sequence {wt𝒦}t\{w_{t}\in\mathcal{K}\}_{t}, and as a private case it holds for the Anytime-GD algorithm that we have described. Thus the above Theorem relates the excess loss of a given algorithm that computes the sequences {wt𝒦}t,{xt𝒦}t\{w_{t}\in\mathcal{K}\}_{t},\{x_{t}\in\mathcal{K}\}_{t} to its weighted regret, t:=τ=1tατf(xτ)(wτw).\mathcal{R}_{t}:=\sum_{\tau=1}^{t}\alpha_{\tau}\nabla f(x_{\tau})\cdot(w_{\tau}-w^{*})~{}.

3.2 Mechanism II: Recursive Corrected Averaging

The mechanism that we discuss here is related to averaging the gradient estimates that we compute throughout the learning process, which is a common and often crucial technique in practical applications. While such straightforward averaging might incur bias into the gradient estimates, it was suggested in (Cutkosky and Orabona, 2019) to add a bias correction mechanism named STORM\rm{STORM} (STochastic Recursive Momentum). And it was shown that this mechanism leads to a powerful variance reduction effect.

Concretely, STORM\rm{STORM} maintains an estimate dtd_{t} which is a corrected weighted average of past stochastic gradients, and then it performs an SGD-style update step,

wt+1=wtηdt.\displaystyle w_{t+1}=w_{t}-\eta d_{t}~{}. (6)

The corrected momentum estimates are updated as follows,

dt=f(wt;zt)+(1βt)(dt1f(wt1,zt));where βt[0,1].\displaystyle d_{t}=\nabla f(w_{t};z_{t})+(1-\beta_{t})(d_{t-1}-\nabla f(w_{t-1},z_{t}))~{};~{}~{}\text{where~{}}\beta_{t}\in[0,1]~{}. (7)

Note that the correction term implies that E[dt]=Ef(wt)\mbox{\bf E}[d_{t}]=\mbox{\bf E}\nabla f(w_{t}). Nevertheless, in general E[dt|wt]f(wt)\mbox{\bf E}[d_{t}|w_{t}]\neq\nabla f(w_{t}), so dtd_{t} is conditionally biased (in contrast to standard SGD gradient estimates). Moreover, the choice of the same sample ztz_{t} in the two terms of the above expression is crucial for the variance reduction effect.

4 Our Approach: Double Momentum Mechanism

Our approach is to combine together the two momentum mechanisms that we describe above. Our algorithm is therefore named μ2\mu^{2}-SGD (Momentum2\text{Momentum}^{2}-Stochastic Gradient Descent), and we describe it in Alg. 1. Intuitively, each of these Momentum (averaging) techniques stabilizes the algorithm, and their combination leads to a method that is almost as stable as offline GD.

We first describe algorithm 1, and then present our main result in Theorem 4.1, showing that the error of the gradient estimates of our approach shrinks as we progress. We show the benefit of this result in Thm. 4.2, which demonstrates that μ2\mu^{2}-SGD obtains the same convergence rate as standard SGD, for a very wide range of learning rates (in contrast to SGD).

Next we elaborate on the ingredients of Alg. 1:
Update rule: Note that for generality we allow a broad family of update rules in Eq. (9) of Alg. 1. The only constraint on the update rule is that its iterates {wt}t\{w_{t}\}_{t} always belong to 𝒦\mathcal{K}. Later, we will specifically analyze the natural SGD-style update rule,

wt+1=Π𝒦(wtηαtdt).\displaystyle w_{t+1}=\Pi_{\mathcal{K}}(w_{t}-\eta\alpha_{t}d_{t})~{}. (8)

Momentum Computation: From Eq. (11) in Alg. 1 we can see that the momentum dtd_{t} is updated similarly to the STORM\rm{STORM} update in Eq. (7), albeit with two main differences: (i) first we incorporate importance weights {αt}t\{\alpha_{t}\}_{t} into STORM\rm{STORM} and recursively update the weighted momentum αtdt\alpha_{t}d_{t}; more importantly (ii) we query the noisy gradients at the averages xtx_{t}’s rather than in the iterates themselves, and the averages (Eq. (10)) are computed in the spirit of Anytime-SGD. These can be seen in the computation of gt+1g_{t+1} and g~t\tilde{g}_{t} which query the gradients at the averages rather than the iterates.

Algorithm 1 μ2\mu^{2}-SGD
  Input: #Iterations TT, initial point x1x_{1}, learning rate sequence η>0\eta>0, importance weights {αt}t\{\alpha_{t}\}_{t}, Corrected Momentum weights {βt}t\{\beta_{t}\}_{t}
  
  Initialize: set w1=x1w_{1}=x_{1}, draw z1𝒟z_{1}\sim{\mathcal{D}} and set d1=f(x1,z1)d_{1}=\nabla f(x_{1},z_{1})
  for t=1,,Tt=1,\ldots,T do
     Iterate Update:
Update wt+1𝒦w_{t+1}\in\mathcal{K} based on past information/computations up to time tt (9)
     Query Update (Anytime-SGD style):
xt+1=α1:tα1:t+1xt+αt+1α1:t+1wt+1,\displaystyle x_{t+1}=\frac{\alpha_{1:t}}{\alpha_{1:t+1}}x_{t}+\frac{\alpha_{t+1}}{\alpha_{1:t+1}}w_{t+1}~{}, (10)
     Update Corrected Momentum (STORM\rm{STORM} style):draw zt+1𝒟z_{t+1}\sim{\mathcal{D}}, compute gt+1:=f(xt+1;zt+1)g_{t+1}:=\nabla f(x_{t+1};z_{t+1}), and g~t:=f(xt;zt+1)\tilde{g}_{t}:=\nabla f(x_{t};z_{t+1}) and update,
αt+1dt+1=αt+1gt+1+(1βt+1)αt+1(dtg~t)\displaystyle\alpha_{t+1}d_{t+1}=\alpha_{t+1}g_{t+1}+(1-\beta_{t+1})\alpha_{t+1}(d_{t}-\tilde{g}_{t}) (11)
  end for
  output: xTx_{T}

Thus, as promised our algorithms combines two different momentum mechanisms. Next we present our main result which shows that Alg. 1 yields estimates with a very small error. Note that the only limitation on the update rule in Eq. (9) is that its iterates {wt}t\{w_{t}\}_{t} always belong to 𝒦\mathcal{K}.

Theorem 4.1.

Let f:𝒦f:\mathcal{K}\mapsto{\mathbb{R}}, where 𝒦\mathcal{K} is a convex set with bounded diameter DD, and presume that the assumption in Equations (1),(2),(3) hold. Then invoking Alg. 1 with {αt=t+1}t\{\alpha_{t}=t+1\}_{t} and {βt=1/αt}\{\beta_{t}=1/\alpha_{t}\}, ensures the following for any t[T]t\in[T],

Eεt2:=Edtf(xt)2σ~2/t,\mbox{\bf E}\|\varepsilon_{t}\|^{2}:=\mbox{\bf E}\|d_{t}-\nabla f(x_{t})\|^{2}\leq\tilde{\sigma}^{2}/t~{},

where εt:=dtf(xt)\varepsilon_{t}:=d_{t}-\nabla f(x_{t}), and σ~2:=32D2σL2+2σ2\tilde{\sigma}^{2}:=32D^{2}\sigma_{L}^{2}+2\sigma^{2}.

So according to the above theorem, the overall error of dtd_{t} compared to the exact gradient f(xt)\nabla f(x_{t}) shrinks as we progress. Conversely, in standard SGD (as well as in Anytime-SGD) the expected square error is fixed, namely Egtf(wt)2O(σ2)\mbox{\bf E}\|g_{t}-\nabla f(w_{t})\|^{2}\leq O(\sigma^{2}).

Based on the above theorem, we are now ready to analyze Alg. 1 with the specific SGD-type update rule that we present in Eq. (8).

Theorem 4.2 (μ2\mu^{2}-SGD Guarantees).

Let f:df:{\mathbb{R}}^{d}\mapsto{\mathbb{R}} be a convex function, and assume that wargminw𝒦f(w)w^{*}\in\operatorname*{arg\,min}_{w\in\mathcal{K}}f(w) is also its global minimum in d{\mathbb{R}}^{d}. Also, let us make the same assumptions as in Thm. 4.1. Then invoking Alg. 1 with {αt=t+1}t\{\alpha_{t}=t+1\}_{t} and {βt=1/αt}t\{\beta_{t}=1/\alpha_{t}\}_{t}, and using the SGD-type update rule (8) with a learning rate η1/8LT\eta\leq 1/8LT inside Eq. (9) of Alg. 1 guarantees,

E(f(xT)f(w))=EΔTO(D2ηT2+2ησ~2+4Dσ~T),\displaystyle\mbox{\bf E}(f(x_{T})-f(w^{*}))=\mbox{\bf E}\Delta_{T}\leq O\left(\frac{D^{2}}{\eta T^{2}}+2\eta\tilde{\sigma}^{2}+\frac{4D\tilde{\sigma}}{\sqrt{T}}\right)~{},

where Δt:=f(xt)f(w)\Delta_{t}:=f(x_{t})-f(w^{*}), and σ~2:=32D2σL2+2σ2\tilde{\sigma}^{2}:=32D^{2}\sigma_{L}^{2}+2\sigma^{2}.

The Stability of μ2\mu^{2}-SGD.

The above lemma shows that μ2\mu^{2}-SGD obtains the optimal SGD convergence rates for both offline (noiseless) and noisy case with the same choice of fixed learning rate ηOffline=18LT\eta^{\rm Offline}=\frac{1}{8LT}, which does not depend on the noise σ~\tilde{\sigma}. This in contrast to SGD, which require either reducing the offline learning rate by a factor of σT\sigma\sqrt{T}; or using sophisticated adaptive learning rates (Duchi et al., 2011; Levy et al., 2018).

Moreover, letting ηNoisy=1/(8LT+σ~T3/2/D)\eta^{\rm Noisy}=1/(8LT+\tilde{\sigma}T^{3/2}/D), than it can be seen that in the noisy case our approach enables to employ learning rates in an extremely wide range of [ηNoisy,ηOffline][\eta^{\rm Noisy},\eta^{\rm Offline}]; and still obtain the same optimal SGD convergence rate of O(LD2/T+σ~D/T)O\left({LD^{2}}/{T}+{\tilde{\sigma}D}/{\sqrt{T}}\right). Indeed note that, the ratio ηOffline/ηNoisy(σ~/L)T1/2\eta^{\rm Offline}/\eta^{\rm Noisy}\approx(\tilde{\sigma}/L)T^{1/2}.

4.1 Proof Sketch of Thm. 4.1

Proof.

First note that the xtx_{t}’s always belong to 𝒦\mathcal{K}, since they are weighted averages of the {wt𝒦}t\{w_{t}\in\mathcal{K}\}_{t}’s. Our first step is to bound the difference between consecutive queries. The definition of xtx_{t} implies,

α1:t1(xtxt1)=αt(wtxt),\alpha_{1:t-1}(x_{t}-x_{t-1})=\alpha_{t}(w_{t}-x_{t})~{},

yielding,

xtxt12=(αt/α1:t1)2wtxt2(16/t2)D2=(16/αt12)D2.\displaystyle\|x_{t}-x_{t-1}\|^{2}=\left({\alpha_{t}}/{\alpha_{1:t-1}}\right)^{2}\|w_{t}-x_{t}\|^{2}\leq(16/t^{2})D^{2}=(16/\alpha_{t-1}^{2})D^{2}~{}. (12)

where we have used αt=t+1\alpha_{t}=t+1 implying αt/α1:t14/t{\alpha_{t}}/{\alpha_{1:t-1}}\leq 4/t for any t2t\geq 2, we also used wtxtD\|w_{t}-x_{t}\|\leq D which holds since wt,xt𝒦w_{t},x_{t}\in\mathcal{K}, finally we use αt1=t\alpha_{t-1}=t.
Notation: Prior to going on with the proof we shall require some notation. We will denote g¯t:=f(xt)\bar{g}_{t}:=\nabla f(x_{t}), and recall the following notation from Alg. 1: gt:=f(xt,zt);g~t1:=f(xt1,zt)g_{t}:=\nabla f(x_{t},z_{t})~{};\tilde{g}_{t-1}:=\nabla f(x_{t-1},z_{t}). We will also denote,

εt:=dtg¯t.\varepsilon_{t}:=d_{t}-\bar{g}_{t}~{}.

Now, recalling Eq. (11), and combining it with the above definition of εt\varepsilon_{t} enables to derive the following recursive relation,

αtεt\displaystyle\alpha_{t}\varepsilon_{t} =βtαt(gtg¯t)+(1βt)αtZt+(1βt)αtαt1αt1εt1,\displaystyle=\beta_{t}\alpha_{t}(g_{t}-\bar{g}_{t})+(1-\beta_{t})\alpha_{t}Z_{t}+(1-\beta_{t})\frac{\alpha_{t}}{\alpha_{t-1}}\alpha_{t-1}\varepsilon_{t-1}~{},

where we denote Zt:=(gtg¯t)(g~t1g¯t1)Z_{t}:=(g_{t}-\bar{g}_{t})-(\tilde{g}_{t-1}-\bar{g}_{t-1}). Now, using αt=t+1\alpha_{t}=t+1, and βt=1/(t+1)\beta_{t}=1/(t+1) then it can be shown that αtβt=1\alpha_{t}\beta_{t}=1,and αt(1βt)=αt1:=αt1\alpha_{t}(1-\beta_{t})=\alpha_{t}-1:=\alpha_{t-1}. Moreover, (1βt)αtαt1=αt1αt1=1(1-\beta_{t})\frac{\alpha_{t}}{\alpha_{t-1}}=\frac{\alpha_{t}-1}{\alpha_{t-1}}=1. Using these relations in the equation above gives,

αtεt=αt1Zt+αt1εt1+(gtg¯t)=Mt+αt1εt1.\displaystyle\alpha_{t}\varepsilon_{t}=\alpha_{t-1}Z_{t}+\alpha_{t-1}\varepsilon_{t-1}+(g_{t}-\bar{g}_{t})=M_{t}+\alpha_{t-1}\varepsilon_{t-1}~{}. (13)

where for any t>1t>1 we denote Mt:=αt1Zt+(gtg¯t)M_{t}:=\alpha_{t-1}Z_{t}+(g_{t}-\bar{g}_{t}), as well as M1=g1g¯1M_{1}=g_{1}-\bar{g}_{1}. Unrolling the above equation yields an explicit expression for any t[T]t\in[T]: αtεt=τ=1tMτ.\alpha_{t}\varepsilon_{t}=\sum_{\tau=1}^{t}M_{\tau}~{}.

Noticing that the sequence {Mt}t\{M_{t}\}_{t} is is martingale difference sequence with respect to the natural filtration {t}t\{\mathcal{F}_{t}\}_{t} induced by the history of the samples up to time tt; enables to bound the expected square error as follows,

Eαtεt2\displaystyle\mbox{\bf E}\|\alpha_{t}\varepsilon_{t}\|^{2} =τ=1tMτ2=τ=1tEMτ2=τ=1tEαt1Zt+(gtg¯t)2\displaystyle=\left\|\sum_{\tau=1}^{t}M_{\tau}\right\|^{2}=\sum_{\tau=1}^{t}\mbox{\bf E}\|M_{\tau}\|^{2}=\sum_{\tau=1}^{t}\mbox{\bf E}\|\alpha_{t-1}Z_{t}+(g_{t}-\bar{g}_{t})\|^{2}
2τ=1tαt12EZt2+2τ=1tEgtg¯t2\displaystyle\leq 2\sum_{\tau=1}^{t}\alpha_{t-1}^{2}\mbox{\bf E}\|Z_{t}\|^{2}+2\sum_{\tau=1}^{t}\mbox{\bf E}\|g_{t}-\bar{g}_{t}\|^{2}
2τ=1tαt12E(gtg¯t)(g~t1g¯t1)2+2τ=1tσ2.\displaystyle\leq 2\sum_{\tau=1}^{t}\alpha_{t-1}^{2}\mbox{\bf E}\|(g_{t}-\bar{g}_{t})-(\tilde{g}_{t-1}-\bar{g}_{t-1})\|^{2}+2\sum_{\tau=1}^{t}\sigma^{2}~{}. (14)

Now, using Eq. (3) together with Eq. (12) allows to bound,

E(gtg¯t)(g~t1g¯t1)2σL2xtxt1216σL2D2/αt12.\mbox{\bf E}\|(g_{t}-\bar{g}_{t})-(\tilde{g}_{t-1}-\bar{g}_{t-1})\|^{2}\leq\sigma_{L}^{2}\|x_{t}-x_{t-1}\|^{2}\leq 16\sigma_{L}^{2}D^{2}/\alpha_{t-1}^{2}~{}.

Plugging the above back into Eq. (4.1) and summing establishes the theorem. ∎

5 Accelerated Version: μ2ExtraSGD\mu^{2}-\texttt{Extra}\text{SGD} 

In this section we present an accelerated version that makes use of a double momentum mechanism as we do for μ2\mu^{2}-SGD. Our approach here relies on an algorithmic template named ExtraGradient (Korpelevich, 1976; Nemirovski, 2004; Juditsky et al., 2011). The latter technique has already been combined with the Anytime-SGD mechanism in Cutkosky (2019); Kavis et al. (2019), where it was shown that it leads to acceleration. Here, we further blend an appropriate STORM\rm{STORM} Mechanism, leading to a new method that we name μ2ExtraSGD\mu^{2}-\texttt{Extra}\text{SGD} . Our main result for this section is presented in Thm. 5.2.

Optimistic OGD, Extragradient and UnixGrad:

The extragradient technique is related to an algorithmic template named Optimistic Online GD (Optimistic OGD)(Rakhlin and Sridharan, 2013). In this algorithm we receive a sequence of (possibly arbitrary) loss vectors {dtd}t[T]\{d_{t}\in{\mathbb{R}}^{d}\}_{t\in[T]} in an online manner. And our goal is to sequentially compute a sequence of iterates (or decision points) {wt𝒦}t\{w_{t}\in\mathcal{K}\}_{t}, where 𝒦\mathcal{K} is given convex set. Not that we may pick wtw_{t} only based on past information {d1,,dt1}\{d_{1},\ldots,d_{t-1}\}. And our goal is to ensure a low weighted regret for any w𝒦w\in\mathcal{K}, and the latter is defined as follows,

T(w):=t=1Tαtdt(wtw),\mathcal{R}_{T}(w):=\sum_{t=1}^{T}\alpha_{t}d_{t}\cdot(w_{t}-w^{*})~{},

and {αt>0}\{\alpha_{t}>0\} is a sequence of predefined weights. Now, in the optimistic setting we assume that we may access another sequence of “hint vectors” {d^td}t\{\hat{d}_{t}\in{\mathbb{R}}^{d}\}_{t} and we assume that prior to picking wtw_{t} we may also access {d^1,,d^t}\{\hat{d}_{1},\ldots,\hat{d}_{t}\}. It was shown in Rakhlin and Sridharan (2013), that if the hints are useful, in the sense that d^tdt\hat{d}_{t}\approx d_{t}, then one can substantially reduce the regret by properly incorporating the hints. Thus, in Optimistic-OGD we maintain two sequences: a decision point sequence {wt}t\{w_{t}\}_{t} and an auxiliary sequence {yt}\{y_{t}\} that are updated as follows,

Optimistic OGD:
wt=argminw𝒦αtd^tw+12ηwyt12&yt=argminy𝒦αtdty+12ηyyt12\displaystyle w_{t}=\operatorname*{arg\,min}_{w\in\mathcal{K}}\alpha_{t}\hat{d}_{t}\cdot w+\frac{1}{2\eta}\|w-y_{t-1}\|^{2}\quad~{}~{}\&~{}~{}\quad y_{t}=\operatorname*{arg\,min}_{y\in\mathcal{K}}\alpha_{t}d_{t}\cdot y+\frac{1}{2\eta}\|y-y_{t-1}\|^{2} (15)

It was shown in (Rakhlin and Sridharan, 2013; Kavis et al., 2019), that the above algorithm enjoys the following regret bound for any w𝒦w\in\mathcal{K},

Theorem 5.1 (See e.g. the proof of Thm. 1 in (Kavis et al., 2019)).

Let η>0\eta>0, {αt0}t[T]\{\alpha_{t}\geq 0\}_{t\in[T]} and 𝒦\mathcal{K} be a convex set with bounded diameter DD. Then the Optimistic-OGD algorithm that appears above enjoys the following regret bound for any w𝒦w\in\mathcal{K},

t=1Tαtdt(wtw)4D2η+η2t=1Tαt2dtd^t212ηt=1Twtyt12.\sum_{t=1}^{T}\alpha_{t}d_{t}\cdot(w_{t}-w)\leq\frac{4D^{2}}{\eta}+\frac{\eta}{2}\sum_{t=1}^{T}\alpha_{t}^{2}\|d_{t}-\hat{d}_{t}\|^{2}-\frac{1}{2\eta}\sum_{t=1}^{T}\|w_{t}-y_{t-1}\|^{2}~{}.

The Extragradient algorithm (Nemirovski, 2004) aims to minimize a convex function f:𝒦f:\mathcal{K}\mapsto{\mathbb{R}}. To do so, it applies the Optimistic-OGD template with the following choices of loss and hint vectors: d^t=f(yt1)\hat{d}_{t}=\nabla f(y_{t-1}), and dt:=f(wt)d_{t}:=\nabla f(w_{t}).
The UnixGrad algorithm (Kavis et al., 2019), can be seen as an Anytime version of Extragradient, where again we aim to minimize a convex function f:𝒦f:\mathcal{K}\mapsto{\mathbb{R}}. In the spirit of Anytime-GD, UnixGrad maintains two sequences of weighted averages based on the {wt,yt}t\{w_{t},y_{t}\}_{t} sequences as follows,

x^t=α1:t1α1:txt1+αtα1:tyt1,&xt=α1:t1α1:txt1+αtα1:twt\displaystyle\hat{x}_{t}=\frac{\alpha_{1:t-1}}{\alpha_{1:t}}x_{t-1}+\frac{\alpha_{t}}{\alpha_{1:t}}y_{t-1}~{},~{}~{}~{}\quad\&~{}~{}~{}~{}\quad x_{t}=\frac{\alpha_{1:t-1}}{\alpha_{1:t}}x_{t-1}+\frac{\alpha_{t}}{\alpha_{1:t}}w_{t} (16)

Then, based on the above averages UnixGrad sets the loss and hint vectors as follows: d^t=f(x^t)\hat{d}_{t}=\nabla f(\hat{x}_{t}), and dt:=f(xt)d_{t}:=\nabla f(x_{t}). Note that the above averaging rule implies that the xtx_{t}’s are weighted averages of the wtw_{t}’s, i.e. xt=1α1:tτ=1tατwτx_{t}=\frac{1}{\alpha_{1:t}}\sum_{\tau=1}^{t}\alpha_{\tau}w_{\tau}. The latter enables to utilize the Anytime guarantees of Thm. 3.1.

Note that there also exist stochastic versions of the above approaches where we may only query noisy rather than exact gradient estimates.

Our Approach.

We suggest to employ the Optimistic-OGD template together with a STORM\rm{STORM} mechanism on top of the Anytime mechanism employed by UnixGrad. Specifically, we shall maintain the same weighted averages as in Eq. (16), and define momentum estimates as follows:
At round tt draw a fresh sample zt𝒟z_{t}\sim{\mathcal{D}}, and based on x^t,xt\hat{x}_{t},x_{t} compute

g~t1=f(xt1;zt),g^t=f(x^t;zt),gt=f(xt;zt).\displaystyle\tilde{g}_{t-1}=\nabla f(x_{t-1};z_{t})~{}~{},\quad\hat{g}_{t}=\nabla f(\hat{x}_{t};z_{t})~{}~{},\quad g_{t}=\nabla f(x_{t};z_{t})~{}. (17)

Based on the above compute the (corrected momentum) loss and hint vectors as follows,

αtd^t=αtg^t+(1βt)αt(dt1g~t1)&αtdt=αtgt+(1βt)αt(dt1g~t1)\displaystyle\alpha_{t}\hat{d}_{t}=\alpha_{t}\hat{g}_{t}+(1-\beta_{t})\alpha_{t}(d_{t-1}-\tilde{g}_{t-1})~{}~{}~{}\&~{}~{}~{}\alpha_{t}d_{t}=\alpha_{t}g_{t}+(1-\beta_{t})\alpha_{t}(d_{t-1}-\tilde{g}_{t-1}) (18)

And then update according to Optimistic OGD in Eq. (5). Notice that the update rule for αtdt\alpha_{t}d_{t} is the exact same update that we use in Alg. 1; additionally as we have already commented, the xtx_{t} sequence in Eq. (16) is an {αt}t\{\alpha_{t}\}_{t} weighted average of the {wt}t\{w_{t}\}_{t} sequence. Therefore, if we pick αt=t+1\alpha_{t}=t+1, and βt=1/αt\beta_{t}=1/\alpha_{t}, then we can invoke Thm. 4.1 implying that,

Eεt2:=Edtf(xt)2σ~2/t.\mbox{\bf E}\|\varepsilon_{t}\|^{2}:=\mbox{\bf E}\|d_{t}-\nabla f(x_{t})\|^{2}\leq\tilde{\sigma}^{2}/t~{}.

where σ~2:=32D2σL2+2σ2\tilde{\sigma}^{2}:=32D^{2}\sigma_{L}^{2}+2\sigma^{2}.

The pseudo-code in Alg. 2 depicts our μ2ExtraSGD\mu^{2}-\texttt{Extra}\text{SGD} algorithm with the appropriate computational order. It can be seen that it combines Optimistic-OGD updates (Eq. (5)), together with appropriate Anytime averaging (Eq. (16)), and together with STORM\rm{STORM} updates for dt,d^td_{t},\hat{d}_{t} (Eq. (18)).

Algorithm 2 μ2ExtraSGD\mu^{2}-\texttt{Extra}\text{SGD} 
  Input: #Iterations TT, initial point y0y_{0}, learning rate sequence η>0\eta>0, importance weights {αt=t+1}t\{\alpha_{t}=t+1\}_{t}, Corrected Momentum weights {βt=1/αt}t\{\beta_{t}=1/\alpha_{t}\}_{t}
  
  Initialize: set x0=0x_{0}=0, and x^1=y0\hat{x}_{1}=y_{0}, draw z1𝒟z_{1}\sim{\mathcal{D}} and set d0=g~0=d^1=f(x^1,z1)d_{0}=\tilde{g}_{0}=\hat{d}_{1}=\nabla f(\hat{x}_{1},z_{1})
  for t=1,,Tt=1,\ldots,T do
     
Compute:wt=argminw𝒦αtd^tw+12ηwyt12,& Update:xt=α1:t1α1:txt1+αtα1:twt\text{\bf Compute:}~{}~{}w_{t}=\operatorname*{arg\,min}_{w\in\mathcal{K}}\alpha_{t}\hat{d}_{t}\cdot w+\frac{1}{2\eta}\|w-y_{t-1}\|^{2}~{},~{}\text{\bf\& Update:}~{}x_{t}=\frac{\alpha_{1:t-1}}{\alpha_{1:t}}x_{t-1}+\frac{\alpha_{t}}{\alpha_{1:t}}w_{t}
     Compute:   gt=f(xt;zt)g_{t}=\nabla f(x_{t};z_{t})  , & Update:  αtdt=αtgt+(1βt)αt(dt1g~t1)\alpha_{t}d_{t}=\alpha_{t}g_{t}+(1-\beta_{t})\alpha_{t}(d_{t-1}-\tilde{g}_{t-1})
     
Compute:yt=argminy𝒦αtdty+12ηyyt12& Update:x^t+1=α1:tα1:t+1xt+αt+1α1:t+1yt\text{\bf Compute:}~{}~{}y_{t}=\operatorname*{arg\,min}_{y\in\mathcal{K}}\alpha_{t}d_{t}\cdot y+\frac{1}{2\eta}\|y-y_{t-1}\|^{2}~{}\text{\bf\& Update:}~{}\hat{x}_{t+1}=\frac{\alpha_{1:t}}{\alpha_{1:t+1}}x_{t}+\frac{\alpha_{t+1}}{\alpha_{1:t+1}}y_{t}
     Draw a fresh sample zt+1𝒟z_{t+1}\sim{\mathcal{D}} and compute, g~t=f(xt;zt+1),g^t+1=f(x^t+1,zt+1)\tilde{g}_{t}=\nabla f(x_{t};z_{t+1})~{},~{}\hat{g}_{t+1}=\nabla f(\hat{x}_{t+1},z_{t+1})
     Update: αt+1d^t+1=αt+1g^t+1+(1βt+1)αt+1(dtg~t)\alpha_{t+1}\hat{d}_{t+1}=\alpha_{t+1}\hat{g}_{t+1}+(1-\beta_{t+1})\alpha_{t+1}(d_{t}-\tilde{g}_{t})
  end for
  output: xTx_{T}

We are now ready to state the guarantees of μ2ExtraSGD\mu^{2}-\texttt{Extra}\text{SGD} ,

Theorem 5.2 (μ2ExtraSGD\mu^{2}-\texttt{Extra}\text{SGD} Guarantees).

Let f:𝒦f:\mathcal{K}\mapsto{\mathbb{R}} be a convex function and 𝒦\mathcal{K} be a convex set with diameter DD, and denote wargminw𝒦f(w)w^{*}\in\operatorname*{arg\,min}_{w\in\mathcal{K}}f(w). Also presume that the assumption in Equations (1),(2),(3) hold. Then invoking Alg. 2 with {αt=t+1}t\{\alpha_{t}=t+1\}_{t} and {βt=1/αt}t\{\beta_{t}=1/\alpha_{t}\}_{t}, and using η1/2L\eta\leq 1/2L guarantees,

E(f(xT)f(w))=EΔTO(D2ηT2+σ~DT.),\displaystyle\mbox{\bf E}(f(x_{T})-f(w^{*}))=\mbox{\bf E}\Delta_{T}\leq O\left(\frac{D^{2}}{\eta T^{2}}+\frac{\tilde{\sigma}D}{\sqrt{T}}~{}.\right)~{},

where Δt:=f(xt)f(w)\Delta_{t}:=f(x_{t})-f(w^{*}), and σ~2:=32D2σL2+2σ2\tilde{\sigma}^{2}:=32D^{2}\sigma_{L}^{2}+2\sigma^{2}.

The Stability of μ2ExtraSGD\mu^{2}-\texttt{Extra}\text{SGD} .

Similarly to μ\mu-SGD, the above lemma shows that μ2ExtraSGD\mu^{2}-\texttt{Extra}\text{SGD} obtains the optimal rates for both offline (noiseless) and noisy case with the same choice of fixed learning rate ηOffline=1/2L\eta^{\rm Offline}=1/2L, which only depends on the smoothness of the objective. This in contrast to existing accelerated methods, which require to either to reduce the offline learning rate by a factor of σT\sigma\sqrt{T} (Xiao, 2010); or to employ sophisticated adaptive learning rates (Cutkosky, 2019; Kavis et al., 2019).

Moreover, letting ηNoisy:=1/(2L+σ~T3/2/D)\eta^{\rm Noisy}:=1/(2L+\tilde{\sigma}T^{3/2}/D), than it can be seen that in the noisy case our approach enables to employ learning rates in an extremely wide range of [ηNoisy,ηOffline][\eta^{\rm Noisy},\eta^{\rm Offline}]; and still obtain the same optimal convergence rate of O(LD2/T2+σ~D/T)O\left({LD^{2}}/{T^{2}}+{\tilde{\sigma}D}/{\sqrt{T}}\right). Indeed note that, the ratio ηOffline/ηNoisy(σ~/L)T3/2\eta^{\rm Offline}/\eta^{\rm Noisy}\approx(\tilde{\sigma}/L)T^{3/2}.

Finally, In contrast to Thm. 4.2, which requires wargminw𝒦f(w)w^{*}\in\operatorname*{arg\,min}_{w\in\mathcal{K}}f(w) to be also the global minimum of f()f(\cdot) in d{\mathbb{R}}^{d}; Thm. 5.2 does not require this assumption.

5.1 Proof of Thm. 5.2

Proof of Thm. 5.2.

The proof decomposes according to the techniques that μ2ExtraSGD\mu^{2}-\texttt{Extra}\text{SGD} employs.
Part I: Anytime Guarantees. Since the xtx_{t}’s are {αt}t\{\alpha_{t}\}_{t} weighted averages of the {wt}t\{w_{t}\}_{t}’s we can invoke Thm. 3.1 which implies,

α1:T(f(xT)f(w))\displaystyle\alpha_{1:T}(f(x_{T})-f(w^{*})) t=1Tαtf(xt)(wtw)=t=1Tαtdt(wtw)t=1Tαtεt(wtw).\displaystyle\leq\sum_{t=1}^{T}\alpha_{t}\nabla f(x_{t})\cdot(w_{t}-w^{*})=\sum_{t=1}^{T}\alpha_{t}d_{t}\cdot(w_{t}-w^{*})-\sum_{t=1}^{T}\alpha_{t}\varepsilon_{t}\cdot(w_{t}-w^{*})~{}. (19)

where we have denote εt:=dtf(xt)\varepsilon_{t}:=d_{t}-\nabla f(x_{t}).
Part II: Optimistic OGD Guarantees. Since the update rule for {wt,yt}t\{w_{t},y_{t}\}_{t} satisfies the Optimistic-OGD template w.r.t  the sequences of loss and hint vectors {dt,d^t}t\{d_{t},\hat{d}_{t}\}_{t} we can apply Lemma 5.1 to bound the weighted regret in Eq. (19) as follows,

α1:T\displaystyle\alpha_{1:T} (f(xT)f(w))\displaystyle(f(x_{T})-f(w^{*}))
4D2η+η2t=1Tαt2dtd^t212ηt=1Twtyt12t=1Tαtεt(wtw)\displaystyle\leq\frac{4D^{2}}{\eta}+\frac{\eta}{2}\sum_{t=1}^{T}\alpha_{t}^{2}\|d_{t}-\hat{d}_{t}\|^{2}-\frac{1}{2\eta}\sum_{t=1}^{T}\|w_{t}-y_{t-1}\|^{2}-\sum_{t=1}^{T}\alpha_{t}\varepsilon_{t}\cdot(w_{t}-w^{*})
4D2η+η2t=1Tαt2gtg^t212ηt=1Twtyt12+t=1Tαtεtwtw\displaystyle\leq\frac{4D^{2}}{\eta}+\frac{\eta}{2}\sum_{t=1}^{T}\alpha_{t}^{2}\|g_{t}-\hat{g}_{t}\|^{2}-\frac{1}{2\eta}\sum_{t=1}^{T}\|w_{t}-y_{t-1}\|^{2}+\sum_{t=1}^{T}\|\alpha_{t}\varepsilon_{t}\|\cdot\|w_{t}-w^{*}\|
4D2η+η2t=1Tαt2f(xt;zt)f(x^t;zt)212ηt=1Twtyt12+Dt=1Tαtεt\displaystyle\leq\frac{4D^{2}}{\eta}+\frac{\eta}{2}\sum_{t=1}^{T}\alpha_{t}^{2}\|\nabla f(x_{t};z_{t})-\nabla f(\hat{x}_{t};z_{t})\|^{2}-\frac{1}{2\eta}\sum_{t=1}^{T}\|w_{t}-y_{t-1}\|^{2}+D\sum_{t=1}^{T}\|\alpha_{t}\varepsilon_{t}\|
4D2η+ηL22t=1Tαt2xtx^t212ηt=1Twtyt12+Dt=1Tαtεt\displaystyle\leq\frac{4D^{2}}{\eta}+\frac{\eta L^{2}}{2}\sum_{t=1}^{T}\alpha_{t}^{2}\|x_{t}-\hat{x}_{t}\|^{2}-\frac{1}{2\eta}\sum_{t=1}^{T}\|w_{t}-y_{t-1}\|^{2}+D\sum_{t=1}^{T}\|\alpha_{t}\varepsilon_{t}\|
4D2η+ηL22t=1Tαt2(αtα1:t)2wtyt1212ηt=1Twtyt12+Dt=1Tαtεt\displaystyle\leq\frac{4D^{2}}{\eta}+\frac{\eta L^{2}}{2}\sum_{t=1}^{T}\alpha_{t}^{2}\left(\frac{\alpha_{t}}{\alpha_{1:t}}\right)^{2}\|w_{t}-y_{t-1}\|^{2}-\frac{1}{2\eta}\sum_{t=1}^{T}\|w_{t}-y_{t-1}\|^{2}+D\sum_{t=1}^{T}\|\alpha_{t}\varepsilon_{t}\|
4D2η+4ηL22t=1Twtyt1212ηt=1Twtyt12+Dt=1Tαtεt\displaystyle\leq\frac{4D^{2}}{\eta}+\frac{4\eta L^{2}}{2}\sum_{t=1}^{T}\|w_{t}-y_{t-1}\|^{2}-\frac{1}{2\eta}\sum_{t=1}^{T}\|w_{t}-y_{t-1}\|^{2}+D\sum_{t=1}^{T}\|\alpha_{t}\varepsilon_{t}\|
4D2η+Dt=1Tαtεt,\displaystyle\leq\frac{4D^{2}}{\eta}+D\sum_{t=1}^{T}\|\alpha_{t}\varepsilon_{t}\|~{}, (20)

where the first line uses Eq. (19) together with Thm. 5.1; the second line uses dtd^t=gtg^td_{t}-\hat{d}_{t}=g_{t}-\hat{g}_{t} which follows by Eq. (18); and the third line follows by the definitions of gt,g^tg_{t},\hat{g}_{t}, as well as from wtwD\|w_{t}-w^{*}\|\leq D, which holds since wt,w𝒦w_{t},w^{*}\in\mathcal{K}; the fourth line follows by our assumption in Eq. (2); the fifth line holds since xtx^t=(αt/α1:t)(wtyt1)x_{t}-\hat{x}_{t}=(\alpha_{t}/\alpha_{1:t})(w_{t}-y_{t-1}) which holds due to Eq. (16); the sixth line holds since αt4/(α1:t)24,t1\alpha_{t}^{4}/(\alpha_{1:t})^{2}\leq 4~{},\forall t\geq 1; and the last line follows since 2ηL21/(2η)0{2\eta L^{2}}-{1}/{(2\eta)}\leq 0 which holds since we assume η1/2L\eta\leq 1/2L.
Part III: μ2\mu^{2} Guarantees. Notice that our definitions for wt,xt,αt,βtw_{t},x_{t},\alpha_{t},\beta_{t} and dtd_{t} satisfy the exact same conditions of Thm. 4.1, This immediately implies that Eεt2σ~2/t,t\mbox{\bf E}\|\varepsilon_{t}\|^{2}\leq\tilde{\sigma}^{2}/t~{},\forall t. Using this, and taking the expectation of Eq. (5.1) yields,

α1:TE(f(xT)f(w))\displaystyle\alpha_{1:T}\mbox{\bf E}(f(x_{T})-f(w^{*})) 4D2η+Dt=1TEαtεt4D2η+Dt=1TEαtεt2\displaystyle\leq\frac{4D^{2}}{\eta}+D\sum_{t=1}^{T}\mbox{\bf E}\|\alpha_{t}\varepsilon_{t}\|\leq\frac{4D^{2}}{\eta}+D\sum_{t=1}^{T}\sqrt{\mbox{\bf E}\|\alpha_{t}\varepsilon_{t}\|^{2}}
4D2η+Dσ~t=1Tαt2/t4D2η+2T3/2Dσ~.\displaystyle\leq\frac{4D^{2}}{\eta}+D\tilde{\sigma}\sum_{t=1}^{T}\sqrt{\alpha_{t}^{2}/t}\leq\frac{4D^{2}}{\eta}+2T^{3/2}D\tilde{\sigma}~{}. (21)

where the second inequality uses Jensen’s Inequality: EXEX2\mbox{\bf E}X\leq\sqrt{\mbox{\bf E}X^{2}} which holds for any random variable XX; the last inequality follows from αt2/t2t\alpha_{t}^{2}/t\leq 2t, implying that t=1Tαt2/t2T3/2\sum_{t=1}^{T}\sqrt{\alpha_{t}^{2}/t}\leq 2T^{3/2}.

Dividing the above equation by α1:T\alpha_{1:T} and recalling that α1:T=Θ(T2)\alpha_{1:T}=\Theta(T^{2}) concludes the proof. ∎

6 Conclusion

By blending two recent momentum techniques, we have designed a new shrinking-error gradient estimate for the SCO-EOS setting. Based on it, we have presented two algorithms that rely on the SGD and on the Extragradient templates, and have shown that latter are extremely stable w.r.t. the choice of the learning rate. For future work, it will be interesting to study our approach in the more general case of expectation over ν\nu-Hölder smooth functions (with ν[0,1]\nu\in[0,1]); as well as to extend our approach to the strongly-convex case.

Acknowledgments

This work has received funding from the Israeli Science Foundation (grant No. 447/20).

References

  • Antonakopoulos et al. (2022) K. Antonakopoulos, D. Q. Vu, V. Cevher, K. Levy, and P. Mertikopoulos. Undergrad: A universal black-box optimization method with almost dimension-free convergence rate guarantees. In International Conference on Machine Learning, pages 772–795. PMLR, 2022.
  • Cutkosky (2019) A. Cutkosky. Anytime online-to-batch, optimism and acceleration. In International Conference on Machine Learning, pages 1446–1454. PMLR, 2019.
  • Cutkosky and Orabona (2019) A. Cutkosky and F. Orabona. Momentum-based variance reduction in non-convex sgd. Advances in neural information processing systems, 32, 2019.
  • Duchi et al. (2011) J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.
  • Ge et al. (2015) R. Ge, F. Huang, C. Jin, and Y. Yuan. Escaping from saddle points—online stochastic gradient for tensor decomposition. In Proceedings of The 28th Conference on Learning Theory, pages 797–842, 2015.
  • Hazan et al. (2016) E. Hazan et al. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(3-4):157–325, 2016.
  • Hu et al. (2009) C. Hu, W. Pan, and J. T. Kwok. Accelerated gradient methods for stochastic optimization and online learning. In Advances in Neural Information Processing Systems, pages 781–789, 2009.
  • Jacobsen and Cutkosky (2022) A. Jacobsen and A. Cutkosky. Parameter-free mirror descent. arXiv preprint arXiv:2203.00444, 2022.
  • Juditsky et al. (2011) A. Juditsky, A. Nemirovski, and C. Tauvel. Solving variational inequalities with stochastic mirror-prox algorithm. Stochastic Systems, 1(1):17–58, 2011.
  • Kavis et al. (2019) A. Kavis, K. Y. Levy, F. Bach, and V. Cevher. Unixgrad: a universal, adaptive algorithm with optimal guarantees for constrained optimization. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pages 6260–6269, 2019.
  • Kingma and Ba (2015) D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In Y. Bengio and Y. LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980.
  • Korpelevich (1976) G. M. Korpelevich. The extragradient method for finding saddle points and other problems. Matecon, 12:747–756, 1976.
  • Lan (2012) G. Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 133(1-2):365–397, 2012.
  • Levy et al. (2018) K. Y. Levy, A. Yurtsever, and V. Cevher. Online adaptive methods, universality and acceleration. In NeurIPS, 2018.
  • Levy et al. (2021) K. Y. Levy, A. Kavis, and V. Cevher. Storm+: Fully adaptive sgd with recursive momentum for nonconvex optimization. In NeurIPS, 2021.
  • Mairal et al. (2009) J. Mairal, F. R. Bach, J. Ponce, and G. Sapiro. Online dictionary learning for sparse coding. In A. P. Danyluk, L. Bottou, and M. L. Littman, editors, Proceedings of the 26th Annual International Conference on Machine Learning, ICML 2009, Montreal, Quebec, Canada, June 14-18, 2009, volume 382 of ACM International Conference Proceeding Series, pages 689–696. ACM, 2009. doi: 10.1145/1553374.1553463. URL https://doi.org/10.1145/1553374.1553463.
  • Nemirovski (2004) A. Nemirovski. Prox-method with rate of convergence O(1/t){O}(1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization, 15(1):229–251, 2004.
  • Nesterov (1983) Y. Nesterov. A method of solving a convex programming problem with convergence rate O(1/k2)O(1/k^{2}). In Soviet Mathematics Doklady, volume 27, pages 372–376, 1983.
  • Polyak (1964) B. T. Polyak. Some methods of speeding up the convergence of iteration methods. Ussr computational mathematics and mathematical physics, 4(5):1–17, 1964.
  • Rakhlin and Sridharan (2013) S. Rakhlin and K. Sridharan. Optimization, learning, and games with predictable sequences. In Advances in Neural Information Processing Systems, pages 3066–3074, 2013.
  • Recht et al. (2011) B. Recht, C. Ré, S. J. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. C. N. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems 2011. Proceedings of a meeting held 12-14 December 2011, Granada, Spain, pages 693–701, 2011. URL https://proceedings.neurips.cc/paper/2011/hash/218a0aefd1d1a4be65601cc6ddc1520e-Abstract.html.
  • Robbins and Monro (1951) H. Robbins and S. Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
  • Shalev-Shwartz et al. (2007) S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-gradient solver for svm. In Proceedings of the 24th international conference on Machine learning, pages 807–814, 2007.
  • Sutskever et al. (2013) I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pages 1139–1147. PMLR, 2013.
  • Wang et al. (2021) J.-K. Wang, J. Abernethy, and K. Y. Levy. No-regret dynamics in the fenchel game: A unified framework for algorithmic convex optimization. arXiv preprint arXiv:2111.11309, 2021.
  • Welling and Teh (2011) M. Welling and Y. W. Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 681–688. Citeseer, 2011.
  • Xiao (2010) L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research, 11(Oct):2543–2596, 2010.

Appendix A Explaining the Bounded Smoothness Variance Assumption

Here we show that Eq. (2) implies that Eq. (3) holds for some σL2[0,L]\sigma_{L}^{2}\in[0,L].

Fixing x,y𝒦x,y\in\mathcal{K}, then Eq. (2) implies that for any zSupport{𝒟}z\in\textbf{Support}\{{\mathcal{D}}\} there exists Lx,y;z[0,L]L_{x,y;z}\in[0,L] such that,

f(x;z)f(y;z)=Lx,y;zxy.\|\nabla f(x;z)-\nabla f(y;z)\|=L_{x,y;z}\|x-y\|~{}.

Similarly there exists Lx,y[0,L]L_{x,y}\in[0,L] such that,

f(x)f(y)=Lx,yxy.\|\nabla f(x)-\nabla f(y)\|=L_{x,y}\|x-y\|~{}.

And clearly in the deterministic case we have Lx,y;z=Lx,y,zSupport{𝒟}L_{x,y;z}=L_{x,y}~{},\forall z\in\textbf{Support}\{{\mathcal{D}}\}. Therefore,

E(f(x;z)f(x))(f(y;z)f(y))2\displaystyle\mbox{\bf E}\|(\nabla f(x;z)-\nabla f(x))-(\nabla f(y;z)-\nabla f(y))\|^{2} =Ef(x;z)f(y;z)2f(x)f(y))2\displaystyle=\mbox{\bf E}\|\nabla f(x;z)-\nabla f(y;z)\|^{2}-\|\nabla f(x)-\nabla f(y))\|^{2}
=E(Lx,y;zLx,y)xy2=σL2{x,y}xy2,\displaystyle=\mbox{\bf E}(L_{x,y;z}-L_{x,y})\cdot\|x-y\|^{2}=\sigma_{L}^{2}\{x,y\}\cdot\|x-y\|^{2}~{},

where we have used E(f(x;z)f(y;z))=(f(x)f(y))\mbox{\bf E}(\nabla f(x;z)-\nabla f(y;z))=(\nabla f(x)-\nabla f(y)), and we denote σL2{x,y}:=E(Lx,y;zLx,y)\sigma_{L}^{2}\{x,y\}:=\mbox{\bf E}(L_{x,y;z}-L_{x,y}). This notation implies that σL2{x,y}[0,L]\sigma_{L}^{2}\{x,y\}\in[0,L], and clearly σL2{x,y}=0\sigma_{L}^{2}\{x,y\}=0 in the deterministic case for all x,y𝒦x,y\in\mathcal{K}. Thus, if we denote,

σL2:=supx,y𝒦σL2{x,y},\sigma_{L}^{2}:=\sup_{x,y\in\mathcal{K}}\sigma_{L}^{2}\{x,y\}~{},

Then σL2[0,L]\sigma_{L}^{2}\in[0,L] satisfies Eq. (3) and is equal to 0 in the deterministic (noiseless) case.

Appendix B Proof of Thm. 4.1

Proof of Thm. 4.1.

First note that the xtx_{t}’s always belong to 𝒦\mathcal{K}, since they are weighted averages of the {wt𝒦}t\{w_{t}\in\mathcal{K}\}_{t}’s. Next we bound the difference between consecutive queries. By definition,

α1:t1(xtxt1)=αt(wtxt),\alpha_{1:t-1}(x_{t}-x_{t-1})=\alpha_{t}(w_{t}-x_{t})~{},

Implying,

xtxt12=(αt/α1:t1)2wtxt2(16/t2)D2=(16/αt12)D2.\displaystyle\|x_{t}-x_{t-1}\|^{2}=\left({\alpha_{t}}/{\alpha_{1:t-1}}\right)^{2}\|w_{t}-x_{t}\|^{2}\leq(16/t^{2})D^{2}=(16/\alpha_{t-1}^{2})D^{2}~{}. (22)

where we have used αt=t+1\alpha_{t}=t+1 implying αt/α1:t14/t{\alpha_{t}}/{\alpha_{1:t-1}}\leq 4/t for any t2t\geq 2, we also used wtxtD\|w_{t}-x_{t}\|\leq D which holds since wt,xt𝒦w_{t},x_{t}\in\mathcal{K}, finally we use αt1=t\alpha_{t-1}=t.

Notation: Prior to going on with the proof we shall require some notation. We will denote g¯t:=f(xt)\bar{g}_{t}:=\nabla f(x_{t}), and recall the following notation form Alg. 1: gt:=f(xt,zt);g~t1:=f(xt1,zt)g_{t}:=\nabla f(x_{t},z_{t})~{};\tilde{g}_{t-1}:=\nabla f(x_{t-1},z_{t}), and we will also denote, g¯t:=f(xt)\bar{g}_{t}:=\nabla f(x_{t}) ,and

εt:=dtg¯t.\varepsilon_{t}:=d_{t}-\bar{g}_{t}~{}.

Now, recalling Eq. (11),

αtdt=αtgt+(1βt)αt(dt1g~t1).\displaystyle\alpha_{t}d_{t}=\alpha_{t}g_{t}+(1-\beta_{t})\alpha_{t}(d_{t-1}-\tilde{g}_{t-1})~{}.

Combining the above with the definition of εt\varepsilon_{t} yields the following recursive relation,

αtεt\displaystyle\alpha_{t}\varepsilon_{t} :=αtdtαtg¯t\displaystyle:=\alpha_{t}d_{t}-\alpha_{t}\bar{g}_{t}
=αt(gtg¯t)+(1βt)αt(dt1g~t1)\displaystyle=\alpha_{t}(g_{t}-\bar{g}_{t})+(1-\beta_{t})\alpha_{t}(d_{t-1}-\tilde{g}_{t-1})
=βtαt(gtg¯t)+(1βt)αt(dt1g~t1+gtg¯t)\displaystyle=\beta_{t}\alpha_{t}(g_{t}-\bar{g}_{t})+(1-\beta_{t})\alpha_{t}(d_{t-1}-\tilde{g}_{t-1}+g_{t}-\bar{g}_{t})
=βtαt(gtg¯t)+(1βt)αt(g¯t1g~t1+gtg¯t)+(1βt)αt(dt1g¯t1)\displaystyle=\beta_{t}\alpha_{t}(g_{t}-\bar{g}_{t})+(1-\beta_{t})\alpha_{t}(\bar{g}_{t-1}-\tilde{g}_{t-1}+g_{t}-\bar{g}_{t})+(1-\beta_{t})\alpha_{t}(d_{t-1}-\bar{g}_{t-1})
=βtαt(gtg¯t)+(1βt)αt((gtg¯t)(g~t1g¯t1))+(1βt)αt(dt1g¯t1)\displaystyle=\beta_{t}\alpha_{t}(g_{t}-\bar{g}_{t})+(1-\beta_{t})\alpha_{t}((g_{t}-\bar{g}_{t})-(\tilde{g}_{t-1}-\bar{g}_{t-1}))+(1-\beta_{t})\alpha_{t}(d_{t-1}-\bar{g}_{t-1})
=βtαt(gtg¯t)+(1βt)αtZt+(1βt)αtαt1αt1εt1\displaystyle=\beta_{t}\alpha_{t}(g_{t}-\bar{g}_{t})+(1-\beta_{t})\alpha_{t}Z_{t}+(1-\beta_{t})\frac{\alpha_{t}}{\alpha_{t-1}}\alpha_{t-1}\varepsilon_{t-1}

where we denoted Zt:=(gtg¯t)(g~t1g¯t1)Z_{t}:=(g_{t}-\bar{g}_{t})-(\tilde{g}_{t-1}-\bar{g}_{t-1}). Now, using αt=t+1\alpha_{t}=t+1, and βt=1/(t+1)\beta_{t}=1/(t+1) then it can be shown that αtβt=1\alpha_{t}\beta_{t}=1,and αt(1βt)=αt1:=αt1\alpha_{t}(1-\beta_{t})=\alpha_{t}-1:=\alpha_{t-1}. Moreover, (1βt)αtαt1=αt1αt1=1(1-\beta_{t})\frac{\alpha_{t}}{\alpha_{t-1}}=\frac{\alpha_{t}-1}{\alpha_{t-1}}=1. Using these relations in the equation above gives,

αtεt=αt1Zt+αt1εt1+(gtg¯t)=Mt+αt1εt1.\displaystyle\alpha_{t}\varepsilon_{t}=\alpha_{t-1}Z_{t}+\alpha_{t-1}\varepsilon_{t-1}+(g_{t}-\bar{g}_{t})=M_{t}+\alpha_{t-1}\varepsilon_{t-1}~{}. (23)

where for any t>1t>1 we denote Mt:=αt1Zt+(gtg¯t)M_{t}:=\alpha_{t-1}Z_{t}+(g_{t}-\bar{g}_{t}), as well as M1=g1g¯1M_{1}=g_{1}-\bar{g}_{1}. Unrolling the above equation yields an explicit expression for any t[T]t\in[T],

αtεt=τ=1tMτ.\alpha_{t}\varepsilon_{t}=\sum_{\tau=1}^{t}M_{\tau}~{}.

Now, notice that the sequence {Mt}t\{M_{t}\}_{t} is is martingale difference sequence with respect to the natural filtration {t}t\{\mathcal{F}_{t}\}_{t} induced by the history of the samples up to time tt. Indeed,

E[Mt|t1]=E[(gtg¯t)|t1]+αt1E[Zt|t1]=E[(gtg¯t)|xt]+αt1E[Zt|xt1,xt]=0.\mbox{\bf E}[M_{t}|\mathcal{F}_{t-1}]=\mbox{\bf E}[(g_{t}-\bar{g}_{t})|\mathcal{F}_{t-1}]+\alpha_{t-1}\mbox{\bf E}[Z_{t}|\mathcal{F}_{t-1}]=\mbox{\bf E}[(g_{t}-\bar{g}_{t})|x_{t}]+\alpha_{t-1}\mbox{\bf E}[Z_{t}|x_{t-1},x_{t}]=0~{}.

Thus, using Lemma B.1 below gives,

Eαtεt2\displaystyle\mbox{\bf E}\|\alpha_{t}\varepsilon_{t}\|^{2} =τ=1tMτ2=τ=1tEMτ2=τ=1tEαt1Zt+(gtg¯t)2\displaystyle=\left\|\sum_{\tau=1}^{t}M_{\tau}\right\|^{2}=\sum_{\tau=1}^{t}\mbox{\bf E}\|M_{\tau}\|^{2}=\sum_{\tau=1}^{t}\mbox{\bf E}\|\alpha_{t-1}Z_{t}+(g_{t}-\bar{g}_{t})\|^{2}
2τ=1tαt12EZt2+2τ=1tEgtg¯t2\displaystyle\leq 2\sum_{\tau=1}^{t}\alpha_{t-1}^{2}\mbox{\bf E}\|Z_{t}\|^{2}+2\sum_{\tau=1}^{t}\mbox{\bf E}\|g_{t}-\bar{g}_{t}\|^{2}
2τ=1tαt12E(gtg¯t)(g~t1g¯t1)2+2τ=1tσ2\displaystyle\leq 2\sum_{\tau=1}^{t}\alpha_{t-1}^{2}\mbox{\bf E}\|(g_{t}-\bar{g}_{t})-(\tilde{g}_{t-1}-\bar{g}_{t-1})\|^{2}+2\sum_{\tau=1}^{t}\sigma^{2}
=2τ=1tαt12E(f(xt;zt)f(xt))(f(xt1;zt)f(xt1))2+2tσ2\displaystyle=2\sum_{\tau=1}^{t}\alpha_{t-1}^{2}\mbox{\bf E}\|(\nabla f(x_{t};z_{t})-\nabla f(x_{t}))-(\nabla f(x_{t-1};z_{t})-\nabla f(x_{t-1}))\|^{2}+2t\sigma^{2}
2τ=1tαt12σL2xtxt12+2tσ2\displaystyle\leq 2\sum_{\tau=1}^{t}\alpha_{t-1}^{2}\sigma_{L}^{2}\|x_{t}-x_{t-1}\|^{2}+2t\sigma^{2}
32D2σL2τ=1t(αt12/αt12)+2tσ2\displaystyle\leq 32D^{2}\sigma_{L}^{2}\sum_{\tau=1}^{t}(\alpha_{t-1}^{2}/\alpha_{t-1}^{2})+2t\sigma^{2}
=(32D2σL2+2σ2)t\displaystyle=(32D^{2}\sigma_{L}^{2}+2\sigma^{2})\cdot t
=σ~2t.\displaystyle=\tilde{\sigma}^{2}\cdot t~{}.

here the first inequality uses a+b22a2+2b2\|a+b\|^{2}\leq 2\|a\|^{2}+2\|b\|^{2} which holds for any a,bda,b\in{\mathbb{R}}^{d}; the second inequality uses the bounded variance assumption; the third inequality uses Eq. (3), and the last inequality uses Eq. (22).

Dividing the above inequality by αt2=(t+1)2\alpha_{t}^{2}=(t+1)^{2} the lemma follows,

Edtf(xt)2=Eεt2=Eαtεt2/αt2σ~2t/(t+12)σ~2/(t+1).\mbox{\bf E}\|d_{t}-\nabla f(x_{t})\|^{2}=\mbox{\bf E}\|\varepsilon_{t}\|^{2}=\mbox{\bf E}\|\alpha_{t}\varepsilon_{t}\|^{2}/\alpha_{t}^{2}\leq\tilde{\sigma}^{2}t/(t+1^{2})\leq\tilde{\sigma}^{2}/(t+1)~{}.
Lemma B.1.

Let {Mt}t\{M_{t}\}_{t} be a martingale difference sequence with respect to a filtration {t}t\{\mathcal{F}_{t}\}_{t}, then the following holds for any tt,

Eτ=1tMτ2\displaystyle\mbox{\bf E}\left\|\sum_{\tau=1}^{t}M_{\tau}\right\|^{2} =τ=1tEMτ2.\displaystyle=\sum_{\tau=1}^{t}\mbox{\bf E}\left\|M_{\tau}\right\|^{2}~{}.

B.0.1 Proof of Lemma B.1

Proof of Lemma B.1.

We shall prove the lemma by induction over tt. The base case where t=1t=1 clearly holds.

Now for induction step let us assume that the equality holds for t1t\geq 1 and lets prove it holds for t+1t+1. Indeed,

Eτ=1t+1Mτ2\displaystyle\mbox{\bf E}\left\|\sum_{\tau=1}^{t+1}M_{\tau}\right\|^{2} =EMt+1+τ=1tMτ2\displaystyle=\mbox{\bf E}\left\|M_{t+1}+\sum_{\tau=1}^{t}M_{\tau}\right\|^{2}
=Eτ=1tMτ2+EMt+12+2E(τ=1tMτ)Mt+1\displaystyle=\mbox{\bf E}\left\|\sum_{\tau=1}^{t}M_{\tau}\right\|^{2}+\mbox{\bf E}\|M_{t+1}\|^{2}+2\mbox{\bf E}\left(\sum_{\tau=1}^{t}M_{\tau}\right)\cdot M_{t+1}
=τ=1t+1EMτ2+2E[E[(τ=1tMτ)Mt+1|t]]\displaystyle=\sum_{\tau=1}^{t+1}\mbox{\bf E}\left\|M_{\tau}\right\|^{2}+2\mbox{\bf E}\left[\mbox{\bf E}\left[\left(\sum_{\tau=1}^{t}M_{\tau}\right)\cdot M_{t+1}|\mathcal{F}_{t}\right]\right]
=τ=1t+1EMτ2+2E[(τ=1tMτ)E[Mt+1|t]]\displaystyle=\sum_{\tau=1}^{t+1}\mbox{\bf E}\left\|M_{\tau}\right\|^{2}+2\mbox{\bf E}\left[\left(\sum_{\tau=1}^{t}M_{\tau}\right)\cdot\mbox{\bf E}\left[M_{t+1}|\mathcal{F}_{t}\right]\right]
=τ=1t+1EMτ2+0\displaystyle=\sum_{\tau=1}^{t+1}\mbox{\bf E}\left\|M_{\tau}\right\|^{2}+0
=τ=1t+1EMτ2,\displaystyle=\sum_{\tau=1}^{t+1}\mbox{\bf E}\left\|M_{\tau}\right\|^{2}~{},

where the third line follows from the induction hypothesis, as well as from the law of total expectations; the fourth lines follows since {Mτ}τ=1t\{M_{\tau}\}_{\tau=1}^{t} are measurable w.r.t t\mathcal{F}_{t}, and the fifth line follows since E[Mt+1|t]=0\mbox{\bf E}[M_{t+1}|\mathcal{F}_{t}]=0. Thus, we have established the induction step and therefore the lemma holds. ∎

Appendix C Proof of Thm. 4.2

Proof of Thm. 4.2.

The proof is a direct combination of Thm. 4.1 together with the standard regret bound of OGD (Online Gradient Descent), which in turn enables to utilize the Anytime guarantees of Thm. 3.1.
Part 1: Regret Bound. Standard regret analysis of the update rule in Eq. (8) implies the following for any tt (Hazan et al., 2016),

τ=1tατdτ(wτw)D22η+η2τ=1tατ2dτ2.\displaystyle\sum_{\tau=1}^{t}\alpha_{\tau}d_{\tau}\cdot(w_{\tau}-w^{*})\leq\frac{D^{2}}{2\eta}+\frac{\eta}{2}\sum_{\tau=1}^{t}\alpha_{\tau}^{2}\|d_{\tau}\|^{2}~{}. (24)

Part 2: Anytime Guarantees.

Since the xtx_{t}’s are weighted averages of the wtw_{t}’s we may invoke Thm. 3.1, which ensures for any t[T]t\in[T],

α1:tΔt=α1:t(f(xt)f(w))τ=1tατf(xτ)(wτw),\alpha_{1:t}\Delta_{t}=\alpha_{1:t}(f(x_{t})-f(w^{*}))\leq\sum_{\tau=1}^{t}\alpha_{\tau}\nabla f(x_{\tau})\cdot(w_{\tau}-w^{*})~{},

where we denote Δt:=f(xt)f(w)\Delta_{t}:=f(x_{t})-f(w^{*}).

Part 3: Combining Guarantees.

Combining the above Anytime guarantees together with the bound in Eq. (24) yields,

α1:tΔt\displaystyle\alpha_{1:t}\Delta_{t} τ=1tατf(xτ)(wτw)\displaystyle\leq\sum_{\tau=1}^{t}\alpha_{\tau}\nabla f(x_{\tau})\cdot(w_{\tau}-w^{*})
=τ=1tατdτ(wτw)+τ=1tατ(f(xτ)dτ)(wτw)\displaystyle=\sum_{\tau=1}^{t}\alpha_{\tau}d_{\tau}\cdot(w_{\tau}-w^{*})+\sum_{\tau=1}^{t}\alpha_{\tau}(\nabla f(x_{\tau})-d_{\tau})\cdot(w_{\tau}-w^{*})
=D22η+η2τ=1tατ2dτ2τ=1tατετ(wτw)\displaystyle=\frac{D^{2}}{2\eta}+\frac{\eta}{2}\sum_{\tau=1}^{t}\alpha_{\tau}^{2}\|d_{\tau}\|^{2}-\sum_{\tau=1}^{t}\alpha_{\tau}\varepsilon_{\tau}\cdot(w_{\tau}-w^{*})
D22η+η2τ=1tατ2f(xτ)+ετ2+τ=1tατετwτw\displaystyle\leq\frac{D^{2}}{2\eta}+\frac{\eta}{2}\sum_{\tau=1}^{t}\alpha_{\tau}^{2}\|\nabla f(x_{\tau})+\varepsilon_{\tau}\|^{2}+\sum_{\tau=1}^{t}\|\alpha_{\tau}\varepsilon_{\tau}\|\cdot\|w_{\tau}-w^{*}\|
D22η+ητ=1tατ2f(xτ)2+ητ=1tατ2ετ2+Dτ=1tατετ\displaystyle\leq\frac{D^{2}}{2\eta}+\eta\sum_{\tau=1}^{t}\alpha_{\tau}^{2}\|\nabla f(x_{\tau})\|^{2}+\eta\sum_{\tau=1}^{t}\alpha_{\tau}^{2}\|\varepsilon_{\tau}\|^{2}+D\sum_{\tau=1}^{t}\|\alpha_{\tau}\varepsilon_{\tau}\|
D22η+2ηLτ=1tατ2Δτ+ητ=1tατ2ετ2+Dτ=1tατετ\displaystyle\leq\frac{D^{2}}{2\eta}+2\eta L\sum_{\tau=1}^{t}\alpha_{\tau}^{2}\Delta_{\tau}+\eta\sum_{\tau=1}^{t}\alpha_{\tau}^{2}\|\varepsilon_{\tau}\|^{2}+D\sum_{\tau=1}^{t}\|\alpha_{\tau}\varepsilon_{\tau}\|
D22η+4ηLτ=1tα1:τΔτ+ητ=1tατ2ετ2+Dτ=1tατετ,\displaystyle\leq\frac{D^{2}}{2\eta}+4\eta L\sum_{\tau=1}^{t}\alpha_{1:\tau}\Delta_{\tau}+\eta\sum_{\tau=1}^{t}\alpha_{\tau}^{2}\|\varepsilon_{\tau}\|^{2}+D\sum_{\tau=1}^{t}\|\alpha_{\tau}\varepsilon_{\tau}\|~{}, (25)

where the first inequality follows from Cauchy-Schwartz; the second inequality holds since wtwD\|w_{t}-w^{*}\|\leq D, as well as from using a+b22a2+2b2\|a+b\|^{2}\leq 2\|a\|^{2}+2\|b\|^{2} which holds for any a,bda,b\in{\mathbb{R}}^{d}, the third inequality follows by the self bounding property for smooth functions (see Lemma C.1 below) implying that f(xτ)22L(f(xτ)f(w)):=2LΔτ\|\nabla f(x_{\tau})\|^{2}\leq 2L(f(x_{\tau})-f(w^{*})):=2L\Delta_{\tau}; and the fourth inequality follows due to ατ22α1:τ\alpha_{\tau}^{2}\leq 2\alpha_{1:\tau} which holds since ατ=τ+1\alpha_{\tau}=\tau+1.

Lemma C.1.

(See e.g. (Levy et al., 2018; Cutkosky, 2019)) Let F:dF:{\mathbb{R}}^{d}\mapsto{\mathbb{R}} be an LL-smooth function with a global minimum xx^{*}, then for any xdx\in{\mathbb{R}}^{d} we have,

F(x)22L(F(x)F(w)).\|\nabla F(x)\|^{2}\leq 2L(F(x)-F(w^{*}))~{}.

Next, we will take expectation over Eq. (C), yielding,

α1:tEΔt\displaystyle\alpha_{1:t}\mbox{\bf E}\Delta_{t} D22η+4ηLτ=1tα1:τEΔτ+ητ=1tατ2Eετ2+Dτ=1tEατετ\displaystyle\leq\frac{D^{2}}{2\eta}+4\eta L\sum_{\tau=1}^{t}\alpha_{1:\tau}\mbox{\bf E}\Delta_{\tau}+\eta\sum_{\tau=1}^{t}\alpha_{\tau}^{2}\mbox{\bf E}\|\varepsilon_{\tau}\|^{2}+D\sum_{\tau=1}^{t}\mbox{\bf E}\|\alpha_{\tau}\varepsilon_{\tau}\|
D22η+4ηLτ=1tα1:τEΔτ+ητ=1tατ2Eετ2+Dτ=1tατ2Eετ2\displaystyle\leq\frac{D^{2}}{2\eta}+4\eta L\sum_{\tau=1}^{t}\alpha_{1:\tau}\mbox{\bf E}\Delta_{\tau}+\eta\sum_{\tau=1}^{t}\alpha_{\tau}^{2}\mbox{\bf E}\|\varepsilon_{\tau}\|^{2}+D\sum_{\tau=1}^{t}\sqrt{\alpha_{\tau}^{2}\mbox{\bf E}\|\varepsilon_{\tau}\|^{2}}
D22η+4ηLτ=1tα1:τEΔτ+ητ=1tατ2σ~2/ατ+Dτ=1tατ2σ~2/ατ\displaystyle\leq\frac{D^{2}}{2\eta}+4\eta L\sum_{\tau=1}^{t}\alpha_{1:\tau}\mbox{\bf E}\Delta_{\tau}+\eta\sum_{\tau=1}^{t}\alpha_{\tau}^{2}\cdot\tilde{\sigma}^{2}/\alpha_{\tau}+D\sum_{\tau=1}^{t}\sqrt{\alpha_{\tau}^{2}\cdot\tilde{\sigma}^{2}/\alpha_{\tau}}
D22η+4ηLτ=1tα1:τEΔτ+ησ~2τ=1tατ+Dσ~τ=1tατ\displaystyle\leq\frac{D^{2}}{2\eta}+4\eta L\sum_{\tau=1}^{t}\alpha_{1:\tau}\mbox{\bf E}\Delta_{\tau}+\eta\tilde{\sigma}^{2}\sum_{\tau=1}^{t}\alpha_{\tau}+D\tilde{\sigma}\sum_{\tau=1}^{t}\sqrt{\alpha_{\tau}}
D22η+4ηLτ=1Tα1:τEΔτ+ησ~2τ=1Tατ+Dσ~τ=1Tατ\displaystyle\leq\frac{D^{2}}{2\eta}+4\eta L\sum_{\tau=1}^{T}\alpha_{1:\tau}\mbox{\bf E}\Delta_{\tau}+\eta\tilde{\sigma}^{2}\sum_{\tau=1}^{T}\alpha_{\tau}+D\tilde{\sigma}\sum_{\tau=1}^{T}\sqrt{\alpha_{\tau}}
D22η+4ηLτ=1Tα1:τEΔτ+ησ~2α1:T+2Dσ~T3/2\displaystyle\leq\frac{D^{2}}{2\eta}+4\eta L\sum_{\tau=1}^{T}\alpha_{1:\tau}\mbox{\bf E}\Delta_{\tau}+\eta\tilde{\sigma}^{2}\alpha_{1:T}+2D\tilde{\sigma}T^{3/2}
D22η+12Tτ=1Tα1:τEΔτ+ησ~2α1:T+2Dσ~T3/2,\displaystyle\leq\frac{D^{2}}{2\eta}+\frac{1}{2T}\sum_{\tau=1}^{T}\alpha_{1:\tau}\mbox{\bf E}\Delta_{\tau}+\eta\tilde{\sigma}^{2}\alpha_{1:T}+2D\tilde{\sigma}T^{3/2}~{}, (26)

where the second lines is due to Jensen’s inequality implying that EXEX2\mbox{\bf E}X\leq\sqrt{\mbox{\bf E}X^{2}} for any random variable XX; the third line follows from Eεt2σ~2/αt\mbox{\bf E}\|\varepsilon_{t}\|^{2}\leq\tilde{\sigma}^{2}/\alpha_{t} which holds by Thm. 4.1; the fifth line holds since tTt\leq T; the sixth line follows since t=1Tαt2T3/2\sum_{t=1}^{T}\sqrt{\alpha_{t}}\leq 2T^{3/2}, and the last line follows since we pick η1/8LT\eta\leq 1/8LT.

To obtain the final bound we will apply the lemma below to Eq. (C),

Lemma C.2.

Let {At}t[T]\{A_{t}\}_{t\in[T]} be a sequence of non-negative elements and \mathcal{B}\in{\mathbb{R}}, and assume that for any tTt\leq T,

At+12Tt=1TAt,A_{t}\leq\mathcal{B}+\frac{1}{2T}\sum_{t=1}^{T}A_{t}~{},

Then the following bound holds,

AT2.A_{T}\leq 2\mathcal{B}~{}.

Taking Atα1:tEΔtA_{t}\leftarrow\alpha_{1:t}\mbox{\bf E}\Delta_{t} and D22η+ησ~2α1:T+2Dσ~T3/2\mathcal{B}\leftarrow\frac{D^{2}}{2\eta}+\eta\tilde{\sigma}^{2}\alpha_{1:T}+2D\tilde{\sigma}T^{3/2} provides the following explicit bound,

α1:TEΔTD2η+2ησ~2α1:T+4Dσ~T3/2\displaystyle\alpha_{1:T}\mbox{\bf E}\Delta_{T}\leq\frac{D^{2}}{\eta}+2\eta\tilde{\sigma}^{2}\alpha_{1:T}+4D\tilde{\sigma}T^{3/2}

Dividing by α1:T\alpha_{1:T} and recalling α1:T=Θ(T2)\alpha_{1:T}=\Theta(T^{2}) gives,

E(f(xT)f(w))=EΔTO(D2ηT2+2ησ~2+4Dσ~T),\displaystyle\mbox{\bf E}(f(x_{T})-f(w^{*}))=\mbox{\bf E}\Delta_{T}\leq O\left(\frac{D^{2}}{\eta T^{2}}+2\eta\tilde{\sigma}^{2}+\frac{4D\tilde{\sigma}}{\sqrt{T}}\right)~{},

which concludes the proof. ∎

C.1 Proof of Lemma C.2

Proof of Lemma C.2.

Summing the inequality At+12Tt=1TAtA_{t}\leq\mathcal{B}+\frac{1}{2T}\sum_{t=1}^{T}A_{t} over tt gives,

A1:TT+T12TA1:T=T+12A1:T,\displaystyle A_{1:T}\leq T\mathcal{B}+T\frac{1}{2T}A_{1:T}=T\mathcal{B}+\frac{1}{2}A_{1:T}~{},

Re-ordering we obtain,

A1:T2T.\displaystyle A_{1:T}\leq 2T\mathcal{B}~{}.

Plugging this back to the original inequality and taking t=Tt=T gives,

AT+12TA1:T2.A_{T}\leq\mathcal{B}+\frac{1}{2T}A_{1:T}\leq 2\mathcal{B}~{}.

which concludes the proof. ∎