$\mu^{2}$ -SGD: Stable Stochastic Optimization via a Double Momentum Mechanism

Kfir Y. Levy¹¹1 Department of Electrical and Computer Engineering, Technion, Israel.
Email :[email protected].

Abstract

We consider stochastic convex optimization problems where the objective is an expectation over smooth functions. For this setting we suggest a novel gradient estimate that combines two recent mechanism that are related to notion of momentum. Then, we design an SGD-style algorithm as well as an accelerated version that make use of this new estimator, and demonstrate the robustness of these new approaches to the choice of the learning rate. Concretely, we show that these approaches obtain the optimal convergence rates for both noiseless and noisy case with the same choice of fixed learning rate. Moreover, for the noisy case we show that these approaches achieve the same optimal bound for a very wide range of learning rates.

1 Inroduction

Stochastic Convex Optimization (SCO) is a fundamental framework, that captures several classical Machine Learning (ML) problems, such as linear regression, logistic regression and SVMs (Support Vector Machines); amongst others. In the past two decades, SCO has been extensively explored and highly influenced the field of ML: it has popularized the use of Stochastic Gradient Descent (SGD) as the standard workhorse for training ML models; see e.g. (Shalev-Shwartz et al., 2007; Welling and Teh, 2011; Mairal et al., 2009; Recht et al., 2011); as well as has lead to the design of sophisticated SGD variants that play a central role in training modern large scale models (Duchi et al., 2011; Kingma and Ba, 2015).

One practical difficulty in applying SGD-type methods is the need to tune its learning rate among other hyperparameters, and it is well known that the performance of such algorithms crucially relies on the right choice of the learning rate. As a remedy, several adaptive SGD variants have been designed throughout the years (Duchi et al., 2011; Kingma and Ba, 2015; Levy et al., 2018; Kavis et al., 2019; Jacobsen and Cutkosky, 2022); however in practice such methods still require hyperparameter tuning which might be very costly in huge scale scenarios.

In this paper we focus on the prevalent SCO setting where the objective (expected loss) is an Expectation Over Smooth losses (SCO-EOS); this applies e.g. to linear and logistic regression problems (but not to SVMs). In this case, it is well known that SGD requires a careful tuning of the learning rate to obtain the optimal performance. For example, in the noiseless case, SGD (or GD in this case) should employ a learning rate of $\eta^{\rm Offline}=1/L$ where $L$ is the smoothness parameter of the objective. Nevertheless, if we apply this $\eta^{\rm Offline}$ in the noisy setting, the guarantees of SGD become vacuous. To obtain the optimal SGD guarantees, we should roughly decrease the learning rate by a factor of $\sigma\sqrt{T}$ where $T$ is the total number of SGD iterates (and samples), and $\sigma$ is the variance of the noise in the gradient estimates. This illustrates the sensitivity of SGD to the choice of $\eta$ . The same applies to stochastic accelerated methods such as (Lan, 2012; Hu et al., 2009; Xiao, 2010).

Contributions.

We introduce a novel gradient estimate for SCO-EOS problems, and show that its square error, $\|\varepsilon_{t}\|^{2}$ , shrinks with the number of updates, $\|\varepsilon_{t}\|^{2}\propto 1/t$ , where $t$ is the iterate. This, in contrast to the standard SGD estimator where usually $\|\varepsilon_{t}\|^{2}=\text{Variance}_{t}=O(1)$ . Our new estimate blends two recent machanisms that are related to the notion of momentum: Anytime Averaging, which is due to (Cutkosky, 2019); and a corrected momentum technique (Cutkosky and Orabona, 2019). We therefore denote our estimate by $\mu^{2}$ which stands for $\textbf{Momentum}^{2}$ .

We further design an SGD variant called $\mu$ -SGD, as well as an accelerated version called $\mu^{2}-\texttt{Extra}\text{SGD}$ , that employ our new estimate, and demonstrate their stability with respect to the choice of the learning rates $\eta$ . Concretely,

•

Upon using the exact same learning rate of $\eta^{\rm Offline}=1/8LT$ (where $T$ is the total number of iterates/data-samples), $\mu^{2}$ -SGD enjoys a convergence rate of $O({L}/T)$ in the noiseless case, and a rate of $O({L}/T+\tilde{\sigma}/\sqrt{T})$ in the noisy case.

Moreover, in the noisy case, $\mu^{2}$ -SGD enjoys the same convergence rate as of the optimal SGD $O({L}/T+\tilde{\sigma}/\sqrt{T})$ , for a wide range of learning rate choices i.e. $\eta\in[\eta^{\rm Noisy},\eta^{\rm Offline}]$ , such that the ratio $\eta^{\rm Offline}/\eta^{\rm Noisy}\approx(\tilde{\sigma}/L)\sqrt{T}$ .
•

Upon using the exact same learning rate of $\eta^{\rm Offline}=1/2L$ , $\mu^{2}-\texttt{Extra}\text{SGD}$ enjoys an optimal convergence rate of $O({L}/T^{2})$ in the noiseless case, and an optimal rate of $O({L}/T^{2}+\tilde{\sigma}/\sqrt{T})$ in the noisy case.

Moreover, in the noisy case, $\mu^{2}-\texttt{Extra}\text{SGD}$ enjoys the same optimal convergence of $O({L}/T^{2}+\tilde{\sigma}/\sqrt{T})$ , for an extremely wide range of learning rate choices i.e. $\eta\in[\eta^{\rm Noisy},\eta^{\rm Offline}]$ , such that the ratio $\eta^{\rm Offline}/\eta^{\rm Noisy}\approx(\tilde{\sigma}/L)T^{3/2}$ .

Related Work:

The Gradient Descent (GD) algorithm and it stochastic counterpart SGD (Robbins and Monro, 1951) are cornerstones of ML and Optimization. Their wide adoption in ML as well as in other fields has lead to the development of numerous elegant and useful variants (Duchi et al., 2011; Kingma and Ba, 2015; Ge et al., 2015). Curiously, many of these variants that serve in practical training of non-convex learning models; were originally designed under the framework of SCO.

As we mention in the Introduction, the performance of SGD crucially relies on the choice of the learning rate. There is a plethora of work on designing methods that implicitly and optimally adapt the learning rate throughout the learning process (Duchi et al., 2011; Kingma and Ba, 2015; Kavis et al., 2019; Antonakopoulos et al., 2022; Jacobsen and Cutkosky, 2022); and some of these methods have been widely adopted among practitioners.

Momentum (Polyak, 1964) is another widely used practical technique (Sutskever et al., 2013), and it is interestingly related to accelerated method of Nesterov (Nesterov, 1983) – a seminal approach that enables to obtain faster convergence rates compared to GD for smooth and convex objectives. While Nesterov’s accelerated method is fragile to noise, Lan (2012); Hu et al. (2009); Xiao (2010) have designed stochastic accelerated variants that enable to obtain a convergence rate that interpolates between the fast rate in the noiseless case and between the standard SGD rate in the noisy case (depending on the noise magnitude).

Our work builds on two recent mechanisms that are related to the notion of momentum: (i) An Anytime averaging mechanism Cutkosky (2019) which relies on averaging the query points of the gradient oracle. (ii) A corrected momentum technique Cutkosky and Orabona (2019) which relies on averaging the gradients themselves throughout the learning process (while introducing correction). It is interesting to note that the Anytime mechanism has proven to be extremely useful in designing adaptive and accelerated methods (Cutkosky, 2019; Kavis et al., 2019; Antonakopoulos et al., 2022). The corrected momentum mechanism has mainly found use in designing optimal and adaptive algorithms for stochastic non-convex problems (Cutkosky and Orabona, 2019; Levy et al., 2021).

2 Setting

We consider stochastic optimization problems where the objective $f:\mathcal{K}\mapsto{\mathbb{R}}$ is convex and is of the following form,

f(x):=\mbox{\bf E}_{z\sim{\mathcal{D}}}f(x;z)~{},

where $\mathcal{K}\subseteq{\mathbb{R}}^{d}$ is a compact convex set, and ${\mathcal{D}}$ is an unknown distribution from which we may draw i.i.d. samples $\{z_{t}\sim{\mathcal{D}}\}_{t}$ . We consider first order optimization methods that iteratively employ such samples in order to generate a sequence of query points and eventually output a solution ${x}_{\rm output}\in\mathcal{K}$ . Our performance measure is the expected excess loss,

\text{ExcessLoss}:=\mbox{\bf E}[f({x}_{\rm output})]-\min_{x\in\mathcal{K}}f(x)

where the expectation is w.r.t. the randomization of the samples.

More concretely, at every iteration $t$ such methods maintain a query point $x_{t}\in{\mathbb{R}}^{d}$ which is computed based on the past query points and past samples $\{z_{1},\ldots,z_{t-1}\}$ . Then, the next query point $x_{t+1}$ is computed based on $x_{t}$ and on a gradient estimate $g_{t}$ that is derived by drawing a fresh sample $z_{t}\sim{\mathcal{D}}$ independently of the past samples, and computing,

g_{t}:=\nabla f(x_{t};z_{t})~{},

note that the above derivative is w.r.t. $x$ . The independence between samples implies that $g_{t}$ is an unbiased estimate of $\nabla f(x_{t})$ in the following sense, $\mbox{\bf E}[g_{t}|x_{t}]=\nabla f(x_{t})~{}.$ It is often comfortable to think of the computation of $g_{t}=\nabla f(x_{t};z_{t})$ as a (noisy) Gradient Oracle that upon receiving a query point $x_{t}\in\mathcal{K}$ outputs a vector $g_{t}\in{\mathbb{R}}^{d}$ , which is an unbiased estimate of $\nabla f(x_{t})$ .

Assumptions.

We will make the following assumptions,
Bounded Diameter: we assume there exists $D>0$ such that $\max_{x,y\in\mathcal{K}}\|x-y\|\leq D$ .
Bounded variance: There exists $\sigma>0$ such that,

\displaystyle\mbox{\bf E}\|\nabla f(x;z)-\nabla f(x)\|^{2}\leq\sigma^{2}~{};\quad\forall x\in\mathcal{K},~{}z\in\textbf{Support}\{{\mathcal{D}}\}

(1)

Expectation over smooth functions: We assume that $f(\cdot)$ is an expectation of smooth functions, i.e. there exist $L>0$ such,

\displaystyle\|f(x;z)-f(y;z)\|\leq L\|x-y\|~{},\quad\forall x,y\in\mathcal{K}~{},z\in\textbf{Support}\{{\mathcal{D}}\}

(2)

The above assumption also implies that the expected loss $f(\cdot)$ is $L$ smooth.
Bounded Smoothness Variance. Note that the assumption that we make in Eq. (2) implies that there exists $\sigma_{L}^{2}\in[0,L]$ such,

\displaystyle\mbox{\bf E}\left\|(\nabla f(x;z)-\nabla f(x))-(\nabla f(y;z)-\nabla f(y))\right\|^{2}\leq\sigma_{L}^{2}\|x-y\|^{2}~{},\quad\forall x,y\in\mathcal{K}~{}.

(3)

Clearly, in the deterministic setting where $f(x;z)=f(x)~{},~{}\forall z\in\textbf{Support}\{{\mathcal{D}}\}$ , we have $\sigma_{L}=0$ . In App. A, we show how Eq. (2) implies Eq. (3).
Notation: In the rest of this manuscript, $\nabla f(x;z)$ relates to gradients with respect to $x$ , i.e., $\nabla:=\nabla_{x}$ . We use $\|\cdot\|$ to denote the Euclidean norm. Given a sequence $\{y_{t}\}_{t}$ we denote $y_{t_{1}:t_{2}}:=\sum_{\tau=t_{1}}^{t_{2}}y_{\tau}$ . For a positive integer $N$ we denote $[N]:=\{1,\ldots,N\}$ . We also let $\Pi_{\mathcal{K}}:{\mathbb{R}}^{d}\mapsto\mathcal{K}$ denote the orthogonal projection onto $\mathcal{K}$ , i.e. $\Pi_{\mathcal{K}}(x)=\operatorname*{arg\,min}_{y\in\mathcal{K}}\|y-x\|^{2}~{},~{}\forall x\in{\mathbb{R}}^{d}$ .

3 Momentum Mechanisms

In this section we provide background regarding two mechanisms that can be related to the notion of momentum. Curiously, both of these approaches are related to averaging of different elements of the learning algorithm. Our approach that we present in the following section builds on a combination of these aforementioned mechanisms.

3.1 Mechanism I: Anytime-GD

The mechanism that we discuss here is related to averaging the query points for the noisy gradient oracle. Basically, while in standard SGD we query the gradients at the iterates that we compute, in Anytime-SGD (Cutkosky, 2019), we query the gradients at weighted averages of the iterates that we compute.

More concretely, the Anytime-SGD algorithm (Cutkosky, 2019; Kavis et al., 2019) that we describe in Equations (4) and (5), employs a learning rate $\eta>0$ and a sequence of non-negative weights $\{\alpha_{t}\}_{t}$ . The algorithm maintains two sequences $\{w_{t}\}_{t},\{x_{t}\}_{t}$ . At initialization $x_{1}=w_{1}$ , and then we update as follows $\forall t$ ,

\displaystyle w_{t+1}=w_{t}-\eta\alpha_{t}g_{t}~{},\forall t\in[T]~{},\text{where~{}}g_{t}=\nabla f(x_{t};z_{t})~{},

(4)

and $z_{t}$ is a fresh sample from ${\mathcal{D}}$ . Then Anytime-SGD updates,

\displaystyle x_{t+1}=\frac{\alpha_{1:t}}{\alpha_{1:t+1}}x_{t}+\frac{\alpha_{t+1}}{\alpha_{1:t+1}}w_{t+1}~{}.

(5)

The above implies that the $x_{t}$ ’s are weighted averages of the $w_{t}$ ’s, i.e. that $x_{t+1}=\frac{1}{\alpha_{1:t+1}}\sum_{\tau=1}^{t+1}\alpha_{\tau}w_{\tau}$ . Thus, at every iterate, the gradient $g_{t}$ is queried at $x_{t}$ which is a weighted average of past iterates, and then $w_{t+1}$ is updated similarly to GD with a weight $\alpha_{t}$ on the gradient $g_{t}$ .

Curiously, it was shown in Wang et al. (2021), that a very similar algorithm to Anytime-GD, can be related to the classical Heavy-Ball method (Polyak, 1964), and the latter incorporates momentum in its iterates (see Alg. $(13)$ in Section $4.5.1$ of ).

It was shown in Cutkosky (2019) that Anytime-SGD obtains the same convergence rates as SGD for convex loss functions (both smooth and non-smooth). And this technique was found to be extremely useful in designing universal accelerated methods Cutkosky (2019); Kavis et al. (2019).

The next theorem is crucial in analyzing Anytime-SGD, and actually applies more broadly,

Theorem 3.1 (Rephrased from Theorem 1 in Cutkosky (2019)).

Let $f:\mathcal{K}\mapsto{\mathbb{R}}$ be a convex function with a minimum $w^{*}\in\operatorname*{arg\,min}_{w\in\mathcal{K}}f(w)$ . Also let $\{\alpha_{t}\geq 0\}_{t}$ , and $\{w_{t}\in\mathcal{K}\}_{t},\{x_{t}\in\mathcal{K}\}_{t}$ , such that $\{x_{t}\}_{t}$ is an $\{\alpha_{t}\}_{t}$ weighted average of $\{w_{t}\}_{t}$ , i.e. such that $x_{1}=w_{1}$ , and for any $t\geq 1$ ,

x_{t+1}=\frac{1}{\alpha_{1:t+1}}\sum_{\tau=1}^{t+1}\alpha_{\tau}w_{\tau}~{}.

Then the following holds for any $t\geq 1$ ,

\alpha_{1:t}\left(f(x_{t})-f(w^{*})\right)\leq\sum_{\tau=1}^{t}\alpha_{\tau}\nabla f(x_{\tau})\cdot(w_{\tau}-w^{*})~{}.

Note that the above theorem holds for any sequence $\{w_{t}\in\mathcal{K}\}_{t}$ , and as a private case it holds for the Anytime-GD algorithm that we have described. Thus the above Theorem relates the excess loss of a given algorithm that computes the sequences $\{w_{t}\in\mathcal{K}\}_{t},\{x_{t}\in\mathcal{K}\}_{t}$ to its weighted regret, $\mathcal{R}_{t}:=\sum_{\tau=1}^{t}\alpha_{\tau}\nabla f(x_{\tau})\cdot(w_{\tau}-w^{*})~{}.$

3.2 Mechanism II: Recursive Corrected Averaging

The mechanism that we discuss here is related to averaging the gradient estimates that we compute throughout the learning process, which is a common and often crucial technique in practical applications. While such straightforward averaging might incur bias into the gradient estimates, it was suggested in (Cutkosky and Orabona, 2019) to add a bias correction mechanism named $\rm{STORM}$ (STochastic Recursive Momentum). And it was shown that this mechanism leads to a powerful variance reduction effect.

Concretely, $\rm{STORM}$ maintains an estimate $d_{t}$ which is a corrected weighted average of past stochastic gradients, and then it performs an SGD-style update step,

\displaystyle w_{t+1}=w_{t}-\eta d_{t}~{}.

(6)

The corrected momentum estimates are updated as follows,

\displaystyle d_{t}=\nabla f(w_{t};z_{t})+(1-\beta_{t})(d_{t-1}-\nabla f(w_{t-1},z_{t}))~{};~{}~{}\text{where~{}}\beta_{t}\in[0,1]~{}.

(7)

Note that the correction term implies that $\mbox{\bf E}[d_{t}]=\mbox{\bf E}\nabla f(w_{t})$ . Nevertheless, in general $\mbox{\bf E}[d_{t}|w_{t}]\neq\nabla f(w_{t})$ , so $d_{t}$ is conditionally biased (in contrast to standard SGD gradient estimates). Moreover, the choice of the same sample $z_{t}$ in the two terms of the above expression is crucial for the variance reduction effect.

4 Our Approach: Double Momentum Mechanism

Our approach is to combine together the two momentum mechanisms that we describe above. Our algorithm is therefore named $\mu^{2}$ -SGD ( $\text{Momentum}^{2}$ -Stochastic Gradient Descent), and we describe it in Alg. 1. Intuitively, each of these Momentum (averaging) techniques stabilizes the algorithm, and their combination leads to a method that is almost as stable as offline GD.

We first describe algorithm 1, and then present our main result in Theorem 4.1, showing that the error of the gradient estimates of our approach shrinks as we progress. We show the benefit of this result in Thm. 4.2, which demonstrates that $\mu^{2}$ -SGD obtains the same convergence rate as standard SGD, for a very wide range of learning rates (in contrast to SGD).

Next we elaborate on the ingredients of Alg. 1:
Update rule: Note that for generality we allow a broad family of update rules in Eq. (9) of Alg. 1. The only constraint on the update rule is that its iterates $\{w_{t}\}_{t}$ always belong to $\mathcal{K}$ . Later, we will specifically analyze the natural SGD-style update rule,

\displaystyle w_{t+1}=\Pi_{\mathcal{K}}(w_{t}-\eta\alpha_{t}d_{t})~{}.

(8)

Momentum Computation: From Eq. (11) in Alg. 1 we can see that the momentum $d_{t}$ is updated similarly to the $\rm{STORM}$ update in Eq. (7), albeit with two main differences: (i) first we incorporate importance weights $\{\alpha_{t}\}_{t}$ into $\rm{STORM}$ and recursively update the weighted momentum $\alpha_{t}d_{t}$ ; more importantly (ii) we query the noisy gradients at the averages $x_{t}$ ’s rather than in the iterates themselves, and the averages (Eq. (10)) are computed in the spirit of Anytime-SGD. These can be seen in the computation of $g_{t+1}$ and $\tilde{g}_{t}$ which query the gradients at the averages rather than the iterates.

Algorithm 1

\mu^{2}

-SGD

Input: #Iterations

T

, initial point

x_{1}

, learning rate sequence

\eta>0

, importance weights

\{\alpha_{t}\}_{t}

, Corrected Momentum weights

\{\beta_{t}\}_{t}

Initialize: set

w_{1}=x_{1}

, draw

z_{1}\sim{\mathcal{D}}

and set

d_{1}=\nabla f(x_{1},z_{1})

for

t=1,\ldots,T

Iterate Update:

Update

w_{t+1}\in\mathcal{K}

based on past information/computations up to time

t

(9)

Query Update (Anytime-SGD style):

\displaystyle x_{t+1}=\frac{\alpha_{1:t}}{\alpha_{1:t+1}}x_{t}+\frac{\alpha_{t+1}}{\alpha_{1:t+1}}w_{t+1}~{},

(10)

Update Corrected Momentum (

\rm{STORM}

style):draw

z_{t+1}\sim{\mathcal{D}}

, compute

g_{t+1}:=\nabla f(x_{t+1};z_{t+1})

, and

\tilde{g}_{t}:=\nabla f(x_{t};z_{t+1})

and update,

\displaystyle\alpha_{t+1}d_{t+1}=\alpha_{t+1}g_{t+1}+(1-\beta_{t+1})\alpha_{t+1}(d_{t}-\tilde{g}_{t})

(11)

end for

output:

x_{T}

Thus, as promised our algorithms combines two different momentum mechanisms. Next we present our main result which shows that Alg. 1 yields estimates with a very small error. Note that the only limitation on the update rule in Eq. (9) is that its iterates $\{w_{t}\}_{t}$ always belong to $\mathcal{K}$ .

Theorem 4.1.

Let $f:\mathcal{K}\mapsto{\mathbb{R}}$ , where $\mathcal{K}$ is a convex set with bounded diameter $D$ , and presume that the assumption in Equations (1),(2),(3) hold. Then invoking Alg. 1 with $\{\alpha_{t}=t+1\}_{t}$ and $\{\beta_{t}=1/\alpha_{t}\}$ , ensures the following for any $t\in[T]$ ,

\mbox{\bf E}\|\varepsilon_{t}\|^{2}:=\mbox{\bf E}\|d_{t}-\nabla f(x_{t})\|^{2}\leq\tilde{\sigma}^{2}/t~{},

where $\varepsilon_{t}:=d_{t}-\nabla f(x_{t})$ , and $\tilde{\sigma}^{2}:=32D^{2}\sigma_{L}^{2}+2\sigma^{2}$ .

So according to the above theorem, the overall error of $d_{t}$ compared to the exact gradient $\nabla f(x_{t})$ shrinks as we progress. Conversely, in standard SGD (as well as in Anytime-SGD) the expected square error is fixed, namely $\mbox{\bf E}\|g_{t}-\nabla f(w_{t})\|^{2}\leq O(\sigma^{2})$ .

Based on the above theorem, we are now ready to analyze Alg. 1 with the specific SGD-type update rule that we present in Eq. (8).

Theorem 4.2 ( $\mu^{2}$ -SGD Guarantees).

Let $f:{\mathbb{R}}^{d}\mapsto{\mathbb{R}}$ be a convex function, and assume that $w^{*}\in\operatorname*{arg\,min}_{w\in\mathcal{K}}f(w)$ is also its global minimum in ${\mathbb{R}}^{d}$ . Also, let us make the same assumptions as in Thm. 4.1. Then invoking Alg. 1 with $\{\alpha_{t}=t+1\}_{t}$ and $\{\beta_{t}=1/\alpha_{t}\}_{t}$ , and using the SGD-type update rule (8) with a learning rate $\eta\leq 1/8LT$ inside Eq. (9) of Alg. 1 guarantees,

\displaystyle\mbox{\bf E}(f(x_{T})-f(w^{*}))=\mbox{\bf E}\Delta_{T}\leq O\left(\frac{D^{2}}{\eta T^{2}}+2\eta\tilde{\sigma}^{2}+\frac{4D\tilde{\sigma}}{\sqrt{T}}\right)~{},

where $\Delta_{t}:=f(x_{t})-f(w^{*})$ , and $\tilde{\sigma}^{2}:=32D^{2}\sigma_{L}^{2}+2\sigma^{2}$ .

The Stability of $\mu^{2}$ -SGD.

The above lemma shows that $\mu^{2}$ -SGD obtains the optimal SGD convergence rates for both offline (noiseless) and noisy case with the same choice of fixed learning rate $\eta^{\rm Offline}=\frac{1}{8LT}$ , which does not depend on the noise $\tilde{\sigma}$ . This in contrast to SGD, which require either reducing the offline learning rate by a factor of $\sigma\sqrt{T}$ ; or using sophisticated adaptive learning rates (Duchi et al., 2011; Levy et al., 2018).

Moreover, letting $\eta^{\rm Noisy}=1/(8LT+\tilde{\sigma}T^{3/2}/D)$ , than it can be seen that in the noisy case our approach enables to employ learning rates in an extremely wide range of $[\eta^{\rm Noisy},\eta^{\rm Offline}]$ ; and still obtain the same optimal SGD convergence rate of $O\left({LD^{2}}/{T}+{\tilde{\sigma}D}/{\sqrt{T}}\right)$ . Indeed note that, the ratio $\eta^{\rm Offline}/\eta^{\rm Noisy}\approx(\tilde{\sigma}/L)T^{1/2}$ .

4.1 Proof Sketch of Thm. 4.1

Proof.

First note that the $x_{t}$ ’s always belong to $\mathcal{K}$ , since they are weighted averages of the $\{w_{t}\in\mathcal{K}\}_{t}$ ’s. Our first step is to bound the difference between consecutive queries. The definition of $x_{t}$ implies,

\alpha_{1:t-1}(x_{t}-x_{t-1})=\alpha_{t}(w_{t}-x_{t})~{},

yielding,

\displaystyle\|x_{t}-x_{t-1}\|^{2}=\left({\alpha_{t}}/{\alpha_{1:t-1}}\right)^{2}\|w_{t}-x_{t}\|^{2}\leq(16/t^{2})D^{2}=(16/\alpha_{t-1}^{2})D^{2}~{}.

(12)

where we have used $\alpha_{t}=t+1$ implying ${\alpha_{t}}/{\alpha_{1:t-1}}\leq 4/t$ for any $t\geq 2$ , we also used $\|w_{t}-x_{t}\|\leq D$ which holds since $w_{t},x_{t}\in\mathcal{K}$ , finally we use $\alpha_{t-1}=t$ .
Notation: Prior to going on with the proof we shall require some notation. We will denote $\bar{g}_{t}:=\nabla f(x_{t})$ , and recall the following notation from Alg. 1: $g_{t}:=\nabla f(x_{t},z_{t})~{};\tilde{g}_{t-1}:=\nabla f(x_{t-1},z_{t})$ . We will also denote,

\varepsilon_{t}:=d_{t}-\bar{g}_{t}~{}.

Now, recalling Eq. (11), and combining it with the above definition of $\varepsilon_{t}$ enables to derive the following recursive relation,

\displaystyle\alpha_{t}\varepsilon_{t}

\displaystyle=\beta_{t}\alpha_{t}(g_{t}-\bar{g}_{t})+(1-\beta_{t})\alpha_{t}Z_{t}+(1-\beta_{t})\frac{\alpha_{t}}{\alpha_{t-1}}\alpha_{t-1}\varepsilon_{t-1}~{},

where we denote $Z_{t}:=(g_{t}-\bar{g}_{t})-(\tilde{g}_{t-1}-\bar{g}_{t-1})$ . Now, using $\alpha_{t}=t+1$ , and $\beta_{t}=1/(t+1)$ then it can be shown that $\alpha_{t}\beta_{t}=1$ ,and $\alpha_{t}(1-\beta_{t})=\alpha_{t}-1:=\alpha_{t-1}$ . Moreover, $(1-\beta_{t})\frac{\alpha_{t}}{\alpha_{t-1}}=\frac{\alpha_{t}-1}{\alpha_{t-1}}=1$ . Using these relations in the equation above gives,

\displaystyle\alpha_{t}\varepsilon_{t}=\alpha_{t-1}Z_{t}+\alpha_{t-1}\varepsilon_{t-1}+(g_{t}-\bar{g}_{t})=M_{t}+\alpha_{t-1}\varepsilon_{t-1}~{}.

(13)

where for any $t>1$ we denote $M_{t}:=\alpha_{t-1}Z_{t}+(g_{t}-\bar{g}_{t})$ , as well as $M_{1}=g_{1}-\bar{g}_{1}$ . Unrolling the above equation yields an explicit expression for any $t\in[T]$ : $\alpha_{t}\varepsilon_{t}=\sum_{\tau=1}^{t}M_{\tau}~{}.$

Noticing that the sequence $\{M_{t}\}_{t}$ is is martingale difference sequence with respect to the natural filtration $\{\mathcal{F}_{t}\}_{t}$ induced by the history of the samples up to time $t$ ; enables to bound the expected square error as follows,

$\displaystyle\mbox{\bf E}\\|\alpha_{t}\varepsilon_{t}\\|^{2}$	$\displaystyle=\left\\|\sum_{\tau=1}^{t}M_{\tau}\right\\|^{2}=\sum_{\tau=1}^{t}\mbox{\bf E}\\|M_{\tau}\\|^{2}=\sum_{\tau=1}^{t}\mbox{\bf E}\\|\alpha_{t-1}Z_{t}+(g_{t}-\bar{g}_{t})\\|^{2}$
	$\displaystyle\leq 2\sum_{\tau=1}^{t}\alpha_{t-1}^{2}\mbox{\bf E}\\|Z_{t}\\|^{2}+2\sum_{\tau=1}^{t}\mbox{\bf E}\\|g_{t}-\bar{g}_{t}\\|^{2}$
	$\displaystyle\leq 2\sum_{\tau=1}^{t}\alpha_{t-1}^{2}\mbox{\bf E}\\|(g_{t}-\bar{g}_{t})-(\tilde{g}_{t-1}-\bar{g}_{t-1})\\|^{2}+2\sum_{\tau=1}^{t}\sigma^{2}~{}.$	(14)

Now, using Eq. (3) together with Eq. (12) allows to bound,

\mbox{\bf E}\|(g_{t}-\bar{g}_{t})-(\tilde{g}_{t-1}-\bar{g}_{t-1})\|^{2}\leq\sigma_{L}^{2}\|x_{t}-x_{t-1}\|^{2}\leq 16\sigma_{L}^{2}D^{2}/\alpha_{t-1}^{2}~{}.

Plugging the above back into Eq. (4.1) and summing establishes the theorem. ∎

5 Accelerated Version: $\mu^{2}-\texttt{Extra}\text{SGD}$

In this section we present an accelerated version that makes use of a double momentum mechanism as we do for $\mu^{2}$ -SGD. Our approach here relies on an algorithmic template named ExtraGradient (Korpelevich, 1976; Nemirovski, 2004; Juditsky et al., 2011). The latter technique has already been combined with the Anytime-SGD mechanism in Cutkosky (2019); Kavis et al. (2019), where it was shown that it leads to acceleration. Here, we further blend an appropriate $\rm{STORM}$ Mechanism, leading to a new method that we name $\mu^{2}-\texttt{Extra}\text{SGD}$ . Our main result for this section is presented in Thm. 5.2.

Optimistic OGD, Extragradient and UnixGrad:

The extragradient technique is related to an algorithmic template named Optimistic Online GD (Optimistic OGD)(Rakhlin and Sridharan, 2013). In this algorithm we receive a sequence of (possibly arbitrary) loss vectors $\{d_{t}\in{\mathbb{R}}^{d}\}_{t\in[T]}$ in an online manner. And our goal is to sequentially compute a sequence of iterates (or decision points) $\{w_{t}\in\mathcal{K}\}_{t}$ , where $\mathcal{K}$ is given convex set. Not that we may pick $w_{t}$ only based on past information $\{d_{1},\ldots,d_{t-1}\}$ . And our goal is to ensure a low weighted regret for any $w\in\mathcal{K}$ , and the latter is defined as follows,

\mathcal{R}_{T}(w):=\sum_{t=1}^{T}\alpha_{t}d_{t}\cdot(w_{t}-w^{*})~{},

and $\{\alpha_{t}>0\}$ is a sequence of predefined weights. Now, in the optimistic setting we assume that we may access another sequence of “hint vectors” $\{\hat{d}_{t}\in{\mathbb{R}}^{d}\}_{t}$ and we assume that prior to picking $w_{t}$ we may also access $\{\hat{d}_{1},\ldots,\hat{d}_{t}\}$ . It was shown in Rakhlin and Sridharan (2013), that if the hints are useful, in the sense that $\hat{d}_{t}\approx d_{t}$ , then one can substantially reduce the regret by properly incorporating the hints. Thus, in Optimistic-OGD we maintain two sequences: a decision point sequence $\{w_{t}\}_{t}$ and an auxiliary sequence $\{y_{t}\}$ that are updated as follows,

		Optimistic OGD:
		$\displaystyle w_{t}=\operatorname{arg\,min}_{w\in\mathcal{K}}\alpha_{t}\hat{d}_{t}\cdot w+\frac{1}{2\eta}\\|w-y_{t-1}\\|^{2}\quad~{}~{}\&~{}~{}\quad y_{t}=\operatorname{arg\,min}_{y\in\mathcal{K}}\alpha_{t}d_{t}\cdot y+\frac{1}{2\eta}\\|y-y_{t-1}\\|^{2}$		(15)

It was shown in (Rakhlin and Sridharan, 2013; Kavis et al., 2019), that the above algorithm enjoys the following regret bound for any $w\in\mathcal{K}$ ,

Theorem 5.1 (See e.g. the proof of Thm. 1 in (Kavis et al., 2019)).

Let $\eta>0$ , $\{\alpha_{t}\geq 0\}_{t\in[T]}$ and $\mathcal{K}$ be a convex set with bounded diameter $D$ . Then the Optimistic-OGD algorithm that appears above enjoys the following regret bound for any $w\in\mathcal{K}$ ,

\sum_{t=1}^{T}\alpha_{t}d_{t}\cdot(w_{t}-w)\leq\frac{4D^{2}}{\eta}+\frac{\eta}{2}\sum_{t=1}^{T}\alpha_{t}^{2}\|d_{t}-\hat{d}_{t}\|^{2}-\frac{1}{2\eta}\sum_{t=1}^{T}\|w_{t}-y_{t-1}\|^{2}~{}.

The Extragradient algorithm (Nemirovski, 2004) aims to minimize a convex function $f:\mathcal{K}\mapsto{\mathbb{R}}$ . To do so, it applies the Optimistic-OGD template with the following choices of loss and hint vectors: $\hat{d}_{t}=\nabla f(y_{t-1})$ , and $d_{t}:=\nabla f(w_{t})$ .
The UnixGrad algorithm (Kavis et al., 2019), can be seen as an Anytime version of Extragradient, where again we aim to minimize a convex function $f:\mathcal{K}\mapsto{\mathbb{R}}$ . In the spirit of Anytime-GD, UnixGrad maintains two sequences of weighted averages based on the $\{w_{t},y_{t}\}_{t}$ sequences as follows,

\displaystyle\hat{x}_{t}=\frac{\alpha_{1:t-1}}{\alpha_{1:t}}x_{t-1}+\frac{\alpha_{t}}{\alpha_{1:t}}y_{t-1}~{},~{}~{}~{}\quad\&~{}~{}~{}~{}\quad x_{t}=\frac{\alpha_{1:t-1}}{\alpha_{1:t}}x_{t-1}+\frac{\alpha_{t}}{\alpha_{1:t}}w_{t}

(16)

Then, based on the above averages UnixGrad sets the loss and hint vectors as follows: $\hat{d}_{t}=\nabla f(\hat{x}_{t})$ , and $d_{t}:=\nabla f(x_{t})$ . Note that the above averaging rule implies that the $x_{t}$ ’s are weighted averages of the $w_{t}$ ’s, i.e. $x_{t}=\frac{1}{\alpha_{1:t}}\sum_{\tau=1}^{t}\alpha_{\tau}w_{\tau}$ . The latter enables to utilize the Anytime guarantees of Thm. 3.1.

Note that there also exist stochastic versions of the above approaches where we may only query noisy rather than exact gradient estimates.

Our Approach.

We suggest to employ the Optimistic-OGD template together with a $\rm{STORM}$ mechanism on top of the Anytime mechanism employed by UnixGrad. Specifically, we shall maintain the same weighted averages as in Eq. (16), and define momentum estimates as follows:
At round $t$ draw a fresh sample $z_{t}\sim{\mathcal{D}}$ , and based on $\hat{x}_{t},x_{t}$ compute

\displaystyle\tilde{g}_{t-1}=\nabla f(x_{t-1};z_{t})~{}~{},\quad\hat{g}_{t}=\nabla f(\hat{x}_{t};z_{t})~{}~{},\quad g_{t}=\nabla f(x_{t};z_{t})~{}.

(17)

Based on the above compute the (corrected momentum) loss and hint vectors as follows,

\displaystyle\alpha_{t}\hat{d}_{t}=\alpha_{t}\hat{g}_{t}+(1-\beta_{t})\alpha_{t}(d_{t-1}-\tilde{g}_{t-1})~{}~{}~{}\&~{}~{}~{}\alpha_{t}d_{t}=\alpha_{t}g_{t}+(1-\beta_{t})\alpha_{t}(d_{t-1}-\tilde{g}_{t-1})

(18)

And then update according to Optimistic OGD in Eq. (5). Notice that the update rule for $\alpha_{t}d_{t}$ is the exact same update that we use in Alg. 1; additionally as we have already commented, the $x_{t}$ sequence in Eq. (16) is an $\{\alpha_{t}\}_{t}$ weighted average of the $\{w_{t}\}_{t}$ sequence. Therefore, if we pick $\alpha_{t}=t+1$ , and $\beta_{t}=1/\alpha_{t}$ , then we can invoke Thm. 4.1 implying that,

\mbox{\bf E}\|\varepsilon_{t}\|^{2}:=\mbox{\bf E}\|d_{t}-\nabla f(x_{t})\|^{2}\leq\tilde{\sigma}^{2}/t~{}.

where $\tilde{\sigma}^{2}:=32D^{2}\sigma_{L}^{2}+2\sigma^{2}$ .

The pseudo-code in Alg. 2 depicts our $\mu^{2}-\texttt{Extra}\text{SGD}$ algorithm with the appropriate computational order. It can be seen that it combines Optimistic-OGD updates (Eq. (5)), together with appropriate Anytime averaging (Eq. (16)), and together with $\rm{STORM}$ updates for $d_{t},\hat{d}_{t}$ (Eq. (18)).

Algorithm 2

\mu^{2}-\texttt{Extra}\text{SGD}

Input: #Iterations

T

, initial point

y_{0}

, learning rate sequence

\eta>0

, importance weights

\{\alpha_{t}=t+1\}_{t}

, Corrected Momentum weights

\{\beta_{t}=1/\alpha_{t}\}_{t}

Initialize: set

x_{0}=0

, and

\hat{x}_{1}=y_{0}

, draw

z_{1}\sim{\mathcal{D}}

and set

d_{0}=\tilde{g}_{0}=\hat{d}_{1}=\nabla f(\hat{x}_{1},z_{1})

for

t=1,\ldots,T

\text{\bf Compute:}~{}~{}w_{t}=\operatorname*{arg\,min}_{w\in\mathcal{K}}\alpha_{t}\hat{d}_{t}\cdot w+\frac{1}{2\eta}\|w-y_{t-1}\|^{2}~{},~{}\text{\bf\& Update:}~{}x_{t}=\frac{\alpha_{1:t-1}}{\alpha_{1:t}}x_{t-1}+\frac{\alpha_{t}}{\alpha_{1:t}}w_{t}

Compute:

g_{t}=\nabla f(x_{t};z_{t})

, & Update:

\alpha_{t}d_{t}=\alpha_{t}g_{t}+(1-\beta_{t})\alpha_{t}(d_{t-1}-\tilde{g}_{t-1})

\text{\bf Compute:}~{}~{}y_{t}=\operatorname*{arg\,min}_{y\in\mathcal{K}}\alpha_{t}d_{t}\cdot y+\frac{1}{2\eta}\|y-y_{t-1}\|^{2}~{}\text{\bf\& Update:}~{}\hat{x}_{t+1}=\frac{\alpha_{1:t}}{\alpha_{1:t+1}}x_{t}+\frac{\alpha_{t+1}}{\alpha_{1:t+1}}y_{t}

Draw a fresh sample

z_{t+1}\sim{\mathcal{D}}

and compute,

\tilde{g}_{t}=\nabla f(x_{t};z_{t+1})~{},~{}\hat{g}_{t+1}=\nabla f(\hat{x}_{t+1},z_{t+1})

Update:

\alpha_{t+1}\hat{d}_{t+1}=\alpha_{t+1}\hat{g}_{t+1}+(1-\beta_{t+1})\alpha_{t+1}(d_{t}-\tilde{g}_{t})

end for

output:

x_{T}

We are now ready to state the guarantees of $\mu^{2}-\texttt{Extra}\text{SGD}$ ,

Theorem 5.2 ( $\mu^{2}-\texttt{Extra}\text{SGD}$ Guarantees).

Let $f:\mathcal{K}\mapsto{\mathbb{R}}$ be a convex function and $\mathcal{K}$ be a convex set with diameter $D$ , and denote $w^{*}\in\operatorname*{arg\,min}_{w\in\mathcal{K}}f(w)$ . Also presume that the assumption in Equations (1),(2),(3) hold. Then invoking Alg. 2 with $\{\alpha_{t}=t+1\}_{t}$ and $\{\beta_{t}=1/\alpha_{t}\}_{t}$ , and using $\eta\leq 1/2L$ guarantees,

\displaystyle\mbox{\bf E}(f(x_{T})-f(w^{*}))=\mbox{\bf E}\Delta_{T}\leq O\left(\frac{D^{2}}{\eta T^{2}}+\frac{\tilde{\sigma}D}{\sqrt{T}}~{}.\right)~{},

where $\Delta_{t}:=f(x_{t})-f(w^{*})$ , and $\tilde{\sigma}^{2}:=32D^{2}\sigma_{L}^{2}+2\sigma^{2}$ .

The Stability of $\mu^{2}-\texttt{Extra}\text{SGD}$ .

Similarly to $\mu$ -SGD, the above lemma shows that $\mu^{2}-\texttt{Extra}\text{SGD}$ obtains the optimal rates for both offline (noiseless) and noisy case with the same choice of fixed learning rate $\eta^{\rm Offline}=1/2L$ , which only depends on the smoothness of the objective. This in contrast to existing accelerated methods, which require to either to reduce the offline learning rate by a factor of $\sigma\sqrt{T}$ (Xiao, 2010); or to employ sophisticated adaptive learning rates (Cutkosky, 2019; Kavis et al., 2019).

Moreover, letting $\eta^{\rm Noisy}:=1/(2L+\tilde{\sigma}T^{3/2}/D)$ , than it can be seen that in the noisy case our approach enables to employ learning rates in an extremely wide range of $[\eta^{\rm Noisy},\eta^{\rm Offline}]$ ; and still obtain the same optimal convergence rate of $O\left({LD^{2}}/{T^{2}}+{\tilde{\sigma}D}/{\sqrt{T}}\right)$ . Indeed note that, the ratio $\eta^{\rm Offline}/\eta^{\rm Noisy}\approx(\tilde{\sigma}/L)T^{3/2}$ .

Finally, In contrast to Thm. 4.2, which requires $w^{*}\in\operatorname*{arg\,min}_{w\in\mathcal{K}}f(w)$ to be also the global minimum of $f(\cdot)$ in ${\mathbb{R}}^{d}$ ; Thm. 5.2 does not require this assumption.

5.1 Proof of Thm. 5.2

Proof of Thm. 5.2.

The proof decomposes according to the techniques that $\mu^{2}-\texttt{Extra}\text{SGD}$ employs.
Part I: Anytime Guarantees. Since the $x_{t}$ ’s are $\{\alpha_{t}\}_{t}$ weighted averages of the $\{w_{t}\}_{t}$ ’s we can invoke Thm. 3.1 which implies,

\displaystyle\alpha_{1:T}(f(x_{T})-f(w^{*}))

\displaystyle\leq\sum_{t=1}^{T}\alpha_{t}\nabla f(x_{t})\cdot(w_{t}-w^{*})=\sum_{t=1}^{T}\alpha_{t}d_{t}\cdot(w_{t}-w^{*})-\sum_{t=1}^{T}\alpha_{t}\varepsilon_{t}\cdot(w_{t}-w^{*})~{}.

(19)

where we have denote $\varepsilon_{t}:=d_{t}-\nabla f(x_{t})$ .
Part II: Optimistic OGD Guarantees. Since the update rule for $\{w_{t},y_{t}\}_{t}$ satisfies the Optimistic-OGD template w.r.t the sequences of loss and hint vectors $\{d_{t},\hat{d}_{t}\}_{t}$ we can apply Lemma 5.1 to bound the weighted regret in Eq. (19) as follows,

$\displaystyle\alpha_{1:T}$	$\displaystyle(f(x_{T})-f(w^{*}))$
	$\displaystyle\leq\frac{4D^{2}}{\eta}+\frac{\eta}{2}\sum_{t=1}^{T}\alpha_{t}^{2}\\|d_{t}-\hat{d}_{t}\\|^{2}-\frac{1}{2\eta}\sum_{t=1}^{T}\\|w_{t}-y_{t-1}\\|^{2}-\sum_{t=1}^{T}\alpha_{t}\varepsilon_{t}\cdot(w_{t}-w^{*})$
	$\displaystyle\leq\frac{4D^{2}}{\eta}+\frac{\eta}{2}\sum_{t=1}^{T}\alpha_{t}^{2}\\|g_{t}-\hat{g}_{t}\\|^{2}-\frac{1}{2\eta}\sum_{t=1}^{T}\\|w_{t}-y_{t-1}\\|^{2}+\sum_{t=1}^{T}\\|\alpha_{t}\varepsilon_{t}\\|\cdot\\|w_{t}-w^{*}\\|$
	$\displaystyle\leq\frac{4D^{2}}{\eta}+\frac{\eta}{2}\sum_{t=1}^{T}\alpha_{t}^{2}\\|\nabla f(x_{t};z_{t})-\nabla f(\hat{x}_{t};z_{t})\\|^{2}-\frac{1}{2\eta}\sum_{t=1}^{T}\\|w_{t}-y_{t-1}\\|^{2}+D\sum_{t=1}^{T}\\|\alpha_{t}\varepsilon_{t}\\|$
	$\displaystyle\leq\frac{4D^{2}}{\eta}+\frac{\eta L^{2}}{2}\sum_{t=1}^{T}\alpha_{t}^{2}\\|x_{t}-\hat{x}_{t}\\|^{2}-\frac{1}{2\eta}\sum_{t=1}^{T}\\|w_{t}-y_{t-1}\\|^{2}+D\sum_{t=1}^{T}\\|\alpha_{t}\varepsilon_{t}\\|$
	$\displaystyle\leq\frac{4D^{2}}{\eta}+\frac{\eta L^{2}}{2}\sum_{t=1}^{T}\alpha_{t}^{2}\left(\frac{\alpha_{t}}{\alpha_{1:t}}\right)^{2}\\|w_{t}-y_{t-1}\\|^{2}-\frac{1}{2\eta}\sum_{t=1}^{T}\\|w_{t}-y_{t-1}\\|^{2}+D\sum_{t=1}^{T}\\|\alpha_{t}\varepsilon_{t}\\|$
	$\displaystyle\leq\frac{4D^{2}}{\eta}+\frac{4\eta L^{2}}{2}\sum_{t=1}^{T}\\|w_{t}-y_{t-1}\\|^{2}-\frac{1}{2\eta}\sum_{t=1}^{T}\\|w_{t}-y_{t-1}\\|^{2}+D\sum_{t=1}^{T}\\|\alpha_{t}\varepsilon_{t}\\|$
	$\displaystyle\leq\frac{4D^{2}}{\eta}+D\sum_{t=1}^{T}\\|\alpha_{t}\varepsilon_{t}\\|~{},$	(20)

where the first line uses Eq. (19) together with Thm. 5.1; the second line uses $d_{t}-\hat{d}_{t}=g_{t}-\hat{g}_{t}$ which follows by Eq. (18); and the third line follows by the definitions of $g_{t},\hat{g}_{t}$ , as well as from $\|w_{t}-w^{*}\|\leq D$ , which holds since $w_{t},w^{*}\in\mathcal{K}$ ; the fourth line follows by our assumption in Eq. (2); the fifth line holds since $x_{t}-\hat{x}_{t}=(\alpha_{t}/\alpha_{1:t})(w_{t}-y_{t-1})$ which holds due to Eq. (16); the sixth line holds since $\alpha_{t}^{4}/(\alpha_{1:t})^{2}\leq 4~{},\forall t\geq 1$ ; and the last line follows since ${2\eta L^{2}}-{1}/{(2\eta)}\leq 0$ which holds since we assume $\eta\leq 1/2L$ .
Part III: $\mu^{2}$ Guarantees. Notice that our definitions for $w_{t},x_{t},\alpha_{t},\beta_{t}$ and $d_{t}$ satisfy the exact same conditions of Thm. 4.1, This immediately implies that $\mbox{\bf E}\|\varepsilon_{t}\|^{2}\leq\tilde{\sigma}^{2}/t~{},\forall t$ . Using this, and taking the expectation of Eq. (5.1) yields,

	$\displaystyle\alpha_{1:T}\mbox{\bf E}(f(x_{T})-f(w^{*}))$	$\displaystyle\leq\frac{4D^{2}}{\eta}+D\sum_{t=1}^{T}\mbox{\bf E}\\|\alpha_{t}\varepsilon_{t}\\|\leq\frac{4D^{2}}{\eta}+D\sum_{t=1}^{T}\sqrt{\mbox{\bf E}\\|\alpha_{t}\varepsilon_{t}\\|^{2}}$
		$\displaystyle\leq\frac{4D^{2}}{\eta}+D\tilde{\sigma}\sum_{t=1}^{T}\sqrt{\alpha_{t}^{2}/t}\leq\frac{4D^{2}}{\eta}+2T^{3/2}D\tilde{\sigma}~{}.$		(21)

where the second inequality uses Jensen’s Inequality: $\mbox{\bf E}X\leq\sqrt{\mbox{\bf E}X^{2}}$ which holds for any random variable $X$ ; the last inequality follows from $\alpha_{t}^{2}/t\leq 2t$ , implying that $\sum_{t=1}^{T}\sqrt{\alpha_{t}^{2}/t}\leq 2T^{3/2}$ .

Dividing the above equation by $\alpha_{1:T}$ and recalling that $\alpha_{1:T}=\Theta(T^{2})$ concludes the proof. ∎

6 Conclusion

By blending two recent momentum techniques, we have designed a new shrinking-error gradient estimate for the SCO-EOS setting. Based on it, we have presented two algorithms that rely on the SGD and on the Extragradient templates, and have shown that latter are extremely stable w.r.t. the choice of the learning rate. For future work, it will be interesting to study our approach in the more general case of expectation over $\nu$ -Hölder smooth functions (with $\nu\in[0,1]$ ); as well as to extend our approach to the strongly-convex case.

Acknowledgments

This work has received funding from the Israeli Science Foundation (grant No. 447/20).

References

Antonakopoulos et al. (2022) K. Antonakopoulos, D. Q. Vu, V. Cevher, K. Levy, and P. Mertikopoulos. Undergrad: A universal black-box optimization method with almost dimension-free convergence rate guarantees. In International Conference on Machine Learning, pages 772–795. PMLR, 2022.
Cutkosky (2019) A. Cutkosky. Anytime online-to-batch, optimism and acceleration. In International Conference on Machine Learning, pages 1446–1454. PMLR, 2019.
Cutkosky and Orabona (2019) A. Cutkosky and F. Orabona. Momentum-based variance reduction in non-convex sgd. Advances in neural information processing systems, 32, 2019.
Duchi et al. (2011) J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.
Ge et al. (2015) R. Ge, F. Huang, C. Jin, and Y. Yuan. Escaping from saddle points—online stochastic gradient for tensor decomposition. In Proceedings of The 28th Conference on Learning Theory, pages 797–842, 2015.
Hazan et al. (2016) E. Hazan et al. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(3-4):157–325, 2016.
Hu et al. (2009) C. Hu, W. Pan, and J. T. Kwok. Accelerated gradient methods for stochastic optimization and online learning. In Advances in Neural Information Processing Systems, pages 781–789, 2009.
Jacobsen and Cutkosky (2022) A. Jacobsen and A. Cutkosky. Parameter-free mirror descent. arXiv preprint arXiv:2203.00444, 2022.
Juditsky et al. (2011) A. Juditsky, A. Nemirovski, and C. Tauvel. Solving variational inequalities with stochastic mirror-prox algorithm. Stochastic Systems, 1(1):17–58, 2011.
Kavis et al. (2019) A. Kavis, K. Y. Levy, F. Bach, and V. Cevher. Unixgrad: a universal, adaptive algorithm with optimal guarantees for constrained optimization. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pages 6260–6269, 2019.
Kingma and Ba (2015) D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In Y. Bengio and Y. LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980.
Korpelevich (1976) G. M. Korpelevich. The extragradient method for finding saddle points and other problems. Matecon, 12:747–756, 1976.
Lan (2012) G. Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 133(1-2):365–397, 2012.
Levy et al. (2018) K. Y. Levy, A. Yurtsever, and V. Cevher. Online adaptive methods, universality and acceleration. In NeurIPS, 2018.
Levy et al. (2021) K. Y. Levy, A. Kavis, and V. Cevher. Storm+: Fully adaptive sgd with recursive momentum for nonconvex optimization. In NeurIPS, 2021.
Mairal et al. (2009) J. Mairal, F. R. Bach, J. Ponce, and G. Sapiro. Online dictionary learning for sparse coding. In A. P. Danyluk, L. Bottou, and M. L. Littman, editors, Proceedings of the 26th Annual International Conference on Machine Learning, ICML 2009, Montreal, Quebec, Canada, June 14-18, 2009, volume 382 of ACM International Conference Proceeding Series, pages 689–696. ACM, 2009. doi: 10.1145/1553374.1553463. URL https://doi.org/10.1145/1553374.1553463.
Nemirovski (2004) A. Nemirovski. Prox-method with rate of convergence ${O}(1/t)$ for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization, 15(1):229–251, 2004.
Nesterov (1983) Y. Nesterov. A method of solving a convex programming problem with convergence rate $O(1/k^{2})$ . In Soviet Mathematics Doklady, volume 27, pages 372–376, 1983.
Polyak (1964) B. T. Polyak. Some methods of speeding up the convergence of iteration methods. Ussr computational mathematics and mathematical physics, 4(5):1–17, 1964.
Rakhlin and Sridharan (2013) S. Rakhlin and K. Sridharan. Optimization, learning, and games with predictable sequences. In Advances in Neural Information Processing Systems, pages 3066–3074, 2013.
Recht et al. (2011) B. Recht, C. Ré, S. J. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. C. N. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems 2011. Proceedings of a meeting held 12-14 December 2011, Granada, Spain, pages 693–701, 2011. URL https://proceedings.neurips.cc/paper/2011/hash/218a0aefd1d1a4be65601cc6ddc1520e-Abstract.html.
Robbins and Monro (1951) H. Robbins and S. Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
Shalev-Shwartz et al. (2007) S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-gradient solver for svm. In Proceedings of the 24th international conference on Machine learning, pages 807–814, 2007.
Sutskever et al. (2013) I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pages 1139–1147. PMLR, 2013.
Wang et al. (2021) J.-K. Wang, J. Abernethy, and K. Y. Levy. No-regret dynamics in the fenchel game: A unified framework for algorithmic convex optimization. arXiv preprint arXiv:2111.11309, 2021.
Welling and Teh (2011) M. Welling and Y. W. Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 681–688. Citeseer, 2011.
Xiao (2010) L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research, 11(Oct):2543–2596, 2010.

Appendix A Explaining the Bounded Smoothness Variance Assumption

Here we show that Eq. (2) implies that Eq. (3) holds for some $\sigma_{L}^{2}\in[0,L]$ .

Fixing $x,y\in\mathcal{K}$ , then Eq. (2) implies that for any $z\in\textbf{Support}\{{\mathcal{D}}\}$ there exists $L_{x,y;z}\in[0,L]$ such that,

\|\nabla f(x;z)-\nabla f(y;z)\|=L_{x,y;z}\|x-y\|~{}.

Similarly there exists $L_{x,y}\in[0,L]$ such that,

\|\nabla f(x)-\nabla f(y)\|=L_{x,y}\|x-y\|~{}.

And clearly in the deterministic case we have $L_{x,y;z}=L_{x,y}~{},\forall z\in\textbf{Support}\{{\mathcal{D}}\}$ . Therefore,

	$\displaystyle\mbox{\bf E}\\|(\nabla f(x;z)-\nabla f(x))-(\nabla f(y;z)-\nabla f(y))\\|^{2}$	$\displaystyle=\mbox{\bf E}\\|\nabla f(x;z)-\nabla f(y;z)\\|^{2}-\\|\nabla f(x)-\nabla f(y))\\|^{2}$
		$\displaystyle=\mbox{\bf E}(L_{x,y;z}-L_{x,y})\cdot\\|x-y\\|^{2}=\sigma_{L}^{2}\{x,y\}\cdot\\|x-y\\|^{2}~{},$

where we have used $\mbox{\bf E}(\nabla f(x;z)-\nabla f(y;z))=(\nabla f(x)-\nabla f(y))$ , and we denote $\sigma_{L}^{2}\{x,y\}:=\mbox{\bf E}(L_{x,y;z}-L_{x,y})$ . This notation implies that $\sigma_{L}^{2}\{x,y\}\in[0,L]$ , and clearly $\sigma_{L}^{2}\{x,y\}=0$ in the deterministic case for all $x,y\in\mathcal{K}$ . Thus, if we denote,

\sigma_{L}^{2}:=\sup_{x,y\in\mathcal{K}}\sigma_{L}^{2}\{x,y\}~{},

Then $\sigma_{L}^{2}\in[0,L]$ satisfies Eq. (3) and is equal to $0$ in the deterministic (noiseless) case.

Appendix B Proof of Thm. 4.1

Proof of Thm. 4.1.

First note that the $x_{t}$ ’s always belong to $\mathcal{K}$ , since they are weighted averages of the $\{w_{t}\in\mathcal{K}\}_{t}$ ’s. Next we bound the difference between consecutive queries. By definition,

\alpha_{1:t-1}(x_{t}-x_{t-1})=\alpha_{t}(w_{t}-x_{t})~{},

Implying,

\displaystyle\|x_{t}-x_{t-1}\|^{2}=\left({\alpha_{t}}/{\alpha_{1:t-1}}\right)^{2}\|w_{t}-x_{t}\|^{2}\leq(16/t^{2})D^{2}=(16/\alpha_{t-1}^{2})D^{2}~{}.

(22)

Notation: Prior to going on with the proof we shall require some notation. We will denote $\bar{g}_{t}:=\nabla f(x_{t})$ , and recall the following notation form Alg. 1: $g_{t}:=\nabla f(x_{t},z_{t})~{};\tilde{g}_{t-1}:=\nabla f(x_{t-1},z_{t})$ , and we will also denote, $\bar{g}_{t}:=\nabla f(x_{t})$ ,and

\varepsilon_{t}:=d_{t}-\bar{g}_{t}~{}.

Now, recalling Eq. (11),

\displaystyle\alpha_{t}d_{t}=\alpha_{t}g_{t}+(1-\beta_{t})\alpha_{t}(d_{t-1}-\tilde{g}_{t-1})~{}.

Combining the above with the definition of $\varepsilon_{t}$ yields the following recursive relation,

	$\displaystyle\alpha_{t}\varepsilon_{t}$	$\displaystyle:=\alpha_{t}d_{t}-\alpha_{t}\bar{g}_{t}$
		$\displaystyle=\alpha_{t}(g_{t}-\bar{g}_{t})+(1-\beta_{t})\alpha_{t}(d_{t-1}-\tilde{g}_{t-1})$
		$\displaystyle=\beta_{t}\alpha_{t}(g_{t}-\bar{g}_{t})+(1-\beta_{t})\alpha_{t}(d_{t-1}-\tilde{g}_{t-1}+g_{t}-\bar{g}_{t})$
		$\displaystyle=\beta_{t}\alpha_{t}(g_{t}-\bar{g}_{t})+(1-\beta_{t})\alpha_{t}(\bar{g}_{t-1}-\tilde{g}_{t-1}+g_{t}-\bar{g}_{t})+(1-\beta_{t})\alpha_{t}(d_{t-1}-\bar{g}_{t-1})$
		$\displaystyle=\beta_{t}\alpha_{t}(g_{t}-\bar{g}_{t})+(1-\beta_{t})\alpha_{t}((g_{t}-\bar{g}_{t})-(\tilde{g}_{t-1}-\bar{g}_{t-1}))+(1-\beta_{t})\alpha_{t}(d_{t-1}-\bar{g}_{t-1})$
		$\displaystyle=\beta_{t}\alpha_{t}(g_{t}-\bar{g}_{t})+(1-\beta_{t})\alpha_{t}Z_{t}+(1-\beta_{t})\frac{\alpha_{t}}{\alpha_{t-1}}\alpha_{t-1}\varepsilon_{t-1}$

where we denoted $Z_{t}:=(g_{t}-\bar{g}_{t})-(\tilde{g}_{t-1}-\bar{g}_{t-1})$ . Now, using $\alpha_{t}=t+1$ , and $\beta_{t}=1/(t+1)$ then it can be shown that $\alpha_{t}\beta_{t}=1$ ,and $\alpha_{t}(1-\beta_{t})=\alpha_{t}-1:=\alpha_{t-1}$ . Moreover, $(1-\beta_{t})\frac{\alpha_{t}}{\alpha_{t-1}}=\frac{\alpha_{t}-1}{\alpha_{t-1}}=1$ . Using these relations in the equation above gives,

\displaystyle\alpha_{t}\varepsilon_{t}=\alpha_{t-1}Z_{t}+\alpha_{t-1}\varepsilon_{t-1}+(g_{t}-\bar{g}_{t})=M_{t}+\alpha_{t-1}\varepsilon_{t-1}~{}.

(23)

where for any $t>1$ we denote $M_{t}:=\alpha_{t-1}Z_{t}+(g_{t}-\bar{g}_{t})$ , as well as $M_{1}=g_{1}-\bar{g}_{1}$ . Unrolling the above equation yields an explicit expression for any $t\in[T]$ ,

\alpha_{t}\varepsilon_{t}=\sum_{\tau=1}^{t}M_{\tau}~{}.

Now, notice that the sequence $\{M_{t}\}_{t}$ is is martingale difference sequence with respect to the natural filtration $\{\mathcal{F}_{t}\}_{t}$ induced by the history of the samples up to time $t$ . Indeed,

\mbox{\bf E}[M_{t}|\mathcal{F}_{t-1}]=\mbox{\bf E}[(g_{t}-\bar{g}_{t})|\mathcal{F}_{t-1}]+\alpha_{t-1}\mbox{\bf E}[Z_{t}|\mathcal{F}_{t-1}]=\mbox{\bf E}[(g_{t}-\bar{g}_{t})|x_{t}]+\alpha_{t-1}\mbox{\bf E}[Z_{t}|x_{t-1},x_{t}]=0~{}.

Thus, using Lemma B.1 below gives,

	$\displaystyle\mbox{\bf E}\\|\alpha_{t}\varepsilon_{t}\\|^{2}$	$\displaystyle=\left\\|\sum_{\tau=1}^{t}M_{\tau}\right\\|^{2}=\sum_{\tau=1}^{t}\mbox{\bf E}\\|M_{\tau}\\|^{2}=\sum_{\tau=1}^{t}\mbox{\bf E}\\|\alpha_{t-1}Z_{t}+(g_{t}-\bar{g}_{t})\\|^{2}$
		$\displaystyle\leq 2\sum_{\tau=1}^{t}\alpha_{t-1}^{2}\mbox{\bf E}\\|Z_{t}\\|^{2}+2\sum_{\tau=1}^{t}\mbox{\bf E}\\|g_{t}-\bar{g}_{t}\\|^{2}$
		$\displaystyle\leq 2\sum_{\tau=1}^{t}\alpha_{t-1}^{2}\mbox{\bf E}\\|(g_{t}-\bar{g}_{t})-(\tilde{g}_{t-1}-\bar{g}_{t-1})\\|^{2}+2\sum_{\tau=1}^{t}\sigma^{2}$
		$\displaystyle=2\sum_{\tau=1}^{t}\alpha_{t-1}^{2}\mbox{\bf E}\\|(\nabla f(x_{t};z_{t})-\nabla f(x_{t}))-(\nabla f(x_{t-1};z_{t})-\nabla f(x_{t-1}))\\|^{2}+2t\sigma^{2}$
		$\displaystyle\leq 2\sum_{\tau=1}^{t}\alpha_{t-1}^{2}\sigma_{L}^{2}\\|x_{t}-x_{t-1}\\|^{2}+2t\sigma^{2}$
		$\displaystyle\leq 32D^{2}\sigma_{L}^{2}\sum_{\tau=1}^{t}(\alpha_{t-1}^{2}/\alpha_{t-1}^{2})+2t\sigma^{2}$
		$\displaystyle=(32D^{2}\sigma_{L}^{2}+2\sigma^{2})\cdot t$
		$\displaystyle=\tilde{\sigma}^{2}\cdot t~{}.$

here the first inequality uses $\|a+b\|^{2}\leq 2\|a\|^{2}+2\|b\|^{2}$ which holds for any $a,b\in{\mathbb{R}}^{d}$ ; the second inequality uses the bounded variance assumption; the third inequality uses Eq. (3), and the last inequality uses Eq. (22).

Dividing the above inequality by $\alpha_{t}^{2}=(t+1)^{2}$ the lemma follows,

\mbox{\bf E}\|d_{t}-\nabla f(x_{t})\|^{2}=\mbox{\bf E}\|\varepsilon_{t}\|^{2}=\mbox{\bf E}\|\alpha_{t}\varepsilon_{t}\|^{2}/\alpha_{t}^{2}\leq\tilde{\sigma}^{2}t/(t+1^{2})\leq\tilde{\sigma}^{2}/(t+1)~{}.

Lemma B.1.

Let $\{M_{t}\}_{t}$ be a martingale difference sequence with respect to a filtration $\{\mathcal{F}_{t}\}_{t}$ , then the following holds for any $t$ ,

\displaystyle\mbox{\bf E}\left\|\sum_{\tau=1}^{t}M_{\tau}\right\|^{2}

\displaystyle=\sum_{\tau=1}^{t}\mbox{\bf E}\left\|M_{\tau}\right\|^{2}~{}.

∎

B.0.1 Proof of Lemma B.1

Proof of Lemma B.1.

We shall prove the lemma by induction over $t$ . The base case where $t=1$ clearly holds.

Now for induction step let us assume that the equality holds for $t\geq 1$ and lets prove it holds for $t+1$ . Indeed,

	$\displaystyle\mbox{\bf E}\left\\|\sum_{\tau=1}^{t+1}M_{\tau}\right\\|^{2}$	$\displaystyle=\mbox{\bf E}\left\\|M_{t+1}+\sum_{\tau=1}^{t}M_{\tau}\right\\|^{2}$
		$\displaystyle=\mbox{\bf E}\left\\|\sum_{\tau=1}^{t}M_{\tau}\right\\|^{2}+\mbox{\bf E}\\|M_{t+1}\\|^{2}+2\mbox{\bf E}\left(\sum_{\tau=1}^{t}M_{\tau}\right)\cdot M_{t+1}$
		$\displaystyle=\sum_{\tau=1}^{t+1}\mbox{\bf E}\left\\|M_{\tau}\right\\|^{2}+2\mbox{\bf E}\left[\mbox{\bf E}\left[\left(\sum_{\tau=1}^{t}M_{\tau}\right)\cdot M_{t+1}\|\mathcal{F}_{t}\right]\right]$
		$\displaystyle=\sum_{\tau=1}^{t+1}\mbox{\bf E}\left\\|M_{\tau}\right\\|^{2}+2\mbox{\bf E}\left[\left(\sum_{\tau=1}^{t}M_{\tau}\right)\cdot\mbox{\bf E}\left[M_{t+1}\|\mathcal{F}_{t}\right]\right]$
		$\displaystyle=\sum_{\tau=1}^{t+1}\mbox{\bf E}\left\\|M_{\tau}\right\\|^{2}+0$
		$\displaystyle=\sum_{\tau=1}^{t+1}\mbox{\bf E}\left\\|M_{\tau}\right\\|^{2}~{},$

where the third line follows from the induction hypothesis, as well as from the law of total expectations; the fourth lines follows since $\{M_{\tau}\}_{\tau=1}^{t}$ are measurable w.r.t $\mathcal{F}_{t}$ , and the fifth line follows since $\mbox{\bf E}[M_{t+1}|\mathcal{F}_{t}]=0$ . Thus, we have established the induction step and therefore the lemma holds. ∎

Appendix C Proof of Thm. 4.2

Proof of Thm. 4.2.

The proof is a direct combination of Thm. 4.1 together with the standard regret bound of OGD (Online Gradient Descent), which in turn enables to utilize the Anytime guarantees of Thm. 3.1.
Part 1: Regret Bound. Standard regret analysis of the update rule in Eq. (8) implies the following for any $t$ (Hazan et al., 2016),

\displaystyle\sum_{\tau=1}^{t}\alpha_{\tau}d_{\tau}\cdot(w_{\tau}-w^{*})\leq\frac{D^{2}}{2\eta}+\frac{\eta}{2}\sum_{\tau=1}^{t}\alpha_{\tau}^{2}\|d_{\tau}\|^{2}~{}.

(24)

Part 2: Anytime Guarantees.

Since the $x_{t}$ ’s are weighted averages of the $w_{t}$ ’s we may invoke Thm. 3.1, which ensures for any $t\in[T]$ ,

\alpha_{1:t}\Delta_{t}=\alpha_{1:t}(f(x_{t})-f(w^{*}))\leq\sum_{\tau=1}^{t}\alpha_{\tau}\nabla f(x_{\tau})\cdot(w_{\tau}-w^{*})~{},

where we denote $\Delta_{t}:=f(x_{t})-f(w^{*})$ .

Part 3: Combining Guarantees.

Combining the above Anytime guarantees together with the bound in Eq. (24) yields,

$\displaystyle\alpha_{1:t}\Delta_{t}$	$\displaystyle\leq\sum_{\tau=1}^{t}\alpha_{\tau}\nabla f(x_{\tau})\cdot(w_{\tau}-w^{*})$
	$\displaystyle=\sum_{\tau=1}^{t}\alpha_{\tau}d_{\tau}\cdot(w_{\tau}-w^{})+\sum_{\tau=1}^{t}\alpha_{\tau}(\nabla f(x_{\tau})-d_{\tau})\cdot(w_{\tau}-w^{})$
	$\displaystyle=\frac{D^{2}}{2\eta}+\frac{\eta}{2}\sum_{\tau=1}^{t}\alpha_{\tau}^{2}\\|d_{\tau}\\|^{2}-\sum_{\tau=1}^{t}\alpha_{\tau}\varepsilon_{\tau}\cdot(w_{\tau}-w^{*})$
	$\displaystyle\leq\frac{D^{2}}{2\eta}+\frac{\eta}{2}\sum_{\tau=1}^{t}\alpha_{\tau}^{2}\\|\nabla f(x_{\tau})+\varepsilon_{\tau}\\|^{2}+\sum_{\tau=1}^{t}\\|\alpha_{\tau}\varepsilon_{\tau}\\|\cdot\\|w_{\tau}-w^{*}\\|$
	$\displaystyle\leq\frac{D^{2}}{2\eta}+\eta\sum_{\tau=1}^{t}\alpha_{\tau}^{2}\\|\nabla f(x_{\tau})\\|^{2}+\eta\sum_{\tau=1}^{t}\alpha_{\tau}^{2}\\|\varepsilon_{\tau}\\|^{2}+D\sum_{\tau=1}^{t}\\|\alpha_{\tau}\varepsilon_{\tau}\\|$
	$\displaystyle\leq\frac{D^{2}}{2\eta}+2\eta L\sum_{\tau=1}^{t}\alpha_{\tau}^{2}\Delta_{\tau}+\eta\sum_{\tau=1}^{t}\alpha_{\tau}^{2}\\|\varepsilon_{\tau}\\|^{2}+D\sum_{\tau=1}^{t}\\|\alpha_{\tau}\varepsilon_{\tau}\\|$
	$\displaystyle\leq\frac{D^{2}}{2\eta}+4\eta L\sum_{\tau=1}^{t}\alpha_{1:\tau}\Delta_{\tau}+\eta\sum_{\tau=1}^{t}\alpha_{\tau}^{2}\\|\varepsilon_{\tau}\\|^{2}+D\sum_{\tau=1}^{t}\\|\alpha_{\tau}\varepsilon_{\tau}\\|~{},$	(25)

where the first inequality follows from Cauchy-Schwartz; the second inequality holds since $\|w_{t}-w^{*}\|\leq D$ , as well as from using $\|a+b\|^{2}\leq 2\|a\|^{2}+2\|b\|^{2}$ which holds for any $a,b\in{\mathbb{R}}^{d}$ , the third inequality follows by the self bounding property for smooth functions (see Lemma C.1 below) implying that $\|\nabla f(x_{\tau})\|^{2}\leq 2L(f(x_{\tau})-f(w^{*})):=2L\Delta_{\tau}$ ; and the fourth inequality follows due to $\alpha_{\tau}^{2}\leq 2\alpha_{1:\tau}$ which holds since $\alpha_{\tau}=\tau+1$ .

Lemma C.1.

(See e.g. (Levy et al., 2018; Cutkosky, 2019)) Let $F:{\mathbb{R}}^{d}\mapsto{\mathbb{R}}$ be an $L$ -smooth function with a global minimum $x^{*}$ , then for any $x\in{\mathbb{R}}^{d}$ we have,

\|\nabla F(x)\|^{2}\leq 2L(F(x)-F(w^{*}))~{}.

Next, we will take expectation over Eq. (C), yielding,

$\displaystyle\alpha_{1:t}\mbox{\bf E}\Delta_{t}$	$\displaystyle\leq\frac{D^{2}}{2\eta}+4\eta L\sum_{\tau=1}^{t}\alpha_{1:\tau}\mbox{\bf E}\Delta_{\tau}+\eta\sum_{\tau=1}^{t}\alpha_{\tau}^{2}\mbox{\bf E}\\|\varepsilon_{\tau}\\|^{2}+D\sum_{\tau=1}^{t}\mbox{\bf E}\\|\alpha_{\tau}\varepsilon_{\tau}\\|$
	$\displaystyle\leq\frac{D^{2}}{2\eta}+4\eta L\sum_{\tau=1}^{t}\alpha_{1:\tau}\mbox{\bf E}\Delta_{\tau}+\eta\sum_{\tau=1}^{t}\alpha_{\tau}^{2}\mbox{\bf E}\\|\varepsilon_{\tau}\\|^{2}+D\sum_{\tau=1}^{t}\sqrt{\alpha_{\tau}^{2}\mbox{\bf E}\\|\varepsilon_{\tau}\\|^{2}}$
	$\displaystyle\leq\frac{D^{2}}{2\eta}+4\eta L\sum_{\tau=1}^{t}\alpha_{1:\tau}\mbox{\bf E}\Delta_{\tau}+\eta\sum_{\tau=1}^{t}\alpha_{\tau}^{2}\cdot\tilde{\sigma}^{2}/\alpha_{\tau}+D\sum_{\tau=1}^{t}\sqrt{\alpha_{\tau}^{2}\cdot\tilde{\sigma}^{2}/\alpha_{\tau}}$
	$\displaystyle\leq\frac{D^{2}}{2\eta}+4\eta L\sum_{\tau=1}^{t}\alpha_{1:\tau}\mbox{\bf E}\Delta_{\tau}+\eta\tilde{\sigma}^{2}\sum_{\tau=1}^{t}\alpha_{\tau}+D\tilde{\sigma}\sum_{\tau=1}^{t}\sqrt{\alpha_{\tau}}$
	$\displaystyle\leq\frac{D^{2}}{2\eta}+4\eta L\sum_{\tau=1}^{T}\alpha_{1:\tau}\mbox{\bf E}\Delta_{\tau}+\eta\tilde{\sigma}^{2}\sum_{\tau=1}^{T}\alpha_{\tau}+D\tilde{\sigma}\sum_{\tau=1}^{T}\sqrt{\alpha_{\tau}}$
	$\displaystyle\leq\frac{D^{2}}{2\eta}+4\eta L\sum_{\tau=1}^{T}\alpha_{1:\tau}\mbox{\bf E}\Delta_{\tau}+\eta\tilde{\sigma}^{2}\alpha_{1:T}+2D\tilde{\sigma}T^{3/2}$
	$\displaystyle\leq\frac{D^{2}}{2\eta}+\frac{1}{2T}\sum_{\tau=1}^{T}\alpha_{1:\tau}\mbox{\bf E}\Delta_{\tau}+\eta\tilde{\sigma}^{2}\alpha_{1:T}+2D\tilde{\sigma}T^{3/2}~{},$	(26)

where the second lines is due to Jensen’s inequality implying that $\mbox{\bf E}X\leq\sqrt{\mbox{\bf E}X^{2}}$ for any random variable $X$ ; the third line follows from $\mbox{\bf E}\|\varepsilon_{t}\|^{2}\leq\tilde{\sigma}^{2}/\alpha_{t}$ which holds by Thm. 4.1; the fifth line holds since $t\leq T$ ; the sixth line follows since $\sum_{t=1}^{T}\sqrt{\alpha_{t}}\leq 2T^{3/2}$ , and the last line follows since we pick $\eta\leq 1/8LT$ .

To obtain the final bound we will apply the lemma below to Eq. (C),

Lemma C.2.

Let $\{A_{t}\}_{t\in[T]}$ be a sequence of non-negative elements and $\mathcal{B}\in{\mathbb{R}}$ , and assume that for any $t\leq T$ ,

A_{t}\leq\mathcal{B}+\frac{1}{2T}\sum_{t=1}^{T}A_{t}~{},

Then the following bound holds,

A_{T}\leq 2\mathcal{B}~{}.

Taking $A_{t}\leftarrow\alpha_{1:t}\mbox{\bf E}\Delta_{t}$ and $\mathcal{B}\leftarrow\frac{D^{2}}{2\eta}+\eta\tilde{\sigma}^{2}\alpha_{1:T}+2D\tilde{\sigma}T^{3/2}$ provides the following explicit bound,

\displaystyle\alpha_{1:T}\mbox{\bf E}\Delta_{T}\leq\frac{D^{2}}{\eta}+2\eta\tilde{\sigma}^{2}\alpha_{1:T}+4D\tilde{\sigma}T^{3/2}

Dividing by $\alpha_{1:T}$ and recalling $\alpha_{1:T}=\Theta(T^{2})$ gives,

\displaystyle\mbox{\bf E}(f(x_{T})-f(w^{*}))=\mbox{\bf E}\Delta_{T}\leq O\left(\frac{D^{2}}{\eta T^{2}}+2\eta\tilde{\sigma}^{2}+\frac{4D\tilde{\sigma}}{\sqrt{T}}\right)~{},

which concludes the proof. ∎

C.1 Proof of Lemma C.2

Proof of Lemma C.2.

Summing the inequality $A_{t}\leq\mathcal{B}+\frac{1}{2T}\sum_{t=1}^{T}A_{t}$ over $t$ gives,

\displaystyle A_{1:T}\leq T\mathcal{B}+T\frac{1}{2T}A_{1:T}=T\mathcal{B}+\frac{1}{2}A_{1:T}~{},

Re-ordering we obtain,

\displaystyle A_{1:T}\leq 2T\mathcal{B}~{}.

Plugging this back to the original inequality and taking $t=T$ gives,

A_{T}\leq\mathcal{B}+\frac{1}{2T}A_{1:T}\leq 2\mathcal{B}~{}.