This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On the Convergence of SGD with Biased Gradients

Ahmad Ajalloeian    Sebastian U. Stich
Abstract

We analyze the complexity of biased stochastic gradient methods (SGD), where individual updates are corrupted by deterministic, i.e. biased error terms. We derive convergence results for smooth (non-convex) functions and give improved rates under the Polyak-Łojasiewicz condition. We quantify how the magnitude of the bias impacts the attainable accuracy and the convergence rates (sometimes leading to divergence).

Our framework covers many applications where either only biased gradient updates are available, or preferred, over unbiased ones for performance reasons. For instance, in the domain of distributed learning, biased gradient compression techniques such as top-kk compression have been proposed as a tool to alleviate the communication bottleneck and in derivative-free optimization, only biased gradient estimators can be queried. We discuss a few guiding examples that show the broad applicability of our analysis.

Machine Learning, ICML

1 Introduction

The stochastic gradient descent (SGD) algorithm has proven to be effective for many machine learning applications. This first-order method has been intensively studied in theory and practice in recent years (cf. Bottou et al., 2018). Whilst vanilla SGD crucially depends on unbiased gradient oracles, variations with biased gradient updates have been considered in a few application domains recently.

For instance, in the context of distributed parallel optimization where the data is split among several compute nodes, the standard mini-batch SGD updates can yield a bottleneck in large-scale systems and techniques such as structured sparsity (Alistarh et al., 2018; Wangni et al., 2018; Stich et al., 2018) or asynchronous updates (Niu et al., 2011; Li et al., 2014) have been proposed to reduce communication costs. However, sparsified or delayed SGD updates are no longer unbiased and the methods become more difficult to analyze (Stich and Karimireddy, 2019; Beznosikov et al., 2020).

Another class of methods that do not have access to unbiased gradients are zeroth-order methods which find application for optimization of black-box functions (Nesterov and Spokoiny, 2017), or for instance in deep learning for finding adversarial examples (Moosavi-Dezfooli et al., 2016; Chen et al., 2017). Some theoretical works that analyze zeroth-order training methods argue that the standard methods often operate with a biased estimator of the true gradient (Nesterov and Spokoiny, 2017; Liu et al., 2018), in contrast to SGD that relies on unbiased oracles.

Approximate gradients naturally also appear in many other applications, such as in the context of smoothing techniques, proximal updates or preconditioning (d’Aspremont, 2008; Schmidt et al., 2011; Devolder et al., 2014; Tappenden et al., 2016; Karimireddy et al., 2018).

Many standard textbook analyses for SGD typically require unbiased gradient information to prove convergence (Lacoste-Julien et al., 2012; Bottou et al., 2018). Yet, the manifold applications mentioned above, witness that SGD can converge even when it has only access to inexact or biased gradient oracles. However, all these methods and works required different analyses which had to be developed separately for every application. In this work, we reconcile insights that have been accumulated previously, explicitly or implicitly, and develop novel results, resulting in a clean framework for the analysis of a general class of biased SGD methods.

We study SGD under a biased gradient oracle model. We go beyond the standard modeling assumptions that require unbiased gradient oracles with bounded noise (Bottou et al., 2018), but allow gradient estimators to be biased, for covering the important applications mentioned above. We find that SGD with biased updates can converge to the same accuracy as SGD with unbiased oracles when the bias is not larger than the norm of the expected gradient. Such a multiplicative bound on the bias does for instance hold for quantized gradient methods (Alistarh et al., 2017). Without such a relative bound on the bias, biased SGD in general only converges towards a neighborhood of the solution, where the size of the neighborhood depends on the magnitude of the bias.

Although it is a widespread view in parts of the literature that SGD does only converge with unbiased oracles, our findings shows that SGD does not need unbiased gradient estimates to converge.

Algorithm and Setting.

Specifically, we consider unconstrained optimization problems of the form:

f:=min𝐱df(𝐱)f^{\star}:=\min_{\mathbf{x}\in\mathbb{R}^{d}}f(\mathbf{x}) (1)

where f:df\colon\mathbb{R}^{d}\to\mathbb{R} is a smooth function and where we assume that we only have access to biased and noisy gradient estimator. We analyze the convergence of (stochastic) gradient descent:

𝐱t+1=𝐱tγt𝐠t,𝐠t=f(𝐱t)+𝐛t+𝐧t,\mathbf{x}_{t+1}=\mathbf{x}_{t}-\gamma_{t}\mathbf{g}_{t},\qquad\mathbf{g}_{t}=\nabla f(\mathbf{x}_{t})+\mathbf{b}_{t}+\mathbf{n}_{t}, (2)

where γt\gamma_{t} is a sequence of step sizes and 𝐠t\mathbf{g}_{t} is a (potentially biased) gradient oracle for zero-mean noise 𝐧t\mathbf{n}_{t}, 𝔼𝐧t=𝟎\operatorname{{\mathbb{E}}}\mathbf{n}_{t}=\mathbf{0}, and bias 𝐛t\mathbf{b}_{t} terms. In the case 𝐛t=𝟎\mathbf{b}_{t}=\mathbf{0}, the gradient oracle 𝐠t\mathbf{g}_{t} becomes unbiased and we recover the setting of SGD. If in addition 𝐧t=𝟎\mathbf{n}_{t}=\mathbf{0} almost surely, we get back to the classic Gradient Descent (GD) algorithm.

Contributions.

We show that biased SGD methods can in general only converge to a neighborhood of the solution but indicate also interesting special cases where the optimum still can be reached. Our framework covers smooth optimization problems (general non-convex problems and non-convex problems under the Polyak-Łojasiewicz condition) and unifies the rates for both deterministic and stochastic optimization scenarios. We discuss a host of examples that are covered by our analysis (for instance top-kk compression, random-smoothing) and compare the convergence rates to prior work.

2 Related work

Optimization methods with deterministic errors have been previously studied mainly in the context of deterministic gradient methods. For instance, Bertsekas (2002) analyzes gradient descent under the same assumptions on the errors as we consider here, however does not provide concise rates. Similar, but slightly different, assumptions have been made in (Hu et al., 2020a) that considers finite-sum objectives.

Most analyses of biased gradient methods have been carried out with specific applications in mind. For instance, computation of approximate (i.e. corrupted) gradient updates is for many applications computationally more efficient than computing exact gradients, and therefore there was a natural interest in such methods (Schmidt et al., 2011; Tappenden et al., 2016; Karimireddy et al., 2018). d’Aspremont (2008); Baes (2009); Devolder et al. (2014) specifically also consider the impact of approximate updates on accelerated gradient methods and show these schemes have to suffer from error accumulation, whilst non-accelerated methods often still converge under mild assumptions. Devolder et al. (2014); Devolder (2011) consider a notion of inexact gradient oracles that generalizes the notion of inexact subgradient oracles (Polyak, 1987). We will show later that their notion of inexact oracles is stronger than the oracles that we consider in this work.

In distributed learning, both unbiased (Alistarh et al., 2017; Zhang et al., 2017) and biased (Dryden et al., 2016; Aji and Heafield, 2017; Alistarh et al., 2018) compression techniques have been introduced to address the scalability issues and communication bottlenecks. Alistarh et al. (2018) analyze the top-kk compression operator which sparsifies the gradient updates by only applying the top kk components. They prove a sublinear rate that suffers a slowdown compared to vanilla SGD (this gap could be closed with additional assumptions). Biased updates that are corrected with error-feedback (Stich et al., 2018; Stich and Karimireddy, 2019) do not suffer from such a slowdown. However, these methods are not the focus of this paper. In recent work, Beznosikov et al. (2020) analyze top-kk compression for deterministic gradient descent, often denoted as greedy coordinate descent (Nutini et al., 2015; Karimireddy et al., 2019). We recover greedy coordinate descent convergence rates as a special case (though, we are not considering coordinate-wise Lipschitzness as in those specialized works). Asynchronous gradient methods have sometimes been studied trough the lens of viewing updates as biased gradients (Li et al., 2014), but the additional structure of delayed gradients admits more fine-grained analyses (Niu et al., 2011; Mania et al., 2017).

Another interesting class of biased stochastic methods has been studied in  (Nesterov and Spokoiny, 2017) in the context of gradient-free optimization. They show that the finite-difference gradient estimator that evaluates the function values (but not the gradients) at two nearby points, provides a biased gradient estimator. As a key observation, they can show randomized smoothing estimates an unbiased gradient of a different smooth function that is close to the original function. This observation is leveraged in the proofs. In this work, we consider general bias terms (without inherent structure to be exploited), and hence our convergence rates are slightly weaker for this special case. Randomized smoothing was further studied in (Duchi et al., 2012; Bach and Perchet, 2016).

Hu et al. (2016) study bandit convex optimization with biased noisy gradient oracles. They assume that the bias can be bounded by an absolute constant. Our notion is more general and covers uniformly bounded bias as a special case. Other classes of biased oracle have been studied for instance by Hu et al. (2020b) who consider conditional stochastic optimization, Chen and Luss (2018) who consider consistent biased estimators and Wang and Giannakis (2020) who study the statistical efficiency of Q-learning algorithms by deriving finite-time convergence guarantees for a specific class of biased stochastic approximation scenarios.

3 Formal setting and assumptions

In this section we discuss our main setting and assumptions. All our results depend on the standard smoothness assumption, and some in addition on the PL condition.

3.1 Regularity Assumptions

Assumption 1 (LL-smoothness).

The function f:df\colon\mathbb{R}^{d}\to\mathbb{R} is differentiable and there exists a constant L>0L>0 such that for all 𝐱,𝐲n\mathbf{x},\mathbf{y}\in\mathbb{R}^{n}:

f(𝐲)f(𝐱)+f(𝐱),𝐲𝐱+L2𝐲𝐱2.f(\mathbf{y})\leq f(\mathbf{x})+\left\langle\nabla f(\mathbf{x}),\mathbf{y}-\mathbf{x}\right\rangle+\frac{L}{2}\left\lVert\mathbf{y}-\mathbf{x}\right\rVert^{2}\,. (3)

Sometimes we will assume the Polyak- Łojasiewicz (PL) condition (which is implied by standard μ\mu-strong convexity, but is much weaker in general):

Assumption 2 (μ\mu-PL).

The function f:df\colon\mathbb{R}^{d}\to\mathbb{R} is differentiable and there exists a constant μ>0\mu>0 such that

f(𝐱)22μ(f(𝐱)f),𝐱d.\left\lVert\nabla f(\mathbf{x})\right\rVert^{2}\geq 2\mu(f(\mathbf{x})-f^{\star})\,,\quad\forall\mathbf{x}\in\mathbb{R}^{d}. (4)

3.2 Biased Gradient Estimators

Finally, whilst the above assumptions are standard, we now introduce biased gradient oracles.

Definition 1 (Biased Gradient Oracle).

A map 𝐠:d×𝒟d\mathbf{g}\colon\mathbb{R}^{d}\times\mathcal{D}\to\mathbb{R}^{d} s.t.

𝐠(𝐱,ξ)=f(𝐱)+𝐛(𝐱)+𝐧(𝐱,ξ)\displaystyle\mathbf{g}(\mathbf{x},\xi)=\nabla f(\mathbf{x})+\mathbf{b}(\mathbf{x})+\mathbf{n}(\mathbf{x},\xi)

for a bias 𝐛:dd\mathbf{b}\colon\mathbb{R}^{d}\to\mathbb{R}^{d} and zero-mean noise 𝐧:d×𝒟d\mathbf{n}\colon\mathbb{R}^{d}\times\mathcal{D}\to\mathbb{R}^{d}, that is 𝔼ξ𝐧(𝐱,ξ)=𝟎{\mathbb{E}}_{\xi}\left.\mathbf{n}\right.(\mathbf{x},\xi)=\mathbf{0}, 𝐱d\forall\mathbf{x}\in\mathbb{R}^{d}.

We assume that the noise and bias terms are bounded:

Assumption 3 ((M,σ2)(M,\sigma^{2})-bounded noise).

There exists constants M,σ20M,\sigma^{2}\geq 0 such that

𝔼ξ𝐧(𝐱,ξ)2Mf(𝐱)+𝐛(𝐱)2+σ2,𝐱d.{\mathbb{E}}_{\xi\!}\left.\left\lVert\mathbf{n}(\mathbf{x},\xi)\right\rVert^{2}\right.\leq M\left\lVert\nabla f(\mathbf{x})+\mathbf{b}(\mathbf{x})\right\rVert^{2}+\sigma^{2},\quad\forall\mathbf{x}\in\mathbb{R}^{d}\,.
Assumption 4 ((m,ζ2)(m,\zeta^{2})-bounded bias).

There exists constants 0m<10\leq m<1, and ζ20\zeta^{2}\geq 0 such that

𝐛(𝐱)2mf(𝐱)2+ζ2,𝐱d.\left\lVert\mathbf{b}(\mathbf{x})\right\rVert^{2}\leq m\left\lVert\nabla f(\mathbf{x})\right\rVert^{2}+\zeta^{2},\quad\forall\mathbf{x}\in\mathbb{R}^{d}\,.
Remark 1.

Assumptions 34 and the inequality 𝐚+𝐛22𝐚2+2𝐛2\left\lVert\mathbf{a}+\mathbf{b}\right\rVert^{2}\leq 2\left\lVert\mathbf{a}\right\rVert^{2}+2\left\lVert\mathbf{b}\right\rVert^{2} imply that it further holds

𝔼ξ𝐧(𝐱,ξ)2M¯f(𝐱)2+σ¯2,𝐱d.\displaystyle{\mathbb{E}}_{\xi\!}\left.\left\lVert\mathbf{n}(\mathbf{x},\xi)\right\rVert^{2}\right.\leq\bar{M}\left\lVert\nabla f(\mathbf{x})\right\rVert^{2}+\bar{\sigma}^{2},\quad\forall\mathbf{x}\in\mathbb{R}^{d}\,.

for M¯2M(1+m)4M\bar{M}\leq 2M(1+m)\leq 4M and σ¯2σ2+2Mζ2\bar{\sigma}^{2}\leq\sigma^{2}+2M\zeta^{2}.

Our assumption on the noise is similar as in (Bottou, 2010; Stich, 2019) and generalizes the standard bounded noise assumption (for which only M=0M=0 is admitted). By allowing the noise to grow with the gradient norm and bias, an admissible σ2\sigma^{2} parameter can often be chosen to be much smaller than under the constraint M=0M=0.

Our assumption on the bias is similar as in (Bertsekas, 2002). For the special case when ζ2=0\zeta^{2}=0 one might expect (and we prove later) that gradient methods can still converge for any bias strength 0m<10\leq m<1, as in expectation the corrupted gradients are still aligned with the true gradient, 𝔼ξf(𝐱t),𝐠(𝐱,ξ)>0\operatorname{{\mathbb{E}}}_{\xi}\left\langle\nabla f(\mathbf{x}_{t}),\mathbf{g}(\mathbf{x},\xi)\right\rangle>0 (see e.g. Bertsekas, 2002). In the case ζ2>0\zeta^{2}>0, gradient based methods can only converge to a region where f(𝐱)2=𝒪(ζ2)\left\lVert\nabla f(\mathbf{x})\right\rVert^{2}=\mathcal{O}(\zeta^{2}). We make these claims precise in Section 4 below.

A slightly different condition than Assumption 4 was considered in (Hu et al., 2020a), but they measure the relative error with respect to the stochastic gradient f(𝐱)+𝐧(𝐱,ξ)\nabla f(\mathbf{x})+\mathbf{n}(\mathbf{x},\xi), and not with respect to f(𝐱)\nabla f(\mathbf{x}) as we consider here. Whilst we show that biased gradient methods converge for any m<1m<1, for instance Hu et al. (2020a) required a stronger condition mμLm\leq\frac{\mu}{L} on μ\mu-strongly convex functions (see also Remark 7 below).

4 Biased SGD Framework

In this section we study the convergence of stochastic gradient descent (SGD) with biased gradient oracles as introduced in Definition 1, see also Algorithm 1. For simplicity, we will assume constant stepsize γt=γ\gamma_{t}=\gamma, t\forall t, in this section.

  Input: step size sequence (γt)t0(\gamma_{t})_{t\geq 0}, 𝐱0d\mathbf{x}_{0}\in\mathbb{R}^{d}
  for t=0t=0 to T1T-1 do
     compute biased stochastic gradient 𝐠t=𝐠(𝐱t,ξ)\mathbf{g}_{t}=\mathbf{g}(\mathbf{x}_{t},\xi)
     𝐱t+1𝐱tγt𝐠t\mathbf{x}_{t+1}\leftarrow\mathbf{x}_{t}-\gamma_{t}\mathbf{g}_{t}
  end for
  Return: 𝐱T\mathbf{x}_{T}
Algorithm 1 Stochastic Gradient Descent (SGD)

4.1 Modified Descent Lemma

A key step in our proof is to derive a modified version of the descent lemma for smooth functions (cf. Nesterov, 2004). We give all missing proofs in the appendix.

Lemma 2.

Let ff be LL-smooth, 𝐱t+1\mathbf{x}_{t+1}, 𝐱t\mathbf{x}_{t} as in (2) with gradient oracle as in Assumptions 34. Then for any stepsize γ1(M+1)L\gamma\leq\frac{1}{(M+1)L} it holds

𝔼ξ[f(𝐱t+1)f(𝐱t)𝐱t]γ(m1)2f(𝐱t)2+γ2ζ2+γ2L2σ2\displaystyle\begin{split}&\operatorname{{\mathbb{E}}}_{\xi}\left[f(\mathbf{x}_{t+1})-f(\mathbf{x}_{t})\mid\mathbf{x}_{t}\right]\\ &\qquad\leq\frac{\gamma(m-1)}{2}\left\lVert\nabla f(\mathbf{x}_{t})\right\rVert^{2}+\frac{\gamma}{2}\zeta^{2}+\frac{\gamma^{2}L}{2}\sigma^{2}\end{split} (5)

When M=m=ζ2=0M=m=\zeta^{2}=0 we recover (even with the same constants) the standard descent lemma.

4.2 Gradient Norm Convergence

We first show that Lemma 2 allows to derive convergence on all smooth (including non-convex) functions, but only with respect to gradient norm, as mentioned in Section 3.2 above. By rearranging (5) and the notation Ft:=𝔼f(𝐱t)fF_{t}:=\operatorname{{\mathbb{E}}}{f(\mathbf{x}_{t})-f^{\star}}, F=F0F=F_{0} we obtain

(1m)2𝔼f(𝐱t)2FtFt+1γ+ζ22+γLσ22\displaystyle\frac{(1-m)}{2}\operatorname{{\mathbb{E}}}\left\lVert\nabla f(\mathbf{x}_{t})\right\rVert^{2}\leq\frac{F_{t}-F_{t+1}}{\gamma}+\frac{\zeta^{2}}{2}+\frac{\gamma L\sigma^{2}}{2}

and by summing and averaging over tt,

1m2ΨTF0Tγ+ζ22+γLσ22\displaystyle\frac{1-m}{2}\Psi_{T}\leq\frac{F_{0}}{T\gamma}+\frac{\zeta^{2}}{2}+\frac{\gamma L\sigma^{2}}{2}

where ΨT:=1Tt=0T1𝔼f(𝐱t)2\Psi_{T}:=\frac{1}{T}\sum_{t=0}^{T-1}\operatorname{{\mathbb{E}}}\left\lVert\nabla f(\mathbf{x}_{t})\right\rVert^{2}. We summarize this observation in the next lemma:

Lemma 3.

Under Assumptions 1, 3, 4, and for any stepsize γ1(M+1)L\gamma\leq\frac{1}{(M+1)L} it holds after TT steps of SGD:

ΨT(2FTγ(1m)+γLσ21m)+ζ21m,\displaystyle\Psi_{T}\leq\left(\frac{2F}{T\gamma(1-m)}+\frac{\gamma L\sigma^{2}}{1-m}\right)+\frac{\zeta^{2}}{1-m}\,,

where ΨT\Psi_{T} is defined as above.

The quantity on the left hand side is equal to the expected gradient norm of a uniformly at random selected iterate, ΨT=𝔼f(𝐱out)2\Psi_{T}=\operatorname{{\mathbb{E}}}\left\lVert\nabla f(\mathbf{x}_{\rm out})\right\rVert^{2} for 𝐱outu.a.r.{𝐱0,,𝐱T1}\mathbf{x}_{\rm out}\in_{\rm u.a.r.}\{\mathbf{x}_{0},\dots,\mathbf{x}_{T-1}\} and a standard convergence measure. Alternatively one could consider mint[T]𝔼f(𝐱t)2ΨT\min_{t\in[T]}\operatorname{{\mathbb{E}}}\left\lVert\nabla f(\mathbf{x}_{t})\right\rVert^{2}\leq\Psi_{T}.

If there is no additive bias, ζ2=0\zeta^{2}=0, then Lemma 3 can be used to show convergence of SGD by carefully choosing the stepsize to minimize the term in the round bracket. However, with a bias ζ2>0\zeta^{2}>0 SGD can only converge to a 𝒪(ζ2)\mathcal{O}(\zeta^{2}) neighborhood of the solution.

Theorem 4.

Under Assumptions 1, 3, 4, and by choosing the stepsize γ=min{1(M+1)L,ϵ(1m)+ζ22Lσ2}\gamma=\min\left\{\frac{1}{(M+1)L},\frac{\epsilon(1-m)+\zeta^{2}}{2L\sigma^{2}}\right\} for ϵ0\epsilon\geq 0, then

T=𝒪(M+1ϵ(1m)+ζ2+σ2ϵ2(1m)2+ζ4)LF\displaystyle T=\mathcal{O}\left(\frac{M+1}{\epsilon(1-m)+\zeta^{2}}+\frac{\sigma^{2}}{\epsilon^{2}(1-m)^{2}+\zeta^{4}}\right)\cdot LF

iterations are sufficient to obtain ΨT=𝒪(ϵ+ζ21m)\Psi_{T}=\mathcal{O}\bigl{(}\epsilon+\frac{\zeta^{2}}{1-m}\bigr{)}.

If there is no bias, ζ2=m=0\zeta^{2}=m=0, then this result recovers the standard convergence rate of gradient descent (when also σ2=M=0\sigma^{2}=M=0) or stochastic gradient descent, in general. When the biased gradients are correlated with the gradient (i.e. ζ2=0\zeta^{2}=0, m>0m>0), then the exact solution can still be found, though the number of iterations increases by a factor of (1m)(1-m) in the deterministic case, or by (1m)2(1-m)^{2} if the noise term is dominating the rate. When there is an uncorrelated bias, ζ2>0\zeta^{2}>0, then SGD can only converge to a neighborhood of a stationary point and a finite number of iterations suffice to converge. This result is tight in general, as we show in Remark 5 below.

Remark 5 (Tightness of Error-floor).

Consider a LL-smooth function f:df\colon\mathbb{R}^{d}\to\mathbb{R} with gradient oracle 𝐠(𝐱)=f(𝐱)+ρ(𝐱)𝐛\mathbf{g}(\mathbf{x})=\nabla f(\mathbf{x})+\rho(\mathbf{x})\mathbf{b}, for 0m10\leq m\leq 1, 𝐛d\mathbf{b}\in\mathbb{R}^{d} with 𝐛2=ζ2\left\lVert\mathbf{b}\right\rVert^{2}=\zeta^{2} and ρ(𝐱)2:=1+mζ2f(𝐱)2\rho(\mathbf{x})^{2}:=1+\frac{m}{\zeta^{2}}\left\lVert\nabla f(\mathbf{x})\right\rVert^{2}. This oracle has (m,ζ2)(m,\zeta^{2})-bounded bias. For any stationary point 𝐱\mathbf{x}^{\star} with 𝐠(𝐱)=0\nabla\mathbf{g}(\mathbf{x})=0, it holds f(𝐱)=ρ(𝐱)𝐛\nabla f(\mathbf{x}^{\star})=-\rho(\mathbf{x}^{\star})\mathbf{b}, i.e. f(𝐱)2=ζ21m\left\lVert\nabla f(\mathbf{x}^{\star})\right\rVert^{2}=\frac{\zeta^{2}}{1-m}.

4.3 Convergence under PL condition

To show convergence under the PL condition, we observe that (5) and (4) imply

Ft+1(1γμ(1m))Ft+γζ22+γ2Lσ22.\displaystyle F_{t+1}\leq\left(1-\gamma\mu(1-m)\right)F_{t}+\frac{\gamma\zeta^{2}}{2}+\frac{\gamma^{2}L\sigma^{2}}{2}\,. (6)

By now unrolling this recursion we obtain the next theorem.

Theorem 6.

Under Assumptions 14 it holds for any stepsize γ1(M+1)L\gamma\leq\frac{1}{(M+1)L} that

FT(1γμ(1m))TF0+12Ξ,\displaystyle F_{T}\leq(1-\gamma\mu(1-m))^{T}F_{0}+\frac{1}{2}\Xi\,,

where

Ξ:=ζ2μ(1m)+γLσ2μ(1m).\displaystyle\Xi:=\frac{\zeta^{2}}{\mu(1-m)}+\frac{\gamma L\sigma^{2}}{\mu(1-m)}\,. (7)

Consequently, by choosing the stepsize γ=min{1(M+1)L,ϵμ(1m)+ζ2Lσ2}\gamma=\min\bigl{\{}\frac{1}{(M+1)L},\frac{\epsilon\mu(1-m)+\zeta^{2}}{L\sigma^{2}}\bigr{\}}, for ϵ0\epsilon\geq 0,

T=𝒪~((M+1)log1ϵ+σ2ϵμ(1m)+ζ2)κ1m\displaystyle T=\tilde{\mathcal{O}}\left((M+1)\log\frac{1}{\epsilon}+\frac{\sigma^{2}}{\epsilon\mu(1-m)+\zeta^{2}}\right)\cdot\frac{\kappa}{1-m}

iterations suffice for FT=𝒪(ϵ+ζ2μ(1m))F_{T}=\mathcal{O}\bigl{(}\epsilon+\frac{\zeta^{2}}{\mu(1-m)}\bigr{)}. Here κ:=Lμ\kappa:=\frac{L}{\mu} and the 𝒪~()\tilde{\mathcal{O}}(\cdot) notation hides logarithmic factors.

This result shows, that SGD with constant stepsize converges up to the error floor 𝒪(ζ2+γLσ2)\mathcal{O}(\zeta^{2}+\gamma L\sigma^{2}), given in (7). The second term in this bound can be decreased by choosing the stepsize γ\gamma small, however, the first term remains constant, regardless of the choice of the stepsize. We will investigate this error floor later in the experiments.

Without bias terms, we recover the best known rates under the PL condition (Karimi et al., 2016). Compared to the SGD convergence rate for μ\mu-strongly convex function our bound is worse by a factor of κ\kappa compared to (Stich, 2019), but we consider convergence of the function value of the last iterate FTF_{T} here.

Remark 7 (Convergence on Strongly Convex Functions.).

Theorem 6 applies also to μ\mu-strongly convex functions (which are μ\mu-PŁ) and shows convergence in function value for any 0m<10\leq m<1 (if ζ2=0\zeta^{2}=0). By μ\mu-strong convexity this implies iterate convergence 𝐱t𝐱22μFt\left\lVert\mathbf{x}_{t}-\mathbf{x}^{\star}\right\rVert^{2}\leq\frac{2}{\mu}F_{t}, however, at the expense of an additional 1μ\frac{1}{\mu} factor. For proving a stronger result, for instance to prove a one-step progress bound for 𝐱t𝐱2\left\lVert\mathbf{x}_{t}-\mathbf{x}^{\star}\right\rVert^{2} (as it standard in the unbiased case (Lacoste-Julien et al., 2012; Stich, 2019)) it is necessary to impose a strong condition m1κm\leq\frac{1}{\kappa}, similar as in (Hu et al., 2020a).

4.4 Divergence on Convex Functions

Albeit Theorem 4 shows convergence of the gradient norms, this convergence does not imply convergence in function value in general (unless the gradient norm is related to the function values, as for instance guaranteed by (4)). We now give an example of a weakly-convex (non-strongly convex) smooth function, where Algorithm 1 can diverge.

Example 8.

Consider the Huber-loss function h:h\colon\mathbb{R}\to\mathbb{R},

h(x)={|x|,|x|>112x2+12|x|1\displaystyle h(x)=\begin{cases}\left\lvert x\right\rvert,&\left\lvert x\right\rvert>1\\ \frac{1}{2}x^{2}+\frac{1}{2}&\left\lvert x\right\rvert\leq 1\end{cases}

and define the gradient oracle g(x)=h(x)2g(x)=h^{\prime}(x)-2. This is a biased oracle with ζ24\zeta^{2}\leq 4, and it is easy to observe that iterations (2) diverge for any stepsize γ>0\gamma>0, given x0>1x_{0}>1.

This result shows that in general it is not possible to converge when ζ20\zeta^{2}\neq 0, and that the approximate solutions found by SGD with unbiased gradients (in Example 8 converging to 0) and by SGD with biased gradients can be arbitrarily far away. It is an interesting open question, whether SGD can be made robust to biased updates, i.e. preventing this diverging behaviour through modified updates (or stronger assumptions on the bias). We leave this for future work.

5 Discussion of Examples

In this section we are going to look into some examples which can be considered as special cases of our biased gradient framework. Table 1 shows a summary of these examples together with value of the respective parameters.

Table 1: Special cases of our biased gradient framework. Each column represents the value of our framework parameters in an example. Here 𝐧(𝐱)\mathbf{n}(\mathbf{x}) is a (M,σ2)(M,\sigma^{2})-bounded noise term and 𝐮\mathbf{u} isotropic Gaussian with variance 1.
def M σ2\sigma^{2} 1-m ζ2\zeta^{2}
top-kk compression 𝐠(𝐱)=topk(f(𝐱))\mathbf{g}(\mathbf{x})=\operatorname*{{top-k}}(\nabla f(\mathbf{x})) 0 0 kd\frac{k}{d} 0
random-kk compression 𝐠(𝐱)=randk(f(𝐱))\mathbf{g}(\mathbf{x})=\operatorname{{rand-k}}(\nabla f(\mathbf{x})) 0 0 kd\frac{k}{d} 0
random-kk compression (of stochastic g.) 𝐠(𝐱)=randk(f(𝐱)+𝐧(𝐱))\mathbf{g}(\mathbf{x})=\operatorname{{rand-k}}(\nabla f(\mathbf{x})+\mathbf{n}(\mathbf{x})) (1+M)dk1(1+M)\frac{d}{k}-1 kdσ2\frac{k}{d}\sigma^{2} kd\frac{k}{d} 0
Gaussian smoothing 𝐠GS(𝐱)=f(𝐱+τ𝐮)f(𝐱)τ𝐮\mathbf{g}_{\rm GS}(\mathbf{x})=\frac{f(\mathbf{x}+\tau\mathbf{u})-f(\mathbf{x})}{\tau}\mathbf{u} 4(d+4) 3τ2L2(d+4)33\tau^{2}L^{2}(d+4)^{3} 1 τ24L2(d+3)3\frac{\tau^{2}}{4}L^{2}(d+3)^{3}
(δ,L)(\delta,L)-oracle (see text) 0 0 1 2δL2\delta L
stochastic (δ,L)(\delta,L)-oracle (see appendix) 0 σ2\sigma^{2} 1 2δL2\delta L
compressed gradient 𝐠(𝐱)=𝒞(f(𝐱)),𝒞C(δ)\mathbf{g}(\mathbf{x})=\mathcal{C}(\nabla f(\mathbf{x})),\quad\mathcal{C}\in C(\delta) 0 0 δ\delta 0

5.1 Top-kk and Random-kk sparsification

The well-known top-kk spasification operator (Dryden et al., 2016; Alistarh et al., 2018) that selects the kk largest coordinates of the gradient vector f(𝐱)\nabla f(\mathbf{x}) is a biased gradient oracle with mdkdm\leq\frac{d-k}{d} and ζ2=0\zeta^{2}=0. This means, that GD with top-kk compression converges to a stationary point on smooth functions and under the PL condition the convergence rate is 𝒪(dkκlog1ϵ)\mathcal{O}\bigl{(}\frac{d}{k}\kappa\log\frac{1}{\epsilon}\bigr{)} when there is no additional noise. This recovers the rate of greedy coordinate descent when analyzed in the Euclidean norm (Nutini et al., 2015).

The biased random-kk sparsification operator randomly selects kk out of the dd coordinates of the stochastic gradient, and sets the other entries to zero. As m=1kdm=1-\frac{k}{d}, this implies a dk\frac{d}{k} slowdown over the corresponding rates of GD (recovering the rate of random coordinate descent (Nesterov and Spokoiny, 2017; Richtárik and Takác, 2016)) and SGD, with asymptotic dominant term 𝒪(dσ2k)\mathcal{O}\bigl{(}\frac{d\sigma^{2}}{k}\bigr{)} that is expected (cf. Chaturapruek et al., 2015). In the appendix we further discuss differences to unbiased random-kk sparsification.111By rescaling the obtained sparse vector by the factor dk\frac{d}{k} one obtains a unbiased estimator (with higher variance). With these sparsification operators the dominant term in the rate remains 𝒪(dσ2k)\mathcal{O}\bigl{(}\frac{d\sigma^{2}}{k}\bigr{)}, but the optimization terms can be improved as a benefit of choosing larger stepsizes. However, this observation does not show a limitation of the biased rand-kk oracle, but is merely a consequence of applying our general result to this special case. Both algorithms are equivalent when properly tuning the stepsize.

5.2 Zeroth-Order Gradient

Next, we discuss the zeroth-order gradient oracle obtained by Gaussian smoothing (cf. Nesterov and Spokoiny, 2017). This oracle is defined as (Polyak, 1987):

𝐠GS(𝐱)=f(𝐱+τ𝐮)f(𝐱)τ𝐮\mathbf{g}_{\rm GS}(\mathbf{x})=\frac{f(\mathbf{x}+\tau\mathbf{u})-f(\mathbf{x})}{\tau}\cdot\mathbf{u} (8)

where τ>0\tau>0 is a smoothing parameter and 𝐮𝒩(𝟎,𝐈)\mathbf{u}\sim\mathcal{N}(\mathbf{0},\mathbf{I}) a random Gaussian vector. Nesterov and Spokoiny (2017, Lemma 3 & Theorem 4) provide estimates for the bias:

𝔼𝐠GS(𝐱)f(𝐱)2τ24L2(d+3)3,\left\lVert\mathbb{E}\mathbf{g}_{\rm GS}(\mathbf{x})-\nabla f(\mathbf{x})\right\rVert^{2}\leq\frac{\tau^{2}}{4}L^{2}(d+3)^{3}\,,

and the second moment:

𝔼𝐠GS(𝐱)23τ2L2(d+4)3+4(d+4)𝔼𝐠GS(𝐱)2.\begin{split}\mathbb{E}\left\lVert\mathbf{g}_{GS}(\mathbf{x})\right\rVert^{2}&\leq 3\tau^{2}L^{2}(d+4)^{3}+4(d+4)\left\lVert\operatorname{{\mathbb{E}}}{\mathbf{g}_{GS}(\mathbf{x})}\right\rVert^{2}\,.\end{split}

We conclude that Assumptions 3, and 4 hold with:

ζ2=τ24L2(d+3)3,m=0\displaystyle\zeta^{2}=\frac{\tau^{2}}{4}L^{2}(d+3)^{3},\quad m=0
σ2=3τ2L2(d+4)3,M=4(d+4).\displaystyle\sigma^{2}=3\tau^{2}L^{2}(d+4)^{3},\quad M=4(d+4)\,.

Nesterov and Spokoiny (2017) show that on smooth functions gradient-free oracles in general slow-down the rates by a factor of 𝒪(d)\mathcal{O}(d), similar as we observe here with M=𝒪(d)M=\mathcal{O}(d). However, their oracle is stronger, for instance they can show convergence (up to 𝒪(ζ2)\mathcal{O}(\zeta^{2})) on weakly-convex functions, which is not possible with our more general oracle (see Example 8).

5.3 Inexact gradient oracles

Devolder et al. (2014) introduced the notion of inexact gradient oracles. A (δ,L)(\delta,L)-gradient oracle for 𝐲d\mathbf{y}\in\mathbb{R}^{d} is a pair (f~(𝐲),𝐠~(𝐲))(\tilde{f}(\mathbf{y}),\tilde{\mathbf{g}}(\mathbf{y})) that satisfies 𝐱d\forall\mathbf{x}\in\mathbb{R}^{d}:

0f(𝐱)(f~(𝐲)+𝐠~(𝐲),𝐱𝐲)L2𝐱𝐲2+δ.\displaystyle 0\leq f(\mathbf{x})-\left(\tilde{f}(\mathbf{y})+\left\langle\tilde{\mathbf{g}}(\mathbf{y}),\mathbf{x}-\mathbf{y}\right\rangle\right)\leq\frac{L}{2}\left\lVert\mathbf{x}-\mathbf{y}\right\rVert^{2}+\delta\,.

We have by (Devolder et al., 2014):

𝐛(𝐱)2=f(𝐱)𝐠~(𝐱)22δL\displaystyle\left\lVert\mathbf{b}(\mathbf{x})\right\rVert^{2}=\left\lVert\nabla f(\mathbf{x})-\tilde{\mathbf{g}}(\mathbf{x})\right\rVert^{2}\leq 2\delta L\,

hence we can conclude: ζ2=2δL\zeta^{2}=2\delta L and m=0m=0.

It is important to observe that the notion of a (δ,L)(\delta,L) oracle is stronger than what we consider in Assumption 4. For this, consider again the Huber-loss from Example 8 and a (δ,L)(\delta,L) gradient estimator 𝐠~(x)\tilde{\mathbf{g}}(x). For xx\to\infty, we observe that it must hold 𝐠~(x)h(x)0\left\lVert\tilde{\mathbf{g}}(x)-\nabla h(x)\right\rVert\to 0, otherwise the condition of the (δ,L)(\delta,L) oracle is violated. In contrast, the bias term in Assumption 4 can be constant, regardless of xx. Devolder (2011, preprint) generalized the notion of (δ,L)(\delta,L) oracles to the stochastic case, which we discuss in the appendix.

5.4 Biased compression operators

The notion of top-kk compressors has been generalized to arbitrary δ\delta-compressors, defined as

Definition 2 (δ\delta-compressor).

𝒞C(δ)\mathcal{C}\in C\left(\delta\right) if δ>0\exists\delta>0 s.t.:

𝔼[𝒞(𝐠)𝐠2](1δ)𝐠2,𝐠d.\displaystyle\mathbb{E}\left[\left\lVert\mathcal{C}\left(\mathbf{g}\right)-\mathbf{g}\right\rVert^{2}\right]\leq\left(1-\delta\right)\left\lVert\mathbf{g}\right\rVert^{2},\quad\forall\mathbf{g}\in\mathbb{R}^{d}\,.

This notion has for instance been used in (Beznosikov et al., 2020; Stich et al., 2018). For this class of operator it holds

m1δ,ζ2=0.\displaystyle m\leq 1-\delta,\qquad\zeta^{2}=0\,.

In the noiseless case, our results show convergence for arbitrary small δ>0\delta>0 (m<1)m<1) and we recover the rates given in (Beznosikov et al., 2020, Theorem 1).

5.5 Delayed Gradients

Another example of updates that can be viewed as biased gradients, are delayed gradients, for instance arising in asynchronous implementations of SGD (Niu et al., 2011). For instance Li et al. (2014, Proof of Theorem 2) provide a bound on the bias. See also (Stich and Karimireddy, 2019) for a more refined analysis in the stochastic case. However, these bounds are not of the form as in Assumption 4, as they depend on past gradients, and not only the current gradient at 𝐱t\mathbf{x}_{t}. We leave generalizations of Assumption 4 for future work (moreover, delayed gradient methods have been analyzed tightly with other tools already).

6 Experiments

In this section we verify numerically whether our theoretical bounds are aligned with the numerical performance of SGD with biased gradient estimators. In particular, we study whether the predicted error floor (7) of the form 𝒪(ζ2+γLσ2)\mathcal{O}(\zeta^{2}+\gamma L\sigma^{2}) can be observed in practice.

Synthetic Problem Instance.

We consider a simple least squares objective function f(𝐱)=12A𝐱22f(\mathbf{x})=\frac{1}{2}\left\lVert A\mathbf{x}\right\rVert_{2}^{2}, in dimension d=10d=10, where the matrix AA is chosen according to the Nesterov worst function (Nesterov, 2004). We choose this function since its Hessian, i.e., ATAA^{T}A has a very large condition number and hence it is a hard function to optimize. By the choice of AA this objective function is strongly convex. We control the stochastic noise by adding Gaussian noise 𝐧t𝒩(0,σ2)\mathbf{n}_{t}\sim\mathcal{N}(0,\sigma^{2}) to every gradient.

6.1 Experiment with a synthetic additive bias

In our first setup (Figure 1), we consider a synthetic case in which we add a constant value ζ{0,0.1,0.001}\zeta\in\{0,0.1,0.001\} to the gradient to control the bias, and vary the noise σ2{0,1}\sigma^{2}\in\{0,1\}. This corresponds to the setting where we have m=0,ζ2>0m=0,\zeta^{2}>0, M=0M=0, σ20\sigma^{2}\geq 0.

Discussion of Results. Figure 1 highlights the effect of the parameters σ2\sigma^{2} and ζ2\zeta^{2} on the rate of the convergence and the neighbourhood of the convergence. Notice that a small bias value can only affect the convergence neighborhood when there is no stochastic noise. In fact this matches the error floor we derived in (7). When ζ2\zeta^{2} is small compared to σ2\sigma^{2}, i.e, the bias is insignificant compared to the noise, the second term in (7) determines the neighborhood of the optimal solution to which SGD with constant stepsize converges. In contrast, when ζ\zeta is large the first term in (7) dominates the second term and determines the neighborhood of the convergence. As mentioned earlier in Section 4.3, the second term in (7) also depends on the stepsize, meaning that by increasing (decreasing) the stepsize the effect of σ2\sigma^{2} in determining the convergence neighborhood can be increased (decreased).

Refer to caption
Figure 1: Effect of the parameters ζ\zeta, and σ\sigma on the error floor when optimizing f(𝐱),𝐱10f(\mathbf{x}),\mathbf{x}\in\mathbb{R}^{10}. Here we use a fixed stepsize γ=0.01\gamma=0.01 and we set M=m=0M=m=0.
Refer to caption
Figure 2: Effect of the parameters σ2\sigma^{2} and kk which change the noise and bias level respectively, for random-kk sparsification. Here we optimize f(𝐱),𝐱10f(\mathbf{x}),\mathbf{x}\in\mathbb{R}^{10} and use a fixed stepsize γ=0.01\gamma=0.01.
Refer to caption
Figure 3: Effect of the parameters σ2\sigma^{2} and kk which change the noise and bias level respectively, for top-kk sparsification. Here we optimize f(𝐱),𝐱10f(\mathbf{x}),\mathbf{x}\in\mathbb{R}^{10} and use a fixed stepsize γ=0.01\gamma=0.01.

6.2 Experiment with Top-kk and Random-kk sparsification

In contrast to the previous section, we leave the completely synthetic setup, and consider structured bias that arises from compressing the gradients by using either the random-kk or top-kk method (Alistarh et al., 2018), displayed in Figures 2, 3. We use the same objective function ff as introduced earlier. In each iteration we compress the stochastic gradient, σ2{0,1,100}\sigma^{2}\in\{0,1,100\}, by the top-kk or random-kk method, i.e.,

𝐠(𝐱)=𝒞(f(𝐱)+n(𝐱)),n(𝐱)𝒩(0,σ2)\displaystyle\mathbf{g}(\mathbf{x})=\mathcal{C}(\nabla f(\mathbf{x})+n(\mathbf{x})),\quad n(\mathbf{x})\sim\mathcal{N}(0,\sigma^{2})

where 𝒞\mathcal{C} is either topk\operatorname*{{top-k}} or randk\operatorname{{rand-k}} operator, for kd{0.1,1}\frac{k}{d}\in\{0.1,1\}. As we showed in Section 5, this corresponds to the setting ζ2=0\zeta^{2}=0 for rand-kk, and ζ2>0\zeta^{2}>0 for top-kk, with m=1kdm=1-\frac{k}{d}.

Error floor. Figure 2, 3 show the effect of the parameters σ2\sigma^{2} and kk (that determine the level of noise and bias respectively) on the rate of convergence as well as the error floor, when using random-kk and top-kk compressors and a constant stepsize γ\gamma.

Convergence rate. In Figure 5 we show the convergence for different noise levels σ2\sigma^{2} and compression parameters kk. For each configuration and algorithm, we tuned the stepsize γ\gamma to reach the target accuracy (set by ϵ\epsilon) in the fewest number of iterations as possible.

Discussion of Results.

We can see that in this case even in the presence of stochastic noise, random-kk and top-kk, although with a slower rate, can still converge to the same level as with no compression. We note that top-kk consistently converges faster than rand-kk, for the same values of kk. The error floor for both schemes is mainly determined by the strength of the noise σ2\sigma^{2} that dominates in (7). In Figure 5 we see that top-kk and random-kk can still converge to the same error level as no compression scheme but with a slower rate. The convergence rate of the compression methods become slower when decreasing the compression parameter kk, matching the rate we derived in Theorem 6.

Refer to caption
Figure 4: Effect of the parameter τ\tau (the smoothing parameter) on the error floor, when zeroth-order gradient is used. Here we optimize f(𝐱),𝐱10f(\mathbf{x}),\mathbf{x}\in\mathbb{R}^{10} and use a fixed stepsize γ=0.01\gamma=0.01.
Refer to caption
Figure 5: Convergence of f(𝐱)=12Ax2,𝐱10f(\mathbf{x})=\frac{1}{2}\left\lVert Ax\right\rVert^{2},\quad\mathbf{x}\in\mathbb{R}^{10} to target accuracy ϵ=5×104\epsilon=5\times 10^{-4} for different problem settings (σ2\sigma^{2} increasing to the bottom and kk decreasing to the left), and different compression methods. Stepsizes were tuned for each setting individually to reach target accuracy in as few iterations as possible.

6.3 Experiment with zeroth-order gradient oracle

We optimize the function ff using the zeroth-order gradient oracle defined in Section 5.2. The smoothing parameter τ\tau affects both the values of σ2\sigma^{2} and ζ2\zeta^{2} and hence the error floor.

Discussion of Results. Figure 4 shows the error floor for the zeroth-order gradient method with different values of τ\tau. We can observe that the difference between the error floors matches what we saw in theory, i.e, σ2+ζ2𝒪(τ2)\sigma^{2}+\zeta^{2}\sim\mathcal{O}(\tau^{2}).

Refer to caption
Figure 6: Training error when training ResNet18 on CIFAR-10 dataset. For the compression methods, only 1%1\% of the coordinates were kept, i.e, k=0.01×dk=0.01\times d.

6.4 DL Experiment on CIFAR-10 on Resnet18

We use the PyTorch framework (Paszke et al., 2019) to empirically compare top-kk and random-kk with SGD without compression for training Resnet+BN+Dropout (Resnet18) (He et al., 2016) on the CIFAR-10 dataset (Krizhevsky, 2012). Following the standard practice in compression schemes (Lin et al., 2018; Wang et al., 2018), we apply the compression layer-wise. We use SGD with momentum and we tune the momentum parameter β\beta and the learning rate separately for each setting (see Appendix C). Although SGD with momentum is not covered by our analysis, we use this method since it is the state-of-the-art training scheme in deep learning and required to reach a good accuracy. In each setting we trained the neural network for 180 epochs and report the cross entropy error for the training data.

Discussion of results. Figure 6 shows the convergence. We can observe that similar to Figures 2 and 3, top-kk can still converge to the same error floor as SGD but with a slower rate. Rand-kk is more severely impacted by the 𝒪(dk)\mathcal{O}(\frac{d}{k}) slowdown predicted by theory and would require more iterations to converge to higher accuracy.

7 Conclusion

We provide a convergence analysis of SGD with biased gradient oracles. We reconcile insights that have been accumulated previously, explicitly or implicitly, by presenting a clean framework for the analysis of a general class of biased SGD methods. We show concisely the influence of the bias and highlight important cases where SGD converges with biased oracles. The simplicity of the framework will allow these findings to be broadly integrated, and the limitations we have identified may inspire targeted research efforts that can increase the robustness of SGD.

References

Appendix A Deferred Proofs for Section 4

A.1 Proof of Lemma 2

By the quadratic upper bound (3) and Assumption 3:

𝔼f(𝐱t+1)\displaystyle\operatorname{{\mathbb{E}}}{f(\mathbf{x}_{t+1})} f(𝐱t)γf(𝐱t),𝔼𝐠t+γ2L2(𝔼𝐠t𝔼𝐠t2+𝔼𝔼𝐠t2)\displaystyle\leq f(\mathbf{x}_{t})-\gamma\left\langle\nabla f(\mathbf{x}_{t}),\operatorname{{\mathbb{E}}}{\mathbf{g}_{t}}\right\rangle+\frac{\gamma^{2}L}{2}\left(\operatorname{{\mathbb{E}}}{\left\lVert\mathbf{g}_{t}-\operatorname{{\mathbb{E}}}{\mathbf{g}_{t}}\right\rVert^{2}}+\operatorname{{\mathbb{E}}}{\left\lVert\operatorname{{\mathbb{E}}}{\mathbf{g}_{t}}\right\rVert^{2}}\right)
=f(𝐱t)γf(𝐱t),f(𝐱t)+𝐛t+γ2L2(𝔼𝐧t2+𝔼f(𝐱t)+𝐛t2)\displaystyle=f(\mathbf{x}_{t})-\gamma\left\langle\nabla f(\mathbf{x}_{t}),\nabla f(\mathbf{x}_{t})+\mathbf{b}_{t}\right\rangle+\frac{\gamma^{2}L}{2}\left(\operatorname{{\mathbb{E}}}{\left\lVert\mathbf{n}_{t}\right\rVert^{2}}+\operatorname{{\mathbb{E}}}{\left\lVert\nabla f(\mathbf{x}_{t})+\mathbf{b}_{t}\right\rVert^{2}}\right)
f(𝐱t)γf(𝐱t),f(𝐱t)+𝐛t+γ2L2((M+1)𝔼f(𝐱t)+𝐛t2+σ2)\displaystyle\leq f(\mathbf{x}_{t})-\gamma\left\langle\nabla f(\mathbf{x}_{t}),\nabla f(\mathbf{x}_{t})+\mathbf{b}_{t}\right\rangle+\frac{\gamma^{2}L}{2}\left((M+1)\operatorname{{\mathbb{E}}}{\left\lVert\nabla f(\mathbf{x}_{t})+\mathbf{b}_{t}\right\rVert^{2}}+\sigma^{2}\right)

By the choice of the stepsize, γ1(M+1)L\gamma\leq\frac{1}{(M+1)L}, and Assumption 4:

𝔼f(𝐱t+1)\displaystyle\operatorname{{\mathbb{E}}}{f(\mathbf{x}_{t+1})} f(𝐱t)+γ2(2f(𝐱t),f(𝐱t)+𝐛t+f(𝐱t)+𝐛t2)+γ2L2σ2\displaystyle\leq f(\mathbf{x}_{t})+\frac{\gamma}{2}\left(-2\left\langle\nabla f(\mathbf{x}_{t}),\nabla f(\mathbf{x}_{t})+\mathbf{b}_{t}\right\rangle+\left\lVert\nabla f(\mathbf{x}_{t})+\mathbf{b}_{t}\right\rVert^{2}\right)+\frac{\gamma^{2}L}{2}\sigma^{2}
=f(𝐱t)+γ2(f(𝐱t)2+𝐛t2)+γ2L2σ2\displaystyle=f(\mathbf{x}_{t})+\frac{\gamma}{2}\left(-\left\lVert\nabla f(\mathbf{x}_{t})\right\rVert^{2}+\left\lVert\mathbf{b}_{t}\right\rVert^{2}\right)+\frac{\gamma^{2}L}{2}\sigma^{2}
f(𝐱t)+γ2(m1)f(𝐱t)2+γ2ζ2+γ2L2σ2.\displaystyle\leq f(\mathbf{x}_{t})+\frac{\gamma}{2}\left(m-1\right)\left\lVert\nabla f(\mathbf{x}_{t})\right\rVert^{2}+\frac{\gamma}{2}\zeta^{2}+\frac{\gamma^{2}L}{2}\sigma^{2}\,.

This concludes the proof. \Box

A.2 Proof of Theorem 4

By Lemma 3 we have that for any T1T\geq 1 and γ1(M+1)L\gamma\leq\frac{1}{(M+1)L} that it holds

ΨT(2FTγ(1m)+γLσ21m)+ζ21m=:Θ(T,γ),\displaystyle\Psi_{T}\leq\left(\frac{2F}{T\gamma(1-m)}+\frac{\gamma L\sigma^{2}}{1-m}\right)+\frac{\zeta^{2}}{1-m}=:\Theta(T,\gamma)\,,

where we denote the right hand side by Θ(T,γ)\Theta(T,\gamma).

By choosing T=max{4(M+1)FLϵ(1m),8FLσ2ϵ2(1m)2}T=\max\left\{\frac{4(M+1)FL}{\epsilon(1-m)},\frac{8FL\sigma^{2}}{\epsilon^{2}(1-m)^{2}}\right\} and γ=min{1(M+1)L,ϵ(1m)2Lσ2}\gamma=\min\left\{\frac{1}{(M+1)L},\frac{\epsilon(1-m)}{2L\sigma^{2}}\right\}, we observe Θ(T,γ)ϵ+ζ21m\Theta(T,\gamma)\leq\epsilon+\frac{\zeta^{2}}{1-m}. To see this, assume first that the minimum is attained for γ=ϵ(1m)2Lσ2\gamma=\frac{\epsilon(1-m)}{2L\sigma^{2}}. Then we verify

Θ(T,γ)=4FLσ2Tϵ(1m)2+ϵ2+ζ21mϵ+ζ21m\displaystyle\Theta(T,\gamma)=\frac{4FL\sigma^{2}}{T\epsilon(1-m)^{2}}+\frac{\epsilon}{2}+\frac{\zeta^{2}}{1-m}\leq\epsilon+\frac{\zeta^{2}}{1-m}

by the choice of TT. Similarly, if the minimum is attained for γ=1(M+1)L\gamma=\frac{1}{(M+1)L}, then σ2(M+1)(1m)ϵ2\frac{\sigma^{2}}{(M+1)(1-m)}\leq\frac{\epsilon}{2} and

Θ(T,γ)=2(M+1)FLT(1m)+σ2(M+1)(1m)+ζ21mϵ2+ϵ2+ζ21m.\displaystyle\Theta(T,\gamma)=\frac{2(M+1)FL}{T(1-m)}+\frac{\sigma^{2}}{(M+1)(1-m)}+\frac{\zeta^{2}}{1-m}\leq\frac{\epsilon}{2}+\frac{\epsilon}{2}+\frac{\zeta^{2}}{1-m}\,.

Finally, it remains to note that we do not need to choose ϵ\epsilon smaller than ζ21m\frac{\zeta^{2}}{1-m}, that is, we cannot get closer than ζ21m\frac{\zeta^{2}}{1-m} to the optimal solution. \Box

A.3 Proof of Theorem 6

Starting from Equation (6), we proceed by unrolling the recursion and using that i=0T(1a)ii=0(1a)i=1a\sum_{i=0}^{T}(1-a)^{i}\leq\sum_{i=0}^{\infty}(1-a)^{i}=\frac{1}{a} for 0<a<10<a<1:

FT(1γμ(1m))TF0+ζ22μ(1m)+γLσ22μ(1m).\displaystyle F_{T}\leq\left(1-\gamma\mu(1-m)\right)^{T}F_{0}+\frac{\zeta^{2}}{2\mu(1-m)}+\frac{\gamma L\sigma^{2}}{2\mu(1-m)}\,.

Choose the stepsize γ=min{1(M+1)L,ϵμ(1m)Lσ2}\gamma=\min\bigl{\{}\frac{1}{(M+1)L},\frac{\epsilon\mu(1-m)}{L\sigma^{2}}\bigr{\}} and T=max{(M+1)Lμ(1m)log2F0ϵ,Lσ2ϵμ2(1m)2log2F0ϵ}T=\max\left\{\frac{(M+1)L}{\mu(1-m)}\log\frac{2F_{0}}{\epsilon},\frac{L\sigma^{2}}{\epsilon\mu^{2}(1-m)^{2}}\log\frac{2F_{0}}{\epsilon}\right\}. Then we verify:

If the minimum is attained for γ=1(M+1)L\gamma=\frac{1}{(M+1)L}, then σ2μ(1m)(M+1)ϵ\frac{\sigma^{2}}{\mu(1-m)(M+1)}\leq\epsilon, and it holds

FT(1μ(1m)(M+1)L)TF0+σ22μ(1m)(M+1)+ζ22μ(1m)ϵ2+ϵ2+ζ22μ(1m)\displaystyle F_{T}\leq\left(1-\frac{\mu(1-m)}{(M+1)L}\right)^{T}F_{0}+\frac{\sigma^{2}}{2\mu(1-m)(M+1)}+\frac{\zeta^{2}}{2\mu(1-m)}\leq\frac{\epsilon}{2}+\frac{\epsilon}{2}+\frac{\zeta^{2}}{2\mu(1-m)}

by the choice of TT. Otherwise, if γ=ϵμ(1m)Lσ2\gamma=\frac{\epsilon\mu(1-m)}{L\sigma^{2}} then

FT(1ϵμ2(1m)2Lσ2)TF0+ϵ2+ζ22μ(1m)ϵ+ζ22μ(1m)\displaystyle F_{T}\leq\left(1-\frac{\epsilon\mu^{2}(1-m)^{2}}{L\sigma^{2}}\right)^{T}F_{0}+\frac{\epsilon}{2}+\frac{\zeta^{2}}{2\mu(1-m)}\leq\epsilon+\frac{\zeta^{2}}{2\mu(1-m)}

The claim follows again by observing that it suffices to consider ϵζ22μ(1m)\epsilon\geq\frac{\zeta^{2}}{2\mu(1-m)} only. \Box.

Appendix B Deferred Proofs for Section 5

B.1 Top-kk sparsification

Definition 3.

For a parameter 1kd1\leq k\leq d, the operator topk:dd\operatorname*{{top-k}}:\mathbb{R}^{d}\to\mathbb{R}^{d} is defined for 𝐱d\mathbf{x}\in\mathbb{R}^{d} as:

(topk(𝐱))i:={(𝐱)π(i), if ik0 otherwise \displaystyle\left(\operatorname*{{top-k}}(\mathbf{x})\right)_{i}:=\left\{\begin{array}[]{ll}(\mathbf{x})_{\pi(i)},&\text{ if }i\leq k\\ 0&\text{ otherwise }\end{array}\right.

where π\pi is a permutation of [d][d] such that (|𝐱|)π(i)(|𝐱|)π(i+1)(\left\lvert\mathbf{x}\right\rvert)_{\pi(i)}\leq(\left\lvert\mathbf{x}\right\rvert)_{\pi(i+1)} for i=1,,d1i=1,\ldots,d-1. In other words, the topk\operatorname*{{top-k}} operator keeps only the kk largest elements of a vector and truncates the other ones to zero.

We observe that for the topk\operatorname*{{top-k}} operator, we have

0topk(𝐱)𝐱2dkd𝐱2𝐱d,\displaystyle 0\leq\left\lVert\operatorname*{{top-k}}(\mathbf{x})-\mathbf{x}\right\rVert^{2}\leq\frac{d-k}{d}\left\lVert\mathbf{x}\right\rVert^{2}\,\qquad\forall\mathbf{x}\in\mathbb{R}^{d}\,, (9)

and both inequalities can be tight in general.

Lemma 9.

Let f:df\colon\mathbb{R}^{d}\to\mathbb{R} we differentiable, then topk(f(𝐱))\operatorname*{{top-k}}(\nabla f(\mathbf{x})) is a (m,ζ2)(m,\zeta^{2})-biased gradient oracle for

m\displaystyle m =(1kd),\displaystyle=\left(1-\frac{k}{d}\right)\,, ζ2\displaystyle\zeta^{2} =0.\displaystyle=0\,.
Proof.

This follows directly from (9). ∎

B.2 Random-kk sparsification

Definition 4.

For a parameter 1kd1\leq k\leq d, the operator randk:d×Ωkd\operatorname{{rand-k}}:\mathbb{R}^{d}\times\Omega_{k}\to\mathbb{R}^{d}, where Ωk\Omega_{k} denotes the set of all kk element subsets of [d][d], is defined for 𝐱d\mathbf{x}\in\mathbb{R}^{d} as:

(randk(𝐱,ω))i:={(𝐱)i, if iω0 otherwise \displaystyle\left(\operatorname{{rand-k}}(\mathbf{x},\omega)\right)_{i}:=\left\{\begin{array}[]{ll}(\mathbf{x})_{i},&\text{ if }i\in\omega\\ 0&\text{ otherwise }\end{array}\right.

where ω\omega is a set of kk elements chosen uniformly at random from Ωl\Omega_{l}, i.e, ωu.a.rΩk\omega\sim_{\text{u.a.r}}\Omega_{k}.

We observe that for the randk\operatorname{{rand-k}} operator, we have

𝔼randk(𝐱)𝐱2\displaystyle\operatorname{{\mathbb{E}}}\left\lVert\operatorname{{rand-k}}(\mathbf{x})-\mathbf{x}\right\rVert^{2} =dkd𝐱2,\displaystyle=\frac{d-k}{d}\left\lVert\mathbf{x}\right\rVert^{2}\,, 𝔼randk(𝐱)\displaystyle\operatorname{{\mathbb{E}}}\operatorname{{rand-k}}(\mathbf{x}) =kd𝐱,\displaystyle=\frac{k}{d}\mathbf{x}\,, 𝐱d,\displaystyle\forall\mathbf{x}\in\mathbb{R}^{d}\,, (10)

and further

𝔼randk(𝐚+𝐛)=𝔼randk(𝐚)+𝔼randk(𝐛)\displaystyle\operatorname{{\mathbb{E}}}\operatorname{{rand-k}}(\mathbf{a}+\mathbf{b})=\operatorname{{\mathbb{E}}}\operatorname{{rand-k}}(\mathbf{a})+\operatorname{{\mathbb{E}}}\operatorname{{rand-k}}(\mathbf{b}) (11)

for any vectors 𝐚,𝐛d\mathbf{a},\mathbf{b}\in\mathbb{R}^{d}. Moreover

𝔼randk(𝐱)2\displaystyle\left\lVert\operatorname{{\mathbb{E}}}\operatorname{{rand-k}}(\mathbf{x})\right\rVert^{2} =k2d2𝐱2\displaystyle=\frac{k^{2}}{d^{2}}\left\lVert\mathbf{x}\right\rVert^{2} 𝔼randk(𝐱)2\displaystyle\operatorname{{\mathbb{E}}}\left\lVert\operatorname{{rand-k}}(\mathbf{x})\right\rVert^{2} =kd𝐱2.\displaystyle=\frac{k}{d}\left\lVert\mathbf{x}\right\rVert^{2}\,. (12)
Lemma 10.

Let 𝐠(𝐱)\mathbf{g}(\mathbf{x}) be an unbiased gradient oracle with (Mb,σb2)(M_{b},\sigma_{b}^{2})-bounded noise. Then randk(𝐠(𝐱))\operatorname{{rand-k}}(\mathbf{g}(\mathbf{x})) is a (m,ζ2)(m,\zeta^{2})-biased gradient oracle with (M,σ2)(M,\sigma^{2})-bounded noise, for

m\displaystyle m =(1kd),\displaystyle=\left(1-\frac{k}{d}\right)\,, ζ2\displaystyle\zeta^{2} =0,\displaystyle=0\,, M\displaystyle M =((1+Mb)dk1),\displaystyle=\left((1+M_{b})\frac{d}{k}-1\right)\,, σ2\displaystyle\sigma^{2} =kdσb2.\displaystyle=\frac{k}{d}\sigma_{b}^{2}\,.
Proof.

The gradient oracle can be written as 𝐠(𝐱)=f(𝐱)+𝐧(𝐱)\mathbf{g}(\mathbf{x})=\nabla f(\mathbf{x})+\mathbf{n}(\mathbf{x}), for (Mb,σb2)(M_{b},\sigma_{b}^{2})-bounded noise. We first estimate the bias

𝔼randk(𝐠(𝐱))f(𝐱)2\displaystyle\left\lVert\operatorname{{\mathbb{E}}}{\operatorname{{rand-k}}(\mathbf{g}(\mathbf{x}))}-\nabla f(\mathbf{x})\right\rVert^{2} =𝔼randk(f(𝐱)+𝐧(𝐱))f(𝐱)2\displaystyle=\left\lVert\operatorname{{\mathbb{E}}}{\operatorname{{rand-k}}(\nabla f(\mathbf{x})+\mathbf{n}(\mathbf{x}))}-\nabla f(\mathbf{x})\right\rVert^{2}
=(11)𝔼randk(f(𝐱))+𝔼randk(𝐧(𝐱))f(𝐱)2\displaystyle\stackrel{{\scriptstyle\eqref{eq:randklinear}}}{{=}}\left\lVert\operatorname{{\mathbb{E}}}\operatorname{{rand-k}}(\nabla f(\mathbf{x}))+\operatorname{{\mathbb{E}}}\operatorname{{rand-k}}(\mathbf{n}(\mathbf{x}))-\nabla f(\mathbf{x})\right\rVert^{2}
=𝔼randk(f(𝐱))f(𝐱)2\displaystyle=\left\lVert\operatorname{{\mathbb{E}}}\operatorname{{rand-k}}(\nabla f(\mathbf{x}))-\nabla f(\mathbf{x})\right\rVert^{2}
=(10)(1kd)f(𝐱)2\displaystyle\stackrel{{\scriptstyle\eqref{eq:randk}}}{{=}}\left(1-\frac{k}{d}\right)\left\lVert\nabla f(\mathbf{x})\right\rVert^{2}

where we used independence to conclude 𝔼[randk(𝐧(𝐱))]=𝟎\operatorname{{\mathbb{E}}}[\operatorname{{rand-k}}(\mathbf{n}(\mathbf{x}))]=\mathbf{0}.

For the noise, we observe 𝔼randk(f(𝐱))2=dk𝔼randk(f(𝐱))2=dk𝔼randk(𝐠(𝐱))2\operatorname{{\mathbb{E}}}\left\lVert\operatorname{{rand-k}}(\nabla f(\mathbf{x}))\right\rVert^{2}=\frac{d}{k}\left\lVert\operatorname{{\mathbb{E}}}\operatorname{{rand-k}}(\nabla f(\mathbf{x}))\right\rVert^{2}=\frac{d}{k}\left\lVert\operatorname{{\mathbb{E}}}\operatorname{{rand-k}}(\mathbf{g}(\mathbf{x}))\right\rVert^{2}, and hence

𝔼randk(𝐠(𝐱))𝔼randk(𝐠(𝐱))2\displaystyle\operatorname{{\mathbb{E}}}{\left\lVert{\operatorname{{rand-k}}(\mathbf{g}(\mathbf{x}))}-\operatorname{{\mathbb{E}}}{\operatorname{{rand-k}}(\mathbf{g}(\mathbf{x}))}\right\rVert^{2}} =𝔼randk(f(𝐱)+𝐧(𝐱))2𝔼randk(𝐠(𝐱))2\displaystyle=\operatorname{{\mathbb{E}}}\left\lVert\operatorname{{rand-k}}(\nabla f(\mathbf{x})+\mathbf{n}(\mathbf{x}))\right\rVert^{2}-\left\lVert\operatorname{{\mathbb{E}}}\operatorname{{rand-k}}(\mathbf{g}(\mathbf{x}))\right\rVert^{2}
=𝔼randk(f(𝐱))2+𝔼randk(𝐧(𝐱))2𝔼randk(𝐠(𝐱))2\displaystyle=\operatorname{{\mathbb{E}}}\left\lVert\operatorname{{rand-k}}(\nabla f(\mathbf{x}))\right\rVert^{2}+\operatorname{{\mathbb{E}}}\left\lVert\operatorname{{rand-k}}(\mathbf{n}(\mathbf{x}))\right\rVert^{2}-\left\lVert\operatorname{{\mathbb{E}}}\operatorname{{rand-k}}(\mathbf{g}(\mathbf{x}))\right\rVert^{2}
=(dk1)𝔼randk(𝐠(𝐱))2+kd𝔼𝐧(𝐱)2\displaystyle=\left(\frac{d}{k}-1\right)\left\lVert\operatorname{{\mathbb{E}}}\operatorname{{rand-k}}(\mathbf{g}(\mathbf{x}))\right\rVert^{2}+\frac{k}{d}\operatorname{{\mathbb{E}}}\left\lVert\mathbf{n}(\mathbf{x})\right\rVert^{2}
(dk1)𝔼randk(𝐠(𝐱))2+kdMbf(𝐱)2+kdσb2\displaystyle\leq\left(\frac{d}{k}-1\right)\left\lVert\operatorname{{\mathbb{E}}}\operatorname{{rand-k}}(\mathbf{g}(\mathbf{x}))\right\rVert^{2}+\frac{k}{d}M_{b}\left\lVert\nabla f(\mathbf{x})\right\rVert^{2}+\frac{k}{d}\sigma_{b}^{2}
(dk1+dkMb)𝔼randk(𝐠(𝐱))2+kdσb2\displaystyle\leq\left(\frac{d}{k}-1+\frac{d}{k}M_{b}\right)\left\lVert\operatorname{{\mathbb{E}}}\operatorname{{rand-k}}(\mathbf{g}(\mathbf{x}))\right\rVert^{2}+\frac{k}{d}\sigma_{b}^{2}

the statement of the lemma follows. ∎

Remark 11.

These bounds are tight in general. When k=dk=d, then we just recover M=MbM=M_{b}, σ2=σb2\sigma^{2}=\sigma_{b}^{2}. For the special case when Mb=σb2=0M_{b}=\sigma_{b}^{2}=0, then we recover M=dk1M=\frac{d}{k}-1 and σ2=0\sigma^{2}=0, i.e, the randomness of the oracle would be only due to the random-kk sparsification.

Biased versus unbiased sparsification.

In the literature, often an unbiased version of random sparsification is considered, namely the scaled operator randk:=dkrandk\operatorname{{rand-k}}^{\prime}:=\frac{d}{k}\operatorname{{rand-k}}.

Up to the scaling by dk\frac{d}{k}, the randk\operatorname{{rand-k}}^{\prime} operator is therefore identical to the randk\operatorname{{rand-k}} operator, and we would expect the same convergence guarantees, in particular with tuning for the optimal stepsize γ\gamma^{\prime} (as in our theorems). However, due to the constraint γ1L\gamma\leq\frac{1}{L} as a consequence of applying the smoothness in Lemma 2, this rescaling might not be possible if γ[kdL,1L]\gamma^{\prime}\in[\frac{k}{dL},\frac{1}{L}] (only if it is smaller). This is the reason, why the lemma below gives a slightly better estimate for the convergence with randk\operatorname{{rand-k}}^{\prime} than with randk\operatorname{{rand-k}} (the optimization term, the term not depending on σ2\sigma^{2}, is improved by a factor 𝒪(dk)\mathcal{O}(\frac{d}{k})).

However, we argue that this is mostly a technical issue, and not a limitation of randk\operatorname{{rand-k}} vs. randk\operatorname{{rand-k}}^{\prime}. Our Lemma 2 does not assume any particular structure on the bias and holds for arbitrary biased oracles. For the particular choice of randk\operatorname{{rand-k}}, we know that scaling up the update by a factor dk\frac{d}{k} is possible in any case, and hence the condition on the stepsize could be relaxed to γdkL\gamma\leq\frac{d}{kL} in this special case—recovering exactly the same convergence rate as with randk\operatorname{{rand-k}}^{\prime}.

Lemma 12.

Let 𝐠(𝐱)\mathbf{g}(\mathbf{x}) be an unbiased gradient oracle with (Mb,σb2)(M_{b},\sigma_{b}^{2})-bounded noise. Then randk(𝐠(𝐱))\operatorname{{rand-k}}^{\prime}(\mathbf{g}(\mathbf{x})) is a (m,ζ2)(m,\zeta^{2})-biased gradient oracle with (M,σ2)(M,\sigma^{2})-bounded noise, for

m\displaystyle m =0,\displaystyle=0\,, ζ2\displaystyle\zeta^{2} =0,\displaystyle=0\,, M\displaystyle M =((1+Mb)dk1),\displaystyle=\left((1+M_{b})\frac{d}{k}-1\right)\,, σ2\displaystyle\sigma^{2} =dkσb2.\displaystyle=\frac{d}{k}\sigma_{b}^{2}\,.
Proof.

The unbiasedness is easily verified. For the variance, observe

𝔼randk(𝐠(𝐱))f(𝐱)2\displaystyle\operatorname{{\mathbb{E}}}\left\lVert\operatorname{{rand-k}}^{\prime}(\mathbf{g}(\mathbf{x}))-\nabla f(\mathbf{x})\right\rVert^{2} =𝔼randk(f(𝐱)+𝐧(𝐱))2f(𝐱)2\displaystyle=\operatorname{{\mathbb{E}}}\left\lVert\operatorname{{rand-k}}^{\prime}(\nabla f(\mathbf{x})+\mathbf{n}(\mathbf{x}))\right\rVert^{2}-\left\lVert\nabla f(\mathbf{x})\right\rVert^{2}
=𝔼randk(f(𝐱))2+𝔼randk(𝐧(𝐱))2f(𝐱)2\displaystyle=\operatorname{{\mathbb{E}}}\left\lVert\operatorname{{rand-k}}^{\prime}(\nabla f(\mathbf{x}))\right\rVert^{2}+\operatorname{{\mathbb{E}}}\left\lVert\operatorname{{rand-k}}^{\prime}(\mathbf{n}(\mathbf{x}))\right\rVert^{2}-\left\lVert\nabla f(\mathbf{x})\right\rVert^{2}
=dkf(𝐱)2+dk𝔼𝐧(𝐱)2f(𝐱)2\displaystyle=\frac{d}{k}\left\lVert\nabla f(\mathbf{x})\right\rVert^{2}+\frac{d}{k}\operatorname{{\mathbb{E}}}\left\lVert\mathbf{n}(\mathbf{x})\right\rVert^{2}-\left\lVert\nabla f(\mathbf{x})\right\rVert^{2}
=(dk1+dkMb)f(𝐱)2+dkσb2\displaystyle=\left(\frac{d}{k}-1+\frac{d}{k}M_{b}\right)\left\lVert\nabla f(\mathbf{x})\right\rVert^{2}+\frac{d}{k}\sigma_{b}^{2}\qed

B.3 Zeroth-Order Gradient

Lemma 13.

let 𝐠GS\mathbf{g}_{GS} be a zeroth-order gradient oracle as defined in (8). Then this oracle can be described as a Biased Gradient Oracle defined in Definition 1 with parameters:

m=0,ζ2τ24L2(d+3)3,M4(d+4),σ23τ2L2(d+4)3\displaystyle m=0,\quad\zeta^{2}\leq\frac{\tau^{2}}{4}L^{2}(d+3)^{3},\quad M\leq 4(d+4),\quad\sigma^{2}\leq 3\tau^{2}L^{2}(d+4)^{3}
Proof.

From Nesterov and Spokoiny (2017, Lemma 3) we can bound the bias for 𝐠GS\mathbf{g}_{GS} as follows:

𝐛(𝐱)2\displaystyle\left\lVert\mathbf{b}(\mathbf{x})\right\rVert^{2} =𝔼𝐠GS(𝐱)f(𝐱)2\displaystyle=\left\lVert\operatorname{{\mathbb{E}}}{\mathbf{g}_{GS}(\mathbf{x})}-\nabla f(\mathbf{x})\right\rVert^{2}
τ24L2(d+3)3\displaystyle\leq\frac{\tau^{2}}{4}L^{2}(d+3)^{3}

Hence we can conclude that Assumption 4 holds with the choice m=0m=0, and ζ2τ24L2(d+3)3\zeta^{2}\leq\frac{\tau^{2}}{4}L^{2}(d+3)^{3}. Moreover, Nesterov and Spokoiny (2017, Theorem 4) provide an upper bound on the second moment of the zeroth-order gradient oracle 𝐠GS\mathbf{g}_{GS}:

𝔼𝐠GS(𝐱)2\displaystyle\mathbb{E}\left\lVert\mathbf{g}_{GS}(\mathbf{x})\right\rVert^{2} 3τ2L2(d+4)3+4(d+4)𝔼𝐠GS(𝐱)2\displaystyle\leq 3\tau^{2}L^{2}(d+4)^{3}+4(d+4)\left\lVert\operatorname{{\mathbb{E}}}{\mathbf{g}_{GS}(\mathbf{x})}\right\rVert^{2}
=3τ2L2(d+4)3+4(d+4)f(𝐱)+𝐛(𝐱)2.\displaystyle=3\tau^{2}L^{2}(d+4)^{3}+4(d+4)\left\lVert\nabla f(\mathbf{x})+\mathbf{b}(\mathbf{x})\right\rVert^{2}\,.

The second moment is an upper bound on the variance. Hence we can conclude that Assumption 3 hold with the choice

σ2\displaystyle\sigma^{2} =3τ2L2(d+4)3,M=4(d+4).\displaystyle=3\tau^{2}L^{2}(d+4)^{3},\qquad M=4(d+4)\,.\qed

Convergence rate for 𝐠GS\mathbf{g}_{GS} under PL condition.

Using the result from Theorem 6, we can conclude that under PL condition,

𝒪(dLμlog1ϵ+τ2L3(d+1)3μ2ϵ+μτ2L2(d+1)3)\displaystyle\mathcal{O}\left(\frac{dL}{\mu}\log\frac{1}{\epsilon}+\frac{\tau^{2}L^{3}(d+1)^{3}}{\mu^{2}\epsilon+\mu\tau^{2}L^{2}(d+1)^{3}}\right)

iterations is sufficient for zeroth-order gradient descent to reach FTϵ+ζ2μF_{T}\leq\epsilon+\frac{\zeta^{2}}{\mu}. Observe that for any reasonable choice of ϵ=Ω(ζ2μ)\epsilon=\Omega\bigl{(}\frac{\zeta^{2}}{\mu}\bigr{)}, the second term is of order 𝒪(κ)\mathcal{O}\bigl{(}\kappa\bigr{)} only, and hence the rate is dominated by the first term. We conclude, that zeroth-order gradient descent requires 𝒪(dκlog1ϵ)\mathcal{O}\left(d\kappa\log\frac{1}{\epsilon}\right) iterations to converge to the accuracy level of the bias term ζ2μ=𝒪(τ2L2d3μ)\frac{\zeta^{2}}{\mu}=\mathcal{O}\left(\frac{\tau^{2}L^{2}d^{3}}{\mu}\right) which cannot be surpassed.

The convergence rate matches with the rate established in (Nesterov and Spokoiny, 2017, Theorem 8), but in our case the convergence radius (bias term) is 𝒪(d)\mathcal{O}(d) times worse than theirs. This shows that in general the rates given in Theorem 6 can be improved if stronger assumptions are imposed on the bias term or if it is known how the gradient oracle (and bias) are generated, for instance by Gaussian smoothing.

B.4 Inexact gradient oracle

First, we give the definition of a (δ,L)(\delta,L) oracle as introduced in (Devolder et al., 2014).

Definition 5 ((δ,L\delta,L)-oracle (Devolder et al., 2014)).

Let function ff be convex on QQ, where QQ is a closed convex set in d\mathbb{R}^{d}. We say that ff is equipped with a first-order (δ,L)(\delta,L)-oracle if for any 𝐲Q\mathbf{y}\in Q we can compute a pair (f~(𝐲),𝐠~(𝐲))(\tilde{f}(\mathbf{y}),\tilde{\mathbf{g}}(\mathbf{y})) such that:

0f(𝐱)(f~(𝐲)+𝐠~(𝐲),𝐱𝐲)L2𝐱𝐲2+δ,𝐱Q.\displaystyle 0\leq f(\mathbf{x})-\left(\tilde{f}(\mathbf{y})+\left\langle\tilde{\mathbf{g}}(\mathbf{y}),\mathbf{x}-\mathbf{y}\right\rangle\right)\leq\frac{L}{2}\left\lVert\mathbf{x}-\mathbf{y}\right\rVert^{2}+\delta\,,\qquad\forall\mathbf{x}\in Q\,.

The constant δ\delta is called the accuracy of the oracle. A function ff is LL-smooth if and only if it admits a (0,L)(0,L)-oracle. However the class of functions admitting a (δ,L)(\delta,L)-oracle is strictly larger and includes non-smooth functions as well.

Stochastic inexact gradient oracle.

Devolder (2011) generalized the notion of (δ,L)(\delta,L)-gradient oracle to the stochastic case. Instead of using the pair (f~(𝐱),𝐠~(𝐱))(\tilde{f}(\mathbf{x}),\tilde{\mathbf{g}}(\mathbf{x})), they use the stochastic estimates (F~(𝐱,ξ),G~(𝐱,ξ))(\tilde{F}(\mathbf{x},\xi),\tilde{G}(\mathbf{x},\xi)) such that:

𝔼ξ[F~(𝐱,ξ)]=f~(𝐱),\displaystyle{\mathbb{E}}_{\xi}\left[\tilde{F}(\mathbf{x},\xi)\right]=\tilde{f}(\mathbf{x})\,,
𝔼ξ[G~(𝐱,ξ)]=𝐠~(𝐱),\displaystyle{\mathbb{E}}_{\xi}\left[\tilde{G}(\mathbf{x},\xi)\right]=\tilde{\mathbf{g}}(\mathbf{x})\,,
𝔼ξ[G~(𝐱,ξ)𝐠~(𝐱)2]σ2,\displaystyle{\mathbb{E}}_{\xi}\left[\left\lVert\tilde{G}(\mathbf{x},\xi)-\tilde{\mathbf{g}}(\mathbf{x})\right\rVert^{2}\right]\leq\sigma^{2}\,,

where (f~(𝐱),𝐠~(𝐱))(\tilde{f}(\mathbf{x}),\tilde{\mathbf{g}}(\mathbf{x})) is a (δ,L)(\delta,L)-oracle as defined above. From the third inequality we can conclude that M=0M=0. Moreover the bias term can be upper bounded by:

𝐛(𝐱)2=𝔼ξ[G~(𝐱,ξ)]f(𝐱)2=𝐠~(𝐱)f(𝐱)22δL\displaystyle\left\lVert\mathbf{b}(\mathbf{x})\right\rVert^{2}=\left\lVert{\mathbb{E}}_{\xi}\left[\tilde{G}(\mathbf{x},\xi)\right]-\nabla f(\mathbf{x})\right\rVert^{2}=\left\lVert\tilde{\mathbf{g}}(\mathbf{x})-\nabla f(\mathbf{x})\right\rVert^{2}\leq 2\delta L

Therefore we can conclude that m=0m=0, and ζ2=2δL\zeta^{2}=2\delta L.

B.5 Biased compression operators

Deriving the bounds for arbitrary compressors follows similarly as in Section B.1 when observing that a δ\delta-compressor satisfies (9) with factor (1δ)(1-\delta) instead of (1kd)\left(1-\frac{k}{d}\right) as for the top-kk compressor (the top-kk compressor is a δ\delta-compressor for δ=kd\delta=\frac{k}{d}).

Appendix C Details on the deep learning experiment

The hyper-parameter choices used in our deep learning experiments (Section 6.4) are summarized in Table 2. We tuned the hyper-parameters over the range [0.1,0.2,0.3,0.4,0.5][0.1,0.2,0.3,0.4,0.5] for learning rate and [0.1,0.25,0.5,0.75,0.9][0.1,0.25,0.5,0.75,0.9] for momentum and choose the combination of hyper-parameters that gave the least test error. In all of the settings we reduced the learning rate at the epoch 150 with a factor of 10 and we trained the network with a batch size of 256256 and a L2L2 penalty regularizer equal to 10410^{-4}.

Table 2: Hyper-parameters used in the DL experiment
method lr momentum β\beta
SGD 0.1 0.9
top-kk 0.1 0.75
rand-kk 0.1 0.75
Refer to caption
Figure 7: Test error when training ResNet18 on CIFAR-10 dataset. For the compression methods, only 1%1\% of the coordinates were kept, i.e, k=0.01×dk=0.01\times d.