This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

High Probability Convergence Bounds for Non-convex Stochastic Gradient Descent with Sub-Weibull Noise

\nameLiam Madden \email[email protected]
\addrDepartment of Electrical and Computer Engineering
University of British Columbia
\AND\nameEmiliano Dall’Anese \email[email protected]
\addrDepartment of Electrical and Computer Engineering
Boston University \AND\nameStephen Becker \email[email protected]
\addrDepartment of Applied Mathematics
University of Colorado Boulder
Abstract

Stochastic gradient descent is one of the most common iterative algorithms used in machine learning and its convergence analysis is a rich area of research. Understanding its convergence properties can help inform what modifications of it to use in different settings. However, most theoretical results either assume convexity or only provide convergence results in mean. This paper, on the other hand, proves convergence bounds in high probability without assuming convexity. Assuming strong smoothness, we prove high probability convergence bounds in two settings: (1) assuming the Polyak-Łojasiewicz inequality and norm sub-Gaussian gradient noise and (2) assuming norm sub-Weibull gradient noise. In the second setting, as an intermediate step to proving convergence, we prove a sub-Weibull martingale difference sequence self-normalized concentration inequality of independent interest. It extends Freedman-type concentration beyond the sub-exponential threshold to heavier-tailed martingale difference sequences. We also provide a post-processing method that picks a single iterate with a provable convergence guarantee as opposed to the usual bound for the unknown best iterate. Our convergence result for sub-Weibull noise extends the regime where stochastic gradient descent has equal or better convergence guarantees than stochastic gradient descent with modifications such as clipping, momentum, and normalization.

Keywords: stochastic gradient descent, convergence bounds, sub-Weibull distributions, Polyak-Łojasiewicz inequality, Freedman inequality

1 Introduction

Stochastic gradient descent (SGD) and its variants are some of the most commonly used algorithms in machine learning. In particular, they are used for training neural network and transformer models, models that have achieved considerable success on image classification and language processing tasks in recent years. The training in this case is non-convex and smooth for many activation/attention functions, such as sigmoid, GELU, and softmax. Even for ReLU, which is not differentiable, the training is smooth over most of the parameter space and avoidance of the non-smooth part is dealt with separately. Thus, the smooth non-convex convergence analysis of SGD has far-reaching influence on the field of machine learning.

There is a large literature on the almost sure and mean convergence of SGD assuming strong smoothness. Bertsekas and Tsitsiklis (2000) prove that SGD almost surely converges to a first-order stationary point assuming strong smoothness and the relaxed growth noise condition; more recently, Patel (2021) showed the same result with a different proof technique. Ghadimi and Lan (2013) prove that the mean of the squared gradient norm of SGD converges to zero at a rate of O(1/T)O(1/\sqrt{T}) assuming strong smoothness and the bounded variance noise condition, where TT is the number of SGD iterations. Khaled and Richtárik (2020) prove the same but with the expected smoothness noise condition. Sebbouh et al. (2021) strengthen this to an almost sure convergence rate. Assuming strong smoothness and convexity, the tight mean convergence rate of the squared gradient norm of SGD is O(1/T)O(1/\sqrt{T}) (Nemirovsky and Yudin, 1983, Thm. 5.3.1). In fact, this is the optimal rate for all stochastic first-order methods assuming only strong smoothness (Arjevani et al., 2019). Thus, the O(1/T)O(1/\sqrt{T}) mean convergence rate of Ghadimi and Lan (2013) is tight and shows that SGD is optimal in the smooth non-convex setting.

However, mean convergence is not the end of the story. A convergence guarantee is generally required with some arbitrary probability, 1δ1-\delta. For a single run of SGD, using Markov’s inequality gives a 1/δ1/\delta scaling to the bound. If one can re-run SGD many times (say, rr times), then by choosing the best run, δ\delta can be relatively large since the overall success is at least by 1δr1-\delta^{r}. On the other hand, if just a single run of SGD is allowed, then the 1/δ1/\delta factor leads to nearly vacuous results, and hence a direct high probability convergence analysis is necessary to understand the behavior of unique runs of SGD.

Li and Orabona (2020) prove a O(log(T)log(T/δ)/T)O(\log(T)\log(T/\delta)/\sqrt{T}) high probability convergence rate for a weighted average of the squared gradient norms of SGD assuming strong smoothness and norm sub-Gaussian noise. However, a series of recent papers (Gürbüzbalaban et al., 2021; Şimşekli et al., 2019; Panigrahi et al., 2019) suggest that norm sub-Gaussian noise is often not satisfied. While the central limit theorem can be used to heuristically justify norm sub-Gaussian noise for mini-batch SGD with large batch-sizes, it cannot for small batch-sizes. In particular, Panigrahi et al. (2019) provide examples where the noise is Gaussian for a batch-size of 4096, is not Gaussian for a batch-size of 32, and starts out Gaussian then becomes non-Gaussian for a batch-size of 256. Gürbüzbalaban et al. (2021) and Şimşekli et al. (2019) suggest instead assuming that the ppth moment of the norm of the noise is bounded, where p(1,2)p\in(1,2). Scaman and Malherbe (2020) prove a O(1/T(p1)/p/δ(2+p)/(2p))O(1/T^{(p-1)/p}/\delta^{(2+p)/(2p)}) high probability convergence rate for SGD assuming strong smoothness and bounded ppth moment noise for p(1,2)p\in(1,2). By allowing noise with possibly infinite variance, SGD converges at a slower rate in terms of TT. Moreover, the dependence on 1/δ1/\delta is polynomial rather than logarithmic. Cutkosky and Mehta (2021) prove a O(log(T/δ)/T(p1)/(3p2))O(\log(T/\delta)/T^{(p-1)/(3p-2)}) high probability convergence rate for clipped SGD (which uses clipping, momentum, and normalization) assuming strong smoothness and bounded ppth moment noise for p(1,2)p\in(1,2), thus improving the dependence on 1/δ1/\delta to logarithmic, but clipped SGD will still have a slower convergence rate in terms of TT even when the noise is norm sub-Gaussian. So, knowledge about the type of noise should actually affect whether we use SGD or clipped SGD. This motivates the question: How heavy can the noise be for SGD to have a high probability convergence rate that depends logarithmically on 1/δ1/\delta?

Towards this end, we consider norm sub-Weibull noise. While it is much lighter tailed than bounded ppth moment noise for p(1,2)p\in(1,2), it is a natural extension of norm sub-Gaussian noise and it turns out that it admits a convergence rate for SGD with logarithmic dependence on 1/δ1/\delta. This leads us to the first contribution of our paper. Assuming strong smoothness and norm sub-Weibull noise, we prove a O((log(T)log(1/δ)2θ+log(T/δ)min{0,θ1}log(1/δ))/T)O((\log(T)\log(1/\delta)^{2\theta}+\log(T/\delta)^{\min\{0,\theta-1\}}\log(1/\delta))/\sqrt{T}) high probability convergence rate for a weighted average of the squared gradient norms of SGD, where θ\theta is the sub-Weibull tail weight. This is given in Theorem 15. In the special case of norm sub-Gaussian noise, which is θ=1/2\theta=1/2, this becomes O(log(T)log(1/δ)/T)O(\log(T)\log(1/\delta)/\sqrt{T}), which slightly improves the result of Li and Orabona (2020). For more general θ>1/2\theta>1/2, we make the additional assumption that the objective function is Lipschitz continuous. At its most basic, the Lipschitz continuity assumption says that the norm of the true gradient is bounded for all of the iterates. Li and Liu (2022), building off of our pre-print, relax this to only assuming that the step-size times the true gradient is bounded for all of the iterates, which is a weaker assumption since the step-size goes to zero.

However, hidden within these convergence rates is a subtle issue that requires a post-processing algorithm. All of these high probability convergence rates are for a weighted average of the squared gradient norms of the iterates rather than being for the squared gradient norm of a single iterate. While this implies a high probability convergence rate for the iterate with the smallest squared gradient norm, to actually determine which iterate has the smallest squared gradient norm, we would need to estimate the true gradient for each iterate! This leads us to another question: Is there a post-processing strategy for a single run of SGD that is as efficient as 2-RSG?

2-RSG is the algorithm of Ghadimi and Lan (2013) that takes multiple runs of SGD and picks the best one to get a high probability convergence rate. First, it goes from a mean convergence rate for the iterate with the smallest squared gradient norm to a mean convergence rate for a particular iterate by randomly choosing the iterate. It does this for Θ(log(1/δ))\Theta(\log(1/\delta)) runs of SGD. Then, it uses Θ(log(1/δ)σ2/δϵ)\Theta(\log(1/\delta)\sigma^{2}/\delta\epsilon) samples to estimate the Θ(log(1/δ))\Theta(\log(1/\delta)) gradients and pick the run with the smallest squared estimated gradient norm, where σ2\sigma^{2} is the variance of the noise and ϵ\epsilon is the convergence tolerance. To compete with this, we introduce a post-processing algorithm that randomly chooses Θ(log(1/δ))\Theta(\log(1/\delta)) iterates of a single run of SGD and then picks the one with the smallest squared estimated gradient norm using Θ(log(1/δ)σ2/δϵ)\Theta(\log(1/\delta)\sigma^{2}/\delta\epsilon) samples and we prove that our high probability convergence rate applies to the squared gradient norm of this iterate. The algorithm and the bound on the squared gradient norm of its output are in Theorem 21. The result can be combined with any high probability convergence rate on a weighted average of the squared gradient norm of the iterates of SGD, such as the results of Li and Orabona (2020), Scaman and Malherbe (2020), and Cutkosky and Mehta (2021). Moreover, the first part of it, going from a high probability bound on a weighted average of random variables to a high probability bound on the smallest of a small subset of those random variables, applies to more general stochastic sequences as shown in Theorem 19, which is a probability result of independent interest.

Underlying our convergence analysis are concentration inequalities. Vladimirova et al. (2020), Wong et al. (2020), and Bakhshizadeh et al. (2023) prove concentration inequalities for sub-Weibull random variables with deterministic scale parameters. However, we need a concentration inequality for a sub-Weibull martingale difference sequence (MDS) where the scale parameters are themselves random variables making up an auxiliary sequence. MDS concentration inequalities upper bound the partial sum of the main sequence and lower bound the partial sum of the auxiliary sequence simultaneously. When the lower bound depends on the partial sum of the main sequence, we call the concentration inequality self-normalized. Freedman (1975) proves a sub-Gaussian MDS concentration inequality. Fan et al. (2015) proves a sub-exponential MDS concentration inequality. Harvey et al. (2019) proves a sub-Gaussian MDS self-normalized concentration inequality. The proofs for these results all rely on the moment generating function (MGF), which does not exist for sub-Weibull random variables with θ>1\theta>1, i.e., heavier than sub-exponential. Nevertheless, we are able to prove a sub-Weibull martingale difference sequence self-normalized concentration inequality. This is given in Theorem 11. The proof uses the MGF truncation technique of Bakhshizadeh et al. (2023) but applied to an MDS rather than i.i.d. random variables. The truncation level is determined by the tail decay rate and shows up as the log(T/δ)θ1\log(T/\delta)^{\theta-1} in the SGD convergence rate.

We also consider convergence under the Polyak-Łojasiewicz (PŁ) inequality. Some examples of problems with PŁ objectives are logistic regression over compact sets (Karimi et al., 2016) and matrix factorization (Sun and Luo, 2016). Deep linear neural networks satisfy the PŁ inequality in large regions of the parameter space (Charles and Papailiopoulos, 2018, Thm. 4.5), as do two-layer neural networks with an extra identity mapping (Li and Yuan, 2017). Furthermore, sufficiently wide neural networks satisfy the PŁ inequality locally around random initialization (Allen-Zhu et al., 2019a, b; Du et al., 2018, 2019; Liu et al., 2022). While strong convexity implies the PŁ inequality, a PŁ function need not even be convex, hence the PŁ condition is considerably more applicable than strong convexity in the context of neural networks. Moreover, as pointed out by Karimi et al. (2016), the convergence proof for gradient descent assuming strong smoothness and strong convexity is actually simplified if we use the PŁ inequality instead of strong convexity. Thus, we would like to find a simple, elegant proof for the high probability convergence of SGD assuming strong smoothness and the PŁ inequality.

Assuming strong smoothness and strong convexity, the tight mean convergence rate of SGD is O(1/T)O(1/T) (Nemirovski et al., 2009). The same mean convergence rate can be shown assuming strong smoothness and the PŁ inequality (Karimi et al., 2016; Orvieto and Lucchi, 2019). But neither of these papers address high probability convergence. To fill this gap, we prove a O(log(1/δ)/T)O(\log(1/\delta)/T) high probability convergence rate for the objective value of SGD assuming strong smoothness, the PŁ inequality, and norm sub-Gaussian noise. This is given in Theorem 8. The proof relies on a novel probability result, Theorem 9, that essentially shows that adding two kinds of noise to a contracting sequence—sub-Gaussian noise with variance depending on the sequence itself and sub-exponential noise—results in a sub-exponential sequence. Unfortunately, the proof does not extend to the case of norm sub-Weibull noise. It is unclear if this is just an artifact of the proof or if the faster convergence of SGD under the PŁ inequality really cannot be maintained under norm sub-Weibull noise.

Applicability of assumptions. Given the motivation from neural networks, it is necessary to say a few words regarding the applicability of the Lipschitz continuity and strong smoothness assumptions. First, we show in Lemmas 1 and 2 that these assumptions are satisfied by simple neural network models that are, nevertheless, powerful enough to interpolate generic data sets using only twice the minimal number of parameters required for any model to do so. Second, they are satisfied if the iterates remain in a bounded set, which is precisely what happens in the NTK setting. Third, while it is true that Lipschitz continuity and strong smoothness may not be satisfied in practice, our paper is guided by the principle of “how far can we relax assumptions while still being able to prove high probability convergence rates?” An alternative principle is “what can we prove given assumptions that are completely justified in practice?” This is the principle guiding Patel et al. (2022). They provide a neural network counterexample that does not even satisfy (L0,L1)(L_{0},L_{1})-smoothness (Zhang et al., 2020), an alternative assumption to strong smoothness. Then, they prove—assuming only (1) a lower bound on the objective function, (2) local α\alpha-Hölder continuity of the gradient, (3) equality of the true and mean gradients, and (4) that the ppth moment of the norm of the stochastic gradient, for some p(1,2]p\in(1,2], is bounded by an upper semi-continuous function—that, with probability 1, either (1) the norm of the iterates of SGD go to infinity, or (2) SGD almost surely converges to a first order stationary point. These assumptions are much weaker than ours, but so is the conclusion. We leave it as a future research direction to further relax our assumptions.

Organization. Section 2 establishes the optimization framework and reviews optimization, probability, and sub-Weibull terminology and results. Section 3 proves the PŁ convergence rate. Section 4 proves the MDS concentration inequality. Section 5 proves the non-convex convergence rate. Section 6 establishes the post-processing algorithm. Section 7 trains a two layer neural network model on synthetic data with Weibull noise injected into the gradient, showing the dependence of the tail of the convergence error on the tail of the noise.

Notation. We use xx, yy, zz for vectors and XX, YY, ZZ, ξ\xi, ψ\psi for random variables. We use ete_{t} for the error vector and distinguish it from Euler’s number with the subscript (the subscript will denote the iteration index). We use σ2\sigma^{2} for the variance and so spell out sigma for sigma algebra. We use OO and Θ\Theta for big-OO notation. When comparing two sequences of real numbers, (at)(a_{t}) and (bt)(b_{t}): at=o(bt)a_{t}=o(b_{t}) if limat/bt=0\lim a_{t}/b_{t}=0, at=O(bt)a_{t}=O(b_{t}) if lim sup|at|/bt<\limsup|a_{t}|/b_{t}<\infty, and at=Θ(bt)a_{t}=\Theta(b_{t}) if at=O(bt)a_{t}=O(b_{t}) and bt=O(at)b_{t}=O(a_{t}). We use [n][n] to denote the set {1,,n}\{1,\ldots,n\} and Γ\Gamma to denote the gamma function.

2 Preliminaries

We are interested in the unconstrained optimization problem

minxdf(x),\displaystyle\min_{x\in\mathbb{R}^{d}}f(x), (1)

where f:df:\mathbb{R}^{d}\rightarrow\mathbb{R} is a differentiable function, and the SGD iteration

xt+1=xtηtgtt{0},\displaystyle x_{t+1}=x_{t}-\eta_{t}g_{t}~{}\forall t\in\mathbb{N}\cup\{0\},

where gtdg_{t}\in\mathbb{R}^{d} is an estimate of f(xt)\nabla f(x_{t}), ηt\eta_{t} is the step-size, and x0dx_{0}\in\mathbb{R}^{d} is the initial point. We restrict our attention to the setting where x0x_{0} is deterministic, but the results easily extend to the setting where x0x_{0} is a random vector. We define the noise as the vector et=f(xt)gte_{t}=\nabla f(x_{t})-g_{t} and assume that, conditioned on the previous iterates, it is unbiased and its norm is sub-Weibull (which, as we will see in Lemma 6, implies that the variance of the noise is bounded).

One of the main examples of Eq. (1) is the stochastic approximation problem: f=𝔼[F(,ξ)]f=\mathbb{E}\left[F(\cdot,\xi)\right] where (Ω,,P)(\Omega,\mathcal{F},P) is a probability space and F:d×ΩF:\mathbb{R}^{d}\times\Omega\rightarrow\mathbb{R} (Nemirovski et al., 2009; Ghadimi and Lan, 2013; Bottou and Bousquet, 2008). In this case, we independently sample ξ0,ξ1,\xi_{0},\xi_{1},\ldots and set gt=F(xt,ξt)g_{t}=\nabla F(x_{t},\xi_{t}).

Another example is the sample average approximation problem which is the special case of the stochastic approximation problem where there is a finite set {ξ1,,ξn}\{\xi_{1},\ldots,\xi_{n}\} such that ξ=ξi\xi=\xi_{i} with probability 1/n1/n. In this setting, we define Fi=F(,ξi)F_{i}=F(\cdot,\xi_{i}) so that

f(x)=1ni=1nFi(x).\displaystyle f(x)=\frac{1}{n}\sum_{i=1}^{n}F_{i}(x).

2.1 Optimization

We highlight a few key facts we use; see, e.g., Nesterov (2018), for more details. All definitions are with respect to the Euclidean norm \|\cdot\|. Let f:df:\mathbb{R}^{d}\to\mathbb{R} be differentiable, and assume argminxdf(x)\text{argmin}_{x\in\mathbb{R}^{d}}f(x) is non-empty and denote its minimum by ff^{\star}.

If ff is continuously differentiable, then ff is ρ\rho-Lipschitz if and only if f(x)ρ\|\nabla f(x)\|\leq\rho for all xdx\in\mathbb{R}^{d}. We say ff is LL-strongly smooth (or LL-smooth for short) if its gradient is LL-Lipschitz continuous. If ff is LL-smooth, then a standard result using the Taylor expansion is

f(y)f(x)+f(x),yx+L2xy2x,yd.f(y)\leq f(x)+\langle\nabla f(x),y-x\rangle+\frac{L}{2}\|x-y\|^{2}~{}\forall x,y\in\mathbb{R}^{d}.

Applying this result with y=x1Lf(x)y=x-\frac{1}{L}\nabla f(x) and using f(y)ff(y)\geq f^{\star}, we get

f(x)22L(f(x)f),\|\nabla f(x)\|^{2}\leq 2L(f(x)-f^{\star}), (2)

or taking xargminxf(x)x\in\text{argmin}_{x^{\prime}}\,f(x^{\prime}), we get

f(y)fL2yx2.f(y)-f^{\star}\leq\frac{L}{2}\|y-x^{\star}\|^{2}.

Thus a convergence rate in terms of the iterates is stronger than one in terms of the objective, which is stronger than one in terms of the norm of the gradient. We say ff is μ\mu-Polyak-Łojasiewicz (or μ\mu-PŁ for short) if f(x)22μ(f(x)f)xd\|\nabla f(x)\|^{2}\geq 2\mu(f(x)-f^{\star})~{}\forall x\in\mathbb{R}^{d}. Combining this with Eq. (2) shows that LμL\geq\mu.

To justify the Lipschitz continuity and strong smoothness assumptions in the context of neural networks, we include the following lemmas, proved in Appendix A, which show that both properties are satisfied by the least squares loss applied to the hidden layer of a two layer neural network with, e.g., sigmoid functions such as tanh, arctan, and the logistic function (applying the first lemma) or GELU activation (applying the second lemma). Note that such a model (with fixed outer layer), while a simple example of a neural network, is already able to interpolate generic data sets with only twice the necessary number of parameters for any model, including deeper ones, to do so (Madden and Thrampoulidis, 2024).

Lemma 1

Let m,n,dm,n,d\in\mathbb{N} and aa\in\mathbb{R}. Let ϕ:\phi:\mathbb{R}\to\mathbb{R} be twice differentiable and assume |ϕ(x)|,|ϕ(x)|,|ϕ′′(x)|ax|\phi(x)|,|\phi^{\prime}(x)|,|\phi^{\prime\prime}(x)|\leq a~{}\forall x\in\mathbb{R}. Let Xd×nX\in\mathbb{R}^{d\times n}, vmv\in\mathbb{R}^{m}, and yny\in\mathbb{R}^{n}. Define

f:d×m:W12ϕ(XW)vy2.\displaystyle f:\mathbb{R}^{d\times m}\to\mathbb{R}:W\mapsto\frac{1}{2}\|\phi(X^{\top}W)v-y\|^{2}.

Then ff is Lipschitz continuous and strongly smooth.

Lemma 2

Let m,n,dm,n,d\in\mathbb{N} and a,ba,b\in\mathbb{R}. Let ϕ:\phi:\mathbb{R}\to\mathbb{R} be twice differentiable. Assume |ϕ(x)|a|\phi^{\prime}(x)|\leq a and |ϕ′′(x)|b|\phi^{\prime\prime}(x)|\leq b for all xx\in\mathbb{R}. Let Xd×nX\in\mathbb{R}^{d\times n}, vmv\in\mathbb{R}^{m}, and yny\in\mathbb{R}^{n}. Define

f:d×m:W12ϕ(XW)vy2.\displaystyle f:\mathbb{R}^{d\times m}\to\mathbb{R}:W\mapsto\frac{1}{2}\|\phi(X^{\top}W)v-y\|^{2}.

Let α1\alpha\geq 1 and define the sublevel set 𝒲={Wd×mf(W)α}\mathcal{W}=\{W\in\mathbb{R}^{d\times m}\mid f(W)\leq\alpha\}. Then, on 𝒲\mathcal{W}, ff is

avX22αm-Lipschitz continuous and\displaystyle a\|v\|_{\infty}\|X\|_{2}\sqrt{2\alpha m}\text{-Lipschitz continuous and}
(a2v2X22m+bvX2X1,22α)-strongly smooth.\displaystyle\left(a^{2}\|v\|_{\infty}^{2}\|X\|_{2}^{2}m+b\|v\|_{\infty}\|X\|_{2}\|X\|_{1,2}\sqrt{2\alpha}\right)\text{-strongly smooth}.

2.2 Probability

See Section 20 of Billingsley (1995) for details on the convergence of random variables. A sequence of random variables (Xt)(X_{t}) converges to a random variable XX:

  • \circ

    in probability if, for all ϵ>0\epsilon>0, limtP(|XtX|>ϵ)=0\lim_{t}P(|X_{t}-X|>\epsilon)=0, denoted Xt𝑝XX_{t}\overset{p}{\to}X;

  • \circ

    in mean if limt𝔼|XtX|=0\lim_{t}\mathbb{E}|X_{t}-X|=0, denoted XtL1XX_{t}\overset{L^{1}}{\to}X;

  • \circ

    almost surely if P(limtXt=X)=1P(\lim_{t}X_{t}=X)=1, denoted Xta.s.XX_{t}\overset{a.s.}{\to}X.

When the rate of convergence is of interest, we say a sequence of random variables (Xt)(X_{t}) converges to XX:

  • \circ

    with mean convergence rate r(t)r(t) if, for all tt, 𝔼|XtX|r(t)\mathbb{E}|X_{t}-X|\leq r(t);

  • \circ

    with high probability convergence rate r~(t,δ)\tilde{r}(t,\delta) if, for all tt and δ\delta, P(|XtX|r~(t,δ))1δP\left(|X_{t}-X|\leq\tilde{r}(t,\delta)\,\right)\geq 1-\delta.

All five kinds of convergence are interrelated:

  • \circ

    Convergence in mean and convergence almost surely both imply convergence in probability.

  • \circ

    Mean convergence with rate r(t)r(t) such that r(t)0r(t)\to 0 as tt\to\infty implies convergence in mean.

  • \circ

    By Markov’s inequality, mean convergence with rate r(t)r(t) implies high probability convergence with rate r~(t,δ)=r(t)/δ\tilde{r}(t,\delta)=r(t)/\delta.

  • \circ

    By the Borel-Cantelli lemma (Billingsley, 1995, Thm. 4.3), high probability convergence with rate r~(t,δ)=r(t)p(δ)\tilde{r}(t,\delta)=r(t)p(\delta) implies almost sure convergence if t=0min{1,p1(a/r(t))}<\sum_{t=0}^{\infty}\min\{1,p^{-1}(a/r(t))\}<\infty for all a>0a>0. If p(δ)=1/δcp(\delta)=1/\delta^{c} for some c>0c>0, then r(t)=o(1/tc)r(t)=o(1/t^{c}) is required. If p(δ)=log(1/δ)p(\delta)=\log(1/\delta), then only r(t)=O(1/tc)r(t)=O(1/t^{c}) for some c>0c>0 is required.

See Section 35 of Billingsley (1995) for details on martingale difference sequences. Let (Ω,,P)(\Omega,\mathcal{F},P) be a probability space. A sequence, (i)(\mathcal{F}_{i}), of nested sigma algebras in \mathcal{F} (i.e., ii+1\mathcal{F}_{i}\subset\mathcal{F}_{i+1}\subset\mathcal{F}) is called a filtration, in which case (Ω,,(i),P)(\Omega,\mathcal{F},(\mathcal{F}_{i}),P) is called a filtered probability space. A sequence of random variables (ξi)(\xi_{i}) is said to be adapted to (i)(\mathcal{F}_{i}) if each ξi\xi_{i} is i\mathcal{F}_{i}-measurable. Furthermore, if 𝔼[ξii1]=ξi1i\mathbb{E}\left[\xi_{i}\mid\mathcal{F}_{i-1}\right]=\xi_{i-1}~{}\forall i, then (ξi)(\xi_{i}) is called a martingale. On the other hand, if 𝔼[ξii1]=0i\mathbb{E}\left[\xi_{i}\mid\mathcal{F}_{i-1}\right]=0~{}\forall i, then (ξi)(\xi_{i}) is called a martingale difference sequence.

For the noise sequence (et)(e_{t}), we define a corresponding filtration (t)(\mathcal{F}_{t}) by letting t\mathcal{F}_{t} be the sigma algebra generated by e0,,ete_{0},\ldots,e_{t} for all t0t\geq 0 and setting 1={,Ω}\mathcal{F}_{-1}=\{\emptyset,\Omega\}. Note that xtx_{t} is t1\mathcal{F}_{t-1}-measurable while ete_{t} is t\mathcal{F}_{t}-measurable.

2.3 Sub-Weibull Random Variables

We consider sub-Gaussian, sub-exponential, and sub-Weibull random variables.

Definition 3

A random variable XX is KK-sub-Gaussian if 𝔼[exp(X2/K2)]2\mathbb{E}\left[\exp\left(X^{2}/K^{2}\right)\right]\leq 2.

Definition 4

A random variable XX is KK-sub-exponential if 𝔼[exp(|X|/K)]2\mathbb{E}\left[\exp\left(|X|/K\right)\right]\leq 2.

Definition 5

A random variable XX is KK-sub-Weibull(θ\theta) if 𝔼[exp((|X|/K)1/θ)]2\mathbb{E}\left[\exp\left((|X|/K)^{1/\theta}\right)\right]\leq 2.

See Proposition 2.5.2 of Vershynin (2018) for equivalent definitions of sub-Gaussian, Proposition 2.7.1 of Vershynin (2018) for equivalent definitions of sub-exponential, and Theorem 2.1 of Vladimirova et al. (2020) for equivalent definitions of sub-Weibull. Note that sub-Gaussian and sub-exponential are special cases of sub-Weibull using θ=12\theta=\frac{1}{2} and θ=1\theta=1, respectively. The tail parameter θ\theta measures the heaviness of the tail—higher values correspond to heavier tails—and the scale parameter KK gives us the following bound on the second moment.

Lemma 6

If XX is KK-sub-Weibull(θ\theta) then 𝔼[|X|p]2Γ(θp+1)Kpp>0\mathbb{E}\left[|X|^{p}\right]\leq 2\Gamma(\theta p+1)K^{p}~{}\forall p>0. In particular, 𝔼[X2]2Γ(2θ+1)K2\mathbb{E}\left[X^{2}\right]\leq 2\Gamma(2\theta+1)K^{2}.

Proof  First, for all t0t\geq 0,

P(|X|t)\displaystyle P\left(|X|\geq t\right) =P(exp((|X|/K)1/θ)exp((t/K)1/θ))\displaystyle=P\left(\exp\left(\left(|X|/K\right)^{1/\theta}\right)\geq\exp\left(\left(t/K\right)^{1/\theta}\right)\right)
2exp((t/K)1/θ).\displaystyle\leq 2\exp\left(-\left(t/K\right)^{1/\theta}\right).

Second,

𝔼[|X|p]\displaystyle\mathbb{E}\left[|X|^{p}\right] =0P(|X|px)𝑑x\displaystyle=\int_{0}^{\infty}P\left(|X|^{p}\geq x\right)dx
20exp((x1/p/K)1/θ)\displaystyle\leq 2\int_{0}^{\infty}\exp\left(-\left(x^{1/p}/K\right)^{1/\theta}\right)
=2θpKp0exp(u)uθp1𝑑u\displaystyle=2\theta pK^{p}\int_{0}^{\infty}\exp(-u)u^{\theta p-1}du
=2θpΓ(θp)Kp\displaystyle=2\theta p\Gamma(\theta p)K^{p}
=2Γ(θp+1)Kp.\displaystyle=2\Gamma(\theta p+1)K^{p}.

 

The proof demonstrates the general techniques for going between probability and expectation in this type of analysis. To go from probability to expectation, the CDF formula is used:

𝔼[|Y|]=0P(|Y|>t)𝑑t.\displaystyle\mathbb{E}\left[|Y|\right]=\int_{0}^{\infty}P\left(|Y|>t\right)dt.

To go from expectation to probability, Markov’s inequality is used:

P(|Y|>t)1t𝔼[|Y|]t>0.\displaystyle P\left(|Y|>t\right)\leq\frac{1}{t}\mathbb{E}\left[|Y|\right]~{}\forall t>0.

In both cases, the trick is choosing what YY should be, e.g. Y=|X|pY=|X|^{p} or Y=exp((λ|X|)1/θ)Y=\exp((\lambda|X|)^{1/\theta}).

Remark 7

Note that our definitions of sub-Gaussian, sub-exponential, and sub-Weibull do not require the random variable to be centered. Thus, we do not center the constant random variable XxX\equiv x to think of it as having scale parameter zero. Instead, for each θ>0\theta>0, XxX\equiv x is |x|/log(2)θ|x|/\log(2)^{\theta}-sub-Weibull(θ\theta).

3 PŁ Convergence

In this section, we prove the following convergence bound, proving Theorem 9, a probability result, along the way. The convergence bound matches the optimal convergence rate in mean of O(1/T)O(1/T) and has log(1/δ)\log(1/\delta) dependence on δ\delta.

Theorem 8

Assume ff is LL-smooth and μ\mu-PŁ and that, conditioned on the previous iterates, ete_{t} is centered and et\|e_{t}\| is σ\sigma-sub-Gaussian. Then, SGD with step-size

ηt=2t+4κ1μ(t+2κ)2,\displaystyle\eta_{t}=\frac{2t+4\kappa-1}{\mu(t+2\kappa)^{2}},

where κ=L/μ\kappa=L/\mu, constructs a sequence (xt)(x_{t}) such that, w.p.1δ\text{w.p.}\geq 1-\delta for all δ(0,1)\delta\in(0,1),

f(xT)f\displaystyle f(x_{T})-f^{\star} =O(Lσ2log(e/δ)μ2T).\displaystyle=O\left(\frac{L\sigma^{2}\log({\color[rgb]{0,0,0}e}/\delta)}{\mu^{2}T}\right).

Proof  Assume ff is LL-smooth, hence

f(y)\displaystyle f(y) f(x)+f(x),yx+L2xy2\displaystyle\leq f(x)+\langle\nabla f(x),y-x\rangle+\frac{L}{2}\|x-y\|^{2} (3)

for all x,ydx,y\in\mathbb{R}^{d}. Set y=xt+1=xtηtgt=xtηt(f(xt)et)y=x_{t+1}=x_{t}-\eta_{t}g_{t}=x_{t}-\eta_{t}(\nabla f(x_{t})-e_{t}) and x=xtx=x_{t} and subtract ff^{\star}. Additionally assume ff is μ\mu-PŁ and ηt1/L\eta_{t}\leq 1/L. Then

f(xt+1)f\displaystyle f(x_{t+1})-f^{\star} f(xt)fηt(1Lηt2)f(xt)2+ηt(1Lηt)f(xt),et+Lηt22et2\displaystyle\leq f(x_{t})-f^{\star}-\eta_{t}\left(1-\frac{L\eta_{t}}{2}\right)\|\nabla f(x_{t})\|^{2}+\eta_{t}(1-L\eta_{t})\langle\nabla f(x_{t}),e_{t}\rangle+\frac{L\eta_{t}^{2}}{2}\|e_{t}\|^{2}
f(xt)fηt2f(xt)2+ηt(1Lηt)f(xt),et+Lηt22et2\displaystyle\leq f(x_{t})-f^{\star}-\frac{\eta_{t}}{2}\|\nabla f(x_{t})\|^{2}+\eta_{t}(1-L\eta_{t})\langle\nabla f(x_{t}),e_{t}\rangle+\frac{L\eta_{t}^{2}}{2}\|e_{t}\|^{2}
(1μηt)(f(xt)f)+ηt(1Lηt)f(xt),et+Lηt22et2\displaystyle\leq(1-\mu\eta_{t})(f(x_{t})-f^{\star})+\eta_{t}(1-L\eta_{t})\langle\nabla f(x_{t}),e_{t}\rangle+\frac{L\eta_{t}^{2}}{2}\|e_{t}\|^{2} (4)

where the second inequality used ηt1/L\eta_{t}\leq 1/L and the third used the PŁ inequality.

Note that

ηt=2t+4κ1μ(t+2κ)2=Θ(1μt)\displaystyle\eta_{t}=\frac{2t+4\kappa-1}{\mu(t+2\kappa)^{2}}=\Theta\left(\frac{1}{\mu t}\right)

and, since ηt\eta_{t} is decreasing in tt for t0t\geq 0,

ηt4κ1μ(2κ)24κμ4κ2=1L,\displaystyle\eta_{t}\leq\frac{4\kappa-1}{\mu(2\kappa)^{2}}\leq\frac{4\kappa}{\mu 4\kappa^{2}}=\frac{1}{L},

so we can apply Eq. (4).

Define

Xt=(t+2κ1)2(f(xt)f)\displaystyle X_{t}=(t+2\kappa-1)^{2}(f(x_{t})-f^{\star})
Yt=ηt(1Lηt)(t+2κ)2f(xt),et\displaystyle Y_{t}=\eta_{t}(1-L\eta_{t})(t+2\kappa)^{2}\langle\nabla f(x_{t}),e_{t}\rangle
Zt=Lηt2(t+2κ)22et2.\displaystyle Z_{t}=\frac{L\eta_{t}^{2}(t+2\kappa)^{2}}{2}\|e_{t}\|^{2}.

Then multiplying both sides of Eq. (4) by (t+2κ)2(t+2\kappa)^{2} and noting 1μηt=(t+2κ1)2(t+2κ)21-\mu\eta_{t}=\frac{(t+2\kappa-1)^{2}}{(t+2\kappa)^{2}} we get the recursion

Xt+1Xt+Yt+Zt.\displaystyle X_{t+1}\leq X_{t}+Y_{t}+Z_{t}.

Recall that we defined t\mathcal{F}_{t} as the sigma algebra generated by e0,,ete_{0},\ldots,e_{t} for all t0t\geq 0 and 1={,Ω}\mathcal{F}_{-1}=\{\emptyset,\Omega\}. So, XtX_{t} is t1\mathcal{F}_{t-1}-measurable, YtY_{t} is t\mathcal{F}_{t}-measurable, and ZtZ_{t} is t\mathcal{F}_{t}-measurable.

For all t0t\geq 0, assume, conditioned on t1\mathcal{F}_{t-1}, that ete_{t} is centered and et\|e_{t}\| is σ\sigma-sub-Gaussian. Then, YtY_{t} is centered conditioned on t1\mathcal{F}_{t-1}. Note that we can bound YtY_{t} using Cauchy-Schwarz and Eq. (2) in the following way:

f(xt),etf(xt)et2L(f(xt)f)et.\displaystyle\langle\nabla f(x_{t}),e_{t}\rangle\leq\|\nabla f(x_{t})\|\cdot\|e_{t}\|\leq\sqrt{2L(f(x_{t})-f^{\star})}\|e_{t}\|.

But, if we use this to get a new recursion, then we get a sub-optimal convergence rate. Instead, we need to keep the same recursion but use the bound on YtY_{t} in its MGF:

𝔼[exp(Yt218Lμ2Xtσ2)|t1]\displaystyle\mathbb{E}\left[\exp\left(\frac{Y_{t}^{2}}{\frac{18L}{\mu^{2}}X_{t}\sigma^{2}}\right)\bigg{|}~{}\mathcal{F}_{t-1}\right] 𝔼[exp(ηt2(1Lηt)2(t+2κ)4f(xt)2et218Lμ2Xtσ2)|t1]\displaystyle{\color[rgb]{0,0,0}\leq}\mathbb{E}\left[\exp\left(\frac{\eta_{t}^{2}(1-L\eta_{t})^{2}(t+2\kappa)^{4}\|\nabla f(x_{t})\|^{2}\|e_{t}\|^{2}}{\frac{18L}{\mu^{2}}X_{t}\sigma^{2}}\right)\bigg{|}~{}\mathcal{F}_{t-1}\right]
𝔼[exp(ηt2(1Lηt)2(t+2κ)4(t+2κ1)22LXtet218Lμ2Xtσ2)|t1]via (2)\displaystyle\leq\mathbb{E}\left[\exp\left(\frac{\eta_{t}^{2}(1-L\eta_{t})^{2}\frac{(t+2\kappa)^{4}}{(t+2\kappa-1)^{2}}2LX_{t}\|e_{t}\|^{2}}{\frac{18L}{\mu^{2}}X_{t}\sigma^{2}}\right)\bigg{|}~{}\mathcal{F}_{t-1}\right]\quad\text{via }\eqref{eq:LipschitzBound}
𝔼[exp(18Lμ2Xtet218Lμ2Xtσ2)|t1]\displaystyle\leq\mathbb{E}\left[\exp\left(\frac{\frac{18L}{\mu^{2}}X_{t}\|e_{t}\|^{2}}{\frac{18L}{\mu^{2}}X_{t}\sigma^{2}}\right)\bigg{|}~{}\mathcal{F}_{t-1}\right]
2.\displaystyle\leq 2.

Thus, YtY_{t} is 18Lμ2Xtσ2\frac{18L}{\mu^{2}}X_{t}\sigma^{2}-sub-Gaussian conditioned on t1\mathcal{F}_{t-1}. Similarly,

𝔼[exp(Zt2Lμ2σ2)|t1]\displaystyle\mathbb{E}\left[\exp\left(\frac{Z_{t}}{\frac{2L}{\mu^{2}}\sigma^{2}}\right)\bigg{|}~{}\mathcal{F}_{t-1}\right] 𝔼[exp(2Lμ2et22Lμ2σ2)|t1]\displaystyle\leq\mathbb{E}\left[\exp\left(\frac{\frac{2L}{\mu^{2}}\|e_{t}\|^{2}}{\frac{2L}{\mu^{2}}\sigma^{2}}\right)\bigg{|}~{}\mathcal{F}_{t-1}\right]
2.\displaystyle\leq 2.

and so ZtZ_{t} is 2Lμ2σ2\frac{2L}{\mu^{2}}\sigma^{2}-sub-exponential conditioned on t1\mathcal{F}_{t-1}.

So, we have a recursion for a stochastic sequence with two types of additive noise—sub-Gaussian noise, with variance depending on the sequence itself, and sub-exponential noise. We prove that this makes the main sequence sub-exponential in the following theorem.

Theorem 9

Let (Ω,,(i),P)(\Omega,\mathcal{F},(\mathcal{F}_{i}),P) be a filtered probability space. Let (Xi+1){\color[rgb]{0,0,0}(X_{i+1})}, (Yi)(Y_{i}), and (Zi)(Z_{i}) be adapted to (i)(\mathcal{F}_{i}), and X0X_{0} be deterministic. Let (αi)(\alpha_{i}) and (βi)(\beta_{i}) be sequences of non-negative reals and let (γi)(\gamma_{i}) be a sequence of positive reals. Assume XiX_{i} and ZiZ_{i} are non-negative almost surely. Assume 𝔼[exp(λYi)i1]exp(λ22βi2Xi)\mathbb{E}[\exp(\lambda Y_{i})\mid\mathcal{F}_{i-1}]\leq\exp(\frac{\lambda^{2}}{2}\beta_{i}^{2}X_{i}) for all λ\lambda\in\mathbb{R} and 𝔼[exp(λZi)i1]exp(λγi)\mathbb{E}[\exp(\lambda Z_{i})\mid\mathcal{F}_{i-1}]\leq\exp(\lambda\gamma_{i}) for all λ[0,γi1]\lambda\in[0,\gamma_{i}^{-1}]. Assume

Xi+1αiXi+Yi+Zi.\displaystyle X_{i+1}\leq\alpha_{i}X_{i}+Y_{i}+Z_{i}.

Then, for any sequence of positive reals (Ki)(K_{i}) such that K0X0K_{0}\geq X_{0} and, for all i0i\geq 0, Ki+12(αiKi+2γi)Ki+1+βi2KiK_{i+1}^{2}\geq(\alpha_{i}K_{i}+2\gamma_{i})K_{i+1}+\beta_{i}^{2}K_{i}, and for any n0n\geq 0, we have, w.p.1δ\text{w.p.}\geq 1-\delta for all δ(0,1)\delta\in(0,1),

XnKnlog(e/δ).\displaystyle X_{n}\leq K_{n}\log(e/\delta).

Proof  We want to find requirements on a sequence (Ki)(K_{i}) such that 𝔼[exp(λXi)]exp(λKi)λ[0,Ki1]\mathbb{E}\left[\exp(\lambda X_{i})\right]\leq\exp(\lambda K_{i})~{}\forall\lambda\in[0,K_{i}^{-1}]. Then, by Markov’s inequality and taking λ=Kn1\lambda=K_{n}^{-1},
P(Xnlog(e/δ)Kn)=P(exp(Xn/Kn)e/δ)δP\left(X_{n}\geq\log(e/\delta)K_{n}\right)=P\left(\exp(X_{n}/K_{n})\geq e/\delta\right)\leq\delta. Our proof is inductive. For the base case, we need

(1)K0X0.\displaystyle(1)~{}K_{0}\geq X_{0}.

For the induction step, assume 𝔼[exp(λ~Xi)]exp(λ~Ki)λ~[0,Ki1]\mathbb{E}\left[\exp(\tilde{\lambda}X_{i})\right]\leq\exp(\tilde{\lambda}K_{i})~{}\forall\tilde{\lambda}\in[0,K_{i}^{-1}]. Let λ[0,Ki+11]\lambda\in[0,K_{i+1}^{-1}]. Then

𝔼[exp(λXi+1)]\displaystyle\mathbb{E}\left[\exp(\lambda X_{i+1})\right] 𝔼[exp(λαiXi+λYi+λZi)]by non-negativity\displaystyle\leq\mathbb{E}\left[\exp(\lambda\alpha_{i}X_{i}+\lambda Y_{i}+\lambda Z_{i})\right]\quad\text{by non-negativity}
=TE𝔼[exp(λαiXi)𝔼[exp(λYi)exp(λZi)i1]]\displaystyle\overset{\text{TE}}{=}\mathbb{E}\left[\exp(\lambda\alpha_{i}X_{i})\mathbb{E}\left[\exp(\lambda Y_{i})\exp(\lambda Z_{i})\mid\mathcal{F}_{i-1}\right]\right]
CS𝔼[exp(λαiXi)𝔼[exp(2λYi)i1]1/2𝔼[exp(2λZi)i1]1/2]\displaystyle\overset{\text{CS}}{\leq}\mathbb{E}\left[\exp(\lambda\alpha_{i}X_{i})\mathbb{E}\left[\exp(2\lambda Y_{i})\mid\mathcal{F}_{i-1}\right]^{1/2}\mathbb{E}\left[\exp(2\lambda Z_{i})\mid\mathcal{F}_{i-1}\right]^{1/2}\right]
(2)𝔼[exp(λαiXi)exp((2λ)22βi2Xi)1/2exp(2λγi)1/2]if2λ[0,γi1]\displaystyle\overset{(2)}{\leq}\mathbb{E}\left[\exp(\lambda\alpha_{i}X_{i})\exp\left(\frac{(2\lambda)^{2}}{2}\beta_{i}^{2}X_{i}\right)^{1/2}\exp(2\lambda\gamma_{i})^{1/2}\right]\quad\text{if}\quad 2\lambda\in[0,\gamma_{i}^{-1}]
=𝔼[exp(λαiXi+λ2βi2Xi+λγi)]\displaystyle=\mathbb{E}\left[\exp(\lambda\alpha_{i}X_{i}+\lambda^{2}\beta_{i}^{2}X_{i}+\lambda\gamma_{i})\right]
=𝔼[exp(λ(αi+λβi2)Xi]exp(λγi)\displaystyle=\mathbb{E}\left[\exp(\lambda(\alpha_{i}+\lambda\beta_{i}^{2})X_{i}\right]\exp(\lambda\gamma_{i})
(3)exp(λ((αi+λβi2)Ki+γi))via induction ifλ~:=λ(αi+λβi2)Ki1\displaystyle\overset{(3)}{\leq}\exp(\lambda((\alpha_{i}+\lambda\beta_{i}^{2})K_{i}+\gamma_{i}))\quad\text{via induction if}\;\tilde{\lambda}:=\lambda(\alpha_{i}+\lambda\beta_{i}^{2})\leq K_{i}^{-1}
(4)exp(λKi+1)\displaystyle\overset{(4)}{\leq}\exp(\lambda K_{i+1})

where TETE denotes the law of total expectation, CSCS denotes the Cauchy-Schwarz inequality, and (2)(4)(2)-(4) are the requirements

(2)2λγi12Ki+11γi1Ki+12γi\displaystyle(2)~{}2\lambda\leq\gamma_{i}^{-1}\impliedby 2K_{i+1}^{-1}\leq\gamma_{i}^{-1}\iff K_{i+1}\geq 2\gamma_{i}
(3)λ(αi+λβi2)Ki1Ki+11(αi+Ki+11βi2)Ki1Ki+12αiKiKi+1+βi2Ki\displaystyle(3)~{}\lambda(\alpha_{i}+\lambda\beta_{i}^{2})\leq K_{i}^{-1}\impliedby K_{i+1}^{-1}(\alpha_{i}+K_{i+1}^{-1}\beta_{i}^{2})\leq K_{i}^{-1}\iff K_{i+1}^{2}\geq\alpha_{i}K_{i}K_{i+1}+\beta_{i}^{2}K_{i}
(4)Ki+1(αi+λβi2)Ki+γiKi+1(αi+Ki+11βi2)Ki+γi\displaystyle(4)~{}K_{i+1}\geq(\alpha_{i}+\lambda\beta_{i}^{2})K_{i}+\gamma_{i}\impliedby K_{i+1}\geq(\alpha_{i}+K_{i+1}^{-1}\beta_{i}^{2})K_{i}+\gamma_{i}
Ki+12(αiKi+γi)Ki+1+βi2Ki.\displaystyle\hskip 199.16928pt\iff K_{i+1}^{2}\geq(\alpha_{i}K_{i}+\gamma_{i})K_{i+1}+\beta_{i}^{2}K_{i}.

The assumptions of the theorem imply requirements (2)(4)(2)-(4), completing the proof.  

The proof is similar to the proof of Theorem 4.1 of Harvey et al. (2019) but with some key differences. There, the recursion is Xi+1αiXi+YiXi+γiX_{i+1}\leq\alpha_{i}X_{i}+Y_{i}\sqrt{X_{i}}+\gamma_{i} where 𝔼[exp(λYi)i1]exp(λ22βi2)\mathbb{E}\left[\exp(\lambda Y_{i})\mid\mathcal{F}_{i-1}\right]\leq\exp\left(\frac{\lambda^{2}}{2}\beta_{i}^{2}\right) and γi\gamma_{i} is deterministic. We had to move the implicit dependence of the sub-Gaussian term inside of the MGF. We also had to allow γi\gamma_{i} to be a sub-exponential random variable ZiZ_{i}, and so applied Cauchy-Schwarz, which contributed the 2 in the recursion for (Ki)(K_{i}).

Note that if αi=1,γi=γ,βi=βi\alpha_{i}=1,\gamma_{i}=\gamma,\beta_{i}=\beta~{}\forall i, then the assumptions on (Ki)(K_{i}) are satisfied by, for example, K0=X0K_{0}=X_{0} and Ki+1=Ki+2γ+β2iK_{i+1}=K_{i}+2\gamma+\beta^{2}~{}\forall i. Returning to the proof of Theorem 8, for absolute constants aa and bb, by Proposition 2.5.2 and Proposition 2.7.1 of Vershynin (2018), we can set βt2=18aLσ2μ2\beta_{t}^{2}=\frac{18aL\sigma^{2}}{\mu^{2}} and γt=2bLσ2μ2\gamma_{t}=\frac{2bL\sigma^{2}}{\mu^{2}} and apply Theorem 9 with

KT\displaystyle K_{T} X0+t=0T1(2γt+βt2)\displaystyle{\color[rgb]{0,0,0}\coloneqq}X_{0}+\sum_{t=0}^{T-1}(2\gamma_{t}+\beta_{t}^{2})
=X0+(18a+4b)Lσ2Tμ2.\displaystyle=X_{0}+\frac{(18a+{\color[rgb]{0,0,0}4}b)L\sigma^{2}T}{\mu^{2}}.

Dividing by (T+2κ)2(T+2\kappa)^{2} completes the proof.  

Remark 10

It is natural to ask whether we can relax the sub-Gaussian assumption to a sub-Weibull assumption. In Theorem 9, we need a bound on the MGFs of both YiY_{i} and ZiZ_{i}. But, if et\|e_{t}\| is σ\sigma-sub-Weibull(θ\theta) with θ>1/2\theta>1/2, then et2\|e_{t}\|^{2} is σ2\sigma^{2}-sub-Weibull(2θ2\theta). Thus, ZiZ_{i} would be sub-Weibull with tail parameter greater than 1, and so may have an infinite moment generating function for all λ>0\lambda>0.

A big take-away from the PŁ analysis is its simplicity, matching the simplicity of gradient descent’s convergence analysis in the same setting. It also serves as a good warm-up for the non-convex analysis that follows. Induction on the convergence recursion does not work for the non-convex analysis. Instead it requires an MDS self-normalized concentration inequality, which we state and prove in the next section.

4 Concentration Inequality

In this section, we prove the following MDS concentration inequality. For θ=1/2\theta=1/2 without α\alpha (i.e., α=0\alpha=0), Eq. (6), we recover the classical Freedman’s inequality (Freedman, 1975). For θ(1/2,1]\theta\in(1/2,1] without α\alpha, Eq. (8), we recover Theorem 2.6 of Fan et al. (2015). For θ=1/2\theta=1/2 with α\alpha, Eq. (5), we recover Theorem 3.3 of Harvey et al. (2019), called the “Generalized Freedman” inequality.

Theorem 11

Let (Ω,,(i),P)(\Omega,\mathcal{F},(\mathcal{F}_{i}),P) be a filtered probability space. Let (ξi)(\xi_{i}) and (Ki)(K_{i}) be adapted to (i)(\mathcal{F}_{i}). Let nn\in\mathbb{N}. For all i[n]i\in[n], assume Ki10K_{i-1}\geq 0 almost surely, 𝔼[ξii1]=0\mathbb{E}\left[\xi_{i}\mid\mathcal{F}_{i-1}\right]=0, and

𝔼[exp((|ξi|/Ki1)1/θ)i1]2\displaystyle\mathbb{E}\left[\exp\left((|\xi_{i}|/K_{i-1})^{1/\theta}\right)\mid\mathcal{F}_{i-1}\right]\leq 2

where θ1/2\theta\geq 1/2. If θ>1/2\theta>1/2, assume there exist constants (mi)(m_{i}) such that Ki1miK_{i-1}\leq m_{i} almost surely for all i[n]i\in[n].

If θ=1/2\theta=1/2, then for all x,β0x,\beta\geq 0, and α>0\alpha>0, and λ[0,12α]\lambda\in\left[0,\frac{1}{2\alpha}\right],

P(k[n]{i=1kξix and i=1k2Ki12αi=1kξi+β})exp(λx+2λ2β),P\left(\bigcup_{k\in[n]}\bigg{\{}\sum_{i=1}^{k}\xi_{i}\geq x\text{ and }\sum_{i=1}^{k}2K_{i-1}^{2}\leq\alpha\sum_{i=1}^{k}\xi_{i}+\beta\bigg{\}}\right)\leq\exp(-\lambda x+2\lambda^{2}\beta), (5)

and for all x,β,λ0x,\beta,\lambda\geq 0,

P(k[n]{i=1kξix and i=1k2Ki12β})exp(λx+λ22β).P\left(\bigcup_{k\in[n]}\bigg{\{}\sum_{i=1}^{k}\xi_{i}\geq x\text{ and }\sum_{i=1}^{k}2K_{i-1}^{2}\leq\beta\bigg{\}}\right)\leq\exp\left(-\lambda x+\frac{\lambda^{2}}{2}\beta\right). (6)

If θ(12,1]\theta\in\left(\frac{1}{2},1\right], define

a=(4θ)2θe2\displaystyle a=(4\theta)^{2\theta}e^{2}
b=(4θ)θe.\displaystyle b=(4\theta)^{\theta}e.

For all x,β0x,\beta\geq 0, and αbmaxi[n]mi\alpha\geq b\max_{i\in[n]}m_{i}, and λ[0,12α]\lambda\in\left[0,\frac{1}{2\alpha}\right],

P(k[n]{i=1kξix and i=1kaKi12αi=1kξi+β})exp(λx+2λ2β),P\left(\bigcup_{k\in[n]}\bigg{\{}\sum_{i=1}^{k}\xi_{i}\geq x\text{ and }\sum_{i=1}^{k}aK_{i-1}^{2}\leq\alpha\sum_{i=1}^{k}\xi_{i}+\beta\bigg{\}}\right)\leq\exp(-\lambda x+2\lambda^{2}\beta), (7)

and for all x,β0x,\beta\geq 0, and λ[0,1bmaxi[n]mi]\lambda\in\left[0,\frac{1}{b\max_{i\in[n]}m_{i}}\right],

P(k[n]{i=1kξix and i=1kaKi12β})exp(λx+λ22β).P\left(\bigcup_{k\in[n]}\bigg{\{}\sum_{i=1}^{k}\xi_{i}\geq x\text{ and }\sum_{i=1}^{k}aK_{i-1}^{2}\leq\beta\bigg{\}}\right)\leq\exp\left(-\lambda x+\frac{\lambda^{2}}{2}\beta\right). (8)

If θ>1\theta>1, let δ(0,1)\delta\in(0,1). Define

a=(22θ+1+2)Γ(2θ+1)+23θΓ(3θ+1)3log(n/δ)θ1\displaystyle a=(2^{2\theta+1}+2)\Gamma(2\theta+1)+\frac{2^{3\theta}\Gamma(3\theta+1)}{3\log(n/\delta)^{\theta-1}}
b=2log(n/δ)θ1.\displaystyle b=2\log(n/\delta)^{\theta-1}.

For all x,β0x,\beta\geq 0, and αbmaxi[n]mi\alpha\geq b\max_{i\in[n]}m_{i}, and λ[0,12α]\lambda\in\left[0,\frac{1}{2\alpha}\right],

P(k[n]{i=1kξix and i=1kaKi12αi=1kξi+β})\displaystyle P\left(\bigcup_{k\in[n]}\bigg{\{}\sum_{i=1}^{k}\xi_{i}\geq x\text{ and }\sum_{i=1}^{k}aK_{i-1}^{2}\leq\alpha\sum_{i=1}^{k}\xi_{i}+\beta\bigg{\}}\right) exp(λx+2λ2β)+2δ,\displaystyle\leq\exp(-\lambda x+2\lambda^{2}\beta)+2\delta,

and for all x,β0x,\beta\geq 0, and λ[0,1bmaxi[n]mi]\lambda\in\left[0,\frac{1}{b\max_{i\in[n]}m_{i}}\right],

P(k[n]{i=1kξix and i=1kaKi12β})exp(λx+λ22β)+2δ.P\left(\bigcup_{k\in[n]}\bigg{\{}\sum_{i=1}^{k}\xi_{i}\geq x\text{ and }\sum_{i=1}^{k}aK_{i-1}^{2}\leq\beta\bigg{\}}\right)\leq\exp\left(-\lambda x+\frac{\lambda^{2}}{2}\beta\right)+2\delta. (9)

Proof  Let (Ω,,(i),P)(\Omega,\mathcal{F},(\mathcal{F}_{i}),P) be a filtered probability space. Let (ξi)(\xi_{i}) and (Ki)(K_{i}) be adapted to (i)(\mathcal{F}_{i}). Let nn\in\mathbb{N}. For all i[n]i\in[n], assume 0Ki1mi0\leq K_{i-1}\leq m_{i} almost surely, 𝔼[ξii1]=0\mathbb{E}\left[\xi_{i}\mid\mathcal{F}_{i-1}\right]=0, and

𝔼[exp((|ξi|/Ki1)1/θ)i1]2.\displaystyle\mathbb{E}\left[\exp\left((|\xi_{i}|/K_{i-1})^{1/\theta}\right)\mid\mathcal{F}_{i-1}\right]\leq 2.

What we want to upper bound in this setting is

P(k[n]{i=1kξix and i=1kKi12αi=1kξi+β})\displaystyle P\left(\bigcup_{k\in[n]}\bigg{\{}\sum_{i=1}^{k}\xi_{i}\geq x\text{ and }\sum_{i=1}^{k}K_{i-1}^{2}\leq\alpha\sum_{i=1}^{k}\xi_{i}+\beta\bigg{\}}\right)

for constants x,α>0x,\alpha>0 and β0\beta\geq 0. To understand how to use this, set β=0\beta=0 and observe

P(i=1nξix+1αi=1nKi12)\displaystyle P\left(\sum_{i=1}^{n}\xi_{i}\geq x+\frac{1}{\alpha}\sum_{i=1}^{n}K_{i-1}^{2}\right)
P(i=1nξix and i=1nKi12αi=1nξi)\displaystyle\leq P\left(\sum_{i=1}^{n}\xi_{i}\geq x\text{ and }\sum_{i=1}^{n}K_{i-1}^{2}\leq\alpha\sum_{i=1}^{n}\xi_{i}\right)
P(k[n]{i=1kξix and i=1kKi12αi=1kξi})\displaystyle\leq P\left(\bigcup_{k\in[n]}\bigg{\{}\sum_{i=1}^{k}\xi_{i}\geq x\text{ and }\sum_{i=1}^{k}K_{i-1}^{2}\leq\alpha\sum_{i=1}^{k}\xi_{i}\bigg{\}}\right)

by monotonicity (ABP(A)P(B)A\subset B\implies P(A)\leq P(B)). The stronger bound over the union of partial sum bounds does not help us since the constants cannot depend on kk. The union of partial sum bounds arises in the proof from a stopping time defined as the first kk such that the corresponding set occurs.

Proving such a bound would typically involve using an MGF bound for ξi\xi_{i}, but the MGF is infinite in our setting. To get around this, we truncate ξi\xi_{i} to an appropriate level. By accounting for the probability of exceeding truncation, applying an MGF bound for truncated random variables, and applying a concentration inequality for bounded MGF martingale different sequence, we are able to prove a concentration inequality for sub-Weibull martingale difference sequences.

Probability of exceeding truncation.

Let δ(0,1)\delta\in(0,1). We want to define the truncation level so that the probability of any of the ξi\xi_{i} exceeding it is smaller than O(δ)O(\delta). This ends up being ξ~i=ξi𝕀{ξicKi1}\widetilde{\xi}_{i}=\xi_{i}\mathbb{I}_{\{\xi_{i}\leq cK_{i-1}\}} with c=log(n/δ)θc=\log(n/\delta)^{\theta} as we will see.

First, for any c>0c>0,

P(|ξi|cKi1)\displaystyle P\left(|\xi_{i}|\geq cK_{i-1}\right) =P(exp((|ξi|/Ki1)1/θ)exp(c1/θ))\displaystyle=P\left(\exp\left((|\xi_{i}|/K_{i-1})^{1/\theta}\right)\geq\exp\left(c^{1/\theta}\right)\right)
exp(c1/θ)𝔼[exp((|ξi|/Ki1)1/θ)]\displaystyle\leq\exp\left(-c^{1/\theta}\right)\mathbb{E}\left[\exp\left((|\xi_{i}|/K_{i-1})^{1/\theta}\right)\right]
=exp(c1/θ)𝔼[𝔼[exp((|ξi|/Ki1)1/θ)i1]]\displaystyle=\exp\left(-c^{1/\theta}\right)\mathbb{E}\left[\mathbb{E}\left[\exp\left((|\xi_{i}|/K_{i-1})^{1/\theta}\right)\mid\mathcal{F}_{i-1}\right]\right]
2exp(c1/θ)\displaystyle\leq 2\exp\left(-c^{1/\theta}\right)

where we use Markov’s inequality, the law of total expectation, and the sub-Weibull assumption. In particular, P(ξi>log(n/δ)θKi1)2δ/nP\left(\xi_{i}>\log(n/\delta)^{\theta}K_{i-1}\right)\leq 2\delta/n.

Then, we can bound

P(k[n]{i=1kξix and i=1kKi12αi=1kξi+β})\displaystyle P\left(\bigcup_{k\in[n]}\bigg{\{}\sum_{i=1}^{k}\xi_{i}\geq x\text{ and }\sum_{i=1}^{k}K_{i-1}^{2}\leq\alpha\sum_{i=1}^{k}\xi_{i}+\beta\bigg{\}}\right)

by

P(k[n]{i=1kξ~ix and i=1kKi12αi=1kξ~i+β})\displaystyle P\left(\bigcup_{k\in[n]}\bigg{\{}\sum_{i=1}^{k}\widetilde{\xi}_{i}\geq x\text{ and }\sum_{i=1}^{k}K_{i-1}^{2}\leq\alpha\sum_{i=1}^{k}\widetilde{\xi}_{i}+\beta\bigg{\}}\right)

and

P(i[n]{ξi>log(n/δ)θKi1})\displaystyle P\left(\bigcup_{i\in[n]}\bigg{\{}\xi_{i}>\log(n/\delta)^{\theta}K_{i-1}\bigg{\}}\right) i=1nP(ξi>log(n/δ)θKi1)\displaystyle\leq\sum_{i=1}^{n}P\left(\xi_{i}>\log(n/\delta)^{\theta}K_{i-1}\right)
2δ.\displaystyle\leq 2\delta.

and so proceed using ξ~i\widetilde{\xi}_{i} while carrying around an additional 2δ2\delta probability.

MGF bound of truncated random variable.

In order to bound the MGF of our truncated random variables, we prove the following lemma which slightly modifies Corollary 2 of Bakhshizadeh et al. (2023) using the law of total expectation. The main subtlety in the extension is where we use the following bound

P(|X|t𝒢)\displaystyle P\left(|X|\geq t\mid\mathcal{G}\right) =P(exp((|X|/K0)1/θ)exp((t/K0)1/θ)𝒢)\displaystyle=P\left(\exp\left((|X|/K_{0})^{1/\theta}\right)\geq\exp\left((t/K_{0})^{1/\theta}\right)\mid\mathcal{G}\right)
exp((t/K0)1/θ)𝔼[exp((|X|/K0)1/θ)𝒢]\displaystyle\leq\exp\left(-(t/K_{0})^{1/\theta}\right)\mathbb{E}\left[\exp\left((|X|/K_{0})^{1/\theta}\right)\mid\mathcal{G}\right]
2exp((t/K0)1/θ)\displaystyle\leq 2\exp\left(-(t/K_{0})^{1/\theta}\right)

which we denote by *. Otherwise, the proof exactly follows theirs.

Lemma 12

Let (Ω,,P)(\Omega,\mathcal{F},P) be a probability space, 𝒢\mathcal{G}\subset\mathcal{F} be a sigma algebra, and XX and K0K_{0} be random variables. Assume K0K_{0} is 𝒢\mathcal{G}-measurable. Assume, conditioned on 𝒢\mathcal{G}, that XX is centered and K0K_{0}-sub-Weibull(θ\theta) with θ>1\theta>1. Define X~=X𝕀{XcK0}\widetilde{X}=X\mathbb{I}_{\{X\leq cK_{0}\}}. Then

𝔼[exp(λX~)𝒢]exp(λ22aK02)λ[0,1bK0]\displaystyle\mathbb{E}\left[\exp(\lambda\widetilde{X})\mid\mathcal{G}\right]\leq\exp\left(\frac{\lambda^{2}}{2}aK_{0}^{2}\right)~{}\forall\lambda\in\left[0,\frac{1}{bK_{0}}\right]

where

a=(22θ+1+2)Γ(2θ+1)+23θΓ(3θ+1)3c11/θ\displaystyle a=(2^{2\theta+1}+2)\Gamma(2\theta+1)+\frac{2^{3\theta}\Gamma(3\theta+1)}{3c^{1-1/\theta}}
b=2c11/θ.\displaystyle b=2c^{1-1/\theta}.

Proof  For convenience, define L0=cK0L_{0}=cK_{0}. Let λ[0,1bK0]\lambda\in\left[0,\frac{1}{bK_{0}}\right]. That is, λ0\lambda\geq 0 and λL011/θK01/θ12\lambda L_{0}^{1-1/\theta}K_{0}^{1/\theta}\leq\frac{1}{2}. Since λ0\lambda\geq 0, we have, by Lemma 4 of Bakhshizadeh et al. (2023),

𝔼[exp(λX~)𝒢]exp(λ22(𝔼[X2𝕀{X<0}𝒢]+𝔼[X2exp(λX)𝕀{0XL0}𝒢])).\displaystyle\mathbb{E}\left[\exp(\lambda\widetilde{X})\mid\mathcal{G}\right]\leq\exp\bigg{(}\frac{\lambda^{2}}{2}\bigg{(}\mathbb{E}\left[X^{2}\mathbb{I}_{\{X<0\}}\mid\mathcal{G}\right]+\mathbb{E}\left[X^{2}\exp(\lambda X)\mathbb{I}_{\{0\leq X\leq L_{0}\}}\mid\mathcal{G}\right]\bigg{)}\bigg{)}.

Observe

𝔼[X2𝕀{X<0}𝒢]\displaystyle\mathbb{E}\left[X^{2}\mathbb{I}_{\{X<0\}}\mid\mathcal{G}\right] =0P(X2𝕀{X<0}>x𝒢)𝑑x\displaystyle=\int_{0}^{\infty}P\left(X^{2}\mathbb{I}_{\{X<0\}}>x\mid\mathcal{G}\right)dx
=0P(X2>t2,X<0𝒢)2t𝑑t\displaystyle=\int_{0}^{\infty}P\left(X^{2}>t^{2},X<0\mid\mathcal{G}\right)2t\,dt
0P(|X|>t𝒢)2t𝑑t\displaystyle\leq\int_{0}^{\infty}P\left(|X|>t\mid\mathcal{G}\right)2t\,dt
20exp((t/K0)1/θ)2t𝑑t\displaystyle\overset{*}{\leq}2\int_{0}^{\infty}\exp\left(-(t/K_{0})^{1/\theta}\right)2t\,dt
=2Γ(2θ+1)K02\displaystyle=2\Gamma(2\theta+1)K_{0}^{2}

and

𝔼[X2exp(λX)𝕀{0XL0}𝒢]\displaystyle\mathbb{E}\left[X^{2}\exp(\lambda X)\mathbb{I}_{\{0\leq X\leq L_{0}\}}\mid\mathcal{G}\right]
=0P(X2exp(λX)𝕀{0XL0}>x𝒢)𝑑x\displaystyle=\int_{0}^{\infty}P\left(X^{2}\exp(\lambda X)\mathbb{I}_{\{0\leq X\leq L_{0}\}}>x\mid\mathcal{G}\right)dx
=0P(X2exp(λX)>t2exp(λt),0XL0𝒢)(2t+λt2)exp(λt)𝑑t\displaystyle=\int_{0}^{\infty}P\left(X^{2}\exp(\lambda X)>t^{2}\exp(\lambda t),0\leq X\leq L_{0}\mid\mathcal{G}\right)(2t+\lambda t^{2})\exp(\lambda t)dt
=0P(|X|>t,0XL0𝒢)(2t+λt2)exp(λt)𝑑t\displaystyle=\int_{0}^{\infty}P\left(|X|>t,0\leq X\leq L_{0}\mid\mathcal{G}\right)(2t+\lambda t^{2})\exp(\lambda t)dt
0L0P(|X|>t𝒢)(2t+λt2)exp(λt)𝑑t\displaystyle\leq\int_{0}^{L_{0}}P\left(|X|>t\mid\mathcal{G}\right)(2t+\lambda t^{2})\exp(\lambda t)dt
20L0exp((1λt11/θK01/θ)(t/K0)1/θ)(2t+λt2)𝑑t\displaystyle\overset{*}{\leq}2\int_{0}^{L_{0}}\exp\left(-\left(1-\lambda t^{1-1/\theta}K_{0}^{1/\theta}\right)(t/K_{0})^{1/\theta}\right)(2t+\lambda t^{2})dt
20L0exp((1λL011/θK01/θ)(t/K0)1/θ)(2t+λt2)𝑑t\displaystyle\leq 2\int_{0}^{L_{0}}\exp\left(-\left(1-\lambda L_{0}^{1-1/\theta}K_{0}^{1/\theta}\right)(t/K_{0})^{1/\theta}\right)(2t+\lambda t^{2})dt
=20L0exp(u)(2K02θu2θ1(1λL011/θK01/θ)2θ+λK03θu3θ1(1λL011/θK01/θ)3θ)𝑑u\displaystyle=2\int_{0}^{L_{0}}\exp(-u)\left(\frac{2K_{0}^{2}\theta u^{2\theta-1}}{\left(1-\lambda L_{0}^{1-1/\theta}K_{0}^{1/\theta}\right)^{2\theta}}+\frac{\lambda K_{0}^{3}\theta u^{3\theta-1}}{\left(1-\lambda L_{0}^{1-1/\theta}K_{0}^{1/\theta}\right)^{3\theta}}\right)du
=2K02Γ(2θ+1)(1λL011/θK01/θ)2θ+2λK03Γ(3θ+1)3(1λL011/θK01/θ)3θ\displaystyle=\frac{2K_{0}^{2}\Gamma(2\theta+1)}{\left(1-\lambda L_{0}^{1-1/\theta}K_{0}^{1/\theta}\right)^{2\theta}}+\frac{2\lambda K_{0}^{3}\Gamma(3\theta+1)}{3\left(1-\lambda L_{0}^{1-1/\theta}K_{0}^{1/\theta}\right)^{3\theta}}
(22θ+1Γ(2θ+1)+23θK0Γ(3θ+1)3L011/θK01/θ)K02\displaystyle\leq\left(2^{2\theta+1}\Gamma(2\theta+1)+\frac{2^{3\theta}K_{0}\Gamma(3\theta+1)}{3L_{0}^{1-1/\theta}K_{0}^{1/\theta}}\right)K_{0}^{2}
=(22θ+1Γ(2θ+1)+23θΓ(3θ+1)3log(n/δ)θ1)K02.\displaystyle=\left(2^{2\theta+1}\Gamma(2\theta+1)+\frac{2^{3\theta}\Gamma(3\theta+1)}{3\log(n/\delta)^{\theta-1}}\right)K_{0}^{2}.

 

Applying this to our setting, we get the MGF bound

𝔼[exp(λξ~i)t1]exp(λ22aKi12)λ[0,1bKi1]\displaystyle\mathbb{E}\left[\exp(\lambda\widetilde{\xi}_{i})\mid\mathcal{F}_{t-1}\right]\leq\exp\left(\frac{\lambda^{2}}{2}aK_{i-1}^{2}\right)~{}\forall\lambda\in\left[0,\frac{1}{bK_{i-1}}\right] (10)

where

a=(22θ+1+2)Γ(2θ+1)+23θΓ(3θ+1)3log(n/δ)θ1\displaystyle a=(2^{2\theta+1}+2)\Gamma(2\theta+1)+\frac{2^{3\theta}\Gamma(3\theta+1)}{3\log(n/\delta)^{\theta-1}}
b=2log(n/δ)θ1.\displaystyle b=2\log(n/\delta)^{\theta-1}.

Concentration inequality for MGF bound.

Now we just need a self-normalized concentration inequality in the following setting:

Let (Ω,,(i),P)(\Omega,\mathcal{F},(\mathcal{F}_{i}),P) be a filtered probability space. Let (ξi)(\xi_{i}) and (Ki)(K_{i}) be adapted to (Fi)(F_{i}). Let nn\in\mathbb{N}. For all i[n]i\in[n], assume 0Ki1mi0\leq K_{i-1}\leq m_{i} almost surely, 𝔼[ξi1]=0\mathbb{E}\left[\xi\mid\mathcal{F}_{i-1}\right]=0, and

𝔼[exp(λξi)i1]exp(λ22aKi12)λ[0,1bKi1]\displaystyle\mathbb{E}\left[\exp(\lambda\xi_{i})\mid\mathcal{F}_{i-1}\right]\leq\exp\left(\frac{\lambda^{2}}{2}aK_{i-1}^{2}\right)~{}\forall\lambda\in\left[0,\frac{1}{bK_{i-1}}\right]

for constants a,b>0a,b>0.

We can recognize this as a centered sub-exponential type MGF bound from Proposition 2.7.1 of Vershynin (2018). Theorem 2.6 of Fan et al. (2015) proves a concentration inequality in this setting, but not a self-normalized one. On the other hand, Theorem 3.3 of Harvey et al. (2019) proves a self-normalized concentration inequality in the centered sub-Gaussian setting, thus extending the original result of Freedman (1975) to a self-normalized result. We use the same trick of Harvey et al. (2019) to extend Theorem 2.6 of Fan et al. (2015) to a self-normalized result.

We distill the proof technique of Fan et al. (2015) into the following lemma so that we can apply it in our setting.

Lemma 13

(Fan et al., 2015, Pf. of Thm. 2.1) Let (Ω,,(i),P)(\Omega,\mathcal{F},(\mathcal{F}_{i}),P) be a filtered probability space. If (ψi)(\psi_{i}) is adapted to (i)(\mathcal{F}_{i}) and (Ak)(A_{k}) is a sequence of events; then

P(k[n]Ak)supk[n]𝕀Aki=1k𝔼[ψii1]ψi.\displaystyle P\left(\bigcup_{k\in[n]}A_{k}\right)\leq\sup_{k\in[n]}\mathbbm{I}_{A_{k}}\prod_{i=1}^{k}\frac{\mathbb{E}\left[\psi_{i}\mid\mathcal{F}_{i-1}\right]}{\psi_{i}}.

Proof  Defining Zk=i=1kψi𝔼[ψii1]Z_{k}=\prod_{i=1}^{k}\frac{\psi_{i}}{\mathbb{E}\left[\psi_{i}\mid\mathcal{F}_{i-1}\right]}, then (Zk)(Z_{k}) is a martingale. Let TT be a stopping time. Then the stopped process (ZkT)(Z_{k\land T}) is a martingale (where aba\land b denotes min{a,b}\min\{a,b\}). Moreover, ZkTZ_{k\land T} is a probability density so define the conjugate probability measure dP=ZnTdPdP^{\prime}=Z_{n\land T}dP.

Define the stopping time T(ω)=min{k[n]ωAk}T(\omega)=\min\{k\in[n]\mid\omega\in A_{k}\}. Then 𝕀k[n]Ak=i=1n𝕀{T=k}\mathbbm{I}_{\cup_{k\in[n]}A_{k}}=\sum_{i=1}^{n}\mathbbm{I}_{\{T=k\}}.

Observe,

P(k[n]Ak)\displaystyle P\left(\bigcup_{k\in[n]}A_{k}\right) =𝔼[ZnT1k=1n𝕀{T=k}]\displaystyle=\mathbb{E}^{\prime}\left[Z_{n\land T}^{-1}\sum_{k=1}^{n}\mathbbm{I}_{\{T=k\}}\right]
=k=1n𝔼[i=1k𝔼[ψii1]ψi𝕀{T=k}]\displaystyle=\sum_{k=1}^{n}\mathbb{E}^{\prime}\left[\prod_{i=1}^{k}\frac{\mathbb{E}\left[\psi_{i}\mid\mathcal{F}_{i-1}\right]}{\psi_{i}}\mathbbm{I}_{\{T=k\}}\right]
(supk[n]𝕀Aki=1k𝔼[ψii1]ψi)k=1n𝔼[𝕀{T=k}]\displaystyle\leq\left(\sup_{k\in[n]}\mathbbm{I}_{A_{k}}\prod_{i=1}^{k}\frac{\mathbb{E}\left[\psi_{i}\mid\mathcal{F}_{i-1}\right]}{\psi_{i}}\right)\sum_{k=1}^{n}\mathbb{E}^{\prime}\left[\mathbbm{I}_{\{T=k\}}\right]
=supk[n]𝕀Aki=1k𝔼[ψii1]ψi.\displaystyle=\sup_{k\in[n]}\mathbbm{I}_{A_{k}}\prod_{i=1}^{k}\frac{\mathbb{E}\left[\psi_{i}\mid\mathcal{F}_{i-1}\right]}{\psi_{i}}.

 

We apply Lemma 13 to our setting to get the following self-normalized concentration inequality.

Lemma 14

Let (Ω,,(i),P)(\Omega,\mathcal{F},(\mathcal{F}_{i}),P) be a filtered probability space. Let (ξi)(\xi_{i}) and (Ki)(K_{i}) be adapted to (Fi)(F_{i}). Let nn\in\mathbb{N}. For all i[n]i\in[n], assume 0Ki1mi0\leq K_{i-1}\leq m_{i} almost surely, 𝔼[ξi1]=0\mathbb{E}\left[\xi\mid\mathcal{F}_{i-1}\right]=0, and

𝔼[exp(λξi)i1]exp(λ22aKi12)λ[0,1bKi1]\displaystyle\mathbb{E}\left[\exp(\lambda\xi_{i})\mid\mathcal{F}_{i-1}\right]\leq\exp\left(\frac{\lambda^{2}}{2}aK_{i-1}^{2}\right)~{}\forall\lambda\in\left[0,\frac{1}{bK_{i-1}}\right]

for constants a,b>0a,b>0. Then, for all x,β0x,\beta\geq 0, and αbmaxi[n]mi\alpha\geq b\max_{i\in[n]}m_{i}, and λ[0,12α]\lambda\in\left[0,\frac{1}{2\alpha}\right],

P(k[n]{i=1kξix and i=1kaKi12αi=1kξi+β})\displaystyle P\left(\bigcup_{k\in[n]}\bigg{\{}\sum_{i=1}^{k}\xi_{i}\geq x\text{ and }\sum_{i=1}^{k}aK_{i-1}^{2}\leq\alpha\sum_{i=1}^{k}\xi_{i}+\beta\bigg{\}}\right) exp(λx+2λ2β).\displaystyle\leq\exp(-\lambda x+2\lambda^{2}\beta).

Proof  By Claim C.2 of Harvey et al. (2019), if 0λ12α0\leq\lambda\leq\frac{1}{2\alpha}, then c[0,2]\exists~{}c\in[0,2] such that 12(λ+αcλ2)2=cλ2\frac{1}{2}(\lambda+\alpha c\lambda^{2})^{2}=c\lambda^{2}. Define

ψi=exp((λ+αcλ2)ξi).\displaystyle\psi_{i}=\exp\left((\lambda+\alpha c\lambda^{2})\xi_{i}\right).

With 0λ12α0\leq\lambda\leq\frac{1}{2\alpha}, we want λ+αcλ21bKi1\lambda+\alpha c\lambda^{2}\leq\frac{1}{bK_{i-1}}. This is ensured by αbmaxi[n]mi\alpha\geq b\max_{i\in[n]}m_{i}.

Define

Ak={i=1kξix and i=1kaKi12αi=1kξi+β}.\displaystyle A_{k}=\bigg{\{}\sum_{i=1}^{k}\xi_{i}\geq x\text{ and }\sum_{i=1}^{k}aK_{i-1}^{2}\leq\alpha\sum_{i=1}^{k}\xi_{i}+\beta\bigg{\}}.

Then ωAk\omega\in A_{k} implies

i=1k𝔼[ψii1]ψi\displaystyle\prod_{i=1}^{k}\frac{\mathbb{E}\left[\psi_{i}\mid\mathcal{F}_{i-1}\right]}{\psi_{i}} exp((λ+αcλ2)i=1kξi+(λ+αcλ2)22i=1kaKi12)\displaystyle\leq\exp\left(-(\lambda+\alpha c\lambda^{2})\sum_{i=1}^{k}\xi_{i}+\frac{(\lambda+\alpha c\lambda^{2})^{2}}{2}\sum_{i=1}^{k}aK_{i-1}^{2}\right)
exp(λx+cλ2β)\displaystyle\leq\exp(-\lambda x+c\lambda^{2}\beta)
exp(λx+2λ2β).\displaystyle\leq\exp(-\lambda x+2\lambda^{2}\beta).

 

Final step.

Putting everything together proves the θ>1\theta>1 and α>0\alpha>0 case. The rest of the work for the other cases is included in Appendix B.  

5 Non-convex Convergence

In this section, we prove the following convergence bound.

Theorem 15

Assume ff is LL-smooth and that, conditioned on the previous iterates, ete_{t} is centered and et\|e_{t}\| is K-sub-Weibull(θ)K\text{-sub-Weibull}(\theta) with θ1/2\theta\geq 1/2. If θ>1/2\theta>1/2, assume ff is ρ\rho-Lipschitz. Let δ1,δ2,δ3(0,1)\delta_{1},\delta_{2},\delta_{3}\in(0,1) and define δ=max{δ1,δ2,δ3}\delta=\max\{\delta_{1},\delta_{2},\delta_{3}\}. Then, for TT iterations of SGD with ηt=c/t+1\eta_{t}=c/\sqrt{t+1} where c1/Lc\leq 1/L, w.p. 1δ1δ2δ3\geq 1-\delta_{1}-\delta_{2}-\delta_{3},

1Tt=0T11t+1f(xt)2\displaystyle\frac{1}{\sqrt{T}}\sum_{t=0}^{T-1}\frac{1}{\sqrt{t+1}}\|\nabla f(x_{t})\|^{2}
4(f(x0)f)cT+γ(θ)log(1/δ3)T+4LK2c(4eθlog(2/δ1))2θlog(T+1)T\displaystyle\hskip 28.45274pt\leq\frac{4(f(x_{0})-f^{\star})}{c\sqrt{T}}+\frac{\gamma(\theta)\log(1/\delta_{3})}{\sqrt{T}}+\frac{4LK^{2}c(4e\theta\log(2/\delta_{1}))^{2\theta}\log(T+1)}{\sqrt{T}}
=O(log(T)log(1/δ)2θ+γ(θ)log(1/δ)T)\displaystyle\hskip 28.45274pt=O\left(\frac{\log(T)\log(1/\delta)^{2\theta}+\gamma(\theta)\log(1/\delta)}{\sqrt{T}}\right)

where

γ(θ)={64K2θ=1/28max{(4θ)θeKρ, 4(4θ)2θe2K2}θ(1/2,1]8max{2log(2T/δ2)θ1Kρ,4[(22θ+1+2)Γ(2θ+1)+23θΓ(3θ+1)3log(2T/δ2)θ1]K2}θ>1\gamma(\theta)=\begin{cases}64K^{2}&\theta=1/2\\ 8\max\left\{(4\theta)^{\theta}eK\rho,\;4(4\theta)^{2\theta}e^{2}K^{2}\right\}&\theta\in(1/2,1]\\ \begin{split}8\max\bigg{\{}&2\log(2T/\delta_{2})^{\theta-1}K\rho,\\ &4\left[(2^{2\theta+1}+2)\Gamma(2\theta+1)+\frac{2^{3\theta}\Gamma(3\theta+1)}{3\log(2T/\delta_{2})^{\theta-1}}\right]K^{2}\bigg{\}}\end{split}&\theta>1\end{cases}

and observe γ(θ)=O(log(T/δ)min{0,θ1})\gamma(\theta)=O\left(\log(T/\delta)^{\min\{0,\theta-1\}}\right) for any θ12\theta\geq\frac{1}{2}

Proof  As with the PŁ analysis we start the non-convex analysis with Eq. (3). From this, we get a master bound on a weighted sum of the f(xt)2\|\nabla f(x_{t})\|^{2}. Our goal is a convergence rate for a weighted average of the f(xt)2\|\nabla f(x_{t})\|^{2} since this would imply convergence to a first-order stationary point, which is the best one can hope for without further assumptions. The master bound is in terms of two sums, an inner product sum and a norm sum. We bound the norm sum using an established sub-Weibull concentration inequality. The inner product sum, on the other hand, is the part that requires the MDS concentration inequality of Theorem 11.

Master bound.

As in Section 3, again assume ff is LL-smooth, hence

f(y)\displaystyle f(y) f(x)+f(x),yx+L2xy2\displaystyle\leq f(x)+\langle\nabla f(x),y-x\rangle+\frac{L}{2}\|x-y\|^{2}

for all x,ydx,y\in\mathbb{R}^{d}. Set y=xt+1=xtηtgt=xtηt(f(xt)et)y=x_{t+1}=x_{t}-\eta_{t}g_{t}=x_{t}-\eta_{t}(\nabla f(x_{t})-e_{t}) and x=xtx=x_{t}. Then we get

f(xt+1)\displaystyle f(x_{t+1}) f(xt)ηt(1Lηt2)f(xt)2+ηt(1Lηt)f(xt),et+Lηt22et2.\displaystyle\leq f(x_{t})-\eta_{t}\left(1-\frac{L\eta_{t}}{2}\right)\|\nabla f(x_{t})\|^{2}+\eta_{t}(1-L\eta_{t})\langle\nabla f(x_{t}),e_{t}\rangle+\frac{L\eta_{t}^{2}}{2}\|e_{t}\|^{2}.

Summing this and using f(xT)ff(x_{T})\geq f^{\star}, we get

t=0T1ηt(1Lηt2)f(xt)2\displaystyle\sum_{t=0}^{T-1}\eta_{t}\left(1-\frac{L\eta_{t}}{2}\right)\|\nabla f(x_{t})\|^{2} f(x0)f+t=0T1ηt(1Lηt)f(xt),etinner product sum+L2t=0T1ηt2et2norm sum.\displaystyle\leq f(x_{0})-f^{\star}+\underset{\text{inner product sum}}{\underbrace{\sum_{t=0}^{T-1}\eta_{t}(1-L\eta_{t})\langle\nabla f(x_{t}),e_{t}\rangle}}+\underset{\text{norm sum}}{\underbrace{\frac{L}{2}\sum_{t=0}^{T-1}\eta_{t}^{2}\|e_{t}\|^{2}}}. (11)

We would like to bound

t=0T1ηt(1Lηt)f(xt),et\displaystyle\sum_{t=0}^{T-1}\eta_{t}(1-L\eta_{t})\langle\nabla f(x_{t}),e_{t}\rangle O(t=0T1ηt2f(xt)2)\displaystyle\leq O\left(\sum_{t=0}^{T-1}\eta_{t}^{2}\|\nabla f(x_{t})\|^{2}\right)
and L2t=0T1ηt2et2\displaystyle\text{and }\frac{L}{2}\sum_{t=0}^{T-1}\eta_{t}^{2}\|e_{t}\|^{2} O(t=0T1ηt2)\displaystyle\leq O\left(\sum_{t=0}^{T-1}\eta_{t}^{2}\right)

with high probability so that if ηt=Θ(1/t+1)\eta_{t}=\Theta(1/\sqrt{t+1}), we get

min0tT1f(xt)21Tt=0T11t+1f(xt)2O(log(T+1)T)\displaystyle\min_{0\leq t\leq T-1}\|\nabla f(x_{t})\|^{2}\leq\frac{1}{\sqrt{T}}\sum_{t=0}^{T-1}\frac{1}{\sqrt{t+1}}\|\nabla f(x_{t})\|^{2}\leq O\left(\frac{\log(T+1)}{\sqrt{T}}\right)

with high probability. While these bounds are in big-OO notation, the bounds we prove will be precise.

Norm sum bound.

Assume that, conditioned on the previous iterates, ete_{t} is centered and et\|e_{t}\| is KK-sub-Weibull(θ\theta) with θ1/2\theta\geq 1/2. Set ηt=c/t+1\eta_{t}=c/\sqrt{t+1} with c1/Lc\leq 1/L. Let δ1(0,1)\delta_{1}\in(0,1). Using the law of total expectation,

𝔼[exp((ηt2et2ηt2K2)1/2θ)]2.\displaystyle\mathbb{E}\left[\exp\left(\left(\frac{\eta_{t}^{2}\|e_{t}\|^{2}}{\eta_{t}^{2}K^{2}}\right)^{1/2\theta}\right)\right]\leq 2.

Thus, ηt2\eta_{t}^{2} is ηt2K\eta_{t}^{2}K-sub-Weibull(2θ2\theta) so we can apply the following sub-Weibull concentration inequality.

Lemma 16

(Vladimirova et al., 2020, Thm. 1) (Wong et al., 2020, Lma. 5) Suppose X1,,XnX_{1},\ldots,X_{n} are sub-Weibull(θ)(\theta) with respective parameters K1,,KnK_{1},\ldots,K_{n}. Then, for all γ0\gamma\geq 0,

P(|i=1nXi|γ)2exp((γv(θ)i=1nKi)1/θ),\displaystyle P\left(\bigg{|}\sum_{i=1}^{n}X_{i}\bigg{|}\geq\gamma\right)\leq 2\exp\left(-\left(\frac{\gamma}{v(\theta)\sum_{i=1}^{n}K_{i}}\right)^{1/\theta}\right),

where v(θ)=(4e)θv(\theta)=(4e)^{\theta} for θ1\theta\leq 1 and v(θ)=2(2eθ)θv(\theta)=2(2e\theta)^{\theta} for θ1\theta\geq 1.

Applying Lemma 16, we get, w.p. 1δ1\geq 1-\delta_{1},

L2t=0T1ηt2et2\displaystyle\frac{L}{2}\sum_{t=0}^{T-1}\eta_{t}^{2}\|e_{t}\|^{2} LK2(4eθlog(2/δ1))2θt=0T1ηt2\displaystyle\leq LK^{2}(4e\theta\log(2/\delta_{1}))^{2\theta}\sum_{t=0}^{T-1}\eta_{t}^{2}
LK2c2(4eθlog(2/δ1))2θlog(T+1).\displaystyle\leq LK^{2}c^{2}(4e\theta\log(2/\delta_{1}))^{2\theta}\log(T+1).

Inner product sum bound.

Assume that θ>1\theta>1. We will prove the easier cases of θ=1/2\theta=1/2 and θ(1/2,1]\theta\in(1/2,1] in the appendix. Assume ff is ρ\rho-Lipschitz continuous. Define

ξt\displaystyle\xi_{t} =ηt(1Lηt)f(xt),et\displaystyle=\eta_{t}(1-L\eta_{t})\langle\nabla f(x_{t}),e_{t}\rangle
Kt1\displaystyle K_{t-1} =ηt(1Lηt)Kf(xt)\displaystyle=\eta_{t}(1-L\eta_{t})K\|\nabla f(x_{t})\|
and mt\displaystyle\text{and }m_{t} =ηt(1Lηt)Kρ.\displaystyle=\eta_{t}(1-L\eta_{t})K\rho.

Recall that we defined t\mathcal{F}_{t} as the sigma algebra generated by e0,,ete_{0},\ldots,e_{t} for all t0t\geq 0 and 1={,Ω}\mathcal{F}_{-1}=\{\emptyset,\Omega\}. So, ξt\xi_{t} is t\mathcal{F}_{t}-measurable and Kt1K_{t-1} is t1\mathcal{F}_{t-1}-measurable; hence (ξt)(\xi_{t}) and (Kt)(K_{t}) are adapted to (t)({\color[rgb]{0,0,0}\mathcal{F}}_{t}). We also have, for all t0t\geq 0, 0Kt1mt0\leq K_{t-1}\leq m_{t} almost surely, 𝔼[ξtt1]=0\mathbb{E}\left[\xi_{t}\mid\mathcal{F}_{t-1}\right]=0, and

𝔼[exp((|ξt|/Kt1)1/θ)t1]2.\displaystyle\mathbb{E}\left[\exp\left((|\xi_{t}|/K_{t-1})^{1/\theta}\right)\mid\mathcal{F}_{t-1}\right]\leq 2.

In other words, (ξt)(\xi_{t}) is a sub-Weibull MDS and (Kt)(K_{t}) captures the scale parameters. Let δ2,δ3(0,1)\delta_{2},\delta_{3}\in(0,1) and define

δ\displaystyle\delta =δ2\displaystyle=\delta_{2}
β\displaystyle\beta =0\displaystyle=0
λ\displaystyle\lambda =12α\displaystyle=\frac{1}{2\alpha}
and x\displaystyle\text{and }x =2αlog(1/δ3).\displaystyle=2\alpha\log(1/\delta_{3}).

Applying Theorem 11, we get, for all αbKρc\alpha\geq bK\rho c, w.p. 12δ2δ3\geq 1-2\delta_{2}-\delta_{3},

t=0T1ηt(1Lηt)f(xt),et\displaystyle\sum_{t=0}^{T-1}\eta_{t}(1-L\eta_{t})\langle\nabla f(x_{t}),e_{t}\rangle 2αlog(1/δ3)+aK2αt=0T1ηt2(1Lηt)2f(xt)2.\displaystyle\leq 2\alpha\log(1/\delta_{3})+\frac{aK^{2}}{\alpha}\sum_{t=0}^{T-1}\eta_{t}^{2}(1-L\eta_{t})^{2}\|\nabla f(x_{t})\|^{2}.

Combining this with the norm sum bound and master bound, we get, for all αbKρc\alpha\geq bK\rho c, w.p. 1δ12δ2δ3\geq 1-\delta_{1}-2\delta_{2}-\delta_{3},

t=0T1ηtνtf(xt)2f(x0)f+2αlog(1/δ3)+LK2c2(4eθlog(2/δ1))2θlog(T+1)\displaystyle\sum_{t=0}^{T-1}\eta_{t}\nu_{t}\|\nabla f(x_{t})\|^{2}\leq f(x_{0})-f^{\star}+2\alpha\log(1/\delta_{3})+LK^{2}c^{2}(4e\theta\log(2/\delta_{1}))^{2\theta}\log(T+1)

where

νt=1Lηt2aK2αηt(1Lηt)2.\displaystyle\nu_{t}=1-\frac{L\eta_{t}}{2}-\frac{aK^{2}}{\alpha}\eta_{t}(1-L\eta_{t})^{2}.

We want to bound νt\nu_{t} away from zero. To do so, assume c1Lc\leq\frac{1}{L} and α4aK2c\alpha\geq 4aK^{2}c. Then νt14\nu_{t}\geq\frac{1}{4}. Setting

α=max{bKρ,4aK2}c\displaystyle\alpha=\max\{bK\rho,4aK^{2}\}c

and plugging in aa and bb completes the proof.  

Remark 17

Note that Lipschitz continuity follows immediately if the iterates are bounded. This might lead one to consider using projected SGD, but there are certain issues preventing us from analyzing it, which we discuss in Appendices C and D. But, is Lipschitz continuity even a necessary assumption? Li and Liu (2022), building off of our pre-print, were actually able to relax the Lipschitz continuity assumption to the assumption that 1t+1f(xt)ρt0\frac{1}{\sqrt{t+1}}\|\nabla f(x_{t})\|\leq\rho~{}\forall t\geq 0. To see that this works, note that we can change the definition of mtm_{t} to c(1Lηt)Kρc(1-L\eta_{t})K\rho and the rest of the analysis, including the result, still holds.

Remark 18

Can we extend the analysis beyond sub-Weibull? Yes, but then we would not get logarithmic dependence on 1/δ1/\delta. For example, if we assume 𝔼[etpt1]\mathbb{E}\left[\|e_{t}\|^{p}\mid\mathcal{F}_{t-1}\right] for some p>2p>2, then we could use Corollary 3 instead of Corollary 2 of Bakhshizadeh et al. (2023). This would give us a O(log(1/δ)/T+1/(δaTb))O(\log(1/\delta)/\sqrt{T}+1/(\delta^{a}T^{b})) convergence rate for some b>1/2b>1/2. Thus, we would not have a logarithmic dependence on 1/δ1/\delta in general, but would approach such a dependence as the number of iterations increases.

6 Post-processing

Note that the results of Theorem 15 are in terms of 1Tt=0T11t+1f(xt)2\frac{1}{\sqrt{T}}\sum_{t=0}^{T-1}\frac{1}{\sqrt{t+1}}\|\nabla f(x_{t})\|^{2} which is not a particularly useful quantity by itself. To get a bound in terms of a single iterate, we prove the following probability result, introduce a novel post-processing strategy which outputs a single iterate xx, and apply the probability result to bound f(x)2\|\nabla f(x)\|^{2} with high probability.

Theorem 19

Let TT\in\mathbb{N}. For all t[T]t\in[T], let Z=tZ=t with probability ptp_{t}, where t=1Tpt=1\sum_{t=1}^{T}p_{t}=1. Let Z1,,ZnZ_{1},\ldots,Z_{n} be independent copies of ZZ. Let Y={Z1,,Zn}Y=\{Z_{1},\ldots,Z_{n}\}. Let XX be an +T\mathbb{R}_{+}^{T}-valued random variable independent of ZZ. Then

P(mintYXt>eγ)exp(n)+P(t=1TptXt>γ)γ>0.\displaystyle P\left(\min_{t\in Y}X_{t}>e\gamma\right)\leq\exp(-n)+P\left(\sum_{t=1}^{T}p_{t}X_{t}>\gamma\right)~{}\forall\gamma>0.

Proof  First, letting γ>0\gamma>0 and x+Tx\in\mathbb{R}_{+}^{T},

P(mintYxt>γ)\displaystyle P\left(\min_{t\in Y}x_{t}>\gamma\right) =P(i=1n{xZi>γ})\displaystyle=P\left(\bigcap_{i=1}^{n}\{x_{Z_{i}}>\gamma\}\right)
=(i)i=1nP(xZi>γ)\displaystyle\overset{(i)}{=}\prod_{i=1}^{n}P\left(x_{Z_{i}}>\gamma\right)
=P(xZ>γ)n\displaystyle=P\left(x_{Z}>\gamma\right)^{n}
(ii)(1γ𝔼[xZ])n\displaystyle\overset{(ii)}{\leq}\left(\frac{1}{\gamma}\mathbb{E}\left[x_{Z}\right]\right)^{n}
=(1γt=1Tptxt)n,\displaystyle=\left(\frac{1}{\gamma}\sum_{t=1}^{T}p_{t}x_{t}\right)^{n},

where (i)(i) follows by the independence of the ZiZ_{i} and (ii)(ii) follows by Markov’s inequality since xZx_{Z} is non-negative almost surely. Next, define

A={x+T|t=1Tptxtγ}\displaystyle A=\bigg{\{}x\in\mathbb{R}_{+}^{T}~{}\bigg{|}~{}\sum_{t=1}^{T}p_{t}x_{t}\leq\gamma\bigg{\}}
B={(x,y)+T×[T]n|xyi>eγi[n]}.\displaystyle B=\bigg{\{}(x,y)\in\mathbb{R}_{+}^{T}\times[T]^{n}~{}\bigg{|}~{}x_{y_{i}}>e\gamma~{}\forall i\in[n]\bigg{\}}.

Observe,

P((X,Y)B)\displaystyle P\left((X,Y)\in B\right) =(i)P(XA,(X,Y)B)+P(XAc,(X,Y)B)\displaystyle\overset{(i)}{=}P\left(X\in A,(X,Y)\in B\right)+P\left(X\in A^{c},(X,Y)\in B\right)
=(ii)AP((x,Y)B)μ(dx)+AcP((x,Y)B)μ(dx)\displaystyle\overset{(ii)}{=}\int_{A}P\left((x,Y)\in B\right)\mu(dx)+\int_{A^{c}}P\left((x,Y)\in B\right)\mu(dx)
A(1eγt=1Tptxt)nμ(dx)+Acμ(dx)\displaystyle\leq\int_{A}\left(\frac{1}{e\gamma}\sum_{t=1}^{T}p_{t}x_{t}\right)^{n}\mu(dx)+\int_{A^{c}}\mu(dx)
exp(n)Aμ(dx)+P(XAc)\displaystyle\leq\exp(-n)\int_{A}\mu(dx)+P\left(X\in A^{c}\right)
exp(n)+P(t=1TptXt>γ)\displaystyle\leq\exp(-n)+P\left(\sum_{t=1}^{T}p_{t}X_{t}>\gamma\right)

where (i) follows from the law of total probability and (ii) follows from Theorem 20.3 of Billingsley (1995) since XX and YY are independent.  

Corollary 20

Assume ff is differentiable. Let δiter(0,1)\delta_{\text{iter}}\in(0,1) and TT\in\mathbb{N}. Set niter=log(1/δiter)n_{\text{iter}}=\lceil\log(1/\delta_{\text{iter}})\rceil. Sample nitern_{\text{iter}} indices with replacement from {0,,T1}\{0,\ldots,T-1\} with probabilities p0,,pT1p_{0},\ldots,p_{T-1} to form the set S={s1,,sniter}S=\{s_{1},\ldots,s_{n_{\text{iter}}}\}. Then, for TT iterations of SGD with any step-size sequence (ηt)(\eta_{t}),

P(mintSf(xt)2>eγ)P(t=0T1ptf(xt)2>γ)+δiterγ>0.\displaystyle P\left(\min_{t\in S}\|\nabla f(x_{t})\|^{2}>e\gamma\right)\leq P\left(\sum_{t=0}^{T-1}p_{t}\|\nabla f(x_{t})\|^{2}>\gamma\right)+{\color[rgb]{0,0,0}\delta_{\text{iter}}}~{}\forall\gamma>0.

To combine the corollary with Theorem 15, we can set

pt=1/t+1t=0T11/t+1 and use that 1t=0T11/t+11T.\displaystyle p_{t}=\frac{1/\sqrt{t+1}}{\sum_{t=0}^{T-1}1/\sqrt{t+1}}\text{ and use that }\frac{1}{\sum_{t=0}^{T-1}1/\sqrt{t+1}}\leq\frac{1}{\sqrt{T}}.

To see the merit of this post-processing strategy, let’s compare it to a naive approach one might take to apply the result of Theorem 15. The standard trick is to observe

min0tT1f(xt)21Tt=0T11t+1f(xt)2.\displaystyle\min_{0\leq t\leq T-1}\|\nabla f(x_{t})\|^{2}\leq\frac{1}{\sqrt{T}}\sum_{t=0}^{T-1}\frac{1}{\sqrt{t+1}}\|\nabla f(x_{t})\|^{2}. (12)

So, we could keep track of f(xt)\|\nabla f(x_{t})\| at every iteration and record the iterate where it is lowest. However, this requires exact gradient information, which may be more costly than the stochastic gradient used in the algorithm. In Ghadimi and Lan (2013), they pick index ss with probability proportional to 1/s+11/\sqrt{s+1} so that 𝔼[f(xs)2]\mathbb{E}\left[\|\nabla f(x_{s})\|^{2}\right] is proportional to the right-hand side of Eq. (12). They do this for Θ(log(1/δ))\Theta(\log(1/\delta)) runs and pick the best of the runs. Corollary 20, on the other hand, allows us to sample a set SS of niter=Θ(log(1/δ))n_{\text{iter}}=\Theta(\log(1/\delta)) indices and pick the best iterate from among these samples. Hence, we call δiter\delta_{\text{iter}} the iterate sampling failure probability.

But, Corollary 20 is not the end of the story since to compute even argmintSf(xt)2\text{argmin}_{t\in S}\|\nabla f(x_{t})\|^{2} still requires full gradient information. In the sample average approximation setting, this can be obtained by running on the full batch of data (rather than a mini-batch). However, if this is computationally infeasible or if we are in the stochastic approximation setting, then we instead have to use empirical gradients over a test or validation set. This is what we do for the full post-processing strategy presented in the following theorem.

Theorem 21

Let (Ω,,P)(\Omega,\mathcal{F},P) be a probability space. Let F:d×ΩF:\mathbb{R}^{d}\times\Omega\to\mathbb{R} and assume F(;ξ)F(\cdot;\xi) is differentiable for all ξΩ\xi\in\Omega. Let f=𝔼[F(;ξ)]f=\mathbb{E}\left[F(\cdot;\xi)\right]. Assume f=𝔼[F(;ξ)]\nabla f=\mathbb{E}\left[\nabla F(\cdot;\xi)\right] and 𝔼[f(x)F(x;ξ)2]σ2xd\mathbb{E}\left[\|\nabla f(x)-\nabla F(x;\xi)\|^{2}\right]\leq\sigma^{2}~{}\forall x\in\mathbb{R}^{d}. Let δiter,δemp(0,1)\delta_{\text{iter}},\delta_{\text{emp}}\in(0,1) and TT\in\mathbb{N}. Set niter=log(1/δiter)n_{\text{iter}}=\lceil\log(1/\delta_{\text{iter}})\rceil and nemp=6(niter+1)σ2/(eγδemp)n_{\text{emp}}=\lceil 6(n_{\text{iter}}+1)\sigma^{2}/(e\gamma\delta_{\text{emp}})\rceil. Apply the following procedure:

  1. 1.

    Sample nitern_{\text{iter}} indices with replacement from {0,,T1}\{0,\ldots,T-1\} with probabilities p0,,pT1p_{0},\ldots,p_{T-1} to form the set S={s1,,sniter}S=\{s_{1},\ldots,s_{n_{\text{iter}}}\}.

  2. 2.

    Sample ξ1,,ξnemp\xi_{1},\ldots,\xi_{n_{\text{emp}}} independently.

  3. 3.

    Run TT iterations of SGD with any step-size sequence (ηt)(\eta_{t}) to form (xt)(x_{t}).

  4. 4.

    Compute

    x=argmintS1nempj=0nempF(xt;ξj)2.\displaystyle x=\underset{t\in S}{\text{argmin}}\bigg{\|}\frac{1}{n_{\text{emp}}}\sum_{j=0}^{n_{\text{emp}}}\nabla F(x_{t};\xi_{j})\bigg{\|}^{2}.

Then,

P(f(x)2>5eγ)P(t=0T1ptf(xt)2>γ)+δiter+δempγ>0.\displaystyle P\left(\|\nabla f(x)\|^{2}>5e\gamma\right)\leq P\left(\sum_{t=0}^{T-1}p_{t}\|\nabla f(x_{t})\|^{2}>\gamma\right)+\delta_{\text{iter}}+\delta_{\text{emp}}~{}\forall\gamma>0.

Proof  Apply Theorem 2.4 of Ghadimi and Lan (2013) to get

P(f(x)2>5eγ)P(mintSf(xt)2>eγ)+6(niter+1)σ2eγnempγ,λ>0.\displaystyle P\left(\|\nabla f(x)\|^{2}>5e\gamma\right)\leq P\left(\min_{t\in S}\|\nabla f(x_{t})\|^{2}>e\gamma\right)+\frac{6(n_{\text{iter}}+1)\sigma^{2}}{e\gamma n_{\text{emp}}}~{}\forall\gamma,\lambda>0.

Using the definitions of nitern_{\text{iter}} and nempn_{\text{emp}}, and applying Corollary 20, proves the result.  

We call δemp\delta_{\text{emp}} the empirical gradient failure probability. Note that if f(x)G(x,ξ)\|\nabla f(x)-G(x,\xi)\| is KK-sub-Weibull(θ\theta), then ete_{t} is centered and et\|e_{t}\| is KK-sub-Weibull(θ\theta), in which case both Theorem 15 and Theorem 21 apply, with σ2=2Γ(2θ+1)K2\sigma^{2}=2\Gamma(2\theta+1)K^{2} (by Lemma 6) for the latter. Also, note that for both Corollary 20 and Theorem 21, while we have to specify TT in advance, we only have to take maxS\max S iterations to apply the bound from Theorem 15 to the post-processing output.

7 Neural Network Example

Consider the two layer neural network model

xϕ(xTW)a\displaystyle x\mapsto\phi(x^{T}W)a

where ϕ\phi is a differentiable activation function applied coordinate-wise, xdx\in\mathbb{R}^{d} is a data point (feature vector), Wd×mW\in\mathbb{R}^{d\times m} is the first-layer weights, and a{±1}ma\in\{\pm 1\}^{m} is the second-layer weights. If we are given a data set Xd×nX\in\mathbb{R}^{d\times n} and labels yny\in\mathbb{R}^{n}, then we can train WW (leaving aa fixed for simplicity) using the squared loss:

f(W)=12ϕ(XTW)ay2.\displaystyle f(W)=\frac{1}{2}\|\phi(X^{T}W)a-y\|^{2}.

In this case,

vec(f(W))=(diag(a)ϕ(WTX)X)(ϕ(XTW)ay)\displaystyle\text{vec}(\nabla f(W))=\left(\text{diag}(a)\phi^{\prime}(W^{T}X)*X\right)\left(\phi(X^{T}W)a-y\right)

where * is the Khatri-Rao product (Oymak and Soltanolkotabi, 2020). For our example, we use the GELU activation, ϕ(x)=x[1+erf(x/2)]/2\phi(x)=x[1+\text{erf}(x/\sqrt{2})]/2 (Hendrycks and Gimpel, 2016), which satisfies |ϕ(x)|1.13|\phi^{\prime}(x)|\leq 1.13 and |ϕ′′(x)|.11|\phi^{\prime\prime}(x)|\leq.11 for all xx\in\mathbb{R}. Thus, we can apply Lemma 2. Since limxϕ′′(x)=0=limxϕ′′(x)\lim_{x\to\infty}\phi^{\prime\prime}(x)=0=\lim_{x\to-\infty}\phi^{\prime\prime}(x), we heuristically set b=0b=0 in the lemma and estimate the strong smoothness of ff as mX22\approx m\|X\|_{2}^{2}, setting our step-size accordingly.

In order to demonstrate the effect of gradient noise on convergence error, we train WW on a fixed synthetic data set while injecting noise into the gradient. The labels come from a neural network model with width, mm^{\prime}, larger than the width of the training model. We also make sure the total number of trainable parameters is less than nn so that f>0f^{\star}>0. The noise we inject has uniformly random direction and Weibull norm with scale parameter K=1K=1 and shape parameter 1/θ1/\theta, with θ{2,3,4}\theta\in\{2,3,4\}. We keep the same initialization for all trials. We run 100100 iterations of SGD and then define the convergence error to be the best gradient norm squared divided by the initial gradient norm squared. We compute the empirical CDF of the convergence error over 10000 trials and then consider δ\delta in the ranges [0.1,0.2][0.1,0.2], [0.01,0.1][0.01,0.1], and [0.001,0.01][0.001,0.01]. We care about the dependence of the convergence error on δ\delta for small δ\delta, but for too small of δ\delta, the empirical CDF is not a good approximation to the true CDF (see Fig. 1) due to the nature of order statistics. Our code can be found at https://github.com/liammadden/sgd.

Refer to caption
Figure 1: Empirical 1δ1-\delta convergence error, averaged over 10000 runs. The dashed lines show the mean ±\pm one standard deviation (computed over 5 blocks of 2000 runs each). The data are less reliable for small δ\delta.

From Theorem 15, for a particular range of δ\delta (and for fixed TT), the upper bound has dependence either log(1/δ)θ\log(1/\delta)^{\theta}, log(1/δ)\log(1/\delta), or log(1/δ)2θ\log(1/\delta)^{2\theta}. For sufficiently small δ\delta, the log(1/δ)2θ\log(1/\delta)^{2\theta} will dominate, but the upper limit of this range of δ\delta may be smaller than the lower limit of the range of δ\delta for which the empirical CDF is a good approximation to the true CDF. In other words, if we showed that the true CDF for this particular example has log(1/δ)2θ\log(1/\delta)^{2\theta} dependence for δ\delta sufficiently small, then we would have shown that the δ\delta-dependence of the upper bound in Theorem 15 is tight. However, with the empirical CDF, we are only able to show that the exponent increases as δ\delta shrinks. Figure 1 shows the dependence for the three different ranges. We assume the convergence error has dependence blog(1/δ)ab\log(1/\delta)^{a} and find the line of best fit as log(b)+aloglog(1/δ)\log(b)+a\log\log(1/\delta). The different values of aa are given in Table 1.

Our experiments suggest that injected Weibull noise results in convergence error with dependence Ω(log(1/δ)c(δ)θ+d(δ))\Omega(\log(1/\delta)^{c(\delta)\theta+d(\delta)}) where c(δ)c(\delta) increases as δ\delta decreases, thus roughly corroborating our upper bound. In particular, by computing a line of best fit for the aa values in each of the three δ\delta ranges, we can estimate that c(δ)c(\delta) increases from 0.43, to 0.77, to 1.34 and d(δ)d(\delta) decreases from -0.28, to -0.7, to -1.42, suggesting that we are in the log(1/δ)θ1\log(1/\delta)^{\theta-1} regime.

aa δ[0.1,0.2]\delta\in[0.1,0.2] δ[0.01,0.1]\delta\in[0.01,0.1] δ[0.001,0.01]\delta\in[0.001,0.01]
θ=2\theta=2 0.64 0.85 1.40
θ=3\theta=3 0.93 1.56 2.33
θ=4\theta=4 1.51 2.38 4.08
Table 1: Empirically estimated exponents of log(1/δ)\log(1/\delta)
Refer to caption
Figure 2: Same data as Fig. 1 but each line series is normalized, and the x-axis is log(1/δ)\log(1/\delta) and plotted on a logarithmic scale, so log(1/δ)a\log(1/\delta)^{a} dependence shows us a straight line with slope aa. The δ\delta range is from 0.20.2 (left side) to 0.010.01 (right side), since any smaller δ\delta has unreliable statistics. Lines of best fit using the exponents from Table 1 are shown (with arbitrary shifts for clarity).

8 Conclusions

This paper analyzed the convergence of SGD for objective functions that satisfy the PŁ condition and for generic non-convex objectives. Under a sub-Gaussian assumption on the gradient error, we showed a high probability convergence rate matching the mean convergence rate for PŁ objectives. Under a sub-Weibull assumption on the noise, we showed a high probability convergence rate matching the mean convergence rate for non-convex objectives. We also provided and analyzed a post-processing method for choosing a single iterate. To prove the convergence rate, we first proved a Freedman-type inequality for martingale difference sequences that extends previous Freedman-type inequalities beyond the sub-exponential threshold to allow for sub-Weibull tail-decay. Finally, we considered a synthetic neural net problem and showed that the heaviness of the tail of the gradient noise has a direct effect on the heaviness of the tail of the convergence error.


Acknowledgments

All three authors gratefully acknowledge support from the National Science Foundation (NSF), Division of Mathematical Sciences, through the award # 1923298.


Appendix A Proof of Lemma 1

Here we prove Lemma 1.

Proof  First,

f(W)=i=1n12[ϕ(xiW)vyi]2\displaystyle f(W)=\sum_{i=1}^{n}\frac{1}{2}\left[\phi(x_{i}^{\top}W)v-y_{i}\right]^{2}

and

ws,t[ϕ(xiW)vyi]\displaystyle\frac{\partial}{\partial w_{s,t}}\left[\phi(x_{i}^{\top}W)v-y_{i}\right] =ws,tj=1mvjϕ(xiwj)=vtϕ(xiTwt)xs,i.\displaystyle=\frac{\partial}{\partial w_{s,t}}\sum_{j=1}^{m}v_{j}\phi(x_{i}^{\top}w_{j})=v_{t}\phi^{\prime}(x_{i}^{T}w_{t})x_{s,i}.

So,

ws,tf(W)=i=1n[ϕ(xiW)vyi]vtϕ(xiTwt)xs,i.\displaystyle\frac{\partial}{\partial w_{s,t}}f(W)=\sum_{i=1}^{n}\left[\phi(x_{i}^{\top}W)v-y_{i}\right]v_{t}\phi^{\prime}(x_{i}^{T}w_{t})x_{s,i}.

Thus,

2w,rws,tf(W)\displaystyle\frac{\partial^{2}}{\partial w_{\ell,r}\partial w_{s,t}}f(W) =i=1nvrϕ(xiwr)x,ivtϕ(xiwt)xs,i\displaystyle=\sum_{i=1}^{n}v_{r}\phi^{\prime}(x_{i}^{\top}w_{r})x_{\ell,i}v_{t}\phi^{\prime}(x_{i}^{\top}w_{t})x_{s,i}
+δr,ti=1n[ϕ(xiW)vyi]vtϕ′′(xiwt)x,ixs,i\displaystyle\quad+\delta_{r,t}\sum_{i=1}^{n}\left[\phi(x_{i}^{\top}W)v-y_{i}\right]v_{t}\phi^{\prime\prime}(x_{i}^{\top}w_{t})x_{\ell,i}x_{s,i}

where δ\delta is the Kronecker delta. Let bb be the largest absolute element of XX. Then, viewing WW as a vector,

f(W)2mdf(W)mdnabv(amv2+y)\displaystyle\|\nabla f(W)\|_{2}\leq\sqrt{md}\|\nabla f(W)\|_{\infty}\leq\sqrt{md}nab\|v\|_{\infty}(a\sqrt{m}\|v\|_{2}+\|y\|_{\infty})

and

2f(W)2md2f(W)maxmdn(a2b2v2+ab2v(amv2+y)),\displaystyle\|\nabla^{2}f(W)\|_{2}\leq md\|\nabla^{2}f(W)\|_{\max}\leq mdn\left(a^{2}b^{2}\|v\|_{\infty}^{2}+ab^{2}\|v\|_{\infty}(a\sqrt{m}\|v\|_{2}+\|y\|_{\infty})\right),

proving the result.  

And here we prove Lemma 2.

Proof  Define F:wϕ(Xvec1(w))vF:w\mapsto\phi(X^{\top}\text{vec}^{-1}(w))v and :wF(w)y22/2\mathcal{L}:w\mapsto\|F(w)-y\|_{2}^{2}/2. Then DF(w)2mavX2\|DF(w)\|_{2}\leq\sqrt{m}a\|v\|_{\infty}\|X\|_{2} and DF(w)DF(u)2bvX2X1,2wu2\|DF(w)-DF(u)\|_{2}\leq b\|v\|_{\infty}\|X\|_{2}\|X\|_{1,2}\|w-u\|_{2} for all w,umdw,u\in\mathbb{R}^{md} by Lemmas 3 and 5 of Oymak and Soltanolkotabi (2020). Let ρ=mavX2\rho=\sqrt{m}a\|v\|_{\infty}\|X\|_{2} and L=bvX2X1,2L=b\|v\|_{\infty}\|X\|_{2}\|X\|_{1,2}. Observe

(w)2\displaystyle\|\nabla\mathcal{L}(w)\|_{2} =DF(w)(F(w)y)2DF(w)2F(w)y2ρ2α\displaystyle=\|DF(w)^{\top}(F(w)-y)\|_{2}\leq\|DF(w)\|_{2}\|F(w)-y\|_{2}\leq\rho\sqrt{2\alpha}

and

(w)(u)\displaystyle\|\nabla\mathcal{L}(w)-\nabla\mathcal{L}(u)\| =DF(w)(F(w)F(v))2+(DF(w)DF(v))(F(v)y)2\displaystyle=\|DF(w)^{\top}(F(w)-F(v))\|_{2}+\|(DF(w)-DF(v))^{\top}(F(v)-y)\|_{2}
DF(w)2F(w)F(v)2+DF(w)DF(v)2F(v)y2\displaystyle\leq\|DF(w)\|_{2}\|F(w)-F(v)\|_{2}+\|DF(w)-DF(v)\|_{2}\|F(v)-y\|_{2}
ρ2wv2+Lwv22α=(ρ2+L2α)wv2,\displaystyle\leq\rho^{2}\|w-v\|_{2}+L\|w-v\|_{2}\sqrt{2\alpha}=(\rho^{2}+L\sqrt{2\alpha})\|w-v\|_{2},

proving the result.  

Appendix B Remaining work for proof of Theorem 11

First we need the following two lemmas. Lemma 23 extends Proposition 2.7.1 of Vershynin (2018) to interpolate between the sub-Gaussian and sub-exponential regimes.

Lemma 22

(Vershynin, 2018, Prop. 2.5.2(e)) If XX is centered and KK-sub-Guassian then 𝔼[exp(λX)]exp(λ2K2)λ\mathbb{E}\left[\exp(\lambda X)\right]\leq\exp(\lambda^{2}K^{2})~{}\forall\lambda\in\mathbb{R}.

Lemma 23

If XX is centered and KK-sub-Weibull(θ\theta) with θ(1/2,1]\theta\in(1/2,1], then
𝔼[exp(λX)]exp(λ22(4θ)2θe2K2)\mathbb{E}\left[\exp(\lambda X)\right]\leq\exp\left(\frac{\lambda^{2}}{2}(4\theta)^{2\theta}e^{2}K^{2}\right) for all λ[0,1(4θ)θeK]\lambda\in\left[0,\frac{1}{(4\theta)^{\theta}eK}\right].

Proof  First, using Lemma 6 and Γ(x+1)xxx1\Gamma(x+1)\leq x^{x}~{}\forall x\geq 1, we can get Xp(2θ)θKpθ\|X\|_{p}\leq(2\theta)^{\theta}Kp^{\theta} for all p1/θp\geq 1/\theta, and so, in particular, for all p2p\geq 2. Thus, if λ[0,1(4θ)θeK]\lambda\in\left[0,\frac{1}{(4\theta)^{\theta}eK}\right], then

𝔼[exp(λX)]\displaystyle\mathbb{E}\left[\exp(\lambda X)\right] =𝔼[1+λX+p=2(λX)pp!]\displaystyle=\mathbb{E}\left[1+\lambda X+\sum_{p=2}^{\infty}\frac{(\lambda X)^{p}}{p!}\right]
=1+p=2λpXppp!\displaystyle=1+\sum_{p=2}^{\infty}\frac{\lambda^{p}\|X\|_{p}^{p}}{p!}
1+p=2λp(2θ)θpKppθpp!\displaystyle\leq 1+\sum_{p=2}^{\infty}\frac{\lambda^{p}(2\theta)^{\theta p}K^{p}p^{\theta p}}{p!}
1+p=2(λ(2θ)θeKp1θ)p\displaystyle\leq 1+\sum_{p=2}^{\infty}\left(\frac{\lambda(2\theta)^{\theta}eK}{p^{1-\theta}}\right)^{p}
1+p=2(λ(4θ)θ(e/2)K)p\displaystyle\leq 1+\sum_{p=2}^{\infty}\left(\lambda(4\theta)^{\theta}(e/2)K\right)^{p}
1+(λ(4θ)θ(e/2)K)21λ(4θ)θ(e/2)K\displaystyle\leq 1+\frac{\left(\lambda(4\theta)^{\theta}(e/2)K\right)^{2}}{1-\lambda(4\theta)^{\theta}(e/2)K}
1+2(λ(4θ)θ(e/2)K)2\displaystyle\leq 1+2\left(\lambda(4\theta)^{\theta}(e/2)K\right)^{2}
exp(λ22(4θ)2θe2K2),\displaystyle\leq\exp\left(\frac{\lambda^{2}}{2}(4\theta)^{2\theta}e^{2}K^{2}\right),

completing the proof.  

Then, the next three lemmas allow us to include previous results as special cases of the theorem.

Lemma 24

(Fan et al., 2015, Thm. 2.6) Let (Ω,,(i),P)(\Omega,\mathcal{F},(\mathcal{F}_{i}),P) be a filtered probability space. Let (ξi)(\xi_{i}) and (Ki)(K_{i}) be adapted to (i)(\mathcal{F}_{i}). Let nn\in\mathbb{N}. For all i[n]i\in[n], assume 0Ki1mi0\leq K_{i-1}\leq m_{i} almost surely, 𝔼[ξii1]=0\mathbb{E}\left[\xi_{i}\mid\mathcal{F}_{i-1}\right]=0, and

𝔼[exp(λξi)i1]exp(λ22aKi12)λ[0,1bKi1].\displaystyle\mathbb{E}\left[\exp(\lambda\xi_{i})\mid\mathcal{F}_{i-1}\right]\leq\exp\left(\frac{\lambda^{2}}{2}aK_{i-1}^{2}\right)~{}\forall\lambda\in\left[0,\frac{1}{bK_{i-1}}\right].

Then, for all x,β0x,\beta\geq 0, and λ[0,1bmaxi[r]mi]\lambda\in\left[0,\frac{1}{b\max_{i\in[r]}m_{i}}\right],

P(k[n]{i=1kξix and i=1kaKi12β})\displaystyle P\left(\bigcup_{k\in[n]}\bigg{\{}\sum_{i=1}^{k}\xi_{i}\geq x\text{ and }\sum_{i=1}^{k}aK_{i-1}^{2}\leq\beta\bigg{\}}\right) exp(λx+λ22β).\displaystyle\leq\exp\left(-\lambda x+\frac{\lambda^{2}}{2}\beta\right).

Proof  Define

ψi=exp(λξi)\displaystyle\psi_{i}=\exp\left(\lambda\xi_{i}\right)

and

Ak={i=1kξix and i=1kaKi12β}.\displaystyle A_{k}=\bigg{\{}\sum_{i=1}^{k}\xi_{i}\geq x\text{ and }\sum_{i=1}^{k}aK_{i-1}^{2}\leq\beta\bigg{\}}.

Then ωAk\omega\in A_{k} implies

i=1k𝔼[ψii1]ψi\displaystyle\prod_{i=1}^{k}\frac{\mathbb{E}\left[\psi_{i}\mid\mathcal{F}_{i-1}\right]}{\psi_{i}} exp(λi=1kξi+λ22i=1kaKi12)\displaystyle\leq\exp\left(-\lambda\sum_{i=1}^{k}\xi_{i}+\frac{\lambda^{2}}{2}\sum_{i=1}^{k}aK_{i-1}^{2}\right)
exp(λx+λ22β).\displaystyle\leq\exp\left(-\lambda x+\frac{\lambda^{2}}{2}\beta\right).

 

Lemma 25

(Harvey et al., 2019, Thm. 3.3) Let (Ω,,(i),P)(\Omega,\mathcal{F},(\mathcal{F}_{i}),P) be a filtered probability space. Let (ξi)(\xi_{i}) and (Ki)(K_{i}) be adapted to (i)(\mathcal{F}_{i}). Let nn\in\mathbb{N}. For all i[n]i\in[n], assume Ki10K_{i-1}\geq 0 almost surely, 𝔼[ξii1]=0\mathbb{E}\left[\xi_{i}\mid\mathcal{F}_{i-1}\right]=0, and

𝔼[exp(λξi)i1]exp(λ22aKi12)λ0.\displaystyle\mathbb{E}\left[\exp(\lambda\xi_{i})\mid\mathcal{F}_{i-1}\right]\leq\exp\left(\frac{\lambda^{2}}{2}aK_{i-1}^{2}\right)~{}\forall\lambda\geq 0.

Then, for all x,β,α0x,\beta,\alpha\geq 0, and λ[0,12α]\lambda\in\left[0,\frac{1}{2\alpha}\right],

P(k[n]{i=1kξix and i=1kaKi12αi=1kξi+β})\displaystyle P\left(\bigcup_{k\in[n]}\bigg{\{}\sum_{i=1}^{k}\xi_{i}\geq x\text{ and }\sum_{i=1}^{k}aK_{i-1}^{2}\leq\alpha\sum_{i=1}^{k}\xi_{i}+\beta\bigg{\}}\right) exp(λx+2λ2β).\displaystyle\leq\exp(-\lambda x+2\lambda^{2}\beta).
Lemma 26

(Freedman, 1975) Let (Ω,,(i),P)(\Omega,\mathcal{F},(\mathcal{F}_{i}),P) be a filtered probability space. Let (ξi)(\xi_{i}) and (Ki)(K_{i}) be adapted to (i)(\mathcal{F}_{i}). For all i[n]i\in[n], assume Ki10K_{i-1}\geq 0 almost surely, 𝔼[ξii1]=0\mathbb{E}\left[\xi_{i}\mid\mathcal{F}_{i-1}\right]=0, and

𝔼[exp(λξi)i1]exp(λ22aKi12)λ0.\displaystyle\mathbb{E}\left[\exp(\lambda\xi_{i})\mid\mathcal{F}_{i-1}\right]\leq\exp\left(\frac{\lambda^{2}}{2}aK_{i-1}^{2}\right)~{}\forall\lambda\geq 0.

Then, for all x,β0x,\beta\geq 0, and λ0\lambda\geq 0,

P(k[n]{i=1kξix and i=1kaKi12β})\displaystyle P\left(\bigcup_{k\in[n]}\bigg{\{}\sum_{i=1}^{k}\xi_{i}\geq x\text{ and }\sum_{i=1}^{k}aK_{i-1}^{2}\leq\beta\bigg{\}}\right) exp(λx+λ22β).\displaystyle\leq\exp\left(-\lambda x+\frac{\lambda^{2}}{2}\beta\right).

If the ξi\xi_{i}’s are sub-Gaussian, that is, if θ=1/2\theta=1/2, then, from Lemma 22,

𝔼[exp(λξi)i1]exp(λ222Ki12)λ,\displaystyle\mathbb{E}\left[\exp(\lambda\xi_{i})\mid\mathcal{F}_{i-1}\right]\leq\exp\left(\frac{\lambda^{2}}{2}2K_{i-1}^{2}\right)~{}\forall\lambda\in\mathbb{R},

so we can apply Lemma 25 if α>0\alpha>0 or Lemma 26 if α=0\alpha=0.

If the ξi\xi_{i}’s are at most sub-exponential, that is, if 1/2<θ11/2<\theta\leq 1, then, from Lemma 23,

𝔼[exp(λξi)i1]exp(λ22(4θ)2θe2Ki12)λ[0,1(4θ)θeKi1],\displaystyle\mathbb{E}\left[\exp(\lambda\xi_{i})\mid\mathcal{F}_{i-1}\right]\leq\exp\left(\frac{\lambda^{2}}{2}(4\theta)^{2\theta}e^{2}K_{i-1}^{2}\right)~{}\forall\lambda\in\left[0,\frac{1}{(4\theta)^{\theta}eK_{i-1}}\right],

so we can apply Lemma 14 if α>0\alpha>0 or Lemma 24 if α=0\alpha=0.

Appendix C PŁ Projected SGD

When optimizing over a constraint set, XdX\subsetneq\mathbb{R}^{d}, if ff is strongly convex, then so is ff plus the indicator function of XX, and results for gradient descent methods easily extend to projected gradient descent. On the other hand, if ff is PŁ, then ff plus the indicator function is not KL (Kurdyka-Łojasiewicz). This has real impacts on gradient descent algorithms, where gradient descent might converge while projected gradient descent does not. For example, there is a smooth function, a mollified version of f(x,y)=(a(x)+2b(|y|c)+)+f(x,y)=\left(a(x)_{+}^{2}-b\left(|y|-c\right)_{+}\right)_{+}, such that the PŁ inequality is satisfied but projected gradient descent does not converge to a minimizer; we formalize this in the remark below.

Remark 27

Consider f(x,y)=(a(x)+2b(|y|c)+)+f(x,y)=\left(a(x)_{+}^{2}-b\left(|y|-c\right)_{+}\right)_{+} where ()+(\cdot)_{+} denotes max(,0)\max(\cdot,0) and a,b>0,c0a,b>0,c\geq 0. The minimum of ff is 0 and X={(x,y)|x0 or |y|abx2+c}X^{\star}=\{(x,y)~{}|~{}x\leq 0\text{ or }|y|\geq\frac{a}{b}x^{2}+c\}. If we use φ(x)=𝒳B1(0)exp(1/(1x2))/Φ\varphi(x)=\mathcal{X}_{B_{1}(0)}\cdot\exp\left(-1/(1-\|x\|^{2})\right)/\Phi—where 𝒳\mathcal{X} denotes the indicator function, B1(0)B_{1}(0) denotes the ball of radius 1 centered at 0, and Φ\Phi is the normalization constant—to mollify ff, then, for ϵ<c\epsilon<c, fϵf_{\epsilon} has PŁ constant 2a2a and smoothness constant 2a2a. Consider the starting point (d,0)(d,0). For a,b,c,da,b,c,d chosen appropriately, the distance from (d,0)(d,0) to its projection onto XX^{\star} is strictly less than dd. Thus, if we let XX be the ball centered at (d,0)(d,0) with radius equal to exactly that distance, then the constrained problem and the unconstrained problem have the same minimum. However, projected gradient flow, starting from (d,0)(d,0), ends up stuck at a non-minimizer: the point of XX closest to (0,0)(0,0). See Figure 3 for the contour plot when a=1/10a=1/10, b=1=cb=1=c, and d=10d=10.

Refer to caption
Figure 3: Contour plot of the PŁ function counter-example to projected gradient flow

In order to generalize gradient methods to projected gradient methods under PL-like assumptions, the proper generalization is that the function should satisfy a proximal PŁ inequality (Karimi et al., 2016). However, such an assumption is quite restrictive compared to the PŁ inequality. We leave the problem of convergence with just the PŁ inequality, via added noise or a Frank-Wolfe type construction, as a future research direction.

Appendix D Non-convex Projected SGD

Consider xt+1=proj(xtηtgt)=proj(xtηt(f(xt)et))x_{t+1}=\text{proj}(x_{t}-\eta_{t}g_{t})=\text{proj}(x_{t}-\eta_{t}(\nabla f(x_{t})-e_{t})). Define

Gt=xtproj(xtηtf(xt))ηt\displaystyle G_{t}=\frac{x_{t}-\text{proj}(x_{t}-\eta_{t}\nabla f(x_{t}))}{\eta_{t}}
Et=xt+1proj(xtηtf(xt))ηt.\displaystyle E_{t}=\frac{x_{t+1}-\text{proj}(x_{t}-\eta_{t}\nabla f(x_{t}))}{\eta_{t}}.

Note that if proj=I\text{proj}=I, then Gt=f(xt)G_{t}=\nabla f(x_{t}) and Et=etE_{t}=e_{t}. Moreover, xt=proj(xt)x_{t}=\text{proj}(x_{t}) and xt+1=proj(xt+1)x_{t+1}=\text{proj}(x_{t+1}) so, by the non-expansiveness of proj, Gtf(xt)\|G_{t}\|\leq\|\nabla f(x_{t})\| and Etet\|E_{t}\|\leq\|e_{t}\|. We can get even tighter bounds using the second prox theorem (Beck, 2017, Thm. 6.39): Gt2f(xt),Gt\|G_{t}\|^{2}\leq\langle\nabla f(x_{t}),G_{t}\rangle and Gt,Etf(xt),Et\langle G_{t},E_{t}\rangle\leq\langle\nabla f(x_{t}),E_{t}\rangle.

It is easy to come up with an example where f(xt)\|\nabla f(x_{t})\| does not go to zero, so we would like to bound Gt\|G_{t}\| instead. We start as usual:

f(xt+1)f(xt)+f(xt),xt+1xt+L2xt+1xt2.\displaystyle f(x_{t+1})\leq f(x_{t})+\langle\nabla f(x_{t}),x_{t+1}-x_{t}\rangle+\frac{L}{2}\|x_{t+1}-x_{t}\|^{2}.

Focusing on the norm term,

L2xt+1xt2\displaystyle\frac{L}{2}\|x_{t+1}-x_{t}\|^{2} =Lηt22Gt2Lηt2Gt,Et+Lηt22Et2.\displaystyle=\frac{L\eta_{t}^{2}}{2}\|G_{t}\|^{2}-L\eta_{t}^{2}\langle G_{t},E_{t}\rangle+\frac{L\eta_{t}^{2}}{2}\|E_{t}\|^{2}.

Focusing on the inner product term,

f(xt),xt+1xt\displaystyle\langle\nabla f(x_{t}),x_{t+1}-x_{t}\rangle =ηtf(xt),EtGt\displaystyle=\eta_{t}\langle\nabla f(x_{t}),E_{t}-G_{t}\rangle
=ηtf(xt),Etηtf(xt),Gt\displaystyle=\eta_{t}\langle\nabla f(x_{t}),E_{t}\rangle-\eta_{t}\langle\nabla f(x_{t}),G_{t}\rangle
ηtGt,EtηtGt2.\displaystyle\nleq\eta_{t}\langle G_{t},E_{t}\rangle-\eta_{t}\|G_{t}\|^{2}.

Unfortunately, we cannot proceed any further. Ghadimi et al. (2016) are able to get around this but at the cost of getting t=0T1ηtet2=O(T)\sum_{t=0}^{T-1}\eta_{t}\|e_{t}\|^{2}=O(\sqrt{T}) instead of t=0T1ηt2et2=O(log(T))\sum_{t=0}^{T-1}\eta_{t}^{2}\|e_{t}\|^{2}=O(\log(T)). To mitigate this, they require an increasing batch-size. Reddi et al. (2016) were able to remove this requirement, but only for non-convex projected SVRG not non-convex projected SGD. Thus, we leave the analysis of non-convex projected SGD as an open problem.


References

  • Allen-Zhu et al. (2019a) Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. On the convergence rate of training recurrent neural networks. In Neural Information Processing Systems (NeurIPS), volume 32, 2019a.
  • Allen-Zhu et al. (2019b) Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning (ICML), volume 97, pages 242–252, 2019b.
  • Arjevani et al. (2019) Yossi Arjevani, Yair Carmon, John C Duchi, Dylan J Foster, Nathan Srebro, and Blake Woodworth. Lower bounds for non-convex stochastic optimization. arXiv preprint arXiv:1912.02365, 2019.
  • Bakhshizadeh et al. (2023) Milad Bakhshizadeh, Arian Maleki, and Victor H de la Pena. Sharp concentration results for heavy-tailed distributions. Information and Inference: A Journal of the IMA, 12(3):1655–1685, 2023.
  • Beck (2017) A. Beck. First-Order Methods in Optimization. MOS-SIAM Series on Optimization, 2017.
  • Bertsekas and Tsitsiklis (2000) D. Bertsekas and J. Tsitsiklis. Gradient convergence in gradient methods with errors. SIAM Journal on Optimization, 10(3):627–642, 2000.
  • Billingsley (1995) Patrick Billingsley. Probability and Measure. John Wiley & Sons, 3 edition, 1995.
  • Bottou and Bousquet (2008) Léon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In Neural Information Processing Systems (NeurIPS), volume 20, 2008.
  • Charles and Papailiopoulos (2018) Zachary Charles and Dimitris Papailiopoulos. Stability and generalization of learning algorithms that converge to global optima. In International Conference on Machine Learning (ICML), volume 80, pages 745–754, 2018.
  • Şimşekli et al. (2019) Umut Şimşekli, Levent Sagun, and Mert Gürbüzbalaban. A tail-index analysis of stochastic gradient noise in deep neural networks. In International Conference on Machine Learning (ICML), volume 97, pages 5827–5837, 2019.
  • Cutkosky and Mehta (2021) Ashok Cutkosky and Harsh Mehta. High-probability bounds for non-convex stochastic optimization with heavy tails. In Neural Information Processing Systems (NeurIPS), volume 34, pages 4883–4895, 2021.
  • Du et al. (2018) Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations (ICLR), 2018.
  • Du et al. (2019) Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. In International Conference on Machine Learning (ICML), volume 97, pages 1675–1685, 2019.
  • Fan et al. (2015) Xiequan Fan, Ion Grama, Quansheng Liu, et al. Exponential inequalities for martingales with applications. Electronic Journal of Probability, 20, 2015.
  • Freedman (1975) David A Freedman. On tail probabilities for martingales. The Annals of Probability, pages 100–118, 1975.
  • Ghadimi and Lan (2013) Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
  • Ghadimi et al. (2016) Saeed Ghadimi, Guanghui Lan, and Hongchao Zhang. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 155(1-2):267–305, 2016.
  • Gürbüzbalaban et al. (2021) Mert Gürbüzbalaban, Umut Şimşekli, and Lingjiong Zhu. The heavy-tail phenomenon in SGD. In International Conference on Machine Learning (ICML), volume 139, pages 3964–3975, 2021.
  • Harvey et al. (2019) Nicholas J. A. Harvey, Christopher Liaw, Yaniv Plan, and Sikander Randhawa. Tight analyses for non-smooth stochastic gradient descent. In Conference on Learning Theory (COLT), volume 99, pages 1579–1613, 2019.
  • Hendrycks and Gimpel (2016) Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (GELUs). arXiv preprint arXiv:1606.08415, 2016.
  • Karimi et al. (2016) Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal-gradient methods under the Polyak-Łojasiewicz condition. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 795–811. Springer, 2016.
  • Khaled and Richtárik (2020) Ahmed Khaled and Peter Richtárik. Better theory for SGD in the nonconvex world. arXiv preprint arXiv:2002.03329, 2020.
  • Li and Liu (2022) Shaojie Li and Yong Liu. High probability guarantees for nonconvex stochastic gradient descent with heavy tails. In International Conference on Machine Learning (ICML), volume 162, pages 12931–12963, 2022.
  • Li and Orabona (2020) Xiaoyu Li and Francesco Orabona. A high probability analysis of adaptive sgd with momentum. arXiv preprint arXiv:2007.14294, 2020.
  • Li and Yuan (2017) Yuanzhi Li and Yang Yuan. Convergence analysis of two-layer neural networks with relu activation. In Neural Information Processing Systems (NeurIPS), volume 30, 2017.
  • Liu et al. (2022) Chaoyue Liu, Libin Zhu, and Mikhail Belkin. Loss landscapes and optimization in over-parameterized non-linear systems and neural networks. Applied and Computational Harmonic Analysis, 2022.
  • Madden and Thrampoulidis (2024) Liam Madden and Christos Thrampoulidis. Memory capacity of two layer neural networks with smooth activation. SIAM Journal on Mathematics of Data Science, 6, 2024.
  • Nemirovski et al. (2009) Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574–1609, 2009.
  • Nemirovsky and Yudin (1983) Arkadi Nemirovsky and David Yudin. Problem Complexity and Method Efficiency in Optimization. Wiley-Interscience, 1983.
  • Nesterov (2018) Y. Nesterov. Lectures on Convex Optimization, volume 137 of Springer Optimization and Its Applications. Springer, Switzerland, 2 edition, 2018.
  • Orvieto and Lucchi (2019) Antonio Orvieto and Aurelien Lucchi. Continuous-time models for stochastic optimization algorithms. In Neural Information Processing Systems (NeurIPS), pages 12589–12601, 2019.
  • Oymak and Soltanolkotabi (2020) Samet Oymak and Mahdi Soltanolkotabi. Toward moderate overparameterization: Global convergence guarantees for training shallow neural networks. IEEE Journal on Selected Areas in Information Theory, 1(1):84–105, 2020.
  • Panigrahi et al. (2019) Abhishek Panigrahi, Raghav Somani, Navin Goyal, and Praneeth Netrapalli. Non-gaussianity of stochastic gradient noise. In Science meets Engineering of Deep Learning (SEDL) workshop, 33rd Conference on Neural Information Processing Systems (NeurIPS), 2019.
  • Patel (2021) Vivak Patel. Stopping criteria for, and strong convergence of, stochastic gradient descent on Bottou-Curtis-Nocedal functions. Mathematical Programming, 164, 2021.
  • Patel et al. (2022) Vivak Patel, Shushu Zhang, and Bowen Tian. Global convergence and stability of stochastic gradient descent. In Neural Information Processing Systems (NeurIPS), volume 35, pages 36014–36025, 2022.
  • Reddi et al. (2016) Sashank J Reddi, Suvrit Sra, Barnabas Poczos, and Alexander J Smola. Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. In Neural Information Processing Systems (NeurIPS), pages 1153–1161, 2016.
  • Scaman and Malherbe (2020) Kevin Scaman and Cedric Malherbe. Robustness analysis of non-convex stochastic gradient descent using biased expectations. In Neural Information Processing Systems (NeurIPS), volume 33, pages 16377–16387, 2020.
  • Sebbouh et al. (2021) Othmane Sebbouh, Robert M Gower, and Aaron Defazio. Almost sure convergence rates for stochastic gradient descent and stochastic heavy ball. In Conference on Learning Theory (COLT), pages 3935–3971, 2021.
  • Sun and Luo (2016) Ruoyu Sun and Zhi-Quan Luo. Guaranteed matrix completion via non-convex factorization. IEEE Transactions on Information Theory, 62(11):6535–6579, 2016.
  • Vershynin (2018) Roman Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science, volume 47. Cambridge University Press, 2018.
  • Vladimirova et al. (2020) Mariia Vladimirova, Stéphane Girard, Hien Nguyen, and Julyan Arbel. Sub-Weibull distributions: Generalizing sub-Gaussian and sub-exponential properties to heavier tailed distributions. Stat, 9(1):e318, 2020.
  • Wong et al. (2020) Kam Chung Wong, Zifan Li, and Ambuj Tewari. Lasso guarantees for β\beta-mixing heavy-tailed time series. The Annals of Statistics, 48(2):1124–1142, 2020.
  • Zhang et al. (2020) Jingzhao Zhang, Tianxing He, Suvrit Sra, and Ali Jadbabaie. Why gradient clipping accelerates training: A theoretical justification for adaptivity. In International Conference on Learning Representations (ICLR), 2020.