High Probability Convergence Bounds for Non-convex Stochastic Gradient Descent with Sub-Weibull Noise
Abstract
Stochastic gradient descent is one of the most common iterative algorithms used in machine learning and its convergence analysis is a rich area of research. Understanding its convergence properties can help inform what modifications of it to use in different settings. However, most theoretical results either assume convexity or only provide convergence results in mean. This paper, on the other hand, proves convergence bounds in high probability without assuming convexity. Assuming strong smoothness, we prove high probability convergence bounds in two settings: (1) assuming the Polyak-Łojasiewicz inequality and norm sub-Gaussian gradient noise and (2) assuming norm sub-Weibull gradient noise. In the second setting, as an intermediate step to proving convergence, we prove a sub-Weibull martingale difference sequence self-normalized concentration inequality of independent interest. It extends Freedman-type concentration beyond the sub-exponential threshold to heavier-tailed martingale difference sequences. We also provide a post-processing method that picks a single iterate with a provable convergence guarantee as opposed to the usual bound for the unknown best iterate. Our convergence result for sub-Weibull noise extends the regime where stochastic gradient descent has equal or better convergence guarantees than stochastic gradient descent with modifications such as clipping, momentum, and normalization.
Keywords: stochastic gradient descent, convergence bounds, sub-Weibull distributions, Polyak-Łojasiewicz inequality, Freedman inequality
1 Introduction
Stochastic gradient descent (SGD) and its variants are some of the most commonly used algorithms in machine learning. In particular, they are used for training neural network and transformer models, models that have achieved considerable success on image classification and language processing tasks in recent years. The training in this case is non-convex and smooth for many activation/attention functions, such as sigmoid, GELU, and softmax. Even for ReLU, which is not differentiable, the training is smooth over most of the parameter space and avoidance of the non-smooth part is dealt with separately. Thus, the smooth non-convex convergence analysis of SGD has far-reaching influence on the field of machine learning.
There is a large literature on the almost sure and mean convergence of SGD assuming strong smoothness. Bertsekas and Tsitsiklis (2000) prove that SGD almost surely converges to a first-order stationary point assuming strong smoothness and the relaxed growth noise condition; more recently, Patel (2021) showed the same result with a different proof technique. Ghadimi and Lan (2013) prove that the mean of the squared gradient norm of SGD converges to zero at a rate of assuming strong smoothness and the bounded variance noise condition, where is the number of SGD iterations. Khaled and Richtárik (2020) prove the same but with the expected smoothness noise condition. Sebbouh et al. (2021) strengthen this to an almost sure convergence rate. Assuming strong smoothness and convexity, the tight mean convergence rate of the squared gradient norm of SGD is (Nemirovsky and Yudin, 1983, Thm. 5.3.1). In fact, this is the optimal rate for all stochastic first-order methods assuming only strong smoothness (Arjevani et al., 2019). Thus, the mean convergence rate of Ghadimi and Lan (2013) is tight and shows that SGD is optimal in the smooth non-convex setting.
However, mean convergence is not the end of the story. A convergence guarantee is generally required with some arbitrary probability, . For a single run of SGD, using Markov’s inequality gives a scaling to the bound. If one can re-run SGD many times (say, times), then by choosing the best run, can be relatively large since the overall success is at least by . On the other hand, if just a single run of SGD is allowed, then the factor leads to nearly vacuous results, and hence a direct high probability convergence analysis is necessary to understand the behavior of unique runs of SGD.
Li and Orabona (2020) prove a high probability convergence rate for a weighted average of the squared gradient norms of SGD assuming strong smoothness and norm sub-Gaussian noise. However, a series of recent papers (Gürbüzbalaban et al., 2021; Şimşekli et al., 2019; Panigrahi et al., 2019) suggest that norm sub-Gaussian noise is often not satisfied. While the central limit theorem can be used to heuristically justify norm sub-Gaussian noise for mini-batch SGD with large batch-sizes, it cannot for small batch-sizes. In particular, Panigrahi et al. (2019) provide examples where the noise is Gaussian for a batch-size of 4096, is not Gaussian for a batch-size of 32, and starts out Gaussian then becomes non-Gaussian for a batch-size of 256. Gürbüzbalaban et al. (2021) and Şimşekli et al. (2019) suggest instead assuming that the th moment of the norm of the noise is bounded, where . Scaman and Malherbe (2020) prove a high probability convergence rate for SGD assuming strong smoothness and bounded th moment noise for . By allowing noise with possibly infinite variance, SGD converges at a slower rate in terms of . Moreover, the dependence on is polynomial rather than logarithmic. Cutkosky and Mehta (2021) prove a high probability convergence rate for clipped SGD (which uses clipping, momentum, and normalization) assuming strong smoothness and bounded th moment noise for , thus improving the dependence on to logarithmic, but clipped SGD will still have a slower convergence rate in terms of even when the noise is norm sub-Gaussian. So, knowledge about the type of noise should actually affect whether we use SGD or clipped SGD. This motivates the question: How heavy can the noise be for SGD to have a high probability convergence rate that depends logarithmically on ?
Towards this end, we consider norm sub-Weibull noise. While it is much lighter tailed than bounded th moment noise for , it is a natural extension of norm sub-Gaussian noise and it turns out that it admits a convergence rate for SGD with logarithmic dependence on . This leads us to the first contribution of our paper. Assuming strong smoothness and norm sub-Weibull noise, we prove a high probability convergence rate for a weighted average of the squared gradient norms of SGD, where is the sub-Weibull tail weight. This is given in Theorem 15. In the special case of norm sub-Gaussian noise, which is , this becomes , which slightly improves the result of Li and Orabona (2020). For more general , we make the additional assumption that the objective function is Lipschitz continuous. At its most basic, the Lipschitz continuity assumption says that the norm of the true gradient is bounded for all of the iterates. Li and Liu (2022), building off of our pre-print, relax this to only assuming that the step-size times the true gradient is bounded for all of the iterates, which is a weaker assumption since the step-size goes to zero.
However, hidden within these convergence rates is a subtle issue that requires a post-processing algorithm. All of these high probability convergence rates are for a weighted average of the squared gradient norms of the iterates rather than being for the squared gradient norm of a single iterate. While this implies a high probability convergence rate for the iterate with the smallest squared gradient norm, to actually determine which iterate has the smallest squared gradient norm, we would need to estimate the true gradient for each iterate! This leads us to another question: Is there a post-processing strategy for a single run of SGD that is as efficient as 2-RSG?
2-RSG is the algorithm of Ghadimi and Lan (2013) that takes multiple runs of SGD and picks the best one to get a high probability convergence rate. First, it goes from a mean convergence rate for the iterate with the smallest squared gradient norm to a mean convergence rate for a particular iterate by randomly choosing the iterate. It does this for runs of SGD. Then, it uses samples to estimate the gradients and pick the run with the smallest squared estimated gradient norm, where is the variance of the noise and is the convergence tolerance. To compete with this, we introduce a post-processing algorithm that randomly chooses iterates of a single run of SGD and then picks the one with the smallest squared estimated gradient norm using samples and we prove that our high probability convergence rate applies to the squared gradient norm of this iterate. The algorithm and the bound on the squared gradient norm of its output are in Theorem 21. The result can be combined with any high probability convergence rate on a weighted average of the squared gradient norm of the iterates of SGD, such as the results of Li and Orabona (2020), Scaman and Malherbe (2020), and Cutkosky and Mehta (2021). Moreover, the first part of it, going from a high probability bound on a weighted average of random variables to a high probability bound on the smallest of a small subset of those random variables, applies to more general stochastic sequences as shown in Theorem 19, which is a probability result of independent interest.
Underlying our convergence analysis are concentration inequalities. Vladimirova et al. (2020), Wong et al. (2020), and Bakhshizadeh et al. (2023) prove concentration inequalities for sub-Weibull random variables with deterministic scale parameters. However, we need a concentration inequality for a sub-Weibull martingale difference sequence (MDS) where the scale parameters are themselves random variables making up an auxiliary sequence. MDS concentration inequalities upper bound the partial sum of the main sequence and lower bound the partial sum of the auxiliary sequence simultaneously. When the lower bound depends on the partial sum of the main sequence, we call the concentration inequality self-normalized. Freedman (1975) proves a sub-Gaussian MDS concentration inequality. Fan et al. (2015) proves a sub-exponential MDS concentration inequality. Harvey et al. (2019) proves a sub-Gaussian MDS self-normalized concentration inequality. The proofs for these results all rely on the moment generating function (MGF), which does not exist for sub-Weibull random variables with , i.e., heavier than sub-exponential. Nevertheless, we are able to prove a sub-Weibull martingale difference sequence self-normalized concentration inequality. This is given in Theorem 11. The proof uses the MGF truncation technique of Bakhshizadeh et al. (2023) but applied to an MDS rather than i.i.d. random variables. The truncation level is determined by the tail decay rate and shows up as the in the SGD convergence rate.
We also consider convergence under the Polyak-Łojasiewicz (PŁ) inequality. Some examples of problems with PŁ objectives are logistic regression over compact sets (Karimi et al., 2016) and matrix factorization (Sun and Luo, 2016). Deep linear neural networks satisfy the PŁ inequality in large regions of the parameter space (Charles and Papailiopoulos, 2018, Thm. 4.5), as do two-layer neural networks with an extra identity mapping (Li and Yuan, 2017). Furthermore, sufficiently wide neural networks satisfy the PŁ inequality locally around random initialization (Allen-Zhu et al., 2019a, b; Du et al., 2018, 2019; Liu et al., 2022). While strong convexity implies the PŁ inequality, a PŁ function need not even be convex, hence the PŁ condition is considerably more applicable than strong convexity in the context of neural networks. Moreover, as pointed out by Karimi et al. (2016), the convergence proof for gradient descent assuming strong smoothness and strong convexity is actually simplified if we use the PŁ inequality instead of strong convexity. Thus, we would like to find a simple, elegant proof for the high probability convergence of SGD assuming strong smoothness and the PŁ inequality.
Assuming strong smoothness and strong convexity, the tight mean convergence rate of SGD is (Nemirovski et al., 2009). The same mean convergence rate can be shown assuming strong smoothness and the PŁ inequality (Karimi et al., 2016; Orvieto and Lucchi, 2019). But neither of these papers address high probability convergence. To fill this gap, we prove a high probability convergence rate for the objective value of SGD assuming strong smoothness, the PŁ inequality, and norm sub-Gaussian noise. This is given in Theorem 8. The proof relies on a novel probability result, Theorem 9, that essentially shows that adding two kinds of noise to a contracting sequence—sub-Gaussian noise with variance depending on the sequence itself and sub-exponential noise—results in a sub-exponential sequence. Unfortunately, the proof does not extend to the case of norm sub-Weibull noise. It is unclear if this is just an artifact of the proof or if the faster convergence of SGD under the PŁ inequality really cannot be maintained under norm sub-Weibull noise.
Applicability of assumptions. Given the motivation from neural networks, it is necessary to say a few words regarding the applicability of the Lipschitz continuity and strong smoothness assumptions. First, we show in Lemmas 1 and 2 that these assumptions are satisfied by simple neural network models that are, nevertheless, powerful enough to interpolate generic data sets using only twice the minimal number of parameters required for any model to do so. Second, they are satisfied if the iterates remain in a bounded set, which is precisely what happens in the NTK setting. Third, while it is true that Lipschitz continuity and strong smoothness may not be satisfied in practice, our paper is guided by the principle of “how far can we relax assumptions while still being able to prove high probability convergence rates?” An alternative principle is “what can we prove given assumptions that are completely justified in practice?” This is the principle guiding Patel et al. (2022). They provide a neural network counterexample that does not even satisfy -smoothness (Zhang et al., 2020), an alternative assumption to strong smoothness. Then, they prove—assuming only (1) a lower bound on the objective function, (2) local -Hölder continuity of the gradient, (3) equality of the true and mean gradients, and (4) that the th moment of the norm of the stochastic gradient, for some , is bounded by an upper semi-continuous function—that, with probability 1, either (1) the norm of the iterates of SGD go to infinity, or (2) SGD almost surely converges to a first order stationary point. These assumptions are much weaker than ours, but so is the conclusion. We leave it as a future research direction to further relax our assumptions.
Organization. Section 2 establishes the optimization framework and reviews optimization, probability, and sub-Weibull terminology and results. Section 3 proves the PŁ convergence rate. Section 4 proves the MDS concentration inequality. Section 5 proves the non-convex convergence rate. Section 6 establishes the post-processing algorithm. Section 7 trains a two layer neural network model on synthetic data with Weibull noise injected into the gradient, showing the dependence of the tail of the convergence error on the tail of the noise.
Notation. We use , , for vectors and , , , , for random variables. We use for the error vector and distinguish it from Euler’s number with the subscript (the subscript will denote the iteration index). We use for the variance and so spell out sigma for sigma algebra. We use and for big- notation. When comparing two sequences of real numbers, and : if , if , and if and . We use to denote the set and to denote the gamma function.
2 Preliminaries
We are interested in the unconstrained optimization problem
(1) |
where is a differentiable function, and the SGD iteration
where is an estimate of , is the step-size, and is the initial point. We restrict our attention to the setting where is deterministic, but the results easily extend to the setting where is a random vector. We define the noise as the vector and assume that, conditioned on the previous iterates, it is unbiased and its norm is sub-Weibull (which, as we will see in Lemma 6, implies that the variance of the noise is bounded).
One of the main examples of Eq. (1) is the stochastic approximation problem: where is a probability space and (Nemirovski et al., 2009; Ghadimi and Lan, 2013; Bottou and Bousquet, 2008). In this case, we independently sample and set .
Another example is the sample average approximation problem which is the special case of the stochastic approximation problem where there is a finite set such that with probability . In this setting, we define so that
2.1 Optimization
We highlight a few key facts we use; see, e.g., Nesterov (2018), for more details. All definitions are with respect to the Euclidean norm . Let be differentiable, and assume is non-empty and denote its minimum by .
If is continuously differentiable, then is -Lipschitz if and only if for all . We say is -strongly smooth (or -smooth for short) if its gradient is -Lipschitz continuous. If is -smooth, then a standard result using the Taylor expansion is
Applying this result with and using , we get
(2) |
or taking , we get
Thus a convergence rate in terms of the iterates is stronger than one in terms of the objective, which is stronger than one in terms of the norm of the gradient. We say is -Polyak-Łojasiewicz (or -PŁ for short) if . Combining this with Eq. (2) shows that .
To justify the Lipschitz continuity and strong smoothness assumptions in the context of neural networks, we include the following lemmas, proved in Appendix A, which show that both properties are satisfied by the least squares loss applied to the hidden layer of a two layer neural network with, e.g., sigmoid functions such as tanh, arctan, and the logistic function (applying the first lemma) or GELU activation (applying the second lemma). Note that such a model (with fixed outer layer), while a simple example of a neural network, is already able to interpolate generic data sets with only twice the necessary number of parameters for any model, including deeper ones, to do so (Madden and Thrampoulidis, 2024).
Lemma 1
Let and . Let be twice differentiable and assume . Let , , and . Define
Then is Lipschitz continuous and strongly smooth.
Lemma 2
Let and . Let be twice differentiable. Assume and for all . Let , , and . Define
Let and define the sublevel set . Then, on , is
2.2 Probability
See Section 20 of Billingsley (1995) for details on the convergence of random variables. A sequence of random variables converges to a random variable :
-
in probability if, for all , , denoted ;
-
in mean if , denoted ;
-
almost surely if , denoted .
When the rate of convergence is of interest, we say a sequence of random variables converges to :
-
with mean convergence rate if, for all , ;
-
with high probability convergence rate if, for all and , .
All five kinds of convergence are interrelated:
-
Convergence in mean and convergence almost surely both imply convergence in probability.
-
Mean convergence with rate such that as implies convergence in mean.
-
By Markov’s inequality, mean convergence with rate implies high probability convergence with rate .
-
By the Borel-Cantelli lemma (Billingsley, 1995, Thm. 4.3), high probability convergence with rate implies almost sure convergence if for all . If for some , then is required. If , then only for some is required.
See Section 35 of Billingsley (1995) for details on martingale difference sequences. Let be a probability space. A sequence, , of nested sigma algebras in (i.e., ) is called a filtration, in which case is called a filtered probability space. A sequence of random variables is said to be adapted to if each is -measurable. Furthermore, if , then is called a martingale. On the other hand, if , then is called a martingale difference sequence.
For the noise sequence , we define a corresponding filtration by letting be the sigma algebra generated by for all and setting . Note that is -measurable while is -measurable.
2.3 Sub-Weibull Random Variables
We consider sub-Gaussian, sub-exponential, and sub-Weibull random variables.
Definition 3
A random variable is -sub-Gaussian if .
Definition 4
A random variable is -sub-exponential if .
Definition 5
A random variable is -sub-Weibull() if .
See Proposition 2.5.2 of Vershynin (2018) for equivalent definitions of sub-Gaussian, Proposition 2.7.1 of Vershynin (2018) for equivalent definitions of sub-exponential, and Theorem 2.1 of Vladimirova et al. (2020) for equivalent definitions of sub-Weibull. Note that sub-Gaussian and sub-exponential are special cases of sub-Weibull using and , respectively. The tail parameter measures the heaviness of the tail—higher values correspond to heavier tails—and the scale parameter gives us the following bound on the second moment.
Lemma 6
If is -sub-Weibull() then . In particular, .
Proof First, for all ,
Second,
The proof demonstrates the general techniques for going between probability and expectation in this type of analysis. To go from probability to expectation, the CDF formula is used:
To go from expectation to probability, Markov’s inequality is used:
In both cases, the trick is choosing what should be, e.g. or .
Remark 7
Note that our definitions of sub-Gaussian, sub-exponential, and sub-Weibull do not require the random variable to be centered. Thus, we do not center the constant random variable to think of it as having scale parameter zero. Instead, for each , is -sub-Weibull().
3 PŁ Convergence
In this section, we prove the following convergence bound, proving Theorem 9, a probability result, along the way. The convergence bound matches the optimal convergence rate in mean of and has dependence on .
Theorem 8
Assume is -smooth and -PŁ and that, conditioned on the previous iterates, is centered and is -sub-Gaussian. Then, SGD with step-size
where , constructs a sequence such that, for all ,
Proof Assume is -smooth, hence
(3) |
for all . Set and and subtract . Additionally assume is -PŁ and . Then
(4) |
where the second inequality used and the third used the PŁ inequality.
Recall that we defined as the sigma algebra generated by for all and . So, is -measurable, is -measurable, and is -measurable.
For all , assume, conditioned on , that is centered and is -sub-Gaussian. Then, is centered conditioned on . Note that we can bound using Cauchy-Schwarz and Eq. (2) in the following way:
But, if we use this to get a new recursion, then we get a sub-optimal convergence rate. Instead, we need to keep the same recursion but use the bound on in its MGF:
Thus, is -sub-Gaussian conditioned on . Similarly,
and so is -sub-exponential conditioned on .
So, we have a recursion for a stochastic sequence with two types of additive noise—sub-Gaussian noise, with variance depending on the sequence itself, and sub-exponential noise. We prove that this makes the main sequence sub-exponential in the following theorem.
Theorem 9
Let be a filtered probability space. Let , , and be adapted to , and be deterministic. Let and be sequences of non-negative reals and let be a sequence of positive reals. Assume and are non-negative almost surely. Assume for all and for all . Assume
Then, for any sequence of positive reals such that and, for all , , and for any , we have, for all ,
Proof
We want to find requirements on a sequence such that . Then, by Markov’s inequality and taking ,
. Our proof is inductive. For the base case, we need
For the induction step, assume . Let . Then
where denotes the law of total expectation, denotes the Cauchy-Schwarz inequality, and are the requirements
The assumptions of the theorem imply requirements , completing the proof.
The proof is similar to the proof of Theorem 4.1 of Harvey et al. (2019) but with some key differences. There, the recursion is where and is deterministic. We had to move the implicit dependence of the sub-Gaussian term inside of the MGF. We also had to allow to be a sub-exponential random variable , and so applied Cauchy-Schwarz, which contributed the 2 in the recursion for .
Note that if , then the assumptions on are satisfied by, for example, and . Returning to the proof of Theorem 8, for absolute constants and , by Proposition 2.5.2 and Proposition 2.7.1 of Vershynin (2018), we can set and and apply Theorem 9 with
Dividing by completes the proof.
Remark 10
It is natural to ask whether we can relax the sub-Gaussian assumption to a sub-Weibull assumption. In Theorem 9, we need a bound on the MGFs of both and . But, if is -sub-Weibull() with , then is -sub-Weibull(). Thus, would be sub-Weibull with tail parameter greater than 1, and so may have an infinite moment generating function for all .
A big take-away from the PŁ analysis is its simplicity, matching the simplicity of gradient descent’s convergence analysis in the same setting. It also serves as a good warm-up for the non-convex analysis that follows. Induction on the convergence recursion does not work for the non-convex analysis. Instead it requires an MDS self-normalized concentration inequality, which we state and prove in the next section.
4 Concentration Inequality
In this section, we prove the following MDS concentration inequality. For without (i.e., ), Eq. (6), we recover the classical Freedman’s inequality (Freedman, 1975). For without , Eq. (8), we recover Theorem 2.6 of Fan et al. (2015). For with , Eq. (5), we recover Theorem 3.3 of Harvey et al. (2019), called the “Generalized Freedman” inequality.
Theorem 11
Let be a filtered probability space. Let and be adapted to . Let . For all , assume almost surely, , and
where . If , assume there exist constants such that almost surely for all .
If , then for all , and , and ,
(5) |
and for all ,
(6) |
If , define
For all , and , and ,
(7) |
and for all , and ,
(8) |
If , let . Define
For all , and , and ,
and for all , and ,
(9) |
Proof Let be a filtered probability space. Let and be adapted to . Let . For all , assume almost surely, , and
What we want to upper bound in this setting is
for constants and . To understand how to use this, set and observe
by monotonicity (). The stronger bound over the union of partial sum bounds does not help us since the constants cannot depend on . The union of partial sum bounds arises in the proof from a stopping time defined as the first such that the corresponding set occurs.
Proving such a bound would typically involve using an MGF bound for , but the MGF is infinite in our setting. To get around this, we truncate to an appropriate level. By accounting for the probability of exceeding truncation, applying an MGF bound for truncated random variables, and applying a concentration inequality for bounded MGF martingale different sequence, we are able to prove a concentration inequality for sub-Weibull martingale difference sequences.
Probability of exceeding truncation.
Let . We want to define the truncation level so that the probability of any of the exceeding it is smaller than . This ends up being with as we will see.
First, for any ,
where we use Markov’s inequality, the law of total expectation, and the sub-Weibull assumption. In particular, .
Then, we can bound
by
and
and so proceed using while carrying around an additional probability.
MGF bound of truncated random variable.
In order to bound the MGF of our truncated random variables, we prove the following lemma which slightly modifies Corollary 2 of Bakhshizadeh et al. (2023) using the law of total expectation. The main subtlety in the extension is where we use the following bound
which we denote by . Otherwise, the proof exactly follows theirs.
Lemma 12
Let be a probability space, be a sigma algebra, and and be random variables. Assume is -measurable. Assume, conditioned on , that is centered and -sub-Weibull() with . Define . Then
where
Proof For convenience, define . Let . That is, and . Since , we have, by Lemma 4 of Bakhshizadeh et al. (2023),
Observe
and
Applying this to our setting, we get the MGF bound
(10) |
where
Concentration inequality for MGF bound.
Now we just need a self-normalized concentration inequality in the following setting:
Let be a filtered probability space. Let and be adapted to . Let . For all , assume almost surely, , and
for constants .
We can recognize this as a centered sub-exponential type MGF bound from Proposition 2.7.1 of Vershynin (2018). Theorem 2.6 of Fan et al. (2015) proves a concentration inequality in this setting, but not a self-normalized one. On the other hand, Theorem 3.3 of Harvey et al. (2019) proves a self-normalized concentration inequality in the centered sub-Gaussian setting, thus extending the original result of Freedman (1975) to a self-normalized result. We use the same trick of Harvey et al. (2019) to extend Theorem 2.6 of Fan et al. (2015) to a self-normalized result.
We distill the proof technique of Fan et al. (2015) into the following lemma so that we can apply it in our setting.
Lemma 13
(Fan et al., 2015, Pf. of Thm. 2.1) Let be a filtered probability space. If is adapted to and is a sequence of events; then
Proof Defining , then is a martingale. Let be a stopping time. Then the stopped process is a martingale (where denotes ). Moreover, is a probability density so define the conjugate probability measure .
Define the stopping time . Then .
Observe,
We apply Lemma 13 to our setting to get the following self-normalized concentration inequality.
Lemma 14
Let be a filtered probability space. Let and be adapted to . Let . For all , assume almost surely, , and
for constants . Then, for all , and , and ,
Proof By Claim C.2 of Harvey et al. (2019), if , then such that . Define
With , we want . This is ensured by .
Define
Then implies
Final step.
Putting everything together proves the and case. The rest of the work for the other cases is included in Appendix B.
5 Non-convex Convergence
In this section, we prove the following convergence bound.
Theorem 15
Assume is -smooth and that, conditioned on the previous iterates, is centered and is with . If , assume is -Lipschitz. Let and define . Then, for iterations of SGD with where , w.p. ,
where
and observe for any
Proof As with the PŁ analysis we start the non-convex analysis with Eq. (3). From this, we get a master bound on a weighted sum of the . Our goal is a convergence rate for a weighted average of the since this would imply convergence to a first-order stationary point, which is the best one can hope for without further assumptions. The master bound is in terms of two sums, an inner product sum and a norm sum. We bound the norm sum using an established sub-Weibull concentration inequality. The inner product sum, on the other hand, is the part that requires the MDS concentration inequality of Theorem 11.
Master bound.
As in Section 3, again assume is -smooth, hence
for all . Set and . Then we get
Summing this and using , we get
(11) |
We would like to bound
with high probability so that if , we get
with high probability. While these bounds are in big- notation, the bounds we prove will be precise.
Norm sum bound.
Assume that, conditioned on the previous iterates, is centered and is -sub-Weibull() with . Set with . Let . Using the law of total expectation,
Thus, is -sub-Weibull() so we can apply the following sub-Weibull concentration inequality.
Lemma 16
Applying Lemma 16, we get, w.p. ,
Inner product sum bound.
Assume that . We will prove the easier cases of and in the appendix. Assume is -Lipschitz continuous. Define
Recall that we defined as the sigma algebra generated by for all and . So, is -measurable and is -measurable; hence and are adapted to . We also have, for all , almost surely, , and
In other words, is a sub-Weibull MDS and captures the scale parameters. Let and define
Applying Theorem 11, we get, for all , w.p. ,
Combining this with the norm sum bound and master bound, we get, for all , w.p. ,
where
We want to bound away from zero. To do so, assume and . Then . Setting
and plugging in and completes the proof.
Remark 17
Note that Lipschitz continuity follows immediately if the iterates are bounded. This might lead one to consider using projected SGD, but there are certain issues preventing us from analyzing it, which we discuss in Appendices C and D. But, is Lipschitz continuity even a necessary assumption? Li and Liu (2022), building off of our pre-print, were actually able to relax the Lipschitz continuity assumption to the assumption that . To see that this works, note that we can change the definition of to and the rest of the analysis, including the result, still holds.
Remark 18
Can we extend the analysis beyond sub-Weibull? Yes, but then we would not get logarithmic dependence on . For example, if we assume for some , then we could use Corollary 3 instead of Corollary 2 of Bakhshizadeh et al. (2023). This would give us a convergence rate for some . Thus, we would not have a logarithmic dependence on in general, but would approach such a dependence as the number of iterations increases.
6 Post-processing
Note that the results of Theorem 15 are in terms of which is not a particularly useful quantity by itself. To get a bound in terms of a single iterate, we prove the following probability result, introduce a novel post-processing strategy which outputs a single iterate , and apply the probability result to bound with high probability.
Theorem 19
Let . For all , let with probability , where . Let be independent copies of . Let . Let be an -valued random variable independent of . Then
Proof First, letting and ,
where follows by the independence of the and follows by Markov’s inequality since is non-negative almost surely. Next, define
Observe,
where (i) follows from the law of total probability and (ii) follows from Theorem 20.3 of Billingsley (1995) since and are independent.
Corollary 20
Assume is differentiable. Let and . Set . Sample indices with replacement from with probabilities to form the set . Then, for iterations of SGD with any step-size sequence ,
To combine the corollary with Theorem 15, we can set
To see the merit of this post-processing strategy, let’s compare it to a naive approach one might take to apply the result of Theorem 15. The standard trick is to observe
(12) |
So, we could keep track of at every iteration and record the iterate where it is lowest. However, this requires exact gradient information, which may be more costly than the stochastic gradient used in the algorithm. In Ghadimi and Lan (2013), they pick index with probability proportional to so that is proportional to the right-hand side of Eq. (12). They do this for runs and pick the best of the runs. Corollary 20, on the other hand, allows us to sample a set of indices and pick the best iterate from among these samples. Hence, we call the iterate sampling failure probability.
But, Corollary 20 is not the end of the story since to compute even still requires full gradient information. In the sample average approximation setting, this can be obtained by running on the full batch of data (rather than a mini-batch). However, if this is computationally infeasible or if we are in the stochastic approximation setting, then we instead have to use empirical gradients over a test or validation set. This is what we do for the full post-processing strategy presented in the following theorem.
Theorem 21
Let be a probability space. Let and assume is differentiable for all . Let . Assume and . Let and . Set and . Apply the following procedure:
-
1.
Sample indices with replacement from with probabilities to form the set .
-
2.
Sample independently.
-
3.
Run iterations of SGD with any step-size sequence to form .
-
4.
Compute
Then,
Proof Apply Theorem 2.4 of Ghadimi and Lan (2013) to get
Using the definitions of and , and applying Corollary 20, proves the result.
We call the empirical gradient failure probability. Note that if is -sub-Weibull(), then is centered and is -sub-Weibull(), in which case both Theorem 15 and Theorem 21 apply, with (by Lemma 6) for the latter. Also, note that for both Corollary 20 and Theorem 21, while we have to specify in advance, we only have to take iterations to apply the bound from Theorem 15 to the post-processing output.
7 Neural Network Example
Consider the two layer neural network model
where is a differentiable activation function applied coordinate-wise, is a data point (feature vector), is the first-layer weights, and is the second-layer weights. If we are given a data set and labels , then we can train (leaving fixed for simplicity) using the squared loss:
In this case,
where is the Khatri-Rao product (Oymak and Soltanolkotabi, 2020). For our example, we use the GELU activation, (Hendrycks and Gimpel, 2016), which satisfies and for all . Thus, we can apply Lemma 2. Since , we heuristically set in the lemma and estimate the strong smoothness of as , setting our step-size accordingly.
In order to demonstrate the effect of gradient noise on convergence error, we train on a fixed synthetic data set while injecting noise into the gradient. The labels come from a neural network model with width, , larger than the width of the training model. We also make sure the total number of trainable parameters is less than so that . The noise we inject has uniformly random direction and Weibull norm with scale parameter and shape parameter , with . We keep the same initialization for all trials. We run iterations of SGD and then define the convergence error to be the best gradient norm squared divided by the initial gradient norm squared. We compute the empirical CDF of the convergence error over 10000 trials and then consider in the ranges , , and . We care about the dependence of the convergence error on for small , but for too small of , the empirical CDF is not a good approximation to the true CDF (see Fig. 1) due to the nature of order statistics. Our code can be found at https://github.com/liammadden/sgd.

From Theorem 15, for a particular range of (and for fixed ), the upper bound has dependence either , , or . For sufficiently small , the will dominate, but the upper limit of this range of may be smaller than the lower limit of the range of for which the empirical CDF is a good approximation to the true CDF. In other words, if we showed that the true CDF for this particular example has dependence for sufficiently small, then we would have shown that the -dependence of the upper bound in Theorem 15 is tight. However, with the empirical CDF, we are only able to show that the exponent increases as shrinks. Figure 1 shows the dependence for the three different ranges. We assume the convergence error has dependence and find the line of best fit as . The different values of are given in Table 1.
Our experiments suggest that injected Weibull noise results in convergence error with dependence where increases as decreases, thus roughly corroborating our upper bound. In particular, by computing a line of best fit for the values in each of the three ranges, we can estimate that increases from 0.43, to 0.77, to 1.34 and decreases from -0.28, to -0.7, to -1.42, suggesting that we are in the regime.
0.64 | 0.85 | 1.40 | |
0.93 | 1.56 | 2.33 | |
1.51 | 2.38 | 4.08 |

8 Conclusions
This paper analyzed the convergence of SGD for objective functions that satisfy the PŁ condition and for generic non-convex objectives. Under a sub-Gaussian assumption on the gradient error, we showed a high probability convergence rate matching the mean convergence rate for PŁ objectives. Under a sub-Weibull assumption on the noise, we showed a high probability convergence rate matching the mean convergence rate for non-convex objectives. We also provided and analyzed a post-processing method for choosing a single iterate. To prove the convergence rate, we first proved a Freedman-type inequality for martingale difference sequences that extends previous Freedman-type inequalities beyond the sub-exponential threshold to allow for sub-Weibull tail-decay. Finally, we considered a synthetic neural net problem and showed that the heaviness of the tail of the gradient noise has a direct effect on the heaviness of the tail of the convergence error.
Acknowledgments
All three authors gratefully acknowledge support from the National Science Foundation (NSF), Division of Mathematical Sciences, through the award # 1923298.
Appendix A Proof of Lemma 1
Here we prove Lemma 1.
Proof First,
and
So,
Thus,
where is the Kronecker delta. Let be the largest absolute element of . Then, viewing as a vector,
and
proving the result.
And here we prove Lemma 2.
Proof Define and . Then and for all by Lemmas 3 and 5 of Oymak and Soltanolkotabi (2020). Let and . Observe
and
proving the result.
Appendix B Remaining work for proof of Theorem 11
First we need the following two lemmas. Lemma 23 extends Proposition 2.7.1 of Vershynin (2018) to interpolate between the sub-Gaussian and sub-exponential regimes.
Lemma 22
(Vershynin, 2018, Prop. 2.5.2(e)) If is centered and -sub-Guassian then .
Lemma 23
If is centered and -sub-Weibull() with , then
for all .
Proof First, using Lemma 6 and , we can get for all , and so, in particular, for all . Thus, if , then
completing the proof.
Then, the next three lemmas allow us to include previous results as special cases of the theorem.
Lemma 24
(Fan et al., 2015, Thm. 2.6) Let be a filtered probability space. Let and be adapted to . Let . For all , assume almost surely, , and
Then, for all , and ,
Proof Define
and
Then implies
Lemma 25
(Harvey et al., 2019, Thm. 3.3) Let be a filtered probability space. Let and be adapted to . Let . For all , assume almost surely, , and
Then, for all , and ,
Lemma 26
(Freedman, 1975) Let be a filtered probability space. Let and be adapted to . For all , assume almost surely, , and
Then, for all , and ,
If the ’s are sub-Gaussian, that is, if , then, from Lemma 22,
Appendix C PŁ Projected SGD
When optimizing over a constraint set, , if is strongly convex, then so is plus the indicator function of , and results for gradient descent methods easily extend to projected gradient descent. On the other hand, if is PŁ, then plus the indicator function is not KL (Kurdyka-Łojasiewicz). This has real impacts on gradient descent algorithms, where gradient descent might converge while projected gradient descent does not. For example, there is a smooth function, a mollified version of , such that the PŁ inequality is satisfied but projected gradient descent does not converge to a minimizer; we formalize this in the remark below.
Remark 27
Consider where denotes and . The minimum of is 0 and . If we use —where denotes the indicator function, denotes the ball of radius 1 centered at 0, and is the normalization constant—to mollify , then, for , has PŁ constant and smoothness constant . Consider the starting point . For chosen appropriately, the distance from to its projection onto is strictly less than . Thus, if we let be the ball centered at with radius equal to exactly that distance, then the constrained problem and the unconstrained problem have the same minimum. However, projected gradient flow, starting from , ends up stuck at a non-minimizer: the point of closest to . See Figure 3 for the contour plot when , , and .

In order to generalize gradient methods to projected gradient methods under PL-like assumptions, the proper generalization is that the function should satisfy a proximal PŁ inequality (Karimi et al., 2016). However, such an assumption is quite restrictive compared to the PŁ inequality. We leave the problem of convergence with just the PŁ inequality, via added noise or a Frank-Wolfe type construction, as a future research direction.
Appendix D Non-convex Projected SGD
Consider . Define
Note that if , then and . Moreover, and so, by the non-expansiveness of proj, and . We can get even tighter bounds using the second prox theorem (Beck, 2017, Thm. 6.39): and .
It is easy to come up with an example where does not go to zero, so we would like to bound instead. We start as usual:
Focusing on the norm term,
Focusing on the inner product term,
Unfortunately, we cannot proceed any further. Ghadimi et al. (2016) are able to get around this but at the cost of getting instead of . To mitigate this, they require an increasing batch-size. Reddi et al. (2016) were able to remove this requirement, but only for non-convex projected SVRG not non-convex projected SGD. Thus, we leave the analysis of non-convex projected SGD as an open problem.
References
- Allen-Zhu et al. (2019a) Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. On the convergence rate of training recurrent neural networks. In Neural Information Processing Systems (NeurIPS), volume 32, 2019a.
- Allen-Zhu et al. (2019b) Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning (ICML), volume 97, pages 242–252, 2019b.
- Arjevani et al. (2019) Yossi Arjevani, Yair Carmon, John C Duchi, Dylan J Foster, Nathan Srebro, and Blake Woodworth. Lower bounds for non-convex stochastic optimization. arXiv preprint arXiv:1912.02365, 2019.
- Bakhshizadeh et al. (2023) Milad Bakhshizadeh, Arian Maleki, and Victor H de la Pena. Sharp concentration results for heavy-tailed distributions. Information and Inference: A Journal of the IMA, 12(3):1655–1685, 2023.
- Beck (2017) A. Beck. First-Order Methods in Optimization. MOS-SIAM Series on Optimization, 2017.
- Bertsekas and Tsitsiklis (2000) D. Bertsekas and J. Tsitsiklis. Gradient convergence in gradient methods with errors. SIAM Journal on Optimization, 10(3):627–642, 2000.
- Billingsley (1995) Patrick Billingsley. Probability and Measure. John Wiley & Sons, 3 edition, 1995.
- Bottou and Bousquet (2008) Léon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In Neural Information Processing Systems (NeurIPS), volume 20, 2008.
- Charles and Papailiopoulos (2018) Zachary Charles and Dimitris Papailiopoulos. Stability and generalization of learning algorithms that converge to global optima. In International Conference on Machine Learning (ICML), volume 80, pages 745–754, 2018.
- Şimşekli et al. (2019) Umut Şimşekli, Levent Sagun, and Mert Gürbüzbalaban. A tail-index analysis of stochastic gradient noise in deep neural networks. In International Conference on Machine Learning (ICML), volume 97, pages 5827–5837, 2019.
- Cutkosky and Mehta (2021) Ashok Cutkosky and Harsh Mehta. High-probability bounds for non-convex stochastic optimization with heavy tails. In Neural Information Processing Systems (NeurIPS), volume 34, pages 4883–4895, 2021.
- Du et al. (2018) Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations (ICLR), 2018.
- Du et al. (2019) Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. In International Conference on Machine Learning (ICML), volume 97, pages 1675–1685, 2019.
- Fan et al. (2015) Xiequan Fan, Ion Grama, Quansheng Liu, et al. Exponential inequalities for martingales with applications. Electronic Journal of Probability, 20, 2015.
- Freedman (1975) David A Freedman. On tail probabilities for martingales. The Annals of Probability, pages 100–118, 1975.
- Ghadimi and Lan (2013) Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
- Ghadimi et al. (2016) Saeed Ghadimi, Guanghui Lan, and Hongchao Zhang. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 155(1-2):267–305, 2016.
- Gürbüzbalaban et al. (2021) Mert Gürbüzbalaban, Umut Şimşekli, and Lingjiong Zhu. The heavy-tail phenomenon in SGD. In International Conference on Machine Learning (ICML), volume 139, pages 3964–3975, 2021.
- Harvey et al. (2019) Nicholas J. A. Harvey, Christopher Liaw, Yaniv Plan, and Sikander Randhawa. Tight analyses for non-smooth stochastic gradient descent. In Conference on Learning Theory (COLT), volume 99, pages 1579–1613, 2019.
- Hendrycks and Gimpel (2016) Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (GELUs). arXiv preprint arXiv:1606.08415, 2016.
- Karimi et al. (2016) Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal-gradient methods under the Polyak-Łojasiewicz condition. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 795–811. Springer, 2016.
- Khaled and Richtárik (2020) Ahmed Khaled and Peter Richtárik. Better theory for SGD in the nonconvex world. arXiv preprint arXiv:2002.03329, 2020.
- Li and Liu (2022) Shaojie Li and Yong Liu. High probability guarantees for nonconvex stochastic gradient descent with heavy tails. In International Conference on Machine Learning (ICML), volume 162, pages 12931–12963, 2022.
- Li and Orabona (2020) Xiaoyu Li and Francesco Orabona. A high probability analysis of adaptive sgd with momentum. arXiv preprint arXiv:2007.14294, 2020.
- Li and Yuan (2017) Yuanzhi Li and Yang Yuan. Convergence analysis of two-layer neural networks with relu activation. In Neural Information Processing Systems (NeurIPS), volume 30, 2017.
- Liu et al. (2022) Chaoyue Liu, Libin Zhu, and Mikhail Belkin. Loss landscapes and optimization in over-parameterized non-linear systems and neural networks. Applied and Computational Harmonic Analysis, 2022.
- Madden and Thrampoulidis (2024) Liam Madden and Christos Thrampoulidis. Memory capacity of two layer neural networks with smooth activation. SIAM Journal on Mathematics of Data Science, 6, 2024.
- Nemirovski et al. (2009) Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574–1609, 2009.
- Nemirovsky and Yudin (1983) Arkadi Nemirovsky and David Yudin. Problem Complexity and Method Efficiency in Optimization. Wiley-Interscience, 1983.
- Nesterov (2018) Y. Nesterov. Lectures on Convex Optimization, volume 137 of Springer Optimization and Its Applications. Springer, Switzerland, 2 edition, 2018.
- Orvieto and Lucchi (2019) Antonio Orvieto and Aurelien Lucchi. Continuous-time models for stochastic optimization algorithms. In Neural Information Processing Systems (NeurIPS), pages 12589–12601, 2019.
- Oymak and Soltanolkotabi (2020) Samet Oymak and Mahdi Soltanolkotabi. Toward moderate overparameterization: Global convergence guarantees for training shallow neural networks. IEEE Journal on Selected Areas in Information Theory, 1(1):84–105, 2020.
- Panigrahi et al. (2019) Abhishek Panigrahi, Raghav Somani, Navin Goyal, and Praneeth Netrapalli. Non-gaussianity of stochastic gradient noise. In Science meets Engineering of Deep Learning (SEDL) workshop, 33rd Conference on Neural Information Processing Systems (NeurIPS), 2019.
- Patel (2021) Vivak Patel. Stopping criteria for, and strong convergence of, stochastic gradient descent on Bottou-Curtis-Nocedal functions. Mathematical Programming, 164, 2021.
- Patel et al. (2022) Vivak Patel, Shushu Zhang, and Bowen Tian. Global convergence and stability of stochastic gradient descent. In Neural Information Processing Systems (NeurIPS), volume 35, pages 36014–36025, 2022.
- Reddi et al. (2016) Sashank J Reddi, Suvrit Sra, Barnabas Poczos, and Alexander J Smola. Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. In Neural Information Processing Systems (NeurIPS), pages 1153–1161, 2016.
- Scaman and Malherbe (2020) Kevin Scaman and Cedric Malherbe. Robustness analysis of non-convex stochastic gradient descent using biased expectations. In Neural Information Processing Systems (NeurIPS), volume 33, pages 16377–16387, 2020.
- Sebbouh et al. (2021) Othmane Sebbouh, Robert M Gower, and Aaron Defazio. Almost sure convergence rates for stochastic gradient descent and stochastic heavy ball. In Conference on Learning Theory (COLT), pages 3935–3971, 2021.
- Sun and Luo (2016) Ruoyu Sun and Zhi-Quan Luo. Guaranteed matrix completion via non-convex factorization. IEEE Transactions on Information Theory, 62(11):6535–6579, 2016.
- Vershynin (2018) Roman Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science, volume 47. Cambridge University Press, 2018.
- Vladimirova et al. (2020) Mariia Vladimirova, Stéphane Girard, Hien Nguyen, and Julyan Arbel. Sub-Weibull distributions: Generalizing sub-Gaussian and sub-exponential properties to heavier tailed distributions. Stat, 9(1):e318, 2020.
- Wong et al. (2020) Kam Chung Wong, Zifan Li, and Ambuj Tewari. Lasso guarantees for -mixing heavy-tailed time series. The Annals of Statistics, 48(2):1124–1142, 2020.
- Zhang et al. (2020) Jingzhao Zhang, Tianxing He, Suvrit Sra, and Ali Jadbabaie. Why gradient clipping accelerates training: A theoretical justification for adaptivity. In International Conference on Learning Representations (ICLR), 2020.