Uniform-in-Time Wasserstein Stability Bounds
for (Noisy) Stochastic Gradient Descent
Abstract
Algorithmic stability is an important notion that has proven powerful for deriving generalization bounds for practical algorithms. The last decade has witnessed an increasing number of stability bounds for different algorithms applied on different classes of loss functions. While these bounds have illuminated various properties of optimization algorithms, the analysis of each case typically required a different proof technique with significantly different mathematical tools. In this study, we make a novel connection between learning theory and applied probability and introduce a unified guideline for proving Wasserstein stability bounds for stochastic optimization algorithms. We illustrate our approach on stochastic gradient descent (SGD) and we obtain time-uniform stability bounds (i.e., the bound does not increase with the number of iterations) for strongly convex losses and non-convex losses with additive noise, where we recover similar results to the prior art or extend them to more general cases by using a single proof technique. Our approach is flexible and can be generalizable to other popular optimizers, as it mainly requires developing Lyapunov functions, which are often readily available in the literature. It also illustrates that ergodicity is an important component for obtaining time-uniform bounds – which might not be achieved for convex or non-convex losses unless additional noise is injected to the iterates. Finally, we slightly stretch our analysis technique and prove time-uniform bounds for SGD under convex and non-convex losses (without additional additive noise), which, to our knowledge, is novel.
Keywords: Stochastic gradient descent, algorithmic stability, Wasserstein perturbation.
1 Introduction
With the development of modern machine learning applications, understanding the generalization properties of stochastic gradient descent (SGD) has become a major challenge in statistical learning theory. In this context, the main goal is to obtain computable upper-bounds on the population risk associated with the output of the SGD algorithm that is given as follows: , where denotes a random data point, is the (unknown) data distribution defined on the data space , denotes the parameter vector, and is an instantaneous loss function.
In a practical setting, directly minimizing is not typically possible as is unknown; yet one typically has access to a finite data set , where we assume each is independent and identically distributed (i.i.d.) with the common distribution . Hence, given , one can then attempt to minimize the empirical risk as a proxy for . In this setting, SGD has been one of the most popular optimization algorithms for minimizing and is based on the following recursion:
(1) |
where is the step-size, is the batch-size, is the minibatch that is chosen randomly from the set , and its cardinality satisfies .
One fruitful approach for estimating the population risk attained by SGD, i.e., , is based on the following simple decomposition:
(2) |
where the last term is called the generalization error. Once a computable upper-bound for the generalization error can be obtained, this decomposition directly leads to a computable upper bound for the population risk , since can be computed thanks to the availability of . Hence, the challenge here is reduced to derive upper-bounds on , typically referred to as generalization bounds.
Among many approaches for deriving generalization bounds, algorithmic stability (Bousquet and Elisseeff, 2002) has been one of the most fruitful notions that have paved the way to numerous generalization bounds for stochastic optimization algorithms (Hardt et al., 2016; Chen et al., 2018; Mou et al., 2018; Feldman and Vondrak, 2019; Lei and Ying, 2020; Zhang et al., 2022). In a nutshell, algorithmic stability measures how much the algorithm output differs if we replace one data point in with a new sample. More precisely, in the context of SGD, given another data set that differs from by at most one element, we (theoretically) consider running SGD on , i.e.,
(3) |
and we are interested in the discrepancy between and in some precise sense (to be formally defined in the next section). The wisdom of algorithmic stability indicates that a smaller discrepancy between and implies a smaller generalization error.
The last decade has witnessed an increasing number of stability bounds for different algorithms applied on different classes of loss functions. In a pioneering study, Hardt et al. (2016) proved a variety of stability bounds for SGD, for strongly convex, convex, and non-convex problems. Their analysis showed that, under strong convexity and bounded gradient assumptions, the generalization error of SGD with constant step-size is of order ; whereas for general convex and non-convex problems, their bounds diverged with the number of iterations (even with a projection step), unless a decreasing step-size is used. In subsequent studies (Lei and Ying, 2020; Kozachkov et al., 2023) extended the results of Hardt et al. (2016), by either relaxing the assumptions or generalizing the setting to more general algorithms. However, their bounds still diverged for constant step-sizes, unless strong convexity is assumed. In a recent study, Bassily et al. (2020) proved stability lower-bounds for projected SGD when the loss is convex and non-smooth. Their results showed for general non-smooth loss functions we cannot expect to prove time-uniform (i.e., non-divergent with the number of iterations) stability bounds for SGD, even when a projection step is appended.
In a related line of research, several studies investigated the algorithmic stability of the stochastic gradient Langevin dynamics (SGLD) algorithm (Welling and Teh, 2011), which is essentially a ‘noisy’ version of SGD that uses the following recursion: , where is a sequence of i.i.d. Gaussian vectors, independent of and . The authors of Raginsky et al. (2017); Mou et al. (2018) proved stability bounds for SGLD for non-convex losses, which were then extended to more general (non-Gaussian) noise settings in Li et al. (2019). While these bounds hinted at the benefits of additional noise in terms of stability, they still increased with the number of iterations, which limited the impact of their results. More recently, Farghly and Rebeschini (2021) proved the first time-uniform stability bounds for SGLD under non-convexity, indicating that, with the presence of additive Gaussian noise, better stability bounds can be achieved. Their time-uniform results were then extended to non-Gaussian, heavy-tailed perturbations in Raj et al. (2023a, b) for quadratic and a class of non-convex problems.
While these bounds have illuminated various properties of optimization algorithms, the analysis of each case typically required a different proof technique with significantly different mathematical tools. Hence, it is not straightforward to extend the existing techniques to different algorithms with different classes of loss functions. Moreover, currently, it is not clear how the noisy perturbations affect algorithmic stability so that time-uniform bounds can be achieved, and more generally, it is not clear in which circumstances one might hope for time-uniform stability bounds.
In this study, we contribute to this line of research and prove novel time-uniform algorithmic stability bounds for SGD and its noisy versions. Our main contributions are as follows:
-
•
We make a novel connection between learning theory and applied probability, and introduce a unified guideline for proving Wasserstein stability bounds for stochastic optimization algorithms with a constant step-size. Our approach is based on Markov chain perturbation theory (Rudolf and Schweizer, 2018), which offers a three-step proof technique for deriving stability bounds: (i) showing the optimizer is geometrically ergodic, (ii) obtaining a Lyapunov function for the optimizer and the loss, and (iii) bounding the discrepancy between the Markov transition kernels associated with the chains and . We illustrate this approach on SGD and show that time-uniform stability bounds can be obtained under a pseudo-Lipschitz-like condition for smooth strongly-convex losses (we recover similar results to the ones of Hardt et al. (2016)) and a class of non-convex losses (that satisfy a dissipativity condition) when a noisy perturbation with finite variance (not necessarily Gaussian, hence more general than Farghly and Rebeschini (2021)) is introduced. Our results shed more light on the role of the additional noise in terms of obtaining time-uniform bounds: in the non-convex case the optimizer might not be geometrically ergodic unless additional noise is introduced, hence the bound cannot be obtained. Moreover, our approach is flexible and can be generalizable to other popular optimizers, as it mainly requires developing Lyapunov functions, which are often readily available in the literature (Aybat et al., 2020; Lessard et al., 2016; Fallah et al., 2022; Gürbüzbalaban et al., 2022; Liu et al., 2020; Aybat et al., 2019).
-
•
We then investigate the case where no additional noise is introduced to the SGD recursion and the geometric ergodicity condition does not hold. First, for non-convex losses, we prove a time-uniform stability bound, where the bound converges to a positive number (instead of zero) as , and this limit depends on the ‘level of non-convexity’. Then, we consider a class of (non-strongly) convex functions and prove stability bounds for the stationary distribution of , which vanish as increases. To the best of our knowledge, these results are novel, and indicate that the stability bounds do not need to increase with time even under non-convexity and without additional perturbations; yet, they might have a different nature depending on the problem class.
One limitation of our analysis is that it requires Lipschitz surrogate loss functions and does not directly handle the original loss function, due to the use of the Wasserstein distance (Raginsky et al., 2016). Yet, surrogate losses have been readily utilized in the recent stability literature (e.g., Raj et al. (2023a, b)) and we believe that our analysis might illuminate uncovered aspects of SGD even with this requirement. All the proofs are provided in the Appendix.
2 Technical Background
2.1 The Wasserstein distance and Wasserstein algorithmic stability
Wasserstein distance. For , the -Wasserstein distance between two probability measures and on is defined as Villani (2009):
(4) |
where the infimum is taken over all couplings of and . In particular, the dual representation for the -Wasserstein distance is given as Villani (2009):
(5) |
where consists of the functions that are -Lipschitz.
Wasserstein algorithmic stability. Algorithmic stability is a crucial concept in learning theory that has led to numerous significant theoretical breakthroughs (Bousquet and Elisseeff, 2002; Hardt et al., 2016). To begin, we will present the definition of algorithmic stability as stated in Hardt et al. (2016):
Definition 1 (Hardt et al. (2016), Definition 2.1)
Let denote the set of -valued random vectors. For a (surrogate) loss function , an algorithm is -uniformly stable if
(6) |
where the first supremum is taken over data that differ by one element, denoted by .
In this context, we purposefully employ a distinct notation for the loss function (in contrast to ) since our theoretical framework necessitates measuring algorithmic stability through a surrogate loss function, which may differ from the original loss . More precisely, our bounds will be based on the 1-Wasserstein distance, hence, we will need the surrogate loss to be a Lipschitz continuous function, as we will detail in (5). On the other hand, for the original loss we will need some form of convexity (e.g., strongly convex, convex, or dissipative) and we will need the gradient of to be Lipschitz continuous, in order to derive Wasserstein bounds. Unfortunately, under these assumptions, we cannot further impose itself to be Lipschitz, hence the need for surrogate losses. Nevertheless, the usage of surrogate losses is common in learning theory, see e.g, Farghly and Rebeschini (2021); Raj et al. (2023b), and we present concrete practical examples in the Appendix.
Now, we present a result from Hardt et al. (2016) that establishes a connection between algorithmic stability and the generalization performance of a randomized algorithm. Prior to presenting the result, we define the empirical and population risks with respect to the loss function as follows:
Theorem 2 (Hardt et al. (2016), Theorem 2.2)
Suppose that is an -uniformly stable algorithm, then the expected generalization error is bounded by
(7) |
For a randomized algorithm, if and denotes the law of and then for a -Lipschitz surrogate loss function , we have the following generalization error guarantee,
(8) |
The above result can be directly obtained from the combination of the results given in (5), Definition 1, and Theorem 2 (see also Raginsky et al. (2016)).
2.2 Perturbation theory for Markov chains
Next, we recall the Wasserstein perturbation bound for Markov chains from Rudolf and Schweizer (2018). Let be a Markov chain with transition kernel and initial distribution , i.e., we have almost surely
(9) |
and for any measurable set and . We assume that is another Markov chain with transition kernel and initial distribution . We denote by the distribution of and by the distribution of . By , we denote the Dirac delta distribution at , i.e. the probability measure concentrated at . For a measurable set , we also use the notation .
Lemma 3 (Rudolf and Schweizer (2018), Theorem 3.1)
Assume that there exist some and such that
(10) |
for any . Further assume that there exist some and and a measurable Lyapunov function of such that for any :
(11) |
where . Then, we have
(12) |
where .
Lemma 3 provides a sufficient condition for the distributions and after iterations to stay close to each other given the initial distributions . Lemma 3 will provide a key role in helping us derive the main results in Section 3.1 and Section 3.2. Later, in the Appendix, we will state and prove a modification of Lemma 3 (see Lemma 23 in the Appendix) that will be crucial to obtaining the main result in Section 3.3.
3 Wasserstein Stability of SGD via Markov Chain Perturbations
In this section, we will derive time-uniform Wasserstein stability bounds for SGD by using the perturbation theory presented in Rudolf and Schweizer (2018). Before considering general losses that can be non-convex, we first consider the simpler case of quadratic losses to illustrate our key ideas.
3.1 Warm up: quadratic case
To illustrate the proof technique, we start by considering a quadratic loss of the form: where, and . In this setting, the SGD recursion takes the following form:
(13) |
The sequence are i.i.d. and for every , is independent of .
Similarly, we can write down the iterates of SGD with a different data set with , where differs from with at most one element:
(14) |
Our goal is to obtain an algorithmic stability bound, through estimating the -Wasserstein distance between the distribution of and and we will now illustrate the three-step proof technique that we described in Section 1. To be able to apply the perturbation theory (Rudolf and Schweizer, 2018), we start by establishing the geometric ergodicity of the Markov process with transition kernel , given in the following lemma.
Lemma 4
Assume that . Then, for any , we have the following inequality:
We note that since , the assumption in Lemma 4 can be satisfied under mild assumptions, for example when with a positive probability, which is satisfied for small enough.
In the second step, we construct a Lyapunov function that satisfies the conditions of Lemma 3.
Lemma 5
Let . Assume that . Then, we have
(15) |
In our third and last step, we estimate the perturbation gap based on the Lyapunov function in the form of (10), assuming that the data is bounded. Such bounded data assumptions have been commonly made in the literature (Bach, 2014; Bach and Moulines, 2013).
Lemma 6
If for some , then, we have .
Note that Lemma 3 relies on three conditions: the Wasserstein contraction in (10), which is obtained through Lemma 4, the drift condition for the Lyapunov function in (11), which is obtained in Lemma 5 and finally the estimate on in (12) which is about the one-step -Wasserstein distance between two semi-groups that in our context are associated with two datasets that differ by at most one element, which is obtained in Lemma 6. The only place the neighborhood assumption () is used is in the expression of in equation (12). Now, having all the ingredients, we can invoke Lemma 3 and we obtain the following result which provides a 1-Wasserstein bound between the distribution of iterates when applied to datasets that differ by one point.
For and , let denote the law of the -th the SGD iterate when is used as the dataset, i.e., and denote the distributions of and obtained by the recursions (13) and (14) respectively. As shorthand notation, set and .
Theorem 7
Assume . We also assume that and and for some . Then, we have
(16) |
Proof
The result directly follows from Lemma 4, Lemma 5,
Lemma 6 and Lemma 3.
By a direct application of (8), we can obtain a generalization bound for an -Lipschitz surrogate loss function, as follows:
where , and is a random set of -data points from the data generating distribution. The generalization bound obtained above does not include the mean square error in the unbounded case but covers a larger class of surrogate loss functions. Because of this incompatibility, a direct comparison is not possible; however, the rate obtained in the equation above has the same dependence on the number of samples that were obtained in the previous works (Lei and Ying, 2020). For least squares, there are other works using integral operators that develop generalization bounds for SGD under a capacity condition (Lin and Rosasco, 2017; Pillaud-Vivien et al., 2018). However, these bounds only hold for the least square loss.
3.2 Strongly convex case
Next, we consider strongly convex losses. In the remainder of the paper, we will always assume that for every , is differentiable.
Before proceeding to the stability bound, we first introduce the following assumptions.
Assumption 1
There exist constants such that for any and every ,
(17) |
This assumption is a pseudo-Lipschitz-like condition on and is satisfied for various problems such as GLMs (Bach, 2014). Next, we assume that the loss function is strongly convex.
Assumption 2
There exists a universal constant such that for any and ,
By using the same recipe as we used for quadratic losses, we obtain the following stability result.
Theorem 8
Similarly to the quadratic case, we can now directly obtain a bound on expected generalization error using (8). More precisely, for an -Lipschitz surrogate loss function , we have
The bound above has the same dependence on the number of samples as the ones of the previous stability analysis of (projected) SGD for strongly convex functions (Hardt et al., 2016; Liu et al., 2017; Lei and Ying, 2020). However, we have a worse dependence on the strong convexity parameter .
3.3 Non-convex case with additive noise
Finally, we consider a class of non-convex loss functions. We assume that the loss function satisfies the following dissipativity condition.
Assumption 3
There exist constants and such that for any and ,
The class of dissipative functions satisfying this assumption are the ones that admit some gradient growth in radial directions outside a compact set. Inside the compact set though, they can have quite general non-convexity patterns. As concrete examples, they include certain one-hidden-layer neural networks (Akiyama and Suzuki, 2023); they arise in non-convex formulations of classification problems (e.g. in logistic regression with a sigmoid/non-convex link function); they can also arise in robust regression problems, see e.g. Gao et al. (2022). Also, any function that is strongly convex outside of a ball of radius for some will satisfy this assumption. Consequently, regularized regression problems where the loss is a strongly convex quadratic plus a smooth penalty that grows slower than a quadratic will belong to this class; a concrete example would be smoothed Lasso regression; many other examples are also given in Erdogdu et al. (2022). Dissipative functions also arise frequently in the sampling and Bayesian learning and global convergence in non-convex optimization literature (Raginsky et al., 2017; Gao et al., 2022).
Unlike the strongly-convex case, we can no longer obtain a Wasserstein contraction bound using the synchronous coupling technique as we did in the proof of Theorem 8. To circumvent this problem, in this setting, we consider a noisy version of SGD, with the following recursion:
(19) |
where are additional i.i.d. random vectors in , independent of and , satisfying the following assumption.
Assumption 4
is random vector on with a continuous density that is positive everywhere, i.e. for any and
Note that the SGLD algorithm (Welling and Teh, 2011) is a special case of this recursion, whilst our noise model can accommodate non-Gaussian distributions with finite second-order moment.
Analogously, let us define the (noisy) SGD recursion with the data set as
and let denote the probability density function of . Further let be a minimizer of . Then, by following the same three-step recipe, we obtain the following stability bound. Here, we do not provide all the constants explicitly for the sake of clarity; the complete theorem statement is given in Theorem 25 (Appendix E.1).
Theorem 9
Let . Assume that Assumption 1, Assumption 3 and Assumption 4 hold. We also assume that and for some and for some . For any , define so that and for any where is defined in (67) so that
(20) |
Let and denote the distributions of and respectively. Then, we have
(21) |
where for any and , and . The constant is explicitly stated in the proof.
Contrary to our previous results, the proof technique for showing Wasserstein contraction (as in Lemma 4) for this theorem relies on verifying the drift condition (Assumption 7) and the minorization condition (Assumption 8) as given in Hairer and Mattingly (2011). Once these conditions are satisfied, we invoke the explicitly computable bounds on the convergence of Markov chains developed in Hairer and Mattingly (2011).
From equation (8), we directly obtain the following generalization error bound for -Lipschitz surrogate loss function,
where the constants are defined in Theorem 9111By using the decomposition (2), we can obtain excess risk bounds for SGLD by combining our results with Xu and Raginsky (2017): it was shown that gradient Langevin dynamics has the following optimization error is after iterations, where is the mini-batch size and is the uniform spectral gap of the continuous-time Langevin dynamics. Similar results are given for SGLD in (Xu and Raginsky, 2017, Theorem 3.6). . The above result can be directly compared with the result in (Farghly and Rebeschini, 2021, Theorem 4.1) that has the same dependence on and . However, our result is more general in the sense that we do not assume our noise to be Gaussian noise. Note that Li et al. (2019) can also accommodate non-Gaussian noise; however, their bounds increase with the number of iterations.
Remark 10
In Theorem 9, we can take for some fixed so we can take
(22) |
Moreover, we can take , , and , so that
(23) |
provided that . Note that the parameter in (22) appears in the upper bound in equation (21) that controls the -Wasserstein algorithmic stability of the SGD. It is easy to see from equation (21) that the smaller , the smaller the -Wasserstein bound. By the defintion of , the larger , the smaller the -Wasserstein bound. As a result, we would like to choose to be as large as possible, and the equation (22) provides an explicit value that can take, which is already the largest as possible.
Next, let us provide some explicitly computable lower bounds for in (22). This is achievable if we specify further the noise assumption. Under the assumption that are i.i.d. Gaussian distributed, we have the following corollary.
Corollary 11
Under the assumptions in Theorem 9, we further assume the noise are i.i.d. Gaussian so that . We also assume that . Then, we have
(24) |
4 Wasserstein Stability of SGD without Geometric Ergodicity
While the Markov chain perturbation theory enabled us to develop stability bounds for the case where we can ensure geometric ergodicity in the Wasserstein sense (i.e., proving contraction bounds), we have observed that such a strong ergodicity notion might not hold for non-strongly convex losses. In this section, we will prove two more stability bounds for SGD, without relying on Rudolf and Schweizer (2018), hence without requiring geometric ergodicity. To the best of our knowledge, these are the first uniform-time stability bounds for the considered classes of convex and non-convex problems.
4.1 Non-convex case without additive noise
The stability result we obtained in Theorem 9 required us to introduce an additional noise (Assumption 4) to be able to invoke Lemma 3. We will now show that it is possible to use a more direct approach to obtain 2-Wasserstein algorithmic stability in the non-convex case under Assumption 3 without relying on Rudolf and Schweizer (2018). However, we will observe that without geometric ergodicity will have a non-vanishing bias term in the bound. Note that, since for all , the following bound still yields a generalization bound by (8).
Theorem 12
While the bound (25) does not increase with the number of iterations, it is easy to see that it does not vanish as , and it is small only when from the dissipativity condition (Assumption 3) is small. In other words, if we consider to be the level of non-convexity (e.g., corresponds to strong convexity), as the function becomes ‘more non-convex’ the persistent term in the bound will get larger. While this persistent term might make the bound vacuous when , for moderate the bound can be still informative as the persistent term might be dominated by the first two terms.
Moreover, discarding the persistent bias term, this bound leads to a generalization bound with rate , rather than as before. This indicates that it is beneficial to add additional noise in SGD as in Theorem 9 in order for the dynamics to be geometrically ergodic that can lead to a sharp bound as . Finally, we note that as Theorem 12 involves 2-Wasserstein distance, it can pave the way for generalization bounds without requiring a surrogate loss. Yet, this is not immediate and would require deriving uniform bounds for the iterates, e.g., Raginsky et al. (2017).
4.2 Convex case with additional geometric structure
We now present our final stability bound, where we consider relaxing the strong convexity assumption (Assumption 2) to the following milder assumption.
Assumption 5
There exists universal constants and such that for any and , .
Note that as , the function class can be seen as an intermediate class between convex and strongly convex functions, and such a class of functions has been studied in the optimization literature (Dunn, 1981; Bertsekas, 2015).
We analogously modify Assumption 1 and consider the following assumption.
Assumption 6
There exist constants and such that for any and every , .
The next theorem establishes a stability bound for the considered class of convex losses in the stationary regime of SGD.
Theorem 13
While we have relaxed the geometric ergodicity condition for this case, in the proof of Theorem 13, we show that the Markov chain is simply ergodic, i.e., . Hence, even though we still obtain a time-uniform bound, our bound holds asymptotically in , due to the lack of an explicit convergence rate for . On the other hand, the lack of strong convexity here results in a generalization bound with rate , whereas for the strongly convex case, i.e., , we previously obtained a rate of . This might be an indicator that there might be still room for improvement in terms of the rate, at least for this class of loss functions.
5 Conclusion
We proved time-uniform Wasserstein-stability bounds for SGD and its noisy versions under different strongly convex, convex, and non-convex classes of functions. By making a connection to Markov chain perturbation results (Rudolf and Schweizer, 2018), we introduced a three-step guideline for proving stability bounds for stochastic optimizers. As this approach required geometric ergodicity, we finally relaxed this condition and proved two other stability bounds for a large class of loss functions.
The main limitation of our approach is that it requires Lipschitz surrogate loss functions, as it is based on the Wasserstein distance. Hence, our natural next step will be to extend our analysis without such a requirement. Finally, due to the theoretical nature of this study, it does not contain any direct potential societal impacts.
Acknowledgments
Lingjiong Zhu is partially supported by the grants NSF DMS-2053454, NSF DMS-2208303, and a Simons Foundation Collaboration Grant. Mert Gürbüzbalaban’s research are supported in part by the grants Office of Naval Research Award Number N00014-21-1-2244, National Science Foundation (NSF) CCF-1814888, NSF DMS-2053485. Anant Raj is supported by the a Marie Sklodowska-Curie Fellowship (project NN-OVEROPT 101030817). Umut Şimşekli’s research is supported by the French government under management of Agence Nationale de la Recherche as part of the “Investissements d’avenir” program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute) and the European Research Council Starting Grant DYNASTY – 101039676.
References
- Akiyama and Suzuki (2023) S. Akiyama and T. Suzuki. Excess risk of two-layer ReLU neural networks in teacher-student settings and its superiority to kernel methods. In International Conference on Learning Representations, 2023.
- Aybat et al. (2019) N. S. Aybat, A. Fallah, M. Gurbuzbalaban, and A. Ozdaglar. A universally optimal multistage accelerated stochastic gradient method. In Advances in Neural Information Processing Systems, volume 32, 2019.
- Aybat et al. (2020) N. S. Aybat, A. Fallah, M. Gurbuzbalaban, and A. Ozdaglar. Robust accelerated gradient methods for smooth strongly convex functions. SIAM Journal on Optimization, 30(1):717–751, 2020.
- Bach (2014) F. Bach. Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression. Journal of Machine Learning Research, 15(1):595–627, 2014.
- Bach and Moulines (2013) F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate . In Advances in Neural Information Processing Systems, volume 26, 2013.
- Bassily et al. (2020) R. Bassily, V. Feldman, C. Guzmán, and K. Talwar. Stability of stochastic gradient descent on nonsmooth convex losses. In Advances in Neural Information Processing Systems, volume 33, pages 4381–4391, 2020.
- Bertsekas (2015) D. Bertsekas. Convex Optimization Algorithms. Athena Scientific, 2015.
- Bousquet and Elisseeff (2002) O. Bousquet and A. Elisseeff. Stability and generalization. Journal of Machine Learning Research, 2(Mar):499–526, 2002.
- Can et al. (2019) B. Can, M. Gürbüzbalaban, and L. Zhu. Accelerated linear convergence of stochastic momentum methods in Wasserstein distances. In International Conference on Machine Learning, pages 891–901. PMLR, 2019.
- Chen et al. (2018) Y. Chen, C. Jin, and B. Yu. Stability and convergence trade-off of iterative optimization algorithms. arXiv preprint arXiv:1804.01619, 2018.
- Dunn (1981) J. C. Dunn. Global and asymptotic convergence rate estimates for a class of projected gradient processes. SIAM Journal on Control and Optimization, 19(3):368–400, 1981.
- Erdogdu et al. (2022) M. A. Erdogdu, R. Hosseinzadeh, and M. S. Zhang. Convergence of Langevin Monte Carlo in Chi-squred and Rényi divergence. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics (AISTATS), volume 151. PMLR, 2022.
- Fallah et al. (2022) A. Fallah, M. Gürbüzbalaban, A. Ozdaglar, U. Şimşekli, and L. Zhu. Robust distributed accelerated stochastic gradient methods for multi-agent networks. Journal of Machine Learning Research, 23(1):9893–9988, 2022.
- Farghly and Rebeschini (2021) T. Farghly and P. Rebeschini. Time-independent generalization bounds for SGLD in non-convex settings. In Advances in Neural Information Processing Systems, volume 34, pages 19836–19846, 2021.
- Feldman and Vondrak (2019) V. Feldman and J. Vondrak. High probability generalization bounds for uniformly stable algorithms with nearly optimal rate. In Conference on Learning Theory, pages 1270–1279. PMLR, 2019.
- Gao et al. (2022) X. Gao, M. Gürbüzbalaban, and L. Zhu. Global convergence of stochastic gradient Hamiltonian Monte Carlo for nonconvex stochastic optimization: Nonasymptotic performance bounds and momentum-based acceleration. Operations Research, 70(5):2931–2947, 2022.
- Garrigos and Gower (2023) G. Garrigos and R. M. Gower. Handbook of convergence theorems for (stochastic) gradient methods. arXiv preprint arXiv:2301.11235, 2023.
- Gürbüzbalaban et al. (2022) M. Gürbüzbalaban, A. Ruszczyński, and L. Zhu. A stochastic subgradient method for distributionally robust non-convex and non-smooth learning. Journal of Optimization Theory and Applications, 194(3):1014–1041, 2022.
- Hairer and Mattingly (2011) M. Hairer and J. C. Mattingly. Yet another look at Harris’ ergodic theorem for Markov chains. In Seminar on Stochastic Analysis, Random Fields and Applications VI, pages 109–118, Basel, 2011.
- Hardt et al. (2016) M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learning, pages 1225–1234. PMLR, 2016.
- Kozachkov et al. (2023) L. Kozachkov, P. M. Wensing, and J.-J. Slotine. Generalization as dynamical robustness–The role of Riemannian contraction in supervised learning. Transactions on Machine Learning Research, 4:1–25, 2023.
- Lei and Ying (2020) Y. Lei and Y. Ying. Fine-grained analysis of stability and generalization for stochastic gradient descent. In International Conference on Machine Learning, volume 119, pages 5809–5819. PMLR, 2020.
- Lessard et al. (2016) L. Lessard, B. Recht, and A. Packard. Analysis and design of optimization algorithms via integral quadratic constraints. SIAM Journal on Optimization, 26(1):57–95, 2016.
- Li et al. (2019) J. Li, X. Luo, and M. Qiao. On generalization error bounds of noisy gradient methods for non-convex learning. arXiv preprint arXiv:1902.00621, 2019.
- Lin and Rosasco (2017) J. Lin and L. Rosasco. Optimal rates for multi-pass stochastic gradient methods. Journal of Machine Learning Research, 18(1):3375–3421, 2017.
- Liu et al. (2017) T. Liu, G. Lugosi, G. Neu, and D. Tao. Algorithmic stability and hypothesis complexity. In International Conference on Machine Learning, pages 2159–2167. PMLR, 2017.
- Liu et al. (2020) Y. Liu, Y. Gao, and W. Yin. An improved analysis of stochastic gradient descent with momentum. In Advances in Neural Information Processing Systems, volume 33, pages 18261–18271, 2020.
- Meyn and Tweedie (1993) S. P. Meyn and R. L. Tweedie. Markov Chains and Stochastic Stability. Communications and Control Engineering Series. Springer-Verlag, London, 1993.
- Meyn and Tweedie (1994) S. P. Meyn and R. L. Tweedie. Computable bounds for geometric convergence rates of Markov chains. Annals of Applied Probability, 4(4):981–1011, 1994.
- Mou et al. (2018) W. Mou, L. Wang, X. Zhai, and K. Zheng. Generalization bounds of SGLD for non-convex learning: Two theoretical viewpoints. In Conference on Learning Theory, pages 605–638. PMLR, 2018.
- Pillaud-Vivien et al. (2018) L. Pillaud-Vivien, A. Rudi, and F. Bach. Statistical optimality of stochastic gradient descent on hard learning problems through multiple passes. In Advances in Neural Information Processing Systems, volume 31, 2018.
- Raginsky et al. (2016) M. Raginsky, A. Rakhlin, M. Tsao, Y. Wu, and A. Xu. Information-theoretic analysis of stability and bias of learning algorithms. In 2016 IEEE Information Theory Workshop (ITW), pages 26–30. IEEE, 2016.
- Raginsky et al. (2017) M. Raginsky, A. Rakhlin, and M. Telgarsky. Non-convex learning via stochastic gradient Langevin dynamics: A nonasymptotic analysis. In Conference on Learning Theory, pages 1674–1703. PMLR, 2017.
- Raj et al. (2023a) A. Raj, M. Barsbey, M. Gürbüzbalaban, L. Zhu, and U. Şimşekli. Algorithmic stability of heavy-tailed stochastic gradient descent on least squares. In International Conference on Algorithmic Learning Theory, volume 201, pages 1292–1342. PMLR, 2023a.
- Raj et al. (2023b) A. Raj, L. Zhu, M. Gürbüzbalaban, and U. Şimşekli. Algorithmic stability of heavy-tailed SGD with general loss functions. In International Conference on Machine Learning, volume 202, pages 28578–28597. PMLR, 2023b.
- Rudolf and Schweizer (2018) D. Rudolf and N. Schweizer. Perturbation theory for Markov chains via Wasserstein distance. Bernoulli, 24(4A):2610–2639, 2018.
- Villani (2009) C. Villani. Optimal Transport: Old and New. Springer, Berlin, 2009.
- Welling and Teh (2011) M. Welling and Y. W. Teh. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 681–688, 2011.
- Xu and Raginsky (2017) A. Xu and M. Raginsky. Information-theoretic analysis of generalization capability of learning algorithms. In Advances in Neural Information Processing Systems, volume 30, 2017.
- Zhang et al. (2022) Y. Zhang, W. Zhang, S. Bald, V. Pingali, C. Chen, and M. Goswami. Stability of SGD: Tightness analysis and improved bounds. In Uncertainty in Artificial Intelligence, pages 2364–2373. PMLR, 2022.
Uniform-in-Time Wasserstein Stability Bounds
for (Noisy) Stochastic Gradient Descent
APPENDIX
The Appendix is organized as follows:
-
•
In Section A, we provide further details and examples about the usage of surrogate losses.
- •
- •
- •
- •
- •
- •
A On the Usage of Surrogate Losses
While the requirement of surrogate losses is a drawback of our framework, nevertheless our setup can cover several practical settings. In this section, we will provide two such examples.
Example 1.
We can choose the surrogate loss as the truncated loss, such that:
where is a chosen constant. This can be seen as a “robust” version of the original loss, which has been widely used in robust optimization and is conceptually similar to adding a projection step to the optimizer.
Example 2.
Another natural setup for our framework is the -regularized Lipschitz loss that was also used in Farghly and Rebeschini (2021). As opposed to the previous case, for the sake of this example, let us consider as the true loss and as the surrogate loss. Then, we can choose the pair and as follows:
where . Intuitively, this setting means that, we have a true loss which can be Lipschitz, but in the optimization framework we consider a regularized version of the loss. In other words, we have a loss ; however, we run the algorithm on the regularized loss to have better convergence properties, and finally, we would like to understand if the algorithm generalizes on or not, and we are typically not interested if the algorithm generalizes well on the regularized loss .
Next, we illustrate how a generalization bound for the loss , i.e., . For this example, a bound on the quantity can be obtained by building on our analysis. To obtain such a bound, in addition to the bounds that we developed on , we would need to estimate the following quantity:
For illustration purposes, assume that is convex and Lipschitz in the first parameter. Then, is -strongly convex. Further consider that we initialize SGD from , i.e., and set the batch size to 1. Denote as the -th iterate of SGD when applied on , i.e., (1). Further define the minimum:
We can now analyze the error induced by the surrogate loss as follows:
Here, the second inequality follows from standard convergence analysis for SGD (Garrigos and Gower, 2023, Theorem 5.7) and we define as the stochastic gradient noise variance:
where for a random vector we define . Hence, we can see that the error induced by the surrogate loss depends on the following factors:
-
•
The regularization parameter ,
-
•
The expected norm of the minimizers,
-
•
The step-size ,
-
•
The expected stochastic gradient noise variance.
These terms can be controlled by adjusting and .
B Technical Background
B.1 Computable bounds for the convergence of Markov chains
Geometric ergodicity and convergence rate of Markov chains has been well studied in the literature (Meyn and Tweedie, 1993, 1994; Hairer and Mattingly, 2011). In this section, we state a result from Hairer and Mattingly (2011) that provides an explicitly computable bound on the Wasserstein contraction for the Markov chains that satisfies a drift condition that relies on the construction of an appropriate Lyapunov function and a minorization condition.
Let be a Markov transition kernel for a Markov chain on . For any measurable function , we define:
Assumption 7 (Drift Condition)
There exists a function and some constants and so that
for all .
Assumption 8 (Minorization Condition)
There exists some constant and a probability measure so that
for some .
We define the weighted total variation distance:
where and is the Lyapunov function that satisfies the drift condition (Assumption 7). It is known that has the following alternative expression (Hairer and Mattingly, 2011):
where is the weighted supremum norm such that for any :
It is also noted in Hairer and Mattingly (2011) that has yet another equivalent expression:
where
C Proofs of Wasserstein Perturbation Results: Quadratic Case
C.1 Proof of Lemma 4
Proof Let denote the law of starting with and the law of :
(27) |
with . Note that
(28) | |||
(29) |
which implies that
(30) |
By iterating over , we conclude that
(31) |
This completes the proof.
C.2 Proof of Lemma 5
Proof First, we recall that
(32) |
where and . Therefore, starting with , we have
(33) |
which implies that
(34) |
This completes the proof.
C.3 Proof of Lemma 6
Proof Let us recall that
(35) | |||
(36) |
which implies that
(37) |
Since and differ by at most one element and for some , we have with probability and with probability and moreover
(38) |
and
(39) |
Hence, we conclude that
(40) |
This completes the proof.
D Proofs of Wasserstein Perturbation Results: Strongly Convex Case
In order to obtain the algorithmic stability bound, that is a -Wasserstein distance between the distribution of and , we need to establish a sequence of technical lemmas. First, we show a -Wasserstein contraction rate in the following lemma.
Lemma 15
Proof Let denote the law of starting with :
(42) |
and the law of :
(43) |
with . Note that
(44) | |||
(45) |
Therefore, we have
(46) |
By applying Assumption 1 and Assumption 2, we get
(47) |
provided that . By iterating over , we conclude that
(48) |
This completes the proof.
Next, we construct a Lyapunov function and obtain a drift condition for the SGD .
Lemma 16
Proof First, we recall that
(50) |
Therefore, starting with , we have
(51) |
Moreover, we have
(52) |
This implies that
(53) |
We can compute that
(54) |
Moreover, we can compute that
(55) |
Hence, we conclude that
(56) |
provided that .
This completes the proof.
Next, we estimate the perturbation gap based on the Lyapunov function .
Lemma 17
Assume that Assumption 1 holds. Assume that for some . Then, we have
(57) |
where is the minimizer of .
Proof Let us recall that
(58) | |||
(59) |
which implies that
(60) |
Since and differ by at most one element and for some , we have for any with probability and for exactly one with probability and therefore
(61) |
Hence, we conclude that
(62) |
This completes the proof.
Next, let us provide a technical lemma that upper bounds the norm of and , which are the minimizers of and respectively.
Lemma 18
Proof Since is -strongly convex in for every , we have
(63) |
which implies that
(64) |
which yields that .
Similarly, one can show that
.
This completes the proof.
D.1 Proof of Theorem 8
E Proofs of Wasserstein Perturbation Bounds: Non-Convex Case
Lemma 19
Proof Our proof relies on a computable bound on the Wasserstein contraction for the Markov chains by Hairer and Mattingly (2011) that satisfies a drift condition (Assumption 7) that relies on the construction of an appropriate Lyapunov function and a minorization condition (Assumption 8).
By applying Lemma 21, we can immediately show that the following drift condition holds. Let , where is the minimizer of . Assume that and for some . Then, we have
(69) |
where
(70) |
Thus, the drift condition (Assumption 7) holds.
Next, let us show that the minorization condition (Assumption 8) also holds. Let us recall that
(71) |
We denote the probability density function of with the emphasis on the dependence on the initial point . Then, to check that the minorization condition (Assumption 8) holds, it suffices to show that there exists some constant
(72) |
for some , where is the density function of a probability distribution function on , and (72) follows from Lemma 20. Hence, by Lemma 14, we have
for any in , where for any and one can choose and .
Finally, by the Kantorovich-Rubinstein duality for the Wasserstein metric, we get for any two probability measures on :
(73) |
Hence, we conclude that
This completes the proof.
The proof of Lemma 19 relies on the following technical lemma, which is a reformulation of Lemma 35 in Can et al. (2019) that helps establish the minorization condition (Assumption 8).
Lemma 20
For any and so that and any so that
(74) |
Then, we have
(75) |
where
(76) |
Proof The proof is an adaptation of the proof of Lemma 35 in Can et al. (2019). Let us take:
(77) |
Then, it is clear that is a probability density function on . It follows that (72) automatically holds for . Thus, we only need to show that (72) holds for . Since has a continuous density, is continuous in both and . Fix , by continuity of in both and , there exists some such that uniformly in ,
(78) |
where we can take
(79) |
In particular, for any fixed , we can take such that
(80) |
and with fixed and , we take such that uniformly in ,
(81) |
This completes the proof.
Next, we construct a Lyapunov function and obtain a drift condition for the SGD .
Lemma 21
Proof First, we recall that
(83) |
Therefore, starting with , we have
(84) |
Moreover, we have
(85) |
This implies that
(86) |
We can compute that
(87) |
Moreover, by following the same arguments as in the proof of Lemma 16, we have
(88) |
Hence, we conclude that
(89) |
provided that .
This completes the proof.
Next, we estimate the perturbation gap based on the Lyapunov function .
Lemma 22
Proof Let us recall that
(91) | |||
(92) |
which implies that
(93) |
where denotes expectations w.r.t. only, and we can further compute that
(94) |
Therefore, we have
(95) |
Similarly, we have
(96) |
Since and differ by at most one element and for some , we have for any with probability and for exactly one with probability . Therefore, we have
(97) |
Hence, we conclude that
(98) |
This completes the proof.
It is worth noting that the Wasserstein contraction bound we obtained in Lemma 19 in the non-convex case differs from the one we obtained in the strongly-convex case (Lemma 15) in the sense that the right hand side of (68) is no longer so that Lemma 3 is not directly applicable. Instead, in the following, we will provide a modification of Lemma 3, which will be used in proving Theorem 9 in this paper. The definitions of the notations used in the following lemma can be found in Section 2.2.
Lemma 23
Assume that there exist some and such that
(99) |
for any . Further assume that there exist some and and a measurable Lyapunov function of such that for any :
(100) |
Then, we have
(101) |
where
(102) |
Proof The proof is based on the modification of the proof of Lemma 3 (Theorem 3.1 in Rudolf and Schweizer (2018)). By induction we have
(103) |
We have
(104) |
Moreover, for any ,
(105) |
so that we obtain . Therefore, we have
(106) |
By the triangle inequality of the Wasserstein distance, we have
(107) |
This completes the proof.
Next, let us provide a technical lemma that upper bounds the norm of and , which are the minimizers of and respectively.
Lemma 24
Under Assumption 3, we have
(108) | |||
(109) |
Proof By Assumption 3, we have
(110) |
which implies that
(111) |
which yields that
(112) |
Similarly, one can show that
(113) |
This completes the proof.
E.1 Proof of Theorem 9
Before going to the proof, let us restate the full version of Theorem 9 that we provide below.
Theorem 25 (Complete Theorem 3.3)
Let . Assume that Assumption 1, Assumption 3 and Assumption 4 hold. We also assume that and for some and for some . For any . Define so that and any where is defined in (67) so that
(114) |
Let and denote the distributions of and respectively. Then, we have
(115) |
where for any and one can choose and and is the weighted total variation distance defined in Section B.1.
E.2 Proof of Corollary 11
Proof Under our assumptions, the noise are i.i.d. Gaussian so that . Moreoever, we have . Then is the probability density function of
(118) |
Therefore,
(119) |
Notice that for any ,
(120) |
Thus, we have
(121) |
Since and , for any , we have
By Chebychev’s inequality, letting , for any , we get
Next, for any such that and , we have
(122) |
where denotes the expectation w.r.t. , and denotes the probability density function conditional on . For any given , we can compute that
(123) |
where
(124) |
Therefore, for any such that and , we have
(125) |
We can further compute that
(126) |
and
(127) |
Hence, we have
(128) |
Since it holds for every , we have
(129) |
Hence, we conclude that
(130) |
This completes the proof.
F Proofs of Non-Convex Case without Additive Noise
F.1 Proof of Theorem 12
Proof Let us recall that and for any ,
(131) | |||
(132) |
Thus it follows that
(133) |
where
(134) |
This implies that
(135) |
By Assumption 1 and Assumption 3, we have
(136) |
provided that .
Since and differ by at most one element and for some , we have for any with probability and for exactly one with probability and therefore
(137) |
and moreover,
(138) |
where we used the inequality that
(139) |
for any . Therefore, we have
(140) |
G Proofs of Convex Case with Additional Geometric Structure
In the following technical lemma, we show that the -th moment of and can be bounded in the following sense.
Lemma 26
Proof First, we recall that
(148) |
Therefore, we have
(149) |
Moreover, we have
(150) |
This implies that
(151) |
We can compute that
(152) |
Moreover, we can compute that
(153) |
Hence, by applying (152) and (153) to (151), we conclude that
(154) |
provided that . This implies that
(155) |
and hence
(156) |
Similarly, we can show that
(157) |
This completes the proof.
Next, let us provide a technical lemma that upper bounds the norm of and , which are the minimizers of and respectively.
Lemma 27
Proof Under Assumption 5, we have
(158) |
where , which implies that
(159) |
which yields that
Similarly, one can show that
This completes the proof.
Now, we are able to state the main result for the Wasserstein algorithmic stability.
Theorem 28
Proof Let us recall that and for any ,
(161) | |||
(162) |
Thus it follows that
(163) |
where
(164) |
This implies that
(165) |
By Assumption 6 and Assumption 5, we have
(166) |
provided that .
Since and differ by at most one element and for some , we have for any with probability and for exactly one with probability and therefore
(167) |
and moreover,
(168) |
Notice that for any and , we have , , and , which implies that
(169) |
Therefore, by applying (169) to (168), we have
(170) |
Hence, by applying (166), (167), (170) into (165), we conclude that
(171) |
provided that . This, together with , implies that
(172) |
In Lemma 26, we showed that under the assumption , we have
(173) | |||
(174) |
Hence, by plugging (173) and (174) into (172), we conclude that
(175) |
Moreover, , and by applying Lemma 27, we obtain
(176) |
Finally, by the definition of -Wasserstein distance, we have
(177) |
This completes the proof.
G.1 Proof of Theorem 13
Proof First, let us show that the sequences and are ergodic.
First, let us show that the limit , if exists, is unique. Consider two sequences and starting at and respectively with two limits and . For any ,
(178) | |||
(179) |
By Assumption 6 and Assumption 5, we have
(180) |
provided that . This implies that
(181) |
provided that . Thus, for any . Suppose for every , then we have
(182) |
such that
(183) |
Since ,
(184) |
for any , which implies that as so that .
Next, let us show that for any sequence , it converges to a limit. It follows from (184) that
(185) |
for any . Let be a fixed initial value in , and let which is random yet takes only finitely many values given so that is bounded. Therefore, it follows from (185) that
(186) |
where denotes the distribution of , which implies that
(187) |
Thus, is a Cauchy sequence in equipped with metric and hence there exists some such that as .
Hence, we showed that the sequence is ergodic. Similarly, we can show that the sequence is ergodic.