Shashank RajputCorrespondence to Shashank Rajput [email protected]Kangwook Lee
University of Wisconsin-Madison
Dimitris Papailiopoulos
Abstract
A recent line of ground-breaking results for permutation-based SGD has corroborated a widely observed phenomenon: random permutations offer faster convergence than with-replacement sampling.
However, is random optimal?
We show that this depends heavily on what functions we are optimizing, and the convergence gap between optimal and random permutations can vary from exponential to nonexistent.
We first show that for 1-dimensional strongly convex functions, with smooth second derivatives, there exist
permutations that offer exponentially faster convergence compared to random.
However, for general strongly convex functions, random permutations are optimal.
Finally, we show that for quadratic, strongly-convex functions, there are easy-to-construct permutations that lead to accelerated convergence compared to random.
Our results suggest that a general convergence characterization of optimal permutations cannot capture the nuances of individual function classes, and can mistakenly indicate that one cannot do much better than random.
1 Introduction
Finite sum optimization seeks to solve the following:
(1)
Stochastic gradient descent (SGD) approximately solves finite sum problems, by iteratively updating the optimization variables according to the following rule:
(2)
where is the step size and is the index of the function sampled at iteration .
There exist various ways of sampling , with the most common being with- and without-replacement sampling. In the former, is uniformly chosen at random from , and for the latter, represents the -th element of a random permutation of . We henceforth refer to these two SGD variants as vanilla and permutation-based, respectively.
Although permutation-based SGD has been widely observed to perform better in practice (Bottou, 2009; Recht & Ré, 2012; 2013), the vanilla version has attracted the vast majority of theoretical analysis. This is because of the fact that at each iteration, in expectation the update is a scaled version of the true gradient, allowing for simple performance analyses of the algorithm, e.g., see (Bubeck et al., 2015).
Permutation-based SGD has resisted a tight analysis for a long time.
However, a recent line of breakthrough results provides the first tight convergence guarantees for several classes of convex functions (Nagaraj et al., 2019; Safran & Shamir, 2019; Rajput et al., 2020; Mishchenko et al., 2020; Ahn et al., 2020; Nguyen et al., 2020). These recent studies mainly focus on two variants of permutation-based SGD where (1) a new random permutation is sampled at each epoch (also known as Random Reshuffle) (Nagaraj et al., 2019; Safran & Shamir, 2019; Rajput et al., 2020), and (2) a random permutation is sampled once and is reused throughout all SGD epochs (Single Shuffle) (Safran & Shamir, 2019; Mishchenko et al., 2020; Ahn et al., 2020).
Perhaps interestingly, Random Reshuffle and Single Shuffle exhibit different convergence rates and a performance gap that varies across different function classes. In particular, when run for epochs, the convergence rate for strongly convex functions is for both Random Reshuffle and Single Shuffle (Nagaraj et al., 2019; Ahn et al., 2020; Mishchenko et al., 2020).
However, when run specifically on strongly convex quadratics, Random Reshuffle experiences an acceleration of rates, whereas Single Shuffle does not (Safran & Shamir, 2019; Rajput et al., 2020; Ahn et al., 2020; Mishchenko et al., 2020). All the above rates have been coupled by matching lower bounds, at least up to constants and sometimes log factors (Safran & Shamir, 2019; Rajput et al., 2020).
Table 1: Convergence rates of Random Reshuffle (RR), Single Shuffle (SS) and Incremental Gradient Descent (IGD) on strongly convex quadratics: Plain vs. FlipFlop. Lower bounds for the “plain” versions are taken from (Safran & Shamir, 2019). When , that is when the training set is much larger than the number of epochs, which arguably is the case in practice, the convergence rates of Random Reshuffle,Single Shuffle, and Incremental Gradient Descent are , , and respectively.
On the other hand, by combining these methods with FlipFlop the convergence rates become faster, i.e., , , and , respectively.
From the above we observe that reshuffling at the beginning of every epoch may not always help.
But then there are cases where Random Reshuffle is faster than Single Shuffle, implying that certain ways of generating permutations are more suited for certain subfamilies of functions.
The goal of our paper is to take a first step into exploring the relationship between convergence rates and the particular choice of permutations. We are particularly interested in understanding if random permutations are as good as optimal, or if SGD can experience faster rates with carefully crafted permutations. As we see in the following, the answer the above is not straightforward, and depends heavily on the function class at hand.
Our Contributions:
We define as permutation-based SGD to be any variant of the iterates in (2), where a permutation of the functions, at the start of each epoch, can be generated deterministically, randomly, or with a combination of the two. For example, Single Shuffle, Random Reshuffle, and incremental gradient descent (IGD), are all permutation-based SGD variants (see Algorithm 1).
We first want to understand—even in the absence of computational constraints in picking the optimal permutations—what is the fastest rate one can get for permutation-based SGD? In other words, are there permutations that are better than random in the eyes of SGD?
Perhaps surprisingly, we show that there exist
permutations that may offer up to exponentially faster convergence than random permutations, but for a limited set of functions.
Specifically, we show this for 1-dimensional functions (Theorem 1).
However, such exponential improvement is no longer possible in higher dimensions (Theorem 2), or for general strongly convex objectives (Theorem 3), where random is optimal.
The above highlight that an analysis of how permutations affect convergence rates needs to be nuanced enough to account for the structure of functions at hand. Otherwise, in lieu of further assumptions, random permutations may just appear to be as good as optimal.
In this work, we further identify a subfamily of convex functions, where there exist easy-to-generate permutations that lead accelerated convergence.
We specifically introduce a new technique, FlipFlop, which can be used in conjunction with existing permutation-based methods, e.g., Random Reshuffle, Single Shuffle, or Incremental Gradient Descent, to provably improve their convergence on quadratic functions (Theorems 4, 5, and 6). The way that FlipFlop works is rather simple: every even epoch uses the flipped (or reversed) version of the previous epoch’s permutation. The intuition behind why FlipFlop leads to faster convergence is as follows.
Towards the end of an epoch, the contribution of earlier gradients gets attenuated.
To counter this, we flip the permutation for the next epoch so that every function’s contribution is diluted (approximately) equally over the course of two consecutive epochs.
FlipFlop demonstrates that finding better permutations for specific classes of functions might be computationally easy.
We summarize FlipFlop’s convergence rates in Table 1 and report the results of numerical verification in Section 6.2.
Note that in this work, we focus on the dependence of the error on the number of iterations, and in particular, the number of epochs.
However, we acknowledge that its dependence on other parameters like the condition number is also very important.
We leave such analysis for future work.
Notation:
We use lowercase for scalars (), lower boldface for vectors (), and upper boldface for matrices ().
2 Related Work
Gürbüzbalaban et al. (2019a; b) provided the first theoretical results establishing that Random Reshuffle and Incremental Gradient Descent (and hence Single Shuffle) were indeed faster than vanilla SGD, as they offered an asymptotic rate of for strongly convex functions, which beats the convergence rate of for vanilla SGD when . Shamir (2016) used techniques from online learning and transductive learning theory to prove an optimal convergence rate of for the first epoch of Random Reshuffle (and hence Single Shuffle). Later, Haochen & Sra (2019) also established a non-asymptotic convergence rate of , when the objective function is quadratic, or has smooth Hessian.
Nagaraj et al. (2019) used a very interesting iterate coupling based approach to give a new upper bound on the error rate of Random Reshuffle, thus proving for the first time that for general strongly convex smooth functions,
it converges faster than vanilla SGD in all regimes of and .
This was followed by (Safran & Shamir, 2019), where the authors were able to establish the first lower bounds, in terms of both and , for Random Reshuffle.
However, there was a gap in these upper and lower bounds. The gap in the convergence rates was closed by Rajput et al. (2020), who showed that the upper bound given by Nagaraj et al. (2019) and the lower bound given by Safran & Shamir (2019) were both tight up to logarithmic terms.
For Single Shuffle, Mishchenko et al. (2020) and Ahn et al. (2020) showed an upper bound of , which matched the lower bound given earlier by (Safran & Shamir, 2019), up to logarithmic terms. Ahn et al. (2020) and Mishchenko et al. (2020) also proved tight upper bounds for Random Reshuffle, with a simpler analysis and using more relaxed assumptions than (Nagaraj et al., 2019) and (Rajput et al., 2020). In particular, the results by Ahn et al. (2020) work under the PŁ condition and do not require individual component convexity.
Incremental Gradient Descent on strongly convex functions has also been studied well in literature (Nedić & Bertsekas, 2001; Bertsekas, 2011; Gürbüzbalaban et al., 2019a).
More recently, Nguyen et al. (2020) provide a unified analysis for all permutation-based algorithms. The dependence of their convergence rates on the number of epochs is also optimal for Incremental Gradient Descent, Single Shuffle and Random Reshuffle.
There has also been some recent work on the analysis of Random Reshuffle on non-strongly convex functions and non-convex functions. Specifically, Nguyen et al. (2020) and Mishchenko et al. (2020) show that even there, Random Reshuffle outperforms SGD under certain conditions. Mishchenko et al. (2020) show that Random Reshuffle and Single Shuffle beat vanilla SGD on non-strongly convex functions after epochs, and that Random Reshuffle is faster than vanilla SGD on non-convex objectives if the desired error is .
Speeding up convergence by combining without replacement sampling with other techniques like variance reduction (Shamir, 2016; Ying et al., 2020) and momentum (Tran et al., 2020) has also received some attention. In this work, we solely focus on the power of “good permutations” to achieve fast convergence.
3 Preliminaries
We will use combinations of the following assumptions:
Assumption 1(Component convexity).
’s are convex.
Assumption 2(Component smoothness).
’s are -smooth, i.e.,
Note that Assumption 2 immediately implies that also has -Lipschitz gradients:
We denote the condition number by , which is defined as .
It can be seen easily that always.
Let denote the minimizer of Eq. (1), that is, .
We will study permutation-based algorithms in the constant step size regime, that is, the step size is chosen at the beginning of the algorithm, and then remains fixed throughout.
We denote the iterate after the -th iteration of the -th epoch by . Hence, the initialization point is . Similarly, the permutation of used in the -th epoch is denoted by , and its -th ordered element is denoted by . Note that if the ambient space is 1-dimensional, then we represent the iterates and the minimizer using non-bold characters, i.e. and , to remain consistent with the notation.
In the following, due to lack of space, we only provide sketches of the full proofs, when possible.
The full proofs of the lemmas and theorems are provided in the Appendix.
4 Exponential Convergence in 1-Dimension
In this section, we show that there exist
permutations for Hessian-smooth -dimensional functions that lead to exponentially faster convergence compared to random.
Assumption 4(Component Hessian-smoothness).
’s have -smooth second derivatives, that is,
We also define the following instance dependent constants: , , and .
Theorem 1.
Let Assumptions 1,2,3 and 4 hold.
Let and be as defined above.
If , then there exists a sequence of permutations such that using those permutations from any initialization point gives the error
where .
Figure 1: (A graphical depiction of Theorem 1’s proof sketch.) Assume that the minimizer is at the origin. The proof of Theorem 1 first shows that there exists an initialization and a sequence of permutations, such that using those, we get to the exact minimizer. Let the sequence of iterates for this run be . Consider a parallel run, which uses the same sequence of permutations, but an arbitrary initialization point. Let this sequence be . The figure shows how converges to the exact optima, and the distance between and decreases exponentially, leading to an exponential convergence for .
An important thing to notice in the theorem statement is that the sequence of permutations only depends on the function, and not on the initialization point . This implies that for any such function, there exists a sequence of permutations , which gives exponentially fast convergence, unconditionally of the initialization. Note that the convergence rate is slower than Gradient Descent, for which the constant ‘’ would be larger. However, here we are purely interested in the convergence rates of the best permutations and their (asymptotic) dependence on .
Proof sketch
The core idea is to establish that there exists an initialization point (close to the minimizer ), and a sequence of permutations such that that starting from and using that sequence of permutation leads us exactly to the minimizer. Once we have proved this, we show that if two parallel runs of the optimization process are initialized from two different iterates, and they are coupled so that they use the exact same permutations, then they approach each other at an exponential rate. Thus, if we use the same permutation from any initialization point, it will converge to the minimizer at an exponential rate. See Figure 1 for a graphical depiction of this sketch. We note that the figure is not the result of an actual run, but only serves to explain the proof sketch.
5 Lower Bounds for Permutation-based SGD
The result in the previous section leads us to wonder if exponentially fast convergence can be also achieved in higher dimensions. Unfortunately, for higher dimensions, there exist strongly convex quadratic functions for which there does not exist any sequence of permutations that lead to an exponential convergence rate. This is formalized in the following theorem.
Theorem 2.
For any ( must be even), there exists a -dimensional strongly convex function which is the mean of convex quadratic functions, such that for every permutation-based algorithm with any step size,
This theorem shows that we cannot hope to develop constant step size algorithms that exhibit exponentially fast convergence in multiple dimensions.
Proof sketch
In this sketch, we explain a simpler version of the theorem, which works for and 2-Dimensions. Consider such that
Hence, , and has minimizer at the origin.
Each epoch has two possible permutations, or . Working out the details manually, it can be seen that regardless of the permutation, , that is, the sum of the co-ordinates keeps increasing. This can be used to get a bound on the error term .
Next, we show that even in 1-Dimension, individual function convexity might be necessary to obtain faster rates than Random Reshuffle.
Theorem 3.
There exists a 1-Dimensional strongly convex function which is the mean of two quadratic functions and , such that one of the functions is non-convex. Then, every permutation-based algorithm with constant step size gives an error of at least
Proof sketch
The idea behind the sketch is to have one of the two component functions as strongly concave. This gives it the advantage that the farther away from its maximum the iterate is, the more it pushes the iterate away. Hence, it essentially results in increasing the deviation in each epoch. This leads to a slow convergence rate.
In the setting where the individual may be non-convex, Nguyen et al. (2020) and Ahn et al. (2020) show that Single Shuffle, Random Reshuffle, and Incremental Gradient Descent achieve the error rate of , for a fixed . In particular, their results only need that the component functions be smooth and hence their results apply to the function from Theorem 3. The theorem above essentially shows that when , this is the best possible error rate, for any permutation-based algorithm - deterministic or random. Hence, at least for , the three algorithms are optimal when the component functions can possibly be non-convex. However, note that here we are only considering the dependence of the convergence rate on . It is possible that these are not optimal, if we further take into account the dependence of the convergence rate on the combination of both and . Indeed, if we consider the dependence on as well, Incremental Gradient Descent has a convergence rate of (Safran & Shamir, 2019), whereas the other two have a convergence rate of (Ahn et al., 2020).
6 Flipping Permutations for Faster Convergence in Quadratics
In this section, we introduce a new algorithm FlipFlop, that can improve the convergence rate of Single Shuffle,Random Reshuffle, and Incremental Gradient Descent on strongly convex quadratic functions.
The following theorem gives the convergence rate of FlipFlop with Single Shuffle:
Assumption 5.
’s are quadratic.
Theorem 4.
If Assumptions 1, 2, 3, and 5 hold, then running FlipFlop with Single Shuffle
for epochs, where is an even integer, with step size gives the error
(4)
For comparison, Safran & Shamir (2019) give the following lower bound on the convergence rate of vanilla Single Shuffle:
(5)
Note that both the terms in Eq. (4) are smaller than the term in Eq. (5). In particular, when and is fixed as we vary , the RHS of Eq. (5) decays as , whereas the RHS of Eq. (4) decays as . Otherwise, when and is fixed as we vary , the RHS of Eq. (5) decays as , whereas the RHS of Eq. (4) decays as . Hence, in both the cases, FlipFlop with Single Shuffle outperforms Single Shuffle.
The next theorem shows that FlipFlop improves the convergence rate of Random Reshuffle:
Theorem 5.
If Assumptions 1, 2, 3, and 5 hold, then running FlipFlop with Random Reshuffle
for epochs, where is an even integer, with step size gives the error
For comparison, Safran & Shamir (2019) give the following lower bound on the convergence rate of vanilla Random Reshuffle:
Hence, we see that in the regime when , which happens when the number of components in the finite sum of is much larger than the number of epochs, FlipFlop with Random Reshuffle is much faster than vanilla Random Reshuffle.
Note that the theorems above do not contradict Theorem 2, because for a fixed , both the theorems above give a convergence rate of .
We also note here that the theorems above need the number of epochs to be much larger than , in which range Gradient Descent performs better than with- or without- replacement SGD, and hence GD should be preferred over SGD in that case. However, we think that this requirement on epochs is a limitation of our analysis, rather than that of the algorithms.
Figure 2: Dependence of convergence rates on the number of epochs for quadratic functions. The figures show the median and inter-quartile range after 10 runs of each algorithm, with random initializations and random permutation seeds (note that SS and IGD exhibit extremely small variance). We set , so that and hence the higher order terms of dominate the convergence rates. Note that both the axes are in logarithmic scale.
Finally, the next theorem shows that FlipFlop improves the convergence rate of Incremental Gradient Descent.
Theorem 6.
If Assumptions 1, 2, 3, and 5 hold, then running FlipFlop with Incremental GD
for epochs, where is an even integer, with step size gives the error
For comparison, Safran & Shamir (2019) give the following lower bound on the convergence rate of vanilla Incremental Gradient Descent:
In the next subsection, we give a short sketch of the proof of these theorems.
6.1 Proof sketch
In the proof sketch, we consider scalar quadratic functions. The same intuition carries over to multi-dimensional quadratics, but requires a more involved analysis. Let . Assume that has minimizer at 0. This assumption is valid because it can be achieved by a simple translation of the origin (see (Safran & Shamir, 2019) for a more detailed explanation). This implies that .
For the sake of this sketch, assume , that is, we are starting at the minimizer itself. Further, without loss of generality, assume that . Then, for the last iteration of the first epoch,
Applying this to all iterations of the first epoch, we get
(6)
Substituting , we get
(7)
Note that the sum above is not weighted uniformly: is multiplied by , whereas is multiplied by 1. Because
, we see that ’s weight is much smaller than . If the weights were all 1, then we would get , i.e., we would not move away from the minimizer. Since we want to stay close to the minimizer, we want the weights of all the to be roughly equal.
The idea behind FlipFlop is to add something like in the next epoch, to counteract the bias in Eq. (7). To achieve this, we simply take the permutation that the algorithm used in the previous epoch and flip it for the next epoch. Roughly speaking, in the next epoch will get multiplied by whereas will get multiplied by . Thus over two epochs, both get scaled approximately the same.
To see more concretely, we look at the first order approximation of Eq. (7):
(8)
where in the last step above we used the fact that .
Now, let us see what happens in the second epoch if we use FlipFlop. Doing a recursion analogous to how we got Eq. (6), but reversing the order of functions, we get:
Carefully doing a change of variables in the equation above, we get:
(9)
Note that the product in the second term in the equation above is almost complementary to the product in Eq. (7). This is because we flipped the order in the second epoch. This will play an important part in cancelling out much of the bias in Eq. (7).
Continuing on from Eq. (9), we again do a linear approximation similar to before and substitute Eq. (8) (and use the fact that ):
We assume that is small and hence we only focus on the quadratic terms:
Now, since , we get
(10)
Now, comparing the coefficients of the terms in Eq. (8) and Eq. (10), we see that the former has terms whereas the latter has only terms. This correction of error is exactly what eventually manifests into the faster convergence rate of FlipFlop.
The main reason that the analysis for multidimensional quadratics is not as simple as the 1-dimensional analysis done above, is because unlike scalar multiplication, matrix multiplication is not commutative, and the AM-GM inequality is not true in higher dimensions (Lai & Lim, 2020; De Sa, 2020). One way to bypass this inequality is by using the following inequality for small enough :
Ahn et al. (2020) proved a stochastic version of this (see Lemma 6 in their paper). We prove a deterministic version in
Lemma 3 (in the Appendix).
6.2 Numerical Verification
We verify the theorems numerically by running Random Reshuffle, Single Shuffle and their FlipFlop versions on the task of mean computation. We randomly sample points from a -dimensional sphere. Let the points be for . Then, their mean is the solution to the following quadratic problem : .
We solve this problem by using the given algorithms.
The results are reported in Figure 2. The results are plotted in a – graph, so that we get to see the dependence of error on the power of .
Note that since the points are sampled randomly, Incremental Gradient Descent essentially becomes Single Shuffle. Hence, to verify Theorem 6, we need ‘hard’ instances of Incremental Gradient Descent, and in particular we use the ones used in Theorem 3 in (Safran & Shamir, 2019). These results are also reported in a – graph in Figure 2.
We also tried FlipFlop in the training of deep neural networks, but unfortunately we did not see a big speedup there. We ran experiments on logistic regression for 1-Dimensional artificial data, the results for which are in Appendix H. The code for all the experiments can be found at https://github.com/shashankrajput/flipflop .
7 Conclusion and Future Work
In this paper, we explore the theoretical limits of permutation-based SGD for solving finite sum optimization problems.
We focus on the power of good, carefully designed permutations and whether they can lead to a much better convergence rate than random.
We prove that for 1-dimensional, strongly convex functions, indeed good sequences of permutations exist, which lead to a convergence rate which is exponentially faster than random permutations.
We also show that unfortunately, this is not true for higher dimensions, and that for general strongly convex functions, random permutations might be optimal.
However, we think that for some subfamilies of strongly convex functions, good permutations might exist and may be easy to generate.
Towards that end, we introduce a very simple technique, FlipFlop, to generate permutations that lead to faster convergence on strongly convex quadratics. This is a black box technique, that is, it does not look at the optimization problem to come up with the permutations; and can be implemented easily. This serves as an example that for other classes of functions, there can exist other techniques for coming up with good permutations.
Finally, note that we only consider constant step sizes in this work for both upper and lower bounds.
Exploring regimes in which the step size changes, e.g., diminishing step sizes, is a very interesting open problem, which we leave for future work.
We think that the upper and lower bounds in this paper give some important insights and can help in the development of better algorithms or heuristics.
We strongly believe that under nice distributional assumptions on the component functions, there can exist good heuristics to generate good permutations, and this should also be investigated in future work.
References
Ahn et al. (2020)
Kwangjun Ahn, Chulhee Yun, and Suvrit Sra.
Sgd with shuffling: optimal rates without component convexity and
large epoch requirements, 2020.
Bertsekas (2011)
Dimitri P Bertsekas.
Incremental gradient, subgradient, and proximal methods for convex
optimization: A survey.
Optimization for Machine Learning, 2010(1-38):3, 2011.
Bottou (2009)
Léon Bottou.
Curiously fast convergence of some stochastic gradient descent
algorithms.
Unpublished open problem offered to the attendance of the SLDS 2009
conference, 2009.
URL
http://leon.bottou.org/papers/bottou-slds-open-problem-2009.
Bubeck et al. (2015)
Sébastien Bubeck et al.
Convex optimization: Algorithms and complexity.
Foundations and Trends® in Machine Learning,
8(3-4):231–357, 2015.
De Sa (2020)
Christopher M De Sa.
Random reshuffling is not always better.
Advances in Neural Information Processing Systems, 33, 2020.
Gürbüzbalaban et al. (2019a)
M Gürbüzbalaban, Asuman Ozdaglar, and Pablo A Parrilo.
Convergence rate of incremental gradient and incremental newton
methods.
SIAM Journal on Optimization, 29(4):2542–2565, 2019a.
Gürbüzbalaban et al. (2019b)
Mert Gürbüzbalaban, Asu Ozdaglar, and Pablo A Parrilo.
Why random reshuffling beats stochastic gradient descent.
Mathematical Programming, pp. 1–36, 2019b.
Haochen & Sra (2019)
Jeff Haochen and Suvrit Sra.
Random shuffling beats sgd after finite epochs.
In International Conference on Machine Learning, pp. 2624–2633, 2019.
Lai & Lim (2020)
Zehua Lai and Lek-Heng Lim.
Recht-ré noncommutative arithmetic-geometric mean conjecture is
false.
In International Conference on Machine Learning, pp. 5608–5617. PMLR, 2020.
Mishchenko et al. (2020)
Konstantin Mishchenko, Ahmed Khaled, and Peter Richtárik.
Random reshuffling: Simple analysis with vast improvements.
ArXiv, abs/2006.05988, 2020.
Nagaraj et al. (2019)
Dheeraj Nagaraj, Prateek Jain, and Praneeth Netrapalli.
Sgd without replacement: Sharper rates for general smooth convex
functions.
In International Conference on Machine Learning, pp. 4703–4711, 2019.
Nedić & Bertsekas (2001)
Angelia Nedić and Dimitri Bertsekas.
Convergence rate of incremental subgradient algorithms.
In Stochastic optimization: algorithms and applications, pp. 223–264. Springer, 2001.
Nesterov (2004)
Yurii Nesterov.
Introductory Lectures on Convex Optimization - A Basic
Course, volume 87 of Applied Optimization.
Springer, 2004.
ISBN 978-1-4613-4691-3.
doi: 10.1007/978-1-4419-8853-9.
URL https://doi.org/10.1007/978-1-4419-8853-9.
Nguyen et al. (2020)
Lam M Nguyen, Quoc Tran-Dinh, Dzung T Phan, Phuong Ha Nguyen, and Marten van
Dijk.
A unified convergence analysis for shuffling-type gradient methods.
arXiv preprint arXiv:2002.08246, 2020.
Nocedal & Wright (2006)
Jorge Nocedal and Stephen Wright.
Numerical optimization.
Springer Science & Business Media, 2006.
Rajput et al. (2020)
Shashank Rajput, Anant Gupta, and Dimitris Papailiopoulos.
Closing the convergence gap of sgd without replacement.
In International Conference on Machine Learning, pp. 7964–7973. PMLR, 2020.
Recht & Ré (2012)
Benjamin Recht and Christopher Ré.
Beneath the valley of the noncommutative arithmetic-geometric mean
inequality: conjectures, case-studies, and consequences.
arXiv preprint arXiv:1202.4184, 2012.
Recht & Ré (2013)
Benjamin Recht and Christopher Ré.
Parallel stochastic gradient algorithms for large-scale matrix
completion.
Mathematical Programming Computation, 5(2):201–226, 2013.
Safran & Shamir (2019)
Itay Safran and Ohad Shamir.
How good is sgd with random shuffling?, 2019.
Safran & Shamir (2021)
Itay Safran and Ohad Shamir.
Random shuffling beats sgd only after many epochs on ill-conditioned
problems.
arXiv preprint arXiv:2106.06880, 2021.
Schneider (2016)
M. Schneider.
Probability inequalities for kernel embeddings in sampling without
replacement.
In AISTATS, 2016.
Shamir (2016)
Ohad Shamir.
Without-replacement sampling for stochastic gradient methods.
In Proceedings of the 30th International Conference on Neural
Information Processing Systems, pp. 46–54, 2016.
Tran et al. (2020)
Trang H Tran, Lam M Nguyen, and Quoc Tran-Dinh.
Shuffling gradient-based methods with momentum.
arXiv preprint arXiv:2011.11884, 2020.
Ying et al. (2020)
Bicheng Ying, Kun Yuan, and Ali H Sayed.
Variance-reduced stochastic learning under random reshuffling.
IEEE Transactions on Signal Processing, 68:1390–1408, 2020.
Appendix A Discussion and Future Work
Being the first paper (to the best of our knowledge) to theoretically analyze the optimality of random permutations, we limited our scope to a specific, but common theoretical setting - strongly convex functions with constant step size. We think that future work can generalize the results of this paper to settings of non-convexity, variable step sizes, as well as techniques like variance reduction, momentum, etc.
A.1 Lower bounds for variable step sizes
All the existing lower bounds (to the best of our knowledge) work in the constant step size regime (Safran & Shamir (2019); Rajput et al. (2020); Safran & Shamir (2021)). Thus, generalizing the lower bounds to variable step size algorithms would be a very important direction for future research.
However, the case when step sizes are not constant can be tricky to prove lower bounds, since the step size could potentially depend on the permutation, and the current iterate. A more reasonable setting to prove lower bounds could be the case when the step sizes follow a schedule over epochs, similar to what happens in practice.
A.2 Faster permutations for non-quadratic objectives
The analysis of FlipFlop leverages the fact that the Hessians of quadratic functions are constant. We think that the analysis of FlipFlop might be extended to strongly convex functions or even PŁ functions (which are non-convex in general), under some assumptions on the Lipschitz continuity of the Hessians, similar to how Haochen & Sra (2019) extended their analysis of quadratic functions to more general classes. A key take-away from FlipFlop is that we had to understand how permutation based SGD works specifically for quadratic functions, that is, we did a white-box analysis. In general, we feel that depending on the specific class of non-convex functions (say deep neural networks), practitioners would have to think about permutation-based SGD in a white-box fashion, to come up with better heuristics for shuffling.
In a concurrent work, (https://openreview.net/pdf?id=7gWSJrP3opB), it is shown that by greedily sorting stale gradients, a permutation order can be found which converges faster for some deep learning tasks. Hence, there do exist better permutations than random, even for deep learning tasks.
A.3 FlipFlop on Random Coordinate Descent
A shuffling scheme similar to FlipFlop has been used in random coordinate descent for faster practical convergence (see page 231 in Nocedal & Wright (2006)). This should be further investigated empirically and theoretically in future work.
Even though the current analysis of FlipFlop does not directly go through for random coordinate descent, we think the analysis can be adapted to work.
In certain places in the proof, we would need that . To see how this is satisfied, note that we have assumed that in the theorem statement. Using the inequality in gives that .
In this proof, we assume that the minimizer of is at to make the analysis simpler. This assumption can be satisfied by simply translating the origin to the minimizer (Safran & Shamir, 2019).
There are three main components in the proof:
1.
Say that an epoch starts off at the minimizer. We prove that there exists at least one pair of permutations such that if we do two separate runs of the epoch, the first using the first permutation and the second using the second, then at the end of that epoch, the iterates corresponding to the two runs end up on opposite sides of the minimizer.
2.
There exists a sequence of permutations and a point in the neighborhood of the minimizer, such that intializing at that point and using these sequence of permutations, we converge exactly to the minimizer.
3.
Starting from any other point, we can couple the iterates with the iterates which were shown in the previous component, to get that these two sequences of iterates come close to each other exponentially fast.
We prove the first and second components in the Subsections B.1 and B.2; and conclude the proof in Subsection B.3 where we also prove the third component.
B.1 Permutations in one epoch
In this subsection, we prove that if are the iterates in an epoch such that , then there exists a permutation of functions such that . By the same logic, we show that
there exists a permutation of functions such that .
These will give us control over movement of iterates across epochs.
Order the gradients at the minimizer, , in decreasing order. WLOG assume that it is just . We claim that this permutation leads to .
We will need the following intermediate result. Let be such that . Assume and . Then,
(11)
that is, .
Because is the minimizer, we know that . Also, recall that .
There can be two cases:
1.
. This cannot be true because if , then
where we used the fact that if and is convex.
2.
Thus, . Now, consider the sequence such that and for , . Then because and , we get that for all (Using Ineq. (11)).
Hence, there is an such that , and further for all . Next, we repeat the process above for . That is, there can be two cases:
1.
. This cannot be true because if , then
2.
Thus, . Now, consider the sequence such that and for , . Then because and , we get that for all (Using Ineq. (11)).
Hence, there exists an integer such that . We have already proved that . Thus, we have that . We can continue repeating this process (apply the same two cases above for , and so on), to get that . We define to be this non-negative value of . Note that the following lemma gives us that the gradients are bounded by
Lemma 1.
Define and . If Assumptions 2 and 3 hold, and , then for any permutation-based algorithm (deterministic or random), we have
Because the gradients are bounded by , we get that .
Similarly, we can show that the reverse permutation leads to . We define to be this non-positive value of .
Because we have assumed that the gradients are bounded by , we get that .
B.2 Exact convergence to the minimizer
In this section, we show that there exists a point such that if we initialize there and follow a specific permutation sequence, we land exactly at the minimizer.
In particular, we will show the following: There exists a point in such that if we initialize there and follow a specific permutation sequence, then we land exactly at the minimizer.
We show this recursively. We will prove that there exists a point such that if the last epoch begins there, that is , then we land at the minimizer at the end of the last epoch, that is, . Then, we will show that there exists a point such that if the , then . Repeating this for , we get that there exists a point such that if we initialize the first epoch there, that is , then there is a permutation sequence such that ultimately .
We prove that any point can be reached at the end of an epoch by beginning the epoch at some point , that is if , then .
•
Case: . In this case, we show that . We have proved in the previous subsection that there exists a permutation such that if then .
Next, we have the following helpful lemma that we will also use later.
Lemma 2.
Let be a sequence of iterates in an epoch and be another sequence of iterates in an epoch such that both use the same permutation of functions. If , then
If we set in Lemma 2 and we follow the permutation , then we get that
(Since using and results in .)
where we used the fact that is the last step.
Thus, if and we follow the permutation , then
.
Next, note that
Looking above, we see that is a continuous function of ; is a continuous function of ; and so on.
Thus, using the fact that composition of continuous functions is continuous, we get that is also a continuous function of .
We have shown that if , then and if , then . Thus, using the fact that that is a continuous function of , we get that for any point , there is at least one point , such that leads to .
•
Case: . We can apply the same logic as above to show that there is at least one point , such that .
•
Case: . WLOG assume that . Let be the permutation such that if and the epoch uses this permutation, then we end up at .
If we set in Lemma 2 and we follow the permutation , then we get that
(Since using and results in .)
where we used the fact that is the last step.
Thus, if and we follow the permutation , then
.
Thus, using similar argument of continuity as the first case, we know that there is a point , such that leads to when we use the permutation .
B.3 Same sequence permutations get closer
In the previous subsection, we have shown that there exists a point and a sequence of permutations such that if and epoch uses permutation , then . In this subsection, we will show that if is initialized at any other point such that , then using the same permutations gives us that . For this we will repeatedly apply Lemma 2 on all the epochs. Assume that with .
Let be the sequence of iterates such that and uses the permutation sequence . Then, we know that . Let be the sequence of iterates such that and uses the same permutation sequence .
Then, using Lemma 2 gives us that . Thus, we get that
. Again applying Lemma 2 gives us that . Therefore, after applying it times, we get
Note that , and , with .
We showed earlier in Subsection B.1 that and . Therefore,
Further, we know that . Thus,
Substituting the value of completes the proof. Next we prove the lemmas used in this proof.
Define and . If Assumptions 2 and 3 hold, and , then for any permutation-based algorithm (deterministic or random), we have
This lemma says that for any permutation-based algorithm, the domain of iterates and the norm of the gradient stays bounded during the optimization. This means that we can assume bounds on norm of iterates and gradients, which is not true in general for unconstrained SGD. This makes analyzing such algorithms much easier, and hence this lemma can be of independent interest for proving future results for permutation-based algorithms.
Remark 1.
This lemma does not hold in general for vanilla SGD where sampling is done with replacement. Consider the example with two functions , and ; and . This satisfies Assumptions 2 and 3, but one may choose consecutively for arbitrary many iterations, which will lead the iterates a proportional distance away from the minimizer. This kind of situation can never happen for permutation-based SGD because we see every function exactly once in every epoch and hence no particular function can attract the iterates too much towards its minimizer, and by the end of the epochs most of the noise gets cancelled out.
Note that the requirement is stricter than the usual requirement , but we believe the lemma should also hold in that regime. For the current paper, however, this stronger requirement suffices.
Proof.
To prove this lemma, we show two facts: If for some epoch , , then a) and b) .
To see how a) and b) are sufficient to prove the lemma, assume that they are true.
Then, since the first epoch begins inside the bounded region , we see using b) that every subsequent epoch begins inside the same bounded region, that is as well.
Hence using a) we get that during these epochs, the iterates satisfy , which is the first part of the lemma. Further, this bound together with the gradient Lipshitzness directly gives the upper bound on the gradients.
Thus, all we need to do to prove this lemma is to prove a) and b), which we do next.
We will prove a) and b) for . Once we do this, using will give us the exact statement of the lemma.
Let for some epoch . Then, we try to find the minimum number of iterations needed so that . Within this region, the gradient is bounded by . Thus, the minimum number of iterations needed are . However,
(Using the fact that )
Thus, the minimum number iterations needed to go outside the bound is more than the length of the epoch. This implies that within the epoch, , which proves a).
We prove b) next:
Note that is just the sum of all component gradients at , that is . Using this, we get
(12)
where we used gradient Lipschitzness (Assumption 2) in the last step.
To bound the first term above, we use the standard analysis of gradient descent on smooth, strongly convex functions as follows
where in the last step we used that since .
Substituting this inequality in Ineq. (12), we get
We have already proven a) that says that the iterates satisfy . Using gradient Lipschitzness, this implies that the gradient norms stay bounded by . Hence, . Using this,
To prove this theorem, we consider three step-size ranges and do a case by case analysis for each of them. We construct functions for each range such that the convergence of any permutation-based algorithm is “slow” for the functions on their corresponding step-size regime. The final lower bound is the minimum among the lower bounds obtained for the three regimes.
In this proof, we will use different notation from the rest of the paper because we work with scalars in the proof, and hence the superscript will denote the scalar power, and not the epoch number. We will use to denote the -th iterate of the the -th epoch.
We will construct three functions , , and , each the means of component functions, such that
•
Any permutation-based algorithm on with and initialization results in
will be an -dimensional function, that is . This function will have minimizer at . NOTE: This is the ‘key’ step-size range, and the proof sketch explained in the main paper corresponds to this function’s construction.
•
Any permutation-based algorithm on with and initialization results in
will be an -dimensional function, that is . This function will have minimizer at . The construction for this is also inspired by the construction for , but this is constructed significantly differently due to the different step-size range.
•
Any permutation-based algorithm on with and initialization results in
will be an -dimensional function, that is . This function will have minimizer at .
Then, the -dimensional function will show bad convergence in any step-size regime. This function will have minimizer at . Furthermore,
that is, , , , and are all -strongly convex and -smooth.
In the subsequent subsections, we prove the lower bounds for , , and separately.
NOTE: In Appendix C.4, we have discussed how the lower bound is actually invariant to initialization.
C.1 Lower bound for ,
We will work in -dimensions ( is even) and represent a vector in the space as . These and are not related to the vectors used by and later, we only use and to make the proof for this part easier to understand.
We start off by defining the component functions:
For , define
Thus, . This function has minimizer at .
Let denote at the -th iteration in -th epoch. We initialize at .
For the majority of this proof, we will work inside a given epoch, so we will skip the subscript denoting the epoch. We denote the -th iterate within the epoch as and the coordinates as
In the current epoch, let be the permutation of used.
For any , let and be indices such that and . Let us consider the case that (the case when will be analogous). Then, it can be seen that
(14)
Hence,
In the other case, when , we will get the same inequality. Let and denote the value of and at the -th epoch. Then, recalling that was initialized to , we use the inequality above to get
(15)
Since this inequality is valid for all , we get that
(16)
Note that if and , then .
Using this in (16), we get
We consider two cases:
1.
Case A, : It can be verified that is an increasing function of when . Noting that we are working in the range when , then
2.
Case B, : In this case,
C.2 Lower bound for ,
We will work in -dimensions and represent a vector in the space as .
We start off by defining the component functions:
For , define
Thus, . This function has minimizer at .
Let denote at the -th iteration in -th epoch. We initialize at . For the majority of this proof, we will work inside a given epoch. We denote the -th iterate within the epoch as and the coordinates as
In the current epoch, let be the permutation of used. Let be the index such that , that is, is the last element of the permutation . Then at the end of the epoch,
(17)
For some , let be the integer such that , that is is the -th element in the permutation .
Then for any and any epoch,
Then,
Note that the above is independent of , and hence applicable for all epochs. Applying it recursively and noting that we initialized , we get
Note that is just from the previous epoch. Hence we can substitute the inequality above for in (17). Thus,
This gives us that for any .
C.3 Lower bound for ,
Consider the function , where for all . Note that the gradient of any function at is just . Hence, regardless of the permutation, if we start the shuffling based gradient descent method at , we get that
In the case when , we see that
for . Finally, in the case when , we see that
C.4 Discussion about initialization
The lower bound partitions the step size in three ranges -
•
In the step size ranges and , the initializations are done at the minimizer and it is shown that any permutation-based algorithm will still move away from the minimizer. The choice of initializing at the minimizer was solely for the convenience of analysis and calculations, and the proof should work for any other initialization as well.
Furthermore, the effect of initializing at any arbitrary (non-zero) point will decay exponentially fast with epochs anyway. To see how, note that every epoch can be treated as steps of full gradient descent and some noise, and hence the full gradient descent part will essentially keep decreasing the effect of initialization exponentially, and what we would be left with is the noise in each epoch. Thus, it was more convenient for us to just assume initialization at the minimizer and only focus on the noise in each epoch.
•
The step size range can be divided into two parts, and .
For the range , we essentially show that the step size is too small to make any meaningful progress towards the minimizer. Hence, instead of initializing at , initializing at any other arbitrary (non-zero) point will also give the same slow convergence rate.
For the range , we show that the optimization algorithm will in fact diverge since the step size is too large. Hence, even here, any other arbitrary (non-zero) point of initialize will also give divergence.
Define and . Thus, . This function has minimizer at . For this proof, we will use the convention that is the iterate after the -th iteration of the -th epoch. Further, the number in the superscript will be the scalar power. For example .
Initialize at . Then at epoch , there are two possible permutations: and . In the first case , , we get that after the first iteration of the epoch,
Continuing on, in the second iteration, we get
Note that . Thus, .
Similarly, for the other possible permutation, , we get . Thus, regardless of what permutation we use, we get that
Hence, recalling that we initialized at , we get
(18)
Now, if , then
and hence, . Otherwise, if , then continuing on from (18),
Let be such that its minimizer it at the origin.
This can be assumed without loss of generality because we can shift the coordinates appropriately, similar to (Safran & Shamir, 2019).
Since the are convex quadratics, we can write them as , where are symmetric, positive-semidefinite matrices.
We can omit the constants because they do not affect the minimizer or the gradients.
Because we assume that the minimizer of is at the origin, we get that
(19)
Let be the random permutation of generated at the beginning of the algorithm.
Then for , epoch sees the functions in the following sequence:
whereas epoch sees the functions in the reverse sequence:
We define and for convenience of notation.
We start off by computing the progress made during an even indexed epoch. Since the even epochs use the reverse permutation, we get
( is used at the last iteration of even epochs.)
We recursively apply the same procedure as above to the whole epoch to get the following
(20)
where the product of matrices is defined as if and otherwise. Similar to Eq. (20), we can compute the progress made during an odd indexed epoch. Recall that the only difference will be that the odd indexed epochs see the permutations in the order instead of . After doing the computation, we get the following equation:
Combining the results above, we can get the total progress made after the pair of epoch and :
(21)
In the sum above, the first term will have an exponential decay, hence we need to control the next two terms.
We denote the sum of the terms as (see the definition below) and we will control its norm later in this proof.
To see where the iterates end up after epochs, we simply set in Eq. 21 and then keep applying the equation recursively to preceding epochs. Then, we get
Taking squared norms and expectations on both sides, we get
(Since )
We assumed that the functions have -Lipschitz gradients (Assumption 2). This translates to having maximum eigenvalue less than . Hence, if , we get that is positive semi-definite with maximum eigenvalue bounded by 1. Hence, . Using this and the fact that for matrices and , , we get that
We handle the two terms above separately. For the first term, we have the following bound:
Lemma 3.
If , then
Note that .
We also have the following lemma that bounds the expected squared norm of .
Lemma 4.
If , then
where , and the expectation is taken over the randomness of .
The proof is similar to that of Theorem 4, except for that here we leverage the independence of random permutations in every other epoch. The setup is also the same as Theorem 4, but we explain it again here nevertheless, for the completeness of this proof.
Let be such that its minimizer it at the origin.
This can be assumed without loss of generality because we can shift the coordinates appropriately, similar to (Safran & Shamir, 2019).
Since the are convex quadratics, we can write them as , where are symmetric, positive-semidefinite matrices.
We can omit the constants because they do not affect the minimizer or the gradients.
Because we assume that the minimizer of is at the origin, we get that
(28)
Let be the random permutation of sampled in epoch .
Then epoch sees the functions in the reverse sequence:
whereas epoch sees the functions in the reverse sequence:
We define and for convenience of notation. We start off by computing the progress made duting an even indexed epoch. Since the even epochs use the reverse permutation of , we get
( used at the last iteration of epoch .)
We recursively apply the same procedure as above to the whole epoch to get the following
(29)
where the product of matrices is defined as if and otherwise. Similar to Eq. (29), we can compute the progress made during an odd indexed epoch. Recall that the only difference will be that the odd indexed epochs see the permutations in the order instead of . After doing the computation, we get the following equation:
Combining the results above, we can get the total progress made after the pair of epoch and :
(30)
In the sum above, the first term will have an exponential decay, hence we need to control the next two terms.
Similar to Theorem 4, we denote the sum of the terms as :
To see where the iterates end up after epochs, we simply set in Eq. 30 and then keep applying the equation recursively to preceding epochs. Then, we get
Taking squared norms and expectations on both sides, we get
(31)
where we used the fact that . Next, we expand the second term above to get
(32)
We handle each of the three terms separately. Let .
The the first term can be written as:
where, in the last line, we used the fact that the permutations in every epoch are independent of the permutations in other epochs.
Next, we have the following lemma that bounds the spectral norm of for any :
To avoid confusion, we remark here that, the term ‘’ in the paper Ahn et al. (2020) is the same as the term ‘’ in our paper, and hence the original lemma statement in their paper looks different from what is written above.
We begin by decomposing the product into product of independent terms, similar to proof of Lemma 8 in Ahn et al. (2020).
However, after that we diverge from their proof since we use FlipFlop specific analysis.
Since , we get that , and
are independent. Hence, we can write the expectation as product of expectations:
Applying the Cauchy-Schwarz inequality on the decomposition above, we get
where in the last step we used that for all . To see why this is true, recall that . Further by Assumption 2, and hence as long as , we have .
For the two terms in the product above, we have the following lemma:
Since we are dealing with just a single epoch, we will skip the superscript. Using Lemma 5, we get
(35)
Define . Then and hence
(36)
Next we bound the other two terms.
Using Eq. (26), we get that for any ,
Since , and we use uniform random permutations, . Similarly, . Hence,
(37)
where we used Lemma 6 and the assumption that in the last step.
The third term in Ineq. (35) can be handled similarly. For any :
Now, it is easy to see that and . Then, if or we use for that case the fact that . For all other , we can again use that if or otherwise. Similarly, we can bound . Proceeding in a way similar to how we derived Ineq. (37), we get
(38)
Substituting Ineq. (36), (37) and (38) into (35), we get
where we used the assumption that in the last step.
where .
This captures the difference between true gradients that the algorithms observes, and the gradients that a full step of gradient descent would have seen.
For two consecutive epochs of FlipFlop with Incremental GD, we have the following inequality
(39)
where the first inequality is due to Theorem 2.1.11 in Nesterov (2004) and the second one is simply .
What remains to be done is to bound the two terms with dependence. Firstly, we give a bound on the norm of :
Next, we will use the smoothness assumption and bounded gradients property (Lemma 1).
Hence,
(40)
For the term, we need a more careful bound. Since the Hessian is constant for quadratic functions, we use to denote the Hessian matrix of function . We start off by using the definition of :
where we used the fact that for a quadratic function with Hessian , we have that . After that, we express and as sum of gradient descent steps:
Next, we use the fact that . We will also again use the fact that for a quadratic function with Hessian , we have that :
where the random variables as
Again, using smoothness assumption and bounded gradients property (Lemma 1) we get,
(41)
Next, we decompose the inner product of and in Eq. (39):
Substituting (40) (45) back to (39), we finally get a recursion bound for one epoch:
Next, we use the fact that (for any ) on the term to get that
Substituting this back we get,
(Since )
Now, substituting the values of and the bound on , we get that and hence,
Now, iterating this for epoch pairs, we get
(Since )
Substituting gives us the desired result.
∎
Appendix H Additional Experiments
Figure 3: Dependence of convergence rates on the number of epochs for logistic regression. The figures show the median and inter-quartile range after 10 runs of each algorithm, with random initializations and random permutation seeds (note that IGD exhibits extremely small variance). We set , so that and hence the higher order terms of dominate the convergence rates. Note that both the axes are in logarithmic scale.
Although our theoretical guarantees for FlipFlop only hold for quadratic objectives, we conjecture that FlipFlop might be able to improve the convergence performance for other classes of functions, whose Hessians are smooth near their minimizers.
To see this, we also ran some experiments on 1-dimensional logistic regression.
As we can see in Figure 3, the convergence rates are very similar to those on quadratic functions.
The data was synthetically generated such that the objective function becomes strongly convex and well conditioned near the minimizer.
Note that logistic loss is not strongly convex on linearly separable data.
Therefore, to make the loss strongly convex, we ensured that the data was not linearly separable.
Essentially, the dataset was the following: all the datapoints were , and their labels were with probability and with probability .
Framing this as an optimization problem, we have
where . Note that is the minimizer of this function, which is helpful because we can use it to compute the exact error. Similar to the experiment on quadratic functions, was set to
and step size was set in the same regime as in Theorems 4, 5, and 6.