SMG: A Shuffling Gradient-Based Method with Momentum
Abstract
We combine two advanced ideas widely used in optimization for machine learning: shuffling strategy and momentum technique to develop a novel shuffling gradient-based method with momentum, coined Shuffling Momentum Gradient (SMG), for non-convex finite-sum optimization problems. While our method is inspired by momentum techniques, its update is fundamentally different from existing momentum-based methods. We establish state-of-the-art convergence rates of SMG for any shuffling strategy using either constant or diminishing learning rate under standard assumptions (i.e. -smoothness and bounded variance). When the shuffling strategy is fixed, we develop another new algorithm that is similar to existing momentum methods, and prove the same convergence rates for this algorithm under the -smoothness and bounded gradient assumptions. We demonstrate our algorithms via numerical simulations on standard datasets and compare them with existing shuffling methods. Our tests have shown encouraging performance of the new algorithms.
1 Introduction
Most training tasks in supervised learning are boiled down to solving the following finite-sum minimization:
(1) |
where is a given smooth and possibly nonconvex function for .
Problem (1) looks simple, but covers various convex and nonconvex applications in machine learning and statistical learning, including, but not limited to, logistic regression, multi-kernel learning, conditional random fields, and neural networks. Especially, (1) covers the empirical risk minimization as a special case. Solution methods for approximately solving (1) have been widely studied in the literature under different sets of assumptions. The most common approach is perhaps stochastic gradient-type (SGD) methods (Robbins & Monro, 1951; Ghadimi & Lan, 2013; Bottou et al., 2018; Nguyen et al., 2018, 2019) and their variants.
Motivation. While SGD and its variants rely on randomized sampling strategies with replacement, gradient-based methods using without-replacement strategies are often easier and faster to implement. Moreover, practical evidence (Bottou, 2009) has shown that they usually produce a faster decrease of the training loss. Randomized shuffling strategies (also viewed as sampling without replacement) allow the algorithm to use exactly one function component at each epoch compared to SGD, which has only statistical convergence guarantees (e.g., in expectation or with high probability). However, very often, the analysis of shuffling methods is more challenging than SGD due to the lack of statistical independence.
In the deterministic case, single permutation (also called shuffle once, or single shuffling) and incremental gradient methods can be considered as special cases of the shuffling gradient-based methods we study in this paper. One special shuffling strategy is randomized reshuffling, which is broadly used in practice, where we use a different random permutation at each epoch. Alternatively, in recent years, it has been shown that many gradient-based methods with momentum update can notably boost the convergence speed both in theory and practice (Nesterov, 2004; Dozat, 2016; Wang et al., 2020). These methods have been widely used in both convex and nonconvex settings, especially, in deep learning community. Remarkably, Nesterov’s accelerated method (Nesterov, 1983) has made a revolution in large-scale convex optimization in the last two decades, and has been largely exploited in nonconvex problems. The developments we have discussed here motivate us to raise the following research question:
Can we combine both shuffling strategy and momentum scheme to develop new provable gradient-based algorithms for handling (1)?
In this paper, we answer this question affirmatively by proposing a novel algorithm called Shuffling Momentum Gradient (SMG). We establish its convergence guarantees for different shuffling strategies, and in particular, randomized reshuffling strategy. We also investigate different variants of our method.
Our contribution. To this end, our contributions in this paper can be summarized as follows.
-
(a)
We develop a novel shuffling gradient-based method with momentum (Algorithm 1 in Section 2) for approximating a stationary point of the nonconvex minimization problem (1). Our algorithm covers any shuffling strategy ranging from deterministic to randomized, including incremental, single shuffling, and randomized reshuffling variants.
-
(b)
We establish the convergence of our method in the nonconvex setting and achieve the state-of-the-art convergence rate under standard assumptions (i.e. the -smoothness and bounded variance conditions), where is the number of epochs. For randomized reshuffling strategy, we can improve our convergence rate up to .
-
(c)
We study different strategies for selecting learning rates (LR), including constant, diminishing, exponential, and cosine scheduled learning rates. In all cases, we prove the same convergence rate of the corresponding variants without any additional assumption.
-
(d)
When a single shuffling strategy is used, we show that a momentum strategy can be incorporated directly at each iteration of the shuffling gradient method to obtain a different variant as presented in Algorithm 2. We analyze the convergence of this algorithm and achieve the same epoch-wise convergence rate, but under a bounded gradient assumption instead of the bounded variance as for the SMG algorithm.
Our convergence rate is the best known so far for shuffling gradient-type methods in nonconvex optimization (Nguyen et al., 2020; Mishchenko et al., 2020). However, like (Mishchenko et al., 2020), our SMG method only requires a generalized bounded variance assumption (Assumption 1(c)), which is weaker and more standard than the bounded component gradient assumption used in existing works. Algorithm 2 uses the same set of assumptions as in (Nguyen et al., 2020) to achieve the same rate, but has a momentum update. For the randomized reshuffling strategy, our convergence rate also matches the rate of the without-momentum algorithm in (Mishchenko et al., 2020). It leads to the total of iterations .
We emphasize that, in many existing momentum variants, the momentum is updated recursively at each iteration as for a given weight . This update shows that the momentum incorporates all the past gradient terms with exponential decay weights , respectively. However, in shuffling methods, the convergence guarantee is often obtained in epoch. Based on this observation, we modify the classical momentum update in the shuffling method as shown in Algorithm 1. More specifically, the momentum term is fixed at the beginning of each epoch, and an auxiliary sequence is introduced to keep track of the gradient average to update the momentum term in the next epoch. This modification makes Algorithm 1 fundamentally different from existing momentum-based methods. This new algorithm still achieves epoch-wise convergence rate under standard assumptions. To the best of our knowledge, our work is the first analyzing convergence rate guarantees of shuffling-type gradient methods with momentum under standard assumptions.
Besides Algorithm 1, we also exploit recent advanced strategies for selecting learning rates, including exponential and cosine scheduled learning rates. These two strategies have shown state-of-the-art performance in practice (Smith, 2017; Loshchilov & Hutter, 2017; Li et al., 2020). Therefore, it is worth incorporating them in shuffling methods.
Related work. Let us briefly review the most related works to our methods studied in this paper.
Shuffling gradient-based methods. Shuffling gradient-type methods for solving (1) have been widely studied in the literature in recent years (Bottou, 2009; Gürbüzbalaban et al., 2019; Shamir, 2016; Haochen & Sra, 2019; Nguyen et al., 2020) for both convex and nonconvex settings. It was empirically investigated in a short note (Bottou, 2009) and also discussed in (Bottou, 2012). These methods have also been implemented in several software packages such as TensorFlow and PyTorch, broadly used in machine learning (Abadi et al., 2015; Paszke et al., 2019).
In the strongly convex case, shuffling methods have been extensively studied in (Ahn et al., 2020; Gürbüzbalaban et al., 2019; Haochen & Sra, 2019; Safran & Shamir, 2020; Nagaraj et al., 2019; Rajput et al., 2020; Nguyen et al., 2020; Mishchenko et al., 2020) under different assumptions. The best known convergence rate in this case is , which matches the lower bound rate studied in (Safran & Shamir, 2020) up to some constant factor. Most results in the convex case are for the incremental gradient variant, which are studied in (Nedic & Bertsekas, 2001; Nedić & Bertsekas, 2001). Convergence results of shuffling methods on the general convex case are investigated in (Shamir, 2016; Mishchenko et al., 2020), where (Mishchenko et al., 2020) provides a unified approach to cover different settings. The authors in (Ying et al., 2017) combine a randomized shuffling and a variance reduction technique (e.g., SAGA (Defazio et al., 2014) and SVRG (Johnson & Zhang, 2013)) to develop a new variant. They show a linear convergence rate for strongly convex problems but using an energy function, which is unclear how to convert it into known convergence criteria.
In the nonconvex case, (Nguyen et al., 2020) first shows convergence rate for a general class of shuffling gradient methods under the -smoothness and bounded gradient assumptions on (1). This analysis is then extended in (Mishchenko et al., 2020) to a more relaxed assumption. The authors in (Meng et al., 2019) study different distributed SGD variants with shuffling for strongly convex, general convex, and nonconvex problems. An incremental gradient method for weakly convex problems is investigated in (Li et al., 2019), where the authors show convergence rate as in standard SGD. To the best of our knowledge, the best known rate of shuffling gradient methods for the nonconvex case under standard assumptions is as shown in (Nguyen et al., 2020; Mishchenko et al., 2020).
Our Algorithm 1 developed in this paper is a nontrivial momentum variant of the general shuffling method in (Nguyen et al., 2020) but our analysis uses a standard bounded variance assumption instead of bounded gradient one.
Momentum-based methods. Gradient methods with momentum were studied in early works for convex problems such as heavy-ball, inertial, and Nesterov’s accelerated gradient methods (Polyak, 1964; Nesterov, 2004). Nesterov’s accelerated method is the most influent scheme and achieves optimal convergence rate for convex problems. While momentum-based methods are not yet known to improve theoretical convergence rates in the nonconvex setting, they show significantly encouraging performance in practice (Dozat, 2016; Wang et al., 2020), especially in the deep learning community. However, the momentum strategy has not yet been exploited in shuffling methods.
Adaptive learning rate schemes. Gradient-type methods with adaptive learning rates such as AdaGrad (Duchi et al., 2011) and Adam (Kingma & Ba, 2014) have shown state-of-the-art performance in several optimization applications. Recently, many adaptive schemes for learning rates have been proposed such as diminishing (Nguyen et al., 2020), exponential scheduled (Li et al., 2020), and cosine scheduled (Smith, 2017; Loshchilov & Hutter, 2017). In (Li et al., 2020), the authors analyze convergence guarantees for the exponential and cosine learning rates in SGD. These adaptive learning rates have also been empirically studied in the literature, especially in machine learning tasks. Their convergence guarantees have also been investigated accordingly under certain assumptions.
Although the analyses for adaptive learning rates are generally non-trivial, our theoretical results in this paper are flexible enough to cover various different learning rates. However, we only exploit the diminishing, exponential scheduled, and cosine scheduled schemes for our shuffling methods with momentum. We establish that in the last two cases, our algorithm still achieves state-of-the-art (and possibly up to ) epoch-wise rates.
Content. The rest of this paper is organized as follows. Section 2 describes our novel method, Shuffling Momentum Gradient (Algorithm 1). Section 3 investigates its convergence rate under different shuffling-type strategies and different learning rates. Section 4 proposes an algorithm with traditional momentum update for single shuffling strategy. Section 5 presents our numerical simulations. Due to space limit, the convergence analysis of our methods, all technical proofs, and additional experiments are deferred to the Supplementary Document (Supp. Doc.).
2 Shuffling Gradient-Based Method with Momentum
In this section, we describe our new shuffling gradient algorithm with momentum in Algorithm 1. We also compare our algorithm with existing methods and discuss its per-iteration complexity and possible modifications, e.g., mini-batch.
(2) |
Clearly, if , then Algorithm 1 reduces to the standard shuffling gradient method as in (Nguyen et al., 2020; Mishchenko et al., 2020). Since each inner iteration of our method uses one component , we use in Algorithm 1 to make it consistent with one full gradient, which consists of gradient components. This form looks different from the learning rates used in previous work for SGD (Shamir, 2016; Mishchenko et al., 2020), however, it does not necessary make our learning rate smaller. In the same order of training samples , our learning rate matches the one in (Mishchenko et al., 2020) as well as the state-of-the-art complexity results. In fact, the detailed learning rate used in Algorithm 1 will be specified in Section 3.
Comparison. Unlike existing momentum methods where in (2) is updated recursively as for , we instead fix the first term in (2) at each epoch. It is only updated at the end of each epoch by averaging all the gradient components evaluated in such an epoch. To avoid storing these gradients, we introduce an auxiliary variable to keep track of the gradient average. Here, we fix the weight for the sake of our analysis, but it is possible to make adaptive.
Our new method is based on the following observations. First, while SGD generally uses an unbiased estimator for the full gradient, shuffling gradient methods often do not have such a nice property, making them difficult to analyze. Due to this fact, updating the momentum at each inner iteration does not seem preferable since it could make the estimator deviate from the true gradient. Therefore, we consider updating the momentum after each epoch. Second, unlike the traditional momentum with exponential decay weights, our momentum term is an equal-weighted average of all the past gradients in an epoch, but evaluated at different points in the inner loop. Based on these observations, the momentum term should aid the full gradient instead of its component , leading to the update at the end of each epoch.
It is also worth noting that our algorithm is fundamentally different from variance reduction methods. For instance, SVRG and SARAH variants require evaluating full gradient at each snapshot point while our method always uses single component gradient. Hence, our algorithm does not require a full gradient evaluation at each outer iteration, and our momentum term does not require full gradient of .
When the learning rate is fixed at each epoch as , where , we can derive from (2) that
where . Here, plays a role as a momentum or an inertial term, but it is different from the usual momentum term. However, we still name Algorithm 1 the Shuffling Momentum Gradient since it is inspired by momentum methods.
Per-iteration complexity. The per-iteration complexity of Algorithm 1 is almost the same as in standard shuffling gradient schemes, see, e.g., (Shamir, 2016). It needs to store two additional vectors and , and performs two vector additions and three scalar-vector multiplications. Such additional costs are very mild.
Shuffling strategies. Our convergence guarantee in Section 3 holds for any permutation of , including deterministic and randomized ones. Therefore, our method covers any shuffling strategy, including incremental, single shuffling, and randomized reshuffling variants as special cases. We highlight that our convergence result for the randomized reshuffling variant is significantly better than the general ones as we will explain in detail in Section 3.
Mini-batch. Algorithm 1 also works with mini-batches, and our theory can be slightly adapted to establish the same convergence rate for mini-batch variants. However, it remains unclear to us how mini-batches can improve the convergence rate guarantee of Algorithm 1 in the general case where the shuffling strategy is not specified.
3 Convergence Analysis
We analyze the convergence of Algorithm 1 under standard assumptions, which is organized as follows.
3.1 Technical Assumptions
Our analysis relies on the following assumptions:
Assumption 1.
Problem (1) satisfies:
-
(Boundedness from below) and is bounded from below, i.e. .
-
(-Smoothness) is -smooth for all , i.e. there exists a universal constant such that, for all , it holds that
(3) -
(Generalized bounded variance) There exist two non-negative and finite constants and such that for any we have
(4)
Assumption 1(a) is required in any algorithm to guarantee the well-definedness of (1). The -smoothness (3) is standard in gradient-type methods for both stochastic and deterministic algorithms. From this assumption, we have for any (see (Nesterov, 2004)):
(5) |
The condition (4) in Assumption 1(c) reduces to the standard bounded variance condition if . Therefore, (4) is more general than the bounded variance assumption, which is often used in stochastic optimization. Unlike recent existing works on momentum SGD and shuffling (Chen et al., 2018; Nguyen et al., 2020), we do not assume the bounded gradient assumption on each in Algorithm 1 (see Assumption 2). This condition is stronger than (4).
3.2 Main Result 1 and Its Consequences
Our first main result is the following convergence theorem for Algorithm 1 under any shuffling strategy.
Theorem 1.
Remark 1 (Convergence guarantee).
When a deterministic permutation is used, our convergence rate can be achieved in a deterministic sense. However, to unify our analysis, we will express our convergence guarantees in expectation, where the expectation is taken over all the randomness generated by and up to iterations. Since we can choose permutations either deterministically or randomly, our bounds in the sequel will hold either deterministically or with probability (w.p.1), respectively. Without loss of generality, we write these results in expectation.
Next, we derive two direct consequences of Theorem 1 by choosing constant and diminishing learning rates.
Corollary 1 (Constant learning rate).
With a constant LR as in Corollary 1, the convergence rate of Algorithm 1 is exactly expressed as
which matches the best known rate in the literature (Mishchenko et al., 2020; Nguyen et al., 2020) in term of for general shuffling-type strategies.
Corollary 2 (Diminishing learning rate).
The diminishing LR allows us to use large learning rate values at early epochs compared to the constant case. However, we loose a factor in the second term of our worst-case convergence rate bound.
We also derive the following two consequences of Theorem 1 for exponential and cosine scheduled learning rates.
Exponential scheduled learning rate. Given an epoch budget , and two positive constants and , we consider the following exponential LR, see (Li et al., 2020):
(7) |
The following corollary shows the convergence of Algorithm 1 using this LR without any additional assumption.
Corollary 3.
Cosine annealing learning rate: Alternatively, given an epoch budget , and a positive constant , we consider the following cosine LR for Algorithm 1:
(8) |
This LR is adopted from (Loshchilov & Hutter, 2017; Smith, 2017). However, different from these works, we fix our learning rate at each epoch instead of updating it at every iteration as in (Loshchilov & Hutter, 2017; Smith, 2017).
Corollary 4.
3.3 Main Result 2: Randomized Reshuffling Variant
A variant of Algorithm 1 is called a randomized reshuffling variant if at each iteration , the generated permutation is uniformly sampled at random without replacement from . Since the randomized reshuffling strategy is extremely popular in practice, we analyze Algorithm 1 under this strategy.
Theorem 2.
We can derive the following two consequences.
Corollary 5 (Constant learning rate).
Let us fix the number of epochs , and choose a constant learning rate for some such that for in Algorithm 1. Then, under the conditions of Theorem 2, is upper bounded by
Consequently, the convergence rate of Algorithm 1 is in epoch.
With a constant LR as in Corollary 5, the convergence rate of Algorithm 1 is improved to
which matches the best known rate as in the randomized reshuffling scheme, see, e.g., (Mishchenko et al., 2020).
In this case, the total number of iterations is to obtain . Compared to the complexity of SGD, our randomized reshuffling variant is better than SGD if .
Corollary 6 (Diminishing learning rate).
Remark 2.
Algorithm 1 under randomized reshuffling still works with exponential and cosine scheduled learning rates, and our analysis is similar to the one with general shuffling schemes. However, we omit their analysis here.
4 Single Shuffling Variant
In this section, we modify the single-shuffling gradient method by directly incorporating a momentum term at each iteration. We prove for the first time that this variant still achieves state-of-the-art convergence rate guarantee under the smoothness and bounded gradient (i.e. Assumption 2) assumptions as in existing shuffling methods. This variant, though somewhat special, also covers an incremental gradient method with momentum as a special case.
Our new momentum algorithm using single shuffling strategy for solving (1) is presented in Algorithm 2.
Besides fixing the permutation , Algorithm 2 is different from Algorithm 1 at the point of updating . Here, is updated from instead of as in Algorithm 1. In addition, we update instead of the epoch gradient average. Note that we can write the main update of Algorithm 2 as
which exactly reduces to existing momentum updates.
Incremental gradient method with momentum. If we choose , then we obtain the well-known incremental gradient variant, but with momentum. Hence, Algorithm 2 is still new compared to the standard incremental gradient algorithm (Bertsekas, 2011). To prove convergence of Algorithm 2, we replace Assumption 1(c) by the following:
Assumption 2 (Bounded gradient).
There exists such that , and .
Now, we state the convergence of Algorithm 2 in the following theorem as our third main result.
Theorem 3.
Compared to (9) in Theorem 2, (10) strongly depends on the weight as appears on the right-hand side of (10). Hence, must be chosen in a specific form to obtain desired convergence rate.
Theorem 3 provides a key bound to derive concrete convergence rates in the following two corollaries.
Corollary 7 (Constant learning rate).
With a constant learning rate as in Corollary 7, the convergence rate of Algorithm 2 is
which depends on , , and , and is slightly different from Corollary 1.
Corollary 8 (Diminishing learning rate).
5 Numerical Experiments
In order to examine our algorithms, we present two numerical experiments for different nonconvex problems and compare them with some state-of-the-art SGD-type and shuffling gradient methods.
5.1 Models and Datasets
Neural Networks. We test our Shuffling Momentum Gradient (SMG) algorithm using two standard network architectures: fully connected network (FCN) and convolutional neural network (CNN). For the fully connected setting, we train the classic LeNet-300-100 model (LeCun et al., 1998) on the Fashion-MNIST dataset (Xiao et al., 2017) with images.
We also use the convolutional LeNet-5 (LeCun et al., 1998) architecture to train the well-known CIFAR-10 dataset (Krizhevsky & Hinton, 2009) with samples. We repeatedly run the experiments for 10 random seeds and report the average results. All the algorithms are implemented and run in Python using the PyTorch package (Paszke et al., 2019).
Nonconvex Logistic Regression. Nonconvex regularizers are widely used in statistical learning such as approximating sparsity or gaining robustness. We consider the following nonconvex binary classification problem:
where is a set of training samples; and is a nonconvex regularizer, and is a regularization parameter. This example was also previously used in (Wang et al., 2019; Tran-Dinh et al., 2019; Nguyen et al., 2020).
We have conducted the experiments on two classification datasets w8a ( samples) and ijcnn1 ( samples) from LIBSVM (Chang & Lin, 2011). The experiment is repeated with random seeds 10 times and the average result is reported.
5.2 Comparing SMG with Other Methods
We compare our SMG algorithm with Stochastic Gradient Descent (SGD) and two other methods: SGD with Momentum (SGD-M) (Polyak, 1964) and Adam (Kingma & Ba, 2014). For the latter two algorithms, we use the hyper-parameter settings recommended and widely used in practice (i.e. momentum: 0.9 for SGD-M, and two hyper-parameters , for Adam). For our new SMG algorithm, we fixed the parameter since it usually performs the best in our experiments.
To have a fair comparison, we apply the randomized reshuffling scheme to all methods. Note that shuffling strategies are broadly used in practice and have been implemented in TensorFlow, PyTorch, and Keras (Chollet et al., 2015). We tune each algorithm using constant learning rate and report the best result obtained.
Results. Our first experiment is presented in Figure 1, where we depict the value of “train loss” (i.e. in (1)) on the -axis and the “number of effective passes” (i.e. the number of epochs) on the -axis. It was observed that SGD-M and Adam work well for machine learning tasks (see, e.g., (Ruder, 2017)). Align with this fact, from Figure 1, we also observe that our SMG algorithm and SGD is slightly worse than SGD-M and Adam at the initial stage when training a neural network. However, SMG quickly catches up to Adam and demonstrates a good performance at the end of the training process.


For the nonconvex logistic regression problem, our result is reported in Figure 2. For two small datasets tested in our experiments, our algorithm performs significantly better than SGD and Adam, and slightly better than SGD with momentum.


5.3 The Choice of Hyper-parameter
Since the hyper-parameter plays a critical role in the proposed SMG method, our next experiment is to investigate how this algorithm depends on , while using the same constant learning rate.
Results. Our result is presented in Figure 3. We can observe from this figure that in the early stage of the training process, the choice gives comparably good performance comparing to other smaller values. This choice also results in the best train loss in the end of the training process. However, the difference is not really significant, showing that SMG seems robust to the choice of the momentum weight in the range of .


In the nonconvex logistic regression problem, the same choice of also yields similar outcome for two datasets: w8a and ijcnn1, as shown in Figure 4. We have also experimented with different choices of . However, these choices of do not lead to good performance, and, therefore, we omit to report them here. This finding raises an open question on the optimal choice of . Our empirical study here shows that the choice of in works reasonably well, and seems to be the best in our test.


5.4 Different Learning Rate Schemes
Our last experiment is to examine the effect of different learning rate variants on the performance of our SMG method, i.e. Algorithm 1.
Results. We conduct this test using four different learning rate variants: constant, diminishing, exponential decay, and cosine annealing learning rates. Our results are reported in Figure 5 and Figure 6. From Figure 5, we can observe that the cosine scheduled and the diminishing learning rates converge relatively fast at the early stage. However, the exponential decay and the constant learning rates make faster progress in the last epochs and tend to give the best result at the end of the training process in our neural network training experiments.


For the non-convex logistic regression, Figure 6 shows that the cosine learning rate has certain advantages compared to other choices after a few dozen numbers of epochs.


We note that the detailed settings and additional experiments in this section can be found in the Supp. Doc.
6 Conclusions and Future Work
We have developed two new shuffling gradient algorithms with momentum for solving nonconvex finite-sum minimization problems. Our Algorithm 1 is novel and can work with any shuffling strategy, while Algorithm 2 is similar to existing momentum methods using single shuffling strategy. Our methods achieve the state-of-the-art convergence rate under standard assumptions using different learning rates and shuffling strategies. When a randomized reshuffling strategy is exploited for Algorithm 1, we can further improve our rates by a fraction of the data size , matching the best known results in the non-momentum shuffling method. Our numerical results also show encouraging performance of the new methods.
We believe that our analysis framework can be extended to study non-asymptotic rates for some recent adaptive SGD schemes such as Adam (Kingma & Ba, 2014) and AdaGrad (Duchi et al., 2011) as well as variance-reduced methods such as SARAH (Nguyen et al., 2017) under shuffling strategies. It is also interesting to extend our framework to other problems such as minimax and federated learning. We will further investigate these opportunities in the future.
Acknowledgements
The authors would like to thank the reviewers for their useful comments and suggestions which helped to improve the exposition in this paper. The work of Q. Tran-Dinh has partly been supported by the Office of Naval Research under Grant No. ONR-N00014-20-1-2088 (2020-2023).
References
- Abadi et al. (2015) Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL https://www.tensorflow.org/. Software available from tensorflow.org.
- Ahn et al. (2020) Ahn, K., Yun, C., and Sra, S. Sgd with shuffling: optimal rates without component convexity and large epoch requirements. arXiv preprint arXiv:2006.06946, 2020.
- Bertsekas (2011) Bertsekas, D. Incremental proximal methods for large scale convex optimization. Math. Program., 129(2):163–195, 2011.
- Bottou (2009) Bottou, L. Curiously fast convergence of some stochastic gradient descent algorithms. In Proceedings of the symposium on learning and data science, Paris, volume 8, pp. 2624–2633, 2009.
- Bottou (2012) Bottou, L. Stochastic gradient descent tricks. In Neural networks: Tricks of the trade, pp. 421–436. Springer, 2012.
- Bottou et al. (2018) Bottou, L., Curtis, F. E., and Nocedal, J. Optimization Methods for Large-Scale Machine Learning. SIAM Rev., 60(2):223–311, 2018.
- Chang & Lin (2011) Chang, C.-C. and Lin, C.-J. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
- Chen et al. (2018) Chen, X., Liu, S., Sun, R., and Hong, M. On the convergence of a class of adam-type algorithms for non-convex optimization. ICLR, 2018.
- Chollet et al. (2015) Chollet, F. et al. Keras, 2015. URL https://github.com/fchollet/keras.
- Defazio et al. (2014) Defazio, A., Bach, F., and Lacoste-Julien, S. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pp. 1646–1654, 2014.
- Dozat (2016) Dozat, T. Incorporating nesterov momentum into ADAM. ICLR Workshop, 1:2013––2016, 2016.
- Duchi et al. (2011) Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, 2011.
- Ghadimi & Lan (2013) Ghadimi, S. and Lan, G. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim., 23(4):2341–2368, 2013.
- Gürbüzbalaban et al. (2019) Gürbüzbalaban, M., Ozdaglar, A., and Parrilo, P. A. Why random reshuffling beats stochastic gradient descent. Mathematical Programming, Oct 2019. ISSN 1436-4646. doi: 10.1007/s10107-019-01440-w. URL http://dx.doi.org/10.1007/s10107-019-01440-w.
- Haochen & Sra (2019) Haochen, J. and Sra, S. Random shuffling beats sgd after finite epochs. In International Conference on Machine Learning, pp. 2624–2633. PMLR, 2019.
- Johnson & Zhang (2013) Johnson, R. and Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS, pp. 315–323, 2013.
- Kingma & Ba (2014) Kingma, D. P. and Ba, J. ADAM: A Method for Stochastic Optimization. Proceedings of the 3rd International Conference on Learning Representations (ICLR), abs/1412.6980, 2014.
- Krizhevsky & Hinton (2009) Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
- LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Li et al. (2020) Li, L., Zhuang, Z., and Orabona, F. Exponential step sizes for non-convex optimization. arXiv preprint arXiv:2002.05273, 2020.
- Li et al. (2019) Li, X., Zhu, Z., So, A., and Lee, J. D. Incremental methods for weakly convex optimization. arXiv preprint arXiv:1907.11687, 2019.
- Loshchilov & Hutter (2017) Loshchilov, I. and Hutter, F. Sgdr: Stochastic gradient descent with warm restarts, 2017.
- Meng et al. (2019) Meng, Q., Chen, W., Wang, Y., Ma, Z.-M., and Liu, T.-Y. Convergence analysis of distributed stochastic gradient descent with shuffling. Neurocomputing, 337:46–57, 2019.
- Mishchenko et al. (2020) Mishchenko, K., Khaled Ragab Bayoumi, A., and Richtárik, P. Random reshuffling: Simple analysis with vast improvements. Advances in Neural Information Processing Systems, 33, 2020.
- Nagaraj et al. (2019) Nagaraj, D., Jain, P., and Netrapalli, P. Sgd without replacement: Sharper rates for general smooth convex functions. In International Conference on Machine Learning, pp. 4703–4711, 2019.
- Nedić & Bertsekas (2001) Nedić, A. and Bertsekas, D. Convergence rate of incremental subgradient algorithms. In Stochastic optimization: algorithms and applications, pp. 223–264. Springer, 2001.
- Nedic & Bertsekas (2001) Nedic, A. and Bertsekas, D. P. Incremental subgradient methods for nondifferentiable optimization. SIAM J. on Optim., 12(1):109–138, 2001.
- Nesterov (1983) Nesterov, Y. A method for unconstrained convex minimization problem with the rate of convergence . Doklady AN SSSR, 269:543–547, 1983. Translated as Soviet Math. Dokl.
- Nesterov (2004) Nesterov, Y. Introductory lectures on convex optimization: A basic course, volume 87 of Applied Optimization. Kluwer Academic Publishers, 2004.
- Nguyen et al. (2018) Nguyen, L., Nguyen, P. H., van Dijk, M., Richtarik, P., Scheinberg, K., and Takac, M. SGD and Hogwild! convergence without the bounded gradients assumption. In Proceedings of the 35th International Conference on Machine Learning-Volume 80, pp. 3747–3755, 2018.
- Nguyen et al. (2017) Nguyen, L. M., Liu, J., Scheinberg, K., and Takáč, M. Sarah: A novel method for machine learning problems using stochastic recursive gradient. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2613–2621. JMLR. org, 2017.
- Nguyen et al. (2019) Nguyen, L. M., Nguyen, P. H., Richtárik, P., Scheinberg, K., Takáč, M., and van Dijk, M. New convergence aspects of stochastic gradient algorithms. Journal of Machine Learning Research, 20(176):1–49, 2019. URL http://jmlr.org/papers/v20/18-759.html.
- Nguyen et al. (2020) Nguyen, L. M., Tran-Dinh, Q., Phan, D. T., Nguyen, P. H., and van Dijk, M. A unified convergence analysis for shuffling-type gradient methods. arXiv preprint arXiv:2002.08246, 2020.
- Paszke et al. (2019) Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran Associates, Inc., 2019.
- Polyak (1964) Polyak, B. T. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
- Rajput et al. (2020) Rajput, S., Gupta, A., and Papailiopoulos, D. Closing the convergence gap of sgd without replacement. In International Conference on Machine Learning, pp. 7964–7973. PMLR, 2020.
- Robbins & Monro (1951) Robbins, H. and Monro, S. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400–407, 1951.
- Ruder (2017) Ruder, S. An overview of gradient descent optimization algorithms, 2017.
- Safran & Shamir (2020) Safran, I. and Shamir, O. How good is sgd with random shuffling? In Conference on Learning Theory, pp. 3250–3284. PMLR, 2020.
- Shamir (2016) Shamir, O. Without-replacement sampling for stochastic gradient methods. In Advances in neural information processing systems, pp. 46–54, 2016.
- Smith (2017) Smith, L. N. Cyclical learning rates for training neural networks. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 464–472. IEEE, 2017.
- Tran-Dinh et al. (2019) Tran-Dinh, Q., Pham, N. H., Phan, D. T., and Nguyen, L. M. Hybrid stochastic gradient descent algorithms for stochastic nonconvex optimization. arXiv preprint arXiv:1905.05920, 2019.
- Wang et al. (2020) Wang, B., Nguyen, T. M., Bertozzi, A. L., Baraniuk, R. G., and Osher, S. J. Scheduled restart momentum for accelerated stochastic gradient descent. arXiv preprint arXiv:2002.10583, 2020.
- Wang et al. (2019) Wang, Z., Ji, K., Zhou, Y., Liang, Y., and Tarokh, V. Spiderboost and momentum: Faster variance reduction algorithms. Advances in Neural Information Processing Systems, 32:2406–2416, 2019.
- Xiao et al. (2017) Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
- Ying et al. (2017) Ying, B., Yuan, K., and Sayed, A. H. Convergence of variance-reduced stochastic learning under random reshuffling. arXiv preprint arXiv:1708.01383, 2(3):6, 2017.
Supplementary Document:
SMG: A Shuffling Gradient-Based Method with Momentum
7 Technical Proofs of Theorem 1 and Its Consequences in Section 3
7.1 Auxiliary Notation and Common Expressions
Remark 4.
We use the superscript “” for the epoch counter, and the subscript “” for the counter of the inner loop of the -th epoch. In addition, at each epoch , we fix the learning rate for given .
In this Supp. Doc. we repeatedly use some common notations to analyze the analysis for three different shuffling strategies, including: Randomized Reshuffling, Shuffle Once and Incremental Gradient. We are ready to introduce the notations here.
For any , , generated by Algorithm 1, and two permutations and of generated at epochs and , we denote
(11) | |||
(12) | |||
(13) | |||
(14) | |||
(15) | |||
(16) |
and
(17) |
Note that we adopt the convention , and in the definitions of and .
For each epoch , we denote by , the -algebra generated by the iterates of Algorithm 1 up to the beginning of the epoch . From Step 4 of Algorithm 1, we observe that the permutation used at time is independent of the -algebra . This will be the key factor of our analysis for Randomized Reshuffling methods. We also denote by , the conditional expectation on the -algebra .
7.2 The Proof Sketch of Theorem 1
Due to the technicality of our proofs, we first provide here a proof sketch of our first main result, Theorem 1, while the full proof is given in the next subsections.
The proof of Theorem 1 is divided into the following steps.
-
•
First, from the -smoothness of and the update of , we can bound
where
Hence, the key step here is to upper bound by .
- •
- •
-
•
We further upper bound the sum of the right-hand side of the last estimate as in Lemma 5 in terms .
Combining these steps together, and using some simplification, we obtain (6) in Theorem 1.
7.3 Technical Lemmas
The proof of Theorem 1 requires the following lemmas as intermediate steps of its proof. Let us first upper bound and in the following lemma.
Lemma 1.
Let be generated by Algorithm 1 with and for every . Then, for , we have
(21) | ||||
(22) |
Next, we upper bound the quantities , , and defined above in terms of and .
One of our key estimates is to upper bound the quantity , which is derived in the following lemma.
Lemma 4.
Under the same conditions as in Lemma 3, for any , we have
(25) |
Lemma 6.
Under the same setting as in Theorem 1, we have
(26) |
7.4 The Proof of Theorem 1: Key Estimate for Algorithm 1
First, from the assumption , , we have . Next, from (17), we have for and , which lead to for . Moreover, from the definition of in Theorem 1, we have and for
Now, letting in Equation (19), for all , we have
Since for (due to (18)), we have the following update
(27) |
From the -smoothness of in Assumption (1)(b), (27), and the fact that , for every epoch , we can derive
(28) |
where (a) follows from the elementary equality and the last equality comes from the fact that , due to the choice of our learning rate .
Next, we use an interesting fact of the permutations and to rewrite the full gradient as
Let us upper bound the last term of (28) as follows:
where (b) is from the Cauchy-Schwarz inequality, (c) follows from the convexity of for .
Applying this into (28), we get the following result:
(29) |
Remark 5.
Applying the result of Lemma 4 for , then we obtain
which leads to
Since satisfies as proved above, we can deduce from the last estimate that
Rearranging this inequality and noting that , for , we obtain the following estimate:
For , since as previously defined in (17), from the result of Lemma 6 we have
Summing the previous estimate for to , and then using the last one, we obtain
7.5 The Proof of Corollaries 1 and 2: Constant and Diminishing Learning Rates
7.6 The Proof of Corollaries 3 and 4: Scheduled Learning Rates
The proof of Corollary 3: Exponential LR.
For , since where , we have . We also have for all and
Substituting these expressions into (6), we obtain
which is our desired result. ∎
The proof of Corollary 4: Cosine LR.
First, we would like to show that . Let us denote . Multiplying this sum by , we get
where (a) comes from the identity . Since is nonzero, we obtain
For , since , we have . We also have for all and
where follows since as shown above. Substituting these expressions into (6), we obtain
which is our desired estimate. ∎
7.7 The Proof of Technical Lemmas
We now provide the proof of six lemmas that serve for the proof of Theorem 1 above.
7.7.1 Proof of Lemma 1: Upper Bounding Two Terms and
Remark 6.
(a) From Equation (19), for and , we have
Moreover, by (17), we have for all . Therefore, for and , we can derive that
where (a) follows from the convexity of and .
(b) Using similar argument for the second equation (20), for and ; we can derive that
Note again that , for and ; we can show that
where in (b) we use again the convexity of and .
7.7.2 Proof of Lemma 2: Upper Bounding The Terms , , and
(a) Let first upper bound the term defined by (11). Indeed, for and , we have
where we use the Cauchy-Schwarz inequality in (a), and (b) follows from Assumption 1(c).
7.7.3 Proof of Lemma 3: Upper Bounding The Terms , , and
Applying Lemma 1 to (31), for , we have
where (a) follows from the result (23) in Lemma 2 and the fact that .
(b) Using the similar argument as above we can derive that
Applying Lemma 1 for we get to the above estimate, we get
where (b) follows from the results (23), (24) in Lemma 2 and the fact that .
Now applying Lemma 1 for the special case , we obtain
where we use the result (24) from Lemma 2 and the fact that .
(c) First, from the result of Part (a), for , we have
Since (due to the choice of our learning rate) and , we further get
which is equivalent to
Adding to both sides of this inequality, for , we get
(33) |
Finally, for , since and , we have
which is equivalent to
(34) |
This leads to our desired result for as
(35) |
Combining two cases, we obtain the desired result in Part (c).
7.7.4 Proof of Lemma 4: Upper Bounding The Key Quantity
First, we analyze the case using the results of Lemma 3 as follows:
(36) |
where the last two lines follow since and for .
Note that this bound is also true for the case :
apply (35) |
Substituting this inequality into , for , we get
Now we analyze similarly for the case as follows:
where the last line follows since and .
Combining these two cases, we have the following estimates:
where the last line follows since for every . Hence, we have proved (25).
7.7.5 The Proof of Lemma 5
Using the assumption that , we can derive the following sum as
where the last inquality follows since for .
7.7.6 The Proof of Lemma 6
We analyze the special case as follows. First, letting in Equation (19), for , we have
For , since , we have
(37) |
Using the -smoothness of in Assumption 1(b), , and (37), we have
(38) |
where (a) follows from the equality and the last equality comes from the fact that .
Applying this into (38) we get the following estimate
(39) |
Remark 7.
Applying Lemma 3, we further have
Noting that and , we get
Rearranging the terms and dividing both sides by , we have
Since satisfies as above, we can deduce from the last estimate that
which proves (26).
8 Convergence Analysis of Algorithm 1 for Randomized Reshuffling Strategy
In this section, we present the convergence results of Algorithm 1 for Randomized Reshuffling strategy. Since this analysis is similar to the previous bounds in Theorem 1, we will highlight the similarities and discuss the differences between the two analyses.
8.1 The Proof Sketch of Theorem 2
The proof of Theorem 2 is similar to Theorem 1, except for the fact that is upper bounded by a different term. The proof of Theorem 2 is divided into the following steps.
- •
- •
-
•
We further upper bound the sum of the right-hand side of the last estimate as in Lemma 5 in terms .
Combining these steps together, and using some simplification, we obtain (9) in Theorem 2.
8.2 Technical Lemmas
In this subsection, we will introduce some intermediate results used in Theorem 2. Lemmas 7 and 8 in Theorem 2 play a similar role as previous Lemmas 2 and 3 in Theorem 1.
Lemma 7.
We will present below Lemmas 9 and 10, which play a similar role as previous Lemmas 4 and 6 in Theorem 1.
Lemma 9.
Under the same conditions as in Lemma 8, for any , we have
(42) |
Lemma 10.
Under the same setting as in Theorem 2, we have
(43) |
8.3 The Proof of Theorem 2: Key Estimate for Algorithm 1
First, from the assumption , , we have . Next, from (17), we have for and , which lead to for . Moreover, from the definition of in Theorem 2, we have and for
Taking expectation of both sides and applying Lemma 9, we get:
Since satisfies as proved above, we can deduce from the last estimate that
Rearranging this inequality while noting that and we obtain that for :
Summing the previous estimate for to , and then using the last one, we obtain
Applying the result of Lemma 5 to the last estimate, we get
which is equivalent to
8.4 The Proof of Corollaries 5 and 6: Constant and Diminishing Learning Rates
8.5 The Proof of Technical Lemmas
We now provide the proof of additional Lemmas that serve for the proof of Theorem 2 above.
8.5.1 Proof of Lemma 7: Upper Bounding The Terms , , and
In this proof, we will use (Mishchenko et al., 2020)[Lemma 1] for sampling without replacement. For the sake of references, we recall it here.
Lemma 11 (Lemma 1 in (Mishchenko et al., 2020)).
Let be fixed vectors, be their average and be the population variance. Fix any , let be sampled uniformly without replacement from and be their average. Then, the sample average and the variance are given, respectively by
Using this result, we now prove Lemma 7 as follows.
(a) Let first upper bound the term defined by (11). Indeed, for and , we have
where we have used the Cauchy-Schwarz inequality in (a).
Now taking expectation conditioned on , we get
Note that for every index Hence, we have
where we apply the sample variance Lemma from (Mishchenko et al., 2020) and (b) follows from Assumption 1(c).
Letting in the above estimate and taking total expectation, we obtain the first estimate of Lemma 7, i.e.:
Now we are ready to calculate the sum as
where we use the facts that and .
Taking total expectation, we have the second estimate of Lemma 7.
(b) Using similar argument as above, for and , we can derive
where we use again the Cauchy-Schwarz inequality in (a).
Taking expectation conditioned on , we get
Note that for every index Hence
where we apply again the sample variance Lemma from (Mishchenko et al., 2020) and (b) follows from Assumption 1(c). Finally, for all , we calculate the sum similarly as follows
Taking total expectation, we get (41).
8.5.2 Proof of Lemma 8: Upper Bounding The Terms , , and
Applying Lemma 1 to (44), for , we have
where (a) follows from the fact that . Taking expectation and applying result (40) in Lemma 7 we get
(b) Using the similar argument as above we can derive that
Applying Lemma 1 for to the above estimate, we get
where the last line follows from the fact that . Similar to the previous part, taking total expectation we get
Now applying Lemma 1 for the special case , we obtain
Therefore, we have
where we have used the result (41) from Lemma 7 and the fact that .
(c) First, from the result of Part (a), for , we have
Since (due to the choice of our learning rate) and , we further get
which is equivalent to
Adding to both sides of this inequality, for , we get
(46) |
Finally, for , since and , we have
which is equivalent to
(47) |
This leads to our desired result for as
(48) |
Combining two cases, we obtain the desired result in Part (c).
8.5.3 Proof of Lemma 9: Upper Bounding The Key Quantity
First, we analyze the case using the results of Lemma 8 as follows:
(49) |
where the last line follows since and for .
Note that this bound is also true for the case :
apply (48) |
Substituting this inequality into , for , we get
Now we analyze similarly for the case as follows:
where the last line follows since and .
Combining these two cases, we have the following estimate
where the last line follows since for every . Hence, we have proved (42).
8.5.4 The Proof of Lemma 10
First, from the assumption , , we have . Next, from (17), we have for and , which lead to for . Moreover, from the definition of in Theorem 2, we have and for
Similar to the estimate (39) in Theorem 2, we get the following result:
Taking total expectation and applying Lemma 8 we further have:
Noting that and , we get
Rearranging the terms and dividing both sides by , we have
Since satisfies as above, we can deduce from the last estimate that
which proves (43).
9 The Proof of Technical Results in Section 4
This section provides the full proofs of the results in Section 4. Before proving Theorem 3, we introduce some common quantities and provide four technical lemmas.
9.1 Technical Lemmas
For the sake of our analysis, we introduce the following quantities:
(50) | |||
(51) | |||
(52) | |||
(53) |
The proof of Theorem 3 requires the following lemmas as intermediate steps of its proof.
Lemma 12.
Lemma 13.
Lemma 14.
Lemma 15.
Under the same setting as in Theorem 3, the initial objective value is upper bounded by
(58) |
9.2 The Proof of Theorem 3: Key Bound for Algorithm 2
From the update at Step 7 and Step 9 of Algorithm 2 with , for , we have
(59) |
Now, letting in the estimate and noting that for all , we obtain
(60) |
From this update and the -smoothness of from Assumption 1(c), for , we can derive
where (a) comes from the convexity of squared norm: for , (b) is from the inequalities and noted in Lemma 12, and (c) follows from the fact that .
Next, we focus on processing the term . We can further upper bound the above expression as
(61) |
where . Here, (d) follows from the equality , and (e) comes from the fact that .
Note that and , we can rewrite
(62) |
Recall the definition of , , and from (53), (50), and (51), respectively, we can bound as
where the last line comes from the Cauchy-Schwarz inequality. We normalize the last squared norms as follows:
where the last estimate follows from Lemma 13 with for . Substituting the last estimate into (61) for , we obtain
Since and , this inequality leads to
Now, using the fact that , we can derive the following statement for :
Summing the previous statement for , and using the last one, we can deduce that
Finally, dividing both sides of this estimate by , we obtain (10).
9.3 The Proof of Corollaries 7 and 8: Constant and Diminishing Learning Rates
The proof of Corollary 7.
9.4 The Proof of Technical Lemmas
We now provide the full proof of lemmas that serve for the proof of Theorem 3 above.
9.4.1 The Proof of Lemma 12
9.4.2 The Proof of Lemma 13
Using the -smoothness assumption of , we derive the following for :
Now from the update in Algorithm 2 with , for and , we have
Therefore, for and , using these expressions and (54), we can bound
Substituting these inequalities into the first two expressions, for and , we get
which proves (56), where the last line follows from the facts that and .
9.4.3 The Proof of Lemma 14
The first statement follows directly from the momentum update rule, i.e.:
where (a) follows from the rule .
Next, summing up this expression from to and averaging, we get
(63) |
where
Now we consider the term in the last expression:
9.4.4 The Proof of Lemma 15: Upper Bounding The Initial Objective Value
Substituting and into this estimate, we obtain (58).
10 Detailed Implementation and Additional Experiments
In this Supp. Doc., we explain the detailed hyper-parameter tuning strategy in Section 5 and provide additional experiments for our proposed methods.
10.1 Comparing SMG with Other Methods
We compare our SMG algorithm with Stochastic Gradient Descent (SGD) and two other methods: SGD with Momentum (SGD-M) (Polyak, 1964) and Adam (Kingma & Ba, 2014). To have a fair comparison, a random reshuffling strategy is applied to all methods. We tune each algorithm with a constant learning rate using grid search and select the hyper-parameters that perform best according to their results. We report the additional results on the squared norm of gradients of this experiment in Figure 7. The hyper-parameters tuning strategy for each method is given below:
-
•
For SGD, the first (coarse) searching grid is and the fine grid is like if is the best value we get after the first stage.
-
•
For our new SMG algorithm, we fixed the parameter since it usually performs the best in our experiments. We let the coarse searching grid be and the fine grid be if is the best value of the first stage. Note that our SMG algorithm may work with a bigger learning rate than the traditional SGD algorithm.
-
•
For SGD-M, we update the weights using the following rule:
where is the -th gradient at epoch . Note that this momentum update is implemented in PyTorch with the default value . Hence, we choose this setting for SGD-M, and we tune the learning rate using the two searching grids and as in the SGD algorithm.
-
•
For Adam, we fixed two hyper-parameters , as in the original paper. Since the default learning rate for Adam is , we let our coarse searching grid be , and the fine grid be if performs the best in the first stage. We note that since the best learning rate for Adam is usually , its hyper-parameter tuning process requires little effort than other algorithms in our experiments.




10.2 The Choice of Hyper-parameter
As presented in Section 5, in this experiment, we investigate the sensitivity of our proposed SMG algorithm to the hyper-parameter . We choose a constant learning rate for each dataset and run the algorithm for different values of in the linear grid . However, the choice of does not lead to a good performance, and, therefore, we omit to report them in our results. Figure 8 shows the comparison of on different values of for FashionMNIST, CIFAR-10, w8a, and ijcnn1 datasets. Our empirical study here shows that the choice of in works reasonably well, and seems to be the best in our test.




10.3 Different Learning Rate Schemes
As presented in Section 5, in the last experiment, we examine the performance of our SMG method with the effect of four different learning rate schemes. We choose the hyper-parameter since it tends to give the best results in our test. The additional result on the squared norm gradient is reported in Figure 9.
-
•
Constant learning rate: Similar to the first section, we set the coarse searching grid as and let the fine grid be if is the best value of the first stage.
-
•
Cosine annealing learning rate: For a fixed number of epoch , we need one hyper-parameter for the cosine learning rate scheme: . We choose a coarse grid and a fine grid if is the best value for the parameter in the first stage.
-
•
Diminishing learning rate: In order to apply the diminishing scheme , we need two hyper-parameters and . At first, we let the searching grid for be . For the value, we set its searching grid so that the initial value lies in the first grid of , and then lies in a fine grid centered at the best one of the coarse grid.
-
•
Exponential learning rate: We choose two hyper-parameters for the exponential scheme . First, let the searching grid for decay rate be . Then we set the searching grid for similarly to the diminishing case (such that initial learning rate is in the coarse grid , and then a second grid centered at the best value of the first).




10.4 Additional Experiment with Single Shuffling Scheme
Algorithm 2 (Single Shuffling Momentum Gradient - SSMG) uses the single shuffling strategy which covers incremental gradient as a special case. Therefore in this experiment we compare our proposed methods (SMG and SSMG) with SGD, SGD with Momentum (SGD-M) and Adam using the single shuffling scheme.








For the SSMG algorithm, the hyper-parameter is chosen in the grid . For the learning rate, we let the coarse searching grid be and the fine grid be if is the best value of the first stage. For other methods, the hyper-parameters tuning strategy is similar to the settings in subsection 10.1.