SLowcal-SGD: Slow Query Points Improve Local-SGD for Stochastic Convex Optimization
Abstract
We consider distributed learning scenarios where machines interact with a parameter server along several communication rounds in order to minimize a joint objective function. Focusing on the heterogeneous case, where different machines may draw samples from different data-distributions, we design the first local update method that provably benefits over the two most prominent distributed baselines: namely Minibatch-SGD and Local-SGD. Key to our approach is a slow querying technique that we customize to the distributed setting, which in turn enables a better mitigation of the bias caused by local updates.
1 Introduction
Federated Learning (FL) is a framework that enables huge scale collaborative learning among a large number of heterogeneous111Heterogeneous here refers to the data of each client, and we assume that its statistical properties may vary between different clients. clients (or machines). FL may potentially promote fairness among participants, by allowing clients with small scale datasets to participate in the learning process and affect the resulting model. Additionally, participants are not required to directly share data, which may improve privacy. Due to these reasons, FL has gained popularity in the past years, and found use in applications like voice recognition (APP, 2019; GGL, 2021), fraud detection (INT, 2020), drug discovery (MEL, 2020), and more (Wang et al., 2021a).
The two most prominent algorithmic approaches towards federated learning are Minibatch-SGD
(Dekel et al., 2012) and Local-SGD (a.k.a. Federated-Averaging) (Mangasarian and Solodov, 1993; McMahan et al., 2017; Stich, 2019) .
In Minibatch-SGD all machines (or clients) always compute unbiased gradient estimates of the same query points, while using large
batch sizes; and it is well known that this approach is not degraded due to data heterogeneity (Woodworth et al., 2020b). On the downside, the number of model updates made by Minibatch-SGD may be considerably smaller compared to the number of gradient queries made by each machine; which is due to the use of minibatches. This suggests that there may be room to improve over this approach by employing local update methods like
Local-SGD, where the number of model updates and the number of gradient queries are the same.
And indeed, in the past years, local update methods have been extensively investigated, see e.g. (Kairouz et al., 2021) and references therein.
We can roughly divide the research on FL into two scenarios: the homogeneous case, where it is assumed that the data on each machine is drawn from the same distribution; and to the more realistic heterogeneous case where it is assumed that data distributions may vary between machines.
Method | Rate | |||
---|---|---|---|---|
|
||||
|
||||
|
||||
|
||||
|
||||
|
For the homogeneous case it was shown in (Woodworth et al., 2020a; Glasgow et al., 2022) that the standard Local-SGD method is not superior to Minibatch-SGD. Nevertheless, (Yuan and Ma, 2020) have designed an accelerated variant of Local-SGD that provably benefits over the Minibatch baseline. These results are established for the fundamental Stochastic Convex Optimization (SCO) setting, which assumes that the learning objective is convex.
Similarly to the homogeneous case, it was shown in (Woodworth et al., 2020b; Glasgow et al., 2022) that Local-SGD is not superior to Minibatch-SGD in heterogeneous scenarios. Nevertheless, several local approaches that compare with the Minibatch baseline were designed in (Karimireddy et al., 2020b; Gorbunov et al., 2021). Unfortunately, we have so far been missing a local method that provably benefits over the Minibatch baseline in the heterogeneous SCO setting.
Our work focuses on the latter heterogeneous SCO setting, and provide a new Local-SGD-style algorithm that provably benefits over the minibatch baseline. Our algorithm named SLowcal-SGD, builds on customizing a recent technique for incorporating a slowly-changing sequence of query points (Cutkosky, 2019; Kavis et al., 2019), which in turn enables to better mitigate the bias induced by the local updates. Curiously, we also found importance weighting to be crucial in order to surpass the minibatch baseline.
In Table 1 we compare our results to the state-of-the-art methods for the heterogeneous SCO setting. We denote to be the number of machines, is the number of local updates per round, and is the number of communications rounds. Additionally, (or ) measures the dissimilarity between machines. Our table shows that Local-SGD requires much more communication rounds compared to Minibatch-SGD, and that the dissimilarity (or substantially degrades its performance. Conversely, one can see that even if the dissimilarity measure is , our approach SLowcal-SGD still requires less communication rounds compared to Minibatch-SGD.
Similarly to the homogeneous case, accelerated-Minibatch-SGD (Dekel et al., 2012; Lan, 2012), obtains the best performance among all current methods, and it is still open to understand whether one can outperform this accelerated minibatch baseline. In App. A we elaborate on the computations of in Table 1.
Related Work. We focus here on centralized learning problems, where we aim to employ machines in order to minimize a joint learning objective. We allow the machines to synchronize during communication rounds through a central machine called the Parameter Server (); and allow each machine to draw samples and perform local gradient computations in every such communication round. We assume that each machine may draw i.i.d. samples from a distribution , which may vary between machines.
The most natural approach in this context is Minibatch-SGD, and its accelerate variant (Dekel et al., 2012), which have been widely adopted both in academy and in industry, see e.g. (Hoffer et al., 2017; Smith et al., 2017; You et al., 2019). Local update methods like Local-SGD (McMahan et al., 2017), have recently gained much popularity due to the rise of FL, and have been extensively explored in the past years.
Focusing on the SCO setting, it is well known that the standard Local-SGD is not superior (actually in most regimes it is inferior) to Minibatch-SGD (Woodworth et al., 2020a, b; Glasgow et al., 2022). Nevertheless, (Yuan and Ma, 2020) devised a novel accelerated local approach that provably surpasses the Minibatch baseline in the homogeneous case.
The heterogeneous SCO case has also been extensively investigated, with several original and elegant approaches (Koloskova et al., 2020; Khaled et al., 2020; Karimireddy et al., 2020b; Woodworth et al., 2020b; Mitra et al., 2021; Gorbunov et al., 2021; Mishchenko et al., 2022; Patel et al., ). Nevertheless, so far we have been missing a local approach that provably benefits over Minibatch-SGD. Note that Mitra et al. (2021); Mishchenko et al. (2022) improve the communication complexity with respect to the condition number of the objective; However their performance does not improve as we increase the number of machines 222This implies that such methods do not obtain a wall-clock speedup as we increase the number of machines ., which is inferior to the minibatch baseline.
The heterogeneous non-convex setting was also extensively explored (Karimireddy et al., 2020b, a; Gorbunov et al., 2021); and the recent work of (Patel et al., ) has developed a novel algorithm that provably benefits over the minibatch baseline in this case. The latter work also provides a lower bound which demonstrates that their upper bound is almost tight. Finally, for the special case of quadratic loss functions, it was shown in (Woodworth et al., 2020a) and in (Karimireddy et al., 2020b) that it is possible to surpass the minibatch baseline.
It is important to note that excluding the special case of quadratic losses, there does not exist a local update algorithm that provably benefits over accelerated-Minibatch-SGD (Dekel et al., 2012). And the latter applies to both homogeneous and heterogeneous SCO problems.
Our local update algorithm utilizes a recent technique of employing slowly changing query points in SCO problems (Cutkosky, 2019). The latter has shown to be useful in designing universal accelerated methods (Kavis et al., 2019; Ene et al., 2021; Antonakopoulos et al., 2022), as well as in improving asynchronous training methods (Aviv et al., 2021).
2 Setting: Parallel Stochastic Optimization
We consider Parallel stochastic optimization problems where the objective is convex and is of the following form,
Thus, the objective is an average of functions , and each such can be written as an expectation over losses where the are drawn from some distribution which is unknown to the learner. For ease of notation, in what follows we will not explicitly denote but rather use E to denote the expectation w.r.t. all randomization.
We assume that there exist machines (computation units), and that each machine may independently draw samples from the distribution , and can therefore compute unbiased gradient estimates to the gradients of . Most commonly, we allow the machines to synchronize during communication rounds through a central machine called the Parameter Server (); and allow each machine to perform local computations in every such communication round.
We consider first order optimization methods that iteratively employ samples and generate a sequence of query points and eventually output a solution . Our performance measure is the expected excess loss, where the expectation is w.r.t. the randomization of the samples, and is a global minimum of in , i.e., .
More concretely, at every computation step, each machine may draw a fresh sample , and compute a gradient estimate at a given point as follows, and note that , i.e. is an ubiased estimate of .
General Parallelization Scheme.
A general scheme for parallel stochastic optimization is described in Alg. 1. It can be seen that the communicates with the machines along communication rounds. In every round the distributes an anchor point which is a starting point for the local computations in that round. Based on each machine performs local gradient computations based on i.i.d. draws from , and yields a message . At the end of round the aggregates the messages from all machines and updates the anchor point . Finally, after the last round, the outputs , which is computed based on the anchor points .
Ideally, one would hope that using machines in parallel will enable to accelerate the learning process by a factor of . And there exists a rich line of works that have shown that this is indeed possible to some extent, depending on , and on the parallelization algorithm.
Next, we describe the two most prominent approaches to first-order Parallel Optimization,
(i) Minibatch SGD:
In terms of Alg. 1, one can describe Minibatch-SGD as an algorithm in which the sends a weight vector in every round as the anchor point . Based on that anchor , each machine computes an unbiased gradient estimate based on independent samples from , i.e. , and communicates as the message to the . The latter aggregates the messages and compute the next anchor point ,
where is the learning rate of the algorithm. The benefit in this approach is that all machines always compute gradient estimates at the same anchor points , which highly simplifies its analysis. On the downside, in this approach the number of gradient updates is smaller compared to the number of stochastic gradient computations made by each machine which is . This gives the hope that there is room to improve upon Minibatch SGD, by mending this issue.
(ii) Local SGD:
In terms of Alg. 1, one can describe Local-SGD as an algorithm in which the sends a weight vector in every round as the anchor information . Based on the anchor , each machine performs a sequence of local gradient updates based on independent samples from as follows, ,
where for all machines we initialize , and is the learning rate of the algorithm. At the end of round each machine communicates as the message to the and the latter computes the next anchor as follows,
In local SGD the number of gradient steps is equal to the number of stochastic gradient computations made by each machine which is . The latter suggests that such an approach may potentially surpass Minibatch SGD. Nevertheless, this potential benefit is hindered by the bias that is introduced between different machines during the local updates. And indeed, as we show in Table 1, this approach is inferior to Minibatch SGD in the prevalent case where .
Assumptions. We assume that is convex, and that the are smooth i.e. such,
We also assume that variance of the gradient estimates is bounded, i.e. that there exists such,
Letting be a global minimum of , we assume there exist such that,
(1) |
The above assumption together with the smoothness and convexity imply (see App. B) ,
(2) |
A stronger dissimilarity assumption that is often used in the literature is the following,
(3) |
Notation: For we denote . For we denote .
3 Our Approach
Section 3.1 describes a basic (single machine) algorithmic template called Anytime-GD. Section 3.2 describes our SLowcal-SGD algorithm, which is a Local-SGD style algorithm in the spirit of Anytime GD. We describe our method in Alg. 2, and state its guarantees in Thm. 2.
3.1 Anytime GD
The standard GD algorithm computes a sequence of iterates and queries the gradients at theses iterates. It was recently shown that one can design a GD-style scheme that computes a sequence of iterates yet queries the gradients at a different sequence which may be slowly-changing, in the sense that may be considerably smaller than .
Concretely, the Anytime-GD algorithm (Cutkosky, 2019; Kavis et al., 2019) that we describe in Equations (4) and (5), employs a learning rate and a sequence of non-negative weights . The algorithm maintains two sequences that are updated as follows ,
(4) |
and then,
(5) |
It can be shown that the above implies that , i.e. the ’s are weighted averages of the ’s. Thus, at every iterate the gradient is queried at which is a weighted average of past iterates, and then is updated similarly to GD with a weight on the gradient . Moreover, at initialization .
Curiously, it was shown in Cutkosky (2019) that Anytime-GD obtains the same convergence rates as GD for convex loss functions (both smooth and non-smooth). It was further shown and that one can employ a stochastic version (Anytime-SGD) where we query noisy gradients at instead of the exact ones, and that approach performs similarly to SGD.
Slowly changing query points. A recent work (Aviv et al., 2021), demonstrates that if we use projected Anytime-SGD, i.e. project the sequence to a given bounded convex domain; then one can immediately show that for both and we obtain , where is the diameter of the convex domain.
Conversely, for standard SGD we have
, where here is a (possibly noisy) unbiased estimate of . Thus, while the change between consecutive SGD queries is controlled by
which is usually , and by magnitude of stochastic gradients; for Anytime-SGD the change decays with time, irrespective of the learning rate . In (Aviv et al., 2021), this is used to design better and more robust asynchronous training methods.
Relation to Momentum. In the appendix we show that Anytime-SGD can be explicitly written as a momentum method, and therefore is quite different from standard SGD. Concretely, for we show that , and for we show that . Where here is a (possibly noisy) unbiased estimate of .
This momentum interpretation provides a complementary intuition regarding the benefit of Anytime-SGD in the context of local update methods. Momentum brings more stability to the optimization process which in turn reduces the bias between different machines.
For the sake of this paper we will require a specific theorem that does not necessarily regard Anytime-GD, but is rather more general. We will require the following definition,
-
Definition
Let be a sequence of non-negative weights, and let , be an arbitrary sequence. We say that a sequence is an weighted average of if , and for any Eq. (5) is satisfied.
Next, we state the main theorem for this section, which applies for any sequence ,
Theorem 1 (Rephrased from Theorem 1 in Cutkosky (2019)).
Let be a convex function with a global minimum . Also let , and such that is an weighted average of . Then the following holds for any ,
3.2 SLowcal-SGD
Our approach is to employ an Anytime version of Local-SGD, which we name by SLowcal-SGD.
Notation:
Prior to describing our algorithm we will define to be the total of per-machine local updates up to step of round , resulting . In what follows, we will often find it useful to denote the iterates and samples using , rather than explicitly denoting .
Additionally we use to denote a pre-defined sequence of non-negative weights. Finally, we denote .
In the spirit of Anytime-SGD our approach is to maintain two sequences per machine : and
. Our approach is depicted explicitly in
Alg. 2. Next we describe our algorithm in terms of the scheme depicted in
Alg. 1:
(i) Distributing anchor. At the beginning of round the distributes to all machines.
(ii) Local Computations. For , every machine initializes , and for the next rounds, i.e. for any , every machine performs a sequence of local Anytime-SGD steps as follows,
(6) |
where similarly to Anytime-SGD we query the gradients at the averages , meaning And query points are updated as weighted averages of past iterates , ,
(7) |
At the end round , i.e. , each machine communicates as a message to the .
(iii) Aggregation. The aggregates the messages and computes the next anchor point
, where .
Remark: Note that for our notation for is inconsistent: at the end of round these values may vary between different machines, while at the beginning of round these values are all equal to .
Nevertheless, for simplicity we will abuse notation, and explicitly state the right definition when needed.
Importantly, in most of our analysis we will mainly need to refer to the averages , and note the latter are consistent at the end and beginning of consecutive rounds due to the definition of , and .
3.2.1 Guarantees & Intuition
Below we state our main result for SLowcal-SGD (Alg. 2),
Theorem 2.
As Table 1 shows, Thm. 2 implies that SLowcal-SGD improves over all existing upper bounds for Minibatch and Local SGD, by allowing less communication rounds to obtain a linear speedup of .
Intuition. The degradation in local SGD schemes (both standard and Anytime) is due to the bias that it introduces between different machines during each round, which leads to a bias in their gradients. Intuitively, this bias is small if the machines query the gradients at a sequence of slowly changing query points. This is exactly the benefit of SLowcal-SGD which queries the gradients at averaged iterates ’s. Intuitively these averages are slowly changing compared to the iterates themselves ; and recall that the latter are the query points used by standard Local-SGD. A complementary intuition to the benefit of our approach, is the interpretation of Anytime-SGD as a momentum method (see Sec. 3.1 and the appendix) which leads to decreased bias between machines.
To further simplify the more technical discussion here, we will assume the homogeneous case, i.e., that for any we have and .
So a bit more formally, let us discuss the bias between query points in a given round , and let us denote . The following holds for standard Local SGD,
(9) |
where is the noisy gradients that Machine computes in , and we can write , where is the noisy component of the gradient. Thus, for two machines we can write,
And it was shown in Woodworth et al. (2020a), that the noisy term is dominant and therefore we can bound,
(10) |
Similarly, for SLowcal-SGD we would like to bound for two machines ; and in order to simplify the discussion we will assume uniform weights i.e., . Now the update rule for the iterates , is of the same form as in Eq. (9), only now , where is the noisy component of the gradient. Consequently,
where we took a crude approximation of . Now, by definition of and , Thus, for two machines we have,
As we show in our analysis, the noisy term is dominant, so we can therefore bound,
(11) |
Taking above yields a bound of . Thus Equations (10), (11), illustrate that the bias of SLowcal-SGD is smaller by a factor of compared to the bias of standard Local-SGD. In the appendix we demonstrate the same benefit of Anytime-SGD over SGD when both use .
Finally, note that the biases introduced by the local updates come into play in a slightly different manner in Local-SGD compared to SLowcal-SGD 333A major challenge in our analysis is that for a given the sequence is not necessarily an weighted average of the .. Consequently, the above discussion does not enable to demonstrate the exact rates that we derive. Nevertheless, it provides some intuition regarding the benefit of our approach. The full and exact derivations appear in the appendix.
Importance Weights. One may wonder whether it is necessary to employ increasing weights , rather than employing standard uniform weights . Surprisingly, in our analysis we have found that increasing weights are crucial in order to obtain a benefit over Minibatch-SGD, and that upon using uniform weights SLowcal-SGD performs worse compared to Minibatch SGD! We elaborate on this in Appendix L. Below we provide an intuitive explanation.
Intuitive Explanation.
The intuition behind the importance of using increasing weights is the following: Increasing weights are a technical tool to put more emphasis on the last rounds. Now, in the context of Local update methods, the iterates of the last rounds are more attractive since the bias between different machines shrinks as we progress. Intuitively, this happens since as we progress with the optimization process, the expected value of the gradients that we compute goes to zero (since we converge); and consequently the bias between different machines shrinks as we progress.
3.3 Proof Sketch for Theorem 2
Proof Sketch for Theorem 2.
As a starting point for the analysis, for every iteration we will define the averages of across all machines as follows,
Note that Alg. 2 explicitly computes only once every local updates, and that theses are identical to the local copies of every machine at the beginning of every round. Combining the above definitions with Eq. (6) yields,
(12) |
Further combining these definitions with Eq. (7) yields,
(13) |
The above implies that the sequence is an weighted average of . This enables to employ Thm. 1 which yields, where we denote . This bound highlights the challenge in the analysis: our algorithm does not directly compute unbiased estimates of , except for the first iterate of each round. Concretely, Eq. (12) implies that our algorithm effectively updates using which is a biased estimate of .
It is therefore natural to decompose in the above bound, yielding,
(14) |
Thus, we intend to bound the weighted error by bounding two terms: which is related to the update rule of the algorithm, and which accounts for the bias between and .
Notation:
In what follows we will find the following notation useful,
, and note that .
We will also employ the following notations:
and
where is a global minimum of . We will also denote
.
Bounding (A):
Due to the update rule of Eq. (12), one can show by standard regret analysis that:
Bounding (B):
We can bound in expectation using and as follows for any :
Combining (A) and (B):
Combining the above boounds on and into Eq. (14) we obtain the following bound which holds for any and ,
(15) |
Now, to simplify the proof sketch we shall assume that , implying that . Plugging this into the above equation and taking gives,
(16) |
Next we will bound and , and plug them back into Eq. (16).
Bounding :
To bound it is natural to decompose . Using this decomposition we show that,
Bounding
The definition of shows that it is encompasses the bias that is introduced due to the local updates, which in turn relates to the distances . Thus, is therefore directly related to the dissimilarity between the machines. Our analysis shows the following: Plugging the above into Eq. (16), and using our choice for , gives an almost explicit bound,
The theorem follows by plugging above the choices of , and using a technical lemma. ∎
4 Experiments
To assess the effectiveness of our proposed approach, we conducted experiments on the MNIST (LeCun et al., 2010) dataset—a well-established benchmark in image classification comprising 70,000 grayscale images of handwritten digits (0–9), with 60,000 images designated for training and 10,000 for testing. The dataset was accessed via torchvision (version 0.16.2). We implemented a logistic regression model (Bishop and Nasrabadi, 2006) using the PyTorch framework and executed all computations on an NVIDIA L40S GPU. To ensure robustness, results were averaged over three different random seeds. The complete codebase for these experiments is publicly available on our GitHub repository.444https://github.com/dahan198/slowcal-sgd
We evaluated our approach using parameters derived from our theoretical framework () in comparison to Local-SGD and Minibatch-SGD under various configurations. Specifically, experiments were conducted with 16, 32, and 64 workers to examine the scalability and robustness of the proposed method. We also varied the number of local updates (or minibatch sizes for Minibatch-SGD) among 4, 8, 16, 32, and 64 to investigate how different local iteration counts impact performance. Data subsets for each worker were generated using a Dirichlet distribution (Hsu et al., 2019) with to simulate real-world non-IID data scenarios characterized by high heterogeneity. For fairness, the learning rate was selected through grid search, with a value of 0.01 for SLowcal-SGD and Local-SGD, and 0.1 for Minibatch-SGD. More details about the data distribution across workers and complete experimental results are provided in Appendix M.


Our results on the MNIST dataset, presented in Figure 1 and detailed in Appendix M.2, demonstrate the effectiveness of our approach, showing consistent performance improvements compared to Local-SGD and Minibatch-SGD as the number of local steps increases. Notably, this improvement becomes even more significant compared to the other methods as the number of workers increases, underscoring the scalability of our method and aligning with the theoretical guarantees outlined in our framework. These results highlight the robustness of our approach in handling highly heterogeneous, distributed environments.
Upon closer inspection, when a small number of local steps are performed, the differences between the approaches are negligible, with a slight advantage for Minibatch-SGD. However, as the number of local steps increases, the minibatch size grows, and the need for significant variance reduction diminishes. In this regime, making more frequent optimization updates becomes more impactful, as demonstrated by the superior performance of the local approaches compared to Minibatch-SGD. Importantly, with SLowcal-SGD, which keeps local updates closely aligned among workers throughout the training process, we can achieve significantly better and more stable performance compared to both Minibatch-SGD and Local-SGD as the number of local steps and the number of workers increase.
5 Conclusion
We have presented the first local approach for the heterogeneous distributed Stochastic Convex Optimization (SCO) setting that provably benefits over the two most prominent baselines, namely Minibatch-SGD, and Local-SGD. There are several interesting avenues for future exploration:
(a) developing an adaptive variant that does not require the knowledge of the problem parameters like and ; (b) Allowing a per dimension step-size that could benefit in (the prevalent) scenarios where the scale of the gradients considerably changes between different dimensions; in the spirit of the well known AdaGrad method (Duchi et al., 2011). Finally, (c) it will be interesting to understand whether we can find an algorithm that provably dominates over the Accelerated Minibach-SGD baseline, which is an open question also in the homogeneous SCO setting.
Acknowledgement
This research was partially supported by Israel PBC-VATAT, the Technion Artificial Intelligent Hub (Tech.AI), and the Israel Science Foundation (grant No. 3109/24).
References
- APP [2019] Apple. designing for privacy (video and slide deck). apple wwdc, 2019. URL https://developer.apple.com/videos/play/wwdc2019/708.
- INT [2020] Intel and consilient. intel and consilient join forces to fight financial fraud with ai, 2020. URL https://newsroom.intel.com/news/intel-consilient-join-forces-fight-financial-fraud-ai/.
- MEL [2020] Melloddy. melloddy project meets its year one objective: Deployment of the world’s first secure platform for multi-task federated learning in drug discovery among 10 pharmaceutical companies, 2020. URL https://www.melloddy.eu/y1announcement.
- GGL [2021] Google. your voice and audio data stays private while google assistant improves, 2021. URL https://support.google.com/assistant/answer/10176224.
- Antonakopoulos et al. [2022] Kimon Antonakopoulos, Dong Quan Vu, Volkan Cevher, Kfir Levy, and Panayotis Mertikopoulos. Undergrad: A universal black-box optimization method with almost dimension-free convergence rate guarantees. In International Conference on Machine Learning, pages 772–795. PMLR, 2022.
- Aviv et al. [2021] Rotem Zamir Aviv, Ido Hakimi, Assaf Schuster, and Kfir Yehuda Levy. Asynchronous distributed learning: Adapting to gradient delays without prior knowledge. In International Conference on Machine Learning, pages 436–445. PMLR, 2021.
- Bishop and Nasrabadi [2006] Christopher M Bishop and Nasser M Nasrabadi. Pattern recognition and machine learning, volume 4. Springer, 2006.
- [8] Ashok Cutkosky. Lecture notes for ec525: Optimization for machine learning.
- Cutkosky [2019] Ashok Cutkosky. Anytime online-to-batch, optimism and acceleration. In International conference on machine learning, pages 1446–1454. PMLR, 2019.
- Dekel et al. [2012] Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research, 13(1), 2012.
- Duchi et al. [2011] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.
- Ene et al. [2021] Alina Ene, Huy L Nguyen, and Adrian Vladu. Adaptive gradient methods for constrained convex optimization and variational inequalities. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 7314–7321, 2021.
- Glasgow et al. [2022] Margalit R Glasgow, Honglin Yuan, and Tengyu Ma. Sharp bounds for federated averaging (local sgd) and continuous perspective. In International Conference on Artificial Intelligence and Statistics, pages 9050–9090. PMLR, 2022.
- Gorbunov et al. [2021] Eduard Gorbunov, Filip Hanzely, and Peter Richtárik. Local sgd: Unified theory and new efficient methods. In International Conference on Artificial Intelligence and Statistics, pages 3556–3564. PMLR, 2021.
- Hazan et al. [2016] Elad Hazan et al. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(3-4):157–325, 2016.
- Hoffer et al. [2017] Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. Advances in neural information processing systems, 30, 2017.
- Hsu et al. [2019] Tzu-Ming Harry Hsu, Hang Qi, and Matthew Brown. Measuring the effects of non-identical data distribution for federated visual classification. arXiv preprint arXiv:1909.06335, 2019.
- Johnson and Zhang [2013] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pages 315–323, 2013.
- Kairouz et al. [2021] Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. Foundations and Trends® in Machine Learning, 14(1–2):1–210, 2021.
- Karimireddy et al. [2020a] Sai Praneeth Karimireddy, Martin Jaggi, Satyen Kale, Mehryar Mohri, Sashank J Reddi, Sebastian U Stich, and Ananda Theertha Suresh. Mime: Mimicking centralized stochastic algorithms in federated learning. arXiv preprint arXiv:2008.03606, 2020a.
- Karimireddy et al. [2020b] Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. Scaffold: Stochastic controlled averaging for federated learning. In International Conference on Machine Learning, pages 5132–5143. PMLR, 2020b.
- Kavis et al. [2019] Ali Kavis, Kfir Y Levy, Francis Bach, and Volkan Cevher. Unixgrad: A universal, adaptive algorithm with optimal guarantees for constrained optimization. Advances in neural information processing systems, 32, 2019.
- Khaled et al. [2020] Ahmed Khaled, Konstantin Mishchenko, and Peter Richtárik. Tighter theory for local sgd on identical and heterogeneous data. In International Conference on Artificial Intelligence and Statistics, pages 4519–4529. PMLR, 2020.
- Koloskova et al. [2020] Anastasia Koloskova, Nicolas Loizou, Sadra Boreiri, Martin Jaggi, and Sebastian Stich. A unified theory of decentralized sgd with changing topology and local updates. In International Conference on Machine Learning, pages 5381–5393. PMLR, 2020.
- Lan [2012] Guanghui Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 133(1-2):365–397, 2012.
- LeCun et al. [2010] Yann LeCun, Corinna Cortes, Chris Burges, et al. Mnist handwritten digit database, 2010. URL http://yann.lecun.com/exdb/mnist/. Licensed under CC BY-SA 3.0, available at https://creativecommons.org/licenses/by-sa/3.0/.
- Mangasarian and Solodov [1993] Olvi L Mangasarian and Mikhail V Solodov. Backpropagation convergence via deterministic nonmonotone perturbed minimization. Advances in Neural Information Processing Systems, 6, 1993.
- McMahan et al. [2017] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pages 1273–1282. PMLR, 2017.
- Mishchenko et al. [2022] Konstantin Mishchenko, Grigory Malinovsky, Sebastian Stich, and Peter Richtárik. Proxskip: Yes! local gradient steps provably lead to communication acceleration! finally! In International Conference on Machine Learning, pages 15750–15769. PMLR, 2022.
- Mitra et al. [2021] Aritra Mitra, Rayana Jaafar, George J Pappas, and Hamed Hassani. Linear convergence in federated learning: Tackling client heterogeneity and sparse gradients. Advances in Neural Information Processing Systems, 34:14606–14619, 2021.
- [31] Kumar Kshitij Patel, Lingxiao Wang, Blake Woodworth, Brian Bullins, and Nathan Srebro. Towards optimal communication complexity in distributed non-convex optimization. In Advances in Neural Information Processing Systems.
- Smith et al. [2017] Samuel L Smith, Pieter-Jan Kindermans, Chris Ying, and Quoc V Le. Don’t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489, 2017.
- Stich [2019] Sebastian U. Stich. Local SGD converges fast and communicates little. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=S1g2JnRcFX.
- Wang et al. [2021a] Jianyu Wang, Zachary Charles, Zheng Xu, Gauri Joshi, H. Brendan McMahan, Blaise Agüera y Arcas, Maruan Al-Shedivat, Galen Andrew, Salman Avestimehr, Katharine Daly, Deepesh Data, Suhas N. Diggavi, Hubert Eichner, Advait Gadhikar, Zachary Garrett, Antonious M. Girgis, Filip Hanzely, Andrew Hard, Chaoyang He, Samuel Horváth, Zhouyuan Huo, Alex Ingerman, Martin Jaggi, Tara Javidi, Peter Kairouz, Satyen Kale, Sai Praneeth Karimireddy, Jakub Konečný, Sanmi Koyejo, Tian Li, Luyang Liu, Mehryar Mohri, Hang Qi, Sashank J. Reddi, Peter Richtárik, Karan Singhal, Virginia Smith, Mahdi Soltanolkotabi, Weikang Song, Ananda Theertha Suresh, Sebastian U. Stich, Ameet Talwalkar, Hongyi Wang, Blake E. Woodworth, Shanshan Wu, Felix X. Yu, Honglin Yuan, Manzil Zaheer, Mi Zhang, Tong Zhang, Chunxiang Zheng, Chen Zhu, and Wennan Zhu. A field guide to federated optimization. CoRR, abs/2107.06917, 2021a. URL https://arxiv.org/abs/2107.06917.
- Wang et al. [2021b] Jun-Kun Wang, Jacob Abernethy, and Kfir Y Levy. No-regret dynamics in the fenchel game: A unified framework for algorithmic convex optimization. arXiv e-prints, pages arXiv–2111, 2021b.
- Woodworth et al. [2020a] Blake Woodworth, Kumar Kshitij Patel, Sebastian Stich, Zhen Dai, Brian Bullins, Brendan Mcmahan, Ohad Shamir, and Nathan Srebro. Is local sgd better than minibatch sgd? In International Conference on Machine Learning, pages 10334–10343. PMLR, 2020a.
- Woodworth et al. [2020b] Blake E Woodworth, Kumar Kshitij Patel, and Nati Srebro. Minibatch vs local sgd for heterogeneous distributed learning. Advances in Neural Information Processing Systems, 33:6281–6292, 2020b.
- You et al. [2019] Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large batch optimization for deep learning: Training bert in 76 minutes. arXiv preprint arXiv:1904.00962, 2019.
- Yuan and Ma [2020] Honglin Yuan and Tengyu Ma. Federated accelerated stochastic gradient descent. Advances in Neural Information Processing Systems, 33:5332–5344, 2020.
Appendix A Explanations Regarding the Linear Speedup and Table 1
Here we elaborate on the computations done in Table 1. First we will explain why the dominance of the term implies a linear speedup by a factor of .
Explanation.
Recall that using SGD with a single machine , yields a convergnece rate of (as a dominant term). Thus, in order to obtain an excess loss smaller than some , SGD requires . Where is the wall-clock time required to compute the solution.
Now, when we use parallel optimization with communication rounds, local computations, and machines, the wall-clock time to compute a solution is still . Now, if the dominant term in the convergence rate of this algorithm is then the wall clock time to obtain an -optimal solution should be . And the latter is smaller by a factor of compared to a single machine.
Computation of in Table 1.
The term appears in the bounds of all of the parallel optimization methods that we describe. Nevertheless, it is dominant up as long as the number of communication rounds is larger than some treshold value , that depends on the specific convergence rate. Clearly, smaller values of imply less communication. Thus, in the column of the table we compute for each method based on the term in the bound that compares least favourably against . These terms are bolded in the Rate column of the table.
Concretely, denoting this less favourable term by 555 may also depend on but for simplicity of exposition we hide these dependencies in Table 1., then is the lowest which satisfies,
Appendix B On Heterogeneity Assumption
Let us assume that the following holds at the optimum ,
Then we can show the following relation for any ,
where we used which holds for any , and the last line follows by the lemma below that we borrow from [Johnson and Zhang, 2013, Cutkosky, ].
Lemma 1.
Let be a convex function with global minimum , and assume that every is -smooth. Then the following holds,
Appendix C Interpreting Anytime-SGD as Momentum
Here we show how to interpret the Anytime-SGD algorithm that we present in Equations (4),(5), as a momentum method. For completeness we rewrite the update equations below,
(17) |
and then,
(18) |
where is an unbiased gradient estimate at , and is a sequence of non-negative scalars. And at initialization .
First note that Eq. (18) directly implies that,
Next, note that we can directly write,
Plugging the above into the formula for yields,
Thus,
(19) |
where we used the equality below,
Thus we can write,
Uniform Weights. Thus, taking uniform weights yields,
Linear Weights. Similarly, taking linear weights yields,
Appendix D Proof of Theorem 1
Proof of Theorem 1.
We rehearse the proof of Theorem from Cutkosky [2019].
First, since is a global minimum and are non-negative than clearly,
Now, notice that the following holds,
Using the gradient inequality for gives,
where we have used the gradient inequality again which implies .
Now Re-ordering we obtain,
Telescoping the sum in the LHS we conclude the proof,
∎
Appendix E More Intuition and Discussion Regarding the Benefit of SLowcal-SGD
More Elaborate Intuitive Explanation.
The intuition is the following: We have two extreme baselines: (1) Minibatch-SGD where queries do not change at all during updates-implying that there is no bias between different machines. However, Minibatch-SGD is “lazy” since among queries it only performs gradient updates. Conversely (2) Local-SGD is not “lazy” since each machine performs gradient updates. Nevertheless, the queries of different machines change substantially during each round, which translates to bias between machines, which in turn degrades the convergence.
Ideally, we would like to have a “non-lazy” method where each machine performs KR gradient updates (like Local-SGD), but where the queries of each machine do not change at all during rounds (like Minibatch-SGD) and therefore no bias is introduced between machines. Of course, this is too good to exist, but our method is a step in this direction: it is “non-lazy” and the query points of different machines change slowly, and therefore introduce less bias between machines. This translates to a better convergence rate.
Additional Technical Intuition for .
Here we extend the technical explanation that we provide in Sec. 3.2.1 to the case where , and show again that SLowcal-SGD yields smaller bias between different machines compared to Local-SGD.
As in the intuition for the case of uniform weights, to simplify the more technical discussion, we will assume the homogeneous case, i.e., that for any we have and .
Note that upon employing linear weights, the normalization factor that plays a major role in the convergence guarantees of Anytime-SGD (see Thm. 1) also grows as . Thus, in order to make an proper comparison, we should compare the bias of weighted Anytime-SGD, to the appropriate weighted version of SGD; where the normalization factor also plays a similar role in the guarantees (see e.g. Wang et al. [2021b]). This weighted SGD is as follows Wang et al. [2021b],
(20) |
and after iterations it outputs . And for this version enjoys the same guarantees as standard SGD.
Next, we compare the Local-SGD version of the above weighted SGD (Eq. (20)) to our SLowcal-SGDwhen both employ . So a bit more formally, let us discuss the bias between query points in a given round , and let us denote . The following holds for weighted Local SGD,
(21) |
where is the noisy gradients that Machine computes in , and we can write , where is the noisy component of the gradient. Thus, for two machines we can write,
Now, we can be generous with respect to weighted SGD and only take the second noisy term into account and neglect the first term 666It was shown in Woodworth et al. [2020a], that the noisy term is dominant for standard SGD. Thus we obtain,
(22) |
where we used , as well as . We also used .
Similarly, for SLowcal-SGD we would like to bound for two machines ; while assuming linear weights i.e., . Now the update rule for the iterates , is of the same form as in Eq. (21), only now , where is the noisy component of the gradient. Consequently, we can show the following,
where we took a crude approximation of . In the last "" we also change the notation of summation variable from to .
Now, by definition, Thus, for two machines we have,
where we have used .
As we show in our analysis, the noisy term is dominant, so we can therefore bound,
(23) |
Thus Equations (22), (23), illustrate that for , then the bias of SLowcal-SGD is smaller by a factor of compared to the bias of weighted Local-SGD.
Since can be as big as this coincides with the benefit of SLowcal-SGD over standard SGD in the case where , which we demonstrate in the main text.
Finally, note that upon dividing by the normalization factor we have, that for SLowcal-SGD with either or then,
(24) |
Comparably, upon dividing by the normalization factor we have, that for Local-SGD with either or that,
(25) |
Thus, with respect to the approximate and intuitive analysis that we make here SLowcal-SGD maintains similar benefit over Local-SGD for both and .
As we explain in Appendix L, taking in SLowcal-SGD does not actually enable to provide a benefit over Local-SGD. The reason is that for , the condition for which the dominant term in the bias (between different machines) is the noisy term (this enables the approximate analysis that we make here and in the body of the paper), leads to limitation on the learning rate which in turn degrades the performance for SLowcal-SGD with . Conversely, for there is no such degradation due to the limitation of the learning rate. For more details and intuition please see Appendix L.
Appendix F Proof of Thm. 2
Proof of Thm. 2.
As a starting point for the analysis, for every iteration we will define the averages of across all machines as follows,
Note that Alg. 2 explicitly computes only once every local updates, and that theses are identical to the local copies of every machine at the beginning of every round. Combining the above definitions with Eq. (6) yields,
(26) |
Further combining these definitions with Eq. (7) yields,
(27) |
And the above implies that the sequence is an weighted average of . This enables to employ Thm. 1 which yields,
where we denote . The above bound highlights the challenge in the analysis: our algorithm does not directly computes unbiased estimates of , except for the first iterate of each round. Concretely, Eq. (26) demonstrates that our algorithm effectively updates using which might be a biased estimate of .
It is therefore natural to decompose in the above bound, leading to,
(28) |
Thus, we intend to bound the weighted error by bounding two terms: which is directly related to the update rule of the algorithm (Eq. (26)), and which accounts for the bias between and .
Notation:
In what follows we will find the following notation useful,
(29) |
and the above definition implies that . We will also employ the following notations,
where is a global minimum of . Moreover, we will also use the notation .
Bounding (A):
Due to the update rule of Eq. (26), one can show by standard regret analysis (see Lemma 2 below) that,
(30) |
Lemma 2.
(OGD Regret Lemma -See e.g. [Hazan et al., 2016]) Let and . Also assume a sequence of non-negative weights and vectors , and assume an update rule of the following form:
Then the following bound holds for any , and ,
for completeness we provide a proof in Appendix G.
Bounding (B):
Since our goal is to bound the expected excess loss, we will bound the expected value of , thus,
(31) |
where the second line follows by the definition of (see Eq. (29)) and due to the fact that is measurable with respect to while implying that ; the third line uses Young’s inequality which holds for any ; and the last two lines use the definition of and .
Combining (A) and (B):
Combining Equations (30) and (F) into Eq. (28) we obtain the following bound which holds for any and ,
(32) |
where we have used which holds for any , as well as , which holds since .
Next, we shall bound each of the above terms. The following lemma bounds ,
Lemma 3.
The following bound holds for any ,
Combining the above lemma into Eq. (33) gives,
Since the above holds for any let us pick a specific value of ; by doing so we obtain,
(33) |
Next, we would like to bound ; to do so it is natural to decompose . The next lemma provides a bound, and its proof goes directly through this decomposition,
Lemma 4.
The following holds,
Combining the above Lemma into Eq. (33) yields,
(34) |
where we have uses which holds since . The next lemma provides a bound for ,
Plugging the above bound back into Eq. (34) gives an almost explicit bound,
(35) |
and we used which follows since (see Eq. (8)), as well as , which follows since (see Eq. (8)). Next we use the above bound and invoke the following lemma,
Lemma 6.
Let be a sequence of non-negative elements and , and assume that for any ,
Then the following bound holds,
Taking and provides the following explicit bound,
(36) |
where we have used , as well as ,
Recalling that and that,
The above bound translates into,
(37) | ||||
(38) |
Noting that and using gives the final bound,
E | |||
which establishes the Theorem. ∎
Appendix G Proof of Lemma 2
Proof of Lemma 2.
The update rule implies for all
Re-ordering and gives,
Summing over and telescoping we obtain,
Dividing the above by establishes the lemma. ∎
Appendix H Proof of Lemma 3
Proof of Lemma 3.
Recalling the notations , our goal is to bound . To do so, we will derive a recursive formula for . Indeed, the update rule of Alg. 2 implies Eq. (26), which in turn leads to the following for any ,
Unrolling the above equation and taking expectation gives,
(39) |
The next lemma provides a bound on ,
Lemma 7.
The following holds for any ,
and recall that , and .
Plugging the bound of Lemma 7 into Eq. (39), and using the notation of we conclude that for any ,
where we used . Summing the above equation over gives,
(40) |
We shall now require the following lemma,
Lemma 8.
Let , and assume that , then the following holds,
H.1 Proof of Lemma 7
Proof of Lemma 7.
Recall that , we shall now focus on bounding ,
(42) |
where the first line is due to the definitions of and appearing in Eq. (29) (this is formalized in Lemma 9 below and in its proof); the third line follows by observing that the sequence is and weighted average of and thus Theorem 1 implies that for any , as well as from Cauchy-Schwarz; the fourth line follows from the Cauchy-Schwarz inequality for random variables, which asserts that for every random variables , then ; the fifth line is an application of the following inequality
which holds for any two sequences , and the above also follows from the standard Cauchy-Schwarz inequality. Thus, Eq. (H.1) establishes the lemma.
We are left to show that which is established in the lemma below,
Lemma 9.
The following holds for any ,
∎
H.1.1 Proof of Lemma 9
Proof of Lemma 9.
Let be the natural filtration induces by the history of samples up to every time step . Then according to the definitions of and we have,
where the first line follows by the law of total expectations; the second line follows since is measurable w.r.t. ; the third line follows by definition of and ; and the last line uses the law of total expectations. ∎
H.2 Proof of Lemma 8
Proof of Lemma 8.
We will divide the proof into two case.
Case 1: . In this case,
Case 2: . In this case,
dividing by and taking the square implies,
And therefore the lemma holds. ∎
Appendix I Proof of Lemma 4
Proof of Lemma 4.
Recalling that , we will decompose which gives,
(43) |
where the second line uses which holds for any ; the third line uses the definition of as well as the smoothness of implying that (see Lemma 10 below); the fourth line invokes Lemma 11; the fifth line uses .
Lemma 10.
Let be an -smooth function with a global minimum , then for any we have,
Lemma 11.
The following bound holds for any ,
∎
I.1 Proof of Lemma 10
Proof of Lemma 10.
The smoothness of means the following to hold ,
Taking we get,
Thus:
where in the last inequality we used which holds since is the global minimum. ∎
I.2 Proof of Lemma 11
Proof of Lemma 11.
Recall that we can write,
where , and , and that are independent of each other. Thus, conditioning over then are independent and zero mean i.e. . Consequently,
Using the law of total expectation implies that . ∎
Appendix J Proof of Lemma 5
Proof of Lemma 5.
To bound we will first employ the definition of together with the smoothness of ,
(44) |
where the first line uses the definition of , the second line uses Jensen’s inequality, and the third line uses the smoothness of ’s. The last line follows from Jensen’s inequality.
We use the following notation for any
(45) |
and notice that , and that . Moreover .
Thus, according to Eq. (J) it is enough to bound as follows,
(46) |
Next we will bound the above term.
Bounding :
Let , if for some , then according to Alg. 2 for any machine , thus for any two machines .
More generally, if for some , and , then by denoting we can write . Using this notation, the update rule for implies the following for any ,
where we used . Thus, for any we can write,
(47) |
So our next goal is to derive an expression for . The update rule of Eq. (6) implies that for any ,
(48) |
where the second equality is due to the initialization of each round implying that .
Next, we will require the following notation , and . We can therefore write, and it is immediate to show that . Using this notation together with Eq. (48), implies that for any and we have,
(49) |
and in the last line we use the following notation 777A more appropriate notation would be , but to ease notation we absorb the notation into ..
Summing Eq. (J) over we obtain,
(50) |
where in the last equation we replace the notation of the summation index from to (only done to ease notation).
Plugging the above equation back into Eq. (47) we obtain for any
(51) |
where the first inequality uses which holds , the second inequality uses which holds for any ; the third inequality uses the definition of , it also uses as well as which holds since and since both ; and the last inequality uses the fact that implying that the following holds,
Lemma 12.
The following holds for any ,
Using the above lemma inside Eq. (J) yields,
(52) |
where we have denoted 888 Formally we should use the upper script for in the definition of , i.e. to define . We absorb this notation into to simplify the notation.
Summing Eq. (J) over and using the definition of gives,
(53) |
where we used,
Now dividing Eq. (J) by gives ,
(54) |
where the second line follows from the dissimilarity assumption Eq. (2), and the last line is due to the definition of .
Thus, we can re-write the above equation as follows forall ,
(55) |
where , and recall that . Now, notice that , and that satisfies since we assume that (see Eq. (8)). This enables to make use of Lemma 13 below to conclude,
(56) |
Lemma 13.
Let , and assume . Also assume a sequence of non-negative terms and another sequence that satisfy the following inequality for any ,
Then the following holds,
Recalling that we like to bound the expectation of the LHS of Eq. (J), we will next bound , which is done in the following Lemma 999recall that for simplicity of notation we denote rather than .,
Lemma 14.
The following bound holds for any ,
Since the above lemma for any and we can now bound as follows,
(57) |
where the last lines for any iteration that belongs to round .
Final Bound on .
Finally, using the above bound together with the Eq. (46) enables to bound as follows,
where we have used , and the last line uses Consequently, we can bound
which established the lemma. ∎
J.1 Proof of Lemma 12
Proof of Lemma 12.
First note that by definition of we have for any ,
(59) |
where we have used Jensen’s inequality, and the definition of .
Using the above inequality, we obtain,
where the second and third lines use which holds for any ; we also used the smoothness of the ’s; and the last line uses Eq. (59). ∎
J.2 Proof of Lemma 13
Proof of Lemma 13.
Since the ’s and are non-negative, we can further bound for all as follows,
Summing the above over gives,
where we used . Re-ordering the above equation immediately establishes the lemma. ∎
J.3 Proof of Lemma 14
Proof of Lemma 14.
Letting be the natural filtration that is induces by the random draws up to time , i.e., by . By the definition of it is clear that is measurable with respect to , and that,
Implying that is martingale difference sequence with respect to the filtration . The following implies that,
where the first line follows by Lemma 15 below, the second line holds since , and (recall since ); the fourth line follows due to ; the fifth line uses for any ; the sixth line follows since ; and the last line uses .
Lemma 15.
Let be a martingale difference sequence with respect to a filtration , then the following holds for all time indexes
∎
J.3.1 Proof of Lemma 15
Proof of Lemma 15.
We shall prove the lemma by induction over . The base case where clearly holds since it boils down to the following identity,
Now for induction step let us assume that the equality holds for and lets prove it holds for . Indeed,
where the third line follows from the induction hypothesis, as well as from the law of total expectations; the fourth lines follows since are measurable w.r.t , and the fifth line follows since . Thus, we have established the induction step and therefore the lemma holds. ∎
Appendix K Proof of Lemma 6
Proof of Lemma 6.
Summing the inequality over gives,
Re-ordering we obtain,
Plugging this back to the original inequality and taking gives,
which concludes the proof. ∎
Appendix L The Necessity of Non-uniform Weights
One may wonder, why should we employ increasing weights rather than using standard uniform weights . Here we explain why uniform weights are insufficient and why increasing weights e.g. are crucial to obtain our result.
Intuitive Explanation.
Prior to providing an elaborate technical explanation we will provides some intuition. The intuition behind the importance of using increasing weights is the following: Increasing weights are a technical tool to put more emphasis on the last rounds. Now, in the context of Local update methods, the iterates of the last rounds are more attractive since the bias between different machines shrinks as we progress. Intuitively, this happens since as we progress with the optimization process, the bias in the stochastic gradients that we compute goes to zero (in expectation), and consequently the bias between different machines shrinks as we progress.
Technical Explanation.
Assume general weights , and let us go back to the proof of Lemma 5 (see Section J). Recall that in this proof we derive a bound of the following form (see Eq. (55))
(60) |
where , , and importantly 101010Indeed, in the case where we can take , and this is used in the proof of Lemma 5,
Now, a crucial part of the proof is the fact that , which in turn enables to employ Lemma 13 in order to bound .
Not let us inspect the constraint for polynomial weights of the form where . This condition boils down to,
Implying the following bound should apply to for any ,
Now since the bound should hold for any we could divide into two cases:
Case 1: . In this case is monotonically increasing with so the above condition should be satisfied for the smallest , namely , implying,
The effect of this term on the overall error stems from the first term in the error analysis (see e.g. Eq. (F)), namely,
Now, for the extreme values (uniform weights) and (linear weights), the above expression results an error term of the following form,
(61) |
Thus, for , the error term is considerably worse even compared to Minibatch-SGD, and is clearly an improvement.
Case 2: . In this case is decreasing increasing with so the above condition should be satisfied for the largest , namely , implying,
Now, the effect of this term on the overall error stems from the first term in the error analysis (see e.g. Eq. (F)), namely,
Thus, for any we obtain,
(62) |
Conclusion: As can be seen from Equations (61) (62), using uniform weights or even polynomial weights with yields strictly worse guarantees compared to taking .
Appendix M Experiments
M.1 Data Distribution Across Workers
In federated learning or distributed machine learning settings, data heterogeneity among workers often arises due to non-identical data distributions. To simulate such scenarios, we split the MNIST dataset using a Dirichlet distribution. The Dirichlet distribution allows control over the degree of heterogeneity in the data assigned to each worker by adjusting the dirichlet-alpha parameter. Lower values of dirichlet-alpha result in more uneven class distributions across workers, simulating highly non-IID data.
Given classes and workers, the probability of assigning data from class to worker is based on sampling from a Dirichlet distribution:
where is the concentration parameter controlling the level of heterogeneity. A smaller value results in a more imbalanced distribution, meaning that each worker primarily receives data from a limited subset of classes. In this experiment, we set to induce high heterogeneity.
The following figures present scatter plots illustrating the class frequencies assigned to individual workers for different worker configurations—16, 32, and 64—using a specific random seed.



M.2 Complete Experimental Results
This section presents the complete experimental results for 16, 32, and 64 workers, showing test accuracy and test loss as functions of local iterations . The plots illustrate the method’s scalability and performance, with higher accuracy () and lower loss () indicating better outcomes.





