FedChain: Chained Algorithms for
Near-Optimal Communication Cost
in Federated Learning
Abstract
Federated learning (FL) aims to minimize the communication complexity of training a model over heterogeneous data distributed across many clients. A common approach is local update methods, where clients take multiple optimization steps over local data before communicating with the server (e.g., FedAvg). Local update methods can exploit similarity between clients’ data. However, in existing analyses, this comes at the cost of slow convergence in terms of the dependence on the number of communication rounds . On the other hand, global update methods, where clients simply return a gradient vector in each round (e.g., SGD), converge faster in terms of but fail to exploit the similarity between clients even when clients are homogeneous. We propose FedChain, an algorithmic framework that combines the strengths of local update methods and global update methods to achieve fast convergence in terms of while leveraging the similarity between clients. Using FedChain, we instantiate algorithms that improve upon previously known rates in the general convex and PL settings, and are near-optimal (via an algorithm-independent lower bound that we show) for problems that satisfy strong convexity. Empirical results support this theoretical gain over existing methods.
1 Introduction
In federated learning (FL) (McMahan et al., 2017; Kairouz et al., 2019; Li et al., 2020), distributed clients interact with a central server to jointly train a single model without directly sharing their data with the server. The training objective is to solve the following minimization problem:
(1) |
where variable is the model parameter, indexes the clients (or devices), is the number of clients, and is a loss that only depends on that client’s data. Typical FL deployments have two properties that make optimizing Eq. 1 challenging: () Data heterogeneity: We want to minimize the average of the expected losses for some loss function evaluated on client data drawn from a client-specific distribution . The convergence rate of the optimization depends on the heterogeneity of the local data distributions, , and this dependence is captured by a popular notion of client heterogeneity, , defined as follows:
(2) |
This captures the maximum difference between a local gradient and the global gradient, and is a standard measure of heterogeneity used in the literature (Woodworth et al., 2020a; Gorbunov et al., 2020; Deng et al., 2020; Woodworth et al., 2020a; Gorbunov et al., 2020; Yuan et al., 2020; Deng et al., 2021; Deng & Mahdavi, 2021). () Communication cost: In many FL deployments, communication is costly because clients have limited bandwidths. Due to these two challenges, most federated optimization algorithms alternate between local rounds of computation, where each client locally processes only their own data to save communication, and global rounds, where clients synchronize with the central server to resolve disagreements due to heterogeneity in the locally updated models.
Several federated algorithms navigate this trade-off between reducing communication and resolving data heterogeneity by modifying the amount and nature of local and global computation (McMahan et al., 2017; Li et al., 2018; Wang et al., 2019b; a; Li et al., 2019; Karimireddy et al., 2020b; a; Al-Shedivat et al., 2020; Reddi et al., 2020; Charles & Konečnỳ, 2020; Mitra et al., 2021; Woodworth et al., 2020a). These first-order federated optimization algorithms largely fall into one of two camps: () clients in local update methods send updated models after performing multiple steps of model updates, and () clients in global update methods send gradients and do not perform any model updates locally. Examples of () include FedAvg (McMahan et al., 2017), SCAFFOLD (Karimireddy et al., 2020b), and FedProx (Li et al., 2018). Examples of () include SGD and Accelerated SGD (ASG), where in each round the server collects from the clients gradients evaluated on local data at the current iterate and then performs a (possibly Nesterov-accelerated) model update.
As an illustrating example, consider the scenario when ’s are -strongly convex and -smooth such that the condition number is as summarized in Table 1, and also assume for simplicity that all clients participate in each communication round (i.e., full participation). Existing convergence analyses show that local update methods have a favorable dependence on the heterogeneity . For example, Woodworth et al. (2020a) show that FedAvg achieves an error bound of after rounds of communication. Hence FedAvg achieves theoretically faster convergence for smaller heterogeneity levels by using local updates. However, this favorable dependency on comes at the cost of slow convergence in . On the other hand, global update methods converge faster in , with error rate decaying as (for the case of ASG), where is the initial function value gap to the (unique) optimal solution. This is an exponentially faster rate in , but it does not take advantage of small heterogeneity even when client data is fully homogeneous.
Our main contribution is to design a novel family of algorithms that combines the strengths of local and global methods to achieve a faster convergence while maintaining the favorable dependence on the heterogeneity, thus achieving an error rate of in the strongly convex scenario. By adaptively switching between this novel algorithm and the existing best global method, we can achieve an error rate of . We further show that this is near-optimal by providing a matching lower bound in Thm. 5.4. This lower bound tightens an existing one from (Woodworth et al., 2020a) by considering a smaller class of algorithms that includes those presented in this paper.
We propose FedChain, a unifying chaining framework for federated optimization that enjoys the benefits of both local update methods and global update methods. For a given total number of rounds , FedChain first uses a local-update method for a constant fraction of rounds and then switches to a global-update method for the remaining rounds. The first phase exploits client homogeneity when possible, providing fast convergence when heterogeneity is small. The second phase uses the unbiased stochastic gradients of global update methods to achieve a faster convergence rate in the heterogeneous setting. The second phase inherits the benefits of the first phase through the iterate output by the local-update method. An instantiation of FedChain consists of a combination of a specific local-update method and global-update method (e.g., FedAvg SGD).
Contributions. We propose the FedChain framework and analyze various instantiations under strongly convex, general convex, and nonconvex objectives that satisfy the PL condition.111 satisfies the -PL condition if . Achievable rates are summarized in Tables 1, LABEL:, 2, LABEL: and 4. In the strongly convex setting, these rates nearly match the algorithm-independent lower bounds we introduce in Theorem 5.4, which shows the near-optimality of FedChain. For strongly convex functions, chaining is optimal up to a factor that decays exponentially in the condition number . For convex functions, it is optimal in the high-heterogeneity regime, when . For nonconvex PL functions, it is optimal for constant condition number . In all three settings, FedChain instantiations improve upon previously known worst-case rates in certain regimes of . We further demonstrate the empirical gains of our chaining framework in the convex case (logistic regression on MNIST) and in the nonconvex case (ConvNet classification on EMNIST (Cohen et al., 2017) and ResNet-18 classification on CIFAR-100 (Krizhevsky, 2009)).
1.1 Related Work
The convergence of FedAvg in convex optimization has been the subject of much interest in the machine learning community, particularly as federated learning (Kairouz et al., 2019) has increased in popularity. The convergence of FedAvg for convex minimization was first studied in the homogeneous client setting (Stich, 2018; Wang & Joshi, 2018; Woodworth et al., 2020b). These rates were later extended to the heterogeneous client setting (Khaled et al., 2020; Karimireddy et al., 2020b; Woodworth et al., 2020a; Koloskova et al., 2020), including a lower bound (Woodworth et al., 2020a). The only current known analysis for FedAvg that shows that FedAvg can improve on SGD/ASG in the heterogeneous data case is that of Woodworth et al. (2020a), which has been used to prove slightly tighter bounds in subsequent work Gorbunov et al. (2020).
Many new federated optimization algorithms have been proposed to improve on FedAvg and SGD/ASG, such as SCAFFOLD (Karimireddy et al., 2020b), S-Local-SVRG (Gorbunov et al., 2020), FedAdam (Reddi et al., 2020), FedLin (Mitra et al., 2021), FedProx (Li et al., 2018). However, under full participation (where all clients participate in a communication round; otherwise the setting is partial participation) existing analyses for these algorithms have not demonstrated any improvement over SGD (under partial participation, SAGA (Defazio et al., 2014)). In the smooth nonconvex full participation setting, MimeMVR (Karimireddy et al., 2020a) was recently proposed, which improves over SGD in the low-heterogeneity setting. Currently MimeMVR and SGD are the two best algorithms for worst-case smooth nonconvex optimization with respect to communication efficiency, depending on client heterogeneity. In partial participation, MimeMVR and SAGA are the two best algorithms for worst-case smooth nonconvex optimization, depending on client heterogeneity.
Lin et al. (2018) proposed a scheme, post-local SGD, opposite ours by switching from a global-update method to a local-update method (as opposed to our proposal of switching from local-update method to global-update method). They evaluate the setting where . The authors found that while post-local SGD has a training accuracy worse than SGD, the test accuracy can be better (an improvement of 1% test accuracy on ResNet-20 CIFAR-10 classification), though theoretical analysis was not provided. Wang & Joshi (2019) decrease the number () of local updates to transition from FedAvg to SGD in the homogeneous setting. This is conceptually similar to FedChain, but they do not show order-wise convergence rate improvements. Indeed, in the homogeneous setting, a simple variant of ASG achieves the optimal worst-case convergence rate (Woodworth et al., 2021).
2 Setting
Federated optimization proceeds in rounds. We consider the partial participation setting, where at the beginning of each round, out of total clients are sampled uniformly at random without replacement. Between each global communication round, the sampled clients each access either (i) their own stochastic gradient oracle times, update their local models, and return the updated model, or (ii) their own stochastic function value oracle times and return the average value to the server. The server aggregates the received information and performs a model update. For each baseline optimization algorithm, we analyze the sub-optimality error after rounds of communication between the clients and the server. Suboptimality is measured in terms of the function value , where is the solution estimate after rounds and is a (possibly non-unique) optimum of . We let the estimate for the initial suboptimality gap be (Assumption B.9), and the initial distance to a (not necessarily unique) optimum be (Assumption B.10). If applicable, is the target expected function value suboptimality.
We study three settings: strongly convex ’s, convex ’s and -PL ; for formal definitions, refer to App. B. Throughout this paper, we assume that the ’s are -smooth (Assumption B.4). If ’s are -strongly convex (Assumption B.1) or is -PL (Assumption B.3), we denote as the condition number. is the data distribution of client . We define the heterogeneity of the problem as in Assumption B.5. We assume unless otherwise specified that , i.e., that the problem is heterogeneous. We assume that the client gradient variance is upper bounded by (Assumption B.6). We also define the analogous quantities for function value oracle queries: Assumption B.8, Assumption B.7. We use the notation to hide polylogarithmic factors, and if we are only hiding constant factors.
3 Federated Chaining (FedChain) Framework

We start with a toy example (Fig. 1) to illustrate FedChain. Consider two strongly convex client objectives: and . The global objective is their average. Fig. 1 (top) plots the objectives, and Fig. 1 (bottom) displays their gradients. Far from the optimum, due to strong convexity, all client gradients point towards the optimal solution; specifically, this occurs when . In this regime, clients can use local steps (e.g., FedAvg) to reduce communication without sacrificing the consistency of per-client local updates. On the other hand, close to the optimum, i.e., when , some client gradients may point away from the optimum. This suggests that clients should not take local steps to avoid driving the global estimate away from the optimum. Therefore, when we are close to the optimum, we use an algorithm without local steps (e.g. SGD), which is less affected by client gradient disagreement.
This intuition seems to carry over to the nonconvex setting: Charles et al. (2021) show that over the course of FedAvg execution on neural network StackOverflow next word prediction, client gradients become more orthogonal.
FedChain: To exploit the strengths of both local and global update methods, we propose the federated chaining (FedChain) framework in Alg. 1. There are three steps: (1) Run a local-update method , like FedAvg. (2) Choose the better point between the output of (which we denote ), and the initial point to initialize (3) a global-update method like SGD. Note that when heterogeneity is large, can actually output an iterate with higher suboptimality gap than . Hence, selecting the point (between and ) with a smaller allows us to adapt to the problem’s heterogeneity, and initialize appropriately to achieve good convergence rates. To compare and the initial point , we approximate and by averaging; we compute for , where is also the number of local steps per client per round in , and is a sample of clients. In practice, one might consider adaptively selecting how many rounds of to run, which can potentially improve the convergence by a constant factor. Our experiments in § J.1 show significant improvement when using only 1 round of with a large enough for convex losses.
Method/Analysis | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Centralized Algorithms | |||||||||||
|
|
||||||||||
Federated Algorithms | |||||||||||
|
|
||||||||||
This paper | |||||||||||
|
|
4 Convergence of FedChain
We first analyze FedChain (Algo. 1) when is FedAvg and is (Nesterov accelerated) SGD. The first theorem is without Nesterov acceleration and the second with Nesterov acceleration.
Theorem 4.1 (FedAvg SGD).
Suppose that client objectives ’s and their gradient queries satisfy Assumptions B.4, LABEL:, B.6, LABEL:, B.7, LABEL:, B.5, LABEL: and B.8. Then running FedChain (Algo. 1) with as FedAvg (Algo. 4) with the parameter choices of Thm. E.1, and as SGD (Algo. 2) with the parameter choices of Thm. D.1, we get the following rates 222We ignore variance terms and as the former can be made negligible by increasing and the latter can be made zero by running the better of FedAvg SGD and SGD instead of choosing the better of and as in Algo. 1. Furthermore, terms are similar to the terms. The rest of the theorems in the main paper will also be stated this way. To see the formal statements, see App. F.:
-
•
Strongly convex: If ’s satisfy Assumption B.1 for some then there exists a finite above which we get, .
-
•
General convex: If ’s satisfy Assumption B.2, then there exists a finite above which we get the rate .
-
•
PL condition: If ’s satisfy Assumption B.3 for , then there exists a finite above which we get the rate the rate .
For the formal statement, see Thm. F.1.
Theorem 4.2 (FedAvg ASG).
Under the hypotheses of Thm. 4.1 with a choice of as ASG (Algo. 3) and the parameter choices of Thm. D.3, we get the following guarantees:
-
•
Strongly convex: If ’s satisfy Assumption B.1 for then there exists a finite above which we get the rate,
-
•
General convex: If ’s satisfy Assumption B.2 then there exists a finite above which we get the rate,
For the formal statement, see Thm. F.2. We show and discuss the near-optimality of FedAvg ASG under strongly convex and PL conditions (and under full participation) in Section 5, where we introduce matching lower bounds. Under strong-convexity shown in Table 1, FedAvg ASG, when compared to ASG, converts the term into a term, improving over ASG when heterogeneity moderately small: . It also significantly improves over FedAvg, as is exponentially faster than . Under the PL condition (which is not necessarily convex) the story is similar (Table 4), except we use FedAvg SGD as our representative algorithm, as Nesterov acceleration is not known to improve under non-convex settings.
Method/Analysis | |||||||
---|---|---|---|---|---|---|---|
Centralized Algorithms | |||||||
|
|
||||||
Federated Algorithms | |||||||
|
|
||||||
This paper | |||||||
|
|
In the general convex case (Table 2), let for the purpose of comparisons. Then FedAvg ASG’s convergence rate is . If , then and if , then , so the FedAvg ASG convergence rate is better than the convergence rate of ASG under the regime . The rate of Karimireddy et al. (2020b) for FedAvg (which does not require ) is strictly worse than ASG, and so has a worse convergence rate than FedAvg ASG if . Altogether, if , FedAvg ASG achieves the best known worst-case rate. Finally, in the case, FedAvg ASG does not have a regime in where it improves over both ASG and FedAvg (the analysis of Woodworth et al. (2020a), which requires ) at the same time. It is unclear if this is due to looseness in the analysis.
Next, we analyze variance reduced methods that improve convergence when a random subset of the clients participate in each round (i.e., partial participation).
Theorem 4.3 (FedAvg SAGA).
Suppose that client objectives ’s and their gradient queries satisfy Assumptions B.4, LABEL:, B.6, LABEL:, B.7, LABEL:, B.5, LABEL: and B.8. Then running FedChain (Algo. 1) with as FedAvg (Algo. 4 ) with the parameter choices of Thm. E.1, and as SAGA (Algo. 5) with the parameter choices of Thm. D.4, we get the following guarantees as long as :
-
•
Strongly convex: If ’s satisfy Assumption B.1 for , then there exists a finite above which we get the rate .
-
•
PL condition: If ’s satisfy Assumption B.3 for , then there exists a finite above which we get the rate .
For the formal statement, see Thm. F.3.
Theorem 4.4 (FedAvg SSNM).
Suppose that client objectives ’s and their gradient queries satisfy Assumptions B.4, LABEL:, B.6, LABEL:, B.5, LABEL: and B.8. Then running FedChain (Algo. 1) with as FedAvg (Algo. 4) with the parameter choices of Thm. E.1, and as SSNM (Algo. 6) with the parameter choices of Thm. D.5, we get the following guarantees as long as :
-
•
Strongly convex: If ’s satisfy Assumption B.1 for , then there exists a finite (same as in FedAvg SGD) above which we get the rate .
For the formal statement, see Thm. F.4. SSNM (Zhou et al., 2019) is the Nesterov accelerated version of SAGA (Defazio et al., 2014).
The main contribution of using variance reduced methods in is the removal of the sampling error (the terms depending on the sampling heterogeneity error ) from the convergence rates, in exchange for requiring a round complexity of at least . To illustrate this, observe that in the strongly convex case, FedAvg SGD has convergence rate and FedAvg SAGA (FedAvg SGD’s variance-reduced counterpart) has convergence rate , dropping the sampling heterogeneity error term in exchange for harming the rate of linear convergence from to .
This same tradeoff occurs in finite sum optimization (the problem , where ’s (typically) represent losses on data points and the main concern is computation cost), which is what variance reduction is designed for. In finite sum optimization, variance reduction methods such as SVRG (Johnson & Zhang, 2013) and SAGA (Defazio et al., 2014; Reddi et al., 2016) achieve linear rates of convergence (given strong convexity of ’s) in exchange for requiring at least updates. Because we can treat FL as an instance of finite-sum optimization (by viewing Eq. 1 as a finite sum of objectives and as the variance between ’s), these results from variance-reduced finite sum optimization can be extended to federated learning. This is the idea behind SCAFFOLD (Karimireddy et al., 2020b).
It is not always the case that variance reduction in achieves better rates. In the strongly convex case if , then variance reduction gets gain, otherwise not.



5 Lower bounds
Our lower bound allows full participation. It assumes deterministic gradients and the following class from (Woodworth et al., 2020a; Woodworth, 2021; Carmon et al., 2020):
Definition 5.1.
For a , let . An algorithm is distributed zero-respecting if for any , the -th iterate on the -th client in the -th round satisfy
(3) |
Distributed zero-respecting algorithms’ iterates have components in coordinates that they have any information on. As discussed in (Woodworth et al., 2020a), this means that algorithms which are not distributed zero-respecting are just “wild guessing”. Algorithms that are distributed zero-respecting include SGD, ASG, and FedAvg. We assume the following class of algorithms in order to bound the heterogeneity of our construction for the lower bound proof.
Definition 5.2.
We say that an algorithm is distributed distance-conserving if for any , we have for the -th iterate on the -th client in the -th round satisfies , where and and is the initial iterate, and is some scalar parameter.
Algorithms which do not satisfy Definition 5.2 with at most logarithmic in problem parameters (see § 2) are those that move substantially far away from , even farther than the ’s are from . With this definition in mind, we slightly overload the usual definition of heterogeneity for the lower bound:
Definition 5.3.
A distributed optimization problem is -heterogeneous if , where we define for some scalar parameter .
Those achievable convergence rates in FL that assume Eq. 2 can be readily extended to account for Definition 5.3 as long as the algorithm satisfies Definition 5.2. We show that the chaining algorithms we propose satisfy Definition 5.3 in Thm. D.1, Thm. D.3, and Thm. E.1 for at most polylogarithmic in the problem parameters defined in § 2. 333We do not formally show it for SAGA and SSNM, as the algorithms are functionally the same as SGD and ASG under full participation.. Other FL algorithms also satisfy Definition 5.2, notably FedAvg analyzed in Woodworth et al. (2020a) for an absolute constant, which is the current tightest known analysis of FedAvg in the full participation setting.
Theorem 5.4.
For any number of rounds , number of local steps per-round , and -heterogeneity (Definition 5.3), there exists a global objective which is the average of two -smooth (Assumption B.4) and ()-strongly convex (Assumption B.1) quadratic client objectives and with an initial sub-optimality gap of , such that the output of any distributed zero-respecting (Definition 5.1) and distance-conserving algorithm (Definition 5.2) satisfies
-
•
Strongly convex: when , and
-
•
General Convex when .
Corollary 5.5.
Under the hypotheses of Theorem 5.4, there exists a global objective which is -PL and satisfies .
A proof of the lower bound is in App. G, and the corollary follows immediately from the fact that -strong convexity implies -PL. This result tightens the lower bound in Woodworth et al. (2020a), which proves a similar lower bound but in terms of a much larger class of functions with heterogeneity bounded in . To the best of our knowledge, all achievable rates that can take advantage of heterogeneity require -heterogeneity (which is a smaller class than -heterogeneity), which are incomparable with the existing lower bound from (Woodworth et al., 2020a) that requires -heterogeneity. By introducing the class of distributed distance-conserving algorithms (which includes most of the algorithms we are interested in), Thm. 5.4 allows us, for the first time, to identify the optimality of the achievable rates as shown in Tables 1, 2, and 4.
Note that this lower bound allows full participation and should be compared to the achievable rates with ; comparisons with variance reduced methods like FedAvg SAGA and FedAvg SSNM are unnecessary. Our lower bound proves that FedAvg ASG is optimal up to condition number factors shrinking exponentially in among algorithms satisfying Definition 5.2 and Definition 5.3. Under the PL-condition, the situation is similar, except FedAvg SGD loses in the exponential versus the lower bound. On the other hand, there remains a substantial gap between FedAvg ASG and the lower bound in the general convex case. Meaningful progress in closing the gap in the general convex case is an important future direction.
6 Experiments
We evaluate the utility of our framework on strongly convex and nonconvex settings. We compare the communication round complexities of four baselines: FedAvg, SGD, ASG, and SCAFFOLD, the stepsize decaying variants of these algorithms (which are prefixed by M- in plots) and the various instantiations of FedChain (Algo. 1).
Convex Optimization (Logistic Regression)
We first study federated regularized logistic regression, which is strongly convex (objective function in § I.1). In this experiment, we use the MNIST dataset of handwritten digits (LeCun et al., 2010). We (roughly) control client heterogeneity by creating “clients” through shuffling samples across different digit classes. The details of this shuffling process are described in § I.1; if a dataset is more shuffled (more homogeneous), it roughly corresponds to a lower heterogeneity dataset. All clients participate in each round.
Fig. 2 compares the convergence of chained and non-chained algorithms in the stochastic gradient setting (minibatches are 1% of a client’s data), over rounds (tuning details in § I.1). We observe that in all heterogeneity levels, FedChain instantiations (whether two or more stages) outperform other baselines. We also observe that SCAFFOLDSGD outperforms all other curves, including SCAFFOLDASG. This can be explained by the fact that acceleration increases the effect of noise, which can contribute to error if one does not take large enough (we set in the convex experiments). The effect of large is elaborated upon in § J.1.
Algorithm | Constant | w/ Decay | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Baselines | |||||||||||
|
|
|
|||||||||
FedChain | |||||||||||
|
|
|
Algorithm | Constant | w/ Decay | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Baselines | |||||||||||
|
|
|
|||||||||
FedChain | |||||||||||
|
|
|
Nonconvex Optimization (Neural Networks)
We also evaluated nonconvex image classification tasks with convolutional neural networks. In all experiments, we consider client sampling with . We started with digit classification over EMNIST (Cohen et al., 2017) (tuning details in § I.2.1), where handwritten characters are partitioned by author. Table 3 (Left) displays test accuracies on the task. Overall, FedChain instantiations perform better than baselines.
Acknowledgments
This work is supported by Google faculty research award, JP Morgan Chase, Siemens, the Sloan Foundation, Intel, NSF grants CNS-2002664, CA-2040675, IIS-1929955, DMS-2134012, CCF-2019844 as a part of NSF Institute for Foundations of Machine Learning (IFML), and CNS-2112471 as a part of NSF AI Institute for Future Edge Networks and Distributed Intelligence (AI-EDGE). Most of this work was done prior to the second author joining Amazon, and it does not relate to his current position there.
References
- Al-Shedivat et al. (2020) Maruan Al-Shedivat, Jennifer Gillenwater, Eric Xing, and Afshin Rostamizadeh. Federated learning via posterior averaging: A new perspective and practical algorithms. arXiv preprint arXiv:2010.05273, 2020.
- Arjevani & Shamir (2015) Yossi Arjevani and Ohad Shamir. Communication complexity of distributed convex learning and optimization. arXiv preprint arXiv:1506.01900, 2015.
- Aybat et al. (2019) Necdet Serhat Aybat, Alireza Fallah, Mert Gurbuzbalaban, and Asuman Ozdaglar. A universally optimal multistage accelerated stochastic gradient method. arXiv preprint arXiv:1901.08022, 2019.
- Carmon et al. (2020) Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Lower bounds for finding stationary points i. Mathematical Programming, 184(1):71–120, 2020.
- Charles & Konečnỳ (2020) Zachary Charles and Jakub Konečnỳ. On the outsized importance of learning rates in local update methods. arXiv preprint arXiv:2007.00878, 2020.
- Charles et al. (2021) Zachary Charles, Zachary Garrett, Zhouyuan Huo, Sergei Shmulyian, and Virginia Smith. On large-cohort training for federated learning. arXiv preprint arXiv:2106.07820, 2021.
- Cohen et al. (2017) Gregory Cohen, Saeed Afshar, Jonathan Tapson, and Andre Van Schaik. Emnist: Extending mnist to handwritten letters. In 2017 International Joint Conference on Neural Networks (IJCNN), pp. 2921–2926. IEEE, 2017.
- Defazio et al. (2014) Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. Advances in neural information processing systems, 27, 2014.
- Deng & Mahdavi (2021) Yuyang Deng and Mehrdad Mahdavi. Local stochastic gradient descent ascent: Convergence analysis and communication efficiency. In International Conference on Artificial Intelligence and Statistics, pp. 1387–1395. PMLR, 2021.
- Deng et al. (2020) Yuyang Deng, Mohammad Mahdi Kamani, and Mehrdad Mahdavi. Adaptive personalized federated learning. arXiv preprint arXiv:2003.13461, 2020.
- Deng et al. (2021) Yuyang Deng, Mohammad Mahdi Kamani, and Mehrdad Mahdavi. Distributionally robust federated averaging. arXiv preprint arXiv:2102.12660, 2021.
- Fallah et al. (2020) Alireza Fallah, Asuman Ozdaglar, and Sarath Pattathil. An optimal multistage stochastic gradient method for minimax problems. In 2020 59th IEEE Conference on Decision and Control (CDC), pp. 3573–3579. IEEE, 2020.
- Ghadimi & Lan (2012) Saeed Ghadimi and Guanghui Lan. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization i: A generic algorithmic framework. SIAM Journal on Optimization, 22(4):1469–1492, 2012.
- Ghadimi & Lan (2013) Saeed Ghadimi and Guanghui Lan. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, ii: shrinking procedures and optimal algorithms. SIAM Journal on Optimization, 23(4):2061–2089, 2013.
- Gorbunov et al. (2020) Eduard Gorbunov, Filip Hanzely, and Peter Richtárik. Local sgd: Unified theory and new efficient methods. arXiv preprint arXiv:2011.02828, 2020.
- Johnson & Zhang (2013) Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. Advances in neural information processing systems, 26:315–323, 2013.
- Kairouz et al. (2019) Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Keith Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. arXiv preprint arXiv:1912.04977, 2019.
- Karimireddy et al. (2020a) Sai Praneeth Karimireddy, Martin Jaggi, Satyen Kale, Mehryar Mohri, Sashank J Reddi, Sebastian U Stich, and Ananda Theertha Suresh. Mime: Mimicking centralized stochastic algorithms in federated learning. arXiv preprint arXiv:2008.03606, 2020a.
- Karimireddy et al. (2020b) Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. Scaffold: Stochastic controlled averaging for federated learning. In International Conference on Machine Learning, pp. 5132–5143. PMLR, 2020b.
- Khaled et al. (2020) Ahmed Khaled, Konstantin Mishchenko, and Peter Richtárik. Tighter theory for local sgd on identical and heterogeneous data. In International Conference on Artificial Intelligence and Statistics, pp. 4519–4529. PMLR, 2020.
- Koloskova et al. (2020) Anastasia Koloskova, Nicolas Loizou, Sadra Boreiri, Martin Jaggi, and Sebastian Stich. A unified theory of decentralized sgd with changing topology and local updates. In International Conference on Machine Learning, pp. 5381–5393. PMLR, 2020.
- Krizhevsky (2009) Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, ., 2009.
- LeCun et al. (2010) Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010.
- Li et al. (2018) Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. arXiv preprint arXiv:1812.06127, 2018.
- Li et al. (2019) Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smithy. Feddane: A federated newton-type method. In 2019 53rd Asilomar Conference on Signals, Systems, and Computers, pp. 1227–1231. IEEE, 2019.
- Li et al. (2020) Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine, 37(3):50–60, 2020.
- Lin et al. (2018) Tao Lin, Sebastian U Stich, Kumar Kshitij Patel, and Martin Jaggi. Don’t use large mini-batches, use local sgd. arXiv preprint arXiv:1808.07217, 2018.
- McMahan et al. (2017) Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, pp. 1273–1282. PMLR, 2017.
- Mitra et al. (2021) Aritra Mitra, Rayana Jaafar, George J. Pappas, and Hamed Hassani. Linear Convergence in Federated Learning: Tackling Client Heterogeneity and Sparse Gradients. arXiv e-prints, art. arXiv:2102.07053, February 2021.
- Nesterov (2003) Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2003.
- Reddi et al. (2020) Sashank Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konečnỳ, Sanjiv Kumar, and H Brendan McMahan. Adaptive federated optimization. arXiv preprint arXiv:2003.00295, 2020.
- Reddi et al. (2016) Sashank J Reddi, Suvrit Sra, Barnabás Póczos, and Alex Smola. Fast incremental method for nonconvex optimization. arXiv preprint arXiv:1603.06159, 2016.
- Stich (2018) Sebastian U Stich. Local sgd converges fast and communicates little. arXiv preprint arXiv:1805.09767, 2018.
- Wang & Joshi (2018) Jianyu Wang and Gauri Joshi. Cooperative sgd: A unified framework for the design and analysis of communication-efficient sgd algorithms. arXiv preprint arXiv:1808.07576, 2018.
- Wang & Joshi (2019) Jianyu Wang and Gauri Joshi. Adaptive communication strategies to achieve the best error-runtime trade-off in local-update sgd. In SysML, 2019.
- Wang et al. (2019a) Jianyu Wang, Anit Kumar Sahu, Zhouyi Yang, Gauri Joshi, and Soummya Kar. Matcha: Speeding up decentralized sgd via matching decomposition sampling. In 2019 Sixth Indian Control Conference (ICC), pp. 299–300. IEEE, 2019a.
- Wang et al. (2019b) Jianyu Wang, Vinayak Tantia, Nicolas Ballas, and Michael Rabbat. Slowmo: Improving communication-efficient distributed sgd with slow momentum. arXiv preprint arXiv:1910.00643, 2019b.
- Woodworth (2021) Blake Woodworth. The minimax complexity of distributed optimization. arXiv preprint arXiv:2109.00534, 2021.
- Woodworth et al. (2020a) Blake Woodworth, Kumar Kshitij Patel, and Nathan Srebro. Minibatch vs local sgd for heterogeneous distributed learning. arXiv preprint arXiv:2006.04735, 2020a.
- Woodworth et al. (2020b) Blake Woodworth, Kumar Kshitij Patel, Sebastian Stich, Zhen Dai, Brian Bullins, Brendan Mcmahan, Ohad Shamir, and Nathan Srebro. Is local sgd better than minibatch sgd? In International Conference on Machine Learning, pp. 10334–10343. PMLR, 2020b.
- Woodworth et al. (2021) Blake Woodworth, Brian Bullins, Ohad Shamir, and Nathan Srebro. The min-max complexity of distributed stochastic convex optimization with intermittent communication. arXiv preprint arXiv:2102.01583, 2021.
- Yuan & Ma (2020) Honglin Yuan and Tengyu Ma. Federated accelerated stochastic gradient descent. arXiv preprint arXiv:2006.08950, 2020.
- Yuan et al. (2020) Honglin Yuan, Manzil Zaheer, and Sashank Reddi. Federated composite optimization. arXiv preprint arXiv:2011.08474, 2020.
- Zhou et al. (2019) Kaiwen Zhou, Qinghua Ding, Fanhua Shang, James Cheng, Danli Li, and Zhi-Quan Luo. Direct acceleration of saga using sampled negative momentum. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1602–1610. PMLR, 2019.
[sections] \printcontents[sections]l1
Method/Analysis | |||||||
---|---|---|---|---|---|---|---|
Centralized Algorithms | |||||||
|
|
||||||
Federated Algorithms | |||||||
|
|
||||||
This paper | |||||||
|
|
Appendix A Additional Related Work
Multistage algorithms. Our chaining framework has similarities in analysis to multistage algorithms which have been employed in the classical convex optimization setting (Aybat et al., 2019; Fallah et al., 2020; Ghadimi & Lan, 2012; Woodworth et al., 2020a). The idea behind multistage algorithms is to manage stepsize decay to balance fast progress in “bias” error terms while controlling variance error. However chaining differs from multistage algorithms in a few ways: (1) We do not necessarily decay stepsize (2) In the heterogeneous FL setting, we have to accomodate error due to client drift (Reddi et al., 2020; Karimireddy et al., 2020b), which is not analyzed in prior multistage literature. (3) Our goal is to minimize commuincation rounds rather than iteration complexity.
Appendix B Definitions
Assumption B.1.
’s are -strongly convex for , i.e.
(4) |
for any , , .
Assumption B.2.
’s are general convex, i.e.
(5) |
for any , , .
Assumption B.3.
is -PL for , i.e.
(6) |
for any .
Assumption B.4.
’s are -smooth where
(7) |
for any , , .
Assumption B.5.
’s are -heterogeneous, defined as
(8) |
Assumption B.6.
Gradient queries within a client have a uniform upper bound on variance and are also unbiased.
(9) |
Assumption B.7.
Cost queries within a client have a uniform upper bound on variance and are also unbiased.
(10) |
Assumption B.8.
Cost queries within a client have an absolute deviation.
(11) |
Assumption B.9.
The initial suboptimality gap is upper bounded by , where is the initial feasible iterate and is the global minimizer.
(12) |
Assumption B.10.
The expected initial distance from the optimum is upper bounded by , where is the initial feasible iterate and is one of the global minimizers:
(13) |
Appendix C Omitted Algorithm Definitions
Appendix D Proofs for global update methods
D.1 SGD
Theorem D.1.
Suppose that client objectives ’s and their gradient queries satisfy Assumption B.4 and Assumption B.6. Then running Algo. 2 gives the following for returned iterate :
- •
-
•
General convex (grad norm): ’s satisfy Assumption B.2. If we return the average iterate , and where is the update variance
-
•
PL condition: ’s satisfy Assumption B.3. If we return the final iterate , set , then
Lemma D.2 (Update variance).
For SGD, we have that
(14) |
Proof.
D.1.1 Convergence of SGD for Strongly Convex functions
Proof.
Using the update of SGD,
(17) |
Taking expectation up to the r-th round
(18) |
By Assumption B.1,
(19) |
Using Lemma D.2, Assumption B.4, and setting ,
(20) | |||
(21) |
Rearranging and taking full expectation,
(22) |
letting , , , then
(23) |
From (Karimireddy et al., 2020b) Lemma 1, if and where ,
(24) |
Which finishes our work on the convergence rate. Next we consider the distance bound. Returning to the recurrence
(25) | |||
(26) |
Taking full expectation and unrolling,
(27) | |||
(28) |
Note that Karimireddy et al. (2020b) chose so that And so
(29) |
∎
D.1.2 Convergence of SGD for General Convex functions
Proof.
By smoothness (Assumption B.4), we have that
(30) |
Taking expectation conditioned up to and using Lemma D.2,
(31) |
If ,
(32) |
Taking full expectation and rearranging,
(33) |
Summing both sides over and averaging,
(34) |
Letting be an average of all the ’s, noting that , , and choosing where ,
(35) |
So we have our convergence rate. Next, for the distance bound,
(36) |
Taking expectation up to , we know from Lemma D.2,
(37) | |||
(38) |
By Assumption B.2 and Assumption B.4,
(39) | |||
(40) |
And with ,
(41) |
Unrolling and taking full expectation,
(42) |
We chose so that . Therefore
(43) |
∎
D.1.3 Convergence of SGD under the PL-condition
Proof.
Using Assumption B.4 we have that
(44) |
Taking expectations of both sides conditioned on the -th step,
(45) |
Using Lemma D.2,
(46) |
Using Assumption B.3, we have that
(47) |
Rearranging,
(48) |
Letting and subtracting from both sides,
(49) |
Which, upon unrolling the recursion, gives us
(50) |
Now, if we choose stepsize
(51) |
Then the final convergence rate is
(52) |
∎
D.2 ASG
The precise form of accelerated stochastic gradient we run is AC-SA(Ghadimi & Lan, 2012; 2013). The following specification is taken from Woodworth et al. (2020a):
We run Algo. 3 for iterations using , and , with definitions
-
•
-
•
, ,
-
•
Where . Set where is the solution obtained in the previous step.
Theorem D.3.
Suppose that client objectives ’s and their gradient queries satisfy Assumption B.4 and Assumption B.6. Then running Algo. 3 gives the following for returned iterate :
-
•
Strongly convex: ’s satisfy Assumption B.1 for . If we return after rounds of the multistage AC-SA specified above,
And given no randomness, for any
and for any
-
•
General convex (grad norm): ’s satisfy Assumption B.2. If we return after rounds of the multistage AC-SA specified above on the regularized objective with , we have
D.2.1 Convergence of ASG for Strongly Convex Functions
Proof.
The convergence rate proof comes from Woodworth et al. (2020a) Lemma 5. What remains is to show the distance bound.
From Proposition 5, (Eq. 3.25, 4.25) of Ghadimi & Lan (2012), we have that the generic (non-multistage) AC-SA satisfies (together with Lemma H.1)
(53) | |||
(54) |
Where Notice that
(55) |
So
(56) |
Which implies
(57) |
Furthermore, so
(58) |
If there is no gradient variance or sampling,
(59) |
Next, we show the above two conclusions hold for and .
For , we show by induction:
Base case: , so it is true
Inductive case: is a convex combination of and , so the above statements hold for as well.
For , note it is a convex combination of and , so the above statements on distance also hold for .
For the distance bound on the returned solution, use strong convexity on the convergence rate:
(60) |
∎
D.2.2 Convergence of ASG for General Convex Functions
For the general convex case, we use Nesterov smoothing. Concretely, we will run Algo. 3 assuming strong convexity by optimizing instead a modified objective
(61) |
Define and . We will choose carefully to balance the error introduced by the regularization term and the better convergence properties of having larger .
Proof.
Observe that
(62) |
We evaluate the second term first. We know that from the theorem statement on strongly convex functions and letting and because is now -smooth,
(63) |
Which implies
(64) | ||||
(65) |
So it is true that
(66) | ||||
(67) |
If we rearrange,
(68) | ||||
(69) | ||||
(70) |
Next we evaluate . Observe that the smoothness of is . Returning to the convergence rate,
(71) |
We have that
(72) |
So . By Assumption B.4,
(73) | ||||
(74) |
So altogether,
(75) |
Choose . By Theorem E.1’s proof in Yuan & Ma (2020),
(76) |
Next we evaluate the distance bound for the returned iterate. Observe that from the distance bound in the strongly convex case,
(77) |
Given that ,
(78) | ||||
(79) |
Now we look at the actual distance we want to bound:
(80) |
Observe that
(82) |
which implies that
(83) |
So altogether
(84) |
∎
D.3 SAGA
Theorem D.4.
Suppose that client objectives ’s and their gradient queries satisfy Assumption B.4 and Assumption B.6. Then running Algo. 5 gives the following for returned iterate :
-
•
Strongly convex: ’s satisfy Assumption B.1 for . If we return with and , , , and we use Option I,
-
•
PL condition: ’s satisfy Assumption B.3. If we set , use Option II in a multistage manner (details specified in proof), and return a uniformly sampled iterate from the final stage,
(85)
D.3.1 Convergence of SAGA for Strongly Convex functions
The proof of this is similar to that of Karimireddy et al. (2020b, Theorem VII).
Proof.
Following the standard analysis,
(86) | |||
(87) |
We treat each of these terms separately.
Second term: Observe first that
(89) | ||||
(90) |
And so the middle term has, by (strong) convexity,
(91) | ||||
(92) |
Third term:
(93) | |||
(94) | |||
(95) | |||
(96) | |||
(97) |
Where the last inequality comes from the relaxed triangle inequality. Then, by using Assumption B.6, Assumption B.4, and Assumption B.2,
(99) | |||
(100) |
Taking full expectation and separating out the variance of the control variates,
(101) | |||
(102) |
Now observe that because ,
(103) | ||||
(104) |
Bounding the control lag:
Recall that
(105) |
Therefore
(106) |
Returning to the definition of ,
(107) | ||||
(108) | ||||
(109) |
where in the last inequality we applied Jensen’s inequality twice. By smoothness of ’s (Assumption B.4),
(110) |
Putting it together:
So putting it all together, we have
(111) | |||
(112) |
From our bound on the control lag,
(113) | ||||
(114) |
Adding both inequalities, we have
(115) | |||
(116) | |||
(117) |
Let , , then
(118) | ||||
(119) |
Then by using Lemma 1 in (Karimireddy et al., 2020b), setting , , and choosing where and , we have that by outputting with and , we have
(120) | ||||
(121) | ||||
(122) |
We use a warm-start strategy to initialize all control variates in the first rounds such that
By smoothness of ’s (Assumption B.4),
(123) | ||||
(124) | ||||
(125) |
And recalling that , we know that
(126) |
So altogether,
(127) |
∎
D.3.2 Convergence of SAGA under the PL condition
This proof follows Reddi et al. (2016).
Proof.
We start with Assumption B.4
(128) |
Using the fact that is unbiased,
(129) |
Now we consider the Lyapunov function
(130) |
We bound using
(131) | |||
(132) |
Where the equality comes from how with probability and is otherwise. Observe that we can bound
(133) | |||
(134) | |||
(135) |
Where we used unbiasedness of and Fenchel-Young inequality. Plugging this into ,
(136) | ||||
(137) | ||||
(138) | ||||
(139) | ||||
(140) |
Now we must bound .
(141) | |||
(142) | |||
(143) | |||
(144) |
Where the second to last inequality is an application of , and separating out the variance (taking advantage of the fact that ’s and ’s are independent), and is some constant.
We use Assumption B.4 and take expectation over sampled clients to get
(145) |
Returning to our bound on ,
(146) | ||||
(147) | ||||
(148) |
We set , which results in
(149) | |||
(150) |
Let . Then by rearranging
(151) |
Letting ,
(152) | ||||
(153) |
Implying
(154) |
Therefore, if we take a uniform sample from all the , denoted ,
(155) |
We start by bounding . Take and . Let . Observe that and . Then , which implies
So we can conclude that
(156) |
D.4 SSNM
Note that our usual assumption is -strongly convex can be straightforwardly converted into the assumption that our losses are where is merely convex and is -strongly convex (see (Zhou et al., 2019) section 4.2).
Theorem D.5.
Suppose that client objectives ’s and their gradient queries satisfy Assumption B.4 and Assumption B.6. Then running Algo. 6 gives the following for returned iterate :
-
•
Strongly convex: ’s satisfy Assumption B.1 for . If we return the final iterate and set if and if ,
D.4.1 Convergence of SSNM on Strongly Convex functions
Most of this proof follows that of (Zhou et al., 2019) Theorem 1.
First, we compute the variance of the update .
(163) | |||
(164) | |||
(165) | |||
(166) | |||
(167) |
For some constant . In the first inequality we separated out the gradient variance, second inequality we use the fact that , third inequality used Assumption B.4, and fourth we took expectation with respect to the sampled clients. From convexity we have that
(168) | |||
(169) | |||
(170) | |||
(171) | |||
(172) |
where the second to last inequality comes from the definition that . Taking expectation with respect to the sampled clients, we have
(173) | ||||
(174) |
For , we can employ smoothness at , which holds for any :
(175) |
using and (though the second is never explicitly computed and only implicitly exists)
(176) |
Taking expectation over and using and we see that
(177) | ||||
(178) |
and rearranging,
(179) | |||
(180) | |||
(181) |
Substituting this into Eq. 173 after taking expectation over , and observing that from (Zhou et al., 2019, Lemma 2) we have identity
(182) | ||||
(183) |
so with , we get
(184) | |||
(185) | |||
(186) | |||
(187) | |||
(188) | |||
(189) |
We use the constraint that plus Young’s inequality with on to get
(190) | |||
(191) | |||
(192) | |||
(193) | |||
(194) |
using the bound on variance, we get
(195) | |||
(196) | |||
(197) | |||
(198) |
combining terms,
(199) | |||
(200) | |||
(201) |
Using convexity of and for ,
(202) |
After taking expectation with respect to and ,
(203) |
and plugging this back in, multiplying by on both sides, and adding both sides by
(204) | |||
(205) | |||
(206) |
Observe that the probability of choosing any client index happens with probability , so
(207) |
which implies
(208) | |||
(209) | |||
(210) |
To complete our Lyapunov function so the potential is always positive, we need another term:
(211) | |||
(212) | |||
(213) | |||
(214) |
Taking expectation with respect to ,
(215) |
Let and , then we can write
(216) |
Case 1: Suppose that , then choosing , , we evaluate the parameter constraint :
(217) |
which shows our constraint is satisfied. We also know that
(218) |
So we can write
(219) | |||
(220) |
Telescoping the contraction and taking expectation with respect to all randomness, we have
(221) | |||
(222) |
and based on convexity. Next, we calculate the coefficient of the variance term.
(223) | ||||
(224) | ||||
(225) | ||||
(226) |
Substituting the parameter choices and using strong convexity,
(227) |
Case 2: On the other hand, we can have and choose , . One can check that the constraint is satisfied.
After telescoping and taking expectation,
(228) | |||
(229) |
Next, we calculate the coefficient of the variance term.
(230) | ||||
(231) | ||||
(232) |
which gives us
(233) |
Altogether, supposing we choose our parameters as stated in the two cases, we have:
(234) |
Appendix E Proofs for local update methods
E.1 FedAvg
Theorem E.1.
Suppose that client objectives ’s and their gradient queries satisfy Assumption B.4 and Assumption B.6. Then running Algo. 4 gives the following for returned iterate :
-
•
Strongly convex: ’s satisfy Assumption B.1 for . If we return the final iterate, set , and sample clients per round arbitrarily,
Further, if there is no gradient variance and only one client is sampled the entire algorithm,
-
•
General convex: ’s satisfy Assumption B.2. If we return the last iterate after running the algorithm on with , and ,
and for any ,
-
•
PL condition: ’s satisfy Assumption B.3 for . If we return the final iterate, set , and sample one client per round,
E.1.1 Convergence of FedAvg for Strongly Convex functions
Proof.
By smoothness of F (Assumption B.4), we have for
(235) |
Using the fact that for any we have ,
(236) |
Letting ,
(237) |
Conditioning on everything up to the -th step of the -th round,
(238) | ||||
(239) |
Where the last step used the fact that , the assumption on heterogeneity (Assumption B.5), and the assumption on gradient variance (Assumption B.6) along with the fact that is an average over client gradient queries. Next, using -strong convexity of (Assumption B.1),
(240) |
which after taking full expectation gives
(241) |
Unrolling the recursion over we get
(242) | |||
(243) |
Also note that . So by convexity of ,
(244) | ||||
(245) | ||||
(246) |
Unrolling the recursion over , we get
(247) | |||
(248) |
Which can be upper bounded as
(249) |
The final statement comes from the fact that because of and applying triangle inequality. ∎
E.1.2 Convergence of FedAvg for General Convex functions
For the general convex case, we use Nesterov smoothing. Concretely, we will run Algo. 3 assuming strong convexity by optimizing instead a modified objective
(250) |
Define and . We will choose carefully to balance the error introduced by the regularization term and the better convergence properties of having larger .
Proof.
We know that running Algo. 4 with gives (from the previous proof)
(251) |
We have that by (Yuan & Ma, 2020, Proposition E.7)
(252) |
So we have (because as shown in Thm. D.3),
(253) |
Then if we choose , , and ,
(254) |
Now we show the distance bound. Recall that
(255) |
By smoothness, strong convexity of , and the choice of (as we chose each term above divided by to match up to log factors), we have that
(256) |
So,
(257) |
Where the last inequality follows because
(258) |
∎
E.1.3 Convergence of FedAvg under the PL condition
Proof.
The same proof follows as in the strongly convex case, except Eq. 244, where we avoid having to use convexity by only sampling one client at a time. This follows previous work such as Karimireddy et al. (2020a). Averaging when the functions are not convex can cause the error to blow up, though in practice this is not seen (as mentioned in Karimireddy et al. (2020a)). ∎
Appendix F Proofs for FedChain
F.1 FedAvg SGD
F.1.1 Convergence of FedAvg SGD on Strongly Convex Functions
Theorem F.1.
Suppose that client objectives ’s and their gradient queries satisfy Assumption B.4, Assumption B.6, Assumption B.7, Assumption B.5, Assumption B.8. Then running Algo. 1 where is Algo. 4 in the setting of Thm. E.1 and is Algo. 2 in the setting Thm. D.1:
-
•
Strongly convex: ’s satisfy Assumption B.1 for . Then there exists a lower bound of such that we have the rate
-
•
General convex: ’s satisfy Assumption B.2. Then there exists a lower bound of such that we have the rate
-
•
PL condition: ’s satisfy Assumption B.3 for . Then there exists a lower bound of such that we have the rate
F.1.2 Convergence of FedAvg SGD on General Convex Functions
Proof.
From Thm. E.1, we know that with
From Lemma H.2, if
(263) |
And so from Thm. D.1, we know that
(264) |
Next, using that from Thm. E.1 and Thm. D.1 as well as convexity,
(265) | |||
(266) | |||
(267) | |||
(268) |
One can pre-run SGD before all of this for a constant fraction of rounds so that (Woodworth et al., 2020a, Section 7):
(269) |
This likely not practically necessary § 6. Altogether,
(270) | |||
(271) | |||
(272) | |||
(273) |
∎
F.1.3 Convergence of FedAvg SGD under the PL-condition
Proof.
The proof is the same as in the strongly convex case. ∎
F.2 FedAvg ASG
Theorem F.2.
Suppose that client objectives ’s and their gradient queries satisfy Assumption B.4, Assumption B.6, Assumption B.7, Assumption B.5, Assumption B.8. Then running Algo. 1 where is Algo. 4 in the setting of Thm. E.1 and is Algo. 3 in the setting Thm. D.1:
-
•
Strongly convex: ’s satisfy Assumption B.1 for . Then there exists a lower bound of (same as in FedAvg SGD) such that we have the rate
-
•
General convex: ’s satisfy Assumption B.2. Then there exists a lower bound of (same as in FedAvg SGD) such that we have the rate
Proof follows those of FedAvg SGD (Thm. F.1).
F.3 FedAvg SAGA
Theorem F.3.
Suppose that client objectives ’s and their gradient queries satisfy Assumption B.4, Assumption B.6, Assumption B.7, Assumption B.5, Assumption B.8. Then running Algo. 1 where is Algo. 4 in the setting of Thm. E.1 and is Algo. 5 in the setting Thm. D.4:
-
•
Strongly convex: ’s satisfy Assumption B.1 for . Then there exists a lower bound of (same as in FedAvg SGD) such that we have the rate
-
•
PL condition: ’s satisfy Assumption B.3 for . Then there exists a lower bound of such that we have the rate
F.4 FedAvg SSNM
Theorem F.4.
Suppose that client objectives ’s and their gradient queries satisfy Assumption B.4, Assumption B.6, Assumption B.5, Assumption B.8. Then running Algo. 1 where is Algo. 4 in the setting of Thm. E.1 and is Algo. 6 in the setting Thm. D.5:
-
•
Strongly convex: ’s satisfy Assumption B.1 for . Then there exists a lower bound of (same as in FedAvg SGD) such that we have the rate
Appendix G Lower Bound
In this section we prove the lower bound. All gradients will be noiseless, and we will allow full communication to all clients every round. There will be only two functions in this lower bound: and . If , then is assigned to the first clients and to the next clients. If there is an odd number of machines we let the last machine be . This only reduces the lower bound by a factor of at most . So we look at the case .
Let be values to be chosen later. Let be even. At a high level, essentially controls the smoothness of and , is basically a constant, and is a quantity that affects the heterogeneity, initial suboptimality gap, and initial distance. We give the instance:
(274) |
(275) |
Where .
These functions are the same as those used in (Woodworth et al., 2020a; Woodworth, 2021) for lower bounds in distributed optimization. They are also similar to the instances used to prove convex optimization lower bounds (Nesterov, 2003) and distributed optimization lower bounds (Arjevani & Shamir, 2015).
These two functions have the following property
(276) | |||
(277) |
What the property roughly says is the following. Suppose we so far have managed to turn an even number of coordinates nonzero. Then we can only use gradient queries of to access the next coordinate. After client 1 queries the gradient of , it now has an odd number of unlocked coordinates, and cannot unlock any more coordinates until a communication occurs. Similar is true if the number of unlocked coordinates is odd. And so each round of communication can only unlock a single new coordinate.
We start by writing out the gradients of and . Assume that is even. Then:
(278) | ||||
(279) | ||||
(280) |
and
(281) | ||||
(282) | ||||
(283) |
We define the class of functions that our lower bound will apply to. This definition follows (Woodworth et al., 2020a; Woodworth, 2021; Carmon et al., 2020):
Definition G.1 (Distributed zero-respecting algorithm).
For a vector , let . We say that an optimization algorithm is distributed zero-respecting if for any , the -th iterate on the -th client in the -th round satisfy
(284) |
Broadly speaking, distributed zero-respecting algorithms are those whose iterates have components in only dimensions that they can possibly have information on. As discussed in (Woodworth et al., 2020a), this means that algorithms which are not distributed zero-respecting are just ”wild guessing”. Algorithms that are distributed zero-respecting include SGD, ASG, FedAvg, SCAFFOLD, SAGA, and SSNM.
Next, we define the next condition we require on algorithms for our lower bound.
Definition G.2.
We say that an algorithm is distributed distance-conserving if for any , we have for the -th iterate on the -th client in the -th round satisfies , where and and is the initial iterate, for some scalar parameter .
Algorithms which do not satisfy Definition 5.2 for polylogarithmic in problem parameters (see § 2) are those that move substantially far away from , even farther than the ’s are from . With this definition in mind, we slightly overload the usual definition of heterogeneity for the lower bound:
Definition G.3.
A distributed optimization problem is -heterogeneous if , where we define for some scalar parameter .
While convergence rates in FL are usually proven under Assumption B.5, the proofs can be converted to work under Definition G.3 as well, so long as one can prove all iterates stay within as defined in Definition G.3. We show that our algorithms satisfy Definition G.3 as well as a result (Thm. E.1, Thm. D.3, Thm. D.1)444We do not formally show it for SAGA and SSNM, as the algorithms are functionally the same as SGD and ASG under full participation.. Other proofs of FL algorithms also satisfy this requirement, most notably the proof of the convergence of FedAvg in Woodworth et al. (2020a).
Following (Woodworth et al., 2020a), we start by making the argument that, given the algorithm is distributed zero-respecting, we can only unlock one coordinate at a time. Let , and be the null span. Then
Lemma G.4.
Let be the output of a distributed zero-respecting algorithm optimizing after rounds of communication. Then we have
(285) |
Proof.
A proof is in Woodworth et al. (2020a, Lemma 9). ∎
We now compute various properties of this distributed optimization problem.
G.1 Strong convexity and smoothness of
From Woodworth (2021, Lemma 25), as long as , we have that are -smooth and -strongly convex.
G.2 The solutions of
First, observe that from the gradient of computed in Eq. 281, . Next observe that from the gradient of computed in Eq. 278, . Thus and .
From Woodworth (2021, Lemma 25), if we let , , , then .
G.3 Initial suboptimality gap
From Woodworth (2021, Lemma 25), .
G.4 Suboptimality gap after rounds of communication
From (Woodworth, 2021, Lemma 25), , as long as .
G.5 Computation of
We start by writing out the difference between the gradient of and , using Eq. 278 and Eq. 281.
(286) |
We upper bound each of the terms:
(287) |
(288) | |||
(289) |
(290) | ||||
(291) |
So altogether, using the fact that ,
(292) | |||
(293) | |||
(294) | |||
(295) | |||
(296) |
Next we use Definition G.2 which says that . For ease of calculation, we assume . Recall that we calculated (because ), , .
(297) | ||||
(298) |
Altogether,
(299) |
G.6 Theorem
With all of the computations earlier, we are now prepared to prove the theorem.
Theorem G.5.
For any number of rounds , number of local steps per-round , and -heterogeneity (Definition 5.3), there exists a global objective which is the average of two -smooth (Assumption B.4) and ()-strongly convex (Assumption B.1) quadratic client objectives and with an initial sub-optimality gap of , such that the output of any distributed zero-respecting (Definition 5.1) and distance-conserving algorithm (Definition 5.2) satisfies
-
•
Strongly convex: when , and
-
•
General Convex when .
Proof.
The convex case: By the previous computations, we know that after rounds,
(300) | ||||
(301) | ||||
(302) |
To maintain the property that and Definition G.3, Choose s.t.
(303) |
For an absolute constant .
This leads to (throughout we use , (Woodworth, 2021, Eq 769))
(304) | |||
(305) | |||
(306) | |||
(307) | |||
(308) | |||
(309) |
We choose , which ensures . So noting that , , :
(311) | |||
(312) | |||
(313) | |||
(314) |
So,
(315) | |||
(316) | |||
(317) | |||
(318) |
Where we used . Setting (which will ensure that with appropriate constants chosen), Which altogether gives us
(320) |
Strongly Convex Case:
By the previous computations, we know that after rounds,
(321) | ||||
(322) | ||||
(323) |
To maintain the property that and that for all encountered during the execution of the algorithm,
Choose s.t.
(324) |
For some constant .
Again using ) and following the same calculations as in the convex case, we use the fact that , and that it is possible to choose s.t. , which ensures .
(325) | |||
(326) | |||
(327) |
We use and because it is possible to choose such that (Woodworth, 2021, Eq. 800), which ensures and
(329) | |||
(330) | |||
(331) |
Next we note that and (Woodworth, 2021, Eq 801), which gives us
(332) |
∎
Appendix H Technical Lemmas
Lemma H.1.
Let
(333) |
Given Assumption B.6, Assumption B.5, and uniformly sampled clients per round,
(334) |
If instead the clients are arbitrarily (possibly randomly) sampled and there is no gradient variance,
(335) |
Proof.
(336) | ||||
(337) | ||||
(338) |
We get Eq. 337 by using over the randomness of the gradient variance. We get Eq. 338 using the fact that the gradient randomness is i.i.d across and (so the cross terms in the expansion vanish). Note that by Assumption B.6. On the other hand by Assumption B.5
(339) |
So altogether
(340) |
The second conclusion follows by noting
(341) |
∎
Lemma H.2.
Let be arbitrary (possibly random) points. Define where . Let
Then
(342) | |||
(343) |
and
(344) | |||
(345) |
Proof.
Suppose that , where . Then,
(346) |
Substituting ,
(347) | ||||
(348) |
Observe that by Assumption B.7 and Assumption B.8. Therefore by Chebyshev’s inequality,
(349) |
Observe that
(350) | ||||
(351) |
And so it is also true that
(352) | ||||
(353) | ||||
(354) |
Therefore altogether we have that
(355) |
So by maximizing for
(356) |
which gives us
(357) |
Then proof also holds if we switch the roles of and , and so altogether we have that: if ,
(358) |
if ,
(359) |
implying the first part of the theorem.
(360) |
For the second part,
(361) | ||||
(362) |
and then
(363) | ||||
(364) |
∎
Appendix I Experimental Setup Details
I.1 Convex Optimization
We empirically evaluate FedChain on federated regularized logistic regression. Let be the th datapoint of the th client and is the number of datapoints for the -th client. We minimize Eq. 1 where
Dataset. We use the MNIST dataset of handwritten digits (LeCun et al., 2010). We model a federated setting with five clients by partitioning the data into groups. First, we take 500 images from each digit class (total 5,000). Each client’s local data is a mixture of data drawn from two digit classes (leading to heterogeneity), and data sampled uniformly from all classes.
We call a federated dataset homogeneous if the first of each class’s 500 images is shuffled and evenly partitioned to each client. The remaining is partitioned as follows: client receives the remaining non-shuffled data from classes and . For example, in a 50% homogeneous setup, client 3 has 250 samples from digit 4, 250 samples from digit 5, and 500 samples drawn uniformly from all classes. Note that 100% homogeneity is not the same thing as setting heterogeneity due to sampling randomness; we use this technique for lack of a better control over . To model binary classification, we let even classes represent 0’s and odd classes represent 1’s, and set . All clients participate per round.
Hyperparameters. All experiments initialize iterates at 0 with regularization . We fix the total number of rounds (differs across experiments). For algorithms without stepsize decay, the tuning process is as follows.
We tune all stepsizes in the range below:
(365) |
We tune the percentage of rounds before switching from to (if applicable) as
(366) |
For algorithms with acceleration, we run experiments using the more easily implementable (but less mathematically tractable for our analysis) version in Aybat et al. (2019) as opposed to Algo. 3.
For algorithms with stepsize decay, we instead use the range Eq. 366 to decide the percentage of rounds to wait before decreasing the stepsize by half–denote this number of rounds as . Subsequently, every factor of 2 times we decrease the stepsize by half.
We use the heuristic that when the stepsize decreases to , we switch to and begin the stepsize decay process again. All algorithms are tuned to the best final gradient norm averaged over 1000 runs.
I.2 Nonconvex Optimization
I.2.1 EMNIST
In this section we detail how we use the EMNIST (Cohen et al., 2017) dataset for our nonconvex experiments, where the handwritten characters are partitioned by their author. In this dataset, there are 3400 clients, with 671,585 images in the train set and 77,483 in the test set. The task is to train a convolutional network with two convolutional layers, max-pooling, dropout, and two dense layers as in the Federated Learning task suite from (Reddi et al., 2020).
Hyperparameters. We set , and rather than fixing the number of local steps, we fix the number of local client epochs (the number of times a client performs gradient descent over its own dataset) to 20. The number of clients sampled per round is 10. These changes were made in light of the imbalanced dataset sizes across clients, and is standard for this dataset-task (Reddi et al., 2020; Charles & Konečnỳ, 2020; Charles et al., 2021). For all algorithms aside from (M-SGD), we tune in the range . We tune the percentage of rounds before switching from to in , where applicable.
We next detail the way algorithms with stepsize decay are run. We tune the stepsize in . Let be the number of rounds before the first decay event. We tune . Whenever the number of passed rounds is a power of 2 of , we halve the stepsize/ After rounds have passed, we enter , where we tune . We restart the decay process upon entering .
For M-SGD we tune
and
For stepsize decaying SCAFFOLD and FedAvg, we tune and . When evaluating a particular metric the algorithms are tuned according to the mean of the last 5 evaluated iterates of that particular metric. Reported accuracies are also the mean of the last 5 evaluated iterates, as the values mostly stabilized during training.
I.2.2 CIFAR-100
For the CIFAR-100 experiments, we use the CIFAR-100 (Krizhevsky, 2009) dataset, where the handwritten characters are partitioned via the method proposed in (Reddi et al., 2020). In this dataset, there are 500 train clients, with a total of 50,000 images in the train set. There are 100 test clients and 10,000 images in the test set. The task is to train a ResNet-18, replacing batch normalization layers with group normalization as in the Federated Learning task suite from (Reddi et al., 2020). We leave out SCAFFOLD in all experiments because of out of memory issues.
Hyperparameters. We set , and rather than fixing the number of local steps, we fix the number of local client epochs (the number of times a client performs gradient descent over its own dataset) to 20. The number of clients sampled per round is 10. These changes were made in light of the imbalanced dataset sizes across clients, and is standard for this dataset-task (Reddi et al., 2020; Charles & Konečnỳ, 2020; Charles et al., 2021). For all algorithms we tune in . Algorithms without stepsize decay tune the percentage of rounds before switching to in . Algorithms with stepsize decay (aside from M-SGD) tune the number of rounds before the first decay event in (and for every power of 2 of that percentage, the stepsize decays in half), and the number of rounds before swapping to in if applicable (stepsize decay process restarts after swapping). M-SGD tunes the number of rounds before the first decay event in . We tune for and return the test accuracy of the last iterate, as accuracy was still improving as we trained.
Appendix J Additional Convex Experiments
In this section, we give supplemental experimental results to verify the following claims:
-
1.
Higher allows (1) accelerated algorithms to perform better than non-accelerated algorithms, and furthermore (2) allows us to run for one round to get satisfactory results.
-
2.
Our FedChain instantiations outperform stepsize decaying baselines
J.1 Verifying the effect of increased and
Our proofs of FedChain all also work if we run for only one communication round, so long as is sufficiently large (exact number is in the formal theorem proofs). In this section we verify the effect of large in conjunction with running for only one round. The setup is the same as in § I.1, except we tune the stepsize in a larger grid:
(367) |
The plots are in Fig. 3. For algorithms that are “1-XY”, we run algorithm X for one round and algorithm Y for the rest of the rounds. These algorithms are tuned in the following way. Algorithm X has its stepsize tuned in Eq. 367, and Algorithm Y independently has its stepsize tuned in Eq. 367. We require these to be tuned separately because local update algorithms and centralized algorithms have stepsizes that depend on differently if is large (see, for example, Thm. D.1 and Thm. E.1). The baselines have a single stepsize tuned in Eq. 367. These are tuned for the lowest final gradient norm averaged over 1000 runs.
Overall, we see that in the large regime, algorithms that start with a local update algorithm and then finish with an accelerated centralized algorithm have the best communication complexity. This is because larger means the error due to variance decreases, for example as seen in Thm. F.2. Acceleration increases instability (if one does not perform a carefully calibrated stepsize decay, as we do not in the experiments), so decreasing variance helps accelerated algorithms the most. Furthermore, a single round for suffices to see large improvement as seen in Fig. 3. This is expected, because both the “bias” term (the term in Thm. E.1) and the variance term become negligible, leaving just the heterogeneity term as desired.



J.2 Including Stepsize Decay Baselines



In this section, we compare the performance of the algorithms against learning rate decayed SGD, FedAvg, ASG, and SCAFFOLD. We use the same setting as § I.1 for the decay process and the stepsize. In this case, we still see that the chained methods outperform their non-chained counterparts.