This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Achieving ​ Linear ​ Speedup ​ with ​​ Partial ​​ Worker ​​ Participation ​​ in ​​ Non-IID ​​ Federated ​​ Learning

Haibo Yang,   Minghong Fang,   and Jia Liu
Department of Electrical and Computer Engineering
The Ohio State University
Columbus, OH 43210 USA
{yang.5952, fang.841, liu.1736}@osu.edu
Abstract

Federated learning (FL) is a distributed machine learning architecture that leverages a large number of workers to jointly learn a model with decentralized data. FL has received increasing attention in recent years thanks to its data privacy protection, communication efficiency and a linear speedup for convergence in training (i.e., convergence performance increases linearly with respect to the number of workers). However, existing studies on linear speedup for convergence are only limited to the assumptions of i.i.d. datasets across workers and/or full worker participation, both of which rarely hold in practice. So far, it remains an open question whether or not the linear speedup for convergence is achievable under non-i.i.d. datasets with partial worker participation in FL. In this paper, we show that the answer is affirmative. Specifically, we show that the federated averaging (FedAvg) algorithm (with two-sided learning rates) on non-i.i.d. datasets in non-convex settings achieves a convergence rate 𝒪(1mKT+1T)\mathcal{O}(\frac{1}{\sqrt{mKT}}+\frac{1}{T}) for full worker participation and a convergence rate 𝒪(KnT+1T)\mathcal{O}(\frac{\sqrt{K}}{\sqrt{nT}}+\frac{1}{T}) for partial worker participation, where KK is the number of local steps, TT is the number of total communication rounds, mm is the total worker number and nn is the worker number in one communication round if for partial worker participation. Our results also reveal that the local steps in FL could help the convergence and show that the maximum number of local steps can be improved to T/mT/m in full worker participation. We conduct extensive experiments on MNIST and CIFAR-10 to verify our theoretical results.

1 Introduction

Federated Learning (FL) is a distributed machine learning paradigm that leverages a large number of workers to collaboratively learn a model with decentralized data under the coordination of a centralized server. Formally, the goal of FL is to solve an optimization problem, which can be decomposed as:

minxdf(x):=1mi=1mFi(x),\min_{x\in\mathbb{R}^{d}}f(x):=\frac{1}{m}\sum_{i=1}^{m}F_{i}(x),

where Fi(x)𝔼ξiDi[Fi(x,ξi)]F_{i}(x)\triangleq\mathbb{E}_{\xi_{i}\sim D_{i}}[F_{i}(x,\xi_{i})] is the local (non-convex) loss function associated with a local data distribution DiD_{i} and mm is the number of workers. FL allows a large number of workers (such as edge devices) to participate flexibly without sharing data, which helps protect data privacy. However, it also introduces two unique challenges unseen in traditional distributed learning algorithms that are used typically for large data centers:

  • Non-independent-identically-distributed (non-i.i.d.) datasets across workers (data heterogeneity): In conventional distributed learning in data centers, the distribution for each worker’s local dataset can usually be assumed to be i.i.d., i.e., Di=D,i{1,,m}D_{i}=D,\forall i\in\{1,...,m\}. Unfortunately, this assumption rarely holds for FL since data are generated locally at the workers based on their circumstances, i.e., DiDjD_{i}\neq D_{j}, for iji\neq j. It will be seen later that the non-i.i.d assumption imposes significant challenges in algorithm design for FL and their performance analysis.

  • Time-varying partial worker participation (systems non-stationarity): With the flexibility for workers’ participation in many scenarios (particularly in mobile edge computing), workers may randomly join or leave the FL system at will, thus rendering the active worker set stochastic and time-varying across communication rounds. Hence, it is often infeasible to wait for all workers’ responses as in traditional distributed learning, since inactive workers or stragglers will significantly slow down the whole training process. As a result, only a subset of the workers may be chosen by the server in each communication round, i.e., partial worker participation.

In recent years, the Federated Averaging method (FedAvg) and its variants  (McMahan et al., 2016; Li et al., 2018; Hsu et al., 2019; Karimireddy et al., 2019; Wang et al., 2019a) have emerged as a prevailing approach for FL. Similar to the traditional distributed learning, FedAvg leverages local computation at each worker and employs a centralized parameter server to aggregate and update the model parameters. The unique feature of FedAvg is that each worker runs multiple local stochastic gradient descent (SGD) steps rather than just one step as in traditional distributed learning between two consecutive communication rounds. For i.i.d. datasets and the full worker participation setting, Stich (2018) and Yu et al. (2019b) proposed two variants of FedAvg that achieve a convergence rate of 𝒪(mKT+1mKT)\mathcal{O}(\frac{mK}{T}+\frac{1}{\sqrt{mKT}}) with a bounded gradient assumption for both strongly convex and non-convex problems, where mm is the number of workers, KK is the local update steps, and TT is the total communication rounds. Wang & Joshi (2018) and Stich & Karimireddy (2019) further proposed improved FedAvg algorithms to achieve an 𝒪(mT+1mKT)\mathcal{O}(\frac{m}{T}+\frac{1}{\sqrt{mKT}}) convergence rate without bounded gradient assumption. Notably, for a sufficiently large TT, the above rates become 𝒪(1mKT)\mathcal{O}(\frac{1}{\sqrt{mKT}})111This rate also matches the convergence rate order of parallel SGD in conventional distributed learning., which implies a linear speedup with respect to the number of workers.222To attain ϵ\epsilon accuracy for an algorithm, it needs to take 𝒪(1ϵ2)\mathcal{O}(\frac{1}{\epsilon^{2}}) steps with a convergence rate 𝒪(1T)\mathcal{O}(\frac{1}{\sqrt{T}}), while needing 𝒪(1mϵ2)\mathcal{O}(\frac{1}{m\epsilon^{2}}) steps if the convergence rate is 𝒪(1mT)\mathcal{O}(\frac{1}{\sqrt{mT}}) (the hidden constant in Big-O is the same). In this sense, one achieves a linear speedup with respect to the number of workers. This linear speedup is highly desirable for an FL algorithm because the algorithm is able to effectively leverage the massive parallelism in a large FL system. However, with non-i.i.d. datasets and partial worker participation in FL, a fundamental open question arises: Can we still achieve the same linear speedup for convergence, i.e., 𝒪(1mKT)\mathcal{O}(\frac{1}{\sqrt{mKT}}), with non-i.i.d. datasets and under either full or partial worker participation?

In this paper, we show the answer to the above question is affirmative. Specifically, we show that a generalized FedAvg with two-sided learning rates achieves linear convergence speedup with non-i.i.d. datasets and under full/partial worker participation. We highlight our contributions as follows:

  • For non-convex problems, we show that the convergence rate of the FedAvg algorithm on non-i.i.d. dataset are 𝒪(1mKT+1T)\mathcal{O}(\frac{1}{\sqrt{mKT}}+\frac{1}{T}) and 𝒪(KnT+1T)\mathcal{O}(\frac{\sqrt{K}}{\sqrt{nT}}+\frac{1}{T}) for full and partial worker participation, respectively, where nn is the size of the partially participating worker set. This indicates that our proposed algorithm achieves a linear speedup for convergence rate for a sufficiently large TT. When reduced to the i.i.d. case, our convergence rate is 𝒪(1TK+1mKT)\mathcal{O}(\frac{1}{TK}+\frac{1}{\sqrt{mKT}}), which is also better than previous works. We summarize the convergence rate comparisons for both i.i.d. and non-i.i.d. cases in Table 1. It is worth noting that our proof does not require the bounded gradient assumption. We note that the SCAFFOLD algorithm (Karimireddy et al., 2019) also achieves the linear speedup but extra variance reduction operations are required, which lead to higher communication costs and implementation complexity. By contrast, we do not have such extra requirements in this paper.

  • In order to achieve a linear speedup, i.e., a convergence rate 𝒪(1mKT)\mathcal{O}(\frac{1}{\sqrt{mKT}}), we show that the number of local updates KK can be as large as T/mT/m, which improves the T1/3/mT^{1/3}/m result previously shown in Yu et al. (2019a) and Karimireddy et al. (2019). As shown later in the communication complexity comparison in Table 1, a larger number of local steps implies relatively fewer communication rounds, thus less communication overhead. Interestingly, our results also indicate that the number of local updates KK does not hurt but rather help the convergence with a proper learning rates choice in full worker participation. This overcomes the limitation as suggested in Li et al. (2019b) that local SGD steps might slow down the convergence (𝒪(KT)\mathcal{O}(\frac{K}{T}) for strongly convex case). This result also reveals new insights on the relationship between the number of local steps and learning rate.

Notation. In this paper, we let mm be the total number of workers and StS_{t} be the set of active workers for the tt-th communication round with size |St|=n|S_{t}|=n for some n(0,m]n\in(0,m]. 333 For simplicity and ease of presentation in this paper, we let |St|=n|S_{t}|=n. We note that this is not a restrictive condition and our proofs and results still hold for |St|n|S_{t}|\geq n, which can be easily satisfied in practice. We use KK to denote the number of local steps per communication round at each worker. We let TT be the number of total communication rounds. In addition, we use boldface to denote matrices/vectors. We let []t,ki[\cdot]_{t,k}^{i} represent the parameter of kk-th local step in the ii-th worker after the tt-th communication. We use 2\mbox{$\left\lVert\cdot\right\rVert$}_{2} to denote the 2\ell^{2}-norm. For a natural number mm, we use [m][m] to represent the set {1,,m}\{1,\cdots,m\}.

The rest of the paper is organized as follows. In Section 2, we review the literature to put our work in comparative perspectives. Section 3 presents the convergence analysis for our proposed algorithm. Section 4 discusses the implication of the convergence rate analysis. Section 5 presents numerical results and Section 6 concludes this paper. Due to space limitation, the details of all proofs and some experiments are provided in the supplementary material.

2 Related work

Table 1: Convergence rates of optimization methods for FL.

[t] Dataset Algorithm6 Convexity7 Partial Convergence Communication Worker Rate complexity IID Stich1 SC ×\times 𝒪(mKT+1mKT)\mathcal{O}(\frac{mK}{T}+\frac{1}{\sqrt{mKT}}) 𝒪(mKϵ+1mKϵ2)\mathcal{O}(\frac{mK}{\epsilon}+\frac{1}{mK\epsilon^{2}}) Yu1 NC ×\times 𝒪(mKT+1mKT)\mathcal{O}(\frac{mK}{T}+\frac{1}{\sqrt{mKT}}) 𝒪(mKϵ+1mKϵ2)\mathcal{O}(\frac{mK}{\epsilon}+\frac{1}{mK\epsilon^{2}}) Wang NC ×\times 𝒪(mT+1mKT)\mathcal{O}(\frac{m}{T}+\frac{1}{\sqrt{mKT}}) 𝒪(mϵ+1mKϵ2)\mathcal{O}(\frac{m}{\epsilon}+\frac{1}{mK\epsilon^{2}}) Stich2 NC ×\times 𝒪(mT+1mKT)\mathcal{O}(\frac{m}{T}+\frac{1}{\sqrt{mKT}}) 𝒪(mϵ+1mKϵ2)\mathcal{O}(\frac{m}{\epsilon}+\frac{1}{mK\epsilon^{2}}) This paper NC 𝒪(1TK+1mKT)\mathcal{O}(\frac{1}{TK}+\frac{1}{\sqrt{mKT}}) 𝒪(1Kϵ+1mKϵ2)\mathcal{O}(\frac{1}{K\epsilon}+\frac{1}{mK\epsilon^{2}}) NON-IID Khaled 1 C ×\times 𝒪(mT+1mT)\mathcal{O}(\frac{m}{T}+\frac{1}{\sqrt{mT}}) 𝒪(mϵ+1mKϵ2)\mathcal{O}(\frac{m}{\epsilon}+\frac{1}{mK\epsilon^{2}}) Yu22 NC ×\times 𝒪(mTK+1mKT)\mathcal{O}(\frac{m}{TK}+\frac{1}{\sqrt{mKT}}) 𝒪(mKϵ+1mKϵ2)\mathcal{O}(\frac{m}{K\epsilon}+\frac{1}{mK\epsilon^{2}}) Li SC 𝒪(KT)\mathcal{O}(\frac{K}{T}) 𝒪(Kϵ)\mathcal{O}(\frac{K}{\epsilon}) Karimireddy 3 NC 𝒪(1T2/3+MSKT)\mathcal{O}(\frac{1}{T^{2/3}}+\frac{M}{\sqrt{SKT}}) 𝒪(1ϵ3/2+MSKϵ2)\mathcal{O}(\frac{1}{\epsilon^{3/2}}+\frac{M}{SK\epsilon^{2}}) Karimireddy 4 NC 𝒪(1T+1mKT)\mathcal{O}(\frac{1}{T}+\frac{1}{\sqrt{mKT}}) 𝒪(1ϵ+1mKϵ2)\mathcal{O}(\frac{1}{\epsilon}+\frac{1}{mK\epsilon^{2}}) This paper5 NC 𝒪(1T+1mKT)\mathcal{O}(\frac{1}{T}+\frac{1}{\sqrt{mKT}}) 𝒪(1ϵ+1mKϵ2)\mathcal{O}(\frac{1}{\epsilon}+\frac{1}{mK\epsilon^{2}})

  • 1

    Full gradients are used for each worker.

  • 2

    Local momentum is used at each worker.

  • 3

    A FedAvg algorithm with two-sided learning rates. M2=𝒪(1)+𝒪(KS(1Sm))M^{2}=\mathcal{O}(1)+\mathcal{O}(KS(1-\frac{S}{m})). S=mS=m (S=nS=n) for full (partial) worker participation.

  • 4

    The SCAFFOLD algorithm in Karimireddy et al. (2019) for non-convex case.

  • 5

    The convergence rate becomes 𝒪(1T+KnT)\mathcal{O}(\frac{1}{T}+\frac{\sqrt{K}}{\sqrt{nT}}) under partial worker participation.

  • 6

    Shorthand notation for references: Stich1 := Stich (2018), Yu2 := Yu et al. (2019b), Wang:= Wang & Joshi (2018), Stich2:= Stich & Karimireddy (2019); Khaled:= Khaled et al. (2019b), Yu2:=Yu et al. (2019a), Li:= Li et al. (2019b), and Karimireddy:= Karimireddy et al. (2019).

  • 7

    Shorthand notation for convexity: SC: Strongly Convex, C: Convex, and NC: Non-Convex.

The federated averaging (FedAvg) algorithm was first proposed by McMahan et al. (2016) for FL as a heuristic to improve communication efficiency and data privacy. Since then, this work has sparked many follow-ups that focus on FL with i.i.d. datasets and full worker participation (also known as LocalSGD (Stich, 2018; Yu et al., 2019b; Wang & Joshi, 2018; Stich & Karimireddy, 2019; Lin et al., 2018; Khaled et al., 2019a; Zhou & Cong, 2017)). Under these two assumptions, most of the theoretical works can achieve a linear speedup for convergence, i.e., 𝒪(1mKT)\mathcal{O}(\frac{1}{\sqrt{mKT}}) for a sufficiently large TT, matching the rate of the parallel SGD. In addition, LocalSGD is empirically shown to be communication-efficient and enjoys better generalization performance  (Lin et al., 2018). For a comprehensive introduction to FL, we refer readers to Li et al. (2019a) and Kairouz et al. (2019).

For non-i.i.d. datasets, many works (Sattler et al., 2019; Zhao et al., 2018; Li et al., 2018; Wang et al., 2019a; Karimireddy et al., 2019; Huang et al., 2018; Jeong et al., 2018) heuristically demonstrated the performance of FedAvg and its variants. On convergence rate with full worker participation, many works (Stich et al., 2018; Yu et al., 2019a; Wang & Joshi, 2018; Karimireddy et al., 2019; Reddi et al., 2020) can achieve linear speedup, but their convergence rate bounds could be improved as shown in this paper. On convergence rate with partial worker participation, Li et al. (2019b) showed that the original FedAvg can achieve 𝒪(K/T)\mathcal{O}(K/T) for strongly convex functions, which suggests that local SGD steps slow down the convergence in the original FedAvg. Karimireddy et al. (2019) analyzed a generalized FedAvg with two-sided learning rates under strongly convex, convex and non-convex cases. However, as shown in Table 1, none of them indicates that linear speedup is achievable with non-i.i.d. datasets under partial worker participation. Note that the SCAFFOLD algorithm (Karimireddy et al., 2019) can achieve linear speedup but extra variance reduction operations are required, which lead to higher communication costs and implementation complexity. In this paper, we show that this linear speedup can be achieved without any extra requirements. For more detailed comparisons and other algorithmic variants in FL and decentralized settings, we refer readers to Kairouz et al. (2019).

3 Linear Speedup of the Generalized FedAvg with Two-Sided Learning Rates for Non-IID Datasets

In this paper, we consider a FedAvg algorithm with two-sided learning rates as shown in Algorithm 1, which is generalized from previous works (Karimireddy et al., 2019; Reddi et al., 2020). Here, workers perform multiple SGD steps using a worker optimizer to minimize the local loss on its own dataset, while the server aggregates and updates the global model using another gradient-based server optimizer based on the returned parameters. Specifically, between two consecutive communication rounds, each worker performs KK SGD steps with the worker’s local learning rate ηL\eta_{L}. We assume an unbiased estimator in each step, which is denoted by 𝐠t,ki=Fi(𝐱t,ki,ξt,ki)\mathbf{g}_{t,k}^{i}=\nabla F_{i}(\mathbf{x}_{t,k}^{i},\xi_{t,k}^{i}), where ξt,ki\xi_{t,k}^{i} is a random local data sample for kk-th steps after tt-th communication round at worker ii. Then, each worker sends the accumulative parameter difference Δti\Delta_{t}^{i} to the server. On the server side, the server aggregates all available Δti\Delta_{t}^{i}-values and updates the model parameters with a global learning rate η\eta. The FedAvg algorithm with two-sided learning rates provides a natural way to decouple the learning of workers and server, thus utilizing different learning rate schedules for workers and the server. The original FedAvg can be viewed as a special case of this framework with server-side learning rate being one.

Algorithm 1 A Generalized FedAvg Algorithm with Two-Sided Learning Rates.
  Initialize 𝐱0\mathbf{x}_{0}
  for t=0,,T1t=0,\cdots,T-1 do
     The server samples a subset StS_{t} of workers with |St|=n|S_{t}|=n.
     for each worker iSti\in S_{t} in parallel do
        𝐱t,0i=𝐱t\mathbf{x}_{t,0}^{i}=\mathbf{x}_{t}
        for k=0,,K1k=0,\cdots,K-1 do
           Compute an unbiased estimate 𝐠t,ki=Fi(𝐱t,ki,ξt,ki)\mathbf{g}_{t,k}^{i}=\nabla F_{i}(\mathbf{x}_{t,k}^{i},\xi_{t,k}^{i}) of Fi(𝐱t,ki)\nabla F_{i}(\mathbf{x}_{t,k}^{i}).
           Local worker update: 𝐱t,k+1i=𝐱t,kiηL𝐠t,ki\mathbf{x}_{t,k+1}^{i}=\mathbf{x}_{t,k}^{i}-\eta_{L}\mathbf{g}_{t,k}^{i}.
        end for
        Let Δti=𝐱t,Ki𝐱t,0i=ηLk=0K1𝐠t,ki\Delta_{t}^{i}=\mathbf{x}_{t,K}^{i}-\mathbf{x}_{t,0}^{i}=-\eta_{L}\sum_{k=0}^{K-1}\mathbf{g}_{t,k}^{i}. Send Δti\Delta_{t}^{i} to the server.
     end for
     At Server: Receive Δti,iS\Delta_{t}^{i},i\in S. Let Δt=1|S|iSΔti\Delta_{t}=\frac{1}{|S|}\sum_{i\in S}\Delta_{t}^{i}. Server Update: 𝐱t+1=𝐱t+ηΔt\mathbf{x}_{t+1}=\mathbf{x}_{t}+\eta\Delta_{t}. Broadcasting 𝐱t+1\mathbf{x}_{t+1} to workers.
  end for

In what follows, we show that a linear speedup for convergence is achievable by the generalized FedAvg for non-convex functions on non-i.i.d. datasets. We first state our assumptions as follows.

Assumption 1.

(LL-Lipschitz Continuous Gradient) There exists a constant L>0L>0, such that Fi(𝐱)Fi(𝐲)L𝐱𝐲,𝐱,𝐲d,andi[m]\|\nabla F_{i}(\mathbf{x})-\nabla F_{i}(\mathbf{y})\|\leq L\|\mathbf{x}-\mathbf{y}\|,\forall\mathbf{x},\mathbf{y}\in\mathbb{R}^{d},and\ i\in[m].

Assumption 2.

(Unbiased Local Gradient Estimator) Let ξti\xi_{t}^{i} be a random local data sample in the tt-th step at the ii-th worker. The local gradient estimator is unbiased, i.e., 𝔼[Fi(𝐱t,ξti)]=Fi(𝐱t)\mathbb{E}[\nabla F_{i}(\mathbf{x}_{t},\xi_{t}^{i})]=\nabla F_{i}(\mathbf{x}_{t}), i[m]\forall i\in[m], where the expectation is over all local datasets samples.

Assumption 3.

(Bounded Local and Global Variance) There exist two constants σL>0\sigma_{L}>0 and σG>0\sigma_{G}>0, such that the variance of each local gradient estimator is bounded by 𝔼[Fi(𝐱t,ξti)Fi(𝐱t)2]σL2\mathbb{E}[\|\nabla F_{i}(\mathbf{x}_{t},\xi_{t}^{i})-\nabla F_{i}(\mathbf{x}_{t})\|^{2}]\leq\sigma_{L}^{2}, i[m]\forall i\in[m], and the global variability of the local gradient of the cost function is bounded by
Fi(𝐱t)f(𝐱t)2σG2\|\nabla F_{i}(\mathbf{x}_{t})-\nabla f(\mathbf{x}_{t})\|^{2}\leq\sigma_{G}^{2}, i[m],t\forall i\in[m],\forall t.

The first two assumptions are standard in non-convex optimization (Ghadimi & Lan, 2013; Bottou et al., 2018). For Assumption 3, the bounded local variance is also a standard assumption. We use a universal bound σG\sigma_{G} to quantify the heterogeneity of the non-i.i.d. datasets among different workers. In particular, σG=0\sigma_{G}=0 corresponds to i.i.d. datasets. This assumption is also used in other works for FL under non-i.i.d. datasets  (Reddi et al., 2020; Yu et al., 2019b; Wang et al., 2019b) as well as in decentralized optimization (Kairouz et al., 2019). It is worth noting that we do not require a bounded gradient assumption, which is often assumed in FL optimization analysis.

3.1 Convergence analysis for full worker participation

In this subsection, we first analyze the convergence rate of the generalized FedAvg with two-sided learning rates under full worker participation, for which we have the following result:

Theorem 1.

Let constant local and global learning rates ηL\eta_{L} and η\eta be chosen as such that ηL18LK\eta_{L}\leq\frac{1}{8LK} and ηηL1KL\eta\eta_{L}\leq\frac{1}{KL}. Under Assumptions 13 and with full worker participation, the sequence of outputs {𝐱k}\{\mathbf{x}_{k}\} generated by Algorithm 1 satisfies:

mint[T]𝔼[f(𝐱t)22]f0fcηηLKT+Φ,\displaystyle\min_{t\in[T]}\mathbb{E}[\|\nabla f(\mathbf{x}_{t})\|_{2}^{2}]\leq\frac{f_{0}-f_{*}}{c\eta\eta_{L}KT}+\Phi,

where Φ1c[LηηL2mσL2+5KηL2L22(σL2+6KσG2)]\Phi\triangleq\frac{1}{c}[\frac{L\eta\eta_{L}}{2m}\sigma_{L}^{2}+\frac{5K\eta_{L}^{2}L^{2}}{2}(\sigma_{L}^{2}+6K\sigma_{G}^{2})], cc is a constant, f0f(𝐱0)f_{0}\triangleq f(\mathbf{x}_{0}), ff(𝐱)f_{*}\triangleq f(\mathbf{x}_{*}) and the expectation is over the local dataset samples among workers.

Remark 1.

The convergence bound contains two parts: a vanishing term f0fcηηLKT\frac{f_{0}-f_{*}}{c\eta\eta_{L}KT} as TT increases and a constant term Φ\Phi whose size depends on the problem instance parameters and is independent of TT. The vanishing term’s decay rate matches that of the typical SGD methods.

Remark 2.

The first part of Φ\Phi (i.e., LηηL2mσL2\frac{L\eta\eta_{L}}{2m}\sigma_{L}^{2}) is due to the local stochastic gradients at each worker, which shrinks at rate 1m\frac{1}{m} as mm increases. The cumulative variance of the KK local steps contributes to the second term in Φ\Phi (i.e., 5KηL2L22(σL2+6KσG2))\frac{5K\eta_{L}^{2}L^{2}}{2}(\sigma_{L}^{2}+6K\sigma_{G}^{2})), which is independent of mm and largely affected by the data heterogeneity. To make the second part small, an inverse relationship between the local learning rate and local steps should be satisfied, i.e., ηL=𝒪(1K)\eta_{L}=\mathcal{O}(\frac{1}{K}). Specifically, note that the global and local variances are quadratically and linearly amplified by KK. This requires a sufficiently small ηL\eta_{L} to offset the variance between two successive communication rounds to make the second term in Φ\Phi small. This is consistent with the observation in strongly convex FL that a decaying learning rate is needed for FL to converge under non-i.i.d. datasets even if full gradients used in each worker (Li et al., 2019b). However, we note that our explicit inverse relationship between ηL\eta_{L} and KK in the above is new. Intuitively, the KK local steps with a sufficiently small ηL\eta_{L} can be viewed as one SGD step with a large learning rate.

With Theorem 1, we immediately have the following convergence rate for the generalized FedAvg algorithm with a proper choice of two-sided learning rates:

Corollary 1.

Let ηL=1TKL\eta_{L}=\frac{1}{\sqrt{T}KL} and η=Km\eta=\sqrt{Km}. The convergence rate of the generalized FedAvg algorithm under full worker participation is mint[T]𝔼[f(𝐱t)22]=𝒪(1mKT+1T)\min_{t\in[T]}\mathbb{E}[\|\nabla f(\mathbf{x}_{t})\|_{2}^{2}]=\mathcal{O}\bigg{(}\frac{1}{\sqrt{mKT}}+\frac{1}{T}\bigg{)}.

Remark 3.

The generalized FedAvg algorithm with two-sided learning rates can achieve a linear speedup for non-i.i.d. datasets, i.e., a 𝒪(1mKT)\mathcal{O}(\frac{1}{\sqrt{mKT}}) convergence rate as long as TmKT\geq mK. Although many works have achieved this convergence rate asymptotically, we improve the maximum number of local steps KK to T/mT/m, which is significantly better than the state-of-art bounds such as T1/3/mT^{1/3}/m shown in (Karimireddy et al., 2019; Yu et al., 2019a; Kairouz et al., 2019). Note that a larger number of local steps implies relatively fewer communication rounds, thus less communication overhead. See also the communication complexity comparison in Table 1. For example, when T=106T=10^{6} and m=100m=100 (as used in (Kairouz et al., 2019)), the local steps in our algorithm is KT/m=104K\leq T/m=10^{4}. However, KT1/3m=1K\leq\frac{T^{1/3}}{m}=1 means that no extra local steps can be taken to reduce communication costs.

Remark 4.

When degenerated to the i.i.d. case (σG=0\sigma_{G}=0), the convergence rate becomes 𝒪(1TK+1mKT)\mathcal{O}(\frac{1}{TK}+\frac{1}{\sqrt{mKT}}), which has a better first term in the bound compared with previous work as shown in Table 1.

3.2 Convergence analysis for partial worker participation

Partial worker participation in each communication round may be more practical than full worker participation due to many physical limitations of FL in practice (e.g., excessive delays because of too many devices to poll, malfunctioning devices, etc.). Partial worker participation can also accelerate the training by neglecting stragglers. We consider two sampling strategies proposed by Li et al. (2018) and Li et al. (2019b). Let StS_{t} be the participating worker index set at communication round tt with |St|=n|S_{t}|=n, t\forall t, for some n(0,m]n\in(0,m]. StS_{t} is randomly and independently selected either with replacement (Strategy 1) or without replacement (Strategy 2) sequentially according to the sampling probabilities pi,i[m]p_{i},\forall i\in[m]. For each member in StS_{t}, we pick a worker from the entire set [m][m] uniformly at random with probability pi=1m,i[m]p_{i}=\frac{1}{m},\forall i\in[m]. That is, selection likelihood for anyone worker iSti\in S_{t} is p=nmp=\frac{n}{m}. Then we have the following results:

Theorem 2.

Under Assumptions 13 with partial worker participation, the sequence of outputs {𝐱k}\{\mathbf{x}_{k}\} generated by Algorithm 1 with constant learning rates η\eta and ηL\eta_{L} satisfies:

mint[T]𝔼[f(𝐱t)22]f0fcηηLKT+Φ,\min_{t\in[T]}\mathbb{E}[\|\nabla f(\mathbf{x}_{t})\|_{2}^{2}]\leq\frac{f_{0}-f_{*}}{c\eta\eta_{L}KT}+\Phi,

where f0=f(𝐱0)f_{0}=f(\mathbf{x}_{0}), f=f(𝐱)f_{*}=f(\mathbf{x}_{*}), and the expectation is over the local dataset samples among workers.

For sampling Strategy 1, let η\eta and ηL\eta_{L} be chosen as such that ηL18LK\eta_{L}\leq\frac{1}{8LK}, ηηLKL<n1n\eta\eta_{L}KL<\frac{n-1}{n} and 30K2ηL2L2LηηLn(90K3L2ηL2+3K)<130K^{2}\eta_{L}^{2}L^{2}-\frac{L\eta\eta_{L}}{n}(90K^{3}L^{2}\eta_{L}^{2}+3K)<1. It then holds that:

Φ1c[LηηL2nσL2+3LKηηL2nσG2+(5KηL2L22+15K2ηηL3L32n)(σL2+6KσG2)].\Phi\triangleq\frac{1}{c}\bigg{[}\frac{L\eta\eta_{L}}{2n}\sigma_{L}^{2}+\frac{3LK\eta\eta_{L}}{2n}\sigma_{G}^{2}+(\frac{5K\eta_{L}^{2}L^{2}}{2}+\frac{15K^{2}\eta\eta_{L}^{3}L^{3}}{2n})(\sigma_{L}^{2}+6K\sigma_{G}^{2})\bigg{]}.

For sampling Strategy 2, let η\eta and ηL\eta_{L} be chosen as such that ηL18LK\eta_{L}\leq\frac{1}{8LK}, ηηLKLn(m1)m(n1)\eta\eta_{L}KL\leq\frac{n(m-1)}{m(n-1)} and 10K2ηL2L2LηηLmnn(m1)(90K3ηL2L2+3K)<110K^{2}\eta_{L}^{2}L^{2}-L\eta\eta_{L}\frac{m-n}{n(m-1)}(90K^{3}\eta_{L}^{2}L^{2}+3K)<1. It then holds that:

Φ1c[LηηL2nσL2+3LKηηLmn2n(m1)σG2+(5KηL2L22+15K2ηηL3L3mn2n(m1))(σL2+6KσG2)].\Phi\triangleq\frac{1}{c}\bigg{[}\frac{L\eta\eta_{L}}{2n}\sigma_{L}^{2}+3LK\eta\eta_{L}\frac{m-n}{2n(m-1)}\sigma_{G}^{2}+\bigg{(}\frac{5K\eta_{L}^{2}L^{2}}{2}+15K^{2}\eta\eta_{L}^{3}L^{3}\frac{m-n}{2n(m-1)}\bigg{)}(\sigma_{L}^{2}+6K\sigma_{G}^{2})\bigg{]}.

From Theorem 2, we immediately have the following convergence rate for the generalized FedAvg algorithm with a proper choice of two-sided learning rates:

Corollary 2.

Let ηL=1TKL\eta_{L}=\frac{1}{\sqrt{T}KL} and η=Kn\eta=\sqrt{Kn}. The convergence rate of the generalized FedAvg algorithm under partial worker participation and both sampling strategies are:

mint[T]𝔼f(𝐱t)22𝒪(KnT+1T).\min_{t\in[T]}\mathbb{E}\|\nabla f(\mathbf{x}_{t})\|_{2}^{2}\leq\mathcal{O}\bigg{(}\frac{\sqrt{K}}{\sqrt{nT}}+\frac{1}{T}\bigg{)}.
Remark 5.

The convergence rate bound for partial worker participation has the same structure but with a larger variance term. This implies that the partial worker participation through the uniform sampling does not result in fundamental changes in convergence (in order sense) except for an amplified variance due to fewer workers participating and random sampling. The intuition is that the uniform sampling (with/without replacement) for worker selection yields a good approximation of the entire worker distribution in expectation, which reduces the risk of distribution deviation due to the partial worker participation. As shown in Section  5, the distribution deviation due to fewer worker participation could render the training unstable, especially in highly non-i.i.d. cases.

Remark 6.

The generalized FedAvg with partial worker participation under non-i.i.d. datasets can still achieve a linear speedup 𝒪(KnT)\mathcal{O}(\frac{\sqrt{K}}{\sqrt{nT}}) with proper learning rate settings as shown in Corollary 2. In addition, when degenerated to i.i.d. case (σG=0\sigma_{G}=0), the convergence rate becomes 𝒪(1TK+1nKT)\mathcal{O}(\frac{1}{TK}+\frac{1}{\sqrt{nKT}}).

Remark 7.

Here, we let |St|=n|S_{t}|=n only for ease of presentation and better readability. We note that this is not a restrictive condition. We can show that |St|=n|S_{t}|=n can be relaxed to |St|n,t[T]|S_{t}|\geq n,\forall t\in[T] and the same convergence rate still holds. In fact, our full proof in Appendix A.2 is for |St|n|S_{t}|\geq n.

4 Discussion

In light of above results, in what follows, we discuss several insights from the convergence analysis:

Convergence Rate: We show that the generalized FedAvg algorithm with two-sided learning rates can achieve a linear speedup, i.e., an 𝒪(1mKT)\mathcal{O}(\frac{1}{\sqrt{mKT}}) convergence rate with a proper choice of hyper-parameters. Thus, it works well in large FL systems, where massive parallelism can be leveraged to accelerate training. The key challenge in convergence analysis stems from the different local loss functions (also called “model drift” in the literature) among workers due to the non-i.i.d. datasets and local steps. As shown above, we obtain a convergence bound for the generalized FedAvg method containing a vanishing term and a constant term (the constant term is similar to that of SGD). In contrast, the constant term in SGD is only due to the local variance. Note that, similar to SGD, the iterations do not diminish the constant term. The local variance σL2\sigma_{L}^{2} (randomness of stochastic gradients), global variability σG2\sigma_{G}^{2} (non-i.i.d. datasets), and the number of local steps KK (amplification factor) all contribute to the constant term, but the total global variability in KK local steps dominates the term. When the local learning rate ηL\eta_{L} is set to an inverse relationship with respect to the number of local steps KK, the constant term is controllable. An intuitive explanation is that the KK small local steps can be approximately viewed as one large step in conventional SGD. So this speedup and the more allowed local steps can be largely attributed to the two-sided learning rates setting.

Number of Local Steps: Besides the result that the maximum number of local steps is improved to KT/mK\leq T/m, we also show that the local steps could help the convergence with the proper hyper-parameter choices, which supports previous numerical results  (McMahan et al., 2016; Stich, 2018; Lin et al., 2018) and is verified in different models with different non-i.i.d. degree datasets in Section 5. However, there are other results showing the local steps slow down the convergence (Li et al., 2019b). We believe that whether local steps help or hurt the convergence in FL worths further investigations.

Number of Workers: We show that the convergence rate improves substantially as the the number of workers in each communication round increases. This is consistent with the results for i.i.d. cases in Stich (2018). For i.i.d. datasets, more workers means more data samples and thus less variance and better performance. For non-i.i.d. datasets, having more workers implies that the distribution of the sampled workers is a better approximation for the distribution of all workers. This is also empirically observed in Section 5. On the other hand, the sampling strategy plays an important role in non-i.i.d. case as well. Here, we adopt the uniform sampling (with/without replacement) to enlist workers to participate in FL. Intuitively, the distribution of the sampled workers’ collective datasets under uniform sampling yields a good approximation of the overall data distribution in expectation.

Note that, in this paper, we assume that every worker is available to participate once being enlisted. However, this may not always be feasible. In practice, the workers need to be in certain states in order to be able to participate in FL (e.g., in charging or idle states, etc. (Eichner et al., 2019)). Therefore, care must be taken in sampling and enlisting workers in practice. We believe that the joint design of sampling schemes and the generalized FedAvg algorithm will have a significant impact on the convergence, which needs further investigations.

5 Numerical Results

We perform extensive experiments to verify our theoretical results. We use three models: logistic regression (LR), a fully-connected neural network with two hidden layers (2NN) and a convolution neural network (CNN) with the non-i.i.d. version of MNIST (LeCun et al., 1998) and one ResNet model with CIFAR-10 (Krizhevsky et al., 2009). Due to space limitation, we relegate some experimental results in the supplementary material.

Refer to caption
Refer to caption
Refer to caption
Refer to caption

(a) Impact of non-i.i.d. datasets.

Refer to caption

(b) Impact of worker number.

Refer to caption

(c) Impact of local steps

Figure 1: Training loss (top) and test accuracy (bottom) for the 2NN model with hyper-parameters setting: local learning rate 0.1, global learning rate 1.0: (a) worker number 100, local steps 5 epochs; (b) local steps 5 epochs; (c) 5 digits in each worker’s dataset.

In this section, we elaborate the results under non-i.i.d. MNIST datasets for the 2NN. We distribute the MNIST dataset among m=100m=100 workers randomly and evenly in a digit-based manner such that the local dataset for each worker contains only a certain class of digits. The number of digits in each worker’s dataset represents the non-i.i.d. degree. For digits_10digits\_10, each worker has training/testing samples with ten digits from 0 to 99, which is essentially an i.i.d. case. For digits_1digits\_1, each worker has samples only associated with one digit, which leads to highly non-i.i.d. datasets among workers. For partial worker participation, we set the number of workers n=10n=10 in each communication round.

Impact of non-i.i.d. datasets: As shown in Figure 1(a), for the 2NN model with full worker participation, the top-row figures are for training loss versus communication round and the bottom-row are for test accuracy versus communication round. We can see that the generalized FedAvg algorithm converges under non-i.i.d. datasets with a proper learning rate choice in both cases. For five digits (digits_5digits\_5) in each worker’s dataset with full (partial) worker participation in Figure 1(a), the generalized FedAvg algorithm achieves a convergence speed comparable to that of the i.i.d. case (digits_10digits\_10). Another key observation is that non-i.i.d. datasets slow down the convergence under the same learning rate settings for both cases. The higher the non-i.i.d. degree, the slower the convergence speed. As the non-i.i.d. degree increases (from case digits_10digits\_10 to case digits_1digits\_1), it is obvious that the training loss is increasing and test accuracy is decreasing. This trend is more obvious from the zigzagging curves for partial worker participation. These two observations can also be verified for other models as shown in the supplementary material, which confirms our theoretical analysis.

Impact of worker number: As shown in Figure 1(b), we compare the training loss and test accuracy between full worker participation n=100n=100 and partial worker participation n=10n=10 with the same hyper-parameters. Compared with full worker participation, partial worker participation introduces another source of randomness, which leads to zigzagging convergence curves and slower convergence. This problem is more prominent for highly non-i.i.d. datasets. For full worker participation, it can neutralize the the system heterogeneity in each communication round. However, it might not be able to neutralize the gaps among different workers for partial worker participation. That is, the datasets’ distribution does not approximate the overall distribution well. Specifically, it is not unlikely that the digits in these datasets among all active workers are only a proper subset of the total 10 digits in the original MNIST dataset, especially with highly non-i.i.d. datasets. This trend is also obvious for complex models and complicated datasets as shown in the supplementary material. The sampling strategy here is random sampling with equal probability without replacement. In practice, however, the actual sampling of the workers in FL could be more complex, which requires further investigations.

Impact of local steps: One open question of FL is that whether the local steps help the convergence or not. In Figure 1(c), we show that the local steps could help the convergence for both full and partial worker participation. These results verify our theoretical analysis. However, Li et al. (2019b) showed that the local steps may hurt the convergence, which was demonstrated under unbalanced non-i.i.d. MNIST datasets. We believe that this may be due to the combined effect of unbalanced datasets and local steps rather than just the use of local steps only.

Table 2: Comparison with SCAFFOLD.
Dataset IID or Non-IID Worker selected Model SCAFFOLD This paper
# of Round Communication cost (MB) Wall-clock time (s) # of Round Communication cost (MB) Wall-clock time (s)
MNIST IID n=10n=10 Logistic 3 0.36 0.32 3 0.18 0.22
2NN 3 9.12 0.88 3 4.56 0.56
CNN 3 26.64 2.23 3 13.32 1.57
n=100n=100 Logistic 5 0.60 0.53 5 0.30 0.42
2NN 5 15.20 1.51 8 12.16 1.49
CNN 1 8.88 0.79 1 4.44 0.50
Non-IID n=10n=10 Logistic 14 1.68 1.48 14 0.84 1.16
2NN 14 42.55 4.23 14 21.28 2.46
CNN 14 124.34 11.12 10 44.41 4.92
n=100n=100 Logistic 7 0.84 0.72 11 0.66 0.91
2NN 7 21.28 2.11 17 25.84 3.16
CNN 17 150.98 13.50 7 31.08 3.51
CIFAR-10 IID n=10n=10 Resnet18 56 9548.07 583.24 44 3751.03 256.63
Non-IID n=10n=10 Resnet18 52 8866.06 539.50 61 5200.29 358.22
  • 1

    Bandwidth = 20MB/s.

Comparison with SCAFFOLD: Lastly, we compare with the SCAFFOLD algorithm (Karimireddy et al., 2019) since it also achieves the same linear speedup effect under non-i.i.d. datasets. We compare communication rounds, total communication load, and estimated wall-clock time under the same settings to achieve certain test accuracy, and the results are reported in Table 2. The non-i.i.d. dataset is digits_2digits\_2 and the i.i.d. dataset is digits_10digits\_10. The learning rates are ηL=0.1,η=1.0\eta_{L}=0.1,\eta=1.0, and number of local steps KK is 55 epochs. We set the target accuracy ϵ=95%\epsilon=95\% for MNIST and ϵ=75%\epsilon=75\% for CIFAR-10. Note that the total training time contains two parts: i) the computation time for training the local model at each worker and ii) the communication time for information exchanges between the workers and the server. We assume the bandwidth 2020 MB/s for both uplink and downlink connections. For MNIST datasets, we can see that our algorithm is similar to or outperforms SCAFFOLD. This is because the numbers of communication rounds of both algorithms are relatively small for such simple tasks. For non-i.i.d. CIFAR-10, the SCAFFOLD algorithm takes slightly fewer number of communication rounds than our FedAvg algorithm to achieve ϵ=75%\epsilon=75\% thanks to its variance reduction. However, it takes more than 1.5 times of communication cost and wall-clock time compared to those of our FedAvg algorithm. Due to space limitation, we relegate the results of time proportions for computation and communication to Appendix B (see Figure 7).

6 Conclusions and future work

In this paper, we analyzed the convergence of a generlized FedAvg algorithm with two-sided learning rates on non-i.i.d. datasets for general non-convex optimization. We proved that the generalized FedAvg algorithm achieves a linear speedup for convergence under full and partial worker participation. We showed that the local steps in FL could help the convergence and we improve the maximum number of local steps to T/mT/m. While our work sheds light on theoretical understanding of FL, it also opens the doors to many new interesting questions in FL, such as how to sample optimally in partial worker participation, and how to deal with active participant sets that are both time-varying and size-varying across communication rounds. We hope that the insights and proof techniques in this paper can pave the way for many new research directions in the aforementioned areas.

Acknowledgements

This work is supported in part by NSF grants CAREER CNS-1943226, CIF-2110252, ECCS-1818791, CCF-1934884, ONR grant ONR N00014-17-1-2417, and a Google Faculty Research Award.

References

  • Bottou et al. (2018) Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. Siam Review, 60(2):223–311, 2018.
  • Eichner et al. (2019) Hubert Eichner, Tomer Koren, H Brendan McMahan, Nathan Srebro, and Kunal Talwar. Semi-cyclic stochastic gradient descent. arXiv preprint arXiv:1904.10120, 2019.
  • Ghadimi & Lan (2013) Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
  • Hsu et al. (2019) Tzu-Ming Harry Hsu, Hang Qi, and Matthew Brown. Measuring the effects of non-identical data distribution for federated visual classification. arXiv preprint arXiv:1909.06335, 2019.
  • Huang et al. (2018) Li Huang, Yifeng Yin, Zeng Fu, Shifa Zhang, Hao Deng, and Dianbo Liu. Loadaboost: Loss-based adaboost federated machine learning on medical data. arXiv preprint arXiv:1811.12629, 2018.
  • Jeong et al. (2018) Eunjeong Jeong, Seungeun Oh, Hyesung Kim, Jihong Park, Mehdi Bennis, and Seong-Lyun Kim. Communication-efficient on-device machine learning: Federated distillation and augmentation under non-iid private data. arXiv preprint arXiv:1811.11479, 2018.
  • Kairouz et al. (2019) Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Keith Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. arXiv preprint arXiv:1912.04977, 2019.
  • Karimireddy et al. (2019) Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank J Reddi, Sebastian U Stich, and Ananda Theertha Suresh. Scaffold: Stochastic controlled averaging for on-device federated learning. arXiv preprint arXiv:1910.06378, 2019.
  • Khaled et al. (2019a) Ahmed Khaled, Konstantin Mishchenko, and Peter Richtárik. Better communication complexity for local sgd. arXiv preprint arXiv:1909.04746, 2019a.
  • Khaled et al. (2019b) Ahmed Khaled, Konstantin Mishchenko, and Peter Richtárik. First analysis of local gd on heterogeneous data. arXiv preprint arXiv:1909.04715, 2019b.
  • Krizhevsky et al. (2009) Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Li et al. (2018) Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. arXiv preprint arXiv:1812.06127, 2018.
  • Li et al. (2019a) Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. Federated learning: Challenges, methods, and future directions. arXiv preprint arXiv:1908.07873, 2019a.
  • Li et al. (2019b) Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, and Zhihua Zhang. On the convergence of fedavg on non-iid data. arXiv preprint arXiv:1907.02189, 2019b.
  • Lin et al. (2018) Tao Lin, Sebastian U Stich, Kumar Kshitij Patel, and Martin Jaggi. Don’t use large mini-batches, use local sgd. arXiv preprint arXiv:1808.07217, 2018.
  • McMahan et al. (2016) H Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, et al. Communication-efficient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629, 2016.
  • Reddi et al. (2020) Sashank Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konecny, Sanjiv Kumar, and H Brendan McMahan. Adaptive federated optimization. arXiv preprint arXiv:2003.00295, 2020.
  • Sattler et al. (2019) Felix Sattler, Simon Wiedemann, Klaus-Robert Müller, and Wojciech Samek. Robust and communication-efficient federated learning from non-iid data. IEEE transactions on neural networks and learning systems, 2019.
  • Stich (2018) Sebastian U Stich. Local sgd converges fast and communicates little. arXiv preprint arXiv:1805.09767, 2018.
  • Stich & Karimireddy (2019) Sebastian U Stich and Sai Praneeth Karimireddy. The error-feedback framework: Better rates for sgd with delayed gradients and compressed communication. arXiv preprint arXiv:1909.05350, 2019.
  • Stich et al. (2018) Sebastian U Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. Sparsified sgd with memory. In Advances in Neural Information Processing Systems, pp. 4447–4458, 2018.
  • Wang & Joshi (2018) Jianyu Wang and Gauri Joshi. Cooperative sgd: A unified framework for the design and analysis of communication-efficient sgd algorithms. arXiv preprint arXiv:1808.07576, 2018.
  • Wang et al. (2019a) Jianyu Wang, Vinayak Tantia, Nicolas Ballas, and Michael Rabbat. Slowmo: Improving communication-efficient distributed sgd with slow momentum. arXiv preprint arXiv:1910.00643, 2019a.
  • Wang et al. (2019b) Shiqiang Wang, Tiffany Tuor, Theodoros Salonidis, Kin K Leung, Christian Makaya, Ting He, and Kevin Chan. Adaptive federated learning in resource constrained edge computing systems. IEEE Journal on Selected Areas in Communications, 37(6):1205–1221, 2019b.
  • Yu et al. (2019a) Hao Yu, Rong Jin, and Sen Yang. On the linear speedup analysis of communication efficient momentum sgd for distributed non-convex optimization. arXiv preprint arXiv:1905.03817, 2019a.
  • Yu et al. (2019b) Hao Yu, Sen Yang, and Shenghuo Zhu. Parallel restarted sgd with faster convergence and less communication: Demystifying why model averaging works for deep learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp.  5693–5700, 2019b.
  • Zhao et al. (2018) Yue Zhao, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, and Vikas Chandra. Federated learning with non-iid data. arXiv preprint arXiv:1806.00582, 2018.
  • Zhou & Cong (2017) Fan Zhou and Guojing Cong. On the convergence properties of a kk-step averaging stochastic gradient descent algorithm for nonconvex optimization. arXiv preprint arXiv:1708.01012, 2017.

Appendix A Appendix I: Proofs

In this section, we give the proofs in detail for full and partial worker participation in Section A.1 and Section A.2, respectively.

A.1 Proof of Theorem 1

See 1

Proof.

For convenience, we define Δ¯t1mi=1mΔti\bar{\Delta}_{t}\triangleq\frac{1}{m}\sum_{i=1}^{m}\Delta_{t}^{i}. Under full device participation (i.e., St=[m]S_{t}=[m]), it is clear that Δt=1mi=1mΔti=Δ¯t\Delta_{t}=\frac{1}{m}\sum_{i=1}^{m}\Delta_{t}^{i}=\bar{\Delta}_{t}.

Due to the smoothness in Assumption 1, taking expectation of f(𝐱t+1)f(\mathbf{x}_{t+1}) over the randomness at communication round tt, we have:

𝔼t[f(𝐱t+1)]\displaystyle\mathbb{E}_{t}[f(\mathbf{x}_{t+1})] f(𝐱t)+<f(𝐱t),𝔼t[𝐱t+1𝐱t]>+L2𝔼t[𝐱t+1𝐱t2]\displaystyle\leq f(\mathbf{x}_{t})+\big{<}\nabla f(\mathbf{x}_{t}),\mathbb{E}_{t}[\mathbf{x}_{t+1}-\mathbf{x}_{t}]\big{>}+\frac{L}{2}\mathbb{E}_{t}[\|\mathbf{x}_{t+1}-\mathbf{x}_{t}\|^{2}]
=f(𝐱t)+<f(𝐱t),𝔼t[ηΔ¯t+ηηLKf(𝐱t)ηηLKf(𝐱t)]>+L2η2𝔼t[Δ¯t2]\displaystyle=f(\mathbf{x}_{t})\!+\!\big{<}\nabla f(\mathbf{x}_{t}),\mathbb{E}_{t}[\eta\bar{\Delta}_{t}+\eta\eta_{L}K\nabla f(\mathbf{x}_{t})-\eta\eta_{L}K\nabla f(\mathbf{x}_{t})]\big{>}\!+\!\frac{L}{2}\eta^{2}\mathbb{E}_{t}[\|\bar{\Delta}_{t}\|^{2}]
=f(𝐱t)ηηLKf(𝐱t)2+η<f(𝐱t),𝔼t[Δ¯t+ηLKf(𝐱t)]>A1+L2η2𝔼t[Δ¯t2]A2.\displaystyle=f(\mathbf{x}_{t})\!-\!\eta\eta_{L}K\|\nabla f(\mathbf{x}_{t})\|^{2}\!\!+\!\eta\underbrace{\big{<}\nabla f(\mathbf{x}_{t}),\mathbb{E}_{t}[\bar{\Delta}_{t}\!+\!\eta_{L}K\nabla f(\mathbf{x}_{t})]\big{>}}_{A_{1}}\!+\!\frac{L}{2}\eta^{2}\underbrace{\mathbb{E}_{t}[\|\bar{\Delta}_{t}\|^{2}]}_{A_{2}}. (1)

Note that the term A1A_{1} in (1) can be bounded as follows:

A1=<f(𝐱t),𝔼t[Δ¯t+ηLKf(𝐱t)]>\displaystyle A_{1}=\big{<}\nabla f(\mathbf{x}_{t}),\mathbb{E}_{t}[\bar{\Delta}_{t}+\eta_{L}K\nabla f(\mathbf{x}_{t})]\big{>}
=<f(𝐱t),𝔼t[1mi=1mk=0K1ηL𝐠t,ki+ηLKf(xt)]>\displaystyle=\bigg{<}\nabla f(\mathbf{x}_{t}),\mathbb{E}_{t}\bigg{[}-\frac{1}{m}\sum_{i=1}^{m}\sum_{k=0}^{K-1}\eta_{L}\mathbf{g}_{t,k}^{i}+\eta_{L}K\nabla f(x_{t})\bigg{]}\bigg{>}
=<f(𝐱t),𝔼t[1mi=1mk=0K1ηLFi(𝐱t,ki)+ηLK1mi=1mFi(𝐱t)]>\displaystyle=\bigg{<}\nabla f(\mathbf{x}_{t}),\mathbb{E}_{t}\bigg{[}-\frac{1}{m}\sum_{i=1}^{m}\sum_{k=0}^{K-1}\eta_{L}\nabla F_{i}(\mathbf{x}_{t,k}^{i})+\eta_{L}K\frac{1}{m}\sum_{i=1}^{m}\nabla F_{i}(\mathbf{x}_{t})\bigg{]}\bigg{>}
=<ηLKf(𝐱t),ηLmK𝔼ti=1mk=0K1(Fi(𝐱t,ki)Fi(𝐱t))>\displaystyle=\bigg{<}\sqrt{\eta_{L}K}\nabla f(\mathbf{x}_{t}),-\frac{\sqrt{\eta_{L}}}{m\sqrt{K}}\mathbb{E}_{t}\sum_{i=1}^{m}\sum_{k=0}^{K-1}(\nabla F_{i}(\mathbf{x}_{t,k}^{i})-\nabla F_{i}(\mathbf{x}_{t}))\bigg{>}
=(a1)ηLK2f(𝐱t)2+ηL2Km2𝔼ti=1mk=0K1Fi(𝐱t,ki)Fi(𝐱t)2ηL2Km2𝔼ti=1mk=0K1Fi(𝐱t,ki)2\displaystyle\overset{(a1)}{=}\!\!\frac{\eta_{L}K}{2}\|\nabla f(\mathbf{x}_{t})\|^{2}\!+\!\frac{\eta_{L}}{2Km^{2}}\mathbb{E}_{t}\!\bigg{\|}\!\sum_{i=1}^{m}\!\sum_{k=0}^{K-1}\nabla F_{i}(\mathbf{x}_{t,k}^{i})\!-\!\nabla F_{i}(\mathbf{x}_{t})\bigg{\|}^{2}\!\!\!\!-\!\!\frac{\eta_{L}}{2Km^{2}}\mathbb{E}_{t}\bigg{\|}\!\sum_{i=1}^{m}\!\!\sum_{k=0}^{K-1}\!\nabla F_{i}(\mathbf{x}_{t,k}^{i})\bigg{\|}^{2}
(a2)ηLK2f(𝐱t)2+ηL2mi=1mk=0K1𝔼tFi(𝐱t,ki)Fi(𝐱t)2ηL2Km2𝔼ti=1mk=0K1Fi(𝐱t,ki)2\displaystyle\overset{(a2)}{\leq}\!\!\frac{\eta_{L}K}{2}\|\nabla f(\mathbf{x}_{t})\|^{2}\!+\!\frac{\eta_{L}}{2m}\sum_{i=1}^{m}\!\sum_{k=0}^{K-1}\mathbb{E}_{t}\|\nabla F_{i}(\mathbf{x}_{t,k}^{i})\!-\!\nabla F_{i}(\mathbf{x}_{t})\|^{2}\!-\!\frac{\eta_{L}}{2Km^{2}}\mathbb{E}_{t}\bigg{\|}\sum_{i=1}^{m}\!\sum_{k=0}^{K-1}\nabla F_{i}(\mathbf{x}_{t,k}^{i})\bigg{\|}^{2}
(a3)ηLK2f(𝐱t)2+ηLL22mi=1mk=0K1𝔼t𝐱t,ki𝐱t2ηL2Km2𝔼ti=1mk=0K1Fi(𝐱t,ki)2\displaystyle\overset{(a3)}{\leq}\!\!\frac{\eta_{L}K}{2}\|\nabla f(\mathbf{x}_{t})\|^{2}+\frac{\eta_{L}L^{2}}{2m}\sum_{i=1}^{m}\sum_{k=0}^{K-1}\mathbb{E}_{t}\|\mathbf{x}_{t,k}^{i}-\mathbf{x}_{t}\|^{2}-\frac{\eta_{L}}{2Km^{2}}\mathbb{E}_{t}\bigg{\|}\sum_{i=1}^{m}\sum_{k=0}^{K-1}\nabla F_{i}(\mathbf{x}_{t,k}^{i})\bigg{\|}^{2}
(a4)ηLK(12+15K2ηL2L2)f(xt)2+5K2ηL3L22(σL2+6KσG2)ηL2Km2𝔼ti=1mk=0K1Fi(𝐱t,ki)2,\displaystyle\overset{(a4)}{\leq}\!\!\!\eta_{L}K(\frac{1}{2}\!\!+\!\!15K^{2}\eta_{L}^{2}L^{2})\|\nabla f(x_{t})\|^{2}\!\!+\!\frac{5K^{2}\eta_{L}^{3}L^{2}}{2}(\sigma_{L}^{2}\!+\!6K\sigma_{G}^{2})\!\!-\!\!\frac{\eta_{L}}{2Km^{2}}\mathbb{E}_{t}\bigg{\|}\!\sum_{i=1}^{m}\!\sum_{k=0}^{K-1}\!\nabla F_{i}(\mathbf{x}_{t,k}^{i})\bigg{\|}^{2}, (2)

where (a1)(a1) follows from that <𝐱,𝐲>=12[𝐱2+𝐲2𝐱𝐲2]\big{<}\mathbf{x},\mathbf{y}\big{>}=\frac{1}{2}[\|\mathbf{x}\|^{2}+\|\mathbf{y}\|^{2}-\|\mathbf{x}-\mathbf{y}\|^{2}] for 𝐱=ηLKf(𝐱t)\mathbf{x}=\sqrt{\eta_{L}K}\nabla f(\mathbf{x}_{t}) and 𝐲=ηLmKi=1mk=0K1(Fi(𝐱t,ki)Fi(𝐱t))\mathbf{y}=-\frac{\sqrt{\eta_{L}}}{m\sqrt{K}}\sum_{i=1}^{m}\sum_{k=0}^{K-1}(\nabla F_{i}(\mathbf{x}_{t,k}^{i})-\nabla F_{i}(\mathbf{x}_{t})), (a2)(a2) is due to that 𝔼[x1++xn2]n𝔼[x12++xn2]\mathbb{E}[\|x_{1}+\cdots+x_{n}\|^{2}]\leq n\mathbb{E}[\|x_{1}\|^{2}+\cdots+\|x_{n}\|^{2}] , (a3)(a3) is due to Assumption 1 and (a4)(a4) follows from Lemma 2.

The term A2A_{2} in (1) can be bounded as:

A2\displaystyle A_{2} =𝔼t[Δ¯t2]\displaystyle=\mathbb{E}_{t}[\|\bar{\Delta}_{t}\|^{2}]
=𝔼t[1mi=1mΔti2]\displaystyle=\mathbb{E}_{t}\bigg{[}\bigg{\|}\frac{1}{m}\sum_{i=1}^{m}\Delta_{t}^{i}\bigg{\|}^{2}\bigg{]}
1m2𝔼t[i=1mΔti2]\displaystyle\leq\frac{1}{m^{2}}\mathbb{E}_{t}\bigg{[}\bigg{\|}\sum_{i=1}^{m}\Delta_{t}^{i}\bigg{\|}^{2}\bigg{]}
=ηL2m2𝔼t[i=1mk=0K1𝐠t,ki2]\displaystyle=\frac{\eta_{L}^{2}}{m^{2}}\mathbb{E}_{t}\bigg{[}\bigg{\|}\sum_{i=1}^{m}\sum_{k=0}^{K-1}\mathbf{g}_{t,k}^{i}\bigg{\|}^{2}\bigg{]}
=(a5)ηL2m2𝔼t[i=1mk=0K1(𝐠t,kiFi(𝐱t,ki))2]+ηL2m2𝔼ti=1mk=0K1Fi(𝐱t,ki)2\displaystyle\overset{(a5)}{=}\frac{\eta_{L}^{2}}{m^{2}}\mathbb{E}_{t}\bigg{[}\bigg{\|}\sum_{i=1}^{m}\sum_{k=0}^{K-1}(\mathbf{g}_{t,k}^{i}-\nabla F_{i}(\mathbf{x}_{t,k}^{i}))\bigg{\|}^{2}\bigg{]}+\frac{\eta_{L}^{2}}{m^{2}}\mathbb{E}_{t}\bigg{\|}\sum_{i=1}^{m}\sum_{k=0}^{K-1}\nabla F_{i}(\mathbf{x}_{t,k}^{i})\bigg{\|}^{2}
(a6)KηL2mσL2+ηL2m2𝔼ti=1mk=0K1Fi(𝐱t,ki)2,\displaystyle\overset{(a6)}{\leq}\frac{K\eta_{L}^{2}}{m}\sigma_{L}^{2}+\frac{\eta_{L}^{2}}{m^{2}}\mathbb{E}_{t}\bigg{\|}\sum_{i=1}^{m}\sum_{k=0}^{K-1}\nabla F_{i}(\mathbf{x}_{t,{}k}^{i})\bigg{\|}^{2}, (3)

where (a5)(a5) follows from the fact that 𝔼[𝐱2]=𝔼[𝐱𝔼[𝐱]2]+𝔼[𝐱]2]\mathbb{E}[\|\mathbf{x}\|^{2}]=\mathbb{E}[\|\mathbf{x}-\mathbb{E}[\mathbf{x}]\|^{2}]+\|\mathbb{E}[\mathbf{x}]\|^{2}] and (a6)(a6) is due to the bounded variance assumption in Assumption 3 and the fact that 𝔼[x1++xn2]=𝔼[x12++xn2]\mathbb{E}[\|x_{1}+\cdots+x_{n}\|^{2}]=\mathbb{E}[\|x_{1}\|^{2}+\cdots+\|x_{n}\|^{2}] if xix_{i}^{{}^{\prime}}s are independent with zero mean and 𝔼[𝐠t,ji]=Fi(𝐱t,ji)\mathbb{E}[\mathbf{g}_{t,j}^{i}]=\nabla F_{i}(\mathbf{x}_{t,j}^{i}).

Substituting the inequalities in (2) of A1A_{1} and (3) of A2A_{2} into inequality (1), we have:

𝔼t[f(𝐱t+1)]\displaystyle\mathbb{E}_{t}[f(\mathbf{x}_{t+1})] f(𝐱t)ηηLKf(𝐱t)2+η<f(𝐱t),𝔼t[Δ¯t+ηLKf(𝐱t)]>A1+L2η2𝔼t[Δ¯t2]A2\displaystyle\leq f(\mathbf{x}_{t})\!-\!\eta\eta_{L}K\|\nabla f(\mathbf{x}_{t})\|^{2}\!+\!\eta\underbrace{<\nabla f(\mathbf{x}_{t}),\mathbb{E}_{t}[\bar{\Delta}_{t}\!+\!\eta_{L}K\nabla f(\mathbf{x}_{t})]>}_{A_{1}}\!+\!\frac{L}{2}\eta^{2}\underbrace{\mathbb{E}_{t}[\|\bar{\Delta}_{t}\|^{2}]}_{A_{2}}
f(𝐱t)ηηLK(1215K2ηL2L2)f(𝐱t)2+LKη2ηL22mσL2\displaystyle\leq f(\mathbf{x}_{t})-\eta\eta_{L}K(\frac{1}{2}-15K^{2}\eta_{L}^{2}L^{2})\|\nabla f(\mathbf{x}_{t})\|^{2}+\frac{LK\eta^{2}\eta_{L}^{2}}{2m}\sigma_{L}^{2}
+5ηK2ηL3L22(σL2+6KσG2)(ηηL2Km2Lη2ηL22m2)𝔼ti=1mk=0K1Fi(𝐱t,ki)2\displaystyle\quad+\frac{5\eta K^{2}\eta_{L}^{3}L^{2}}{2}(\sigma_{L}^{2}+6K\sigma_{G}^{2})-(\frac{\eta\eta_{L}}{2Km^{2}}-\frac{L\eta^{2}\eta_{L}^{2}}{2m^{2}})\mathbb{E}_{t}\bigg{\|}\sum_{i=1}^{m}\sum_{k=0}^{K-1}\nabla F_{i}(\mathbf{x}_{t,k}^{i})\bigg{\|}^{2}
(a7)f(𝐱t)ηηLK(125K2ηL2L2)f(𝐱t)2+LKη2ηL22mσL2+5ηK2ηL3L22(σL2+6KσG2)\displaystyle\overset{(a7)}{\leq}\!\!f(\mathbf{x}_{t})\!-\!\eta\eta_{L}K(\frac{1}{2}\!\!-\!\!5K^{2}\eta_{L}^{2}L^{2})\|\nabla f(\mathbf{x}_{t})\|^{2}\!\!+\!\!\frac{LK\eta^{2}\eta_{L}^{2}}{2m}\sigma_{L}^{2}\!\!+\!\!\frac{5\eta K^{2}\eta_{L}^{3}L^{2}}{2}(\sigma_{L}^{2}\!\!+\!\!6K\sigma_{G}^{2})
(a8)f(𝐱t)cηηLKf(𝐱t)2+LKη2ηL22mσL2+5ηK2ηL3L22(σL2+6KσG2),\displaystyle\overset{(a8)}{\leq}f(\mathbf{x}_{t})-c\eta\eta_{L}K\|\nabla f(\mathbf{x}_{t})\|^{2}+\frac{LK\eta^{2}\eta_{L}^{2}}{2m}\sigma_{L}^{2}+\frac{5\eta K^{2}\eta_{L}^{3}L^{2}}{2}(\sigma_{L}^{2}+6K\sigma_{G}^{2}),

where (a7)(a7) follows from (ηηL2Km2Lη2ηL22m2)0(\frac{\eta\eta_{L}}{2Km^{2}}-\frac{L\eta^{2}\eta_{L}^{2}}{2m^{2}})\geq 0 if ηηL1KL\eta\eta_{L}\leq\frac{1}{KL}, (a8)(a8) holds because there exists a constant c>0c>0 satisfying (1215K2ηL2L2)>c>0(\frac{1}{2}-15K^{2}\eta_{L}^{2}L^{2})>c>0 if ηL<130KL\eta_{L}<\frac{1}{\sqrt{30}KL}.

Rearranging and summing from t=0,,T1t=0,\cdots,T-1, we have:

t=0T1cηηLK𝔼[f(𝐱t)]\displaystyle\sum_{t=0}^{T-1}c\eta\eta_{L}K\mathbb{E}[\nabla f(\mathbf{x}_{t})] f(𝐱0)f(𝐱T)+T(ηηLK)[LηηL2mσL2+5KηL2L22(σL2+6KσG2)]\displaystyle\leq f(\mathbf{x}_{0})-f(\mathbf{x}_{T})+T(\eta\eta_{L}K)\bigg{[}\frac{L\eta\eta_{L}}{2m}\sigma_{L}^{2}+\frac{5K\eta_{L}^{2}L^{2}}{2}(\sigma_{L}^{2}+6K\sigma_{G}^{2})\bigg{]}

which implies,

mint[T]𝔼f(𝐱t)22f0fcηηLKT+Φ,\min_{t\in[T]}\mathbb{E}\|\nabla f(\mathbf{x}_{t})\|_{2}^{2}\leq\frac{f_{0}-f_{*}}{c\eta\eta_{L}KT}+\Phi,

where Φ=1c[LηηL2mσL2+5KηL2L22(σL2+6KσG2)]\Phi=\frac{1}{c}[\frac{L\eta\eta_{L}}{2m}\sigma_{L}^{2}+\frac{5K\eta_{L}^{2}L^{2}}{2}(\sigma_{L}^{2}+6K\sigma_{G}^{2})]. This completes the proof. ∎

A.2 Proof of Theorem 2

See 2

Proof.

Let Δ¯t\bar{\Delta}_{t} be defined the same as in the proof of Theorem 1. Under partial device participation, note that Δ¯tΔt\bar{\Delta}_{t}\neq\Delta_{t} (recall that Δ¯t1mi=1mΔti\bar{\Delta}_{t}\triangleq\frac{1}{m}\sum_{i=1}^{m}\Delta_{t}^{i}, Δt=1niStΔti\Delta_{t}=\frac{1}{n}\sum_{i\in S_{t}}\Delta_{t}^{i}, and |St|=n|S_{t}|=n). The randomness for partial worker participation contains two parts: the random sampling and the stochastic gradient. We still use 𝔼t[]\mathbb{E}_{t}[\cdot] to represent the expectation with respect to both types of randomness.

Due to the smoothness assumption in Assumption 1, taking expectation of f(𝐱t+1)f(\mathbf{x}_{t+1}) over the randomness at communication round t:

𝔼t[f(𝐱t+1)]\displaystyle\mathbb{E}_{t}[f(\mathbf{x}_{t+1})] f(𝐱t)+<f(𝐱t),𝔼t[𝐱t+1𝐱t]>+L2𝔼t[𝐱t+1𝐱t2]\displaystyle\leq f(\mathbf{x}_{t})+\big{<}\nabla f(\mathbf{x}_{t}),\mathbb{E}_{t}[\mathbf{x}_{t+1}-\mathbf{x}_{t}]\big{>}+\frac{L}{2}\mathbb{E}_{t}[\|\mathbf{x}_{t+1}-\mathbf{x}_{t}\|^{2}]
=f(𝐱t)+<f(𝐱t),𝔼t[ηΔt+ηηLKf(𝐱t)ηηLKf(𝐱t)]>+L2η2𝔼t[Δt2]\displaystyle=f(\mathbf{x}_{t})+\big{<}\nabla f(\mathbf{x}_{t}),\mathbb{E}_{t}[\eta\Delta_{t}+\eta\eta_{L}K\nabla f(\mathbf{x}_{t})-\eta\eta_{L}K\nabla f(\mathbf{x}_{t})]\big{>}+\frac{L}{2}\eta^{2}\mathbb{E}_{t}[\|\Delta_{t}\|^{2}]
=f(𝐱t)ηηLKf(𝐱t)2+η<f(𝐱t),𝔼t[Δt+ηLKf(𝐱t)]>A1+L2η2𝔼t[Δt2]A2\displaystyle=f(\mathbf{x}_{t})\!-\!\eta\eta_{L}K\|\nabla f(\mathbf{x}_{t})\|^{2}\!+\!\eta\underbrace{\big{<}\nabla f(\mathbf{x}_{t}),\mathbb{E}_{t}[\Delta_{t}\!+\!\eta_{L}K\nabla f(\mathbf{x}_{t})]\big{>}}_{A^{{}^{\prime}}_{1}}+\frac{L}{2}\eta^{2}\underbrace{\mathbb{E}_{t}[\|\Delta_{t}\|^{2}]}_{A^{{}^{\prime}}_{2}} (4)

The term A1A^{{}^{\prime}}_{1} in (4) can be bounded as follows: Since 𝔼St[A1]=A1\mathbb{E}_{S_{t}}[A^{{}^{\prime}}_{1}]=A_{1} due to Lemma 1 for both sampling strategies, we have the same bound as in inequality 2 for A1A^{{}^{\prime}}_{1}:

A1ηLK(12+15K2ηL2L2)f(xt)2+5K2ηL3L22(σL2+6KσG2)ηL2Km2𝔼ti=1mk=0K1Fi(𝐱t,ki)2,A_{1}^{{}^{\prime}}\leq\eta_{L}K(\frac{1}{2}+15K^{2}\eta_{L}^{2}L^{2})\|\nabla f(x_{t})\|^{2}+\frac{5K^{2}\eta_{L}^{3}L^{2}}{2}(\sigma_{L}^{2}+6K\sigma_{G}^{2})\\ -\frac{\eta_{L}}{2Km^{2}}\mathbb{E}_{t}\bigg{\|}\sum_{i=1}^{m}\sum_{k=0}^{K-1}\nabla F_{i}(\mathbf{x}_{t,k}^{i})\bigg{\|}^{2}, (5)

For strategy 1: We can bound A2A^{{}^{\prime}}_{2} in (4) as follows.

Note StS_{t} is an index set (multiset) for independent sampling (equal probability) with replacement in which some elements may have the same value. Suppose St={l1,,ln}S_{t}=\{l_{1},\dots,l_{n}\}.

A2\displaystyle A_{2}^{{}^{\prime}} =𝔼t[Δt2]\displaystyle=\mathbb{E}_{t}[\|\Delta_{t}\|^{2}]
=𝔼t[1niStΔti2]\displaystyle=\mathbb{E}_{t}\bigg{[}\bigg{\|}\frac{1}{n}\sum_{i\in S_{t}}\Delta_{t}^{i}\bigg{\|}^{2}\bigg{]}
=1n2𝔼t[iStΔti2]\displaystyle=\frac{1}{n^{2}}\mathbb{E}_{t}\bigg{[}\bigg{\|}\sum_{i\in S_{t}}\Delta_{t}^{i}\bigg{\|}^{2}\bigg{]}
=1n2𝔼t[z=1nΔtlz2]\displaystyle=\frac{1}{n^{2}}\mathbb{E}_{t}\bigg{[}\bigg{\|}\sum_{z=1}^{n}\Delta_{t}^{l_{z}}\bigg{\|}^{2}\bigg{]}
=(b1)ηL2n2𝔼t[z=1nj=0K1[𝐠t,jlzFlz(𝐱t,jlz)]2]+ηL2n2𝔼t[z=1nj=0K1Flz(𝐱t,jlz)2]\displaystyle\overset{(b1)}{=}\frac{\eta_{L}^{2}}{n^{2}}\mathbb{E}_{t}\bigg{[}\bigg{\|}\sum_{z=1}^{n}\sum_{j=0}^{K-1}[\mathbf{g}_{t,j}^{l_{z}}-\nabla F_{l_{z}}(\mathbf{x}_{t,j}^{l_{z}})]\bigg{\|}^{2}\bigg{]}+\frac{\eta_{L}^{2}}{n^{2}}\mathbb{E}_{t}\bigg{[}\bigg{\|}\sum_{z=1}^{n}\sum_{j=0}^{K-1}\nabla F_{l_{z}}(\mathbf{x}_{t,j}^{l_{z}})\bigg{\|}^{2}\bigg{]}
(b2)KηL2nσL2+ηL2n2𝔼t[z=1nj=0K1Flz(𝐱t,jlz)2],\displaystyle\overset{(b2)}{\leq}\frac{K\eta_{L}^{2}}{n}\sigma_{L}^{2}+\frac{\eta_{L}^{2}}{n^{2}}\mathbb{E}_{t}\bigg{[}\bigg{\|}\sum_{z=1}^{n}\sum_{j=0}^{K-1}\nabla F_{l_{z}}(\mathbf{x}_{t,j}^{l_{z}})\bigg{\|}^{2}\bigg{]},

where (b1)(b1) follows from the fact that 𝔼[𝐱2]=𝔼[𝐱𝔼[𝐱]2]+𝔼[𝐱]2]\mathbb{E}[\|\mathbf{x}\|^{2}]=\mathbb{E}[\|\mathbf{x}-\mathbb{E}[\mathbf{x}]\|^{2}]+\|\mathbb{E}[\mathbf{x}]\|^{2}] and (b2)(b2) is due to the bounded variance assumption 3 and 𝔼[x1++xn2]n𝔼[x12++xn2]\mathbb{E}[\|x_{1}+\cdots+x_{n}\|^{2}]\leq n\mathbb{E}[\|x_{1}\|^{2}+\cdots+\|x_{n}\|^{2}].

By letting 𝐭i=j=0K1Fi(𝐱t,ji)\mathbf{t}_{i}=\sum_{j=0}^{K-1}\nabla F_{i}(\mathbf{x}_{t,j}^{i}), we have:

𝔼t[z=1nj=0K1Flz(𝐱t,jlz)2\displaystyle\mathbb{E}_{t}\bigg{[}\bigg{\|}\sum_{z=1}^{n}\sum_{j=0}^{K-1}\nabla F_{l_{z}}(\mathbf{x}_{t,j}^{l_{z}})\bigg{\|}^{2} =𝔼t[z=1n𝐭lz2]\displaystyle=\mathbb{E}_{t}\bigg{[}\bigg{\|}\sum_{z=1}^{n}\mathbf{t}_{l_{z}}\bigg{\|}^{2}\bigg{]}
=𝔼t[z=1n𝐭lz2+ij;li,ljSt<𝐭li,𝐭lj>]\displaystyle=\mathbb{E}_{t}\bigg{[}\sum_{z=1}^{n}\|\mathbf{t}_{l_{z}}\|^{2}+\sum_{i\neq j;l_{i},l_{j}\in S_{t}}\big{<}\mathbf{t}_{l_{i}},\mathbf{t}_{l_{j}}\big{>}\bigg{]}
=(b3)𝔼t[n𝐭l12+n(n1)<𝐭l1,𝐭l2>]\displaystyle\overset{(b3)}{=}\mathbb{E}_{t}\bigg{[}n\|\mathbf{t}_{l_{1}}\|^{2}+n(n-1)\big{<}\mathbf{t}_{l_{1}},\mathbf{t}_{l_{2}}\big{>}\bigg{]}
=nmi=1m𝐭i2+n(n1)m2i,j[m]<𝐭i,𝐭j>\displaystyle=\frac{n}{m}\sum_{i=1}^{m}\|\mathbf{t}_{i}\|^{2}+\frac{n(n-1)}{m^{2}}\sum_{i,j\in[m]}\big{<}\mathbf{t}_{i},\mathbf{t}_{j}\big{>}
=nmi=1m𝐭i2+n(n1)m2i=1m𝐭i2,\displaystyle=\frac{n}{m}\sum_{i=1}^{m}\|\mathbf{t}_{i}\|^{2}+\frac{n(n-1)}{m^{2}}\|\sum_{i=1}^{m}\mathbf{t}_{i}\|^{2},

where (b3)(b3) is due to the independent sampling with replacement.

So we can bound A2A^{{}^{\prime}}_{2} as follows.

A2\displaystyle A_{2}^{{}^{\prime}} =𝔼t[Δt2]\displaystyle=\mathbb{E}_{t}[\|\Delta_{t}\|^{2}]
KηL2nσL2+ηL2mni=1m𝔼t𝐭i2+(n1)ηL2m2n𝔼ti=1m𝐭i2,\displaystyle\leq\frac{K\eta_{L}^{2}}{n}\sigma_{L}^{2}+\frac{\eta_{L}^{2}}{mn}\sum_{i=1}^{m}\mathbb{E}_{t}\|\mathbf{t}_{i}\|^{2}+\frac{(n-1)\eta_{L}^{2}}{m^{2}n}\mathbb{E}_{t}\bigg{\|}\sum_{i=1}^{m}\mathbf{t}_{i}\bigg{\|}^{2}, (6)

For 𝐭i\mathbf{t}_{i}, we have:

i=1m𝔼t𝐭i2\displaystyle\sum_{i=1}^{m}\mathbb{E}_{t}\|\mathbf{t}_{i}\|^{2} =i=1m𝔼tj=0K1Fi(𝐱t,ji)Fi(𝐱t)+Fi(𝐱t)f(𝐱t)+f(𝐱t)2\displaystyle=\sum_{i=1}^{m}\mathbb{E}_{t}\bigg{\|}\sum_{j=0}^{K-1}\nabla F_{i}(\mathbf{x}_{t,j}^{i})-\nabla F_{i}(\mathbf{x}_{t})+\nabla F_{i}(\mathbf{x}_{t})-\nabla f(\mathbf{x}_{t})+\nabla f(\mathbf{x}_{t})\bigg{\|}^{2}
(b4)3KL2i=1mj=0K1𝔼t𝐱t,ji𝐱t2+3mK2σG2+3mK2f(𝐱t)2\displaystyle\overset{(b4)}{\leq}3KL^{2}\sum_{i=1}^{m}\sum_{j=0}^{K-1}\mathbb{E}_{t}\|\mathbf{x}_{t,j}^{i}-\mathbf{x}_{t}\|^{2}+3mK^{2}\sigma_{G}^{2}+3mK^{2}\|\nabla f(\mathbf{x}_{t})\|^{2}
(b5)15mK3L2ηL2(σL2+6KσG2)+(90mK4L2ηL2+3mK2)f(𝐱t)2+3mK2σG2,\displaystyle\overset{(b5)}{\leq}15mK^{3}L^{2}\eta_{L}^{2}(\sigma_{L}^{2}\!+\!6K\sigma_{G}^{2})\!+\!(90mK^{4}L^{2}\eta_{L}^{2}+3mK^{2})\|\nabla f(\mathbf{x}_{t})\|^{2}\!+\!3mK^{2}\sigma_{G}^{2}, (7)

where (b4)(b4) is due to the fact that 𝔼[x1++xn2]n𝔼[x12++xn2]\mathbb{E}[\|x_{1}+\cdots+x_{n}\|^{2}]\leq n\mathbb{E}[\|x_{1}\|^{2}+\cdots+\|x_{n}\|^{2}] , Assumptions 3 and 1, and (b5)(b5) follows from Lemma 2.

Substituting the inequalities in ( 5) of A1A_{1}^{{}^{\prime}} and ( 6) of A2A_{2}^{{}^{\prime}} into inequality  (4), we have:

𝔼t[f(𝐱t+1)]\displaystyle\mathbb{E}_{t}[f(\mathbf{x}_{t+1})] f(𝐱t)ηηLKf(𝐱t)2+η<f(𝐱t),𝔼t[Δt+ηLKf(𝐱t)]>A1+L2η2𝔼t[Δt2]A2\displaystyle\leq f(\mathbf{x}_{t})\!-\!\eta\eta_{L}K\|\nabla f(\mathbf{x}_{t})\|^{2}\!+\!\eta\underbrace{\big{<}\nabla f(\mathbf{x}_{t}),\mathbb{E}_{t}[\Delta_{t}\!+\!\eta_{L}K\nabla f(\mathbf{x}_{t})]\big{>}}_{A^{{}^{\prime}}_{1}}\!+\!\frac{L}{2}\eta^{2}\underbrace{\mathbb{E}_{t}[\|\Delta_{t}\|^{2}]}_{A^{{}^{\prime}}_{2}}
f(𝐱t)ηηLK(1215K2ηL2L2)f(xt)2+5ηK2ηL3L22(σL2+6KσG2)\displaystyle\leq f(\mathbf{x}_{t})-\eta\eta_{L}K(\frac{1}{2}-15K^{2}\eta_{L}^{2}L^{2})\|\nabla f(x_{t})\|^{2}+\frac{5\eta K^{2}\eta_{L}^{3}L^{2}}{2}(\sigma_{L}^{2}+6K\sigma_{G}^{2})
+[(n1)Lη2ηL22m2nηηL2Km2]𝔼ti=1m𝐭i2+LKη2ηL22nσL2+Lη2ηL22mni=1m𝔼t𝐭i2\displaystyle\quad+\bigg{[}\frac{(n-1)L\eta^{2}\eta_{L}^{2}}{2m^{2}n}-\frac{\eta\eta_{L}}{2Km^{2}}\bigg{]}\mathbb{E}_{t}\bigg{\|}\sum_{i=1}^{m}\mathbf{t}_{i}\bigg{\|}^{2}\!+\!\frac{LK\eta^{2}\eta_{L}^{2}}{2n}\sigma_{L}^{2}\!+\!\frac{L\eta^{2}\eta_{L}^{2}}{2mn}\sum_{i=1}^{m}\mathbb{E}_{t}\|\mathbf{t}_{i}\|^{2}
(b6)f(𝐱t)ηηLK(1215K2ηL2L2)f(xt)2+5ηK2ηL3L22(σL2+6KσG2)\displaystyle\overset{(b6)}{\leq}f(\mathbf{x}_{t})-\eta\eta_{L}K(\frac{1}{2}-15K^{2}\eta_{L}^{2}L^{2})\|\nabla f(x_{t})\|^{2}+\frac{5\eta K^{2}\eta_{L}^{3}L^{2}}{2}(\sigma_{L}^{2}+6K\sigma_{G}^{2})
+LKη2ηL22nσL2+Lη2ηL22mni=1m𝔼t𝐭i2\displaystyle\qquad+\frac{LK\eta^{2}\eta_{L}^{2}}{2n}\sigma_{L}^{2}+\frac{L\eta^{2}\eta_{L}^{2}}{2mn}\sum_{i=1}^{m}\mathbb{E}_{t}\|\mathbf{t}_{i}\|^{2}
(b7)f(𝐱t)ηηLK(1215K2ηL2L2LηηL2n(90K3L2ηL2+3K))f(xt)2\displaystyle\overset{(b7)}{\leq}f(\mathbf{x}_{t})-\eta\eta_{L}K(\frac{1}{2}-15K^{2}\eta_{L}^{2}L^{2}-\frac{L\eta\eta_{L}}{2n}(90K^{3}L^{2}\eta_{L}^{2}+3K))\|\nabla f(x_{t})\|^{2}
+[5ηK2ηL3L22+15K3L3η2ηL42n](σL2+6KσG2)+LKη2ηL22nσL2+3K2Lη2ηL22nσG2\displaystyle\qquad\!+\!\bigg{[}\frac{5\eta K^{2}\eta_{L}^{3}L^{2}}{2}\!+\!\frac{15K^{3}L^{3}\eta^{2}\eta_{L}^{4}}{2n}\bigg{]}(\sigma_{L}^{2}\!+\!6K\sigma_{G}^{2})\!+\!\frac{LK\eta^{2}\eta_{L}^{2}}{2n}\sigma_{L}^{2}\!+\!\frac{3K^{2}L\eta^{2}\eta_{L}^{2}}{2n}\sigma_{G}^{2}
(b8)f(𝐱t)cηηLKf(𝐱t)2+LKη2ηL22nσL2+3K2Lη2ηL22nσG2\displaystyle\overset{(b8)}{\leq}f(\mathbf{x}_{t})-c\eta\eta_{L}K\|\nabla f(\mathbf{x}_{t})\|^{2}+\frac{LK\eta^{2}\eta_{L}^{2}}{2n}\sigma_{L}^{2}+\frac{3K^{2}L\eta^{2}\eta_{L}^{2}}{2n}\sigma_{G}^{2}
+ηηLK[5KηL2L22+15K2ηL3ηL32n](σL2+6KσG2),\displaystyle\qquad+\eta\eta_{L}K\bigg{[}\frac{5K\eta_{L}^{2}L^{2}}{2}+\frac{15K^{2}\eta_{L}^{3}\eta L^{3}}{2n}\bigg{]}(\sigma_{L}^{2}+6K\sigma_{G}^{2}), (8)

where (b6)(b6) follows from (n1)Lη2ηL22m2nηηL2Km20\frac{(n-1)L\eta^{2}\eta_{L}^{2}}{2m^{2}n}-\frac{\eta\eta_{L}}{2Km^{2}}\leq 0 if ηηLKLn1n\eta\eta_{L}KL\leq\frac{n-1}{n}, (b7)(b7)is due to inequality (7) and (b8)(b8) holds since there exists a constant c>0c>0 such that [1215K2ηL2L2LηηL2n(90K3L2ηL2+3K)]>c>0[\frac{1}{2}-15K^{2}\eta_{L}^{2}L^{2}-\frac{L\eta\eta_{L}}{2n}(90K^{3}L^{2}\eta_{L}^{2}+3K)]>c>0 if 30K2ηL2L2LηηLn(90K3L2ηL2+3K)<130K^{2}\eta_{L}^{2}L^{2}-\frac{L\eta\eta_{L}}{n}(90K^{3}L^{2}\eta_{L}^{2}+3K)<1.

Note that the requirement of |St|=n|S_{t}|=n can be relaxed to |St|n|S_{t}|\geq n. With ptnp_{t}\geq n workers in tt-th communication round, 8 is

𝔼t[f(𝐱t+1)]\displaystyle\mathbb{E}_{t}[f(\mathbf{x}_{t+1})] f(𝐱t)cηηLKf(𝐱t)2+LKη2ηL22ptσL2+3KLη2ηL22ptσG2\displaystyle\leq f(\mathbf{x}_{t})-c\eta\eta_{L}K\|\nabla f(\mathbf{x}_{t})\|^{2}+\frac{LK\eta^{2}\eta_{L}^{2}}{2p_{t}}\sigma_{L}^{2}+\frac{3KL\eta^{2}\eta_{L}^{2}}{2p_{t}}\sigma_{G}^{2}
+ηηLK[5KηL2L22+15KηL3ηL32pt](σL2+6KσG2)\displaystyle\quad+\eta\eta_{L}K\bigg{[}\frac{5K\eta_{L}^{2}L^{2}}{2}+\frac{15K\eta_{L}^{3}\eta L^{3}}{2p_{t}}\bigg{]}(\sigma_{L}^{2}+6K\sigma_{G}^{2})
f(𝐱t)cηηLKf(𝐱t)2+LKη2ηL22nσL2+3K2Lη2ηL22nσG2\displaystyle\leq f(\mathbf{x}_{t})-c\eta\eta_{L}K\|\nabla f(\mathbf{x}_{t})\|^{2}+\frac{LK\eta^{2}\eta_{L}^{2}}{2n}\sigma_{L}^{2}+\frac{3K^{2}L\eta^{2}\eta_{L}^{2}}{2n}\sigma_{G}^{2}
+ηηLK[5KηL2L22+15K2ηL3ηL32n](σL2+6KσG2).\displaystyle\quad+\eta\eta_{L}K\bigg{[}\frac{5K\eta_{L}^{2}L^{2}}{2}+\frac{15K^{2}\eta_{L}^{3}\eta L^{3}}{2n}\bigg{]}(\sigma_{L}^{2}+6K\sigma_{G}^{2}).

That is, the same convergence rate can be guaranteed if at least nn workers in each communication round (no need to be exactly nn).

Rearranging and summing from t=0,,T1t=0,\cdots,T-1, we have the convergence for partial device participation with sampling strategy 1 as follows:

mint[T]𝔼[f(𝐱t)22]f0fcηηLKT+Φ,\min_{t\in[T]}\mathbb{E}[\|\nabla f(\mathbf{x}_{t})\|_{2}^{2}]\leq\frac{f_{0}-f_{*}}{c\eta\eta_{L}KT}+\Phi,

where Φ=1c[LηηL2nσL2+3KLηηL2nσG2+(5KηL2L22+15K2ηηL3L32n)(σL2+6KσG2)]\Phi=\frac{1}{c}\big{[}\frac{L\eta\eta_{L}}{2n}\sigma_{L}^{2}+\frac{3KL\eta\eta_{L}}{2n}\sigma_{G}^{2}+(\frac{5K\eta_{L}^{2}L^{2}}{2}+\frac{15K^{2}\eta\eta_{L}^{3}L^{3}}{2n})(\sigma_{L}^{2}+6K\sigma_{G}^{2})\big{]} and cc is a constant.

For strategy 2: Under the strategy of independent sampling with equal probability without replacement. We bound A2A^{{}^{\prime}}_{2} as follows.

A2\displaystyle A_{2}^{{}^{\prime}} =𝔼t[Δt2]\displaystyle=\mathbb{E}_{t}[\|\Delta_{t}\|^{2}]
=𝔼t[1niStΔti2]\displaystyle=\mathbb{E}_{t}\bigg{[}\bigg{\|}\frac{1}{n}\sum_{i\in S_{t}}\Delta_{t}^{i}\bigg{\|}^{2}\bigg{]}
=1n2𝔼t[iStΔti2]\displaystyle=\frac{1}{n^{2}}\mathbb{E}_{t}\bigg{[}\bigg{\|}\sum_{i\in S_{t}}\Delta_{t}^{i}\bigg{\|}^{2}\bigg{]}
=1n2𝔼t[i=1m𝕀{iSt}Δti2]\displaystyle=\frac{1}{n^{2}}\mathbb{E}_{t}\bigg{[}\bigg{\|}\sum_{i=1}^{m}\mathbb{I}\{i\in S_{t}\}\Delta_{t}^{i}\bigg{\|}^{2}\bigg{]}
=ηL2n2𝔼t[i=1m𝕀{iSt}j=0K1[𝐠t,jiFi(𝐱t,ji)]2]+ηL2n2𝔼t[i=1m𝕀{iSt}j=0K1Fi(𝐱t,ji)]2]\displaystyle=\!\frac{\eta_{L}^{2}}{n^{2}}\mathbb{E}_{t}\bigg{[}\bigg{\|}\!\sum_{i=1}^{m}\mathbb{I}\{i\in S_{t}\}\!\sum_{j=0}^{K-1}[\mathbf{g}_{t,j}^{i}\!-\!\nabla F_{i}(\mathbf{x}_{t,j}^{i})]\bigg{\|}^{2}\bigg{]}\!+\!\frac{\eta_{L}^{2}}{n^{2}}\mathbb{E}_{t}\bigg{[}\bigg{\|}\!\sum_{i=1}^{m}\mathbb{I}\{i\in S_{t}\}\!\!\sum_{j=0}^{K-1}\!\nabla F_{i}(\mathbf{x}_{t,j}^{i})]\bigg{\|}^{2}\bigg{]}
=ηL2n2𝔼t[i=1m{iSt}j=0K1[𝐠t,jiFi(𝐱t,ji)]2+ηL2n2i=1m𝕀{iSt}j=0K1Fi(𝐱t,ji)2]\displaystyle=\frac{\eta_{L}^{2}}{n^{2}}\mathbb{E}_{t}\bigg{[}\bigg{\|}\sum_{i=1}^{m}\mathbb{P}\{i\in S_{t}\}\sum_{j=0}^{K-1}[\mathbf{g}_{t,j}^{i}-\nabla F_{i}(\mathbf{x}_{t,j}^{i})]\bigg{\|}^{2}+\frac{\eta_{L}^{2}}{n^{2}}\bigg{\|}\sum_{i=1}^{m}\mathbb{I}\{i\in S_{t}\}\sum_{j=0}^{K-1}\nabla F_{i}(\mathbf{x}_{t,j}^{i})\bigg{\|}^{2}\bigg{]}
=(b9)ηL2nm𝔼t[i=1mj=0K1𝐠t,jiFi(𝐱t,ji)2]+ηL2n2𝔼t[i=1m𝕀{iSt}j=0K1Fi(𝐱t,ji)2]\displaystyle\overset{(b9)}{=}\frac{\eta_{L}^{2}}{nm}\mathbb{E}_{t}\bigg{[}\sum_{i=1}^{m}\sum_{j=0}^{K-1}\bigg{\|}\mathbf{g}_{t,j}^{i}-\nabla F_{i}(\mathbf{x}_{t,j}^{i})\bigg{\|}^{2}\bigg{]}+\frac{\eta_{L}^{2}}{n^{2}}\mathbb{E}_{t}\bigg{[}\bigg{\|}\sum_{i=1}^{m}\mathbb{I}\{i\in S_{t}\}\sum_{j=0}^{K-1}\nabla F_{i}(\mathbf{x}_{t,j}^{i})\bigg{\|}^{2}\bigg{]}
(b10)KηL2nσL2+ηL2n2i=1m{iSt}j=0K1Fi(𝐱t,ji)2,\displaystyle\overset{(b10)}{\leq}\frac{K\eta_{L}^{2}}{n}\sigma_{L}^{2}+\frac{\eta_{L}^{2}}{n^{2}}\bigg{\|}\sum_{i=1}^{m}\mathbb{P}\{i\in S_{t}\}\sum_{j=0}^{K-1}\nabla F_{i}(\mathbf{x}_{t,j}^{i})\bigg{\|}^{2}, (9)

where (b9)(b9) is due to the fact that 𝔼[x1++xn2]=𝔼[x12++xn2]\mathbb{E}[\|x_{1}+\cdots+x_{n}\|^{2}]=\mathbb{E}[\|x_{1}\|^{2}+\cdots+\|x_{n}\|^{2}] if xix_{i}^{{}^{\prime}}s are independent with zero mean, 𝐱i=𝐠t,jiFi(𝐱t,ji)\mathbf{x}_{i}=\mathbf{g}_{t,j}^{i}-\nabla F_{i}(\mathbf{x}_{t,j}^{i}) is independent random variable with mean zero, and {iSt}=nm\mathbb{P}\{i\in S_{t}\}=\frac{n}{m}. (b10)(b10) is due to bounded variance assumption in Assumption 3

Substituting the inequalities in (5) of A1A_{1}^{{}^{\prime}} and (9) of A2A_{2}^{{}^{\prime}} into inequality (4), we have:

𝔼t[f(𝐱t+1)]f(𝐱t)ηηLKf(𝐱t)2+η<f(𝐱t),𝔼t[Δt+ηLKf(𝐱t)]>A1+L2η2𝔼t[Δt2]A2\displaystyle\mathbb{E}_{t}[f(\mathbf{x}_{t+1})]\!\leq\!f(\mathbf{x}_{t})\!-\!\eta\eta_{L}K\|\nabla f(\mathbf{x}_{t})\|^{2}\!\!+\!\eta\underbrace{\big{<}\nabla f(\mathbf{x}_{t}),\mathbb{E}_{t}[\Delta_{t}\!+\!\eta_{L}K\nabla f(\mathbf{x}_{t})]\big{>}}_{A^{{}^{\prime}}_{1}}\!+\frac{L}{2}\eta^{2}\underbrace{\mathbb{E}_{t}[\|\Delta_{t}\|^{2}]}_{A^{{}^{\prime}}_{2}}
f(𝐱t)ηηLK(1215K2ηL2L2)f(𝐱t)2+LKη2ηL22nσL2+5ηK2ηL3L22(σL2+6KσG2)\displaystyle\leq\nabla f(\mathbf{x}_{t})-\eta\eta_{L}K(\frac{1}{2}-15K^{2}\eta_{L}^{2}L^{2})\|\nabla f(\mathbf{x}_{t})\|^{2}+\frac{LK\eta^{2}\eta_{L}^{2}}{2n}\sigma_{L}^{2}+\frac{5\eta K^{2}\eta_{L}^{3}L^{2}}{2}(\sigma_{L}^{2}+6K\sigma_{G}^{2})
+Lη2ηL22n2𝔼ti=1m{iSt}j=0K1Fi(𝐱t,ji)2ηηL2Km2𝔼ti=1mk=0K1Fi(𝐱t,ki)2A3.\displaystyle\quad+\underbrace{\frac{L\eta^{2}\eta_{L}^{2}}{2n^{2}}\mathbb{E}_{t}\bigg{\|}\sum_{i=1}^{m}\mathbb{P}\{i\in S_{t}\}\sum_{j=0}^{K-1}\nabla F_{i}(\mathbf{x}_{t,j}^{i})\bigg{\|}^{2}-\frac{\eta\eta_{L}}{2Km^{2}}\mathbb{E}_{t}\bigg{\|}\sum_{i=1}^{m}\sum_{k=0}^{K-1}\nabla F_{i}(\mathbf{x}_{t,k}^{i})\bigg{\|}^{2}}_{A_{3}^{{}^{\prime}}}.

Then we bound A3A_{3}^{{}^{\prime}} as follows.

By letting 𝐭i=j=0K1Fi(𝐱t,ji)\mathbf{t}_{i}=\sum_{j=0}^{K-1}\nabla F_{i}(\mathbf{x}_{t,j}^{i}), we have:

i=1m𝔼t𝐭i2\displaystyle\sum_{i=1}^{m}\mathbb{E}_{t}\|\mathbf{t}_{i}\|^{2} 15mK3L2ηL2(σL2+6KσG2)+(90mK4L2ηL2+3mK2)f(𝐱t)2+3mK2σG2.\displaystyle\leq 15mK^{3}L^{2}\eta_{L}^{2}(\sigma_{L}^{2}+6K\sigma_{G}^{2})+(90mK^{4}L^{2}\eta_{L}^{2}+3mK^{2})\|\nabla f(\mathbf{x}_{t})\|^{2}+3mK^{2}\sigma_{G}^{2}.

It then follows that

i=1m𝐭i2\displaystyle\|\sum_{i=1}^{m}\mathbf{t}_{i}\|^{2} =i[m]𝐭i2+ij<𝐭i,𝐭j>\displaystyle=\sum_{i\in[m]}\|\mathbf{t}_{i}\|^{2}+\sum_{i\neq j}<\mathbf{t}_{i},\mathbf{t}_{j}>
=(b11)i[m]m𝐭i212ij𝐭i𝐭j2\displaystyle\overset{(b11)}{=}\sum_{i\in[m]}m\|\mathbf{t}_{i}\|^{2}-\frac{1}{2}\sum_{i\neq j}\|\mathbf{t}_{i}-\mathbf{t}_{j}\|^{2}
i=1m{iSt}𝐭i2\displaystyle\|\sum_{i=1}^{m}\mathbb{P}\{i\in S_{t}\}\mathbf{t}_{i}\|^{2} =i[m]{iSt}𝐭i2+ij{i,jSt}<𝐭i,𝐭j>\displaystyle=\sum_{i\in[m]}\mathbb{P}\{i\in S_{t}\}\|\mathbf{t}_{i}\|^{2}+\sum_{i\neq j}\mathbb{P}\{i,j\in S_{t}\}<\mathbf{t}_{i},\mathbf{t}_{j}>
=(b12)nmi[m]𝐭i2+n(n1)m(m1)ij<𝐭i,𝐭j>\displaystyle\overset{(b12)}{=}\frac{n}{m}\sum_{i\in[m]}\|\mathbf{t}_{i}\|^{2}+\frac{n(n-1)}{m(m-1)}\sum_{i\neq j}<\mathbf{t}_{i},\mathbf{t}_{j}>
=(b13)n2mi[m]𝐭i2n(n1)2m(m1)ij𝐭i𝐭j2,\displaystyle\overset{(b13)}{=}\frac{n^{2}}{m}\sum_{i\in[m]}\|\mathbf{t}_{i}\|^{2}-\frac{n(n-1)}{2m(m-1)}\sum_{i\neq j}\|\mathbf{t}_{i}-\mathbf{t}_{j}\|^{2},

where (b11)(b11) and (b13)(b13) are due to the fact that <𝐱,𝐲>=12[𝐱2+𝐲2𝐱𝐲2]12[𝐱2+𝐲2],\big{<}\mathbf{x},\mathbf{y}\big{>}=\frac{1}{2}[\|\mathbf{x}\|^{2}+\|\mathbf{y}\|^{2}-\|\mathbf{x}-\mathbf{y}\|^{2}]\leq\frac{1}{2}[\|\mathbf{x}\|^{2}+\|\mathbf{y}\|^{2}], (b12)(b12) follows from the fact that {iSt}=nm\mathbb{P}\{i\in S_{t}\}=\frac{n}{m} and {i,jSt}=n(n1)m(m1)\mathbb{P}\{i,j\in S_{t}\}=\frac{n(n-1)}{m(m-1)}. Therefore, we have

A3\displaystyle A_{3}^{{}^{\prime}} =Lη2ηL22n2i=1m{iSt}j=0K1Fi(𝐱t,ji)]2ηηL2Km2i=1mk=0K1Fi(𝐱t,ki)2\displaystyle=\frac{L\eta^{2}\eta_{L}^{2}}{2n^{2}}\|\sum_{i=1}^{m}\mathbb{P}\{i\in S_{t}\}\sum_{j=0}^{K-1}\nabla F_{i}(\mathbf{x}_{t,j}^{i})]\|^{2}-\frac{\eta\eta_{L}}{2Km^{2}}\|\sum_{i=1}^{m}\sum_{k=0}^{K-1}\nabla F_{i}(\mathbf{x}_{t,k}^{i})\|^{2}
=(Lη2ηL22mηηL2Km)i=1m𝐭i2+(ηηL4Km2Lη2ηL2(n1)4mn(m1))ij𝐭i𝐭j2\displaystyle=(\frac{L\eta^{2}\eta_{L}^{2}}{2m}-\frac{\eta\eta_{L}}{2Km})\sum_{i=1}^{m}\|\mathbf{t}_{i}\|^{2}+(\frac{\eta\eta_{L}}{4Km^{2}}-\frac{L\eta^{2}\eta_{L}^{2}(n-1)}{4mn(m-1)})\sum_{i\neq j}\|\mathbf{t}_{i}-\mathbf{t}_{j}\|^{2}
=(b14)(Lη2ηL22mLη2ηL2(n1)2n(m1))i=1m𝐭i2(ηηL2Km2Lη2ηL2(n1)2mn(m1))i[m]𝐭i2\displaystyle\overset{(b14)}{=}(\frac{L\eta^{2}\eta_{L}^{2}}{2m}-\frac{L\eta^{2}\eta_{L}^{2}(n-1)}{2n(m-1)})\sum_{i=1}^{m}\|\mathbf{t}_{i}\|^{2}-(\frac{\eta\eta_{L}}{2Km^{2}}-\frac{L\eta^{2}\eta_{L}^{2}(n-1)}{2mn(m-1)})\|\sum_{i\in[m]}\mathbf{t}_{i}\|^{2}
(b15)(Lη2ηL22mLη2ηL2(n1)2n(m1))i=1m𝐭i2\displaystyle\overset{(b15)}{\leq}(\frac{L\eta^{2}\eta_{L}^{2}}{2m}-\frac{L\eta^{2}\eta_{L}^{2}(n-1)}{2n(m-1)})\sum_{i=1}^{m}\|\mathbf{t}_{i}\|^{2}
=Lη2ηL2mn2mn(m1)i=1m𝐭i2,\displaystyle=L\eta^{2}\eta_{L}^{2}\frac{m-n}{2mn(m-1)}\sum_{i=1}^{m}\|\mathbf{t}_{i}\|^{2},

where (b14)(b14) follows from the fact that i[m]𝐭i2=i[m]m𝐭i212ij𝐭i𝐭j2\|\sum_{i\in[m]}\mathbf{t}_{i}\|^{2}=\sum_{i\in[m]}m\|\mathbf{t}_{i}\|^{2}-\frac{1}{2}\sum_{i\neq j}\|\mathbf{t}_{i}-\mathbf{t}_{j}\|^{2}, and (b15)(b15) is due to the fact that (ηηL2Km2Lη2ηL2(n1)2mn(m1))0(\frac{\eta\eta_{L}}{2Km^{2}}-\frac{L\eta^{2}\eta_{L}^{2}(n-1)}{2mn(m-1)})\geq 0 if ηηLKLn(m1)m(n1)\eta\eta_{L}KL\leq\frac{n(m-1)}{m(n-1)}.

Then we have

𝔼t[f(𝐱t+1)]\displaystyle\mathbb{E}_{t}[f(\mathbf{x}_{t+1})] f(𝐱t)ηηLK(1215K2ηL2L2LηηLmn2n(m1)(90K3ηL2L2+3K))f(𝐱t)2\displaystyle\leq f(\mathbf{x}_{t})\!-\!\eta\eta_{L}K(\frac{1}{2}\!-\!15K^{2}\eta_{L}^{2}L^{2}\!-\!L\eta\eta_{L}\frac{m-n}{2n(m-1)}(90K^{3}\eta_{L}^{2}L^{2}\!+\!3K))\|\nabla f(\mathbf{x}_{t})\|^{2}
+LKη2ηL22nσL2+3K2Lη2ηL2mn2n(m1)σG2\displaystyle\quad+\frac{LK\eta^{2}\eta_{L}^{2}}{2n}\sigma_{L}^{2}+3K^{2}L\eta^{2}\eta_{L}^{2}\frac{m-n}{2n(m-1)}\sigma_{G}^{2}
+ηηLK(5KηL2L22+15KηηL3L3mn2n(m1))(σL2+6KσG2)\displaystyle\quad+\eta\eta_{L}K(\frac{5K\eta_{L}^{2}L^{2}}{2}+15K\eta\eta_{L}^{3}L^{3}\frac{m-n}{2n(m-1)})(\sigma_{L}^{2}+6K\sigma_{G}^{2})
(b16)f(𝐱t)cηηLKf(𝐱t)2+LKη2ηL22nσL2+3KLη2ηL2mn2n(m1)σG2\displaystyle\overset{(b16)}{\leq}f(\mathbf{x}_{t})-c\eta\eta_{L}K\|\nabla f(\mathbf{x}_{t})\|^{2}+\frac{LK\eta^{2}\eta_{L}^{2}}{2n}\sigma_{L}^{2}+3KL\eta^{2}\eta_{L}^{2}\frac{m-n}{2n(m-1)}\sigma_{G}^{2}
+ηηLK(5KηL2L22+15K2ηηL3L3mn2n(m1))(σL2+6KσG2),\displaystyle\quad+\eta\eta_{L}K(\frac{5K\eta_{L}^{2}L^{2}}{2}+15K^{2}\eta\eta_{L}^{3}L^{3}\frac{m-n}{2n(m-1)})(\sigma_{L}^{2}+6K\sigma_{G}^{2}), (10)

where (b16)(b16) holds because there exists a constant c>0c>0 satisfying (125K2ηL2L2LηηLmn2n(m1)(90K3ηL2L2+3K))>c>0(\frac{1}{2}-5K^{2}\eta_{L}^{2}L^{2}-L\eta\eta_{L}\frac{m-n}{2n(m-1)}(90K^{3}\eta_{L}^{2}L^{2}+3K))>c>0 if 10K2ηL2L2LηηLmnn(m1)(90K3ηL2L2+3K)<110K^{2}\eta_{L}^{2}L^{2}-L\eta\eta_{L}\frac{m-n}{n(m-1)}(90K^{3}\eta_{L}^{2}L^{2}+3K)<1.

Note that the requirement of |St|=n|S_{t}|=n can be relaxed to |St|n|S_{t}|\geq n. With ptnp_{t}\geq n workers in tt-th communication round, 10 is

𝔼t[f(𝐱t+1)]\displaystyle\mathbb{E}_{t}[f(\mathbf{x}_{t+1})] f(𝐱t)cηηLKf(𝐱t)2+LKη2ηL22ptσL2+3KLη2ηL2mpt2pt(m1)σG2\displaystyle\leq f(\mathbf{x}_{t})-c\eta\eta_{L}K\|\nabla f(\mathbf{x}_{t})\|^{2}+\frac{LK\eta^{2}\eta_{L}^{2}}{2p_{t}}\sigma_{L}^{2}+3KL\eta^{2}\eta_{L}^{2}\frac{m-p_{t}}{2p_{t}(m-1)}\sigma_{G}^{2}
+ηηLK(5KηL2L22+15K2ηηL3L3mpt2pt(m1))(σL2+6KσG2)\displaystyle\quad+\eta\eta_{L}K(\frac{5K\eta_{L}^{2}L^{2}}{2}+15K^{2}\eta\eta_{L}^{3}L^{3}\frac{m-p_{t}}{2p_{t}(m-1)})(\sigma_{L}^{2}+6K\sigma_{G}^{2})
f(𝐱t)cηηLKf(𝐱t)2+LKη2ηL22nσL2+3KLη2ηL2mn2n(m1)σG2\displaystyle\leq f(\mathbf{x}_{t})-c\eta\eta_{L}K\|\nabla f(\mathbf{x}_{t})\|^{2}+\frac{LK\eta^{2}\eta_{L}^{2}}{2n}\sigma_{L}^{2}+3KL\eta^{2}\eta_{L}^{2}\frac{m-n}{2n(m-1)}\sigma_{G}^{2}
+ηηLK(5KηL2L22+15K2ηηL3L3mn2n(m1))(σL2+6KσG2)\displaystyle\quad+\eta\eta_{L}K(\frac{5K\eta_{L}^{2}L^{2}}{2}+15K^{2}\eta\eta_{L}^{3}L^{3}\frac{m-n}{2n(m-1)})(\sigma_{L}^{2}+6K\sigma_{G}^{2})

That is, the same convergence rate can be guaranteed if at least nn workers in each communication round (no need to be exactly nn).

Rearranging and summing from t=0,,T1t=0,\cdots,T-1, we have the convergence for partial device participation with sampling strategy 2 as follows:

mint[T]𝔼[f(𝐱t)22]f0fcηηLKT+Φ,\min_{t\in[T]}\mathbb{E}[\|\nabla f(\mathbf{x}_{t})\|_{2}^{2}]\leq\frac{f_{0}-f_{*}}{c\eta\eta_{L}KT}+\Phi,

where Φ=1c[LηηL2nσL2+3KLηηLmn2n(m1)σG2+(5KηL2L22+15K2ηηL3L3mn2n(m1))(σL2+6KσG2)]\Phi=\frac{1}{c}\big{[}\frac{L\eta\eta_{L}}{2n}\sigma_{L}^{2}+3KL\eta\eta_{L}\frac{m-n}{2n(m-1)}\sigma_{G}^{2}+(\frac{5K\eta_{L}^{2}L^{2}}{2}+15K^{2}\eta\eta_{L}^{3}L^{3}\frac{m-n}{2n(m-1)})(\sigma_{L}^{2}+6K\sigma_{G}^{2})\big{]} and cc is a constant. This completes the proof. ∎

A.2.1 Key Lemmas

Lemma 1 (Unbiased Sampling).

For strategies 1 and 2, the estimator Δt\Delta_{t} is unbiased, i.e.,

𝔼St[Δt]=Δ¯t.\mathbb{E}_{S_{t}}[\Delta_{t}]=\bar{\Delta}_{t}.

Proof of Lemma 1.
Let St={t1,,tn}S_{t}=\{t_{1},\cdots,t_{n}\} with size nn. Both for sampling strategies 1 and 2, each sampling distribution is identical. Then we have:

𝔼St[Δt]=1n𝔼St[tiStΔtti]=1n𝔼St[i=1nΔtti]=𝔼St[Δtt1]=1mi=1mΔti=Δ¯t.\displaystyle\mathbb{E}_{S_{t}}[\Delta_{t}]=\frac{1}{n}\mathbb{E}_{S_{t}}[\sum_{t_{i}\in S_{t}}\Delta_{t}^{t_{i}}]=\frac{1}{n}\mathbb{E}_{S_{t}}[\sum_{i=1}^{n}\Delta_{t}^{t_{i}}]=\mathbb{E}_{S_{t}}[\Delta_{t}^{t_{1}}]=\frac{1}{m}\sum_{i=1}^{m}\Delta_{t}^{i}=\bar{\Delta}_{t}.

A.3 Auxiliary Lemmas

Lemma 2 (Lemma 4 in Reddi et al. (2020)).

For any step-size satisfying ηL18LK\eta_{L}\leq\frac{1}{8LK}, we can have the following results:

1mi=1m𝔼[𝐱t,ki𝐱t2]5KηL2(σL2+6KσG2)+30K2ηL2f(𝐱t)2.\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}[\|\mathbf{x}_{t,k}^{i}-\mathbf{x}_{t}\|^{2}]\leq 5K\eta_{L}^{2}(\sigma_{L}^{2}+6K\sigma_{G}^{2})+30K^{2}\eta_{L}^{2}\|\nabla f(\mathbf{x}_{t})\|^{2}.
Proof.

In order for this paper to be self-contained, we restate the proof of Lemma 4 in (Reddi et al., 2020) here.

For any worker i[m]i\in[m] and k[K]k\in[K], we have:

𝔼[𝐱t,ki𝐱t2]=𝔼[𝐱t,k1i𝐱tηLgt,k1t2]\displaystyle\mathbb{E}[\|\mathbf{x}_{t,k}^{i}-\mathbf{x}_{t}\|^{2}]=\mathbb{E}[\|\mathbf{x}_{t,k-1}^{i}-\mathbf{x}_{t}-\eta_{L}g_{t,k-1}^{t}\|^{2}]
𝔼[𝐱t,k1i𝐱tηL(gt,k1tFi(𝐱t,k1i)+Fi(𝐱t,k1i)Fi(𝐱t)+Fi(𝐱t)f(𝐱t)+f(𝐱t))2]\displaystyle\!\leq\!\!\mathbb{E}[\|\mathbf{x}_{t,k-1}^{i}\!\!-\!\!\mathbf{x}_{t}\!\!-\!\!\eta_{L}(g_{t,k-1}^{t}\!\!-\!\!\nabla F_{i}(\mathbf{x}_{t,k-1}^{i})\!\!+\!\!\nabla F_{i}(\mathbf{x}_{t,k-1}^{i})\!\!-\!\!\nabla F_{i}(\mathbf{x}_{t})\!\!+\!\!\nabla F_{i}(\mathbf{x}_{t})\!\!-\!\!\nabla f(\mathbf{x}_{t})\!\!+\!\!\nabla f(\mathbf{x}_{t}))\|^{2}]
(1+12K1)𝔼[𝐱t,k1i𝐱t2]+𝔼[ηL(gt,k1tFi(𝐱t,k1i))2]\displaystyle\leq(1+\frac{1}{2K-1})\mathbb{E}[\|\mathbf{x}_{t,k-1}^{i}-\mathbf{x}_{t}\|^{2}]+\mathbb{E}[\|\eta_{L}(g_{t,k-1}^{t}-\nabla F_{i}(\mathbf{x}_{t,k-1}^{i}))\|^{2}]
+6K𝔼[ηL(Fi(𝐱t,k1i)Fi(𝐱t))2]+6K𝔼[ηL(Fi(𝐱t)f(𝐱t)))2]+6KηLf(𝐱t)2\displaystyle\quad\!\!\!\!+\!6K\mathbb{E}[\|\eta_{L}(\nabla F_{i}(\mathbf{x}_{t,k-1}^{i})\!-\!\nabla F_{i}(\mathbf{x}_{t}))\|^{2}]\!+\!6K\mathbb{E}[\|\eta_{L}(\nabla F_{i}(\mathbf{x}_{t})\!-\!\nabla f(\mathbf{x}_{t})))\|^{2}]\!+\!6K\|\eta_{L}\nabla f(\mathbf{x}_{t})\|^{2}
(1+12K1)𝔼[𝐱t,k1i𝐱t2]+ηL2σL2+6KηL2L2𝔼[𝐱t,k1i𝐱t2]+6KηL2σG2+6KηLf(𝐱t)2\displaystyle\!\leq\!\!(1\!+\!\frac{1}{2K\!-\!1})\mathbb{E}[\|\mathbf{x}_{t,k-1}^{i}\!\!\!-\!\mathbf{x}_{t}\|^{2}]\!+\!\eta_{L}^{2}\sigma_{L}^{2}\!\!+\!6K\eta_{L}^{2}L^{2}\mathbb{E}[\|\mathbf{x}_{t,k-1}^{i}\!\!\!-\!\mathbf{x}_{t}\|^{2}]\!+\!6K\eta_{L}^{2}\sigma_{G}^{2}\!\!+\!\!6K\|\eta_{L}\!\nabla f(\mathbf{x}_{t})\|^{2}
=(1+12K1+6KηL2L2)𝔼[𝐱t,k1i𝐱t2]+ηL2σL2+6KηL2σG2+6KηLf(𝐱t)2\displaystyle=(1+\frac{1}{2K-1}+6K\eta_{L}^{2}L^{2})\mathbb{E}[\|\mathbf{x}_{t,k-1}^{i}-\mathbf{x}_{t}\|^{2}]+\eta_{L}^{2}\sigma_{L}^{2}+6K\eta_{L}^{2}\sigma_{G}^{2}+6K\|\eta_{L}\nabla f(\mathbf{x}_{t})\|^{2}
(1+1K1)𝔼[𝐱t,k1i𝐱t2]+ηL2σL2+6KηL2σG2+6KηLf(𝐱t)2\displaystyle\leq(1+\frac{1}{K-1})\mathbb{E}[\|\mathbf{x}_{t,k-1}^{i}-\mathbf{x}_{t}\|^{2}]+\eta_{L}^{2}\sigma_{L}^{2}+6K\eta_{L}^{2}\sigma_{G}^{2}+6K\|\eta_{L}\nabla f(\mathbf{x}_{t})\|^{2}

Unrolling the recursion, we get:

1mi=1m𝔼[𝐱t,ki𝐱t2]\displaystyle\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}[\|\mathbf{x}_{t,k}^{i}-\mathbf{x}_{t}\|^{2}] p=0k1(1+1K1)p[ηL2σL2+6KσG2+6KηL2ηLf(𝐱t))2]\displaystyle\leq\sum_{p=0}^{k-1}(1+\frac{1}{K-1})^{p}[\eta_{L}^{2}\sigma_{L}^{2}+6K\sigma_{G}^{2}+6K\eta_{L}^{2}\|\eta_{L}\nabla f(\mathbf{x}_{t}))\|^{2}]
(K1)[(1+1K1)K1][ηL2σL2+6KσG2+6KηL2ηLf(𝐱t))2]\displaystyle\leq(K-1)[(1+\frac{1}{K-1})^{K}-1][\eta_{L}^{2}\sigma_{L}^{2}+6K\sigma_{G}^{2}+6K\eta_{L}^{2}\|\eta_{L}\nabla f(\mathbf{x}_{t}))\|^{2}]
5KηL2(σL2+6KσG2)+30K2ηL2f(𝐱t)2\displaystyle\leq 5K\eta_{L}^{2}(\sigma_{L}^{2}+6K\sigma_{G}^{2})+30K^{2}\eta_{L}^{2}\|\nabla f(\mathbf{x}_{t})\|^{2}

This completes the proof. ∎

Appendix B Appendix II: Experiments

We provide the full detail of the experiments. We uses non-i.i.d. versions for MNIST and CIFAR-10, which are described as follows:

B.1 MNIST

We study image classification of handwritten digits 0-9 in MNIST and modify the MNIST dataset to a non-i.i.d. version.

To impose statistical heterogeneity, we split the data based on the digits (pp) they contain in their dataset. We distribute the data to m=100m=100 workers such that each worker contains only a certain class of digits with the same number of training/test samples. For example, for p=1p=1, each worker only has training/testing samples with one digit, which causes heterogeneity among different workers. For p=10p=10, each worker has samples with 10 digits, which is essentially i.i.d. case. In this way, we can use the digits in worker’s local dataset to represent the non-i.i.d. degree qualitatively. In each communication round, 100 workers run KK epochs locally in parallel and then the server samples nn workers for aggregation and update. We make a grid-search experiments for the hyper-parameters as shown in Table  3.

Table 3: Hyper-parameters Tuning.
Server Learning Rate η{1,10}\eta\in\{1,10\}
Client Learning Rate ηL{0.001,0.01,0.1}\eta_{L}\in\{0.001,0.01,0.1\}
Local Epochs K{1,5,10}K\in\{1,5,10\}
Clients Partition Number n{10,50,100}n\in\{10,50,100\}
Non-i.i.d. Degree p{1,2,5,10}p\in\{1,2,5,10\}

We run three models: multinomial logistic regression, fully-connected network with two hidden layers (2NN) (two 200 neurons hidden layers with ReLU followed by an output layer), convolutional neural network (CNN), as shown in Table 4. The results are shown in Figures 2,  3 and  4.

Table 4: CNN Architecture for MNIST.
Layer Type Size
Convolution + ReLu 5×5×325\times 5\times 32
Max Pooling 2×22\times 2
Convolution + ReLu 5×5×645\times 5\times 64
Max Pooling 2×22\times 2
Fully Connected + ReLU 1024×5121024\times 512
Fully Connected 512×10512\times 10
Refer to caption

(a) LR

Refer to caption

(b) 2NN

Refer to caption

(c) CNN

Figure 2: Training loss (top) and test accuracy (bottom) for three models on MNIST with hyperparameters setting: local learning rate 0.1, global learning rate 1.0, local steps 5 epochs.
Refer to caption

(a) LR

Refer to caption

(b) 2NN

Refer to caption

(c) CNN

Figure 3: Training loss (top) and test accuracy (bottom) for three models on MNIST with hyperparameters setting: local learning rate 0.1, global learning rate 1.0, worker number 100.
Refer to caption

(a) LR

Refer to caption

(b) 2NN

Refer to caption

(c) CNN

Figure 4: Training loss (top) and test accuracy (bottom) for three models on MNIST with hyperparameters setting: local learning rate 0.1, global learning rate 1.0, worker number 10.

B.2 CIFAR-10

Unless stated otherwise, we use the following default parameter setting: the server learning rate and client learning rate are set to η=1.0\eta=1.0 and ηL=0.1\eta_{L}=0.1, respectively. The local epochs is set to K=10K=10. The total number of clients is set to 100, and the clients partition number is set to n=10n=10. We use the same strategy to distribute the data over clients as suggested in McMahan et al. (2016). For the i.i.d. setting, we evenly partition all the training data among all clients, i.e., each client observes 500 data; for the non-i.i.d. setting, we first sort the training data by label, then divide all the training data into 200 shards of size 250, and randomly assign two shards to each client. For the CIFAR-10 dataset, we train our classifier with the ResNet model. The results are shown in Figure 5 and Figure 6.

Refer to caption
(a) IID.
Refer to caption
(b) Non-IID.
Figure 5: Test accuracy with respect to worker number on CIFAR-10 dataset.
Refer to caption
(a) IID.
Refer to caption
(b) Non-IID.
Figure 6: Test accuracy with respect to different local steps on CIFAR-10 dataset.

B.3 Discussion

Impact of non-i.i.d. datasets: Figure 2 shows the results of training loss (top) and test accuracy (bottom) for three models under different non-i.i.d. datasets with full and partial worker participation on MNIST. We can see that the FedAvg algorithm converges under non-i.i.d. datasets with a proper learning rate choice in these cases. We believe that the major challenge in FL is the non-i.i.d. datasets. For these datasets with a lower degree of non-i.i.d., the FedAvg algorithm can achieve a good result compared with the i.i.d. case. For example, when the local dataset in each worker has five digits (p=5p=5) with full (partial) worker participation, the FedAvg algorithm achieves a convergence speed comparable with that of the i.i.d. case (p=10p=10). This result can be observed in Figure 2 for all three models. As the degree of non-i.i.d. datasets increases, its negative impact on the convergence is becoming more obvious. The higher the degree of non-i.i.d., the slower the convergence speed. As the non-i.i.d. degree increases (from case p=10p=10 to case p=1p=1), it is obvious that the training loss is increasing and test accuracy is decreasing. For these with high degree of non-i.i.d., the convergence curves oscillate and are highly unstable. This trend is more obvious for complex models such for CNN in Figure 2(c).

Impact of worker number: For full worker participation, the server can have an accurate estimation of the system heterogeneity after receiving the updates for all workers and neutralize this heterogeneity in each communication round. However, partial worker participation introduces another source of randomness, which leads to zigzagging convergence curves and slower convergence. In each communication round, the server can only receive a subset of workers based on the sampling strategy. So the server could only have a coarse estimation of the system heterogeneity and might not be able to neutralize the heterogeneity among different workers for partial worker participation. This problem is more prominent for highly non-i.i.d. datasets. It is not unlikely that the digits in these datasets among all active workers are only a proper subset of the total 1010 digits in the original MNIST dataset, especially with highly non-i.i.d. datasets. For example, for p=1p=1 with 1010 workers in each communication round, it is highly likely that the datasets formed by these ten workers only includes certain small number of digits (say, 44 or 55) rather than total 1010 digits. But for p=5p=5, it is the opposite, that is, the digits in these datasets among these 1010 workers are highly likely to be 1010. So in each communication round, the server can mitigate system heterogeneity since it covers the training samples with all 1010 digits. This trend is more obvious for complex models and datasets given the dramatic drop of test accuracy in the result of CIFAR-10 in Figure 5.

The sample strategy here is random sampling with equal probability without replacement. In practice, the workers need to be in certain states in order to be able to participate in FL (e.g., in charging or idle states, etc.(Eichner et al., 2019)). Therefore, care must be taken in sampling and enlisting workers in practice. We believe that the joint design of sampling schemes, number of workers and the FedAvg algorithm will have a significant impact on the convergence, which needs further investigations.

Impact of local steps: Figure 3 and Figure 4 shows the results of training loss (top) and test accuracy (bottom) for three models under different local steps with full and partial worker participation respectively. Figure 6 shows the impact of local steps in CIFAR-10. One open question of FL is that whether the local steps help the convergence or not. Li et al. (2019b) showed a convergence rate 𝒪(KT)\mathcal{O}(\frac{K}{T}), i.e., the local steps may hurt the convergence for full and partial worker participation. In this two figures, we can see that local steps could help the convergence for both full and partial worker participation. However, it only has a slight effect on the convergence compared to the effects of non-i.i.d. datasets and number of workers.

Comparison with SCAFFOLD: We compare SCAFFOLD (Karimireddy et al., 2019) with the generalized FedAVg algorithm in this paper in terms of communication rounds, total communication overloads and estimated wall-clock time to achieve certain test accuracy in Table 2. We run the experiments using the same GPU (NVIDIA V100) to ensure the same conditions. Here, we give a specific comparison for these two algorithms under exact condition. Note that we divide the total training time to two parts: the computation time when the worker trains the local model and the communication time when information exchanges between the worker and server. We only compare the computation time and communication time with a fixed bandwidth 20MB/s20MB/s for both uploading and downloading connections. As shown in Figure 7, to achieve ϵ=75%\epsilon=75\%, SCAFFOLD performs less communication round due to the variance reduction techniques. That is, it spends less time on computation. However, it needs to communicates as twice as the FedAvg since the control variate to perform variance reduction in each worker needs to update in each round. In this way, the communication time would be largely prolonged.

Refer to caption
Figure 7: Wall-clock time to achieve test accuracy ϵ=75%\epsilon=75\% on CIFAR-10 dataset.