This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

FedChain: Chained Algorithms for
Near-Optimal Communication Cost
in Federated Learning

Charlie Hou
Department of Electrical and Computer Engineering
Carnegie Mellon University
Pittsburgh, Pennsylvania, USA
[email protected]
&Kiran K. Thekumparampil
Department of Electrical and Computer Engineering
University of Illinois at Urbana-Champaign
Champaign, Illinois, USA
[email protected]
\ANDGiulia Fanti
Department of Electrical and Computer Engineering
Carnegie Mellon University
Pittsburgh, Pennsylvania, USA
[email protected]
&Sewoong Oh
Allen School of Computer Science and Engineering
University of Washington
Seattle, Washington, USA
[email protected]
Abstract

Federated learning (FL) aims to minimize the communication complexity of training a model over heterogeneous data distributed across many clients. A common approach is local update methods, where clients take multiple optimization steps over local data before communicating with the server (e.g., FedAvg). Local update methods can exploit similarity between clients’ data. However, in existing analyses, this comes at the cost of slow convergence in terms of the dependence on the number of communication rounds RR. On the other hand, global update methods, where clients simply return a gradient vector in each round (e.g., SGD), converge faster in terms of RR but fail to exploit the similarity between clients even when clients are homogeneous. We propose FedChain, an algorithmic framework that combines the strengths of local update methods and global update methods to achieve fast convergence in terms of RR while leveraging the similarity between clients. Using FedChain, we instantiate algorithms that improve upon previously known rates in the general convex and PL settings, and are near-optimal (via an algorithm-independent lower bound that we show) for problems that satisfy strong convexity. Empirical results support this theoretical gain over existing methods.

1 Introduction

In federated learning (FL) (McMahan et al., 2017; Kairouz et al., 2019; Li et al., 2020), distributed clients interact with a central server to jointly train a single model without directly sharing their data with the server. The training objective is to solve the following minimization problem:

minx[F(x)=1Ni=1NFi(x)]\displaystyle\min_{x}\Big{[}F(x)=\frac{1}{N}\sum_{i=1}^{N}F_{i}(x)\Big{]} (1)

where variable xx is the model parameter, ii indexes the clients (or devices), NN is the number of clients, and Fi(x)F_{i}(x) is a loss that only depends on that client’s data. Typical FL deployments have two properties that make optimizing Eq. 1 challenging: (ii) Data heterogeneity: We want to minimize the average of the expected losses Fi(x)=𝔼zi𝒟i[f(x;zi)]F_{i}(x)=\mathbb{E}_{z_{i}\sim\mathcal{D}_{i}}[f(x;z_{i})] for some loss function ff evaluated on client data ziz_{i} drawn from a client-specific distribution 𝒟i\mathcal{D}_{i}. The convergence rate of the optimization depends on the heterogeneity of the local data distributions, {𝒟i}i=1N\{\mathcal{D}_{i}\}_{i=1}^{N}, and this dependence is captured by a popular notion of client heterogeneity, ζ2\zeta^{2}, defined as follows:

ζ2:=maxi[N]supxF(x)Fi(x)2.\displaystyle\zeta^{2}:=\max_{i\in[N]}\sup_{x}\|\nabla F(x)-\nabla F_{i}(x)\|^{2}\;. (2)

This captures the maximum difference between a local gradient and the global gradient, and is a standard measure of heterogeneity used in the literature (Woodworth et al., 2020a; Gorbunov et al., 2020; Deng et al., 2020; Woodworth et al., 2020a; Gorbunov et al., 2020; Yuan et al., 2020; Deng et al., 2021; Deng & Mahdavi, 2021). (iiii) Communication cost: In many FL deployments, communication is costly because clients have limited bandwidths. Due to these two challenges, most federated optimization algorithms alternate between local rounds of computation, where each client locally processes only their own data to save communication, and global rounds, where clients synchronize with the central server to resolve disagreements due to heterogeneity in the locally updated models.

Several federated algorithms navigate this trade-off between reducing communication and resolving data heterogeneity by modifying the amount and nature of local and global computation (McMahan et al., 2017; Li et al., 2018; Wang et al., 2019b; a; Li et al., 2019; Karimireddy et al., 2020b; a; Al-Shedivat et al., 2020; Reddi et al., 2020; Charles & Konečnỳ, 2020; Mitra et al., 2021; Woodworth et al., 2020a). These first-order federated optimization algorithms largely fall into one of two camps: (ii) clients in local update methods send updated models after performing multiple steps of model updates, and (iiii) clients in global update methods send gradients and do not perform any model updates locally. Examples of (ii) include FedAvg (McMahan et al., 2017), SCAFFOLD (Karimireddy et al., 2020b), and FedProx (Li et al., 2018). Examples of (iiii) include SGD and Accelerated SGD (ASG), where in each round rr the server collects from the clients gradients evaluated on local data at the current iterate x(r)x^{(r)} and then performs a (possibly Nesterov-accelerated) model update.

As an illustrating example, consider the scenario when FiF_{i}’s are μ\mu-strongly convex and β\beta-smooth such that the condition number is κ=β/μ\kappa={\beta}/{\mu} as summarized in Table 1, and also assume for simplicity that all clients participate in each communication round (i.e., full participation). Existing convergence analyses show that local update methods have a favorable dependence on the heterogeneity ζ\zeta. For example, Woodworth et al. (2020a) show that FedAvg achieves an error bound of 𝒪~((κζ2/μ)R2)\tilde{\mathcal{O}}((\kappa\zeta^{2}/\mu)R^{-2}) after RR rounds of communication. Hence FedAvg achieves theoretically faster convergence for smaller heterogeneity levels ζ\zeta by using local updates. However, this favorable dependency on ζ\zeta comes at the cost of slow convergence in RR. On the other hand, global update methods converge faster in RR, with error rate decaying as 𝒪~(Δexp(R/κ))\tilde{\mathcal{O}}(\Delta\exp(-R/\sqrt{\kappa})) (for the case of ASG), where Δ\Delta is the initial function value gap to the (unique) optimal solution. This is an exponentially faster rate in RR, but it does not take advantage of small heterogeneity even when client data is fully homogeneous.

Our main contribution is to design a novel family of algorithms that combines the strengths of local and global methods to achieve a faster convergence while maintaining the favorable dependence on the heterogeneity, thus achieving an error rate of 𝒪~((ζ2/μ)exp(R/κ))\tilde{\mathcal{O}}((\zeta^{2}/\mu)\exp(-R/\sqrt{\kappa})) in the strongly convex scenario. By adaptively switching between this novel algorithm and the existing best global method, we can achieve an error rate of 𝒪~(min{Δ,ζ2/μ}exp(R/κ))\tilde{\mathcal{O}}(\min\{\Delta,\zeta^{2}/\mu\}\exp(-R/\sqrt{\kappa})). We further show that this is near-optimal by providing a matching lower bound in Thm. 5.4. This lower bound tightens an existing one from (Woodworth et al., 2020a) by considering a smaller class of algorithms that includes those presented in this paper.

We propose FedChain, a unifying chaining framework for federated optimization that enjoys the benefits of both local update methods and global update methods. For a given total number of rounds RR, FedChain first uses a local-update method for a constant fraction of rounds and then switches to a global-update method for the remaining rounds. The first phase exploits client homogeneity when possible, providing fast convergence when heterogeneity is small. The second phase uses the unbiased stochastic gradients of global update methods to achieve a faster convergence rate in the heterogeneous setting. The second phase inherits the benefits of the first phase through the iterate output by the local-update method. An instantiation of FedChain consists of a combination of a specific local-update method and global-update method (e.g., FedAvg \to SGD).

Contributions. We propose the FedChain framework and analyze various instantiations under strongly convex, general convex, and nonconvex objectives that satisfy the PL condition.111FF satisfies the μ\mu-PL condition if 2μ(F(x)F(x))F(x)22\mu(F(x)-F(x^{*}))\leq\|\nabla F(x)\|^{2}. Achievable rates are summarized in Tables 1, LABEL:, 2, LABEL: and 4. In the strongly convex setting, these rates nearly match the algorithm-independent lower bounds we introduce in Theorem 5.4, which shows the near-optimality of FedChain. For strongly convex functions, chaining is optimal up to a factor that decays exponentially in the condition number κ\kappa. For convex functions, it is optimal in the high-heterogeneity regime, when ζ>βDR1/2\zeta>\beta DR^{1/2}. For nonconvex PL functions, it is optimal for constant condition number κ\kappa. In all three settings, FedChain instantiations improve upon previously known worst-case rates in certain regimes of ζ\zeta. We further demonstrate the empirical gains of our chaining framework in the convex case (logistic regression on MNIST) and in the nonconvex case (ConvNet classification on EMNIST (Cohen et al., 2017) and ResNet-18 classification on CIFAR-100 (Krizhevsky, 2009)).

1.1 Related Work

The convergence of FedAvg in convex optimization has been the subject of much interest in the machine learning community, particularly as federated learning (Kairouz et al., 2019) has increased in popularity. The convergence of FedAvg for convex minimization was first studied in the homogeneous client setting (Stich, 2018; Wang & Joshi, 2018; Woodworth et al., 2020b). These rates were later extended to the heterogeneous client setting (Khaled et al., 2020; Karimireddy et al., 2020b; Woodworth et al., 2020a; Koloskova et al., 2020), including a lower bound (Woodworth et al., 2020a). The only current known analysis for FedAvg that shows that FedAvg can improve on SGD/ASG in the heterogeneous data case is that of Woodworth et al. (2020a), which has been used to prove slightly tighter bounds in subsequent work Gorbunov et al. (2020).

Many new federated optimization algorithms have been proposed to improve on FedAvg and SGD/ASG, such as SCAFFOLD (Karimireddy et al., 2020b), S-Local-SVRG (Gorbunov et al., 2020), FedAdam (Reddi et al., 2020), FedLin (Mitra et al., 2021), FedProx (Li et al., 2018). However, under full participation (where all clients participate in a communication round; otherwise the setting is partial participation) existing analyses for these algorithms have not demonstrated any improvement over SGD (under partial participation, SAGA (Defazio et al., 2014)). In the smooth nonconvex full participation setting, MimeMVR (Karimireddy et al., 2020a) was recently proposed, which improves over SGD in the low-heterogeneity setting. Currently MimeMVR and SGD are the two best algorithms for worst-case smooth nonconvex optimization with respect to communication efficiency, depending on client heterogeneity. In partial participation, MimeMVR and SAGA are the two best algorithms for worst-case smooth nonconvex optimization, depending on client heterogeneity.

Lin et al. (2018) proposed a scheme, post-local SGD, opposite ours by switching from a global-update method to a local-update method (as opposed to our proposal of switching from local-update method to global-update method). They evaluate the setting where ζ=0\zeta=0. The authors found that while post-local SGD has a training accuracy worse than SGD, the test accuracy can be better (an improvement of 1% test accuracy on ResNet-20 CIFAR-10 classification), though theoretical analysis was not provided. Wang & Joshi (2019) decrease the number (KK) of local updates to transition from FedAvg to SGD in the homogeneous setting. This is conceptually similar to FedChain, but they do not show order-wise convergence rate improvements. Indeed, in the homogeneous setting, a simple variant of ASG achieves the optimal worst-case convergence rate (Woodworth et al., 2021).

Algorithm 1 Federated Chaining (FedChain)
  Input: 𝒜local\mathcal{A}_{\text{local}} local update algorithm, 𝒜global\mathcal{A}_{\text{global}} centralized algorithm, initial point x^0\hat{x}_{0}, KK, rounds RR
  \triangleright Run 𝒜local\mathcal{A}_{\text{local}} for R/2R/2 rounds
  x^1/2𝒜local(x^0)\hat{x}_{1/2}\leftarrow\mathcal{A}_{\text{local}}(\hat{x}_{0})
  \triangleright Choose the better point between x^0\hat{x}_{0} and x^1/2\hat{x}_{1/2}
  Sample SS clients 𝒮[N]\mathcal{S}\subseteq[N]
  Draw z^i,k𝒟i\hat{z}_{i,k}\sim\mathcal{D}_{i}, i𝒮i\in\mathcal{S}, k{0,,K1}k\in\{0,\dots,K-1\}
  x^1argminx{x^0,x^1/2}1SKi𝒮k=0K1f(x;z^i,k)\hat{x}_{1}\leftarrow\operatorname*{arg\,min}_{x\in\{\hat{x}_{0},\hat{x}_{1/2}\}}\frac{1}{S{K}}\sum_{i\in\mathcal{S}}\sum_{k=0}^{{K}-1}f(x;\hat{z}_{i,k})
  \triangleright Finish convergence with 𝒜global\mathcal{A}_{\text{global}} for R/2R/2 rounds
  x^2𝒜global(x^1)\hat{x}_{2}\leftarrow\mathcal{A}_{\text{global}}(\hat{x}_{1})
  Return x^2\hat{x}_{2}

2 Setting

Federated optimization proceeds in rounds. We consider the partial participation setting, where at the beginning of each round, S+S\in{\mathbb{Z}}_{+} out of total NN clients are sampled uniformly at random without replacement. Between each global communication round, the sampled clients each access either (i) their own stochastic gradient oracle KK times, update their local models, and return the updated model, or (ii) their own stochastic function value oracle KK times and return the average value to the server. The server aggregates the received information and performs a model update. For each baseline optimization algorithm, we analyze the sub-optimality error after RR rounds of communication between the clients and the server. Suboptimality is measured in terms of the function value 𝔼F(x^)F(x)\mathbb{E}F(\hat{x})-F(x^{*}), where x^\hat{x} is the solution estimate after RR rounds and x=argminxF(x)x^{*}=\operatorname*{arg\,min}_{x}F(x) is a (possibly non-unique) optimum of FF. We let the estimate for the initial suboptimality gap be Δ\Delta (Assumption B.9), and the initial distance to a (not necessarily unique) optimum be DD (Assumption B.10). If applicable, ϵ\epsilon is the target expected function value suboptimality.

We study three settings: strongly convex FiF_{i}’s, convex FiF_{i}’s and μ\mu-PL FF; for formal definitions, refer to App. B. Throughout this paper, we assume that the FiF_{i}’s are β\beta-smooth (Assumption B.4). If FiF_{i}’s are μ\mu-strongly convex (Assumption B.1) or FF is μ\mu-PL (Assumption B.3), we denote κ=β/μ\kappa={\beta}/{\mu} as the condition number. 𝒟i\mathcal{D}_{i} is the data distribution of client ii. We define the heterogeneity of the problem as ζ2:=maxi[N]supxF(x)Fi(x)2\zeta^{2}:=\max_{i\in[N]}\sup_{x}\|\nabla F(x)-\nabla F_{i}(x)\|^{2} in Assumption B.5. We assume unless otherwise specified that ζ2>0\zeta^{2}>0, i.e., that the problem is heterogeneous. We assume that the client gradient variance is upper bounded by σ2\sigma^{2} (Assumption B.6). We also define the analogous quantities for function value oracle queries: ζF2\zeta^{2}_{F} Assumption B.8, σF2\sigma^{2}_{F} Assumption B.7. We use the notation 𝒪~,Ω~\tilde{\mathcal{O}},\tilde{\Omega} to hide polylogarithmic factors, and 𝒪,Ω\mathcal{O},\Omega if we are only hiding constant factors.

3 Federated Chaining (FedChain) Framework

Refer to caption
Figure 1: A toy example illustrating FedChain. We have two client objectives: F1(x)F_{1}(x) and F2(x)F_{2}(x). F(x)F(x) is their average. The top plot displays the objectives and the bottom plot displays the gradients. In regions where client gradients agree in direction, i.e. (,1][1,)(-\infty,-1]\cup[1,\infty) it may be better to use an algorithm with local steps (like FedAvg), and in the region where the gradient disagree in direction, i.e. (1,1)(-1,1) it may be better to use an algorithm without local steps (like SGD).

We start with a toy example (Fig. 1) to illustrate FedChain. Consider two strongly convex client objectives: F1(x)=(1/2)(x1)2F_{1}(x)=({1}/{2})(x-1)^{2} and F2(x)=(x+1)2F_{2}(x)=(x+1)^{2}. The global objective F(x)=(F1(x)+F2(x))/2F(x)=({F_{1}(x)+F_{2}(x)})/{2} is their average. Fig. 1 (top) plots the objectives, and Fig. 1 (bottom) displays their gradients. Far from the optimum, due to strong convexity, all client gradients point towards the optimal solution; specifically, this occurs when  x(,1][1,)x\in(-\infty,-1]\cup[1,\infty). In this regime, clients can use local steps (e.g., FedAvg) to reduce communication without sacrificing the consistency of per-client local updates. On the other hand, close to the optimum, i.e., when x(1,1)x\in(-1,1), some client gradients may point away from the optimum. This suggests that clients should not take local steps to avoid driving the global estimate away from the optimum. Therefore, when we are close to the optimum, we use an algorithm without local steps (e.g. SGD), which is less affected by client gradient disagreement.

This intuition seems to carry over to the nonconvex setting: Charles et al. (2021) show that over the course of FedAvg execution on neural network StackOverflow next word prediction, client gradients become more orthogonal.

FedChain: To exploit the strengths of both local and global update methods, we propose the federated chaining (FedChain) framework in Alg. 1. There are three steps: (1) Run a local-update method 𝒜local\mathcal{A}_{\text{local}}, like FedAvg. (2) Choose the better point between the output of 𝒜local\mathcal{A}_{\text{local}} (which we denote x^1/2\hat{x}_{1/2}), and the initial point x^0\hat{x}_{0} to initialize (3) a global-update method 𝒜global\mathcal{A}_{\text{global}} like SGD. Note that when heterogeneity is large, 𝒜local\mathcal{A}_{\text{local}} can actually output an iterate with higher suboptimality gap than x^0\hat{x}_{0}. Hence, selecting the point (between x^0\hat{x}_{0} and x^1/2\hat{x}_{1/2}) with a smaller FF allows us to adapt to the problem’s heterogeneity, and initialize 𝒜global\mathcal{A}_{\text{global}} appropriately to achieve good convergence rates. To compare x^1/2\hat{x}_{1/2} and the initial point x^0\hat{x}_{0}, we approximate F(x^0)F(\hat{x}_{0}) and F(x^1/2)F(\hat{x}_{1/2}) by averaging; we compute 1SKi𝒮,k[K]f(x;z^i,k)\frac{1}{S{K}}\sum_{i\in\mathcal{S},k\in[K]}f(x;\hat{z}_{i,k}) for x^1/2,x^0\hat{x}_{1/2},\hat{x}_{0}, where KK is also the number of local steps per client per round in 𝒜local\mathcal{A}_{\text{local}}, and 𝒮\mathcal{S} is a sample of SS clients. In practice, one might consider adaptively selecting how many rounds of 𝒜local\mathcal{A}_{\text{local}} to run, which can potentially improve the convergence by a constant factor. Our experiments in § J.1 show significant improvement when using only 1 round of 𝒜local\mathcal{A}_{\text{local}} with a large enough KK for convex losses.

Table 1: Rates for the strongly convex case. Rate requires RN/SR\geq{N}/{S}. Rate requires S=NS=N.
Method/Analysis 𝔼F(x^)F(x)𝒪~()\mathbb{E}F(\hat{x})-F(x^{*})\leq\tilde{\mathcal{O}}(\cdot)
Centralized Algorithms
SGD
ASG
Δexp(κ1R)+(1SN)ζ2μSR\Delta\exp(-\kappa^{-1}R)+(1-\frac{S}{N})\frac{\zeta^{2}}{\mu SR}
Δexp(κ12R)+(1SN)ζ2μSR\Delta\exp(-\kappa^{-\frac{1}{2}}R)+(1-\frac{S}{N})\frac{\zeta^{2}}{\mu SR}
Federated Algorithms
FedAvg (Karimireddy et al., 2020b)
FedAvg (Woodworth et al., 2020a)
SCAFFOLD (Karimireddy et al., 2020b)
Δexp(κ1R)+κ(ζ2μ)R2\Delta\exp(-\kappa^{-1}R)+\kappa(\frac{\zeta^{2}}{\mu})R^{-2}
κ(ζ2μ)R2\kappa(\frac{\zeta^{2}}{\mu})R^{-2}
Δexp(min{κ1,SN}R)\Delta\exp(-\min\{\kappa^{-1},\frac{S}{N}\}R)
This paper
FedAvg \to SGD (Thm. 4.1)
FedAvg \to SAGA (Thm. 4.3)
FedAvg \to ASG (Thm. 4.2)
FedAvg \to SSNM (Thm. 4.4)
Algo.-independent LB (Thm. 5.4)
min{Δ,ζ2μ}exp(κ1R)+(1SN)ζ2μSR\min\{\Delta,\frac{\zeta^{2}}{\mu}\}\exp(-\kappa^{-1}R)+(1-\frac{S}{N})\frac{\zeta^{2}}{\mu SR}
min{Δ,ζ2μ}exp(min{κ1,SN}R)\min\{\Delta,\frac{\zeta^{2}}{\mu}\}\exp(-\min\{\kappa^{-1},\frac{S}{N}\}R)
min{Δ,ζ2μ}exp(κ12R)+(1SN)ζ2μSR\min\{\Delta,\frac{\zeta^{2}}{\mu}\}\exp(-\kappa^{-\frac{1}{2}}R)+(1-\frac{S}{N})\frac{\zeta^{2}}{\mu SR}
κmin{Δ,ζ2μ}exp(min{SNκ,SN}R)\kappa\min\{\Delta,\frac{\zeta^{2}}{\mu}\}\exp(-\min\{\sqrt{\frac{S}{N\kappa}},\frac{S}{N}\}R)
min{Δ,κ32(ζ2β)}exp(κ12R)\min\{\Delta,\kappa^{-\frac{3}{2}}(\frac{\zeta^{2}}{\beta})\}\exp(-\kappa^{-\frac{1}{2}}R)

4 Convergence of FedChain

We first analyze FedChain (Algo. 1) when 𝒜local\mathcal{A}_{\text{local}} is FedAvg and 𝒜global\mathcal{A}_{\text{global}} is (Nesterov accelerated) SGD. The first theorem is without Nesterov acceleration and the second with Nesterov acceleration.

Theorem 4.1 (FedAvg \to SGD).

Suppose that client objectives FiF_{i}’s and their gradient queries satisfy Assumptions B.4, LABEL:, B.6, LABEL:, B.7, LABEL:, B.5, LABEL: and B.8. Then running FedChain (Algo. 1) with 𝒜local\mathcal{A}_{\text{local}} as FedAvg (Algo. 4) with the parameter choices of Thm. E.1, and 𝒜global\mathcal{A}_{\text{global}} as SGD (Algo. 2) with the parameter choices of Thm. D.1, we get the following rates 222We ignore variance terms and ζF2=maxi[N]supx(Fi(x)F(x))2\zeta_{F}^{2}=\max_{i\in[N]}\sup_{x}(F_{i}(x)-F(x))^{2} as the former can be made negligible by increasing KK and the latter can be made zero by running the better of FedAvg \to SGD and SGD instead of choosing the better of x^1/2\hat{x}_{1/2} and x^0\hat{x}_{0} as in Algo. 1. Furthermore, ζF2\zeta_{F}^{2} terms are similar to the ζ2\zeta^{2} terms. The rest of the theorems in the main paper will also be stated this way. To see the formal statements, see App. F.:

  • Strongly convex: If FiF_{i}’s satisfy Assumption B.1 for some μ>0\mu>0 then there exists a finite KK above which we get, 𝔼F(x^2)F(x)𝒪~(min{Δ,ζ2/μ}exp(R/κ)+(1S/N)ζ2/(μSR))\mathbb{E}F(\hat{x}_{2})-F(x^{*})\leq\tilde{\mathcal{O}}(\min\{\Delta,{\zeta^{2}}/{\mu}\}\exp(-{R}/{\kappa})+(1-{S}/{N}){\zeta^{2}}/({\mu SR})) .

  • General convex: If FiF_{i}’s satisfy Assumption B.2, then there exists a finite KK above which we get the rate 𝔼F(x^2)F(x)𝒪~(min{βD2/R,βζD3/R}+(1S/N4)(βζD3/SR4))\mathbb{E}F(\hat{x}_{2})-F(x^{*})\leq\tilde{\mathcal{O}}(\min\{\beta D^{2}/R,\sqrt{\beta\zeta D^{3}}/\sqrt{R}\}+(\sqrt[4]{1-S/N})(\sqrt{\beta\zeta D^{3}}/\sqrt[4]{SR})) .

  • PL condition: If FiF_{i}’s satisfy Assumption B.3 for μ>0\mu>0, then there exists a finite KK above which we get the rate the rate 𝔼F(x^2)F(x)𝒪~(min{Δ,ζ2/μ}exp(R/κ)+(1S/N)(κζ2/μSR))\mathbb{E}F(\hat{x}_{2})-F(x^{*})\leq\tilde{\mathcal{O}}(\min\{\Delta,\zeta^{2}/\mu\}\exp(-R/\kappa)+(1-S/N)(\kappa\zeta^{2}/\mu SR)) .

For the formal statement, see Thm. F.1.

Theorem 4.2 (FedAvg \to ASG).

Under the hypotheses of Thm. 4.1 with a choice of 𝒜global\mathcal{A}_{\text{global}} as ASG (Algo. 3) and the parameter choices of Thm. D.3, we get the following guarantees:

  • Strongly convex: If FiF_{i}’s satisfy Assumption B.1 for μ>0\mu>0 then there exists a finite KK above which we get the rate, 𝔼F(x^2)F(x)𝒪~(min{Δ,ζ2/μ}exp(R/κ)+(1S/N)ζ2/μSR).\mathbb{E}F(\hat{x}_{2})-F(x^{*})\leq\tilde{\mathcal{O}}(\min\{\Delta,{\zeta^{2}}/{\mu}\}\exp(-{R}/{\sqrt{\kappa}})+(1-{S}/{N}){\zeta^{2}}/{\mu SR}).

  • General convex: If FiF_{i}’s satisfy Assumption B.2 then there exists a finite KK above which we get the rate, 𝔼F(x^2)F(x)𝒪~(min{βD2/R2,βζD3/R}+1S/N(ζD/SR)+1S/N4(βζD3/SR4)).\mathbb{E}F(\hat{x}_{2})-F(x^{*})\leq\tilde{\mathcal{O}}(\min\{{\beta D^{2}}/{R^{2}},{\sqrt{\beta\zeta D^{3}}}/{R}\}+\sqrt{1-{S}/{N}}({\zeta D}/{\sqrt{SR}})+\sqrt[4]{1-{S}/{N}}({\sqrt{\beta\zeta D^{3}}}/{\sqrt[4]{SR}})\,).

For the formal statement, see Thm. F.2. We show and discuss the near-optimality of FedAvg \to ASG under strongly convex and PL conditions (and under full participation) in Section 5, where we introduce matching lower bounds. Under strong-convexity shown in Table 1, FedAvg \to ASG, when compared to ASG, converts the Δexp(R/κ)\Delta\exp(-{R}/{\sqrt{\kappa}}) term into a min{Δ,ζ2/μ}exp(R/κ)\min\{\Delta,{\zeta^{2}}/{\mu}\}\exp(-{R}/{\sqrt{\kappa}}) term, improving over ASG when heterogeneity moderately small: ζ2/μ<Δ{\zeta^{2}}/{\mu}<\Delta. It also significantly improves over FedAvg, as min{Δ,ζ2/μ}exp(R/κ)\min\{\Delta,{\zeta^{2}}/{\mu}\}\exp(-{R}/{\sqrt{\kappa}}) is exponentially faster than κ(ζ2/μ)R2\kappa({\zeta^{2}}/{\mu})R^{-2}. Under the PL condition (which is not necessarily convex) the story is similar (Table 4), except we use FedAvg \to SGD as our representative algorithm, as Nesterov acceleration is not known to improve under non-convex settings.

Table 2: Rates for the general convex case. Rate requires RNSR\geq\frac{N}{S}. Rate requires S=NS=N. Analysis from Karimireddy et al. (2020b).
Method/Analysis 𝔼F(x^)F(x)𝒪~()\mathbb{E}F(\hat{x})-F(x^{*})\leq\tilde{\mathcal{O}}(\cdot)
Centralized Algorithms
SGD
ASG
βD2R+1SNζDSR\frac{\beta D^{2}}{R}+\sqrt{1-\frac{S}{N}}\frac{\zeta D}{\sqrt{SR}}
βD2R2+1SNζDSR\frac{\beta D^{2}}{R^{2}}+\sqrt{1-\frac{S}{N}}\frac{\zeta D}{\sqrt{SR}}
Federated Algorithms
FedAvg
FedAvg (Woodworth et al., 2020a)
SCAFFOLD
βD2R+βζ2D4R23+1SNζDSR\frac{\beta D^{2}}{R}+\sqrt[3]{\frac{\beta\zeta^{2}D^{4}}{R^{2}}}+\sqrt{1-\frac{S}{N}}\frac{\zeta D}{\sqrt{SR}}
βζ2D4R23\sqrt[3]{\frac{\beta\zeta^{2}D^{4}}{R^{2}}}
NSβD2R\sqrt{\frac{N}{S}}\frac{\beta D^{2}}{R}
This paper
FedAvg \to SGD (Thm. 4.1)
FedAvg \to ASG (Thm. 4.2)
Algo.-independent LB (Thm. 5.4)
min{βD2R,βζD3R}+1SN4βζD3SR4\min\{\frac{\beta D^{2}}{R},\frac{\sqrt{\beta\zeta D^{3}}}{\sqrt{R}}\}+\sqrt[4]{1-\frac{S}{N}}\frac{\sqrt{\beta\zeta D^{3}}}{\sqrt[4]{SR}}
min{βD2R2,βζD3R}+1SN4βζD3SR4+1SNζDSR\min\{\frac{\beta D^{2}}{R^{2}},\frac{\sqrt{\beta\zeta D^{3}}}{R}\}+\sqrt[4]{1-\frac{S}{N}}\frac{\sqrt{\beta\zeta D^{3}}}{\sqrt[4]{SR}}+\sqrt{1-\frac{S}{N}}\frac{\zeta D}{\sqrt{SR}}
min{βD2R2,ζDR5}\min\{\frac{\beta D^{2}}{R^{2}},\frac{\zeta D}{\sqrt{R^{5}}}\}

In the general convex case (Table 2), let β=D=1\beta=D=1 for the purpose of comparisons. Then FedAvg \to ASG’s convergence rate is min{1/R2,ζ1/2/R}+1S/N(ζ/SR)+1S/N4ζ/SR4\min\{1/R^{2},\zeta^{1/2}/R\}+\sqrt{1-S/N}(\zeta/\sqrt{SR})+\sqrt[4]{1-S/N}\sqrt{\zeta}/\sqrt[4]{SR}. If ζ<1R2\zeta<\frac{1}{R^{2}}, then ζ1/2/R<1/R2\zeta^{1/2}/R<1/R^{2} and if ζ<S/R7\zeta<\sqrt{S/R^{7}}, then 1S/N4ζ/SR4<1/R2\sqrt[4]{1-S/N}\sqrt{\zeta}/\sqrt[4]{SR}<1/R^{2}, so the FedAvg \to ASG convergence rate is better than the convergence rate of ASG under the regime ζ<min{1/R2,S/R7}\zeta<\min\{1/R^{2},\sqrt{S/R^{7}}\}. The rate of Karimireddy et al. (2020b) for FedAvg (which does not require S=NS=N) is strictly worse than ASG, and so has a worse convergence rate than FedAvg \to ASG if ζ<min{1/R2,S/R7}\zeta<\min\{1/R^{2},\sqrt{S/R^{7}}\}. Altogether, if ζ<min{1/R2,S/R7}\zeta<\min\{1/R^{2},\sqrt{S/R^{7}}\}, FedAvg \to ASG achieves the best known worst-case rate. Finally, in the S=NS=N case, FedAvg \to ASG does not have a regime in ζ\zeta where it improves over both ASG and FedAvg (the analysis of Woodworth et al. (2020a), which requires S=NS=N) at the same time. It is unclear if this is due to looseness in the analysis.

Next, we analyze variance reduced methods that improve convergence when a random subset of the clients participate in each round (i.e., partial participation).

Theorem 4.3 (FedAvg \to SAGA).

Suppose that client objectives FiF_{i}’s and their gradient queries satisfy Assumptions B.4, LABEL:, B.6, LABEL:, B.7, LABEL:, B.5, LABEL: and B.8. Then running FedChain (Algo. 1) with 𝒜local\mathcal{A}_{\text{local}} as FedAvg (Algo. 4 ) with the parameter choices of Thm. E.1, and 𝒜global\mathcal{A}_{\text{global}} as SAGA (Algo. 5) with the parameter choices of Thm. D.4, we get the following guarantees as long as RΩ(NS)R\geq\Omega(\frac{N}{S}):

  • Strongly convex: If FiF_{i}’s satisfy Assumption B.1 for μ>0\mu>0, then there exists a finite KK above which we get the rate 𝔼F(x^2)F(x)𝒪~(min{Δ,ζ2/μ}exp(min{S/N,1/κ}R))\mathbb{E}F(\hat{x}_{2})-F(x^{*})\leq\tilde{\mathcal{O}}(\min\{\Delta,\zeta^{2}/\mu\}\exp(-\min\{S/N,1/\kappa\}R)) .

  • PL condition: If FiF_{i}’s satisfy Assumption B.3 for μ>0\mu>0, then there exists a finite KK above which we get the rate 𝔼F(x^2)F(x)𝒪~(min{Δ,ζ2/μ}exp((S/N)23R/κ))\mathbb{E}F(\hat{x}_{2})-F(x^{*})\leq\tilde{\mathcal{O}}(\min\{\Delta,\zeta^{2}/\mu\}\exp(-(S/N)^{\frac{2}{3}}R/\kappa)) .

For the formal statement, see Thm. F.3.

Theorem 4.4 (FedAvg \to SSNM).

Suppose that client objectives FiF_{i}’s and their gradient queries satisfy Assumptions B.4, LABEL:, B.6, LABEL:, B.5, LABEL: and B.8. Then running FedChain (Algo. 1) with 𝒜local\mathcal{A}_{\text{local}} as FedAvg (Algo. 4) with the parameter choices of Thm. E.1, and 𝒜global\mathcal{A}_{\text{global}} as SSNM (Algo. 6) with the parameter choices of Thm. D.5, we get the following guarantees as long as RΩ(NS)R\geq\Omega(\frac{N}{S}):

  • Strongly convex: If FiF_{i}’s satisfy Assumption B.1 for μ>0\mu>0, then there exists a finite KK (same as in FedAvg \to SGD) above which we get the rate 𝔼F(x^2)F(x)𝒪~(min{Δ,ζ2/μ}exp(min{S/N,S/(Nκ)}R))\mathbb{E}F(\hat{x}_{2})-F(x^{*})\leq\tilde{\mathcal{O}}(\min\{\Delta,\zeta^{2}/\mu\}\exp(-\min\{S/N,\sqrt{S/(N\kappa)}\}R)) .

For the formal statement, see Thm. F.4. SSNM (Zhou et al., 2019) is the Nesterov accelerated version of SAGA (Defazio et al., 2014).

The main contribution of using variance reduced methods in 𝒜global\mathcal{A}_{\text{global}} is the removal of the sampling error (the terms depending on the sampling heterogeneity error (1S/N)(ζ2/S)(1-S/N)(\zeta^{2}/S)) from the convergence rates, in exchange for requiring a round complexity of at least N/SN/S. To illustrate this, observe that in the strongly convex case, FedAvg \to SGD has convergence rate min{Δ,ζ2/μ}exp(R/κ)+(1S/N)(ζ2/SR)\min\{\Delta,\zeta^{2}/\mu\}\exp(-R/\kappa)+(1-S/N)(\zeta^{2}/SR) and FedAvg \to SAGA (FedAvg \to SGD’s variance-reduced counterpart) has convergence rate min{Δ,ζ2/μ}exp(min{1/κ,S/N}R)\min\{\Delta,\zeta^{2}/\mu\}\exp(-\min\{1/\kappa,S/N\}R), dropping the (1S/N)(ζ2/μSR)(1-S/N)(\zeta^{2}/\mu SR) sampling heterogeneity error term in exchange for harming the rate of linear convergence from 1/κ1/\kappa to min{1/κ,S/N}\min\{1/\kappa,S/N\}.

This same tradeoff occurs in finite sum optimization (the problem minx(1/n)i=1nψi(x)\min_{x}(1/n)\sum_{i=1}^{n}\psi_{i}(x), where ψi\psi_{i}’s (typically) represent losses on data points and the main concern is computation cost), which is what variance reduction is designed for. In finite sum optimization, variance reduction methods such as SVRG (Johnson & Zhang, 2013) and SAGA (Defazio et al., 2014; Reddi et al., 2016) achieve linear rates of convergence (given strong convexity of ψi\psi_{i}’s) in exchange for requiring at least N/SN/S updates. Because we can treat FL as an instance of finite-sum optimization (by viewing Eq. 1 as a finite sum of objectives and ζ2\zeta^{2} as the variance between FiF_{i}’s), these results from variance-reduced finite sum optimization can be extended to federated learning. This is the idea behind SCAFFOLD (Karimireddy et al., 2020b).

It is not always the case that variance reduction in 𝒜global\mathcal{A}_{\text{global}} achieves better rates. In the strongly convex case if (1S/N)(ζ2/μSϵ)>N/S(1-S/N)(\zeta^{2}/\mu S\epsilon)>N/S, then variance reduction gets gain, otherwise not.

Refer to caption
Refer to caption
Refer to caption

Refer to caption

Figure 2: Plot titles denote data homogeneity (§ 6). “X\toY” denotes a FedChain instantiation with X as 𝒜local\mathcal{A}_{\text{local}} and Y as 𝒜global\mathcal{A}_{\text{global}}, circle markers denote stepsize decay events and plusses denote switching from 𝒜local\mathcal{A}_{\text{local}} to 𝒜global\mathcal{A}_{\text{global}}. Across all heterogeneity levels, the multistage algorithms perform the best. Stepsize decayed baselines are left out to simplify the plots; we display them in § J.2.

5 Lower bounds

Our lower bound allows full participation. It assumes deterministic gradients and the following class from (Woodworth et al., 2020a; Woodworth, 2021; Carmon et al., 2020):

Definition 5.1.

For a vdv\in{\mathbb{R}}^{d}, let supp(v)={i[d]:vi0}\text{supp}(v)=\{i\in[d]:v_{i}\neq 0\}. An algorithm is distributed zero-respecting if for any i,k,ri,k,r, the kk-th iterate on the ii-th client in the rr-th round xi,k(r)x^{(r)}_{i,k} satisfy

supp(xi,k(r))0k<ksupp(Fi(xi,k(r)))i[N],0kK1,0r<rsupp(Fi(xi,k(r)))\displaystyle\text{supp}(x^{(r)}_{i,k})\subseteq\bigcup_{0\leq k^{\prime}<k}\text{supp}(\nabla F_{i}(x_{i,k^{\prime}}^{(r)}))\bigcup_{\begin{subarray}{c}i^{\prime}\in[N],0\leq k^{\prime}\leq K-1,0\leq r^{\prime}<r\end{subarray}}\text{supp}(\nabla F_{i^{\prime}}(x_{i^{\prime},k^{\prime}}^{(r^{\prime})})) (3)

Distributed zero-respecting algorithms’ iterates have components in coordinates that they have any information on. As discussed in (Woodworth et al., 2020a), this means that algorithms which are not distributed zero-respecting are just “wild guessing”. Algorithms that are distributed zero-respecting include SGD, ASG, and FedAvg. We assume the following class of algorithms in order to bound the heterogeneity of our construction for the lower bound proof.

Definition 5.2.

We say that an algorithm is distributed distance-conserving if for any i,k,ri,k,r, we have for the kk-th iterate on the ii-th client in the rr-th round xi,k(r)x^{(r)}_{i,k} satisfies xi,k(r)x2(c/2)[xinitx2+i=1Nxinitxi2]\|x^{(r)}_{i,k}-x^{*}\|^{2}\leq(c/2)[\|x_{\text{init}}-x^{*}\|^{2}+\sum_{i=1}^{N}\|x_{\text{init}}-x_{i}^{*}\|^{2}], where xj:=argminxFj(x)x^{*}_{j}:=\operatorname*{arg\,min}_{x}F_{j}(x) and x:=argminxF(x)x^{*}:=\operatorname*{arg\,min}_{x}F(x) and xinitx_{\text{init}} is the initial iterate, and cc is some scalar parameter.

Algorithms which do not satisfy Definition 5.2 with cc at most logarithmic in problem parameters (see § 2) are those that move substantially far away from xx^{*}, even farther than the xix_{i}^{*}’s are from xx^{*}. With this definition in mind, we slightly overload the usual definition of heterogeneity for the lower bound:

Definition 5.3.

A distributed optimization problem is (ζ,c)(\zeta,c)-heterogeneous if maxi[N]supxAFi(x)F(x)2ζ2\max_{i\in[N]}\sup_{x\in A}\|\nabla F_{i}(x)-\nabla F(x)\|^{2}\leq\zeta^{2}, where we define A:={x:xx2(c/2)(xinitx2+i=1Nxinitxi2)}A:=\{x:\|x-x^{*}\|^{2}\leq({c}/{2})(\|x_{\text{init}}-x^{*}\|^{2}+\sum_{i=1}^{N}\|x_{\text{init}}-x^{*}_{i}\|^{2})\} for some scalar parameter cc.

Those achievable convergence rates in FL that assume Eq. 2 can be readily extended to account for Definition 5.3 as long as the algorithm satisfies Definition 5.2. We show that the chaining algorithms we propose satisfy Definition 5.3 in Thm. D.1, Thm. D.3, and Thm. E.1 for cc at most polylogarithmic in the problem parameters defined in § 2. 333We do not formally show it for SAGA and SSNM, as the algorithms are functionally the same as SGD and ASG under full participation.. Other FL algorithms also satisfy Definition 5.2, notably FedAvg analyzed in Woodworth et al. (2020a) for cc an absolute constant, which is the current tightest known analysis of FedAvg in the full participation setting.

Theorem 5.4.

For any number of rounds RR, number of local steps per-round KK, and (ζ,c)(\zeta,c)-heterogeneity (Definition 5.3), there exists a global objective FF which is the average of two β\beta-smooth (Assumption B.4) and μ\mu(0\geq 0)-strongly convex (Assumption B.1) quadratic client objectives F1F_{1} and F2F_{2} with an initial sub-optimality gap of Δ\Delta, such that the output x^\hat{x} of any distributed zero-respecting (Definition 5.1) and distance-conserving algorithm (Definition 5.2) satisfies

  • Strongly convex: F(x^)F(x)Ω(min{Δ,1/(cκ3/2)(ζ2/β)}exp(R/κ).)F(\hat{x})-F(x^{*})\geq\Omega(\min\{\Delta,{1}/({c\kappa^{3/2})}({\zeta^{2}}/{\beta})\}\exp(-{R}/{\sqrt{\kappa}}).) when μ>0\mu>0, and

  • General Convex F(x^)F(x)Ω(min{βD2/R2,ζD/(c1/2R5)})F(\hat{x})-F(x^{*})\geq\Omega(\min\{{\beta D^{2}}/{R^{2}},{\zeta D}/({c^{1/2}\sqrt{R^{5}})}\}) when μ=0\mu=0.

Corollary 5.5.

Under the hypotheses of Theorem 5.4, there exists a global objective FF which is μ\mu-PL and satisfies F(x^)F(x)Ω(min{Δ,1/(cκ3/2)(ζ2/β)}exp(R/κ))F(\hat{x})-F(x^{*})\geq\Omega(\min\{\Delta,{1}/{(c\kappa^{3/2})}({\zeta^{2}}/{\beta})\}\exp(-{R}/{\sqrt{\kappa}})).

A proof of the lower bound is in App. G, and the corollary follows immediately from the fact that μ\mu-strong convexity implies μ\mu-PL. This result tightens the lower bound in Woodworth et al. (2020a), which proves a similar lower bound but in terms of a much larger class of functions with heterogeneity bounded in ζ=(1/N)i=1NFi(x)2\zeta_{*}=(1/N)\sum_{i=1}^{N}\|\nabla F_{i}(x^{*})\|^{2}. To the best of our knowledge, all achievable rates that can take advantage of heterogeneity require (ζ,c)(\zeta,c)-heterogeneity (which is a smaller class than ζ\zeta_{*}-heterogeneity), which are incomparable with the existing lower bound from (Woodworth et al., 2020a) that requires ζ\zeta_{*}-heterogeneity. By introducing the class of distributed distance-conserving algorithms (which includes most of the algorithms we are interested in), Thm. 5.4 allows us, for the first time, to identify the optimality of the achievable rates as shown in Tables 1, 2, and 4.

Note that this lower bound allows full participation and should be compared to the achievable rates with S=NS=N; comparisons with variance reduced methods like FedAvg \to SAGA and FedAvg \to SSNM are unnecessary. Our lower bound proves that FedAvg \to ASG is optimal up to condition number factors shrinking exponentially in κ\sqrt{\kappa} among algorithms satisfying Definition 5.2 and Definition 5.3. Under the PL-condition, the situation is similar, except FedAvg \to SGD loses κ\sqrt{\kappa} in the exponential versus the lower bound. On the other hand, there remains a substantial gap between FedAvg \to ASG and the lower bound in the general convex case. Meaningful progress in closing the gap in the general convex case is an important future direction.

6 Experiments

We evaluate the utility of our framework on strongly convex and nonconvex settings. We compare the communication round complexities of four baselines: FedAvg, SGD, ASG, and SCAFFOLD, the stepsize decaying variants of these algorithms (which are prefixed by M- in plots) and the various instantiations of FedChain (Algo. 1).

Convex Optimization (Logistic Regression)

We first study federated regularized logistic regression, which is strongly convex (objective function in § I.1). In this experiment, we use the MNIST dataset of handwritten digits (LeCun et al., 2010). We (roughly) control client heterogeneity by creating “clients” through shuffling samples across different digit classes. The details of this shuffling process are described in § I.1; if a dataset is more shuffled (more homogeneous), it roughly corresponds to a lower heterogeneity dataset. All clients participate in each round.

Fig. 2 compares the convergence of chained and non-chained algorithms in the stochastic gradient setting (minibatches are 1% of a client’s data), over R=100R=100 rounds (tuning details in § I.1). We observe that in all heterogeneity levels, FedChain instantiations (whether two or more stages) outperform other baselines. We also observe that SCAFFOLD\toSGD outperforms all other curves, including SCAFFOLD\toASG. This can be explained by the fact that acceleration increases the effect of noise, which can contribute to error if one does not take KK large enough (we set K=20K=20 in the convex experiments). The effect of large KK is elaborated upon in § J.1.

Table 3: Test accuracies (\uparrow). “Constant” means the stepsize does not change during optimization, while “w/ Decay” means the stepsize decays during optimization. Left: Comparison among algorithms in the EMNIST task. Right: Comparison among algorithms in the CIFAR-100 task. “SCA.” abbreviates SCAFFOLD.
Algorithm Constant w/ Decay
Baselines
SGD
FedAvg
SCAFFOLD
0.7842
0.8314
0.8157
0.7998
0.8224
0.8174
FedChain
FedAvg \to SGD
SCA. \to SGD
0.8501
0.8508
0.8355
0.8392
Algorithm Constant w/ Decay
Baselines
SGD
FedAvg
SCAFFOLD
0.1987
0.4944
0.1968
0.5059
FedChain
FedAvg \to SGD
SCA. \to SGD
0.5134
0.5167
Nonconvex Optimization (Neural Networks)

We also evaluated nonconvex image classification tasks with convolutional neural networks. In all experiments, we consider client sampling with S=10S=10. We started with digit classification over EMNIST (Cohen et al., 2017) (tuning details in § I.2.1), where handwritten characters are partitioned by author. Table 3 (Left) displays test accuracies on the task. Overall, FedChain instantiations perform better than baselines.

We also considered image classification with ResNet-18 on CIFAR-100 (Krizhevsky, 2009) (tuning details in § I.2.2). Table 3 (Right) displays the test accuracies on CIFAR-100; SCAFFOLD is not included due to memory constraints. FedChain instantiations again perform the best.

Acknowledgments

This work is supported by Google faculty research award, JP Morgan Chase, Siemens, the Sloan Foundation, Intel, NSF grants CNS-2002664, CA-2040675, IIS-1929955, DMS-2134012, CCF-2019844 as a part of NSF Institute for Foundations of Machine Learning (IFML), and CNS-2112471 as a part of NSF AI Institute for Future Edge Networks and Distributed Intelligence (AI-EDGE). Most of this work was done prior to the second author joining Amazon, and it does not relate to his current position there.

References

  • Al-Shedivat et al. (2020) Maruan Al-Shedivat, Jennifer Gillenwater, Eric Xing, and Afshin Rostamizadeh. Federated learning via posterior averaging: A new perspective and practical algorithms. arXiv preprint arXiv:2010.05273, 2020.
  • Arjevani & Shamir (2015) Yossi Arjevani and Ohad Shamir. Communication complexity of distributed convex learning and optimization. arXiv preprint arXiv:1506.01900, 2015.
  • Aybat et al. (2019) Necdet Serhat Aybat, Alireza Fallah, Mert Gurbuzbalaban, and Asuman Ozdaglar. A universally optimal multistage accelerated stochastic gradient method. arXiv preprint arXiv:1901.08022, 2019.
  • Carmon et al. (2020) Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Lower bounds for finding stationary points i. Mathematical Programming, 184(1):71–120, 2020.
  • Charles & Konečnỳ (2020) Zachary Charles and Jakub Konečnỳ. On the outsized importance of learning rates in local update methods. arXiv preprint arXiv:2007.00878, 2020.
  • Charles et al. (2021) Zachary Charles, Zachary Garrett, Zhouyuan Huo, Sergei Shmulyian, and Virginia Smith. On large-cohort training for federated learning. arXiv preprint arXiv:2106.07820, 2021.
  • Cohen et al. (2017) Gregory Cohen, Saeed Afshar, Jonathan Tapson, and Andre Van Schaik. Emnist: Extending mnist to handwritten letters. In 2017 International Joint Conference on Neural Networks (IJCNN), pp.  2921–2926. IEEE, 2017.
  • Defazio et al. (2014) Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. Advances in neural information processing systems, 27, 2014.
  • Deng & Mahdavi (2021) Yuyang Deng and Mehrdad Mahdavi. Local stochastic gradient descent ascent: Convergence analysis and communication efficiency. In International Conference on Artificial Intelligence and Statistics, pp.  1387–1395. PMLR, 2021.
  • Deng et al. (2020) Yuyang Deng, Mohammad Mahdi Kamani, and Mehrdad Mahdavi. Adaptive personalized federated learning. arXiv preprint arXiv:2003.13461, 2020.
  • Deng et al. (2021) Yuyang Deng, Mohammad Mahdi Kamani, and Mehrdad Mahdavi. Distributionally robust federated averaging. arXiv preprint arXiv:2102.12660, 2021.
  • Fallah et al. (2020) Alireza Fallah, Asuman Ozdaglar, and Sarath Pattathil. An optimal multistage stochastic gradient method for minimax problems. In 2020 59th IEEE Conference on Decision and Control (CDC), pp.  3573–3579. IEEE, 2020.
  • Ghadimi & Lan (2012) Saeed Ghadimi and Guanghui Lan. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization i: A generic algorithmic framework. SIAM Journal on Optimization, 22(4):1469–1492, 2012.
  • Ghadimi & Lan (2013) Saeed Ghadimi and Guanghui Lan. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, ii: shrinking procedures and optimal algorithms. SIAM Journal on Optimization, 23(4):2061–2089, 2013.
  • Gorbunov et al. (2020) Eduard Gorbunov, Filip Hanzely, and Peter Richtárik. Local sgd: Unified theory and new efficient methods. arXiv preprint arXiv:2011.02828, 2020.
  • Johnson & Zhang (2013) Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. Advances in neural information processing systems, 26:315–323, 2013.
  • Kairouz et al. (2019) Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Keith Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. arXiv preprint arXiv:1912.04977, 2019.
  • Karimireddy et al. (2020a) Sai Praneeth Karimireddy, Martin Jaggi, Satyen Kale, Mehryar Mohri, Sashank J Reddi, Sebastian U Stich, and Ananda Theertha Suresh. Mime: Mimicking centralized stochastic algorithms in federated learning. arXiv preprint arXiv:2008.03606, 2020a.
  • Karimireddy et al. (2020b) Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. Scaffold: Stochastic controlled averaging for federated learning. In International Conference on Machine Learning, pp. 5132–5143. PMLR, 2020b.
  • Khaled et al. (2020) Ahmed Khaled, Konstantin Mishchenko, and Peter Richtárik. Tighter theory for local sgd on identical and heterogeneous data. In International Conference on Artificial Intelligence and Statistics, pp.  4519–4529. PMLR, 2020.
  • Koloskova et al. (2020) Anastasia Koloskova, Nicolas Loizou, Sadra Boreiri, Martin Jaggi, and Sebastian Stich. A unified theory of decentralized sgd with changing topology and local updates. In International Conference on Machine Learning, pp. 5381–5393. PMLR, 2020.
  • Krizhevsky (2009) Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, ., 2009.
  • LeCun et al. (2010) Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010.
  • Li et al. (2018) Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. arXiv preprint arXiv:1812.06127, 2018.
  • Li et al. (2019) Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smithy. Feddane: A federated newton-type method. In 2019 53rd Asilomar Conference on Signals, Systems, and Computers, pp.  1227–1231. IEEE, 2019.
  • Li et al. (2020) Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine, 37(3):50–60, 2020.
  • Lin et al. (2018) Tao Lin, Sebastian U Stich, Kumar Kshitij Patel, and Martin Jaggi. Don’t use large mini-batches, use local sgd. arXiv preprint arXiv:1808.07217, 2018.
  • McMahan et al. (2017) Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, pp.  1273–1282. PMLR, 2017.
  • Mitra et al. (2021) Aritra Mitra, Rayana Jaafar, George J. Pappas, and Hamed Hassani. Linear Convergence in Federated Learning: Tackling Client Heterogeneity and Sparse Gradients. arXiv e-prints, art. arXiv:2102.07053, February 2021.
  • Nesterov (2003) Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2003.
  • Reddi et al. (2020) Sashank Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konečnỳ, Sanjiv Kumar, and H Brendan McMahan. Adaptive federated optimization. arXiv preprint arXiv:2003.00295, 2020.
  • Reddi et al. (2016) Sashank J Reddi, Suvrit Sra, Barnabás Póczos, and Alex Smola. Fast incremental method for nonconvex optimization. arXiv preprint arXiv:1603.06159, 2016.
  • Stich (2018) Sebastian U Stich. Local sgd converges fast and communicates little. arXiv preprint arXiv:1805.09767, 2018.
  • Wang & Joshi (2018) Jianyu Wang and Gauri Joshi. Cooperative sgd: A unified framework for the design and analysis of communication-efficient sgd algorithms. arXiv preprint arXiv:1808.07576, 2018.
  • Wang & Joshi (2019) Jianyu Wang and Gauri Joshi. Adaptive communication strategies to achieve the best error-runtime trade-off in local-update sgd. In SysML, 2019.
  • Wang et al. (2019a) Jianyu Wang, Anit Kumar Sahu, Zhouyi Yang, Gauri Joshi, and Soummya Kar. Matcha: Speeding up decentralized sgd via matching decomposition sampling. In 2019 Sixth Indian Control Conference (ICC), pp.  299–300. IEEE, 2019a.
  • Wang et al. (2019b) Jianyu Wang, Vinayak Tantia, Nicolas Ballas, and Michael Rabbat. Slowmo: Improving communication-efficient distributed sgd with slow momentum. arXiv preprint arXiv:1910.00643, 2019b.
  • Woodworth (2021) Blake Woodworth. The minimax complexity of distributed optimization. arXiv preprint arXiv:2109.00534, 2021.
  • Woodworth et al. (2020a) Blake Woodworth, Kumar Kshitij Patel, and Nathan Srebro. Minibatch vs local sgd for heterogeneous distributed learning. arXiv preprint arXiv:2006.04735, 2020a.
  • Woodworth et al. (2020b) Blake Woodworth, Kumar Kshitij Patel, Sebastian Stich, Zhen Dai, Brian Bullins, Brendan Mcmahan, Ohad Shamir, and Nathan Srebro. Is local sgd better than minibatch sgd? In International Conference on Machine Learning, pp. 10334–10343. PMLR, 2020b.
  • Woodworth et al. (2021) Blake Woodworth, Brian Bullins, Ohad Shamir, and Nathan Srebro. The min-max complexity of distributed stochastic convex optimization with intermittent communication. arXiv preprint arXiv:2102.01583, 2021.
  • Yuan & Ma (2020) Honglin Yuan and Tengyu Ma. Federated accelerated stochastic gradient descent. arXiv preprint arXiv:2006.08950, 2020.
  • Yuan et al. (2020) Honglin Yuan, Manzil Zaheer, and Sashank Reddi. Federated composite optimization. arXiv preprint arXiv:2011.08474, 2020.
  • Zhou et al. (2019) Kaiwen Zhou, Qinghua Ding, Fanhua Shang, James Cheng, Danli Li, and Zhi-Quan Luo. Direct acceleration of saga using sampled negative momentum. In The 22nd International Conference on Artificial Intelligence and Statistics, pp.  1602–1610. PMLR, 2019.
\appendixpage
\startcontents

[sections] \printcontents[sections]l1

Table 4: Rates under the PL condition. Rate requires RNSR\geq\frac{N}{S}.
Method/Analysis 𝔼F(x^)F(x)𝒪~()\mathbb{E}F(\hat{x})-F(x^{*})\leq\tilde{\mathcal{O}}(\cdot)
Centralized Algorithms
SGD
Δexp(κ1R)+(1SN)κζ2μSR\Delta\exp(-\kappa^{-1}R)+(1-\frac{S}{N})\frac{\kappa\zeta^{2}}{\mu SR}
Federated Algorithms
FedAvg (Karimireddy et al., 2020a)
κΔexp(κ1R)+κ2ζ2μR2\kappa\Delta\exp(-\kappa^{-1}R)+\frac{\kappa^{2}\zeta^{2}}{\mu R^{2}}
This paper
FedAvg \to SGD (Thm. 4.1)
FedAvg \to SAGA (Thm. 4.3)
Algo.-independent LB (Corollary 5.5)
min{Δ,ζ2μ}exp(κ1R)+(1SN)κζ2μSR\min\{\Delta,\frac{\zeta^{2}}{\mu}\}\exp(-\kappa^{-1}R)+(1-\frac{S}{N})\frac{\kappa\zeta^{2}}{\mu SR}
min{Δ,ζ2μ}exp(((NS)23κ)1R)\min\{\Delta,\frac{\zeta^{2}}{\mu}\}\exp(-((\frac{N}{S})^{\frac{2}{3}}\kappa)^{-1}R)
min{Δ,κ32(ζ2β)}exp(κ12R)\min\{\Delta,\kappa^{-\frac{3}{2}}(\frac{\zeta^{2}}{\beta})\}\exp(-\kappa^{-\frac{1}{2}}R)

Appendix A Additional Related Work

Multistage algorithms. Our chaining framework has similarities in analysis to multistage algorithms which have been employed in the classical convex optimization setting (Aybat et al., 2019; Fallah et al., 2020; Ghadimi & Lan, 2012; Woodworth et al., 2020a). The idea behind multistage algorithms is to manage stepsize decay to balance fast progress in “bias” error terms while controlling variance error. However chaining differs from multistage algorithms in a few ways: (1) We do not necessarily decay stepsize (2) In the heterogeneous FL setting, we have to accomodate error due to client drift (Reddi et al., 2020; Karimireddy et al., 2020b), which is not analyzed in prior multistage literature. (3) Our goal is to minimize commuincation rounds rather than iteration complexity.

Appendix B Definitions

Assumption B.1.

FiF_{i}’s are μ\mu-strongly convex for μ>0\mu>0, i.e.

Fi(x),yx(Fi(x)Fi(y)+μ2xy2)\displaystyle\langle\nabla F_{i}(x),y-x\rangle\leq-(F_{i}(x)-F_{i}(y)+\frac{\mu}{2}\|x-y\|^{2}) (4)

for any ii, xx, yy.

Assumption B.2.

FiF_{i}’s are general convex, i.e.

Fi(x),yx(Fi(x)Fi(y))\displaystyle\langle\nabla F_{i}(x),y-x\rangle\leq-(F_{i}(x)-F_{i}(y)) (5)

for any ii, xx, yy.

Assumption B.3.

FF is μ\mu-PL for μ>0\mu>0, i.e.

2μ(F(x)F(x))F(x)2\displaystyle 2\mu(F(x)-F(x^{*}))\leq\|\nabla F(x)\|^{2} (6)

for any xx.

Assumption B.4.

FiF_{i}’s are β\beta-smooth where

Fi(x)Fi(y)βxy\displaystyle\|\nabla F_{i}(x)-\nabla F_{i}(y)\|\leq\beta\|x-y\| (7)

for any ii, xx, yy.

Assumption B.5.

FiF_{i}’s are ζ2\zeta^{2}-heterogeneous, defined as

ζ2:=supxmaxi[N]F(x)Fi(x)2\displaystyle\zeta^{2}:=\sup_{x}\max_{i\in[N]}\|\nabla F(x)-\nabla F_{i}(x)\|^{2} (8)
Assumption B.6.

Gradient queries within a client have a uniform upper bound on variance and are also unbiased.

𝔼zi𝒟if(x;zi)Fi(x)2σ2,𝔼zi𝒟i[f(x;zi)]=Fi(x)\displaystyle\mathbb{E}_{z_{i}\sim\mathcal{D}_{i}}\|\nabla f(x;z_{i})-\nabla F_{i}(x)\|^{2}\leq\sigma^{2},\ \ \ \mathbb{E}_{z_{i}\sim\mathcal{D}_{i}}[\nabla f(x;z_{i})]=\nabla F_{i}(x) (9)
Assumption B.7.

Cost queries within a client have a uniform upper bound on variance and are also unbiased.

𝔼zi𝒟i(f(x;zi)Fi(x))2σF2,𝔼zi𝒟if(x;zi)=Fi(x)\displaystyle\mathbb{E}_{z_{i}\sim\mathcal{D}_{i}}(f(x;z_{i})-F_{i}(x))^{2}\leq\sigma_{F}^{2},\ \ \ \mathbb{E}_{z_{i}\sim\mathcal{D}_{i}}f(x;z_{i})=F_{i}(x) (10)
Assumption B.8.

Cost queries within a client have an absolute deviation.

ζF2=supxmaxi[N](F(x)Fi(x))2\displaystyle\zeta^{2}_{F}=\sup_{x}\max_{i\in[N]}(F(x)-F_{i}(x))^{2} (11)
Assumption B.9.

The initial suboptimality gap is upper bounded by Δ\Delta, where x0x_{0} is the initial feasible iterate and xx^{*} is the global minimizer.

𝔼F(x0)F(x)Δ\displaystyle\mathbb{E}F(x_{0})-F(x^{*})\leq\Delta (12)
Assumption B.10.

The expected initial distance from the optimum is upper bounded by DD, where x0x_{0} is the initial feasible iterate and xx^{*} is one of the global minimizers:

𝔼x0x2D2\displaystyle\mathbb{E}\|x_{0}-x^{*}\|^{2}\leq D^{2} (13)

Appendix C Omitted Algorithm Definitions

Algorithm 2 SGD
  Input: initial point x(0)x^{(0)}, stepsize η\eta, loss ff
  for r=0,,R1r=0,\dots,R-1 do
     \triangleright Run a step of SGD
     Sample SS clients 𝒮r[1,,N]\mathcal{S}_{r}\subseteq[1,\dots,N]
     Send x(r)x^{(r)} to all clients in 𝒮r\mathcal{S}_{r}
     Receive {gi(r)}i𝒮r=Grad(x(r),𝒮r,z(r))\{g_{i}^{(r)}\}_{i\in\mathcal{S}_{r}}=\text{Grad}(x^{(r)},\mathcal{S}_{r},z^{(r)}) (Algo. 7)
     g(r)=1Si𝒮rgi(r)g^{(r)}=\frac{1}{S}\sum_{i\in\mathcal{S}_{r}}g_{i}^{(r)}
     x(r+1)=x(r)ηg(r)x^{(r+1)}=x^{(r)}-\eta\cdot g^{(r)}
  end for
Algorithm 3 ASG
  Input: initial point x(0)=xag(0){x}^{(0)}=x^{(0)}_{\text{ag}}, {αr}1rR\{\alpha_{r}\}_{1\leq r\leq R}, {γr}1rR\{\gamma_{r}\}_{1\leq r\leq R}, loss ff
  for r=1,,Rr=1,\dots,R do
     \triangleright Run a step of ASG
     Sample SS clients 𝒮r[1,,N]\mathcal{S}_{r}\subseteq[1,\dots,N]
     Send x(r)x^{(r)} to all clients in 𝒮r\mathcal{S}_{r}
     xmd(r)=(1αr)(μ+γr)γr+(1αr2)μxag(r1)+αr[(1αr)μ+γr]γr+(1αr2)μx(r1)x^{(r)}_{\text{md}}=\frac{(1-\alpha_{r})(\mu+\gamma_{r})}{\gamma_{r}+(1-\alpha_{r}^{2})\mu}x^{(r-1)}_{\text{ag}}+\frac{\alpha_{r}[(1-\alpha_{r})\mu+\gamma_{r}]}{\gamma_{r}+(1-\alpha_{r}^{2})\mu}x^{(r-1)}
     Receive {gi(r)}i𝒮r=Grad(xmd(r),𝒮r,z(r))\{g_{i}^{(r)}\}_{i\in\mathcal{S}_{r}}=\text{Grad}(x^{(r)}_{\text{md}},\mathcal{S}_{r},z^{(r)}) (Algo. 7)
     g(r)=1Si𝒮rgi(r)g^{(r)}=\frac{1}{S}\sum_{i\in\mathcal{S}_{r}}g_{i}^{(r)}
     x(r)=argminx{αr[g(r),x+μ2xmd(r)x2]+[(1αr)μ2+γr2x(r1)x2]}x^{(r)}=\operatorname*{arg\,min}_{x}\{\alpha_{r}[\langle g^{(r)},x\rangle+\frac{\mu}{2}\|x^{(r)}_{\text{md}}-x\|^{2}]+[(1-\alpha_{r})\frac{\mu}{2}+\frac{\gamma_{r}}{2}\|x^{(r-1)}-x\|^{2}]\}
     xag(r)=αrx(r)+(1αr)xag(r1)x^{(r)}_{\text{ag}}=\alpha_{r}x^{(r)}+(1-\alpha_{r})x^{(r-1)}_{\text{ag}}
  end for
Algorithm 4 FedAvg
  Input: Stepsize η\eta, initial point x(0){x}^{(0)}
  for r=0,,R1r=0,\dots,R-1 do
     Sample SS clients 𝒮r[1,,N]\mathcal{S}_{r}\subseteq[1,\dots,N]
     Send x(r)x^{(r)} to all clients in 𝒮r\mathcal{S}_{r}, all clients set xi,0(r)=x(r)x_{i,0}^{(r)}=x^{(r)}
     for client i𝒮ri\in\mathcal{S}_{r} in parallel do
        for k=0k=0, …, K1\sqrt{K}-1 do
           Client samples gi,k(r)=1Kk=0K1f(xi,k(r);zi,kK+k(r))g_{i,k}^{(r)}=\frac{1}{\sqrt{K}}\sum_{k^{\prime}=0}^{\sqrt{K}-1}\nabla f(x_{i,k}^{(r)};z_{i,k\sqrt{K}+k^{\prime}}^{(r)}) where zi,kK+k(r)𝒟iz_{i,k\sqrt{K}+k^{\prime}}^{(r)}\sim\mathcal{D}_{i}
           xi,k+1(r)=xi,k(r)ηgi,k(r)x_{i,k+1}^{(r)}=x_{i,k}^{(r)}-\eta\cdot g_{i,k}^{(r)}
        end for
        Receive gi(r)=k=0K1gi,k(r)g_{i}^{(r)}=\sum_{k=0}^{\sqrt{K}-1}g_{i,k}^{(r)} from client
     end for
     g(r)=1Si𝒮rgi(r)g^{(r)}=\frac{1}{S}\sum_{i\in\mathcal{S}_{r}}g_{i}^{(r)}
     x(r+1)x(r)ηg(r)x^{(r+1)}\leftarrow x^{(r)}-\eta\cdot g^{(r)}
  end for
Algorithm 5 SAGA
  Input: stepsize η\eta, loss ff, initial point x(0)x^{(0)}
  Send ϕi(0)=x(0)\phi_{i}^{(0)}=x^{(0)} to all clients ii
  Receive ci(0)=1Kk=0K1f(ϕi(0);zi,k(1))c_{i}^{(0)}=\frac{1}{K}\sum_{k=0}^{K-1}\nabla f(\phi_{i}^{(0)};z_{i,k}^{(-1)}), zi,k(1)𝒟iz_{i,k}^{(-1)}\sim\mathcal{D}_{i} from all clients, c(0)=1Ni=1Nci(0)c^{(0)}=\frac{1}{N}\sum_{i=1}^{N}c_{i}^{(0)}
  for r=0,,R1r=0,\dots,R-1 do
     Sample SS clients 𝒮r[1,,N]\mathcal{S}_{r}\subseteq[1,\dots,N]
     Send x(r)x^{(r)} to all clients in 𝒮r\mathcal{S}_{r}, Receive {gi(r)}i𝒮r=Grad(x(r),𝒮r,z(r))\{g_{i}^{(r)}\}_{i\in\mathcal{S}_{r}}=\text{Grad}(x^{(r)},\mathcal{S}_{r},z^{(r)}) (Algo. 7)
     g(r)=1Si𝒮rgi(r)1Si𝒮rci(r)+c(r)g^{(r)}=\frac{1}{S}\sum_{i\in\mathcal{S}_{r}}g_{i}^{(r)}-\frac{1}{S}\sum_{i\in\mathcal{S}_{r}}c_{i}^{(r)}+c^{(r)}
     x(r+1)=x(r)ηg(r)x^{(r+1)}=x^{(r)}-\eta\cdot g^{(r)}
     Option I: Set ci(r+1)=gi(r)c_{i}^{(r+1)}=g_{i}^{(r)}, ϕi(r+1)=x(r)\phi_{i}^{(r+1)}=x^{(r)} for i𝒮ri\in\mathcal{S}_{r}
     Option II: New independent sample of clients 𝒮r\mathcal{S}_{r}^{\prime}, send x(r)x^{(r)} to all clients in 𝒮r\mathcal{S}_{r}^{\prime}, receive {g~i(r)}i𝒮r=Grad(x(r),𝒮r,z~(r))\{\tilde{g}_{i}^{(r)}\}_{i\in\mathcal{S}_{r}^{\prime}}=\text{Grad}(x^{(r)},\mathcal{S}_{r}^{\prime},\tilde{z}^{(r)}), set ci(r+1)=g~i(r)c_{i}^{(r+1)}=\tilde{g}_{i}^{(r)}, ϕi(r+1)=x(r)\phi_{i}^{(r+1)}=x^{(r)} for i𝒮ri\in\mathcal{S}_{r}^{\prime}
     
     ci(r+1)=ci(r)c_{i}^{(r+1)}=c_{i}^{(r)} and ϕi(r+1)=ϕi(r)\phi_{i}^{(r+1)}=\phi_{i}^{(r)} for any ii not yet updated
     c(r+1)=1Ni=1Nci(r+1)c^{(r+1)}=\frac{1}{N}\sum_{i=1}^{N}c_{i}^{(r+1)}
  end for

Algorithm 6 SSNM
  Input: stepsize η\eta, momentum τ\tau, client losses are Fi(x)=𝔼zi𝒟i[f(x;zi)]+h(x)=F~i(x)+h(x)F_{i}(x)=\mathbb{E}_{z_{i}\sim\mathcal{D}_{i}}[f(x;z_{i})]+h(x)=\tilde{F}_{i}(x)+h(x) where F~i\tilde{F}_{i} can be general convex and h(x)h(x) is μ\mu-strongly convex, initial point x(0)x^{(0)}
  Send ϕi(0)=x(0)\phi_{i}^{(0)}=x^{(0)} to all clients ii
  Receive ci(0)=1Kk=0K1f(ϕi(0);zi,k(1))c_{i}^{(0)}=\frac{1}{K}\sum_{k=0}^{K-1}\nabla f(\phi_{i}^{(0)};z_{i,k}^{(-1)}), zi,k(1)𝒟iz_{i,k}^{(-1)}\sim\mathcal{D}_{i} from all clients c(0)=1Ni=1Nci(0)c^{(0)}=\frac{1}{N}\sum_{i=1}^{N}c_{i}^{(0)}
  for r=0,,R1r=0,\dots,R-1 do
     Sample SS clients 𝒮r[1,,N]\mathcal{S}_{r}\subseteq[1,\dots,N]
     Send x(r)x^{(r)} to all clients in 𝒮r\mathcal{S}_{r}, yir(r)=τx(r)+(1τ)ϕir(r)y_{i_{r}}^{(r)}=\tau x^{(r)}+(1-\tau)\phi^{(r)}_{i_{r}} for ir𝒮ri_{r}\in\mathcal{S}_{r}
     Receive gir(r)=1Kk=0K1f(yir(r);zir,k(r))g_{i_{r}}^{(r)}=\frac{1}{K}\sum_{k=0}^{K-1}\nabla f(y_{i_{r}}^{(r)};z_{i_{r},k}^{(r)}) for ir𝒮ri_{r}\in\mathcal{S}_{r}, zir,k(r)𝒟iz_{i_{r},k}^{(r)}\sim\mathcal{D}_{i}
     g(r)=1Sir𝒮rgir(r)1Sir𝒮rcir(r)+c(r)g^{(r)}=\frac{1}{S}\sum_{i_{r}\in\mathcal{S}_{r}}g_{i_{r}}^{(r)}-\frac{1}{S}\sum_{i_{r}\in\mathcal{S}_{r}}c_{i_{r}}^{(r)}+c^{(r)}
     x(r+1)=argminx{h(x)+g(r),x+12ηx(r)x2}x^{(r+1)}=\operatorname*{arg\,min}_{x}\{h(x)+\langle g^{(r)},x\rangle+\frac{1}{2\eta}\|x^{(r)}-x\|^{2}\}
     Sample new independent sample of SS clients 𝒮r[1,,N]\mathcal{S}_{r}^{\prime}\subseteq[1,\dots,N]
     Send x(r)x^{(r)} to all clients in 𝒮r\mathcal{S}_{r}^{\prime}, ϕIr(r+1)=τx(r+1)+(1τ)ϕIr(r)\phi_{I_{r}}^{(r+1)}=\tau x^{(r+1)}+(1-\tau)\phi_{I_{r}}^{(r)} for Ir𝒮rI_{r}\in\mathcal{S}_{r}^{\prime}
     Receive gIr(r)=1Kk=0K1f(ϕIr(r+1);z~Ir,k(r))g_{I_{r}}^{(r)}=\frac{1}{K}\sum_{k=0}^{K-1}\nabla f(\phi_{I_{r}}^{(r+1)};\tilde{z}_{I_{r},k}^{(r)}) for Ir𝒮rI_{r}\in\mathcal{S}_{r}^{\prime}, z~Ir,k(r)𝒟i\tilde{z}_{I_{r},k}^{(r)}\sim\mathcal{D}_{i}
     ci(r+1)=gIr(r)c_{i}^{(r+1)}=g_{I_{r}}^{(r)} for Ir𝒮rI_{r}\in\mathcal{S}_{r}^{\prime}
     ci(r+1)=ci(r)c_{i}^{(r+1)}=c_{i}^{(r)} and ϕi(r+1)=ϕi(r)\phi_{i}^{(r+1)}=\phi_{i}^{(r)} for any ii not yet updated
     c(r+1)=1Ni=1Nci(r+1)c^{(r+1)}=\frac{1}{N}\sum_{i=1}^{N}c_{i}^{(r+1)}
  end for
Algorithm 7 Grad(x,𝒮,z)\text{Grad}(x,\mathcal{S},z)
  for client i𝒮i\in\mathcal{S} in parallel do
     Client samples gi,k(r)=f(x(r);zi,k(r))g_{i,k}^{(r)}=\nabla f(x^{(r)};z_{i,k}^{(r)}) where zi,k(r)𝒟iz_{i,k}^{(r)}\sim\mathcal{D}_{i}, k{0,,K1}k\in\{0,\dots,K-1\}
     gi(r)=1Kk=0K1gi,k(r)g_{i}^{(r)}=\frac{1}{K}\sum_{k=0}^{K-1}g_{i,k}^{(r)}
  end for
  return  {gi(r)}i𝒮r\{g_{i}^{(r)}\}_{i\in\mathcal{S}_{r}}

Appendix D Proofs for global update methods

D.1 SGD

Theorem D.1.

Suppose that client objectives FiF_{i}’s and their gradient queries satisfy Assumption B.4 and Assumption B.6. Then running Algo. 2 gives the following for returned iterate x^\hat{x}:

  • Strongly convex: FiF_{i}’s satisfy Assumption B.1 for μ>0\mu>0. If we return x^=1WRr=0Rwrx(r)\hat{x}=\frac{1}{W_{R}}\sum_{r=0}^{R}w_{r}x^{(r)} with wr=(1ημ)1(r+1)w_{r}=(1-\eta\mu)^{1-(r+1)} and WR=r=0RwrW_{R}=\sum_{r=0}^{R}w_{r}, η=𝒪~(min{1β,1μR})\eta=\tilde{\mathcal{O}}(\min\{\frac{1}{\beta},\frac{1}{\mu R}\}), and R>κR>\kappa,

    𝔼F(x^)F(x)𝒪~(Δexp(Rκ)+σ2μSKR+(1SN)ζ2μSR)\displaystyle\mathbb{E}F(\hat{x})-F(x^{*})\leq\tilde{\mathcal{O}}(\Delta\exp(-\frac{R}{\kappa})+\frac{\sigma^{2}}{\mu SKR}+(1-\frac{S}{N})\frac{\zeta^{2}}{\mu SR})
    𝔼x(r)x2𝒪~(D2) for any 0rR\displaystyle\mathbb{E}\|x^{(r)}-x^{*}\|^{2}\leq\tilde{\mathcal{O}}(D^{2})\textrm{ for any }0\leq r\leq R
  • General convex (grad norm): FiF_{i}’s satisfy Assumption B.2. If we return the average iterate x^=1Rr=1Rx(r)\hat{x}=\frac{1}{R}\sum_{r=1}^{R}x^{(r)}, and η=min{1β,(ΔβcR)1/2}\eta=\min\{\frac{1}{\beta},(\frac{\Delta}{\beta cR})^{1/2}\} where cc is the update variance

    𝔼F(x^)2𝒪~(βΔR+βσDSKR+1SNβζDSR)\displaystyle\mathbb{E}\|\nabla F(\hat{x})\|^{2}\leq\tilde{\mathcal{O}}(\frac{\beta\Delta}{R}+\frac{\beta\sigma D}{\sqrt{SKR}}+\sqrt{1-\frac{S}{N}}\frac{\beta\zeta D}{\sqrt{SR}})
    𝔼x(r)x2𝒪(D2) for any 0rR\displaystyle\mathbb{E}\|x^{(r)}-x^{*}\|^{2}\leq\mathcal{O}(D^{2})\text{ for any }0\leq r\leq R
  • PL condition: FiF_{i}’s satisfy Assumption B.3. If we return the final iterate x^=x(R)\hat{x}=x^{(R)}, set η=𝒪~(min{1β,1μR})\eta=\tilde{\mathcal{O}}(\min\{\frac{1}{\beta},\frac{1}{\mu R}\}), then

    𝔼F(x(R))F(x)𝒪~(Δexp(Rκ)+κσ2μSKR+(1SN)κζ2μSR)\displaystyle\mathbb{E}F(x^{(R)})-F(x^{*})\leq\tilde{\mathcal{O}}(\Delta\exp(-\frac{R}{\kappa})+\frac{\kappa\sigma^{2}}{\mu SKR}+(1-\frac{S}{N})\frac{\kappa\zeta^{2}}{\mu SR})
Lemma D.2 (Update variance).

For SGD, we have that

𝔼rg(r)2F(x(r))2+σ2SK+(1S1N1)ζ2S\displaystyle\mathbb{E}_{r}\|g^{(r)}\|^{2}\leq\|\nabla F(x^{(r)})\|^{2}+\frac{\sigma^{2}}{SK}+(1-\frac{S-1}{N-1})\frac{\zeta^{2}}{S} (14)
Proof.

Observing that

𝔼rg(r)2=𝔼rg(r)F(x(r))2+F(x(r))2\displaystyle\mathbb{E}_{r}\|g^{(r)}\|^{2}=\mathbb{E}_{r}\|g^{(r)}-\nabla F(x^{(r)})\|^{2}+\|\nabla F(x^{(r)})\|^{2} (15)

and using Lemma H.1, we have

𝔼rg(r)F(x(r))2σ2SK+(1S1N1)ζ2S\displaystyle\mathbb{E}_{r}\|g^{(r)}-\nabla F(x^{(r)})\|^{2}\leq\frac{\sigma^{2}}{SK}+(1-\frac{S-1}{N-1})\frac{\zeta^{2}}{S} (16)

D.1.1 Convergence of SGD for Strongly Convex functions

Proof.

Using the update of SGD,

x(r+1)x2x(r)x22ηg(r),x(r)x+η2g(r)2\displaystyle\|x^{(r+1)}-x^{*}\|^{2}\leq\|x^{(r)}-x^{*}\|^{2}-2\eta\langle g^{(r)},x^{(r)}-x^{*}\rangle+\eta^{2}\|g^{(r)}\|^{2} (17)

Taking expectation up to the r-th round

𝔼rx(r+1)x2x(r)x22ηF(x(r)),x(r)x+η2𝔼rg(r)2\displaystyle\mathbb{E}_{r}\|x^{(r+1)}-x^{*}\|^{2}\leq\|x^{(r)}-x^{*}\|^{2}-2\eta\langle\nabla F(x^{(r)}),x^{(r)}-x^{*}\rangle+\eta^{2}\mathbb{E}_{r}\|g^{(r)}\|^{2} (18)

By Assumption B.1,

2ηF(x(r)),x(r)x2η(F(x(r))F(x))ημx(r)x2\displaystyle-2\eta\langle\nabla F(x^{(r)}),x^{(r)}-x^{*}\rangle\leq-2\eta(F(x^{(r)})-F(x^{*}))-\eta\mu\|x^{(r)}-x^{*}\|^{2} (19)

Using Lemma D.2, Assumption B.4, and setting η12β\eta\leq\frac{1}{2\beta},

𝔼rx(r+1)x2\displaystyle\mathbb{E}_{r}\|x^{(r+1)}-x^{*}\|^{2} (20)
(1ημ)x(r)x2η(F(x(r))F(x))+η2(σ2SK+(1S1N1)ζ2S)\displaystyle\leq(1-\eta\mu)\|x^{(r)}-x^{*}\|^{2}-\eta(F(x^{(r)})-F(x^{*}))+\eta^{2}(\frac{\sigma^{2}}{SK}+(1-\frac{S-1}{N-1})\frac{\zeta^{2}}{S}) (21)

Rearranging and taking full expectation,

𝔼F(x(r))F(x)(1ημ)𝔼x(r)x2𝔼x(r+1)x2η+η(σ2SK+(1S1N1)ζ2S)\displaystyle\mathbb{E}F(x^{(r)})-F(x^{*})\leq\frac{(1-\eta\mu)\mathbb{E}\|x^{(r)}-x^{*}\|^{2}-\mathbb{E}\|x^{(r+1)}-x^{*}\|^{2}}{\eta}+\eta(\frac{\sigma^{2}}{SK}+(1-\frac{S-1}{N-1})\frac{\zeta^{2}}{S}) (22)

letting wr=(1μη)1(r+1)w_{r}=(1-\mu\eta)^{1-(r+1)}, WR=r=0RW_{R}=\sum_{r=0}^{R}, x^=1WRr=0Rx(r)\hat{x}=\frac{1}{W_{R}}\sum_{r=0}^{R}x^{(r)}, then

𝔼F(x^)F(x)𝔼x(0)x2ηWR+η(σ2SK+(1S1N1)ζ2S)\displaystyle\mathbb{E}F(\hat{x})-F(x^{*})\leq\frac{\mathbb{E}\|x^{(0)}-x^{*}\|^{2}}{\eta W_{R}}+\eta(\frac{\sigma^{2}}{SK}+(1-\frac{S-1}{N-1})\frac{\zeta^{2}}{S}) (23)

From (Karimireddy et al., 2020b) Lemma 1, if R>κR>\kappa and η=min{12β,log(max(1,μ2RD2/c))μR}\eta=\min\{\frac{1}{2\beta},\frac{\log(\max(1,\mu^{2}RD^{2}/c))}{\mu R}\} where c=σ2NK+(1S1N1)ζ2Sc=\frac{\sigma^{2}}{NK}+(1-\frac{S-1}{N-1})\frac{\zeta^{2}}{S},

𝔼F(x^)F(x)𝒪~(Δexp(Rκ)+σ2μSKR+(1SN)ζ2μSR)\displaystyle\mathbb{E}F(\hat{x})-F(x^{*})\leq\tilde{\mathcal{O}}(\Delta\exp(-\frac{R}{\kappa})+\frac{\sigma^{2}}{\mu SKR}+(1-\frac{S}{N})\frac{\zeta^{2}}{\mu SR}) (24)

Which finishes our work on the convergence rate. Next we consider the distance bound. Returning to the recurrence

𝔼rx(r+1)x2\displaystyle\mathbb{E}_{r}\|x^{(r+1)}-x^{*}\|^{2} (25)
(1ημ)x(r)x2η(F(x(r))F(x))+η2(σ2SK+(1S1N1)ζ2S)\displaystyle\leq(1-\eta\mu)\|x^{(r)}-x^{*}\|^{2}-\eta(F(x^{(r)})-F(x^{*}))+\eta^{2}(\frac{\sigma^{2}}{SK}+(1-\frac{S-1}{N-1})\frac{\zeta^{2}}{S}) (26)

Taking full expectation and unrolling,

𝔼x(r)x2\displaystyle\mathbb{E}\|x^{(r)}-x^{*}\|^{2} (27)
𝔼x(0)x2exp(ημr)+ημ(σ2SK+(1S1N1)ζ2S)\displaystyle\leq\mathbb{E}\|x^{(0)}-x^{*}\|^{2}\exp(-\eta\mu r)+\frac{\eta}{\mu}(\frac{\sigma^{2}}{SK}+(1-\frac{S-1}{N-1})\frac{\zeta^{2}}{S}) (28)

Note that Karimireddy et al. (2020b) chose η\eta so that η(σ2SK+(1S1N1)ζ2S)𝒪~(μD2exp(μηR))\eta(\frac{\sigma^{2}}{SK}+(1-\frac{S-1}{N-1})\frac{\zeta^{2}}{S})\leq\tilde{\mathcal{O}}(\mu D^{2}\exp(-\mu\eta R)) And so

𝔼x(r)x2𝒪~(D2)\displaystyle\mathbb{E}\|x^{(r)}-x^{*}\|^{2}\leq\tilde{\mathcal{O}}(D^{2}) (29)

D.1.2 Convergence of SGD for General Convex functions

Proof.

By smoothness (Assumption B.4), we have that

F(x(r+1))F(x(r))ηF(x(r)),g(r)+βη22g(r)2\displaystyle F(x^{(r+1)})-F(x^{(r)})\leq-\eta\langle\nabla F(x^{(r)}),g^{(r)}\rangle+\frac{\beta\eta^{2}}{2}\|g^{(r)}\|^{2} (30)

Taking expectation conditioned up to rr and using Lemma D.2,

𝔼rF(x(r+1))F(x(r))(ηβη22)F(x(r))2+βη22(σ2SK+(1S1N1)ζ2S)\displaystyle\mathbb{E}_{r}F(x^{(r+1)})-F(x^{(r)})\leq-(\eta-\frac{\beta\eta^{2}}{2})\|\nabla F(x^{(r)})\|^{2}+\frac{\beta\eta^{2}}{2}(\frac{\sigma^{2}}{SK}+(1-\frac{S-1}{N-1})\frac{\zeta^{2}}{S}) (31)

If η1β\eta\leq\frac{1}{\beta},

𝔼rF(x(r+1))F(x(r))η2F(x(r))2+βη22(σ2SK+(1S1N1)ζ2S)\displaystyle\mathbb{E}_{r}F(x^{(r+1)})-F(x^{(r)})\leq-\frac{\eta}{2}\|\nabla F(x^{(r)})\|^{2}+\frac{\beta\eta^{2}}{2}(\frac{\sigma^{2}}{SK}+(1-\frac{S-1}{N-1})\frac{\zeta^{2}}{S}) (32)

Taking full expectation and rearranging,

12𝔼F(x(r))2𝔼F(x(r+1))𝔼F(x(r))η+βη2(σ2SK+(1S1N1)ζ2S)\displaystyle\frac{1}{2}\mathbb{E}\|\nabla F(x^{(r)})\|^{2}\leq\frac{\mathbb{E}F(x^{(r+1)})-\mathbb{E}F(x^{(r)})}{\eta}+\frac{\beta\eta}{2}(\frac{\sigma^{2}}{SK}+(1-\frac{S-1}{N-1})\frac{\zeta^{2}}{S}) (33)

Summing both sides over rr and averaging,

121Rr=0R1𝔼F(x(r))2𝔼F(x(0))𝔼F(x(R))ηR+βη2(σ2SK+(1S1N1)ζ2S)\displaystyle\frac{1}{2}\frac{1}{R}\sum_{r=0}^{R-1}\mathbb{E}\|\nabla F(x^{(r)})\|^{2}\leq\frac{\mathbb{E}F(x^{(0)})-\mathbb{E}F(x^{(R)})}{\eta R}+\frac{\beta\eta}{2}(\frac{\sigma^{2}}{SK}+(1-\frac{S-1}{N-1})\frac{\zeta^{2}}{S}) (34)

Letting x^\hat{x} be an average of all the x(r)x^{(r)}’s, noting that 𝔼F(x(0))𝔼F(x(R))Δ\mathbb{E}F(x^{(0)})-\mathbb{E}F(x^{(R)})\leq\Delta, ΔβD2\Delta\leq\beta D^{2}, and choosing η=min{1β,(ΔβcR)1/2}\eta=\min\{\frac{1}{\beta},(\frac{\Delta}{\beta cR})^{1/2}\} where c=σ2SK+(1S1N1)ζ2Sc=\frac{\sigma^{2}}{SK}+(1-\frac{S-1}{N-1})\frac{\zeta^{2}}{S},

𝔼F(x^)2𝒪~(βΔR+βσDSKR+1SNβζDSR)\displaystyle\mathbb{E}\|\nabla F(\hat{x})\|^{2}\leq\tilde{\mathcal{O}}(\frac{\beta\Delta}{R}+\frac{\beta\sigma D}{\sqrt{SKR}}+\sqrt{1-\frac{S}{N}}\frac{\beta\zeta D}{\sqrt{SR}}) (35)

So we have our convergence rate. Next, for the distance bound,

x(r+1)x2=x(r)x22ηg(r),x(r)x+η2g(r)2\displaystyle\|x^{(r+1)}-x^{*}\|^{2}=\|x^{(r)}-x^{*}\|^{2}-2\eta\langle g^{(r)},x^{(r)}-x^{*}\rangle+\eta^{2}\|g^{(r)}\|^{2} (36)

Taking expectation up to rr, we know from Lemma D.2,

𝔼rx(r+1)x2\displaystyle\mathbb{E}_{r}\|x^{(r+1)}-x^{*}\|^{2} (37)
x(r)x22ηg(r),x(r)x+η2F(x(r))2+η2(σ2SK+(1SN)ζ2S)\displaystyle\leq\|x^{(r)}-x^{*}\|^{2}-2\eta\langle g^{(r)},x^{(r)}-x^{*}\rangle+\eta^{2}\|\nabla F(x^{(r)})\|^{2}+\eta^{2}(\frac{\sigma^{2}}{SK}+(1-\frac{S}{N})\frac{\zeta^{2}}{S}) (38)

By Assumption B.2 and Assumption B.4,

𝔼rx(r+1)x2\displaystyle\mathbb{E}_{r}\|x^{(r+1)}-x^{*}\|^{2} (39)
x(r)x22η(1βη)(F(x(r))F(x))+η2(σ2SK+(1SN)ζ2S)\displaystyle\leq\|x^{(r)}-x^{*}\|^{2}-2\eta(1-\beta\eta)(F(x^{(r)})-F(x^{*}))+\eta^{2}(\frac{\sigma^{2}}{SK}+(1-\frac{S}{N})\frac{\zeta^{2}}{S}) (40)

And with η1β\eta\leq\frac{1}{\beta},

𝔼rx(r+1)x2x(r)x2+η2(σ2SK+(1SN)ζ2S)\displaystyle\mathbb{E}_{r}\|x^{(r+1)}-x^{*}\|^{2}\leq\|x^{(r)}-x^{*}\|^{2}+\eta^{2}(\frac{\sigma^{2}}{SK}+(1-\frac{S}{N})\frac{\zeta^{2}}{S}) (41)

Unrolling and taking full expectation,

𝔼x(r)x2𝔼x(0)x2+η2R(σ2SK+(1SN)ζ2S)\displaystyle\mathbb{E}\|x^{(r)}-x^{*}\|^{2}\leq\mathbb{E}\|x^{(0)}-x^{*}\|^{2}+\eta^{2}R(\frac{\sigma^{2}}{SK}+(1-\frac{S}{N})\frac{\zeta^{2}}{S}) (42)

We chose η\eta so that βη(σ2SK+(1SN)ζ2S)ΔηR\beta\eta(\frac{\sigma^{2}}{SK}+(1-\frac{S}{N})\frac{\zeta^{2}}{S})\leq\frac{\Delta}{\eta R}. Therefore

𝔼x(r)x2𝔼x(0)x2+Δβ2D2\displaystyle\mathbb{E}\|x^{(r)}-x^{*}\|^{2}\leq\mathbb{E}\|x^{(0)}-x^{*}\|^{2}+\frac{\Delta}{\beta}\leq 2D^{2} (43)

D.1.3 Convergence of SGD under the PL-condition

Proof.

Using Assumption B.4 we have that

F(x(r+1))F(x(r))ηF(x(r)),g(r)+βη22g(r)2\displaystyle F(x^{(r+1)})-F(x^{(r)})\leq-\eta\langle\nabla F(x^{(r)}),g^{(r)}\rangle+\frac{\beta\eta^{2}}{2}\|g^{(r)}\|^{2} (44)

Taking expectations of both sides conditioned on the rr-th step,

𝔼rF(x(r+1))F(x(r))ηF(x(r))2+βη22𝔼rg(r)2\displaystyle\mathbb{E}_{r}F(x^{(r+1)})-F(x^{(r)})\leq-\eta\|F(x^{(r)})\|^{2}+\frac{\beta\eta^{2}}{2}\mathbb{E}_{r}\|g^{(r)}\|^{2} (45)

Using Lemma D.2,

𝔼rF(x(r+1))F(x(r))η2F(x(r))2+βη22(σ2SK+(1S1N1)ζ2S)\displaystyle\mathbb{E}_{r}F(x^{(r+1)})-F(x^{(r)})\leq-\frac{\eta}{2}\|F(x^{(r)})\|^{2}+\frac{\beta\eta^{2}}{2}(\frac{\sigma^{2}}{SK}+(1-\frac{S-1}{N-1})\frac{\zeta^{2}}{S}) (46)

Using Assumption B.3, we have that

𝔼rF(x(r+1))F(x(r))ημ(F(x(r))F(x))+βη22(σ2SK+(1S1N1)ζ2S)\displaystyle\mathbb{E}_{r}F(x^{(r+1)})-F(x^{(r)})\leq-\eta\mu(F(x^{(r)})-F(x^{*}))+\frac{\beta\eta^{2}}{2}(\frac{\sigma^{2}}{SK}+(1-\frac{S-1}{N-1})\frac{\zeta^{2}}{S}) (47)

Rearranging,

𝔼rF(x(r+1))F(x)(1ημ)(F(x(r))F(x))+βη22(σ2SK+(1S1N1)ζ2S)\displaystyle\mathbb{E}_{r}F(x^{(r+1)})-F(x^{*})\leq(1-\eta\mu)(F(x^{(r)})-F(x^{*}))+\frac{\beta\eta^{2}}{2}(\frac{\sigma^{2}}{SK}+(1-\frac{S-1}{N-1})\frac{\zeta^{2}}{S}) (48)

Letting c=σ2SK+(1S1N1)ζ2Sc=\frac{\sigma^{2}}{SK}+(1-\frac{S-1}{N-1})\frac{\zeta^{2}}{S} and subtracting βcη2μ\frac{\beta c\eta}{2\mu} from both sides,

𝔼rF(x(r+1))F(x)βcη2μ(1ημ)(F(x(r))F(x)βcη2μ)\displaystyle\mathbb{E}_{r}F(x^{(r+1)})-F(x^{*})-\frac{\beta c\eta}{2\mu}\leq(1-\eta\mu)(F(x^{(r)})-F(x^{*})-\frac{\beta c\eta}{2\mu}) (49)

Which, upon unrolling the recursion, gives us

𝔼F(x(R))F(x)Δexp(ημR)+βcη2μ\displaystyle\mathbb{E}F(x^{(R)})-F(x^{*})\leq\Delta\exp(-\eta\mu R)+\frac{\beta c\eta}{2\mu} (50)

Now, if we choose stepsize

η=min{1β,log(max{e,μ2ΔRβc})μR}\displaystyle\eta=\min\{\frac{1}{\beta},\frac{\log(\max\{e,\frac{\mu^{2}\Delta R}{\beta c}\})}{\mu R}\} (51)

Then the final convergence rate is

𝔼F(x(R))F(x)𝒪~(Δexp(Rκ)+κσ2μSKR+(1SN)κζ2μSR)\displaystyle\mathbb{E}F(x^{(R)})-F(x^{*})\leq\tilde{\mathcal{O}}(\Delta\exp(-\frac{R}{\kappa})+\frac{\kappa\sigma^{2}}{\mu SKR}+(1-\frac{S}{N})\frac{\kappa\zeta^{2}}{\mu SR}) (52)

D.2 ASG

The precise form of accelerated stochastic gradient we run is AC-SA(Ghadimi & Lan, 2012; 2013). The following specification is taken from Woodworth et al. (2020a):

We run Algo. 3 for RsR_{s} iterations using x(0)=ps1x^{(0)}=p_{s-1}, {αt}t1\{\alpha_{t}\}_{t\geq 1} and {γt}t1\{\gamma_{t}\}_{t\geq 1}, with definitions

  • Rs=max{44βμ,128c3μΔ2(s+1)}R_{s}=\lceil\max\{4\sqrt{\frac{4\beta}{\mu}},\frac{128c}{3\mu\Delta 2^{-(s+1)}}\}\rceil

  • αr=2r+1\alpha_{r}=\frac{2}{r+1}, γr=4ϕsr(r+1)\gamma_{r}=\frac{4\phi_{s}}{r(r+1)},

  • ϕs=max{2β,[μc3Δ2(s1)Rs(Rs+1)(Rs+2)]1/2}\phi_{s}=\max\{2\beta,[\frac{\mu c}{3\Delta 2^{-(s-1)}R_{s}(R_{s}+1)(R_{s}+2)}]^{1/2}\}

Where c=σ2SK+(1S1N1)ζ2Sc=\frac{\sigma^{2}}{SK}+(1-\frac{S-1}{N-1})\frac{\zeta^{2}}{S}. Set ps=xag(Rs)p_{s}=x^{(R_{s})}_{\text{ag}} where xag(Rs)x^{(R_{s})}_{\text{ag}} is the solution obtained in the previous step.

Theorem D.3.

Suppose that client objectives FiF_{i}’s and their gradient queries satisfy Assumption B.4 and Assumption B.6. Then running Algo. 3 gives the following for returned iterate x^\hat{x}:

  • Strongly convex: FiF_{i}’s satisfy Assumption B.1 for μ>0\mu>0. If we return x^=xag(Rs)\hat{x}=x_{\text{ag}}^{(R_{s})} after RR rounds of the multistage AC-SA specified above,

    𝔼F(x^)F(x)𝒪(Δexp(Rκ)+σ2μSKR+(1SN)ζ2μSR)\displaystyle\mathbb{E}F(\hat{x})-F(x^{*})\leq\mathcal{O}(\Delta\exp(-\frac{R}{\sqrt{\kappa}})+\frac{\sigma^{2}}{\mu SKR}+(1-\frac{S}{N})\frac{\zeta^{2}}{\mu SR})
    𝔼x^x2𝒪(κ𝔼x(0)x2exp(Rκ)+σ2μ2SKR+(1SN)ζ2μ2SR)\displaystyle\mathbb{E}\|\hat{x}-x^{*}\|^{2}\leq\mathcal{O}(\kappa\mathbb{E}\|x^{(0)}-x^{*}\|^{2}\exp(-\frac{R}{\sqrt{\kappa}})+\frac{\sigma^{2}}{\mu^{2}SKR}+(1-\frac{S}{N})\frac{\zeta^{2}}{\mu^{2}SR})

    And given no randomness, for any 0rR0\leq r\leq R

    x(r)x2x(0)x2\displaystyle\|x^{(r)}-x^{*}\|^{2}\leq\|x^{(0)}-x^{*}\|^{2}
    xag(r)x2x(0)x2\displaystyle\|x^{(r)}_{\text{ag}}-x^{*}\|^{2}\leq\|x^{(0)}-x^{*}\|^{2}

    and for any 1rR1\leq r\leq R

    xmd(r)x2x(0)x2\displaystyle\|x^{(r)}_{\text{md}}-x^{*}\|^{2}\leq\|x^{(0)}-x^{*}\|^{2}
  • General convex (grad norm): FiF_{i}’s satisfy Assumption B.2. If we return x^=xag(Rs)\hat{x}=x_{\text{ag}}^{(R_{s})} after RR rounds of the multistage AC-SA specified above on the regularized objective Fμ(x)=F(x)+μ2x(0)x2F_{\mu}(x)=F(x)+\frac{\mu}{2}\|x^{(0)}-x\|^{2} with μ=max{2βR2log2(e2+R2),βσ2ΔSKR,(1SN)βζ2ΔSR}\mu=\max\{\frac{2\beta}{R^{2}}\log^{2}(e^{2}+R^{2}),\sqrt{\frac{\beta\sigma^{2}}{\Delta SKR}},\sqrt{\frac{(1-\frac{S}{N})\beta\zeta^{2}}{\Delta SR}}\}, we have

    𝔼F(x^)2\displaystyle\mathbb{E}\|\nabla F(\hat{x})\|^{2} 𝒪~(βΔR2+σ2SKR+(1SN)ζ2SR+βσDSKR+1SNζDSR)\displaystyle\leq\tilde{\mathcal{O}}(\frac{\beta\Delta}{R^{2}}+\frac{\sigma^{2}}{SKR}+(1-\frac{S}{N})\frac{\zeta^{2}}{SR}+\frac{\beta\sigma D}{\sqrt{SKR}}+\sqrt{1-\frac{S}{N}}\frac{\zeta D}{\sqrt{SR}})
    𝔼x^x2𝒪~(D2)\displaystyle\mathbb{E}\|\hat{x}-x^{*}\|^{2}\leq\tilde{\mathcal{O}}(D^{2})

D.2.1 Convergence of ASG for Strongly Convex Functions

Proof.

The convergence rate proof comes from Woodworth et al. (2020a) Lemma 5. What remains is to show the distance bound.

From Proposition 5, (Eq. 3.25, 4.25) of Ghadimi & Lan (2012), we have that the generic (non-multistage) AC-SA satisfies (together with Lemma H.1)

𝔼F(xag(r))+μ𝔼x(r)x2F(x)\displaystyle\mathbb{E}F(x_{\text{ag}}^{(r)})+\mu\mathbb{E}\|x^{(r)}-x^{*}\|^{2}-F(x) (53)
Γrτ=1rγτΓτ[𝔼x(τ1)x2𝔼x(τ)x2]+Γr4rμ(σ2SK+(1SN)ζ2S)\displaystyle\leq\Gamma_{r}\sum_{\tau=1}^{r}\frac{\gamma_{\tau}}{\Gamma_{\tau}}[\mathbb{E}\|x^{(\tau-1)}-x^{*}\|^{2}-\mathbb{E}\|x^{(\tau)}-x^{*}\|^{2}]+\Gamma_{r}\frac{4r}{\mu}(\frac{\sigma^{2}}{SK}+(1-\frac{S}{N})\frac{\zeta^{2}}{S}) (54)

Where Γr={1, r=1(1αr)Γr1 r2\Gamma_{r}=\begin{cases}1,\text{ }r=1\\ (1-\alpha_{r})\Gamma_{r-1}\text{ }r\geq 2\end{cases} Notice that

Γrτ=1rγτΓτ[𝔼x(τ1)x2𝔼x(τ)x2]=Γtγ1[x(0)x2x(r)x2]\displaystyle\Gamma_{r}\sum_{\tau=1}^{r}\frac{\gamma_{\tau}}{\Gamma_{\tau}}[\mathbb{E}\|x^{(\tau-1)}-x^{*}\|^{2}-\mathbb{E}\|x^{(\tau)}-x^{*}\|^{2}]=\Gamma_{t}\gamma_{1}[\|x^{(0)}-x^{*}\|^{2}-\|x^{(r)}-x^{*}\|^{2}] (55)

So

(μ+Γtγ1)𝔼x(r)x2Γtγ1𝔼x(0)x2+Γr4rμ(σ2SK+(1SN)ζ2S)\displaystyle(\mu+\Gamma_{t}\gamma_{1})\mathbb{E}\|x^{(r)}-x^{*}\|^{2}\leq\Gamma_{t}\gamma_{1}\mathbb{E}\|x^{(0)}-x^{*}\|^{2}+\Gamma_{r}\frac{4r}{\mu}(\frac{\sigma^{2}}{SK}+(1-\frac{S}{N})\frac{\zeta^{2}}{S}) (56)

Which implies

𝔼x(r)x2𝔼x(0)x2+Γr4rμ2(σ2SK+(1SN)ζ2S)\displaystyle\mathbb{E}\|x^{(r)}-x^{*}\|^{2}\leq\mathbb{E}\|x^{(0)}-x^{*}\|^{2}+\Gamma_{r}\frac{4r}{\mu^{2}}(\frac{\sigma^{2}}{SK}+(1-\frac{S}{N})\frac{\zeta^{2}}{S}) (57)

Furthermore, Γr=2r(r+1)\Gamma_{r}=\frac{2}{r(r+1)} so

𝔼x(r)x2𝔼x(0)x2+8μ2(r+1)(σ2SK+(1SN)ζ2S)\displaystyle\mathbb{E}\|x^{(r)}-x^{*}\|^{2}\leq\mathbb{E}\|x^{(0)}-x^{*}\|^{2}+\frac{8}{\mu^{2}(r+1)}(\frac{\sigma^{2}}{SK}+(1-\frac{S}{N})\frac{\zeta^{2}}{S}) (58)

If there is no gradient variance or sampling,

x(r)x2x(0)x2\displaystyle\|x^{(r)}-x^{*}\|^{2}\leq\|x^{(0)}-x^{*}\|^{2} (59)

Next, we show the above two conclusions hold for xmd(r)x^{(r)}_{\text{md}} and xag(r)x^{(r)}_{\text{ag}}.

For xag(r)x^{(r)}_{\text{ag}}, we show by induction:

Base case: xag(r)=x(0)x^{(r)}_{\text{ag}}=x^{(0)}, so it is true

Inductive case: xag(r)x^{(r)}_{\text{ag}} is a convex combination of x(r)x^{(r)} and xag(r1)x^{(r-1)}_{\text{ag}}, so the above statements hold for xag(r)x^{(r)}_{\text{ag}} as well.

For xmd(r)x^{(r)}_{\text{md}}, note it is a convex combination of xag(r1)x^{(r-1)}_{\text{ag}} and x(r1)x^{(r-1)}, so the above statements on distance also hold for xmd(r)x^{(r)}_{\text{md}}.

For the distance bound on the returned solution, use strong convexity on the convergence rate:

𝔼x^x2𝒪(κ𝔼x(0)x2exp(Rκ)+cμ2R)\displaystyle\mathbb{E}\|\hat{x}-x^{*}\|^{2}\leq\mathcal{O}(\kappa\mathbb{E}\|x^{(0)}-x^{*}\|^{2}\exp(-\frac{R}{\sqrt{\kappa}})+\frac{c}{\mu^{2}R}) (60)

D.2.2 Convergence of ASG for General Convex Functions

For the general convex case, we use Nesterov smoothing. Concretely, we will run Algo. 3 assuming strong convexity by optimizing instead a modified objective

Fμ(x)=F(x)+μ2x(0)x2\displaystyle F_{\mu}(x)=F(x)+\frac{\mu}{2}\|x^{(0)}-x\|^{2} (61)

Define xμ=argminxFμ(x)x^{*}_{\mu}=\operatorname*{arg\,min}_{x}F_{\mu}(x) and Δμ=𝔼Fμ(x(0))F(xμ)\Delta_{\mu}=\mathbb{E}F_{\mu}(x^{(0)})-F(x^{*}_{\mu}). We will choose μ\mu carefully to balance the error introduced by the regularization term and the better convergence properties of having larger μ\mu.

Proof.

Observe that

𝔼F(x^)2=𝔼Fμ(x^)μ(x^x(0))22𝔼Fμ(x^)2+2μ2𝔼x^x(0)2\displaystyle\mathbb{E}\|\nabla F(\hat{x})\|^{2}=\mathbb{E}\|\nabla F_{\mu}(\hat{x})-\mu(\hat{x}-x^{(0)})\|^{2}\leq 2\mathbb{E}\|\nabla F_{\mu}(\hat{x})\|^{2}+2\mu^{2}\mathbb{E}\|\hat{x}-x^{(0)}\|^{2} (62)

We evaluate the second term first. We know that from the theorem statement on strongly convex functions and letting κ=β+μμ\kappa^{\prime}=\frac{\beta+\mu}{\mu} and κ=βμ\kappa=\frac{\beta}{\mu} because FF is now β+μ\beta+\mu-smooth,

𝔼Fμ(x^)Fμ(xμ)𝒪((𝔼Fμ(x(0))Fμ(xμ))exp(Rκ)+σ2μSKR+(1SN)ζ2μSR)\displaystyle\mathbb{E}F_{\mu}(\hat{x})-F_{\mu}(x^{*}_{\mu})\leq\mathcal{O}((\mathbb{E}F_{\mu}(x^{(0)})-F_{\mu}(x^{*}_{\mu}))\exp(-\frac{R}{\sqrt{\kappa^{\prime}}})+\frac{\sigma^{2}}{\mu SKR}+(1-\frac{S}{N})\frac{\zeta^{2}}{\mu SR}) (63)

Which implies

𝔼Fμ(x^)\displaystyle\mathbb{E}F_{\mu}(\hat{x}) 𝒪(𝔼Fμ(x(0))+σ2μSKR+(1SN)ζ2μSR)\displaystyle\leq\mathcal{O}(\mathbb{E}F_{\mu}(x^{(0)})+\frac{\sigma^{2}}{\mu SKR}+(1-\frac{S}{N})\frac{\zeta^{2}}{\mu SR}) (64)
=𝒪(𝔼F(x(0))+σ2μSKR+(1SN)ζ2μSR)\displaystyle=\mathcal{O}(\mathbb{E}F(x^{(0)})+\frac{\sigma^{2}}{\mu SKR}+(1-\frac{S}{N})\frac{\zeta^{2}}{\mu SR}) (65)

So it is true that

𝔼Fμ(x^)\displaystyle\mathbb{E}F_{\mu}(\hat{x}) =𝔼[F(x^)+μ2x^x(0)2]\displaystyle=\mathbb{E}[F(\hat{x})+\frac{\mu}{2}\|\hat{x}-x^{(0)}\|^{2}] (66)
=𝒪(𝔼F(x(0))+σ2μSKR+(1SN)ζ2μSR)\displaystyle=\mathcal{O}(\mathbb{E}F(x^{(0)})+\frac{\sigma^{2}}{\mu SKR}+(1-\frac{S}{N})\frac{\zeta^{2}}{\mu SR}) (67)

If we rearrange,

μ2𝔼x^x(0)2\displaystyle\frac{\mu}{2}\mathbb{E}\|\hat{x}-x^{(0)}\|^{2} 𝒪(𝔼F(x(0))F(x^)+σ2μSKR+(1SN)ζ2μSR)\displaystyle\leq\mathcal{O}(\mathbb{E}F(x^{(0)})-F(\hat{x})+\frac{\sigma^{2}}{\mu SKR}+(1-\frac{S}{N})\frac{\zeta^{2}}{\mu SR}) (68)
𝒪(𝔼F(x(0))F(x)+σ2μSKR+(1SN)ζ2μSR)\displaystyle\leq\mathcal{O}(\mathbb{E}F(x^{(0)})-F(x^{*})+\frac{\sigma^{2}}{\mu SKR}+(1-\frac{S}{N})\frac{\zeta^{2}}{\mu SR}) (69)
𝒪(Δ+σ2μSKR+(1SN)ζ2μSR)\displaystyle\leq\mathcal{O}(\Delta+\frac{\sigma^{2}}{\mu SKR}+(1-\frac{S}{N})\frac{\zeta^{2}}{\mu SR}) (70)

Next we evaluate 2𝔼Fμ(x^)22\mathbb{E}\|\nabla F_{\mu}(\hat{x})\|^{2}. Observe that the smoothness of FμF_{\mu} is β+μ\beta+\mu. Returning to the convergence rate,

𝔼Fμ(x^)Fμ(xμ)𝒪(Δμexp(Rκ)+σ2μSKR+(1SN)ζ2μSR)\displaystyle\mathbb{E}F_{\mu}(\hat{x})-F_{\mu}(x^{*}_{\mu})\leq\mathcal{O}(\Delta_{\mu}\exp(-\frac{R}{\sqrt{\kappa^{\prime}}})+\frac{\sigma^{2}}{\mu SKR}+(1-\frac{S}{N})\frac{\zeta^{2}}{\mu SR}) (71)

We have that

Fμ(x(0))Fμ(xμ)=F(x(0))F(xμ)μ2x(0)xμ2F(x(0))F(x)\displaystyle F_{\mu}(x^{(0)})-F_{\mu}(x^{*}_{\mu})=F(x^{(0)})-F(x^{*}_{\mu})-\frac{\mu}{2}\|x^{(0)}-x_{\mu}^{*}\|^{2}\leq F(x^{(0)})-F(x^{*}) (72)

So ΔμΔ\Delta_{\mu}\leq\Delta. By Assumption B.4,

𝔼Fμ(x^)2\displaystyle\mathbb{E}\|\nabla F_{\mu}(\hat{x})\|^{2} (β+μ)𝒪(Δexp(Rκ)+σ2μSKR+(1SN)ζ2μSR)\displaystyle\leq(\beta+\mu)\mathcal{O}(\Delta\exp(-\frac{R}{\sqrt{\kappa^{\prime}}})+\frac{\sigma^{2}}{\mu SKR}+(1-\frac{S}{N})\frac{\zeta^{2}}{\mu SR}) (73)
𝒪(βΔexp(Rκ)+μΔ+βσ2μSKR+σ2SKR+(1SN)ζ2SR+(1SN)βζ2μSR)\displaystyle\leq\mathcal{O}(\beta\Delta\exp(-\frac{R}{\sqrt{\kappa^{\prime}}})+\mu\Delta+\frac{\beta\sigma^{2}}{\mu SKR}+\frac{\sigma^{2}}{SKR}+(1-\frac{S}{N})\frac{\zeta^{2}}{SR}+(1-\frac{S}{N})\frac{\beta\zeta^{2}}{\mu SR}) (74)

So altogether,

𝔼F(x^)2\displaystyle\mathbb{E}\|\nabla F(\hat{x})\|^{2} 𝒪(βΔexp(Rκ)+(1+κ)σ2SKR+(1+κ)(1SN)ζ2SR+μΔ)\displaystyle\leq\mathcal{O}(\beta\Delta\exp(-\frac{R}{\sqrt{\kappa^{\prime}}})+(1+\kappa)\frac{\sigma^{2}}{SKR}+(1+\kappa)(1-\frac{S}{N})\frac{\zeta^{2}}{SR}+\mu\Delta) (75)

Choose μ=max{2βR2log2(e2+R2),βσ2ΔSKR,(1SN)βζ2ΔSR}\mu=\max\{\frac{2\beta}{R^{2}}\log^{2}(e^{2}+R^{2}),\sqrt{\frac{\beta\sigma^{2}}{\Delta SKR}},\sqrt{\frac{(1-\frac{S}{N})\beta\zeta^{2}}{\Delta SR}}\}. By Theorem E.1’s proof in Yuan & Ma (2020),

𝔼F(x^)2\displaystyle\mathbb{E}\|\nabla F(\hat{x})\|^{2} 𝒪~(βΔR2+σ2SKR+(1SN)ζ2SR+βσDSKR+1SNζDSR)\displaystyle\leq\tilde{\mathcal{O}}(\frac{\beta\Delta}{R^{2}}+\frac{\sigma^{2}}{SKR}+(1-\frac{S}{N})\frac{\zeta^{2}}{SR}+\frac{\beta\sigma D}{\sqrt{SKR}}+\sqrt{1-\frac{S}{N}}\frac{\zeta D}{\sqrt{SR}}) (76)

Next we evaluate the distance bound for the returned iterate. Observe that from the distance bound in the strongly convex case,

𝔼x^xμ2𝒪(β+μμ𝔼x(0)xμ2exp(Rκ)+σ2μ2SKR+(1SN)ζ2μ2SR)\displaystyle\mathbb{E}\|\hat{x}-x^{*}_{\mu}\|^{2}\leq\mathcal{O}(\frac{\beta+\mu}{\mu}\mathbb{E}\|x^{(0)}-x^{*}_{\mu}\|^{2}\exp(-\frac{R}{\sqrt{\kappa}})+\frac{\sigma^{2}}{\mu^{2}SKR}+(1-\frac{S}{N})\frac{\zeta^{2}}{\mu^{2}SR}) (77)

Given that μ=max{2βR2log2(e2+R2),βσ2ΔSKR,(1SN)βζ2ΔSR}\mu=\max\{\frac{2\beta}{R^{2}}\log^{2}(e^{2}+R^{2}),\sqrt{\frac{\beta\sigma^{2}}{\Delta SKR}},\sqrt{\frac{(1-\frac{S}{N})\beta\zeta^{2}}{\Delta SR}}\},

𝔼x^xμ2\displaystyle\mathbb{E}\|\hat{x}-x^{*}_{\mu}\|^{2} 𝒪(𝔼x(0)xμ2+x^xμ2)\displaystyle\leq\mathcal{O}(\mathbb{E}\|x^{(0)}-x^{*}_{\mu}\|^{2}+\|\hat{x}-x^{*}_{\mu}\|^{2}) (78)
𝒪~(𝔼x(0)xμ2)\displaystyle\leq\tilde{\mathcal{O}}(\mathbb{E}\|x^{(0)}-x^{*}_{\mu}\|^{2}) (79)

Now we look at the actual distance we want to bound:

𝔼x^x2\displaystyle\mathbb{E}\|\hat{x}-x^{*}\|^{2} 𝒪~(𝔼x^xμ2+𝔼xμx(0)2+𝔼xx(0)2)\displaystyle\leq\tilde{\mathcal{O}}(\mathbb{E}\|\hat{x}-x^{*}_{\mu}\|^{2}+\mathbb{E}\|x^{*}_{\mu}-x^{(0)}\|^{2}+\mathbb{E}\|x^{*}-x^{(0)}\|^{2}) (80)

Observe that

Fμ(xμ)=F(xμ)+x(0)xμ2F(x)+x(0)x2=Fμ(x)\displaystyle F_{\mu}(x_{\mu}^{*})=F(x_{\mu}^{*})+\|x^{(0)}-x_{\mu}^{*}\|^{2}\leq F(x^{*})+\|x^{(0)}-x^{*}\|^{2}=F_{\mu}(x^{*}) (82)

which implies that

x(0)xμ2x(0)x2\displaystyle\|x^{(0)}-x_{\mu}^{*}\|^{2}\leq\|x^{(0)}-x^{*}\|^{2} (83)

So altogether

𝔼x^x2𝒪~(D2)\displaystyle\mathbb{E}\|\hat{x}-x^{*}\|^{2}\leq\tilde{\mathcal{O}}(D^{2}) (84)

D.3 SAGA

Theorem D.4.

Suppose that client objectives FiF_{i}’s and their gradient queries satisfy Assumption B.4 and Assumption B.6. Then running Algo. 5 gives the following for returned iterate x^\hat{x}:

  • Strongly convex: FiF_{i}’s satisfy Assumption B.1 for μ>0\mu>0. If we return x^=1WRr=0Rwrx(r)\hat{x}=\frac{1}{W_{R}}\sum_{r=0}^{R}w_{r}x^{(r)} with wr=(1ημ)1(r+1)w_{r}=(1-\eta\mu)^{1-(r+1)} and WR=r=0RwrW_{R}=\sum_{r=0}^{R}w_{r}, η=𝒪~(min{1β,1μR})\eta=\tilde{\mathcal{O}}(\min\{\frac{1}{\beta},\frac{1}{\mu R}\}), R>κR>\kappa, and we use Option I,

    𝔼F(x^)F(x)𝒪~(Δexp(min{μβ,SN}R)+σ2μRKS)\displaystyle\mathbb{E}F(\hat{x})-F(x^{*})\leq\tilde{\mathcal{O}}(\Delta\exp(-\min\{\frac{\mu}{\beta},\frac{S}{N}\}R)+\frac{\sigma^{2}}{\mu RKS})
  • PL condition: FiF_{i}’s satisfy Assumption B.3. If we set η=13β(N/S)2/3\eta=\frac{1}{3\beta(N/S)^{2/3}}, use Option II in a multistage manner (details specified in proof), and return a uniformly sampled iterate from the final stage,

    𝔼F(x^)F(x)𝒪(Δexp(Rκ(N/S)2/3)+σ2μSK)\displaystyle\mathbb{E}F(\hat{x})-F(x^{*})\leq\mathcal{O}(\Delta\exp(-\frac{R}{\kappa(N/S)^{2/3}})+\frac{\sigma^{2}}{\mu SK}) (85)

D.3.1 Convergence of SAGA for Strongly Convex functions

The proof of this is similar to that of Karimireddy et al. (2020b, Theorem VII).

Proof.

Following the standard analysis,

𝔼rx(r+1)x2\displaystyle\mathbb{E}_{r}\|x^{(r+1)}-x^{*}\|^{2} (86)
=x(r)x22η𝔼rg(r),x(r)x+𝔼rηg(r)2\displaystyle=\|x^{(r)}-x^{*}\|^{2}-2\eta\mathbb{E}_{r}\langle g^{(r)},x^{(r)}-x^{*}\rangle+\mathbb{E}_{r}\|\eta g^{(r)}\|^{2} (87)

We treat each of these terms separately.

Second term: Observe first that

𝔼rg(r)\displaystyle\mathbb{E}_{r}g^{(r)} =𝔼r(1Si𝒮r[1Kk=0K1f(x(r);zi,k(r))ci(r)]+c(r))\displaystyle=\mathbb{E}_{r}(\frac{1}{S}\sum_{i\in\mathcal{S}_{r}}[\frac{1}{K}\sum_{k=0}^{K-1}\nabla f(x^{(r)};z_{i,k}^{(r)})-c_{i}^{(r)}]+c^{(r)}) (89)
=F(x(r))\displaystyle=\nabla F(x^{(r)}) (90)

And so the middle term has, by (strong) convexity,

2η𝔼rg(r),x(r)x\displaystyle-2\eta\mathbb{E}_{r}\langle g^{(r)},x^{(r)}-x^{*}\rangle =2ηF(x(r)),x(r)x\displaystyle=-2\eta\langle\nabla F(x^{(r)}),x^{(r)}-x^{*}\rangle (91)
2η(F(x(r))F(x)+μ2x(r)x2)\displaystyle\leq-2\eta(F(x^{(r)})-F(x^{*})+\frac{\mu}{2}\|x^{(r)}-x^{*}\|^{2}) (92)

Third term:

𝔼rηg(r)2\displaystyle\mathbb{E}_{r}\|\eta g^{(r)}\|^{2} (93)
=𝔼rη(1Si𝒮r[1Kk=0K1f(x(r);zi,k(r))ci(r)]+c(r))2\displaystyle=\mathbb{E}_{r}\|\eta(\frac{1}{S}\sum_{i\in\mathcal{S}_{r}}[\frac{1}{K}\sum_{k=0}^{K-1}\nabla f(x^{(r)};z_{i,k}^{(r)})-c_{i}^{(r)}]+c^{(r)})\|^{2} (94)
=η2𝔼r1KSk,i𝒮rf(x(r);zi,k(r))ci(r)+c(r)2\displaystyle=\eta^{2}\mathbb{E}_{r}\|\frac{1}{KS}\sum_{k,i\in\mathcal{S}_{r}}\nabla f(x^{(r)};z_{i,k}^{(r)})-c_{i}^{(r)}+c^{(r)}\|^{2} (95)
4η2𝔼r1KSk,i𝒮rf(x(r);zi,k(r))Fi(x(r))2+4η2𝔼rc(r)2\displaystyle\leq 4\eta^{2}\mathbb{E}_{r}\|\frac{1}{KS}\sum_{k,i\in\mathcal{S}_{r}}\nabla f(x^{(r)};z_{i,k}^{(r)})-\nabla F_{i}(x^{(r)})\|^{2}+4\eta^{2}\mathbb{E}_{r}\|c^{(r)}\|^{2} (96)
+4η2𝔼r1KSk,i𝒮rFi(x)ci(r)2+4η2𝔼r1KSk,i𝒮rFi(x(r))Fi(x)2\displaystyle\ \ \ \ +4\eta^{2}\mathbb{E}_{r}\|\frac{1}{KS}\sum_{k,i\in\mathcal{S}_{r}}\nabla F_{i}(x^{*})-c_{i}^{(r)}\|^{2}+4\eta^{2}\mathbb{E}_{r}\|\frac{1}{KS}\sum_{k,i\in\mathcal{S}_{r}}\nabla F_{i}(x^{(r)})-\nabla F_{i}(x^{*})\|^{2} (97)

Where the last inequality comes from the relaxed triangle inequality. Then, by using Assumption B.6, Assumption B.4, and Assumption B.2,

𝔼rηg(r)2\displaystyle\mathbb{E}_{r}\|\eta g^{(r)}\|^{2} (99)
4η2𝔼rc(r)2+4η2𝔼r1KSk,i𝒮rFi(x)ci(r)2+8βη2[F(x(r))F(x)]+4η2σ2KS\displaystyle\leq 4\eta^{2}\mathbb{E}_{r}\|c^{(r)}\|^{2}+4\eta^{2}\mathbb{E}_{r}\|\frac{1}{KS}\sum_{k,i\in\mathcal{S}_{r}}\nabla F_{i}(x^{*})-c_{i}^{(r)}\|^{2}+8\beta\eta^{2}[F(x^{(r)})-F(x^{*})]+\frac{4\eta^{2}\sigma^{2}}{KS} (100)

Taking full expectation and separating out the variance of the control variates,

𝔼ηg(r)2\displaystyle\mathbb{E}\|\eta g^{(r)}\|^{2} (101)
4η2𝔼c(r)2+4η21KSk,i𝒮rFi(x)𝔼ci(r)2+8βη2𝔼[F(x(r))F(x)]+12η2σ2KS\displaystyle\leq 4\eta^{2}\|\mathbb{E}c^{(r)}\|^{2}+4\eta^{2}\|\frac{1}{KS}\sum_{k,i\in\mathcal{S}_{r}}\nabla F_{i}(x^{*})-\mathbb{E}c_{i}^{(r)}\|^{2}+8\beta\eta^{2}\mathbb{E}[F(x^{(r)})-F(x^{*})]+\frac{12\eta^{2}\sigma^{2}}{KS} (102)

Now observe that because c(r)=1Ni=1Nci(r)c^{(r)}=\frac{1}{N}\sum_{i=1}^{N}c_{i}^{(r)},

𝔼ηg(r)2\displaystyle\mathbb{E}\|\eta g^{(r)}\|^{2} 8η2Ni=1NFi(x)𝔼ci(r)2+8βη2𝔼[F(x(r))F(x)]+12η2σ2KS\displaystyle\leq\frac{8\eta^{2}}{N}\sum_{i=1}^{N}\|\nabla F_{i}(x^{*})-\mathbb{E}c_{i}^{(r)}\|^{2}+8\beta\eta^{2}\mathbb{E}[F(x^{(r)})-F(x^{*})]+\frac{12\eta^{2}\sigma^{2}}{KS} (103)
8η2𝒞r+8βη2𝔼[F(x(r))F(x)]+12η2σ2KS\displaystyle\leq 8\eta^{2}\mathcal{C}_{r}+8\beta\eta^{2}\mathbb{E}[F(x^{(r)})-F(x^{*})]+\frac{12\eta^{2}\sigma^{2}}{KS} (104)

Bounding the control lag:

Recall that

ci(r+1)={ci(r) w.p. 1SN1Kk=0K1f(x(r);zi,k(r)) w.p. SN\displaystyle c_{i}^{(r+1)}=\begin{cases}c_{i}^{(r)}\text{ w.p. }1-\frac{S}{N}\\ \frac{1}{K}\sum_{k=0}^{K-1}\nabla f(x^{(r)};z_{i,k}^{(r)})\text{ w.p. }\frac{S}{N}\end{cases} (105)

Therefore

𝔼ci(r+1)=(1SN)𝔼[ci(r)]+SN𝔼Fi(x(r))\displaystyle\mathbb{E}c_{i}^{(r+1)}=(1-\frac{S}{N})\mathbb{E}[c_{i}^{(r)}]+\frac{S}{N}\mathbb{E}\nabla F_{i}(x^{(r)}) (106)

Returning to the definition of 𝒞r+1\mathcal{C}_{r+1},

𝒞r+1\displaystyle\mathcal{C}_{r+1} =1Ni=1N𝔼𝔼[ci(r+1)]Fi(x)2\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\mathbb{E}\|\mathbb{E}[c_{i}^{(r+1)}]-\nabla F_{i}(x^{*})\|^{2} (107)
=1Ni=1N𝔼(1SN)(𝔼[ci(r)]Fi(x))+SN(𝔼Fi(x(r))Fi(x))2\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\mathbb{E}\|(1-\frac{S}{N})(\mathbb{E}[c_{i}^{(r)}]-\nabla F_{i}(x^{*}))+\frac{S}{N}(\mathbb{E}\nabla F_{i}(x^{(r)})-\nabla F_{i}(x^{*}))\|^{2} (108)
(1SN)𝒞r+SN2i=1N𝔼Fi(x(r))Fi(x)2\displaystyle\leq(1-\frac{S}{N})\mathcal{C}_{r}+\frac{S}{N^{2}}\sum_{i=1}^{N}\mathbb{E}\|\nabla F_{i}(x^{(r)})-\nabla F_{i}(x^{*})\|^{2} (109)

where in the last inequality we applied Jensen’s inequality twice. By smoothness of FiF_{i}’s (Assumption B.4),

𝒞r+1(1SN)𝒞r+2βSN[𝔼F(x(r))F(x)]\displaystyle\mathcal{C}_{r+1}\leq(1-\frac{S}{N})\mathcal{C}_{r}+\frac{2\beta S}{N}[\mathbb{E}F(x^{(r)})-F(x^{*})] (110)

Putting it together:

So putting it all together, we have

𝔼x(r+1)x2\displaystyle\mathbb{E}\|x^{(r+1)}-x^{*}\|^{2} (111)
(1ημ)𝔼x(r)x2η(28βη)(𝔼F(x(r))F(x))+8η2𝒞r+12η2σ2KS\displaystyle\leq(1-\eta\mu)\mathbb{E}\|x^{(r)}-x^{*}\|^{2}-\eta(2-8\beta\eta)(\mathbb{E}F(x^{(r)})-F(x^{*}))+8\eta^{2}\mathcal{C}_{r}+\frac{12\eta^{2}\sigma^{2}}{KS} (112)

From our bound on the control lag,

9η2NS𝒞r+1\displaystyle\frac{9\eta^{2}N}{S}\mathcal{C}_{r+1} 9η2NS(1SN)𝒞r+18βη2[𝔼F(x(r))F(x)]\displaystyle\leq\frac{9\eta^{2}N}{S}(1-\frac{S}{N})\mathcal{C}_{r}+18\beta\eta^{2}[\mathbb{E}F(x^{(r)})-F(x^{*})] (113)
=(1ημ)9η2NS𝒞r+9η2(ημNS1)𝒞r+18βη2[𝔼F(x(r))F(x)]\displaystyle=(1-\eta\mu)\frac{9\eta^{2}N}{S}\mathcal{C}_{r}+9\eta^{2}(\frac{\eta\mu N}{S}-1)\mathcal{C}_{r}+18\beta\eta^{2}[\mathbb{E}F(x^{(r)})-F(x^{*})] (114)

Adding both inequalities, we have

𝔼x(r+1)x2+9η2NS𝒞r+1\displaystyle\mathbb{E}\|x^{(r+1)}-x^{*}\|^{2}+\frac{9\eta^{2}N}{S}\mathcal{C}_{r+1} (115)
(1ημ)[𝔼x(r)x2+9η2NS𝒞r]η(226βη)(𝔼F(x(r))F(x))\displaystyle\leq(1-\eta\mu)[\mathbb{E}\|x^{(r)}-x^{*}\|^{2}+\frac{9\eta^{2}N}{S}\mathcal{C}_{r}]-\eta(2-26\beta\eta)(\mathbb{E}F(x^{(r)})-F(x^{*})) (116)
+η2(9ημNS1)𝒞r+12η2σ2KS\displaystyle+\eta^{2}(\frac{9\eta\mu N}{S}-1)\mathcal{C}_{r}+\frac{12\eta^{2}\sigma^{2}}{KS} (117)

Let η126β\eta\leq\frac{1}{26\beta}, ηS9μN\eta\leq\frac{S}{9\mu N}, then

𝔼x(r+1)x2+9η2NS𝒞r+1\displaystyle\mathbb{E}\|x^{(r+1)}-x^{*}\|^{2}+\frac{9\eta^{2}N}{S}\mathcal{C}_{r+1} (118)
(1ημ)[𝔼x(r)x2+9η2NS𝒞r]η(𝔼F(x(r))F(x))+12η2σ2KS\displaystyle\leq(1-\eta\mu)[\mathbb{E}\|x^{(r)}-x^{*}\|^{2}+\frac{9\eta^{2}N}{S}\mathcal{C}_{r}]-\eta(\mathbb{E}F(x^{(r)})-F(x^{*}))+\frac{12\eta^{2}\sigma^{2}}{KS} (119)

Then by using Lemma 1 in (Karimireddy et al., 2020b), setting ηmax=min{126β,S9μN}\eta_{\text{max}}=\min\{\frac{1}{26\beta},\frac{S}{9\mu N}\}, R12ηmaxμR\geq\frac{1}{2\eta_{\text{max}}\mu}, and choosing η=min{log(max(1,μ2Rd0/c))μR,ηmax}\eta=\min\{\frac{\log(\max(1,\mu^{2}Rd_{0}/c))}{\mu R},\eta_{\max}\} where d0=𝔼x(0)x2+9Nη2S𝒞0d_{0}=\mathbb{E}\|x^{(0)}-x^{*}\|^{2}+\frac{9N\eta^{2}}{S}\mathcal{C}_{0} and c=12σ2KSc=\frac{12\sigma^{2}}{KS}, we have that by outputting x^=1WRr=0Rwrx(r)\hat{x}=\frac{1}{W_{R}}\sum_{r=0}^{R}w_{r}x^{(r)} with WR=r=0RwrW_{R}=\sum_{r=0}^{R}w_{r} and wr=(1μη)1rw_{r}=(1-\mu\eta)^{1-r}, we have

𝔼F(x^)F(x)\displaystyle\mathbb{E}F(\hat{x})-F(x^{*}) 𝒪~(μd0exp(μηmaxR)+cμR)\displaystyle\leq\tilde{\mathcal{O}}(\mu d_{0}\exp(-\mu\eta_{\max}R)+\frac{c}{\mu R}) (120)
=𝒪~(μd0exp(μηmaxR)+cμR)\displaystyle=\tilde{\mathcal{O}}(\mu d_{0}\exp(-\mu\eta_{\max}R)+\frac{c}{\mu R}) (121)
=𝒪~(μ[𝔼x(0)x2+Nη2S𝒞0]exp(μηmaxR)+σ2μRKS)\displaystyle=\tilde{\mathcal{O}}(\mu[\mathbb{E}\|x^{(0)}-x^{*}\|^{2}+\frac{N\eta^{2}}{S}\mathcal{C}_{0}]\exp(-\mu\eta_{\max}R)+\frac{\sigma^{2}}{\mu RKS}) (122)

We use a warm-start strategy to initialize all control variates in the first N/SN/S rounds such that

ci(0)=1Kk=0K1f(x(0);zi,k(1))c_{i}^{(0)}=\frac{1}{K}\sum_{k=0}^{K-1}\nabla f(x^{(0)};z_{i,k}^{(-1)})

By smoothness of FiF_{i}’s (Assumption B.4),

𝒞0\displaystyle\mathcal{C}_{0} =1Ni=1N𝔼𝔼ci(0)Fi(x)2\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\mathbb{E}\|\mathbb{E}c_{i}^{(0)}-\nabla F_{i}(x^{*})\|^{2} (123)
=1Ni=1N𝔼Fi(x(0))Fi(x)2\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\mathbb{E}\|\nabla F_{i}(x^{(0)})-\nabla F_{i}(x^{*})\|^{2} (124)
β𝔼(F(x(0))F(x))\displaystyle\leq\beta\mathbb{E}(F(x^{(0)})-F(x^{*})) (125)

And recalling that ηmin{126β,S9μN}\eta\leq\min\{\frac{1}{26\beta},\frac{S}{9\mu N}\}, we know that

9Nη2S𝒞0\displaystyle\frac{9N\eta^{2}}{S}\mathcal{C}_{0} 9Nη2Sβ𝔼(F(x(0))F(x))ηβμ(F(x(0))F(x))1μΔ\displaystyle\leq\frac{9N\eta^{2}}{S}\beta\mathbb{E}(F(x^{(0)})-F(x^{*}))\leq\frac{\eta\beta}{\mu}(F(x^{(0)})-F(x^{*}))\leq\frac{1}{\mu}\Delta (126)

So altogether,

𝔼F(x^)F(x)\displaystyle\mathbb{E}F(\hat{x})-F(x^{*}) 𝒪~(Δexp(μηmaxR)+σ2μRKS)\displaystyle\leq\tilde{\mathcal{O}}(\Delta\exp(-\mu\eta_{\max}R)+\frac{\sigma^{2}}{\mu RKS}) (127)

D.3.2 Convergence of SAGA under the PL condition

This proof follows Reddi et al. (2016).

Proof.

We start with Assumption B.4

𝔼F(x(r+1))𝔼[F(x(r))+F(x(r)),x(r+1)x(t)+β2x(r+1)x(r)2]\displaystyle\mathbb{E}F(x^{(r+1)})\leq\mathbb{E}[F(x^{(r)})+\langle\nabla F(x^{(r)}),x^{(r+1)}-x^{(t)}\rangle+\frac{\beta}{2}\|x^{(r+1)}-x^{(r)}\|^{2}] (128)

Using the fact that g(r)g^{(r)} is unbiased,

𝔼F(x(r+1))𝔼[F(x(r))ηF(x(r))2+βη22g(r)2]\displaystyle\mathbb{E}F(x^{(r+1)})\leq\mathbb{E}[F(x^{(r)})-\eta\|\nabla F(x^{(r)})\|^{2}+\frac{\beta\eta^{2}}{2}\|g^{(r)}\|^{2}] (129)

Now we consider the Lyapunov function

Lr=𝔼[F(x(r))+crNi=1Nx(r)ϕi(r)2]\displaystyle L_{r}=\mathbb{E}[F(x^{(r)})+\frac{c_{r}}{N}\sum_{i=1}^{N}\|x^{(r)}-\phi_{i}^{(r)}\|^{2}] (130)

We bound Lr+1L_{r+1} using

1Ni=1N𝔼x(r+1)ϕi(r+1)2\displaystyle\frac{1}{N}\sum_{i=1}^{N}\mathbb{E}\|x^{(r+1)}-\phi_{i}^{(r+1)}\|^{2} (131)
=1Ni=1N[SN𝔼x(r+1)x(r)2+NSN𝔼x(r+1)ϕi(r)2]\displaystyle=\frac{1}{N}\sum_{i=1}^{N}[\frac{S}{N}\mathbb{E}\|x^{(r+1)}-x^{(r)}\|^{2}+\frac{N-S}{N}\mathbb{E}\|x^{(r+1)}-\phi_{i}^{(r)}\|^{2}] (132)

Where the equality comes from how ϕi(r+1)=x(r)\phi_{i}^{(r+1)}=x^{(r)} with probability S/NS/N and is ϕi(r)\phi_{i}^{(r)} otherwise. Observe that we can bound

𝔼x(r+1)ϕi(r)2\displaystyle\mathbb{E}\|x^{(r+1)}-\phi_{i}^{(r)}\|^{2} (133)
=𝔼[x(r+1)x(r)2+x(r)ϕi(r)2+2x(r+1)x(r),x(r)ϕi(r)]\displaystyle=\mathbb{E}[\|x^{(r+1)}-x^{(r)}\|^{2}+\|x^{(r)}-\phi_{i}^{(r)}\|^{2}+2\langle x^{(r+1)}-x^{(r)},x^{(r)}-\phi_{i}^{(r)}\rangle] (134)
𝔼[x(r+1)x(r)2+x(r)ϕi(r)2]+2η𝔼[12bF(x(r))2+12bx(r)ϕi(r)2]\displaystyle\leq\mathbb{E}[\|x^{(r+1)}-x^{(r)}\|^{2}+\|x^{(r)}-\phi_{i}^{(r)}\|^{2}]+2\eta\mathbb{E}[\frac{1}{2b}\|\nabla F(x^{(r)})\|^{2}+\frac{1}{2}b\|x^{(r)}-\phi_{i}^{(r)}\|^{2}] (135)

Where we used unbiasedness of g(r)g^{(r)} and Fenchel-Young inequality. Plugging this into Lr+1L_{r+1},

Lr+1\displaystyle L_{r+1} 𝔼[F(x(r))ηF(x(r))2+βη22g(r)2]\displaystyle\leq\mathbb{E}[F(x^{(r)})-\eta\|\nabla F(x^{(r)})\|^{2}+\frac{\beta\eta^{2}}{2}\|g^{(r)}\|^{2}] (136)
+𝔼[cr+1x(r+1)x(r)2+cr+1NSN2i=1Nx(r)ϕi(r)2]\displaystyle+\mathbb{E}[c_{r+1}\|x^{(r+1)}-x^{(r)}\|^{2}+c_{r+1}\frac{N-S}{N^{2}}\sum_{i=1}^{N}\|x^{(r)}-\phi_{i}^{(r)}\|^{2}] (137)
+2(NS)cr+1ηN2i=1N𝔼[12bF(x(r))2+12bx(r)ϕi(r)2]\displaystyle+\frac{2(N-S)c_{r+1}\eta}{N^{2}}\sum_{i=1}^{N}\mathbb{E}[\frac{1}{2b}\|\nabla F(x^{(r)})\|^{2}+\frac{1}{2}b\|x^{(r)}-\phi_{i}^{(r)}\|^{2}] (138)
𝔼[F(x(r))(ηcr+1η(NS)bN)F(x(r))2+(βη22+cr+1η2)𝔼g(r)2]\displaystyle\leq\mathbb{E}[F(x^{(r)})-(\eta-\frac{c_{r+1}\eta(N-S)}{bN})\|\nabla F(x^{(r)})\|^{2}+(\frac{\beta\eta^{2}}{2}+c_{r+1}\eta^{2})\mathbb{E}\|g^{(r)}\|^{2}] (139)
+(NSNcr+1+cr+1ηb(NS)N)1Ni=1N𝔼x(r)ϕi(r)2\displaystyle+(\frac{N-S}{N}c_{r+1}+\frac{c_{r+1}\eta b(N-S)}{N})\frac{1}{N}\sum_{i=1}^{N}\mathbb{E}\|x^{(r)}-\phi_{i}^{(r)}\|^{2} (140)

Now we must bound 𝔼g(r)2\mathbb{E}\|g^{(r)}\|^{2}.

𝔼(1Si𝒮r[1Kk=0K1f(x(r);zi,k(r))f(ϕi(r);z~i,k(r))]+1NKi=1Nf(ϕi(r);z~i,k(r)))2\displaystyle\mathbb{E}\|(\frac{1}{S}\sum_{i\in\mathcal{S}_{r}}[\frac{1}{K}\sum_{k=0}^{K-1}\nabla f(x^{(r)};z_{i,k}^{(r)})-\nabla f(\phi_{i}^{(r)};\tilde{z}_{i,k}^{(r)})]+\frac{1}{NK}\sum_{i=1}^{N}\nabla f(\phi_{i}^{(r)};\tilde{z}_{i,k}^{(r)}))\|^{2} (141)
2𝔼(1Si𝒮r[1Kk=0K1f(x(r);zi,k(r))f(ϕi(r);z~i,k(r))]\displaystyle\leq 2\mathbb{E}\|(\frac{1}{S}\sum_{i\in\mathcal{S}_{r}}[\frac{1}{K}\sum_{k=0}^{K-1}\nabla f(x^{(r)};z_{i,k}^{(r)})-\nabla f(\phi_{i}^{(r)};\tilde{z}_{i,k}^{(r)})] (142)
1NKi=1N[f(x(r);zi,k(r))f(ϕi(r);z~i,k(r)))]2+2𝔼1NKi=1Nf(x(r);zi,k(r))2\displaystyle-\frac{1}{NK}\sum_{i=1}^{N}[\nabla f(x^{(r)};z_{i,k}^{(r)})-\nabla f(\phi_{i}^{(r)};\tilde{z}_{i,k}^{(r)}))]\|^{2}+2\mathbb{E}\|\frac{1}{NK}\sum_{i=1}^{N}\nabla f(x^{(r)};z_{i,k}^{(r)})\|^{2} (143)
2𝔼1Si𝒮rFi(x(r))Fi(ϕi(r))2+2𝔼F(x(r))2+νσ2SK\displaystyle\leq 2\mathbb{E}\|\frac{1}{S}\sum_{i\in\mathcal{S}_{r}}\nabla F_{i}(x^{(r)})-\nabla F_{i}(\phi_{i}^{(r)})\|^{2}+2\mathbb{E}\|\nabla F(x^{(r)})\|^{2}+\frac{\nu\sigma^{2}}{SK} (144)

Where the second to last inequality is an application of Var(X)𝔼[X2]\text{Var}(X)\leq\mathbb{E}[X^{2}], and separating out the variance (taking advantage of the fact that z~\tilde{z}’s and zz’s are independent), and ν\nu is some constant.

We use Assumption B.4 and take expectation over sampled clients to get

𝔼g(r)22β2Ni=1N𝔼x(t)ϕi(r)2+2𝔼F(x(r))2+νσ2SK\displaystyle\mathbb{E}\|g^{(r)}\|^{2}\leq\frac{2\beta^{2}}{N}\sum_{i=1}^{N}\mathbb{E}\|x^{(t)}-\phi_{i}^{(r)}\|^{2}+2\mathbb{E}\|\nabla F(x^{(r)})\|^{2}+\frac{\nu\sigma^{2}}{SK} (145)

Returning to our bound on Lr+1L_{r+1},

Lr+1\displaystyle L_{r+1} 𝔼F(x(r))(ηcr+1η(NS)bNη2β2cr+1η2)𝔼F(x(r))2\displaystyle\leq\mathbb{E}F(x^{(r)})-(\eta-\frac{c_{r+1}\eta(N-S)}{bN}-\eta^{2}\beta-2c_{r+1}\eta^{2})\mathbb{E}\|\nabla F(x^{(r)})\|^{2} (146)
+(cr+1(NSN+ηb(NS)N+2η2β2)+η2β3)1Ni=1N𝔼x(r)ϕi(r)2\displaystyle+(c_{r+1}(\frac{N-S}{N}+\frac{\eta b(N-S)}{N}+2\eta^{2}\beta^{2})+\eta^{2}\beta^{3})\frac{1}{N}\sum_{i=1}^{N}\mathbb{E}\|x^{(r)}-\phi_{i}^{(r)}\|^{2} (147)
+νη2σ2SK(cr+1+β2)\displaystyle+\frac{\nu\eta^{2}\sigma^{2}}{SK}(c_{r+1}+\frac{\beta}{2}) (148)

We set cr=cr+1(NSN+ηb(NS)N+2η2β2)+η2β3c_{r}=c_{r+1}(\frac{N-S}{N}+\frac{\eta b(N-S)}{N}+2\eta^{2}\beta^{2})+\eta^{2}\beta^{3}, which results in

Lr+1\displaystyle L_{r+1} (149)
Lr(ηcr+1η(NS)bNη2β2cr+1η2)𝔼F(x(r))2+νη2σ2SK(cr+1+β2)\displaystyle\leq L_{r}-(\eta-\frac{c_{r+1}\eta(N-S)}{bN}-\eta^{2}\beta-2c_{r+1}\eta^{2})\mathbb{E}\|\nabla F(x^{(r)})\|^{2}+\frac{\nu\eta^{2}\sigma^{2}}{SK}(c_{r+1}+\frac{\beta}{2}) (150)

Let Γr=ηcr+1η(NS)bNη2β2cr+1η2\Gamma_{r}=\eta-\frac{c_{r+1}\eta(N-S)}{bN}-\eta^{2}\beta-2c_{r+1}\eta^{2}. Then by rearranging

Γr𝔼F(x(r))2LrLr+1+νη2σ2SK(cr+1+β2)\displaystyle\Gamma_{r}\mathbb{E}\|\nabla F(x^{(r)})\|^{2}\leq L_{r}-L_{r+1}+\frac{\nu\eta^{2}\sigma^{2}}{SK}(c_{r+1}+\frac{\beta}{2}) (151)

Letting γn=min0rR1Γr\gamma_{n}=\min_{0\leq r\leq R-1}\Gamma_{r},

γnr=0R1𝔼F(x(r))2\displaystyle\gamma_{n}\sum_{r=0}^{R-1}\mathbb{E}\|\nabla F(x^{(r)})\|^{2} r=0R1Γr𝔼F(x(r))2\displaystyle\leq\sum_{r=0}^{R-1}\Gamma_{r}\mathbb{E}\|\nabla F(x^{(r)})\|^{2} (152)
L0LR+r=0R1νη2σ2SK(cr+1+β2)\displaystyle\leq L_{0}-L_{R}+\sum_{r=0}^{R-1}\frac{\nu\eta^{2}\sigma^{2}}{SK}(c_{r+1}+\frac{\beta}{2}) (153)

Implying

r=0R1𝔼F(x(r))2Δγn+1γnr=0R1νη2σ2SK(cr+1+β2)\displaystyle\sum_{r=0}^{R-1}\mathbb{E}\|\nabla F(x^{(r)})\|^{2}\leq\frac{\Delta}{\gamma_{n}}+\frac{1}{\gamma_{n}}\sum_{r=0}^{R-1}\frac{\nu\eta^{2}\sigma^{2}}{SK}(c_{r+1}+\frac{\beta}{2}) (154)

Therefore, if we take a uniform sample from all the x(r)x{(r)}, denoted x¯R\bar{x}^{R},

𝔼F(x¯(R))2ΔγnR+1γnRr=0R1νη2σ2SK(cr+1+β2)\displaystyle\mathbb{E}\|\nabla F(\bar{x}^{(R)})\|^{2}\leq\frac{\Delta}{\gamma_{n}R}+\frac{1}{\gamma_{n}R}\sum_{r=0}^{R-1}\frac{\nu\eta^{2}\sigma^{2}}{SK}(c_{r+1}+\frac{\beta}{2}) (155)

We start by bounding crc_{r}. Take η=13β(SN)2/3\eta=\frac{1}{3\beta}(\frac{S}{N})^{2/3} and b=β(SN)1/3b=\beta(\frac{S}{N})^{1/3}. Let θ=SNηb(NS)N2η2β2\theta=\frac{S}{N}-\frac{\eta b(N-S)}{N}-2\eta^{2}\beta^{2}. Observe that θ<1\theta<1 and θ>49SN\theta>\frac{4}{9}\frac{S}{N}. Then cr=cr+1(1θ)+η2β3c_{r}=c_{r+1}(1-\theta)+\eta^{2}\beta^{3}, which implies cr=η2β31(1θ)Rrθη2β3θβ4(SN)1/3c_{r}=\eta^{2}\beta^{3}\frac{1-(1-\theta)^{R-r}}{\theta}\leq\frac{\eta^{2}\beta^{3}}{\theta}\leq\frac{\beta}{4}(\frac{S}{N})^{1/3}

So we can conclude that

γn=minr(ηcr+1ηβη2β2cr+1η2)112β(SN)2/3\displaystyle\gamma_{n}=\min_{r}(\eta-\frac{c_{r+1}\eta}{\beta}-\eta^{2}\beta-2c_{r+1}\eta^{2})\geq\frac{1}{12\beta}(\frac{S}{N})^{2/3} (156)

So altogether,

𝔼F(x¯(R))2\displaystyle\mathbb{E}\|\nabla F(\bar{x}^{(R)})\|^{2} 12βΔR(NS)2/3+νη2βσ2SK(112β(N/S)2/3)\displaystyle\leq\frac{12\beta\Delta}{R}(\frac{N}{S})^{2/3}+\frac{\nu\eta^{2}\beta\sigma^{2}}{SK}(\frac{1}{12\beta(N/S)^{2/3}}) (157)
=12βΔR(NS)2/3+νβσ2SK(12β(N/S)2/3)(19β2(N/S)4/3)\displaystyle=\frac{12\beta\Delta}{R}(\frac{N}{S})^{2/3}+\frac{\nu\beta\sigma^{2}}{SK}(12\beta(N/S)^{2/3})(\frac{1}{9\beta^{2}(N/S)^{4/3}}) (158)
12βΔR(NS)2/3+2νσ2SK\displaystyle\leq\frac{12\beta\Delta}{R}(\frac{N}{S})^{2/3}+\frac{2\nu\sigma^{2}}{SK} (159)

Now we run Algo. 5 in a repeated fashion, as follows:

  1. 1.

    Set x(0)=ps1x^{(0)}=p_{s-1}

  2. 2.

    Run Algo. 5 for RsR_{s} iterations

  3. 3.

    Set psp_{s} to the result of Algo. 5

Repeat for ss stages. Let p0p_{0} be the initial point. Letting Rs=24κ(NS)2/3R_{s}=\lceil 24\kappa(\frac{N}{S})^{2/3}\rceil this implies that

2μ(𝔼F(ps)F(x))𝔼F(ps)2μ(𝔼F(ps1)F(x))2+2νσ2SK\displaystyle 2\mu(\mathbb{E}F(p_{s})-F(x^{*}))\leq\mathbb{E}\|\nabla F(p_{s})\|^{2}\leq\frac{\mu(\mathbb{E}F(p_{s-1})-F(x^{*}))}{2}+\frac{2\nu\sigma^{2}}{SK} (160)

Which gives

𝔼F(ps)F(x)𝒪(Δexp(s)+σ2μSK)\displaystyle\mathbb{E}F(p_{s})-F(x^{*})\leq\mathcal{O}(\Delta\exp(-s)+\frac{\sigma^{2}}{\mu SK}) (161)

If the total number of rounds is RR, then

𝔼F(x^)F(x)𝒪(Δexp(Rκ)+σ2μSK)\displaystyle\mathbb{E}F(\hat{x})-F(x^{*})\leq\mathcal{O}(\Delta\exp(-\frac{R}{\kappa})+\frac{\sigma^{2}}{\mu SK}) (162)

D.4 SSNM

Note that our usual assumption Fi(x)F_{i}(x) is μ\mu-strongly convex can be straightforwardly converted into the assumption that our losses are Fi(x)=F~i(x)+h(x)F_{i}(x)=\tilde{F}_{i}(x)+h(x) where F~i(x)\tilde{F}_{i}(x) is merely convex and h(x)h(x) is μ\mu-strongly convex (see (Zhou et al., 2019) section 4.2).

Theorem D.5.

Suppose that client objectives FiF_{i}’s and their gradient queries satisfy Assumption B.4 and Assumption B.6. Then running Algo. 6 gives the following for returned iterate x^\hat{x}:

  • Strongly convex: FiF_{i}’s satisfy Assumption B.1 for μ>0\mu>0. If we return the final iterate and set η=12μ(N/S),τ=(N/S)ημ1+ημ\eta=\frac{1}{2\mu(N/S)},\tau=\frac{(N/S)\eta\mu}{1+\eta\mu} if (N/S)κ>34\frac{(N/S)}{\kappa}>\frac{3}{4} and η=13μ(N/S)β,τ=(N/S)ημ1+ημ\eta=\sqrt{\frac{1}{3\mu(N/S)\beta}},\tau=\frac{(N/S)\eta\mu}{1+\eta\mu} if (N/S)κ34\frac{(N/S)}{\kappa}\leq\frac{3}{4},

    𝔼F(x(R))F(x)𝒪(κΔexp(min{SN,SNκ}R)+κσ2μKS)\displaystyle\mathbb{E}F(x^{(R)})-F(x^{*})\leq\mathcal{O}(\kappa\Delta\exp(-\min\{\frac{S}{N},\sqrt{\frac{S}{N\kappa}}\}R)+\frac{\kappa\sigma^{2}}{\mu KS})

D.4.1 Convergence of SSNM on Strongly Convex functions

Most of this proof follows that of (Zhou et al., 2019) Theorem 1.

First, we compute the variance of the update g(r)g^{(r)}.

𝔼[g(r)1Ni=1NF~i(yir(r))2]\displaystyle\mathbb{E}[\|g^{(r)}-\frac{1}{N}\sum_{i=1}^{N}\nabla\tilde{F}_{i}(y_{i_{r}}^{(r)})\|^{2}] (163)
𝔼1Sir𝒮rF~ir(yir(r))F~ir(ϕir(r))1Ni=1N(F~i(yi(r))F~i(ϕi(r)))2+νσ2KS\displaystyle\leq\mathbb{E}\|\frac{1}{S}\sum_{i_{r}\in\mathcal{S}_{r}}\nabla\tilde{F}_{i_{r}}(y_{i_{r}}^{(r)})-\nabla\tilde{F}_{i_{r}}(\phi_{i_{r}}^{(r)})-\frac{1}{N}\sum_{i=1}^{N}(\nabla\tilde{F}_{i}(y_{i}^{(r)})-\nabla\tilde{F}_{i}(\phi_{i}^{(r)}))\|^{2}+\frac{\nu\sigma^{2}}{KS} (164)
𝔼1Sir𝒮rF~ir(yir(r))F~ir(ϕir(r))2+νσ2KS\displaystyle\leq\mathbb{E}\|\frac{1}{S}\sum_{i_{r}\in\mathcal{S}_{r}}\nabla\tilde{F}_{i_{r}}(y_{i_{r}}^{(r)})-\nabla\tilde{F}_{i_{r}}(\phi_{i_{r}}^{(r)})\|^{2}+\frac{\nu\sigma^{2}}{KS} (165)
2β𝔼[1Sir𝒮rF~ir(ϕir(r))F~ir(yir(r))F~ir(yir(r)),ϕir(r)yir(r)]+νσ2KS\displaystyle\leq 2\beta\mathbb{E}[\frac{1}{S}\sum_{i_{r}\in\mathcal{S}_{r}}\tilde{F}_{i_{r}}(\phi_{i_{r}}^{(r)})-\tilde{F}_{i_{r}}(y_{i_{r}}^{(r)})-\langle\nabla\tilde{F}_{i_{r}}(y_{i_{r}}^{(r)}),\phi_{i_{r}}^{(r)}-y_{i_{r}}^{(r)}\rangle]+\frac{\nu\sigma^{2}}{KS} (166)
=2β[1Ni=1NF~i(ϕi(r))F~i(yi(r))1Ni=1NF~i(yi(r)),ϕi(r)yi(r)]+νσ2KS\displaystyle=2\beta[\frac{1}{N}\sum_{i=1}^{N}\tilde{F}_{i}(\phi_{i}^{(r)})-\tilde{F}_{i}(y_{i}^{(r)})-\frac{1}{N}\sum_{i=1}^{N}\langle\nabla\tilde{F}_{i}(y_{i}^{(r)}),\phi_{i}^{(r)}-y_{i}^{(r)}\rangle]+\frac{\nu\sigma^{2}}{KS} (167)

For some constant ν\nu. In the first inequality we separated out the gradient variance, second inequality we use the fact that Var(X)𝔼[X2]\text{Var}(X)\leq\mathbb{E}[X^{2}], third inequality used Assumption B.4, and fourth we took expectation with respect to the sampled clients. From convexity we have that

F~ir(yir(r))F~ir(x)\displaystyle\tilde{F}_{i_{r}}(y_{i_{r}}^{(r)})-\tilde{F}_{i_{r}}(x^{*}) (168)
F~ir(yir(r)),yir(r)x\displaystyle\leq\langle\nabla\tilde{F}_{i_{r}}(y_{i_{r}}^{(r)}),y_{i_{r}}^{(r)}-x^{*}\rangle (169)
=1ττF~ir(yir(r)),ϕir(r)yir(r)+F~ir(yir(r)),x(r)x\displaystyle=\frac{1-\tau}{\tau}\langle\nabla\tilde{F}_{i_{r}}(y_{i_{r}}^{(r)}),\phi_{i_{r}}^{(r)}-y_{i_{r}}^{(r)}\rangle+\langle\nabla\tilde{F}_{i_{r}}(y_{i_{r}}^{(r)}),x^{(r)}-x^{*}\rangle (170)
=1ττF~ir(yir(r)),ϕir(r)yir(r)+F~ir(yir(r))g(r),x(r)x\displaystyle=\frac{1-\tau}{\tau}\langle\nabla\tilde{F}_{i_{r}}(y_{i_{r}}^{(r)}),\phi_{i_{r}}^{(r)}-y_{i_{r}}^{(r)}\rangle+\langle\nabla\tilde{F}_{i_{r}}(y_{i_{r}}^{(r)})-g^{(r)},x^{(r)}-x^{*}\rangle (171)
+g(r),x(r)x(r+1)+g(r),x(r+1)x\displaystyle+\langle g^{(r)},x^{(r)}-x^{(r+1)}\rangle+\langle g^{(r)},x^{(r+1)}-x^{*}\rangle (172)

where the second to last inequality comes from the definition that yir(r)=τx(r)+(1τ)ϕir(r)y_{i_{r}}^{(r)}=\tau x^{(r)}+(1-\tau)\phi_{i_{r}}^{(r)}. Taking expectation with respect to the sampled clients, we have

1Ni=1NF~i(yi(r))F~(x)\displaystyle\frac{1}{N}\sum_{i=1}^{N}\tilde{F}_{i}(y_{i}^{(r)})-\tilde{F}(x^{*}) 1ττNi=1NF~i(yi(r)),ϕi(r)yi(r)+𝔼𝒮rg(r),x(r)x(r+1)\displaystyle\leq\frac{1-\tau}{\tau N}\sum_{i=1}^{N}\langle\nabla\tilde{F}_{i}(y_{i}^{(r)}),\phi_{i}^{(r)}-y_{i}^{(r)}\rangle+\mathbb{E}_{\mathcal{S}_{r}}\langle g^{(r)},x^{(r)}-x^{(r+1)}\rangle (173)
+𝔼𝒮rg(r),x(r+1)x\displaystyle+\mathbb{E}_{\mathcal{S}_{r}}\langle g^{(r)},x^{(r+1)}-x^{*}\rangle (174)

For 𝔼𝒮rg(r),x(r)x(r+1)\mathbb{E}_{\mathcal{S}_{r}}\langle g^{(r)},x^{(r)}-x^{(r+1)}\rangle, we can employ smoothness at (ϕIr(r+1),yIr(r))(\phi_{I_{r}}^{(r+1)},y_{I_{r}}^{(r)}), which holds for any Ir𝒮rI_{r}\in\mathcal{S}_{r}^{\prime}:

F~Ir(ϕIr(r+1))F~Ir(yIr(r))F~Ir(yIr(r)),ϕIr(r+1)yIr(r)+β2ϕIr(r+1)yIr(r)2\displaystyle\tilde{F}_{I_{r}}(\phi_{I_{r}}^{(r+1)})-\tilde{F}_{I_{r}}(y_{I_{r}}^{(r)})\leq\langle\nabla\tilde{F}_{I_{r}}(y_{I_{r}}^{(r)}),\phi_{I_{r}}^{(r+1)}-y_{I_{r}}^{(r)}\rangle+\frac{\beta}{2}\|\phi_{I_{r}}^{(r+1)}-y_{I_{r}}^{(r)}\|^{2} (175)

using ϕIr(r+1)=τx(r+1)+(1τ)ϕIr(r)\phi^{(r+1)}_{I_{r}}=\tau x^{(r+1)}+(1-\tau)\phi^{(r)}_{I_{r}} and yIr(r)=τx(r)+(1τ)ϕIr(r)y_{I_{r}}^{(r)}=\tau x^{(r)}+(1-\tau)\phi_{I_{r}}^{(r)} (though the second is never explicitly computed and only implicitly exists)

F~Ir(ϕIr(r+1))F~Ir(yIr(r))τF~Ir(yIr(r)),x(r+1)x(r)+βτ22x(r+1)x(r)2\displaystyle\tilde{F}_{I_{r}}(\phi_{I_{r}}^{(r+1)})-\tilde{F}_{I_{r}}(y_{I_{r}}^{(r)})\leq\tau\langle\nabla\tilde{F}_{I_{r}}(y_{I_{r}}^{(r)}),x^{(r+1)}-x^{(r)}\rangle+\frac{\beta\tau^{2}}{2}\|x^{(r+1)}-x^{(r)}\|^{2} (176)

Taking expectation over 𝒮r\mathcal{S}_{r}^{\prime} and using ϕIr(r+1)=τx(r+1)+(1τ)ϕIr(r)\phi^{(r+1)}_{I_{r}}=\tau x^{(r+1)}+(1-\tau)\phi^{(r)}_{I_{r}} and yIr(r)=τx(r)+(1τ)ϕIr(r)y_{I_{r}}^{(r)}=\tau x^{(r)}+(1-\tau)\phi_{I_{r}}^{(r)} we see that

𝔼𝒮r[1SIr𝒮rF~Ir(ϕIr(r+1))]1Ni=1NF~i(yi(r))\displaystyle\mathbb{E}_{\mathcal{S}_{r}^{\prime}}[\frac{1}{S}\sum_{I_{r}\in\mathcal{S}_{r}^{\prime}}\tilde{F}_{I_{r}}(\phi_{I_{r}}^{(r+1)})]-\frac{1}{N}\sum_{i=1}^{N}\tilde{F}_{i}(y_{i}^{(r)}) τ1Ni=1NF~i(yi(r)),x(r+1)x(r)\displaystyle\leq\tau\langle\frac{1}{N}\sum_{i=1}^{N}\nabla\tilde{F}_{i}(y_{i}^{(r)}),x^{(r+1)}-x^{(r)}\rangle (177)
+βτ22x(r+1)x(r)2\displaystyle+\frac{\beta\tau^{2}}{2}\|x^{(r+1)}-x^{(r)}\|^{2} (178)

and rearranging,

g(r),x(r)x(r+1)\displaystyle\langle g^{(r)},x^{(r)}-x^{(r+1)}\rangle (179)
1τNi=1NF~i(yi(r))1τS𝔼𝒮r[Ir𝒮rF~Ir(ϕIr(r+1))]\displaystyle\leq\frac{1}{\tau N}\sum_{i=1}^{N}\tilde{F}_{i}(y_{i}^{(r)})-\frac{1}{\tau S}\mathbb{E}_{\mathcal{S}_{r}^{\prime}}[\sum_{I_{r}\in\mathcal{S}_{r}^{\prime}}\tilde{F}_{I_{r}}(\phi_{I_{r}}^{(r+1)})] (180)
+1Ni=1NF~i(yi(r))g(r),x(r+1)x(r)+βτ2x(r+1)x(r)2\displaystyle+\langle\frac{1}{N}\sum_{i=1}^{N}\nabla\tilde{F}_{i}(y_{i}^{(r)})-g^{(r)},x^{(r+1)}-x^{(r)}\rangle+\frac{\beta\tau}{2}\|x^{(r+1)}-x^{(r)}\|^{2} (181)

Substituting this into Eq. 173 after taking expectation over 𝒮r\mathcal{S}_{r}, and observing that from (Zhou et al., 2019, Lemma 2) we have identity

g(r),x(r+1)u\displaystyle\langle g^{(r)},x^{(r+1)}-u\rangle 12ηx(r+1)x(r)2+12ηx(r)u2\displaystyle\leq-\frac{1}{2\eta}\|x^{(r+1)}-x^{(r)}\|^{2}+\frac{1}{2\eta}\|x^{(r)}-u\|^{2} (182)
1+ημ2ηx(r+1)u2+h(u)h(x(r+1))\displaystyle-\frac{1+\eta\mu}{2\eta}\|x^{(r+1)}-u\|^{2}+h(u)-h(x^{(r+1)}) (183)

so with u=xu=x^{*}, we get

1Ni=1NF~i(yi(r))F~(x)\displaystyle\frac{1}{N}\sum_{i=1}^{N}\tilde{F}_{i}(y_{i}^{(r)})-\tilde{F}(x^{*}) (184)
1ττNi=1NF~i(yi(r)),ϕi(r)yi(r)+1τNi=1NF~i(yi(r))\displaystyle\leq\frac{1-\tau}{\tau N}\sum_{i=1}^{N}\langle\nabla\tilde{F}_{i}(y_{i}^{(r)}),\phi_{i}^{(r)}-y_{i}^{(r)}\rangle+\frac{1}{\tau N}\sum_{i=1}^{N}\tilde{F}_{i}(y_{i}^{(r)}) (185)
1τS𝔼𝒮r,𝒮r[Ir𝒮rF~Ir(ϕIr(r+1))]\displaystyle-\frac{1}{\tau S}\mathbb{E}_{\mathcal{S}_{r},\mathcal{S}_{r}^{\prime}}[\sum_{I_{r}\in\mathcal{S}_{r}^{\prime}}\tilde{F}_{I_{r}}(\phi_{I_{r}}^{(r+1)})] (186)
+𝔼𝒮r1Ni=1NF~i(yi(r))g(r),x(r+1)x(r)+βτ2𝔼𝒮rx(r+1)x(r)2\displaystyle\ \ \ +\mathbb{E}_{\mathcal{S}_{r}}\langle\frac{1}{N}\sum_{i=1}^{N}\nabla\tilde{F}_{i}(y_{i}^{(r)})-g^{(r)},x^{(r+1)}-x^{(r)}\rangle+\frac{\beta\tau}{2}\mathbb{E}_{\mathcal{S}_{r}}\|x^{(r+1)}-x^{(r)}\|^{2} (187)
12η𝔼𝒮rx(r+1)x(r)2+12ηx(r)x21+ημ2η𝔼𝒮rx(r+1)x2\displaystyle\ \ \ -\frac{1}{2\eta}\mathbb{E}_{\mathcal{S}_{r}}\|x^{(r+1)}-x^{(r)}\|^{2}+\frac{1}{2\eta}\|x^{(r)}-x^{*}\|^{2}-\frac{1+\eta\mu}{2\eta}\mathbb{E}_{\mathcal{S}_{r}}\|x^{(r+1)}-x^{*}\|^{2} (188)
+h(x)𝔼𝒮rh(x(r+1))\displaystyle\ \ \ +h(x^{*})-\mathbb{E}_{\mathcal{S}_{r}}h(x^{(r+1)}) (189)

We use the constraint that βτ1ηβτ1τ\beta\tau\leq\frac{1}{\eta}-\frac{\beta\tau}{1-\tau} plus Young’s inequality a,b12ca2+c2b2\langle a,b\rangle\leq\frac{1}{2c}\|a\|^{2}+\frac{c}{2}\|b\|^{2} with c=βτ1τc=\frac{\beta\tau}{1-\tau} on 𝔼𝒮r1Ni=1NF~i(yi(r))g(r),x(r+1)x(r)\mathbb{E}_{\mathcal{S}_{r}}\langle\frac{1}{N}\sum_{i=1}^{N}\nabla\tilde{F}_{i}(y_{i}^{(r)})-g^{(r)},x^{(r+1)}-x^{(r)}\rangle to get

1Ni=1NF~i(yi(r))F~(x)\displaystyle\frac{1}{N}\sum_{i=1}^{N}\tilde{F}_{i}(y_{i}^{(r)})-\tilde{F}(x^{*}) (190)
1ττNi=1NF~i(yi(r)),ϕi(r)yi(r)\displaystyle\leq\frac{1-\tau}{\tau N}\sum_{i=1}^{N}\langle\nabla\tilde{F}_{i}(y_{i}^{(r)}),\phi_{i}^{(r)}-y_{i}^{(r)}\rangle (191)
+1τNi=1NF~i(yi(r))1τS𝔼𝒮r,𝒮r[Ir𝒮rF~Ir(ϕIr(r+1))]\displaystyle\ \ \ +\frac{1}{\tau N}\sum_{i=1}^{N}\tilde{F}_{i}(y_{i}^{(r)})-\frac{1}{\tau S}\mathbb{E}_{\mathcal{S}_{r},\mathcal{S}_{r}^{\prime}}[\sum_{I_{r}\in\mathcal{S}_{r}^{\prime}}\tilde{F}_{I_{r}}(\phi_{I_{r}}^{(r+1)})] (192)
+1τ2βτ𝔼𝒮r1Ni=1NF~i(yi(r))g(r)2+12ηx(r)x2\displaystyle\ \ \ +\frac{1-\tau}{2\beta\tau}\mathbb{E}_{\mathcal{S}_{r}}\|\frac{1}{N}\sum_{i=1}^{N}\nabla\tilde{F}_{i}(y_{i}^{(r)})-g^{(r)}\|^{2}+\frac{1}{2\eta}\|x^{(r)}-x^{*}\|^{2} (193)
1+ημ2η𝔼𝒮rx(r+1)x2+h(x)𝔼𝒮rh(x(r+1))\displaystyle\ \ \ -\frac{1+\eta\mu}{2\eta}\mathbb{E}_{\mathcal{S}_{r}}\|x^{(r+1)}-x^{*}\|^{2}+h(x^{*})-\mathbb{E}_{\mathcal{S}_{r}}h(x^{(r+1)}) (194)

using the bound on variance, we get

1Ni=1NF~i(yi(r))F~(x)\displaystyle\frac{1}{N}\sum_{i=1}^{N}\tilde{F}_{i}(y_{i}^{(r)})-\tilde{F}(x^{*}) (195)
1τNi=1NF~i(yi(r))1τS𝔼𝒮r,𝒮r[Ir𝒮rF~Ir(ϕIr(r+1))]+1ττNi=1NF~i(ϕi(r))F~i(yi(r))\displaystyle\leq\frac{1}{\tau N}\sum_{i=1}^{N}\tilde{F}_{i}(y_{i}^{(r)})-\frac{1}{\tau S}\mathbb{E}_{\mathcal{S}_{r},\mathcal{S}_{r}^{\prime}}[\sum_{I_{r}\in\mathcal{S}_{r}^{\prime}}\tilde{F}_{I_{r}}(\phi_{I_{r}}^{(r+1)})]+\frac{1-\tau}{\tau N}\sum_{i=1}^{N}\tilde{F}_{i}(\phi_{i}^{(r)})-\tilde{F}_{i}(y_{i}^{(r)}) (196)
+12ηx(r)x21+ημ2η𝔼𝒮rx(r+1)x2\displaystyle\ \ \ +\frac{1}{2\eta}\|x^{(r)}-x^{*}\|^{2}-\frac{1+\eta\mu}{2\eta}\mathbb{E}_{\mathcal{S}_{r}}\|x^{(r+1)}-x^{*}\|^{2} (197)
+h(x)𝔼𝒮rh(x(r+1))+(1τ)νσ22βτKS\displaystyle\ \ \ +h(x^{*})-\mathbb{E}_{\mathcal{S}_{r}}h(x^{(r+1)})+\frac{(1-\tau)\nu\sigma^{2}}{2\beta\tau KS} (198)

combining terms,

1τS𝔼𝒮r,𝒮r[Ir𝒮rF~Ir(ϕIr(r+1))]F(x)\displaystyle\frac{1}{\tau S}\mathbb{E}_{\mathcal{S}_{r},\mathcal{S}_{r}^{\prime}}[\sum_{I_{r}\in\mathcal{S}_{r}^{\prime}}\tilde{F}_{I_{r}}(\phi_{I_{r}}^{(r+1)})]-F(x^{*}) (199)
1ττNi=1NF~i(ϕi(r))+12ηx(r)x21+ημ2η𝔼𝒮rx(r+1)x2\displaystyle\leq\frac{1-\tau}{\tau N}\sum_{i=1}^{N}\tilde{F}_{i}(\phi_{i}^{(r)})+\frac{1}{2\eta}\|x^{(r)}-x^{*}\|^{2}-\frac{1+\eta\mu}{2\eta}\mathbb{E}_{\mathcal{S}_{r}}\|x^{(r+1)}-x^{*}\|^{2} (200)
𝔼𝒮rh(x(r+1))+(1τ)νσ22βτKS\displaystyle\ \ \ -\mathbb{E}_{\mathcal{S}_{r}}h(x^{(r+1)})+\frac{(1-\tau)\nu\sigma^{2}}{2\beta\tau KS} (201)

Using convexity of hh and ϕIk(r+1)=τx(r+1)+(1τ)ϕIk(r)\phi_{I_{k}}^{(r+1)}=\tau x^{(r+1)}+(1-\tau)\phi_{I_{k}}^{(r)} for Ik𝒮rI_{k}\in\mathcal{S}_{r}^{\prime},

h(ϕIk(r+1))τh(x(r+1))+(1τ)h(ϕIk(r))\displaystyle h(\phi_{I_{k}}^{(r+1)})\leq\tau h(x^{(r+1)})+(1-\tau)h(\phi_{I_{k}}^{(r)}) (202)

After taking expectation with respect to 𝒮r\mathcal{S}_{r} and 𝒮r\mathcal{S}_{r}^{\prime},

𝔼𝒮r[h(x(r+1))]1ττNi=1Nh(ϕi(r))1τ𝔼𝒮r,𝒮r[1SIr𝒮rh(ϕIk(r+1))]\displaystyle-\mathbb{E}_{\mathcal{S}_{r}}[h(x^{(r+1)})]\leq\frac{1-\tau}{\tau N}\sum_{i=1}^{N}h(\phi_{i}^{(r)})-\frac{1}{\tau}\mathbb{E}_{\mathcal{S}_{r},\mathcal{S}_{r}^{\prime}}[\frac{1}{S}\sum_{I_{r}\in\mathcal{S}_{r}^{\prime}}h(\phi_{I_{k}}^{(r+1)})] (203)

and plugging this back in, multiplying by S/NS/N on both sides, and adding both sides by 1τN𝔼𝒮r[i𝒮r(Fi(ϕi(r))Fi(x))]\frac{1}{\tau N}\mathbb{E}_{\mathcal{S}_{r}^{\prime}}[\sum_{i\notin\mathcal{S}_{r}^{\prime}}(F_{i}(\phi_{i}^{(r)})-F_{i}(x^{*}))]

1τ𝔼𝒮r,𝒮r[1Ni=1NFi(ϕi(r+1))Fi(x)]\displaystyle\frac{1}{\tau}\mathbb{E}_{\mathcal{S}_{r},\mathcal{S}_{r}^{\prime}}[\frac{1}{N}\sum_{i=1}^{N}F_{i}(\phi_{i}^{(r+1)})-F_{i}(x^{*})] (204)
(1τ)SτN(1Ni=1NFi(ϕi(r))Fi(x))+1τN𝔼𝒮r[i𝒮r(Fi(ϕi(r))Fi(x))]\displaystyle\leq\frac{(1-\tau)S}{\tau N}(\frac{1}{N}\sum_{i=1}^{N}F_{i}(\phi_{i}^{(r)})-F_{i}(x^{*}))+\frac{1}{\tau N}\mathbb{E}_{\mathcal{S}_{r}^{\prime}}[\sum_{i\notin\mathcal{S}_{r}^{\prime}}(F_{i}(\phi_{i}^{(r)})-F_{i}(x^{*}))] (205)
+S2ηNx(r)x2(1+ημ)S2ηN𝔼𝒮rx(r+1)x2+(1τ)νσ22βτKN\displaystyle\ \ \ +\frac{S}{2\eta N}\|x^{(r)}-x^{*}\|^{2}-\frac{(1+\eta\mu)S}{2\eta N}\mathbb{E}_{\mathcal{S}_{r}}\|x^{(r+1)}-x^{*}\|^{2}+\frac{(1-\tau)\nu\sigma^{2}}{2\beta\tau KN} (206)

Observe that the probability of choosing any client index happens with probability S/NS/N, so

1τN𝔼𝒮r[i𝒮r(Fi(ϕi(r))Fi(x))]=NSτN(1Ni=1NFi(ϕi(r))Fi(x))\displaystyle\frac{1}{\tau N}\mathbb{E}_{\mathcal{S}_{r}^{\prime}}[\sum_{i\notin\mathcal{S}_{r}^{\prime}}(F_{i}(\phi_{i}^{(r)})-F_{i}(x^{*}))]=\frac{N-S}{\tau N}(\frac{1}{N}\sum_{i=1}^{N}F_{i}(\phi_{i}^{(r)})-F_{i}(x^{*})) (207)

which implies

1τ𝔼𝒮r,𝒮r[1Ni=1NFi(ϕi(r+1))Fi(x)]\displaystyle\frac{1}{\tau}\mathbb{E}_{\mathcal{S}_{r},\mathcal{S}_{r}^{\prime}}[\frac{1}{N}\sum_{i=1}^{N}F_{i}(\phi_{i}^{(r+1)})-F_{i}(x^{*})] (208)
1τSNτ(1Ni=1NFi(ϕi(r))Fi(x))\displaystyle\leq\frac{1-\frac{\tau S}{N}}{\tau}(\frac{1}{N}\sum_{i=1}^{N}F_{i}(\phi_{i}^{(r)})-F_{i}(x^{*})) (209)
+S2ηNx(r)x2(1+ημ)S2ηN𝔼𝒮rx(r+1)x2+(1τ)νσ22βτKN\displaystyle\ \ \ +\frac{S}{2\eta N}\|x^{(r)}-x^{*}\|^{2}-\frac{(1+\eta\mu)S}{2\eta N}\mathbb{E}_{\mathcal{S}_{r}}\|x^{(r+1)}-x^{*}\|^{2}+\frac{(1-\tau)\nu\sigma^{2}}{2\beta\tau KN} (210)

To complete our Lyapunov function so the potential is always positive, we need another term:

1Ni=1NFi(x),ϕi(r+1)x\displaystyle-\frac{1}{N}\sum_{i=1}^{N}\langle\nabla F_{i}(x^{*}),\phi_{i}^{(r+1)}-x^{*}\rangle (211)
=1NIr𝒮rFIr(x),ϕIr(r+1)x1Nj𝒮rFj(x),ϕj(r)x\displaystyle=-\frac{1}{N}\sum_{I_{r}\in\mathcal{S}_{r}^{\prime}}\langle\nabla F_{I_{r}}(x^{*}),\phi^{(r+1)}_{I_{r}}-x^{*}\rangle-\frac{1}{N}\sum_{j\notin\mathcal{S}_{r}^{\prime}}\langle\nabla F_{j}(x^{*}),\phi^{(r)}_{j}-x^{*}\rangle (212)
=Ir𝒮rτNFIr(x),x(r+1)x+τNFIr(x),ϕIr(r)x\displaystyle=\sum_{I_{r}\in\mathcal{S}_{r}^{\prime}}-\frac{\tau}{N}\langle\nabla F_{I_{r}}(x^{*}),x^{(r+1)}-x^{*}\rangle+\frac{\tau}{N}\langle\nabla F_{I_{r}}(x^{*}),\phi_{I_{r}}^{(r)}-x^{*}\rangle (213)
1Ni=1NFi(x),ϕi(r)x\displaystyle\ \ \ -\frac{1}{N}\sum_{i=1}^{N}\langle\nabla F_{i}(x^{*}),\phi_{i}^{(r)}-x^{*}\rangle (214)

Taking expectation with respect to 𝒮r,𝒮r\mathcal{S}_{r},\mathcal{S}_{r}^{\prime},

𝔼𝒮r,𝒮r[1Ni=1NFi(x),ϕi(r+1)x]=(1τSN)(1Ni=1NFi(x),ϕi(r)x)\displaystyle\mathbb{E}_{\mathcal{S}_{r},\mathcal{S}_{r}^{\prime}}[-\frac{1}{N}\sum_{i=1}^{N}\langle\nabla F_{i}(x^{*}),\phi_{i}^{(r+1)}-x^{*}\rangle]=-(1-\frac{\tau S}{N})(\frac{1}{N}\sum_{i=1}^{N}\langle\nabla F_{i}(x^{*}),\phi_{i}^{(r)}-x^{*}\rangle) (215)

Let Br:=1Ni=1NFi(ϕi(r))F(x)1Ni=1NFi(x),ϕi(r)xB_{r}:=\frac{1}{N}\sum_{i=1}^{N}F_{i}(\phi_{i}^{(r)})-F(x^{*})-\frac{1}{N}\sum_{i=1}^{N}\langle\nabla F_{i}(x^{*}),\phi_{i}^{(r)}-x^{*}\rangle and Pr:=x(r)x2P_{r}:=\|x^{(r)}-x^{*}\|^{2}, then we can write

1τ𝔼𝒮r,𝒮r[Br+1]+(1+ημ)S2ηN𝔼𝒮r[Pr+1]1τSNτBr+S2ηNPr+(1τ)νσ22βτKN\displaystyle\frac{1}{\tau}\mathbb{E}_{\mathcal{S}_{r},\mathcal{S}_{r}^{\prime}}[B_{r+1}]+\frac{(1+\eta\mu)S}{2\eta N}\mathbb{E}_{\mathcal{S}_{r}}[P_{r+1}]\leq\frac{1-\frac{\tau S}{N}}{\tau}B_{r}+\frac{S}{2\eta N}P_{r}+\frac{(1-\tau)\nu\sigma^{2}}{2\beta\tau KN} (216)

Case 1: Suppose that (N/S)κ34\frac{(N/S)}{\kappa}\leq\frac{3}{4}, then choosing η=13μ(N/S)β\eta=\sqrt{\frac{1}{3\mu(N/S)\beta}}, τ=(N/S)ημ1+ημ=(N/S)3κ1+13(N/S)κ<12\tau=\frac{(N/S)\eta\mu}{1+\eta\mu}=\frac{\sqrt{\frac{(N/S)}{3\kappa}}}{1+\sqrt{\frac{1}{3(N/S)\kappa}}}<\frac{1}{2}, we evaluate the parameter constraint βτ1ηβτ1τ\beta\tau\leq\frac{1}{\eta}-\frac{\beta\tau}{1-\tau}:

βτ1ηβτ1τ(1+11τ)τ1βη2τ1τ(N/S)3κ1+13(N/S)κ3(N/S)κ\displaystyle\beta\tau\leq\frac{1}{\eta}-\frac{\beta\tau}{1-\tau}\implies(1+\frac{1}{1-\tau})\tau\leq\frac{1}{\beta\eta}\implies\frac{2-\tau}{1-\tau}\frac{\sqrt{\frac{(N/S)}{3\kappa}}}{1+\sqrt{\frac{1}{3(N/S)\kappa}}}\leq\sqrt{\frac{3(N/S)}{\kappa}} (217)

which shows our constraint is satisfied. We also know that

1τ(1+ημ)=1τSNτ=1(N/S)ημ\displaystyle\frac{1}{\tau(1+\eta\mu)}=\frac{1-\frac{\tau S}{N}}{\tau}=\frac{1}{(N/S)\eta\mu} (218)

So we can write

1(N/S)ημ𝔼𝒮r,𝒮r[Br+1]+12η(N/S)𝔼𝒮r[Pr+1]\displaystyle\frac{1}{(N/S)\eta\mu}\mathbb{E}_{\mathcal{S}_{r},\mathcal{S}_{r}^{\prime}}[B_{r+1}]+\frac{1}{2\eta(N/S)}\mathbb{E}_{\mathcal{S}_{r}}[P_{r+1}] (219)
(1+ημ)1(1(N/S)ημBr+12η(N/S)Pr)+(1τ)νσ22βτKN\displaystyle\leq(1+\eta\mu)^{-1}(\frac{1}{(N/S)\eta\mu}B_{r}+\frac{1}{2\eta(N/S)}P_{r})+\frac{(1-\tau)\nu\sigma^{2}}{2\beta\tau KN} (220)

Telescoping the contraction and taking expectation with respect to all randomness, we have

1(N/S)ημ𝔼[BR]+12η(N/S)𝔼[PR]\displaystyle\frac{1}{(N/S)\eta\mu}\mathbb{E}[B_{R}]+\frac{1}{2\eta(N/S)}\mathbb{E}[P_{R}] (221)
(1+ημ)R(1(N/S)ημ𝔼B0+12η(N/S)𝔼P0)+(1τ)νσ22βτKN(1+ημημ)\displaystyle\leq(1+\eta\mu)^{-R}(\frac{1}{(N/S)\eta\mu}\mathbb{E}B_{0}+\frac{1}{2\eta(N/S)}\mathbb{E}P_{0})+\frac{(1-\tau)\nu\sigma^{2}}{2\beta\tau KN}(\frac{1+\eta\mu}{\eta\mu}) (222)

B0=F(x(0))F(x)B_{0}=F(x^{(0)})-F(x^{*}) and 𝔼BR0\mathbb{E}B_{R}\geq 0 based on convexity. Next, we calculate the coefficient of the variance term.

1ττ1+ημημ\displaystyle\frac{1-\tau}{\tau}\frac{1+\eta\mu}{\eta\mu} =(1τ1)(1+1ημ)=(SN(1+1ημ)1)(1+1ημ)\displaystyle=(\frac{1}{\tau}-1)(1+\frac{1}{\eta\mu})=(\frac{S}{N}(1+\frac{1}{\eta\mu})-1)(1+\frac{1}{\eta\mu}) (223)
=𝒪((SN(1+κ(NS))1)(1+κ(NS)))\displaystyle=\mathcal{O}((\frac{S}{N}(1+\sqrt{\kappa}(\sqrt{\frac{N}{S}}))-1)(1+\sqrt{\kappa}(\sqrt{\frac{N}{S}}))) (224)
=𝒪((SN+κ(SN)1/21)(1+κ(NS)1/2))\displaystyle=\mathcal{O}((\frac{S}{N}+\sqrt{\kappa}(\frac{S}{N})^{1/2}-1)(1+\sqrt{\kappa}(\frac{N}{S})^{1/2})) (225)
=𝒪(κ)\displaystyle=\mathcal{O}(\kappa) (226)

Substituting the parameter choices and using strong convexity,

𝔼x(R)x2(1+13(N/S)κ)R(2μΔ+D2)+𝒪(σ2μ2KN)\displaystyle\mathbb{E}\|x^{(R)}-x^{*}\|^{2}\leq(1+\sqrt{\frac{1}{3(N/S)\kappa}})^{-R}(\frac{2}{\mu}\Delta+D^{2})+\mathcal{O}(\frac{\sigma^{2}}{\mu^{2}KN}) (227)

Case 2: On the other hand, we can have (N/S)κ>34\frac{(N/S)}{\kappa}>\frac{3}{4} and choose η=12μ(N/S)\eta=\frac{1}{2\mu(N/S)}, τ=(N/S)ημ1+ημ=(1/2)1+12(N/S)<12\tau=\frac{(N/S)\eta\mu}{1+\eta\mu}=\frac{(1/2)}{1+\frac{1}{2(N/S)}}<\frac{1}{2}. One can check that the constraint βτ1ηβτ1τ\beta\tau\leq\frac{1}{\eta}-\frac{\beta\tau}{1-\tau} is satisfied.

After telescoping and taking expectation,

2𝔼[BR]+12η(N/S)𝔼[PR]\displaystyle 2\mathbb{E}[B_{R}]+\frac{1}{2\eta(N/S)}\mathbb{E}[P_{R}] (228)
(1+ημ)R(2𝔼B0+12η(N/S)𝔼P0)+(1τ)νσ22βτKN(1+ημημ)\displaystyle\leq(1+\eta\mu)^{-R}(2\mathbb{E}B_{0}+\frac{1}{2\eta(N/S)}\mathbb{E}P_{0})+\frac{(1-\tau)\nu\sigma^{2}}{2\beta\tau KN}(\frac{1+\eta\mu}{\eta\mu}) (229)

Next, we calculate the coefficient of the variance term.

1ττ1+ημημ\displaystyle\frac{1-\tau}{\tau}\frac{1+\eta\mu}{\eta\mu} =(1τ1)(1+1ημ)=(SN(1+1ημ)1)(1+1ημ)\displaystyle=(\frac{1}{\tau}-1)(1+\frac{1}{\eta\mu})=(\frac{S}{N}(1+\frac{1}{\eta\mu})-1)(1+\frac{1}{\eta\mu}) (230)
=𝒪((SN(1+NS)1)(1+NS))\displaystyle=\mathcal{O}((\frac{S}{N}(1+\frac{N}{S})-1)(1+\frac{N}{S})) (231)
=𝒪(NS)\displaystyle=\mathcal{O}(\frac{N}{S}) (232)

which gives us

𝔼x(R)x2(1+12(N/S))R(2μΔ+D2)+𝒪(σ2βμKS)\displaystyle\mathbb{E}\|x^{(R)}-x^{*}\|^{2}\leq(1+\frac{1}{2(N/S)})^{-R}(\frac{2}{\mu}\Delta+D^{2})+\mathcal{O}(\frac{\sigma^{2}}{\beta\mu KS}) (233)

Altogether, supposing we choose our parameters as stated in the two cases, we have:

𝔼F(x(R))F(x)𝒪(κΔexp(min{SN,SNκ}R)+κσ2μKS)\displaystyle\mathbb{E}F(x^{(R)})-F(x^{*})\leq\mathcal{O}(\kappa\Delta\exp(-\min\{\frac{S}{N},\sqrt{\frac{S}{N\kappa}}\}R)+\frac{\kappa\sigma^{2}}{\mu KS}) (234)

Appendix E Proofs for local update methods

E.1 FedAvg

Theorem E.1.

Suppose that client objectives FiF_{i}’s and their gradient queries satisfy Assumption B.4 and Assumption B.6. Then running Algo. 4 gives the following for returned iterate x^\hat{x}:

  • Strongly convex: FiF_{i}’s satisfy Assumption B.1 for μ>0\mu>0. If we return the final iterate, set η=1β\eta=\frac{1}{\beta}, and sample SS clients per round arbitrarily,

    𝔼F(x(R))F(x)𝒪(Δexp(RKκ)+ζ2μ+σ2μK)\displaystyle\mathbb{E}F(x^{(R)})-F(x^{*})\leq\mathcal{O}(\Delta\exp(-\frac{R\sqrt{K}}{\kappa})+\frac{\zeta^{2}}{\mu}+\frac{\sigma^{2}}{\mu\sqrt{K}})

    Further, if there is no gradient variance and only one client ii is sampled the entire algorithm,

    x(r)x2𝒪(x(0)x2+x(0)xi2) for any 0rR\displaystyle\|x^{(r)}-x^{*}\|^{2}\leq\mathcal{O}(\|x^{(0)}-x^{*}\|^{2}+\|x^{(0)}-x^{*}_{i}\|^{2})\text{ for any }0\leq r\leq R
  • General convex: FiF_{i}’s satisfy Assumption B.2. If we return the last iterate after running the algorithm on Fμ(x)=F(x)+μ2x(0)x2F_{\mu}(x)=F(x)+\frac{\mu}{2}\|x^{(0)}-x\|^{2} with μ=Θ(max{βR2log2(e2+R2),ζD,σDK1/4})\mu=\Theta(\max\{\frac{\beta}{R^{2}}\log^{2}(e^{2}+R^{2}),\frac{\zeta}{D},\frac{\sigma}{DK^{1/4}}\}), and η=1β+μ\eta=\frac{1}{\beta+\mu},

    𝔼F(x(R))F(x)𝒪~(βD2KR+σDK1/4+ζD)\displaystyle\mathbb{E}F(x^{(R)})-F(x^{*})\leq\tilde{\mathcal{O}}(\frac{\beta D^{2}}{\sqrt{K}R}+\frac{\sigma D}{K^{1/4}}+\zeta D)

    and for any 0rR0\leq r\leq R,

    𝔼x(r)x2𝒪~(D2)\displaystyle\mathbb{E}\|x^{(r)}-x^{*}\|^{2}\leq\tilde{\mathcal{O}}(D^{2})
  • PL condition: FiF_{i}’s satisfy Assumption B.3 for μ>0\mu>0. If we return the final iterate, set η=1β\eta=\frac{1}{\beta}, and sample one client per round,

    𝔼F(x(R))F(x)𝒪(Δexp(RKκ)+ζ22μ+σ22μK)\displaystyle\mathbb{E}F(x^{(R)})-F(x^{*})\leq\mathcal{O}(\Delta\exp(-\frac{R\sqrt{K}}{\kappa})+\frac{\zeta^{2}}{2\mu}+\frac{\sigma^{2}}{2\mu\sqrt{K}})

E.1.1 Convergence of FedAvg for Strongly Convex functions

Proof.

By smoothness of F (Assumption B.4), we have for k{0,,K1}k\in\{0,\dots,\sqrt{K}-1\}

F(xi,k+1(r))F(xi,k(r))ηF(xi,k(r)),gi,k(r)+βη22gi,k(r)2\displaystyle F(x^{(r)}_{i,k+1})-F(x^{(r)}_{i,k})\leq-\eta\langle\nabla F(x^{(r)}_{i,k}),g_{i,k}^{(r)}\rangle+\frac{\beta\eta^{2}}{2}\|g_{i,k}^{(r)}\|^{2} (235)

Using the fact that for any a,ba,b we have 2ab=(ab)2a2b2-2ab=(a-b)^{2}-a^{2}-b^{2},

F(xi,k+1(r))F(xi,k(r))η2F(xi,k(r))2+βη2η2gi,k(r)2+η2gi,k(r)F(xi,k(r))2\displaystyle F(x^{(r)}_{i,k+1})-F(x^{(r)}_{i,k})\leq-\frac{\eta}{2}\|\nabla F(x^{(r)}_{i,k})\|^{2}+\frac{\beta\eta^{2}-\eta}{2}\|g_{i,k}^{(r)}\|^{2}+\frac{\eta}{2}\|g_{i,k}^{(r)}-\nabla F(x^{(r)}_{i,k})\|^{2} (236)

Letting η1β\eta\leq\frac{1}{\beta},

F(xk+1(r))F(xi,k(r))η2F(xi,k(r))2+η2gi,k(r)F(xi,k(r))2\displaystyle F(x^{(r)}_{k+1})-F(x^{(r)}_{i,k})\leq-\frac{\eta}{2}\|\nabla F(x^{(r)}_{i,k})\|^{2}+\frac{\eta}{2}\|g_{i,k}^{(r)}-\nabla F(x^{(r)}_{i,k})\|^{2} (237)

Conditioning on everything up to the kk-th step of the rr-th round,

𝔼r,kF(xi,k+1(r))F(xi,k(r))\displaystyle\mathbb{E}_{r,k}F(x^{(r)}_{i,k+1})-F(x^{(r)}_{i,k}) η2F(xi,k(r))2+η2𝔼r,kgi,k(r)F(xi,k(r))2\displaystyle\leq-\frac{\eta}{2}\|\nabla F(x^{(r)}_{i,k})\|^{2}+\frac{\eta}{2}\mathbb{E}_{r,k}\|g_{i,k}^{(r)}-\nabla F(x^{(r)}_{i,k})\|^{2} (238)
η2F(xi,k(r))2+ηζ22+ησ22K\displaystyle\leq-\frac{\eta}{2}\|\nabla F(x^{(r)}_{i,k})\|^{2}+\frac{\eta\zeta^{2}}{2}+\frac{\eta\sigma^{2}}{2\sqrt{K}} (239)

Where the last step used the fact that 𝔼[X2]=Var(X)+𝔼[X]\mathbb{E}[X^{2}]=\text{Var}(X)+\mathbb{E}[X], the assumption on heterogeneity (Assumption B.5), and the assumption on gradient variance (Assumption B.6) along with the fact that gi,k(r)g_{i,k}^{(r)} is an average over K\sqrt{K} client gradient queries. Next, using μ\mu-strong convexity of FF (Assumption B.1),

𝔼r,kF(xi,k+1(r))F(xi,k(r))ημ(F(xi,k(r))F(x))+ηζ22+ησ22K\displaystyle\mathbb{E}_{r,k}F(x^{(r)}_{i,k+1})-F(x^{(r)}_{i,k})\leq-\eta\mu(F(x^{(r)}_{i,k})-F(x^{*}))+\frac{\eta\zeta^{2}}{2}+\frac{\eta\sigma^{2}}{2\sqrt{K}} (240)

which after taking full expectation gives

𝔼F(xi,k+1(r))F(x)(1ημ)(𝔼F(xi,k(r))F(x))+ηζ22+ησ22K\displaystyle\mathbb{E}F(x^{(r)}_{i,k+1})-F(x^{*})\leq(1-\eta\mu)(\mathbb{E}F(x^{(r)}_{i,k})-F(x^{*}))+\frac{\eta\zeta^{2}}{2}+\frac{\eta\sigma^{2}}{2\sqrt{K}} (241)

Unrolling the recursion over kk we get

𝔼F(xi,K(r))F(x)\displaystyle\mathbb{E}F(x^{(r)}_{i,K})-F(x^{*}) (242)
(1ημ)K(𝔼F(xi,0(r))F(x))+(ηζ22+ησ22K)k=0K1(1ημ)k\displaystyle\leq(1-\eta\mu)^{\sqrt{K}}(\mathbb{E}F(x^{(r)}_{i,0})-F(x^{*}))+(\frac{\eta\zeta^{2}}{2}+\frac{\eta\sigma^{2}}{2\sqrt{K}})\sum_{k=0}^{\sqrt{K}-1}(1-\eta\mu)^{k} (243)

Also note that x(r+1)=1Si𝒮rxi,K(r)x^{(r+1)}=\frac{1}{S}\sum_{i\in\mathcal{S}_{r}}x^{(r)}_{i,\sqrt{K}}. So by convexity of FF,

𝔼F(x(r+1))F(x)\displaystyle\mathbb{E}F(x^{(r+1)})-F(x^{*}) (244)
1Si𝒮r𝔼F(xi,K(r))F(x)\displaystyle\leq\frac{1}{S}\sum_{i\in\mathcal{S}_{r}}\mathbb{E}F(x^{(r)}_{i,\sqrt{K}})-F(x^{*}) (245)
(1ημ)K(𝔼F(x(r))F(x))+(ηζ22+ησ22K)k=0K1(1ημ)k\displaystyle\leq(1-\eta\mu)^{\sqrt{K}}(\mathbb{E}F(x^{(r)})-F(x^{*}))+(\frac{\eta\zeta^{2}}{2}+\frac{\eta\sigma^{2}}{2\sqrt{K}})\sum_{k=0}^{\sqrt{K}-1}(1-\eta\mu)^{k} (246)

Unrolling the recursion over RR, we get

𝔼F(x(r))F(x)\displaystyle\mathbb{E}F(x^{(r)})-F(x^{*}) (247)
(1ημ)rK(F(x(0))F(x))+(ηζ22+ησ22K)τ=0r1k=0K1(1ημ)τK+k\displaystyle\leq(1-\eta\mu)^{r\sqrt{K}}(F(x^{(0)})-F(x^{*}))+(\frac{\eta\zeta^{2}}{2}+\frac{\eta\sigma^{2}}{2\sqrt{K}})\sum_{\tau=0}^{r-1}\sum_{k=0}^{\sqrt{K}-1}(1-\eta\mu)^{\tau\sqrt{K}+k} (248)

Which can be upper bounded as

𝔼F(x(r))F(x)(1ημ)RK(F(x(0))F(x))+ζ22μ+σ22μK\displaystyle\mathbb{E}F(x^{(r)})-F(x^{*})\leq(1-\eta\mu)^{R\sqrt{K}}(F(x^{(0)})-F(x^{*}))+\frac{\zeta^{2}}{2\mu}+\frac{\sigma^{2}}{2\mu\sqrt{K}} (249)

The final statement comes from the fact that xηFi(x)xixxi\|x-\eta\nabla F_{i}(x)-x^{*}_{i}\|\leq\|x-x^{*}_{i}\| because of η1β\eta\leq\frac{1}{\beta} and applying triangle inequality. ∎

E.1.2 Convergence of FedAvg for General Convex functions

For the general convex case, we use Nesterov smoothing. Concretely, we will run Algo. 3 assuming strong convexity by optimizing instead a modified objective

Fμ(x)=F(x)+μ2x(0)x2\displaystyle F_{\mu}(x)=F(x)+\frac{\mu}{2}\|x^{(0)}-x\|^{2} (250)

Define xμ=argminxFμ(x)x^{*}_{\mu}=\operatorname*{arg\,min}_{x}F_{\mu}(x) and Δμ=𝔼Fμ(x(0))F(xμ)\Delta_{\mu}=\mathbb{E}F_{\mu}(x^{(0)})-F(x^{*}_{\mu}). We will choose μ\mu carefully to balance the error introduced by the regularization term and the better convergence properties of having larger μ\mu.

Proof.

We know that running Algo. 4 with η=1β+μ\eta=\frac{1}{\beta+\mu} gives (from the previous proof)

𝔼Fμ(x(R))Fμ(xμ)Δμexp(RKβ+μμ)+ζ22μ+σ22μK\displaystyle\mathbb{E}F_{\mu}(x^{(R)})-F_{\mu}(x^{*}_{\mu})\leq\Delta_{\mu}\exp(-\frac{R\sqrt{K}}{\frac{\beta+\mu}{\mu}})+\frac{\zeta^{2}}{2\mu}+\frac{\sigma^{2}}{2\mu\sqrt{K}} (251)

We have that by (Yuan & Ma, 2020, Proposition E.7)

𝔼F(x(R))F(x)𝔼Fμ(x(R))Fμ(xμ)+μ2D2\displaystyle\mathbb{E}F(x^{(R)})-F(x^{*})\leq\mathbb{E}F_{\mu}(x^{(R)})-F_{\mu}(x^{*}_{\mu})+\frac{\mu}{2}D^{2} (252)

So we have (because ΔμΔ\Delta_{\mu}\leq\Delta as shown in Thm. D.3),

𝔼F(x(R))F(x)Δexp(RKβ+μμ)+ζ22μ+σ22μK+μ2D2\displaystyle\mathbb{E}F(x^{(R)})-F(x^{*})\leq\Delta\exp(-\frac{R\sqrt{K}}{\frac{\beta+\mu}{\mu}})+\frac{\zeta^{2}}{2\mu}+\frac{\sigma^{2}}{2\mu\sqrt{K}}+\frac{\mu}{2}D^{2} (253)

Then if we choose μΘ(βKRlog2(e2+KR))\mu\geq\Theta(\frac{\beta}{\sqrt{K}R}\log^{2}(e^{2}+\sqrt{K}R)), μΘ(ζD)\mu\geq\Theta(\frac{\zeta}{D}), and μΘ(σDK1/4)\mu\geq\Theta(\frac{\sigma}{DK^{1/4}}),

𝔼F(x(R))F(x)𝒪~(βD2KR+σDK1/4+ζD)\displaystyle\mathbb{E}F(x^{(R)})-F(x^{*})\leq\tilde{\mathcal{O}}(\frac{\beta D^{2}}{\sqrt{K}R}+\frac{\sigma D}{K^{1/4}}+\zeta D) (254)

Now we show the distance bound. Recall that

𝔼Fμ(x(R))Fμ(xμ)Δμexp(RKβ+μμ)+ζ22μ+σ22μK\displaystyle\mathbb{E}F_{\mu}(x^{(R)})-F_{\mu}(x^{*}_{\mu})\leq\Delta_{\mu}\exp(-\frac{R\sqrt{K}}{\frac{\beta+\mu}{\mu}})+\frac{\zeta^{2}}{2\mu}+\frac{\sigma^{2}}{2\mu\sqrt{K}} (255)

By smoothness, strong convexity of FμF_{\mu}, and the choice of μ\mu (as we chose each term above divided by μ\mu to match D2D^{2} up to log factors), we have that

𝔼x(R)xμ2𝒪~(D2)\displaystyle\mathbb{E}\|x^{(R)}-x^{*}_{\mu}\|^{2}\leq\tilde{\mathcal{O}}(D^{2}) (256)

So,

𝔼x(R)x23𝔼x(R)xμ2+3𝔼x(0)xμ2+3𝔼xx(0)2𝒪~(D2)\displaystyle\mathbb{E}\|x^{(R)}-x^{*}\|^{2}\leq 3\mathbb{E}\|x^{(R)}-x^{*}_{\mu}\|^{2}+3\mathbb{E}\|x^{(0)}-x^{*}_{\mu}\|^{2}+3\mathbb{E}\|x^{*}-x^{(0)}\|^{2}\leq\tilde{\mathcal{O}}(D^{2}) (257)

Where the last inequality follows because

F(xμ)+μ2x(0)xμ2F(x)+μ2x(0)x2\displaystyle F(x^{*}_{\mu})+\frac{\mu}{2}\|x^{(0)}-x^{*}_{\mu}\|^{2}\leq F(x^{*})+\frac{\mu}{2}\|x^{(0)}-x^{*}\|^{2} (258)

E.1.3 Convergence of FedAvg under the PL condition

Proof.

The same proof follows as in the strongly convex case, except Eq. 244, where we avoid having to use convexity by only sampling one client at a time. This follows previous work such as Karimireddy et al. (2020a). Averaging when the functions are not convex can cause the error to blow up, though in practice this is not seen (as mentioned in Karimireddy et al. (2020a)). ∎

Appendix F Proofs for FedChain

F.1 FedAvg \to SGD

F.1.1 Convergence of FedAvg \to SGD on Strongly Convex Functions

Theorem F.1.

Suppose that client objectives FiF_{i}’s and their gradient queries satisfy Assumption B.4, Assumption B.6, Assumption B.7, Assumption B.5, Assumption B.8. Then running Algo. 1 where 𝒜local\mathcal{A}_{\text{local}} is Algo. 4 in the setting of Thm. E.1 and 𝒜global\mathcal{A}_{\text{global}} is Algo. 2 in the setting Thm. D.1:

  • Strongly convex: FiF_{i}’s satisfy Assumption B.1 for μ>0\mu>0. Then there exists a lower bound of KK such that we have the rate

    𝒪~(min{ζ2μ,Δ}exp(Rκ)+σ2μSKR+(1SN)ζ2μSR+1SNζFS)\displaystyle\tilde{\mathcal{O}}(\min\{\frac{\zeta^{2}}{\mu},\Delta\}\exp(-\frac{R}{\kappa})+\frac{\sigma^{2}}{\mu SKR}+(1-\frac{S}{N})\frac{\zeta^{2}}{\mu SR}+\sqrt{1-\frac{S}{N}}\frac{\zeta_{F}}{\sqrt{S}})
  • General convex: FiF_{i}’s satisfy Assumption B.2. Then there exists a lower bound of KK such that we have the rate

    𝒪~(\displaystyle\tilde{\mathcal{O}}( min{β1/2ζ1/2D3/2R1/2,βD2R}+β1/2σ1/2D3/2(SKR)1/4\displaystyle\min\{\frac{\beta^{1/2}\zeta^{1/2}D^{3/2}}{R^{1/2}},\frac{\beta D^{2}}{R}\}+\frac{\beta^{1/2}\sigma^{1/2}D^{3/2}}{(SKR)^{1/4}}
    +(1SN)1/4β1/2ζF1/2DS1/4R1/2+(1SN)1/4β1/2ζ1/2D3/2(SR)1/4)\displaystyle+(1-\frac{S}{N})^{1/4}\frac{\beta^{1/2}\zeta_{F}^{1/2}D}{S^{1/4}R^{1/2}}+(1-\frac{S}{N})^{1/4}\frac{\beta^{1/2}\zeta^{1/2}D^{3/2}}{(SR)^{1/4}})
  • PL condition: FiF_{i}’s satisfy Assumption B.3 for μ>0\mu>0. Then there exists a lower bound of KK such that we have the rate

    𝒪~(min{ζ2μ,Δ}exp(Rκ)+κσ2μNKR+(1SN)κζ2μSR+1SNζFS)\displaystyle\tilde{\mathcal{O}}(\min\{\frac{\zeta^{2}}{\mu},\Delta\}\exp(-\frac{R}{\kappa})+\frac{\kappa\sigma^{2}}{\mu NKR}+(1-\frac{S}{N})\frac{\kappa\zeta^{2}}{\mu SR}+\sqrt{1-\frac{S}{N}}\frac{\zeta_{F}}{\sqrt{S}})
Proof.

From Thm. E.1, we know that with K>max{σ4ζ4,κ2R2log2(Δμζ2)}K>\max\{\frac{\sigma^{4}}{\zeta^{4}},\frac{\kappa^{2}}{R^{2}}\log^{2}(\frac{\Delta\mu}{\zeta^{2}})\}

𝔼F(x^1/2)F(x)𝒪(ζ2μ)\displaystyle\mathbb{E}F(\hat{x}_{1/2})-F(x^{*})\leq\mathcal{O}(\frac{\zeta^{2}}{\mu}) (259)

From Lemma H.2, if K>σF2Smin{Δ,ζ2μ}K>\frac{\sigma^{2}_{F}}{S\min\{\Delta,\frac{\zeta^{2}}{\mu}\}}

𝔼F(x^1)F(x)𝒪(min{ζ2μ,Δ}+1S1N1ζFS)\displaystyle\mathbb{E}F(\hat{x}_{1})-F(x^{*})\leq\mathcal{O}(\min\{\frac{\zeta^{2}}{\mu},\Delta\}+\sqrt{1-\frac{S-1}{N-1}}\frac{\zeta_{F}}{\sqrt{S}}) (260)

And so from Thm. D.1, we know that

𝔼F(x^2)F(x)\displaystyle\mathbb{E}F(\hat{x}_{2})-F(x^{*}) (261)
𝒪~(min{ζ2μ,Δ}exp(Rκ)+σ2μNKR+(1SN)ζ2μSR+1SNζFS))\displaystyle\leq\tilde{\mathcal{O}}(\min\{\frac{\zeta^{2}}{\mu},\Delta\}\exp(-\frac{R}{\kappa})+\frac{\sigma^{2}}{\mu NKR}+(1-\frac{S}{N})\frac{\zeta^{2}}{\mu SR}+\sqrt{1-\frac{S}{N}}\frac{\zeta_{F}}{\sqrt{S}})) (262)

F.1.2 Convergence of FedAvg \to SGD on General Convex Functions

Proof.

From Thm. E.1, we know that with K>max{σ4ζ4,β2D2ζR}K>\max\{\frac{\sigma^{4}}{\zeta^{4}},\frac{\beta^{2}D^{2}}{\zeta R}\}

𝔼F(x(R))F(x)𝒪~(ζD)\displaystyle\mathbb{E}F(x^{(R)})-F(x^{*})\leq\tilde{\mathcal{O}}(\zeta D)

From Lemma H.2, if K>σF2Smin{Δ,ζ2μ}K>\frac{\sigma^{2}_{F}}{S\min\{\Delta,\frac{\zeta^{2}}{\mu}\}}

𝔼F(x^1)F(x)𝒪~(min{ζD,Δ}+1S1N1ζFS)\displaystyle\mathbb{E}F(\hat{x}_{1})-F(x^{*})\leq\tilde{\mathcal{O}}(\min\{\zeta D,\Delta\}+\sqrt{1-\frac{S-1}{N-1}}\frac{\zeta_{F}}{\sqrt{S}}) (263)

And so from Thm. D.1, we know that

𝔼F(x^2)2𝒪~(βmin{ζD,Δ}R+1SNβζFSR+βσDSKR+1SNβζDSR)\displaystyle\mathbb{E}\|\nabla F(\hat{x}_{2})\|^{2}\leq\tilde{\mathcal{O}}(\frac{\beta\min\{\zeta D,\Delta\}}{R}+\sqrt{1-\frac{S}{N}}\frac{\beta\zeta_{F}}{\sqrt{S}R}+\frac{\beta\sigma D}{\sqrt{SKR}}+\sqrt{1-\frac{S}{N}}\frac{\beta\zeta D}{\sqrt{SR}}) (264)

Next, using that 𝔼x^2x𝒪~(D2)\mathbb{E}\|\hat{x}_{2}-x^{*}\|\leq\tilde{\mathcal{O}}(D^{2}) from Thm. E.1 and Thm. D.1 as well as convexity,

𝔼F(x^2)F(x)\displaystyle\mathbb{E}F(\hat{x}_{2})-F(x^{*}) (265)
𝔼F(x^2)2𝔼x^2x2\displaystyle\leq\sqrt{\mathbb{E}\|\nabla F(\hat{x}_{2})\|^{2}}\sqrt{\mathbb{E}\|\hat{x}_{2}-x^{*}\|^{2}} (266)
𝒪~(β1/2min{ζ1/2D3/2,Δ1/2D}R1/2\displaystyle\leq\tilde{\mathcal{O}}(\frac{\beta^{1/2}\min\{\zeta^{1/2}D^{3/2},\Delta^{1/2}D\}}{R^{1/2}} (267)
+β1/2σ1/2D3/2(SKR)1/4+(1SN)1/4β1/2ζF1/2DS1/4R1/2+(1SN)1/4β1/2ζ1/2D3/2(SR)1/4)\displaystyle\ \ \ +\frac{\beta^{1/2}\sigma^{1/2}D^{3/2}}{(SKR)^{1/4}}+(1-\frac{S}{N})^{1/4}\frac{\beta^{1/2}\zeta_{F}^{1/2}D}{S^{1/4}R^{1/2}}+(1-\frac{S}{N})^{1/4}\frac{\beta^{1/2}\zeta^{1/2}D^{3/2}}{(SR)^{1/4}}) (268)

One can pre-run SGD before all of this for a constant fraction of rounds so that (Woodworth et al., 2020a, Section 7):

Δ𝒪~(βD2R+σDSKR+1SNζDSR)\displaystyle\Delta\leq\tilde{\mathcal{O}}(\frac{\beta D^{2}}{R}+\frac{\sigma D}{\sqrt{SKR}}+\sqrt{1-\frac{S}{N}}\frac{\zeta D}{\sqrt{SR}}) (269)

This likely not practically necessary § 6. Altogether,

𝔼F(x^2)F(x)\displaystyle\mathbb{E}F(\hat{x}_{2})-F(x^{*}) (270)
𝔼F(x^2)2𝔼x^2x2\displaystyle\leq\sqrt{\mathbb{E}\|\nabla F(\hat{x}_{2})\|^{2}}\sqrt{\mathbb{E}\|\hat{x}_{2}-x^{*}\|^{2}} (271)
𝒪~(min{β1/2ζ1/2D3/2R1/2,βD2R}\displaystyle\leq\tilde{\mathcal{O}}(\min\{\frac{\beta^{1/2}\zeta^{1/2}D^{3/2}}{R^{1/2}},\frac{\beta D^{2}}{R}\} (272)
+β1/2σ1/2D3/2(SKR)1/4+(1SN)1/4β1/2ζF1/2DS1/4R1/2+(1SN)1/4β1/2ζ1/2D3/2(SR)1/4)\displaystyle\ \ \ +\frac{\beta^{1/2}\sigma^{1/2}D^{3/2}}{(SKR)^{1/4}}+(1-\frac{S}{N})^{1/4}\frac{\beta^{1/2}\zeta_{F}^{1/2}D}{S^{1/4}R^{1/2}}+(1-\frac{S}{N})^{1/4}\frac{\beta^{1/2}\zeta^{1/2}D^{3/2}}{(SR)^{1/4}}) (273)

F.1.3 Convergence of FedAvg \to SGD under the PL-condition

Proof.

The proof is the same as in the strongly convex case. ∎

F.2 FedAvg \to ASG

Theorem F.2.

Suppose that client objectives FiF_{i}’s and their gradient queries satisfy Assumption B.4, Assumption B.6, Assumption B.7, Assumption B.5, Assumption B.8. Then running Algo. 1 where 𝒜local\mathcal{A}_{\text{local}} is Algo. 4 in the setting of Thm. E.1 and 𝒜global\mathcal{A}_{\text{global}} is Algo. 3 in the setting Thm. D.1:

  • Strongly convex: FiF_{i}’s satisfy Assumption B.1 for μ>0\mu>0. Then there exists a lower bound of KK (same as in FedAvg \to SGD) such that we have the rate

    𝒪~(min{ζ2μ,Δ}exp(Rκ)+σ2μSKR+(1SN)ζ2μSR+1SNζFS)\displaystyle\tilde{\mathcal{O}}(\min\{\frac{\zeta^{2}}{\mu},\Delta\}\exp(-\frac{R}{\sqrt{\kappa}})+\frac{\sigma^{2}}{\mu SKR}+(1-\frac{S}{N})\frac{\zeta^{2}}{\mu SR}+\sqrt{1-\frac{S}{N}}\frac{\zeta_{F}}{\sqrt{S}})
  • General convex: FiF_{i}’s satisfy Assumption B.2. Then there exists a lower bound of KK (same as in FedAvg \to SGD) such that we have the rate

    𝒪~(\displaystyle\tilde{\mathcal{O}}( min{β1/2ζ1/2D3/2R,βD2R2}+β1/2σ1/2D3/2(SKR)1/4+σD(SKR)1/2+(1SN)1/2ζD(SR)1/2\displaystyle\min\{\frac{\beta^{1/2}\zeta^{1/2}D^{3/2}}{R},\frac{\beta D^{2}}{R^{2}}\}+\frac{\beta^{1/2}\sigma^{1/2}D^{3/2}}{(SKR)^{1/4}}+\frac{\sigma D}{(SKR)^{1/2}}+(1-\frac{S}{N})^{1/2}\frac{\zeta D}{(SR)^{1/2}}
    +(1SN)1/4β1/2ζF1/2DS1/4R1/2+(1SN)1/4β1/2ζ1/2D3/2(SR)1/4)\displaystyle+(1-\frac{S}{N})^{1/4}\frac{\beta^{1/2}\zeta_{F}^{1/2}D}{S^{1/4}R^{1/2}}+(1-\frac{S}{N})^{1/4}\frac{\beta^{1/2}\zeta^{1/2}D^{3/2}}{(SR)^{1/4}})

Proof follows those of FedAvg \to SGD (Thm. F.1).

F.3 FedAvg \to SAGA

Theorem F.3.

Suppose that client objectives FiF_{i}’s and their gradient queries satisfy Assumption B.4, Assumption B.6, Assumption B.7, Assumption B.5, Assumption B.8. Then running Algo. 1 where 𝒜local\mathcal{A}_{\text{local}} is Algo. 4 in the setting of Thm. E.1 and 𝒜global\mathcal{A}_{\text{global}} is Algo. 5 in the setting Thm. D.4:

  • Strongly convex: FiF_{i}’s satisfy Assumption B.1 for μ>0\mu>0. Then there exists a lower bound of KK (same as in FedAvg \to SGD) such that we have the rate

    𝒪~(min{ζ2μ,Δ}exp(max{NS,κ}1R)+σ2μSKR)\displaystyle\tilde{\mathcal{O}}(\min\{\frac{\zeta^{2}}{\mu},\Delta\}\exp(-\max\{\frac{N}{S},\kappa\}^{-1}R)+\frac{\sigma^{2}}{\mu SKR})
  • PL condition: FiF_{i}’s satisfy Assumption B.3 for μ>0\mu>0. Then there exists a lower bound of KK such that we have the rate

    𝒪~(min{ζ2μ,Δ}exp((κ(NS)2/3)1R)+σ2μSK)\displaystyle\tilde{\mathcal{O}}(\min\{\frac{\zeta^{2}}{\mu},\Delta\}\exp(-(\kappa(\frac{N}{S})^{2/3})^{-1}R)+\frac{\sigma^{2}}{\mu SK})

Proof follows those of FedAvg \to SGD (Thm. F.1) and using Lemma H.2 with S=NS=N as the algorithm already requires R>NSR>\frac{N}{S}.

F.4 FedAvg \to SSNM

Theorem F.4.

Suppose that client objectives FiF_{i}’s and their gradient queries satisfy Assumption B.4, Assumption B.6, Assumption B.5, Assumption B.8. Then running Algo. 1 where 𝒜local\mathcal{A}_{\text{local}} is Algo. 4 in the setting of Thm. E.1 and 𝒜global\mathcal{A}_{\text{global}} is Algo. 6 in the setting Thm. D.5:

  • Strongly convex: FiF_{i}’s satisfy Assumption B.1 for μ>0\mu>0. Then there exists a lower bound of KK (same as in FedAvg \to SGD) such that we have the rate

    𝒪~(min{ζ2μ,Δ}exp(max{NS,κ(NS)}1R)+κσ2μKS)\displaystyle\tilde{\mathcal{O}}(\min\{\frac{\zeta^{2}}{\mu},\Delta\}\exp(-\max\{\frac{N}{S},\sqrt{\kappa(\frac{N}{S})}\}^{-1}R)+\frac{\kappa\sigma^{2}}{\mu KS})

Proof follows those of FedAvg \to SGD (Thm. F.1) and using Lemma H.2 with S=NS=N as the algorithm already requires R>NSR>\frac{N}{S}.

Appendix G Lower Bound

In this section we prove the lower bound. All gradients will be noiseless, and we will allow full communication to all clients every round. There will be only two functions in this lower bound: F1F_{1} and F2F_{2}. If N>2N>2, then F1F_{1} is assigned to the first N/2\lfloor N/2\rfloor clients and F2F_{2} to the next N/2\lfloor N/2\rfloor clients. If there is an odd number of machines we let the last machine be F3(x)=μ2x2F_{3}(x)=\frac{\mu}{2}\|x\|^{2}. This only reduces the lower bound by a factor of at most N1N\frac{N-1}{N}. So we look at the case N=2N=2.

Let 2,C,ζ^\ell_{2},C,\hat{\zeta} be values to be chosen later. Let dd be even. At a high level, 2\ell_{2} essentially controls the smoothness of F1F_{1} and F2F_{2}, CC is basically a constant, and ζ^\hat{\zeta} is a quantity that affects the heterogeneity, initial suboptimality gap, and initial distance. We give the instance:

F1(x)=2ζ^x1+C22xd2+22i=1d21(x2i+1x2i)2+μ2x2\displaystyle F_{1}(x)=-\ell_{2}\hat{\zeta}x_{1}+\frac{C\ell_{2}}{2}x_{d}^{2}+\frac{\ell_{2}}{2}\sum_{i=1}^{\frac{d}{2}-1}(x_{2i+1}-x_{2i})^{2}+\frac{\mu}{2}\|x\|^{2} (274)
F2(x)=22i=1d/2(x2ix2i1)2+μ2x2\displaystyle F_{2}(x)=\frac{\ell_{2}}{2}\sum_{i=1}^{d/2}(x_{2i}-x_{2i-1})^{2}+\frac{\mu}{2}\|x\|^{2} (275)

Where F=F1+F22F=\frac{F_{1}+F_{2}}{2}.

These functions are the same as those used in (Woodworth et al., 2020a; Woodworth, 2021) for lower bounds in distributed optimization. They are also similar to the instances used to prove convex optimization lower bounds (Nesterov, 2003) and distributed optimization lower bounds (Arjevani & Shamir, 2015).

These two functions have the following property

xevenspan{e1,e2,,e2i}{F1(xeven)span{e1,e2,,e2i+1}F2(xeven)span{e1,e2,,e2i}\displaystyle x_{\text{even}}\in\text{span}\{e_{1},e_{2},\dots,e_{2i}\}\implies\begin{cases}\nabla F_{1}(x_{\text{even}})\in\text{span}\{e_{1},e_{2},\dots,e_{2i+1}\}\\ \nabla F_{2}(x_{\text{even}})\in\text{span}\{e_{1},e_{2},\dots,e_{2i}\}\\ \end{cases} (276)
xoddspan{e1,e2,,e2i1}{F1(xodd)span{e1,e2,,e2i1}F2(xodd)span{e1,e2,,e2i}\displaystyle x_{\text{odd}}\in\text{span}\{e_{1},e_{2},\dots,e_{2i-1}\}\implies\begin{cases}\nabla F_{1}(x_{\text{odd}})\in\text{span}\{e_{1},e_{2},\dots,e_{2i-1}\}\\ \nabla F_{2}(x_{\text{odd}})\in\text{span}\{e_{1},e_{2},\dots,e_{2i}\}\\ \end{cases} (277)

What the property roughly says is the following. Suppose we so far have managed to turn an even number of coordinates nonzero. Then we can only use gradient queries of F1F_{1} to access the next coordinate. After client 1 queries the gradient of F1F_{1}, it now has an odd number of unlocked coordinates, and cannot unlock any more coordinates until a communication occurs. Similar is true if the number of unlocked coordinates is odd. And so each round of communication can only unlock a single new coordinate.

We start by writing out the gradients of F1F_{1} and F2F_{2}. Assume that dd is even. Then:

[F1(x)]1=2ζ^+μx1\displaystyle[\nabla F_{1}(x)]_{1}=-\ell_{2}\hat{\zeta}+\mu x_{1} (278)
[F1(x)]d=C2xd+μxd\displaystyle[\nabla F_{1}(x)]_{d}=C\ell_{2}x_{d}+\mu x_{d} (279)
{[F1(x)]i=2(xixi1)+μxii odd, 2id1[F1(x)]i=2(xi+1xi)+μxii even, 2id1\displaystyle\begin{cases}[\nabla F_{1}(x)]_{i}=\ell_{2}(x_{i}-x_{i-1})+\mu x_{i}\ \ i\text{ odd, }2\leq i\leq d-1\\ [\nabla F_{1}(x)]_{i}=-\ell_{2}(x_{i+1}-x_{i})+\mu x_{i}\ \ i\text{ even, }2\leq i\leq d-1\end{cases} (280)

and

[F2(x)]1=2(x2x1)+μx1\displaystyle[\nabla F_{2}(x)]_{1}=-\ell_{2}(x_{2}-x_{1})+\mu x_{1} (281)
[F2(x)]d=2(xdxd1)+μxd\displaystyle[\nabla F_{2}(x)]_{d}=\ell_{2}(x_{d}-x_{d-1})+\mu x_{d} (282)
{[F2(x)]i=2(xi+1xi)+μxii odd, 2id1[F2(x)]i=2(xixi1)+μxii even, 2id1\displaystyle\begin{cases}[\nabla F_{2}(x)]_{i}=-\ell_{2}(x_{i+1}-x_{i})+\mu x_{i}\ \ i\text{ odd, }2\leq i\leq d-1\\ [\nabla F_{2}(x)]_{i}=\ell_{2}(x_{i}-x_{i-1})+\mu x_{i}\ \ i\text{ even, }2\leq i\leq d-1\end{cases} (283)

We define the class of functions that our lower bound will apply to. This definition follows (Woodworth et al., 2020a; Woodworth, 2021; Carmon et al., 2020):

Definition G.1 (Distributed zero-respecting algorithm).

For a vector vv, let supp(v)={i{1,,d}:vi0}\text{supp}(v)=\{i\in\{1,\dots,d\}:v_{i}\neq 0\}. We say that an optimization algorithm is distributed zero-respecting if for any i,k,ri,k,r, the kk-th iterate on the ii-th client in the rr-th round xi,k(r)x^{(r)}_{i,k} satisfy

supp(xi,k(r))0k<ksupp(Fi(xi,k(r)))i[N],0kK1,0r<rsupp(Fi(xi,k(r)))\displaystyle\text{supp}(x^{(r)}_{i,k})\subseteq\bigcup_{0\leq k^{\prime}<k}\text{supp}(\nabla F_{i}(x_{i,k^{\prime}}^{(r)}))\bigcup_{\begin{subarray}{c}i^{\prime}\in[N],0\leq k^{\prime}\leq K-1,0\leq r^{\prime}<r\end{subarray}}\text{supp}(\nabla F_{i^{\prime}}(x_{i^{\prime},k^{\prime}}^{(r^{\prime})})) (284)

Broadly speaking, distributed zero-respecting algorithms are those whose iterates have components in only dimensions that they can possibly have information on. As discussed in (Woodworth et al., 2020a), this means that algorithms which are not distributed zero-respecting are just ”wild guessing”. Algorithms that are distributed zero-respecting include SGD, ASG, FedAvg, SCAFFOLD, SAGA, and SSNM.

Next, we define the next condition we require on algorithms for our lower bound.

Definition G.2.

We say that an algorithm is distributed distance-conserving if for any i,k,ri,k,r, we have for the kk-th iterate on the ii-th client in the rr-th round xi,k(r)x^{(r)}_{i,k} satisfies xi,k(r)x2(c/2)[xinitx2+i=1Nxinitxi2]\|x^{(r)}_{i,k}-x^{*}\|^{2}\leq(c/2)[\|x_{\text{init}}-x^{*}\|^{2}+\sum_{i=1}^{N}\|x_{\text{init}}-x_{i}^{*}\|^{2}], where xj:=argminxFj(x)x^{*}_{j}:=\operatorname*{arg\,min}_{x}F_{j}(x) and x:=argminxF(x)x^{*}:=\operatorname*{arg\,min}_{x}F(x) and xinitx_{\text{init}} is the initial iterate, for some scalar parameter cc.

Algorithms which do not satisfy Definition 5.2 for polylogarithmic cc in problem parameters (see § 2) are those that move substantially far away from xx^{*}, even farther than the xix_{i}^{*}’s are from xx^{*}. With this definition in mind, we slightly overload the usual definition of heterogeneity for the lower bound:

Definition G.3.

A distributed optimization problem is (ζ,c)(\zeta,c)-heterogeneous if maxi[N]supxAFi(x)F(x)2ζ2\max_{i\in[N]}\sup_{x\in A}\|\nabla F_{i}(x)-\nabla F(x)\|^{2}\leq\zeta^{2}, where we define A:={x:xx2(c/2)(xinitx2+i=1Nxinitx2)}A:=\{x:\|x-x^{*}\|^{2}\leq({c}/{2})(\|x_{\text{init}}-x^{*}\|^{2}+\sum_{i=1}^{N}\|x_{\text{init}}-x^{*}\|^{2})\} for some scalar parameter cc.

While convergence rates in FL are usually proven under Assumption B.5, the proofs can be converted to work under Definition G.3 as well, so long as one can prove all iterates stay within AA as defined in Definition G.3. We show that our algorithms satisfy Definition G.3 as well as a result (Thm. E.1, Thm. D.3, Thm. D.1)444We do not formally show it for SAGA and SSNM, as the algorithms are functionally the same as SGD and ASG under full participation.. Other proofs of FL algorithms also satisfy this requirement, most notably the proof of the convergence of FedAvg in Woodworth et al. (2020a).

Following (Woodworth et al., 2020a), we start by making the argument that, given the algorithm is distributed zero-respecting, we can only unlock one coordinate at a time. Let Ei=span{e1,,ei}E_{i}=\text{span}\{e_{1},\dots,e_{i}\}, and E0E_{0} be the null span. Then

Lemma G.4.

Let x^\hat{x} be the output of a distributed zero-respecting algorithm optimizing F=12(F1+F2)F=\frac{1}{2}(F_{1}+F_{2}) after RR rounds of communication. Then we have

supp(x^)ER\displaystyle\textup{supp}(\hat{x})\in E_{R} (285)
Proof.

A proof is in Woodworth et al. (2020a, Lemma 9). ∎

We now compute various properties of this distributed optimization problem.

G.1 Strong convexity and smoothness of F,F1,F2F,F_{1},F_{2}

From Woodworth (2021, Lemma 25), as long as 2βμ4\ell_{2}\leq\frac{\beta-\mu}{4}, we have that F,F1,F2F,F_{1},F_{2} are β\beta-smooth and μ\mu-strongly convex.

G.2 The solutions of F,F1,F2F,F_{1},F_{2}

First, observe that from the gradient of F2F_{2} computed in Eq. 281, x2=argminxF2(x)=0x^{*}_{2}=\operatorname*{arg\,min}_{x}F_{2}(x)=\vec{0}. Next observe that from the gradient of F1F_{1} computed in Eq. 278, x1=argminxF1(x)=2ζ^μe1x^{*}_{1}=\operatorname*{arg\,min}_{x}F_{1}(x)=\frac{\ell_{2}\hat{\zeta}}{\mu}e_{1}. Thus x22=0\|x_{2}^{*}\|^{2}=0 and x12=22ζ^2μ2\|x_{1}^{*}\|^{2}=\frac{\ell_{2}^{2}\hat{\zeta}^{2}}{\mu^{2}}.

From Woodworth (2021, Lemma 25), if we let α=1+22μ\alpha=\sqrt{1+\frac{2\ell_{2}}{\mu}}, q=α1α+1q=\frac{\alpha-1}{\alpha+1}, C=1qC=1-q, then x2=ζ^2(1q)2i=1dq2i=ζ^2q2(1q2d)(1q)2(1q2)\|x^{*}\|^{2}=\frac{\hat{\zeta}^{2}}{(1-q)^{2}}\sum_{i=1}^{d}q^{2i}=\frac{\hat{\zeta}^{2}q^{2}(1-q^{2d})}{(1-q)^{2}(1-q^{2})}.

G.3 Initial suboptimality gap

From Woodworth (2021, Lemma 25), F(0)F(x)q2ζ^24(1q)F(0)-F(x^{*})\leq\frac{q\ell_{2}\hat{\zeta}^{2}}{4(1-q)}.

G.4 Suboptimality gap after RR rounds of communication

From (Woodworth, 2021, Lemma 25), F(x^)F(x)ζ^2μq216(1q)2(1q2)q2RF(\hat{x})-F(x^{*})\geq\frac{\hat{\zeta}^{2}\mu q^{2}}{16(1-q)^{2}(1-q^{2})}q^{2R}, as long as dR+log22log(1/q)d\geq R+\frac{\log 2}{2\log(1/q)}.

G.5 Computation of ζ2\zeta^{2}

We start by writing out the difference between the gradient of F1F_{1} and F2F_{2}, using Eq. 278 and Eq. 281.

F1(x)F2(x)2=22(ζ^+x2x1)2+22(Cxdxd+xd1)2+i=2d122(xi+1xi1)2\displaystyle\|\nabla F_{1}(x)-\nabla F_{2}(x)\|^{2}=\ell_{2}^{2}(-\hat{\zeta}+x_{2}-x_{1})^{2}+\ell_{2}^{2}(Cx_{d}-x_{d}+x_{d-1})^{2}+\sum_{i=2}^{d-1}\ell_{2}^{2}(x_{i+1}-x_{i-1})^{2} (286)

We upper bound each of the terms:

(xi+1xi1)2\displaystyle(x_{i+1}-x_{i-1})^{2} =xi+122xi+1xi1+xi122xi+12+2xi12\displaystyle=x_{i+1}^{2}-2x_{i+1}x_{i-1}+x_{i-1}^{2}\leq 2x_{i+1}^{2}+2x_{i-1}^{2} (287)
(ζ^+x2x1)2=ζ^2+x22+x122ζ^x2+2ζ^x12x2x1\displaystyle(-\hat{\zeta}+x_{2}-x_{1})^{2}=\hat{\zeta}^{2}+x_{2}^{2}+x_{1}^{2}-2\hat{\zeta}x_{2}+2\hat{\zeta}x_{1}-2x_{2}x_{1} (288)
3ζ^2+3x22+3x12\displaystyle\leq 3\hat{\zeta}^{2}+3x_{2}^{2}+3x_{1}^{2} (289)
(Cxdxd+xd1)2\displaystyle(Cx_{d}-x_{d}+x_{d-1})^{2} =C2xd2+xd2+xd122Cxd2+2Cxdxd12xdxd1\displaystyle=C^{2}x_{d}^{2}+x_{d}^{2}+x_{d-1}^{2}-2Cx_{d}^{2}+2Cx_{d}x_{d-1}-2x_{d}x_{d-1} (290)
C2xd2+2xd2+2xd12+2Cxd12\displaystyle\leq C^{2}x_{d}^{2}+2x_{d}^{2}+2x_{d-1}^{2}+2Cx_{d-1}^{2} (291)

So altogether, using the fact that C1C\leq 1,

122F1(x)F2(x)2\displaystyle\frac{1}{\ell_{2}^{2}}\|\nabla F_{1}(x)-\nabla F_{2}(x)\|^{2} (292)
3ζ^2+3x22+3x12+3xd2+3xd12+i=2d12xi+12+2xi12\displaystyle\leq 3\hat{\zeta}^{2}+3x_{2}^{2}+3x_{1}^{2}+3x_{d}^{2}+3x_{d-1}^{2}+\sum_{i=2}^{d-1}2x_{i+1}^{2}+2x_{i-1}^{2} (293)
3ζ^2+7x2\displaystyle\leq 3\hat{\zeta}^{2}+7\|x\|^{2} (294)
3ζ^2+14xx2+14x2\displaystyle\leq 3\hat{\zeta}^{2}+14\|x-x^{*}\|^{2}+14\|x^{*}\|^{2} (295)
5ζ^2+28cζ^2q2(1q)2(1q2)+28c22ζ^2μ2\displaystyle\leq 5\hat{\zeta}^{2}+28c\frac{\hat{\zeta}^{2}q^{2}}{(1-q)^{2}(1-q^{2})}+28c\frac{\ell_{2}^{2}\hat{\zeta}^{2}}{\mu^{2}} (296)

Next we use Definition G.2 which says that xxc2[xinitx2+i=1Nxinitxi2]\|x-x^{*}\|\leq\frac{c}{2}[\|x_{\text{init}}-x^{*}\|^{2}+\sum_{i=1}^{N}\|x_{\text{init}}-x_{i}^{*}\|^{2}]. For ease of calculation, we assume c1c\geq 1. Recall that we calculated x2ζ^2q2(1q)2(1q2)\|x^{*}\|^{2}\leq\frac{\hat{\zeta}^{2}q^{2}}{(1-q)^{2}(1-q^{2})} (because 0<q<10<q<1), x22=0\|x_{2}^{*}\|^{2}=0, x12=22ζ^2μ2\|x_{1}^{*}\|^{2}=\frac{\ell_{2}^{2}\hat{\zeta}^{2}}{\mu^{2}}.

122F1(x)F2(x)2\displaystyle\frac{1}{\ell_{2}^{2}}\|\nabla F_{1}(x)-\nabla F_{2}(x)\|^{2} 3ζ^2+24cx2+7cx12+7cx22\displaystyle\leq 3\hat{\zeta}^{2}+24c\|x^{*}\|^{2}+7c\|x_{1}^{*}\|^{2}+7c\|x_{2}^{*}\|^{2} (297)
3ζ^2+24cζ^2q2(1q)2(1q2)+7c22ζ^2μ2\displaystyle\leq 3\hat{\zeta}^{2}+\frac{24c\hat{\zeta}^{2}q^{2}}{(1-q)^{2}(1-q^{2})}+\frac{7c\ell_{2}^{2}\hat{\zeta}^{2}}{\mu^{2}} (298)

Altogether,

F1(x)F2(x)2322ζ^2+24c22ζ^2q2(1q)2(1q2)+7c24ζ^2μ2\displaystyle\|\nabla F_{1}(x)-\nabla F_{2}(x)\|^{2}\leq 3\ell_{2}^{2}\hat{\zeta}^{2}+\frac{24c\ell_{2}^{2}\hat{\zeta}^{2}q^{2}}{(1-q)^{2}(1-q^{2})}+\frac{7c\ell_{2}^{4}\hat{\zeta}^{2}}{\mu^{2}} (299)

G.6 Theorem

With all of the computations earlier, we are now prepared to prove the theorem.

Theorem G.5.

For any number of rounds RR, number of local steps per-round KK, and (ζ,c)(\zeta,c)-heterogeneity (Definition 5.3), there exists a global objective FF which is the average of two β\beta-smooth (Assumption B.4) and μ\mu(0\geq 0)-strongly convex (Assumption B.1) quadratic client objectives F1F_{1} and F2F_{2} with an initial sub-optimality gap of Δ\Delta, such that the output x^\hat{x} of any distributed zero-respecting (Definition 5.1) and distance-conserving algorithm (Definition 5.2) satisfies

  • Strongly convex: F(x^)F(x)Ω(min{Δ,1/(cκ3/2)(ζ2/β)}exp(R/κ).)F(\hat{x})-F(x^{*})\geq\Omega(\min\{\Delta,{1}/({c\kappa^{3/2})}({\zeta^{2}}/{\beta})\}\exp(-{R}/{\sqrt{\kappa}}).) when μ>0\mu>0, and

  • General Convex F(x^)F(x)Ω(min{βD2/R2,ζD/(c1/2R5)})F(\hat{x})-F(x^{*})\geq\Omega(\min\{{\beta D^{2}}/{R^{2}},{\zeta D}/({c^{1/2}\sqrt{R^{5}})}\}) when μ=0\mu=0.

Proof.

The convex case: By the previous computations, we know that after RR rounds,

F(x^)F(x)\displaystyle F(\hat{x})-F(x^{*}) μζ^2q216(1q)2(1q2)q2R\displaystyle\geq\frac{\mu\hat{\zeta}^{2}q^{2}}{16(1-q)^{2}(1-q^{2})}q^{2R} (300)
x2\displaystyle\|x^{*}\|^{2} ζ^2q2(1q)2(1q2)\displaystyle\leq\frac{\hat{\zeta}^{2}q^{2}}{(1-q)^{2}(1-q^{2})} (301)
F1(x)F2(x)2\displaystyle\|\nabla F_{1}(x)-\nabla F_{2}(x)\|^{2} 322ζ^2+24c22ζ^2q2(1q)2(1q2)+7c24ζ^2μ2\displaystyle\leq 3\ell_{2}^{2}\hat{\zeta}^{2}+\frac{24c\ell_{2}^{2}\hat{\zeta}^{2}q^{2}}{(1-q)^{2}(1-q^{2})}+\frac{7c\ell_{2}^{4}\hat{\zeta}^{2}}{\mu^{2}} (302)

To maintain the property that x(0)x=xD\|x^{(0)}-x^{*}\|=\|x^{*}\|\leq D and Definition G.3, Choose ζ^\hat{\zeta} s.t.

ζ^2=νmin{ζ222,(1q)2(1q2)ζ2c22q2,μ2ζ2c24,(1q)2(1q2)D2q2}\displaystyle\hat{\zeta}^{2}=\nu\min\{\frac{\zeta^{2}}{\ell_{2}^{2}},\frac{(1-q)^{2}(1-q^{2})\zeta^{2}}{c\ell_{2}^{2}q^{2}},\frac{\mu^{2}\zeta^{2}}{c\ell_{2}^{4}},\frac{(1-q)^{2}(1-q^{2})D^{2}}{q^{2}}\} (303)

For an absolute constant ν\nu.

This leads to (throughout we use μ2=(1q)22q\frac{\mu}{\ell_{2}}=\frac{(1-q)^{2}}{2q}, (Woodworth, 2021, Eq 769))

F(x^)F(x)\displaystyle F(\hat{x})-F(x^{*}) (304)
μζ^2q216(1q)2(1q2)q2R\displaystyle\geq\frac{\mu\hat{\zeta}^{2}q^{2}}{16(1-q)^{2}(1-q^{2})}q^{2R} (305)
νmin{μζ222,μ(1q)2(1q2)ζ2c22q2,μ3ζ2c24,μ(1q)2(1q2)D2q2}q216(1q)2(1q2)q2R\displaystyle\geq\nu\min\{\frac{\mu\zeta^{2}}{\ell_{2}^{2}},\frac{\mu(1-q)^{2}(1-q^{2})\zeta^{2}}{c\ell_{2}^{2}q^{2}},\frac{\mu^{3}\zeta^{2}}{c\ell_{2}^{4}},\frac{\mu(1-q)^{2}(1-q^{2})D^{2}}{q^{2}}\}\frac{q^{2}}{16(1-q)^{2}(1-q^{2})}q^{2R} (306)
νmin{(1q)2ζ222q,μ(1q)2(1q2)ζ2c22q2,\displaystyle\geq\nu\min\{\frac{(1-q)^{2}\zeta^{2}}{2\ell_{2}q},\frac{\mu(1-q)^{2}(1-q^{2})\zeta^{2}}{c\ell_{2}^{2}q^{2}}, (307)
μζ2(1q)44c22q2,μ(1q)2(1q2)D2q2}q216(1q)2(1q2)q2R\displaystyle\ \ \ \frac{\mu\zeta^{2}(1-q)^{4}}{4c\ell_{2}^{2}q^{2}},\frac{\mu(1-q)^{2}(1-q^{2})D^{2}}{q^{2}}\}\frac{q^{2}}{16(1-q)^{2}(1-q^{2})}q^{2R} (308)
νmin{ζ2q322(1q2),μζ216c22,μζ2(1q)64c22(1+q),μD216}q2R\displaystyle\geq\nu\min\{\frac{\zeta^{2}q}{32\ell_{2}(1-q^{2})},\frac{\mu\zeta^{2}}{16c\ell_{2}^{2}},\frac{\mu\zeta^{2}(1-q)}{64c\ell_{2}^{2}(1+q)},\frac{\mu D^{2}}{16}\}q^{2R} (309)

We choose μ=264R2\mu=\frac{\ell_{2}}{64R^{2}}, which ensures α2\alpha\geq 2. So noting that (1q2)=(1q)(1+q)(1+q)(1-q^{2})=(1-q)(1+q)\leq(1+q), 1+q=2αα+11+q=\frac{2\alpha}{\alpha+1}, 1q=2α+11-q=\frac{2}{\alpha+1}:

F(x^)F(x)\displaystyle F(\hat{x})-F(x^{*}) (311)
νmin{ζ2322(α1α+1α+12α),μζ216c22,μζ264c22(2α+1α+12α),μD216}q2R\displaystyle\geq\nu\min\{\frac{\zeta^{2}}{32\ell_{2}}(\frac{\alpha-1}{\alpha+1}\frac{\alpha+1}{2\alpha}),\frac{\mu\zeta^{2}}{16c\ell_{2}^{2}},\frac{\mu\zeta^{2}}{64c\ell_{2}^{2}}(\frac{2}{\alpha+1}\frac{\alpha+1}{2\alpha}),\frac{\mu D^{2}}{16}\}q^{2R} (312)
νmin{ζ2322(α1α+1α+12α),μζ216c22,μζ264c22(2α+1α+12α),μD216}exp(2Rlog(α+1α1))\displaystyle\geq\nu\min\{\frac{\zeta^{2}}{32\ell_{2}}(\frac{\alpha-1}{\alpha+1}\frac{\alpha+1}{2\alpha}),\frac{\mu\zeta^{2}}{16c\ell_{2}^{2}},\frac{\mu\zeta^{2}}{64c\ell_{2}^{2}}(\frac{2}{\alpha+1}\frac{\alpha+1}{2\alpha}),\frac{\mu D^{2}}{16}\}\exp(-2R\log(\frac{\alpha+1}{\alpha-1})) (313)
νmin{ζ2642,μζ216c22,μζ264c22(1α),μD216}exp(4Rα1)\displaystyle\geq\nu\min\{\frac{\zeta^{2}}{64\ell_{2}},\frac{\mu\zeta^{2}}{16c\ell_{2}^{2}},\frac{\mu\zeta^{2}}{64c\ell_{2}^{2}}(\frac{1}{\alpha}),\frac{\mu D^{2}}{16}\}\exp(-\frac{4R}{\alpha-1}) (314)

So,

F(x^)F(x)\displaystyle F(\hat{x})-F(x^{*}) (315)
νmin{ζ2642,μζ216c22,μζ264c22(μ32)1/2,μD216}exp(8Rμ2)\displaystyle\geq\nu\min\{\frac{\zeta^{2}}{64\ell_{2}},\frac{\mu\zeta^{2}}{16c\ell_{2}^{2}},\frac{\mu\zeta^{2}}{64c\ell_{2}^{2}}(\frac{\mu}{3\ell_{2}})^{1/2},\frac{\mu D^{2}}{16}\}\exp(-\frac{8R\sqrt{\mu}}{\sqrt{\ell_{2}}}) (316)
Ω(min{ζ22,ζ2c2R2,ζ2c2R3,2D2R2})\displaystyle\geq\Omega(\min\{\frac{\zeta^{2}}{\ell_{2}},\frac{\zeta^{2}}{c\ell_{2}R^{2}},\frac{\zeta^{2}}{c\ell_{2}R^{3}},\frac{\ell_{2}D^{2}}{R^{2}}\}) (317)
Ω(ζ2c2R3,2D2R2})\displaystyle\geq\Omega(\frac{\zeta^{2}}{c\ell_{2}R^{3}},\frac{\ell_{2}D^{2}}{R^{2}}\}) (318)

Where we used c>1c>1. Setting 2=Θ(min{β,ζc1/2DR1/2})\ell_{2}=\Theta(\min\{\beta,\frac{\zeta}{c^{1/2}DR^{1/2}}\}) (which will ensure that 2βμ4\ell_{2}\leq\frac{\beta-\mu}{4} with appropriate constants chosen), Which altogether gives us

F(x^)F(x)Ω(min{ζDc1/2R5/2,βD2R2})\displaystyle F(\hat{x})-F(x^{*})\geq\Omega(\min\{\frac{\zeta D}{c^{1/2}R^{5/2}},\frac{\beta D^{2}}{R^{2}}\}) (320)

Strongly Convex Case:

By the previous computations, we know that after RR rounds,

F(x^)F(x)\displaystyle F(\hat{x})-F(x^{*}) μζ^2q216(1q)2(1q2)q2R\displaystyle\geq\frac{\mu\hat{\zeta}^{2}q^{2}}{16(1-q)^{2}(1-q^{2})}q^{2R} (321)
F(x^)F(x)\displaystyle F(\hat{x})-F(x^{*}) q2ζ^24(1q)\displaystyle\leq\frac{q\ell_{2}\hat{\zeta}^{2}}{4(1-q)} (322)
F1(x)F2(x)2\displaystyle\|\nabla F_{1}(x)-\nabla F_{2}(x)\|^{2} 322ζ^2+24c22ζ^2q2(1q)2(1q2)+7c24ζ^2μ2\displaystyle\leq 3\ell_{2}^{2}\hat{\zeta}^{2}+\frac{24c\ell_{2}^{2}\hat{\zeta}^{2}q^{2}}{(1-q)^{2}(1-q^{2})}+\frac{7c\ell_{2}^{4}\hat{\zeta}^{2}}{\mu^{2}} (323)

To maintain the property that F(x^)F(x)ΔF(\hat{x})-F(x^{*})\leq\Delta and that F1(x)F2(x)2ζ2\|\nabla F_{1}(x)-\nabla F_{2}(x)\|^{2}\leq\zeta^{2} for all xx encountered during the execution of the algorithm,

Choose ζ^\hat{\zeta} s.t.

ζ^2=νmin{ζ222,(1q)2(1q2)ζ2c22q2,μ2ζ2c24,(1q)Δq2}\displaystyle\hat{\zeta}^{2}=\nu\min\{\frac{\zeta^{2}}{\ell_{2}^{2}},\frac{(1-q)^{2}(1-q^{2})\zeta^{2}}{c\ell_{2}^{2}q^{2}},\frac{\mu^{2}\zeta^{2}}{c\ell_{2}^{4}},\frac{(1-q)\Delta}{q\ell_{2}}\} (324)

For some constant ν\nu.

Again using μ2=(1q)22q\frac{\mu}{\ell_{2}}=\frac{(1-q)^{2}}{2q}) and following the same calculations as in the convex case, we use the fact that 9μβ9\mu\leq\beta, and that it is possible to choose 2\ell_{2} s.t. 2μ2βμ42\mu\leq\ell_{2}\leq\frac{\beta-\mu}{4}, which ensures β2\beta\geq 2.

F(x^)F(x)\displaystyle F(\hat{x})-F(x^{*}) (325)
μζ^2q216(1q)2(1q2)q2R\displaystyle\geq\frac{\mu\hat{\zeta}^{2}q^{2}}{16(1-q)^{2}(1-q^{2})}q^{2R} (326)
νmin{ζ2642,μζ216c22,μζ264c22(1α),Δ4}exp(4Rα1)\displaystyle\geq\nu\min\{\frac{\zeta^{2}}{64\ell_{2}},\frac{\mu\zeta^{2}}{16c\ell_{2}^{2}},\frac{\mu\zeta^{2}}{64c\ell_{2}^{2}}(\frac{1}{\alpha}),\frac{\Delta}{4}\}\exp(-\frac{4R}{\alpha-1}) (327)

We use 9μβ9\mu\leq\beta and because it is possible to choose 2\ell_{2} such that 2μ2βμ42\mu\leq\ell_{2}\leq\frac{\beta-\mu}{4} (Woodworth, 2021, Eq. 800), which ensures α2\alpha\geq 2 and 1αμ32\frac{1}{\alpha}\geq\sqrt{\frac{\mu}{3\ell_{2}}}

F(x^)F(x)\displaystyle F(\hat{x})-F(x^{*}) (329)
Ω(min{ζ22,μζ2c22,μζ2c22(μ2)1/2,Δ}exp(8Rμ2))\displaystyle\geq\Omega(\min\{\frac{\zeta^{2}}{\ell_{2}},\frac{\mu\zeta^{2}}{c\ell_{2}^{2}},\frac{\mu\zeta^{2}}{c\ell_{2}^{2}}(\frac{\mu}{\ell_{2}})^{1/2},\Delta\}\exp(-\frac{8R\sqrt{\mu}}{\sqrt{\ell_{2}}})) (330)
Ω(min{μζ2c22(μ2)1/2,Δ}exp(8Rμ2))\displaystyle\geq\Omega(\min\{\frac{\mu\zeta^{2}}{c\ell_{2}^{2}}(\frac{\mu}{\ell_{2}})^{1/2},\Delta\}\exp(-\frac{8R\sqrt{\mu}}{\sqrt{\ell_{2}}})) (331)

Next we note that 2β\ell_{2}\leq\beta and 2β5\ell_{2}\geq\frac{\beta}{5} (Woodworth, 2021, Eq 801), which gives us

F(x^)F(x)Ω(min{μ3/2ζ2cβ5/2,Δ}exp(18Rκ))\displaystyle F(\hat{x})-F(x^{*})\geq\Omega(\min\{\frac{\mu^{3/2}\zeta^{2}}{c\beta^{5/2}},\Delta\}\exp(-\frac{18R}{\sqrt{\kappa}})) (332)

Appendix H Technical Lemmas

Lemma H.1.

Let

g(r):=1SKi𝒮rk=0K1gi,k(r)\displaystyle g^{(r)}:=\frac{1}{SK}\sum_{i\in\mathcal{S}_{r}}\sum_{k=0}^{K-1}g_{i,k}^{(r)} (333)

Given Assumption B.6, Assumption B.5, and uniformly sampled clients per round,

𝔼1SKi𝒮rk=0K1gi,k(r)F(x(r))2σ2SK+(1S1N1)ζ2S\displaystyle\mathbb{E}\|\frac{1}{SK}\sum_{i\in\mathcal{S}_{r}}\sum_{k=0}^{K-1}g_{i,k}^{(r)}-\nabla F(x^{(r)})\|^{2}\leq\frac{\sigma^{2}}{SK}+(1-\frac{S-1}{N-1})\frac{\zeta^{2}}{S} (334)

If instead the clients are arbitrarily (possibly randomly) sampled and there is no gradient variance,

1SKi𝒮rk=0K1gi,k(r)F(x(r))2ζ2\displaystyle\|\frac{1}{SK}\sum_{i\in\mathcal{S}_{r}}\sum_{k=0}^{K-1}g_{i,k}^{(r)}-\nabla F(x^{(r)})\|^{2}\leq\zeta^{2} (335)
Proof.
𝔼1SKi𝒮rk=0K1f(x;zi)F(x)2\displaystyle\mathbb{E}\|\frac{1}{SK}\sum_{i\in\mathcal{S}_{r}}\sum_{k=0}^{K-1}\nabla f(x;z_{i})-\nabla F(x)\|^{2} (336)
=𝔼1Si𝒮rFi(x)F(x)2+𝔼1SKi𝒮rk=0K1f(x;zi)Fi(x)2\displaystyle=\mathbb{E}\|\frac{1}{S}\sum_{i\in\mathcal{S}_{r}}\nabla F_{i}(x)-\nabla F(x)\|^{2}+\mathbb{E}\|\frac{1}{SK}\sum_{i\in\mathcal{S}_{r}}\sum_{k=0}^{K-1}\nabla f(x;z_{i})-\nabla F_{i}(x)\|^{2} (337)
𝔼1Si𝒮rFi(x)F(x)2+1S2K2i𝒮rk=0K1𝔼f(x;zi)Fi(x)2\displaystyle\leq\mathbb{E}\|\frac{1}{S}\sum_{i\in\mathcal{S}_{r}}\nabla F_{i}(x)-\nabla F(x)\|^{2}+\frac{1}{S^{2}K^{2}}\sum_{i\in\mathcal{S}_{r}}\sum_{k=0}^{K-1}\mathbb{E}\|\nabla f(x;z_{i})-\nabla F_{i}(x)\|^{2} (338)

We get Eq. 337 by using 𝔼[X2]=Var(X)+𝔼[X]2\mathbb{E}[X^{2}]=\text{Var}(X)+\mathbb{E}[X]^{2} over the randomness of the gradient variance. We get Eq. 338 using the fact that the gradient randomness is i.i.d across ii and kk (so the cross terms in the expansion vanish). Note that 𝔼f(x;zi)Fi(x)2σ2\mathbb{E}\|\nabla f(x;z_{i})-\nabla F_{i}(x)\|^{2}\leq\sigma^{2} by Assumption B.6. On the other hand by Assumption B.5

𝔼1Si𝒮rFi(x)F(x)2(1S1N1)𝔼Fi(x)F(x)2S(1S1N1)ζ2S\displaystyle\mathbb{E}\|\frac{1}{S}\sum_{i\in\mathcal{S}_{r}}\nabla F_{i}(x)-\nabla F(x)\|^{2}\leq(1-\frac{S-1}{N-1})\frac{\mathbb{E}\|\nabla F_{i}(x)-\nabla F(x)\|^{2}}{S}\leq(1-\frac{S-1}{N-1})\frac{\zeta^{2}}{S} (339)

So altogether

𝔼1SKi𝒮rk=1Kf(x;zi)F(x)2σ2SK+(1S1N1)ζ2S\displaystyle\mathbb{E}\|\frac{1}{SK}\sum_{i\in\mathcal{S}_{r}}\sum_{k=1}^{K}\nabla f(x;z_{i})-\nabla F(x)\|^{2}\leq\frac{\sigma^{2}}{SK}+(1-\frac{S-1}{N-1})\frac{\zeta^{2}}{S} (340)

The second conclusion follows by noting

1Si𝒮rFi(x)F(x)21Si𝒮rFi(x)F(x)2\displaystyle\|\frac{1}{S}\sum_{i\in\mathcal{S}_{r}}\nabla F_{i}(x)-\nabla F(x)\|^{2}\leq\frac{1}{S}\sum_{i\in\mathcal{S}_{r}}\|\nabla F_{i}(x)-\nabla F(x)\|^{2} (341)

Lemma H.2.

Let u,vu,v be arbitrary (possibly random) points. Define F^(x)=1SKi𝒮k=0K1f(x;z^i,k)\hat{F}(x)=\frac{1}{SK}\sum_{i\in\mathcal{S}}\sum_{k=0}^{K-1}f(x;\hat{z}_{i,k}) where z^i,k𝒟i\hat{z}_{i,k}\sim\mathcal{D}_{i}. Let

w=argminx{u,v}F^(x)w=\operatorname*{arg\,min}_{x\in\{u,v\}}\hat{F}(x)

Then

𝔼[F(w)F(x)]\displaystyle\mathbb{E}[F(w)-F(x^{*})] (342)
min{F(u)F(x),F(v)F(x)}+4σFSK+41S1N1ζFS\displaystyle\leq\min\{F(u)-F(x^{*}),F(v)-F(x^{*})\}+4\frac{\sigma_{F}}{\sqrt{SK}}+4\sqrt{1-\frac{S-1}{N-1}}\frac{\zeta_{F}}{\sqrt{S}} (343)

and

𝔼[F(w)F(x)]\displaystyle\mathbb{E}[F(w)-F(x^{*})] (344)
min{𝔼F(u)F(x),𝔼F(v)F(x)}+4σFSK+41S1N1ζFS\displaystyle\leq\min\{\mathbb{E}F(u)-F(x^{*}),\mathbb{E}F(v)-F(x^{*})\}+4\frac{\sigma_{F}}{\sqrt{SK}}+4\sqrt{1-\frac{S-1}{N-1}}\frac{\zeta_{F}}{\sqrt{S}} (345)
Proof.

Suppose that F(u)+2a=F(v)F(u)+2a=F(v), where a0a\geq 0. Then,

𝔼[F(w)F(x)]=P(w=u)(F(u)F(x))+P(w=v)(F(v)F(x))\displaystyle\mathbb{E}[F(w)-F(x^{*})]=P(w=u)(F(u)-F(x^{*}))+P(w=v)(F(v)-F(x^{*})) (346)

Substituting F(v)=F(u)+2aF(v)=F(u)+2a,

𝔼[F(w)F(x)]\displaystyle\mathbb{E}[F(w)-F(x^{*})] =P(w=u)(F(u)F(x))+P(w=v)(F(u)+2aF(x))\displaystyle=P(w=u)(F(u)-F(x^{*}))+P(w=v)(F(u)+2a-F(x^{*})) (347)
F(u)F(x)+2a\displaystyle\leq F(u)-F(x^{*})+2a (348)

Observe that 𝔼(F^(x)F(x))2σF2SK+(1S1N1)ζF2S\mathbb{E}(\hat{F}(x)-F(x))^{2}\leq\frac{\sigma_{F}^{2}}{SK}+(1-\frac{S-1}{N-1})\frac{\zeta_{F}^{2}}{S} by Assumption B.7 and Assumption B.8. Therefore by Chebyshev’s inequality,

P(|F^(x)F(x)|a)1a2(σF2SK+(1S1N1)ζF2S)\displaystyle P(|\hat{F}(x)-F(x)|\geq a)\leq\frac{1}{a^{2}}(\frac{\sigma_{F}^{2}}{SK}+(1-\frac{S-1}{N-1})\frac{\zeta_{F}^{2}}{S}) (349)

Observe that

P(w=v)\displaystyle P(w=v) P(F^(u)>F(u)+a)+P(F^(v)<F(v)a)\displaystyle\leq P(\hat{F}(u)>F(u)+a)+P(\hat{F}(v)<F(v)-a) (350)
2a2(σF2SK+(1S1N1)ζF2S)\displaystyle\leq\frac{2}{a^{2}}(\frac{\sigma_{F}^{2}}{SK}+(1-\frac{S-1}{N-1})\frac{\zeta_{F}^{2}}{S}) (351)

And so it is also true that

𝔼[F(w)F(x)]\displaystyle\mathbb{E}[F(w)-F(x^{*})] =F(u)F(x)+2aP(w=v)\displaystyle=F(u)-F(x^{*})+2aP(w=v) (352)
F(u)F(x)+2a(2a2(σF2SK+(1S1N1)ζF2S))\displaystyle\leq F(u)-F(x^{*})+2a(\frac{2}{a^{2}}(\frac{\sigma_{F}^{2}}{SK}+(1-\frac{S-1}{N-1})\frac{\zeta_{F}^{2}}{S})) (353)
=F(u)F(x)+4a(σF2SK+(1S1N1)ζF2S)\displaystyle=F(u)-F(x^{*})+\frac{4}{a}(\frac{\sigma_{F}^{2}}{SK}+(1-\frac{S-1}{N-1})\frac{\zeta_{F}^{2}}{S}) (354)

Therefore altogether we have that

𝔼[F(w)F(x)]F(u)F(x)+maxamin{4a(σF2SK+(1S1N1)ζF2S),2a}\displaystyle\mathbb{E}[F(w)-F(x^{*})]\leq F(u)-F(x^{*})+\max_{a}\min\{\frac{4}{a}(\frac{\sigma_{F}^{2}}{SK}+(1-\frac{S-1}{N-1})\frac{\zeta_{F}^{2}}{S}),2a\} (355)

So by maximizing for aa

4a(σF2SK+(1S1N1)ζF2S)=2aa=2(σF2SK+(1S1N1)ζF2S)\displaystyle\frac{4}{a}(\frac{\sigma_{F}^{2}}{SK}+(1-\frac{S-1}{N-1})\frac{\zeta_{F}^{2}}{S})=2a\implies a=\sqrt{2(\frac{\sigma_{F}^{2}}{SK}+(1-\frac{S-1}{N-1})\frac{\zeta_{F}^{2}}{S})} (356)

which gives us

𝔼[F(w)F(x)]F(u)F(x)+4σFSK+41S1N1ζFS\displaystyle\mathbb{E}[F(w)-F(x^{*})]\leq F(u)-F(x^{*})+4\frac{\sigma_{F}}{\sqrt{SK}}+4\sqrt{1-\frac{S-1}{N-1}}\frac{\zeta_{F}}{\sqrt{S}} (357)

Then proof also holds if we switch the roles of uu and vv, and so altogether we have that: if F(u)<F(v)F(u)<F(v),

𝔼[F(w)F(x)]F(u)F(x)+4σFSK+41S1N1ζFS\displaystyle\mathbb{E}[F(w)-F(x^{*})]\leq F(u)-F(x^{*})+4\frac{\sigma_{F}}{\sqrt{SK}}+4\sqrt{1-\frac{S-1}{N-1}}\frac{\zeta_{F}}{\sqrt{S}} (358)

if F(u)>F(v)F(u)>F(v),

𝔼[F(w)F(x)]F(v)F(x)+4σFSK+41S1N1ζFS\displaystyle\mathbb{E}[F(w)-F(x^{*})]\leq F(v)-F(x^{*})+4\frac{\sigma_{F}}{\sqrt{SK}}+4\sqrt{1-\frac{S-1}{N-1}}\frac{\zeta_{F}}{\sqrt{S}} (359)

implying the first part of the theorem.

𝔼[F(w)F(x)]min{F(u)F(x),F(v)F(x)}+4σFSK+41S1N1ζFS\displaystyle\mathbb{E}[F(w)-F(x^{*})]\leq\min\{F(u)-F(x^{*}),F(v)-F(x^{*})\}+4\frac{\sigma_{F}}{\sqrt{SK}}+4\sqrt{1-\frac{S-1}{N-1}}\frac{\zeta_{F}}{\sqrt{S}} (360)

For the second part,

𝔼[min{F(u)F(x),F(v)F(x)}|v]\displaystyle\mathbb{E}[\min\{F(u)-F(x^{*}),F(v)-F(x^{*})\}|v] =F(u)F(v)F(u)F(x)\displaystyle=\int_{F(u)\leq F(v)}F(u)-F(x^{*}) (361)
min{𝔼F(u)F(x),F(v)F(x)}\displaystyle\leq\min\{\mathbb{E}F(u)-F(x^{*}),F(v)-F(x^{*})\} (362)

and then

𝔼[min{𝔼F(u)F(x),F(v)F(x)}]\displaystyle\mathbb{E}[\min\{\mathbb{E}F(u)-F(x^{*}),F(v)-F(x^{*})\}] =F(v)𝔼F(u)F(u)F(x)\displaystyle=\int_{F(v)\leq\mathbb{E}F(u)}F(u)-F(x^{*}) (363)
min{𝔼F(u)F(x),𝔼F(v)F(x)}\displaystyle\leq\min\{\mathbb{E}F(u)-F(x^{*}),\mathbb{E}F(v)-F(x^{*})\} (364)

Appendix I Experimental Setup Details

I.1 Convex Optimization

We empirically evaluate FedChain on federated regularized logistic regression. Let (xi,j,yi,j)(x_{i,j},y_{i,j}) be the jjth datapoint of the iith client and nin_{i} is the number of datapoints for the ii-th client. We minimize Eq. 1 where

Fi(w)\displaystyle F_{i}(w) =1ni(j=1niyi,jlog(wxi,j)(1yi,j)log(1wxi,j)))+μ2w2.\displaystyle=\frac{1}{n_{i}}(\sum_{j=1}^{n_{i}}-y_{i,j}\log(w^{\top}x_{i,j})-(1-y_{i,j})\log(1-w^{\top}x_{i,j})))+\frac{\mu}{2}\|w\|^{2}.

Dataset.  We use the MNIST dataset of handwritten digits (LeCun et al., 2010). We model a federated setting with five clients by partitioning the data into groups. First, we take 500 images from each digit class (total 5,000). Each client’s local data is a mixture of data drawn from two digit classes (leading to heterogeneity), and data sampled uniformly from all classes.

We call a federated dataset X%X\% homogeneous if the first X%X\% of each class’s 500 images is shuffled and evenly partitioned to each client. The remaining (100X)%(100-X)\% is partitioned as follows: client i{1,,5}i\in\{1,\ldots,5\} receives the remaining non-shuffled data from classes 2i22i-2 and 2i12i-1. For example, in a 50% homogeneous setup, client 3 has 250 samples from digit 4, 250 samples from digit 5, and 500 samples drawn uniformly from all classes. Note that 100% homogeneity is not the same thing as setting heterogeneity ζ=0\zeta=0 due to sampling randomness; we use this technique for lack of a better control over ζ\zeta. To model binary classification, we let even classes represent 0’s and odd classes represent 1’s, and set K=20K=20. All clients participate per round.

Hyperparameters.  All experiments initialize iterates at 0 with regularization μ=0.1\mu=0.1. We fix the total number of rounds RR (differs across experiments). For algorithms without stepsize decay, the tuning process is as follows.

We tune all stepsizes η\eta in the range below:

{103,102.5,102,101.5,101}\displaystyle\{10^{-3},10^{-2.5},10^{-2},10^{-1.5},10^{-1}\} (365)

We tune the percentage of rounds before switching from 𝒜local\mathcal{A}_{\text{local}} to 𝒜global\mathcal{A}_{\text{global}} (if applicable) as

{102,101.625,101.25,100.875,100.5}\displaystyle\{10^{-2},10^{-1.625},10^{-1.25},10^{-0.875},10^{-0.5}\} (366)

For algorithms with acceleration, we run experiments using the more easily implementable (but less mathematically tractable for our analysis) version in Aybat et al. (2019) as opposed to Algo. 3.

For algorithms with stepsize decay, we instead use the range Eq. 366 to decide the percentage of rounds to wait before decreasing the stepsize by half–denote this number of rounds as RdecayR_{\text{decay}}. Subsequently, every factor of 2 times RdecayR_{\text{decay}} we decrease the stepsize by half.

We use the heuristic that when the stepsize decreases to η/K\eta/K, we switch to 𝒜global\mathcal{A}_{\text{global}} and begin the stepsize decay process again. All algorithms are tuned to the best final gradient norm averaged over 1000 runs.

I.2 Nonconvex Optimization

I.2.1 EMNIST

In this section we detail how we use the EMNIST (Cohen et al., 2017) dataset for our nonconvex experiments, where the handwritten characters are partitioned by their author. In this dataset, there are 3400 clients, with 671,585 images in the train set and 77,483 in the test set. The task is to train a convolutional network with two convolutional layers, max-pooling, dropout, and two dense layers as in the Federated Learning task suite from (Reddi et al., 2020).

Hyperparameters.  We set R=500R=500, and rather than fixing the number of local steps, we fix the number of local client epochs (the number of times a client performs gradient descent over its own dataset) to 20. The number of clients sampled per round is 10. These changes were made in light of the imbalanced dataset sizes across clients, and is standard for this dataset-task (Reddi et al., 2020; Charles & Konečnỳ, 2020; Charles et al., 2021). For all algorithms aside from (M-SGD), we tune η\eta in the range {0.05,0.1,0.15,0.2,0.25}\{0.05,0.1,0.15,0.2,0.25\}. We tune the percentage of rounds before switching from 𝒜local\mathcal{A}_{\text{local}} to 𝒜global\mathcal{A}_{\text{global}} in {0.1,0.3,0.5,0.7,0.9}\{0.1,0.3,0.5,0.7,0.9\}, where applicable.

We next detail the way algorithms with stepsize decay are run. We tune the stepsize η\eta in {0.1,0.2,0.3,0.4,0.5}\{0.1,0.2,0.3,0.4,0.5\}. Let RdecayR_{\text{decay}} be the number of rounds before the first decay event. We tune Rdecay{0.1R,0.275R,0.45R}R_{\text{decay}}\in\lceil\{0.1R,0.275R,0.45R\}\rceil. Whenever the number of passed rounds is a power of 2 of RdecayR_{\text{decay}}, we halve the stepsize/ After RR^{\prime} rounds have passed, we enter 𝒜global\mathcal{A}_{\text{global}}, where we tune R{0.5R,0.7R,0.9R}R^{\prime}\in\lceil\{0.5R,0.7R,0.9R\}\rceil. We restart the decay process upon entering 𝒜global\mathcal{A}_{\text{global}}.

For M-SGD we tune

η{0.1,0.2,0.3,0.4,0.5}\eta\in\{0.1,0.2,0.3,0.4,0.5\}

and

Rdecay{102R,101.625R,101.25R,100.875R,100.5R}R_{\text{decay}}\in\lceil\{10^{-2}R,10^{-1.625}R,10^{-1.25}R,10^{-0.875}R,10^{-0.5}R\}\rceil

For stepsize decaying SCAFFOLD and FedAvg, we tune η{0.1,0.2,0.3,0.4,0.5}\eta\in\{0.1,0.2,0.3,0.4,0.5\} and Rdecay{0.1R,0.275R,0.45R}R_{\text{decay}}\in\lceil\{0.1R,0.275R,0.45R\}\rceil. When evaluating a particular metric the algorithms are tuned according to the mean of the last 5 evaluated iterates of that particular metric. Reported accuracies are also the mean of the last 5 evaluated iterates, as the values mostly stabilized during training.

I.2.2 CIFAR-100

For the CIFAR-100 experiments, we use the CIFAR-100 (Krizhevsky, 2009) dataset, where the handwritten characters are partitioned via the method proposed in (Reddi et al., 2020). In this dataset, there are 500 train clients, with a total of 50,000 images in the train set. There are 100 test clients and 10,000 images in the test set. The task is to train a ResNet-18, replacing batch normalization layers with group normalization as in the Federated Learning task suite from (Reddi et al., 2020). We leave out SCAFFOLD in all experiments because of out of memory issues.

Hyperparameters.  We set R=5000R=5000, and rather than fixing the number of local steps, we fix the number of local client epochs (the number of times a client performs gradient descent over its own dataset) to 20. The number of clients sampled per round is 10. These changes were made in light of the imbalanced dataset sizes across clients, and is standard for this dataset-task (Reddi et al., 2020; Charles & Konečnỳ, 2020; Charles et al., 2021). For all algorithms we tune η\eta in {0.1,0.2,0.3,0.4,0.5}\{0.1,0.2,0.3,0.4,0.5\}. Algorithms without stepsize decay tune the percentage of rounds before switching to 𝒜global\mathcal{A}_{\text{global}} in {0.1,0.3,0.5,0.7,0.9}\{0.1,0.3,0.5,0.7,0.9\}. Algorithms with stepsize decay (aside from M-SGD) tune the number of rounds before the first decay event in {0.1,0.275,0.45}\{0.1,0.275,0.45\} (and for every power of 2 of that percentage, the stepsize decays in half), and the number of rounds before swapping to 𝒜global\mathcal{A}_{\text{global}} in {0.5,0.7,0.9}\{0.5,0.7,0.9\} if applicable (stepsize decay process restarts after swapping). M-SGD tunes the number of rounds before the first decay event in {0.01,0.0237,0.0562,0.1334,0.3162}\{0.01,0.0237,0.0562,0.1334,0.3162\}. We tune for and return the test accuracy of the last iterate, as accuracy was still improving as we trained.

Appendix J Additional Convex Experiments

In this section, we give supplemental experimental results to verify the following claims:

  1. 1.

    Higher KK allows (1) accelerated algorithms to perform better than non-accelerated algorithms, and furthermore (2) allows us to run 𝒜local\mathcal{A}_{\text{local}} for one round to get satisfactory results.

  2. 2.

    Our FedChain instantiations outperform stepsize decaying baselines

J.1 Verifying the effect of increased KK and R=1R=1

Our proofs of FedChain all also work if we run 𝒜local\mathcal{A}_{\text{local}} for only one communication round, so long as KK is sufficiently large (exact number is in the formal theorem proofs). In this section we verify the effect of large KK in conjunction with running 𝒜local\mathcal{A}_{\text{local}} for only one round. The setup is the same as in § I.1, except we tune the stepsize η\eta in a larger grid:

η{103,102.5,102,101.5,101,100.5,100}\displaystyle\eta\in\{10^{-3},10^{-2.5},10^{-2},10^{-1.5},10^{-1},10^{-0.5},10^{0}\} (367)

The plots are in Fig. 3. For algorithms that are “1-X\toY”, we run algorithm X for one round and algorithm Y for the rest of the rounds. These algorithms are tuned in the following way. Algorithm X has its stepsize tuned in Eq. 367, and Algorithm Y independently has its stepsize tuned in Eq. 367. We require these to be tuned separately because local update algorithms and centralized algorithms have stepsizes that depend on KK differently if KK is large (see, for example, Thm. D.1 and Thm. E.1). The baselines have a single stepsize tuned in Eq. 367. These are tuned for the lowest final gradient norm averaged over 1000 runs.

Overall, we see that in the large KK regime, algorithms that start with a local update algorithm and then finish with an accelerated centralized algorithm have the best communication complexity. This is because larger KK means the error due to variance decreases, for example as seen in Thm. F.2. Acceleration increases instability (if one does not perform a carefully calibrated stepsize decay, as we do not in the experiments), so decreasing variance helps accelerated algorithms the most. Furthermore, a single round for 𝒜local\mathcal{A}_{\text{local}} suffices to see large improvement as seen in Fig. 3. This is expected, because both the “bias” term (the Δexp(RKκ)\Delta\exp(-\frac{R\sqrt{K}}{\kappa}) term in Thm. E.1) and the variance term become negligible, leaving just the heterogeneity term as desired.

Refer to caption
Refer to caption
Refer to caption

Refer to caption

Figure 3: We investigate the effect of high KK (K=100K=100). “1-X\toY” denotes a chained algorithm with X run for one round and Y run for the rest of the rounds. Chained algorithms, even with one round allocated to 𝒜local\mathcal{A}_{\text{local}}, perform the best. Furthermore, accelerated algorithms outperform their non-accelerated counterparts.

J.2 Including Stepsize Decay Baselines

Refer to caption
Refer to caption
Refer to caption

Refer to caption

Figure 4: The same as Fig. 2, except we include stepsize decay for baseline algorithms. Chained algorithms still perform the best.

In this section, we compare the performance of the algorithms against learning rate decayed SGD, FedAvg, ASG, and SCAFFOLD. We use the same setting as § I.1 for the decay process and the stepsize. In this case, we still see that the chained methods outperform their non-chained counterparts.