This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

FedCompass: Efficient Cross-Silo Federated Learning on Heterogeneous Client Devices Using a Computing Power-Aware Scheduler

Zilinghan Li1,2,3, Pranshu Chaturvedi1,2,3, Shilan He1,2,3, Han Chen2, Gagandeep Singh2,4,
Volodymyr Kindratenko2,3, E. A. Huerta1,2,5, Kibaek Kim1, Ravi Madduri1
1
Argonne National Laboratory, 2University of Illinois at Urbana-Champaign, 3National Center
for Supercomputing Applications, 4VMWare Research, 5The University of Chicago
{zl52, pranshu3, shilanh2, hanc7, ggnds, kindrtnk}@illinois.edu
{elihu, kimk, madduri}@anl.gov
Abstract

Cross-silo federated learning offers a promising solution to collaboratively train robust and generalized AI models without compromising the privacy of local datasets, e.g., healthcare, financial, as well as scientific projects that lack a centralized data facility. Nonetheless, because of the disparity of computing resources among different clients (i.e., device heterogeneity), synchronous federated learning algorithms suffer from degraded efficiency when waiting for straggler clients. Similarly, asynchronous federated learning algorithms experience degradation in the convergence rate and final model accuracy on non-identically and independently distributed (non-IID) heterogeneous datasets due to stale local models and client drift. To address these limitations in cross-silo federated learning with heterogeneous clients and data, we propose FedCompass, an innovative semi-asynchronous federated learning algorithm with a computing power-aware scheduler on the server side, which adaptively assigns varying amounts of training tasks to different clients using the knowledge of the computing power of individual clients. FedCompass ensures that multiple locally trained models from clients are received almost simultaneously as a group for aggregation, effectively reducing the staleness of local models. At the same time, the overall training process remains asynchronous, eliminating prolonged waiting periods from straggler clients. Using diverse non-IID heterogeneous distributed datasets, we demonstrate that FedCompass achieves faster convergence and higher accuracy than other asynchronous algorithms while remaining more efficient than synchronous algorithms when performing federated learning on heterogeneous clients. The source code for FedCompass is available at https://github.com/APPFL/FedCompass.

1 Introduction

Federated learning (FL) is a model training approach where multiple clients train a global model under the orchestration of a central server (Konečnỳ et al., 2016; McMahan et al., 2017; Yang et al., 2019; Kairouz et al., 2021). In FL, there are mm distributed clients solving the optimization problem: minwdF(w)=i=1mpiFi(w),\min_{w\in\mathbb{R}^{d}}F(w)=\sum_{i=1}^{m}p_{i}F_{i}(w), where pip_{i} is the importance weight of client ii’s local data, and is usually defined as the relative data sample size, and Fi(w)F_{i}(w) is the loss on client ii’s local data. FL typically runs two steps iteratively: (i) the server distributes the global model to clients to train it using their local data, and (ii) the server collects the locally trained models and updates the global model by aggregating them. Federated Averaging (FedAvg) (McMahan et al., 2017) is the most popular FL algorithm where each client trains a model using local data for QQ local steps in each training round, after which the server aggregates all local models by performing a weighted averaging and sends the updated global model back to all clients for the next round of training. By leveraging the training data from multiple clients without explicitly sharing, FL empowers the training of more robust and generalized models while preserving the privacy of client data. The data in FL is generally heterogeneous, which means that client data is non-identically and independently distributed (non-IID). FL is most commonly conducted in two settings: cross-device FL and cross-silo FL (Kairouz et al., 2021). In cross-device FL, numerous mobile or IoT devices participate, but not in every training round because of reliability issues (Mills et al., 2019). In cross-silo FL, a smaller number of reliable clients contribute to training in each round, and the clients are often represented by large institutions in domains such as healthcare or finance, where preserving data privacy is essential (Wang et al., 2019; Wang, 2019; Kaissis et al., 2021; Pati et al., 2022; Hoang et al., 2023).

The performance of FL suffers from device heterogeneity, where the disparity of computing power among participating client devices leads to significant variations in local training times. Classical synchronous FL algorithms such as FedAvg, where the server waits for all client models before global aggregation, suffer from compute inefficiency as faster clients have to wait for slower clients (stragglers). Several asynchronous FL algorithms have been developed to immediately update the global model after receiving one client model to improve computing resource utilization. The principal shortcoming in asynchronous FL is that straggler clients may train on outdated global models and obtain stale local models, which can potentially undermine the performance of the global model. To deal with stale local models, various solutions have been proposed, including client selection, weighted aggregation, and tiered FL. Client selection strategies such as FedAR (Imteaj & Amini, 2020) select a subset of clients according to their robustness or computing power. However, these strategies may not be applicable for cross-silo FL settings, where all clients are expected to participate to maintain the robustness of the global model. Algorithms such as FedAsync (Xie et al., 2019) introduce a staleness factor to penalize the local models trained on outdated global models. While the staleness factor alleviates the negative impact of stale local models, it may cause the global model to drift away from the slower clients (client drift) even with minor disparities in client computing power, resulting in a further decrease in accuracy on non-IID heterogeneous distributed datasets. Tiered FL methods such as FedAT (Chai et al., 2021) group clients into tiers based on their computing power, allowing each tier to update at its own pace. Nonetheless, the waiting time inside each tier can be significant, and it is not resilient to sudden changes in computing power.

In this paper, we focus on improving the performance in cross-silo FL settings. Particularly, we propose a semi-asynchronous cross-silo FL algorithm, FedCompass, which employs a COMputing Power-Aware Scheduler (Compass) to reduce client local model staleness and enhance FL efficiency in cross-silo settings. Compass assesses the computing power of individual clients based on their previous response times and dynamically assigns varying numbers of local training steps to ensure that a group of clients can complete local training almost simultaneously. Compass is also designed to be resilient to sudden client computing power changes. This enables the server to perform global aggregation using the local models from a group of clients without prolonged waiting times, thus improving computing resource utilization. Group aggregation leads to less global update frequency and reduced local model staleness, thus mitigating client drift and improving the performance on non-IID heterogeneous data. In addition, the design of FedCompass is orthogonal to and compatible with several server-side optimization strategies for further enhancement of performance on non-IID data. In summary, the main contributions of this paper are as follows:

  • A novel semi-asynchronous cross-silo FL algorithm, FedCompass, which employs a resilient computing power-aware scheduler to make client updates arrive in groups almost simultaneously for aggregation, thus reducing client model staleness for enhanced performance on non-IID heterogeneous distributed datasets while ensuring computing efficiency.

  • A suite of convergence analyses which demonstrate that FedCompass can converge to a first-order critical point of the global loss function with heterogeneous clients and data, and offer insights into how the number of local training steps impacts the convergence rate.

  • A suite of experiments with heterogeneous clients and datasets which demonstrate that FedCompass converges faster than synchronous and asynchronous FL algorithms, and achieves higher accuracy than other asynchronous algorithms by mitigating client drift.

2 Related Work

Device Heterogeneity in Federated Learning. Device heterogeneity occurs when clients have varying computing power, leading to varied local training times (Li et al., 2020; Xu et al., 2021). For synchronous FL algorithms that wait for all local models before global aggregation, device heterogeneity leads to degraded efficiency due to the waiting time for slower clients, often referred to as stragglers. Some methods address device heterogeneity by waiting only for some clients and ignoring the stragglers (Bonawitz et al., 2019). However, this approach disregards the contributions of straggler clients, resulting in wasted client computing resources and a global model significantly biased towards faster clients. Other solutions explicitly select the clients that perform updates. Chen et al. (2021) employs a greedy selection method with a heuristic based on client computing resources. Hao et al. (2020) defines a priority function according to computing power and model accuracy to select participating clients. FedAr (Imteaj & Amini, 2020) assigns each client a trust score based on previous activities. FLNAP (Reisizadeh et al., 2022) starts from faster nodes and gradually involves slower ones. FedCS (Nishio & Yonetani, 2019) queries client resource information for client selection. SAFA (Wu et al., 2020) chooses the clients with lower crash probabilities. However, these methods may not be suitable for cross-silo FL settings, where the participation of every client is crucial for maintaining the robustness of the global model and there are stronger assumptions with regard to the trustworthiness and reliability of clients.

Asynchronous Federated Learning. Asynchronous FL algorithms provide another avenue to mitigate the impact of stragglers. In asynchronous FL, the server typically updates the global model once receiving one or few client models (Xie et al., 2019; Chen et al., 2020) or stores the local models in a size-KK buffer (Nguyen et al., 2022; So et al., 2021), and then immediately sends the global model back for further local training. Asynchronous FL offers significant advantages in heterogeneous client settings by reducing the idle time for each client. However, one of the key challenges in asynchronous FL is the staleness of local models. In asynchronous FL, global model updates can occur without waiting for all local models. This flexibility leads to stale local models trained on outdated global models, which may negatively impact the accuracy of the global model. Various solutions aimed to mitigate these adverse effects have been proposed. One common strategy is weighted aggregation, which is prevalent in numerous asynchronous FL algorithms (Xie et al., 2019; Chen et al., 2019; 2020; Wang et al., 2021; Nguyen et al., 2022). This approach applies a staleness weight factor to penalize outdated local models during the aggregation. While effective in reducing the negative impact of stale models, this often causes the global model to drift away from the local data on slower clients, and the staleness factor could significantly degrade the final model performance in settings with minor computing disparities among clients. Tiered FL methods offer another option for improvement. FedAT (Chai et al., 2021) clusters clients into several faster and slower tiers, allowing each tier to update the global model at its own pace. However, the waiting time within tiers can be substantial, and it becomes less effective when the client pool is small and client heterogeneity is large. The tiering may also not get updated timely to become resilient to speed changes. Similarly, CSAFL (Zhang et al., 2021) groups clients based on gradient direction and computation time. While this offers some benefits, the fixed grouping can become a liability if client computing speeds vary significantly over the course of training.

3 Proposed Method: FedCompass

The design choices of FedCompass are motivated by the shortcomings of other asynchronous FL algorithms caused by the potential staleness of client local models. In FedAsync, as the global model gets updated frequently, the client model is significantly penalized by a staleness weight factor, leading to theoretically and empirically sub-optimal convergence characteristics, especially when the disparity in computing power among clients is small. This can cause the global model to gradually drift away from the data of slow clients and have decreased performance on non-IID datasets. FedBuff attempts to address this problem by introducing a size-KK buffer for client updates to reduce the global update frequency. However, the optimal buffer size varies for different client numbers and heterogeneity settings and needs to be selected heuristically. It is also possible that one buffer contains several updates from the same client, or one client may train on the same global model several times. Inspired by those shortcomings, we aim to group clients for global aggregation to reduce model staleness and avoid repeated aggregation or local training. As cross-silo FL involves a small number of reliable clients, this enables us to design a scheduler that can profile the computing power of each client. The scheduler then dynamically and automatically creates groups accordingly by assigning clients with different numbers of local training steps to allow clients with similar computing power to finish local training in close proximity to each other. The scheduler tries to strike a balance between the time efficiency of asynchronous algorithms and the objective consistency of synchronous ones. Additionally, we keep the scheduler operations separated from the global aggregation, thus making it possible to combine the benefits of aggregation strategies in other FL algorithms, such as server-side optimizations for improving performance when training data are non-IID (Hsu et al., 2019; Reddi et al., 2020) and normalized averaging to eliminates objective inconsistency when clients perform various steps (Wang et al., 2020; Horváth et al., 2022). Further, from the drawbacks of tiered FL methods such as FedAT, the scheduler is designed to cover multiple corner cases to ensure its resilience to sudden speed changes, especially those occasional yet substantial changes that may arise in real-world cross-silo FL as a result of HPC/cloud auto-scaling.

3.1 Compass: Computing Power-Aware Scheduler

To address the aforementioned desiderata, we introduce Compass, a COMputing Power-Aware Scheduler that assesses the computing power of individual clients according to their previous response times. Compass then dynamically assigns varying numbers of local steps QQ within a predefined range [Qmin,Qmax][Q_{\min},Q_{\max}] to different clients in each training round. This approach ensures that multiple client models are received nearly simultaneously, allowing for a grouped server aggregation. By incorporating several local models in a single global update and coordinating the timing of client responses, we effectively reduce the frequency of global model updates. This, in turn, reduces client model staleness while keeping the client idle time to a minimum during the training process.

Refer to caption
Figure 1: Overview of an example FL run using Compass scheduler on five clients with the minimum number of local steps Qmin=20Q_{\min}=20 and maximum number of local steps Qmax=100Q_{\max}=100.

Figure 1 illustrates an example run of federated learning using Compass on five clients with different computing power. For the sake of brevity, this illustration focuses on a simple scenario, and more details are described in Section 3.2. All times in the figure are absolute wall-clock times from the start of FL. First, all clients are initialized with Qmin=20Q_{\min}=20 local steps. When client 1 finishes local training and sends the update Δ1\Delta_{1} to the server at time 02:00, Compass records its speed and creates the first arrival group with the expected arrival time Ta={T}_{a}= 12:00, which is computed based on the client speed and the assigned number of local steps Q=Qmax=100Q=Q_{\max}=100. We assume negligible communication and aggregation times in this example, so client 1 receives an updated global model ww immediately at 02:00 and begins the next round of training. Upon receiving the update from client 2, Compass slots it into the first arrival group with Q=40Q=40 local steps, aiming for a completion time of Ta=T_{a}= 12:00 as well. Client 3 follows suit with 2828 steps. When Compass computes local steps for client 4 in order to finish by Ta={T}_{a}= 12:00, it finds that Q=10Q=10 falls below Qmin=20Q_{\min}=20. Therefore, Compass creates a new arrival group for client 4 with expected arrival time Ta=T_{a}= 22:00. Compass optimizes the value of TaT_{a} such that the clients in an existing arrival group may join this newly created group in the future for group merging. After client 5 finishes, it is assigned to the group created for client 4. When clients 1–3 finish their second round of local training at 12:00, Compass assigns them to the arrival group of clients 4 and 5 (i.e., arrival group 2) with the corresponding values of QQ so that all the clients are in the same arrival group and expected to arrive nearly simultaneously for global aggregation in the future. The training then achieves equilibrium with a fixed number of existing groups and group assignments, and future training rounds will proceed with certain values of QQ for each client in the ideal case. The detailed implementation and explanation of Compass can be found in Appendix A. We also provide analyses and empirical results for non-ideal scenarios with client computing powering changes in Appendix G.2, where we show how the design of Compass makes it resilient when clients experience speedup or slowdown during the course of training, and compare it with the tiered FL method FedAT (Chai et al., 2021).

3.2 FedCompass: Federated Learning with Compass

1
2Function FedCompass-Client(ii, ww, QQ, η\eta_{\ell}, BB):
Input : Client id ii, global model ww, number of local steps QQ, local learning rate η\eta_{\ell}, batch size BB
3 for q[1,,Q]q\in[1,\dots,Q] do
4       i,q:=\mathcal{B}_{i,q}:= random training batch with batch size BB;
5       wi,q:=wi,q1ηfi(wi,q1,i,q)w_{i,q}:=w_{i,q-1}-\eta_{\ell}\nabla f_{i}(w_{i,q-1},\mathcal{B}_{i,q}) \ \triangleright wi,qw_{i,q} is client ii’s model after qq steps, wi,0=ww_{i,0}=w;
6      
7Send Δi:=wi,0wi,Q\Delta_{i}:=w_{i,0}-w_{i,Q} to server;
8
9
10Function FedCompass-Server(NN, η\eta_{\ell}, BB, ww):
Input : Number of comm. rounds NN, local learning rate η\eta_{\ell}, batch size BB, initial model weight ww
11 FedCompass-Client(i,w,Qmin,η,B)\textsc{FedCompass-Client}(i,w,Q_{\min},\eta_{\ell},B) for all client ii’s;
12 Initialize client info dictionary 𝒞\mathcal{C}, group info dictionary 𝒢\mathcal{G}, and global model timestamp τg:=0\tau_{g}:=0;
13 for n[1,,N]n\in[1,\dots,N] do
14       if receive update Δi\Delta_{i} from client ii then
15             Record client speed into 𝒞[i].S\mathcal{C}\texttt{[}i\texttt{]}.S;
16             if client ii has an arrival group gg then
17                  
18                  𝒢[g].CL.remove(i)\mathcal{G}\texttt{[}g\texttt{]}.CL.\texttt{remove(}i\texttt{)},  𝒢[g].ACL.add(i)\mathcal{G}\texttt{[}g\texttt{]}.ACL.\texttt{add(}i\texttt{)} \ \triangleright move client ii to arrived client list;
19                   if Tnow>𝒢[g].TmaxT_{\text{now}}>\mathcal{G}\texttt{[}g\texttt{]}.{T}_{\max} then
20                         Δ¯:=Δ¯+st(τg𝒞[i].τ)piΔi\overline{\Delta}:=\overline{\Delta}+st(\tau_{g}-\mathcal{C}\texttt{[}i\texttt{]}.\tau)*p_{i}*\Delta_{i}𝒞[i].τ:=τg\mathcal{C}\texttt{[}i\texttt{]}.\tau:=\tau_{g}; \ \triangleright st()st(\cdot) is staleness factor;
21                         AssignGroup(i)\textsc{AssignGroup}(i) \ \triangleright this computes 𝒞[i].Q\mathcal{C}\texttt{[}i\texttt{]}.Q, details in Alg. 2 from Appendix;
22                         FedCompass-Client (i,w,𝒞[i].Q,η,B)(i,w,\mathcal{C}\texttt{[}i\texttt{]}.Q,\eta_{\ell},B);
23                         delete 𝒢[g]\mathcal{G}\texttt{[}g\texttt{]} if len(𝒢[g].CL)=0\texttt{len(}\mathcal{G}\texttt{[}g\texttt{]}.CL\texttt{)}=0;
24                        
25                  else
26                         Δ¯g:=Δ¯g+st(τg𝒞[i].τ)piΔi\overline{\Delta}^{g}:=\overline{\Delta}^{g}+st(\tau_{g}-\mathcal{C}\texttt{[}i\texttt{]}.\tau)*p_{i}*\Delta_{i} \ \triangleright st()st(\cdot) is staleness factor;
27                         if len(𝒢[g].CL=0)\texttt{len(}\mathcal{G}\texttt{[}g\texttt{]}.CL=0\texttt{)} then
28                               GroupAggregate(g)\textsc{GroupAggregate}(g) \ \triangleright details in Alg. 3 from Appendix;  delete 𝒢[g]\mathcal{G}\texttt{[}g\texttt{]};
29                              
30                        
31                  
32            else
33                   w:=wst(τg𝒞[i].τ)piΔiw:=w-st(\tau_{g}-\mathcal{C}\texttt{[}i\texttt{]}.\tau)*p_{i}*\Delta_{i};  τg:=τg+1\tau_{g}:=\tau_{g}+1;  𝒞[i].τ:=τg\mathcal{C}\texttt{[}i\texttt{]}.\tau:=\tau_{g};
34                   AssignGroup(i)\textsc{AssignGroup}(i) \ \triangleright this computes 𝒞[i].Q\mathcal{C}\texttt{[}i\texttt{]}.Q, details in Alg. 2 from Appendix;
35                   FedCompass-Client (i,w,𝒞[i].Q,η,B)(i,w,\mathcal{C}\texttt{[}i\texttt{]}.Q,\eta_{\ell},B);
36                  
37            
38      
Algorithm 1 FedCompass Algorithm

FedCompass is a semi-asynchronous FL algorithm that leverages the Compass scheduler to coordinate the training process. It is semi-asynchronous as the server may need to wait for a group of clients for global aggregation, but the Compass scheduler minimizes the waiting time by managing to have clients within the same group finish local training nearly simultaneously, so the overall training process still remains asynchronous without long client idle times. The function FedCompass-Client in Algorithm 1 outlines the client-side algorithm for FedCompass. The client performs QQ local steps using mini-batch stochastic gradient descent and sends the difference between the initial and final models Δi\Delta_{i} as the client update to the server.

The function FedCompass-Server in Algorithm 1 presents the server algorithm for FedCompass. Initially, in the absence of prior knowledge of clients’ computing speeds, the server assigns a minimum of QminQ_{\min} local steps to all clients for a warm-up. Subsequently, client speed information is gathered and updated each time a client responds with the local update. For each client’s first response (lines 25 to 28), the server immediately updates the global model using the client update and assigns it to an arrival group. All clients within the same arrival group are expected to arrive almost simultaneously at a specified time TaT_{a} for the next global group aggregation. This is achieved by assigning variable local steps based on individual computing speeds. The AssignGroup function first attempts to assign the client to an existing group with local steps within the range of [Qmin,Qmax][Q_{\min},Q_{\max}]. If none of the existing groups have local step values that can accommodate the client, a new group is created for the client with a local step value that facilitates future group merging opportunities. Further details for this function are elaborated in Appendix A.

After the initial arrivals, each client is expected to adhere to the group-specific arrival time TaT_{a}. Nonetheless, to accommodate client speed fluctuations and potential estimation inaccuracies, a latest arrival time TmaxT_{\max} per group is designated. If a client arrives prior to TmaxT_{\max} (as per lines 20 to 24), the server stores the client update in a group-specific buffer Δ¯g\overline{\Delta}^{g}, and checks whether this client is the last within its group to arrive. If so, the server invokes GroupAggregate to update the global model with contributions from all clients in the group and assign new training tasks to them with the latest global model. The details of GroupAggregate are shown in Algorithm 3 in the Appendix. If there remain pending clients within the group, the server will wait until the last client arrives or the latest arrival time TmaxT_{\max} before the group aggregation. Client(s) that arrive after the latest time TmaxT_{\max} and miss the group aggregation (lines 15 to 19), due to an unforeseen delay or slowdown, have their updates stored in a general buffer Δ¯\overline{\Delta}, and the server assigns the client to an arrival group immediately using the AssignGroup function for the next round of local training. The updates in the general buffer Δ¯\overline{\Delta} will be incorporated into the group aggregation of the following arrival group.

4 Convergence Analysis

We provide the convergence analysis of FedCompass for smooth and non-convex loss functions Fi(w)F_{i}(w). We denote by F(w)\nabla F(w) the gradient of global loss, Fi(w)\nabla F_{i}(w) the gradient of client ii’s local loss, and fi(w,i)\nabla f_{i}(w,\mathcal{B}_{i}) the gradient of loss on batch i\mathcal{B}_{i}. With that, we make the following assumptions.

Assumption 1.

Lipschitz smoothness. The loss function of each client ii is Lipschitz smooth. Fi(w)Fi(w)Lww||\nabla F_{i}(w)-\nabla F_{i}(w^{\prime})||\leq L||w-w^{\prime}||.

Assumption 2.

Unbiased client stochastic gradient. 𝔼i[fi(w,i))]=Fi(w)\mathbb{E}_{\mathcal{B}_{i}}[\nabla f_{i}(w,\mathcal{B}_{i}))]=\nabla F_{i}(w).

Assumption 3.

Bounded gradient, and bounded local and global gradient variance. Fi(w)2M||\nabla F_{i}(w)||^{2}\leq M, 𝔼i[||fi(w,i))Fi(w)||2]σl2\mathbb{E}_{\mathcal{B}_{i}}[||\nabla f_{i}(w,\mathcal{B}_{i}))-\nabla F_{i}(w)||^{2}]\leq\sigma_{l}^{2}, and 𝔼[Fi(w)F(w)2]σg2\mathbb{E}[||\nabla F_{i}(w)-\nabla F(w)||^{2}]\leq\sigma_{g}^{2}.

Assumption 4.

Bounded client heterogeneity and staleness. Suppose that client aa is the fastest client, and client bb is the slowest, 𝒯a\mathcal{T}_{a} and 𝒯b\mathcal{T}_{b} are the times per one local step of these two clients, respectively. We then assume that 𝒯b/𝒯aμ{\mathcal{T}_{b}}/{\mathcal{T}_{a}}\leq\mu. This assumption also implies that the staleness (τg𝒞[i].τ\tau_{g}-\mathcal{C}\texttt{[}i\texttt{]}.\tau) of each client model is also bounded by a number τmax\tau_{\max}.

Assumptions 13 are common assumptions in the convergence analysis of federated or distributed learning algorithms (Stich, 2018; Li et al., 2019; Yu et al., 2019; Karimireddy et al., 2020; Nguyen et al., 2022). Assumption 4 indicates that the speed difference among clients is finite and is equivalent to the assumption that all clients are reliable, which is reasonable in cross-silo FL.

Theorem 1.

Suppose that η12LQmax\eta_{\ell}\leq\frac{1}{2LQ_{\max}}, Q=Qmax/Qmin\textit{Q}=Q_{\max}/Q_{\min}, and μ=QlogQμ\mu^{\prime}=\textit{Q}^{\lfloor\log_{\textit{Q}}\mu\rfloor}. Then, after TT updates for global model ww, FedCompass achieves the following convergence rate:

1Tt=0T1𝔼F(w(t))2\displaystyle\frac{1}{T}\mathop{\sum}\limits_{t=0}^{T-1}\mathbb{E}\|\nabla F(w^{(t)})\|^{2}\leq 2γ2FηQminT+2γ2ηm2σl2LQmax2γ12Qmin+6γ2mη2σ2L2Qmax3γ1Qmin(τmax2+1),\displaystyle\frac{2\gamma_{2}F^{*}}{\eta_{\ell}Q_{\min}T}+\frac{2\gamma_{2}\eta_{\ell}m^{2}\sigma_{l}^{2}LQ_{\max}^{2}}{\gamma_{1}^{2}Q_{\min}}+\frac{6\gamma_{2}m\eta_{\ell}^{2}\sigma^{2}L^{2}Q_{\max}^{3}}{\gamma_{1}Q_{\min}}(\tau_{\max}^{2}+1), (1)

where mm is the number of clients, w(t)w^{(t)} is the global model after tt global updates, γ1=1+m1μ,γ2=1+μ(m1)\gamma_{1}=1+\frac{m-1}{\mu^{\prime}},\gamma_{2}=1+\mu^{\prime}(m-1), F=F(w(0))F(w)F^{*}=F(w^{(0)})-F(w^{*}), and σ2=σl2+σg2+M\sigma^{2}=\sigma_{l}^{2}+\sigma_{g}^{2}+M.

Corollary 1.

When QminQmax/μQ_{\min}\leq{Q_{\max}}/{\mu} and η=𝒪(1/QmaxT)\eta_{\ell}=\mathcal{O}(1/Q_{\max}\sqrt{T}) while satisfying η12LQmax\eta_{\ell}\leq\frac{1}{2LQ_{\max}}, then

mint[T]𝔼F(w(t))2𝒪(mFT)+𝒪(mσl2LT)+𝒪(mτmax2σ2L2T).\displaystyle\underset{t\in[T]}{\min}\ \mathbb{E}\|\nabla F(w^{(t)})\|^{2}\leq\mathcal{O}\Big{(}\frac{mF^{*}}{\sqrt{T}}\Big{)}+\mathcal{O}\Big{(}\frac{m\sigma_{l}^{2}L}{\sqrt{T}}\Big{)}+\mathcal{O}\Big{(}\frac{m\tau_{\max}^{2}\sigma^{2}L^{2}}{T}\Big{)}. (2)

The proof is provided in Appendix C. Based on Theorem 1 and Corollary 1, we remark the following algorithmic characteristics.

Worst-Case Convergence. Corollary 1 provides an upper bound for the expected squared norm of the gradients of the global loss, which is monotonically decreasing with respect to the total number of global updates TT. This leads to the conclusion that FedCompass converges to a first-order stationary point in expectation, characterized by a zero gradient, of the smooth and non-convex global loss function F(w)=i=1mpiFi(w)F(w)=\sum_{i=1}^{m}p_{i}F_{i}(w). Specifically, the first term of equation 2 represents the influence of model initialization, and the second term accounts for stochastic noise from local updates. The third term reflects the impact of device heterogeneity (τmax2\tau_{\max}^{2}) and data heterogeneity (σ2\sigma^{2}) on the convergence process, which theoretically provides a convergence guarantee for FedCompass on non-IID heterogeneous distributed datasets and heterogeneous client devices.

Effect of Number of Local Steps. Theorem 1 establishes the requirement η12LQmax\eta_{\ell}\leq\frac{1}{2LQ_{\max}}, signifying a trade-off between the maximum number of client local steps QmaxQ_{\max} and the client learning rate η\eta_{\ell}. A higher QmaxQ_{\max} necessitates a smaller η\eta_{\ell} to ensure the convergence of FL training. The value of QminQ_{\min} also presents a trade-off: a smaller QminQ_{\min} leads to a reduced μ\mu^{\prime}, causing an increase in γ1\gamma_{1} and a decrease in γ2\gamma_{2}, which ultimately lowers the gradient bound. Appendix B shows that QminQ_{\min} also influences the grouping of FedCompass. There are at most logQμ\lceil\log_{\textit{Q}}\mu\rceil existing groups at the equilibrium state, so a smaller QminQ_{\min} results in fewer groups and a smaller τmax\tau_{\max}. However, since QminQ_{\min} appears in the denominator in equation 1, a smaller QminQ_{\min} could potentially increase the gradient bound. Corollary 1 represents an extreme case where the value of QminQ_{\min} makes μ\mu^{\prime} reach the minimum value of 1. We give empirical ablation study results on the effect of the number of local steps in Section 5.2 and Appendix G.1. The ablation study results show that the performance of FedCompass is not sensitive to the choice of the ratio Q=Qmax/Qmin\textit{Q}=Q_{\max}/Q_{\min}, and a wide range of Q still leads to better performance than other asynchronous FL algorithms such as FedBuff.

5 Experiments

5.1 Experiment Setup

Datasets. To account for the dataset inherent statistical heterogeneity, we use two types of datasets: (i) artificially partitioned MNIST (LeCun, 1998) and CIFAR-10 (Krizhevsky et al., 2009) datasets, and (ii) two naturally split datasets from the cross-silo FL benchmark FLamby (Ogier du Terrail et al., 2022), Fed-IXI and Fed-ISIC2019. Artificially partitioned datasets are created by dividing classic datasets into client splits. We introduce class partition and dual Dirichlet partition strategies. The class partition makes each client hold data of only a few classes, and the dual Dirichlet parition employs Dirichlet distributions to model the class distribution within each client and the variation in sample sizes across clients. However, these methods might not accurately capture the intricate data heterogeneity in real federated datasets. Naturally split datasets, on the other hand, are composed of federated datasets obtained directly from multiple real-world clients, which retain the authentic characteristics and complexities of the underlying data distribution. The Fed-IXI dataset takes MRI images as inputs to generate 3D brain masks, and the Fed-ISIC2019 dataset takes dermoscopy images as inputs and aims to predict the corresponding melanoma classes. The details of the partition strategies and the FLamby datasets are elaborated in Appendix D.

Client Heterogeneity Simulation. To simulate client heterogeneity, let tit_{i} be the average time for client ii to finish one local step, and assume that tit_{i} follows a random distribution. We use three different distributions in the experiments: normal distribution ti𝒩(μ,σ2)t_{i}\sim\mathcal{N}(\mu,\sigma^{2}) with σ=0\sigma=0 (i.e., homogeneous clients), normal distribution ti𝒩(μ,σ2)t_{i}\sim\mathcal{N}(\mu,\sigma^{2}) with σ=0.3μ\sigma=0.3\mu, and exponential distribution tiExp(λ)t_{i}\sim\textit{Exp}(\lambda). The client heterogeneity increases accordingly for these three heterogeneity settings. The mean values of the distributions (μ\mu or 1/λ1/\lambda) for different experiment datasets are given in Appendix E.1. Additionally, to account for the variance in training time within each client across different training rounds, we apply another normal distribution ti(k)𝒩(ti,(0.05ti)2)t_{i}^{(k)}\sim\mathcal{N}(t_{i},(0.05t_{i})^{2}) with smaller variance. This distribution simulates the local training time per step for client ii in the training round kk, considering the variability experienced by the client during different training rounds. The combination of distributions adequately captures the client heterogeneity in terms of both average training time among clients and the variability in training time within each client across different rounds.

Training Settings. We employ a convolutional neural network (CNN) for the partitioned MNIST dataset, and a ResNet-18 (He et al., 2016) for the partitioned CIFAR-10 dataset. For both datasets, we conduct experiments with mm={5,10,20}\{5,10,20\} clients and the three client heterogeneity settings above, using the class partition and dual Dirichlet partition strategies. For datasets Fed-IXI and Fed-ISIC2019 from FLamby, we use the default experiment settings provided by FLamby. We list the detailed model architectures and experiment hyperparameters in Appendix E.

5.2 Experiment Results

Table 1: Relative wall-clock time to first reach the target validation accuracy on different datasets for various FL algorithms with different numbers of clients and client heterogeneity settings.
Client number m=5m=5 m=10m=10 m=20m=20
Dataset Method homo normal exp homo normal exp homo normal exp
FedAvg 1.35×\times 2.82×\times 4.32×\times 1.25×\times 2.10×\times 5.01×\times 1.44×\times 4.20×\times 9.03×\times
FedAvgM 1.27×\times 2.75×\times 4.23×\times 1.50×\times 2.51×\times 6.01×\times 2.20×\times 5.36×\times 13.73×\times
FedAsync 2.15×\times 2.34×\times 2.35×\times - 2.17×\times 1.32×\times - 4.44×\times 1.32×\times
MNIST-Class FedBuff 1.30×\times 1.57×\times 1.81×\times 1.58×\times 1.39×\times 1.43×\times 1.37×\times 1.18×\times 1.39×\times
Partition (90%) FedAT 1.36×\times 2.16×\times 3.07×\times 1.13×\times 1.77×\times 2.46×\times 0.97×\times 1.58×\times 4.32×\times
FedCompass 1.00×\times 1.00×\times 1.00×\times 1.00×\times 1.00×\times 1.00×\times 1.00×\times 1.00×\times 1.00×\times
FedCompass+M
1.10×\times 1.19×\times 0.81×\times 1.18×\times 0.99×\times 0.81×\times 1.10×\times 1.09×\times 1.08×\times
FedCompass+N
0.89×\times 1.24×\times 0.94×\times 1.44×\times 1.28×\times 0.92×\times 0.94×\times 1.01×\times 1.09×\times
FedAvg 0.91×\times 1.85×\times 4.90×\times 1.18×\times 2.44×\times 5.68×\times 1.14×\times 2.48×\times 5.94×\times
FedAvgM 0.91×\times 1.85×\times 4.89×\times 1.22×\times 2.45×\times 5.83×\times 1.21×\times 2.70×\times 6.12×\times
FedAsync 1.78×\times 1.68×\times 2.01×\times - 2.28×\times 1.29×\times - - 1.76×\times
MNIST-Dirichlet FedBuff 1.23×\times 1.52×\times 1.86×\times 1.81×\times 1.51×\times 1.49×\times 2.77×\times 1.22×\times 1.40×\times
Partition (90%) FedAT 1.29×\times 1.97×\times 2.08×\times 1.08×\times 1.70×\times 2.44×\times 1.02×\times 1.83×\times 2.83×\times
FedCompass 1.00×\times 1.00×\times 1.00×\times 1.00×\times 1.00×\times 1.00×\times 1.00×\times 1.00×\times 1.00×\times
FedCompass+M
0.72×\times 0.92×\times 0.95×\times 0.93×\times 0.91×\times 0.88×\times 0.87×\times 1.31×\times 0.81×\times
FedCompass+N
0.92×\times 1.55×\times 1.26×\times 0.99×\times 1.12×\times 0.95×\times 0.93×\times 0.90×\times 1.08×\times
FedAvg 1.84×\times 2.64×\times 9.40×\times 3.18×\times 4.18×\times 11.51×\times 3.20×\times 3.24×\times 7.13×\times
FedAvgM 2.57×\times 3.73×\times 13.12×\times 2.19×\times 2.88×\times 9.04×\times 3.13×\times 3.19×\times 6.91×\times
FedAsync - 3.18×\times 3.07×\times - - 5.18×\times - - -
CIFAR10-Class FedBuff 1.68×\times 2.18×\times 2.47×\times - - 2.62×\times - - 4.62×\times
Partition (60%) FedAT 1.05×\times 1.42×\times 2.33×\times 1.21×\times 1.83×\times 3.21×\times 1.20×\times 1.23×\times 2.21×\times
FedCompass 1.00×\times 1.00×\times 1.00×\times 1.00×\times 1.00×\times 1.00×\times 1.00×\times 1.00×\times 1.00×\times
FedCompass+M
0.82×\times 0.82×\times 0.95×\times 0.89×\times 0.78×\times 0.96×\times 0.78×\times 0.81×\times 0.83×\times
FedCompass+N
0.98×\times 1.01×\times 1.18×\times 0.99×\times 1.22×\times 1.49×\times 1.32×\times 1.19×\times 1.16×\times
FedAvg 2.19×\times 2.81×\times 10.86×\times 4.55×\times 5.05×\times 11.47×\times 5.88×\times 5.83×\times 9.05×\times
FedAvgM 2.17×\times 2.97×\times 10.77×\times 3.77×\times 4.21×\times 11.06×\times 5.19×\times 5.44×\times 8.63×\times
FedAsync - 3.18×\times 1.70×\times - - 5.35×\times - - -
CIFAR10-Dirichlet FedBuff 1.62×\times 1.99×\times 1.67×\times - - 2.42×\times - - -
Partition (50%) FedAT 1.04×\times 1.69×\times 2.90×\times 1.36×\times 1.66×\times 2.30×\times 1.45×\times 1.58×\times 2.21×\times
FedCompass 1.00×\times 1.00×\times 1.00×\times 1.00×\times 1.00×\times 1.00×\times 1.00×\times 1.00×\times 1.00×\times
FedCompass+M
0.82×\times 0.80×\times 1.03×\times 1.07×\times 0.85×\times 0.94×\times 0.86×\times 0.69×\times 0.98×\times
FedCompass+N
1.05×\times 0.98×\times 1.65×\times 0.89×\times 1.29×\times 1.45×\times 0.77×\times 0.87×\times 1.23×\times
Table 2: Relative wall-clock time to first reach the target validation accuracy on the Fed-IXI and Fed-ISIC2019 datasets for various FL algorithms in different client heterogeneity settings.
Fed-IXI (0.98) Fed-ISIC2019 (50%)
Method homo normal exp homo normal exp
FedAvg
1.18×\times
1.80×\times
2.20×\times
1.35×\times
2.00×\times
1.90×\times
FedAvgM
1.14×\times
1.83×\times
2.20×\times
2.74×\times
2.74×\times
4.65×\times
FedAsync
1.40×\times
1.31×\times
1.45×\times
1.50×\times
1.75×\times
1.69×\times
FedBuff
1.20×\times
1.25×\times
1.38×\times
2.59×\times
3.76×\times
1.03×\times
FedAT
1.07×\times
1.30×\times
1.35×\times
1.08×\times
1.65×\times
2.64×\times
FedCompass
1.00×\times
1.00×\times
1.00×\times
1.00×\times
1.00×\times
1.00×\times
FedCompass+M
1.00×\times
0.98×\times
1.02×\times
2.22×\times
2.02×\times
2.03×\times
FedCompass+N
0.98×\times
0.97×\times
1.03×\times
1.04×\times
0.84×\times
0.97×\times
Refer to caption
Figure 2: Change in validation accuracy and standard deviation for different FL algorithms on the dual Dirichlet partitioned MNIST dataset and the class partitioned CIFAR-10 dataset with five clients and three client heterogeneity settings. (Synchronous algorithms take the same amount of time, and asynchronous algorithms take the same amount of time in the same experiment setting.)

Performance Comparison. Table 1 shows the relative wall-clock time for various FL algorithms to first reach the target validation accuracy on the partitioned MNIST and CIFAR-10 datasets with different numbers of clients and client heterogeneity settings. Table 2 reports the relative wall-clock time to reach 0.98 DICE accuracy (Dice, 1945) on the Fed-IXI dataset and 50% balanced accuracy on the Fed-ISIC2019 dataset in three heterogeneity settings. The numbers are the average results of ten runs with different random seeds, and the wall-clock time for FedCompass is used as a baseline. A dash “-” signifies that a certain algorithm cannot reach the target accuracy in at least half of the experiment runs. The dataset target accuracy is carefully selected based on each algorithm’s top achievable accuracy during training (given in Appendix F). Among the benchmarked algorithms, FedAvg (McMahan et al., 2017) is the most commonly used synchronous FL algorithm, and FedAvgM (Hsu et al., 2019) is a variant of FedAvg using server-side momentum to address non-IID data. FedAsync (Xie et al., 2019) and FedBuff (Nguyen et al., 2022) are two popular asynchronous FL algorithms, and FedAT (Chai et al., 2021) is a tiered FL algorithm. Since the Compass scheduler is separated from global aggregation, making FedCompass compatible with server-side optimization strategies, we also benchmark the performance of using server-side momentum and normalized averaging (Wang et al., 2020) in FedCompass (FedCompass+M and FedCompass+N). From the table we obtain the following key observations: (i) The speedup provided by asynchronous FL algorithms increases as client heterogeneity increases. (ii) FedAT converges slower as client heterogeneity increases due to non-negligible inner-tier waiting time, while FedCompass minimizes inner-group waiting time by dynamic local step assignment and the usage of the latest arrival time. (iii) Similar to synchronous FL where server-side optimizations can sometimes speed up the convergence, applying server momentum or normalized averaging to FedCompass in group aggregation can sometimes speed up the convergence as well. This indicates that the design of FedCompass allows it to easily take advantage of other FL strategies. (iv) Because of client drift and model staleness, FedAsync and FedBuff occasionally fail to achieve the target accuracy, particularly in settings with minimal client heterogeneity and unnecessary application of the staleness factor. On the other hand, FedCompass can quickly converge in all cases, as Compass dynamically and automatically groups clients for global aggregation, reducing client model staleness and mitigating the client drift problem. This shows that FedCompass outperforms other asynchronous algorithms on heterogeneous non-IID data without any prior knowledge of clients. Additionally, Figure 2 illustrates the change in validation accuracy during the training for different FL algorithms on the dual Dirichlet partitioned MNIST dataset (subplots a to c) and the class partitioned CIFAR-10 dataset (subplots d to f) with five clients and three client heterogeneity settings. In the plots, synchronous algorithms take the same amount of time, and asynchronous algorithms take the same amount of time in the same experiment setting. These plots further illustrate how FedCompass converges faster than other algorithms and also achieves higher accuracy than other asynchronous algorithms. For example, subplots d and e explicitly demonstrate that FedAsync and FedBuff only converge to much lower accuracy in low-heterogeneity settings, mainly because of client drift from unnecessary staleness factors. More plots of the training process and the maximum achievable accuracy for each FL algorithm are given in Appendix F.

Effect of Number of Local Steps. We conduct ablation studies on the values of 1/Q=Qmin/Qmax\textit{1/Q}=Q_{\min}/Q_{\max} to investigate its impact on FedCompass’s performance, with results listed in Table 3. From the table, we observe that smaller 1/Q values (ranging from 0.05 to 0.2) usually lead to faster convergence speed. Additionally, a wide range of 1/Q (ranging from 0.05 to 0.8) can achieve a reasonable convergence rate compared with FedBuff or FedAT, which showcases that FedCompass is not sensitive to the value of Q. Therefore, users do not need to heuristically select Q for different client numbers and heterogeneity settings. In the extreme case that Q=1\textit{Q}=1, i.e. Qmin=QmaxQ_{\min}=Q_{\max}, there is a significant performance drop for FedCompass, as Compass loses the flexibility to assign different local steps to different clients. Therefore, each client has its own arrival group, and FedCompass becomes equivalent to FedAsync. More ablation study results are given in Appendix G.1.

Table 3: Relative wall-clock time for FedCompass to first reach the target validation accuracy on the dual Dirichlet partitioned MNIST/CIFAR10 datasets for various values of 1/Q=Qmin/Qmax\textit{1/Q}=Q_{\min}/Q_{\max}.
Client number m=5m=5 m=10m=10 m=20m=20
Dataset 1/Q homo normal exp homo normal exp homo normal exp
0.05
0.89/1.27×\times
1.01/0.97×\times
0.85/0.76×\times
0.96/1.05×\times
0.97/0.98×\times
0.82/1.19×\times
1.00/1.05×\times
1.02/0.97×\times
0.96/0.90×\times
MNIST /
0.10
0.91/1.15×\times
1.05/1.01×\times
0.92/0.73×\times
0.96/0.98×\times
0.95/1.11×\times
0.85/1.27×\times
0.97/1.01×\times
1.02/1.04×\times
0.96/1.01×\times
CIFAR10
0.20
1.00/1.00×\times
1.00/1.00×\times
1.00/1.00×\times
1.00/1.00×\times
1.00/1.00×\times
1.00/1.00×\times
1.00/1.00×\times
1.00/1.00×\times
1.00/1.00×\times
Dirichlet
0.40
1.02/1.28×\times
1.03/1.06×\times
1.70/1.36×\times
1.04/1.02×\times
1.05/1.06×\times
1.28/2.19×\times
1.00/0.96×\times
1.14/1.14×\times
1.25/1.08×\times
Partition
0.60
1.04/1.23×\times
1.01/0.99×\times
1.83/1.65×\times
1.00/0.97×\times
1.01/1.61×\times
1.19/2.52×\times
0.92/1.11×\times
1.11/1.06×\times
1.16/1.74×\times
(90%/50%)
0.80
1.04/1.23×\times
1.13/1.24×\times
1.58/1.64×\times
0.91/1.32×\times
1.25/1.91×\times
1.22/3.44×\times
0.86/1.33×\times
1.22/2.95×\times
1.21/1.64×\times
1.00
1.78×\times/-
1.68/3.18×\times
2.01/1.70×\times
-/-
2.28×\times/-
1.29/5.35×\times
-/-
-/-
1.76×\times/-
FedBuff
1.23/1.62×\times
1.52/1.99×\times
1.86/1.67×\times
1.81×\times/-
1.51×\times/-
1.49/2.42×\times
2.77×\times/-
1.22×\times/-
1.40×\times/-
FedAT
1.29/1.04×\times
1.97/1.69×\times
2.08/2.90×\times
1.08/1.36×\times
1.70/1.66×\times
2.44/2.30×\times
1.02/1.45×\times
1.83/1.58×\times
2.83/2.21×\times

6 Conclusion

This paper introduces FedCompass, which employs a novel computing power-aware scheduler to make several client models arrive almost simultaneously for group aggregation to reduce global aggregation frequency and mitigate the stale local model issues in asynchronous FL. FedCompass thus enhances the FL training efficiency with heterogeneous client devices and improves the performance on non-IID heterogeneous distributed datasets. Through a comprehensive suite of experiments, we have demonstrated that FedCompass achieves faster convergence than other prevalent FL algorithms in various training tasks with heterogeneous training data and clients. This new approach opens up new pathways for the development and use of robust AI models in settings that require privacy, with the additional advantage that it can readily leverage the existing disparate computing ecosystem.

Reproducibility Statement

The source code and instructions to reproduce all the experiment results can be found in the supplemental materials. The details of the classic dataset partition strategies are provided in Appendix D, and the corresponding implementations are included in the source code. The setups and hyperparameters used in the experiments are given in Appendix E. Additionally, the proofs of Theorem 1 and Corollary 1 are provided in Appendix C.

Acknowledgments

This work was supported by Laboratory Directed Research and Development (LDRD) funding from Argonne National Laboratory, provided by the Director, Office of Science, of the U.S. Department of Energy under Contract No. DE-AC02-06CH11357. This research is also part of the Delta research computing project, which is supported by the National Science Foundation (award OCI 2005572), and the State of Illinois. Delta is a joint effort of the University of Illinois at Urbana-Champaign and the National Center for Supercomputing Applications. E.A.H. was partially supported by National Science Foundation awards OAC-1931561 and OAC-2209892.

References

  • Bonawitz et al. (2019) Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloe Kiddon, Jakub Konečnỳ, Stefano Mazzocchi, Brendan McMahan, et al. Towards federated learning at scale: System design. Proceedings of Machine Learning and Systems, 1:374–388, 2019.
  • Chai et al. (2021) Zheng Chai, Yujing Chen, Ali Anwar, Liang Zhao, Yue Cheng, and Huzefa Rangwala. FedAT: A high-performance and communication-efficient federated learning system with asynchronous tiers. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp.  1–16, 2021.
  • Chen et al. (2019) Yang Chen, Xiaoyan Sun, and Yaochu Jin. Communication-efficient federated deep learning with layerwise asynchronous model update and temporally weighted aggregation. IEEE transactions on neural networks and learning systems, 31(10):4229–4238, 2019.
  • Chen et al. (2020) Yujing Chen, Yue Ning, Martin Slawski, and Huzefa Rangwala. Asynchronous online federated learning for edge devices with non-iid data. In 2020 IEEE International Conference on Big Data (Big Data), pp.  15–24. IEEE, 2020.
  • Chen et al. (2021) Zheyi Chen, Weixian Liao, Kun Hua, Chao Lu, and Wei Yu. Towards asynchronous federated learning for heterogeneous edge-powered internet of things. Digital Communications and Networks, 7(3):317–326, 2021.
  • Çiçek et al. (2016) Özgün Çiçek, Ahmed Abdulkadir, Soeren S Lienkamp, Thomas Brox, and Olaf Ronneberger. 3D U-Net: learning dense volumetric segmentation from sparse annotation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2016: 19th International Conference, Athens, Greece, October 17-21, 2016, Proceedings, Part II 19, pp.  424–432. Springer, 2016.
  • Codella et al. (2018) Noel CF Codella, David Gutman, M Emre Celebi, Brian Helba, Michael A Marchetti, Stephen W Dusza, Aadi Kalloo, Konstantinos Liopyris, Nabin Mishra, Harald Kittler, et al. Skin lesion analysis toward melanoma detection: A challenge at the 2017 international symposium on biomedical imaging (ISBI), hosted by the international skin imaging collaboration (isic). In 2018 IEEE 15th international symposium on biomedical imaging (ISBI 2018), pp.  168–172. IEEE, 2018.
  • Combalia et al. (2019) Marc Combalia, Noel CF Codella, Veronica Rotemberg, Brian Helba, Veronica Vilaplana, Ofer Reiter, Cristina Carrera, Alicia Barreiro, Allan C Halpern, Susana Puig, et al. Bcn20000: Dermoscopic lesions in the wild. arXiv preprint arXiv:1908.02288, 2019.
  • Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on Computer Vision and Pattern Recognition, pp.  248–255. Ieee, 2009.
  • Dice (1945) Lee R Dice. Measures of the amount of ecologic association between species. Ecology, 26(3):297–302, 1945.
  • Hao et al. (2020) Jiangshan Hao, Yanchao Zhao, and Jiale Zhang. Time efficient federated learning with semi-asynchronous communication. In 2020 IEEE 26th International Conference on Parallel and Distributed Systems (ICPADS), pp.  156–163. IEEE, 2020.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp.  770–778, 2016.
  • Hoang et al. (2023) Trung-Hieu Hoang, Jordan Fuhrman, Ravi Madduri, Miao Li, Pranshu Chaturvedi, Zilinghan Li, Kibaek Kim, Minseok Ryu, Ryan Chard, EA Huerta, et al. Enabling end-to-end secure federated learning in biomedical research on heterogeneous computing environments with appflx. arXiv preprint arXiv:2312.08701, 2023.
  • Horváth et al. (2022) Samuel Horváth, Maziar Sanjabi, Lin Xiao, Peter Richtárik, and Michael Rabbat. Fedshuffle: Recipes for better use of local work in federated learning. arXiv preprint arXiv:2204.13169, 2022.
  • Hsu et al. (2019) Tzu-Ming Harry Hsu, Hang Qi, and Matthew Brown. Measuring the effects of non-identical data distribution for federated visual classification. arXiv preprint arXiv:1909.06335, 2019.
  • Imteaj & Amini (2020) Ahmed Imteaj and M Hadi Amini. Fedar: Activity and resource-aware federated learning model for distributed mobile robots. In 2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA), pp.  1153–1160. IEEE, 2020.
  • Kairouz et al. (2021) Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. Foundations and Trends® in Machine Learning, 14(1–2):1–210, 2021.
  • Kaissis et al. (2021) Georgios Kaissis, Alexander Ziller, Jonathan Passerat-Palmbach, Théo Ryffel, Dmitrii Usynin, Andrew Trask, Ionésio Lima Jr, Jason Mancuso, Friederike Jungmann, Marc-Matthias Steinborn, et al. End-to-end privacy preserving deep learning on multi-institutional medical imaging. Nature Machine Intelligence, 3(6):473–484, 2021.
  • Karimireddy et al. (2020) Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. Scaffold: Stochastic controlled averaging for federated learning. In International Conference on Machine Learning, pp.  5132–5143. PMLR, 2020.
  • Konečnỳ et al. (2016) Jakub Konečnỳ, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492, 2016.
  • Krizhevsky et al. (2009) Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • LeCun (1998) Yann LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.
  • Li et al. (2020) Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine, 37(3):50–60, 2020.
  • Li et al. (2019) Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, and Zhihua Zhang. On the convergence of fedavg on non-iid data. arXiv preprint arXiv:1907.02189, 2019.
  • Li et al. (2023) Zilinghan Li, Shilan He, Pranshu Chaturvedi, Trung-Hieu Hoang, Minseok Ryu, EA Huerta, Volodymyr Kindratenko, Jordan Fuhrman, Maryellen Giger, Ryan Chard, et al. APPFLx: Providing privacy-preserving cross-silo federated learning as a service. arXiv preprint arXiv:2308.08786, 2023.
  • Loshchilov & Hutter (2017) Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  • Ma et al. (2022) Xiaodong Ma, Jia Zhu, Zhihao Lin, Shanxuan Chen, and Yangjie Qin. A state-of-the-art survey on solving non-iid data in federated learning. Future Generation Computer Systems, 135:244–258, 2022.
  • McMahan et al. (2017) Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pp.  1273–1282. PMLR, 2017.
  • Mills et al. (2019) Jed Mills, Jia Hu, and Geyong Min. Communication-efficient federated learning for wireless edge intelligence in IoT. IEEE Internet of Things Journal, 7(7):5986–5994, 2019.
  • Nguyen et al. (2022) John Nguyen, Kshitiz Malik, Hongyuan Zhan, Ashkan Yousefpour, Mike Rabbat, Mani Malek, and Dzmitry Huba. Federated learning with buffered asynchronous aggregation. In International Conference on Artificial Intelligence and Statistics, pp.  3581–3607. PMLR, 2022.
  • Nishio & Yonetani (2019) Takayuki Nishio and Ryo Yonetani. Client selection for federated learning with heterogeneous resources in mobile edge. In ICC 2019-2019 IEEE international Conference on Communications (ICC), pp.  1–7. IEEE, 2019.
  • Ogier du Terrail et al. (2022) Jean Ogier du Terrail, Samy-Safwan Ayed, Edwige Cyffers, Felix Grimberg, Chaoyang He, Regis Loeb, Paul Mangold, Tanguy Marchand, Othmane Marfoq, Erum Mushtaq, et al. Flamby: Datasets and benchmarks for cross-silo federated learning in realistic healthcare settings. Advances in Neural Information Processing Systems, 35:5315–5334, 2022.
  • Pati et al. (2022) Sarthak Pati, Ujjwal Baid, Brandon Edwards, Micah Sheller, Shih-Han Wang, G Anthony Reina, Patrick Foley, Alexey Gruzdev, Deepthi Karkada, Christos Davatzikos, et al. Federated learning enables big data for rare cancer boundary detection. Nature Communications, 13(1):7346, 2022.
  • Pérez-García et al. (2021) Fernando Pérez-García, Rachel Sparks, and Sébastien Ourselin. TorchIO: a Python library for efficient loading, preprocessing, augmentation and patch-based sampling of medical images in deep learning. Computer Methods and Programs in Biomedicine, 208:106236, 2021.
  • Reddi et al. (2020) Sashank Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konečnỳ, Sanjiv Kumar, and H Brendan McMahan. Adaptive federated optimization. arXiv preprint arXiv:2003.00295, 2020.
  • Reisizadeh et al. (2022) Amirhossein Reisizadeh, Isidoros Tziotis, Hamed Hassani, Aryan Mokhtari, and Ramtin Pedarsani. Straggler-resilient federated learning: Leveraging the interplay between statistical accuracy and system heterogeneity. IEEE Journal on Selected Areas in Information Theory, 3(2):197–205, 2022.
  • Ryu et al. (2022) Minseok Ryu, Youngdae Kim, Kibaek Kim, and Ravi K Madduri. APPFL: open-source software framework for privacy-preserving federated learning. In 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp.  1074–1083. IEEE, 2022.
  • So et al. (2021) Jinhyun So, Ramy E Ali, Başak Güler, and A Salman Avestimehr. Secure aggregation for buffered asynchronous federated learning. arXiv preprint arXiv:2110.02177, 2021.
  • Stich (2018) Sebastian U Stich. Local SGD converges fast and communicates little. arXiv preprint arXiv:1805.09767, 2018.
  • Tan & Le (2019) Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on Machine Learning, pp.  6105–6114. PMLR, 2019.
  • Tschandl et al. (2018) Philipp Tschandl, Cliff Rosendahl, and Harald Kittler. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific data, 5(1):1–9, 2018.
  • Wang (2019) Guan Wang. Interpret federated learning with shapley values. arXiv preprint arXiv:1905.04519, 2019.
  • Wang et al. (2019) Guan Wang, Charlie Xiaoqian Dang, and Ziye Zhou. Measure contribution of participants in federated learning. In 2019 IEEE international conference on Big Data (Big Data), pp.  2597–2604. IEEE, 2019.
  • Wang et al. (2020) Jianyu Wang, Qinghua Liu, Hao Liang, Gauri Joshi, and H Vincent Poor. Tackling the objective inconsistency problem in heterogeneous federated optimization. Advances in Neural Information Processing Systems, 33:7611–7623, 2020.
  • Wang et al. (2021) Qizhao Wang, Qing Li, Kai Wang, Hong Wang, and Peng Zeng. Efficient federated learning for fault diagnosis in industrial cloud-edge computing. Computing, 103(10):2319–2337, 2021.
  • Wu et al. (2020) Wentai Wu, Ligang He, Weiwei Lin, Rui Mao, Carsten Maple, and Stephen Jarvis. SAFA: A semi-asynchronous protocol for fast federated learning with low overhead. IEEE Transactions on Computers, 70(5):655–668, 2020.
  • Xie et al. (2019) Cong Xie, Sanmi Koyejo, and Indranil Gupta. Asynchronous federated optimization. arXiv preprint arXiv:1903.03934, 2019.
  • Xu et al. (2021) Chenhao Xu, Youyang Qu, Yong Xiang, and Longxiang Gao. Asynchronous federated learning on heterogeneous devices: A survey. arXiv preprint arXiv:2109.04269, 2021.
  • Yang et al. (2019) Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. Federated machine learning: Concept and applications. ACM Transactions on Intelligent Systems and Technology (TIST), 10(2):1–19, 2019.
  • Yu et al. (2019) Hao Yu, Sen Yang, and Shenghuo Zhu. Parallel restarted with faster convergence and less communication: Demystifying why model averaging works for deep learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp.  5693–5700, 2019.
  • Yurochkin et al. (2019) Mikhail Yurochkin, Mayank Agarwal, Soumya Ghosh, Kristjan Greenewald, Nghia Hoang, and Yasaman Khazaeni. Bayesian nonparametric federated learning of neural networks. In International conference on machine learning, pp.  7252–7261. PMLR, 2019.
  • Zhang et al. (2021) Yu Zhang, Morning Duan, Duo Liu, Li Li, Ao Ren, Xianzhang Chen, Yujuan Tan, and Chengliang Wang. CSAFL: A clustered semi-asynchronous federated learning framework. In 2021 International Joint Conference on Neural Networks (IJCNN), pp.  1–10. IEEE, 2021.

Appendix A Detailed Implementation of FedCompass

Table 4 lists all frequently used notations and their corresponding explanations in the algorithm description of FedCompass.

Table 4: Summary of notations used in algorithm description of FedCompass.
Notation Explanation
𝒞\mathcal{C} client information dictionary
𝒢\mathcal{G} arrival group information dictionary
GG index of the arrival group for a client
QQ number of local steps for a client
SS estimated computing speed for a client
TstartT_{\text{start}} start time of current local training round for a client
CLCL list of non-arrived clients for an arrival group
ACLACL list of arrived clients for an arrival group
TaT_{a} expected arrival time for an arrival group
Tmax{T}_{\max} latest arrival time for an arrival group
TnowT_{\text{now}} current time
QmaxQ_{\max} maximum number of local steps for clients
QminQ_{\min} minimum number of local steps for clients
λ\lambda latest time factor
η\eta_{\ell} client learning rate
NN total number of communication rounds
BB batch size for local training
st()st(\cdot) staleness function
τ\tau
timestamp of the global model that a client trained on
τg\tau_{g}
timestamp of the global model
pip_{i} importance weight of client ii
Δi\Delta_{i}
the difference between the original model and the trained model of client ii
Δ¯g\overline{\Delta}^{g} client update buffer for group gg
Δ¯\overline{\Delta} general client update buffer
ww global model
wi,qw_{i,q} local model of client ii after qq local steps
1
Function AssignGroup(ii): Input : Client id ii
2 if JoinGroup(i)=False\textsc{JoinGroup}(i)=\texttt{False} then
3       CreateGroup(i)\textsc{CreateGroup}(i);
4      
5
6
Function JoinGroup(ii): Input : Client id ii
7 Initialization: Gassign,Qassign:=1,1G_{\text{assign}},Q_{\text{assign}}:=-1,-1;
8 for g𝒢g\in\mathcal{G} do
9       q:=(𝒢[g].TaTnow)/𝒞[i].Sq:=\lfloor(\mathcal{G}\texttt{[}g\texttt{]}.T_{a}-T_{\text{now}})/\mathcal{C}\texttt{[}i\texttt{]}.S\rfloor;
10       if qQminq\geq Q_{\min} and qQassignq\geq Q_{\text{assign}} and qQmaxq\leq Q_{\max} then
11             Gassign,Qassign:=g,qG_{\text{assign}},Q_{\text{assign}}:=g,q;
12            
13      
14
15if Gassign1G_{\text{assign}}\neq-1 then
16       𝒞[i].(G,Q,Tstart):=(Gassign,Qassign,Tnow)\mathcal{C}\texttt{[}i\texttt{]}.(G,Q,T_{\text{start}}):=(G_{\text{assign}},Q_{\text{assign}},T_{\text{now}});
17       𝒢[g].CL.add(i)\mathcal{G}\texttt{[}g\texttt{]}.CL.\texttt{add(}i\texttt{)};
18       return True;
19      
20else
21       return False
22
23
Function CreateGroup(ii): Input : Client id ii
24 Initialization: Qassign:=1Q_{\text{assign}}:=-1;
25 for g𝒢g\in\mathcal{G} do
26       if Tnow<𝒢[g].TaT_{\text{now}}<\mathcal{G}\texttt{[}g\texttt{]}.T_{a} then
27             S:=speed of the fastest client in group gS:=\text{speed of the fastest client in group }g;
28             Texp:=𝒢[g].Ta+SQmaxT_{\text{exp}}:=\mathcal{G}\texttt{[}g\texttt{]}.T_{a}+S*Q_{\max};
29             Qassign:=max(Qassign,(TexpTnow)/𝒞[i].S)Q_{\text{assign}}:=\max(Q_{\text{assign}},\lfloor(T_{\text{exp}}-T_{\text{now}})/\mathcal{C}\texttt{[}i\texttt{]}.S\rfloor);
30            
31      
32
33if Qassign0Q_{\text{assign}}\geq 0 and Qassign<QminQ_{\text{assign}}<Q_{\min} then
34       Qassign:=QminQ_{\text{assign}}:=Q_{\min};
35      
36else
37       if Qassign<0orQassign>QmaxQ_{\text{assign}}<0\ \texttt{or}\ Q_{\text{assign}}>Q_{\max} then
38             Qassign:=QmaxQ_{\text{assign}}:=Q_{\max};
39            
40      
41
42Gnew:=a unique id for the new groupG_{\text{new}}:=\text{a unique id for the new group};
43 𝒢[Gnew].(CL,ACL):=([i],[])\mathcal{G}\texttt{[}G_{\text{new}}\texttt{]}.(CL,ACL):=(\texttt{[}i\texttt{]},\texttt{[]});
44 𝒢[Gnew].Ta:=Tnow+Qassign𝒞[i].S\mathcal{G}\texttt{[}G_{\text{new}}\texttt{]}.T_{a}:=T_{\text{now}}+Q_{\text{assign}}*\mathcal{C}\texttt{[}i\texttt{]}.S;
45 𝒢[Gnew].Tmax:=Tnow+Qassign𝒞[i].Sλ\mathcal{G}\texttt{[}G_{\text{new}}\texttt{]}.{T}_{\max}:=T_{\text{now}}+Q_{\text{assign}}*\mathcal{C}\texttt{[}i\texttt{]}.S*\lambda;
46 𝒞[i].(G,Q,Tstart):=(Gnew,Qassign,Tnow)\mathcal{C}\texttt{[}i\texttt{]}.(G,Q,T_{\text{start}}):=(G_{\text{new}},Q_{\text{assign}},T_{\text{now}});
47 Add a timer event at 𝒢[Gnew].Tmax\mathcal{G}\texttt{[}G_{\text{new}}\texttt{]}.{T}_{\max} to invoke GroupAggregate(GnewG_{\text{new}});
Algorithm 2 AssignGroup
1 Function GroupAggregate(gg):
Input: group index gg
2 if g𝒢g\in\mathcal{G} then
3       w:=wΔ¯gΔ¯w:=w-\overline{\Delta}_{g}-\overline{\Delta};  Δ¯:=𝟎\overline{\Delta}:=\mathbf{0};  τg:=τg+1\tau_{g}:=\tau_{g}+1;
4       sort 𝒢[g].ACL\mathcal{G}\texttt{[}g\texttt{]}.ACL from fastest client to slowest client;
5       for i𝒢[g].ACLi\in\mathcal{G}\texttt{[}g\texttt{]}.ACL do
6             𝒞[i].τ:=τg\mathcal{C}\texttt{[}i\texttt{]}.\tau:=\tau_{g};
7             AssignGroup(ii) ;
8             FedCompass-Client(ii, ww, 𝒞[i].Q\mathcal{C}\texttt{[}i\texttt{]}.Q, η\eta_{\ell}, BB);
9            
10      
Algorithm 3 GroupAggregate Algorithm

Compass utilizes the concept of arrival group to determine which clients are supposed to arrive together. To manage the client arrival groups, Compass employs two dictionaries: 𝒞\mathcal{C} for client information and 𝒢\mathcal{G} for arrival group information. Specifically, for client ii, 𝒞[i].G\mathcal{C}\texttt{[}i\texttt{]}.G, 𝒞[i].Q\mathcal{C}\texttt{[}i\texttt{]}.Q, 𝒞[i].S\mathcal{C}\texttt{[}i\texttt{]}.S, and 𝒞[i].Tstart\mathcal{C}\texttt{[}i\texttt{]}.T_{\text{start}} represent the arrival group, number of local steps, estimated computing speed (time per step), and start time of current communication round of client ii. 𝒞[i].τ\mathcal{C}\texttt{[}i\texttt{]}.\tau represents the timestamp of the global model upon which client ii trained (i.e., client local model timestamp). For arrival group gg, 𝒢[g].CL\mathcal{G}\texttt{[}g\texttt{]}.CL, 𝒢[g].ACL\mathcal{G}\texttt{[}g\texttt{]}.ACL, 𝒢[g].Ta\mathcal{G}\texttt{[}g\texttt{]}.T_{a}, and 𝒢[g].Tmax\mathcal{G}\texttt{[}g\texttt{]}.{T}_{\max} represent the list of non-arrived and arrived clients, expected arrival time for clients in the group, and latest arrival time for clients in the group. According to the client speed estimation, all clients in group gg are expected to finish local training and send the trained local updates back to the server before 𝒢[g].Ta\mathcal{G}\texttt{[}g\texttt{]}.T_{a}, however, to account for potential errors in speed estimation and variations in computing speed, the group can wait until the latest arrival time 𝒢[g].Tmax\mathcal{G}\texttt{[}g\texttt{]}.{T}_{\max} for the clients to send their trained models back. The difference between 𝒢[g].Tmax\mathcal{G}\texttt{[}g\texttt{]}.{T}_{\max} and the group creation time is λ\lambda times longer than the difference between 𝒢[g].Ta\mathcal{G}\texttt{[}g\texttt{]}.T_{a} and the group creation time.

Algorithm 2 presents the process through which Compass assigns a client to an arrival group. The dictionaries 𝒞\mathcal{C} and 𝒢\mathcal{G} along with the constant parameters QmaxQ_{\max}, QminQ_{\min}, and λ\lambda are assumed to be available implicitly and are not explicitly passed as algorithm inputs. When Compass assigns a client to a group, it first checks whether it can join an existing arrival group (function JoinGroup, lines 5 to 16). For each group, Compass uses the current time, group expected arrival time, and estimated client speed to calculate a tentative local step number qq (line 8). To avoid large disparity in the number of local steps, Compass uses two parameters QmaxQ_{\max} and QminQ_{\min} to specify a valid range of local steps. If Compass can find the largest qq within the range [Qmin,Qmax][Q_{\min},Q_{\max}], it assigns the client to the corresponding group. Otherwise, Compass creates a new arrival group for the client. When creating this new group, Compass aims to ensure that a specific existing arrival group can join the newly created group for group merging once all of its clients have arrived, as described in function CreateGroup (lines 18 to 35). To achieve this, Compass first identifies the computing speed SS of the fastest client in group gg. It assumes that all clients in the group will arrive at the expected time 𝒢[g].Ta\mathcal{G}\texttt{[}g\texttt{]}.T_{a} and the fastest client will perform QmaxQ_{\max} local steps in the next round. Subsequently, Compass utilizes the expected arrival time in the next round (line 23) to determine the appropriate number of local steps (line 24). It then selects the largest number of local steps while ensuring that it is within the range [Qmin,Qmax][Q_{\min},Q_{\max}]. In cases where Compass is unable to set the number of local steps in a way that allows any existing group to be suitable for joining in the next round, it assigns QmaxQ_{\max} to the client upon creating the group. Whenever a new group is created, Compass sets up a timer event to invoke the GroupAggregate function (Algorithm 3) to aggregate the group of client updates after passing the group’s latest arrival time 𝒢[g].Tmax\mathcal{G}\texttt{[}g\texttt{]}.T_{\max}. In GroupAggregate, the server updates the global model using the group-specific buffer Δ¯g\overline{\Delta}^{g}, containing updates from clients in arrival group gg, and the general buffer Δ¯\overline{\Delta}, containing updates from clients arriving late in other groups. After the global update, the server updates the global model timestamp and assigns new training tasks to all arriving clients in group gg.

Appendix B Upper Bound on the Number of Groups

If the client speeds do not have large variations for a sufficient number of global updates, then FedCompass can reach an equilibrium state where the arrival group assignment for clients is fixed and the total number of existing groups (denoted by #g\#_{g}) is bounded by logQμ\lceil\log_{\textit{Q}}\mu\rceil, where Q=Qmax/Qmin\textit{Q}=Q_{\max}/Q_{\min} and μ\mu is the ratio of the time per local step between the slowest client and the fastest client.

Consider one execution of FedCompass with a list of sorted client local computing times {𝒯1,𝒯2,,𝒯m}\{\mathcal{T}_{1},\mathcal{T}_{2},\cdots,\mathcal{T}_{m}\}, where 𝒯i\mathcal{T}_{i} represents the time for client ii to complete one local step and 𝒯i𝒯j\mathcal{T}_{i}\leq\mathcal{T}_{j} if i<ji<j. Let 𝒢1={i|TiT1QmaxQmin}={1,,g1}\mathcal{G}_{1}=\{i\ |\ \frac{T_{i}}{T_{1}}\leq\frac{Q_{\max}}{Q_{\min}}\}=\{1,\cdots,g_{1}\} be the first group of clients. Then all the clients i𝒢1i\in\mathcal{G}_{1} will be assigned to the same arrival group at the equilibrium state with Qi=T1TiQmax[Qmin,Qmax]Q_{i}=\frac{T_{1}}{T_{i}}Q_{\max}\in[Q_{\min},Q_{\max}]. Similarly, if g1<mg_{1}<m, we can get the second group of clients 𝒢2={i|TiTg1+1QmaxQmin}={g1+1,,g2}\mathcal{G}_{2}=\{i\ |\ \frac{T_{i}}{T_{g_{1}+1}}\leq\frac{Q_{\max}}{Q_{\min}}\}=\{g_{1}+1,\cdots,g_{2}\} where the clients in 𝒢2\mathcal{G}_{2} will be assigned to another arrival group at the equilibrium state. Therefore, the maximum number of existing arrival groups at the equilibrium state is equal to the smallest integer such that

(QmaxQmin)#g=Q#g𝒯m𝒯1=μ#glogQμ.\Big{(}\frac{Q_{\max}}{Q_{\min}}\Big{)}^{\#_{g}}=\textit{Q}^{\#_{g}}\leq\frac{\mathcal{T}_{m}}{\mathcal{T}_{1}}=\mu\implies\#_{g}\leq\lceil\log_{\textit{Q}}\mu\rceil. (3)

Additionally, this implies that the maximum ratio of time for two different clients to finish one communication round at the equilibrium state is bounded by μ=QlogQμ{\mu}^{\prime}=\textit{Q}^{\lfloor\log_{\textit{Q}}\mu\rfloor}. First, we can assume that the maximum ratio of time for two different clients to finish one communication round is the same as that for two different arrival groups to finish one communication round. In the extreme case that there are maximum logQμ\lceil\log_{\textit{Q}}\mu\rceil arrival groups. Then the time ratio between the slowest group and fastest group is bounded by QlogQμ1=QlogQμ\textit{Q}^{\lceil\log_{\textit{Q}}\mu\rceil-1}=\textit{Q}^{\lfloor\log_{\textit{Q}}\mu\rfloor}.

Appendix C FedCompass Convergence Analysis

C.1 List of Notations

Table 5 lists frequently used notations in the convergence analysis of FedCompass.

Table 5: Frequently used notations in the convergence analysis of FedCompass.
Notation Explanation
mm total number of clients
η\eta_{\ell} client local learning rate
LL Lipschitz constant
μ\mu
ratio of the time per local step between the slowest client and the fastest client
σl\sigma_{l} bound for local gradient variance
σg\sigma_{g} bound for global gradient variance
MM bound for gradient
QminQ_{\min} minimum number of client local steps in each training round
QmaxQ_{\max} maximum number of client local steps in each training round
Q ratio between QmaxQ_{\max} and QminQ_{\min}, i.e., Qmax/QminQ_{\max}/Q_{\min}
Qi,tQ_{i,t} number of local steps for client ii in round tt
w(t)w^{(t)} global model after tt global updates
wi,q(tτi,t)w_{i,q}^{(t-\tau_{i,t})}
local model of client ii after qq local steps, where the original model has
timestamp tτi,tt-\tau_{i,t}
i,q\mathcal{B}_{i,q} a random training batch that client ii uses in the qq-th local step
F(w)F(w) global loss for model ww
Fi(w)F_{i}(w) client ii’s local loss for model ww
fi(w,i)f_{i}(w,\mathcal{B}_{i}) loss for model ww evaluated on batch i\mathcal{B}_{i} from client ii
F(w)\nabla F(w) gradient of global loss w.r.t. ww
Fi(w)\nabla F_{i}(w) gradient of client ii’s local loss w.r.t. ww
fi(w,i)\nabla f_{i}(w,\mathcal{B}_{i}) gradient of loss evaluated on batch i\mathcal{B}_{i} w.r.t. ww
𝒮t\mathcal{S}_{t} set of arriving clients at timestamp tt
Δi(tτi,t)\Delta_{i}^{(t-\tau_{i,t})}
update from client ii at timestamp tt trained on global model with staleness τi,t\tau_{i,t}
Δ¯(t)\overline{\Delta}^{(t)}
aggregated local updates from clients for updating the global model at
timestamp tt

C.2 Proof of Theorem 1

Lemma 1.

𝔼[fi(w,i)2]3(σl2+σg2+M)\mathbb{E}\Big{[}\|\nabla f_{i}(w,\mathcal{B}_{i})\|^{2}\Big{]}\leq 3(\sigma_{l}^{2}+\sigma_{g}^{2}+M) for any model ww.

Proof of Lemma 1.

𝔼[fi(w,i)2]=\displaystyle\mathbb{E}\Big{[}\|\nabla f_{i}(w,\mathcal{B}_{i})\|^{2}\Big{]}= 𝔼[fi(w,i)Fi(w)+Fi(w)F(w)+F(w)2]\displaystyle\ \mathbb{E}\Big{[}\|\nabla f_{i}(w,\mathcal{B}_{i})-\nabla F_{i}(w)+\nabla F_{i}(w)-\nabla F(w)+\nabla F(w)\|^{2}\Big{]} (4)
(a)\displaystyle\overset{\text{(a)}}{\leq} 3𝔼[fi(w,i)Fi(w)2+Fi(w)F(w)2+F(w)2]\displaystyle\ 3\mathbb{E}\Big{[}\|\nabla f_{i}(w,\mathcal{B}_{i})-\nabla F_{i}(w)\|^{2}+\|\nabla F_{i}(w)-\nabla F(w)\|^{2}+\|\nabla F(w)\|^{2}\Big{]}
(b)\displaystyle\overset{\text{(b)}}{\leq} 3(σl2+σg2+M),\displaystyle\ 3(\sigma_{l}^{2}+\sigma_{g}^{2}+M),

where step (a) utilizes Cauchy-Schwarz inequality, and step (b) utilizes Assumption 3.

Lemma 2.

When a function F:nF:\mathbb{R}^{n}\rightarrow\mathbb{R} is Lipschitz smooth, then we have

F(w)F(w)+F(w),ww+L2ww2.F(w^{\prime})\leq F(w)+\langle\nabla F(w),w^{\prime}-w\rangle+\frac{L}{2}\|w^{\prime}-w\|^{2}.\\ (5)

Proof of Lemma 2.

Let wt=w+t(ww)w_{t}=w+t(w^{\prime}-w) for t[0,1]t\in[0,1]. Then according to the definition of gradient, we have

F(w)=\displaystyle F(w^{\prime})= F(w)+01F(wt),ww𝑑t\displaystyle\ F(w)+\int_{0}^{1}\langle\nabla F(w_{t}),w^{\prime}-w\rangle dt (6)
=\displaystyle= F(w)+F(w),ww+01F(wt)F(w),ww𝑑t\displaystyle\ F(w)+\langle\nabla F(w),w^{\prime}-w\rangle+\int_{0}^{1}\langle\nabla F(w_{t})-\nabla F(w),w^{\prime}-w\rangle dt
(a)\displaystyle\overset{\text{(a)}}{\leq} F(w)+F(w),ww+01F(wt)F(w)ww𝑑t\displaystyle\ F(w)+\langle\nabla F(w),w^{\prime}-w\rangle+\int_{0}^{1}\|\nabla F(w_{t})-\nabla F(w)\|\cdot\|w^{\prime}-w\|dt
(b)\displaystyle\overset{\text{(b)}}{\leq} F(w)+F(w),ww+L01wtwww𝑑t\displaystyle\ F(w)+\langle\nabla F(w),w^{\prime}-w\rangle+L\int_{0}^{1}\|w_{t}-w\|\cdot\|w^{\prime}-w\|dt
=\displaystyle= F(w)+F(w),ww+Lww201t𝑑t\displaystyle\ F(w)+\langle\nabla F(w),w^{\prime}-w\rangle+L\|w^{\prime}-w\|^{2}\int_{0}^{1}tdt
=\displaystyle= F(w)+F(w),ww+L2ww2,\displaystyle\ F(w)+\langle\nabla F(w),w^{\prime}-w\rangle+\frac{L}{2}\|w^{\prime}-w\|^{2},

where step (a) utilizes Cauchy-Schwarz inequality u,v|u,v|uv\langle u,v\rangle\leq|\langle u,v\rangle|\leq\|u\|\cdot\|v\|, and step (b) utilizes the definition of Lipschitz smoothness.

Proof of Theorem 1.

According to Lemma 2, we have

F(w(t+1))\displaystyle F(w^{(t+1)}) F(w(t))+F(w(t)),w(t+1)w(t)+L2w(t+1)w(t)2\displaystyle\leq F(w^{(t)})+\langle\nabla F(w^{(t)}),w^{(t+1)}-w^{(t)}\rangle+\frac{L}{2}\|w^{(t+1)}-w^{(t)}\|^{2} (7)
F(w(t))F(w(t)),Δ¯(t)+L2Δ¯(t)2\displaystyle\leq F(w^{(t)})-\langle\nabla F(w^{(t)}),\overline{\Delta}^{(t)}\rangle+\frac{L}{2}\|\overline{\Delta}^{(t)}\|^{2}
=F(w(t))1mk𝒮tF(w(t)),Δk(tτk,t)P1+L2m2k𝒮tΔk(tτk,t)2P2,\displaystyle=F(w^{(t)})\underbrace{-\frac{1}{m}\mathop{\sum}\limits_{k\in\mathcal{S}_{t}}\langle\nabla F(w^{(t)}),\Delta_{k}^{(t-\tau_{k,t})}\rangle}_{P_{1}}+\underbrace{\frac{L}{2m^{2}}\Big{\|}\mathop{\sum}\limits_{k\in\mathcal{S}_{t}}\Delta_{k}^{(t-\tau_{k,t})}\Big{\|}^{2}}_{P_{2}},

where 𝒮t\mathcal{S}_{t} represents the set of clients arrived at timestamp tt, and Δk(tτk,t)\Delta_{k}^{(t-\tau_{k,t})} represents the update from client kk trained on the global model with staleness τk,t\tau_{k,t}. Rearranging part P1P_{1},

P1\displaystyle P_{1} =1mk𝒮tF(w(t)),Δk(tτk,t)=ηmk𝒮tF(w(t)),q=0Qk,t1fk(wk,q(tτk,t),k,q),\displaystyle=-\frac{1}{m}\mathop{\sum}\limits_{k\in\mathcal{S}_{t}}\langle\nabla F(w^{(t)}),\Delta_{k}^{(t-\tau_{k,t})}\rangle=-\frac{\eta_{\ell}}{m}\mathop{\sum}\limits_{k\in\mathcal{S}_{t}}\Big{\langle}\nabla F(w^{(t)}),\mathop{\sum}\limits_{q=0}^{Q_{k,t}-1}\nabla f_{k}(w_{k,q}^{(t-\tau_{k,t})},\mathcal{B}_{k,q})\Big{\rangle}, (8)

where Qk,tQ_{k,t} represents the number of local steps client kk has taken before sending the update to the server at timestamp tt, wk,q(tτk,t)w_{k,q}^{(t-\tau_{k,t})} represents the model of client kk in local step qq, and k,q\mathcal{B}_{k,q} represents a random training batch that the client kk uses in local step qq. To compute an upper bound for the expectation of part P1P_{1}, we utilize conditional expectation: 𝔼[P1]=𝔼𝔼{i}[m]|𝔼i|{i}[m],[P1]\mathbb{E}[P_{1}]=\mathbb{E}_{\mathcal{H}}\mathbb{E}_{\{i\}\sim[m]|\mathcal{H}}\mathbb{E}_{\mathcal{B}_{i}|\{i\}\sim[m],\mathcal{H}}[P_{1}]. Specifically, 𝔼\mathbb{E}_{\mathcal{H}} is the expectation over the history \mathcal{H} of iterates up to timestamp tt, 𝔼{i}[m]|\mathbb{E}_{\{i\}\sim[m]|\mathcal{H}} is the expectation over the random group of clients arriving at timestamp tt, and 𝔼i|{i}[m],\mathbb{E}_{\mathcal{B}_{i}|\{i\}\sim[m],\mathcal{H}} is the expectation over the mini-batch stochastic gradient on client ii.

𝔼[P1]=\displaystyle\mathbb{E}[P_{1}]= ηm𝔼[k𝒮tF(w(t)),q=0Qk,t1fk(wk,q(tτk,t),k,q)]\displaystyle-\frac{\eta_{\ell}}{m}\mathbb{E}\Big{[}\mathop{\sum}\limits_{k\in\mathcal{S}_{t}}\Big{\langle}\nabla F(w^{(t)}),\mathop{\sum}\limits_{q=0}^{Q_{k,t}-1}\nabla f_{k}(w_{k,q}^{(t-\tau_{k,t})},\mathcal{B}_{k,q})\Big{\rangle}\Big{]} (9)
=(a)\displaystyle\overset{\text{(a)}}{=} ηm𝔼[k𝒮tq=0Qk,t1F(w(t)),Fk(wk,q(tτk,t))],\displaystyle-\frac{\eta_{\ell}}{m}\mathbb{E}\Big{[}\mathop{\sum}\limits_{k\in\mathcal{S}_{t}}\mathop{\sum}\limits_{q=0}^{Q_{k,t}-1}\Big{\langle}\nabla F(w^{(t)}),\nabla F_{k}(w_{k,q}^{(t-\tau_{k,t})})\Big{\rangle}\Big{]},
=(b)\displaystyle\overset{\text{(b)}}{=} η2m𝔼[k𝒮tq=0Qk,t1(F(w(t))2F(w(t))Fk(wk,q(tτk,t))2)]\displaystyle-\frac{\eta_{\ell}}{2m}\mathbb{E}\Big{[}\mathop{\sum}\limits_{k\in\mathcal{S}_{t}}\mathop{\sum}\limits_{q=0}^{Q_{k,t}-1}\Big{(}\|\nabla F(w^{(t)})\|^{2}-\|\nabla F(w^{(t)})-\nabla F_{k}(w_{k,q}^{(t-\tau_{k,t})})\|^{2}\Big{)}\Big{]}
η2m𝔼[k𝒮tq=0Qk,t1Fk(wk,q(tτk,t))2]\displaystyle-\frac{\eta_{\ell}}{2m}\mathbb{E}\Big{[}\mathop{\sum}\limits_{k\in\mathcal{S}_{t}}\mathop{\sum}\limits_{q=0}^{Q_{k,t}-1}\|\nabla F_{k}(w_{k,q}^{(t-\tau_{k,t})})\|^{2}\Big{]}
=(c)\displaystyle\overset{\text{(c)}}{=} η2m𝔼[i=1mπiq=0Qi,t1(F(w(t))2F(w(t))Fi(wi,q(tτi,t))2)]\displaystyle-\frac{\eta_{\ell}}{2m}\mathbb{E}_{\mathcal{H}}\Big{[}\mathop{\sum}\limits_{i=1}^{m}\pi_{i}\mathop{\sum}\limits_{q=0}^{Q_{i,t}-1}\Big{(}\|\nabla F(w^{(t)})\|^{2}-\|\nabla F(w^{(t)})-\nabla F_{i}(w_{i,q}^{(t-\tau_{i,t})})\|^{2}\Big{)}\Big{]}
η2m𝔼[k𝒮tq=0Qk,t1Fk(wk,q(tτk,t))2]\displaystyle-\frac{\eta_{\ell}}{2m}\mathbb{E}\Big{[}\mathop{\sum}\limits_{k\in\mathcal{S}_{t}}\mathop{\sum}\limits_{q=0}^{Q_{k,t}-1}\|\nabla F_{k}(w_{k,q}^{(t-\tau_{k,t})})\|^{2}\Big{]}
=\displaystyle= η𝒫2m𝔼[i=1mπiq=0Qi,t1F(w(t))2]\displaystyle-\frac{\eta_{\ell}\mathcal{P}}{2m}\mathbb{E}_{\mathcal{H}}\Big{[}\mathop{\sum}\limits_{i=1}^{m}\pi_{i}^{\prime}\mathop{\sum}\limits_{q=0}^{Q_{i,t}-1}\|\nabla F(w^{(t)})\|^{2}\Big{]}
+η𝒫2m𝔼[i=1mπiq=0Qi,t1F(w(t))Fi(wi,q(tτi,t))2]\displaystyle+\frac{\eta_{\ell}\mathcal{P}}{2m}\mathbb{E}_{\mathcal{H}}\Big{[}\mathop{\sum}\limits_{i=1}^{m}\pi_{i}^{\prime}\mathop{\sum}\limits_{q=0}^{Q_{i,t}-1}\|\nabla F(w^{(t)})-\nabla F_{i}(w_{i,q}^{(t-\tau_{i,t})})\|^{2}\Big{]}
η2m𝔼[k𝒮tq=0Qk,t1Fk(wk,q(tτk,t))2]\displaystyle-\frac{\eta_{\ell}}{2m}\mathbb{E}\Big{[}\mathop{\sum}\limits_{k\in\mathcal{S}_{t}}\mathop{\sum}\limits_{q=0}^{Q_{k,t}-1}\|\nabla F_{k}(w_{k,q}^{(t-\tau_{k,t})})\|^{2}\Big{]}
\displaystyle\leq η2m𝔼[i=1mπiq=0Qi,t1F(w(t))2]\displaystyle-\frac{\eta_{\ell}}{2m}\mathbb{E}_{\mathcal{H}}\Big{[}\mathop{\sum}\limits_{i=1}^{m}\pi_{i}^{\prime}\mathop{\sum}\limits_{q=0}^{Q_{i,t}-1}\|\nabla F(w^{(t)})\|^{2}\Big{]}
+η2𝔼[i=1mπiq=0Qi,t1F(w(t))Fi(wi,q(tτi,t))2]\displaystyle+\frac{\eta_{\ell}}{2}\mathbb{E}_{\mathcal{H}}\Big{[}\mathop{\sum}\limits_{i=1}^{m}\pi_{i}^{\prime}\mathop{\sum}\limits_{q=0}^{Q_{i,t}-1}\|\nabla F(w^{(t)})-\nabla F_{i}(w_{i,q}^{(t-\tau_{i,t})})\|^{2}\Big{]}
η2m𝔼[k𝒮tq=0Qk,t1Fk(wk,q(tτk,t))2],\displaystyle-\frac{\eta_{\ell}}{2m}\mathbb{E}\Big{[}\mathop{\sum}\limits_{k\in\mathcal{S}_{t}}\mathop{\sum}\limits_{q=0}^{Q_{k,t}-1}\|\nabla F_{k}(w_{k,q}^{(t-\tau_{k,t})})\|^{2}\Big{]},

where step (a) utilizes Assumption 2, step (b) utilizes the identity u,v=12(u2+v2uv2)\langle u,v\rangle=\frac{1}{2}(\|u\|^{2}+\|v\|^{2}-\|u-v\|^{2}), and step (c) utilizes conditional expectation with the probability πi\pi_{i} of client ii to arrive at timestamp tt. We let 𝒫=i=1mπi[1,m]\mathcal{P}=\mathop{\sum}\limits_{i=1}^{m}\pi_{i}\in[1,m] and πi:=πi/𝒫\pi_{i}^{\prime}:=\pi_{i}/\mathcal{P} such that i=1mπi=1\mathop{\sum}\limits_{i=1}^{m}\pi_{i}^{\prime}=1.

To derive the upper and lower bounds of πi\pi_{i}^{\prime}, according to the conclusion we get in Appendix B, we have πmax/πminQlogQμ=μ\pi_{\max}^{\prime}/\pi_{\min}^{\prime}\leq\textit{Q}^{\lfloor\log_{\textit{Q}}\mu\rfloor}=\mu^{\prime}, i.e., πmin1μπmax\pi_{\min}^{\prime}\geq\frac{1}{\mu^{\prime}}\pi_{\max}^{\prime}. Therefore, we can derive the upper and lower bounds of πi\pi_{i}^{\prime} as follows.

1=i=1mπi\displaystyle 1=\mathop{\sum}\limits_{i=1}^{m}\pi_{i}^{\prime}\geq πmax+(m1)πminπmax+m1μπmax\displaystyle\pi_{\max}^{\prime}+(m-1)\pi_{\min}^{\prime}\geq\pi_{\max}^{\prime}+\frac{m-1}{\mu^{\prime}}\pi_{\max}^{\prime} (10)
πmax11+m1μ\displaystyle\pi_{\max}^{\prime}\leq\frac{1}{1+\frac{m-1}{\mu^{\prime}}}
1=i=1mπi\displaystyle 1=\mathop{\sum}\limits_{i=1}^{m}\pi_{i}^{\prime}\leq πmin+(m1)πmaxπmin+μ(m1)πmin\displaystyle\pi_{\min}^{\prime}+(m-1)\pi_{\max}^{\prime}\leq\pi_{\min}^{\prime}+\mu^{\prime}(m-1)\pi_{\min}^{\prime} (11)
πmin11+μ(m1)\displaystyle\pi_{\min}^{\prime}\geq\frac{1}{1+\mu^{\prime}(m-1)}

Let γ1=1+m1μ\gamma_{1}=1+\frac{m-1}{\mu^{\prime}} and γ2=1+μ(m1)\gamma_{2}=1+\mu^{\prime}(m-1). We have πi[1γ2,1γ1]\pi_{i}^{\prime}\in[\frac{1}{\gamma_{2}},\frac{1}{\gamma_{1}}]. Then,

𝔼[P1]\displaystyle\mathbb{E}[P_{1}]\leq ηQmin2γ2𝔼[F(w(t))2]η2m𝔼[k𝒮tq=0Qk,t1Fk(wk,q(tτk,t))2]\displaystyle-\frac{\eta_{\ell}Q_{\min}}{2\gamma_{2}}\mathbb{E}\Big{[}\|\nabla F(w^{(t)})\|^{2}\Big{]}-\frac{\eta_{\ell}}{2m}\mathbb{E}\Big{[}\mathop{\sum}\limits_{k\in\mathcal{S}_{t}}\mathop{\sum}\limits_{q=0}^{Q_{k,t}-1}\|\nabla F_{k}(w_{k,q}^{(t-\tau_{k,t})})\|^{2}\Big{]} (12)
+η2γ1𝔼[i=1mq=0Qmax1F(w(t))Fi(wi,q(tτi,t))2]P3.\displaystyle+\underbrace{\frac{\eta_{\ell}}{2\gamma_{1}}\mathbb{E}_{\mathcal{H}}\Big{[}\mathop{\sum}\limits_{i=1}^{m}\mathop{\sum}\limits_{q=0}^{Q_{\max}-1}\|\nabla F(w^{(t)})-\nabla F_{i}(w_{i,q}^{(t-\tau_{i,t})})\|^{2}\Big{]}}_{P_{3}}.

For part P3P_{3}, by utilizing Assumption 1, we can get

P3=\displaystyle P_{3}= η2γ1𝔼[i=1mq=0Qmax1F(w(t))Fi(wi,q(tτi,t))2]\displaystyle\ \frac{\eta_{\ell}}{2\gamma_{1}}\mathbb{E}_{\mathcal{H}}\Big{[}\mathop{\sum}\limits_{i=1}^{m}\mathop{\sum}\limits_{q=0}^{Q_{\max}-1}\|\nabla F(w^{(t)})-\nabla F_{i}(w_{i,q}^{(t-\tau_{i,t})})\|^{2}\Big{]} (13)
\displaystyle\leq ηγ1𝔼[i=1mq=0Qmax1(Fi(w(t))Fi(w(tτi,t))2+Fi(w(tτi,t))Fi(wi,q(tτi,t))2)]\displaystyle\ \frac{\eta_{\ell}}{\gamma_{1}}\mathbb{E}_{\mathcal{H}}\Big{[}\mathop{\sum}\limits_{i=1}^{m}\mathop{\sum}\limits_{q=0}^{Q_{\max}-1}\Big{(}\|\nabla F_{i}(w^{(t)})-\nabla F_{i}(w^{(t-\tau_{i,t})})\|^{2}+\|\nabla F_{i}(w^{(t-\tau_{i,t})})-\nabla F_{i}(w_{i,q}^{(t-\tau_{i,t})})\|^{2}\Big{)}\Big{]}
\displaystyle\leq ηL2γ1𝔼[i=1mq=0Qmax1(w(t)w(tτi,t)2+w(tτi,t)wi,q(tτi,t)2)].\displaystyle\ \frac{\eta_{\ell}L^{2}}{\gamma_{1}}\mathbb{E}_{\mathcal{H}}\Big{[}\mathop{\sum}\limits_{i=1}^{m}\mathop{\sum}\limits_{q=0}^{Q_{\max}-1}\Big{(}\|w^{(t)}-w^{(t-\tau_{i,t})}\|^{2}+\|w^{(t-\tau_{i,t})}-w^{(t-\tau_{i,t})}_{i,q}\|^{2}\Big{)}\Big{]}.

For w(t)w(tτi,t)2\|w^{(t)}-w^{(t-\tau_{i,t})}\|^{2}, we derive its upper bound in the following way, with step (a) utilizing Cauchy-Schwarz inequality and step (b) utilizing Lemma 1.

w(t)w(tτi,t)2=ρ=tτi,tt(w(ρ+1)w(ρ))2=ρ=tτi,tt1mk𝒮ρΔk(ρτk,ρ)2\displaystyle\ \|w^{(t)}-w^{(t-\tau_{i,t})}\|^{2}=\Bigg{\|}\mathop{\sum}\limits_{\rho=t-\tau_{i,t}}^{t}(w^{(\rho+1)}-w^{(\rho)})\Bigg{\|}^{2}=\Bigg{\|}\mathop{\sum}\limits_{\rho=t-\tau_{i,t}}^{t}\frac{1}{m}\mathop{\sum}\limits_{k\in\mathcal{S}_{\rho}}\Delta_{k}^{(\rho-\tau_{k,\rho})}\Bigg{\|}^{2} (14)
=\displaystyle= η2m2ρ=tτi,ttk𝒮ρq=0Qk,ρ1fk(wk,q(ρτk,ρ),k,q)2\displaystyle\ \frac{\eta_{\ell}^{2}}{m^{2}}\Bigg{\|}\mathop{\sum}\limits_{\rho=t-\tau_{i,t}}^{t}\mathop{\sum}\limits_{k\in\mathcal{S}_{\rho}}\mathop{\sum}\limits_{q=0}^{Q_{k,\rho}-1}\nabla f_{k}(w_{k,q}^{(\rho-\tau_{k,\rho})},\mathcal{B}_{k,q})\Bigg{\|}^{2}
(a)\displaystyle\overset{\text{(a)}}{\leq} η2τmaxQmaxmρ=tτi,ttk𝒮ρq=0Qk,ρ1fk(wk,q(ρτk,ρ),k,q)2\displaystyle\ \frac{\eta_{\ell}^{2}\tau_{\max}Q_{\max}}{m}\mathop{\sum}\limits_{\rho=t-\tau_{i,t}}^{t}\mathop{\sum}\limits_{k\in\mathcal{S}_{\rho}}\mathop{\sum}\limits_{q=0}^{Q_{k,\rho}-1}\Big{\|}\nabla f_{k}(w_{k,q}^{(\rho-\tau_{k,\rho})},\mathcal{B}_{k,q})\Big{\|}^{2}
(b)\displaystyle\overset{\text{(b)}}{\leq} 3η2τmax2Qmax2(σl2+σg2+M)\displaystyle\ 3\eta_{\ell}^{2}\tau_{\max}^{2}Q_{\max}^{2}(\sigma_{l}^{2}+\sigma_{g}^{2}+M)

For w(tτi,t)wi,q(tτi,t)2\|w^{(t-\tau_{i,t})}-w^{(t-\tau_{i,t})}_{i,q}\|^{2}, we derive its upper bound also using Lemma 1.

w(tτi,t)wi,q(tτi,t)2=η2e=0q1fi(wi,e(tτi,t),i,e)23η2Qmax2(σl2+σg2+M)\displaystyle\|w^{(t-\tau_{i,t})}-w^{(t-\tau_{i,t})}_{i,q}\|^{2}=\eta_{\ell}^{2}\Big{\|}\mathop{\sum}\limits_{e=0}^{q-1}\nabla f_{i}(w_{i,e}^{(t-\tau_{i,t})},\mathcal{B}_{i,e})\Big{\|}^{2}\leq 3\eta_{\ell}^{2}Q_{\max}^{2}(\sigma_{l}^{2}+\sigma_{g}^{2}+M) (15)

Combining equation 14 and equation 15, we can obtain the upper bound of P3P_{3}:

P33mL2η3Qmax3γ1(τmax2+1)(σl2+σg2+M).P_{3}\leq\frac{3mL^{2}\eta_{\ell}^{3}Q_{\max}^{3}}{\gamma_{1}}(\tau_{\max}^{2}+1)(\sigma_{l}^{2}+\sigma_{g}^{2}+M). (16)

Putting the upper bound of P3P_{3} (equation 16) into equation 12, we can get:

𝔼[P1]\displaystyle\mathbb{E}[P_{1}]\leq ηQmin2γ2𝔼[F(w(t))2]η2m𝔼[k𝒮tq=0Qk,t1Fk(wk,q(tτk,t))2]P4\displaystyle-\frac{\eta_{\ell}Q_{\min}}{2\gamma_{2}}\mathbb{E}\Big{[}\|\nabla F(w^{(t)})\|^{2}\Big{]}\underbrace{-\frac{\eta_{\ell}}{2m}\mathbb{E}\Big{[}\mathop{\sum}\limits_{k\in\mathcal{S}_{t}}\mathop{\sum}\limits_{q=0}^{Q_{k,t}-1}\|\nabla F_{k}(w_{k,q}^{(t-\tau_{k,t})})\|^{2}\Big{]}}_{P_{4}} (17)
+3mL2η3Qmax3γ1(τmax2+1)(σl2+σg2+M).\displaystyle+\frac{3mL^{2}\eta_{\ell}^{3}Q_{\max}^{3}}{\gamma_{1}}(\tau_{\max}^{2}+1)(\sigma_{l}^{2}+\sigma_{g}^{2}+M).

To compute the upper bound for the expectation of part P2P_{2}, similar to P1P_{1}, we also apply conditional expectation: 𝔼[P2]=𝔼𝔼{i}[m]|[P2]\mathbb{E}[P_{2}]=\mathbb{E}_{\mathcal{H}}\mathbb{E}_{\{i\}\sim[m]|\mathcal{H}}[P_{2}].

𝔼[P2]=\displaystyle\mathbb{E}[P_{2}]= L2m2𝔼[k𝒮tΔk(tτk,t)2]=η2L2m2𝔼[k𝒮tq=0Qk,t1fk(wk,q(tτk,t),k,q)2]\displaystyle\ \frac{L}{2m^{2}}\mathbb{E}\Bigg{[}\Big{\|}\mathop{\sum}\limits_{k\in\mathcal{S}_{t}}\Delta_{k}^{(t-\tau_{k,t})}\Big{\|}^{2}\Bigg{]}=\frac{\eta_{\ell}^{2}L}{2m^{2}}\mathbb{E}\Bigg{[}\Big{\|}\mathop{\sum}\limits_{k\in\mathcal{S}_{t}}\mathop{\sum}\limits_{q=0}^{Q_{k,t}-1}\nabla f_{k}(w_{k,q}^{(t-\tau_{k,t})},\mathcal{B}_{k,q})\Big{\|}^{2}\Bigg{]} (18)
=\displaystyle= η2L2m2𝔼[k𝒮tq=0Qk,t1(fk(wk,q(tτk,t),k,q)Fk(wk,q(tτk,t)))+k𝒮tq=0Qk,t1Fk(wk,q(tτk,t))2]\displaystyle\ \frac{\eta_{\ell}^{2}L}{2m^{2}}\mathbb{E}\Bigg{[}\Big{\|}\mathop{\sum}\limits_{k\in\mathcal{S}_{t}}\mathop{\sum}\limits_{q=0}^{Q_{k,t}-1}\Big{(}\nabla f_{k}(w_{k,q}^{(t-\tau_{k,t})},\mathcal{B}_{k,q})-\nabla F_{k}(w_{k,q}^{(t-\tau_{k,t})})\Big{)}+\mathop{\sum}\limits_{k\in\mathcal{S}_{t}}\mathop{\sum}\limits_{q=0}^{Q_{k,t}-1}\nabla F_{k}(w_{k,q}^{(t-\tau_{k,t})})\Big{\|}^{2}\Bigg{]}
(a)\displaystyle\overset{\text{(a)}}{\leq} η2Lm2𝔼[i=1mπiq=0Qi,t1(fi(wi,q(tτi,t),i,q)Fi(wi,q(tτi,t)))2]\displaystyle\ \frac{\eta_{\ell}^{2}L}{m^{2}}\mathbb{E}_{\mathcal{H}}\Bigg{[}\Big{\|}\mathop{\sum}\limits_{i=1}^{m}\pi_{i}\mathop{\sum}\limits_{q=0}^{Q_{i,t}-1}\Big{(}\nabla f_{i}(w_{i,q}^{(t-\tau_{i,t})},\mathcal{B}_{i,q})-\nabla F_{i}(w_{i,q}^{(t-\tau_{i,t})})\Big{)}\Big{\|}^{2}\Bigg{]}
+η2Lm2𝔼[k𝒮tq=0Qk,t1Fk(wk,q(tτk,t))2]\displaystyle+\frac{\eta_{\ell}^{2}L}{m^{2}}\mathbb{E}\Bigg{[}\Big{\|}\mathop{\sum}\limits_{k\in\mathcal{S}_{t}}\mathop{\sum}\limits_{q=0}^{Q_{k,t}-1}\nabla F_{k}(w_{k,q}^{(t-\tau_{k,t})})\Big{\|}^{2}\Bigg{]}
=\displaystyle= η2L𝒫2m2𝔼[i=1mπiq=0Qi,t1(fi(wi,q(tτi,t),i,q)Fi(wi,q(tτi,t)))2]\displaystyle\ \frac{\eta_{\ell}^{2}L\mathcal{P}^{2}}{m^{2}}\mathbb{E}_{\mathcal{H}}\Bigg{[}\Big{\|}\mathop{\sum}\limits_{i=1}^{m}\pi_{i}^{\prime}\mathop{\sum}\limits_{q=0}^{Q_{i,t}-1}\Big{(}\nabla f_{i}(w_{i,q}^{(t-\tau_{i,t})},\mathcal{B}_{i,q})-\nabla F_{i}(w_{i,q}^{(t-\tau_{i,t})})\Big{)}\Big{\|}^{2}\Bigg{]}
+η2Lm2𝔼[k𝒮tq=0Qk,t1Fk(wk,q(tτk,t))2]\displaystyle+\frac{\eta_{\ell}^{2}L}{m^{2}}\mathbb{E}\Bigg{[}\Big{\|}\mathop{\sum}\limits_{k\in\mathcal{S}_{t}}\mathop{\sum}\limits_{q=0}^{Q_{k,t}-1}\nabla F_{k}(w_{k,q}^{(t-\tau_{k,t})})\Big{\|}^{2}\Bigg{]}
\displaystyle\leq η2Lγ12𝔼[i=1mq=0Qi,t1(fi(wi,q(tτi,t),i,q)Fi(wi,q(tτi,t)))2]\displaystyle\ \frac{\eta_{\ell}^{2}L}{\gamma_{1}^{2}}\mathbb{E}_{\mathcal{H}}\Bigg{[}\Big{\|}\mathop{\sum}\limits_{i=1}^{m}\mathop{\sum}\limits_{q=0}^{Q_{i,t}-1}\Big{(}\nabla f_{i}(w_{i,q}^{(t-\tau_{i,t})},\mathcal{B}_{i,q})-\nabla F_{i}(w_{i,q}^{(t-\tau_{i,t})})\Big{)}\Big{\|}^{2}\Bigg{]}
+η2Lm2𝔼[k𝒮tq=0Qk,t1Fk(wk,q(tτk,t))2]\displaystyle+\frac{\eta_{\ell}^{2}L}{m^{2}}\mathbb{E}\Bigg{[}\Big{\|}\mathop{\sum}\limits_{k\in\mathcal{S}_{t}}\mathop{\sum}\limits_{q=0}^{Q_{k,t}-1}\nabla F_{k}(w_{k,q}^{(t-\tau_{k,t})})\Big{\|}^{2}\Bigg{]}
(b)\displaystyle\overset{\text{(b)}}{\leq} mη2LQmaxγ12i=1mq=0Qmax1𝔼[fi(wi,q(tτi,t),i,q)Fi(wi,q(tτi,t))2]\displaystyle\ \frac{m\eta_{\ell}^{2}LQ_{\max}}{\gamma_{1}^{2}}\mathop{\sum}\limits_{i=1}^{m}\mathop{\sum}\limits_{q=0}^{Q_{\max}-1}\mathbb{E}\Big{[}\Big{\|}\nabla f_{i}(w_{i,q}^{(t-\tau_{i,t})},\mathcal{B}_{i,q})-\nabla F_{i}(w_{i,q}^{(t-\tau_{i,t})})\Big{\|}^{2}\Big{]}
+η2Lm2𝔼[k𝒮tq=0Qk,t1Fk(wk,q(tτk,t))2]\displaystyle+\frac{\eta_{\ell}^{2}L}{m^{2}}\mathbb{E}\Bigg{[}\Big{\|}\mathop{\sum}\limits_{k\in\mathcal{S}_{t}}\mathop{\sum}\limits_{q=0}^{Q_{k,t}-1}\nabla F_{k}(w_{k,q}^{(t-\tau_{k,t})})\Big{\|}^{2}\Bigg{]}
(c)\displaystyle\overset{\text{(c)}}{\leq} m2η2σl2LQmax2γ12+η2Lm2𝔼[k𝒮tq=0Qk,t1Fk(wk,q(tτk,t))2]P5,\displaystyle\ \frac{m^{2}\eta_{\ell}^{2}\sigma_{l}^{2}LQ_{\max}^{2}}{\gamma_{1}^{2}}+\underbrace{\frac{\eta_{\ell}^{2}L}{m^{2}}\mathbb{E}\Bigg{[}\Big{\|}\mathop{\sum}\limits_{k\in\mathcal{S}_{t}}\mathop{\sum}\limits_{q=0}^{Q_{k,t}-1}\nabla F_{k}(w_{k,q}^{(t-\tau_{k,t})})\Big{\|}^{2}\Bigg{]}}_{P_{5}},

where step (a) utilizes conditional expectation and Cauchy-Schwarz inequality, step (b) utilizes Cauchy-Schwarz inequality, and step (c) utilizes Assumption 3.

To find the upper bound of 𝔼[P1+P2]\mathbb{E}[P_{1}+P_{2}], we need to make sure that P4+P50P_{4}+P_{5}\leq 0 so that we can eliminate these two parts when evaluating the upper bound. The following step (a) employs Cauchy-Schwarz inequality.

P4+P5=\displaystyle P4+P5= η2m𝔼[k𝒮tq=0Qk,t1Fk(wk,q(tτk,t))2]+η2Lm2𝔼[k𝒮tq=0Qk,t1Fk(wk,q(tτk,t))2]\displaystyle-\frac{\eta_{\ell}}{2m}\mathbb{E}\Big{[}\mathop{\sum}\limits_{k\in\mathcal{S}_{t}}\mathop{\sum}\limits_{q=0}^{Q_{k,t}-1}\|\nabla F_{k}(w_{k,q}^{(t-\tau_{k,t})})\|^{2}\Big{]}+\frac{\eta_{\ell}^{2}L}{m^{2}}\mathbb{E}\Bigg{[}\Big{\|}\mathop{\sum}\limits_{k\in\mathcal{S}_{t}}\mathop{\sum}\limits_{q=0}^{Q_{k,t}-1}\nabla F_{k}(w_{k,q}^{(t-\tau_{k,t})})\Big{\|}^{2}\Bigg{]} (19)
(a)\displaystyle\overset{\text{(a)}}{\leq} η2m𝔼[k𝒮tq=0Qk,t1Fk(wk,q(tτk,t))2]+η2LQmaxm𝔼[k𝒮tq=0Qk,t1Fk(wk,q(tτk,t))2]\displaystyle-\frac{\eta_{\ell}}{2m}\mathbb{E}\Big{[}\mathop{\sum}\limits_{k\in\mathcal{S}_{t}}\mathop{\sum}\limits_{q=0}^{Q_{k,t}-1}\|\nabla F_{k}(w_{k,q}^{(t-\tau_{k,t})})\|^{2}\Big{]}+\frac{\eta_{\ell}^{2}LQ_{\max}}{m}\mathbb{E}\Bigg{[}\mathop{\sum}\limits_{k\in\mathcal{S}_{t}}\mathop{\sum}\limits_{q=0}^{Q_{k,t}-1}\Big{\|}\nabla F_{k}(w_{k,q}^{(t-\tau_{k,t})})\Big{\|}^{2}\Bigg{]}
=\displaystyle= (η2LQmaxmη2m)𝔼[k𝒮tq=0Qk,t1Fk(wk,q(tτk,t))2]0\displaystyle\ (\frac{\eta_{\ell}^{2}LQ_{\max}}{m}-\frac{\eta_{\ell}}{2m})\mathbb{E}\Big{[}\mathop{\sum}\limits_{k\in\mathcal{S}_{t}}\mathop{\sum}\limits_{q=0}^{Q_{k,t}-1}\|\nabla F_{k}(w_{k,q}^{(t-\tau_{k,t})})\|^{2}\Big{]}\leq 0

Therefore, we need to have η12LQmax\eta_{\ell}\leq\frac{1}{2LQ_{\max}}. In such cases, combining equation 17 and equation 18, we have

𝔼[F(w(t+1))]\displaystyle\mathbb{E}[F(w^{(t+1)})]\leq 𝔼[F(w(t))]ηQmin2γ2𝔼[F(w(t))2]+m2η2σl2LQmax2γ12\displaystyle\ \mathbb{E}[F(w^{(t)})]-\frac{\eta_{\ell}Q_{\min}}{2\gamma_{2}}\mathbb{E}\Big{[}\|\nabla F(w^{(t)})\|^{2}\Big{]}+\frac{m^{2}\eta_{\ell}^{2}\sigma_{l}^{2}LQ_{\max}^{2}}{\gamma_{1}^{2}} (20)
+3mL2η3Qmax3γ1(τmax2+1)(σl2+σg2+M).\displaystyle+\frac{3mL^{2}\eta_{\ell}^{3}Q_{\max}^{3}}{\gamma_{1}}(\tau_{\max}^{2}+1)(\sigma_{l}^{2}+\sigma_{g}^{2}+M).

Summing up tt from 0 to T1T-1, we can get

ηQmin2γ2t=0T1𝔼[F(w(t))2]\displaystyle\frac{\eta_{\ell}Q_{\min}}{2\gamma_{2}}\mathop{\sum}\limits_{t=0}^{T-1}\mathbb{E}\Big{[}\|\nabla F(w^{(t)})\|^{2}\Big{]}\leq t=0T1(𝔼[F(w(t))]𝔼[F(w(t+1))])+m2η2σl2LQmax2Tγ12\displaystyle\mathop{\sum}\limits_{t=0}^{T-1}\Big{(}\mathbb{E}[F(w^{(t)})]-\mathbb{E}[F(w^{(t+1)})]\Big{)}+\frac{m^{2}\eta_{\ell}^{2}\sigma_{l}^{2}LQ_{\max}^{2}T}{\gamma_{1}^{2}} (21)
+3mL2η3Qmax3Tγ1(τmax2+1)(σl2+σg2+M)\displaystyle+\frac{3mL^{2}\eta_{\ell}^{3}Q_{\max}^{3}T}{\gamma_{1}}(\tau_{\max}^{2}+1)(\sigma_{l}^{2}+\sigma_{g}^{2}+M)
\displaystyle\leq (F(w(0))F(w))+m2η2σl2LQmax2Tγ12\displaystyle\ \Big{(}F(w^{(0)})-F(w^{*})\Big{)}+\frac{m^{2}\eta_{\ell}^{2}\sigma_{l}^{2}LQ_{\max}^{2}T}{\gamma_{1}^{2}}
+3mL2η3Qmax3Tγ1(τmax2+1)(σl2+σg2+M).\displaystyle+\frac{3mL^{2}\eta_{\ell}^{3}Q_{\max}^{3}T}{\gamma_{1}}(\tau_{\max}^{2}+1)(\sigma_{l}^{2}+\sigma_{g}^{2}+M).

Therefore, we obtain the conclusion in Theorem 1.

1Tt=0T1𝔼[F(w(t))2]\displaystyle\frac{1}{T}\mathop{\sum}\limits_{t=0}^{T-1}\mathbb{E}\Big{[}\|\nabla F(w^{(t)})\|^{2}\Big{]}\leq 2γ2(F(w(0))F(w))ηQminT+2γ2ηm2σl2LQmax2γ12Qmin\displaystyle\frac{2\gamma_{2}\Big{(}F(w^{(0)})-F(w^{*})\Big{)}}{\eta_{\ell}Q_{\min}T}+\frac{2\gamma_{2}\eta_{\ell}m^{2}\sigma_{l}^{2}LQ_{\max}^{2}}{\gamma_{1}^{2}Q_{\min}} (22)
+6γ2mη2L2Qmax3γ1Qmin(τmax2+1)(σl2+σg2+M)\displaystyle+\frac{6\gamma_{2}m\eta_{\ell}^{2}L^{2}Q_{\max}^{3}}{\gamma_{1}Q_{\min}}(\tau_{\max}^{2}+1)(\sigma_{l}^{2}+\sigma_{g}^{2}+M)

Proof of Corollary 1.

When QminQmax/μQ_{\min}\leq{Q_{\max}}/{\mu}, then μQmax/Qmin=Q\mu\leq Q_{\max}/Q_{\min}=\textit{Q}. Therefore,

μ=QlogQμ=Q0=1.\displaystyle\mu^{\prime}=\textit{Q}^{\lfloor\log_{\textit{Q}}\mu\rfloor}=\textit{Q}^{0}=1. (23)

Consequently, γ1=1+m1μ=m,γ2=1+μ(m1)=m\gamma_{1}=1+\frac{m-1}{\mu^{\prime}}=m,\gamma_{2}=1+\mu^{\prime}(m-1)=m. Letting F=F(w(0))F(w)F^{*}=F(w^{(0)})-F(w^{*}) and σ2=σl2+σg2+M\sigma^{2}=\sigma_{l}^{2}+\sigma_{g}^{2}+M, equation 22 can be written as

1Tt=0T1𝔼[F(w(t))2]\displaystyle\frac{1}{T}\mathop{\sum}\limits_{t=0}^{T-1}\mathbb{E}\Big{[}\|\nabla F(w^{(t)})\|^{2}\Big{]}\leq 2mFηQminT+2ηmσl2LQmax2Qmin+6mη2σ2L2Qmax3Qmin(τmax2+1).\displaystyle\frac{2mF^{*}}{\eta_{\ell}Q_{\min}T}+\frac{2\eta_{\ell}m\sigma_{l}^{2}LQ_{\max}^{2}}{Q_{\min}}+\frac{6m\eta_{\ell}^{2}\sigma^{2}L^{2}Q_{\max}^{3}}{Q_{\min}}(\tau_{\max}^{2}+1). (24)

If choosing η=𝒪(1/QmaxT)\eta_{\ell}=\mathcal{O}(1/Q_{\max}\sqrt{T}) and considering QmaxQ_{\max} and QminQ_{\min} having the same order in the 𝒪\mathcal{O}-class estimation, we can derive the conclusion of Corollary 1.

mint[T]𝔼F(w(t))21Tt=0T1𝔼[F(w(t))2]\displaystyle\underset{t\in[T]}{\min}\ \mathbb{E}\|\nabla F(w^{(t)})\|^{2}\leq\frac{1}{T}\mathop{\sum}\limits_{t=0}^{T-1}\mathbb{E}\Big{[}\|\nabla F(w^{(t)})\|^{2}\Big{]}\leq 𝒪(mFT)+𝒪(mσl2LT)+𝒪(mτmax2σ2L2T)\displaystyle\mathcal{O}\Big{(}\frac{mF^{*}}{\sqrt{T}}\Big{)}+\mathcal{O}\Big{(}\frac{m\sigma_{l}^{2}L}{\sqrt{T}}\Big{)}+\mathcal{O}\Big{(}\frac{m\tau_{\max}^{2}\sigma^{2}L^{2}}{T}\Big{)} (25)

Appendix D Details of the Experiment Datasets

D.1 Classic Dataset Partition Strategy

To simulate data heterogeneity on the classic datasets, we employ two partition strategies: (i) class partition and (ii) dual Dirichlet partition, and apply them to both the MNIST (LeCun, 1998) and CIFAR-10 (Krizhevsky et al., 2009) datasets. These two datasets are the two most commonly used datasets in evaluating FL algorithms (MNIST is used in 32% cases and CIFAR-10 is used in 28% cases) (Ma et al., 2022). The details of these two partition strategies are elaborated in the following subsections.

D.1.1 Class Partition

Data: Classic dataset 𝒟\mathcal{D}, sample indices for different classes class_indicesclass\_indices, number of clients mm, total number of distinct classes nn, minimum and maximum number of sample classes for each client split nminn_{\min} and nmaxn_{\max}, mean value of the normal distribution μ\mu, standard deviation of the normal distribution σ\sigma.
1 client_datasets:={i:[] for i[1,m]}client\_datasets:=\{i:\texttt{[]}\textbf{ for }i\in[1,m]\} \ \triangleright Return value: local datasets for each client;
2 repeat
3       class_partition:={}class\_partition:=\{\} \ \triangleright For a certain class, which client(s) have samples of that class in its own client split;
4       for i[1,m]i\in[1,m] do
5             ni:=randint(nmin,nmax)n_{i}:=\texttt{randint(}n_{\min},n_{\max}\texttt{)};
6             classesi:=randpermutation(n)[:ni]classes_{i}:=\texttt{randpermutation(}n\texttt{)[:}n_{i}\texttt{]};
7             for cclassesic\in classes_{i} do
8                   class_partition[c]:=[i] if cclass_partition else class_partition[c].add(i)class\_partition[c]:=\texttt{[}i\texttt{]}\textbf{ if }c\notin class\_partition\textbf{ else }class\_partition\texttt{[}c\texttt{]}.add(i);
9                  
10            
11      
12until len(class_partition)=n\texttt{len(}class\_partition\texttt{)}=n;
13for cclass_partitionc\in class\_partition do
14       ns:=normal(mean=μ,std=σ,size=len(class_partition[c]))n_{s}:=\texttt{normal(mean=}\mu,\texttt{std=}\sigma,\texttt{size=}\texttt{len(}class\_partition\texttt{[}c\texttt{]}\texttt{)}\texttt{)};
15       ns:=(ns/sum(ns))len(class_indices[c])n_{s}:=(n_{s}/\texttt{sum(}n_{s}\texttt{)})*\texttt{len(}class\_indices\texttt{[}c\texttt{]}\texttt{)} \ \triangleright Number of samples with class cc for each assigned client;
16       for num,izip(ns,class_partition[c])num,i\in\texttt{zip(}n_{s},class\_partition\texttt{[}c\texttt{]}\texttt{)} do
17             Add numnum samples of 𝒟\mathcal{D} with distinct indices from classes_indices[c]classes\_indices\texttt{[}c\texttt{]} to client_datasets[i]client\_datasets\texttt{[}i\texttt{]};
18            
19      
20return client_datasetsclient\_datasets;
Algorithm 4 ClassPartition

Algorithm 4 outlines the process of class partition, which partitions a classic dataset into several client splits such that each client split has samples from only a few classes. Many previous research works partition the classical dataset in a similar way to artificially create non-IID federated datasets (McMahan et al., 2017; Chen et al., 2019). To partition a dataset 𝒟\mathcal{D}, we first obtain a dictionary class_indicesclass\_indices that contains the indices of samples belonging to each class. Namely, class_indices[c]class\_indices\texttt{[}c\texttt{]} represents the indices of all samples in 𝒟\mathcal{D} with class cc. Lines 2 to 9 of the algorithm assign each client ii with nin_{i} classes for its own client split, where ni[nmin,nmax]n_{i}\in[n_{\min},n_{\max}]. Additionally, the algorithm ensures that each sample class is assigned to at least one client split. Subsequently, from lines 10 to 14, the algorithm determines the number of samples of class cc that each assigned client split should have based on a normal distribution, and then assigns the corresponding number of data samples to the respective clients. The class partition strategy results in imbalanced and non-IID FL datasets, with clients having varying sample classes within their local datasets.

In the experiments, the mean value of the normal distribution μ=10\mu=10, and the standard deviation σ=3\sigma=3. Both the MNIST and the CIFAR-10 datasets have 10 distinct classes, namely, n=10n=10. When the number of clients m=5m=5, nmin=5n_{\min}=5 and nmax=6n_{\max}=6, when m=10m=10 or 2020, nmin=3n_{\min}=3 and nmax=5n_{\max}=5. Figure 3 shows a sample class distribution generated by the class partition strategy among m=10m=10 clients with nmin=3n_{\min}=3, nmax=5n_{\max}=5, and n=10n=10.

Refer to caption
Figure 3: Sample class distribution generated by the class partition strategy among ten clients.

D.1.2 Dual Dirichlet Partition

Data: Classic dataset 𝒟\mathcal{D}, sample indices for different classes class_indicesclass\_indices, number of clients mm, total number of distinct classes nn, two concentration parameters α1\alpha_{1} and α2\alpha_{2} for Dirichlet distributions.
1 client_datasets:={i:[] for i[1,m]}client\_datasets:=\{i:\texttt{[]}\textbf{ for }i\in[1,m]\} \ \triangleright Return value: local datasets for each client;
2 𝐩𝟏:=[1/m for i[1,m]]\mathbf{p_{1}}:=\texttt{[}1/m\textbf{ for }i\in[1,m]\texttt{]} \ \triangleright Prior distribution for the number of samples for each client;
3 𝐩𝟐:=[len(class_indices[c]) for cclass_indices\mathbf{p_{2}}:=\texttt{[len(}class\_indices\texttt{[}c\texttt{]}\texttt{)}\textbf{ for }c\in class\_indices];
4 𝐩𝟐:=[p/sum(𝐩𝟐) for p𝐩𝟐]\mathbf{p_{2}}:=\texttt{[}p/\texttt{sum(}\mathbf{p_{2}}\texttt{)}\textbf{ for }p\in\mathbf{p_{2}}\texttt{]} \triangleright Prior distribution for the number of samples for each class;
5 𝜶𝟏:=α1𝐩𝟏\bm{\alpha_{1}}:=\alpha_{1}*\mathbf{p_{1}};
6 𝜶𝟐:=α2𝐩𝟐\bm{\alpha_{2}}:=\alpha_{2}*\mathbf{p_{2}};
7 client_wts:=dirichlet(𝜶𝟏)client\_wts:=\texttt{dirichlet(}\bm{\alpha_{1}}\texttt{)} \ \triangleright Normalized number of samples for each client, shape is m\mathbb{R}^{m}, sum is 1;
8 class_wts:=dirichlet(𝜶𝟐,size=m)class\_wts:=\texttt{dirichlet(}\bm{\alpha_{2}},\texttt{size=}m\texttt{)} \ \triangleright Normalized number of samples for each class within each client, shape is m×n\mathbb{R}^{m\times n}, row sum is 1;
9 for i[1,m]i\in[1,m] do
10       for c[1,n]c\in[1,n] do
11             num:=client_wts[i]class_wts[i][c]len(class_indices[c])/client_wts,class_wts[:,c]num:=client\_wts[i]*class\_wts[i][c]*\texttt{len(}class\_indices[c])/\langle client\_wts,class\_wts[:,c]\rangle;
12             Add numnum samples of 𝒟\mathcal{D} with distinct indices from classes_indices[c]classes\_indices\texttt{[}c\texttt{]} to client_datasets[i]client\_datasets\texttt{[}i\texttt{]};
13            
14      
15return client_datasetsclient\_datasets;
Algorithm 5 DualDirichletPartition

Some prior works that model data heterogeneity also employ a Dirichlet distribution to model the class distribution within each local client dataset (Yurochkin et al., 2019; Hsu et al., 2019). We extend the work of others using two Dirichlet distributions, with details outlined in Algorithm 5. The first Dirichlet distribution uses concentration parameter α1\alpha_{1} and prior distribution 𝐩𝟏\mathbf{p_{1}} to generate the normalized number of samples for each client, client_wtsmclient\_wts\in\mathbb{R}^{m}. The second Dirichlet distribution uses concentration parameter α2\alpha_{2} and prior distribution 𝐩𝟐\mathbf{p_{2}} to generate the normalized number of samples for each class within each client, class_wtsm×nclass\_wts\in\mathbb{R}^{m\times n}. From lines 9 to 12, the algorithm combines client_wtsclient\_wts, class_wtsclass\_wts, and the total number of samples for class cc to calculate the number of samples with class cc to be assigned for each client, and assigns the samples accordingly. In summary, the dual Dirichlet partition strategy employs two Dirichlet distributions to model two levels of data heterogeneity: the distribution across clients and the distribution across different classes within each client. Therefore, the generated client datasets are imbalanced and non-IID, which more accurately reflect real-world federated datasets.

In the experiments, we set α1=m\alpha_{1}=m and α2=0.5\alpha_{2}=0.5. The concentration parameter α\alpha governs the similarity among clients or classes. When α\alpha\rightarrow\infty, the distribution becomes identical to the prior distribution. Conversely, when α0\alpha\rightarrow 0, only one item is selected with a value of 1 while the remaining items have a value of 0. Since concentration parameter α2\alpha_{2} has a very small value, each client has data samples from only a few classes. Figure 4 illustrates a sample class distribution generated by the dual Dirichlet partition strategy among m=10m=10 clients, where n=10n=10, α1=10\alpha_{1}=10, and α2=0.5\alpha_{2}=0.5.

Refer to caption
Figure 4: Sample class distribution generated by the dual Dirichlet partition strategy among ten clients.

D.2 Naturally Split Datasets: FLamby

FLamby is an open-source suite that provides naturally split datasets specifically designed for benchmarking cross-silo federated learning algorithms (Ogier du Terrail et al., 2022). These datasets are sourced directly from distributed real-world entities, ensuring that they preserve the authentic characteristics and complexities of data distribution in FL. In our study, we selected two medium-sized datasets from FLamby – Fed-IXI, and Fed-ISIC2019 – to evaluate the performance of FedCompass and other FL algorithms in the experiments. The two datasets offer valuable perspectives into two healthcare domains, brain MRI interpretation (Fed-IXI), and skin cancer detection (Fed-ISIC2019). By leveraging these datasets, our goal is to showcase the robustness and versatility of FedCompass across diverse machine learning tasks in real-world scenarios. Table 6 provides the overview of the two selected datasets in FLamby. Detailed descriptions of these datasets are provided in the following subsections.

Table 6: Overview of the two selected datasets in FLamby.
Dataset Fed-IXI Fed-ISIC2019
Inputs T1 weighted image (T1WI) Dermoscopy
Task type 3D segmentation Classification
Prediction Brain mask Melanoma class
Client number 3 6
Model
3D U-net(Çiçek et al., 2016)
EfficientNet(Tan & Le, 2019)
Metric DICE (Dice, 1945) Balanced accuracy
Input dimension 48 ×\times 60 ×\times 48 200 ×\times 200 ×\times 3
Dataset size 444MB 9GB

D.2.1 Fed-IXI

The Fed-IXI dataset is sourced from the Information eXtraction from Images (IXI) database and includes brain T1 magnetic resonance images (MRIs) from three hospitals (Pérez-García et al., 2021). These MRIs have been repurposed for a brain segmentation task; the model performance is evaluated with the DICE score (Dice, 1945). To ensure consistency, the images undergo preprocessing steps, such as volume resizing to 48 × 60 × 48 voxels, and sample-wise intensity normalization. The 3D U-net (Çiçek et al., 2016) serves as the baseline model for this task.

D.2.2 Fed-ISIC2019

The Fed-ISIC2019 dataset is sourced from the ISIC2019 dataset, which includes dermoscopy images collected from four hospitals (Tschandl et al., 2018; Codella et al., 2018; Combalia et al., 2019). The images are restricted to a subset from the public train set and are split based on the imaging acquisition system, resulting in a 6-client federated dataset. The task involves image classification among eight different melanoma classes, and the performance is measured through balanced accuracy. Images are preprocessed by resizing and normalizing brightness and contrast. The baseline classification model is an EfficientNet (Tan & Le, 2019) pretrained on ImageNet (Deng et al., 2009).

Appendix E Experiment Setup Details

All the experiments are implemented using the open-source software framework for privacy-preserving federated learning (Ryu et al., 2022; Li et al., 2023).

E.1 Details of Client Heterogeneity Simulation

When simulating the client heterogeneity, we consider the average time tit_{i} required for client ii to complete one local step. We assume that tit_{i} follows a random distribution PP, namely, tiP(t)t_{i}\sim P(t). In our experiments, we employ three different distributions to simulate client heterogeneity: the normal distribution ti𝒩(μ,σ2)t_{i}\sim\mathcal{N}(\mu,\sigma^{2}) with σ=0\sigma=0 (i.e., homogeneous clients), the normal distribution ti𝒩(μ,σ2)t_{i}\sim\mathcal{N}(\mu,\sigma^{2}) with σ=0.3μ\sigma=0.3\mu, and the exponential distribution tiExp(λ)t_{i}\sim\textit{Exp}(\lambda). The exponential distribution models the amount of time until a certain event occurs (arrival time), specifically events that follow a Poisson process. The mean values of the distribution (μ\mu or 1/λ1/\lambda) are adjusted for different datasets according to the size of the model and training data. Table 7 presents the mean values of the client training time distribution for different datasets used in the experiments. Furthermore, Figure 5 illustrates the distribution of the training time to complete one local step among 50 clients under normal distribution with σ=0.3μ\sigma=0.3\mu and exponential distribution, both with a mean value of 0.15.

Table 7: Mean values of the training time distribution for different datasets.
Dataset Name Batch Size Mean (μ\mu or 1/λ1/\lambda)
MNIST 64 0.15s
CIFAR-10 128 0.5s
Fed-IXI 2 0.8s
Fed-ISIC2019 64 1.5s
Refer to caption
Figure 5: Sample distribution of the training time to complete one local step among 50 clients under (a) normal distribution with σ=0.3μ\sigma=0.3\mu and (b) exponential distribution.

E.2 Detailed Setups and Hyperparameters for Experiments on the MNIST Dataset

Table 8 shows the detailed architecture of the convolutional neural network (CNN) used for the experiments on the MNIST dataset. Table 9 provides a comprehensive list of the hyperparameter values used for training. Specifically, QQ is the number of client local steps for each training round, QminQ_{\min} and QmaxQ_{\max} are the minimum and maximum numbers of client local steps in FedCompass, η\eta_{\ell} is the client local learning rate, β\beta is the server-side momentum, and BB is the batch size. For all asynchronous FL algorithms, we adopt the polynomial staleness function introduced in (Xie et al., 2019), denoted as st(tτ)=α(tτ+1)ast(t-\tau)=\alpha(t-\tau+1)^{-a}. The values of α\alpha and aa are also listed in Table 9 for all asynchronous algorithms. Additionally, KK is the buffer size for the FedBuff algorithm, υ\upsilon is the speed ratio factor for creating tiers for clients (clients with speed ratio within υ\upsilon are tiered together and its value is selected via grid search), and λ\lambda is the latest time factor for FedCompass. For hyperparameters given as length-three tuples, the tuple items correspond to the hyperparameter values when the number of clients is 5, 10, and 20, respectively. For hyperparameters that are not applicable to certain algorithms, the corresponding value is marked as a dash “-”. To ensure a fair comparison, similar algorithms are assigned the same hyperparameter values, as depicted in the table. The number of global training rounds is chosen separately for different experiment settings and FL algorithms such that the corresponding FL algorithm converges, synchronous FL algorithms take the same amount of wall-clock time, and asynchronous algorithms take the same amount of wall-clock time in the same setting. It is also worth mentioning that the buffer size KK for the FedBuff algorithm (Nguyen et al., 2022) is selected from a search of multiple possible values. Though the original paper states that K=10K=10 is a good setting across all their benchmarks and does not require tuning, the value does not apply in the cross-silo cases where there are less than or equal to ten clients in most experiment settings, while all the experiments in the original paper have more than thousands of clients.

Table 8: MNIST CNN model architecture.
Layer Outsize Setting Parama #
Input (1, 28, 28) - 0
Conv1 (32, 24, 24) kernel=5, stride=1 832
ReLU1 (32, 24, 24) - 0
MaxPool1 (32, 12, 12) kernel=2, stride=2 0
Conv2 (64, 8, 8) kernel=5, stride=1 51,264
ReLU2 (64, 8, 8) - 0
MaxPool2 (64, 4, 4) kernel=2, stride=2 0
Flatten (1024,) - 0
FC1 (512,) - 524,800
ReLU2 (512,) - 0
FC2 (10,) - 5,130
Table 9: Hyperparameters for various FL algorithms on the MNIST dataset.
FedAvg FedAvgM FedAsync FedBuff FedAT FedCompass(+N) FedCompass+M
QQ (200, 200, 100) (200, 200, 100) (200, 200, 100) (200, 200, 100) (200, 200, 100) - -
QminQ_{\min} - - - - - (40, 40, 20) (40, 40, 20)
QmaxQ_{\max} - - - - - (200, 200, 100) (200, 200, 100)
Optim Adam Adam Adam Adam Adam Adam Adam
η\eta_{\ell} 0.003 0.003 0.003 0.003 0.003 0.003 0.003
β\beta - 0.9 - - - - 0.9
BB 64 64 64 64 64 64 64
α\alpha - - 0.9 0.9 - 0.9 0.9
aa - - 0.5 0.5 - 0.5 0.5
KK - - - (3, 3, 5) - - -
υ\upsilon - - - - 2.0 - -
λ\lambda - - - - - 1.2 1.2

E.3 Detailed Setups and Hyperparameters for Experiments on the CIFAR-10 Dataset

For experiments on the CIFAR-10 dataset, a randomly initialized ResNet-18 model (He et al., 2016) is used in training. Table 10 shows the detailed architecture of the ResNet-18 model, which contains eight residual blocks. Each residual block contains two convolutional + batch normalization layers, as shown in Table 11, and the corresponding output is added by the input itself if the output channel is equal to the input channel, otherwise, the input goes through a 1x1 convolutional + batch normalization layer before added to the output. An additional ReLU layer is appended at the end of the residual block. Table 12 lists the hyperparameter values used in the training for various FL algorithms. The number of global training rounds is chosen separately for different experiment settings and FL algorithms such that the corresponding FL algorithm converges, synchronous FL algorithms take the same amount of wall-clock time, and asynchronous algorithms take the same amount of wall-clock time in the same setting.

Table 10: CIFAR-10 ResNet18 model architecture.
Layer Outsize Setting Param #
Input (3, 32, 32) - 0
Conv1 (64, 32, 32) kernel=3, stride=1, padding=1, bias=False 1,728
BatchNorm1 (64, 32, 32) - 128
ReLU1 (64, 32, 32) - 0
ResidualBlock1-1 (64, 32, 32) stride=1 73,984
ResidualBlock1-2 (64, 32, 32) stride=1 73,984
ResidualBlock2-1 (128, 16, 16) stride=2 230,144
ResidualBlock2-2 (128, 16, 16) stride=1 295,424
ResidualBlock3-1 (256, 8, 8) stride=2 919,040
ResidualBlock3-2 (256, 8, 8) stride=1 1,180,672
ResidualBlock4-1 (512, 4, 4) stride=2 3,673,088
ResidualBlock4-2 (512, 4, 4) stride=1 4,720,640
AvgPool (512,) kernel=4, stride=4 0
Linear (10,) - 5,130
Table 11: ResidualBlock in ResNet18 architecture.
Layer Outsize Setting Param #
Input (in_planes, H, W) - 0
Conv1 (planes, H/stride, W/stride) kernel=3, stride=stride, padding=1 in_planes*planes*9
BatchNorm1 (planes, H/stride, W/stride) - 2*planes
ReLU1 (planes, H/stride, W/stride) - 0
Conv2 (planes, H/stride, W/stride) kernel=3, stride=1, padding=1 planes*planes*9
BatchNorm2 (planes, H/stride, W/stride) - 2*planes
“Identity” (planes, H/stride, W/stride) 1x1 Conv + BatchNorm if in_plane \neq plane -
ReLU2 (planes, H/stride, W/stride) \triangleright Apply after adding BatchNorm2 and “Identity” outputs 0
Table 12: Hyperparameters for various FL algorithms on the CIFAR-10 dataset.
FedAvg FedAvgM FedAsync FedBuff FedAT FedCompass(+N) FedCompass+M
QQ (200, 200, 100) (200, 200, 100) (200, 200, 100) (200, 200, 100) (200, 200, 100) - -
QminQ_{\min} - - - - - (40, 40, 20) (40, 40, 20)
QmaxQ_{\max} - - - - - (200, 200, 100) (200, 200, 100)
Optim SGD SGD SGD SGD SGD SGD SGD
η\eta_{\ell} 0.1 0.1 0.1 0.1 0.1 0.1 0.1
β\beta - 0.9 - - - - 0.9
BB 128 128 128 128 128 128 128
α\alpha - - 0.9 0.9 - 0.9 0.9
aa - - 0.5 0.5 - 0.5 0.5
KK - - - (3, 3, 5) - - -
υ\upsilon - - - - 2.0 - -
λ\lambda - - - - - 1.2 1.2

E.4 Detailed Setups and Hyperparameters for Experiments on the FLamby Datasets

For the datasets from FLamby (Ogier du Terrail et al., 2022), we use the default experiments settings provided by FLamby. The details of the experiment settings are shown in Table 13

Table 13: Detailed settings for experiments on FLamby datasets.
Fed-IXI Fed-ISIC2019
Model 3D U-net (Çiçek et al., 2016) EfficientNet (Tan & Le, 2019)
Metric DICE Balanced accuracy
mm (Client #) 3 6
QQ 50 50
QminQ_{\min} 20 20
QmaxQ_{\max} 100 100
Optimizer AdamW (Loshchilov & Hutter, 2017) Adam
η\eta_{\ell} 0.001 0.0005
β\beta 0.9 0.5
BB 2 64
α\alpha 0.9 0.9
aa 0.5 0.5
KK 2 3
λ\lambda 1.2 1.2

Appendix F Additional Experiment Results

F.1 Additional Experiment Results on the MNIST Dataset

Table 14 shows the top validation accuracy and the corresponding standard deviation on the class partitioned MNIST dataset for various federated learning algorithms with different numbers of clients and client heterogeneity settings among ten independent experiment runs with different random seeds, and Table 15 shows that on the dual Dirichlet partitioned MNIST dataset. As all asynchronous FL algorithms take the same amount of training time in the same experiment setting, the results indicate that FedCompass not only converges faster than other asynchronous FL algorithms but also can converge to higher accuracy. Notably, in the homogeneous client settings, FedCompass does not behave exactly the same as FedAvg as the clients also have local speed variance in each local training round, and this prevents the Compass scheduler to assign QmaxQ_{\max} to all clients and have exactly the same behavior as FedAvg. The global model gets updated more frequently for FedCompass and leads to relatively faster convergence.

Figures 611 depict the change in validation accuracy during the training process for various FL algorithms in different experiment settings. Those figures illustrate how FedCompass can achieve faster convergence in different data and device heterogeneity settings.

Table 14: Average top validation accuracy and standard deviation on the class partitioned MNIST dataset for various federated learning algorithms with different numbers of clients and client heterogeneity settings.
n=5n=5 n=10n=10 n=20n=20
Method homo normal exp homo normal exp homo normal exp
FedAvg
92.42±\pm3.96
92.42±\pm3.96
92.42±\pm3.96
92.27±\pm3.88
92.27±\pm3.88
92.27±\pm3.88
97.38±\pm1.59
97.38±\pm1.59
97.38±\pm1.59
FedAvgM
96.22±\pm2.45
96.22±\pm2.45
96.22±\pm2.45
97.11±\pm1.45
97.11±\pm1.45
97.11±\pm1.45
98.27±\pm0.18
98.27±\pm0.18
98.27±\pm0.18
FedAsync
90.60±\pm6.96
92.53±\pm5.57
92.44±\pm3.07
86.73±\pm6.62
92.27±\pm3.82
96.25±\pm2.46
84.26±\pm7.96
90.98±\pm2.48
98.01±\pm0.89
FedBuff
95.77±\pm1.93
94.69±\pm3.52
95.68±\pm1.78
91.93±\pm4.50
94.83±\pm3.36
96.82±\pm1.91
93.04±\pm3.61
98.12±\pm0.54
98.56±\pm0.22
FedAT
92.47±\pm3.51
92.36±\pm3.59
91.40±\pm3.21
91.50±\pm3.15
91.01±\pm3.43
93.03±\pm2.94
97.17±\pm1.62
97.89±\pm0.94
95.74±\pm2.48
FedCompass
95.06±\pm2.89
95.54±\pm2.89
96.34±\pm1.89
94.33±\pm3.80
95.47±\pm3.05
97.51±\pm1.66
98.49±\pm0.45
98.66±\pm0.36
98.72±\pm0.47
FedCompass+M
95.58±\pm2.87
95.98±\pm2.43
95.93±\pm1.92
95.64±\pm2.72
96.61±\pm2.12
97.69±\pm0.76
98.58±\pm0.23
97.05±\pm0.91
98.51±\pm0.47
FedCompass+N
94.03±\pm3.82
93.07±\pm4.00
96.44±\pm1.72
94.55±\pm3.86
94.80±\pm3.65
97.28±\pm1.51
98.52±\pm0.42
98.79±\pm0.17
98.80±\pm0.18
Table 15: Average top accuracy and standard deviation on the dual Dirichlet partitioned MNIST dataset for various federated learning algorithms with different numbers of clients and client heterogeneity settings.
m=5m=5 m=10m=10 m=20m=20
Method homo normal exp homo normal exp homo normal exp
FedAvg
93.08±\pm4.16
93.08±\pm4.16
93.08±\pm4.16
93.71±\pm4.52
93.71±\pm4.52
93.71±\pm4.52
95.77±\pm2.98
95.77±\pm2.98
95.77±\pm2.98
FedAvgM
93.88±\pm3.16
93.88±\pm3.16
93.88±\pm3.16
94.00±\pm3.49
94.00±\pm3.49
94.00±\pm3.49
96.06±\pm2.17
96.06±\pm2.17
96.06±\pm2.17
FedAsync
92.11±\pm2.99
92.96±\pm3.70
93.29±\pm3.08
86.31±\pm4.51
90.72±\pm4.30
94.06±\pm5.91
78.49±\pm6.28
85.82±\pm5.31
95.90±\pm1.58
FedBuff
93.87±\pm4.30
94.11±\pm3.60
94.94±\pm2.49
92.27±\pm4.60
94.29±\pm4.04
94.50±\pm3.72
87.04±\pm6.26
95.27±\pm1.84
96.90±\pm0.98
FedAT
92.30±\pm4.00
93.06±\pm3.74
92.88±\pm3.16
95.40±\pm2.70
95.63±\pm2.42
93.80±\pm4.94
95.26±\pm2.47
97.32±\pm0.76
94.67±\pm3.01
FedCompass
94.15±\pm4.26
94.36±\pm4.28
95.46±\pm3.59
95.53±\pm3.21
95.41±\pm3.52
96.34±\pm2.64
96.41±\pm1.96
97.62±\pm1.57
97.86±\pm0.71
FedCompass+M
94.74±\pm3.42
95.21±\pm3.13
96.15±\pm2.03
96.01±\pm2.76
96.10±\pm2.91
95.59±\pm2.81
97.03±\pm1.50
93.25±\pm3.28
97.42±\pm0.96
FedCompass+N
94.56±\pm4.25
92.56±\pm4.96
94.57±\pm3.69
95.45±\pm3.41
95.13±\pm3.02
95.01±\pm4.62
96.28±\pm2.06
97.61±\pm0.60
97.92±\pm0.50
Refer to caption
Figure 6: Change in validation accuracy during the training process for different FL algorithms on the class partitioned MNIST dataset with five clients and three client heterogeneity settings.
Refer to caption
Figure 7: Change in validation accuracy during the training process for different FL algorithms on the class partitioned MNIST dataset with ten clients and three client heterogeneity settings.
Refer to caption
Figure 8: Change in validation accuracy during the training process for different FL algorithms on the class partitioned MNIST dataset with twenty clients and three client heterogeneity settings.
Refer to caption
Figure 9: Change in validation accuracy during the training process for different FL algorithms on the dual Dirichlet partitioned MNIST dataset with five clients and three client heterogeneity settings.
Refer to caption
Figure 10: Change in validation accuracy during the training process for different FL algorithms on the dual Dirichlet partitioned MNIST dataset with ten clients and three client heterogeneity settings.
Refer to caption
Figure 11: Change in validation accuracy during the training process for different FL algorithms on the dual Dirichlet partitioned MNIST dataset with twenty clients and three client heterogeneity settings.

F.2 Additional Experiment Results on the CIFAR-10 Dataset

Table 16 shows the top validation accuracy and the corresponding standard deviation on the class partitioned CIFAR-10 dataset for various federated learning algorithms with different numbers of clients and client heterogeneity settings among ten independent experiment runs with different random seeds, and Table 17 shows that on the dual Dirichlet partitioned CIFAR-10 dataset. Figures 1217 give the change in validation accuracy during the training process for various FL algorithms in different experiment settings on the CIFAR-10 datasets.

Table 16: Average top accuracy and standard deviation on the class partitioned CIFAR-10 dataset for various FL algorithms with different numbers of clients and client heterogeneity settings.
m=5m=5 m=10m=10 m=20m=20
Method homo normal exp homo normal exp homo normal exp
FedAvg
67.10±\pm6.33
67.10±\pm6.33
67.10±\pm6.3
66.80±\pm2.71
66.80±\pm2.71
66.80±\pm2.71
70.03±\pm2.37
70.03±\pm2.37
70.03±\pm2.37
FedAvgM
63.44±\pm3.26
63.44±\pm3.26
63.44±\pm3.26
63.50±\pm2.56
63.50±\pm2.56
63.50±\pm2.56
64.47±\pm1.88
64.47±\pm1.88
64.47±\pm1.88
FedAsync
53.30±\pm4.12
63.65±\pm2.81
66.27±\pm3.91
37.09±\pm2.99
44.87±\pm3.54
59.40±\pm4.21
30.18±\pm2.84
36.73±\pm4.93
54.30±\pm2.64
FedBuff
67.02±\pm2.55
67.33±\pm2.46
70.86±\pm2.99
46.14±\pm4.14
50.86±\pm4.43
64.32±\pm3.78
43.83±\pm3.01
49.63±\pm4.60
60.90±\pm3.37
FedAT
72.41±\pm3.31
72.92±\pm2.52
71.15±\pm3.18
65.25±\pm4.21
65.70±\pm4.34
66.57±\pm4.48
66.09±\pm1.72
66.38±\pm1.50
66.31±\pm1.87
FedCompass
72.66±\pm2.44
73.64±\pm1.80
73.96±\pm1.83
66.27±\pm2.41
67.07±\pm2.40
68.73±\pm1.95
66.83±\pm1.29
67.14±\pm1.32
67.41±\pm1.59
FedCompass+M
72.82±\pm2.27
74.03±\pm1.77
74.22±\pm1.98
66.79±\pm1.93
67.72±\pm1.77
69.14±\pm1.78
66.80±\pm1.18
67.48±\pm1.38
67.91±\pm1.45
FedCompass+N
72.79±\pm2.50
73.69±\pm2.33
74.85±\pm1.91
65.53±\pm2.52
66.78±\pm2.31
68.04±\pm1.99
66.32±\pm1.41
66.76±\pm1.40
67.51±\pm1.50
Table 17: Average top accuracy and standard deviation on the dual Dirichlet partitioned CIFAR-10 dataset for various FL algorithms with different numbers of clients and client heterogeneity settings.
m=5m=5 m=10m=10 m=20m=20
Method homo normal exp homo normal exp homo normal exp
FedAvg
48.88±\pm8.03
48.88±\pm8.03
48.88±\pm8.03
51.91±\pm7.65
51.91±\pm7.65
51.91±\pm7.65
49.28±\pm2.62
49.28±\pm2.62
49.28±\pm2.62
FedAvgM
49.89±\pm5.27
49.89±\pm5.27
49.89±\pm5.27
51.72±\pm2.64
51.72±\pm2.64
51.72±\pm2.64
49.94±\pm4.15
49.94±\pm4.15
49.94±\pm4.15
FedAsync
44.38±\pm4.71
47.67±\pm4.29
51.21±\pm4.99
33.52±\pm3.29
38.13±\pm2.07
47.62±\pm2.59
34.89±\pm3.34
36.12±\pm1.69
39.92±\pm3.39
FedBuff
51.85±\pm3.23
54.03±\pm4.84
58.11±\pm4.62
42.60±\pm2.47
45.24±\pm1.12
53.04±\pm2.45
41.37±\pm2.85
42.65±\pm1.70
46.30±\pm2.90
FedAT
58.91±\pm4.63
58.34±\pm3.77
55.40±\pm4.05
52.82±\pm6.37
53.54±\pm7.12
53.83±\pm5.36
48.63±\pm3.02
50.56±\pm1.31
50.37±\pm2.38
FedCompass
59.29±\pm3.49
59.98±\pm3.65
61.23±\pm2.67
54.51±\pm3.50
54.95±\pm3.68
57.29±\pm1.98
51.01±\pm2.20
52.03±\pm2.75
51.39±\pm2.12
FedCompass+M
59.29±\pm2.65
60.20±\pm3.01
59.73±\pm3.97
54.24±\pm4.21
54.53±\pm3.14
56.30±\pm2.87
51.48±\pm2.54
52.73±\pm2.61
51.70±\pm2.06
FedCompass+N
59.38±\pm3.59
60.38±\pm2.20
61.08±\pm3.74
54.28±\pm4.62
54.31±\pm6.03
55.90±\pm2.97
52.13±\pm2.61
52.44±\pm1.91
52.23±\pm2.20
Refer to caption
Figure 12: Change in validation accuracy during the training process for different FL algorithms on the class partitioned CIFAR-10 dataset with five clients and three client heterogeneity settings.
Refer to caption
Figure 13: Change in validation accuracy during the training process for different FL algorithms on the class partitioned CIFAR-10 dataset with ten clients and three client heterogeneity settings.
Refer to caption
Figure 14: Change in validation accuracy during the training process for different FL algorithms on the class partitioned CIFAR-10 dataset with twenty clients and three client heterogeneity settings.
Refer to caption
Figure 15: Change in validation accuracy during the training process for different FL algorithms on the dual Dirichlet partitioned CIFAR-10 dataset with five clients and three client heterogeneity settings.
Refer to caption
Figure 16: Change in validation accuracy during the training process for different FL algorithms on the dual Dirichlet partitioned CIFAR-10 dataset with ten clients and three client heterogeneity settings.
Refer to caption
Figure 17: Change in validation accuracy during the training process for different FL algorithms on the dual Dirichlet partitioned CIFAR-10 dataset with twenty clients and three client heterogeneity settings.

F.3 Additional Experiment Results on the FLamby Datasets

Table 18 shows the top validation accuracy and the corresponding standard deviation on the Fed-IXI and Fed-ISIC2019 datasets from FLamby for various federated learning algorithms in different client heterogeneity settings among ten independent experiment runs with different random seeds. Figures 18 and 19 present the change in validation accuracy during the training process for various FL algorithms on the Fed-IXI and Fed-ISIC2019 datasets, respectively.

Table 18: Average top accuracy and standard deviation on the FLamby datasets for various federated learning methods in different client heterogeneity settings.
Fed-IXI Fed-ISIC2019
Method homo normal exp homo normal exp
FedAvg 98.39±\pm0.02 98.39±\pm0.02 98.39±\pm0.02 59.12±\pm1.38 59.12±\pm1.38 59.12±\pm1.38
FedAvgM 98.39±\pm0.00 98.39±\pm0.00 98.39±\pm0.00 50.25±\pm2.57 50.25±\pm2.57 50.25±\pm2.57
FedAsync 98.05±\pm0.25 98.24±\pm0.18 98.32±\pm0.20 59.43±\pm1.18 59.66±\pm3.04 56.34±\pm4.34
FedBuff 98.36±\pm0.03 98.35±\pm0.03 98.35±\pm0.08 57.65±\pm2.07 57.46±\pm2.60 55.13±\pm4.74
FedAT 98.72±\pm0.01 98.65±\pm0.06 98.63±\pm0.08 58.31±\pm2.35 58.55±\pm1.76 61.93±\pm4.56
FedCompass 98.76±\pm0.00 98.61±\pm0.06 98.59±\pm0.04 68.46±\pm0.38 66.35±\pm3.76 65.46±\pm5.33
FedCompass+M
98.76±\pm0.00 98.61±\pm0.06 98.59±\pm0.04 64.72±\pm0.94 62.43±\pm5.23 60.18±\pm4.95
FedCompass+N
98.72±\pm0.01 98.61±\pm0.06 98.59±\pm0.04 69.59±\pm0.06 67.04±\pm3.56 66.48±\pm4.84
Refer to caption
Figure 18: Change in validation accuracy during the training process for different federated learning algorithms on the FLamby Fed-IXI dataset with three different client heterogeneity settings.
Refer to caption
Figure 19: Change in validation accuracy during the training process for different federated learning algorithms on the FLamby Fed-ISIC2019 dataset with three different client heterogeneity settings.

Appendix G Ablation Study

G.1 Ablation Study on Qmin/QmaxQ_{\min}/Q_{\max}

In this section, we conduct ablation studies on the values of 1/Q=Qmin/Qmax\textit{1/Q}=Q_{\min}/Q_{\max} to investigate its impact on the performance of the FedCompass algorithm. Table 19 lists the relative wall-clock time for different values of 1/Q=Qmin/Qmax\textit{1/Q}=Q_{\min}/Q_{\max} to reach 90% validation accuracy on the class partitioned and dual Dirichlet partitioned MNIST datasets, and 50% accuracy on the dual Dirichlet partitioned CIFAR10 datasets in different client heterogeneity settings, and Table 20 shows the corresponding the validation accuracy and standard deviations. The results are the average of ten different runs with random seeds. Table 19 also lists the results of FedBuff for reference. From Table 19, the following observations can be derived: (1) A smaller QminQ_{\min} usually leads to faster convergence speed, which is consistent with the convergence analysis from Section 4 where smaller 1/Q results in fewer existing arrival groups at equilibrium and may help to reduce the impact of device heterogeneity. (2) Though FedCompass achieves faster convergence rate with smaller 1/Q (0.05–0.2) in general, it still can achieve reasonable convergence rate with large 1/Q values (0.2–0.8) compared with FedBuff, which shows that FedCompass is not sensitive to the choice of the value of Q. (3) In the extreme case that Q=1\textit{Q}=1, i.e. Qmin=QmaxQ_{\min}=Q_{\max}, there is a significant performance drop for FedCompass, as the Compass scheduler is not able to assign different number of local steps to different clients, so each client has its own group, and the FedCompass algorithm becomes equivalent to the FedAsync algorithm.

Table 19: Relative wall-clock time to first reach the target validation accuracy on the MNIST and CIFAR10 datasets for various values of 1/Q=Qmin/Qmax\textit{1/Q}=Q_{\min}/Q_{\max} with different number of clients and client heterogeneity settings.
m=5m=5 m=10m=10 m=20m=20
Dataset 1/Q homo normal exp homo normal exp homo normal exp
0.05
1.04×\times
1.16×\times
0.92×\times
0.93×\times
0.90×\times
0.95×\times
0.98×\times
0.93×\times
0.99×\times
0.10
1.00×\times
1.38×\times
0.93×\times
1.03×\times
0.92×\times
0.93×\times
1.01×\times
1.03×\times
1.02×\times
0.20
1.00×\times
1.00×\times
1.00×\times
1.00×\times
1.00×\times
1.00×\times
1.00×\times
1.00×\times
1.00×\times
MNIST-Class
0.40
0.99×\times
1.67×\times
2.48×\times
0.98×\times
0.98×\times
1.41×\times
1.01×\times
1.19×\times
1.10×\times
Partition (90%)
0.60
0.97×\times
1.23×\times
2.64×\times
0.98×\times
0.97×\times
1.28×\times
1.02×\times
1.31×\times
1.11×\times
0.80
0.96×\times
1.33×\times
1.51×\times
0.98×\times
1.21×\times
1.23×\times
1.01×\times
1.61×\times
1.28×\times
1.00
2.15×\times
2.34×\times
2.35×\times
-
2.17×\times
1.32×\times
-
4.44×\times
1.32×\times
FedBuff
1.30×\times
1.57×\times
1.81×\times
1.58×\times
1.39×\times
1.43×\times
1.37×\times
1.18×\times
1.39×\times
0.05
0.89×\times
1.01×\times
0.85×\times
0.96×\times
0.97×\times
0.82×\times
1.00×\times
1.02×\times
0.96×\times
0.10
0.91×\times
1.05×\times
0.92×\times
0.96×\times
0.95×\times
0.85×\times
0.97×\times
1.02×\times
0.96×\times
0.20
1.00×\times
1.00×\times
1.00×\times
1.00×\times
1.00×\times
1.00×\times
1.00×\times
1.00×\times
1.00×\times
MNIST-Dirichlet
0.40
1.02×\times
1.03×\times
1.70×\times
1.04×\times
1.05×\times
1.28×\times
1.00×\times
1.14×\times
1.25×\times
Partition (90%)
0.60
1.04×\times
1.01×\times
1.83×\times
1.00×\times
1.01×\times
1.19×\times
0.92×\times
1.11×\times
1.16×\times
0.80
1.04×\times
1.13×\times
1.58×\times
0.91×\times
1.25×\times
1.22×\times
0.86×\times
1.22×\times
1.21×\times
1.00
1.78×\times
1.68×\times
2.01×\times
-
2.28×\times
1.29×\times
-
-
1.76×\times
FedBuff
1.23×\times
1.52×\times
1.86×\times
1.81×\times
1.51×\times
1.49×\times
2.77×\times
1.22×\times
1.40×\times
0.05
1.27×\times
0.97×\times
0.76×\times
1.05×\times
0.98×\times
1.19×\times
1.05×\times
0.97×\times
0.90×\times
0.10
1.15×\times
1.01×\times
0.73×\times
0.98×\times
1.11×\times
1.27×\times
1.01×\times
1.04×\times
1.01×\times
0.20
1.00×\times
1.00×\times
1.00×\times
1.00×\times
1.00×\times
1.00×\times
1.00×\times
1.00×\times
1.00×\times
CIFAR10-Dirichlet
0.40
1.28×\times
1.06×\times
1.36×\times
1.02×\times
1.06×\times
2.19×\times
0.96×\times
1.14×\times
1.08×\times
Partition (50%)
0.60
1.23×\times
0.99×\times
1.65×\times
0.97×\times
1.61×\times
2.52×\times
1.11×\times
1.06×\times
1.74×\times
0.80
1.23×\times
1.24×\times
1.64×\times
1.32×\times
1.91×\times
3.44×\times
1.33×\times
2.95×\times
1.64×\times
1.00
-
3.18×\times
1.70×\times
-
-
5.35×\times
-
-
-
FedBuff
1.62×\times
1.99×\times
1.67×\times
-
-
2.42×\times
-
-
-
Table 20: Average top validation accuracy and standard deviation for various values of 1/Q=Qmin/Qmax\textit{1/Q}=Q_{\min}/Q_{\max} with different number of clients and client heterogeneity settings.
m=5m=5 m=10m=10 m=20m=20
Dataset 1/Q homo normal exp homo normal exp homo normal exp
0.05
95.20±\pm2.53
95.21±\pm2.90
97.22±\pm1.67
94.29±\pm3.80
94.88±\pm3.73
97.49±\pm1.55
98.56±\pm0.43
98.80±\pm0.17
98.75±\pm0.20
0.10
95.03±\pm3.13
95.35±\pm2.15
97.07±\pm1.67
94.72±\pm3.56
94.65±\pm3.74
97.71±\pm1.25
98.57±\pm0.38
98.80±\pm0.11
98.82±\pm0.18
MNIST
0.20
95.06±\pm2.89
95.55±\pm2.89
96.34±\pm1.89
94.33±\pm3.80
95.47±\pm3.05
97.51±\pm1.66
98.49±\pm0.45
98.67±\pm0.36
98.73±\pm0.47
Class
0.40
94.44±\pm3.26
95.34±\pm3.63
92.73±\pm3.09
94.42±\pm3.94
95.01±\pm3.52
96.91±\pm2.13
98.58±\pm0.36
98.74±\pm0.26
98.75±\pm0.32
Partition
0.60
95.29±\pm2.75
95.25±\pm3.36
94.90±\pm0.97
94.17±\pm3.88
95.38±\pm2.55
97.73±\pm0.86
98.48±\pm0.56
98.80±\pm0.20
98.80±\pm0.19
(90%)
0.80
95.61±\pm2.34
95.35±\pm2.30
95.04±\pm3.13
94.77±\pm3.59
96.08±\pm2.60
97.35±\pm1.49
98.60±\pm0.45
98.72±\pm0.25
98.74±\pm0.25
1.00
90.60±\pm6.96
92.53±\pm5.57
92.44±\pm3.07
86.73±\pm6.62
92.27±\pm3.82
96.25±\pm2.46
84.26±\pm7.96
90.98±\pm2.48
98.01±\pm0.89
0.05
94.10±\pm4.12
94.30±\pm4.18
95.81±\pm3.12
95.64±\pm3.15
95.75±\pm2.98
96.17±\pm3.01
96.22±\pm2.25
97.40±\pm1.82
97.04±\pm1.97
0.10
94.14±\pm4.08
94.14±\pm4.44
96.11±\pm3.04
95.49±\pm3.32
95.65±\pm3.23
96.78±\pm2.56
96.39±\pm1.92
97.59±\pm1.28
97.68±\pm1.28
MNIST
0.20
94.15±\pm4.26
94.36±\pm4.28
95.46±\pm3.59
95.53±\pm3.21
95.41±\pm3.52
96.34±\pm2.64
96.41±\pm1.96
97.62±\pm1.57
97.86±\pm0.71
Dirichlet
0.40
94.12±\pm4.11
94.27±\pm4.11
95.40±\pm2.46
95.46±\pm3.47
95.67±\pm3.14
95.86±\pm3.35
96.07±\pm2.45
98.03±\pm0.93
97.85±\pm1.00
Partition
0.60
94.09±\pm4.10
94.65±\pm4.11
94.29±\pm3.43
95.43±\pm3.36
95.89±\pm3.26
95.27±\pm4.71
96.37±\pm2.35
98.04±\pm0.68
97.66±\pm1.14
(90%)
0.80
94.23±\pm3.89
94.71±\pm3.56
95.09±\pm2.62
95.32±\pm3.55
95.46±\pm3.20
95.26±\pm4.36
96.72±\pm2.20
97.51±\pm0.84
97.66±\pm0.62
1.00
92.11±\pm2.99
92.96±\pm3.70
93.29±\pm3.08
86.31±\pm4.51
90.72±\pm4.30
94.06±\pm5.91
78.49±\pm6.28
85.82±\pm5.31
95.90±\pm1.58
0.05
59.40±\pm2.28
60.86±\pm1.94
61.79±\pm3.27
54.04±\pm4.68
53.85±\pm6.74
57.36±\pm1.90
50.58±\pm2.26
50.62±\pm2.58
51.40±\pm1.68
0.10
59.50±\pm2.69
61.05±\pm2.26
61.70±\pm2.92
53.91±\pm4.00
55.13±\pm5.42
56.36±\pm2.60
51.15±\pm2.64
50.81±\pm2.39
51.08±\pm1.90
CIFAR10
0.20
59.29±\pm3.49
59.98±\pm3.65
61.23±\pm2.67
54.51±\pm3.50
54.95±\pm3.68
57.29±\pm1.98
51.01±\pm2.20
52.03±\pm2.75
51.39±\pm2.12
Dirichlet
0.40
59.25±\pm3.49
60.85±\pm2.23
59.44±\pm3.22
53.29±\pm5.17
56.47±\pm2.83
55.58±\pm2.04
50.90±\pm2.65
50.26±\pm2.04
51.43±\pm1.32
Partition
0.60
59.25±\pm2.79
60.76±\pm2.24
59.01±\pm3.90
52.92±\pm4.00
56.45±\pm3.05
54.76±\pm2.71
50.67±\pm2.20
50.33±\pm2.04
51.44±\pm1.79
(50%)
0.80
59.10±\pm3.03
60.53±\pm4.18
58.79±\pm3.68
53.11±\pm2.50
55.78±\pm3.00
53.23±\pm2.03
49.83±\pm2.04
49.63±\pm2.10
51.01±\pm0.96
1.00
44.38±\pm4.71
47.67±\pm4.29
51.21±\pm4.99
33.52±\pm3.29
38.13±\pm2.07
47.62±\pm2.59
34.89±\pm3.34
36.12±\pm1.69
39.92±\pm3.39

G.2 Client Speed Changes in FedCompass and Ablation Study on λ\lambda

In cross-silo federated learning scenarios, variations in computing power and speeds among clients are possible. For instance, a client utilizing a supercomputer might experience significant changes in computing power due to the auto-scaling after the allocation or deallocation of other tasks within the same cluster. Similarly, in cloud computing environments, a client might be subject to auto-scaling policies that dynamically adjust the number of virtual machines or container instances. In this section, we provide analyses and empirical results to study how FedCompass is resilient to client speed changes.

Figure 20 depicts an execution of FedCompass involving a client speedup scenario. During the second round of local training, client 3 enhances its computing speed from 44 step/min to 55 step/min. Consequently, client 3 transmits its local updates Δ3\Delta_{3} at 10:36, earlier than the expected group arrival time Ta=T_{a}= 12:00. In this situation, FedCompass server first updates the speed record for client 3 and then waits for the arrival of clients 1 and 2 in the same group. Upon the arrival of clients 1 and 2 at 12:00, FedCompass server assigns all three clients to arrival group 2 for the subsequent round of local training, and the number of local steps for client 3 are assigned according to its new speed.

Refer to caption
Figure 20: Overview of an execution of FedCompass on five clients with the minimum number of local steps Qmin=20Q_{\min}=20, maximum number of local steps Qmax=100Q_{\max}=100, and client speedup.

Figures 21 and 22 illustrate two executions of FedCompass with different levels of client slowdown. In Figure 21, the speed of client 3 changes from 44 step/min to 3.333.33 step/min during its second round of training. Consequently, the client transmits its local update Δ3\Delta_{3} back to the server at 13:24, surpassing the expected arrival time Ta=T_{a}= 12:00. Nonetheless, it is important to recall that each arrival group has the latest arrival time TmaxT_{\max} to accommodate potential client speed fluctuations. The time interval between TmaxT_{\max} and the group’s creation time is λ\lambda times longer than the difference between TaT_{a} and the group’s creation time. For this example, we select λ=1.2\lambda=1.2, resulting in Tmax=T_{\max}= 14:00 for arrival group 1. Since client 3 arrives later than TaT_{a} but earlier than TmaxT_{\max}, the FedCompass server waits until client 3 arrives at 13:24, subsequently assigning clients 1 to 3 to a new arrival group for the next training round. Figure 22 presents another case where the client has a relatively larger slowdown. In this example, the speed of client 3 diminishes from 44 step/min to 2.52.5 step/min, causing it to transmit local update Δ3\Delta_{3} at 15:12, even exceeding Tmax=T_{\max}= 14:00. When the FedCompass server does not receive an update from client 3 by the latest arrival time 14:00, it responds to clients 1 and 2. The server assigns these clients to arrival group 2 with corresponding numbers of local steps. Upon client 3 arrives at 15:12, the server first attempts to assign it to group 2, but this is unsuccessful due to the corresponding number of steps (Q=(22:00-15:12)2.5 (step/min)=17Q=\text{(22:00-15:12)}*2.5\text{ (step/min)}=17) falling below the Qmin=20Q_{\min}=20. As a result, the server establishes a new group 3 for client 3. Subsequently, once all clients in group 2 arrive at 22:0022:00, the server assigns all of them to group 3.

Refer to caption
Figure 21: Overview of an execution of FedCompass on five clients with the minimum number of local steps Qmin=20Q_{\min}=20 and maximum number of local steps Qmax=100Q_{\max}=100 with client slowdown.
Refer to caption
Figure 22: Overview of an execution of FedCompass on five clients with the minimum number of local steps Qmin=20Q_{\min}=20 and maximum number of local steps Qmax=100Q_{\max}=100 with client slowdown.

To empirically study how FedCompass is resilient to client speed changes compared to similar tiered FL methods such as FedAT (Chai et al., 2021) and how the value of λ\lambda may impact the resilience, we set up a scenario such that each client has a 10% probability to change its computing speed by re-sampling a speed from the client speed distribution (normal or exponential distribution) in each local training round. This setup aims to mimic the occasional yet substantial changes in speed that may arise in real-world cross-silo federated learning environments as a result of auto-scaling. Table 21 shows the convergence speed for the FedCompass algorithm with different values of λ\lambda and the FedAT algorithm, and Table 22 shows the corresponding average top accuracy and standard deviations. From Table 21, we can draw the following conclusions: (1) When λ=1.00\lambda=1.00, the Compass scheduler is not resilient to even small client slowdown, so FedCompass converges relatively slower compared with some λ\lambda values slightly above 1.001.00 in general. (2) λ\lambda values slightly above 1.001.00 (e.g., 1.201.501.20-1.50) result in faster convergence from the empirical results as the Compass scheduler accounts for minor client speed fluctuations without a long waiting time. (3) Large λ\lambda values sometimes slow down the convergence as the server needs to experience a long waiting time when there are significant client speed slowdowns. (4) FedCompass converges faster than FedAT regardless of the value of λ\lambda, which further demonstrates its resilience to speed changes.

Finally, it is noteworthy that under extreme circumstances, where all client computing speeds are subject to dramatic and constant fluctuations, the scheduling mechanism of FedCompass may be unable to maintain the desired client synchronization. This is due to the reliance on prior speed data, which becomes obsolete in the face of such variability. In these scenarios, the training process would proceed in a completely asynchronous manner, effectively causing FedCompass to revert to a similar behavior of the FedAsync (Xie et al., 2019) algorithm. However, different from cross-device FL where each client may experience frequent speed changes due to power constraints, the operation of other concurrent applications, or network connectivity issues. In practice, a cross-silo FL client usually has dedicated computing resources such as HPC with a job scheduler or some cloud computing facilities—where significant changes in computing speed are less frequent, generally arising from events like auto-scaling.

Table 21: Relative wall-clock time to first reach the target validation accuracy on the MNIST datasets for various values of λ\lambda with different numbers of clients and client heterogeneity settings, and each client has 10% probability to change its speed in each local training round.
m=5m=5 m=10m=10 m=20m=20
Dataset λ\lambda normal exp normal exp normal exp
1.00
1.00×\times
1.00×\times
1.00×\times
1.00×\times
1.00×\times
1.00×\times
MNIST-Class
1.20
0.64×\times
0.83×\times
0.84×\times
0.88×\times
0.86×\times
0.90×\times
Partition (90%)
1.50
0.67×\times
1.00×\times
0.72×\times
1.01×\times
1.01×\times
1.00×\times
2.00
0.68×\times
1.11×\times
0.80×\times
1.11×\times
1.04×\times
1.37×\times
FedAT
3.13×\times
3.20×\times
4.04×\times
5.92×\times
1.66×\times
3.33×\times
1.00
1.00×\times
1.00×\times
1.00×\times
1.00×\times
1.00×\times
1.00×\times
MNIST-Dirichlet
1.20
0.63×\times
1.05×\times
0.75×\times
0.62×\times
0.79×\times
0.98×\times
Partition (90%)
1.50
0.55×\times
1.20×\times
0.75×\times
0.62×\times
0.85×\times
1.16×\times
2.00
0.55×\times
1.29×\times
0.76×\times
0.83×\times
0.86×\times
1.28×\times
FedAT
1.03×\times
1.55×\times
1.41×\times
1.61×\times
1.71×\times
2.94×\times
Table 22: Average top validation accuracy and standard deviation on the MNIST datasets for various values of λ\lambda with different numbers of clients and client heterogeneity settings, and each client has 10% probability to change its speed in each local training round.
m=5m=5 m=10m=10 m=20m=20
Dataset λ\lambda normal exp normal exp normal exp
1.00
97.01±\pm1.19
97.57±\pm1.01
96.55±\pm1.07
97.45±\pm2.17
98.78±\pm0.03
98.76±\pm0.15
MNIST-Class
1.20
96.19±\pm2.46
97.54±\pm1.54
96.78±\pm2.94
97.03±\pm2.71
98.86±\pm0.10
98.90±\pm0.05
Partition (90%)
1.50
95.99±\pm2.30
97.59±\pm1.98
96.25±\pm3.26
97.44±\pm1.90
98.86±\pm0.09
98.71±\pm0.26
2.00
96.29±\pm1.77
97.84±\pm1.06
96.37±\pm3.17
97.37±\pm2.29
98.86±\pm0.09
98.83±\pm0.13
FedAT
93.56±\pm6.40
94.97±\pm3.86
93.93±\pm3.51
94.20±\pm3.36
97.98±\pm1.42
97.98±\pm1.32
1.00
91.15±\pm4.48
94.34±\pm2.84
94.88±\pm4.40
96.90±\pm1.53
98.08±\pm0.34
98.09±\pm0.23
MNIST-Dirichlet
1.20
92.03±\pm4.10
93.92±\pm3.45
96.59±\pm2.10
97.19±\pm1.82
98.16±\pm0.13
98.22±\pm0.19
Partition (90%)
1.50
91.82±\pm4.35
93.32±\pm4.03
96.04±\pm3.36
97.47±\pm1.42
98.28±\pm0.13
98.16±\pm0.19
2.00
92.09±\pm4.11
93.08±\pm3.61
95.77±\pm3.29
97.36±\pm1.68
98.16±\pm0.20
98.26±\pm0.07
FedAT
91.43±\pm3.31
92.36±\pm2.50
95.49±\pm2.94
96.36±\pm2.36
97.65±\pm0.44
97.68±\pm0.35