FedCompass: Efficient Cross-Silo Federated Learning on Heterogeneous Client Devices Using a Computing Power-Aware Scheduler
Abstract
Cross-silo federated learning offers a promising solution to collaboratively train robust and generalized AI models without compromising the privacy of local datasets, e.g., healthcare, financial, as well as scientific projects that lack a centralized data facility. Nonetheless, because of the disparity of computing resources among different clients (i.e., device heterogeneity), synchronous federated learning algorithms suffer from degraded efficiency when waiting for straggler clients. Similarly, asynchronous federated learning algorithms experience degradation in the convergence rate and final model accuracy on non-identically and independently distributed (non-IID) heterogeneous datasets due to stale local models and client drift. To address these limitations in cross-silo federated learning with heterogeneous clients and data, we propose FedCompass, an innovative semi-asynchronous federated learning algorithm with a computing power-aware scheduler on the server side, which adaptively assigns varying amounts of training tasks to different clients using the knowledge of the computing power of individual clients. FedCompass ensures that multiple locally trained models from clients are received almost simultaneously as a group for aggregation, effectively reducing the staleness of local models. At the same time, the overall training process remains asynchronous, eliminating prolonged waiting periods from straggler clients. Using diverse non-IID heterogeneous distributed datasets, we demonstrate that FedCompass achieves faster convergence and higher accuracy than other asynchronous algorithms while remaining more efficient than synchronous algorithms when performing federated learning on heterogeneous clients. The source code for FedCompass is available at https://github.com/APPFL/FedCompass.
1 Introduction
Federated learning (FL) is a model training approach where multiple clients train a global model under the orchestration of a central server (Konečnỳ et al., 2016; McMahan et al., 2017; Yang et al., 2019; Kairouz et al., 2021). In FL, there are distributed clients solving the optimization problem: where is the importance weight of client ’s local data, and is usually defined as the relative data sample size, and is the loss on client ’s local data. FL typically runs two steps iteratively: (i) the server distributes the global model to clients to train it using their local data, and (ii) the server collects the locally trained models and updates the global model by aggregating them. Federated Averaging (FedAvg) (McMahan et al., 2017) is the most popular FL algorithm where each client trains a model using local data for local steps in each training round, after which the server aggregates all local models by performing a weighted averaging and sends the updated global model back to all clients for the next round of training. By leveraging the training data from multiple clients without explicitly sharing, FL empowers the training of more robust and generalized models while preserving the privacy of client data. The data in FL is generally heterogeneous, which means that client data is non-identically and independently distributed (non-IID). FL is most commonly conducted in two settings: cross-device FL and cross-silo FL (Kairouz et al., 2021). In cross-device FL, numerous mobile or IoT devices participate, but not in every training round because of reliability issues (Mills et al., 2019). In cross-silo FL, a smaller number of reliable clients contribute to training in each round, and the clients are often represented by large institutions in domains such as healthcare or finance, where preserving data privacy is essential (Wang et al., 2019; Wang, 2019; Kaissis et al., 2021; Pati et al., 2022; Hoang et al., 2023).
The performance of FL suffers from device heterogeneity, where the disparity of computing power among participating client devices leads to significant variations in local training times. Classical synchronous FL algorithms such as FedAvg, where the server waits for all client models before global aggregation, suffer from compute inefficiency as faster clients have to wait for slower clients (stragglers). Several asynchronous FL algorithms have been developed to immediately update the global model after receiving one client model to improve computing resource utilization. The principal shortcoming in asynchronous FL is that straggler clients may train on outdated global models and obtain stale local models, which can potentially undermine the performance of the global model. To deal with stale local models, various solutions have been proposed, including client selection, weighted aggregation, and tiered FL. Client selection strategies such as FedAR (Imteaj & Amini, 2020) select a subset of clients according to their robustness or computing power. However, these strategies may not be applicable for cross-silo FL settings, where all clients are expected to participate to maintain the robustness of the global model. Algorithms such as FedAsync (Xie et al., 2019) introduce a staleness factor to penalize the local models trained on outdated global models. While the staleness factor alleviates the negative impact of stale local models, it may cause the global model to drift away from the slower clients (client drift) even with minor disparities in client computing power, resulting in a further decrease in accuracy on non-IID heterogeneous distributed datasets. Tiered FL methods such as FedAT (Chai et al., 2021) group clients into tiers based on their computing power, allowing each tier to update at its own pace. Nonetheless, the waiting time inside each tier can be significant, and it is not resilient to sudden changes in computing power.
In this paper, we focus on improving the performance in cross-silo FL settings. Particularly, we propose a semi-asynchronous cross-silo FL algorithm, FedCompass, which employs a COMputing Power-Aware Scheduler (Compass) to reduce client local model staleness and enhance FL efficiency in cross-silo settings. Compass assesses the computing power of individual clients based on their previous response times and dynamically assigns varying numbers of local training steps to ensure that a group of clients can complete local training almost simultaneously. Compass is also designed to be resilient to sudden client computing power changes. This enables the server to perform global aggregation using the local models from a group of clients without prolonged waiting times, thus improving computing resource utilization. Group aggregation leads to less global update frequency and reduced local model staleness, thus mitigating client drift and improving the performance on non-IID heterogeneous data. In addition, the design of FedCompass is orthogonal to and compatible with several server-side optimization strategies for further enhancement of performance on non-IID data. In summary, the main contributions of this paper are as follows:
-
•
A novel semi-asynchronous cross-silo FL algorithm, FedCompass, which employs a resilient computing power-aware scheduler to make client updates arrive in groups almost simultaneously for aggregation, thus reducing client model staleness for enhanced performance on non-IID heterogeneous distributed datasets while ensuring computing efficiency.
-
•
A suite of convergence analyses which demonstrate that FedCompass can converge to a first-order critical point of the global loss function with heterogeneous clients and data, and offer insights into how the number of local training steps impacts the convergence rate.
-
•
A suite of experiments with heterogeneous clients and datasets which demonstrate that FedCompass converges faster than synchronous and asynchronous FL algorithms, and achieves higher accuracy than other asynchronous algorithms by mitigating client drift.
2 Related Work
Device Heterogeneity in Federated Learning. Device heterogeneity occurs when clients have varying computing power, leading to varied local training times (Li et al., 2020; Xu et al., 2021). For synchronous FL algorithms that wait for all local models before global aggregation, device heterogeneity leads to degraded efficiency due to the waiting time for slower clients, often referred to as stragglers. Some methods address device heterogeneity by waiting only for some clients and ignoring the stragglers (Bonawitz et al., 2019). However, this approach disregards the contributions of straggler clients, resulting in wasted client computing resources and a global model significantly biased towards faster clients. Other solutions explicitly select the clients that perform updates. Chen et al. (2021) employs a greedy selection method with a heuristic based on client computing resources. Hao et al. (2020) defines a priority function according to computing power and model accuracy to select participating clients. FedAr (Imteaj & Amini, 2020) assigns each client a trust score based on previous activities. FLNAP (Reisizadeh et al., 2022) starts from faster nodes and gradually involves slower ones. FedCS (Nishio & Yonetani, 2019) queries client resource information for client selection. SAFA (Wu et al., 2020) chooses the clients with lower crash probabilities. However, these methods may not be suitable for cross-silo FL settings, where the participation of every client is crucial for maintaining the robustness of the global model and there are stronger assumptions with regard to the trustworthiness and reliability of clients.
Asynchronous Federated Learning. Asynchronous FL algorithms provide another avenue to mitigate the impact of stragglers. In asynchronous FL, the server typically updates the global model once receiving one or few client models (Xie et al., 2019; Chen et al., 2020) or stores the local models in a size- buffer (Nguyen et al., 2022; So et al., 2021), and then immediately sends the global model back for further local training. Asynchronous FL offers significant advantages in heterogeneous client settings by reducing the idle time for each client. However, one of the key challenges in asynchronous FL is the staleness of local models. In asynchronous FL, global model updates can occur without waiting for all local models. This flexibility leads to stale local models trained on outdated global models, which may negatively impact the accuracy of the global model. Various solutions aimed to mitigate these adverse effects have been proposed. One common strategy is weighted aggregation, which is prevalent in numerous asynchronous FL algorithms (Xie et al., 2019; Chen et al., 2019; 2020; Wang et al., 2021; Nguyen et al., 2022). This approach applies a staleness weight factor to penalize outdated local models during the aggregation. While effective in reducing the negative impact of stale models, this often causes the global model to drift away from the local data on slower clients, and the staleness factor could significantly degrade the final model performance in settings with minor computing disparities among clients. Tiered FL methods offer another option for improvement. FedAT (Chai et al., 2021) clusters clients into several faster and slower tiers, allowing each tier to update the global model at its own pace. However, the waiting time within tiers can be substantial, and it becomes less effective when the client pool is small and client heterogeneity is large. The tiering may also not get updated timely to become resilient to speed changes. Similarly, CSAFL (Zhang et al., 2021) groups clients based on gradient direction and computation time. While this offers some benefits, the fixed grouping can become a liability if client computing speeds vary significantly over the course of training.
3 Proposed Method: FedCompass
The design choices of FedCompass are motivated by the shortcomings of other asynchronous FL algorithms caused by the potential staleness of client local models. In FedAsync, as the global model gets updated frequently, the client model is significantly penalized by a staleness weight factor, leading to theoretically and empirically sub-optimal convergence characteristics, especially when the disparity in computing power among clients is small. This can cause the global model to gradually drift away from the data of slow clients and have decreased performance on non-IID datasets. FedBuff attempts to address this problem by introducing a size- buffer for client updates to reduce the global update frequency. However, the optimal buffer size varies for different client numbers and heterogeneity settings and needs to be selected heuristically. It is also possible that one buffer contains several updates from the same client, or one client may train on the same global model several times. Inspired by those shortcomings, we aim to group clients for global aggregation to reduce model staleness and avoid repeated aggregation or local training. As cross-silo FL involves a small number of reliable clients, this enables us to design a scheduler that can profile the computing power of each client. The scheduler then dynamically and automatically creates groups accordingly by assigning clients with different numbers of local training steps to allow clients with similar computing power to finish local training in close proximity to each other. The scheduler tries to strike a balance between the time efficiency of asynchronous algorithms and the objective consistency of synchronous ones. Additionally, we keep the scheduler operations separated from the global aggregation, thus making it possible to combine the benefits of aggregation strategies in other FL algorithms, such as server-side optimizations for improving performance when training data are non-IID (Hsu et al., 2019; Reddi et al., 2020) and normalized averaging to eliminates objective inconsistency when clients perform various steps (Wang et al., 2020; Horváth et al., 2022). Further, from the drawbacks of tiered FL methods such as FedAT, the scheduler is designed to cover multiple corner cases to ensure its resilience to sudden speed changes, especially those occasional yet substantial changes that may arise in real-world cross-silo FL as a result of HPC/cloud auto-scaling.
3.1 Compass: Computing Power-Aware Scheduler
To address the aforementioned desiderata, we introduce Compass, a COMputing Power-Aware Scheduler that assesses the computing power of individual clients according to their previous response times. Compass then dynamically assigns varying numbers of local steps within a predefined range to different clients in each training round. This approach ensures that multiple client models are received nearly simultaneously, allowing for a grouped server aggregation. By incorporating several local models in a single global update and coordinating the timing of client responses, we effectively reduce the frequency of global model updates. This, in turn, reduces client model staleness while keeping the client idle time to a minimum during the training process.

Figure 1 illustrates an example run of federated learning using Compass on five clients with different computing power. For the sake of brevity, this illustration focuses on a simple scenario, and more details are described in Section 3.2. All times in the figure are absolute wall-clock times from the start of FL. First, all clients are initialized with local steps. When client 1 finishes local training and sends the update to the server at time 02:00, Compass records its speed and creates the first arrival group with the expected arrival time 12:00, which is computed based on the client speed and the assigned number of local steps . We assume negligible communication and aggregation times in this example, so client 1 receives an updated global model immediately at 02:00 and begins the next round of training. Upon receiving the update from client 2, Compass slots it into the first arrival group with local steps, aiming for a completion time of 12:00 as well. Client 3 follows suit with steps. When Compass computes local steps for client 4 in order to finish by 12:00, it finds that falls below . Therefore, Compass creates a new arrival group for client 4 with expected arrival time 22:00. Compass optimizes the value of such that the clients in an existing arrival group may join this newly created group in the future for group merging. After client 5 finishes, it is assigned to the group created for client 4. When clients 1–3 finish their second round of local training at 12:00, Compass assigns them to the arrival group of clients 4 and 5 (i.e., arrival group 2) with the corresponding values of so that all the clients are in the same arrival group and expected to arrive nearly simultaneously for global aggregation in the future. The training then achieves equilibrium with a fixed number of existing groups and group assignments, and future training rounds will proceed with certain values of for each client in the ideal case. The detailed implementation and explanation of Compass can be found in Appendix A. We also provide analyses and empirical results for non-ideal scenarios with client computing powering changes in Appendix G.2, where we show how the design of Compass makes it resilient when clients experience speedup or slowdown during the course of training, and compare it with the tiered FL method FedAT (Chai et al., 2021).
3.2 FedCompass: Federated Learning with Compass
FedCompass is a semi-asynchronous FL algorithm that leverages the Compass scheduler to coordinate the training process. It is semi-asynchronous as the server may need to wait for a group of clients for global aggregation, but the Compass scheduler minimizes the waiting time by managing to have clients within the same group finish local training nearly simultaneously, so the overall training process still remains asynchronous without long client idle times. The function FedCompass-Client in Algorithm 1 outlines the client-side algorithm for FedCompass. The client performs local steps using mini-batch stochastic gradient descent and sends the difference between the initial and final models as the client update to the server.
The function FedCompass-Server in Algorithm 1 presents the server algorithm for FedCompass. Initially, in the absence of prior knowledge of clients’ computing speeds, the server assigns a minimum of local steps to all clients for a warm-up. Subsequently, client speed information is gathered and updated each time a client responds with the local update. For each client’s first response (lines 25 to 28), the server immediately updates the global model using the client update and assigns it to an arrival group. All clients within the same arrival group are expected to arrive almost simultaneously at a specified time for the next global group aggregation. This is achieved by assigning variable local steps based on individual computing speeds. The AssignGroup function first attempts to assign the client to an existing group with local steps within the range of . If none of the existing groups have local step values that can accommodate the client, a new group is created for the client with a local step value that facilitates future group merging opportunities. Further details for this function are elaborated in Appendix A.
After the initial arrivals, each client is expected to adhere to the group-specific arrival time . Nonetheless, to accommodate client speed fluctuations and potential estimation inaccuracies, a latest arrival time per group is designated. If a client arrives prior to (as per lines 20 to 24), the server stores the client update in a group-specific buffer , and checks whether this client is the last within its group to arrive. If so, the server invokes GroupAggregate to update the global model with contributions from all clients in the group and assign new training tasks to them with the latest global model. The details of GroupAggregate are shown in Algorithm 3 in the Appendix. If there remain pending clients within the group, the server will wait until the last client arrives or the latest arrival time before the group aggregation. Client(s) that arrive after the latest time and miss the group aggregation (lines 15 to 19), due to an unforeseen delay or slowdown, have their updates stored in a general buffer , and the server assigns the client to an arrival group immediately using the AssignGroup function for the next round of local training. The updates in the general buffer will be incorporated into the group aggregation of the following arrival group.
4 Convergence Analysis
We provide the convergence analysis of FedCompass for smooth and non-convex loss functions . We denote by the gradient of global loss, the gradient of client ’s local loss, and the gradient of loss on batch . With that, we make the following assumptions.
Assumption 1.
Lipschitz smoothness. The loss function of each client is Lipschitz smooth. .
Assumption 2.
Unbiased client stochastic gradient. .
Assumption 3.
Bounded gradient, and bounded local and global gradient variance. , , and .
Assumption 4.
Bounded client heterogeneity and staleness. Suppose that client is the fastest client, and client is the slowest, and are the times per one local step of these two clients, respectively. We then assume that . This assumption also implies that the staleness () of each client model is also bounded by a number .
Assumptions 1–3 are common assumptions in the convergence analysis of federated or distributed learning algorithms (Stich, 2018; Li et al., 2019; Yu et al., 2019; Karimireddy et al., 2020; Nguyen et al., 2022). Assumption 4 indicates that the speed difference among clients is finite and is equivalent to the assumption that all clients are reliable, which is reasonable in cross-silo FL.
Theorem 1.
Suppose that , , and . Then, after updates for global model , FedCompass achieves the following convergence rate:
(1) |
where is the number of clients, is the global model after global updates, , , and .
Corollary 1.
When and while satisfying , then
(2) |
The proof is provided in Appendix C. Based on Theorem 1 and Corollary 1, we remark the following algorithmic characteristics.
Worst-Case Convergence. Corollary 1 provides an upper bound for the expected squared norm of the gradients of the global loss, which is monotonically decreasing with respect to the total number of global updates . This leads to the conclusion that FedCompass converges to a first-order stationary point in expectation, characterized by a zero gradient, of the smooth and non-convex global loss function . Specifically, the first term of equation 2 represents the influence of model initialization, and the second term accounts for stochastic noise from local updates. The third term reflects the impact of device heterogeneity () and data heterogeneity () on the convergence process, which theoretically provides a convergence guarantee for FedCompass on non-IID heterogeneous distributed datasets and heterogeneous client devices.
Effect of Number of Local Steps. Theorem 1 establishes the requirement , signifying a trade-off between the maximum number of client local steps and the client learning rate . A higher necessitates a smaller to ensure the convergence of FL training. The value of also presents a trade-off: a smaller leads to a reduced , causing an increase in and a decrease in , which ultimately lowers the gradient bound. Appendix B shows that also influences the grouping of FedCompass. There are at most existing groups at the equilibrium state, so a smaller results in fewer groups and a smaller . However, since appears in the denominator in equation 1, a smaller could potentially increase the gradient bound. Corollary 1 represents an extreme case where the value of makes reach the minimum value of 1. We give empirical ablation study results on the effect of the number of local steps in Section 5.2 and Appendix G.1. The ablation study results show that the performance of FedCompass is not sensitive to the choice of the ratio , and a wide range of Q still leads to better performance than other asynchronous FL algorithms such as FedBuff.
5 Experiments
5.1 Experiment Setup
Datasets. To account for the dataset inherent statistical heterogeneity, we use two types of datasets: (i) artificially partitioned MNIST (LeCun, 1998) and CIFAR-10 (Krizhevsky et al., 2009) datasets, and (ii) two naturally split datasets from the cross-silo FL benchmark FLamby (Ogier du Terrail et al., 2022), Fed-IXI and Fed-ISIC2019. Artificially partitioned datasets are created by dividing classic datasets into client splits. We introduce class partition and dual Dirichlet partition strategies. The class partition makes each client hold data of only a few classes, and the dual Dirichlet parition employs Dirichlet distributions to model the class distribution within each client and the variation in sample sizes across clients. However, these methods might not accurately capture the intricate data heterogeneity in real federated datasets. Naturally split datasets, on the other hand, are composed of federated datasets obtained directly from multiple real-world clients, which retain the authentic characteristics and complexities of the underlying data distribution. The Fed-IXI dataset takes MRI images as inputs to generate 3D brain masks, and the Fed-ISIC2019 dataset takes dermoscopy images as inputs and aims to predict the corresponding melanoma classes. The details of the partition strategies and the FLamby datasets are elaborated in Appendix D.
Client Heterogeneity Simulation. To simulate client heterogeneity, let be the average time for client to finish one local step, and assume that follows a random distribution. We use three different distributions in the experiments: normal distribution with (i.e., homogeneous clients), normal distribution with , and exponential distribution . The client heterogeneity increases accordingly for these three heterogeneity settings. The mean values of the distributions ( or ) for different experiment datasets are given in Appendix E.1. Additionally, to account for the variance in training time within each client across different training rounds, we apply another normal distribution with smaller variance. This distribution simulates the local training time per step for client in the training round , considering the variability experienced by the client during different training rounds. The combination of distributions adequately captures the client heterogeneity in terms of both average training time among clients and the variability in training time within each client across different rounds.
Training Settings. We employ a convolutional neural network (CNN) for the partitioned MNIST dataset, and a ResNet-18 (He et al., 2016) for the partitioned CIFAR-10 dataset. For both datasets, we conduct experiments with = clients and the three client heterogeneity settings above, using the class partition and dual Dirichlet partition strategies. For datasets Fed-IXI and Fed-ISIC2019 from FLamby, we use the default experiment settings provided by FLamby. We list the detailed model architectures and experiment hyperparameters in Appendix E.
5.2 Experiment Results
Client number | ||||||||||||||
Dataset | Method | homo | normal | exp | homo | normal | exp | homo | normal | exp | ||||
FedAvg | 1.35 | 2.82 | 4.32 | 1.25 | 2.10 | 5.01 | 1.44 | 4.20 | 9.03 | |||||
FedAvgM | 1.27 | 2.75 | 4.23 | 1.50 | 2.51 | 6.01 | 2.20 | 5.36 | 13.73 | |||||
FedAsync | 2.15 | 2.34 | 2.35 | - | 2.17 | 1.32 | - | 4.44 | 1.32 | |||||
MNIST-Class | FedBuff | 1.30 | 1.57 | 1.81 | 1.58 | 1.39 | 1.43 | 1.37 | 1.18 | 1.39 | ||||
Partition (90%) | FedAT | 1.36 | 2.16 | 3.07 | 1.13 | 1.77 | 2.46 | 0.97 | 1.58 | 4.32 | ||||
FedCompass | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | |||||
|
1.10 | 1.19 | 0.81 | 1.18 | 0.99 | 0.81 | 1.10 | 1.09 | 1.08 | |||||
|
0.89 | 1.24 | 0.94 | 1.44 | 1.28 | 0.92 | 0.94 | 1.01 | 1.09 | |||||
FedAvg | 0.91 | 1.85 | 4.90 | 1.18 | 2.44 | 5.68 | 1.14 | 2.48 | 5.94 | |||||
FedAvgM | 0.91 | 1.85 | 4.89 | 1.22 | 2.45 | 5.83 | 1.21 | 2.70 | 6.12 | |||||
FedAsync | 1.78 | 1.68 | 2.01 | - | 2.28 | 1.29 | - | - | 1.76 | |||||
MNIST-Dirichlet | FedBuff | 1.23 | 1.52 | 1.86 | 1.81 | 1.51 | 1.49 | 2.77 | 1.22 | 1.40 | ||||
Partition (90%) | FedAT | 1.29 | 1.97 | 2.08 | 1.08 | 1.70 | 2.44 | 1.02 | 1.83 | 2.83 | ||||
FedCompass | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | |||||
|
0.72 | 0.92 | 0.95 | 0.93 | 0.91 | 0.88 | 0.87 | 1.31 | 0.81 | |||||
|
0.92 | 1.55 | 1.26 | 0.99 | 1.12 | 0.95 | 0.93 | 0.90 | 1.08 | |||||
FedAvg | 1.84 | 2.64 | 9.40 | 3.18 | 4.18 | 11.51 | 3.20 | 3.24 | 7.13 | |||||
FedAvgM | 2.57 | 3.73 | 13.12 | 2.19 | 2.88 | 9.04 | 3.13 | 3.19 | 6.91 | |||||
FedAsync | - | 3.18 | 3.07 | - | - | 5.18 | - | - | - | |||||
CIFAR10-Class | FedBuff | 1.68 | 2.18 | 2.47 | - | - | 2.62 | - | - | 4.62 | ||||
Partition (60%) | FedAT | 1.05 | 1.42 | 2.33 | 1.21 | 1.83 | 3.21 | 1.20 | 1.23 | 2.21 | ||||
FedCompass | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | |||||
|
0.82 | 0.82 | 0.95 | 0.89 | 0.78 | 0.96 | 0.78 | 0.81 | 0.83 | |||||
|
0.98 | 1.01 | 1.18 | 0.99 | 1.22 | 1.49 | 1.32 | 1.19 | 1.16 | |||||
FedAvg | 2.19 | 2.81 | 10.86 | 4.55 | 5.05 | 11.47 | 5.88 | 5.83 | 9.05 | |||||
FedAvgM | 2.17 | 2.97 | 10.77 | 3.77 | 4.21 | 11.06 | 5.19 | 5.44 | 8.63 | |||||
FedAsync | - | 3.18 | 1.70 | - | - | 5.35 | - | - | - | |||||
CIFAR10-Dirichlet | FedBuff | 1.62 | 1.99 | 1.67 | - | - | 2.42 | - | - | - | ||||
Partition (50%) | FedAT | 1.04 | 1.69 | 2.90 | 1.36 | 1.66 | 2.30 | 1.45 | 1.58 | 2.21 | ||||
FedCompass | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | |||||
|
0.82 | 0.80 | 1.03 | 1.07 | 0.85 | 0.94 | 0.86 | 0.69 | 0.98 | |||||
|
1.05 | 0.98 | 1.65 | 0.89 | 1.29 | 1.45 | 0.77 | 0.87 | 1.23 |
Fed-IXI (0.98) | Fed-ISIC2019 (50%) | ||||||||||||
Method | homo | normal | exp | homo | normal | exp | |||||||
FedAvg |
|
|
|
|
|
|
|||||||
FedAvgM |
|
|
|
|
|
|
|||||||
FedAsync |
|
|
|
|
|
|
|||||||
FedBuff |
|
|
|
|
|
|
|||||||
FedAT |
|
|
|
|
|
|
|||||||
FedCompass |
|
|
|
|
|
|
|||||||
FedCompass+M |
|
|
|
|
|
|
|||||||
FedCompass+N |
|
|
|
|
|
|

Performance Comparison. Table 1 shows the relative wall-clock time for various FL algorithms to first reach the target validation accuracy on the partitioned MNIST and CIFAR-10 datasets with different numbers of clients and client heterogeneity settings. Table 2 reports the relative wall-clock time to reach 0.98 DICE accuracy (Dice, 1945) on the Fed-IXI dataset and 50% balanced accuracy on the Fed-ISIC2019 dataset in three heterogeneity settings. The numbers are the average results of ten runs with different random seeds, and the wall-clock time for FedCompass is used as a baseline. A dash “-” signifies that a certain algorithm cannot reach the target accuracy in at least half of the experiment runs. The dataset target accuracy is carefully selected based on each algorithm’s top achievable accuracy during training (given in Appendix F). Among the benchmarked algorithms, FedAvg (McMahan et al., 2017) is the most commonly used synchronous FL algorithm, and FedAvgM (Hsu et al., 2019) is a variant of FedAvg using server-side momentum to address non-IID data. FedAsync (Xie et al., 2019) and FedBuff (Nguyen et al., 2022) are two popular asynchronous FL algorithms, and FedAT (Chai et al., 2021) is a tiered FL algorithm. Since the Compass scheduler is separated from global aggregation, making FedCompass compatible with server-side optimization strategies, we also benchmark the performance of using server-side momentum and normalized averaging (Wang et al., 2020) in FedCompass (FedCompass+M and FedCompass+N). From the table we obtain the following key observations: (i) The speedup provided by asynchronous FL algorithms increases as client heterogeneity increases. (ii) FedAT converges slower as client heterogeneity increases due to non-negligible inner-tier waiting time, while FedCompass minimizes inner-group waiting time by dynamic local step assignment and the usage of the latest arrival time. (iii) Similar to synchronous FL where server-side optimizations can sometimes speed up the convergence, applying server momentum or normalized averaging to FedCompass in group aggregation can sometimes speed up the convergence as well. This indicates that the design of FedCompass allows it to easily take advantage of other FL strategies. (iv) Because of client drift and model staleness, FedAsync and FedBuff occasionally fail to achieve the target accuracy, particularly in settings with minimal client heterogeneity and unnecessary application of the staleness factor. On the other hand, FedCompass can quickly converge in all cases, as Compass dynamically and automatically groups clients for global aggregation, reducing client model staleness and mitigating the client drift problem. This shows that FedCompass outperforms other asynchronous algorithms on heterogeneous non-IID data without any prior knowledge of clients. Additionally, Figure 2 illustrates the change in validation accuracy during the training for different FL algorithms on the dual Dirichlet partitioned MNIST dataset (subplots a to c) and the class partitioned CIFAR-10 dataset (subplots d to f) with five clients and three client heterogeneity settings. In the plots, synchronous algorithms take the same amount of time, and asynchronous algorithms take the same amount of time in the same experiment setting. These plots further illustrate how FedCompass converges faster than other algorithms and also achieves higher accuracy than other asynchronous algorithms. For example, subplots d and e explicitly demonstrate that FedAsync and FedBuff only converge to much lower accuracy in low-heterogeneity settings, mainly because of client drift from unnecessary staleness factors. More plots of the training process and the maximum achievable accuracy for each FL algorithm are given in Appendix F.
Effect of Number of Local Steps. We conduct ablation studies on the values of to investigate its impact on FedCompass’s performance, with results listed in Table 3. From the table, we observe that smaller 1/Q values (ranging from 0.05 to 0.2) usually lead to faster convergence speed. Additionally, a wide range of 1/Q (ranging from 0.05 to 0.8) can achieve a reasonable convergence rate compared with FedBuff or FedAT, which showcases that FedCompass is not sensitive to the value of Q. Therefore, users do not need to heuristically select Q for different client numbers and heterogeneity settings. In the extreme case that , i.e. , there is a significant performance drop for FedCompass, as Compass loses the flexibility to assign different local steps to different clients. Therefore, each client has its own arrival group, and FedCompass becomes equivalent to FedAsync. More ablation study results are given in Appendix G.1.
Client number | |||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Dataset | 1/Q | homo | normal | exp | homo | normal | exp | homo | normal | exp | |||||||||||||
0.05 |
|
|
|
|
|
|
|
|
|
||||||||||||||
|
0.10 |
|
|
|
|
|
|
|
|
|
|||||||||||||
|
0.20 |
|
|
|
|
|
|
|
|
|
|||||||||||||
|
0.40 |
|
|
|
|
|
|
|
|
|
|||||||||||||
|
0.60 |
|
|
|
|
|
|
|
|
|
|||||||||||||
|
0.80 |
|
|
|
|
|
|
|
|
|
|||||||||||||
1.00 |
|
|
|
|
|
|
|
|
|
||||||||||||||
FedBuff |
|
|
|
|
|
|
|
|
|
||||||||||||||
FedAT |
|
|
|
|
|
|
|
|
|
6 Conclusion
This paper introduces FedCompass, which employs a novel computing power-aware scheduler to make several client models arrive almost simultaneously for group aggregation to reduce global aggregation frequency and mitigate the stale local model issues in asynchronous FL. FedCompass thus enhances the FL training efficiency with heterogeneous client devices and improves the performance on non-IID heterogeneous distributed datasets. Through a comprehensive suite of experiments, we have demonstrated that FedCompass achieves faster convergence than other prevalent FL algorithms in various training tasks with heterogeneous training data and clients. This new approach opens up new pathways for the development and use of robust AI models in settings that require privacy, with the additional advantage that it can readily leverage the existing disparate computing ecosystem.
Reproducibility Statement
The source code and instructions to reproduce all the experiment results can be found in the supplemental materials. The details of the classic dataset partition strategies are provided in Appendix D, and the corresponding implementations are included in the source code. The setups and hyperparameters used in the experiments are given in Appendix E. Additionally, the proofs of Theorem 1 and Corollary 1 are provided in Appendix C.
Acknowledgments
This work was supported by Laboratory Directed Research and Development (LDRD) funding from Argonne National Laboratory, provided by the Director, Office of Science, of the U.S. Department of Energy under Contract No. DE-AC02-06CH11357. This research is also part of the Delta research computing project, which is supported by the National Science Foundation (award OCI 2005572), and the State of Illinois. Delta is a joint effort of the University of Illinois at Urbana-Champaign and the National Center for Supercomputing Applications. E.A.H. was partially supported by National Science Foundation awards OAC-1931561 and OAC-2209892.
References
- Bonawitz et al. (2019) Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloe Kiddon, Jakub Konečnỳ, Stefano Mazzocchi, Brendan McMahan, et al. Towards federated learning at scale: System design. Proceedings of Machine Learning and Systems, 1:374–388, 2019.
- Chai et al. (2021) Zheng Chai, Yujing Chen, Ali Anwar, Liang Zhao, Yue Cheng, and Huzefa Rangwala. FedAT: A high-performance and communication-efficient federated learning system with asynchronous tiers. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–16, 2021.
- Chen et al. (2019) Yang Chen, Xiaoyan Sun, and Yaochu Jin. Communication-efficient federated deep learning with layerwise asynchronous model update and temporally weighted aggregation. IEEE transactions on neural networks and learning systems, 31(10):4229–4238, 2019.
- Chen et al. (2020) Yujing Chen, Yue Ning, Martin Slawski, and Huzefa Rangwala. Asynchronous online federated learning for edge devices with non-iid data. In 2020 IEEE International Conference on Big Data (Big Data), pp. 15–24. IEEE, 2020.
- Chen et al. (2021) Zheyi Chen, Weixian Liao, Kun Hua, Chao Lu, and Wei Yu. Towards asynchronous federated learning for heterogeneous edge-powered internet of things. Digital Communications and Networks, 7(3):317–326, 2021.
- Çiçek et al. (2016) Özgün Çiçek, Ahmed Abdulkadir, Soeren S Lienkamp, Thomas Brox, and Olaf Ronneberger. 3D U-Net: learning dense volumetric segmentation from sparse annotation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2016: 19th International Conference, Athens, Greece, October 17-21, 2016, Proceedings, Part II 19, pp. 424–432. Springer, 2016.
- Codella et al. (2018) Noel CF Codella, David Gutman, M Emre Celebi, Brian Helba, Michael A Marchetti, Stephen W Dusza, Aadi Kalloo, Konstantinos Liopyris, Nabin Mishra, Harald Kittler, et al. Skin lesion analysis toward melanoma detection: A challenge at the 2017 international symposium on biomedical imaging (ISBI), hosted by the international skin imaging collaboration (isic). In 2018 IEEE 15th international symposium on biomedical imaging (ISBI 2018), pp. 168–172. IEEE, 2018.
- Combalia et al. (2019) Marc Combalia, Noel CF Codella, Veronica Rotemberg, Brian Helba, Veronica Vilaplana, Ofer Reiter, Cristina Carrera, Alicia Barreiro, Allan C Halpern, Susana Puig, et al. Bcn20000: Dermoscopic lesions in the wild. arXiv preprint arXiv:1908.02288, 2019.
- Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on Computer Vision and Pattern Recognition, pp. 248–255. Ieee, 2009.
- Dice (1945) Lee R Dice. Measures of the amount of ecologic association between species. Ecology, 26(3):297–302, 1945.
- Hao et al. (2020) Jiangshan Hao, Yanchao Zhao, and Jiale Zhang. Time efficient federated learning with semi-asynchronous communication. In 2020 IEEE 26th International Conference on Parallel and Distributed Systems (ICPADS), pp. 156–163. IEEE, 2020.
- He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 770–778, 2016.
- Hoang et al. (2023) Trung-Hieu Hoang, Jordan Fuhrman, Ravi Madduri, Miao Li, Pranshu Chaturvedi, Zilinghan Li, Kibaek Kim, Minseok Ryu, Ryan Chard, EA Huerta, et al. Enabling end-to-end secure federated learning in biomedical research on heterogeneous computing environments with appflx. arXiv preprint arXiv:2312.08701, 2023.
- Horváth et al. (2022) Samuel Horváth, Maziar Sanjabi, Lin Xiao, Peter Richtárik, and Michael Rabbat. Fedshuffle: Recipes for better use of local work in federated learning. arXiv preprint arXiv:2204.13169, 2022.
- Hsu et al. (2019) Tzu-Ming Harry Hsu, Hang Qi, and Matthew Brown. Measuring the effects of non-identical data distribution for federated visual classification. arXiv preprint arXiv:1909.06335, 2019.
- Imteaj & Amini (2020) Ahmed Imteaj and M Hadi Amini. Fedar: Activity and resource-aware federated learning model for distributed mobile robots. In 2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 1153–1160. IEEE, 2020.
- Kairouz et al. (2021) Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. Foundations and Trends® in Machine Learning, 14(1–2):1–210, 2021.
- Kaissis et al. (2021) Georgios Kaissis, Alexander Ziller, Jonathan Passerat-Palmbach, Théo Ryffel, Dmitrii Usynin, Andrew Trask, Ionésio Lima Jr, Jason Mancuso, Friederike Jungmann, Marc-Matthias Steinborn, et al. End-to-end privacy preserving deep learning on multi-institutional medical imaging. Nature Machine Intelligence, 3(6):473–484, 2021.
- Karimireddy et al. (2020) Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. Scaffold: Stochastic controlled averaging for federated learning. In International Conference on Machine Learning, pp. 5132–5143. PMLR, 2020.
- Konečnỳ et al. (2016) Jakub Konečnỳ, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492, 2016.
- Krizhevsky et al. (2009) Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
- LeCun (1998) Yann LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.
- Li et al. (2020) Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine, 37(3):50–60, 2020.
- Li et al. (2019) Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, and Zhihua Zhang. On the convergence of fedavg on non-iid data. arXiv preprint arXiv:1907.02189, 2019.
- Li et al. (2023) Zilinghan Li, Shilan He, Pranshu Chaturvedi, Trung-Hieu Hoang, Minseok Ryu, EA Huerta, Volodymyr Kindratenko, Jordan Fuhrman, Maryellen Giger, Ryan Chard, et al. APPFLx: Providing privacy-preserving cross-silo federated learning as a service. arXiv preprint arXiv:2308.08786, 2023.
- Loshchilov & Hutter (2017) Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Ma et al. (2022) Xiaodong Ma, Jia Zhu, Zhihao Lin, Shanxuan Chen, and Yangjie Qin. A state-of-the-art survey on solving non-iid data in federated learning. Future Generation Computer Systems, 135:244–258, 2022.
- McMahan et al. (2017) Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pp. 1273–1282. PMLR, 2017.
- Mills et al. (2019) Jed Mills, Jia Hu, and Geyong Min. Communication-efficient federated learning for wireless edge intelligence in IoT. IEEE Internet of Things Journal, 7(7):5986–5994, 2019.
- Nguyen et al. (2022) John Nguyen, Kshitiz Malik, Hongyuan Zhan, Ashkan Yousefpour, Mike Rabbat, Mani Malek, and Dzmitry Huba. Federated learning with buffered asynchronous aggregation. In International Conference on Artificial Intelligence and Statistics, pp. 3581–3607. PMLR, 2022.
- Nishio & Yonetani (2019) Takayuki Nishio and Ryo Yonetani. Client selection for federated learning with heterogeneous resources in mobile edge. In ICC 2019-2019 IEEE international Conference on Communications (ICC), pp. 1–7. IEEE, 2019.
- Ogier du Terrail et al. (2022) Jean Ogier du Terrail, Samy-Safwan Ayed, Edwige Cyffers, Felix Grimberg, Chaoyang He, Regis Loeb, Paul Mangold, Tanguy Marchand, Othmane Marfoq, Erum Mushtaq, et al. Flamby: Datasets and benchmarks for cross-silo federated learning in realistic healthcare settings. Advances in Neural Information Processing Systems, 35:5315–5334, 2022.
- Pati et al. (2022) Sarthak Pati, Ujjwal Baid, Brandon Edwards, Micah Sheller, Shih-Han Wang, G Anthony Reina, Patrick Foley, Alexey Gruzdev, Deepthi Karkada, Christos Davatzikos, et al. Federated learning enables big data for rare cancer boundary detection. Nature Communications, 13(1):7346, 2022.
- Pérez-García et al. (2021) Fernando Pérez-García, Rachel Sparks, and Sébastien Ourselin. TorchIO: a Python library for efficient loading, preprocessing, augmentation and patch-based sampling of medical images in deep learning. Computer Methods and Programs in Biomedicine, 208:106236, 2021.
- Reddi et al. (2020) Sashank Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konečnỳ, Sanjiv Kumar, and H Brendan McMahan. Adaptive federated optimization. arXiv preprint arXiv:2003.00295, 2020.
- Reisizadeh et al. (2022) Amirhossein Reisizadeh, Isidoros Tziotis, Hamed Hassani, Aryan Mokhtari, and Ramtin Pedarsani. Straggler-resilient federated learning: Leveraging the interplay between statistical accuracy and system heterogeneity. IEEE Journal on Selected Areas in Information Theory, 3(2):197–205, 2022.
- Ryu et al. (2022) Minseok Ryu, Youngdae Kim, Kibaek Kim, and Ravi K Madduri. APPFL: open-source software framework for privacy-preserving federated learning. In 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 1074–1083. IEEE, 2022.
- So et al. (2021) Jinhyun So, Ramy E Ali, Başak Güler, and A Salman Avestimehr. Secure aggregation for buffered asynchronous federated learning. arXiv preprint arXiv:2110.02177, 2021.
- Stich (2018) Sebastian U Stich. Local SGD converges fast and communicates little. arXiv preprint arXiv:1805.09767, 2018.
- Tan & Le (2019) Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on Machine Learning, pp. 6105–6114. PMLR, 2019.
- Tschandl et al. (2018) Philipp Tschandl, Cliff Rosendahl, and Harald Kittler. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific data, 5(1):1–9, 2018.
- Wang (2019) Guan Wang. Interpret federated learning with shapley values. arXiv preprint arXiv:1905.04519, 2019.
- Wang et al. (2019) Guan Wang, Charlie Xiaoqian Dang, and Ziye Zhou. Measure contribution of participants in federated learning. In 2019 IEEE international conference on Big Data (Big Data), pp. 2597–2604. IEEE, 2019.
- Wang et al. (2020) Jianyu Wang, Qinghua Liu, Hao Liang, Gauri Joshi, and H Vincent Poor. Tackling the objective inconsistency problem in heterogeneous federated optimization. Advances in Neural Information Processing Systems, 33:7611–7623, 2020.
- Wang et al. (2021) Qizhao Wang, Qing Li, Kai Wang, Hong Wang, and Peng Zeng. Efficient federated learning for fault diagnosis in industrial cloud-edge computing. Computing, 103(10):2319–2337, 2021.
- Wu et al. (2020) Wentai Wu, Ligang He, Weiwei Lin, Rui Mao, Carsten Maple, and Stephen Jarvis. SAFA: A semi-asynchronous protocol for fast federated learning with low overhead. IEEE Transactions on Computers, 70(5):655–668, 2020.
- Xie et al. (2019) Cong Xie, Sanmi Koyejo, and Indranil Gupta. Asynchronous federated optimization. arXiv preprint arXiv:1903.03934, 2019.
- Xu et al. (2021) Chenhao Xu, Youyang Qu, Yong Xiang, and Longxiang Gao. Asynchronous federated learning on heterogeneous devices: A survey. arXiv preprint arXiv:2109.04269, 2021.
- Yang et al. (2019) Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. Federated machine learning: Concept and applications. ACM Transactions on Intelligent Systems and Technology (TIST), 10(2):1–19, 2019.
- Yu et al. (2019) Hao Yu, Sen Yang, and Shenghuo Zhu. Parallel restarted with faster convergence and less communication: Demystifying why model averaging works for deep learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 5693–5700, 2019.
- Yurochkin et al. (2019) Mikhail Yurochkin, Mayank Agarwal, Soumya Ghosh, Kristjan Greenewald, Nghia Hoang, and Yasaman Khazaeni. Bayesian nonparametric federated learning of neural networks. In International conference on machine learning, pp. 7252–7261. PMLR, 2019.
- Zhang et al. (2021) Yu Zhang, Morning Duan, Duo Liu, Li Li, Ao Ren, Xianzhang Chen, Yujuan Tan, and Chengliang Wang. CSAFL: A clustered semi-asynchronous federated learning framework. In 2021 International Joint Conference on Neural Networks (IJCNN), pp. 1–10. IEEE, 2021.
Appendix A Detailed Implementation of FedCompass
Table 4 lists all frequently used notations and their corresponding explanations in the algorithm description of FedCompass.
Notation | Explanation | |
---|---|---|
client information dictionary | ||
arrival group information dictionary | ||
index of the arrival group for a client | ||
number of local steps for a client | ||
estimated computing speed for a client | ||
start time of current local training round for a client | ||
list of non-arrived clients for an arrival group | ||
list of arrived clients for an arrival group | ||
expected arrival time for an arrival group | ||
latest arrival time for an arrival group | ||
current time | ||
maximum number of local steps for clients | ||
minimum number of local steps for clients | ||
latest time factor | ||
client learning rate | ||
total number of communication rounds | ||
batch size for local training | ||
staleness function | ||
|
||
|
||
importance weight of client | ||
|
||
client update buffer for group | ||
general client update buffer | ||
global model | ||
local model of client after local steps |
Compass utilizes the concept of arrival group to determine which clients are supposed to arrive together. To manage the client arrival groups, Compass employs two dictionaries: for client information and for arrival group information. Specifically, for client , , , , and represent the arrival group, number of local steps, estimated computing speed (time per step), and start time of current communication round of client . represents the timestamp of the global model upon which client trained (i.e., client local model timestamp). For arrival group , , , , and represent the list of non-arrived and arrived clients, expected arrival time for clients in the group, and latest arrival time for clients in the group. According to the client speed estimation, all clients in group are expected to finish local training and send the trained local updates back to the server before , however, to account for potential errors in speed estimation and variations in computing speed, the group can wait until the latest arrival time for the clients to send their trained models back. The difference between and the group creation time is times longer than the difference between and the group creation time.
Algorithm 2 presents the process through which Compass assigns a client to an arrival group. The dictionaries and along with the constant parameters , , and are assumed to be available implicitly and are not explicitly passed as algorithm inputs. When Compass assigns a client to a group, it first checks whether it can join an existing arrival group (function JoinGroup, lines 5 to 16). For each group, Compass uses the current time, group expected arrival time, and estimated client speed to calculate a tentative local step number (line 8). To avoid large disparity in the number of local steps, Compass uses two parameters and to specify a valid range of local steps. If Compass can find the largest within the range , it assigns the client to the corresponding group. Otherwise, Compass creates a new arrival group for the client. When creating this new group, Compass aims to ensure that a specific existing arrival group can join the newly created group for group merging once all of its clients have arrived, as described in function CreateGroup (lines 18 to 35). To achieve this, Compass first identifies the computing speed of the fastest client in group . It assumes that all clients in the group will arrive at the expected time and the fastest client will perform local steps in the next round. Subsequently, Compass utilizes the expected arrival time in the next round (line 23) to determine the appropriate number of local steps (line 24). It then selects the largest number of local steps while ensuring that it is within the range . In cases where Compass is unable to set the number of local steps in a way that allows any existing group to be suitable for joining in the next round, it assigns to the client upon creating the group. Whenever a new group is created, Compass sets up a timer event to invoke the GroupAggregate function (Algorithm 3) to aggregate the group of client updates after passing the group’s latest arrival time . In GroupAggregate, the server updates the global model using the group-specific buffer , containing updates from clients in arrival group , and the general buffer , containing updates from clients arriving late in other groups. After the global update, the server updates the global model timestamp and assigns new training tasks to all arriving clients in group .
Appendix B Upper Bound on the Number of Groups
If the client speeds do not have large variations for a sufficient number of global updates, then FedCompass can reach an equilibrium state where the arrival group assignment for clients is fixed and the total number of existing groups (denoted by ) is bounded by , where and is the ratio of the time per local step between the slowest client and the fastest client.
Consider one execution of FedCompass with a list of sorted client local computing times , where represents the time for client to complete one local step and if . Let be the first group of clients. Then all the clients will be assigned to the same arrival group at the equilibrium state with . Similarly, if , we can get the second group of clients where the clients in will be assigned to another arrival group at the equilibrium state. Therefore, the maximum number of existing arrival groups at the equilibrium state is equal to the smallest integer such that
(3) |
Additionally, this implies that the maximum ratio of time for two different clients to finish one communication round at the equilibrium state is bounded by . First, we can assume that the maximum ratio of time for two different clients to finish one communication round is the same as that for two different arrival groups to finish one communication round. In the extreme case that there are maximum arrival groups. Then the time ratio between the slowest group and fastest group is bounded by .
Appendix C FedCompass Convergence Analysis
C.1 List of Notations
Table 5 lists frequently used notations in the convergence analysis of FedCompass.
Notation | Explanation | ||
---|---|---|---|
total number of clients | |||
client local learning rate | |||
Lipschitz constant | |||
|
|||
bound for local gradient variance | |||
bound for global gradient variance | |||
bound for gradient | |||
minimum number of client local steps in each training round | |||
maximum number of client local steps in each training round | |||
Q | ratio between and , i.e., | ||
number of local steps for client in round | |||
global model after global updates | |||
|
|||
a random training batch that client uses in the -th local step | |||
global loss for model | |||
client ’s local loss for model | |||
loss for model evaluated on batch from client | |||
gradient of global loss w.r.t. | |||
gradient of client ’s local loss w.r.t. | |||
gradient of loss evaluated on batch w.r.t. | |||
set of arriving clients at timestamp | |||
|
|||
|
C.2 Proof of Theorem 1
Lemma 1.
for any model .
Proof of Lemma 1.
(4) | ||||
where step (a) utilizes Cauchy-Schwarz inequality, and step (b) utilizes Assumption 3.
Lemma 2.
When a function is Lipschitz smooth, then we have
(5) |
Proof of Lemma 2.
Let for . Then according to the definition of gradient, we have
(6) | ||||
where step (a) utilizes Cauchy-Schwarz inequality , and step (b) utilizes the definition of Lipschitz smoothness.
Proof of Theorem 1.
According to Lemma 2, we have
(7) | ||||
where represents the set of clients arrived at timestamp , and represents the update from client trained on the global model with staleness . Rearranging part ,
(8) |
where represents the number of local steps client has taken before sending the update to the server at timestamp , represents the model of client in local step , and represents a random training batch that the client uses in local step . To compute an upper bound for the expectation of part , we utilize conditional expectation: . Specifically, is the expectation over the history of iterates up to timestamp , is the expectation over the random group of clients arriving at timestamp , and is the expectation over the mini-batch stochastic gradient on client .
(9) | ||||
where step (a) utilizes Assumption 2, step (b) utilizes the identity , and step (c) utilizes conditional expectation with the probability of client to arrive at timestamp . We let and such that .
To derive the upper and lower bounds of , according to the conclusion we get in Appendix B, we have , i.e., . Therefore, we can derive the upper and lower bounds of as follows.
(10) | ||||
(11) | ||||
Let and . We have . Then,
(12) | ||||
For part , by utilizing Assumption 1, we can get
(13) | ||||
For , we derive its upper bound in the following way, with step (a) utilizing Cauchy-Schwarz inequality and step (b) utilizing Lemma 1.
(14) | ||||
For , we derive its upper bound also using Lemma 1.
(15) |
To compute the upper bound for the expectation of part , similar to , we also apply conditional expectation: .
(18) | ||||
where step (a) utilizes conditional expectation and Cauchy-Schwarz inequality, step (b) utilizes Cauchy-Schwarz inequality, and step (c) utilizes Assumption 3.
To find the upper bound of , we need to make sure that so that we can eliminate these two parts when evaluating the upper bound. The following step (a) employs Cauchy-Schwarz inequality.
(19) | ||||
Therefore, we need to have . In such cases, combining equation 17 and equation 18, we have
(20) | ||||
Summing up from to , we can get
(21) | ||||
Therefore, we obtain the conclusion in Theorem 1.
(22) | ||||
Proof of Corollary 1.
If choosing and considering and having the same order in the -class estimation, we can derive the conclusion of Corollary 1.
(25) |
Appendix D Details of the Experiment Datasets
D.1 Classic Dataset Partition Strategy
To simulate data heterogeneity on the classic datasets, we employ two partition strategies: (i) class partition and (ii) dual Dirichlet partition, and apply them to both the MNIST (LeCun, 1998) and CIFAR-10 (Krizhevsky et al., 2009) datasets. These two datasets are the two most commonly used datasets in evaluating FL algorithms (MNIST is used in 32% cases and CIFAR-10 is used in 28% cases) (Ma et al., 2022). The details of these two partition strategies are elaborated in the following subsections.
D.1.1 Class Partition
Algorithm 4 outlines the process of class partition, which partitions a classic dataset into several client splits such that each client split has samples from only a few classes. Many previous research works partition the classical dataset in a similar way to artificially create non-IID federated datasets (McMahan et al., 2017; Chen et al., 2019). To partition a dataset , we first obtain a dictionary that contains the indices of samples belonging to each class. Namely, represents the indices of all samples in with class . Lines 2 to 9 of the algorithm assign each client with classes for its own client split, where . Additionally, the algorithm ensures that each sample class is assigned to at least one client split. Subsequently, from lines 10 to 14, the algorithm determines the number of samples of class that each assigned client split should have based on a normal distribution, and then assigns the corresponding number of data samples to the respective clients. The class partition strategy results in imbalanced and non-IID FL datasets, with clients having varying sample classes within their local datasets.
In the experiments, the mean value of the normal distribution , and the standard deviation . Both the MNIST and the CIFAR-10 datasets have 10 distinct classes, namely, . When the number of clients , and , when or , and . Figure 3 shows a sample class distribution generated by the class partition strategy among clients with , , and .

D.1.2 Dual Dirichlet Partition
Some prior works that model data heterogeneity also employ a Dirichlet distribution to model the class distribution within each local client dataset (Yurochkin et al., 2019; Hsu et al., 2019). We extend the work of others using two Dirichlet distributions, with details outlined in Algorithm 5. The first Dirichlet distribution uses concentration parameter and prior distribution to generate the normalized number of samples for each client, . The second Dirichlet distribution uses concentration parameter and prior distribution to generate the normalized number of samples for each class within each client, . From lines 9 to 12, the algorithm combines , , and the total number of samples for class to calculate the number of samples with class to be assigned for each client, and assigns the samples accordingly. In summary, the dual Dirichlet partition strategy employs two Dirichlet distributions to model two levels of data heterogeneity: the distribution across clients and the distribution across different classes within each client. Therefore, the generated client datasets are imbalanced and non-IID, which more accurately reflect real-world federated datasets.
In the experiments, we set and . The concentration parameter governs the similarity among clients or classes. When , the distribution becomes identical to the prior distribution. Conversely, when , only one item is selected with a value of 1 while the remaining items have a value of 0. Since concentration parameter has a very small value, each client has data samples from only a few classes. Figure 4 illustrates a sample class distribution generated by the dual Dirichlet partition strategy among clients, where , , and .

D.2 Naturally Split Datasets: FLamby
FLamby is an open-source suite that provides naturally split datasets specifically designed for benchmarking cross-silo federated learning algorithms (Ogier du Terrail et al., 2022). These datasets are sourced directly from distributed real-world entities, ensuring that they preserve the authentic characteristics and complexities of data distribution in FL. In our study, we selected two medium-sized datasets from FLamby – Fed-IXI, and Fed-ISIC2019 – to evaluate the performance of FedCompass and other FL algorithms in the experiments. The two datasets offer valuable perspectives into two healthcare domains, brain MRI interpretation (Fed-IXI), and skin cancer detection (Fed-ISIC2019). By leveraging these datasets, our goal is to showcase the robustness and versatility of FedCompass across diverse machine learning tasks in real-world scenarios. Table 6 provides the overview of the two selected datasets in FLamby. Detailed descriptions of these datasets are provided in the following subsections.
Dataset | Fed-IXI | Fed-ISIC2019 | ||
---|---|---|---|---|
Inputs | T1 weighted image (T1WI) | Dermoscopy | ||
Task type | 3D segmentation | Classification | ||
Prediction | Brain mask | Melanoma class | ||
Client number | 3 | 6 | ||
Model |
|
|
||
Metric | DICE (Dice, 1945) | Balanced accuracy | ||
Input dimension | 48 60 48 | 200 200 3 | ||
Dataset size | 444MB | 9GB |
D.2.1 Fed-IXI
The Fed-IXI dataset is sourced from the Information eXtraction from Images (IXI) database and includes brain T1 magnetic resonance images (MRIs) from three hospitals (Pérez-García et al., 2021). These MRIs have been repurposed for a brain segmentation task; the model performance is evaluated with the DICE score (Dice, 1945). To ensure consistency, the images undergo preprocessing steps, such as volume resizing to 48 × 60 × 48 voxels, and sample-wise intensity normalization. The 3D U-net (Çiçek et al., 2016) serves as the baseline model for this task.
D.2.2 Fed-ISIC2019
The Fed-ISIC2019 dataset is sourced from the ISIC2019 dataset, which includes dermoscopy images collected from four hospitals (Tschandl et al., 2018; Codella et al., 2018; Combalia et al., 2019). The images are restricted to a subset from the public train set and are split based on the imaging acquisition system, resulting in a 6-client federated dataset. The task involves image classification among eight different melanoma classes, and the performance is measured through balanced accuracy. Images are preprocessed by resizing and normalizing brightness and contrast. The baseline classification model is an EfficientNet (Tan & Le, 2019) pretrained on ImageNet (Deng et al., 2009).
Appendix E Experiment Setup Details
All the experiments are implemented using the open-source software framework for privacy-preserving federated learning (Ryu et al., 2022; Li et al., 2023).
E.1 Details of Client Heterogeneity Simulation
When simulating the client heterogeneity, we consider the average time required for client to complete one local step. We assume that follows a random distribution , namely, . In our experiments, we employ three different distributions to simulate client heterogeneity: the normal distribution with (i.e., homogeneous clients), the normal distribution with , and the exponential distribution . The exponential distribution models the amount of time until a certain event occurs (arrival time), specifically events that follow a Poisson process. The mean values of the distribution ( or ) are adjusted for different datasets according to the size of the model and training data. Table 7 presents the mean values of the client training time distribution for different datasets used in the experiments. Furthermore, Figure 5 illustrates the distribution of the training time to complete one local step among 50 clients under normal distribution with and exponential distribution, both with a mean value of 0.15.
Dataset Name | Batch Size | Mean ( or ) |
---|---|---|
MNIST | 64 | 0.15s |
CIFAR-10 | 128 | 0.5s |
Fed-IXI | 2 | 0.8s |
Fed-ISIC2019 | 64 | 1.5s |

E.2 Detailed Setups and Hyperparameters for Experiments on the MNIST Dataset
Table 8 shows the detailed architecture of the convolutional neural network (CNN) used for the experiments on the MNIST dataset. Table 9 provides a comprehensive list of the hyperparameter values used for training. Specifically, is the number of client local steps for each training round, and are the minimum and maximum numbers of client local steps in FedCompass, is the client local learning rate, is the server-side momentum, and is the batch size. For all asynchronous FL algorithms, we adopt the polynomial staleness function introduced in (Xie et al., 2019), denoted as . The values of and are also listed in Table 9 for all asynchronous algorithms. Additionally, is the buffer size for the FedBuff algorithm, is the speed ratio factor for creating tiers for clients (clients with speed ratio within are tiered together and its value is selected via grid search), and is the latest time factor for FedCompass. For hyperparameters given as length-three tuples, the tuple items correspond to the hyperparameter values when the number of clients is 5, 10, and 20, respectively. For hyperparameters that are not applicable to certain algorithms, the corresponding value is marked as a dash “-”. To ensure a fair comparison, similar algorithms are assigned the same hyperparameter values, as depicted in the table. The number of global training rounds is chosen separately for different experiment settings and FL algorithms such that the corresponding FL algorithm converges, synchronous FL algorithms take the same amount of wall-clock time, and asynchronous algorithms take the same amount of wall-clock time in the same setting. It is also worth mentioning that the buffer size for the FedBuff algorithm (Nguyen et al., 2022) is selected from a search of multiple possible values. Though the original paper states that is a good setting across all their benchmarks and does not require tuning, the value does not apply in the cross-silo cases where there are less than or equal to ten clients in most experiment settings, while all the experiments in the original paper have more than thousands of clients.
Layer | Outsize | Setting | Parama # |
---|---|---|---|
Input | (1, 28, 28) | - | 0 |
Conv1 | (32, 24, 24) | kernel=5, stride=1 | 832 |
ReLU1 | (32, 24, 24) | - | 0 |
MaxPool1 | (32, 12, 12) | kernel=2, stride=2 | 0 |
Conv2 | (64, 8, 8) | kernel=5, stride=1 | 51,264 |
ReLU2 | (64, 8, 8) | - | 0 |
MaxPool2 | (64, 4, 4) | kernel=2, stride=2 | 0 |
Flatten | (1024,) | - | 0 |
FC1 | (512,) | - | 524,800 |
ReLU2 | (512,) | - | 0 |
FC2 | (10,) | - | 5,130 |
FedAvg | FedAvgM | FedAsync | FedBuff | FedAT | FedCompass(+N) | FedCompass+M | |
(200, 200, 100) | (200, 200, 100) | (200, 200, 100) | (200, 200, 100) | (200, 200, 100) | - | - | |
- | - | - | - | - | (40, 40, 20) | (40, 40, 20) | |
- | - | - | - | - | (200, 200, 100) | (200, 200, 100) | |
Optim | Adam | Adam | Adam | Adam | Adam | Adam | Adam |
0.003 | 0.003 | 0.003 | 0.003 | 0.003 | 0.003 | 0.003 | |
- | 0.9 | - | - | - | - | 0.9 | |
64 | 64 | 64 | 64 | 64 | 64 | 64 | |
- | - | 0.9 | 0.9 | - | 0.9 | 0.9 | |
- | - | 0.5 | 0.5 | - | 0.5 | 0.5 | |
- | - | - | (3, 3, 5) | - | - | - | |
- | - | - | - | 2.0 | - | - | |
- | - | - | - | - | 1.2 | 1.2 |
E.3 Detailed Setups and Hyperparameters for Experiments on the CIFAR-10 Dataset
For experiments on the CIFAR-10 dataset, a randomly initialized ResNet-18 model (He et al., 2016) is used in training. Table 10 shows the detailed architecture of the ResNet-18 model, which contains eight residual blocks. Each residual block contains two convolutional + batch normalization layers, as shown in Table 11, and the corresponding output is added by the input itself if the output channel is equal to the input channel, otherwise, the input goes through a 1x1 convolutional + batch normalization layer before added to the output. An additional ReLU layer is appended at the end of the residual block. Table 12 lists the hyperparameter values used in the training for various FL algorithms. The number of global training rounds is chosen separately for different experiment settings and FL algorithms such that the corresponding FL algorithm converges, synchronous FL algorithms take the same amount of wall-clock time, and asynchronous algorithms take the same amount of wall-clock time in the same setting.
Layer | Outsize | Setting | Param # |
---|---|---|---|
Input | (3, 32, 32) | - | 0 |
Conv1 | (64, 32, 32) | kernel=3, stride=1, padding=1, bias=False | 1,728 |
BatchNorm1 | (64, 32, 32) | - | 128 |
ReLU1 | (64, 32, 32) | - | 0 |
ResidualBlock1-1 | (64, 32, 32) | stride=1 | 73,984 |
ResidualBlock1-2 | (64, 32, 32) | stride=1 | 73,984 |
ResidualBlock2-1 | (128, 16, 16) | stride=2 | 230,144 |
ResidualBlock2-2 | (128, 16, 16) | stride=1 | 295,424 |
ResidualBlock3-1 | (256, 8, 8) | stride=2 | 919,040 |
ResidualBlock3-2 | (256, 8, 8) | stride=1 | 1,180,672 |
ResidualBlock4-1 | (512, 4, 4) | stride=2 | 3,673,088 |
ResidualBlock4-2 | (512, 4, 4) | stride=1 | 4,720,640 |
AvgPool | (512,) | kernel=4, stride=4 | 0 |
Linear | (10,) | - | 5,130 |
Layer | Outsize | Setting | Param # |
---|---|---|---|
Input | (in_planes, H, W) | - | 0 |
Conv1 | (planes, H/stride, W/stride) | kernel=3, stride=stride, padding=1 | in_planes*planes*9 |
BatchNorm1 | (planes, H/stride, W/stride) | - | 2*planes |
ReLU1 | (planes, H/stride, W/stride) | - | 0 |
Conv2 | (planes, H/stride, W/stride) | kernel=3, stride=1, padding=1 | planes*planes*9 |
BatchNorm2 | (planes, H/stride, W/stride) | - | 2*planes |
“Identity” | (planes, H/stride, W/stride) | 1x1 Conv + BatchNorm if in_plane plane | - |
ReLU2 | (planes, H/stride, W/stride) | Apply after adding BatchNorm2 and “Identity” outputs | 0 |
FedAvg | FedAvgM | FedAsync | FedBuff | FedAT | FedCompass(+N) | FedCompass+M | |
(200, 200, 100) | (200, 200, 100) | (200, 200, 100) | (200, 200, 100) | (200, 200, 100) | - | - | |
- | - | - | - | - | (40, 40, 20) | (40, 40, 20) | |
- | - | - | - | - | (200, 200, 100) | (200, 200, 100) | |
Optim | SGD | SGD | SGD | SGD | SGD | SGD | SGD |
0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | |
- | 0.9 | - | - | - | - | 0.9 | |
128 | 128 | 128 | 128 | 128 | 128 | 128 | |
- | - | 0.9 | 0.9 | - | 0.9 | 0.9 | |
- | - | 0.5 | 0.5 | - | 0.5 | 0.5 | |
- | - | - | (3, 3, 5) | - | - | - | |
- | - | - | - | 2.0 | - | - | |
- | - | - | - | - | 1.2 | 1.2 |
E.4 Detailed Setups and Hyperparameters for Experiments on the FLamby Datasets
For the datasets from FLamby (Ogier du Terrail et al., 2022), we use the default experiments settings provided by FLamby. The details of the experiment settings are shown in Table 13
Fed-IXI | Fed-ISIC2019 | |
Model | 3D U-net (Çiçek et al., 2016) | EfficientNet (Tan & Le, 2019) |
Metric | DICE | Balanced accuracy |
(Client #) | 3 | 6 |
50 | 50 | |
20 | 20 | |
100 | 100 | |
Optimizer | AdamW (Loshchilov & Hutter, 2017) | Adam |
0.001 | 0.0005 | |
0.9 | 0.5 | |
2 | 64 | |
0.9 | 0.9 | |
0.5 | 0.5 | |
2 | 3 | |
1.2 | 1.2 |
Appendix F Additional Experiment Results
F.1 Additional Experiment Results on the MNIST Dataset
Table 14 shows the top validation accuracy and the corresponding standard deviation on the class partitioned MNIST dataset for various federated learning algorithms with different numbers of clients and client heterogeneity settings among ten independent experiment runs with different random seeds, and Table 15 shows that on the dual Dirichlet partitioned MNIST dataset. As all asynchronous FL algorithms take the same amount of training time in the same experiment setting, the results indicate that FedCompass not only converges faster than other asynchronous FL algorithms but also can converge to higher accuracy. Notably, in the homogeneous client settings, FedCompass does not behave exactly the same as FedAvg as the clients also have local speed variance in each local training round, and this prevents the Compass scheduler to assign to all clients and have exactly the same behavior as FedAvg. The global model gets updated more frequently for FedCompass and leads to relatively faster convergence.
Figures 6–11 depict the change in validation accuracy during the training process for various FL algorithms in different experiment settings. Those figures illustrate how FedCompass can achieve faster convergence in different data and device heterogeneity settings.
Method | homo | normal | exp | homo | normal | exp | homo | normal | exp | |||||||||||||
FedAvg |
|
|
|
|
|
|
|
|
|
|||||||||||||
FedAvgM |
|
|
|
|
|
|
|
|
|
|||||||||||||
FedAsync |
|
|
|
|
|
|
|
|
|
|||||||||||||
FedBuff |
|
|
|
|
|
|
|
|
|
|||||||||||||
FedAT |
|
|
|
|
|
|
|
|
|
|||||||||||||
FedCompass |
|
|
|
|
|
|
|
|
|
|||||||||||||
|
|
|
|
|
|
|
|
|
|
|||||||||||||
|
|
|
|
|
|
|
|
|
|
Method | homo | normal | exp | homo | normal | exp | homo | normal | exp | |||||||||||||
FedAvg |
|
|
|
|
|
|
|
|
|
|||||||||||||
FedAvgM |
|
|
|
|
|
|
|
|
|
|||||||||||||
FedAsync |
|
|
|
|
|
|
|
|
|
|||||||||||||
FedBuff |
|
|
|
|
|
|
|
|
|
|||||||||||||
FedAT |
|
|
|
|
|
|
|
|
|
|||||||||||||
FedCompass |
|
|
|
|
|
|
|
|
|
|||||||||||||
|
|
|
|
|
|
|
|
|
|
|||||||||||||
|
|
|
|
|
|
|
|
|
|






F.2 Additional Experiment Results on the CIFAR-10 Dataset
Table 16 shows the top validation accuracy and the corresponding standard deviation on the class partitioned CIFAR-10 dataset for various federated learning algorithms with different numbers of clients and client heterogeneity settings among ten independent experiment runs with different random seeds, and Table 17 shows that on the dual Dirichlet partitioned CIFAR-10 dataset. Figures 12–17 give the change in validation accuracy during the training process for various FL algorithms in different experiment settings on the CIFAR-10 datasets.
Method | homo | normal | exp | homo | normal | exp | homo | normal | exp | |||||||||||||
FedAvg |
|
|
|
|
|
|
|
|
|
|||||||||||||
FedAvgM |
|
|
|
|
|
|
|
|
|
|||||||||||||
FedAsync |
|
|
|
|
|
|
|
|
|
|||||||||||||
FedBuff |
|
|
|
|
|
|
|
|
|
|||||||||||||
FedAT |
|
|
|
|
|
|
|
|
|
|||||||||||||
FedCompass |
|
|
|
|
|
|
|
|
|
|||||||||||||
|
|
|
|
|
|
|
|
|
|
|||||||||||||
|
|
|
|
|
|
|
|
|
|
Method | homo | normal | exp | homo | normal | exp | homo | normal | exp | |||||||||||||
FedAvg |
|
|
|
|
|
|
|
|
|
|||||||||||||
FedAvgM |
|
|
|
|
|
|
|
|
|
|||||||||||||
FedAsync |
|
|
|
|
|
|
|
|
|
|||||||||||||
FedBuff |
|
|
|
|
|
|
|
|
|
|||||||||||||
FedAT |
|
|
|
|
|
|
|
|
|
|||||||||||||
FedCompass |
|
|
|
|
|
|
|
|
|
|||||||||||||
|
|
|
|
|
|
|
|
|
|
|||||||||||||
|
|
|
|
|
|
|
|
|
|






F.3 Additional Experiment Results on the FLamby Datasets
Table 18 shows the top validation accuracy and the corresponding standard deviation on the Fed-IXI and Fed-ISIC2019 datasets from FLamby for various federated learning algorithms in different client heterogeneity settings among ten independent experiment runs with different random seeds. Figures 18 and 19 present the change in validation accuracy during the training process for various FL algorithms on the Fed-IXI and Fed-ISIC2019 datasets, respectively.
Fed-IXI | Fed-ISIC2019 | ||||||||
Method | homo | normal | exp | homo | normal | exp | |||
FedAvg | 98.390.02 | 98.390.02 | 98.390.02 | 59.121.38 | 59.121.38 | 59.121.38 | |||
FedAvgM | 98.390.00 | 98.390.00 | 98.390.00 | 50.252.57 | 50.252.57 | 50.252.57 | |||
FedAsync | 98.050.25 | 98.240.18 | 98.320.20 | 59.431.18 | 59.663.04 | 56.344.34 | |||
FedBuff | 98.360.03 | 98.350.03 | 98.350.08 | 57.652.07 | 57.462.60 | 55.134.74 | |||
FedAT | 98.720.01 | 98.650.06 | 98.630.08 | 58.312.35 | 58.551.76 | 61.934.56 | |||
FedCompass | 98.760.00 | 98.610.06 | 98.590.04 | 68.460.38 | 66.353.76 | 65.465.33 | |||
|
98.760.00 | 98.610.06 | 98.590.04 | 64.720.94 | 62.435.23 | 60.184.95 | |||
|
98.720.01 | 98.610.06 | 98.590.04 | 69.590.06 | 67.043.56 | 66.484.84 |


Appendix G Ablation Study
G.1 Ablation Study on
In this section, we conduct ablation studies on the values of to investigate its impact on the performance of the FedCompass algorithm. Table 19 lists the relative wall-clock time for different values of to reach 90% validation accuracy on the class partitioned and dual Dirichlet partitioned MNIST datasets, and 50% accuracy on the dual Dirichlet partitioned CIFAR10 datasets in different client heterogeneity settings, and Table 20 shows the corresponding the validation accuracy and standard deviations. The results are the average of ten different runs with random seeds. Table 19 also lists the results of FedBuff for reference. From Table 19, the following observations can be derived: (1) A smaller usually leads to faster convergence speed, which is consistent with the convergence analysis from Section 4 where smaller 1/Q results in fewer existing arrival groups at equilibrium and may help to reduce the impact of device heterogeneity. (2) Though FedCompass achieves faster convergence rate with smaller 1/Q (0.05–0.2) in general, it still can achieve reasonable convergence rate with large 1/Q values (0.2–0.8) compared with FedBuff, which shows that FedCompass is not sensitive to the choice of the value of Q. (3) In the extreme case that , i.e. , there is a significant performance drop for FedCompass, as the Compass scheduler is not able to assign different number of local steps to different clients, so each client has its own group, and the FedCompass algorithm becomes equivalent to the FedAsync algorithm.
Dataset | 1/Q | homo | normal | exp | homo | normal | exp | homo | normal | exp | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0.05 |
|
|
|
|
|
|
|
|
|
||||||||||||||
0.10 |
|
|
|
|
|
|
|
|
|
||||||||||||||
0.20 |
|
|
|
|
|
|
|
|
|
||||||||||||||
|
0.40 |
|
|
|
|
|
|
|
|
|
|||||||||||||
|
0.60 |
|
|
|
|
|
|
|
|
|
|||||||||||||
0.80 |
|
|
|
|
|
|
|
|
|
||||||||||||||
1.00 |
|
|
|
|
|
|
|
|
|
||||||||||||||
FedBuff |
|
|
|
|
|
|
|
|
|
||||||||||||||
0.05 |
|
|
|
|
|
|
|
|
|
||||||||||||||
0.10 |
|
|
|
|
|
|
|
|
|
||||||||||||||
0.20 |
|
|
|
|
|
|
|
|
|
||||||||||||||
|
0.40 |
|
|
|
|
|
|
|
|
|
|||||||||||||
|
0.60 |
|
|
|
|
|
|
|
|
|
|||||||||||||
0.80 |
|
|
|
|
|
|
|
|
|
||||||||||||||
1.00 |
|
|
|
|
|
|
|
|
|
||||||||||||||
FedBuff |
|
|
|
|
|
|
|
|
|
||||||||||||||
0.05 |
|
|
|
|
|
|
|
|
|
||||||||||||||
0.10 |
|
|
|
|
|
|
|
|
|
||||||||||||||
0.20 |
|
|
|
|
|
|
|
|
|
||||||||||||||
|
0.40 |
|
|
|
|
|
|
|
|
|
|||||||||||||
|
0.60 |
|
|
|
|
|
|
|
|
|
|||||||||||||
0.80 |
|
|
|
|
|
|
|
|
|
||||||||||||||
1.00 |
|
|
|
|
|
|
|
|
|
||||||||||||||
FedBuff |
|
|
|
|
|
|
|
|
|
Dataset | 1/Q | homo | normal | exp | homo | normal | exp | homo | normal | exp | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0.05 |
|
|
|
|
|
|
|
|
|
||||||||||||||
0.10 |
|
|
|
|
|
|
|
|
|
||||||||||||||
|
0.20 |
|
|
|
|
|
|
|
|
|
|||||||||||||
|
0.40 |
|
|
|
|
|
|
|
|
|
|||||||||||||
|
0.60 |
|
|
|
|
|
|
|
|
|
|||||||||||||
|
0.80 |
|
|
|
|
|
|
|
|
|
|||||||||||||
1.00 |
|
|
|
|
|
|
|
|
|
||||||||||||||
0.05 |
|
|
|
|
|
|
|
|
|
||||||||||||||
0.10 |
|
|
|
|
|
|
|
|
|
||||||||||||||
|
0.20 |
|
|
|
|
|
|
|
|
|
|||||||||||||
|
0.40 |
|
|
|
|
|
|
|
|
|
|||||||||||||
|
0.60 |
|
|
|
|
|
|
|
|
|
|||||||||||||
|
0.80 |
|
|
|
|
|
|
|
|
|
|||||||||||||
1.00 |
|
|
|
|
|
|
|
|
|
||||||||||||||
0.05 |
|
|
|
|
|
|
|
|
|
||||||||||||||
0.10 |
|
|
|
|
|
|
|
|
|
||||||||||||||
|
0.20 |
|
|
|
|
|
|
|
|
|
|||||||||||||
|
0.40 |
|
|
|
|
|
|
|
|
|
|||||||||||||
|
0.60 |
|
|
|
|
|
|
|
|
|
|||||||||||||
|
0.80 |
|
|
|
|
|
|
|
|
|
|||||||||||||
1.00 |
|
|
|
|
|
|
|
|
|
G.2 Client Speed Changes in FedCompass and Ablation Study on
In cross-silo federated learning scenarios, variations in computing power and speeds among clients are possible. For instance, a client utilizing a supercomputer might experience significant changes in computing power due to the auto-scaling after the allocation or deallocation of other tasks within the same cluster. Similarly, in cloud computing environments, a client might be subject to auto-scaling policies that dynamically adjust the number of virtual machines or container instances. In this section, we provide analyses and empirical results to study how FedCompass is resilient to client speed changes.
Figure 20 depicts an execution of FedCompass involving a client speedup scenario. During the second round of local training, client 3 enhances its computing speed from step/min to step/min. Consequently, client 3 transmits its local updates at 10:36, earlier than the expected group arrival time 12:00. In this situation, FedCompass server first updates the speed record for client 3 and then waits for the arrival of clients 1 and 2 in the same group. Upon the arrival of clients 1 and 2 at 12:00, FedCompass server assigns all three clients to arrival group 2 for the subsequent round of local training, and the number of local steps for client 3 are assigned according to its new speed.

Figures 21 and 22 illustrate two executions of FedCompass with different levels of client slowdown. In Figure 21, the speed of client 3 changes from step/min to step/min during its second round of training. Consequently, the client transmits its local update back to the server at 13:24, surpassing the expected arrival time 12:00. Nonetheless, it is important to recall that each arrival group has the latest arrival time to accommodate potential client speed fluctuations. The time interval between and the group’s creation time is times longer than the difference between and the group’s creation time. For this example, we select , resulting in 14:00 for arrival group 1. Since client 3 arrives later than but earlier than , the FedCompass server waits until client 3 arrives at 13:24, subsequently assigning clients 1 to 3 to a new arrival group for the next training round. Figure 22 presents another case where the client has a relatively larger slowdown. In this example, the speed of client 3 diminishes from step/min to step/min, causing it to transmit local update at 15:12, even exceeding 14:00. When the FedCompass server does not receive an update from client 3 by the latest arrival time 14:00, it responds to clients 1 and 2. The server assigns these clients to arrival group 2 with corresponding numbers of local steps. Upon client 3 arrives at 15:12, the server first attempts to assign it to group 2, but this is unsuccessful due to the corresponding number of steps () falling below the . As a result, the server establishes a new group 3 for client 3. Subsequently, once all clients in group 2 arrive at , the server assigns all of them to group 3.


To empirically study how FedCompass is resilient to client speed changes compared to similar tiered FL methods such as FedAT (Chai et al., 2021) and how the value of may impact the resilience, we set up a scenario such that each client has a 10% probability to change its computing speed by re-sampling a speed from the client speed distribution (normal or exponential distribution) in each local training round. This setup aims to mimic the occasional yet substantial changes in speed that may arise in real-world cross-silo federated learning environments as a result of auto-scaling. Table 21 shows the convergence speed for the FedCompass algorithm with different values of and the FedAT algorithm, and Table 22 shows the corresponding average top accuracy and standard deviations. From Table 21, we can draw the following conclusions: (1) When , the Compass scheduler is not resilient to even small client slowdown, so FedCompass converges relatively slower compared with some values slightly above in general. (2) values slightly above (e.g., ) result in faster convergence from the empirical results as the Compass scheduler accounts for minor client speed fluctuations without a long waiting time. (3) Large values sometimes slow down the convergence as the server needs to experience a long waiting time when there are significant client speed slowdowns. (4) FedCompass converges faster than FedAT regardless of the value of , which further demonstrates its resilience to speed changes.
Finally, it is noteworthy that under extreme circumstances, where all client computing speeds are subject to dramatic and constant fluctuations, the scheduling mechanism of FedCompass may be unable to maintain the desired client synchronization. This is due to the reliance on prior speed data, which becomes obsolete in the face of such variability. In these scenarios, the training process would proceed in a completely asynchronous manner, effectively causing FedCompass to revert to a similar behavior of the FedAsync (Xie et al., 2019) algorithm. However, different from cross-device FL where each client may experience frequent speed changes due to power constraints, the operation of other concurrent applications, or network connectivity issues. In practice, a cross-silo FL client usually has dedicated computing resources such as HPC with a job scheduler or some cloud computing facilities—where significant changes in computing speed are less frequent, generally arising from events like auto-scaling.
Dataset | normal | exp | normal | exp | normal | exp | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1.00 |
|
|
|
|
|
|
|||||||||||
|
1.20 |
|
|
|
|
|
|
||||||||||
|
1.50 |
|
|
|
|
|
|
||||||||||
2.00 |
|
|
|
|
|
|
|||||||||||
FedAT |
|
|
|
|
|
|
|||||||||||
1.00 |
|
|
|
|
|
|
|||||||||||
|
1.20 |
|
|
|
|
|
|
||||||||||
|
1.50 |
|
|
|
|
|
|
||||||||||
2.00 |
|
|
|
|
|
|
|||||||||||
FedAT |
|
|
|
|
|
|
Dataset | normal | exp | normal | exp | normal | exp | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1.00 |
|
|
|
|
|
|
|||||||||||
|
1.20 |
|
|
|
|
|
|
||||||||||
|
1.50 |
|
|
|
|
|
|
||||||||||
2.00 |
|
|
|
|
|
|
|||||||||||
FedAT |
|
|
|
|
|
|
|||||||||||
1.00 |
|
|
|
|
|
|
|||||||||||
|
1.20 |
|
|
|
|
|
|
||||||||||
|
1.50 |
|
|
|
|
|
|
||||||||||
2.00 |
|
|
|
|
|
|
|||||||||||
FedAT |
|
|
|
|
|
|