pFL-Bench: A Comprehensive Benchmark for Personalized Federated Learning
Abstract
Personalized Federated Learning (pFL), which utilizes and deploys distinct local models, has gained increasing attention in recent years due to its success in handling the statistical heterogeneity of FL clients. However, standardized evaluation and systematical analysis of diverse pFL methods remain a challenge. Firstly, the highly varied datasets, FL simulation settings and pFL implementations prevent easy and fair comparisons of pFL methods. Secondly, the current pFL literature diverges in the adopted evaluation and ablation protocols. Finally, the effectiveness and robustness of pFL methods are under-explored in various practical scenarios, such as the generalization to new clients and the participation of resource-limited clients. To tackle these challenges, we propose the first comprehensive pFL benchmark, pFL-Bench, for facilitating rapid, reproducible, standardized and thorough pFL evaluation. The proposed benchmark contains more than 10 dataset variants in various application domains with a unified data partition and realistic heterogeneous settings; a modularized and easy-to-extend pFL codebase with more than 20 competitive pFL method implementations; and systematic evaluations under containerized environments in terms of generalization, fairness, system overhead, and convergence. We highlight the benefits and potential of state-of-the-art pFL methods and hope the pFL-Bench enables further pFL research and broad applications that would otherwise be difficult owing to the absence of a dedicated benchmark. The code is released at https://github.com/alibaba/FederatedScope/tree/master/benchmark/pFL-Bench. 111We will continuously maintain the benchmark and update the codebase and arXiv version.
1 Introduction
Federated learning (FL) is an emerging machine learning (ML) paradigm, which collaboratively trains models via coordinating certain distributed clients (e.g., smart IoT devices) with a logically centralized aggregator [1, 2]. Due to the benefit that it does not transmit local data and circumvents the high cost and privacy risks of collecting raw sensitive data from clients, FL has gained widespread interest and has been applied in numerous ML tasks such as image classification [3, 4], object detection [5, 6], keyboard suggestion [7, 8], text classification [9], relation extraction [10], speech recognition [11, 12], graph classification [13], recommendation [14, 15], and healthcare [16, 17].
Pioneering FL researchers made great efforts to find a global model that performs well for most FL clients [18, 19, 20, 21]. However, the intrinsic statistical and system heterogeneity of clients limits the performance and applicability of such classical FL methods [22, 23]. Taking the concept shift case as an example, in which the conditional distribution vary across some clients who have the same marginal distributions [18], a shared global model cannot fit these clients well at the same time. Recently, the personalized FL (pFL) methods that utilize client-distinct models to overcome these challenges have been gaining increasing popularity, such as those based on multi-task learning [24, 25, 26], meta-learning [27, 28], and transfer learning [29, 30, 31].
Even though fruitful pFL methods have been explored, a standard benchmark is still lacking. As a result, the evaluation of pFL methods is currently with non-standardized datasets and implementations, highly diverse evaluation protocols, and unclear effectiveness and robustness of pFL methods under various practical scenarios. To be specific:
-
•
Non-standardized datasets and implementations for pFL. Currently, researchers often use custom FL datasets and implementations to evaluate the effectiveness of proposed methods due to the absence of standardized pFL benchmarks. For example, although many pFL works use the same public FEMNIST [32] and CIFAR-10/100 [33] datasets, the partition manners can be divergent: the number of clients is 205 in [26] while 539 in [34] for FEMNIST; and [35] adopts the Dirichlet distribution based partition while [36] uses the pathological partition for CIFAR-10/100. Prior pFL studies set up different computation environments and simulation settings, increasing the difficulty of fast evaluation and the risk of unfair comparisons.
-
•
Diverse evaluation protocols. The current pFL methods often focus on different views and adopt diverse evaluation protocols, which may lead to isolated development of pFL and prevent pFL research from reaching its full potential. For example, besides the global accuracy improvement, a few works studied the local accuracy evaluation characterized by fairness, and system efficiency in terms of communication and computational costs [26, 37]. Without a careful design and control of the evaluation, it is difficult to compare the pros and cons of different pFL methods and understand how much costs we pay for the personalization.
-
•
Under-explored practicability of pFL in various scenarios. Most existing pFL methods examine their effectiveness in several mild Non-IID FL cases [23]. However, it is unclear whether existing pFL methods can consistently work well in more practical scenarios, such as the participation of partial clients in which the clients have spotty connectivity [38]; the participation of resource-limited clients in which the personalization is required to be highly efficient [39]; and the generalization to new clients in which learned models will be applied to new clients that do not participate in the FL process [40].
To quantify the progress in the pFL community and facilitate rapid, reproducible, and generalizable pFL research, we propose the first comprehensive pFL benchmark characterized as follows:
-
•
We provide 4 benchmark families with 12 dataset variants for diverse application domains involving image, text, graph and recommendation data, each with unified data partition and some realistic Non-IID settings such as clients sampling and the participation of new clients. Some public popular DataZoos such as LEAF [32], Torchvision [41] and Huggingface datasets [42] are also compatible to the proposed pFL-Bench to enable flexible and easily-extended experiments.
-
•
We implement an extensible open-sourced pFL codebase that contains more than 20 pFL methods, providing fruitful state-of-the-art (SOTA) methods re-implementations, unified interfaces and pluggable personalization sub-modules such as model parameter decoupling, model mixture, meta-learning and personalized regularization.
-
•
We conduct systematic evaluation under unified experimental settings and containerized environments to show the efficacy of pFL-Bench and provide standardized scripts to ensure the reproducibility and maintainability of pFL-Bench. We also highlight the advantages of pFL methods and opportunities for further pFL study in terms of generalization, fairness, system overhead and convergence.
2 Related Works
Personalized Federated Learning.
Despite the promising performance using a shared global model for all clients as demonstrated in [1, 43, 44, 45, 46, 47], it is challenging to find a converged best-for-all global model under statistical and system heterogeneity among clients [48, 49, 50]. As a natural way to handle the heterogeneity, personalization is gaining popularity in recent years. Fruitful pFL literatures have explored the accuracy and convergence improvement based on clustering [51, 48, 52], multi-task learning [53, 54, 55, 56], model mixture [36, 26, 57, 58], model parameter decoupling [37, 6], Bayesian treatment [59, 54], knowledge distillation [60, 31, 61], meta-learning [62, 63, 62, 28, 64, 65], and transfer learning [66, 67, 68]. We refer readers to related FL and pFL survey paper for more details [18, 22, 23]. In pFL-Bench, we provide modularized re-implementation for numerous SOTA pFL methods with several fundamental and pluggable sub-routines for easy and fast pFL research and deployment. We plan to add more pFL methods in the future and also welcome contributions to the pFL-Bench.
Federated Learning Benchmark.
We are aware that there are great efforts on benchmarking FL from various aspects, such as heterogeneous datasets (LEAF [32], TFF [69]), heterogeneous system resources (Flower [70], FedML [71], FedScale [38]), and specific domains (FedNLP [72], FS-G [73]). However, they mostly benchmarked general FL algorithms, lacking recently proposed pFL methods that perform well on heterogeneous FL scenarios. Besides, no benchmark so far supports the evaluation of generalization to new clients; and few existing benchmark simultaneously supports comprehensive evaluation for trade-offs among accuracy, fairness and system costs. We hope to close these gaps with this proposed pFL-Bench, and facilitate further pFL research and broad applications.
3 Background and Problem Formulation
3.1 A Generalized FL Setting
We first introduce some important concepts in FL, taking the FedAvg method [1] as an illustrative example. A typical FL procedure using FedAvg is as follows: Each client has its own private dataset over , and the goal of FL is to train a single global model with collaborative learning from this set of clients without directly sharing their local data. At each FL round, the server broadcasts to selected clients , who then perform local learning based on the private local data and upload the local update information (e.g., the gradients of trained models) to the server. After collecting and averagely aggregating the update information from clients, the server applies the updates into for next-round federation and the process repeats.
Then we present a generalized FL formulation, which establishes the proposed comprehensive benchmark in terms of diverse evaluation and personalization perspectives. Specifically, besides the FL-participated clients , we consider a set of new clients that do not participate to the FL training process and denote it as . Most FL approaches implicitly solve the following problem:
(1) |
where indicates the loss at data point with model . The term indicates the global objective based on the shared global model with an aggregation function for model parameters, and indicates the local objective based on the local distinct models . We note that and can be in various forms such as uniform averaging or weighted averaging according to local training data size of clients, which corresponds to the commonly used intra-client generalization case [40].
For the latter two terms, usually has similar forms to and measures the generalization to new clients that do not contribute to the FL training process. indicates the modeling of the relationship between the global and local models, such as the norm to regularize the model parameters in Ditto [26]. Besides, different pFL methods may flexibly introduce various constraints on this optimization objective, which we have omitted in Eq.(1) for brevity. The coefficients , , and trade off these terms. For non-personalized FL algorithms, . Although the generalization term has been primarily explored in a few recent studies [40, 35], most existing FL works overlooked it with . Later, we will discuss more instantiations for and in the personalization setting.
3.2 Personalization Setup
With the above generalized formulation, we can see that existing pFL works achieve personalization via multi-granularity objects, including the global model and local models . For example, many two-step pFL approaches first find a strong in the FL training stages, then get local models by fine-tuning on local data and use in inference [23, 74]. A more flexible manner is to directly learn distinct local models in the FL process, while this introduces additional storage and computation costs for clients [75]. Recently, to gain a better accuracy-efficiency trade-off, several pFL works propose to only personalize a sub-module of , and transmit and aggregate the remaining part as [24].
We illustrate some representative personalization operations in pFL works w.r.t. the different choices of the local objective and . Fine-tuning is a basic step widely used in abundant pFL works [63] to minimize , where is usually initialized using the parameters of before fine-tuning. Model mixture is a general pFL approach assuming that the local data distribution is a mixture of underlying data distributions with mixture weight [56], thus learning a group of intermediate models is suitable to handle the data heterogeneity as . For clustering-based pFL methods, indicates the cluster number, and the mixture weight is 0-1 indicative function for belonged clusters [25]. Besides, taking as the reference point, model interpolation and model regularization are also widely explored in the pFL literature [76], where is the regularization factor.
4 Benchmark Design and Resources
4.1 Datasets and Models
We conduct experiments on 12 publicly available dataset variants with heterogeneous partition in our benchmark. These datasets are popular in the corresponding fields, and cover a wide range of domains, scales, partition manners and Non-IID degrees. We list the statistics in Table 1 and illustrate the violin plot of data size per client in Figure 1, which show diverse properties across the FL datasets, enabling thorough comparisons among different pFL methods. We provide more detailed descriptions of these datasets in the Appendix A. Besides, with a carefully designed modularity, our code-base is compatible with a large number of datasets from other public popular DataZoos, including LEAF [32], Torchvision [41], Huggingface datasets [42] and FederatedScope-GNN [73].
Dataset | Task | Model | Partition By | # Clients | # Sample Per Client | |
FEMNIST | Image Classification | CNN | Writers | 200 | =217 | =73 |
CIFAR10- | Labels | 100 | =600 | =46 | ||
CIFAR10- | 100 | =600 | =137 | |||
CIFAR10- | 100 | =600 | =383 | |||
COLA | Linguistic Acceptability | BERT | Labels | 50 | =192 | =159 |
SST-2 | Sentiment Analysis | 50 | =1,364 | =1,291 | ||
Sentiment Analysis | LR | Users | 13,203 | =10 | =11 | |
Cora | Node Classification | GIN | Community | 5 | =542 | =30 |
Pubmed | 5 | =3,943 | =34 | |||
Citeseer | 5 | =665 | =29 | |||
MovieLens1M | Recommendation | MF | Users | 1000 | =1,000 | =482 |
MovieLens10M | Items | 1000 | =10,000 | =8,155 |

We preset the widely adopted CNN model [77, 78, 79, 80] with additional batch normalization layers for image datasets, and the pre-trained BERT-Tiny model from [81] and linear regression (LR) model for the textual datasets. For the graph and recommendation datasets, we preset the graph isomorphism neural network, GIN [82], and Matrix Factorization (MF) [83] respectively. It is worth noting that pFL-Bench provides a unified model interface decoupled with FL algorithms, enabling users to easily register and use more customized models or built-in models from existing ModelZoos including Torchvision [41], Huggingface [84], and FederatedScope [13].
Benchmark scenarios.
(Generalization) For a comprehensive evaluation, pFL-Bench supports examining the generalization performance for both FL-participated and FL-non-participated clients, i.e., the term in formulation (1). Specifically, we randomly select 20% clients as non-participated clients for each dataset, and these clients will not transmit their training-related message during the FL processes.
(Client Sampling) In addition to generalization performance, we also care about how the pFL methods perform when adopting client sampling in FL processes, which is useful in cross-device scenarios where a large number of clients have spotty connectivity. For the image, text and recommendation datasets, we uniformly sample 20% clients without replacement from the participating clients at each FL round.
(Cross-silo v.s. cross-device) Note that the adopted datasets have quite different numbers of clients after heterogeneous partition. We choose the three graph datasets with small number of clients to simulate cross-silo FL scenarios [73, 55], while the other datasets correspond to different scales of cross-device FL scenarios.
4.2 Methods
We consider abundant methods for extensive pFL comparisons. The pFL-bench provides unified and modularized interfaces for a range of popular and SOTA methods in the following three categories: Non-pFL methods. As two naive methods sitting on opposite ends of the local-global spectrum, we evaluate the Global-Train method that trains the model from centralized data merged from all clients, and Isolated method that trains a separated models for each client without any information transmission among clients. Besides, we consider the classical FedAvg [1] with weighted averaging based on local data size, the FedProx [85] that introduces proximal term during the local training process, and FedOpt [43] that generalizes FedAvg by introducing an optimizer for the FL server.
pFL methods. We compare several SOTA methods including FedBN [86] that is a simple yet effective method to deal with feature shift Non-IID, via locally maintaining the clients’ batch normalization parameters without transmitting and aggregation; Ditto [26] that improves fairness and robustness of FL by training local personalized model and global model simultaneously, in which the local model update is based on regularization to global model parameters; pFedMe [78] that is a meta-learning based method and decouples personalized model and global model with Moreau envelops; HypCluster [74] that splits users into clusters and learns different models for different clusters; FedEM [56] that assumes local data distribution is a mixture of unknown underlying distributions, and correspondingly learns a mixture of multiple intermediate models with Expectation-Maximization algorithm.
Combined variants. Note that pFL-Bench provides pluggable re-implementations of existing methods, enabling users can pick different personalized behaviors and different personalized objects to form a new pFL variant. Here we combine FedBN, FedOpt, and Fine-tuning (FT) with other compatible methods, and provide fine-grained ablations via more than 20 method variants for systematic pFL study and explorations of pFL potential. More details of the considered methods are in Appendix B.
4.3 Evaluation Criteria
We propose a unified and comprehensive evaluation protocol in pFL-Bench, in which evaluations from multiple perspectives are taken into consideration. (1) For the generalization examination, we support monitoring on both the server and client sides, with various and extensible metric aggregation manners, and a wide range of metrics for diverse tasks such as classification, regression and ranking. (2) We also report several fairness-related metrics, including standard deviation, and the top and bottom deciles of performance across different clients. (3) Numerous systematical metrics are considered as well, including the process running time, the memory cost w.r.t. average and peak memory usage, the total computational cost w.r.t. FLOPs in server and clients, the communication cost w.r.t. the total number of downloaded/uploaded bytes, and the number of FL rounds to convergence. Detailed description can be found in Appendix B.
4.4 Codebase
To facilitate the innovation for pFL community, our pFL-Bench contains a well-modularized, easy-to-extend codebase for standardization of implementation, evaluation, and ablation of pFL methods.
pFL implementations.
We build the pFL-bench upon an event-driven FL framework FederatedScope (FS) [87], which abstracts FL information transmitting and processing as message passing and several pluggable subroutines. We eliminate the cumbersome engineering for coordinating FL participants with the help of FS, and customize many message handlers and subroutines, such as model parameter decoupling, model mixture, local fine-tuning, meta-learning and regularization for personalization. By combining these useful and pluggable components, we re-implement a number of SOTA pFL methods with unified and extensible interfaces. This modularity also makes the usage of pFL methods convenient, and makes the contribution of new pFL methods easy and flexible. We release the codes with Apache License 2.0 and will continuously include more pFL methods.
Reproducibility.
To enable easily reproducible research, we conduct experiments in containerized environments and provide standardized and documented evaluation procedures for prescribed metrics. For fair comparisons, we search the optimal hyper-parameters using the validation sets for all methods, with early stopping and large number of total FL rounds. We run experiments 3 times with the optimal configurations and report the average results. All the experiments are conducted on a cluster of 8 Tesla V100 and 64 NVIDIA GTX 1080 Ti GPUs, taking 13,000 runs with a total of 112 days process computing time. More details, such as hyper-parameter search spaces, can be found in Appendix C.
5 Experimental Results and Analysis
To demonstrate the utility of pFL-Bench in providing fair, comprehensive, and rigorous comparisons among pFL methods, we conduct extensive experiments and present some main results in terms of generalization, fairness and efficiency under various FL datasets and scenarios. The complete experimental results are presented in Appendix D due to the space limitation.
FEMNIST, | SST-2 | PUBMED | |||||||
Global-Train | 74.51 | - | - | 80.57 | - | - | 87.01 | - | - |
Isolated | 68.74 | - | - | 60.82 | - | - | 85.56 | - | - |
FedAvg | 83.97 | 81.97 | -2.00 | 74.88 | 80.24 | 5.36 | 87.27 | 72.63 | -14.64 |
FedAvg-FT | 86.44 | 84.94 | -1.50 | 74.14 | 83.28 | 9.13 | 87.21 | 79.78 | -7.43 |
FedProx | 84.10 | 81.49 | -2.61 | 74.36 | 79.20 | 4.84 | 87.23 | 75.02 | -12.21 |
FedProx-FT | 87.34 | 85.27 | -2.08 | 79.94 | 80.48 | 0.59 | 88.24 | 79.12 | -9.12 |
pFedMe | 87.50 | 82.76 | -4.73 | 71.27 | 69.34 | -1.92 | 86.91 | 71.64 | -15.27 |
pFedMe-FT | 88.19 | 82.46 | -5.73 | 75.61 | 66.48 | -9.13 | 85.71 | 77.07 | -8.64 |
HypCluster | 83.80 | 81.88 | -1.92 | 46.26 | 61.32 | 15.05 | 87.20 | 75.37 | -11.83 |
HypCluster-FT | 87.79 | 85.67 | -2.12 | 52.46 | 78.67 | 26.21 | 86.43 | 76.69 | -9.74 |
FedBN | 86.72 | 7.86 | -78.86 | 74.88 | 75.40 | 0.52 | 88.49 | 52.53 | -35.95 |
FedBN-FT | 88.51 | 82.87 | -5.64 | 68.81 | 82.43 | 13.63 | 87.45 | 80.36 | -7.09 |
FedBN-FedOPT | 88.25 | 8.77 | -79.49 | 64.70 | 65.50 | 0.81 | 87.87 | 42.72 | -45.15 |
FedBN-FedOPT-FT | 88.14 | 80.25 | -7.88 | 68.65 | 70.56 | 1.91 | 87.54 | 77.07 | -10.47 |
Ditto | 88.39 | 2.20 | -86.19 | 52.03 | 46.79 | -5.24 | 87.27 | 2.84 | -84.43 |
Ditto-FT | 85.72 | 56.96 | -28.76 | 56.49 | 65.50 | 9.01 | 87.47 | 35.03 | -52.44 |
Ditto-FedBN | 88.94 | 2.20 | -86.74 | 56.03 | 46.79 | -9.24 | 88.18 | 2.84 | -85.34 |
Ditto-FedBN-FT | 86.53 | 58.96 | -27.57 | 53.15 | 66.49 | 13.34 | 87.83 | 28.52 | -59.30 |
Ditto-FedBN-FedOpt | 88.73 | 2.20 | -86.54 | 57.67 | 46.79 | -10.88 | 87.81 | 2.84 | -84.97 |
Ditto-FedBN-FedOpt-FT | 87.02 | 55.22 | -31.80 | 52.89 | 66.49 | 13.60 | 87.60 | 18.18 | -69.42 |
FedEM | 84.35 | 82.81 | -1.54 | 75.78 | 67.67 | -8.11 | 85.64 | 71.12 | -14.52 |
FedEM-FT | 86.17 | 85.01 | -1.16 | 64.86 | 81.63 | 16.77 | 85.88 | 78.08 | -7.80 |
FedEM-FedBN | 84.37 | 12.88 | -71.49 | 75.43 | 62.81 | -12.62 | 88.12 | 48.64 | -39.48 |
FedEM-FedBN-FT | 88.29 | 83.96 | -4.33 | 64.96 | 81.04 | 16.08 | 86.38 | 72.02 | -14.35 |
FedEM-FedBN-FedOPT | 82.12 | 6.64 | -75.48 | 72.25 | 64.69 | -7.56 | 87.56 | 42.37 | -45.19 |
FedEM-FedBN-FedOPT-FT | 87.54 | 85.76 | -1.79 | 62.26 | 73.87 | 11.61 | 87.49 | 72.39 | -15.09 |
5.1 Generalization
We present the accuracy results for both participated and un-participated clients in Table 2 on FEMNIST, SST-2 and PUBMED datasets, where and indicates the aggregated accuracy weighted by the local data samples of participated and un-participated clients respectively, and indicates the generalization gap.
Comparison of original methods.
For the original methods without combination (“-”), we mark the best and second-best results as red and blue respectively in Table 2. Notably, we find that no method can consistently beat others across all metrics and all datasets. The pFL methods gain significantly better over FedAvg in some cases (e.g., Ditto on FEMNIST), showing the effectiveness of personalization. However, the advantages of pFL methods on un-participated clients’ generalization are relatively smaller than those on intra-client generalization . The methods Ditto and FedBN even fail to gain reasonable on FEMNIST and PUBMED datasets, as these methods did not discuss how to apply to unseen clients, we have kept the behaviors of their algorithms in inference in order to respect the original method. And we examine their running dynamics and find that their local models diverge with the un-participated clients’ data. Besides, there are still performance gaps between pFL methods and the Global-Train method on the textual dataset. These observations demonstrate plenty of room for pFL improvement.
Comparison of pFL variants.
We then extend the comparison to include FL variant methods incorporating other compatible personalized operators or methods. In a nutshell, almost all the best and second-best results (marked as Bold and underlined) are achieved by the combined variants, showing the efficacy and flexibility of our benchmark, and the potential of new pFL methods. Among these methods, fine-tuning (FT) effectively improves both and metrics in most cases, even for pFL methods that have been implicitly adopted local training process, which calls for deeper understanding about how much should we fit the local data and how the personalized fitting impacts the FL dynamics. Besides, FedBN is a simple yet effective method to improve while may bring negative effect on (e.g., FedEM v.s. FedEM-FT), since the feature space of un-participated clients lack informative characterization and the frozen BN parameters of global model can arbitrarily mis-match these un-participated clients. We also find that FedBN is much more effective on graph data with the GIN model than those on textual data with the BERT model, in which we made a simple modification that filters out the Layer Normalization parameters, showing the opportunity of designing domain- and model-specific pFL techniques.

Effect of Non-IID split.
To gain further insight into the pFL methods, we vary the Non-IID degree with different Dirichlet factor for CIFAR10 dataset and illustrate the results in Figure 2(a). Generally speaking, almost all methods gain performance degradation as the Non-IID degree increases (from =5 to =0.1). Besides, we can see that most of pFL methods show superior accuracy and robustness over FedAvg especially for the highly heterogeneous case with =0.1. These results indicate the benefit of pFL methods in Non-IID situations, as well as their substantial space for improvement.
Effect of client sampling.
We also vary the client sampling rate and present the generalization results on FEMNIST dataset in Figure 2(b). In summary, most pFL methods still achieve better results than FedAvg as decreases. However, the advantages are diminishing for un-participating clients and several pFL methods prune to fail with small . In addition, we observe that FT increases the performance variance in client sampling cases. Although several pFL works provide convergence guarantees under mild assumptions, there remain open questions about the theoretical impact of clients with spotty connections [88] in the personalization case, and the design of robust pFL algorithms.

FEMNIST, | SST-2 | PUBMED | |||||||
Isolated | 67.08 | 10.76 | 53.16 | 59.40 | 41.29 | 0.00 | 84.67 | 6.26 | 74.63 |
FedAvg | 82.40 | 9.91 | 69.11 | 76.30 | 22.02 | 44.85 | 86.72 | 3.93 | 79.76 |
FedAvg-FT | 85.17 | 8.69 | 72.34 | 75.36 | 27.67 | 31.08 | 86.71 | 3.86 | 80.57 |
FedProx | 82.37 | 10.25 | 68.57 | 72.25 | 22.03 | 43.51 | 88.06 | 3.62 | 80.08 |
FedProx-FT | 86.06 | 7.89 | 74.60 | 58.72 | 37.58 | 4.17 | 87.87 | 4.02 | 78.80 |
pFedMe | 86.50 | 8.52 | 75.00 | 65.08 | 26.59 | 27.75 | 86.35 | 4.43 | 78.76 |
pFedMe-FT | 87.06 | 8.02 | 75.00 | 74.36 | 27.02 | 32.49 | 85.47 | 3.06 | 80.95 |
HypCluster | 82.34 | 9.72 | 68.57 | 57.29 | 39.27 | 0.00 | 86.82 | 5.06 | 77.66 |
HypCluster-FT | 86.58 | 7.84 | 75.71 | 56.47 | 42.59 | 0.00 | 85.97 | 5.69 | 76.45 |
FedBN | 85.38 | 8.19 | 74.26 | 76.30 | 22.02 | 44.85 | 87.97 | 3.42 | 81.77 |
FedBN-FT | 87.65 | 6.33 | 80.02 | 68.50 | 26.83 | 29.17 | 87.02 | 3.47 | 80.13 |
FedBN-FedOPT | 87.27 | 7.34 | 76.87 | 65.59 | 31.07 | 22.22 | 87.43 | 4.64 | 80.81 |
FedBN-FedOPT-FT | 87.13 | 7.36 | 78.27 | 68.42 | 28.18 | 30.71 | 87.02 | 3.94 | 81.78 |
Ditto | 87.18 | 7.52 | 78.23 | 49.94 | 40.81 | 0.00 | 86.85 | 3.98 | 80.44 |
Ditto-FT | 84.30 | 8.16 | 73.95 | 54.34 | 39.26 | 0.00 | 87.10 | 3.52 | 80.46 |
Ditto-FedBN | 87.82 | 7.19 | 77.78 | 49.44 | 41.80 | 0.00 | 87.75 | 3.70 | 81.82 |
Ditto-FedBN-FT | 85.16 | 7.98 | 75.25 | 52.18 | 39.85 | 0.00 | 87.43 | 3.77 | 81.15 |
Ditto-FedBN-FedOpt | 87.64 | 7.08 | 78.23 | 55.61 | 40.43 | 1.39 | 87.27 | 3.90 | 79.14 |
Ditto-FedBN-FedOpt-FT | 85.71 | 7.91 | 75.81 | 53.16 | 34.75 | 9.72 | 87.10 | 3.79 | 80.93 |
FedEM | 82.61 | 9.57 | 69.29 | 76.53 | 23.34 | 44.44 | 85.05 | 4.44 | 78.51 |
FedEM-FT | 84.91 | 8.39 | 73.64 | 64.29 | 32.84 | 12.96 | 85.54 | 4.48 | 79.39 |
FedEM-FedBN | 82.94 | 9.35 | 70.43 | 75.06 | 18.48 | 53.33 | 87.63 | 4.14 | 82.54 |
FedEM-FedBN-FT | 87.09 | 9.24 | 76.36 | 64.33 | 35.72 | 8.59 | 85.68 | 4.33 | 79.44 |
FedEM-FedBN-FedOPT | 80.48 | 11.02 | 64.84 | 72.66 | 27.18 | 34.17 | 87.11 | 4.24 | 80.32 |
FedEM-FedBN-FedOPT-FT | 86.23 | 8.33 | 75.44 | 58.42 | 31.21 | 17.93 | 87.16 | 3.66 | 82.20 |
5.2 Fairness Study
We then empirically investigate what degrees of fairness can be achieved by pFL methods, and report the equally-weighted average of accuracy , the standard deviation across evaluated clients, and the bottom individual accuracy in Table 3. These metrics are considered as the fairness criteria in related pFL works [26, 56]. We find that is usually smaller than the one weighted by local data size ( in Table 2), indicating the client bias in existing pFL evaluation. Across the three datasets, the s on SST-2 are much larger (at dozen-scales) than those on the image dataset (6.33 11.02), which are larger than those on the graph dataset (3.06 6.26), leaving room for further research to understand this difference and improve the fairness in various application domains. Interestingly, compared with FedAvg, pFL methods can effectively improve bottom accuracy, while they may gain larger standard deviations. An exception is Ditto-based methods on SST-2, as the parameter regularization in Ditto may fail for the complex BERT model. Besides, the Isolated method performs bad for clients having a very small data size (a few dozens), even gains =0 on SST-2. We find that most of the other methods achieve much better results than it, verifying the benefits of transmitting knowledge across clients.

5.3 Efficiency
To quantify the systematical payloads of personalization that is introduced into the FL process, we count the total FLOPs, communication bytes and convergence rounds to demonstrate the trade-off between these metrics and accuracy in Figure 2(c). Not surprisingly, pFL methods usually incur much larger computation and communication costs than non-personalized methods, requiring more careful and efficient design for further pFL research in resource-limited scenarios. Another interesting observation is that when combined with FT or FedOpt that aggregates the clients’ model updates as pseudo-gradients for the global model, the convergence speeds are improved for some methods such as FedEM and Ditto, showing the potential of co-optimizing from the server and local client sides.
5.4 More Experiments
Due to the space limitation, we present more experimental results in Appendix D, including results in terms of generalization (Sec.D.1), fairness (Sec.D.2) and efficiency (Sec.D.3) for all the datasets in Table 1. To demonstrate the potential and ease of extensibility of the pFL-bench, we also conduct experiments in the scenario of heterogeneous device resource based on FedScale [38] in Sec.D.4, where we adopt the over-selection mechanism for server and a temporal event simulator [89]. The simulator executes the behaviors of clients according to virtual timestamps of their message delivery to the server, and the virtual timestamps are updated by the estimated execution time based on different clients’ computational and communication capacities. This enables us to simulate different response speeds and participating degrees of clients, which correspond to heterogeneous real-world mobile devices. Moreover, in Sec.D.5, we show that pFL-Bench supports the exploration of trade-offs between pFL and privacy protection techniques, and conduct demonstrative experiments with Differential Privacy [90].
6 Conclusions
In this paper, we propose a comprehensive, standardized, and extensible benchmark for personalized Federated Learning (pFL), pFL-Bench, which contains 12 dataset variants with a wide range of domains and unified partitions, and more than 20 pFL methods with pluggable and easy-to-extend pFL subroutines. We conduct extensive and systematic comparisons and conclude that designing effective, efficient and robust pFL methods with good generalization and convergence still remains challenging. We release pFL-Bench with guaranteed maintenance for the community, and believe that it will benefit reproducible, easy and generalizable pFL researches and potential applications. We also welcome contributions of new pFL methods and datasets to keep pFL-Bench up-to-date.
References
- [1] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54, pages 1273–1282, 2017.
- [2] Junyuan Hong, Zhuangdi Zhu, Shuyang Yu, Zhangyang Wang, Hiroko H. Dodge, and Jiayu Zhou. Federated adversarial debiasing for fair and transferable representations. In KDD ’21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 617–627, 2021.
- [3] Yue Zhao, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, and Vikas Chandra. Federated learning with non-iid data. arXiv preprint, abs/1806.00582, 2018.
- [4] Jaehoon Oh, Sangmook Kim, and Se-Young Yun. Fedbabu: Toward enhanced representation for federated image classification. In The Tenth International Conference on Learning Representations, 2022.
- [5] Yang Liu, Anbu Huang, Yun Luo, He Huang, Youzhi Liu, Yuanyuan Chen, Lican Feng, Tianjian Chen, Han Yu, and Qiang Yang. Fedvision: An online visual object detection platform powered by federated learning. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, pages 13172–13179, 2020.
- [6] Benyuan Sun, Hongxing Huo, Yi Yang, and Bo Bai. Partialfed: Cross-domain personalized federated learning via partial initialization. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems, pages 23309–23320, 2021.
- [7] Andrew Hard, Kanishka Rao, Rajiv Mathews, Françoise Beaufays, Sean Augenstein, Hubert Eichner, Chloé Kiddon, and Daniel Ramage. Federated learning for mobile keyboard prediction. arXiv preprint, abs/1811.03604, 2018.
- [8] Timothy Yang, Galen Andrew, Hubert Eichner, Haicheng Sun, Wei Li, Nicholas Kong, Daniel Ramage, and Françoise Beaufays. Applied federated learning: Improving google keyboard query suggestions. arXiv preprint, abs/1812.02903, 2018.
- [9] Xinghua Zhu, Jianzong Wang, Zhenhou Hong, and Jing Xiao. Empirical studies of institutional federated learning for natural language processing. In Findings of the Association for Computational Linguistics: EMNLP, pages 625–634, 2020.
- [10] Dianbo Sui, Yubo Chen, Jun Zhao, Yantao Jia, Yuantao Xie, and Weijian Sun. Feded: Federated learning via ensemble distillation for medical relation extraction. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 2118–2128, 2020.
- [11] Matthias Paulik, Matt Seigel, Henry Mason, Dominic Telaar, Joris Kluivers, Rogier C. van Dalen, Chi Wai Lau, Luke Carlson, Filip Granqvist, Chris Vandevelde, Sudeep Agarwal, Julien Freudiger, Andrew Byde, Abhishek Bhowmick, Gaurav Kapoor, Si Beaumont, Áine Cahill, Dominic Hughes, Omid Javidbakht, Fei Dong, Rehan Rishi, and Stanley Hung. Federated evaluation and tuning for on-device personalization: System design & applications. arXiv preprint, abs/2102.08503, 2021.
- [12] Dhruv Guliani, Françoise Beaufays, and Giovanni Motta. Training speech recognition models with federated learning: A quality/cost framework. In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 3080–3084, 2021.
- [13] Han Xie, Jing Ma, Li Xiong, and Carl Yang. Federated graph classification over non-iid graphs. pages 18839–18852, 2021.
- [14] Khalil Muhammad, Qinqin Wang, Diarmuid O’Reilly-Morgan, Elias Z. Tragos, Barry Smyth, Neil Hurley, James Geraci, and Aonghus Lawlor. Fedfast: Going beyond average for faster training of federated recommender systems. In KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 1234–1242, 2020.
- [15] Chuhan Wu, Fangzhao Wu, Yang Cao, Yongfeng Huang, and Xing Xie. Fedgnn: Federated graph neural network for privacy-preserving recommendation. arXiv preprint, abs/2102.04925, 2021.
- [16] Qian Yang, Jianyi Zhang, Weituo Hao, Gregory P. Spell, and Lawrence Carin. FLOP: federated learning on medical datasets using partial networks. In KDD ’21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 3845–3853, 2021.
- [17] Ittai Dayan, Holger R Roth, Aoxiao Zhong, Ahmed Harouni, Amilcare Gentili, Anas Z Abidin, Andrew Liu, Anthony Beardsworth Costa, Bradford J Wood, Chien-Sung Tsai, et al. Federated learning for predicting clinical outcomes in patients with covid-19. Nature medicine, 27(10):1735–1743, 2021.
- [18] Peter Kairouz, H. Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista A. Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, Rafael G. L. D’Oliveira, Hubert Eichner, Salim El Rouayheb, David Evans, Josh Gardner, Zachary Garrett, Adrià Gascón, Badih Ghazi, Phillip B. Gibbons, Marco Gruteser, Zaïd Harchaoui, Chaoyang He, Lie He, Zhouyuan Huo, Ben Hutchinson, Justin Hsu, Martin Jaggi, Tara Javidi, Gauri Joshi, Mikhail Khodak, Jakub Konečný, Aleksandra Korolova, Farinaz Koushanfar, Sanmi Koyejo, Tancrède Lepoint, Yang Liu, Prateek Mittal, Mehryar Mohri, Richard Nock, Ayfer Özgür, Rasmus Pagh, Hang Qi, Daniel Ramage, Ramesh Raskar, Mariana Raykova, Dawn Song, Weikang Song, Sebastian U. Stich, Ziteng Sun, Ananda Theertha Suresh, Florian Tramèr, Praneeth Vepakomma, Jianyu Wang, Li Xiong, Zheng Xu, Qiang Yang, Felix X. Yu, Han Yu, and Sen Zhao. Advances and open problems in federated learning. Found. Trends Mach. Learn., 14(1-2):1–210, 2021.
- [19] Chuizheng Meng, Sirisha Rambhatla, and Yan Liu. Cross-node federated graph neural network for spatio-temporal data modeling. In KDD ’21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 1202–1211, 2021.
- [20] Fuxun Yu, Weishan Zhang, Zhuwei Qin, Zirui Xu, Di Wang, Chenchen Liu, Zhi Tian, and Xiang Chen. Fed2: Feature-aligned federated learning. In KDD ’21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 2066–2074, 2021.
- [21] Viraaji Mothukuri, Reza M. Parizi, Seyedamin Pouriyeh, Yan Huang, Ali Dehghantanha, and Gautam Srivastava. A survey on security and privacy of federated learning. Future Gener. Comput. Syst., 115:619–640, 2021.
- [22] Viraj Kulkarni, Milind Kulkarni, and Aniruddha Pant. Survey of personalization techniques for federated learning. arXiv preprint, abs/2003.08673, 2020.
- [23] Alysa Ziying Tan, Han Yu, Lizhen Cui, and Qiang Yang. Towards personalized federated learning. arXiv preprint, abs/2103.00710, 2021.
- [24] Liam Collins, Hamed Hassani, Aryan Mokhtari, and Sanjay Shakkottai. Exploiting shared representations for personalized federated learning. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 2089–2099, 2021.
- [25] Avishek Ghosh, Jichan Chung, Dong Yin, and Kannan Ramchandran. An efficient framework for clustered federated learning. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems, 2020.
- [26] Tian Li, Shengyuan Hu, Ahmad Beirami, and Virginia Smith. Ditto: Fair and robust federated learning through personalization. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 6357–6368, 2021.
- [27] Canh T. Dinh, Nguyen H. Tran, and Tuan Dung Nguyen. Personalized federated learning with moreau envelopes. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems, 2020.
- [28] Alireza Fallah, Aryan Mokhtari, and Asuman E. Ozdaglar. Personalized federated learning with theoretical guarantees: A model-agnostic meta-learning approach. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems, 2020.
- [29] Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. Federated machine learning: Concept and applications. ACM Trans. Intell. Syst. Technol., 10(2):12:1–12:19, 2019.
- [30] Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank J. Reddi, Sebastian U. Stich, and Ananda Theertha Suresh. SCAFFOLD: stochastic controlled averaging for federated learning. In Proceedings of the 37th International Conference on Machine Learning, volume 119, pages 5132–5143, 2020.
- [31] Zhuangdi Zhu, Junyuan Hong, and Jiayu Zhou. Data-free knowledge distillation for heterogeneous federated learning. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 12878–12889, 2021.
- [32] Sebastian Caldas, Sai Meher Karthik Duddu, Peter Wu, Tian Li, Jakub Konečný, H Brendan McMahan, Virginia Smith, and Ameet Talwalkar. Leaf: A benchmark for federated settings. arXiv preprint arXiv:1812.01097, 2018.
- [33] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
- [34] Othmane Marfoq, Giovanni Neglia, Aurélien Bellet, Laetitia Kameni, and Richard Vidal. Federated multi-task learning under a mixture of distributions. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems, pages 15434–15447, 2021.
- [35] Aviv Shamsian, Aviv Navon, Ethan Fetaya, and Gal Chechik. Personalized federated learning using hypernetworks. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 9489–9502, 2021.
- [36] Michael Zhang, Karan Sapra, Sanja Fidler, Serena Yeung, and Jose M. Alvarez. Personalized federated learning with first order model optimization. In 9th International Conference on Learning Representations, 2021.
- [37] Enmao Diao, Jie Ding, and Vahid Tarokh. Heterofl: Computation and communication efficient federated learning for heterogeneous clients. In 9th International Conference on Learning Representations, 2021.
- [38] Fan Lai, Yinwei Dai, Sanjay Sri Vallabh Singapuram, Jiachen Liu, Xiangfeng Zhu, Harsha V. Madhyastha, and Mosharaf Chowdhury. Fedscale: Benchmarking model and system performance of federated learning at scale. In International Conference on Machine Learning, volume 162, pages 11814–11827, 2022.
- [39] Fan Lai, Xiangfeng Zhu, Harsha V. Madhyastha, and Mosharaf Chowdhury. Oort: Efficient federated learning via guided participant selection. In 15th USENIX Symposium on Operating Systems Design and Implementation, pages 19–35, 2021.
- [40] Honglin Yuan, Warren Richard Morningstar, Lin Ning, and Karan Singhal. What do we mean by generalization in federated learning? In International Conference on Learning Representations, 2022.
- [41] Sébastien Marcel and Yann Rodriguez. Torchvision the machine-vision package of torch. In Proceedings of the 18th ACM International Conference on Multimedia, page 1485–1488, 2010.
- [42] Quentin Lhoest, Albert Villanova del Moral, Yacine Jernite, Abhishek Thakur, Patrick von Platen, Suraj Patil, Julien Chaumond, Mariama Drame, Julien Plu, Lewis Tunstall, Joe Davison, Mario Šaško, Gunjan Chhablani, Bhavitvya Malik, Simon Brandeis, Teven Le Scao, Victor Sanh, Canwen Xu, Nicolas Patry, Angelina McMillan-Major, Philipp Schmid, Sylvain Gugger, Clément Delangue, Théo Matussière, Lysandre Debut, Stas Bekman, Pierric Cistac, Thibault Goehringer, Victor Mustar, François Lagunas, Alexander Rush, and Thomas Wolf. Datasets: A community library for natural language processing. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 175–184, 2021.
- [43] Muhammad Asad, Ahmed Moustafa, and Takayuki Ito. Fedopt: Towards communication efficiency and privacy preservation in federated learning. Applied Sciences, 10(8):2864, 2020.
- [44] Jianyu Wang, Qinghua Liu, Hao Liang, Gauri Joshi, and H. Vincent Poor. Tackling the objective inconsistency problem in heterogeneous federated optimization. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems, 2020.
- [45] Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank J. Reddi, Sebastian U. Stich, and Ananda Theertha Suresh. SCAFFOLD: stochastic controlled averaging for on-device federated learning. arXiv preprint, abs/1910.06378, 2019.
- [46] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. In Proceedings of Machine Learning and Systems, 2020.
- [47] Baihe Huang, Xiaoxiao Li, Zhao Song, and Xin Yang. FL-NTK: A neural tangent kernel-based framework for federated learning analysis. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 4423–4434, 2021.
- [48] Felix Sattler, Klaus-Robert Müller, and Wojciech Samek. Clustered federated learning: Model-agnostic distributed multitask optimization under privacy constraints. IEEE Trans. Neural Networks Learn. Syst., 32(8):3710–3722, 2021.
- [49] Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. Federated learning: Challenges, methods, and future directions. IEEE Signal Process. Mag., 37(3):50–60, 2020.
- [50] Zheng Chai, Hannan Fayyaz, Zeshan Fayyaz, Ali Anwar, Yi Zhou, Nathalie Baracaldo, Heiko Ludwig, and Yue Cheng. Towards taming the resource and data heterogeneity in federated learning. In 2019 USENIX Conference on Operational Machine Learning (OpML 19), pages 19–21, 2019.
- [51] Christopher Briggs, Zhong Fan, and Peter Andras. Federated learning with hierarchical clustering of local updates to improve training on non-iid data. In 2020 International Joint Conference on Neural Networks, pages 1–9, 2020.
- [52] Zheng Chai, Ahsan Ali, Syed Zawad, Stacey Truex, Ali Anwar, Nathalie Baracaldo, Yi Zhou, Heiko Ludwig, Feng Yan, and Yue Cheng. Tifl: A tier-based federated learning system. In Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing, pages 125–136, 2020.
- [53] Virginia Smith, Chao-Kai Chiang, Maziar Sanjabi, and Ameet Talwalkar. Federated multi-task learning. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems, pages 4424–4434, 2017.
- [54] Luca Corinzia and Joachim M. Buhmann. Variational federated multi-task learning, 2019.
- [55] Yutao Huang, Lingyang Chu, Zirui Zhou, Lanjun Wang, Jiangchuan Liu, Jian Pei, and Yong Zhang. Personalized cross-silo federated learning on non-iid data. In Thirty-Fifth AAAI Conference on Artificial Intelligence, pages 7865–7873, 2021.
- [56] Othmane Marfoq, Giovanni Neglia, Aurélien Bellet, Laetitia Kameni, and Richard Vidal. Federated multi-task learning under a mixture of distributions. pages 15434–15447, 2021.
- [57] Michael Zhang, Karan Sapra, Sanja Fidler, Serena Yeung, and Jose M. Alvarez. Personalized federated learning with first order model optimization. In 9th International Conference on Learning Representations, 2021.
- [58] Filip Hanzely, Slavomír Hanzely, Samuel Horváth, and Peter Richtárik. Lower bounds and optimal algorithms for personalized federated learning. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems, 2020.
- [59] Idan Achituve, Aviv Shamsian, Aviv Navon, Gal Chechik, and Ethan Fetaya. Personalized federated learning with gaussian processes. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems, pages 8392–8406, 2021.
- [60] Tao Lin, Lingjing Kong, Sebastian U. Stich, and Martin Jaggi. Ensemble distillation for robust model fusion in federated learning. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems, 2020.
- [61] Kaan Ozkara, Navjot Singh, Deepesh Data, and Suhas N. Diggavi. Quped: Quantized personalization via distillation with applications to federated learning. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems, pages 3622–3634, 2021.
- [62] Mikhail Khodak, Maria-Florina Balcan, and Ameet Talwalkar. Adaptive gradient-based meta-learning methods. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems, pages 5915–5926, 2019.
- [63] Yihan Jiang, Jakub Konečný, Keith Rush, and Sreeram Kannan. Improving federated learning personalization via model agnostic meta learning. arXiv preprint, abs/1909.12488, 2019.
- [64] Karan Singhal, Hakim Sidahmed, Zachary Garrett, Shanshan Wu, John Rush, and Sushant Prakash. Federated reconstruction: Partially local federated learning. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems, pages 11220–11232, 2021.
- [65] Durmus Alp Emre Acar, Yue Zhao, Ruizhao Zhu, Ramon Matas Navarro, Matthew Mattina, Paul N. Whatmough, and Venkatesh Saligrama. Debiasing model updates for improving personalized federated training. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 21–31, 2021.
- [66] Hongwei Yang, Hui He, Weizhe Zhang, and Xiaochun Cao. Fedsteg: A federated transfer learning framework for secure image steganalysis. IEEE Trans. Netw. Sci. Eng., 8(2):1084–1094, 2021.
- [67] Chaoyang He, Murali Annavaram, and Salman Avestimehr. Group knowledge transfer: Federated learning of large cnns at the edge. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems, 2020.
- [68] Jie Zhang, Song Guo, Xiaosong Ma, Haozhao Wang, Wenchao Xu, and Feijie Wu. Parameterized knowledge transfer for personalized federated learning. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems, pages 10092–10104, 2021.
- [69] Kallista A. Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloé Kiddon, Jakub Konečný, Stefano Mazzocchi, Brendan McMahan, Timon Van Overveldt, David Petrou, Daniel Ramage, and Jason Roselander. Towards federated learning at scale: System design. 2019.
- [70] Daniel J Beutel, Taner Topal, Akhil Mathur, Xinchi Qiu, Titouan Parcollet, and Nicholas D Lane. Flower: A friendly federated learning research framework. arXiv preprint arXiv:2007.14390, 2020.
- [71] Chaoyang He, Songze Li, Jinhyun So, Mi Zhang, Hongyi Wang, Xiaoyang Wang, Praneeth Vepakomma, Abhishek Singh, Hang Qiu, Li Shen, Peilin Zhao, Yan Kang, Yang Liu, Ramesh Raskar, Qiang Yang, Murali Annavaram, and Salman Avestimehr. Fedml: A research library and benchmark for federated machine learning. arXiv preprint, abs/2007.13518, 2020.
- [72] Bill Yuchen Lin, Chaoyang He, Zihang Zeng, Hulin Wang, Yufen Huang, Mahdi Soltanolkotabi, Xiang Ren, and Salman Avestimehr. Fednlp: A research platform for federated learning in natural language processing. arXiv preprint, abs/2104.08815, 2021.
- [73] Zhen Wang, Weirui Kuang, Yuexiang Xie, Liuyi Yao, Yaliang Li, Bolin Ding, and Jingren Zhou. Federatedscope-gnn: Towards a unified, comprehensive and efficient package for federated graph learning. In KDD ’22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 4110–4120, 2022.
- [74] Yishay Mansour, Mehryar Mohri, Jae Ro, and Ananda Theertha Suresh. Three approaches for personalization with applications to federated learning. arXiv preprint arXiv:2002.10619, 2020.
- [75] Yuyang Deng, Mohammad Mahdi Kamani, and Mehrdad Mahdavi. Adaptive personalized federated learning. arXiv preprint arXiv:2003.13461, 2020.
- [76] Filip Hanzely and Peter Richtárik. Federated learning of a mixture of global and local models. arXiv preprint, abs/2002.05516, 2020.
- [77] Sashank J. Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konečný, Sanjiv Kumar, and Hugh Brendan McMahan. Adaptive federated optimization. In 9th International Conference on Learning Representations, 2021.
- [78] Canh T. Dinh, Nguyen H. Tran, and Tuan Dung Nguyen. Personalized federated learning with moreau envelopes. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems, 2020.
- [79] Paul Pu Liang, Terrance Liu, Ziyin Liu, Ruslan Salakhutdinov, and Louis-Philippe Morency. Think locally, act globally: Federated learning with local and global representations. arXiv preprint, abs/2001.01523, 2020.
- [80] Aviv Shamsian, Aviv Navon, Ethan Fetaya, and Gal Chechik. Personalized federated learning using hypernetworks. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 9489–9502, 2021.
- [81] Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Well-read students learn better: On the importance of pre-training compact models. arXiv preprint arXiv:1908.08962v2, 2019.
- [82] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In International Conference on Learning Representations, 2019.
- [83] Yehuda Koren, Robert M. Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8):30–37, 2009.
- [84] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, 2020.
- [85] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. In Proceedings of Machine Learning and Systems, 2020.
- [86] Xiaoxiao Li, Meirui Jiang, Xiaofei Zhang, Michael Kamp, and Qi Dou. Fedbn: Federated learning on non-iid features via local batch normalization. In 9th International Conference on Learning Representations, 2021.
- [87] Yuexiang Xie, Zhen Wang, Daoyuan Chen, Dawei Gao, Liuyi Yao, Weirui Kuang, Yaliang Li, Bolin Ding, and Jingren Zhou. Federatedscope: A comprehensive and flexible federated learning platform via message passing. arXiv preprint, abs/2204.05011, 2022.
- [88] John Nguyen, Kshitiz Malik, Hongyuan Zhan, Ashkan Yousefpour, Mike Rabbat, Mani Malek, and Dzmitry Huba. Federated learning with buffered asynchronous aggregation. In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera, editors, Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151, pages 3581–3607, 28–30 Mar 2022.
- [89] Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloé Kiddon, Jakub Konečný, Stefano Mazzocchi, Brendan McMahan, Timon Van Overveldt, David Petrou, Daniel Ramage, and Jason Roselander. Towards federated learning at scale: System design. In A. Talwalkar, V. Smith, and M. Zaharia, editors, Proceedings of Machine Learning and Systems, volume 1, pages 374–388, 2019.
- [90] Cynthia Dwork. Differential privacy: A survey of results. In Theory and Applications of Models of Computation, 5th International Conference, volume 4978, pages 1–19, 2008.
- [91] Gregory Cohen, Saeed Afshar, Jonathan Tapson, and André van Schaik. EMNIST: extending MNIST to handwritten letters. In 2017 International Joint Conference on Neural Networks, pages 2921–2926, 2017.
- [92] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
- [93] Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7:625–641, 2019.
- [94] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642, 2013.
- [95] Andrew Kachites McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore. Automating the construction of internet portals with machine learning. Information Retrieval, 3(2):127–163, 2000.
- [96] Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment, 2008(10):P10008, 2008.
- [97] Galileo Namata, Ben London, Lise Getoor, Bert Huang, and U Edu. Query-driven active surveying for collective classification. In 10th International Workshop on Mining and Learning with Graphs, volume 8, page 1, 2012.
- [98] Lise Getoor. Link-based classification. In Advanced methods for knowledge discovery from complex data, pages 189–207. 2005.
- [99] F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis), 5(4):1–19, 2015.
- [100] Zitao Li, Bolin Ding, Ce Zhang, Ninghui Li, and Jingren Zhou. Federated matrix factorization with privacy guarantee. Proceedings of the VLDB Endowment, 15(4):900–913, 2021.
- [101] Alireza Fallah, Aryan Mokhtari, and Asuman E. Ozdaglar. Personalized federated learning: A meta-learning approach. arXiv preprint, abs/2002.07948, 2020.
- [102] Lisha Li, Kevin G. Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyperband: Bandit-based configuration evaluation for hyperparameter optimization. In 5th International Conference on Learning Representations, 2017.
- [103] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci., 9(3-4):211–407, 2014.
- [104] Lingjuan Lyu, Han Yu, and Qiang Yang. Threats to federated learning: A survey. arXiv preprint, abs/2003.02133, 2020.
- [105] Kang Wei, Jun Li, Ming Ding, Chuan Ma, Howard H. Yang, Farhad Farokhi, Shi Jin, Tony Q. S. Quek, and H. Vincent Poor. Federated learning with differential privacy: Algorithms and performance analysis. IEEE Trans. Inf. Forensics Secur., 15:3454–3469, 2020.
Appendices for the Paper: pFL-Bench: A Comprehensive Benchmark for Personalized Federated Learning
We provide more details and experimental results for pFL-Bench in the appendices:
-
•
Sec.A: the details of adopted datasets and models (e.g., tasks, heterogeneous partitions, and model architectures), and the extensions for other datasets and models with pFL-Bench.
-
•
Sec.B: detailed description of methods and metrics in our experiments.
-
•
Sec.C: implementation details including the experimental environments and hyper-parameters.
-
•
Sec.D: more experimental results in terms of generalization (Sec. D.1), fairness (Sec. D.2) and efficiency (Sec.D.3) for all the datasets in Table 1. Besides, to demonstrate the potential and ease of extensibility of the pFL-bench, we also conducted experiments in the heterogeneous device resource scenario based on FedScale [38] (Sec.D.4), as well as experiments incorporating privacy-preserving techniques (Sec.D.5).
Appendix A Datasets and Models
Experimental datasets.
We present detailed descriptions of the 12 publicly available dataset variants used in pFL-Bench. These datasets are popular in the corresponding fields, and cover a wide range of domains, scales, partition manners, and Non-IID degrees.
-
•
The Federated Extended MNIST (FEMNIST) is a widely used FL dataset for 62-class handwritten character recognition [32]. The original FEMNIST dataset contains 3,550 clients and each client corresponds to a character writer from EMNIST [91]. Following [13], we adopt the sub-sampled version in FL-Bench, which contains 200 clients and totally 43,400 images with resolution of 28x28 pixels, and the dataset is randomly split into train/valid/test sets with ratio 3:1:1. 111For all the adopted datasets, the train/val/test splitting is conducted within the local data of each client. In pFL-Bench, we use this dataset to vary the client sampling rate in FL processes as shown in Figure 4 in the main body of the paper.
-
•
The CIFAR10 is a popular dataset for 10-class image classification containing 60,000 colored images with resolution of 32x32 pixels. Follow the heterogeneous partition manners used in [56, 32, 37, 28], we use Dirichlet allocation to split this datasets into 100 clients with different Dirichlet factors as (a smaller indicates a higher heterogeneous degree). We split the dataset into train/valid/test sets with ratio 4:1:1.
-
•
Corpus of Linguistic Acceptability (COLA) is a textual classification datasets from [92, 93], which contains 9,600 English sentences labeled with grammatical correctness. In pFL-Bench, this dataset is partitioned into 50 clients via Dirichlet allocation with . We split the dataset into train/valid/test sets with a ratio of about 7:2:1.
-
•
The Stanford Sentiment Treebank (SST-2) is a sentiment classification dataset from [92, 94], which contains 68,200 movie reviews sentences labeled with human sentiment. Similar to COLA, in pFL-Bench, this dataset is partitioned into 50 clients with Dirichlet allocation and . The train/valid/test sets are with a ratio of about 60:15:1. For COLA and SST-2, since the test subsets from GLUE [92] are unlabeled (private in the GLUE server), we made new train/val/test partitions different from GLUE versions.
-
•
The Twitter dataset is a textual sentiment analysis dataset from [32]. We adopt a subset which contains 13,203 users, the partition manner for this dataset is natural w.r.t. users, and the median number of data samples per user is 7. The train/valid/test sets for each client are with a ratio of about 3:1:1.
-
•
The Cora dataset is a citation network that contains 2,708 nodes and 5,429 edges, in which each node indicates a scientific publication classified into one of seven classes [95]. Following FS-G [73], we split it into 5 clients using a community detection algorithm, Louvain [96]. The train/valid/test sets are with ratio about 3:1:1.
- •
-
•
The Citeseer dataset is a citation network that contains 3,312 nodes and 4,732 edges, in which each node indicates a scientific publication classified into one of six classes [98]. Following FS-G [73], we split it into 5 clients with Louvain community partition. The train/valid/test sets are with a ratio of about 4:1:1.
- •
- •
For all these experimental datasets, we randomly select 20% clients as new clients that do not participate in the FL processes. We summarize some statistics in Table 1 in the main body of the paper. Besides, we illustrate the violin plot of data size per client in Figure 2(d), the label skew visualization of certain datasets in Figure 2(e), and clients’ pairwise similarity of label distribution in terms of Jensen–Shannon distance in Figure 2(f). And the smaller the Jensen-Shannon distance, the more similar the compared distributions. We can see that as the degree of heterogeneity increase (the decreases), the larger the label skew degree and the Jensen-Shannon distances we get. We can perform similar calculations on a variety of FL datasets, and further rank this distances, and in turn select those clients whose distributions are very different but whose models do not perform well for further analysis, understanding and algorithm improvement. Furthermore, all these results show diverse properties across the adopted FL datasets in pFL-Bench, enabling comprehensive comparisons among different methods.



Models
To align with previous works [77, 56, 78, 79, 80], we preset a 2-layer CNN for FEMNIST and CIFAR10. Specifically, the model consists of two convolutional layers with 5 × 5 kernels, max pooling, batch normalization, ReLU activation, and two dense layers. The hidden size is 2,048 and 512 for FEMNIST and CIFAR10 respectively. For the COLA and SST-2 datasets, we preset the pre-trained BERT-Tiny model from [81], which contains 2-layer Transformer encoders with a hidden size of 128. For the Twitter dataset, we preset a LR model with 50d Glove embeddings 222https://nlp.stanford.edu/data/glove.6B.zip. For the graph datasets, we preset the graph isomorphism neural network, GIN [82], which contains 2-layer convolutions with batch normalization, the hidden size of 64, and dropout rate of 0.5. For the recommendation datasets, we preset the Matrix Factorization (MF) model [83] with a hidden size of 20 for user and item embeddings.
Remark on the adopted dataset scales and model sizes.
It is worth noting that simulation with pFL algorithms on a large client scale is very challenging, due to the fact that we need to maintain distinct (personalized) model object for each client. Let’s take the famous benchmark FEMNIST as an example, which has 3,550 users and suppose we adopt the widely-used two-layer CNN network. Although this model only occupies 200MB, maintaining 3,550 such models would consume more than 700GB memory. Different from non-personalized FL algorithms, for which it is feasible to maintain only one model object for all the clients, for pFL algorithms, we may have to switch and cache the personalized models among CPU, GPU and even disks. Due to the large number of methods and datasets included in our benchmark, and the corresponding huge hyper-parameter search space, we used several subsets of the FL datasets to reduce the reproduction and experimental barriers.
Extension.
We note that besides the experimental datasets and models introduced above, our code-base is compatible with a large number of datasets from other public popular DataZoos and ModelZoos. We provide the unified dataset, dataloader, and model I/O interfaces with carefully designed modularity, which enables users to easily register and extend the datasets/models with simple and flexible configuration, such as different heterogeneous partition manners, number of clients, new client ratio, model types and model parameter dimensions. Currently, we support datasets from LEAF [32], Torchvision [41], Huggingface datasets [42], FederatedScope (FS) [13] and FederatedScope-GNN (FS-G) [73]; and models from Torchvision [41], Huggingface [84], FS [13] and FS-G [73].
Appendix B Methods and Metrics
B.1 Methods
We present detailed descriptions of the methods in pFL-Bench, which conveys a range of popular and SOTA methods in three categories including Non-pFL methods, pFL methods and Combined variants.
The following Non-pFL methods are considered in pFL-Bench:
-
•
The Global-Train method refers to training only a centralized model from all data merged from all clients.
-
•
The Isolated method indicates that each client trains its’ client-specific model without FL communication. The Global-Train and Isolated methods provide a good reference to examine the benefits of pFL processes. For these two methods, we omit the un-participated clients.
-
•
The FedProx [85] method leverages proximal term to encourage the updated models at clients not to differ too much from the global model.
-
•
In addition, we include the classical FedAvg [1] that average gradients weighted by data size of clients in each FL round.
-
•
The FedOpt [43] algorithm is also considered, which generalizes FedAvg by introducing an optimizer for the FL server. We use the SGD as the server optimizer for FedOpt and search its learning rate.
pFL methods. We consider the following representative SOTA methods:
-
•
FedBN [86] is a simple yet effective pFL method aiming to handle the feature shift Non-IID challenge. It locally maintains the clients’ batch normalization parameters without FL communication and aggregation. In pFL-Bench, we generalize FedBN into the Transformer model by filtering out the layer normalization parameters.
-
•
The Ditto [26] is a pFL method aiming to improve the fairness and robustness of FL. For each client, Ditto maintains the local personalized model and global model at the same time. The global model is trained with the same produce in FedAvg and the local model is trained with a personalized regularization according to the global model parameters.
-
•
The pFedMe [78] is a meta-learning based method and also regularizes the local models according to the global model parameters. The authors propose to use Moreau envelops based regularization to reduce the complexity caused by Hessian matrix computation, which is required by some meta-learning based pFL methods such as Per-FedAvg [101].
-
•
The pFL-Bench also contains multi-model based pFL methods.
-
•
The HypCluster [74] method proposes to split clients into clusters and learns different personalized models for different clusters. The cluster is determined by performance on validation sets. In our experiments, we set the number of clusters as 3 for a fair comparison with FedEM.
The FedEM method [56] assumes the local data distribution is a mixture of multiple underlying distributions. It learns a mixture of multiple local models with Expectation-Maximization algorithm to deal with the data heterogeneity, and can be easily extended to several clustering based and multi-task learning based method pFL methods. In our experiments, we use 3 internal models for FedEM according to the authors’ default choice.
Combined variants. It is worth noting that we provide pluggable re-implementations of numerous existing methods in pFL-Bench. This modularity enables users can pick different personalized objects and behaviors to form a new pFL variant. We combine FedBN, FedOpt, and Fine-tuning (FT) with other compatible methods. The FedBN combination indicates to make the batch/layer normalization parameters personalized and locally maintained. The FedOpt combination indicates introducing the server optimizer into the FL processes. The Fine-tuning (FT) combination indicates fine-tuning the local models with a few steps before evaluation within the FL processes.
To facilitate fine-grained ablations and systematic pFL study, we finally compare more than 20 pFL method variants in the experiments. We will continuously include more pFL methods into pFL-Bench.
B.2 Metrics
Here we summarize the monitored metrics in our benchmark and give more detailed description about them.
Generalization.
We support server-side and clients-sides monitoring w.r.t. widely used performance metrics such as accuracy, loss, F1, etc. Specifically, in the paper, we denote be the accuracy Loss average weighted by the number of local data samples, be the accuracy Loss of un-participated clients, and be the participation generalization gap.
Fairness.
We support different summarizing manners over the evaluated metrics over clients, such as weighted average (e.g., ), uniform average (e.g., ), the standard deviation (denoted by in our paper), and various quantiles such as bottom accuracy . We report the )-th worst accuracy where is the number of all evaluated clients. To align with FedEM, the 90th percentile is considered here to omit the particularly noisy results from clients with worse performance with very small data sizes.
System costs.
For computational and communication costs, we support to monitor some proxy metrics including FLOPs, communication bytes, and convergence rounds. The FLOPS are counted as the sum of amounts for both training and inference via a per-operator flops counting tool, fvcore/flop_count. 333https://github.com/facebookresearch/fvcore/blob/main/docs/flop_count.md The reported communication bytes are counted as the sum of upstream and downstream across all participants until convergence with early stopping. Besides, thanks to the good integration of wandb, our benchmark also supports more runtime metrics including the dynamic utilization of CPU, GPU, memory, disk, etc. 444https://docs.wandb.ai/ref/app/features/system-metrics
Appendix C Implementation
Enviroments.
We implement pFL-Bench based on the FS [13] package and PyTorch. The experiments are conducted on a cluster of 8 Tesla V100 and 64 NVIDIA GeForce GTX 1080 Ti GPUs, each machine with 380G memory and Xeon Platinum 8163 2.50GHz CPU containing 96 cores. Our experiments are conducted in the containerized environments with Ubuntu18.04.
We provide versioned DockerFiles, the built docker images and experimental datasets in our website with Aliyun storage service. 555 https://github.com/alibaba/FederatedScope/tree/master/benchmark/pFL-Bench The pFL-Bench and the underlying FS [13] package is continuously developed and maintained by Data Analytics and Intelligence Lab (DAIL) of DAMO Academy. We will actively fix potential issues, track updates and Github release.
Hyper-parameters.
For fair comparisons, we first use wandb sweep 666https://docs.wandb.ai/sweeps with the hyper-parameter searching (HPO) algorithm, HyperBand [102], to find the best hyper-parameters for all the methods on all datasets. The validation sets are used and we employ early stopping with a large number of total FL rounds . We set this hyper-parameter to make almost all methods converge within rounds. Specifically, for FEMNIST, CIFAR10, Movielens-1M, Movielens-10M and Twitter, we set . For Cola, SST-2, Pubmed, Cora and Citeseer, we set . The batch size is set to be 32 for image datasets, 64 for textual datasets, and 1,024 for the recommendation datasets respectively. For graph datasets, we adopt full batch training. For all methods, we search the local update steps (i.e., the number of local training epochs in each FL round) from . For the local SGD learning rate, we search from . For FedOpt, we search the server learning rate from . For pFedMe, we search the personalized regularization weight from and its local meta-learning step from . For FedEM, we set its number of mixture models as 3. For Ditto, we search its personalized regularization weight from .
To enable easily reproducible research, we provide standardized and documented scripts including the HPO scripts, the experiments running scripts and searched best configuration files in our code-base (see the link in the above paragraph).
Appendix D Additional Experimental Results
D.1 Generalization
The generalization results for FEMNIST, SST-2 and PUBMED are shown in the Table 2 and Figure 3 in the main body of the paper. Here we present the results for other datasets including all the textual datasets (Table 4), all the graph datasets (Table 5), and all the recommendation datasets (Table 6). We note that FedOpt may gain bad results on some datasets when the models contain batch/layer normalization parameters. Besides, for the Twitter and recommendation datasets, we have not compared the FedBN based methods as the used LR and MF models do not contain batch/layer normalization parameters.
COLA | SST-2 | ||||||||
---|---|---|---|---|---|---|---|---|---|
Global-Train | 69.06 | - | - | 80.57 | - | - | 55.56 | - | - |
Isolated | 55.96 | - | - | 60.82 | - | - | 70.04 | - | - |
FedAvg | 71.85 | 63.49 | -8.36 | 74.88 | 80.24 | 5.36 | 62.15 | 61.24 | -0.91 |
FedAvg-FT | 68.29 | 58.66 | -9.63 | 74.14 | 83.28 | 9.13 | 70.53 | 71.17 | 0.64 |
FedOpt | 71.85 | 59.62 | -12.23 | 72.28 | 83.06 | 10.78 | 62.09 | 61.64 | -0.45 |
FedOpt-FT | 62.59 | 47.82 | -14.77 | 65.77 | 80.02 | 14.25 | 71.08 | 71.41 | 0.33 |
pFedMe | 74.40 | 67.64 | -6.76 | 71.27 | 69.34 | -1.92 | 63.45 | 62.52 | -0.94 |
pFedMe-FT | 78.47 | 76.33 | -2.14 | 75.61 | 66.48 | -9.13 | 84.00 | 71.80 | -12.20 |
FedBN | 71.85 | 63.49 | -8.36 | 74.88 | 75.40 | 0.52 | - | - | - |
FedBN-FT | 66.71 | 49.87 | -16.84 | 68.81 | 82.43 | 13.63 | - | - | - |
FedBN-FedOPT | 71.85 | 62.48 | -9.37 | 64.70 | 65.50 | 0.81 | - | - | - |
FedBN-FedOPT-FT | 67.48 | 57.59 | -9.90 | 68.65 | 70.56 | 1.91 | - | - | - |
Ditto | 55.46 | 49.90 | -5.56 | 52.03 | 46.79 | -5.24 | 70.23 | 49.60 | -20.63 |
Ditto-FT | 72.11 | 52.15 | -19.96 | 56.49 | 65.50 | 9.01 | 69.99 | 51.32 | -18.67 |
Ditto-FedBN | 70.69 | 49.90 | -20.79 | 56.03 | 46.79 | -9.24 | - | - | - |
Ditto-FedBN-FT | 72.66 | 53.44 | -19.21 | 53.15 | 66.49 | 13.34 | - | - | - |
Ditto-FedBN-FedOpt | 50.25 | 49.90 | -0.35 | 57.67 | 46.79 | -10.88 | - | - | - |
Ditto-FedBN-FedOpt-FT | 55.01 | 58.22 | 3.21 | 52.89 | 66.49 | 13.60 | - | - | - |
FedEM | 71.85 | 63.49 | -8.36 | 75.78 | 67.67 | -8.11 | 63.44 | 62.68 | -0.75 |
FedEM-FT | 54.90 | 48.29 | -6.61 | 64.86 | 81.63 | 16.77 | 70.97 | 71.59 | 0.62 |
FedEM-FedBN | 71.44 | 63.99 | -7.45 | 75.43 | 62.81 | -12.62 | - | - | - |
FedEM-FedBN-FT | 57.62 | 58.88 | 1.26 | 64.96 | 81.04 | 16.08 | - | - | - |
FedEM-FedBN-FedOPT | 71.85 | 62.82 | -9.03 | 72.25 | 64.69 | -7.56 | - | - | - |
FedEM-FedBN-FedOPT-FT | 57.23 | 58.88 | 1.65 | 62.26 | 73.87 | 11.61 | - | - | - |
PUBMED | CORA | CITESEER | |||||||
---|---|---|---|---|---|---|---|---|---|
Global-Train | 87.01 | - | - | 86.10 | - | - | 74.03 | - | - |
Isolated | 85.56 | - | - | 82.48 | - | - | 69.83 | - | - |
FedAvg | 87.27 | 72.63 | -14.64 | 81.30 | 72.14 | -9.16 | 75.58 | 59.83 | -15.74 |
FedAvg-FT | 87.21 | 79.78 | -7.43 | 82.07 | 75.84 | -6.22 | 75.63 | 66.07 | -9.57 |
FedOpt | 67.38 | 53.84 | -13.54 | 70.70 | 59.58 | -11.12 | 71.59 | 55.16 | -16.43 |
FedOpt-FT | 82.36 | 64.09 | -18.27 | 82.68 | 62.17 | -20.51 | 74.34 | 62.36 | -11.99 |
pFedMe | 86.91 | 71.64 | -15.27 | 83.18 | 70.13 | -13.05 | 75.30 | 58.45 | -16.85 |
pFedMe-FT | 85.71 | 77.07 | -8.64 | 82.11 | 71.48 | -10.63 | 75.35 | 62.14 | -13.20 |
FedBN | 88.49 | 52.53 | -35.95 | 84.13 | 57.33 | -26.80 | 75.80 | 51.29 | -24.51 |
FedBN-FT | 87.45 | 80.36 | -7.09 | 76.20 | 64.13 | -12.07 | 75.07 | 64.20 | -10.87 |
FedBN-FedOPT | 87.87 | 42.72 | -45.15 | 84.64 | 53.27 | -31.37 | 76.20 | 50.34 | -25.86 |
FedBN-FedOPT-FT | 87.54 | 77.07 | -10.47 | 84.10 | 68.14 | -15.96 | 76.70 | 62.84 | -13.86 |
Ditto | 87.27 | 2.84 | -84.43 | 83.67 | 14.53 | -69.14 | 74.79 | 14.52 | -60.27 |
Ditto-FT | 87.47 | 35.03 | -52.44 | 81.47 | 72.64 | -8.84 | 76.47 | 39.27 | -37.20 |
Ditto-FedBN | 88.18 | 2.84 | -85.34 | 81.38 | 14.53 | -66.84 | 75.35 | 14.52 | -60.83 |
Ditto-FedBN-FT | 87.83 | 28.52 | -59.30 | 83.25 | 65.87 | -17.38 | 75.97 | 36.51 | -39.46 |
Ditto-FedBN-FedOpt | 87.81 | 2.84 | -84.97 | 82.54 | 14.53 | -68.01 | 75.07 | 14.52 | -60.55 |
Ditto-FedBN-FedOpt-FT | 87.60 | 18.18 | -69.42 | 82.00 | 70.75 | -11.25 | 76.42 | 48.99 | -27.43 |
FedEM | 85.64 | 71.12 | -14.52 | 81.92 | 72.02 | -9.90 | 75.41 | 59.60 | -15.81 |
FedEM-FT | 85.88 | 78.08 | -7.80 | 77.32 | 79.45 | 2.13 | 72.71 | 66.77 | -5.94 |
FedEM-FedBN | 88.12 | 48.64 | -39.48 | 85.07 | 50.98 | -34.09 | 74.90 | 41.55 | -33.35 |
FedEM-FedBN-FT | 86.38 | 72.02 | -14.35 | 84.61 | 76.77 | -7.83 | 75.29 | 68.38 | -6.91 |
FedEM-FedBN-FedOPT | 87.56 | 42.37 | -45.19 | 84.68 | 56.45 | -28.24 | 76.08 | 53.81 | -22.27 |
FedEM-FedBN-FedOPT-FT | 87.49 | 72.39 | -15.09 | 85.02 | 76.97 | -8.05 | 76.59 | 68.62 | -7.96 |
Movielens-1M | Movielens-10M | |||||
Global-Train | 0.78 | - | - | 0.67 | - | - |
Isolated | 10.35 | - | - | 11.48 | - | - |
FedAvg | 0.84 | 14.17 | 13.33 | 0.70 | 13.39 | 12.68 |
FedAvg-FT | 0.84 | 9.76 | 8.92 | 0.71 | 11.07 | 10.36 |
FedAvg-FT-FedOpt | 0.85 | 5.15 | 4.31 | 0.73 | 11.40 | 10.66 |
FedOpt | 0.83 | 14.17 | 13.34 | 0.71 | 13.39 | 12.68 |
FedOpt-FT | 0.83 | 12.06 | 11.23 | 0.74 | 11.92 | 11.18 |
pFedMe | 0.54 | 14.18 | 13.64 | 13.06 | 12.73 | -0.33 |
pFedMe-FT | 0.60 | 8.20 | 7.60 | 0.80 | 12.59 | 11.79 |
Ditto | 1.29 | 14.19 | 12.89 | 1.84 | 13.39 | 11.55 |
Ditto-FT | 1.35 | 14.17 | 12.81 | 1.69 | 13.39 | 11.70 |
Ditto-FT-FedOpt | 1.36 | 14.15 | 12.79 | 2.03 | 13.39 | 11.35 |
FedEM | 0.85 | 14.27 | 13.43 | 1.75 | 13.41 | 11.65 |
FedEM-FT | 0.85 | 4.86 | 4.01 | 0.87 | 12.80 | 11.93 |
FedEM-FT-FedOpt | 0.86 | 4.47 | 3.61 | 1.43 | 13.25 | 11.82 |
D.2 Fairness
The fairness results for FEMNIST, SST-2 and PUBMED are listed in Table 3 in the main body of the paper. Here we present the results for other datasets, including all the textual datasets (Table 7), all the graph datasets (Table 8), and all the recommendation datasets (Table 9).
COLA | SST-2 | ||||||||
Isolated | 56.86 | 35.34 | 0.00 | 59.40 | 41.29 | 0.00 | 67.44 | 37.86 | 0.00 |
FedAvg | 51.53 | 35.96 | 0.00 | 76.30 | 22.02 | 44.85 | 56.98 | 37.97 | 0.00 |
FedAvg-FT | 58.74 | 35.21 | 0.00 | 75.36 | 27.67 | 31.08 | 68.45 | 36.19 | 0.00 |
FedOpt | 57.10 | 31.70 | 10.85 | 73.78 | 34.03 | 17.53 | 59.95 | 37.88 | 0.00 |
FedOpt-FT | 59.77 | 35.99 | 0.00 | 66.17 | 33.73 | 15.56 | 69.20 | 36.28 | 0.00 |
pFedMe | 67.58 | 38.06 | 0.90 | 65.08 | 26.59 | 27.75 | 58.18 | 36.67 | 0.00 |
pFedMe-FT | 69.17 | 33.63 | 9.90 | 74.36 | 27.02 | 32.49 | 78.82 | 33.28 | 22.22 |
FedBN | 59.60 | 35.96 | 0.00 | 76.30 | 22.02 | 44.85 | - | - | - |
FedBN-FT | 59.69 | 35.44 | 0.00 | 68.50 | 26.83 | 29.17 | - | - | - |
FedBN-FedOPT | 59.60 | 36.03 | 0.00 | 65.59 | 31.07 | 22.22 | - | - | - |
FedBN-FedOPT-FT | 59.10 | 35.15 | 0.00 | 68.42 | 28.18 | 30.71 | - | - | - |
Ditto | 55.14 | 36.76 | 0.00 | 49.94 | 40.81 | 0.00 | 66.90 | 38.05 | 0.00 |
Ditto-FT | 63.61 | 35.02 | 0.00 | 54.34 | 39.26 | 0.00 | 66.91 | 38.08 | 0.00 |
Ditto-FedBN | 62.68 | 35.74 | 0.00 | 49.44 | 41.80 | 0.00 | - | - | - |
Ditto-FedBN-FT | 63.58 | 34.58 | 0.00 | 52.18 | 39.85 | 0.00 | - | - | - |
Ditto-FedBN-FedOpt | 52.48 | 35.89 | 0.00 | 55.61 | 40.43 | 1.39 | - | - | - |
Ditto-FedBN-FedOpt-FT | 57.13 | 36.20 | 0.00 | 53.16 | 34.75 | 9.72 | - | - | - |
FedEM | 51.52 | 35.96 | 0.00 | 76.53 | 23.34 | 44.44 | 61.70 | 37.72 | 0.00 |
FedEM-FT | 57.80 | 35.24 | 0.00 | 64.29 | 32.84 | 12.96 | 70.19 | 36.35 | 0.00 |
FedEM-FedBN | 57.95 | 34.11 | 1.52 | 75.06 | 18.48 | 53.33 | - | - | - |
FedEM-FedBN-FT | 58.74 | 35.56 | 1.00 | 64.33 | 35.72 | 8.59 | - | - | - |
FedEM-FedBN-FedOPT | 59.60 | 35.49 | 0.00 | 72.66 | 27.18 | 34.17 | - | - | - |
FedEM-FedBN-FedOPT-FT | 56.74 | 35.61 | 1.00 | 58.42 | 31.21 | 17.93 | - | - | - |
PUBMED | CORA | CITESEER | |||||||
Isolated | 84.67 | 6.26 | 74.63 | 81.62 | 4.67 | 72.64 | 69.90 | 5.76 | 61.10 |
FedAvg | 86.72 | 3.93 | 79.76 | 81.07 | 5.06 | 73.26 | 75.64 | 5.03 | 67.30 |
FedAvg-FT | 86.71 | 3.86 | 80.57 | 81.90 | 3.06 | 77.57 | 75.77 | 5.00 | 68.68 |
FedOpt | 66.69 | 16.69 | 48.50 | 70.17 | 12.56 | 38.89 | 71.60 | 9.20 | 42.25 |
FedOpt-FT | 81.53 | 16.69 | 46.21 | 82.31 | 6.35 | 68.03 | 74.38 | 4.13 | 69.26 |
pFedMe | 86.35 | 4.43 | 78.76 | 82.76 | 3.40 | 76.70 | 75.36 | 4.79 | 68.55 |
pFedMe-FT | 85.47 | 3.06 | 80.95 | 81.98 | 1.63 | 79.58 | 75.40 | 4.39 | 70.31 |
FedBN | 87.97 | 3.42 | 81.77 | 83.64 | 3.88 | 77.53 | 75.59 | 3.80 | 71.24 |
FedBN-FT | 87.02 | 3.47 | 80.13 | 76.01 | 4.32 | 69.92 | 75.16 | 5.04 | 67.91 |
FedBN-FedOPT | 87.43 | 4.64 | 80.81 | 84.11 | 3.93 | 75.44 | 76.33 | 4.28 | 72.43 |
FedBN-FedOPT-FT | 87.02 | 3.94 | 81.78 | 83.79 | 2.39 | 80.26 | 76.77 | 4.60 | 71.68 |
Ditto | 86.85 | 3.98 | 80.44 | 83.50 | 3.54 | 75.02 | 75.55 | 4.23 | 67.99 |
Ditto-FT | 87.10 | 3.52 | 80.46 | 81.53 | 3.67 | 75.63 | 76.57 | 4.24 | 70.31 |
Ditto-FedBN | 87.75 | 3.70 | 81.82 | 81.36 | 3.34 | 74.93 | 75.46 | 4.83 | 68.57 |
Ditto-FedBN-FT | 87.43 | 3.77 | 81.15 | 82.21 | 4.37 | 72.56 | 76.05 | 4.21 | 69.16 |
Ditto-FedBN-FedOpt | 87.27 | 3.90 | 79.14 | 82.38 | 2.10 | 77.69 | 75.19 | 4.06 | 68.60 |
Ditto-FedBN-FedOpt-FT | 87.10 | 3.79 | 80.93 | 81.99 | 3.67 | 73.29 | 76.52 | 4.43 | 71.31 |
FedEM | 85.05 | 4.44 | 78.51 | 81.72 | 3.72 | 74.95 | 75.49 | 4.66 | 68.50 |
FedEM-FT | 85.54 | 4.48 | 79.39 | 78.43 | 11.72 | 56.20 | 72.88 | 6.55 | 63.53 |
FedEM-FedBN | 87.63 | 4.14 | 82.54 | 84.45 | 3.14 | 76.71 | 74.29 | 4.22 | 69.06 |
FedEM-FedBN-FT | 85.68 | 4.33 | 79.44 | 84.57 | 3.04 | 80.22 | 75.40 | 3.84 | 70.59 |
FedEM-FedBN-FedOPT | 87.11 | 4.24 | 80.32 | 83.88 | 4.37 | 78.93 | 76.17 | 4.76 | 70.54 |
FedEM-FedBN-FedOPT-FT | 87.16 | 3.66 | 82.20 | 84.40 | 4.43 | 79.53 | 76.74 | 4.53 | 71.92 |
Movielens-1M | Movielens-10M | |||||
Isolated | 11.12 | 2.43 | 14.34 | 11.44 | 1.83 | 13.75 |
FedAvg | 0.85 | 0.21 | 1.13 | 0.71 | 0.11 | 0.84 |
FedAvg-FT | 0.85 | 0.21 | 1.12 | 0.71 | 0.11 | 0.85 |
FedAvg-FT-FedOpt | 0.86 | 0.22 | 1.14 | 0.75 | 0.12 | 0.90 |
FedOpt | 0.84 | 0.21 | 1.11 | 0.72 | 0.12 | 0.87 |
FedOpt-FT | 0.84 | 0.21 | 1.11 | 0.77 | 0.13 | 0.94 |
pFedMe | 0.55 | 0.11 | 0.69 | 12.48 | 2.44 | 15.76 |
pFedMe-FT | 0.60 | 0.12 | 0.75 | 0.80 | 0.13 | 0.96 |
Ditto | 1.31 | 0.77 | 1.69 | 1.81 | 0.24 | 2.12 |
Ditto-FT | 1.35 | 0.88 | 1.70 | 2.30 | 1.15 | 4.05 |
Ditto-FT-FedOpt | 1.35 | 0.79 | 1.70 | 1.98 | 0.27 | 2.32 |
FedEM | 0.87 | 0.22 | 1.15 | 2.37 | 1.25 | 4.09 |
FedEM-FT | 0.87 | 0.23 | 1.16 | 0.98 | 0.26 | 1.29 |
FedEM-FT-FedOpt | 0.87 | 0.22 | 1.16 | 1.88 | 0.94 | 3.09 |
D.3 Efficiency
The efficiency-accuracy trade-off results for FEMNIST are plotted in Figure 5 in the main body of the paper. Here we present more efficiency-accuracy trade-off results for the experimental datasets, including the FEMNIST datasets with different client sampling rates (Table 11 and Table 12), CIFAR-10 datasets with different (Table 13 and Table 14), all the textual datasets (Table 15 and Table 16), all the textual datasets (Table 18 and Table 19), and all the recommendation datasets (Table 20 and Table 17).
Besides the reported proxy system metrics such as FLOPs and the number of convergence rounds of FL processes, our benchmark also supports monitoring more runtime metrics. Thanks to the good integration with wandb, we can easily track the usage of system resources in runtime including utilization of CPU, GPU, memory, disk, etc. In Table 21, we report the average and peak process memory usage (in MB) and process running times (in seconds). In general, most pFL algorithms do have higher time and space overheads. We omit to report results for other metrics since we started very many sets of experiments concurrently, taking up as much of the graphics card’s memory and maximising CPU/GPU utilisation as possible, these metrics do not differ much from one of our different experiments. However, it is worth noting that these omitted metrics can be used to analyse algorithm bottlenecks in terms of system performance, and to optimise the space-time efficiency in single-experiment scenarios.
D.4 Heterogeneous Device Resources
The proposed pFL-Bench has good extensibility to support experiments in heterogeneous device resource scenarios, where clients have different computational and communication capacities. Specifically, we integrate FedScale [38] into our benchmark with a simulator that executes the behaviors of clients according to virtual timestamps of their message delivery to the server. The virtual timestamps are updated by the estimated execution time based on clients’ computational and communication capacities with the cost model proposed in FedScale. The server employs an over-selection mechanism for clients at each broadcast round and thus some clients’ message may be dropped, since the clients have different system capacities and different respond speeds corresponding to real-world mobile devices. 777https://github.com/SymbioticLab/FedScale/tree/master/benchmark/dataset/data/device_info
Here we take the Ditto method on FEMNIST dataset as an example and present the results of experiments with heterogeneous device resources in Table 10. Let to be the clients sampling rate for each FL round, and be the the minimal ratio of received feedback w.r.t. the number of clients for the server to trigger federated aggregation in over-selection mode. For the homo-device case, we set and for the hetero-device case, we set the and , leading to the same number of clients used for each federated aggregation. From the results, we can see that the hetero-device version has slower convergence speed ( indicates that the early-stopping is not triggered within the large number of FL rounds ), and it gains worse performance than the homo-device version, especially for the bottom accuracy () and standard deviation of the average accuracy (). This shows unfairness among clients due to the fact that some low-resourced clients have too long computation or communication time to make their feedback incorporated into the federated aggregation, calling for more considerations w.r.t. device heterogeneity within pFL algorithms design.
FLOPS | Com. | ||||||||
---|---|---|---|---|---|---|---|---|---|
Ditto, Homo-Device | 88.39 | 2.2 | -86.19 | 87.18 | 7.52 | 78.23 | 849.3G | 2.81M | 610 |
Ditto, Hetero-Device | 79.76 | 1.43 | -78.33 | 77.39 | 11.25 | 61.76 | 1.72T | 5.72M | 0 |
D.5 Incorporating Differential Privacy
It is interesting and under-explored to investigate the trade-off between personalization and privacy protection, which is important as FL involves the transmitting of local (maybe private) information. We note that FederatedScope supports various privacy-related fundamental components and algorithms, such as Differential Privacy (DP) [103] and privacy attack methods [104] that can be used to examine the privacy-preserving strength. Further research that combines pFL and privacy-preserving techniques will be convenient based on the modularized and extensible design of our benchmark. As a preliminary example, here we demonstrate the combination of the pFL with a Differential Privacy algorithm, the NbAFL [105] that achieves -DP via noise injection and gradient clipping.
In Figure 2(g), we plot the learning curves of FedAvg and Ditto methods on FEMNIST dataset with various ()-DP. Generally speaking, for privacy protection, the smaller the protection, the less performance degradation there is. We can see that in the Figure, with larger and , the accuracy () is better for both the compared methods, which meets our expectation. Interestingly, Ditto shows significantly better robustness for the dramatically varying privacy protection strengths during the whole learning process than FedAvg. This may be because, in the noise perturbation scenarios, the personalized local model potentially brings up more local optimal points that can be reached for clients. But there is still a gap for the best achievable performance between Ditto and FedAvg, leaving an interesting open question about how to reduce the performance degradation by co-designing personalization and noise injection.

FEMNIST, | FEMNIST, | FEMNIST, | |||||||
FLOPS | Com. | FLOPS | Com. | FLOPS | Com. | ||||
FedAvg | 195.38G | 1.24M | 83.97 | 159.9G | 373.57K | 83.84 | 78.48G | 384.0K | 83.21 |
FedAvg-FT | 296.8G | 1.35M | 86.44 | 155.49G | 630.44K | 85.45 | 217.97G | 522.82K | 84.78 |
FedOpt | 846.22G | 1.68M | 19.31 | 272.3G | 635.48K | 27.29 | 29.0G | 80.81K | 22.53 |
FedOpt-FT | 77.81G | 149.97K | 9.69 | 54.88G | 222.47K | 31.02 | 365.54G | 877.15K | 29.73 |
pFedMe | 2.24T | 1.52M | 87.50 | 516.47G | 1012.93K | 84.70 | 216.66G | 493.06K | 85.23 |
pFedMe-FT | 1.33T | 2.19M | 88.19 | 1.22T | 953.21K | 86.28 | 585.23G | 569.72K | 87.37 |
FedBN | 243.31G | 1.21M | 86.72 | 120.66G | 510.15K | 85.10 | 189.64G | 841.57K | 84.36 |
FedBN-FT | 208.99G | 687.58K | 88.51 | 174.22G | 610.31K | 87.31 | 118.16G | 407.71K | 86.87 |
FedBN-FedOPT | 451.07G | 1.78M | 88.25 | 179.28G | 758.36K | 86.22 | 109.73G | 487.2K | 84.04 |
FedBN-FedOPT-FT | 195.3G | 696.81K | 88.14 | 217.59G | 762.72K | 87.91 | 171.93G | 593.18K | 87.34 |
Ditto | 2.04T | 1.78M | 88.39 | 700.28G | 700.96K | 88.87 | 399.96G | 533.77K | 88.90 |
Ditto-FT | 2.73T | 2.26M | 85.72 | 349.06G | 600.22K | 67.97 | 433.51G | 749.29K | 72.08 |
Ditto-FedBN | 1.51T | 1.02M | 88.94 | 720.27G | 623.37K | 89.33 | 336.44G | 695.85K | 67.63 |
Ditto-FedBN-FT | 2.86T | 1.89M | 86.53 | 1.42T | 1.19M | 86.35 | 410.03G | 762.09K | 72.84 |
Ditto-FedBN-FedOpt | 2.45T | 1.65M | 88.73 | 922.71G | 801.91K | 89.30 | 408.13G | 493.82K | 88.95 |
Ditto-FedBN-FedOpt-FT | 2.52T | 1.66M | 87.02 | 1.42T | 1.19M | 87.34 | 876.33G | 993.59K | 79.38 |
FedEM | 2.98T | 2.0M | 84.35 | 1.99T | 1.57M | 84.49 | 1017.32G | 1.03M | 84.46 |
FedEM-FT | 5.79T | 1.67M | 86.17 | 4.12T | 1.16M | 86.13 | 3.83T | 1.35M | 86.11 |
FedEM-FedBN | 10.34T | 3.1M | 84.37 | 1.87T | 1.26M | 84.83 | 1.23T | 1.15M | 84.26 |
FedEM-FedBN-FT | 7.55T | 2.59M | 88.29 | 4.07T | 1.34M | 87.49 | 5.03T | 1.29M | 86.54 |
FedEM-FedBN-FedOPT | 6.0T | 1.79M | 82.12 | 3.89T | 2.63M | 85.81 | 1.81T | 1.7M | 85.16 |
FedEM-FedBN-FedOPT-FT | 13.86T | 4.76M | 87.54 | 4.11T | 1.36M | 87.69 | 3.62T | 1.15M | 85.91 |
FEMNIST, | FEMNIST, | FEMNIST, | ||||
FedAvg | 173.33 | 82.40 | 246.67 | 82.07 | 350.00 | 82.31 |
FedAvg-FT | 220.00 | 85.17 | 416.67 | 84.26 | 476.67 | 83.45 |
FedOpt | 733.33 | 19.59 | 420.00 | 27.47 | 73.33 | 22.45 |
FedOpt-FT | 63.33 | 10.39 | 146.67 | 30.88 | 800.00 | 30.02 |
pFedMe | 640.00 | 86.50 | 670.00 | 83.69 | 450.00 | 84.29 |
pFedMe-FT | 960.00 | 87.06 | 630.00 | 85.05 | 520.00 | 86.36 |
FedBN | 273.33 | 85.38 | 390.00 | 83.81 | 846.67 | 82.76 |
FedBN-FT | 340.00 | 87.65 | 466.67 | 86.22 | 410.00 | 86.01 |
FedBN-FedOPT | 943.33 | 87.27 | 580.00 | 85.09 | 490.00 | 82.63 |
FedBN-FedOPT-FT | 360.00 | 87.13 | 583.33 | 87.10 | 596.67 | 86.53 |
Ditto | 406.67 | 87.18 | 463.33 | 87.65 | 486.67 | 87.80 |
Ditto-FT | 286.67 | 84.30 | 396.67 | 65.86 | 483.33 | 70.20 |
Ditto-FedBN | 536.67 | 87.82 | 476.67 | 88.28 | 700.00 | 66.15 |
Ditto-FedBN-FT | 0.00 | 85.16 | 263.33 | 84.98 | 766.67 | 71.76 |
Ditto-FedBN-FedOpt | 873.33 | 87.64 | 613.33 | 88.20 | 496.67 | 87.84 |
Ditto-FedBN-FedOpt-FT | 880.00 | 85.71 | 263.33 | 86.00 | 0.00 | 76.75 |
FedEM | 290.00 | 82.61 | 373.33 | 83.13 | 346.67 | 82.91 |
FedEM-FT | 243.33 | 84.91 | 276.67 | 84.77 | 453.33 | 84.61 |
FedEM-FedBN | 570.00 | 82.94 | 350.00 | 83.35 | 430.00 | 82.98 |
FedEM-FedBN-FT | 476.67 | 87.09 | 373.33 | 86.56 | 483.33 | 85.37 |
FedEM-FedBN-FedOPT | 330.00 | 80.48 | 730.00 | 85.01 | 633.33 | 83.87 |
FedEM-FedBN-FedOPT-FT | 876.67 | 86.23 | 376.67 | 86.58 | 430.00 | 84.98 |
CIFAR10, | CIFAR10, | CIFAR10, | |||||||
---|---|---|---|---|---|---|---|---|---|
FLOPS | Com. | FLOPS | Com. | FLOPS | Com. | ||||
FedAvg | 388.27G | 654.43K | 73.86 | 384.14G | 646.69K | 70.82 | 305.15G | 513.76K | 56.21 |
FedAvg-FT | 422.66G | 685.52K | 73.89 | 404.17G | 654.46K | 70.12 | 259.72G | 425.4K | 53.99 |
FedOpt | 1.26T | 862.74K | 54.66 | 494.38G | 833.2K | 49.90 | 651.29G | 1.06M | 36.41 |
FedOpt-FT | 1.37T | 917.13K | 57.33 | 1003.55G | 1.59M | 51.70 | 1.36T | 895.36K | 41.56 |
pFedMe | 3.26T | 2.07M | 73.92 | 3.95T | 918.26K | 70.30 | 2.51T | 594.98K | 55.70 |
pFedMe-FT | 3.57T | 2.22M | 77.73 | 3.57T | 825.05K | 72.76 | 4.15T | 969.52K | 58.65 |
FedBN | 388.27G | 540.02K | 72.24 | 375.0G | 520.8K | 68.32 | 632.15G | 350.44K | 56.36 |
FedBN-FT | 394.14G | 527.19K | 72.55 | 356.18G | 475.9K | 67.45 | 249.57G | 337.61K | 50.52 |
FedBN-FedOPT | 581.98G | 809.39K | 71.63 | 581.92G | 809.43K | 67.13 | 660.49G | 360.45K | 57.17 |
FedBN-FedOPT-FT | 638.27G | 854.28K | 72.32 | 288.66G | 386.11K | 67.72 | 381.03G | 206.52K | 49.35 |
Ditto | 1.81T | 491.25K | 72.02 | 1.31T | 685.55K | 67.96 | 929.51G | 479.8K | 45.67 |
Ditto-FT | 4.24T | 1.12M | 68.59 | 1.22T | 631.15K | 58.52 | 1001.19G | 510.88K | 37.79 |
Ditto-FedBN | 2.21T | 495.12K | 71.70 | 1.87T | 418.18K | 62.22 | 940.86G | 392.52K | 41.59 |
Ditto-FedBN-FT | 5.71T | 1.24M | 67.90 | 1.3T | 552.87K | 57.14 | 953.17G | 392.52K | 37.21 |
Ditto-FedBN-FedOpt | 2.33T | 520.77K | 71.58 | 1.68T | 552.87K | 60.78 | 1.36T | 578.53K | 47.94 |
Ditto-FedBN-FedOpt-FT | 3.09T | 687.53K | 68.41 | 1.01T | 431.01K | 56.66 | 1.18T | 495.14K | 38.73 |
FedEM | 6.78T | 1.41M | 73.34 | 7.42T | 1.55M | 70.56 | 5.74T | 1.21M | 53.43 |
FedEM-FT | 12.49T | 1.81M | 72.89 | 12.05T | 1.74M | 65.47 | 9.75T | 1.45M | 44.98 |
FedEM-FedBN | 6.88T | 1.18M | 71.43 | 8.58T | 1.46M | 68.42 | 7.61T | 741.57K | 55.30 |
FedEM-FedBN-FT | 9.91T | 1.17M | 70.46 | 11.44T | 1.35M | 61.43 | 9.01T | 704.68K | 43.48 |
FedEM-FedBN-FedOPT | 11.2T | 1.91M | 71.43 | 12.79T | 2.18M | 67.11 | 10.63T | 1.01M | 56.87 |
FedEM-FedBN-FedOPT-FT | 9.45T | 1.12M | 71.32 | 8.85T | 1.05M | 62.03 | 9.6T | 758.5K | 41.10 |
CIFAR10, | CIFAR10, | CIFAR10, | ||||
FedAvg | 280.00 | 73.87 | 276.67 | 70.89 | 213.33 | 57.71 |
FedAvg-FT | 293.33 | 73.86 | 280.00 | 69.94 | 173.33 | 54.47 |
FedOpt | 370.00 | 54.66 | 356.67 | 49.61 | 466.67 | 37.30 |
FedOpt-FT | 393.33 | 57.35 | 696.67 | 51.69 | 383.33 | 42.01 |
pFedMe | 573.33 | 73.74 | 393.33 | 70.63 | 243.33 | 56.46 |
pFedMe-FT | 643.33 | 77.82 | 353.33 | 73.46 | 400.00 | 59.74 |
FedBN | 280.00 | 72.17 | 270.00 | 68.28 | 173.33 | 57.59 |
FedBN-FT | 273.33 | 72.50 | 246.67 | 67.69 | 166.67 | 51.40 |
FedBN-FedOPT | 420.00 | 71.59 | 420.00 | 67.22 | 186.67 | 58.19 |
FedBN-FedOPT-FT | 443.33 | 72.27 | 200.00 | 68.07 | 106.67 | 50.76 |
Ditto | 210.00 | 71.97 | 293.33 | 67.65 | 196.67 | 49.77 |
Ditto-FT | 490.00 | 68.57 | 270.00 | 58.63 | 210.00 | 38.97 |
Ditto-FedBN | 256.67 | 71.65 | 216.67 | 62.65 | 203.33 | 43.07 |
Ditto-FedBN-FT | 660.00 | 67.78 | 286.67 | 57.27 | 203.33 | 39.13 |
Ditto-FedBN-FedOpt | 270.00 | 71.54 | 286.67 | 61.12 | 300.00 | 48.27 |
Ditto-FedBN-FedOpt-FT | 356.67 | 68.37 | 223.33 | 57.59 | 256.67 | 40.67 |
FedEM | 213.33 | 73.36 | 233.33 | 70.56 | 173.33 | 54.52 |
FedEM-FT | 273.33 | 72.84 | 263.33 | 65.49 | 210.00 | 42.92 |
FedEM-FedBN | 216.67 | 71.40 | 270.00 | 68.64 | 133.33 | 57.82 |
FedEM-FedBN-FT | 216.67 | 70.40 | 250.00 | 62.13 | 126.67 | 43.97 |
FedEM-FedBN-FedOPT | 353.33 | 71.39 | 403.33 | 67.19 | 186.67 | 58.40 |
FedEM-FedBN-FedOPT-FT | 206.67 | 71.27 | 193.33 | 62.31 | 136.67 | 40.70 |
COLA | SST-2 | ||||||||
FLOPS | Com. | FLOPS | Com. | FLOPS | Com. | ||||
FedAvg | 22.15G | 310.94K | 71.85 | 954.66G | 464.37K | 74.88 | 726.45K | 60.32K | 62.15 |
FedAvg-FT | 37.23G | 251.93K | 68.29 | 1.41T | 629.61K | 74.14 | 4.75M | 19.26K | 70.53 |
FedOpt | 50.71G | 299.14K | 71.85 | 808.78G | 393.56K | 72.28 | 726.45K | 60.32K | 62.09 |
FedOpt-FT | 58.3G | 393.56K | 62.59 | 707.21G | 310.94K | 65.77 | 4.52M | 18.35K | 71.08 |
pFedMe | 208.22G | 1.04M | 74.40 | 4.66T | 830.51K | 71.27 | 603.46K | 29.3K | 63.45 |
pFedMe-FT | 240.44G | 357.98K | 78.47 | 3.76T | 629.4K | 75.61 | 12.07M | 15.57K | 84.00 |
FedBN | 46.07G | 259.58K | 71.85 | 954.66G | 464.37K | 74.88 | - | - | - |
FedBN-FT | 40.29G | 259.58K | 66.71 | 1.06T | 476.17K | 68.81 | - | - | - |
FedBN-FedOPT | 86.98G | 482.11K | 71.85 | 593.16G | 270.71K | 64.70 | - | - | - |
FedBN-FedOPT-FT | 39.08G | 248.45K | 67.48 | 707.21G | 292.96K | 68.65 | - | - | - |
Ditto | 133.52G | 322.75K | 55.46 | 1.43T | 275.53K | 52.03 | 2.64M | 38.42K | 70.23 |
Ditto-FT | 124.81G | 440.77K | 72.11 | 1.85T | 334.55K | 56.49 | 4.2M | 36.59K | 69.99 |
Ditto-FedBN | 96.29G | 404.22K | 70.69 | 1.48T | 270.71K | 56.03 | - | - | - |
Ditto-FedBN-FT | 113.99G | 381.97K | 72.66 | 1.58T | 270.7K | 53.15 | - | - | - |
Ditto-FedBN-FedOpt | 163.88G | 370.84K | 50.25 | 3.19T | 626.86K | 57.67 | - | - | - |
Ditto-FedBN-FedOpt-FT | 148.0G | 292.96K | 55.01 | 1.58T | 537.74K | 52.89 | - | - | - |
FedEM | 414.28G | 801.7K | 71.85 | 18.46T | 1.55M | 75.78 | 18.82M | 95.64K | 63.44 |
FedEM-FT | 1.85T | 1.02M | 54.90 | 24.19T | 1.32M | 64.86 | 24.19M | 35.27K | 70.97 |
FedEM-FedBN | 729.78G | 753.83K | 71.44 | 16.84T | 1.31M | 75.43 | - | - | - |
FedEM-FedBN-FT | 1.37T | 979.64K | 57.62 | 24.19T | 1.24M | 64.96 | - | - | - |
FedEM-FedBN-FedOPT | 414.28G | 753.83K | 71.85 | 13.34T | 1.05M | 72.25 | - | - | - |
FedEM-FedBN-FedOPT-FT | 1.49T | 1.02M | 57.23 | 22.0T | 1.12M | 62.26 | - | - | - |
COLA | SST-2 | |||||
FedAvg | 43.33 | 51.53 | 65.00 | 76.30 | 223.33 | 56.98 |
FedAvg-FT | 35.00 | 58.74 | 88.33 | 75.36 | 70.67 | 68.45 |
FedOpt | 41.67 | 57.10 | 55.00 | 73.78 | 220.33 | 59.95 |
FedOpt-FT | 55.00 | 59.77 | 43.33 | 66.17 | 66.67 | 69.20 |
pFedMe | 150.00 | 67.58 | 116.67 | 65.08 | 106.67 | 58.18 |
pFedMe-FT | 50.00 | 69.17 | 88.33 | 74.36 | 53.33 | 78.82 |
FedBN | 38.33 | 59.60 | 65.00 | 76.30 | - | - |
FedBN-FT | 38.33 | 59.69 | 66.67 | 68.50 | - | - |
FedBN-FedOPT | 71.67 | 59.60 | 40.00 | 65.59 | - | - |
FedBN-FedOPT-FT | 36.67 | 59.10 | 43.33 | 68.42 | - | - |
Ditto | 45.00 | 55.14 | 38.33 | 49.94 | 140.33 | 66.90 |
Ditto-FT | 61.67 | 63.61 | 46.67 | 54.34 | 133.33 | 66.91 |
Ditto-FedBN | 60.00 | 62.68 | 40.00 | 49.44 | - | - |
Ditto-FedBN-FT | 56.67 | 63.58 | 40.00 | 52.18 | - | - |
Ditto-FedBN-FedOpt | 55.00 | 52.48 | 93.33 | 55.61 | - | - |
Ditto-FedBN-FedOpt-FT | 43.33 | 57.13 | 80.00 | 53.16 | - | - |
FedEM | 38.33 | 51.52 | 76.67 | 76.53 | 163.33 | 61.70 |
FedEM-FT | 50.00 | 57.80 | 65.00 | 64.29 | 40.67 | 70.19 |
FedEM-FedBN | 38.33 | 57.95 | 68.33 | 75.06 | - | - |
FedEM-FedBN-FT | 50.00 | 58.74 | 65.00 | 64.33 | - | - |
FedEM-FedBN-FedOPT | 38.33 | 59.60 | 55.00 | 72.66 | - | - |
FedEM-FedBN-FedOPT-FT | 53.33 | 56.74 | 58.33 | 58.42 | - | - |
Movielens-1M | Movielens-10M | |||
FedAvg | 360.33 | 0.85 | 470.67 | 0.71 |
FedAvg-FT | 0 | 0.85 | 520.33 | 0.71 |
FedAvg-FT-FedOpt | 830.0 | 0.86 | 0 | 0.75 |
FedOpt | 270.0 | 0.84 | 0 | 0.72 |
FedOpt-FT | 300.0 | 0.84 | 0 | 0.77 |
pFedMe | 0 | 0.55 | 280.33 | 12.48 |
pFedMe-FT | 470.0 | 0.60 | 840.00 | 0.80 |
Ditto | 360.67 | 1.31 | 910.67 | 1.81 |
Ditto-FT | 450.33 | 1.35 | 0 | 2.30 |
Ditto-FT-FedOpt | 550.67 | 1.35 | 0 | 1.98 |
FedEM | 700.33 | 0.87 | 0 | 2.37 |
FedEM-FT | 780.00 | 0.87 | 0 | 0.98 |
FedEM-FT-FedOpt | 0 | 0.87 | 0 | 1.88 |
PUBMED | CORA | CITESEER | |||||||
FLOPS | Com. | FLOPS | Com. | FLOPS | Com. | ||||
FedAvg | 40.41G | 909.81K | 87.27 | 2.94G | 980.35K | 81.30 | 71.06G | 2.98M | 75.58 |
FedAvg-FT | 31.39G | 1.79M | 87.21 | 5.16G | 885.92K | 82.07 | 77.54G | 2.39M | 75.63 |
FedOpt | 163.3G | 6.92M | 67.38 | 2.22G | 744.07K | 70.70 | 3.89G | 437.18K | 71.59 |
FedOpt-FT | 24.78G | 791.28K | 82.36 | 6.27G | 1.05M | 82.68 | 25.07G | 791.28K | 74.34 |
pFedMe | 45.67G | 1.0M | 86.91 | 10.02G | 1.26M | 83.18 | 193.83G | 744.07K | 75.30 |
pFedMe-FT | 119.12G | 862.1K | 85.71 | 22.76G | 2.09M | 82.11 | 64.31G | 460.79K | 75.35 |
FedBN | 27.53G | 875.47K | 88.49 | 2.51G | 615.38K | 84.13 | 14.01G | 441.99K | 75.80 |
FedBN-FT | 35.86G | 2.04M | 87.45 | 29.13G | 4.85M | 76.20 | 28.25G | 531.61K | 75.07 |
FedBN-FedOPT | 13.26G | 632.72K | 87.87 | 2.07G | 736.76K | 84.64 | 26.08G | 823.45K | 76.20 |
FedBN-FedOPT-FT | 45.55G | 1.04M | 87.54 | 3.21G | 615.38K | 84.10 | 21.97G | 771.43K | 76.70 |
Ditto | 46.9G | 909.53K | 87.27 | 7.27G | 933.13K | 83.67 | 79.77G | 838.49K | 74.79 |
Ditto-FT | 43.73G | 1.54M | 87.47 | 8.59G | 909.31K | 81.47 | 42.34G | 744.07K | 76.47 |
Ditto-FedBN | 23.72G | 752.63K | 88.18 | 8.37G | 788.77K | 81.38 | 37.51G | 528.68K | 75.35 |
Ditto-FedBN-FT | 36.49G | 963.57K | 87.83 | 5.46G | 424.65K | 83.25 | 26.06G | 494.0K | 75.97 |
Ditto-FedBN-FedOpt | 57.84G | 821.99K | 87.81 | 7.45G | 702.08K | 82.54 | 20.14G | 407.31K | 75.07 |
Ditto-FedBN-FedOpt-FT | 49.88G | 650.06K | 87.60 | 7.92G | 615.38K | 82.00 | 35.48G | 684.74K | 76.42 |
FedEM | 575.33G | 3.24M | 85.64 | 61.17G | 2.01M | 81.92 | 760.69G | 8.36M | 75.41 |
FedEM-FT | 647.22G | 3.04M | 85.88 | 61.14G | 2.7M | 77.32 | 150.55G | 2.22M | 72.71 |
FedEM-FedBN | 253.94G | 2.07M | 88.12 | 18.99G | 1.92M | 85.07 | 52.64G | 1.82M | 74.90 |
FedEM-FedBN-FT | 275.77G | 3.02M | 86.38 | 34.58G | 1.72M | 84.61 | 89.54G | 1.47M | 75.29 |
FedEM-FedBN-FedOPT | 229.51G | 1.87M | 87.56 | 108.73G | 2.62M | 84.68 | 419.47G | 3.37M | 76.08 |
FedEM-FedBN-FedOPT-FT | 233.63G | 2.12M | 87.49 | 285.87G | 4.92M | 85.02 | 561.46G | 3.22M | 76.59 |
PUBMED | CORA | CITESEER | ||||
FedAvg | 63.33 | 86.72 | 68.33 | 81.07 | 215.00 | 75.64 |
FedAvg-FT | 128.33 | 86.71 | 61.67 | 81.90 | 171.67 | 75.77 |
FedOpt | 0.00 | 66.69 | 51.67 | 70.17 | 30.00 | 71.60 |
FedOpt-FT | 55.00 | 81.53 | 75.00 | 82.31 | 55.00 | 74.38 |
pFedMe | 71.67 | 86.35 | 90.00 | 82.76 | 51.67 | 75.36 |
pFedMe-FT | 60.00 | 85.47 | 150.00 | 81.98 | 31.67 | 75.40 |
FedBN | 83.33 | 87.97 | 58.33 | 83.64 | 41.67 | 75.59 |
FedBN-FT | 146.67 | 87.02 | 183.33 | 76.01 | 36.67 | 75.16 |
FedBN-FedOPT | 60.00 | 87.43 | 70.00 | 84.11 | 78.33 | 76.33 |
FedBN-FedOPT-FT | 101.67 | 87.02 | 58.33 | 83.79 | 73.33 | 76.77 |
Ditto | 63.33 | 86.85 | 65.00 | 83.50 | 58.33 | 75.55 |
Ditto-FT | 110.00 | 87.10 | 63.33 | 81.53 | 51.67 | 76.57 |
Ditto-FedBN | 71.67 | 87.75 | 75.00 | 81.36 | 50.00 | 75.46 |
Ditto-FedBN-FT | 91.67 | 87.43 | 40.00 | 82.21 | 46.67 | 76.05 |
Ditto-FedBN-FedOpt | 78.33 | 87.27 | 66.67 | 82.38 | 38.33 | 75.19 |
Ditto-FedBN-FedOpt-FT | 61.67 | 87.10 | 58.33 | 81.99 | 65.00 | 76.52 |
FedEM | 78.33 | 85.05 | 48.33 | 81.72 | 203.33 | 75.49 |
FedEM-FT | 73.33 | 85.54 | 65.00 | 78.43 | 53.33 | 72.88 |
FedEM-FedBN | 68.33 | 87.63 | 63.33 | 84.45 | 60.00 | 74.29 |
FedEM-FedBN-FT | 100.00 | 85.68 | 56.67 | 84.57 | 48.33 | 75.40 |
FedEM-FedBN-FedOPT | 61.67 | 87.11 | 86.67 | 83.88 | 111.67 | 76.17 |
FedEM-FedBN-FedOPT-FT | 70.00 | 87.16 | 163.33 | 84.40 | 106.67 | 76.74 |
Movielens-1M | Movielens-10M | |||||
FLOPS | Com. | FLOPS | Com. | |||
FedAvg | 343.73M | 108.75K | 0.84 | 375.55G | 142.58K | 0.70 |
FedAvg-FT | 995.55M | 301.5K | 0.84 | 162.79G | 157.72K | 0.71 |
FedAvg-FT-FedOpt | 236.68M | 250.45K | 0.85 | 21.39G | 302.91K | 0.73 |
FedOpt | 258.31M | 81.62K | 0.83 | 19.75G | 302.91K | 0.71 |
FedOpt-FT | 299.3M | 90.66K | 0.83 | 52.04G | 302.91K | 0.74 |
pFedMe | 763.2M | 301.51K | 0.54 | 131.49G | 24.44K | 13.06 |
pFedMe-FT | 376.37M | 142.35K | 0.60 | 93.58G | 254.7K | 0.80 |
Ditto | 332.94M | 108.75K | 1.29 | 72.4G | 302.91K | 1.84 |
Ditto-FT | 220.62M | 135.88K | 1.35 | 242.74G | 302.91K | 1.69 |
Ditto-FT-FedOpt | 269.63M | 166.03K | 1.36 | 73.26G | 302.91K | 2.03 |
FedEM | 2.07G | 523.1K | 0.85 | 86.69G | 135.15K | 1.75 |
FedEM-FT | 4.07G | 582.82K | 0.85 | 253.37G | 269.78K | 0.87 |
FedEM-FT-FedOpt | 5.22G | 746.53K | 0.86 | 113.27G | 120.19K | 1.43 |
FEMNIST, | SST-2 | |||||
Global-Train | 86.56 | 86.56 | 11.00 | 87.23 | 93.37 | 892.00 |
Isolated | 746.85 | 1351.94 | 1108.00 | 118.00 | 151.07 | 994.00 |
FedAvg | 10506.71 | 17707.04 | 840.00 | 297.43 | 399.51 | 319.00 |
FedAvg-FT | 13400.46 | 21752.04 | 1193.00 | 314.05 | 371.73 | 545.00 |
FedProx | 19389.09 | 20729.26 | 1415.00 | 6635.17 | 7378.15 | 672.00 |
FedProx-FT | 21209.42 | 22503.71 | 1935.00 | 7804.79 | 8160.99 | 1810.00 |
pFedMe | 1880.78 | 1979.59 | 7205.00 | 262.67 | 366.15 | 1572.00 |
pFedMe-FT | 18573.50 | 21332.41 | 4676.00 | 282.19 | 367.81 | 1080.00 |
HypCluster | 21659.61 | 22386.94 | 2803.00 | 6942.08 | 7510.02 | 678.00 |
HypCluster-FT | 22569.13 | 23588.59 | 3602.00 | 7624.09 | 8232.55 | 887.00 |
FedBN | 11135.25 | 15406.92 | 1011.00 | 279.89 | 373.27 | 332.00 |
FedBN-FT | 17347.31 | 25649.89 | 936.00 | 266.80 | 331.84 | 551.00 |
FedBN-FedOPT | 21866.64 | 35335.47 | 2881.00 | 216.22 | 311.33 | 202.00 |
FedBN-FedOPT-FT | 9827.22 | 14711.30 | 845.00 | 260.59 | 350.32 | 283.00 |
Ditto | 1127.92 | 1268.18 | 4628.00 | 149.16 | 204.95 | 638.00 |
Ditto-FT | 1110.87 | 1532.88 | 8915.00 | 228.51 | 292.20 | 599.00 |
Ditto-FedBN | 1116.05 | 1227.17 | 4049.00 | 196.52 | 228.59 | 593.00 |
Ditto-FedBN-FT | 1453.92 | 1730.06 | 8528.00 | 216.48 | 290.13 | 627.00 |
Ditto-FedBN-FedOpt | 1176.76 | 1273.67 | 5782.00 | 202.71 | 237.94 | 560.00 |
Ditto-FedBN-FedOpt-FT | 1299.42 | 1564.89 | 7545.00 | 275.08 | 333.58 | 637.00 |
FedEM | 939.58 | 1063.35 | 5070.00 | 267.37 | 357.79 | 1568.00 |
FedEM-FT | 563.77 | 619.02 | 7592.00 | 327.36 | 374.10 | 3210.00 |
FedEM-FedBN | 572.14 | 912.14 | 12051.00 | 269.62 | 337.22 | 1615.00 |
FedEM-FedBN-FT | 489.82 | 616.71 | 9391.00 | 306.51 | 367.51 | 3040.00 |
FedEM-FedBN-FedOPT | 755.61 | 842.33 | 8977.00 | 251.02 | 343.16 | 1314.00 |
FedEM-FedBN-FedOPT-FT | 608.29 | 638.73 | 19137.00 | 300.92 | 353.46 | 3116.00 |