pFL-Bench: A Comprehensive Benchmark for Personalized Federated Learning

Daoyuan Chen, Dawei Gao, Weirui Kuang, Yaliang Li, Bolin Ding
Alibaba Group
{daoyuanchen.cdy, gaodawei.gdw, weirui.kwr}@alibaba-inc.com
{yaliang.li, bolin.ding}@alibaba-inc.com
corresponding author

Abstract

Personalized Federated Learning (pFL), which utilizes and deploys distinct local models, has gained increasing attention in recent years due to its success in handling the statistical heterogeneity of FL clients. However, standardized evaluation and systematical analysis of diverse pFL methods remain a challenge. Firstly, the highly varied datasets, FL simulation settings and pFL implementations prevent easy and fair comparisons of pFL methods. Secondly, the current pFL literature diverges in the adopted evaluation and ablation protocols. Finally, the effectiveness and robustness of pFL methods are under-explored in various practical scenarios, such as the generalization to new clients and the participation of resource-limited clients. To tackle these challenges, we propose the first comprehensive pFL benchmark, pFL-Bench, for facilitating rapid, reproducible, standardized and thorough pFL evaluation. The proposed benchmark contains more than 10 dataset variants in various application domains with a unified data partition and realistic heterogeneous settings; a modularized and easy-to-extend pFL codebase with more than 20 competitive pFL method implementations; and systematic evaluations under containerized environments in terms of generalization, fairness, system overhead, and convergence. We highlight the benefits and potential of state-of-the-art pFL methods and hope the pFL-Bench enables further pFL research and broad applications that would otherwise be difficult owing to the absence of a dedicated benchmark. The code is released at https://github.com/alibaba/FederatedScope/tree/master/benchmark/pFL-Bench. ¹¹1We will continuously maintain the benchmark and update the codebase and arXiv version.

1 Introduction

Federated learning (FL) is an emerging machine learning (ML) paradigm, which collaboratively trains models via coordinating certain distributed clients (e.g., smart IoT devices) with a logically centralized aggregator [1, 2]. Due to the benefit that it does not transmit local data and circumvents the high cost and privacy risks of collecting raw sensitive data from clients, FL has gained widespread interest and has been applied in numerous ML tasks such as image classification [3, 4], object detection [5, 6], keyboard suggestion [7, 8], text classification [9], relation extraction [10], speech recognition [11, 12], graph classification [13], recommendation [14, 15], and healthcare [16, 17].

Pioneering FL researchers made great efforts to find a global model that performs well for most FL clients [18, 19, 20, 21]. However, the intrinsic statistical and system heterogeneity of clients limits the performance and applicability of such classical FL methods [22, 23]. Taking the concept shift case as an example, in which the conditional distribution $P(Y|X)$ vary across some clients who have the same marginal distributions $P(X)$ [18], a shared global model cannot fit these clients well at the same time. Recently, the personalized FL (pFL) methods that utilize client-distinct models to overcome these challenges have been gaining increasing popularity, such as those based on multi-task learning [24, 25, 26], meta-learning [27, 28], and transfer learning [29, 30, 31].

Even though fruitful pFL methods have been explored, a standard benchmark is still lacking. As a result, the evaluation of pFL methods is currently with non-standardized datasets and implementations, highly diverse evaluation protocols, and unclear effectiveness and robustness of pFL methods under various practical scenarios. To be specific:

•

Non-standardized datasets and implementations for pFL. Currently, researchers often use custom FL datasets and implementations to evaluate the effectiveness of proposed methods due to the absence of standardized pFL benchmarks. For example, although many pFL works use the same public FEMNIST [32] and CIFAR-10/100 [33] datasets, the partition manners can be divergent: the number of clients is 205 in [26] while 539 in [34] for FEMNIST; and [35] adopts the Dirichlet distribution based partition while [36] uses the pathological partition for CIFAR-10/100. Prior pFL studies set up different computation environments and simulation settings, increasing the difficulty of fast evaluation and the risk of unfair comparisons.
•

Diverse evaluation protocols. The current pFL methods often focus on different views and adopt diverse evaluation protocols, which may lead to isolated development of pFL and prevent pFL research from reaching its full potential. For example, besides the global accuracy improvement, a few works studied the local accuracy evaluation characterized by fairness, and system efficiency in terms of communication and computational costs [26, 37]. Without a careful design and control of the evaluation, it is difficult to compare the pros and cons of different pFL methods and understand how much costs we pay for the personalization.
•

Under-explored practicability of pFL in various scenarios. Most existing pFL methods examine their effectiveness in several mild Non-IID FL cases [23]. However, it is unclear whether existing pFL methods can consistently work well in more practical scenarios, such as the participation of partial clients in which the clients have spotty connectivity [38]; the participation of resource-limited clients in which the personalization is required to be highly efficient [39]; and the generalization to new clients in which learned models will be applied to new clients that do not participate in the FL process [40].

To quantify the progress in the pFL community and facilitate rapid, reproducible, and generalizable pFL research, we propose the first comprehensive pFL benchmark characterized as follows:

•

We provide 4 benchmark families with 12 dataset variants for diverse application domains involving image, text, graph and recommendation data, each with unified data partition and some realistic Non-IID settings such as clients sampling and the participation of new clients. Some public popular DataZoos such as LEAF [32], Torchvision [41] and Huggingface datasets [42] are also compatible to the proposed pFL-Bench to enable flexible and easily-extended experiments.
•

We implement an extensible open-sourced pFL codebase that contains more than 20 pFL methods, providing fruitful state-of-the-art (SOTA) methods re-implementations, unified interfaces and pluggable personalization sub-modules such as model parameter decoupling, model mixture, meta-learning and personalized regularization.
•

We conduct systematic evaluation under unified experimental settings and containerized environments to show the efficacy of pFL-Bench and provide standardized scripts to ensure the reproducibility and maintainability of pFL-Bench. We also highlight the advantages of pFL methods and opportunities for further pFL study in terms of generalization, fairness, system overhead and convergence.

2 Related Works

Personalized Federated Learning.

Despite the promising performance using a shared global model for all clients as demonstrated in [1, 43, 44, 45, 46, 47], it is challenging to find a converged best-for-all global model under statistical and system heterogeneity among clients [48, 49, 50]. As a natural way to handle the heterogeneity, personalization is gaining popularity in recent years. Fruitful pFL literatures have explored the accuracy and convergence improvement based on clustering [51, 48, 52], multi-task learning [53, 54, 55, 56], model mixture [36, 26, 57, 58], model parameter decoupling [37, 6], Bayesian treatment [59, 54], knowledge distillation [60, 31, 61], meta-learning [62, 63, 62, 28, 64, 65], and transfer learning [66, 67, 68]. We refer readers to related FL and pFL survey paper for more details [18, 22, 23]. In pFL-Bench, we provide modularized re-implementation for numerous SOTA pFL methods with several fundamental and pluggable sub-routines for easy and fast pFL research and deployment. We plan to add more pFL methods in the future and also welcome contributions to the pFL-Bench.

Federated Learning Benchmark.

We are aware that there are great efforts on benchmarking FL from various aspects, such as heterogeneous datasets (LEAF [32], TFF [69]), heterogeneous system resources (Flower [70], FedML [71], FedScale [38]), and specific domains (FedNLP [72], FS-G [73]). However, they mostly benchmarked general FL algorithms, lacking recently proposed pFL methods that perform well on heterogeneous FL scenarios. Besides, no benchmark so far supports the evaluation of generalization to new clients; and few existing benchmark simultaneously supports comprehensive evaluation for trade-offs among accuracy, fairness and system costs. We hope to close these gaps with this proposed pFL-Bench, and facilitate further pFL research and broad applications.

3 Background and Problem Formulation

3.1 A Generalized FL Setting

We first introduce some important concepts in FL, taking the FedAvg method [1] as an illustrative example. A typical FL procedure using FedAvg is as follows: Each client $i\in\mathcal{C}$ has its own private dataset $\mathcal{D}_{i}$ over $\mathcal{X}\times\mathcal{Y}$ , and the goal of FL is to train a single global model $\theta_{g}$ with collaborative learning from this set of clients $\mathcal{C}$ without directly sharing their local data. At each FL round, the server broadcasts $\theta_{g}$ to selected clients $\mathcal{C}_{s}\subseteq\mathcal{C}$ , who then perform local learning based on the private local data and upload the local update information (e.g., the gradients of trained models) to the server. After collecting and averagely aggregating the update information from clients, the server applies the updates into $\theta_{g}$ for next-round federation and the process repeats.

Then we present a generalized FL formulation, which establishes the proposed comprehensive benchmark in terms of diverse evaluation and personalization perspectives. Specifically, besides the FL-participated clients $\mathcal{C}$ , we consider a set of new clients that do not participate to the FL training process and denote it as $\mathcal{\tilde{C}}$ . Most FL approaches implicitly solve the following problem:

\begin{split}\min_{\{h_{\theta_{g}}\}\cup\{h_{\theta_{i}}\}_{i\in\mathcal{C}}\cup\{h_{\theta_{j}}\}_{j\in\tilde{\mathcal{C}}}}\alpha G\big{(}\{\mathbb{E}_{(x,y)\sim\mathcal{D}_{i}}&[f(\theta_{g};x,y)]\}_{i\in\mathcal{C}}\big{)}+\beta L\big{(}\{\mathbb{E}_{(x,y)\sim\mathcal{D}_{i}}[f(\theta_{i};x,y)]\}_{i\in\mathcal{C}}\big{)}\\ +\gamma R\big{(}\{\mathbb{E}_{(x,y)\sim\mathcal{D}_{j}}&[f(\theta_{j};x,y)|\theta_{g}]\}_{j\in\mathcal{\tilde{C}}}\big{)}+\zeta Q(\theta_{g},\theta_{k})_{k\in(\mathcal{C}\cup\tilde{\mathcal{C}})},\qquad\end{split}

(1)

where $f(\theta;x,y)$ indicates the loss at data point $(x,y)$ with model $\theta$ . The term $G(\cdot)$ indicates the global objective based on the shared global model $\theta_{g}=Agg([\theta_{i}]_{i\in\mathcal{C}})$ with an aggregation function $Agg(\cdot)$ for model parameters, and $L(\cdot)$ indicates the local objective based on the local distinct models $[\theta_{i}]_{i\in\mathcal{C}}$ . We note that $G(\cdot)$ and $L(\cdot)$ can be in various forms such as uniform averaging or weighted averaging according to local training data size of clients, which corresponds to the commonly used intra-client generalization case [40].

For the latter two terms, $R(\cdot)$ usually has similar forms to $G(\cdot)$ and measures the generalization to new clients $\mathcal{\tilde{C}}$ that do not contribute to the FL training process. $Q(\cdot)$ indicates the modeling of the relationship between the global and local models, such as the $L^{2}$ norm to regularize the model parameters in Ditto [26]. Besides, different pFL methods may flexibly introduce various constraints on this optimization objective, which we have omitted in Eq.(1) for brevity. The coefficients $\alpha$ , $\beta$ , $\gamma$ and $\zeta$ trade off these terms. For non-personalized FL algorithms, $\beta=0$ . Although the generalization term $R(\cdot)$ has been primarily explored in a few recent studies [40, 35], most existing FL works overlooked it with $\gamma=0$ . Later, we will discuss more instantiations for $L(\cdot)$ and $Q(\cdot)$ in the personalization setting.

3.2 Personalization Setup

With the above generalized formulation, we can see that existing pFL works achieve personalization via multi-granularity objects, including the global model $\theta_{g}$ and local models $[\theta_{i}]$ . For example, many two-step pFL approaches first find a strong $\theta_{g}$ in the FL training stages, then get local models $[\theta_{i}]$ by fine-tuning $\theta_{g}$ on local data and use $[\theta_{i}]$ in inference [23, 74]. A more flexible manner is to directly learn distinct local models $[\theta_{i}]$ in the FL process, while this introduces additional storage and computation costs for clients [75]. Recently, to gain a better accuracy-efficiency trade-off, several pFL works propose to only personalize a sub-module $\pi_{i}$ of $\theta_{i}$ , and transmit and aggregate the remaining part as $\theta_{g}=Agg([\theta_{i}\setminus\pi_{i}]_{i\in\mathcal{C}})$ [24].

We illustrate some representative personalization operations in pFL works w.r.t. the different choices of the local objective $L(\cdot)$ and $Q(\cdot)$ . Fine-tuning is a basic step widely used in abundant pFL works [63] to minimize $\mathbb{E}_{(x,y)\sim\mathcal{D}_{i}}[f(\theta_{i};x,y)]$ , where $\theta_{i}$ is usually initialized using the parameters of $\theta_{g}$ before fine-tuning. Model mixture is a general pFL approach assuming that the local data distribution is a mixture of $K$ underlying data distributions $\mathcal{D}_{i}=Mix(\mathcal{D}_{i,k},w_{i,k})$ with mixture weight $w_{i,k}$ $\text{for}~{}k\in[K]$ [56], thus learning a group of intermediate models is suitable to handle the data heterogeneity as $\theta_{i}=Mix(\theta_{i,k})_{k\in[K]}$ . For clustering-based pFL methods, $K$ indicates the cluster number, and the mixture weight is 0-1 indicative function for belonged clusters [25]. Besides, taking $\theta_{g}$ as the reference point, model interpolation $\theta_{i}\equiv w_{g}\theta_{g}+(1-w_{g})\theta_{i}$ and model regularization $\theta_{i}=argmin\big{(}\sum_{(x,y)\sim\mathcal{D}_{i}}(f(\theta_{i};x,y)+\frac{\lambda}{2}||\theta_{i}-\theta_{g}||)\big{)}$ are also widely explored in the pFL literature [76], where $\lambda$ is the regularization factor.

4 Benchmark Design and Resources

4.1 Datasets and Models

We conduct experiments on 12 publicly available dataset variants with heterogeneous partition in our benchmark. These datasets are popular in the corresponding fields, and cover a wide range of domains, scales, partition manners and Non-IID degrees. We list the statistics in Table 1 and illustrate the violin plot of data size per client in Figure 1, which show diverse properties across the FL datasets, enabling thorough comparisons among different pFL methods. We provide more detailed descriptions of these datasets in the Appendix A. Besides, with a carefully designed modularity, our code-base is compatible with a large number of datasets from other public popular DataZoos, including LEAF [32], Torchvision [41], Huggingface datasets [42] and FederatedScope-GNN [73].

Table 1: Statistics of the experimental datasets, tasks, and models in pFL-bench. We sample 5% clients from FEMNIST [32]. Following previous works [71, 56, 37, 28], we adopt Dirichlet allocation with different

\alpha

s to simulate the heterogeneous partition for CIFAR10 and textual datasets. The

\mu

and

\sigma

indicate the mean and std of number of samples per client. More datasets from popular Datazoos such as LEAF [32], Torchvision [41], Huggingface datasets [42] and FederatedScope-GNN [73] are also supported. Detailed descriptions can be found in Appendix A.

Dataset	Task	Model	Partition By	# Clients	# Sample Per Client
FEMNIST	Image Classification	CNN	Writers	200	$\mu$ =217	$\sigma$ =73
CIFAR10- $\alpha 5$			Labels	100	$\mu$ =600	$\sigma$ =46
CIFAR10- $\alpha 0.5$				100	$\mu$ =600	$\sigma$ =137
CIFAR10- $\alpha 0.1$				100	$\mu$ =600	$\sigma$ =383
COLA	Linguistic Acceptability	BERT	Labels	50	$\mu$ =192	$\sigma$ =159
SST-2	Sentiment Analysis	BERT	Labels	50	$\mu$ =1,364	$\sigma$ =1,291
Twitter	Sentiment Analysis	LR	Users	13,203	$\mu$ =10	$\sigma$ =11
Cora	Node Classification	GIN	Community	5	$\mu$ =542	$\sigma$ =30
Pubmed				5	$\mu$ =3,943	$\sigma$ =34
Citeseer				5	$\mu$ =665	$\sigma$ =29
MovieLens1M	Recommendation	MF	Users	1000	$\mu$ =1,000	$\sigma$ =482
MovieLens10M	Recommendation	MF	Items	1000	$\mu$ =10,000	$\sigma$ =8,155

Refer to caption — Figure 1: The violin plot of number of samples per client for partial adopted datasets. In Appendix A, we present the plots for other datasets, the label skew visualization and clients’ pairwise similarity of label distribution in terms of Jensen–Shannon distance.

We preset the widely adopted CNN model [77, 78, 79, 80] with additional batch normalization layers for image datasets, and the pre-trained BERT-Tiny model from [81] and linear regression (LR) model for the textual datasets. For the graph and recommendation datasets, we preset the graph isomorphism neural network, GIN [82], and Matrix Factorization (MF) [83] respectively. It is worth noting that pFL-Bench provides a unified model interface decoupled with FL algorithms, enabling users to easily register and use more customized models or built-in models from existing ModelZoos including Torchvision [41], Huggingface [84], and FederatedScope [13].

Benchmark scenarios.

(Generalization) For a comprehensive evaluation, pFL-Bench supports examining the generalization performance for both FL-participated and FL-non-participated clients, i.e., the $R(\cdot)$ term in formulation (1). Specifically, we randomly select 20% clients as non-participated clients for each dataset, and these clients will not transmit their training-related message during the FL processes.

(Client Sampling) In addition to generalization performance, we also care about how the pFL methods perform when adopting client sampling in FL processes, which is useful in cross-device scenarios where a large number of clients have spotty connectivity. For the image, text and recommendation datasets, we uniformly sample 20% clients without replacement from the participating clients at each FL round.

(Cross-silo v.s. cross-device) Note that the adopted datasets have quite different numbers of clients after heterogeneous partition. We choose the three graph datasets with small number of clients to simulate cross-silo FL scenarios [73, 55], while the other datasets correspond to different scales of cross-device FL scenarios.

4.2 Methods

We consider abundant methods for extensive pFL comparisons. The pFL-bench provides unified and modularized interfaces for a range of popular and SOTA methods in the following three categories: Non-pFL methods. As two naive methods sitting on opposite ends of the local-global spectrum, we evaluate the Global-Train method that trains the model from centralized data merged from all clients, and Isolated method that trains a separated models for each client without any information transmission among clients. Besides, we consider the classical FedAvg [1] with weighted averaging based on local data size, the FedProx [85] that introduces proximal term during the local training process, and FedOpt [43] that generalizes FedAvg by introducing an optimizer for the FL server.

pFL methods. We compare several SOTA methods including FedBN [86] that is a simple yet effective method to deal with feature shift Non-IID, via locally maintaining the clients’ batch normalization parameters without transmitting and aggregation; Ditto [26] that improves fairness and robustness of FL by training local personalized model and global model simultaneously, in which the local model update is based on regularization to global model parameters; pFedMe [78] that is a meta-learning based method and decouples personalized model and global model with Moreau envelops; HypCluster [74] that splits users into clusters and learns different models for different clusters; FedEM [56] that assumes local data distribution is a mixture of unknown underlying distributions, and correspondingly learns a mixture of multiple intermediate models with Expectation-Maximization algorithm.

Combined variants. Note that pFL-Bench provides pluggable re-implementations of existing methods, enabling users can pick different personalized behaviors and different personalized objects to form a new pFL variant. Here we combine FedBN, FedOpt, and Fine-tuning (FT) with other compatible methods, and provide fine-grained ablations via more than 20 method variants for systematic pFL study and explorations of pFL potential. More details of the considered methods are in Appendix B.

4.3 Evaluation Criteria

We propose a unified and comprehensive evaluation protocol in pFL-Bench, in which evaluations from multiple perspectives are taken into consideration. (1) For the generalization examination, we support monitoring on both the server and client sides, with various and extensible metric aggregation manners, and a wide range of metrics for diverse tasks such as classification, regression and ranking. (2) We also report several fairness-related metrics, including standard deviation, and the top and bottom deciles of performance across different clients. (3) Numerous systematical metrics are considered as well, including the process running time, the memory cost w.r.t. average and peak memory usage, the total computational cost w.r.t. FLOPs in server and clients, the communication cost w.r.t. the total number of downloaded/uploaded bytes, and the number of FL rounds to convergence. Detailed description can be found in Appendix B.

4.4 Codebase

To facilitate the innovation for pFL community, our pFL-Bench contains a well-modularized, easy-to-extend codebase for standardization of implementation, evaluation, and ablation of pFL methods.

pFL implementations.

We build the pFL-bench upon an event-driven FL framework FederatedScope (FS) [87], which abstracts FL information transmitting and processing as message passing and several pluggable subroutines. We eliminate the cumbersome engineering for coordinating FL participants with the help of FS, and customize many message handlers and subroutines, such as model parameter decoupling, model mixture, local fine-tuning, meta-learning and regularization for personalization. By combining these useful and pluggable components, we re-implement a number of SOTA pFL methods with unified and extensible interfaces. This modularity also makes the usage of pFL methods convenient, and makes the contribution of new pFL methods easy and flexible. We release the codes with Apache License 2.0 and will continuously include more pFL methods.

Reproducibility.

To enable easily reproducible research, we conduct experiments in containerized environments and provide standardized and documented evaluation procedures for prescribed metrics. For fair comparisons, we search the optimal hyper-parameters using the validation sets for all methods, with early stopping and large number of total FL rounds. We run experiments 3 times with the optimal configurations and report the average results. All the experiments are conducted on a cluster of 8 Tesla V100 and 64 NVIDIA GTX 1080 Ti GPUs, taking $\sim$ 13,000 runs with a total of $\sim$ 112 days process computing time. More details, such as hyper-parameter search spaces, can be found in Appendix C.

5 Experimental Results and Analysis

To demonstrate the utility of pFL-Bench in providing fair, comprehensive, and rigorous comparisons among pFL methods, we conduct extensive experiments and present some main results in terms of generalization, fairness and efficiency under various FL datasets and scenarios. The complete experimental results are presented in Appendix D due to the space limitation.

Table 2: Accuracy results for both participated clients and un-participated clients.

\overline{Acc}

indicates the aggregated accuracy weighted by the number of local data samples of participated clients,

\widetilde{Acc}

indicates the aggregated accuracy of un-participated clients, and

\Delta

indicates the participation generalization gap. Bold and underlined indicate the best and second-best results among all compared methods, while red and blue indicate the best and second-best results for original methods without combination “-”.

	FEMNIST, $s=0.2$			SST-2			PUBMED
	$\overline{Acc}$	$\widetilde{Acc}$	$\Delta$	$\overline{Acc}$	$\widetilde{Acc}$	$\Delta$	$\overline{Acc}$	$\widetilde{Acc}$	$\Delta$
Global-Train	74.51	-	-	80.57	-	-	87.01	-	-
Isolated	68.74	-	-	60.82	-	-	85.56	-	-
FedAvg	83.97	81.97	-2.00	74.88	80.24	5.36	87.27	72.63	-14.64
FedAvg-FT	86.44	84.94	-1.50	74.14	83.28	9.13	87.21	79.78	-7.43
FedProx	84.10	81.49	-2.61	74.36	79.20	4.84	87.23	75.02	-12.21
FedProx-FT	87.34	85.27	-2.08	79.94	80.48	0.59	88.24	79.12	-9.12
pFedMe	87.50	82.76	-4.73	71.27	69.34	-1.92	86.91	71.64	-15.27
pFedMe-FT	88.19	82.46	-5.73	75.61	66.48	-9.13	85.71	77.07	-8.64
HypCluster	83.80	81.88	-1.92	46.26	61.32	15.05	87.20	75.37	-11.83
HypCluster-FT	87.79	85.67	-2.12	52.46	78.67	26.21	86.43	76.69	-9.74
FedBN	86.72	7.86	-78.86	74.88	75.40	0.52	88.49	52.53	-35.95
FedBN-FT	88.51	82.87	-5.64	68.81	82.43	13.63	87.45	80.36	-7.09
FedBN-FedOPT	88.25	8.77	-79.49	64.70	65.50	0.81	87.87	42.72	-45.15
FedBN-FedOPT-FT	88.14	80.25	-7.88	68.65	70.56	1.91	87.54	77.07	-10.47
Ditto	88.39	2.20	-86.19	52.03	46.79	-5.24	87.27	2.84	-84.43
Ditto-FT	85.72	56.96	-28.76	56.49	65.50	9.01	87.47	35.03	-52.44
Ditto-FedBN	88.94	2.20	-86.74	56.03	46.79	-9.24	88.18	2.84	-85.34
Ditto-FedBN-FT	86.53	58.96	-27.57	53.15	66.49	13.34	87.83	28.52	-59.30
Ditto-FedBN-FedOpt	88.73	2.20	-86.54	57.67	46.79	-10.88	87.81	2.84	-84.97
Ditto-FedBN-FedOpt-FT	87.02	55.22	-31.80	52.89	66.49	13.60	87.60	18.18	-69.42
FedEM	84.35	82.81	-1.54	75.78	67.67	-8.11	85.64	71.12	-14.52
FedEM-FT	86.17	85.01	-1.16	64.86	81.63	16.77	85.88	78.08	-7.80
FedEM-FedBN	84.37	12.88	-71.49	75.43	62.81	-12.62	88.12	48.64	-39.48
FedEM-FedBN-FT	88.29	83.96	-4.33	64.96	81.04	16.08	86.38	72.02	-14.35
FedEM-FedBN-FedOPT	82.12	6.64	-75.48	72.25	64.69	-7.56	87.56	42.37	-45.19
FedEM-FedBN-FedOPT-FT	87.54	85.76	-1.79	62.26	73.87	11.61	87.49	72.39	-15.09

5.1 Generalization

We present the accuracy results for both participated and un-participated clients in Table 2 on FEMNIST, SST-2 and PUBMED datasets, where $\overline{Acc}$ and $\widetilde{Acc}$ indicates the aggregated accuracy weighted by the local data samples of participated and un-participated clients respectively, and $\Delta=\widetilde{Acc}-\overline{Acc}$ indicates the generalization gap.

Comparison of original methods.

For the original methods without combination (“-”), we mark the best and second-best results as red and blue respectively in Table 2. Notably, we find that no method can consistently beat others across all metrics and all datasets. The pFL methods gain significantly better $\overline{Acc}$ over FedAvg in some cases (e.g., Ditto on FEMNIST), showing the effectiveness of personalization. However, the advantages of pFL methods on un-participated clients’ generalization $\widetilde{Acc}$ are relatively smaller than those on intra-client generalization $\overline{Acc}$ . The methods Ditto and FedBN even fail to gain reasonable $\widetilde{Acc}$ on FEMNIST and PUBMED datasets, as these methods did not discuss how to apply to unseen clients, we have kept the behaviors of their algorithms in inference in order to respect the original method. And we examine their running dynamics and find that their local models diverge with the un-participated clients’ data. Besides, there are still performance gaps between pFL methods and the Global-Train method on the textual dataset. These observations demonstrate plenty of room for pFL improvement.

Comparison of pFL variants.

We then extend the comparison to include FL variant methods incorporating other compatible personalized operators or methods. In a nutshell, almost all the best and second-best results (marked as Bold and underlined) are achieved by the combined variants, showing the efficacy and flexibility of our benchmark, and the potential of new pFL methods. Among these methods, fine-tuning (FT) effectively improves both $\overline{Acc}$ and $\widetilde{Acc}$ metrics in most cases, even for pFL methods that have been implicitly adopted local training process, which calls for deeper understanding about how much should we fit the local data and how the personalized fitting impacts the FL dynamics. Besides, FedBN is a simple yet effective method to improve $\overline{Acc}$ while may bring negative effect on $\widetilde{Acc}$ (e.g., FedEM v.s. FedEM-FT), since the feature space of un-participated clients lack informative characterization and the frozen BN parameters of global model can arbitrarily mis-match these un-participated clients. We also find that FedBN is much more effective on graph data with the GIN model than those on textual data with the BERT model, in which we made a simple modification that filters out the Layer Normalization parameters, showing the opportunity of designing domain- and model-specific pFL techniques.

Effect of Non-IID split.

To gain further insight into the pFL methods, we vary the Non-IID degree with different Dirichlet factor $\alpha$ for CIFAR10 dataset and illustrate the results in Figure 2(a). Generally speaking, almost all methods gain performance degradation as the Non-IID degree increases (from $\alpha$ =5 to $\alpha$ =0.1). Besides, we can see that most of pFL methods show superior accuracy and robustness over FedAvg especially for the highly heterogeneous case with $\alpha$ =0.1. These results indicate the benefit of pFL methods in Non-IID situations, as well as their substantial space for improvement.

Effect of client sampling.

We also vary the client sampling rate $s$ and present the generalization results on FEMNIST dataset in Figure 2(b). In summary, most pFL methods still achieve better results than FedAvg as $s$ decreases. However, the advantages are diminishing for un-participating clients and several pFL methods prune to fail with small $s$ . In addition, we observe that FT increases the performance variance in client sampling cases. Although several pFL works provide convergence guarantees under mild assumptions, there remain open questions about the theoretical impact of clients with spotty connections [88] in the personalization case, and the design of robust pFL algorithms.

Table 3: Fairness results in terms of

\overline{Acc}^{\prime}

indicating the equally-weighted average,

\sigma

indicating the standard deviation of the average accuracy, and

\mathchoice{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{9.69318pt}{3.8889pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\displaystyle Acc\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{9.69318pt}{3.8889pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\textstyle Acc\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{6.7852pt}{2.72223pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\scriptstyle Acc\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{4.84657pt}{1.94444pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\scriptscriptstyle Acc\hss$\crcr}}}\limits}

indicating the bottom accuracy. Bold, underlined, red and blue indicate the same highlights as used in Table 2.

	FEMNIST, $s=0.2$			SST-2			PUBMED
	$\overline{Acc}^{\prime}$	$\sigma$	$\mathchoice{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{9.69318pt}{3.8889pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\displaystyle Acc\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{9.69318pt}{3.8889pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\textstyle Acc\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{6.7852pt}{2.72223pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\scriptstyle Acc\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{4.84657pt}{1.94444pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\scriptscriptstyle Acc\hss$\crcr}}}\limits}$	$\overline{Acc}^{\prime}$	$\sigma$	$\mathchoice{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{9.69318pt}{3.8889pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\displaystyle Acc\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{9.69318pt}{3.8889pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\textstyle Acc\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{6.7852pt}{2.72223pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\scriptstyle Acc\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{4.84657pt}{1.94444pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\scriptscriptstyle Acc\hss$\crcr}}}\limits}$	$\overline{Acc}^{\prime}$	$\sigma$	$\mathchoice{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{9.69318pt}{3.8889pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\displaystyle Acc\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{9.69318pt}{3.8889pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\textstyle Acc\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{6.7852pt}{2.72223pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\scriptstyle Acc\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{4.84657pt}{1.94444pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\scriptscriptstyle Acc\hss$\crcr}}}\limits}$
Isolated	67.08	10.76	53.16	59.40	41.29	0.00	84.67	6.26	74.63
FedAvg	82.40	9.91	69.11	76.30	22.02	44.85	86.72	3.93	79.76
FedAvg-FT	85.17	8.69	72.34	75.36	27.67	31.08	86.71	3.86	80.57
FedProx	82.37	10.25	68.57	72.25	22.03	43.51	88.06	3.62	80.08
FedProx-FT	86.06	7.89	74.60	58.72	37.58	4.17	87.87	4.02	78.80
pFedMe	86.50	8.52	75.00	65.08	26.59	27.75	86.35	4.43	78.76
pFedMe-FT	87.06	8.02	75.00	74.36	27.02	32.49	85.47	3.06	80.95
HypCluster	82.34	9.72	68.57	57.29	39.27	0.00	86.82	5.06	77.66
HypCluster-FT	86.58	7.84	75.71	56.47	42.59	0.00	85.97	5.69	76.45
FedBN	85.38	8.19	74.26	76.30	22.02	44.85	87.97	3.42	81.77
FedBN-FT	87.65	6.33	80.02	68.50	26.83	29.17	87.02	3.47	80.13
FedBN-FedOPT	87.27	7.34	76.87	65.59	31.07	22.22	87.43	4.64	80.81
FedBN-FedOPT-FT	87.13	7.36	78.27	68.42	28.18	30.71	87.02	3.94	81.78
Ditto	87.18	7.52	78.23	49.94	40.81	0.00	86.85	3.98	80.44
Ditto-FT	84.30	8.16	73.95	54.34	39.26	0.00	87.10	3.52	80.46
Ditto-FedBN	87.82	7.19	77.78	49.44	41.80	0.00	87.75	3.70	81.82
Ditto-FedBN-FT	85.16	7.98	75.25	52.18	39.85	0.00	87.43	3.77	81.15
Ditto-FedBN-FedOpt	87.64	7.08	78.23	55.61	40.43	1.39	87.27	3.90	79.14
Ditto-FedBN-FedOpt-FT	85.71	7.91	75.81	53.16	34.75	9.72	87.10	3.79	80.93
FedEM	82.61	9.57	69.29	76.53	23.34	44.44	85.05	4.44	78.51
FedEM-FT	84.91	8.39	73.64	64.29	32.84	12.96	85.54	4.48	79.39
FedEM-FedBN	82.94	9.35	70.43	75.06	18.48	53.33	87.63	4.14	82.54
FedEM-FedBN-FT	87.09	9.24	76.36	64.33	35.72	8.59	85.68	4.33	79.44
FedEM-FedBN-FedOPT	80.48	11.02	64.84	72.66	27.18	34.17	87.11	4.24	80.32
FedEM-FedBN-FedOPT-FT	86.23	8.33	75.44	58.42	31.21	17.93	87.16	3.66	82.20

5.2 Fairness Study

We then empirically investigate what degrees of fairness can be achieved by pFL methods, and report the equally-weighted average of accuracy $\overline{Acc}^{\prime}$ , the standard deviation $\sigma$ across evaluated clients, and the bottom individual accuracy $\mathchoice{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{9.69318pt}{3.8889pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\displaystyle Acc\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{9.69318pt}{3.8889pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\textstyle Acc\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{6.7852pt}{2.72223pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\scriptstyle Acc\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{4.84657pt}{1.94444pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\scriptscriptstyle Acc\hss$\crcr}}}\limits}$ in Table 3. These metrics are considered as the fairness criteria in related pFL works [26, 56]. We find that $\overline{Acc}^{\prime}$ is usually smaller than the one weighted by local data size ( $\overline{Acc}$ in Table 2), indicating the client bias in existing pFL evaluation. Across the three datasets, the $\sigma$ s on SST-2 are much larger (at dozen-scales) than those on the image dataset (6.33 $\sim$ 11.02), which are larger than those on the graph dataset (3.06 $\sim$ 6.26), leaving room for further research to understand this difference and improve the fairness in various application domains. Interestingly, compared with FedAvg, pFL methods can effectively improve bottom accuracy, while they may gain larger standard deviations. An exception is Ditto-based methods on SST-2, as the parameter regularization in Ditto may fail for the complex BERT model. Besides, the Isolated method performs bad for clients having a very small data size (a few dozens), even gains $\mathchoice{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{9.69318pt}{3.8889pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\displaystyle Acc\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{9.69318pt}{3.8889pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\textstyle Acc\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{6.7852pt}{2.72223pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\scriptstyle Acc\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{4.84657pt}{1.94444pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\scriptscriptstyle Acc\hss$\crcr}}}\limits}$ =0 on SST-2. We find that most of the other methods achieve much better results than it, verifying the benefits of transmitting knowledge across clients.

5.3 Efficiency

To quantify the systematical payloads of personalization that is introduced into the FL process, we count the total FLOPs, communication bytes and convergence rounds to demonstrate the trade-off between these metrics and accuracy in Figure 2(c). Not surprisingly, pFL methods usually incur much larger computation and communication costs than non-personalized methods, requiring more careful and efficient design for further pFL research in resource-limited scenarios. Another interesting observation is that when combined with FT or FedOpt that aggregates the clients’ model updates as pseudo-gradients for the global model, the convergence speeds are improved for some methods such as FedEM and Ditto, showing the potential of co-optimizing from the server and local client sides.

5.4 More Experiments

Due to the space limitation, we present more experimental results in Appendix D, including results in terms of generalization (Sec.D.1), fairness (Sec.D.2) and efficiency (Sec.D.3) for all the datasets in Table 1. To demonstrate the potential and ease of extensibility of the pFL-bench, we also conduct experiments in the scenario of heterogeneous device resource based on FedScale [38] in Sec.D.4, where we adopt the over-selection mechanism for server and a temporal event simulator [89]. The simulator executes the behaviors of clients according to virtual timestamps of their message delivery to the server, and the virtual timestamps are updated by the estimated execution time based on different clients’ computational and communication capacities. This enables us to simulate different response speeds and participating degrees of clients, which correspond to heterogeneous real-world mobile devices. Moreover, in Sec.D.5, we show that pFL-Bench supports the exploration of trade-offs between pFL and privacy protection techniques, and conduct demonstrative experiments with Differential Privacy [90].

6 Conclusions

In this paper, we propose a comprehensive, standardized, and extensible benchmark for personalized Federated Learning (pFL), pFL-Bench, which contains 12 dataset variants with a wide range of domains and unified partitions, and more than 20 pFL methods with pluggable and easy-to-extend pFL subroutines. We conduct extensive and systematic comparisons and conclude that designing effective, efficient and robust pFL methods with good generalization and convergence still remains challenging. We release pFL-Bench with guaranteed maintenance for the community, and believe that it will benefit reproducible, easy and generalizable pFL researches and potential applications. We also welcome contributions of new pFL methods and datasets to keep pFL-Bench up-to-date.

References

[1] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54, pages 1273–1282, 2017.
[2] Junyuan Hong, Zhuangdi Zhu, Shuyang Yu, Zhangyang Wang, Hiroko H. Dodge, and Jiayu Zhou. Federated adversarial debiasing for fair and transferable representations. In KDD ’21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 617–627, 2021.
[3] Yue Zhao, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, and Vikas Chandra. Federated learning with non-iid data. arXiv preprint, abs/1806.00582, 2018.
[4] Jaehoon Oh, Sangmook Kim, and Se-Young Yun. Fedbabu: Toward enhanced representation for federated image classification. In The Tenth International Conference on Learning Representations, 2022.
[5] Yang Liu, Anbu Huang, Yun Luo, He Huang, Youzhi Liu, Yuanyuan Chen, Lican Feng, Tianjian Chen, Han Yu, and Qiang Yang. Fedvision: An online visual object detection platform powered by federated learning. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, pages 13172–13179, 2020.
[6] Benyuan Sun, Hongxing Huo, Yi Yang, and Bo Bai. Partialfed: Cross-domain personalized federated learning via partial initialization. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems, pages 23309–23320, 2021.
[7] Andrew Hard, Kanishka Rao, Rajiv Mathews, Françoise Beaufays, Sean Augenstein, Hubert Eichner, Chloé Kiddon, and Daniel Ramage. Federated learning for mobile keyboard prediction. arXiv preprint, abs/1811.03604, 2018.
[8] Timothy Yang, Galen Andrew, Hubert Eichner, Haicheng Sun, Wei Li, Nicholas Kong, Daniel Ramage, and Françoise Beaufays. Applied federated learning: Improving google keyboard query suggestions. arXiv preprint, abs/1812.02903, 2018.
[9] Xinghua Zhu, Jianzong Wang, Zhenhou Hong, and Jing Xiao. Empirical studies of institutional federated learning for natural language processing. In Findings of the Association for Computational Linguistics: EMNLP, pages 625–634, 2020.
[10] Dianbo Sui, Yubo Chen, Jun Zhao, Yantao Jia, Yuantao Xie, and Weijian Sun. Feded: Federated learning via ensemble distillation for medical relation extraction. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 2118–2128, 2020.
[11] Matthias Paulik, Matt Seigel, Henry Mason, Dominic Telaar, Joris Kluivers, Rogier C. van Dalen, Chi Wai Lau, Luke Carlson, Filip Granqvist, Chris Vandevelde, Sudeep Agarwal, Julien Freudiger, Andrew Byde, Abhishek Bhowmick, Gaurav Kapoor, Si Beaumont, Áine Cahill, Dominic Hughes, Omid Javidbakht, Fei Dong, Rehan Rishi, and Stanley Hung. Federated evaluation and tuning for on-device personalization: System design & applications. arXiv preprint, abs/2102.08503, 2021.
[12] Dhruv Guliani, Françoise Beaufays, and Giovanni Motta. Training speech recognition models with federated learning: A quality/cost framework. In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 3080–3084, 2021.
[13] Han Xie, Jing Ma, Li Xiong, and Carl Yang. Federated graph classification over non-iid graphs. pages 18839–18852, 2021.
[14] Khalil Muhammad, Qinqin Wang, Diarmuid O’Reilly-Morgan, Elias Z. Tragos, Barry Smyth, Neil Hurley, James Geraci, and Aonghus Lawlor. Fedfast: Going beyond average for faster training of federated recommender systems. In KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 1234–1242, 2020.
[15] Chuhan Wu, Fangzhao Wu, Yang Cao, Yongfeng Huang, and Xing Xie. Fedgnn: Federated graph neural network for privacy-preserving recommendation. arXiv preprint, abs/2102.04925, 2021.
[16] Qian Yang, Jianyi Zhang, Weituo Hao, Gregory P. Spell, and Lawrence Carin. FLOP: federated learning on medical datasets using partial networks. In KDD ’21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 3845–3853, 2021.
[17] Ittai Dayan, Holger R Roth, Aoxiao Zhong, Ahmed Harouni, Amilcare Gentili, Anas Z Abidin, Andrew Liu, Anthony Beardsworth Costa, Bradford J Wood, Chien-Sung Tsai, et al. Federated learning for predicting clinical outcomes in patients with covid-19. Nature medicine, 27(10):1735–1743, 2021.
[18] Peter Kairouz, H. Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista A. Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, Rafael G. L. D’Oliveira, Hubert Eichner, Salim El Rouayheb, David Evans, Josh Gardner, Zachary Garrett, Adrià Gascón, Badih Ghazi, Phillip B. Gibbons, Marco Gruteser, Zaïd Harchaoui, Chaoyang He, Lie He, Zhouyuan Huo, Ben Hutchinson, Justin Hsu, Martin Jaggi, Tara Javidi, Gauri Joshi, Mikhail Khodak, Jakub Konečný, Aleksandra Korolova, Farinaz Koushanfar, Sanmi Koyejo, Tancrède Lepoint, Yang Liu, Prateek Mittal, Mehryar Mohri, Richard Nock, Ayfer Özgür, Rasmus Pagh, Hang Qi, Daniel Ramage, Ramesh Raskar, Mariana Raykova, Dawn Song, Weikang Song, Sebastian U. Stich, Ziteng Sun, Ananda Theertha Suresh, Florian Tramèr, Praneeth Vepakomma, Jianyu Wang, Li Xiong, Zheng Xu, Qiang Yang, Felix X. Yu, Han Yu, and Sen Zhao. Advances and open problems in federated learning. Found. Trends Mach. Learn., 14(1-2):1–210, 2021.
[19] Chuizheng Meng, Sirisha Rambhatla, and Yan Liu. Cross-node federated graph neural network for spatio-temporal data modeling. In KDD ’21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 1202–1211, 2021.
[20] Fuxun Yu, Weishan Zhang, Zhuwei Qin, Zirui Xu, Di Wang, Chenchen Liu, Zhi Tian, and Xiang Chen. Fed2: Feature-aligned federated learning. In KDD ’21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 2066–2074, 2021.
[21] Viraaji Mothukuri, Reza M. Parizi, Seyedamin Pouriyeh, Yan Huang, Ali Dehghantanha, and Gautam Srivastava. A survey on security and privacy of federated learning. Future Gener. Comput. Syst., 115:619–640, 2021.
[22] Viraj Kulkarni, Milind Kulkarni, and Aniruddha Pant. Survey of personalization techniques for federated learning. arXiv preprint, abs/2003.08673, 2020.
[23] Alysa Ziying Tan, Han Yu, Lizhen Cui, and Qiang Yang. Towards personalized federated learning. arXiv preprint, abs/2103.00710, 2021.
[24] Liam Collins, Hamed Hassani, Aryan Mokhtari, and Sanjay Shakkottai. Exploiting shared representations for personalized federated learning. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 2089–2099, 2021.
[25] Avishek Ghosh, Jichan Chung, Dong Yin, and Kannan Ramchandran. An efficient framework for clustered federated learning. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems, 2020.
[26] Tian Li, Shengyuan Hu, Ahmad Beirami, and Virginia Smith. Ditto: Fair and robust federated learning through personalization. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 6357–6368, 2021.
[27] Canh T. Dinh, Nguyen H. Tran, and Tuan Dung Nguyen. Personalized federated learning with moreau envelopes. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems, 2020.
[28] Alireza Fallah, Aryan Mokhtari, and Asuman E. Ozdaglar. Personalized federated learning with theoretical guarantees: A model-agnostic meta-learning approach. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems, 2020.
[29] Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. Federated machine learning: Concept and applications. ACM Trans. Intell. Syst. Technol., 10(2):12:1–12:19, 2019.
[30] Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank J. Reddi, Sebastian U. Stich, and Ananda Theertha Suresh. SCAFFOLD: stochastic controlled averaging for federated learning. In Proceedings of the 37th International Conference on Machine Learning, volume 119, pages 5132–5143, 2020.
[31] Zhuangdi Zhu, Junyuan Hong, and Jiayu Zhou. Data-free knowledge distillation for heterogeneous federated learning. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 12878–12889, 2021.
[32] Sebastian Caldas, Sai Meher Karthik Duddu, Peter Wu, Tian Li, Jakub Konečný, H Brendan McMahan, Virginia Smith, and Ameet Talwalkar. Leaf: A benchmark for federated settings. arXiv preprint arXiv:1812.01097, 2018.
[33] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
[34] Othmane Marfoq, Giovanni Neglia, Aurélien Bellet, Laetitia Kameni, and Richard Vidal. Federated multi-task learning under a mixture of distributions. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems, pages 15434–15447, 2021.
[35] Aviv Shamsian, Aviv Navon, Ethan Fetaya, and Gal Chechik. Personalized federated learning using hypernetworks. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 9489–9502, 2021.
[36] Michael Zhang, Karan Sapra, Sanja Fidler, Serena Yeung, and Jose M. Alvarez. Personalized federated learning with first order model optimization. In 9th International Conference on Learning Representations, 2021.
[37] Enmao Diao, Jie Ding, and Vahid Tarokh. Heterofl: Computation and communication efficient federated learning for heterogeneous clients. In 9th International Conference on Learning Representations, 2021.
[38] Fan Lai, Yinwei Dai, Sanjay Sri Vallabh Singapuram, Jiachen Liu, Xiangfeng Zhu, Harsha V. Madhyastha, and Mosharaf Chowdhury. Fedscale: Benchmarking model and system performance of federated learning at scale. In International Conference on Machine Learning, volume 162, pages 11814–11827, 2022.
[39] Fan Lai, Xiangfeng Zhu, Harsha V. Madhyastha, and Mosharaf Chowdhury. Oort: Efficient federated learning via guided participant selection. In 15th USENIX Symposium on Operating Systems Design and Implementation, pages 19–35, 2021.
[40] Honglin Yuan, Warren Richard Morningstar, Lin Ning, and Karan Singhal. What do we mean by generalization in federated learning? In International Conference on Learning Representations, 2022.
[41] Sébastien Marcel and Yann Rodriguez. Torchvision the machine-vision package of torch. In Proceedings of the 18th ACM International Conference on Multimedia, page 1485–1488, 2010.
[42] Quentin Lhoest, Albert Villanova del Moral, Yacine Jernite, Abhishek Thakur, Patrick von Platen, Suraj Patil, Julien Chaumond, Mariama Drame, Julien Plu, Lewis Tunstall, Joe Davison, Mario Šaško, Gunjan Chhablani, Bhavitvya Malik, Simon Brandeis, Teven Le Scao, Victor Sanh, Canwen Xu, Nicolas Patry, Angelina McMillan-Major, Philipp Schmid, Sylvain Gugger, Clément Delangue, Théo Matussière, Lysandre Debut, Stas Bekman, Pierric Cistac, Thibault Goehringer, Victor Mustar, François Lagunas, Alexander Rush, and Thomas Wolf. Datasets: A community library for natural language processing. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 175–184, 2021.
[43] Muhammad Asad, Ahmed Moustafa, and Takayuki Ito. Fedopt: Towards communication efficiency and privacy preservation in federated learning. Applied Sciences, 10(8):2864, 2020.
[44] Jianyu Wang, Qinghua Liu, Hao Liang, Gauri Joshi, and H. Vincent Poor. Tackling the objective inconsistency problem in heterogeneous federated optimization. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems, 2020.
[45] Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank J. Reddi, Sebastian U. Stich, and Ananda Theertha Suresh. SCAFFOLD: stochastic controlled averaging for on-device federated learning. arXiv preprint, abs/1910.06378, 2019.
[46] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. In Proceedings of Machine Learning and Systems, 2020.
[47] Baihe Huang, Xiaoxiao Li, Zhao Song, and Xin Yang. FL-NTK: A neural tangent kernel-based framework for federated learning analysis. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 4423–4434, 2021.
[48] Felix Sattler, Klaus-Robert Müller, and Wojciech Samek. Clustered federated learning: Model-agnostic distributed multitask optimization under privacy constraints. IEEE Trans. Neural Networks Learn. Syst., 32(8):3710–3722, 2021.
[49] Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. Federated learning: Challenges, methods, and future directions. IEEE Signal Process. Mag., 37(3):50–60, 2020.
[50] Zheng Chai, Hannan Fayyaz, Zeshan Fayyaz, Ali Anwar, Yi Zhou, Nathalie Baracaldo, Heiko Ludwig, and Yue Cheng. Towards taming the resource and data heterogeneity in federated learning. In 2019 USENIX Conference on Operational Machine Learning (OpML 19), pages 19–21, 2019.
[51] Christopher Briggs, Zhong Fan, and Peter Andras. Federated learning with hierarchical clustering of local updates to improve training on non-iid data. In 2020 International Joint Conference on Neural Networks, pages 1–9, 2020.
[52] Zheng Chai, Ahsan Ali, Syed Zawad, Stacey Truex, Ali Anwar, Nathalie Baracaldo, Yi Zhou, Heiko Ludwig, Feng Yan, and Yue Cheng. Tifl: A tier-based federated learning system. In Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing, pages 125–136, 2020.
[53] Virginia Smith, Chao-Kai Chiang, Maziar Sanjabi, and Ameet Talwalkar. Federated multi-task learning. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems, pages 4424–4434, 2017.
[54] Luca Corinzia and Joachim M. Buhmann. Variational federated multi-task learning, 2019.
[55] Yutao Huang, Lingyang Chu, Zirui Zhou, Lanjun Wang, Jiangchuan Liu, Jian Pei, and Yong Zhang. Personalized cross-silo federated learning on non-iid data. In Thirty-Fifth AAAI Conference on Artificial Intelligence, pages 7865–7873, 2021.
[56] Othmane Marfoq, Giovanni Neglia, Aurélien Bellet, Laetitia Kameni, and Richard Vidal. Federated multi-task learning under a mixture of distributions. pages 15434–15447, 2021.
[57] Michael Zhang, Karan Sapra, Sanja Fidler, Serena Yeung, and Jose M. Alvarez. Personalized federated learning with first order model optimization. In 9th International Conference on Learning Representations, 2021.
[58] Filip Hanzely, Slavomír Hanzely, Samuel Horváth, and Peter Richtárik. Lower bounds and optimal algorithms for personalized federated learning. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems, 2020.
[59] Idan Achituve, Aviv Shamsian, Aviv Navon, Gal Chechik, and Ethan Fetaya. Personalized federated learning with gaussian processes. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems, pages 8392–8406, 2021.
[60] Tao Lin, Lingjing Kong, Sebastian U. Stich, and Martin Jaggi. Ensemble distillation for robust model fusion in federated learning. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems, 2020.
[61] Kaan Ozkara, Navjot Singh, Deepesh Data, and Suhas N. Diggavi. Quped: Quantized personalization via distillation with applications to federated learning. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems, pages 3622–3634, 2021.
[62] Mikhail Khodak, Maria-Florina Balcan, and Ameet Talwalkar. Adaptive gradient-based meta-learning methods. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems, pages 5915–5926, 2019.
[63] Yihan Jiang, Jakub Konečný, Keith Rush, and Sreeram Kannan. Improving federated learning personalization via model agnostic meta learning. arXiv preprint, abs/1909.12488, 2019.
[64] Karan Singhal, Hakim Sidahmed, Zachary Garrett, Shanshan Wu, John Rush, and Sushant Prakash. Federated reconstruction: Partially local federated learning. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems, pages 11220–11232, 2021.
[65] Durmus Alp Emre Acar, Yue Zhao, Ruizhao Zhu, Ramon Matas Navarro, Matthew Mattina, Paul N. Whatmough, and Venkatesh Saligrama. Debiasing model updates for improving personalized federated training. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 21–31, 2021.
[66] Hongwei Yang, Hui He, Weizhe Zhang, and Xiaochun Cao. Fedsteg: A federated transfer learning framework for secure image steganalysis. IEEE Trans. Netw. Sci. Eng., 8(2):1084–1094, 2021.
[67] Chaoyang He, Murali Annavaram, and Salman Avestimehr. Group knowledge transfer: Federated learning of large cnns at the edge. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems, 2020.
[68] Jie Zhang, Song Guo, Xiaosong Ma, Haozhao Wang, Wenchao Xu, and Feijie Wu. Parameterized knowledge transfer for personalized federated learning. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems, pages 10092–10104, 2021.
[69] Kallista A. Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloé Kiddon, Jakub Konečný, Stefano Mazzocchi, Brendan McMahan, Timon Van Overveldt, David Petrou, Daniel Ramage, and Jason Roselander. Towards federated learning at scale: System design. 2019.
[70] Daniel J Beutel, Taner Topal, Akhil Mathur, Xinchi Qiu, Titouan Parcollet, and Nicholas D Lane. Flower: A friendly federated learning research framework. arXiv preprint arXiv:2007.14390, 2020.
[71] Chaoyang He, Songze Li, Jinhyun So, Mi Zhang, Hongyi Wang, Xiaoyang Wang, Praneeth Vepakomma, Abhishek Singh, Hang Qiu, Li Shen, Peilin Zhao, Yan Kang, Yang Liu, Ramesh Raskar, Qiang Yang, Murali Annavaram, and Salman Avestimehr. Fedml: A research library and benchmark for federated machine learning. arXiv preprint, abs/2007.13518, 2020.
[72] Bill Yuchen Lin, Chaoyang He, Zihang Zeng, Hulin Wang, Yufen Huang, Mahdi Soltanolkotabi, Xiang Ren, and Salman Avestimehr. Fednlp: A research platform for federated learning in natural language processing. arXiv preprint, abs/2104.08815, 2021.
[73] Zhen Wang, Weirui Kuang, Yuexiang Xie, Liuyi Yao, Yaliang Li, Bolin Ding, and Jingren Zhou. Federatedscope-gnn: Towards a unified, comprehensive and efficient package for federated graph learning. In KDD ’22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 4110–4120, 2022.
[74] Yishay Mansour, Mehryar Mohri, Jae Ro, and Ananda Theertha Suresh. Three approaches for personalization with applications to federated learning. arXiv preprint arXiv:2002.10619, 2020.
[75] Yuyang Deng, Mohammad Mahdi Kamani, and Mehrdad Mahdavi. Adaptive personalized federated learning. arXiv preprint arXiv:2003.13461, 2020.
[76] Filip Hanzely and Peter Richtárik. Federated learning of a mixture of global and local models. arXiv preprint, abs/2002.05516, 2020.
[77] Sashank J. Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konečný, Sanjiv Kumar, and Hugh Brendan McMahan. Adaptive federated optimization. In 9th International Conference on Learning Representations, 2021.
[78] Canh T. Dinh, Nguyen H. Tran, and Tuan Dung Nguyen. Personalized federated learning with moreau envelopes. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems, 2020.
[79] Paul Pu Liang, Terrance Liu, Ziyin Liu, Ruslan Salakhutdinov, and Louis-Philippe Morency. Think locally, act globally: Federated learning with local and global representations. arXiv preprint, abs/2001.01523, 2020.
[80] Aviv Shamsian, Aviv Navon, Ethan Fetaya, and Gal Chechik. Personalized federated learning using hypernetworks. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 9489–9502, 2021.
[81] Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Well-read students learn better: On the importance of pre-training compact models. arXiv preprint arXiv:1908.08962v2, 2019.
[82] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In International Conference on Learning Representations, 2019.
[83] Yehuda Koren, Robert M. Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8):30–37, 2009.
[84] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, 2020.
[85] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. In Proceedings of Machine Learning and Systems, 2020.
[86] Xiaoxiao Li, Meirui Jiang, Xiaofei Zhang, Michael Kamp, and Qi Dou. Fedbn: Federated learning on non-iid features via local batch normalization. In 9th International Conference on Learning Representations, 2021.
[87] Yuexiang Xie, Zhen Wang, Daoyuan Chen, Dawei Gao, Liuyi Yao, Weirui Kuang, Yaliang Li, Bolin Ding, and Jingren Zhou. Federatedscope: A comprehensive and flexible federated learning platform via message passing. arXiv preprint, abs/2204.05011, 2022.
[88] John Nguyen, Kshitiz Malik, Hongyuan Zhan, Ashkan Yousefpour, Mike Rabbat, Mani Malek, and Dzmitry Huba. Federated learning with buffered asynchronous aggregation. In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera, editors, Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151, pages 3581–3607, 28–30 Mar 2022.
[89] Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloé Kiddon, Jakub Konečný, Stefano Mazzocchi, Brendan McMahan, Timon Van Overveldt, David Petrou, Daniel Ramage, and Jason Roselander. Towards federated learning at scale: System design. In A. Talwalkar, V. Smith, and M. Zaharia, editors, Proceedings of Machine Learning and Systems, volume 1, pages 374–388, 2019.
[90] Cynthia Dwork. Differential privacy: A survey of results. In Theory and Applications of Models of Computation, 5th International Conference, volume 4978, pages 1–19, 2008.
[91] Gregory Cohen, Saeed Afshar, Jonathan Tapson, and André van Schaik. EMNIST: extending MNIST to handwritten letters. In 2017 International Joint Conference on Neural Networks, pages 2921–2926, 2017.
[92] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
[93] Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7:625–641, 2019.
[94] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642, 2013.
[95] Andrew Kachites McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore. Automating the construction of internet portals with machine learning. Information Retrieval, 3(2):127–163, 2000.
[96] Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment, 2008(10):P10008, 2008.
[97] Galileo Namata, Ben London, Lise Getoor, Bert Huang, and U Edu. Query-driven active surveying for collective classification. In 10th International Workshop on Mining and Learning with Graphs, volume 8, page 1, 2012.
[98] Lise Getoor. Link-based classification. In Advanced methods for knowledge discovery from complex data, pages 189–207. 2005.
[99] F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis), 5(4):1–19, 2015.
[100] Zitao Li, Bolin Ding, Ce Zhang, Ninghui Li, and Jingren Zhou. Federated matrix factorization with privacy guarantee. Proceedings of the VLDB Endowment, 15(4):900–913, 2021.
[101] Alireza Fallah, Aryan Mokhtari, and Asuman E. Ozdaglar. Personalized federated learning: A meta-learning approach. arXiv preprint, abs/2002.07948, 2020.
[102] Lisha Li, Kevin G. Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyperband: Bandit-based configuration evaluation for hyperparameter optimization. In 5th International Conference on Learning Representations, 2017.
[103] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci., 9(3-4):211–407, 2014.
[104] Lingjuan Lyu, Han Yu, and Qiang Yang. Threats to federated learning: A survey. arXiv preprint, abs/2003.02133, 2020.
[105] Kang Wei, Jun Li, Ming Ding, Chuan Ma, Howard H. Yang, Farhad Farokhi, Shi Jin, Tony Q. S. Quek, and H. Vincent Poor. Federated learning with differential privacy: Algorithms and performance analysis. IEEE Trans. Inf. Forensics Secur., 15:3454–3469, 2020.

Appendices for the Paper: pFL-Bench: A Comprehensive Benchmark for Personalized Federated Learning

We provide more details and experimental results for pFL-Bench in the appendices:

•

Sec.A: the details of adopted datasets and models (e.g., tasks, heterogeneous partitions, and model architectures), and the extensions for other datasets and models with pFL-Bench.
•

Sec.B: detailed description of methods and metrics in our experiments.
•

Sec.C: implementation details including the experimental environments and hyper-parameters.
•

Sec.D: more experimental results in terms of generalization (Sec. D.1), fairness (Sec. D.2) and efficiency (Sec.D.3) for all the datasets in Table 1. Besides, to demonstrate the potential and ease of extensibility of the pFL-bench, we also conducted experiments in the heterogeneous device resource scenario based on FedScale [38] (Sec.D.4), as well as experiments incorporating privacy-preserving techniques (Sec.D.5).

Appendix A Datasets and Models

Experimental datasets.

We present detailed descriptions of the 12 publicly available dataset variants used in pFL-Bench. These datasets are popular in the corresponding fields, and cover a wide range of domains, scales, partition manners, and Non-IID degrees.

•

The Federated Extended MNIST (FEMNIST) is a widely used FL dataset for 62-class handwritten character recognition [32]. The original FEMNIST dataset contains 3,550 clients and each client corresponds to a character writer from EMNIST [91]. Following [13], we adopt the sub-sampled version in FL-Bench, which contains 200 clients and totally 43,400 images with resolution of 28x28 pixels, and the dataset is randomly split into train/valid/test sets with ratio 3:1:1. ¹¹1For all the adopted datasets, the train/val/test splitting is conducted within the local data of each client. In pFL-Bench, we use this dataset to vary the client sampling rate in FL processes as shown in Figure 4 in the main body of the paper.
•

The CIFAR10 is a popular dataset for 10-class image classification containing 60,000 colored images with resolution of 32x32 pixels. Follow the heterogeneous partition manners used in [56, 32, 37, 28], we use Dirichlet allocation to split this datasets into 100 clients with different Dirichlet factors as $\alpha=[5,0.5,0.1]$ (a smaller $\alpha$ indicates a higher heterogeneous degree). We split the dataset into train/valid/test sets with ratio 4:1:1.
•

Corpus of Linguistic Acceptability (COLA) is a textual classification datasets from [92, 93], which contains 9,600 English sentences labeled with grammatical correctness. In pFL-Bench, this dataset is partitioned into 50 clients via Dirichlet allocation with $\alpha=0.4$ . We split the dataset into train/valid/test sets with a ratio of about 7:2:1.
•

The Stanford Sentiment Treebank (SST-2) is a sentiment classification dataset from [92, 94], which contains 68,200 movie reviews sentences labeled with human sentiment. Similar to COLA, in pFL-Bench, this dataset is partitioned into 50 clients with Dirichlet allocation and $\alpha=0.4$ . The train/valid/test sets are with a ratio of about 60:15:1. For COLA and SST-2, since the test subsets from GLUE [92] are unlabeled (private in the GLUE server), we made new train/val/test partitions different from GLUE versions.
•

The Twitter dataset is a textual sentiment analysis dataset from [32]. We adopt a subset which contains 13,203 users, the partition manner for this dataset is natural w.r.t. users, and the median number of data samples per user is 7. The train/valid/test sets for each client are with a ratio of about 3:1:1.
•

The Cora dataset is a citation network that contains 2,708 nodes and 5,429 edges, in which each node indicates a scientific publication classified into one of seven classes [95]. Following FS-G [73], we split it into 5 clients using a community detection algorithm, Louvain [96]. The train/valid/test sets are with ratio about 3:1:1.
•

The Pubmed dataset contains 19,717 nodes and 44,338 edges. The nodes indicate scientific publications classified into one of three classes [97]. Following FS-G [73], we split it into 5 clients with Louvain community partition. The train/valid/test sets are with a ratio of about 3:1:5.
•

The Citeseer dataset is a citation network that contains 3,312 nodes and 4,732 edges, in which each node indicates a scientific publication classified into one of six classes [98]. Following FS-G [73], we split it into 5 clients with Louvain community partition. The train/valid/test sets are with a ratio of about 4:1:1.
•

The Movielens1M contains 1,000,209 ratings from 6,040 users and 3,900 movies [99]. Following the horizontal partition manner used in [100], in pFL-Bench, we split this dataset into 1,000 clients according to users. The train/valid/test sets with ratio about 14:3:3.
•

The Movielens10M contains 10,000,054 ratings from 71,567 users and 10,681 movies [99]. Following the vertical partition manner used in [100], in pFL-Bench, we split this dataset into 1,000 clients according to items. The train/valid/test sets are with ratio about 14:3:3.

For all these experimental datasets, we randomly select 20% clients as new clients that do not participate in the FL processes. We summarize some statistics in Table 1 in the main body of the paper. Besides, we illustrate the violin plot of data size per client in Figure 2(d), the label skew visualization of certain datasets in Figure 2(e), and clients’ pairwise similarity of label distribution in terms of Jensen–Shannon distance in Figure 2(f). And the smaller the Jensen-Shannon distance, the more similar the compared distributions. We can see that as the degree of heterogeneity increase (the $\alpha$ decreases), the larger the label skew degree and the Jensen-Shannon distances we get. We can perform similar calculations on a variety of FL datasets, and further rank this distances, and in turn select those clients whose distributions are very different but whose models do not perform well for further analysis, understanding and algorithm improvement. Furthermore, all these results show diverse properties across the adopted FL datasets in pFL-Bench, enabling comprehensive comparisons among different methods.

Models

To align with previous works [77, 56, 78, 79, 80], we preset a 2-layer CNN for FEMNIST and CIFAR10. Specifically, the model consists of two convolutional layers with 5 × 5 kernels, max pooling, batch normalization, ReLU activation, and two dense layers. The hidden size is 2,048 and 512 for FEMNIST and CIFAR10 respectively. For the COLA and SST-2 datasets, we preset the pre-trained BERT-Tiny model from [81], which contains 2-layer Transformer encoders with a hidden size of 128. For the Twitter dataset, we preset a LR model with 50d Glove embeddings ²²2https://nlp.stanford.edu/data/glove.6B.zip. For the graph datasets, we preset the graph isomorphism neural network, GIN [82], which contains 2-layer convolutions with batch normalization, the hidden size of 64, and dropout rate of 0.5. For the recommendation datasets, we preset the Matrix Factorization (MF) model [83] with a hidden size of 20 for user and item embeddings.

Remark on the adopted dataset scales and model sizes.

It is worth noting that simulation with pFL algorithms on a large client scale is very challenging, due to the fact that we need to maintain distinct (personalized) model object for each client. Let’s take the famous benchmark FEMNIST as an example, which has 3,550 users and suppose we adopt the widely-used two-layer CNN network. Although this model only occupies 200MB, maintaining 3,550 such models would consume more than 700GB memory. Different from non-personalized FL algorithms, for which it is feasible to maintain only one model object for all the clients, for pFL algorithms, we may have to switch and cache the personalized models among CPU, GPU and even disks. Due to the large number of methods and datasets included in our benchmark, and the corresponding huge hyper-parameter search space, we used several subsets of the FL datasets to reduce the reproduction and experimental barriers.

Extension.

We note that besides the experimental datasets and models introduced above, our code-base is compatible with a large number of datasets from other public popular DataZoos and ModelZoos. We provide the unified dataset, dataloader, and model I/O interfaces with carefully designed modularity, which enables users to easily register and extend the datasets/models with simple and flexible configuration, such as different heterogeneous partition manners, number of clients, new client ratio, model types and model parameter dimensions. Currently, we support datasets from LEAF [32], Torchvision [41], Huggingface datasets [42], FederatedScope (FS) [13] and FederatedScope-GNN (FS-G) [73]; and models from Torchvision [41], Huggingface [84], FS [13] and FS-G [73].

Appendix B Methods and Metrics

B.1 Methods

We present detailed descriptions of the methods in pFL-Bench, which conveys a range of popular and SOTA methods in three categories including Non-pFL methods, pFL methods and Combined variants.

The following Non-pFL methods are considered in pFL-Bench:

•

The Global-Train method refers to training only a centralized model from all data merged from all clients.
•

The Isolated method indicates that each client trains its’ client-specific model without FL communication. The Global-Train and Isolated methods provide a good reference to examine the benefits of pFL processes. For these two methods, we omit the un-participated clients.
•

The FedProx [85] method leverages proximal term to encourage the updated models at clients not to differ too much from the global model.
•

In addition, we include the classical FedAvg [1] that average gradients weighted by data size of clients in each FL round.
•

The FedOpt [43] algorithm is also considered, which generalizes FedAvg by introducing an optimizer for the FL server. We use the SGD as the server optimizer for FedOpt and search its learning rate.

pFL methods. We consider the following representative SOTA methods:

•

FedBN [86] is a simple yet effective pFL method aiming to handle the feature shift Non-IID challenge. It locally maintains the clients’ batch normalization parameters without FL communication and aggregation. In pFL-Bench, we generalize FedBN into the Transformer model by filtering out the layer normalization parameters.
•

The Ditto [26] is a pFL method aiming to improve the fairness and robustness of FL. For each client, Ditto maintains the local personalized model and global model at the same time. The global model is trained with the same produce in FedAvg and the local model is trained with a personalized regularization according to the global model parameters.
•

The pFedMe [78] is a meta-learning based method and also regularizes the local models according to the global model parameters. The authors propose to use Moreau envelops based regularization to reduce the complexity caused by Hessian matrix computation, which is required by some meta-learning based pFL methods such as Per-FedAvg [101].
•

The pFL-Bench also contains multi-model based pFL methods.
•

The HypCluster [74] method proposes to split clients into clusters and learns different personalized models for different clusters. The cluster is determined by performance on validation sets. In our experiments, we set the number of clusters as 3 for a fair comparison with FedEM.

The FedEM method [56] assumes the local data distribution is a mixture of multiple underlying distributions. It learns a mixture of multiple local models with Expectation-Maximization algorithm to deal with the data heterogeneity, and can be easily extended to several clustering based and multi-task learning based method pFL methods. In our experiments, we use 3 internal models for FedEM according to the authors’ default choice.

Combined variants. It is worth noting that we provide pluggable re-implementations of numerous existing methods in pFL-Bench. This modularity enables users can pick different personalized objects and behaviors to form a new pFL variant. We combine FedBN, FedOpt, and Fine-tuning (FT) with other compatible methods. The FedBN combination indicates to make the batch/layer normalization parameters personalized and locally maintained. The FedOpt combination indicates introducing the server optimizer into the FL processes. The Fine-tuning (FT) combination indicates fine-tuning the local models with a few steps before evaluation within the FL processes.

To facilitate fine-grained ablations and systematic pFL study, we finally compare more than 20 pFL method variants in the experiments. We will continuously include more pFL methods into pFL-Bench.

B.2 Metrics

Here we summarize the monitored metrics in our benchmark and give more detailed description about them.

Generalization.

We support server-side and clients-sides monitoring w.r.t. widely used performance metrics such as accuracy, loss, F1, etc. Specifically, in the paper, we denote $\overline{Acc}$ $\overline{Loss}$ be the accuracy Loss average weighted by the number of local data samples, $\widetilde{Acc}$ $\widetilde{Loss}$ be the accuracy Loss of un-participated clients, and $\Delta$ be the participation generalization gap.

Fairness.

We support different summarizing manners over the evaluated metrics over clients, such as weighted average (e.g., $\overline{Acc}$ ), uniform average (e.g., $\overline{Acc}^{\prime}$ ), the standard deviation (denoted by $\sigma$ in our paper), and various quantiles such as bottom accuracy $\mathchoice{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{9.69318pt}{3.8889pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\displaystyle Acc\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{9.69318pt}{3.8889pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\textstyle Acc\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{6.7852pt}{2.72223pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\scriptstyle Acc\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{4.84657pt}{1.94444pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\scriptscriptstyle Acc\hss$\crcr}}}\limits}$ . We report the $\lfloor|\mathcal{C}|/10\rfloor$ )-th worst accuracy where $|\mathcal{C}|$ is the number of all evaluated clients. To align with FedEM, the 90th percentile is considered here to omit the particularly noisy results from clients with worse performance with very small data sizes.

System costs.

For computational and communication costs, we support to monitor some proxy metrics including FLOPs, communication bytes, and convergence rounds. The FLOPS are counted as the sum of amounts for both training and inference via a per-operator flops counting tool, fvcore/flop_count. ³³3https://github.com/facebookresearch/fvcore/blob/main/docs/flop_count.md The reported communication bytes are counted as the sum of upstream and downstream across all participants until convergence with early stopping. Besides, thanks to the good integration of wandb, our benchmark also supports more runtime metrics including the dynamic utilization of CPU, GPU, memory, disk, etc. ⁴⁴4https://docs.wandb.ai/ref/app/features/system-metrics

Appendix C Implementation

Enviroments.

We implement pFL-Bench based on the FS [13] package and PyTorch. The experiments are conducted on a cluster of 8 Tesla V100 and 64 NVIDIA GeForce GTX 1080 Ti GPUs, each machine with 380G memory and Xeon Platinum 8163 2.50GHz CPU containing 96 cores. Our experiments are conducted in the containerized environments with Ubuntu18.04.

We provide versioned DockerFiles, the built docker images and experimental datasets in our website with Aliyun storage service. ⁵⁵5 https://github.com/alibaba/FederatedScope/tree/master/benchmark/pFL-Bench The pFL-Bench and the underlying FS [13] package is continuously developed and maintained by Data Analytics and Intelligence Lab (DAIL) of DAMO Academy. We will actively fix potential issues, track updates and Github release.

Hyper-parameters.

For fair comparisons, we first use wandb sweep ⁶⁶6https://docs.wandb.ai/sweeps with the hyper-parameter searching (HPO) algorithm, HyperBand [102], to find the best hyper-parameters for all the methods on all datasets. The validation sets are used and we employ early stopping with a large number of total FL rounds $T$ . We set this hyper-parameter to make almost all methods converge within $T$ rounds. Specifically, for FEMNIST, CIFAR10, Movielens-1M, Movielens-10M and Twitter, we set $T=1,000$ . For Cola, SST-2, Pubmed, Cora and Citeseer, we set $T=500$ . The batch size is set to be 32 for image datasets, 64 for textual datasets, and 1,024 for the recommendation datasets respectively. For graph datasets, we adopt full batch training. For all methods, we search the local update steps (i.e., the number of local training epochs in each FL round) from $[1,3,6]$ . For the local SGD learning rate, we search from $[0.05,0.005,0.5,0.01,0.1,1,2]$ . For FedOpt, we search the server learning rate from $[0.05,0.1,0.5,1.5]$ . For pFedMe, we search the personalized regularization weight from $[0.05,0.1,0.2,0.9]$ and its local meta-learning step from $[1,3]$ . For FedEM, we set its number of mixture models as 3. For Ditto, we search its personalized regularization weight from $[0.05,0.1,0.2,0.5,0.8]$ .

To enable easily reproducible research, we provide standardized and documented scripts including the HPO scripts, the experiments running scripts and searched best configuration files in our code-base (see the link in the above paragraph).

Appendix D Additional Experimental Results

D.1 Generalization

The generalization results for FEMNIST, SST-2 and PUBMED are shown in the Table 2 and Figure 3 in the main body of the paper. Here we present the results for other datasets including all the textual datasets (Table 4), all the graph datasets (Table 5), and all the recommendation datasets (Table 6). We note that FedOpt may gain bad results on some datasets when the models contain batch/layer normalization parameters. Besides, for the Twitter and recommendation datasets, we have not compared the FedBN based methods as the used LR and MF models do not contain batch/layer normalization parameters.

Table 4: Accuracy results for both participated clients and un-participated clients on COLA, SST-2 and Twitter datasets.

\overline{Acc}

indicates the aggregated accuracy weighted by the number of local data samples of participated clients,

\widetilde{Acc}

indicates the aggregated accuracy of un-participated clients, and

\Delta

	COLA			SST-2			Twitter
	$\overline{Acc}$	$\widetilde{Acc}$	$\Delta$	$\overline{Acc}$	$\widetilde{Acc}$	$\Delta$	$\overline{Acc}$	$\widetilde{Acc}$	$\Delta$
Global-Train	69.06	-	-	80.57	-	-	55.56	-	-
Isolated	55.96	-	-	60.82	-	-	70.04	-	-
FedAvg	71.85	63.49	-8.36	74.88	80.24	5.36	62.15	61.24	-0.91
FedAvg-FT	68.29	58.66	-9.63	74.14	83.28	9.13	70.53	71.17	0.64
FedOpt	71.85	59.62	-12.23	72.28	83.06	10.78	62.09	61.64	-0.45
FedOpt-FT	62.59	47.82	-14.77	65.77	80.02	14.25	71.08	71.41	0.33
pFedMe	74.40	67.64	-6.76	71.27	69.34	-1.92	63.45	62.52	-0.94
pFedMe-FT	78.47	76.33	-2.14	75.61	66.48	-9.13	84.00	71.80	-12.20
FedBN	71.85	63.49	-8.36	74.88	75.40	0.52	-	-	-
FedBN-FT	66.71	49.87	-16.84	68.81	82.43	13.63	-	-	-
FedBN-FedOPT	71.85	62.48	-9.37	64.70	65.50	0.81	-	-	-
FedBN-FedOPT-FT	67.48	57.59	-9.90	68.65	70.56	1.91	-	-	-
Ditto	55.46	49.90	-5.56	52.03	46.79	-5.24	70.23	49.60	-20.63
Ditto-FT	72.11	52.15	-19.96	56.49	65.50	9.01	69.99	51.32	-18.67
Ditto-FedBN	70.69	49.90	-20.79	56.03	46.79	-9.24	-	-	-
Ditto-FedBN-FT	72.66	53.44	-19.21	53.15	66.49	13.34	-	-	-
Ditto-FedBN-FedOpt	50.25	49.90	-0.35	57.67	46.79	-10.88	-	-	-
Ditto-FedBN-FedOpt-FT	55.01	58.22	3.21	52.89	66.49	13.60	-	-	-
FedEM	71.85	63.49	-8.36	75.78	67.67	-8.11	63.44	62.68	-0.75
FedEM-FT	54.90	48.29	-6.61	64.86	81.63	16.77	70.97	71.59	0.62
FedEM-FedBN	71.44	63.99	-7.45	75.43	62.81	-12.62	-	-	-
FedEM-FedBN-FT	57.62	58.88	1.26	64.96	81.04	16.08	-	-	-
FedEM-FedBN-FedOPT	71.85	62.82	-9.03	72.25	64.69	-7.56	-	-	-
FedEM-FedBN-FedOPT-FT	57.23	58.88	1.65	62.26	73.87	11.61	-	-	-

Table 5: Accuracy results for both participated clients and un-participated clients on Pubmed, Cora and Citeseer datasets.

\overline{Acc}

indicates the aggregated accuracy weighted by the number of local data samples of participated clients,

\widetilde{Acc}

indicates the aggregated accuracy of un-participated clients, and

\Delta

	PUBMED			CORA			CITESEER
	$\overline{Acc}$	$\widetilde{Acc}$	$\Delta$	$\overline{Acc}$	$\widetilde{Acc}$	$\Delta$	$\overline{Acc}$	$\widetilde{Acc}$	$\Delta$
Global-Train	87.01	-	-	86.10	-	-	74.03	-	-
Isolated	85.56	-	-	82.48	-	-	69.83	-	-
FedAvg	87.27	72.63	-14.64	81.30	72.14	-9.16	75.58	59.83	-15.74
FedAvg-FT	87.21	79.78	-7.43	82.07	75.84	-6.22	75.63	66.07	-9.57
FedOpt	67.38	53.84	-13.54	70.70	59.58	-11.12	71.59	55.16	-16.43
FedOpt-FT	82.36	64.09	-18.27	82.68	62.17	-20.51	74.34	62.36	-11.99
pFedMe	86.91	71.64	-15.27	83.18	70.13	-13.05	75.30	58.45	-16.85
pFedMe-FT	85.71	77.07	-8.64	82.11	71.48	-10.63	75.35	62.14	-13.20
FedBN	88.49	52.53	-35.95	84.13	57.33	-26.80	75.80	51.29	-24.51
FedBN-FT	87.45	80.36	-7.09	76.20	64.13	-12.07	75.07	64.20	-10.87
FedBN-FedOPT	87.87	42.72	-45.15	84.64	53.27	-31.37	76.20	50.34	-25.86
FedBN-FedOPT-FT	87.54	77.07	-10.47	84.10	68.14	-15.96	76.70	62.84	-13.86
Ditto	87.27	2.84	-84.43	83.67	14.53	-69.14	74.79	14.52	-60.27
Ditto-FT	87.47	35.03	-52.44	81.47	72.64	-8.84	76.47	39.27	-37.20
Ditto-FedBN	88.18	2.84	-85.34	81.38	14.53	-66.84	75.35	14.52	-60.83
Ditto-FedBN-FT	87.83	28.52	-59.30	83.25	65.87	-17.38	75.97	36.51	-39.46
Ditto-FedBN-FedOpt	87.81	2.84	-84.97	82.54	14.53	-68.01	75.07	14.52	-60.55
Ditto-FedBN-FedOpt-FT	87.60	18.18	-69.42	82.00	70.75	-11.25	76.42	48.99	-27.43
FedEM	85.64	71.12	-14.52	81.92	72.02	-9.90	75.41	59.60	-15.81
FedEM-FT	85.88	78.08	-7.80	77.32	79.45	2.13	72.71	66.77	-5.94
FedEM-FedBN	88.12	48.64	-39.48	85.07	50.98	-34.09	74.90	41.55	-33.35
FedEM-FedBN-FT	86.38	72.02	-14.35	84.61	76.77	-7.83	75.29	68.38	-6.91
FedEM-FedBN-FedOPT	87.56	42.37	-45.19	84.68	56.45	-28.24	76.08	53.81	-22.27
FedEM-FedBN-FedOPT-FT	87.49	72.39	-15.09	85.02	76.97	-8.05	76.59	68.62	-7.96

Table 6: Accuracy results for both participated clients and un-participated clients on Movielens-1M and Movielens-10M datasets.

\overline{Loss}

indicates the loss average weighted by the number of local data samples,

\widetilde{Loss}

indicates the loss of un-participated clients, and

\Delta

	Movielens-1M			Movielens-10M
	$\overline{Loss}$	$\widetilde{Loss}$	$\Delta$	$\overline{Loss}$	$\widetilde{Loss}$	$\Delta$
Global-Train	0.78	-	-	0.67	-	-
Isolated	10.35	-	-	11.48	-	-
FedAvg	0.84	14.17	13.33	0.70	13.39	12.68
FedAvg-FT	0.84	9.76	8.92	0.71	11.07	10.36
FedAvg-FT-FedOpt	0.85	5.15	4.31	0.73	11.40	10.66
FedOpt	0.83	14.17	13.34	0.71	13.39	12.68
FedOpt-FT	0.83	12.06	11.23	0.74	11.92	11.18
pFedMe	0.54	14.18	13.64	13.06	12.73	-0.33
pFedMe-FT	0.60	8.20	7.60	0.80	12.59	11.79
Ditto	1.29	14.19	12.89	1.84	13.39	11.55
Ditto-FT	1.35	14.17	12.81	1.69	13.39	11.70
Ditto-FT-FedOpt	1.36	14.15	12.79	2.03	13.39	11.35
FedEM	0.85	14.27	13.43	1.75	13.41	11.65
FedEM-FT	0.85	4.86	4.01	0.87	12.80	11.93
FedEM-FT-FedOpt	0.86	4.47	3.61	1.43	13.25	11.82

D.2 Fairness

The fairness results for FEMNIST, SST-2 and PUBMED are listed in Table 3 in the main body of the paper. Here we present the results for other datasets, including all the textual datasets (Table 7), all the graph datasets (Table 8), and all the recommendation datasets (Table 9).

Table 7: Fairness results on COLA, SST-2 and Twitter datasets.

\overline{Acc}^{\prime}

indicate the equally-weighted average,

\sigma

indicating the standard deviation of the average accuracy, and

\mathchoice{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{9.69318pt}{3.8889pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\displaystyle Acc\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{9.69318pt}{3.8889pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\textstyle Acc\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{6.7852pt}{2.72223pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\scriptstyle Acc\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{4.84657pt}{1.94444pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\scriptscriptstyle Acc\hss$\crcr}}}\limits}

indicating the bottom accuracy. Bold, underlined, red and blue indicate the same highlights as used in Table 2.

	COLA			SST-2			Twitter
	$\overline{Acc}^{\prime}$	$\sigma$	$\mathchoice{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{9.69318pt}{3.8889pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\displaystyle Acc\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{9.69318pt}{3.8889pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\textstyle Acc\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{6.7852pt}{2.72223pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\scriptstyle Acc\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{4.84657pt}{1.94444pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\scriptscriptstyle Acc\hss$\crcr}}}\limits}$	$\overline{Acc}^{\prime}$	$\sigma$	$\mathchoice{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{9.69318pt}{3.8889pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\displaystyle Acc\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{9.69318pt}{3.8889pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\textstyle Acc\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{6.7852pt}{2.72223pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\scriptstyle Acc\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{4.84657pt}{1.94444pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\scriptscriptstyle Acc\hss$\crcr}}}\limits}$	$\overline{Acc}^{\prime}$	$\sigma$	$\mathchoice{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{9.69318pt}{3.8889pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\displaystyle Acc\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{9.69318pt}{3.8889pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\textstyle Acc\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{6.7852pt}{2.72223pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\scriptstyle Acc\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{4.84657pt}{1.94444pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\scriptscriptstyle Acc\hss$\crcr}}}\limits}$
Isolated	56.86	35.34	0.00	59.40	41.29	0.00	67.44	37.86	0.00
FedAvg	51.53	35.96	0.00	76.30	22.02	44.85	56.98	37.97	0.00
FedAvg-FT	58.74	35.21	0.00	75.36	27.67	31.08	68.45	36.19	0.00
FedOpt	57.10	31.70	10.85	73.78	34.03	17.53	59.95	37.88	0.00
FedOpt-FT	59.77	35.99	0.00	66.17	33.73	15.56	69.20	36.28	0.00
pFedMe	67.58	38.06	0.90	65.08	26.59	27.75	58.18	36.67	0.00
pFedMe-FT	69.17	33.63	9.90	74.36	27.02	32.49	78.82	33.28	22.22
FedBN	59.60	35.96	0.00	76.30	22.02	44.85	-	-	-
FedBN-FT	59.69	35.44	0.00	68.50	26.83	29.17	-	-	-
FedBN-FedOPT	59.60	36.03	0.00	65.59	31.07	22.22	-	-	-
FedBN-FedOPT-FT	59.10	35.15	0.00	68.42	28.18	30.71	-	-	-
Ditto	55.14	36.76	0.00	49.94	40.81	0.00	66.90	38.05	0.00
Ditto-FT	63.61	35.02	0.00	54.34	39.26	0.00	66.91	38.08	0.00
Ditto-FedBN	62.68	35.74	0.00	49.44	41.80	0.00	-	-	-
Ditto-FedBN-FT	63.58	34.58	0.00	52.18	39.85	0.00	-	-	-
Ditto-FedBN-FedOpt	52.48	35.89	0.00	55.61	40.43	1.39	-	-	-
Ditto-FedBN-FedOpt-FT	57.13	36.20	0.00	53.16	34.75	9.72	-	-	-
FedEM	51.52	35.96	0.00	76.53	23.34	44.44	61.70	37.72	0.00
FedEM-FT	57.80	35.24	0.00	64.29	32.84	12.96	70.19	36.35	0.00
FedEM-FedBN	57.95	34.11	1.52	75.06	18.48	53.33	-	-	-
FedEM-FedBN-FT	58.74	35.56	1.00	64.33	35.72	8.59	-	-	-
FedEM-FedBN-FedOPT	59.60	35.49	0.00	72.66	27.18	34.17	-	-	-
FedEM-FedBN-FedOPT-FT	56.74	35.61	1.00	58.42	31.21	17.93	-	-	-

Table 8: Fairness results on Pubmed, Cora, and Citeseer datasets.

\overline{Acc}^{\prime}

indicate the equally-weighted average,

\sigma

indicating the standard deviation of the average accuracy, and

\mathchoice{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{9.69318pt}{3.8889pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\displaystyle Acc\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{9.69318pt}{3.8889pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\textstyle Acc\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{6.7852pt}{2.72223pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\scriptstyle Acc\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{4.84657pt}{1.94444pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\scriptscriptstyle Acc\hss$\crcr}}}\limits}

indicating the bottom accuracy. Bold, underlined, red and blue indicate the same highlights as used in Table 2.

	PUBMED			CORA			CITESEER
	$\overline{Acc}^{\prime}$	$\sigma$	$\mathchoice{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{9.69318pt}{3.8889pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\displaystyle Acc\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{9.69318pt}{3.8889pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\textstyle Acc\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{6.7852pt}{2.72223pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\scriptstyle Acc\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{4.84657pt}{1.94444pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\scriptscriptstyle Acc\hss$\crcr}}}\limits}$	$\overline{Acc}^{\prime}$	$\sigma$	$\mathchoice{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{9.69318pt}{3.8889pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\displaystyle Acc\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{9.69318pt}{3.8889pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\textstyle Acc\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{6.7852pt}{2.72223pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\scriptstyle Acc\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{4.84657pt}{1.94444pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\scriptscriptstyle Acc\hss$\crcr}}}\limits}$	$\overline{Acc}^{\prime}$	$\sigma$	$\mathchoice{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{9.69318pt}{3.8889pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\displaystyle Acc\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{9.69318pt}{3.8889pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\textstyle Acc\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{6.7852pt}{2.72223pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\scriptstyle Acc\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{4.84657pt}{1.94444pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\scriptscriptstyle Acc\hss$\crcr}}}\limits}$
Isolated	84.67	6.26	74.63	81.62	4.67	72.64	69.90	5.76	61.10
FedAvg	86.72	3.93	79.76	81.07	5.06	73.26	75.64	5.03	67.30
FedAvg-FT	86.71	3.86	80.57	81.90	3.06	77.57	75.77	5.00	68.68
FedOpt	66.69	16.69	48.50	70.17	12.56	38.89	71.60	9.20	42.25
FedOpt-FT	81.53	16.69	46.21	82.31	6.35	68.03	74.38	4.13	69.26
pFedMe	86.35	4.43	78.76	82.76	3.40	76.70	75.36	4.79	68.55
pFedMe-FT	85.47	3.06	80.95	81.98	1.63	79.58	75.40	4.39	70.31
FedBN	87.97	3.42	81.77	83.64	3.88	77.53	75.59	3.80	71.24
FedBN-FT	87.02	3.47	80.13	76.01	4.32	69.92	75.16	5.04	67.91
FedBN-FedOPT	87.43	4.64	80.81	84.11	3.93	75.44	76.33	4.28	72.43
FedBN-FedOPT-FT	87.02	3.94	81.78	83.79	2.39	80.26	76.77	4.60	71.68
Ditto	86.85	3.98	80.44	83.50	3.54	75.02	75.55	4.23	67.99
Ditto-FT	87.10	3.52	80.46	81.53	3.67	75.63	76.57	4.24	70.31
Ditto-FedBN	87.75	3.70	81.82	81.36	3.34	74.93	75.46	4.83	68.57
Ditto-FedBN-FT	87.43	3.77	81.15	82.21	4.37	72.56	76.05	4.21	69.16
Ditto-FedBN-FedOpt	87.27	3.90	79.14	82.38	2.10	77.69	75.19	4.06	68.60
Ditto-FedBN-FedOpt-FT	87.10	3.79	80.93	81.99	3.67	73.29	76.52	4.43	71.31
FedEM	85.05	4.44	78.51	81.72	3.72	74.95	75.49	4.66	68.50
FedEM-FT	85.54	4.48	79.39	78.43	11.72	56.20	72.88	6.55	63.53
FedEM-FedBN	87.63	4.14	82.54	84.45	3.14	76.71	74.29	4.22	69.06
FedEM-FedBN-FT	85.68	4.33	79.44	84.57	3.04	80.22	75.40	3.84	70.59
FedEM-FedBN-FedOPT	87.11	4.24	80.32	83.88	4.37	78.93	76.17	4.76	70.54
FedEM-FedBN-FedOPT-FT	87.16	3.66	82.20	84.40	4.43	79.53	76.74	4.53	71.92

Table 9: Fairness results on Movielens-1M and Movielens-10M datasets.

\overline{Loss}^{\prime}

indicate the equally-weighted average loss,

\mathchoice{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{12.6168pt}{3.8889pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\displaystyle Loss\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{12.6168pt}{3.8889pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\textstyle Loss\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{8.83176pt}{2.72223pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\scriptstyle Loss\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{6.3084pt}{1.94444pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\scriptscriptstyle Loss\hss$\crcr}}}\limits}

indicates the bottom loss (the largest), and

\sigma

indicates the standard deviation of the average loss. Bold, underlined, red and blue indicate the same highlights as used in Table 2.

	Movielens-1M			Movielens-10M
	$\overline{Loss}^{\prime}$	$\sigma$	$\mathchoice{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{12.6168pt}{3.8889pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\displaystyle Loss\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{12.6168pt}{3.8889pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\textstyle Loss\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{8.83176pt}{2.72223pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\scriptstyle Loss\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{6.3084pt}{1.94444pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\scriptscriptstyle Loss\hss$\crcr}}}\limits}$	$\overline{Loss}^{\prime}$	$\sigma$	$\mathchoice{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{12.6168pt}{3.8889pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\displaystyle Loss\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{12.6168pt}{3.8889pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\textstyle Loss\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{8.83176pt}{2.72223pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\scriptstyle Loss\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{6.3084pt}{1.94444pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\scriptscriptstyle Loss\hss$\crcr}}}\limits}$
Isolated	11.12	2.43	14.34	11.44	1.83	13.75
FedAvg	0.85	0.21	1.13	0.71	0.11	0.84
FedAvg-FT	0.85	0.21	1.12	0.71	0.11	0.85
FedAvg-FT-FedOpt	0.86	0.22	1.14	0.75	0.12	0.90
FedOpt	0.84	0.21	1.11	0.72	0.12	0.87
FedOpt-FT	0.84	0.21	1.11	0.77	0.13	0.94
pFedMe	0.55	0.11	0.69	12.48	2.44	15.76
pFedMe-FT	0.60	0.12	0.75	0.80	0.13	0.96
Ditto	1.31	0.77	1.69	1.81	0.24	2.12
Ditto-FT	1.35	0.88	1.70	2.30	1.15	4.05
Ditto-FT-FedOpt	1.35	0.79	1.70	1.98	0.27	2.32
FedEM	0.87	0.22	1.15	2.37	1.25	4.09
FedEM-FT	0.87	0.23	1.16	0.98	0.26	1.29
FedEM-FT-FedOpt	0.87	0.22	1.16	1.88	0.94	3.09

D.3 Efficiency

The efficiency-accuracy trade-off results for FEMNIST are plotted in Figure 5 in the main body of the paper. Here we present more efficiency-accuracy trade-off results for the experimental datasets, including the FEMNIST datasets with different client sampling rates (Table 11 and Table 12), CIFAR-10 datasets with different $\alpha$ (Table 13 and Table 14), all the textual datasets (Table 15 and Table 16), all the textual datasets (Table 18 and Table 19), and all the recommendation datasets (Table 20 and Table 17).

Besides the reported proxy system metrics such as FLOPs and the number of convergence rounds of FL processes, our benchmark also supports monitoring more runtime metrics. Thanks to the good integration with wandb, we can easily track the usage of system resources in runtime including utilization of CPU, GPU, memory, disk, etc. In Table 21, we report the average and peak process memory usage (in MB) and process running times (in seconds). In general, most pFL algorithms do have higher time and space overheads. We omit to report results for other metrics since we started very many sets of experiments concurrently, taking up as much of the graphics card’s memory and maximising CPU/GPU utilisation as possible, these metrics do not differ much from one of our different experiments. However, it is worth noting that these omitted metrics can be used to analyse algorithm bottlenecks in terms of system performance, and to optimise the space-time efficiency in single-experiment scenarios.

D.4 Heterogeneous Device Resources

The proposed pFL-Bench has good extensibility to support experiments in heterogeneous device resource scenarios, where clients have different computational and communication capacities. Specifically, we integrate FedScale [38] into our benchmark with a simulator that executes the behaviors of clients according to virtual timestamps of their message delivery to the server. The virtual timestamps are updated by the estimated execution time based on clients’ computational and communication capacities with the cost model proposed in FedScale. The server employs an over-selection mechanism for clients at each broadcast round and thus some clients’ message may be dropped, since the clients have different system capacities and different respond speeds corresponding to real-world mobile devices. ⁷⁷7https://github.com/SymbioticLab/FedScale/tree/master/benchmark/dataset/data/device_info

Here we take the Ditto method on FEMNIST dataset as an example and present the results of experiments with heterogeneous device resources in Table 10. Let $s$ to be the clients sampling rate for each FL round, and $s_{agg}$ be the the minimal ratio of received feedback w.r.t. the number of clients for the server to trigger federated aggregation in over-selection mode. For the homo-device case, we set $s=0.2$ and for the hetero-device case, we set the $s=0.25$ and $s_{agg}=0.8$ , leading to the same number of clients used for each federated aggregation. From the results, we can see that the hetero-device version has slower convergence speed ( $T^{\prime}=0$ indicates that the early-stopping is not triggered within the large number of FL rounds $T=1000$ ), and it gains worse performance than the homo-device version, especially for the bottom accuracy ( $\overline{Acc}^{\prime}$ ) and standard deviation of the average accuracy ( $\sigma$ ). This shows unfairness among clients due to the fact that some low-resourced clients have too long computation or communication time to make their feedback incorporated into the federated aggregation, calling for more considerations w.r.t. device heterogeneity within pFL algorithms design.

Table 10: Comparison between Ditto methods with and without heterogeneous device capabilities on FEMNIST dataset.

\overline{Acc}

indicates the accuracy average weighted by the number of local data samples,

\widetilde{Acc}

indicates the accuracy of un-participated clients, and

\Delta

indicates the participation generalization gap.

\overline{Acc}^{\prime}

indicate the equally-weighted average,

\sigma

indicating the standard deviation of the average accuracy, and

\mathchoice{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{9.69318pt}{3.8889pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\displaystyle Acc\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{9.69318pt}{3.8889pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\textstyle Acc\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{6.7852pt}{2.72223pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\scriptstyle Acc\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{4.84657pt}{1.94444pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\scriptscriptstyle Acc\hss$\crcr}}}\limits}

indicating the bottom accuracy. Efficiency metrics include total FLOPS, communication bytes (Com.) and the convergence round

T^{\prime}=0

. The

T^{\prime}=0

indicates the early-stopping is not triggered within the large number of FL rounds

T=1000

	$\overline{Acc}$	$\widetilde{Acc}$	$\Delta$	$\overline{Acc}^{\prime}$	$\sigma$	$\mathchoice{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{9.69318pt}{3.8889pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\displaystyle Acc\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{9.69318pt}{3.8889pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\textstyle Acc\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{6.7852pt}{2.72223pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\scriptstyle Acc\hss$\crcr}}}\limits}{\mathop{\vbox{\halign{#\cr\kern 0.80002pt$\hss\leavevmode\resizebox{4.84657pt}{1.94444pt}{\rotatebox[origin={c}]{90.0}{(}}\hss$\crcr\nointerlineskip\cr$\hss\scriptscriptstyle Acc\hss$\crcr}}}\limits}$	$T^{\prime}$	FLOPS	Com.
Ditto, Homo-Device	88.39	2.2	-86.19	87.18	7.52	78.23	849.3G	2.81M	610
Ditto, Hetero-Device	79.76	1.43	-78.33	77.39	11.25	61.76	1.72T	5.72M	0

D.5 Incorporating Differential Privacy

It is interesting and under-explored to investigate the trade-off between personalization and privacy protection, which is important as FL involves the transmitting of local (maybe private) information. We note that FederatedScope supports various privacy-related fundamental components and algorithms, such as Differential Privacy (DP) [103] and privacy attack methods [104] that can be used to examine the privacy-preserving strength. Further research that combines pFL and privacy-preserving techniques will be convenient based on the modularized and extensible design of our benchmark. As a preliminary example, here we demonstrate the combination of the pFL with a Differential Privacy algorithm, the NbAFL [105] that achieves $(\epsilon,\delta)$ -DP via noise injection and gradient clipping.

In Figure 2(g), we plot the learning curves of FedAvg and Ditto methods on FEMNIST dataset with various ( $\epsilon,\delta$ )-DP. Generally speaking, for privacy protection, the smaller the protection, the less performance degradation there is. We can see that in the Figure, with larger $\epsilon$ and $\delta$ , the accuracy ( $\overline{Acc}$ ) is better for both the compared methods, which meets our expectation. Interestingly, Ditto shows significantly better robustness for the dramatically varying privacy protection strengths during the whole learning process than FedAvg. This may be because, in the noise perturbation scenarios, the personalized local model potentially brings up more local optimal points that can be reached for clients. But there is still a gap for the best achievable performance between Ditto and FedAvg, leaving an interesting open question about how to reduce the performance degradation by co-designing personalization and noise injection.

Table 11: The efficiency-accuracy trade-off results including total FLOPS, communication bytes (Com.), and

\overline{Acc}

for FEMNIST datasets with different

s

	FEMNIST, $s=0.2$			FEMNIST, $s=0.1$			FEMNIST, $s=0.05$
	FLOPS	Com.	$\overline{Acc}$	FLOPS	Com.	$\overline{Acc}$	FLOPS	Com.	$\overline{Acc}$
FedAvg	195.38G	1.24M	83.97	159.9G	373.57K	83.84	78.48G	384.0K	83.21
FedAvg-FT	296.8G	1.35M	86.44	155.49G	630.44K	85.45	217.97G	522.82K	84.78
FedOpt	846.22G	1.68M	19.31	272.3G	635.48K	27.29	29.0G	80.81K	22.53
FedOpt-FT	77.81G	149.97K	9.69	54.88G	222.47K	31.02	365.54G	877.15K	29.73
pFedMe	2.24T	1.52M	87.50	516.47G	1012.93K	84.70	216.66G	493.06K	85.23
pFedMe-FT	1.33T	2.19M	88.19	1.22T	953.21K	86.28	585.23G	569.72K	87.37
FedBN	243.31G	1.21M	86.72	120.66G	510.15K	85.10	189.64G	841.57K	84.36
FedBN-FT	208.99G	687.58K	88.51	174.22G	610.31K	87.31	118.16G	407.71K	86.87
FedBN-FedOPT	451.07G	1.78M	88.25	179.28G	758.36K	86.22	109.73G	487.2K	84.04
FedBN-FedOPT-FT	195.3G	696.81K	88.14	217.59G	762.72K	87.91	171.93G	593.18K	87.34
Ditto	2.04T	1.78M	88.39	700.28G	700.96K	88.87	399.96G	533.77K	88.90
Ditto-FT	2.73T	2.26M	85.72	349.06G	600.22K	67.97	433.51G	749.29K	72.08
Ditto-FedBN	1.51T	1.02M	88.94	720.27G	623.37K	89.33	336.44G	695.85K	67.63
Ditto-FedBN-FT	2.86T	1.89M	86.53	1.42T	1.19M	86.35	410.03G	762.09K	72.84
Ditto-FedBN-FedOpt	2.45T	1.65M	88.73	922.71G	801.91K	89.30	408.13G	493.82K	88.95
Ditto-FedBN-FedOpt-FT	2.52T	1.66M	87.02	1.42T	1.19M	87.34	876.33G	993.59K	79.38
FedEM	2.98T	2.0M	84.35	1.99T	1.57M	84.49	1017.32G	1.03M	84.46
FedEM-FT	5.79T	1.67M	86.17	4.12T	1.16M	86.13	3.83T	1.35M	86.11
FedEM-FedBN	10.34T	3.1M	84.37	1.87T	1.26M	84.83	1.23T	1.15M	84.26
FedEM-FedBN-FT	7.55T	2.59M	88.29	4.07T	1.34M	87.49	5.03T	1.29M	86.54
FedEM-FedBN-FedOPT	6.0T	1.79M	82.12	3.89T	2.63M	85.81	1.81T	1.7M	85.16
FedEM-FedBN-FedOPT-FT	13.86T	4.76M	87.54	4.11T	1.36M	87.69	3.62T	1.15M	85.91

Table 12: The convergence results including the convergence round

T^{\prime}

and

\overline{Acc}

for FEMNIST datasets with different

s

. The

T^{\prime}=0

indicates the early-stopping is not triggered within the large number of FL rounds

T=1000

	FEMNIST, $s=0.2$		FEMNIST, $s=0.1$		FEMNIST, $s=0.05$
	$T^{\prime}$	$\overline{Acc}$	$T^{\prime}$	$\overline{Acc}$	$T^{\prime}$	$\overline{Acc}$
FedAvg	173.33	82.40	246.67	82.07	350.00	82.31
FedAvg-FT	220.00	85.17	416.67	84.26	476.67	83.45
FedOpt	733.33	19.59	420.00	27.47	73.33	22.45
FedOpt-FT	63.33	10.39	146.67	30.88	800.00	30.02
pFedMe	640.00	86.50	670.00	83.69	450.00	84.29
pFedMe-FT	960.00	87.06	630.00	85.05	520.00	86.36
FedBN	273.33	85.38	390.00	83.81	846.67	82.76
FedBN-FT	340.00	87.65	466.67	86.22	410.00	86.01
FedBN-FedOPT	943.33	87.27	580.00	85.09	490.00	82.63
FedBN-FedOPT-FT	360.00	87.13	583.33	87.10	596.67	86.53
Ditto	406.67	87.18	463.33	87.65	486.67	87.80
Ditto-FT	286.67	84.30	396.67	65.86	483.33	70.20
Ditto-FedBN	536.67	87.82	476.67	88.28	700.00	66.15
Ditto-FedBN-FT	0.00	85.16	263.33	84.98	766.67	71.76
Ditto-FedBN-FedOpt	873.33	87.64	613.33	88.20	496.67	87.84
Ditto-FedBN-FedOpt-FT	880.00	85.71	263.33	86.00	0.00	76.75
FedEM	290.00	82.61	373.33	83.13	346.67	82.91
FedEM-FT	243.33	84.91	276.67	84.77	453.33	84.61
FedEM-FedBN	570.00	82.94	350.00	83.35	430.00	82.98
FedEM-FedBN-FT	476.67	87.09	373.33	86.56	483.33	85.37
FedEM-FedBN-FedOPT	330.00	80.48	730.00	85.01	633.33	83.87
FedEM-FedBN-FedOPT-FT	876.67	86.23	376.67	86.58	430.00	84.98

Table 13: The efficiency-accuracy trade-off results including total FLOPS, communication bytes (Com.), and

\overline{Acc}

for CIFAR10 datasets with different

\alpha

	CIFAR10, $\alpha=5$			CIFAR10, $\alpha=0.5$			CIFAR10, $\alpha=0.1$
	FLOPS	Com.	$\overline{Acc}$	FLOPS	Com.	$\overline{Acc}$	FLOPS	Com.	$\overline{Acc}$
FedAvg	388.27G	654.43K	73.86	384.14G	646.69K	70.82	305.15G	513.76K	56.21
FedAvg-FT	422.66G	685.52K	73.89	404.17G	654.46K	70.12	259.72G	425.4K	53.99
FedOpt	1.26T	862.74K	54.66	494.38G	833.2K	49.90	651.29G	1.06M	36.41
FedOpt-FT	1.37T	917.13K	57.33	1003.55G	1.59M	51.70	1.36T	895.36K	41.56
pFedMe	3.26T	2.07M	73.92	3.95T	918.26K	70.30	2.51T	594.98K	55.70
pFedMe-FT	3.57T	2.22M	77.73	3.57T	825.05K	72.76	4.15T	969.52K	58.65
FedBN	388.27G	540.02K	72.24	375.0G	520.8K	68.32	632.15G	350.44K	56.36
FedBN-FT	394.14G	527.19K	72.55	356.18G	475.9K	67.45	249.57G	337.61K	50.52
FedBN-FedOPT	581.98G	809.39K	71.63	581.92G	809.43K	67.13	660.49G	360.45K	57.17
FedBN-FedOPT-FT	638.27G	854.28K	72.32	288.66G	386.11K	67.72	381.03G	206.52K	49.35
Ditto	1.81T	491.25K	72.02	1.31T	685.55K	67.96	929.51G	479.8K	45.67
Ditto-FT	4.24T	1.12M	68.59	1.22T	631.15K	58.52	1001.19G	510.88K	37.79
Ditto-FedBN	2.21T	495.12K	71.70	1.87T	418.18K	62.22	940.86G	392.52K	41.59
Ditto-FedBN-FT	5.71T	1.24M	67.90	1.3T	552.87K	57.14	953.17G	392.52K	37.21
Ditto-FedBN-FedOpt	2.33T	520.77K	71.58	1.68T	552.87K	60.78	1.36T	578.53K	47.94
Ditto-FedBN-FedOpt-FT	3.09T	687.53K	68.41	1.01T	431.01K	56.66	1.18T	495.14K	38.73
FedEM	6.78T	1.41M	73.34	7.42T	1.55M	70.56	5.74T	1.21M	53.43
FedEM-FT	12.49T	1.81M	72.89	12.05T	1.74M	65.47	9.75T	1.45M	44.98
FedEM-FedBN	6.88T	1.18M	71.43	8.58T	1.46M	68.42	7.61T	741.57K	55.30
FedEM-FedBN-FT	9.91T	1.17M	70.46	11.44T	1.35M	61.43	9.01T	704.68K	43.48
FedEM-FedBN-FedOPT	11.2T	1.91M	71.43	12.79T	2.18M	67.11	10.63T	1.01M	56.87
FedEM-FedBN-FedOPT-FT	9.45T	1.12M	71.32	8.85T	1.05M	62.03	9.6T	758.5K	41.10

Table 14: The convergence results including the convergence round

T^{\prime}

and

\overline{Acc}

for CIFAR10 datasets with different

\alpha

. The

T^{\prime}=0

indicates the early-stopping is not triggered within the large number of FL rounds

T=1000

	CIFAR10, $\alpha=5$		CIFAR10, $\alpha=0.5$		CIFAR10, $\alpha=0.1$
	$T^{\prime}$	$\overline{Acc}$	$T^{\prime}$	$\overline{Acc}$	$T^{\prime}$	$\overline{Acc}$
FedAvg	280.00	73.87	276.67	70.89	213.33	57.71
FedAvg-FT	293.33	73.86	280.00	69.94	173.33	54.47
FedOpt	370.00	54.66	356.67	49.61	466.67	37.30
FedOpt-FT	393.33	57.35	696.67	51.69	383.33	42.01
pFedMe	573.33	73.74	393.33	70.63	243.33	56.46
pFedMe-FT	643.33	77.82	353.33	73.46	400.00	59.74
FedBN	280.00	72.17	270.00	68.28	173.33	57.59
FedBN-FT	273.33	72.50	246.67	67.69	166.67	51.40
FedBN-FedOPT	420.00	71.59	420.00	67.22	186.67	58.19
FedBN-FedOPT-FT	443.33	72.27	200.00	68.07	106.67	50.76
Ditto	210.00	71.97	293.33	67.65	196.67	49.77
Ditto-FT	490.00	68.57	270.00	58.63	210.00	38.97
Ditto-FedBN	256.67	71.65	216.67	62.65	203.33	43.07
Ditto-FedBN-FT	660.00	67.78	286.67	57.27	203.33	39.13
Ditto-FedBN-FedOpt	270.00	71.54	286.67	61.12	300.00	48.27
Ditto-FedBN-FedOpt-FT	356.67	68.37	223.33	57.59	256.67	40.67
FedEM	213.33	73.36	233.33	70.56	173.33	54.52
FedEM-FT	273.33	72.84	263.33	65.49	210.00	42.92
FedEM-FedBN	216.67	71.40	270.00	68.64	133.33	57.82
FedEM-FedBN-FT	216.67	70.40	250.00	62.13	126.67	43.97
FedEM-FedBN-FedOPT	353.33	71.39	403.33	67.19	186.67	58.40
FedEM-FedBN-FedOPT-FT	206.67	71.27	193.33	62.31	136.67	40.70

Table 15: The efficiency-accuracy trade-off results including total FLOPS, communication bytes (Com.), and

\overline{Acc}

for COLA, SST-2 and Twitter datasets.

	COLA			SST-2			Twitter
	FLOPS	Com.	$\overline{Acc}$	FLOPS	Com.	$\overline{Acc}$	FLOPS	Com.	$\overline{Acc}$
FedAvg	22.15G	310.94K	71.85	954.66G	464.37K	74.88	726.45K	60.32K	62.15
FedAvg-FT	37.23G	251.93K	68.29	1.41T	629.61K	74.14	4.75M	19.26K	70.53
FedOpt	50.71G	299.14K	71.85	808.78G	393.56K	72.28	726.45K	60.32K	62.09
FedOpt-FT	58.3G	393.56K	62.59	707.21G	310.94K	65.77	4.52M	18.35K	71.08
pFedMe	208.22G	1.04M	74.40	4.66T	830.51K	71.27	603.46K	29.3K	63.45
pFedMe-FT	240.44G	357.98K	78.47	3.76T	629.4K	75.61	12.07M	15.57K	84.00
FedBN	46.07G	259.58K	71.85	954.66G	464.37K	74.88	-	-	-
FedBN-FT	40.29G	259.58K	66.71	1.06T	476.17K	68.81	-	-	-
FedBN-FedOPT	86.98G	482.11K	71.85	593.16G	270.71K	64.70	-	-	-
FedBN-FedOPT-FT	39.08G	248.45K	67.48	707.21G	292.96K	68.65	-	-	-
Ditto	133.52G	322.75K	55.46	1.43T	275.53K	52.03	2.64M	38.42K	70.23
Ditto-FT	124.81G	440.77K	72.11	1.85T	334.55K	56.49	4.2M	36.59K	69.99
Ditto-FedBN	96.29G	404.22K	70.69	1.48T	270.71K	56.03	-	-	-
Ditto-FedBN-FT	113.99G	381.97K	72.66	1.58T	270.7K	53.15	-	-	-
Ditto-FedBN-FedOpt	163.88G	370.84K	50.25	3.19T	626.86K	57.67	-	-	-
Ditto-FedBN-FedOpt-FT	148.0G	292.96K	55.01	1.58T	537.74K	52.89	-	-	-
FedEM	414.28G	801.7K	71.85	18.46T	1.55M	75.78	18.82M	95.64K	63.44
FedEM-FT	1.85T	1.02M	54.90	24.19T	1.32M	64.86	24.19M	35.27K	70.97
FedEM-FedBN	729.78G	753.83K	71.44	16.84T	1.31M	75.43	-	-	-
FedEM-FedBN-FT	1.37T	979.64K	57.62	24.19T	1.24M	64.96	-	-	-
FedEM-FedBN-FedOPT	414.28G	753.83K	71.85	13.34T	1.05M	72.25	-	-	-
FedEM-FedBN-FedOPT-FT	1.49T	1.02M	57.23	22.0T	1.12M	62.26	-	-	-

Table 16: The convergence results including the convergence round

T^{\prime}

and

\overline{Acc}

for COLA, SST-2 and Twitter datasets. The

T^{\prime}=0

indicates the early-stopping is not triggered within a large number of FL rounds,

T=500

for COLA and SST-2, and

T=1000

for Twitter datasets.

	COLA		SST-2		Twitter
	$T^{\prime}$	$\overline{Acc}$	$T^{\prime}$	$\overline{Acc}$	$T^{\prime}$	$\overline{Acc}$
FedAvg	43.33	51.53	65.00	76.30	223.33	56.98
FedAvg-FT	35.00	58.74	88.33	75.36	70.67	68.45
FedOpt	41.67	57.10	55.00	73.78	220.33	59.95
FedOpt-FT	55.00	59.77	43.33	66.17	66.67	69.20
pFedMe	150.00	67.58	116.67	65.08	106.67	58.18
pFedMe-FT	50.00	69.17	88.33	74.36	53.33	78.82
FedBN	38.33	59.60	65.00	76.30	-	-
FedBN-FT	38.33	59.69	66.67	68.50	-	-
FedBN-FedOPT	71.67	59.60	40.00	65.59	-	-
FedBN-FedOPT-FT	36.67	59.10	43.33	68.42	-	-
Ditto	45.00	55.14	38.33	49.94	140.33	66.90
Ditto-FT	61.67	63.61	46.67	54.34	133.33	66.91
Ditto-FedBN	60.00	62.68	40.00	49.44	-	-
Ditto-FedBN-FT	56.67	63.58	40.00	52.18	-	-
Ditto-FedBN-FedOpt	55.00	52.48	93.33	55.61	-	-
Ditto-FedBN-FedOpt-FT	43.33	57.13	80.00	53.16	-	-
FedEM	38.33	51.52	76.67	76.53	163.33	61.70
FedEM-FT	50.00	57.80	65.00	64.29	40.67	70.19
FedEM-FedBN	38.33	57.95	68.33	75.06	-	-
FedEM-FedBN-FT	50.00	58.74	65.00	64.33	-	-
FedEM-FedBN-FedOPT	38.33	59.60	55.00	72.66	-	-
FedEM-FedBN-FedOPT-FT	53.33	56.74	58.33	58.42	-	-

Table 17: The convergence results including the convergence round

T^{\prime}

and

\overline{Loss}

for Movielens-1M and Movielens-10M datasets. The

T^{\prime}=0

indicates the early-stopping is not triggered within the large number of FL rounds

T=1000

	Movielens-1M		Movielens-10M
	$T^{\prime}$	$\overline{Loss}$	$T^{\prime}$	$\overline{Loss}$
FedAvg	360.33	0.85	470.67	0.71
FedAvg-FT	0	0.85	520.33	0.71
FedAvg-FT-FedOpt	830.0	0.86	0	0.75
FedOpt	270.0	0.84	0	0.72
FedOpt-FT	300.0	0.84	0	0.77
pFedMe	0	0.55	280.33	12.48
pFedMe-FT	470.0	0.60	840.00	0.80
Ditto	360.67	1.31	910.67	1.81
Ditto-FT	450.33	1.35	0	2.30
Ditto-FT-FedOpt	550.67	1.35	0	1.98
FedEM	700.33	0.87	0	2.37
FedEM-FT	780.00	0.87	0	0.98
FedEM-FT-FedOpt	0	0.87	0	1.88

Table 18: The efficiency-accuracy trade-off results including total FLOPS, communication bytes (Com.), and

\overline{Acc}

for Pubmed, Cora, and Citeseer datasets.

	PUBMED			CORA			CITESEER
	FLOPS	Com.	$\overline{Acc}$	FLOPS	Com.	$\overline{Acc}$	FLOPS	Com.	$\overline{Acc}$
FedAvg	40.41G	909.81K	87.27	2.94G	980.35K	81.30	71.06G	2.98M	75.58
FedAvg-FT	31.39G	1.79M	87.21	5.16G	885.92K	82.07	77.54G	2.39M	75.63
FedOpt	163.3G	6.92M	67.38	2.22G	744.07K	70.70	3.89G	437.18K	71.59
FedOpt-FT	24.78G	791.28K	82.36	6.27G	1.05M	82.68	25.07G	791.28K	74.34
pFedMe	45.67G	1.0M	86.91	10.02G	1.26M	83.18	193.83G	744.07K	75.30
pFedMe-FT	119.12G	862.1K	85.71	22.76G	2.09M	82.11	64.31G	460.79K	75.35
FedBN	27.53G	875.47K	88.49	2.51G	615.38K	84.13	14.01G	441.99K	75.80
FedBN-FT	35.86G	2.04M	87.45	29.13G	4.85M	76.20	28.25G	531.61K	75.07
FedBN-FedOPT	13.26G	632.72K	87.87	2.07G	736.76K	84.64	26.08G	823.45K	76.20
FedBN-FedOPT-FT	45.55G	1.04M	87.54	3.21G	615.38K	84.10	21.97G	771.43K	76.70
Ditto	46.9G	909.53K	87.27	7.27G	933.13K	83.67	79.77G	838.49K	74.79
Ditto-FT	43.73G	1.54M	87.47	8.59G	909.31K	81.47	42.34G	744.07K	76.47
Ditto-FedBN	23.72G	752.63K	88.18	8.37G	788.77K	81.38	37.51G	528.68K	75.35
Ditto-FedBN-FT	36.49G	963.57K	87.83	5.46G	424.65K	83.25	26.06G	494.0K	75.97
Ditto-FedBN-FedOpt	57.84G	821.99K	87.81	7.45G	702.08K	82.54	20.14G	407.31K	75.07
Ditto-FedBN-FedOpt-FT	49.88G	650.06K	87.60	7.92G	615.38K	82.00	35.48G	684.74K	76.42
FedEM	575.33G	3.24M	85.64	61.17G	2.01M	81.92	760.69G	8.36M	75.41
FedEM-FT	647.22G	3.04M	85.88	61.14G	2.7M	77.32	150.55G	2.22M	72.71
FedEM-FedBN	253.94G	2.07M	88.12	18.99G	1.92M	85.07	52.64G	1.82M	74.90
FedEM-FedBN-FT	275.77G	3.02M	86.38	34.58G	1.72M	84.61	89.54G	1.47M	75.29
FedEM-FedBN-FedOPT	229.51G	1.87M	87.56	108.73G	2.62M	84.68	419.47G	3.37M	76.08
FedEM-FedBN-FedOPT-FT	233.63G	2.12M	87.49	285.87G	4.92M	85.02	561.46G	3.22M	76.59

Table 19: The convergence results including the convergence round

T^{\prime}

and

\overline{Acc}

for Pubmed, Cora, and Citeseer datasets. The

T^{\prime}

indicates the early-stopping is not triggered within the large number of FL rounds

T=500

	PUBMED		CORA		CITESEER
	$T^{\prime}$	$\overline{Acc}$	$T^{\prime}$	$\overline{Acc}$	$T^{\prime}$	$\overline{Acc}$
FedAvg	63.33	86.72	68.33	81.07	215.00	75.64
FedAvg-FT	128.33	86.71	61.67	81.90	171.67	75.77
FedOpt	0.00	66.69	51.67	70.17	30.00	71.60
FedOpt-FT	55.00	81.53	75.00	82.31	55.00	74.38
pFedMe	71.67	86.35	90.00	82.76	51.67	75.36
pFedMe-FT	60.00	85.47	150.00	81.98	31.67	75.40
FedBN	83.33	87.97	58.33	83.64	41.67	75.59
FedBN-FT	146.67	87.02	183.33	76.01	36.67	75.16
FedBN-FedOPT	60.00	87.43	70.00	84.11	78.33	76.33
FedBN-FedOPT-FT	101.67	87.02	58.33	83.79	73.33	76.77
Ditto	63.33	86.85	65.00	83.50	58.33	75.55
Ditto-FT	110.00	87.10	63.33	81.53	51.67	76.57
Ditto-FedBN	71.67	87.75	75.00	81.36	50.00	75.46
Ditto-FedBN-FT	91.67	87.43	40.00	82.21	46.67	76.05
Ditto-FedBN-FedOpt	78.33	87.27	66.67	82.38	38.33	75.19
Ditto-FedBN-FedOpt-FT	61.67	87.10	58.33	81.99	65.00	76.52
FedEM	78.33	85.05	48.33	81.72	203.33	75.49
FedEM-FT	73.33	85.54	65.00	78.43	53.33	72.88
FedEM-FedBN	68.33	87.63	63.33	84.45	60.00	74.29
FedEM-FedBN-FT	100.00	85.68	56.67	84.57	48.33	75.40
FedEM-FedBN-FedOPT	61.67	87.11	86.67	83.88	111.67	76.17
FedEM-FedBN-FedOPT-FT	70.00	87.16	163.33	84.40	106.67	76.74

Table 20: The efficiency-accuracy trade-off results including total FLOPS, communication bytes (Com.), and

\overline{Loss}

for Movielens-1M and Movielens-10M datasets.

	Movielens-1M			Movielens-10M
	FLOPS	Com.	$\overline{Loss}$	FLOPS	Com.	$\overline{Loss}$
FedAvg	343.73M	108.75K	0.84	375.55G	142.58K	0.70
FedAvg-FT	995.55M	301.5K	0.84	162.79G	157.72K	0.71
FedAvg-FT-FedOpt	236.68M	250.45K	0.85	21.39G	302.91K	0.73
FedOpt	258.31M	81.62K	0.83	19.75G	302.91K	0.71
FedOpt-FT	299.3M	90.66K	0.83	52.04G	302.91K	0.74
pFedMe	763.2M	301.51K	0.54	131.49G	24.44K	13.06
pFedMe-FT	376.37M	142.35K	0.60	93.58G	254.7K	0.80
Ditto	332.94M	108.75K	1.29	72.4G	302.91K	1.84
Ditto-FT	220.62M	135.88K	1.35	242.74G	302.91K	1.69
Ditto-FT-FedOpt	269.63M	166.03K	1.36	73.26G	302.91K	2.03
FedEM	2.07G	523.1K	0.85	86.69G	135.15K	1.75
FedEM-FT	4.07G	582.82K	0.85	253.37G	269.78K	0.87
FedEM-FT-FedOpt	5.22G	746.53K	0.86	113.27G	120.19K	1.43

Table 21: Efficiency results in terms of process memory (MB) and running time (seconds). A higher value indicates that more system resources were consumed. The

MEM_{avg}

and

MEM_{peak}

indicates average and peak values of process-used memory respectively, and

T_{run}

indicates the process running time. Similar to above table, Bold and underlined indicate the best and second-best results among all compared methods, while red and blue indicate the best and second-best results for original methods without combination “-”.

	FEMNIST, $s=0.2$			SST-2
	$MEM_{avg}$	$MEM_{peak}$	$T_{run}$	$MEM_{avg}$	$MEM_{peak}$	$T_{run}$
Global-Train	86.56	86.56	11.00	87.23	93.37	892.00
Isolated	746.85	1351.94	1108.00	118.00	151.07	994.00
FedAvg	10506.71	17707.04	840.00	297.43	399.51	319.00
FedAvg-FT	13400.46	21752.04	1193.00	314.05	371.73	545.00
FedProx	19389.09	20729.26	1415.00	6635.17	7378.15	672.00
FedProx-FT	21209.42	22503.71	1935.00	7804.79	8160.99	1810.00
pFedMe	1880.78	1979.59	7205.00	262.67	366.15	1572.00
pFedMe-FT	18573.50	21332.41	4676.00	282.19	367.81	1080.00
HypCluster	21659.61	22386.94	2803.00	6942.08	7510.02	678.00
HypCluster-FT	22569.13	23588.59	3602.00	7624.09	8232.55	887.00
FedBN	11135.25	15406.92	1011.00	279.89	373.27	332.00
FedBN-FT	17347.31	25649.89	936.00	266.80	331.84	551.00
FedBN-FedOPT	21866.64	35335.47	2881.00	216.22	311.33	202.00
FedBN-FedOPT-FT	9827.22	14711.30	845.00	260.59	350.32	283.00
Ditto	1127.92	1268.18	4628.00	149.16	204.95	638.00
Ditto-FT	1110.87	1532.88	8915.00	228.51	292.20	599.00
Ditto-FedBN	1116.05	1227.17	4049.00	196.52	228.59	593.00
Ditto-FedBN-FT	1453.92	1730.06	8528.00	216.48	290.13	627.00
Ditto-FedBN-FedOpt	1176.76	1273.67	5782.00	202.71	237.94	560.00
Ditto-FedBN-FedOpt-FT	1299.42	1564.89	7545.00	275.08	333.58	637.00
FedEM	939.58	1063.35	5070.00	267.37	357.79	1568.00
FedEM-FT	563.77	619.02	7592.00	327.36	374.10	3210.00
FedEM-FedBN	572.14	912.14	12051.00	269.62	337.22	1615.00
FedEM-FedBN-FT	489.82	616.71	9391.00	306.51	367.51	3040.00
FedEM-FedBN-FedOPT	755.61	842.33	8977.00	251.02	343.16	1314.00
FedEM-FedBN-FedOPT-FT	608.29	638.73	19137.00	300.92	353.46	3116.00