From Optimization to Generalization: Fair Federated Learning against
Quality Shift via Inter-Client Sharpness Matching

Nannan Wu Zhuo Kuang Zengqiang Yan&Li Yu
School of Electronic Information and Communications, Huazhong University of Science and Technology
{wnn2000, kuangzhuo, z_yan, hustlyu}@hust.edu.cn Corresponding author

Abstract

Due to escalating privacy concerns, federated learning has been recognized as a vital approach for training deep neural networks with decentralized medical data. In practice, it is challenging to ensure consistent imaging quality across various institutions, often attributed to equipment malfunctions affecting a minority of clients. This imbalance in image quality can cause the federated model to develop an inherent bias towards higher-quality images, thus posing a severe fairness issue. In this study, we pioneer the identification and formulation of this new fairness challenge within the context of the imaging quality shift. Traditional methods for promoting fairness in federated learning predominantly focus on balancing empirical risks across diverse client distributions. This strategy primarily facilitates fair optimization across different training data distributions, yet neglects the crucial aspect of generalization. To address this, we introduce a solution termed Federated learning with Inter-client Sharpness Matching (FedISM). FedISM enhances both local training and global aggregation by incorporating sharpness-awareness, aiming to harmonize the sharpness levels across clients for fair generalization. Our empirical evaluations, conducted using the widely-used ICH and ISIC 2019 datasets, establish FedISM’s superiority over current state-of-the-art federated learning methods in promoting fairness. Code is available at https://github.com/wnn2000/FFL4MIA.

1 Introduction

In light of escalating concerns regarding data privacy, federated learning (FL) McMahan et al. (2017) has emerged as a promising approach for training deep neural networks in the realm of medical image analysis Dou et al. (2021). A significant challenge within FL is the inherent data heterogeneity Ye et al. (2023); Huang et al. (2022, 2023a) observed across various medical institutions, primarily attributed to their independent data collection processes. This heterogeneity has been examined from several perspectives in existing research, e.g., domain shift Li et al. (2021); Liu et al. (2021); Jiang et al. (2023b), label skew Zhang et al. (2022); Wu et al. (2023b), and label quality variation Wu et al. (2023a); Chen et al. (2023); Wu et al. (2024). Nevertheless, the prevalent issue of quality heterogeneity Fang et al. (2023) in medical imaging, a factor that could potentially raise new challenges for FL, remains under-explored.

Refer to caption — Figure 1: Imaging quality shift across clients. Most clients possess clean images, while others have corrupted images (e.g., exhibiting noise or blur).

Despite relatively rigorous protocols in medical imaging environments, uniformity in image quality across various institutions cannot be strictly assured. As illustrated in Fig. 1, images from some clients may exhibit noise due to equipment malfunctions or the necessity for low-dose imaging. Moreover, random movements from either patients or cameras in certain cases can result in motion blur. Typically, the proportion of low-quality (i.e., corrupted) images is smaller compared to high-quality (i.e., clean) images Huang et al. (2023d). In such scenarios, characterized by quality shifts across clients, FL tends to exhibit a bias towards the more-prevalent clean images, thereby compromising the performance on the less-frequent corrupted images. Such a bias can be a critical concern when applying federated models to clients with corrupted images. Therefore, it is imperative to address the issue of biased performance under quality shifts.

In this paper, we pioneer the identification and formulation of this significant challenge in FL within real-world medical contexts. In FL, especially under the mantle of privacy protection, there is an absence of prior knowledge about the imaging quality of each client, suggesting the necessity for a data-agnostic solution to this issue. We theoretically demonstrate improving performance on the poorest-quality images is equivalent to achieving client-level fairness in FL. Hence, we remodel this challenge as a problem of client-level fairness, i.e., how to ensure equitable FL performance among clients with clean and corrupted image distributions? This investigation marks the first effort to promote fair FL across clients with image quality shifts, a departure from previous research focused on fairness under domain shifts Jiang et al. (2023a) or class distribution shifts Li et al. (2020).

In this work, we observe that existing methods to foster client-level performance fairness in FL typically modify the importance weights in global aggregation to harmonize certain training metrics (e.g., loss) across clients Mohri et al. (2019); Li et al. (2020); Jiang et al. (2023a). However, as illustrated in Fig. 2, while these methods facilitate fairness regarding optimization during training, they do not necessarily ensure fairness regarding generalization during testing. This gap stems from the issue that equalizing training losses for all clients may result in a simplistic convergence towards a sharp minimum, particularly for the minority distributions (i.e., corrupted image distributions), thereby impairing generalization on testing sets. To counter this, our focus extends beyond fair optimization to fair generalization. Recent advancements in sharpness-aware minimization reveal an inverse relationship between the generalization capability and the sharpness of the loss surface Foret et al. (2021); Zhuang et al. (2022). Inspired by this finding, we introduce a novel Federated learning framework with Inter-client Sharpness Matching (FedISM), which aims to equalize the sharpness levels across clients for fair generalization. In this framework, local optimization in FL is made sharpness-aware, where each client’s local update aims to minimize sharpness on its local data. Subsequently, weights for global aggregation are determined based on each client’s sharpness level, with higher weights assigned to clients exhibiting greater sharpness. In this way, the global model update is more effective in minimizing sharpness in higher-level clients, leading to more uniform sharpness and fair generalization across both clean and corrupted image distributions as shown in Fig. 2. It is worth mentioning that FedISM involves only simple modifications to local optimization and global aggregation compared to FedAvg McMahan et al. (2017), thus eschewing complex loss functions and additional regularization and offering ease of implementation.

The contributions are summarized into three folds:

•

New Fairness Challenge in FL - Imaging Quality Shift: We identify and articulate a new challenge in FL, specifically concentrating on achieving performance fairness across clients with diverse imaging qualities.
•

Innovative Solution - FedISM: To address this challenge, our focus extends beyond previous fair optimization to fair generalization. Our proposed solution, FedISM, integrates sharpness-aware local updates with sharpness-dependent global aggregation, promoting uniform sharpness across clients and achieving more equitable generalization.
•

Extensive Validation: We validate the effectiveness of FedISM through a series of experiments on two real-world medical image classification datasets, i.e., RSNA ICH and ISIC 2019. FedISM demonstrates superior performance, outperforming several state-of-the-art methods in promoting fairness in FL.

2 Related Work

2.1 Fair Federated Learning

Fairness Huang et al. (2023c) has become a crucial topic in FL, primarily concentrating on collaborative fairness Lyu et al. (2020); Xu et al. (2021) and performance fairness Li et al. (2020). Collaborative fairness aims for the reward of each participant to be proportional to its contribution to the federation. In contrast, performance fairness advocates for unbiased performance across different devices/attributes/distributions. Existing solutions mainly achieve this goal by balancing weights of different objects to optimize fairly. For instance, q-FedAvg Li et al. (2020) addresses this by prioritizing clients that are harder to optimize; FedCE Jiang et al. (2023a) tackles this issue by considering task-specific performance. Nevertheless, fairness respecting generalization has not been considered thoroughly till now, leading to sub-optimal outcomes.

2.2 Sharpness of Loss Surface

Bridging the gap between training optimization and testing generalization remains a pivotal challenge in machine learning. Recent developments in sharpness-aware minimization suggest that models tend to generalize better on flat minima than on sharp minima Foret et al. (2021); Zhuang et al. (2022). Motivated by this insight, various studies have incorporated the concept of loss surface sharpness to address poor generalization. SharpDRO Huang et al. (2023d) merges sharpness with GroupDRO Sagawa et al. (2020) to achieve robust generalization. ImbSAM Zhou et al. (2023) focuses on enhancing the performance for tail classes in long-tailed recognition by minimizing sharpness in these classes. In addition to typical machine learning, it has recently been applied in FL Qu et al. (2022); Caldarola et al. (2022); Sun et al. (2023). However, the relationship between sharpness and fairness within FL has not yet been investigated.

3 Methodology

3.1 Preliminaries

For a typical image classification task with $C$ classes, we consider a cross-silo FL scenario with $K$ participants. Each $k$ -th participant possesses a private dataset $D_{k}=\{(\bm{x}_{i}\in\mathcal{X},y_{i}\in\mathcal{Y})\}_{i=1}^{N_{k}},k\in[K]$ , where $\mathcal{X}$ and $\mathcal{Y}=[C]$ represent the input image and label spaces, respectively. An image-label pair $(\bm{x}_{i},y_{i})$ is drawn from the client-specific distribution $\mathbb{P}_{k}(\bm{x},y\mid a_{k})$ , with $a_{k}\in[A]$ indicating an attribute only influencing image quality in client $k$ . This paper highlights the uneven distribution of clients across various quality attributes. Define $f(\cdot;\bm{\theta}):\mathcal{X}\rightarrow\Delta^{C-1}$ as a deep learning model parameterized by $\bm{\theta}$ and $\ell:\Delta^{C-1}\times\mathcal{Y}\rightarrow\mathbb{R_{+}}$ as the loss function, where $\Delta$ denotes the probability simplex. In this paper, the goal of FL is to optimize $\bm{\theta}$ for performance enhancement on images with the worst-performing quality:

\bm{\theta}^{*}=\mathop{\arg\min}\limits_{\bm{\theta}}\{\max_{a\in[A]}\mathbb{E}_{(\bm{x},y)\sim\mathbb{P}(\bm{x},y\mid a)}[\ell(f(\bm{x};\bm{\theta}),y)]\}.

(1)

When treating each quality as a group, Eq. 1 achieves group fairness which is similar to the aim in distributionally robust optimization Sagawa et al. (2020). However, in FL, there is no prior knowledge of group information (i.e., the imaging quality in each client), making group-wise design impractical. Therefore, we achieve Eq. 1 via client fairness, as stated in the following theorem following Papadaki et al. (2022):

Theorem 1 (Equivalence).

Assuming class distributions of the testing set and all clients’ training sets are identical, we have:

$\begin{aligned} &\bm{\theta}^{*},\bm{\lambda}^{*}=\arg\mathop{\min}\limits_{\bm{\theta}}\mathop{\max}\limits_{\bm{\lambda}\in\Delta^{K-1}}\sum_{k=1}^{K}\lambda_{k}\mathbb{E}_{(\bm{x},y)\sim\mathbb{P}_{k}(\bm{x},y\mid a_{k})}[\ell(f(\bm{x};\bm{\theta}),y)],\end{aligned}$

(2)

and

$\begin{aligned} &\bm{\theta}^{*},\bm{\mu}^{*}=\arg\mathop{\min}\limits_{\bm{\theta}}\mathop{\max}\limits_{\bm{\mu}\in\Delta^{A-1}}\sum_{u=1}^{A}\mu_{u}\mathbb{E}_{(\bm{x},y)\sim\mathbb{P}(\bm{x},y\mid u)}[\ell(f(\bm{x};\bm{\theta}),y)],\end{aligned}$

(3)

where $\mu^{*}_{u}=\sum_{k=1}^{K}\mathds{1}_{a_{k}=u}\lambda^{*}_{k}$ .

The proof is detailed in the Appendix. Notably, even when label distribution shifts occur, they can be theoretically addressed by logit adjustment Menon et al. (2021); Zhang et al. (2022), thereby not considered to be addressed in this paper. Theorem 1 demonstrates that achieving client fairness (Eq. 2) inherently ensures group fairness (Eq. 3). Therefore, this paper subsequently concentrates on enhancing fairness across clients with various imaging qualities.

3.2 Previous Solution: Fair Optimization

Client-level unfairness in FL arises from a tendency to overlook certain clients, particularly those with limited data or outlier distributions. These clients are often overlooked because optimizing them does not significantly contribute to the overall optimization objective. To address this challenge, various approaches have been proposed. AFL Mohri et al. (2019) concentrates efforts on the worst-performing client; q-FedAvg Li et al. (2020) and FairFed Ezzeldin et al. (2023) give greater weights to those clients with higher training losses; and FedCE Jiang et al. (2023a) focuses on clients with lower task-specific metrics. These methods collectively aim to balance optimization across the empirical distribution of each client, striving for more uniform risks among clients, as represented by the following optimization problem:

\mathop{\min}\limits_{\bm{\theta}}\mathop{\max}\limits_{\bm{\lambda}\in\Delta^{K-1}}\sum_{k=1}^{K}\frac{\lambda_{k}}{N_{k}}\sum_{(\bm{x},y)\in D_{k}}\ell(f(\bm{x};\bm{\theta}),y).

(4)

Essentially, these solutions seek to achieve a balance in optimization by prioritizing clients that are performing worse. However, this strategy may result in clients with poorer performance rapidly converging to sharp minima, primarily because the rate of loss minimization is most pronounced when moving towards them. Due to the discrepancy between empirical and expected risks, this strategy to optimize fairly does not necessarily lead to client fairness in a strict sense, as denoted by Eq. 2. A conceptual illustration of this issue is presented in Fig. 2. Thus, it becomes crucial to extend the exploration of client fairness to include considerations of generalization, beyond mere optimization.

3.3 Measurement of Generalization: Sharpness

Since it is sub-optimal to measure generalization capacity by a single value of empirical risk (i.e., training loss), promoting fairness by merely seeking uniformity in such a metric is inappropriate. To tackle this problem, we should identify a new indicator more closely correlated with generalization ability.

A key limitation of using single training loss as a metric is its insensitivity to the geometric properties of the loss landscape, treating sharp and flat minima indiscriminately. To overcome this, we propose focusing on a range of loss values, specifically the sharpness of the loss surface Foret et al. (2021); Zhuang et al. (2022). It is defined as the largest loss change in the vicinity of the initial model parameters:

\mathcal{S}:=\max_{\|\bm{\epsilon}\|_{2}\leq\rho}\{\ell(f(\bm{x};\bm{\theta}+\bm{\epsilon}),y)-\ell(f(\bm{x};\bm{\theta}),y)\},

(5)

where $\rho$ is a positive step parameter controlling the search radius. Calculating sharpness as per Eq. 5 poses a challenge due to the continuous and infinite nature of the perturbation. To simplify, the difference can be approximated linearly through the Taylor series when $\rho$ is sufficiently small:

\ell(f(\bm{x};\bm{\theta}+\bm{\epsilon}),y)-\ell(f(\bm{x};\bm{\theta}),y)\approx\bm{\epsilon}^{\top}\nabla\ell(f(\bm{x};\bm{\theta}),y).

(6)

This approximation enables us to identify the optimal perturbation by maximizing the right-hand side of the equation:

\bm{\epsilon}^{*}=\mathop{\arg\max}\limits_{\|\bm{\epsilon}\|_{2}\leq\rho}\bm{\epsilon}^{\top}\nabla\ell(f(\bm{x};\bm{\theta}),y)=\rho\frac{\nabla\ell(f(\bm{x};\bm{\theta}),y)}{\|\nabla\ell(f(\bm{x};\bm{\theta}),y)\|_{2}}.

(7)

Therefore, sharpness can be computed more feasibly as:

\mathcal{S}\approx\ell(f(\bm{x};\bm{\theta}+\bm{\epsilon}^{*}),y)-\ell(f(\bm{x};\bm{\theta}),y).

(8)

Unlike a single training loss, sharpness reflects the rate of change in training loss across the loss surface. Prior research has shown that this rate of change is closely related to generalization capacity. To be specific, models generally perform better on a flat minimum (i.e., with smaller sharpness) than on a sharp minimum (i.e., with larger sharpness) Foret et al. (2021); Zhuang et al. (2022). Therefore, we select this indicator to measure generalization capacity.

3.4 Sharpness Matching for Fair Generalization

As the sharpness of the loss surface is indicative of a model’s generalization ability, we propose an FL method with inter-client sharpness matching (FedISM) to establish a uniform sharpness distribution across clients, thereby achieving fairer generalization. The overview of FedISM is depicted in Fig. 3, and details are as follows.

Our proposed method is structured around the following objective:

\mathop{\min}\limits_{\bm{\theta}}\mathop{\max}\limits_{\bm{\lambda}\in\Delta^{K-1}}\sum_{k=1}^{K}\frac{\lambda_{k}}{N_{k}}\sum_{(\bm{x},y)\in D_{k}}\mathcal{S}(f(\bm{x};\bm{\theta}),y),

(9)

where $\mathcal{S}(f(\bm{x};\bm{\theta}),y)$ represents the sharpness for a given sample and model, as defined in Eq. 8. Unlike previous solutions which focus on a uniformly-low loss (Eq. 4), our key objective is to achieve a uniformly-low sharpness across clients, thereby avoiding convergence to sharp minima while aligning empirical risks. Considering that lower sharpness typically reflects better generalization, this strategy emphasizes fairness regarding generalization more.

Algorithm 1 Pseudocode of FedAvg and our FedISM

Input: number of clients $K$ , local datasets $\{D_{1},\dots,D_{K}\}$ , local dataset size, total communication rounds $T$ , learning rate of local training $\eta$ .
Output: final global model $\bm{\theta}_{T+1}$

1: Initialize the global model

\bm{\theta}_{1}

2: for

t=1,2,\dots,T

3: for Client

k=1,2,\dots,K

in parallel do

\bm{\theta}_{(t,k)}\leftarrow\bm{\theta}_{t}

\triangleright

download the global model

5: for

(\bm{x}_{i},y_{i})\in D_{k}

6: Update

\bm{\theta}_{(t,k)}

with

(\bm{x}_{i},y_{i})

by Eq. 10

7: Update

\bm{\theta}_{(t,k)}

with

(\bm{x}_{i},y_{i})

by Eq. 11

8: end for

9: end for

10:

\bm{\theta}_{t+1}\leftarrow

Aggregate

\{\bm{\theta}_{(t,k)}\}_{k=1}^{K}

with

\bm{w}_{\texttt{Avg}}

(Eq. 12)

11:

\bm{\theta}_{t+1}\leftarrow

Aggregate

\{\bm{\theta}_{(t,k)}\}_{k=1}^{K}

with

\bm{w}

(Eq. 14)

12: end for

13: return

\bm{\theta}_{T+1}

In conventional local training in FL, e.g., FedAvg McMahan et al. (2017), the objective is to minimize the loss on local data via gradient descent:

\bm{\theta}\leftarrow\bm{\theta}-\eta\nabla\ell(f(\bm{x};\bm{\theta}),y),

(10)

with $\eta$ as the learning rate. However, it tends to find an isolated point of low loss, without aiming for a flat minimum as desired in Eq. 9. To address this, local training should incorporate sharpness-awareness. Inspired by sharpness-aware minimization Foret et al. (2021), the update rule is adapted to:

\bm{\theta}\leftarrow\bm{\theta}-\eta\nabla\ell(f(\bm{x};\bm{\theta}+\bm{\epsilon}^{*}),y),

(11)

where $\bm{\epsilon}^{*}$ represents the optimal perturbation that maximizes the change in the training loss as Eq. 7. This modification directs the optimization towards a point where the surrounding area exhibits the lowest loss, effectively minimizing the sharpness of the loss surface.

After each round of local training, gradients from all clients are transmitted to the server for global aggregation. Notably, under sharpness-aware minimization (Eq. 11), each client’s gradient is representative of the direction to minimize sharpness with respect to local data. The crux of achieving uniform sharpness across clients lies in how these gradients are fairly aggregated. Traditional FedAvg McMahan et al. (2017) assigns aggregation weights based on the quantity of data each client contributes, as defined by:

\bm{w}_{\texttt{Avg}}=\frac{1}{\sum_{k=1}^{K}N_{k}}[N_{1},N_{2},\cdots,N_{K}]^{\top}.

(12)

However, such a weighting policy may not facilitate uniform sharpness. Given that clients with corrupted images typically have less data, the global update tends to be dominated by clients with clean images. Consequently, this might only reduce the sharpness for clean clients, leading to a disparity in sharpness similar to the loss disparity addressed by fair optimization Mohri et al. (2019); Li et al. (2020); Ezzeldin et al. (2023); Jiang et al. (2023a). To solve this, we propose sharpness-aware aggregation:

\widetilde{\bm{w}}_{t}=\frac{1}{\sum_{k=1}^{K}\mathcal{S}_{k,t}^{q}}[\mathcal{S}_{1,t}^{q},\mathcal{S}_{2,t}^{q},\cdots,\mathcal{S}_{K,t}^{q}]^{\top},

(13)

where $\mathcal{S}_{k,t}$ denotes the sharpness calculated on the entire local dataset $D_{k}$ at any round $t$ by Eq. 8, and $q$ is a predetermined positive parameter. This strategy emphasizes clients with higher sharpness levels during aggregation. As $q$ increases, the aggregation process focuses more on these high-sharpness clients. In extreme cases, aggregation will exclusively focus on the client exhibiting the highest sharpness given $q\rightarrow+\infty$ . To conclude, this method ensures that the sharpness levels of clients with initially higher sharpness are primarily reduced, thus promoting a uniform sharpness distribution across all clients. In this way, it accomplishes the objective defined in Eq. 9. To maintain stability in federated training, a moving average is further employed for rounds $t>1$ , formulated as:

{\bm{w}}_{t}=\beta\widetilde{\bm{w}}_{t}+(1-\beta){\bm{w}}_{t-1}.

(14)

In particular, we set ${\bm{w}}_{1}=\widetilde{\bm{w}}_{1}$ for the first round.

It is important to note that FedISM only requires clients to share their sharpness values instead of any information regarding data distributions (e.g., imaging quality of each client), which is privacy-preserving.

To facilitate a clearer understanding, the processes of both FedISM and the conventional FedAvg McMahan et al. (2017) are summarized in Algorithm 1. Compared to FedAvg, FedISM modifies only the local optimizer and global aggregation weights, without the need to introduce complex loss functions or additional regularization techniques. Such a streamlined approach enhances the ease of implementation.

4 Experiments

We conduct a series of experiments to assess the effectiveness of FedISM. More results and discussion can be found in Appendix.

4.1 Experimental Setup

4.1.1 Datasets

Two medical datasets are used for evaluation, in line with prior FL research Jiang et al. (2022); Wu et al. (2023a):

•

RSNA ICH Flanders et al. (2020): The task is to classify each CT slice into five intracranial hemorrhage (ICH) subtypes. Following Jiang et al. (2022), we randomly select 25000 images for experiments.
•

ISIC 2019 Tschandl et al. (2018); Codella and others (2018); Combalia et al. (2019): This dataset contains 25331 images for developing models to classify eight skin diseases.

Both datasets are split into training and test sets in an 8:2 ratio and resized to $224\times 224$ pixels following the standard operation Jiang et al. (2022). For data partitioning, training sets are partitioned into 20 clients using a Dirichlet distribution (i.e., $Dir(1.0)$ ), simulating the prevalent label distribution shifts though this paper does not focus mainly on it. Imaging quality shifts are created via Gaussian noise added on a subset of clients following Hendrycks and Dietterich (2019).

4.1.2 Model

For standard evaluation, pretrained ResNet-18 He et al. (2016) is used as the base model for all experiments.

4.1.3 Implement Details

To mitigate label distribution shifts between local training and testing sets, we incorporate logit adjustment Menon et al. (2021) in the training of local models. We train batches of 32 images using the Adam optimizer, with a constant learning rate of 0.0003, beta values of (0.9, 0.999), and a weight decay of 0.0005. For FL specifics, we set a maximum of 300 communication rounds and the local epoch as 1. These parameters are consistent across all experiments to facilitate fair comparison. In FedISM, we implement GSAM Zhuang et al. (2022) for sharpness-aware minimization, with default settings of $q$ = 2.0 and $\beta$ = 0.5.

4.1.4 Evaluation Strategy

The federated model’s performance is evaluated on both an unaltered clean testing set and a generated corrupted testing set (with the identically distributed Gaussian noise in corrupted clients). We use class-balanced accuracy (ACC) and the area under the receiver operating characteristic curve (AUC) as our evaluation metrics. To ensure the robustness of results, three independent experiments are conducted, and performance is averaged over the last five communication rounds, in line with Huang et al. (2023b).

4.2 Comparison to State-of-the-Arts

To validate the superiority, we compare FedISM with several leading methods, including the basic FL approach FedAvg McMahan et al. (2017), and five advanced fair FL methods: Agnostic-FL Mohri et al. (2019), q-FedAvg Li et al. (2020), FairFed Ezzeldin et al. (2023), FedCE Jiang et al. (2023a), and FedGA Zhang et al. (2023). Comprehensive details on these methods and implementation strategies are available in the Appendix. In our experiments, 4 out of 20 clients are equipped with corrupted images, constituting a corrupted client ratio of 20%.

Tab. 1 summarizes the mean and standard deviation of ACC and AUC for both clean and corrupted images, as well as their average. Although most fair FL methods improve the performance on corrupted images, excluding the less stable Agnostic-FL, they focus primarily on fair optimization rather than generalization, which can be suboptimal. Comparatively, our FedISM, with its emphasis on sharpness minimization, achieves better generalization on corrupted images. For instance, on the ICH dataset, FedISM outperforms FedAvg and the second-best method (FedGA) by 10.85% and 3.78%, respectively. Notably, previous methods often boost the performance on corrupted images at the expense of performance degradation on clean images in this setting. For instance, FedGA’s ACC on clean ICH images falls by 4.84% compared to FedAvg. This trade-off can be problematic in medical scenarios, potentially discouraging high-quality institutions from participation in FL. In contrast, FedISM not only enhances the performance on corrupted images but also maintains top results on clean images. This balanced improvement is crucial in medical scenarios, ensuring accurate diagnostics and encouraging broader participation in FL.

Category	Method	Dataset (Corrupted Clients Ratio: 20%)
		ICH						ISIC 2019
		Clean		Corrupted		Average		Clean		Corrupted		Average
		ACC $\uparrow$	AUC $\uparrow$	ACC $\uparrow$	AUC $\uparrow$	ACC $\uparrow$	AUC $\uparrow$	ACC $\uparrow$	AUC $\uparrow$	ACC $\uparrow$	AUC $\uparrow$	ACC $\uparrow$	AUC $\uparrow$
Naive FL	FedAvg	$76.77$	$94.58$	$53.66$	$84.37$	$65.21$	$89.48$	$64.43$	$91.91$	$38.26$	$78.03$	$51.35$	$84.97$
Naive FL	(AISTATS’17)	$(0.68)$	$(0.20)$	$(1.28)$	$(0.73)$	$(0.75)$	$(0.34)$	$(1.01)$	$(0.40)$	$(1.57)$	$(1.32)$	$(0.70)$	$(0.64)$
Fair FL	Agnostic-FL	$55.13$	$83.20$	$46.69$	$78.13$	$50.91$	$80.67$	$39.66$	$78.64$	$33.55$	$75.62$	$36.60$	$77.13$
	(ICML’19)	$(7.28)$	$(4.86)$	$(7.54)$	$(4.79)$	$(2.56)$	$(1.65)$	$(12.94)$	$(9.08)$	$(8.88)$	$(6.48)$	$(3.30)$	$(2.52)$
	q-FedAvg	$75.94$	$94.36$	$59.46$	$86.81$	$67.70$	$90.58$	$65.20$	$91.59$	$44.54$	$82.88$	$54.87$	$87.24$
	(ICLR’20)	$(1.05)$	$(0.28)$	$(1.90)$	$(0.95)$	$(0.69)$	$(0.40)$	$(1.26)$	$(0.67)$	$(0.80)$	$(0.61)$	$(0.58)$	$(0.47)$
	FairFed	$74.13$	$93.55$	$60.27$	$86.74$	$67.20$	$90.15$	$60.36$	$90.29$	$49.04$	$84.86$	$54.70$	$87.58$
	(AAAI’23)	$(1.15)$	$(0.30)$	$(1.08)$	$(0.63)$	$(0.79)$	$(0.38)$	$(1.59)$	$(0.54)$	$(1.80)$	$(0.64)$	$(1.39)$	$(0.52)$
	FedCE	$75.82$	$94.31$	$58.77$	$86.93$	$67.29$	$90.62$	$62.21$	$90.37$	$45.25$	$83.37$	$53.73$	$86.87$
	(CVPR’23)	$(0.45)$	$(0.10)$	$(1.92)$	$(0.70)$	$(0.94)$	$(0.37)$	$(1.27)$	$(0.56)$	$(2.65)$	$(1.72)$	$(1.11)$	$(0.91)$
	FedGA	$71.93$	$92.88$	$60.73$	$87.45$	$66.33$	$90.16$	$59.56$	$89.53$	$48.16$	$84.64$	$53.86$	$87.08$
	(CVPR’23)	$(1.43)$	$(0.36)$	$(1.24)$	$(0.47)$	$(0.99)$	$(0.33)$	$(1.55)$	$(0.55)$	$(1.38)$	$(0.47)$	$(1.03)$	$(0.47)$
Ours	FedISM	$\bm{77.52}$	$\bm{95.02}$	$\bm{64.51}$	$\bm{89.56}$	$\bm{71.01}$	$\bm{92.29}$	$\bm{66.94}$	$\bm{93.21}$	$\bm{51.62}$	$\bm{86.89}$	$\bm{59.28}$	$\bm{90.05}$
Ours	FedISM	$\bm{(0.58)}$	$\bm{(0.16)}$	$\bm{(0.64)}$	$\bm{(0.22)}$	$\bm{(0.36)}$	$\bm{(0.13)}$	$\bm{(0.84)}$	$\bm{(0.32)}$	$\bm{(1.39)}$	$\bm{(0.49)}$	$\bm{(0.52)}$	$\bm{(0.18)}$

Table 1: Comparison of the mean (%) and standard deviation (%) for ACC and AUC against state-of-the-art methods. Values without parentheses represent the mean, while those within parentheses indicate the standard deviation.

4.3 Ablation Study

FedISM is comprised of two key components: Sharpness-Aware Local Training (SALT, Eq. 11) and Sharpness-Aware Global Aggregation (SAGA, Eqs. 13 and 14). To evaluate the effectiveness of each component, we conduct ablation studies by integrating them individually with FedAvg, with performance summaries provided in Table 2. SALT, by directing the model towards flat minima, enhances FedAvg’s performance on both clean and corrupted images. Nonetheless, this approach is not fully optimized for fairness, as it fails to address the disparity in sharpness between clean and corrupted image distributions. This is where SAGA comes into play; it helps the model prioritize distributions with poorer generalization capacity, typically those of corrupted images. As a result, the combination of both SALT and SAGA yields the best performance on corrupted images, demonstrating the comprehensive effectiveness of FedISM’s designs.

Method	ICH (Corrupted Clients Ratio: 20%)
	Clean				Crorrupted				Average
	ACC $\uparrow$	$\Delta$ ACC $\uparrow$	AUC $\uparrow$	$\Delta$ AUC $\uparrow$	ACC $\uparrow$	$\Delta$ ACC $\uparrow$	AUC $\uparrow$	$\Delta$ AUC $\uparrow$	ACC $\uparrow$	$\Delta$ ACC $\uparrow$	AUC $\uparrow$	$\Delta$ AUC $\uparrow$
FedAvg	$76.77$	-	$94.58$	-	$53.66$	-	$84.37$	-	$65.21$	-	$89.48$	-
(AISTATS’17)	$(0.68)$	-	$(0.20)$	-	$(1.28)$	-	$(0.73)$	-	$(0.75)$	-	$(0.34)$	-
FedAvg	$\bm{79.21}$	$\bm{+2.44}$	$\bm{95.66}$	$\bm{+1.08}$	$55.29$	$+1.63$	$86.61$	$+2.24$	$67.25$	$+2.04$	$91.14$	$+1.66$
+SALT	$\bm{(0.21)}$	$\bm{+2.44}$	$\bm{(0.10)}$	$\bm{+1.08}$	$(2.05)$	$+1.63$	$(0.70)$	$+2.24$	$(1.08)$	$+2.04$	$(0.32)$	$+1.66$
FedAvg	$75.45$	$-1.32$	$93.91$	$-0.67$	$61.24$	$+7.58$	$87.46$	$+3.09$	$68.34$	$+3.13$	$90.68$	$+1.20$
+SAGA	$(1.20)$	$-1.32$	$(0.49)$	$-0.67$	$(1.81)$	$+7.58$	$(0.95)$	$+3.09$	$(0.60)$	$+3.13$	$(0.26)$	$+1.20$
FedISM	$77.52$	$+0.75$	$95.02$	$+0.44$	$\bm{64.51}$	$\bm{+10.85}$	$\bm{89.56}$	$\bm{+5.19}$	$\bm{71.01}$	$\bm{+5.80}$	$\bm{92.29}$	$\bm{+2.81}$
(Ours)	$(0.58)$	$+0.75$	$(0.16)$	$+0.44$	$\bm{(0.64)}$	$\bm{+10.85}$	$\bm{(0.22)}$	$\bm{+5.19}$	$\bm{(0.36)}$	$\bm{+5.80}$	$\bm{(0.13)}$	$\bm{+2.81}$

Table 2: Component-wise ablation study in mean (%) and standard deviation (%) in ACC and AUC. Values without parentheses represent the mean, while those within parentheses indicate the standard deviation.

\Delta

ACC and

\Delta

AUC represent the difference to ACC and AUC of FedAvg, respectively.

4.4 SALT Helps Fair Optimization

A key motivation for developing Sharpness-Aware Local Training (SALT) is to address the issue of fair optimization methods converging to sharp minima, often resulting in suboptimal performance (see Fig. 2). To validate this premise and demonstrate SALT’s effectiveness, we integrate SALT with existing fair optimization methods and analyze the impact on performance enhancement. The results, presented in Tab. 3, reveal that, aside from Agnostic-FL Mohri et al. (2019) which shows large standard deviations, other methods exhibit performance improvements when combined with SALT. It not only confirms that SALT is an effective and adaptable component for fair FL, but also validates our initial motivation for its design.

	ICH (Corrupted Clients Ratio: 20%)
	Corrupted
Method	ACC	$\Delta$ ACC $\uparrow$	AUC	$\Delta$ AUC $\uparrow$
	$53.66$		$84.37$
FedAvg	$(1.28)$	-	$(0.73)$	-
	$55.29$		$86.61$
+ SALT	$(2.05)$	$+1.63$	$(0.70)$	$+2.24$
	$46.69$		$78.13$
Agnostic-FL	$(7.54)$	-	$(4.79)$	-
	$45.91$		$78.23$
+ SALT	$(10.89)$	$-0.78$	$(6.90)$	$+0.10$
	$59.46$		$86.81$
q-FedAvg	$(1.90)$	-	$(0.95)$	-
	$61.30$		$88.66$
+ SALT	$(1.57)$	$+1.84$	$(0.51)$	$+1.85$
	$60.27$		$86.74$
FairFed	$(1.08)$	-	$(0.63)$	-
	$62.86$		$89.14$
+ SALT	$(1.08)$	$+2.59$	$(0.39)$	$+2.40$
	$58.77$		$86.93$
FedCE	$(1.92)$	-	$(0.70)$	-
	$62.34$		$88.81$
+ SALT	$(0.94)$	$+3.57$	$((0.41)$	$+1.88$
	$60.73$		$87.45$
FedGA	$(1.24)$	-	$(0.47)$	-
	$61.76$		$88.56$
+ SALT	$(0.78)$	$+1.03$	$(0.15)$	$+1.11$

Table 3: Performance enhancement in mean (%) and standard deviation (%) on corrupted images by combining SALT with existing FL for fair optimization.

4.5 Robustness to Different Imbalanced Ratios

Our experiments also demonstrate the robustness of FedISM, showcasing stable performance enhancements across varying ratios of clean and corrupted clients. Quantitative results under various ratios of clients with corrupted images through {0.1, 0.2, 0.3} are illustrated in Fig. 4. Across all ratios and on both image distributions, FedISM consistently outperforms other methods in the two evaluation metrics. Such findings further highlight FedISM’s strong adaptability and effectiveness irrespective of the corruption ratio.

4.6 Discussion on Parameters

The parameter $q$ is closely related to how strongly FedISM concentrates on clients with higher sharpness levels. We examine this impact by varying $q$ through {0.1, 0.5, 1.0, 2.0, 5.0, 10.0} and assess the performance on the ICH dataset with 20% corrupted clients, as shown in Fig. 5. For comparison, we also report the performance of FedAvg and the second-best fair FL method. As $q$ increases, FedISM increasingly focuses on clients with greater sharpness (typically those with corrupted images), which results in a decrease in performance on clean images and an improvement on corrupted images. It is important to note that using very high values of $q$ , such as 5 and 10, would slightly degrade the performance on corrupted images, likely due to training instability. Overall, FedISM demonstrates improved performance across a wide range of $q$ values, reducing the burden of parameter tuning in practice.

5 Conclusion

In this paper, we pioneer the identification and formulation of a new fairness challenge in FL, specifically concerning imaging quality shifts across clients. To address this problem, existing FL approaches have primarily concentrated on balancing the empirical risk among distinct client distributions. Despite their effectiveness for fair optimization, they often neglect the crucial aspect of fair generalization. To address this overlooked area, we introduce Federated Learning with Inter-client Sharpness Matching (FedISM). FedISM innovatively refines both local training and global aggregation by integrating sharpness-awareness, effectively harmonizing sharpness levels across clients to achieve fair generalization in FL. Extensive empirical evaluations, on the well-recognized RSNA ICH and ISIC 2019 datasets, clearly demonstrate FedISM’s superiority over current state-of-the-art FL methods in terms of promoting fairness. It highlights the effectiveness of FedISM in resolving fairness issues stemming from imaging quality shifts across medical datasets. We are confident that our proposed challenge and solution will pave the way for more equitable and effective FL system designs in medical applications and beyond.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grants 62202179 and Grant 62271220, and in part by the Natural Science Foundation of Hubei Province of China under Grant 2022CFB585. The computation is supported by the HPC Platform of HUST.

References

Caldarola et al. [2022] Debora Caldarola, Barbara Caputo, and Marco Ciccone. Improving generalization in federated learning by seeking flat minima. In ECCV, 2022.
Chen et al. [2023] Zhen Chen, Wuyang Li, Xiaohan Xing, and Yixuan Yuan. Medical federated learning with joint graph purification for noisy label learning. Medical Image Anal., 90:102976, 2023.
Codella and others [2018] Noel CF Codella et al. Skin lesion analysis toward melanoma detection: A challenge at the 2017 international symposium on biomedical imaging (ISBI), hosted by the international skin imaging collaboration (ISIC). In ISBI, pages 168–172, 2018.
Combalia et al. [2019] Marc Combalia, Noel CF Codella, Veronica Rotemberg, Brian Helba, Veronica Vilaplana, Ofer Reiter, Cristina Carrera, Alicia Barreiro, Allan C Halpern, Susana Puig, et al. BCN20000: Dermoscopic lesions in the wild. arXiv:1908.02288, 2019.
Dou et al. [2021] Qi Dou, Tiffany Y So, Meirui Jiang, Quande Liu, Varut Vardhanabhuti, Georgios Kaissis, Zeju Li, Weixin Si, Heather HC Lee, Kevin Yu, et al. Federated deep learning for detecting COVID-19 lung abnormalities in CT: a privacy-preserving multinational validation study. npj Digit. Medicine, 4(1):60, 2021.
Ezzeldin et al. [2023] Yahya H Ezzeldin, Shen Yan, Chaoyang He, Emilio Ferrara, and A Salman Avestimehr. Fairfed: Enabling group fairness in federated learning. In AAAI, 2023.
Fang et al. [2023] Xiuwen Fang, Mang Ye, and Xiyuan Yang. Robust heterogeneous federated learning under data corruption. In ICCV, 2023.
Flanders et al. [2020] Adam E Flanders, Luciano M Prevedello, George Shih, Safwan S Halabi, Jayashree Kalpathy-Cramer, Robyn Ball, John T Mongan, Anouk Stein, Felipe C Kitamura, Matthew P Lungren, et al. Construction of a machine learning dataset through collaboration: The RSNA 2019 brain CT hemorrhage challenge. Radiology: Artificial Intelligence, 2(3), 2020.
Foret et al. [2021] Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware minimization for efficiently improving generalization. In ICLR, 2021.
He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
Hendrycks and Dietterich [2019] Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In ICLR, 2019.
Huang et al. [2022] Wenke Huang, Mang Ye, and Bo Du. Learn from others and be yourself in heterogeneous federated learning. In CVPR, 2022.
Huang et al. [2023a] Wenke Huang, Mang Ye, Zekun Shi, and Bo Du. Generalizable heterogeneous federated cross-correlation and instance similarity learning. IEEE Trans. Pattern Anal. Mach. Intell., 2023.
Huang et al. [2023b] Wenke Huang, Mang Ye, Zekun Shi, He Li, and Bo Du. Rethinking federated learning with domain shift: A prototype view. In CVPR, 2023.
Huang et al. [2023c] Wenke Huang, Mang Ye, Zekun Shi, Guancheng Wan, He Li, Bo Du, and Qiang Yang. Federated learning for generalization, robustness, fairness: A survey and benchmark. arXiv:2311.06750, 2023.
Huang et al. [2023d] Zhuo Huang, Miaoxi Zhu, Xiaobo Xia, Li Shen, Jun Yu, Chen Gong, Bo Han, Bo Du, and Tongliang Liu. Robust generalization against photon-limited corruptions via worst-case sharpness minimization. In CVPR, 2023.
Jiang et al. [2022] Meirui Jiang, Hongzheng Yang, Xiaoxiao Li, Quande Liu, Pheng-Ann Heng, and Qi Dou. Dynamic bank learning for semi-supervised federated image diagnosis with class imbalance. In MICCAI, pages 196–206, 2022.
Jiang et al. [2023a] Meirui Jiang, Holger R Roth, Wenqi Li, Dong Yang, Can Zhao, Vishwesh Nath, Daguang Xu, Qi Dou, and Ziyue Xu. Fair federated medical image segmentation via client contribution estimation. In CVPR, pages 16302–16311, 2023.
Jiang et al. [2023b] Meirui Jiang, Hongzheng Yang, Chen Cheng, and Qi Dou. IOP-FL: Inside-outside personalization for federated medical image segmentation. IEEE Trans. Medical Imaging, 2023.
Li et al. [2020] Tian Li, Maziar Sanjabi, Ahmad Beirami, and Virginia Smith. Fair resource allocation in federated learning. In ICLR, 2020.
Li et al. [2021] Xiaoxiao Li, Meirui Jiang, Xiaofei Zhang, Michael Kamp, and Qi Dou. FedBN: Federated learning on non-iid features via local batch normalization. In ICLR, 2021.
Liu et al. [2021] Quande Liu, Cheng Chen, Jing Qin, Qi Dou, and Pheng-Ann Heng. FedDG: Federated domain generalization on medical image segmentation via episodic learning in continuous frequency space. In CVPR, pages 1013–1023, 2021.
Lyu et al. [2020] Lingjuan Lyu, Xinyi Xu, Qian Wang, and Han Yu. Collaborative fairness in federated learning. Federated Learning: Privacy and Incentive, pages 189–204, 2020.
McMahan et al. [2017] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In AISTATS, pages 1273–1282, 2017.
Menon et al. [2021] Aditya Krishna Menon, Sadeep Jayasumana, Ankit Singh Rawat, Himanshu Jain, Andreas Veit, and Sanjiv Kumar. Long-tail learning via logit adjustment. In ICLR, 2021.
Mohri et al. [2019] Mehryar Mohri, Gary Sivek, and Ananda Theertha Suresh. Agnostic federated learning. In ICML, pages 4615–4625, 2019.
Papadaki et al. [2022] Afroditi Papadaki, Natalia Martinez, Martin Bertran, Guillermo Sapiro, and Miguel Rodrigues. Minimax demographic group fairness in federated learning. In ACM FAccT, pages 142–159, 2022.
Qu et al. [2022] Zhe Qu, Xingyu Li, Rui Duan, Yao Liu, Bo Tang, and Zhuo Lu. Generalized federated learning via sharpness aware minimization. In ICML, 2022.
Sagawa et al. [2020] Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. In ICLR, 2020.
Sun et al. [2023] Yan Sun, Li Shen, Shixiang Chen, Liang Ding, and Dacheng Tao. Dynamic regularized sharpness aware minimization in federated learning: Approaching global consistency and smooth landscape. In ICML, 2023.
Tschandl et al. [2018] Philipp Tschandl, Cliff Rosendahl, and Harald Kittler. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific Data, 5(1):1–9, 2018.
Wu et al. [2023a] Nannan Wu, Li Yu, Xuefeng Jiang, Kwang-Ting Cheng, and Zengqiang Yan. Fednoro: Towards noise-robust federated learning by addressing class imbalance and label noise heterogeneity. In IJCAI, 2023.
Wu et al. [2023b] Nannan Wu, Li Yu, Xin Yang, Kwang-Ting Cheng, and Zengqiang Yan. Fediic: Towards robust federated learning for class-imbalanced medical image classification. In MICCAI, 2023.
Wu et al. [2024] Nannan Wu, Zhaobin Sun, Zengqiang Yan, and Li Yu. Feda3i: Annotation quality-aware aggregation for federated medical image segmentation against heterogeneous annotation noise. In AAAI, 2024.
Xu et al. [2021] Xinyi Xu, Lingjuan Lyu, Xingjun Ma, Chenglin Miao, Chuan Sheng Foo, and Bryan Kian Hsiang Low. Gradient driven rewards to guarantee fairness in collaborative machine learning. 2021.
Ye et al. [2023] Mang Ye, Xiuwen Fang, Bo Du, Pong C Yuen, and Dacheng Tao. Heterogeneous federated learning: State-of-the-art and research challenges. ACM Computing Surveys, 56(3):1–44, 2023.
Zhang et al. [2022] Jie Zhang, Zhiqi Li, Bo Li, Jianghe Xu, Shuang Wu, Shouhong Ding, and Chao Wu. Federated learning with label distribution skew via logits calibration. In ICML, 2022.
Zhang et al. [2023] Ruipeng Zhang, Qinwei Xu, Jiangchao Yao, Ya Zhang, Qi Tian, and Yanfeng Wang. Federated domain generalization with generalization adjustment. In CVPR, 2023.
Zhou et al. [2023] Yixuan Zhou, Yi Qu, Xing Xu, and Hengtao Shen. Imbsam: A closer look at sharpness-aware minimization in class-imbalanced recognition. In ICCV, pages 11345–11355, 2023.
Zhuang et al. [2022] Juntang Zhuang, Boqing Gong, Liangzhe Yuan, Yin Cui, Hartwig Adam, Nicha Dvornek, Sekhar Tatikonda, James Duncan, and Ting Liu. Surrogate gap minimization improves sharpness-aware training. In ICLR, 2022.

Appendix

This appendix is organized as follows:

•

In Section A, we summarize the mathematical notations.
•

In Section B, the theoretical proof regarding the theorem in this paper is given.
•

In Section C, experimental details and additional results are presented.

Appendix A Notation Table

Notations	Description
$C$	Class Number
$K$	Client Number
$D_{k}$	Dataset of $k$ -th Client
$\bm{x}$	Image
$y$	Label
$\mathcal{X}$	Image Space
$\mathcal{Y}$	Label Space
$N_{k}$	Size of the Local Dataset in Client $k$
$a_{k}$	Imaging Quality of $k$ -th Client
$A$	Number of Imaging Quality Categories
$f$	Neural Network
$\ell$	Loss Function
$\bm{\theta}$	Weights of the Neural Network
$\Delta$	Probability Simplex
$\mathbb{R_{+}}$	Set of Positive Real Numbers
$\bm{\lambda}$	Linear Combination Weights of Clients
$\bm{\mu}$	Linear Combination Weights of Imaging Qualities
$\mathds{1}$	Indicator Function
$\mathcal{S}$	Sharpness
$\bm{\epsilon}$	Weights Perturbation
$\rho$	Searching Distance
$\eta$	Learning Rate of Local Training
$T$	Total Communication Rounds
$\bm{w}_{\texttt{Avg}}$	Aggregation Weights for FedAvg
$\widetilde{\bm{w}}_{t}$	Aggregation Weights of FedISM in round $t$
$\bm{w}_{t}$	Aggregation Weights of FedISM in round $t$ with Moving Average
$q$	Parameter for Global Aggregation
$\beta$	Parameter for Moving Average

Table 4: Mathematical notations in this paper.

Appendix B Theoretical Proof

In this section, the proof of the theorem is given.

Theorem 1 (Equivalence).

Assuming class distributions of the testing set and all clients’ training sets are identical, we have:

(2)

and

(3)

where $\mu^{*}_{t}=\sum_{k=1}^{K}\mathds{1}_{a_{k}=u}\lambda^{*}_{k}$ .

Proof.

Eq. 2 can be re-written as:

		$\displaystyle\mathop{\max}\limits_{\bm{\lambda}\in\Delta^{K-1}}\sum_{k=1}^{K}\lambda_{k}\mathbb{E}_{(\bm{x},y)\sim\mathbb{P}_{k}(\bm{x},y\mid a_{k})}[\ell(f(\bm{x};\bm{\theta}),y)]$
	$\displaystyle=$	$\displaystyle\mathop{\max}\limits_{\bm{\lambda}\in\Delta^{K-1}}\sum_{k=1}^{K}\lambda_{k}\sum_{u=1}^{A}\mathds{1}_{a_{k}=u}\mathbb{E}_{(\bm{x},y)\sim\mathbb{P}_{k}(\bm{x},y\mid u)}[\ell(f(\bm{x};\bm{\theta}),y)]]$
	$\displaystyle=$	$\displaystyle\mathop{\max}\limits_{\bm{\lambda}\in\Delta^{K-1}}\sum_{u=1}^{A}(\sum_{k=1}^{K}\mathds{1}_{a_{k}=u}\lambda_{k})\mathbb{E}_{(\bm{x},y)\sim\mathbb{P}_{k}(\bm{x},y\mid u)}[\ell(f(\bm{x};\bm{\theta}),y)]]$

We define $\mu_{u}:=\sum_{k=1}^{K}\mathds{1}_{a_{k}=u}\lambda_{k}$ and $\bm{\mu}=[\mu_{1},\dots,\mu_{A}]^{\top}$ . For each $u\in[A]$ , please note that there is at lease one client $k$ satisfying $\mathds{1}_{a_{k}=u}=1$ . Hence, we can obtain all the standard orthogonal bases of $\mathbb{R}^{A}$ space by changing the preimage $\bm{\lambda}\in\Delta^{K-1}$ . This indicates that, for any $\bm{\mu}\in\Delta^{A-1}$ , it can be acquired by the linear combination of these preimages. Therefore, we have:

		$\displaystyle\mathop{\max}\limits_{\bm{\lambda}\in\Delta^{K-1}}\sum_{u=1}^{A}(\sum_{k=1}^{K}\mathds{1}_{a_{k}=u}\lambda_{k})\mathbb{E}_{(\bm{x},y)\sim\mathbb{P}_{k}(\bm{x},y\mid u)}[\ell(f(\bm{x};\bm{\theta}),y)]]$
	$\displaystyle=$	$\displaystyle\mathop{\max}\limits_{\bm{\mu}\in\Delta^{A-1}}\sum_{u=1}^{A}\mu_{u}\mathbb{E}_{(\bm{x},y)\sim\mathbb{P}(\bm{x},y\mid u)}[\ell(f(\bm{x};\bm{\theta}),y)]$

∎

This completes the proof of the theorem.

Method	ICH (Corrupted Clients Ratio: 30%)
	Clean				Crorrupted				Average
	ACC $\uparrow$	$\Delta$ ACC $\uparrow$	AUC $\uparrow$	$\Delta$ AUC $\uparrow$	ACC $\uparrow$	$\Delta$ ACC $\uparrow$	AUC $\uparrow$	$\Delta$ AUC $\uparrow$	ACC $\uparrow$	$\Delta$ ACC $\uparrow$	AUC $\uparrow$	$\Delta$ AUC $\uparrow$
FedAvg	$75.56$	-	$94.36$	-	$57.14$	-	$85.95$	-	$66.35$	-	$90.16$	-
(AISTATS’17)	$(0.69)$	-	$(0.22)$	-	$(1.86)$	-	$(0.83)$	-	$(0.88)$	-	$(0.41)$	-
FedAvg	$78.57$	$+3.01$	$\bm{95.63}$	$\bm{+1.27}$	$62.16$	$+5.02$	$89.14$	$+3.19$	$70.37$	$+4.02$	$92.39$	$+2.23$
+SALT	$(0.50)$	$+3.01$	$\bm{(0.12)}$	$\bm{+1.27}$	$(0.80)$	$+5.02$	$(0.20)$	$+3.19$	$(0.53)$	$+4.02$	$(0.12)$	$+2.23$
FedAvg	$75.70$	$+0.14$	$94.08$	$-0.28$	$62.41$	$+5.27$	$88.20$	$+2.25$	$69.06$	$+2.71$	$91.14$	$+0.98$
+SAGA	$(0.70)$	$+0.14$	$(0.26)$	$-0.28$	$(1.89)$	$+5.27$	$(0.67)$	$+2.25$	$(1.01)$	$+2.71$	$(0.32)$	$+0.98$
FedISM	$\bm{78.74}$	$\bm{+3.18}$	$95.32$	$+0.96$	$\bm{66.86}$	$\bm{+9.72}$	$\bm{90.66}$	$\bm{+4.71}$	$\bm{72.80}$	$\bm{+6.45}$	$\bm{92.99}$	$\bm{+2.83}$
(Ours)	$\bm{(0.56)}$	$\bm{+3.18}$	$(0.23)$	$+0.96$	$\bm{(1.22)}$	$\bm{+9.72}$	$\bm{(0.18)}$	$\bm{+4.71}$	$\bm{(0.44)}$	$\bm{+6.45}$	$\bm{(0.10)}$	$\bm{+2.83}$

Table 5: Component-wise ablation study in mean (%) and standard deviation (%) in ACC and AUC. Values without parentheses represent the mean, while those within parentheses indicate the standard deviation.

\Delta

ACC and

\Delta

AUC represent the difference to ACC and AUC of FedAvg, respectively. The ratio of corrupted clients is 30%.

Method	ICH (Corrupted Clients Ratio: 10%)
	Clean				Crorrupted				Average
	ACC $\uparrow$	$\Delta$ ACC $\uparrow$	AUC $\uparrow$	$\Delta$ AUC $\uparrow$	ACC $\uparrow$	$\Delta$ ACC $\uparrow$	AUC $\uparrow$	$\Delta$ AUC $\uparrow$	ACC $\uparrow$	$\Delta$ ACC $\uparrow$	AUC $\uparrow$	$\Delta$ AUC $\uparrow$
FedAvg	$78.17$	-	$95.26$	-	$42.72$	-	$78.26$	-	$60.44$	-	$86.76$	-
(AISTATS’17)	$(0.55)$	-	$(0.07)$	-	$(2.12)$	-	$(1.35)$	-	$(1.05)$	-	$(0.68)$	-
FedAvg	$\bm{80.04}$	$\bm{+1.90}$	$\bm{96.00}$	$\bm{+0.74}$	$46.91$	$+4.19$	$83.37$	$+5.11$	$63.48$	$+3.04$	$89.68$	$+2.92$
+SALT	$\bm{(0.34)}$	$\bm{+1.90}$	$\bm{(0.11)}$	$\bm{+0.74}$	$(1.28)$	$+4.19$	$(0.82)$	$+5.11$	$(0.67)$	$+3.04$	$(0.43)$	$+2.92$
FedAvg	$75.35$	$-2.79$	$94.16$	$-1.10$	$55.15$	$+12.43$	$84.68$	$+6.42$	$65.25$	$+4.81$	$89.42$	$+2.66$
+SAGA	$(1.03)$	$-2.79$	$(0.28)$	$-1.10$	$(2.37)$	$+12.43$	$(1.06)$	$+6.42$	$(1.26)$	$+4.81$	$(0.55)$	$+2.66$
FedISM	$78.91$	$+0.77$	$95.54$	$+0.28$	$\bm{58.84}$	$\bm{+16.12}$	$\bm{87.61}$	$\bm{+9.35}$	$\bm{68.88}$	$\bm{+8.44}$	$\bm{91.57}$	$\bm{+4.81}$
(Ours)	$(0.59)$	$+0.77$	$(0.10)$	$+0.28$	$\bm{(1.72)}$	$\bm{+16.12}$	$\bm{(0.46)}$	$\bm{+9.35}$	$\bm{(1.03)}$	$\bm{+8.44}$	$\bm{(0.25)}$	$\bm{+4.81}$

Table 6: Component-wise ablation study in mean (%) and standard deviation (%) in ACC and AUC. Values without parentheses represent the mean, while those within parentheses indicate the standard deviation.

\Delta

ACC and

\Delta

AUC represent the difference to ACC and AUC of FedAvg, respectively. The ratio of corrupted clients is 10%.

Appendix C Experiments

C.1 Details of SOTA Methods for Comparison

This section provides an overview and implementation approaches of the methods compared in Section 4.2.

•

Agnostic-FL Mohri et al. [2019]: Agnostic-FL, a pioneering work in fair FL, introduces a training strategy that focuses on updating only the poorest-performing client. This approach aims to enhance fairness, albeit with potential drawbacks such as slower convergence and model instability.
•

q-FedAvg Li et al. [2020]: q-FedAvg addresses fairness issues by incorporating training loss into the global update process. To improve model convergence and maintain consistency with other methods, we have adjusted the initial global update method by introducing a multiplicative constant. This adjustment aligns it with FedAvg, employing loss-aware aggregation weights. For parameter optimization, we experimented with different values of the parameter $q$ in q-FedAvg, specifically {0.5, 1.0, 2.0, 5.0}, and report the configuration yielding the best performance.
•

FairFed Ezzeldin et al. [2023]: FairFed aims at achieving group fairness through equitable optimization across clients. The parameter $\beta$ in FairFed is tuned from {0.1, 0.5, 1.0}.
•

FedCE Jiang et al. [2023a]: FedCE focuses on promoting fair FL specifically in the context of medical image segmentation. For the purposes of this study, we have adapted it for image classification tasks.
•

FedGA Zhang et al. [2023]: This method is designed for better domain generalization by pursuing fairness across different clients/domains.

C.2 Additional Experimental Results

C.2.1 Ablation Study

Component-wise ablation studies are conducted in two additional settings with varying ratios of clean and corrupted clients. Specifically, the ratio of corrupted clients is altered to 30% and 10%, while keeping the corruption type the same as in Tab. 2. The results are summarized in Tab. 5 and Tab. 6. Both experiments consistently indicate the same conclusion as the main text. SALT enhances generalization but does not significantly improve fairness. SAGA solely balances importance weights without directly addressing sharpness minimization during training. The optimal performance is achieved by combining these two components.

C.2.2 Discussion on Parameter $\beta$

To ensure training stability, we employ a moving average approach in our FedISM as follows:

{\bm{w}}_{t}=\beta\widetilde{\bm{w}}_{t}+(1-\beta){\bm{w}}_{t-1}.

In this study, we explore the impact of the parameter $\beta$ . We vary $\beta$ through the values {0.3, 0.5, 0.7, 0.9, 1.0} and conduct experiments on the ICH dataset, using the approach described in Sec. 4.2. The results of these experiments are depicted in Fig. 6. Additionally, we include the performance metrics of FedAvg and the second-best method for enhanced comparison. Notably, compared to the scenario where moving average is not applied (i.e., $\beta=1.0$ ), the incorporation of moving average yields improved and more consistent performance, typically characterized by a reduced standard deviation. Moreover, we observe that the performance is relatively insensitive to different values of $\beta$ when moving average is integrated.

C.2.3 Sharpness across Clients

We present the statistical information (i.e., mean and stand deviation) of clients’ sharpness as shown in Tab. 7. Our proposed method, FedISM, demonstrates a more uniform and lower sharpness across clients compared to other methods. This aligns well with our conceptual framework of inter-client sharpness matching (i.e., ensuring consistent and low sharpness levels across different clients).

Method	FedAvg	Agnostic-FL	q-FedAvg	FairFed	FedCE	FedGA	FedISM (Ours)
Mean $\downarrow$	$0.67370$	$1.01556$	$0.63291$	$1.07346$	$0.66945$	$1.06669$	$\bm{0.44165}$
Std $\downarrow$	$0.56693$	$0.65518$	$0.16827$	$0.22205$	$0.29507$	$0.28508$	$\bm{0.09966}$

Table 7: Mean and standard deviation of sharpness across clients.

C.2.4 Visualization of Loss landscape

The loss landscape under model weight perturbation is visualized as Fig. 7 following [Li et al., 2018], which effectively captures the dynamics of loss variation in the vicinity of the model’s convergence point. Traditional approaches for fair optimization primarily aim to reduce training loss across various clients, often neglecting the geometric characteristics of the loss surface. This can result in convergence at sharper minima, as illustrated by the two training columns in the figure. However, our proposed method, FedISM, diverges from this norm. By targeting uniformly low sharpness across both clean and corrupted image clients, FedISM consistently achieves convergence at flatter minima. This strategy fosters superior generalization, as evidenced by the comparatively lower testing losses (see two columns of testing).

C.2.5 Evaluation with Different Corruption Type

Different experiments in Sec. 4.2 that model corruption as Gaussian noise, we additionally generate corrupted images by random motion blur Hendrycks and Dietterich [2019]. This is also a typical corruption in imaging caused by random movements from either patients or cameras. Following Sec. 4.2, the ratio of corrupted clients is set as 20%. We summarize the results as Tab. 8, which can see the consistent superiority of our method. Here, the parameter of GSAM Zhuang et al. [2022] is tuned to enhance absolute classification performance, indicating the potential impact of sharpness calculation in our method. This may inspire the next steps in our research on sharpness definition, which may further enhance our method.

Category	Method	Dataset (Corrupted Clients Ratio: 20%)
		ICH						ISIC 2019
		Clean		Corrupted		Average		Clean		Corrupted		Average
		ACC $\uparrow$	AUC $\uparrow$	ACC $\uparrow$	AUC $\uparrow$	ACC $\uparrow$	AUC $\uparrow$	ACC $\uparrow$	AUC $\uparrow$	ACC $\uparrow$	AUC $\uparrow$	ACC $\uparrow$	AUC $\uparrow$
Naive FL	FedAvg	$75.97$	$94.74$	$68.69$	$90.98$	$72.33$	$92.86$	$67.16$	$92.60$	$54.98$	$88.16$	$61.07$	$90.38$
Naive FL	(AISTATS’17)	$(1.11)$	$(0.22)$	$(0.74)$	$(0.41)$	$(0.81)$	$(0.29)$	$(1.15)$	$(0.37)$	$(1.72)$	$(0.54)$	$(1.31)$	$(0.41)$
Fair FL	Agnostic-FL	$63.57$	$89.45$	$57.27$	$86.35$	$60.42$	$87.90$	$51.24$	$87.90$	$50.11$	$86.56$	$50.67$	$87.23$
	(ICML’19)	$(3.48)$	$(1.32)$	$(7.12)$	$(3.32)$	$(3.27)$	$(1.78)$	$(5.49)$	$(1.91)$	$(3.85)$	$(1.64)$	$(2.93)$	$(1.18)$
	q-FedAvg	$75.68$	$94.77$	$70.53$	$91.87$	$73.10$	$93.32$	$71.39$	$93.88$	$60.79$	$90.03$	$66.09$	$91.95$
	(ICLR’20)	$(1.59)$	$(0.34)$	$(1.18)$	$(0.53)$	$(0.54)$	$(0.17)$	$(0.73)$	$(0.30)$	$(1.38)$	$(0.39)$	$(0.96)$	$(0.30)$
	FairFed	$74.59$	$94.34$	$69.66$	$91.75$	$72.12$	$93.05$	$63.94$	$91.89$	$56.27$	$89.58$	$60.11$	$90.73$
	(AAAI’23)	$(1.12)$	$(0.21)$	$(1.08)$	$(0.30)$	$(0.66)$	$(0.18)$	$(1.58)$	$(0.48)$	$(1.59)$	$(0.58)$	$(1.48)$	$(0.49)$
	FedCE	$76.83$	$95.00$	$70.38$	$92.04$	$73.60$	$93.52$	$72.29$	$94.26$	$61.85$	$90.95$	$67.07$	$92.61$
	(CVPR’23)	$(0.83)$	$(0.15)$	$(0.72)$	$(0.28)$	$(0.73)$	$(0.17)$	$(0.64)$	$(0.28)$	$(1.28)$	$(0.40)$	$(0.74)$	$(0.31)$
	FedGA	$72.97$	$93.75$	$69.64$	$91.78$	$71.31$	$92.76$	$66.35$	$92.78$	$59.22$	$90.20$	$62.79$	$91.49$
	(CVPR’23)	$(0.78)$	$(0.16)$	$(0.60)$	$(0.27)$	$(0.54)$	$(0.16)$	$(1.53)$	$(0.45)$	$(1.46)$	$(0.36)$	$(1.31)$	$(0.34)$
Ours	FedISM	$\bm{77.86}$	$\bm{95.60}$	$\bm{72.38}$	$\bm{93.59}$	$\bm{75.12}$	$\bm{94.60}$	$\bm{73.44}$	$\bm{95.07}$	$\bm{63.45}$	$\bm{92.21}$	$\bm{68.45}$	$\bm{93.64}$
Ours	FedISM	$\bm{(0.67)}$	$\bm{(0.07)}$	$\bm{(0.75)}$	$\bm{(0.18)}$	$\bm{(0.53)}$	$\bm{(0.08)}$	$\bm{(0.88)}$	$\bm{(0.18)}$	$\bm{(0.75)}$	$\bm{(0.29)}$	$\bm{(0.68)}$	$\bm{(0.16)}$

Table 8: Comparison of the mean (%) and standard deviation (%) for ACC and AUC against state-of-the-art methods. Values without parentheses represent the mean, while those within parentheses indicate the standard deviation. The type of corruption is motion blur.

Additional References for Appendix

[Li et al., 2018] Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer and Tom Goldstein. Visualizing the Loss Landscape of Neural Nets. In NeurIPS, 2018.

From Optimization to Generalization: Fair Federated Learning against Quality Shift via Inter-Client Sharpness Matching

Abstract

1 Introduction

2 Related Work

2.1 Fair Federated Learning

2.2 Sharpness of Loss Surface

3 Methodology

3.1 Preliminaries

Theorem 1 (Equivalence).

3.2 Previous Solution: Fair Optimization

3.3 Measurement of Generalization: Sharpness

3.4 Sharpness Matching for Fair Generalization

4 Experiments

4.1 Experimental Setup

4.1.1 Datasets

4.1.2 Model

4.1.3 Implement Details

4.1.4 Evaluation Strategy

4.2 Comparison to State-of-the-Arts

4.3 Ablation Study

4.4 SALT Helps Fair Optimization

4.5 Robustness to Different Imbalanced Ratios

4.6 Discussion on Parameters

5 Conclusion

Acknowledgments

References

Appendix

Appendix A Notation Table

Appendix B Theoretical Proof

Theorem 1 (Equivalence).

Proof.

Appendix C Experiments

C.1 Details of SOTA Methods for Comparison

C.2 Additional Experimental Results

C.2.1 Ablation Study

C.2.2 Discussion on Parameter β\beta

C.2.3 Sharpness across Clients

C.2.4 Visualization of Loss landscape

C.2.5 Evaluation with Different Corruption Type

Additional References for Appendix

From Optimization to Generalization: Fair Federated Learning against
Quality Shift via Inter-Client Sharpness Matching

C.2.2 Discussion on Parameter $\beta$