Improving Fairness via Federated Learning
Abstract
Recently, lots of algorithms have been proposed for learning a fair classifier from decentralized data. However, many theoretical and algorithmic questions remain open. First, is federated learning necessary, i.e., can we simply train locally fair classifiers and aggregate them? In this work, we first propose a new theoretical framework, with which we demonstrate that federated learning can strictly boost model fairness compared with such non-federated algorithms. We then theoretically and empirically show that the performance tradeoff of FedAvg-based fair learning algorithms is strictly worse than that of a fair classifier trained on centralized data. To bridge this gap, we propose FedFB, a private fair learning algorithm on decentralized data. The key idea is to modify the FedAvg protocol so that it can effectively mimic the centralized fair learning. Our experimental results show that FedFB significantly outperforms existing approaches, sometimes matching the performance of the centrally trained model.
1 Introduction
Fair learning has recently received increasing attention. Various fairness notions have been introduced in the past few years (Dwork et al., 2012; Hardt et al., 2016; Zafar et al., 2017b, a; Kearns et al., 2018; Friedler et al., 2016). Among various fairness notions, group fairness (Hardt et al., 2016; Zafar et al., 2017a) is the most studied one; it requires the classifier to treat different groups similarly, where groups are defined with respect to sensitive attributes such as gender and race. One of the most commonly used group fairness notions is demographic parity, which requires that different groups are equally likely to receive desirable outcomes.
There has been a large amount of work in training fair classifiers (Zafar et al., 2017c; Hardt et al., 2016; Roh et al., 2021), and almost all of these studies assume that the learner has access to the entire training data. Unfortunately, this is not the case in many critical applications. For example, the government may want to train a single classifier that is fair and accurate using datasets owned by different stakeholders across the country, but the stakeholders may not be able to share their raw data with the government due to the privacy act. This precisely sets the core question that we aim to answer in this paper – how can we privately train a fair classifier on decentralized data? To answer this, consider the following approaches: Local Fair Training + Ensemble and Local Fair Training + FedAvg.
Local Fair Training + Ensemble (LFT+Ensemble)
One naïve yet the simplest way to learn fair classifiers from decentralized data without violating privacy policy is training locally fair classifiers and assembling them. We call this strategy LFT+Ensemble. This is more private than any data sharing schemes or federated learning approaches as only the fully trained models are shared just once.
Local Fair Training + FedAvg (LFT+FedAvg)
LFT+FedAvg applies federated learning (Konečnỳ et al., 2017) together with existing fair learning algorithms. Federated learning is a distributed learning framework, using which many data owners can collaboratively train a model under the orchestration of a central server while keeping their data decentralized. For instance, under FedAvg, the de facto aggregation protocol for federated learning, the central server periodically computes a weighted average of the locally trained model parameters. If each data owner runs a fair learning algorithm on its own data and these locally trained models are aggregated via FedAvg, then we call this approach Local Fair Training + FedAvg (LFT+FedAvg).
Goal and Main Contributions
As LFT+FedAvg makes more than one round of communications between the clients and the centralized aggregator, it is reasonable to expect a better performance (fairness and accuracy) with LFT+FedAvg over LFT+Ensemble. However, rigorous performance analysis has not been made yet in the literature. Inspired by this, we first study the following question:
Is federated learning needed for decentralized fair learning?

In other words, it is not clear whether there is any strict performance (fairness and accuracy) improvement by using LFT+FedAvg over LFT+Ensemble. If not, there is no reason to use LFT+FedAvg as LFT+Ensemble is a more private scheme. Our first contribution is to theoretically compare their performances and show that federated learning is indeed necessary. Intuitively, if the data is highly heterogeneous, an ensemble model of locally trained models will not be fair even if the consisting models are locally fair. See Fig. 2 for a toy example illustrating this phenomenon. On the other hand, LFT+FedAvg might find a globally fairer model by making frequent communications, of course at the cost of privacy loss. Another natural question follows:
How good is LFT+FedAvg compared to the case where the whole data is centralized?
That is, can LFT+FedAvg achieve similar performance with Centralized Fair Learning (CFL)? If not, can we develop a better federated learning approach? Our second contribution is to identify the performance gap between LFT+FedAvg and CFL and to develop FedFB, a new federated learning algorithm that bridges the gap. Our major contributions are summarized as follows (see Fig. 1):
-
•
We develop a theoretical framework for analyzing approaches for decentralized fair learning. Using this, we show the strict ordering between the existing approaches, i.e., under certain conditions, LFT+Ensemble LFT+FedAvg CFL, w.r.t. their fairness-accuracy tradeoffs. To the best of our knowledge, it is the first rigorous comparison of these approaches. Our finding implies the necessity of federated learning, but at the same time, implies the limitation of the naiv̈e application of FedAvg together with local fair training.
-
•
Leveraging the state-of-the-art algorithm for (centralized) fair learning (Roh et al., 2021), we design FedFB, a novel fairness-aware federated learning algorithm.
-
•
Via extensive experiments, we show that (1) our theoretical findings hold under more general settings, and (2) FedFB significantly outperforms the existing approaches and achieves similar performance as CFL.
2 Related work
Model Fairness
Several works have studied the fundamental tradeoff between fairness and accuracy (Menon and Williamson, 2018; Wick et al., 2019; Zhao and Gordon, 2019). Our analysis is highly inspired by the one by Menon and Williamson (2018), which characterized the tradeoff between accuracy and fairness under the centralized setting. Among various fair training algorithms (Zemel et al., 2013; Jiang and Nachum, 2020; Zafar et al., 2017c, a; Hardt et al., 2016), our work is based on the current state of the art — FairBatch (Roh et al., 2021), which we will detail in Sec. 4
Federated Learning
Unlike traditional, centralized machine learning approaches, federated learning keeps the data decentralized throughout training, reducing the privacy risks involved in traditional approaches (Konečnỳ et al., 2017; McMahan et al., 2017). FedAvg (McMahan et al., 2017) is the first and most widely used federated learning algorithm. The idea is to iteratively compute a weighted average of the local model parameters, with the weights proportional to the local datasets’ sizes. Prior work (Li et al., 2020b) has shown that FedAvg provably converges under some mild conditions. The design of our proposed algorithm FedFB is also based on that of FedAvg.
Federated Fair Learning for Group Fairness
A few very recent studies (Ezzeldin et al., 2021; Rodríguez-Gálvez et al., 2021; Chu et al., 2021; Du et al., 2021; Cui et al., 2021), conducted concurrently with our work, aim at achieving group fairness under federated learning. In particular, Du et al. (2021), Rodríguez-Gálvez et al. (2021) and Chu et al. (2021) mimic the centralized fair learning setting by very frequently exchanging information for each local update, not for each round of local training. In contrast, FedFB requires much less frequent communications (one per round), resulting in higher privacy and lower communication costs. Moreover, Rodríguez-Gálvez et al. (2021) and Chu et al. (2021) are designed specifically for certain group fairness notions (such as equal opportunity and accuracy parity) and cannot be applied for other definitions such as demographic parity. On the other hand, FedFB is applicable for all popular group fairness notions including demographic parity, equalized odds, and equal opportunity. Similar to FedFB, Ezzeldin et al. (2021) employs FedAvg and a reweighting mechanism, but it achieves a worse performance.

We visualize a high-level comparison of LFT+Ensemble, LFT+FedAvg, CFL, FedFB and AgnosticFair (Du et al., 2021) in terms of performance and privacy in Fig. 3(a). Most of the work in the literature and our work focus on satisfying a single fairness requirement on the “global” data distribution, but there is also a recent work that aims at finding a classifier that simultaneously satisfies multiple fairness requirements, each defined with respect to each client’s “local” data (Cui et al., 2021).
Federated Fair Learning for Client Fairness
The definition of fairness used in most of the existing federated learning work is slightly different from the standard notion of group fairness. One popular definition of fairness in the federated setting is client parity (CP), which requires that all clients (i.e. data owners) achieve similar accuracies (or loss values). Several algorithms have been proposed to achieve this goal (Li et al., 2021, 2020a; Mohri et al., 2019; Yue et al., 2021; Zhang et al., 2020a). Even though their goal is different from ours, we show that an adapted version of FedFB can also achieve comparable client parity, compared with the existing algorithms proposed specifically for CP.
3 Analysis of LFT+Ensemble & LFT+FedAvg
In this section, we first show the necessity of federated learning by proving that LFT+FedAvg achieves strictly higher fairness than LFT+Ensemble. We then prove the limitation of LFT+FedAvg by comparing its performance with an oracle bound of CFL (non-private). These two results together imply that federated learning is necessary, but there exists a limit on what can be achieved by FedAvg-based approaches. We will present informal theoretical statements, deferring the formal versions and proofs to Sec. A.
Problem Setting
Denote for any . We assume clients, which have the same amount of data. We further assume a simple binary classification setting with a binary sensitive attribute, i.e., is the input feature, is the outcome and is the binary sensitive attribute. Assume that is a continuous variable. The algorithm we will develop later is applicable to general settings.
We now introduce parameters for describing the data distribution. Let for all client , where is a strictly monotone increasing function. Assume , , where is the index of the client, is a distribution, and for . Let . Given and data sample , we consider the following randomized classifier: . Using these definitions, we now define demographic parity, specialized for a binary case as
To measure how unfair a classifier is with respect to demographic parity (DP), we define DP disparity as:

3.1 LFT+FedAvg strictly outperforms LFT+Ensemble
LFT+Ensemble Optimization
We now present the optimization problem solved in the LFT+Ensemble scenario. Here, for analytical tractability, we will assume the population limit, i.e., the true data distribution is used in optimization. Recall that in this scenario, each client solves their own optimization problem to train a locally fair classifier. In particular, each client first sets its own local fairness constraint , and solves the following problem:
(LFT+Ensemble()) | ||||
(1) |
We denote by the solution to LFT+Ensemble(). Let . We consider the following ensemble of locally trained classifiers:
(2) |
where .
Since there always exists a perfectly fair classifier (Menon and Williamson, 2018), one question we can ask is if can achieve an arbitrary level of fairness. The following lemma asserts that it cannot.
Lemma 1 ((informal) Achievable fairness range of LFT+Ensemble).
Let for all , where is a constant. Under certain conditions, there exists () such that
Lemma 1 implies that LFT+Ensemble fails to achieve perfect fairness even when the sensitive attribute () distribution is identical across the clients. Instead, Lemma 1 only requires the insensitive attribute distribution () to be highly heterogeneous across the clients. The following corollary provides a concrete example satisfying such conditions.
Corollary 2 (informal).
Let and , , where . If one client has much larger variance than the other, under certain assumptions, there exists such that
Note that the condition that one client has a much larger variance than the other contributes to the “high data heterogeneity” requirement. We provide the explicit form of in Corollary 11 in Sec. A.2.2. Corollary 2 implies that under a limiting case of Gaussian distribution, LFT+Ensemble cannot achieve high fairness requirements. In Sec. 2, we will numerically demonstrate the same claim holds for more general cases. Next, we give an example that satisfies the conditions of Corollary 2, whose achievable DP disparity and lower bound are visualized in Fig. 4(a).
Example 1.
Let and . Then, .

Remark 1.
In this work, we classify local training followed by an ensemble method as a non-federated learning algorithm. However, one may also view it as an extreme instance of federated learning, called one-shot (single round) federated learning (Guha et al., 2019). Depending on how one would classify this approach, our finding can be viewed as either the necessity of federated learning or the necessity of multi-round federated learning.
LFT+FedAvg Optimization
The classifier obtained by LFT+FedAvg can be shown to be equivalent to , the solution to the following problem.
(LFT+FedAvg) | ||||
(3) |
The proof of the equivalence is provided in Sec. A.3.2. The following theorem asserts that LFT+FedAvg can achieve a strictly higher fairness than LFT+Ensemble.
Theorem 3 ((informal) Fundamental values of federated learning for fairness).
Let for all , where is a constant. Under certain conditions,
Thm. 3 implies that LFT+Ensemble cannot match the highest fairness achieved by LFT+FedAvg. See Fig. 5 for the performance tradeoffs. Similar to Lemma 1, the technical assumptions include high enough heterogeneity of across the clients (see Sec. A.4).
Remark 2.
Extending Thm. 3 (and the analysis of LFT+FedAvg) to general cases where ’s are different remains open. However, we conjecture that our lemma holds for more general cases, and we will numerically support our conjecture empirically.
Remark 3.
Thm. 3 shows that even with just two clients (), a constant gap exists between non-federated algorithms and federated algorithms in their fairness performances. There is a stark difference between this phenomenon and the well-known gain of federated learning due to an increased sample size, which is generally negligible with a few number of clients. Our finding on this untapped gain in fairness can be seen as a new justification for small-scale federated learning, i.e., cross-silo settings.
3.2 LFT+FedAvg is strictly worse than CFL
We first present the optimization problem for CFL, whose achievable fairness region can serve as an upper bound on that of all other decentralized algorithms. We then show the existence of data distributions on which LFT+FedAvg achieves a strictly worse fairness performance than CFL.
CFL Optimization
We consider the same problem setting as last subsection. We now formulate the CFL problem as:
(CFL()) | ||||
(4) |
Denote the solution to CFL() as . It is clear that achieves the best accuracy-fairness tradeoff, at the cost of no privacy. The following lemma shows that there exists some distribution such that LFT+FedAvg is strictly worse than CFL when the distribution of sensitive attribute is heterogeneous ( are not all the same).
Lemma 4 ((informal) A strict gap between LFT+FedAvg and CFL).
When there exist , there exists a distribution such that
Remark 4.
A strict gap exists for certain distributions, but not for all distributions.
3.3 Numerical Comparisons of tradeoffs
One limitation of our current theoretical results is that they only compare the maximum achievable fairness. Extending our theoretical results to fully characterize two-dimensional tradeoffs is non-trivial, so we leave it as future work. Instead, we numerically solve each of the optimization problems and visualize the tradeoff curves.
Shown in Fig. 5 are the tradeoff curves for two clients cases. We let , , and vary the values of and . Note that captures the heterogeneity of the sensitive data , which increases from left to right. First, one can observe that LFT+EnsembleLFT+FedAvg CFL in terms of the achievable fairness range, as predicted by our theory. Beyond our theory, we also observe an increasing gap between the tradeoff curves as the data heterogeneity increases.
4 Proposed Algorithm: FedFB
Our findings in Sec. 3 imply that federated learning is necessary, but the current FedAvg-based approach might not be the best approach. Can we design a federated learning algorithm that is strictly better than FedAvg-based approaches? In this section, we propose a new federated learning algorithm for fair learning, which we dub FedFB (short for Federated FairBatch). Our approach is based on the state-of-the-art (centralized) fair learning algorithm FairBatch (FB for short) (Roh et al., 2021) and comes with a few desirable theoretical guarantees.
Centralized FB
Let us first review how FB works in the centralized setting when demographic parity is considered. Assume there are sensitive attributes. Let be the model parameters, be the loss function, be the number of samples in group with label , and be the number of samples in group . Let , . Then, for , we define
Then, we can show the following proposition.
Proposition 5 (Necessary and sufficient condition for demographic parity).
Consider 0-1 loss: , where is the indicator function. Then,
(5) |
is a necessary & sufficient condition for demographic parity.
Note that Proposition 5 allows us to measure demographic disparity using subgroup-specific losses when the 0-1 loss is used. Inspired by this observation, FB uses as a surrogate of the unfairness measure even for non- loss functions. With this surrogate, FB solves the following bi-level optimization.
(6) | ||||
(7) |
Here, are the outer optimization variables, which denote adaptive coefficient to loss occurring from each subgroup. Given , the inner optimization problem minimizes a reweighted loss function. The intuition here is that if one carefully chooses the coefficients for each subpopulation loss term, the weighted risk minimization will result in a fair classifier, even if it is unconstrained.
In particular, Roh et al. (2021) proposed the following iterative algorithms to solve the outer optimization problem:
(8) | ||||
(9) |
All together, FB alternates this update equation (for the outer optimization) and the inner-loop updates (e.g., via SGD) to solve the bi-level optimization problem111Technically speaking, the algorithm shown here is a slightly generalized version of the original FB. In particular, it works for more than two sensitive groups, while the original FB algorithm only works for binary groups. However, we will still call ours FB. .
FedFB: Federated Learning + FB
Recall that under the FedAvg protocol, clients periodically share their locally trained model parameters with the server. Our proposed algorithm is based on the following simple observation:
The bi-level structure of FB naturally fits the hierarchical structure of federated learning.
To see this, observe that the update equation given in (8) only requires the knowledge about all , which is the function of ’s. By definition of , it is the sum of the locally computed values across all the clients. Thus, this can be securely collected by the central aggregator via secure aggregation. Once the values are securely aggregated at the central server, the central aggregator can update the per-group coefficients using the exact same equation (8).
This allows us to run the FB algorithm even on decentralized data. More specifically, we modify the FedAvg protocol so that each client shares not only its intermediate model parameters but also its with the central coordinator. Once the central server receives the securely aggregated model parameters and values, it performs both (1) model averaging and (2) reweighting coefficients updates (). The server then broadcasts the averaged model parameter together with the updated coefficients, which are then used for the subsequent round of local training with a reweighted loss function. Note that we update every communication rounds.
This algorithm, which we dub FedFB, is formally described in Alg. 1. While this description is specific to the demographic parity, FedFB can be used for other fairness notions including equal opportunity, equalized odds, and client parity. The only difference would be how to design the function, which is the surrogate for the unfairness measure. See Sec. B for more details.
Convergence of FedFB
The following theorem shows that FedFB converges to the optimal reweighting coefficients. The proof is based on the analysis tools for federated learning and FB – See Thm. 24 for more details.
Theorem 6 ((informal) Partial convergence guarantee of FedFB).
Let the output of FedFB be . Assume . Then, under certain conditions, for some .
Additional privacy leakage
Under FedFB, clients exchange real-valued ’s with the server in addition to the model parameters. To limit the information leakage, we develop a variant of FedFB, which exchanges the quantized loss values. For instance, “FedFB(10bits)” means each loss value is uniformly quantized using 10 bits. Such a quantization scheme limits the amount of additional information shared in communication round, at the cost of perturbed coefficient updates.
5 Experiments
222Our implemtation is available in https://github.com/yzeng58/Improving-Fairness-via-Federated-Learning.In this section, we numerically study the performance of LFT+Ensemble, LFT+FedAvg, and CFL for more general cases, and evaluate the empirical performance of FedFB. In each simulation study, we report the summary statistics across five replications. Similar to the experimental settings used in (Roh et al., 2020), we train all algorithms using a two-layer ReLU neural network with four hidden neurons to evaluate the performance of FedFB for the non-convex case. Due to the space limit, we defer (1) logistic regression results, (2) the performance tradeoffs of FedFB as a function of the number of clients, and (3) more implementation details to Sec. C.
5.1 Limitation of LFT+Ensemble on general cases
The first experiment examines the fairness range of LFT+Ensemble under various Gaussian distributions, which does not satisfy the conditions of Corollary 2. To do so, we numerically solve the optimization problems LFT+Ensemble(). See Sec. A for more details. We conjecture that the same phenomenon holds for more general distributions, and corroborate our conjecture with numerical experiments. Shown in Fig. 4(b,c) are the numerically computed lower bound on LFT+Ensemble’s achievable fairness. In particular, for (b), we let , and for (c), we let . For both cases, we set . It is easy to check that these distributions do not satisfy the conditions of Corollary 2. In particular, the distribution (b) corresponds to the case that the same group is favored on both clients, and the positive rates of each group in different clients are distinctive. The distribution (c) represents the case that different groups are favored on two clients. In both cases, LFT+Ensemble fails to achieve perfect fairness, i.e., . We also observe that is large on the distribution (c). This supports our conjecture that LFT+Ensemble’s fairness performance is strictly limited not only on certain data distributions but also on more general ones.
5.2 Accuracy-fairness tradeoffs of LFT+Ensemble, LFT+FedAvg and CFL
The second experiment extends the experiments conducted in Sec. 3.3. We assess the relationship between the data heterogeneity and the gap between the three fair learning scenarios with three clients. As shown in Fig. 9, LFT+FedAvg is observed to achieve a strictly worse tradeoff than CFL and a strictly higher maximum fairness value than LFT+Ensemble. The results corroborate the benefit and limitation of FedAvg-based federated learning in improving fairness. A very interesting observation is that LFT+Ensemble is observed to obtain a strictly higher accuracy than LFT+FedAvg. Indeed, this could be attributed to the fact that the average of locally fair models might not be fair to any sub-distribution, while LFT+Ensemble at least ensures that each component of the mixture classifier is fair on some sub-distribution.
5.3 FedFB evaluation on demographic parity
Property | Synthetic | Adult | COMPAS | Bank | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Private | Fair | Method | Acc.() | DP Disp.() | Acc.() | DP Disp.() | Acc.() | DP Disp.() | Acc.() | DP Disp.() |
✓ | ✗ | FedAvg | .886.003 | .406.009 | .829.012 | .153.022 | .655.009 | .167.037 | .898.001 | .026.003 |
✓ | ✓ | LFT+Ensemble | .727.194 | .248.194 | .825.008 | .034.028 | .620.019 | .088.055 | .892.002 | .014.006 |
✓ | ✓ | LFT+FedAvg | .823.102 | .305.131 | .801.043 | .123.071 | .595.005 | .059.009 | .893.000 | .017.001 |
✓ | ✓ | FedFB (Ours) | .725.012 | .051.018 | .804.001 | .028.001 | .606.019 | .086.029 | .883.000 | .000.000 |
✗ | ✓ | CFL | .726.009 | .028.016 | .814.002 | .010.004 | .616.033 | .036.028 | .883.000 | .000.000 |
We assess the empirical performance of FedFB for non-convex cases on four datasets and the performance of FedFB under different data heterogeneity. We focus on demographic parity and report DP disparity , where is the number of groups. Note that this is slightly different from the definition we used in the previous sections, which was specific for the binary sensitive attribute.
Baselines
We employ three types of baselines: (1) decentralized non-fair training (FedAvg); (2) decentralized fair training (LFT+Ensemble, LFT+FedAvg, FairFed, AgnosticFairs); (3) centralized fair training (CFL). Here, for the fairer comparison and better performance, LFT+Ensemble, LFT+FedAvg, and CFL are implemented based on FB, which is the state-of-the-art fair learning algorithm on centralized data. Note that LFT+Ensemble is the most private, CFL is the least private, and FedAvg, LFT+FedAvg, FairFed and FedFB are somewhere in-between as they share some statistics at each communication round but not directly reveal the data. AgnosticFair is close to CFL, as it exchanges information every local update.
Low Data Heterogeneity | Medium Data Heterogeneity | High Data Heterogeneity | ||||
---|---|---|---|---|---|---|
Method | Acc.() | DP Disp.() | Acc.() | DP Disp.() | Acc.() | DP Disp.() |
FedFB (Ours) | .669.040 | .058.042 | .725.012 | .051.018 | .703.012 | .013.008 |
CFL | .726.009 | .028.016 | .726.009 | .028.016 | .726.009 | .028.016 |

Datasets
(synthetic) We follow Roh et al. (2021) for data generation, but with a slight modification to make the dataset more unbalanced. To study the empirical relationship between accuracy, fairness, and data heterogeneity, we split the dataset in different ways to obtain desired levels of data heterogeneity. More details are given in Sec. C.3. (real) We use three benchmark datasets: Adult (Dua and Graff, 2017) with 48,842 samples, COMPAS (ProPublica, 2021) with 7,214 samples, and Bank (Moro et al., 2014) with 45,211 samples. We follow Du et al. (2021)’s method to preprocess and split Adult into two clients and Jiang and Nachum (2020)’s method to preprocess COMPAS and Bank. Then, we split COMPAS into two clients based on age and split Bank into three clients based on the loan decision. Note that all the datasets are split in heterogeneous ways.
Results
Table 1 reports the test accuracy and DP disparity of four baselines and FedFB. As expected, the resulting fairness level of FedFB is close to that of CFL. Besides, we observe the poor performance of LFT+Ensemble and LFT+FedAvg, which is due to the fundamental limitation of LFT+Ensemble and LFT+FedAvg. Table 2 reports the accuracy and fairness of each method under different data heterogeneity. FedFB is observed to be robust to data heterogeneity. This agrees with our expectation as FedFB closely mimics the operation of the centralized FB, which is not affected by data heterogeneity. We make a more thorough comparison between CFL and FedFB by plotting the accuracy-fairness tradeoff curves in Fig. 3(b). To demonstrate the performance gain does not come at the cost of privacy loss, we restrict FedFB to only exchange 10 bits of information per communication round. Fig. 3(b) shows that FedFB still performs well with quantized communications.
As suggested in Ezzeldin et al. (2021), we use logistic regression to compare our approach with FairFed and AgnosticFair. Since FairFed is only applicable to single binary sensitive attribute cases, we report the performance on synthetic and Adult datasets. Fig. 6 shows that FedFB achieves the best accuracy-fairness performance among the three methods with a much smaller privacy cost compared to AgnosticFair. For a fair comparison, we provide the experiment results under the same setting as Ezzeldin et al. (2021), and full comparison with AgnosticFair in Sec. C.
5.4 FedFB evaluation on client parity

We evaluate the performance of FedFB in achieving client parity (CP) and compare it with the state-of-the-art algorithms for CP. We will measure CP disparity , where is the loss in th client.
Baselines
We consider GIFAIR (Yue et al., 2021), q-FFL (Li et al., 2020a), Ditto (Li et al., 2021), and the unconstrained baseline FedAvg (McMahan et al., 2017). Similar to FedFB, both GIFAIR and q-FFL propose a modified aggregation protocol, under which clients share some additional information with the central server, which then accordingly adjusts the objective function used for the next round of training. The key difference is that while FedFB optimizes the coefficients for the primary objective terms (i.e. sample reweighting) by solving a bi-level optimization problem, GIFAIR updates the coefficient for the penalty terms, and q-FFL implicitly updates the weights on each objective term based on nonlinear behaviors of polynomial functions, which is equivalent to the -fairness algorithm used in networking (Mo and Walrand, 2000). The Ditto algorithm combines multitask learning with federated learning to learn a personalized classifier for each client, improving the accuracy of the clients with low accuracy.
Datasets
We use the same datasets as before, but split the datasets according to their sensitive attributes to simulate the same setting as assumed by GIFAIR, q-FFL, and Ditto.
Results
Fig. 7 shows that FedFB offers competitive and stable performances in mitigating the model bias, especially in the high fairness region. Although q-FFL achieves better accuracy and fairness on the synthetic data, under strict fairness constraint, FedFB and its private variant nearly achieve the highest accuracy on the other three datasets.
6 Conclusions
Summary
We have investigated how to achieve group fairness under a decentralized setting. We developed a theoretical framework for decentralized fair learning algorithms and analyzed the performance of LFT+Ensemble, LFT+FedAvg, and CFL. As a result, we showed that (1) federated learning can significantly boost model fairness even with only a handful number of participating clients, and (2) FedAvg-based federated fair learning algorithms are strictly worse than the oracle upper bound of CFL. To close the gap, we propose FedFB. Our extensive experimental results demonstrate that our proposed solution FedFB achieves state-of-the-art performances.
Open questions
(Theory) First, full theoretical characterization of accuracy-fairness tradeoff still remains open. Furthermore, a three-way tradeoff between accuracy, fairness, and privacy remains widely open. (Algorithm) It remains open whether or not FedFB can handle different fairness notions such as proportional fairness (Zhang et al., 2020b; Lyu et al., 2020a, b).
Acknowledgement
This work was supported in part by NSF Award DMS-2023239, NSF/Intel Partnership on Machine Learning for Wireless Networking Program under Grant No. CNS-2003129, and the Understanding and Reducing Inequalities Initiative of the University of Wisconsin-Madison, Office of the Vice Chancellor for Research and Graduate Education with funding from the Wisconsin Alumni Research Foundation.
References
- Chu et al. (2021) L. Chu, L. Wang, Y. Dong, J. Pei, Z. Zhou, and Y. Zhang. Fedfair: Training fair models in cross-silo federated learning. arXiv preprint arXiv:2109.05662, 2021.
- Cui et al. (2021) S. Cui, W. Pan, J. Liang, C. Zhang, and F. Wang. Addressing algorithmic disparity and performance inconsistency in federated learning. In Thirty-Fifth Conference on Neural Information Processing Systems, 2021.
- Du et al. (2021) W. Du, D. Xu, X. Wu, and H. Tong. Fairness-aware agnostic federated learning. In Proceedings of the 2021 SIAM International Conference on Data Mining (SDM), pages 181–189, 2021.
- Dua and Graff (2017) D. Dua and C. Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
- Dwork et al. (2012) C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel. Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, ITCS ’12, page 214–226, 2012. ISBN 9781450311151. doi: 10.1145/2090236.2090255. URL https://doi.org/10.1145/2090236.2090255.
- Ezzeldin et al. (2021) Y. H. Ezzeldin, S. Yan, C. He, E. Ferrara, and S. Avestimehr. Fairfed: Enabling group fairness in federated learning. arXiv preprint arXiv:2110.00857, 2021.
- Friedler et al. (2016) S. A. Friedler, C. Scheidegger, and S. Venkatasubramanian. On the (im)possibility of fairness, 2016.
- Guha et al. (2019) N. Guha, A. Talwalkar, and V. Smith. One-shot federated learning. arXiv preprint arXiv:1902.11175, 2019.
- Hardt et al. (2016) M. Hardt, E. Price, and N. Srebro. Equality of opportunity in supervised learning. In Advances in Neural Information Processing Systems (NIPS), volume 29, pages 3315–3323, 2016.
- Jiang and Nachum (2020) H. Jiang and O. Nachum. Identifying and correcting label bias in machine learning. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 702–712, 2020. URL http://proceedings.mlr.press/v108/jiang20a.html.
- Kearns et al. (2018) M. Kearns, S. Neel, A. Roth, and Z. S. Wu. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. In Proceedings of the 35th International Conference on Machine Learning (ICML), volume 80 of Proceedings of Machine Learning Research, pages 2564–2572, 2018. URL http://proceedings.mlr.press/v80/kearns18a.html.
- Konečnỳ et al. (2017) J. Konečnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492, 2017.
- Li et al. (2020a) T. Li, M. Sanjabi, A. Beirami, and V. Smith. Fair resource allocation in federated learning. In International Conference on Learning Representations (ICLR), 2020a. URL https://openreview.net/forum?id=ByexElSYDr.
- Li et al. (2021) T. Li, S. Hu, A. Beirami, and V. Smith. Ditto: Fair and robust federated learning through personalization. In Proceedings of the 38th International Conference on Machine Learning (ICML), volume 139 of Proceedings of Machine Learning Research, pages 6357–6368, 2021. URL http://proceedings.mlr.press/v139/li21h.html.
- Li et al. (2020b) X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang. On the convergence of fedavg on non-iid data. In International Conference on Learning Representations (ICLR), 2020b. URL https://openreview.net/forum?id=HJxNAnVtDS.
- Lyu et al. (2020a) L. Lyu, X. Xu, Q. Wang, and H. Yu. Collaborative fairness in federated learning. In Federated Learning, pages 189–204. Springer, 2020a.
- Lyu et al. (2020b) L. Lyu, J. Yu, K. Nandakumar, Y. Li, X. Ma, J. Jin, H. Yu, and K. Ng. Towards fair and privacy-preserving federated deep models. IEEE Transactions on Parallel & Distributed Systems, 31(11):2524–2541, 2020b. ISSN 1558-2183. doi: 10.1109/TPDS.2020.2996273.
- McMahan et al. (2017) B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas. Communication-efficient learning of deep networks from decentralized data. In International Conference on Artificial Intelligence and Statistics (AISTATS), volume 54 of Proceedings of Machine Learning Research, pages 1273–1282, 2017. URL http://proceedings.mlr.press/v54/mcmahan17a.html.
- Menon and Williamson (2018) A. K. Menon and R. C. Williamson. The cost of fairness in binary classification. In Proceedings of the 1st Conference on Fairness, Accountability and Transparency, volume 81 of Proceedings of Machine Learning Research, pages 107–118, 2018. URL http://proceedings.mlr.press/v81/menon18a.html.
- Mo and Walrand (2000) J. Mo and J. Walrand. Fair end-to-end window-based congestion control. IEEE/ACM Transactions on networking, 8(5):556–567, 2000.
- Mohri et al. (2019) M. Mohri, G. Sivek, and A. T. Suresh. Agnostic federated learning. In Proceedings of the 36th International Conference on Machine Learning (ICML), volume 97 of Proceedings of Machine Learning Research, pages 4615–4625, 2019. URL http://proceedings.mlr.press/v97/mohri19a.html.
- Moro et al. (2014) S. Moro, P. Cortez, and P. Rita. A data-driven approach to predict the success of bank telemarketing. Decision Support Systems, 62:22–31, 2014. ISSN 0167-9236. doi: https://doi.org/10.1016/j.dss.2014.03.001. URL https://www.sciencedirect.com/science/article/pii/S016792361400061X.
- ProPublica (2021) ProPublica. Compas recidivism risk score data and analysis. 2021. URL https://www.propublica.org/datastore/results?q=compas.
- Rodríguez-Gálvez et al. (2021) B. Rodríguez-Gálvez, F. Granqvist, R. van Dalen, and M. Seigel. Enforcing fairness in private federated learning via the modified method of differential multipliers. arXiv preprint arXiv:2109.08604, 2021.
- Roh et al. (2020) Y. Roh, K. Lee, S. Whang, and C. Suh. FR-train: A mutual information-based approach to fair and robust training. In Proceedings of the 37th International Conference on Machine Learning (ICML), volume 119 of Proceedings of Machine Learning Research, pages 8147–8157, 2020. URL http://proceedings.mlr.press/v119/roh20a.html.
- Roh et al. (2021) Y. Roh, K. Lee, S. E. Whang, and C. Suh. Fairbatch: Batch selection for model fairness. In International Conference on Learning Representations (ICLR), 2021. URL https://openreview.net/forum?id=YNnpaAKeCfx.
- Wick et al. (2019) M. Wick, s. panda, and J.-B. Tristan. Unlocking fairness: a trade-off revisited. In Advances in Neural Information Processing Systems (NeurIPS), volume 32, 2019. URL https://proceedings.neurips.cc/paper/2019/file/373e4c5d8edfa8b74fd4b6791d0cf6dc-Paper.pdf.
- Yue et al. (2021) X. Yue, M. Nouiehed, and R. A. Kontar. Gifair-fl: An approach for group and individual fairness in federated learning. arXiv preprint arXiv:2108.02741, 2021.
- Zafar et al. (2017a) M. B. Zafar, I. Valera, M. Gomez Rodriguez, and K. P. Gummadi. Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment. In Proceedings of the 26th International Conference on World Wide Web, page 1171–1180, 2017a.
- Zafar et al. (2017b) M. B. Zafar, I. Valera, M. G. Rodriguez, K. P. Gummadi, and A. Weller. From parity to preference-based notions of fairness in classification. In Advances in Neural Information Processing Systems (NIPS), NIPS’17, page 228–238, 2017b. ISBN 9781510860964.
- Zafar et al. (2017c) M. B. Zafar, I. Valera, M. G. Rogriguez, and K. P. Gummadi. Fairness Constraints: Mechanisms for Fair Classification. In International Conference on Artificial Intelligence and Statistics (AISTATS), volume 54, pages 962–970, 2017c.
- Zemel et al. (2013) R. Zemel, Y. Wu, K. Swersky, T. Pitassi, and C. Dwork. Learning fair representations. In International conference on machine learning, pages 325–333. PMLR, 2013.
- Zhang et al. (2020a) D. Y. Zhang, Z. Kou, and D. Wang. Fairfl: A fair federated learning approach to reducing demographic bias in privacy-sensitive classification models. In 2020 IEEE International Conference on Big Data (Big Data), pages 1051–1060, 2020a. doi: 10.1109/BigData50022.2020.9378043.
- Zhang et al. (2020b) J. Zhang, C. Li, A. Robles-Kelly, and M. Kankanhalli. Hierarchically fair federated learning. arXiv preprint arXiv:2004.10386, 2020b.
- Zhao and Gordon (2019) H. Zhao and G. Gordon. Inherent tradeoffs in learning fair representations. In Advances in Neural Information Processing Systems (NeurIPS), volume 32, 2019. URL https://proceedings.neurips.cc/paper/2019/file/b4189d9de0fb2b9cce090bd1a15e3420-Paper.pdf.
Appendix A Appendix - LFT+Ensemble, LFT+FedAvg, CFL analysis
In this section, we provide the concrete analysis for LFT+Ensemble, LFT+FedAvg, and CFL. For illustration purposes, we will start by discussing two clients cases, and then extend the analysis into more clients cases in Sec. A.4. We begin with the analysis of CFL in Sec. A.1. Then we analyze LFT+Ensemble and LFT+FedAvg. To be specific, in Sec. A.2, we analyze the limitation of LFT+Ensemble and present the formal version of Lemma 1 under two clients cases and Corollary 2. In Sec. A.3, we analyze LFT+FedAvg and compare it with LFT+Ensemble and CFL, then we present the formal two-client version of Thm. 3 and Lemma 4. All the multi-client statements are included in Sec. A.4 We summarize the commonly used notations in Table 3 .
Symbol | Meaning | Symbol | Meaning | Symbol | Meaning |
---|---|---|---|---|---|
feature | indicator function | bias in client | |||
sensitive attribute | randomized classifier | ||||
client index | distribution | unfairness of | |||
predicted class | LFT+Ensemble classifier |
A.1 CFL analysis
In this section, we analyze the CFL classifier given in CFL(). We mainly derive the solution of CFL() in Lemma 7. In Lemma 8 and Lemma 9 we summarize the properties of . For a given classifier , we define the mean difference to be
Lemma 7.
Let . Define as
(10) |
then , where , , . Here we denote the indicator function as if is true, zero otherwise.
Proof.
The proof is similar as Menon and Williamson [2018]. To solve CFL(), we first write the error rate and the fairness constraint as a linear function of . Let be the pdf of , where . Denote the joint distribution of and as . Note that
(11) | ||||
and
(12) | ||||
Consequently, our goal becomes solving
(13) |
Denote the function that minimizes the error rate (ERM) as . It is easy to see that,
(14) |
Next, consider the following three cases. In particular, we provide the proof for and . The proof for is similar as the proof for .
Case 1.
: ERM is already fair.
Case 2.
: ERM is favoring group 0 over group 1.
We will show that solving (13) is equivalent to solving an unconstrained optimization problem.
First, we will prove by contradiction that the solution to (13) satisfies . We use to denote the solution of (13). Suppose the above claim does not hold. Then we have . To show the contradiction, we construct a that satisfies the fairness constraint and has a lower error rate than that of . Let be a linear combination of and :
(15) |
where . Then we obtain
(16) |
Denote the error rate as . Then the error rate of is
(17) |
which is inconsistent to the optimality assumption of . Therefore, .
Now, solving CFL() is equivalent to solving
(18) |
Furthermore, the optimization problem above is also equivalent to
(20) | |||||
for all .
Next, our goal is to select a suitable such that the constrained optimization problem above becomes an unconstrained problem, i.e., we will select a suitable such that the minimizer to the unconstrained optimization problem (20) satisfies equality constraint (20).

Since the range of is [0,1], with is no different from with ; with is no different from with . Therefore, the only thing left is to find the such that satisfies the constraint (20).
Now consider the mean difference between the positive rate of two groups:
(25) | ||||
Note that is a strictly monotone increasing function. Consequently, if and only if , satisfies (20). Recall that optimization problem (20) with constraint (20) is equivalent as CFL(). Thus, let and is the solution of CFL().
Case 3.
: ERM is favoring group 1 over group 0.
Similarly, like Case 2, we obtain that the solution CFL() is
(26) |
where
(27) |
and . Combining all the cases above yields the desired conclusion. The proof is now complete. ∎
Remark 5.
Lemma 8.
If and are two solutions to CFL(), then almost everywhere.
For illustration purposes, we denote
(29) |
Below we summarize some useful properties of .
Lemma 9.
The sign of and are determined by .
-
1.
If , then , and . If then and .
-
2.
If or , then for any , we have or , respectively.
Proof.
The first property follows directly from the definition of . Next, we prove the second property.
When , we have and the first property holds. When , by the definition of , we have . When , we have . Combining all the cases above yields the desired conclusion. ∎
A.2 LFT+Ensemble analysis
With the analysis of CFL in Sec. A.1, in this section, we analyze the LFT+Ensemble classifier LFT+Ensemble() for the case of two clients. For illustration purpose, with , we denote . In Sec. A.2.1 we introduce some notations for the LFT+Ensemble classifier that follows from Lemma 7. In Sec. A.2.2 we analyze the limitation of as stated in Sec. 3.1. To be more specific, we present the two clients’ version of Lemma 1, formal version of Corollary 2 and conclude their proof. In Sec. A.2.3, we analyze the performance gap between and CFL classifier .
A.2.1 Problem setting
By Lemma 7, the solution to LFT+Ensemble() is
(30) |
where the associated is defined as
(31) |
and
Note that is the mean difference on th client of the classifier of the form (28). Now, the demographic disparity for can be written as
(32) |
For ease of notation, we define local mean difference on th client as and local demographic disparity on th client as , where . Since the overall distribution of the data samples is , , (see the definition of in Lemma 7) and has the following relation:
A.2.2 Limitation of LFT+Ensemble
In this section, we mainly analyze the limitation of LFT+Ensemble in Lemma 10, which shows that can not achieve demographic disparity in certain cases. Corollary 11 is a specific example of Lemma 10.
Lemma 10 (Formal version of Lemma 1 under two clients cases).
Let . Let . Define as
(33) |
If and for all , then for all , .
Proof.
Define . The goal is to show that the demographic disparity has a positive lower bound. Note that the mean difference can be expressed as
(34) |
In the following proof, we will show, the mean difference cannot reach 0.
Without any loss of generality, assume . First we consider . We will discuss later. By and for all , we have and for all .
First, we will prove that LFT+Ensemble achieves its lowest mean difference when . In what follows, we consider five different cases to derive the desired result.
Case 1.
: ERM is fair on both clients.
By (31), we have . Recall is a monotone increasing function, we combine and Lemma 9 to have Applying the above conclusion yields
Case 2.
: ERM is unfair on client 0, but fair on client 1.
Applying (31) results in . By the fact that is a strictly monotone increasing function, we have Applying the above conclusion yields
(35) | ||||
Case 3.
: ERM is unfair on both client 0 and client 1.
Applying (31) we have . Then we have
(36) | ||||
(37) |
Case 4.
: ERM is fair on client 0 and very unfair on client 1.
By (31), we have . Then we obtain
(38) | ||||
(39) |
Case 5.
: ERM is fair on client 0 and unfair on client 1.
Applying (31) implies . Therefore,
(40) | ||||
(41) |
Combining all the cases above, we conclude that when for all .
Now we consider , by the setting and the assumption and , we have and for all . Following similar computation above, Case 1 - Case 5 become:
Case 1. . Now we have , thus
Case 2. . Now we have , , thus
Case 3. In this case we have
Case 4. Now we have , thus
Case 5. Now we have , thus
Then we conclude the proof.
∎
Remark 6.
Note that is the smallest local demographic disparity the ERM achieves on clients. The condition implies that the ERM is favoring different groups in different clients. The condition for all implies that favors the same group as ERM when the constraint is very tight. If the conditions above hold, Lemma 10 suggests that there exists a lower bound of all the demographic disparity that LFT+Ensemble can achieve. In particular, if the conditions above hold, LFT+Ensemble fails to achieve perfect demographic parity.
Among the conditions of Lemma 10, can be satisfied by the distribution with high data heterogeneity. To demonstrate the condition for all can be satisfied, we consider a limiting Gaussian case. The following corollary serves as an example that satisfies the conditions of Lemma 10, and provides a more explicit expression of the lowest demographic disparity LFT+Ensemble can reach in the Gaussian case.
Corollary 11 (Formal version of Corollary 2).
Let , where . Then DP Disp for all if one of the following condition holds:
-
1.
and : client 0 has much larger variance than client 1, and client 1 is favoring group 0;
-
2.
and : client 1 has much larger variance than client 0, and client 0 is favoring group 0.
Proof.
In this example, note that local mean difference function of can be written as:
(42) |
where is the CDF of the standard Gaussian distribution.
We only provide the proof for condition 1, and the proof for condition 2 is similar. Assume condition 1 holds. By and , we have . By (42), we have and . Consequently, combining Lemma 9 and (33) yields
First we show that reachs its minimum either at or by taking the derivative of , where is the smallest local demographic disparity. And then, we will estimate the minimum of on .
We take the derivative of with respect to and get
(43) |
By condition 1, we have thus and . Since are increasing function, , we have , and thus . Therefore, reaches its extreme value at and .
Now, we evaluate
where the inequality comes from to have . And at , we have
In what follows, we will show that by proving that .
Consider and . Since , we have
Therefore, the only thing left is to show . We divide the rest of the proof into the following three cases.
Case 1.
: on client 1, the local classifier is favoring group 0 over group 1, and the positive rate of both groups are under .
Clearly, under this case, we have . We select such that . Then we have , while . Thus we get . Combining and intermediate value theorem results in . Then we obtain
(44) |
Case 2.
: on client 1, the local classifier is favoring group 0 over group 1, and the positive rate of both groups are above .
This proof of this case is similar to Case 1.
Case 3.
: with respect to the local classifier trained by client 1, the positive rate of group 0 is above while that of group 1 is under .
Without any loss of generality, we assume . Select such that Clearly , while
Consequently, we get and . Then we draw the same conclusion as (44). Therefore, .
Combining all three cases above, we get . Then applying Lemma 10 we complete the proof. ∎
A.2.3 Comparison between LFT+Ensemble and CFL
In this section, we compare the performance of LFT+Ensemble and CFL. In Lemma 12 and Lemma 13, we illustrate the conditions for LFT+Ensemble to have the same performance as CFL. In Lemma 14, Lemma 15 and Lemma 16, we illustrate the scenarios when CFL outperforms LFT+Ensemble.
To do the comparison, first we introduce some additional notations. Define the accuracy of a classifier as
Given the required global demographic disparity , define the performance of as:
(45) |
and define performance of as:
Now we are able to compare the performance between LFT+Ensemble and CFL with the metric and . In particular, we will show that, under some mild conditions, , which implies the gap between LFT+Ensemble and CFL is inevitable.
We begin with the following two lemmas, which describe the cases that .
Lemma 12.
Proof.
The proof is straightforward. Clearly, since is the optimizer to CFL(), we have . By Lemma 7, according to the form of the solution to CFL(), is the solution to CFL() if and only if . Thus complete the proof.
∎
Lemma 13.
Let . If the ERM is already fair, i.e., , then
Proof.
Since the ERM is already fair, is ERM. Therefore, we take , and also equals to ERM. Thus we conclude the lemma.
∎
The next two lemmas describes the cases that .
Lemma 14.
Let . If , for all .
Proof.
In this proof, we only consider the case that , . The proof for or is similar. Next, we divide the proof into two cases.
Case 1.
: LFT+Ensemble cannot achieve global demographic disparity.
The conclusion holds.
Case 2.
: LFT+Ensemble can achieve global demographic disparity.
Remark 7.
Lemma 14 implies that if ERM is favoring different groups in different clients, there exists an inevitable gap between the performance of LFT+Ensemble and that of CFL.
Lemma 15.
Let . Let .
If , we have
(46) |
Proof.
Without any loss of generality, assume . Then by Lemma 9, we have for all , and .
To study the performance of LFT+Ensemble when , recall that we use to denote the local classifier trained by client in LFT+Ensemble analysis. Therefore, is the local classifier trained by client that achieves perfect local fairness.
Next, we discuss two cases to prove the result. In Case 1, we will show that , and then prove that ; in Case 2, we will show that , and then prove that when .
Case 1.
: the two local classifiers that achieve perfect local fairness are equal.
When , the conclusion holds by directly applying Lemma 13. Therefore, in what follows, we focus on the case that .
Next, we will first, show that when , , which implies that .
Since , we have , and thus . Consequently, we get
By the monotonicity of , we have and . Next, we will show that, for , there also exists such that , which implies .
Consider . Select By Lemma 7 and the monotonicity of , we have . Therefore, . By Lemma 7, we get for . By the selection of , we have . Therefore, when .
Combining all the discussion above yields for all , thus we conclude for all . Consequently, the lemma holds under Case 1.
Case 2.
: the two local classifiers that achieve perfect local fairness are different.
The key idea of this proof is: when , then we can always select such that and , thus ; when , however, by Lemma 9, for all , we have , thus there exist such that and . Next, we will give rigorous proof.
Since , we have . By Lemma 7, we get . Without any loss of generality, assume , which implies and . Thus we get
Combining the inequality above and yields
(47) |
Thus we have
When , by Lemma 13, clearly we have .
For the other case , by Lemma 9 we have . Similar to Case 1, in order to achieve , we select .
When ,
Since , by Lemma 9 we have , from the monotonicity of we conclude . By Lemma 12, we have for all .
When , applying (47) we have
where the first inequality comes from Case 1. Thus, as desired, and we obtain .
Combining both cases above yields the desired conclusion. ∎
Remark 8.
Recall that we use to denote the local classifier trained by client in LFT+Ensemble analysis. In the expression of :
(48) |
is the demographic disparity of ERM, is the demographic disparity of local classifier , and is the demographic disparity of local classifier . According to the proof of Lemma 15, we obtain if and only if two local classifiers which achieves perfect local fairness is equal, i.e., . Therefore, Lemma 15 implies that, if the ERM is favoring the same group on different clients and the two local classifiers which achieve perfect local fairness are unequal, then LFT+Ensemble performs strictly worse than CFL when the required demographic disparity is smaller than a certain value.
So far we assume for both client 0 and client 1. When both clients do not share the same , we can conclude that CFL outperforms LFT+Ensemble in the following lemma.
Lemma 16.
Assume in client , and . Then for all .
Proof.
Given an LFT+Ensemble classifier such that , the solution reads
We prove the lemma by contradiction argument. If , then is a solution to CFL() with . Since , by Lemma 9 we have . Without any loss of generality, assume . Below we discuss three cases.
Case 1.
: the LFT+Ensemble classifier is ERM.
Case 2.
or , and : the LFT+Ensemble classifier is not ERM, but one of the local classifier is ERM.
Without any loss of generality, let , . Then for , while for Thus we get
Case 3.
: The local classifiers are not ERM.
When , without loss of generality, let . Then by the same argument in Case 2, we have
When , similarly, is not a solution to CFL().
When and , since , we have
which leads to and thus . This contradicts with the assumption that .
Combining all three cases yields desired conclusion.
∎
A.3 LFT+FedAvg analysis
In this section, we analyze LFT+FedAvg for the case of two clients. For purpose of illustration, with , we denote to be the solution to LFT+FedAvg. In Sec. A.3.1, we present a formal version of Thm. 3 and show that compared to LFT+Ensemble, LFT+FedAvg has strictly higher fairness. In Sec. A.3.2 we derive the solution to LFT+FedAvg, and show it is equivalent to LFT+FedAvg. In Sec. A.3.3, we analyze the limitation of LFT+FedAvg and present a formal version of Lemma 4.
A.3.1 Improve fairness via federated learning
Different to LFT+Ensemble, LFT+FedAvg can reach any demographic disparity:
Theorem 17 (Formal version of Thm. 3 under two clients cases).
Let . For all , there exists such that . Thus under the condition in Lemma 10, we have
Proof.
For any , let . Then the global DP disparity becomes
(49) | ||||
∎
A.3.2 The best classifier of LFT+FedAvg
For LFT+FedAvg, we directly consider multi-client cases. To visualize the gap between LFT+Ensemble, LFT+FedAvg, and CFL, in our numerical experiments, we draw finite samples from Gaussian distribution, and then we optimize the empirical risk with the fairness constraint to obtain the classifier trained by LFT+FedAvg. The following lemma provides the solution to LFT+FedAvg when is finite.
Lemma 18.
Proof.
This proof is based on Menon and Williamson [2018]. The key idea of this proof is to use the Lagrangian approach. Before we applying the Lagrangian approach, we will show that LFT+FedAvg is expressible as a linear program, and thus the strong duality holds.
Since is finite, is a vector of finite dimension. Based on the proof of Lemma 7, the error rate can be written as
and the fairness constraints can be written as
(51) |
for . Let , , and , for . Note that are vectors of the same dimension of . For ease of notation, we allow to be applied to pairs of vectors in an element-wise manner. Therefore, the optimization is
(52) | ||||
(53) | ||||
(54) |
which is a linear objective with linear constraint. Therefore, the strong duality holds for LFT+FedAvg. Next, we apply Lagrangian approach to solve the LFT+FedAvg.
Recall that
(55) |
and
(56) | ||||
By strong duality, for , the corresponding Lagrangian version of LFT+FedAvg is
(57) | ||||
(58) |
Let , , then we get
(59) | ||||
(60) |
where is defined in Lemma 18. Thus the above equation reaches its minimum at
∎
Remark 9.
Based on the proof of Lemma 18, LFT+FedAvg is equivalent to solving
(61) |
Under certain conditions (assumptions 1 to 4 in Li et al. [2020b]), we have solving LFT+FedAvg is equivalent to minimizing
locally and applying FedAvg (Theorem 1 in Li et al. [2020b]).
A.3.3 Comparison of LFT+FedAvg and CFL
Lemma 19 (Formal version of Lemma 4 under two clients cases).
When and , we have
Proof.
Since , the constraints in LFT+FedAvg vanish. When , the constraint in CFL() always holds and thus also vanishes. Thus in such scenario the solution to LFT+FedAvg becomes . Then from the assumption we have
∎
A.4 Extension to multi-client cases
In this subsection, we perform the analysis of LFT+Ensemble and LFT+FedAvg for the multi-client cases. We present a more general version of Lemma 10, Thm. 17 and Lemma 19.
The following lemma shows the fundamental limitation of LFT+Ensemble:
Lemma 20 (Formal version of Lemma 1).
Let . Consider a partition which divides clients into two subsets. Denote the mixture distribution of each subset as , where is the index of the subset, and is a distribution, for . Similar to two clients case, define . Consider the case that for all and . Denote the proportion of the two subset as and , where and . Let . Define as
(62) |
If there exists a partition such that and for all , then for all , .
Proof.
By Lemma 10, we conclude the achievable fairness range of LFT+Ensemble is strictly smaller than that of CFL. Therefore, pooling the datasets in one subset and perform fair learning can clearly achieve a wider range of fairness than perform fair learning on each client individually. Thus, we can consider the two subsets as two clients with uneven amounts of data, which is almost the same case Lemma 10 considers. Therefore, we follow the same proof idea as Lemma 10 to prove our claim.
Denote the assembled classifier trained from two pooled datasets as . Note that the mean difference can be expressed as
(63) |
In the following proof, we will show, the mean difference cannot reach 0.
Without any loss of generality, assume and . By and for all , we have and for all . Without any loss of generality, assume .
First, we will prove that LFT+Ensemble achieves its lowest mean difference when . In what follows, we consider five different cases to derive the desired result.
Case 1.
: ERM is fair on both clients.
By (31), we have . Recall is a monotone increasing function, combine and Lemma 9, and thus Applying the above conclusion yields
Case 2.
: ERM is unfair on client 0, but fair on client 1.
Applying (31) results in . By the fact that is a strictly monotone increasing function, we have Applying the above conclusion yields
(64) | ||||
In the first equality we used .
Case 3.
: ERM is unfair on both client 0 and client 1.
Applying (31) we have . Then we have
(65) | ||||
(66) |
Case 4.
: ERM is fair on client 0 and very unfair on client 1.
By (31), we have . Then we obtain
(67) | ||||
(68) |
Case 5.
: ERM is fair on client 0 and unfair on client 1.
Applying (31) implies . Therefore,
(69) | ||||
(70) |
Combining all the cases above, we conclude that when , for all .
∎
Remark 10.
Based on the proof above, we can conclude, for the cases with multiple clients, the fundamental limitation of LFT+Ensemble still exists.
The following theorem shows that LFT+FedAvg can reach any DP disparity:
Theorem 21 (Generalized version of Thm. 17).
Let for all . For all , let for all , then . Thus under the condition in Lemma 20, we have
Proof.
When the global DP disparity becomes
(71) | ||||
∎
The following theorem shows the limitation of LFT+FedAvg:
Lemma 22 (Generalized version of Lemma 19).
Let or for all . When , we have
Proof.
Since or , the constraints in (LFT+FedAvg) vanish. When , the constraint in CFL() always holds and thus also vanishes. Thus in such scenario the solution to (LFT+FedAvg) becomes . Then from the assumption we have
∎
Appendix B Appendix - FedFB analysis and algorithm description
In this section, we provide our bi-level optimization formulation for FedFB for four fairness notions: demographic parity, equal opportunity, equalized odds and client parity, and design the corresponding update rule. This development can also be applied to centralized case. Then, we provides more details of how we incorporate FB with federated learning.
To explain how to optimize the weights of different groups, we introduce some necessary notations first. Recall that in the setting of Sec. 4 we assume sensitive attributes, here we further assume clients. Then the sample has attributes , where is the input feature, is the label, is the sensitive attribute and is the index of the client that the sample belongs to. We further denote be the number of samples belong to client of label and sensitive attribute . And we further define the local version of as .
B.1 FedFB w.r.t demographic parity
We first provide the proof for Proposition 5.
Proof of Proposition 5.
We denote by the empirical probability. The demographic parity is satisfied when holds for all . Thus,
(72) | ||||
For 0-1 loss, we have , thus
(73) | ||||
By replacing , we have
(74) | ||||
∎
We make the following assumption to our loss function for showing the partial convergence guarantee of FB.
Assumption 1.
is twice differentiable for all , , and
(75) |
If is convex for all , the condition (75) holds unless for all , share their stationary points, which is very unlikely (see Remark 1 in Roh et al. [2021]).
Recall the bi-level optimization problem in (6), we denote the outer objective function to be . The following lemma provides a decreasing direction of , which inspired us to design the update rule (8) of .
Lemma 23 (Decreasing direction of ).
If Assumption 1 holds, then on the direction where
(76) |
for all , we have , and the equality holds if only if .
Proof.
We compute the derivative as
(77) |
Note that is the minimizer to, we have
We take the derivative to the above equation and have
Thus we get
(78) |
Now we introduce how clients collaborate to solve the bi-level optimization problem (6).
First, we focus on the outer objective function in and introduce how clients collaborate to update the weight . Note that the central server can compute by weight-averaging the local group loss sent from clients at communication rounds as
thereby obtain and update by (8).
Next, we focus on the inner objective function in and introduce how clients collaborate to update the model parameters using FedAvg. Note that we can decompose the objective function into , where is the client objective function of client . The global objective can be seen as a weighted sum of the client objective function. Therefore, we can use FedAvg to solve the inner optimization problem.
Denote , we present the pseudocode of FedFB w.r.t demographic parity in Algorithm 2.
Next, we analyze the convergence performance of FedFB. We need to make the following assumptions on the objective function , . For simplicity, we drop the here and use the notations and instead. We use to denote the model parameters at -th iteration in -th client. The assumptions below are proposed by work Li et al. [2020b]:
Assumption 2 (Strong convexity, Assumption 1 in Li et al. [2020b]).
is -strongly convex for , i.e., for all and , , .
Assumption 3 (Smoothness, Assumption 2 in Li et al. [2020b]).
is -smooth for , i.e., for all and , .
Assumption 4 (Bounded variance, Assumption 3 in Li et al. [2020b]).
Let be sampled from -th device’s local data uniformly at random, where , and is the total number of every client’s SGDs. The variance of stochastic gradients in each device is bounded: for .
Assumption 5 (Bounded gradients, Assumption 4 in Li et al. [2020b]).
The expected squared norm of stochastic gradients is uniformly bounded, i.e., for all .
In FedAvg, first, the central server broadcasts the lastest model to all clients, then, every client performs local updates for a number of iterations, last, the central server aggregates the local models to produce the new global model(see Algorithm Description in Li et al. [2020b] for more detailed explanation).
Note that we assume is not updated at each communication round. With the above assumptions, the following theorem shows the convergence of FedFB in the case of two clients.
Theorem 24.
Consider the case of . Let Assumption 2, 3, 4, 5 on and Assumption 1 on hold. Choose such that . Suppose we have sufficiently large number of communication rounds to solve the inner optimization problem between two update rounds, i.e, in Algorithm 2. Denote the number of update as , as the weight at -th updated round. We can find sufficiently large such that with high probability, applying update rule (8) leads to where is the global minimizer: . .
Proof.
We first derive the update rule in the case of . From (8), we have . Denote
Then the update rule becomes
(85) |
We denote as the number local epochs performed between two communication rounds, and be the total number of iterations between two update rounds. Thus is the number of communication rounds between two update rounds. Then we apply FedAvg with number of total iterations to solve , and obtain . Then in the update round from to , by Thm. 1 in Li et al. [2020b], we have
(86) | ||||
where . By Markov’s inequality, with probability ,
(87) | ||||
Then taking the union bound over updating iterations of , the conclusions above hold with probability at least for all . Therefore, with sufficiently large such that , for all in the iteration, , .
Remark 11.
Note that Thm. 24 assumes FedFB does not update in each communication round and there are infinite rounds of aggregations between two updating round. However, for computation efficiency, we update at every communication round in practice.
B.2 FedFB w.r.t equal opportunity
Similar to Proposition 5, we design the following bi-level optimization problems to capture equal opportunity:
(89) | ||||
(90) |
Here .
For equal opportunity, we make the following assumption:
Assumption 6.
is twice differentiable for all , , and
for all .
With the above assumption, the following lemma provides the update rule:
Lemma 25.
If Assumption 6 holds, then on the direction
(91) |
we have , and the equality holds if only if .
Update rule for equal opportunity:
(92) |
Proof of Lemma 25.
We compute the derivative as
(93) |
Note that is the minimizer to (90), we have
We take the derivative to the above equation and have
(94) | |||
(95) |
Thus we get
(96) |
Then on the direction given by (91), we combine (93) and (96) to have
(97) | |||
(98) | |||
(99) | |||
(100) |
where we have used Assumption 6, the equality holds only when .
∎
We present the FedFB algorithm w.r.t equal opportunity in Algorithm 3.
B.3 FedFB w.r.t equalized odds
For equalized odd, we design the following bi-level optimization problem:
(101) | ||||
(102) | ||||
(103) |
Here
(104) | ||||
(105) |
For equalized odds, we make the following assumption:
Assumption 7.
is twice differentiable for all , , and
(106) | |||
(107) |
for all .
With the above assumption, the following lemma provides the update rule:
Lemma 26.
If Assumption 7 holds, then on the direction
, with
(108) |
we have , and the equality holds if only if .
Update rule for equalized odd:
(109) |
Proof of Lemma 26.
We take the derivative to the above equation and have
(113) | |||
(114) |
Thus we get
(115) | ||||
(116) | ||||
(117) |
Then on the direction given by (108), we combine (110) and (117) to have
(118) | |||
(119) | |||
(120) | |||
(121) | |||
(122) | |||
(123) | |||
(124) |
where we have used Assumption 7, and the equality holds only when .
∎
The full procedure is described in Algorithm 4.
B.4 FedFB w.r.t client parity
For client parity, we slightly abuse the notation and define the loss over client as , with . Note that the here is different from the one in Sec. B.1. We design the following bi-level optimization problem:
(125) | |||
(126) |
Here
For client parity, we make the following assumption:
Assumption 8.
is twice differentiable for all , and
With the above assumption, the following lemma provides the update rule:
Lemma 27.
If Assumption 8 holds, then on the direction
(127) |
we have , and the equality holds if and only if .
Then we update as follows:
Update rule for client parity:
(128) |
Proof of Lemma 27.
We take the derivative to the above equation and have
Thus we get
(130) |
Then on the direction given by (127), we combine (129) and (130) to have
(131) | |||
(132) | |||
(133) | |||
(134) |
where we have used Assumption 8, and the equality holds only when .
∎
The Algorithm 5 gives the full description.
Appendix C Appendix - Experiments
We continue from Sec. 2 and provide more details on experimental settings, and other supplementary experiment results.
C.1 Experiment setting
The reported statistics are computed on the test set, and we set 10 communication rounds and 30 local epochs for all federated learning algorithms except AgnosticFair. We run 300 epochs for other methods. For all tasks, we randomly split data into a training set and a testing set at a ratio of 7:3. The batch size is set to be 128. For all methods, we choose learning rate from . We solve FedFB with sample weight learning rates in parallel and select the which achieves the highest fairness. All benchmark models are tuned according to the hyperparameter configuration suggested in their original works. We perform cross-validation on the training sets to find the best hyperparameter for all algorithms.
C.2 Supplementary experiment results
Fig. 9 shows the accuracy versus fairness violation for the 3 clients’ cases. We see a clear advantage of LFT+FedAvg over the LFT+Ensemble in both 2 clients’ cases and 3 clients’ cases. The achievable fairness range of LFT+FedAvg is much wider than that of LFT+Ensemble, though the accuracy of FedAvg is not guaranteed when the data heterogeneity increases.

We also compare FedAvg, LFT+Ensemble, LFT+FedAvg, FedFB, AgnosticFair and CFL on logistic regression. Table 4 shows that our method outperforms LFT+Ensemble and LFT+FedAvg, while achieving similar performance as AgnosticFair and CFL but ensuring higher privacy.
Property | Synthetic | Adult | COMPAS | Bank | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Private | Fair | Method | Acc.() | DP Disp.() | Acc.() | DP Disp.() | Acc.() | DP Disp.() | Acc.() | DP Disp.() |
✓ | ✗ | FedAvg | .884.001 | .419.006 | .837.007 | .144.015 | .658.006 | .149.022 | .900.000 | .026.001 |
✓ | ✓ | LFT+Ensemble | .712.198 | .266.175 | .819006. | .032.031 | .606.018 | .089.058 | .888.005 | .008.007 |
✓ | ✓ | LFT+FedAvg | .789.138 | .301.192 | .828.002 | .098.008 | .577.003 | .053.003 | .892.000 | .013.000 |
✓ | ✓ | FedFB (Ours) | .756.001 | .085.001 | .820.000 | .002.001 | .600.003 | .059.003 | .890.001 | .011.002 |
✗ | ✓ | CFL | .709.001 | .011.001 | .800.003 | .016.003 | .587.003 | .032.002 | .883.000 | .000.000 |
C.3 Synthetic dataset generation
We generate a synthetic dataset of 5,000 examples with two non-sensitive attributes , a binary sensitive , and a binary label . A tuple is randomly generated based on the two Gaussian distributions: and , where . For the sensitive attribute , we generate biased data using an unfair scenario , where is the pdf of .
We split the data into three clients in a non-iid way. We randomly assign of the samples from group 0 and of the samples from group 1 to 1st, 2nd, and 3rd client, respectively. To study the empirical relationship between the performance of FedFB and data heterogeneity, we split the dataset into other ratios to obtain the desired level of data heterogeneity. To get the dataset with low data heterogeneity, we draw samples from each group into three clients in a ratio of and . For dataset with high data heterogeneity, the ratio we choose is and .
C.4 Empirical relationship between accuracy, fairness, and the number of clients
We investigate the empirical relationship between accuracy, fairness, and the number of clients. We generate three synthetic datasets of 3,333, 5,000, and 6,667 samples, and split them into two, three, and four clients, respectively. Table 5 shows that our method outperforms under the three cases.
Number of clients | two clients | three clients | four clients | |||
Method | Acc.() | DP Disp.() | Acc.() | DP Disp.() | Acc.() | DP Disp.() |
FedAvg | .879.004 | .360.013 | .886.003 | .406.009 | .879.003 | .382.005 |
LFT+Ensemble | .780.090 | .161.157 | .727.194 | .248.194 | .720.192 | .246.180 |
LFT+FedAvg | .866.002 | .429.017 | .823.102 | .305.131 | .608.367 | .491.102 |
FedFB | .679.007 | .047.012 | .725.012 | .051.018 | .705.004 | .005.003 |
CFL | .668.028 | .063.035 | .726.009 | .028.016 | .670.035 | .064.020 |
C.5 Comparison with FairFed
Method | Adult | COMPAS | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Heterogeneity Level | Heterogeneity Level | ||||||||||
0.1 | 0.2 | 0.5 | 10 | 5000 | 0.1 | 0.2 | 0.5 | 10 | 5000 | ||
Acc.() | FairFed | .775 | .794 | .819 | .824 | .824 | .594 | .586 | .608 | .636 | .640 |
FedFB (Ours) | .799 | .769 | .772 | .822 | .769 | .668 | .655 | .665 | .672 | .666 | |
SPD() | FairFed | .021 | .037 | .061 | .065 | .065 | .048 | .040 | .072 | .108 | .115 |
FedFB (Ours) | .042 | .034 | .038 | .063 | .060 | .006 | .011 | .026 | .032 | .004 |
Synthetic | Adult | COMPAS | Bank | |||||
---|---|---|---|---|---|---|---|---|
Method | Acc.() | DP Disp.() | Acc.() | DP Disp.() | Acc.() | DP Disp.() | Acc.() | DP Disp.() |
FedFB (Ours) | .725.012 | .051.018 | .804.001 | .028.001 | .616.033 | .036.028 | .883.000 | .000.000 |
AgnosticFair | .655.028 | .028.033 | .790.012 | .032.016 | .612.037 | .096.046 | .883.000 | .000.000 |
C.6 Comparison with AgnosticFair
Lastly, we compare our method with AgnosticFair [Du et al., 2021], which exchanges the model parameters and the other information after every gradient update to mimic the performance of CFL with FairnessConstraint implementation [Zafar et al., 2017c]. Table 7 shows that our FedFB achieves similar performance as AgnosticFair, at a much lower cost of privacy.