This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Improving Fairness via Federated Learning

Yuchen Zeng,   Hongxu Chen,  Kangwook Lee
University of Wisconsin-Madison
Abstract

Recently, lots of algorithms have been proposed for learning a fair classifier from decentralized data. However, many theoretical and algorithmic questions remain open. First, is federated learning necessary, i.e., can we simply train locally fair classifiers and aggregate them? In this work, we first propose a new theoretical framework, with which we demonstrate that federated learning can strictly boost model fairness compared with such non-federated algorithms. We then theoretically and empirically show that the performance tradeoff of FedAvg-based fair learning algorithms is strictly worse than that of a fair classifier trained on centralized data. To bridge this gap, we propose FedFB, a private fair learning algorithm on decentralized data. The key idea is to modify the FedAvg protocol so that it can effectively mimic the centralized fair learning. Our experimental results show that FedFB significantly outperforms existing approaches, sometimes matching the performance of the centrally trained model.

1 Introduction

Fair learning has recently received increasing attention. Various fairness notions have been introduced in the past few years (Dwork et al., 2012; Hardt et al., 2016; Zafar et al., 2017b, a; Kearns et al., 2018; Friedler et al., 2016). Among various fairness notions, group fairness (Hardt et al., 2016; Zafar et al., 2017a) is the most studied one; it requires the classifier to treat different groups similarly, where groups are defined with respect to sensitive attributes such as gender and race. One of the most commonly used group fairness notions is demographic parity, which requires that different groups are equally likely to receive desirable outcomes.

There has been a large amount of work in training fair classifiers (Zafar et al., 2017c; Hardt et al., 2016; Roh et al., 2021), and almost all of these studies assume that the learner has access to the entire training data. Unfortunately, this is not the case in many critical applications. For example, the government may want to train a single classifier that is fair and accurate using datasets owned by different stakeholders across the country, but the stakeholders may not be able to share their raw data with the government due to the privacy act. This precisely sets the core question that we aim to answer in this paper – how can we privately train a fair classifier on decentralized data? To answer this, consider the following approaches: Local Fair Training + Ensemble and Local Fair Training + FedAvg.

\begin{overpic}[width=433.62pt]{figures/fl_learning} \put(112.0,60.0){$\displaystyle{\color[rgb]{0.1328125,0.421875,0.87890625}<}$} \put(250.0,50.0){$\displaystyle{\color[rgb]{0.1328125,0.421875,0.87890625}<}$} \put(280.0,15.0){$\displaystyle{\color[rgb]{0.1328125,0.421875,0.87890625}<}$} \par\put(340.0,50.0){$\displaystyle{\color[rgb]{0.1328125,0.421875,0.87890625}\approx}$} \put(20.0,90.0){LFT+Ensemble{}} \put(160.0,90.0){{LFT+{FedAvg}{}}} \put(280.0,90.0){{FedFB (Ours)}} \put(375.0,90.0){{CFL (Upper Bound)}} \put(100.0,50.0){{\color[rgb]{0.1328125,0.421875,0.87890625}Sec.~{}\ref{sec:necessity}}} \put(98.0,40.0){{\color[rgb]{0.1328125,0.421875,0.87890625}(Thm.~{}\ref{thm:uflvsffl})}} \par\put(240.0,40.0){{\color[rgb]{0.1328125,0.421875,0.87890625}Sec.~{}\ref{sec:experiments}}} \put(335.0,40.0){{\color[rgb]{0.1328125,0.421875,0.87890625}Sec.~{}\ref{sec:experiments}}} \par\put(250.0,0.0){{\color[rgb]{0.1328125,0.421875,0.87890625}Sec.~{}\ref{sec:4.1} (Lemma~{}\ref{lemma:ffl_limit})}} \end{overpic}
Figure 1: A high-level illustration of various approaches to fair learning on decentralized data and our contributions. Assuming two data owners, from left to right, we show LFT+Ensemble (Local Fair Training + Ensemble), LFT+FedAvg (Local Fair Training + FedAvg), FedFB (ours), and CFL (Centralized Fair Learning). In LFT+Ensemble, each data owner trains a locally fair model on its own data and we assemble the local models to obtain a global model. LFT+FedAvg applies FedAvg together with off-the-shelf fair training algorithms for local training. Our proposed solution FedFB consists of a modified FedAvg protocol with a custom-designed fair learning algorithm. CFL is the setting where a fair model is trained on the pooled data. In this work, we theoretically characterize the strict ordering between the existing approaches and empirically demonstrate the superior performance of FedFB.
Local Fair Training + Ensemble (LFT+Ensemble)

One naïve yet the simplest way to learn fair classifiers from decentralized data without violating privacy policy is training locally fair classifiers and assembling them. We call this strategy LFT+Ensemble. This is more private than any data sharing schemes or federated learning approaches as only the fully trained models are shared just once.

Local Fair Training + FedAvg (LFT+FedAvg)

LFT+FedAvg applies federated learning (Konečnỳ et al., 2017) together with existing fair learning algorithms. Federated learning is a distributed learning framework, using which many data owners can collaboratively train a model under the orchestration of a central server while keeping their data decentralized. For instance, under FedAvg, the de facto aggregation protocol for federated learning, the central server periodically computes a weighted average of the locally trained model parameters. If each data owner runs a fair learning algorithm on its own data and these locally trained models are aggregated via FedAvg, then we call this approach Local Fair Training + FedAvg (LFT+FedAvg).

Goal and Main Contributions

As LFT+FedAvg makes more than one round of communications between the clients and the centralized aggregator, it is reasonable to expect a better performance (fairness and accuracy) with LFT+FedAvg over LFT+Ensemble. However, rigorous performance analysis has not been made yet in the literature. Inspired by this, we first study the following question:

Is federated learning needed for decentralized fair learning?

Refer to caption
Figure 2: An example illustrating the failure of LFT+Ensemble (Local Fair Training + Ensemble). Consider three clients with four samples of one-dimensional feature xx. F and M denote female and male, respectively, and pink boxes denote positive labels. Under LFT+Ensemble, every client first finds a locally fair classifier on their data, which are depicted in short blue dotted lines. The voting ensemble model is visualized in a long blue line. Note that this is strictly worse than the green line that is more fair and accurate.

In other words, it is not clear whether there is any strict performance (fairness and accuracy) improvement by using LFT+FedAvg over LFT+Ensemble. If not, there is no reason to use LFT+FedAvg as LFT+Ensemble is a more private scheme. Our first contribution is to theoretically compare their performances and show that federated learning is indeed necessary. Intuitively, if the data is highly heterogeneous, an ensemble model of locally trained models will not be fair even if the consisting models are locally fair. See Fig. 2 for a toy example illustrating this phenomenon. On the other hand, LFT+FedAvg might find a globally fairer model by making frequent communications, of course at the cost of privacy loss. Another natural question follows:

How good is LFT+FedAvg compared to the case where the whole data is centralized?

That is, can LFT+FedAvg achieve similar performance with Centralized Fair Learning (CFL)? If not, can we develop a better federated learning approach? Our second contribution is to identify the performance gap between LFT+FedAvg and CFL and to develop FedFB, a new federated learning algorithm that bridges the gap. Our major contributions are summarized as follows (see Fig. 1):

  • We develop a theoretical framework for analyzing approaches for decentralized fair learning. Using this, we show the strict ordering between the existing approaches, i.e., under certain conditions, LFT+Ensemble << LFT+FedAvg << CFL, w.r.t. their fairness-accuracy tradeoffs. To the best of our knowledge, it is the first rigorous comparison of these approaches. Our finding implies the necessity of federated learning, but at the same time, implies the limitation of the naiv̈e application of FedAvg together with local fair training.

  • Leveraging the state-of-the-art algorithm for (centralized) fair learning (Roh et al., 2021), we design FedFB, a novel fairness-aware federated learning algorithm.

  • Via extensive experiments, we show that (1) our theoretical findings hold under more general settings, and (2) FedFB significantly outperforms the existing approaches and achieves similar performance as CFL.

2 Related work

Model Fairness

Several works have studied the fundamental tradeoff between fairness and accuracy  (Menon and Williamson, 2018; Wick et al., 2019; Zhao and Gordon, 2019). Our analysis is highly inspired by the one by Menon and Williamson (2018), which characterized the tradeoff between accuracy and fairness under the centralized setting. Among various fair training algorithms (Zemel et al., 2013; Jiang and Nachum, 2020; Zafar et al., 2017c, a; Hardt et al., 2016), our work is based on the current state of the art — FairBatch (Roh et al., 2021), which we will detail in Sec. 4

Federated Learning

Unlike traditional, centralized machine learning approaches, federated learning keeps the data decentralized throughout training, reducing the privacy risks involved in traditional approaches (Konečnỳ et al., 2017; McMahan et al., 2017). FedAvg (McMahan et al., 2017) is the first and most widely used federated learning algorithm. The idea is to iteratively compute a weighted average of the local model parameters, with the weights proportional to the local datasets’ sizes. Prior work (Li et al., 2020b) has shown that FedAvg provably converges under some mild conditions. The design of our proposed algorithm FedFB is also based on that of FedAvg.

Federated Fair Learning for Group Fairness

A few very recent studies (Ezzeldin et al., 2021; Rodríguez-Gálvez et al., 2021; Chu et al., 2021; Du et al., 2021; Cui et al., 2021), conducted concurrently with our work, aim at achieving group fairness under federated learning. In particular, Du et al. (2021), Rodríguez-Gálvez et al. (2021) and Chu et al. (2021) mimic the centralized fair learning setting by very frequently exchanging information for each local update, not for each round of local training. In contrast, FedFB requires much less frequent communications (one per round), resulting in higher privacy and lower communication costs. Moreover, Rodríguez-Gálvez et al. (2021) and Chu et al. (2021) are designed specifically for certain group fairness notions (such as equal opportunity and accuracy parity) and cannot be applied for other definitions such as demographic parity. On the other hand, FedFB is applicable for all popular group fairness notions including demographic parity, equalized odds, and equal opportunity. Similar to FedFB, Ezzeldin et al. (2021) employs FedAvg and a reweighting mechanism, but it achieves a worse performance.

Refer to caption
Figure 3: (a) A high-level comparison of various fair learning methods. (b) Accuracy-fairness tradeoff curves on the synthetic dataset. FedFB nearly matches the performance of CFL.

We visualize a high-level comparison of LFT+Ensemble, LFT+FedAvg, CFL, FedFB and AgnosticFair (Du et al., 2021) in terms of performance and privacy in Fig. 3(a). Most of the work in the literature and our work focus on satisfying a single fairness requirement on the “global” data distribution, but there is also a recent work that aims at finding a classifier that simultaneously satisfies multiple fairness requirements, each defined with respect to each client’s “local” data (Cui et al., 2021).

Federated Fair Learning for Client Fairness

The definition of fairness used in most of the existing federated learning work is slightly different from the standard notion of group fairness. One popular definition of fairness in the federated setting is client parity (CP), which requires that all clients (i.e. data owners) achieve similar accuracies (or loss values). Several algorithms have been proposed to achieve this goal (Li et al., 2021, 2020a; Mohri et al., 2019; Yue et al., 2021; Zhang et al., 2020a). Even though their goal is different from ours, we show that an adapted version of FedFB can also achieve comparable client parity, compared with the existing algorithms proposed specifically for CP.

3 Analysis of LFT+Ensemble & LFT+FedAvg

In this section, we first show the necessity of federated learning by proving that LFT+FedAvg achieves strictly higher fairness than LFT+Ensemble. We then prove the limitation of LFT+FedAvg by comparing its performance with an oracle bound of CFL (non-private). These two results together imply that federated learning is necessary, but there exists a limit on what can be achieved by FedAvg-based approaches. We will present informal theoretical statements, deferring the formal versions and proofs to Sec. A.

Problem Setting

Denote [N]:={0,1,,N1}[N]:=\{0,1,\dots,N-1\} for any N+N\in\mathbb{Z}^{+}. We assume II clients, which have the same amount of data. We further assume a simple binary classification setting with a binary sensitive attribute, i.e., x𝒳={\mathrm{x}}\in{\mathcal{X}}={\mathbb{R}} is the input feature, y𝒴=[1]{\mathrm{y}}\in{\mathcal{Y}}=[1] is the outcome and a𝒜=[A]=[1]{\mathrm{a}}\in{\mathcal{A}}=[A]=[1] is the binary sensitive attribute. Assume that x{\mathrm{x}} is a continuous variable. The algorithm we will develop later is applicable to general settings.

We now introduce parameters for describing the data distribution. Let yxBern(η(x)){\mathrm{y}}\mid{\mathrm{x}}\sim\operatorname{Bern}(\eta({\mathrm{x}})) for all client ii, where η():𝒳[0,1]\eta(\cdot):{\mathcal{X}}\rightarrow[0,1] is a strictly monotone increasing function. Assume xa=a,i=i𝒫a(i){\mathrm{x}}\mid{\mathrm{a}}=a,{\mathrm{i}}=i\sim{\mathcal{P}}_{a}^{(i)}, ai=iBern(qi){\mathrm{a}}\mid{\mathrm{i}}=i\sim\operatorname{Bern}(q_{i}), where i{\mathrm{i}} is the index of the client, 𝒫a(i){\mathcal{P}}_{a}^{(i)} is a distribution, and qi[0,1]q_{i}\in[0,1] for a=0,1,i[I]a=0,1,i\in[I]. Let ={f:𝒳×𝒜[0,1]}{\mathcal{F}}=\{f:{\mathcal{X}}\times{\mathcal{A}}\rightarrow[0,1]\}. Given ff\in{\mathcal{F}} and data sample (x,a)({\mathrm{x}},{\mathrm{a}}), we consider the following randomized classifier: y^x,aBern(f(x,a))\hat{{\mathrm{y}}}\mid{\mathrm{x}},{\mathrm{a}}\sim\operatorname{Bern}(f({\mathrm{x}},{\mathrm{a}})). Using these definitions, we now define demographic parity, specialized for a binary case as

(Demographic parity):(y^=1a=0)=(y^=1a=1).\textrm{(Demographic parity)}:~{}{\mathbb{P}}(\hat{{\mathrm{y}}}=1\mid{\mathrm{a}}=0)={\mathbb{P}}(\hat{{\mathrm{y}}}=1\mid{\mathrm{a}}=1).

To measure how unfair a classifier is with respect to demographic parity (DP), we define DP disparity as:

DP Disp(f)=|(y^=1a=0)(y^=1a=1)|.\text{DP Disp}(f)=|{\mathbb{P}}(\hat{{\mathrm{y}}}=1\mid{\mathrm{a}}=0)-{\mathbb{P}}(\hat{{\mathrm{y}}}=1\mid{\mathrm{a}}=1)|.
Refer to caption
Figure 4: Fundamental limitation of LFT+Ensemble in terms of fairness range. DP disparity of LFT+Ensemble as a function of local unfairness budgets. The blue plane is of perfect fairness, and the pink plane visualizes the value of δ\delta (the lowest DP disparity that LFT+Ensemble can achieve). (a) (Example 1) On a distribution that satisfies the conditions of Corollary 2. (b, c) On distributions illustrated in Sec. 5.1, which do not satisfy the conditions of Corollary 2. In all three cases, the LFT+Ensemble cannot achieve perfect fairness, i.e., δ>0\delta>0.

3.1 LFT+FedAvg strictly outperforms LFT+Ensemble

LFT+Ensemble Optimization

We now present the optimization problem solved in the LFT+Ensemble scenario. Here, for analytical tractability, we will assume the population limit, i.e., the true data distribution is used in optimization. Recall that in this scenario, each client solves their own optimization problem to train a locally fair classifier. In particular, each client i[I]i\in[I] first sets its own local fairness constraint εi[0,1]{\varepsilon}_{i}\in[0,1], and solves the following problem:

minf(y^yi=i),\displaystyle\min_{f\in{\mathcal{F}}}{\mathbb{P}}(\hat{{\mathrm{y}}}\neq{\mathrm{y}}\mid{\mathrm{i}}=i), (LFT+Ensemble(i,εii,{\varepsilon}_{i}))
s.t.|(y^=1a=0,i=i)(y^=1a=1,i=i)|εi.\displaystyle\mathrm{s.t}.~{}|{\mathbb{P}}(\hat{{\mathrm{y}}}=1\mid{\mathrm{a}}=0,{\mathrm{i}}=i)-{\mathbb{P}}(\hat{{\mathrm{y}}}=1\mid{\mathrm{a}}=1,{\mathrm{i}}=i)|\leq{\varepsilon}_{i}. (1)

We denote by fiεif_{i}^{{\varepsilon}_{i}} the solution to LFT+Ensemble(i,εii,{\varepsilon}_{i}). Let 𝜺=(ε0,,εI1)\bm{{\varepsilon}}=({\varepsilon}_{0},\dots,{\varepsilon}_{I-1}). We consider the following ensemble of II locally trained classifiers:

y^x,aBern(i[I]fiεi(x,a)/I)=Bern(f𝜺LFT+Ensemble),\displaystyle\hat{{\mathrm{y}}}\mid{\mathrm{x}},{\mathrm{a}}\sim\operatorname{Bern}\bigg{(}\sum_{i\in[I]}f_{i}^{{\varepsilon}_{i}}({\mathrm{x}},{\mathrm{a}})/I\bigg{)}=\operatorname{Bern}(f_{\bm{{\varepsilon}}}^{\text{LFT+Ensemble}}), (2)

where f𝜺LFT+Ensemble:=i[I]fiεi(x,a)/If_{\bm{{\varepsilon}}}^{\text{LFT+Ensemble}}:=\sum_{i\in[I]}f_{i}^{{\varepsilon}_{i}}({\mathrm{x}},{\mathrm{a}})/I.

Since there always exists a perfectly fair classifier (Menon and Williamson, 2018), one question we can ask is if f𝜺LFT+Ensemblef_{\bm{{\varepsilon}}}^{\text{LFT+Ensemble}} can achieve an arbitrary level of fairness. The following lemma asserts that it cannot.

Lemma 1 ((informal) Achievable fairness range of LFT+Ensemble).

Let qi=qq_{i}=q for all i[I]i\in[I], where q(0,1)q\in(0,1) is a constant. Under certain conditions, there exists δ\delta (δ>0\delta>0) such that min𝛆[0,1]IDP Disp(f𝛆LFT+Ensemble)>δ.\min_{\bm{{\varepsilon}}\in[0,1]^{I}}\text{DP Disp}(f_{\bm{{\varepsilon}}}^{\text{LFT+Ensemble}})>\delta.

Lemma 1 implies that LFT+Ensemble fails to achieve perfect fairness even when the sensitive attribute (a{\mathrm{a}}) distribution is identical across the clients. Instead, Lemma 1 only requires the insensitive attribute distribution (x{\mathrm{x}}) to be highly heterogeneous across the clients. The following corollary provides a concrete example satisfying such conditions.

Corollary 2 (informal).

Let I=2I=2 and q0=q1=0.5q_{0}=q_{1}=0.5, η(x)=11+ex,𝒫a(i)=𝒩(μa(i),σ(i)2)\eta(x)=\frac{1}{1+\text{e}^{-x}},{\mathcal{P}}_{a}^{(i)}={\mathcal{N}}(\mu_{a}^{(i)},\sigma^{(i)2}), where i=0,1i=0,1. If one client has much larger variance than the other, under certain assumptions, there exists δ>0\delta>0 such that minε0,ε1[0,1]DP Disp(f𝛆LFT+Ensemble)>δ.\min_{{\varepsilon}_{0},{\varepsilon}_{1}\in[0,1]}\text{DP Disp}(f_{\bm{{\varepsilon}}}^{\text{LFT+Ensemble}})>\delta.

Note that the condition that one client has a much larger variance than the other contributes to the “high data heterogeneity” requirement. We provide the explicit form of δ\delta in Corollary 11 in Sec. A.2.2. Corollary 2 implies that under a limiting case of Gaussian distribution, LFT+Ensemble cannot achieve high fairness requirements. In Sec. 2, we will numerically demonstrate the same claim holds for more general cases. Next, we give an example that satisfies the conditions of Corollary 2, whose achievable DP disparity and lower bound are visualized in Fig. 4(a).

Example 1.

Let μ0(0)=μ1(0)=0,μ0(1)=3,μ1(1)=1,σ(0)=70\mu_{0}^{(0)}=\mu_{1}^{(0)}=0,\mu_{0}^{(1)}=3,\mu_{1}^{(1)}=-1,\sigma^{(0)}=70 and σ(1)=1\sigma^{(1)}=1. Then, δ0.21\delta\approx 0.21.

Refer to caption
Figure 5: Accuracy-fairness tradeoff curves of CFL, LFT+FedAvg, and LFT+Ensemble for two clients cases. As qiq_{i} denotes the sensitive attribute distribution for client ii, |q1q0||q_{1}-q_{0}| captures the sensitive attribute distribution’s heterogeneity. The dotted vertical lines describe the lowest DP disparities that can be achieved. LFT+FedAvg’s maximum fairness is strictly higher than that of LFT+Ensemble but strictly lower than that of CFL. Moreover, the tradeoff curves are strictly ordered in the same order.
Remark 1.

In this work, we classify local training followed by an ensemble method as a non-federated learning algorithm. However, one may also view it as an extreme instance of federated learning, called one-shot (single round) federated learning (Guha et al., 2019). Depending on how one would classify this approach, our finding can be viewed as either the necessity of federated learning or the necessity of multi-round federated learning.

LFT+FedAvg Optimization

The classifier obtained by LFT+FedAvg can be shown to be equivalent to f𝜺LFT+FedAvgf_{\bm{{\varepsilon}}}^{\text{LFT+FedAvg}}, the solution to the following problem.

minf(y^y)\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\min_{f\in{\mathcal{F}}}{\mathbb{P}}(\hat{{\mathrm{y}}}\neq{\mathrm{y}}) (LFT+FedAvg(𝜺)(\bm{{\varepsilon}}))
s.t.|(y^=1a=0,i=i)(y^=1a=1,i=i)|εi.\displaystyle\mathrm{s.t}.~{}|{\mathbb{P}}(\hat{{\mathrm{y}}}=1\mid{\mathrm{a}}=0,{\mathrm{i}}=i)-{\mathbb{P}}(\hat{{\mathrm{y}}}=1\mid{\mathrm{a}}=1,{\mathrm{i}}=i)|\leq{\varepsilon}_{i}. (3)

The proof of the equivalence is provided in Sec. A.3.2. The following theorem asserts that LFT+FedAvg can achieve a strictly higher fairness than LFT+Ensemble.

Theorem 3 ((informal) Fundamental values of federated learning for fairness).

Let qi=qq_{i}=q for all i[I]i\in[I], where q(0,1)q\in(0,1) is a constant. Under certain conditions, min𝛆[0,1]IDP Disp(f𝛆LFT+Ensemble)>min𝛆[0,1]IDP Disp(f𝛆LFT+FedAvg).\min_{\bm{{\varepsilon}}\in[0,1]^{I}}\text{DP Disp}(f_{\bm{{\varepsilon}}}^{\text{LFT+Ensemble}})>\min_{\bm{{\varepsilon}}\in[0,1]^{I}}\text{DP Disp}(f_{\bm{{\varepsilon}}}^{\text{LFT+FedAvg}}).

Thm. 3 implies that LFT+Ensemble cannot match the highest fairness achieved by LFT+FedAvg. See Fig. 5 for the performance tradeoffs. Similar to Lemma 1, the technical assumptions include high enough heterogeneity of x{\mathrm{x}} across the clients (see Sec. A.4).

Remark 2.

Extending Thm. 3 (and the analysis of LFT+FedAvg) to general cases where qiq_{i}’s are different remains open. However, we conjecture that our lemma holds for more general cases, and we will numerically support our conjecture empirically.

Remark 3.

Thm. 3 shows that even with just two clients (I=2I=2), a constant gap exists between non-federated algorithms and federated algorithms in their fairness performances. There is a stark difference between this phenomenon and the well-known gain of federated learning due to an increased sample size, which is generally negligible with a few number of clients. Our finding on this untapped gain in fairness can be seen as a new justification for small-scale federated learning, i.e., cross-silo settings.

3.2 LFT+FedAvg is strictly worse than CFL

We first present the optimization problem for CFL, whose achievable fairness region can serve as an upper bound on that of all other decentralized algorithms. We then show the existence of data distributions on which LFT+FedAvg achieves a strictly worse fairness performance than CFL.

CFL Optimization

We consider the same problem setting as last subsection. We now formulate the CFL problem as:

minf(y^y),\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\min_{f\in{\mathcal{F}}}{\mathbb{P}}(\hat{{\mathrm{y}}}\neq{\mathrm{y}}), (CFL(ε{\varepsilon}))
s.t.|(y^=1a=0)(y^=1a=1)|ε.\displaystyle\mathrm{s.t}.~{}|{\mathbb{P}}(\hat{{\mathrm{y}}}=1\mid{\mathrm{a}}=0)-{\mathbb{P}}(\hat{{\mathrm{y}}}=1\mid{\mathrm{a}}=1)|\leq{\varepsilon}. (4)

Denote the solution to CFL(ε{\varepsilon}) as fεCFLf_{{\varepsilon}}^{\text{CFL}}. It is clear that fεCFLf_{{\varepsilon}}^{\text{CFL}} achieves the best accuracy-fairness tradeoff, at the cost of no privacy. The following lemma shows that there exists some distribution such that LFT+FedAvg is strictly worse than CFL when the distribution of sensitive attribute a{\mathrm{a}} is heterogeneous (qiq_{i} are not all the same).

Lemma 4 ((informal) A strict gap between LFT+FedAvg and CFL).

When there exist ij[I]s.t.qiqji\neq j\in[I]~{}\mathrm{s.t}.~{}q_{i}\neq q_{j}, there exists a distribution such that min𝛆[0,1]IDP Disp(f𝛆LFT+FedAvg)>minεDP Disp(fεCFL)=0.\min_{\bm{{\varepsilon}}\in[0,1]^{I}}\text{DP Disp}(f_{\bm{{\varepsilon}}}^{\text{LFT+FedAvg}})>\min_{{\varepsilon}}\text{DP Disp}(f_{{\varepsilon}}^{\text{CFL}})=0.

Remark 4.

A strict gap exists for certain distributions, but not for all distributions.

3.3 Numerical Comparisons of tradeoffs

One limitation of our current theoretical results is that they only compare the maximum achievable fairness. Extending our theoretical results to fully characterize two-dimensional tradeoffs is non-trivial, so we leave it as future work. Instead, we numerically solve each of the optimization problems and visualize the tradeoff curves.

Shown in Fig. 5 are the tradeoff curves for two clients cases. We let xa=0,i=0𝒩(3,1),xa=1,i=0𝒩(5,1),xa=0,i=1𝒩(1,1),xa=1,i=1𝒩(1,1),ai=0Bern(q0),ai=1Bern(q1){\mathrm{x}}\mid{\mathrm{a}}=0,{\mathrm{i}}=0\sim{\mathcal{N}}(3,1),{\mathrm{x}}\mid{\mathrm{a}}=1,{\mathrm{i}}=0\sim{\mathcal{N}}(5,1),{\mathrm{x}}\mid{\mathrm{a}}=0,{\mathrm{i}}=1\sim{\mathcal{N}}(1,1),{\mathrm{x}}\mid{\mathrm{a}}=1,{\mathrm{i}}=1\sim{\mathcal{N}}(-1,1),{\mathrm{a}}\mid{\mathrm{i}}=0\sim\operatorname{Bern}(q_{0}),{\mathrm{a}}\mid{\mathrm{i}}=1\sim\operatorname{Bern}(q_{1}), η(x)=11+ex\eta(x)=\frac{1}{1+e^{-x}}, and vary the values of q0q_{0} and q1q_{1}. Note that |q1q0||q_{1}-q_{0}| captures the heterogeneity of the sensitive data a{\mathrm{a}}, which increases from left to right. First, one can observe that LFT+Ensemble<<LFT+FedAvg<< CFL in terms of the achievable fairness range, as predicted by our theory. Beyond our theory, we also observe an increasing gap between the tradeoff curves as the data heterogeneity increases.

4 Proposed Algorithm: FedFB

Our findings in Sec. 3 imply that federated learning is necessary, but the current FedAvg-based approach might not be the best approach. Can we design a federated learning algorithm that is strictly better than FedAvg-based approaches? In this section, we propose a new federated learning algorithm for fair learning, which we dub FedFB (short for Federated FairBatch). Our approach is based on the state-of-the-art (centralized) fair learning algorithm FairBatch (FB for short) (Roh et al., 2021) and comes with a few desirable theoretical guarantees.

Centralized FB

Let us first review how FB works in the centralized setting when demographic parity is considered. Assume there are AA sensitive attributes. Let 𝒘\bm{w} be the model parameters, (y,y^;𝒘)\ell({\mathrm{y}},\hat{{\mathrm{y}}};\bm{w}) be the loss function, ny,a:=|{(x,y,a):y=y,a=a}|n_{y,a}:=|\left\{({\mathrm{x}},{\mathrm{y}},{\mathrm{a}}):{\mathrm{y}}=y,{\mathrm{a}}=a\right\}| be the number of samples in group aa with label yy, and n,a:=|{(x,y,a):a=a}|n_{\star,a}:=|\left\{({\mathrm{x}},{\mathrm{y}},{\mathrm{a}}):{\mathrm{a}}=a\right\}| be the number of samples in group aa. Let Ly,a(𝒘)=y=y,a=a(y,y^;𝒘)L_{y,a}(\bm{w})=\sum_{{\mathrm{y}}=y,{\mathrm{a}}=a}\ell({\mathrm{y}},\hat{{\mathrm{y}}};\bm{w}), Ly,a(𝒘):=ny,aLy,a(𝒘)/n,aL^{\prime}_{y,a}(\bm{w}):=n_{y,a}L_{y,a}(\bm{w})/n_{\star,a}. Then, for a[A]a\in[A], we define

Fa(𝒘):=L0,0(𝒘)+L1,0(𝒘)+L0,a(𝒘)L1,a(𝒘)+n0,0n,0n0,an,a.F_{a}(\bm{w}):=-L_{0,0}^{\prime}(\bm{w})+L_{1,0}^{\prime}(\bm{w})+L_{0,a}^{\prime}(\bm{w})-L_{1,a}^{\prime}(\bm{w})+\tfrac{n_{0,0}}{n_{\star,0}}-\tfrac{n_{0,a}}{n_{\star,a}}.

Then, we can show the following proposition.

Proposition 5 (Necessary and sufficient condition for demographic parity).

Consider 0-1 loss: (y,y^;𝐰)=yy^(𝐰)\ell(\mathrm{y},\hat{\mathrm{y}};\bm{w})=\llbracket\mathrm{y}\neq\hat{\mathrm{y}}(\bm{w})\rrbracket, where \llbracket\cdot\rrbracket is the indicator function. Then,

Fa(𝒘)=0,a[A]\displaystyle F_{a}(\bm{w})=0,\forall a\in[A] (5)

is a necessary & sufficient condition for demographic parity.

Note that Proposition 5 allows us to measure demographic disparity using subgroup-specific losses when the 0-1 loss is used. Inspired by this observation, FB uses Fa(𝒘)F_{a}(\bm{w}) as a surrogate of the unfairness measure even for non-010-1 loss functions. With this surrogate, FB solves the following bi-level optimization.

min𝝀[0,2n,0n]××[0,2n,A1n]a=1A1(Fa(𝒘𝝀))2\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\min_{\bm{\lambda}\in[0,2\frac{n_{\star,0}}{n}]\times\dots\times[0,2\frac{n_{\star,A-1}}{n}]}\sum_{a=1}^{A-1}\big{(}F_{a}(\bm{w}_{\bm{\lambda}})\big{)}^{2} (6)
𝒘𝝀=argmin𝒘a=0A1[λaL0,a(𝒘)+(2n,anλa)L1,a(𝒘)],\displaystyle\bm{w}_{\bm{\lambda}}=\arg\min_{\bm{w}}\sum_{a=0}^{A-1}\left[\lambda_{a}L_{0,a}^{\prime}(\bm{w})+(2\frac{n_{\star,a}}{n}-\lambda_{a})L_{1,a}^{\prime}(\bm{w})\right], (7)

Here, 𝝀=(λ0,,λA1)\bm{\lambda}=(\lambda_{0},\dots,\lambda_{A-1}) are the outer optimization variables, which denote adaptive coefficient to loss occurring from each subgroup. Given 𝝀\bm{\lambda}, the inner optimization problem minimizes a reweighted loss function. The intuition here is that if one carefully chooses the coefficients for each subpopulation loss term, the weighted risk minimization will result in a fair classifier, even if it is unconstrained.

In particular, Roh et al. (2021) proposed the following iterative algorithms to solve the outer optimization problem:

μ0(𝝀)=a=1A1Fa(𝒘𝝀),μa(𝝀)=Fa(𝒘𝝀), for a0,\displaystyle\mu_{0}(\bm{\lambda})=-\sum_{a=1}^{A-1}F_{a}(\bm{w}_{\bm{\lambda}}),\mu_{a}(\bm{\lambda})=F_{a}(\bm{w}_{\bm{\lambda}}),\text{ for }a\neq 0, (8)
λaλa+α𝝁(𝝀)2μa(𝝀), for a[A],α is the step size.\displaystyle\lambda_{a}\leftarrow\lambda_{a}+\frac{\alpha}{\|\bm{\mu}(\bm{\lambda})\|_{2}}\mu_{a}(\bm{\lambda}),\text{ for }a\in[A],\alpha\text{ is the step size}. (9)

All together, FB alternates this update equation (for the outer optimization) and the inner-loop updates (e.g., via SGD) to solve the bi-level optimization problem111Technically speaking, the algorithm shown here is a slightly generalized version of the original FB. In particular, it works for more than two sensitive groups, while the original FB algorithm only works for binary groups. However, we will still call ours FB. .

FedFB: Federated Learning + FB

Recall that under the FedAvg protocol, clients periodically share their locally trained model parameters with the server. Our proposed algorithm is based on the following simple observation:

The bi-level structure of FB naturally fits the hierarchical structure of federated learning.

To see this, observe that the update equation given in (8) only requires the knowledge about all μa\mu_{a}, which is the function of FaF_{a}’s. By definition of FaF_{a}, it is the sum of the locally computed FaF_{a} values across all the clients. Thus, this can be securely collected by the central aggregator via secure aggregation. Once the FaF_{a} values are securely aggregated at the central server, the central aggregator can update the per-group coefficients 𝝀\bm{\lambda} using the exact same equation (8).

This allows us to run the FB algorithm even on decentralized data. More specifically, we modify the FedAvg protocol so that each client shares not only its intermediate model parameters but also its {Fa}a[A]\{F_{a}\}_{a\in[A]} with the central coordinator. Once the central server receives the securely aggregated model parameters and FaF_{a} values, it performs both (1) model averaging and (2) reweighting coefficients updates (𝝀\bm{\lambda}). The server then broadcasts the averaged model parameter together with the updated coefficients, which are then used for the subsequent round of local training with a reweighted loss function. Note that we update 𝝀\bm{\lambda} every kk communication rounds.

ClientUpdate(i,w,λ)(i,\bm{w},\bm{\lambda}):
       Update local 𝒘(i)\bm{w}^{(i)} according to sample weights 𝝀\bm{\lambda};
       Compute local Fa(i)F_{a}^{(i)} for all a[A]a\in[A];
      
ServerExecutes:
       for each round tt do
             Wait until clients perform their updates;
             𝒘\bm{w}\leftarrow SecAgg({𝒘(i)})(\{\bm{w}^{(i)}\});
             𝑭a\bm{F}_{a}\leftarrow SecAgg({Fa(i)})(I1)(n0,0n,0n0,an,a)(\{F_{a}^{(i)}\})-(I-1)(\frac{n_{0,0}}{n_{\star,0}}-\frac{n_{0,a}}{n_{\star,a}});
             if t%k=0t\%k=0 then
                   𝝀Update(𝝀,𝑭a)\bm{\lambda}\leftarrow\texttt{Update}(\bm{\lambda},\bm{F}_{a});
                   Broadcast 𝝀\bm{\lambda} to clients;
                  
             Broadcast 𝒘\bm{w} to clients;
            
      output : 𝒘,𝝀\bm{w},\bm{\lambda}
      
Algorithm 1 FedFB(k,t)(k,t)

This algorithm, which we dub FedFB, is formally described in Alg. 1. While this description is specific to the demographic parity, FedFB can be used for other fairness notions including equal opportunity, equalized odds, and client parity. The only difference would be how to design the FaF_{a} function, which is the surrogate for the unfairness measure. See Sec. B for more details.

Convergence of FedFB

The following theorem shows that FedFB converges to the optimal reweighting coefficients. The proof is based on the analysis tools for federated learning and FB – See Thm. 24 for more details.

Theorem 6 ((informal) Partial convergence guarantee of FedFB).

Let the output of FedFB(k,t)(k,t) be 𝛌k,t\bm{\lambda}_{k,t}. Assume A=2A=2. Then, under certain conditions, limtlimk𝛌k,t=𝛌\lim_{t\to\infty}\lim_{k\to\infty}\bm{\lambda}_{k,t}=\bm{\lambda}^{\star} for some 𝛌argmin𝛌a[Fa(𝐰𝛌)]2\bm{\lambda}^{\star}\in\arg\min_{\bm{\lambda}}\sum_{a}[F_{a}(\bm{w}_{\bm{\lambda}})]^{2}.

Additional privacy leakage

Under FedFB, clients exchange real-valued FaF_{a}’s with the server in addition to the model parameters. To limit the information leakage, we develop a variant of FedFB, which exchanges the quantized loss values. For instance, “FedFB(10bits)” means each loss value is uniformly quantized using 10 bits. Such a quantization scheme limits the amount of additional information shared in communication round, at the cost of perturbed coefficient updates.

5 Experiments

222Our implemtation is available in https://github.com/yzeng58/Improving-Fairness-via-Federated-Learning.

In this section, we numerically study the performance of LFT+Ensemble, LFT+FedAvg, and CFL for more general cases, and evaluate the empirical performance of FedFB. In each simulation study, we report the summary statistics across five replications. Similar to the experimental settings used in (Roh et al., 2020), we train all algorithms using a two-layer ReLU neural network with four hidden neurons to evaluate the performance of FedFB for the non-convex case. Due to the space limit, we defer (1) logistic regression results, (2) the performance tradeoffs of FedFB as a function of the number of clients, and (3) more implementation details to Sec. C.

5.1 Limitation of LFT+Ensemble on general cases

The first experiment examines the fairness range of LFT+Ensemble under various Gaussian distributions, which does not satisfy the conditions of Corollary 2. To do so, we numerically solve the optimization problems LFT+Ensemble(i,εii,{\varepsilon}_{i}). See Sec. A for more details. We conjecture that the same phenomenon holds for more general distributions, and corroborate our conjecture with numerical experiments. Shown in Fig. 4(b,c) are the numerically computed lower bound on LFT+Ensemble’s achievable fairness. In particular, for (b), we let xa=0,i=0𝒩(10,0.22),xa=1,i=0𝒩(9.8,0.22),xa=0,i=1𝒩(0.2,0.22),xa=1,i=1𝒩(0,0.22),aBern(0.2){\mathrm{x}}\mid{\mathrm{a}}=0,{\mathrm{i}}=0\sim{\mathcal{N}}(10,0.2^{2}),{\mathrm{x}}\mid{\mathrm{a}}=1,{\mathrm{i}}=0\sim{\mathcal{N}}(9.8,0.2^{2}),{\mathrm{x}}\mid{\mathrm{a}}=0,{\mathrm{i}}=1\sim{\mathcal{N}}(0.2,0.2^{2}),{\mathrm{x}}\mid{\mathrm{a}}=1,{\mathrm{i}}=1\sim{\mathcal{N}}(0,0.2^{2}),{\mathrm{a}}\sim\operatorname{Bern}(0.2), and for (c), we let xa=0,i=0𝒩(3,1),xa=1,i=0𝒩(5,1),xa=0,i=1𝒩(1,1),xa=1,i=1𝒩(1,1),aBern(0.5){\mathrm{x}}\mid{\mathrm{a}}=0,{\mathrm{i}}=0\sim{\mathcal{N}}(3,1),{\mathrm{x}}\mid{\mathrm{a}}=1,{\mathrm{i}}=0\sim{\mathcal{N}}(5,1),{\mathrm{x}}\mid{\mathrm{a}}=0,{\mathrm{i}}=1\sim{\mathcal{N}}(1,1),{\mathrm{x}}\mid{\mathrm{a}}=1,{\mathrm{i}}=1\sim{\mathcal{N}}(-1,1),{\mathrm{a}}\sim\operatorname{Bern}(0.5). For both cases, we set η(x)=11+ex\eta(x)=\frac{1}{1+e^{-x}}. It is easy to check that these distributions do not satisfy the conditions of Corollary 2. In particular, the distribution (b) corresponds to the case that the same group is favored on both clients, and the positive rates of each group in different clients are distinctive. The distribution (c) represents the case that different groups are favored on two clients. In both cases, LFT+Ensemble fails to achieve perfect fairness, i.e., δ>0\delta>0. We also observe that δ\delta is large on the distribution (c). This supports our conjecture that LFT+Ensemble’s fairness performance is strictly limited not only on certain data distributions but also on more general ones.

5.2 Accuracy-fairness tradeoffs of LFT+Ensemble, LFT+FedAvg and CFL

The second experiment extends the experiments conducted in Sec. 3.3. We assess the relationship between the data heterogeneity and the gap between the three fair learning scenarios with three clients. As shown in Fig. 9, LFT+FedAvg is observed to achieve a strictly worse tradeoff than CFL and a strictly higher maximum fairness value than LFT+Ensemble. The results corroborate the benefit and limitation of FedAvg-based federated learning in improving fairness. A very interesting observation is that LFT+Ensemble is observed to obtain a strictly higher accuracy than LFT+FedAvg. Indeed, this could be attributed to the fact that the average of locally fair models might not be fair to any sub-distribution, while LFT+Ensemble at least ensures that each component of the mixture classifier is fair on some sub-distribution.

5.3 FedFB evaluation on demographic parity

Table 1: Comparison of accuracy and DP disparity on the synthetic, Adult, COMPAS, and Bank datasets. FedFB outperforms the other approaches on most of the tested datasets, sometimes nearly matching the performance of CFL. Note that LFT+FedAvg sometimes gets a strictly worse performance than FedAvg. This is because the average of fair models may not be fair at all.
Property Synthetic Adult COMPAS Bank
Private Fair Method Acc.(\uparrow) DP Disp.(\downarrow) Acc.(\uparrow) DP Disp.(\downarrow) Acc.(\uparrow) DP Disp.(\downarrow) Acc.(\uparrow) DP Disp.(\downarrow)
FedAvg .886±\pm.003 .406±\pm.009 .829±\pm.012 .153±\pm.022 .655±\pm.009 .167±\pm.037 .898±\pm.001 .026±\pm.003
LFT+Ensemble .727±\pm.194 .248±\pm.194 .825±\pm.008 .034±\pm.028 .620±\pm.019 .088±\pm.055 .892±\pm.002 .014±\pm.006
LFT+FedAvg .823±\pm.102 .305±\pm.131 .801±\pm.043 .123±\pm.071 .595±\pm.005 .059±\pm.009 .893±\pm.000 .017±\pm.001
FedFB (Ours) .725±\pm.012 .051±\pm.018 .804±\pm.001 .028±\pm.001 .606±\pm.019 .086±\pm.029 .883±\pm.000 .000±\pm.000
CFL .726±\pm.009 .028±\pm.016 .814±\pm.002 .010±\pm.004 .616±\pm.033 .036±\pm.028 .883±\pm.000 .000±\pm.000

We assess the empirical performance of FedFB for non-convex cases on four datasets and the performance of FedFB under different data heterogeneity. We focus on demographic parity and report DP disparity =maxa[A]|(y^=1a=a)(y^=1)|=\max_{a\in[A]}|{\mathbb{P}}(\hat{{\mathrm{y}}}=1\mid{\mathrm{a}}=a)-{\mathbb{P}}(\hat{{\mathrm{y}}}=1)|, where AA is the number of groups. Note that this is slightly different from the definition we used in the previous sections, which was specific for the binary sensitive attribute.

Baselines

We employ three types of baselines: (1) decentralized non-fair training (FedAvg); (2) decentralized fair training (LFT+Ensemble, LFT+FedAvg, FairFed, AgnosticFairs); (3) centralized fair training (CFL). Here, for the fairer comparison and better performance, LFT+Ensemble, LFT+FedAvg, and CFL are implemented based on FB, which is the state-of-the-art fair learning algorithm on centralized data. Note that LFT+Ensemble is the most private, CFL is the least private, and FedAvg, LFT+FedAvg, FairFed and FedFB are somewhere in-between as they share some statistics at each communication round but not directly reveal the data. AgnosticFair is close to CFL, as it exchanges information every local update.

Table 2: Comparison of accuracy and DP disparity on the synthetic dataset with varying heterogeneity. FedFB achieves good performance on all the tested levels of heterogeneity. This is because FedFB is designed to mimic the operation of CFL, whose performance is independent of data heterogeneity.
Low Data Heterogeneity Medium Data Heterogeneity High Data Heterogeneity
Method Acc.(\uparrow) DP Disp.(\downarrow) Acc.(\uparrow) DP Disp.(\downarrow) Acc.(\uparrow) DP Disp.(\downarrow)
FedFB (Ours) .669±\pm.040 .058±\pm.042 .725±\pm.012 .051±\pm.018 .703±\pm.012 .013±\pm.008
CFL .726±\pm.009 .028±\pm.016 .726±\pm.009 .028±\pm.016 .726±\pm.009 .028±\pm.016
Refer to caption
Figure 6: Performance comparison in terms of accuracy and DP disparity on logistic regression. FedFB achieves good accuracy and fairness, compared to FairFed and AgnosticFair.
Datasets

(synthetic) We follow Roh et al. (2021) for data generation, but with a slight modification to make the dataset more unbalanced. To study the empirical relationship between accuracy, fairness, and data heterogeneity, we split the dataset in different ways to obtain desired levels of data heterogeneity. More details are given in Sec. C.3. (real) We use three benchmark datasets: Adult (Dua and Graff, 2017) with 48,842 samples, COMPAS (ProPublica, 2021) with 7,214 samples, and Bank (Moro et al., 2014) with 45,211 samples. We follow Du et al. (2021)’s method to preprocess and split Adult into two clients and Jiang and Nachum (2020)’s method to preprocess COMPAS and Bank. Then, we split COMPAS into two clients based on age and split Bank into three clients based on the loan decision. Note that all the datasets are split in heterogeneous ways.

Results

Table 1 reports the test accuracy and DP disparity of four baselines and FedFB. As expected, the resulting fairness level of FedFB is close to that of CFL. Besides, we observe the poor performance of LFT+Ensemble and LFT+FedAvg, which is due to the fundamental limitation of LFT+Ensemble and LFT+FedAvg. Table 2 reports the accuracy and fairness of each method under different data heterogeneity. FedFB is observed to be robust to data heterogeneity. This agrees with our expectation as FedFB closely mimics the operation of the centralized FB, which is not affected by data heterogeneity. We make a more thorough comparison between CFL and FedFB by plotting the accuracy-fairness tradeoff curves in Fig. 3(b). To demonstrate the performance gain does not come at the cost of privacy loss, we restrict FedFB to only exchange 10 bits of information per communication round. Fig. 3(b) shows that FedFB still performs well with quantized communications.

As suggested in Ezzeldin et al. (2021), we use logistic regression to compare our approach with FairFed and AgnosticFair. Since FairFed is only applicable to single binary sensitive attribute cases, we report the performance on synthetic and Adult datasets. Fig. 6 shows that FedFB achieves the best accuracy-fairness performance among the three methods with a much smaller privacy cost compared to AgnosticFair. For a fair comparison, we provide the experiment results under the same setting as Ezzeldin et al. (2021), and full comparison with AgnosticFair in Sec. C.

5.4 FedFB evaluation on client parity

Refer to caption
Figure 7: Comparison of accuracy and Client Parity (CP) disparity on the synthetic, Adult, COMPAS, and Bank datasets. Although our algorithm is not specifically designed for CP, it closely matches the state-of-the-art accuracy-CP performance.

We evaluate the performance of FedFB in achieving client parity (CP) and compare it with the state-of-the-art algorithms for CP. We will measure CP disparity =maxij[I]|L(i)L(j)|=\max_{i\neq j\in[I]}|L^{(i)}-L^{(j)}|, where L(i)L^{(i)} is the loss in iith client.

Baselines

We consider GIFAIR (Yue et al., 2021), q-FFL (Li et al., 2020a), Ditto (Li et al., 2021), and the unconstrained baseline FedAvg (McMahan et al., 2017). Similar to FedFB, both GIFAIR and q-FFL propose a modified aggregation protocol, under which clients share some additional information with the central server, which then accordingly adjusts the objective function used for the next round of training. The key difference is that while FedFB optimizes the coefficients for the primary objective terms (i.e. sample reweighting) by solving a bi-level optimization problem, GIFAIR updates the coefficient for the penalty terms, and q-FFL implicitly updates the weights on each objective term based on nonlinear behaviors of polynomial functions, which is equivalent to the α\alpha-fairness algorithm used in networking (Mo and Walrand, 2000). The Ditto algorithm combines multitask learning with federated learning to learn a personalized classifier for each client, improving the accuracy of the clients with low accuracy.

Datasets

We use the same datasets as before, but split the datasets according to their sensitive attributes to simulate the same setting as assumed by GIFAIR, q-FFL, and Ditto.

Results

Fig. 7 shows that FedFB offers competitive and stable performances in mitigating the model bias, especially in the high fairness region. Although q-FFL achieves better accuracy and fairness on the synthetic data, under strict fairness constraint, FedFB and its private variant nearly achieve the highest accuracy on the other three datasets.

6 Conclusions

Summary

We have investigated how to achieve group fairness under a decentralized setting. We developed a theoretical framework for decentralized fair learning algorithms and analyzed the performance of LFT+Ensemble, LFT+FedAvg, and CFL. As a result, we showed that (1) federated learning can significantly boost model fairness even with only a handful number of participating clients, and (2) FedAvg-based federated fair learning algorithms are strictly worse than the oracle upper bound of CFL. To close the gap, we propose FedFB. Our extensive experimental results demonstrate that our proposed solution FedFB achieves state-of-the-art performances.

Open questions

(Theory) First, full theoretical characterization of accuracy-fairness tradeoff still remains open. Furthermore, a three-way tradeoff between accuracy, fairness, and privacy remains widely open. (Algorithm) It remains open whether or not FedFB can handle different fairness notions such as proportional fairness (Zhang et al., 2020b; Lyu et al., 2020a, b).

Acknowledgement

This work was supported in part by NSF Award DMS-2023239, NSF/Intel Partnership on Machine Learning for Wireless Networking Program under Grant No. CNS-2003129, and the Understanding and Reducing Inequalities Initiative of the University of Wisconsin-Madison, Office of the Vice Chancellor for Research and Graduate Education with funding from the Wisconsin Alumni Research Foundation.

References

  • Chu et al. (2021) L. Chu, L. Wang, Y. Dong, J. Pei, Z. Zhou, and Y. Zhang. Fedfair: Training fair models in cross-silo federated learning. arXiv preprint arXiv:2109.05662, 2021.
  • Cui et al. (2021) S. Cui, W. Pan, J. Liang, C. Zhang, and F. Wang. Addressing algorithmic disparity and performance inconsistency in federated learning. In Thirty-Fifth Conference on Neural Information Processing Systems, 2021.
  • Du et al. (2021) W. Du, D. Xu, X. Wu, and H. Tong. Fairness-aware agnostic federated learning. In Proceedings of the 2021 SIAM International Conference on Data Mining (SDM), pages 181–189, 2021.
  • Dua and Graff (2017) D. Dua and C. Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
  • Dwork et al. (2012) C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel. Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, ITCS ’12, page 214–226, 2012. ISBN 9781450311151. doi: 10.1145/2090236.2090255. URL https://doi.org/10.1145/2090236.2090255.
  • Ezzeldin et al. (2021) Y. H. Ezzeldin, S. Yan, C. He, E. Ferrara, and S. Avestimehr. Fairfed: Enabling group fairness in federated learning. arXiv preprint arXiv:2110.00857, 2021.
  • Friedler et al. (2016) S. A. Friedler, C. Scheidegger, and S. Venkatasubramanian. On the (im)possibility of fairness, 2016.
  • Guha et al. (2019) N. Guha, A. Talwalkar, and V. Smith. One-shot federated learning. arXiv preprint arXiv:1902.11175, 2019.
  • Hardt et al. (2016) M. Hardt, E. Price, and N. Srebro. Equality of opportunity in supervised learning. In Advances in Neural Information Processing Systems (NIPS), volume 29, pages 3315–3323, 2016.
  • Jiang and Nachum (2020) H. Jiang and O. Nachum. Identifying and correcting label bias in machine learning. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 702–712, 2020. URL http://proceedings.mlr.press/v108/jiang20a.html.
  • Kearns et al. (2018) M. Kearns, S. Neel, A. Roth, and Z. S. Wu. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. In Proceedings of the 35th International Conference on Machine Learning (ICML), volume 80 of Proceedings of Machine Learning Research, pages 2564–2572, 2018. URL http://proceedings.mlr.press/v80/kearns18a.html.
  • Konečnỳ et al. (2017) J. Konečnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492, 2017.
  • Li et al. (2020a) T. Li, M. Sanjabi, A. Beirami, and V. Smith. Fair resource allocation in federated learning. In International Conference on Learning Representations (ICLR), 2020a. URL https://openreview.net/forum?id=ByexElSYDr.
  • Li et al. (2021) T. Li, S. Hu, A. Beirami, and V. Smith. Ditto: Fair and robust federated learning through personalization. In Proceedings of the 38th International Conference on Machine Learning (ICML), volume 139 of Proceedings of Machine Learning Research, pages 6357–6368, 2021. URL http://proceedings.mlr.press/v139/li21h.html.
  • Li et al. (2020b) X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang. On the convergence of fedavg on non-iid data. In International Conference on Learning Representations (ICLR), 2020b. URL https://openreview.net/forum?id=HJxNAnVtDS.
  • Lyu et al. (2020a) L. Lyu, X. Xu, Q. Wang, and H. Yu. Collaborative fairness in federated learning. In Federated Learning, pages 189–204. Springer, 2020a.
  • Lyu et al. (2020b) L. Lyu, J. Yu, K. Nandakumar, Y. Li, X. Ma, J. Jin, H. Yu, and K. Ng. Towards fair and privacy-preserving federated deep models. IEEE Transactions on Parallel & Distributed Systems, 31(11):2524–2541, 2020b. ISSN 1558-2183. doi: 10.1109/TPDS.2020.2996273.
  • McMahan et al. (2017) B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas. Communication-efficient learning of deep networks from decentralized data. In International Conference on Artificial Intelligence and Statistics (AISTATS), volume 54 of Proceedings of Machine Learning Research, pages 1273–1282, 2017. URL http://proceedings.mlr.press/v54/mcmahan17a.html.
  • Menon and Williamson (2018) A. K. Menon and R. C. Williamson. The cost of fairness in binary classification. In Proceedings of the 1st Conference on Fairness, Accountability and Transparency, volume 81 of Proceedings of Machine Learning Research, pages 107–118, 2018. URL http://proceedings.mlr.press/v81/menon18a.html.
  • Mo and Walrand (2000) J. Mo and J. Walrand. Fair end-to-end window-based congestion control. IEEE/ACM Transactions on networking, 8(5):556–567, 2000.
  • Mohri et al. (2019) M. Mohri, G. Sivek, and A. T. Suresh. Agnostic federated learning. In Proceedings of the 36th International Conference on Machine Learning (ICML), volume 97 of Proceedings of Machine Learning Research, pages 4615–4625, 2019. URL http://proceedings.mlr.press/v97/mohri19a.html.
  • Moro et al. (2014) S. Moro, P. Cortez, and P. Rita. A data-driven approach to predict the success of bank telemarketing. Decision Support Systems, 62:22–31, 2014. ISSN 0167-9236. doi: https://doi.org/10.1016/j.dss.2014.03.001. URL https://www.sciencedirect.com/science/article/pii/S016792361400061X.
  • ProPublica (2021) ProPublica. Compas recidivism risk score data and analysis. 2021. URL https://www.propublica.org/datastore/results?q=compas.
  • Rodríguez-Gálvez et al. (2021) B. Rodríguez-Gálvez, F. Granqvist, R. van Dalen, and M. Seigel. Enforcing fairness in private federated learning via the modified method of differential multipliers. arXiv preprint arXiv:2109.08604, 2021.
  • Roh et al. (2020) Y. Roh, K. Lee, S. Whang, and C. Suh. FR-train: A mutual information-based approach to fair and robust training. In Proceedings of the 37th International Conference on Machine Learning (ICML), volume 119 of Proceedings of Machine Learning Research, pages 8147–8157, 2020. URL http://proceedings.mlr.press/v119/roh20a.html.
  • Roh et al. (2021) Y. Roh, K. Lee, S. E. Whang, and C. Suh. Fairbatch: Batch selection for model fairness. In International Conference on Learning Representations (ICLR), 2021. URL https://openreview.net/forum?id=YNnpaAKeCfx.
  • Wick et al. (2019) M. Wick, s. panda, and J.-B. Tristan. Unlocking fairness: a trade-off revisited. In Advances in Neural Information Processing Systems (NeurIPS), volume 32, 2019. URL https://proceedings.neurips.cc/paper/2019/file/373e4c5d8edfa8b74fd4b6791d0cf6dc-Paper.pdf.
  • Yue et al. (2021) X. Yue, M. Nouiehed, and R. A. Kontar. Gifair-fl: An approach for group and individual fairness in federated learning. arXiv preprint arXiv:2108.02741, 2021.
  • Zafar et al. (2017a) M. B. Zafar, I. Valera, M. Gomez Rodriguez, and K. P. Gummadi. Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment. In Proceedings of the 26th International Conference on World Wide Web, page 1171–1180, 2017a.
  • Zafar et al. (2017b) M. B. Zafar, I. Valera, M. G. Rodriguez, K. P. Gummadi, and A. Weller. From parity to preference-based notions of fairness in classification. In Advances in Neural Information Processing Systems (NIPS), NIPS’17, page 228–238, 2017b. ISBN 9781510860964.
  • Zafar et al. (2017c) M. B. Zafar, I. Valera, M. G. Rogriguez, and K. P. Gummadi. Fairness Constraints: Mechanisms for Fair Classification. In International Conference on Artificial Intelligence and Statistics (AISTATS), volume 54, pages 962–970, 2017c.
  • Zemel et al. (2013) R. Zemel, Y. Wu, K. Swersky, T. Pitassi, and C. Dwork. Learning fair representations. In International conference on machine learning, pages 325–333. PMLR, 2013.
  • Zhang et al. (2020a) D. Y. Zhang, Z. Kou, and D. Wang. Fairfl: A fair federated learning approach to reducing demographic bias in privacy-sensitive classification models. In 2020 IEEE International Conference on Big Data (Big Data), pages 1051–1060, 2020a. doi: 10.1109/BigData50022.2020.9378043.
  • Zhang et al. (2020b) J. Zhang, C. Li, A. Robles-Kelly, and M. Kankanhalli. Hierarchically fair federated learning. arXiv preprint arXiv:2004.10386, 2020b.
  • Zhao and Gordon (2019) H. Zhao and G. Gordon. Inherent tradeoffs in learning fair representations. In Advances in Neural Information Processing Systems (NeurIPS), volume 32, 2019. URL https://proceedings.neurips.cc/paper/2019/file/b4189d9de0fb2b9cce090bd1a15e3420-Paper.pdf.

Appendix A Appendix - LFT+Ensemble, LFT+FedAvg, CFL analysis

In this section, we provide the concrete analysis for LFT+Ensemble, LFT+FedAvg, and CFL. For illustration purposes, we will start by discussing two clients cases, and then extend the analysis into more clients cases in Sec. A.4. We begin with the analysis of CFL in Sec. A.1. Then we analyze LFT+Ensemble and LFT+FedAvg. To be specific, in Sec. A.2, we analyze the limitation of LFT+Ensemble and present the formal version of Lemma 1 under two clients cases and Corollary 2. In Sec. A.3, we analyze LFT+FedAvg and compare it with LFT+Ensemble and CFL, then we present the formal two-client version of Thm. 3 and Lemma 4. All the multi-client statements are included in Sec. A.4 We summarize the commonly used notations in Table 3 .

Symbol Meaning Symbol Meaning Symbol Meaning
x\mathrm{x} feature \llbracket\cdot\rrbracket indicator function εi\varepsilon_{i} bias in client ii
a\mathrm{a} sensitive attribute ff randomized classifier qq aBern(q)\mathrm{a}\sim\text{Bern}(q)
i\mathrm{i} client index 𝒫a(i)\mathcal{P}_{a}^{(i)} distribution DP Disp(f)\text{DP Disp}(f) unfairness of ff
y^\hat{\mathrm{y}} predicted class η(x)\eta(x) (y=1|x=x)\mathbb{P}({\mathrm{y}}=1|{\mathrm{x}}=x) fε0,ε1LFT+Ensemblef_{\varepsilon_{0},\varepsilon_{1}}^{\text{LFT+Ensemble{}}} LFT+Ensemble classifier
Table 3: Commonly used notations.

A.1 CFL analysis

In this section, we analyze the CFL classifier fεCFLf_{{\varepsilon}}^{\text{CFL}} given in CFL(ε{\varepsilon}). We mainly derive the solution of CFL(ε{\varepsilon}) in Lemma 7. In Lemma 8 and Lemma 9 we summarize the properties of fεCFLf_{{\varepsilon}}^{\text{CFL}}. For a given classifier ff, we define the mean difference to be

MD(f)=(y^=1a=0)(y^=1a=1).\text{MD}(f)={\mathbb{P}}(\hat{{\mathrm{y}}}=1\mid{\mathrm{a}}=0)-{\mathbb{P}}(\hat{{\mathrm{y}}}=1\mid{\mathrm{a}}=1).
Lemma 7.

Let q(0,1)q\in(0,1). Define g():[max(q,1q),max(q,1q)][1,1]g(\cdot):[-\max(q,1-q),\max(q,1-q)]\rightarrow[-1,1] as

g(λ)=[η1(12λ2(1q)),+]d𝒫0[η1(12+λ2q),+]d𝒫1,g(\lambda)=\int_{[\eta^{-1}(\frac{1}{2}-\frac{\lambda}{2(1-q)}),+\infty]}\,\mathrm{d}{\mathcal{P}}_{0}-\int_{[\eta^{-1}(\frac{1}{2}+\frac{\lambda}{2q}),+\infty]}\,\mathrm{d}{\mathcal{P}}_{1}, (10)

then fεCFL={s(x,a)>0+αs(x,a)=0:α[0,1]}f_{{\varepsilon}}^{\text{CFL}}=\{\llbracket s(x,a)>0\rrbracket+\alpha\llbracket s(x,a)=0\rrbracket:\alpha\in[0,1]\}, where s(x,0)=2η(x)1+λ1qs(x,0)=2\eta(x)-1+\frac{\lambda}{1-q}, s(x,1)=2η(x)1λqs(x,1)=2\eta(x)-1-\frac{\lambda}{q}, λ=g1(sign(g(0))min{ε,|g(0)|})\lambda=g^{-1}(\operatorname{sign}(g(0))\min\{{\varepsilon},|g(0)|\}). Here we denote the indicator function as E:\llbracket E\rrbracket: E=1\llbracket E\rrbracket=1 if EE is true, zero otherwise.

Proof.

The proof is similar as Menon and Williamson [2018]. To solve CFL(ε{\varepsilon}), we first write the error rate and the fairness constraint as a linear function of ff. Let pa()p_{a}(\cdot) be the pdf of 𝒫a{\mathcal{P}}_{a}, where a=0,1a=0,1. Denote the joint distribution of x{\mathrm{x}} and a{\mathrm{a}} as px,a(x,a)p_{{\mathrm{x}},{\mathrm{a}}}(x,a). Note that

(y^y)\displaystyle{\mathbb{P}}(\hat{{\mathrm{y}}}\neq{\mathrm{y}}) (11)
=\displaystyle= 𝒳a𝒜[f(x,a)(1η(x))+(1f(x,a))η(x)]px,a(x,a)dx\displaystyle\int_{{\mathcal{X}}}\sum_{a\in{\mathcal{A}}}\left[f(x,a)(1-\eta(x))+(1-f(x,a))\eta(x)\right]p_{{\mathrm{x}},{\mathrm{a}}}(x,a)\,\mathrm{d}x
=\displaystyle= 𝔼x,af(x,a)(12η(x))+(y=1)\displaystyle{\mathbb{E}}_{{\mathrm{x}},{\mathrm{a}}}f({\mathrm{x}},{\mathrm{a}})(1-2\eta({\mathrm{x}}))+{\mathbb{P}}({\mathrm{y}}=1)

and

(y^=1a=0)(y^=1a=1)\displaystyle{\mathbb{P}}(\hat{{\mathrm{y}}}=1\mid{\mathrm{a}}=0)-{\mathbb{P}}(\hat{{\mathrm{y}}}=1\mid{\mathrm{a}}=1) (12)
=\displaystyle= 𝒳f(x,0)p0(x)dx𝒳f(x,1)p1(x)dx\displaystyle\int_{\mathcal{X}}f(x,0)p_{0}(x)\,\mathrm{d}x-\int_{\mathcal{X}}f(x,1)p_{1}(x)\,\mathrm{d}x
=\displaystyle= 𝒳a𝒜a=0f(x,0)px,a(x,a)(a=0)dx𝒳a𝒜a=1f(x,1)px,a(x,a)(a=1)dx\displaystyle\int_{{\mathcal{X}}}\sum_{a\in{\mathcal{A}}}\llbracket a=0\rrbracket f(x,0)\frac{p_{{\mathrm{x}},{\mathrm{a}}}(x,a)}{{\mathbb{P}}({\mathrm{a}}=0)}\,\mathrm{d}x-\int_{{\mathcal{X}}}\sum_{a\in{\mathcal{A}}}\llbracket a=1\rrbracket f(x,1)\frac{p_{{\mathrm{x}},{\mathrm{a}}}(x,a)}{{\mathbb{P}}({\mathrm{a}}=1)}\,\mathrm{d}x
=\displaystyle= 𝔼x,a[f(x,0)a=01qf(x,1)a=1q].\displaystyle{\mathbb{E}}_{{\mathrm{x}},{\mathrm{a}}}\left[f({\mathrm{x}},0)\frac{\llbracket{\mathrm{a}}=0\rrbracket}{1-q}-f({\mathrm{x}},1)\frac{\llbracket{\mathrm{a}}=1\rrbracket}{q}\right].

Consequently, our goal becomes solving

minf𝔼x,af(x,a)(12η(x))+(y=1)s.t.|𝔼x,a[f(x,0)a=01qf(x,1)a=1q]|ε.\begin{array}[]{c}\min_{f\in{\mathcal{F}}}{\mathbb{E}}_{{\mathrm{x}},{\mathrm{a}}}f({\mathrm{x}},{\mathrm{a}})(1-2\eta({\mathrm{x}}))+{\mathbb{P}}({\mathrm{y}}=1)\\ \mathrm{s.t}.~{}|{\mathbb{E}}_{{\mathrm{x}},{\mathrm{a}}}\left[f({\mathrm{x}},0)\frac{\llbracket{\mathrm{a}}=0\rrbracket}{1-q}-f({\mathrm{x}},1)\frac{\llbracket{\mathrm{a}}=1\rrbracket}{q}\right]|\leq{\varepsilon}.\end{array} (13)

Denote the function that minimizes the error rate (ERM) as f~\tilde{f}\in{\mathcal{F}}. It is easy to see that,

f~(x){η(x)>1/2+αη(x)=1/2:α[0,1]}.\tilde{f}(x)\in\left\{\llbracket\eta(x)>1/2\rrbracket+\alpha\llbracket\eta(x)=1/2\rrbracket:\alpha\in[0,1]\right\}. (14)

Next, consider the following three cases. In particular, we provide the proof for |MD(f~)|ε|\text{MD}(\tilde{f})|\leq{\varepsilon} and MD(f~)>ε\text{MD}(\tilde{f})>{\varepsilon}. The proof for MD(f~)<ε\text{MD}(\tilde{f})<-{\varepsilon} is similar as the proof for MD(f~)>ε\text{MD}(\tilde{f})>{\varepsilon}.

Case 1.

|MD(f~)|ε|\text{MD}(\tilde{f})|\leq{\varepsilon}: ERM is already fair.

The solution to (13) and CFL(ε{\varepsilon}) is f~\tilde{f}.

Case 2.

MD(f~)>ε\text{MD}(\tilde{f})>{\varepsilon}: ERM is favoring group 0 over group 1.

We will show that solving (13) is equivalent to solving an unconstrained optimization problem.

First, we will prove by contradiction that the solution ff^{\star}\in{\mathcal{F}} to (13) satisfies MD(f)=ε\text{MD}(f^{\star})={\varepsilon}. We use ff^{\star}\in{\mathcal{F}} to denote the solution of (13). Suppose the above claim does not hold. Then we have MD(f)<ε\text{MD}(f^{\star})<{\varepsilon}. To show the contradiction, we construct a ff^{\prime}\in{\mathcal{F}} that satisfies the fairness constraint and has a lower error rate than that of ff^{\star}. Let ff^{\prime} be a linear combination of ff^{\star} and f~\tilde{f}:

f=af+(1a)f~,f^{\prime}=af^{\star}+(1-a)\tilde{f}, (15)

where a=MD(f~)εMD(f~)MD(f)(0,1)a=\frac{\text{MD}(\tilde{f})-{\varepsilon}}{\text{MD}(\tilde{f})-\text{MD}(f^{\star})}\in(0,1). Then we obtain

MD(f)=aMD(f)+(1a)MD(f~)=ε(MD(f~)MD(f))MD(f~)MD(f)=ε.\text{MD}(f^{\prime})=a\text{MD}(f^{\star})+(1-a)\text{MD}(\tilde{f})=\frac{{\varepsilon}(\text{MD}(\tilde{f})-\text{MD}(f^{\star}))}{\text{MD}(\tilde{f})-\text{MD}(f^{\star})}={\varepsilon}. (16)

Denote the error rate {y^y}{\mathbb{P}}\left\{\hat{{\mathrm{y}}}\neq{\mathrm{y}}\right\} as e:[0,1]e:{\mathcal{F}}\rightarrow[0,1]. Then the error rate of ff^{\prime} is

e(f)=ae(f)+(1a)e(f~)<e(f),e(f^{\prime})=ae(f^{\star})+(1-a)e(\tilde{f})<e(f^{\star}), (17)

which is inconsistent to the optimality assumption of ff^{\star}. Therefore, MD(f)=ε\text{MD}(f^{\star})={\varepsilon}.

Now, solving CFL(ε{\varepsilon}) is equivalent to solving

minf𝔼x,af(x,a)(12η(x))+(y=1)s.t.|𝔼x,a[f(x,0)a=01qf(x,1)a=1q]|=ε.\begin{array}[]{c}\min_{f\in{\mathcal{F}}}{\mathbb{E}}_{{\mathrm{x}},{\mathrm{a}}}f({\mathrm{x}},{\mathrm{a}})(1-2\eta({\mathrm{x}}))+{\mathbb{P}}({\mathrm{y}}=1)\\ \mathrm{s.t}.~{}|{\mathbb{E}}_{{\mathrm{x}},{\mathrm{a}}}\left[f({\mathrm{x}},0)\frac{\llbracket{\mathrm{a}}=0\rrbracket}{1-q}-f({\mathrm{x}},1)\frac{\llbracket{\mathrm{a}}=1\rrbracket}{q}\right]|={\varepsilon}.\end{array} (18)

Furthermore, the optimization problem above is also equivalent to

minf\displaystyle\min_{f\in{\mathcal{F}}} 𝔼x,af(x,a)(12η(x))λ𝔼x,a(f(x,0)a=01qf(x,1)a=1q)\displaystyle{\mathbb{E}}_{{\mathrm{x}},{\mathrm{a}}}f({\mathrm{x}},{\mathrm{a}})(1-2\eta({\mathrm{x}}))-\lambda{\mathbb{E}}_{{\mathrm{x}},{\mathrm{a}}}\left(f({\mathrm{x}},0)\frac{\llbracket{\mathrm{a}}=0\rrbracket}{1-q}-f({\mathrm{x}},1)\frac{\llbracket{\mathrm{a}}=1\rrbracket}{q}\right) (20)
s.t.|𝔼x,a[f(x,0)a=01qf(x,1)a=1q]|=ε,\displaystyle~{}~{}~{}~{}~{}~{}~{}\mathrm{s.t}.~{}|{\mathbb{E}}_{{\mathrm{x}},{\mathrm{a}}}\left[f({\mathrm{x}},0)\frac{\llbracket{\mathrm{a}}=0\rrbracket}{1-q}-f({\mathrm{x}},1)\frac{\llbracket{\mathrm{a}}=1\rrbracket}{q}\right]|={\varepsilon},

for all λ\lambda\in{\mathbb{R}}.

Next, our goal is to select a suitable λ\lambda such that the constrained optimization problem above becomes an unconstrained problem, i.e., we will select a suitable λ\lambda such that the minimizer to the unconstrained optimization problem (20) satisfies equality constraint (20).

Note that

𝔼x,af(x,a)(12η(x))λ𝔼x,a(f(x,0)a=01qf(x,1)a=1q)\displaystyle{\mathbb{E}}_{{\mathrm{x}},{\mathrm{a}}}f({\mathrm{x}},{\mathrm{a}})(1-2\eta({\mathrm{x}}))-\lambda{\mathbb{E}}_{{\mathrm{x}},{\mathrm{a}}}\left(f({\mathrm{x}},0)\frac{\llbracket{\mathrm{a}}=0\rrbracket}{1-q}-f({\mathrm{x}},1)\frac{\llbracket{\mathrm{a}}=1\rrbracket}{q}\right) (21)
=\displaystyle= 𝔼x,af(x,a)(12η(x)λa=01q+λa=1q),\displaystyle{\mathbb{E}}_{{\mathrm{x}},{\mathrm{a}}}f({\mathrm{x}},{\mathrm{a}})\left(1-2\eta({\mathrm{x}})-\lambda\frac{\llbracket{\mathrm{a}}=0\rrbracket}{1-q}+\lambda\frac{\llbracket{\mathrm{a}}=1\rrbracket}{q}\right), (22)

then the solution to unconstrained optimization problem (20) is

f¯{s(x,a)>0+αs(x,a)=0:α[0,1]},\bar{f}\in\left\{\llbracket s(x,a)>0\rrbracket+\alpha\llbracket s(x,a)=0\rrbracket:\alpha\in[0,1]\right\}, (23)

where

s(x,a)={1+2η(x)+λ1qa=01+2η(x)λqa=1.s(x,a)=\begin{cases}-1+2\eta(x)+\frac{\lambda}{1-q}&a=0\\ -1+2\eta(x)-\frac{\lambda}{q}&a=1\end{cases}. (24)
Refer to caption
Figure 8: Visualization of g(λ)g(\lambda) and |g(λ)||g(\lambda)| under Case 2: MD(f~)>ε\text{MD}(\tilde{f})>{\varepsilon}. When g(0)=MD(f~)>ε0g(0)=\text{MD}(\tilde{f})>{\varepsilon}\geq 0, the corresponding λ\lambda of the best classifier is λ=g1(ε)<0\lambda=g^{-1}({\varepsilon})<0.

Since the range of η(x)\eta(x) is [0,1], f¯\bar{f} with λ>max(q,1q)\lambda>\max(q,1-q) is no different from f¯\bar{f} with λ=max(q,1q)\lambda=\max(q,1-q); f¯\bar{f} with λ<max(q,1q)\lambda<-\max(q,1-q) is no different from f¯\bar{f} with λ=max(q,1q)\lambda=-\max(q,1-q). Therefore, the only thing left is to find the λ[max(q,1q),max(q,1q)]\lambda\in[-\max(q,1-q),\max(q,1-q)] such that f¯\bar{f} satisfies the constraint (20).

Now consider the mean difference between the positive rate of two groups:

MD(f¯)\displaystyle\text{MD}(\bar{f}) =(y^=1a=0)(y^=1a=1)\displaystyle={\mathbb{P}}(\hat{{\mathrm{y}}}=1\mid{\mathrm{a}}=0)-{\mathbb{P}}(\hat{{\mathrm{y}}}=1\mid{\mathrm{a}}=1) (25)
=+η(x)>12λ2(1q)d𝒫0+η(x)>12+λ2qd𝒫1\displaystyle=\int_{-\infty}^{+\infty}\llbracket\eta(x)>\frac{1}{2}-\frac{\lambda}{2(1-q)}\rrbracket\,\mathrm{d}{\mathcal{P}}_{0}-\int_{-\infty}^{+\infty}\llbracket\eta(x)>\frac{1}{2}+\frac{\lambda}{2q}\rrbracket\,\mathrm{d}{\mathcal{P}}_{1}
=[η1(12λ2(1q)),+]d𝒫0[η1(12+λ2q),+]d𝒫1\displaystyle=\int_{[\eta^{-1}(\frac{1}{2}-\frac{\lambda}{2(1-q)}),+\infty]}\,\mathrm{d}{\mathcal{P}}_{0}-\int_{[\eta^{-1}(\frac{1}{2}+\frac{\lambda}{2q}),+\infty]}\,\mathrm{d}{\mathcal{P}}_{1}
=g(λ).\displaystyle=g(\lambda).

Note that g():[max(q,1q),max(q,1q)][1,1]g(\cdot):[-\max(q,1-q),\max(q,1-q)]\rightarrow[-1,1] is a strictly monotone increasing function. Consequently, if and only if λ=g1(ε)\lambda=g^{-1}({\varepsilon}), f¯\bar{f} satisfies (20). Recall that optimization problem (20) with constraint (20) is equivalent as CFL(ε{\varepsilon}). Thus, let λ=g1(ε)\lambda=g^{-1}({\varepsilon}) and f¯\bar{f} is the solution of CFL(ε{\varepsilon}).

Case 3.

MD(f~)<ε\text{MD}(\tilde{f})<-{\varepsilon}: ERM is favoring group 1 over group 0.

Similarly, like Case 2, we obtain that the solution CFL(ε{\varepsilon}) is

f¯{s(x,a)>0+αs(x,a)=0:α[0,1]},\bar{f}\in\left\{\llbracket s(x,a)>0\rrbracket+\alpha\llbracket s(x,a)=0\rrbracket:\alpha\in[0,1]\right\}, (26)

where

s(x,a)={1+2η(x)+λ1qa=01+2η(x)λqa=1,s(x,a)=\begin{cases}-1+2\eta(x)+\frac{\lambda}{1-q}&a=0\\ -1+2\eta(x)-\frac{\lambda}{q}&a=1\end{cases}, (27)

and λ=g1(ε)\lambda=g^{-1}(-{\varepsilon}). Combining all the cases above yields the desired conclusion. The proof is now complete. ∎

Remark 5.

Select α=0\alpha=0, then the solution to CFL(ε{\varepsilon}) can be written as

f(x,a)={η(x)>12λ2(1q)a=0η(x)>12+λ2qa=1.f(x,a)=\begin{cases}\llbracket\eta(x)>\frac{1}{2}-\frac{\lambda}{2(1-q)}\rrbracket&a=0\\ \llbracket\eta(x)>\frac{1}{2}+\frac{\lambda}{2q}\rrbracket&a=1\end{cases}. (28)

Therefore, Lemma 7 implies that the best classifier of the CFL problem is equivalent to simply applying a constant threshold to the class-probabilities for each value of the sensitive feature.

Lemma 7 suggests the following property of the solution to CFL(ε{\varepsilon}).

Lemma 8.

If ff and gg are two solutions to CFL(ε{\varepsilon}), then f=gf=g almost everywhere.

For illustration purposes, we denote

λεCFL=g1(sign(g(0))min{ε,|g(0)|}).\lambda_{{\varepsilon}}^{\text{CFL}}=g^{-1}(\operatorname{sign}(g(0))\min\{{\varepsilon},|g(0)|\}). (29)

Below we summarize some useful properties of λεCFL\lambda_{{\varepsilon}}^{\text{CFL}}.

Lemma 9.

The sign of λεCFL\lambda_{{\varepsilon}}^{\text{CFL}} and MD(fεCFL)\text{MD}(f_{{\varepsilon}}^{\text{CFL}}) are determined by g(0)g(0).

  1. 1.

    If ε<|g(0)|{\varepsilon}<|g(0)|, then MD(fεCFL)=λεCFL0\text{MD}(f_{{\varepsilon}}^{\text{CFL}})=\lambda_{{\varepsilon}}^{\text{CFL}}\neq 0, and g(λεCFL)=sign(g(0))εg(\lambda_{{\varepsilon}}^{\text{CFL}})=\operatorname{sign}(g(0)){\varepsilon}. If |g(0)|ε,|g(0)|\leq{\varepsilon}, then λεCFL=0\lambda_{{\varepsilon}}^{\text{CFL}}=0 and MD(fεCFL)=g(λεCFL)=g(0)\text{MD}(f_{{\varepsilon}}^{\text{CFL}})=g(\lambda_{{\varepsilon}}^{\text{CFL}})=g(0).

  2. 2.

    If g(0)>0g(0)>0 or g(0)<0g(0)<0, then for any ε0{\varepsilon}\geq 0, we have λ0\lambda\leq 0 or λ0\lambda\geq 0, respectively.

Proof.

The first property follows directly from the definition of λεCFL\lambda_{{\varepsilon}}^{\text{CFL}}. Next, we prove the second property.

When ε>|g(0)|{\varepsilon}>|g(0)|, we have λ=0\lambda=0 and the first property holds. When g(0)>ε>0g(0)>{\varepsilon}>0, by the definition of λεCFL\lambda_{{\varepsilon}}^{\text{CFL}}, we have λ=g1(ε)<g1(g(0))=0\lambda=g^{-1}({\varepsilon})<g^{-1}(g(0))=0. When g(0)<ε<0g(0)<-{\varepsilon}<0, we have λ=g1(ε)>g1(g(0))=0\lambda=g^{-1}(-{\varepsilon})>g^{-1}(g(0))=0. Combining all the cases above yields the desired conclusion. ∎

A.2 LFT+Ensemble analysis

With the analysis of CFL in Sec. A.1, in this section, we analyze the LFT+Ensemble classifier LFT+Ensemble(i,εii,{\varepsilon}_{i}) for the case of two clients. For illustration purpose, with I=2I=2, we denote fε0,ε1LFT+Ensemble=f𝜺LFT+Ensemble=(f0ε0+f1ε1)/2f_{{\varepsilon}_{0},{\varepsilon}_{1}}^{\text{LFT+Ensemble}}=f_{\bm{{\varepsilon}}}^{\text{LFT+Ensemble}}=(f_{0}^{{\varepsilon}_{0}}+f_{1}^{{\varepsilon}_{1}})/2. In Sec. A.2.1 we introduce some notations for the LFT+Ensemble classifier fε0,ε1LFT+Ensemblef_{{\varepsilon}_{0},{\varepsilon}_{1}}^{\text{LFT+Ensemble}} that follows from Lemma 7. In Sec. A.2.2 we analyze the limitation of fε0,ε1LFT+Ensemblef_{{\varepsilon}_{0},{\varepsilon}_{1}}^{\text{LFT+Ensemble}} as stated in Sec. 3.1. To be more specific, we present the two clients’ version of Lemma 1, formal version of Corollary 2 and conclude their proof. In Sec. A.2.3, we analyze the performance gap between fε0,ε1LFT+Ensemblef_{{\varepsilon}_{0},{\varepsilon}_{1}}^{\text{LFT+Ensemble}} and CFL classifier fεCFLf_{{\varepsilon}}^{\text{CFL}}.

A.2.1 Problem setting

By Lemma 7, the solution to LFT+Ensemble(i,εii,{\varepsilon}_{i}) is

fiεi(x,a)={η(x)>12λεiLFT+Ensemblei2(1q)a=0η(x)>12+λεiLFT+Ensemblei2qa=1,f_{i}^{{\varepsilon}_{i}}(x,a)=\begin{cases}\llbracket\eta(x)>\frac{1}{2}-\frac{\lambda_{{\varepsilon}_{i}}^{\text{LFT+Ensemble}_{i}}}{2(1-q)}\rrbracket&a=0\\ \llbracket\eta(x)>\frac{1}{2}+\frac{\lambda_{{\varepsilon}_{i}}^{\text{LFT+Ensemble}_{i}}}{2q}\rrbracket&a=1\end{cases}, (30)

where the associated λεiLFT+Ensemblei\lambda_{{\varepsilon}_{i}}^{\text{LFT+Ensemble}_{i}} is defined as

λεiLFT+Ensemblei=gi1(sign(gi(0))min(εi,|gi(0)|))\lambda_{{\varepsilon}_{i}}^{\text{LFT+Ensemble}_{i}}=g_{i}^{-1}(\operatorname{sign}(g_{i}(0))\min({\varepsilon}_{i},|g_{i}(0)|)) (31)

and

gi(λ)=[η1(12λ2(1q)),+)d𝒫0(i)[η1(12+λ2q),+)d𝒫1(i).g_{i}(\lambda)=\int_{[\eta^{-1}(\frac{1}{2}-\frac{\lambda}{2(1-q)}),+\infty)}\,\mathrm{d}{\mathcal{P}}_{0}^{(i)}-\int_{[\eta^{-1}(\frac{1}{2}+\frac{\lambda}{2q}),+\infty)}\,\mathrm{d}{\mathcal{P}}_{1}^{(i)}.

Note that gi(λ)g_{i}(\lambda) is the mean difference on iith client 𝔼x𝒫0(i)f(x,0)𝔼x𝒫1(i)f(x,1){\mathbb{E}}_{{\mathrm{x}}\sim{\mathcal{P}}_{0}^{(i)}}f({\mathrm{x}},0)-{\mathbb{E}}_{{\mathrm{x}}\sim{\mathcal{P}}_{1}^{(i)}}f({\mathrm{x}},1) of the classifier of the form (28). Now, the demographic disparity for fε0,ε1LFT+Ensemblef_{{\varepsilon}_{0},{\varepsilon}_{1}}^{\text{LFT+Ensemble}} can be written as

DP Disp(fε0,ε1LFT+Ensemble)=|12i=01[(y^=1a=0,i=i)(y^=1a=1,i=i)]|=|14i,j=01[𝔼x𝒫0(i)fjεj(x,0)𝔼x𝒫1(i)fjεj(x,1)]|=|14(g0(λε0LFT+Ensemble0)+g0(λε1LFT+Ensemble1)+g1(λε0LFT+Ensemble0)+g1(λε1LFT+Ensemble1))|.\begin{split}\text{DP Disp}(f_{{\varepsilon}_{0},{\varepsilon}_{1}}^{\text{LFT+Ensemble}})&=\left|\frac{1}{2}\sum_{i=0}^{1}\left[{\mathbb{P}}(\hat{{\mathrm{y}}}=1\mid{\mathrm{a}}=0,{\mathrm{i}}=i)-{\mathbb{P}}(\hat{{\mathrm{y}}}=1\mid{\mathrm{a}}=1,{\mathrm{i}}=i)\right]\right|\\ &=\left|\frac{1}{4}\sum_{i,j=0}^{1}\left[{\mathbb{E}}_{{\mathrm{x}}\sim{\mathcal{P}}_{0}^{(i)}}f_{j}^{{\varepsilon}_{j}}({\mathrm{x}},0)-{\mathbb{E}}_{{\mathrm{x}}\sim{\mathcal{P}}_{1}^{(i)}}f_{j}^{{\varepsilon}_{j}}({\mathrm{x}},1)\right]\right|\\ &=|\frac{1}{4}\left(g_{0}(\lambda_{{\varepsilon}_{0}}^{\text{LFT+Ensemble}_{0}})+g_{0}(\lambda_{{\varepsilon}_{1}}^{\text{LFT+Ensemble}_{1}})+g_{1}(\lambda_{{\varepsilon}_{0}}^{\text{LFT+Ensemble}_{0}})+g_{1}(\lambda_{{\varepsilon}_{1}}^{\text{LFT+Ensemble}_{1}})\right)|.\end{split} (32)

For ease of notation, we define local mean difference on iith client as MDi(f)=(y^=1a=0,i=i)(y^=1a=1,i=i)\text{MD}_{i}(f)={\mathbb{P}}(\hat{{\mathrm{y}}}=1\mid{\mathrm{a}}=0,{\mathrm{i}}=i)-{\mathbb{P}}(\hat{{\mathrm{y}}}=1\mid{\mathrm{a}}=1,{\mathrm{i}}=i) and local demographic disparity on iith client as DP Dispi(f)=|MDi(f)|\text{DP Disp}_{i}(f)=|\text{MD}_{i}(f)|, where i=0,1i=0,1. Since the overall distribution of the data samples is xa=a𝒫a=𝒫a(0)/2+𝒫a(1)/2{\mathrm{x}}\mid{\mathrm{a}}=a\sim{\mathcal{P}}_{a}={\mathcal{P}}_{a}^{(0)}/2+{\mathcal{P}}_{a}^{(1)}/2, a=0,1a=0,1, gg (see the definition of gg in Lemma 7) and g0,g1g_{0},g_{1} has the following relation:

g(λ)=12g0(λ)+12g1(λ).g(\lambda)=\frac{1}{2}g_{0}(\lambda)+\frac{1}{2}g_{1}(\lambda).

A.2.2 Limitation of LFT+Ensemble

In this section, we mainly analyze the limitation of LFT+Ensemble in Lemma 10, which shows that fε0,ε1LFT+Ensemblef_{{\varepsilon}_{0},{\varepsilon}_{1}}^{\text{LFT+Ensemble}} can not achieve 0 demographic disparity in certain cases. Corollary 11 is a specific example of Lemma 10.

Lemma 10 (Formal version of Lemma 1 under two clients cases).

Let q(0,1)q\in(0,1). Let c=min{|g0(0)|,|g1(0)|}c=\min\{|g_{0}(0)|,|g_{1}(0)|\}. Define ψ:[0,c]×[0,c][1,1]\psi:[0,c]\times[0,c]\rightarrow[-1,1] as

ψ(ε0,ε1)=MD(fε0,ε1LFT+Ensemble)=14g0(g11(sign(g1(0))ε1))+14g1(g01(sign(g0(0))ε0))+14sign(g0(0))ε0+14sign(g1(0))ε1.\begin{split}\psi({\varepsilon}_{0},{\varepsilon}_{1})=\text{MD}(f_{{\varepsilon}_{0},{\varepsilon}_{1}}^{\text{LFT+Ensemble}})&=\frac{1}{4}g_{0}(g_{1}^{-1}(\operatorname{sign}(g_{1}(0)){\varepsilon}_{1}))+\frac{1}{4}g_{1}(g_{0}^{-1}(\operatorname{sign}(g_{0}(0)){\varepsilon}_{0}))\\ &+\frac{1}{4}\operatorname{sign}(g_{0}(0)){\varepsilon}_{0}+\frac{1}{4}\operatorname{sign}(g_{1}(0)){\varepsilon}_{1}.\end{split} (33)

If g0(0)g1(0)<0g_{0}(0)g_{1}(0)<0 and ψ(ε0,ε1)(g0(0)+g1(0))>0\psi({\varepsilon}_{0},{\varepsilon}_{1})(g_{0}(0)+g_{1}(0))>0 for all ε0,ε1[0,c]{\varepsilon}_{0},{\varepsilon}_{1}\in[0,c], then for all ε0,ε1[0,1]{\varepsilon}_{0},{\varepsilon}_{1}\in[0,1], DP Disp(fε0,ε1LFT+Ensemble)δ=min{|ψ(ε0,ε1)|:ε0,ε1[0,c]}>0\text{DP Disp}(f_{{\varepsilon}_{0},{\varepsilon}_{1}}^{\text{LFT+Ensemble}})\geq\delta=\min\{|\psi({\varepsilon}_{0},{\varepsilon}_{1})|:{\varepsilon}_{0},{\varepsilon}_{1}\in[0,c]\}>0.

Proof.

Define δ=min{|ψ(ε0,ε1)|:ε0,ε1[0,c]}\delta=\min\{|\psi({\varepsilon}_{0},{\varepsilon}_{1})|:{\varepsilon}_{0},{\varepsilon}_{1}\in[0,c]\}. The goal is to show that the demographic disparity has a positive lower bound. Note that the mean difference can be expressed as

MD(fε0,ε1LFT+Ensemble)=14(g0(λε0LFT+Ensemble0)+g0(λε1LFT+Ensemble1)+g1(λε0LFT+Ensemble0)+g1(λε1LFT+Ensemble1)).\text{MD}(f_{{\varepsilon}_{0},{\varepsilon}_{1}}^{\text{LFT+Ensemble}})=\frac{1}{4}\left(g_{0}(\lambda_{{\varepsilon}_{0}}^{\text{LFT+Ensemble}_{0}})+g_{0}(\lambda_{{\varepsilon}_{1}}^{\text{LFT+Ensemble}_{1}})+g_{1}(\lambda_{{\varepsilon}_{0}}^{\text{LFT+Ensemble}_{0}})+g_{1}(\lambda_{{\varepsilon}_{1}}^{\text{LFT+Ensemble}_{1}})\right). (34)

In the following proof, we will show, the mean difference cannot reach 0.

Without any loss of generality, assume |g0(0)|<|g1(0)||g_{0}(0)|<|g_{1}(0)|. First we consider g1(0)>0g_{1}(0)>0. We will discuss g1(0)<0g_{1}(0)<0 later. By g0(0)g1(0)<0g_{0}(0)g_{1}(0)<0 and ψ(ε0,ε1)(g0(0)+g1(0))>0\psi({\varepsilon}_{0},{\varepsilon}_{1})(g_{0}(0)+g_{1}(0))>0 for all ε0,ε1[0,c]{\varepsilon}_{0},{\varepsilon}_{1}\in[0,c], we have g0(0)<0g_{0}(0)<0 and ψ(ε0,ε1)>0\psi({\varepsilon}_{0},{\varepsilon}_{1})>0 for all ε0,ε1[0,c]{\varepsilon}_{0},{\varepsilon}_{1}\in[0,c].

First, we will prove that LFT+Ensemble achieves its lowest mean difference when ε0,ε1[0,c]{\varepsilon}_{0},{\varepsilon}_{1}\in[0,c]. In what follows, we consider five different cases to derive the desired result.

Case 1.

ε0>|g0(0)|,ε1>|g1(0)|{\varepsilon}_{0}>|g_{0}(0)|,{\varepsilon}_{1}>|g_{1}(0)|: ERM is fair on both clients.

By (31), we have λε0LFT+Ensemble0=λε1LFT+Ensemble1=0\lambda_{{\varepsilon}_{0}}^{\text{LFT+Ensemble}_{0}}=\lambda_{{\varepsilon}_{1}}^{\text{LFT+Ensemble}_{1}}=0. Recall gi()g_{i}(\cdot) is a monotone increasing function, we combine g1(0)>0g_{1}(0)>0 and Lemma 9 to have g0(g11(0))<g0(0)<0.g_{0}(g_{1}^{-1}(0))<g_{0}(0)<0. Applying the above conclusion yields

(34)=12g0(0)+12g1(0)>2(14g0(0)+14g1(0)+14g0(g11(0)))=2ψ(g0(0),0)δ.(\ref{LFT+Ensemble{}_dp})=\frac{1}{2}g_{0}(0)+\frac{1}{2}g_{1}(0)>2\left(\frac{1}{4}g_{0}(0)+\frac{1}{4}g_{1}(0)+\frac{1}{4}g_{0}(g_{1}^{-1}(0))\right)=2\psi(g_{0}(0),0)\geq\delta.
Case 2.

ε0|g0(0)|,ε1>|g1(0)|{\varepsilon}_{0}\leq|g_{0}(0)|,{\varepsilon}_{1}>|g_{1}(0)|: ERM is unfair on client 0, but fair on client 1.

Applying (31) results in λε1LFT+Ensemble1=0\lambda_{{\varepsilon}_{1}}^{\text{LFT+Ensemble}_{1}}=0. By the fact that gi()g_{i}(\cdot) is a strictly monotone increasing function, we have λε0LFT+Ensemble0=g01(ε0)>g01(g0(0))=0.\lambda_{{\varepsilon}_{0}}^{\text{LFT+Ensemble}_{0}}=g_{0}^{-1}(-{\varepsilon}_{0})>g_{0}^{-1}(g_{0}(0))=0. Applying the above conclusion yields

(34)=\displaystyle~{}(\ref{LFT+Ensemble{}_dp})= 14ε0+14g1(0)+14g0(0)+14g1(λε0LFT+Ensemble0)(λε0LFT+Ensemble0>0,g1(λε0LFT+Ensemble0)>g1(0),g0(0)<ε0)\displaystyle-\frac{1}{4}{\varepsilon}_{0}+\frac{1}{4}g_{1}(0)+\frac{1}{4}g_{0}(0)+\frac{1}{4}g_{1}(\lambda_{{\varepsilon}_{0}}^{\text{LFT+Ensemble}_{0}})~{}~{}~{}\big{(}\lambda_{{\varepsilon}_{0}}^{\text{LFT+Ensemble}_{0}}>0,g_{1}(\lambda_{{\varepsilon}_{0}}^{\text{LFT+Ensemble}_{0}})>g_{1}(0),g_{0}(0)<-{\varepsilon}_{0}\big{)} (35)
>\displaystyle> 12g0(0)+12g1(0)>2ψ(g0(0),0)δ.\displaystyle\frac{1}{2}g_{0}(0)+\frac{1}{2}g_{1}(0)>2\psi(g_{0}(0),0)\geq\delta.
Case 3.

ε0|g0(0)|,ε1|g1(0)|{\varepsilon}_{0}\leq|g_{0}(0)|,{\varepsilon}_{1}\leq|g_{1}(0)|: ERM is unfair on both client 0 and client 1.

Applying (31) we have λε0LFT+Ensemble0=g01(ε0),λε1LFT+Ensemble1=g11(ε1)\lambda_{{\varepsilon}_{0}}^{\text{LFT+Ensemble}_{0}}=g_{0}^{-1}(-{\varepsilon}_{0}),\lambda_{{\varepsilon}_{1}}^{\text{LFT+Ensemble}_{1}}=g_{1}^{-1}({\varepsilon}_{1}). Then we have

(34)\displaystyle~{}(\ref{LFT+Ensemble{}_dp}) =14(ε0+ε1+g0(g11(ε1))+g1(g01(ε0)))\displaystyle=\frac{1}{4}\left(-{\varepsilon}_{0}+{\varepsilon}_{1}+g_{0}(g_{1}^{-1}({\varepsilon}_{1}))+g_{1}(g_{0}^{-1}(-{\varepsilon}_{0}))\right) (36)
14(ε0+ε0+g0(g11(ε0))+g1(g01(ε0)))=ψ(ε0,ε0)δ.\displaystyle\geq\frac{1}{4}\left(-{\varepsilon}_{0}+{\varepsilon}_{0}+g_{0}(g_{1}^{-1}({\varepsilon}_{0}))+g_{1}(g_{0}^{-1}(-{\varepsilon}_{0}))\right)=\psi({\varepsilon}_{0},{\varepsilon}_{0})\geq\delta. (37)
Case 4.

ε0>|g0(0)|,ε1|g0(0)|{\varepsilon}_{0}>|g_{0}(0)|,{\varepsilon}_{1}\leq|g_{0}(0)|: ERM is fair on client 0 and very unfair on client 1.

By (31), we have λε0LFT+Ensemble0=0,λε1LFT+Ensemble1=g11(ε1)>g11(0)\lambda_{{\varepsilon}_{0}}^{\text{LFT+Ensemble}_{0}}=0,\lambda_{{\varepsilon}_{1}}^{\text{LFT+Ensemble}_{1}}=g_{1}^{-1}({\varepsilon}_{1})>g_{1}^{-1}(0). Then we obtain

(34)\displaystyle~{}(\ref{LFT+Ensemble{}_dp}) =14(g0(0)+g0(λε1LFT+Ensemble1)+g1(0)+ε1)\displaystyle=\frac{1}{4}\left(g_{0}(0)+g_{0}(\lambda_{{\varepsilon}_{1}}^{\text{LFT+Ensemble}_{1}})+g_{1}(0)+{\varepsilon}_{1}\right) (38)
>(g0(0)+g0(g11(0))+g1(0))/4=ψ(g0(0),0)δ.\displaystyle>(g_{0}(0)+g_{0}(g_{1}^{-1}(0))+g_{1}(0))/4=\psi(g_{0}(0),0)\geq\delta. (39)
Case 5.

ε0>|g0(0)|,|g0(0)|ε1<|g1(0)|{\varepsilon}_{0}>|g_{0}(0)|,|g_{0}(0)|\leq{\varepsilon}_{1}<|g_{1}(0)|: ERM is fair on client 0 and unfair on client 1.

Applying (31) implies λε1LFT+Ensemble1=g11(ε1)>g11(0)\lambda_{{\varepsilon}_{1}}^{\text{LFT+Ensemble}_{1}}=g^{-1}_{1}({\varepsilon}_{1})>g_{1}^{-1}(0). Therefore,

(34)=\displaystyle~{}(\ref{LFT+Ensemble{}_dp})= 14(g0(0)+g0(g11(ε1))+ε1+g1(0))\displaystyle\frac{1}{4}\left(g_{0}(0)+g_{0}(g_{1}^{-1}({\varepsilon}_{1}))+{\varepsilon}_{1}+g_{1}(0)\right) (40)
>(g0(0)+g1(0)+g0(g11(0))+ε1)/4>ψ(g0(0),0)δ.\displaystyle>(g_{0}(0)+g_{1}(0)+g_{0}(g_{1}^{-1}(0))+{\varepsilon}_{1})/4>\psi(g_{0}(0),0)\geq\delta. (41)

Combining all the cases above, we conclude that when g1(0)>0g_{1}(0)>0 DP Disp(fε0,ε1LFT+Ensemble)δ=min{|ψ(ε0,ε1)|:ε0,ε1[0,c]}>0\text{DP Disp}(f_{{\varepsilon}_{0},{\varepsilon}_{1}}^{\text{LFT+Ensemble}})\geq\delta=\min\{|\psi({\varepsilon}_{0},{\varepsilon}_{1})|:{\varepsilon}_{0},{\varepsilon}_{1}\in[0,c]\}>0 for all ε0,ε1[0,1]{\varepsilon}_{0},{\varepsilon}_{1}\in[0,1].

Now we consider g1(0)<0g_{1}(0)<0, by the setting |g0(0)|<|g1(0)||g_{0}(0)|<|g_{1}(0)| and the assumption g0(0)g1(0)<0g_{0}(0)g_{1}(0)<0 and ψ(ε0,ε1)(g0(0)+g1(0))>0\psi({\varepsilon}_{0},{\varepsilon}_{1})(g_{0}(0)+g_{1}(0))>0, we have g0(0)>0g_{0}(0)>0 and ψ(ε0,ε1)<0\psi({\varepsilon}_{0},{\varepsilon}_{1})<0 for all ε0,ε1[0,c]{\varepsilon}_{0},{\varepsilon}_{1}\in[0,c]. Following similar computation above, Case 1 - Case 5 become:

Case 1. ε0>|g0(0)|,ε1>|g1(0)|{\varepsilon}_{0}>|g_{0}(0)|,{\varepsilon}_{1}>|g_{1}(0)|. Now we have 0<g0(0)<g0(g11(0))0<g_{0}(0)<g_{0}(g_{1}^{-1}(0)), thus

(34)<2(14g0(0)+14g1(0)+14g0(g11(0)))=2ψ(g0(0),0)δ.(\ref{LFT+Ensemble{}_dp})<2\left(\frac{1}{4}g_{0}(0)+\frac{1}{4}g_{1}(0)+\frac{1}{4}g_{0}(g_{1}^{-1}(0))\right)=2\psi(g_{0}(0),0)\leq-\delta.

Case 2. ε0|g0(0)|,ε1>|g1(0)|{\varepsilon}_{0}\leq|g_{0}(0)|,{\varepsilon}_{1}>|g_{1}(0)|. Now we have g0(0)>ε0g_{0}(0)>{\varepsilon}_{0}, g1(λε0LFT+Ensemble0)<g1(0)g_{1}(\lambda_{{\varepsilon}_{0}}^{\text{LFT+Ensemble}_{0}})<g_{1}(0), thus

(34)<12g0(0)+12g1(0)<2ψ(g0(0),0)δ.(\ref{LFT+Ensemble{}_dp})<\frac{1}{2}g_{0}(0)+\frac{1}{2}g_{1}(0)<2\psi(g_{0}(0),0)\leq-\delta.

Case 3. In this case we have

(34)=14(ε0ε1+g0(g11(ε1))+g1(g01(ε0)))=ψ(ε0,ε1)δ.(\ref{LFT+Ensemble{}_dp})=\frac{1}{4}\left({\varepsilon}_{0}-{\varepsilon}_{1}+g_{0}(g_{1}^{-1}(-{\varepsilon}_{1}))+g_{1}(g_{0}^{-1}({\varepsilon}_{0}))\right)=\psi({\varepsilon}_{0},{\varepsilon}_{1})\leq-\delta.

Case 4. Now we have λε1LFT+Ensemble1=g11(ε1)<g11(0)\lambda_{{\varepsilon}_{1}}^{\text{LFT+Ensemble}_{1}}=g_{1}^{-1}(-{\varepsilon}_{1})<g_{1}^{-1}(0), thus

(34)<(g0(0)+g0(g11(0))+g1(0))/4=ψ(g0(0),0)δ.(\ref{LFT+Ensemble{}_dp})<(g_{0}(0)+g_{0}(g_{1}^{-1}(0))+g_{1}(0))/4=\psi(g_{0}(0),0)\leq-\delta.

Case 5. Now we have λε1LFT+Ensemble1=g11(ε1)<g11(0)\lambda_{{\varepsilon}_{1}}^{\text{LFT+Ensemble}_{1}}=g^{-1}_{1}(-{\varepsilon}_{1})<g_{1}^{-1}(0), thus

(34)=(g0(0)+g1(0)+g0(g11(0))ε1)/4<ψ(g0(0),0)δ.(\ref{LFT+Ensemble{}_dp})=(g_{0}(0)+g_{1}(0)+g_{0}(g_{1}^{-1}(0))-{\varepsilon}_{1})/4<\psi(g_{0}(0),0)\leq-\delta.

Then we conclude the proof.

Remark 6.

Note that cc is the smallest local demographic disparity the ERM achieves on clients. The condition g0(0)g1(0)<0g_{0}(0)g_{1}(0)<0 implies that the ERM is favoring different groups in different clients. The condition ψ(ε0,ε1)(g0(0)+g1(0))>0\psi({\varepsilon}_{0},{\varepsilon}_{1})(g_{0}(0)+g_{1}(0))>0 for all ε0,ε1[0,c]{\varepsilon}_{0},{\varepsilon}_{1}\in[0,c] implies that fε0,ε1LFT+Ensemblef_{{\varepsilon}_{0},{\varepsilon}_{1}}^{\text{LFT+Ensemble}} favors the same group as ERM when the constraint is very tight. If the conditions above hold, Lemma 10 suggests that there exists a lower bound of all the demographic disparity that LFT+Ensemble can achieve. In particular, if the conditions above hold, LFT+Ensemble fails to achieve perfect demographic parity.

Among the conditions of Lemma 10, g0(0)g1(0)<0g_{0}(0)g_{1}(0)<0 can be satisfied by the distribution with high data heterogeneity. To demonstrate the condition ψ(ε0,ε1)(g0(0)+g1(0))>0\psi({\varepsilon}_{0},{\varepsilon}_{1})(g_{0}(0)+g_{1}(0))>0 for all ε0,ε1[0,c]{\varepsilon}_{0},{\varepsilon}_{1}\in[0,c] can be satisfied, we consider a limiting Gaussian case. The following corollary serves as an example that satisfies the conditions of Lemma 10, and provides a more explicit expression of the lowest demographic disparity LFT+Ensemble can reach in the Gaussian case.

Corollary 11 (Formal version of Corollary 2).

Let q=0.5q=0.5, η(x)=11+ex,𝒫a(i)=𝒩(μa(i),σ(i)2)\eta(x)=\frac{1}{1+\text{e}^{-x}},{\mathcal{P}}_{a}^{(i)}={\mathcal{N}}(\mu_{a}^{(i)},\sigma^{(i)2}) where (μ0(0)μ1(0))(μ0(1)μ1(1))<0(\mu_{0}^{(0)}-\mu_{1}^{(0)})(\mu_{0}^{(1)}-\mu_{1}^{(1)})<0. Then DP Disp(fε0,ε1LFT+Ensemble)δ14|g0(0)+g1(0)|>0(f_{{\varepsilon}_{0},{\varepsilon}_{1}}^{\text{LFT+Ensemble}})\geq\delta\approx\frac{1}{4}|g_{0}(0)+g_{1}(0)|>0 for all ε0,ε1[0,1]{\varepsilon}_{0},{\varepsilon}_{1}\in[0,1] if one of the following condition holds:

  1. 1.

    σ(0)σ(1),|μ0(0)|,|μ1(0)|,|μ0(1)|,|μ1(1)|\sigma^{(0)}\gg\sigma^{(1)},|\mu_{0}^{(0)}|,|\mu_{1}^{(0)}|,|\mu_{0}^{(1)}|,|\mu_{1}^{(1)}| and μ0(1)>μ1(1)\mu_{0}^{(1)}>\mu_{1}^{(1)}: client 0 has much larger variance than client 1, and client 1 is favoring group 0;

  2. 2.

    σ(1)σ(0),|μ0(0)|,|μ1(0)|,|μ0(1)|,|μ1(1)|\sigma^{(1)}\gg\sigma^{(0)},|\mu_{0}^{(0)}|,|\mu_{1}^{(0)}|,|\mu_{0}^{(1)}|,|\mu_{1}^{(1)}| and μ0(0)>μ1(0)\mu_{0}^{(0)}>\mu_{1}^{(0)}: client 1 has much larger variance than client 0, and client 0 is favoring group 0.

Proof.

In this example, note that local mean difference function of λ\lambda can be written as:

gi(λ)=Φ(η1(12+λ)μ1(i)σ(i))Φ(η1(12λ)μ0(i)σ(i)),g_{i}(\lambda)=\Phi(\frac{\eta^{-1}(\frac{1}{2}+\lambda)-\mu_{1}^{(i)}}{\sigma^{(i)}})-\Phi(\frac{\eta^{-1}(\frac{1}{2}-\lambda)-\mu_{0}^{(i)}}{\sigma^{(i)}}), (42)

where Φ()\Phi(\cdot) is the CDF of the standard Gaussian distribution.

We only provide the proof for condition 1, and the proof for condition 2 is similar. Assume condition 1 holds. By (μ0(0)μ1(0))(μ0(1)μ1(1))<0(\mu_{0}^{(0)}-\mu_{1}^{(0)})(\mu_{0}^{(1)}-\mu_{1}^{(1)})<0 and μ0(1)μ1(1)>0\mu_{0}^{(1)}-\mu_{1}^{(1)}>0, we have μ0(0)μ1(0)<0\mu_{0}^{(0)}-\mu_{1}^{(0)}<0. By (42), we have g0(0)<0g_{0}(0)<0 and g1(0)>0g_{1}(0)>0. Consequently, combining Lemma 9 and (33) yields

ψ(ε0,ε1)=14(g0(g11(ε1))+g1(g01(ε0))ε0+ε1).\psi({\varepsilon}_{0},{\varepsilon}_{1})=\frac{1}{4}\left(g_{0}(g_{1}^{-1}({\varepsilon}_{1}))+g_{1}(g_{0}^{-1}(-{\varepsilon}_{0}))-{\varepsilon}_{0}+{\varepsilon}_{1}\right).

First we show that ψ(ε0,ε1)\psi({\varepsilon}_{0},{\varepsilon}_{1}) reachs its minimum either at (c,0)(c,0) or (0,c)(0,c) by taking the derivative of ψ\psi, where c=min{|g0(0)|,|g1(0)|}c=\min\{|g_{0}(0)|,|g_{1}(0)|\} is the smallest local demographic disparity. And then, we will estimate the minimum of ψ\psi on [0,1]×[0,1][0,1]\times[0,1].

We take the derivative of ψ\psi with respect to εi{\varepsilon}_{i} and get

ψεi(ε0,ε1)=sign(gi(0))(1+g1i(gi1(sign(gi(0))εi))gi(gi1(sign(gi(0))εi)))/4.\frac{\partial\psi}{\partial{\varepsilon}_{i}}({\varepsilon}_{0},{\varepsilon}_{1})=\operatorname{sign}(g_{i}(0))\left(1+\frac{g_{1-i}^{\prime}(g_{i}^{-1}(\operatorname{sign}(g_{i}(0)){\varepsilon}_{i}))}{g_{i}^{\prime}(g_{i}^{-1}(\operatorname{sign}(g_{i}(0)){\varepsilon}_{i}))}\right)/4. (43)

By condition 1, we have g0(0)=Φ(μ1(0)σ(0))Φ(μ0(0)σ(0))0,g_{0}(0)=\Phi(-\frac{\mu_{1}^{(0)}}{\sigma^{(0)}})-\Phi(-\frac{\mu_{0}^{(0)}}{\sigma^{(0)}})\approx 0, thus |g0(0)||g1(0)||g_{0}(0)|\ll|g_{1}(0)| and c=|g0(0)|c=|g_{0}(0)|. Since gig_{i} are increasing function, i=0,1i=0,1, we have gi()>0g_{i}^{\prime}(\cdot)>0, and thus ψε0<0,ψε1>0\frac{\partial\psi}{\partial{\varepsilon}_{0}}<0,\frac{\partial\psi}{\partial{\varepsilon}_{1}}>0. Therefore, ψ\psi reaches its extreme value at (0,c)(0,c) and (c,0)(c,0).

Now, we evaluate

ψ(0,c)=(g0(g11(c))+g1(g01(0))g0(0))/4>g1(0)g0(0)+g0(g11(c))4,\psi(0,c)=(g_{0}(g_{1}^{-1}(c))+g_{1}(g_{0}^{-1}(0))-g_{0}(0))/4>\frac{g_{1}(0)-g_{0}(0)+g_{0}(g_{1}^{-1}(c))}{4},

where the inequality comes from g0(0)<0g_{0}(0)<0 to have g1(g01(0))>g1(0)g_{1}(g_{0}^{-1}(0))>g_{1}(0). And at (c,0)(c,0), we have

ψ(c,0)=(g0(g11(0))+g1(0)+g0(0))/4.\psi(c,0)=(g_{0}(g_{1}^{-1}(0))+g_{1}(0)+g_{0}(0))/4.

In what follows, we will show that min{ψ(c,0),ψ(0,c)}δ=g1(0)+g0(0)4\min\left\{\psi(c,0),\psi(0,c)\right\}\approx\delta=\frac{g_{1}(0)+g_{0}(0)}{4} by proving that 0>g0(g11(c))>g0(g11(0))00>g_{0}(g_{1}^{-1}(c))>g_{0}(g_{1}^{-1}(0))\approx 0.

Consider ψ(c,0)\psi(c,0) and ψ(0,c)\psi(0,c). Since g1(0)>c,g0(0)<0g_{1}(0)>c,g_{0}(0)<0, we have

g0(g11(0))<g0(g11(c))<g0(0)<0.g_{0}(g_{1}^{-1}(0))<g_{0}(g_{1}^{-1}(c))<g_{0}(0)<0.

Therefore, the only thing left is to show g0(g11(0))0g_{0}(g_{1}^{-1}(0))\approx 0. We divide the rest of the proof into the following three cases.

Case 1.

μ1(1)<μ0(1)<0\mu_{1}^{(1)}<\mu_{0}^{(1)}<0: on client 1, the local classifier is favoring group 0 over group 1, and the positive rate of both groups are under 12\frac{1}{2}.

Clearly, under this case, we have [η1(12),)d𝒫0(1)=[0,)d𝒫0(1)<12\int_{[\eta^{-1}(\frac{1}{2}),\infty)}\,\mathrm{d}{\mathcal{P}}_{0}^{(1)}=\int_{[0,\infty)}\,\mathrm{d}{\mathcal{P}}_{0}^{(1)}<\frac{1}{2}. We select λ<0\lambda^{\prime}<0 such that η1(12+λ)=μ1(1)\eta^{-1}(\frac{1}{2}+\lambda^{\prime})=\mu_{1}^{(1)}. Then we have [η1(12+λ),)d𝒫1(1)=[μ1(1),)d𝒫1(1)=12\int_{[\eta^{-1}(\frac{1}{2}+\lambda^{\prime}),\infty)}\,\mathrm{d}{\mathcal{P}}_{1}^{(1)}=\int_{[\mu_{1}^{(1)},\infty)}\,\mathrm{d}{\mathcal{P}}_{1}^{(1)}=\frac{1}{2}, while [η1(12λ2(1q)),)d𝒫0(1)<0\int_{[\eta^{-1}(\frac{1}{2}-\frac{\lambda^{\prime}}{2(1-q)}),\infty)}\,\mathrm{d}{\mathcal{P}}_{0}^{(1)}<0. Thus we get g1(λ)<0g_{1}(\lambda^{\prime})<0. Combining g1(0)>0g_{1}(0)>0 and intermediate value theorem results in λ<g11(0)<0\lambda^{\prime}<g^{-1}_{1}(0)<0. Then we obtain

μ1(1)=η1(12+λ)<η1(12+g11(0))<0μ1(1)=η1(12λ)>η1(12g11(0))>0(η(x)12 is odd).\begin{split}\mu_{1}^{(1)}&=\eta^{-1}(\frac{1}{2}+\lambda^{\prime})<\eta^{-1}(\frac{1}{2}+g^{-1}_{1}(0))<0\\ -\mu_{1}^{(1)}&=\eta^{-1}(\frac{1}{2}-\lambda^{\prime})>\eta^{-1}(\frac{1}{2}-g^{-1}_{1}(0))>0~{}~{}~{}~{}~{}\text{($\eta(x)-\frac{1}{2}$ is odd)}\end{split}. (44)

By plugging (44) into (42), we have the other side of g0(g11(0))<0g_{0}(g_{1}^{-1}(0))<0 is bounded by g0(λ)=Φ(μ1(1)μ1(1)σ(1))Φ(μ1(1)μ0(1)σ(1))g_{0}(\lambda^{\prime})=\Phi(\frac{\mu_{1}^{(1)}-\mu_{1}^{(1)}}{\sigma^{(1)}})-\Phi(\frac{-\mu_{1}^{(1)}-\mu_{0}^{(1)}}{\sigma^{(1)}}). Since σ(1)|μ1(1)|,|μ1(0)|\sigma^{(1)}\gg|\mu_{1}^{(1)}|,|\mu_{1}^{(0)}|, we get g0(g11(0))0g_{0}(g^{-1}_{1}(0))\approx 0.

Case 2.

0<μ1(1)<μ0(1)0<\mu_{1}^{(1)}<\mu_{0}^{(1)}: on client 1, the local classifier is favoring group 0 over group 1, and the positive rate of both groups are above 12\frac{1}{2}.

This proof of this case is similar to Case 1.

Case 3.

μ1(1)<0<μ0(1)\mu_{1}^{(1)}<0<\mu_{0}^{(1)}: with respect to the local classifier trained by client 1, the positive rate of group 0 is above 12\frac{1}{2} while that of group 1 is under 12\frac{1}{2}.

Without any loss of generality, we assume |μ1(1)|<|μ0(1)||\mu_{1}^{(1)}|<|\mu_{0}^{(1)}|. Select λ′′<0\lambda^{\prime\prime}<0 such that η(12λ′′)=μ0(1).\eta(\frac{1}{2}-\lambda^{\prime\prime})=\mu_{0}^{(1)}. Clearly [η1(12λ′′),+)d𝒫0(1)=[μ0(1),+)d𝒫0(1)=12\int_{[\eta^{-1}(\frac{1}{2}-\lambda^{\prime\prime}),+\infty)}\,\mathrm{d}{\mathcal{P}}_{0}^{(1)}=\int_{[\mu_{0}^{(1)},+\infty)}\,\mathrm{d}{\mathcal{P}}_{0}^{(1)}=\frac{1}{2}, while

[η1(12+λ′′),+)d𝒫1(1)=[μ0(1),+)d𝒫1(1)>[μ1(1),+)d𝒫1(1)>12.\int_{[\eta^{-1}(\frac{1}{2}+\lambda^{\prime\prime}),+\infty)}\,\mathrm{d}{\mathcal{P}}_{1}^{(1)}=\int_{[-\mu_{0}^{(1)},+\infty)}\,\mathrm{d}{\mathcal{P}}_{1}^{(1)}>\int_{[-\mu_{1}^{(1)},+\infty)}\,\mathrm{d}{\mathcal{P}}_{1}^{(1)}>\frac{1}{2}.

Consequently, we get g1(λ′′)<0g_{1}(\lambda^{\prime\prime})<0 and λ′′<g11(0)<0\lambda^{\prime\prime}<g_{1}^{-1}(0)<0. Then we draw the same conclusion as (44). Therefore, g0(g11(0))0g_{0}(g_{1}^{-1}(0))\approx 0.

Combining all three cases above, we get δ>0\delta>0. Then applying Lemma 10 we complete the proof. ∎

A.2.3 Comparison between LFT+Ensemble and CFL

In this section, we compare the performance of LFT+Ensemble and CFL. In Lemma 12 and Lemma 13, we illustrate the conditions for LFT+Ensemble to have the same performance as CFL. In Lemma 14, Lemma 15 and Lemma 16, we illustrate the scenarios when CFL outperforms LFT+Ensemble.

To do the comparison, first we introduce some additional notations. Define the accuracy of a classifier ff as

Acc(f)=(y^=y)=(y=0)𝔼x,ay=0[1f(x,a)]+(y=1)𝔼x,ay=0f(x,a).\text{Acc}(f)={\mathbb{P}}(\hat{{\mathrm{y}}}={\mathrm{y}})={\mathbb{P}}({\mathrm{y}}=0){\mathbb{E}}_{{\mathrm{x}},{\mathrm{a}}\mid{\mathrm{y}}=0}[1-f({\mathrm{x}},{\mathrm{a}})]+{\mathbb{P}}({\mathrm{y}}=1){\mathbb{E}}_{{\mathrm{x}},{\mathrm{a}}\mid{\mathrm{y}}=0}f({\mathrm{x}},{\mathrm{a}}).

Given the required global demographic disparity ε{\varepsilon}, define the performance of fε0,ε1LFT+Ensemblef_{{\varepsilon}_{0},{\varepsilon}_{1}}^{\text{LFT+Ensemble}} as:

LFT+Ensemble(ε0,ε1;ε)={Acc(fε0,ε1LFT+Ensemble)DP Disp(fε0,ε1LFT+Ensemble)ε0o.w.,\text{LFT+Ensemble{}}({\varepsilon}_{0},{\varepsilon}_{1};{\varepsilon})=\begin{cases}\text{Acc}(f_{{\varepsilon}_{0},{\varepsilon}_{1}}^{\text{LFT+Ensemble}})&\text{DP Disp}(f_{{\varepsilon}_{0},{\varepsilon}_{1}}^{\text{LFT+Ensemble}})\leq{\varepsilon}\\ 0&\emph{o.w}.\end{cases}, (45)

and define performance of fεCFLf_{{\varepsilon}}^{\text{CFL}} as:

CFL(ε)=Acc(fεCFL).\text{CFL}({\varepsilon})=\text{Acc}(f^{\text{CFL}}_{{\varepsilon}}).

Now we are able to compare the performance between LFT+Ensemble and CFL with the metric LFT+Ensemble(ε0,ε1;ε)\text{LFT+Ensemble{}}({\varepsilon}_{0},{\varepsilon}_{1};{\varepsilon}) and CFL(ε)\text{CFL}({\varepsilon}). In particular, we will show that, under some mild conditions, maxε0,ε1LFT+Ensemble(ε0,ε1;ε)<CFL(ε)\max_{{\varepsilon}_{0},{\varepsilon}_{1}}\text{LFT+Ensemble{}}({\varepsilon}_{0},{\varepsilon}_{1};{\varepsilon})<\text{CFL}({\varepsilon}), which implies the gap between LFT+Ensemble and CFL is inevitable.

We begin with the following two lemmas, which describe the cases that LFT+Ensemble(ε0,ε1;ε)=CFL(ε)\text{LFT+Ensemble{}}({\varepsilon}_{0},{\varepsilon}_{1};{\varepsilon})=\text{CFL}({\varepsilon}).

Lemma 12.

Let q(0,1)q\in(0,1). Given an LFT+Ensemble classifier fε0,ε1LFT+Ensemblef_{{\varepsilon}_{0},{\varepsilon}_{1}}^{\text{LFT+Ensemble}} such that DP Disp(fε0,ε1LFT+Ensemble)ε\text{DP Disp}(f_{{\varepsilon}_{0},{\varepsilon}_{1}}^{\text{LFT+Ensemble}})\leq{\varepsilon} and a CFL classifier fεCFLf_{{\varepsilon}}^{\text{CFL}}, we have LFT+Ensemble(ε0,ε1;ε)CFL(ε)\text{LFT+Ensemble{}}({\varepsilon}_{0},{\varepsilon}_{1};{\varepsilon})\leq\text{CFL}({\varepsilon}). The equality holds if and only if λε0LFT+Ensemble0=λε1LFT+Ensemble1=λεCFL\lambda_{{\varepsilon}_{0}}^{\text{LFT+Ensemble}_{0}}=\lambda_{{\varepsilon}_{1}}^{\text{LFT+Ensemble}_{1}}=\lambda_{{\varepsilon}}^{\text{CFL}}, where λεCFL\lambda_{{\varepsilon}}^{\text{CFL}} is defined in (29), λε0LFT+Ensemble0,λε1LFT+Ensemble1\lambda_{{\varepsilon}_{0}}^{\text{LFT+Ensemble}_{0}},\lambda_{{\varepsilon}_{1}}^{\text{LFT+Ensemble}_{1}} are defined in (31).

Proof.

The proof is straightforward. Clearly, since fεCFLf_{{\varepsilon}}^{\text{CFL}} is the optimizer to CFL(ε{\varepsilon}), we have LFT+Ensemble(ε0,ε1;ε)CFL(ε)\text{LFT+Ensemble{}}({\varepsilon}_{0},{\varepsilon}_{1};{\varepsilon})\leq\text{CFL}({\varepsilon}). By Lemma 7, according to the form of the solution to CFL(ε{\varepsilon}), fε0,ε1LFT+Ensemblef_{{\varepsilon}_{0},{\varepsilon}_{1}}^{\text{LFT+Ensemble}} is the solution to CFL(ε{\varepsilon}) if and only if λε0LFT+Ensemble0=λε1LFT+Ensemble1=λεCFL\lambda_{{\varepsilon}_{0}}^{\text{LFT+Ensemble}_{0}}=\lambda_{{\varepsilon}_{1}}^{\text{LFT+Ensemble}_{1}}=\lambda_{{\varepsilon}}^{\text{CFL}}. Thus complete the proof.

Lemma 13.

Let q(0,1)q\in(0,1). If the ERM is already fair, i.e., ε|g(0)|{\varepsilon}\geq|g(0)|, then

maxε0,ε1LFT+Ensemble(ε0,ε1;ε)=CFL(ε).\max_{{\varepsilon}_{0},{\varepsilon}_{1}}\text{LFT+Ensemble{}}({\varepsilon}_{0},{\varepsilon}_{1};{\varepsilon})=\text{CFL}({\varepsilon}).
Proof.

Since the ERM is already fair, fεCFLf_{{\varepsilon}}^{\text{CFL}} is ERM=η(x)>1/2=\llbracket\eta(x)>1/2\rrbracket. Therefore, we take ε0=ε1=1{\varepsilon}_{0}={\varepsilon}_{1}=1, and fε0,ε1LFT+Ensemblef_{{\varepsilon}_{0},{\varepsilon}_{1}}^{\text{LFT+Ensemble}} also equals to ERM. Thus we conclude the lemma.

The next two lemmas describes the cases that LFT+Ensemble(ε0,ε1;ε)<CFL(ε)\text{LFT+Ensemble{}}({\varepsilon}_{0},{\varepsilon}_{1};{\varepsilon})<\text{CFL}({\varepsilon}).

Lemma 14.

Let q(0,1)q\in(0,1). If g0(0)g1(0)<0g_{0}(0)g_{1}(0)<0, maxε0,ε1LFT+Ensemble(ε0,ε1;ε)<CFL(ε)\max_{{\varepsilon}_{0},{\varepsilon}_{1}}\text{LFT+Ensemble{}}({\varepsilon}_{0},{\varepsilon}_{1};{\varepsilon})<\text{CFL}({\varepsilon}) for all ε<|g(0)|{\varepsilon}<|g(0)|.

Proof.

In this proof, we only consider the case that g1(0)>0g_{1}(0)>0, |g1(0)||g0(0)||g_{1}(0)|\geq|g_{0}(0)|. The proof for g1(0)>0g_{1}(0)>0 or |g1(0)||g0(0)||g_{1}(0)|\geq|g_{0}(0)| is similar. Next, we divide the proof into two cases.

Case 1.

maxε0,ε1LFT+Ensemble(ε0,ε1;ε)=0\max_{{\varepsilon}_{0},{\varepsilon}_{1}}\text{LFT+Ensemble{}}({\varepsilon}_{0},{\varepsilon}_{1};{\varepsilon})=0: LFT+Ensemble cannot achieve ε{\varepsilon} global demographic disparity.

The conclusion holds.

Case 2.

maxε0,ε1LFT+Ensemble(ε0,ε1;ε)>0\max_{{\varepsilon}_{0},{\varepsilon}_{1}}\text{LFT+Ensemble{}}({\varepsilon}_{0},{\varepsilon}_{1};{\varepsilon})>0: LFT+Ensemble can achieve ε{\varepsilon} global demographic disparity.

Since ε<|g(0)|{\varepsilon}<|g(0)|, by Lemma 9, we have λεCFL0\lambda_{{\varepsilon}}^{\text{CFL}}\neq 0. Next, we solve fε0,ε1LFT+Ensemblef_{{\varepsilon}_{0},{\varepsilon}_{1}}^{\text{LFT+Ensemble}} by solving the local version of CFL(ε{\varepsilon}). Combining Lemma 9, g1(0)>0g_{1}(0)>0 and g0(0)<0g_{0}(0)<0 yields λε0LFT+Ensemble00,λε1LFT+Ensemble10\lambda_{{\varepsilon}_{0}}^{\text{LFT+Ensemble}_{0}}\geq 0,\lambda_{{\varepsilon}_{1}}^{\text{LFT+Ensemble}_{1}}\leq 0. If λε0LFT+Ensemble0=λε1LFT+Ensemble1\lambda_{{\varepsilon}_{0}}^{\text{LFT+Ensemble}_{0}}=\lambda_{{\varepsilon}_{1}}^{\text{LFT+Ensemble}_{1}}, then λε0LFT+Ensemble0=λε1LFT+Ensemble1=0λεCFL\lambda_{{\varepsilon}_{0}}^{\text{LFT+Ensemble}_{0}}=\lambda_{{\varepsilon}_{1}}^{\text{LFT+Ensemble}_{1}}=0\neq\lambda_{{\varepsilon}}^{\text{CFL}}. Thus, we conclude the lemma by applying Lemma 12. ∎

Remark 7.

Lemma 14 implies that if ERM is favoring different groups in different clients, there exists an inevitable gap between the performance of LFT+Ensemble and that of CFL.

Lemma 15.

Let q(0,1)q\in(0,1). Let τ=min{|g(0)|,max{sign(g(0))g(g01(0)),sign(g(0))g(g11(0))}}\tau=\min\big{\{}|g(0)|,\max\{\operatorname{sign}(g(0))g(g_{0}^{-1}(0)),\operatorname{sign}(g(0))g(g_{1}^{-1}(0))\}\big{\}}.

If g0(0)g1(0)>0g_{0}(0)g_{1}(0)>0, we have

maxε0,ε1LFT+Ensemble(ε0,ε1;ε){=CFL(ε)for all ετ<CFL(ε)o.w..\max_{{\varepsilon}_{0},{\varepsilon}_{1}}\text{LFT+Ensemble{}}({\varepsilon}_{0},{\varepsilon}_{1};{\varepsilon})\begin{cases}=\text{CFL}({\varepsilon})&\text{for all }{\varepsilon}\geq\tau\\ <\text{CFL}({\varepsilon})&\emph{o.w}.\end{cases}. (46)
Proof.

Without any loss of generality, assume g0(0),g1(0)>0g_{0}(0),g_{1}(0)>0. Then by Lemma 9, we have λε0LFT+Ensemble00,λε1LFT+Ensemble10\lambda_{{\varepsilon}_{0}}^{\text{LFT+Ensemble}_{0}}\leq 0,\lambda_{{\varepsilon}_{1}}^{\text{LFT+Ensemble}_{1}}\leq 0 for all ε0,ε1[0,1]{\varepsilon}_{0},{\varepsilon}_{1}\in[0,1], and g(0)=(g0(0)+g1(0))/2>0g(0)=(g_{0}(0)+g_{1}(0))/2>0.

To study the performance of LFT+Ensemble when g0(0)g1(0)<0g_{0}(0)g_{1}(0)<0, recall that we use fiεif_{i}^{{\varepsilon}_{i}} to denote the local classifier trained by client ii in LFT+Ensemble analysis. Therefore, fi0f_{i}^{0} is the local classifier trained by client ii that achieves perfect local fairness.

Next, we discuss two cases to prove the result. In Case 1, we will show that τ=0\tau=0, and then prove that maxε0,ε1LFT+Ensemble(ε0,ε1;ε)=CFL(ε)\max_{{\varepsilon}_{0},{\varepsilon}_{1}}\text{LFT+Ensemble{}}({\varepsilon}_{0},{\varepsilon}_{1};{\varepsilon})=\text{CFL}({\varepsilon}); in Case 2, we will show that τ>0\tau>0, and then prove that maxε0,ε1LFT+Ensemble(ε0,ε1;ε)<CFL(ε)\max_{{\varepsilon}_{0},{\varepsilon}_{1}}\text{LFT+Ensemble{}}({\varepsilon}_{0},{\varepsilon}_{1};{\varepsilon})<\text{CFL}({\varepsilon}) when ε<τ{\varepsilon}<\tau.

Case 1.

f00=f10f_{0}^{0}=f_{1}^{0}: the two local classifiers that achieve perfect local fairness are equal.

When εg(0){\varepsilon}\geq g(0), the conclusion holds by directly applying Lemma 13. Therefore, in what follows, we focus on the case that ε<g(0){\varepsilon}<g(0).

Next, we will first, show that when ε=0{\varepsilon}=0, λ0LFT+Ensemble0=λ0LFT+Ensemble1=λ0CFL\lambda_{0}^{\mathrm{LFT+Ensemble{}}_{0}}=\lambda_{0}^{\mathrm{LFT+Ensemble{}}_{1}}=\lambda_{0}^{\mathrm{CFL}}, which implies that f00=f10=f0CFLf_{0}^{0}=f_{1}^{0}=f_{0}^{\mathrm{CFL}}.

Since f00=f10f_{0}^{0}=f_{1}^{0}, we have λ0LFT+Ensemble0=λ0LFT+Ensemble1\lambda_{0}^{\mathrm{LFT+Ensemble{}}_{0}}=\lambda_{0}^{\mathrm{LFT+Ensemble{}}_{1}}, and thus g01(0)=g11(0)g_{0}^{-1}(0)=g_{1}^{-1}(0). Consequently, we get

g(λ0LFT+Ensemble0)=g(λ0LFT+Ensemble1)=g(g01(0))=g0(g01(0))+g1(g11(0))2=0=g(λ0CFL).g(\lambda_{0}^{\mathrm{LFT+Ensemble{}}_{0}})=g(\lambda_{0}^{\mathrm{LFT+Ensemble{}}_{1}})=g(g_{0}^{-1}(0))=\frac{g_{0}(g_{0}^{-1}(0))+g_{1}(g_{1}^{-1}(0))}{2}=0=g(\lambda_{0}^{\mathrm{CFL}}).

By the monotonicity of gg, we have λ0LFT+Ensemble0=λ0LFT+Ensemble1=λ0CFL\lambda_{0}^{\mathrm{LFT+Ensemble{}}_{0}}=\lambda_{0}^{\mathrm{LFT+Ensemble{}}_{1}}=\lambda_{0}^{\mathrm{CFL}} and τ=0\tau=0. Next, we will show that, for ε0{\varepsilon}\neq 0, there also exists ε0,ε1[0,1]{\varepsilon}_{0},{\varepsilon}_{1}\in[0,1] such that λε0LFT+Ensemble0=λε1LFT+Ensemble1=λεCFL\lambda_{{\varepsilon}_{0}}^{\text{LFT+Ensemble}_{0}}=\lambda_{{\varepsilon}_{1}}^{\text{LFT+Ensemble}_{1}}=\lambda_{{\varepsilon}}^{\text{CFL}}, which implies fε0LFT+Ensemble0=fε0LFT+Ensemble1=fεCFLf_{{\varepsilon}_{0}}^{\mathrm{LFT+Ensemble{}}_{0}}=f_{{\varepsilon}_{0}}^{\mathrm{LFT+Ensemble{}}_{1}}=f_{{\varepsilon}}^{\text{CFL}}.

Consider ε0{\varepsilon}\neq 0. Select εi=gi(λεCFL),i=0,1.{\varepsilon}_{i}=g_{i}(\lambda_{{\varepsilon}}^{\text{CFL}}),i=0,1. By Lemma 7 and the monotonicity of gg, we have λ0CFL<λεCFL<0\lambda_{0}^{\mathrm{CFL}}<\lambda_{{\varepsilon}}^{\text{CFL}}<0. Therefore, εi=gi(λεCFL)>gi(λ0CFL)=0{\varepsilon}_{i}=g_{i}(\lambda_{{\varepsilon}}^{\text{CFL}})>g_{i}(\lambda_{0}^{\mathrm{CFL}})=0. By Lemma 7, we get λεiLFT+Ensemblei=gi1(εi)\lambda_{{\varepsilon}_{i}}^{\text{LFT+Ensemble}_{i}}=g_{i}^{-1}({\varepsilon}_{i}) for i=0,1i=0,1. By the selection of εi{\varepsilon}_{i}, we have λεCFL=gi1(εi)\lambda_{{\varepsilon}}^{\text{CFL}}=g_{i}^{-1}({\varepsilon}_{i}). Therefore, λεCFL=λε0LFT+Ensemble0=λε1LFT+Ensemble1\lambda_{{\varepsilon}}^{\text{CFL}}=\lambda_{{\varepsilon}_{0}}^{\text{LFT+Ensemble}_{0}}=\lambda_{{\varepsilon}_{1}}^{\text{LFT+Ensemble}_{1}} when ε0{\varepsilon}\neq 0.

Combining all the discussion above yields fε0,ε1LFT+Ensemble=fεCFLf_{{\varepsilon}_{0},{\varepsilon}_{1}}^{\text{LFT+Ensemble}}=f_{{\varepsilon}}^{\text{CFL}} for all ε<g(0){\varepsilon}<g(0), thus we conclude maxε0,ε1LFT+Ensemble(ε0,ε1;ε)=CFL(ε)\max_{{\varepsilon}_{0},{\varepsilon}_{1}}\text{LFT+Ensemble{}}({\varepsilon}_{0},{\varepsilon}_{1};{\varepsilon})=\text{CFL}({\varepsilon}) for all ε<g(0){\varepsilon}<g(0). Consequently, the lemma holds under Case 1.

Case 2.

f00f10f_{0}^{0}\neq f_{1}^{0}: the two local classifiers that achieve perfect local fairness are different.

The key idea of this proof is: when MD0(fεCFL),MD1(fεCFL)0\text{MD}_{0}(f_{{\varepsilon}}^{\text{CFL}}),\text{MD}_{1}(f_{{\varepsilon}}^{\text{CFL}})\geq 0, then we can always select ε0=g0(λεCFL)=MD0(fεCFL)>0,ε1=g1(λεCFL)=MD1(fεCFL)>0{\varepsilon}_{0}=g_{0}(\lambda_{{\varepsilon}}^{\text{CFL}})=\text{MD}_{0}(f_{{\varepsilon}}^{\text{CFL}})>0,{\varepsilon}_{1}=g_{1}(\lambda_{{\varepsilon}}^{\text{CFL}})=\text{MD}_{1}(f_{{\varepsilon}}^{\text{CFL}})>0 such that λε0LFT+Ensemble0=g01(ε0)=λεCFL\lambda_{{\varepsilon}_{0}}^{\text{LFT+Ensemble}_{0}}=g_{0}^{-1}({\varepsilon}_{0})=\lambda_{{\varepsilon}}^{\text{CFL}} and λε1LFT+Ensemble1=g11(ε1)=λεCFL\lambda_{{\varepsilon}_{1}}^{\text{LFT+Ensemble}_{1}}=g_{1}^{-1}({\varepsilon}_{1})=\lambda_{{\varepsilon}}^{\text{CFL}}, thus fε0,ε1LFT+Ensemble=fεCFLf_{{\varepsilon}_{0},{\varepsilon}_{1}}^{\text{LFT+Ensemble}}=f_{{\varepsilon}}^{\text{CFL}}; when MD0(fεCFL)MD1(fεCFL)=g0(λεCFL)g1(λεCFL)<0\text{MD}_{0}(f_{{\varepsilon}}^{\text{CFL}})\text{MD}_{1}(f_{{\varepsilon}}^{\text{CFL}})=g_{0}(\lambda_{{\varepsilon}}^{\text{CFL}})g_{1}(\lambda_{{\varepsilon}}^{\text{CFL}})<0, however, by Lemma 9, for all ε0,ε1[0,1]×[0,1]{\varepsilon}_{0},{\varepsilon}_{1}\in[0,1]\times[0,1], we have g0(λε0LFT+Ensemble0)g1(λε1LFT+Ensemble1)>0>g0(λεCFL)g1(λεCFL)g_{0}(\lambda_{{\varepsilon}_{0}}^{\text{LFT+Ensemble}_{0}})g_{1}(\lambda_{{\varepsilon}_{1}}^{\text{LFT+Ensemble}_{1}})>0>g_{0}(\lambda_{{\varepsilon}}^{\text{CFL}})g_{1}(\lambda_{{\varepsilon}}^{\text{CFL}}), thus there exist i{0,1}i\in\left\{0,1\right\} such that λεiLFT+EnsembleiλεCFL\lambda_{{\varepsilon}_{i}}^{\text{LFT+Ensemble}_{i}}\neq\lambda_{{\varepsilon}}^{\text{CFL}} and fε0,ε1LFT+EnsemblefεCFLf_{{\varepsilon}_{0},{\varepsilon}_{1}}^{\text{LFT+Ensemble}}\neq f_{{\varepsilon}}^{\text{CFL}}. Next, we will give rigorous proof.

Since f00f10f_{0}^{0}\neq f_{1}^{0}, we have λ0LFT+Ensemble0λ0LFT+Ensemble1\lambda_{0}^{\mathrm{LFT+Ensemble{}}_{0}}\neq\lambda_{0}^{\mathrm{LFT+Ensemble{}}_{1}}. By Lemma 7, we get g01(0)g11(0)g_{0}^{-1}(0)\neq g_{1}^{-1}(0). Without any loss of generality, assume g01(0)<g11(0)g_{0}^{-1}(0)<g_{1}^{-1}(0), which implies g1(g01(0))<g1(g11(0))=0g_{1}(g_{0}^{-1}(0))<g_{1}(g_{1}^{-1}(0))=0 and g0(g01(0))=0<g0(g11(0))g_{0}(g_{0}^{-1}(0))=0<g_{0}(g_{1}^{-1}(0)). Thus we get

g1(g01(0))<0<g0(g11(0)).g_{1}(g_{0}^{-1}(0))<0<g_{0}(g_{1}^{-1}(0)).

Combining the inequality above and g=g0+g12g=\frac{g_{0}+g_{1}}{2} yields

g(g01(0))=g1(g01(0))<0=g(g1(0))<g0(g11(0))=g(g11(0)).g(g_{0}^{-1}(0))=g_{1}(g_{0}^{-1}(0))<0=g(g^{-1}(0))<g_{0}(g_{1}^{-1}(0))=g(g_{1}^{-1}(0)). (47)

Thus we have

τ=max{sign(g(0))g(g01(0)),sign(g(0))g(g11(0))}=g(g11(0)).\tau=\max\{\operatorname{sign}(g(0))g(g_{0}^{-1}(0)),\operatorname{sign}(g(0))g(g_{1}^{-1}(0))\}=g(g_{1}^{-1}(0)).

When ε|g(0)|{\varepsilon}\geq|g(0)|, by Lemma 13, clearly we have maxε0,ε1LFT+Ensemble(ε0,ε1;ε)=CFL(ε)\max_{{\varepsilon}_{0},{\varepsilon}_{1}}\text{LFT+Ensemble{}}({\varepsilon}_{0},{\varepsilon}_{1};{\varepsilon})=\text{CFL}({\varepsilon}).

For the other case ε<|g(0)|{\varepsilon}<|g(0)|, by Lemma 9 we have λ=g1(ε)\lambda=g^{-1}({\varepsilon}). Similar to Case 1, in order to achieve λε0LFT+Ensemble0=λε1LFT+Ensemble1=λεCFL\lambda_{{\varepsilon}_{0}}^{\text{LFT+Ensemble}_{0}}=\lambda_{{\varepsilon}_{1}}^{\text{LFT+Ensemble}_{1}}=\lambda_{{\varepsilon}}^{\text{CFL}}, we select εi=gi(λεCFL){\varepsilon}_{i}=g_{i}(\lambda_{{\varepsilon}}^{\text{CFL}}).

When ε<g(g11(0)){\varepsilon}<g(g_{1}^{-1}(0)),

g1(λεCFL)=g1(g1(ε))<g1(g1(g(g11(0))))=0.g_{1}(\lambda_{{\varepsilon}}^{\text{CFL}})=g_{1}(g^{-1}({\varepsilon}))<g_{1}(g^{-1}(g(g_{1}^{-1}(0))))=0.

Since g1(0)>0g_{1}(0)>0, by Lemma 9 we have g1(λε1LFT+Ensemble1)0g_{1}(\lambda_{{\varepsilon}_{1}}^{\text{LFT+Ensemble}_{1}})\geq 0, from the monotonicity of g1g_{1} we conclude λε1LFT+Ensemble1λεCFL\lambda_{{\varepsilon}_{1}}^{\text{LFT+Ensemble}_{1}}\neq\lambda_{{\varepsilon}}^{\text{CFL}}. By Lemma 12, we have maxε0,ε1LFT+Ensemble(ε0,ε1;ε)<CFL(ε)\max_{{\varepsilon}_{0},{\varepsilon}_{1}}\text{LFT+Ensemble{}}({\varepsilon}_{0},{\varepsilon}_{1};{\varepsilon})<\text{CFL}({\varepsilon}) for all ετ{\varepsilon}\leq\tau.

When εg(g11(0)){\varepsilon}\geq g(g_{1}^{-1}(0)), applying (47) we have

gi(0)>εi=gi(λ)=gi(g1(ε))gi(g1(g(gi1(0))))=0,g_{i}(0)>{\varepsilon}_{i}=g_{i}(\lambda)=g_{i}(g^{-1}({\varepsilon}))\geq g_{i}(g^{-1}(g(g_{i}^{-1}(0))))=0,

where the first inequality comes from Case 1. Thus, λi=g1(εi)=λ\lambda_{i}=g^{-1}({\varepsilon}_{i})=\lambda as desired, and we obtain fε0,ε1LFT+Ensemble=fεCFLf_{{\varepsilon}_{0},{\varepsilon}_{1}}^{\text{LFT+Ensemble}}=f_{{\varepsilon}}^{\text{CFL}}.

Combining both cases above yields the desired conclusion. ∎

Remark 8.

Recall that we use fiεif_{i}^{{\varepsilon}_{i}} to denote the local classifier trained by client ii in LFT+Ensemble analysis. In the expression of τ\tau:

min{|g(0)|,max{sign(g(0))g(g01(0)),sign(g(0))g(g11(0))}},\min\big{\{}|g(0)|,\max\{\operatorname{sign}(g(0))g(g_{0}^{-1}(0)),\operatorname{sign}(g(0))g(g_{1}^{-1}(0))\}\big{\}}, (48)

|g(0)||g(0)| is the demographic disparity of ERM, sign(g(0))g(g01(0))\operatorname{sign}(g(0))g(g_{0}^{-1}(0)) is the demographic disparity of local classifier f00f_{0}^{0}, and sign(g(0))g(g11(0))\operatorname{sign}(g(0))g(g_{1}^{-1}(0)) is the demographic disparity of local classifier f10f_{1}^{0}. According to the proof of Lemma 15, we obtain max{sign(g(0))g(g01(0)),sign(g(0))g(g11(0))}>0\max\{\operatorname{sign}(g(0))g(g_{0}^{-1}(0)),\operatorname{sign}(g(0))g(g_{1}^{-1}(0))\}>0 if and only if two local classifiers which achieves perfect local fairness is equal, i.e., f00=f10f_{0}^{0}=f_{1}^{0}. Therefore, Lemma 15 implies that, if the ERM is favoring the same group on different clients and the two local classifiers which achieve perfect local fairness are unequal, then LFT+Ensemble performs strictly worse than CFL when the required demographic disparity is smaller than a certain value.

So far we assume aBern(q){\mathrm{a}}\sim\operatorname{Bern}(q) for both client 0 and client 1. When both clients do not share the same qq, we can conclude that CFL outperforms LFT+Ensemble in the following lemma.

Lemma 16.

Assume aBern(qi){\mathrm{a}}\sim\operatorname{Bern}(q_{i}) in client ii, and q0q1(0,1)q_{0}\neq q_{1}\in(0,1). Then maxε0,ε1LFT+Ensemble(ε0,ε1;ε)<CFL(ε)\max_{{\varepsilon}_{0},{\varepsilon}_{1}}\text{LFT+Ensemble{}}({\varepsilon}_{0},{\varepsilon}_{1};{\varepsilon})<\text{CFL}({\varepsilon}) for all ε<|g(0)|{\varepsilon}<|g(0)|.

Proof.

We assemble the dataset from two clients to have xa=a𝒫a=q0q0+q1𝒫a(0)+q1q0+q1𝒫a(1){\mathrm{x}}\mid{\mathrm{a}}=a\sim{\mathcal{P}}_{a}=\frac{q_{0}}{q_{0}+q_{1}}{\mathcal{P}}_{a}^{(0)}+\frac{q_{1}}{q_{0}+q_{1}}{\mathcal{P}}_{a}^{(1)}, a=0,1a=0,1. We let fεCFLf_{{\varepsilon}}^{\text{CFL}} be the solution to CFL(ε{\varepsilon}), with q=q0+q12q=\frac{q_{0}+q_{1}}{2}:

fεCFL=s(x,a)>0,f_{{\varepsilon}}^{\text{CFL}}=\llbracket s(x,a)>0\rrbracket,
where s(x,0)=2η(x)1+λεCFL1q,s(x,1)=2η(x)1λεCFLq.\text{where }s(x,0)=2\eta(x)-1+\frac{\lambda_{{\varepsilon}}^{\text{CFL}}}{1-q},\quad s(x,1)=2\eta(x)-1-\frac{\lambda_{{\varepsilon}}^{\text{CFL}}}{q}.

Given an LFT+Ensemble classifier fε0,ε1LFT+Ensemble=(f0ε0+f1ε1)/2f_{{\varepsilon}_{0},{\varepsilon}_{1}}^{\text{LFT+Ensemble}}=(f_{0}^{{\varepsilon}_{0}}+f_{1}^{{\varepsilon}_{1}})/2 such that DP Disp(fε0,ε1LFT+Ensemble)ε\text{DP Disp}(f_{{\varepsilon}_{0},{\varepsilon}_{1}}^{\text{LFT+Ensemble}})\leq{\varepsilon}, the solution reads

fiεi=si(x,a)>0,f_{i}^{{\varepsilon}_{i}}=\llbracket s_{i}(x,a)>0\rrbracket,
where si(x,0)=2η(x)1+λεiLFT+Ensemblei1qi,si(x,1)=2η(x)1λεiLFT+Ensembleiqi.\text{where }s_{i}(x,0)=2\eta(x)-1+\frac{\lambda_{{\varepsilon}_{i}}^{\text{LFT+Ensemble}_{i}}}{1-q_{i}},\quad s_{i}(x,1)=2\eta(x)-1-\frac{\lambda_{{\varepsilon}_{i}}^{\text{LFT+Ensemble}_{i}}}{q_{i}}.

We prove the lemma by contradiction argument. If Acc(fε0,ε1LFT+Ensemble)=CFL(ε)\text{Acc}(f_{{\varepsilon}_{0},{\varepsilon}_{1}}^{\text{LFT+Ensemble}})=\text{CFL}({\varepsilon}), then fε0,ε1LFT+Ensemblef_{{\varepsilon}_{0},{\varepsilon}_{1}}^{\text{LFT+Ensemble}} is a solution to CFL(ε{\varepsilon}) with q=q0+q12q=\frac{q_{0}+q_{1}}{2}. Since ε<|g(0)|{\varepsilon}<|g(0)|, by Lemma 9 we have λεCFL0\lambda_{{\varepsilon}}^{\text{CFL}}\neq 0. Without any loss of generality, assume λεCFL<0\lambda_{{\varepsilon}}^{\text{CFL}}<0. Below we discuss three cases.

Case 1.

λε0LFT+Ensemble0=λε1LFT+Ensemble1=0\lambda_{{\varepsilon}_{0}}^{\text{LFT+Ensemble}_{0}}=\lambda_{{\varepsilon}_{1}}^{\text{LFT+Ensemble}_{1}}=0: the LFT+Ensemble classifier is ERM.

In this case, f0ε0=f1ε1f_{0}^{{\varepsilon}_{0}}=f_{1}^{{\varepsilon}_{1}}. We have fε0,ε1LFT+Ensemble(x,0)=1f_{{\varepsilon}_{0},{\varepsilon}_{1}}^{\text{LFT+Ensemble}}(x,0)=1 for η(x)>12\eta(x)>\frac{1}{2}, and fεCFL=0f_{{\varepsilon}}^{\text{CFL}}=0 for η(x)<(1λεCFL/(1q))/2\eta(x)<(1-\lambda_{{\varepsilon}}^{\text{CFL}}/(1-q))/2. By Lemma 8, fε0,ε1LFT+Ensemblef_{{\varepsilon}_{0},{\varepsilon}_{1}}^{\text{LFT+Ensemble}} is not a solution to CFL(ε{\varepsilon}).

Case 2.

λε0LFT+Ensemble00\lambda_{{\varepsilon}_{0}}^{\text{LFT+Ensemble}_{0}}\neq 0 or λε1LFT+Ensemble10\lambda_{{\varepsilon}_{1}}^{\text{LFT+Ensemble}_{1}}\neq 0, and λε0LFT+Ensemble0λε1LFT+Ensemble1=0\lambda_{{\varepsilon}_{0}}^{\text{LFT+Ensemble}_{0}}\lambda_{{\varepsilon}_{1}}^{\text{LFT+Ensemble}_{1}}=0: the LFT+Ensemble classifier is not ERM, but one of the local classifier is ERM.

Without any loss of generality, let λε0LFT+Ensemble0=0\lambda_{{\varepsilon}_{0}}^{\text{LFT+Ensemble}_{0}}=0, λε1LFT+Ensemble1<0\lambda_{{\varepsilon}_{1}}^{\text{LFT+Ensemble}_{1}}<0. Then f0ε0(x,0)=1f_{0}^{{\varepsilon}_{0}}(x,0)=1 for η(x)>12\eta(x)>\frac{1}{2}, while f1ε1(x,0)=0f_{1}^{{\varepsilon}_{1}}(x,0)=0 for η(x)<(1λε0LFT+Ensemble0(1q1))/2.\eta(x)<(1-\lambda_{{\varepsilon}_{0}}^{\text{LFT+Ensemble}_{0}}(1-q_{1}))/2. Thus we get

fε0,ε1LFT+Ensemble(x,0)=12 for 12<η(x)<(1λε1LFT+Ensemble1(1q1))/2.f_{{\varepsilon}_{0},{\varepsilon}_{1}}^{\text{LFT+Ensemble}}(x,0)=\frac{1}{2}\text{ for }\frac{1}{2}<\eta(x)<(1-\lambda_{{\varepsilon}_{1}}^{\text{LFT+Ensemble}_{1}}(1-q_{1}))/2.

By Lemma 8, fε0,ε1LFT+Ensemblef_{{\varepsilon}_{0},{\varepsilon}_{1}}^{\text{LFT+Ensemble}} is not a solution to CFL(ε{\varepsilon}).

Case 3.

λε0LFT+Ensemble0λε1LFT+Ensemble10\lambda_{{\varepsilon}_{0}}^{\text{LFT+Ensemble}_{0}}\lambda_{{\varepsilon}_{1}}^{\text{LFT+Ensemble}_{1}}\neq 0: The local classifiers are not ERM.

When λε0LFT+Ensemble01q0λε1LFT+Ensemble11q1\frac{\lambda_{{\varepsilon}_{0}}^{\text{LFT+Ensemble}_{0}}}{1-q_{0}}\neq\frac{\lambda_{{\varepsilon}_{1}}^{\text{LFT+Ensemble}_{1}}}{1-q_{1}}, without loss of generality, let λε0LFT+Ensemble01q0>λε1LFT+Ensemble11q1\frac{\lambda_{{\varepsilon}_{0}}^{\text{LFT+Ensemble}_{0}}}{1-q_{0}}>\frac{\lambda_{{\varepsilon}_{1}}^{\text{LFT+Ensemble}_{1}}}{1-q_{1}}. Then by the same argument in Case 2, we have

fε0,ε1LFT+Ensemble(x,0)=12 for 1λε0LFT+Ensemble01q02<η(x)<1λε1LFT+Ensemble11q12.f_{{\varepsilon}_{0},{\varepsilon}_{1}}^{\text{LFT+Ensemble}}(x,0)=\frac{1}{2}\text{ for }\frac{1-\frac{\lambda_{{\varepsilon}_{0}}^{\text{LFT+Ensemble}_{0}}}{1-q_{0}}}{2}<\eta(x)<\frac{1-\frac{\lambda_{{\varepsilon}_{1}}^{\text{LFT+Ensemble}_{1}}}{1-q_{1}}}{2}.

By Lemma 8, fε0,ε1LFT+Ensemblef_{{\varepsilon}_{0},{\varepsilon}_{1}}^{\text{LFT+Ensemble}} is not a solution to CFL(ε{\varepsilon}).

When λε0LFT+Ensemble0q0λε1LFT+Ensemble1q1\frac{\lambda_{{\varepsilon}_{0}}^{\text{LFT+Ensemble}_{0}}}{q_{0}}\neq\frac{\lambda_{{\varepsilon}_{1}}^{\text{LFT+Ensemble}_{1}}}{q_{1}}, similarly, fε0,ε1LFT+Ensemblef_{{\varepsilon}_{0},{\varepsilon}_{1}}^{\text{LFT+Ensemble}} is not a solution to CFL(ε{\varepsilon}).

When λε0LFT+Ensemble0q0=λε1LFT+Ensemble1q1\frac{\lambda_{{\varepsilon}_{0}}^{\text{LFT+Ensemble}_{0}}}{q_{0}}=\frac{\lambda_{{\varepsilon}_{1}}^{\text{LFT+Ensemble}_{1}}}{q_{1}} and λε0LFT+Ensemble01q0=λε1LFT+Ensemble11q1\frac{\lambda_{{\varepsilon}_{0}}^{\text{LFT+Ensemble}_{0}}}{1-q_{0}}=\frac{\lambda_{{\varepsilon}_{1}}^{\text{LFT+Ensemble}_{1}}}{1-q_{1}}, since λε0LFT+Ensemble0λε1LFT+Ensemble10\lambda_{{\varepsilon}_{0}}^{\text{LFT+Ensemble}_{0}}\lambda_{{\varepsilon}_{1}}^{\text{LFT+Ensemble}_{1}}\neq 0, we have

q0λε0LFT+Ensemble0=q1λε1LFT+Ensemble1,1q0λε0LFT+Ensemble0=1q1λε1LFT+Ensemble1,\frac{q_{0}}{\lambda_{{\varepsilon}_{0}}^{\text{LFT+Ensemble}_{0}}}=\frac{q_{1}}{\lambda_{{\varepsilon}_{1}}^{\text{LFT+Ensemble}_{1}}},\quad\frac{1-q_{0}}{\lambda_{{\varepsilon}_{0}}^{\text{LFT+Ensemble}_{0}}}=\frac{1-q_{1}}{\lambda_{{\varepsilon}_{1}}^{\text{LFT+Ensemble}_{1}}},

which leads to 1λε0LFT+Ensemble0=1λε1LFT+Ensemble1\frac{1}{\lambda_{{\varepsilon}_{0}}^{\text{LFT+Ensemble}_{0}}}=\frac{1}{\lambda_{{\varepsilon}_{1}}^{\text{LFT+Ensemble}_{1}}} and thus q0=q1q_{0}=q_{1}. This contradicts with the assumption that q0q1q_{0}\neq q_{1}.

Combining all three cases yields desired conclusion.

A.3 LFT+FedAvg analysis

In this section, we analyze LFT+FedAvg for the case of two clients. For purpose of illustration, with I=2I=2, we denote fε0,ε1LFT+FedAvg=f𝜺LFT+FedAvgf_{{\varepsilon}_{0},{\varepsilon}_{1}}^{\text{LFT+FedAvg}}=f_{\bm{{\varepsilon}}}^{\text{LFT+FedAvg}} to be the solution to LFT+FedAvg(𝜺)(\bm{{\varepsilon}}). In Sec. A.3.1, we present a formal version of Thm. 3 and show that compared to LFT+Ensemble, LFT+FedAvg has strictly higher fairness. In Sec. A.3.2 we derive the solution to LFT+FedAvg(𝜺)(\bm{{\varepsilon}}), and show it is equivalent to LFT+FedAvg. In Sec. A.3.3, we analyze the limitation of LFT+FedAvg and present a formal version of Lemma 4.

A.3.1 Improve fairness via federated learning

Different to LFT+Ensemble, LFT+FedAvg can reach any ε{\varepsilon} demographic disparity:

Theorem 17 (Formal version of Thm. 3 under two clients cases).

Let q(0,1)q\in(0,1). For all ε[0,1]{\varepsilon}\in[0,1], there exists ε0,ε1[0,1]{\varepsilon}_{0},{\varepsilon}_{1}\in[0,1] such that DP Disp(fε0,ε1LFT+FedAvg)ε\text{DP Disp}(f_{{\varepsilon}_{0},{\varepsilon}_{1}}^{\text{LFT+FedAvg}})\leq{\varepsilon}. Thus under the condition in Lemma 10, we have

minε0,ε1[0,1]DP Disp(fε0,ε1LFT+Ensemble)>minε0,ε1[0,1]DP Disp(fε0,ε1LFT+FedAvg)=0.\min_{{\varepsilon}_{0},{\varepsilon}_{1}\in[0,1]}\text{DP Disp}(f_{{\varepsilon}_{0},{\varepsilon}_{1}}^{\text{LFT+Ensemble}})>\min_{{\varepsilon}_{0},{\varepsilon}_{1}\in[0,1]}\text{DP Disp}(f_{{\varepsilon}_{0},{\varepsilon}_{1}}^{\text{LFT+FedAvg}})=0.
Proof.

For any ε[0,1]{\varepsilon}\in[0,1], let ε0=ε1=ε{\varepsilon}_{0}={\varepsilon}_{1}={\varepsilon}. Then the global DP disparity becomes

DP Disp(fε0,ε1LFT+FedAvg)\displaystyle\text{DP Disp}(f_{{\varepsilon}_{0},{\varepsilon}_{1}}^{\text{LFT+FedAvg}}) =|𝔼xa=0f(x,0)𝔼xa=1f(x,1)|\displaystyle=|{\mathbb{E}}_{{\mathrm{x}}\mid{\mathrm{a}}=0}f({\mathrm{x}},0)-{\mathbb{E}}_{{\mathrm{x}}\mid{\mathrm{a}}=1}f({\mathrm{x}},1)| (49)
=|(𝒳f(x,0)d𝒫0(0)+𝒳f(x,0)d𝒫0(1))/2\displaystyle=|(\int_{{\mathcal{X}}}f(x,0)\,d{\mathcal{P}}_{0}^{(0)}+\int_{{\mathcal{X}}}f(x,0)\,d{\mathcal{P}}_{0}^{(1)})/2
(𝒳f(x,1)d𝒫1(0)+𝒳f(x,1)d𝒫1(1))/2|\displaystyle-(\int_{{\mathcal{X}}}f(x,1)\,d{\mathcal{P}}_{1}^{(0)}+\int_{{\mathcal{X}}}f(x,1)\,d{\mathcal{P}}_{1}^{(1)})/2|
=|MD0(fε0,ε1LFT+FedAvg)/2+MD1(fε0,ε1LFT+FedAvg)/2|\displaystyle=|\text{MD}_{0}(f_{{\varepsilon}_{0},{\varepsilon}_{1}}^{\text{LFT+FedAvg}})/2+\text{MD}_{1}(f_{{\varepsilon}_{0},{\varepsilon}_{1}}^{\text{LFT+FedAvg}})/2|
DP Disp0(fε0,ε1LFT+FedAvg)/2+DP Disp1(fε0,ε1LFT+FedAvg)/2\displaystyle\leq\text{DP Disp}_{0}(f_{{\varepsilon}_{0},{\varepsilon}_{1}}^{\text{LFT+FedAvg}})/2+\text{DP Disp}_{1}(f_{{\varepsilon}_{0},{\varepsilon}_{1}}^{\text{LFT+FedAvg}})/2
(ε0+ε1)/2=ε.\displaystyle\leq({\varepsilon}_{0}+{\varepsilon}_{1})/2={\varepsilon}.

A.3.2 The best classifier of LFT+FedAvg

For LFT+FedAvg, we directly consider multi-client cases. To visualize the gap between LFT+Ensemble, LFT+FedAvg, and CFL, in our numerical experiments, we draw finite samples from Gaussian distribution, and then we optimize the empirical risk with the fairness constraint to obtain the classifier trained by LFT+FedAvg. The following lemma provides the solution to LFT+FedAvg(𝜺)(\bm{{\varepsilon}}) when 𝒳{\mathcal{X}} is finite.

Lemma 18.

For finite 𝒳{\mathcal{X}}, the solution to LFT+FedAvg(𝜺)(\bm{{\varepsilon}}) is given by

f(x,a)\displaystyle f(x,a) =i[I]si(x,a)pa(i)(x)>0,\displaystyle=\llbracket\sum_{i\in[I]}s_{i}(x,a)p^{(i)}_{a}(x)>0\rrbracket, (50)

where si(x,a)=2η(x)1+Iλia=01qIλia=1qs_{i}(x,a)=2\eta(x)-1+I\lambda_{i}\frac{\llbracket{\mathrm{a}}=0\rrbracket}{1-q}-I\lambda_{i}\frac{\llbracket{\mathrm{a}}=1\rrbracket}{q}, for certain λ0,λI1\lambda_{0},\dots\lambda_{I-1}\in{\mathbb{R}}.

Proof.

This proof is based on Menon and Williamson [2018]. The key idea of this proof is to use the Lagrangian approach. Before we applying the Lagrangian approach, we will show that LFT+FedAvg(𝜺)(\bm{{\varepsilon}}) is expressible as a linear program, and thus the strong duality holds.

Since 𝒳{\mathcal{X}} is finite, ff is a vector of finite dimension. Based on the proof of Lemma 7, the error rate can be written as

(y^y)=x𝒳,a𝒜f(x,a)(12η(x))(x=x,a=a)+(y=1),{\mathbb{P}}(\hat{{\mathrm{y}}}\neq{\mathrm{y}})=\sum_{x\in{\mathcal{X}},a\in{\mathcal{A}}}f(x,a)(1-2\eta(x)){\mathbb{P}}({\mathrm{x}}=x,{\mathrm{a}}=a)+{\mathbb{P}}({\mathrm{y}}=1),

and the fairness constraints can be written as

(y^=1a=0,i=i)(y^=1a=1,i=i)=x𝒳[f(x,0)(x=xa=0,i=i)(a=0)f(x,1)(x=xa=1,i=i)(a=1)],\begin{split}&{\mathbb{P}}(\hat{{\mathrm{y}}}=1\mid{\mathrm{a}}=0,{\mathrm{i}}=i)-{\mathbb{P}}(\hat{{\mathrm{y}}}=1\mid{\mathrm{a}}=1,{\mathrm{i}}=i)\\ &=\sum_{x\in{\mathcal{X}}}\left[f(x,0)\frac{{\mathbb{P}}({\mathrm{x}}=x\mid{\mathrm{a}}=0,{\mathrm{i}}=i)}{{\mathbb{P}}({\mathrm{a}}=0)}-f(x,1)\frac{{\mathbb{P}}({\mathrm{x}}=x\mid{\mathrm{a}}=1,{\mathrm{i}}=i)}{{\mathbb{P}}({\mathrm{a}}=1)}\right],\end{split} (51)

for i[I]i\in[I]. Let u(x,a)=(12η(x))(x=x,a=a)u(x,a)=(1-2\eta(x)){\mathbb{P}}({\mathrm{x}}=x,{\mathrm{a}}=a), u=(y=1)u^{\prime}={\mathbb{P}}({\mathrm{y}}=1), and vi(x,a)=(a=0a=1)(x=xa=a,i=i)/(a=a)v_{i}(x,a)=(\llbracket a=0\rrbracket-\llbracket a=1\rrbracket){\mathbb{P}}({\mathrm{x}}=x\mid{\mathrm{a}}=a,{\mathrm{i}}=i)/{\mathbb{P}}({\mathrm{a}}=a), for x𝒳,a𝒜,i[I]x\in{\mathcal{X}},a\in{\mathcal{A}},i\in[I]. Note that u,v0,v1u,v_{0},v_{1} are vectors of the same dimension of ff. For ease of notation, we allow \leq to be applied to pairs of vectors in an element-wise manner. Therefore, the optimization is

minf\displaystyle\min_{f} uf+u\displaystyle~{}u^{\top}f+u^{\prime} (52)
s.t.\displaystyle\mathrm{s.t}.~{} vifεi\displaystyle v_{i}^{\top}f\leq{\varepsilon}_{i} (53)
0f1,\displaystyle 0\leq f\leq 1, (54)

which is a linear objective with linear constraint. Therefore, the strong duality holds for LFT+FedAvg(𝜺)(\bm{{\varepsilon}}). Next, we apply Lagrangian approach to solve the LFT+FedAvg(𝜺)(\bm{{\varepsilon}}).

Recall that

(y^y)=\displaystyle{\mathbb{P}}(\hat{{\mathrm{y}}}\neq{\mathrm{y}})= 1I𝔼a𝔼x𝒫a(i)f(x,a)(12η(x))+(y=1),\displaystyle\frac{1}{I}{\mathbb{E}}_{{\mathrm{a}}}{\mathbb{E}}_{{\mathrm{x}}\sim{\mathcal{P}}^{(i)}_{{\mathrm{a}}}}f({\mathrm{x}},{\mathrm{a}})(1-2\eta({\mathrm{x}}))+{\mathbb{P}}({\mathrm{y}}=1), (55)

and

(y^=1a=0,i=i)(y^=1a=1,i=i)\displaystyle{\mathbb{P}}(\hat{{\mathrm{y}}}=1\mid{\mathrm{a}}=0,{\mathrm{i}}=i)-{\mathbb{P}}(\hat{{\mathrm{y}}}=1\mid{\mathrm{a}}=1,{\mathrm{i}}=i) (56)
=\displaystyle= 𝔼a𝔼x𝒫a(i)[f(x,0)a=01qf(x,1)a=1q].\displaystyle{\mathbb{E}}_{{\mathrm{a}}}{\mathbb{E}}_{{\mathrm{x}}\sim{\mathcal{P}}^{(i)}_{{\mathrm{a}}}}\left[f({\mathrm{x}},0)\frac{\llbracket{\mathrm{a}}=0\rrbracket}{1-q}-f({\mathrm{x}},1)\frac{\llbracket{\mathrm{a}}=1\rrbracket}{q}\right].

By strong duality, for λ0,,λ2I10\lambda_{0}^{\prime},\dots,\lambda_{2I-1}^{\prime}\geq 0, the corresponding Lagrangian version of LFT+FedAvg(𝜺)(\bm{{\varepsilon}}) is

minf(y^y)i[I][λ2i[(y^=1a=0,i=i)(y^=1a=1,i=i)εi]\displaystyle\min_{f\in{\mathcal{F}}}{\mathbb{P}}(\hat{{\mathrm{y}}}\neq{\mathrm{y}})-\sum_{i\in[I]}\Big{[}\lambda_{2i}^{\prime}[{\mathbb{P}}(\hat{{\mathrm{y}}}=1\mid{\mathrm{a}}=0,{\mathrm{i}}=i)-{\mathbb{P}}(\hat{{\mathrm{y}}}=1\mid{\mathrm{a}}=1,{\mathrm{i}}=i)-{\varepsilon}_{i}] (57)
+λ2i+1[εi(y^=1a=0,i=i)(y^=1a=1,i=i)]]\displaystyle\quad\quad\quad\quad\quad\quad+\lambda_{2i+1}^{\prime}[{\varepsilon}_{i}-{\mathbb{P}}(\hat{{\mathrm{y}}}=1\mid{\mathrm{a}}=0,{\mathrm{i}}=i)-{\mathbb{P}}(\hat{{\mathrm{y}}}=1\mid{\mathrm{a}}=1,{\mathrm{i}}=i)]\Big{]} (58)

Let λi=λ2iλ2i+1\lambda_{i}=\lambda_{2i}^{\prime}-\lambda_{2i+1}^{\prime}, i[I]i\in[I], then we get

(57)=\displaystyle~{}(\ref{eq:lagrangianffl})= minf𝔼a𝔼x𝒫a(i)[1If(x,a)(12η(x))i[I](λif(x,0)a=01qλif(x,1)a=1q)]\displaystyle\min_{f\in{\mathcal{F}}}{\mathbb{E}}_{{\mathrm{a}}}{\mathbb{E}}_{{\mathrm{x}}\sim{\mathcal{P}}^{(i)}_{{\mathrm{a}}}}\bigg{[}\frac{1}{I}f({\mathrm{x}},{\mathrm{a}})(1-2\eta({\mathrm{x}}))-\sum_{i\in[I]}\Big{(}\lambda_{i}f({\mathrm{x}},0)\frac{\llbracket{\mathrm{a}}=0\rrbracket}{1-q}-\lambda_{i}f({\mathrm{x}},1)\frac{\llbracket{\mathrm{a}}=1\rrbracket}{q}\Big{)}\bigg{]} (59)
=\displaystyle= minf𝒳a𝒜1If(x,a)[i[I]si(x,a)pa(i)(x)]dx.\displaystyle\min_{f\in{\mathcal{F}}}\int_{{\mathcal{X}}}\sum_{a\in{\mathcal{A}}}-\frac{1}{I}f(x,a)[\sum_{i\in[I]}s_{i}(x,a)p^{(i)}_{a}(x)]\mathrm{d}x. (60)

where sis_{i} is defined in Lemma 18. Thus the above equation reaches its minimum at

f(x,a)=i[I]si(x,a)pa(i)(x)>0.f(x,a)=\llbracket\sum_{i\in[I]}s_{i}(x,a)p^{(i)}_{a}(x)>0\rrbracket.

Remark 9.

Based on the proof of Lemma 18LFT+FedAvg(𝜺)(\bm{{\varepsilon}}) is equivalent to solving

minf(y^y)i=0I1λi((y^=1a=0,i=i)(y^=1a=1,i=i)).\min_{f\in{\mathcal{F}}}{\mathbb{P}}(\hat{{\mathrm{y}}}\neq{\mathrm{y}})-\sum_{i=0}^{I-1}\lambda_{i}({\mathbb{P}}(\hat{{\mathrm{y}}}=1\mid{\mathrm{a}}=0,{\mathrm{i}}=i)-{\mathbb{P}}(\hat{{\mathrm{y}}}=1\mid{\mathrm{a}}=1,{\mathrm{i}}=i)). (61)

Under certain conditions (assumptions 1 to 4 in Li et al. [2020b]), we have solving LFT+FedAvg(𝜺)(\bm{{\varepsilon}}) is equivalent to minimizing

(y^yi=i)λi((y^=1a=0,i=i)(y^=1a=1,i=i)){\mathbb{P}}(\hat{{\mathrm{y}}}\neq{\mathrm{y}}\mid{\mathrm{i}}=i)-\lambda_{i}({\mathbb{P}}(\hat{{\mathrm{y}}}=1\mid{\mathrm{a}}=0,{\mathrm{i}}=i)-{\mathbb{P}}(\hat{{\mathrm{y}}}=1\mid{\mathrm{a}}=1,{\mathrm{i}}=i))

locally and applying FedAvg (Theorem 1 in Li et al. [2020b]).

A.3.3 Comparison of LFT+FedAvg and CFL

Lemma 19 (Formal version of Lemma 4 under two clients cases).

When ai=0Bern(0),ai=1Bern(1){\mathrm{a}}\mid{\mathrm{i}}=0\sim\operatorname{Bern}(0),{\mathrm{a}}\mid{\mathrm{i}}=1\sim\operatorname{Bern}(1) and DP Disp(f1CFL)>0\text{DP Disp}(f_{1}^{\text{CFL}})>0, we have

minε0,ε1DP Disp(fε0,ε1LFT+FedAvg)=DP Disp(f1CFL)>minεDP Disp(fεCFL)=0.\min_{{\varepsilon}_{0},{\varepsilon}_{1}}\text{DP Disp}(f_{{\varepsilon}_{0},{\varepsilon}_{1}}^{\text{LFT+FedAvg}})=\text{DP Disp}(f_{1}^{\text{CFL}})>\min_{{\varepsilon}}\text{DP Disp}(f_{{\varepsilon}}^{\text{CFL}})=0.
Proof.

Since ai=0Bern(0),ai=1Bern(1){\mathrm{a}}\mid{\mathrm{i}}=0\sim\operatorname{Bern}(0),{\mathrm{a}}\mid{\mathrm{i}}=1\sim\operatorname{Bern}(1), the constraints in LFT+FedAvg(𝜺)(\bm{{\varepsilon}}) vanish. When ε=1{\varepsilon}=1, the constraint in CFL(ε{\varepsilon}) always holds and thus also vanishes. Thus in such scenario the solution to LFT+FedAvg(𝜺)(\bm{{\varepsilon}}) becomes f1CFLf_{1}^{\text{CFL}}. Then from the assumption we have

DP Disp(fε0,ε1LFT+FedAvg)=DP Disp(f1CFL)>DP Disp(f0CFL)=0.\text{DP Disp}(f_{{\varepsilon}_{0},{\varepsilon}_{1}}^{\text{LFT+FedAvg}})=\text{DP Disp}(f_{1}^{\text{CFL}})>\text{DP Disp}(f_{0}^{\text{CFL}})=0.

A.4 Extension to multi-client cases

In this subsection, we perform the analysis of LFT+Ensemble and LFT+FedAvg for the multi-client cases. We present a more general version of Lemma 10, Thm. 17 and Lemma 19.

The following lemma shows the fundamental limitation of LFT+Ensemble:

Lemma 20 (Formal version of Lemma 1).

Let q(0,1)q\in(0,1). Consider a partition which divides II clients into two subsets. Denote the mixture distribution of each subset as xa=a,j=j𝒫~a(j){\mathrm{x}}\mid{\mathrm{a}}=a,{\mathrm{j}}=j\sim\tilde{{\mathcal{P}}}_{a}^{(j)}, where jj is the index of the subset, and 𝒫~a(j)\tilde{{\mathcal{P}}}_{a}^{(j)} is a distribution, for a,j=0,1a,j=0,1. Similar to two clients case, define g~j(λ)=[η1(12λ2(1q)),+)d𝒫~0(j)[η1(12+λ2q),+)d𝒫~1(j)\tilde{g}_{j}(\lambda)=\int_{[\eta^{-1}(\frac{1}{2}-\frac{\lambda}{2(1-q)}),+\infty)}\,\mathrm{d}\tilde{{\mathcal{P}}}_{0}^{(j)}-\int_{[\eta^{-1}(\frac{1}{2}+\frac{\lambda}{2q}),+\infty)}\,\mathrm{d}\tilde{{\mathcal{P}}}_{1}^{(j)}. Consider the case that qi=qq_{i}=q for all i[I]i\in[I] and q(0,1)q\in(0,1). Denote the proportion of the two subset as J0J_{0} and J1J_{1}, where J0,J1>0J_{0},J_{1}>0 and J0+J1=1J_{0}+J_{1}=1. Let c=min{|g~0(0)|,|g~1(0)|}c=\min\left\{|\tilde{g}_{0}(0)|,|\tilde{g}_{1}(0)|\right\}. Define ψ~:[0,c]×[0,c][1,1]\tilde{\psi}:[0,c]\times[0,c]\rightarrow[-1,1] as

ψ~(ε~0,ε~1)=J0J1g~0(g~11(sign(g~1(0))ε~1))+J0J1g~1(g~01(sign(g~0(0))ε~0))+J02sign(g~0(0))ε~0+J12sign(g~1(0))ε~1.\begin{split}\tilde{\psi}(\tilde{{\varepsilon}}_{0},\tilde{{\varepsilon}}_{1})&=J_{0}J_{1}\tilde{g}_{0}(\tilde{g}_{1}^{-1}(\operatorname{sign}(\tilde{g}_{1}(0))\tilde{{\varepsilon}}_{1}))+J_{0}J_{1}\tilde{g}_{1}(\tilde{g}_{0}^{-1}(\operatorname{sign}(\tilde{g}_{0}(0))\tilde{{\varepsilon}}_{0}))\\ &+J_{0}^{2}\operatorname{sign}(\tilde{g}_{0}(0))\tilde{{\varepsilon}}_{0}+J_{1}^{2}\operatorname{sign}(\tilde{g}_{1}(0))\tilde{{\varepsilon}}_{1}.\end{split} (62)

If there exists a partition such that g~0(0)g~1(0)<0\tilde{g}_{0}(0)\tilde{g}_{1}(0)<0 and ψ~(ε~0,ε~1)(g~0(0)+g~1(0))>0\tilde{\psi}(\tilde{{\varepsilon}}_{0},\tilde{{\varepsilon}}_{1})(\tilde{g}_{0}(0)+\tilde{g}_{1}(0))>0 for all ε~0,ε~1[0,c]\tilde{{\varepsilon}}_{0},\tilde{{\varepsilon}}_{1}\in[0,c], then for all ε~0,ε~1[0,1]\tilde{{\varepsilon}}_{0},\tilde{{\varepsilon}}_{1}\in[0,1], DP Disp(f𝛆LFT+Ensemble)δ~=min{|ψ~(ε~0,ε~1)|:ε~0,ε~1[0,c]}>0\text{DP Disp}(f_{\bm{{\varepsilon}}}^{\text{LFT+Ensemble}})\geq\tilde{\delta}=\min\{|\tilde{\psi}(\tilde{{\varepsilon}}_{0},\tilde{{\varepsilon}}_{1})|:\tilde{{\varepsilon}}_{0},\tilde{{\varepsilon}}_{1}\in[0,c]\}>0.

Proof.

By Lemma 10, we conclude the achievable fairness range of LFT+Ensemble is strictly smaller than that of CFL. Therefore, pooling the datasets in one subset and perform fair learning can clearly achieve a wider range of fairness than perform fair learning on each client individually. Thus, we can consider the two subsets as two clients with uneven amounts of data, which is almost the same case Lemma 10 considers. Therefore, we follow the same proof idea as Lemma 10 to prove our claim.

Denote the assembled classifier trained from two pooled datasets as f~ε~0,ε~1LFT+Ensemble\tilde{f}_{\tilde{{\varepsilon}}_{0},\tilde{{\varepsilon}}_{1}}^{\text{LFT+Ensemble}}. Note that the mean difference can be expressed as

MD(f~ε~0,ε~1LFT+Ensemble)=J02g~0(λ~ε~0LFT+Ensemble0)+J0J1g~0(λ~ε~1LFT+Ensemble1)+J0J1g~1(λ~ε~0LFT+Ensemble0)+J12g~1(λ~ε~1LFT+Ensemble1).\text{MD}(\tilde{f}_{\tilde{{\varepsilon}}_{0},\tilde{{\varepsilon}}_{1}}^{\text{LFT+Ensemble}})=J_{0}^{2}\tilde{g}_{0}(\tilde{\lambda}_{\tilde{{\varepsilon}}_{0}}^{\text{LFT+Ensemble}_{0}})+J_{0}J_{1}\tilde{g}_{0}(\tilde{\lambda}_{\tilde{{\varepsilon}}_{1}}^{\text{LFT+Ensemble}_{1}})+J_{0}J_{1}\tilde{g}_{1}(\tilde{\lambda}_{\tilde{{\varepsilon}}_{0}}^{\text{LFT+Ensemble}_{0}})+J_{1}^{2}\tilde{g}_{1}(\tilde{\lambda}_{\tilde{{\varepsilon}}_{1}}^{\text{LFT+Ensemble}_{1}}). (63)

In the following proof, we will show, the mean difference cannot reach 0.

Without any loss of generality, assume |g~0(0)|<|g~1(0)||\tilde{g}_{0}(0)|<|\tilde{g}_{1}(0)| and g~1(0)>0\tilde{g}_{1}(0)>0. By g~0(0)g~1(0)<0\tilde{g}_{0}(0)\tilde{g}_{1}(0)<0 and ψ~(ε~0,ε~1)(g~0(0)+g~1(0))>0\tilde{\psi}(\tilde{{\varepsilon}}_{0},\tilde{{\varepsilon}}_{1})(\tilde{g}_{0}(0)+\tilde{g}_{1}(0))>0 for all ε~0,ε~1[0,c]\tilde{{\varepsilon}}_{0},\tilde{{\varepsilon}}_{1}\in[0,c], we have g~0(0)<0\tilde{g}_{0}(0)<0 and ψ~(ε~0,ε~1)>0\tilde{\psi}(\tilde{{\varepsilon}}_{0},\tilde{{\varepsilon}}_{1})>0 for all ε~0,ε~1[0,c]\tilde{{\varepsilon}}_{0},\tilde{{\varepsilon}}_{1}\in[0,c]. Without any loss of generality, assume J0J1J_{0}\leq J_{1}.

First, we will prove that LFT+Ensemble achieves its lowest mean difference when ε~0,ε~1[0,c]\tilde{{\varepsilon}}_{0},\tilde{{\varepsilon}}_{1}\in[0,c]. In what follows, we consider five different cases to derive the desired result.

Case 1.

ε~0>|g~0(0)|,ε~1>|g~1(0)|\tilde{{\varepsilon}}_{0}>|\tilde{g}_{0}(0)|,\tilde{{\varepsilon}}_{1}>|\tilde{g}_{1}(0)|: ERM is fair on both clients.

By (31), we have λ~ε~0LFT+Ensemble0=λ~ε~1LFT+Ensemble1=0\tilde{\lambda}_{\tilde{{\varepsilon}}_{0}}^{\text{LFT+Ensemble}_{0}}=\tilde{\lambda}_{\tilde{{\varepsilon}}_{1}}^{\text{LFT+Ensemble}_{1}}=0. Recall g~i()\tilde{g}_{i}(\cdot) is a monotone increasing function, combine g~1(0)>0\tilde{g}_{1}(0)>0 and Lemma 9, and thus g~0(g~11(0))<g~0(0)<0.\tilde{g}_{0}(\tilde{g}_{1}^{-1}(0))<\tilde{g}_{0}(0)<0. Applying the above conclusion yields

(63)=J0g~0(0)+J1g~1(0)>1J1(J02g~0(0)+J12g~1(0)+J0J1g~0(g~11(0)))=1J1ψ~(g~0(0),0)δ~.(\ref{LFT+Ensemble{}_dp_p})=J_{0}\tilde{g}_{0}(0)+J_{1}\tilde{g}_{1}(0)>\frac{1}{J_{1}}\left(J_{0}^{2}\tilde{g}_{0}(0)+J_{1}^{2}\tilde{g}_{1}(0)+J_{0}J_{1}\tilde{g}_{0}(\tilde{g}_{1}^{-1}(0))\right)=\frac{1}{J_{1}}\tilde{\psi}(\tilde{g}_{0}(0),0)\geq\tilde{\delta}.
Case 2.

ε~0|g~0(0)|,ε~1>|g~1(0)|\tilde{{\varepsilon}}_{0}\leq|\tilde{g}_{0}(0)|,\tilde{{\varepsilon}}_{1}>|\tilde{g}_{1}(0)|: ERM is unfair on client 0, but fair on client 1.

Applying (31) results in λε1LFT+Ensemble1=0\lambda_{{\varepsilon}_{1}}^{\text{LFT+Ensemble}_{1}}=0. By the fact that g~i()\tilde{g}_{i}(\cdot) is a strictly monotone increasing function, we have λ~ε~0LFT+Ensemble0=g~01(ε~0)>g~01(g~0(0))=0.\tilde{\lambda}_{\tilde{{\varepsilon}}_{0}}^{\text{LFT+Ensemble}_{0}}=\tilde{g}_{0}^{-1}(-\tilde{{\varepsilon}}_{0})>\tilde{g}_{0}^{-1}(\tilde{g}_{0}(0))=0. Applying the above conclusion yields

(63)=\displaystyle~{}(\ref{LFT+Ensemble{}_dp_p})= J02ε~0+J12g~1(0)+J0J1g~0(0)+J0J1g~1(λ~ε~0LFT+Ensemble0)\displaystyle-J_{0}^{2}\tilde{{\varepsilon}}_{0}+J_{1}^{2}\tilde{g}_{1}(0)+J_{0}J_{1}\tilde{g}_{0}(0)+J_{0}J_{1}\tilde{g}_{1}(\tilde{\lambda}_{\tilde{{\varepsilon}}_{0}}^{\text{LFT+Ensemble}_{0}}) (64)
>\displaystyle> J0g~0(0)+J1g~1(0)>1J1ψ~(g~0(0),0)δ~.\displaystyle J_{0}\tilde{g}_{0}(0)+J_{1}\tilde{g}_{1}(0)>\frac{1}{J_{1}}\tilde{\psi}(\tilde{g}_{0}(0),0)\geq\tilde{\delta}.

In the first equality we used λ~ε~0LFT+Ensemble0>0,g~1(λ~ε~0LFT+Ensemble0)>g~1(0),g~0(0)<ε~0\tilde{\lambda}_{\tilde{{\varepsilon}}_{0}}^{\text{LFT+Ensemble}_{0}}>0,\tilde{g}_{1}(\tilde{\lambda}_{\tilde{{\varepsilon}}_{0}}^{\text{LFT+Ensemble}_{0}})>\tilde{g}_{1}(0),\tilde{g}_{0}(0)<-\tilde{{\varepsilon}}_{0}.

Case 3.

ε~0|g~0(0)|,ε~1|g~1(0)|\tilde{{\varepsilon}}_{0}\leq|\tilde{g}_{0}(0)|,\tilde{{\varepsilon}}_{1}\leq|\tilde{g}_{1}(0)|: ERM is unfair on both client 0 and client 1.

Applying (31) we have λ~ε~0LFT+Ensemble0=g~01(ε~0),λ~ε~1LFT+Ensemble1=g~11(ε~1)\tilde{\lambda}_{\tilde{{\varepsilon}}_{0}}^{\text{LFT+Ensemble}_{0}}=\tilde{g}_{0}^{-1}(-\tilde{{\varepsilon}}_{0}),\tilde{\lambda}_{\tilde{{\varepsilon}}_{1}}^{\text{LFT+Ensemble}_{1}}=\tilde{g}_{1}^{-1}(\tilde{{\varepsilon}}_{1}). Then we have

(63)\displaystyle~{}(\ref{LFT+Ensemble{}_dp_p}) =J02ε~0+J12ε~1+J0J1g~0(g~11(ε~1))+J0J1g~1(g~01(ε~0))\displaystyle=-J_{0}^{2}\tilde{{\varepsilon}}_{0}+J_{1}^{2}\tilde{{\varepsilon}}_{1}+J_{0}J_{1}\tilde{g}_{0}(\tilde{g}_{1}^{-1}(\tilde{{\varepsilon}}_{1}))+J_{0}J_{1}\tilde{g}_{1}(\tilde{g}_{0}^{-1}(-\tilde{{\varepsilon}}_{0})) (65)
J02ε~0+J12ε~0+J0J1g~0(g~11(ε~0))+J0J1g~1(g~01(ε~0))=ψ~(ε~0,ε~0)δ~.\displaystyle\geq-J_{0}^{2}\tilde{{\varepsilon}}_{0}+J_{1}^{2}\tilde{{\varepsilon}}_{0}+J_{0}J_{1}\tilde{g}_{0}(\tilde{g}_{1}^{-1}(\tilde{{\varepsilon}}_{0}))+J_{0}J_{1}\tilde{g}_{1}(\tilde{g}_{0}^{-1}(-\tilde{{\varepsilon}}_{0}))=\tilde{\psi}(\tilde{{\varepsilon}}_{0},\tilde{{\varepsilon}}_{0})\geq\tilde{\delta}. (66)
Case 4.

ε~0>|g~0(0)|,ε~1|g~0(0)|\tilde{{\varepsilon}}_{0}>|\tilde{g}_{0}(0)|,\tilde{{\varepsilon}}_{1}\leq|\tilde{g}_{0}(0)|: ERM is fair on client 0 and very unfair on client 1.

By (31), we have λ~ε~0LFT+Ensemble0=0,λ~ε~1LFT+Ensemble1=g~11(ε~1)>g~11(0)\tilde{\lambda}_{\tilde{{\varepsilon}}_{0}}^{\text{LFT+Ensemble}_{0}}=0,\tilde{\lambda}_{\tilde{{\varepsilon}}_{1}}^{\text{LFT+Ensemble}_{1}}=\tilde{g}_{1}^{-1}(\tilde{{\varepsilon}}_{1})>\tilde{g}_{1}^{-1}(0). Then we obtain

(63)\displaystyle~{}(\ref{LFT+Ensemble{}_dp_p}) =J02g~0(0)+J0J1g~0(λ~ε~1LFT+Ensemble1)+J0J1g~1(0)+J12ε~1\displaystyle=J_{0}^{2}\tilde{g}_{0}(0)+J_{0}J_{1}\tilde{g}_{0}(\tilde{\lambda}_{\tilde{{\varepsilon}}_{1}}^{\text{LFT+Ensemble}_{1}})+J_{0}J_{1}\tilde{g}_{1}(0)+J_{1}^{2}\tilde{{\varepsilon}}_{1} (67)
>J02g~0(0)+J0J1g~0(g~11(0))+J12g~1(0)=ψ~(g~0(0),0)δ~.\displaystyle>J_{0}^{2}\tilde{g}_{0}(0)+J_{0}J_{1}\tilde{g}_{0}(\tilde{g}_{1}^{-1}(0))+J_{1}^{2}\tilde{g}_{1}(0)=\tilde{\psi}(\tilde{g}_{0}(0),0)\geq\tilde{\delta}. (68)
Case 5.

ε~0>|g~0(0)|,|g~0(0)|ε~1<|g~1(0)|\tilde{{\varepsilon}}_{0}>|\tilde{g}_{0}(0)|,|\tilde{g}_{0}(0)|\leq\tilde{{\varepsilon}}_{1}<|\tilde{g}_{1}(0)|: ERM is fair on client 0 and unfair on client 1.

Applying (31) implies λ~ε~1LFT+Ensemble1=g~11(ε~1)>g~11(0)\tilde{\lambda}_{\tilde{{\varepsilon}}_{1}}^{\text{LFT+Ensemble}_{1}}=\tilde{g}^{-1}_{1}(\tilde{{\varepsilon}}_{1})>\tilde{g}_{1}^{-1}(0). Therefore,

(63)=\displaystyle~{}(\ref{LFT+Ensemble{}_dp_p})= J02g~0(0)+J0J1g~0(g~11(ε~1))+J12ε~1+J0J1g~1(0)\displaystyle J_{0}^{2}\tilde{g}_{0}(0)+J_{0}J_{1}\tilde{g}_{0}(\tilde{g}_{1}^{-1}(\tilde{{\varepsilon}}_{1}))+J_{1}^{2}\tilde{{\varepsilon}}_{1}+J_{0}J_{1}\tilde{g}_{1}(0) (69)
>J02g~0(0)+J0J1g~1(0)+J0J1g~0(g~11(0))+J12ε~1>ψ~(g~0(0),0)δ~.\displaystyle>J_{0}^{2}\tilde{g}_{0}(0)+J_{0}J_{1}\tilde{g}_{1}(0)+J_{0}J_{1}\tilde{g}_{0}(\tilde{g}_{1}^{-1}(0))+J_{1}^{2}\tilde{{\varepsilon}}_{1}>\tilde{\psi}(\tilde{g}_{0}(0),0)\geq\tilde{\delta}. (70)

Combining all the cases above, we conclude that when g~1(0)>0\tilde{g}_{1}(0)>0, DP Disp(f~ε~0,ε~1LFT+Ensemble)δ~=min{|ψ~(ε~0,ε~1)|:ε~0,ε~1[0,c]}>0\text{DP Disp}(\tilde{f}_{\tilde{{\varepsilon}}_{0},\tilde{{\varepsilon}}_{1}}^{\text{LFT+Ensemble}})\geq\tilde{\delta}=\min\{|\tilde{\psi}(\tilde{{\varepsilon}}_{0},\tilde{{\varepsilon}}_{1})|:\tilde{{\varepsilon}}_{0},\tilde{{\varepsilon}}_{1}\in[0,c]\}>0 for all ε~0,ε~1[0,1]\tilde{{\varepsilon}}_{0},\tilde{{\varepsilon}}_{1}\in[0,1].

Remark 10.

Based on the proof above, we can conclude, for the cases with multiple clients, the fundamental limitation of LFT+Ensemble still exists.

The following theorem shows that LFT+FedAvg can reach any ε{\varepsilon} DP disparity:

Theorem 21 (Generalized version of Thm. 17).

Let qi=q(0,1)q_{i}=q\in(0,1) for all i[I]i\in[I]. For all ε[0,1]{\varepsilon}\in[0,1], let εiε{\varepsilon}_{i}\leq{\varepsilon} for all i[I]i\in[I], then DP Disp(f𝛆LFT+FedAvg)ε\text{DP Disp}(f_{\bm{{\varepsilon}}}^{\text{LFT+FedAvg}})\leq{\varepsilon}. Thus under the condition in Lemma 20, we have

min𝜺[0,1]IDP Disp(f𝜺LFT+Ensemble)>min𝜺[0,1]IDP Disp(f𝜺LFT+FedAvg)=0.\min_{\bm{{\varepsilon}}\in[0,1]^{I}}\text{DP Disp}(f_{\bm{{\varepsilon}}}^{\text{LFT+Ensemble}})>\min_{\bm{{\varepsilon}}\in[0,1]^{I}}\text{DP Disp}(f_{\bm{{\varepsilon}}}^{\text{LFT+FedAvg}})=0.
Proof.

When εiε{\varepsilon}_{i}\leq{\varepsilon} the global DP disparity becomes

DP Disp(f𝜺LFT+FedAvg)=|𝔼xa=0f(x,0)𝔼xa=1f(x,1)|\displaystyle\text{DP Disp}(f_{\bm{{\varepsilon}}}^{\text{LFT+FedAvg}})=|{\mathbb{E}}_{{\mathrm{x}}\mid{\mathrm{a}}=0}f({\mathrm{x}},0)-{\mathbb{E}}_{{\mathrm{x}}\mid{\mathrm{a}}=1}f({\mathrm{x}},1)| (71)
=|(i=0I1𝒳f(x,0)𝑑𝒫0(i))/I(i=0I1𝒳f(x,1)𝑑𝒫1(i))/I|\displaystyle=|(\sum_{i=0}^{I-1}\int_{{\mathcal{X}}}f(x,0)\,d{\mathcal{P}}_{0}^{(i)})/I-(\sum_{i=0}^{I-1}\int_{{\mathcal{X}}}f(x,1)\,d{\mathcal{P}}_{1}^{(i)})/I|
=|i=0I1MDi(f𝜺LFT+FedAvg)/I|i=0I1DP Dispi(f𝜺LFT+FedAvg)/Ii=0I1εi/I=ε.\displaystyle=|\sum_{i=0}^{I-1}\text{MD}_{i}(f_{\bm{{\varepsilon}}}^{\text{LFT+FedAvg}})/I|\leq\sum_{i=0}^{I-1}\text{DP Disp}_{i}(f_{\bm{{\varepsilon}}}^{\text{LFT+FedAvg}})/I\leq\sum_{i=0}^{I-1}{\varepsilon}_{i}/I={\varepsilon}.

The following theorem shows the limitation of LFT+FedAvg:

Lemma 22 (Generalized version of Lemma 19).

Let ai=iBern(0){\mathrm{a}}\mid{\mathrm{i}}=i\sim\operatorname{Bern}(0) or ai=iBern(1){\mathrm{a}}\mid{\mathrm{i}}=i\sim\operatorname{Bern}(1) for all i[I]i\in[I]. When DP Disp(f1CFL)>0\text{DP Disp}(f_{1}^{\text{CFL}})>0, we have

min𝜺[0,1]IDP Disp(f𝜺LFT+FedAvg)=DP Disp(f1CFL)>minεDP Disp(fεCFL)=0.\min_{\bm{{\varepsilon}}\in[0,1]^{I}}\text{DP Disp}(f_{\bm{{\varepsilon}}}^{\text{LFT+FedAvg}})=\text{DP Disp}(f_{1}^{\text{CFL}})>\min_{{\varepsilon}}\text{DP Disp}(f_{{\varepsilon}}^{\text{CFL}})=0.
Proof.

Since ai=iBern(0){\mathrm{a}}\mid{\mathrm{i}}=i\sim\operatorname{Bern}(0) or ai=iBern(1){\mathrm{a}}\mid{\mathrm{i}}=i\sim\operatorname{Bern}(1), the constraints in (LFT+FedAvg(𝜺)(\bm{{\varepsilon}})) vanish. When ε=1{\varepsilon}=1, the constraint in CFL(ε{\varepsilon}) always holds and thus also vanishes. Thus in such scenario the solution to (LFT+FedAvg(𝜺)(\bm{{\varepsilon}})) becomes f1CFLf_{1}^{\text{CFL}}. Then from the assumption we have

DP Disp(f𝜺LFT+FedAvg)=DP Disp(f1CFL)>DP Disp(f0CFL)=0.\text{DP Disp}(f_{\bm{{\varepsilon}}}^{\text{LFT+FedAvg}})=\text{DP Disp}(f_{1}^{\text{CFL}})>\text{DP Disp}(f_{0}^{\text{CFL}})=0.

Appendix B Appendix - FedFB analysis and algorithm description

In this section, we provide our bi-level optimization formulation for FedFB for four fairness notions: demographic parity, equal opportunity, equalized odds and client parity, and design the corresponding update rule. This development can also be applied to centralized case. Then, we provides more details of how we incorporate FB with federated learning.

To explain how to optimize the weights of different groups, we introduce some necessary notations first. Recall that in the setting of Sec. 4 we assume AA sensitive attributes, here we further assume II clients. Then the sample has attributes (x,y,a,i)(\mathrm{x},\mathrm{y},\mathrm{a},\mathrm{i}), where x𝒳\mathrm{x}\in\mathcal{X} is the input feature, y{0,1}\mathrm{y}\in\{0,1\} is the label, a𝒜=[A]\mathrm{a}\in\mathcal{A}=[A] is the sensitive attribute and i[I]\mathrm{i}\in[I] is the index of the client that the sample belongs to. We further denote ny,a(i):=|{(x,y,a,i):y=y,a=a,i=i}|n_{y,a}^{(i)}:=|\{(\mathrm{x},\mathrm{y},\mathrm{a},\mathrm{i}):\mathrm{y}=y,\mathrm{a}=a,\mathrm{i}=i\}| be the number of samples belong to client ii of label yy and sensitive attribute aa. And we further define the local version of Ly,aL_{y,a} as Ly,a(i)(𝒘):=(x,y,a,i):y=y,a=a,i=i(y,y^;𝒘)/ny,a(i)L_{y,a}^{(i)}(\bm{w}):=\sum_{(\mathrm{x},\mathrm{y},\mathrm{a},\mathrm{i}):\mathrm{y}=y,\mathrm{a}=a,\mathrm{i}=i}\ell(\mathrm{y},\hat{\mathrm{y}};\bm{w})/n_{y,a}^{(i)}.

B.1 FedFB w.r.t demographic parity

We first provide the proof for Proposition 5.

Proof of Proposition 5.

We denote by \mathbb{P} the empirical probability. The demographic parity is satisfied when (y^=1a=0)=(y^=1a=a)\mathbb{P}(\hat{\mathrm{y}}=1\mid\mathrm{a}=0)=\mathbb{P}(\hat{\mathrm{y}}=1\mid\mathrm{a}=a) holds for all a[A]a\in[A]. Thus,

(y^=1,y=0a=0)+(y^=1,y=1a=0)\displaystyle\mathbb{P}(\hat{\mathrm{y}}=1,\mathrm{y}=0\mid\mathrm{a}=0)+\mathbb{P}(\hat{\mathrm{y}}=1,\mathrm{y}=1\mid\mathrm{a}=0) (72)
=\displaystyle= (y^=1,y=0a=a)+(y^=1,y=1a=a)\displaystyle\mathbb{P}(\hat{\mathrm{y}}=1,\mathrm{y}=0\mid\mathrm{a}=a)+\mathbb{P}(\hat{\mathrm{y}}=1,\mathrm{y}=1\mid\mathrm{a}=a)

For 0-1 loss, we have (|1y|,)=1(y,)\ell(|1-y|,\cdot)=1-\ell(y,\cdot), thus

1n,0(x,y,a):y=0,a=0(1(y,y^;𝒘))+1n,0(x,y,a):y=1,a=0(y,y^;𝒘)\displaystyle\frac{1}{n_{\star,0}}\sum_{(\mathrm{x},\mathrm{y},\mathrm{a}):\mathrm{y}=0,\mathrm{a}=0}\left(1-\ell\left(\mathrm{y},\hat{\mathrm{y}};\bm{w}\right)\right)+\frac{1}{n_{\star,0}}\sum_{(\mathrm{x},\mathrm{y},\mathrm{a}):\mathrm{y}=1,\mathrm{a}=0}\ell\left(\mathrm{y},\hat{\mathrm{y}};\bm{w}\right) (73)
=\displaystyle= 1n,a(x,y,a):y=0,a=a(1(y,y^;𝒘))+1n,a(x,y,a):y=1,a=a(y,y^;𝒘).\displaystyle\frac{1}{n_{\star,a}}\sum_{(\mathrm{x},\mathrm{y},\mathrm{a}):\mathrm{y}=0,\mathrm{a}=a}\left(1-\ell\left(\mathrm{y},\hat{\mathrm{y}};\bm{w}\right)\right)+\frac{1}{n_{\star,a}}\sum_{(\mathrm{x},\mathrm{y},\mathrm{a}):\mathrm{y}=1,\mathrm{a}=a}\ell\left(\mathrm{y},\hat{\mathrm{y}};\bm{w}\right).

By replacing (x,y,a):y=y,a=a(y,y^;𝒘)=ny,aLy,a(𝒘)\sum_{(\mathrm{x},\mathrm{y},\mathrm{a}):\mathrm{y}=y,\mathrm{a}=a}\ell\left(\mathrm{y},\hat{\mathrm{y}};\bm{w}\right)=n_{y,a}L_{y,a}(\bm{w}), we have

n0,0n,0(1L0,0(𝒘))+n1,0n,0L1,0(𝒘)\displaystyle\frac{n_{0,0}}{n_{\star,0}}(1-L_{0,0}(\bm{w}))+\frac{n_{1,0}}{n_{\star,0}}L_{1,0}(\bm{w}) (74)
=\displaystyle= n0,an,a(1L0,a(𝒘))+n1,an,aL1,a(𝒘).\displaystyle\frac{n_{0,a}}{n_{\star,a}}(1-L_{0,a}(\bm{w}))+\frac{n_{1,a}}{n_{\star,a}}L_{1,a}(\bm{w}).

We make the following assumption to our loss function for showing the partial convergence guarantee of FB.

Assumption 1.

Ly,a()L^{\prime}_{y,a}(\cdot) is twice differentiable for all y{0,1}y\in\{0,1\}, a[A]a\in[A], and

a=0A1[λa2L0,a(𝒘)+(2n,anλa)2L1,a(𝒘)]0 for all λΛ.\sum_{a=0}^{A-1}\left[\lambda_{a}\nabla^{2}L_{0,a}^{\prime}(\bm{w})+(2\frac{n_{\star,a}}{n}-\lambda_{a})\nabla^{2}L_{1,a}^{\prime}(\bm{w})\right]\succ 0\text{ for all }\lambda\in\Lambda. (75)

If Ly,a(𝒘𝝀)L^{\prime}_{y,a}(\bm{w}_{\bm{\lambda}}) is convex for all y{0,1},a[A]y\in\{0,1\},a\in[A], the condition (75) holds unless for all aa, L0,a(),L1,a()L_{0,a}^{\prime}(\cdot),L_{1,a}^{\prime}(\cdot) share their stationary points, which is very unlikely (see Remark 1 in Roh et al. [2021]).

Recall the bi-level optimization problem in (6), we denote the outer objective function to be Fdp(𝝀)=a=1A1(Fa(𝒘𝝀))2F_{\mathrm{dp}}(\bm{\lambda})=\sum_{a=1}^{A-1}\big{(}F_{a}(\bm{w}_{\bm{\lambda}})\big{)}^{2}. The following lemma provides a decreasing direction of FdpF_{\mathrm{dp}}, which inspired us to design the update rule (8) of 𝝀\bm{\lambda}.

Lemma 23 (Decreasing direction of FdpF_{\mathrm{dp}}).

If Assumption 1 holds, then on the direction 𝛍(𝛌)=(μ0(𝛌),,μA1(𝛌))\bm{\mu}(\bm{\lambda})=(\mu_{0}(\bm{\lambda}),\dots,\mu_{A-1}(\bm{\lambda})) where

μ0(𝝀)=a=1A1Fa(𝒘𝝀)μa(𝝀)=Fa(𝒘𝝀)\begin{split}\mu_{0}(\bm{\lambda})&=-\sum_{a=1}^{A-1}F_{a}(\bm{w}_{\bm{\lambda}})\\ \mu_{a}(\bm{\lambda})&=F_{a}(\bm{w}_{\bm{\lambda}})\end{split} (76)

for all a{1,,A1}a\in\{1,\dots,A-1\}, we have 𝛍(𝛌)Fdp(𝛌)0\bm{\mu}(\bm{\lambda})\cdot\nabla F_{\mathrm{dp}}(\bm{\lambda})\leq 0, and the equality holds if only if 𝛍(𝛌)=𝟎\bm{\mu}(\bm{\lambda})=\bm{0}.

Proof.

We compute the derivative as

Fdp(𝝀)λj=2a=1A1[(L0,0(𝒘𝝀)+L1,0(𝒘𝝀)+L0,a(𝒘𝝀)L1,a(𝒘𝝀)+n0,0n,0n0,an,a)(L0,0(𝒘𝝀)+L1,0(𝒘𝝀)+L0,a(𝒘𝝀)L1,a(𝒘𝝀))]𝒘𝝀λj.\begin{split}\frac{\partial F_{\text{dp}}(\bm{\lambda})}{\partial\lambda_{j}}&=2\sum_{a=1}^{A-1}\Big{[}\Big{(}-L_{0,0}^{\prime}(\bm{w}_{\bm{\lambda}})+L_{1,0}^{\prime}(\bm{w}_{\bm{\lambda}})+L_{0,a}^{\prime}(\bm{w}_{\bm{\lambda}})-L_{1,a}^{\prime}(\bm{w}_{\bm{\lambda}})+\frac{n_{0,0}}{n_{\star,0}}-\frac{n_{0,a}}{n_{\star,a}}\Big{)}\\ &\Big{(}-\nabla L^{\prime}_{0,0}(\bm{w}_{\bm{\lambda}})+\nabla L^{\prime}_{1,0}(\bm{w}_{\bm{\lambda}})+\nabla L^{\prime}_{0,a}(\bm{w}_{\bm{\lambda}})-\nabla L^{\prime}_{1,a}(\bm{w}_{\bm{\lambda}})\Big{)}\Big{]}\frac{\partial\bm{w}_{\bm{\lambda}}}{\partial\lambda_{j}}.\end{split} (77)

Note that 𝒘𝝀\bm{w}_{\bm{\lambda}} is the minimizer to(6)inner~{}(\ref{dp:w})_{\text{inner}}, we have

a=0A1[λaL0,a(𝒘𝝀)+(2n,anλa)L1,a(𝒘𝝀)]=0.\sum_{a=0}^{A-1}\Big{[}\lambda_{a}\nabla L_{0,a}^{\prime}(\bm{w}_{\bm{\lambda}})+(2\frac{n_{\star,a}}{n}-\lambda_{a})\nabla L_{1,a}^{\prime}(\bm{w}_{\bm{\lambda}})\Big{]}=0.

We take the λj\lambda_{j} derivative to the above equation and have

L0,j(𝒘𝝀)L1,j(𝒘𝝀)+a=0A1[λa2L0,a(𝒘𝝀)+(2n,anλa)2L1,a(𝒘𝝀)]𝒘𝝀λj=0.\nabla L_{0,j}^{\prime}(\bm{w}_{\bm{\lambda}})-\nabla L_{1,j}^{\prime}(\bm{w}_{\bm{\lambda}})+\sum_{a=0}^{A-1}\Big{[}\lambda_{a}\nabla^{2}L_{0,a}^{\prime}(\bm{w}_{\bm{\lambda}})+(2\frac{n_{\star,a}}{n}-\lambda_{a})\nabla^{2}L^{\prime}_{1,a}(\bm{w}_{\bm{\lambda}})\Big{]}\frac{\partial\bm{w}_{\bm{\lambda}}}{\partial\lambda_{j}}=0.

Thus we get

𝒘𝝀λj=(a=0A1[λa2L0,a(𝒘𝝀)+(2n,anλa)2L1,a(𝒘𝝀)])1[L1,j(𝒘𝝀)L0,j(𝒘𝝀)].\frac{\partial\bm{w}_{\bm{\lambda}}}{\partial\lambda_{j}}=\Big{(}\sum_{a=0}^{A-1}\Big{[}\lambda_{a}\nabla^{2}L_{0,a}^{\prime}(\bm{w}_{\bm{\lambda}})+(2\frac{n_{\star,a}}{n}-\lambda_{a})\nabla^{2}L^{\prime}_{1,a}(\bm{w}_{\bm{\lambda}})\Big{]}\Big{)}^{-1}[\nabla L_{1,j}^{\prime}(\bm{w}_{\bm{\lambda}})-\nabla L_{0,j}^{\prime}(\bm{w}_{\bm{\lambda}})]. (78)

Then on the direction 𝝁(𝝀)\bm{\mu}(\bm{\lambda}) given by (76), we combine (77) and (78) to have

𝝁(𝝀)Fdp(𝝀)\displaystyle\bm{\mu}(\bm{\lambda})\cdot\nabla F_{\text{dp}}(\bm{\lambda}) (79)
=2(a=1A1[(L0,0(𝒘𝝀)+L1,0(𝒘𝝀)+L0,a(𝒘𝝀)L1,a(𝒘𝝀)+n0,0n,0n0,an,a)\displaystyle=2\bigg{(}\sum_{a=1}^{A-1}\Big{[}\Big{(}-L_{0,0}^{\prime}(\bm{w}_{\bm{\lambda}})+L_{1,0}^{\prime}(\bm{w}_{\bm{\lambda}})+L_{0,a}^{\prime}(\bm{w}_{\bm{\lambda}})-L_{1,a}^{\prime}(\bm{w}_{\bm{\lambda}})+\frac{n_{0,0}}{n_{\star,0}}-\frac{n_{0,a}}{n_{\star,a}}\Big{)} (80)
(L0,0(𝒘𝝀)+L1,0(𝒘𝝀)+L0,a(𝒘𝝀)L1,a(𝒘𝝀))])\displaystyle\quad\Big{(}-\nabla L^{\prime}_{0,0}(\bm{w}_{\bm{\lambda}})+\nabla L^{\prime}_{1,0}(\bm{w}_{\bm{\lambda}})+\nabla L^{\prime}_{0,a}(\bm{w}_{\bm{\lambda}})-\nabla L^{\prime}_{1,a}(\bm{w}_{\bm{\lambda}})\Big{)}\Big{]}\bigg{)} (81)
(a=0A1[λa2L0,a(𝒘𝝀)+(2n,anλa)2L1,a(𝒘𝝀)])1\displaystyle\quad\Big{(}\sum_{a=0}^{A-1}\Big{[}\lambda_{a}\nabla^{2}L_{0,a}^{\prime}(\bm{w}_{\bm{\lambda}})+(2\frac{n_{\star,a}}{n}-\lambda_{a})\nabla^{2}L^{\prime}_{1,a}(\bm{w}_{\bm{\lambda}})\Big{]}\Big{)}^{-1} (82)
(a=1A1[(L0,0(𝒘𝝀)+L1,0(𝒘𝝀)+L0,a(𝒘𝝀)L1,a(𝒘𝝀)+n0,0n,0n0,an,a)\displaystyle\quad\bigg{(}-\sum_{a=1}^{A-1}\Big{[}\Big{(}-L_{0,0}^{\prime}(\bm{w}_{\bm{\lambda}})+L_{1,0}^{\prime}(\bm{w}_{\bm{\lambda}})+L_{0,a}^{\prime}(\bm{w}_{\bm{\lambda}})-L_{1,a}^{\prime}(\bm{w}_{\bm{\lambda}})+\frac{n_{0,0}}{n_{\star,0}}-\frac{n_{0,a}}{n_{\star,a}}\Big{)} (83)
(L0,0(𝒘𝝀)+L1,0(𝒘𝝀)+L0,a(𝒘𝝀)L1,a(𝒘𝝀))])0\displaystyle\quad\Big{(}-\nabla L^{\prime}_{0,0}(\bm{w}_{\bm{\lambda}})+\nabla L^{\prime}_{1,0}(\bm{w}_{\bm{\lambda}})+\nabla L^{\prime}_{0,a}(\bm{w}_{\bm{\lambda}})-\nabla L^{\prime}_{1,a}(\bm{w}_{\bm{\lambda}})\Big{)}\Big{]}\bigg{)}\leq 0 (84)

where we have used Assumption 1, and the equality holds only when 𝝁(𝝀)=𝟎\bm{\mu}(\bm{\lambda})=\bm{0}. ∎

Now we introduce how clients collaborate to solve the bi-level optimization problem (6).

First, we focus on the outer objective function in (6)(\ref{dp:w}) and introduce how clients collaborate to update the weight 𝝀\bm{\lambda}. Note that the central server can compute Ly,a(𝒘𝝀)L_{y,a}^{\prime}(\bm{w}_{\bm{\lambda}}) by weight-averaging the local group loss Ly,a(i)(𝒘𝝀)L_{y,a}^{(i)}(\bm{w}_{\bm{\lambda}}) sent from clients at communication rounds as

Ly,a(𝒘𝝀)=i[I]ny,a(i)n,aLy,a(i)(𝒘𝝀),L^{\prime}_{y,a}(\bm{w}_{\bm{\lambda}})=\sum_{i\in[I]}\frac{n^{(i)}_{y,a}}{n_{\star,a}}L^{(i)}_{y,a}(\bm{w}_{\bm{\lambda}}),

thereby obtain 𝝁(𝝀)\bm{\mu}(\bm{\lambda}) and update 𝝀\bm{\lambda} by (8).

Next, we focus on the inner objective function in (6)(\ref{dp:w}) and introduce how clients collaborate to update the model parameters 𝒘𝝀\bm{w}_{\bm{\lambda}} using FedAvg. Note that we can decompose the objective function L(𝒘,𝝀)L(\bm{w},\bm{\lambda}) into L(𝒘,𝝀)=i[I]L(i)(𝒘,𝝀)L(\bm{w},\bm{\lambda})=\sum_{i\in[I]}L^{(i)}(\bm{w},\bm{\lambda}), where L(i)(𝒘,𝝀):=a=0A1[λan0,a(i)L0,a(i)/n,a+(2n,anλa)n1,a(i)L1,a(i)/n,a]L^{(i)}(\bm{w},\bm{\lambda}):=\sum_{a=0}^{A-1}[\lambda_{a}n_{0,a}^{(i)}L_{0,a}^{(i)}/n_{\star,a}+(2\frac{n_{\star,a}}{n}-\lambda_{a})n_{1,a}^{(i)}L_{1,a}^{(i)}/n_{\star,a}] is the client objective function of client ii. The global objective can be seen as a weighted sum of the client objective function. Therefore, we can use FedAvg to solve the inner optimization problem.

Denote Ly,a(i)(𝒘):=ny,a(i)n,aLy,a(i)(𝒘)L_{y,a}^{\prime(i)}(\bm{w}):=\frac{n^{(i)}_{y,a}}{n_{\star,a}}L_{y,a}^{(i)}(\bm{w}), we present the pseudocode of FedFB w.r.t demographic parity in Algorithm 2.

Server executes:
      input : Learning rate {αt}t\{\alpha_{t}\}_{t\in\mathbb{N}};
      Initialize λa\lambda_{a} as n,an\frac{n_{\star,a}}{n} for all a[A]\{0}a\in[A]\backslash\left\{0\right\};
      for each iteration t=1,2,t=1,2,\dots do
           Clients perform updates;
           𝒘𝝀\bm{w}_{\bm{\lambda}}\leftarrow SecAgg({𝒘(i)})(\left\{\bm{w}^{(i)}\right\}) for all ii;
           FaF_{a}\leftarrow SecAgg({Fa(i)})(I1)(n0,0n,0n0,an,a)(\left\{F^{(i)}_{a}\right\})-(I-1)(\frac{n_{0,0}}{n_{\star,0}}-\frac{n_{0,a}}{n_{\star,a}}) for all a,ia,i;
           μ0a=1A1Fa\mu_{0}\leftarrow-\sum_{a=1}^{A-1}F_{a};
           μaFa,for all a[A]\{0}\mu_{a}\leftarrow F_{a},\text{for all }a\in[A]\backslash\{0\};
           if t%k=0t\%k=0 then
                λaλa+αt𝝁2μa, for all a[A]\lambda_{a}\leftarrow\lambda_{a}+\frac{\alpha_{t}}{\|\bm{\mu}\|_{2}}\mu_{a},\text{ for all }a\in[A];
                Broadcast 𝝀\bm{\lambda} to clients
                Broadcast 𝒘𝝀\bm{w}_{\bm{\lambda}} to clients;
                end for
               output : 𝒘𝝀,𝝀\bm{w}_{\bm{\lambda}},\bm{\lambda}
               
               
               
               
               ClientUpdate(i,w,λ)(i,\bm{w},\bm{\lambda}):
                     𝒘(i)\bm{w}^{(i)}\leftarrow Gradient descent w.r.t objective function a=0A1[λaL0,a(i)(𝒘)+(2n,anλa)L1,a(i)(𝒘)]\sum_{a=0}^{A-1}\left[\lambda_{a}L_{0,a}^{\prime(i)}(\bm{w})+(2\frac{n_{\star,a}}{n}-\lambda_{a})L_{1,a}^{\prime(i)}(\bm{w})\right];
                     Fa(i)L0,0(i)+L1,0(i)+L0,a(i)L1,a(i)+n0,0n,0n0,an,aF_{a}^{(i)}\leftarrow-L_{0,0}^{\prime(i)}+L_{1,0}^{\prime(i)}+L_{0,a}^{\prime(i)}-L_{1,a}^{\prime(i)}+\frac{n_{0,0}}{n_{\star,0}}-\frac{n_{0,a}}{n_{\star,a}};
                     Send 𝒘(i),F0,a(i)(𝒘)\bm{w}^{(i)},F_{0,a}^{(i)}(\bm{w}) for all a[A]a\in[A] to server via a SecAgg protocol;
                    
                    
                    
Algorithm 2 FedFB(k,t)(k,t) w.r.t Demographic Parity

Next, we analyze the convergence performance of FedFB. We need to make the following assumptions on the objective function L(i)(𝒘,𝝀)L^{(i)}(\bm{w},\bm{\lambda}), i[I]i\in[I]. For simplicity, we drop the 𝝀\bm{\lambda} here and use the notations L(i)(𝒘)L^{(i)}(\bm{w}) and L(𝒘)L(\bm{w}) instead. We use 𝒘t(i)\bm{w}_{t}^{(i)} to denote the model parameters at tt-th iteration in ii-th client. The assumptions below are proposed by work Li et al. [2020b]:

Assumption 2 (Strong convexity, Assumption 1 in Li et al. [2020b]).

L(i)(𝒘)L^{(i)}(\bm{w}) is μ\mu-strongly convex for i[I]i\in[I], i.e., for all 𝐯\bm{v} and 𝐰\bm{w}, 𝐰\bm{w}, L(i)(𝐯)L(i)(𝐰)+(𝐯𝐰)L(i)(𝐰)+μ2𝐯𝐰22L^{(i)}(\bm{v})\geq L^{(i)}(\bm{w})+(\bm{v}-\bm{w})^{\top}\nabla L^{(i)}(\bm{w})+\frac{\mu}{2}\left\lVert\bm{v}-\bm{w}\right\rVert_{2}^{2}.

Assumption 3 (Smoothness, Assumption 2 in Li et al. [2020b]).

L(i)(𝒘)L^{(i)}(\bm{w}) is LL-smooth for i[I]i\in[I], i.e., for all 𝐯\bm{v} and 𝐰\bm{w}, L(i)(𝐯)L(i)(𝐰)+(𝐯𝐰)L(i)(𝐰)+L2𝐯𝐰22L^{(i)}(\bm{v})\leq L^{(i)}(\bm{w})+(\bm{v}-\bm{w})^{\top}\nabla L^{(i)}(\bm{w})+\frac{L}{2}\left\lVert\bm{v}-\bm{w}\right\rVert_{2}^{2}.

Assumption 4 (Bounded variance, Assumption 3 in Li et al. [2020b]).

Let ξt(i)\xi^{(i)}_{t} be sampled from ii-th device’s local data uniformly at random, where t[T]t\in[T], and TT is the total number of every client’s SGDs. The variance of stochastic gradients in each device is bounded: 𝔼L(i)(𝐰t(i),ξt(i))L(i)(𝐰t(i))2<\mathbb{E}\left\lVert\nabla L^{(i)}(\bm{w}_{t}^{(i)},\xi_{t}^{(i)})-\nabla L^{(i)}(\bm{w}_{t}^{(i)})\right\rVert^{2}<\infty for i[I]i\in[I].

Assumption 5 (Bounded gradients, Assumption 4 in Li et al. [2020b]).

The expected squared norm of stochastic gradients is uniformly bounded, i.e., 𝔼L(i)(𝐰t(i),ξt(i))<\mathbb{E}\left\lVert\nabla L^{(i)}(\bm{w}^{(i)}_{t},\xi_{t}^{(i)})\right\rVert<\infty for all i[I],t[T]i\in[I],t\in[T].

In FedAvg, first, the central server broadcasts the lastest model to all clients, then, every client performs local updates for a number of iterations, last, the central server aggregates the local models to produce the new global model(see Algorithm Description in Li et al. [2020b] for more detailed explanation).

Note that we assume 𝝀\bm{\lambda} is not updated at each communication round. With the above assumptions, the following theorem shows the convergence of FedFB in the case of two clients.

Theorem 24.

Consider the case of A=2A=2. Let Assumption 2345 on {L(i)(,𝛌)}i[I]\{L^{(i)}(\cdot,\bm{\lambda})\}_{i\in[I]} and Assumption 1 on {Ly,a()}y,a{0,1}\{L^{\prime}_{y,a}(\cdot)\}_{y,a\in\{0,1\}} hold. Choose {αt}t=1\{\alpha_{t}\}_{t=1}^{\infty} such that limtαt=0,t=1αt=\lim_{t\rightarrow\infty}\alpha_{t}=0,\sum_{t=1}^{\infty}\alpha_{t}=\infty. Suppose we have sufficiently large number of communication rounds to solve the inner optimization problem between two 𝛌\bm{\lambda} update rounds, i.e, kk\to\infty in Algorithm 2. Denote the number of 𝛌\bm{\lambda} update as TT, 𝛌(t)=(λ0(t),,λA1(t))\bm{\lambda}^{(t)}=(\lambda_{0}^{(t)},\cdots,\lambda_{A-1}^{(t)}) as the weight at tt-th updated round. We can find sufficiently large TT such that with high probability, applying update rule (8) leads to |λa(T)λa|max{|λa(0)λa|t=1Tαt,αT}0,|\lambda_{a}^{(T)}-\lambda_{a}^{\star}|\leq\max\Big{\{}|\lambda_{a}^{(0)}-\lambda_{a}^{\star}|-\sum_{t=1}^{T}\alpha_{t},\alpha_{T}\Big{\}}\to 0, where 𝛌\bm{\lambda}^{\star} is the global minimizer: 𝛌argmin𝛌a[Fa(𝐰𝛌)]2\bm{\lambda}^{\star}\in\arg\min_{\bm{\lambda}}\sum_{a}[F_{a}(\bm{w}_{\bm{\lambda}})]^{2}. .

Proof.

We first derive the update rule in the case of A=2A=2. From (8), we have μ0=μ1\mu_{0}=-\mu_{1}. Denote

f(𝝀(t))=L0,0(𝒘𝝀(t))+L1,0(𝒘𝝀(t))+L0,1(𝒘𝝀(t))L1,1(𝒘𝝀(t))+n0,0n,0n0,1n,1.f(\bm{\lambda}^{(t)})=-L_{0,0}^{\prime}(\bm{w}_{\bm{\lambda}^{(t)}})+L_{1,0}^{\prime}(\bm{w}_{\bm{\lambda}}^{(t)})+L_{0,1}^{\prime}(\bm{w}_{\bm{\lambda}}^{(t)})-L_{1,1}^{\prime}(\bm{w}_{\bm{\lambda}}^{(t)})+\frac{n_{0,0}}{n_{\star,0}}-\frac{n_{0,1}}{n_{\star,1}}.

Then the update rule becomes

λ0(t+1)=λ0(t)22αtsign(f(𝝀(t)))λ1(t+1)=λ1(t)+22αtsign(f(𝝀(t))).\begin{split}\lambda_{0}^{(t+1)}&=\lambda_{0}^{(t)}-\frac{\sqrt{2}}{2}\alpha_{t}\text{sign}(f(\bm{\lambda}^{(t)}))\\ \lambda_{1}^{(t+1)}&=\lambda_{1}^{(t)}+\frac{\sqrt{2}}{2}\alpha_{t}\text{sign}(f(\bm{\lambda}^{(t)})).\end{split} (85)

We denote EE as the number local epochs performed between two communication rounds, and RR be the total number of iterations between two 𝝀\bm{\lambda} update rounds. Thus RE\frac{R}{E} is the number of communication rounds between two 𝝀\bm{\lambda} update rounds. Then we apply FedAvg with RR number of total iterations to solve min𝒘L(𝒘;𝝀(t))\min_{\bm{w}}L(\bm{w};\bm{\lambda}^{(t)}), and obtain 𝒘R(t)\bm{w}_{R}^{(t)}. Then in the update round from 𝝀(t)\bm{\lambda}^{(t)} to 𝝀(t+1)\bm{\lambda}^{(t+1)}, by Thm. 1 in Li et al. [2020b], we have

𝔼[L(𝒘R(t);𝝀(t))]L(𝒘𝝀(𝒕);𝝀(t))=O(E2R),\displaystyle\mathbb{E}[L(\bm{w}_{R}^{(t)};\bm{\lambda}^{(t)})]-L(\bm{w}_{\bm{\lambda^{(t)}}};\bm{\lambda}^{(t)})=O\Big{(}\frac{E^{2}}{R}\Big{)}, (86)
𝔼𝒘R(t)𝒘𝝀(t)2=O(E2R),\displaystyle\mathbb{E}\left\lVert\bm{w}_{R}^{(t)}-\bm{w}_{\bm{\lambda}^{(t)}}\right\rVert^{2}=O\Big{(}\frac{E^{2}}{R}\Big{)},

where 𝒘𝝀(t)=argmin𝒘L(𝒘;𝝀(t))\bm{w}_{\bm{\lambda}^{(t)}}=\underset{\bm{w}}{\arg\min}L(\bm{w};\bm{\lambda}^{(t)}). By Markov’s inequality, with probability 1δ1-\delta,

L(𝒘R(t);𝝀(t))L(𝒘𝝀(t);𝝀(t))=O(E2δR),\displaystyle L(\bm{w}_{R}^{(t)};\bm{\lambda}^{(t)})-L(\bm{w}_{\bm{\lambda}^{(t)}};\bm{\lambda}^{(t)})=O\Big{(}\frac{E^{2}}{\delta R}\Big{)}, (87)
𝒘R(t)𝒘𝝀(t)2=O(E2δR).\displaystyle\left\lVert\bm{w}_{R}^{(t)}-\bm{w}_{\bm{\lambda}^{(t)}}\right\rVert^{2}=O\Big{(}\frac{E^{2}}{\delta R}\Big{)}.

Then taking the union bound over TT updating iterations of 𝝀(t)\bm{\lambda}^{(t)}, the conclusions above hold with probability at least 1Tδ1-T\delta for all 𝝀(t),t=1,2,,T\bm{\lambda}^{(t)},t=1,2,\cdots,T. Therefore, with sufficiently large R=R(δ)R=R(\delta) such that E2δR0\frac{E^{2}}{\delta R}\to 0, for all 𝝀(t)\bm{\lambda}^{(t)} in the TT iteration, |L(𝒘R(t);𝝀(t))L(𝒘𝝀(t);𝝀(t))|1|L(\bm{w}_{R}^{(t)};\bm{\lambda}^{(t)})-L(\bm{w}_{\bm{\lambda}^{(t)}};\bm{\lambda}^{(t)})|\ll 1, 𝒘R(t)𝒘𝝀(t)1\|\bm{w}_{R}^{(t)}-\bm{w}_{\bm{\lambda}^{(t)}}\|\ll 1.

By Lemma 23, Fdp(𝝀)F_{\text{dp}}(\bm{\lambda}) has minimizer 𝝀\bm{\lambda}^{\star} on direction 𝝁(𝝀)\bm{\mu}(\bm{\lambda}) such that 𝝁(𝝀)=0\bm{\mu}(\bm{\lambda}^{\star})=0, which implies Fdp(𝝀)=0F_{\mathrm{dp}}(\bm{\lambda}^{\star})=0. Therefore, 𝝀\bm{\lambda}^{\star} is a global minimizer. By update rule (85), we can find large T>0T>0 to have

|λa(T)λa|max{|λa(0)λa|t=1Tαt,αT}0,|\lambda_{a}^{(T)}-\lambda_{a}^{\star}|\leq\max\left\{|\lambda_{a}^{(0)}-\lambda_{a}^{\star}|-\sum_{t=1}^{T}\alpha_{t},\alpha_{T}\right\}\to 0, (88)

where a{0,1}a\in\{0,1\}. ∎

Remark 11.

Note that Thm. 24 assumes FedFB does not update 𝛌\bm{\lambda} in each communication round and there are infinite rounds of aggregations between two 𝛌\bm{\lambda} updating round. However, for computation efficiency, we update 𝛌\bm{\lambda} at every communication round in practice.

B.2 FedFB w.r.t equal opportunity

Similar to Proposition 5, we design the following bi-level optimization problems to capture equal opportunity:

min𝝀ΛFeo(𝝀)\displaystyle\min_{\bm{\lambda}\in\Lambda}F_{\mathrm{eo}}(\bm{\lambda}) =min𝝀Λa=1A1(L1,a(𝒘𝝀)L1,0(𝒘𝝀))2\displaystyle=\min_{\bm{\lambda}\in\Lambda}\sum_{a=1}^{A-1}\left(L_{1,a}(\bm{w}_{\bm{\lambda}})-L_{1,0}(\bm{w}_{\bm{\lambda}})\right)^{2} (89)
𝒘𝝀\displaystyle\bm{w}_{\bm{\lambda}} =argmin𝒘a=1A1λaL1,a(𝒘)+(n1,na=1A1λa)L1,0(𝒘)+n0,nL0,(𝒘).\displaystyle=\arg\min_{\bm{w}}\sum_{a=1}^{A-1}\lambda_{a}L_{1,a}(\bm{w})+(\frac{n_{1,\star}}{n}-\sum_{a=1}^{A-1}\lambda_{a})L_{1,0}(\bm{w})+\frac{n_{0,\star}}{n}L_{0,\star}(\bm{w}). (90)

Here Λ={(λ1,,λA1):λ1++λA1n1,n,λa0 for all a=1,,A1}\Lambda=\{(\lambda_{1},\dots,\lambda_{A-1}):\lambda_{1}+\dots+\lambda_{A-1}\leq\frac{n_{1,\star}}{n},\lambda_{a}\geq 0\text{ for all }a=1,\dots,A-1\}.

For equal opportunity, we make the following assumption:

Assumption 6.

Ly,a()L_{y,a}(\cdot) is twice differentiable for all y{0,1}y\in\{0,1\}, a[A]a\in[A], and

a=1A1λa2L1,a(𝒘𝝀)+(n1,na=1A1λa)2L1,0(𝒘𝝀)+n0,n2L0,(𝒘𝝀)0\sum_{a=1}^{A-1}\lambda_{a}\nabla^{2}L_{1,a}(\bm{w}_{\bm{\lambda}})+(\frac{n_{1,\star}}{n}-\sum_{a=1}^{A-1}\lambda_{a})\nabla^{2}L_{1,0}(\bm{w}_{\bm{\lambda}})+\frac{n_{0,\star}}{n}\nabla^{2}L_{0,\star}(\bm{w}_{\bm{\lambda}})\succ 0

for all λΛ\lambda\in\Lambda.

With the above assumption, the following lemma provides the update rule:

Lemma 25.

If Assumption 6 holds, then on the direction

𝝁(𝝀)=(L1,1(𝒘𝝀)L1,0(𝒘𝝀),,L1,A1(𝒘𝝀)L1,0(𝒘𝝀)),\bm{\mu}(\bm{\lambda})=(L_{1,1}(\bm{w}_{\bm{\lambda}})-L_{1,0}(\bm{w}_{\bm{\lambda}}),\dots,L_{1,A-1}(\bm{w}_{\bm{\lambda}})-L_{1,0}(\bm{w}_{\bm{\lambda}})), (91)

we have 𝛍(𝛌)Feo(𝛌)0\bm{\mu}(\bm{\lambda})\cdot\nabla F_{\mathrm{eo}}(\bm{\lambda})\leq 0, and the equality holds if only if 𝛍(𝛌)=𝟎\bm{\mu}(\bm{\lambda})=\bm{0}.

Update rule for equal opportunity:

λa(t+1)=λa(t)+αt𝝁(𝝀(t))2(L1,a(𝒘𝝀(t))L1,0(𝒘𝝀(t))).\lambda_{a}^{(t+1)}=\lambda_{a}^{(t)}+\frac{\alpha_{t}}{\|\bm{\mu}(\bm{\lambda}^{(t)})\|_{2}}(L_{1,a}(\bm{w}_{\bm{\lambda}^{(t)}})-L_{1,0}(\bm{w}_{\bm{\lambda}^{(t)}})). (92)
Proof of Lemma 25.

We compute the derivative as

Feo(𝝀)λj=2a=1A1(L1,a(𝒘𝝀)L1,0(𝒘𝝀))(L1,a(𝒘𝝀)L1,0(𝒘𝝀))𝒘𝝀λj.\begin{split}\frac{\partial F_{\mathrm{eo}}(\bm{\lambda})}{\partial\lambda_{j}}&=2\sum_{a=1}^{A-1}(L_{1,a}(\bm{w}_{\bm{\lambda}})-L_{1,0}(\bm{w}_{\bm{\lambda}}))(\nabla L_{1,a}(\bm{w}_{\bm{\lambda}})-\nabla L_{1,0}(\bm{w}_{\bm{\lambda}}))\frac{\partial\bm{w}_{\bm{\lambda}}}{\partial\lambda_{j}}.\end{split} (93)

Note that 𝒘𝝀\bm{w}_{\bm{\lambda}} is the minimizer to (90), we have

a=1A1λaL1,a(𝒘𝝀)+(n1,na=1A1λa)L1,0(𝒘𝝀)+n0,nL0,(𝒘𝝀)=0.\sum_{a=1}^{A-1}\lambda_{a}\nabla L_{1,a}(\bm{w}_{\bm{\lambda}})+(\frac{n_{1,\star}}{n}-\sum_{a=1}^{A-1}\lambda_{a})\nabla L_{1,0}(\bm{w}_{\bm{\lambda}})+\frac{n_{0,\star}}{n}\nabla L_{0,\star}(\bm{w}_{\bm{\lambda}})=0.

We take the λj\lambda_{j} derivative to the above equation and have

[a=1A1λa2L1,a(𝒘𝝀)+(n1,na=1A1λa)2L1,0(𝒘𝝀)+n0,n2L0,(𝒘𝝀)]\displaystyle\Big{[}\sum_{a=1}^{A-1}\lambda_{a}\nabla^{2}L_{1,a}(\bm{w}_{\bm{\lambda}})+(\frac{n_{1,\star}}{n}-\sum_{a=1}^{A-1}\lambda_{a})\nabla^{2}L_{1,0}(\bm{w}_{\bm{\lambda}})+\frac{n_{0,\star}}{n}\nabla^{2}L_{0,\star}(\bm{w}_{\bm{\lambda}})\Big{]} (94)
L1,j(𝒘𝝀)L1,0(𝒘𝝀)+𝒘𝝀λj=0.\displaystyle\nabla L_{1,j}(\bm{w}_{\bm{\lambda}})-\nabla L_{1,0}(\bm{w}_{\bm{\lambda}})+\frac{\partial\bm{w}_{\bm{\lambda}}}{\partial\lambda_{j}}=0. (95)

Thus we get

𝒘𝝀λj=[a=1A1λa2L1,a(𝒘𝝀)+(n1,na=1A1λa)2L1,0(𝒘𝝀)+n0,n2L0,(𝒘𝝀)]1[L1,0(𝒘𝝀)L1,j(𝒘𝝀)].\begin{split}\frac{\partial\bm{w}_{\bm{\lambda}}}{\partial\lambda_{j}}=&\Big{[}\sum_{a=1}^{A-1}\lambda_{a}\nabla^{2}L_{1,a}(\bm{w}_{\bm{\lambda}})+(\frac{n_{1,\star}}{n}-\sum_{a=1}^{A-1}\lambda_{a})\nabla^{2}L_{1,0}(\bm{w}_{\bm{\lambda}})+\frac{n_{0,\star}}{n}\nabla^{2}L_{0,\star}(\bm{w}_{\bm{\lambda}})\Big{]}^{-1}\\ &[\nabla L_{1,0}(\bm{w}_{\bm{\lambda}})-\nabla L_{1,j}(\bm{w}_{\bm{\lambda}})].\end{split} (96)

Then on the direction 𝝁(𝝀)\bm{\mu}(\bm{\lambda}) given by (91), we combine (93) and (96) to have

𝝁(𝝀)Feo(𝝀)\displaystyle\bm{\mu}(\bm{\lambda})\cdot\nabla F_{\mathrm{eo}}(\bm{\lambda}) (97)
=2[a=1A1(L1,a(𝒘𝝀)L1,0(𝒘𝝀))(L1,a(𝒘𝝀)L1,0(𝒘𝝀))]\displaystyle=2\Big{[}\sum_{a=1}^{A-1}(L_{1,a}(\bm{w}_{\bm{\lambda}})-L_{1,0}(\bm{w}_{\bm{\lambda}}))(\nabla L_{1,a}(\bm{w}_{\bm{\lambda}})-\nabla L_{1,0}(\bm{w}_{\bm{\lambda}}))\Big{]} (98)
[a=1A1λa2L1,a(𝒘𝝀)+(n1,na=1A1λa)2L1,0(𝒘𝝀)+n0,n2L0,(𝒘𝝀)]1\displaystyle\quad\Big{[}\sum_{a=1}^{A-1}\lambda_{a}\nabla^{2}L_{1,a}(\bm{w}_{\bm{\lambda}})+(\frac{n_{1,\star}}{n}-\sum_{a=1}^{A-1}\lambda_{a})\nabla^{2}L_{1,0}(\bm{w}_{\bm{\lambda}})+\frac{n_{0,\star}}{n}\nabla^{2}L_{0,\star}(\bm{w}_{\bm{\lambda}})\Big{]}^{-1} (99)
[a=1A1(L1,a(𝒘𝝀)L1,0(𝒘𝝀))(L1,0(𝒘𝝀)L1,a(𝒘𝝀))]0,\displaystyle\quad\Big{[}\sum_{a=1}^{A-1}(L_{1,a}(\bm{w}_{\bm{\lambda}})-L_{1,0}(\bm{w}_{\bm{\lambda}}))(\nabla L_{1,0}(\bm{w}_{\bm{\lambda}})-\nabla L_{1,a}(\bm{w}_{\bm{\lambda}}))\Big{]}\leq 0, (100)

where we have used Assumption 6, the equality holds only when 𝝁(𝝀)=𝟎\bm{\mu}(\bm{\lambda})=\bm{0}.

We present the FedFB algorithm w.r.t equal opportunity in Algorithm 3.

Server executes:
      input : Learning rate {αt}t\{\alpha_{t}\}_{t\in\mathbb{N}};
      Initialize λa\lambda_{a} as n1,an\frac{n_{1,a}}{n} for all a[A]\{0}a\in[A]\backslash\left\{0\right\};
      for each iteration t=1,2,t=1,2,\dots do
           Clients perform updates;
           𝒘𝝀\bm{w}_{\bm{\lambda}}\leftarrow SecAgg({𝒘(i)}\left\{\bm{w}^{(i)}\right\}) for all ii;
           μa\mu_{a}\leftarrow SecAgg({μa(i)})(\left\{\mu_{a}^{(i)}\right\}) for all a[A]\{0}a\in[A]\backslash\left\{0\right\};
           if t%k=0t\%k=0 then
                λaλa+αt𝝁2μa, for all a[A]\{0}\lambda_{a}\leftarrow\lambda_{a}+\frac{\alpha_{t}}{\|\bm{\mu}\|_{2}}\mu_{a},\text{ for all }a\in[A]\backslash\left\{0\right\};
                Broadcast 𝝀\bm{\lambda} to clients;
               
                Broadcast 𝒘𝝀\bm{w}_{\bm{\lambda}} to clients;
                end for
               output : 𝒘𝝀,𝝀\bm{w}_{\bm{\lambda}},\bm{\lambda}
               
               
               
               
               ClientUpdate(i,w,λ)(i,\bm{w},\bm{\lambda}):
                     𝒘(i)\bm{w}^{(i)}\leftarrow Gradient descent w.r.t objective function a=1A1λaL1,a(i)(𝒘)+(n1,na=1A1λa)L1,0(i)(𝒘)+n0,nL0,(i)(𝒘)\sum_{a=1}^{A-1}\lambda_{a}L_{1,a}^{(i)}(\bm{w})+(\frac{n_{1,\star}}{n}-\sum_{a=1}^{A-1}\lambda_{a})L^{(i)}_{1,0}(\bm{w})+\frac{n_{0,\star}}{n}L^{(i)}_{0,\star}(\bm{w});
                     Send 𝒘(i),μa(i)\bm{w}^{(i)},\mu_{a}^{(i)} for all a[A]\{0}a\in[A]\backslash\left\{0\right\} to server via a SecAgg protocol;
                    
                    
                    
Algorithm 3 FedFB(k,t)(k,t) w.r.t Equal Opportunity

B.3 FedFB w.r.t equalized odds

For equalized odd, we design the following bi-level optimization problem:

min𝝀ΛFeod(𝝀)\displaystyle\min_{\bm{\lambda}\in\Lambda}F_{\mathrm{eod}}(\bm{\lambda}) =min𝝀Λa=1A1[(L1,a(𝒘𝝀)L1,0(𝒘𝝀))2+(L0,a(𝒘𝝀)L0,0(𝒘𝝀))2]\displaystyle=\min_{\bm{\lambda}\in\Lambda}\sum_{a=1}^{A-1}\left[(L_{1,a}(\bm{w}_{\bm{\lambda}})-L_{1,0}(\bm{w}_{\bm{\lambda}}))^{2}+(L_{0,a}(\bm{w}_{\bm{\lambda}})-L_{0,0}(\bm{w}_{\bm{\lambda}}))^{2}\right] (101)
𝒘𝝀\displaystyle\bm{w}_{\bm{\lambda}} =argmin𝒘a=1A1(λ0,aL0,a(𝒘)+λ1,aL1,a(𝒘))\displaystyle=\arg\min_{\bm{w}}\sum_{a=1}^{A-1}\left(\lambda_{0,a}L_{0,a}(\bm{w})+\lambda_{1,a}L_{1,a}(\bm{w})\right) (102)
+(n0,na=1A1λ0,a)L0,0(𝒘)+(n1,na=1A1λ1,a)L1,0(𝒘).\displaystyle+(\frac{n_{0,\star}}{n}-\sum_{a=1}^{A-1}\lambda_{0,a})L_{0,0}(\bm{w})+(\frac{n_{1,\star}}{n}-\sum_{a=1}^{A-1}\lambda_{1,a})L_{1,0}(\bm{w}). (103)

Here

Λ={(λ0,1,,λ0,A1,λ1,1,,λ1,A1):\displaystyle\Lambda=\{(\lambda_{0,1},\dots,\lambda_{0,A-1},\lambda_{1,1},\dots,\lambda_{1,A-1}): a=1A1λ0,an0,n,a=1A1λ1,an1,n,\displaystyle\sum_{a=1}^{A-1}\lambda_{0,a}\leq\frac{n_{0,\star}}{n},\quad\sum_{a=1}^{A-1}\lambda_{1,a}\leq\frac{n_{1,\star}}{n}, (104)
λ0,a,λ1,a0,for all a=1,,A1}.\displaystyle\lambda_{0,a},\lambda_{1,a}\geq 0,\quad\text{for all }a=1,\dots,A-1\}. (105)

For equalized odds, we make the following assumption:

Assumption 7.

Ly,a()L_{y,a}(\cdot) is twice differentiable for all y{0,1}y\in\{0,1\}, a[A]a\in[A], and

a=1A1[(λ1,a2L1,a(𝒘𝝀)+λ0,a2L0,a(𝒘𝝀)+(n0,na=1A1λ0,a)2L0,0(𝒘𝝀)\displaystyle\sum_{a=1}^{A-1}\Big{[}(\lambda_{1,a}\nabla^{2}L_{1,a}(\bm{w}_{\bm{\lambda}})+\lambda_{0,a}\nabla^{2}L_{0,a}(\bm{w}_{\bm{\lambda}})+(\frac{n_{0,\star}}{n}-\sum_{a=1}^{A-1}\lambda_{0,a})\nabla^{2}L_{0,0}(\bm{w}_{\bm{\lambda}}) (106)
+(n1,na=1A1λ1,a)2L1,0(𝒘𝝀))]0\displaystyle+(\frac{n_{1,\star}}{n}-\sum_{a=1}^{A-1}\lambda_{1,a})\nabla^{2}L_{1,0}(\bm{w}_{\bm{\lambda}}))\Big{]}\succ 0 (107)

for all λΛ\lambda\in\Lambda.

With the above assumption, the following lemma provides the update rule:

Lemma 26.

If Assumption 7 holds, then on the direction

𝝁(𝝀)=(μ0,1(𝝀),,μ0,A1(𝝀),μ1,1(𝝀),,μ1,A1(𝝀))\bm{\mu}(\bm{\lambda})=(\mu_{0,1}(\bm{\lambda}),\dots,\mu_{0,A-1}(\bm{\lambda}),\mu_{1,1}(\bm{\lambda}),\dots,\mu_{1,A-1}(\bm{\lambda})), with

μy,a(𝝀)=Ly,a(𝒘𝝀)Ly,0(𝒘𝝀),y{0,1},a[A],\mu_{y,a}(\bm{\lambda})=L_{y,a}(\bm{w}_{\bm{\lambda}})-L_{y,0}(\bm{w}_{\bm{\lambda}}),\quad y\in\{0,1\},\quad a\in[A], (108)

we have 𝛍(𝛌)Feod(𝛌)0\bm{\mu}(\bm{\lambda})\cdot\nabla F_{\mathrm{eod}}(\bm{\lambda})\leq 0, and the equality holds if only if 𝛍(𝛌)=𝟎\bm{\mu}(\bm{\lambda})=\bm{0}.

Update rule for equalized odd:

λy,a(t+1)=λy,a(t)+αt𝝁(𝝀(t))2(Ly,a(𝒘𝝀(t))Ly,0(𝒘𝝀(t))) for y{0,1},a[A].\lambda_{y,a}^{(t+1)}=\lambda_{y,a}^{(t)}+\frac{\alpha_{t}}{\|\bm{\mu}(\bm{\lambda}^{(t)})\|_{2}}(L_{y,a}(\bm{w}_{\bm{\lambda}^{(t)}})-L_{y,0}(\bm{w}_{\bm{\lambda}^{(t)}}))\text{ for }y\in\{0,1\},a\in[A]. (109)
Proof of Lemma 26.

We compute the derivative as

Feod(𝝀)λy,j=2a=1A1[(L1,a(𝒘𝝀)L1,0(𝒘𝝀))(L1,a(𝒘𝝀)L1,0(𝒘𝝀))+(L0,a(𝒘𝝀)L0,0(𝒘𝝀))(L0,a(𝒘𝝀)L0,0(𝒘𝝀))]𝒘𝝀λy,j.\begin{split}\frac{\partial F_{\mathrm{eod}}(\bm{\lambda})}{\partial\lambda_{y,j}}&=2\sum_{a=1}^{A-1}\big{[}(L_{1,a}(\bm{w}_{\bm{\lambda}})-L_{1,0}(\bm{w}_{\bm{\lambda}}))(\nabla L_{1,a}(\bm{w}_{\bm{\lambda}})-\nabla L_{1,0}(\bm{w}_{\bm{\lambda}}))\\ &+(L_{0,a}(\bm{w}_{\bm{\lambda}})-L_{0,0}(\bm{w}_{\bm{\lambda}}))(\nabla L_{0,a}(\bm{w}_{\bm{\lambda}})-\nabla L_{0,0}(\bm{w}_{\bm{\lambda}}))\big{]}\frac{\partial\bm{w}_{\bm{\lambda}}}{\partial\lambda_{y,j}}.\end{split} (110)

Note that 𝒘𝝀\bm{w}_{\bm{\lambda}} is the minimizer to (103), we have

a=1A1(λ0,aL0,a(𝒘𝝀)+λ1,aL1,a(𝒘𝝀))\displaystyle\sum_{a=1}^{A-1}(\lambda_{0,a}\nabla L_{0,a}(\bm{w}_{\bm{\lambda}})+\lambda_{1,a}\nabla L_{1,a}(\bm{w}_{\bm{\lambda}})) (111)
+(n0,na=1A1λ0,a)L0,0(𝒘𝝀)+(n1,na=1A1λ1,a)L1,0(𝒘𝝀)=0.\displaystyle+(\frac{n_{0,\star}}{n}-\sum_{a=1}^{A-1}\lambda_{0,a})\nabla L_{0,0}(\bm{w}_{\bm{\lambda}})+(\frac{n_{1,\star}}{n}-\sum_{a=1}^{A-1}\lambda_{1,a})\nabla L_{1,0}(\bm{w}_{\bm{\lambda}})=0. (112)

We take the λy,j\lambda_{y,j} derivative to the above equation and have

Ly,j(𝒘𝝀)Ly,0(𝒘𝝀)+[a=1A1(λ0,a2L0,a(𝒘𝝀)+λ1,a2L1,a(𝒘𝝀))\displaystyle\nabla L_{y,j}(\bm{w}_{\bm{\lambda}})-\nabla L_{y,0}(\bm{w}_{\bm{\lambda}})+\Big{[}\sum_{a=1}^{A-1}(\lambda_{0,a}\nabla^{2}L_{0,a}(\bm{w}_{\bm{\lambda}})+\lambda_{1,a}\nabla^{2}L_{1,a}(\bm{w}_{\bm{\lambda}})) (113)
+(n0,na=1A1λ0,a)2L0,0(𝒘𝝀)+(n1,na=1A1λ1,a)2L1,0(𝒘𝝀)]𝒘𝝀λy,j=0.\displaystyle+(\frac{n_{0,\star}}{n}-\sum_{a=1}^{A-1}\lambda_{0,a})\nabla^{2}L_{0,0}(\bm{w}_{\bm{\lambda}})+(\frac{n_{1,\star}}{n}-\sum_{a=1}^{A-1}\lambda_{1,a})\nabla^{2}L_{1,0}(\bm{w}_{\bm{\lambda}})\Big{]}\frac{\partial\bm{w}_{\bm{\lambda}}}{\partial\lambda_{y,j}}=0. (114)

Thus we get

𝒘𝝀λy,j\displaystyle\frac{\partial\bm{w}_{\bm{\lambda}}}{\partial\lambda_{y,j}} =[a=1A1(λ0,a2L0,a(𝒘𝝀)+λ1,a2L1,a(𝒘𝝀))\displaystyle=\Big{[}\sum_{a=1}^{A-1}(\lambda_{0,a}\nabla^{2}L_{0,a}(\bm{w}_{\bm{\lambda}})+\lambda_{1,a}\nabla^{2}L_{1,a}(\bm{w}_{\bm{\lambda}})) (115)
+(n0,na=1A1λ0,a)2L0,0(𝒘𝝀)+(n1,na=1A1λ1,a)2L1,0(𝒘𝝀)]1\displaystyle+(\frac{n_{0,\star}}{n}-\sum_{a=1}^{A-1}\lambda_{0,a})\nabla^{2}L_{0,0}(\bm{w}_{\bm{\lambda}})+(\frac{n_{1,\star}}{n}-\sum_{a=1}^{A-1}\lambda_{1,a})\nabla^{2}L_{1,0}(\bm{w}_{\bm{\lambda}})\Big{]}^{-1} (116)
[Ly,0(𝒘𝝀)Ly,j(𝒘𝝀)].\displaystyle\quad[\nabla L_{y,0}(\bm{w}_{\bm{\lambda}})-\nabla L_{y,j}(\bm{w}_{\bm{\lambda}})]. (117)

Then on the direction 𝝁(𝝀)\bm{\mu}(\bm{\lambda}) given by (108), we combine (110) and (117) to have

𝝁(𝝀)Feod(𝝀)\displaystyle\bm{\mu}(\bm{\lambda})\cdot\nabla F_{\mathrm{eod}}(\bm{\lambda}) (118)
=2[a=1A1[(L1,a(𝒘𝝀)L1,0(𝒘𝝀))(L1,a(𝒘𝝀)L1,0(𝒘𝝀))\displaystyle=2\Big{[}\sum_{a=1}^{A-1}[(L_{1,a}(\bm{w}_{\bm{\lambda}})-L_{1,0}(\bm{w}_{\bm{\lambda}}))(\nabla L_{1,a}(\bm{w}_{\bm{\lambda}})-\nabla L_{1,0}(\bm{w}_{\bm{\lambda}})) (119)
+(L0,a(𝒘𝝀)L0,0(𝒘𝝀))(L0,a(𝒘𝝀)L0,0(𝒘𝝀))]]\displaystyle+(L_{0,a}(\bm{w}_{\bm{\lambda}})-L_{0,0}(\bm{w}_{\bm{\lambda}}))(\nabla L_{0,a}(\bm{w}_{\bm{\lambda}})-\nabla L_{0,0}(\bm{w}_{\bm{\lambda}}))]\Big{]} (120)
[a=1A1(λ0,a2L0,a(𝒘𝝀)+λ1,a2L1,a(𝒘𝝀))\displaystyle\Big{[}\sum_{a=1}^{A-1}(\lambda_{0,a}\nabla^{2}L_{0,a}(\bm{w}_{\bm{\lambda}})+\lambda_{1,a}\nabla^{2}L_{1,a}(\bm{w}_{\bm{\lambda}})) (121)
+(n0,na=1A1λ0,a)2L0,0(𝒘𝝀)+(n1,na=1A1λ1,a)2L1,0(𝒘𝝀)]1\displaystyle+(\frac{n_{0,\star}}{n}-\sum_{a=1}^{A-1}\lambda_{0,a})\nabla^{2}L_{0,0}(\bm{w}_{\bm{\lambda}})+(\frac{n_{1,\star}}{n}-\sum_{a=1}^{A-1}\lambda_{1,a})\nabla^{2}L_{1,0}(\bm{w}_{\bm{\lambda}})\Big{]}^{-1} (122)
[a=1A1[(L1,a(𝒘𝝀)L1,0(𝒘𝝀))(L1,0(𝒘𝝀)L1,a(𝒘𝝀))\displaystyle\Big{[}\sum_{a=1}^{A-1}[(L_{1,a}(\bm{w}_{\bm{\lambda}})-L_{1,0}(\bm{w}_{\bm{\lambda}}))(\nabla L_{1,0}(\bm{w}_{\bm{\lambda}})-\nabla L_{1,a}(\bm{w}_{\bm{\lambda}})) (123)
+(L0,a(𝒘𝝀)L0,0(𝒘𝝀))(L0,0(𝒘𝝀)L0,a(𝒘𝝀))]]0\displaystyle+(L_{0,a}(\bm{w}_{\bm{\lambda}})-L_{0,0}(\bm{w}_{\bm{\lambda}}))(\nabla L_{0,0}(\bm{w}_{\bm{\lambda}})-\nabla L_{0,a}(\bm{w}_{\bm{\lambda}}))]\Big{]}\leq 0 (124)

where we have used Assumption 7, and the equality holds only when 𝝁(𝝀)=𝟎\bm{\mu}(\bm{\lambda})=\bm{0}.

The full procedure is described in Algorithm 4.

Server executes:
      input : Learning rate {αt}t\{\alpha_{t}\}_{t\in\mathbb{N}};
      Initialize λy,a\lambda_{y,a} as ny,an\frac{n_{y,a}}{n} for all y{0,1},a[A]\{0}y\in\left\{0,1\right\},a\in[A]\backslash\left\{0\right\};
      for each iteration t=1,2,t=1,2,\dots do
           Clients perform updates;
           𝒘𝝀\bm{w}_{\bm{\lambda}}\leftarrow SecAgg ({𝒘(i)})(\left\{\bm{w}^{(i)}\right\}) for all ii;
           μy,a\mu_{y,a}\leftarrow SecAgg({μy,a(i)})(\left\{\mu^{(i)}_{y,a}\right\}) for all y{0,1},a[A]\{0}y\in\{0,1\},a\in[A]\backslash\left\{0\right\};
           if t%k=0t\%k=0 then
                λy,aλy,a+αt𝝁2μy,a, for all a[A]\{0}\lambda_{y,a}\leftarrow\lambda_{y,a}+\frac{\alpha_{t}}{\|\bm{\mu}\|_{2}}\mu_{y,a},\text{ for all }a\in[A]\backslash\left\{0\right\};
                Broadcast 𝝀\bm{\lambda} to clients;
               
               
               Broadcast 𝒘𝝀\bm{w}_{\bm{\lambda}} to clients;
               
                end for
               output : 𝒘𝝀\bm{w}_{\bm{\lambda}}
               
               
               
               
               ClientUpdate(i,w,λ)(i,\bm{w},\bm{\lambda}):
                     𝒘(i)\bm{w}^{(i)}\leftarrow Gradient descent w.r.t objective function a=1A1(λ0,aL0,a(i)(𝒘)+λ1,aL1,a(i)(𝒘))+(n0,na=1A1λ0,a)L0,0(i)(𝒘)+(n1,na=1A1λ1,a)L1,0(i)(𝒘)\sum_{a=1}^{A-1}\left(\lambda_{0,a}L^{(i)}_{0,a}(\bm{w})+\lambda_{1,a}L^{(i)}_{1,a}(\bm{w})\right)+(\frac{n_{0,\star}}{n}-\sum_{a=1}^{A-1}\lambda_{0,a})L^{(i)}_{0,0}(\bm{w})+(\frac{n_{1,\star}}{n}-\sum_{a=1}^{A-1}\lambda_{1,a})L^{(i)}_{1,0}(\bm{w});
                     Send 𝒘(i),μy,a(i)\bm{w}^{(i)},\mu_{y,a}^{(i)} for all y[1],a[A]y\in[1],a\in[A] to server via a SecAgg protocol;
                    
                    
                    
Algorithm 4 FedFB(k,t)(k,t) w.r.t Equalized Odds

B.4 FedFB w.r.t client parity

For client parity, we slightly abuse the notation and define the loss over client ii as L(i)(𝒘):=(x,y,a,i):i=i(y,y^;𝒘)/n(i)L^{(i)}(\bm{w}):=\sum_{(\mathrm{x},\mathrm{y},\mathrm{a},\mathrm{i}):\mathrm{i}=i}\ell(\mathrm{y},\hat{\mathrm{y}};\bm{w})/n^{(i)}, with n(i):=|{(x,y,a,i):i=i}|n^{(i)}:=|\{(\mathrm{x},\mathrm{y},\mathrm{a},\mathrm{i}):\mathrm{i}=i\}|. Note that the L(i)(𝒘)L^{(i)}(\bm{w}) here is different from the one in Sec. B.1. We design the following bi-level optimization problem:

min𝝀ΛFcp(𝝀)=min𝝀Λi=1I1(L(i)(𝒘𝝀)L(0)(𝒘𝝀))2,\displaystyle\min_{\bm{\lambda}\in\Lambda}F_{\mathrm{cp}}(\bm{\lambda})=\min_{\bm{\lambda}\in\Lambda}\sum_{i=1}^{I-1}\left(L^{(i)}\left(\bm{w}_{\bm{\lambda}}\right)-L^{(0)}\left(\bm{w}_{\bm{\lambda}}\right)\right)^{2}, (125)
𝒘𝝀=argmin𝒘i=1I1λ(i)L(i)(𝒘)+(1i=1I1λ(i))L(0)(𝒘).\displaystyle\bm{w}_{\bm{\lambda}}=\underset{\bm{w}}{\arg\min}\sum_{i=1}^{I-1}\lambda^{(i)}L^{(i)}(\bm{w})+(1-\sum_{i=1}^{I-1}\lambda^{(i)})L^{(0)}(\bm{w}). (126)

Here

Λ={(λ(1),,λ(I1)):0λ(i)1,i=1I1λ(i)1}.\Lambda=\{(\lambda^{(1)},\dots,\lambda^{(I-1)}):0\leq\lambda^{(i)}\leq 1,\quad\sum_{i=1}^{I-1}\lambda^{(i)}\leq 1\}.

For client parity, we make the following assumption:

Assumption 8.

L(i)()L^{(i)}(\cdot) is twice differentiable for all i=1,,I1i=1,\dots,I-1, and

i=1I1λ(i)2L(i)(𝒘𝝀)+(1i=1I1λ(i))2L(0)(𝒘𝝀)0.\sum_{i=1}^{I-1}\lambda^{(i)}\nabla^{2}L^{(i)}(\bm{w}_{\bm{\lambda}})+(1-\sum_{i=1}^{I-1}\lambda^{(i)})\nabla^{2}L^{(0)}(\bm{w}_{\bm{\lambda}})\succ 0.

With the above assumption, the following lemma provides the update rule:

Lemma 27.

If Assumption 8 holds, then on the direction

𝝁(𝝀)=(L(1)(𝒘𝝀)L(0)(𝒘𝝀),,L(I1)(𝒘𝝀)L(0)(𝒘𝝀)),\bm{\mu}(\bm{\lambda})=(L^{(1)}(\bm{w}_{\bm{\lambda}})-L^{(0)}(\bm{w}_{\bm{\lambda}}),\dots,L^{(I-1)}(\bm{w}_{\bm{\lambda}})-L^{(0)}(\bm{w}_{\bm{\lambda}})), (127)

we have 𝛍(𝛌)Fcp(𝛌)0\bm{\mu}(\bm{\lambda})\cdot\nabla F_{\mathrm{cp}}(\bm{\lambda})\leq 0, and the equality holds if and only if 𝛍(𝛌)=𝟎\bm{\mu}(\bm{\lambda})=\bm{0}.

Then we update 𝝀(t)=(λ(1)(t),,λ(I1)(t))\bm{\lambda}^{(t)}=(\lambda^{(1)^{(t)}},\dots,\lambda^{(I-1)^{(t)}}) as follows:

Update rule for client parity:

λ(i)(t+1)=λ(i)(t)+αt𝝁(𝝀(t))2(L(i)(𝒘𝝀(t))L(0)(𝒘𝝀(t))) for i=1,,I1.\lambda^{(i)^{(t+1)}}=\lambda^{(i)^{(t)}}+\frac{\alpha_{t}}{\|\bm{\mu}(\bm{\lambda}^{(t)})\|_{2}}(L^{(i)}(\bm{w}_{\bm{\lambda}^{(t)}})-L^{(0)}(\bm{w}_{\bm{\lambda}^{(t)}}))\text{ for }i=1,\dots,I-1. (128)
Proof of Lemma 27.

We compute the derivative as

Fcp(𝝀)λ(j)=(2i=1I1[L(i)(𝒘𝝀)L(0)(𝒘𝝀)][L(i)(𝒘𝝀)L(0)(𝒘𝝀)])𝒘𝝀λ(j).\begin{split}\frac{\partial F_{\mathrm{cp}}(\bm{\lambda})}{\partial\lambda^{(j)}}&=\Big{(}2\sum_{i=1}^{I-1}[L^{(i)}(\bm{w}_{\bm{\lambda}})-L^{(0)}(\bm{w}_{\bm{\lambda}})][\nabla L^{(i)}(\bm{w}_{\bm{\lambda}})-\nabla L^{(0)}(\bm{w}_{\bm{\lambda}})]\Big{)}\frac{\partial\bm{w}_{\bm{\lambda}}}{\partial\lambda^{(j)}}.\end{split} (129)

Note that 𝒘𝝀\bm{w}_{\bm{\lambda}} is the minimizer to (126), we have

i=1I1λ(i)L(i)(𝒘𝝀)+(1i=1I1λ(i))L(0)(𝒘𝝀)=0.\sum_{i=1}^{I-1}\lambda^{(i)}\nabla L^{(i)}(\bm{w}_{\bm{\lambda}})+(1-\sum_{i=1}^{I-1}\lambda^{(i)})\nabla L^{(0)}(\bm{w}_{\bm{\lambda}})=0.

We take the λ(j)\lambda^{(j)} derivative to the above equation and have

L(j)(𝒘𝝀)+i=1I1λ(i)2L(i)(𝒘𝝀)𝒘𝝀λjL(0)(𝒘𝝀)+(1i=1I1λ(i))2L(0)(𝒘𝝀)𝒘𝝀λj=0.\nabla L^{(j)}(\bm{w}_{\bm{\lambda}})+\sum_{i=1}^{I-1}\lambda^{(i)}\nabla^{2}L^{(i)}(\bm{w}_{\bm{\lambda}})\frac{\partial\bm{w}_{\bm{\lambda}}}{\partial\lambda_{j}}-\nabla L^{(0)}(\bm{w}_{\bm{\lambda}})+(1-\sum_{i=1}^{I-1}\lambda^{(i)})\nabla^{2}L^{(0)}(\bm{w}_{\bm{\lambda}})\frac{\partial\bm{w}_{\bm{\lambda}}}{\partial\lambda_{j}}=0.

Thus we get

𝒘𝝀λ(j)=[i=1I1λ(i)2L(i)(𝒘𝝀)+(1i=1I1λ(i))2L(0)(𝒘𝝀)]1[L(0)(𝒘𝝀)L(j)(𝒘𝝀)].\frac{\partial\bm{w}_{\bm{\lambda}}}{\partial\lambda^{(j)}}=\Big{[}\sum_{i=1}^{I-1}\lambda^{(i)}\nabla^{2}L^{(i)}(\bm{w}_{\bm{\lambda}})+(1-\sum_{i=1}^{I-1}\lambda^{(i)})\nabla^{2}L^{(0)}(\bm{w}_{\bm{\lambda}})\Big{]}^{-1}[\nabla L^{(0)}(\bm{w}_{\bm{\lambda}})-\nabla L^{(j)}(\bm{w}_{\bm{\lambda}})]. (130)

Then on the direction given by (127), we combine (129) and (130) to have

𝝁(𝝀)Fcp(𝝀)\displaystyle\bm{\mu}(\bm{\lambda})\cdot\nabla F_{\mathrm{cp}}(\bm{\lambda}) (131)
=2(i=1I1[L(i)(𝒘𝝀)L(0)(𝒘𝝀)][L(i)(𝒘𝝀)L(0)(𝒘𝝀)])\displaystyle=2\Big{(}\sum_{i=1}^{I-1}[L^{(i)}(\bm{w}_{\bm{\lambda}})-L^{(0)}(\bm{w}_{\bm{\lambda}})][\nabla L^{(i)}(\bm{w}_{\bm{\lambda}})-\nabla L^{(0)}(\bm{w}_{\bm{\lambda}})]\Big{)} (132)
[i=1I1λ(i)2L(i)(𝒘𝝀)+(1i=1I1λ(i))2L(0)(𝒘𝝀)]1\displaystyle\quad\Big{[}\sum_{i=1}^{I-1}\lambda^{(i)}\nabla^{2}L^{(i)}(\bm{w}_{\bm{\lambda}})+(1-\sum_{i=1}^{I-1}\lambda^{(i)})\nabla^{2}L^{(0)}(\bm{w}_{\bm{\lambda}})\Big{]}^{-1} (133)
i=1I1([L(i)(𝒘𝝀)L(0)(𝒘𝝀)][L(0)(𝒘𝝀)L(i)(𝒘𝝀)])0,\displaystyle\quad\sum_{i=1}^{I-1}\Big{(}[L^{(i)}(\bm{w}_{\bm{\lambda}})-L^{(0)}(\bm{w}_{\bm{\lambda}})][\nabla L^{(0)}(\bm{w}_{\bm{\lambda}})-\nabla L^{(i)}(\bm{w}_{\bm{\lambda}})]\Big{)}\leq 0, (134)

where we have used Assumption 8, and the equality holds only when 𝝁(𝝀)=𝟎\bm{\mu}(\bm{\lambda})=\bm{0}.

The Algorithm 5 gives the full description.

Server executes:
      input : Learning rate {αt}t\{\alpha_{t}\}_{t\in\mathbb{N}};
      Initialize λ(i)\lambda^{(i)} as n(i)n\frac{n^{(i)}}{n} for all i[I]\{0}i\in[I]\backslash\left\{0\right\};
      for each iteration t=1,2,t=1,2,\dots do
           Clients perform updates;
           𝒘𝝀\bm{w}_{\bm{\lambda}}\leftarrow SecAgg {𝒘(i)}\left\{\bm{w}^{(i)}\right\} for all ii;
           μ(i)L(i)L(0),i[I]\{0}\mu^{(i)}\leftarrow L^{(i)}-L^{(0)},i\in[I]\backslash\{0\};
           if t%k=0t\%k=0 then
                λ(i)λ(i)+αt𝝁2μ(i), for all i[I]\{0}\lambda^{(i)}\leftarrow\lambda^{(i)}+\frac{\alpha_{t}}{\|\bm{\mu}\|_{2}}\mu^{(i)},\text{ for all }i\in[I]\backslash\left\{0\right\};
                Broadcast 𝝀\bm{\lambda} to clients;
               
               
               Broadcast 𝒘𝝀\bm{w}_{\bm{\lambda}} to clients;
               
                end for
               output : 𝒘𝝀\bm{w}_{\bm{\lambda}}
               
               
               
               
               ClientUpdate(i,w,λ)(i,\bm{w},\bm{\lambda}):
                     𝒘(i)\bm{w}^{(i)}\leftarrow Gradient descent w.r.t objective function i0λ(i)L(i)(𝒘)+i=0(1j=1I1λ(i))L(0)(𝒘)\llbracket i\neq 0\rrbracket\lambda^{(i)}L^{(i)}(\bm{w})+\llbracket i=0\rrbracket(1-\sum_{j=1}^{I-1}\lambda^{(i)})L^{(0)}(\bm{w});
                     Send 𝒘(i)\bm{w}^{(i)} to server via a SecAgg protocol; Send L(i)(𝒘)L^{(i)}(\bm{w}) to server;
                    
                    
                    
                    
Algorithm 5 FedFB w.r.t Client Parity

Appendix C Appendix - Experiments

We continue from Sec. 2 and provide more details on experimental settings, and other supplementary experiment results.

C.1 Experiment setting

The reported statistics are computed on the test set, and we set 10 communication rounds and 30 local epochs for all federated learning algorithms except AgnosticFair. We run 300 epochs for other methods. For all tasks, we randomly split data into a training set and a testing set at a ratio of 7:3. The batch size is set to be 128. For all methods, we choose learning rate from {0.001,0.005,0.01}\left\{0.001,0.005,0.01\right\}. We solve FedFB with sample weight learning rates α{0.001,0.05,0.08,0.1,0.2,0.5,1,2}\alpha\in\left\{0.001,0.05,0.08,0.1,0.2,0.5,1,2\right\} in parallel and select the α\alpha which achieves the highest fairness. All benchmark models are tuned according to the hyperparameter configuration suggested in their original works. We perform cross-validation on the training sets to find the best hyperparameter for all algorithms.

C.2 Supplementary experiment results

Fig. 9 shows the accuracy versus fairness violation for the 3 clients’ cases. We see a clear advantage of LFT+FedAvg over the LFT+Ensemble in both 2 clients’ cases and 3 clients’ cases. The achievable fairness range of LFT+FedAvg is much wider than that of LFT+Ensemble, though the accuracy of FedAvg is not guaranteed when the data heterogeneity increases.

Refer to caption
Figure 9: Accuracy-Fairness tradeoff curves of CFL, LFT+FedAvg, and LFT+Ensemble for three clients cases. The data heterogeneity is increasing from left to right. The green dotted vertical line describes the lower bound of unfairness LFT+FedAvg can achieve, and and the orange dotted vertical line describes the lower bound of unfairness LFT+Ensemble can achieve. Here the distribution setting is xa=0,i=0𝒩(3,1),xa=1,i=0𝒩(5,1),xa=0,i=1𝒩(1,1),xa=1,i=1𝒩(1,1),xa=0,i=2𝒩(1,1),xa=1,i=2𝒩(2,1),ai=iBern(qi){\mathrm{x}}\mid{\mathrm{a}}=0,{\mathrm{i}}=0\sim{\mathcal{N}}(3,1),{\mathrm{x}}\mid{\mathrm{a}}=1,{\mathrm{i}}=0\sim{\mathcal{N}}(5,1),{\mathrm{x}}\mid{\mathrm{a}}=0,{\mathrm{i}}=1\sim{\mathcal{N}}(1,1),{\mathrm{x}}\mid{\mathrm{a}}=1,{\mathrm{i}}=1\sim{\mathcal{N}}(-1,1),{\mathrm{x}}\mid{\mathrm{a}}=0,{\mathrm{i}}=2\sim{\mathcal{N}}(1,1),{\mathrm{x}}\mid{\mathrm{a}}=1,{\mathrm{i}}=2\sim{\mathcal{N}}(2,1),{\mathrm{a}}\mid{\mathrm{i}}=i\sim\operatorname{Bern}(q_{i}) for i=0,1,2i=0,1,2. The data heterogeneity here is captured by |q2q0||q_{2}-q_{0}|. (a) q0=q1=q2=0.5q_{0}=q_{1}=q_{2}=0.5. (b) q0=0.4,q1=0.5,q2=0.6q_{0}=0.4,q_{1}=0.5,q_{2}=0.6. (c): q0=0.3,q1=0.5,q2=0.7q_{0}=0.3,q_{1}=0.5,q_{2}=0.7. (d): q0=0.2,q1=0.5,q2=0.8q_{0}=0.2,q_{1}=0.5,q_{2}=0.8.

We also compare FedAvg, LFT+Ensemble, LFT+FedAvg, FedFB, AgnosticFair and CFL on logistic regression. Table 4 shows that our method outperforms LFT+Ensemble and LFT+FedAvg, while achieving similar performance as AgnosticFair and CFL but ensuring higher privacy.

Table 4: Comparison of accuracy and fairness in the synthetic, Adult, COMPAS, and Bank datasets w.r.t demographic parity (DP) on logistic regression. The implementation of LFT+Ensemble, LFT+FedAvg, and CFL are all based on FB.
Property Synthetic Adult COMPAS Bank
Private Fair Method Acc.(\uparrow) DP Disp.(\downarrow) Acc.(\uparrow) DP Disp.(\downarrow) Acc.(\uparrow) DP Disp.(\downarrow) Acc.(\uparrow) DP Disp.(\downarrow)
FedAvg .884±\pm.001 .419±\pm.006 .837±\pm.007 .144±\pm.015 .658±\pm.006 .149±\pm.022 .900±\pm.000 .026±\pm.001
LFT+Ensemble .712±\pm.198 .266±\pm.175 .819±\pm006. .032±\pm.031 .606±\pm.018 .089±\pm.058 .888±\pm.005 .008±\pm.007
LFT+FedAvg .789±\pm.138 .301±\pm.192 .828±\pm.002 .098±\pm.008 .577±\pm.003 .053±\pm.003 .892±\pm.000 .013±\pm.000
FedFB (Ours) .756±\pm.001 .085±\pm.001 .820±\pm.000 .002±\pm.001 .600±\pm.003 .059±\pm.003 .890±\pm.001 .011±\pm.002
CFL .709±\pm.001 .011±\pm.001 .800±\pm.003 .016±\pm.003 .587±\pm.003 .032±\pm.002 .883±\pm.000 .000±\pm.000

C.3 Synthetic dataset generation

We generate a synthetic dataset of 5,000 examples with two non-sensitive attributes (x1,x2)({\mathrm{x}}_{1},{\mathrm{x}}_{2}), a binary sensitive a{\mathrm{a}}, and a binary label y{\mathrm{y}}. A tuple (x1,x2,y)({\mathrm{x}}_{1},{\mathrm{x}}_{2},{\mathrm{y}}) is randomly generated based on the two Gaussian distributions: (x1,x2)y𝒩([2;2],[10,1;1,3])({\mathrm{x}}_{1},{\mathrm{x}}_{2})\mid{\mathrm{y}}\sim{\mathcal{N}}([-2;-2],[10,1;1,3]) and (x1,x2)y=1𝒩([2;2],[5,1])({\mathrm{x}}_{1},{\mathrm{x}}_{2})\mid{\mathrm{y}}=1\sim{\mathcal{N}}([2;2],[5,1]), where yBern(0.6){\mathrm{y}}\sim\operatorname{Bern}(0.6). For the sensitive attribute a{\mathrm{a}}, we generate biased data using an unfair scenario px1,x2((x1,x2)y=1)/[px1,x2((x1,x2)y=0)+px1,x2((x1,x2)y=1)]p_{{\mathrm{x}}_{1},{\mathrm{x}}_{2}}(({\mathrm{x}}_{1}^{\prime},{\mathrm{x}}_{2}^{\prime})\mid{\mathrm{y}}=1)/[p_{{\mathrm{x}}_{1},{\mathrm{x}}_{2}}(({\mathrm{x}}_{1}^{\prime},{\mathrm{x}}_{2}^{\prime})\mid{\mathrm{y}}=0)+p_{{\mathrm{x}}_{1},{\mathrm{x}}_{2}}(({\mathrm{x}}_{1}^{\prime},{\mathrm{x}}_{2}^{\prime})\mid{\mathrm{y}}=1)], where px1,x2p_{{\mathrm{x}}_{1},{\mathrm{x}}_{2}} is the pdf of (x1,x2)({\mathrm{x}}_{1},{\mathrm{x}}_{2}).

We split the data into three clients in a non-iid way. We randomly assign 50%,30%,20%50\%,30\%,20\% of the samples from group 0 and 20%,40%,40%20\%,40\%,40\% of the samples from group 1 to 1st, 2nd, and 3rd client, respectively. To study the empirical relationship between the performance of FedFB and data heterogeneity, we split the dataset into other ratios to obtain the desired level of data heterogeneity. To get the dataset with low data heterogeneity, we draw samples from each group into three clients in a ratio of 33%,33%,34%33\%,33\%,34\% and 33%,33%,34%33\%,33\%,34\%. For dataset with high data heterogeneity, the ratio we choose is 70%,10%,20%70\%,10\%,20\% and 10%,80%,10%10\%,80\%,10\%.

C.4 Empirical relationship between accuracy, fairness, and the number of clients

We investigate the empirical relationship between accuracy, fairness, and the number of clients. We generate three synthetic datasets of 3,333, 5,000, and 6,667 samples, and split them into two, three, and four clients, respectively. Table 5 shows that our method outperforms under the three cases.

Table 5: Comparison of accuracy and fairness in the synthetic datasets with different numbers of clients w.r.t demographic parity (DP) on multilayer perceptron. The implementation of LFT+Ensemble, LFT+FedAvg, and CFL are all based on FB.
Number of clients two clients three clients four clients
Method Acc.(\uparrow) DP Disp.(\downarrow) Acc.(\uparrow) DP Disp.(\downarrow) Acc.(\uparrow) DP Disp.(\downarrow)
FedAvg .879±\pm.004 .360±\pm.013 .886±\pm.003 .406±\pm.009 .879±\pm.003 .382±\pm.005
LFT+Ensemble .780±\pm.090 .161±\pm.157 .727±\pm.194 .248±\pm.194 .720±\pm.192 .246±\pm.180
LFT+FedAvg .866±\pm.002 .429±\pm.017 .823±\pm.102 .305±\pm.131 .608±\pm.367 .491±\pm.102
FedFB .679±\pm.007 .047±\pm.012 .725±\pm.012 .051±\pm.018 .705±\pm.004 .005±\pm.003
CFL .668±\pm.028 .063±\pm.035 .726±\pm.009 .028±\pm.016 .670±\pm.035 .064±\pm.020

C.5 Comparison with FairFed

Table 6: Comparison of accuracy and fairness under the same setting as Ezzeldin et al. [2021]. The statistics of FairFed are from Ezzeldin et al. [2021].
Method Adult COMPAS
Heterogeneity Level α\alpha Heterogeneity Level α\alpha
0.1 0.2 0.5 10 5000 0.1 0.2 0.5 10 5000
Acc.(\uparrow) FairFed .775 .794 .819 .824 .824 .594 .586 .608 .636 .640
FedFB (Ours) .799 .769 .772 .822 .769 .668 .655 .665 .672 .666
||SPD||(\downarrow) FairFed .021 .037 .061 .065 .065 .048 .040 .072 .108 .115
FedFB (Ours) .042 .034 .038 .063 .060 .006 .011 .026 .032 .004
Table 7: Comparison of accuracy and fairness in the synthetic, Adult, COMPAS, and Bank datasets w.r.t demographic parity (DP) on multilayer perceptron.
Synthetic Adult COMPAS Bank
Method Acc.(\uparrow) DP Disp.(\downarrow) Acc.(\uparrow) DP Disp.(\downarrow) Acc.(\uparrow) DP Disp.(\downarrow) Acc.(\uparrow) DP Disp.(\downarrow)
FedFB (Ours) .725±\pm.012 .051±\pm.018 .804±\pm.001 .028±\pm.001 .616±\pm.033 .036±\pm.028 .883±\pm.000 .000±\pm.000
AgnosticFair .655±\pm.028 .028±\pm.033 .790±\pm.012 .032±\pm.016 .612±\pm.037 .096±\pm.046 .883±\pm.000 .000±\pm.000

To make fairer comparison, we follow the exact same setting as Ezzeldin et al. [2021] to re-split Adult and COMPAS dataset, employ |SPD|=|(y^=1a=0)(y^=1a=1)||SPD|=|{\mathbb{P}}(\hat{{\mathrm{y}}}=1\mid{\textnormal{a}}=0)-{\mathbb{P}}(\hat{{\mathrm{y}}}=1\mid{\textnormal{a}}=1)| as unfairness metric and present the results in Table 6. We observe that FedFB achieves higher fairness than FairFed while being robust to data heterogeneity.

C.6 Comparison with AgnosticFair

Lastly, we compare our method with AgnosticFair [Du et al., 2021], which exchanges the model parameters and the other information after every gradient update to mimic the performance of CFL with FairnessConstraint implementation [Zafar et al., 2017c]. Table 7 shows that our FedFB achieves similar performance as AgnosticFair, at a much lower cost of privacy.