This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Loss Balancing for Fair Supervised Learning

Mohammad Mahdi Khalili    Xueru Zhang    Mahed Abroshan
Abstract

Supervised learning models have been used in various domains such as lending, college admission, face recognition, natural language processing, etc. However, they may inherit pre-existing biases from training data and exhibit discrimination against protected social groups. Various fairness notions have been proposed to address unfairness issues. In this work, we focus on Equalized Loss (EL), a fairness notion that requires the expected loss to be (approximately) equalized across different groups. Imposing EL on the learning process leads to a non-convex optimization problem even if the loss function is convex, and the existing fair learning algorithms cannot properly be adopted to find the fair predictor under the EL constraint. This paper introduces an algorithm that can leverage off-the-shelf convex programming tools (e.g., CVXPY (Diamond and Boyd, 2016; Agrawal et al., 2018)) to efficiently find the global optimum of this non-convex optimization. In particular, we propose the ELminimizer algorithm, which finds the optimal fair predictor under EL by reducing the non-convex optimization to a sequence of convex optimization problems. We theoretically prove that our algorithm finds the global optimal solution under certain conditions. Then, we support our theoretical results through several empirical studies.

Machine Learning, ICML

1 Introduction

As machine learning (ML) algorithms are increasingly being used in applications such as education, lending, recruitment, healthcare, criminal justice, etc., there is a growing concern that the algorithms may exhibit discrimination against protected population groups. For example, speech recognition products such as Google Home and Amazon Alexa were shown to have accent bias (Harwell, 2018). The COMPAS recidivism prediction tool, used by courts in the US in parole decisions, has been shown to have a substantially higher false positive rate for African Americans compared to the general population (Dressel and Farid, 2018). Amazon had been using automated software since 2014 to assess applicants’ resumes, which were found to be biased against women (Dastin, 2018). As a result, there have been several works focusing on interpreting machine learning models to understand how features and sensitive attributes contribute to the output of the model (Ribeiro et al., 2016; Lundberg and Lee, 2017; Abroshan et al., 2023).

Various fairness notions have been proposed in the literature to measure and remedy the biases in ML systems; they can be roughly classified into two categories: 1) individual fairness focuses on equity at the individual level and it requires similar individuals to be treated similarly (Dwork et al., 2012; Biega et al., 2018; Jung et al., 2019; Gupta and Kamble, 2019); 2) group fairness requires certain statistical measures to be (approximately) equalized across different groups distinguished by some sensitive attributes (Hardt et al., 2016; Conitzer et al., 2019; Khalili et al., 2020; Zhang et al., 2020; Khalili et al., 2021; Diana et al., 2021; Williamson and Menon, 2019; Zhang et al., 2022).

Several approaches have been developed to satisfy a given definition of fairness; they fall under three categories: 1) pre-processing, by modifying the original dataset such as removing certain features and reweighing, (e.g., (Kamiran and Calders, 2012; Celis et al., 2020; Abroshan et al., )); 2) in-processing, by modifying the algorithms such as imposing fairness constraints or changing objective functions during the training process, (e.g., (Zhang et al., 2018; Agarwal et al., 2018, 2019; Reimers et al., 2021; Calmon et al., 2017)); 3) post-processing, by adjusting the output of the algorithms based on sensitive attributes, (e.g., (Hardt et al., 2016)).

In this paper, we focus on group fairness, and we aim to mitigate unfairness issues in supervised learning using an in-processing approach. This problem can be cast as a constrained optimization problem by minimizing a loss function subject to a group fairness constraint. We are particularly interested in the Equalized Loss (EL) fairness notion proposed by Zhang et al. (2019), which requires the expected loss (e.g., Mean Squared Error (MSE), Binary Cross Entropy (BCE) Loss) to be equalized across different groups.111 Zhang et al. (2019) propose the EL fairness notion without providing an efficient algorithm for satisfying this notion.

The problem of finding fair predictors by solving constrained optimizations has been largely studied. Komiyama et al. (2018) propose the coefficient of determination constraint for learning a fair regressor and develop an algorithm for minimizing the mean squared error (MSE) under their proposed fairness notion. Agarwal et al. (2019) propose an approach to finding a fair regression model under bounded group loss and statistical parity fairness constraints. Agarwal et al. (2018) study classification problems and aim to find fair classifiers under various fairness notions including statistical parity and equal opportunity. In particular, they consider zero-one loss as the objective function and train a randomized fair classifier over a finite hypothesis space. They show that this problem can be reduced to a problem of finding the saddle point of a linear Lagrangian function. Zhang et al. (2018) propose an adversarial debiasing technique to find fair classifiers under equalized odd, equal opportunity, and statistical parity. Unlike the previous works, we focus on the Equalized Loss fairness notion which has not been well studied. Finding an EL fair predictor requires solving a non-convex optimization. Unfortunately, there is no algorithm in fair ML literature with a theoretical performance guarantee that can be properly applied to EL fairness (see Section 2 for detailed discussion).

Our main contribution can be summarized as follows,

  • We develop an algorithm with a theoretical performance guarantee for EL fairness. In particular, we propose the (ELminimizer)(\texttt{ELminimizer}) algorithm to solve a non-convex constrained optimization problem that finds the optimal fair predictor under EL constraint. We show that such a non-convex optimization problem can be reduced to a sequence of convex constrained optimizations. The proposed algorithm finds the global optimal solution and is applicable to both regression and classification problems. Importantly, it can be easily implemented using off-the-shelf convex programming tools.

  • In addition to ELminimizer which finds the global optimal solution, we develop a simple algorithm for finding a sub-optimal predictor satisfying EL fairness. We prove there is a sub-optimal solution satisfying EL fairness that is a linear combination of the optimal solutions to two unconstrained optimizations, and it can be found without solving any constrained optimizations.

  • We conduct sample complexity analysis and provide a generalization performance guarantee. In particular, we show the sample complexity analysis found in (Donini et al., 2018) is applicable to learning problems under EL.

  • We also examine (in the appendix) the relation between Equalized Loss (EL) and Bounded Group Loss (BGL), another fairness notion proposed by (Agarwal et al., 2019). We show that under certain conditions, these two notions are closely related, and they do not contradict each other.

2 Problem Formulation

Consider a supervised learning problem where the training dataset consists of triples (𝑿,A,Y)(\boldsymbol{X},A,Y) from two social groups.222We use bold letters to represent vectors. Random variable 𝑿𝒳\boldsymbol{X}\in\mathcal{X} is the feature vector (in the form of a column vector), A{0,1}A\in\{0,1\} is the sensitive attribute (e.g., race, gender) indicating the group membership, and Y𝒴Y\in\mathcal{Y}\subseteq\mathbb{R} is the label/output.We denote realizations of random variables by small letters (e.g., (𝒙,a,y)(\boldsymbol{x},a,y) is a realization of (𝑿,A,Y)(\boldsymbol{X},A,Y)). Feature vector 𝑿\boldsymbol{X} may or may not include sensitive attribute AA. Set 𝒴\mathcal{Y} can be either {0,1}\{0,1\} or \mathbb{R}: if 𝒴={0,1}\mathcal{Y}=\{0,1\} (resp. 𝒴=\mathcal{Y}=\mathbb{R}), then the problem of interest is a binary classification (resp. regression) problem.

Let \mathcal{F} be a set of predictors f𝒘:𝒳f_{\boldsymbol{w}}:\mathcal{X}\to\mathbb{R} parameterized by weight vector 𝒘\boldsymbol{w} with dimension d𝒘d_{\boldsymbol{w}}.333Predictive models such as logistic regression, linear regression, deep learning models, etc., are parameterized by a weight vector. If the problem is binary classification, then f𝒘(𝒙)f_{\boldsymbol{w}}(\boldsymbol{x}) is an estimate of Pr(Y=1|𝑿=𝒙)\Pr(Y=1|\boldsymbol{X}=\boldsymbol{x}).444Our framework can be easily applied to multi-class classifications, where f𝒘(𝑿)f_{\boldsymbol{w}}(\boldsymbol{X}) becomes a vector. Because it only complicates the notations without providing additional insights about our algorithm, we present the method and algorithm in a binary setting. Consider loss function l:𝒴×l:\mathcal{Y}\times\mathbb{R}\to\mathbb{R} where l(Y,f𝒘(𝑿))l(Y,f_{\boldsymbol{w}}(\boldsymbol{X})) measures the error of f𝒘f_{\boldsymbol{w}} in predicting 𝑿\boldsymbol{X}. We denote the expected loss with respect to the joint probability distribution of (𝑿,Y)(\boldsymbol{X},Y) by L(𝒘):=𝔼{l(Y,f𝒘(𝑿))}L(\boldsymbol{w}):=\mathbb{E}\{l(Y,f_{\boldsymbol{w}}(\boldsymbol{X}))\}. Similarly, La(𝒘):=𝔼{l(Y,f𝒘(𝑿))|A=a}L_{a}(\boldsymbol{w}):=\mathbb{E}\{l(Y,f_{\boldsymbol{w}}(\boldsymbol{X}))|A=a\} denotes the expected loss of the group with sensitive attribute A=aA=a. In this work, we assume that l(y,f𝒘(𝒙))l(y,f_{\boldsymbol{w}}(\boldsymbol{x})) is differentiable and strictly convex in 𝒘{\boldsymbol{w}} (e.g., binary cross entropy loss).555We do not consider non-differentiable losses (e.g., zero-one loss) as they have already been extensively studied in the literature, e.g., (Hardt et al., 2016; Zafar et al., 2017; Lohaus et al., 2020).

Without fairness consideration, a predictor that simply minimizes the total expected loss, i.e., argmin𝒘L(𝒘)\operatorname*{arg\,min}_{\boldsymbol{w}}L(\boldsymbol{w}), may be biased against certain groups. To mitigate the risks of unfairness, we consider Equalized Loss (EL) fairness notion, as formally defined below.

Definition 2.1 (γ\gamma-EL (Zhang et al., 2019)).

We say f𝒘f_{\boldsymbol{w}} satisfies the equalized loss (EL) fairness notion if L0(𝒘)=L1(𝒘)L_{0}({\boldsymbol{w}})=L_{1}({\boldsymbol{w}}). Moreover, we say f𝒘f_{\boldsymbol{w}} satisfies γ\gamma-EL for some γ>0\gamma>0 if γL0(𝒘)L1(𝒘)γ-\gamma\leq L_{0}({\boldsymbol{w}})-L_{1}({\boldsymbol{w}})\leq\gamma.

Note that if l(Y,f𝒘(𝑿))l(Y,f_{\boldsymbol{w}}(\boldsymbol{X})) is a (strictly) convex function in 𝒘\boldsymbol{w}, both L0(𝒘)L_{0}(\boldsymbol{w}) and L1(𝒘)L_{1}(\boldsymbol{w}) are also (strictly) convex in 𝒘{\boldsymbol{w}}. However, L0(𝒘)L1(𝒘)L_{0}(\boldsymbol{w})-L_{1}(\boldsymbol{w}) is not necessary convex666As an example, consider two functions h0(x)=x2h_{0}(x)=x^{2} and h1(x)=2x2xh_{1}(x)=2\cdot x^{2}-x. Although both h0h_{0} and h1h_{1} are convex, their difference h0(x)h1(x)h_{0}(x)-h_{1}(x) is not a convex function. . As a result, the following optimization problem for finding a fair predictor under γ\gamma-EL is not a convex programming,

min𝒘L(𝒘) s.t.γL0(𝒘)L1(𝒘)γ.\displaystyle\min_{\boldsymbol{w}}~{}~{}L(\boldsymbol{w})~{}\text{ s.t.}~{}~{}-\gamma\leq L_{0}(\boldsymbol{w})-L_{1}(\boldsymbol{w})\leq\gamma. (1)

We say a group is disadvantaged group if it experiences higher loss than the other group. Before discussing how to find the global optimal solution of the above non-convex optimization problem and train a γ\gamma-EL fair predictor, we first discuss why γ\gamma-EL is an important fairness notion and why the majority of fair learning algorithms in the literature cannot be used for finding γ\gamma-EL fair predictors.

2.1 Existing Fairness Notions & Algorithms

Next, we (mathematically) introduce some of the most commonly used fairness notions and compare them with γ\gamma-EL. We will also discuss why the majority of proposed fair learning algorithms are not properly applicable to EL fairness.

Overall Misclassification Rate (OMR): It was considered by (Zafar et al., 2017, 2019) for classification problems. Let Y^=I(f𝒘(𝑿)>0.5)\hat{Y}=I(f_{\boldsymbol{w}}(\boldsymbol{X})>0.5), where I(.){0,1}I(.)\in\{0,1\} is an indicator function, and Y^=1\hat{Y}=1 if f𝒘(𝑿)>0.5f_{\boldsymbol{w}}(\boldsymbol{X})>0.5. OMR requires Pr(Y^Y|A=0)=Pr(Y^Y|A=1),\Pr(\hat{Y}\neq Y|A=0)=\Pr(\hat{Y}\neq Y|A=1), which is not a convex constraint. As a result, Zafar et al. (2017; 2019) propose a method to relax this constraint using decision boundary covariances. We emphasize that OMR is different from EL fairness, that OMR only equalizes the accuracy of binary predictions across different groups while EL is capable of considering the fairness in estimating probability Pr(Y=1|𝑿=𝒙)\Pr(Y=1|\boldsymbol{X}=\boldsymbol{x}), e.g., by using binary cross entropy loss function. Note that in many applications such as conversion prediction, click prediction, medical diagnosis, etc., it is critical to find Pr(Y=1|𝑿=𝒙)\Pr(Y=1|\boldsymbol{X}=\boldsymbol{x}) accurately for different groups besides the final predictions Y^\hat{Y}. Moreover, unlike EL, OMR is not applicable to regression problems. Therefore, the relaxation method proposed by (Zafar et al., 2017, 2019) cannot be applied to the EL fairness constraint.

Statistical Parity (SP), Equal Opportunity (EO): For binary classification, Statistical Parity (SP) (Dwork et al., 2012) (resp. Equal Opportunity (EO) (Hardt et al., 2016)) requires the positive classification rates (resp. true positive rates) to be equalized across different groups. Formally,

Pr(Y^=1|A=0)\displaystyle\Pr(\hat{Y}=1|A=0) =\displaystyle= Pr(Y^=1|A=1)\displaystyle\Pr(\hat{Y}=1|A=1)
Pr(Y^=1|A=0,Y=1)\displaystyle\Pr(\hat{Y}=1|A=0,Y=1) =\displaystyle= Pr(Y^=1|A=1,Y=1)\displaystyle\Pr(\hat{Y}=1|A=1,Y=1)

Both notions can be re-written in the expectation form using an indicator function. Specifically, SP is equivalent to 𝔼{I(f𝒘(𝑿)>0.5)|A=0}=𝔼{I(f𝒘(𝑿)>0.5)|A=1}\mathbb{E}\{I(f_{\boldsymbol{w}}(\boldsymbol{X})>0.5)|A=0\}=\mathbb{E}\{I(f_{\boldsymbol{w}}(\boldsymbol{X})>0.5)|A=1\}, and EO equals to 𝔼{I(f𝒘(𝑿)>0.5)|A=0,Y=1}=𝔼{I(f𝒘(𝑿)>0.5)|A=1,Y=1}\mathbb{E}\{I(f_{\boldsymbol{w}}(\boldsymbol{X})>0.5)|A=0,Y=1\}=\mathbb{E}\{I(f_{\boldsymbol{w}}(\boldsymbol{X})>0.5)|A=1,Y=1\}. Since the indicator function is neither differentiable nor convex, Donini et al. (2018) use a linear relaxation of EO as a proxy. 777This linear relaxation is applicable to EL with some modification. We use linear relaxation as one of our baselines. However, linear relaxation may negatively affect the fairness of the predictor (Lohaus et al., 2020). To address this issue, Lohaus et al. (2020) and Wu et al. (2019) develop convex relaxation techniques for SP and EO fairness criteria by convexifying indicator function I(.)I(.). However, these convex relaxation techniques are not applicable to EL fairness notion because l(.,.)l(.,.) in our setting is convex, not a zero-one function. FairBatch (Roh et al., 2020) is another algorithm that has been proposed to find a predictor under SP or EO. FairBatch adds a sampling bias in the mini-batch selection. However, the bias in mini-batch sampling distribution leads to a biased estimate of the gradient, and there is no guarantee FairBatch finds the global optimum solution. FairBatch can be used to find a sub-optimal fair predictor EL fairness notion. We use FairBatch as a baseline. Shen et al. (2022) propose an algorithm for EO. This algorithm adds a penalty term to the objective function, which is similar to the Penalty Method (Ben-Tal and Zibulevsky, 1997). We will use the Penalty method as a baseline as well.

Hardt et al. (2016) propose a post-processing algorithm that randomly flips the binary predictions to satisfy EO or SP. However, this method does not guarantee finding an optimal classifier (Woodworth et al., 2017). Agarwal et al. (2018) introduce a reduction approach for SP or EO. However, this method finds a randomized classifier satisfying SP or EO in expectation. In other words, to satisfy SP, the reduction approach finds distribution QQ over \mathcal{F} such that,

fQ(f)𝔼{l(Y,f(𝑿))|A=0}\displaystyle\textstyle\sum_{f\in\mathcal{F}}Q(f)\mathbb{E}\{l(Y,f(\boldsymbol{X}))|A=0\}
=fQ(f)𝔼{l(Y,f(𝑿))|A=1}\displaystyle\textstyle=\sum_{f\in\mathcal{F}}Q(f)\mathbb{E}\{l(Y,f(\boldsymbol{X}))|A=1\}

where Q(f)Q(f) is the probability of selecting model ff under distribution QQ. Obviously, satisfying a fairness constraint in expectation may lead to unfair predictions because QQ can still assign a non-zero probability to unfair models.

In summary, maybe some of the approaches used for SP/EO are applicable to EL fairness notion (e.g., linear relaxation or FairBatch). However, they can only find sub-optimal solutions (see Section 6 for more details).

Statistical Parity for Regression: SP can be adjusted to be suitable for regression. As proposed by (Agarwal et al., 2019), Statistical Parity for regressor f𝒘(.)f_{\boldsymbol{w}}(.) is defined as:

Pr(f𝒘(𝑿)z|A=a)=Pr(f𝒘(𝑿)z),z,a.\Pr(f_{\boldsymbol{w}}(\boldsymbol{X})\leq z|A=a)=\Pr(f_{\boldsymbol{w}}(\boldsymbol{X})\leq z),\forall z,a. (2)

To find a predictor that satisfies constraint (2), Agarwal et al. (2019) use the reduction approach as mentioned above. However, this approach only finds a randomized predictor satisfying SP in expectation and cannot be applied to optimization problem (1).888Appendix includes a detailed discussion on why the reduction approach is not appropriate for EL fairness.

Bounded Group Loss (BGL): γ\gamma-BGL was introduced by (Agarwal et al., 2019) for regression problems. It requires that the loss experienced by each group be bounded by γ\gamma. That is, La(𝒘)γ,a{0,1}.L_{a}(\boldsymbol{w})\leq\gamma,\forall a\in\{0,1\}. Agarwal et al. (2019) use the reduction approach to find a randomized regression model under γ\gamma-BGL. In addition to the reduction method, if L(𝒘)L(\boldsymbol{w}), L0(𝒘)L_{0}(\boldsymbol{w}), and L1(𝒘)L_{1}(\boldsymbol{w}) are convex in 𝒘\boldsymbol{w}, then we can directly use convex solvers (e.g., CVXPY (Diamond and Boyd, 2016; Agrawal et al., 2018)) to find a γ\gamma-BGL fair predictor. This is because the following is a convex problem,

min𝒘L(𝒘),s.t.,La(𝒘)γ,a.\min_{\boldsymbol{w}}L({\boldsymbol{w}}),~{}~{}~{}\text{s.t.},~{}~{}~{}L_{a}(\boldsymbol{w})\leq\gamma,~{}\forall a. (3)

However, for non-convex optimization problems such as (1), the convex solvers cannot be used directly.

We want to emphasize that even though there are already many fairness notions and algorithms in the literature to find a fair predictor, none of the existing algorithms can be used to solve the non-convex optimization (1) efficiently and find a global optimal fair predictor under EL notion.

3 Optimal Model under γ\gamma-EL

In this section, we consider optimization problem (1) under EL fairness constraint. Note that this optimization problem is non-convex and finding the global optimal solution is difficult. We propose an algorithm that finds the solution to non-convex optimization (1) by solving a sequence of convex optimization problems. Before presenting the algorithm, we first introduce two assumptions, which will be used when proving the convergence of the proposed algorithm.

Algorithm 1 Function ELminimizer

Input: 𝒘G0,𝒘G1,ϵ,γ\boldsymbol{w}_{G_{0}},\boldsymbol{w}_{G_{1}},\epsilon,\gamma
Parameters: λstart(0)=L0(𝒘G0),λend(0)=L0(𝒘G1),i=0\lambda_{start}^{(0)}=L_{0}(\boldsymbol{w}_{G_{0}}),\lambda_{end}^{(0)}=L_{0}(\boldsymbol{w}_{G_{1}}),i=0
Define L~1(𝒘)=L1(𝒘)+γ\tilde{L}_{1}(\boldsymbol{w})=L_{1}(\boldsymbol{w})+\gamma

1:  while λend(i)λstart(i)>ϵ\lambda^{(i)}_{end}-\lambda^{(i)}_{start}>\epsilon do
2:     λmid(i)=(λend(i)+λstart(i))/2\lambda^{(i)}_{mid}=(\lambda^{(i)}_{end}+\lambda^{(i)}_{start})/2
3:     Solve the following convex optimization problem,
𝒘i=argmin𝒘L~1(𝒘)s.t.L0(𝒘)λmid(i)\boldsymbol{w}_{i}^{*}=\arg\min_{\boldsymbol{w}}\tilde{L}_{1}(\boldsymbol{w})~{}~{}\text{s.t.}~{}~{}L_{0}(\boldsymbol{w})\leq\lambda^{(i)}_{mid} (4)
4:     λ(i)=L~1(𝒘i)\lambda^{(i)}=\tilde{L}_{1}(\boldsymbol{w}_{i}^{*})
5:     if λ(i)λmid(i)\lambda^{(i)}\geq\lambda^{(i)}_{mid} then
6:        λstart(i+1)=λmid(i)\lambda^{(i+1)}_{start}=\lambda^{(i)}_{mid}λend(i+1)=λend(i)\lambda^{(i+1)}_{end}=\lambda^{(i)}_{end};
7:     else
8:        λend(i+1)=λmid(i)\lambda^{(i+1)}_{end}=\lambda^{(i)}_{mid}λstart(i+1)=λstart(i)\lambda^{(i+1)}_{start}=\lambda^{(i)}_{start};
9:        i=i+1i=i+1;
10:     end if
11:  end while

Output: 𝒘i\boldsymbol{w}_{i}^{*}

Assumption 3.1.

Expected losses L0(𝒘)L_{0}(\boldsymbol{w}), L1(𝒘)L_{1}(\boldsymbol{w}), and L(𝒘)L(\boldsymbol{w}) are strictly convex and differentiable in 𝒘\boldsymbol{w}. Moreover, each of them has a unique minimizer.

Let 𝒘Ga\boldsymbol{w}_{G_{a}} be the optimal weight vector minimizing the loss associated with group A=aA=a. That is,

𝒘Ga=argmin𝒘La(𝒘).\boldsymbol{w}_{G_{a}}=\arg\min_{\boldsymbol{w}}L_{a}(\boldsymbol{w}). (5)

Since problem (5) is an unconstrained, convex optimization problem, 𝒘Ga\boldsymbol{w}_{G_{a}} can be found efficiently by common convex solvers. We make the following assumption about 𝒘Ga\boldsymbol{w}_{G_{a}}.

Assumption 3.2.

We assume the following holds,

L0(𝒘G0)L1(𝒘G0) and L1(𝒘G1)L0(𝒘G1).L_{0}(\boldsymbol{w}_{G_{0}})\leq L_{1}(\boldsymbol{w}_{G_{0}})\text{ and }L_{1}(\boldsymbol{w}_{G_{1}})\leq L_{0}(\boldsymbol{w}_{G_{1}}).
Algorithm 2 Solving Optimization (1)

Input: 𝒘G0\boldsymbol{w}_{G_{0}}, 𝒘G1\boldsymbol{w}_{G_{1}},ϵ\epsilon,γ\gamma

1:  𝒘γ=ELminimizer(𝒘G0,𝒘G1,ϵ,γ)\boldsymbol{w}_{\gamma}=\texttt{ELminimizer}(\boldsymbol{w}_{G_{0}},\boldsymbol{w}_{G_{1}},\epsilon,\gamma)
2:  𝒘γ=ELminimizer(𝒘G0,𝒘G1,ϵ,γ)\boldsymbol{w}_{-\gamma}=\texttt{ELminimizer}(\boldsymbol{w}_{G_{0}},\boldsymbol{w}_{G_{1}},\epsilon,-\gamma)
3:  if L(𝒘γ)L(𝒘γ)L(\boldsymbol{w}_{\gamma})\leq L(\boldsymbol{w}_{-\gamma}) then
4:     𝒘=𝒘γ\boldsymbol{w}^{*}=\boldsymbol{w}_{\gamma}
5:  else
6:     𝒘=𝒘γ\boldsymbol{w}^{*}=\boldsymbol{w}_{-\gamma}
7:  end if

Output: 𝒘\boldsymbol{w}^{*}

Assumption 3.2 implies that when a group experiences its lowest possible loss, this group is not the disadvantaged group. Under Assumptions 3.1 and 3.2, the optimal 0-EL fair predictor can be easily found using our proposed algorithm (i.e., function ELminimizer(𝒘G0,𝒘G1,ϵ,γ)\texttt{ELminimizer}(\boldsymbol{w}_{G_{0}},\boldsymbol{w}_{G_{1}},\epsilon,\gamma) with γ=0\gamma=0); the complete procedure is shown in Algorithm 1, in which parameter ϵ>0\epsilon>0 specifies the stopping criterion: as ϵ0\epsilon\to 0, the output approaches to the global optimal solution.

Intuitively, Algorithm 1 solves non-convex optimization (1) by solving a sequence of convex and constrained optimizations. When γ>0\gamma>0 (i.e., relaxed fairness), the optimal γ\gamma-EL fair predictor can be found with Algorithm 2 which calls function ELminimizer twice. The convergence of Algorithm 1 for finding the optimal 0-EL fair solution, and the convergence of Algorithm 2 for finding the optimal γ\gamma-EL fair solution are stated in the following theorems.

Theorem 3.3 (Convergence of Algorithm 1 when γ=0\gamma=0).

Let {λmid(i)|i=0,1,2,}\{\lambda^{(i)}_{mid}|i=0,1,2,\ldots\} and {𝐰i|i=0,1,2,}\{\boldsymbol{{w}}_{i}^{*}|i=0,1,2,\ldots\} be two sequences generated by Algorithm 1 when γ=ϵ=0\gamma=\epsilon=0, i.e., ELminimizer(𝐰G0,𝐰G1,0,0).\texttt{ELminimizer}(\boldsymbol{w}_{G_{0}},\boldsymbol{w}_{G_{1}},0,0). Under Assumptions 3.1 and 3.2, we have,

limi𝒘i=𝒘 and limiλmid(i)=𝔼{l(Y,f𝒘(𝑿))}\lim_{i\to\infty}\boldsymbol{{w}}_{i}^{*}=\boldsymbol{{w}}^{*}\text{ and }\lim_{i\to\infty}\lambda^{(i)}_{mid}=\mathbb{E}\{l(Y,f_{\boldsymbol{w}^{*}}(\boldsymbol{X}))\}

where 𝐰\boldsymbol{w}^{*} is the global optimal solution to (1).

Theorem 3.3 implies that when γ=ϵ=0\gamma=\epsilon=0 and ii goes to infinity, the solution to convex problem (4) is the same as the solution to (1).

Theorem 3.4 (Convergence of Algorithm 2).

Assume that L0(𝐰G0)L1(𝐰G0)<γL_{0}(\boldsymbol{w}_{G_{0}})-L_{1}(\boldsymbol{w}_{G_{0}})<-\gamma and L0(𝐰G1)L1(𝐰G1)>γL_{0}(\boldsymbol{w}_{G_{1}})-L_{1}(\boldsymbol{w}_{G_{1}})>\gamma. If 𝐰O\boldsymbol{w}_{O} does not satisfy the γ\gamma-EL constraint, then, as ϵ0\epsilon\to 0, the output of Algorithm 2 goes to the optimal γ\gamma-EL fair solution (i.e., solution to (1)).

Complexity Analysis. The While loop in Algorithm 1 is executed for 𝒪(log(1/ϵ))\mathcal{O}(\log(1/\epsilon)) times. Therefore, Algorithm 1 needs to solve a constrained convex optimization problem for 𝒪(log(1/ϵ))\mathcal{O}(\log(1/\epsilon)) times. Note that constrained convex optimization problems can be efficiently solved via sub-gradient methods (Nedić and Ozdaglar, 2009), brier methods (Wright, 2001), stochastic gradient descent with one projection (Mahdavi et al., 2012), interior point methods (Nemirovski, 2004), etc. For instance, (Nemirovski, 2004) shows that several convex optimization problems can be solved in polynomial time. Therefore, the time complexity of Algorithm 1 depends on the convex solver. If the time complexity of solving (4) is 𝒪(p(d𝒘))\mathcal{O}(p(d_{\boldsymbol{w}})), then the overall time complexity of Algorithm 1 is 𝒪(p(d𝒘)log(1/ϵ))\mathcal{O}(p(d_{\boldsymbol{w}})\log(1/\epsilon)).

Regularization. So far we have considered a supervised learning model without regularization. Next, we explain how Algorithm 2 can be applied to a regularized problem. Consider the following optimization problem,

min𝒘\displaystyle\min_{\boldsymbol{w}} Pr(A=0)L0(𝒘)+Pr(A=1)L1(𝒘)+R(𝒘),\displaystyle\Pr(A=0)L_{0}(\boldsymbol{w})+\Pr(A=1)L_{1}(\boldsymbol{w})+R(\boldsymbol{w}),
s.t.,\displaystyle\text{s.t.}, |L0(𝒘)L1(𝒘)|<γ.\displaystyle|L_{0}(\boldsymbol{w})-L_{1}(\boldsymbol{w})|<\gamma. (6)

where R(𝒘)R(\boldsymbol{w}) is a regularizer function. In this case, we can re-write the optimization problem as follows,

min𝒘\displaystyle\min_{\boldsymbol{w}} Pr(A=0)(L0(𝒘)+R(𝒘))\displaystyle\Pr(A=0)\big{(}L_{0}(\boldsymbol{w})+R(\boldsymbol{w})\big{)}
+Pr(A=1)(L1(𝒘)+R(𝒘)),\displaystyle+\Pr(A=1)\big{(}L_{1}(\boldsymbol{w})+R(\boldsymbol{w})\big{)},
s.t.,\displaystyle\text{s.t.}, |(L0(𝒘)+R(𝒘))(L1(𝒘)+R(𝒘))|<γ.\displaystyle|\big{(}L_{0}(\boldsymbol{w})+R(\boldsymbol{w})\big{)}-\big{(}L_{1}(\boldsymbol{w})+R(\boldsymbol{w})\big{)}|<\gamma.

If we define L¯a(𝒘):=La(𝒘)+R(𝒘)\bar{L}_{a}({\boldsymbol{w}}):=L_{a}(\boldsymbol{w})+R(\boldsymbol{w}) and L¯(𝒘):=Pr(A=0)L¯0(𝒘)+Pr(A=1)L¯1(𝒘),\bar{L}(\boldsymbol{w}):=\Pr(A=0)\bar{L}_{0}(\boldsymbol{w})+\Pr(A=1)\bar{L}_{1}(\boldsymbol{w}), then problem (6) can be written in the form of problem (1) using (L¯0(𝒘),L¯1(𝒘),L¯(𝒘))(\bar{L}_{0}({\boldsymbol{w}}),\bar{L}_{1}({\boldsymbol{w}}),\bar{L}({\boldsymbol{w}})) and solved by Algorithm 2.

4 Sub-optimal Model under γ\gamma-EL

In Section 3, we have shown that non-convex optimization problem (1) can be reduced to a sequence of convex constrained optimizations (4), and based on this we proposed Algorithm 2 that finds the optimal γ\gamma-EL fair predictor. However, the proposed algorithm still requires solving a convex constrained optimization in each iteration. In this section, we propose another algorithm that finds a sub-optimal solution to optimization (1) without solving constrained optimization in each iteration. The algorithm consists of two phases: (1) finding two weight vectors by solving two unconstrained convex optimization problems; (2) generating a new weight vector satisfying γ\gamma-EL using the two weight vectors found in the first phase.

Phase 1: Unconstrained optimization. We ignore EL fairness and solve the following unconstrained problem,

𝒘O=argmin𝒘L(𝒘)\boldsymbol{w}_{O}=\arg\min_{\boldsymbol{w}}L(\boldsymbol{w}) (7)

Because L(𝒘)L(\boldsymbol{w}) is strictly convex in 𝒘\boldsymbol{w}, the above optimization problem can be solved efficiently using convex solvers. Predictor f𝒘Of_{\boldsymbol{w}_{O}} is the optimal predictor without fairness constraint, and L(𝒘O)L(\boldsymbol{w}_{O}) is the smallest overall expected loss that is attainable. Let a^=argmaxa{0,1}La(𝒘O)\hat{a}=\arg\max_{a\in\{0,1\}}L_{a}(\boldsymbol{w}_{O}), i.e., group a^\hat{a} is disadvantaged under predictor f𝒘Of_{\boldsymbol{w}_{O}}. Then, for the disadvantaged group a^\hat{a}, we find 𝒘Ga^\boldsymbol{w}_{G_{\hat{a}}} by optimization (5).

Phase 2: Binary search to find the fair predictor. For β[0,1]\beta\in[0,1], we define the following two functions,

g(β)\displaystyle g(\beta) :=\displaystyle:= La^((1β)𝒘O+β𝒘Ga^)\displaystyle L_{\hat{a}}\big{(}(1-\beta)\boldsymbol{w}_{O}+\beta\boldsymbol{w}_{G_{\hat{a}}}\big{)}
L1a^((1β)𝒘O+β𝒘Ga^);\displaystyle-L_{1-\hat{a}}\big{(}(1-\beta)\boldsymbol{w}_{O}+\beta\boldsymbol{w}_{G_{\hat{a}}}\big{)};
h(β)\displaystyle h(\beta) :=\displaystyle:= L((1β)𝒘O+β𝒘Ga^),\displaystyle L\big{(}(1-\beta)\boldsymbol{w}_{O}+\beta\boldsymbol{w}_{G_{\hat{a}}}\big{)},

where function g(β)g(\beta) can be interpreted as the loss disparity between two demographic groups under predictor f(1β)𝒘O+β𝒘Ga^f_{(1-\beta)\boldsymbol{w}_{O}+\beta\boldsymbol{w}_{G_{\hat{a}}}}, and h(β)h(\beta) is the corresponding overall expected loss. Some properties of functions g(.)g(.) and h(.)h(.) are summarized in the following theorem.

Theorem 4.1.

Under Assumptions 3.1 and 3.2, 1. There exists β0[0,1]\beta_{0}\in[0,1] such that g(β0)=0g(\beta_{0})=0; 2. h(β)h(\beta) is strictly increasing in β[0,1]\beta\in[0,1]; 3. g(β)g(\beta) is strictly decreasing in β[0,1]\beta\in[0,1].

Theorem 4.1 implies that in a d𝒘d_{\boldsymbol{w}}-dimensional space if we start from 𝒘O\boldsymbol{w}_{O} and move toward 𝒘Ga^\boldsymbol{w}_{G_{\hat{a}}} along a straight line, the overall loss increases and the disparity between two groups decreases until we reach (1β0)𝒘O+β0𝒘Ga^(1-\beta_{0})\boldsymbol{w}_{O}+\beta_{0}\boldsymbol{w}_{G_{\hat{a}}}, at which 0-EL fairness is satisfied. Note that β0\beta_{0} is the unique root of gg. Since g(β)g(\beta) is a strictly decreasing function, β0\beta_{0} can be found using binary search.

For the approximate γ\gamma-EL fairness, there are multiple values of β\beta such that (1β)𝒘O+β𝒘Ga^(1-\beta)\boldsymbol{w}_{O}+\beta\boldsymbol{w}_{G_{\hat{a}}} satisfies γ\gamma-EL. Since h(β)h(\beta) is strictly increasing in β\beta, among all β\beta that satisfy γ\gamma-EL fairness, we would choose the smallest one. The method for finding a sub-optimal solution to optimization (1) is described in Algorithm 3.

Algorithm 3 Sub-optimal solution to optimization (1)

Input: 𝒘Ga^\boldsymbol{w}_{G_{\hat{a}}}, 𝒘O\boldsymbol{w}_{O}, ϵ\epsilon, γ\gamma
Initialization: gγ(β)=g(β)γg_{\gamma}(\beta)=g(\beta)-\gamma, i=0i=0, βstart(0)=0\beta_{start}^{(0)}=0, βend(0)=1\beta_{end}^{(0)}=1

1:  if gγ(0)0g_{\gamma}(0)\leq 0 then
2:     𝒘¯=𝒘O\underline{\boldsymbol{w}}=\boldsymbol{w}_{O}, and go to line 13;
3:  end if
4:  while βend(i)βstart(i)>ϵ\beta_{end}^{(i)}-\beta_{start}^{(i)}>\epsilon do
5:     βmid(i)=(βstart(i)+βend(i))/2\beta^{(i)}_{mid}=(\beta_{start}^{(i)}+\beta_{end}^{(i)})/2;
6:     if gγ(βmid(i))0g_{\gamma}(\beta_{mid}^{(i)})\geq 0 then
7:        βstart(i+1)=βmid(i),\beta_{start}^{(i+1)}=\beta_{mid}^{(i)}, βend(i+1)=βend(i)\beta_{end}^{(i+1)}=\beta_{end}^{(i)};
8:     else
9:        βstart(i+1)=βstart(i),\beta_{start}^{(i+1)}=\beta_{start}^{(i)}, βend(i+1)=βmid(i)\beta_{end}^{(i+1)}=\beta_{mid}^{(i)};
10:     end if
11:  end while
12:  𝒘¯=(1βmid(i))𝒘O+βmid(i)𝒘Ga^\underline{\boldsymbol{w}}=(1-\beta_{mid}^{(i)})\boldsymbol{w}_{O}+\beta_{mid}^{(i)}\boldsymbol{w}_{G_{\hat{a}}};
13:  Output: 𝒘¯\underline{\boldsymbol{w}}

Note that while loop in Algorithm 3 is repeated for 𝒪(log(1/ϵ))\mathcal{O}(\log(1/\epsilon)) times. Since the time complexity of operations (i.e., evaluating gγ(βmid(i))g_{\gamma}(\beta_{mid}^{(i)})) in each iteration is 𝒪(d𝒘)\mathcal{O}(d_{\boldsymbol{w}}) , the total time complexity of Algorithm 3 is 𝒪(d𝒘log(1/ϵ))\mathcal{O}(d_{\boldsymbol{w}}\log(1/\epsilon)). We can formally prove that the output returned by Algorithm 3 satisfies γ\gamma-EL constraint.

Theorem 4.2.

Assume that Assumptions 3.1 and 3.2 hold, and let gγ(β)=g(β)γg_{\gamma}(\beta)=g(\beta)-\gamma. If gγ(0)0g_{\gamma}(0)\leq 0, then 𝐰O\boldsymbol{w}_{O} satisfies the γ\gamma-EL fairness; if gγ(0)>0g_{\gamma}(0)>0, then limiβmid(i)=βmid()\lim_{i\to\infty}\beta_{mid}^{(i)}=\beta_{mid}^{(\infty)} exists, and (1βmid())𝐰O+βmid()𝐰Ga^(1-\beta_{mid}^{(\infty)})\boldsymbol{w}_{O}+\beta_{mid}^{(\infty)}\boldsymbol{w}_{G_{\hat{a}}} satisfies the γ\gamma-EL fairness constraint.

Note that since h(β)h(\beta) is increasing in β\beta, we only need to find the smallest possible β\beta such that (1β)𝒘O+β𝒘Ga^(1-\beta)\boldsymbol{w}_{O}+\beta\boldsymbol{w}_{G_{\hat{a}}} satisfies γ\gamma-EL, which is βmid()\beta_{mid}^{(\infty)} in Theorem 4.2. Since Algorithm 3 finds a sub-optimal solution, it is important to investigate the performance of this sub-optimal fair predictor, especially in the worst case scenario. The following theorem finds an upper bound of the expected loss of f𝒘¯f_{\boldsymbol{\underline{w}}}, where 𝒘¯\boldsymbol{\underline{w}} is the output of Algorithm 3.

Theorem 4.3.

Under Assumptions 3.1 and 3.2, we have the following: L(𝐰¯)maxa{0,1}La(𝐰O).L(\boldsymbol{\underline{w}})\leq\max_{a\in\{0,1\}}L_{a}(\boldsymbol{w}_{O}). That is, the expected loss of f𝐰¯f_{\boldsymbol{\underline{w}}} is not worse than the loss of the disadvantaged group under predictor f𝐰Of_{\boldsymbol{w}_{O}}.

Learning with Finite Samples. So far we proposed algorithms for solving optimization (1). In practice, the joint probability distribution of (𝑿,A,Y)(\boldsymbol{X},A,Y) is unknown and the expected loss needs to be estimated using the empirical loss. Specifically, given nn i.i.d. samples {(𝑿i,Ai,Yi)}i=1n\{(\boldsymbol{X}_{i},A_{i},Y_{i})\}_{i=1}^{n} and a predictor f𝒘f_{\boldsymbol{w}}, the empirical losses of the entire population and each group are defined as follows,

L^(𝒘)\displaystyle\textstyle\hat{L}(\boldsymbol{w}) =\displaystyle= 1ni=1nl(Yi,f𝒘(𝑿i)),\displaystyle\textstyle\frac{1}{n}\sum_{i=1}^{n}l(Y_{i},f_{\boldsymbol{w}}(\boldsymbol{X}_{i})),
L^a(𝒘)\displaystyle\textstyle\hat{L}_{a}(\boldsymbol{w}) =\displaystyle= 1nai:Ai=al(Yi,f𝒘(𝑿i)),\displaystyle\textstyle\frac{1}{n_{a}}\sum_{i:A_{i}=a}l(Y_{i},f_{\boldsymbol{w}}(\boldsymbol{X}_{i})),

where na=|{i|Ai=a}|n_{a}=|\{i|A_{i}=a\}|. Because γ\gamma-EL fairness constraint is defined in terms of expected loss, the optimization problem of finding an optimal γ\gamma-EL fair predictor using empirical losses is as follows,

𝒘^=argmin𝒘L^(𝒘),s.t.|L^0(𝒘)L^1(𝒘)|γ^.\displaystyle\boldsymbol{\hat{w}}=\arg\min_{\boldsymbol{w}}\hat{L}(\boldsymbol{w}),~{}~{}~{}\text{s.t.}~{}~{}~{}|\hat{L}_{0}(\boldsymbol{w})-\hat{L}_{1}(\boldsymbol{w})|\leq\hat{\gamma}. (8)

In this section, we aim to investigate how to determine γ^\hat{\gamma} so that with high probability, the predictor found by solving problem (8) satisfies γ\gamma-EL fairness, and meanwhile 𝒘^\boldsymbol{\hat{w}} is a good estimate of the solution 𝒘\boldsymbol{w}^{*} to optimization (1). We aim to show that we can set γ^=γ\hat{\gamma}=\gamma if the number of samples is sufficiently large. To understand the relation between (8) and (1), we follow the general sample complexity analysis found in (Donini et al., 2018) and show their sample complexity analysis is applicable to EL. To proceed, we make the assumption used in (Donini et al., 2018).

Assumption 4.4.

With probability 1δ1-\delta, following holds:

supf𝒘|L(𝒘)L^(𝒘)|B(δ,n,),\textstyle\sup_{f_{\boldsymbol{w}}\in\mathcal{F}}|L(\boldsymbol{w})-\hat{L}(\boldsymbol{w})|\leq B(\delta,n,\mathcal{F}),

where B(δ,n,)B(\delta,n,\mathcal{F}) is a bound that goes to zero as n+n\to+\infty.

Note that according to (Shalev-Shwartz and Ben-David, 2014), if the class \mathcal{F} is learnable with respect to loss function l(.,.)l(.,.), then always there exists such a bound B(δ,n,)B(\delta,n,\mathcal{F}) that goes to zero as nn goes to infinity.999As an example, if \mathcal{F} is a compact subset of linear predictors in Reproducing Kernel Hilbert Space (RKHS) and loss l(y,f(x))l(y,f(x)) is Lipschitz in f(x)f(x) (second argument), then Assumption 4.4 can be satisfied (Bartlett and Mendelson, 2002). Vast majority of linear predictors such as support vector machine and logistic regression can be defined in RKHS.

Theorem 4.5.

Let \mathcal{F} be a set of learnable functions, and let 𝐰^\hat{\boldsymbol{w}} and 𝐰\boldsymbol{w}^{*} be the solutions to (8) and (1) respectively, with γ^=γ+a{0,1}B(δ,na,).\linebreak\hat{\gamma}=\gamma+\sum_{a\in\{0,1\}}B(\delta,n_{a},\mathcal{F}). Then, with probability at least 16δ1-6\delta, the followings hold,

L(𝒘^)L(𝒘)\displaystyle L(\boldsymbol{\hat{w}})-L(\boldsymbol{w}^{*}) \displaystyle\leq 2B(δ,n,) and\displaystyle 2B(\delta,n,\mathcal{F})~{}\text{ and }
|L0(𝒘^)L1(𝒘^)|\displaystyle|{L}_{0}(\boldsymbol{\hat{w}})-{L}_{1}(\boldsymbol{\hat{w}})| \displaystyle\leq γ+2B(δ,n0,)+2B(δ,n1,).\displaystyle\gamma+2B(\delta,n_{0},\mathcal{F})+2B(\delta,n_{1},\mathcal{F}).

Theorem 4.5 shows that as n0n_{0}, n1n_{1} go to infinity, γ^γ\hat{\gamma}\to\gamma, and both empirical loss and expected loss satisfy γ\gamma-EL. In addition, as nn goes to infinity, the expected loss at 𝒘^\hat{\boldsymbol{w}} goes to the minimum possible expected loss. Therefore, solving (8) using empirical loss is equivalent to solving (1) if the number of data points from each group is sufficiently large.

5 Beyond Linear Models

So far, we have assumed that the loss function is strictly convex. This assumption is mainly valid for training linear models (e.g., Ridge regression, regularized logistic regression). However, it is known that training deep models lead to minimizing non-convex objective functions. To train a deep model under the equalized loss fairness notion, we can take advantage of Algorithm 2 for fine-tuning under EL as long as the objective function is convex with respect to the parameters of the output layer.101010In classification or regression problems with l2 regularizer, the objective function is strictly convex with respect to the parameters of the output layer. This is true regardless of the network structure before the output layer. To clarify how Algorithm 2 can be used for deep models, for simplicity, consider a neural network with one hidden layer for regression. Let WW be an mm by dd matrix (dd is the size of feature vector 𝑿\boldsymbol{X} and mm is the number of neurons in the first layer) denoting the parameters of the first layer of the Neural Network, and 𝒘\boldsymbol{w} be a vector corresponding to the output layer. To find a neural network satisfying the equalized loss fairness notion, first, we train the network without any fairness constraint using common gradient descent algorithms (e.g., stochastic gradient descent). Let W~\tilde{W} and 𝒘~\tilde{\boldsymbol{w}} denote the network parameters after training the network without fairness constraint. Now we can take advantage of Algorithm 2 to fine-tune the parameters of the output layer under the equalized loss fairness notion. Let us define 𝑿~:=[1,W~𝑿]T\tilde{\boldsymbol{X}}:=[1,\tilde{W}\cdot\boldsymbol{X}]^{T}. The problem for fine-tuning the output layer can be written as follows,

𝒘\displaystyle\boldsymbol{w}^{*} =\displaystyle= argmin𝒘𝔼{l(Y,𝒘T𝑿~)},\displaystyle\arg\min_{\boldsymbol{w}}\mathbb{E}\{l(Y,\boldsymbol{w}^{T}\tilde{\boldsymbol{X}})\}, (9)
s.t.,\displaystyle\text{s.t.}, |𝔼{l(Y,𝒘T𝑿~)|A=0}𝔼{l(Y,𝒘T𝑿~)|A=1}|γ.\displaystyle\left|\mathbb{E}\{l(Y,\boldsymbol{w}^{T}\tilde{\boldsymbol{X}})|A=0\}-\mathbb{E}\{l(Y,\boldsymbol{w}^{T}\tilde{\boldsymbol{X}})|A=1\}\right|\leq\gamma.

The objective function of the above optimization problem is strictly convex, and the optimization problem can be solved using Algorithm 2. After solving the above problem, [W~,𝒘][\tilde{W},\boldsymbol{w}^{*}] will be the final parameters of the neural network model satisfying the equalized loss fairness notion. Note that a similar optimization problem can be written for fine-tuning any deep model with classification/regression task.

6 Experiments

We conduct experiments on two real-world datasets to evaluate the performance of the proposed algorithm. In our experiments, we used a system with the following configurations: 24 GB of RAM, 2 cores of P100-16GB GPU, and 2 cores of Intel Xeon [email protected] GHz processor. More information about the experiments and the instructions on reproducing the empirical results are provided in Appendix. The codes are available at https://github.com/KhaliliMahdi/Loss_Balancing_ICML2023.

Baselines. As discussed in Section 2, not all the fair learning algorithms are applicable to EL fairness. The followings are three baselines that are applicable to EL fairness.

Penalty Method (PM): The penalty method (Ben-Tal and Zibulevsky, 1997) finds a fair predictor under γ\gamma-EL fairness constraint by solving the following problem,

min𝒘L^(𝒘)+tmax{0,|L^0(𝒘)L^1(𝒘)|γ}2+R(𝒘),\displaystyle\min_{\boldsymbol{w}}\hat{L}(\boldsymbol{w})+t\cdot\max\{0,|\hat{L}_{0}(\boldsymbol{w})-\hat{L}_{1}(\boldsymbol{w})|-\gamma\}^{2}+R(\boldsymbol{w}), (10)

where tt is the penalty parameter, and R(𝒘)=0.002𝒘22R(\boldsymbol{w})=0.002\cdot||\boldsymbol{w}||_{2}^{2} is the regularizer. The above optimization problem cannot be solved with a convex solver because it is not generally convex. We solve problem (10) using Adam gradient descent (Kingma and Ba, 2014) with a learning rate of 0.0050.005. We use the default parameters of Adam optimization in Pytorch. We set the penalty parameter t=0.1t=0.1 and increase this penalty coefficient by a factor of 22 every 100100 iteration.

Linear Relaxation (LinRe): Inspired by (Donini et al., 2018), for the linear regression, we relax the EL constraint as γ1n0i:Ai=a(Yi𝒘T𝑿i)1n1i:Ai=1(Yi𝒘T𝑿i)γ-\gamma\leq\frac{1}{n_{0}}\sum_{i:A_{i}=a}(Y_{i}-\boldsymbol{w}^{T}\boldsymbol{X}_{i})-\frac{1}{n_{1}}\sum_{i:A_{i}=1}(Y_{i}-\boldsymbol{w}^{T}\boldsymbol{X}_{i})\leq\gamma. For the logistic regression, we relax the constraint as γ1n0i:Ai=a(Yi0.5)(𝒘T𝑿i)1n1i:Ai=1(Yi0.5)(𝒘T𝑿i)γ-\gamma\leq\frac{1}{n_{0}}\sum_{i:A_{i}=a}(Y_{i}-0.5)\cdot(\boldsymbol{w}^{T}\boldsymbol{X}_{i})-\frac{1}{n_{1}}\sum_{i:A_{i}=1}(Y_{i}-0.5)\cdot(\boldsymbol{w}^{T}\boldsymbol{X}_{i})\leq\gamma. Note that the sign of (Yi0.5)(𝒘T𝑿i)(Y_{i}-0.5)\cdot(\boldsymbol{w}^{T}\boldsymbol{X}_{i}) determines whether the binary classifier makes a correct prediction or not.

FairBatch (Roh et al., 2020): This method was originally proposed for equal opportunity, statistical parity, and equalized odds. With some modifications (see the appendix for more details), this can be applied to EL fairness. This algorithm measures the loss of each group in each epoch and changes the Minibatch sampling distribution in favor of the group with a higher empirical loss. When implementing FairBatch, we use Adam optimization with default parameters, a learning rate of 0.0050.005, and a batch size of 100100.

Linear Regression and Law School Admission Dataset. In the first experiment, we use the law school admission dataset, which includes the information of 21,790 law students studying in 163 different law schools across the United States (Wightman, 1998). This dataset contains entrance exam scores (LSAT), grade-point average (GPA) prior to law school, and the first year average grade (FYA). Our goal is to train a γ\gamma-EL fair regularized linear regression model to estimate the FYA of students given their LSAT and GPA. In this study, we consider Black and White Demographic groups. In this dataset, 1828518285 data points belong to White students, and 12821282 data points are from Black students. We randomly split the dataset into training and test datasets (70% for training and 30% for testing), and conduct five independent runs of the experiment. A fair predictor is found by solving the following optimization problem,

min𝒘L^(𝒘)+0.002𝒘22s.t.,|L^0(𝒘)L^1(𝒘)|γ,\displaystyle\min_{\boldsymbol{w}}\hat{L}(\boldsymbol{w})+0.002\cdot||\boldsymbol{w}||_{2}^{2}~{}~{}\text{s.t.},|\hat{L}_{0}(\boldsymbol{w})-\hat{L}_{1}(\boldsymbol{w})|\leq\gamma, (11)

with L^\hat{L} and L^a\hat{L}_{a} being the overall and the group specific empirical MSE, respectively. Note that 0.002𝒘220.002\cdot||\boldsymbol{w}||_{2}^{2} is the regularizer. We use Algorithm 2 and Algorithm 3 with ϵ=0.01\epsilon=0.01 to find the optimal linear regression model under EL and adopt CVXPY python library (Diamond and Boyd, 2016; Agrawal et al., 2018) as the convex optimization solver in ELminimizer algorithm.

Refer to caption
Figure 1: Trade-off between overall MSE and unfairness. A lower curve implies a better trade-off.
Refer to caption
Figure 2: Trade-off between overall BCE and unfairness. A lower curve implies a better trade-off.
Table 1: Linear regression model under EL fairness. The loss function in this example is the mean squared error loss.
γ=0\gamma=0 γ=0.1\gamma=0.1
test loss 0.9246±0.00830.9246\pm 0.0083 0.9332±0.01010.9332\pm 0.0101

 PM

test |L^0L^1||\hat{L}_{0}-\hat{L}_{1}| 0.1620±0.08020.1620\pm 0.0802 0.1438±0.09140.1438\pm 0.0914
test loss 0.9086±0.01900.9086\pm 0.0190 0.8668±0.01640.8668\pm 0.0164

LinRe

test |L^0L^1||\hat{L}_{0}-\hat{L}_{1}| 0.2687±0.05880.2687\pm 0.0588 0.2587±0.07040.2587\pm 0.0704
test loss 0.8119±0.03160.8119\pm 0.0316 0.8610±0.08840.8610\pm 0.0884

 Fair

 Batch

test |L^0L^1||\hat{L}_{0}-\hat{L}_{1}| 0.2862±0.19330.2862\pm 0.1933 0.2708±0.15260.2708\pm 0.1526
test loss 0.9186±0.0179{0.9186}\pm{0.0179} 0.8556±0.0217{0.8556}\pm{0.0217}

ours

Alg 2

test |L^0L^1||\hat{L}_{0}-\hat{L}_{1}| 0.0699±0.04690.0699\pm 0.0469 0.1346±0.07490.1346\pm 0.0749
test loss 0.9522±0.02090.9522\pm 0.0209 0.8977±0.02230.8977\pm 0.0223

ours

Alg 3

test |L^0L^1||\hat{L}_{0}-\hat{L}_{1}| 0.0930±0.04750.0930\pm 0.0475 0.1437±0.09070.1437\pm 0.0907

Table 1 illustrates the means and standard deviations of empirical loss and the loss difference between Black and White students. The first row specifies desired fairness level (γ=0\gamma=0 and γ=0.1\gamma=0.1) used as the input to each algorithm. Based on Table 1, when desired fairness level is γ=0\gamma=0, the model fairness level trained by LinRe and FairBatch method is far from γ=0\gamma=0. We also realized that the performance of FairBatch highly depends on the random seed. As a result, the fairness level of the model trained by FairBatch has a high variance (0.1933 in this example) in these five independent runs of the experiment, and in some of these runs, it can achieve desired fairness level. This is because the FairBatch algorithm does not come with any performance guarantee, and as stated in (Roh et al., 2020), FairBatch calculates a biased estimate of the gradient in each epoch, and the mini-batch sampling distribution keeps changing from one epoch to another epoch. We observed that FairBatch has better performance with a non-linear model (see Table 3). Both Algorithms 2 and 3 can achieve a fairness level close to γ=0\gamma=0. However, Algorithm 3 finds a sub-optimal solution and achieves higher MSE compared to Algorithm 2. For γ=0.1\gamma=0.1, in addition to Algorithms 2 and 3, the penalty method also achieves a fairness level close to desired fairness level γ=0.1\gamma=0.1 (i.e., |L^1L^0|=0.0892|\hat{L}_{1}-\hat{L}_{0}|=0.0892). Algorithm 2 still achieves the lowest MSE compared to Algorithm 3 and the penalty method. The model trained by FairBatch also suffers from high variance in the fairness level. We want to emphasize that even though Algorithm 3 has a higher MSE compared to Algorithm 2, it is much faster, as stated in Section 3.

We also investigate the trade-off between fairness and overall loss under different algorithms. Figure 2 illustrates the MSE loss as a function of the loss difference between Black and White students. Specifically, we run Algorithm 2, Algorithm 3, and the baselines under different values of γ=[0.0250,0.05,0.1,0.15,0.2]\gamma=[0.0250,0.05,0.1,0.15,0.2]. For each γ\gamma, we repeat the experiment five times and calculate the average MSE and average MSE difference over these five runs using the test dataset. Figure 2 shows the penalty method, linear relaxation, and FairBatch are not sensitive to input γ\gamma. However, Algorithm 2 and Algorithm 3 are sensitive to γ\gamma; As γ\gamma increases, |L^0(𝒘)L^1(𝒘)||\hat{L}_{0}(\boldsymbol{w}^{*})-\hat{L}_{1}(\boldsymbol{w}^{*})| increases and MSE decreases.

Logistic Regression and Adult Income Dataset. We consider the adult income dataset containing the information of 48,842 individuals (Kohavi, 1996). Each data point consists of 14 features, including age, education, race, etc. In this study, we consider race (White or Black) as the sensitive attribute and denote the White demographic group by A=0A=0 and the Black group by A=1A=1. We first pre-process the dataset by removing the data points with a missing value or with a race other than Black and White; this results in 41,961 data points. Among these data points, 4585 belong to the Black group. For each data point, we convert all the categorical features to one-hot vectors with 110110 dimension and randomly split the dataset into training and test data sets (70% of the dataset is used for training). The goal is to predict whether the income of an individual is above $50K\$50K using a γ\gamma-EL fair logistic regression model. In this experiment, we solve optimization problem (11), with L^\hat{L} and L^a\hat{L}_{a} being the overall and the group specific empirical average of binary cross entropy (BCE) loss, respectively.

Table 2: Logistic Regression model under EL fairness. The loss function in this example is binary cross entropy loss.
γ=0\gamma=0 γ=0.1\gamma=0.1
test loss 0.5594±0.01010.5594\pm 0.0101 0.5404±0.00460.5404\pm 0.0046

 PM

test |L^0L^1||\hat{L}_{0}-\hat{L}_{1}| 0.0091±0.00670.0091\pm 0.0067 0.0892±0.03780.0892\pm 0.0378
test loss 0.3468±0.00130.3468\pm 0.0013 0.3441±0.00120.3441\pm 0.0012

LinRe

test |L^0L^1||\hat{L}_{0}-\hat{L}_{1}| 0.0815±0.00980.0815\pm 0.0098 0.1080±0.00980.1080\pm 0.0098
test loss 1.5716±0.80711.5716\pm 0.8071 1.2116±0.88191.2116\pm 0.8819

 Fair

 Batch

test |L^0L^1||\hat{L}_{0}-\hat{L}_{1}| 0.6191±0.54590.6191\pm 0.5459 0.3815±0.34700.3815\pm 0.3470
test loss 0.3516±0.0015{0.3516}\pm{0.0015} 0.3435±0.00120.3435\pm 0.0012

Ours

Alg2

test |L^0L^1||\hat{L}_{0}-\hat{L}_{1}| 0.0336±0.00750.0336\pm 0.0075 0.1110±0.01400.1110\pm 0.0140
test loss 0.3521±0.00150.3521\pm 0.0015 0.3377±0.0015{0.3377}\pm 0.0015

Ours

Alg3

test |L^0L^1||\hat{L}_{0}-\hat{L}_{1}| 0.0278±0.00750.0278\pm 0.0075 0.1068±0.01380.1068\pm 0.0138

The comparison of Algorithm 2, Algorithm 3, and the baselines is shown in Table 2, where we conduct five independent runs of experiments, and calculate the mean and standard deviation of overall loss and the loss difference between two demographic groups. The first row in this table shows the value of γ\gamma used as an input to the algorithms. The results show that linear relaxation, algorithm 2 and Algorithm 3 have very similar performances. All of these three algorithms are able to satisfy the γ\gamma-EL with small test loss. Similar to Table 1, we observe the high variance in the performance of FairBatch, which highly depends on the random seed.

In Figure 2, we compare the performance-fairness trade-off. We focus on binary cross entropy on the test dataset. To generate this figure, we run Algorithm 2, Algorithm 3, and the baselines (we do not include the curve for FairBatch due to large overall loss and high variance in performance) under different values of γ=[0.02,0.04,0.06,0.08,0.1]\gamma=[0.02,0.04,0.06,0.08,0.1] for five times and calculate the average BCE and the average BCE difference. We observe Algorithms 2 and 3 and the linear relaxation have a similar trade-off between L^\hat{L} and |L^0L^1||\hat{L}_{0}-\hat{L}_{1}|.

Experiment with a non-linear model

We repeat our first experiment with nonlinear models to demonstrate how we can use our algorithms to fine-tune a non-linear model. We work with the Law School Admission dataset, and we train a neural network with one hidden layer which consists of 125 neurons. We use sigmoid as the activation function for the hidden layer. We run the following algorithms,

  • Penalty Method: We solve optimization problem (10). In this example, L^\hat{L} and L^a\hat{L}_{a} are not convex anymore. The hyperparameters except for the learning rate remain the same as in the first experiment. The learning rate is set to be 0.0010.001.

  • FairBatch: we train the whole network using FairBatch with mini-batch Adam optimization with a batch size of 100 and a learning rate of 0.0010.001.

  • Linear Relaxation: In order to take advantage of CVXPY, first, we train the network without any fairness constraint using batch Adam optimization (i.e., the batch size is equal to the size of the training dataset) with a learning rate of 0.0010.001. Then, we fine-tune the parameters of the output layer. Note that the output layer has 126 parameters, and we fine-tune those under relaxed EL fairness. In particular, we solve problem (9) after linear relaxation.

  • Algorithm 2 and Algorithm 3: We can run Algorithm 2 and Algorithm 3 to fine-tune the neural network. After training the network without any constraint using batch Adam optimization, we solve (9) using Algorithm 2 and Algorithm 3.

Table 3 illustrates the average and standard deviation of empirical loss and the loss difference between Black and White students. Both Algorithm 2 and Algorithm 3 can achieve a fairness level (i.e., |L^0L^1||\hat{L}_{0}-\hat{L}_{1}|) close to desired fairness level γ\gamma. Also, we can see that the MSE of Algorithm 2 and Algorithm 3 under the nonlinear model is slightly lower than the MSE under the linear model.

We also investigate how MSE L^\hat{L} changes as a function of fairness level |L^1L^0||\hat{L}_{1}-\hat{L}_{0}|. Figure 3 illustrates the MSE-fairness trade-off. To generate this plot, we repeat the experiment for γ=[0.025,0.05,0.1,0.15,0.2]\gamma=[0.025,0.05,0.1,0.15,0.2]. For each γ\gamma, we ran the experiment 5 times and calculated the average of MSE L^\hat{L} and the average of MSE difference using the test dataset. Based on Figure 3, we observe that FairBatch and LinRe are not very sensitive to the input γ\gamma. However, FairBatch may sometimes show a better trade-off than Algorithm 2. In this example, PM, Algorithm 2, and Algorithm 3 are very sensitive to γ\gamma, and as γ\gamma increases, MSE L^\hat{L} decreases and |L^0L^1||\hat{L}_{0}-\hat{L}_{1}| increases.

Table 3: Neural Network training under EL fairness. The loss function in this example is the mean squared error loss.
γ=0\gamma=0 γ=0.1\gamma=0.1
test loss 0.9490±0.05840.9490\pm 0.0584 0.9048±0.03550.9048\pm 0.0355

 PM

test |L^0L^1||\hat{L}_{0}-\hat{L}_{1}| 0.1464±0.10550.1464\pm 0.1055 0.1591±0.08470.1591\pm 0.0847
test loss 0.8489±0.01950.8489\pm 0.0195 0.8235±0.01650.8235\pm 0.0165

LinRe

test |L^0L^1||\hat{L}_{0}-\hat{L}_{1}| 0.6543±0.03220.6543\pm 0.0322 0.5595±0.04820.5595\pm 0.0482
test loss 0.9012±0.19180.9012\pm 0.1918 0.8638±0.08630.8638\pm 0.0863

 Fair

 Batch

test |L^0L^1||\hat{L}_{0}-\hat{L}_{1}| 0.2771±0.12520.2771\pm 0.1252 0.1491±0.09280.1491\pm 0.0928
test loss 0.9117±0.01720.9117\pm 0.0172 0.8519±0.01950.8519\pm 0.0195

ours

Alg 2

test |L^0L^1||\hat{L}_{0}-\hat{L}_{1}| 0.0761±0.04980.0761\pm 0.0498 0.1454±0.07490.1454\pm 0.0749
test loss 0.9427±0.01900.9427\pm 0.0190 0.8908±0.02090.8908\pm 0.0209

ours

Alg 3

test |L^0L^1||\hat{L}_{0}-\hat{L}_{1}| 0.0862±0.05550.0862\pm 0.0555 0.1423±0.08670.1423\pm 0.0867
Refer to caption
Figure 3: Trade-off between overall MSE and unfairness. A lower curve implies a better trade-off.

Limitation and Negative Societal Impact. 1) Our theoretical guarantees are valid under the stated assumptions (e.g., the convexity of L(𝒘)L({\boldsymbol{w}}), i.i.d. samples, binary sensitive attribute). These assumptions have been clearly stated throughout this paper. 2) In this paper, we develop an algorithm for finding a fair predictor under EL fairness. However, we do not claim this notion is better than other fairness notions. Depending on the scenario, this notion may or may not be suitable for mitigating unfairness.

7 Conclusion

In this work, we studied supervised learning problems under the Equalized Loss (EL) fairness (Zhang et al., 2019), a notion that requires the expected loss to be balanced across different demographic groups. By imposing the EL constraint, the learning problem can be formulated as a non-convex problem. We proposed two algorithms with theoretical performance guarantees to find the global optimal and a sub-optimal solution to this non-convex problem.

Acknowledgment

This work is partially supported by the NSF under grants IIS-2202699, IIS-2301599, and ECCS-2301601.

References

  • [1] Mahed Abroshan, Mohammad Mahdi Khalili, and Andrew Elliott. Counterfactual fairness in synthetic data generation. In NeurIPS 2022 Workshop on Synthetic Data for Empowering ML Research.
  • Abroshan et al. [2023] Mahed Abroshan, Saumitra Mishra, and Mohammad Mahdi Khalili. Symbolic metamodels for interpreting black-boxes using primitive functions. arXiv preprint arXiv:2302.04791, 2023.
  • Agarwal et al. [2018] Alekh Agarwal, Alina Beygelzimer, Miroslav Dudík, John Langford, and Hanna Wallach. A reductions approach to fair classification. In International Conference on Machine Learning, pages 60–69. PMLR, 2018.
  • Agarwal et al. [2019] Alekh Agarwal, Miroslav Dudik, and Zhiwei Steven Wu. Fair regression: Quantitative definitions and reduction-based algorithms. In International Conference on Machine Learning, pages 120–129. PMLR, 2019.
  • Agrawal et al. [2018] Akshay Agrawal, Robin Verschueren, Steven Diamond, and Stephen Boyd. A rewriting system for convex optimization problems. Journal of Control and Decision, 5(1):42–60, 2018.
  • Bartlett and Mendelson [2002] Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.
  • Ben-Tal and Zibulevsky [1997] Aharon Ben-Tal and Michael Zibulevsky. Penalty/barrier multiplier methods for convex programming problems. SIAM Journal on Optimization, 7(2):347–366, 1997.
  • Biega et al. [2018] Asia J Biega, Krishna P Gummadi, and Gerhard Weikum. Equity of attention: Amortizing individual fairness in rankings. In The 41st international acm sigir conference on research & development in information retrieval, pages 405–414, 2018.
  • Calmon et al. [2017] Flavio P Calmon, Dennis Wei, Bhanukiran Vinzamuri, Karthikeyan Natesan Ramamurthy, and Kush R Varshney. Optimized pre-processing for discrimination prevention. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 3995–4004, 2017.
  • Celis et al. [2020] L Elisa Celis, Vijay Keswani, and Nisheeth Vishnoi. Data preprocessing to mitigate bias: A maximum entropy based approach. In International Conference on Machine Learning, pages 1349–1359. PMLR, 2020.
  • Conitzer et al. [2019] Vincent Conitzer, Rupert Freeman, Nisarg Shah, and Jennifer Wortman Vaughan. Group fairness for the allocation of indivisible goods. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 1853–1860, 2019.
  • Dastin [2018] Jeffrey Dastin. Amazon scraps secret ai recruiting tool that showed bias against women. http://reut.rs/2MXzkly, 2018.
  • Diamond and Boyd [2016] Steven Diamond and Stephen Boyd. CVXPY: A Python-embedded modeling language for convex optimization. Journal of Machine Learning Research, 17(83):1–5, 2016.
  • Diana et al. [2021] Emily Diana, Wesley Gill, Michael Kearns, Krishnaram Kenthapadi, and Aaron Roth. Minimax group fairness: Algorithms and experiments. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pages 66–76, 2021.
  • Donini et al. [2018] Michele Donini, Luca Oneto, Shai Ben-David, John S Shawe-Taylor, and Massimiliano Pontil. Empirical risk minimization under fairness constraints. Advances in Neural Information Processing Systems, 31, 2018.
  • Dressel and Farid [2018] Julia Dressel and Hany Farid. The accuracy, fairness, and limits of predicting recidivism. Science advances, 4(1):eaao5580, 2018.
  • Dwork et al. [2012] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pages 214–226, 2012.
  • Gupta and Kamble [2019] Swati Gupta and Vijay Kamble. Individual fairness in hindsight. In Proceedings of the 2019 ACM Conference on Economics and Computation, pages 805–806, 2019.
  • Hardt et al. [2016] Moritz Hardt, Eric Price, and Nati Srebro. Equality of opportunity in supervised learning. Advances in neural information processing systems, 29:3315–3323, 2016.
  • Harwell [2018] Drew Harwell. The accent gap. http://wapo.st/3pUqZ0S, 2018.
  • Jung et al. [2019] Christopher Jung, Michael Kearns, Seth Neel, Aaron Roth, Logan Stapleton, and Zhiwei Steven Wu. Eliciting and enforcing subjective individual fairness. arXiv preprint arXiv:1905.10660, 2019.
  • Kamiran and Calders [2012] Faisal Kamiran and Toon Calders. Data preprocessing techniques for classification without discrimination. Knowledge and Information Systems, 33(1):1–33, 2012.
  • Khalili et al. [2020] Mohammad Mahdi Khalili, Xueru Zhang, Mahed Abroshan, and Somayeh Sojoudi. Improving fairness and privacy in selection problems. arXiv preprint arXiv:2012.03812, 2020.
  • Khalili et al. [2021] Mohammad Mahdi Khalili, Xueru Zhang, and Mahed Abroshan. Fair sequential selection using supervised learning models. Advances in Neural Information Processing Systems, 34:28144–28155, 2021.
  • Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Kohavi [1996] Ron Kohavi. Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid. In Kdd, volume 96, pages 202–207, 1996.
  • Komiyama et al. [2018] Junpei Komiyama, Akiko Takeda, Junya Honda, and Hajime Shimao. Nonconvex optimization for regression with fairness constraints. In International conference on machine learning, pages 2737–2746. PMLR, 2018.
  • Lohaus et al. [2020] Michael Lohaus, Michael Perrot, and Ulrike Von Luxburg. Too relaxed to be fair. In International Conference on Machine Learning, pages 6360–6369. PMLR, 2020.
  • Lundberg and Lee [2017] Scott Lundberg and Su-In Lee. A unified approach to interpreting model predictions. arXiv preprint arXiv:1705.07874, 2017.
  • Mahdavi et al. [2012] Mehrdad Mahdavi, Tianbao Yang, Rong Jin, Shenghuo Zhu, and Jinfeng Yi. Stochastic gradient descent with only one projection. Advances in neural information processing systems, 25:494–502, 2012.
  • Nedić and Ozdaglar [2009] Angelia Nedić and Asuman Ozdaglar. Subgradient methods for saddle-point problems. Journal of optimization theory and applications, 142(1):205–228, 2009.
  • Nemirovski [2004] Arkadi Nemirovski. Interior point polynomial time methods in convex programming. Lecture notes, 42(16):3215–3224, 2004.
  • Reimers et al. [2021] Christian Reimers, Paul Bodesheim, Jakob Runge, and Joachim Denzler. Towards learning an unbiased classifier from biased data via conditional adversarial debiasing. arXiv preprint arXiv:2103.06179, 2021.
  • Ribeiro et al. [2016] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should i trust you?” explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135–1144, 2016.
  • Roh et al. [2020] Yuji Roh, Kangwook Lee, Steven Euijong Whang, and Changho Suh. Fairbatch: Batch selection for model fairness. In International Conference on Learning Representations, 2020.
  • Shalev-Shwartz and Ben-David [2014] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
  • Shen et al. [2022] Aili Shen, Xudong Han, Trevor Cohn, Timothy Baldwin, and Lea Frermann. Optimising equal opportunity fairness in model training. arXiv preprint arXiv:2205.02393, 2022.
  • Wightman [1998] Linda F Wightman. Lsac national longitudinal bar passage study. lsac research report series. 1998.
  • Williamson and Menon [2019] Robert Williamson and Aditya Menon. Fairness risk measures. In International Conference on Machine Learning, pages 6786–6797. PMLR, 2019.
  • Woodworth et al. [2017] Blake Woodworth, Suriya Gunasekar, Mesrob I Ohannessian, and Nathan Srebro. Learning non-discriminatory predictors. In Conference on Learning Theory, pages 1920–1953. PMLR, 2017.
  • Wright [2001] Stephen J Wright. On the convergence of the newton/log-barrier method. Mathematical programming, 90(1):71–100, 2001.
  • Wu et al. [2019] Yongkai Wu, Lu Zhang, and Xintao Wu. On convexity and bounds of fairness-aware classification. In The World Wide Web Conference, pages 3356–3362, 2019.
  • Zafar et al. [2017] Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rodriguez, and Krishna P Gummadi. Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment. In Proceedings of the 26th international conference on world wide web, pages 1171–1180, 2017.
  • Zafar et al. [2019] Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez-Rodriguez, and Krishna P Gummadi. Fairness constraints: A flexible approach for fair classification. The Journal of Machine Learning Research, 20(1):2737–2778, 2019.
  • Zhang et al. [2018] Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell. Mitigating unwanted biases with adversarial learning. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pages 335–340, 2018.
  • Zhang et al. [2019] Xueru Zhang, Mohammadmahdi Khaliligarekani, Cem Tekin, and Mingyan Liu. Group retention when using machine learning in sequential decision making: the interplay between user dynamics and fairness. Advances in Neural Information Processing Systems, 32:15269–15278, 2019.
  • Zhang et al. [2020] Xueru Zhang, Mohammad Mahdi Khalili, and Mingyan Liu. Long-term impacts of fair machine learning. Ergonomics in Design, 28(3):7–11, 2020.
  • Zhang et al. [2022] Xueru Zhang, Mohammad Mahdi Khalili, Kun Jin, Parinaz Naghizadeh, and Mingyan Liu. Fairness interventions as (dis) incentives for strategic manipulation. In International Conference on Machine Learning, pages 26239–26264. PMLR, 2022.

Appendix A Appendix

A.1 Some notes on the code for reproducibility

In this part, we provide a description of the files provided in our GitHub repository.

  • law_data.pylaw\_data.py: This file includes a function called law_data(seed)law\_data(seed) which processes the law school admission dataset and splits the dataset randomly into training and test datasets (we keep 70% of the datapoints for training). Later, in our experiments, we set the seedseed equal to 0, 1, 2, 3, and 4 to get five different splits to repeat our experiments five times.

  • Adult_data.pyAdult\_data.py: This file includes a function called Adult_dataset(seed)Adult\_dataset(seed) which processes the adult income dataset and splits the dataset randomly into training and test datasets. Later, in our experiments, we set the seedseed equal to 0, 1, 2, 3, 4 to get five different splits to repeat our experiments five times.

  • Algorithms.pyAlgorithms.py: This file includes the following functions,

    • ELminimizer(X0,Y0,X1,Y1,gamma,eta,model)ELminimizer(X0,Y0,X1,Y1,gamma,eta,model): This function implements Elminimizer algorithm. (X0,Y0)(X0,Y0) are the training datapoints belonging to group A=0A=0 and (X1,Y1)(X1,Y1) are the datapoits belonging to group A=1A=1. gammagamma is the fairness level for EL constraint. η\eta is the reqularizer parameter (in our experiments, η=0.002\eta=0.002). modelmodel determines the model that we want to train. If model="linear"model="linear", then we train a linear regression model. If model="logistic"model="logistic", then we train a logisitic regression model. This function returns five variables (w,b,l0,l1,l)(w,b,l0,l1,l). w,bw,b are the weight vector and bias term of the trained model. l0,l1l_{0},l_{1} are the average training loss of group 0 and group 1, respectively. ll is the overall training loss.

    • Algorithm2(X0,Y0,X1,Y1,gamma,eta,model)Algorithm2(X0,Y0,X1,Y1,gamma,eta,model): This function implements Algorithm 2 which calls Elminimizer algorithm twice. This function also returns five variables (w,b,l0,l1,l)(w,b,l0,l1,l). These variables have been defined above.

    • Algorithm3(X0,Y0,X1,Y1,gamma,eta,model)Algorithm3(X0,Y0,X1,Y1,gamma,eta,model): This function implements Algorithm 3 which finds a sub-optimal solution under EL fairness. This function also returns five variables (w,b,l0,l1,l)(w,b,l0,l1,l). These variables have been defined above.

    • solve_constrained_opt(X0,Y0,X1,Y1,eta,landa,model)solve\_constrained\_opt(X0,Y0,X1,Y1,eta,landa,model): This function uses the CVXPY package to solve the optimization problem (4). We set landalanda equal to λmid(i)\lambda_{mid}^{(i)} to solve the optimization problem (4) in iteration ii of Algorithm 1.

    • calculate_loss(w,b,X0,Y0,X1,Y1,model)calculate\_loss(w,b,X0,Y0,X1,Y1,model): This function is used to find the test loss. w,bw,b are model parameters (trained by Algorithm 2 or 3). It returns the average loss of group 0 and group 1 and the overall loss based on the given dataset.

    • solve_lin_constrained_opt(X0,Y0,X1,Y1,gamma,eta,model)solve\_lin\_constrained\_opt(X0,Y0,X1,Y1,gamma,eta,model): This function is for solving optimization problem (8) after linear relaxation.

  • Baseline.pyBaseline.py: this file includes the following functions,

    • penalty_method(method,X_0,y_0,X_1,y_1,num_itr,lr,r,gamma,seed,epsilon)penalty\_method(method,X\_0,y\_0,X\_1,y\_1,num\_itr,lr,r,gamma,seed,epsilon) where methodmethod can be either "linear""linear" for linear regression or "logistic""logistic" for logistic regression. This function uses the penalty method and trains the model under EL using the Adam optimization. num_itrnum\_itr is the maximum number of iterations. rr is the regularization parameter (it is set to 0.0020.002 in our experiment). lrlr is the learning rate and gammagamma is the fairness level. ϵ\epsilon is used for the stopping criterion. This function returns the trained model (which is a torch module), and training loss of group 0 and group 1, and the overall training loss.

    • fair_batch(method,X_0,y_0,X_1,y_1,num_itr,lr,r,alpha,gamma,seed,epsilon)fair\_batch(method,X\_0,y\_0,X\_1,y\_1,num\_itr,lr,r,alpha,gamma,seed,epsilon): This function is used to simulate the FairBatch algorithm [Roh et al., 2020]. The input parameters are similar to the input parameters of penalty_methodpenalty\_method except for alphaalpha. This parameter determines how to adjust the sub-sampling distribution for mini-batch formation. Please look at the next section for more details. This function returns the trained model (which is a torch module), and training loss of group 0 and group 1, and the overall training loss.

table1_2.pytable1\_2.py uses the above functions to reproduce the results in Table 1 and Table 2. figure1_2.pyfigure1\_2.py uses the above functions to reproduce Figure 1 and Figure 2. We provide some comments in these files to make the code more readable. We have also provided code for training non-linear models. Please use Table3.pyTable3.py and figure3.pyfigure3.py to generate the results in Table 3 and Figure 3, respectively.

Lastly, use the following command to generate results in Table 1:

  • python3 table1_2.py --experiment=1 --gamma=0.0

  • python3 table1_2.py --experiment=1 --gamma=0.1

Use the following command to generate results in Table 2:

  • python3 table1_2.py --experiment=2 --gamma=0.0

  • python3 table1_2.py --experiment=2 --gamma=0.1

Use the following command to generate results in Table 3:

  • python3 table3.py --gamma=0.0

  • python3 table3.py --gamma=0.1

Use the following command to generate results in Figure 1:

  • python3 figure1_2.py --experiment=1

Use the following command to generate results in Figure 2:

  • python3 figure1_2.py --experiment=2

Use the following command to generate results in Figure 3:

  • python3 figure3.py

Note that you need to install packages in requirements.txt

A.2 Notes on FairBatch [Roh et al., 2020]

This method has been proposed to find a predictor under equal opportunity, equalized odd or statistical parity. In each epoch, this method identifies the disadvantaged group and increases the subsampling rate corresponding to the disadvantaged group in mini-batch selection for the next epoch. We modify this approach for γ\gamma-EL as follows,

  • We initialize the sub-sampling rate of group aa (denoted by SRa(0)SR^{(0)}_{a}) for mini-batch formation by SRa(0)=nan,a=0,1SR^{(0)}_{a}=\frac{n_{a}}{n},a=0,1. We Form the mini-batches using SR0(0)SR^{(0)}_{0} and SR1(0)SR^{(0)}_{1}.

  • At epoch ii, we run gradient descent using the mini-batches formed by SR0(i1)SR^{(i-1)}_{0} and SR1(i1)SR^{(i-1)}_{1}, and we obtain new model parameters 𝒘i\boldsymbol{w}_{i}.

  • After epoch ii, we calculate the empirical loss of each group. Then, we update SRa(i)SR^{(i)}_{a} as follows,

    SRa(i)SRa(i1)+α\displaystyle SR^{(i)}_{a}\longleftarrow SR^{(i-1)}_{a}+\alpha if L^a(𝒘i)L^1a(𝒘i)>γ\displaystyle\mbox{if }\hat{L}_{a}(\boldsymbol{w}_{i})-\hat{L}_{1-a}(\boldsymbol{w}_{i})>\gamma
    SRa(i)SRa(i1)α\displaystyle SR^{(i)}_{a}\longleftarrow SR^{(i-1)}_{a}-\alpha if L^a(𝒘i)L^1a(𝒘i)<γ\displaystyle\mbox{if }\hat{L}_{a}(\boldsymbol{w}_{i})-\hat{L}_{1-a}(\boldsymbol{w}_{i})<-\gamma
    SRa(i)SRa(i1)\displaystyle SR^{(i)}_{a}\longleftarrow SR^{(i-1)}_{a} o.w.,\displaystyle\mbox{o.w.},

    where is α\alpha is a hyperparameter and, in our experiment, is equal to 0.0050.005.

A.3 Details of numerical experiments and additional numerical results

Due to the space limits of the main paper, we provide more details on our experiments here,

  • Stopping criteria for penalty method and FairBatch: For stopping criteria, we stopped the learning process when the change in the objective function is less than 10610^{-6} between two consecutive epochs. The reason that we used 10610^{-6} was that we did not observe any significant change by choosing a smaller value.

  • Learning rate for penalty method and FairBatch: We chose 0.0050.005 for the learning rate for training a linear model. For the experiment with a non-linear model, we set the learning rate to be 0.0010.001.

  • Stopping criteria for Algorithm 2 and Algorithm 3: As we stated in the main paper, we set ϵ=0.01\epsilon=0.01 in ELminimizer and Algorithm 3. Choosing smaller ϵ\epsilon did not change the performance significantly.

  • Linear Relaxation: Note that equation (8) after linear relaxation is a convex optimization problem. We directly solve this optimization problem using CVXPY.

The experiment has been done on a system with the following configurations: 24 GB of RAM, 2 cores of P100-16GB GPU, and 2 cores of Intel Xeon [email protected] GHz processor. We used GPUs for training FairBatch.

A.4 Notes on the Reduction Approach [Agarwal et al., 2018, 2019]

Let Q(f)Q(f) be a distribution over \mathcal{F}. In order to find optimal Q(f)Q(f) using the reduction approach, we have to solve the following optimization problem,

minQ\displaystyle\min_{Q} fQ(f)𝔼{l(Y,f(𝑿))}\displaystyle\sum_{f\in\mathcal{F}}Q(f)\mathbb{E}\{l(Y,f(\boldsymbol{X}))\}
s.t.,\displaystyle s.t., fQ(f)𝔼{l(Y,f(𝑿))|A=0}=fQ(f)𝔼{l(Y,f(𝑿))}\displaystyle\sum_{f\in\mathcal{F}}Q(f)\mathbb{E}\{l(Y,f(\boldsymbol{X}))|A=0\}=\sum_{f\in\mathcal{F}}Q(f)\mathbb{E}\{l(Y,f(\boldsymbol{X}))\}
fQ(f)𝔼{l(Y,f(𝑿))|A=1}=fQ(f)𝔼{l(Y,f(𝑿))}\displaystyle\sum_{f\in\mathcal{F}}Q(f)\mathbb{E}\{l(Y,f(\boldsymbol{X}))|A=1\}=\sum_{f\in\mathcal{F}}Q(f)\mathbb{E}\{l(Y,f(\boldsymbol{X}))\}

Similar to [Agarwal et al., 2018, 2019], we can re-write the above optimization problem in the following form,

minQ\displaystyle\min_{Q} fQ(f)𝔼{l(Y,f(𝑿))}\displaystyle\sum_{f\in\mathcal{F}}Q(f)\mathbb{E}\{l(Y,f(\boldsymbol{X}))\}
s.t.,\displaystyle s.t., fQ(f)𝔼{l(Y,f(𝑿))|A=0}fQ(f)𝔼{l(Y,f(𝑿))}0\displaystyle\sum_{f\in\mathcal{F}}Q(f)\mathbb{E}\{l(Y,f(\boldsymbol{X}))|A=0\}-\sum_{f\in\mathcal{F}}Q(f)\mathbb{E}\{l(Y,f(\boldsymbol{X}))\}\leq 0
fQ(f)𝔼{l(Y,f(𝑿))|A=0}+fQ(f)𝔼{l(Y,f(𝑿))}0\displaystyle-\sum_{f\in\mathcal{F}}Q(f)\mathbb{E}\{l(Y,f(\boldsymbol{X}))|A=0\}+\sum_{f\in\mathcal{F}}Q(f)\mathbb{E}\{l(Y,f(\boldsymbol{X}))\}\leq 0
fQ(f)𝔼{l(Y,f(𝑿))|A=1}fQ(f)𝔼{l(Y,f(𝑿))}0\displaystyle\sum_{f\in\mathcal{F}}Q(f)\mathbb{E}\{l(Y,f(\boldsymbol{X}))|A=1\}-\sum_{f\in\mathcal{F}}Q(f)\mathbb{E}\{l(Y,f(\boldsymbol{X}))\}\leq 0
fQ(f)𝔼{l(Y,f(𝑿))|A=1}+fQ(f)𝔼{l(Y,f(𝑿))}0\displaystyle-\sum_{f\in\mathcal{F}}Q(f)\mathbb{E}\{l(Y,f(\boldsymbol{X}))|A=1\}+\sum_{f\in\mathcal{F}}Q(f)\mathbb{E}\{l(Y,f(\boldsymbol{X}))\}\leq 0

Then, the reduction approach forms the Lagrangian function as follows,

L(Q,μ)\displaystyle L(Q,\mu) =\displaystyle= fQ(f)𝔼{l(Y,f(𝑿))}\displaystyle\sum_{f\in\mathcal{F}}Q(f)\mathbb{E}\{l(Y,f(\boldsymbol{X}))\}
\displaystyle- μ1(fQ(f)𝔼{l(Y,f(𝑿))|A=0}fQ(f)𝔼{l(Y,f(𝑿))})\displaystyle\mu_{1}\cdot\left(\sum_{f\in\mathcal{F}}Q(f)\mathbb{E}\{l(Y,f(\boldsymbol{X}))|A=0\}-\sum_{f\in\mathcal{F}}Q(f)\mathbb{E}\{l(Y,f(\boldsymbol{X}))\}\right)
\displaystyle- μ2(fQ(f)𝔼{l(Y,f(𝑿))|A=0}+fQ(f)𝔼{l(Y,f(𝑿))})\displaystyle\mu_{2}\cdot\left(-\sum_{f\in\mathcal{F}}Q(f)\mathbb{E}\{l(Y,f(\boldsymbol{X}))|A=0\}+\sum_{f\in\mathcal{F}}Q(f)\mathbb{E}\{l(Y,f(\boldsymbol{X}))\}\right)
\displaystyle- μ3(fQ(f)𝔼{l(Y,f(𝑿))|A=1}fQ(f)𝔼{l(Y,f(𝑿))})\displaystyle\mu_{3}\cdot\left(\sum_{f\in\mathcal{F}}Q(f)\mathbb{E}\{l(Y,f(\boldsymbol{X}))|A=1\}-\sum_{f\in\mathcal{F}}Q(f)\mathbb{E}\{l(Y,f(\boldsymbol{X}))\}\right)
\displaystyle- μ4(fQ(f)𝔼{l(Y,f(𝑿))|A=1}+fQ(f)𝔼{l(Y,f(𝑿))}),\displaystyle\mu_{4}\cdot\left(-\sum_{f\in\mathcal{F}}Q(f)\mathbb{E}\{l(Y,f(\boldsymbol{X}))|A=1\}+\sum_{f\in\mathcal{F}}Q(f)\mathbb{E}\{l(Y,f(\boldsymbol{X}))\}\right),
μ10,μ20,μ30,μ40.\displaystyle\mu_{1}\geq 0,\mu_{2}\geq 0,\mu_{3}\geq 0,\mu_{4}\geq 0.

Since ff is parametrized with 𝒘\boldsymbol{w}, we can find distribution Q(𝒘)Q(\boldsymbol{w}) over d𝒘\mathbb{R}^{d_{\boldsymbol{w}}}. Therefore, we rewrite the problem in the following form,

L(Q(𝒘),μ1,μ2,μ3,μ4)\displaystyle L(Q(\boldsymbol{w}),\mu_{1},\mu_{2},\mu_{3},\mu_{4}) =\displaystyle= 𝒘Q(𝒘)L(𝒘)\displaystyle\sum_{\boldsymbol{w}}Q({\boldsymbol{w}})L({\boldsymbol{w}})
\displaystyle- μ1(𝒘Q(𝒘)L0(𝒘)𝒘Q(𝒘)L(𝒘))\displaystyle\mu_{1}\left(\sum_{{\boldsymbol{w}}}Q(\boldsymbol{w})L_{0}({\boldsymbol{w}})-\sum_{{\boldsymbol{w}}}Q(\boldsymbol{w})L({\boldsymbol{w}})\right)
\displaystyle- μ2(𝒘Q(𝒘)L0(𝒘)+𝒘Q(𝒘)L(𝒘))\displaystyle\mu_{2}\left(-\sum_{{\boldsymbol{w}}}Q(\boldsymbol{w})L_{0}({\boldsymbol{w}})+\sum_{{\boldsymbol{w}}}Q(\boldsymbol{w})L({\boldsymbol{w}})\right)
\displaystyle- μ3(𝒘Q(𝒘)L1(𝒘)𝒘Q(𝒘)L(𝒘))\displaystyle\mu_{3}\left(\sum_{{\boldsymbol{w}}}Q(\boldsymbol{w})L_{1}({\boldsymbol{w}})-\sum_{{\boldsymbol{w}}}Q(\boldsymbol{w})L({\boldsymbol{w}})\right)
\displaystyle- μ4(𝒘Q(𝒘)L1(𝒘)+𝒘Q(𝒘)L(𝒘))\displaystyle\mu_{4}\left(-\sum_{{\boldsymbol{w}}}Q(\boldsymbol{w})L_{1}({\boldsymbol{w}})+\sum_{{\boldsymbol{w}}}Q(\boldsymbol{w})L({\boldsymbol{w}})\right)

The reduction approach updates Q(𝒘)Q({\boldsymbol{w}}) and (μ1,μ2,μ3,μ4)(\mu_{1},\mu_{2},\mu_{3},\mu_{4}) alternatively. Looking carefully at Algorithm 1 in [Agarwal et al., 2018], after updating (μ1,μ2,μ3,μ4)(\mu_{1},\mu_{2},\mu_{3},\mu_{4}), we need to have access to an oracle that is able to solve the following optimization problem in each iteration,

min𝒘(1+μ1μ2+μ3μ4)L(𝒘)+(μ1+μ2)L0(𝒘)+(μ3+μ4)L1(𝒘)\min_{{\boldsymbol{w}}}(1+\mu_{1}-\mu_{2}+\mu_{3}-\mu_{4})L({\boldsymbol{w}})+(-\mu_{1}+\mu_{2})L_{0}(\boldsymbol{w})+(-\mu_{3}+\mu_{4})L_{1}({\boldsymbol{w}})

The above optimization problem is not convex for all μ1,μ2,μ3,μ4\mu_{1},\mu_{2},\mu_{3},\mu_{4}. Therefore, in order to use the reduction approach, we need to have access to an oracle that is able to solve the above non-convex optimization problem which is not available. Note that the original problem (1) is a non-convex optimization problem and using the reduction approach just leads to another non-convex optimization problem.

A.5 Equalized Loss & Bounded Group Loss

In this section, we study the relation between EL and BGL fairness notions. It is straightforward to see that any predictor satisfying γ\gamma-BGL also satisfies the γ\gamma-EL. However, it is unclear to what extend an optimal fair predictor under γ\gamma-EL satisfies the BGL fairness notion. Next, we theoretically study the relation between BGL and EL fairness notions.

Let 𝒘\boldsymbol{w}^{*} be denoted as the solution to (1) and f𝒘f_{\boldsymbol{w}^{*}} the corresponding optimal γ\gamma-EL fair predictor. Theorem A.1 below shows that under certain conditions, it is impossible for both groups to experience a loss larger than 2γ2\gamma under the optimal γ\gamma-EL fair predictor.

Theorem A.1.

Suppose there exists a predictor that satisfies γ\gamma-BGL fairness notion. That is, the following optimization problem has at least one feasible point.

min𝒘L(𝒘) s.t.La(𝒘)γ,a{0,1}.\displaystyle\min_{\boldsymbol{w}}~{}~{}L(\boldsymbol{w})\text{ s.t.}~{}~{}L_{a}(\boldsymbol{w})\leq\gamma,~{}\forall a\in\{0,1\}. (12)

Then, the followings hold,

min{L0(𝒘),L1(𝒘)}\displaystyle\min\{L_{0}(\boldsymbol{w}^{*}),L_{1}(\boldsymbol{w}^{*})\} \displaystyle\leq γ;\displaystyle\gamma;
max{L0(𝒘),L1(𝒘)}\displaystyle\max\{L_{0}(\boldsymbol{w}^{*}),L_{1}(\boldsymbol{w}^{*})\} \displaystyle\leq 2γ.\displaystyle 2\gamma.

Theorem A.1 shows that γ\gamma-EL implies 2γ2\gamma-BGL if γ\gamma-BGL is a feasible constraint. Therefore, if γ\gamma is not too small (e.g., γ=0\gamma=0), then EL and BGL are not contradicting each other.

We emphasize that we are not claiming that whether EL fairness is better than BGL. Instead, these relations indicate the impacts the two fairness constraints could have on the model performance; the results may further provide the guidance for policy-makers.

A.6 Proofs

In order to prove Theorem 3.3, we first introduce two lemmas.

Lemma A.2.

Under assumption 3.2, there exists 𝐰¯d𝐰\overline{\boldsymbol{w}}\in\mathbb{R}^{d_{\boldsymbol{w}}} such that L0(𝐰¯)=L1(𝐰¯)=L(𝐰¯)L_{0}(\overline{\boldsymbol{w}})=L_{1}(\overline{\boldsymbol{w}})=L(\overline{\boldsymbol{w}}) and λstart(0)L(𝐰¯)λend(0)\lambda^{(0)}_{start}\leq L(\overline{\boldsymbol{w}})\leq\lambda^{(0)}_{end}.

Proof. Let q0(β)=L0((1β)𝒘G0+β𝒘G1)q_{0}(\beta)=L_{0}((1-\beta)\boldsymbol{w}_{G_{0}}+\beta\boldsymbol{w}_{G_{1}}) and q1(β)=L1((1β)𝒘G0+β𝒘G1)q_{1}(\beta)=L_{1}((1-\beta)\boldsymbol{w}_{G_{0}}+\beta\boldsymbol{w}_{G_{1}}), and q(β)=q0(β)q1(β),β[0,1]q(\beta)=q_{0}(\beta)-q_{1}(\beta),\beta\in[0,1]. Note that 𝒘La(𝒘Ga)=0\nabla_{\boldsymbol{w}}L_{a}(\boldsymbol{w}_{G_{a}})=0 because 𝒘Ga\boldsymbol{w}_{G_{a}} is the minimizer of La(𝒘)L_{a}(\boldsymbol{w}).

First, we show that L0((1β)𝒘G0+β𝒘G1)L_{0}((1-\beta)\boldsymbol{w}_{G_{0}}+\beta\boldsymbol{w}_{G_{1}}) is an increasing function in β\beta, and L1((1β)𝒘G0+β𝒘G1)L_{1}((1-\beta)\boldsymbol{w}_{G_{0}}+\beta\boldsymbol{w}_{G_{1}}) is a decreasing function in β\beta. Note that q0(0)=(𝒘G1𝒘G0)T𝒘L0(𝒘G0)=0q^{\prime}_{0}(0)=(\boldsymbol{w}_{G_{1}}-\boldsymbol{w}_{G_{0}})^{T}\nabla_{\boldsymbol{w}}L_{0}(\boldsymbol{w}_{G_{0}})=0, and q0(β)q_{0}(\beta) is convex because L0(𝒘)L_{0}(\boldsymbol{w}) is convex. This implies that q(β)q^{\prime}(\beta) is an increasing function, and q0(β)0,β[0,1]q^{\prime}_{0}(\beta)\geq 0,\forall\beta\in[0,1]. Similarly, we can show that q1(β)0,β[0,1]q^{\prime}_{1}(\beta)\leq 0,\forall\beta\in[0,1].

Note that under Assumption (3.2), q(0)<0q(0)<0 and q(1)>0q(1)>0. Therefore, by the intermediate value theorem, the exists β¯(0,1)\overline{\beta}\in(0,1) such that q(β¯)=0q(\overline{\beta})=0. Define 𝒘¯=(1β¯)𝒘G0+β¯𝒘G1\overline{\boldsymbol{w}}=(1-\overline{\beta})\boldsymbol{w}_{G_{0}}+\overline{\beta}\boldsymbol{w}_{G_{1}}. We have,

q(β¯)\displaystyle q(\overline{\beta}) =\displaystyle= 0L0(𝒘¯)=L1(𝒘¯)=L(𝒘¯)\displaystyle 0\implies L_{0}(\overline{\boldsymbol{w}})=L_{1}(\overline{\boldsymbol{w}})=L(\overline{\boldsymbol{w}})
𝒘G0\displaystyle\boldsymbol{w}_{G_{0}} is minimizer of L0\displaystyle\mbox{minimizer of }L_{0}\implies
L(𝒘¯)\displaystyle L(\overline{\boldsymbol{w}}) =\displaystyle= L0(𝒘¯)λstart(0)\displaystyle L_{0}(\overline{\boldsymbol{w}})\geq\lambda^{(0)}_{start}
q0(β)\displaystyle q^{\prime}_{0}(\beta) \displaystyle\geq 0,β[0,1]q0(1)q0(β¯)\displaystyle 0,\forall\beta\in[0,1]\implies q_{0}(1)\geq q_{0}(\overline{\beta})\implies
λend(0)\displaystyle\lambda_{end}^{(0)} \displaystyle\geq L0(𝒘¯)=L(𝒘¯)\displaystyle L_{0}(\overline{\boldsymbol{w}})=L(\overline{\boldsymbol{w}})
Lemma A.3.

L0(𝒘i)=λmid(i)L_{0}(\boldsymbol{w}_{i}^{*})=\lambda_{mid}^{(i)}, where 𝐰i\boldsymbol{w}_{i}^{*} is the solution to (4).

Proof. We proceed by contradiction. Assume that L0(𝒘i)<λmid(i)L_{0}(\boldsymbol{w}_{i}^{*})<\lambda_{mid}^{(i)} (i.e., 𝒘i\boldsymbol{w}_{i}^{*} is an interior point of the feasible set of (4)). Notice that 𝒘G1\boldsymbol{w}_{G_{1}} cannot be in the feasible set of (4) because L0(𝒘G1)=λend(0)>λmid(i)L_{0}(\boldsymbol{w}_{G_{1}})=\lambda_{end}^{(0)}>\lambda_{mid}^{(i)}. As a result, 𝒘L1(𝒘i)0\nabla_{\boldsymbol{w}}L_{1}(\boldsymbol{w}_{i}^{*})\neq 0. This is a contradiction because 𝒘i\boldsymbol{w}_{i}^{*} is an interior point of the feasible set of a convex optimization and cannot be optimal if 𝒘L1(𝒘i)\nabla_{\boldsymbol{w}}L_{1}(\boldsymbol{w}_{i}^{*}) is not equal to zero.

Proof [Theorem 3.3]

Now, we show that L(𝒘)IiL(\boldsymbol{w}^{*})\in I_{i} for all ii (𝒘\boldsymbol{w}^{*} is the solution to (1) when γ=0\gamma=0. As a result, L0(𝒘)=L1(𝒘)=L(𝒘)L_{0}(\boldsymbol{w}^{*})=L_{1}(\boldsymbol{w}^{*})=L(\boldsymbol{w}^{*})). Note that L(𝒘)=L0(𝒘)λstart(0)L(\boldsymbol{w}^{*})=L_{0}(\boldsymbol{w}^{*})\geq\lambda_{start}^{(0)} because 𝒘G0\boldsymbol{w}_{G_{0}} is the minimizer of L0L_{0}. Moreover, λend(0)L(𝒘)\lambda_{end}^{(0)}\geq L(\boldsymbol{w}^{*}) otherwise L(𝒘¯)<L(𝒘)L(\overline{\boldsymbol{w}})<L(\boldsymbol{w}^{*}) (𝒘¯\overline{\boldsymbol{w}} is defined in Lemma A.2) and 𝒘\boldsymbol{w}^{*} is not optimal solution under 0-EL. Therefore, L(𝒘)I0L(\boldsymbol{w}^{*})\in I_{0}.

Now we proceed by induction. Suppose L(𝒘)IiL(\boldsymbol{w}^{*})\in I_{i}. We show that L(𝒘)Ii+1L(\boldsymbol{w}^{*})\in I_{i+1} as well. We consider two cases.

  • L(𝒘)λmid(i)L(\boldsymbol{w}^{*})\leq\lambda_{mid}^{(i)}. In this case 𝒘\boldsymbol{w}^{*} is a feasible point for (4), and L1(𝒘i)=λ(i)L1(𝒘)=L(𝒘)λmid(i)L_{1}(\boldsymbol{w}_{i}^{*})=\lambda^{(i)}\leq L_{1}(\boldsymbol{w}^{*})=L(\boldsymbol{w}^{*})\leq\lambda_{mid}^{(i)}. Therefore, L(𝒘)Ii+1L(\boldsymbol{w}^{*})\in I_{i+1}.

  • L(𝒘)>λmid(i)L(\boldsymbol{w}^{*})>\lambda_{mid}^{(i)}. In this case, we proceed by contradiction to show that λ(i)λmid(i)\lambda^{(i)}\geq\lambda_{mid}^{(i)}. Assume that λ(i)<λmid(i)\lambda^{(i)}<\lambda_{mid}^{(i)}. Define r(β)=r0(β)r1(β)r(\beta)=r_{0}(\beta)-r_{1}(\beta), where ra(β)=La((1β)𝒘G0+β𝒘i)r_{a}(\beta)=L_{a}((1-\beta)\boldsymbol{w}_{G_{0}}+\beta\boldsymbol{w}_{i}^{*}). Note that λ(i)=r1(1)\lambda^{(i)}=r_{1}(1) By Lemma A.3, r0(1)=λmid(i)r_{0}(1)=\lambda_{mid}^{(i)}. Therefore, r(1)=λmid(i)λ(i)>0r(1)=\lambda_{mid}^{(i)}-\lambda^{(i)}>0. Moreover, under Assumption 3.2, r(0)<0r(0)<0. Therefore, by the intermediate value theorem, there exists β¯0(0,1)\overline{\beta}_{0}\in(0,1) such that r(β¯0)=0r(\overline{\beta}_{0})=0. Similar to the proof of Lemma A.2, we can show that r0(β)r_{0}(\beta) in an increasing function for all β[0,1]\beta\in[0,1]. As a result r0(β¯0)<r0(1)=λmid(i)r_{0}(\overline{\beta}_{0})<r_{0}(1)=\lambda_{mid}^{(i)}. Define 𝒘¯0=(1β¯0)𝒘G0+β¯0𝒘i\overline{\boldsymbol{w}}_{0}=(1-\overline{\beta}_{0})\boldsymbol{w}_{G_{0}}+\overline{\beta}_{0}\boldsymbol{w}_{i}^{*}. We have,

    r0(β¯0)=L0(𝒘¯0)=L1(𝒘¯0)=L(𝒘¯0)<λmid(i)\displaystyle r_{0}(\overline{\beta}_{0})=L_{0}(\overline{\boldsymbol{w}}_{0})=L_{1}(\overline{\boldsymbol{w}}_{0})=L(\overline{\boldsymbol{w}}_{0})<\lambda_{mid}^{(i)} (13)
    L(𝒘)>λmid(i)\displaystyle L(\boldsymbol{w}^{*})>\lambda_{mid}^{(i)} (14)

    The last two equations imply that 𝒘\boldsymbol{w}^{*} is not a global optimal fair solution under 0-EL fairness constraint and it is not the global optmal solution to (1). This is a contradiction. Therefore, if L(𝒘)>λmid(i)L(\boldsymbol{w}^{*})>\lambda_{mid}^{(i)}, then λ(i)λmid(i)\lambda^{(i)}\geq\lambda_{mid}^{(i)}. As a result, L(𝒘)Ii+1L(\boldsymbol{w}^{*})\in I_{i+1}

By two above cases and the nested interval theorem, we conclude that,

L(𝒘)i=1Ii,limiλmid(i)=L(𝒘),L(\boldsymbol{w}^{*})\in\cap_{i=1}^{\infty}I_{i},~{}\lim_{i\to\infty}\lambda_{mid}^{(i)}=L(\boldsymbol{w}^{*}),
defineλmid:=limiλmid(i)\mbox{define}\lambda_{mid}^{\infty}:=\lim_{i\to\infty}\lambda_{mid}^{(i)}

Therefore, limi𝒘i\lim_{i\to\infty}\boldsymbol{w}_{i}^{*} would be the solution to the following optimziation problem,

argmin𝒘L1(𝒘)s.t.,L0(𝒘)λmid=L(𝒘)\arg\min_{\boldsymbol{w}}L_{1}(\boldsymbol{w})s.t.,L_{0}(\boldsymbol{w})\leq\lambda_{mid}^{\infty}=L(\boldsymbol{w}^{*})

By lemma A.3, the solution to above optimization problem (i.e., limi𝒘i\lim_{i\to\infty}\boldsymbol{w}_{i}^{*}) satisfies the following, L0(limi𝒘i)=λmid=L(𝒘)L_{0}(\lim_{i\to\infty}\boldsymbol{w}_{i}^{*})=\lambda_{mid}^{\infty}=L(\boldsymbol{w}^{*}). Therefore, limi𝒘i\lim_{i\to\infty}\boldsymbol{w}_{i}^{*} is the global optimal solution to optimization problem (1).

Proof [Theorem 3.4 ] Let’s assume that 𝒘O\boldsymbol{w}_{O} does not satisfy the γ\gamma-EL.111111If 𝒘O\boldsymbol{w}_{O} satisfies γ\gamma-EL, it will be the optimal predictor under γ\gamma-EL fairness. Therefore, there is no need to solve any constrained optimization problem. Note that 𝒘O\boldsymbol{w}_{O} is the solution to problem (7). Let 𝒘\boldsymbol{w}^{*} be the optimal weight vector under γ\gamma-EL. It is clear that 𝒘𝒘O\boldsymbol{w}^{*}\neq\boldsymbol{w}_{O}.

Step 1. we show that one of the following holds,

L0(𝒘)L1(𝒘)=γ\displaystyle L_{0}(\boldsymbol{w}^{*})-L_{1}(\boldsymbol{w}^{*})=\gamma (15)
L0(𝒘)L1(𝒘)=γ\displaystyle L_{0}(\boldsymbol{w}^{*})-L_{1}(\boldsymbol{w}^{*})=-\gamma (16)

Proof by contradiction. Assume γ<L0(𝒘)L1(𝒘)<γ-\gamma<L_{0}(\boldsymbol{w}^{*})-L_{1}(\boldsymbol{w}^{*})<\gamma. This implies that 𝒘\boldsymbol{w}^{*} is an interior point of the feasible set of optimization problem (1). Since 𝒘𝒘O\boldsymbol{w}^{*}\neq\boldsymbol{w}_{O}, then L(𝒘)0\nabla L(\boldsymbol{w}^{*})\neq 0. As a result, object function of (1) can be improved at 𝒘\boldsymbol{w}^{*} by moving toward L(𝒘)-\nabla L(\boldsymbol{w}^{*}). This a contradiction. Therefore, |L0(𝒘)L1(𝒘)|=γ|L_{0}(\boldsymbol{w}^{*})-L_{1}(\boldsymbol{w}^{*})|=\gamma.

Step 2. Function 𝒘γ=ELminimizer(𝒘G0,𝒘G0,ϵ,γ)\boldsymbol{w}_{\gamma}=\texttt{ELminimizer}(\boldsymbol{w}_{G_{0}},\boldsymbol{w}_{G_{0}},\epsilon,\gamma) is the solution to the following optimization problem,

min𝒘Pr{A=0}L0(𝒘)+Pr{A=1}L1(𝒘),\displaystyle\min_{\boldsymbol{w}}\Pr\{A=0\}L_{0}(\boldsymbol{w})+\Pr\{A=1\}L_{1}(\boldsymbol{w}),
s.t.,L0(𝒘)L1(𝒘)=γ\displaystyle s.t.,L_{0}(\boldsymbol{w})-L_{1}(\boldsymbol{w})=\gamma (17)

To show the above claim, notice that the solution to optimization problem (A.6) is the same as the following,

min𝒘Pr{A=0}L0(𝒘)+Pr{A=1}L~1(𝒘),\displaystyle\min_{\boldsymbol{w}}\Pr\{A=0\}L_{0}(\boldsymbol{w})+\Pr\{A=1\}\tilde{L}_{1}(\boldsymbol{w}),
s.t.,L0(𝒘)L~1(𝒘)=0,\displaystyle s.t.,L_{0}(\boldsymbol{w})-\tilde{L}_{1}(\boldsymbol{w})=0, (18)

where L~1(𝒘)=L1(𝒘)+γ\tilde{L}_{1}(\boldsymbol{w})=L_{1}(\boldsymbol{w})+\gamma. Since L0(𝒘G0)L~1(𝒘G0)<0L_{0}(\boldsymbol{w}_{G_{0}})-\tilde{L}_{1}(\boldsymbol{w}_{G_{0}})<0 and L0(𝒘G1)L~1(𝒘G1)>0L_{0}(\boldsymbol{w}_{G_{1}})-\tilde{L}_{1}(\boldsymbol{w}_{G_{1}})>0, by Theorem 3.3, we know that 𝒘γ=ELminimizer(𝒘G0,𝒘G0,ϵ,γ)\boldsymbol{w}_{\gamma}=\texttt{ELminimizer}(\boldsymbol{w}_{G_{0}},\boldsymbol{w}_{G_{0}},\epsilon,\gamma) find the solution to (A.6) when ϵ\epsilon goes to zero.

Lastly, because |L0(𝒘)L1(𝒘)|=γ|L_{0}(\boldsymbol{w}^{*})-L_{1}(\boldsymbol{w}^{*})|=\gamma, we have,

𝒘={𝒘γif L(𝒘γ)L(𝒘γ)𝒘γo.w.\displaystyle\boldsymbol{w}^{*}=\left\{\begin{array}[]{ll}\boldsymbol{w}_{\gamma}&\mbox{if }L(\boldsymbol{w}_{\gamma})\leq L(\boldsymbol{w}_{-\gamma})\\ \boldsymbol{w}_{-\gamma}&\mbox{o.w.}\end{array}\right. (21)

Thus, Algorithm 2 finds the solution to (1).

Proof [Theorem 4.1]

  1. 1.

    Under Assumption 3.2, g(1)<0g(1)<0. Moreover, g(0)0g(0)\geq 0. Therefore, by the intermediate value theorem, there exists β0[0,1]\beta_{0}\in[0,1] such that g(β0)=0g(\beta_{0})=0.

  2. 2.

    Since 𝒘O\boldsymbol{w}_{O} is the minimizer of L(𝒘)L(\boldsymbol{w}), h(0)=0h^{\prime}(0)=0. Moreover, since L(𝒘)L(\boldsymbol{w}) is strictly convex, h(β)h(\beta) is strictly convex and h(β)h^{\prime}(\beta) is strictly increasing function. As a result, h(β)>0h^{\prime}(\beta)>0 for β>0\beta>0, and h(β)h(\beta) is strictly increasing.

  3. 3.

    Similar to the above argument, s(β)=La^((1β)𝒘O+β𝒘Ga^)s(\beta)=L_{\hat{a}}((1-\beta)\boldsymbol{w}_{O}+\beta\boldsymbol{w}_{G_{\hat{a}}}) is strictly decreasing function (notice that s(1)=0s^{\prime}(1)=0 and s(β)s(\beta) is strictly convex).

    Note that since h(β)=Pr{A=a^}La^((1β)𝒘O+β𝒘Ga^)+Pr{A=1a^}L1a^((1β)𝒘O+β𝒘Ga^)h(\beta)=\Pr\{A=\hat{a}\}L_{\hat{a}}((1-\beta)\boldsymbol{w}_{O}+\beta\boldsymbol{w}_{G_{\hat{a}}})+\Pr\{A=1-\hat{a}\}L_{1-\hat{a}}((1-\beta)\boldsymbol{w}_{O}+\beta\boldsymbol{w}_{G_{\hat{a}}}) is strictly increasing and La^((1β)𝒘O+β𝒘Ga^)L_{\hat{a}}((1-\beta)\boldsymbol{w}_{O}+\beta\boldsymbol{w}_{G_{\hat{a}}}) is strictly decreasing. Therefore, we conclude that L1a^((1β)𝒘O+β𝒘Ga^)L_{1-\hat{a}}((1-\beta)\boldsymbol{w}_{O}+\beta\boldsymbol{w}_{G_{\hat{a}}}) is strictly increasing. As a result, g(β)g(\beta) should be strictly decreasing.

Proof [Theorem 4.2] First, we show that if gγ(0)0g_{\gamma}(0)\leq 0, then 𝒘O\boldsymbol{w}_{O} satisfies γ\gamma-EL.

gγ(0)0g(β)γ0La^(𝒘O)L1a^(𝒘O)γg_{\gamma}(0)\leq 0\implies g(\beta)-\gamma\leq 0\implies L_{\hat{a}}(\boldsymbol{w}_{O})-L_{1-\hat{a}}(\boldsymbol{w}_{O})\leq\gamma

Moreover, La^(𝒘O)L1a^(𝒘O)0L_{\hat{a}}(\boldsymbol{w}_{O})-L_{1-\hat{a}}(\boldsymbol{w}_{O})\geq 0 because a^=argmaxaLa(𝒘O)\hat{a}=\arg\max_{a}L_{a}(\boldsymbol{w}_{O}). Therefore, γ\gamma-EL is satisfied.

Now, let gγ(0)>0g_{\gamma}(0)>0. Note that under Assumption 3.2, gγ(1)=La^(𝒘Ga^)L1a^(𝒘Ga^)γ<0g_{\gamma}(1)=L_{\hat{a}}(\boldsymbol{w}_{G_{\hat{a}}})-L_{1-\hat{a}}(\boldsymbol{w}_{G_{\hat{a}}})-\gamma<0. Therefore, by the intermediate value theorem, there exists β0\beta_{0} such that gγ(β0)=0g_{\gamma}(\beta_{0})=0. Moreover, based on Theorem 4.2, gγg_{\gamma} is a strictly decreasing function. Therefore, the binary search proposed in Algorithm 3 converges to the root of gγ(β)g_{\gamma}(\beta). As a result, (1βmid())𝒘O+βmid()𝒘Ga^(1-\beta_{mid}^{(\infty)})\boldsymbol{w}_{O}+\beta_{mid}^{(\infty)}\boldsymbol{w}_{G_{\hat{a}}} satisfies γ\gamma-EL. Note that since g(β)g(\beta) is strictly decreasing, and g(βmid())=γg(\beta_{mid}^{(\infty)})=\gamma, and βmid()\beta^{(\infty)}_{mid} is the smallest possible β\beta under which (1β)𝒘O+β𝒘Ga^(1-\beta)\boldsymbol{w}_{O}+\beta\boldsymbol{w}_{G_{\hat{a}}} satisfies γ\gamma-EL. Since hh is increasing, the smallest possible β\beta gives us a better accuracy.

Proof [Theorem 4.3] If gγ(0)0g_{\gamma}(0)\leq 0, then 𝒘O\boldsymbol{w}_{O} satisfies γ\gamma-EL, and 𝒘¯=𝒘O\underline{\boldsymbol{w}}=\boldsymbol{w}_{O}. In this case, it is easy to see that L(𝒘O)maxa{0,1}La(𝒘O)L(\boldsymbol{w}_{O})\leq\max_{a\in\{0,1\}}L_{a}(\boldsymbol{w}_{O}) (because L(𝒘O)L(\boldsymbol{w}_{O}) is a weighted average of L0(𝒘O)L_{0}(\boldsymbol{w}_{O}) and L1(𝒘O))L_{1}(\boldsymbol{w}_{O})).

Now assume that gγ(0)>0g_{\gamma}(0)>0. Note that if we prove this theorem for γ=0\gamma=0, then the theorem will hold for γ>0\gamma>0. This is because the optimal predictor under 0-EL satisfies γ\gamma-EL condition as well. In other words, 0-EL is a stronger constraint than γ\gamma-EL.

Let γ=0\gamma=0. In this case, Algorithm 3 finds 𝒘¯=(1β0)𝒘O+β0𝒘Ga^\underline{\boldsymbol{w}}=(1-\beta_{0})\boldsymbol{w}_{O}+\beta_{0}\boldsymbol{w}_{G_{\hat{a}}}, where β0\beta_{0} is defined in Theorem 4.1. We have,

()g(β0)=0=La^(𝒘¯)L1a^(𝒘¯)(*)~{}~{}g(\beta_{0})=0=L_{\hat{a}}(\underline{\boldsymbol{w}})-L_{1-\hat{a}}(\underline{\boldsymbol{w}})

In the proof of theorem 4.1, we showed that La^((1β)𝒘O+β𝒘Ga^)L_{\hat{a}}((1-\beta)\boldsymbol{w}_{O}+\beta\boldsymbol{w}_{G_{\hat{a}}}) is decreasing in β\beta. Therefore, we have,

()La^(𝒘¯)La^(𝒘O)(**)~{}~{}L_{\hat{a}}(\underline{\boldsymbol{w}})\leq L_{\hat{a}}(\boldsymbol{w}_{O})

Therefore, we have,

L(𝒘¯)\displaystyle L(\underline{\boldsymbol{w}}) =\displaystyle= Pr(A=0)La^(𝒘¯)+(1Pr(A=1))L1a^(𝒘¯)\displaystyle\Pr(A=0)\cdot L_{\hat{a}}(\underline{\boldsymbol{w}})+(1-\Pr(A=1))\cdot L_{1-\hat{a}}(\underline{\boldsymbol{w}}) (22)
(By ())\displaystyle(\mbox{By }(*))~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{} =\displaystyle= La^(𝒘¯)\displaystyle L_{\hat{a}}(\underline{\boldsymbol{w}}) (23)
(By ())\displaystyle(\mbox{By }(**))~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{} \displaystyle\leq La^(𝒘O)\displaystyle L_{\hat{a}}(\boldsymbol{w}_{O}) (24)

Proof [Theorem 4.5]

By the triangle inequality, the following holds,

supf𝒘||L0(𝒘)L1(𝒘)||L^0(𝒘)L^1(𝒘)||\displaystyle\sup_{f_{\boldsymbol{w}}\in\mathcal{F}}||L_{0}(\boldsymbol{w})-L_{1}(\boldsymbol{w})|-|\hat{L}_{0}(\boldsymbol{w})-\hat{L}_{1}(\boldsymbol{w})||\leq (25)
supf𝒘|L0(𝒘)L^0(𝒘)|+supf𝒘|L1(𝒘)L^1(𝒘)|.\displaystyle\sup_{f_{\boldsymbol{w}}\in\mathcal{F}}|L_{0}(\boldsymbol{w})-\hat{L}_{0}(\boldsymbol{w})|+\sup_{f_{\boldsymbol{w}}\in\mathcal{F}}|L_{1}(\boldsymbol{w})-\hat{L}_{1}(\boldsymbol{w})|. (26)

Therefore, with probability at least 12δ1-2\delta we have,

supf𝒘||L0(𝒘)L1(𝒘)||L^0(𝒘)L^1(𝒘)||\displaystyle\sup_{f_{\boldsymbol{w}}\in\mathcal{F}}||L_{0}(\boldsymbol{w})-L_{1}(\boldsymbol{w})|-|\hat{L}_{0}(\boldsymbol{w})-\hat{L}_{1}(\boldsymbol{w})||\leq
B(δ,n0,)+B(δ,n1,)\displaystyle B(\delta,n_{0},\mathcal{F})+B(\delta,n_{1},\mathcal{F}) (27)

As a result, with probability 12δ1-2\delta holds,

{𝒘|f𝒘,|L0(𝒘)L1(𝒘)|γ}\displaystyle\{\boldsymbol{w}|f_{\boldsymbol{w}}\in\mathcal{F},|L_{0}(\boldsymbol{w})-L_{1}(\boldsymbol{w})|\leq\gamma\}\subseteq
{𝒘|f𝒘,|L^0(𝒘)L^1(𝒘)|γ^}\displaystyle\{\boldsymbol{w}|f_{\boldsymbol{w}}\in\mathcal{F},|\hat{L}_{0}(\boldsymbol{w})-\hat{L}_{1}(\boldsymbol{w})|\leq\hat{\gamma}\} (28)

Now consider the following,

L(𝒘^)L(𝒘)=L(𝒘^)L^(𝒘^)+L^(𝒘^)L^(𝒘)+L^(𝒘)L(𝒘)L(\boldsymbol{\hat{w}})-L(\boldsymbol{w}^{*})=L(\boldsymbol{\hat{w}})-\hat{L}(\boldsymbol{\hat{w}})+\hat{L}(\boldsymbol{\hat{w}})-\hat{L}(\boldsymbol{w}^{*})+\hat{L}(\boldsymbol{w}^{*})-L(\boldsymbol{w}^{*}) (29)

By (A.6), L^(𝒘^)L^(𝒘)0\hat{L}(\boldsymbol{\hat{w}})-\hat{L}(\boldsymbol{w}^{*})\leq 0 with probability 12δ1-2\delta. Thus, with probability at least 12δ1-2\delta, we have,

L(𝒘^)L(𝒘)L(𝒘^)L^(𝒘^)+L^(𝒘)L(𝒘).L(\boldsymbol{\hat{w}})-L(\boldsymbol{w}^{*})\leq L(\boldsymbol{\hat{w}})-\hat{L}(\boldsymbol{\hat{w}})+\hat{L}(\boldsymbol{w}^{*})-L(\boldsymbol{w}^{*}). (30)

Therefore, under assumption 4.4, we can conclude with probability at least 16δ1-6\delta, L(𝒘^)L(𝒘)2B(δ,n,)L(\boldsymbol{\hat{w}})-L(\boldsymbol{w}^{*})\leq 2B(\delta,n,\mathcal{F}). In addition, by (A.6), with probability at least 12δ1-2\delta, we have,

|L0(𝒘^)L1(𝒘^)|\displaystyle|L_{0}(\boldsymbol{\hat{w}})-L_{1}(\boldsymbol{\hat{w}})| \displaystyle\leq B(δ,n0,)+B(δ,n1,)+|L^0(𝒘)L^1(𝒘)|\displaystyle B(\delta,n_{0},\mathcal{F})+B(\delta,n_{1},\mathcal{F})+|\hat{L}_{0}(\boldsymbol{w})-\hat{L}_{1}(\boldsymbol{w})|
\displaystyle\leq γ^+B(δ,n0,)+B(δ,n1,)\displaystyle\hat{\gamma}+B(\delta,n_{0},\mathcal{F})+B(\delta,n_{1},\mathcal{F})
=\displaystyle= γ+2B(δ,n0,)+2B(δ,n1,)\displaystyle\gamma+2B(\delta,n_{0},\mathcal{F})+2B(\delta,n_{1},\mathcal{F})

Proof [Theorem A.1] Let 𝒘~\boldsymbol{\tilde{w}} be a feasible point of optimization problem (12), then 𝒘~\boldsymbol{\tilde{w}} is also a feasible point to (1).

We proceed by contradiction. We consider three cases,

  • If min{L0(𝒘),L1(𝒘)}>γ\min\{L_{0}(\boldsymbol{w}^{*}),L_{1}(\boldsymbol{w}^{*})\}>\gamma and max{L0(𝒘),L1(𝒘)}>2γ\max\{L_{0}(\boldsymbol{w}^{*}),L_{1}(\boldsymbol{w}^{*})\}>2\gamma. In this case,

    L(𝒘)>γL(𝒘~).L(\boldsymbol{w}^{*})>\gamma\geq L(\boldsymbol{\tilde{w}}).

    This is a contradiction because it implies that 𝒘\boldsymbol{w}^{*} is not an optimal solution to (1), and 𝒘~\tilde{{\boldsymbol{w}}} is a better solution for (1).

  • If min{L0(𝒘),L1(𝒘)}>γ\min\{L_{0}(\boldsymbol{w}^{*}),L_{1}(\boldsymbol{w}^{*})\}>\gamma and max{L0(𝒘),L1(𝒘)}2γ\max\{L_{0}(\boldsymbol{w}^{*}),L_{1}(\boldsymbol{w}^{*})\}\leq 2\gamma. This case is similar to above. min{L0(𝒘),L1(𝒘)}>γ\min\{L_{0}(\boldsymbol{w}^{*}),L_{1}(\boldsymbol{w}^{*})\}>\gamma implies that L(𝒘)>γL(𝒘~).L(\boldsymbol{w}^{*})>\gamma\geq L(\boldsymbol{\tilde{w}}). This is a contradiction because it implies that 𝒘\boldsymbol{w}^{*} is not an optimal solution to (1).

  • If min{L0(𝒘),L1(𝒘)}γ\min\{L_{0}(\boldsymbol{w}^{*}),L_{1}(\boldsymbol{w}^{*})\}\leq\gamma and max{L0(𝒘),L1(𝒘)}>2γ\max\{L_{0}(\boldsymbol{w}^{*}),L_{1}(\boldsymbol{w}^{*})\}>2\cdot\gamma. We have:

    max{L0(𝒘),L1(𝒘)}min{L0(𝒘),L1(𝒘)}>γ,\max\{L_{0}(\boldsymbol{w}^{*}),L_{1}(\boldsymbol{w}^{*})\}-\min\{L_{0}(\boldsymbol{w}^{*}),L_{1}(\boldsymbol{w}^{*})\}>\gamma,

    which shows that 𝒘\boldsymbol{w}^{*} is not a feasible point for (1). This is a contradiction.

Therefore, max{L0(𝒘),L1(𝒘)}2γ\max\{L_{0}(\boldsymbol{w}^{*}),L_{1}(\boldsymbol{w}^{*})\}\leq 2\gamma and min{L0(𝒘),L1(𝒘)}γ\min\{L_{0}(\boldsymbol{w}^{*}),L_{1}(\boldsymbol{w}^{*})\}\leq\gamma.