Towards Fair Machine Learning Software: Understanding and Addressing Model Bias Through Counterfactual Thinking

Zichong Wang [email protected] Florida International UniversityMiamiFloridaUSA33199 , Yang Zhou [email protected] Singapore Management UniversitySingaporeSingapore188065 , Israat Haque [email protected] Dalhousie UniversityHalifaxNova ScotiaCanadaB3H 4R2 , David Lo [email protected] Singapore Management UniversitySingaporeSingapore188065 and Wenbin Zhang [email protected] Florida International UniversityMiamiFloridaUSA33199

(2018)

Abstract.

A clear and well-documented LaTeX document is presented as an article formatted for publication by ACM in a conference proceedings or journal publication. Based on the “acmart” document class, this article presents and explains many of the common variations, as well as many of the formatting elements an author may use in the preparation of the documentation of their work.

Abstract.

The increasing use of Machine Learning (ML) software can lead to unfair and unethical decisions, thus fairness bugs in software are becoming a growing concern. Addressing these fairness bugs often involves sacrificing ML performance, such as accuracy. To address this issue, we present a novel counterfactual approach that uses counterfactual thinking to tackle the root causes of bias in ML software. In addition, our approach combines models optimized for both performance and fairness, resulting in an optimal solution in both aspects. We conducted a thorough evaluation of our approach on 10 benchmark tasks using a combination of 5 performance metrics, 3 fairness metrics, and 15 measurement scenarios, all applied to 8 real-world datasets. The conducted extensive evaluations show that the proposed method significantly improves the fairness of ML software while maintaining competitive performance, outperforming state-of-the-art solutions in 84.6% of overall cases based on a recent benchmarking tool.

Counterfactual Fairness, Counterfactual Generation, Group Fairness

^†^†copyright: acmlicensed^†^†journalyear: 2018^†^†doi: XXXXXXX.XXXXXXX^†^†conference: Make sure to enter the correct conference title from your rights confirmation email; June 03–05, 2018; Woodstock, NY^†^†isbn: 978-1-4503-XXXX-X/18/06^†^†ccs: Do Not Use This Code Generate the Correct Terms for Your Paper^†^†ccs: Do Not Use This Code Generate the Correct Terms for Your Paper^†^†ccs: Do Not Use This Code Generate the Correct Terms for Your Paper^†^†ccs: Do Not Use This Code Generate the Correct Terms for Your Paper

1. Introduction

Machine learning (ML) systems have garnered widespread recognition due to their demonstrated ability to tackle a wide range of critical tasks. Such tasks include human resource management (Chan and Wang, 2018), healthcare (Rasmy et al., 2021), sentiment analysis (Hoang et al., 2019), and more. The success of ML systems is largely due to the availability of large-scale datasets, as highlighted by a report by Worldbank, which indicates that credit bureaus have been using machine learning for credit scoring and fraud detection purposes. Furthermore, ML systems are also employed in the realm of hiring, including resume screening and candidate assessment. However, despite the positive aspects of ML systems, recent studies have revealed a significant drawback - ML systems can be biased. Specifically, ML systems have been shown to exhibit biases in various domains such as gender and race. For instance, state-of-the-art sentiment analysis tools often predict texts containing female names as negative (Asyrofi et al., 2022). Additionally, a deployed ML system that was used to evaluate a defendant’s risk of future crime was found to be biased against black defendants (i.e., producing high risk) (Angwin et al., 2016a).

The prevalence of bias in ML systems has drawn significant attention from researchers in both software engineering and machine learning communities, as it has the potential to cause harm to society. To address this issue, various efforts have been made to uncover and mitigate bias in ML systems. For example, Sun et al. (Sun et al., 2020) have made progress in uncovering and improving bias in commercial machine translation systems. Ribeiro et al. (Ribeiro et al., 2020) have proposed an evaluation framework for language models, which goes beyond simply assessing accuracy and considers the fairness of the models. Recently, Chen et al. (Chen et al., 2022b) introduce MAAT, an ensemble approach that strikes a balance between fairness and performance in ML software. While MAAT achieves state-of-the-art performance, it has limitations to be addressed. In particular, it does not take into account the biasness of each example and conducts under-sampling by randomly removing examples. However, each training example has a varying degree of bias to the final results of a model. We argue that fairness-oriented model building should be achieved through sampling in accordance with data bias rather than randomly.

To mitigate the limitations of existing approaches, we introduce Counterfactual Fair Software Approach (CFSA), a novel framework that aims to improve the fairness and performance of ML software by identifying and removing the bias encoded in the training data. Unlike existing methods that treat all training examples in a subgroup equally, our framework adopts the concept of counterfactual fairness of a training example to quantify its impact on the fairness of the model. Based on this concept, we construct a ranked counterfactual bias list, where examples with a higher rank are more likely to introduce bias into the model. Intuitively, these examples should be removed from the training data with higher priority. Our framework also includes a fair data synthesis approach to maintain class balance when generating new training examples, avoiding exacerbating data imbalance and ensuring that the synthesized dataset closely resembles the original dataset.

We evaluate our approach and the baseline methods on 8 benchmark datasets that consider multiple types of bias, e.g., race, gender, and age bias. We compare with baselines from both machine learning and software engineering communities. Reweighing (REW) (Kamiran and Calders, 2012), Adversarial Debiasing (ADV) (Zhang et al., 2018a) and Reject Option Classification (ROC) (Kamiran et al., 2012) are from machine learning venues and have been integrated into the FAIR260 toolkit maintained by IBM (Bellamy et al., 2019). Fair-SMOTE (Chakraborty et al., 2021a) and MAAT (Chen et al., 2022b) are from software engineering venues. The investigated methods also cover bias mitigation at different stages, e.g., pre-processing techniques that remove biased examples before training the models.

Our experiment results show that CFSA can effectively identify biased data from a large set of training data. CFSA is also model-agnostic: its effectiveness can be observed on multiple types of models and datasets we investigate. We run CFSA for 1,500 times on different configurations. We find that it outperforms the state-of-the-art in terms of reducing bias in 80% cases. Statistical tests also show that CFSA can provide statistically significant better fairness improvement than the baselines. We also explore the impact of ensemble strategy on the performance of CFSA. After removing biased examples, we train a model on the new dataset and obtain a fair model. We also train a model on the original dataset, which is called the performance model. We find that the previous settings using in MAAT (Chen et al., 2022b) to ensemble outputs from the two models is not the optimal setting. Giving larger weights to the outputs from the performance model can further balance the trade-off between accuracy and fairness of the model. Our additional analysis also demonstrates that CFSA is capable of reducing bias for multiple sensitive attributes simultaneously.

We make the following contributions in the paper:

•

We propose CFSA, a novel bias mitigation approach that can improve the fairness and performance of ML software by identifying and removing bias encoded in the training data.
•

CFSA investigates the impact of each training example on the fairness of the model and constructs a ranked list where biased examples have higher ranks. We empirically show that removing examples from the ranked list can better mitigate the bias in the model than the state-of-the-art methods.
•

We design a fair data synthesis approach that can maintain the class balance when generating new training examples. Experiments show that including the new examples generated by our approach can further improve the fairness and performance of the model.

The paper is organized as follows. Section 2 explains some preliminaries and background of this paper. We explain the details of our methodology in Section 3 and present the experiment settings in Section 4. Section 5 describes our answers to various research questions. Additionally, related work is discussed in Section 6, the conclusion of the paper can be found in Section 7.

2. Preliminaries and Background

This section begins by presenting the necessary background information and notation. Next, biases in real-world software development are discussed, showing the root causes of model bias.

2.1. Terminology

Before we explore the causes of model bias, we provide initial notations and concepts that lead to bias. First, we will review the definition of class label, feature and sensitive attribute.

•

Class label: A class label indicates the class or category to which an instance belongs to. For example, in a classification problem where the goal is to predict whether an applicant is qualified for the job, “hire” and “do not hire” are the class labels. In addition, “hire” is called favored class label as it gives an advantage to the applicant, while “do not hire” is a deprived class label.
•

Feature: Feature is an individual’s measurable property or characteristic of a phenomenon being observed. It is a specific aspect of the data used as input to a machine-learning model to make predictions.
•

Sensitive attribute: Sensitive attribute is a special feature that divides the instances into favored and deprived groups. Examples of sensitive attributes include race, gender, age, etc.

We will follow up by introducing the concept of machine learning fairness, which is generally divided into group fairness and individual fairness:

•

Group fairness: first identifies sensitive attribute(s) that define(s) a potential source of bias then asks for some fairness statistic of the classifier, e.g., prediction accuracy, across favored and deprived groups.
•

Individual fairness: requires similarly situated individuals to receive similar probability distributions over class labels to prevent inequitable treatment.

In this study, we further consider a particular type of bias embodied in real-world: the individuals’ sensitive attribute highly or completely decide their class label showing profound bias towards the deprived groups. To this end, we assume each individual has a corresponding counterfactual individual that only differs in the sensitive attribute in feature space and requires their similar class label prediction.

With the key concepts outlined, we now introduce the corresponding notation. Let $D=\{d_{1},d_{2},\dots,d_{n}\}$ be a dataset, where each $d_{i}=\{A,S,Y\}$ is described by a set of attributes $A$ with $S$ denoting the sensitive attribute such as gender or race, and the class label $Y$ with $\hat{Y}$ as the predicted class label. We also define $D^{\prime}=\{d_{1}^{\prime},d_{2}^{\prime},\dots,d_{n}^{\prime}\}$ as the counterfactual dataset, where $d_{i}^{\prime}$ only differs in $S$ in comparison to $d_{i}$ . Mathematically, $d_{i}=\{Y|S=s,A\}$ , and $d_{i}^{\prime}=\{Y|S=\overline{s},A\}$ in which $s\in S$ is a specific value, referred to as the sensitive value, e.g., female/black, that defines deprived group (with $\overline{s}$ = male/non-black defining favored group). $F$ is a classifier that outputs a predicted probability $F(d_{i})=P(\hat{Y}|S,A)$ for each input sample $d_{i}$ .

2.2. Root of Model Bias

This section explores the root causes of model bias. Previous studies (Chakraborty et al., 2021b; Kamiran et al., 2018; Kamiran and Calders, 2012) have demonstrated that classification models constructed using the datasets we used are prone to showing bias. One particular cause is that the collected data is often a reflection of the systems and processes that were in place at the time the data was generated. These systems and processes can be influenced by a wide range of factors, including social, cultural, and economic biases. For example, the classification models built from these datasets may be biased if the human labelers had conscious or unconscious biases against certain social groups. This could lead to unfair labeling of the data, which could in turn lead to biased classification models.

Another root cause is the selection bias. For example, a road evaluation program may mistakenly conclude that roads in high-income neighborhoods are worse than in low-income neighborhoods due to selection bias in the data. This was seen in a real-world case where a road maintenance program “Street Bump”, a project by the city of Boston to crowdsource data on potholes (Barocas et al., 2017). The app received more requests from high-income communities due to higher smartphone penetration and user proficiency in those areas. This leads to a model that incorporates political and economic biases.

In the next sections, we delve deeper into the concepts discussed by examining data imbalance and biased data labeling in more detail.

2.2.1. Data Imbalance Bias

Real-world data sets often have unbalanced distributions, as shown in an example in Figure 1, the Adult dataset used to predict annual income. In this dataset, high-income individuals make up only 23.93% of the total, with low-income individuals making up 76.07%. Additionally, men make up the majority of the high-income category (20.3%) compared to women (3.62%), and men also make up the majority of the low-income category (46.54%) compared to women (29.53%). This imbalance can affect machine learning models, leading to bias towards assigning favorable labels to male and unfavorable labels to female. The remainder of the datasets also exhibits the same issue.

Refer to caption — Figure 1. All datasets exhibit an imbalanced distribution concerning the sensitive attribute and class label.

2.2.2. Labeling bias

In addition to distribution bias, unfair labeling can also lead to bias in machine learning models. Previous research has shown that incorrect labeling can result in individuals with similar characteristics being treated differently. For example, if two samples, one from favored and another from deprived group, have similar features except for the sensitive attribute with distinct class labels, it can be due to the fact that the sample from the deprived group is labeled in a biased way. This can further increase bias towards the deprived group, particularly if those deprived samples are selected for balancing as part of the sampling strategy. To address this issue, we use a counterfactual thinking (c.f., Section 3.2.1) to identify samples with biased labels for bias mitigation.

3. Methodology

This section proposes solutions to address the aforementioned root causes of model bias toward fair machine learning software development.

3.1. Briefly

CFSA is a framework that aims to improve the fairness and performance of ML software by addressing bias at its origin, i.e., the biased encoded in data, while jointly optimizing for performance via ensemble learning. CFSA identifies biased data encoding by evaluating the counterfactual fairness of each sample and creates a Counterfactual Bias List (c.f., Section 3.2.1) to balance biased representation (c.f., Section 3.2.2) and to correct labeling bias (c.f., Section 3.2.3 and 3.2.4) for fairness improvement. Additionally, it uses ensemble theory to combine models with different optimization objectives to achieve the best results when making predictions (c.f., Section 3.3 and 3.4). CFSA is a tool that balances fairness and performance of ML software.

3.2. Counterfactually Debiasing Biased Dataset

As previously discussed in section 2.2, bias in the model stems from the dataset. By addressing bias in the dataset, the model and its predictions can also be fairer. Improving the fairness of the model is thus equivalent to improving the fair representation of the dataset, which can be achieved by balancing the representation of different groups and correcting biased labels.

3.2.1. Counterfactual Bias List

The previous works (Jiang and Nachum, 2019; Das et al., 2021; Chakraborty et al., 2021b) have shown that biased labeling could be a major root behind discrimination. To address this inherent bias, we propose to counterfactually testing whether the decision for a given individual would flip if the value of this individual’s sensitive attribute changes. As an illustrating example, consider the sensitive attribute $S$ to be gender with the value of $s$ being female ( $\overline{s}$ being male) representing deprived community, and the label $Y$ being binary with values of ”rejected” or ”granted”. By combining these two binary features, the dataset $D$ can be divided into four subgroups:

•

Deprived Rejected ( $DR$ ): Females being rejected a benefit.
•

Deprived Granted ( $DG$ ): Females being granted a benefit.
•

Favored Rejected ( $FR$ ): Males being rejected a benefit.
•

Favored Granted ( $FG$ ): Males being granted a benefit.

Our goal is to test if changing the sensitive attribute $S$ while keeping the insensitive attribute $A$ constant would lead to the flipping prediction of this individual, showing inherent bias toward this deprived community. To this end, we define Counterfactual Flip Test (CFTest) of each instance mathematically as:

(1)

CFTest=|F(d_{i})>0.5|\oplus|F(d_{i}^{\prime})>0.5|

where $F(d_{i})$ and $F(d_{i}^{\prime})$ are the classifier $F$ ’s predictions, i.e., predicted class probability, on the selected and corresponding counterfactual instances. In addition, $\oplus$ stands for exclusive disjunction, testing whether both sides of the equation are equal. By calculating Equation 1, CFTest equals to 1 when the class label flips showing inherent bias and 0 otherwise. Practically, a classifier is learned using all other instances as the training set and then to predict the class labels of the selected instance (i.e., $d_{i}$ ) and its corresponding counterfactual instance (i.e., $d_{i}^{\prime}$ ) to calculate CFTest value for each instance.

The CFTest is strictly defined from class label’s perspective and is loosely constrained in terms of predicted probability. Now, let’s consider the two following cases from predicted probability’s perspective:

•

Case 1: $F(d_{i})$ = 0.9, and $F(d_{i}^{\prime})$ = 0.6.
•

Case 2: $F(d_{i})$ = 0.55, and $F(d_{i}^{\prime})$ = 0.45.

As we can see, although Case 1 does not show bias according to CFTest, the predicted probabilities differ more significantly in comparison to Case 2, which leads to label flipping but smaller predicted probabilities difference. Therefore, relying on label flipping alone is insensitive to prediction probability based bias. To account for this, we further define Counterfactual Deviation Test (CDTest) to measure the prediction probability deviation between the instance and its counterfactual instance as below:

(2)

CDTest=|F(d_{i})-F(d_{i}^{\prime})|

We now integrate CFTest and CDTest into a joint objective called Counterfactual Bias Test (CBTest):

(3)

CBTest(X_{i})=\begin{cases}CFTest+CDTest&\text{if CFTest $\neq$ 0}\\ \\ CDTest&\text{Otherwise}\end{cases}

Clearly, CBTest jointly considers the bias rooted in the form of class label flipping as well as prediction distribution. For individuals showing no class label bias towards their counterfactual representation (i.e., FCTest = 0), their CBTest is reduced to CDTest. Based on CBTest values, a ranking list can be created showing the level of bias of each instance. We name such a list as Counterfactual Bias List (CBList), which forms the basis for our following debiasing techniques.

3.2.2. Balancing Biased Representation

We will now proceed to address imbalance based data bias by utilizing the CBList, which serves as the basis to identify individuals that prone to bias for rebalancing data distribution. Before going further into this, we first discuss our guiding principle for addressing bias caused by unbalanced data distributions the “We are all equal” (WAE) worldview (Friedler et al., 2021). Specifically, WAE calls for equal probability of being granted for both deprived and favored communities, mathematically represented as:

(4)

\frac{FG}{FG+FR}=\frac{DG}{DG+DR}

In real-world data, however, the distribution is often highly imbalanced thus WAE does not hold. As discussed in section 2.2.1, the probability of being granted for favored community will be significantly greater than the probability of deprived community’s, i.e., the value of the left side of Equation 4 is larger than the right side’s, as a result of the relative significant overrepresentation in $FG$ and $DR$ . To this end, undersampling is performed on $FG$ and $DR$ to align with the view of WAE:

(5)

\frac{FG-FG_{Remove}}{FG-FG_{Remove}+FR}=\frac{DG}{DG+DR-DR_{Remove}}

where $FG_{Remove}$ and $DR_{Remove}$ are the number of samples to be removed from $FG$ and $DR$ , respectively.

There are various combinations of $FG_{Remove}$ and $DR_{Remove}$ to achieve this goal. To mitigate bias while preserving as much original data information as possible, the combination that maintains the original relative representation between favored and deprived group is desired:

(6)

\frac{FG+FR}{DG+DR}=\frac{FG_{Remove}}{DR_{Remove}}

With the desired $FG_{Remove}$ and $DR_{Remove}$ determined based on Equation 5 and Equation 6, we utilize CBList to determine exactly which individuals to be removed. Specifically, the top $FG_{Remove}$ and $DR_{Remove}$ ranked individuals in CBList from $FG$ and $DR$ are undersampled respectively to meet the criteria of WAE (further bias correction operation on undersampled $DR$ is discussed in the following Section 3.2.3). Using this debiasing guided sampling, as opposed to the random sampling methods used in previous studies (Chen et al., 2022b; Chakraborty et al., 2021a), can effectively lead to a procedure for balancing that mitigates bias given the bias identification power of CBList.

3.2.3. Correcting Labeling Bias

With a balanced representation, we further address labeling bias also on the basis of the proposed CBList. Since labeling bias in $FG$ and $DR$ has been addressed in the previous debiasing balancing procedure by selecting individuals with labeling bias for undersampling, we focus on $FR$ and $DG$ at this stage. In specific, all individuals from $FR$ and $DG$ that do not meet the criteria of counterfactual fairness are removed; removing individuals from $FR$ and $DG$ with CBTest values greater than 1 which indicates that the classifier produces different predictions in the real and counterfactual worlds.

In addition, to ensure WAE still holds, corresponding number of removed instances will be added back through either synthesization or class label flipping. First, regarding $DG$ , relating to the aforementioned bias correction for undersampled individuals in $DR$ in Section 3.2.2. Specifically, their discriminatory treatment due to the presence of their sensitive attribute is a major manifestation of bias in real-world. Corresponding to CBList, such a manifestation is equivalent to the flipping class label solely due to one individual’s sensitive attribute, i.e., CFTest. To correct these biased labels, we flip such $DR$ individuals’ class label and re-include them as $DG$ in the dataset to mitigate bias while preserving as much original data as possible. In the event that the number of removed individuals is greater than the number of flipped individuals, corresponding number of instances will be synthesized (c.f., Section 3.2.4 for synthesizing details) and included in $DG$ . Second, in terms of $FR$ , synthesization is applied directly without class label flipping as $FR$ is typically not the major manifestation of labeling bias. The data is now prepared for building fairness-oriented models, having undergone balancing of representation and biased label correction.

3.2.4. Fair Synthesis

Our synthetic algorithm addresses the issue of increased intra-class imbalance present in traditional oversampling techniques such as ROS-random oversampling, SMOTE and KMeans-SMOTE (Last et al., [n. d.]). It is designed to maintain the class balance while generating new samples, thus avoiding exacerbation of the in-class imbalance. This is achieved through a three-step process that includes clustering, filtering, and synthesizing: i) the algorithm first groups the data based on sensitive attribute and class label, then each subgroup is clustered using k-nearest neighbors (Peterson, 2009), thus individuals with similar characteristics will be grouped together within subgroups. ii) To avoid blurring of the clustering boundaries, the algorithm filters each class by removing the farthest 20 percent of points from the center of the clusters from the sample points. iii) Finally, the algorithm generates simulated data proportionally in the different classes to ensure that the distribution of the synthesized dataset is consistent with that of the original dataset. This approach avoids exacerbating the intra-class imbalance while generating new samples, making the synthesized dataset more representative of the original data.

3.3. Accuracy-Driven Training

In CFSA, the objective of the performance model is to maximize its performance. To achieve this, various machine learning (ML) algorithms are used to train the models on the original training data $D$ . The model with the highest accuracy is chosen as the performance model. In the experimental analysis, the effect of using different performance models on the final results will be explored. This approach allows us to determine the optimal performance model for the given dataset and task.

3.4. Ensemble Training

With the fairness and performance model trained on debiased and original dataset respectively, CFSA now combines the outputs of them to make the final prediction. This involves ensembling the prediction probability vectors generated by each model using the formula outlined in Equation 7:

(7)

\hat{Y}=\sum_{i=1}^{n}W_{i}\times P_{i}

where $W_{i}$ is the weight of different models and $P_{i}$ is the prediction probability vector. Consider the binary classification task with average ensemble as an example, the combination module first takes the prediction probability vector, from these two models as inputs, e.g., $P_{f}$ and $P_{a}$ for fairness-oriented training model and accuracy-driven training model, respectively. Next, the combination module uses an averaged weighting strategy, i.e., the weighting vector $W$ =[0.5, 0.5], and the final combined prediction probability vector is thus computed as: [ $0.5\times(P_{f}(\hat{Y}=0)+P_{a}(\hat{Y}=0)),0.5\times(P_{f}(\hat{Y}=1)+P_{a}\times(\hat{Y}=1)$ ]. When $0.5\times(P_{f}(\hat{Y}=0)+P_{a}(\hat{Y}=0))$ ¿ $0.5\times(P_{f}(\hat{Y}=1)+P_{a}(\hat{Y}=1))$ , the label is predicted as 0 and 1 otherwise. In addition to the commonly used averaging strategy in ensemble learning, we will also explore other combination strategies in Section 5 to investigate how different strategies affect the validity of our results which will shed light on finding the best combination strategy for the given dataset and task.

4. Experiment Settings

In this section, we describe our experimental setting and experimental datasets to answer the research questions in Section 5. We first describe the datasets used in our experiments and then present the baselines and metrics selected in our experiments.

4.1. Datasets

In contrast to typical fair machine learning studies that only evaluate up to three datasets (Hort et al., 2021; Zhang and Ntoutsi, 2019; Zhang et al., 2021a; Zhang and Weiss, 2021), our method is comprehensively evaluated on eight real-world datasets with varying feature spaces and sensitive attributes (the details of the datasets used can be found in in Table 1), covering various domains as follows:

Financial domain: i) The Adult dataset (Fox and Carvalho, 2012) is used for a prediction task aimed at determining whether a person earns an annual income of over $50K based on their demographic and financial information. ii) The Bank dataset (Moro et al., 2014) is employed for predicting whether a client of a bank will opt for a term deposit, based on their demographic, financial, and social information. iii) The German dataset (Dheeru and Taniskidou, 2017) is a financial dataset of bank account holders, commonly used for predicting creditworthiness to assess credit risk. iv) The Default dataset (Yeh and Lien, 2009) studies the default payments of customers in Taiwan, with the aim of predicting the probability of default in the next month.

Criminological domain: v) The COMPAS dataset (Angwin et al., 2016b) is a well-known dataset in the field of algorithmic bias, used for predicting the likelihood of criminal recidivism in defendants.

Social domain: vi) The Dutch dataset (Van der Laan, 2000) compiles information on individuals in the Netherlands for the year 2001 and is utilized for predicting a person’s occupation. vii) The Heart dataset (Tarawneh and Embarak, 2019), dating back to 1988, gathers medical information on patients and is employed for predicting the presence or absence of heart disease.

Educational domain: viii) The Student dataset (Wightman, 1998) contains law school admission records and is used to predict if a candidate will pass the bar exam and their first-year average grade.

Table 1. Summary of datasets used in experiments.

	Sample#	Features#	Sensitive Attribute
Adult	45,222	12	Race,Gender
Bank	45,211	17	Age
COMPAS	6,172	12	Race,Gender
German	1,000	21	Gender
Default	30,000	11	Gender
Dutch	60,420	12	Gender
Law	20,798	12	Gender
Heart	297	14	Age

4.2. Baselines

To evaluate the performance of our method, we compare five existing bias mitigation methods: Reweighing (REW) (Kamiran and Calders, 2012), Adversarial Debiasing (ADV) (Zhang et al., 2018a), Reject Option Classification (ROC) (Kamiran et al., 2012), Fair-SMOTE (Chakraborty et al., 2021a), and MAAT (Chen et al., 2022b). The first three baselines are recent advanced methods proposed by the ML community, which are integrated into the IBM AIF360 toolkit (Bellamy et al., 2019). The remaining two baselines are state-of-the-art approaches recently proposed in the Software Engineering venues. These baselines cover a wide range of debiasing methods including pre-processing, in-processing, post-processing, and ensemble. Next, we briefly describe each approach.

•

The REW (Kamiran and Calders, 2012), Fair-SMOTE (Chakraborty et al., 2021a), and MAAT (Chen et al., 2022b) are pre-processing methods used to address bias in machine learning algorithms: i) REW calculates the weight of each group based on the label and protection attributes. ii) Fair-SMOTE uses a combination of data clustering and oversampling to equalize the number of training data in different subgroups. iii) MAAT divides the dataset into four groups and adjusts the sample size of each group to ensure equal favorable rates for favored and deprived groups, then fair and accurate models are trained and combined through an ensemble method.
•

ADV (Zhang et al., 2018a) is an in-processing method for addressing bias in machine learning algorithms. It uses adversarial techniques to reduce the impact of sensitive attributes on the model’s predictions while maximizing overall performance.
•

ROC (Kamiran et al., 2012) is a post-processing method that reduces bias in machine learning algorithms. It focuses on predictions with high uncertainty and reassigns them to reduce bias. Specifically, it aims to allocate favorable outcomes to deprived groups and unfavorable outcomes to favored groups.

4.3. Evaluation Metrics

The evaluation involves three fairness metrics and five ML performance metrics. We will first present the fairness measures, then the ML performance metrics, and finally describe how to quantify the trade-off between fairness and performance.

4.3.1. Fairness Metrics

To measure ML software fairness, we employed three commonly used fairness metrics: Statistical Parity Difference (SPD), Average Odds Difference (AOD), and Equal Opportunity Difference (EOD). Our choice of these metrics is based on their widespread adoption in the field, as demonstrated in the literature (Spinellis, 2021).

•

SPD: quantifies the disparity between the probability of the favored and deprived group receiving a benefit:

(8) $SPD=P\left[\hat{Y}=1|S=0\right]-P\left[\hat{Y}=1|S=1\right]$

•

AOD: measures the average of the False Positive Rates (FPR) and the True Positive Rates (TPR) between favored and deprived group:

(9)		$\displaystyle AOD=\frac{1}{2}(\left\|P\left[\hat{Y}=1\|S=0,Y=0\right]-P\left[\hat{Y}=1\|S=1,Y=0\right]\right\|$
(9)		$\displaystyle+\left\|P\left[\hat{Y}=1\|S=0,Y=1\right]-P\left[\hat{Y}=1\|S=1,Y=1\right]\right\|)$

•

EOD: measures the True Positive Rates (TPR) difference between favored and deprived group:

(10) $EOD=P\left[\hat{Y}=1|S=0,Y=1\right]-P\left[\hat{Y}=1|S=1,Y=1\right].$

We use the absolute value of all fairness metrics, with zero representing maximum fairness and higher values indicating greater bias.

4.3.2. Performance Metrics

To measure ML software performance, we employed five performance metrics: accuracy, recall, Matthews correlation coefficient, precision, and F1-Score. These metrics can be calculated using the confusion matrix of a binary classification, which comprises of four elements: true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN).

•

Accuracy: measures the fraction of correct predictions made by the model out of all the predictions.

(11) $Accuracy=\frac{TP+TN}{TP+FP+TN+FN}$
•

Recall: measures the proportion of actual positive cases that the model correctly identified.

(12) $Recall=\frac{TP}{TP+FN}$
•

Matthews Correlation Coefficient(MCC): measures the different between true positive and true negative rate.

(13) $MCC=\frac{TP\times TN-FP\times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}$
•

Precision: refers to the fraction of positive predictions that are actually correct.

(14) $Precision=\frac{TP}{TP+FP}$

•

F1-Score: represents the harmonic mean of precision and recall.

(15)

F1-Score=\frac{2\times(Precision\times Recall)}{(Precision+Recall)}

For all of the performance metrics, the higher the value the better performance. In addition, the value of all metrics other than MCC, i.e., accuracy, recall, precision, and F1-Score, range from 0 to 1. In terms of MCC, it has a range of -1 to 1, where 1 represents a perfect forecast, 0 represents a prediction that is no better than random, and -1 indicates a prediction that completely contradicts the observation. Among all the performance metrics, MCC also is more sensitive to the overall quality of the model’s predictions as it takes both the true and false positive and negative rates into account, and has been demonstrated to be suitable for dealing with imbalance in various software engineering research (Chicco and Jurman, 2020).

4.3.3. Jointly evaluating Fairness and Performance of the Model

Based on the aforementioned fairness and performance metrics, the benchmarking tool Fairea (Hort et al., 2021) is employed to jointly evaluate the effectiveness of fairness and performance of the ML software models. The fundamental idea of Fairea is to convert the original model into a random guessing model to enhance its fairness. This is achieved by mutating predictive class label such that predictive performance are equally worse in both favored and deprived group. Therefore, it’s anticipated that effective bias mitigation techniques will surpass the fairness-performance trade-offs of mutated models. Specifically, it operates in the following manner:

Trade-off baseline: Fairea begins by utilizing the initial model to make predictions and then duplicates these predictions. Next, Fairea randomly selects predictions and mutates these predicted class labels, i.e., changing all of them with the majority class of data, based on different mutation degrees, e.g., 10%, 20%, …, 100%. This mutated model is called the pseudo model. In addition, as the mutation degree increases, the accuracy of the model’s predictions decreases, but the fairness of the predictions improves as the prediction becomes more random and similar across subgroups. Particularly, when the mutation degree reaches 100%, all predictions receive the same prediction, resulting in the lowest accuracy but the highest fairness.

Lastly, Fairea constructs a trade-off baseline for one specific model, e.g., CFSA or other competition baselines, by connecting the fairness-performance (measured by the aforementioned metrics) points of the original model and a series of pseudo models (as show in Figure 3).

Five effectiveness levels: The trade-off baseline categorizes bias mitigation methods into five trade-off effectiveness levels: i) “win-win”: A method falls into this trade-off if it enhances both ML performance and fairness compared to the trade-off baseline. Such a fairness-performance point will be located in region 1. ii) A method is considered a ”good” trade-off if it improves either machine learning performance or fairness compared to the trade-off baseline, and is overall better than the trade-off baseline, thus situating in region 2. iii) If a method improves ML performance but leads to fairness drop, it falls into a “inverted” trade-off, being in the region 3. vi) If a method falls into a “pool” trade-off its either ML performance or fairness declines compared to the trade-off baseline, and overall worse than the trade-off baseline. The fairness-performance point thus will be located in region 4. v) A method is considered a “lose-lose” trade-off if it results in a decline in both machine learning performance and fairness compared to the original model. This results in the fairness-performance point being positioned in region 5.

5. Experimental Results

5.1. Experimental Setup

The experiment involves the use of 8 datasets, which are shuffled and divided into 70% training and 30% test data. Samples with missing values are removed, continuous features transformed into categorical categories, non-numerical features converted to numerical values, and all feature values normalized to [0,1]. Three standard machine learning models, e.g., Logistic Regression (LR), Support Vector Machine (SVM), and Random Forest (RF), are employed as the base model to build our proposed CFSA as well as baselines, e.g., REW, ADV, ROC, and Fair-SMOTE and MAAT. Among them, the first three implementations are based on IBM AIF360 (Bellamy et al., 2019) while the remaining are released by their authors (Chakraborty et al., 2021a; Chen et al., 2022b).

To implement benchmarking tool Fairea, we establish a trade-off baseline based on each benchmark task, ML algorithm, and fairness-performance measurement. Specifically, we train the original model 50 times. For each trained original model, the mutation is repeated 50 times and each time with a different mutation degree. The trade-off baseline is then constructed using the averaged result from the multiple runs.

The experiments are implemented with Python 3.7 and executed on a 64-bit machine with a 10-core processor (i9, 3.3GHz), 64GB memory with GTX 1080Ti GPU. The experimental results are organized to answer the following five research questions (RQs).

5.2. Research Questions

RQ1: Can CBList effectively identifies Biased data samples?

This research question focus on understanding the effectiveness of the method in identifying biased samples, which is crucial as addressing bias in the dataset prior to training the model can significantly reduce the likelihood of biased decisions. To answer this question, we continue to use the three aforementioned machine learning models as the base model to construct the CBList. Based on this, instances prone to bias are identified then removed for fairness-oriented training. The results, as shown in Figure 4, indicate that all models trained on the original biased datasets (represented by the yellow line) have higher bias scores (only statistical parity difference is shown for the ease of distinction) than the models trained on the CBList debiased datasets (represented by the blue line), with the largest fairness improvement by 98.05% in the Law dataset. This suggests that CBList can effectively identify biased data instances for fairness-oriented model training. Additionally, there is no notable result difference across three base models built in conjunction with CBList, suggesting that CBList is model agnostic for fair ML software tasks.

Ans. to RQ1: Yes, the CFSA improves fairness by as much as 98.5%. To conclude, CBList is effective in identifying biased instances, and having these instances removed exactly mitigated the biased data representation.

RQ2: Can CFSA reduce bias?

To answer this question, we first evaluate the effectiveness of CFSA in 10 uni-attribute benchmark tasks. For each task, CFSA is applied with the same three ML models, e.g., LR, SVM, RF, for 50 times in Fairea. Hence, we have $10\times 3\times 50=1,500$ cases in total. As can be seen in Figure 5, CFSA (Green bar) beats the trade-off baseline constructed by mutated CFSA in 81% of the cases.

In addition, the reduction in model bias is only considered meaningful if it does not result in a significant decrease in model performance, which can be reflected by the cases where CFSA outperforms its trade-off baseline as shown in rows 6, 12, and 18 of Table 3. As can be seen, CFSA wins at least 78% of the cases showing significant bias reduction while maintaining competitive performance.

Table 2. Proportions of scenarios where each method significantly improves fairness and decreases performance. CFSA significantly improves fairness in 98.4% of the scenarios, without decreasing ML performance too much.

	REW	ADV	ROC	Fair- SMOTE	MAAT	CFSA
Fairness (+)	57.1%	63.5%	90.5%	66.7%	77.8%	98.4%
Accuracy (-)	10.5%	13.4%	25.7%	17.1%	17.2%	22.3%

Table 3. The proportion of mitigation cases that surpass the trade-off baseline in 15 fairness-performance evaluations from CFSA and existing methods (the darker cells show top rank and the lighter cells show the second rank).

Methods	SPD Accuracy	SPD Precision	SPD Recall	SPD F1Score	SPD MCC
rew	0.71	0.74	0.79	0.80	0.78
adv	0.69	0.66	0.70	0.70	0.70
roc	0.58	0.51	0.72	0.71	0.72
Fair-SMOTE	0.32	0.29	0.34	0.34	0.34
MAAT	0.71	0.70	0.71	0.73	0.72
CFSA	0.81	0.85	0.85	0.84	0.85
Methods	AOD Accuracy	AOD Precision	AOD Recall	AOD F1Score	AOD MCC
rew	0.70	0.75	0.79	0.80	0.79
adv	0.40	0.38	0.40	0.40	0.40
roc	0.38	0.38	0.50	0.49	0.50
Fair-SMOTE	0.46	0.39	0.47	0.47	0.47
MAAT	0.72	0.76	0.77	0.78	0.77
CFSA	0.80	0.87	0.84	0.85	0.87
Methods	EOD Accuracy	EOD Precision	EOD Recall	EOD F1Score	EOD MCC
rew	0.77	0.76	0.79	0.79	0.79
adv	0.39	0.37	0.40	0.40	0.40
roc	0.32	0.31	0.42	0.41	0.42
Fair-SMOTE	0.56	0.49	0.57	0.57	0.57
MAAT	0.75	0.84	0.85	0.87	0.86
CFSA	0.78	0.86	0.86	0.88	0.85

Ans. to RQ2: CFSA achieves a good or win-win trade-off in 85% of the cases while poor or lose-lose trade-off is only 2%. In sum, CFSA can reduce bias while not resulting in a significant decrease in model performance.

RQ3: How well does CFSA perform compared to the state-of-the-art bias mitigation algorithms?

To answer this question, CFSA is evaluated against 5 state-of-the-art baselines, e.g., REW, ADV, ROC, Fair-SMOTE, MAAT, in 10 benchmark tasks. Same as previous RQs, for each task, CFSA and other baselines are constructed using the same 3 base models for 50 times, and each individual run is treated as a distinct mitigation case. As a result, we have total $10\times(1+5)\times 3\times 50=9,000$ cases. To simplify the demonstration, we use the percentage of mitigation scenarios that exceed the trade-off baseline established by Fairea as a measure of effectiveness (i.e., scenarios that result in either a good or win-win trade-off).

Figure 6 shows the overall results. As we can see, CFSA achieves a good or win-win trade-off (i.e., beat the trade-off baseline constructed by Fairea) in most cases, i.e., 85% of the time. In comparison, the corresponding percentages for REW, ADV, ROC, Fair-SMOTE, and MAAT were 76%, 50%, 49%, 44%, and 77%, respectively. In addition, CFSA has significantly fewer lose-lose trade-off cases (2%) than other existing methods, such as Fair-SMOTE which has a lose-lose trade-off rate of 28% that is 14 times higher than CFSA. The Figure 7 displays the results more clearly by showing the percentage of times that CFSA and other baselines beat their correpsonding trade-off baseline in 10 benchmark tasks. The results show that the performance of existing methods, such as REW and Fair-SMOTE, is inconsistent across different decision-making tasks. For example, in the Compas-Sex task, REW and Fair-SMOTE outperform the baseline by 97.4% and 91.7%, respectively, but in the Adult-sex task, their success rates dropped to 67.5% and 52.8%. On the other hand, CFSA show consistent performance, with a success rate of 91.1% in the Compas-sex task and 99.9% in the Adult-sex task, with only a small difference of 8.8%. This highlights the improved performance of CFSA compared to other existing methods in achieving a trade-off between fairness and performance.

Additionally, for each combination of task, base models, fairness-performance metric, we compare the percentage of surpassing trade-off baseline of CFSA and five other baselines. The results are displayed in Table 3, where CFSA, in 15 fairness-performance measurements, secures 14 first place finishes and only 1 second place finish (with a margin of only 1% to the 1st place).

Ans. to Q3: CFSA achieves the best trade-off, CFSA outperforming other methods at least 8% more in good or win-win trade-off. Also, CFSA achieves less poor or lose-lose trade-off than other methods. In summary, the superiority of CFSA over state-of-the-art is maintained across all studied ML algorithms, uni-attribute benchmark tasks, and fairness-performance evaluations.

RQ4: How do various combination strategies impact the performance of our method?

To answer this question, we set the 11 different weighting strategies ranging from 0 to 1 with step size 0.1 (i.e., $W$ = [0, 1], [0.1, 0.9], $\cdots$ , [0.9, 0.1], [1, 0]). When fairness is the sole focus, the weighting strategy $W$ is set to [1, 0] while [0, 1] purely focus on performance. The results for the variations between each weight strategy in the experiment and all CFSA tasks are presented in Figure 8. The effectiveness indicator is determined by the percentage of scenarios that exceed the trade-off baseline constructed by Fairea.

Overall, the [0.6-0.4] strategy shows the best results, with CFSA beats 84.25% of the trade-off baseline cases. In real-world deployment, the requirements for fairness and performance may vary depending on task-specific goals. Software engineers can explore different strategies and evaluate their effectiveness to determine the most appropriate strategy for a given task.

Ans. to RQ4: The balance between fairness and performance can be adjusted by the combination strategy that alters the balance. In general, the weighting strategy of [0.6-0.4] can be a starting weight to explore the best weighting strategy based on factors like the number of features and ML algorithm.

RQ5: Is CFSA method efficient in handling multiple sensitive attributes?

The first four RQs examine the bias reduction of CFSA based on a single sensitive attribute, which is the current focus of existing fairness literature (Zhang and Weiss, 2022; Biswas and Rajan, 2020, 2021; Chakraborty et al., 2020; Hort et al., 2021; Chakraborty et al., 2021b; Zhang and Weiss, 2021). This research question assesses CFSA’s effectiveness in dealing with multiple sensitive attributes, which is a common fairness question in real-world (Mehrabi et al., 2021).

We compare CFSA with the MAAT and Fair-SMOTE, the only two approaces that are capable of handling multiple sensitive attributes to the best of our knowledge, in two multi-attribute tasks (i.e., Adult and Compas), still using LR, SVM, and RF as the base model. When training our CFSA, one fair model is built focusing on one sensitive attribute as well as one performance model, which are then averaging strage is used for ensemble learning. With the use of 3 base models, 2 datasets, and 3 methods, and 50 repeated runs, we have a total of 900 mitigation cases ( $3\times 2\times 3\times 50=900$ ). The results, shown in Figure 9, indicate that CFSA had a higher proportion of good or win-win trade-offs compared to other methods and fewer poor or lose-lose trade-offs. For example, CFSA, in the Adult task, outperforms the trade-off baseline for race in 80% of cases, compared to 33.3% for Fair-SMOTE and 66.3% for MAAT.

Ans. to RQ5: CFSA can decrease bias for multiple sensitive attributes. It outperforms state-of-the-art methods by beating the trade-off baseline in 70.5% of cases, outperforming MAAT and Fair-SMOTE by 14.75% and 21.75%, respectively.

6. Related Work

This section presents the two lines of works that are related to this study. ML systems are a subset of AI systems, so we first discuss the works related to AI testing. Then, we discuss the works that are specific to testing and improving fairness in AI systems.

6.1. AI Testing

Although demonstrating the potential of tackling important tasks, AI systems still require sufficient amount of effort to ensure their quality from various aspects, including correctness, robustness, security, privacy, etc. Asyrofi et al. (Asyrofi et al., 2021a) leverage differential testing to synthesize speech inputs using speech-to-text systems to test the correctness of speech recognition systems and show that these inputs can be used to improve the performance of the systems under test (Asyrofi et al., 2021b). The robustness testing tries to evaluate how AI systems behave when small perturbations are introduced to the inputs. Researchers have conducted robustness testing on various AI systems, e.g., computer vision (Gao et al., 2020; Pei et al., 2019), code models (Yang et al., 2022b; Zhang et al., 2020a), reinforcement learning (Gong et al., 2022; Wu et al., 2021), etc.

Researchers have built a wide range of tools to test various AI systems. Motivated by the usage of code coverage metrics (e.g, line coverage, branch coverage, etc) in testing conventional software systems, Pei et al. (Pei et al., 2019) propose DeepXplore, a tool that uses neuron coverage as a guidance to generate test cases for deep neural networks. The following researchers extend this work by proposing structural neuron coverage metrics, e.g., neuron boundary coverage, etc. These metrics are the foundation for a series of AI testing tools, including DeepHunter (Xie et al., 2019), DeepGauge (Ma et al., 2018), DeepCT (Ma et al., 2019), DeepTest (Tian et al., 2018), etc. However, recent studies (Li et al., 2019; Harel-Canada et al., 2020; Dong et al., 2020; Yan et al., 2020; Trujillo et al., 2020; Yang et al., 2022a) also reveal that neuron coverage may not be effective enough to expose the vulnerabilities of AI systems. Researchers also explore other metrics to test AI systems. Gao et al. (Gao et al., 2020) propose Sensei, a fuzz testing tool that uses genetic algorithms to synthesize inputs to test and improve the robustness of computer vision systems. Zhang et al. (Zhang et al., 2018b) utilize generative adversarial networks (GANs) to generate driving scenes with various weather conditions to test autonomous driving systems. We refer readers to (Zhang et al., 2022) for a comprehensive survey of works on AI testing.

6.2. AI Fairness Testing and Improvement

A recent survey by Chen et al. (Chen et al., 2022a) provides a comprehensive overview of the works on fairness in AI software. Beyond the five baselines evaluated in our study, there has been a growing number of works that aim to improve the fairness of AI systems. Zhang et al. (Zhang et al., 2020b) propose a white-box testing technique that leverages adversarial sampling to generate test cases to uncover and repair the fairness violations in DNN-based classifiers. Zheng et al. (Zheng et al., 2022) design NeuronFair to identify biased neurons and conduct interpretable white-box fairness testing. Zhang et al. (Zhang et al., 2021b) conduct gradient search to improve the efficiency of generating fairness test cases. Fan et al. (Fan et al., 2022) use genetic algorithm to conduct explanation-guided fairness testing. Some works focus on improving the natural language processing (NLP) systems. Asyrofi et al. (Asyrofi et al., 2022) propose BiasFinder, a tool that use metamorphic testing to generate test cases to uncover fairness violations in sentiment analysis. BiasRV (Yang et al., 2021) is based on BiasFinder and can verify fairness violations at runtime. Sun et al. (Sun et al., 2020) uncover bias in machine translation systems. Ezekiel et al. (Soremekun et al., 2022) use context-free grammar to synthesize inputs to test the fairness of NLP systems. Researchers also put effort into improving the fairness of AI systems. Max et al. (Hort et al., 2021) design Fairea, a model behaviour mutation approach to benchmark the bias mitigation methods. Gao et al. (Gao et al., 2022) model the problem of balancing fairness and accuracy as an adversarial game and propose FairNeuron that can strategically select neurons to improve AI fairness. Zhang and Sun (Zhang and Sun, 2022) adaptively improve model fairness based on causality anaysis.

7. Conclusion and Future Work

In this paper, we present CFSA, a method to tackle the root causes of bias in ML software through counterfactual thinking. A thorough evaluation shows that CFSA surpasses existing bias reduction techniques from both the fields of ML and SE significantly by improving fairness while maintaining competitive performance. The successful implementation of CFSA provides possibilities for further exploration into software fairness for fair software engineering development. In the future, we aim to broaden the scope of this work to include text mining and image processing. Additionally, we intend to improve our approach by incorporating new evaluation systems and utilizing industry datasets.

References

(1)
Angwin et al. (2016a) Julia Angwin, Jeff Larson, Lauren Kirchner, and Surya Mattu. 2016a. Machine bias. https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing
Angwin et al. (2016b) Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. 2016b. Machine bias. In Ethics of Data and Analytics. Auerbach Publications, 254–264.
Asyrofi et al. (2021a) Muhammad Hilmi Asyrofi, Zhou Yang, and David Lo. 2021a. CrossASR++: A Modular Differential Testing Framework for Automatic Speech Recognition. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Athens, Greece) (ESEC/FSE 2021). Association for Computing Machinery, New York, NY, USA, 1575–1579. https://doi.org/10.1145/3468264.3473124
Asyrofi et al. (2021b) Muhammad Hilmi Asyrofi, Zhou Yang, Jieke Shi, Chu Wei Quan, and David Lo. 2021b. Can Differential Testing Improve Automatic Speech Recognition Systems?. In 2021 IEEE International Conference on Software Maintenance and Evolution (ICSME). 674–678. https://doi.org/10.1109/ICSME52107.2021.00079
Asyrofi et al. (2022) Muhammad Hilmi Asyrofi, Zhou Yang, Imam Nur Bani Yusuf, Hong Jin Kang, Ferdian Thung, and David Lo. 2022. BiasFinder: Metamorphic Test Generation to Uncover Bias for Sentiment Analysis Systems. IEEE Transactions on Software Engineering 48, 12 (2022), 5087–5101. https://doi.org/10.1109/TSE.2021.3136169
Barocas et al. (2017) Solon Barocas, Moritz Hardt, and Arvind Narayanan. 2017. Fairness in machine learning. Nips tutorial 1 (2017), 2.
Bellamy et al. (2019) Rachel KE Bellamy, Kuntal Dey, Michael Hind, Samuel C Hoffman, Stephanie Houde, Kalapriya Kannan, Pranay Lohia, Jacquelyn Martino, Sameep Mehta, Aleksandra Mojsilović, et al. 2019. AI Fairness 360: An extensible toolkit for detecting and mitigating algorithmic bias. IBM Journal of Research and Development 63, 4/5 (2019), 4–1.
Biswas and Rajan (2020) Sumon Biswas and Hridesh Rajan. 2020. Do the machine learning models on a crowd sourced platform exhibit bias? an empirical study on model fairness. In Proceedings of the 28th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering. 642–653.
Biswas and Rajan (2021) Sumon Biswas and Hridesh Rajan. 2021. Fair preprocessing: towards understanding compositional fairness of data transformers in machine learning pipeline. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 981–993.
Chakraborty et al. (2021a) Joymallya Chakraborty, Suvodeep Majumder, and Tim Menzies. 2021a. Bias in Machine Learning Software: Why? How? What to Do?. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Athens, Greece) (ESEC/FSE 2021). Association for Computing Machinery, New York, NY, USA, 429–440. https://doi.org/10.1145/3468264.3468537
Chakraborty et al. (2021b) Joymallya Chakraborty, Suvodeep Majumder, and Tim Menzies. 2021b. Bias in machine learning software: why? how? what to do?. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 429–440.
Chakraborty et al. (2020) Joymallya Chakraborty, Suvodeep Majumder, Zhe Yu, and Tim Menzies. 2020. Fairway: A way to build fair ml software. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 654–665.
Chan and Wang (2018) Jason Chan and Jing Wang. 2018. Hiring preferences in online labor markets: Evidence of a female hiring bias. Management Science 64, 7 (2018), 2973–2994.
Chen et al. (2022a) Zhenpeng Chen, Jie M. Zhang, Max Hort, Federica Sarro, and Mark Harman. 2022a. Fairness Testing: A Comprehensive Survey and Analysis of Trends. https://doi.org/10.48550/ARXIV.2207.10223
Chen et al. (2022b) Zhenpeng Chen, Jie M Zhang, Federica Sarro, and Mark Harman. 2022b. MAAT: a novel ensemble approach to addressing fairness and performance bugs for machine learning software. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1122–1134.
Chicco and Jurman (2020) Davide Chicco and Giuseppe Jurman. 2020. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC genomics 21, 1 (2020), 1–13.
Das et al. (2021) Sanjiv Das, Michele Donini, Jason Gelman, Kevin Haas, Mila Hardt, Jared Katzman, Krishnaram Kenthapadi, Pedro Larroy, Pinar Yilmaz, and Muhammad Bilal Zafar. 2021. Fairness measures for machine learning in finance. The Journal of Financial Data Science 3, 4 (2021), 33–64.
Dheeru and Taniskidou (2017) Dua Dheeru and Efi Karra Taniskidou. 2017. UCI machine learning repository. http:¡¡ archive. ics. uci. edu¡ ml. (2017).
Dong et al. (2020) Yizhen Dong, Peixin Zhang, Jingyi Wang, Shuang Liu, Jun Sun, Jianye Hao, Xinyu Wang, Li Wang, Jinsong Dong, and Ting Dai. 2020. An Empirical Study on Correlation between Coverage and Robustness for Deep Neural Networks. In 2020 25th International Conference on Engineering of Complex Computer Systems (ICECCS). 73–82. https://doi.org/10.1109/ICECCS51672.2020.00016
Fan et al. (2022) Ming Fan, Wenying Wei, Wuxia Jin, Zijiang Yang, and Ting Liu. 2022. Explanation-Guided Fairness Testing through Genetic Algorithm. In Proceedings of the 44th International Conference on Software Engineering (Pittsburgh, Pennsylvania) (ICSE ’22). Association for Computing Machinery, New York, NY, USA, 871–882. https://doi.org/10.1145/3510003.3510137
Fox and Carvalho (2012) John Fox and Marilia S Carvalho. 2012. The RcmdrPlugin. survival package: Extending the R Commander interface to survival analysis. Journal of Statistical Software 49 (2012), 1–32.
Friedler et al. (2021) Sorelle A Friedler, Carlos Scheidegger, and Suresh Venkatasubramanian. 2021. The (im) possibility of fairness: Different value systems require different mechanisms for fair decision making. Commun. ACM 64, 4 (2021), 136–143.
Gao et al. (2020) Xiang Gao, Ripon K. Saha, Mukul R. Prasad, and Abhik Roychoudhury. 2020. Fuzz Testing Based Data Augmentation to Improve Robustness of Deep Neural Networks. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering (Seoul, South Korea) (ICSE ’20). Association for Computing Machinery, New York, NY, USA, 1147–1158. https://doi.org/10.1145/3377811.3380415
Gao et al. (2022) Xuanqi Gao, Juan Zhai, Shiqing Ma, Chao Shen, Yufei Chen, and Qian Wang. 2022. FairNeuron: Improving Deep Neural Network Fairness with Adversary Games on Selective Neurons. In Proceedings of the 44th International Conference on Software Engineering (Pittsburgh, Pennsylvania) (ICSE ’22). Association for Computing Machinery, New York, NY, USA, 921–933. https://doi.org/10.1145/3510003.3510087
Gong et al. (2022) Chen Gong, Zhou Yang, Yunpeng Bai, Jieke Shi, Arunesh Sinha, Bowen Xu, David Lo, Xinwen Hou, and Guoliang Fan. 2022. Curiosity-Driven and Victim-Aware Adversarial Policies. In Proceedings of the 38th Annual Computer Security Applications Conference (Austin, TX, USA) (ACSAC ’22). Association for Computing Machinery, New York, NY, USA, 186–200. https://doi.org/10.1145/3564625.3564636
Harel-Canada et al. (2020) Fabrice Harel-Canada, Lingxiao Wang, Muhammad Ali Gulzar, Quanquan Gu, and Miryung Kim. 2020. Is Neuron Coverage a Meaningful Measure for Testing Deep Neural Networks?. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Virtual Event, USA) (ESEC/FSE 2020). Association for Computing Machinery, New York, NY, USA, 851–862. https://doi.org/10.1145/3368089.3409754
Hoang et al. (2019) Mickel Hoang, Oskar Alija Bihorac, and Jacobo Rouces. 2019. Aspect-Based Sentiment Analysis using BERT. In Proceedings of the 22nd Nordic Conference on Computational Linguistics. Linköping University Electronic Press, Turku, Finland, 187–196. https://aclanthology.org/W19-6120
Hort et al. (2021) Max Hort, Jie M Zhang, Federica Sarro, and Mark Harman. 2021. Fairea: A model behaviour mutation approach to benchmarking bias mitigation methods. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 994–1006.
Jiang and Nachum (2019) Heinrich Jiang and Ofir Nachum. 2019. Identifying and Correcting Label Bias in Machine Learning. CoRR abs/1901.04966 (2019). arXiv:1901.04966 http://arxiv.org/abs/1901.04966
Kamiran and Calders (2012) Faisal Kamiran and Toon Calders. 2012. Data preprocessing techniques for classification without discrimination. Knowledge and information systems 33, 1 (2012), 1–33.
Kamiran et al. (2012) Faisal Kamiran, Asim Karim, and Xiangliang Zhang. 2012. Decision Theory for Discrimination-Aware Classification. In 2012 IEEE 12th International Conference on Data Mining. 924–929. https://doi.org/10.1109/ICDM.2012.45
Kamiran et al. (2018) Faisal Kamiran, Sameen Mansha, Asim Karim, and Xiangliang Zhang. 2018. Exploiting reject option in classification for social discrimination control. Information Sciences 425 (2018), 18–33.
Last et al. ([n. d.]) F Last, G Douzas, and F Bacao. [n. d.]. Oversampling for imbalanced learning based on k-means and smote. arXiv 2017. arXiv preprint arXiv:1711.00837 ([n. d.]).
Li et al. (2019) Zenan Li, Xiaoxing Ma, Chang Xu, and Chun Cao. 2019. Structural Coverage Criteria for Neural Networks Could Be Misleading. In 2019 IEEE/ACM 41st International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER). 89–92. https://doi.org/10.1109/ICSE-NIER.2019.00031
Ma et al. (2019) Lei Ma, Felix Juefei-Xu, Minhui Xue, Bo Li, Li Li, Yang Liu, and Jianjun Zhao. 2019. DeepCT: Tomographic Combinatorial Testing for Deep Learning Systems. In 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER). 614–618. https://doi.org/10.1109/SANER.2019.8668044
Ma et al. (2018) Lei Ma, Felix Juefei-Xu, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Chunyang Chen, Ting Su, Li Li, Yang Liu, Jianjun Zhao, and Yadong Wang. 2018. DeepGauge: Multi-Granularity Testing Criteria for Deep Learning Systems. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering (Montpellier, France) (ASE 2018). Association for Computing Machinery, New York, NY, USA, 120–131. https://doi.org/10.1145/3238147.3238202
Mehrabi et al. (2021) Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. 2021. A survey on bias and fairness in machine learning. ACM Computing Surveys (CSUR) 54, 6 (2021), 1–35.
Moro et al. (2014) Sérgio Moro, Paulo Cortez, and Paulo Rita. 2014. A data-driven approach to predict the success of bank telemarketing. Decision Support Systems 62 (2014), 22–31.
Pei et al. (2019) Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. 2019. DeepXplore: Automated Whitebox Testing of Deep Learning Systems. Commun. ACM 62, 11 (Oct. 2019), 137–145. https://doi.org/10.1145/3361566
Peterson (2009) Leif E Peterson. 2009. K-nearest neighbor. Scholarpedia 4, 2 (2009), 1883.
Rasmy et al. (2021) Laila Rasmy, Yang Xiang, Ziqian Xie, Cui Tao, and Degui Zhi. 2021. Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. NPJ digital medicine 4, 1 (2021), 86.
Ribeiro et al. (2020) Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 4902–4912. https://doi.org/10.18653/v1/2020.acl-main.442
Soremekun et al. (2022) Ezekiel Soremekun, Sakshi Udeshi, and Sudipta Chattopadhyay. 2022. Astraea: Grammar-Based Fairness Testing. IEEE Transactions on Software Engineering 48, 12 (2022), 5188–5211. https://doi.org/10.1109/TSE.2022.3141758
Spinellis (2021) Diomidis Spinellis. 2021. Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. (2021).
Sun et al. (2020) Zeyu Sun, Jie M. Zhang, Mark Harman, Mike Papadakis, and Lu Zhang. 2020. Automatic Testing and Improvement of Machine Translation. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering (Seoul, South Korea) (ICSE ’20). Association for Computing Machinery, New York, NY, USA, 974–985. https://doi.org/10.1145/3377811.3380420
Tarawneh and Embarak (2019) Monther Tarawneh and Ossama Embarak. 2019. Hybrid approach for heart disease prediction using data mining techniques. In Advances in Internet, Data and Web Technologies: The 7th International Conference on Emerging Internet, Data and Web Technologies (EIDWT-2019). Springer, 447–454.
Tian et al. (2018) Yuchi Tian, Kexin Pei, Suman Jana, and Baishakhi Ray. 2018. DeepTest: Automated Testing of Deep-Neural-Network-Driven Autonomous Cars. In Proceedings of the 40th International Conference on Software Engineering (Gothenburg, Sweden) (ICSE ’18). Association for Computing Machinery, New York, NY, USA, 303–314. https://doi.org/10.1145/3180155.3180220
Trujillo et al. (2020) Miller Trujillo, Mario Linares-Vásquez, Camilo Escobar-Velásquez, Ivana Dusparic, and Nicolás Cardozo. 2020. Does Neuron Coverage Matter for Deep Reinforcement Learning? A Preliminary Study. In Proceedings of the IEEE/ACM 42nd International Conference on Software Engineering Workshops (Seoul, Republic of Korea) (ICSEW’20). Association for Computing Machinery, New York, NY, USA, 215–220. https://doi.org/10.1145/3387940.3391462
Van der Laan (2000) Paul Van der Laan. 2000. The 2001 census in the netherlands. In conference The Census of Population.
Wightman (1998) Linda F Wightman. 1998. LSAC national longitudinal bar passage study. Law School Admission Council.
Wu et al. (2021) Xian Wu, Wenbo Guo, Hua Wei, and Xinyu Xing. 2021. Adversarial policy training against deep reinforcement learning. In 30th USENIX Security Symposium (USENIX Security 21). 1883–1900.
Xie et al. (2019) Xiaofei Xie, Lei Ma, Felix Juefei-Xu, Minhui Xue, Hongxu Chen, Yang Liu, Jianjun Zhao, Bo Li, Jianxiong Yin, and Simon See. 2019. DeepHunter: A Coverage-Guided Fuzz Testing Framework for Deep Neural Networks. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis (Beijing, China) (ISSTA 2019). Association for Computing Machinery, New York, NY, USA, 146–157. https://doi.org/10.1145/3293882.3330579
Yan et al. (2020) Shenao Yan, Guanhong Tao, Xuwei Liu, Juan Zhai, Shiqing Ma, Lei Xu, and Xiangyu Zhang. 2020. Correlations between Deep Neural Network Model Coverage Criteria and Model Quality. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Virtual Event, USA) (ESEC/FSE 2020). Association for Computing Machinery, New York, NY, USA, 775–787. https://doi.org/10.1145/3368089.3409671
Yang et al. (2021) Zhou Yang, Muhammad Hilmi Asyrofi, and David Lo. 2021. BiasRV: Uncovering Biased Sentiment Predictions at Runtime. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Athens, Greece) (ESEC/FSE 2021). Association for Computing Machinery, New York, NY, USA, 1540–1544. https://doi.org/10.1145/3468264.3473117
Yang et al. (2022a) Z. Yang, J. Shi, M. Asyrofi, and D. Lo. 2022a. Revisiting Neuron Coverage Metrics and Quality of Deep Neural Networks. In 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE Computer Society, Los Alamitos, CA, USA, 408–419. https://doi.org/10.1109/SANER53432.2022.00056
Yang et al. (2022b) Zhou Yang, Jieke Shi, Junda He, and David Lo. 2022b. Natural Attack for Pre-Trained Models of Code. In Proceedings of the 44th International Conference on Software Engineering (Pittsburgh, Pennsylvania) (ICSE ’22). Association for Computing Machinery, New York, NY, USA, 1482–1493. https://doi.org/10.1145/3510003.3510146
Yeh and Lien (2009) I-Cheng Yeh and Che-hui Lien. 2009. The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert systems with applications 36, 2 (2009), 2473–2480.
Zhang et al. (2018a) Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell. 2018a. Mitigating Unwanted Biases with Adversarial Learning. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society (New Orleans, LA, USA) (AIES ’18). Association for Computing Machinery, New York, NY, USA, 335–340. https://doi.org/10.1145/3278721.3278779
Zhang et al. (2020a) Huangzhao Zhang, Zhuo Li, Ge Li, Lei Ma, Yang Liu, and Zhi Jin. 2020a. Generating Adversarial Examples for Holding Robustness of Source Code Processing Models. (2020), 1169–1176.
Zhang et al. (2022) Jie M. Zhang, Mark Harman, Lei Ma, and Yang Liu. 2022. Machine Learning Testing: Survey, Landscapes and Horizons. IEEE Transactions on Software Engineering 48, 1 (2022), 1–36. https://doi.org/10.1109/TSE.2019.2962027
Zhang et al. (2021b) Lingfeng Zhang, Yueling Zhang, and Min Zhang. 2021b. Efficient White-Box Fairness Testing through Gradient Search. In Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis (Virtual, Denmark) (ISSTA 2021). Association for Computing Machinery, New York, NY, USA, 103–114. https://doi.org/10.1145/3460319.3464820
Zhang and Sun (2022) Mengdi Zhang and Jun Sun. 2022. Adaptive Fairness Improvement Based on Causality Analysis. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Singapore, Singapore) (ESEC/FSE 2022). Association for Computing Machinery, New York, NY, USA, 6–17. https://doi.org/10.1145/3540250.3549103
Zhang et al. (2018b) Mengshi Zhang, Yuqun Zhang, Lingming Zhang, Cong Liu, and Sarfraz Khurshid. 2018b. DeepRoad: GAN-Based Metamorphic Testing and Input Validation Framework for Autonomous Driving Systems. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering (Montpellier, France) (ASE 2018). Association for Computing Machinery, New York, NY, USA, 132–142. https://doi.org/10.1145/3238147.3238187
Zhang et al. (2020b) Peixin Zhang, Jingyi Wang, Jun Sun, Guoliang Dong, Xinyu Wang, Xingen Wang, Jin Song Dong, and Ting Dai. 2020b. White-Box Fairness Testing through Adversarial Sampling. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering (Seoul, South Korea) (ICSE ’20). Association for Computing Machinery, New York, NY, USA, 949–960. https://doi.org/10.1145/3377811.3380331
Zhang et al. (2021a) Wenbin Zhang, Albert Bifet, Xiangliang Zhang, Jeremy C Weiss, and Wolfgang Nejdl. 2021a. Farf: A fair and adaptive random forests classifier. In Advances in Knowledge Discovery and Data Mining: 25th Pacific-Asia Conference, PAKDD 2021, Virtual Event, May 11–14, 2021, Proceedings, Part II. Springer, 245–256.
Zhang and Ntoutsi (2019) Wenbin Zhang and Eirini Ntoutsi. 2019. FAHT: an adaptive fairness-aware decision tree classifier. arXiv preprint arXiv:1907.07237 (2019).
Zhang and Weiss (2021) Wenbin Zhang and Jeremy C Weiss. 2021. Fair decision-making under uncertainty. In 2021 IEEE International Conference on Data Mining (ICDM). IEEE, 886–895.
Zhang and Weiss (2022) Wenbin Zhang and Jeremy C Weiss. 2022. Longitudinal fairness with censorship. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 12235–12243.
Zheng et al. (2022) Haibin Zheng, Zhiqing Chen, Tianyu Du, Xuhong Zhang, Yao Cheng, Shouling Ji, Jingyi Wang, Yue Yu, and Jinyin Chen. 2022. NeuronFair: Interpretable White-Box Fairness Testing through Biased Neuron Identification. In Proceedings of the 44th International Conference on Software Engineering (Pittsburgh, Pennsylvania) (ICSE ’22). Association for Computing Machinery, New York, NY, USA, 1519–1531. https://doi.org/10.1145/3510003.3510123