This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Making Bias Amplification in Balanced Datasets Directional and Interpretable

Bhanu Tokas
Arizona State University
btokas@asu.edu
   Rahul Nair
Arizona State University
rnair21@asu.edu
   Hannah Kerner
Arizona State University
hkerner@asu.edu
Abstract

Most of the ML datasets we use today are biased. When we train models on these biased datasets, they often not only learn dataset biases but can also amplify them — a phenomenon known as bias amplification. Several co-occurrence-based metrics have been proposed to measure bias amplification between a protected attribute AA (e.g., gender) and a task TT (e.g., cooking). However, these metrics fail to measure biases when AA is balanced with TT. To measure bias amplification in balanced datasets, recent work proposed a predictability-based metric called leakage amplification. However, leakage amplification cannot identify the direction in which biases are amplified. In this work, we propose a new predictability-based metric called directional predictability amplification (DPA). DPA measures directional bias amplification, even for balanced datasets. Unlike leakage amplification, DPA is easier to interpret and less sensitive to attacker models (a hyperparameter in predictability-based metrics). Our experiments on tabular and image datasets show that DPA is an effective metric for measuring directional bias amplification. The code will be available soon.

**footnotetext: These authors contributed equally to this work

1 Introduction

Machine learning models should perform fairly across demographics, genders, and other groups. However, ensuring fairness is challenging when training datasets are biased, as is the case with many datasets. For instance, in the imSitu dataset [12], 67%67\% of the images labeled “cooking” feature females, indicating a gender bias that women are more likely to be associated with cooking than men [14]. Given a biased training set, it is not surprising for a model to learn these dataset biases. Surprisingly, models not only learn dataset biases but can also amplify them [14, 10, 13]. In the example from imSitu, where females and cooking co-occurred 67%67\% of the time, bias amplification occurs when >67%>67\% of the images predicted as cooking feature females.

Several metrics have been proposed to measure bias amplification between a protected attribute (e.g., gender), denoted as AA, and a task (e.g., cooking), denoted as TT [14, 10, 13]. If AA and TT co-occur more frequently than random in the training dataset, these metrics measure the increase in co-occurrence between the predictions of AA and TT. For instance, if the co-occurrence between “females” (AA) and “cooking” (TT) is 67%67\% in the training dataset and 90%90\% at test time, the bias amplification value is 23%23\%.

These metrics imply that if a protected attribute and task are balanced in the training dataset, there are no dataset biases to amplify. However, simply balancing a protected attribute AA with a pre-defined task TT does not ensure an unbiased dataset. Biases may emerge from unannotated parts of the dataset.

Suppose we balance imSitu such that 50%50\% of the images labeled “cooking” feature females. In this case, gender is balanced with respect to cooking. Now, assume that cooking objects in imSitu, like hairnets, are not annotated. If most of the cooking images with females have hairnets, while most of the cooking images with males do not, the model may learn a spurious correlation between hairnets, cooking, and females. Hence, the model may more often predict the presence of a female when cooking images have hairnets in the test set, leading to bias amplification between females and cooking. However, since gender appears balanced with respect to the cooking labels, current metrics would report 0 bias amplification.

Wang et al. [11] identified that metrics measuring bias through co-occurrences between a protected attribute and a task failed to account for biases emerging from unannotated elements. They proposed a term called “leakage” to measure bias amplification, even when a dataset’s protected attribute is balanced with a task. Leakage measures how predictable the protected attribute AA is from the ground truth labels of task TT (dataset leakage) and from the model predictions of task T^\hat{T} (model leakage). Wang et al. [11] describe bias amplification as the difference between dataset leakage (λD\lambda_{\text{D}}) and model leakage (λM\lambda_{\text{M}}). λD\lambda_{\text{D}} and λM\lambda_{\text{M}} are quantified using an attacker model that predicts the protected attribute.

In this work, we refer to Wang et al.’s [11] method of calculating bias amplification as leakage amplification. Leakage amplification was an important step toward measuring bias amplification in balanced datasets. However, it has the following limitations:

  1. 1.

    Leakage amplification lacks direction. In the cooking example, we need to identify if the model amplifies the bias towards predicting only women as cooking (ATA\rightarrow T) or towards predicting all cooks as women (TAT\rightarrow A).

  2. 2.

    Leakage amplification is unbounded. Leakage amplification does not have a bounded range of values since it is the absolute difference between λM\lambda_{\text{M}} and λD\lambda_{\text{D}}. This makes leakage amplification values hard to interpret.

  3. 3.

    Leakage amplification does not measure the relative change in biases. In a slightly biased dataset (e.g., λD\lambda_{\text{D}} =0.55=0.55), a bias amplification of 0.050.05 (to λM\lambda_{\text{M}} =0.60=0.60) is a larger relative increase compared to the same 0.050.05 amplification in a highly biased dataset (e.g., λD\lambda_{\text{D}} =0.90=0.90 to λM\lambda_{\text{M}} =0.95=0.95). Since leakage amplification calculates the absolute difference between λM\lambda_{\text{M}} and λD\lambda_{\text{D}}, it gives the same bias amplification value of 0.050.05 for both datasets.

  4. 4.

    Leakage amplification is sensitive to the choice of attacker model. The choice of attacker model influences λM\lambda_{\text{M}} and λD\lambda_{\text{D}}, and consequently, leakage amplification values. An attacker model with poor predictability of the protected attribute will yield very different results for λM\lambda_{\text{M}} and λD\lambda_{\text{D}}, compared to one with high predictability.

We propose a new metric called Directional Predictability Amplification (DPADPA) that addresses the limitations of leakage amplification. The contributions of DPADPA are:

  1. 1.

    DPADPA is the only metric that can measure directional bias amplification in a balanced dataset.

  2. 2.

    DPADPA is bounded and interpretable.

  3. 3.

    DPADPA measures the relative change of predictability (as opposed to an absolute change of predictability in leakage amplification).

  4. 4.

    DPADPA is minimally sensitive to attacker models.

2 Related Work

Co-occurence for Bias Amplification

Men Also Like Shopping (BAMALSBA_{MALS}[14] proposed the first metric for bias amplification. The proposed metric measured the co-occurrences between protected attributes AA and task TT. For any TAT-A pairs that showed a positive correlation (i.e., the pair occurred more frequently than independent events) in the training dataset, it measured how much the positive correlation increased in model predictions.

Wang et al. [10] generalized the BAMALSBA_{MALS} metric to also measure negative correlation (i.e., the pair occurred less frequently than independent events). Further, Wang et al. [10] changed how the positive bias is defined by comparing the independent and joint probability of a pair. But, both BAMALSBA_{MALS} [14] and Wang et al. [10] could only work for TAT-A pairs where TT, AA were singleton sets (e.g., {Basketball} & {Male}). Zhao et al. [13] extended the metric proposed by Wang et al. [10] to allow TAT-A pairs where TT, AA are non-singleton sets (e.g., {Basketball, Sneakers} & {African-American, Male}).

Lin et al. [4] proposed a new metric called bias disparity to measure bias amplification in recommender systems. Foulds et al. [2] measured bias amplification using the difference in “differential fairness”, a measure of the difference in co-occurrences of TAT-A pairs across different values of AA. Seshadri et al. [8] measured bias amplification for text-to-image generation using the increase in percentage bias in generated vs. training samples.

Bias Amplification in Balanced Datasets

Wang et al. [11] identified that BAMALSBA_{MALS} [14] failed to measure bias amplification for balanced datasets. They proposed a metric that we refer to as leakage amplification that could measure bias amplification in balanced datasets. While some of the previously discussed metrics [2, 8, 4, 13] can measure bias amplification in a balanced dataset, these metrics do not work for continuous variables, because they use co-occurrences to quantify biases.

Leakage amplification quantifies biases in terms of predictability, i.e., how easily a model can predict the protected attribute AA from a task TT. Attacker functions (ff) are trained to predict the attribute (AA) from the ground-truth observations of the task (TT) and model predictions of the task (T^\hat{T}). The relative performance of ff on TT vs. T^\hat{T} represents the leakage of information from AA to TT.

As the attacker function can be any kind of machine learning model, it can process continuous inputs, text, and images. This flexibility gives leakage amplification a distinct advantage over co-occurrence-based bias amplification metrics. Subsequent work used leakage amplification for quantifying bias amplification in image captioning  [3].

Capturing Directionality in Bias Amplification

While previous metrics including leakage amplification [11] could detect the presence of bias, they could not explain its causality or directionality. Wang et al. [10] was the first to introduce a directional bias amplification metric, BABA_{\rightarrow}. However, the metric only works for unbalanced datasets. Zhao et al. [13] proposed a new metric, MultiMulti_{\rightarrow}, to measure directional bias amplification for multiple attributes and balanced datasets. However, the metric cannot distinguish between positive and negative bias amplification, as shown in section A. This lack of sign awareness makes MultiMulti_{\rightarrow} unsuitable for many use cases.

In summary, no existing metric can measure the positive and negative directional bias amplification in a balanced dataset, as shown in Table 1.

Method Balanced Datasets Directional Negative Amp.
BAMALSBA_{MALS} [14]
BABA_{\rightarrow} [10]
MultiMulti_{\rightarrow} [13]
Leakage Amp. [11]
DPA (Ours)
Table 1: We compare different desirable properties of bias amplification metrics. Only DPADPA has all three.

3 Leakage Amplification

Before introducing our metric, we explain the formulation and limitations of the leakage amplification metric proposed by Wang et al. [11].

3.1 Formulation

To measure the leakage of an attribute (AA) from a task (TT), Wang et al. [11] trained an attacker function (ff) that takes TT as input to predict AA. The performance of the attacker is measured using a quality function (QQ). Previous works [11, 3] used accuracy and F1-scores for QQ. Wang et al. [11] describe dataset leakage (λD\lambda_{D}) as:

λD=Q(fD(T),A)\lambda_{D}=Q(f_{D}(T),A) (1)

Similarly, model leakage (λM\lambda_{M}) is decribed as:

λM=Q(fM(T^),A)\lambda_{M}=Q(f_{M}(\hat{T}),A) (2)

where TT and T^\hat{T} represent the ground truth and model predictions for the task, respectively. fDf_{D} is trained on task observations from the dataset, while fMf_{M} is trained on task predictions from the model.

Leakage amplification measures the increase of leakage in model predictions compared to the leakage in the dataset:

LeakageAmplification=λMλDLeakage\ Amplification=\lambda_{M}-\lambda_{D} (3)

Model predictions (T^\hat{T}) are not 100%100\% accurate and might have errors. These errors might create a difference in leakage values, which could be misinterpreted as bias. To prevent conflation of errors with bias, Wang et al. [11] introduced a similar error rate in TT using random perturbations. If the model predictions T^\hat{T} are 70%70\% accurate, they randomly flipped 30%30\% of labels in TT. As the bias in TT can vary significantly between two random perturbations, they measured bias amplification using confidence intervals. This quality equalization prevents conflation of model biases and errors.

3.2 Limitations

3.2.1 Incompatible with directionality

In leakage amplification, as seen in equation 1, the attacker function ff tries to model the relationship of P(A|T)P(A|T). Hence, we can approximate equation 1 as:

λD=Q(f(T),A)P(A|T)\lambda_{D}=Q(f(T),A)\propto P(A|T) (4)

Similarly for equation  2 and  3, we can say:

λMP(A|T^)\lambda_{M}\propto P(A|\hat{T})
LeakageAmplification(P(A|T^)P(A|T))Leakage\ Amplification\propto(P(A|\hat{T})-P(A|T)) (5)

We observe that leakage amplification approximates differences in probability with fixed posteriors. This is different from Wang et al’s  [10] definition of directionality where fixed priors are used. Wang et al. [10] defined their metric BABA_{\rightarrow} in the following manner:

BA=1|A||T|aA,tTyatΔat+(1yat)(Δat)BA_{\rightarrow}=\frac{1}{|A||T|}\sum_{a\in A,t\in T}y_{at}\Delta_{at}+(1-y_{at})(-\Delta_{at}) (6)

where,

yat=1[P(Aa=1,Tt=1)>P(Aa=1)P(Tt=1)]y_{at}=1[P(A_{a}=1,T_{t}=1)>P(A_{a}=1)P(T_{t}=1)] (7)
Δat={P(T^t=1|Aa=1)P(Tt=1|Aa=1)if measuring ATP(A^a=1|Tt=1)P(Aa=1|Tt=1)if measuring TA\Delta_{at}=\left\{\begin{array}[]{ll}P(\hat{T}_{t}=1|A_{a}=1)-P(T_{t}=1|A_{a}=1)&\\ \text{if measuring }A\rightarrow T&\\ P(\hat{A}_{a}=1|T_{t}=1)-P(A_{a}=1|T_{t}=1)&\\ \text{if measuring }T\rightarrow A&\end{array}\right. (8)

For TAT\rightarrow A, BABA_{\rightarrow} measures the change in P(A^|T)P(\hat{A}|T) with respect to P(A|T)P(A|T), i.e., change in the conditional probability of A^\hat{A} vs. AA with respect to a fixed prior TT. Similarly, for ATA\rightarrow T, BABA_{\rightarrow} measures change in the conditional probability of T^\hat{T} vs. TT with respect to a fixed prior AA.

In leakage amplification, unlike BABA_{\rightarrow}, the posterior is fixed. To measure directionality, we need fixed priors. Thus, leakage amplification does not align with existing definitions of directionality.

3.2.2 Variable bounds

Leakage amplification is the difference between λM\lambda_{M} and λD\lambda_{D} (equation 3). Hence, the range for leakage amplification is bounded in the interval [min(λM)max(λD),max(λM)min(λD)][min(\lambda_{M})-max(\lambda_{D}),max(\lambda_{M})-min(\lambda_{D})]. However, the max and min values for λM\lambda_{M} and λD\lambda_{D} are dependent on the choice of quality function QQ. Depending on the choice of QQ, we can have completely different leakage amplification values for the same input. This makes leakage amplification values hard to interpret.

3.2.3 Does not measure relative amplification

Leakage amplification does not account for the magnitude of biases in the dataset (λD\lambda_{D}). Let us understand this using two cases. In the first case, we are working with a slightly biased dataset (D1)(D_{1}). In the second case, we are working with a significantly biased dataset (D2)(D_{2}). We train two identical models on these datasets to get predictions (M1)(M_{1}) and (M2)(M_{2}) respectively. Let us assume we are using accuracy for QQ. Suppose we get the following λ\lambda values: λD1\lambda_{D_{1}} = 0.550.55 (slightly biased), λD2\lambda_{D_{2}} = 0.90.9 (highly biased), λM1\lambda_{M_{1}} = 0.600.60, λM2\lambda_{M_{2}} = 0.950.95.

Leakage amplification treats both cases as equivalent. Although the relative increase in bias in the first case (\approx 0.09) is greater than the second case (\approx 0.06), both cases will report the same bias amplification value (0.050.05).

3.2.4 Sensitive to attacker model hyperparameters

The performance of attacker functions (usually neural networks) directly impacts leakage amplification values. Since neural network performance is sensitive to the hyperparameter settings, leakage amplification values are too.

4 Directional Predictability Amplification

We propose our new metric, Directional Predictability Amplification (DPADPA) that addresses the previously mentioned limitations of leakage amplification.

4.1 Formulation

As noted in section 3.2.1, Wang et al’s [11] formula for leakage amplification is not compatible with directionality as it has fixed posteriors, not priors. We define predictability (Ψ\Psi) using fixed priors.

We define the predictability of TT from AA, which represents the dataset bias for ATA\rightarrow T direction, as:

ΨD,AT=Q(fDT(A),T)\Psi_{D,A\rightarrow T}=Q(f^{T}_{D}(A),T) (9)

We define the predictability of T^\hat{T} from AA, which represents the model bias for ATA\rightarrow T direction, as:

ΨM,AT=Q(fMT(A),T^)\Psi_{M,A\rightarrow T}=Q(f^{T}_{M}(A),\hat{T}) (10)

We define the predictability of AA from TT, which represents the dataset bias for TAT\rightarrow A direction, as:

ΨD,TA=Q(fDA(T),A)\Psi_{D,T\rightarrow A}=Q(f^{A}_{D}(T),A) (11)

We define the predictability of A^\hat{A} from TT, which represents the model bias for TAT\rightarrow A direction, as:

ΨM,TA=Q(fMA(T),A^)\Psi_{M,T\rightarrow A}=Q(f^{A}_{M}(T),\hat{A}) (12)

fAf^{A} represents an attacker function that takes TT as input and tries to predict AA. fTf^{T} represents an attacker function that takes AA as input and tries to predict TT.

While leakage amplification computed the difference between λM\lambda_{M} and λD\lambda_{D}, we normalize the difference in predictability using their sum.

Using equations 9 and 10, we define bias amplification in T \rightarrow A direction as:

DPATA=ΨM,TAΨD,TAΨM,TA+ΨD,TADPA_{T\rightarrow A}=\frac{\Psi_{M,T\rightarrow A}-\Psi_{D,T\rightarrow A}}{\Psi_{M,T\rightarrow A}+\Psi_{D,T\rightarrow A}} (13)

Similarly, using equations 11 and 12, we define bias amplification in A \rightarrow T direction as:

DPAAT=ΨM,ATΨD,ATΨM,AT+ΨD,ATDPA_{A\rightarrow T}=\frac{\Psi_{M,A\rightarrow T}-\Psi_{D,A\rightarrow T}}{\Psi_{M,A\rightarrow T}+\Psi_{D,A\rightarrow T}} (14)

4.2 Benefits

The new formulation gives DPADPA the following benefits:

Directionality

For ATA\rightarrow T, we keep the prior fixed by giving TT as input for both attacker models (fDA,fMAf^{A}_{D},\ f^{A}_{M}). Similarly, for TAT\rightarrow A, we keep the prior fixed by giving AA as input for both attacker models (fDT,fMTf^{T}_{D},\ f^{T}_{M}). Hence, our method follows Wang et al.’s [10] definition of directionality.

Fixed Bounds

For any chosen quality function QQ (such that its range is [0,)[0,\infty) or [0, +\mathbb{R}^{+}]), the range of DPADPA is restricted to (-1,1). This normalization fixes the issue of unbounded values in leakage amplification.

While selecting QQ, users must ensure that 0 represents worst possible performance by the attacker function (i.e., low predictability or no bias), and the upper bound represents best possible performance by the attacker function (i.e., high predictability or significant bias). This is true for most typical choices for quality functions such as accuracy or F1 score, but not for certain losses like cross-entropy.

Relative Amplification

The normalization in DPADPA not only gives a bounded range but also considers the original bias in the dataset. To demonstrate this shift in behavior, we plot the relation between leakage amplification and λM\lambda_{M} at different values of dataset bias (λD\lambda_{D}) in Figure 1. For DPADPA, we plot the relation between DPADPA and ΨM\Psi_{M} at different values of dataset bias (ΨD\Psi_{D}).

We observe that the slope for leakage amplification remains constant irrespective of the value of λD\lambda_{D}. On the other hand, for DPADPA we observed higher slopes between DPADPA and λM\lambda_{M}, for smaller values of ΨD\Psi_{D} and vice-versa. Hence, in nearly balanced datasets (smaller λD\lambda_{D}), DPADPA reports high bias amplification even for small increases in bias. For highly biased datasets (higher λD\lambda_{D}), DPADPA reports a small bias amplification value for a similar increase in biases.

Refer to caption
Figure 1: The graphs show trends between (a) Leakage amplification vs λM\lambda_{M}, at different values of λD\lambda_{D} and (b) DPADPA vs ΨM\Psi_{M}, at different values of ΨD\Psi_{D}. For the same model bias, DPADPA reported much higher bias amplification values (compared to leakage amplification) when the dataset bias is small.
Attacker Robustness

Normalization also helps in improving the robustness of DPADPA to different hyperparameters of the attacker model. Since we use the same type of attacker for both TT and T^\hat{T}, the changes in hyperparameters impact their performance in similar ways. We show that taking the normalized difference of ΨM\Psi_{M} and ΨD\Psi_{D} is more robust in section B.

5 Experiment Setup

We performed experiments using tabular (COMPAS [1]) and image (COCO [5]) datasets to compare DPADPA to previous bias amplification metrics.

5.1 COMPAS Experiment

COMPAS [1] is a dataset containing information about individuals who have been previously arrested. Each entry is associated with 52 features. We used five features: age, juv_fel_count, juv_misd_count, juv_other_count, priors_count.

We limited the dataset to 2 races (Caucasian or African-American) which we used as the protected attribute (AA). The task (TT) was recidivism (i.e. if the person was arrested again for a crime in the next 2 years). Hence, A={Caucasian:0,African-American:1}A=\{\texttt{Caucasian}:0,\texttt{African-American}:1\} and T={No Recidivism:0,Recidivism:1}T=\{\texttt{No Recidivism}:0,\texttt{Recidivism}:1\}.

We created balanced and unbalanced versions of the COMPAS dataset. For the unbalanced dataset, we sampled all available COMPAS instances (attributes, race labels, and recidivism labels) for each of the four AA and TT pairs. For the balanced dataset, we sampled an equal number of instances across the four AA and TT pairs. The counts for the AA and TT pairs in the unbalanced dataset are shown in the top-left quadrant of Table 2(a), while the counts for the balanced dataset are shown in the top-right quadrant of Table 2(b).

We trained a decision tree model on the unbalanced and the balanced COMPAS datasets. Each model predicts a person’s race (AA) and recidivism (TT) based on the 55 selected features. We measured the bias amplification caused by each model in two directions: bias amplification caused by race (AA) on recidivism (TT), referred to as ATA\rightarrow T, and the bias amplification caused by recidivism (TT) on race (AA), referred to as TAT\rightarrow A. We compared our proposed metric, DPADPA, to previous metrics BABA_{\rightarrow} and MultiMulti_{\rightarrow}. For DPADPA, we used a 3-layer dense neural network (with a hidden layer of size 4 and sigmoid activations) as the attacker model for both directions. Following [11], we evaluated each bias amplification metric on the training set predictions.

5.2 COCO Experiment

Next, we explore how different bias amplification metrics are impacted in TAT\rightarrow A as a model’s reliance on task-associated objects to predict gender increases. We used the gender-annotated version of the COCO dataset released by Wang et al. [11]. Each image is labeled with both gender (A={Female:0,Male:1}\texttt{A}=\{\texttt{Female}:0,\texttt{Male}:1\}) and object categories (T={Teddy Bear:0,,Skateboard:78)}\texttt{T}=\{\texttt{Teddy Bear}:0,\ldots,\texttt{Skateboard}:78)\}. For the purpose of the experiment, we sampled 2 sub-datasets, “Unbalanced” and “Balanced”. The balanced dataset is subject to the following constraint.

y:#(m,y)=#(f,y)\forall y:\#(m,y)=\#(f,y) (15)

Where #(m,y)\#(m,y) represents the number of images of a male person performing task yy, #(f,y)\#(f,y) represents the number of images of a female person performing task yy. As these constraints are hard to satisfy, only a subset of 12 objects or tasks are used in the final dataset. This results in a dataset of 6156 images (3078 male and 3078 female images).

We used the same 12 objects for the unbalanced case but relaxed the constraint from Equation 15 as shown in equation 16. This results in a dataset of 15743 images (8885 male and 6588 female images)

y:12<#(m,y)#(f,y)<2\forall y:\frac{1}{2}<\frac{\#(m,y)}{\#(f,y)}<2 (16)

For each dataset, we have 4 versions: one original and three perturbed versions wherein the person in the image is masked using different techniques (i.e., partially masking segment, completely masking segment, completely masking bounding box), as shown in Table 4. We trained a separate VGG16 [9] (pre-trained on ImageNet-1K [7]) for 12 epochs for each of the 8 cases (4 versions for both balanced and unbalanced datasets). We measure the feature attribution of the model using Gradient-Shap [6]. This allows us to measure the attribution of different image elements and compare it with TAT\rightarrow A bias amplification reported by various metrics.

6 Results

A=0A=0 A=1A=1 A^=0\hat{A}=0 A^=1\hat{A}=1
T=0T=0 12291229 14021402 10561056 15751575
T=1T=1 874874 17731773 11151115 15321532
T^=0\hat{T}=0 11651165 15461546 - -
T^=1\hat{T}=1 938938 16291629 - -
(a) Unbalanced COMPAS Set
A=0A=0 A=1A=1 A^=0\hat{A}=0 A^=1\hat{A}=1
T=0T=0 874874 874874 10831083 665665
T=1T=1 874874 874874 896896 852852
T^=0\hat{T}=0 11451145 948948 - -
T^=1\hat{T}=1 603603 800800 - -
(b) Balanced COMPAS Set
Table 2: COMPAS Dataset: Counts of the protected attribute (race) and task (recidivism) in the dataset (represented as AA and TT) and in the model predictions (represented as A^\hat{A} and T^\hat{T}) for the balanced and unbalanced COMPAS set. Here: A={Caucasian:0,African-American:1}A=\{\texttt{Caucasian}:0,\texttt{African-American}:1\} and T={No Recidivism:0,Recidivism:1)}T=\{\texttt{No Recidivism}:0,\texttt{Recidivism}:1)\}.
Method Unbalanced Balanced
TAT\rightarrow A ATA\rightarrow T TAT\rightarrow A ATA\rightarrow T
BABA_{\rightarrow} 0.078±0.031-0.078\pm 0.031 0.038±0.001-0.038\pm 0.001 0.000±0.0000.000\pm 0.000 0.000±0.0000.000\pm 0.000
MultiMulti_{\rightarrow} 0.078±0.0260.078\pm 0.026 0.038±0.0010.038\pm 0.001 0.066±0.0070.066\pm 0.007 0.099±0.0060.099\pm 0.006
DPA (ours) 0.063±0.0050.063\pm 0.005 0.004±0.002-0.004\pm 0.002 0.061±0.0080.061\pm 0.008 0.100±0.0040.100\pm 0.004
Table 3: COMPAS Results: The first two columns depict the bias amplification values for the unbalanced COMPAS set (Table 2(a)), while the last two columns depict the bias amplification values for the balanced COMPAS set (Table 2(b)).
Dataset Split Metric Original Partial Masked Segment Masked Bounding-Box Masked
Image [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
Attribution Map [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
Unbalanced Attribution Score 0.6202±0.00260.6202\pm 0.0026 0.6777±0.00270.6777\pm 0.0027 0.7321±0.00200.7321\pm 0.0020 0.7973±0.0200.7973\pm 0.020
DPA(ours)DPA(ours) 0.0006±0.00020.0006\pm 0.0002 0.0013±0.00050.0013\pm 0.0005 0.0041±0.00050.0041\pm 0.0005 0.0048±0.00020.0048\pm 0.0002
BABA_{\rightarrow} 0.0029±0.00020.0029\pm 0.0002 0.0072±0.00050.0072\pm 0.0005 0.0108±0.00070.0108\pm 0.0007 0.0140±0.00070.0140\pm 0.0007
MultiMulti_{\rightarrow} 0.0057±0.00030.0057\pm 0.0003 0.0091±0.00050.0091\pm 0.0005 0.0109±0.00050.0109\pm 0.0005 0.0219±0.00110.0219\pm 0.0011
Balanced Attribution Score 0.6292±0.00270.6292\pm 0.0027 0.6992±0.00240.6992\pm 0.0024 0.7367±0.00190.7367\pm 0.0019 0.8065±0.01830.8065\pm 0.0183
DPA(ours)DPA(ours) 0.0002±0.00000.0002\pm 0.0000 0.0007±0.00020.0007\pm 0.0002 0.0011±0.0030.0011\pm 0.003 0.0015±0.00020.0015\pm 0.0002
BABA_{\rightarrow} 0.0000±0.00000.0000\pm 0.0000 0.0000±0.00000.0000\pm 0.0000 0.0000±0.00000.0000\pm 0.0000 0.0000±0.00000.0000\pm 0.0000
MultiMulti_{\rightarrow} 0.0035±0.00020.0035\pm 0.0002 0.0056±0.00040.0056\pm 0.0004 0.0060±0.00030.0060\pm 0.0003 0.0099±0.00100.0099\pm 0.0010
Table 4: COCO Results: Reported bias amplification in TAT\rightarrow A direction for the unbalanced dataset for different masking scenarios.

While interpreting results, note that a co-occurrence-based metric like BABA_{\rightarrow} and a predictability-based metric like DPADPA may sometimes give different results. This is because they measure bias amplification in different ways.

BABA_{\rightarrow} classifies each ATA-T pair in the dataset as a majority or minority pair using equation 7. It only measures if the counts of the majority pair increased (positive bias amplification) or decreased (negative bias amplification) in the model predictions or vice-versa.

DPADPA, like [11], does not select a majority or a minority ATA-T pair. It measures the change in the task distribution given the attribute (and vice-versa). For instance, if AA and TT are binary, DPADPA measures if the absolute difference in counts between T=0T=0 and T=1T=1 increased (positive bias amplification) or decreased (negative bias amplification) in the model predictions. Both BABA_{\rightarrow} and DPADPA offer different yet valuable insights into bias amplification.

6.1 COMPAS Results

6.1.1 Unbalanced COMPAS dataset

The bias amplification scores for the unbalanced case are shown in the first two columns of Table 3.

𝑻𝑨\boldsymbol{T\rightarrow A}:

For BABA_{\rightarrow}, when T=0T=0, the count of the majority class A=0A=0 decreased from 12291229 in the dataset to 10561056 in the model predictions. Similarly, when T=1T=1, the count of the majority class A=1A=1 decreased from 17731773 in the dataset to 15321532 in the model predictions. Since the count of the majority classes decreased in the model predictions, BABA_{\rightarrow} reported a negative bias amplification in TAT\rightarrow A.

For DPADPA, when T=0T=0, the difference in counts between A=0A=0 and A=1A=1 increased from 173173 (14021229=1731402-1229=173) in the dataset to 519519 (15751056=5191575-1056=519) in the model predictions. However, when T=1T=1, the difference in counts between A=0A=0 and A=1A=1 decreased from 899899 (1773874=8991773-874=899) in the dataset to 417417 (15321115=4171532-1115=417) in the model predictions. Since the decrease in bias when T=1T=1 is larger than the increase in bias when T=0T=0 (899417>519173899-417>519-173), we might naively assume a negative bias amplification in TAT\rightarrow A.

This naive assumption does not account for the conflation of model errors and model biases. As noted in 3.1, the quality equalization step in leakage prevents the conflation of the model’s errors and biases. The model has a low accuracy when predicting A^\hat{A} (approx. 69%69\%); hence, 31%31\% of instances in AA are perturbed to match the model’s accuracy. As a result, the biases in the perturbed AA are lesser than A^\hat{A}, indicating a positive bias amplification. The positive score reported by DPADPA is not an incorrect result. It is the low model accuracy that misleadingly suggests a negative bias amplification.

MultiMulti_{\rightarrow} also reports a positive bias amplification of the same magnitude as BABA_{\rightarrow}. But, this positive value is the result of MultiMulti_{\rightarrow} not being able to distinguish between positive and negative amplification, as shown in Appendix A.

𝑨𝑻\boldsymbol{A\rightarrow T}:

For BABA_{\rightarrow}, when A=0A=0, the count of the majority class T=0T=0 decreased from 12291229 in the dataset to 11651165 in the model predictions. Similarly, when A=1A=1, the count of the majority class T=1T=1 decreased from 17731773 in the dataset to 15461546 in the model predictions. Since the count of the majority classes decreased in the model predictions, BABA_{\rightarrow} reported negative bias amplification in ATA\rightarrow T.

For DPADPA, when A=0A=0, the difference in counts between T=0T=0 and T=1T=1 decreased from 355355 (1229874=3551229-874=355) in the dataset to 227227 (1165938=2271165-938=227) in the model predictions. Similarly, when A=1A=1, the difference in counts between T=0T=0 and T=1T=1 decreased from 371371 (17731402=3711773-1402=371) in the dataset to 8383 (16291546=831629-1546=83) in the model predictions. Since the overall count difference decreased in the model predictions, DPADPA reported negative bias amplification in ATA\rightarrow T.

MultiMulti_{\rightarrow} reported positive bias amplification as it cannot capture negative bias amplification. It only measures the magnitude of bias amplification but not its sign.

6.1.2 Balanced COMPAS Dataset

The bias amplification scores for the balanced case are shown in the last two columns of Table 3.

𝑻𝑨\boldsymbol{T\rightarrow A}:

Since BABA_{\rightarrow} assumes a balanced dataset to be unbiased, BABA_{\rightarrow} reported zero bias amplification in TAT\rightarrow A. For DPADPA, when T=0T=0, the difference in counts between A=0A=0 and A=1A=1 increased from 0 (874874=0874-874=0) in the dataset to 418418 (1083665=4181083-665=418) in the model predictions. Similarly, when T=1T=1, the difference in counts between A=0A=0 and A=1A=1 increased from 0 (874874=0874-874=0) in the dataset to 4444 (896852=44896-852=44) in the model predictions. Since the overall count difference increased in the model predictions, DPADPA reported positive bias amplification in TAT\rightarrow A.

𝑨𝑻\boldsymbol{A\rightarrow T}:

Since the dataset is balanced, BABA_{\rightarrow} reported zero bias amplification in ATA\rightarrow T. For DPADPA, when A=0A=0, the difference in counts between T=0T=0 and T=1T=1 increased from 0 (874874=0874-874=0) in the dataset to 542542 (1145603=5421145-603=542) in the model predictions. Similarly, when A=1A=1, the difference in counts between T=0T=0 and T=1T=1 increased from 0 (874874=0874-874=0) in the dataset to 148148 (948800=148948-800=148) in the model predictions. Since the overall count difference increased in the model predictions, DPADPA reported positive bias amplification in ATA\rightarrow T.

MultiMulti_{\rightarrow} reported positive bias amplification as it only looks at the magnitude of amplification scores.

Refer to caption
(a) BABA_{\rightarrow}
Refer to caption
(b) DPADPA
Refer to caption
(c) MultiMulti_{\rightarrow}
Figure 2: Bias amplification heatmap for different configurations of the dataset (X-axis) and model predictions (Y-axis). αd\alpha_{d} creates different configurations of the dataset, while αm\alpha_{m} creates different configurations of the model predictions. BABA_{\rightarrow} and DPADPA show similar behavior (except when the dataset is balanced). However, MultiMulti_{\rightarrow} always reports positive bias amplification.

6.2 COCO Results

In Table 4, the “attribution score” is a measure of the contribution of non-person image elements in the model’s prediction of a person’s gender. To calculate the attribution score, we take the normalized attribution map created using Gradient-Shap [6] and mask the values for the person’s segment (similar to the segment-masked case). We add the remaining values and average across all images in the dataset to get the final score.

The unbalanced section in Table 4 shows that all metrics report increasing scores as the attribution score increases. It makes intuitive sense that as the model relies more on the background objects (including task-associated objects) to predict gender, the bias of tasks on gender (i.e., TAT\rightarrow A) increases.

But, in Table 4’s balanced section, this trend no longer holds for BABA_{\rightarrow}. BABA_{\rightarrow} reports a constant zero bias amplification despite the model’s increasing reliance on background objects to predict gender. Thus, for balanced datasets, BABA_{\rightarrow} continues to report zero bias amplification despite changes in model biases.

Thus, DPADPA is the most reliable metric as it avoids pitfalls such as BABA_{\rightarrow}’s inability to work with “balanced” datasets and MultiMulti_{\rightarrow}’s inability to distinguish positive and negative bias amplification.

7 Discussion

7.1 Different metrics interpret bias amplification differently

As we observed in Section 6, each metric reported a different value for bias amplification. This makes it challenging for users to decide which metric to use when measuring bias amplification.

To understand the behavior of each metric, we simulated the following scenario. Consider a dataset with a protected attribute AA (where A=0A=0 or A=1A=1) and task TT (where T=0T=0 or T=1T=1). Initially, we have the same probability (0.250.25) for each A,TA,T pair. To introduce bias in the dataset, we modify the probabilities for specific groups. We add a term α\alpha to the group {A=0A=0, T=0T=0} and subtract it from {A=1A=1, T=1T=1}. Here, α\alpha ranges from 0.25-0.25 to 0.250.25 in steps of 0.0050.005. This setup creates a dataset that is balanced only when α=0\alpha=0; as α\alpha moves away from 0 (in either direction), the dataset becomes increasingly unbalanced. We follow the same setup to simulate the model predictions.

With α\alpha ranging from 0.25-0.25 to 0.250.25, we create 100100 different versions of the dataset and model predictions, influenced by αd\alpha_{d} and αm\alpha_{m}, respectively. For each metric, we plot a 100×100100\times 100 heatmap of the reported bias amplification scores. Each pixel in the heatmap represents the bias amplification score for a specific {dataset, model} pair.

Figure 2 shows the heatmaps for all metrics. Figures 2(a) and 2(b) display the bias amplification heatmaps for BABA_{\rightarrow} and DPADPA, respectively. These heatmaps look similar, suggesting that both metrics show similar behavior. However, BABA_{\rightarrow} (Figure 2(a)) shows a distinct vertical green line in the center, indicating that when the dataset is balanced (αd=0\alpha_{d}=0 on the X-axis), bias amplification remains at 0, regardless of changes in model’s bias (indicated by varying αm\alpha_{m} values on the Y-axis). In contrast, DPADPA (Figure 2(b)) accurately detects non-zero bias amplification whenever there is a shift in bias in either the dataset or model predictions. Thus, DPADPA is a more reliable metric for measuring bias amplification. MultiMulti_{\rightarrow} (as shown in Figure 2(c)) reports positive bias amplification in all scenarios, making it an unreliable metric.

7.2 Should we always use DPA?

While DPADPA is generally the most reliable metric for measuring bias amplification, there are cases where BABA_{\rightarrow} is more suitable. Consider a job hiring dataset: 100100 men (A=0A=0) and 5050 women (A=1A=1) apply for a job. Out of these, 2525 men and 2525 women are hired (T=0T=0), while the rest are rejected (T=1T=1). Since the acceptance rate for women (50%50\%) is higher than for men (25%25\%), BABA_{\rightarrow} sees this as a bias. In contrast, DPADPA interprets this as an unbiased scenario because the same number of men and women are hired. In situations like this, where T=0T=0 (acceptance) is almost always less frequent than T=1T=1 (rejections), BABA_{\rightarrow} may be a better fit, as it considers a dataset unbiased only if the T=0T=0-to-T=1T=1 ratio is the same for both genders.

In another scenario, imagine a dataset of men (A=0A=0) and women (A=1A=1), where each person is either indoors (T=0T=0) or outdoors (T=1T=1). It would make more sense to call this dataset unbiased when there are an equal number of instances for all AA and TT pairs. In this case, DPADPA is a better metric, as it treats a dataset as unbiased when all AA and TT combinations have equal representation. BABA_{\rightarrow} and DPADPA each measure distinct types of bias. The choice of metric depends on the specific bias we aim to address.

8 Conclusion

In this work, we showed how our novel predictability-based metric (DPADPA) can measure directional bias amplification, even for balanced datasets. We also showed how DPADPA is easy to interpret and minimally sensitive to attacker models. DPADPA is the only reliable directional metric for balanced datasets. It should be used in unbalanced datasets with an accurate understanding of the type of biases an end-user wants to measure.

References

  • Angwin et al. [2022] Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. Machine bias. In Ethics of Data and Analytics, pages 254–264. Auerbach Publications, 2022.
  • Foulds et al. [2020] James R. Foulds, Rashidul Islam, Kamrun Naher Keya, and Shimei Pan. An intersectional definition of fairness. In 2020 IEEE 36th International Conference on Data Engineering (ICDE), pages 1918–1921, 2020.
  • Hirota et al. [2022] Yusuke Hirota, Yuta Nakashima, and Noa Garcia. Quantifying societal bias amplification in image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13450–13459, 2022.
  • Lin et al. [2019] Kun Lin, Nasim Sonboli, Bamshad Mobasher, and Robin Burke. Crank up the volume: preference bias amplification in collaborative recommendation, 2019.
  • Lin et al. [2014] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  • Lundberg and Lee [2017] Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2017.
  • Russakovsky et al. [2015] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
  • Seshadri et al. [2023] Preethi Seshadri, Sameer Singh, and Yanai Elazar. The bias amplification paradox in text-to-image generation. arXiv preprint arXiv:2308.00755, 2023.
  • Simonyan and Zisserman [2015] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition, 2015.
  • Wang and Russakovsky [2021] Angelina Wang and Olga Russakovsky. Directional bias amplification. In International Conference on Machine Learning, pages 10882–10893. PMLR, 2021.
  • Wang et al. [2019] Tianlu Wang, Jieyu Zhao, Mark Yatskar, Kai-Wei Chang, and Vicente Ordonez. Balanced datasets are not enough: Estimating and mitigating gender bias in deep image representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5310–5319, 2019.
  • Yatskar et al. [2016] Mark Yatskar, Luke Zettlemoyer, and Ali Farhadi. Situation recognition: Visual semantic role labeling for image understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5534–5542, 2016.
  • Zhao et al. [2023] Dora Zhao, Jerone Andrews, and Alice Xiang. Men also do laundry: Multi-attribute bias amplification. In Proceedings of the 40th International Conference on Machine Learning, pages 42000–42017. PMLR, 2023.
  • Zhao et al. [2017] Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. Men also like shopping: Reducing gender bias amplification using corpus-level constraints. arXiv preprint arXiv:1707.09457, 2017.
\thetitle

Supplementary Material

Appendix A MultiMulti_{\rightarrow} explanation

To understand why MultiMulti_{\rightarrow} cannot differentiate between positive bias amplification and negative bias amplification (i.e.) bias reduction, let us take a look at its formulation.

Multi=X,Var(Δgm)Multi_{\rightarrow}=X,Var(\Delta_{gm})
X=1|𝒢|||g𝒢mygm|Δgm|+(1ygm)|Δgm|X=\frac{1}{\left\lvert\mathcal{G}\right\rvert\left\lvert\mathcal{M}\right\rvert}\sum_{g\in\mathcal{G}}\sum_{m\in\mathcal{M}}y_{gm}\left\lvert\Delta_{gm}\right\rvert+(1-y_{gm})\left\lvert-\Delta_{gm}\right\rvert (17)

where,

ygm=1[P(m=1,g=1)>P(g=1)P(m=1)]y_{gm}=1[P(m=1,g=1)>P(g=1)P(m=1)]

and,

Δgm={P(g^=1|m=1)P(g=1|m=1)if measuring MGP(m^=1|g=1)P(m=1|g=1)if measuring GM\Delta_{gm}=\left\{\begin{array}[]{ll}P(\hat{g}=1|m=1)-P(g=1|m=1)&\\ \text{if measuring }M\rightarrow G&\\ P(\hat{m}=1|g=1)-P(m=1|g=1)&\\ \text{if measuring }G\rightarrow M&\end{array}\right. (18)

Following [13], M represents the attribute groups and G represents the task groups.

From Equation 17, we get

X=1|𝒢|||g𝒢mygm|Δgm|+|Δgm|ygm|Δgm|X=\frac{1}{\left\lvert\mathcal{G}\right\rvert\left\lvert\mathcal{M}\right\rvert}\sum_{g\in\mathcal{G}}\sum_{m\in\mathcal{M}}y_{gm}\left\lvert\Delta_{gm}\right\rvert+\left\lvert\Delta_{gm}\right\rvert-y_{gm}\left\lvert\Delta_{gm}\right\rvert
X=1|𝒢|||g𝒢m|Δgm|\implies X=\frac{1}{\left\lvert\mathcal{G}\right\rvert\left\lvert\mathcal{M}\right\rvert}\sum_{g\in\mathcal{G}}\sum_{m\in\mathcal{M}}\left\lvert\Delta_{gm}\right\rvert (19)

Hence, we see from Equations 18 and 19 that, MDBA simply measures the average absolute differences for the conditional probabilities. Due to the absolute term, any positive or negative bias amplification is treated in the same manner.

Appendix B Attacker Robustness

To prove the normalization improves robustness, we conduct the following experiment:

We define A:𝒩(3,2)A:\mathcal{N}(3,2). We define TT and T^\hat{T} in the following manner:

T=poly(A+(α1ϵ),p)T=poly(A+(\alpha_{1}*\epsilon),p) (20)
T^=poly(A+(α2ϵ),p)\hat{T}=poly(A+(\alpha_{2}*\epsilon),p) (21)

Here poly(x,p)poly(x,p) represents any pthp^{th} degree polynomial of xx and ϵ:𝒩(0,1)\epsilon:\mathcal{N}(0,1). To demonstrate positive bias amplification, we want T^\hat{T} to be a better predictor of AA, compared to TT. Hence, we set α2<α1\alpha_{2}<\alpha_{1}.

As the attacker needs to model a simple polynomial function, we use a simple Fully Connected Network as the attacker. The attacker has varying depths dd and width ww with a combination of TanH and ReLU activations. We used the inverse of RMSE loss for quality function. Figure 3 shows the reported value of non-normalized and normalized DPA for different values of ww. Table 5 lists the parameters used for this experiment.

Refer to caption
Figure 3: Non-normalized vs normalized DPA

In Figure 3, non-normalized DPADPA showed unstable bias amplification values with high variance across different models. On the other hand, normalized DPA show a relatively stable bias amplification value with minimal variance across models of different sizes. Hence, we conclude that normalized DPADPA is more robust to changes in model hyperparameters.

ParameterParameter pp α1\alpha_{1} α2\alpha_{2} ww dd
ValueValue 22 11 22 [20,500][20,500] [2,6][2,6]
Table 5: Experiment Parameters