Fairness Testing of Deep Image Classification with Adequacy Metrics

Peixin Zhang Zhejiang University , Jingyi Wang Zhejiang University , Jun Sun Singapore Management University and Xinyu Wang Zhejiang University

Abstract.

As deep image classification applications, e.g., face recognition, become increasingly prevalent in our daily lives, their fairness issues raise more and more concern. It is thus crucial to comprehensively test the fairness of these applications before deployment. Existing fairness testing methods suffer from the following limitations: 1) applicability, i.e., they are only applicable for structured data or text without handling the high-dimensional and abstract domain sampling in the semantic level for image classification applications; 2) functionality, i.e., they generate unfair samples without providing testing criterion to characterize the model’s fairness adequacy. To fill the gap, we propose DeepFAIT, a systematic fairness testing framework specifically designed for deep image classification applications. DeepFAIT consists of several important components enabling effective fairness testing of deep image classification applications: 1) a neuron selection strategy to identify the fairness-related neurons; 2) a set of multi-granularity adequacy metrics to evaluate the model’s fairness; 3) a test selection algorithm for fixing the fairness issues efficiently. We have conducted experiments on widely adopted large-scale face recognition applications, i.e., VGGFace and FairFace. The experimental results confirm that our approach can effectively identify the fairness-related neurons, characterize the model’s fairness, and select the most valuable test cases to mitigate the model’s fairness issues.

1. Introduction

Deep learning (DL) has created a new programming paradigm in solving many real-world problems, e.g., computer vision (Schroff et al., 2015), medical diagnosis (Vieira et al., 2017), and natural language processing (Wulczyn et al., 2017). However, DL is far from being trustworthy to be applied in certain ethic-critical scenarios, e.g., toxic language detection (Wiegand et al., 2019) and facial recognition (Yucer et al., 2020), as decisions of DL models can be unfair, i.e., discriminating minorities or vulnerable subpopulations, which has raised wide public concern (on Artificial Intelligence (AI HLEG), 2018). Therefore, just like traditional software, it is only more demanding to test the fairness of DL models systematically before their deployment.

Unfortunately, there still lacks a commonly agreed definition on fairness. Existing fairness formalization typically concerns different sub-populations (Feldman et al., 2015; Bastani et al., 2019; Dwork et al., 2012; Garg et al., 2019). These sub-populations are normally determined by different domains (values) of sensitive attributes (e.g., race and gender), which are often application-dependent. To name a few, demographic parity defines that minority candidates should be classified at an approximately same rate as majority members (Feldman et al., 2015; Bastani et al., 2019). Individual discrimination states that a well-trained model must output the approximately same predictions for instances (i.e., pairs of instances) which only differ in sensitive attributes (Dwork et al., 2012; Garg et al., 2019). We refer the readers to (Thomas et al., 2019) for detailed definitions of fairness and remark that we focus on individual fairness in this work.

Multiple recent works (Galhotra et al., 2017; Udeshi et al., 2018; Aggarwal et al., 2019; Zhang et al., 2020) have investigated the fairness¹¹1So far restricted to individual fairness. testing problem of machine learning models. For instance, THEMIS (Galhotra et al., 2017) first aims to measure the frequency of unfair samples by randomly sampling the value domain of each attribute. AEQUITAS (Udeshi et al., 2018) integrates a global and local phase to search for unfair samples in the input space more systematically. Symbolic Generation (Aggarwal et al., 2019) utilizes a constraint solver (Wang et al., 2018) to solve the path on the local explanation decision tree (Ribeiro et al., 2016) of a given seed sample to acquire a large number of diverse unfair samples. State-of-the-art work ADF (Zhang et al., 2020, 2021) and its variants (andYueling Zhang and Zhang, 2021) adopt a gradient-guided search strategy to identify unfair samples more effectively. Despite the considerable progress, existing fairness works still suffer from the following limitations: 1) applicability, i.e., they are only applicable for structured data or text without handling the high-dimensional and abstract domain sampling in the semantic level for image classification applications; 2) functionality, i.e., they generate unfair samples without providing testing criterion to characterize the model’s fairness adequacy.

Refer to caption — Figure 1. Overview of DeepFAIT. Among the generated data, from left to right, the image pairs are crafted by Random Generation (RG), Gradient-based Generation (GG), and Gaussian-noise Injection (GI), respectively. The blue and brown dashed lines indicate the process of measuring testing adequacy and fairness enhancement, respectively.

To fill the gap, we propose a systematic fairness testing framework named DeepFAIT, which is especially designed for evaluating and improving the fairness adequacy of deep image classification applications. DeepFAIT provides several key functionalities enabling effective fairness testing of image applications: 1) a neuron selection strategy to identify the fairness-related neurons; 2) a set of multi-granularity adequacy metrics to evaluate the model’s fairness; 3) a test selection algorithm for fixing the fairness issues efficiently. We address multiple technical challenges to realize DeepFAIT. Specifically, as shown in Fig. 1, DeepFAIT consists of five modules. First, we adopt a widely-used image-to-image transformation technology, i.e., Generative Adversarial Network (GAN) (Zhu et al., 2017), to transform images across the sensitive domains. Then, we apply significance testing on the activation differences of neurons to obtain those fairness-related neurons and design 5 testing metrics both on layer- and neuron-level based on the identified fairness-related neurons. Next, we implement three test case generation strategies including fairness testing method and image processing processing technology to generate a variety of unfair samples. Last, we propose a test selection algorithm to select more valuable test case to repair the model with smaller cost.

DeepFAIT has been implemented as an open-source self-contained toolkit. We have evaluated DeepFAIT on widely adopted large-scale face recognition datasets (VGGFace (Parkhi et al., 2015) and Fairface (Karkkainen and Joo, 2021)). The results show that compared with DeepImportance (Gerasimou et al., 2020), DeepFAIT is more capable of identifying fairness-sensitive neurons of the model. Furthermore, the proposed testing metrics calculated on these neurons are highly correlated with fairness and can be used to guide the search of unfair samples effectively. More importantly, the fairness issues can be fixed by selecting a small amount of test cases with our test selection algorithm to further train the model.

In a nutshell, we make the following technical contributions:

•

We propose a systematic fairness testing framework specially designed for deep image classification applications consisting a set of multi-granularity fairness adequacy metrics on fairness-related neurons.
•

Based on the proposed adequacy metrics, we propose a test selection algorithm to evaluate the value of each test case in improving the model’s fairness to reduce the cost of fixing the model.
•

We implemented DeepFAIT as a self-contained toolkit, which can be freely accessed online²²2https://github.com/icse44/DeepFAIT. The evaluation shows that the proposed testing criteria in DeepFAIT are well correlated with the fairness of DL models and is effective to guide the selection of unfair samples.

We frame the reminder of the paper as follows. We provide the necessary background on DNN and robustness testing criteria in Section 2. We then present DeepFAIT in detail in Section 3. In Section 4, we discuss the experimental setup and results. Lastly, we briefly review the releted works in Section 5 and conclude our work in Section 6.

2. Preliminaries

2.1. Deep Neural Network

In this work, we focus on the fairness testing of deep learning models, specifically, deep neural networks (DNNs) for image classification applications. A deep neural network $F$ consists of multiple hidden layers between an input and an output layer. It can be denoted as a tuple $M=(I,L,\Phi,TS)$ where

•

$I$ is the input layer;
•

$L=\{L_{j}|j\in\{1,\dots,J\}\}$ is a set of hidden layers and the output layer, the number of neurons in the $j$ -th layer is $|L_{j}|$ , and the $k$ -th neuron in layer $L_{j}$ is denoted as $n_{j,k}$ and its output value with respect to the input x is $v_{j,k}(x)$ ;
•

$\Phi$ is a set of activation functions, e.g., Sigmoid, Hyperbolic Tangent (TanH), or Rectified Linear Unit (ReLU);
•

$TS$ : $L\times\Phi\to L$ is a set of transitions between layers. As shown in Equation 1, the neuron activation value, $v_{j,k}(x)$ , is computed by applying the activation function to the weighted sum of the activation value of all the neurons within its previous layer, and the weights represent the strength of the connections between two linked neurons.

(1) $v_{j,k}(x)=\phi(\sum_{l=1}^{|L_{j-1}|}\omega_{j-1,k,l}\cdot v_{j-1,l}(x))$

A classification DNN can be defined as $M:X\to Y$ which transforms a given input $x\in X$ to an output label $y\in Y$ by propagating layer by layer as above.

2.2. Individual Fairness for Image Classification

We denote the fairness sensitive attribute of interest as $SA$ (e.g., race). Note that for image classification, $SA$ is hidden from $X$ . We define $HF:X\to SA$ as a function which returns the sensitive attribute of a given sample $x\in X$ . We further define $X_{A}\subset X$ as the samples satisfying $HF(x)=A$ where $x\in X,A\in SA$ . To change the sensitive attribute for a sample $x$ , we define a transformation function $T_{A\to B}:X\to X$ which transforms a sample from $X_{A}$ to $X_{B}$ while preserving other information. Then, we define individual fairness of an image classification model $M$ as follows (Thomas et al., 2019).

Definition 2.1 (Individual Fairness).

Given an image classification model $M$ trained on $X$ , we define that it is individually fair iff there exists no data $x\in X$ satisfying the following conditions:

•

$x\in X_{A},x^{\prime}=T_{A\to B}(x)$
•

$M(x)\neq M(x^{\prime})$ .

On the other hand, $x$ (and $x^{\prime}$ ) is called an unfair sample if $x$ satisfies the above conditions.

2.3. Robustness Testing Criteria

A variety of robustness testing criteria for DNN has been proposed (Pei et al., 2017; Ma et al., 2018a; Kim et al., 2019; Wang et al., 2021; Feng et al., 2020; Gerasimou et al., 2020). We briefly introduce the following representative robustness testing metrics. Readers are referred to (Pei et al., 2017; Ma et al., 2018a; Kim et al., 2019; Wang et al., 2021; Feng et al., 2020; Gerasimou et al., 2020) for details.

Neuron Activation Neuron coverage (Pei et al., 2017) is the first robustness testing metric for DNN, which computes the percentage of activated neurons, i.e., neuron values greater than a threshold. Later, DeepGauge (Ma et al., 2018a) extends it with multi-granularity neuron coverage criteria from two different levels: 1) neuron-level, e.g., k-Multisection Neuron Coverage, Neuron Boundary Coverage and Strong Neuron Activation Coverage, focusing on the value distribution of a single neuron, and 2) layer-level to measure the ranking of the neuron values in each layer, e.g., Top-k Neuron Coverage and Top-k Neuron Patterns. Surprise Adequacy (Kim et al., 2019) evaluates the similarity between test case and training data based on the kernel density estimation or Euclidean distance of neuron activation traces. Importance-driven coverage (Gerasimou et al., 2020) measures the value of neurons from another perspective, i.e., the contribution of each neuron within the same layer with respect to the prediction.

Output Impurity Unlike the aforementioned work, DeepGini (Feng et al., 2020) only takes the output vector into consideration, which measure the likelihood of misclassification by Gini impurity.

Loss Convergence RobOT (Wang et al., 2021) proposed First-Order Loss to measure the convergence quality inspired by the observation that if we perturb the instance based on its gradient with respect to the loss, the loss will increase and gradually converge (Madry et al., 2018).

2.4. Problem Definition

Different from the previous robustness testing works, we aim to propose a set of testing adequacy metrics for individual fairness specially designed for image classification. In particular, we aim to achieve the following research objectives:

•

How can we design testing adequacy metrics which are well correlated with the model’s fairness?
•

How can we select test cases which can effectively fix the model’s fairness issues?

3. DeepFAIT Framework

As shown in Figure 1, DeepFAIT systematically tests, evaluates and improves a DNN’s fairness with the following 4 main components:

1)

Domain transformation. We develop a method based on CycleGAN (Zhu et al., 2017) to realize the transformation function $T_{A\to B}$ .
2)

Fairness-related neuron selection. We propose to conduct testing in a more effective way by filtering out neurons which are strongly correlated with the model’s fairness.
3)

Multi-granularity metric analysis. We design a set of multi-granularity fairness testing coverage metrics to measure the adequacy of fairness testing. These metrics are particularly calculated based on the selected fairness-related neurons to be more effective.
4)

Fairness enhancement. We develop a set of test case generation algorithms to identify a diverse set of unfair samples for mitigating discrimination by augmented training on these unfair samples. To further reduce the cost, we also propose a test selection algorithm to select more valuable test cases for the model’s fairness enhancement based on the proposed metrics.

In the following, we present the details of each component.

3.1. Domain Transformation

The first question is how to realize the domain transformation function $T_{A\to B}$ . Note that this is straightforward for structured or text data which can be done by replacing the protected feature or token with a value from a predefined domain (Zhang et al., 2020, 2021). However, for image data, the sensitive attribute of interest is hidden from the input feature space. We thus follow (Yucer et al., 2020) and adopt CycleGAN (Zhu et al., 2017) to transform images across different protected domains as follows.

As shown in Figure 2, CycleGAN provides a mechanism to transfer from $A$ to $B$ domains and $B$ to $A$ domains respectively with two corresponding generative models $T_{A\to B},T_{B\to A}$ . Similar with traditional GAN (Goodfellow et al., 2014), it contains two discriminators $D_{A}$ and $D_{B}$ to distinguish whether the input is ‘real’, i.e., a sample is generated by generative model or from the original dataset.

The loss function consists of three parts, the first one is adversarial loss, which is defined as follows,

(2)

\begin{split}L_{GAN}(T_{A\to B},D_{B},A,B)&=\mathbb{E}[\log(1-D_{B}(T_{A\to B}(X_{A})))]\\ &+\mathbb{E}[\log D_{B}(X_{B})]\end{split}

The transformer $T_{A\to B}$ aims to synthesize a picture satisfying the distribution of $X_{B}$ based on the seed from $X_{A}$ , while the goal of discriminator $D_{B}$ is to distinguish the raw images $X_{B}$ from the artificial ones $T_{A\to B}(X_{A})$ . $T_{B\to A}$ and $D_{A}$ have the same definition of adversarial loss. Since domain transformation needs to modify other information as little as possible in the process of changing the sensitive attribute, the second part of the objective function is thus to ensure that the generated image is identical with the raw one, which is defined as follows:

(3)

\begin{split}L_{IDE}(T_{A\to B})&=\mathbb{E}[\|T_{A\to B}(X_{A})-X_{A}\|_{p}]\end{split}

In addition, it also needs to ensure that $T_{A\to B}(X_{A})$ is in the data distribution of $X_{B}$ . To this end, CycleGAN introduces cycle consistency loss as the core of its joint optimization objective to make the synthesized image more realistic. It is defined as follows.

(4)

\begin{split}L_{CYC}(T_{A\to B},T_{B\to A})&=\mathbb{E}[\|T_{B\to A}(T_{A\to B}(X_{A}))\|_{p}]\\ &+\mathbb{E}[\|T_{A\to B}(T_{B\to A}(X_{B}))\|_{p}]\end{split}

The intuition is that the pair of well-trained generators can recover the original image through the reconstruction process, i.e., for a given $x\in X_{A}$ , $T_{B\to A}(T_{A\to B}(x))=x$ . Overall, the complete loss function of CycleGAN is defined as follows.

(5)

\begin{split}L&=L_{GAN}(T_{A\to B},D_{B},A,B)+L_{GAN}(T_{B\to A},D_{A},B,A)\\ &+\gamma(L_{IDE}(T_{A\to B})+L_{IDE}(T_{B\to A}))\\ &+\eta L_{CYC}(T_{A\to B},T_{B\to A})\end{split}

where $\gamma$ and $\eta$ are hyperparameters to balance among these three losses. In the “Raw Data” frame of Figure 1, we show an example race transformer (Caucasian to African) from VGGFace.

3.2. Fairness-related Neuron Selection

The next step is to select the fairness-related neurons, i.e., those neurons in the DNN which may have a significant impact on the fairness of the model’s decisions. The benefit is that, rather than blindly selecting neurons in certain layers (Ma et al., 2018a) or only the model’s output (Feng et al., 2020), it enables us to test the model’s fairness in a more fine-grained and focused way as these neurons are more correlated with the model’s fairness.

Our key idea of neuron selection is by significance testing (Kruskal and Wallis, 1952) of whether a neuron is fairness-related as follows. Formally, we make the following null hypothesis on a given neuron $n_{j,k}$ .

(6)

\begin{split}&H_{0}=n_{j,k}\ \text{is}\ \text{not}\ \text{fairness-related}\\ \end{split}

In addition, we use a standard parameter (a.k.a., significance level), $\alpha$ , to control the probability of making errors, e.g., rejecting $H_{0}$ when $H_{0}$ is true. We then use Kruskal-Wallis test (Kruskal and Wallis, 1952) (also know as H-test) to test the above hypothesis. The intuition is that we could identify a fairness-related neuron by looking at the difference between the activation distribution on those fair samples and unfair samples. The larger the difference, the more fairness-related is the neuron. Specifically, given the training dataset $X$ and $T$ , we could collect the activation value difference on neuron $n_{j,k}$ over $X$ :

(7)

XD=\{\|v_{j,k}(x)-v_{j,k}(x^{\prime})\||\forall x\in X,x^{\prime}=T(x)\}.

We further divide $XD$ into two orthogonal subsets $XD_{f},f=0,1$ , according to the prediction results $M(x)$ and $M(x^{\prime})$ are equal or not, respectively.

We first sort $XD$ in ascending order and denote the rank of i-th element in $XD_{f}$ as $r_{f}^{i}$ . Then we add the ranks in each subset $XD_{f}$ to obtain the rank sum, denoted as $R_{f}=\sum_{i}r_{f}^{i}$ . When the original hypothesis $H_{0}$ is true, the average rank of each subset should be close to that of all samples, i.e., $(|XD|+1)/2$ , and we thus use the following equation to compute the rank variance between subsets:

(8)

RVS=\sum_{f=0,1}|XD_{f}|(\frac{R_{f}}{|XD_{f}|}-\frac{|XD|+1}{2})^{2}

In order to eliminate the influence of dimension, we then calculate the average of rank variance of all samples, which is defined as follows.

(9)

\begin{split}ARV&=\frac{1}{|XD|-1}\sum_{f=0,1}\sum_{i=1}^{|XD_{f}|}({r_{f}^{i}}-\frac{|XD|+1}{2})^{2}\\ &=\frac{|XD|(|XD|+1)}{12}\end{split}

Note that the freedom degree of sample variance is $|XD|-1$ . Thus, the Kruskal-Wallis rank-sum statistic H is given by,

(10)

H=\frac{RVS}{ARV}=\frac{12}{|XD|(|XD|+1)}\sum_{f=0,1}\frac{R_{f}^{2}}{|XD_{f}|}-3(|XD|+1)

When $|XD|$ is large, $H$ approximately obeys a chi-square distribution (Pearson, 1900) with the freedom degree of $1$ , $H\sim\mathcal{X}^{2}(1)$ . Therefore, the critical value of Kruskal-Wallis testing, $H_{c}$ , corresponding to $\alpha$ is determined according to the chi-square distribution table. That is, if the computed statistic $H>H_{c}$ , we reject $H_{0}$ and conclude that the neuron $n_{j,k}$ is fairness-related.

3.3. Fairness Adequacy Metrics

Next, we propose a set of testing metrics to measure the adequacy of DNN fairness. Note that different from the robustness testing metrics (Ma et al., 2018a; Gerasimou et al., 2020; Wang et al., 2021; Feng et al., 2020; Kim et al., 2019), fairness testing metrics are based on the behavioral differences between an instance pair, i.e., $x$ and $x^{\prime}$ (which only differ in certain sensitive attributes). Note that one important desirable property on the metrics is that they should be well correlated with the model’s fairness.

Our fairness metrics will satisfy the above property as we define our metrics on the basis of selected fairness-related neurons. Specifically, let $NF_{j}\subset\{n_{j,1},\dots,n_{j,|L_{j}|}\}$ be a set of fairness-related neurons within layer $L_{j}$ . We denote the activation value vector and boolean activation patterns over neurons in $NF_{j}$ with respect to the input $x$ as $\vec{v}(x,NF_{j})$ and $\vec{a}(x,NF_{j})$ , respectively. Note that in $\vec{a}(x,NF_{j})$ , $1$ and $0$ represent whether the value of $n_{j,k}\in NF_{j}$ before and after ReLU is the same or not, respectively. We first characterize the model differences between the inputs $x$ and $x^{\prime}$ at the layer level as the decision is propagated layer by layer.

Tanimoto Coefficient Tanimoto coefficient (Tanimoto, 1968) is a similarity ratio defined over bitmaps. In DNN, the activation of a neuron indicates whether the abstract features of the neuron will be used in the subsequent decision-making process. Then as shown in Equation 11, we compute the division of the number of common activated neurons (i.e., nonzero bits) over the number of neurons activated by either sample.

(11)

TC(x,x^{\prime},NF_{j})=\frac{\vec{A}\cdot\vec{A^{\prime}}}{\|\vec{A}\|+\|\vec{A^{\prime}}\|-\vec{A}\cdot\vec{A^{\prime}}}

where $\vec{A}=\vec{a}(x,NF_{j})$ and $\vec{A^{\prime}}=\vec{a}(x^{\prime},NF_{j})$ .

Cosine Similarity Cosine similarity is one of the most commonly used similarity measure in machine learning applications. For example, it is often used to judge whether two faces belong to the same person in face recognition (Huang et al., 2008; Yucer et al., 2020). We calculate the cosine similarity of the activation traces in the layer representation space, which is defined as follows:

(12)

CS(x,x^{\prime},NF_{j})=\frac{\vec{v}(x,NF_{j})\cdot\vec{v}(x^{\prime},NF_{j})}{\|\vec{v}(x,NF_{j})\|\cdot\|\vec{v}(x^{\prime},NF_{j})\|}

Spearman Correlation Spearman correlation (Spearman, 1904) is a non-parametric statistic to measure the dependence between the rankings of two variables. Although it discards some information, i.e., the real activation value, it retains the order of neurons’ activation status, which is an essential characteristic, since the neurons within the same layer often learn similar features and the closer to the output layer an activated neuron is, the more important it is for the model’s decision (Ma et al., 2018a). Formally, spearman correlation is computed by

(13)

SC(x,x^{\prime},NF_{j})=1-\frac{6\sum_{k=1}^{|NF_{j}|}{(r(\vec{v}_{j,k}(x))-r(\vec{v}_{j,k}(x^{\prime})))^{2}}}{|NF_{j}|(|NF_{j}|^{2}-1)}

where $r(\cdot)$ is the rank function.

The above layer-level metrics are based on the statistical results on a set of neurons in a layer. In addition, we also provide a more fine-grained metric based on a single neuron.

Neuron Distance As presented in 2.1, each neuron within the same layer independently learns and extracts features from its previous layer. Therefore, neuron distance is a fine-grained metric to characterize the diversity of model behavior differences. In particular, for each fairness-related neuron $n_{j,k}\in|NF_{j}|$ , the distance between $x$ and $x^{\prime}$ is denoted as $nd(x,x^{\prime},n_{j,k})$ , which is defined as follows:

(14)

nd(x,x^{\prime},n_{j,k})=\left\{\begin{aligned} |v_{j,k}(x)-v_{j,k}(x^{\prime})|,&&absolute\\ \frac{\max(v_{j,k}(x),v_{j,k}(x^{\prime}))}{\min(v_{j,k}(x),v_{j,k}(x^{\prime}))},&&relative\end{aligned}\right.

Note that the absolute and relative distance are only computed when $n_{j,k}$ are activated by both $x$ and $x^{\prime}$ (Sun et al., 2018a).

Finally, we could define the coverage for each testing metric by dividing its value range (based on $X$ and the domain transformer $T$ ) into $Z$ equal bins, denoted as $B^{z}$ , and then compute the coverage ratio as follows:

(15)

\frac{\sum_{i=1}^{|C|}|\{B_{i}^{z}|\exists x\in X:C_{i}(x,T(x))\in B_{i}^{z}\}|}{Z*|C|}

where $|C|$ is the dimension for collecting the metric value. For layer-level similarity, we traverse layer by layer to calculate its value, thus $|C|=|J|$ , and the value range is $[0,1]$ for Tanimoto and Cosine coefficient and $[-1,1]$ for Spearman correlation. For neuron distance, we select $topK$ neurons each layer to analyse, thus $|C|=|J|*topK$ , and the upper bound is computed by the training data (Ma et al., 2018a).

3.4. Fairness Enhancement

The above metrics enables us to evaluate a model’s fairness adequacy. The follow-up questions are 1) how to generate diverse test cases to improve the fairness adequacy, and then 2) how to select the most valuable test cases to enhance the model’s fairness by augmented training, which has been proved to be useful in enhancing the robustness or fairness of DNN (Ma et al., 2018a; Zhang et al., 2020, 2021).

We address the first question by introducing the following test case generation methods, both from the spirit of fairness testing literature and traditional digital image processing approaches. Given a fair sample pair $(x,x^{\prime})$ , where $x$ and $x^{\prime}$ are only different in certain sensitive attributes, we aim to maximize the output differences between them after applying perturbation $p$ on them, which is formally defined as follows:

(16)

\text{argmax}_{p}\{M(x+p)-M(x^{\prime}+p)\}

We introduce the following perturbation methods for approximating the above goal.

Random Generation (RG) Random testing is the most common testing method in software testing, and is adopted by THEMIS (Galhotra et al., 2017) and AEQUITAS (Udeshi et al., 2018). Here, we select the perturbed pixel and perturbing direction randomly as follows,

(17)

p=random(-1,0,1)*step\_size,

where $0$ means the pixel value is retained, $1$ and $-1$ represent increasing and decreasing the pixel value respectively multiplied by a $step\_size$ .

Gradient-based Generation (GG) It is noticeable that gradient is an effective tool to generate test cases for DL model (Zhang et al., 2020, 2021; Goodfellow et al., 2015; Papernot et al., 2016). We utilize the gradient-based method in (Zhang et al., 2020, 2021) to approximate the solution of the optimization problem 16, which relies on the sign of gradient of loss function with respect to the input, i.e., $sg=sign(\bigtriangledown_{x}J(x,y))$ and $sg^{\prime}=sign(\bigtriangledown_{x}^{\prime}J(x^{\prime},y))$ . As shown in Equation 18, we then choose the pixel with the same direction (sign) and the corresponding direction for perturbation.

(18)

p=(sg_{i}==sg_{i}^{\prime})*sg_{i}*step\_size

Gaussian-noise Injection (GI) Gaussian noise is a commonly used noise in image domain (especially for RGB image), which is physically caused by poor lighting and high temperature of sensors. Therefore, the natural synthetic images are often acquired through applying Gaussian perturbations (An, 1996; Arpit et al., 2017; Gerasimou et al., 2020). The probability density of Gaussian noise (perturbation) $p$ is defined as follows:

(19)

f(p)=\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(p-\mu)^{2}}{2\sigma^{2}}}

where $\mu$ and $\sigma$ are the mean and standard deviation respectively.

After generating test cases to increase the testing adequacy, i.e., the coverage of metrics, we then address the second question which selects a subset of test cases for augmented training to improve the model’s fairness more efficiently. In particular, we prefer to select a set of test cases with diverse metric values inspired by (Wang et al., 2021). Note that the neuron distance for each sample is a vector, of which each value is collected from each neuron independently which is difficult to quantify accurately. We thus only utilize the layer-level metrics, i.e., Tanimoto coefficient, cosine similarity, and Spearman correlation, for the selection.

Specifically, we adopt the K-Multisection Strategy (KM-ST) in our work, which uniformly sample the value space of a given metric. Formally, let $C_{min}$ and $C_{max}$ be the minimum and maximum value of crition $C$ collected from all the test cases, $n$ is the total number of test cases we need, and $k$ is the number of divided sections. We first divide the value range $[C_{min},C_{max}]$ into $k$ equal sections with the interval of $d=(C_{max}-C_{min})/k$ , and then randomly select $n/k$ samples from each section independently to consist the final augmenting dataset.

4. Experiments

We have implemented DeepFAIT as a self-contained toolkit based on Pytorch. We have published the source code, along with all the experiment details and results online. The evaluations are conducted on a server with 1 Intel Xeon 3.50GHz CPU, 64GB system memory, and 2 NVIDIA GTX 1080Ti GPU.

4.1. Experimental Setup

Datasets and Models We adopt $4$ open-source datasets in our evaluation. Two of them are used for training the domain transformer (refer to Section 3.1) of the protected attribute gender and race respectively, and the remaining two are the experiment subjects.

•

CelebA (Liu et al., 2015, [n.d.]) is a widely used large-scale face recognition dataset. It contains $202,599$ face images from $10,177$ celebrities around the world. Each image has $5$ landmark locations and $40$ binary attributes, e.g., male, big lips, and pale skin. We use CelebA for gender transformation.
•

BUPT-Transferface (Wang et al., 2019a; Deng, [n.d.]) is a dataset of face images which aims to bring racial awareness into studies on face discrimination and achieving algorithm fairness. We use it for race transformation.
•

VGGFace (Parkhi et al., 2015, [n.d.]) consists of over $2,600,000$ images for $2,622$ identities from IMDB. The dataset is collected from multiple search engines, e.g., Google and Bing, with limited manual curation. We evaluate DeepFAIT on VGGFace.
•

FairFace (Karkkainen and Joo, 2021, [n.d.]) contains $108,501$ face images collected from the YFCC100m dataset. It is manually labeled with gender, race, and age groups and balanced on race. It divides the age into $9$ bins. We train an age classifier and take it as another benchmark for evaluating DeepFAIT.

We utilize two state-of-the-art DNN architectures as the benchmark models.

•

VGG (Simonyan and Zisserman, 2015) is an advanced architecture for extracting abstract features from image data. VGG utilizes small convolution kernels (e.g., $3*3$ or $1*1$ ) and pooling kernels (e.g., $2*2$ ) to significantly increase the expressive power of the model.
•

ResNet (He et al., 2016) improves traditional sequential DNNs by solving the vanishing gradients problem when expanding the number of layers. It utilizes short-cuts (also called skip connections), which adds up the input and output of a layer and then transforms the sum to the next layer as input.

The details of our pre-trained models are shown in Table 1.

Table 1. Accuracies of the experimented models.

Dataset	Model	Accuracy
VGGFace	VGG-16	97.28%
FairFace	ResNet-50	96.08%

Face Annotation and Transfer For VGGFace, we crawl the racial and gender information by retrieving the celebrity idenetities from an active fan community FamousFix.com (Raji et al., 2020). Then we check manually to make sure the meta information is correct. For Fairface, the metadata about sensitive attributes is downloaded with the images.

In the training process of the transformation model, we need to ensure that only the given sensitive attribute changes to avoid the impact of other sensitive attributes. Therefore, when we train the race (respectively gender) transformation model, the image pairs that we construct are made sure to have the same gender (respectively race).

Parameters Table 2 shows the values of all the parameters used in our experiment to run DeepFAIT, which either follows the settings of existing approaches or are decided empirically.

Table 2. Configuration of experiments.

Parameter	Value	Description
$\gamma$	5	the weight of $L_{IDE}$ in CycleGAN
$\eta$	10	the weight of $L_{CYC}$ in CycleGAN
$\alpha$	0.05	the significance level
step_size	5	the step size of pertubation in RG and GG
$\mu$	0	the mean of gaussian distribution in GI
$\sigma$	7	the std of gaussian distribution in GI
$maxIt$	10	the maximum iteration of generation
$topK$	10	the top- $k$ fairness-sensitive neurons
$Z$	1000	the number of bins

4.2. Research Questions

We aim to answer the following 3 research questions through our evaluation.

RQ1: Is DeepFAIT effective in identifying fairness-related neurons?

Recall that fairness-related neuron selection is key to improve the efficiency and effectiveness of fairness adequacy testing. In the following, we first compare DeepFAIT with two baselines. First is DeepImportance (Gerasimou et al., 2020), which is a recent work aiming to identify the most relevant neurons w.r.t. the model’s decision. The other is activation importance (ActImp hereafter), i.e., which identifies neurons with the maximum activation value in a layer. Figure 3 shows the cumulative distribution function (CDF) on the activation differences for VGG-16. Each figure represents the top neuron within the last hidden layer selected by DeepFAIT with respect to race (shown in Figure 3(a)) and gender (shown in Figure 3(b)), DeepImportance (shown in Figure 3(c)), and ActImp (shown in Figure 3(d)) respectively. The larger the gap, the more relevant to the model’s fairness the neuron is. We observe that compared with DeepFAIT, the neuron selected by DeepImportance and ActImp is less related with the model’s fairness. This is because they only capture the neuron’s contribution to the decision based on a single sample, which makes them less sensitive to the behavioral differences of a pair of samples. Moreover, it is evident that for different sensitive attributes, the selected neurons would be different as well. For instance, the most relevant neurons in FC7 for race and gender are $\#3369$ and $\#1317$ with H value of $9,309$ and $646$ , respectively.

Table 3. The number of fairness-related neurons of the deepest

5

layers.. FC, Conv, and Maxpool represent the Fully-Connected layer, Convolutional layer, and Maxpool layer, respectively. The layers are listed from deep to shallow.

Model	Layer	Total Neuron	Fairness-related Neuron
			Race	Gender
VGG	FC7	4096	3471	3272
	FC6	4096	2881	2672
	Maxpool5	512	502	492
	Conv5_3	512	504	490
	Conv5_2	512	512	509
ResNet	Conv4.2_3	2048	2028	2044
	Conv4.2_2	512	487	508
	Conv4.2_1	512	484	507
	Conv4.1_3	2048	1904	2036
	Conv4.1_2	512	427	509

Table 3 lists the number of total neurons and the selected fairness-related neurons of the deepest $5$ layers both on VGGFace and FairFace with respect to race and gender. We can observe that a considerable percentage of the neurons show correlation with fairness, for instance, the proportion is ranging from $70.34\%$ ( $65.23\%$ ) to $100.00\%$ ( $99.41\%$ ) on VGGFace with respect to race (gender). Further, Figure 4 shows the H distribution of these $5$ layers. Note that the H values within each layer follows a long-tailed distribution, i.e., the H value of most neurons centralize on lower values, whereas a few values are pretty large, which shows significant correlation with fairness.

We thus have the following answer to RQ1,

Answer to RQ1: DeepFAIT can effectively identify fairness-related neurons. While most of the neurons in the DNN are related to fairness, only a small percentage of them are strongly correlated.

RQ2: How effective is DeepFAIT for measuring the adequacy of fairness testing?

Table 4. Coverage performance (

\%

) w.r.t. layers.

Layer	Data	Tan.	Cos.	Spe.	Abs.	Rel.
FC7	Fair	69.50	61.00	34.30	31.04	37.54
	Original	78.30	78.60	42.50	40.73	49.21
	Fair+GG	86.20	87.10	45.90	37.41	47.78
FC6	Fair	57.60	46.10	29.80	41.44	61.00
	Original	65.00	59.70	36.70	49.31	68.52
	Fair+GG	72.30	67.40	40.50	47.62	68.46
Conv5_3	Fair	16.40	36.70	29.20	43.83	78.92
	Original	17.60	44.80	34.50	50.86	82.45
	Fair+GG	17.60	49.10	37.40	49.95	83.84
Conv4_2	Fair	0.20	13.70	25.90	26.40	45.89
	Original	0.20	14.80	27.20	29.35	46.01
	Fair+GG	0.20	15.10	28.40	27.76	46.00

As presented in Section 3.3, all the criteria are calculated in a layer-wise manner. We thus first conduct an experiment to investigate the layer sensitivity of the testing adequacy before evaluating their effectiveness.

We evaluate the testing coverage on $3$ data combinations. Table 4 shows the coverage computed on $4$ ordered layers, i.e., FC7, FC6, Conv5_3, and Conv4_2, of VGG-16 with respect to the sensitive attribute gender. Row Fair and Ori. are the coverage of $10,000$ fair pairs and adding all the discriminatory ones (since its size is smaller than $10,000$ ) of original dataset, and row Fair+GG is the adequacy of the original fair data augmented with $10,000$ discriminatory instance pairs generated by the gradient-based strategy.

First, it can be observed that for layer-level statistic, the coverage on the layers that are close to the output layer is significantly higher. For instance, the tanimoto coefficient of fair images drops from $69.50\%$ in FC7 to $57.60\%$ and $16.40\%$ in FC6 and Conv5_3 and reaches $0.20\%$ in Conv4_2. One possible explanation is that the deeper layers in the DNN are able to extract the more identity-related information. In addition, we also observe that: 1) a deeper layer is more sensitive to the original discriminatory samples, e.g., compared with the fair images, the absolute distance adequacy of the whole original testing dataset increases by $2.95\%$ , $7.03\%$ , $7.87\%$ , and $9.69\%$ for Conv4_2, Conv5_3, FC6, and FC7 respectively; 2) the generated test cases show similar layer sensitivity, e.g., when we augment the original fair data with the one generated using the gradient-based strategy, the coverage of relative distance from FC7 to Conv4_2 increases by $10.24\%$ , $7.46\%$ , $3.92\%$ , and $0.11\%$ , respectively. It is obvious that the selection of layers will affect the effectiveness of the testing. We thus suggest to choose the deeper layers to measure the adequacy of test cases.

Next, we conduct a correlation analysis on the coverage metrics and the individual fairness of the experimented models. We adopt the model mutation technique developed in (Wang et al., 2019b; Ma et al., 2018b) to obtain a significant number of models with different behaviors efficiently for the study. In this work, we generate $10$ models for each mutation operator, e.g., Gaussian Fuzzing, Weight Shuffling, Neuron Switch, and Neuron Activation Inverse. Besides, since it is almost impossible to obtain meaningful images through random sampling in the input space, we measure the individual fairness of the model based on the ratio of non-discriminatory instances in the original testing dataset. The ranges of the accuracy and fairness scores of these $40$ models are in the range of $[94.00\%,97.28\%]$ and $[64.37\%,89.10\%]$ , respectively. In order to ensure the fairness of the experiment, we randomly selected $10,000$ non-discriminatory image pairs and obtain the coverage on the deepest hidden layer, i.e., FC7.

We show the Pearson product-moment correlation (Pearson, 1920) results on VGGFace with sensitive attribute gender in Figure 5. The number in each cell is the correlation value between the metrics of the corresponding row and column, which ranges from $\-1$ to $1$ . Note that all the correlation is significant, i.e., $p<0.05$ .

From Figure 5, we have the following observations. First, the three layer-level criteria are significantly positively correlated with the fairness score, i.e., with a minimum correlation coefficient of $0.85$ . It indicts that if the non-discriminatory instances has a higher coverage on the layer statistics, the DNN is fairer. This is intuitively expected, i.e., a fairer model can tolerate greater behavioral differences. Moreover, the two neuron distances show highly negative correlation with the individual fairness, i.e., with the value of $0.69$ and $0.90$ . Our explanation is that for a fair model, the output difference of the neurons is small (so as to ensure that the final prediction does not change). Second, it is obvious that Tanimoto, cosine, and Spearman similarity have strong positive correlations with each other, while there is a moderate negative correlations between layer-level and neuron-level criteria.

Table 5. The adequacy of criteria on different data settings.

Dataset	Protected Attr.	Criteria	Fair	Fair+Unfair	Fair+GG	Fair+RG	Fair+GI
VGGFace	Race	Tanimoto	70.90	84.00	86.40	84.00	78.50
		Cosine	30.00	38.20	42.50	52.60	42.20
		Spearman	28.40	38.50	40.30	42.70	37.50
		Abs. Distance	30.88	46.66	49.68	48.01	44.44
		Rel. Distance	31.10	42.50	53.54	53.34	46.65
		DeepImportance	0.78	1.10	1.35	1.42	1.23
	Gender	Tanimoto	69.50	78.30	86.20	78.10	74.00
		Cosine	61.00	78.60	87.10	77.70	69.50
		Spearman	34.30	42.50	45.90	41.30	37.90
		Abs. Distance	31.04	40.73	37.41	39.27	39.87
		Rel. Distance	37.54	49.21	47.78	51.34	52.68
		DeepImportance	0.93	1.28	1.26	1.38	1.35
FairFace	Race	Tanimoto	20.30	22.50	35.90	39.00	24.80
		Cosine	32.50	46.10	82.00	79.80	40.90
		Spearman	20.10	23.20	45.30	38.80	22.00
		Abs. Distance	23.42	31.13	58.49	41.55	28.18
		Rel. Distance	39.94	53.37	78.01	58.48	50.83
		DeepImportance	25.29	29.68	30.56	32.61	34.76
	Gender	Tanimoto	20.60	23.70	34.20	32.90	23.90
		Cosine	27.50	46.10	80.10	57.30	35.80
		Spearman	17.70	23.30	43.40	28.90	20.70
		Abs. Distance	32.31	45.43	81.85	37.57	37.17
		Rel. Distance	39.85	55.80	68.47	50.77	47.29
		DeepImportance	26.56	32.61	30.27	33.20	33.20

In traditional software testing, coverage is not only a measure of testing adequacy, but also an effective tool for revealing bugs (Zhang et al., 2019) (i.e., by generating tests with high coverage). In this work, a bug refers to whether individual discriminatory samples exist in a given DNN model. Therefore, we conduct an effectiveness evaluation by comparing the coverage of testing criteria on the fair dataset and the dataset augmented with the individual discriminatory samples. The latter is obtained in two ways. One is the discrimination in the original testing data (column Fair+UnFair), and the other is discriminatory instances generated using $3$ different approaches, e.g., gradient-based generation (column Fair+GG), random generation (column Fair+RG), and Gaussian-noise injection (column Fair+GI). The parameters for generation are shown in Table 2.

For each dataset and each sensitive attribute, we random select $10,000$ image pairs for all the testing subsets including fair and discriminatory data (if the number is less than $10,000$ , we take all of them) in the original set, and generated individual discriminatory pairs. The coverage result of the penultimate layer is presented in Table 5. We repeat the procedure $5$ times and report the average coverage to avoid the effect of randomness. It can be observed that compared with the original fair pairs, the coverage of all the criteria has a significant increase after adding the discriminatory ones. Furthermore, compared with the discriminatory pairs in the original set and generated with image processing technology, fairness testing methods lead to higher coverage on all the criteria in most of the cases, except for the absolute distance and relative distance of VGG-16 with respect to gender. This is in line with our expectation, i.e., the criteria are more sensitive with individual discriminatory pairs generated by fairness testing, especially by the gradient-based method. In addition, we further conduct a comparison with DeepImportance (Gerasimou et al., 2020). It is worth noting that DeepImportance reports similar layer sensitivity in its evaluation. For a fair comparison, we also take the $10$ most important neurons, and the The total number of important neurons cluster combinations on VGG-16 and ResNet-50 are $131,072$ and $1,024$ , respectively. We observe that the coverage of DeepImportance also increases when discriminatory samples are added. One possible reason is that the generation of unfair test case will push the seed towards the decision boundary (Zhang et al., 2020, 2021), which will also reduce the robustness of the seed. However, it is also observed that the DeepImportance is less sensitive than DeepFAIT in most of the cases. In addition, since DeepImportance is only computed on individual samples, it cannot distinguish the similarity between each transformed sample and the original one, when sensitive attributes have multiple values.

We have the following answer to RQ2,

Answer to RQ2: The performance of our proposed fairness testing criteria vary across the layers, i.e., the deeper the layer, the more significant it is. Furthermore, they are strongly correlated with the fairness of DNN. Specifically, the layer-level statistics show positive correlation, whereas the neuron distances show negative correlation. The fairness testing criteria are effective for measuring the testing adequacy and capable for identifying individual discriminatory instances.

RQ3: How effective are DeepFAIT for test case selection?

To answer this question, we evaluate the fairness improvement of augmented training, where the augmented data is selected by the KM-ST and completely random strategy (baseline), respectively. The training images are generated by the gradient-based method based on the original training data, and the validation set is composed by $5,000$ unfair samples each selected from the original testing set and three synthesized sets (Wang et al., 2021). Note that the training set and the validation set are disjoint. Recall that the neuron distance is a vector so that it is not a good quantitative metric for test case selection, and DeepImportance has the same problem. Thus we only apply KM-ST on the layer-level criteria.

Figure 6 shows the results on dataset Fairface. Since we randomly select $10\%$ of the generated unfair samples for data augmentation, we present the average improvement of $5$ repeats. All models after augmented training have only a slight loss of accuracy, i.e., with a maximum decrease of $1.44\%$ . From left to right, each bar represents the completely random strategy (RA) and KM-ST on Tanimoto Coefficient (TC), Cosine Similarity (CS), and Spearman Correlation (SC), respectively. We observe that compare with completely random: 1) applying KM-ST on cosine similarity and Spearman correlation are capable of reducing more discrimination in the model, especially applying it based on the cosine similarity, which has the greatest improvement of $5.15\%$ and $4.08\%$ with respect to race and gender, respectively. The reason is that compared with completely random method, KM-ST uniformly samples in each smaller subspace, which can better ensure that the obtained unfair samples are representative; 2) applying KM-ST on Tanimoto coefficient has $1.16\%$ and $1.73\%$ less fairness improvement with respect to race and gender, respectively. The reason is that compared with cosine similarity and Spearman coefficient, Tanimoto coefficient only considers the activation of neurons in each layer, while ignores some more fine-grained information such as the activation value.

We have the following answer to RQ3,

Answer to RQ3: Compared with completely random, applying KM-ST based on the proposed criteria of DeepFAIT is more effective for selecting test cases to reduce the model’s discrimination.

4.3. Threat to Validity

Limited Subjects Our experimental subjects (i.e., the datasets and DNN models) are limited. It might be interesting to conduct further evaluation on other datasets like VGGFace2 (Cao et al., 2018), and other model structures like recurrent neural network (RNN) (Ribeiro et al., 2016).

Limitation of Domain Transfer We adopt the image-to-image transformation approach, CycleGAN, for the transfer of sensitive facial attribute. The transforming process may have its limitations, such as translating the attributes which implicitly encode unique identity and the illumination variation. However, how to transfer images across sensitive domains while retaining as much other information as possible is still an open problem, and we will further investigate other possible directions.

Limitation of Test Case Generation We notice the other methods like coverage-guided fuzzing utilizing the proposed metrics, which we will further explore in future work.

5. Related Work

Fairness Testing Galhotra et al. (Galhotra et al., 2017) were the first to propose the fairness testing of machine learning, and utilized random generation to evaluate the frequency of discriminatory samples. Later, Udeshi et al. (Udeshi et al., 2018) improved it through using a two-step random strategy, AEQUITAS. The first stage is global generation, which samples the discriminatory cases in the input space at completely random. The second stage is local generation, it randomly searches the neighborhood of identified discrimination based on a dynamically updated probability. Besides, AEQUITAS tried to improve the model fairness through automatic augmentation retraining. Agarwal et al. (Aggarwal et al., 2019) acquired unfair test cases by applying symbolic execution (Wang et al., 2018) to solve the unexplored path on local explanation tree (Ribeiro et al., 2016). Its global and local generation aim to maximize the path coverage (diversity) and the number of instances respectively. Zhang et al. (Zhang et al., 2020, 2021) proposed a DNN-specific algorithm to search the discrimination based on gradient. It first maximize the output difference between input pair, and then perturb the identified one with less impact. The difference between DeepFAIT and the above-mentioned works is two-fold. First, we introduce fairness testing to image data by domain transformation, instead of substituting the value of sensitive attribute with pre-defined one on tabular and text data. Second, previous works paid more attention to the generation of test cases, but ignored to measure the adequacy of testing.

Robustness Testing Criteria Lots of robustness testing criteria were proposed. In (Pei et al., 2017), Pei et al. designed the first white-box robustness testing framework, DeepXplore, in the literature, which identifies and crafts unexpected instances with Neuron Coverage. In (Ma et al., 2018a), Ma et al. inherited the key insight and introduced five more fine-grained testing criteria both on layer and neuron levels. In (Sun et al., 2018b), Sun et al. brought the Modified Condition/Decision Cover into DL testing and proposed the first concolic testing method for DL models to improve four covering metrics based on a given neuron pair from adjacent layers, e.g., Sign-Sign Cover, Distance-Sign Cover, Sign-Value Cover, and Distance-Value Cover. In (Kim et al., 2019), Kim et al. proposed two surprise coverage, LSC and DSC, which measures the range of the likelihood-based and distance-based adequacy values respectively. Later, (Dong et al., 2020; Yan et al., 2020) conducted the empirical study and showed that there is limited correlation between the robustness and the aforementioned coverage criteria of the DL model. In (Gerasimou et al., 2020), Gerasimou et al. calculate the contribution of each neuron to the final prediction through layer-wise backward propagation. In (Feng et al., 2020), Feng et al. proposed DeepGini, which prioritizes the unlabeled test cases by utilizing the impurity of output vector to reduce the resource consumption for labeling and better improve the robustness of model. More recently, Wang et al. designed a robustness-oriented fuzzing framework (RobOT) based on the loss coverage (First-Order Loss) (Wang et al., 2021). Different from the above robustness testing metrics, our fairness testing criteria aim to measure the behavior difference between two similar samples, rather than the output of a single sample.

6. Conclusion

In this paper, we bridge the gap in existing fairness testing research by proposing a novel testing framework, DeepFAIT, which systematically evaluates and improves the fairness testing adequacy of deep image classification applications. Our approach first selects the fairness-related neurons utilizing significance testing, then evaluates the fairness testing adequacy with five multi-granularity adequacy metrics and lastly selects the test cases based on the criteria for mitigating the discrimination efficiently. We evaluate DeepFAIT on two large-scale public face recognition datasets. The results show that DeepFAIT is effective both in identifying the fairness-related neurons, detecting unfair samples and selecting the representative test cases to improve the model’s fairness.

References

(1)
Aggarwal et al. (2019) Aniya Aggarwal, Pranay Lohia, Seema Nagar, Kuntal Dey, and Diptikalyan Saha. 2019. Black Box Fairness Testing of Machine Learning Models. In Proceedings of the ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/SIGSOFT FSE 2019, Tallinn, Estonia. ACM, 625–635. https://doi.org/10.1145/3338906.3338937
An (1996) Guozhong An. 1996. The Effects of Adding Noise During Backpropagation Training on a Generalization Performance. Neural Comput. 8, 3 (1996), 643–674. https://doi.org/10.1162/neco.1996.8.3.643
andYueling Zhang and Zhang (2021) Lingfeng Zhang andYueling Zhang and Min Zhang. 2021. Efficient White-box Fairness Testing through Gradient Search. In ISSTA ’21: 30th ACM SIGSOFT International Symposium on Software Testing and Analysis, Virtual Event, Denmark. ACM, 103–114. https://doi.org/10.1145/3460319.3464820
Arpit et al. (2017) Devansh Arpit, Stanislaw Jastrzebski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S. Kanwal, Tegan Maharaj, Asja Fischer, Aaron C. Courville, Yoshua Bengio, and Simon Lacoste-Julien. 2017. A Closer Look at Memorization in Deep Networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, Vol. 70. PMLR, 233–242. http://proceedings.mlr.press/v70/arpit17a.html
Bastani et al. (2019) Osbert Bastani, Xin Zhang, and Armando Solar-Lezama. 2019. Probabilistic Verification of Fairness Properties via Concentration. PACMPL 3, OOPSLA (2019), 118:1–118:27. https://doi.org/10.1145/3360544
Cao et al. (2018) Qiong Cao, Li Shen, Weidi Xie, Omkar M. Parkhi, and Andrew Zisserman. 2018. VGGFace2: A Dataset for Recognising Faces across Pose and Age. In 13th IEEE International Conference on Automatic Face & Gesture Recognition, FG 2018, Xi’an, China. IEEE Computer Society, 67–74. https://doi.org/10.1109/FG.2018.00020
Deng ([n.d.]) Weihong Deng. [n.d.]. Ethnicity Aware Training Datasets. http://www.whdeng.cn/RFW/Trainingdataste.html
Dong et al. (2020) Yizhen Dong, Peixin Zhang, Jingyi Wang, Shuang Liu, Jun Sun, Jianye Hao, Xinyu Wang, Li Wang, Jinsong Dong, and Ting Dai. 2020. An Empirical Study on Correlation between Coverage and Robustness for Deep Neural Networks. In 25th International Conference on Engineering of Complex Computer Systems, ICECCS 2020, Singapore. IEEE, 73–82. https://doi.org/10.1109/ICECCS51672.2020.00016
Dwork et al. (2012) Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard S. Zemel. 2012. Fairness through Awareness. In Innovations in Theoretical Computer Science 2012, Cambridge, MA, USA. ACM, 214–226. https://doi.org/10.1145/2090236.2090255
Feldman et al. (2015) Michael Feldman, Sorelle A. Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubramanian. 2015. Certifying and Removing Disparate Impact. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia. ACM, 259–268. https://doi.org/10.1145/2783258.2783311
Feng et al. (2020) Yang Feng, Qingkai Shi, Xinyu Gao, Jun Wan, Chunrong Fang, and Zhenyu Chen. 2020. DeepGini: Prioritizing Massive Tests to Enhance the Robustness of Deep Neural Networks. In ISSTA ’20: 29th ACM SIGSOFT International Symposium on Software Testing and Analysis, Virtual Event, USA. ACM, 177–188. https://doi.org/10.1145/3395363.3397357
Galhotra et al. (2017) Sainyam Galhotra, Yuriy Brun, and Alexandra Meliou. 2017. Fairness Testing: Testing Software for Discrimination. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software, ESEC/FSE 2017, Paderborn, Germany. ACM, 498–510. https://doi.org/10.1145/3106237.3106277
Garg et al. (2019) Sahaj Garg, Vincent Perot, Nicole Limtiaco, Ankur Taly, Ed H. Chi, and Alex Beutel. 2019. Counterfactual Fairness in Text Classification through Robustness. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, AIES 2019, Honolulu, HI, USA. ACM, 219–226. https://doi.org/10.1145/3306618.3317950
Gerasimou et al. (2020) Simos Gerasimou, Hasan Ferit Eniser, Alper Sen, and Alper Cakan. 2020. Importance-Driven Deep Learning System Testing. In ICSE ’20: 42nd International Conference on Software Engineering, Seoul, South Korea. ACM, 702–713. https://doi.org/10.1145/3377811.3380391
Goodfellow et al. (2014) Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, Montreal, Quebec, Canada. 2672–2680. https://proceedings.neurips.cc/paper/2014/hash/5ca3e9b122f61f8f06494c97b1afccf3-Abstract.html
Goodfellow et al. (2015) Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and Harnessing Adversarial Examples. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA. http://arxiv.org/abs/1412.6572
He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA. IEEE Computer Society, 770–778. https://doi.org/10.1109/CVPR.2016.90
Huang et al. (2008) Gary B Huang, Marwan Mattar, Tamara Berg, and Eric Learned-Miller. 2008. Labeled faces in the wild: A database for Studying Face Recognition in Unconstrained Environments. In Workshop on Faces in ’Real-Life’ Images: Detection, Alignment, and Recognition. https://hal.inria.fr/inria-00321923/file/Huang_long_eccv2008-lfw.pdf
Karkkainen and Joo ([n.d.]) Kimmo Karkkainen and Jungseock Joo. [n.d.]. FairFace. https://github.com/dchen236/FairFace
Karkkainen and Joo (2021) Kimmo Karkkainen and Jungseock Joo. 2021. FairFace: Face Attribute Dataset for Balanced Race, Gender, and Age for Bias Measurement and Mitigation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. IEEE, 1548–1558. http://arxiv.org/abs/1908.04913
Kim et al. (2019) Jinhan Kim, Robert Feldt, and Shin Yoo. 2019. Guiding Deep Learning System Testing Using Surprise Adequacy. In Proceedings of the 41st International Conference on Software Engineering, ICSE 2019, Montreal, QC, Canada. IEEE / ACM, 1039–1049. https://doi.org/10.1109/ICSE.2019.00108
Kruskal and Wallis (1952) William H. Kruskal and W. Allen Wallis. 1952. Use of Ranks in One-Criterion Variance Analysis. J. Amer. Statist. Assoc. 47, 260 (1952), 583–621. https://doi.org/10.1080/01621459.1952.10483441
Liu et al. ([n.d.]) Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. [n.d.]. Large-scale CelebFaces Attributes (CelebA) Dataset. http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html
Liu et al. (2015) Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. 2015. Deep Learning Face Attributes in the Wild. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile. IEEE Computer Society, 3730–3738. https://doi.org/10.1109/ICCV.2015.425
Ma et al. (2018a) Lei Ma, Felix Juefei-Xu, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Chunyang Chen, Ting Su, Li Li, Yang Liu, Jianjun Zhao, and Yadong Wang. 2018a. DeepGauge: Multi-Granularity Testing Criteria for Deep Learning Systems. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ASE 2018, Montpellier, France. ACM, 120–131. https://doi.org/10.1145/3238147.3238202
Ma et al. (2018b) Lei Ma, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Felix Juefei-Xu, Chao Xie, Li Li, Yang Liu, Jianjun Zhao, and Yadong Wang. 2018b. DeepMutation: Mutation Testing of Deep Learning Systems. In 29th IEEE International Symposium on Software Reliability Engineering, ISSRE 2018, Memphis, TN, USA. IEEE Computer Society, 100–111. https://doi.org/10.1109/ISSRE.2018.00021
Madry et al. (2018) Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2018. Towards Deep Learning Models Resistant to Adversarial Attacks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada. OpenReview.net. https://openreview.net/forum?id=rJzIBfZAb
on Artificial Intelligence (AI HLEG) (2018) High-Level Expert Group on Artificial Intelligence (AI HLEG). 2018. Draft Ethics Guidelines for Trustworthy AI. Technical Report. European Commission.
Papernot et al. (2016) Nicolas Papernot, Patrick D. McDaniel, Somesh Jha, Matt Fredrikson, Z. Berkay Celik, and Ananthram Swami. 2016. The Limitations of Deep Learning in Adversarial Settings. In IEEE European Symposium on Security and Privacy, EuroS&P 2016, Saarbrücken, Germany. 372–387. https://doi.org/10.1109/EuroSP.2016.36
Parkhi et al. ([n.d.]) Omkar M. Parkhi, Andrea Vedaldi, and Andrew Zisserman. [n.d.]. VGG Face Dataset. https://www.robots.ox.ac.uk/~vgg/data/vgg_face/
Parkhi et al. (2015) Omkar M. Parkhi, Andrea Vedaldi, and Andrew Zisserman. 2015. Deep Face Recognition. In Proceedings of the British Machine Vision Conference 2015, BMVC 2015, Swansea, UK. BMVA Press, 41.1–41.12. https://doi.org/10.5244/C.29.41
Pearson (1900) Karl Pearson. 1900. On the Criterion that a Given System of Deviations from the Probable in the Case of a Correlated System of Variables is such that it can be Reasonably Supposed to Have Arisen from Random Sampling. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 50, 302 (1900), 157–175.
Pearson (1920) Karl Pearson. 1920. Notes on the History of Correlation. Biometrika 13, 1 (1920), 25–45. https://doi.org/10.1093/biomet/13.1.25
Pei et al. (2017) Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. 2017. DeepXplore: Automated Whitebox Testing of Deep Learning Systems. In Proceedings of the 26th Symposium on Operating Systems Principles, Shanghai, China. ACM, 1–18. https://doi.org/10.1145/3132747.3132785
Raji et al. (2020) Inioluwa Deborah Raji, Timnit Gebru, Margaret Mitchell, Joy Buolamwini, Joonseok Lee, and Emily Denton. 2020. Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing. In AIES ’20: AAAI/ACM Conference on AI, Ethics, and Society, New York, NY, USA. ACM, 145–151. https://doi.org/10.1145/3375627.3375820
Ribeiro et al. (2016) Marco Túlio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. ”Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA. ACM, 1135–1144. https://doi.org/10.1145/2939672.2939778
Schroff et al. (2015) Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. FaceNet: A Unified Embedding for Face Recognition and Clustering. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA. 815–823. https://doi.org/10.1109/CVPR.2015.7298682
Simonyan and Zisserman (2015) Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA. http://arxiv.org/abs/1409.1556
Spearman (1904) Charles Spearman. 1904. ”General Intelligence,” Objectively Determined and Measured. The American Journal of Psychology 15, 2 (1904), 201–292. https://doi.org/10.2307/1412107
Sun et al. (2018a) Youcheng Sun, Xiaowei Huang, and Daniel Kroening. 2018a. Testing Deep Neural Networks. CoRR abs/1803.04792 (2018). http://arxiv.org/abs/1803.04792
Sun et al. (2018b) Youcheng Sun, Min Wu, Wenjie Ruan, Xiaowei Huang, Marta Kwiatkowska, and Daniel Kroening. 2018b. Concolic Testing for Deep Neural Networks. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ASE 2018, Montpellier, France. ACM, 109–119. https://doi.org/10.1145/3238147.3238172
Tanimoto (1968) TT Tanimoto. 1968. An Elementary Mathematical theory of classification and prediction. Technical Report. IBM.
Thomas et al. (2019) Philip S. Thomas, Bruno Castro da Silva, Andrew G. Barto, Stephen Giguere, Yuriy Brun, and Emma Brunskill. 2019. Preventing Undesirable Behavior of Intelligent Machines. Science 366, 6468 (2019), 999–1004. https://science.sciencemag.org/content/366/6468/999
Udeshi et al. (2018) Sakshi Udeshi, Pryanshu Arora, and Sudipta Chattopadhyay. 2018. Automated Directed Fairness Testing. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ASE 2018, Montpellier, France. ACM, 98–108. https://doi.org/10.1145/3238147.3238165
Vieira et al. (2017) Sandra Vieira, Walter H.L. Pinaya, and Andrea Mechelli. 2017. Using Deep Learning to Investigate the Neuroimaging Correlates of Psychiatric and Neurological Disorders: Methods and Applications. Neuroscience & Biobehavioral Reviews 74 (2017), 58–75. https://doi.org/10.1016/j.neubiorev.2017.01.002
Wang et al. (2021) Jingyi Wang, Jialuo Chen, Youcheng Sun, Xingjun Ma, Dongxia Wang, Jun Sun, and Peng Cheng. 2021. RobOT: Robustness-Oriented Testing for Deep Learning Systems. In 43rd IEEE/ACM International Conference on Software Engineering, ICSE 2021, Madrid, Spain. IEEE, 300–311. https://doi.org/10.1109/ICSE43902.2021.00038
Wang et al. (2019b) Jingyi Wang, Guoliang Dong, Jun Sun, Xinyu Wang, and Peixin Zhang. 2019b. Adversarial Sample Detection for Deep Neural Network through Model Mutation Testing. In Proceedings of the 41st International Conference on Software Engineering, ICSE 2019, Montreal, QC, Canada. IEEE / ACM, 1245–1256. https://doi.org/10.1109/ICSE.2019.00126
Wang et al. (2019a) Mei Wang, Weihong Deng, Jiani Hu, Xunqiang Tao, and Yaohai Huang. 2019a. Racial Faces in the Wild: Reducing Racial Bias by Information Maximization Adaptation Network. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South). IEEE, 692–702. https://doi.org/10.1109/ICCV.2019.00078
Wang et al. (2018) Xinyu Wang, Jun Sun, Zhenbang Chen, Peixin Zhang, Jingyi Wang, and Yun Lin. 2018. Towards optimal concolic testing. In Proceedings of the 40th International Conference on Software Engineering, ICSE 2018, Gothenburg, Sweden. ACM, 291–302. https://doi.org/10.1145/3180155.3180177
Wiegand et al. (2019) Michael Wiegand, Josef Ruppenhofer, and Thomas Kleinbauer. 2019. Detection of Abusive Language: the Problem of Biased Datasets. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA. Association for Computational Linguistics, 602–608. https://doi.org/10.18653/v1/n19-1060
Wulczyn et al. (2017) Ellery Wulczyn, Nithum Thain, and Lucas Dixon. 2017. Ex Machina: Personal Attacks Seen at Scale. In Proceedings of the 26th International Conference on World Wide Web, WWW 2017, Perth, Australia. ACM, 1391–1399. https://doi.org/10.1145/3038912.3052591
Yan et al. (2020) Shenao Yan, Guanhong Tao Xuwei Liu, Juan Zhai, Shiqing Ma, Lei Xu, and Xiangyu Zhang. 2020. Correlations between Deep Neural Network Model Coverage Criteria and Model Quality. In ESEC/FSE ’20: 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Virtual Event, USA. ACM, 775–787. https://doi.org/10.1145/3368089.3409671
Yucer et al. (2020) Seyma Yucer, Samet Akçay, Noura Al Moubayed, and Toby P. Breckon. 2020. Exploring Racial Bias within Face Recognition via per-subject Adversarially-Enabled Data Augmentation. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR Workshops 2020, Seattle, WA, USA. IEEE, 83–92. https://doi.org/10.1109/CVPRW50498.2020.00017
Zhang et al. (2019) Jie M. Zhang, Mark Harman, Lei Ma, and Yang Liu. 2019. Machine Learning Testing: Survey, Landscapes and Horizons. CoRR abs/1906.10742 (2019). http://arxiv.org/abs/1906.10742
Zhang et al. (2020) Peixin Zhang, Jingyi Wang, Jun Sun, Guoliang Dong, Xinyu Wang, Xingen Wang, Jin Song Dong, and Ting Dai. 2020. White-box Fairness Testing through Adversarial Sampling. In ICSE ’20: 42nd International Conference on Software Engineering, Seoul, South Korea. ACM, 949–960. https://doi.org/10.1145/3377811.3380331
Zhang et al. (2021) Peixin Zhang, Jingyi Wang, Jun Sun, Xinyu Wang, Guoliang Dong, Xingen Wang, Ting Dai, and Jin Song Dong. 2021. Automatic Fairness Testing of Neural Classifiers through Adversarial Sampling. IEEE Transactions on Software Engineering. https://doi.org/10.1109/TSE.2021.3101478
Zhu et al. (2017) Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. 2017. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy. IEEE Computer Society, 2242–2251. https://doi.org/10.1109/ICCV.2017.244