This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Fairness Testing of Deep Image Classification with Adequacy Metrics

Peixin Zhang Zhejiang University Jingyi Wang Zhejiang University Jun Sun Singapore Management University  and  Xinyu Wang Zhejiang University
Abstract.

As deep image classification applications, e.g., face recognition, become increasingly prevalent in our daily lives, their fairness issues raise more and more concern. It is thus crucial to comprehensively test the fairness of these applications before deployment. Existing fairness testing methods suffer from the following limitations: 1) applicability, i.e., they are only applicable for structured data or text without handling the high-dimensional and abstract domain sampling in the semantic level for image classification applications; 2) functionality, i.e., they generate unfair samples without providing testing criterion to characterize the model’s fairness adequacy. To fill the gap, we propose DeepFAIT, a systematic fairness testing framework specifically designed for deep image classification applications. DeepFAIT consists of several important components enabling effective fairness testing of deep image classification applications: 1) a neuron selection strategy to identify the fairness-related neurons; 2) a set of multi-granularity adequacy metrics to evaluate the model’s fairness; 3) a test selection algorithm for fixing the fairness issues efficiently. We have conducted experiments on widely adopted large-scale face recognition applications, i.e., VGGFace and FairFace. The experimental results confirm that our approach can effectively identify the fairness-related neurons, characterize the model’s fairness, and select the most valuable test cases to mitigate the model’s fairness issues.

1. Introduction

Deep learning (DL) has created a new programming paradigm in solving many real-world problems, e.g., computer vision (Schroff et al., 2015), medical diagnosis (Vieira et al., 2017), and natural language processing (Wulczyn et al., 2017). However, DL is far from being trustworthy to be applied in certain ethic-critical scenarios, e.g., toxic language detection (Wiegand et al., 2019) and facial recognition (Yucer et al., 2020), as decisions of DL models can be unfair, i.e., discriminating minorities or vulnerable subpopulations, which has raised wide public concern (on Artificial Intelligence (AI HLEG), 2018). Therefore, just like traditional software, it is only more demanding to test the fairness of DL models systematically before their deployment.

Unfortunately, there still lacks a commonly agreed definition on fairness. Existing fairness formalization typically concerns different sub-populations (Feldman et al., 2015; Bastani et al., 2019; Dwork et al., 2012; Garg et al., 2019). These sub-populations are normally determined by different domains (values) of sensitive attributes (e.g., race and gender), which are often application-dependent. To name a few, demographic parity defines that minority candidates should be classified at an approximately same rate as majority members (Feldman et al., 2015; Bastani et al., 2019). Individual discrimination states that a well-trained model must output the approximately same predictions for instances (i.e., pairs of instances) which only differ in sensitive attributes (Dwork et al., 2012; Garg et al., 2019). We refer the readers to (Thomas et al., 2019) for detailed definitions of fairness and remark that we focus on individual fairness in this work.

Multiple recent works (Galhotra et al., 2017; Udeshi et al., 2018; Aggarwal et al., 2019; Zhang et al., 2020) have investigated the fairness111So far restricted to individual fairness. testing problem of machine learning models. For instance, THEMIS (Galhotra et al., 2017) first aims to measure the frequency of unfair samples by randomly sampling the value domain of each attribute. AEQUITAS (Udeshi et al., 2018) integrates a global and local phase to search for unfair samples in the input space more systematically. Symbolic Generation (Aggarwal et al., 2019) utilizes a constraint solver (Wang et al., 2018) to solve the path on the local explanation decision tree (Ribeiro et al., 2016) of a given seed sample to acquire a large number of diverse unfair samples. State-of-the-art work ADF (Zhang et al., 2020, 2021) and its variants (andYueling Zhang and Zhang, 2021) adopt a gradient-guided search strategy to identify unfair samples more effectively. Despite the considerable progress, existing fairness works still suffer from the following limitations: 1) applicability, i.e., they are only applicable for structured data or text without handling the high-dimensional and abstract domain sampling in the semantic level for image classification applications; 2) functionality, i.e., they generate unfair samples without providing testing criterion to characterize the model’s fairness adequacy.

Refer to caption
Figure 1. Overview of DeepFAIT. Among the generated data, from left to right, the image pairs are crafted by Random Generation (RG), Gradient-based Generation (GG), and Gaussian-noise Injection (GI), respectively. The blue and brown dashed lines indicate the process of measuring testing adequacy and fairness enhancement, respectively.

To fill the gap, we propose a systematic fairness testing framework named DeepFAIT, which is especially designed for evaluating and improving the fairness adequacy of deep image classification applications. DeepFAIT provides several key functionalities enabling effective fairness testing of image applications: 1) a neuron selection strategy to identify the fairness-related neurons; 2) a set of multi-granularity adequacy metrics to evaluate the model’s fairness; 3) a test selection algorithm for fixing the fairness issues efficiently. We address multiple technical challenges to realize DeepFAIT. Specifically, as shown in Fig. 1, DeepFAIT consists of five modules. First, we adopt a widely-used image-to-image transformation technology, i.e., Generative Adversarial Network (GAN) (Zhu et al., 2017), to transform images across the sensitive domains. Then, we apply significance testing on the activation differences of neurons to obtain those fairness-related neurons and design 5 testing metrics both on layer- and neuron-level based on the identified fairness-related neurons. Next, we implement three test case generation strategies including fairness testing method and image processing processing technology to generate a variety of unfair samples. Last, we propose a test selection algorithm to select more valuable test case to repair the model with smaller cost.

DeepFAIT has been implemented as an open-source self-contained toolkit. We have evaluated DeepFAIT on widely adopted large-scale face recognition datasets (VGGFace (Parkhi et al., 2015) and Fairface (Karkkainen and Joo, 2021)). The results show that compared with DeepImportance (Gerasimou et al., 2020), DeepFAIT is more capable of identifying fairness-sensitive neurons of the model. Furthermore, the proposed testing metrics calculated on these neurons are highly correlated with fairness and can be used to guide the search of unfair samples effectively. More importantly, the fairness issues can be fixed by selecting a small amount of test cases with our test selection algorithm to further train the model.

In a nutshell, we make the following technical contributions:

  • We propose a systematic fairness testing framework specially designed for deep image classification applications consisting a set of multi-granularity fairness adequacy metrics on fairness-related neurons.

  • Based on the proposed adequacy metrics, we propose a test selection algorithm to evaluate the value of each test case in improving the model’s fairness to reduce the cost of fixing the model.

  • We implemented DeepFAIT as a self-contained toolkit, which can be freely accessed online222https://github.com/icse44/DeepFAIT. The evaluation shows that the proposed testing criteria in DeepFAIT are well correlated with the fairness of DL models and is effective to guide the selection of unfair samples.

We frame the reminder of the paper as follows. We provide the necessary background on DNN and robustness testing criteria in Section 2. We then present DeepFAIT in detail in Section 3. In Section 4, we discuss the experimental setup and results. Lastly, we briefly review the releted works in Section 5 and conclude our work in Section 6.

2. Preliminaries

2.1. Deep Neural Network

In this work, we focus on the fairness testing of deep learning models, specifically, deep neural networks (DNNs) for image classification applications. A deep neural network FF consists of multiple hidden layers between an input and an output layer. It can be denoted as a tuple M=(I,L,Φ,TS)M=(I,L,\Phi,TS) where

  • II is the input layer;

  • L={Lj|j{1,,J}}L=\{L_{j}|j\in\{1,\dots,J\}\} is a set of hidden layers and the output layer, the number of neurons in the jj-th layer is |Lj||L_{j}|, and the kk-th neuron in layer LjL_{j} is denoted as nj,kn_{j,k} and its output value with respect to the input x is vj,k(x)v_{j,k}(x);

  • Φ\Phi is a set of activation functions, e.g., Sigmoid, Hyperbolic Tangent (TanH), or Rectified Linear Unit (ReLU);

  • TSTS: L×ΦLL\times\Phi\to L is a set of transitions between layers. As shown in Equation 1, the neuron activation value, vj,k(x)v_{j,k}(x), is computed by applying the activation function to the weighted sum of the activation value of all the neurons within its previous layer, and the weights represent the strength of the connections between two linked neurons.

    (1) vj,k(x)=ϕ(l=1|Lj1|ωj1,k,lvj1,l(x))v_{j,k}(x)=\phi(\sum_{l=1}^{|L_{j-1}|}\omega_{j-1,k,l}\cdot v_{j-1,l}(x))

A classification DNN can be defined as M:XYM:X\to Y which transforms a given input xXx\in X to an output label yYy\in Y by propagating layer by layer as above.

2.2. Individual Fairness for Image Classification

We denote the fairness sensitive attribute of interest as SASA (e.g., race). Note that for image classification, SASA is hidden from XX. We define HF:XSAHF:X\to SA as a function which returns the sensitive attribute of a given sample xXx\in X. We further define XAXX_{A}\subset X as the samples satisfying HF(x)=AHF(x)=A where xX,ASAx\in X,A\in SA. To change the sensitive attribute for a sample xx, we define a transformation function TAB:XXT_{A\to B}:X\to X which transforms a sample from XAX_{A} to XBX_{B} while preserving other information. Then, we define individual fairness of an image classification model MM as follows (Thomas et al., 2019).

Definition 2.1 (Individual Fairness).

Given an image classification model MM trained on XX, we define that it is individually fair iff there exists no data xXx\in X satisfying the following conditions:

  • xXA,x=TAB(x)x\in X_{A},x^{\prime}=T_{A\to B}(x)

  • M(x)M(x)M(x)\neq M(x^{\prime}).

On the other hand, xx (and xx^{\prime}) is called an unfair sample if xx satisfies the above conditions.

2.3. Robustness Testing Criteria

A variety of robustness testing criteria for DNN has been proposed (Pei et al., 2017; Ma et al., 2018a; Kim et al., 2019; Wang et al., 2021; Feng et al., 2020; Gerasimou et al., 2020). We briefly introduce the following representative robustness testing metrics. Readers are referred to (Pei et al., 2017; Ma et al., 2018a; Kim et al., 2019; Wang et al., 2021; Feng et al., 2020; Gerasimou et al., 2020) for details.

Neuron Activation Neuron coverage (Pei et al., 2017) is the first robustness testing metric for DNN, which computes the percentage of activated neurons, i.e., neuron values greater than a threshold. Later, DeepGauge (Ma et al., 2018a) extends it with multi-granularity neuron coverage criteria from two different levels: 1) neuron-level, e.g., k-Multisection Neuron Coverage, Neuron Boundary Coverage and Strong Neuron Activation Coverage, focusing on the value distribution of a single neuron, and 2) layer-level to measure the ranking of the neuron values in each layer, e.g., Top-k Neuron Coverage and Top-k Neuron Patterns. Surprise Adequacy (Kim et al., 2019) evaluates the similarity between test case and training data based on the kernel density estimation or Euclidean distance of neuron activation traces. Importance-driven coverage (Gerasimou et al., 2020) measures the value of neurons from another perspective, i.e., the contribution of each neuron within the same layer with respect to the prediction.

Output Impurity Unlike the aforementioned work, DeepGini (Feng et al., 2020) only takes the output vector into consideration, which measure the likelihood of misclassification by Gini impurity.

Loss Convergence RobOT (Wang et al., 2021) proposed First-Order Loss to measure the convergence quality inspired by the observation that if we perturb the instance based on its gradient with respect to the loss, the loss will increase and gradually converge (Madry et al., 2018).

2.4. Problem Definition

Different from the previous robustness testing works, we aim to propose a set of testing adequacy metrics for individual fairness specially designed for image classification. In particular, we aim to achieve the following research objectives:

  • How can we design testing adequacy metrics which are well correlated with the model’s fairness?

  • How can we select test cases which can effectively fix the model’s fairness issues?

3. DeepFAIT Framework

As shown in Figure 1, DeepFAIT systematically tests, evaluates and improves a DNN’s fairness with the following 4 main components:

  1. 1)

    Domain transformation. We develop a method based on CycleGAN (Zhu et al., 2017) to realize the transformation function TABT_{A\to B}.

  2. 2)

    Fairness-related neuron selection. We propose to conduct testing in a more effective way by filtering out neurons which are strongly correlated with the model’s fairness.

  3. 3)

    Multi-granularity metric analysis. We design a set of multi-granularity fairness testing coverage metrics to measure the adequacy of fairness testing. These metrics are particularly calculated based on the selected fairness-related neurons to be more effective.

  4. 4)

    Fairness enhancement. We develop a set of test case generation algorithms to identify a diverse set of unfair samples for mitigating discrimination by augmented training on these unfair samples. To further reduce the cost, we also propose a test selection algorithm to select more valuable test cases for the model’s fairness enhancement based on the proposed metrics.

In the following, we present the details of each component.

3.1. Domain Transformation

Refer to caption
Figure 2. Structure of CycleGAN model (A to B). The blue and red triangles indicate the data A and B respectively, and the solid line and dotted frame represent the original and generated data respectively. The black arrow indicates the data flow and the purple line indicates the calculation of loss function. The models for B to A has similar structure.

The first question is how to realize the domain transformation function TABT_{A\to B}. Note that this is straightforward for structured or text data which can be done by replacing the protected feature or token with a value from a predefined domain (Zhang et al., 2020, 2021). However, for image data, the sensitive attribute of interest is hidden from the input feature space. We thus follow  (Yucer et al., 2020) and adopt CycleGAN (Zhu et al., 2017) to transform images across different protected domains as follows.

As shown in Figure 2, CycleGAN provides a mechanism to transfer from AA to BB domains and BB to AA domains respectively with two corresponding generative models TAB,TBAT_{A\to B},T_{B\to A}. Similar with traditional GAN (Goodfellow et al., 2014), it contains two discriminators DAD_{A} and DBD_{B} to distinguish whether the input is ‘real’, i.e., a sample is generated by generative model or from the original dataset.

The loss function consists of three parts, the first one is adversarial loss, which is defined as follows,

(2) LGAN(TAB,DB,A,B)=𝔼[log(1DB(TAB(XA)))]+𝔼[logDB(XB)]\begin{split}L_{GAN}(T_{A\to B},D_{B},A,B)&=\mathbb{E}[\log(1-D_{B}(T_{A\to B}(X_{A})))]\\ &+\mathbb{E}[\log D_{B}(X_{B})]\end{split}

The transformer TABT_{A\to B} aims to synthesize a picture satisfying the distribution of XBX_{B} based on the seed from XAX_{A}, while the goal of discriminator DBD_{B} is to distinguish the raw images XBX_{B} from the artificial ones TAB(XA)T_{A\to B}(X_{A}). TBAT_{B\to A} and DAD_{A} have the same definition of adversarial loss. Since domain transformation needs to modify other information as little as possible in the process of changing the sensitive attribute, the second part of the objective function is thus to ensure that the generated image is identical with the raw one, which is defined as follows:

(3) LIDE(TAB)=𝔼[TAB(XA)XAp]\begin{split}L_{IDE}(T_{A\to B})&=\mathbb{E}[\|T_{A\to B}(X_{A})-X_{A}\|_{p}]\end{split}

In addition, it also needs to ensure that TAB(XA)T_{A\to B}(X_{A}) is in the data distribution of XBX_{B}. To this end, CycleGAN introduces cycle consistency loss as the core of its joint optimization objective to make the synthesized image more realistic. It is defined as follows.

(4) LCYC(TAB,TBA)=𝔼[TBA(TAB(XA))p]+𝔼[TAB(TBA(XB))p]\begin{split}L_{CYC}(T_{A\to B},T_{B\to A})&=\mathbb{E}[\|T_{B\to A}(T_{A\to B}(X_{A}))\|_{p}]\\ &+\mathbb{E}[\|T_{A\to B}(T_{B\to A}(X_{B}))\|_{p}]\end{split}

The intuition is that the pair of well-trained generators can recover the original image through the reconstruction process, i.e., for a given xXAx\in X_{A}, TBA(TAB(x))=xT_{B\to A}(T_{A\to B}(x))=x. Overall, the complete loss function of CycleGAN is defined as follows.

(5) L=LGAN(TAB,DB,A,B)+LGAN(TBA,DA,B,A)+γ(LIDE(TAB)+LIDE(TBA))+ηLCYC(TAB,TBA)\begin{split}L&=L_{GAN}(T_{A\to B},D_{B},A,B)+L_{GAN}(T_{B\to A},D_{A},B,A)\\ &+\gamma(L_{IDE}(T_{A\to B})+L_{IDE}(T_{B\to A}))\\ &+\eta L_{CYC}(T_{A\to B},T_{B\to A})\end{split}

where γ\gamma and η\eta are hyperparameters to balance among these three losses. In the “Raw Data” frame of Figure 1, we show an example race transformer (Caucasian to African) from VGGFace.

3.2. Fairness-related Neuron Selection

The next step is to select the fairness-related neurons, i.e., those neurons in the DNN which may have a significant impact on the fairness of the model’s decisions. The benefit is that, rather than blindly selecting neurons in certain layers (Ma et al., 2018a) or only the model’s output (Feng et al., 2020), it enables us to test the model’s fairness in a more fine-grained and focused way as these neurons are more correlated with the model’s fairness.

Our key idea of neuron selection is by significance testing (Kruskal and Wallis, 1952) of whether a neuron is fairness-related as follows. Formally, we make the following null hypothesis on a given neuron nj,kn_{j,k}.

(6) H0=nj,kisnotfairness-related\begin{split}&H_{0}=n_{j,k}\ \text{is}\ \text{not}\ \text{fairness-related}\\ \end{split}

In addition, we use a standard parameter (a.k.a., significance level), α\alpha, to control the probability of making errors, e.g., rejecting H0H_{0} when H0H_{0} is true. We then use Kruskal-Wallis test (Kruskal and Wallis, 1952) (also know as H-test) to test the above hypothesis. The intuition is that we could identify a fairness-related neuron by looking at the difference between the activation distribution on those fair samples and unfair samples. The larger the difference, the more fairness-related is the neuron. Specifically, given the training dataset XX and TT, we could collect the activation value difference on neuron nj,kn_{j,k} over XX:

(7) XD={vj,k(x)vj,k(x)|xX,x=T(x)}.XD=\{\|v_{j,k}(x)-v_{j,k}(x^{\prime})\||\forall x\in X,x^{\prime}=T(x)\}.

We further divide XDXD into two orthogonal subsets XDf,f=0,1XD_{f},f=0,1, according to the prediction results M(x)M(x) and M(x)M(x^{\prime}) are equal or not, respectively.

We first sort XDXD in ascending order and denote the rank of i-th element in XDfXD_{f} as rfir_{f}^{i}. Then we add the ranks in each subset XDfXD_{f} to obtain the rank sum, denoted as Rf=irfiR_{f}=\sum_{i}r_{f}^{i}. When the original hypothesis H0H_{0} is true, the average rank of each subset should be close to that of all samples, i.e., (|XD|+1)/2(|XD|+1)/2, and we thus use the following equation to compute the rank variance between subsets:

(8) RVS=f=0,1|XDf|(Rf|XDf||XD|+12)2RVS=\sum_{f=0,1}|XD_{f}|(\frac{R_{f}}{|XD_{f}|}-\frac{|XD|+1}{2})^{2}

In order to eliminate the influence of dimension, we then calculate the average of rank variance of all samples, which is defined as follows.

(9) ARV=1|XD|1f=0,1i=1|XDf|(rfi|XD|+12)2=|XD|(|XD|+1)12\begin{split}ARV&=\frac{1}{|XD|-1}\sum_{f=0,1}\sum_{i=1}^{|XD_{f}|}({r_{f}^{i}}-\frac{|XD|+1}{2})^{2}\\ &=\frac{|XD|(|XD|+1)}{12}\end{split}

Note that the freedom degree of sample variance is |XD|1|XD|-1. Thus, the Kruskal-Wallis rank-sum statistic H is given by,

(10) H=RVSARV=12|XD|(|XD|+1)f=0,1Rf2|XDf|3(|XD|+1)H=\frac{RVS}{ARV}=\frac{12}{|XD|(|XD|+1)}\sum_{f=0,1}\frac{R_{f}^{2}}{|XD_{f}|}-3(|XD|+1)

When |XD||XD| is large, HH approximately obeys a chi-square distribution (Pearson, 1900) with the freedom degree of 11, H𝒳2(1)H\sim\mathcal{X}^{2}(1). Therefore, the critical value of Kruskal-Wallis testing, HcH_{c}, corresponding to α\alpha is determined according to the chi-square distribution table. That is, if the computed statistic H>HcH>H_{c}, we reject H0H_{0} and conclude that the neuron nj,kn_{j,k} is fairness-related.

3.3. Fairness Adequacy Metrics

Next, we propose a set of testing metrics to measure the adequacy of DNN fairness. Note that different from the robustness testing metrics (Ma et al., 2018a; Gerasimou et al., 2020; Wang et al., 2021; Feng et al., 2020; Kim et al., 2019), fairness testing metrics are based on the behavioral differences between an instance pair, i.e., xx and xx^{\prime} (which only differ in certain sensitive attributes). Note that one important desirable property on the metrics is that they should be well correlated with the model’s fairness.

Our fairness metrics will satisfy the above property as we define our metrics on the basis of selected fairness-related neurons. Specifically, let NFj{nj,1,,nj,|Lj|}NF_{j}\subset\{n_{j,1},\dots,n_{j,|L_{j}|}\} be a set of fairness-related neurons within layer LjL_{j}. We denote the activation value vector and boolean activation patterns over neurons in NFjNF_{j} with respect to the input xx as v(x,NFj)\vec{v}(x,NF_{j}) and a(x,NFj)\vec{a}(x,NF_{j}), respectively. Note that in a(x,NFj)\vec{a}(x,NF_{j}), 11 and 0 represent whether the value of nj,kNFjn_{j,k}\in NF_{j} before and after ReLU is the same or not, respectively. We first characterize the model differences between the inputs xx and xx^{\prime} at the layer level as the decision is propagated layer by layer.

Tanimoto Coefficient Tanimoto coefficient (Tanimoto, 1968) is a similarity ratio defined over bitmaps. In DNN, the activation of a neuron indicates whether the abstract features of the neuron will be used in the subsequent decision-making process. Then as shown in Equation 11, we compute the division of the number of common activated neurons (i.e., nonzero bits) over the number of neurons activated by either sample.

(11) TC(x,x,NFj)=AAA+AAATC(x,x^{\prime},NF_{j})=\frac{\vec{A}\cdot\vec{A^{\prime}}}{\|\vec{A}\|+\|\vec{A^{\prime}}\|-\vec{A}\cdot\vec{A^{\prime}}}

where A=a(x,NFj)\vec{A}=\vec{a}(x,NF_{j}) and A=a(x,NFj)\vec{A^{\prime}}=\vec{a}(x^{\prime},NF_{j}).

Cosine Similarity Cosine similarity is one of the most commonly used similarity measure in machine learning applications. For example, it is often used to judge whether two faces belong to the same person in face recognition (Huang et al., 2008; Yucer et al., 2020). We calculate the cosine similarity of the activation traces in the layer representation space, which is defined as follows:

(12) CS(x,x,NFj)=v(x,NFj)v(x,NFj)v(x,NFj)v(x,NFj)CS(x,x^{\prime},NF_{j})=\frac{\vec{v}(x,NF_{j})\cdot\vec{v}(x^{\prime},NF_{j})}{\|\vec{v}(x,NF_{j})\|\cdot\|\vec{v}(x^{\prime},NF_{j})\|}

Spearman Correlation Spearman correlation (Spearman, 1904) is a non-parametric statistic to measure the dependence between the rankings of two variables. Although it discards some information, i.e., the real activation value, it retains the order of neurons’ activation status, which is an essential characteristic, since the neurons within the same layer often learn similar features and the closer to the output layer an activated neuron is, the more important it is for the model’s decision (Ma et al., 2018a). Formally, spearman correlation is computed by

(13) SC(x,x,NFj)=16k=1|NFj|(r(vj,k(x))r(vj,k(x)))2|NFj|(|NFj|21)SC(x,x^{\prime},NF_{j})=1-\frac{6\sum_{k=1}^{|NF_{j}|}{(r(\vec{v}_{j,k}(x))-r(\vec{v}_{j,k}(x^{\prime})))^{2}}}{|NF_{j}|(|NF_{j}|^{2}-1)}

where r()r(\cdot) is the rank function.

The above layer-level metrics are based on the statistical results on a set of neurons in a layer. In addition, we also provide a more fine-grained metric based on a single neuron.

Neuron Distance As presented in 2.1, each neuron within the same layer independently learns and extracts features from its previous layer. Therefore, neuron distance is a fine-grained metric to characterize the diversity of model behavior differences. In particular, for each fairness-related neuron nj,k|NFj|n_{j,k}\in|NF_{j}|, the distance between xx and xx^{\prime} is denoted as nd(x,x,nj,k)nd(x,x^{\prime},n_{j,k}), which is defined as follows:

(14) nd(x,x,nj,k)={|vj,k(x)vj,k(x)|,absolutemax(vj,k(x),vj,k(x))min(vj,k(x),vj,k(x)),relativend(x,x^{\prime},n_{j,k})=\left\{\begin{aligned} |v_{j,k}(x)-v_{j,k}(x^{\prime})|,&&absolute\\ \frac{\max(v_{j,k}(x),v_{j,k}(x^{\prime}))}{\min(v_{j,k}(x),v_{j,k}(x^{\prime}))},&&relative\end{aligned}\right.

Note that the absolute and relative distance are only computed when nj,kn_{j,k} are activated by both xx and xx^{\prime} (Sun et al., 2018a).

Finally, we could define the coverage for each testing metric by dividing its value range (based on XX and the domain transformer TT) into ZZ equal bins, denoted as BzB^{z}, and then compute the coverage ratio as follows:

(15) i=1|C||{Biz|xX:Ci(x,T(x))Biz}|Z|C|\frac{\sum_{i=1}^{|C|}|\{B_{i}^{z}|\exists x\in X:C_{i}(x,T(x))\in B_{i}^{z}\}|}{Z*|C|}

where |C||C| is the dimension for collecting the metric value. For layer-level similarity, we traverse layer by layer to calculate its value, thus |C|=|J||C|=|J|, and the value range is [0,1][0,1] for Tanimoto and Cosine coefficient and [1,1][-1,1] for Spearman correlation. For neuron distance, we select topKtopK neurons each layer to analyse, thus |C|=|J|topK|C|=|J|*topK, and the upper bound is computed by the training data (Ma et al., 2018a).

3.4. Fairness Enhancement

The above metrics enables us to evaluate a model’s fairness adequacy. The follow-up questions are 1) how to generate diverse test cases to improve the fairness adequacy, and then 2) how to select the most valuable test cases to enhance the model’s fairness by augmented training, which has been proved to be useful in enhancing the robustness or fairness of DNN (Ma et al., 2018a; Zhang et al., 2020, 2021).

We address the first question by introducing the following test case generation methods, both from the spirit of fairness testing literature and traditional digital image processing approaches. Given a fair sample pair (x,x)(x,x^{\prime}), where xx and xx^{\prime} are only different in certain sensitive attributes, we aim to maximize the output differences between them after applying perturbation pp on them, which is formally defined as follows:

(16) argmaxp{M(x+p)M(x+p)}\text{argmax}_{p}\{M(x+p)-M(x^{\prime}+p)\}

We introduce the following perturbation methods for approximating the above goal.

Random Generation (RG) Random testing is the most common testing method in software testing, and is adopted by THEMIS (Galhotra et al., 2017) and AEQUITAS (Udeshi et al., 2018). Here, we select the perturbed pixel and perturbing direction randomly as follows,

(17) p=random(1,0,1)step_size,p=random(-1,0,1)*step\_size,

where 0 means the pixel value is retained, 11 and 1-1 represent increasing and decreasing the pixel value respectively multiplied by a step_sizestep\_size.

Gradient-based Generation (GG) It is noticeable that gradient is an effective tool to generate test cases for DL model (Zhang et al., 2020, 2021; Goodfellow et al., 2015; Papernot et al., 2016). We utilize the gradient-based method in (Zhang et al., 2020, 2021) to approximate the solution of the optimization problem 16, which relies on the sign of gradient of loss function with respect to the input, i.e., sg=sign(xJ(x,y))sg=sign(\bigtriangledown_{x}J(x,y)) and sg=sign(xJ(x,y))sg^{\prime}=sign(\bigtriangledown_{x}^{\prime}J(x^{\prime},y)). As shown in Equation 18, we then choose the pixel with the same direction (sign) and the corresponding direction for perturbation.

(18) p=(sgi==sgi)sgistep_sizep=(sg_{i}==sg_{i}^{\prime})*sg_{i}*step\_size

Gaussian-noise Injection (GI) Gaussian noise is a commonly used noise in image domain (especially for RGB image), which is physically caused by poor lighting and high temperature of sensors. Therefore, the natural synthetic images are often acquired through applying Gaussian perturbations (An, 1996; Arpit et al., 2017; Gerasimou et al., 2020). The probability density of Gaussian noise (perturbation) pp is defined as follows:

(19) f(p)=12πσe(pμ)22σ2f(p)=\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(p-\mu)^{2}}{2\sigma^{2}}}

where μ\mu and σ\sigma are the mean and standard deviation respectively.

After generating test cases to increase the testing adequacy, i.e., the coverage of metrics, we then address the second question which selects a subset of test cases for augmented training to improve the model’s fairness more efficiently. In particular, we prefer to select a set of test cases with diverse metric values inspired by (Wang et al., 2021). Note that the neuron distance for each sample is a vector, of which each value is collected from each neuron independently which is difficult to quantify accurately. We thus only utilize the layer-level metrics, i.e., Tanimoto coefficient, cosine similarity, and Spearman correlation, for the selection.

Specifically, we adopt the K-Multisection Strategy (KM-ST) in our work, which uniformly sample the value space of a given metric. Formally, let CminC_{min} and CmaxC_{max} be the minimum and maximum value of crition CC collected from all the test cases, nn is the total number of test cases we need, and kk is the number of divided sections. We first divide the value range [Cmin,Cmax][C_{min},C_{max}] into kk equal sections with the interval of d=(CmaxCmin)/kd=(C_{max}-C_{min})/k, and then randomly select n/kn/k samples from each section independently to consist the final augmenting dataset.

4. Experiments

We have implemented DeepFAIT as a self-contained toolkit based on Pytorch. We have published the source code, along with all the experiment details and results online. The evaluations are conducted on a server with 1 Intel Xeon 3.50GHz CPU, 64GB system memory, and 2 NVIDIA GTX 1080Ti GPU.

4.1. Experimental Setup

Datasets and Models We adopt 44 open-source datasets in our evaluation. Two of them are used for training the domain transformer (refer to Section 3.1) of the protected attribute gender and race respectively, and the remaining two are the experiment subjects.

  • CelebA (Liu et al., 2015, [n.d.]) is a widely used large-scale face recognition dataset. It contains 202,599202,599 face images from 10,17710,177 celebrities around the world. Each image has 55 landmark locations and 4040 binary attributes, e.g., male, big lips, and pale skin. We use CelebA for gender transformation.

  • BUPT-Transferface (Wang et al., 2019a; Deng, [n.d.]) is a dataset of face images which aims to bring racial awareness into studies on face discrimination and achieving algorithm fairness. We use it for race transformation.

  • VGGFace (Parkhi et al., 2015, [n.d.]) consists of over 2,600,0002,600,000 images for 2,6222,622 identities from IMDB. The dataset is collected from multiple search engines, e.g., Google and Bing, with limited manual curation. We evaluate DeepFAIT on VGGFace.

  • FairFace (Karkkainen and Joo, 2021, [n.d.]) contains 108,501108,501 face images collected from the YFCC100m dataset. It is manually labeled with gender, race, and age groups and balanced on race. It divides the age into 99 bins. We train an age classifier and take it as another benchmark for evaluating DeepFAIT.

We utilize two state-of-the-art DNN architectures as the benchmark models.

  • VGG (Simonyan and Zisserman, 2015) is an advanced architecture for extracting abstract features from image data. VGG utilizes small convolution kernels (e.g., 333*3 or 111*1) and pooling kernels (e.g., 222*2) to significantly increase the expressive power of the model.

  • ResNet (He et al., 2016) improves traditional sequential DNNs by solving the vanishing gradients problem when expanding the number of layers. It utilizes short-cuts (also called skip connections), which adds up the input and output of a layer and then transforms the sum to the next layer as input.

The details of our pre-trained models are shown in Table 1.

Table 1. Accuracies of the experimented models.
Dataset Model Accuracy
VGGFace VGG-16 97.28%
FairFace ResNet-50 96.08%

Face Annotation and Transfer For VGGFace, we crawl the racial and gender information by retrieving the celebrity idenetities from an active fan community FamousFix.com (Raji et al., 2020). Then we check manually to make sure the meta information is correct. For Fairface, the metadata about sensitive attributes is downloaded with the images.

In the training process of the transformation model, we need to ensure that only the given sensitive attribute changes to avoid the impact of other sensitive attributes. Therefore, when we train the race (respectively gender) transformation model, the image pairs that we construct are made sure to have the same gender (respectively race).

Parameters Table 2 shows the values of all the parameters used in our experiment to run DeepFAIT, which either follows the settings of existing approaches or are decided empirically.

Table 2. Configuration of experiments.
Parameter Value Description
γ\gamma 5 the weight of LIDEL_{IDE} in CycleGAN
η\eta 10 the weight of LCYCL_{CYC} in CycleGAN
α\alpha 0.05 the significance level
step_size 5 the step size of pertubation in RG and GG
μ\mu 0 the mean of gaussian distribution in GI
σ\sigma 7 the std of gaussian distribution in GI
maxItmaxIt 10 the maximum iteration of generation
topKtopK 10 the top-kk fairness-sensitive neurons
ZZ 1000 the number of bins

4.2. Research Questions

We aim to answer the following 3 research questions through our evaluation.

RQ1: Is DeepFAIT effective in identifying fairness-related neurons?

Refer to caption
(a) DeepFAIT (Race)
Refer to caption
(b) DeepFAIT (Gender)
Refer to caption
(c) DeepImportance (Race)
Refer to caption
(d) Activation Importance (Race)
Figure 3. Cumulative distribution function (CDF) w.r.t. the difference of the neuron activation values.

Recall that fairness-related neuron selection is key to improve the efficiency and effectiveness of fairness adequacy testing. In the following, we first compare DeepFAIT with two baselines. First is DeepImportance (Gerasimou et al., 2020), which is a recent work aiming to identify the most relevant neurons w.r.t. the model’s decision. The other is activation importance (ActImp hereafter), i.e., which identifies neurons with the maximum activation value in a layer. Figure 3 shows the cumulative distribution function (CDF) on the activation differences for VGG-16. Each figure represents the top neuron within the last hidden layer selected by DeepFAIT with respect to race (shown in Figure 3(a)) and gender (shown in Figure 3(b)), DeepImportance (shown in Figure 3(c)), and ActImp (shown in Figure 3(d)) respectively. The larger the gap, the more relevant to the model’s fairness the neuron is. We observe that compared with DeepFAIT, the neuron selected by DeepImportance and ActImp is less related with the model’s fairness. This is because they only capture the neuron’s contribution to the decision based on a single sample, which makes them less sensitive to the behavioral differences of a pair of samples. Moreover, it is evident that for different sensitive attributes, the selected neurons would be different as well. For instance, the most relevant neurons in FC7 for race and gender are #3369\#3369 and #1317\#1317 with H value of 9,3099,309 and 646646, respectively.

Table 3. The number of fairness-related neurons of the deepest 55 layers.. FC, Conv, and Maxpool represent the Fully-Connected layer, Convolutional layer, and Maxpool layer, respectively. The layers are listed from deep to shallow.
Model Layer Total Neuron Fairness-related Neuron
Race Gender
VGG FC7 4096 3471 3272
FC6 4096 2881 2672
Maxpool5 512 502 492
Conv5_3 512 504 490
Conv5_2 512 512 509
ResNet Conv4.2_3 2048 2028 2044
Conv4.2_2 512 487 508
Conv4.2_1 512 484 507
Conv4.1_3 2048 1904 2036
Conv4.1_2 512 427 509
Refer to caption
(a) VGGFace (Race)
Refer to caption
(b) VGGFace (Gender)
Refer to caption
(c) FairFace (Race)
Refer to caption
(d) FairFace (Gender)
Figure 4. The H distribution of the deepest 55 layers.

Table 3 lists the number of total neurons and the selected fairness-related neurons of the deepest 55 layers both on VGGFace and FairFace with respect to race and gender. We can observe that a considerable percentage of the neurons show correlation with fairness, for instance, the proportion is ranging from 70.34%70.34\% (65.23%65.23\%) to 100.00%100.00\% (99.41%99.41\%) on VGGFace with respect to race (gender). Further, Figure 4 shows the H distribution of these 55 layers. Note that the H values within each layer follows a long-tailed distribution, i.e., the H value of most neurons centralize on lower values, whereas a few values are pretty large, which shows significant correlation with fairness.

We thus have the following answer to RQ1,

Answer to RQ1: DeepFAIT can effectively identify fairness-related neurons. While most of the neurons in the DNN are related to fairness, only a small percentage of them are strongly correlated.

RQ2: How effective is DeepFAIT for measuring the adequacy of fairness testing?

Table 4. Coverage performance (%\%) w.r.t. layers.
Layer Data Tan. Cos. Spe. Abs. Rel.
FC7 Fair 69.50 61.00 34.30 31.04 37.54
Original 78.30 78.60 42.50 40.73 49.21
Fair+GG 86.20 87.10 45.90 37.41 47.78
FC6 Fair 57.60 46.10 29.80 41.44 61.00
Original 65.00 59.70 36.70 49.31 68.52
Fair+GG 72.30 67.40 40.50 47.62 68.46
Conv5_3 Fair 16.40 36.70 29.20 43.83 78.92
Original 17.60 44.80 34.50 50.86 82.45
Fair+GG 17.60 49.10 37.40 49.95 83.84
Conv4_2 Fair 0.20 13.70 25.90 26.40 45.89
Original 0.20 14.80 27.20 29.35 46.01
Fair+GG 0.20 15.10 28.40 27.76 46.00

As presented in Section 3.3, all the criteria are calculated in a layer-wise manner. We thus first conduct an experiment to investigate the layer sensitivity of the testing adequacy before evaluating their effectiveness.

We evaluate the testing coverage on 33 data combinations. Table 4 shows the coverage computed on 44 ordered layers, i.e., FC7, FC6, Conv5_3, and Conv4_2, of VGG-16 with respect to the sensitive attribute gender. Row Fair and Ori. are the coverage of 10,00010,000 fair pairs and adding all the discriminatory ones (since its size is smaller than 10,00010,000) of original dataset, and row Fair+GG is the adequacy of the original fair data augmented with 10,00010,000 discriminatory instance pairs generated by the gradient-based strategy.

First, it can be observed that for layer-level statistic, the coverage on the layers that are close to the output layer is significantly higher. For instance, the tanimoto coefficient of fair images drops from 69.50%69.50\% in FC7 to 57.60%57.60\% and 16.40%16.40\% in FC6 and Conv5_3 and reaches 0.20%0.20\% in Conv4_2. One possible explanation is that the deeper layers in the DNN are able to extract the more identity-related information. In addition, we also observe that: 1) a deeper layer is more sensitive to the original discriminatory samples, e.g., compared with the fair images, the absolute distance adequacy of the whole original testing dataset increases by 2.95%2.95\%, 7.03%7.03\%, 7.87%7.87\%, and 9.69%9.69\% for Conv4_2, Conv5_3, FC6, and FC7 respectively; 2) the generated test cases show similar layer sensitivity, e.g., when we augment the original fair data with the one generated using the gradient-based strategy, the coverage of relative distance from FC7 to Conv4_2 increases by 10.24%10.24\%, 7.46%7.46\%, 3.92%3.92\%, and 0.11%0.11\%, respectively. It is obvious that the selection of layers will affect the effectiveness of the testing. We thus suggest to choose the deeper layers to measure the adequacy of test cases.

Refer to caption
Figure 5. Test coverage vs. fairness score.

Next, we conduct a correlation analysis on the coverage metrics and the individual fairness of the experimented models. We adopt the model mutation technique developed in (Wang et al., 2019b; Ma et al., 2018b) to obtain a significant number of models with different behaviors efficiently for the study. In this work, we generate 1010 models for each mutation operator, e.g., Gaussian Fuzzing, Weight Shuffling, Neuron Switch, and Neuron Activation Inverse. Besides, since it is almost impossible to obtain meaningful images through random sampling in the input space, we measure the individual fairness of the model based on the ratio of non-discriminatory instances in the original testing dataset. The ranges of the accuracy and fairness scores of these 4040 models are in the range of [94.00%,97.28%][94.00\%,97.28\%] and [64.37%,89.10%][64.37\%,89.10\%], respectively. In order to ensure the fairness of the experiment, we randomly selected 10,00010,000 non-discriminatory image pairs and obtain the coverage on the deepest hidden layer, i.e., FC7.

We show the Pearson product-moment correlation (Pearson, 1920) results on VGGFace with sensitive attribute gender in Figure 5. The number in each cell is the correlation value between the metrics of the corresponding row and column, which ranges from 1\-1 to 11. Note that all the correlation is significant, i.e., p<0.05p<0.05.

From Figure 5, we have the following observations. First, the three layer-level criteria are significantly positively correlated with the fairness score, i.e., with a minimum correlation coefficient of 0.850.85. It indicts that if the non-discriminatory instances has a higher coverage on the layer statistics, the DNN is fairer. This is intuitively expected, i.e., a fairer model can tolerate greater behavioral differences. Moreover, the two neuron distances show highly negative correlation with the individual fairness, i.e., with the value of 0.690.69 and 0.900.90. Our explanation is that for a fair model, the output difference of the neurons is small (so as to ensure that the final prediction does not change). Second, it is obvious that Tanimoto, cosine, and Spearman similarity have strong positive correlations with each other, while there is a moderate negative correlations between layer-level and neuron-level criteria.

Table 5. The adequacy of criteria on different data settings.
Dataset Protected Attr. Criteria Fair Fair+Unfair Fair+GG Fair+RG Fair+GI
VGGFace Race Tanimoto 70.90 84.00 86.40 84.00 78.50
Cosine 30.00 38.20 42.50 52.60 42.20
Spearman 28.40 38.50 40.30 42.70 37.50
Abs. Distance 30.88 46.66 49.68 48.01 44.44
Rel. Distance 31.10 42.50 53.54 53.34 46.65
DeepImportance 0.78 1.10 1.35 1.42 1.23
Gender Tanimoto 69.50 78.30 86.20 78.10 74.00
Cosine 61.00 78.60 87.10 77.70 69.50
Spearman 34.30 42.50 45.90 41.30 37.90
Abs. Distance 31.04 40.73 37.41 39.27 39.87
Rel. Distance 37.54 49.21 47.78 51.34 52.68
DeepImportance 0.93 1.28 1.26 1.38 1.35
FairFace Race Tanimoto 20.30 22.50 35.90 39.00 24.80
Cosine 32.50 46.10 82.00 79.80 40.90
Spearman 20.10 23.20 45.30 38.80 22.00
Abs. Distance 23.42 31.13 58.49 41.55 28.18
Rel. Distance 39.94 53.37 78.01 58.48 50.83
DeepImportance 25.29 29.68 30.56 32.61 34.76
Gender Tanimoto 20.60 23.70 34.20 32.90 23.90
Cosine 27.50 46.10 80.10 57.30 35.80
Spearman 17.70 23.30 43.40 28.90 20.70
Abs. Distance 32.31 45.43 81.85 37.57 37.17
Rel. Distance 39.85 55.80 68.47 50.77 47.29
DeepImportance 26.56 32.61 30.27 33.20 33.20

In traditional software testing, coverage is not only a measure of testing adequacy, but also an effective tool for revealing bugs (Zhang et al., 2019) (i.e., by generating tests with high coverage). In this work, a bug refers to whether individual discriminatory samples exist in a given DNN model. Therefore, we conduct an effectiveness evaluation by comparing the coverage of testing criteria on the fair dataset and the dataset augmented with the individual discriminatory samples. The latter is obtained in two ways. One is the discrimination in the original testing data (column Fair+UnFair), and the other is discriminatory instances generated using 33 different approaches, e.g., gradient-based generation (column Fair+GG), random generation (column Fair+RG), and Gaussian-noise injection (column Fair+GI). The parameters for generation are shown in Table 2.

For each dataset and each sensitive attribute, we random select 10,00010,000 image pairs for all the testing subsets including fair and discriminatory data (if the number is less than 10,00010,000, we take all of them) in the original set, and generated individual discriminatory pairs. The coverage result of the penultimate layer is presented in Table 5. We repeat the procedure 55 times and report the average coverage to avoid the effect of randomness. It can be observed that compared with the original fair pairs, the coverage of all the criteria has a significant increase after adding the discriminatory ones. Furthermore, compared with the discriminatory pairs in the original set and generated with image processing technology, fairness testing methods lead to higher coverage on all the criteria in most of the cases, except for the absolute distance and relative distance of VGG-16 with respect to gender. This is in line with our expectation, i.e., the criteria are more sensitive with individual discriminatory pairs generated by fairness testing, especially by the gradient-based method. In addition, we further conduct a comparison with DeepImportance (Gerasimou et al., 2020). It is worth noting that DeepImportance reports similar layer sensitivity in its evaluation. For a fair comparison, we also take the 1010 most important neurons, and the The total number of important neurons cluster combinations on VGG-16 and ResNet-50 are 131,072131,072 and 1,0241,024, respectively. We observe that the coverage of DeepImportance also increases when discriminatory samples are added. One possible reason is that the generation of unfair test case will push the seed towards the decision boundary (Zhang et al., 2020, 2021), which will also reduce the robustness of the seed. However, it is also observed that the DeepImportance is less sensitive than DeepFAIT in most of the cases. In addition, since DeepImportance is only computed on individual samples, it cannot distinguish the similarity between each transformed sample and the original one, when sensitive attributes have multiple values.

We have the following answer to RQ2,

Answer to RQ2: The performance of our proposed fairness testing criteria vary across the layers, i.e., the deeper the layer, the more significant it is. Furthermore, they are strongly correlated with the fairness of DNN. Specifically, the layer-level statistics show positive correlation, whereas the neuron distances show negative correlation. The fairness testing criteria are effective for measuring the testing adequacy and capable for identifying individual discriminatory instances.

RQ3: How effective are DeepFAIT for test case selection?

Refer to caption
(a) Race
Refer to caption
(b) Gender
Figure 6. Fairness improvement with different test case selection strageties.

To answer this question, we evaluate the fairness improvement of augmented training, where the augmented data is selected by the KM-ST and completely random strategy (baseline), respectively. The training images are generated by the gradient-based method based on the original training data, and the validation set is composed by 5,0005,000 unfair samples each selected from the original testing set and three synthesized sets (Wang et al., 2021). Note that the training set and the validation set are disjoint. Recall that the neuron distance is a vector so that it is not a good quantitative metric for test case selection, and DeepImportance has the same problem. Thus we only apply KM-ST on the layer-level criteria.

Figure 6 shows the results on dataset Fairface. Since we randomly select 10%10\% of the generated unfair samples for data augmentation, we present the average improvement of 55 repeats. All models after augmented training have only a slight loss of accuracy, i.e., with a maximum decrease of 1.44%1.44\%. From left to right, each bar represents the completely random strategy (RA) and KM-ST on Tanimoto Coefficient (TC), Cosine Similarity (CS), and Spearman Correlation (SC), respectively. We observe that compare with completely random: 1) applying KM-ST on cosine similarity and Spearman correlation are capable of reducing more discrimination in the model, especially applying it based on the cosine similarity, which has the greatest improvement of 5.15%5.15\% and 4.08%4.08\% with respect to race and gender, respectively. The reason is that compared with completely random method, KM-ST uniformly samples in each smaller subspace, which can better ensure that the obtained unfair samples are representative; 2) applying KM-ST on Tanimoto coefficient has 1.16%1.16\% and 1.73%1.73\% less fairness improvement with respect to race and gender, respectively. The reason is that compared with cosine similarity and Spearman coefficient, Tanimoto coefficient only considers the activation of neurons in each layer, while ignores some more fine-grained information such as the activation value.

We have the following answer to RQ3,

Answer to RQ3: Compared with completely random, applying KM-ST based on the proposed criteria of DeepFAIT is more effective for selecting test cases to reduce the model’s discrimination.

4.3. Threat to Validity

Limited Subjects Our experimental subjects (i.e., the datasets and DNN models) are limited. It might be interesting to conduct further evaluation on other datasets like VGGFace2 (Cao et al., 2018), and other model structures like recurrent neural network (RNN) (Ribeiro et al., 2016).

Limitation of Domain Transfer We adopt the image-to-image transformation approach, CycleGAN, for the transfer of sensitive facial attribute. The transforming process may have its limitations, such as translating the attributes which implicitly encode unique identity and the illumination variation. However, how to transfer images across sensitive domains while retaining as much other information as possible is still an open problem, and we will further investigate other possible directions.

Limitation of Test Case Generation We notice the other methods like coverage-guided fuzzing utilizing the proposed metrics, which we will further explore in future work.

5. Related Work

Fairness Testing Galhotra et al. (Galhotra et al., 2017) were the first to propose the fairness testing of machine learning, and utilized random generation to evaluate the frequency of discriminatory samples. Later, Udeshi et al. (Udeshi et al., 2018) improved it through using a two-step random strategy, AEQUITAS. The first stage is global generation, which samples the discriminatory cases in the input space at completely random. The second stage is local generation, it randomly searches the neighborhood of identified discrimination based on a dynamically updated probability. Besides, AEQUITAS tried to improve the model fairness through automatic augmentation retraining. Agarwal et al. (Aggarwal et al., 2019) acquired unfair test cases by applying symbolic execution (Wang et al., 2018) to solve the unexplored path on local explanation tree (Ribeiro et al., 2016). Its global and local generation aim to maximize the path coverage (diversity) and the number of instances respectively. Zhang et al. (Zhang et al., 2020, 2021) proposed a DNN-specific algorithm to search the discrimination based on gradient. It first maximize the output difference between input pair, and then perturb the identified one with less impact. The difference between DeepFAIT and the above-mentioned works is two-fold. First, we introduce fairness testing to image data by domain transformation, instead of substituting the value of sensitive attribute with pre-defined one on tabular and text data. Second, previous works paid more attention to the generation of test cases, but ignored to measure the adequacy of testing.

Robustness Testing Criteria Lots of robustness testing criteria were proposed. In (Pei et al., 2017), Pei et al. designed the first white-box robustness testing framework, DeepXplore, in the literature, which identifies and crafts unexpected instances with Neuron Coverage. In  (Ma et al., 2018a), Ma et al. inherited the key insight and introduced five more fine-grained testing criteria both on layer and neuron levels. In  (Sun et al., 2018b), Sun et al. brought the Modified Condition/Decision Cover into DL testing and proposed the first concolic testing method for DL models to improve four covering metrics based on a given neuron pair from adjacent layers, e.g., Sign-Sign Cover, Distance-Sign Cover, Sign-Value Cover, and Distance-Value Cover. In  (Kim et al., 2019), Kim et al. proposed two surprise coverage, LSC and DSC, which measures the range of the likelihood-based and distance-based adequacy values respectively. Later,  (Dong et al., 2020; Yan et al., 2020) conducted the empirical study and showed that there is limited correlation between the robustness and the aforementioned coverage criteria of the DL model. In  (Gerasimou et al., 2020), Gerasimou et al. calculate the contribution of each neuron to the final prediction through layer-wise backward propagation. In  (Feng et al., 2020), Feng et al. proposed DeepGini, which prioritizes the unlabeled test cases by utilizing the impurity of output vector to reduce the resource consumption for labeling and better improve the robustness of model. More recently, Wang et al. designed a robustness-oriented fuzzing framework (RobOT) based on the loss coverage (First-Order Loss) (Wang et al., 2021). Different from the above robustness testing metrics, our fairness testing criteria aim to measure the behavior difference between two similar samples, rather than the output of a single sample.

6. Conclusion

In this paper, we bridge the gap in existing fairness testing research by proposing a novel testing framework, DeepFAIT, which systematically evaluates and improves the fairness testing adequacy of deep image classification applications. Our approach first selects the fairness-related neurons utilizing significance testing, then evaluates the fairness testing adequacy with five multi-granularity adequacy metrics and lastly selects the test cases based on the criteria for mitigating the discrimination efficiently. We evaluate DeepFAIT on two large-scale public face recognition datasets. The results show that DeepFAIT is effective both in identifying the fairness-related neurons, detecting unfair samples and selecting the representative test cases to improve the model’s fairness.

References

  • (1)
  • Aggarwal et al. (2019) Aniya Aggarwal, Pranay Lohia, Seema Nagar, Kuntal Dey, and Diptikalyan Saha. 2019. Black Box Fairness Testing of Machine Learning Models. In Proceedings of the ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/SIGSOFT FSE 2019, Tallinn, Estonia. ACM, 625–635. https://doi.org/10.1145/3338906.3338937
  • An (1996) Guozhong An. 1996. The Effects of Adding Noise During Backpropagation Training on a Generalization Performance. Neural Comput. 8, 3 (1996), 643–674. https://doi.org/10.1162/neco.1996.8.3.643
  • andYueling Zhang and Zhang (2021) Lingfeng Zhang andYueling Zhang and Min Zhang. 2021. Efficient White-box Fairness Testing through Gradient Search. In ISSTA ’21: 30th ACM SIGSOFT International Symposium on Software Testing and Analysis, Virtual Event, Denmark. ACM, 103–114. https://doi.org/10.1145/3460319.3464820
  • Arpit et al. (2017) Devansh Arpit, Stanislaw Jastrzebski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S. Kanwal, Tegan Maharaj, Asja Fischer, Aaron C. Courville, Yoshua Bengio, and Simon Lacoste-Julien. 2017. A Closer Look at Memorization in Deep Networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, Vol. 70. PMLR, 233–242. http://proceedings.mlr.press/v70/arpit17a.html
  • Bastani et al. (2019) Osbert Bastani, Xin Zhang, and Armando Solar-Lezama. 2019. Probabilistic Verification of Fairness Properties via Concentration. PACMPL 3, OOPSLA (2019), 118:1–118:27. https://doi.org/10.1145/3360544
  • Cao et al. (2018) Qiong Cao, Li Shen, Weidi Xie, Omkar M. Parkhi, and Andrew Zisserman. 2018. VGGFace2: A Dataset for Recognising Faces across Pose and Age. In 13th IEEE International Conference on Automatic Face & Gesture Recognition, FG 2018, Xi’an, China. IEEE Computer Society, 67–74. https://doi.org/10.1109/FG.2018.00020
  • Deng ([n.d.]) Weihong Deng. [n.d.]. Ethnicity Aware Training Datasets. http://www.whdeng.cn/RFW/Trainingdataste.html
  • Dong et al. (2020) Yizhen Dong, Peixin Zhang, Jingyi Wang, Shuang Liu, Jun Sun, Jianye Hao, Xinyu Wang, Li Wang, Jinsong Dong, and Ting Dai. 2020. An Empirical Study on Correlation between Coverage and Robustness for Deep Neural Networks. In 25th International Conference on Engineering of Complex Computer Systems, ICECCS 2020, Singapore. IEEE, 73–82. https://doi.org/10.1109/ICECCS51672.2020.00016
  • Dwork et al. (2012) Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard S. Zemel. 2012. Fairness through Awareness. In Innovations in Theoretical Computer Science 2012, Cambridge, MA, USA. ACM, 214–226. https://doi.org/10.1145/2090236.2090255
  • Feldman et al. (2015) Michael Feldman, Sorelle A. Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubramanian. 2015. Certifying and Removing Disparate Impact. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia. ACM, 259–268. https://doi.org/10.1145/2783258.2783311
  • Feng et al. (2020) Yang Feng, Qingkai Shi, Xinyu Gao, Jun Wan, Chunrong Fang, and Zhenyu Chen. 2020. DeepGini: Prioritizing Massive Tests to Enhance the Robustness of Deep Neural Networks. In ISSTA ’20: 29th ACM SIGSOFT International Symposium on Software Testing and Analysis, Virtual Event, USA. ACM, 177–188. https://doi.org/10.1145/3395363.3397357
  • Galhotra et al. (2017) Sainyam Galhotra, Yuriy Brun, and Alexandra Meliou. 2017. Fairness Testing: Testing Software for Discrimination. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software, ESEC/FSE 2017, Paderborn, Germany. ACM, 498–510. https://doi.org/10.1145/3106237.3106277
  • Garg et al. (2019) Sahaj Garg, Vincent Perot, Nicole Limtiaco, Ankur Taly, Ed H. Chi, and Alex Beutel. 2019. Counterfactual Fairness in Text Classification through Robustness. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, AIES 2019, Honolulu, HI, USA. ACM, 219–226. https://doi.org/10.1145/3306618.3317950
  • Gerasimou et al. (2020) Simos Gerasimou, Hasan Ferit Eniser, Alper Sen, and Alper Cakan. 2020. Importance-Driven Deep Learning System Testing. In ICSE ’20: 42nd International Conference on Software Engineering, Seoul, South Korea. ACM, 702–713. https://doi.org/10.1145/3377811.3380391
  • Goodfellow et al. (2014) Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, Montreal, Quebec, Canada. 2672–2680. https://proceedings.neurips.cc/paper/2014/hash/5ca3e9b122f61f8f06494c97b1afccf3-Abstract.html
  • Goodfellow et al. (2015) Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and Harnessing Adversarial Examples. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA. http://arxiv.org/abs/1412.6572
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA. IEEE Computer Society, 770–778. https://doi.org/10.1109/CVPR.2016.90
  • Huang et al. (2008) Gary B Huang, Marwan Mattar, Tamara Berg, and Eric Learned-Miller. 2008. Labeled faces in the wild: A database for Studying Face Recognition in Unconstrained Environments. In Workshop on Faces in ’Real-Life’ Images: Detection, Alignment, and Recognition. https://hal.inria.fr/inria-00321923/file/Huang_long_eccv2008-lfw.pdf
  • Karkkainen and Joo ([n.d.]) Kimmo Karkkainen and Jungseock Joo. [n.d.]. FairFace. https://github.com/dchen236/FairFace
  • Karkkainen and Joo (2021) Kimmo Karkkainen and Jungseock Joo. 2021. FairFace: Face Attribute Dataset for Balanced Race, Gender, and Age for Bias Measurement and Mitigation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. IEEE, 1548–1558. http://arxiv.org/abs/1908.04913
  • Kim et al. (2019) Jinhan Kim, Robert Feldt, and Shin Yoo. 2019. Guiding Deep Learning System Testing Using Surprise Adequacy. In Proceedings of the 41st International Conference on Software Engineering, ICSE 2019, Montreal, QC, Canada. IEEE / ACM, 1039–1049. https://doi.org/10.1109/ICSE.2019.00108
  • Kruskal and Wallis (1952) William H. Kruskal and W. Allen Wallis. 1952. Use of Ranks in One-Criterion Variance Analysis. J. Amer. Statist. Assoc. 47, 260 (1952), 583–621. https://doi.org/10.1080/01621459.1952.10483441
  • Liu et al. ([n.d.]) Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. [n.d.]. Large-scale CelebFaces Attributes (CelebA) Dataset. http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html
  • Liu et al. (2015) Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. 2015. Deep Learning Face Attributes in the Wild. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile. IEEE Computer Society, 3730–3738. https://doi.org/10.1109/ICCV.2015.425
  • Ma et al. (2018a) Lei Ma, Felix Juefei-Xu, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Chunyang Chen, Ting Su, Li Li, Yang Liu, Jianjun Zhao, and Yadong Wang. 2018a. DeepGauge: Multi-Granularity Testing Criteria for Deep Learning Systems. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ASE 2018, Montpellier, France. ACM, 120–131. https://doi.org/10.1145/3238147.3238202
  • Ma et al. (2018b) Lei Ma, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Felix Juefei-Xu, Chao Xie, Li Li, Yang Liu, Jianjun Zhao, and Yadong Wang. 2018b. DeepMutation: Mutation Testing of Deep Learning Systems. In 29th IEEE International Symposium on Software Reliability Engineering, ISSRE 2018, Memphis, TN, USA. IEEE Computer Society, 100–111. https://doi.org/10.1109/ISSRE.2018.00021
  • Madry et al. (2018) Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2018. Towards Deep Learning Models Resistant to Adversarial Attacks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada. OpenReview.net. https://openreview.net/forum?id=rJzIBfZAb
  • on Artificial Intelligence (AI HLEG) (2018) High-Level Expert Group on Artificial Intelligence (AI HLEG). 2018. Draft Ethics Guidelines for Trustworthy AI. Technical Report. European Commission.
  • Papernot et al. (2016) Nicolas Papernot, Patrick D. McDaniel, Somesh Jha, Matt Fredrikson, Z. Berkay Celik, and Ananthram Swami. 2016. The Limitations of Deep Learning in Adversarial Settings. In IEEE European Symposium on Security and Privacy, EuroS&P 2016, Saarbrücken, Germany. 372–387. https://doi.org/10.1109/EuroSP.2016.36
  • Parkhi et al. ([n.d.]) Omkar M. Parkhi, Andrea Vedaldi, and Andrew Zisserman. [n.d.]. VGG Face Dataset. https://www.robots.ox.ac.uk/~vgg/data/vgg_face/
  • Parkhi et al. (2015) Omkar M. Parkhi, Andrea Vedaldi, and Andrew Zisserman. 2015. Deep Face Recognition. In Proceedings of the British Machine Vision Conference 2015, BMVC 2015, Swansea, UK. BMVA Press, 41.1–41.12. https://doi.org/10.5244/C.29.41
  • Pearson (1900) Karl Pearson. 1900. On the Criterion that a Given System of Deviations from the Probable in the Case of a Correlated System of Variables is such that it can be Reasonably Supposed to Have Arisen from Random Sampling. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 50, 302 (1900), 157–175.
  • Pearson (1920) Karl Pearson. 1920. Notes on the History of Correlation. Biometrika 13, 1 (1920), 25–45. https://doi.org/10.1093/biomet/13.1.25
  • Pei et al. (2017) Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. 2017. DeepXplore: Automated Whitebox Testing of Deep Learning Systems. In Proceedings of the 26th Symposium on Operating Systems Principles, Shanghai, China. ACM, 1–18. https://doi.org/10.1145/3132747.3132785
  • Raji et al. (2020) Inioluwa Deborah Raji, Timnit Gebru, Margaret Mitchell, Joy Buolamwini, Joonseok Lee, and Emily Denton. 2020. Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing. In AIES ’20: AAAI/ACM Conference on AI, Ethics, and Society, New York, NY, USA. ACM, 145–151. https://doi.org/10.1145/3375627.3375820
  • Ribeiro et al. (2016) Marco Túlio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. ”Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA. ACM, 1135–1144. https://doi.org/10.1145/2939672.2939778
  • Schroff et al. (2015) Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. FaceNet: A Unified Embedding for Face Recognition and Clustering. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA. 815–823. https://doi.org/10.1109/CVPR.2015.7298682
  • Simonyan and Zisserman (2015) Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA. http://arxiv.org/abs/1409.1556
  • Spearman (1904) Charles Spearman. 1904. ”General Intelligence,” Objectively Determined and Measured. The American Journal of Psychology 15, 2 (1904), 201–292. https://doi.org/10.2307/1412107
  • Sun et al. (2018a) Youcheng Sun, Xiaowei Huang, and Daniel Kroening. 2018a. Testing Deep Neural Networks. CoRR abs/1803.04792 (2018). http://arxiv.org/abs/1803.04792
  • Sun et al. (2018b) Youcheng Sun, Min Wu, Wenjie Ruan, Xiaowei Huang, Marta Kwiatkowska, and Daniel Kroening. 2018b. Concolic Testing for Deep Neural Networks. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ASE 2018, Montpellier, France. ACM, 109–119. https://doi.org/10.1145/3238147.3238172
  • Tanimoto (1968) TT Tanimoto. 1968. An Elementary Mathematical theory of classification and prediction. Technical Report. IBM.
  • Thomas et al. (2019) Philip S. Thomas, Bruno Castro da Silva, Andrew G. Barto, Stephen Giguere, Yuriy Brun, and Emma Brunskill. 2019. Preventing Undesirable Behavior of Intelligent Machines. Science 366, 6468 (2019), 999–1004. https://science.sciencemag.org/content/366/6468/999
  • Udeshi et al. (2018) Sakshi Udeshi, Pryanshu Arora, and Sudipta Chattopadhyay. 2018. Automated Directed Fairness Testing. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ASE 2018, Montpellier, France. ACM, 98–108. https://doi.org/10.1145/3238147.3238165
  • Vieira et al. (2017) Sandra Vieira, Walter H.L. Pinaya, and Andrea Mechelli. 2017. Using Deep Learning to Investigate the Neuroimaging Correlates of Psychiatric and Neurological Disorders: Methods and Applications. Neuroscience & Biobehavioral Reviews 74 (2017), 58–75. https://doi.org/10.1016/j.neubiorev.2017.01.002
  • Wang et al. (2021) Jingyi Wang, Jialuo Chen, Youcheng Sun, Xingjun Ma, Dongxia Wang, Jun Sun, and Peng Cheng. 2021. RobOT: Robustness-Oriented Testing for Deep Learning Systems. In 43rd IEEE/ACM International Conference on Software Engineering, ICSE 2021, Madrid, Spain. IEEE, 300–311. https://doi.org/10.1109/ICSE43902.2021.00038
  • Wang et al. (2019b) Jingyi Wang, Guoliang Dong, Jun Sun, Xinyu Wang, and Peixin Zhang. 2019b. Adversarial Sample Detection for Deep Neural Network through Model Mutation Testing. In Proceedings of the 41st International Conference on Software Engineering, ICSE 2019, Montreal, QC, Canada. IEEE / ACM, 1245–1256. https://doi.org/10.1109/ICSE.2019.00126
  • Wang et al. (2019a) Mei Wang, Weihong Deng, Jiani Hu, Xunqiang Tao, and Yaohai Huang. 2019a. Racial Faces in the Wild: Reducing Racial Bias by Information Maximization Adaptation Network. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South). IEEE, 692–702. https://doi.org/10.1109/ICCV.2019.00078
  • Wang et al. (2018) Xinyu Wang, Jun Sun, Zhenbang Chen, Peixin Zhang, Jingyi Wang, and Yun Lin. 2018. Towards optimal concolic testing. In Proceedings of the 40th International Conference on Software Engineering, ICSE 2018, Gothenburg, Sweden. ACM, 291–302. https://doi.org/10.1145/3180155.3180177
  • Wiegand et al. (2019) Michael Wiegand, Josef Ruppenhofer, and Thomas Kleinbauer. 2019. Detection of Abusive Language: the Problem of Biased Datasets. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA. Association for Computational Linguistics, 602–608. https://doi.org/10.18653/v1/n19-1060
  • Wulczyn et al. (2017) Ellery Wulczyn, Nithum Thain, and Lucas Dixon. 2017. Ex Machina: Personal Attacks Seen at Scale. In Proceedings of the 26th International Conference on World Wide Web, WWW 2017, Perth, Australia. ACM, 1391–1399. https://doi.org/10.1145/3038912.3052591
  • Yan et al. (2020) Shenao Yan, Guanhong Tao Xuwei Liu, Juan Zhai, Shiqing Ma, Lei Xu, and Xiangyu Zhang. 2020. Correlations between Deep Neural Network Model Coverage Criteria and Model Quality. In ESEC/FSE ’20: 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Virtual Event, USA. ACM, 775–787. https://doi.org/10.1145/3368089.3409671
  • Yucer et al. (2020) Seyma Yucer, Samet Akçay, Noura Al Moubayed, and Toby P. Breckon. 2020. Exploring Racial Bias within Face Recognition via per-subject Adversarially-Enabled Data Augmentation. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR Workshops 2020, Seattle, WA, USA. IEEE, 83–92. https://doi.org/10.1109/CVPRW50498.2020.00017
  • Zhang et al. (2019) Jie M. Zhang, Mark Harman, Lei Ma, and Yang Liu. 2019. Machine Learning Testing: Survey, Landscapes and Horizons. CoRR abs/1906.10742 (2019). http://arxiv.org/abs/1906.10742
  • Zhang et al. (2020) Peixin Zhang, Jingyi Wang, Jun Sun, Guoliang Dong, Xinyu Wang, Xingen Wang, Jin Song Dong, and Ting Dai. 2020. White-box Fairness Testing through Adversarial Sampling. In ICSE ’20: 42nd International Conference on Software Engineering, Seoul, South Korea. ACM, 949–960. https://doi.org/10.1145/3377811.3380331
  • Zhang et al. (2021) Peixin Zhang, Jingyi Wang, Jun Sun, Xinyu Wang, Guoliang Dong, Xingen Wang, Ting Dai, and Jin Song Dong. 2021. Automatic Fairness Testing of Neural Classifiers through Adversarial Sampling. IEEE Transactions on Software Engineering. https://doi.org/10.1109/TSE.2021.3101478
  • Zhu et al. (2017) Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. 2017. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy. IEEE Computer Society, 2242–2251. https://doi.org/10.1109/ICCV.2017.244