Foundations for Unfairness in Anomaly Detection - Case Studies in Facial Imaging Data

Michael Livanos, Ian Davidson

Abstract

Deep anomaly detection (AD) is perhaps the most controversial of data analytic tasks as it identifies entities that are then specifically targeted for further investigation or exclusion. Also controversial is the application of AI to facial imaging data. This work explores the intersection of these two areas to understand two core questions: ”Who” these algorithms are being unfair to and equally important ”Why”. Recent work has shown that deep AD can be unfair to different groups despite being unsupervised with a recent study showing that for portraits of people: men of color are far more likely to be chosen to be outliers. We study the two main categories of AD algorithms: autoencoder-based and single-class-based which effectively try to compress all the instances with those that can not be easily compressed being deemed to be outliers. We experimentally verify sources of unfairness such as the under-representation of a group (e.g. people of color are relatively rare), spurious group features (e.g. men are often photographed with hats), and group labeling noise (e.g. race is subjective). We conjecture that lack of compressibility is the main foundation and the others cause it but experimental results show otherwise and we present a natural hierarchy amongst them.

Introduction

Anomaly detection (AD) is a central part of data analytics and perhaps the most controversial given that it is employed for high-impact applications that identify individuals for intervention, policing, and investigation. Its use is prevalent to identify unusual behavior in finance (transactions)(Huang et al. 2018; Zamini and Hasheminejad 2019), social media (posting and account creation)(Yu et al. 2016; Savage et al. 2014), and government services (medicare claims)(Zhang and He 2017; Bauder and Khoshgoftaar 2017).

Perhaps one of the most controversial applications of AI is to facial imaging. This is due to our faces being uniquely identifying and personal. Further, the AI’s ability to identify us and make decisions (without consent) crosses many cultural and legal barriers (Garvie, Bedoya, and Frankle 2016). Existing work on facial data has focused predominantly on facial recognition, that is, given a large collection of people in a known database, identify if any of them occur in an image. Though legislation and progress have been made towards regulating facial recognition technology (Almeida, Shmarko, and Lomas 2022) other technologies in particular AD involving facial images are starting to emerge which gives rise to new ethical considerations and understanding.

Previous work (Zhang and Davidson 2021) has just begun to explore the unfairness at the intersection of AD applied to facial imaging data. For example, our previous work showed that applying AD to a collection of celebrity images overwhelmingly showed the anomalies being people of color and males (see Figure 1). However, our previous work was mainly focused on making AD algorithms fairer. We recreate our earlier results for not only the one-class AD method and the celebrity image dataset the authors used but also for the popular auto-encoder AD method and a more challenging dataset (Labeled Face In The Wild(Huang et al. 2007)).

Our experimental section attempts to address the “Who” and “Why” questions. We create a measure of unfairness (Disparate Impact Ratio (DIR)) which measures how over-represented a protected group (or its complement) is in the anomaly set. We then experimentally investigate who these algorithms are being unfair to and more nuanced questions such as is the same group always being treated unfairly regardless of algorithm. We also explore why an unsupervised algorithm can be biased. We conjecture four main foundations of unfairness, propose metrics to measure them, and outline a series of experiments to test a hypothesis on how they are structured.

The contributions of this work as are as follows:

•

We study the “Who” and ”Why” questions when anomaly detection is applied to facial imaging data - a topic to our knowledge has not been addressed before.
•

Our experiments addressing the “Who” question show that group-level unfairness is due to an interaction between the dataset and the algorithm.
•

We conjecture four main reasons for the “Why” question: i) incompressibility, ii) sample size bias (SSB), iii) spurious feature variance (SFV) within a group, and iv) attribute/group labeling noise (ALN).
•

We postulate an intuitive structure to our conjectured reasons, showing it is not empirically verified, but our experimental results suggest an alternative structure.

We begin by discussing background and related work. We then introduce how we measure unfairness in AD and our four proposed foundations of unfairness. Next, our experimental results addressing the “Who” and “Why” questions are presented after which we discuss and conclude our work.

Refer to caption — Figure 1: Example of AD Being Unfair When Applied to Facial Imaging Data. Reproduced from (Zhang and Davidson 2021).

Background and Related Work

Applications of AD to Facial Data. AD algorithms have been used on imaging data for a variety of reasons. Perhaps the most ubiquitous is for data cleaning where anomalies are viewed as being “noise” (Ng and Winkler 2014) which are removed and then a downstream supervised algorithm is applied. However, if the AD algorithm is biased this creates an under-representation in the down-stream training tasks.

Another common use of AD is to view the outliers as “signal” and in doing so flag the outliers for extra attention. Examples include using AD to identify facial expressions to recognize emotions (Zhang et al. 2020) such as surprise. However, if the AD is biased towards some groups this will over-predict certain emotions for certain groups. Similarly, AD can be used to identify aggressive behavior (Cao et al. 2021). However, if the AD has a bias towards some groups this will incorrectly identify the group as being overly aggressive.

Source of Bias. It has been well established that supervised learning algorithms can have bias due to a variety of reasons. In particular class labeling bias has been extensively studied in the context of the Compas dataset (Angwin et al. 2016). Even though features (e.g. race) associated with this bias are removed, deep learning offers the ability to learn surrogates (e.g. zip code)(Raghavan et al. 2020).

The work on fair AD starts in 2020 (Davidson and Ravi 2020; Abraham et al. 2021) and has shown that AD algorithms can cause bias. Most work has focused on how to correct unfairness for a certain algorithm. This involves understanding the limitations in the algorithm’s computation and then correcting for it. This has been explored for classic density-based methods such as LOF (Abraham et al. 2021) and deep learning methods for autoencoder (Shekhar, Shah, and Akoglu 2021), one class (Zhang and Davidson 2021) and multi-class deep AD methods. However, despite this earlier body of work, there has been surprisingly little work discussing what produces unfairness in unsupervised anomaly detection.

Four Reasons for Unfairness And Their Measurement

Here we outline our four premises for unfairness in AD and explain them at a conceptual level using Figure 1. We then describe how we measure them.

Incompressability of Data

We begin by discussing how AD methods work in particular what causes an instance to be an outlier. Deep AD methods at their core employ compression either directly or indirectly. Instances that cannot be compressed well are deemed outliers and if a group is unusual in some sense it will be unfairly treated as it will be hard to compress and hence overwhelmingly flagged as an outlier.

To understand this further, we present a common taxonomy of anomaly detection algorithms(Pang et al. 2021).

Autoencoder for Anomaly Detection. Let $\phi_{e}$ be the encoding network which maps the data $X$ into the compressed latent space and $\phi_{d}$ be the decoding network which maps the latent representation $\phi_{e}(X)$ back to the original feature space(Hinton 1989). Given the network parameters $\theta_{e},\theta_{d}$ the standard reconstruction objective to train the autoencoder is:

argmin_{\theta}\left(\frac{1}{n}\sum_{i=1}^{n}\|x_{i}-\phi_{\theta_{d}}(\phi_{\theta_{e}}(x_{i}))\|^{2}+R\right)

(1)

The term $R$ denotes the regularization to the encoder and decoder. The anomaly score $s(x)$ for instance $x$ is calculated from the reconstruction error:

s(x)={||x-\phi_{\theta_{d}}(\phi_{\theta_{e}}(x))||}^{2}

(2)

Here clearly an outlier is defined as being an instance that the AE cannot easily compress and hence cannot easily reconstruct(Japkowicz, Myers, and Gluck 1995).

One-Class/Cluster Anomaly Detection Next, consider one class anomaly detection which is still unsupervised. Given the training data of instances $X\in{R}^{n\times d}$ , one class AD method such as the the popular deep SVDD (Ruff et al. 2018) network is trained to map all the $n$ instances close to a fixed center $c$ . Denote function $\phi$ as a neural network with parameters $\theta$ the training objective function is:

argmin_{\theta}\frac{1}{n}\sum_{i=1}^{n}{||\phi_{\theta}({x}_{i})-c||}^{2}+R

(3)

where the term $R$ represents the regularization function. Then the anomaly score is naturally the distance to $c$ .

s(x)={||\phi_{\theta}(x)-c||}^{2}

(4)

Here the aim is to compress all points to map onto a central point $c$ and those that cannot be are deemed outliers.

Deep Clustering for Anomaly Detection Deep Embedded Clustering (DEC) (Xie, Girshick, and Farhadi 2016) is one of the earlier deep clustering methods that combines representation learning with clustering using a clever self-supervision approach. Recently this work was extended to perform outlier detection (Song, Li, and Liu 2021).

The distance a point is from its closest centroid $\{c_{1},...c_{K}\}$ is naturally an anomaly score $s(x)$ :

s(x)=\frac{\min_{k\in[1,K]}||\phi_{\theta_{e}}(x)-c_{k}||^{2}}{\max_{j\in[1,n]\land m_{j}=k}||\phi_{\theta_{e}}(x_{j})-c_{k}||^{2}}

(5)

where $m_{j}=k$ denotes instance $x_{j}$ belongs to cluster $c_{k}$ , $K$ denotes the total number of clusters, and $\phi_{\theta_{e}}(x_{i})$ is the deep learner embedding function.

The core idea here is an extension to the one-class AD method mentioned earlier but extended to $k$ clusters.

Causes Beyond Incompressibility

The above states that outliers are inherently points that the deep learner cannot compress. Hence it is natural to consider reasons why a deep learner cannot compress a group as being a key issue for unfairness. Here we conjecture three main reasons with the view they are related to biased outliers as shown in Figure 2.

Group Underrepresentation. Here we have a group that is relatively rare in the dataset but has some unique properties so the deep learner cannot compress it well. For example in Figure 1 many outliers are African Americans as they only consist of under 15% of the dataset hence the deep learner uses its limited encoding space to encode more populous properties.

Spurious Features for Groups. In this situation, the group has a property that is not critical for the outlier detection task but is highly variable. For example in Figure 1 many groups who are over-represented in the outliers wear different styles of hats.

Label Attribution Noise. Here the labeling of a group is inaccurate and hence can be a reason a group is labeled as being overly abundant in the outlier group. For example in Figure 1 the second to the bottom line of outliers all have the tag Male but this is erroneous.

Measurements of Unfairness and Four Properties

Before discussing our empirical results, we first define each of the properties and how anomaly unfairness is measured. Many of these metrics are the maximum between some expression and their reciprocal. This is because the presence of a tag is equally important as the absence of a tag: for example, disparate treatment of young people and disparate treatment of old (i.e. not young) people are equally important phenomena to study. We first describe how we measure unfairness for anomalies and then how we measure our four properties.

Anomaly DIR: The unfairness of an AD algorithm’s output for particular group $a$ is measured by the disparate impact ratio (DIR), which is (Feldman et al. 2015):

\begin{split}DIR(X,AD,a)=\max\Biggl{(}&\frac{P(AD(X)=1|A=a)}{P(AD(X)=1|A=\neg a)},\\ &\frac{P(AD(X)=1|A=\neg a)}{P(AD(X)=1|A=a)}\Biggr{)}\end{split}

(6)

Here $X$ is the dataset the AD algorithm ( $AD$ ) has made predictions (normal vs anomaly) with $AD(x)=1$ implying $x$ is an anomaly and $AD(x)=0$ implying it is a normal instance, and $a$ is the group in question. This is a natural choice for anomaly detection as it compares the rate at which different attributes are flagged as anomalies, normalized by how often the rest of the data is considered anomalous. It is also the most widely used metric in fair unsupervised learning(Verma and Rubin 2018). The range for this metric is $[1,\infty)$ with the larger the number the more unfairly group $a$ is treated.

Incompressibility: To measure this feature, we extend the typical measure of reconstruction error into the novel metric of reconstruction ratio, which is defined:

\begin{split}RR(X,f,a)=&\max\Biggl{(}\frac{Loss_{MSE}(X,f(X)|A=a)}{Loss_{MSE}(X,f(X)|A=\neg a)},\\ &\frac{Loss_{MSE}(X,f(X)|A=\neg a)}{Loss_{MSE}(X,f(X)|A=a)}\Biggr{)}\end{split}

(7)

Here $X$ and $a$ are the data used for AD and group again, with $f$ being the autoencoder model (both encoder and decoder). The range of Equation 7 is therefore also $[1,\infty)$ , where a higher number indicates that a group is harder to compress than the rest of the data. For example, a RR of 2 indicates that the attribute/group (or absence of the attribute/group) is twice as difficult to compress than the rest of the data.

Sample Size Bias (SSB): SSB (sometimes referred to as representation bias) is determined by the proportion of that tag or lack in the dataset $X$ and is measured as(Suresh and Guttag 2021):

SSB(X,a)=max(P(A=a|X),P(A=\neg a|X))

(8)

Where $X$ and $a$ are again the data and the group in question. Because all groups are binary (or encoded as one-hot encoding), the range of this metric is $[0.5,1]$ , with 0.5 indicating perfect balance of the group (i.e. males and females are equally likely) and 1 indicating that the group is always on or always off. Most groups will fall between these two extremes.

Spurious Feature Variance (SFV): SFV refers to the amount of variance in the background objects in the image and is measured as a proportion of the reconstruction error of the image:

\begin{split}SFV(X,f,a,b)&=1-\max\Biggl{(}\frac{Loss_{MSE}(X[b],f(X)[b]|A=a)}{Loss_{MSE}(X,f(X)|A=a)},\\ &\frac{Loss_{MSE}(X[b],f(X)[b]|A=\neg a)}{Loss_{MSE}(X,f(X)|A=\neg a)}\Biggr{)}\end{split}

(9)

Where $X$ is the data, $f$ is the autoencoder, $a$ the tag, and $b$ is a bounding rectangle around the foreground/focus of the image (i.e. the face), either provided by the data or estimated(Kumar et al. 2009). As the denominator is clearly always greater than or equal to the numerator, SFV ranges between $[0,1]$ , where higher values indicate that more error comes from spurious features.

Label Attribute Noise (LAN): This is a metric of how noisy the labeling of a particular group is, as provided by the academic literature((Lingenfelter, Davis, and Hand 2022) for CelebA and (Kumar et al. 2009) for LFW). Some groups such as Gender tend to have very low LAN, whereas other tags have very high LAN such as Blurry(Kumar et al. 2009). We define LAN as:

ALN(X,a,a^{*})=1-(P(a=a^{*}|X)+P(\neg a=\neg a^{*}|X))

(10)

Where $X$ is the data, $a$ the group in question, and $a^{*}$ the true label for the group. This property has a range $[0,1]$ where the higher the value the less reliable the group labeling.

Experimental Results - Who Is AD Unfair To?

Here we answer the question: Who are the groups of individuals most adversely affected? Following this, we explore more nuanced inquiries, such as whether the unfairness is attributable solely to the data, the algorithm, or a combination of both. In the subsequent section, we aim to investigate the underlying reasons for the unfairness inherent in AD.

Our experiments consist of two core AD algorithms: A reconstruction based autoencoder anomaly detection algorithm (hereby referred to as AE) and Deep one-class SVDD (Ruff et al. 2018). As mentioned earlier, clustering-based AD is a generalization of one-class algorithms and the AE methods. Our datasets consist of the CelebFaces Attributes Dataset(Liu et al. 2015) (the 50,000 instance version to reduce compute) which consists primarily of popular individuals in the movies, music, or arts whilst our Labeled Faces in the Wild(Huang et al. 2007) consists of approximately 13,000 instances and includes a wider variety types of popular individuals such as politicians, sports stars, and criminals. Attribution for CelebA is given and attribution for LFW is provided by(Kumar et al. 2009). These two datasets were chosen as they are well-annotated, including analyses of labeling error, and have been extensively studied. Among all of our datasets, we test a total of 63,233 facial images covering 111 attribute tags. We examine each algorithm individually for a total of 222 data points on fairness. Both the CelebA and LFW data sets are publicly available.

For each dataset and algorithm, we determine the unfairness of each group using the Anomaly DIR. Results are collected over five random-initializations of the network and the median results for each property are reported. The list of all raw results is in the appendix, below we outline some key insights.

The Algorithms are Overwhelming Fair to Most Groups. In total amongst both the two algorithms and two datasets there are 222 groups and a frequency distribution shows that overwhelmingly the algorithms are fair with respect to over 70% of the groups as shown in Figure 3. A score of less than 1.2 indicates that the occurrence of the group in the anomalies is not more than 20% greater than the rate of all other groups (together) being labeled anomalies.

However, there are significant examples of unfairness whose properties we now discuss.

Few Groups Are Always Treated Unfairly. We found that there are several groups that are always (regardless of algorithm or dataset) treated unfairly but they are relatively rare. These include the groups centered around weight having the annotations Chubby, Double-Chin and those centered around very unusual image properties such as Wearing-Hats. This is not unexpected given a very rare group with unusual properties (not shared by other groups) are unlikely to be well compressed. In total less than 2% of all groups are treated unfairly all the time.

Unfairness Varies Due to Both Algorithm and Dataset. A more likely occurrence is that some groups are treated very unfairly but only for some datasets and some algorithms. Table 1 shows in bold groups treated unfairly (the Anomaly DIR is shown in parentheses) but only for that dataset and algorithm combination. For other algorithm-dataset combinations, they are treated fairly as the Table shows. This result is surprising and shows the strong interaction between the algorithm and the data. Consider that the AE method labeled $\neg\texttt{No Beard}$ (reported as ”Beard”) in the CelebA dataset at a rate over 3 times greater than the other groups. Yet, the SVDD algorithm on the very same dataset produced just a 1.27 DIR for the Beard group, and in the LFW dataset both algorithms the DIR was below 1.2.

CelebA

LFW

Beard (3.244)

Senior (N/A)

Gray Hair (1.053)

Unattractive (1.075)

Beard (1.061)

Senior (1.8)

Gray Hair (1.028)

Unattractive Man (1.158)

SVDD

Beard (1.267)

Senior (N/A)

Gray Hair (2.449)

Unattractive (1.094)

Beard (1.0876)

Senior (1.0018)

Gray Hair (1.197)

Unattractive Man (1.566)

Table 1: Examples of groups treated unfairly only for a particular algorithm and dataset interaction. The Fairness DIR is reported in parentheses and indicates the relative over-abundance of the group in the anomalies. The tag being treated unfairly in these cases is in bold. For example, people with a Beard are 3.224 times more likely to be an anomaly than a normal instance for the AE algorithm applied to the CelebA dataset, though people with beards are treated relatively fairly otherwise. Not that ”Senior” is not a tag in CelebA and is therefore absent from the in these cells.

The More Focused The Dataset The More Likely Unfairness Can Occur. When we aggregated all fairness DIR scores (see Appendix) for each group and all algorithms we found that the CelebA dataset (Mean DIR = 1.4) causes significantly more unfairness than the LFW dataset (Mean DIR = 1.13).

This is likely due to the CelebA dataset having a much more focused selection bias as it is limited to people who are overwhelmingly in the arts (film, television, music) whereas the LFW dataset consists of a larger representation of popular people. Hence, the definition of normality learned is very specific and there are many ways to deviate from the norm. Examples of groups that are found to be unfairly treated in the CelebA dataset but NOT the LFW dataset are: Wearing Hat, Big Nose, Eye-Glasses, Goatee, Wavy-Hair.

The More Focused The Algorithm The More Likely Unfairness Can Occur.

Similarly, the way the algorithm defines normality is influential in who it identifies as an anomaly. The SVDD algorithm has the strictest definition of normality as it attempts to find just one group of normal instances (centered around $c$ see equation 3) whereas the AE algorithm with $k$ encoding nodes can in practice (assuming perfect disentanglement) find at least $k$ definitions of normality. Hence not surprisingly the SVDD algorithm is more unfairer than the AE algorithm as shown by the histogram of unfairness for both algorithms in Figure 4.

Experimental Results - Why is AD Unfair

Here we attempt to experimentally answer the following questions:

•

How strong are our four properties correlated to unfairness?
•

How are our four properties related to each other and in particular is there a hierarchical structure to them?
•

How can these properties be combined to create a model to explain unfairness in anomaly detection?

Relationship between Unfairness and Each Property

Our experiments (see Figure 6) demonstrate strong (Pearson) correlations and moderate to strong RSQ (R-squared values of the regression trendline) for each of the properties studied. Each plot shows the results for two datasets (CelebA and LFW) with each data point representing a group of individuals. A positive trend line indicates positive Pearson correlation (see sub-titles of plots for exact values) and we see that incompressability is the most strongest property correlated with unfairness, then Spurious features, then Attribute label noise, and finally Sample Size Bias. This is an interesting result as earlier seminal results showed that AD using facial images (Zhang and Davidson 2021) was unfair due to an under-representation of African Americans and Males in the underlying datasets.

However, it is also clear that no individual property explains unfairness completely by itself. This is shown as each graph has points that not only do not fit the trendline, but are contradictory to the relationship implied by the overall data. Further investigation (see next subsection) reveals that when one property fails to explain why that attribute is anomalous, another one typically will.

For example, the group Bags Under Eyes (from CelebA) under the AE model has a reconstruction ratio of only 1.077 (it is easy to compress), but a DIR of 1.31 (it is treated unfairly). Following the trend, the expected reconstruction ratio at a group with this DIR would be approximately 1.17. Further, this group has only 20.1% representation, though looking at the DIR one would expect only half that. This group’s treatment, however, is explained by the spurious feature variance, as it sits nearly perfectly on the trendline. Similarly, the group Gray Hair (from LFW) under Deep SVDD was towards the far end of spurious feature variance at 0.180, but has extremely low anomaly DIR score at 1.04 (i.e. was treated fairly), though it sits just above the trendline for attribute label noise at 1.05.

A full list of these attributes and their squared error for all trendlines is available in the Appendix, and one can see that every tag can be explained by at least one of these properties with high fidelity, with the average sum of square errors being only 0.00351 (std 0.006498), supporting our claim that unfairness in anomaly detection setting can be typically explained by one of these four properties. This claim is rigorously tested in Section Hypothesis Testing of Relationship Claims.

Relationship between Multiple Properties

We also examine the correlation between the different properties. This analysis is useful in examining potential redundancies and creating our model of unfairness for anomaly detection. Figure 7 examines these relationships. Some features are, indeed, positively correlated with each other, though none have high enough correlation to suggest that they are redundant with each other. In the subsequent subsection, we examine this claim more rigorously via a hypothesis test.

Hypothesis Testing of Relationship Claims

In order to test our claims, we create four hypotheses that we verify through hypothesis significance-testing. Those are:

•

H1: No individual property is sufficient to always explain unfairness.
•

H2: The properties, when combined into a multiple regression, are sufficient to explain unfairness.
•

H3: No properties of the multiple regression are redundant and all are needed.
•

H4: The results of H2 are significant in that when one property fails to predict unfairness, another does.

Null hypothesised $\text{H1}_{0}-\text{H4}_{0}$ are constructed straightforwardly. To create the significance test for H1, we perform an F-test on individual regression models crafted from the relationship between each property and DIR. The results of this F-Test (visualized in Figure 8) indicate that individual properties are reasonable though comparably weak predictors of unfairness, with P-values ranging from 0.0137-0.0986 for the AE model and 0.0279-0.0571 for Deep SVDD. Therefore, we reject the null hypothesis $\text{H1}_{0}$ and validate hypothesis H1.

To test hypotheses H2 and H3, we construct a multiple-regression model. Specifically, this is a stacked multiple regression where the meta-function selects the best individual model for the datum. To validate H2, we create such a multiple-regression using all four of the properties (the ”full” model). This yields P-Values of 0.00589 for the AE model and 0.0127 for Deep SVDD, significantly lower than those of the respective single-regression models, and indicating that using all four properties is sufficient to explain how unfairness occurs. We reject the null hypothesis $\text{H2}_{0}$ and validate hypothesis H2.

For H3, we conduct a similar experiment except we leave one property out. In every case, the resulting multiple regression models were worse than the full model, with P-Values ranging from 0.00674-0.0109 for the AE model and 0.0138-0.0164 for Deep SVDD, all greater than that of the full model, indicating that every property is necessary and none are redundant. We reject the null hypothesis $\text{H3}_{0}$ and validate hypothesis H3.

One may object to the multiple-regression models used above, given that the model as described will monotonically increase in predictive power given more properties. It is important to note that this model matches the central claim of this paper - that unfairness with respect to a group occurs because of one of the four properties described, though one may still be wary of the statistical significance of the reported results given the technique. To resolve these concerns, we demonstrate that our model is not just combining the predictive power of four different already powerful predictors, but rather when one model fails it is because it is explained by one of the other properties.

To validate this claim, we construct fabricated distributions similar to those of Figure 6. Specifically, unfairness is kept the same, and we create distributions of random fake data which has the same correlation and RSQ as all of those shown. This is accomplished by, for each property, finding random points (sampled across a uniform distribution) along the X-axis, giving them fabricated values perfectly in line with the correlation, and then adding noise such that the correlation is maintained and the RSQ matches that of the actual measured properties. Then, we create the same full model of the multiple regression and measure the P-value. We repeat this process 10,000 times to get 10,000 such distributions.

The distributions therefore should be statistically similar to our real data, but there is no reason to believe that when one of the fabricated models fails, another will explain the unfairness. To validate hypothesis H4, we measure the number of times the fake distributions produce P-values under that of the real data. If the statically similar fabricated data cannot match the predictive performance of our models, this would validate hypothesis H4.

In the case of the AE model, the fabricated data averaged a P-value of 0.0194 with a standard deviation of 0.00629 and never beat the full model’s P-value of 0.00589. Similarly, the model simulating Deep SVDD’s data yielded an average P-value of 0.0173 with a standard deviation of 0.00304. Out of the 10,000 trials, only 5 yielded lower P-values. Therefore, we reject the null hypothesis $\text{H4}_{0}$ and validate hypothesis H4. Our model does not simply take four independent good predictors of anomaly and get good statistical results but rather holds the property that when one fails, another property explains it.

A Proposed Model Of Unsupervised Unfairness Relationships

Given the resulting hypothesis tests, we craft our model of unfairness in unsupervised learning. Figure 9 provides a graphical representation of this model. Edges between properties indicate a relationship (binarized to be correlated at $\geq$ 0.15). This is supported by the high correlation between each of these properties and unfairness (Figure 6), the result that the properties together form a uniquely powerful multiple-regression to explain unfairness (H2, H4), that no single feature could do this alone (H1), and that no property is redundant (H3).

Conclusion, Limitations, and Future Work

We study the intersection of the controversial deep AD algorithm with facial imaging data to address the “Who” and “Why” questions. We found that overwhelmingly both auto-encoder and one-class deep AD algorithms are fair to most groups. However, due to the compression-based focus, they are unfair to some sub-groups.

With regard to the “Who” question we found that it was rare to be consistently unfair to the one group and instead unfairness was due to the interaction of the data and the algorithm. In particular, the more focused the dataset and algorithm the more unfairness was found.

Our study of the “Why” question aimed at developing a deeper understanding on the effect of data related factors on the fairness as well as detection performance of OD algorithms. We postulated four hypotheses and found all to be statistically significant by rejecting the null hypothesis. The first hypothesis is that no single property alone is sufficient to explain unfairness. The second hypothesis is when combined the properties can explain unfairness. The third hypothesis is that all properties are relevant and none are redundant and finally, the fourth hypothesis is that the combination of properties is meaningful beyond the predictive power of each individual property.

Limitations. The use of groups may have varying degrees of applicability to real-world fairness scenarios. For example, some groups such as Male, Black and Young correspond to legally recognized protected classes (88th United States Congress 1964; 90th United States Congress 1967), while others such as Goatee, Wearing Hat and attractive may not. However, we believe that this study still provides meaningful insights into the mechanism of unfairness with respect to different people. Real-world protected attributes may be of varying degrees of visibility, as do our groups, and our analysis reflects this.

Future work. Remediation strategies to improve fairness are left out of scope of our investigation. We briefly discuss them here. Fairness interventions are typically grouped into three: pre-, post-, and in-processing strategies, which respectively, modify the input data, modify the output scores or decisions, and account for fairness during model training.

As we showed, AD unfairness can stem from algorithmic bias alone in the face of natural heterogeneities in the data among or within groups. When this is the case, pre-processing strategies become voided as it is not clear how to modify organic, unbiased data. Post-processing could select different thresholds for each group separately, as in (Corbett-Davies et al. 2017; Menon and Williamson 2018), where the group-specific thresholds could either be “natural” cut-off values, or selected to optimize demographic parity if it is a desired fairness metric. Note that metrics that involve true labels cannot be optimized due to lack of any ground truth during training. In-processing techniques are also limited to only enforcing demographic parity, which as we showed, remains susceptible to unfairness. One such strategy that has not been applied to OD is decoupling, as in (Dwork et al. 2018; Ustun, Liu, and Parkes 2019), where a different detector is trained for each group, while optimizing a joint loss.

We remark that post-processing and decoupling exhibit treatment disparity as they both assume it to be ethical and legal to use the sensitive attribute at test (decision) time - in particular, to select which threshold or detector to employ on a given new sample. When there are differences among groups, coming to terms with treatment disparity might be the only get-around to mitigating disparate impact, as argued previously (Lipton, McAuley, and Chouldechova 2018). These solutions, however, do not address unfairness against heterogeneous subpopulations within groups, i.e. within-group discrimination. Here, one direction is to explore clustering-based OD algorithms. Alternatively, establishing a more nuanced or granular sensitive attribute, labeling each subpopulation differently.

References

88th United States Congress (1964) 88th United States Congress, T. 1964. Civil Rights Act of 1964. Public Law 88-352, 78 Stat. 241.
90th United States Congress (1967) 90th United States Congress, T. 1967. Age Discrimination in Employment Act of 1967. Public Law 90-202, 81 Stat. 602.
Abraham et al. (2021) Abraham, S. S.; et al. 2021. Fairlof: fairness in outlier detection. Data Science and Engineering, 6(4): 485–499.
Almeida, Shmarko, and Lomas (2022) Almeida, D.; Shmarko, K.; and Lomas, E. 2022. The ethics of facial recognition technologies, surveillance, and accountability in an age of artificial intelligence: a comparative analysis of US, EU, and UK regulatory frameworks. AI and Ethics, 2(3): 377–387.
Angwin et al. (2016) Angwin, J.; Larson, J.; Mattu, S.; and Kirchner, L. 2016. Machine Bias. ProPublica.
Bauder and Khoshgoftaar (2017) Bauder, R. A.; and Khoshgoftaar, T. M. 2017. Multivariate anomaly detection in medicare using model residuals and probabilistic programming. The Thirtieth International Flairs Conference.
Cao et al. (2021) Cao, R.; Liu, X.; Zhou, J.; Chen, D.; Peng, D.; and Chen, T. 2021. Outlier detection for spotting micro-expressions. 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 3006–3011.
Corbett-Davies et al. (2017) Corbett-Davies, S.; Pierson, E.; Feller, A.; Goel, S.; and Huq, A. 2017. Algorithmic decision making and the cost of fairness. Proceedings of the 23rd acm sigkdd international conference on knowledge discovery and data mining, 797–806.
Davidson and Ravi (2020) Davidson, I.; and Ravi, S. S. 2020. A framework for determining the fairness of outlier detection. ECAI 2020, 2465–2472.
Dwork et al. (2018) Dwork, C.; Immorlica, N.; Kalai, A. T.; and Leiserson, M. 2018. Decoupled classifiers for group-fair and efficient machine learning. Conference on fairness, accountability and transparency, 119–133.
Feldman et al. (2015) Feldman, M.; Friedler, S. A.; Moeller, J.; Scheidegger, C.; and Venkatasubramanian, S. 2015. Certifying and Removing Disparate Impact. 259–268.
Garvie, Bedoya, and Frankle (2016) Garvie, C.; Bedoya, A.; and Frankle, J. 2016. The Perpetual Line-Up: Unregulated Police Face Recognition in America.
Hinton (1989) Hinton, G. E. 1989. Connectionist Learning Procedures. Artificial Intelligence, 40(1-3): 185–234.
Huang et al. (2018) Huang, D.; Mu, D.; Yang, L.; and Cai, X. 2018. CoDetect: Financial Fraud Detection With Anomaly Feature Detection. IEEE Access, 6: 19161–19174.
Huang et al. (2007) Huang, G. B.; Ramesh, M.; Berg, T.; and Learned-Miller, E. 2007. Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments. Technical Report 07-49, University of Massachusetts, Amherst.
Japkowicz, Myers, and Gluck (1995) Japkowicz, N.; Myers, C.; and Gluck, M. A. 1995. A Novelty Detection Approach to Classification. 518–523.
Kumar et al. (2009) Kumar, N.; Berg, A. C.; Belhumeur, P. N.; and Nayar, S. K. 2009. Attribute and Simile Classifiers for Face Verification. 2009 IEEE 12th International Conference on Computer Vision, 365–372.
Lingenfelter, Davis, and Hand (2022) Lingenfelter, B.; Davis, S.; and Hand, E. 2022. A Quantitative Analysis of Labeling Issues in the CelebA Dataset. Advances in Visual Computing. ISVC 2022. Lecture Notes in Computer Science, 13598.
Lipton, McAuley, and Chouldechova (2018) Lipton, Z.; McAuley, J.; and Chouldechova, A. 2018. Does mitigating ML’s impact disparity require treatment disparity? Advances in neural information processing systems, 31.
Liu et al. (2015) Liu, Z.; Luo, P.; Wang, X.; and Tang, X. 2015. Deep Learning Face Attributes in the Wild. Proceedings of the IEEE International Conference on Computer Vision (ICCV).
Menon and Williamson (2018) Menon, A. K.; and Williamson, R. C. 2018. The cost of fairness in binary classification. Conference on Fairness, accountability and transparency, 107–118.
Ng and Winkler (2014) Ng, H.-W.; and Winkler, S. 2014. A data-driven approach to cleaning large face datasets. 2014 IEEE international conference on image processing (ICIP), 343–347.
Pang et al. (2021) Pang, G.; Shen, C.; Cao, L.; and Hengel, A. V. D. 2021. Deep learning for anomaly detection: A review. ACM computing surveys (CSUR), 54(2): 1–38.
Raghavan et al. (2020) Raghavan, M.; Barocas, S.; Kleinberg, J.; and Levy, K. 2020. Mitigating bias in algorithmic hiring: evaluating claims and practices. 469–481.
Ruff et al. (2018) Ruff, L.; Vandermeulen, R.; Goernitz, N.; Deecke, L.; Siddiqui, S. A.; Binder, A.; Müller, E.; and Kloft, M. 2018. Deep One-Class Classification. PMLR 80: 4393–4402.
Savage et al. (2014) Savage, D.; Zhang, X.; Yu, X.; Chou, P.; and Wang, Q. 2014. Anomaly detection in online social networks. Social networks, 39: 62–70.
Shekhar, Shah, and Akoglu (2021) Shekhar, S.; Shah, N.; and Akoglu, L. 2021. Fairod: Fairness-aware outlier detection. Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, 210–220.
Song, Li, and Liu (2021) Song, H.; Li, P.; and Liu, H. 2021. Deep clustering based fair outlier detection. Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 1481–1489.
Suresh and Guttag (2021) Suresh, H.; and Guttag, J. 2021. A Framework for Understanding Sources of Harm throughout the Machine Learning Life Cycle.
Ustun, Liu, and Parkes (2019) Ustun, B.; Liu, Y.; and Parkes, D. 2019. Fairness without harm: Decoupled classifiers with preference guarantees. International Conference on Machine Learning, 6373–6382.
Verma and Rubin (2018) Verma, S.; and Rubin, J. 2018. Fairness definitions explained. 1–7.
Xie, Girshick, and Farhadi (2016) Xie, J.; Girshick, R.; and Farhadi, A. 2016. Unsupervised Deep Embedding for Clustering Analysis.
Yu et al. (2016) Yu, R.; Qiu, H.; Wen, Z.; Lin, C.; and Liu, Y. 2016. A survey on social media anomaly detection. ACM SIGKDD Explorations Newsletter, 18(1): 1–14.
Zamini and Hasheminejad (2019) Zamini, M.; and Hasheminejad, S. M. H. 2019. A comprehensive survey of anomaly detection in banking, wireless sensor networks, social networks, and healthcare. Intelligent Decision Technologies, 13(2): 229–270.
Zhang et al. (2020) Zhang, G.; Luo, T.; Pedrycz, W.; El-Meligy, M. A.; Sharaf, M. A. F.; and Li, Z. 2020. Outlier processing in multimodal emotion recognition. IEEE Access, 8: 55688–55701.
Zhang and Davidson (2021) Zhang, H.; and Davidson, I. 2021. Towards fair deep anomaly detection. Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, 138–148.
Zhang and He (2017) Zhang, W.; and He, X. 2017. An anomaly detection method for medicare fraud detection. 2017 IEEE International Conference on Big Knowledge (ICBK), 309–314.

Appendix

Appendix A Models

Both the AE and SVDD models use the same architecture, and this architecture is modeled off those in(Zhang and Davidson 2021). The architecture is summarized below.

Part	Layer	Details
Encoder	Conv2d	In: 3, Out: 16, Kernel: 3x3, Stride: 2,
		Padding: 1, Bias: False
	ReLU	In-place: True
	Conv2d	In: 16, Out: 32, Kernel: 3x3, Stride: 2,
		Padding: 1, Bias: False
	BatchNorm2d	Num Features: 32
	ReLU	In-place: True
	Conv2d	In: 32, Out: 64, Kernel: 3x3, Stride: 2,
		Padding: 0, Bias: False
	ReLU	In-place: True
	Flatten	Start Dim: 1
	Linear	In: 38016, Out: 128, Bias: False
	ReLU	In-place: True
	Linear	In: 128, Out: Encoded Space Dim,
		Bias: False
Decoder	Linear	In: Encoded Space Dim, Out: 128
	ReLU	In-place: True
	Linear	In: 128, Out: 38016
	ReLU	In-place: True
	Unflatten	Dim: 1, Unflattened Size: (64, 22, 27)
	ConvTranspose2d	In: 64, Out: 32, Kernel: 3x3, Stride: 2,
		Output Padding: 0
	BatchNorm2d	Num Features: 32
	ReLU	In-place: True
	ConvTranspose2d	In: 32, Out: 16, Kernel: 3x3, Stride: 2,
		Padding: 1
	BatchNorm2d	Num Features: 16
	ReLU	In-place: True
	ConvTranspose2d	In: 16, Out: 3, Kernel: 3x3, Stride: 2,
		Padding: 1, Output Padding: 1
	Sigmoid

Datasets are in a random (reset for each initialization) 80-20 split and the model is trained with early stopping if the model does not improve in test loss within three epochs. In practice, the model took, on average 25 minutes to train on a 56-Core 16 GB Tesla P100 GPU.

Appendix B Raw Data Results

This subsection of the appendix reports the raw values for DIR and the four properties for each datum, separated by algorithm-dataset interaction. Table 6 gives the raw sum of squared errors for the individual property models and the entire models used to craft the hypothesis tests.

Table 2: Unfairness and property values for CelebA Attributes via Autoencoder

Attribute	Unfairness (DIR)	Reconstruction Ratio	SSB	SFV	Label Noise
5_o_Clock_Shadow	1.118	1.183	0.8904	0.2252	0.4869
Arched_Eyebrows	1.124	1.0033	0.7252	0.2287	0.4869
Attractive	1.075	1.1356	0.5122	0.2339	0.486
Bags_Under_Eyes	1.308	1.077	0.799	0.2259	0.6119
Bald	1.164	1.1017	0.9766	0.2233	0.5019
Bangs	1.059	1.1121	0.8518	0.2414	0.5687
Big_Lips	1.219	1.1	0.7534	0.2262	0.2721
Big_Nose	1.457	1.1	0.7684	0.2247	0.4415
Black_Hair	1.007	1.15	0.7568	0.2248	0.5283
Blond_Hair	1.042	1.14	0.854	0.2276	0.4273
Blurry	1.128	1.19	0.946	0.2406	1.0181
Brown_Hair	1.080	1.11	0.7962	0.224	0.6281
Bushy_Eyebrows	1.088	1.17	0.859	0.2248	0.5107
Chubby	2.992	1.23	0.942	0.2236	0.6076
Double_Chin	1.413	1.26	0.9578	0.2244	0.709
Eyeglasses	1.600	1.35	0.937	0.2244	0.585
Goatee	1.479	1.27	0.9368	0.2346	0.4938
Gray_Hair	1.053	1.19	0.952	0.2247	0.5767
Heavy_Makeup	1.100	1	0.6148	0.229	0.3698
High_Cheekbones	1.547	1.06	0.5536	0.235	0.6822
Male	1.117	1.01	0.5834	0.2236	0.0211
Mouth_Slightly_Open	1.058	1.08	0.5222	0.2262	0.7859
Mustache	1.280	1.3	0.9616	0.2409	0.5055
Narrow_Eyes	1.017	1.18	0.8808	0.2252	0.7622
No_Beard	3.201	1.43	0.8322	0.231	0.355
Oval_Face	1.672	1.07	0.7296	0.2285	0.6119
Pale_Skin	1.491	1.17	0.9586	0.2367	0.8438
Pointy_Nose	1.301	1.08	0.732	0.2272	0.5454
Receding_Hairline	1.576	1.15	0.9228	0.2248	0.6595
Rosy_Cheeks	1.035	1.19	0.9382	0.2248	0.6718
Sideburns	1.553	1.27	0.9396	0.2286	0.5241
Smiling	1.317	1.07	0.5188	0.239	0.7449
Straight_Hair	1.118	1.17	0.7874	0.2268	0.6559
Wavy_Hair	1.488	1.02	0.69	0.2239	0.5728
Wearing_Earrings	1.052	1.11	0.8064	0.2269	0.6107
Wearing_Hat	1.645	1.32	0.9512	0.2269	0.7502
Wearing_Lipstick	1.213	1.07	0.5288	0.2286	0.2678
Wearing_Necklace	1.349	1.18	0.8686	0.2246	0.6887
Wearing_Necktie	1.077	1.18	0.9244	0.2260	0.7288
Young	1.997	1.36	0.7826	0.2250	0.164

Table 3: Unfairness and property values for LFW Attributes via Autoencoder

	Unfairness (DIR)	Reconstruction Ratio	SSB	SFV	Label Noise
Male	1.12200367380267	1.00535333156585	0.774632884425169	0.2031201482	0.07679999999999998
Asian	1.053645403248	1.1167961359024	0.92322909533592	0.2021178782
White	1.1264462529671	1.01018273830413	0.747926652971163	0.2088530302
Black	1.12365634206545	1.18320667743682	0.957391767480788	0.2025963485	0.08120000000000005
Baby	1.06544960186443	1.11203300952911	0.836643079966522	0.2036614358	0.09550000000000003
Child	1.12626262626262	1.12434077262878	0.898196758730883	0.2015907168	0.10470000000000002
Youth	1.1684570024365	1.09732460975646	0.786274062238453	0.2201078296	0.13770000000000004
Middle Aged	1.05672615298764	1.10370337963104	0.866316670470973	0.2059026539	0.0645
Senior	1.77747312898089	1.1931574344635	0.957467853610286	0.2064542949	0.16779999999999995
Black Hair	1.12004451070385	1.07356524467468	0.63349311420528	0.2042071939
Blond Hair	1.05108769459044	1.13573598861694	0.891653351594004	0.2307599187
Brown Hair	1.03576168696236	1.01701772212982	0.7891653351594	0.2089367867
Bald	1.00087648056866	1.13213109970092	0.824621471505744	0.2131045997	0.11350000000000005
No Eyewear	1.08863610960647	1.14704251289367	0.985087118618275	0.2012821913	0.06899999999999995
Eyeglasses	1.06348181302805	1.1471596956253	0.877729589895762	0.2019755244	0.19679999999999997
Sunglasses	1.05997138025237	1.06211602687835	0.586700144563646	0.2035428464	0.20889999999999997
Mustache	1.0421456164088	1.03791272640228	0.581298029369246	0.2034806907	0.21950000000000003
Smiling	1.14102186869087	1.07792913913726	0.641101727155139	0.2054580331	0.2974
Frowning	1.16748745804309	1.10730016231536	0.843262573232899	0.2032266498	0.07879999999999998
Chubby	1.08701997540087	1.08353006839752	0.683557787415354	0.2049415469	0.10560000000000003
Blurry	1.04091852227881	1.10518634319305	0.811154226584493	0.2110888839	0.27580000000000005
Harsh Lighting	1.05681504499685	1.02375900745391	0.695579395876131	0.2157218277	0.30279999999999996
Soft Lighting	1.05644459380154	1.03066658973693	0.598417408506429	0.2086786151	0.15849999999999997
Outdoor	1.06132796694575	1.07666659355163	0.566613406376017	0.221262145	0.22760000000000002
Curly Hair	1.01377517221455	1.06283998489379	0.62375408962946	0.2088014245	0.14180000000000004
Wavy Hair	1.18200199173129	1.03124058246612	0.581830632275736	0.2119037926	0.04500000000000004
Straight Hair	1.25206733987405	1.16164600849151	0.835806132542037	0.2045027018	0.25670000000000004
Receding Hairline	1.132162388614	1.08063757419586	0.690329452940728	0.2026151061	0.31120000000000003
Bangs	1.16524283964575	1.00697135925292	0.672981815415049	0.2061040878	0.33009999999999995
Sideburns	1.12084015275504	1.16936266422271	0.939739785437114	0.2015084624	0.22319999999999995
Fully Visible Forehead	1.21900390887339	1.1811419725418	0.858784143650612	0.2038781226	0.2298
Partially Visible Forehead	1.05020804838356	1.04769682884216	0.536483299094575	0.2050496221	0.15200000000000002
Obstructed Forehead	1.197095435684	1.09454452991485	0.73674199193487	0.2027293146	0.11260000000000003
Bushy Eyebrows	1.21132478772795	1.02557587623596	0.645286464277562	0.2025717795	0.0998
Arched Eyebrows	1.11604546137808	1.02139496803283	0.862360191737046	0.2056749165	0.15269999999999995
Narrow Eyes	1.01965937186759	1.00794005393981	0.69078596971772	0.2016550004	0.19099999999999995
Eyes Open	1.21624935631726	1.01386857032775	0.698166324279083	0.2054687798	0.2893
Big Nose	1.02608985048702	1.0604817867279	0.634406147759263	0.205928582	0.06599999999999995
Pointy Nose	1.19868957288718	1.07568454742431	0.622764969945978	0.2045042574	0.046699999999999964
Big Lips	1.06428433432607	1.06256353855133	0.663166704709731	0.2033233523	0.08440000000000003
Mouth Closed	1.01175554129597	1.12742841243743	0.904511907479266	0.2021872401	0.32189999999999996
Mouth Slightly Open	1.06630991503093	1.04525172710418	0.571102488016434	0.2044097185	0.07479999999999998
Mouth Wide Open	1.01637465524165	1.01459431648254	0.714981358898272	0.2013282001	0.18810000000000004
Teeth Not Visible	1.09729244959597	1.1062124967575	0.759796089172943	0.2011253536	0.27669999999999995
No Beard	1.0605139319402	1.01765537261962	0.869284029521418	0.2096437275	0.2319
Goatee	1.11854311102431	1.08315765857696	0.639351746176672	0.2126889467	0.04530000000000001
Round Jaw	1.12212437767378	1.15832161903381	0.860229780111085	0.2023314357	0.050899999999999945
Double Chin	1.01444585996835	1.04857730865478	0.519135661568896	0.2015054762	0.1965
Wearing Hat	1.31578440808469	1.13376498222351	0.950467929696416	0.2047562778	0.08360000000000001
Oval Face	1.227980920874	1.10575580596923	0.920718253062466	0.2021733284	0.12819999999999998
Square Face	1.08180300500834	1.03950214385986	0.957087422962793	0.2091827631	0.08360000000000001
Round Face	1.00385912356425	1.03282678127288	0.504983641482157	0.2034206629	0.2126
Color Photo	1.03475440467016	1.07052874565124	0.664764513429201	0.2038946807	0.10570000000000002
Posed Photo	1.05681639747742	1.10381340980529	0.848740774556798	0.2014933527	0.15300000000000002
Attractive Man	1.15967929714224	1.09643280506134	0.977098075020923	0.2019033909	0.35509999999999997
Attractive Woman	1.08906867243748	1.10497522354125	0.84105607547744	0.227733314	0.13529999999999998
Indian	1.01517435331474	1.03677427768707	0.588830556189606	0.2017122924	0.14029999999999998
Gray Hair	1.0282016857369	1.10868871212005	0.882523016054173	0.2012593031	0.18779999999999997
Bags Under Eyes	1.00131664057342	1.11418402194976	0.805143422354104	0.2055422544	0.13219999999999998
Heavy Makeup	1.14755164575804	1.11894488334655	0.879707829262725	0.2015730679	0.21609999999999996
Rosy Cheeks	1.02025763283369	1.05831480026245	0.507037966978619	0.2184995234	0.08520000000000005
Shiny Skin	1.09951096814278	1.06581234931945	0.591797915240051	0.2018399835	0.10729999999999995
Pale Skin	1.06768325049461	1.05000007152557	0.534428973598113	0.2011277676	0.1421
5 o’ Clock Shadow	1.16855307810665	1.09448540210723	0.84432777904588	0.2016790688
Strong Nose-Mouth Lines	1.03358017791439	1.10516810417175	0.866697101118466	0.2044023335
Wearing Lipstick	1.08893014058315	1.07446813583374	0.656471125313855	0.2032339275
Flushed Face	1.0144694850683	1.01248931884765	0.655482005630373	0.2137254417
High Cheekbones	1.00916172995591	1.12738478183746	0.860229780111085	0.2017118096
Brown Eyes	1.08648174717041	1.01540994644165	0.636384387126226	0.2055488884
Wearing Earrings	1.10247725115406	1.09833109378814	0.79563265616678	0.2028558612

Table 4: Unfairness and property values for CelebA Attributes via Deep SVDD

	SVDD	Reconstruction	SSB	Spurious Feature Variance	Label Noise
5_o_Clock_Shadow	1.68123553498308	1.46047670114505	0.8904	0.1409505606	0.4869
Arched_Eyebrows	1.25764192139738	1.29294249928091	0.7252	0.1452494413	0.4869
Attractive	1.09368792760979	1.05381571022971	0.5122	0.1520317346	0.486
Bags_Under_Eyes	1.06134410518395	1.12471149407601	0.798999999999999	0.1400723457	0.6119
Bald	2.352	1.02880658436214	0.9766	0.1451713741	0.5019
Bangs	1.39449541284403	1.32236633976589	0.8518	0.1576949656	0.5687
Big_Lips	1.70156624102154	1.08017998183669	0.7534	0.1442556977	0.2721
Big_Nose	1.30569948186528	1.16960464068483	0.7684	0.1389202923	0.4415
Black_Hair	1.00635593220339	1.10986682808716	0.7568	0.1428498179	0.5283
Blond_Hair	1.03992089562244	1.29937377627469	0.854	0.1420869976	0.4273
Blurry	1.35800508259212	1.12928843710292	0.946	0.1826313585	1.0181
Brown_Hair	1.07992104600792	1.00426740416926	0.7962	0.1399643421	0.6281
Bushy_Eyebrows	1.26066424494032	1.1293009118541	0.859	0.1396305859	0.5106999999999999
Chubby	1.15950659293917	1.0393457117595	0.942	0.143850103	0.6075999999999999
Double_Chin	1.46185598532334	1.24815246204514	0.9578	0.1393095106	0.7090000000000001
Eyeglasses	1.4847619047619	1.13053239255933	0.937	0.1726125926	0.585
Goatee	1.30087633885102	1.29235531479741	0.9368	0.148198694	0.4938
Gray_Hair	2.44949494949494	1.63565217391304	0.952	0.1400457323	0.5767
Heavy_Makeup	1.37794331165961	1.21201795786807	0.6148	0.1471818388	0.3698
High_Cheekbones	1.41521739130434	1.07322226737098	0.5536	0.148946777	0.6821999999999999
Male	1.16378620579292	1.12330668559143	0.583399999999999	0.1411117315	0.021100000000000008
Mouth_Slightly_Open	1.47328992862486	1.01889931435045	0.5222	0.1415492892	0.7859
Mustache	1.28	1.04602510460251	0.9616	0.1618342251	0.5055000000000001
Narrow_Eyes	1.3557779799818	1.08768131630222	0.8808	0.1405434906	0.7622
No_Beard	1.26765068774848	1.32170279829207	0.8322	0.1428951621	0.355
Oval_Face	1.00961538461538	1.05142857142857	0.7296	0.1448870301	0.6119
Pale_Skin	1.49135109864422	1.06838387528924	0.9586	0.1604245007	0.8438
Pointy_Nose	1.28422782037239	1.26457127210139	0.732	0.1431550533	0.5454
Receding_Hairline	1.10142050741269	1.33176813471502	0.9228	0.1404222101	0.6595
Rosy_Cheeks	1.153123680878	1.25196285352469	0.9382	0.1406327337	0.6718
Sideburns	1.36525725929699	1.50509087726463	0.9396	0.1386207491	0.5241
Smiling	1.11647331786542	1.01603413341645	0.518799999999999	0.1510140896	0.7449
Straight_Hair	1.13916759320035	1.16279926135717	0.7874	0.1428056359	0.6558999999999999
Wavy_Hair	1.6726155889433	1.30170504067402	0.69	0.1401683241	0.5728
Wearing_Earrings	1.08250497017892	1.01847107438016	0.8064	0.1419264823	0.6107
Wearing_Hat	5.03622577927548	1.54812552653748	0.9512	0.2158842981	0.7502
Wearing_Lipstick	1.26436951774677	1.16687742370595	0.528799999999999	0.1471352577	0.26780000000000004
Wearing_Necklace	1.00260846420015	1.07914052831476	0.8686	0.1395401657	0.6887
Wearing_Necktie	1.52579365079365	1.36231575963718	0.9244	0.1407860667	0.7288
Young	1.0892026578073	1.17668546526531	0.7826	0.1402778327	0.16400000000000003

Table 5: Unfairness and property values for LFW Attributes via Deep SVDD

	DIR	Incompressibility	SSB	SFV	Label Noise
Male	1.17931562745317	1.0924447774887	0.774632884425169	0.1429237619	0.07679999999999998
Asian	1.23055692048871	1.01016497611999	0.92322909533592	0.1415031001
White	1.00406917599186	1.08468961715698	0.747926652971163	0.1511053567
Black	1.34819532908704	1.03384220600128	0.957391767480788	0.1421764963	0.08120000000000005
Baby	1.06271364829537	1.03743159770965	0.836643079966522	0.143683949	0.09550000000000003
Child	1.13670569529881	1.03432464599609	0.898196758730883	0.140765438	0.10470000000000002
Youth	1.05082822021653	1.03086674213409	0.786274062238453	0.1678981342	0.13770000000000004
Middle Aged	1.10867550207333	1.02771830558776	0.866316670470973	0.1468662707	0.0645
Senior	1.00186866902908	1.04160547256469	0.957467853610286	0.1476565742	0.16779999999999995
Black Hair	1.02750194844192	1.04895257949829	0.63349311420528	0.1444525348
Blond Hair	1.08838038386602	1.01417303085327	0.891653351594004	0.184528688
Brown Hair	1.03209559606518	1.09958708286285	0.7891653351594	0.1512480983
Bald	1.00790551940226	1.00073754787445	0.824621471505744	0.1573695429	0.11350000000000005
No Eyewear	1.40606623336428	1.00252616405487	0.985087118618275	0.1403350747	0.06899999999999995
Eyeglasses	1.43913177607322	1.00310981273651	0.877729589895762	0.1413070618	0.19679999999999997
Sunglasses	1.12142575468585	1.05022633075714	0.586700144563646	0.1435135175	0.20889999999999997
Mustache	1.11839255634876	1.07019913196563	0.581298029369246	0.1434256518	0.21950000000000003
Smiling	1.07582623948232	1.03398072719573	0.641101727155139	0.1462316354	0.2974
Frowning	1.38021050679278	1.04254591464996	0.843262573232899	0.14306304	0.07879999999999998
Chubby	1.37894686222649	1.04961144924163	0.683557787415354	0.1454682116	0.10560000000000003
Blurry	1.07484216395665	1.01648831367492	0.811154226584493	0.1543799572	0.27580000000000005
Harsh Lighting	1.76541734255385	1.06726253032684	0.695579395876131	0.1612156139	0.30279999999999996
Soft Lighting	1.15505859850802	1.0463809967041		0.1456943556	0.1642
Outdoor	1.20495433082845	1.06911957263946	0.598417408506429	0.1508525053	0.15849999999999997
Curly Hair	1.13923719958202	1.05589497089385	0.566613406376017	0.1696171921	0.22760000000000002
Wavy Hair	1.06940992787003	1.04987812042236	0.62375408962946	0.1510261253	0.14180000000000004
Straight Hair	1.04934265833276	1.07179963588714	0.581830632275736	0.1555814983	0.04500000000000004
Receding Hairline	1.17698276832539	1.00636541843414	0.835806132542037	0.1448722679	0.25670000000000004
Bangs	1.09157918248827	1.06093919277191	0.690329452940728	0.1422119786	0.31120000000000003
Sideburns	1.15947653456037	1.08820021152496	0.672981815415049	0.1471594972	0.33009999999999995
Fully Visible Forehead	1.55668147556531	1.00061905384063	0.939739785437114	0.1406534992	0.22319999999999995
Partially Visible Forehead	1.25747607655502	1.03077602386474	0.858784143650612	0.1439849571	0.2298
Obstructed Forehead	1.11325281649095	1.05480468273162	0.536483299094575	0.1456306697	0.15200000000000002
Bushy Eyebrows	1.00786702803827	1.03234314918518	0.73674199193487	0.1423631794	0.11260000000000003
Arched Eyebrows	1.06208761023718	1.07342624664306	0.645286464277562	0.1421420989	0.0998
Narrow Eyes	1.08786442753544	1.09905493259429	0.862360191737046	0.1465277227	0.15269999999999995
Eyes Open	1.223370100546	1.07901871204376	0.69078596971772	0.1408611888	0.19099999999999995
Big Nose	1.03111518672274	1.0845707654953	0.698166324279083	0.1462382611	0.2893
Pointy Nose	1.11446611115883	1.05301141738891	0.634406147759263	0.1469018736	0.06599999999999995
Big Lips	1.14029929024963	1.05008065700531	0.622764969945978	0.1448668399	0.046699999999999964
Mouth Closed	1.0816224959562	1.06419742107391	0.663166704709731	0.1431995614	0.08440000000000003
Mouth Slightly Open	1.01337628971086	1.03440833091735	0.904511907479266	0.1416061479	0.32189999999999996
Mouth Wide Open	1.03990024937655	1.06561398506164	0.571102488016434	0.1447347991	0.07479999999999998
Teeth Not Visible	1.04770316767762	1.07595670223236	0.714981358898272	0.1403963187	0.18810000000000004
No Beard	1.0876431987543	1.03276085853576	0.759796089172943	0.1401160741	0.27669999999999995
Goatee	1.19802672343941	1.09399461746215	0.869284029521418	0.1522871416	0.2319
Round Jaw	1.06897059287373	1.04447555541992	0.639351746176672	0.156724269	0.04530000000000001
Double Chin	1.02390223246378	1.00809562206268	0.860229780111085	0.141805179	0.050899999999999945
Wearing Hat	1.08353184055899	1.06220078468322	0.519135661568896	0.1406492755	0.1965
Oval Face	1.12880495352612	1.02028930187225	0.950467929696416	0.1452499895	0.08360000000000001
Square Face	1.03973957569458	1.0401998758316	0.920718253062466	0.141584407	0.12819999999999998
Round Face	1.1325489572568	1.10276663303375	0.957087422962793	0.1514307107	0.08360000000000001
Color Photo	1.13798432728612	1.07599568367004	0.504983641482157	0.143338242	0.2126
Posed Photo	1.0394726007875	1.0795783996582	0.664764513429201	0.1440141143	0.10570000000000002
Attractive Man	1.08110687391574	1.03554499149322	0.848740774556798	0.1406362299	0.15300000000000002
Attractive Woman	1.51685778921912	1.05630433559417	0.977098075020923	0.1412037472	0.35509999999999997
Indian	1.04062717938913	1.05820059776306	0.84105607547744	0.1797280974	0.13529999999999998
Gray Hair	1.1973171397336	1.06187999248504	0.588830556189606	0.1409442876	0.14029999999999998
Bags Under Eyes	1.45747314192372	1.02336192131042	0.882523016054173	0.140306542	0.18779999999999997
Heavy Makeup	1.00048536793256	1.02952170372009	0.805143422354104	0.1463585953	0.13219999999999998
Rosy Cheeks	1.16560850348722	1.03843355178833	0.879707829262725	0.1407405867	0.21609999999999996
Shiny Skin	1.26315502068762	1.05610179901123	0.507037966978619	0.1653680619	0.08520000000000005
Pale Skin	1.18351263647553	1.0652779340744	0.591797915240051	0.1411130869	0.10729999999999995
5 o’ Clock Shadow	1.17272883141058	1.05673563480377	0.534428973598113	0.1401202399	0.1421
Strong Nose-Mouth Lines	1.22455279703978	1.03042745590209	0.84432777904588	0.1408965065
Wearing Lipstick	1.19464912714504	1.04016602039337	0.866697101118466	0.1447695088
Flushed Face	1.09233277829563	1.04554724693298	0.656471125313855	0.1430755171
High Cheekbones	1.11059337257828	1.08112835884094	0.655482005630373	0.1582355048
Brown Eyes	1.17224142515288	1.01272892951965	0.860229780111085	0.1409409852
Wearing Earrings	1.16634746922024	1.0849984884262	0.636384387126226	0.1463791189

Table 6: Squared Error (SE) For Various Properties & The Proposed Whole Model

	Incompressibility	SSB	SFV	Label Noise	Whole Model
5_o_Clock_Shadow	0.0684682	0.0459237	0.0002894	3e-7	3e-7
Arched_Eyebrows	0.0126105	0.0538039	0.0131643	0.0106393	0.0106393
Attractive	0.0004621	0.0338707	0.0483136	0.0275897	0.0004621
Bags_Under_Eyes	0.0202342	0.0623039	0.0050525	0.0093071	0.0050525
Bald	0.4319569	0.00024	0.0943209	0.0748498	0.00024
Bangs	0.0006496	0.1283069	0.0774581	0.0298166	0.0006496
Big_Lips	0.1088419	0.0091254	0.0032081	0.0052196	0.0032081
Big_Nose	0.2114748	0.0001715	0.0252706	0.020549	0.0001715
Black_Hair	0.0142123	0.0593484	0.0061956	0.0106132	0.0061956
Blond_Hair	0.0100063	0.0946047	0.0130189	0.0110102	0.0100063
Blurry	0.5419093	0.0074364	0.0490523	0.0729299	0.0074364
Brown_Hair	0.0001064	0.1154422	0.0218438	0.0360707	0.0001064
Bushy_Eyebrows	0.0213848	0.0773384	0.0035219	0.006768	0.0035219
Chubby	0.0013206	0.1581314	0.015918	0.0287418	0.0013206
Double_Chin	0.0035046	0.1521108	0.0137349	0.0272636	0.0035046
Eyeglasses	0.0113092	0.120019	0.0071329	0.0140104	0.0071329
Goatee	0.1604753	0.0201623	0.0013985	0.0092697	0.0013985
Gray_Hair	0.0002775	0.1933109	0.0296317	0.0411467	0.0002775
Heavy_Makeup	0.0000469	0.0613383	0.0363665	0.0266154	0.0000469
High_Cheekbones	0.0930655	0.0000754	0.0010459	0.0000749	0.0000749
Male	0.0336658	0.0118035	0.0005703	0.0000012	0.0000012
Mouth_Slightly_Open	0.0443864	0.0033491	0.0006975	0.0047076	0.0006975
Mustache	0.282736	0.0040766	0.0060553	0.0346049	0.0040766
Narrow_Eyes	0.2105005	0.005241	0.0237357	0.0113154	0.005241
No_Beard	0.6583177	0.03911	0.1239485	0.1577867	0.03911
Oval_Face	0.3213105	0.0066273	0.0411204	0.0393565	0.0066273
Pale_Skin	0.2731508	0.0047057	0.0111297	0.0200174	0.0047057
Pointy_Nose	0.0001962	0.0922673	0.0293014	0.0316888	0.0001962
Receding_Hairline	0.5347036	0.0090082	0.1198915	0.0943515	0.0090082
Rosy_Cheeks	0.0309238	0.0892779	0.0015001	0.0063849	0.0015001
Sideburns	0.1497514	0.023381	0.005897	0.0069555	0.005897
Smiling	0.1429648	0.0036202	0.0001827	0.0027342	0.0001827
Straight_Hair	0.2121866	0.0005044	0.0203373	0.0143183	0.0005044
Wavy_Hair	0.0181157	0.0392565	0.0036124	0.0094807	0.0036124
Wearing_Earrings	0.0000825	0.1285602	0.0341021	0.0405731	0.0000825
Wearing_Hat	0.0069229	0.1368304	0.0145395	0.0234406	0.0069229
Wearing_Lipstick	0.0278786	0.0084025	0.0057513	0.0016339	0.0016339
Wearing_Necklace	0.0691438	0.0408226	0.0005308	0.0004549	0.0004549
Wearing_Necktie	0.1449726	0.0222941	0.0086847	0.00313	0.00313
Young	0.1940491	0.0011516	0.0204862	0.0260069	0.0011516