On Measuring Fairness in Generative Models
Abstract
Recently, there has been increased interest in fair generative models. In this work, we conduct, for the first time, an in-depth study on fairness measurement, a critical component in gauging progress on fair generative models. We make three contributions. First, we conduct a study that reveals that the existing fairness measurement framework has considerable measurement errors, even when highly accurate sensitive attribute (SA) classifiers are used. These findings cast doubts on previously reported fairness improvements. Second, to address this issue, we propose CLassifier Error-Aware Measurement (CLEAM), a new framework which uses a statistical model to account for inaccuracies in SA classifiers. Our proposed CLEAM reduces measurement errors significantly, e.g., 4.98%0.62% for StyleGAN2 w.r.t. Gender. Additionally, CLEAM achieves this with minimal additional overhead. Third, we utilize CLEAM to measure fairness in important text-to-image generator and GANs, revealing considerable biases in these models that raise concerns about their applications. Code and more resources: https://sutd-visual-computing-group.github.io/CLEAM/.
1 Introduction

Fair generative models have been attracting significant attention recently [1, 2, 7, 8, 9, 10, 11, 12, 13]. In generative models [14, 15, 16, 17, 18], fairness is commonly defined as equal generative quality [11] or equal representation [1, 2, 7, 9, 12, 19, 20] w.r.t. some Sensitive Attributes (SA). In this work, we focus on the more widely utilized definition – equal representation. In this definition, as an example, a generative model is regarded as fair w.r.t. Gender, if it generates Male and Female samples with equal probability. This is an important research topic as such biases in generative models could impact their application efficacy, e.g., by introducing racial bias in face generation of suspects [21] or reducing accuracy when supplementing data for disease diagnosis [22].
Fairness measurement for generative models. Recognizing the importance of fair generative models, several methods have been proposed to mitigate biases in generative models [1, 2, 7, 9, 12]. However, in our work, we focus mainly on the accurate fairness measurement of deep generative models i.e. assessing and quantifying the bias of generative models. This is a critical topic, as accurate measurements are essential to reliably gauge the progress of bias mitigation techniques. The general fairness measurement framework is shown in Fig. 1 (See Sec. 2 for details). This framework is utilized in existing works to assess their proposed fair generators. Central to the fairness measurement framework is a SA classifier, which classifies the generated samples w.r.t. a SA, in order to estimate the bias of the generator. For example, if eight out of ten generated face images are classified as Male by the SA classifier, then the generator is deemed biased at towards Male (further discussion in Sec. 2). We follow previous works [1, 2, 12] and focus on binary SA due to dataset limitations.
Research gap. In this paper, we study a critical research gap in fairness measurement. Existing works assume that when SA classifiers are highly accurate, measurement errors should be insignificant. As a result, the effect of errors in SA classifiers has not been studied. However, our study reveals that even with highly accurate SA classifiers, considerable fairness measurement errors could still occur. This finding raises concerns about potential errors in previous works’ results, which are measured using existing framework. Note that the SA classifier is indispensable in fairness measurement as it enables automated measurement of generated samples.
Our contributions. We make three contributions to fairness measurement for generative models. As our first contribution, we analyze the accuracy of fairness measurement on generated samples, which previous works [1, 2, 7, 9, 12] have been unable to carry out due to the unavailability of proper datasets. We overcome this challenge by proposing new datasets of generated samples with manual labeling w.r.t. various SAs. The datasets include generated samples from Stable Diffusion Model (SDM) [5] —a popular text-to-image generator— as well as two State-of-The-Art (SOTA) GANs (StyleGAN2 [3] and StyleSwin [4]) w.r.t. different SAs. Our new datasets are then utilized in our work to evaluate the accuracy of the existing fairness measurement framework. Our results reveal that the accuracy of the existing fairness measurement framework is not adequate, due to the lack of consideration for the SA classifier inaccuracies. More importantly, we found that even in setups where the accuracy of the SA classifier is high, the error in fairness measurement could still be significant. Our finding raises concerns about the accuracy of previous works’ results [1, 2, 12], especially since some of their reported improvements are smaller than the margin of measurement errors that we observe in our study when evaluated under the same setup; further discussion in Sec. 3.
To address this issue, as our second (major) contribution, we propose CLassifier Error-Aware Measurement (CLEAM), a new more accurate fairness measurement framework based on our developed statistical model for SA classification (further details on the statistical model in Sec. 4.1). Specifically, CLEAM utilizes this statistical model to account for the classifier’s inaccuracies during SA classification and outputs a more accurate fairness measurement. We then evaluate the accuracy of CLEAM and validate its improvement over existing fairness measurement framework. We further conduct a series of different ablation studies to validate performance of CLEAM. We remark that CLEAM is not a new fairness metric, but an improved fairness measurement framework that could achieve better accuracy in bias estimation when used with various fairness metrics for generative models.
As our third contribution, we apply CLEAM as an accurate framework to reliably measure biases in popular generative models. Our study reveals that SOTA GANs have considerable biases w.r.t. several SA. Furthermore, we observe an intriguing property in Stable Diffusion Model: slight differences in semantically similar prompts could result in markedly different biases for SDM. These results prompt careful consideration on the implication of biases in generative models. Our contributions are:
-
•
We conduct a study to reveal that even highly-accurate SA classifiers could still incur significant fairness measurement errors when using existing framework.
-
•
To enable evaluation of fairness measurement frameworks, we propose new datasets based on generated samples from StyleGAN, StyleSwin and SDM, with manual labeling w.r.t. SA.
-
•
We propose a statistically driven fairness measurement framework, CLEAM, which accounts for the SA classifier inaccuracies to output a more accurate bias estimate.
-
•
Using CLEAM, we reveal considerable biases in several important generative models, prompting careful consideration when applying them to different applications.
2 Fairness Measurement Framework
Fig.1(a) illustrates the fairness measurement framework for generative models as in [1, 2, 7, 9, 12]. Assume that with some input e.g. noise vector for a GAN or text prompt for SDM, a generative model synthesizes a sample . Generally, as the generator does not label synthesized samples, the ground truth (GT) class probability of these samples w.r.t. a SA (denoted by ) is unknown. Thus, an SA classifier is utilized to estimate . Specifically, for each sample , is the argmax classification for the respective SA. In existing works, the expected value of the SA classifier output over a batch of samples, (or the average of over multiple batches of samples), is used as an estimation of . This estimate may then be used in some fairness metric to report the fairness value for the generator, e.g. fairness discrepancy metric between and a uniform distribution [1, 20] (see Supp A.3 for details on how to calculate ). Note that the general assumption behind the existing framework is that with a reasonably accurate SA classifier, could be an accurate estimation of [1, 9]. In the next section, we will present a deeper analysis on the effects of an inaccurate SA classifier on fairness measurement. Our findings suggest that there could be a large discrepancy between and , even for highly accurate SA classifiers, indicative of significant fairness measurement errors in the current measurement framework.
One may argue that conditional GANs (cGANs) [23, 24] may be used to generate samples conditioned on the SA, thereby eliminating the need for an SA classifier. However, cGANs are not considered in previous works due to several limitations. These include the limited availability of large labeled training datasets, the unreliability of sample quality and labels [25], and the exponentially increasing conditional terms, per SA. Similarly, for SDM, Bianchi et al. [26] found that utilizing well-crafted prompts to mitigate biases is ineffective due to the presence of existing biases in its training dataset. Furthermore in Sec. 6, utilizing CLEAM, we will discuss that even subtle prompt changes (while maintaining the semantics) result in drastically different SA biases. See Supp G for further comparison between [26] and our findings.
3 A Closer Look at Fairness Measurement
In this section, we take a closer look at the existing fairness measurement framework. In particular, we examine its performance in estimating of the samples generated by SOTA GANs and SDM, a task previously unstudied due to the lack of a labeled generated dataset. We do so by designing an experiment to demonstrate these errors while evaluating biases in popular image generators. Following previous works, our main focus is on binary SA which takes values in . Note that, we assume that the accuracy of the SA classifier is known and is characterized by , where is the probability of correctly classifying label . For example, for Gender attribute, and are the probability of correctly classifying Female, and Male classes, respectively. In practice, is trained on standard training procedures (more details in the Supp F) and can be measured during the validation stage of and be considered a constant when the validation dataset is large enough. Additionally, can be assumed to be a constant vector, given that the samples generated can be considered to come from an infinite population, as theoretically there is no limit to the number of samples from a generative model like GAN or SDM.
New dataset by labeling generators output. The major limitation of evaluating the existing fairness measurement framework is the unavailability of . To pave the way for an accurate evaluation, we create a new dataset by manually labeling the samples generated by GANs and SDM. More specifically, we utilize the official publicly released pre-trained StyleGAN2 [3] and StyleSwin [4] on CelebA-HQ [27] for sample generation. Then, we randomly sample from these GANs and utilize Amazon Mechanical Turks to hand-label the samples w.r.t. Gender and BlackHair, resulting in 9K samples for each GAN; see Supp H for more details and examples. Next, we follow a similar labeling process w.r.t. Gender, but with a SDM [5] pre-trained on LAION-5B[28]. Here, we input prompts using best practices [26, 29, 30, 31], beginning with a scene description ("A photo with the face of"), followed by four indefinite (gender-neutral) pronouns or nouns [32, 33] – {"an individual", "a human being", "one person", "a person"} to collect 2k high-quality samples. We refer to this new dataset as Generated Dataset (GenData), which includes generated images from three models with corresponding SA labels: GenData-StyleGAN2, GenData-StyleSwin, GenData-SDM. We remark that these labeled datasets only provide a strong approximation of for each generator, however as the datasets are reasonably large, we find this approximation sufficient and simply refer to it as the GT . Then utilizing this GT , we compare it against the estimated baseline (). One interesting observation revealed by GenData is that all three generators exhibit a considerable amount of bias (see Tab.1 and 2); more detail in Sec. 6. Note that for a fair generator we have , and measuring the and is a good proxy for measuring fairness.
Experimental setup. Here, we follow Choi et al. [1] as the Baseline for measuring fairness. In particular, to calculate each value for a generator, a corresponding batch of samples is randomly drawn from GenData and passed into for SA classification. We repeat this for batches and report the mean results denoted by and the 95% confidence interval denoted by . For a comprehensive analysis of the GANs, we repeat the experiment using four different SA classifiers: Resnet-18, ResNet-34 [34], MobileNetv2 [35], and VGG-16 [36]. Then, to evaluate the SDM, we utilize CLIP [6] to explore the utilization of pre-trained models for zero-shot SA classification; more details on the CLIP SA classifier in Supp. E. As CLIP does not have a validation dataset, to measure for CLIP, we utilize CelebA-HQ, a dataset with a similar domain to our application. We found this to be a very accurate approximation; see Supp D.7 for validation results. Note that for SDM, a separate is measured for each text prompt as SDM’s output images are conditioned on the input text prompt. As seen in Tab. 1 and 2, all classifiers demonstrate reasonably high average accuracy . Note that as we focus on binary SA (e.g. Gender:{Male, Female}), both and have two components i.e. , and . After computing the and , we calculate normalized point error , and interval max error w.r.t. the (GT) to evaluate the measurement accuracy of the baseline method:
(1) |
Based on our results in Tab. 1, for GANs, we observe that despite the use of reasonably accurate SA classifiers, there are significant estimation errors in the existing fairness measurement framework, i.e. . In particular, looking at the SA classifier with the highest average accuracy of (ResNet-18 on Gender), we observe significant discrepancies between GT and , with . These errors generally worsen as accuracy marginally degrades, e.g. MobileNetv2 with accuracy results in . These considerably large errors contradict prior assumptions – that for a reasonably accurate SA classifier, we can assume to be fairly negligible. Similarly, our results in Tab. 2 for the SDM, show large , even though the classifier is very accurate. We discuss the reason for this in more detail in Sec. 5.1.
Overall, these results are concerning as they cast doubt on the accuracy of prior reported results. For example, imp-weighting [1] which uses the same ResNet-18 source code as our experiment, reports a 2.35% relative improvement in fairness against its baseline w.r.t. Gender, which falls within the range of our experiments smallest relative error, =4.98%. Similarly, Teo et al. [2] and Um et al. [12] report a relative improvement in fairness of 0.32% and 0.75%, compared to imp-weighting [1]. These findings suggest that some prior results may be affected due to oversight of SA classifier’s inaccuracies; see Supp. A.4 for more details on how to calculate these measurements.
Remark: In this section, we provide the keystone for the evaluation of measurement accuracy in the current framework by introducing a labeled dataset based on generated samples. These evaluation results raise concerns about the accuracy of existing framework as considerable error rates were observed even when using accurate SA classifiers, an issue previously seen to be negligible.
4 Mitigating Error in Fairness Measurements
The previous section exposes the inaccuracies in the existing fairness measurement framework. Following that, in this section, we first develop a statistical model for the erroneous output of the SA classifier, , to help draw a more systematic relationship between the inaccuracy of the SA classifier and error in fairness estimation. Then, with this statistical model, we propose CLEAM – a new measurement framework that reduces error in the measured by accounting for the SA classifier inaccuracies to output a more accurate statistical approximation of .
4.1 Proposed Statistical Model for Fairness Measurements
As shown in Fig.1(a), to measure the fairness of the generator, we feed generated samples to the SA classifier . The output of the SA classifier () is in fact a random variable that aims to approximate the . Here, we propose a statistical model to derive the distribution of .
As Fig.1(b) demonstrates in our running example of a binary SA, each generated sample is from class 0 with probability , or from class 1 with probability . Then, generated sample from class where , will be classified correctly with the probability of , and wrongly with the probability of . Thus, for each sample, there are four mutually exclusive possible events denoted by , with the corresponding probability vector :
(2) |
where denotes the event of assigning label to a sample with GT label . Given that this process is performed independently for each of the generated images, the probability of the counts for each output in Eqn. 2 (denoted by ) can be modeled by a multinomial distribution, i.e. [37, 38, 39]. Note that models the joint probability distribution of these outputs, i.e. where, is the random variable of the count for event after classifying generated images. Since is not near the boundary of the parameter space, and as we utilize a large , based on the central limit theorem, can be approximated by a multivariate Gaussian distribution, , with and [40, 39], where is defined as:
(3) |
denotes a square diagonal matrix corresponding to vector (see Supp A.1 for expanded form). The marginal distribution of this multivariate Gaussian distribution gives us a univariate (one-dimensional) Gaussian distribution for the count of each output in Eqn. 2. For example, the distribution of the count for event , denoted by , can be modeled as .
Lastly, we find the total percentage of data points labeled as class when labeling generated images using the normalized sum of the related random variables, i.e. . For our binary example, can be calculated by summing random variables with Gaussian distribution, which results in another Gaussian distribution [41], i.e. , , where:
(4) | ||||
(5) |
Similarly with , and which is aligned with the fact that .
Remark: In this section, considering the probability tree diagram in Fig.1(b), we propose a joint distribution for the possible events of classification (), and use it to compute the marginal distribution of each event, and finally the distribution of the SA classifier outputs (, and ). Note that considering Eqn. 4, 5, only with a perfect classifier (, i.e. acc) the converges to . However, training a perfect SA classifier is not practical e.g. due to the lack of an appropriate dataset and task hardness [42, 43]. As a result, in the following, we will propose CLEAM which instead utilizes this statistical model to mitigate the error of the SA classifier.
4.2 CLEAM for Accurate Fairness Measurement
In this section, we propose a new estimation method in fairness measurement that considers the inaccuracy of the SA classifier. For this, we use the statistical model, introduced in Sec 4.1, to compute a more accurate estimation of . Specifically, we first propose a Point Estimate (PE) by approximating the maximum likelihood value of . Then, we use the confidence interval for the observed data () to propose an Interval Estimate (IE) for .
Point Estimate (PE) for . Suppose that we have access to samples of denoted by , i.e. SA classification results on batches of generated data. We can then use the proposed statistical model to approximate the . In the previous section, we demonstrate that we can model using a Gaussian distribution. Considering this, first, we use the available samples to calculate sample-based statistics including the mean and variance of the samples:
(6) | ||||
(7) |
For a Gaussian distribution, the Maximum Likelihood Estimate (MLE) of the population mean is its sample mean [44]. Given that is large enough (e.g. ), we can assume that is a good approximation of the population mean [45], and equate it to the statistical population mean in Eqn. 4 (see Supp A.2 for derivation). With that, we get the maximum likelihood approximation of , which we call the CLEAM’s point estimate, :
(8) |
Notice that accounts for the inaccuracy of the SA classifier.
Interval Estimate (IE) for . In the previous part, we propose a PE for using the statistical model, and sample-based mean . However, as we use only samples of , may not capture the exact value of the population mean. This adds some degree of inaccuracy into . In fact, in our framework, equals when . However, increasing each unit of significantly increases the computational complexity, as each requires generated samples. To address this, we recall that follows a Gaussian distribution and instead utilize frequentist statistics [41] to propose a 95% confidence interval (CI) for . To do this, first we derive the CI for :
(9) |
Then, applying Eqn.4 to Eqn.9 gives the lower and upper bounds of the approximated 95% CI for :
(10) |
This gives us the interval estimate of CLEAM, , a range of values that we can be approximately 95% confident to contain . The range of possible values for can be simply derived considering . The overall procedure of CLEAM is summarized in Alg. 1. Now, with the IE, we can provide statistical significance to the reported fairness improvements.
Point Estimate | Interval Estimate | |||||||||||||
Avg. | Baseline | Diversity | CLEAM (Ours) | Baseline | Diversity | CLEAM (Ours) | ||||||||
(A) StyleGAN2 | ||||||||||||||
Gender with GT class probability =0.642 | ||||||||||||||
R18 | {0.947, 0.983} | 0.97 | 0.610 | 4.98% | — | — | 0.638 | 0.62% | [0.602, 0.618] | 6.23% | — | — | [0.629, 0.646] | 2.02% |
R34 | {0.932, 0.976]} | 0.95 | 0.596 | 7.17% | — | — | 0.634 | 1.25% | [0.589, 0.599] | 8.26% | — | — | [0.628, 0.638] | 2.18% |
MN2 | {0.938, 0.975} | 0.96 | 0.607 | 5.45% | — | — | 0.637 | 0.78% | [0.602, 0.612] | 6.23% | — | — | [0.632, 0.643] | 1.56% |
V16 | {0.801, 0.919} | 0.86 | 0.532 | 17.13% | 0.550 | 14.30% | 0.636 | 0.93% | [0.526, 0.538] | 18.06% | [0.536 , 0.564] | 16.51% | [0.628, 0.644] | 2.18% |
Average Error: | 8.68% | 14.30% | 0.90% | 9.70% | 16.51% | 1.99% | ||||||||
BlackHair with GT class probability =0.643 | ||||||||||||||
R18 | {0.869, 0.885} | 0.88 | 0.599 | 6.84% | — | — | 0.641 | 0.31% | [0.591, 0.607] | 8.08% | — | — | [0.631, 0.652] | 1.40% |
R34 | {0.834, 0.916} | 0.88 | 0.566 | 11.98% | — | — | 0.644 | 0.16% | [0.561, 0.572] | 12.75% | — | — | [0.637, 0.651] | 1.24% |
MN2 | {0.839, 0.881} | 0.86 | 0.579 | 9.95% | — | — | 0.639 | 0.62% | [0.574, 0.584] | 10.73% | — | — | [0.632, 0.647] | 1.71% |
V16 | {0.851, 0.836} | 0.84 | 0.603 | 6.22% | 0.582 | 9.49% | 0.640 | 0.47% | [0.597, 0.608] | 7.15% | [0.568, 0.596] | 11.66% | [0.632, 0.648] | 1.71% |
Average Error: | 8.75% | 9.49% | 0.39% | 9.68% | 11.66% | 1.52% | ||||||||
(B) StyleSwin | ||||||||||||||
Gender with GT class probability =0.656 | ||||||||||||||
R18 | {0.947, 0.983} | 0.97 | 0.620 | 5.49% | — | — | 0.648 | 1.22% | [0.612,0.629] | 6.70% | — | — | [0.639,0.658] | 2.59% |
R34 | {0.932, 0.976} | 0.95 | 0.610 | 7.01% | — | — | 0.649 | 1.07% | [0.605,0.615] | 7.77% | — | — | [0.643,0.654] | 1.98% |
MN2 | {0.938, 0.975} | 0.96 | 0.623 | 5.03% | — | — | 0.655 | 0.15% | [] | — | — | [0.649,0.661] | 1.07% | |
V16 | {0.801, 0.919} | 0.86 | 0.555 | 15.39% | 0.562 | 14.33% | 0.668 | 1.83% | [0.549,0.560] | 16.31% | [0.548,0.576] | 16.46% | [0.660,0.675] | 2.90% |
Average Error: | 8.23% | 14.33% | 1.07% | 9.14% | 16.46% | 2.14% | ||||||||
BlackHair with GT class probability =0.668 | ||||||||||||||
R18 | {0.869, 0.885} | 0.88 | 0.612 | 8.38% | — | — | 0.659 | 1.35% | [0.605,0.620] | 9.43% | — | — | [0.649,0.670] | 2.84% |
R34 | {0.834, 0.916} | 0.88 | 0.581 | 13.02% | — | — | 0.662 | 0.90% | [0.576,0.586] | 13.77% | — | — | [0.656,0.669] | 1.80% |
MN2 | {0.839, 0.881} | 0.86 | 0.596 | 10.78% | — | — | 0.659 | 1.35% | [0.591,0.600] | 11.50% | — | — | [0.652,0.666] | 2.40% |
V16 | {0.851, 0.836} | 0.84 | 0.625 | 6.44% | 0.608 | 8.98% | 0.677 | 1.35% | [0.620,0.630] | 7.19% | [0.590,0.626] | 11.68% | [0.670,0.684] | 2.40% |
Average Error: | 9.66% | 8.98% | 1.24% | 10.47% | 11.68% | 2.36% |
Point Estimate | Interval Estimate | ||||||||
---|---|---|---|---|---|---|---|---|---|
Prompt | GT | Baseline | CLEAM (Ours) | Baseline | CLEAM (Ours) | ||||
=[0.998,0.975], Avg. =0.987, CLIP –Gender | |||||||||
"A photo with the face of an individual" | 0.186 | 0.203 | 9.14% | 0.187 | 0.05% | [ 0.198 , 0.208 ] | 11.83% | [ 0.182 , 0.192 ] | 3.23% |
"A photo with the face of a human being" | 0.262 | 0.277 | 5.73% | 0.263 | 0.38% | [ 0.270 , 0.285 ] | 8.78% | [ 0.255 , 0.271 ] | 3.44% |
"A photo with the face of one person" | 0.226 | 0.241 | 6.63% | 0.230 | 1.77% | [ 0.232 , 0.251 ] | 11.06% | [ 0.220 , 0.239 ] | 5.75% |
"A photo with the face of a person" | 0.548 | 0.556 | 1.49% | 0.548 | 0.00% | [ 0.545 , 0.566 ] | 3.28% | [ 0.537 , 0.558 ] | 2.01% |
Average Error | 5.75% | 0.44% | 8.74% | 3.61% |

5 Experiments
In this section, we first evaluate fairness measurement accuracy of CLEAM on both GANs and SDM (Sec.5.1) with our proposed GenData dataset. Then we evaluate CLEAM’s robustness through some ablation studies (Sec. 5.2). To the best of our knowledge, there is no similar literature for improving fairness measurements in generative models. Therefore, we compare CLEAM with the two most related works: a) the Baseline used in previous works [1, 2, 7, 9, 12] b) Diversity [46] which computes disparity within a dataset via an intra-dataset pairwise similarity algorithm. We remark that, as discussed by Keswani et al. [46] Diversity is model-specific using VGG-16 [36]; see Supp. D.2 for more details. Finally, unless specified, we repeat the experiments with batches of images from the generators with batch size . For a fair comparison, all three algorithms use the exact same inputs. However, while Baseline and Diversity ignore the SA classifier inaccuracies, CLEAM makes good use of it to rectify the measurement error. As mentioned in Sec. 4.2, for CLEAM, we utilize measured on real samples, which we found to be a good approximation of the measured on generated samples (see Supp. D.7 for results). We repeat each experiment 5 times and report the mean value for each test point for both PE and IE. See Supp D.1 for the standard deviation.
5.1 Evaluating CLEAM’s Performance
CLEAM for fairness measurement of SOTA GANs – StyleGAN2 and StyleSwin. For a fair comparison, we first compute samples of , one for each batch of images. For Baseline, we use the mean value as the PE (denoted by ), and the confidence interval as IE (). With the same samples of , we apply Alg. 1 to obtain and . For Diversity, following the original source code [46], a controlled dataset with fair representation is randomly selected from a held-out dataset of CelebA-HQ [27]. Then, we use a VGG-16 [36] feature extractor and compute Diversity, . With we find and subsequently and from the mean and CI (see Supp D.2 for more details on diversity). We then compute , , and with Eqn 1, by replacing the Baseline estimates with CLEAM and Diversity.
As discussed, our results in Tab.1 show that the baseline experiences significantly large errors of , due to a lack of consideration for the inaccuracies of the SA classifier. We note that this problem is prevalent throughout the different SA classifier architectures, even with higher capacity classifiers e.g. ResNet-34. Diversity, a method similarly unaware of the inaccuracies of the SA classifier, presents a similar issue with In contrast, CLEAM dramatically reduces the error for all classifier architectures. Specifically, CLEAM reduces the average point estimate error from 8.23% to 1.24%, in both StyleGAN2 and StyleSwin. The IE presents similar results, where in most cases bounds the GT value of .
CLEAM for fairness measurement of SDM. We evaluate CLEAM in estimating the bias of the SDM w.r.t. Gender, based on the synonymous (gender-neutral) prompts introduced in Sec. 3. Recall that here we utilize CLIP as the zero-shot SA classifier. Our results in Tab 2, as discussed, show that utilizing the baseline results in considerable error () for all prompts, even though the SA classifier’s average accuracy was high, (visual results in Fig.2). A closer look at the theoretical model’s Eqn. 4 reveals that this is due to the larger inaccuracies observed in the biased class () coupled with the large bias seen in , which results in deviating from . In contrast, CLEAM accounts for these inaccuracies and significantly minimizes the error to . Moreover, CLEAM’s IE is able to consistently bound the GT value of .
5.2 Ablation Studies and Analysis
Here, we perform the ablation studies and compare CLEAM with classifier correction methods. We remark that detailed results of these experiments are provided in the Supp due to space limitations.
CLEAM for measuring varying degrees of bias. As we cannot control the bias in trained generative models, to simulate different degrees of bias, we evaluate CLEAM with a pseudo-generator. Our results show that CLEAM is effective at different biases ( [0.5,0.9]) reducing the average error from 2.80% 6.93% to 0.75% on CelebA [47] w.r.t. {Gender,BlackHair}, and AFHQ [48] w.r.t. Cat/Dog. See Supp D.3 and D.4 for full experimental results.
CLEAM vs Classifier Correction Methods [49]. CLEAM generally accounts for the classifier’s inaccuracies, without targeting any particular cause of inaccuracies, for the purpose of rectifying the fairness measurements. This objective is unlike traditional classifier correction methods as it does not aim to improve the actual classifier’s accuracy. However, considering that classifier correction methods may improve the fairness measurements by directly rectifying the classifier inaccuracies, we compare its performance against CLEAM. As an example, we utilize the Black Box Shift Estimation / Correction (BBSE / BBSC) [49] which considers the label shift problem and aims to correct the classifier output by detecting the distribution shift. Our results, based on Sec. 5.1 setup, show that while BBSE does improve on the fairness measurements of the baseline i.e. 4.20% 3.38%, these results are far inferior to CLEAM’s results seen in Tab. 1. In contrast, BBSC demonstrates no improvements in fairness measurements. See Supp D.8 for full experimental results. We postulate that this is likely due to the strong assumption of label shift made by both methods.
Effect of batch-size. Utilizing experimental setup in Sec. 5.1 for batch size [100,600], our results in Fig. 9 show that =400 is an ideal batch size, balancing computational cost and measurement accuracy. See Supp F for full experimental details and results.

6 Applying CLEAM: Bias in Current SOTA Generative Models
In this section, we leverage the improved reliability of CLEAM to study biases in the popular generative models. Firstly, with the rise in popularity of text-to-image generators [5, 50, 51, 52], we revisit our results when passing different prompts, with synonymous neutral meanings to an SDM, and take a closer look at how subtle prompt changes can impact bias w.r.t. Gender. Furthermore, we further investigate if similar results would occur in other SA, Smiling. Secondly, with the shift in popularity from convolution to transformer-based architectures [53, 54, 55], due to its better sample quality, we determine whether the learned bias would also change. For this, we compare StylesSwin (transformer) and StyleGAN2 (convolution), which are both based on the same architecture backbone.
Our results, on SDM, demonstrate that the use of different synonymous neutral prompts [32, 33] results in different degrees of bias w.r.t. both Gender and Smiling attributes. For example in Fig. 2, a semantically insignificant prompt change from "one person" to "a person" results in a significant shift in Gender bias. Then, in Fig. 4(a), we observe that while the SDM w.r.t. our prompts appear to be heavily biased to not-Smiling, having "person" in the prompt appears to significantly reduce this bias. This suggests that for SDM, even semantically similar neutral prompts [32, 33] could result in different degrees of bias, thereby demonstrating certain instability in SDM. Next, our results in Fig. 4(b) compare the bias in StyleGAN2, StylesSwin, and the training CelebA-HQ dataset over an extended number of SAs. Overall, we found that while StyleSwin produces better quality samples [4], the same biases still remain statistically unchanged between the two architectures i.e. their IE overlap. Interestingly, our results also found that both the GANs were less biased than the training dataset itself.

7 Discussion
Conclusion. In this work, we address the limitations of the existing fairness measurement framework. Since generated samples are typically unlabeled, we first introduce a new labeled dataset based on three state-of-the-art generative models for our studies. Our findings suggest that the existing framework, which ignores classification inaccuracies, suffers from significant measurement errors, even when the SA classifier is very accurate. To rectify this, we propose CLEAM, which considers these inaccuracies in its statistical model and outputs a more accurate fairness measurement. Overall, CLEAM demonstrates improved accuracy over extensive experimentation, including both real generators and controlled setups. Moreover, by applying CLEAM to popular generative models, we uncover significant biases that raise efficacy concerns about these models’ real-world application.
Broader Impact. Given that generative models are becoming more widely integrated into our everyday society e.g. text-to-image generation, it is important that we have reliable means to measure fairness in generative models, thereby allowing us to prevent these biases from proliferating into new technologies. CLEAM provides a step in this direction by allowing for more accurate evaluation. We remark that our work does not introduce any social harm but instead improves on the already existing measurement framework.
Limitations. Despite the effectiveness of the proposed method along various generative models, our work addresses only one facet of the problems in the existing fairness measurement and there is still room for further improvement. For instance, it may be beneficial to consider SA to be non-binary e.g. when hair color is not necessary fully black (grey). Additionally, existing datasets used to train classifiers are commonly human-annotated, which may itself contain certain notions of bias. See Supp. I for further discussion.
Acknowledgements
This research is supported by the National Research Foundation, Singapore under its AI Singapore Programmes (AISG Award No.: AISG2-TC-2022-007) and SUTD project PIE-SGP-AI-2018-01. This research work is also supported by the Agency for Science, Technology and Research (A*STAR) under its MTC Programmatic Funds (Grant No. M23L7b0021). This material is based on the research/work support in part by the Changi General Hospital and Singapore University of Technology and Design, under the HealthTech Innovation Fund (HTIF Award No. CGH-SUTD-2021-004). We thank anonymous reviewers for their insightful feedback and discussion.
References
- Choi et al. [2020a] Kristy Choi, Aditya Grover, Trisha Singh, Rui Shu, and Stefano Ermon. Fair Generative Modeling via Weak Supervision. In Proceedings of the 37th International Conference on Machine Learning, pages 1887–1898. PMLR, November 2020a.
- Teo et al. [2023] Christopher TH Teo, Milad Abdollahzadeh, and Ngai-Man Cheung. Fair generative models via transfer learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 2429–2437, 2023.
- Karras et al. [2019] Tero Karras, Samuli Laine, and Timo Aila. A Style-Based Generator Architecture for Generative Adversarial Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4401–4410, June 2019.
- Zhang et al. [2021] Bowen Zhang, Shuyang Gu, Bo Zhang, Jianmin Bao, Dong Chen, Fang Wen, Yong Wang, and Baining Guo. Styleswin: Transformer-based gan for high-resolution image generation, 2021.
- Rombach et al. [2021] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2021.
- Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision, February 2021.
- Frankel and Vendrow [2020] Eric Frankel and Edward Vendrow. Fair Generation Through Prior Modification. 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), June 2020.
- Humayun et al. [2022] Ahmed Imtiaz Humayun, Randall Balestriero, and Richard Baraniuk. MaGNET: Uniform Sampling from Deep Generative Network Manifolds Without Retraining. In International Conference on Learning Representations, January 2022.
- Tan et al. [2020] Shuhan Tan, Yujun Shen, and Bolei Zhou. Improving the Fairness of Deep Generative Models without Retraining. arXiv:2012.04842 [cs], December 2020.
- Yu et al. [2020] Ning Yu, Ke Li, Peng Zhou, Jitendra Malik, Larry Davis, and Mario Fritz. Inclusive GAN: Improving Data and Minority Coverage in Generative Models, August 2020.
- Maluleke et al. [2022] Vongani H. Maluleke, Neerja Thakkar, Tim Brooks, Ethan Weber, Trevor Darrell, Alexei A. Efros, Angjoo Kanazawa, and Devin Guillory. Studying Bias in GANs through the Lens of Race, September 2022.
- Um and Suh [2021] Soobin Um and Changho Suh. A Fair Generative Model Using Total Variation Distance. openreview, September 2021.
- Abdollahzadeh et al. [2023] Milad Abdollahzadeh, Touba Malekzadeh, Christopher TH Teo, Keshigeyan Chandrasegaran, Guimeng Liu, and Ngai-Man Cheung. A survey on generative modeling with limited data, few shots, and zero shot. arXiv preprint arXiv:2307.14397, 2023.
- Goodfellow et al. [2020] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
- Karras et al. [2020] Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Training generative adversarial networks with limited data. Advances in neural information processing systems, 33:12104–12114, 2020.
- Zhao et al. [2022] Yunqing Zhao, Keshigeyan Chandrasegaran, Milad Abdollahzadeh, and Ngai-Man Man Cheung. Few-shot image generation via adaptation-aware kernel modulation. Advances in Neural Information Processing Systems, 35:19427–19440, 2022.
- Chandrasegaran et al. [2022] Keshigeyan Chandrasegaran, Ngoc-Trung Tran, Alexander Binder, and Ngai-Man Cheung. Discovering transferable forensic features for cnn-generated images detection. In European Conference on Computer Vision, pages 671–689. Springer, 2022.
- Zhao et al. [2023] Yunqing Zhao, Chao Du, Milad Abdollahzadeh, Tianyu Pang, Min Lin, Shuicheng Yan, and Ngai-Man Cheung. Exploring incompatible knowledge transfer in few-shot image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7380–7391, 2023.
- Hutchinson and Mitchell [2019] Ben Hutchinson and Margaret Mitchell. 50 Years of Test (Un)fairness: Lessons for Machine Learning. In Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* ’19, pages 49–58, New York, NY, USA, January 2019. Association for Computing Machinery. ISBN 978-1-4503-6125-5. doi: 10.1145/3287560.3287600.
- Teo and Cheung [2021] Christopher T. H. Teo and Ngai-Man Cheung. Measuring Fairness in Generative Models. 38th International Conference on Machine Learning (ICML) Workshop, July 2021.
- Jalan et al. [2020] Harsh Jaykumar Jalan, Gautam Maurya, Canute Corda, Sunny Dsouza, and Dakshata Panchal. Suspect Face Generation. In 2020 3rd International Conference on Communication System, Computing and IT Applications (CSCITA), pages 73–78, April 2020. doi: 10.1109/CSCITA47329.2020.9137812.
- Frid-Adar et al. [2018] Maayan Frid-Adar, Idit Diamant, Eyal Klang, Michal Amitai, Jacob Goldberger, and Hayit Greenspan. Gan-based synthetic medical image augmentation for increased cnn performance in liver lesion classification. Neurocomputing, 321:321–331, 2018.
- Mirza and Osindero [2014] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
- Odena et al. [2017] Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with auxiliary classifier gans. In ICML, 2017.
- Thekumparampil et al. [2018] Kiran K Thekumparampil, Ashish Khetan, Zinan Lin, and Sewoong Oh. Robustness of conditional GANs to noisy labels. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
- Bianchi et al. [2022] Federico Bianchi, Pratyusha Kalluri, Esin Durmus, Faisal Ladhak, Myra Cheng, Debora Nozza, Tatsunori Hashimoto, Dan Jurafsky, James Zou, and Aylin Caliskan. Easily Accessible Text-to-Image Generation Amplifies Demographic Stereotypes at Large Scale, November 2022.
- Lee et al. [2020] Cheng-Han Lee, Ziwei Liu, Lingyun Wu, and Ping Luo. Maskgan: Towards diverse and interactive facial image manipulation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Schuhmann et al. [2022] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
- Liu and Chilton [2022] Vivian Liu and Lydia B Chilton. Design guidelines for prompt engineering text-to-image generative models. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, pages 1–23, 2022.
- Sta [2022] Stable Diffusion: Prompt Guide and Examples. https://strikingloo.github.io/stable-diffusion-vs-dalle-2, August 2022.
- Zhou et al. [2022] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to Prompt for Vision-Language Models. International Journal of Computer Vision, 130(9):2337–2348, September 2022. ISSN 0920-5691, 1573-1405. doi: 10.1007/s11263-022-01653-1.
- Haspelmath [1997] Martin Haspelmath. Indefinite Pronouns. Oxford University Press, 1997. ISBN 978-0-19-829963-9 978-0-19-823560-6. doi: 10.1093/oso/9780198235606.001.0001.
- Saguy and Williams [2022] Abigail C Saguy and Juliet A Williams. A little word that means a lot: A reassessment of singular they in a new era of gender politics. Gender & Society, 36(1):5–31, 2022.
- He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, June 2016.
- Sandler et al. [2018] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4510–4520, Salt Lake City, UT, June 2018. IEEE. ISBN 978-1-5386-6420-9. doi: 10.1109/CVPR.2018.00474.
- Simonyan and Zisserman [2014] Karen Simonyan and Andrew Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. ICLR2015, September 2014.
- Rao [1957] C Radhakrishna Rao. Maximum likelihood estimation for the multinomial distribution. Sankhyā: The Indian Journal of Statistics (1933-1960), 18(1/2):139–148, 1957.
- Kesten and Morse [1959] Harry Kesten and Norman Morse. A property of the multinomial distribution. The Annals of Mathematical Statistics, 30(1):120–127, 1959.
- Papoulis [2002] Athanasios Papoulis. Probability, Random Variables and Stochastic Processes with Errata Sheet. McGraw-Hill Europe, Boston, Mass., 4th edition edition, January 2002. ISBN 978-0-07-122661-5.
- Geyer [2010] Charles J Geyer. Stat 5101 Notes: Brand Name Distributions. University of Minnesota, Stat 5101:25, January 2010.
- Goodman [1963] Nathaniel R Goodman. Statistical analysis based on a certain multivariate complex gaussian distribution (an introduction). The Annals of mathematical statistics, 34(1):152–177, 1963.
- Abdollahzadeh et al. [2021] Milad Abdollahzadeh, Touba Malekzadeh, and Ngai-Man Man Cheung. Revisit multimodal meta-learning through the lens of multi-task learning. Advances in Neural Information Processing Systems, 35:14632–14644, 2021.
- Achille et al. [2019] Alessandro Achille, Michael Lam, Rahul Tewari, Avinash Ravichandran, Subhransu Maji, Charless C Fowlkes, Stefano Soatto, and Pietro Perona. Task2vec: Task embedding for meta-learning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6430–6439, 2019.
- Anderson and Olkin [1985] Theodore Wilbur Anderson and Ingram Olkin. Maximum-likelihood estimation of the parameters of a multivariate normal distribution. Linear algebra and its applications, 70:147–171, 1985.
- Krohling and dos Santos Coelho [2006] Renato A Krohling and Leandro dos Santos Coelho. Coevolutionary particle swarm optimization using gaussian distribution for solving constrained optimization problems. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 36(6):1407–1416, 2006.
- Keswani and Celis [2021] Vijay Keswani and L. Elisa Celis. Auditing for Diversity Using Representative Examples. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 860–870, Virtual Event Singapore, August 2021. ACM. ISBN 978-1-4503-8332-5. doi: 10.1145/3447548.3467433.
- Liu et al. [2015] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep Learning Face Attributes in the Wild. Proceedings of International Conference on Computer Vision (ICCV), September 2015.
- Choi et al. [2020b] Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha. StarGAN v2: Diverse Image Synthesis for Multiple Domains, April 2020b.
- Lipton et al. [2018] Zachary Lipton, Yu-Xiang Wang, and Alexander Smola. Detecting and correcting for label shift with black box predictors. In Proceedings of the 35th International Conference on Machine Learning, pages 3122–3130. PMLR, 10–15 Jul 2018.
- Ramesh et al. [2021] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-Shot Text-to-Image Generation, February 2021.
- Gal et al. [2021] Rinon Gal, Or Patashnik, Haggai Maron, Gal Chechik, and Daniel Cohen-Or. StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators, December 2021.
- Patashnik et al. [2021] Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery, March 2021.
- Raghu et al. [2022] Maithra Raghu, Thomas Unterthiner, Simon Kornblith, Chiyuan Zhang, and Alexey Dosovitskiy. Do Vision Transformers See Like Convolutional Neural Networks?, March 2022.
- Hudson and Zitnick [2021] Drew A. Hudson and Larry Zitnick. Generative Adversarial Transformers. In Proceedings of the 38th International Conference on Machine Learning, pages 4487–4499. PMLR, July 2021.
- Jia et al. [2022] Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual Prompt Tuning. In Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors, Computer Vision – ECCV 2022, Lecture Notes in Computer Science, pages 709–727, Cham, 2022. Springer Nature Switzerland. ISBN 978-3-031-19827-4. doi: 10.1007/978-3-031-19827-4_41.
- Georgii [2012] Hans-Otto Georgii. Stochastics. In Stochastics. de Gruyter, 2012.
- Pratt and Gibbons [1981] John W. Pratt and Jean D. Gibbons. Concepts of Nonparametric Theory. Springer Series in Statistics. Springer, New York, NY, 1981. ISBN 978-1-4612-5933-6 978-1-4612-5931-2. doi: 10.1007/978-1-4612-5931-2.
- Deng et al. [2009] J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, June 2009. doi: 10.1109/CVPR.2009.5206848.
- Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
- Brock et al. [2019] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large Scale GAN Training for High Fidelity Natural Image Synthesis. arXiv:1809.11096 [cs, stat], February 2019.
- Kingma and Ba [2017] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. ICLR 2015, January 2017.
- Hardt et al. [2016] Moritz Hardt, Eric Price, and Nathan Srebro. Equality of Opportunity in Supervised Learning. arXiv:1610.02413 [cs], October 2016.
- Feldman et al. [2015] Michael Feldman, Sorelle Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubramanian. Certifying and removing disparate impact. arXiv:1412.3756 [cs, stat], July 2015.
- Chierichetti et al. [2017] Flavio Chierichetti, Ravi Kumar, Silvio Lattanzi, and Sergei Vassilvitskii. Fair Clustering Through Fairlets. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
- Celis et al. [2018] Elisa Celis, Vijay Keswani, Damian Straszak, Amit Deshpande, Tarun Kathuria, and Nisheeth Vishnoi. Fair and Diverse DPP-Based Data Summarization. In Proceedings of the 35th International Conference on Machine Learning, pages 716–725. PMLR, July 2018.
- Guo et al. [2017] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On Calibration of Modern Neural Networks. arXiv:1706.04599 [cs], August 2017.
- Platt et al. [1999] John Platt et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10(3):61–74, 1999.
- Nyberg and Klami [2021] Otto Nyberg and Arto Klami. Reliably calibrated isotonic regression. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 578–589. Springer, 2021.
Supplementary Material
This supplementary provides additional experiments as well as details that are required to reproduce our results. These were not included in the main paper due to space limitations. The supplementary is arranged as follows:
-
•
Section A: Details on Modelling
-
–
Section A.1 Details of Theoretical Modelling
-
–
Section A.2 Additional Details on CLEAM Algorithm
-
–
Section A.3 Details on Fairness Metric
-
–
Section A.4 Details of Significance of the Baseline Errors
-
–
-
•
Section B: Deeper Analysis on Error in Fairness Measurement
-
•
Section C: Validating Statistical Model for Classifier Output
-
–
Section C.1 Validation of Sample-Based Estimate vs Model-Based Estimate
-
–
Section C.2 Goodness-of-Fit Test: from the Real GANs with Our Theoretical Model
-
–
-
•
Section D: Additional Experimental Results
-
–
Section D.1 Experimental Results with Standard Deviation
-
–
Section D.2 Experimental Setup for Diversity
-
–
Section D.3 Measuring Varying Degrees of Bias (Gender and BlackHair)
-
–
Section D.4 Measuring Varying Degrees of Bias with Additional Sensitive Attributes (Young and Attractive)
-
–
Section D.5 Measuring Varying Degrees of Bias with Additional Sensitive Attribute Classifiers (MobileNetV2)
-
–
Section D.6 Measuring SOTA GANs and Diffusion Models with Additional Classifier (CLIP)
-
–
Section D.7 Comparing Classifiers Accuracy on Validation Dataset vs Generated Dataset
-
–
Section D.8 Comparing CLEAM with Classifier Correction Techniques (BBSE/BBSC)
-
–
Section D.9 Applying CLEAM to Re-evaluate Bias Mitigation Algorithms
-
–
-
•
Section E: Details on Applying CLIP as a SA Classifier
-
•
Section F: Ablation Study: Details of Hyper-Parameter Settings and Selection
-
•
Section G: Related work
-
•
Section H: Details of the GenData: A New Dataset of Labeled Generated Images
-
•
Section I: Limitations and Considerations
Appendix A Details on Modelling
A.1 Details of Theoretical Modelling
In Sec 4.1 of main paper, we have proposed a statistical model for the sensitive attribute classifier output which is then used in CLEAM to rectify current measurement method. In this section, we give more details of this model which is not included in the main paper due to lack of space.
Recall that in main paper, we mentioned that there are four possible mutually exclusive outputs for each sample with corresponding probability vector :
where denotes the event of assigning label to a sample with GT label . Then, we mentioned that the count for each ouptut can be modeled as a multinomial distribution, . Note that is the random vector of counts for individual outputs of . is the random variable of the count for event after classifying generated images. First, we consider following assumptions:
-
1.
Classifiers are reasonably accurate. We state that, given the advancement in classifiers architecture, and the assumption that the sensitive attribute classifier is trained with proper training procedures, it is a reasonable assumption that it achieves reasonable accuracy and hence, and . Similarly, we assume that it is highly unlikely to have a perfect classifier and as such and .
-
2.
Generators are not completely biased. Given that a generator is trained on a reliable dataset with the availability of all classes of a given sensitive attribute, coupled with the advancement in generator’s architecture, it is a fair assumption that the generator would learn some representation of each class in the sensitive attribute and not be completely biased, as such and .
Based on this assumptions, is not near the boundary of the parameter space, and we can conclude that . Therefore, we can approximate the multinomial distribution as a Gaussian, , with and [56], where
and therefore:
|
With this, we note that the marginal distribution of this multivariate Gaussian distribution gives us a univariate (one-dimensional) Gaussian distribution for the count of each output in . For example, the distribution of the count for event , denoted by , can be modeled as . Then, we find the total rate of data points labeled as class when labeling generated images using the normalized sum of the related random variables i.e. . More specifically:
Remark: In Sec 4.1 of the main paper, considering the probability tree diagram in Fig.1(b) (main paper), we proposed a distribution for the possible events of classification (), and used it to compute distribution of each event, and finally the distribution of the output of the sensitive attribute classifier (, and ). Here, we provide more information on the necessary assumptions and the expanded forms of the equations. In the following Sec. A.2, we will similarly provide more information on proposed CLEAM, presented in Sec. 4.2 of the main paper, which utilizes this statistical model to mitigate the sensitive attribute classifier’s error.
A.2 Additional Details on CLEAM Algorithm
MLE value of Population Mean. In this section, first, we discuss the maximum likelihood estimate (MLE) of the population mean for a Gaussian distribution. Given a Gaussian distribution with the population mean and standard deviation , we can first find the joint probability distribution from the product of each probabilistic outcome (we introduce the natural log as a monotonic function, for ease of calculation). Then, to find the MLE of , we take the partial derivative of this joint distribution w.r.t. , and solve for its maximum value. This maximum value is equal to the sample mean, , as detailed below:
Point Estimate of CLEAM. From this, given that is sufficiently large, we utilize the sample mean as the maximum likelihood approximate of the population mean. As the population mean was modeled in Sec. A.1, we can equate the sample mean to the expanded theoretical model:
Now given that the classifier’s accuracy and the sample mean can be measured, we are able to solve for the maximum likelihood point estimate of , which we denoted with as follows:
Note that we compute w.r.t. i.e. through-out this paper for ease of discussion, however as , a similar w.r.t. i.e. can be found with .
Interval Estimate of CLEAM. We acknowledge that there exist other statistically probable solutions for that could output the samples, other than the Maximum likelihood point estimate of . We thus propose the following approximation for the confidence interval of . Recall the notations and are the sample mean and standard deviation respectively:
Since follows a Gaussian distribution, we can propose the following equation:
Solving for , we get:
Then, given that we formulate the following:
(11) |
As such when , we can determine that the approximated confidence interval of is :
Extending the point estimate to a multiple label setup. We remark that in current literature, fairness of generative models has been studied for binary sensitive attributes mainly due to the lack of availability of a large labeled dataset needed for systematic experimentation. As a result, CLEAM similarly focuses on binary SA to address a common flaw in the evaluation process of the many proposed State-of-the-Art methods.
Assuming that the constraint of the dataset is addressed, our same CLEAM approach can be easily extended to a multi-label setup. For example, given a 3 label sensitive attribute where is the probability of generating a sample with label and denotes the probability (“accuracy”) of the SA classifier in classifying a sample with GT label as for ,. Fig. 5 shows our statistical model for this setting. We can then similarly solve for the point estimate by solving the matrix:
(12) |

A.3 Details on Fairness Metric
Fairness in generative models is defined as Equal Representation meaning that the generator is supposed to generate an equal number of samples for each element of an attribute, e.g., an equal number of generated Male and Female samples when the sensitive attribute is Gender. Therefore, the expected distribution for a fair generator is usually a uniform distribution denoted by . Considering this, the fairness discrepancy (FD) metric [1] measures the L2 norm between and the estimated class probability of the generator by each measurement method, i.e. , where {Base, CLEAM, Div}, as follows:
(13) |
Note that for a fair generator the fairness discrepancy would be zero, which also indicates zero bias.
A.4 Details of Significance of the Baseline Errors
In the main manuscript (Sec. 3 of the main paper), we discussed that the relative improvement of the previous fair generative models could be small, e.g. Teo et al. [2] and Um et al. [12] report a relative improvement in the fairness of 0.32% and 0.75%, compared to imp-weighting [1], and they fall within the range of our experiment’s smallest relative error, =4.98%. Here, we provide more detail on how we calculate the relative improvements in the main manuscript. Specifically, we calculate the relative change of the proposed work against the previous work with the following:
(14) |
Notice that this is similar to of Eqn. 1 in the main paper. For example, Teo et al. (Tab. 1 90_10 and perc=0.1 settings) [2] reports that fairTL measures a which is compared against the previous work’s (Choi et al. [1]) . Utilizing Eqn. 13, we find that this is equivalent to 0.4257 or 0.5743, and 0.4243 or 0.5757, respectively. We remark that here we report two values per , as the FD metric is a symmetric metric. Then applying Eqn. 14, and taking the maximum of the values, we find the relative improvement to be , at best. Note that as we mentioned in the main paper for this setup the baseline measurement framework results in error rate (with the best performing sensitive attribute classifier), meaning that it may not be reliable for gauging the improvement.
Appendix B Deeper Analysis on Error in Fairness Measurement
In the main paper Sec.3, we discussed that there could be considerable error in the fairness measurement, , even though the sensitive attribute classifier’s accuracy is considerably high. In addition, we further develop on this and discuss two additional factors that could result in a variation of errors. We remark that in the main manuscript, we report diversity only using VGG-16, as specified by Keswani et al. [46]. Further discussion in Sec. D.2
Accuracy of Individual Classes () Impacts the Degree of Error. Notice that in some cases even though the sensitive attribute classifier may have a very similar average accuracy, different degrees of errors could exist for the two different classifiers e.g. R18 and R34 in Tab. 3. This is because the fairness measurement error is not only dependent on the average accuracy but on the individual class accuracy i.e. and . More specifically, given that there is a larger error in for R34 and the bias exists in , this results in a compounded effect and hence a larger error of =11.98% is observed as compared to R18 =6.84%.
Uniform Inaccuracies at Unbiased Test-Point (0.5). In our extended experiments in Sec. D.3 for a Pseudo-generator, we discuss that for some sensitive attribute classifiers e.g. ResNet-18 for Gender and BlackHair, the Baseline performs better than CLEAM at the unbiased test-point i.e. . This is just due to the Gender and Blackhair setups having a specific combination of (i) the Pseudo-Generator producing almost perfectly unbiased data with , (ii) sensitive attribute classifier with almost perfectly uniform inaccuracies , thereby leading to uniform misclassification and hence the false impression of better accuracy by the baseline method, at (See Tab. 4 for extracted table) . To further illustrate this, notice how the ResNet-18 trained on Cat/Dog did not demonstrate this better performance in the Baseline due to its non-uniform . Nevertheless, we note this situation whereby the Baseline outperform CLEAM is specific to the test-point and does not impact the overall effectiveness of CLEAM. Furthermore, CLEAM still demonstrates outstanding results with low error for both the PE and IE at .
To further demonstrate these effects, we repeat this same experiment, but with sensitive attributes Young and Attractive from the CelebA dataset. As seen in Tab. 5, both Young or Attractive have similar average accuracy, of and but a different of 0.103 and 0.027. As such, we are able to investigate the effects that has on both CLEAM and Baseline. We did not include Diversity in this study, due to its poor performance on harder sensitive attribute, as discussed in Sec. D.2. From our results in Tab. 5, we observe that as the increases from sensitive attribute Attractive to Young, the error becomes much more significant in the baseline method. The average increases from to . Furthermore, unlike Gender and Blackhair, who have relatively negligible skew, Young and Attractive observes a significantly larger error at .
Point Estimate | Interval Estimate | |||||||||||||
Classifier | Avg. | Baseline | Diversity | CLEAM (Ours) | Baseline | Diversity | CLEAM (Ours) | |||||||
(A) StyleGAN2 | ||||||||||||||
BlackHair with GT class probability =0.643 | ||||||||||||||
R18 | {0.869, 0.885} | 0.88 | 0.599 | 6.84% | — | — | 0.641 | 0.31% | [0.591, 0.607] | 8.08% | — | — | [0.631, 0.652] | 1.40% |
R34 | {0.834, 0.916} | 0.88 | 0.566 | 11.98% | — | — | 0.644 | 0.16% | [0.561, 0.572] | 12.75% | — | — | [0.637, 0.651] | 1.24% |
Point Estimate | Interval Estimate | |||||||||||
GT | Baseline | Diversity | CLEAM (Ours) | Baseline | Diversity | CLEAM (Ours) | ||||||
=[0.976,0.979], Gender (CelebA) | ||||||||||||
=0.5 | 0.501 | 0.20% | 0.481 | 3.80% | 0.502 | 0.40% | [0.495 , 0.507 ] | 1.40% | [0.473 , 0.490 ] | 5.40% | [0.497, 0.508] | 1.60% |
=[0.881,0.887], BlackHair (CelebA) | ||||||||||||
=0.5 | 0.500 | 0.00% | 0.521 | 4.20% | 0.504 | 0.8% | [ 0.495 , 0.505 ] | 1.00% | [0.506 , 0.536 ] | 7.20% | [0.497, 0.511] | 2.20% |
=[0.953,0.0.990], Cat/Dog (AFHQ) | ||||||||||||
=0.5 | 0.486 | 2.80% | 0.469 | 6.20% | 0.505 | 1.00% | [ 0.480 , 0.493 ] | 4.00% | [ 0.458, 0.480 ] | 8.40% | [ 0.498 , 0.511 ] | 2.20% |
Point Estimate | Interval Estimate | |||||||||||
GT | Baseline | Diversity | CLEAM (Ours) | Baseline | Diversity | CLEAM (Ours) | ||||||
=[0.749,0.852], Young | ||||||||||||
0.690 | 23.33% | — | — | 0.905 | 0.56% | [0.684,0.695] | 24.00% | — | — | [0.890,0.920] | 2.22% | |
0.630 | 21.25% | — | — | 0.804 | 0.50% | [0.625,0.635] | 21.88% | — | — | [0.795,0.813] | 1.63% | |
0.570 | 18.57% | — | — | 0.698 | 0.29% | [0.565,0.575] | 19.29% | — | — | [0.690,0.706] | 1.43% | |
0.510 | 15.00% | — | — | 0.595 | 0.83% | [0.505,0.515] | 15.83% | — | — | [0.590,0.600] | 1.67% | |
0.450 | 10.0% | — | — | 0.506 | 1.20% | [0.445,0.455] | 11.00% | — | — | [0.502,0.510] | 2.00% | |
Avg Error | 17.63% | —% | 0.68% | 18.40% | —% | 1.79% | ||||||
=[0.780,0.807], Attractive | ||||||||||||
0.730 | 18.89% | — | — | 0.908 | 0.89% | [0.724,0.736] | 19.56% | — | — | [0.900,0.916] | 1.78% | |
0.670 | 16.25% | — | — | 0.804 | 0.50% | [0.665,0.675] | 16.88% | — | — | [0.795,0.813] | 1.63% | |
0.600 | 14.29% | — | — | 0.696 | 0.57% | [0.594,0.606] | 15.14% | — | — | [0.690,0.712] | 1.71% | |
0.540 | 10.00% | — | — | 0.592 | 1.33% | [0.534,0.546] | 11.00% | — | — | [0.580,0.604] | 3.33% | |
0.480 | 4.00% | — | — | 0.493 | 1.40% | [0.475,0.485] | 5.00% | — | — | [0.487,0.499] | 2.60% | |
Avg Error | 12.69% | —% | 0.94% | 13.52% | —% | 2.22% |
Appendix C Validating Statistical Model for Classifier Output
C.1 Validation of Sample-Based Estimate vs Model-Based Estimate
As described in the main paper, we utilize the sample-based estimate, , as an approximate for the model-based estimate , . As discussed in Sec. A.2, allows us to find the maximum likelihood approximate of .
To validate this approximation, we utilize a ResNet-18 trained on Gender and BlackHair to compute . Then with the samples from the pseudo-generators with different (following Sec. D.3) we computed with a batch-size of and sample size . Finally, we calculate the sample-based estimates as given in Eqn. 6, 7 of the main paper. As the GT and classifier’s accuracy is known, we also calculate the model-based estimates as given in Eqn. 4, 5 of the main manuscript and compare it against the sample-based estimates.
Our results in Tab. 6 shows that both the sample and theoretical means and standard deviations are close approximate to one another. Thus, we can utilise the sample statistics as a close approximation in our proposed method, CLEAM. Additional results for different values of batch-sizes () and sample-sizes () are tabulated in Tab. 7, 8 and 9. Notice that a reduction in and values contributed to increased errors between the sample-based and model-based estimates. While making very large (), results in the sample based estimate almost a perfectly approximating the model based estimates.
GT | Sampled-based estimates | Model-based estimates | ||
---|---|---|---|---|
Gender, =[0.976,0.979] | ||||
0.881 | 0.0101 | 0.881 | 0.0106 | |
0.781 | 0.0133 | 0.785 | 0.0135 | |
0.692 | 0.0149 | 0.690 | 0.0152 | |
0.590 | 0.0165 | 0.594 | 0.0162 | |
0.503 | 0.0164 | 0.499 | 0.0164 | |
=[0.881,0.887], Black-Hair | ||||
0.802 | 0.0130 | 0.804 | 0.0139 | |
0.723 | 0.0151 | 0.727 | 0.0162 | |
0.653 | 0.0169 | 0.650 | 0.0177 | |
0.580 | 0.0180 | 0.574 | 0.0186 | |
0.502 | 0.0180 | 0.497 | 0.0189 |
GT | Sampled-based estimates | Model-based estimates | ||
---|---|---|---|---|
Gender, =[0.976,0.979] | ||||
0.855 | 0.0201 | 0.881 | 0.0106 | |
0.774 | 0.0211 | 0.785 | 0.0135 | |
0.672 | 0.0219 | 0.690 | 0.0152 | |
0.580 | 0.0181 | 0.594 | 0.0162 | |
0.510 | 0.0230 | 0.499 | 0.0164 | |
=[0.881,0.887], Black-Hair | ||||
0.768 | 0.180 | 0.804 | 0.0139 | |
0.712 | 0.210 | 0.727 | 0.0162 | |
0.658 | 0.190 | 0.650 | 0.0177 | |
0.554 | 0.230 | 0.574 | 0.0186 | |
0.508 | 0.242 | 0.497 | 0.0189 |
GT | Sampled-based estimates | Model-based estimates | ||
---|---|---|---|---|
Gender, =[0.976,0.979] | ||||
0.860 | 0.0232 | 0.881 | 0.0149 | |
0.780 | 0.0286 | 0.785 | 0.0191 | |
0.710 | 0.0294 | 0.690 | 0.0215 | |
0.578 | 0.0380 | 0.594 | 0.0228 | |
0.520 | 0.0321 | 0.499 | 0.0233 | |
=[0.881,0.887], Black-Hair | ||||
0.742 | 0.0312 | 0.804 | 0.0197 | |
0.740 | 0.0332 | 0.727 | 0.0229 | |
0.610 | 0.0291 | 0.650 | 0.0250 | |
0.582 | 0.350 | 0.574 | 0.0262 | |
0.542 | 0.388 | 0.497 | 0.0267 |
GT | Sampled-based estimates | Model-based estimates | ||
---|---|---|---|---|
Gender, =[0.976,0.979] | ||||
0.881 | 0.0104 | 0.881 | 0.0106 | |
0.784 | 0.0133 | 0.785 | 0.0135 | |
0.690 | 0.0153 | 0.690 | 0.0152 | |
0.594 | 0.0160 | 0.594 | 0.0162 | |
0.500 | 0.0164 | 0.499 | 0.0164 | |
=[0.881,0.887], Black-Hair | ||||
0.804 | 0.0137 | 0.804 | 0.0139 | |
0.726 | 0.0160 | 0.727 | 0.0162 | |
0.650 | 0.0179 | 0.650 | 0.0177 | |
0.573 | 0.0185 | 0.574 | 0.0186 | |
0.498 | 0.0191 | 0.497 | 0.0189 |
C.2 Goodness-of-Fit Test: from the Real GANs with Our Theoretical Model
In order to make sure that our proposed theoretical model in Eqn. 4 and Eqn. 5 of the main paper, is also a good representation of the distribution when using a generator, we perform a goodness of fit test between the proposed model for the distribution of and sample data generated by a GAN.
Model Type | Sensitive Attribute | |||
---|---|---|---|---|
StyleGAN2 | Gender | 0.1048 | 0.610 | 0.609 |
Blackhair | 0.1065 | 0.601 | 0.601 | |
StyleSwin | Gender | 0.1509 | 0.628 | 0.629 |
Blackhair | 0.1079 | 0.619 | 0.614 |
To do this, we first obtain values of from framework shown in Fig. 1 of the main paper, and use StyleGAN2 [3] and StyleSwin [4] as the generative model. Then using ResNet-18 with known and GAN’s GT , as discussed in Sec. 4.1 of the main paper, we form the theoretical model’s Gaussian distribution, .
Now with both our model distribution and the GAN samples, we utilise the Kolmogorov-Smirnov goodness of fit test (K-S test) to determine if the samples distribution is statistically similar to the proposed Gaussian model. We thus propose the following hypothesis test for the samples :
The K-S test then measures a D-statistic () and compares it against a for a given . As we use , and a significance level in our setup, we have . As seen from Tab. 10, all of the measured values are below , thus we cannot reject the null hypothesis at a confidence with the K-S test. Therefore, we conclude that the distribution of the obtained samples from the framework (by GANs as generator) are statistically similar to the proposed Gaussian distribution. As a result, we can utilise CLEAM to approximate the range in the presence of a real GAN as the generator.
We further perform a Quantile-Quantile(QQ) analysis to provide a more visual representation. In particular, we plot the Quantile-Quantile(QQ) plot between the samples (produced for the data generated by the GAN) and proposed model. As seen in Fig. 6, the samples from GAN correlate tightly with the standardised line (in red), a line indicating a perfect correlation between theoretical and sample quantiles. This analysis supports our claim that the samples from a real generator (GAN) follow the distribution estimated by the proposed model.




Appendix D Additional Experiments
D.1 Experimental Results with Standard Deviation
In the main manuscript, we did not include the error bars of our experiments due to space constraints. Hence, in this section, we provide the full tables for Tab. 1 and 2 of the main manuscript with the standard deviation over 5 runs. Note that generally, the standard deviation at each test point is relatively small and hence can be considered as negligible. This is likely due to the large and utilized. As a result, we can utilize the mean results (as seen in the main manuscript) to compare CLEAM against Diversity and the Baseline.
Point Estimate | Interval Estimate | |||||||||||
Baseline | Diversity | CLEAM (Ours) | Baseline | Diversity | CLEAM (Ours) | |||||||
(A) StyleGAN2 | ||||||||||||
Gender with GT class probability =0.642 | ||||||||||||
R18 | 0.610 0.004 | 4.98% | — | — | 0.638 0.006 | 0.62% | [0.602 0.004, 0.618 0.004] | 6.23% | — | — | [0.629 0.006, 0.646 0.006] | 2.02% |
R34 | 0.596 0.003 | 7.17% | — | — | 0.634 0.002 | 1.25% | [0.589 0.003, 0.599 0.003] | 8.26% | — | — | [0.628 0.002, 0.638 0.002] | 2.18% |
MN2 | 0.607 0.003 | 5.45% | — | — | 0.637 0.002 | 0.78% | [0.602 0.003, 0.612 0.003] | 6.23% | — | — | [0.632 0.002, 0.643 0.002] | 1.56% |
V16 | 0.532 0.007 | 17.13% | 0.550 0.011 | 14.3% | 0.636 0.007 | 0.93% | [0.526 0.007, 0.538 0.007] | 18.06% | [0.536 0.011 , 0.564 0.011] | 16.51% | [0.628 0.007, 0.644 0.007] | 2.18% |
Avg Error | 8.68% | 14.30% | 0.90% | 9.70% | 16.51% | 1.99% | ||||||
BlackHair with GT class probability =0.643 | ||||||||||||
R18 | 0.599 0.006 | 6.84% | — | — | 0.641 0.004 | 0.31% | [0.591 0.006, 0.607 0.005] | 8.08% | — | — | [0.631 0.004, 0.652 0.003] | 1.40% |
R34 | 0.566 0.007 | 11.98% | — | — | 0.644 0.008 | 0.16% | [0.561 0.007, 0.572 0.006] | 12.75% | — | — | [0.637 0.009, 0.651 0.008] | 1.24% |
MN2 | 0.579 0.007 | 9.95% | — | — | 0.639 0.007 | 0.62% | [0.574 0.008, 0.584 0.008] | 10.73% | — | — | [0.632 0.007, 0.647 0.007] | 1.71% |
V16 | 0.603 0.004 | 6.22% | 0.582 0.011 | 9.49% | 0.640 0.005 | 0.47% | [0.597 0.004, 0.608 0.003] | 7.15% | [0.568 0.010, 0.596 0.011] | 11.66% | [0.632 0.004, 0.648 0.005] | 1.71% |
Avg Error | 8.75% | 9.49% | 0.39% | 9.68% | 11.66% | 1.52% | ||||||
(B) StyleSwin | ||||||||||||
Gender with GT class probability =0.656 | ||||||||||||
R18 | 0.620 0.005 | 5.49% | — | — | 0.648 0.004 | 1.22% | [0.612 0.004,0.629 0.005] | 6.70% | — | — | [0.639 0.005,0.658 0.005] | 2.59% |
R34 | 0.610 0.002 | 7.01% | — | — | 0.649 0.005 | 1.07% | [0.605 0.003,0.615 0.003] | 7.77% | — | — | [0.643 0.006,0.654 0.006] | 1.98% |
MN2 | 0.623 0.008 | 5.03% | — | — | 0.655 0.005 | 0.15% | [0.618 0.007,0.629 0.007] | — | — | [0.649 0.006,0.661 0.006] | 1.07% | |
V16 | 0.555 0.004 | 15.39% | 0.562 0.015 | 14.33% | 0.668 0.006 | 1.83% | [0.549 0.004,0.560 0.004] | 16.31% | [0.548 0.014,0.576 0.014] | 16.46% | [0.660 0.007,0.675 0.007] | 2.90% |
Avg Error | 8.23% | 14.33% | 1.07% | 9.14% | 16.46% | 2.14% | ||||||
BlackHair with GT class probability =0.668 | ||||||||||||
R18 | 0.612 0.005 | 8.38% | — | — | 0.659 0.006 | 1.35% | [0.605 0.005,0.620 0.006] | 9.43% | — | — | [0.649 0.004,0.670 0.004] | 2.84% |
R34 | 0.581 0.006 | 13.02% | — | — | 0.662 0.006 | 0.90% | [0.576 0.005,0.586 0.006] | 13.77% | — | — | [0.656 0.005,0.669 0.005] | 1.80% |
MN2 | 0.596 0.006 | 10.78% | — | — | 0.659 0.005 | 1.35% | [0.591 0.006,0.600 0.007] | 11.50% | — | — | [0.652 0.005,0.666 0.005] | 2.40% |
V16 | 0.625 0.006 | 6.44% | 0.608 0.014 | 8.98% | 0.677 0.005 | 1.35% | [0.620 0.005,0.630 0.006] | 7.19% | [0.590 0.012,0.626 0.013] | 11.68% | [0.670 0.005,0.684 0.006] | 2.40% |
Avg Error | 9.66% | 8.98% | 1.24% | 10.47% | 11.68% | 2.36% |
Point Estimate | Interval Estimate | ||||||||
---|---|---|---|---|---|---|---|---|---|
Prompt | GT | Baseline | CLEAM (Ours) | Baseline | CLEAM (Ours) | ||||
=[0.998,0.975], Avg. =0.987, CLIP –Gender | |||||||||
"A photo with the face of an individual" | 0.186 | 0.203 0.011 | 9.14% | 0.187 0.11 | 0.05% | [ 0.198 0.10 , 0.208 0.10 ] | 11.83% | [ 0.182 0.10 , 0.192 0.10 ] | 3.23% |
"A photo with the face of a human being" | 0.262 | 0.277 0.10 | 5.73% | 0.263 0.10 | 0.38% | [ 0.270 0.10 , 0.285 0.10 ] | 8.78% | [ 0.255 0.10 , 0.271 0.10 ] | 3.44% |
"A photo with the face of one person" | 0.226 | 0.241 0.009 | 6.63% | 0.230 0.08 | 1.77% | [ 0.232 0.10 , 0.251 0.10 ] | 11.06% | [ 0.220 0.09 , 0.239 0.09 ] | 5.75% |
"A photo with the face of a person" | 0.548 | 0.556 0.12 | 1.49% | 0.548 0.11 | 0.00% | [ 0.545 0.11 , 0.566 0.11 ] | 3.28% | [ 0.537 0.11 , 0.558 0.11 ] | 2.01% |
Average Error | 5.75% | 0.44% | 8.74% | 3.61% |
D.2 Experimental Setup for Diversity[46]
In this section, we describe our experimental setup for Diversity [46], as utilized in the main paper. Recall that as discussed by Kewsani et al. [46] a VGG-16 [36] model pre-trained on ImageNet [58] is utilized as a feature extractor. Then, this feature extractor is applied to both the unknown (generator’s data) and the controlled dataset. Finally, the unknown sample’s features are compared against the controlled one’s via a similarity algorithm to compute diversity, .
From our results in Fig. 7(a) (LHS) based on the pseudo-generator’s setup (discussed in more details in Sec. D.3), we recognize that the original implementation with VGG-16 trained on ImageNet works well on the Gender sensitive attribute. This is seen by the close approximation made by the proxy diversity score when compared against the GT diversity score evaluated with Eqn. 15, as per [46].
(15) |
However, when evaluated on the harder BlackHair sensitive attribute, our results in Fig. 7(a) (RHS) observed significant error between the GT Diversity scores and the proxy Diversity scores. This error was especially prevalent in the larger biases e.g. . We theorized that this was due to the differences between the domains of the feature extractor and the generated/controlled images i.e. ImageNet versus CelebA/CelebA-HQ.
To verify this, we fine-tune the VGG-16 model on the CelebA dataset with the respective sensitive attribute. Then we removed the last fully connected layer of the classifier model, and utilise the 4096 feature vector for the diversity measurement, as per [46]. Our results in Fig. 7(b) demonstrate significant improvement on both Gender and BlackHair, based on the new improved VGG-16 model implementation. This thereby verifies our intuition that there exists a mismatch of domains in the VGG-16 pretrained on ImageNet when utilized with CelebA samples.
However, upon further experimentation, we recognize certain limitations still exist in the Diversity measure when used on more ambiguous and harder sensitive attribute e.g. Young and Attractive. Similar to before, we fine-tuned the sensitive attribute classifier (feature extractor) which achieved accuracies of and for Young and Attractive, respectively. However even with this re-implementation, the diversity persistent to perform poorly, as seen in Fig. 8.
Regardless, given the improvement seen on the BlackHair sensitive attribute, we utilized our improved VGG-16 feature extractor in the main paper, in place of the pre-trained VGG-16 (ImageNet).




D.3 Measuring Varying Degrees of Bias
CLEAM for Measuring Varying Degrees of Bias. In previous experiments, we show the performance of different methods in measuring the fairness of generators and evaluating bias mitigation techniques. Another interesting analysis would be to see how these methods fare with different bias, i.e. different values. A challenge of this analysis is that we cannot control the training dynamics of either the GANs nor the Stable Diffusion Model to obtain an exact value of . Thus, we introduce a new setup and use a pseudo-generator instead of real GANs.
In this setup, we utilize the CelebA [47] and the AFHQ [48] dataset to construct different modified datasets that follow different values of w.r.t. the sensitive attribute e.g. BlackHair attribute, when , the modified dataset contains 4880 BlackHair and 542 Non-BlackHair samples. Then, a pseudo-generator with bias works by random sampling from the corresponding datasets. Note that the samples in the modified dataset are unseen to the sensitive attribute classifier. For our experiment, we use different GT values, , where , and . For a pseudo-generator, to calculate each value of , a batch of samples is randomly drawn from the corresponding dataset and fed into the for classification. We utilize a ResNet-18 to evaluate our pseudo-generator. The results in Tab. 13 for demonstrate that CLEAM is effective for different degrees of bias, reducing the average error () of the Baseline from 1.43%0.27% and 6.23%0.49% for Gender and BlackHair on celebA respectively, and 3.52%0.75% for Cat/Dog on AFHQ. Additionally, note how measurement error in Baseline and Diversity increases by increasing the data bias, while CLEAM remains consistently low. See Sec. D.4 and D.5 for analysis with more attributes and classifiers.
Point Estimate | Interval Estimate | |||||||||||
GT | Baseline | Diversity | CLEAM (Ours) | Baseline | Diversity | CLEAM (Ours) | ||||||
=[0.976,0.979], Gender (CelebA) | ||||||||||||
=0.9 | 0.880 | 2.22% | 0.950 | 5.55% | 0.899 | 0.11% | [0.876 , 0.884 ] | 2.67% | [0.913 , 0.986 ] | 9.56% | [0.895, 0.904] | 0.56% |
=0.8 | 0.783 | 2.10% | 0.785 | 1.88% | 0.798 | 0.25% | [ 0.778 , 0.788 ] | 2.75% | [0.762 , 0.809 ] | 4.75% | [0.794,0.803] | 0.75% |
=0.7 | 0.691 | 1.29% | 0.709 | 1.29% | 0.701 | 0.14% | [ 0.687 , 0.695 ] | 1.86% | [0.696 , 0.722 ] | 3.14% | [0.697, 0.707] | 0.10% |
=0.6 | 0.592 | 1.33% | 0.591 | 1.50% | 0.597 | 0.50% | [0.586 , 0.598 ] | 2.33% | [0.581 , 0.612 ] | 3.17% | [0.591,0.603] | 1.50% |
=0.5 | 0.501 | 0.20% | 0.481 | 3.80% | 0.502 | 0.40% | [0.495 , 0.507 ] | 1.40% | [0.473 , 0.490 ] | 5.40% | [0.497, 0.508] | 1.60% |
Average Error: | 1.43% | 2.80% | 0.27% | |||||||||
=[0.881,0.887], BlackHair (CelebA) | ||||||||||||
=0.9 | 0.803 | 10.77% | 0.803 | 10.77% | 0.899 | 0.11% | [ 0.800 , 0.806 ] | 11.11% | 12.11% | [0.893, 0.905] | 0.78% | |
=0.8 | 0.723 | 9.63% | 0.699 | 12.63% | 0.796 | 0.50% | [0.719 , 0.727 ] | 10.13% | [0.686 , 0.713 ] | 14.25% | [0.790, 0.803] | 1.25% |
=0.7 | 0.654 | 6.57% | 0.661 | 5.57% | 0.705 | 0.71% | [ 0.648 , 0.660 ] | 7.43% | [ 0.643 , 0.68 ] | 8.14% | [0.698, 0.712] | 1.71% |
=0.6 | 0.575 | 4.17% | 0.609 | 1.50% | 0.602 | 0.33% | [ 0.564 , 0.586 ] | 6.00% | [0.604 , 0.614 ] | 2.30% | [0.599, 0.606] | 1.00% |
=0.5 | 0.500 | 0.00% | 0.521 | 4.20% | 0.504 | 0.8% | [ 0.495 , 0.505 ] | 1.00% | [0.506 , 0.536 ] | 7.20% | [0.497, 0.511] | 2.20% |
Average Error: | 6.23% | 6.93% | 0.49% | 7.13% | 8.80% | 1.39% | ||||||
=[0.953,0.0.990], Cat/Dog (AFHQ) | ||||||||||||
=0.9 | 0.862 | 4.44% | 0.855 | 5.00% | 0.903 | 0.33% | [ 0.859 , 0.865 ] | 4.56% | [ 0.844 , 0.866 ] | 6.22% | [ 0.900 , 0.907 ] | 0.78% |
=0.8 | 0.766 | 4.25% | 0.774 | 3.25% | 0.802 | 0.25% | [ 0.762 , 0.771 ] | 4.75% | [ 0.765 , 0.784 ] | 4.38% | [ 0.797 , 0.807 ] | 0.88% |
=0.7 | 0.677 | 3.29% | 0.670 | 4.29% | 0.707 | 1.00% | [ 0.672 , 0.682 ] | 4.00% | [ 0.655, 0.686 ] | 6.43% | [ 0.701 , 0.712 ] | 1.71% |
=0.6 | 0.583 | 2.83% | 0.551 | 8.17% | 0.607 | 1.17% | [ 0.578 , 0.588 ] | 3.67% | [ 0.540, 0.562 ] | 10.00% | [ 0.602 , 0.613 ] | 2.17% |
=0.5 | 0.486 | 2.80% | 0.469 | 6.20% | 0.505 | 1.00% | [ 0.480 , 0.493 ] | 4.00% | [ 0.458, 0.480 ] | 8.40% | [ 0.498 , 0.511 ] | 2.20% |
Average Error: | 3.52% | 5.38% | 0.75% | 4.20% | 7.09% | 1.55% |
D.4 Measuring Varying Degrees of Bias with Additional Sensitive Attributes
In Sec. D.3, we demonstrate CLEAM’s ability to improve accuracy in approximating for the sensitive attributes Gender and BlackHair. In this section, we extend the experiment on CelebA dataset but with harder (lower ) sensitive attributes i.e. Young, and Attractive. We did not include Diversity in this study, due to its poor performance on harder sensitive attribute, as discussed in D.2.
From our results in Tab. 14, both Young and Attractive classifiers have relatively large errors () in the baseline i.e. on average 17.63% and 12.69%, respectively. Then utilizing CLEAM, even with the harder sensitive attributes, we are able to significantly reduce these errors to and . See Sec. B for more details regarding the effect that the different degrees of inaccuracies in have on the Baseline error.
Point Estimate | Interval Estimate | |||||||||||
GT | Baseline | Diversity | CLEAM (Ours) | Baseline | Diversity | CLEAM (Ours) | ||||||
=[0.749,0.852], Young | ||||||||||||
0.690 | 23.33% | — | — | 0.905 | 0.56% | [0.684,0.695] | 24.00% | — | — | [0.890,0.920] | 2.22% | |
0.630 | 21.25% | — | — | 0.804 | 0.50% | [0.625,0.635] | 21.88% | — | — | [0.795,0.813] | 1.63% | |
0.570 | 18.57% | — | — | 0.698 | 0.29% | [0.565,0.575] | 19.29% | — | — | [0.690,0.706] | 1.43% | |
0.510 | 15.00% | — | — | 0.595 | 0.83% | [0.505,0.515] | 15.83% | — | — | [0.590,0.600] | 1.67% | |
0.450 | 10.0% | — | — | 0.506 | 1.20% | [0.445,0.455] | 11.00% | — | — | [0.502,0.510] | 2.00% | |
Avg Error | 17.63% | —% | 0.68% | 18.40% | —% | 1.79% | ||||||
=[0.780,0.807], Attractive | ||||||||||||
0.730 | 18.89% | — | — | 0.908 | 0.89% | [0.724,0.736] | 19.56% | — | — | [0.900,0.916] | 1.78% | |
0.670 | 16.25% | — | — | 0.804 | 0.50% | [0.665,0.675] | 16.88% | — | — | [0.795,0.813] | 1.63% | |
0.600 | 14.29% | — | — | 0.696 | 0.57% | [0.594,0.606] | 15.14% | — | — | [0.690,0.712] | 1.71% | |
0.540 | 10.00% | — | — | 0.592 | 1.33% | [0.534,0.546] | 11.00% | — | — | [0.580,0.604] | 3.33% | |
0.480 | 4.00% | — | — | 0.493 | 1.40% | [0.475,0.485] | 5.00% | — | — | [0.487,0.499] | 2.60% | |
Avg Error | 12.69% | —% | 0.94% | 13.52% | —% | 2.22% |
D.5 Measuring Varying Degrees of Bias with Additional Sensitive Attribute Classifier
In this section, we validate CLEAM’s versatility with different sensitive attribute classifier architectures. In our setup, we utilise MobileNetV2 [35] as in [7]. Then similar to Sec. D.3, we utilize a pseudo-generator with known GT for Gender and BlackHair sensitive attribute from the CelebA [47] dataset, and Cat/Dog from the AFHQ [48] dataset, to evaluate CLEAM’s effectiveness at determining bias.
As seen in our results in Tab.15, MobileNetV2 achieved reasonably high average accuracy [0.889,0.983]. Then, when evaluating of the pseudo-generator we observed similar behavior to ResNet-18 discussed in sec. D.3. In particular, we observed a significantly large (for the baseline) of , and for the Gender, BlackHair and Cat/Dog sensitive attribute, respectively. Whereas, CLEAM reported an of , and , respectively. The same trend can be observed in the IE. We thus demonstrate CLEAM’s versatility and ability to be deployed as a post-processing method (without retraining), on models of varying architecture.
Point Estimate | Interval Estimate | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
GT | Baseline | Diversity | CLEAM (Ours) | Baseline | Diversity | CLEAM (Ours) | ||||||
=[0.980,0.986], Gender (CelebA) | ||||||||||||
0.882 | 2.00% | 0.950 | 5.55% | 0.899 | 0.11% | [ 0.879 , 0.885 ] | 2.33% | [0.913 , 0.986 ] | 9.56% | [0.895,0.902] | 0.56% | |
0.786 | 1.75% | 0.785 | 1.88% | 0.800 | [ 0.782 , 0.790 ] | 2.25% | [0.762 , 0.809 ] | 4.75% | [0.794,0.804] | 0.75% | ||
0.689 | 1.57% | 0.709 | 1.30% | 0.699 | 0.14% | [ 0.685, 0.693 ] | 2.14% | [0.696 , 0.722 ] | 3.14% | [0.694,0.704] | 0.86% | |
0.593 | 1.17% | 0.591 | 1.50% | 0.600 | 0.00% | [ 0.585 , 0.597 ] | 2.50% | [0.581 , 0.612 ] | 3.17% | [594,0.605] | 1.00% | |
0.497 | 0.60% | 0.481 | 3.80% | 0.502 | 0.40% | [ 0.491 , 0.502 ] | 1.80% | [0.473 , 0.490 ] | 5.40% | [495,0.507] | 1.40% | |
Avg Error | 1.42% | 2.81% | 0.13% | 2.20% | 5.20% | 0.91% | ||||||
=[0.861,0.916], BlackHair (CelebA) | ||||||||||||
0.782 | 13.11% | 0.803 | 10.78% | 0.899 | 0.11% | [ 0.777 , 0.787 ] | 13.67% | [0.791 , 0.815 ] | 9.44% | [0.893,0.900] | 0.78% | |
0.705 | 11.88% | 0.699 | 12.63% | 0.800 | 0.00% | [ 0.699 , 0.710 ] | 12.63% | [0.686 , 0.713 ] | 14.25% | [0.793,0.807] | ||
0.623 | 11.00% | 0.661 | 5.56% | 0.700 | 0.00% | [ 0.618 , 0.628 ] | 11.71% | [ 0.643 , 0.68 ] | 8.14% | [0.694,0.706] | 0.86% | |
0.550 | 8.33% | 0.609 | 1.50% | 0.600 | 0.00% | [ 0.544 , 0.556 ] | 9.33% | [0.604 , 0.614 ] | 2.33% | [0.593,0.608] | 1.17% | |
0.478 | 4.40% | 0.521 | 4.20% | 0.506 | 1.20% | [ 0.472 , 0.484 ] | 5.60% | [0.506 , 0.536 ] | 7.20% | [0.498,0.514] | 2.80% | |
Avg Error | 9.74% | 6.93% | 0.70% | |||||||||
=[0.964,0.897], Cat/Dog (AFHQ) | ||||||||||||
0.875 | 2.77% | 0.880 | 3.26% | 0.897 | 0.34% | [ 0.872 , 0.878 ] | 3.07% | [0.871 , 0.890] | 3.25% | [ 0.894 , 0.900 ] | 0.68% | |
0.784 | 2.00% | 0.770 | 3.75% | 0.791 | 1.11% | [ 0.780 , 0.788 ] | 2.53% | [0.759 , 0.781 ] | 5.12% | [ 0.786 , 0.796 ] | 0.42% | |
0.704 | 0.62% | 0.692 | 1.08% | 0.698 | 0.20% | [ 0.700 , 0.708 ] | 1.19% | [ 0.684, 0.709 ] | 2.40% | [ 0.694 , 0.703 ] | 0.86% | |
0.617 | 2.78% | 0.611 | 1.83% | 0.597 | 0.54% | [ 0.611 , 0.622 ] | 2.78% | [0.602 , 0.620 ] | 3.42% | [ 0.591 , 0.603 ] | 1.58% | |
0.529 | 5.87% | 0.536 | 7.20% | 0.495 | 0.93% | [ 0.523 , 0.536 ] | 7.17% | [0.524 , 0.548 ] | 9.68% | [ 0.488 , 0.503 ] | ||
Avg Error | 2.81% | 3.42% | 0.62% | 3.35% | Avg Error | 4.77% | 1.20% |
D.6 Measuring SOTA GANs and Diffusion Models with Additional Classifier
In this section, we further explore the utilization of CLIP as a sensitive attribute classifier; more details on CLIP in Sec. E. Here, we follow the setup in Sec. 5.1 of our main manuscript to measure the bias in GenData-StyleGAN2 and GenData-StyleSwin w.r.t. Gender. Additionally, we evaluate a publically available pre-trained Latent Diffusion Model (LDM) [59] on FFHQ [3], where we acquire the GT w.r.t. Gender with the same procedure as GenData.
Our results in Tab. 16 and 17 shows that the Baseline is able to achieve reasonable accuracy in estimating the GT . This is because CLIP’s accuracy is very high (=0.998) on the bias class () for both StyleGAN2, StyleSwin and LDM resulting in less mis-classification from occurring. Regardless, CLEAM is still able to further improve on the already very accurate baseline, further reducing the error from , on StyleGAN2, StyleSwin and LDM to . A similar trend can be observed in the IE, where it is able to bound the GT .
Point Estimate | Interval Estimate | |||||||||||||
Avg. | Baseline | Diversity | CLEAM (Ours) | Baseline | Diversity | CLEAM (Ours) | ||||||||
(A) StyleGAN2 | ||||||||||||||
Gender with GT class probability =0.642 | ||||||||||||||
CLIP | {0.998, 0.975} | 0.987 | 0.653 | 1.71% | — | — | 0.645 | 0.47% | [0.649, 0.657] | 2.34% | — | — | [0.641, 0.649] | 1.09% |
(B) StyleSwin | ||||||||||||||
Gender with GT class probability =0.656 | ||||||||||||||
CLIP | {0.998, 0.975} | 0.987 | 0.666 | 0.91% | — | — | 0.658 | 0.30% | [0.663,0.669] | 1.98% | — | — | [0.655,0.662] | 0.91% |
Point Estimate | Interval Estimate | |||||||||||||
Avg. | Baseline | Diversity | CLEAM (Ours) | Baseline | Diversity | CLEAM (Ours) | ||||||||
Latent Diffusion Model | ||||||||||||||
Gender with GT class probability =0.570 | ||||||||||||||
CLIP | {0.998, 0.975} | 0.987 | 0.585 | 2.63% | — | — | 0.571 | 0.18% | [0.578, 0.593] | 4.04% | — | — | [0.564, 0.579] | 1.58% |
D.7 Comparing Classifiers Accuracy on Validation Dataset vs Generated Dataset
In our proposed CLEAM, we use pre-measured on the validation dataset, denoted by . In this section, we show that is a good approximate of the when measured on the generated data, denoted by . Note that is not available in practice, therefore is used as approximation during CLEAM measurement. We further remark that our purpose of this experiment is only done to validate as a good approximation for and is not necessary in the actual deployment of CLEAM.
Comparing vs on GANs. To validate that, we utilize our newly introduced generated dataset, with known labels and measure the for both Gender and Blackhair on StyleGAN2 and StyleSwin and compared it against the . The results in Tab. 18 show that is a good approximation of the of the generated dataset.
In addition, in Tab. 19, we further demonstrate the difference in effect when utilizing as opposed to with CLEAM for sensitive attribute Gender on StyleGAN2 from the GenData dataset. Overall, we observed only marginal improvements, for most cases, when utilizing the . Additionally, as the improvements by CLEAM were still very significant when utilizing the , and as the labels for the generated dataset are not readily available to evaluate , we found the to be a good approximation for for fairness measurement.
Comparing vs on SDM. Similarly when evaluating the SDM with CLEAM, we also utilize in-place of . However, as a validation dataset is not readily available for SDM, we explored the use of a poxy validation dataset whose domain is a close representation as our applications. More specifically, we utilize CelebA-HQ as our proxy validation dataset (with known labels w.r.t. Gender) to attan . Then similarly, we compare to from our labelled GenData-SDM (per prompt). As shown in Tab. 20 our approximated (measured on CelebA-HQ), although not perfect, is a close approximation of , thereby making it appropriate to be utilized with CLEAM.
StyleGAN2 | StyleSwim | |||||||
---|---|---|---|---|---|---|---|---|
ResNet18 | ResNet34 | MobileNetv2 | VGG16 | ResNet18 | ResNet34 | MobileNetv2 | VGG16 | |
Gender | ||||||||
Validated | [0.947,0.983] | [0.932,0.976] | [0.938, 0.975] | [0.801,0.919] | [0.947,0.983] | [0.932,0.976] | [0.958, 0.975] | [0.801,0.919] |
[0.940,0.984] | [0.928,0.982] | [0.948, 0.985] | [0.815,0.922] | [0.957,0.966] | [0.944,0.981] | [0.956, 0.977] | [0.804,0.924] | |
Blackhair | ||||||||
Validated | [0.869,0.884] | [0.834,0.919] | [0.839,0.880] | [0.850,0.836] | [0.869,0.884] | [0.834,0.919] | [0.839,0.880] | [0.850,0.836] |
[0.870,0,885] | [0.830,0.914] | [0.845,0.886] | [0.837,0.824] | [0.874,0.892] | [0.824,0.930] | [0.837,0.891] | [0.847,0.821] |
Point Estimate | Interval Estimate | |||||||||||
Classifier | Baseline[1] | CLEAM (Ours) with | CLEAM (Ours) with | Baseline[1] | CLEAM (Ours) with | CLEAM (Ours) with | ||||||
(A) StyleGAN2 | ||||||||||||
Gender with GT class probability =0.642 | ||||||||||||
R18 | 0.610 | 4.98% | 0.638 | 0.62% | 0.639 | 0.44% | [0.602, 0.618] | 6.23% | [0.629, 0.646] | 2.02% | [0.629, 0.648] | 2.02% |
R34 | 0.596 | 7.17% | 0.634 | 1.25% | 0.635 | 1.06% | [0.589, 0.599] | 8.26% | [0.628, 0.638] | 2.18% | [0.630, 0.640] | 1.87% |
MN2 | 0.607 | 5.45% | 0.637 | 0.78% | 0.636 | 0.86% | [0.602, 0.612] | 6.23% | [0.632, 0.643] | 1.56% | [0.630, 0.642] | 1.82% |
V16 | 0.532 | 17.13% | 0.636 | 0.93% | 0.640 | 0.36% | [0.526, 0.538] | 18.06% | [0.628, 0.644] | 2.18% | [0.632, 0.647] | 1.53% |
Avg Error | 8.68% | 0.90% | 0.68% | 9.70% | 1.99% | 1.81% |
Dataset | Stable Diffusion Model | |||||
---|---|---|---|---|---|---|
CelebA-HQ | "Somebody" | "an individual" | "a human being" | "a person" | "one person" | |
[0.998,0.975] | [1.0,0.970] | [1.0,0.980] | [1.0,0.970] | [0.990, 0.970] | [1.0, 0.980] |
D.8 Comparing CLEAM with Classifier Correction Techniques (BBSE/BBSC)
In this section, we compare CLEAM against a few classifier correction techniques. We remark that CLEAM, unlike the classifier correction techniques, does not aim to improve the sensitive attribute classifier’s accuracy but instead accounts for its errors during fairness measurement. However, given that classifier correction techniques may improve bias measurement, we found it useful to make a comparison. Specifically, we look into Black-Box shift estimator/correction (BBSE/BBSC) [49], methods previously proposed to address classifier inaccuracies due to label shift. We demonstrate that even with BBSE/BBSC, errors in bias measurement still remain significant.
Setup. To determine the effectiveness of BBSE/BBSC in tackling the errors of fairness measurement in generative models we evaluate it on the same setup as per Sec. 5.1 of the main manuscript on GenData-StyleGAN and GenData-StyleSwin with ResNet-18. Specifically, for BBSE we follow Lipton et al. [49] and first evaluate the confusion matrix for the trained ResNet-18 based on the validation dataset. Then, utilizing the confusion matrix, we calculate the weight vector which accounts for label shift of the generated data. With this weight vector, we now implement a variant of CLEAM utilizing Algo.1 (with the weighted vector in-place of ) in the main manuscript to evaluate the PE and IE. Similarly, for BBSC, we calculate the weight vector. However, unlike BBSE, we now utilize the weighted vector and fine-tune the classifier on the generated samples [49].
Our results in Tab. 21 shows that BBSE helps to marginally reduce and for the PE and IE, when compared against the baseline. However, these results still remain poor when compared to our original CLEAM implementation. One reason for this difference may be that, unlike CLEAM which is agnostic to the cause of the error, BBSE specifically corrects for label shifts while neglecting other sources of error e.g. task hardness. Meanwhile, our results in Tab. 22 show that utilizing BBSC makes no improvement in the improving the baseline fairness measurements. We hypothesize that this is due to the strong assumption of invariant conditional input distribution used in BBSC, which may not hold in our problem. Overall we conclude that while classifier correction techniques may improve fairness measurements in some cases, they may not always generalize as they are often tailored to a specific problem.
Point Estimate | Interval Estimate | ||||||||||
Baseline | BBSE | CLEAM (Ours) | Baseline | BBSE | CLEAM (Ours) | ||||||
(A) StyleGAN2 | |||||||||||
Gender with GT class probability =0.642 | |||||||||||
0.610 | 4.98% | 0.621 | 3.38% | 0.638 | 0.62% | [0.602,0.618] | 6.23% | [0.613,0.628] | 4.52% | [0.629,0.646] | 2.02% |
BlackHair with GT class probability =0.643 | |||||||||||
0.599 | 6.48% | 0.630 | 2.02% | 0.641 | 0.31% | [0.591,0.607] | 8.08% | [0.622,0.638] | 3.27% | [0.631,0.652] | 1.40% |
(B) StyleSwin | |||||||||||
Gender with GT class probability =0.656 | |||||||||||
0.620 | 5.49% | 0.628 | 4.27% | 0.648 | 1.22% | [0.612,0.629] | 6.70% | [0.620,0.634] | 5.49% | [0.639,0.658] | 2.59% |
BlackHair with GT class probability =0.668 | |||||||||||
0.612 | 8.38% | 0.640 | 4.20% | 0.659 | 1.35% | [0.605,0.620] | 9.43% | [0.633,0.647] | 5.24% | [0.649,0.670] | 2.84% |
Point Estimate | Interval Estimate | |||||||||
Setup | Avg | Baseline | CLEAM (Ours) | Baseline | CLEAM (Ours) | |||||
(A) StyleGAN2 | ||||||||||
Gender with GT class probability =0.642 | ||||||||||
Original Classifier | {0.947,0.983} | 0.97 | 0.610 | 4.98% | 0.638 | 0.62% | [0.602,0.618] | 6.23% | [0.629,0.646] | 2.02% |
Adapted Classifier w. BSSC | {0.932,0.980} | 0.96 | 0.609 | 5.28% | 0.645 | 0.46% | [0.601,0.616] | 6.53% | [0.635,0.655] | 2.02% |
BlackHair with GT class probability =0.643 | ||||||||||
Original Classifier | {0.869,0.885} | 0.88 | 0.599 | 6.48% | 0.641 | 0.31% | [0.591,0.607] | 8.08% | [0.631,0.652] | 1.40% |
Adapted Classifier w. BSSC | {0.854,0.875} | 0.86 | 0.588 | 8.55% | 0.635 | 1.24% | [0.581,0.596] | 9.64% | [0.627,0.643] | 2.49% |
(B) StyleSwin | ||||||||||
Gender with GT class probability =0.656 | ||||||||||
Original Classifier | {0.947,0.983} | 0.97 | 0.620 | 5.49% | 0.648 | 1.22% | [0.612,0.629] | 6.70% | [0.639,0.658] | 2.59% |
Adapted Classifier w. BSSC | {0.932,0.980} | 0.96 | 0.617 | 5.94% | 0.655 | 0.15% | [0.610,0.614] | 7.01% | [0.649,0.661] | 1.06% |
BlackHair with GT class probability =0.668 | ||||||||||
Original Classifier | {0.869,0.885} | 0.88 | 0.612 | 8.38% | 0.659 | 1.35% | [0.605,0.620] | 9.43% | [0.649,0.670] | 2.84% |
Adapted Classifier w. BSSC | {0.854,0.875} | 0.86 | 0.608 | 8.98% | 0.663 | 0.75% | [0.600,0.616] | 10.18% | [0.655,0.671] | 1.95% |
D.9 Applying CLEAM to Re-evaluate Bias Mitigation Algorithms
Importance-weighting [1] is a simple and effective method for bias mitigation. However, its performance in fairness improvement is measured by the Baseline, which could be erroneous. In this section, we re-evaluate the performance of importance-weighting with CLEAM, which has shown better accuracy in fairness estimation.
Following Choi et al. [1], we utilize the original source code to train two BIGGANs [60] on CelebA [47]: for the first GAN, without applying any bias mitigation (Unweighted), while in the second, we apply importance re-weighting (Weighted). We do this for the originally proposed sensitive attribute Gender, and extend the experiment to BlackHair. For a fair comparison, we follow [1] and similarly use a ResNet-18 with a reasonably high average accuracy of and for sensitive attributes BlackHair and Gender. Our results in Tab. 23 show that Baseline measures a of and for Unweighted and Weighted, with SA Gender (similar to reported results in [1]). Meanwhile, CLEAM’s results show that , implying that previous work could have underestimated the bias of the GANs. This could lead to an erroneous evaluation of a bias mitigation technique, or comparison across different bias mitigation techniques. Then, when analyzing bias mitigation techniques using IE of CLEAM (as per Tab. 24), since the IE of unweighted and weighted GANs do not overlap, we are provided with some statistical guarantees that the bias mitigation techniques, importance-weighting, is indeed effective.
Setup | Baseline | Diversity | CLEAM (Ours) |
=[0.976,0.979], Gender | |||
Unweighted | 0.727 | 0.711 | 0.738 |
Weighted | 0.680 | 0.671 | 0.690 |
=[0.881,0.887], BlackHair | |||
Unweighted | 0.729 | 0.716 | 0.803 |
Weighted | 0.716 | 0.706 | 0.785 |
Setup | Baseline | Diversity | CLEAM(Ours) |
=[0.976,0.979], Gender | |||
Unweighted | [] | [] | [] |
Weighted | [] | [] | [] |
=[0.881,0.887], BlackHair | |||
Unweighted | [] | [] | [] |
Weighted | [] | [] | [] |
Appendix E Details on Applying CLIP as a SA Classifier
CLIP as a Sensitive Attribute classifier. To utilize CLIP as a sensitive attribute classifier (with the VIT-B/32 architecture), we follow the best practices suggested by Radford et al. [6]. Here, we first input two different prompts, describing the respective classes, to the CLIP text-encoder, as seen in Tab. 25. As suggested by Radford et al. we utilize the prompt starting with "A photo of a" i.e. a scene description, followed by our sensitive attribute’s classes e.g. female/male. Next, we also encode the generated images with the CLIP image encoder. Finally, for each encoded generated image and the two encoded text-prompt, we take the cosine similarities followed by the . The output provides us with the respective hard label of the generated image.
Generated Image pre-processing. We remark that as the stable diffusion model produces a mixture of colored and greyscale images, for a fair comparison, we transform all images from RGB to greyscale before feeding into CLIP for classification.
Sensitive Attribute | Class 0 prompt | Class 1 prompt |
---|---|---|
Gender | "A photo of a female" | "A photo of a male" |
Smiling | "A photo of a face not smiling" | "A photo of a face smiling" |
Appendix F Ablation Study: Details of Hyper-Parameter Settings and Selection
Sensitive Attribute Classifier .
In our experiments, we utilized a Resnet-18/34 [34], MobileNetv2 [35] and VGG-16. The respective datasets i.e. CelebA, [47] CelebA-HQ [27] and AFHQ [48] datasets are then segmented into {Train, Test, Validation} with respect to the ratio {80%,10%,10%}, where each segmentation of the dataset contains uniform distribution w.r.t. the queried sensitive attribute. The classifiers are then trained with the training datasets and the are evaluated with the validation dataset. Each classifier is trained with an Adam optimizer[61] with a learning rate=, Batch size= and input dim=x from the CelebA dataset [47], dim=x from the AFHQ dataset and dim=x from the CelebA-HQ dataset [27]. Tab. 26 details the of the ResNet-18 utilized in Sec.6 of our main manuscript.
Sensitive Attribute | Accuracy, |
---|---|
NoBeard | [0.968,0.898] |
HeavyMakeup | [0.925,0.883] |
Bald | [0.930,0.972] |
Chubby | [0.838,933] |
Mustache | [0.925,0.896] |
Smiling | [0.933, 0.877] |
Young | [0.871, 0.857] |
BlackHair | [0.869,0.885] |
Gender | [0.947,0.983] |
Generator used in sec.D.9.
As mentioned in sec. D.9, we utilized the setup in Choi et al. [1]111https://github.com/ermongroup/fairgen for the training of our imp-weighted and unweighted GANs. With this, we replicate their hyperparameter selection of x celebA [47] images with a learning rate=, , and four discriminator steps per generator step. We utilize a single RTX3090 for the training of our models.
Evaluating CLEAM with Different . Utilizing the same setup in Sec. 5.1 of our main manuscript – with the GenData-StyleGAN and GenData-StyleSwin datasets, we repeated the experiment with ResNet-18 and . Our results in Fig.9 show that there is a marginal increase in error for both the Baseline and CLEAM as approaches 100, while the converse occurs when approaches 600. However, given the diminishing improvements for , we found to be ideal- a balance between computational cost and measurement accuracy.
Batch Size .
In our experiments, we utilized batches of data each of which contains images to approximate with the Baseline and CLEAM methods. In the previous experiment, we found samples to be the ideal balance between computational time and minimizing fairness measurement error. Here, we repeat the same hyper-parameter search, utilizing the real generator setup in Sec 5.1 of the main paper with ResNet-18, but instead varied the number of batches, . Our results in Fig. 10 found to be the optimal value when approximating . Increasing did not result in significant improvements by both baseline and CLEAM. However, decreasing did observe some significant degradation in performance i.e. increase in .
Computational Time.
In our main paper, we note that CLEAM is a lightweight correction to the existing baseline method, that requires no additional parameter to be computed during evaluation. To support this, we evaluated the computational time for the Baseline, Diversity, and our proposed CLEAM. Our results in Tab. 27 show that there is only a small difference in computational time (s) between the Baseline and our proposed CLEAM. This difference is solely to facilitate the computation of Algo. 1. See Tab. 28 for discussion on carbon emission.
![[Uncaptioned image]](https://cdn.awesomepapers.org/papers/42812407-6530-4515-9f9a-b4eeeda5f9de/x11.png)

Experiment | Hardware | GPU Hours | Carbon emitted (kg) |
---|---|---|---|
Training of SA Classifiers | RTX3090 | 2.0 | 0.39 |
Comparing CLEAM on GANs, Main Paper Tab. 1 | RTX3090 | 4.8 | 0.94 |
Comparing CLEAM on DGN, Main Paper Tab. 2 | RTX3090 | 0.3 | 0.1 |
Inferring with CLEAM on DGN, Main Paper Fig. 3a | RTX3090 | 0.3 | 0.1 |
Inferring CLEAM on GANs, Main Paper Fig. 3b | RTX3090 | 0.52 | 0.15 |
Comparing CLEAM on PsuedoG, Supp Tab 13 | RTX3090 | 4.5 | 0.88 |
Comparing CLEAM on PsuedoG Additional SA, Supp Tab 14 | RTX3090 | 3 | 0.59 |
Comparing CLEAM on PsuedoG Additional classifier, Supp Tab 15 | RTX3090 | 4.5 | 0.88 |
Comparing CLEAM on DGN with CLIP, Supp Tab. 16 | RTX3090 | 0.15 | 0.05 |
Comparing CLEAM with BBSE/BBSC, Supp Tab. 21 | RTX3090 | 0.25 | 0.07 |
Applying CLEAM on Bias mitigation, Subb Tab 23 | RTX3090 | 0.88 | 0.17 |
Total: | 21.2 | 4.32 |
Appendix G Related Work
Fairness in Generative Models. Fairness in machine learning is mostly studied for discriminative learning, where usually the objective is to handle a classification task independent of a sensitive attribute in the input data, e.g. making a hiring decision independent of the applicant Gender. However, the definition of fairness is quite different for generative learning, where it is considered as equal representation/generation probability w.r.t. a sensitive attribute. Because of this difference, the conventional fairness metrics used for classification, like Equalised Odds, Equalised Opportunity [62] and Demographic Parity [63], cannot be applied to generative models. Instead, the similarity between the probability distribution of the generated sample w.r.t. a sensitive attribute () and a target distribution (a uniform distribution) [1] is utilized as fairness metric. See sec. A.3 for details.
Existing Works on Fair Generative Models. Existing works focus on bias mitigation in generative models. The importance reweighting algorithm is proposed by Choi et al. [1] where a re-weighting algorithm favours a reference fair dataset w.r.t. the sensitive attribute in-place of a larger biased dataset. Frankel et al. [7] introduces the concept of prior modification, where an additional smaller network is added to modify the prior of a GAN to achieve a fairer output. Tan et al. [9] learns the latent input space w.r.t. the sensitive attribute, which they can later sample accordingly to achieve a fair output. MaGNET [8] demonstrates that enforcing uniformity in the latent feature space of a GAN, through a sampling process, improves fairness. Um et al. [12] improves fairenss through the utilization of total variation distance which quantifies the unfairness between a small reference dataset and the generated samples. Teo et al. [2] introduces fairTL++, which utilizes a small fair dataset to implement fairness adaptation via transfer learning. In all of these works, the focus is on improving fairness of the generative model (where the performance of the model is measured with a framework, in which the inaccuracies in the sensitive attribute classifier has been ignored). However, our proposed CLEAM method focuses on improving fairness measurement, by compensating for the inaccuracies in the sensitive attribute classifier through a statistical model. Therefore, it can be used to evaluate the bias mitigation algorithms more accurately.
Equal Representation. Some literature also use a similar notion of equal representation (used in generative models) to address fairness. For example, fair clustering variation [64] is proposed by enforcing the clusters to represent each attribute equally, and fair data summarization [65] is used to mitigate the bias in creating a representative subset for a given dataset, while handling the trade-offs between fairness and diversity during sampling. However, unlike our setup, these works assume to have access to the attribute labels. Meanwhile, in data mining, a similar problem was recently studied. Given a large dataset of unlabelled mined data, the objective is to evaluate the disparity of the dataset w.r.t. an attribute. To do this, an evaluation framework called diversity [46] was introduced. To measure this, a pre-trained classifier is used as a feature extractor. The unlabelled dataset is then compared against a controlled reference dataset (with known labels) via a similarity algorithm.
Biases in Text-Image generation. Some literature have attempted to look into the biases in text-to-image generators [26]. Specifically, Bianchi et al. study existing biases in occupations-based prompts for popular text-to-image generators e.g. stable diffusion models. They found the biases to exasperate existing occupation stereotypes, e.g. nurses being over-represented as non-Caucasian females. To measure these biases, [26] has a simple approach utilizing a pre-trained feature extractor to assign the sensitive attribute labels to a small batch of generated images (100 samples). We remark that this approach is similar to Diversity, a method which we found to also demonstrate significant errors due to the lack of consideration for the classifier’s error. Furthermore, we emphasize the difference between our study and Bianchi et al. . Specifically, in our application of CLEAM (Sec. 6 of the main manuscript), we examine the impact of using prompts with indefinite pronouns/nouns that are synonymous to each another. Our objective, unlike Bianchi et al. ’s work, is to investigate the influence of subtle changes in the prompts on bias, which is studied on a large dataset ( samples). Our results are the first to demonstrate that even subtle changes to the prompt (which are semantically unchanged), could result in drastically different biases.
Classifier Calibration. The proposed CLEAM can be seen from a classifier calibration point of view as it refines the output of the classifier. However, CLEAM should not be mistaken with conventional calibration algorithms, e.g. temperature scaling [66], Platt Scaling [67] and Isotonic regression [68]. Unlike these algorithms that concern themselves with the confidence of prediction, CLEAM focuses on sensitive attribute distribution, thereby making these algorithms ineffective.
More specifically, conventional classifier calibration methods usually work on soft labels (probabilities). Note that in our framework, the is applied to the output probabilities to determine the hard label. Therefore, in our application that deals with hard labels, regular classification techniques are less effective. To investigate this, we conduct a few calibration experiment by applying some popular classifier calibration techniques; temperature scaling(T-scaling) [66], Isotonic Regression[68] and Platt Scaling[67] on a pre-trained ResNet-18[34] senstive attribute classifier. In Fig. 11, we see that T-scaling is the most effective in correcting the calibration curve to the ideal Ref line. Note that, this Ref line indicates that the classifier is perfectly calibrated w.r.t. the soft labels.
Next, using the pseudo-generator from Sec. D.3, we utilised the calibrated sensitive attribute classifiers earlier and compare them against CLEAM (which was applied on an uncalibrated model). In our results, seen in Fig. 12, we observe that these traditional calibration methods are less effective in correcting the sensitive attribute distribution error. In fact, methods like Platt scaling worsen the error, and T-scaling —which is shown in [66] and our experiment to be one of the most effective traditional calibration methods— does not change class predictions (hard labels), but merely perturb the soft labels. This demonstrates that traditional calibration technique are not direct correlation to hard label calibration, which CLEAM aims to address.


Appendix H Details of the GenData: A New Dataset of Labeled Generated Images
In this section, we provide more information on our new dataset, containing generated samples labeled based on sensitive attributes from StyleGAN2222https://github.com/NVlabs/stylegan2-ada-pytorch [3] and StyleSwin 333https://github.com/microsoft/StyleSwin [4] trained on CelebA-HQ [27], and a Stable Diffuson Model(SDM)[5]. More specifically, our dataset contains 9k randomly generated samples based on the original saved weights and codes of the respective GANs, and 2k samples for four different prompts inputted in the SDM. These samples are then hand labeled w.r.t. the sensitive attributes. More specifically, Gender and BlackHair for both the GANs and Gender for the SDM. Then with these labeled datasets, we can approximate the ground-truth sensitive attribute distribution, , of the respective GANs.
Dataset Labeling Protocol.
To ensure high-quality samples and labels in our dataset, we passed the dataset through Amazon Mechanical Turk, where labelers were given detailed guidelines and examples for identifying the individual sensitive attributes. In addition to the sensitive attribute option e.g. Gender(Male) or Gender(Female), labelers were also given an “unidentifiable” option which they were instructed to select for low-quality samples, as per Fig, 13 and 17. We repeated this process for 4 runs s.t. each sample had the opinions of four independent labelers. Finally, each sample was assigned the label that the majority had selected.
Overall, the GANs and SDM received 97% and 99% unanimous agreement rates. This for example includes male, female, or unidentifiable, for the sensitive attribute Gender. We discard the samples that had been labeled unidentifiable and were left with a high-quality dataset as per Fig. 14, 15 and 16. We remark that the discarded samples consist only a small portion of the generated samples i.e. 3% of the GANs, and 1% of the SDM. Upon further evaluation, we found that the sensitive attribute classifiers appear to uniformly assign these (rejected) ambiguous samples a random class with low confidence. As a result, we can assume that the impact of disregarding these samples was insignificant to CLEAM’s evaluation.








Appendix I Limitations and Considerations
Ethical consideration. In general, we note that our work does not introduce any social harm but instead improves on the existing fairness measurement framework to better gauge progress. However, we stress that it is important to consider the limitations of the existing fairness measurement framework, which we discuss in the following.
Sensitive Attribute Labels. Certain sensitive attributes may exist on a spectrum e.g. Young. However, given that this work aims to improve fairness measurement, and the current widely used definition is based on binary outcomes, we utilize the same setup in our work. Additionally, it is also important to be aware that certain sensitive attributes may be ambiguous e.g. Big Nose (which exist in popular datasets like CelebA-HQ), but definitions could differ based on different cultural expectations. In our work, we try to select less ambiguous sensitive attributes e.g., BlackHair.
Human and Auto Labelling. Labeling sensitive attributes in generative models is essential to better understand the possible biases that may exist in some proposed generative model algorithms. To do this, researchers often utilize either human labelers or machines for automated labeling. However, when utilizing such labeling procedures it is important to consider ethical implications, especially in many cases where sensitive information such as gender is involved. One particular concern is that there could be potential discrimination in the assignment of labels such as gender. For example, if only certain facial features are considered when assigning gender labels, some individuals may be inaccurately labeled due to their unique characteristics that deviate from traditional notions of male and female identity.
Human labelers may bring their own biases, subjectivity, and cultural background to the labeling process, which can lead to inaccuracies or reinforce stereotypes. Additionally, it is important to ensure that the labelers represent a diverse range of backgrounds and perspectives, particularly if the samples being labeled are from a diverse population. This can help mitigate potential discrimination against some social identities and improve the accuracy of the labeling process.
In the case when utilizing machines for labeling, it is important to be aware that labeling algorithms may be biased, depending on the data set it was trained on. If the data set is not diverse or balanced, the algorithm may produce inaccurate or biased results that reinforce stereotypes or discrimination against certain social identities.
Utilizing Zero Shot Classifiers. When utilizing pre-trained classifiers it is important to carefully select proxy validation dataset with a similar domain to the generated images. A significant mismatch in these two domains could result in an inaccurate approximation of , resulting in poor performance by CLEAM. Then similar to our previous discussion, we would also refrain from ambiguous sensitive attributes, as this may result in a mismatch between the proxy validation dataset and the pre-trained sensitive attribute classifier.