FairAdaBN: Mitigating unfairness with adaptive batch normalization and its application to dermatological disease classification
Abstract
Deep learning is becoming increasingly ubiquitous in medical research and applications while involving sensitive information and even critical diagnosis decisions. Researchers observe a significant performance disparity among subgroups with different demographic attributes, which is called model unfairness, and put lots of effort into carefully designing elegant architectures to address unfairness, which poses heavy training burden, brings poor generalization, and reveals the trade-off between model performance and fairness. To tackle these issues, we propose FairAdaBN by making batch normalization adaptive to sensitive attributes. This simple but effective design can be adapted to several classification backbones that are originally unaware of fairness. Additionally, we derive a novel loss function that restrains statistical parity between subgroups on mini-batches, encouraging the model to converge with considerable fairness. In order to evaluate the trade-off between model performance and fairness, we propose a new metric, named Fairness-Accuracy Trade-off Efficiency (FATE), to compute normalized fairness improvement over accuracy drop. Experiments on two dermatological datasets show that our proposed method outperforms other methods on fairness criteria and FATE. Our code is available at https://github.com/XuZikang/FairAdaBN.
Keywords:
Dermatological Fairness Batch Normalization1 Introduction
The past years have witnessed a rapid growth of applying deep learning methods in medical imaging [31]. As the performance improves continuously, researchers also find that deep learning models attempt to distinguish illness by using features that are related to a sample’s demographic attributes, especially sensitive ones, such as skin tone or gender. The biased performance due to sensitive attributes within different subgroups is defined as unfairness [16]. For example, Seyyed-Kalantari et. al. [21] find that their models trained on chest X-Ray dataset show a significant disparity of True Positive Ratio (TPR) between male and female subgroups. Similar evaluations are done on brain MRI [17], dermatology [12], and mammography [15], which shows that unfairness issues exist extensively in medical applications. If the unfairness of deep learning models is not handled properly, healthcare disparity increases, and human fundamental rights are not guaranteed. Thus, there is a pressing need on investigating unfairness mitigation to eliminate critical biased inference in deep learning models.
There are two groups of methods to tackle unfairness. The first group proceeds implicitly with fairness through unawareness [7] by leaving out sensitive attributes when training a single model or deriving invariant representation and ignoring them subjectively when making a decision. However, plenty of evaluations prove that this may lead to unfairness, due to the entangled correlation between sensitive attributes and other variables in the data, and statistical difference between features of different subgroups. The second group explicitly takes sensitive attributes into consideration when training models, for example, train independent models for unfairness mitigation [18, 25] with no parameters shared between subgroups. However, this may result in degraded performance because the amount of data for model building is reduced (see Table 1).
It is natural to consider whether it is possible to inherit the advantages from both worlds, that is, learning a single model on the whole dataset yet still with explicit modeling of sensitive attributes. Therefore, we propose a framework with a powerful adapter termed Fair Adaptive Batch Normalization (FairAdaBN). Specifically, FairAdaBN is designed to mitigate task disparity between subgroups captured by the neural network. It integrates the common information of different subgroups dynamically by sharing part of network parameters, and enables the differential expression of feature maps for different subgroups, by adding only a few parameters compared with backbones. Thanks to FairAdaBN, the proposed architecture can minimize statistical differences between subgroups and learn subgroup-specific features for unfairness mitigation, which improves model fairness and reserves model precision at the same time. In addition, to intensify the models’ ability for balancing performance and fairness, a new loss function named Statistical Disparity Loss (), is introduced to optimize the statistical disparity in mini-batches and specify fairness constraints on network optimization. also enhances information transmission between subgroups, which is rare for independent models. Finally, a perfect model should have both higher precision and fairness compared to current well-fitted models. However, most of the existing unfairness mitigation methods sacrifice overall performance for building a fairer model [20, 22]. Therefore, following the idea of discovering the fairness-accuracy Pareto frontier [32], we propose a novel metric for evaluating the Fairness-Accuracy Trade-off Efficiency (FATE), urging researchers to pay attention to the performance and fairness simultaneously when building prediction models. We evaluate the proposed method based on its application to mitigating unfairness in dermatology diagnosis.
To sum up, our contributions are as follows:
-
1.
A novel framework is proposed for unfairness mitigation by replacing normalization layers in backbones with FairAdaBN;
-
2.
A loss function is proposed to minimize statistical parity between subgroups for improving fairness;
-
3.
A new metric is derived to evaluate the model’s fairness-performance trade-off efficiency. Our proposed FairAdaBN has the highest (48.79), which doubles the highest among other unfairness mitigation methods (Ind, 22.63 ).
-
4.
Experiments on two dermatological disease datasets and three backbones demonstrate the superiority of our proposed FairAdaBN framework in terms of high performance and great portability.
2 Related Work
According to [4], unfairness mitigation can be categorized into pre-processing, in-processing, and post-processing based on the instruction stage.
Pre-Processing. Pre-processing methods focus on the quality of the training set, by organizing fair datasets via datasets combination [21], using generative adversarial networks [11] or sketching model [27] to generate extra images, or directly resampling the train set [18, 28]. However, most methods in this category need huge effort due to the preciousness of medical data.
Post-Processing. Although calibration has been widely used in unfairness mitigation in machine learning tasks, medical applications prefer to use pruning strategies. For example, Wu et. al [26] mitigate unfairness by pruning a pre-trained diagnosis model considering the difference of feature importance between subgroups. However, their method needs extra time except for training a precise classification model, while our FairAdaBN is a one-step method.
In-Processing. In-processing methods mainly consist of two folds. Some studies mitigate unfairness by directly adding fairness constraints to the cost functions [28], which often leads to overfitting. Another category of research mitigates unfairness by designing complex network architectures like adversarial network [30, 14] or representation learning [5]. This family of methods relies heavily on the accuracy of sensitive attribute classifiers in the adversarial branch, leads to bigger models and cannot make full use of pre-trained weights. While our method does not increase the number of parameters significantly and can be applied to several common backbones for dermatology diagnosis.
3 FairAdaBN
Problem Definition. We assume a medical imaging dataset with samples , the -th sample consists of input image , sensitive attributes and classification ground truth label . i.e. . is a binary variable (e.g., skin tone, gender), which splits the dataset into the unprivileged group, , which has a lower average performance than the overall performance, and the privileged group, , which has a higher average performance than the overall performance. Using accuracy as the performance metric for example, for a neural network , our goal is to minimize the accuracy gap between and by finding a proper .
(1) |

In this paper, we propose FairAdaBN, which replaces normalization layers in vanilla models with adaptive batch normalization layers, while sharing other layers between subgroups. The overview of our method is shown in Fig. 1.
Batch normalization (BN) is a ubiquitous network layer that normalizes mini-batch features using statistics [10]. Let denote a given layer’s output feature map, where is the number of channels, width, and height of the feature map. The BN function is defined as:
(2) |
where is the mean and standard deviation of the feature map computed in the mini-batch, and denotes the learnable affine parameters.
We implant the attribute awareness into BN, named FairAdaBN, by parallelizing multiple normalization blocks that are carefully designed for each subgroup. Specifically, for subgroup , its adaptive affine parameter and are learnt by samples in . Thus, the adaptive BN function for subgroup is given by Eq. 3.
(3) |
where is the index of the sensitive attribute corresponding to the current input image, are computed across subgroups independently.
The FairAdaBN acquires subgroup-specific knowledge by learning the affine parameter and . Therefore, the feature maps of subgroups can be aligned and the unfair representation between privileged and unprivileged groups can be mitigated. By applying FairAdaBN on vanilla backbones, the network can learn subgroup-agnostic feature representations by the sharing parameters of convolution layers, and subgroup-specific feature representations using respective BN parameters, resulting in lower fairness criteria. The detailed structure of FairAdaBN is shown in Fig. 1, we display the minimum unit of ResNet for simplification. Note that the normalization layer in the residual branch is not changed for faster convergence.
In this paper, we aim to retain skin lesion classification accuracy and improve model fairness simultaneously. The loss function consists of two parts: (i) the cross-entropy loss, , constraining the prediction precision, and (2) the statistical disparity loss as in Eq. 4, aiming to minimize the difference of prediction probability between subgroups and give extra limits on fairness.
(4) |
where means the number of classification categories.
The overall loss function is given by the sum of the two parts, with a hyper-parameter to adjust the degree of constraint on fairness. .
4 Experiments and Results
4.1 Evaluation Metrics
Lots of fairness criteria are proposed including statistical parity [7], equalized odds [9], equal opportunity [9], counterfactual fairness [13], etc. In this paper, we use equal opportunity and equalized odds as fairness criteria. For equal opportunity, we split it into and considering the ground truth label.
(5) |
(6) |
(7) |
However, these metrics only evaluate the level of fairness while do not consider the trade-off between fairness and accuracy. Therefore, inspired by [6], we propose FATE, a metric that evaluates the balance between normalized improvement of fairness and normalized drop of accuracy. The formulas of FATE on different fairness criteria are shown below:
(8) |
where can be one of . denotes accuracy. The subscript and denote the mitigation model and baseline model, respectively. is a weighting factor that adjusts the requirements for fairness pre-defined by the user considering the real application, here we define for simplification. A model obtains a higher FATE if it mitigates unfairness and maintains accuracy. Note that FATE should be combined with utility metrics and fairness metrics, rather than independently.
4.2 Dataset and Network Configuration
We use two well-known dermatology datasets to evaluate the proposed method. The Fitzpatrick-17k dataset[8] contains 16,577 dermatology images in 9 diagnostic categories. The skin tone is labeled with Fitzpatrick’s skin phenotype. In this paper, we regard Skin Type I to III as light, and Skin Type IV to VI as dark for simplicity, resulting in a ratio of . The ISIC 2019 dataset[24, 1, 2] contains 25,331 images among 9 different diagnostic categories. We use gender as the sensitive attribute, where . Based on subgroup analysis, dark and female are treated as the privileged group, and light and male are treated as the unprivileged group.
We randomly split the dataset into train, validation, and test with a ratio of 6:2:2. The models are trained for 600 epochs and the model with the highest validation accuracy is selected for testing. The images are resized or cropped to 128 128 for both datasets. Random flipping and random rotation are used for data augmentation. The experiments are carried out on 8 NVIDIA 3090 GPUs, implemented on PyTorch, and are repeated 3 times. Pre-trained weights from ImageNet are used for all models. The networks are trained using AdamW optimizer with weight decay. The batch size and learning rate are set as 128 and 1e-4, respectively. The hyper-parameter .
4.3 Results
We compare FairAdaBN with Vanilla (ResNet-152), Resampling [18], Ind (independently trained models for each subgroup) [18], GroupDRO [19], EnD [23], and CFair [29], which are commonly used for unfairness mitigation.
Results on Fitzpatrick-17k Dataset. Table 1 shows the result of these seven methods on Fitzpatrick-17k dataset. Compared to the Vanilla model, Resampling has a comparable utility, but cannot improve fairness. FairAdaBN achieves the lowest unfairness with only a small drop in accuracy. Besides, FairAdaBN has the highest FATE on all fairness criteria. This is because Ind does not share common information between subgroups, and only part of the dataset is used for training. GroupDRO and EnD rely on the discrimination of features from different subgroups, which is indistinguishable for this task. CFair is more efficient on balanced datasets, while the ratio between and is skewed.
Results on ISIC 2019 Dataset. Table 1 shows the results on ISIC 2019 dataset. FairAdaBN is the fairest method among the seven methods. Resampling improves fairness sightly but does not outperform ours. GroupDRO mitigates EOpp0 while increasing unfairness on Eopp1 and Eodd. Ind and CFair cannot mitigate unfairness in ISIC 2019 dataset and EnD increases unfairness on EOpp0.
Fitzpartrick-17k Dataset | ||||||||||
Method | Accuracy | Precision | Recall | F1 | EOpp0 | EOpp1 | Eodd | |||
Vanilla | / | / | / | |||||||
Resampling [18] | ||||||||||
Ind [18] | ||||||||||
GroupDRO [19] | ||||||||||
EnD [23] | ||||||||||
CFair [29] | ||||||||||
FairAdaBN | ||||||||||
ISIC 2019 Dataset | ||||||||||
Method | Accuracy | Precision | Recall | F1 | EOpp0 | EOpp1 | Eodd | |||
Vanilla | / | / | / | |||||||
Resampling [18] | -0.80 | -2.48 | -5.49 | |||||||
Ind [18] | ||||||||||
GroupDRO [19] | ||||||||||
EnD [23] | ||||||||||
CFair [29] | ||||||||||
FairAdaBN |
-
•
* denotes , respectively.
-
•
Private implementation.
The FATE metric. Fig. 2 shows the values of FATE. According to [3], the closer the curve is to the top left corner, the smaller the fairness-accuracy trade-off it has. The figure demonstrates that FATE has the same trend as this argument. We prefer an algorithm that obtains a higher FATE since a higher FATE denotes higher unfairness mitigation and a low drop in utility, and a negative FATE denotes that the mitigation model cannot decrease unfairness while reserving enough accuracy (not beneficial).

Method | Accuracy | Precision | Recall | F1 | EOpp0 | EOpp1 | Eodd | |||
VGG | / | / | / | |||||||
VGG + FairAdaBN | 18.06 | -4.61 | 5.86 | |||||||
DenseNet | / | / | / | |||||||
DenseNet + FairAdaBN | -29.11 | 21.82 | 19.71 | |||||||
ResNet | / | / | / | |||||||
Ours w/o | -7.87 | 9.88 | 5.55 | |||||||
Ours w/o FairAdaBN | 42.15 | -49.94 | -45.62 | |||||||
Ours () | -29.10 | -31.85 | -24.16 | |||||||
Ours () | ||||||||||
Ours () | -13.38 | 14.60 | 16.92 |
Limitation. Compared with other methods, FairAdaBN needs to use sensitive attributes in the test stage, which is unnecessary for EnD and CFair. Although this might be easy to acquire in real applications, improvements could be done to solve this problem.
4.4 Ablation Study
Different backbones. Firstly, we test FairAdaBN’s compatibility on different backbones, by applying FairAdaBN on VGG-19-BN and DenseNet-121. Note that the first and last BN in DenseNet are not changed. The result is shown in Table 2. The experiments are carried out on Fitzpatrick-17k dataset. The result shows that our FairAdaBN is also effective on these two backbones, except when using DenseNet-121, showing well model compatibility. However, we also observe a larger drop in model precision compared with the baseline, which needs to be taken into consideration in future work.
Different loss terms. We train ResNet by only replacing BNs with FairAdaBNs (the second row of the last part), and ResNet adding on the total loss (the third row of the last part). The effectiveness of AdaBN is illustrated by comparing the first and second rows of the last part in Table 2. By replacing BNs with FairAdaBN, ResNet can normalize subgroup feature maps using specific affine parameters, which reduce and by and , respectively. Comparing the second and fourth row of the last part in Table 2, we find that by adding , decreases significantly, from to . Besides, although adding on ResNet alone increases fairness criteria unexpectedly, fairness criteria decrease when using FairAdaBN and simultaneously. The reason could be the potential connection between FairAdaBN and , due to the similar form dealing with subgroups.
Hyper-parameter . Our experiments show that has the best fairness scores and FATE compared to and . Therefore we select as our final setting.
5 Conclusion
We propose FairAdaBN, a simple but effective framework for unfairness mitigation in dermatological disease classification. Extensive experiments illustrate that the proposed framework can mitigate unfairness compared to models without fair constraints, and has a higher fairness-accuracy trade-off efficiency compared with other unfairness mitigation methods. By plugging FairAdaBN into several backbones, its generalization ability is proved. However, the current study only evaluates the effectiveness of FairAdaBN on dermatology datasets, and its generalization ability on other datasets (chest X-Ray, brain MRI) or tasks (segmentation, detection), where unfairness issues also exist, needs to be evaluated in the future. We also plan to explore the unfairness mitigation effectiveness for other universal models [31].
References
- [1] Codella, N.C., Gutman, D., Celebi, M.E., Helba, B., Marchetti, M.A., Dusza, S.W., Kalloo, A., Liopyris, K., Mishra, N., Kittler, H., et al.: Skin lesion analysis toward melanoma detection: A challenge at the 2017 international symposium on biomedical imaging (ISBI), hosted by the international skin imaging collaboration (ISIC). In: 2018 IEEE 15th international symposium on biomedical imaging (ISBI 2018). pp. 168–172. IEEE (2018)
- [2] Combalia, M., Codella, N.C., Rotemberg, V., Helba, B., Vilaplana, V., Reiter, O., Carrera, C., Barreiro, A., Halpern, A.C., Puig, S., et al.: BCN20000: Dermoscopic lesions in the wild. arXiv preprint arXiv:1908.02288 (2019)
- [3] Creager, E., Madras, D., Jacobsen, J.H., Weis, M., Swersky, K., Pitassi, T., Zemel, R.: Flexibly fair representation learning by disentanglement. In: International conference on machine learning. pp. 1436–1445. PMLR (2019)
- [4] Deho, O.B., Zhan, C., Li, J., Liu, J., Liu, L., Duy Le, T.: How do the existing fairness metrics and unfairness mitigation algorithms contribute to ethical learning analytics? British Journal of Educational Technology (2022)
- [5] Deng, W., Zhong, Y., Dou, Q., Li, X.: On fairness of medical image classification with multiple sensitive attributes via learning orthogonal representations. arXiv preprint arXiv:2301.01481 (2023)
- [6] Dhar, P., Gleason, J., Roy, A., Castillo, C.D., Chellappa, R.: PASS: Protected attribute suppression system for mitigating bias in face recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 15087–15096 (2021)
- [7] Dwork, C., Hardt, M., Pitassi, T., Reingold, O., Zemel, R.: Fairness through awareness. In: Proceedings of the 3rd innovations in theoretical computer science conference. pp. 214–226 (2012)
- [8] Groh, M., Harris, C., Daneshjou, R., Badri, O., Koochek, A.: Towards transparency in dermatology image datasets with skin tone annotations by experts, crowds, and an algorithm. arXiv preprint arXiv:2207.02942 (2022)
- [9] Hardt, M., Price, E., Srebro, N.: Equality of opportunity in supervised learning. Advances in neural information processing systems 29 (2016)
- [10] Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning. pp. 448–456. pmlr (2015)
- [11] Joshi, N., Burlina, P.: AI fairness via domain adaptation. arXiv preprint arXiv:2104.01109 (2021)
- [12] Kinyanjui, N.M., Odonga, T., Cintas, C., Codella, N.C., Panda, R., Sattigeri, P., Varshney, K.R.: Fairness of classifiers across skin tones in dermatology. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 320–329. Springer (2020)
- [13] Kusner, M.J., Loftus, J., Russell, C., Silva, R.: Counterfactual fairness. Advances in neural information processing systems 30 (2017)
- [14] Li, X., Cui, Z., Wu, Y., Gu, L., Harada, T.: Estimating and improving fairness with adversarial learning. arXiv preprint arXiv:2103.04243 (2021)
- [15] Lu, C., Lemay, A., Hoebel, K., Kalpathy-Cramer, J.: Evaluating subgroup disparity using epistemic uncertainty in mammography. arXiv preprint arXiv:2107.02716 (2021)
- [16] Narayanan, A.: Translation tutorial: 21 fairness definitions and their politics. In: Proc. Conf. Fairness Accountability Transp., New York, USA. vol. 1170, p. 3 (2018)
- [17] Petersen, E., Feragen, A., Zemsch, L.d.C., Henriksen, A., Christensen, O.E.W., Ganz, M.: Feature robustness and sex differences in medical imaging: a case study in mri-based alzheimer’s disease detection. arXiv preprint arXiv:2204.01737 (2022)
- [18] Puyol-Antón, E., Ruijsink, B., Piechnik, S.K., Neubauer, S., Petersen, S.E., Razavi, R., King, A.P.: Fairness in cardiac MR image analysis: An investigation of bias due to data imbalance in deep learning based segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 413–423. Springer (2021)
- [19] Sagawa, S., Koh, P.W., Hashimoto, T.B., Liang, P.: Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. arXiv preprint arXiv:1911.08731 (2019)
- [20] Sarhan, M.H., Navab, N., Eslami, A., Albarqouni, S.: On the fairness of privacy-preserving representations in medical applications. In: Domain Adaptation and Representation Transfer, and Distributed and Collaborative Learning, pp. 140–149. Springer (2020)
- [21] Seyyed-Kalantari, L., Liu, G., McDermott, M., Chen, I.Y., Ghassemi, M.: Chexclusion: Fairness gaps in deep chest X-ray classifiers. In: BIOCOMPUTING 2021: Proceedings of the Pacific Symposium. pp. 232–243. World Scientific (2020)
- [22] Suriyakumar, V.M., Papernot, N., Goldenberg, A., Ghassemi, M.: Chasing your long tails: Differentially private prediction in health care settings. In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. pp. 723–734 (2021)
- [23] Tartaglione, E., Barbano, C.A., Grangetto, M.: EnD: Entangling and disentangling deep representations for bias correction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 13508–13517 (2021)
- [24] Tschandl, P., Rosendahl, C., Kittler, H.: The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific data 5(1), 1–9 (2018)
- [25] Wang, M., Deng, W.: Mitigating bias in face recognition using skewness-aware reinforcement learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9322–9331 (2020)
- [26] Wu, Y., Zeng, D., Xu, X., Shi, Y., Hu, J.: Fairprune: Achieving fairness through pruning for dermatological disease diagnosis. arXiv preprint arXiv:2203.02110 (2022)
- [27] Yao, R., Cui, Z., Li, X., Gu, L.: Improving fairness in image classification via sketching. arXiv preprint arXiv:2211.00168 (2022)
- [28] Zhang, H., Dullerud, N., Roth, K., Oakden-Rayner, L., Pfohl, S., Ghassemi, M.: Improving the fairness of chest x-ray classifiers. In: Conference on Health, Inference, and Learning. pp. 204–233. PMLR (2022)
- [29] Zhao, H., Coston, A., Adel, T., Gordon, G.J.: Conditional learning of fair representations. arXiv preprint arXiv:1910.07162 (2019)
- [30] Zhao, Q., Adeli, E., Pohl, K.M.: Training confounder-free deep learning models for medical applications. Nature communications 11(1), 1–9 (2020)
- [31] Zhou, S.K., Greenspan, H., Davatzikos, C., Duncan, J.S., Van Ginneken, B., Madabhushi, A., Prince, J.L., Rueckert, D., Summers, R.M.: A review of deep learning in medical imaging: Imaging traits, technology trends, case studies with progress highlights, and future promises. Proceedings of the IEEE (2021)
- [32] Zietlow, D., Lohaus, M., Balakrishnan, G., Kleindessner, M., Locatello, F., Schölkopf, B., Russell, C.: Leveling down in computer vision: Pareto inefficiencies in fair deep classifiers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10410–10421 (2022)