Detecting Dataset Bias in Medical AI: A Generalized and Modality-Agnostic Auditing Framework

Nathan Drenkow^∗1,2 Mitchell Pavlak^∗1 Keith Harrigian¹ Ayah Zirikly¹
Adarsh Subbaswamy¹ Mathias Unberath¹ ¹ The Johns Hopkins University ² The Johns Hopkins University Applied Physics Laboratory ^∗ Equal contribution

Abstract

Data-driven AI is establishing itself at the center of evidence-based medicine. However, reports of shortcomings and unexpected behavior are growing due to AI’s reliance on association-based learning. A major reason for this behavior: latent bias in machine learning datasets can be amplified during training and/or hidden during testing. We present a data modality-agnostic auditing framework for generating targeted hypotheses about sources of bias which we refer to as Generalized Attribute Utility and Detectability-Induced bias Testing (G-AUDIT) for datasets. Our method examines the relationship between task-level annotations and data properties including protected attributes (e.g., race, age, sex) and environment and acquisition characteristics (e.g., clinical site, imaging protocols). G-AUDIT automatically quantifies the extent to which the observed data attributes may enable shortcut learning, or in the case of testing data, hide predictions made based on spurious associations. We demonstrate the broad applicability and value of our method by analyzing large-scale medical datasets for three distinct modalities and learning tasks: skin lesion classification in images, stigmatizing language classification in Electronic Health Records (EHR), and mortality prediction for ICU tabular data. In each setting, G-AUDIT successfully identifies subtle biases commonly overlooked by traditional qualitative methods that focus primarily on social and ethical objectives, underscoring its practical value in exposing dataset-level risks and supporting the downstream development of reliable AI systems. Our method paves the way for achieving deeper understanding of machine learning datasets throughout the AI development life-cycle from initial prototyping all the way to regulation, and creates opportunities to reduce model bias, enabling safer and more trustworthy AI systems.

1 Introduction

The use of AI in healthcare necessitates methods and algorithms that ensure safe, unbiased, and fair outcomes. Recent shifts towards data-driven methods, where performance depends heavily on increasingly large datasets, has created pressure to rapidly scale data collection and curation efforts. Collecting large datasets that meet data quality and diversity requirements for fair and safe algorithm development remains an ongoing challenge.

Datasets collected for machine learning (ML) applications often need to (1) accurately represent populations of interest [41], (2) ensure fair treatment and minimize bias wherever possible [40, 47, 26, 21, 16, 1], and (3) represent the types of conditions/quality expected in real-world settings while avoiding the inclusion of learning shortcuts [32, 49, 48, 50, 34]. These requirements are further impacted by privacy, safety, cost, and other systematic constraints that hinder or prevent the ability to ensure all desired dataset properties are met [24, 46, 12].

Oftentimes, data collection constraints and requirements may be at odds or impractical to achieve. For instance, some populations may not be well-represented by fixed-sample datasets (even when those datasets are considered large-scale) [2, 14, 35]. Similarly, balancing datasets for parity across protected subgroups may create unintentional biases relating to the representation of particular clinical sites, imaging protocols, or other unprotected groups[14, 37]. Even the notions of fairness and bias are nuanced, disconcertingly suggesting that data collection cannot possibly satisfy all definitions [4]. In short, despite careful efforts to construct unbiased datasets, dataset bias may be inevitable and may manifest in forms beyond traditional ethical or social considerations [24].

Without effective methods to audit and correct dataset bias, data-driven AI methods are at considerable risk of learning and amplifying implicit biases. Recent research shows that this risk manifests through fairness/bias in AI predictions [15, 41, 40, 16, 21], shortcut learning [48, 49, 34, 22, 9], and lack of generalization or robustness [36] all of which may result in harmful performance disparities of the model within and across patient populations. While algorithmic auditing methods [27, 1] provide the means to detect, quantify, and characterize limitations of trained machine learning models, they are applied post-hoc and force mitigation strategies to be reactive rather than proactive. The best opportunity to address data-driven AI bias risks starts with exposing data-level bias prior to model training or evaluation. Therefore, data audits, the focus of this work, are of utmost importance for identifying and reducing model bias as early as possible in the training and evaluation phases.

Refer to caption — Figure 1: Given a dataset consisting of input with associated labels and metadata, we conduct our Generalized Attribute Utility and Detectability-Induced bias Testing (G-AUDIT) for data procedure to generate quantitative hypotheses on the relative risk of attributes in the form of a Detectability vs Utility scatter plot (right).

Despite the importance of identifying bias at the dataset-level, techniques for performing dataset audits remain largely absent [24]. On one hand, some methods may examine disparities in medical dataset metadata [21, 1, 35] but they do not explicitly link these to the primary data with which machine learning models are trained. In contrast, other methods show that patient attributes may be predictable from sensor-level measurements [15, 17] but may not fully identify whether the attribute-level features represent direct shortcuts for achieving accurate prediction. Some methods more directly examine the relation between data attributes and task model accuracy [41, 40, 22] yet these do not enable assessment of the dataset-level risks independent of task-level modeling assumptions. Lastly, while most methods have focused on addressing bias with respect to conventional protected attributes (such as race, age, sex), some methods have begun to explore biases that may relate to other aspects of the data collection pipeline such as image acquisition or clinician-level marking [37, 38]. In short, current dataset auditing methods do not holistically address both the risk of relevant attributes leaking information about the task and the risk of whether information about those attributes can be directly exploited during model training. Automated dataset auditing techniques are required to address these primary risks and generate hypotheses to guide downstream model development, auditing, and mitigation.

We present a novel technique that represents a significant advance in enabling independent, dataset-level auditing and which we refer to as Generalized Attribute Utility and Detectability-Induced bias Testing (G-AUDIT) for datasets. Our method presents the first unified approach to shortcut auditing by considering the interplay between attribute-level composition, sensor-level measurements, and task labels. Our method is not limited to only protected patient-level attributes and allows for datasets to be audited with respect to any variables of relevance to the data collection process, patient population, and targeted machine learning task.

Our auditing procedure (Figure 1) utilizes the tools of information theory and causal inference to identify the presence and strength of bias between attributes ( $A$ ) and task labels ( $Y$ ) (utility) as well as the ability to infer the values of those attributes directly from the data ( $X$ ) itself (detectability). In particular, we first quantify utility as the strength of the association between $A\leftrightarrow Y$ . However, high utility is not sufficient for demonstrating the presence of a shortcut because we also need to estimate how easily information about $A$ can be inferred from $X$ to solve the task. We therefore next measure detectability as the association $A\leftrightarrow X$ by predicting $\hat{A}$ from $X$ while controlling for the influence of $Y$ . High-utility and -detectability are indicative of potential shortcuts and would be flagged by our auditing method as targets for downstream model audits or review by clinicians. In this work, we consider the issue of dataset bias across a diverse range of health applications and domains including tabular, text, and image data and identify sources of bias commonly overlooked or previously unknown.

2 Results

To demonstrate its broad effectiveness, our G-AUDIT procedure is applied to datasets from three different modalities, namely, image, text, and tabular data. For each dataset, we estimate the utility and detectability (see Sec. 4) of each attribute relative to the underlying learning task. We do this to generate hypotheses that identify and rank potential shortcut attributes according to the risk that downstream models may exploit them. We also include results from an optional calibration procedure in which an approximate upper bound on the drop in performance-related metrics can be calculated for each potential shortcut using a synthetic attribute in precisely controlled conditions (see Section 4.4). The bound provides a means of estimating a form of worst case downstream model performance risks for specific attributes in more familiar metrics.

2.1 Skin Lesion Classification

We first consider shortcut risks in vision-based tasks. The high dimensionality of image data creates many opportunities for shortcuts to exist without directly impacting the machine learning task. For instance, hospital-specific tokens placed in the chest X-ray field of view (e.g., [50]) may impact only a small number of pixels in the image but may constitute a shortcut when associated primarily with a specific disease condition in the training dataset. In cases like these, DNNs trained on the data may exploit salient features like tokens/watermarks (or other statistical regularities not task-related) to achieve low training error. We focus here on skin lesion classification where the construction of large-scale datasets may not be able to adequately balance across patient characteristics, clinical sites, and dermascopic imaging sensors/settings and where such features of the dataset and collection process may manifest as shortcuts.

The ISIC 2019 skin lesion dataset [43, 6, 7] was analyzed for bias with respect to the included metadata and patient attributes. The dataset consists of 25,331 training images and attributes age, race, sex, anatomical location, and skin color on the Fitzpatrick scale. Image metadata includes height, width, and year of collection. While the original task labels included seven diagnostic categories, we reduce the classification task to malignant vs. benign conditions (e.g., [51, 29, 31]). The data auditing procedure (see Sec. 4) was applied to the entire training dataset relative to the binary malignancy classification task and excluding any images with missing attribute values.

The main auditing results are found in Figure 2 where the height, width, and year attributes exhibit the highest combination of utility and detectability, indicating these attributes are more likely sources of bias within the dataset. While these particular attributes may seem unrelated to the task labels, this image metadata can act as a proxy for the camera type and/or clinical site. Furthermore, while all images are resized and cropped to a fixed resolution prior to running the auditing procedure, the high detectability scores indicate that some of this information is still retained in the images themselves, either directly or via proxy.

Table 1: Comparison between Utility and Detectability measures as well as macro-averaged F1 scores from the detection model and conducting the SPLIT procedure [15, 17] on trained task models.

	G-AUDIT (ours)		Baselines
Attribute	Utility	Detectability	Detection F1	SPLIT F1
Year	0.052	0.862	0.952	0.981
Image Height	0.050	0.887	0.918	0.948
Image Width	0.048	0.865	0.510	0.583
Age	0.035	0.112	0.292	0.334
Anatomical Location	0.012	0.169	0.288	0.624
Sex	0.003	0.168	0.736	0.768
Skin Color (Fitzpatrick)	0.000	0.424	0.538	0.632

2.2 Stigmatizing Language in Electronic Health Records

We next consider the text domain where the dimensionality of the data remains high and the variability of natural language create unique opportunities for shortcuts to exist. Similarly to the image domain, DNNs are often used to learn compact representations of natural language text for various tasks and may exploit shortcuts in the data that are not relevant to the clinical task yet allow them to achieve low training error. We apply the G-AUDIT procedure to the electronic medical record dataset introduced by [19] for the purpose of characterizing stigmatizing language usage by physicians. The dataset contains 5,201 annotated instances across 3 tasks, with each task focusing on a different thematic group of stigmatizing language — Credibility & Obstinacy, Compliance, and Descriptions of Appearance/Demeanor. Models are provided a window of text centered around a keyword or phrase which has been identified by domain experts as being a potential indicator of unconscious bias towards a patient. They are asked to characterize the implication of the input instance (e.g., “the patient claims to brush their teeth 2x daily”; “unable to track down insurance claims”). Each instance is associated with auxiliary attributes which indicate a patient’s race and gender, and the clinical setting from which the statement was drawn (e.g., OB-Gyn, Surgery).

The stigmatizing language dataset serves as an interesting case study for several reasons. First, the authors of [19] included an analysis which examined how well each attribute could be predicted based on the last embedding layer of the primary stigmatizing language task models. Their results provide a direct reference for our method. Second, as is often true in practice, the stigmatizing language dataset has an ambiguous causal structure. This allows us to evaluate the robustness of our method to potential errors in misspecification of the direction of dependency between attributes and labels. Finally, there are documented disparities in the prevalence of stigmatizing language between demographic groups which we expect to show up directly within our utility measure (e.g., Black patients are more likely than white patients to experience discrimination). To gather an unbiased estimate of the prevalence of stigmatizing language in the population, models should not make use of demographics as a predictive shortcut.

As shown in Figure 3, clinical specialty had a higher utility than both patient race and sex for the compliance and appearance / demeanor tasks. Prior work shows that downstream models for these tasks likely did not encode sex and race characteristics beyond what could be explained by a reliance on clinical specialty alone [19]. This is consistent with our observation that clinical specialty had a higher utility than race and sex, implying a stronger association with the task label. Importantly, this does not preclude performance disparities based on these sensitive attributes—clinical specialties in the JHM dataset include OB-GYN with an all-female patient population and Pediatrics with an approximately 95% Black patient population [19]. Instead, our results suggest that downstream models are more likely to exploit shortcuts related to the identification of different clinical domains than to directly encode race or sex to improve performance.

In terms of detectability, we find differences between conditioned and unconditioned measures to be fairly small. Interestingly, within the Credibility and Obstinacy task, while all attributes had relatively low utility, the detectability of sex was higher than that of any other EHR task and attribute evaluated. This presents a potential explanation for the findings of [19], which identified sex within the Credibility and Obstinacy task as the only demographic attribute recoverable above baselines levels across all three tasks.

2.3 Mortality Prediction with Intensive Care Unit Tabular Data

Lastly, we consider the dataset auditing problem in the context of tabular data. While not as feature rich or high dimensional as images and text, tabular data presents its own unique challenges from an auditing perspective. While the features learned by task models in the image and text domains often lack interpretability (e.g., DNN embeddings), tabular data provides a direct mapping between attributes and their measured values. In these cases, attributes used as inputs to a task model can still act as shortcuts when they exhibit undesirable associations with the task labels that could arise due to sample bias or incorrect usage of the attribute itself.

For instance, if clinical sites differ in their test-ordering protocols, associations between disease conditions and clinical site may be reflected in the test orders/results provided to machine learning task models [42]. Furthermore, the gap in task performance between DNNs and more traditional ML models (e.g., SVMs, ensemble methods, etc.) is much smaller in the tabular case and this provides an opportunity to measure both detectability and task model performance over a wider class of machine learning models.

Here, we evaluated our auditing methodology on tabular data extracted from the publicly available Medical Information Mart for Intensive Care (MIMIC) III Dataset [23]. Specifically, we extracted a dataset for predicting mortality in Intensive Care Unit (ICU) patients using features involved in the Simplified Acute Physiology II (SAPS II) score [25, 39], a score used to measure disease severity in patients after their first 24 hours in the ICU. These features include patient age as well as summary statistics of heart rate, systolic blood pressure, temperature, labwork indicators/results, ventilator usage, and Glasgow Coma Scale (GCS) during the first 24 hours of the patient’s ICU stay.

Our task is to predict patient mortality based on the tabularized data of a given patient. The final processed dataset consists of 34,386 patient records and 40 features, detailed in Appendix B. Features are one-hot encoded and include medication details, missing variables (e.g., GCS or lab test results), and patient demographics such as race and insurance coverage. We select an FT-Transformer [18] with default hyperparameters for use as a task model in all experiments. A benefit of this selection is the ability to apply the SPLIT approach [17, 15] to empirically test how well our detectability measure corresponds with other baseline approaches for measuring detectability [17, 15].

As shown in Figure 5, we find that, across model classes, G-AUDIT-based detectability correlates strongly with SPLIT-measured recoverability, with Spearman’s $\rho$ values of .92, .92, .76, .84, .75, and .86 for decision tree, random forest, logistic regression, FT-Transformer, XGBoost, and naive Bayes task models respectively (all $p<.0001$ ). Importantly, both obvious shortcuts such as ‘temperature missing’ -—which is easily detectable because it is explicitly represented in the input data as a negative placeholder value for temperature—- as well as less obvious instances like dopamine, norepinephrine, vasopressin, ventilator and IV usage are highly detectable. We do note some variability in performance that does not always seem correlated with model strength as measured in Figure 4. This suggests using a small ensemble of models of varying complexity may provide a more holistic view of detectability.

By running our synthetic calibration process (Sec. 4.4), we identify that some of the attributes in our set could cause worst case performance drops as great as 0.2 AUC for an average task model assuming full detectability (see Figure 6).

3 Discussion

The G-AUDIT provides a procedural means to identify and quantify potential sources of dataset bias and machine learning shortcuts. The technique identifies potentially harmful relationships between attributes and task labels that may be exploited via the input data. G-AUDIT estimates the presence and strength of these relationships through attribute utility and detectability. Attributes with both high-utility and high-detectability are indicative of dataset bias and act as primary targets for downstream algorithmic auditing.

G-AUDIT is the first known method for fully quantitative measurement of dataset bias relative to both clinical and imaging factors. Prior to this method, the closest points of comparison consider only protected attributes (such as race, age, or sex) and overlook other aspects of the dataset which may be a stronger source of bias (such as the collection site or model of imaging device). In fact, our analysis finds that non-patient attributes are often of greatest concern which is consistent with previous studies that focused on specific shortcut learning scenarios such as the use of digital watermarks in x-ray images [9] or presence of chest drains for detecting pneumothorax [35]. While these cases were found through manual inspection by researchers, our auditing procedure would automatically and more efficiently enable detection of these possible sources of bias prior to model training. Since our audits are implicitly tied to measurable and interpretable attributes of the data, clinicians and AI developers are better equipped to interpret the results of the audit and identify relevant courses of action (e.g., bias mitigation or model auditing strategies).

We show the generality of G-AUDIT by applying it across multiple machine learning tasks and data modalities. G-AUDIT is shown to be effective in both causal and anti-causal scenarios where the relationship between the labels and data may be reversed. While G-AUDIT requires domain knowledge to estimate the directionality, our method gracefully handles each condition given the a priori assumption. Future work will consider augmenting G-AUDIT to include detection of label-data directionality.

While G-AUDIT is guaranteed to examine every sample in the dataset, as the the size of the audited datasets or the number of attributes increases, additional considerations will be necessary to ensure that G-AUDIT remains computationally viable. Nonetheless, unlike algorithmic audits which require the procedure to be run for each new task model, G-AUDIT can be run once per dataset-task combination and adding new tasks only requires re-calculating utility and detectability given the original attribute predictions and new task labels.

In this work, we provide a generalized, quantitative technique for generating hypotheses about dataset bias to inform downstream model training and auditing. As adoption of data-driven machine learning methods for safety- and cost-critical medical applications continues to grow, we must have principled quantitative methods for analyzing the underlying training and evaluation data. G-AUDIT provides the first such approach and results demonstrate that commonly overlooked dataset attributes may induce dataset bias and ultimately lead to AI failures and disparities in diagnosis and treatment of underlying patient conditions. Our method provides a positive step towards identifying and mitigating these risks.

4 Methods

4.1 Utility and Detectability

The core objective of our data auditing procedure is to identify and quantify the degree to which each attribute of the data represents a potential learning shortcut.

For our purposes, datasets consist of samples ( $X$ ), associated metadata or attributes ( $A$ ), and task labels ( $Y$ ). We establish the existence (not direction) of a relationship between $A\leftrightarrow Y$ through a measure we call utility. The utility of an attribute refers to our ability to infer the value of the task label $Y$ simply by observing $A$ . In the extreme case, if $A$ is perfectly correlated with $Y$ , then machine learning models simply need to detect $A$ in order to correctly solve the task.

However, a large value for an attribute’s utility is not sufficient to consider it as a shortcut. For this, we require to know the attribute’s detectability. Since machine learning models will typically only take $X$ as input, detectability measures the extent to which the value of $A$ can be inferred from $X$ .

Attributes with high-utility and high-detectability represent the greatest risk for biasing downstream task models. However, while we use utility and detectability as a proxy for risk, features that are causally relevant to the task and have high utility and detectability are useful, rather than representing possible shortcuts. We must rely on domain expertise to determine whether high-risk attributes are reasonable features or unanticipated dataset flaws.

4.2 Measuring Utility and Detectability

We use information theory and principles of causal inference to measure the utility and detectability of dataset attributes. Utility is measured as the mutual information, $MI(A;Y)=H(Y)-H(Y|A)$ after adjusting for chance as per [44, 45] and represents the reduction in uncertainty about the value of $Y$ by observing $A$ after adjusting for the entropy of the underlying distributions and chance based on the number of categories. We rely on the faithfulness assumption which implies that if there is a relationship between $A,Y$ , then $MI(A;Y)>0$ .

For detectability, the objective is to determine the ability to infer the value $A$ from the data and we accomplish this by first training a surrogate model $f:A\rightarrow X$ to predict $A$ from $X$ . We next leverage clinician and domain expertise to establish the likely directions of dependency between data and labels. The primary role of this step is to understand whether the relationship is $Y\rightarrow X$ or $X\rightarrow Y$ . Our ultimate goal is to identify whether there is a relationship $A\leftrightarrow X$ and if so measure its strength. However, we need to control for the potential of information leaking through $Y$ and this process relies on the assumed/known relationship between $Y,X$ .

In anti-causal scenarios (i.e., $Y\rightarrow X$ ), we know it could be the case that $A\rightarrow Y\rightarrow X$ and as a result information could be leaked about the attribute into $X$ not by a shortcut, but by solving the task itself. To control for this, we condition on $Y$ during the training process of our surrogate attribute prediction model $f:X\rightarrow\hat{A}$ . Specifically, we assume either $Y$ is already discrete or define $Y^{D}$ as a sufficiently granular discretization of $Y$ . Then, we partition $X,Y^{D}$ into disjoint training subsets $S$ s.t. for given subset $S_{i}$ , all $y^{D_{j}}\in S_{i}$ have the same value. For each subset $S_{i}\in S$ , we train separate surrogates $f_{i}(X)=\hat{A}$ . By training separate surrogates in this manner, we ensure that differences in task labels $Y$ that could be used by $f_{i}$ to predict a given attribute are minimized. Based on test set predictions obtained using cross-validation with all $f_{i}$ , we obtain a full set of predictions ( $\hat{A}$ ) for the entire dataset and we can similarly measure $MI(A;\hat{A})$ to understand how recoverable $A$ is from $X$ while reducing the impact of task relevant information.

In the causal case (i.e., $X\rightarrow Y$ ), we are able to directly estimate the relationship without conditioning. In fact, we cannot condition on $Y$ since $Y$ represents a collider (i.e., $Y$ may be dependent on both $X$ and $Y$ ). In that case, measuring the strength of relationship between $X\leftrightarrow A$ given $Y$ could give falsely inflated results since conditioning on a collider $Y$ creates an otherwise non-existent association between its parents ( $X,A$ ). For example, if two conditions both increase mortality rate through different pathways, then given that a patient has died, knowledge of either attribute provides information about the other (e.g., given $A$ is the presence or absence of trauma wounds and $X$ contains information relating to disease, a patient who died but did not have trauma wounds is more likely to have had disease: $A\leftrightarrow X|Y$ even though $A\perp X$ ). As a result, we do not condition for cases where we expect $X\rightarrow Y$ and instead directly estimate $MI(A;\hat{A})$ .

4.3 Data Audit Procedure

A key aspect of the data auditing procedure is that every sample in the dataset contributes to the calculation of attribute utility and detectability. For determining utility, only the labels and metadata are required and can be estimated without any model training. However, detectability requires attribute prediction, so we ensure that every sample in the dataset contributes to the detectability estimate using a cross-validation procedure that ensures unbiased predictions of $\hat{A}$ .

We first partition the full dataset into $K$ disjoint folds. For each fold, we hold out the data of that fold for testing and use the data from the $K-1$ remaining folds to train or finetune a sufficiently expressive machine learning model to predict $\hat{A}$ from $X$ . Given that trained model, we predict $\hat{A}$ on the held-out fold’s data. After repeating the train/test procedure for each fold, and, as necessary due to the dependency between attribute and label, each value in $Y^{D}$ , we aggregate all predictions together to compute detectability measures.

4.4 Bounding Performance Risk

While utility and detectability measures provide a quantitative measure of bias, how to interpret them in the context of potential model performance risks is less clear. The magnitudes of these measures are not easily comparable across datasets so we cannot rely on thresholds or conventions to directly assess risk. We could instead look to the relative rankings of attributes with respect to utility/detectability to understand attribute risk, but cannot directly translate values to drops in task-relevant metrics like AUC.

To address this concern, we provide a supplementary method for generating an upper bound on performance risk with which attribute utility can be interpreted. To construct the bound, we first create a synthetic attribute which we can insert with 100% detectability (e.g., a visible watermark in a fixed image position, a single token added to a text input, an added column to tabular data). We then vary the utility of this synthetic attribute and train a task model for each case. We evaluate the task models on synthetic and counterfactual data distributions and measure the resulting performance drop between the two distributions. The counterfactual data distribution is constructed in such a way as to create the worst case scenario where the synthetic attribute is anti-correlated with the true label and any shortcut exploited during training will yield worst-case behavior at test time.

Formally, let $X$ be the feature data, $Y$ be the true binary task labels (of length $N_{Y}$ ), and $A$ be the synthetic artifact values. For simplicity, we assume $A$ is also binary and initialize the values of $A$ to have the same values as $Y$ in the dataset (i.e., the normalized utility would be $1.0$ when $A=Y$ ). To vary the utility, we randomly select an index set ${\mathcal{I}}$ for $N<\frac{N_{Y}}{2}$ rows of $A$ and flip the values in those rows (i.e., if $i\in{\mathcal{I}}$ then $A[i]=\neg Y[i]$ ). This preserves the existence of the relationship between $Y,A$ but reduces its strength. For each $i\in{\mathcal{I}}$ , we also insert a fully detectable artifact into $X$ . We run the same training and testing procedure as described in Section 4.5 to get the baseline task performance numbers for the synthetic distribution.

To construct the counterfactual distribution, we create an additional test set as follows. For each image in the test set, we insert the synthetic artifact $A_{C}$ to be anti-correlated with the label such that $Y=0\Rightarrow A_{C}=1$ and vice versa. In the worst case, if the task model exploited the synthetic artifact shortcut during training, then at test time, it will be more likely to predict $\hat{Y}=1$ when $A_{C}=1$ resulting in more errors and lower AUC. As the utility of $A$ increases in the training set, the risk of performance degradation in this context also increases.

We view this as an approximate upper bound on risk because of the fact that we have controlled for the detectability of the synthetic attribute and thus can understand the performance risks associated with each value of utility. To assess risk for attributes in the original dataset, we can look at the worst-case AUC drop relative to the measured utility for the original attributes.

4.5 Model Training and Prediction

4.5.1 Skin Lesion Classification

For the attribute prediction step of our auditing procedure, we train ResNet18 [20] networks since they are sufficiently expressive for determining attribute detectability but not as prone to overfitting or training instability as larger or more complex architectures. For task label prediction, we train a more expressive SwinT [28] (using RandAugment [8]) that is capable of solving the more challenging vision task and demonstrates that the detectability results produced by our auditing procedure generalize to more complex architectures.

For continuous attributes (e.g., age, image height/width), we discretize the attribute values and train the attribute predictor in a multi-class setting. For instance, for age, we take $y=\lfloor\frac{age}{5}\rfloor$ which yields 18 total classes for the age attribute predictor. We use a similar binning procedure for the image height, width, and year.

All prediction networks are pretrained on ImageNet with weights provided by the popular torchvision package [33]. For computing G-AUDIT’s detectability, networks are fine-tuned for 10 epochs using the AdamW optimizer [30] with a learning rate of 5e-5, weight decay of 0.01, momentum parameters $(0.9,0.999)$ , and linear learning rate decay with $\gamma=0.7$ . Cross-entropy loss is used for training both attribute prediction and task models. All images are resized to $(224,224)$ and normalized using standard ImageNet mean/std statistics. Horizontal/vertical flips are only applied during attribute predictor network training.

4.5.2 Stigmatizing Language in Electronic Health Records

We fine-tune BERT [10] models for both attribute prediction and each of the three clinical tasks: (1) compliance, (2) appearance and demeanor, and (3) credibility and obstinacy. Attributes we evaluate as potential shortcuts include patient race, gender, and clinical specialty (see Appendix A). The dataset consists of manually annotated samples with a context window of 10 words before and 10 after each identified potentially stigmatizing anchor word [19]. The full list of anchor words for each task is available in Appendix A.0.1. As each task has a unique set of anchors, the input and attribute metadata are different across tasks. Hyperparameters for all task and attribute models are held constant. We use AdamW [30] with a fixed learning rate of 5e-5 and weight decay of 1e-5, a batch size of 16, dropout with probability 0.1, 10 training epochs with early stopping, and class balanced cross-entropy loss for all experiments.

4.5.3 Mortality Prediction from Intensive Care Unit Data

Our task models are FT-Transformers [18] trained with default hyperparameters to predict mortality. For attribute prediction, we compare logistic regression, decision tree, Random Forest [3], FT-Transformer, naive bayes, and XGBoost [5] models. We discretize continuous valued attributes prior to calculating mutual information-based estimators. We select the number of bins for discretization automatically via the Freedman-Diaconis rule [13]. To ensure sufficient examples across cross-validation splits, we combine and drop categories of attributes where necessary when there are fewer than 100 members having the value within our dataset (e.g. ethnicities Guatemalan and Honduran).

4.5.4 SPLIT method

As an alternate baseline form of estimating detectability and following Gichoya et al. [15], we test whether task model representations implicitly encode attribute information. Given a pre-trained classifier, we remove the final fully-connected layer and replace it with a randomly initialized linear layer. We freeze the weights of the pre-trained network and finetune only the linear layer to predict the specified attribute value. As before, we perform cross-validation and measure the performance of the finetuned model on the aggregated predictions across all folds. Better than chance performance is an indicator that model representations encode some degree of attribute information. However, it does not necessarily indicate that the dataset itself is biased as some attribute information may be relevant to solving the task even when the dataset is balanced with respect to the attribute itself.

5 Data Availability

Data for the ICU mortality prediction and skin lesion classification cases are publicly available. Because clinical notes from the Johns Hopkins Medicine (JHM) dataset contain identifiable information, they cannot be shared outside of our study team; however, all code to replicate the experiment on other datasets is available.

References

Aka et al. [2021] O. Aka, K. Burke, A. Bauerle, C. Greer, and M. Mitchell. Measuring model biases in the absence of ground truth. In AAAI/ACM AIES. ACM, 2021.
Banerjee et al. [2023] I. Banerjee, K. Bhattacharjee, J. L. Burns, H. Trivedi, S. Purkayastha, L. Seyyed-Kalantari, B. N. Patel, R. Shiradkar, and J. Gichoya. “shortcuts” causing bias in radiology artificial intelligence: Causes, evaluation, and mitigation. Journal of the American College of Radiology, 20(9):842–851, 2023. ISSN 1546-1440. doi: https://doi.org/10.1016/j.jacr.2023.06.025. URL https://www.sciencedirect.com/science/article/pii/S1546144023005264.
Breiman [2001] L. Breiman. Random forests. Machine learning, 45:5–32, 2001.
Castelnovo et al. [2022] A. Castelnovo, R. Crupi, G. Greco, D. Regoli, I. G. Penco, and A. C. Cosentini. A clarification of the nuances in the fairness metrics landscape. Sci. Rep., 12(1):4209, Mar. 2022.
Chen and Guestrin [2016] T. Chen and C. Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794, 2016.
Codella et al. [2018] N. C. Codella, D. Gutman, M. E. Celebi, B. Helba, M. A. Marchetti, S. W. Dusza, A. Kalloo, K. Liopyris, N. Mishra, H. Kittler, et al. Skin lesion analysis toward melanoma detection: A challenge at the 2017 international symposium on biomedical imaging (isbi), hosted by the international skin imaging collaboration (isic). In 2018 IEEE 15th international symposium on biomedical imaging (ISBI 2018), pages 168–172. IEEE, 2018.
Combalia et al. [2019] M. Combalia, N. C. Codella, V. Rotemberg, B. Helba, V. Vilaplana, O. Reiter, C. Carrera, A. Barreiro, A. C. Halpern, S. Puig, et al. Bcn20000: Dermoscopic lesions in the wild. arXiv preprint arXiv:1908.02288, 2019.
Cubuk et al. [2019] E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le. RandAugment: Practical automated data augmentation with a reduced search space. Sept. 2019.
DeGrave et al. [2021] A. J. DeGrave, J. D. Janizek, and S.-I. Lee. Ai for radiographic covid-19 detection selects shortcuts over signal. Nature Machine Intelligence, 2021.
Devlin et al. [2019] J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In J. Burstein, C. Doran, and T. Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics, 2019. doi: 10.18653/V1/N19-1423. URL https://doi.org/10.18653/v1/n19-1423.
Efron and Tibshirani [1993] B. Efron and R. J. Tibshirani. An introduction to the bootstrap chapman & hall. New York, 436, 1993.
Fabbrizzi et al. [2022] S. Fabbrizzi, S. Papadopoulos, E. Ntoutsi, and I. Kompatsiaris. A survey on bias in visual datasets. Comput. Vis. Image Underst., 223:103552, Oct. 2022.
Freedman and Diaconis [1981] D. Freedman and P. Diaconis. On the histogram as a density estimator:L2 theory. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 57(4):453–476, Dec. 1981. ISSN 1432-2064. doi: 10.1007/BF01025868. URL https://doi.org/10.1007/BF01025868.
Gianfrancesco et al. [2018] M. A. Gianfrancesco, S. Tamang, J. Yazdany, and G. Schmajuk. Potential biases in machine learning algorithms using electronic health record data. JAMA Intern. Med., 178(11):1544–1547, Nov. 2018.
Gichoya et al. [2022] J. W. Gichoya, I. Banerjee, A. R. Bhimireddy, J. L. Burns, L. A. Celi, L.-C. Chen, R. Correa, N. Dullerud, M. Ghassemi, S.-C. Huang, P.-C. Kuo, M. P. Lungren, L. J. Palmer, B. J. Price, S. Purkayastha, A. T. Pyrros, L. Oakden-Rayner, C. Okechukwu, L. Seyyed-Kalantari, H. Trivedi, R. Wang, Z. Zaiman, and H. Zhang. Ai recognition of patient race in medical imaging: a modelling study. The Lancet Digital Health, 2022.
Glocker et al. [2022] B. Glocker, C. Jones, M. Bernhardt, and S. Winzeck. Risk of bias in chest x-ray foundation models. Sept. 2022.
Glocker et al. [2023] B. Glocker, C. Jones, M. Bernhardt, and S. Winzeck. Algorithmic encoding of protected characteristics in chest X-ray disease detection models. eBioMedicine, 89:104467, Mar. 2023. ISSN 23523964. doi: 10.1016/j.ebiom.2023.104467. URL https://linkinghub.elsevier.com/retrieve/pii/S2352396423000324.
Gorishniy et al. [2021] Y. Gorishniy, I. Rubachev, V. Khrulkov, and A. Babenko. Revisiting deep learning models for tabular data. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 18932–18943. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/9d86d83f925f2149e9edb0ac3b49229c-Paper.pdf.
Harrigian et al. [2023] K. Harrigian, A. Zirikly, B. Chee, A. Ahmad, A. Links, S. Saha, M. C. Beach, and M. Dredze. Characterization of stigmatizing language in medical records. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 312–329, 2023.
He et al. [2016] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE/CVPR, pages 770–778, 2016.
Henry Hinnefeld et al. [2018] J. Henry Hinnefeld, P. Cooman, N. Mammo, and R. Deese. Evaluating fairness metrics in the presence of dataset bias. Sept. 2018.
Jabbour et al. [2020] S. Jabbour, D. Fouhey, E. Kazerooni, M. W. Sjoding, and J. Wiens. Deep learning applied to chest x-rays: Exploiting and preventing shortcuts. In Machine Learning for Healthcare Conference, pages 750–782. PMLR, 2020.
Johnson et al. [2016] A. E. Johnson, T. J. Pollard, L. Shen, L.-w. H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. Anthony Celi, and R. G. Mark. Mimic-iii, a freely accessible critical care database. Scientific data, 3(1):1–9, 2016.
Jones et al. [2024] C. Jones, D. C. Castro, F. De Sousa Ribeiro, O. Oktay, M. McCradden, and B. Glocker. A causal perspective on dataset bias in machine learning for medical imaging. Nature Machine Intelligence, 6(2):138–146, Feb. 2024.
Le Gall et al. [1993] J.-R. Le Gall, S. Lemeshow, and F. Saulnier. A New Simplified Acute Physiology Score (SAPS II) Based on a European/North American Multicenter Study. JAMA, 270(24):2957–2963, 12 1993. ISSN 0098-7484. doi: 10.1001/jama.1993.03510240069035. URL https://doi.org/10.1001/jama.1993.03510240069035.
Li and Vasconcelos [2019] Y. Li and N. Vasconcelos. REPAIR: Removing representation bias by dataset resampling. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2019.
Liu et al. [2022] X. Liu, B. Glocker, M. M. McCradden, M. Ghassemi, A. K. Denniston, and L. Oakden-Rayner. The medical algorithmic audit. Lancet Digit Health, 2022.
Liu et al. [2021] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In IEEE/CVPR, 2021.
Lopez et al. [2017] A. R. Lopez, X. Giro-i Nieto, J. Burdick, and O. Marques. Skin lesion classification from dermoscopic images using deep learning techniques. In 2017 13th IASTED international conference on biomedical engineering (BioMed), pages 49–54. IEEE, 2017.
Loshchilov [2017] I. Loshchilov. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
Mahbod et al. [2019] A. Mahbod, G. Schaefer, C. Wang, R. Ecker, and I. Ellinge. Skin lesion classification using hybrid deep neural networks. In ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 1229–1233. IEEE, 2019.
Mahmood et al. [2021] U. Mahmood, R. Shrestha, D. D. Bates, L. Mannelli, G. Corrias, Y. E. Erdi, and C. Kanan. Detecting spurious correlations with sanity tests for artificial intelligence guided radiology systems. Frontiers in digital health, page 85, 2021.
maintainers and contributors [2016] T. maintainers and contributors. Torchvision: Pytorch’s computer vision library. https://github.com/pytorch/vision, 2016.
Nauta et al. [2021] M. Nauta, R. Walsh, A. Dubowski, and C. Seifert. Uncovering and correcting shortcut learning in machine learning models for skin cancer diagnosis. Diagnostics, 2021.
Oakden-Rayner et al. [2020] L. Oakden-Rayner, J. Dunnmon, G. Carneiro, and C. Ré. Hidden stratification causes clinically meaningful failures in machine learning for medical imaging. In Proc. of the ACM conference on health, inference, and learning, pages 151–159, 2020.
O’Brien et al. [2022] M. O’Brien, J. Bukowski, G. Hager, A. Pezeshk, and M. Unberath. Evaluating neural network robustness for melanoma classification using mutual information. In Medical Imaging 2022: Image Processing. SPIE, 2022.
Ong Ly et al. [2024] C. Ong Ly, B. Unnikrishnan, T. Tadic, T. Patel, J. Duhamel, S. Kandel, Y. Moayedi, M. Brudno, A. Hope, H. Ross, and C. McIntosh. Shortcut learning in medical AI hinders generalization: method for estimating AI model generalization without external data. npj Digital Medicine, 7(1):1–10, May 2024.
Pavlak et al. [2023] M. F. Pavlak, N. G. Drenkow, N. Petrick, M. M. Farhangi, and M. Unberath. Data AUDIT: Identifying attribute utility- and detectability-induced bias in task models. Med. Image Comput. Comput. Assist. Interv., pages 442–452, Apr. 2023.
Pirracchio et al. [2015] R. Pirracchio, M. L. Petersen, M. Carone, M. R. Rigon, S. Chevret, and M. J. van der Laan. Mortality prediction in intensive care units with the super icu learner algorithm (sicula): a population-based study. The Lancet Respiratory Medicine, 3(1):42–52, 2015.
Seyyed-Kalantari et al. [2021a] L. Seyyed-Kalantari, G. Liu, M. McDermott, I. Y. Chen, and M. Ghassemi. CheXclusion: Fairness gaps in deep chest x-ray classifiers. Pac. Symp. Biocomput., 2021a.
Seyyed-Kalantari et al. [2021b] L. Seyyed-Kalantari, H. Zhang, M. B. A. McDermott, I. Y. Chen, and M. Ghassemi. Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations. Nat. Med., 2021b.
Subbaswamy et al. [2021] A. Subbaswamy, R. Adams, and S. Saria. Evaluating model robustness and stability to dataset shift. In A. Banerjee and K. Fukumizu, editors, Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pages 2611–2619. PMLR, 2021.
Tschandl et al. [2018] P. Tschandl, C. Rosendahl, and H. Kittler. The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific data, 2018.
Vinh et al. [2009] N. X. Vinh, J. Epps, and J. Bailey. Information theoretic measures for clusterings comparison: is a correction for chance necessary? In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, page 1073–1080, New York, NY, USA, 2009. Association for Computing Machinery. ISBN 9781605585161. doi: 10.1145/1553374.1553511. URL https://doi.org/10.1145/1553374.1553511.
Vinh et al. [2010] N. X. Vinh, J. Epps, and J. Bailey. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. JMLR, 2010.
Wen et al. [2021] D. Wen, S. M. Khan, A. J. Xu, H. Ibrahim, L. Smith, J. Caballero, L. Zepeda, C. de Blas Perez, A. K. Denniston, X. Liu, et al. Characteristics of publicly available skin cancer image datasets: a systematic review. The Lancet Digital Health, 2021.
[47] M. Wick, S. Panda, and J.-B. Tristan. Unlocking fairness: A trade-off revisited. Accessed: 2023-2-1.
Winkler et al. [2019] J. K. Winkler, C. Fink, F. Toberer, A. Enk, T. Deinlein, R. Hofmann-Wellenhof, L. Thomas, A. Lallas, A. Blum, W. Stolz, et al. Association between surgical skin markings in dermoscopic images and diagnostic performance of a deep learning convolutional neural network for melanoma recognition. JAMA dermatology, 2019.
Winkler et al. [2021] J. K. Winkler, K. Sies, C. Fink, F. Toberer, A. Enk, M. S. Abassi, T. Fuchs, and H. A. Haenssle. Association between different scale bars in dermoscopic images and diagnostic performance of a market-approved deep learning convolutional neural network for melanoma recognition. European Journal of Cancer, 2021.
Zech et al. [2018] J. R. Zech, M. A. Badgeley, M. Liu, A. B. Costa, J. J. Titano, and E. K. Oermann. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: a cross-sectional study. PLoS medicine, 2018.
Zhang et al. [2019] J. Zhang, Y. Xie, Y. Xia, and C. Shen. Attention residual learning for skin lesion classification. IEEE transactions on medical imaging, 38(9):2092–2103, 2019.

Appendix A EHR Dataset

A.0.1 EHR Tasks:

The Johns Hopkins Medicine (JHM) dataset we use seeks to enable the prediction of three types of stigma in healthcare. For each of these three tasks, ground truth labels from the initial work were formed in a two-step process. First, a set of anchor words specific to each task were identified in the raw note text. Data samples were created that consist of ten words to the left and right of each identified anchor word. Examples were labeled as to whether they represented a case of stigmatizing language (e.g. ’the patient claims they were..’ or not, ’the patient’s claims were denied’). The team that originally annotated each example consisted of one research assistant and several physician coauthors. At least two annotators independently labeled each example.

The three tasks and associated anchor words are:

1.

Credibility & Obstinacy. Physician doubt regarding patient testimony or descriptions of patients as obstinate.

Anchor Words:

adamant, adamantly, adament, adamently, claim, claimed, claiming, claims, insist, insisted, insistence, insisting, insists
2.

Compliance. Related to whether or not patients appear to follow medical advice.

Anchor Words:

Adherance, adhere, adhered, adherence, adherent, adheres, adhering, compliance, compliant, complied, complies, comply, complying, declined, declines, declining, nonadherance, nonadherence, nonadherent, noncompliance, noncompliant, refusal, refuse, refused, refuses, refusing
3.

Descriptions of Appearance / Demeanor. A description of the patient’s appearance and/or behavior.

Anchor Words:

Aggression, aggressive, aggression, aggressive, aggressively, agitated, agitation, anger, angered, angers, angrier, angrily, angry, argumentative, argumentatively, belligerence, belligerent, belligerently, charming, combative, combatively, confrontational, cooperative, defensive, delightful, disheveled, drug seeking, drug-seeking, exaggerate, exaggerates, exaggerating, historian, lovely, malinger, malingered, malingerer, malingering, malingers, narcotic seeking, narcotic-seeking, pleasant, pleasantly, poorly groomed, poorly-groomed, secondary gain, uncooperative, unkempt, unmotivated, unwilling, unwillingly, well groomed.

Appendix B Potential Shortcut Attributes

We next outline the attributes evaluated as potential shortcuts via our G-AUDIT method.

B.1 Skin lesion classification

Dataset audits for the skin lesion classification task assessed the following attributes as candidate shortcuts: age, anatomical location, image height/width, sex, skin color (Fitzpatrick scale), year

B.2 Stigmatizing language in EHR data

For each of the EHR tasks, we have three potential shortcut attributes available from the original patient visit metadata. These are patient sex, race, and the visit’s clinical specialty within the JHM hospital system. Clinical specialties can be either: Internal Medicine, Surgery, Emergency Medicine, OB-GYN, or Pediatrics.

B.3 Mortality Prediction from ICU Data

Potential shortcuts for the mortality prediction task are one-hot encoded. Categories with fewer than 100 examples are merged wherever possible.

Ethnicity attributes:

ethnicity-Asian, ethnicity-Asian - Chinese, ethnicity-Black/African American, ethnicity-Black/Cape Verdean, ethnicity-Hispanic OR Latino, ethnicity-Hispanic/Latino - Puerto Rican, ethnicity-White, ethnicity-Other, ethnicity-Patient declined to answer, ethnicity-Unable to obtain, ethnicity-Unknown/not specified.

Insurance attributes:

insurance-Government, insurance-Medicaid, insurance-Medicare, insurance-Private, insurance-Self Pay.

Intervention attributes:

vent, vaso, dobutamine, dopamine, epinephrine, milrinone, norepinephrine, phenylephrine, vasopressin, colloid-bolus, crystalloid-bolus, nivdurations.

Missing data attributes:

heart rate missing, systolic blood pressure missing, temperature missing, blood urea nitrogen missing, white blood cell count missing, potassium missing, sodium missing, bicarbonate missing, bilirubin missing, glascow coma scale total missing, partial pressure of oxygen missing, fraction inspired oxygen missing.

Detecting Dataset Bias in Medical AI: A Generalized and Modality-Agnostic Auditing Framework

Abstract

1 Introduction

2 Results

2.1 Skin Lesion Classification

2.2 Stigmatizing Language in Electronic Health Records

2.3 Mortality Prediction with Intensive Care Unit Tabular Data

3 Discussion

4 Methods

4.1 Utility and Detectability

4.2 Measuring Utility and Detectability

4.3 Data Audit Procedure

4.4 Bounding Performance Risk

4.5 Model Training and Prediction

4.5.1 Skin Lesion Classification

4.5.2 Stigmatizing Language in Electronic Health Records

4.5.3 Mortality Prediction from Intensive Care Unit Data

4.5.4 SPLIT method

5 Data Availability

References

Appendix A EHR Dataset

A.0.1 EHR Tasks:

Anchor Words:

Anchor Words:

Anchor Words:

Appendix B Potential Shortcut Attributes

B.1 Skin lesion classification

B.2 Stigmatizing language in EHR data

B.3 Mortality Prediction from ICU Data

Ethnicity attributes:

Insurance attributes:

Intervention attributes:

Missing data attributes: