From Biased Selective Labels to Pseudo-Labels:
An Expectation-Maximization Framework for Learning from Biased Decisions
Abstract
Selective labels occur when label observations are subject to a decision-making process; e.g., diagnoses that depend on the administration of laboratory tests. We study a clinically-inspired selective label problem called disparate censorship, where labeling biases vary across subgroups and unlabeled individuals are imputed as “negative” (i.e., no diagnostic test = no illness). Machine learning models naïvely trained on such labels could amplify labeling bias. Inspired by causal models of selective labels, we propose Disparate Censorship Expectation-Maximization (DCEM), an algorithm for learning in the presence of disparate censorship. We theoretically analyze how DCEM mitigates the effects of disparate censorship on model performance. We validate DCEM on synthetic data, showing that it improves bias mitigation (area between ROC curves) without sacrificing discriminative performance (AUC) compared to baselines. We achieve similar results in a sepsis classification task using clinical data.

1 Introduction
Selective labels occur when a decision-making process determines access to ground truth (Lakkaraju et al., 2017). We study a practical case of selective labels: disparate censorship (Chang et al., 2022). Disparate censorship introduces two challenges: different labeling biases across subgroups and the assumption that unlabeled individuals have a negative label. For example, in healthcare, labels may depend on laboratory test results only available in some patients.
Past work has trained ML models to predict outcomes based on laboratory test results (e.g., sepsis (Seymour et al., 2016; Rhee & Klompas, 2020)). In this setting, patients with no test result are defined as negative (Hartvigsen et al., 2018; Teeple et al., 2020; Jehi et al., 2020; McDonald et al., 2021; Adams et al., 2022; Kamran et al., 2022). However, laboratory testing decisions may be biased. For example, women are undertested and underdiagnosed for cardiovascular disease (Beery, 1995; Schulman et al., 1999). ML models trained on such data may recommend women less often for diagnostic testing than men, reinforcing inequity.
To address this bias, one option is to train only on tested individuals. Such an approach may discard a large subset of the data and may not generalize to untested patients. Another option is semi-supervised approaches that do not assume untested patients are negative, such as label propagation (Zhu & Ghahramani, 2002; Lee, 2013) or filtering (Li et al., 2020; Nguyen et al., 2020), or noisy-label learning methods (Blum & Stangl, 2020; Wang et al., 2021; Zhu et al., 2021). However, such methods do not leverage causal models of label bias, a potential source of additional information. We aim to develop an approach that leverages all available signal while accounting for labeling biases.
Inspired by causal models of selective labeling (Laine et al., 2020; Chang et al., 2022; Guerdan et al., 2023a), we propose a simple method for mitigating bias when training models under disparate censorship: Disparate Censorship Expectation-Maximization (DCEM; Fig. 1). First, we show that DCEM regularizes model estimates to counterbalance disparate censorship. We validate DCEM in a simulation study and a sepsis classification task on clinical data. We find that our method mitigates bias (area between ROC curves) while maintaining competitive discriminative performance (AUC), and is generally more robust than baselines to changes in the data generation process.
2 Preliminaries: Disparate Censorship
We consider a dataset , with covariates , labeling/testing decision , sensitive attribute , and observed label , a proxy for ground truth . The proxy label when , and otherwise (i.e., ).
What is disparate censorship?
Disparate censorship models “double standards” in label collection decisions (Fig. 1, top). It is a variation of selective labeling or outcome measurement error (Lakkaraju et al., 2017; Guerdan et al., 2023b). Disparate censorship uniquely assumes that untested individuals are imputed as negative.
We consider disparate censorship in the context of binary classification (Chang et al., 2022) (Fig. 1, top). We justify the model by example. Consider a patient in an emergency room with characteristics and sensitive attribute . This patient may have some condition (currently unobserved) caused by but not . A clinician may order a diagnostic test (set to 1) to determine . The decision is based on , but could be swayed by biases in .
To simplify, suppose that tests are perfectly sensitive.111If not, we can define to indicate whether a label is confirmed correct. This definition captures differences in test sensitivity across groups (i.e., spectrum bias (Mulherin & Miller, 2002)). Then, we observe ground truth for tested individuals (). Otherwise, the patient’s label is imputed as negative (; i.e., untested patients are presumed healthy). However, due to biases in testing decisions , may only be available in a biased subset of the data. The causal model of disparate censorship (Fig. 1, top) encodes this decision-making pipeline. Beyond healthcare, disparate censorship may arise whenever potentially biased decisions affect data labeling.
Learning under disparate censorship.
We aim to learn a mapping parameterized by optimized for discriminative performance (i.e., AUC), but only observe proxy labels . The default approach for learning under disparate censorship is to assume and proceed using supervised learning. However, such an may encode labeling biases: estimates of may be inflated compared to for those more likely to be labeled. Thus, biased labeling could yield disproportionate impacts on performance across different subgroups of the data.
Note that we can interpret the estimand of interest as the causal effect of testing on the observed label, since, in the language of do-calculus (Pearl, 2009),
(1) |
which follows from standard causal identifiability derivations given the causal graph of Fig. 1 (Imbens & Rubin, 2015). Intuitively, testing an individual () reveals their outcome. Thus, a model trained only on tested individuals could consistently estimate , but may not correct for labeling bias. We discuss other approaches in semi-supervised learning in Section 6.
3 Methodology
We propose Disparate Censorship Expectation-Maximiza-tion (DCEM) as an approach for learning in the presence of disparate censorship. We first build intuition for how one could mitigate disparate censorship based on the causal model (Section 3.1). We then derive DCEM (Section 3.2) and show that it mitigates disparate censorship via a form of regularization (Section 3.3). We consider alternative designs and their limitations (Section 3.4). Detailed proofs and definitions are in Appendix B.
3.1 Towards mitigating disparate censorship
Recalling the causal model of disparate censorship, suppose that we are naively training a model to predict . Define groups and and . Consider some so that
(2) |
for all . Define (e.g., probability of receiving a laboratory test) and . By assumption, is sufficient for predicting (i.e., as in Fig. 1, top), such that the optimal should be similar across (within ). However, Eq. 2 states that group is undertested relative to group : they have a lower within . Equivalently, labeling bias favors group . Thus, our naive model would underestimate in group (lower than group in ) relative to group .
To counterbalance this bias, one could increase (within ) where group is more prevalent than group ; i.e., lower- regions. Since we are interested in discriminative performance, this is analogous to decreasing where is higher, from which the proposed method follows. More broadly, variables associated with labeling bias ( causally affects ) but not the outcome of interest ( does not causally affect ) may be useful for mitigating labeling bias.
Given our causal model with latent variable (Fig. 1, top), we base our approach on expectation-maximization (EM) (Dempster et al., 1977). We can write:
(3) |
Since is not fully observed, Eq. 3 cannot be optimized via standard supervised objectives. Dropping terms that do not involve , we can write the maximization of Eq. 3 as
(4) |
Optimizing Eq. 4 proceeds via EM. We show that the resulting objectives align with reducing in higher- regions and maintain discriminative performance on tested individuals.
3.2 Disparate Censorship Expectation-Maximization
Informal overview.
EM alternates an expectation step (E-step), which imputes guesses for the latent variable(s) (i.e., in Eq. 4), and a maximization step that optimizes likelihood given the imputed estimates (M-step, i.e., Eq. 4). Our E-step imputes preliminary estimates of for untested individuals. Our M-step updates the estimates to counteract labeling biases when , and is equivalent to full supervision when . The E- and M-steps alternate until convergence. Fig. 1 (bottom) shows a schematic of DCEM, with pseudocode in Algorithm 1.
E-step.
Theorem 3.1 (E-step).
The posterior distribution of given the observed data is equivalent to
(6) |
Intuitively, the E-step uses as the label when we have complete label information (recall ); otherwise, we use the posterior estimate as a smoothed label. Equivalently, the E-step imputes soft pseudo-labels for unlabeled data, i.e., probabilistic estimates . Motivated by approaches that train a pseudo-labeling model on labeled data (Arazo et al., 2020; Rizve et al., 2021), we pre-train on tested individuals.
M-step.
The M-step maximizes the log-likelihood of Eq. 4 given E-step estimates (Eq. 6). There are two terms to model, which is done via an estimator for trained using and an estimator for . The latter is obtained by combining an estimate of with . Concretely, let , and let be a model of . Maximizing the log-likelihood of Eq. 4 reduces to
(7) |
This leads to the following result:
Theorem 3.2 (M-step, informal).
In practice, we set , a smoothed version of the assumption . Defining as binary cross-entropy loss, we can rewrite Eq. 7:
(8) |
Eq. 8 can be interpreted as a regularized cross-entropy loss with respect to pseudo-label . The first term pushes towards , while the second “encourages” to be consistent with the causal model. To obtain , we pre-train and freeze a binary classifier for , and take the probabilistic estimates as .
3.3 DCEM counterbalances disparate censorship
We show that DCEM imposes a form of “causal regularization” that lowers in untested individuals.
DCEM is a form of causal regularization.
By analogy to regularized risk minimization, consider an objective
(9) |
for (regularization strength) and a regularizer , where is the parameter space of .
Without loss of generality, setting and matching terms between Eq. 9 and Eq. 8 yields . While affects the optimization of Eq. 8, it is not a multiplier (e.g., in Eq. 9). To interpret the effect of , we propose a definition of causal regularization strength based how the optimal changes.222“Causal regularization” has been defined in the context of causal discovery (Bahadori et al., 2017; Janzing, 2019). Our usage is unrelated: we use a causal model to regularize an estimator.
Definition 3.3 (Causal regularization strength, informal).
Let be the minimizer of Eq. 8. For finite & convex on in [0, 1], the causal regularization strength is .
Definition 3.3 quantifies the tradeoff between matching to the E-step estimates and optimizing Eq. 8. While is not an optimization parameter, analyzing the optimal can clarify the inductive bias of the M-step. We proceed by considering how causal regularization impacts untested vs. tested individuals. When , the M-step is
(10) |
Since increases in , the regularization term “encourages” to decrease when . The regularization term is constant if , such that the M-step would not change the E-step estimate. This matches the intuition that one cannot learn about from individuals that are very different from labeled individuals (i.e., when the overlap assumption in causal inference is violated). The regularization strength depends on as follows:
Theorem 3.4 (informal).
If , causal regularization strength increases in .
The result implies that causal regularization counterbalances disparate censorship. Recall that lowering in regions where is higher can mitigate bias. Equivalently, causal regularization must strengthen as increases, which follows from Theorem 3.4.
Causal regularization aligns with full supervision in tested individuals.
When , the M-step is
(11) |
substituting for and . Thus:
Proposition 3.5.
Eq. 11 is minimized when .
Proposition 3.5 states that causal regularization does not change the M-step optimum from matching ground truth when (i.e., regularization strength = 0). Thus, the M-step objective aligns with fully-supervised loss.
Thus, the M-step (Eq. 8) counterbalances disparate censorship by regularizing towards 0 as increases. For , the M-step optimum stays constant, and DCEM should maintain discriminative performance.
3.4 Alternative designs and their limitations
We consider two alternative designs and their limitations: directly using in DCEM and propensity score adjustment.
Why not use directly?
We substitute into Eq. 8 and analyze one summand (without loss of generality):
(12) | ||||
(13) |
Both losses use the E-step estimate as supervision. When (Eq. 12), the M-step adds , penalizing false negatives 2x as heavily as false positives. This does not affect ranking metrics (e.g., AUC). When (Eq. 13), the M-step drops causal regularization, and thus cannot counterbalance disparate censorship. Directly using would only help if counterbalancing disparate censorship is unnecessary for good estimation, i.e., when tested individuals are representative of the population.
Why not propensity score adjustment/related causal approaches?
Recall that estimating the effect of on the observed label yields a consistent estimate of (Eq. 1, Section 2). Indeed, is an estimate of , i.e., a propensity score, motivating the usage of causal effect estimators that leverage . However, propensity score adjustment (e.g., IPW (Rosenbaum & Rubin, 1983) or doubly-robust variations (Robins et al., 1994; Van Der Laan & Rubin, 2006; Hu et al., 2022)) require an “overlap” assumption for some and have asymptotic variance of order , which is sensitive to extreme (e.g., as in AIPW (Glynn & Quinn, 2010)).
However, in finite-sample settings, “sharp” testing decisions lead to weak overlap. Such extreme may arise in threshold-based decisions (Djulbegovic et al., 2014; Pierson et al., 2018). For example, a patient either exhibits or does not exhibit the requisite symptoms to warrant testing. This is analogous to inducing covariate shift between tested and untested individuals. In other words, “holes” in the training data emerge when using only labeled examples. Thus, systematic testing bias could exacerbate model performance gaps across population subgroups. While low overlap still impacts DCEM (since DCEM cannot learn when ), our method instead leverages an evidence-based lower bound to model under disparate censorship. We further discuss potential improvements in overlap-robustness of the proposed approach in Appendix B.
4 Experimental Setup
We validate DCEM with synthetic data across different data-generation processes on simulated binary classification tasks (Section 4.1) and in a pseudo-synthetic sepsis classification task using real clinical data (MIMIC-III) (Johnson et al., 2016), across potential laboratory testing policies (Section 4.2). We then discuss our chosen baselines (Section 4.3) and evaluation metrics (Section 4.4).
4.1 Synthetic Datasets
By definition, is not fully observed under disparate censorship. Thus, we design a simulation study in order to evaluate various methods with respect to ground truth. The data generation process follows from the assumed causal model of disparate censorship (Fig. 1, top):
where is the identity matrix, and , , and are simulation parameters. We set and induce confounding between and by setting . We draw from group-specific Gaussians, and assume Bernoulli-distributed and with parameters defined via and , respectively. Intuitively, () is a soft “decision boundary” for (). Inspired by observations that clinician testing is can be represented by simpler functions than observed outcomes (Mullainathan & Obermeyer, 2022), we choose a non-linear and a linear .
We simulate individuals for training, validation, and testing each (i.e., 60,000 total). We define settings in terms of testing disparity , prevalence disparity , and testing multiple . Intuitively, controls labeling biases, controls differences between groups, and controls testing rate. We consider , , and , and set simulation parameters to yield the desired .333We skip settings where yield infeasible testing rates.
Since is unknown in practice, we replicate the main experiments across various as a robustness check. The simulation makes simplifying assumptions (e.g., low dimensionality and ) but allows full control over and . Additional simulation details are in Appendix C.1.
4.2 Clinical data: MIMIC-III
Multiple sepsis definitions, such as Sepsis-3 (Singer et al., 2016), are based on laboratory tests (blood culture) such that patients without a test result are by definition negative. Thus, sepsis classification is a potential real-world case of disparate censorship. We curate a sepsis classification task using the MIMIC-III Sepsis-3 cohort (Johnson et al., 2016, 2018), an electronic health record dataset.
We aim to distinguish patients who never develop sepsis from those who develop sepsis within 8 hours of an initial 3-hour observation period. If a patient met the Sepsis-3 criteria between 3-11 hours of the first chart measurement, we set , and if the patient never develops sepsis during their hospital stay. We exclude patients with onset times outside this range and include only White and Black patients to simplify the analysis of bias mitigation. We choose following an existing sepsis prediction model (Delahanty et al., 2019), and exclude patients where all features are missing. This yields patients, from which we create a 60-20-20 train-validation-test split. This is a simplified version of a real clinical task, since we exclude patients who develop sepsis later during their hospitalization. Nonetheless, it is helpful for probing the strengths and weaknesses of the proposed approach.
To evaluate model performance, we assume that the observed reflects ground truth, since of patients were tested (i.e., received a blood culture) in our cohort. To generate label proxies , we simulate multiple potential labeling biases via a clinically-inspired testing function . We specify a linear based on qSOFA, a score for triaging patients at risk of sepsis (Seymour et al., 2016). Inspired by observations that clinicians over-weight representative symptoms in diagnostic test decisions (Mullainathan & Obermeyer, 2022), we create different versions of via different weightings of qSOFA features. We examine and . Details of the sepsis cohort are in Appendix C.2.
4.3 Models
As naive baselines, we test a -obs model (training on ) and training on group only. We select similarly-motivated or applicable baselines from related settings:
- •
- •
-
•
Causal inference: tested-only (training on examples where ), and DragonNet (Shi et al., 2019), using the treatment effect of the sensitive attribute on testing to correct disparate censorship (i.e., learn a correction for ),
-
•
Positive-unlabeled learning: Selected-At-Random EM (SAREM) (Bekker et al., 2020).
As an oracle, we compare to training on (“-model”). We use neural networks for all approaches. Training details, such as hyperparameters, are in Appendix D.

4.4 Evaluation metrics
We consider bias mitigation and discriminative performance metrics with respect to , and measure the robustness of both metrics to changes in the data-generation process.
Discriminative performance.
We use the area under the receiver operating characteristic curve (AUC), a standard discriminative performance metric.
Mitigating bias.
We use the ROC gap (also called ROC fairness (Vogel et al., 2021) or ABROCA (Gardner et al., 2019)), the absolute area between the ROC curves for each group . The ROC gap is in [0, 1]. Lower values indicate better bias mitigation. Intuitively, the ROC gap is zero when a classifier with some fixed false positive rate in each group obtains equal true positive rates across groups. Under disparate censorship, a zero ROC gap is achievable if a model perfectly predicts from .
Robustness.
We consider the median AUC and ROC gap over all (synthetic data setting) or (sepsis classification) and the empirical worst-case (AUC: min.; ROC gap: max.) and range.

5 Experiments & Discussion
Our experiments aim to substantiate our main claims:
-
•
In synthetic data, DCEM mitigates bias, maintains competitive discriminative performance and improves robustness, while achieving better tradeoffs between performance and bias mitigation compared to baselines (Section 5.1).
-
•
On a sepsis classification task, DCEM improves discriminative performance while maintaining good tradeoffs with bias mitigation, and is more robust compared to baselines (Section 5.2).
We also report full results (Appendix E.1) and an ablation study of DCEM (Appendix E.2). We also benchmark causal effect estimators (i.e., as alternatives to the tested-only model) and their overlap robustness compared to DCEM (Appendix E.3). Further sensitivity analyses can be found in Appendix E.4 (smoothed ) and E.5 (E-step initialization).
5.1 Results on simulated disparate censorship
Fig. 2 shows ROC gaps (left) and AUCs (right) of the baselines most competitive with our approach (DCEM, magenta) at . In this setting, 25% (i.e., ) of individuals are tested, and the base rate of in group is that of group , but group is twice as likely to be tested. Each point is an ROC gap/AUC value achieved under one decision boundary . Results for the remaining baselines are in Appendix E.1. The takeaways align with the main results.
DCEM mitigates bias more effectively than baselines.
DCEM achieves a median ROC gap of 0.030 (2nd-best, SELF: 0.034), suggesting that it mitigates bias more effectively than baselines (Fig. 2, left). We show similar trends for , , and (Appendix E.1). At low testing rates, all models mitigate bias poorly. At high testing rates, the tested-only model is sufficient.
For (Appendix E.1), DCEM mitigates bias compared to the default approach (-obs model) but no longer dominates the baselines. We hypothesize that DCEM has similar bias mitigation capabilities as baselines, since there is less bias to mitigate. Recalling that , since , testing probability, and are correlated. Learning to predict would preserve ordering in , reducing impacts on ranking metrics (e.g., ROC gap).444Such settings are related to boundary-consistent noise; see Proposition 1 of (Menon et al., 2018).

DCEM is more robust than baselines to changes in the data-generating process.
Fig. 2 (left) shows that the maximum ROC gap is lower for DCEM compared to baselines (ours: 0.060 vs. 2nd-best, tested-only: 0.083). We report similar results for the minimum AUC (Fig. 2, right; ours: 0.768 vs. 2nd-best, tested-only: 0.623). DCEM also achieves a tighter ROC gap and AUC range. Fig. 2 also shows that our method has the tightest ROC gap range (left, DCEM: 0.048 vs. tested-only, 2nd-tightest: 0.063) and AUC range (right, DCEM: 0.055 vs. DragonNet: 0.199).
The results suggest that DCEM maintains robust bias mitigation and discriminative performance across different data-generation processes (). This is expected, as DCEM optimizes likelihood under the disparate censorship data-generation process by design. In contrast, the baselines may experience selection bias or misspecification, since they discard data or assume certain noise structures/variable dependencies that disparate censorship violates.
DCEM maintains competitive discriminative performance.
Fig. 2 (right) shows that DCEM outperforms all baselines except for the tested-only model, which our approach lags by 0.028 AUC (DCEM: 0.787 vs. tested-only: 0.815). Other causal estimators achieve similar discriminative performance to the tested-only approach (Appendix E.3). However, our method improves on the “default” -obs model, increasing the median AUC by 0.130 (-obs: 0.657). SELF, which has a similar median ROC gap to DCEM, underperforms DCEM by 0.110 AUC (SELF: 0.677 vs. DCEM: 0.787). Other baselines also underperform. This is expected, since some methods ignore label bias: training on alone is misspecified for , since it incorrectly assumes that if , then . The same argument applies to Group 0/1 only approaches.
Some baselines account for label noise/bias, but are mis-specified under disparate censorship since they make different independence assumptions. Group peer loss assumes , and SELF assumes , ignoring the dependence of biased selective labeling on . DragonNet accounts for by adding to the default model’s estimates (i.e., ) as a “correction factor.” However, the correction factor may be biased for true negatives: the oracle is zero, because , but in general under systematic labeling bias.
SAR-EM, which is most similar to the proposed approach, models missingness at random (i.e., ), but discards reliable negatives. In contrast, the proposed approach incorporates reliable negatives in alignment with our assumptions about labeling bias, allowing it to counterbalance biased selective labeling. Trends are similar for other and . Since the tested-only model is a strong baseline, we now compare it directly to DCEM.
DCEM achieves better tradeoffs between discriminative performance and bias mitigation.
Among models with similar AUC where , DCEM reduces ROC gaps compared to the tested-only model (Fig. 3). For example, for (Fig. 3, 2nd from right), DCEM improves the average ROC gap by 0.022 (0.028 vs. 0.050), with similar trends at lower AUCs. Among the best-performing models (; Fig. 3, 1st from right), both methods have similar ROC gaps.
The results suggest that DCEM is not trading discriminative performance for bias mitigation. At a given AUC, DCEM more often yields models with a lower ROC gap than the tested-only model. Since the tested-only approach does not account for label bias, it can achieve relatively high AUC without mitigating bias. In contrast, DCEM explicitly counteracts disparate censorship. A similar comparison to SELF shows that, at low ROC gaps, DCEM likewise finds higher-AUC solutions than SELF (Appendix E.6).
5.2 Results on sepsis classification in MIMIC-III
DCEM has better discriminative performance than baselines.
Fig. 4 compares the ROC gap and AUC of DCEM to selected baselines at testing disparity , and testing rate multiplier . Each dot corresponds to one variation of (laboratory testing policy). Our method has the highest median AUC among baselines (ours: 0.620 vs. DragonNet: 0.593), nearing the oracle (-model, 0.633). Note that DCEM has better discriminative performance than the tested-only approach, suggesting that extrapolation from tested to untested individuals is more difficult on the sepsis classification task than the fully synthetic tasks.
DCEM achieves good tradeoffs with bias mitigation.
DCEM achieves a smaller median ROC gap compared to five of eight baselines tested. Group peer loss, DragonNet and the Group 0 approach achieve lower median ROC gaps of 0.070, 0.088 and 0.082, respectively (DCEM: 0.105). However, the Group 0 approach catastrophically fails (median AUC: 0.342). Models may perform arbitrarily poorly under disparate censorship if labeling biases sufficiently “conceal” the true decision boundary. Group peer loss (among many other baselines) also exhibits a much wider AUC range than the proposed approach (Group peer loss: 0.182 vs. DCEM: 0.065), suggesting that its discriminative utility may be limited. DragonNet appears competitive (0.027 AUC lower than DCEM), but would only perform well when the effect of race on testing is close to , which is violated if labeling biases (large effect of race on testing) are present in negative patients ().
Many approaches, including DCEM, obtain a lower ROC gap than training on . Although the oracle obtains the highest median AUC, optimizing discriminative performance on is not always guaranteed to mitigate bias. DCEM uses labeling probabilities to mitigate bias via causal regularization, while DragonNet directly uses an estimate of the labeling bias as a correction factor. Thus, the results validate that the labeling bias can provide signal for bias mitigation.
DCEM is more robust than most baselines to changes in .
DCEM maintains robust bias mitigation capabilities across ; i.e., differences in how labelers weigh features in their decisions. Fig. 4 shows that DCEM attains a maximum ROC gap of 0.133 (left; DragonNet: 0.144), and a minimum AUC of 0.584 (right; DragonNet: 0.574). Fig. 4 also shows that DCEM achieves the tightest ROC gap range (left; DCEM: 0.094 vs. 2nd-best: 0.102) and 2nd-tightest AUC range (right; DCEM: 0.065 vs. DragonNet: 0.018). Many baselines also exhibit a bimodal empirical AUC distribution and only perform well under specific labeling behaviors. We examine the sensitivity of baselines to by plotting AUC and ROC gaps against coefficients of (Appendix E.7).
While DragonNet is competitive on this dataset, its robustness and performance capabilities may not generalize (e.g., simulation results, Fig. 2). DCEM is the only approach tested that achieved competitive discriminative performance and bias mitigation on both datasets. Trends in performance and robustness are similar for other (Appendix E.1).
Overall takeaways.
In a simulation study of disparate censorship, DCEM mitigates bias while achieving similar or better discriminative performance compared to baselines. The proposed approach is empirically more robust than baselines to changes in the data-generating process. On a sepsis classification task, DCEM mitigates bias while improving or maintaining discriminative performance compared to baselines across different labeling behaviors. Thus, DCEM can potentially mitigate bias with less impact on discriminative performance than existing methods.
6 Related Work
Selective labeling/disparate censorship.
Disparate censorship is a variation of selective labeling (Lakkaraju et al., 2017; Kleinberg et al., 2018) and outcome measurement error (Guerdan et al., 2023b). Selective labeling problems have been studied in clinical settings (Farahani et al., 2020; Shanmugam et al., 2024; Chang et al., 2022; Mullainathan & Obermeyer, 2022; Balachandar et al., 2024), social/public policy (Saxena et al., 2020; Kontokosta & Hong, 2021; Laufer et al., 2022; Liu & Garg, 2022; Kiani et al., 2023), and finance (Björkegren & Grissen, 2020; Henderson et al., 2023), among other domains. For an extended literature review of selective labeling problems, see Appendix A.
Past work has trained ML models under disparate censorship, directly encoding untested individuals as negative (Henry et al., 2015; Jehi et al., 2020; McDonald et al., 2021; Adams et al., 2022; Kamran et al., 2022). Previous approaches for learning under selective labels leverage heterogeneity in human decisions to recover outcome estimates (Lakkaraju et al., 2017; Kleinberg et al., 2018; Chen et al., 2023), or use domain-specific adjustments (Gholami et al., 2018; Balachandar et al., 2024). We propose DCEM, a complementary approach for mitigating bias under disparate censorship without such restrictions.
Semi-supervised learning.
Semi-supervised approaches do not assume labels for untested individuals. However, many causally-motivated methods diverge from the causal model of disparate censorship (Madras et al., 2019; Yao et al., 2021; Garg et al., 2023; Guerdan et al., 2023a; Gong et al., 2021; Kato et al., 2023; Sportisse et al., 2023) via different independence/causal relationships between variables. Filtering methods (Han et al., 2018; Li et al., 2020; Nguyen et al., 2020; Chen et al., 2020; Zhang et al., 2021; Zhao et al., 2022) assume specific model behavior on noisy examples (e.g.,, noise is learned late in training (Arpit et al., 2017)) or labeling bias (randomness/class-dependence), which disparate censorship violates. We also highlight historical expectation-maximization approaches for learning with missing data (Ghahramani & Jordan, 1993; Ghahramani et al., 1996; Ambroise & Govaert, 2000), which place parametric assumptions on the data-generation process. We use neural networks to target the estimands of interest to circumvent parametric assumptions.
Other alternatives include positive-unlabeled learning approaches that assume labeling depends on covariates (e.g., missing not at random) (Bekker et al., 2020; Furmańczyk et al., 2022; Gerych et al., 2022; Wang et al., 2024). However, these methods do not leverage correctly-labeled negatives, and naively-incorporating negative examples without causal assumptions may potentially harm model performance or bias mitigation. Other methods for noisy-label learning make assumptions incompatible with our setting, e.g. uniform noise within subgroups (Wang et al., 2021), almost-surely clean & noisy examples (Liu & Tao, 2015; Patrini et al., 2017; Tjandra & Wiens, 2023), different variable independence/directionality relationships (Wu et al., 2022), that noisy (i.e., out of distribution) examples are rare (Wald & Saria, 2023), or other noise constraints (Li et al., 2021; Zhu et al., 2021). Our approach complements existing work by jointly modeling selective and biased labeling via causal assumptions tailored to a biased decision-making pipeline.
7 Conclusion
When biased human decisions affect observations of ground truth, applying standard supervised learning techniques to data exhibiting disparate censorship can amplify the harm of ML models to marginalized groups. We propose Disparate Censorship Expectation-Maximization (DCEM), a novel approach to classification, to mitigate such harm. In a simulation study and a sepsis classification task, DCEM mitigates bias and maintains competitive discriminative performance compared to baselines. Limitations of DCEM include potential slow convergence, since EM is iterative. Model evaluation under disparate censorship is also inherently difficult due to the difficulty of obtaining ground truth, motivating future work in dataset curation. Furthermore, DCEM does not learn a full generative model with all variables. While such a model could target a wider range of estimands, it would also increase the number of terms that need to be modeled. Ultimately, DCEM is a step towards mitigating the disproportionate impacts of disparate censorship. Our work aims to raise awareness of disparate censorship and motivate the study of bias mitigation methods.
Impact Statement
This paper addresses disparate censorship, a realistic source of label bias in ML, and proposes a method that mitigates its harms. Since the goal of the paper is aligned with reducing inequity in decision-making, practical use cases of DCEM are inherently high-stakes settings. Thus, we believe that the ethical usage of DCEM (or any bias mitigation approach) in the real-world requires prospective model evaluation in the context of use (e.g., shadowing human decision-makers) to assess unforeseen negative impacts. Our work provides a general choice of bias mitigation (area between ROC curves) and discriminative performance metrics (AUC), which are motivated by clinical tasks where equitably ranking individuals in terms of resource needs is important. Practitioners should ensure their evaluation metrics align with domain-specific criteria for bias mitigation and performance.
Acknowledgements
We thank (in alphabetical order) Donna Tjandra, Divya Shanmugan, Fahad Kamran, Jung Min Lee, Maggie Makar, Meera Krishnamoorthy, Michael Ito, Sarah Jabbour, Shengpu Tang, Stephanie Shepard, and Winston Chen for helpful conversations and proofreading, and the anonymous reviewers for their constructive feedback. T.C. and J.W. are supported by the U.S. National Heart, Lung, and Blood Insitute of the National Institutes of Health (Grant No. 5R01HL158626-03). The views and conclusions in this document are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied of the U.S. National Institutes of Health.
References
- Adams et al. (2022) Adams, R., Henry, K. E., Sridharan, A., Soleimani, H., Zhan, A., Rawat, N., Johnson, L., Hager, D. N., Cosgrove, S. E., Markowski, A., et al. Prospective, multi-site study of patient outcomes after implementation of the trews machine learning-based early warning system for sepsis. Nature Medicine, pp. 1–6, 2022.
- Ambroise & Govaert (2000) Ambroise, C. and Govaert, G. EM algorithm for partially known labels. In Data Analysis, Classification, and Related Methods, pp. 161–166. Springer, 2000.
- Arazo et al. (2020) Arazo, E., Ortego, D., Albert, P., O’Connor, N. E., and McGuinness, K. Pseudo-labeling and confirmation bias in deep semi-supervised learning. In 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE, 2020.
- Arpit et al. (2017) Arpit, D., Jastrzebski, S., Ballas, N., Krueger, D., Bengio, E., Kanwal, M. S., Maharaj, T., Fischer, A., Courville, A., Bengio, Y., and Lacoste-Julien, S. A closer look at memorization in deep networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 233–242, 2017.
- Bahadori et al. (2017) Bahadori, M. T., Chalupka, K., Choi, E., Chen, R., Stewart, W. F., and Sun, J. Causal regularization. arXiv preprint arXiv:1702.02604, 2017.
- Balachandar et al. (2024) Balachandar, S., Garg, N., and Pierson, E. Domain constraints improve risk prediction when outcome data is missing. In 12th International Conference on Learning Representations, 2024.
- Beery (1995) Beery, T. A. Gender bias in the diagnosis and treatment of coronary artery disease. Heart & Lung, 24(6):427–435, 1995.
- Bekker et al. (2020) Bekker, J., Robberechts, P., and Davis, J. Beyond the selected completely at random assumption for learning from positive and unlabeled data. In Machine Learning and Knowledge Discovery in Databases, pp. 71–85, Cham, 2020.
- Bergman et al. (2021) Bergman, P., Kopko, E., and Rodriguez, J. E. A seven-college experiment using algorithms to track students: Impacts and implications for equity and fairness. Technical report, National Bureau of Economic Research, 2021.
- Binns et al. (2017) Binns, R., Veale, M., Van Kleek, M., and Shadbolt, N. Like trainer, like bot? Inheritance of bias in algorithmic content moderation. In Social Informatics: 9th International Conference, SocInfo 2017, Oxford, UK, September 13-15, 2017, Proceedings, Part II 9, pp. 405–415, 2017.
- Björkegren & Grissen (2020) Björkegren, D. and Grissen, D. Behavior revealed in mobile phone usage predicts credit repayment. The World Bank Economic Review, 34(3):618–634, 2020.
- Blum & Stangl (2020) Blum, A. and Stangl, K. Recovering from biased data: Can fairness constraints improve accuracy? In 1st Symposium on Foundations of Responsible Computing, 2020.
- Chang et al. (2022) Chang, T., Sjoding, M. W., and Wiens, J. Disparate censorship & undertesting: A source of label bias in clinical machine learning. In Proceedings of the 7th Machine Learning for Healthcare Conference, volume 182 of Proceedings of Machine Learning Research, pp. 343–390, Aug 2022.
- Chen et al. (2023) Chen, J., Li, Z., and Mao, X. Learning under selective labels with heterogeneous decision-makers: An instrumental variable approach. arXiv preprint arXiv:2306.07566, 2023.
- Chen et al. (2020) Chen, P., Ye, J., Chen, G., Zhao, J., and Heng, P.-A. Beyond class-conditional assumption: A primary attempt to combat instance-dependent label noise. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 2020.
- Delahanty et al. (2019) Delahanty, R. J., Alvarez, J., Flynn, L. M., Sherwin, R. L., and Jones, S. S. Development and evaluation of a machine learning model for the early identification of patients at risk for sepsis. Annals of Emergency Medicine, 73(4):334–344, 2019.
- Dempster et al. (1977) Dempster, A. P., Laird, N. M., and Rubin, D. B. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (methodological), 39(1):1–22, 1977.
- Djulbegovic et al. (2014) Djulbegovic, B., Elqayam, S., Reljic, T., Hozo, I., Miladinovic, B., Tsalatsanis, A., Kumar, A., Beckstead, J., Taylor, S., and Cannon-Bowers, J. How do physicians decide to treat: an empirical evaluation of the threshold model. BMC Medical Informatics and Decision Making, 14:1–10, 2014.
- Englesson & Azizpour (2021) Englesson, E. and Azizpour, H. Generalized jensen-shannon divergence loss for learning with noisy labels. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 30284–30297, 2021.
- Farahani et al. (2020) Farahani, N. Z., Sundaram, D. S. B., Enayati, M., Arunachalam, S. P., Pasupathy, K., and Arruda-Olson, A. M. Explanatory analysis of a machine learning model to identify hypertrophic cardiomyopathy patients from EHR using diagnostic codes. In 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 1932–1937, 2020.
- Furmańczyk et al. (2022) Furmańczyk, K., Mielniczuk, J., Rejchel, W., and Teisseyre, P. Joint estimation of posterior probability and propensity score function for positive and unlabelled data. arXiv preprint arXiv:2209.07787, 2022.
- Gardner et al. (2019) Gardner, J., Brooks, C., and Baker, R. Evaluating the fairness of predictive student models through slicing analysis. In Proceedings of the 9th International Conference on Learning Analytics & Knowledge, pp. 225–234, 2019.
- Garg et al. (2023) Garg, A., Nguyen, C., Felix, R., Do, T.-T., and Carneiro, G. Instance-dependent noisy label learning via graphical modelling. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 2288–2298, 2023.
- Gerych et al. (2022) Gerych, W., Hartvigsen, T., Buquicchio, L., Agu, E., and Rundensteiner, E. Recovering the propensity score from biased positive unlabeled data. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 6694–6702, 2022.
- Ghahramani & Jordan (1993) Ghahramani, Z. and Jordan, M. Supervised learning from incomplete data via an EM approach. Advances in Neural Information Processing Systems, 6, 1993.
- Ghahramani et al. (1996) Ghahramani, Z., Hinton, G. E., et al. The EM algorithm for mixtures of factor analyzers. Technical report, Technical Report CRG-TR-96-1, University of Toronto, 1996.
- Gholami et al. (2018) Gholami, S., Mc Carthy, S., Dilkina, B., Plumptre, A., Tambe, M., Driciru, M., Wanyama, F., Rwetsiba, A., Nsubaga, M., Mabonga, J., et al. Adversary models account for imperfect crime data: Forecasting and planning against real-world poachers. In International Conference on Autonomous Agents and Multiagent Systems, 2018.
- Glynn & Quinn (2010) Glynn, A. N. and Quinn, K. M. An introduction to the augmented inverse propensity weighted estimator. Political Analysis, 18(1):36–56, 2010.
- Gong et al. (2021) Gong, C., Wang, Q., Liu, T., Han, B., You, J. J., Yang, J., and Tao, D. Instance-dependent positive and unlabeled learning with labeling bias estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44:4163–4177, 2021.
- Guerdan et al. (2023a) Guerdan, L., Coston, A., Holstein, K., and Wu, Z. S. Counterfactual prediction under outcome measurement error. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, pp. 1584–1598, 2023a.
- Guerdan et al. (2023b) Guerdan, L., Coston, A., Wu, Z. S., and Holstein, K. Ground(less) truth: A causal framework for proxy labels in human-algorithm decision-making. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, pp. 688–704, 2023b.
- Han et al. (2018) Han, B., Yao, Q., Yu, X., Niu, G., Xu, M., Hu, W., Tsang, I. W., and Sugiyama, M. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 8536–8546, 2018.
- Harris et al. (2020) Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N. J., Kern, R., Picus, M., Hoyer, S., van Kerkwijk, M. H., Brett, M., Haldane, A., del Río, J. F., Wiebe, M., Peterson, P., Gérard-Marchant, P., Sheppard, K., Reddy, T., Weckesser, W., Abbasi, H., Gohlke, C., and Oliphant, T. E. Array programming with NumPy. Nature, 585(7825):357–362, September 2020.
- Hartvigsen et al. (2018) Hartvigsen, T., Sen, C., Brownell, S., Teeple, E., Kong, X., and Rundensteiner, E. A. Early Prediction of MRSA Infections using Electronic Health Records. In HEALTHINF, pp. 156–167, 2018.
- Henderson et al. (2023) Henderson, P., Chugg, B., Anderson, B., Altenburger, K., Turk, A., Guyton, J., Goldin, J., and Ho, D. E. Integrating reward maximization and population estimation: Sequential decision-making for internal revenue service audit selection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 5087–5095, 2023.
- Henry et al. (2015) Henry, K. E., Hager, D. N., Pronovost, P. J., and Saria, S. A targeted real-time early warning score (trewscore) for septic shock. Science Translational Medicine, 7(299):299ra122–299ra122, 2015.
- Hu et al. (2022) Hu, X., Niu, Y., Miao, C., Hua, X.-S., and Zhang, H. On non-random missing labels in semi-supervised learning. In 10th International Conference on Learning Representations, 2022.
- Imbens & Rubin (2015) Imbens, G. W. and Rubin, D. B. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015.
- Janzing (2019) Janzing, D. Causal regularization. Advances in Neural Information Processing Systems, 32, 2019.
- Jehi et al. (2020) Jehi, L., Ji, X., Milinovich, A., Erzurum, S., Rubin, B. P., Gordon, S., Young, J. B., and Kattan, M. W. Individualizing risk prediction for positive coronavirus disease 2019 testing: results from 11,672 patients. Chest, 158(4):1364–1375, 2020.
- Johnson et al. (2016) Johnson, A. E., Pollard, T. J., Shen, L., Lehman, L.-w. H., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Anthony Celi, L., and Mark, R. G. MIMIC-III, a freely accessible critical care database. Scientific Data, 3(1):1–9, 2016.
- Johnson et al. (2018) Johnson, A. E., Aboab, J., Raffa, J. D., Pollard, T. J., Deliberato, R. O., Celi, L. A., and Stone, D. J. A comparative analysis of sepsis identification methods in an electronic database. Critical Care Medicine, 46(4):494–499, 2018.
- Kamran et al. (2022) Kamran, F., Tang, S., Ötleş, E., McEvoy, D. S., Saleh, S. N., Gong, J., Li, B. Y., Dutta, S., Liu, X., Medford, R. J., Valley, T. S., West, L. R., Singh, K., Blumberg, S., Donnelly, J. P., Shenoy, E. S., Ayanian, J. Z., Nallamothu, B. K., Sjoding, M. W., and Wiens, J. Early identification of patients admitted to hospital for covid-19 at risk of clinical deterioration: model development and multisite external validation study. The BMJ, 376, 2022.
- Kato et al. (2023) Kato, M., Wu, S., Kureishi, K., and Yasui, S. Automatic debiased learning from positive, unlabeled, and exposure data. arXiv preprint arXiv:2303.04797, 2023.
- Kennedy (2023) Kennedy, E. H. Towards optimal doubly robust estimation of heterogeneous causal effects. Electronic Journal of Statistics, 17(2):3008–3049, 2023.
- Kiani et al. (2023) Kiani, S., Barton, J., Sushinsky, J., Heimbach, L., and Luo, B. Counterfactual prediction under selective confounding. arXiv preprint arXiv:2310.14064, 2023.
- Kingma & Ba (2015) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, 2015.
- Kleinberg et al. (2018) Kleinberg, J., Lakkaraju, H., Leskovec, J., Ludwig, J., and Mullainathan, S. Human decisions and machine predictions. The Quarterly Journal of Economics, 133(1):237–293, 2018.
- Kontokosta & Hong (2021) Kontokosta, C. E. and Hong, B. Bias in smart city governance: How socio-spatial disparities in 311 complaint behavior impact the fairness of data-driven decisions. Sustainable Cities and Society, 64:102503, 2021.
- Laine et al. (2020) Laine, R., Hyttinen, A., and Mathioudakis, M. Evaluating decision makers over selectively labelled data: A causal modelling approach. In Discovery Science: 23rd International Conference, DS 2020, Thessaloniki, Greece, October 19–21, 2020, Proceedings 23, pp. 3–18, 2020.
- Lakkaraju et al. (2017) Lakkaraju, H., Kleinberg, J., Leskovec, J., Ludwig, J., and Mullainathan, S. The selective labels problem: Evaluating algorithmic predictions in the presence of unobservables. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 275–284, 2017.
- Laufer et al. (2022) Laufer, B., Pierson, E., and Garg, N. End-to-end auditing of decision pipelines. In ICML Workshop on Responsible Decision-Making in Dynamic Environments., pp. 1–7, 2022.
- Lee (2013) Lee, D.-H. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, volume 3(2), pp. 896, 2013.
- Li et al. (2020) Li, J., Socher, R., and Hoi, S. C. H. Dividemix: Learning with noisy labels as semi-supervised learning. In 8th International Conference on Learning Representations, ICLR, 2020.
- Li et al. (2021) Li, X., Liu, T., Han, B., Niu, G., and Sugiyama, M. Provably end-to-end label-noise learning without anchor points. In Proceedings of the 38th International Conference on Machine Learning, pp. 6403–6413, 2021.
- Liu & Tao (2015) Liu, T. and Tao, D. Classification with noisy labels by importance reweighting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(3):447–461, 2015.
- Liu & Guo (2020) Liu, Y. and Guo, H. Peer loss functions: Learning from noisy labels without knowing noise rates. In Proceedings of the 37th International Conference on Machine Learning, 2020.
- Liu & Garg (2022) Liu, Z. and Garg, N. Equity in resident crowdsourcing: Measuring under-reporting without ground truth data. In Proceedings of the 23rd ACM Conference on Economics and Computation, pp. 1016–1017, 2022.
- Madras et al. (2019) Madras, D., Creager, E., Pitassi, T., and Zemel, R. Fairness through causal awareness. Proceedings of the Conference on Fairness, Accountability, and Transparency, 2019.
- McDonald et al. (2021) McDonald, S. A., Medford, R. J., Basit, M. A., Diercks, D. B., and Courtney, D. M. Derivation with internal validation of a multivariable predictive model to predict covid-19 test results in emergency department patients. Academic Emergency Medicine, 28(2):206–214, 2021.
- Menon et al. (2018) Menon, A. K., Van Rooyen, B., and Natarajan, N. Learning from binary labels with instance-dependent noise. Machine Learning, 107(8):1561–1595, 2018.
- Mulherin & Miller (2002) Mulherin, S. A. and Miller, W. C. Spectrum bias or spectrum effect? subgroup variation in diagnostic test evaluation. Annals of Internal Medicine, 137(7):598–602, 2002.
- Mullainathan & Obermeyer (2022) Mullainathan, S. and Obermeyer, Z. Diagnosing physician error: A machine learning approach to low-value health care. The Quarterly Journal of Economics, 137(2):679–727, 2022.
- Nguyen et al. (2020) Nguyen, D. T., Mummadi, C. K., Ngo, T., Nguyen, T. H. P., Beggel, L., and Brox, T. SELF: Learning to Filter Noisy Labels with Self-Ensembling. In 8th International Conference on Learning Representations, 2020.
- Paszke et al. (2019) Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Köpf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. Pytorch: An imperative style, high-performance deep learning library, 2019.
- Patrini et al. (2017) Patrini, G., Rozza, A., Krishna Menon, A., Nock, R., and Qu, L. Making deep neural networks robust to label noise: A loss correction approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1944–1952, 2017.
- Pearl (2009) Pearl, J. Causality. Cambridge university press, 2009.
- Pedregosa et al. (2011) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
- Peng et al. (2019) Peng, A., Nushi, B., Kıcıman, E., Inkpen, K., Suri, S., and Kamar, E. What you see is what you get? the impact of representation criteria on human bias in hiring. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, volume 7, pp. 125–134, 2019.
- Pierson et al. (2018) Pierson, E., Corbett-Davies, S., and Goel, S. Fast threshold tests for detecting discrimination. In International Conference on Artificial Intelligence and Statistics, pp. 96–105, 2018.
- Pierson et al. (2020) Pierson, E., Simoiu, C., Overgoor, J., Corbett-Davies, S., Jenson, D., Shoemaker, A., Ramachandran, V., Barghouty, P., Phillips, C., Shroff, R., et al. A large-scale analysis of racial disparities in police stops across the united states. Nature Human Behaviour, 4(7):736–745, 2020.
- Rambachan & Roth (2020) Rambachan, A. and Roth, J. Bias in, bias out? Evaluating the folk wisdom. In 1st Symposium on Foundations of Responsible Computing, 2020.
- Rhee & Klompas (2020) Rhee, C. and Klompas, M. Sepsis trends: increasing incidence and decreasing mortality, or changing denominator? Journal of Thoracic Disease, 12(Suppl 1):S89, 2020.
- Rizve et al. (2021) Rizve, M. N., Duarte, K., Rawat, Y. S., and Shah, M. In defense of pseudo-labeling: An uncertainty-aware pseudo-label selection framework for semi-supervised learning. In 9th International Conference on Learning Representations, 2021.
- Robins et al. (1994) Robins, J. M., Rotnitzky, A., and Zhao, L. P. Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association, pp. 846–866, 1994.
- Rockafellar (1970) Rockafellar, R. T. Convex analysis. Princeton University Press, Princeton, N. J., 1970.
- Rosenbaum & Rubin (1983) Rosenbaum, P. R. and Rubin, D. B. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55, 1983.
- Saxena et al. (2020) Saxena, D., Badillo-Urquiola, K., Wisniewski, P. J., and Guha, S. A human-centered review of algorithms used within the u.s. child welfare system. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, pp. 1–15, 2020.
- Schulman et al. (1999) Schulman, K. A., Berlin, J. A., Harless, W., Kerner, J. F., Sistrunk, S., Gersh, B. J., Dube, R., Taleghani, C. K., Burke, J. E., Williams, S., et al. The effect of race and sex on physicians’ recommendations for cardiac catheterization. New England Journal of Medicine, 340(8):618–626, 1999.
- Seymour et al. (2016) Seymour, C. W., Liu, V. X., Iwashyna, T. J., Brunkhorst, F. M., Rea, T. D., Scherag, A., Rubenfeld, G., Kahn, J. M., Shankar-Hari, M., Singer, M., et al. Assessment of clinical criteria for sepsis: for the third international consensus definitions for sepsis and septic shock (sepsis-3). JAMA, 315(8):762–774, 2016.
- Shalev-Shwartz & Ben-David (2014) Shalev-Shwartz, S. and Ben-David, S. Understanding machine learning: From theory to algorithms. Cambridge University Press, 2014.
- Shanmugam et al. (2024) Shanmugam, D., Hou, K., and Pierson, E. Quantifying disparities in intimate partner violence: a machine learning method to correct for underreporting. npj Women’s Health, 2(1), 2024.
- Shi et al. (2019) Shi, C., Blei, D., and Veitch, V. Adapting neural networks for the estimation of treatment effects. Advances in Neural Information Processing Systems, 32, 2019.
- Singer et al. (2016) Singer, M., Deutschman, C. S., Seymour, C. W., Shankar-Hari, M., Annane, D., Bauer, M., Bellomo, R., Bernard, G. R., Chiche, J.-D., Coopersmith, C. M., et al. The third international consensus definitions for sepsis and septic shock (Sepsis-3). Jama, 315(8):801–810, 2016.
- Sportisse et al. (2023) Sportisse, A., Schmutz, H., Humbert, O., Bouveyron, C., and Mattei, P.-A. Are labels informative in semi-supervised learning? estimating and leveraging the missing-data mechanism. In Proceedings of the 40th International Conference on Machine Learning, 2023.
- Sühr et al. (2021) Sühr, T., Hilgard, S., and Lakkaraju, H. Does fair ranking improve minority outcomes? Understanding the interplay of human and algorithmic biases in online hiring. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pp. 989–999, 2021.
- Teeple et al. (2020) Teeple, E., Hartvigsen, T., Sen, C., Claypool, K. T., and Rundensteiner, E. A. Clinical performance evaluation of a machine learning system for predicting hospital-acquired clostridium difficile infection. In HEALTHINF, pp. 656–663, 2020.
- The pandas development team (2020) The pandas development team. pandas-dev/pandas: Pandas, 2020.
- Tjandra & Wiens (2023) Tjandra, D. and Wiens, J. Leveraging an alignment set in tackling instance-dependent label noise. In Proceedings of the Conference on Health, Inference, and Learning, 2023.
- Van Der Laan & Rubin (2006) Van Der Laan, M. J. and Rubin, D. Targeted maximum likelihood learning. The International Journal of Biostatistics, 2(1), 2006.
- Virtanen et al. (2020) Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., van der Walt, S. J., Brett, M., Wilson, J., Millman, K. J., Mayorov, N., Nelson, A. R. J., Jones, E., Kern, R., Larson, E., Carey, C. J., Polat, İ., Feng, Y., Moore, E. W., VanderPlas, J., Laxalde, D., Perktold, J., Cimrman, R., Henriksen, I., Quintero, E. A., Harris, C. R., Archibald, A. M., Ribeiro, A. H., Pedregosa, F., van Mulbregt, P., and SciPy 1.0 Contributors. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261–272, 2020.
- Vogel et al. (2021) Vogel, R., Bellet, A., and Clémençon, S. Learning fair scoring functions: Bipartite ranking under ROC-based fairness constraints. In International Conference on Artificial Intelligence and Statistics, pp. 784–792, 2021.
- Wager (2020) Wager, S. Stats 361: Causal inference. Technical report, Stanford University, 2020.
- Wald & Saria (2023) Wald, Y. and Saria, S. Birds of an odd feather: Guaranteed out-of-distribution (OOD) novel category detection. In Uncertainty in Artificial Intelligence, pp. 2179–2191, 2023.
- Wang et al. (2021) Wang, J., Liu, Y., and Levy, C. Fair classification with group-dependent label noise. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 526–536, 2021.
- Wang et al. (2024) Wang, X., Chen, H., Guo, T., and Wang, Y. Pue: Biased positive-unlabeled learning enhancement by causal inference. Advances in Neural Information Processing Systems, 36, 2024.
- Wu et al. (2022) Wu, S., Gong, M., Han, B., Liu, Y., and Liu, T. Fair classification with instance-dependent label noise. In Proceedings of the First Conference on Causal Learning and Reasoning, volume 177 of Proceedings of Machine Learning Research, pp. 927–943, Apr 2022.
- Yao et al. (2021) Yao, Y., Liu, T., Gong, M., Han, B., Niu, G., and Zhang, K. Instance-dependent label-noise learning under a structural causal model. In Advances in Neural Information Processing Systems, volume 34, pp. 4409–4420, 2021.
- Zhang et al. (2021) Zhang, Y., Zheng, S., Wu, P., Goswami, M., and Chen, C. Learning with feature-dependent label noise: A progressive approach. In 9th International Conference on Learning Representations, 2021.
- Zhang & Sabuncu (2018) Zhang, Z. and Sabuncu, M. R. Generalized cross entropy loss for training deep neural networks with noisy labels. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 8792–8802, 2018.
- Zhao et al. (2022) Zhao, G., Li, G., Qin, Y., Liu, F., and Yu, Y. Centrality and consistency: Two-stage clean samples identification for learning with instance-dependent noisy labels. In Avidan, S., Brostow, G., Cissé, M., Farinella, G. M., and Hassner, T. (eds.), Computer Vision – ECCV 2022, pp. 21–37, 2022.
- Zhu & Ghahramani (2002) Zhu, X. and Ghahramani, Z. Learning from labeled and unlabeled data with label propagation. Technical report, Center for Automated Learning and Discovery, Carnegie Mellon University, 2002.
- Zhu et al. (2021) Zhu, Z., Song, Y., and Liu, Y. Clusterability as an alternative to anchor points when learning with noisy labels. In Proceedings of the 38th International Conference on Machine Learning, pp. 12912–12923, 2021.
Appendix A Selective labeling in the literature
We enumerate domains in which our literature review found instances of selective label problems in the ML methods and applications literature:
- •
-
•
Social & public policy: Child welfare assessment (Saxena et al., 2020; Kiani et al., 2023), urban planning/policy (Kontokosta & Hong, 2021; Laufer et al., 2022; Liu & Tao, 2015), hiring pipelines (Peng et al., 2019; Sühr et al., 2021), student placement (Bergman et al., 2021), and bias in policing (Rambachan & Roth, 2020; Pierson et al., 2020)
- •
- •
We note that this is not an exhaustive list of all papers in the selective labeling literature or related problem settings. However, this list illustrates the broad applicability and relevance of our problem setting.
Appendix B Omitted Proofs
For convenience, we restate all theorems and propositions here.
B.1 Theorem 3.1
Theorem (E-step derivation).
The posterior conditional mean of given the observed data, , is equal to
(14) |
Proof.
We drop superscripts in the proof for clarity. Denote as posterior distribution of given the observed data, (i.e., the E-step estimate). First, we can write
(15) |
for simplicity, where we use the fact that and the fact that is binary. Proceeding, we can write:
(16) | |||
(17) | |||
(18) | |||
(19) | |||
(20) |
The second equality follows since . The third equality holds since by construction, since under disparate censorship. The final step follows from three facts: (1) for all , (2) for all , and (3) . This is more succinctly rewritten as E-step is:
(21) |
which is what we wanted to show. ∎
Remark B.1.
Since is binary by assumption, this result fully determines the posterior distribution since .
B.2 Theorem 3.2
Theorem (M-step derivation).
Let be a model for the joint data distribution parameterized by some arbitrary in some parameter space , which factorizes according to the disparate censorship DAG =(Fig. 1). Let be the posterior expectation that given the observed data. Then (replacing random variables with their realized counterparts), we have
(22) |
where .
Proof.
We first construct the evidence-based lower bound (ELBO) of the LHS in the theorem statement. First, for a single example indexed by , we can write:
(23) | ||||
(24) |
via Jensen’s inequality. Then, we note that
(25) |
dropping , which is constant with respect to , after expanding the term. We can then use the DAG to factorize the joint distribution of all variables (including latent variable ), which is given by
(26) |
Note that we need only model the first two terms for estimation of . The first two terms do not involve , are not parameterized, and can be dropped from the maximization problem. Hence, we proceed to write
(27) |
where . This can be rewritten as
(28) |
at which point we note that it is sufficient to show that
(29) |
is constant in . We can rewrite the above as
(30) |
First, note that the event occurs with probability zero by definition (recall ). Thus, cannot change with respect to ; we drop it from the maximization problem. Similarly, by definition, so which is constant as needed, from which the theorem follows. ∎
Remark B.2.
In the theorem statement, replacing with and with , assuming and are binary, and writing the explicit form of negative binary cross-entropy (e.g., ) recovers the form of the M-step objective seen in Eq. 7. Note that the optimization problem flips from a maximization to a minimization due to the relationship between maximizing log-likelihood of binary variable(s) and minimizing cross-entropy loss.
B.3 Theorem 3.4
Theorem (Strength of the causal regularizer in ).
For an example indexed by , , and defined as in Eq. 88, is monotonically increasing in on .
Proof.
As a proof outline, we first show the closed-form of by solving the first-order optimality condition of Eq. 88. Then, we show that is decreasing in , and attains a maximum of as . We conclude by showing that the preceding implies that is monotonically increasing in on , as desired.
The first-order optimality condition of Eq. 88 is
(31) |
By assumption (convexity of ), the minimizer is unique. Some algebra yields
(32) | |||||
(33) |
from which we can apply the quadratic formula. Define and .555We use letter because it corresponds to coefficient in the conventional quadratic formula: for a quadratic polynomial . We choose letter since it corresponds to the discriminant. The quadratic formula yields solutions
(34) |
We use the fact that must be in [0, 1] and the constraints that and to determine which branch of Eq. 34 yields real solutions in . By Lemma 1, , so the solutions are real. Then, by Lemma 58,
(35) |
eliminating that branch. By elimination,
(36) |
Lemma 66 verifies that the resulting are in [0, 1], as needed. To proceed, it suffices to show that is decreasing in and attains a maximum of as .
Applying Lemma 4, to prove that decreases in , it is sufficient to show
(37) |
because has the same sign as the derivative of for the values of of interest.
For values of and such that , Lemma 1 yields the desired result. For the remaining values of , we can write
(38) | ||||
(39) | ||||
(40) |
which holds for all feasible values of . Lastly, due to the monotonicity of in , the following one-sided limit is the maximum:
(41) |
We want to show that the limit is . Note that, since is finite and convex, it is continuous (Corollary 10.1.1, (Rockafellar, 1970)); hence, this limit exists. Since substituting yields the indeterminate form , we appeal to L’Hôpital’s rule:
(42) |
Note that . Since ,
(43) |
Furthermore, since is decreasing in , must increase in , from which the theorem follows. ∎
Remark B.3.
We comment on the potential for DCEM to improve robustness to low overlap. To do so, we analyze the sensitivity of the M-step optimum to extreme . While analyzing the asymptotic variance of consistent estimators is a common approach, asymptotic guarantees for DCEM are unclear due to the inherently non-convex (with respect to the parameters) objective function. Thus, we analyze the Lipschitzness of the M-step optimum versus other causal effect estimators. First, note that
(44) |
where . For all and all , this derivative is bounded (e.g., see Figure 5), and is Lipschitz in . However, consider the expression for an inverse-propensity-weighted estimator, which sums terms of the form
(45) |
to obtain a final estimate. Eq. 45 has -Lipschitz terms with respect to . Thus, in a Lipschitz sense, DCEM may be less sensitive to extreme propensity scores than causal effect estimators such as IPW.
Corollary B.4.
For an example indexed by , , and defined as in Eq. 88, is monotonically non-decreasing in on .
Proof.
The proof is identical to that of Theorem 3.4, except we find that
(46) |
instead of being strictly less than zero, from which the corollary follows. ∎
Remark B.5.
For intuition, we show a contour plot of in Fig. 5. We verify the result in CVXPY.

B.4 Proposition 3.5
Proposition (Minimizer of M-step when ).
Suppose that , let , and let be some estimate of . Use be shorthand for binary cross-entropy loss. Then, the minimization problem
(47) |
admits the solution for all .
Proof.
We briefly verify the convexity of the objective, which follows from the convexity of binary cross-entropy loss and the closure of convexity under addition and positive scalar multiplication (). Thus, any minimizer of the objective is unique.
We proceed by cases. First, suppose that . Substituting the definition of , and using the fact that , the objective function for a single example becomes
(48) |
which, by inspection, is maximized for . Similarly, for , the objective function for a single example is
(49) |
which reduces to binary cross-entropy loss, and minimizes the objective. Combining the two cases, the minimizer of the M-step objective when is as desired. ∎
B.5 Causal identifiability
For completeness, we provide the derivation of the causal identifiability results, though it follows directly from existing results (Imbens & Rubin, 2015).
Proposition.
Suppose that conditional exchangeability, or , holds. Then , which is identifiable as .
Proof.
We can write
(50) |
where the first equality is due to , the second equality results from , and the final equality applies conditional exchangeability. Since , the theorem follows. ∎
B.6 Lemmas used
Below are the lemmas and proofs referenced in the preceding theorem and proposition proofs.
Lemma 1.
Define and on and . Then, .
Proof.
Choose any and . We can write:
(51) | ||||
(52) | ||||
(53) | ||||
(54) |
The final LHS is convex (by inspection) in , such that
(55) |
where the minimum is attained at , and concave (by inspection) in , such that it suffices to evaluate the final LHS at :666Recall that, for a concave function , for with equality for or 1. Thus, via the extreme value theorem, the minimum of on is achieved at or .
(56) | ||||
(57) |
such that for all other , as needed.
∎
Lemma 2.
Define and as in Lemma 1. Then, for and ,
(58) |
Proof.
Choose any and . First, we rewrite
(59) | ||||
(60) | ||||
(61) |
For , Lemma 1 yields the desired conclusion. For , we can write
(62) | ||||
(63) | ||||
(64) | ||||
(65) |
which all satisfy. This completes the proof. ∎
Lemma 3.
Define and as in Lemma 1. Then, for and ,
(66) |
Proof.
Choose any and . Equivalently, we can show
(67) |
For the first inequality, note that
(68) | ||||
(69) |
which rearranges to as desired. For the second inequality, note that
(70) | ||||
(71) | ||||
(72) |
For , Lemma 1 yields the desired conclusion. For , the proof proceeds similarly to Lemma 58:
(73) | ||||
(74) | ||||
(75) | ||||
(76) |
which all satisfy. This completes the proof. ∎
Lemma 4.
Proof.
The proof is largely algebraic simplification based on sign-preserving operations. Taking derivatives:
(79) |
via the quotient rule of derivatives and cancelling terms. We can apply sign-preserving operations, namely, positive scalar multiplication, canceling additive zeroes, and commuting additive terms, as follows:
(80) | ||||
(81) | ||||
(82) | ||||
(83) | ||||
(84) |
which completes the proof. ∎
B.7 Definition 3.3: causal regularization strength
We expand on our definition of causal regularization strength here. Conventionally, regularization strength is operationalized in terms of a regularization parameter , given a loss and a regularizer (e.g., for some objective of the form
(85) |
Eq. 85 is an instance of regularized risk minimization (Shalev-Shwartz & Ben-David, 2014). It is also identical to linear scalarization, a technique for characterizing tradeoffs in multi-objective optimization. The equivalence between regularized risk minimization and linear scalarization simply reflects that regularization can impose tradeoffs in optimizing between minimizing versus . Regularized risk minimization treats as a “penalty” term, while linear scalarization treats as merely another objective. As increases, the tradeoff increasingly favors , and vice versa.
Now, consider our M-step objective example-wise:
(86) |
The M-step objective can similarly be interpreted as variation of a regularized risk minimziation problem, where , and . However, is a constant that can affect regularization strength, but is not a multiplier like . The purpose of our result is to characterize the impact of on the tradeoff between the two terms of the M-step objective.
Thus, motivated by the tradeoff/multi-objective perspective of regularization, we define regularization strength in terms of a tradeoff between optimizing and . We observe that
(87) |
and define causal regularization strength as the absolute distance between , the minimizer of , and the optimum of the example-wise M-step objective.
Definition B.6 (Causal regularization strength).
Given an example indexed by , and a finite loss function convex in on [0, 1] for all , define
(88) |
The causal regularization strength of objective is defined as .
Intuitively, we define causal regularization strength in terms of the absolute distance between the optimum of each term of the M-step objective, which captures some notion of a tradeoff between the two terms. Note that this definition does not relate to convergence to ; we are largely interested in how much the solution to shifts after adding the causal regularization term.
Appendix C Additional experimental setup
For both settings, we set random seeds to 42 to facilitate reproducibility.
C.1 Additional details for fully synthetic dataset
We choose as follows:
where is a simulation parameter, and is a 2D rotation matrix. Intuitively, rotates and translates , then applies a sinewave-based function that yields a similarly rotated, sinewave-shaped decision boundary.
We choose as follows:
where is a simulation parameter. For demonstration, we set such that to allow for sufficiently-sized performance gaps across groups to emerge.777Empirically, at extreme values of , we found artificially small performance gaps. This is because model errors tend to concentrate near the true decision boundary, which lies in the tails of the covariate distributions defining . In those tail regions, the difference between the densities across values of is smaller in our two-Gaussian simulation design. As a sensitivity analysis, we also replicate all experiments on fully synthetic data across , representing the “phase” of the decision boundary.
Computing simulation parameters.
We discuss how we find simulation parameters for each value of , and . Given and , we have:
which yields, by substitution,
from which we use a binary search algorithm (bisection) evaluated using simulated versions of with the current estimate of the mean to solve for the requisite values of . Given values of , we can then solve for using and identically:
which yields, again by substitution,
and we can again use binary search to solve for .
C.2 Additional details for pseudo-synthetic sepsis risk-stratification task

Cohort description.
Our cohort follows from the MIMIC-III Sepsis-3 cohort (Johnson et al., 2018). Their cohort exclusion criteria are publicly available.888https://github.com/alistairewj/sepsis3-mimic We corrected an apparent Sepsis-3 definition bug that erroneously labeled individuals with suspicion of infection if they received a blood culture at any time after an antibiotic before re-running their pipeline. In contrast, the Sepsis-3 (Singer et al., 2016) definition requires the blood culture to occur within 24 hours of the antibiotic time for suspicion of infection.999There are multiple “paths” for meeting the criteria for suspicion of infection; for a full enumeration, see Table 2 of (Singer et al., 2016). In practice, this stricter condition affects of rows in their original cohort: their cohort size is , while ours is .
Feature extraction.
Following the Risk of Sepsis model (Delahanty et al., 2019), we extract the following 13 summary statistics over the initial 3-hour observation period:
-
1.
Maximum lactic acid measurement,
-
2.
first shock index times age (years),
-
3.
last shock index times age (years),
-
4.
maximum white blood cell count,
-
5.
change in lactic acid (last - first),
-
6.
maximum neutrophil count,
-
7.
maximum blood glucose,
-
8.
maximum blood urea nitrogen,
-
9.
maximum respiratory rate,
-
10.
last albumin measurement,
-
11.
minimum systolic blood pressure,
-
12.
maximum creatinine, and
-
13.
maximum body temperature (Fahrenheit).
The shock index is defined as the ratio of heart rate (beats per minute) and systolic blood pressure. Missing features are replaced with -9999 following the original manuscript.
Testing decision boundary.
We define as
(89) |
where and are maximum respiratory rate and minimum systolic blood pressure, respectively, and are their corresponding standard deviations on the training split (). The parameter allows us to examine different testing decisions. Thus, we replicate all experiments over .
Appendix D Hyperparameters & additional model details
All hyperparameters were selected using a validation set of examples. Hyperparameters for the sepsis simulation task were chosen such that all approaches attained similar performance when using . We reimplement all existing methods following the original papers, using the code repository as a reference if applicable. We set random seeds to 42 for all models (used for initialization), unless otherwise noted.
D.1 Default hyperparameters
Fully synthetic.
All models use a two-layer neural network with layer sizes , trained for 1000 epochs via Adam (Kingma & Ba, 2015) with learning rate and no weight decay unless specified. EM approaches are trained up to 50 iterations with early stopping on validation loss (patience 3) and warm starts (initialized with solution from the previous iteration).
Sepsis classification.
All predictors are three-layer neural networks with sizes trained for 10000 epochs using Adam with learning rate and weight decay . The DCEM propensity model () is trained for 20000 epochs with learning rate and early stopping with patience 1000, and the DCEM model () uses learning rate and weight decay and , respectively.
D.2 Simulation study
Peer loss & group peer loss:
Both peer loss methods depend on a hyperparameter , for which the optimal value depends on . To show the peer loss methods in the best light, we manually calculate the optimal value for usage in training.
ITE corrected model (DragonNet):
Our estimand of interest is the conditional average treatment effect (CATE) of the sensitive attribute on testing , which is identifiable via
(90) |
under assumptions of consistency () and conditional exchangeability (). We then apply the CATE as a correction factor to the default model:
(91) |
i.e., counterbalancing disparate censorship by “subracting out” the labeling bias. Note that this is an alternative to the counterbalancing approach of DCEM. We train and conduct inference with targeted regularization.
Truncated LQ:
We searched across and (using the notation of the original paper), using for the final results.
SELF:
We were unable to obtain convergence with Adam, so we used SGD with learning rate 0.01, momentum 0.9, noise parameter 0.05 (for input augmentation), consistency regularization parameter 1, and weight decay as used for one of the experiments in the original paper. Weight decay was selected from . The ensembling/mean teacher parameters were chosen from . The noise was chosen from . The regularization parameter was chosen from . SELF proceeds for a maximum of 50 iterations with patience 1 with respect to validation AUC. We set ensembling momentum to 0.9 and the mean teacher moving average parameter to 0.9. To show SELF in the best light, we prevented SELF from filtering tested positive individuals.
DivideMix:
We use 20 warmup epochs, with as the Beta parameter for the MixMatch step, , , , and , and weight decay . We also experimented with preventing DivideMix from filtering tested positive individuals, but DivideMix was unstable in both settings. Ultimately, we did not prevent DivideMix from filtering tested positive individuals.
EM-based methods (SAREM, DCEM):
We tested SAREM and DCEM with and without the usage of warm starts in the M-step.
D.3 Sepsis risk-stratification
For all baselines, the setup matches the fully synthetic setting except as specified below.
DCEM:
The learning rates under consideration were . The weight decay was selected from .
SELF:
For the sepsis classification experiments, weused SGD and set the learning rate to , the highest learning rate tested that did not result in NaN loss. We tested learning rates of the form for .
Appendix E Additional empirical results and discussion
E.1 Full results
Here, we report empirical results for all baselines and settings. Due to the large number of empirical settings tested (simulation: 224, sepsis classification: 45), we include a representative subset of the figures, and report the raw numbers used for these results and results not shown in the Appendix via CSV files in the code appendix.
For the simulated task, we show results for , , and . Empirically, changing did not affect the general trends, but amplified/dampened the scale. Increasing beyond the selected range has similar impacts. For lower values of , all methods perform poorly. For the sepsis classification task, we show results for and .
Summary of results.
We summarize when our method (DCEM) empirically performed the best, when it performed similarly to baselines, and when it underperformed baselines.
DCEM is best where…
-
•
(Both metrics) The higher-prevalance group is undertested ( but ) and
-
•
(Both metrics) testing rates are sufficiently high ().
DCEM is similar to baselines when…
-
•
(Both metrics) Testing rates are moderately low (), or sufficiently high that it is easier to extrapolate from labeled data ().
DCEM underperforms baselines when…
-
•
(ROC gap only) when testing rates are low () and
-
•
(ROC gap only) the testing disparity aligns with the prevalence disparity (e.g., and such that learning to predict preserves ranking in ), or
-
•
(both metrics) testing rates are extremely low ().
The strongest alternatives to DCEM in our experiments were SELF (both datasets, bias mitigation), DragonNet (sepsis only, both metrics), and the tested-only model (simulation only, discriminative performance).
Index of figures.
We provide here a list of all result figures in the Appendix, indexed by problem parameters (overall testing rate multiplier), (testing disparity), and (prevalence disparity; simulation only).
Fully-synthetic data
Sepsis classification
E.2 DCEM ablation study
To understand how DCEM design choices impact performance, we conduct an ablation study of repeated iterations and causal regularization:
-
•
Imputation-only: This approach trains a model on the tested-only (labeled) examples, imputes pseudo-labels for the remaining, then trains a model on both the pseudo-labeled and labeled data. This is equivalent to a single EM iteration without causal regularization.
-
•
No causal regularization: This approach runs multiple EM iterations, but without causal regularization.
The results (Table 1) suggest that both repeated iterations and causal regularization are essential to the bias mitigation and discriminative capabilities of DCEM. The imputation-only approach fails due to low overlap between the tested vs. untested regions. Consequently, the imputed outcomes could be arbitrarily inaccurate. If we keep imputing and retraining (without causal regularization), we recover a form of pseudo-labeling (Lee, 2013). The empirical improvement in performance suggests that repeated supervision from reliably labeled examples helps improve discriminative performance. However, this approach does not adjust for labeling bias (e.g., by using ), and indeed the ROC gap does not improve. Incorporating causal regularization recovers the DCEM M-step. Adding causal regularization guarantees that DCEM locally maximizes log-likelihood, and allows it to mitigate labeling bias by incorporating into a propensity score-like term (causal regularization; see Theorem Theorem).
Method | AUC | ROC gap |
---|---|---|
Imputation-only | .676 [.644, .715] | .063 [.036, .086] |
No causal regularization | .767 [.733, .813] | .056 [.016, .086] |
DCEM (ours) | .791 [.763, .820] | .031 [.019, .072] |
Method | AUC | ROCGap |
---|---|---|
Tested-only | .808 [.623, .876] | .052 [.020, .093] |
Tested-only + group | .764 [.675, .863] | .078 [.025, .278] |
IPW | .829 [.598, .874] | .048 [.020, .104] |
DR-Learner | .643 [.558, .769] | .117 [.080, .216] |
DCEM (ours) | .791 [.763, .820] | .031 [.019, .072] |

E.3 Sensitivity analysis of causally-motivated approaches
Here, we conduct a sensitivity analysis of causally-motivated approaches under disparate censorship. The causally-motivated approaches are theoretically consistent estimators of , which we can interpret at the conditional average treatment effect of testing () on the observed outcome (; see Appendix B.5). We examine the following causally-motivated estimators:
-
•
Tested-only: training models on tested individuals only, using as covariates,
-
•
Tested-only + group: training models on tested individuals only, using and as covariates,
-
•
Inverse propensity weighting (IPW): an IPW-based (Rosenbaum & Rubin, 1983) version of the tested-only approach, and
-
•
Doubly-robust estimator (DR-Learner): a doubly-robust estimator of (Kennedy, 2023).
Models are evaluated for (i.e., same setting as Fig. 2). Under disparate censorship, low overlap is common due to the “sharpness” of the testing boundary. To validate this hypothesis, we also evaluate causal effect estimators versus DCEM at varying levels of overlap (x, x, x, x, and x of the original setting). Overlap is controlled by the coefficient inside the sigmoid for generating (i.e., 30 in the original experiments).101010Recall that is generated as a Bernoulli random variable with parameters of the form . For the DR-learner, we trimmed propensity scores (threshold: 0.05) to obtain estimates that were in [0, 1] (the possible values of ).
DCEM has better bias mitigation capabilities than causal approaches, and a tighter range of discriminative performance.
Table 2 shows that, empirically, DCEM exhibits lower variance under overlap violations than causally-motivated approaches. Notably, DCEM achieves the lowest median ROC gap, and maintains competitive (but not necessarily best) median AUC. Causally-motivated methods generally have good median discriminative performance, but poor bias mitigation properties. Furthermore, the wide performance ranges of causally-motivated approaches may be unacceptable for safety-critical/high-stakes domains. We note that the DR-learner may underperform in this setting due if the propensity score trimming introduces sufficient bias: recall that, although double-robustness only requires one correctly-specified model, the asymptotic properties may still depend on the asymptotics of each model (e.g., as shown in (Wager, 2020)).
DCEM is empirically more robust to overlap violations.
Figure 7 shows that, empirically, as overlap violations increase, DCEM degrades more slowly than causally-motivated approaches in terms of both bias mitigation and discriminative performance. Furthermore, DCEM maintains similarly tight performance ranges across levels of overlap, while the performance ranges of causal approaches widens as overlap violations increase. At low overlap, causally-motivated approaches have similarly tight performance ranges as DCEM.
E.4 Sensitivity analysis of softmax temperature scaling
We can further tune the smoothness of via the softmax temperature of the binary classifier for :
(92) |
where is the logit outputted by for each . Lower values of sharpen towards , while larger values smooth toward . Note that recovers the standard softmax function. Thus, adjusting allows us to control the smoothness of the constraint.
AUC | ROC gap | |
---|---|---|
0.01 | .778 [.737, .815] | .051 [.020, .104] |
0.1 | .791 [.762, .818] | .025 [.014, .057] |
1 (default) | .791 [.763, .820] | .031 [.019, .072] |
10 | .800 [.730, .858] | .051 [.021, .096] |
100 | .762 [.667, .835] | .071 [.032, .097] |

Table 3 shows full results (median AUC and ROC gap, plus minima and maxima across ) for DCEM across various values of temperature scaling parameter . Empirically, our results suggest that temperature scaling does not significantly change the AUC, and may trade off with bias mitigation since may no longer be calibrated. Furthermore, even though median AUC improves in one case (), the range of AUC is much larger (0.057 vs. 0.128), and still yields the maximum empirical worst-case AUC (0.763).
Values of away from 1 tend to yield larger ROC gaps. We find that is well-calibrated for , but not so for values of (Figure 8). Since is critical to counterbalancing disparate censorship, miscalibration error in could result in larger ROC gaps by reducing the effectiveness/correctness of the causal regularization term. Thus, we opt to maintain .
E.5 Sensitivity analysis of E-step initialization
We compare random initialization to using a tested-only model as initialization (the final approach). Empirically, Table 4 shows trivial changes to performance when using a model trained on labeled data to initialize the E-step. This suggests that DCEM is able to overcome poor initialization in the settings studied; i.e., the gains from tested-only initialization may be marginal, if nonzero.
Initialization scheme | AUC | ROC gap |
---|---|---|
random | .787 [.768, .822] | .031 [.011, .060] |
tested-only | .791 [.763, .820] | .031 [.019, .072] |
E.6 Tradeoffs between bias mitigation and discriminative performance: SELF

We compare instances of DCEM to SELF, controlling for ROC gap. We find that DCEM optimizes discriminative performance more effectively than SELF. Fig. 9 shows a histogram of AUC for SELF and DCEM models with similar ROC gaps across and , increasing in ROC gap to the right. For models with ROC gaps (Fig. 9, 1st and 2nd from left), DCEM improves AUC compared to instances of SELF with similar ROC gaps. At larger ROC gaps, DCEM and SELF obtain similar AUCs (Fig. 9, 1st and 2nd from right). Similarly to the comparison with tested-only models, the results suggest that DCEM is not simply trading improved bias mitigation for performance, but is also able to optimize discriminative performance. Since SELF is a filtering approach that does not account for the causal structure of disparate censorship, its estimates of label bias are likely skewed. In contrast, DCEM explicitly uses the causal structure of disparate censorship to counterbalance label bias.
E.7 Sepsis classification and robustness to shifts in labeling decisions

Fig. 10 shows the performance of DCEM vs. models with bimodial behavior across different , indexed by different feature weightings in . Our results suggest that the baselines require specific to perform above random. The baselines catastrophically underperform (AUC below 0.5) otherwise. Trends are analogous for the ROC gap.
Specifically, baseline performance improves when one feature is more heavily weighted than the other in the labeling decision (-axis near 0 or 1). However, when both features feature in labeling decisions (-axis near 0.5), the baselines catastrophically fail, while DCEM performance stays high. As seen in Fig. 4, DCEM AUC and ROC gap also exhibit less variation across the different .
Determining which is appropriate is a clinical problem that requires domain expertise, and we make no claims as to the clinical appropriateness of . Thus, ML practitioners should not assume that their data will be representative of any particular decision-making pattern. DCEM is an alternative approach that is more robust than baselines to shifts in , and thus warrants consideration when narrow assumptions about labeling biases are undesirable.
Appendix F Computing Infrastructure
Hardware.
We parallelize experiments across 4 A6000 GPUs and 256 AMD CPU cores (4x AMD EPYC 7763 64-Core processors), though the memory requirements of each model are under 2GB of VRAM.
Software.
All experiments are run on a distribution of Ubuntu 20.04.5 LTS (Focal Fossa) with Python 3.9.16 managed by conda 23.3.1. We use Pytorch 1.13.1 with CUDA 11.6 for all experiments (Paszke et al., 2019), with scikit-learn 1.2.2 (Pedregosa et al., 2011), scipy 1.10.1 (Virtanen et al., 2020), numpy 1.25.0 (Harris et al., 2020) and pandas 1.5.3 (The pandas development team, 2020) for data processing/analysis. Matplotlib 3.7.1 was used to generate figures. Additionally, torch_ema 0.3 was used in our implementation of SELF. For the simulation study, we use a modified version of the official disparate censorship repository at https://github.com/MLD3/disparate_censorship (Chang et al., 2022), which is included with our code repository.
Appendix G Code
Code will be released at the MLD3 Github repository at https://github.com/MLD3/DCEM. We redact the data-processing code for the sepsis task only where necessary to ensure compliance with the terms of use for MIMIC-III (Johnson et al., 2016).






















