Learning with Selectively Labeled Data from Multiple Decision-makers
Abstract
We study the problem of classification with selectively labeled data, whose distribution may differ from the full population due to historical decision-making. We exploit the fact that in many applications historical decisions were made by multiple decision-makers, each with different decision rules. We analyze this setup under a principled instrumental variable (IV) framework and rigorously study the identification of classification risk. We establish conditions for the exact identification of classification risk and derive tight partial identification bounds when exact identification fails. We further propose a unified cost-sensitive learning (UCL) approach to learn classifiers robust to selection bias in both identification settings. We further theoretically and numerically validate the efficacy of our proposed method.
1 Introduction
The problem of selective labels is common in many decision-making applications involving human subjects. In these applications, each individual receives a certain decision, which in turn determines whether the individual’s outcome label is observed. For example, in judicial bail, the label of interest is whether a defendant returns to court without committing another crime if released. However, this label cannot be observed if bail is denied. Similarly, in lending, the default status of a loan applicant cannot be observed if the loan is not approved. In hiring, a candidate’s job performance cannot be observed if the candidate is not hired.
The selective label problem presents significant challenges for developing effective machine learning (ML) algorithms to support decision-making [Lakkaraju et al., 2017, Kleinberg et al., 2018, De-Arteaga et al., 2021]. Specifically, labeled data may not be representative of the broader population to which decisions will be applied, creating a classic selection bias issue. As a result, models trained on selectively labeled data may perform well on the observed subjects but fail to generalize to the full population when deployed in real-world settings. This challenge becomes even more critical when historical decisions relied on unobservable factors not captured in the available data, as is common when human decision-makers are involved. In such cases, the labeled data can differ substantially from the broader population due to unknown factors, further complicating the development of reliable ML models.
In this paper, we address the selective label problem by leveraging the heterogeneity of decision-making processes in many real-world applications. Specifically, we consider settings where decisions are made by multiple decision-makers with distinct decision rules, potentially leading to different outcomes for the same subject. Furthermore, since decision-makers often evaluate similar pools of subjects, each subject can be viewed as being quasi-randomly assigned to a specific decision-maker. This structure offers a way to mitigate the selective label problem: a subject who remains unlabeled under one decision-maker might have been labeled if assigned to another. For instance, judicial decisions are typically made by different judges who vary in their leniency when deciding whether to release the same defendant. The heterogeneous decision-maker framework has also been explored in prior research to address selection bias [e.g., Lakkaraju et al., 2017, Kleinberg et al., 2018, Rambachan et al., 2021, Arnold et al., 2022].
1.1 Main Contributions
In this paper, we leverage the instrumental variable framework in causal inference to formalize the selective label problem with multiple heterogeneous decision-makers. This enables principled identification analyses to characterize the identifiability of classification risks from the observed data, revealing fundamental power and limits of the multiple decision-maker structure in tackling the selective label problem. This provides valuable insights beyond the heuristic approaches in the prior literature.
Specifically, we describe the problem setup and the IV framework in Section 3. Then in Section 4 we establish identification conditions for the classification risk from the observed data with multiple decision-makers and highlight the identification assumptions are overly strong. In Section 5, we further derive tight bounds on the plausible value of classification risk when the exact identification fails. In Section 6, we propose a Unified Cost-sensitive Learning (UCL) algorithm to learn classifiers robust to selection bias in both identification settings. We briefly demonstrate the superior performance of our proposed method through comprehensive numerical experiments on synthetic and semi-synthetic dataset in Section 7. We defer theoretical guarantees for our method to Appendix D.
Notations
Define , and for . We use the symbol as a shorthand for the set and to denote the absolute value of number or cardinality of set.
2 Related Work
This work is closely related to the growing literatures on the following topics: selective labels problem, identification analysis with IV and individual treatment rule learning.
Selective Labels Problem
Kleinberg et al. [2018], Lakkaraju et al. [2017] formalize the selective labels problem in predicting recidivism within the jail bail context. They train a prediction algorithm on selectively labeled datasets and evaluate whether the algorithm can outperform human decisions when used to assist decision-making. By exploiting the random assignment of judges to defendants, they group defendants based on judge assignment, ensuring comparable demographic features across groups. They also leverage variations in judges’ release rates, where defendants detained by stricter judges might have been released by more lenient ones. Using a heuristic “contraction” technique, they compare the performance of the algorithm-assisted decisions combined with the most lenient judge’s decisions against the stricter judge’s decisions. Fixing the release rate for both methods, they evaluate the recidivism rates and demonstrate that the prediction algorithm can outperform human decisions in some judge groups.
Bertsimas and Fazel-Zarandi [2022] explore the application of machine learning algorithms to assist public decision-making in immigration enforcement. They leverage the fact that most defendants are randomly assigned to federal and state prisons after conviction, enabling the grouping of defendants based on prison assignments. Additionally, they exploit variations in release rates among prisons due to differences in geographic proximity to ICE offices, allowing for the matching of counterfactual outcomes for defendants detained by stricter prisons. Unlike Kleinberg et al. [2018], Lakkaraju et al. [2017], who focus on one-sided evaluations—assessing algorithmic predictions on individuals not deported —- Bertsimas and Fazel-Zarandi [2022] conduct both one-sided and two-sided evaluations. They not only evaluate the predictions for defendants not deported but also establish partial bounds for the recidivism rates of individuals released by the algorithm who were actually deported by ICE officers (for whom labels are missing). Their findings provide evidence that machine learning algorithms can sometimes outperform human officers in decision-making.
The aforementioned works focus on evaluating the performance of machine learning algorithms on selectively labeled data. However, the core challenge lies in training prediction algorithms that are robust to the selection bias inherent in such datasets, and this is what we focus on in this paper. To address this, De-Arteaga et al. [2018, 2021] propose imputing missing labels as negative when multiple human decision-makers consistently reject a decision, leading to missing labels. While this approach offers a partial solution, it is inherently one-sided, as some missing labels may actually correspond to positive outcomes. Consequently, this method does not fully address the selection bias present in the observed data.
There is also a stream of literatures considering the online learning to correct the selection bias in the label. Kilbertus et al. [2020], Wei [2021], Yang et al. [2022] consider learn-to-decide approach to address the selective label problem by implementing a stochastic policy/online algorithm that balances exploration of the entire population with exploitation of the selectively labeled dataset. While effective, this approach is resource-intensive and may be impractical for high-cost decision-making scenarios, such as judicial bail determinations or school admissions decisions.
Finally, our work relates to the literature on counterfactual evaluation under unmeasured confounding [Coston et al., 2020a, b, Rambachan et al., 2021, 2022]. Specifically, Rambachan et al. [2022] address the problem of risk assessment under unmeasured confounding by proposing a mean outcome sensitivity model. This meta-framework translates bounds on the average treatment effect (ATE) into bounds on various risk metrics. They outline two approaches to specify the bounding function for ATE: the outcome regression bound, which involves regressing the outcome on the observed features and treatment, and the instrumental variable (IV) bound, which aligns with our concept of IV partial bounds. However, our IV partial bounds are slightly tighter than their proposed bounds, as we refine them by taking the maximum over all for the lower bound and the minimum over all for the upper bound . We further prove the tightness of these IV partial bounds. The key distinction between our work and that of Rambachan et al. [2022] lies in focus. While they emphasize estimating mean outcome sensitivity bounds using doubly robust methods to derive bounds for various risk metrics, our work centers on training robust prediction models based on the estimated IV partial bounds. This ensures that the resulting models are resilient to selection bias in the observed training data.
Identification Analysis with IV
This work builds on the IV literature in statistics and econometrics. IV is a common and powerful tool to handle unmeasured confounding in causal inference [Angrist and Pischke, 2009, Angrist and Imbens, 1995, Angrist et al., 1996]. Actually, some empirical studies have already considered heterogeneous decision-makers as an IV, such as the judge IV in Kling [2006], Mueller-Smith [2015]. However, these analyses are based on linear parametric models, while our identification analysis is free of any parametric restrictions. The point identification analysis builds on Wang and Tchetgen [2018], Cui and Tchetgen Tchetgen [2021] and the partial identification analysis builds on Manski and Pepper [1998], Swanson et al. [2018], Balke and Pearl [1994]. Our paper directly extends these results to the evaluation of prediction risks under multi-valued selective labels with multi-valued IV, and develop novel reformulations that enable efficient learning of prediction rules via a unified procedure.
Individual Treatment Rule (ITR) Learning
There is a growing body of literature on learning individualized treatment rules (ITR), which seeks to determine the optimal treatment for each individual based on their observed features. This field has been extensively studied in personalized medicine and policy-making [Qian and Murphy, 2011, Zhao et al., 2012, Athey and Wager, 2021]. Qian and Murphy [2011] introduce an inverse propensity weighting (IPW) formulation for ITR learning, parameterizing the value function with a linear model and obtaining a linear ITR through -penalized square loss minimization. Zhao et al. [2012] also adopt the IPW framework but treat ITR learning as a classification problem, proposing the hinge loss as a surrogate and solving for the optimal ITR using an SVM algorithm. Zhang et al. [2012] develop a doubly robust estimator for the value function and formalize ITR learning as a classification problem, using the square loss as a surrogate for the indicator function. Extending to multi-category problems, Zhang et al. [2020] propose the outcome-weighted learning (OWL) formulation and analyze the Fisher consistency of various surrogate loss functions. We demonstrate that learning under selective labels can similarly be formulated as a cost-sensitive or weighted classification problem when point identification is available. Furthermore, we propose several surrogate loss functions for learning the optimal classification rule and establish the finite-sample error bound for our unified learning algorithm.
Our work is also connected to the research on policy learning under partial identification. Ben-Michael et al. [2021] address robust policy learning under partial identification, focusing specifically on violations of the overlap assumption, where some instances have zero probability of receiving treatment. Kallus and Zhou [2021] investigate off-policy learning in the presence of unmeasured confounding, utilizing an IPW formulation for the value function. They address scenarios where the true propensity score is only partially bounded and propose a minimax optimization framework to learn robust policies. Their approach considers a differentiable, parameterized policy class and employs an iterative algorithm to solve the inner maximization via the dual problem, compute the gradient of the policy value, and update the policy parameters. In contrast, our work demonstrates that the inner maximization in our setting can be solved explicitly. This reformulation allows the minimax learning problem to be cast as a cost-sensitive classification task, enabling the use of state-of-the-art machine learning algorithms to efficiently solve the optimization problem.
Particularly relevant to our work are Cui and Tchetgen Tchetgen [2021] and Pu and Zhang [2021], who study ITR learning with instrumental variables under point identification and partial identification settings, respectively. Pu and Zhang [2021] consider a binary treatment setting and develop a minimax ITR framework using IV partial bounds of the expected potential outcomes. Similar to our approach, they explicitly solve the inner maximization and replace indicator functions with the hinge loss, solving the problem using an SVM algorithm. However, their work stops short of formalizing the minimax optimization as a general weighted classification problem, which could be implemented using any machine learning package. In contrast, we extend this framework to multi-valued treatments and demonstrate that minimax learning can be formulated as a cost-sensitive classification problem, generalizing the binary treatment setting in Pu and Zhang [2021] (see Section 5).
3 Problem Formulation
We address the selective label problem within a multiclass classification setting. Consider a dataset comprising units, indexed by . For each unit , let denote the true label of interest for classes. Additionally, let and represent observable and unobservable features, respectively, both of which may be correlated with . Here, “observable” features are those accessible to the ML practitioner, while “unobservable” features are unavailable in the dataset.
In the selective label problem, the true label is not always observed; its observability depends on a binary decision made by a decision-maker based on the features . Specifically, we consider different decision-makers and let denote the specific decision-maker assigned to unit . The observed label is defined as:
(3.1) |
When (e.g., bail, loan approval, hiring), the observed label is exactly the true label , and when , the observed label is recorded as (Not a Number), indicating that remains unobserved in this case. Thus, each unit’s label is selectively observed according to the decision made by decision-maker .
We assume that represents independent and identically distributed samples from a population denoted by . The causal structure of these variables is plotted in Figure 1. However, we cannot observe all variables but instead only observe the data .
3.1 Learning under Selective Labels
Our goal is to learn a classifier that can accurately predict the true label based on observed features . This classifier will aid in future decision-making processes through accurately predicting . The effectiveness of the classifier upon deployment across the full population is evaluated using the expected risk with the zero-one loss:
(3.2) |
Here is an indicator function, which equals if the event inside occurs and otherwise. The expectation is taken over the unknown joint distribution of and . For a given classifier , the risk quantifies its misclassification error rate .
Conventionally, one can train a classifier through empirical risk minimization (ERM), i.e., minimizing an empirical approximation for the classification risk based on the sample data. However, this is challenging under selective labels since the true label is not fully observed. One natural alternative is to perform ERM on the labeled subsample (subsample size denoted by ):
(3.3) |
However, the empirical risk may not capture the true risk even when total sample size :
This is known as a selection bias problem, where the distribution of in the labeled subpopulation and that of in the full population are different. The selection bias arises because the historical decision can depend on the features and that are also correlated with the true outcome . Hence, one has to adjust for both and to eliminate the selection bias, as formalized below.
Assumption 1.
Suppose (1) Selection-on-observables-and-unobservables: ; (2) Overlap: almost surely.
The condition in Assumption 1 means that the labeled subpopulation and the full population would have the same distribution if8 properly adjusting for . The overlap condition means that the subpopulation with almost any value has a positive chance to be labeled. Both conditions are standard in the missing data literature Little and Rubin [2019]. If features were both observed, then one could correct for the selection bias by adjusting for via many methods in the literature [e.g., Rosenbaum and Rubin, 1983, Rubin, 2005, Imbens and Rubin, 2015]. However, adjusting for the unobservable is infeasible, so selection bias remains a significant problem in our setting. This setting is referred to the missing-not-at-random (MNAR) problem in the missing data literature [Little and Rubin, 2019]. Addressing MNAR is notoriously difficult and typically requires additional information.
3.2 Multiple Decision-makers and IV
To address the selective label problem, we draw on recent literature that leverage the fact that in many applications the selective labels involve multiple decision-makers with heterogeneous decision-making preference [Lakkaraju et al., 2017, Kleinberg et al., 2018, De-Arteaga et al., 2018]. The rationale behind this idea is that, the unobserved true label for a unit in the missing group () could have been observed, if the unit had been assigned to a different decision-maker. This is particularly useful when the assignment of decision-makers is random, independent of the unobserved featured . For instance, Lakkaraju et al. [2017], Kleinberg et al. [2018] considers that in the judicial bail, a defendant is randomly assigned to a judge, so a unit who is not granted bail (and thus unlabeled) could have been assigned to a more lenient judge, received bail, and consequently been labeled.
Although the previous works have already considered the structure of multiple decision-makers, they largely rely on some heuristic approaches. In fact, the assignment of decision-makers can be formalized as an instrumental variable (IV). As we will show shortly, this IV formalization allows us to understand the role of multiple decision-makers in a principled way. IV is a well-known concept in causal inference. Following standard literature [Angrist et al., 1996, Cui and Tchetgen Tchetgen, 2021, Wang and Tchetgen, 2018, e.g.,], a valid IV has to satisfy the following conditions.
Assumption 2 (IV conditions).
The assignment satisfies the following three conditions: 1. IV relevance: ; 2. IV independence: ; 3. Exclusion restriction: .
The IV conditions above are particularly suitable for the random assignment of decision-makers. The IV relevance means that even after controlling for the observed features , different decision-makers could have systematically different decision rules so that the assignment is dependent with the decision . The IV independence trivially holds when the decision-maker assignment is random. The exclusion restriction automatically holds because the observed label is completely determined by the decision and true label so it is independent with any other variable given . We note that these conditions are also explicitly or implicitly assumed in the prior literature. Assumption 2 connects these conditions to the IV framework.
Based on the IV framework, we will next understand the power and limits of the multiple decision-maker structure in addressing the selective label problem. In particular, we will analyze the identification of the classification risk in Equation 3.2, studying under what conditions the risk can be determined from the distribution of observed data with multiple decision-makers.
4 Exact Identification of Classification Risk
In this section, we establish conditions under which the classification risk can be exactly identified from the observed data. This represents settings where consistently evaluating the classification risk is possible, if one properly leverage the multiple decision-maker structure.
Before delving into our detailed analysis, we define as the conditional probability that equals , given the observable features . The Bayes optimal classifier, which minimizes the classification risk , is determined by . In fact, also minimizes the excess risk, defined as:
(4.1) |
Minimizing over classifier is equivalent to minimizing over as the Bayes optimal risk is a constant. Therefore, we can equivalently study the identification of excess risk . The following lemma gives a weighted formulation of excess risk in terms of the conditional probabilities.
Lemma 1.
The excess risk satisfies
Lemma 1 indicates to identify the excess risk we need first identify the conditional probabilities of the latent from the observed data. We therefore focus on the identification of in the sequel.
Identification of Conditional Probabilities
In Section 3.2, we show that the assignment of decision-makers can be viewed as an IV. By adapting the IV identification theory in the causal inference literature [Wang and Tchetgen, 2018, Cui and Tchetgen Tchetgen, 2021], we can identify the classification risk under the following no unmeasured common effect modifiers (NUCEM) assumption.
Assumption 3 (NUCEM).
For each , .
Theorem 1 (Point Identification).
Define . Under Assumptions 1, 2 and 3, is identified,
(4.2) |
Conversely, if (4.2) holds, Assumption 3 is also satisfied.
Theorem 1 shows that Assumption 3 is sufficient and necessary for identifying the conditional probability function ’s through Equation 4.2. This identification equation connects the conditional probability of the latent true label with fully observed variables . This lays the foundation for learning ’s (and classification risk) from the observed data using the decision-maker assignment IV.
Identification of Excess Classification Risk
We can now identify the excess classification risk from observed data.
Theorem 2.
Theorem 2 does not only establish the identification of the excess classification risk under Assumption 3, but also formulates the identification into a weighted classification form. Specifically, minimizing the right hand side of Equation 4.3 corresponds to searching a cost-sensitive classifier, while minimizing the right hand side of Equation 4.4 corresponds to a weighted classification problem with the sign of as the label and as the misclassification cost. In Section 6, we will show that these formulations are useful in the design of effective learning algorithms.
The NUCEM Identification Assumption
The identification theory in Theorems 1 and 2 crucially rely on the No Unmeasured Common Effect Modifier (NUCEM) assumption in Assumption 3, which generalizes Cui and Tchetgen Tchetgen [2021] from binary IV to a multi-valued IV. These analyses reveals that additional assumptions are needed to fully identify the classification risk, even when the decision-maker structure already satisfies the IV conditions in Assumption 2. This fact is not known in the prior selective label literature without the formal IV framework.
To understand the NUCEM assumption, it is instructive to consider examples where this assumption holds.
Example 1 (Full Information).
If all features needed to predict the true label are recorded, then almost surely and Assumption 3 holds.
Example 2 (Separable and Independent Unobservables).
Suppose unobservables can be divided into two categories: influences the decision , and affects the outcome . Hence, we have and almost surely. If and are (conditionally) independent, then Assumption 3 holds.
Example 3 (Additive Decision Probability).
Suppose that for every decision-maker , the conditional decision probability satisfies for a common function and potentially different functions . That is, all decision-makers use the unobservables in a common and additive way. Under this condition, we can prove that Assumption 3 also holds (see Section A.2).
The examples above show that the NUCEM assumption needs to impose restrictions on the impact of unobserved features on either the decision or the true label or both. However, these restrictions may be too strong in practice so the NUCEM assumption may fail. As a result, the identification formula in Theorems 1 and 2 are no longer valid. In the next section, we will drop the NUCEM assumption and only impose some mild conditions, leading to weaker identification results accordingly.
5 Partial Identification of Classification Risk
The last section shows that the exact identification of classification risk often requires assumptions that are untenable in practice (see Assumption 3). Hence it is desirable to avoid such stringent assumptions and instead consider weaker and more reasonable assumptions. Nevertheless, without the stringent assumptions, it is generally impossible to exactly identify the classification risk from the observed data. Instead, there may exist a range of plausible values of the classification risk, all compatible with the observed data. This plausible range is characterized by the so-called partial identification bounds.
Below we adopt a mild bound condition on the conditional probability of the true label for the partial bound analysis.
Assumption 4.
For each fixed and , there exist two known functions and such that almost surely.
Assumption 4 mandates a known bound on the conditional probability while permitting arbitrary dependence on unobserved features . This assumption is mild, since it is automatically satisfied when for each . With these parameters, we derive a precise partial bound on .
Partial Identification of Conditional Probabilities
Based on this condition, we can derive tight IV partial identification bounds on the conditional probability for each .
Theorem 3.
Define and assume Assumptions 1 and 2 and Assumption 4 hold. Then for any and , the value of must fall within the interval , where
Conversely, any value within the interval can be a plausible value of . Moreover, the interval endpoints satisfy that .
Theorem 3 generalizes the well-known Balke and Pearl’s bound and Manski’s bound in Balke and Pearl [1994], Manski and Pepper [1998]. Specifically, our bounds coincide with Balke and Pearl’ bounds when applied to a binary IV (see Section B.2). Furthermore, if are two constants independent of (such as ), our bounds recover Manski’s bounds. Theorem 3 also demonstrates the tightness of our IV partial bounds, in the sense that our bounds tightly give the range of all plausible values of the conditional probability ’s (see Section B.1).
Partial Identification of Excess Classification Risk
When are only partially identified within for each , the excess classification risk is also partially identified. Define the feasible region for the conditional probability vector as
(5.1) |
Then the range of plausible value of the excess risk is given by the interval , where
(5.2) |
and is symmetrically defined by changing the maximization over into minimization over the same set. We are particularly interested in the upper bound as it gives the worst-case risk of a classifier . Minimizing this upper bound, i.e., , hence leads to a robust classifier safeguarding the perfrmance in the worst case. Below we further provide a reformulation of the worst-case risk under a mild regularity condition.
Theorem 4.
Suppose conditions in Theorem 3 hold and the set defined in eq. (5.1) is non-empty. Given the bounds in Theorem 3, we define the “realizable” partial bounds for as and . We have
where denotes the weight function. This can be simplified for binary classification problems with :
where is a constant not relying on the classifier and .
The non-emptyness of set defined in Equation 5.1 is a mild regularity condition. It is easy to verify from Theorem 3 that this condition is automatically satisfied for and for all . Like Theorem 2, here Theorem 4 formulates the worst-case risk into a cost-sensitive form for general multi-class classification problems and a weighted classification form for binary classification problems. This reformulation will be handy for the learning algorithm design in the next section.
Discussions
Notably, the partial identification results in this section are quite different than the exact identification results in Section 4. Section 4 relies on a strong NUCEM identification assumption (Assumption 3). This strong assumption has a strong implication: we can pinpoint the exact value of classification risk from the observed data with the decision-maker assignment as an IV. However, the key NUCEM identification assumption is very strong and may be often implausible in practice. In this section, we instead consider a much weaker bound condition in Assumption 4. This condition is arguably more plausible yet meanwhile it also has weaker identification implication. Under this condition, even with the decision-maker assignment as a valid IV, the classification risk is still intrinsically ambiguous, so we can at most get a range of its values from the observed data. In the next subsection, we will design algorithms to learn robust classifiers by approximately minimizing the worst-case classification risk. These analyses reveal the power and limits of the multiple decision-maker data structure in tackling the selective label problem, which highlights the value of the principled IV framework adopted in our paper.
We note that the recent work by Rambachan et al. [2022] is quite related to our paper. Rambachan et al. [2022] study counterfactual evaluation under unmeasured confounding using IV, mention selective labels with multiple decision-makers as one example, and derive IV partial identification bounds. However, our paper differs from Rambachan et al. [2022] in important aspects. In particular, Rambachan et al. [2022] focus on partial identification bounds on various risk measures in counterfactual decision-making and statistical estimation and inference of the bounds. In contrast, our paper systematically studies both the exact identification and partial identification settings and provide a unified weighting formulations of classification risks for these two settings (see Theorems 4 and 2). Moreover, Rambachan et al. [2022], like prior literature [Lakkaraju et al., 2017, Kleinberg et al., 2018, De-Arteaga et al., 2018], primarily focus on evaluation the risk of a given prediction rule. In contrast, our paper does not only study evaluation but also designs efficient algorithms to learn classifiers in the two identification settings. This will be the focus of next section.
(5.3) |
6 Unified Cost-sensitive Learning


In this section, we propose a unified cost-sensitive learning (UCL) algorithm tailored to learn a classifier from selectively labeled data. This algorithm accommodates both the exact identification setting in Section 4 and the partial identification setting in Section 5 in a unified manner.
Typically, a classifier is represented by some score function through . Here for any given , quantifies the score of class . According to Theorems 4 and 2, the excess classification risk of a classifier with score function in the two identification settings can be written into the common weighted form as follows: for ,
(6.1) |
Specifically, we can set for each to recover the exact identification in Theorem 2 and set to recover the worst-case excess risk in Theorem 4. In this section, we aim to minimize these risk objectives to learn effective classifiers. This corresponds to a cost-sensitive multi-classification problem.
Theorems 4 and 2 also show that for binary classification problems with , the risk function simplifies to:
where for any . By specifying , we recover the classification risk in the exact identification setting. By specifying , we recover the worst-case risk (up to some constants) under partial identification. Minimizing these risks correspond to solving weighted binary classification with a pseudo label and misclassification cost .
6.1 Convex Surrogate Risk
The cost-sensitive classification risk in Equation 6.1 involves non-convex and non-smooth indicator functions , making the optimization challenging. To address this issue, we introduce a convex surrogate risk that replaces the indicator function with the function. This substitution allows us to define the convex surrogate for cost-sensitive classification risk:
(6.2) |
For the binary case, we introduce the weighted -risk as
(6.3) |
There are many possible choices for the convex surrogate loss , including the Hinge loss , the logistic loss , and the exponential loss , and so on [Bartlett et al., 2006].
Theorem 5 establishes a calibration bound linking the cost-sensitive risk to its surrogate .
Theorem 5 (Calibration Bound for Cost-sensitive Risk).
For any given weight function , we have
For binary case, calibration bound can also be established for [Bartlett et al., 2006, Tewari and Bartlett, 2007].
Proposition 1 (Calibration Bound for Weighted Risk).
For any fixed weight function , we have
Theorems 5 and 1 show that the target risks can be well-bounded by the excess surrogate risks. Hence we can use sample data to approximately minimize the surrogate risks, which can be efficiently solved.
6.2 Empirical Risk Minimization
We now propose Algorithm 1 to minimize the empirical surrogate risk in Equation 6.2 using observed data . We also establish the finite-sample error bound for the resulting score function (see Appendix D).
Weights Estimation
The weight functions differ between point-identified and partial-identified settings, requiring separate estimation. For , we first estimate and in eq. (4.2), then compute their ratio. For example, can be expressed as , with conditional expectations estimated via regression on . Alternatively, forest methods, such as Generalized Random Forests, can directly estimate the conditional covariance ratio in one step [Athey et al., 2019].
For , we estimate the bounds and for each as defined in Theorem 3, involving conditional expectations and via regression. These bounds are then used to construct .
Classifier Optimization
The empirical surrogate risk in Equation 5.3 defines a cost-sensitive classification problem. Once the weight function are estimated, we can then solve for the score function , which can be implemented using by defining a cost-sensitive loss function with being the output layer. We can also incorporate regularization in Equation 5.3 to reduce over-fitting. In the binary setting, we can simply solve a weighted classification problem with ’s as the labels and as weights, and this can be easily implemented by off-the-shelf ML packages that take additional weight inputs. For example, considering the surrogate risk in Equation 6.3 with as the logistic loss, we can simply run a logistic regression with the aforementioned labels and weights.
Cross-fitting
In Algorithm 1, we divide the dataset into folds and apply the cross-fitting to estimate these weight functions. This approach has been widely used in statistical inference and learning with ML nuisance function estimators [e.g., Chernozhukov et al., 2018, Athey and Wager, 2021], which effectively alleviates the over-fitting bias by avoiding using the same data to both train weight function estimators and evaluate the weight function values.
7 Numeric Experiments
We conduct experiments on a synthetic dataset with a multiclass label () and a semi-synthetic dataset with a binary label ().
7.1 Synthetic Dataset
In this section, we construct a synthetic dataset with multi-valued selective labels. The experiment results are illustrated in Figures 4 and 5, which show that our unified cost-sensitive learning (UCL) can achieve superior performance under various strength of selection bias.
Data Generating Process
Consider a dataset consisting of observable variables , and an unobservable variable , following the joint Gaussian distribution. We also introduce the variable that represents the random assignment of decision-makers, which is uniformly drawn from the set . The missingness decision is modeled as Bernoulli distributed variables with parameters , and the true label is modeled as a categorical variable with parameters for . The complete data generating process is summarized in Equation 7.1.
Observed variables: | (7.1) | |||
Unobserved variables: | ||||
Coefficients: | ||||
Decision (NUCEM): | ||||
Decision (UC): | ||||
Outcome: | ||||
Here we define the functions and for any vector . In both UC and NUCEM models, the parameter controls the impact of unobservable variables on the probability of missingness , while the parameter adjust the magnitude of affecting the distribution of outcome . Overall, the pair of jointly determine the degree of selection bias in our data generating process.
In this experiment, we simulate a full dataset with sample size , feature dimensions , the number of instruments level , and the label classes . Each data point in the dataset is a tuple . The observed dataset consists of a tuple of variables with if and if for each .
Model UC and NUCEM
The missingness model NUCEM captures one of the setting of No Unmeasured Common Effect Modifier, and this ensures the satisfaction of sufficient condition in Proposition 2, and therefore the conditional probabilities can be exactly identified according to Theorem 1. Meanwhile, the model UC captures a general setting of Unmeasured Confounding in which the point-identification condition cannot be realized in this situation.

Baseline Methods and Our Proposed Methods
We divide our dataset into two parts: a training set and a testing set, using a 7:3 split ratio, denoted as and . In the training set, five different meta-learning methods are trained on selectively labeled data, using a simple neural network classifer with softmax output layer. We then evaluate the performance of these methods on the testing set with fully labeled data. The five methods under consideration are as follows:
-
•
The first two methods act as baselines. The SelectedSample method involves running multiclass classification algorithm solely on the labeled portion of the training set . For the second method, SelectedSample(IPW), we firstly estimate the inverse propensity weighting (IPW) for the labeled subset, and then implement the weighted multiclass classification for the labeled training set. These baseline methods establish a “lower bound” for any other meta learning algorithm, as any enhanced method with the use of selectively labeled data should least beat the performance of these two methods.
-
•
The third method, FullSample, runs a multiclass classification algorithm on the entire training set. This represents an ideal but impractical scenario since we cannot actually observe the true label for the missing data (where ). This method serves as an ’upper bound’ for learning performance, indicating the highest possible effectiveness if full information were available.
-
•
Finally, our two proposed methods correspond to the point- and partial- identification settings, named PointLearning and PartialLearning respectively. We anticipate that the performance of these methods will fall between the established “lower” and “upper” bounds, representing a realistic estimation of effectiveness under selective label conditions. Similar to the method SelectedSample(IPW), our method involve estimating a series of weight functions to correct for selection bias (refers to the details in Section 6). In this experiment, we estimate these weght functions using Histogram Gradient Boosting with a 5-fold cross-fitted approach. We selected all hyperparameter through 5-fold cross-validation. The cost-sensitive classification problem is solved by a simple neural network with a softmax output layer.

The Role of Observed Vairable Z
It’s important to note the random assignment of decision-maker, denoted as , play different roles in these methods. In our proposed method, is treated as an instrumental variable, aiding in correcting for selection bias, while in the baseline methods (SelectedSample, SelectedSample(IPW), and FullSample), is only one of the observed features.
Experiment Results
Figures 4 and 5 report the testing accuracy of each method on the data generated by the NUCEM and UC models in Equation 7.1 respectively. All the experiments are repeated 100 times with confounding strengths . Each boxplot then shows the distribution of the testing accuracy for distinct methods under different combinations of confounding strength and .
Transition from the top to the bottom panel, increases from to , indicating a stronger influence of the unobservable variable on the true label . As we can see from Figures 4 and 5, the accuracy of each of five methods decreases as increases, reflecting the growing challenge in accurately predicting outcomes as the influence of unobservable factors intensifies. Moreover, transition from the left to the right panel, rises from to , indicating an enhanced role of the unobservable variable in determining the missingness decision . A larger value of within the range suggests that the missingness decision is predominantly influenced by the unobservable , hinting at a potential larger difference of the distribution among selected group and missing group. This is visually corroborated in Figures 4 and 5, where the performance disparity between the SelectedSample and FullSample methods widens as increases, illustrating the growing challenge in correcting the selection bias.
The results in Figures 4 and 5 demonstrate the effectiveness of our proposed methods in handling selection bias in the data generating process. In each combinations of , our Partial Learning method consistently outperforms the baseline methods, including the Selected Sample, and Selected Sample(IPW) methods. Moreover, the perforamnce of Partial Learning is close to the Full Sample method, which is the upper bound of the performance. The performance of the Point Learning method is also competitive, although it is slightly less robust than the Partial Learning method. The results in Figure 4 are consistent with those in Figure 5, except that under this setting, the point identification requirement in Theorem 1 is not satisfied, and the Point Learning method is therefore less effective than the Partial Learning method.
Below we provide a detailed analysis of the performance of each method.
-
•
Selected Sample (IPW): As we see from Figures 4 and 5, the performance of the Selective Sample (IPW) approach is close to the baseline of naively running multiclass classification algorithms on the selected subsample (named Selected Sample), and it is outperformed by our proposed algorithms. This is not surprising since the propensity score cannot correct selection bias due to the existence of unobserved variables .
-
•
PointLearning: Our proposed Point Learning method shows promising results in the bottom right panel of Figure 4 with parameter , in which case the unobserved confounding strength is large. Although the NUCEM model in Equation 7.1 satisfies Proposition 2, which enables the point identification of the conditional probabilities and therefore the classification risk , Point Learning does not outperform the baseline methods when the unmeasured confounding strength is not so large. This limitation is attributed to the complexities involved in accurately estimating the weight function as outlined in Equation 4.3.
The crux of the challenge lies in estimating the ratio of two conditional covariances, and for each . Estimating conditional covariances is inherently more difficult than estimating conditional expectations, primarily because the process of calculating the ratio of these estimates is fraught with potential inaccuracies. If the denominator is not precisely estimated, the resultant ratio can become unstable, leading to exaggerated or diminished values. Such inaccuracies can severely impact the effectiveness of the downstream cost-sensitive classification, undermining the robustness of Point Learning under these conditions.
-
•
Partial Learning: The cornerstone of our learning framework is Partial Learning. As evidenced by the graphs in Figures 4 and 5, Partial Learning significantly enhances the accuracy of performance predictions while maintaining low variance, showcasing its potential for real-world applications. Remarkably, this improvement in prediction accuracy is consistent across almost all combinations of confounding strengths , as well as the decision models NUCEM and UC.
Reflecting on the conditions outlined in Theorem 3, Partial Learning only requires the conditional mean function to be almost surely bounded, aside from the basic instrumental variable (IV) conditions. This requirement is generally more feasible in real-world scenarios than the point-identification assumptions detailed in Theorem (referenced as Theorem 1). The success of Partial Learning can be attributed to two key strategies: Firstly, we adopt a robust optimization approach to develop a prediction rule that remains effective under the least favorable conditions of , as detailed in our minimax risk function formulation (see Equation 5.2). Secondly, the weighting functions are based on the summation of several conditional expectations, which is inherently more stable than the ratio of conditional covariances required for point identification.
Overall, the compelling performance of Partial Learning demonstrated in Figures 4 and 5 convinces us of its practical viability. It stands as a reliable method that consistently outperforms baseline approaches, even in situations where the presence of selection bias (missing not at random) in the data generating process is uncertain.
7.2 Semi-synthetic Dataset: FICO


In this section, we evaluate the performance of our proposed algorithm in a semi-synthetic experiment based on the home loans dataset from FICO [2018]. This dataset consists of observations of approved home loan applications. The dataset records whether the applicant repays the loan within days overdue, which we view as the true label , and various transaction information of the bank account. The dataset also includes a variable called ExternalRisk, which is a risk score assigned to each application by a proprietary algorithm. We consider ExternalRisk and all transaction features as the observed features .
Semi-synthetic Dataset with Selective Labels
In this dataset the label of interest is fully observed, so we choose to synthetically create selective labels on top of the dataset. Specifically, we simulate decision-makers (e.g., bank officers who handle the loan applications) and randomly assign one to each case. We simulate the decision from a Bernoulli distribution with a success rate that depends on an “unobservable” variable , the decision-maker identity , and the ExternalRisk variable (which serves as an algorithmic assistance to human decision-making). We blind the true label for observations with . Specifically, we construct as the residual from a random forest regression of with respect to over the whole dataset, which is naturally dependent with . We then specify according to
Decision (NUCEM): | (7.2) | |||
(UC): |
Here the function is given by . The parameter controls the impact of on the labeling process and thus the degree of selection bias. We can easily verify that the sufficient condition in Proposition 2 is guaranteed and therefore the point-identification of classification risk is realizable under the NUCEM model in Equation 7.2.
Baseline Methods and Our Proposed Methods
We randomly split our data into training and and testing sets at a ratio. Similar to what we discuss in Section 7.1, on the training set, we apply five types of different methods: Selected Sample, SelectedSample(IPW), FullSample, PointLearning and PartialLearning. The first three methods serve as the benchmark of the last two methods. For each type of method, we try multiple classification algorithms including AdaBoost, Gradient Boosting, Logistic Regression, Random Forest, and SVM. Our proposed method also need to estimate some unknown weight functions, which we implement by fold cross-fitted Gradient Boosting. Again, all hyperparameters are chosen via -fold cross-validation, and we evaluate the classification accuracy of the resulting classifiers on the testing data.
Similarly, we remark that the decision-maker assignment plays different roles in different methods. Our proposed methods (PointLearning and PartialLearning) treatment the decision-maker assignment as an instrumental variable to correct for selection bias. In contrast, the baseline methods (SelectedSample, SelectiveSample(IPW) and FullSample) do not necessarily need . However, for a fair comparison between our proposals and the baselines, we still incorporate as a classification feature in the baseline methods, so they also use the information of .
Results and Discussions
Figures 6 and 7 presents the testing accuracy of each method over 50 experiment replications for under both the NUCEM and UC models defined in Equation 7.2. First, we observe that the performance of SelectedSample(IPW) is comparable to the baseline SelectedSample, which applies binary classification algorithms directly to selectively labeled data. Notably, as the strength of unmeasured confounding () increases, the gains from using our proposed methods—especially PartialLearning—over the baseline methods (SelectedSample and SelectedSample(IPW)) become more pronounced. Interestingly, the PartialLearning method outperforms even under the NUCEM model, where the point-identification condition is satisfied. As discussed in Section 7.1, this may be because the PointLearning method relies on estimating a conditional variance ratio, which is challenging to estimate accurately in practice. In contrast, the PartialLearning method requires only the estimation of conditional expectations, making it more stable and robust. This highlights the advantage of the PartialLearning method: it is robust to violations of the point-identification assumption and achieves more stable performance even when point-identification holds.
To summarize, among all the methods we evaluated, PartialLearning emerges as a reliable approach that consistently outperforms baseline methods, even in scenarios where the unobsreved confounding is strong.
8 Conclusion
We address multiclass classification with selectively labeled data, where the label distribution deviates from the full population due to historical decision-making. Leveraging variations in decision rules across multiple decision-makers, we analyze this problem using an instrumental variable (IV) framework, providing necessary and sufficient conditions for exact classification risk identification. When exact identification is unattainable, we derive sharp partial risk bounds. To address label selection bias, we propose a unified cost-sensitive learning (UCL) approach, supported by theoretical guarantees, and demonstrate its effectiveness through comprehensive numerical experiments.
References
- Angrist and Imbens [1995] J. Angrist and G. Imbens. Identification and estimation of local average treatment effects, 1995.
- Angrist and Pischke [2009] J. D. Angrist and J.-S. Pischke. Mostly harmless econometrics: An empiricist’s companion. Princeton university press, 2009.
- Angrist et al. [1996] J. D. Angrist, G. W. Imbens, and D. B. Rubin. Identification of causal effects using instrumental variables. Journal of the American statistical Association, 91(434):444–455, 1996.
- Arnold et al. [2022] D. Arnold, W. Dobbie, and P. Hull. Measuring racial discrimination in bail decisions. American Economic Review, 112(9):2992–3038, 2022.
- Athey and Wager [2021] S. Athey and S. Wager. Policy learning with observational data. Econometrica, 89(1):133–161, 2021.
- Athey et al. [2019] S. Athey, J. Tibshirani, and S. Wager. Generalized random forests. The Annals of Statistics, 47(2):1148–1178, 2019.
- Balke and Pearl [1994] A. Balke and J. Pearl. Counterfactual probabilities: Computational methods, bounds and applications. In Uncertainty Proceedings 1994, pages 46–54. Elsevier, 1994.
- Bartlett et al. [2006] P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156, 2006.
- Ben-Michael et al. [2021] E. Ben-Michael, D. J. Greiner, K. Imai, and Z. Jiang. Safe policy learning through extrapolation: Application to pre-trial risk assessment. arXiv preprint arXiv:2109.11679, 2021.
- Bertsimas and Fazel-Zarandi [2022] D. Bertsimas and M. M. Fazel-Zarandi. Prescriptive machine learning for public policy: The case of immigration enforcement. Under review, 2022.
- Chernozhukov et al. [2018] V. Chernozhukov, D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, and J. Robins. Double/debiased machine learning for treatment and structural parameters, 2018.
- Coston et al. [2020a] A. Coston, E. Kennedy, and A. Chouldechova. Counterfactual predictions under runtime confounding. Advances in neural information processing systems, 33:4150–4162, 2020a.
- Coston et al. [2020b] A. Coston, A. Mishler, E. H. Kennedy, and A. Chouldechova. Counterfactual risk assessments, evaluation, and fairness. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pages 582–593, 2020b.
- Cui and Tchetgen Tchetgen [2021] Y. Cui and E. Tchetgen Tchetgen. A semiparametric instrumental variable approach to optimal treatment regimes under endogeneity. Journal of the American Statistical Association, 116(533):162–173, 2021.
- De-Arteaga et al. [2018] M. De-Arteaga, A. Dubrawski, and A. Chouldechova. Learning under selective labels in the presence of expert consistency. arXiv preprint arXiv:1807.00905, 2018.
- De-Arteaga et al. [2021] M. De-Arteaga, V. Jeanselme, A. Dubrawski, and A. Chouldechova. Leveraging expert consistency to improve algorithmic decision support. arXiv preprint arXiv:2101.09648, 2021.
- FICO [2018] FICO. Explainable machine learning challenge, 2018. URL https://community.fico.com/s/explainable-machine-learning-challenge?tabset-158d9=3.
- Imbens and Rubin [2015] G. W. Imbens and D. B. Rubin. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015.
- Kallus and Zhou [2021] N. Kallus and A. Zhou. Minimax-optimal policy learning under unobserved confounding. Management Science, 67(5):2870–2890, 2021.
- Kédagni and Mourifie [2017] D. Kédagni and I. Mourifie. Generalized instrumental inequalities: Testing the iv independence assumption. Available at SSRN 2692274, 2017.
- Kilbertus et al. [2020] N. Kilbertus, M. G. Rodriguez, B. Schölkopf, K. Muandet, and I. Valera. Fair decisions despite imperfect predictions. In International Conference on Artificial Intelligence and Statistics, pages 277–287. PMLR, 2020.
- Kleinberg et al. [2018] J. Kleinberg, H. Lakkaraju, J. Leskovec, J. Ludwig, and S. Mullainathan. Human decisions and machine predictions. The quarterly journal of economics, 133(1):237–293, 2018.
- Kling [2006] J. R. Kling. Incarceration length, employment, and earnings. American Economic Review, 96(3):863–876, 2006.
- Lakkaraju et al. [2017] H. Lakkaraju, J. Kleinberg, J. Leskovec, J. Ludwig, and S. Mullainathan. The selective labels problem: Evaluating algorithmic predictions in the presence of unobservables. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 275–284, 2017.
- Little and Rubin [2019] R. J. Little and D. B. Rubin. Statistical analysis with missing data, volume 793. John Wiley & Sons, 2019.
- Manski and Pepper [1998] C. F. Manski and J. V. Pepper. Monotone instrumental variables with an application to the returns to schooling, 1998.
- Mao et al. [2023] A. Mao, M. Mohri, and Y. Zhong. Cross-entropy loss functions: Theoretical analysis and applications. In International conference on Machine learning, pages 23803–23828. PMLR, 2023.
- Massart and Nédélec [2006] P. Massart and É. Nédélec. Risk bounds for statistical learning. Annals of Statistics, 2006.
- Mueller-Smith [2015] M. Mueller-Smith. The criminal and labor market impacts of incarceration. Unpublished working paper, 18, 2015.
- Pu and Zhang [2021] H. Pu and B. Zhang. Estimating optimal treatment rules with an instrumental variable: A partial identification learning approach. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 83(2):318–345, 2021.
- Qian and Murphy [2011] M. Qian and S. A. Murphy. Performance guarantees for individualized treatment rules. Annals of statistics, 39(2):1180, 2011.
- Rambachan et al. [2022] A. Rambachan, A. Coston, and E. Kennedy. Counterfactual risk assessments under unmeasured confounding. arXiv preprint arXiv:2212.09844, 2022.
- Rambachan et al. [2021] A. Rambachan et al. Identifying prediction mistakes in observational data. Harvard University, 2021.
- Rosenbaum and Rubin [1983] P. R. Rosenbaum and D. B. Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55, 1983.
- Rubin [2005] D. B. Rubin. Causal inference using potential outcomes: Design, modeling, decisions. Journal of the American Statistical Association, 100(469):322–331, 2005.
- Shalev-Shwartz and Ben-David [2014] S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
- Swanson et al. [2018] S. A. Swanson, M. A. Hernán, M. Miller, J. M. Robins, and T. S. Richardson. Partial identification of the average treatment effect using instrumental variables: review of methods for binary instruments, treatments, and outcomes. Journal of the American Statistical Association, 113(522):933–947, 2018.
- Tewari and Bartlett [2007] A. Tewari and P. L. Bartlett. On the consistency of multiclass classification methods. Journal of Machine Learning Research, 8(5), 2007.
- Wainwright [2019] M. J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019.
- Wang and Tchetgen [2018] L. Wang and E. T. Tchetgen. Bounded, efficient and multiply robust estimation of average treatment effects using instrumental variables. Journal of the Royal Statistical Society. Series B, Statistical methodology, 80(3):531, 2018.
- Wei [2021] D. Wei. Decision-making under selective labels: Optimal finite-domain policies and beyond. In International Conference on Machine Learning, pages 11035–11046. PMLR, 2021.
- Yang et al. [2022] Y. Yang, Y. Liu, and P. Naghizadeh. Adaptive data debiasing through bounded exploration. Advances in Neural Information Processing Systems, 35:1516–1528, 2022.
- Zhang et al. [2012] B. Zhang, A. A. Tsiatis, M. Davidian, M. Zhang, and E. Laber. Estimating optimal treatment regimes from a classification perspective. Stat, 1(1):103–114, 2012.
- Zhang et al. [2020] C. Zhang, J. Chen, H. Fu, X. He, Y.-Q. Zhao, and Y. Liu. Multicategory outcome weighted margin-based learning for estimating individualized treatment rules. Statistica sinica, 30:1857, 2020.
- Zhao et al. [2012] Y. Zhao, D. Zeng, A. J. Rush, and M. R. Kosorok. Estimating individualized treatment rules using outcome weighted learning. Journal of the American Statistical Association, 107(499):1106–1118, 2012.
Appendix A Supplements for Point Identification
A.1 Proofs in Section 4
Proof of Lemma 1.
By the definition of excess risk in Equation 4.1, we have
We apply the iterated law of expectation in second equality, and the total law of expectation in third equality. The fourth equality comes from the sum-to-one constraint of probability. The fifth equality is established from the definition of indicator function and the sixth equality follows again from the iterated law of expectation. Finally we apply the definition of Bayes optimal .
∎
Proof of Theorem 1.
In this part, we prove that, under Assumptions 1 and 2, Assumption 3 is the necessary and sufficient condition of the exact identification of . To simplify the notation, we denote two new variables and .
-
•
Sufficiency: For any , the conditional probability could be expressed as
(A.1) In the first and second equalities, we use the iterated law of expectation and the linearity of expectation. The third equality follows from the consistency between observed label and true label from Equation 3.1, that is, when . The fourth equality follows from the unconfoundedness of and given and , which is guaranteed by the general unconfoundedness () in Assumption 1. We use the IV independence () from Assumption 2 in the fifth equality. The last equality follows again by the iterated law of expectation and linearity of expectation.
By subtracting on both sides of Equation A.1, we have for any ,
where the last equality follows from the IV independence () from Assumption 2. Now, by multiplying the weight on both sides of equation above, we have
He we apply the IV independence () again in the last equality above. Taking the summation (or integral) for over yields the following results:
Moreover, observe that
we have
(A.2) Here again we use the consistency that . By the unconfoundedness () from Assumption 1 and IV independence () from Assumption 2, we have and
Therefore the II and IV terms in Equation A.2 cancel out, which lead us to the result
(A.3) Finally, by assuming Assumption 3 holds, that is,
we have
According to conditional covariance identity and IV independence (), we have
(A.4) Combining (A.4) with (A.3) leads us to the desired result, that is,
-
•
Necessity: Given the identification result and the Equation A.4, we have
On the other hand, we have the identity (A.3) established when every Assumptions 1 and 2 hold, that is,
Therefore, combining these results together, we have
which then recovers condition (3).
∎
Proof of Theorem 2.
By the result in Lemma 1, we have
For each , we define . According to Theorem 1, when NUCEM conditions in Assumption 3 are satisfied for each , the conditional probability can be uniquely identified by Equation 4.2. Therefore, the weight function can be further identified as
Now, consider a binary classification with the true label . Let denote the conditional probability of given . The Bayes optimal classifier is . As a consequence, the excess risk can be then written as
By defining the weight function , we have
Applying the identification result from Theorem 1, we obtain
∎
A.2 Sufficient Condition for Point Identification
Proposition 2 (A Sufficient Condition for Point Identification).
Suppose Assumptions 1 and 2 hold. For any , define . Assumption 3 holds if, for any ,
(A.5) |
Proposition 2 outlines a sufficient condition for Assumption 3. This condition permits unobserved features to influence both the decision and the true outcome , but constrains this influence to a specific form. Specifically, the condition is met when almost surely, with functions and reflecting the impact of is additive and consistent across all decision-makers.
Proof of Proposition 2.
Notice that the conditional covariance could be decomposed as below:
Define . If for any , we have
Assumption 3 is then guaranteed.
∎
Appendix B Supplements for Partial Identification
B.1 Tightness of IV Partial Bounds in Theorem 3
Theorem 6 (Tightness of IV Partial Bound).
Assume Assumption 1 is valid, and the conditional joint distribution of observable variables is fixed. Consider a new set of random variables where and are binary, and . Define the observed label as . If the expectation falls within the interval for all , then it is possible to establish a conditional joint distribution for given that satisfies:
-
1.
The IV independence in Assumption 2 is met.
-
2.
The conditional joint distribution of matches the initial distribution .
-
3.
The natural bounds in Assumption 4 are maintained.
Theorem 6 along with Theorem 3 establish a two-way correspondence between the conditional mean outcome and the IV partial bound . On one hand, any value of within this interval corresponds to a feasible conditional joint distribution of that is consistent with the observed data . On the other hand, any conditional joint distribution of consistent with the observed data must yield a conditional mean outcome within the interval , as established in Theorem 3.
These results underscore the tightness of the IV partial bound in Theorem 3 and the necessity of the bounded label assumption in Assumption 4 for the partial identification of the conditional probability .
Proof of Theorem 6.
The proof is adapted from Kédagni and Mourifie [2017]. Let be a sequence of random variables such that almost surely and for every . Given that Assumption 1 held and fix for some , we aim to prove that, for every possible value of , there exists a conditional joint distribution of given such that: (1) IV independence in Assumption 2 holds, that is, for every ; (2) the new observable variables has the same conditional joint distribution as the fixed observable variables ; (3) Assumption 4 holds.
Without loss of generality, we first consider the case when . As a consequent, we let
(B.1) | ||||
where the last expression follows by sum-to-one constraint of conditional probability. Note that we does not impose any observable restrictions on the conditional joint distribution of given and , we define the following: for all ,
(B.2) | ||||
Notice that all of the conditional probabilities are well-established as they are all between and , and the sum-to-one constraint is established as follow:
We now prove the three statements mentioned above.
-
1.
IV Independence: To check whether we have , notice that for any ,
We can see that the conditional distribution of given and does not depends on for any , and therefore we are able to claim that is conditional independent with given .
-
2.
has same conditional joint distribution as : Recall that the observed label is defined as . By Equation B.2, we have for any ,
(B.3) and
(B.4) As a result, we have shown that the observed variables has exactly the same conditional joint distribution with .
-
3.
Finally, we show that when , there exists a joint distribution on such that . Without the loss of generality, we assume
and then we have
(B.5) Meanwhile, by the statement of IV independence that , we have for ,
The first line and second line come from the IV independence and linearity of conditional expectation. The third line follows by the consistency of observed outcome as well as the iterated law of expectation. The last line follows by Assumption 1 () and the IV independence.
From Equations B.3 and B.4, we have , and therefore
(B.6) Now, suppose that there is not any conditional joint distribution of given such that almost surely. Alternatively, we assume there is a conditional joint distribution for such that almost surely. Substitute this condition into Equation B.6, we have
(B.7) In the first equality, we use the iterated law of expectation and fact that is independent with unobserved variable . The second equality follows from Equations B.3 and B.4 that for any fixed and along with the definition of .
As a consequence, if
then Equation B.7 contradicts with Equation B.5, implying that there is indeed a conditional joint distribution for such that . On the other hand, if
(B.8) then the conditional joint distribution for induced by the Equation B.8 is exactly the “right” distribution for the establishment of .
Therefore, we have shown that when , there is always a conditional joint distribution for such that almost surely.
To wrap up, we can see that the establishment of statements 1 and 2 do not rely on the condition in Equation B.1. In fact, whenever we choose for a fixed , we can simply choose . As a result, the IV independence always holds, and the new observable variables still has the same conditional distribution as , which is fixed as a prior.
However, the value of condition does affect the value of . If we let , we can show in a similar way that almost surely. In fact, following the proof of statement 3, one can prove that, for any such that , there is always a conditional joint distribution for such that almost surely and .
Therefore, we have shown that under Assumption 1, whenever we have , there is always a conditional joint distribution for such that the three statements in Theorem 6 hold. To conclude, we claim that the IV partial bound introduced in Theorem 3 is sharp for given the establishment of Assumption 1, the IV independence in Assumption 2 and Assumption 4.
∎
B.2 Balke and Pearl’s Bound
Balke and Pearl [1994] provides partial identification bounds for the average treatment effect of a binary treatment with a binary instrumental variable. In this section, we adapt their bound to our setting with partially observed labels and a binary IV (i.e., the assignment to one of two decision-makers). Under Assumption 2, we have following decomposition of joint probability distribution of
(B.9) |
Here we omit the observed covariates for simplicity, or alternatively, all distributions can be considered as implicitly conditioning on . Now we define three response functions which characterize the values of , , and :
Next, we specify the joint distribution of unobservable variables and as follows:
which satisfies the constraint . Then the target mean parameter of the true outcome can be written as a linear combinations of the ’s. Moreover, we note that the observable distribution is fully specified by the following six variables
with constraints and . We also have the following relation between ’s and ’s:
Therefore, we have where , , and
Then the lower bound on can be written as the optimal value of the following linear programming problem
(B.10) |
Similarly, the upper bound on can be written as the optimal value of the following optimization problem:
(B.11) | ||||
subject to | ||||
In fact, by simply comparing the variables in the objective function and those in constraints, one could find that
If we let
(B.12) | ||||
We then have the following partial bounds of .
According to Balke and Pearl [1994], the bounds above are tight for . We note that if we condition on in these bounds, then the corresponding bound on coincide with the bounds in Theorem 3 specialized to a binary instrument.
B.3 Proofs in Section 5
Proof of Theorem 3.
We first prove the construction of IV partial bound, and then show that the IV partial bound is tighter than the natural bound for all when is fixed.
-
1.
Under Assumptions 1 and 2, Equation A.1 gives that
As we have assumed that almost surely in Assumption 4, we then have
and
for every . Optimizing the lower and upper bounds of over yields the desired results:
-
2.
By taking the conditional expectation on both sides of the inequality, a direct consequence of Assumption 4 induces a natural bound for , that is,
We now show that the IV partial bound is tighter than the natural bound for any . Without loss of generality, let
and then we have
(B.13) The first line comes from the definition, while the second line holds by the consistency between the observed label and the true label () and the linearity of conditional expectation. The third line holds by iterated law of expectation, and the fourth line holds by Assumption 1 (). The fifth line follows by Assumption 2 () and the fact that . The last line again follows by the linearity and the iterated law of expectation.
Recall that Assumption 4 states that almost surely, along with Equation B.13 and the fact that , we have
Similarly, we can show that for every fixed . We therefore complete the proof.
∎
Proof of Theorem 4.
To simplify the notation, for any classifier and vector function , define the function
The worst-case risk function can then be written as
with . The following proof will rely on computing the maximization of over all possible values of for any fixed .
-
•
Step 1. We start with the case when for some , we have
Given a fixed vector , we denote as the label with maximal conditional probability. We have
(B.14) where for any real-valued . In the second equality of Equation B.14, we exclude the case when from the maximization, as this situation only occurs when the lower bound of exceeds the upper bound of for all . Consequently, we have , which implies that
Therefore, we expect the maximization of over to be non-negative. We exclude the case from the maximization to facilitate further analysis.
-
•
Step 2. As shown in Equation B.14, for any fixed and , the term is a non-decreasing () function of and a non-increasing () function of . Therefore, one may expect that the maximization of over is realized when and achieving their upper-bound and lower-bound respectively. However, due to the constraint for any , the orginal upper-bound for may not be reached, and similar concern arises when treating . To fix these issues, we define a “realizable” upper-bound for as
(B.15) which is the maximal value that can realize under the constraint . Meanwhile, we define a “realizable” lower-bound for as
(B.16) The last equality is established as for any , leading to the fact that . In Equation B.16, we replace the upper-bound with in the definition of , because in Equation B.14, the lowest possible values of depends on the “realizable” maximal value of . However, notice that does not depend on the label involved in the definition, indicating that the “realizable” lower-bound is an intrinsic property for the conditional probability for each . Similar statement can be applied for the “realizable” upper-bound for each .
-
•
Step 3. Back to the maximization of in Equation B.14, we have
(B.17) By defining
as the weight function corresponding to the event , the Equation B.17 can be written as
The worst-case excess risk can then be written as
The second inequality holds because defines a sequence of disjoint events for every fixed . By replacing the subscript with , we complete the proof.
-
•
Step 4. We are now left to check whether there exists a vector such that the second equality in Equation B.17 is guaranteed, that is, there is indeed a feasible solution of with , such that and for every . Before we go deep into the details, the non-emptyness of set reminds us that, for every ,
(B.18) In the following, we check the feasibility of for a fixed in three aspects.
-
1.
For , check .
For the upper-bound, by the definition of in Equation B.15, is guaranteed.
For the lower-bound, on one hand, if , then naturally holds. On the other hand, suppose . By Equation B.18, we have , which suggests that . Hence, always holds.
-
2.
For , check .
For the lower-bound, by the definition of in Equation B.16, always holds.
For the upper-bound, by Equation B.16, we have
Equation B.18 ensures that , and on the other hand simply holds. Therefore, we establish the upper-bound .
-
3.
Check . We start with computing the summation
If the summation stays with in the capacity region , we can always find a series of feasible solutions for such that is satisfied. The discussions are summarized below.
-
(a)
Suppose , by the definition of in Equation B.15, we have and . We then have
For the Case 1, we can simply let for , and this arrangement fulfills the requirement .
For the Case 2, we have
As a result, there exists a sequence of such that for all that satisifes .
-
(b)
Suppose , we then have and
where the last equality holds because for all . Therefore, we have
As a result, we can simply let for , and this arrangement agains fullfills the requirement .
-
(a)
The above analysis show that validity of the second equality in Equation B.17.
-
1.
∎
Proof of Theorem 4 (Binary Classification Setting).
Recall from the first part of Theorem 4 that
Consider the binary case when and . Recall that in Theorem 3, we derive the partial bound for as . By the constraint for each , we define the partial bound for as and .
We can then compute the “realizable” upper- and lower-bounds for both and as below:
The weight functions for are therefore
Hence, in binary outcome case, the risk function takes the form
For every fixed , define a weight function
we then have
(B.19) | ||||
For any fixed , notice that when and when , we then have
Consequently, by observing the establishment of the event
the first two term in Equation B.19 can be further organized as
(B.20) | ||||
Finally, we define
for every . Combining Equations B.19 and B.20 together, we have
∎
B.4 Comparisons of Point and Partial Identification
Point identification hinges on decision-makers’ homogeneity in their use of unobservables, as defined by the generalized NUCEM assumption in Theorem 1. Concretely, the conditional irrelevant of and given reflects the uniformity of decision-makers’ reliance on unobserved information. Conversely, partial identification benefits from the heterogeneity of decision-makers’ use of observables , not requiring the homogeneity of selection on unobservables . This approach uses IV partial bounds that consider variations in the decision rules and the labeled probabilities across decision-makers . Such heterogeneity helps construct tighter bounds for Moreover, we having the following claims.
Theorem 7.
The partial bounds and in Theorem 3 are achieved by the most lenient and most stringent decision-makers, respectively.
Proof of Theorem 7.
We claim that for any fixed and , the lower-bound is achieved by the most lenient decision-maker, and the upper-bound is achieved by the most stringent decision-maker. To see this, notice that the lower-bound for with respect to decision-maker in Theorem 3 satisfies
The second equality holds consistency, and the third equality follows by iterated law of expectation. The fourth equality holds by Assumption 1 and the final equality is established by the IV independence in Assumption 2 (). Note that the Assumption 4 states almost surely, the lower bound is an increasing function of decision rule . Therefore, for any fixed and , the maximization over lower bounds is achieved by the most lenient decision-maker with the highest decision rule , that is,
Similarly, for the upper-bound, the minimization is achieved by the most stringent decision-maker when and are fixed. ∎
Appendix C Supplements for Unified Cost-sensitive Learning
Proof of Theorem 5.
To facilitate the analysis, for fixed weight function and score function , we define the class labels
Namely, when is fixed, denotes the class label with the minimal weight , while corresponds to the class label with the highest score . According to the definition of cost-sensitive classification risk in Equation 6.1, we define the cost-sensitive classification loss as
Obviously, the excess classification risk can be written as
Correspondingly, we define the cost-sensitive surrogate loss for in Equation 6.2 as
As a result, the excess surrogate risk can be written as
Therefore, we are left to analyze the excess surrogate loss for every fixed .
Following the idea of Mao et al. [2023], for every fixed scored function , we define a new function induced by some real-valued parameter as follow: for fixed ,
Since the domain of function is , we should restrict the value of to ensure that and . Therefore, for fixed , we define the feasible region of as
A direct conseuqnece of above definition is that for every fixed . Consequently, the surrogate loss takes the form
Now, the excess surrogate loss can be lower-bounded by
The first inequality holds by the fact that the infimum of over set is always greater then the infimum of over all measurable functions. Notice that is the class label with the minimal weight, we have for any . Consequently, the supremum of the above expression is achieved when , and the excess surrogate loss is therefore lower-bounded by
The last inequality follows by the definition that is the class label with the highest score , and therefore we must have for any fixed . By taking the expectation over on the both sides, we completes the proof.
∎
Proof of Proposition 1.
For the weighted classification risk , if we choose for , we simply have , and this suggests the Bayes optimal risk equals to zero, that is, . Therefore, the excess weighted classification risk can be written as
Here we use the fact that has the same sign as weight function for any fixed .
For the -risk introduce in Equation 6.3, we consider the following surrogate loss functions:
Notice that for any , the surrogate loss is lower-bounded (in fact, ), then we have . Therefore, the excess -risk takes the form
In the sequel, we only need to check if the surrogate loss provides an upper-bound for the loss for each in the set . We fix the weight function and the observed features in the following discussion.
-
•
Consider the case when , we then have . For the -risk, we have
for any . Concretely, we have
Therefore, we have
for any .
-
•
Consider the case when , we then have . For the -risk, we have
for any . Concretely, we have
Hence, we have
for any .
Finally, by multiplying the common weight and taking the expectation over , we conclude that when the weight function is fixed, the excess -risk is always an upper-bound for the excess weighted risk for any :
∎
Appendix D Generalization Error Bound
In this section, we theoretically analyze the performance of the UCL algorithm in Algorithm 1 by deriving a generalization error bound for the output score function . Specifically, Theorem 5 establishes a calibration bound for the cost-sensitive excess risk induced by the output score function. To leverage this result, we first derive a generalization error bound for the excess surrogate risk .
To facilitate the analysis, we define the surrogate risk induced by the estimated weight functions from batch :
The below lemma provides a decomposition of the upper-bound of the excess surrogate risk into two components: the estimation error of the nuisance functions and the empirical process error.
Lemma 2.
The excess surrogate risk satisfies
The first term in Lemma 2 captures the error arising from the estimation of the unknown weight function , which is evaluated on the true excess surrogate risk with the supremum over all . The second term reflects the error incurred by the estimated score function , evaluated as the supremum of an empirical process. To bound these two terms, we need to further specify the estimation errors of the nuisance estimators, as we will do in Assumptions 5 and 6.
Assumption 5 (Nuisances Estimation).
We assume that the true weight functions and their estimates satisfy the following conditions:
-
1.
they are uniformly bounded by some constant over , namely, for all and ;
-
2.
there exists such that
Assumption 5 requires that the nuisance functions and their estimates are uniformly bounded over the feature space and that the estimation error of the nuisance functions decays at a rate of for some .
Assumption 6 (Function Class Complexity).
Let and be some constants. For any score function candidate , we assume for any . Moreover, assume that the -bracketing number of with respect to the metric , denoted as , satisfies for any and some .
In Assumption 6, we further restrict the complexity of the hypothesis class by imposing the boundedness on each component of score function , and limiting the growth rate of its covering number. The boundedness condition is reasonable, as if any component of score function, say , goes to infinity, then we cannot get reasonable classifier by the operations. At the same time, limiting the growth rate of covering number is a common way to quantify the complexity of a function class in statistical learning theory [Massart and Nédélec, 2006, Shalev-Shwartz and Ben-David, 2014, Wainwright, 2019].
Nuisance Estimation Error
To provide a finite-sample error bound for the estimation error, we start with bounding the nuisance estimation error introduced in Lemma 2.
Proposition 3 (Nuisance Estimation Error).
Suppose Assumption 5 and Assumption 6 hold. We have, for any ,
Empirical Process Error
We now consider bounding the supremum of empirical process introduced in Lemma 2. We focus on analyzing the empirical process define on the -th batch, that is,
As the cross-fitting batch size involved in Algorithm 1 is usually smaller than the sample size , here we simply assume . Note that, the estimated weight function in the -th batch is a random variable which relies on observed variables . To facilitate our analysis, we denote a new variable over the domain with some joint distribution . We will omit the superscript for simplicity in the sequel. For every given estimates , we define a excess surrogate risk class as
(D.1) |
To simplify our analysis, we use shorthand notation and , where is the empirical measured associated with the i.i.d. sample of size . Then the supermom of empirical process can be written as
(D.2) |
Therefore, our goal is now provide a uniform bound for the empirical process over function class . To realize this target, we shall notice that is a uniformly bounded class, that is, for any , we have being bounded. To see this, notice that in Assumption 5, we assume for any . Consequently, we have
Now, given a sequence of i.i.d. sample , we define the empirical Rademacher complexity with respect to class as
(D.3) |
where is a vector of i.i.d. Rademacher variables takin values in with equal probability. The empirical Rademacher complexity provides an upper bound for the supremum of empirical process in Equation D.2. See the lemma below.
Lemma 3 (Symmetrization Bound, Wainwright [2019] Theorem 4.10).
For the -uniformly bounded class of function , with probability at least , we have
Define the norm for the function such that for any function . As the function class is -uniformly bounded, we then have
Moreover, let be the covering number of class with respect to the metric induced by the norm. According to the definition of in Equation D.1, we have . Along with the limitation of the growing rate of in Assumption 6, we establish the following Dudley’s entropy integral for the empirical Rademacher complexity defined in Equation D.3
Lemma 4 (Dudley’s Entropy Integral).
Assume Assumption 6 holds. Given that , there exists a constant such that the empirical Rademacher complexity is almost surely upper-bounded by the Dudely’s integral: ,
Given the establishment of Assumption 6, we can now provide a generalization error bound for the empirical process error in Lemma 2 by combining Lemma 3 and Lemma 4: with probability at least , we have
(D.4) |
Generalization Error Bound for Excess Cost-sensitive Risk
Combining with the calibration bound in Theorem 5, the decomposition error of the excess surrogate risk in Lemma 2, the nuisance estimation error in Proposition 3, and the empirical process error in Equation D.4, we can now provide a generalization error bound for original the excess cost-sensitive risk induced by the output score function in Algorithm 1. Formally speaking, we probability at least , we have
(D.5) |
This result shows that the excess cost-sensitive risk of the output score function converges to the optimal risk at a rate of , which is the standard rate for the multiclass classification risk in statistical learning theory [Massart and Nédélec, 2006, Shalev-Shwartz and Ben-David, 2014, Wainwright, 2019].
D.1 Remaining Proofs
Proof of Lemma 2.
Let denote the best-in-class score function that minimizes the excess surrogate risk . We use denote the true weight function and representes its estiamte from batch . By definition, we have
∎
Proof of Proposition 3.
Notice that for any function and batch ,
The first inequality comes from the fact that the absolute value function is convex and we use Jensen’s inequality. The second inequality follows from Cauchy-Schwarz inequality, and the third inequality follows from the fact that the function is uniformly bounded by for all and . Finally, combining with Assumption 5 and the results above, we have for every batch
∎