This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Learning with Selectively Labeled Data from Multiple Decision-makers

Jian Chen,   Zhehao Lifootnotemark: ,   Xiaojie Maofootnotemark: Alphabetical order. Email: lizh21@mails.tsinghua.edu.cn
( Tsinghua University )
Abstract

We study the problem of classification with selectively labeled data, whose distribution may differ from the full population due to historical decision-making. We exploit the fact that in many applications historical decisions were made by multiple decision-makers, each with different decision rules. We analyze this setup under a principled instrumental variable (IV) framework and rigorously study the identification of classification risk. We establish conditions for the exact identification of classification risk and derive tight partial identification bounds when exact identification fails. We further propose a unified cost-sensitive learning (UCL) approach to learn classifiers robust to selection bias in both identification settings. We further theoretically and numerically validate the efficacy of our proposed method.

1 Introduction

The problem of selective labels is common in many decision-making applications involving human subjects. In these applications, each individual receives a certain decision, which in turn determines whether the individual’s outcome label is observed. For example, in judicial bail, the label of interest is whether a defendant returns to court without committing another crime if released. However, this label cannot be observed if bail is denied. Similarly, in lending, the default status of a loan applicant cannot be observed if the loan is not approved. In hiring, a candidate’s job performance cannot be observed if the candidate is not hired.

The selective label problem presents significant challenges for developing effective machine learning (ML) algorithms to support decision-making [Lakkaraju et al., 2017, Kleinberg et al., 2018, De-Arteaga et al., 2021]. Specifically, labeled data may not be representative of the broader population to which decisions will be applied, creating a classic selection bias issue. As a result, models trained on selectively labeled data may perform well on the observed subjects but fail to generalize to the full population when deployed in real-world settings. This challenge becomes even more critical when historical decisions relied on unobservable factors not captured in the available data, as is common when human decision-makers are involved. In such cases, the labeled data can differ substantially from the broader population due to unknown factors, further complicating the development of reliable ML models.

In this paper, we address the selective label problem by leveraging the heterogeneity of decision-making processes in many real-world applications. Specifically, we consider settings where decisions are made by multiple decision-makers with distinct decision rules, potentially leading to different outcomes for the same subject. Furthermore, since decision-makers often evaluate similar pools of subjects, each subject can be viewed as being quasi-randomly assigned to a specific decision-maker. This structure offers a way to mitigate the selective label problem: a subject who remains unlabeled under one decision-maker might have been labeled if assigned to another. For instance, judicial decisions are typically made by different judges who vary in their leniency when deciding whether to release the same defendant. The heterogeneous decision-maker framework has also been explored in prior research to address selection bias [e.g., Lakkaraju et al., 2017, Kleinberg et al., 2018, Rambachan et al., 2021, Arnold et al., 2022].

1.1 Main Contributions

In this paper, we leverage the instrumental variable framework in causal inference to formalize the selective label problem with multiple heterogeneous decision-makers. This enables principled identification analyses to characterize the identifiability of classification risks from the observed data, revealing fundamental power and limits of the multiple decision-maker structure in tackling the selective label problem. This provides valuable insights beyond the heuristic approaches in the prior literature.

Specifically, we describe the problem setup and the IV framework in Section 3. Then in Section 4 we establish identification conditions for the classification risk from the observed data with multiple decision-makers and highlight the identification assumptions are overly strong. In Section 5, we further derive tight bounds on the plausible value of classification risk when the exact identification fails. In Section 6, we propose a Unified Cost-sensitive Learning (UCL) algorithm to learn classifiers robust to selection bias in both identification settings. We briefly demonstrate the superior performance of our proposed method through comprehensive numerical experiments on synthetic and semi-synthetic dataset in Section 7. We defer theoretical guarantees for our method to Appendix D.

Notations

Define x+=max{x,0}x^{+}=\max\{x,0\}, xy=min{x,y}x\land y=\min\{x,y\} and xy=max{x,y}x\lor y=\max\{x,y\} for x,yx,y\in\mathbb{R}. We use the symbol [N][N] as a shorthand for the set {1,2,,N}\{1,2,\cdots,N\} and |||\cdot| to denote the absolute value of number or cardinality of set.

2 Related Work

This work is closely related to the growing literatures on the following topics: selective labels problem, identification analysis with IV and individual treatment rule learning.

Selective Labels Problem

Kleinberg et al. [2018], Lakkaraju et al. [2017] formalize the selective labels problem in predicting recidivism within the jail bail context. They train a prediction algorithm on selectively labeled datasets and evaluate whether the algorithm can outperform human decisions when used to assist decision-making. By exploiting the random assignment of judges to defendants, they group defendants based on judge assignment, ensuring comparable demographic features across groups. They also leverage variations in judges’ release rates, where defendants detained by stricter judges might have been released by more lenient ones. Using a heuristic “contraction” technique, they compare the performance of the algorithm-assisted decisions combined with the most lenient judge’s decisions against the stricter judge’s decisions. Fixing the release rate for both methods, they evaluate the recidivism rates and demonstrate that the prediction algorithm can outperform human decisions in some judge groups.

Bertsimas and Fazel-Zarandi [2022] explore the application of machine learning algorithms to assist public decision-making in immigration enforcement. They leverage the fact that most defendants are randomly assigned to federal and state prisons after conviction, enabling the grouping of defendants based on prison assignments. Additionally, they exploit variations in release rates among prisons due to differences in geographic proximity to ICE offices, allowing for the matching of counterfactual outcomes for defendants detained by stricter prisons. Unlike Kleinberg et al. [2018], Lakkaraju et al. [2017], who focus on one-sided evaluations—assessing algorithmic predictions on individuals not deported —- Bertsimas and Fazel-Zarandi [2022] conduct both one-sided and two-sided evaluations. They not only evaluate the predictions for defendants not deported but also establish partial bounds for the recidivism rates of individuals released by the algorithm who were actually deported by ICE officers (for whom labels are missing). Their findings provide evidence that machine learning algorithms can sometimes outperform human officers in decision-making.

The aforementioned works focus on evaluating the performance of machine learning algorithms on selectively labeled data. However, the core challenge lies in training prediction algorithms that are robust to the selection bias inherent in such datasets, and this is what we focus on in this paper. To address this, De-Arteaga et al. [2018, 2021] propose imputing missing labels as negative when multiple human decision-makers consistently reject a decision, leading to missing labels. While this approach offers a partial solution, it is inherently one-sided, as some missing labels may actually correspond to positive outcomes. Consequently, this method does not fully address the selection bias present in the observed data.

There is also a stream of literatures considering the online learning to correct the selection bias in the label. Kilbertus et al. [2020], Wei [2021], Yang et al. [2022] consider learn-to-decide approach to address the selective label problem by implementing a stochastic policy/online algorithm that balances exploration of the entire population with exploitation of the selectively labeled dataset. While effective, this approach is resource-intensive and may be impractical for high-cost decision-making scenarios, such as judicial bail determinations or school admissions decisions.

Finally, our work relates to the literature on counterfactual evaluation under unmeasured confounding [Coston et al., 2020a, b, Rambachan et al., 2021, 2022]. Specifically, Rambachan et al. [2022] address the problem of risk assessment under unmeasured confounding by proposing a mean outcome sensitivity model. This meta-framework translates bounds on the average treatment effect (ATE) into bounds on various risk metrics. They outline two approaches to specify the bounding function for ATE: the outcome regression bound, which involves regressing the outcome on the observed features and treatment, and the instrumental variable (IV) bound, which aligns with our concept of IV partial bounds. However, our IV partial bounds are slightly tighter than their proposed bounds, as we refine them by taking the maximum over all j[J]j\in[J] for the lower bound lkl_{k} and the minimum over all j[J]j\in[J] for the upper bound uku_{k}. We further prove the tightness of these IV partial bounds. The key distinction between our work and that of Rambachan et al. [2022] lies in focus. While they emphasize estimating mean outcome sensitivity bounds using doubly robust methods to derive bounds for various risk metrics, our work centers on training robust prediction models based on the estimated IV partial bounds. This ensures that the resulting models are resilient to selection bias in the observed training data.

Identification Analysis with IV

This work builds on the IV literature in statistics and econometrics. IV is a common and powerful tool to handle unmeasured confounding in causal inference [Angrist and Pischke, 2009, Angrist and Imbens, 1995, Angrist et al., 1996]. Actually, some empirical studies have already considered heterogeneous decision-makers as an IV, such as the judge IV in Kling [2006], Mueller-Smith [2015]. However, these analyses are based on linear parametric models, while our identification analysis is free of any parametric restrictions. The point identification analysis builds on Wang and Tchetgen [2018], Cui and Tchetgen Tchetgen [2021] and the partial identification analysis builds on Manski and Pepper [1998], Swanson et al. [2018], Balke and Pearl [1994]. Our paper directly extends these results to the evaluation of prediction risks under multi-valued selective labels with multi-valued IV, and develop novel reformulations that enable efficient learning of prediction rules via a unified procedure.

Individual Treatment Rule (ITR) Learning

There is a growing body of literature on learning individualized treatment rules (ITR), which seeks to determine the optimal treatment for each individual based on their observed features. This field has been extensively studied in personalized medicine and policy-making [Qian and Murphy, 2011, Zhao et al., 2012, Athey and Wager, 2021]. Qian and Murphy [2011] introduce an inverse propensity weighting (IPW) formulation for ITR learning, parameterizing the value function with a linear model and obtaining a linear ITR through l1l_{1}-penalized square loss minimization. Zhao et al. [2012] also adopt the IPW framework but treat ITR learning as a classification problem, proposing the hinge loss as a surrogate and solving for the optimal ITR using an SVM algorithm. Zhang et al. [2012] develop a doubly robust estimator for the value function and formalize ITR learning as a classification problem, using the square loss as a surrogate for the indicator function. Extending to multi-category problems, Zhang et al. [2020] propose the outcome-weighted learning (OWL) formulation and analyze the Fisher consistency of various surrogate loss functions. We demonstrate that learning under selective labels can similarly be formulated as a cost-sensitive or weighted classification problem when point identification is available. Furthermore, we propose several surrogate loss functions for learning the optimal classification rule and establish the finite-sample error bound for our unified learning algorithm.

Our work is also connected to the research on policy learning under partial identification. Ben-Michael et al. [2021] address robust policy learning under partial identification, focusing specifically on violations of the overlap assumption, where some instances have zero probability of receiving treatment. Kallus and Zhou [2021] investigate off-policy learning in the presence of unmeasured confounding, utilizing an IPW formulation for the value function. They address scenarios where the true propensity score is only partially bounded and propose a minimax optimization framework to learn robust policies. Their approach considers a differentiable, parameterized policy class and employs an iterative algorithm to solve the inner maximization via the dual problem, compute the gradient of the policy value, and update the policy parameters. In contrast, our work demonstrates that the inner maximization in our setting can be solved explicitly. This reformulation allows the minimax learning problem to be cast as a cost-sensitive classification task, enabling the use of state-of-the-art machine learning algorithms to efficiently solve the optimization problem.

Particularly relevant to our work are Cui and Tchetgen Tchetgen [2021] and Pu and Zhang [2021], who study ITR learning with instrumental variables under point identification and partial identification settings, respectively. Pu and Zhang [2021] consider a binary treatment setting and develop a minimax ITR framework using IV partial bounds of the expected potential outcomes. Similar to our approach, they explicitly solve the inner maximization and replace indicator functions with the hinge loss, solving the problem using an SVM algorithm. However, their work stops short of formalizing the minimax optimization as a general weighted classification problem, which could be implemented using any machine learning package. In contrast, we extend this framework to multi-valued treatments and demonstrate that minimax learning can be formulated as a cost-sensitive classification problem, generalizing the binary treatment setting in Pu and Zhang [2021] (see Section 5).

3 Problem Formulation

We address the selective label problem within a multiclass classification setting. Consider a dataset comprising NN units, indexed by i=1,,Ni=1,\dots,N. For each unit ii, let Yi[K]Y_{i}^{\star}\in[K] denote the true label of interest for K2K\geq 2 classes. Additionally, let Xi𝒳X_{i}\in\mathcal{X} and Ui𝒰U_{i}\in\mathcal{U} represent observable and unobservable features, respectively, both of which may be correlated with YiY_{i}^{\star}. Here, “observable” features are those accessible to the ML practitioner, while “unobservable” features are unavailable in the dataset.

In the selective label problem, the true label YiY_{i}^{\star} is not always observed; its observability depends on a binary decision Di{0,1}D_{i}\in\{0,1\} made by a decision-maker based on the features (Xi,Ui)(X_{i},U_{i}). Specifically, we consider J2J\geq 2 different decision-makers and let Zi[J]Z_{i}\in[J] denote the specific decision-maker assigned to unit ii. The observed label YiY_{i} is defined as:

Yi={YiifDi=1,NaNifDi=0.Y_{i}=\left\{\begin{aligned} &Y_{i}^{\star}~{}~{}&&\text{if}~{}D_{i}=1,\\ &\text{NaN}~{}~{}&&\text{if}~{}D_{i}=0.\end{aligned}\right. (3.1)

When Di=1D_{i}=1 (e.g., bail, loan approval, hiring), the observed label YiY_{i} is exactly the true label YiY_{i}^{\star}, and when Di=0D_{i}=0, the observed label is recorded as Yi=NaNY_{i}=\text{NaN} (Not a Number), indicating that YiY_{i}^{\star} remains unobserved in this case. Thus, each unit’s label is selectively observed according to the decision DiD_{i} made by decision-maker ZiZ_{i}.

DDYY^{\star}YYXXZZUU
Figure 1: Causal graph of selective label problem. The dashed nodes UU and YY^{\star} are unobserved. The dashed line from XX to ZZ means that ZZ is allowed but not required to be affected by XX.

We assume that {(Yi,Yi,Di,Zi,Xi,Ui)}i=1N\{(Y_{i},Y^{\star}_{i},D_{i},Z_{i},X_{i},U_{i})\}_{i=1}^{N} represents independent and identically distributed samples from a population denoted by (Y,Y,D,Z,X,U)(Y,Y^{\star},D,Z,X,U). The causal structure of these variables is plotted in Figure 1. However, we cannot observe all variables but instead only observe the data SN={(Yi,Di,Xi,Zi)}i=1NS_{N}=\{(Y_{i},D_{i},X_{i},Z_{i})\}_{i=1}^{N}.

3.1 Learning under Selective Labels

Our goal is to learn a classifier t:𝒳[K]t:\mathcal{X}\mapsto[K] that can accurately predict the true label YY^{\star} based on observed features XX. This classifier will aid in future decision-making processes through accurately predicting YY^{\star}. The effectiveness of the classifier upon deployment across the full population is evaluated using the expected risk with the zero-one loss:

mint:𝒳[K](t):=𝔼[𝕀(Yt(X))].\min_{t:\mathcal{X}\rightarrow[K]}\quad\mathcal{R}(t):=\mathbb{E}\left[\mathbb{I}(Y^{\star}\neq t(X))\right]. (3.2)

Here 𝕀()\mathbb{I}(\cdot) is an indicator function, which equals 11 if the event inside ()(\cdot) occurs and 0 otherwise. The expectation is taken over the unknown joint distribution of YY^{\star} and XX. For a given classifier tt, the risk (t)\mathcal{R}(t) quantifies its misclassification error rate (Yt(X))\mathbb{P}(Y^{\star}\neq t(X)).

Conventionally, one can train a classifier through empirical risk minimization (ERM), i.e., minimizing an empirical approximation for the classification risk (t)\mathcal{R}(t) based on the sample data. However, this is challenging under selective labels since the true label YY^{\star} is not fully observed. One natural alternative is to perform ERM on the labeled subsample (subsample size denoted by N1N_{1}):

mint:𝒳[K]^label(t):=1N1i:Di=1𝕀{Yit(Xi)}.\min_{t:\mathcal{X}\mapsto[K]}\widehat{\mathcal{R}}_{\operatorname{label}}(t):=\frac{1}{N_{1}}\sum_{i:D_{i}=1}\mathbb{I}\{Y_{i}\neq t(X_{i})\}. (3.3)

However, the empirical risk ^label(t)\widehat{\mathcal{R}}_{\operatorname{label}}(t) may not capture the true risk even when total sample size NN\to\infty:

^select(t)(Yt(X)D=1)(t).\widehat{\mathcal{R}}_{\operatorname{select}}(t)\to\mathbb{P}(Y^{\star}\neq t(X)\mid D=1)\neq\mathcal{R}(t).

This is known as a selection bias problem, where the distribution of (Y,X)D=1(Y^{\star},X)\mid D=1 in the labeled subpopulation and that of (Y,X)(Y^{\star},X) in the full population are different. The selection bias arises because the historical decision DD can depend on the features XX and UU that are also correlated with the true outcome YY^{\star}. Hence, one has to adjust for both XX and UU to eliminate the selection bias, as formalized below.

Assumption 1.

Suppose (1) Selection-on-observables-and-unobservables: DYX,UD\perp Y^{\star}\mid X,U; (2) Overlap: (D=1X,U)>0\mathbb{P}(D=1\mid X,U)>0 almost surely.

The condition DYX,UD\perp Y^{\star}\mid X,U in Assumption 1 means that the labeled subpopulation and the full population would have the same YY^{\star} distribution if8 properly adjusting for (X,U)(X,U). The overlap condition means that the subpopulation with almost any X,UX,U value has a positive chance to be labeled. Both conditions are standard in the missing data literature Little and Rubin [2019]. If features X,UX,U were both observed, then one could correct for the selection bias by adjusting for X,UX,U via many methods in the literature [e.g., Rosenbaum and Rubin, 1983, Rubin, 2005, Imbens and Rubin, 2015]. However, adjusting for the unobservable UU is infeasible, so selection bias remains a significant problem in our setting. This setting is referred to the missing-not-at-random (MNAR) problem in the missing data literature [Little and Rubin, 2019]. Addressing MNAR is notoriously difficult and typically requires additional information.

3.2 Multiple Decision-makers and IV

To address the selective label problem, we draw on recent literature that leverage the fact that in many applications the selective labels involve multiple decision-makers with heterogeneous decision-making preference [Lakkaraju et al., 2017, Kleinberg et al., 2018, De-Arteaga et al., 2018]. The rationale behind this idea is that, the unobserved true label YY^{\star} for a unit in the missing group (D=0D=0) could have been observed, if the unit had been assigned to a different decision-maker. This is particularly useful when the assignment of decision-makers ZZ is random, independent of the unobserved featured UU. For instance, Lakkaraju et al. [2017], Kleinberg et al. [2018] considers that in the judicial bail, a defendant is randomly assigned to a judge, so a unit who is not granted bail (and thus unlabeled) could have been assigned to a more lenient judge, received bail, and consequently been labeled.

Although the previous works have already considered the structure of multiple decision-makers, they largely rely on some heuristic approaches. In fact, the assignment of decision-makers can be formalized as an instrumental variable (IV). As we will show shortly, this IV formalization allows us to understand the role of multiple decision-makers in a principled way. IV is a well-known concept in causal inference. Following standard literature [Angrist et al., 1996, Cui and Tchetgen Tchetgen, 2021, Wang and Tchetgen, 2018, e.g.,], a valid IV has to satisfy the following conditions.

Assumption 2 (IV conditions).

The assignment ZZ satisfies the following three conditions: 1. IV relevance: Z⟂̸DXZ\not\perp D\mid X; 2. IV independence: Z(U,Y)XZ\perp(U,Y^{\star})\mid X; 3. Exclusion restriction: ZYD,YZ\perp Y\mid D,Y^{\star}.

The IV conditions above are particularly suitable for the random assignment of decision-makers. The IV relevance means that even after controlling for the observed features XX, different decision-makers could have systematically different decision rules so that the assignment ZZ is dependent with the decision DD. The IV independence trivially holds when the decision-maker assignment is random. The exclusion restriction automatically holds because the observed label YY is completely determined by the decision DD and true label YY^{\star} so it is independent with any other variable given D,YD,Y^{\star}. We note that these conditions are also explicitly or implicitly assumed in the prior literature. Assumption 2 connects these conditions to the IV framework.

Based on the IV framework, we will next understand the power and limits of the multiple decision-maker structure in addressing the selective label problem. In particular, we will analyze the identification of the classification risk (t)\mathcal{R}(t) in Equation 3.2, studying under what conditions the risk can be determined from the distribution of observed data with multiple decision-makers.

4 Exact Identification of Classification Risk

In this section, we establish conditions under which the classification risk (t)\mathcal{R}(t) can be exactly identified from the observed data. This represents settings where consistently evaluating the classification risk is possible, if one properly leverage the multiple decision-maker structure.

Before delving into our detailed analysis, we define ηk(X):=(Y=kX)\eta_{k}^{\star}(X):=\mathbb{P}(Y^{\star}=k\mid X) as the conditional probability that YY^{\star} equals kk, given the observable features XX. The Bayes optimal classifier, which minimizes the classification risk (t)\mathcal{R}(t), is determined by s(X):=argmaxk[K]ηk(X)s^{\star}(X):=\operatorname*{arg\,max}_{k\in[K]}\eta_{k}^{\star}(X). In fact, ss^{\star} also minimizes the excess risk, defined as:

(t,s)=(Yt(X))(Ys(X)).\mathcal{R}(t,s^{\star})=\mathbb{P}(Y^{\star}\neq t(X))-\mathbb{P}(Y^{\star}\neq s^{\star}(X)). (4.1)

Minimizing (t)\mathcal{R}(t) over classifier tt is equivalent to minimizing (t,s)\mathcal{R}(t,s^{\star}) over tt as the Bayes optimal risk (s)\mathcal{R}(s^{\star}) is a constant. Therefore, we can equivalently study the identification of excess risk (t,s)\mathcal{R}(t,s^{\star}). The following lemma gives a weighted formulation of excess risk in terms of the conditional probabilities.

Lemma 1.

The excess risk (t,s)\mathcal{R}(t,s^{\star}) satisfies

(t,s)=𝔼[k=1Kηk(X)(𝕀s(X)=k𝕀t(X)=k)].\mathcal{R}(t,s^{\star})=\mathbb{E}\Big{[}\sum_{k=1}^{K}\eta_{k}^{\star}(X)\cdot\big{(}\mathbb{I}_{s^{\star}(X)=k}-\mathbb{I}_{t(X)=k}\big{)}\Big{]}.

Lemma 1 indicates to identify the excess risk we need first identify the conditional probabilities ηk[0,1]\eta_{k}^{\star}\in[0,1] of the latent YY^{\star} from the observed data. We therefore focus on the identification of 𝜼(x)=(η1(x),,ηK(x))\boldsymbol{\eta}^{\star}(x)=(\eta_{1}^{\star}(x),\dots,\eta_{K}^{\star}(x)) in the sequel.

Identification of Conditional Probabilities

In Section 3.2, we show that the assignment of decision-makers ZZ can be viewed as an IV. By adapting the IV identification theory in the causal inference literature [Wang and Tchetgen, 2018, Cui and Tchetgen Tchetgen, 2021], we can identify the classification risk under the following no unmeasured common effect modifiers (NUCEM) assumption.

Assumption 3 (NUCEM).

For each x𝒳x\in\mathcal{X}, Cov(Cov(D,ZX,U),(Y=kX,U)x)=0\operatorname*{\mathrm{Cov}}\left(\operatorname*{\mathrm{Cov}}(D,Z\mid X,U),\mathbb{P}(Y^{\star}=k\mid X,U)\mid x\right)=0.

Theorem 1 (Point Identification).

Define Yk:=𝕀{Y=k}Y_{k}:=\mathbb{I}\{Y=k\}. Under Assumptions 1, 2 and 3, ηk(X)\eta_{k}^{\star}(X) is identified,

ηk(X)=Cov(DYk,ZX)/Cov(D,ZX).\eta_{k}^{\star}(X)=\operatorname*{\mathrm{Cov}}(DY_{k},Z\mid X)/\operatorname*{\mathrm{Cov}}(D,Z\mid X). (4.2)

Conversely, if (4.2) holds, Assumption 3 is also satisfied.

Theorem 1 shows that Assumption 3 is sufficient and necessary for identifying the conditional probability function ηk\eta^{\star}_{k}’s through Equation 4.2. This identification equation connects the conditional probability of the latent true label YY^{\star} with fully observed variables D,Y,Z,XD,Y,Z,X. This lays the foundation for learning ηk\eta^{\star}_{k}’s (and classification risk) from the observed data using the decision-maker assignment IV.

Identification of Excess Classification Risk

We can now identify the excess classification risk from observed data.

Theorem 2.

Under assumptions in Theorem 1, the excess classification risk (t,s)\mathcal{R}(t,s^{\star}) in eq. (4.1) is also identified:

(t,s)=𝔼[k=1Kwk(X)𝕀{t(X)=k}],\mathcal{R}(t,s^{\star})=\mathbb{E}\Big{[}\sum_{k=1}^{K}w_{k}(X)\cdot\mathbb{I}\{t(X)=k\}\Big{]}, (4.3)

where wk(x):=maxp[K](ηp(x)ηk(x))w_{k}(x):=\max_{p\in[K]}(\eta_{p}^{\star}(x)-\eta_{k}^{\star}(x)) denotes the weight function and ηk(x)\eta_{k}^{\star}(x) is identified in eq. (4.2). This can be simplified for binary classification problems with K=2K=2:

(t,s)=𝔼[|w(X)|𝕀{t(X)sgn[w(X)]}],\mathcal{R}(t,s^{\star})=\mathbb{E}\big{[}\big{|}w(X)\big{|}\cdot\mathbb{I}\big{\{}t(X)\neq\operatorname*{\mathrm{sgn}}[w(X)]\big{\}}\big{]}, (4.4)

where w(x):=Cov(DY,Zx)/Cov(D,Zx)1/2w(x):=\operatorname*{\mathrm{Cov}}(DY,Z\mid x)/\operatorname*{\mathrm{Cov}}(D,Z\mid x)-1/2 denotes the weight and sgn:𝒳{±1}\operatorname*{\mathrm{sgn}}:\mathcal{X}\rightarrow\{\pm 1\} is a sign function.

Theorem 2 does not only establish the identification of the excess classification risk under Assumption 3, but also formulates the identification into a weighted classification form. Specifically, minimizing the right hand side of Equation 4.3 corresponds to searching a cost-sensitive classifier, while minimizing the right hand side of Equation 4.4 corresponds to a weighted classification problem with the sign of w(X)w(X) as the label and |w(X)||w(X)| as the misclassification cost. In Section 6, we will show that these formulations are useful in the design of effective learning algorithms.

The NUCEM Identification Assumption

The identification theory in Theorems 1 and 2 crucially rely on the No Unmeasured Common Effect Modifier (NUCEM) assumption in Assumption 3, which generalizes Cui and Tchetgen Tchetgen [2021] from binary IV to a multi-valued IV. These analyses reveals that additional assumptions are needed to fully identify the classification risk, even when the decision-maker structure already satisfies the IV conditions in Assumption 2. This fact is not known in the prior selective label literature without the formal IV framework.

To understand the NUCEM assumption, it is instructive to consider examples where this assumption holds.

Example 1 (Full Information).

If all features needed to predict the true label YY^{\star} are recorded, then 𝔼[YX,U]=𝔼[YX]\mathbb{E}[Y^{\star}\mid X,U]=\mathbb{E}[Y^{\star}\mid X] almost surely and Assumption 3 holds.

Example 2 (Separable and Independent Unobservables).

Suppose unobservables UU can be divided into two categories: U1U_{1} influences the decision DD, and U2U_{2} affects the outcome YY^{\star}. Hence, we have Cov(D,ZX,U)=Cov(D,ZX,U1)\operatorname{Cov}(D,Z\mid X,U)=\operatorname{Cov}(D,Z\mid X,U_{1}) and 𝔼[YX,U]=𝔼[YX,U2]\mathbb{E}[Y^{\star}\mid X,U]=\mathbb{E}[Y^{\star}\mid X,U_{2}] almost surely. If U1U_{1} and U2U_{2} are (conditionally) independent, then Assumption 3 holds.

Example 3 (Additive Decision Probability).

Suppose that for every decision-maker j[J]j\in[J], the conditional decision probability satisfies (D=1U,X,Z=j)=gj(X)+q(U)\mathbb{P}(D=1\mid U,X,Z=j)=g_{j}(X)+q(U) for a common function qq and potentially different functions gjg_{j}. That is, all decision-makers use the unobservables UU in a common and additive way. Under this condition, we can prove that Assumption 3 also holds (see Section A.2).

The examples above show that the NUCEM assumption needs to impose restrictions on the impact of unobserved features UU on either the decision DD or the true label YY^{\star} or both. However, these restrictions may be too strong in practice so the NUCEM assumption may fail. As a result, the identification formula in Theorems 1 and 2 are no longer valid. In the next section, we will drop the NUCEM assumption and only impose some mild conditions, leading to weaker identification results accordingly.

5 Partial Identification of Classification Risk

The last section shows that the exact identification of classification risk often requires assumptions that are untenable in practice (see Assumption 3). Hence it is desirable to avoid such stringent assumptions and instead consider weaker and more reasonable assumptions. Nevertheless, without the stringent assumptions, it is generally impossible to exactly identify the classification risk from the observed data. Instead, there may exist a range of plausible values of the classification risk, all compatible with the observed data. This plausible range is characterized by the so-called partial identification bounds.

Below we adopt a mild bound condition on the conditional probability of the true label for the partial bound analysis.

Assumption 4.

For each fixed xx and kk, there exist two known functions ak(x)[0,1]a_{k}(x)\in[0,1] and bk(x)[0,1]b_{k}(x)\in[0,1] such that ak(x)(Y=kU,X=x)bk(x)a_{k}(x)\leq\mathbb{P}(Y^{\star}=k\mid U,X=x)\leq b_{k}(x) almost surely.

Assumption 4 mandates a known bound on the conditional probability (Y=kU,X=x)\mathbb{P}(Y^{\star}=k\mid U,X=x) while permitting arbitrary dependence on unobserved features UU. This assumption is mild, since it is automatically satisfied when ak(x)=0,bk(x)=1a_{k}(x)=0,b_{k}(x)=1 for each k[K]k\in[K]. With these parameters, we derive a precise partial bound on ηk(X)\eta_{k}^{\star}(X).

Partial Identification of Conditional Probabilities

Based on this condition, we can derive tight IV partial identification bounds on the conditional probability ηk(X)=(Y=kX)\eta_{k}^{\star}(X)=\mathbb{P}(Y^{\star}=k\mid X) for each k[K]k\in[K].

Theorem 3.

Define Yk:=𝕀{Y=k}Y_{k}:=\mathbb{I}\{Y=k\} and assume Assumptions 1 and 2 and Assumption 4 hold. Then for any x𝒳x\in\mathcal{X} and k[K]k\in[K], the value of ηk(x)\eta_{k}^{\star}(x) must fall within the interval [lk(x),uk(x)][l_{k}(x),u_{k}(x)], where

lk(x)\displaystyle l_{k}(x) :=maxz[J]𝔼[DYk+ak(x)(1D)X=x,Z=z],\displaystyle:=\max_{z\in[J]}\mathbb{E}[DY_{k}+a_{k}(x)(1-D)\mid X=x,Z=z],
uk(x)\displaystyle u_{k}(x) :=minz[J]𝔼[DYk+bk(x)(1D)X=x,Z=z].\displaystyle:=\min_{z\in[J]}\mathbb{E}[DY_{k}+b_{k}(x)(1-D)\mid X=x,Z=z].

Conversely, any value within the interval [lk(x),uk(x)][l_{k}(x),u_{k}(x)] can be a plausible value of ηk(x)\eta_{k}^{\star}(x). Moreover, the interval endpoints satisfy that ak(x)lk(x)uk(x)bk(x)a_{k}(x)\leq l_{k}(x)\leq u_{k}(x)\leq b_{k}(x).

Theorem 3 generalizes the well-known Balke and Pearl’s bound and Manski’s bound in Balke and Pearl [1994], Manski and Pepper [1998]. Specifically, our bounds coincide with Balke and Pearl’ bounds when applied to a binary IV (see Section B.2). Furthermore, if a(X),b(X)a(X),b(X) are two constants independent of XX (such as 0,10,1), our bounds recover Manski’s bounds. Theorem 3 also demonstrates the tightness of our IV partial bounds, in the sense that our bounds tightly give the range of all plausible values of the conditional probability ηk\eta_{k}^{\star}’s (see Section B.1).

Partial Identification of Excess Classification Risk

When {ηk}k[K]\{\eta_{k}^{\star}\}_{k\in[K]} are only partially identified within [lk,uk][l_{k},u_{k}] for each k[K]k\in[K], the excess classification risk (t,s)\mathcal{R}(t,s^{\star}) is also partially identified. Define the feasible region for the conditional probability vector 𝜼=(η1,,ηK)\boldsymbol{\eta}^{\star}=(\eta_{1},\dots,\eta_{K}) as

S:={𝜼:𝜼1=1,ηk[lk,uk],k[K]},\displaystyle S:=\Big{\{}\boldsymbol{\eta}:\|\boldsymbol{\eta}\|_{1}=1,\eta_{k}\in[l_{k},u_{k}],k\in[K]\Big{\}}, (5.1)

Then the range of plausible value of the excess risk (t,s)\mathcal{R}(t,s^{\star}) is given by the interval [¯(t),¯(t)][\underline{\mathcal{R}}(t),\overline{\mathcal{R}}(t)], where

¯(t)\displaystyle\overline{\mathcal{R}}(t) :=𝔼[max𝜼Sk=1Kηk(X)(𝕀s(X)=k𝕀t(X)=k)]\displaystyle:=\mathbb{E}\Big{[}\max_{\boldsymbol{\eta}\in S}\sum_{k=1}^{K}\eta_{k}(X)\big{(}\mathbb{I}_{s^{\star}(X)=k}-\mathbb{I}_{t(X)=k}\big{)}\Big{]} (5.2)

and ¯(t)\underline{\mathcal{R}}(t) is symmetrically defined by changing the maximization over ηS\eta\in S into minimization over the same set. We are particularly interested in the upper bound ¯(t)\overline{\mathcal{R}}(t) as it gives the worst-case risk of a classifier tt. Minimizing this upper bound, i.e., mint:𝒳[K]¯(t)\min_{t:\mathcal{X}\rightarrow[K]}\overline{\mathcal{R}}(t), hence leads to a robust classifier safeguarding the perfrmance in the worst case. Below we further provide a reformulation of the worst-case risk under a mild regularity condition.

Theorem 4.

Suppose conditions in Theorem 3 hold and the set SS defined in eq. (5.1) is non-empty. Given the bounds [lk,uk][l_{k},u_{k}] in Theorem 3, we define the “realizable” partial bounds for ηk\eta_{k}^{\star} as l~k(x):=[1pkup(x)]lk(x)\tilde{l}_{k}(x):=[1-\sum_{p\neq k}u_{p}(x)]\lor l_{k}(x) and u~k(x):=[1pklp(x)]uk(x)\tilde{u}_{k}(x):=[1-\sum_{p\neq k}l_{p}(x)]\land u_{k}(x). We have

¯(t)=𝔼[k=1Kwk(X)𝕀{t(X)=k}],\overline{\mathcal{R}}(t)=\mathbb{E}\Big{[}\sum_{k=1}^{K}w_{k}(X)\cdot\mathbb{I}\{t(X)=k\}\Big{]},

where wk(x):=maxpk,p[K](u~p(x)l~k(x))+w_{k}(x):=\max_{p\neq k,p\in[K]}\big{(}\tilde{u}_{p}(x)-\tilde{l}_{k}(x)\big{)}^{+} denotes the weight function. This can be simplified for binary classification problems with K=2K=2:

¯(t)=𝔼[|w(X)|𝕀{t(X)sgn[w(X)]}]+C,\overline{\mathcal{R}}(t)=\mathbb{E}\big{[}|w(X)|\cdot\mathbb{I}\{t(X)\neq\operatorname*{\mathrm{sgn}}[w(X)]\big{\}}\big{]}+C,

where CC is a constant not relying on the classifier tt and w(x):=max{2u1(x)1,0}+min{2l1(x)1,0}w(x):=\max\{2u_{1}(x)-1,0\}+\min\{2l_{1}(x)-1,0\}.

The non-emptyness of set SS defined in Equation 5.1 is a mild regularity condition. It is easy to verify from Theorem 3 that this condition is automatically satisfied for ak(X)=0a_{k}(X)=0 and bk(X)=1b_{k}(X)=1 for all k[K]k\in[K]. Like Theorem 2, here Theorem 4 formulates the worst-case risk into a cost-sensitive form for general multi-class classification problems and a weighted classification form for binary classification problems. This reformulation will be handy for the learning algorithm design in the next section.

Discussions

Notably, the partial identification results in this section are quite different than the exact identification results in Section 4. Section 4 relies on a strong NUCEM identification assumption (Assumption 3). This strong assumption has a strong implication: we can pinpoint the exact value of classification risk from the observed data with the decision-maker assignment as an IV. However, the key NUCEM identification assumption is very strong and may be often implausible in practice. In this section, we instead consider a much weaker bound condition in Assumption 4. This condition is arguably more plausible yet meanwhile it also has weaker identification implication. Under this condition, even with the decision-maker assignment as a valid IV, the classification risk is still intrinsically ambiguous, so we can at most get a range of its values from the observed data. In the next subsection, we will design algorithms to learn robust classifiers by approximately minimizing the worst-case classification risk. These analyses reveal the power and limits of the multiple decision-maker data structure in tackling the selective label problem, which highlights the value of the principled IV framework adopted in our paper.

We note that the recent work by Rambachan et al. [2022] is quite related to our paper. Rambachan et al. [2022] study counterfactual evaluation under unmeasured confounding using IV, mention selective labels with multiple decision-makers as one example, and derive IV partial identification bounds. However, our paper differs from Rambachan et al. [2022] in important aspects. In particular, Rambachan et al. [2022] focus on partial identification bounds on various risk measures in counterfactual decision-making and statistical estimation and inference of the bounds. In contrast, our paper systematically studies both the exact identification and partial identification settings and provide a unified weighting formulations of classification risks for these two settings (see Theorems 4 and 2). Moreover, Rambachan et al. [2022], like prior literature [Lakkaraju et al., 2017, Kleinberg et al., 2018, De-Arteaga et al., 2018], primarily focus on evaluation the risk of a given prediction rule. In contrast, our paper does not only study evaluation but also designs efficient algorithms to learn classifiers in the two identification settings. This will be the focus of next section.

Algorithm 1 Unified Cost-sensitive Learning
1:function class \mathcal{H}, mode{point,partial}\operatorname{mode}\in\{\operatorname{point,partial}\}, data SN={(Yi,Xi,Di,Zi)}i=1NS_{N}=\{(Y_{i},X_{i},D_{i},Z_{i})\}_{i=1}^{N}, cross-fitting folds LL.
2:Randomly split the data into LL (even) batches with indices denoted by SN1,,SNLS_{N}^{1},\dots,S_{N}^{L} respectively.
3:for each batch l[L]l\in[L] do
4:     if mode=point\operatorname{mode}=\operatorname{point} then
5:         For k[K]k\in[K], use dataset SNSNlS_{N}\setminus S_{N}^{l} to estimate the weight w^k[l](x)=maxp[K](η^p(x)η^k(x))\widehat{w}_{k}^{[l]}(x)=\max_{p\in[K]}\big{(}\widehat{\eta}_{p}(x)-\widehat{\eta}_{k}(x)\big{)}.
6:     else if mode=partial\operatorname{mode}=\operatorname{partial} then
7:         For k[K]k\in[K], use dataset SNSNlS_{N}\setminus S_{N}^{l} to estimate the weight w^k[l](x)=maxpk(u^p(x)l^k(x))+\widehat{w}_{k}^{[l]}(x)=\max_{p\neq k}\big{(}\widehat{u}_{p}(x)-\widehat{l}_{k}(x)\big{)}^{+}.
8:     end if
9:end for
10:Construct the empirical risk for each batch l[L]l\in[L]:
^exp[l](𝒉)=1|Il|iIlk=1Kw^k[l](Xi)exp(hk(Xi))p=1Kexp(hp(Xi)).\widehat{\mathcal{R}}_{\exp}^{[l]}(\boldsymbol{h})=\frac{1}{|I_{l}|}\sum_{i\in I_{l}}\sum_{k=1}^{K}\widehat{w}_{k}^{[l]}(X_{i})\frac{\exp(h_{k}(X_{i}))}{\sum_{p=1}^{K}\exp(h_{p}(X_{i}))}.
11:Return the score function 𝒉^\widehat{\boldsymbol{h}} by minimizing:
𝒉^argminh1Ll=1L^exp[l](𝒉).\widehat{\boldsymbol{h}}\in\operatorname*{arg\,min}_{h\in\mathcal{H}}\frac{1}{L}\sum_{l=1}^{L}\widehat{\mathcal{R}}_{\exp}^{[l]}(\boldsymbol{h}). (5.3)

6 Unified Cost-sensitive Learning

Refer to caption
Figure 2: The testing accuracy of methods with α{0.5,0.7,0.9}\alpha\in\{0.5,0.7,0.9\} for model NUCEM in FICO dataset.
Refer to caption
Figure 3: The testing accuracy of methods with α{0.5,0.7,0.9}\alpha\in\{0.5,0.7,0.9\} for model UC in FICO dataset.

In this section, we propose a unified cost-sensitive learning (UCL) algorithm tailored to learn a classifier from selectively labeled data. This algorithm accommodates both the exact identification setting in Section 4 and the partial identification setting in Section 5 in a unified manner.

Typically, a classifier t(x)t(x) is represented by some score function 𝒉:𝒳K\boldsymbol{h}:\mathcal{X}\rightarrow\mathbb{R}^{K} through t(x)=argmaxk[K]hk(x)t(x)=\operatorname*{arg\,max}_{k\in[K]}h_{k}(x). Here for any given xx, hk(x)h_{k}(x) quantifies the score of class kk. According to Theorems 4 and 2, the excess classification risk of a classifier tt with score function 𝒉\boldsymbol{h} in the two identification settings can be written into the common weighted form as follows: for t(X)=argmaxk[K]hk(X)t(X)=\operatorname*{arg\,max}_{k\in[K]}h_{k}(X),

(𝒉,𝒘)=𝔼[k=1Kwk(X)𝕀{t(X)=k}].\mathcal{R}(\boldsymbol{h},\boldsymbol{w})=\mathbb{E}\Big{[}\sum_{k=1}^{K}w_{k}(X)\cdot\mathbb{I}\{t(X)=k\}\Big{]}. (6.1)

Specifically, we can set wk(x)=maxp[K](ηp(x)ηk(x))w_{k}(x)=\max_{p\in[K]}(\eta_{p}^{\star}(x)-\eta_{k}^{\star}(x)) for each kk to recover the exact identification in Theorem 2 and set wk(x)=maxpk,p[K](u~p(x)l~k(x))+w_{k}(x)=\max_{p\neq k,p\in[K]}(\tilde{u}_{p}(x)-\tilde{l}_{k}(x))^{+} to recover the worst-case excess risk ¯(t)\overline{\mathcal{R}}(t) in Theorem 4. In this section, we aim to minimize these risk objectives to learn effective classifiers. This corresponds to a cost-sensitive multi-classification problem.

Theorems 4 and 2 also show that for binary classification problems with K=2K=2, the risk function simplifies to:

(h,w)=𝔼[|w(X)|𝕀{t(X)sgn[w(X)]}],\mathcal{R}(h,w)=\mathbb{E}\big{[}\big{|}w(X)\big{|}\cdot\mathbb{I}\big{\{}t(X)\neq\operatorname*{\mathrm{sgn}}[w(X)]\big{\}}\big{]},

where t(x)=sgn[h(x)]t(x)=\operatorname*{\mathrm{sgn}}[h(x)] for any xx. By specifying w(x)=Cov(DY,Zx)/Cov(D,Zx)1/2w(x)=\operatorname*{\mathrm{Cov}}(DY,Z\mid x)/\operatorname*{\mathrm{Cov}}(D,Z\mid x)-1/2, we recover the classification risk in the exact identification setting. By specifying w(x)=max{2u1(x)1,0}+min{2l1(x)1,0}w(x)=\max\{2u_{1}(x)-1,0\}+\min\{2l_{1}(x)-1,0\}, we recover the worst-case risk (up to some constants) under partial identification. Minimizing these risks correspond to solving weighted binary classification with a pseudo label sgn[w(X)]\operatorname*{\mathrm{sgn}}[w(X)] and misclassification cost |w(X)||w(X)|.

6.1 Convex Surrogate Risk

The cost-sensitive classification risk (𝒉,𝒘)\mathcal{R}(\boldsymbol{h},\boldsymbol{w}) in Equation 6.1 involves non-convex and non-smooth indicator functions 𝕀{}\mathbb{I}\{\cdot\}, making the optimization challenging. To address this issue, we introduce a convex surrogate risk exp(𝒉,w)\mathcal{R}_{\exp}(\boldsymbol{h},w) that replaces the indicator function with the softmax\operatorname{softmax} function. This substitution allows us to define the convex surrogate for cost-sensitive classification risk:

exp(𝒉,𝒘)=𝔼[k=1Kwk(X)exp(hk(X))p=1Kexp(hp(X))].\mathcal{R}_{\exp}(\boldsymbol{h},\boldsymbol{w})=\mathbb{E}\Big{[}\sum_{k=1}^{K}w_{k}(X)\frac{\exp(h_{k}(X))}{\sum_{p=1}^{K}\exp(h_{p}(X))}\Big{]}. (6.2)

For the binary case, we introduce the weighted ϕ\phi-risk as

ϕ(h,w)=𝔼[|w(X)|ϕ(h(X)sgn[w(X)])].\mathcal{R}_{\phi}(h,w)=\mathbb{E}\big{[}\big{|}w(X)\big{|}\cdot\phi\big{(}h(X)\cdot\operatorname*{\mathrm{sgn}}[w(X)]\big{)}\big{]}. (6.3)

There are many possible choices for the convex surrogate loss ϕ\phi, including the Hinge loss ϕ(α)=max{1α,0}\phi(\alpha)=\max\{1-\alpha,0\}, the logistic loss ϕ(α)=log(1+eα)\phi(\alpha)=\log(1+e^{-\alpha}), and the exponential loss ϕ(α)=eα\phi(\alpha)=e^{-\alpha}, and so on [Bartlett et al., 2006].

Theorem 5 establishes a calibration bound linking the cost-sensitive risk (𝒉,𝒘)\mathcal{R}(\boldsymbol{h},\boldsymbol{w}) to its surrogate exp(𝒉,𝒘)\mathcal{R}_{\exp}(\boldsymbol{h},\boldsymbol{w}).

Theorem 5 (Calibration Bound for Cost-sensitive Risk).

For any given weight function 𝐰:𝒳K\boldsymbol{w}:\mathcal{X}\rightarrow\mathbb{R}^{K}, we have

(𝒉,𝒘)inf𝒉(𝒉,𝒘)K[exp(𝒉,𝒘)inf𝒉exp(𝒉,𝒘)].\mathcal{R}(\boldsymbol{h},\boldsymbol{w})-\inf_{\boldsymbol{h}}\mathcal{R}(\boldsymbol{h},\boldsymbol{w})\leq K\big{[}\mathcal{R}_{\exp}(\boldsymbol{h},\boldsymbol{w})-\inf_{\boldsymbol{h}}\mathcal{R}_{\exp}(\boldsymbol{h},\boldsymbol{w})\big{]}.

For binary case, calibration bound can also be established for ϕ(h,w)\mathcal{R}_{\phi}(h,w) [Bartlett et al., 2006, Tewari and Bartlett, 2007].

Proposition 1 (Calibration Bound for Weighted Risk).

For any fixed weight function w:𝒳w:\mathcal{X}\rightarrow\mathbb{R}, we have

(h,w)infh(h,w)ϕ(h;w)infhϕ(h,w).\mathcal{R}(h,w)-\inf_{h}\mathcal{R}(h,w)\leq\mathcal{R}_{\phi}(h;w)-\inf_{h}\mathcal{R}_{\phi}(h,w).

Theorems 5 and 1 show that the target risks can be well-bounded by the excess surrogate risks. Hence we can use sample data to approximately minimize the surrogate risks, which can be efficiently solved.

6.2 Empirical Risk Minimization

We now propose Algorithm 1 to minimize the empirical surrogate risk exp(𝒉,𝒘)\mathcal{R}_{\exp}(\boldsymbol{h},\boldsymbol{w}) in Equation 6.2 using observed data SNS_{N}. We also establish the finite-sample error bound for the resulting score function 𝒉\boldsymbol{h} (see Appendix D).

Weights Estimation

The weight functions {wk}k=1K\{w_{k}\}_{k=1}^{K} differ between point-identified and partial-identified settings, requiring separate estimation. For mode=point\operatorname{mode}=\operatorname{point}, we first estimate Cov(D,ZX)\operatorname*{\mathrm{Cov}}(D,Z\mid X) and Cov(DYk,ZX)\operatorname*{\mathrm{Cov}}(DY_{k},Z\mid X) in eq. (4.2), then compute their ratio. For example, Cov(DYk,Zx)\operatorname*{\mathrm{Cov}}(DY_{k},Z\mid x) can be expressed as 𝔼[DZYkx]𝔼[DYkx]𝔼[Zx]\mathbb{E}[DZY_{k}\mid x]-\mathbb{E}[DY_{k}\mid x]\cdot\mathbb{E}[Z\mid x], with conditional expectations estimated via regression on X=xX=x. Alternatively, forest methods, such as Generalized Random Forests, can directly estimate the conditional covariance ratio in one step [Athey et al., 2019].

For mode=partial\operatorname{mode}=\operatorname{partial}, we estimate the bounds lkl_{k} and uku_{k} for each k[K]k\in[K] as defined in Theorem 3, involving conditional expectations 𝔼[DYkx,z]\mathbb{E}[DY_{k}\mid x,z] and 𝔼[Dx,z]\mathbb{E}[D\mid x,z] via regression. These bounds are then used to construct wk(x)w_{k}(x).

Classifier Optimization

The empirical surrogate risk in Equation 5.3 defines a cost-sensitive classification problem. Once the weight function {wk}k=1K\{w_{k}\}_{k=1}^{K} are estimated, we can then solve for the score function 𝒉\boldsymbol{h} , which can be implemented using PyTorch\operatorname{PyTorch} by defining a cost-sensitive loss function with softmax\operatorname{softmax} being the output layer. We can also incorporate regularization in Equation 5.3 to reduce over-fitting. In the binary setting, we can simply solve a weighted classification problem with sgn[w^(X)]\operatorname*{\mathrm{sgn}}[\hat{w}(X)]’s as the labels and |w^(X)||\hat{w}(X)| as weights, and this can be easily implemented by off-the-shelf ML packages that take additional weight inputs. For example, considering the surrogate risk in Equation 6.3 with ϕ\phi as the logistic loss, we can simply run a logistic regression with the aforementioned labels and weights.

Cross-fitting

In Algorithm 1, we divide the dataset into LL folds and apply the cross-fitting to estimate these weight functions. This approach has been widely used in statistical inference and learning with ML nuisance function estimators [e.g., Chernozhukov et al., 2018, Athey and Wager, 2021], which effectively alleviates the over-fitting bias by avoiding using the same data to both train weight function estimators and evaluate the weight function values.

7 Numeric Experiments

We conduct experiments on a synthetic dataset with a multiclass label (K=3K=3) and a semi-synthetic dataset with a binary label (K=2K=2).

7.1 Synthetic Dataset

In this section, we construct a synthetic dataset with multi-valued selective labels. The experiment results are illustrated in Figures 4 and 5, which show that our unified cost-sensitive learning (UCL) can achieve superior performance under various strength of selection bias.

Data Generating Process

Consider a dataset consisting of observable variables 𝐗=(X1,,Xp)\mathbf{X}=(X_{1},\dots,X_{p}), and an unobservable variable 𝐔=(U1,U2,,Uq)\mathbf{U}=(U_{1},U_{2},\dots,U_{q}), following the joint Gaussian distribution. We also introduce the variable ZZ that represents the random assignment of decision-makers, which is uniformly drawn from the set 𝒵={1,2,,J}\mathcal{Z}=\{1,2,\dots,J\}. The missingness decision D{0,1}D\in\{0,1\} is modeled as Bernoulli distributed variables with parameters pD:=(D=1X,U,Z)p_{D}:=\mathbb{P}(D=1\mid X,U,Z), and the true label Y{1,,K}Y^{\star}\in\{1,\dots,K\} is modeled as a categorical variable with parameters pk:=(Y=kX,U,Z)p_{k}:=\mathbb{P}(Y=k\mid X,U,Z) for k[K]k\in[K]. The complete data generating process is summarized in Equation 7.1.

Observed variables: 𝐗=(X1,,Xp)2𝒩(0,Ip),ZUniform({1,,J})\displaystyle\mathbf{X}=(X_{1},\dots,X_{p})\sim 2\cdot\mathcal{N}(0,I_{p}),~{}~{}Z\sim\text{Uniform}(\{1,\dots,J\}) (7.1)
Unobserved variables: 𝐔=(U1,,Uq)2𝒩(0,Iq)\displaystyle\mathbf{U}=(U_{1},\dots,U_{q})\sim 2\cdot\mathcal{N}(0,I_{q})
Coefficients: 𝐖XY=[i+k]p×Ki{1,,p},k{1,,K}\displaystyle\mathbf{W}_{X\rightarrow Y}=\begin{bmatrix}i+k\end{bmatrix}_{p\times K}\quad i\in\{1,\dots,p\},\quad k\in\{1,\dots,K\}
𝐖UY=[i+k]q×Ki{1,,q},k{1,,K}\displaystyle\mathbf{W}_{U\rightarrow Y}=\begin{bmatrix}i+k\end{bmatrix}_{q\times K}\quad i\in\{1,\dots,q\},\quad k\in\{1,\dots,K\}
𝐖XD=[2i]p×1i{1,,p},\displaystyle\mathbf{W}_{X\rightarrow D}=\begin{bmatrix}2-i\end{bmatrix}_{p\times 1}\quad i\in\{1,\dots,p\},
𝐖UD=[1+i]q×1i{1,,q}.\displaystyle\mathbf{W}_{U\rightarrow D}=\begin{bmatrix}1+i\end{bmatrix}_{q\times 1}\quad i\in\{1,\dots,q\}.
Decision (NUCEM): pD=(1αD)expit{Z2𝐗𝐖XD}+αDexpit{3𝐔𝐖UD},\displaystyle p_{D}=(1-\alpha_{D})\cdot\operatorname{expit}\big{\{}Z\cdot 2\mathbf{X}\mathbf{W}_{X\rightarrow D}\big{\}}+\alpha_{D}\cdot\operatorname{expit}\big{\{}3\mathbf{U}\mathbf{W}_{U\rightarrow D}\big{\}},
Decision (UC): pD=expit{(1αD)Z2𝐗𝐖XD+αD3𝐔𝐖UD}.\displaystyle p_{D}=\operatorname{expit}\big{\{}(1-\alpha_{D})\cdot Z\cdot 2\mathbf{X}\mathbf{W}_{X\rightarrow D}+\alpha_{D}\cdot 3\mathbf{U}\mathbf{W}_{U\rightarrow D}\big{\}}.
Outcome: pk=softmax{(1αY)𝐗𝐖XY+αY4𝐔𝐖UY},k{1,,K}\displaystyle p_{k}=\text{softmax}\big{\{}(1-\alpha_{Y})\cdot\mathbf{X}\mathbf{W}_{X\rightarrow Y}+\alpha_{Y}\cdot 4\mathbf{U}\mathbf{W}_{U\rightarrow Y}\big{\}},\quad k\in\{1,\dots,K\}
Y=Categorical(p1,,pK).\displaystyle Y=\text{Categorical}(p_{1},\dots,p_{K}).

Here we define the functions expit:v1/(1+exp(v))\operatorname{expit}:v\mapsto 1/(1+\exp(-v)) and softmax:𝐯exp(𝐯)/k=1Kexp(vk)\text{softmax}:\mathbf{v}\mapsto\exp(\mathbf{v})/\sum_{k=1}^{K}\exp(v_{k}) for any vector 𝐯K\mathbf{v}\in\mathbb{R}^{K}. In both UC and NUCEM models, the parameter αD(0,1)\alpha_{D}\in(0,1) controls the impact of unobservable variables 𝐔\mathbf{U} on the probability of missingness pDp_{D}, while the parameter αY(0,1)\alpha_{Y}\in(0,1) adjust the magnitude of 𝐔\mathbf{U} affecting the distribution of outcome YY^{*}. Overall, the pair of (αD,αY)(\alpha_{D},\alpha_{Y}) jointly determine the degree of selection bias in our data generating process.

In this experiment, we simulate a full dataset with sample size N=10000N=10000, feature dimensions p=q=5p=q=5, the number of instruments level J=5J=5, and the label classes K=3K=3. Each data point in the dataset is a tuple {(Xi,Ui,Zi,Di,Yi)}i=1N\{(X_{i},U_{i},Z_{i},D_{i},Y_{i}^{*})\}_{i=1}^{N}. The observed dataset consists of a tuple of variables {(Xi,Zi,Di,Yi)}i=1N\{(X_{i},Z_{i},D_{i},Y_{i})\}_{i=1}^{N} with Yi:=YiY_{i}:=Y_{i}^{\star} if Di=1D_{i}=1 and Yi:=NaNY_{i}:=\text{NaN} if Di=0D_{i}=0 for each i[N]i\in[N].

Model UC and NUCEM

The missingness model NUCEM captures one of the setting of No Unmeasured Common Effect Modifier, and this ensures the satisfaction of sufficient condition in Proposition 2, and therefore the conditional probabilities {pk}k=1K\{p_{k}\}_{k=1}^{K} can be exactly identified according to Theorem 1. Meanwhile, the model UC captures a general setting of Unmeasured Confounding in which the point-identification condition cannot be realized in this situation.

Refer to caption
Figure 4: The testing accuracy of different methods with αY{0.3,0.5,0.7}\alpha_{Y}\in\{0.3,0.5,0.7\} and αD{0.3,0.5,0.7}\alpha_{D}\in\{0.3,0.5,0.7\} of Model NUCEM in synthetic dataset.

Baseline Methods and Our Proposed Methods

We divide our dataset into two parts: a training set and a testing set, using a 7:3 split ratio, denoted as 𝒮train\mathcal{S}_{\text{train}} and 𝒮test\mathcal{S}_{\text{test}}. In the training set, five different meta-learning methods are trained on selectively labeled data, using a simple neural network classifer with softmax output layer. We then evaluate the performance of these methods on the testing set with fully labeled data. The five methods under consideration are as follows:

  • The first two methods act as baselines. The SelectedSample method involves running multiclass classification algorithm solely on the labeled portion of the training set {iStrain:Di=1}\{i\in S_{\text{train}}:D_{i}=1\}. For the second method, SelectedSample(IPW), we firstly estimate the inverse propensity weighting (IPW) for the labeled subset, and then implement the weighted multiclass classification for the labeled training set. These baseline methods establish a “lower bound” for any other meta learning algorithm, as any enhanced method with the use of selectively labeled data should least beat the performance of these two methods.

  • The third method, FullSample, runs a multiclass classification algorithm on the entire training set. This represents an ideal but impractical scenario since we cannot actually observe the true label YiY_{i}^{*} for the missing data (where Di=0D_{i}=0). This method serves as an ’upper bound’ for learning performance, indicating the highest possible effectiveness if full information were available.

  • Finally, our two proposed methods correspond to the point- and partial- identification settings, named PointLearning and PartialLearning respectively. We anticipate that the performance of these methods will fall between the established “lower” and “upper” bounds, representing a realistic estimation of effectiveness under selective label conditions. Similar to the method SelectedSample(IPW), our method involve estimating a series of weight functions wk,k=1,,Kw_{k},k=1,\dots,K to correct for selection bias (refers to the details in Section 6). In this experiment, we estimate these weght functions using Histogram Gradient Boosting with a 5-fold cross-fitted approach. We selected all hyperparameter through 5-fold cross-validation. The cost-sensitive classification problem is solved by a simple neural network with a softmax output layer.

Refer to caption
Figure 5: The testing accuracy of different methods with αY{0.3,0.5,0.7}\alpha_{Y}\in\{0.3,0.5,0.7\} and αD{0.3,0.5,0.7}\alpha_{D}\in\{0.3,0.5,0.7\} under Model UC of synthetic dataset.

The Role of Observed Vairable Z

It’s important to note the random assignment of decision-maker, denoted as ZZ, play different roles in these methods. In our proposed method, ZZ is treated as an instrumental variable, aiding in correcting for selection bias, while in the baseline methods (SelectedSample, SelectedSample(IPW), and FullSample), ZZ is only one of the observed features.

Experiment Results

Figures 4 and 5 report the testing accuracy of each method on the data generated by the NUCEM and UC models in Equation 7.1 respectively. All the experiments are repeated 100 times with confounding strengths αD,αY{0.3,0.5,0.7}\alpha_{D},\alpha_{Y}\in\{0.3,0.5,0.7\}. Each boxplot then shows the distribution of the testing accuracy for distinct methods under different combinations of confounding strength αD\alpha_{D} and αY\alpha_{Y}.

Transition from the top to the bottom panel, αY\alpha_{Y} increases from 0.30.3 to 0.70.7, indicating a stronger influence of the unobservable variable UU on the true label YY^{\star}. As we can see from Figures 4 and 5, the accuracy of each of five methods decreases as αY\alpha_{Y} increases, reflecting the growing challenge in accurately predicting outcomes as the influence of unobservable factors intensifies. Moreover, transition from the left to the right panel, αD\alpha_{D} rises from 0.30.3 to 0.70.7, indicating an enhanced role of the unobservable variable UU in determining the missingness decision DD. A larger value of αD\alpha_{D} within the range (0,1)(0,1) suggests that the missingness decision DD is predominantly influenced by the unobservable UU, hinting at a potential larger difference of the distribution among selected group and missing group. This is visually corroborated in Figures 4 and 5, where the performance disparity between the SelectedSample and FullSample methods widens as αD\alpha_{D} increases, illustrating the growing challenge in correcting the selection bias.

The results in Figures 4 and 5 demonstrate the effectiveness of our proposed methods in handling selection bias in the data generating process. In each combinations of (αY,αD)(\alpha_{Y},\alpha_{D}), our Partial Learning method consistently outperforms the baseline methods, including the Selected Sample, and Selected Sample(IPW) methods. Moreover, the perforamnce of Partial Learning is close to the Full Sample method, which is the upper bound of the performance. The performance of the Point Learning method is also competitive, although it is slightly less robust than the Partial Learning method. The results in Figure 4 are consistent with those in Figure 5, except that under this setting, the point identification requirement in Theorem 1 is not satisfied, and the Point Learning method is therefore less effective than the Partial Learning method.

Below we provide a detailed analysis of the performance of each method.

  • Selected Sample (IPW): As we see from Figures 4 and 5, the performance of the Selective Sample (IPW) approach is close to the baseline of naively running multiclass classification algorithms on the selected subsample (named Selected Sample), and it is outperformed by our proposed algorithms. This is not surprising since the propensity score P(D=1X,Z)P(D=1\mid X,Z) cannot correct selection bias due to the existence of unobserved variables UU.

  • PointLearning: Our proposed Point Learning method shows promising results in the bottom right panel of Figure 4 with parameter (αY,αD)=(0.7,0.7)(\alpha_{Y},\alpha_{D})=(0.7,0.7), in which case the unobserved confounding strength is large. Although the NUCEM model in Equation 7.1 satisfies Proposition 2, which enables the point identification of the conditional probabilities and therefore the classification risk (h,w)\mathcal{R}(h,w), Point Learning does not outperform the baseline methods when the unmeasured confounding strength is not so large. This limitation is attributed to the complexities involved in accurately estimating the weight function {wk}k=1K\{w_{k}\}_{k=1}^{K} as outlined in Equation 4.3.

    The crux of the challenge lies in estimating the ratio of two conditional covariances, Cov(DYk,ZX)\operatorname*{\mathrm{Cov}}(DY_{k},Z\mid X) and Cov(D,ZX)\operatorname*{\mathrm{Cov}}(D,Z\mid X) for each k[K]k\in[K]. Estimating conditional covariances is inherently more difficult than estimating conditional expectations, primarily because the process of calculating the ratio of these estimates is fraught with potential inaccuracies. If the denominator is not precisely estimated, the resultant ratio can become unstable, leading to exaggerated or diminished values. Such inaccuracies can severely impact the effectiveness of the downstream cost-sensitive classification, undermining the robustness of Point Learning under these conditions.

  • Partial Learning: The cornerstone of our learning framework is Partial Learning. As evidenced by the graphs in Figures 4 and 5, Partial Learning significantly enhances the accuracy of performance predictions while maintaining low variance, showcasing its potential for real-world applications. Remarkably, this improvement in prediction accuracy is consistent across almost all combinations of confounding strengths (αY,αD)(\alpha_{Y},\alpha_{D}), as well as the decision models NUCEM and UC.

    Reflecting on the conditions outlined in Theorem 3, Partial Learning only requires the conditional mean function 𝔼[YkU,X=x],k=1,,K\mathbb{E}[Y_{k}^{\star}\mid U,X=x],k=1,\dots,K to be almost surely bounded, aside from the basic instrumental variable (IV) conditions. This requirement is generally more feasible in real-world scenarios than the point-identification assumptions detailed in Theorem (referenced as Theorem 1). The success of Partial Learning can be attributed to two key strategies: Firstly, we adopt a robust optimization approach to develop a prediction rule that remains effective under the least favorable conditions of η(X)\eta^{\star}(X), as detailed in our minimax risk function formulation (see Equation 5.2). Secondly, the weighting functions {wk}k=1K\{w_{k}\}_{k=1}^{K} are based on the summation of several conditional expectations, which is inherently more stable than the ratio of conditional covariances required for point identification.

Overall, the compelling performance of Partial Learning demonstrated in Figures 4 and 5 convinces us of its practical viability. It stands as a reliable method that consistently outperforms baseline approaches, even in situations where the presence of selection bias (missing not at random) in the data generating process is uncertain.

7.2 Semi-synthetic Dataset: FICO

Refer to caption
Figure 6: The testing accuracy of methods with α{0.5,0.7,0.9}\alpha\in\{0.5,0.7,0.9\} for model NUCEM in FICO dataset.
Refer to caption
Figure 7: The testing accuracy of methods with α{0.5,0.7,0.9}\alpha\in\{0.5,0.7,0.9\} for model UC in FICO dataset.

In this section, we evaluate the performance of our proposed algorithm in a semi-synthetic experiment based on the home loans dataset from FICO [2018]. This dataset consists of 1045910459 observations of approved home loan applications. The dataset records whether the applicant repays the loan within 9090 days overdue, which we view as the true label Y{0,1}Y^{\star}\in\{0,1\}, and various transaction information of the bank account. The dataset also includes a variable called ExternalRisk, which is a risk score assigned to each application by a proprietary algorithm. We consider ExternalRisk and all transaction features as the observed features XX.

Semi-synthetic Dataset with Selective Labels

In this dataset the label of interest is fully observed, so we choose to synthetically create selective labels on top of the dataset. Specifically, we simulate 1010 decision-makers (e.g., bank officers who handle the loan applications) and randomly assign one to each case. We simulate the decision DD from a Bernoulli distribution with a success rate pDp_{D} that depends on an “unobservable” variable UU, the decision-maker identity ZZ, and the ExternalRisk variable (which serves as an algorithmic assistance to human decision-making). We blind the true label YY^{\star} for observations with D=0D=0. Specifically, we construct UU as the residual from a random forest regression of YY^{\star} with respect to XX over the whole dataset, which is naturally dependent with YY^{\star}. We then specify pD:=(D=1X,U)p_{D}:=\mathbb{P}(D=1\mid X,U) according to

Decision (NUCEM): pD=αexpit{U}+(1α)expit{(1+Z)ExternalRisk},\displaystyle p_{D}=\alpha\cdot\text{expit}\big{\{}U\big{\}}+(1-\alpha)\cdot\text{expit}\big{\{}(1+Z)\cdot\text{ExternalRisk}\big{\}}, (7.2)
(UC): pD=expit{αU+(1α)(1+Z)ExternalRisk}.\displaystyle p_{D}=\text{expit}\big{\{}\alpha\cdot U+(1-\alpha)\cdot(1+Z)\cdot\text{ExternalRisk}\big{\}}.

Here the expit\operatorname{expit} function is given by expit(t)=1/(1+exp(t))\operatorname{expit}(t)=1/(1+\exp(-t)). The parameter αD(0,1)\alpha_{D}\in(0,1) controls the impact of UU on the labeling process and thus the degree of selection bias. We can easily verify that the sufficient condition in Proposition 2 is guaranteed and therefore the point-identification of classification risk (h,w)\mathcal{R}(h,w) is realizable under the NUCEM model in Equation 7.2.

Baseline Methods and Our Proposed Methods

We randomly split our data into training and and testing sets at a 7:37:3 ratio. Similar to what we discuss in Section 7.1, on the training set, we apply five types of different methods: Selected Sample, SelectedSample(IPW), FullSample, PointLearning and PartialLearning. The first three methods serve as the benchmark of the last two methods. For each type of method, we try multiple classification algorithms including AdaBoost, Gradient Boosting, Logistic Regression, Random Forest, and SVM. Our proposed method also need to estimate some unknown weight functions, which we implement by K=5K=5 fold cross-fitted Gradient Boosting. Again, all hyperparameters are chosen via 55-fold cross-validation, and we evaluate the classification accuracy of the resulting classifiers on the testing data.

Similarly, we remark that the decision-maker assignment ZZ plays different roles in different methods. Our proposed methods (PointLearning and PartialLearning) treatment the decision-maker assignment ZZ as an instrumental variable to correct for selection bias. In contrast, the baseline methods (SelectedSample, SelectiveSample(IPW) and FullSample) do not necessarily need ZZ. However, for a fair comparison between our proposals and the baselines, we still incorporate ZZ as a classification feature in the baseline methods, so they also use the information of ZZ.

Results and Discussions

Figures 6 and 7 presents the testing accuracy of each method over 50 experiment replications for α{0.5,0.7,0.9}\alpha\in\{0.5,0.7,0.9\} under both the NUCEM and UC models defined in Equation 7.2. First, we observe that the performance of SelectedSample(IPW) is comparable to the baseline SelectedSample, which applies binary classification algorithms directly to selectively labeled data. Notably, as the strength of unmeasured confounding (αD\alpha_{D}) increases, the gains from using our proposed methods—especially PartialLearning—over the baseline methods (SelectedSample and SelectedSample(IPW)) become more pronounced. Interestingly, the PartialLearning method outperforms even under the NUCEM model, where the point-identification condition is satisfied. As discussed in Section 7.1, this may be because the PointLearning method relies on estimating a conditional variance ratio, which is challenging to estimate accurately in practice. In contrast, the PartialLearning method requires only the estimation of conditional expectations, making it more stable and robust. This highlights the advantage of the PartialLearning method: it is robust to violations of the point-identification assumption and achieves more stable performance even when point-identification holds.

To summarize, among all the methods we evaluated, PartialLearning emerges as a reliable approach that consistently outperforms baseline methods, even in scenarios where the unobsreved confounding is strong.

8 Conclusion

We address multiclass classification with selectively labeled data, where the label distribution deviates from the full population due to historical decision-making. Leveraging variations in decision rules across multiple decision-makers, we analyze this problem using an instrumental variable (IV) framework, providing necessary and sufficient conditions for exact classification risk identification. When exact identification is unattainable, we derive sharp partial risk bounds. To address label selection bias, we propose a unified cost-sensitive learning (UCL) approach, supported by theoretical guarantees, and demonstrate its effectiveness through comprehensive numerical experiments.

References

  • Angrist and Imbens [1995] J. Angrist and G. Imbens. Identification and estimation of local average treatment effects, 1995.
  • Angrist and Pischke [2009] J. D. Angrist and J.-S. Pischke. Mostly harmless econometrics: An empiricist’s companion. Princeton university press, 2009.
  • Angrist et al. [1996] J. D. Angrist, G. W. Imbens, and D. B. Rubin. Identification of causal effects using instrumental variables. Journal of the American statistical Association, 91(434):444–455, 1996.
  • Arnold et al. [2022] D. Arnold, W. Dobbie, and P. Hull. Measuring racial discrimination in bail decisions. American Economic Review, 112(9):2992–3038, 2022.
  • Athey and Wager [2021] S. Athey and S. Wager. Policy learning with observational data. Econometrica, 89(1):133–161, 2021.
  • Athey et al. [2019] S. Athey, J. Tibshirani, and S. Wager. Generalized random forests. The Annals of Statistics, 47(2):1148–1178, 2019.
  • Balke and Pearl [1994] A. Balke and J. Pearl. Counterfactual probabilities: Computational methods, bounds and applications. In Uncertainty Proceedings 1994, pages 46–54. Elsevier, 1994.
  • Bartlett et al. [2006] P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156, 2006.
  • Ben-Michael et al. [2021] E. Ben-Michael, D. J. Greiner, K. Imai, and Z. Jiang. Safe policy learning through extrapolation: Application to pre-trial risk assessment. arXiv preprint arXiv:2109.11679, 2021.
  • Bertsimas and Fazel-Zarandi [2022] D. Bertsimas and M. M. Fazel-Zarandi. Prescriptive machine learning for public policy: The case of immigration enforcement. Under review, 2022.
  • Chernozhukov et al. [2018] V. Chernozhukov, D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, and J. Robins. Double/debiased machine learning for treatment and structural parameters, 2018.
  • Coston et al. [2020a] A. Coston, E. Kennedy, and A. Chouldechova. Counterfactual predictions under runtime confounding. Advances in neural information processing systems, 33:4150–4162, 2020a.
  • Coston et al. [2020b] A. Coston, A. Mishler, E. H. Kennedy, and A. Chouldechova. Counterfactual risk assessments, evaluation, and fairness. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pages 582–593, 2020b.
  • Cui and Tchetgen Tchetgen [2021] Y. Cui and E. Tchetgen Tchetgen. A semiparametric instrumental variable approach to optimal treatment regimes under endogeneity. Journal of the American Statistical Association, 116(533):162–173, 2021.
  • De-Arteaga et al. [2018] M. De-Arteaga, A. Dubrawski, and A. Chouldechova. Learning under selective labels in the presence of expert consistency. arXiv preprint arXiv:1807.00905, 2018.
  • De-Arteaga et al. [2021] M. De-Arteaga, V. Jeanselme, A. Dubrawski, and A. Chouldechova. Leveraging expert consistency to improve algorithmic decision support. arXiv preprint arXiv:2101.09648, 2021.
  • FICO [2018] FICO. Explainable machine learning challenge, 2018. URL https://community.fico.com/s/explainable-machine-learning-challenge?tabset-158d9=3.
  • Imbens and Rubin [2015] G. W. Imbens and D. B. Rubin. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015.
  • Kallus and Zhou [2021] N. Kallus and A. Zhou. Minimax-optimal policy learning under unobserved confounding. Management Science, 67(5):2870–2890, 2021.
  • Kédagni and Mourifie [2017] D. Kédagni and I. Mourifie. Generalized instrumental inequalities: Testing the iv independence assumption. Available at SSRN 2692274, 2017.
  • Kilbertus et al. [2020] N. Kilbertus, M. G. Rodriguez, B. Schölkopf, K. Muandet, and I. Valera. Fair decisions despite imperfect predictions. In International Conference on Artificial Intelligence and Statistics, pages 277–287. PMLR, 2020.
  • Kleinberg et al. [2018] J. Kleinberg, H. Lakkaraju, J. Leskovec, J. Ludwig, and S. Mullainathan. Human decisions and machine predictions. The quarterly journal of economics, 133(1):237–293, 2018.
  • Kling [2006] J. R. Kling. Incarceration length, employment, and earnings. American Economic Review, 96(3):863–876, 2006.
  • Lakkaraju et al. [2017] H. Lakkaraju, J. Kleinberg, J. Leskovec, J. Ludwig, and S. Mullainathan. The selective labels problem: Evaluating algorithmic predictions in the presence of unobservables. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 275–284, 2017.
  • Little and Rubin [2019] R. J. Little and D. B. Rubin. Statistical analysis with missing data, volume 793. John Wiley & Sons, 2019.
  • Manski and Pepper [1998] C. F. Manski and J. V. Pepper. Monotone instrumental variables with an application to the returns to schooling, 1998.
  • Mao et al. [2023] A. Mao, M. Mohri, and Y. Zhong. Cross-entropy loss functions: Theoretical analysis and applications. In International conference on Machine learning, pages 23803–23828. PMLR, 2023.
  • Massart and Nédélec [2006] P. Massart and É. Nédélec. Risk bounds for statistical learning. Annals of Statistics, 2006.
  • Mueller-Smith [2015] M. Mueller-Smith. The criminal and labor market impacts of incarceration. Unpublished working paper, 18, 2015.
  • Pu and Zhang [2021] H. Pu and B. Zhang. Estimating optimal treatment rules with an instrumental variable: A partial identification learning approach. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 83(2):318–345, 2021.
  • Qian and Murphy [2011] M. Qian and S. A. Murphy. Performance guarantees for individualized treatment rules. Annals of statistics, 39(2):1180, 2011.
  • Rambachan et al. [2022] A. Rambachan, A. Coston, and E. Kennedy. Counterfactual risk assessments under unmeasured confounding. arXiv preprint arXiv:2212.09844, 2022.
  • Rambachan et al. [2021] A. Rambachan et al. Identifying prediction mistakes in observational data. Harvard University, 2021.
  • Rosenbaum and Rubin [1983] P. R. Rosenbaum and D. B. Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55, 1983.
  • Rubin [2005] D. B. Rubin. Causal inference using potential outcomes: Design, modeling, decisions. Journal of the American Statistical Association, 100(469):322–331, 2005.
  • Shalev-Shwartz and Ben-David [2014] S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
  • Swanson et al. [2018] S. A. Swanson, M. A. Hernán, M. Miller, J. M. Robins, and T. S. Richardson. Partial identification of the average treatment effect using instrumental variables: review of methods for binary instruments, treatments, and outcomes. Journal of the American Statistical Association, 113(522):933–947, 2018.
  • Tewari and Bartlett [2007] A. Tewari and P. L. Bartlett. On the consistency of multiclass classification methods. Journal of Machine Learning Research, 8(5), 2007.
  • Wainwright [2019] M. J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019.
  • Wang and Tchetgen [2018] L. Wang and E. T. Tchetgen. Bounded, efficient and multiply robust estimation of average treatment effects using instrumental variables. Journal of the Royal Statistical Society. Series B, Statistical methodology, 80(3):531, 2018.
  • Wei [2021] D. Wei. Decision-making under selective labels: Optimal finite-domain policies and beyond. In International Conference on Machine Learning, pages 11035–11046. PMLR, 2021.
  • Yang et al. [2022] Y. Yang, Y. Liu, and P. Naghizadeh. Adaptive data debiasing through bounded exploration. Advances in Neural Information Processing Systems, 35:1516–1528, 2022.
  • Zhang et al. [2012] B. Zhang, A. A. Tsiatis, M. Davidian, M. Zhang, and E. Laber. Estimating optimal treatment regimes from a classification perspective. Stat, 1(1):103–114, 2012.
  • Zhang et al. [2020] C. Zhang, J. Chen, H. Fu, X. He, Y.-Q. Zhao, and Y. Liu. Multicategory outcome weighted margin-based learning for estimating individualized treatment rules. Statistica sinica, 30:1857, 2020.
  • Zhao et al. [2012] Y. Zhao, D. Zeng, A. J. Rush, and M. R. Kosorok. Estimating individualized treatment rules using outcome weighted learning. Journal of the American Statistical Association, 107(499):1106–1118, 2012.

Appendix A Supplements for Point Identification

A.1 Proofs in Section 4

Proof of Lemma 1.

By the definition of excess risk in Equation 4.1, we have

(t,s)\displaystyle\mathcal{R}(t,s^{\star}) =(Yit(Xi))(Yis(Xi))\displaystyle=\mathbb{P}(Y_{i}^{\star}\neq t(X_{i}))-\mathbb{P}(Y_{i}^{\star}\neq s^{\star}(X_{i}))
=𝔼[(Yit(Xi)Xi)(Yis(Xi)Xi)]\displaystyle=\mathbb{E}\Big{[}\mathbb{P}(Y_{i}^{\star}\neq t(X_{i})\mid X_{i})-\mathbb{P}(Y_{i}^{\star}\neq s^{\star}(X_{i})\mid X_{i})\Big{]}
=𝔼[k=1Kηk(Xi)((kt(Xi)Xi)(ks(Xi)Xi))]\displaystyle=\mathbb{E}\left[\sum_{k=1}^{K}\eta_{k}^{\star}(X_{i})\cdot\Big{(}\mathbb{P}(k\neq t(X_{i})\mid X_{i})-\mathbb{P}(k\neq s^{\star}(X_{i})\mid X_{i})\Big{)}\right]
=𝔼[k=1Kηk(Xi)((s(Xi)=kXi)(t(Xi)=kXi))]\displaystyle=\mathbb{E}\left[\sum_{k=1}^{K}\eta_{k}^{\star}(X_{i})\cdot\Big{(}\mathbb{P}(s^{\star}(X_{i})=k\mid X_{i})-\mathbb{P}(t(X_{i})=k\mid X_{i})\Big{)}\right]
=𝔼[k=1Kηk(Xi)𝔼[𝕀{s(Xi)=k}𝕀{t(Xi)=k}Xi]]\displaystyle=\mathbb{E}\left[\sum_{k=1}^{K}\eta_{k}^{\star}(X_{i})\cdot\mathbb{E}\Big{[}\mathbb{I}\{s^{\star}(X_{i})=k\}-\mathbb{I}\{t(X_{i})=k\}\mid X_{i}\Big{]}\right]
=𝔼[k=1Kηk(Xi)(𝕀{s(Xi)=k}𝕀{t(Xi)=k})]\displaystyle=\mathbb{E}\left[\sum_{k=1}^{K}\eta_{k}^{\star}(X_{i})\cdot\Big{(}\mathbb{I}\{s^{\star}(X_{i})=k\}-\mathbb{I}\{t(X_{i})=k\}\Big{)}\right]
=𝔼[k=1Kηk(Xi)(𝕀{argmaxp[K]ηp(Xi)=k}𝕀{t(Xi)=k})].\displaystyle=\mathbb{E}\left[\sum_{k=1}^{K}\eta_{k}^{\star}(X_{i})\cdot\Big{(}\mathbb{I}\{\operatorname*{arg\,max}_{p\in[K]}\eta_{p}^{\star}(X_{i})=k\}-\mathbb{I}\{t(X_{i})=k\}\Big{)}\right].

We apply the iterated law of expectation in second equality, and the total law of expectation in third equality. The fourth equality comes from the sum-to-one constraint of probability. The fifth equality is established from the definition of indicator function and the sixth equality follows again from the iterated law of expectation. Finally we apply the definition of Bayes optimal s(Xi)=argmaxk[K]ηk(Xi)s^{\star}(X_{i})=\operatorname*{arg\,max}_{k\in[K]}\eta_{k}^{\star}(X_{i}).

Proof of Theorem 1.

In this part, we prove that, under Assumptions 1 and 2, Assumption 3 is the necessary and sufficient condition of the exact identification of (Y=kX)\mathbb{P}(Y^{\star}=k\mid X). To simplify the notation, we denote two new variables Yk:=𝕀{Y=k}Y_{k}^{\star}:=\mathbb{I}\{Y^{\star}=k\} and Yk:=𝕀{Y=k}Y_{k}:=\mathbb{I}\{Y=k\}.

  • Sufficiency: For any z𝒵z\in\mathcal{Z}, the conditional probability (Y=kX,Z=z)=𝔼[YkX,Z=z]\mathbb{P}(Y^{\star}=k\mid X,Z=z)=\mathbb{E}[Y_{k}^{\star}\mid X,Z=z] could be expressed as

    𝔼[YkX,Z=z]=𝔼[𝔼[YkX,U,Z=z]X,Z=z]\displaystyle\mathbb{E}[Y_{k}^{\star}\mid X,Z=z]=\mathbb{E}\big{[}\mathbb{E}[Y_{k}^{\star}\mid X,U,Z=z]\mid X,Z=z\big{]} (A.1)
    =𝔼[𝔼[DYkX,U,Z=z]X,Z=z]+𝔼[𝔼[(1D)YkX,U,Z=z]X,Z=z]\displaystyle=\mathbb{E}\big{[}\mathbb{E}[DY_{k}^{\star}\mid X,U,Z=z]\mid X_{,}Z=z\big{]}+\mathbb{E}\big{[}\mathbb{E}[(1-D)Y_{k}^{\star}\mid X,U,Z=z]\mid X,Z=z\big{]}
    =𝔼[𝔼[DYkX,U,Z=z]X,Z=z]+𝔼[𝔼[(1D)YkX,U,Z=z]X,Z=z]\displaystyle=\mathbb{E}\big{[}\mathbb{E}[DY_{k}\mid X,U,Z=z]\mid X_{,}Z=z\big{]}+\mathbb{E}\big{[}\mathbb{E}[(1-D)Y_{k}^{\star}\mid X,U,Z=z]\mid X,Z=z\big{]}
    =𝔼[𝔼[DYkX,U,Z=z]X,Z=z]+𝔼[𝔼[YkX,U,Z=z]𝔼[(1D)X,U,Z=z]X,Z=z]\displaystyle=\mathbb{E}\big{[}\mathbb{E}[DY_{k}\mid X,U,Z=z]\mid X,Z=z\big{]}+\mathbb{E}\big{[}\mathbb{E}[Y_{k}^{\star}\mid X,U,Z=z]\cdot\mathbb{E}[(1-D)\mid X,U,Z=z]\mid X,Z=z\big{]}
    =𝔼[𝔼[DYkX,U,Z=z]X,Z=z]+𝔼[𝔼[YkX,U]𝔼[(1D)X,U,Z=z]X,Z=z]\displaystyle=\mathbb{E}\big{[}\mathbb{E}[DY_{k}\mid X,U,Z=z]\mid X,Z=z\big{]}+\mathbb{E}\big{[}\mathbb{E}[Y_{k}^{\star}\mid X,U]\cdot\mathbb{E}[(1-D)\mid X,U,Z=z]\mid X,Z=z\big{]}
    =𝔼[DYkX,Z=z]+𝔼[YkX,Z=z]𝔼[𝔼[YkX,U]𝔼[DX,U,Z=z]].\displaystyle=\mathbb{E}[DY_{k}\mid X,Z=z]+\mathbb{E}[Y_{k}^{\star}\mid X,Z=z]-\mathbb{E}\big{[}\mathbb{E}[Y_{k}^{\star}\mid X,U]\cdot\mathbb{E}[D\mid X,U,Z=z]\big{]}.

    In the first and second equalities, we use the iterated law of expectation and the linearity of expectation. The third equality follows from the consistency between observed label and true label from Equation 3.1, that is, Yk=YkY_{k}=Y_{k}^{\star} when D=1D=1. The fourth equality follows from the unconfoundedness of YkY_{k}^{\star} and DD given XX and UU, which is guaranteed by the general unconfoundedness (YDX,UY^{\star}\perp D\mid X,U) in Assumption 1. We use the IV independence (ZYkXZ\perp Y_{k}^{\star}\mid X) from Assumption 2 in the fifth equality. The last equality follows again by the iterated law of expectation and linearity of expectation.

    By subtracting 𝔼[YkX]\mathbb{E}[Y_{k}^{\star}\mid X] on both sides of Equation A.1, we have for any z𝒵z\in\mathcal{Z},

    0\displaystyle 0 =𝔼[DYkX,Z=z]𝔼[𝔼[YkX,U]𝔼[DX,U,Z=z]X,Z=z]\displaystyle=\mathbb{E}\big{[}DY_{k}\mid X,Z=z\big{]}-\mathbb{E}\big{[}\mathbb{E}[Y_{k}^{\star}\mid X,U]\cdot\mathbb{E}[D\mid X,U,Z=z]\mid X,Z=z\big{]}
    =𝔼[DYkX,Z=z]𝔼[𝔼[YkX,U]𝔼[DX,U,Z=z]X],\displaystyle=\mathbb{E}\big{[}DY_{k}\mid X,Z=z\big{]}-\mathbb{E}\big{[}\mathbb{E}[Y_{k}^{\star}\mid X,U]\cdot\mathbb{E}[D\mid X,U,Z=z]\mid X\big{]},

    where the last equality follows from the IV independence (ZUXZ\perp U\mid X) from Assumption 2. Now, by multiplying the weight z(Z=zX)z\cdot\mathbb{P}(Z=z\mid X) on both sides of equation above, we have

    0\displaystyle 0 =𝔼[DYkX,Z=z]𝔼[𝔼[YkX,U]𝔼[DX,U,Z=z]X]\displaystyle=\mathbb{E}[DY_{k}\mid X,Z=z]-\mathbb{E}\big{[}\mathbb{E}[Y_{k}^{\star}\mid X,U]\cdot\mathbb{E}[D\mid X,U,Z=z]\mid X\big{]}
    =𝔼[zDYkX,Z=z](Z=zX)𝔼[𝔼[YkX,U]𝔼[zDX,U,Z=z](Z=zX)X]\displaystyle=\mathbb{E}[zDY_{k}\mid X,Z=z]\cdot\mathbb{P}(Z=z\mid X)-\mathbb{E}\big{[}\mathbb{E}[Y_{k}^{\star}\mid X,U]\cdot\mathbb{E}[zD\mid X,U,Z=z]\cdot\mathbb{P}(Z=z\mid X)\mid X\big{]}
    =𝔼[zDYkX,Z=z](Z=zX)𝔼[𝔼[YkX,U]𝔼[zDX,U,Z=z](Z=zX,U)X].\displaystyle=\mathbb{E}[zDY_{k}\mid X,Z=z]\cdot\mathbb{P}(Z=z\mid X)-\mathbb{E}\big{[}\mathbb{E}[Y_{k}^{\star}\mid X,U]\cdot\mathbb{E}[zD\mid X,U,Z=z]\cdot\mathbb{P}(Z=z\mid X,U)\mid X\big{]}.

    He we apply the IV independence (ZUXZ\perp U\mid X) again in the last equality above. Taking the summation (or integral) for zz over 𝒵\mathcal{Z} yields the following results:

    0=𝔼[ZDYkX]𝔼[𝔼[YkX,U]𝔼[ZDX,U]X].0=\mathbb{E}[ZDY_{k}\mid X]-\mathbb{E}\big{[}\mathbb{E}[Y_{k}^{\star}\mid X,U]\cdot\mathbb{E}[ZD\mid X,U]\mid X\big{]}.

    Moreover, observe that

    {𝔼[ZDYkX]=Cov(DYk,ZX)+𝔼[DYkX]𝔼[ZX]𝔼[ZDX,U]=Cov(D,ZX,U)+𝔼[DX,U]𝔼[ZX,U],\left\{\begin{aligned} &\mathbb{E}[ZDY_{k}\mid X]=\operatorname*{\mathrm{Cov}}(DY_{k},Z\mid X)+\mathbb{E}[DY_{k}\mid X]\cdot\mathbb{E}[Z\mid X]\\ &\mathbb{E}[ZD\mid X,U]=\operatorname*{\mathrm{Cov}}(D,Z\mid X,U)+\mathbb{E}[D\mid X,U]\cdot\mathbb{E}[Z\mid X,U],\end{aligned}\right.

    we have

    0\displaystyle 0 =Cov(DYk,ZX)I+𝔼[DYkX]𝔼[ZX]II𝔼[𝔼[YkX,U]Cov(D,ZX,U)X]III\displaystyle=\underbrace{\operatorname*{\mathrm{Cov}}(DY_{k},Z\mid X)}_{\text{I}}+\underbrace{\mathbb{E}[DY_{k}^{\star}\mid X]\cdot\mathbb{E}[Z\mid X]}_{\text{II}}-\underbrace{\mathbb{E}\big{[}\mathbb{E}[Y_{k}^{\star}\mid X,U]\cdot\operatorname*{\mathrm{Cov}}(D,Z\mid X,U)\mid X\big{]}}_{\text{III}} (A.2)
    𝔼[𝔼[YkX,U]𝔼[DX,U]𝔼[ZX,U]X]IV,\displaystyle\quad-\underbrace{\mathbb{E}\big{[}\mathbb{E}[Y_{k}^{\star}\mid X,U]\cdot\mathbb{E}[D\mid X,U]\cdot\mathbb{E}[Z\mid X,U]\mid X\big{]}}_{\text{IV}},

    Here again we use the consistency that 𝔼[DYkX]=𝔼[DYkX]\mathbb{E}[DY_{k}\mid X]=\mathbb{E}[DY_{k}^{\star}\mid X]. By the unconfoundedness (YDX,UY^{\star}\perp D\mid X,U) from Assumption 1 and IV independence (ZUXZ\perp U\mid X) from Assumption 2, we have 𝔼[ZX,U]=𝔼[ZX]\mathbb{E}[Z\mid X,U]=\mathbb{E}[Z\mid X] and

    𝔼[YkX,U]𝔼[DX,U]𝔼[ZX,U]=𝔼[DYkX,U]𝔼[ZX].\mathbb{E}[Y_{k}^{\star}\mid X,U]\cdot\mathbb{E}[D\mid X,U]\cdot\mathbb{E}[Z\mid X,U]=\mathbb{E}[DY_{k}^{\star}\mid X,U]\cdot\mathbb{E}[Z\mid X].

    Therefore the II and IV terms in Equation A.2 cancel out, which lead us to the result

    0=Cov(DYk,ZX)𝔼[𝔼[YkX,U]Cov(D,ZX,U)X].0=\operatorname*{\mathrm{Cov}}(DY_{k},Z\mid X)-\mathbb{E}\big{[}\mathbb{E}[Y_{k}^{\star}\mid X,U]\cdot\operatorname*{\mathrm{Cov}}(D,Z\mid X,U)\mid X\big{]}. (A.3)

    Finally, by assuming Assumption 3 holds, that is,

    Cov(Cov(D,ZX,U),𝔼[YkX,U]X)=0,\operatorname*{\mathrm{Cov}}\big{(}\operatorname*{\mathrm{Cov}}(D,Z\mid X,U),\ \mathbb{E}[Y_{k}^{\star}\mid X,U]\mid X\big{)}=0,

    we have

    𝔼[𝔼[YkX,U]Cov(D,ZX,U)X]\displaystyle\mathbb{E}\big{[}\mathbb{E}[Y_{k}^{\star}\mid X,U]\cdot\operatorname*{\mathrm{Cov}}(D,Z\mid X,U)\mid X\big{]} =𝔼[𝔼[YkX,U]X]𝔼[Cov(D,ZX,U)X]\displaystyle=\mathbb{E}\big{[}\mathbb{E}[Y_{k}^{\star}\mid X,U]\mid X\big{]}\cdot\mathbb{E}\big{[}\operatorname*{\mathrm{Cov}}(D,Z\mid X,U)\mid X\big{]}
    =𝔼[YkX]𝔼[Cov(D,ZX,U)X].\displaystyle=\mathbb{E}[Y_{k}^{\star}\mid X]\cdot\mathbb{E}\big{[}\operatorname*{\mathrm{Cov}}(D,Z\mid X,U)\mid X\big{]}.

    According to conditional covariance identity and IV independence (ZUXZ\perp U\mid X), we have

    Cov(D,ZX)\displaystyle\operatorname*{\mathrm{Cov}}(D,Z\mid X) =𝔼[Cov(D,ZX,U)X]+𝔼[Cov(𝔼[DX,U],𝔼[ZX,U])X]\displaystyle=\mathbb{E}\big{[}\operatorname*{\mathrm{Cov}}(D,Z\mid X,U)\mid X\big{]}+\mathbb{E}\big{[}\operatorname*{\mathrm{Cov}}(\mathbb{E}[D\mid X,U],~{}\mathbb{E}[Z\mid X,U])\mid X\big{]} (A.4)
    =𝔼[Cov(D,ZX,U)X]+𝔼[Cov(𝔼[DX,U],𝔼[ZX])X]\displaystyle=\mathbb{E}\big{[}\operatorname*{\mathrm{Cov}}\big{(}D,Z\mid X,U)\mid X\big{]}+\mathbb{E}\big{[}\operatorname*{\mathrm{Cov}}(\mathbb{E}[D\mid X,U],~{}\mathbb{E}[Z\mid X]\big{)}\mid X\big{]}
    =𝔼[Cov(D,ZX,U)X].\displaystyle=\mathbb{E}\big{[}\operatorname*{\mathrm{Cov}}(D,Z\mid X,U)\mid X\big{]}.

    Combining (A.4) with (A.3) leads us to the desired result, that is,

    ηk(X)=𝔼[YkX]=Cov(DYk,ZX)Cov(D,ZX).\eta_{k}^{\star}(X)=\mathbb{E}[Y_{k}^{\star}\mid X]=\frac{\operatorname*{\mathrm{Cov}}(DY_{k},Z\mid X)}{\operatorname*{\mathrm{Cov}}(D,Z\mid X)}.
  • Necessity: Given the identification result 𝔼[YkX]=Cov(DYk,ZX)Cov(D,ZX)\mathbb{E}[Y_{k}^{\star}\mid X]=\frac{\operatorname*{\mathrm{Cov}}(DY_{k},Z\mid X)}{\operatorname*{\mathrm{Cov}}(D,Z\mid X)} and the Equation A.4, we have

    Cov(DYk,ZX)\displaystyle\operatorname*{\mathrm{Cov}}(DY_{k},Z\mid X) =𝔼[YkX]Cov(D,ZX)\displaystyle=\mathbb{E}[Y_{k}^{\star}\mid X]\cdot\operatorname*{\mathrm{Cov}}(D,Z\mid X)
    =𝔼[𝔼[YkX,U]X]𝔼[Cov(D,ZX,U)X].\displaystyle=\mathbb{E}\big{[}\mathbb{E}[Y_{k}^{\star}\mid X,U]\mid X\big{]}\cdot\mathbb{E}\big{[}\operatorname*{\mathrm{Cov}}(D,Z\mid X,U)\mid X\big{]}.

    On the other hand, we have the identity (A.3) established when every Assumptions 1 and 2 hold, that is,

    Cov(DYk,ZX)=𝔼[𝔼[YkX,U]Cov(D,ZX,U)X].\operatorname*{\mathrm{Cov}}(DY_{k},Z\mid X)=\mathbb{E}\big{[}\mathbb{E}[Y_{k}^{\star}\mid X,U]\cdot\operatorname*{\mathrm{Cov}}(D,Z\mid X,U)\mid X\big{]}.

    Therefore, combining these results together, we have

    𝔼[𝔼[YkX,U]Cov(D,ZX,U)X]𝔼[𝔼[YkX,U]X]𝔼[Cov(D,ZX,U)X]=0,\mathbb{E}\big{[}\mathbb{E}[Y_{k}^{\star}\mid X,U]\cdot\operatorname*{\mathrm{Cov}}(D,Z\mid X,U)\mid X\big{]}-\mathbb{E}\big{[}\mathbb{E}[Y_{k}^{\star}\mid X,U]\mid X\big{]}\cdot\mathbb{E}\big{[}\operatorname*{\mathrm{Cov}}(D,Z\mid X,U)\mid X\big{]}=0,

    which then recovers condition (3).

Proof of Theorem 2.

By the result in Lemma 1, we have

(t,s)\displaystyle\mathcal{R}(t,s^{\star}) =𝔼[k=1Kηk(Xi)(𝕀{argmaxp[K]ηp(Xi)=k}𝕀{t(Xi)=k})]\displaystyle=\mathbb{E}\left[\sum_{k=1}^{K}\eta_{k}^{\star}(X_{i})\cdot\Big{(}\mathbb{I}\{\operatorname*{arg\,max}_{p\in[K]}\eta_{p}^{\star}(X_{i})=k\}-\mathbb{I}\{t(X_{i})=k\}\Big{)}\right]
=𝔼[k=1K[maxp[K]ηp(Xi)ηk(Xi)]𝕀{t(Xi)=k}]\displaystyle=\mathbb{E}\Big{[}\sum_{k=1}^{K}\big{[}\max_{p\in[K]}\eta_{p}^{\star}(X_{i})-\eta_{k}^{\star}(X_{i})\big{]}\cdot\mathbb{I}\{t(X_{i})=k\}\Big{]}
=𝔼[k=1Kwk(Xi)𝕀{t(Xi)=k}].\displaystyle=\mathbb{E}\left[\sum_{k=1}^{K}w_{k}(X_{i})\cdot\mathbb{I}\{t(X_{i})=k\}\right].

For each k[K]k\in[K], we define wk(X):=maxp[K]{ηp(X)ηk(X)}w_{k}(X):=\max_{p\in[K]}\{\eta_{p}^{\star}(X)-\eta_{k}^{\star}(X)\}. According to Theorem 1, when NUCEM conditions in Assumption 3 are satisfied for each k[K]k\in[K], the conditional probability ηk\eta_{k}^{\star} can be uniquely identified by Equation 4.2. Therefore, the weight function can be further identified as

wk:=maxp[K]{Cov(DYp,ZX)Cov(D,ZX)Cov(DYk,ZX)Cov(D,ZX)},k[K].w_{k}:=\max_{p\in[K]}\left\{\frac{\operatorname*{\mathrm{Cov}}(DY_{p},Z\mid X)}{\operatorname*{\mathrm{Cov}}(D,Z\mid X)}-\frac{\operatorname*{\mathrm{Cov}}(DY_{k},Z\mid X)}{\operatorname*{\mathrm{Cov}}(D,Z\mid X)}\right\},~{}~{}\forall~{}k\in[K].

Now, consider a binary classification with the true label Y{1,1}Y^{\star}\in\{-1,1\}. Let η(x):=(Y=1X=x)\eta^{\star}(x):=\mathbb{P}(Y^{\star}=1\mid X=x) denote the conditional probability of Y=1Y^{\star}=1 given X=xX=x. The Bayes optimal classifier is s(x)=sgn[η(x)1/2]s^{\star}(x)=\operatorname*{\mathrm{sgn}}[\eta^{\star}(x)-1/2]. As a consequence, the excess risk can be then written as

(t,s)\displaystyle\mathcal{R}(t,s^{\star}) =(Yt(X))(Ys(X))\displaystyle=\mathbb{P}\big{(}Y^{\star}\neq t(X)\big{)}-\mathbb{P}\big{(}Y^{\star}\neq s^{\star}(X)\big{)}
=𝔼[𝕀{Yt(X)}𝕀{Ys(X)}]\displaystyle=\mathbb{E}\Big{[}\mathbb{I}\big{\{}Y^{\star}\neq t(X)\big{\}}-\mathbb{I}\big{\{}Y^{\star}\neq s^{\star}(X)\big{\}}\Big{]}
=𝔼[η(X)(𝕀{1t(X)}𝕀{1s(X)})+(1η(X))(𝕀{1t(X)}𝕀{1s(X)})]\displaystyle=\mathbb{E}\Big{[}\eta^{\star}(X)\cdot\Big{(}\mathbb{I}\{1\neq t(X)\}-\mathbb{I}\{1\neq s^{\star}(X)\}\Big{)}+(1-\eta^{\star}(X))\cdot\Big{(}\mathbb{I}\{-1\neq t(X)\}-\mathbb{I}\{-1\neq s^{\star}(X)\}\Big{)}\Big{]}
=𝔼[(2η(X)1)(s(X)t(X)2)]\displaystyle=\mathbb{E}\Big{[}\big{(}2\eta^{\star}(X)-1\big{)}\cdot\Big{(}\frac{s^{\star}(X)-t(X)}{2}\Big{)}\Big{]}
=𝔼[|η(X)1/2||s(X)t(X)|]\displaystyle=\mathbb{E}\Big{[}\big{|}\eta^{\star}(X)-1/2\big{|}\cdot\big{|}s^{\star}(X)-t(X)\big{|}\Big{]}
=𝔼[|η(X)1/2|𝕀{t(X)sgn[η(X)1/2]}].\displaystyle=\mathbb{E}\Big{[}\big{|}\eta^{\star}(X)-1/2\big{|}\cdot\mathbb{I}\big{\{}t(X)\neq\operatorname*{\mathrm{sgn}}[\eta^{\star}(X)-1/2]\big{\}}\Big{]}.

By defining the weight function wexact(X):=η(X)1/2w^{\rm exact}(X):=\eta^{\star}(X)-1/2, we have

(t,s)=𝔼[|wexact(X)|𝕀{t(X)sgn[wexact(X)]}]:=(t,wexact).\mathcal{R}(t,s^{\star})=\mathbb{E}\Big{[}\big{|}w^{\rm exact}(X)\big{|}\cdot\mathbb{I}\big{\{}t(X)\neq\operatorname*{\mathrm{sgn}}[w^{\rm exact}(X)]\big{\}}\Big{]}:=\mathcal{R}(t,w^{\rm exact}).

Applying the identification result from Theorem 1, we obtain

wexact(X)=Cov(DY,ZX)Cov(D,ZX)1/2.w^{\rm exact}(X)=\frac{\operatorname*{\mathrm{Cov}}(DY,Z\mid X)}{\operatorname*{\mathrm{Cov}}(D,Z\mid X)}-1/2.

A.2 Sufficient Condition for Point Identification

Proposition 2 (A Sufficient Condition for Point Identification).

Suppose Assumptions 1 and 2 hold. For any z1,z2[m]z_{1},z_{2}\in[m], define Δjl(X,U):=𝔼[DX,U,Z=j]𝔼[DX,U,Z=l]\Delta_{jl}(X,U):=\mathbb{E}[D\mid X,U,Z=j]-\mathbb{E}[D\mid X,U,Z=l]. Assumption 3 holds if, for any j,l[J]j,l\in[J],

Cov(Δjl(X,U),(Y=kX,U)X)=0,a.s.\operatorname{Cov}(\Delta_{jl}(X,U),\mathbb{P}(Y^{\star}=k\mid X,U)\mid X)=0,~{}a.s. (A.5)

Proposition 2 outlines a sufficient condition for Assumption 3. This condition permits unobserved features UU to influence both the decision DD and the true outcome YY^{\star}, but constrains this influence to a specific form. Specifically, the condition is met when 𝔼[DU,X,Z=j]=gj(X)+q(U)\mathbb{E}[D\mid U,X,Z=j]=g_{j}(X)+q(U) almost surely, with functions gjg_{j} and qq reflecting the impact of UU is additive and consistent across all decision-makers.

Proof of Proposition 2.

Notice that the conditional covariance Cov(D,ZX,U)\operatorname*{\mathrm{Cov}}(D,Z\mid X,U) could be decomposed as below:

Cov(D,ZX,U)=𝔼[DZX,U]𝔼[DX,U]𝔼[ZX,U]\displaystyle\operatorname*{\mathrm{Cov}}(D,Z\mid X,U)=\mathbb{E}[DZ\mid X,U]-\mathbb{E}[D\mid X,U]\cdot\mathbb{E}[Z\mid X,U]
=\displaystyle= j=1m𝔼[DZX,U,Z=j](Z=jX,U)𝔼[DX,U]j=1mj(Z=jX,U)\displaystyle\sum_{j=1}^{m}\mathbb{E}[DZ\mid X,U,Z=j]\cdot\mathbb{P}(Z=j\mid X,U)-\mathbb{E}[D\mid X,U]\cdot\sum_{j=1}^{m}j\cdot\mathbb{P}(Z=j\mid X,U)
=\displaystyle= j=1mj𝔼[DX,U,Z=j](Z=jX,U)l=1m𝔼[DX,U,Z=l](Z=lX,U)j=1mj(Z=jX,U)\displaystyle\sum_{j=1}^{m}j\cdot\mathbb{E}[D\mid X,U,Z=j]\cdot\mathbb{P}(Z=j\mid X,U)-\sum_{l=1}^{m}\mathbb{E}[D\mid X,U,Z=l]\cdot\mathbb{P}(Z=l\mid X,U)\cdot\sum_{j=1}^{m}j\cdot\mathbb{P}(Z=j\mid X,U)
=\displaystyle= j=1ml=1mj(Z=lX,U)(Z=kX,U)[𝔼[DX,U,Z=j]𝔼[DX,U,Z=l]].\displaystyle\sum_{j=1}^{m}\sum_{l=1}^{m}j\cdot\mathbb{P}(Z=l\mid X,U)\cdot\mathbb{P}(Z=k\mid X,U)\cdot\big{[}\mathbb{E}[D\mid X,U,Z=j]-\mathbb{E}[D\mid X,U,Z=l]\big{]}.

Define Δj,l(X,U)=𝔼[DX,U,Z=j]𝔼[DX,U,Z=l]\Delta_{j,l}(X,U)=\mathbb{E}[D\mid X,U,Z=j]-\mathbb{E}[D\mid X,U,Z=l]. If for any j,l[m]j,l\in[m], we have

Cov(Δjl(X,U),(Y=kX,U)X)=0almost surely,\operatorname*{\mathrm{Cov}}(\Delta_{jl}(X,U),~{}\mathbb{P}(Y^{\star}=k\mid X,U)\mid X)=0~{}~{}\text{almost surely},

Assumption 3 is then guaranteed.

Appendix B Supplements for Partial Identification

B.1 Tightness of IV Partial Bounds in Theorem 3

Theorem 6 (Tightness of IV Partial Bound).

Assume Assumption 1 is valid, and the conditional joint distribution of observable variables (Yk,D,Z)X=x(Y_{k},D,Z)\mid X=x is fixed. Consider a new set of random variables (Y~k,D~,Z~,X~,U~)(\tilde{Y}_{k}^{\star},\tilde{D},\tilde{Z},\tilde{X},\tilde{U}) where Y~k\tilde{Y}_{k}^{\star} and D~\tilde{D} are binary, and Z~=Z\tilde{Z}=Z. Define the observed label Y~k\tilde{Y}_{k} as D~Y~k+(1D~)NaN\tilde{D}\cdot\tilde{Y}_{k}^{\star}+(1-\tilde{D})\cdot\text{NaN}. If the expectation 𝔼[Y~kX~=x]\mathbb{E}[\tilde{Y}_{k}^{\star}\mid\tilde{X}=x] falls within the interval [lk(x),uk(x)][l_{k}(x),u_{k}(x)] for all x𝒳x\in\mathcal{X}, then it is possible to establish a conditional joint distribution for (Y~k,D~,Z~,U~)(\tilde{Y}_{k}^{\star},\tilde{D},\tilde{Z},\tilde{U}) given X~=x\tilde{X}=x that satisfies:

  1. 1.

    The IV independence in Assumption 2 is met.

  2. 2.

    The conditional joint distribution of (Y~k,D~,Z~)X~=x(\tilde{Y}_{k},\tilde{D},\tilde{Z})\mid\tilde{X}=x matches the initial distribution (Yk,D,Z)X=x(Y_{k},D,Z)\mid X=x.

  3. 3.

    The natural bounds in Assumption 4 are maintained.

Theorem 6 along with Theorem 3 establish a two-way correspondence between the conditional mean outcome 𝔼[Y~kX=x]\mathbb{E}[\tilde{Y}_{k}^{\star}\mid X=x] and the IV partial bound [lk(x),uk(x)][l_{k}(x),u_{k}(x)]. On one hand, any value of 𝔼[Y~kX=x]\mathbb{E}[\tilde{Y}_{k}^{\star}\mid X=x] within this interval corresponds to a feasible conditional joint distribution of (Y~k,D~,Z~)X~=x(\tilde{Y}_{k}^{\star},\tilde{D},\tilde{Z})\mid\tilde{X}=x that is consistent with the observed data (Yk,D,Z)X=x(Y_{k},D,Z)\mid X=x. On the other hand, any conditional joint distribution of (Y~k,D~,Z~)X~=x(\tilde{Y}_{k}^{\star},\tilde{D},\tilde{Z})\mid\tilde{X}=x consistent with the observed data must yield a conditional mean outcome 𝔼[Y~kX~=x]\mathbb{E}[\tilde{Y}_{k}^{\star}\mid\tilde{X}=x] within the interval [lk(x),uk(x)][l_{k}(x),u_{k}(x)], as established in Theorem 3.

These results underscore the tightness of the IV partial bound in Theorem 3 and the necessity of the bounded label assumption in Assumption 4 for the partial identification of the conditional probability ηk\eta_{k}^{\star}.

Proof of Theorem 6.

The proof is adapted from Kédagni and Mourifie [2017]. Let (Y~k,D~,Z~,X~,U~)(\tilde{Y}_{k}^{\star},\tilde{D},\tilde{Z},\tilde{X},\tilde{U}) be a sequence of random variables such that Z~=Z\tilde{Z}=Z almost surely and lk(x)𝔼[Y~kX~=x]uk(x)l_{k}(x)\leq\mathbb{E}[\tilde{Y}_{k}^{\star}\mid\tilde{X}=x]\leq u_{k}(x) for every x𝒳x\in\mathcal{X}. Given that Assumption 1 held and fix X~=x\tilde{X}=x for some x𝒳x\in\mathcal{X}, we aim to prove that, for every possible value of 𝔼[Y~kX~=x][lk(x),uk(x)]\mathbb{E}[\tilde{Y}_{k}^{\star}\mid\tilde{X}=x]\in[l_{k}(x),u_{k}(x)], there exists a conditional joint distribution of (Y~k,D~,Z~,U~)(\tilde{Y}_{k}^{\star},\tilde{D},\tilde{Z},\tilde{U}) given X~=x\tilde{X}=x such that: (1) IV independence in Assumption 2 holds, that is, Z~Y~kX~=x\tilde{Z}\perp\tilde{Y}_{k}^{\star}\mid\tilde{X}=x for every x𝒳x\in\mathcal{X}; (2) the new observable variables (Y~k,D~,Z~)X~=x(\tilde{Y}_{k},\tilde{D},\tilde{Z})\mid\tilde{X}=x has the same conditional joint distribution as the fixed observable variables (Yk,D,Z)X=x(Y_{k},D,Z)\mid X=x; (3) Assumption 4 holds.

Without loss of generality, we first consider the case when 𝔼[Y~kX~=x]=lk(x)\mathbb{E}[\tilde{Y}_{k}^{\star}\mid\tilde{X}=x]=l_{k}(x). As a consequent, we let

(Y~k=1X~=x)\displaystyle\mathbb{P}(\tilde{Y}_{k}^{\star}=1\mid\tilde{X}=x) :=lk(x)=maxz[m]{𝔼[DYkX=x,Z=z]+a(x)(1𝔼[DX=x,Z=z])}\displaystyle:=l_{k}(x)=\max_{z\in[m]}\big{\{}\mathbb{E}[DY_{k}\mid X=x,Z=z]+a(x)\cdot\big{(}1-\mathbb{E}[D\mid X=x,Z=z]\big{)}\big{\}} (B.1)
(Y~k=0X=x)\displaystyle\mathbb{P}(\tilde{Y}_{k}^{\star}=0\mid X=x) :=1lk(x),\displaystyle:=1-l_{k}(x),

where the last expression follows by sum-to-one constraint of conditional probability. Note that we does not impose any observable restrictions on the conditional joint distribution of (Y~k,D~)(\tilde{Y}_{k},\tilde{D}) given X~=x\tilde{X}=x and Z~=z\tilde{Z}=z, we define the following: for all z𝒵z\in\mathcal{Z},

(Y~k=1,D~=1X~=x,Z~=z)\displaystyle\mathbb{P}(\tilde{Y}_{k}^{\star}=1,\tilde{D}=1\mid\tilde{X}=x,\tilde{Z}=z) :=𝔼[DYkX=x,Z=z]\displaystyle:=\mathbb{E}[DY_{k}\mid X=x,Z=z] (B.2)
(Y~k=1,D~=0X~=x,Z~=z)\displaystyle\mathbb{P}(\tilde{Y}_{k}^{\star}=1,\tilde{D}=0\mid\tilde{X}=x,\tilde{Z}=z) :=lk(x)𝔼[DYkX=x,Z=z]\displaystyle:=l_{k}(x)-\mathbb{E}[DY_{k}\mid X=x,Z=z]
(Y~k=0,D~=1X~=x,Z~=z)\displaystyle\mathbb{P}(\tilde{Y}_{k}^{\star}=0,\tilde{D}=1\mid\tilde{X}=x,\tilde{Z}=z) :=𝔼[DX=x,Z=z]𝔼[DYkX=x,Z=z]\displaystyle:=\mathbb{E}[D\mid X=x,Z=z]-\mathbb{E}[DY_{k}\mid X=x,Z=z]
(Y~k=0,D~=0X~=x,Z~=z)\displaystyle\mathbb{P}(\tilde{Y}_{k}^{\star}=0,\tilde{D}=0\mid\tilde{X}=x,\tilde{Z}=z) :=(1lk(x))(𝔼[DX=x,Z=z]𝔼[DYkX=x,Z=z]).\displaystyle:=\big{(}1-l_{k}(x)\big{)}-\big{(}\mathbb{E}[D\mid X=x,Z=z]-\mathbb{E}[DY_{k}\mid X=x,Z=z]\big{)}.

Notice that all of the conditional probabilities are well-established as they are all between 0 and 11, and the sum-to-one constraint is established as follow:

i{0,1}j{0,1}(Y~k=i,D~=jX,Z=z)=1,z[m].\sum_{i\in\{0,1\}}\sum_{j\in\{0,1\}}\mathbb{P}(\tilde{Y}_{k}^{\star}=i,\tilde{D}=j\mid X,Z=z)=1,~{}~{}\forall~{}z\in[m].

We now prove the three statements mentioned above.

  1. 1.

    IV Independence: To check whether we have Z~Y~kX~=x\tilde{Z}\perp\tilde{Y}_{k}^{\star}\mid\tilde{X}=x, notice that for any z𝒵z\in\mathcal{Z},

    (Y~k=1X~=x,Z~=z)\displaystyle\mathbb{P}(\tilde{Y}_{k}^{\star}=1\mid\tilde{X}=x,\tilde{Z}=z) =(Y~k=1,D~=1X~=,Z~=z)+(Y~k=1,D~=0X~=x,Z~=z)\displaystyle=\mathbb{P}(\tilde{Y}_{k}^{\star}=1,\tilde{D}=1\mid\tilde{X}=,\tilde{Z}=z)+\mathbb{P}(\tilde{Y}_{k}^{\star}=1,\tilde{D}=0\mid\tilde{X}=x,\tilde{Z}=z)
    =lk(x)\displaystyle=l_{k}(x)
    (Y~k=0X~=x,Z~=z)\displaystyle\mathbb{P}(\tilde{Y}_{k}^{\star}=0\mid\tilde{X}=x,\tilde{Z}=z) =(Y~k=0,D~=1X~=x,Z~=z)+(Y~k=0,D~=0X~=x,Z~=z)\displaystyle=\mathbb{P}(\tilde{Y}_{k}^{\star}=0,\tilde{D}=1\mid\tilde{X}=x,\tilde{Z}=z)+\mathbb{P}(\tilde{Y}_{k}^{\star}=0,\tilde{D}=0\mid\tilde{X}=x,\tilde{Z}=z)
    =1lk(x).\displaystyle=1-l_{k}(x).

    We can see that the conditional distribution of Y~\tilde{Y}^{\star} given X~=x\tilde{X}=x and Z~=z\tilde{Z}=z does not depends on zz for any z[m]z\in[m], and therefore we are able to claim that Z~\tilde{Z} is conditional independent with Y~k\tilde{Y}_{k}^{\star} given X~=x\tilde{X}=x.

  2. 2.

    (Y~k,D~,Z~)X~=x(\tilde{Y}_{k},\tilde{D},\tilde{Z})\mid\tilde{X}=x has same conditional joint distribution as (Yk,D,Z)X=x(Y_{k},D,Z)\mid X=x: Recall that the observed label is defined as Y~k=D~Y~k+(1D~)NaN\tilde{Y}_{k}=\tilde{D}\tilde{Y}_{k}^{\star}+(1-\tilde{D})\cdot\text{NaN}. By Equation B.2, we have for any z𝒵z\in\mathcal{Z},

    (Y~k=1,D~=1X~=x,Z~=z)\displaystyle\mathbb{P}(\tilde{Y}_{k}=1,\tilde{D}=1\mid\tilde{X}=x,\tilde{Z}=z) =(Y~k=1,D~=1X~=x,Z~=z)\displaystyle=\mathbb{P}(\tilde{Y}_{k}^{\star}=1,\tilde{D}=1\mid\tilde{X}=x,\tilde{Z}=z) (B.3)
    =𝔼[DYkX=x,Z=z]\displaystyle=\mathbb{E}[DY_{k}\mid X=x,Z=z]
    =(Yk=1,D=1X=x,Z=z)\displaystyle=\mathbb{P}(Y_{k}=1,D=1\mid X=x,Z=z)
    (Y~k=0,D~=1X~=x,Z~=z)\displaystyle\mathbb{P}(\tilde{Y}_{k}=0,\tilde{D}=1\mid\tilde{X}=x,\tilde{Z}=z) =(Y~k=0,D~=1X~=x,Z~=z)\displaystyle=\mathbb{P}(\tilde{Y}_{k}^{\star}=0,\tilde{D}=1\mid\tilde{X}=x,\tilde{Z}=z)
    =𝔼[DX=x,Z=z]𝔼[DYkX=x,Z=z]\displaystyle=\mathbb{E}[D\mid X=x,Z=z]-\mathbb{E}[DY_{k}\mid X=x,Z=z]
    =(Yk=0,D=1X=x,Z=z),\displaystyle=\mathbb{P}(Y_{k}=0,D=1\mid X=x,Z=z),

    and

    (Y~k=NaN,D~=0X~=x,Z~=z)\displaystyle\mathbb{P}(\tilde{Y}_{k}=\text{NaN},\tilde{D}=0\mid\tilde{X}=x,\tilde{Z}=z) =(D~=0X~=x,Z~=z)\displaystyle=\mathbb{P}(\tilde{D}=0\mid\tilde{X}=x,\tilde{Z}=z) (B.4)
    =k{0,1}(Y~k=k,D~=0X~=x,Z~=z)\displaystyle=\sum_{k\in\{0,1\}}\mathbb{P}(\tilde{Y}_{k}^{\star}=k,\tilde{D}=0\mid\tilde{X}=x,\tilde{Z}=z)
    =1𝔼[DX=x,Z=z]\displaystyle=1-\mathbb{E}[D\mid X=x,Z=z]
    =(D=0X=x,Z=z)\displaystyle=\mathbb{P}(D=0\mid X=x,Z=z)
    =(Yk=NA,D=0X=x,Z=z).\displaystyle=\mathbb{P}(Y_{k}=\text{NA},D=0\mid X=x,Z=z).

    As a result, we have shown that the observed variables (Y~k,D~,Z~)X~=x(\tilde{Y}_{k},\tilde{D},\tilde{Z})\mid\tilde{X}=x has exactly the same conditional joint distribution with (Yk,D,Z)X=x(Y_{k},D,Z)\mid X=x.

  3. 3.

    Finally, we show that when 𝔼[Y~kX~=x]=lk(x)\mathbb{E}[\tilde{Y}_{k}^{\star}\mid\tilde{X}=x]=l_{k}(x), there exists a joint distribution on (Y~k,X~,U~)(\tilde{Y}_{k}^{\star},\tilde{X},\tilde{U}) such that 𝔼[Y~kU~,X~=x]=a(x)\mathbb{E}[\tilde{Y}_{k}^{\star}\mid\tilde{U},\tilde{X}=x]=a(x). Without the loss of generality, we assume

    z0:=argmaxz[m]{𝔼[DYkX=x,Z=z]+ak(x)𝔼[(1D)X=x,Z=z]}z_{0}:=\operatorname*{arg\,max}_{z\in[m]}\big{\{}\mathbb{E}[DY_{k}\mid X=x,Z=z]+a_{k}(x)\cdot\mathbb{E}[(1-D)\mid X=x,Z=z]\big{\}}

    and then we have

    𝔼[Y~kX~=x]=lk(x)=𝔼[DYkX=x,Z=z0]+ak(x)𝔼[(1D)X=x,Z=z0].\mathbb{E}[\tilde{Y}_{k}^{\star}\mid\tilde{X}=x]=l_{k}(x)=\mathbb{E}[DY_{k}\mid X=x,Z=z_{0}]+a_{k}(x)\cdot\mathbb{E}[(1-D)\mid X=x,Z=z_{0}]. (B.5)

    Meanwhile, by the statement of IV independence that Z~Y~kX~=x\tilde{Z}\perp\tilde{Y}_{k}^{\star}\mid\tilde{X}=x, we have for Z=z0Z=z_{0},

    𝔼[Y~kX~=x]=𝔼[Y~kX~=x,Z~=z0]\displaystyle\mathbb{E}[\tilde{Y}_{k}^{\star}\mid\tilde{X}=x]=\mathbb{E}[\tilde{Y}_{k}^{\star}\mid\tilde{X}=x,\tilde{Z}=z_{0}]
    =𝔼[D~Y~kX~=x,Z~=z0]+𝔼[(1D~)Y~kX~=x,Z~=z0]\displaystyle=\mathbb{E}[\tilde{D}\tilde{Y}_{k}^{\star}\mid\tilde{X}=x,\tilde{Z}=z_{0}]+\mathbb{E}[(1-\tilde{D})\tilde{Y}_{k}^{\star}\mid\tilde{X}=x,\tilde{Z}=z_{0}]
    =𝔼[D~Y~kX~=x,Z~=z0]+𝔼[𝔼[(1D~)Y~kU~,X~=x,Z~=z0]X~=x,Z~=z0]\displaystyle=\mathbb{E}[\tilde{D}\tilde{Y}_{k}\mid\tilde{X}=x,\tilde{Z}=z_{0}]+\mathbb{E}[\mathbb{E}[(1-\tilde{D})\tilde{Y}_{k}^{\star}\mid\tilde{U},\tilde{X}=x,\tilde{Z}=z_{0}]\mid\tilde{X}=x,\tilde{Z}=z_{0}]
    =𝔼[D~Y~kX~=x,Z~=z0]+𝔼[𝔼[Y~kU~,X~=x]𝔼[(1D~)U~,X~=x,Z~=z0]X~=x,Z~=z0].\displaystyle=\mathbb{E}[\tilde{D}\tilde{Y}_{k}\mid\tilde{X}=x,\tilde{Z}=z_{0}]+\mathbb{E}[\mathbb{E}[\tilde{Y}_{k}^{\star}\mid\tilde{U},\tilde{X}=x]\cdot\mathbb{E}[(1-\tilde{D})\mid\tilde{U},\tilde{X}=x,\tilde{Z}=z_{0}]\mid\tilde{X}=x,\tilde{Z}=z_{0}].

    The first line and second line come from the IV independence and linearity of conditional expectation. The third line follows by the consistency of observed outcome as well as the iterated law of expectation. The last line follows by Assumption 1 (D~Y~kX~,U~\tilde{D}\perp\tilde{Y}_{k}^{\star}\mid\tilde{X},\tilde{U}) and the IV independence.

    From Equations B.3 and B.4, we have 𝔼[D~Y~kX~=x,Z~=z0]=𝔼[DYkX=x,Z=z0)\mathbb{E}[\tilde{D}\tilde{Y}_{k}\mid\tilde{X}=x,\tilde{Z}=z_{0}]=\mathbb{E}[DY_{k}\mid X=x,Z=z_{0}), and therefore

    𝔼[Y~kX~=x]\displaystyle\mathbb{E}[\tilde{Y}_{k}^{\star}\mid\tilde{X}=x] =𝔼[DYkX=x,Z=z0]\displaystyle=\mathbb{E}[DY_{k}\mid X=x,Z=z_{0}] (B.6)
    +𝔼[𝔼[Y~kU~,X~=x]𝔼[(1D~)U~,X~=x,Z~=z0]X~=x,Z~=z0].\displaystyle~{}+\mathbb{E}[\mathbb{E}[\tilde{Y}_{k}^{\star}\mid\tilde{U},\tilde{X}=x]\cdot\mathbb{E}[(1-\tilde{D})\mid\tilde{U},\tilde{X}=x,\tilde{Z}=z_{0}]\mid\tilde{X}=x,\tilde{Z}=z_{0}].

    Now, suppose that there is not any conditional joint distribution of (Y~k,D~,U~)(\tilde{Y}_{k}^{\star},\tilde{D},\tilde{U}) given X~=x\tilde{X}=x such that 𝔼[Y~kU~,X~=x]=ak(x)\mathbb{E}[\tilde{Y}_{k}^{\star}\mid\tilde{U},\tilde{X}=x]=a_{k}(x) almost surely. Alternatively, we assume there is a conditional joint distribution for (Y~k,D~,U~)X~=x(\tilde{Y}_{k}^{\star},\tilde{D},\tilde{U})\mid\tilde{X}=x such that 𝔼[Y~kU~,X~=x]=ak(x)+ε(x,U~)\mathbb{E}[\tilde{Y}_{k}^{\star}\mid\tilde{U},\tilde{X}=x]=a_{k}(x)+\varepsilon(x,\tilde{U}) almost surely. Substitute this condition into Equation B.6, we have

    𝔼[Y~kX~=x]\displaystyle\mathbb{E}[\tilde{Y}_{k}^{\star}\mid\tilde{X}=x] =𝔼[DYkX=x,Z=z0]+ak(x)𝔼[(1D~)X~=x,Z~=z0]\displaystyle=\mathbb{E}[DY_{k}\mid X=x,Z=z_{0}]+a_{k}(x)\cdot\mathbb{E}\big{[}(1-\tilde{D})\mid\tilde{X}=x,\tilde{Z}=z_{0}\big{]} (B.7)
    +𝔼[ε(X~,U~)𝔼[(1D~)U~,X~=x,Z~=z0]X~=x,Z~=z0]\displaystyle\quad+\mathbb{E}\Big{[}\varepsilon(\tilde{X},\tilde{U})\cdot\mathbb{E}[(1-\tilde{D})\mid\tilde{U},\tilde{X}=x,\tilde{Z}=z_{0}]\mid\tilde{X}=x,\tilde{Z}=z_{0}\Big{]}
    =lk(x)+𝔼[ε(X~,U~)𝔼[(1D~)U~,X~=x,Z~=z0]X~=x,Z~=z0].\displaystyle=l_{k}(x)+\mathbb{E}\Big{[}\varepsilon(\tilde{X},\tilde{U})\cdot\mathbb{E}[(1-\tilde{D})\mid\tilde{U},\tilde{X}=x,\tilde{Z}=z_{0}]\mid\tilde{X}=x,\tilde{Z}=z_{0}\Big{]}.

    In the first equality, we use the iterated law of expectation and fact that ak(x)a_{k}(x) is independent with unobserved variable U~\tilde{U}. The second equality follows from Equations B.3 and B.4 that (D~=0X~=x,Z~=z0)=(D=0X=x,Z=z)\mathbb{P}(\tilde{D}=0\mid\tilde{X}=x,\tilde{Z}=z_{0})=\mathbb{P}(D=0\mid X=x,Z=z) for any fixed x𝒳x\in\mathcal{X} and z[m]z\in[m] along with the definition of lk(x)l_{k}(x).

    As a consequence, if

    𝔼[ε(X~,U~)𝔼[(1D~)U~,X~=x,Z~=z0]X~=x,Z~=z0]0,\mathbb{E}\big{[}\varepsilon(\tilde{X},\tilde{U})\cdot\mathbb{E}[(1-\tilde{D})\mid\tilde{U},\tilde{X}=x,\tilde{Z}=z_{0}]\mid\tilde{X}=x,\tilde{Z}=z_{0}\big{]}\neq 0,

    then Equation B.7 contradicts with Equation B.5, implying that there is indeed a conditional joint distribution for (Y~,D~,U~)X~=x(\tilde{Y}^{\star},\tilde{D},\tilde{U})\mid\tilde{X}=x such that 𝔼[Y~U~,X~=x]=ak(x)\mathbb{E}[\tilde{Y}^{\star}\mid\tilde{U},\tilde{X}=x]=a_{k}(x). On the other hand, if

    𝔼[ε(X~,U~)𝔼[(1D~)U~,X~=x,Z~=z0]X~=x,Z~=z0]=0,\mathbb{E}\big{[}\varepsilon(\tilde{X},\tilde{U})\cdot\mathbb{E}[(1-\tilde{D})\mid\tilde{U},\tilde{X}=x,\tilde{Z}=z_{0}]\mid\tilde{X}=x,\tilde{Z}=z_{0}\big{]}=0, (B.8)

    then the conditional joint distribution for (Y~k,D~,U~)X~=x(\tilde{Y}_{k},\tilde{D},\tilde{U})\mid\tilde{X}=x induced by the Equation B.8 is exactly the “right” distribution for the establishment of 𝔼[Y~kU~,X~=x]=ak(x)\mathbb{E}[\tilde{Y}_{k}^{\star}\mid\tilde{U},\tilde{X}=x]=a_{k}(x).

    Therefore, we have shown that when 𝔼[Y~kX~=x]=lk(x)\mathbb{E}[\tilde{Y}_{k}^{\star}\mid\tilde{X}=x]=l_{k}(x), there is always a conditional joint distribution for (Y~k,D~,U~)X~=x(\tilde{Y}_{k}^{\star},\tilde{D},\tilde{U})\mid\tilde{X}=x such that 𝔼[Y~kU~,X~=x]=ak(x)\mathbb{E}[\tilde{Y}_{k}^{\star}\mid\tilde{U},\tilde{X}=x]=a_{k}(x) almost surely.

To wrap up, we can see that the establishment of statements 1 and 2 do not rely on the condition (Y~=1X~=x)=lk(x)\mathbb{P}(\tilde{Y}^{\star}=1\mid\tilde{X}=x)=l_{k}(x) in Equation B.1. In fact, whenever we choose gk(x)[lk(x),uk(x)]g_{k}(x)\in[l_{k}(x),u_{k}(x)] for a fixed x𝒳x\in\mathcal{X}, we can simply choose (Y~k=1X~=x)=gk(x)\mathbb{P}(\tilde{Y}_{k}^{\star}=1\mid\tilde{X}=x)=g_{k}(x). As a result, the IV independence Z~Y~kX~=x\tilde{Z}\perp\tilde{Y}_{k}^{\star}\mid\tilde{X}=x always holds, and the new observable variables (Y~k,D~,Z~)X~=x(\tilde{Y}_{k},\tilde{D},\tilde{Z})\mid\tilde{X}=x still has the same conditional distribution as (Yk,D,Z)X=x(Y_{k},D,Z)\mid X=x, which is fixed as a prior.

However, the value of (Y~k=1X~=x)\mathbb{P}(\tilde{Y}_{k}^{\star}=1\mid\tilde{X}=x) condition does affect the value of 𝔼[Y~kU~,X~=x]\mathbb{E}[\tilde{Y}_{k}^{\star}\mid\tilde{U},\tilde{X}=x]. If we let 𝔼[Y~kX~=x]=uk(x)\mathbb{E}[\tilde{Y}_{k}^{\star}\mid\tilde{X}=x]=u_{k}(x), we can show in a similar way that 𝔼[Y~kU~,X~=x]=bk(x)\mathbb{E}[\tilde{Y}_{k}^{\star}\mid\tilde{U},\tilde{X}=x]=b_{k}(x) almost surely. In fact, following the proof of statement 3, one can prove that, for any gk(x)[lk(x),uk(x)]g_{k}(x)\in[l_{k}(x),u_{k}(x)] such that 𝔼[Y~kX~=x]=gk(x)\mathbb{E}[\tilde{Y}_{k}^{\star}\mid\tilde{X}=x]=g_{k}(x), there is always a conditional joint distribution for (Y~k,D~,U~)X~=x(\tilde{Y}_{k},\tilde{D},\tilde{U})\mid\tilde{X}=x such that 𝔼[Y~kU~,X~=x]=ck(x)\mathbb{E}[\tilde{Y}_{k}^{\star}\mid\tilde{U},\tilde{X}=x]=c_{k}(x) almost surely and ck(x)[ak(x),bk(x)]c_{k}(x)\in[a_{k}(x),b_{k}(x)].

Therefore, we have shown that under Assumption 1, whenever we have 𝔼[Y~kX~=x][lk(x),uk(x)]\mathbb{E}[\tilde{Y}_{k}^{\star}\mid\tilde{X}=x]\in[l_{k}(x),u_{k}(x)], there is always a conditional joint distribution for (Y~k,D~,Z~,U~)X~=x(\tilde{Y}_{k}^{\star},\tilde{D},\tilde{Z},\tilde{U})\mid\tilde{X}=x such that the three statements in Theorem 6 hold. To conclude, we claim that the IV partial bound [lk(x),uk(x)][l_{k}(x),u_{k}(x)] introduced in Theorem 3 is sharp for 𝔼[YkX=x]\mathbb{E}[Y_{k}^{\star}\mid X=x] given the establishment of Assumption 1, the IV independence in Assumption 2 and Assumption 4.

B.2 Balke and Pearl’s Bound

Balke and Pearl [1994] provides partial identification bounds for the average treatment effect of a binary treatment with a binary instrumental variable. In this section, we adapt their bound to our setting with partially observed labels and a binary IV (i.e., the assignment to one of two decision-makers). Under Assumption 2, we have following decomposition of joint probability distribution of (Y,D,Z,U)(Y,D,Z,U)

(Y,D,Z,U)=(YD,U)(DZ,U)(Z)(U).\mathbb{P}(Y,D,Z,U)=\mathbb{P}(Y\mid D,U)\cdot\mathbb{P}(D\mid Z,U)\cdot\mathbb{P}(Z)\cdot\mathbb{P}(U). (B.9)

Here we omit the observed covariates XX for simplicity, or alternatively, all distributions can be considered as implicitly conditioning on XX. Now we define three response functions which characterize the values of ZZ, D(0),D(1)D(0),D(1), and YY^{\star}:

rZ={0ifZ=01ifZ=1,rD={0ifD(0)=0andD(1)=01ifD(0)=0andD(1)=12ifD(0)=1andD(1)=03ifD(0)=1andD(1)=1,rY={0ifY=01ifY=1.r_{Z}=\left\{\begin{aligned} &0\ &\text{if}\ Z=0\\ &1\ &\text{if}\ Z=1\\ \end{aligned}\right.,\quad r_{D}=\left\{\begin{aligned} &0\quad&\text{if}\quad D(0)=0\ \text{and}\ D(1)=0\\ &1\quad&\text{if}\quad D(0)=0\ \text{and}\ D(1)=1\\ &2\quad&\text{if}\quad D(0)=1\ \text{and}\ D(1)=0\\ &3\quad&\text{if}\quad D(0)=1\ \text{and}\ D(1)=1\\ \end{aligned}\right.,\quad r_{Y}=\left\{\begin{aligned} &0\ &\text{if}\quad Y^{\star}=0\\ &1\ &\text{if}\quad Y^{\star}=1\\ \end{aligned}\right..

Next, we specify the joint distribution of unobservable variables rDr_{D} and rYr_{Y} as follows:

qkj=(rD=k,rY=j)k{0,1,2,3},j{0,1},q_{kj}=\mathbb{P}(r_{D}=k,r_{Y}=j)\quad\forall\ k\in\{0,1,2,3\},\ j\in\{0,1\},

which satisfies the constraint k=03(qk0+qk1)=1\sum_{k=0}^{3}(q_{k0}+q_{k1})=1. Then the target mean parameter of the true outcome can be written as a linear combinations of the qq’s. Moreover, we note that the observable distribution (Y,DZ)\mathbb{P}(Y,D\mid Z) is fully specified by the following six variables

pna,0=(D=0Z=0)\displaystyle p_{na,0}=\mathbb{P}(D=0\mid Z=0)\qquad pna,1=(D=0Z=1)\displaystyle p_{na,1}=\mathbb{P}(D=0\mid Z=1)
p01,0=(Y=0,D=1Z=0)\displaystyle p_{01,0}=\mathbb{P}(Y=0,D=1\mid Z=0)\qquad p01,1=(Y=0,D=1Z=1)\displaystyle p_{01,1}=\mathbb{P}(Y=0,D=1\mid Z=1)
p11,0=(Y=1,D=1Z=0)\displaystyle p_{11,0}=\mathbb{P}(Y=1,D=1\mid Z=0)\qquad p11,1=(Y=1,D=1Z=1),\displaystyle p_{11,1}=\mathbb{P}(Y=1,D=1\mid Z=1),

with constraints p11,0+p01,0+pna,0=1p_{11,0}+p_{01,0}+p_{na,0}=1 and p11,1+p01,1+pna,1=1p_{11,1}+p_{01,1}+p_{na,1}=1. We also have the following relation between pp’s and qq’s:

pna,0=q00+q01+q10+q11p01,0=q20+q30p11,0=q21+q31pna,1=q00+q01+q20+q21p01,1=q10+q30p11,1=q11+q31.\begin{aligned} p_{na,0}&=q_{00}+q_{01}+q_{10}+q_{11}\\ p_{01,0}&=q_{20}+q_{30}\\ p_{11,0}&=q_{21}+q_{31}\\ \end{aligned}\qquad\begin{aligned} p_{na,1}&=q_{00}+q_{01}+q_{20}+q_{21}\\ p_{01,1}&=q_{10}+q_{30}\\ p_{11,1}&=q_{11}+q_{31}.\\ \end{aligned}

Therefore, we have p=Pqp=Pq where p=(pna,0,,p11,1)p=(p_{na,0},\dots,p_{11,1}), q=(q00,,q31)q=(q_{00},\dots,q_{31}), and

P=[110011000011000000000011101010100101000000000101].P=\begin{bmatrix}1&1&0&0&1&1&0&0\\ 0&0&1&1&0&0&0&0\\ 0&0&0&0&0&0&1&1\\ 1&0&1&0&1&0&1&0\\ 0&1&0&1&0&0&0&0\\ 0&0&0&0&0&1&0&1\\ \end{bmatrix}.

Then the lower bound on (Y=1)\mathbb{P}(Y^{\star}=1) can be written as the optimal value of the following linear programming problem

minq01+q11+q21+q31subject tok=03j=01qkj=1Pq=pqkj0k{0,1,2,3},j{0,1}..\begin{aligned} \min\quad&q_{01}+q_{11}+q_{21}+q_{31}\\ \text{subject to}\quad&\sum_{k=0}^{3}\sum_{j=0}^{1}q_{kj}=1\\ &Pq=p\\ &q_{kj}\geq 0\quad k\in\{0,1,2,3\},\ j\in\{0,1\}.\\ \end{aligned}. (B.10)

Similarly, the upper bound on (Y=1)\mathbb{P}(Y^{\star}=1) can be written as the optimal value of the following optimization problem:

max\displaystyle\max q01+q11+q21+q31\displaystyle q_{01}+q_{11}+q_{21}+q_{31} (B.11)
subject to k=03j=01qkj=1\displaystyle\sum_{k=0}^{3}\sum_{j=0}^{1}q_{kj}=1
Pq=p\displaystyle Pq=p
qkj0k{0,1,2,3},j{0,1}.\displaystyle q_{kj}\geq 0\quad k\in\{0,1,2,3\},\ j\in\{0,1\}.

In fact, by simply comparing the variables in the objective function (Y=1)=q01+q11+q21+q31\mathbb{P}(Y^{\star}=1)=q_{01}+q_{11}+q_{21}+q_{31} and those in constraints, one could find that

p11,0=q21+q31(Y=1)\displaystyle p_{11,0}=q_{21}+q_{31}\leq\mathbb{P}(Y^{\star}=1)
p11,1=q11+q31(Y=1)\displaystyle p_{11,1}=q_{11}+q_{31}\leq\mathbb{P}(Y^{\star}=1)
p11,0+pna,0=q00+q10+q01+q11+q21+q31(Y=1)\displaystyle p_{11,0}+p_{na,0}=q_{00}+q_{10}+q_{01}+q_{11}+q_{21}+q_{31}\geq\mathbb{P}(Y^{\star}=1)
p11,1+pna,1=q00+q20+q01+q11+q21+q31(Y=1).\displaystyle p_{11,1}+p_{na,1}=q_{00}+q_{20}+q_{01}+q_{11}+q_{21}+q_{31}\geq\mathbb{P}(Y^{\star}=1).

If we let

L\displaystyle L =max{p11,0,p11,1}=maxz{0,1}{(Y=1,D=1Z=z)}\displaystyle=\max\{p_{11,0},p_{11,1}\}=\max_{z\in\{0,1\}}\ \{\mathbb{P}(Y=1,D=1\mid Z=z)\} (B.12)
U\displaystyle U =min{p11,0+pna,0,p11,1+pna,1}=minz{0,1}{(Y=1,D=1Z=z)+(D=0Z=z)}.\displaystyle=\min\{p_{11,0}+p_{na,0},p_{11,1}+p_{na,1}\}=\min_{z\in\{0,1\}}\ \{\mathbb{P}(Y=1,D=1\mid Z=z)+\mathbb{P}(D=0\mid Z=z)\}.

We then have the following partial bounds of (Y=1)\mathbb{P}(Y^{\star}=1).

L(Y=1)U.L\ \leq\ \mathbb{P}(Y^{\star}=1)\ \leq\ U.

According to Balke and Pearl [1994], the bounds above are tight for (Y=1)\mathbb{P}(Y^{\star}=1). We note that if we condition on XX in these bounds, then the corresponding bound on (Y=1X)\mathbb{P}(Y^{\star}=1\mid X) coincide with the bounds in Theorem 3 specialized to a binary instrument.

B.3 Proofs in Section 5

Proof of Theorem 3.

We first prove the construction of IV partial bound, and then show that the IV partial bound [lk(x),uk(x)][l_{k}(x),u_{k}(x)] is tighter than the natural bound [ak(x),bk(x)][a_{k}(x),b_{k}(x)] for all k[K]k\in[K] when xx is fixed.

  1. 1.

    Under Assumptions 1 and 2, Equation A.1 gives that

    (Y=kX=x)=𝔼[YkX=x]\displaystyle\mathbb{P}(Y^{\star}=k\mid X=x)=\mathbb{E}[Y_{k}^{\star}\mid X=x]
    =𝔼[DYkX=x,Z=z]+𝔼[𝔼[YkU,X=x]𝔼[(1D)U,X=x,Z=z]X=x,Z=z].\displaystyle=\mathbb{E}[DY_{k}\mid X=x,Z=z]+\mathbb{E}\big{[}\mathbb{E}[Y_{k}^{\star}\mid U,X=x]\cdot\mathbb{E}[(1-D)\mid U,X=x,Z=z]\mid X=x,Z=z\big{]}.

    As we have assumed that 𝔼[YkU,X=x][ak(x),bk(x)]\mathbb{E}[Y_{k}^{\star}\mid U,X=x]\in[a_{k}(x),b_{k}(x)] almost surely in Assumption 4, we then have

    𝔼[YkX=x]\displaystyle\mathbb{E}[Y_{k}^{\star}\mid X=x] 𝔼[DYkX=x,Z=z]+ak(x)𝔼[𝔼[(1D)U,X=x,Z=z]X=x,Z=z]\displaystyle\geq\mathbb{E}[DY_{k}\mid X=x,Z=z]+a_{k}(x)\cdot\mathbb{E}\big{[}\mathbb{E}[(1-D)\mid U,X=x,Z=z]\mid X=x,Z=z\big{]}
    =𝔼[DYkX=x,Z=z]+ak(x)𝔼[(1D)X=x,Z=z]\displaystyle=\mathbb{E}[DY_{k}\mid X=x,Z=z]+a_{k}(x)\cdot\mathbb{E}[(1-D)\mid X=x,Z=z]

    and

    𝔼[YkX=x]\displaystyle\mathbb{E}[Y_{k}^{\star}\mid X=x] 𝔼[DYkX=x,Z=z]+bk(x)𝔼[𝔼[(1D)U,X=x,Z=z]X=x,Z=z]\displaystyle\leq\mathbb{E}[DY_{k}\mid X=x,Z=z]+b_{k}(x)\cdot\mathbb{E}\big{[}\mathbb{E}[(1-D)\mid U,X=x,Z=z]\mid X=x,Z=z\big{]}
    =𝔼[DYkX=x,Z=z]+bk(x)𝔼[(1D)X=x,Z=z]\displaystyle=\mathbb{E}[DY_{k}\mid X=x,Z=z]+b_{k}(x)\cdot\mathbb{E}[(1-D)\mid X=x,Z=z]

    for every z[m]z\in[m]. Optimizing the lower and upper bounds of 𝔼[YkX=x]\mathbb{E}[Y_{k}^{\star}\mid X=x] over z𝒵z\in\mathcal{Z} yields the desired results:

    maxz𝒵{𝔼[DYkX=x,Z=z]+ak(x)(1𝔼[DX=x,Z=z])}\displaystyle\max_{z\in\mathcal{Z}}\ \Big{\{}\mathbb{E}[DY_{k}\mid X=x,Z=z]+a_{k}(x)\cdot\big{(}1-\mathbb{E}[D\mid X=x,Z=z]\big{)}\Big{\}}
    𝔼[YkX=x]\displaystyle\qquad\qquad\qquad\qquad\leq\mathbb{E}[Y_{k}^{\star}\mid X=x]\leq
    minz𝒵{𝔼[DYkX=x,Z=z]+bk(x)(1𝔼[DX=x,Z=z])}.\displaystyle\min_{z\in\mathcal{Z}}\ \Big{\{}\mathbb{E}[DY_{k}\mid X=x,Z=z]+b_{k}(x)\cdot\big{(}1-\mathbb{E}[D\mid X=x,Z=z]\big{)}\Big{\}}.
  2. 2.

    By taking the conditional expectation on both sides of the inequality, a direct consequence of Assumption 4 induces a natural bound for 𝔼[YkX=x]\mathbb{E}[Y_{k}^{\star}\mid X=x], that is,

    ak(x)𝔼[YkX=x]bk(x).a_{k}(x)\leq\mathbb{E}[Y_{k}^{\star}\mid X=x]\leq b_{k}(x).

    We now show that the IV partial bound [lk(x),uk(x)][l_{k}(x),u_{k}(x)] is tighter than the natural bound [ak(x),bk(x)][a_{k}(x),b_{k}(x)] for any x𝒳x\in\mathcal{X}. Without loss of generality, let

    z0:=argmaxz[m]{𝔼[DYkX=x,Z=z]+ak(x)𝔼[(1D)X=x,Z=z]}z_{0}:=\operatorname*{arg\,max}_{z\in[m]}\big{\{}\mathbb{E}[DY_{k}\mid X=x,Z=z]+a_{k}(x)\cdot\mathbb{E}[(1-D)\mid X=x,Z=z]\big{\}}

    and then we have

    lk(x)\displaystyle l_{k}(x) =𝔼[DYkX=x,Z=z0]+ak(x)𝔼[(1D)X=x,Z=z0]\displaystyle=\mathbb{E}[DY_{k}\mid X=x,Z=z_{0}]+a_{k}(x)\cdot\mathbb{E}[(1-D)\mid X=x,Z=z_{0}] (B.13)
    =ak(x)ak(x)(D=1X=x,Z=z0)()+𝔼[DYkX=x,Z=z0]\displaystyle=\underbrace{a_{k}(x)-a_{k}(x)\cdot\mathbb{P}(D=1\mid X=x,Z=z_{0})}_{(\cdots)}+\mathbb{E}[DY_{k}^{\star}\mid X=x,Z=z_{0}]
    =()+𝔼[𝔼[DYkU,X=x,Z=z0]X=x,Z=z0]\displaystyle=(\cdots)+\mathbb{E}\Big{[}\mathbb{E}[DY_{k}^{\star}\mid U,X=x,Z=z_{0}]\mid X=x,Z=z_{0}\Big{]}
    =()+𝔼[𝔼[DU,X=x,Z=z0]𝔼[YkU,X=x,Z=z0]X=x,Z=z0]\displaystyle=(\cdots)+\mathbb{E}\Big{[}\mathbb{E}[D\mid U,X=x,Z=z_{0}]\cdot\mathbb{E}[Y_{k}^{\star}\mid U,X=x,Z=z_{0}]\mid X=x,Z=z_{0}\Big{]}
    =()+𝔼[(D=1U,X=x,Z=z0)𝔼[YkU,X=x]X=x,Z=z0]\displaystyle=(\cdots)+\mathbb{E}\Big{[}\mathbb{P}(D=1\mid U,X=x,Z=z_{0})\cdot\mathbb{E}[Y_{k}^{\star}\mid U,X=x]\mid X=x,Z=z_{0}\Big{]}
    =a(x)+𝔼[(D=1U,X=x,Z=z0)(𝔼[YkU,X=x]ak(x))X=x,Z=z0].\displaystyle=a(x)+\mathbb{E}\Big{[}\mathbb{P}(D=1\mid U,X=x,Z=z_{0})\cdot\big{(}\mathbb{E}[Y_{k}^{\star}\mid U,X=x]-a_{k}(x)\big{)}\mid X=x,Z=z_{0}\Big{]}.

    The first line comes from the definition, while the second line holds by the consistency between the observed label and the true label (DYk=DYkDY_{k}=DY_{k}^{\star}) and the linearity of conditional expectation. The third line holds by iterated law of expectation, and the fourth line holds by Assumption 1 (DYkX,UD\perp Y_{k}^{\star}\mid X,U). The fifth line follows by Assumption 2 (ZYkXZ\perp Y_{k}^{\star}\mid X) and the fact that 𝔼[DU,X=x,Z=z]=(D=1U,X=x,Z=z)\mathbb{E}[D\mid U,X=x,Z=z]=\mathbb{P}(D=1\mid U,X=x,Z=z). The last line again follows by the linearity and the iterated law of expectation.

    Recall that Assumption 4 states that E[YkU,X=x]ak(x)E[Y_{k}^{\star}\mid U,X=x]\geq a_{k}(x) almost surely, along with Equation B.13 and the fact that (D=1U,X=x,Z=z0)0\mathbb{P}(D=1\mid U,X=x,Z=z_{0})\geq 0, we have

    lk(x)ak(x)x𝒳.l_{k}(x)\geq a_{k}(x)~{}~{}\forall\ x\in\mathcal{X}.

    Similarly, we can show that uk(x)bk(x)u_{k}(x)\leq b_{k}(x) for every fixed x𝒳x\in\mathcal{X}. We therefore complete the proof.

Proof of Theorem 4.

To simplify the notation, for any classifier t:𝒳[K]t:\mathcal{X}\mapsto[K] and vector function η:𝒳[0,1]K\eta:\mathcal{X}\mapsto[0,1]^{K}, define the function

A(t,η;x)=k=1Kηk(x)(𝕀{argmaxk[K]ηk(x)=k}𝕀{t(x)=k})0.A(t,\eta;x)=\sum_{k=1}^{K}\eta_{k}(x)\cdot\Big{(}\mathbb{I}\{\operatorname*{arg\,max}_{k\in[K]}\eta_{k}(x)=k\}-\mathbb{I}\{t(x)=k\}\Big{)}\geq 0.

The worst-case risk function can then be written as

¯(t)=𝔼[maxηSA(t,η;X)]\overline{\mathcal{R}}(t)=\mathbb{E}\Big{[}\max_{\eta\in S}A(t,\eta;X)\Big{]}

with S={η[0,1]K:η1=1,ηk[lk,uk],k=1,,K}S=\{\eta\in[0,1]^{K}:\|\eta\|_{1}=1,\eta_{k}\in[l_{k},u_{k}],k=1,\dots,K\}. The following proof will rely on computing the maximization of A(t,η;x)A(t,\eta;x) over all possible values of t(x)[K]t(x)\in[K] for any fixed x𝒳x\in\mathcal{X}.

  • Step 1. We start with the case when t(x)=k0t(x)=k_{0} for some k0[K]k_{0}\in[K], we have

    maxηS{A(t,η;x)}𝕀{t(x)=k0}=maxηS{A(t,η;x)𝕀{t(x)=k0}}\displaystyle\max_{\eta\in S}\Big{\{}A(t,\eta;x)\Big{\}}\cdot\mathbb{I}\{t(x)=k_{0}\}=\max_{\eta\in S}\Big{\{}A(t,\eta;x)\cdot\mathbb{I}\{t(x)=k_{0}\}\Big{\}}
    =maxηS{(k[K]ηk(x)𝕀{argmaxp[K]{ηp(x)}=k}k[K]ηk(x)𝕀{t(x)=k})𝕀{t(x)=k0}}\displaystyle=\max_{\eta\in S}\ \Big{\{}\Big{(}\sum_{k\in[K]}\eta_{k}(x)\cdot\mathbb{I}\{\operatorname*{arg\,max}_{p\in[K]}\{\eta_{p}(x)\}=k\}-\sum_{k\in[K]}\eta_{k}(x)\cdot\mathbb{I}\{t(x)=k\}\Big{)}\cdot\mathbb{I}\{t(x)=k_{0}\}\Big{\}}
    =maxηS{(k[K]ηk(x)𝕀{argmaxp[K]{ηp(x)}=k}ηk0(X))𝕀{t(x)=k0}}\displaystyle=\max_{\eta\in S}\ \Big{\{}\Big{(}\sum_{k\in[K]}\eta_{k}(x)\cdot\mathbb{I}\{\operatorname*{arg\,max}_{p\in[K]}\{\eta_{p}(x)\}=k\}-\eta_{k_{0}}(X)\Big{)}\cdot\mathbb{I}\{t(x)=k_{0}\}\Big{\}}

    Given a fixed vector η(x)S\eta(x)\in S, we denote p0:=argmaxp[K]ηp(x)p_{0}:=\operatorname*{arg\,max}_{p\in[K]}\eta_{p}(x) as the label with maximal conditional probability. We have

    maxηS{A(t,η;x)}𝕀{t(x)=k0}\displaystyle\max_{\eta\in S}\Big{\{}A(t,\eta;x)\Big{\}}\cdot\mathbb{I}\{t(x)=k_{0}\} =maxηS{(ηp0(x)ηk0(x))}𝕀{t(x)=k0}\displaystyle=\max_{\eta\in S}\Big{\{}\Big{(}\eta_{p_{0}}(x)-\eta_{k_{0}}(x)\Big{)}\Big{\}}\cdot\mathbb{I}\{t(x)=k_{0}\} (B.14)
    =maxp0k0,ηS{(ηp0(x)ηk0(x))+}𝕀{t(x)=k0}0,\displaystyle=\max_{p_{0}\neq k_{0},\eta\in S}\Big{\{}\Big{(}\eta_{p_{0}}(x)-\eta_{k_{0}}(x)\Big{)}^{+}\Big{\}}\cdot\mathbb{I}\{t(x)=k_{0}\}\geq 0,

    where (z)+:=max(z,0)(z)^{+}:=\max(z,0) for any real-valued zz. In the second equality of Equation B.14, we exclude the case when p0=k0p_{0}=k_{0} from the maximization, as this situation only occurs when the lower bound of ηk0(x)\eta_{k_{0}}(x) exceeds the upper bound of ηp(x)\eta_{p}(x) for all pk0p\neq k_{0}. Consequently, we have p0=argminp[K]ηp(x)=k0p_{0}=\arg\min_{p\in[K]}\eta_{p}(x)=k_{0}, which implies that

    maxp0=k0,ηS{A(t,η;x)}𝕀{t(x)=k0}=maxp0=k0,ηS{(ηk0(x)ηk0(x))}𝕀{t(x)=k0}=0.\max_{p_{0}=k_{0},\eta\in S}\{A(t,\eta;x)\}\cdot\mathbb{I}\{t(x)=k_{0}\}=\max_{p_{0}=k_{0},\eta\in S}\{(\eta_{k_{0}}(x)-\eta_{k_{0}}(x))\}\cdot\mathbb{I}\{t(x)=k_{0}\}=0.

    Therefore, we expect the maximization of A(t,η;x)𝕀{t(x)=k0}A(t,\eta;x)\cdot\mathbb{I}\{t(x)=k_{0}\} over ηS\eta\in S to be non-negative. We exclude the case p0=k0p_{0}=k_{0} from the maximization to facilitate further analysis.

  • Step 2. As shown in Equation B.14, for any fixed x𝒳x\in\mathcal{X} and p0k0p_{0}\neq k_{0}, the term (ηp0(x)ηk0(x))+(\eta_{p_{0}}(x)-\eta_{k_{0}}(x))^{+} is a non-decreasing (\uparrow) function of ηp0(x)\eta_{p_{0}}(x) and a non-increasing (\downarrow) function of ηk0(x)\eta_{k_{0}}(x). Therefore, one may expect that the maximization of (ηp0(x)ηk0(x))+(\eta_{p_{0}}(x)-\eta_{k_{0}}(x))^{+} over η(x)S(x)\eta(x)\in S(x) is realized when ηp0(x)\eta_{p_{0}}(x) and ηk0(x)\eta_{k_{0}}(x) achieving their upper-bound and lower-bound respectively. However, due to the constraint η(x)1=1\|\eta(x)\|_{1}=1 for any x𝒳x\in\mathcal{X}, the orginal upper-bound up0(x)u_{p_{0}}(x) for ηp0(x)\eta_{p_{0}}(x) may not be reached, and similar concern arises when treating ηk0(x)\eta_{k_{0}}(x). To fix these issues, we define a “realizable” upper-bound for ηp0(x)\eta_{p_{0}}(x) as

    u~p0(x):=[1jp0lj(x)]up0(x),\tilde{u}_{p_{0}}(x):=\Big{[}1-\sum_{j\neq p_{0}}l_{j}(x)\Big{]}\land u_{p_{0}}(x), (B.15)

    which is the maximal value that ηp0(x)\eta_{p_{0}}(x) can realize under the constraint k[K]ηk(x)=1\sum_{k\in[K]}\eta_{k}(x)=1. Meanwhile, we define a “realizable” lower-bound for ηk0(x)\eta_{k_{0}}(x) as

    l~k0(x)\displaystyle\tilde{l}_{k_{0}}(x) :=[1u~p0(x)jp0,k0uj(x)]lk0(x)\displaystyle:=\Big{[}1-\tilde{u}_{p_{0}}(x)-\sum_{j\neq p_{0},k_{0}}u_{j}(x)\Big{]}\lor l_{k_{0}}(x) (B.16)
    ={[1(1jp0lj(x))jp0,k0uj(x)][1up0(x)jp0,k0uj(x)]}lk0(x)\displaystyle=\Big{\{}\Big{[}1-\Big{(}1-\sum_{j\neq p_{0}}l_{j}(x)\Big{)}-\sum_{j\neq p_{0},k_{0}}u_{j}(x)\Big{]}\lor\Big{[}1-u_{p_{0}}(x)-\sum_{j\neq p_{0},k_{0}}u_{j}(x)\Big{]}\Big{\}}\lor l_{k_{0}}(x)
    ={[lk0(x)+jp0,k0(lj(x)uj(x))][1jk0uj(x)]}lk0(x)\displaystyle=\Big{\{}\Big{[}l_{k_{0}}(x)+\sum_{j\neq p_{0},k_{0}}\big{(}l_{j}(x)-u_{j}(x)\big{)}\Big{]}\lor\Big{[}1-\sum_{j\neq k_{0}}u_{j}(x)\Big{]}\Big{\}}\lor l_{k_{0}}(x)
    =[1jk0uj(x)]lk0(x).\displaystyle=\Big{[}1-\sum_{j\neq k_{0}}u_{j}(x)\Big{]}\lor l_{k_{0}}(x).

    The last equality is established as lj(x)uj(x)0l_{j}(x)-u_{j}(x)\leq 0 for any j[K]j\in[K], leading to the fact that lk0(x)+jp0,k0[lj(x)uj(x)]lk0(x)l_{k_{0}}(x)+\sum_{j\neq p_{0},k_{0}}[l_{j}(x)-u_{j}(x)]\leq l_{k_{0}}(x). In Equation B.16, we replace the upper-bound up0(x)u_{p_{0}}(x) with u~p0(x)\tilde{u}_{p_{0}}(x) in the definition of l~k0(x)\tilde{l}_{k_{0}}(x), because in Equation B.14, the lowest possible values of ηk0(x)\eta_{k_{0}}(x) depends on the “realizable” maximal value of ηp0(x)\eta_{p_{0}}(x). However, notice that l~k0(x)\tilde{l}_{k_{0}}(x) does not depend on the label p0p_{0} involved in the definition, indicating that the “realizable” lower-bound is an intrinsic property for the conditional probability for each k0[K]k_{0}\in[K]. Similar statement can be applied for the “realizable” upper-bound u~k0(x)\tilde{u}_{k_{0}}(x) for each k0[K]k_{0}\in[K].

  • Step 3. Back to the maximization of A(t,η;x)𝕀{t(x)=k0}A(t,\eta;x)\cdot\mathbb{I}\{t(x)=k_{0}\} in Equation B.14, we have

    maxηS{A(t,η;x)}𝕀{t(x)=k0}\displaystyle\max_{\eta\in S}\Big{\{}A(t,\eta;x)\Big{\}}\cdot\mathbb{I}\{t(x)=k_{0}\} =maxp0k0ηS(ηp0(x)ηk0(x))+𝕀{t(x)=k0}\displaystyle=\max_{p_{0}\neq k_{0}\eta\in S}\Big{(}\eta_{p_{0}}(x)-\eta_{k_{0}}(x)\Big{)}^{+}\cdot\mathbb{I}\{t(x)=k_{0}\} (B.17)
    =maxp0k0,p0[K](u~p0(x)l~k0(x))+𝕀{t(x)=k0}.\displaystyle=\max_{p_{0}\neq k_{0},p_{0}\in[K]}\Big{(}\tilde{u}_{p_{0}}(x)-\tilde{l}_{k_{0}}(x)\Big{)}^{+}\cdot\mathbb{I}\{t(x)=k_{0}\}.

    By defining

    wk0(x)=maxp0k0,p0[K](u~p0(x)l~k0(x))+w_{k_{0}}(x)=\max_{p_{0}\neq k_{0},p_{0}\in[K]}\Big{(}\tilde{u}_{p_{0}}(x)-\tilde{l}_{k_{0}}(x)\Big{)}^{+}

    as the weight function corresponding to the event {t(x)=k0}\{t(x)=k_{0}\}, the Equation B.17 can be written as

    maxη(x)S(x){A(t,η;x)}𝕀{t(x)=k0}=wk0(x)𝕀{t(x)=k0}.\max_{\eta(x)\in S(x)}\Big{\{}A(t,\eta;x)\Big{\}}\cdot\mathbb{I}\{t(x)=k_{0}\}=w_{k_{0}}(x)\cdot\mathbb{I}\{t(x)=k_{0}\}.

    The worst-case excess risk ¯(t)\overline{\mathcal{R}}(t) can then be written as

    ¯(t)\displaystyle\overline{\mathcal{R}}(t) =𝔼[maxηSk0[K]A(t,η;X)𝕀{t(X)=k0}]\displaystyle=\mathbb{E}\left[\max_{\eta\in S}\sum_{k_{0}\in[K]}A(t,\eta;X)\cdot\mathbb{I}\{t(X)=k_{0}\}\right]
    =𝔼[k0[K]maxηSA(t,η;X)𝕀{t(X)=k0}]\displaystyle=\mathbb{E}\left[\sum_{k_{0}\in[K]}\max_{\eta\in S}A(t,\eta;X)\cdot\mathbb{I}\{t(X)=k_{0}\}\right]
    =𝔼[k0[K]wk0(X)𝕀{t(X)=k0}].\displaystyle=\mathbb{E}\left[\sum_{k_{0}\in[K]}w_{k_{0}}(X)\cdot\mathbb{I}\{t(X)=k_{0}\}\right].

    The second inequality holds because {t(x)=k0},k0[K]\{t(x)=k_{0}\},k_{0}\in[K] defines a sequence of disjoint events for every fixed x𝒳x\in\mathcal{X}. By replacing the subscript k0k_{0} with kk, we complete the proof.

  • Step 4. We are now left to check whether there exists a vector ηS\eta\in S such that the second equality in Equation B.17 is guaranteed, that is, there is indeed a feasible solution of η(x)\eta(x) with ηp0(x)=u~p0(x)\eta_{p_{0}}(x)=\tilde{u}_{p_{0}}(x), ηk0(x)=l~k0(x)\eta_{k_{0}}(x)=\tilde{l}_{k_{0}}(x) such that η(x)1=1\|\eta(x)\|_{1}=1 and ηj(x)[lj(x),uj(x)]\eta_{j}(x)\in[l_{j}(x),u_{j}(x)] for every j[K]j\in[K]. Before we go deep into the details, the non-emptyness of set SS reminds us that, for every x𝒳x\in\mathcal{X},

    j[K]lj(x)1j[K]uj(x).\sum_{j\in[K]}l_{j}(x)\leq 1\leq\sum_{j\in[K]}u_{j}(x). (B.18)

    In the following, we check the feasibility of ηS\eta\in S for a fixed x𝒳x\in\mathcal{X} in three aspects.

    1. 1.

      For ηp0(x)=u~p0(x)\eta_{p_{0}}(x)=\tilde{u}_{p_{0}}(x), check ηp0(x)[lp0(x),up0(x)]\eta_{p_{0}}(x)\in[l_{p_{0}}(x),u_{p_{0}}(x)].

      For the upper-bound, by the definition of u~p0(x)\tilde{u}_{p_{0}}(x) in Equation B.15, ηp0(x)up(x)\eta_{p_{0}}(x)\leq u_{p}(x) is guaranteed.

      For the lower-bound, on one hand, if u~p0(x)=up0(x)\tilde{u}_{p_{0}}(x)=u_{p_{0}}(x), then up0(x)lp0(x)u_{p_{0}}(x)\geq l_{p_{0}}(x) naturally holds. On the other hand, suppose u~p0(x)=[1jp0lj(x)]\tilde{u}_{p_{0}}(x)=[1-\sum_{j\neq p_{0}}l_{j}(x)]. By Equation B.18, we have 1k[K]lk(x)1\geq\sum_{k\in[K]}l_{k}(x), which suggests that u~p0(x)lp0(x)\tilde{u}_{p_{0}}(x)\geq l_{p_{0}}(x). Hence, u~p0(x)lp0(x)\tilde{u}_{p_{0}}(x)\geq l_{p_{0}}(x) always holds.

    2. 2.

      For ηk0(x)=l~k0(x)\eta_{k_{0}}(x)=\tilde{l}_{k_{0}}(x), check ηk0(x)[lk0(x),uk0(x)]\eta_{k_{0}}(x)\in[l_{k_{0}}(x),u_{k_{0}}(x)].

      For the lower-bound, by the definition of l~k0(x)\tilde{l}_{k_{0}}(x) in Equation B.16, ηk0(x)lk0(x)\eta_{k_{0}}(x)\geq l_{k_{0}}(x) always holds.

      For the upper-bound, by Equation B.16, we have

      l~k0(x)=[1jk0uj(x)]lk0(x).\tilde{l}_{k_{0}}(x)=\Big{[}1-\sum_{j\neq k_{0}}u_{j}(x)\Big{]}\lor l_{k_{0}}(x).

      Equation B.18 ensures that 1jk0uj(x)uk0(x)1-\sum_{j\neq k_{0}}u_{j}(x)\leq u_{k_{0}}(x), and on the other hand lk0(x)uk0(x)l_{k_{0}}(x)\leq u_{k_{0}}(x) simply holds. Therefore, we establish the upper-bound ηk0(x)uk0(x)\eta_{k_{0}}(x)\leq u_{k_{0}}(x).

    3. 3.

      Check η(x)1=1\|\eta(x)\|_{1}=1. We start with computing the summation

      ηp0(x)+ηk0(x)=u~p0(x)+l~k0(x).\eta_{p_{0}}(x)+\eta_{k_{0}}(x)=\tilde{u}_{p_{0}}(x)+\tilde{l}_{k_{0}}(x).

      If the summation stays with in the capacity region [1jp0,k0uj(x),1jp0,k0lj(x)]\big{[}1-\sum_{j\neq p_{0},k_{0}}u_{j}(x),1-\sum_{j\neq p_{0},k_{0}}l_{j}(x)\big{]}, we can always find a series of feasible solutions ηj(x)[lj(x),uj(x)]\eta_{j}(x)\in[l_{j}(x),u_{j}(x)] for jp0,k0j\neq p_{0},k_{0} such that j[K]ηj(x)=1\sum_{j\in[K]}\eta_{j}(x)=1 is satisfied. The discussions are summarized below.

      1. (a)

        Suppose up0(x)1jp0lj(x)u_{p_{0}}(x)\leq 1-\sum_{j\neq p_{0}}l_{j}(x), by the definition of u~p0(x)\tilde{u}_{p_{0}}(x) in Equation B.15, we have u~p0(x)=up0(x)\tilde{u}_{p_{0}}(x)=u_{p_{0}}(x) and l~k0(x)=[1jk0uj(x)]lk0(x)\tilde{l}_{k_{0}}(x)=[1-\sum_{j\neq k_{0}}u_{j}(x)]\lor l_{k_{0}}(x). We then have

        u~p0(x)+l~k0(x)={1jk0,p0uj(x),iflk0(x)1jk0uj(x)(Case 1)up0(x)+lk0(x),iflk0(x)1jk0uj(x)(Case 2).\tilde{u}_{p_{0}}(x)+\tilde{l}_{k_{0}}(x)=\left\{\begin{aligned} &1-\sum_{j\neq k_{0},p_{0}}u_{j}(x),&&\text{if}\ l_{k_{0}}(x)\leq 1-\sum_{j\neq k_{0}}u_{j}(x)\quad(\text{Case 1})\\ &u_{p_{0}}(x)+l_{k_{0}}(x),&&\text{if}\ l_{k_{0}}(x)\geq 1-\sum_{j\neq k_{0}}u_{j}(x)\quad(\text{Case 2}).\end{aligned}\right.

        For the Case 1, we can simply let ηj(x)=uj(x)\eta_{j}(x)=u_{j}(x) for jp0,k0j\neq p_{0},k_{0}, and this arrangement fulfills the requirement j[K]ηj(x)=1\sum_{j\in[K]}\eta_{j}(x)=1.

        For the Case 2, we have

        1jk0,p0uj(x)up0(x)+lk0(x)1jk0,p0lj(x).1-\sum_{j\neq k_{0},p_{0}}u_{j}(x)\leq u_{p_{0}}(x)+l_{k_{0}}(x)\leq 1-\sum_{j\neq k_{0},p_{0}}l_{j}(x).

        As a result, there exists a sequence of {ηj(x)}jp0,k0\{\eta_{j}(x)\}_{j\neq p_{0},k_{0}} such that ηj(x)[lj(x),uj(x)]\eta_{j}(x)\in[l_{j}(x),u_{j}(x)] for all jp0,k0j\neq p_{0},k_{0} that satisifes up0(x)+lk0(x)+jp0,k0ηj(x)=1u_{p_{0}}(x)+l_{k_{0}}(x)+\sum_{j\neq p_{0},k_{0}}\eta_{j}(x)=1.

      2. (b)

        Suppose up0(x)1jp0lj(x)u_{p_{0}}(x)\geq 1-\sum_{j\neq p_{0}}l_{j}(x), we then have u~p0(x)=1jp0lj(x)\tilde{u}_{p_{0}}(x)=1-\sum_{j\neq p_{0}}l_{j}(x) and

        l~k0(x)\displaystyle\tilde{l}_{k_{0}}(x) =[1u~p0(x)jk0,p0uj(x)]lk0(x)\displaystyle=\Big{[}1-\tilde{u}_{p_{0}}(x)-\sum_{j\neq k_{0},p_{0}}u_{j}(x)\Big{]}\lor l_{k_{0}}(x)
        =[lk0(x)+jk0,p0[lj(x)uj(x)]]lk0(x)\displaystyle=\Big{[}l_{k_{0}}(x)+\sum_{j\neq k_{0},p_{0}}[l_{j}(x)-u_{j}(x)]\Big{]}\lor l_{k_{0}}(x)
        =lk0(x),\displaystyle=l_{k_{0}}(x),

        where the last equality holds because lj(x)uj(x)l_{j}(x)\leq u_{j}(x) for all j[K]j\in[K]. Therefore, we have

        u~p0(x)+l~k0(x)=1jk0,p0lj(x).\tilde{u}_{p_{0}}(x)+\tilde{l}_{k_{0}}(x)=1-\sum_{j\neq k_{0},p_{0}}l_{j}(x).

        As a result, we can simply let ηj(x)=lj(x)\eta_{j}(x)=l_{j}(x) for jp0,k0j\neq p_{0},k_{0}, and this arrangement agains fullfills the requirement j[K]ηj(x)=1\sum_{j\in[K]}\eta_{j}(x)=1.

    The above analysis show that validity of the second equality in Equation B.17.

Proof of Theorem 4 (Binary Classification Setting).

Recall from the first part of Theorem 4 that

¯(t)\displaystyle\overline{\mathcal{R}}(t) =𝔼[k=1Kwk(X)𝕀{t(X)=k}],where\displaystyle=\mathbb{E}\left[\sum_{k=1}^{K}w_{k}(X)\cdot\mathbb{I}\{t(X)=k\}\right],\ \text{where}
wk(X)\displaystyle w_{k}(X) =maxpk,p[K]{(u~p(x)l~k(x))+}.\displaystyle=\max_{p\neq k,p\in[K]}\Big{\{}\Big{(}\tilde{u}_{p}(x)-\tilde{l}_{k}(x)\Big{)}^{+}\Big{\}}.

Consider the binary case when Y{1,1}Y^{\star}\in\{-1,1\} and t:𝒳{1,1}t:\mathcal{X}\mapsto\{-1,1\}. Recall that in Theorem 3, we derive the partial bound for η1(x)=(Y=1X=x)\eta_{1}^{\star}(x)=\mathbb{P}(Y^{\star}=1\mid X=x) as [l1(x),u1(x)][l_{1}(x),u_{1}(x)]. By the constraint η1(x)+η0(x)=1\eta_{1}^{\star}(x)+\eta_{0}^{\star}(x)=1 for each x𝒳x\in\mathcal{X}, we define the partial bound [l1(x),u1(x)][l_{-1}(x),u_{-1}(x)] for η1(x)\eta_{-1}^{\star}(x) as l1(x)=1u1(x)l_{-1}(x)=1-u_{1}(x) and u1(x)=1l1(x)u_{-1}(x)=1-l_{1}(x).

We can then compute the “realizable” upper- and lower-bounds for both η1(x)\eta_{1}^{\star}(x) and η1(x)\eta_{-1}^{\star}(x) as below:

u~1(x)=(1l1(x))u1(x)=u1(x)\displaystyle\tilde{u}_{1}(x)=(1-l_{-1}(x))\land u_{1}(x)=u_{1}(x) andu~1(x)=(1l1(x))u1(x)=1l1(x),\displaystyle\text{and}\quad\tilde{u}_{-1}(x)=(1-l_{1}(x))\land u_{-1}(x)=1-l_{1}(x),
l~1(x)=(1u1(x))l1(x)=l1(x)\displaystyle\tilde{l}_{1}(x)=(1-u_{-1}(x))\lor l_{1}(x)=l_{1}(x) andl~1(x)=(1u1(x))l1(x)=1u1(x).\displaystyle\text{and}\quad\tilde{l}_{-1}(x)=(1-u_{1}(x))\lor l_{1}(x)=1-u_{1}(x).

The weight functions wk(x)w_{k}(x) for k=±1k=\pm 1 are therefore

w1(x)\displaystyle w_{1}(x) =(u~0(x)l~1(x))+=(12l1(x))+=(12l1(x))𝕀{l1(x)1/2}\displaystyle=\big{(}\tilde{u}_{0}(x)-\tilde{l}_{1}(x)\big{)}^{+}=\big{(}1-2l_{1}(x)\big{)}^{+}=\big{(}1-2l_{1}(x)\big{)}\cdot\mathbb{I}\{l_{1}(x)\leq 1/2\}
w0(x)\displaystyle w_{0}(x) =(u~1(x)l~0(x))+=(2u1(x)1)+=(2u1(x)1)𝕀{u1(x)1/2}.\displaystyle=\big{(}\tilde{u}_{1}(x)-\tilde{l}_{0}(x)\big{)}^{+}=\big{(}2u_{1}(x)-1\big{)}^{+}=\big{(}2u_{1}(x)-1\big{)}\cdot\mathbb{I}\{u_{1}(x)\geq 1/2\}.

Hence, in binary outcome case, the risk function ¯(t)\overline{\mathcal{R}}(t) takes the form

¯(t)\displaystyle\overline{\mathcal{R}}(t) =𝔼[|12l1(X)|𝕀l1(X)1/2𝕀{t(X)=1}+|2u1(X)1|𝕀u1(X)1/2𝕀{t(X)=0}]\displaystyle=\mathbb{E}\Big{[}|1-2l_{1}(X)|\cdot\mathbb{I}_{l_{1}(X)\leq 1/2}\cdot\mathbb{I}\{t(X)=1\}+|2u_{1}(X)-1|\cdot\mathbb{I}_{u_{1}(X)\geq 1/2}\cdot\mathbb{I}\{t(X)=0\}\Big{]}
=𝔼[|12l1(X)|𝕀l1(X)1/2(1𝕀{t(X)=0})+|2u1(X)1|𝕀u1(X)1/2𝕀{t(X)=0}]\displaystyle=\mathbb{E}\Big{[}|1-2l_{1}(X)|\cdot\mathbb{I}_{l_{1}(X)\leq 1/2}\cdot\big{(}1-\mathbb{I}\{t(X)=0\}\big{)}+|2u_{1}(X)-1|\cdot\mathbb{I}_{u_{1}(X)\geq 1/2}\cdot\mathbb{I}\{t(X)=0\}\Big{]}
=𝔼[|12l1(X)|𝕀l1(X)1/2]+𝔼[𝕀{t(X)1}(|2u1(X)1|𝕀u1(X)1/2|12l1(X)|𝕀l1(X)1/2)]\displaystyle=\mathbb{E}\Big{[}\big{|}1-2l_{1}(X)\big{|}\cdot\mathbb{I}_{l_{1}(X)\leq 1/2}\Big{]}+\mathbb{E}\Big{[}\mathbb{I}\{t(X)\neq 1\}\cdot\Big{(}\big{|}2u_{1}(X)-1\big{|}\cdot\mathbb{I}_{u_{1}(X)\geq 1/2}-\big{|}1-2l_{1}(X)\big{|}\cdot\mathbb{I}_{l_{1}(X)\leq 1/2}\Big{)}\Big{]}
=𝔼[|12l1(X)|𝕀l1(X)1/2]+𝔼[1t(X)2(|2u1(X)1|𝕀u1(X)1/2|12l1(X)|𝕀l1(X)1/2)].\displaystyle=\mathbb{E}\Big{[}\big{|}1-2l_{1}(X)\big{|}\cdot\mathbb{I}_{l_{1}(X)\leq 1/2}\Big{]}+\mathbb{E}\left[\frac{1-t(X)}{2}\cdot\Big{(}\big{|}2u_{1}(X)-1\big{|}\cdot\mathbb{I}_{u_{1}(X)\geq 1/2}-\big{|}1-2l_{1}(X)\big{|}\cdot\mathbb{I}_{l_{1}(X)\leq 1/2}\Big{)}\right].

For every fixed x𝒳x\in\mathcal{X}, define a weight function

w(x)\displaystyle w(x) :=|u1(x)1/2|𝕀{u1(x)1/2}|l1(x)1/2|𝕀{l1(x)<1/2}\displaystyle:=\big{|}u_{1}(x)-1/2\big{|}\cdot\mathbb{I}\{u_{1}(x)\geq 1/2\}-\big{|}l_{1}(x)-1/2\big{|}\cdot\mathbb{I}\{l_{1}(x)<1/2\}
=max(u1(X)1/2,0)+min(l1(X)1/2,0),\displaystyle=\max\big{(}u_{1}(X)-1/2,0\big{)}+\min\big{(}l_{1}(X)-1/2,0\big{)},

we then have

¯(t)\displaystyle\overline{\mathcal{R}}(t) =𝔼[|12l1(X)|𝕀{l1(X)<1/2}]()+𝔼[w(X)(1t(X))]\displaystyle=\underbrace{\mathbb{E}\Big{[}\big{|}1-2l_{1}(X)\big{|}\cdot\mathbb{I}\{l_{1}(X)<1/2\}\Big{]}}_{(**)}+\mathbb{E}\Big{[}w(X)\cdot\big{(}1-t(X)\big{)}\Big{]} (B.19)
=()+𝔼[w(X)]+𝔼[w(X)t(X)]\displaystyle=(**)+\mathbb{E}\big{[}w(X)\big{]}+\ \mathbb{E}\big{[}-w(X)\cdot t(X)\big{]}
=()+𝔼[w(X)]+𝔼[|w(X)|sgn[w(X)]t(X)]\displaystyle=(**)+\mathbb{E}\big{[}w(X)\big{]}+\mathbb{E}\Big{[}|w(X)|\cdot\operatorname*{\mathrm{sgn}}[-w(X)]\cdot t(X)\Big{]}
=()+𝔼[w(X)]+𝔼[|w(X)|(2𝕀{sgn[w(X)]t(X)}1)]\displaystyle=(**)+\mathbb{E}\big{[}w(X)\big{]}+\mathbb{E}\Big{[}|w(X)|\cdot\Big{(}2\mathbb{I}\big{\{}\operatorname*{\mathrm{sgn}}[w(X)]\neq t(X)\big{\}}-1\Big{)}\Big{]}
=()+𝔼[w(X)|w(X)|]+2𝔼[|w(X)|𝕀(sgn[w(X)]sgn[h(X)])].\displaystyle=(**)+\mathbb{E}\Big{[}w(X)-|w(X)|\Big{]}+2\mathbb{E}\Big{[}|w(X)|\cdot\mathbb{I}(\operatorname*{\mathrm{sgn}}[w(X)]\neq\operatorname*{\mathrm{sgn}}[h(X)])\Big{]}.

For any fixed xx, notice that w(x)|w(x)|=0w(x)-|w(x)|=0 when w(x)0w(x)\geq 0 and w(x)|w(x)|=2w(x)w(x)-|w(x)|=2w(x) when w(x)<0w(x)<0, we then have

𝔼[w(X)|w(X)|]\displaystyle\mathbb{E}\Big{[}w(X)-|w(X)|\Big{]} =𝔼[2w(X)𝕀{w(X)<0}]\displaystyle=\mathbb{E}\Big{[}2w(X)\cdot\mathbb{I}\{w(X)<0\}\Big{]}
=𝔼[(|2u1(x)1|𝕀{u1(X)1}|12l1(X)|𝕀{l1(X)<1/2})𝕀{w(X)<0}].\displaystyle=\mathbb{E}\Big{[}\Big{(}\big{|}2u_{1}(x)-1\big{|}\cdot\mathbb{I}\{u_{1}(X)\geq 1\}-|1-2l_{1}(X)|\cdot\mathbb{I}\{l_{1}(X)<1/2\}\Big{)}\cdot\mathbb{I}\{w(X)<0\}\Big{]}.

Consequently, by observing the establishment of the event

w(X)<0|2u1(X)1|𝕀{u1(X)1/2}|l1(X)1/2|𝕀{l1(X)<1/2},w(X)<0~{}~{}\Leftrightarrow~{}~{}|2u_{1}(X)-1|\cdot\mathbb{I}\{u_{1}(X)\geq 1/2\}\leq|l_{1}(X)-1/2|\cdot\mathbb{I}\{l_{1}(X)<1/2\},

the first two term in Equation B.19 can be further organized as

𝔼[|12l1(X)|𝕀{l1(X)<1/2}]+𝔼[w(X)|w(X)|]\displaystyle\mathbb{E}\Big{[}\big{|}1-2l_{1}(X)\big{|}\cdot\mathbb{I}\{l_{1}(X)<1/2\}\Big{]}+\mathbb{E}\Big{[}w(X)-|w(X)|\Big{]} (B.20)
={𝔼[|2u1(X)1|𝕀{u1(X)1/2}],ifw(X)<0𝔼[|12l1(X)|𝕀{l1(X)<1/2}],ifw(X)0\displaystyle=\left\{\begin{aligned} &\mathbb{E}\Big{[}|2u_{1}(X)-1|\cdot\mathbb{I}\{u_{1}(X)\geq 1/2\}\Big{]},&&\text{if}\ w(X)<0\\ &\mathbb{E}\Big{[}|1-2l_{1}(X)|\cdot\mathbb{I}\{l_{1}(X)<1/2\}\Big{]},&&\text{if}\ w(X)\geq 0\\ \end{aligned}\right.
=min(𝔼[|12l1(X)|𝕀{l1(X)<1/2}],𝔼[|2u1(X)1|𝕀{u(X)1/2}]).\displaystyle=\min\Big{(}\mathbb{E}\Big{[}\big{|}1-2l_{1}(X)\big{|}\cdot\mathbb{I}\{l_{1}(X)<1/2\}\big{]},\mathbb{E}\Big{[}\big{|}2u_{1}(X)-1\big{|}\cdot\mathbb{I}\{u(X)\geq 1/2\}\big{]}\Big{)}.

Finally, we define

wpartial(x)=2w(x)=max(2u1(X)1,0)+min(2l1(X)1,0)w^{\rm partial}(x)=2w(x)=\max\big{(}2u_{1}(X)-1,0\big{)}+\min\big{(}2l_{1}(X)-1,0\big{)}

for every x𝒳x\in\mathcal{X}. Combining Equations B.19 and B.20 together, we have

¯(h)\displaystyle\overline{\mathcal{R}}(h) =𝔼[|wpartial(X)|𝕀{sgn[wpartial(X)]t(X)}]\displaystyle=\mathbb{E}\Big{[}|w^{\rm partial}(X)|\cdot\mathbb{I}\{\operatorname*{\mathrm{sgn}}[w^{\rm partial}(X)]\neq t(X)\}\Big{]}
+min(𝔼[|12l1(X)|𝕀{l1(X)<1/2}],𝔼[|2u1(X)1|𝕀{u1(X)1/2}])\displaystyle\qquad+\min\Big{(}\mathbb{E}\Big{[}\big{|}1-2l_{1}(X)\big{|}\cdot\mathbb{I}\{l_{1}(X)<1/2\}\big{]},\mathbb{E}\Big{[}\big{|}2u_{1}(X)-1\big{|}\cdot\mathbb{I}\{u_{1}(X)\geq 1/2\}\big{]}\Big{)}

B.4 Comparisons of Point and Partial Identification

Point identification hinges on decision-makers’ homogeneity in their use of unobservables, as defined by the generalized NUCEM assumption in Theorem 1. Concretely, the conditional irrelevant of Cov(D,ZX,U)\operatorname*{\mathrm{Cov}}(D,Z\mid X,U) and 𝔼[YkX,U]\mathbb{E}[Y^{\star}_{k}\mid X,U] given XX reflects the uniformity of decision-makers’ reliance on unobserved information. Conversely, partial identification benefits from the heterogeneity of decision-makers’ use of observables XX, not requiring the homogeneity of selection on unobservables UU. This approach uses IV partial bounds that consider variations in the decision rules (D=1X,Z)\mathbb{P}(D=1\mid X,Z) and the labeled probabilities (Y=k,D=1X,Z)\mathbb{P}(Y=k,D=1\mid X,Z) across decision-makers ZZ. Such heterogeneity helps construct tighter bounds for ηk(x)\eta_{k}^{\star}(x) Moreover, we having the following claims.

Theorem 7.

The partial bounds lk(x)l_{k}(x) and uk(x)u_{k}(x) in Theorem 3 are achieved by the most lenient and most stringent decision-makers, respectively.

Proof of Theorem 7.

We claim that for any fixed XX and UU, the lower-bound is achieved by the most lenient decision-maker, and the upper-bound is achieved by the most stringent decision-maker. To see this, notice that the lower-bound for ηk\eta_{k}^{\star} with respect to decision-maker Z=jZ=j in Theorem 3 satisfies

lk(x,z)\displaystyle l_{k}(x,z) :=𝔼[DYk+ak(X)(1D)X=x,Z=z]\displaystyle:=\mathbb{E}\big{[}DY_{k}+a_{k}(X)\cdot(1-D)\mid X=x,Z=z\big{]}
=𝔼[DYk+ak(X)(1D)X=x,Z=z]\displaystyle=\mathbb{E}\big{[}DY_{k}^{\star}+a_{k}(X)\cdot(1-D)\mid X=x,Z=z\big{]}
=𝔼[𝔼[DYk+ak(X)(1D)U,X,Z=z]X=x,Z=z]\displaystyle=\mathbb{E}\Big{[}\mathbb{E}\big{[}DY_{k}^{\star}+a_{k}(X)\cdot(1-D)\mid U,X,Z=z\big{]}\mid X=x,Z=z\Big{]}
=𝔼[(D=1U,X,Z)𝔼[YkU,X,Z]+ak(X)(1(D=1U,X,Z))X=x,Z=z]\displaystyle=\mathbb{E}\Big{[}\mathbb{P}(D=1\mid U,X,Z)\cdot\mathbb{E}\big{[}Y_{k}^{\star}\mid U,X,Z\big{]}+a_{k}(X)\cdot\big{(}1-\mathbb{P}(D=1\mid U,X,Z)\big{)}\mid X=x,Z=z\Big{]}
=ak(x)+𝔼[(D=1U,X,Z)(𝔼[YkX,U]ak(X))X=x,Z=z].\displaystyle=a_{k}(x)+\mathbb{E}\Big{[}\mathbb{P}(D=1\mid U,X,Z)\cdot\big{(}\mathbb{E}[Y_{k}^{\star}\mid X,U]-a_{k}(X)\big{)}\mid X=x,Z=z\Big{]}.

The second equality holds consistency, and the third equality follows by iterated law of expectation. The fourth equality holds by Assumption 1 and the final equality is established by the IV independence in Assumption 2 (Z(U,Y)XZ\perp(U,Y^{\star})\mid X). Note that the Assumption 4 states ak(X)E[YkX,U]bk(X)a_{k}(X)\leq E[Y_{k}^{\star}\mid X,U]\leq b_{k}(X) almost surely, the lower bound lk(x,z)l_{k}(x,z) is an increasing function of decision rule (D=1U,X,Z=z)\mathbb{P}(D=1\mid U,X,Z=z). Therefore, for any fixed XX and UU, the maximization over lower bounds maxz[m]lk(x,z)\max_{z\in[m]}l_{k}(x,z) is achieved by the most lenient decision-maker with the highest decision rule (D=1X,U,Z=z)\mathbb{P}(D=1\mid X,U,Z=z), that is,

lk(x)\displaystyle l_{k}(x) =maxz𝒵lk(x,z)\displaystyle=\max_{z\in\mathcal{Z}}l_{k}(x,z)
=maxz𝒵ak(x)+𝔼[(D=1U,X,Z=z)(𝔼[YkX,U]ak(X))X=x]\displaystyle=\max_{z\in\mathcal{Z}}a_{k}(x)+\mathbb{E}\Big{[}\mathbb{P}(D=1\mid U,X,Z=z)\cdot\big{(}\mathbb{E}[Y_{k}^{\star}\mid X,U]-a_{k}(X)\big{)}\mid X=x\Big{]}
=ak(x)+𝔼[{maxz𝒵(D=1U,X,Z=z)}(𝔼[YkX,U]ak(X))X=x]\displaystyle=a_{k}(x)+\mathbb{E}\Big{[}\big{\{}\max_{z\in\mathcal{Z}}\mathbb{P}(D=1\mid U,X,Z=z)\big{\}}\cdot\big{(}\mathbb{E}[Y_{k}^{\star}\mid X,U]-a_{k}(X)\big{)}\mid X=x\Big{]}

Similarly, for the upper-bound, the minimization minz[m]uk(x,z)\min_{z\in[m]}u_{k}(x,z) is achieved by the most stringent decision-maker when XX and UU are fixed. ∎

Appendix C Supplements for Unified Cost-sensitive Learning

Proof of Theorem 5.

To facilitate the analysis, for fixed weight function w:𝒳w:\mathcal{X}\rightarrow\mathbb{R} and score function h:𝒳Kh:\mathcal{X}\rightarrow\mathbb{R}^{K}, we define the class labels

k0:=argmink[K]wk(x)andkh:=argmaxk[K]hk(x).k_{0}:=\operatorname*{arg\,min}_{k\in[K]}w_{k}(x)\quad\text{and}\quad k_{h}:=\operatorname*{arg\,max}_{k\in[K]}h_{k}(x).

Namely, when x𝒳x\in\mathcal{X} is fixed, k0k_{0} denotes the class label with the minimal weight wk0(x)w_{k_{0}}(x), while khk_{h} corresponds to the class label with the highest score hkh(x)h_{k_{h}}(x). According to the definition of cost-sensitive classification risk (h,w)\mathcal{R}(h,w) in Equation 6.1, we define the cost-sensitive classification loss as

l(h,w;x)=k=1Kwk(x)𝕀{argmaxp[K]hp(x)k}.l(h,w;x)=\sum_{k=1}^{K}w_{k}(x)\mathbb{I}\{\operatorname*{arg\,max}_{p\in[K]}h_{p}(x)\neq k\}.

Obviously, the excess classification risk can be written as

(h,w)infh(h,w)=𝔼[l(h,w;X)infhl(h,w;X)]=𝔼[wkh(X)wk0(X)].\mathcal{R}(h,w)-\inf_{h}\mathcal{R}(h,w)=\mathbb{E}\big{[}l(h,w;X)-\inf_{h}l(h,w;X)\big{]}=\mathbb{E}\big{[}w_{k_{h}}(X)-w_{k_{0}}(X)\big{]}.

Correspondingly, we define the cost-sensitive surrogate loss for exp(𝒉,𝒘)\mathcal{R}_{\exp}(\boldsymbol{h},\boldsymbol{w}) in Equation 6.2 as

lexp(h,w;x)=k=1Kwk(x)exp(hk(x))p=1Kexp(hp(x)).l_{\exp}(h,w;x)=\sum_{k=1}^{K}w_{k}(x)\frac{\exp(h_{k}(x))}{\sum_{p=1}^{K}\exp(h_{p}(x))}.

As a result, the excess surrogate risk can be written as

exp(h,w)infhexp(h,w)=𝔼[lexp(h,w;X)infhlexp(h,w;X)].\mathcal{R}_{\exp}(h,w)-\inf_{h}\mathcal{R}_{\exp}(h,w)=\mathbb{E}\big{[}l_{\exp}(h,w;X)-\inf_{h}l_{\exp}(h,w;X)\big{]}.

Therefore, we are left to analyze the excess surrogate loss lexp(h,w;x)infhlexp(h,w;x)l_{\exp}(h,w;x)-\inf_{h}l_{\exp}(h,w;x) for every fixed x𝒳x\in\mathcal{X}.

Following the idea of Mao et al. [2023], for every fixed scored function h:𝒳kh:\mathcal{X}\rightarrow\mathbb{R}^{k}, we define a new function hμh^{\mu} induced by some real-valued parameter μ\mu as follow: for fixed x𝒳x\in\mathcal{X},

{hkμ(x)=hk(x),kk0,khhk0μ(x)=log(exp(hkh(x))μ),k=k0hkhμ(x)=log(exp(hk0(x))+μ),k=kh.\left\{\begin{aligned} &h_{k}^{\mu}(x)=h_{k}(x),&&k\neq k_{0},k_{h}\\ &h_{k_{0}}^{\mu}(x)=\log\big{(}\exp(h_{k_{h}}(x))-\mu\big{)},&&k=k_{0}\\ &h_{k_{h}}^{\mu}(x)=\log\big{(}\exp(h_{k_{0}}(x))+\mu\big{)},&&k=k_{h}.\end{aligned}\right.

Since the domain of log\log function is (0,+)(0,+\infty), we should restrict the value of μ\mu to ensure that exp(hkh(x))+μ>0\exp(h_{k_{h}}(x))+\mu>0 and exp(hk0(x))μ>0\exp(h_{k_{0}}(x))-\mu>0. Therefore, for fixed x𝒳x\in\mathcal{X}, we define the feasible region of μ\mu as

(μ):={μ:exp(hkh(x))<μ<exp(hk0(x))}.\mathcal{M}(\mu):=\big{\{}\mu\in\mathbb{R}:-\exp(h_{k_{h}}(x))<\mu<\exp(h_{k_{0}}(x))\big{\}}.

A direct conseuqnece of above definition is that k=1Kexp(hk(x))=k=1Kexp(hkμ(x))\sum_{k=1}^{K}\exp(h_{k}(x))=\sum_{k=1}^{K}\exp(h_{k}^{\mu}(x)) for every fixed xx. Consequently, the surrogate loss l(hμ,w;x)l(h^{\mu},w;x) takes the form

lexp(hμ,w;x)=kk0,khwk(x)exp(hk(x))p=1Kexp(hp(x))+wk0(x)exp(hkh(x))μp=1Kexp(hp(x))+wkh(x)exp(hk0(x))+μp=1Kexp(hp(x)).l_{\exp}(h^{\mu},w;x)=\sum_{k\neq k_{0},k_{h}}w_{k}(x)\frac{\exp(h_{k}(x))}{\sum_{p=1}^{K}\exp(h_{p}(x))}+w_{k_{0}}(x)\frac{\exp(h_{k_{h}}(x))-\mu}{\sum_{p=1}^{K}\exp(h_{p}(x))}+w_{k_{h}}(x)\frac{\exp(h_{k_{0}}(x))+\mu}{\sum_{p=1}^{K}\exp(h_{p}(x))}.

Now, the excess surrogate loss lexp(h,w;x)infhlexp(h,w;x)l_{\exp}(h,w;x)-\inf_{h}l_{\exp}(h,w;x) can be lower-bounded by

lexp(h,w;x)infhlexp(h,w;x)\displaystyle l_{\exp}(h,w;x)-\inf_{h}l_{\exp}(h,w;x)
lexp(h,w;x)infμ(μ)lexp(hμ,w;x)\displaystyle\geq l_{\exp}(h,w;x)-\inf_{\mu\in\mathcal{M}(\mu)}l_{\exp}(h^{\mu},w;x)
=supμ(μ){wkh(x)exp(hkh(x))exp(hk0(x))μp=1Kexp(hp(x))+wk0(x)exp(hk0(x))exp(hkh(x))+μp=1Kexp(hp(x))}\displaystyle=\sup_{\mu\in\mathcal{M}(\mu)}\left\{w_{k_{h}}(x)\frac{\exp(h_{k_{h}}(x))-\exp(h_{k_{0}}(x))-\mu}{\sum_{p=1}^{K}\exp(h_{p}(x))}+w_{k_{0}}(x)\frac{\exp(h_{k_{0}}(x))-\exp(h_{k_{h}}(x))+\mu}{\sum_{p=1}^{K}\exp(h_{p}(x))}\right\}
=supμ(μ){(wkh(x)wk0(x))exp(hkh(x))exp(hk0(x))μp=1Kexp(hp(x))}\displaystyle=\sup_{\mu\in\mathcal{M}(\mu)}\left\{\big{(}w_{k_{h}}(x)-w_{k_{0}}(x)\big{)}\frac{\exp(h_{k_{h}}(x))-\exp(h_{k_{0}}(x))-\mu}{\sum_{p=1}^{K}\exp(h_{p}(x))}\right\}

The first inequality holds by the fact that the infimum of μ\mu over set (μ)\mathcal{M}(\mu) is always greater then the infimum of hh over all measurable functions. Notice that k0k_{0} is the class label with the minimal weight, we have wkh(x)wk0(x)0w_{k_{h}}(x)-w_{k_{0}}(x)\geq 0 for any xx. Consequently, the supremum of the above expression is achieved when μ=exp(hk0(x))\mu=-\exp(h_{k_{0}}(x)), and the excess surrogate loss is therefore lower-bounded by

lexp(h,w;x)infhlexp(h,w;x)\displaystyle l_{\exp}(h,w;x)-\inf_{h}l_{\exp}(h,w;x) (wkh(x)wk0(x))exp(hkh(x))p=1Kexp(hp(x))\displaystyle\geq\big{(}w_{k_{h}}(x)-w_{k_{0}}(x)\big{)}\frac{\exp(h_{k_{h}}(x))}{\sum_{p=1}^{K}\exp(h_{p}(x))}
1K[wkh(x)wk0(x)]\displaystyle\geq\frac{1}{K}\big{[}w_{k_{h}}(x)-w_{k_{0}}(x)\big{]}
=1K[l(h,w;x)infhl(h,w;x)].\displaystyle=\frac{1}{K}\big{[}l(h,w;x)-\inf_{h}l(h,w;x)\big{]}.

The last inequality follows by the definition that khk_{h} is the class label with the highest score hkhh_{k_{h}}, and therefore we must have exp(hkh(x))p=1Kexp(hp(x))1K\frac{\exp(h_{k_{h}}(x))}{\sum_{p=1}^{K}\exp(h_{p}(x))}\geq\frac{1}{K} for any fixed x𝒳x\in\mathcal{X}. By taking the expectation over XX on the both sides, we completes the proof.

Proof of Proposition 1.

For the weighted classification risk (h,w)\mathcal{R}(h,w), if we choose h~(x)=w(x)\tilde{h}(x)=w(x) for x𝒳x\in\mathcal{X}, we simply have (h~,w)=0\mathcal{R}(\tilde{h},w)=0, and this suggests the Bayes optimal risk equals to zero, that is, infh(h,w)=0\inf_{h}\mathcal{R}(h,w)=0. Therefore, the excess weighted classification risk can be written as

(h,w)infh(h,w)=(h,w)0=𝔼[|w(X)|𝕀{sgn[h(X)]sgn[w(X)]}].\mathcal{R}(h,w)-\inf_{h}\mathcal{R}(h,w)=\mathcal{R}(h,w)-0=\mathbb{E}\Big{[}|w(X)|\cdot\mathbb{I}\{\operatorname*{\mathrm{sgn}}[h(X)]\neq\operatorname*{\mathrm{sgn}}[w(X)]\}\Big{]}.

Here we use the fact that h(x)h^{*}(x) has the same sign as weight function w(x)w(x) for any fixed x𝒳x\in\mathcal{X}.

For the ϕ\phi-risk introduce in Equation 6.3, we consider the following surrogate loss functions:

Hinge loss:ϕ(α)=max{1α,0}logistic loss:ϕ(α)=log(1+eα)exponential loss:ϕ(α)=eα.\begin{aligned} &\text{Hinge loss}:&&\phi(\alpha)=\max\{1-\alpha,0\}\\ &\text{logistic loss}:&&\phi(\alpha)=\log(1+e^{-\alpha})\\ &\text{exponential loss}:&&\phi(\alpha)=e^{-\alpha}\\ \end{aligned}.

Notice that for any ϕ{Hinge,logistic,exponential}\phi\in\{\operatorname{Hinge},\operatorname{logistic},\operatorname{exponential}\}, the surrogate loss ϕ(α)\phi(\alpha) is lower-bounded (in fact, infαϕ(α)=0\inf_{\alpha}\phi(\alpha)=0), then we have infα𝔼[ϕ(α(X))]=𝔼[infαϕ(α(X))]=0\inf_{\alpha}\mathbb{E}[\phi(\alpha(X))]=\mathbb{E}[\inf_{\alpha}\phi(\alpha(X))]=0. Therefore, the excess ϕ\phi-risk takes the form

ϕ(h,w)infh(h,w)=ϕ(h,w)0=𝔼[|w(X)|ϕ(h(X)sgn[w(X)])].\mathcal{R}_{\phi}(h,w)-\inf_{h}\mathcal{R}(h,w)=\mathcal{R}_{\phi}(h,w)-0=\mathbb{E}\Big{[}|w(X)|\cdot\phi\big{(}h(X)\cdot\operatorname*{\mathrm{sgn}}[w(X)]\big{)}\Big{]}.

In the sequel, we only need to check if the surrogate loss ϕ(h(x)w(x))\phi(h(x)\cdot w(x)) provides an upper-bound for the 010-1 loss 𝕀{sgn[h(x)]sgn[w(x)]}\mathbb{I}\{\operatorname*{\mathrm{sgn}}[h(x)]\neq\operatorname*{\mathrm{sgn}}[w(x)]\} for each ϕ\phi in the set {Hinge,logistic,exponential}\{\operatorname{Hinge},\operatorname{logistic},\operatorname{exponential}\}. We fix the weight function ww and the observed features x𝒳x\in\mathcal{X} in the following discussion.

  • Consider the case when sgn[h(x)]sgn[w(x)]\operatorname*{\mathrm{sgn}}[h(x)]\neq\operatorname*{\mathrm{sgn}}[w(x)], we then have 𝕀{sgnh(x)sgn[w(x)]}=1\mathbb{I}\{\operatorname*{\mathrm{sgn}}{h(x)}\neq\operatorname*{\mathrm{sgn}}[w(x)]\}=1. For the ϕ\phi-risk, we have

    ϕ(h(x)sgn[w(x)])=ϕ(|h(x)|)\phi\big{(}h(x)\cdot\operatorname*{\mathrm{sgn}}[w(x)]\big{)}=\phi(-|h(x)|)

    for any ϕ{Hinge,logistic,exponential}\phi\in\{\operatorname{Hinge},\operatorname{logistic},\operatorname{exponential}\}. Concretely, we have

    ϕHinge(|h(x)|)\displaystyle\phi_{\operatorname{Hinge}}(-|h(x)|) =(1+|h(x)|)+1,\displaystyle=(1+|h(x)|)^{+}\geq 1,
    ϕlogistic(|h(x)|)\displaystyle\phi_{\operatorname{logistic}}(-|h(x)|) =log2(1+e|h(x)|)1,\displaystyle=\log_{2}(1+e^{|h(x)|})\geq 1,
    ϕexponential(|h(x)|)\displaystyle\phi_{\operatorname{exponential}}(-|h(x)|) =e|h(x)|1.\displaystyle=e^{|h(x)|}\geq 1.

    Therefore, we have

    ϕ(h(x)sgn[w(x)])𝕀{sgn[h(x)]sgn[w(x)]}\phi(h(x)\cdot\operatorname*{\mathrm{sgn}}[w(x)])\geq\mathbb{I}\{\operatorname*{\mathrm{sgn}}[h(x)]\neq\operatorname*{\mathrm{sgn}}[w(x)]\}

    for any ϕ{Hinge,logistic,exponential}\phi\in\{\operatorname{Hinge},\operatorname{logistic},\operatorname{exponential}\}.

  • Consider the case when sgn[h(x)]=sgn[w(x)]\operatorname*{\mathrm{sgn}}[h(x)]=\operatorname*{\mathrm{sgn}}[w(x)], we then have 𝕀{sgnh(x)sgn[w(x)]}=0\mathbb{I}\{\operatorname*{\mathrm{sgn}}{h(x)}\neq\operatorname*{\mathrm{sgn}}[w(x)]\}=0. For the ϕ\phi-risk, we have

    ϕ(h(x)sgn[w(x)])=ϕ(|h(x)|)\phi\big{(}h(x)\cdot\operatorname*{\mathrm{sgn}}[w(x)]\big{)}=\phi(|h(x)|)

    for any ϕ{Hinge,logistic,exponential}\phi\in\{\operatorname{Hinge},\operatorname{logistic},\operatorname{exponential}\}. Concretely, we have

    ϕHinge(|h(x)|)\displaystyle\phi_{\operatorname{Hinge}}(|h(x)|) =(1|h(x)|)+0,\displaystyle=(1-|h(x)|)^{+}\geq 0,
    ϕlogistic(|h(x)|)\displaystyle\phi_{\operatorname{logistic}}(|h(x)|) =log2(1+e|h(x)|)0,\displaystyle=\log_{2}(1+e^{-|h(x)|})\geq 0,
    ϕexponential(|h(x)|)\displaystyle\phi_{\operatorname{exponential}}(|h(x)|) =e|h(x)|0.\displaystyle=e^{-|h(x)|}\geq 0.

    Hence, we have

    ϕ(h(x)sgn[w(x)])𝕀{sgn[h(x)]sgn[w(x)]}\phi(h(x)\cdot\operatorname*{\mathrm{sgn}}[w(x)])\geq\mathbb{I}\{\operatorname*{\mathrm{sgn}}[h(x)]\neq\operatorname*{\mathrm{sgn}}[w(x)]\}

    for any ϕ{Hinge,logistic,exponential}\phi\in\{\operatorname{Hinge},\operatorname{logistic},\operatorname{exponential}\}.

Finally, by multiplying the common weight |w(X)||w(X)| and taking the expectation over XX, we conclude that when the weight function ww is fixed, the excess ϕ\phi-risk is always an upper-bound for the excess weighted risk for any ϕ{Hinge,logistic,exponential}\phi\in\{\operatorname{Hinge},\operatorname{logistic},\operatorname{exponential}\}:

ϕ(h,w)infhϕ(h,w)(h,w)infh(h,w).\mathcal{R}_{\phi}(h,w)-\inf_{h}\mathcal{R}_{\phi}(h,w)\geq\mathcal{R}(h,w)-\inf_{h}\mathcal{R}(h,w).

Appendix D Generalization Error Bound

In this section, we theoretically analyze the performance of the UCL algorithm in Algorithm 1 by deriving a generalization error bound for the output score function 𝒉^\widehat{\boldsymbol{h}}. Specifically, Theorem 5 establishes a calibration bound for the cost-sensitive excess risk (𝒉^)inf𝒉(𝒉)\mathcal{R}(\widehat{\boldsymbol{h}})-\inf_{\boldsymbol{h}\in\mathcal{H}}\mathcal{R}(\boldsymbol{h}) induced by the output score function. To leverage this result, we first derive a generalization error bound for the excess surrogate risk exp(𝒉^)inf𝒉exp(𝒉)\mathcal{R}_{\exp}(\widehat{\boldsymbol{h}})-\inf_{\boldsymbol{h}\in\mathcal{H}}\mathcal{R}_{\exp}(\boldsymbol{h}).

To facilitate the analysis, we define the surrogate risk induced by the estimated weight functions 𝒘[l]\boldsymbol{w}^{[l]} from batch l[K]l\in[K]:

exp(𝒉,𝒘^[l]):=𝔼[k=1Kw^k[l](X)exp(hk(X))p=1Kexp(hp(X))].\mathcal{R}_{\exp}(\boldsymbol{h},\widehat{\boldsymbol{w}}^{[l]}):=\mathbb{E}\Big{[}\sum_{k=1}^{K}\widehat{w}_{k}^{[l]}(X)\cdot\frac{\exp(h_{k}(X))}{\sum_{p=1}^{K}\exp(h_{p}(X))}\Big{]}.

The below lemma provides a decomposition of the upper-bound of the excess surrogate risk into two components: the estimation error of the nuisance functions {wk}k=1K\{w_{k}\}_{k=1}^{K} and the empirical process error.

Lemma 2.

The excess surrogate risk ():=exp(𝐡^)inf𝐡exp(𝐡)\mathcal{E}(\mathcal{H}):=\mathcal{R}_{\exp}(\widehat{\boldsymbol{h}})-\inf_{\boldsymbol{h}\in\mathcal{H}}\mathcal{R}_{\exp}(\boldsymbol{h}) satisfies

()2Ll=1Lsup𝒉|exp(𝒉,𝒘^[l])exp(𝒉,𝒘)|Nuisance Estimation Error+2Ll=1Lsuph|^exp(𝒉,𝒘^[l])exp(𝒉,𝒘^[l])|Empirical Process Error.\mathcal{E}(\mathcal{H})\leq\underbrace{\frac{2}{L}\sum_{l=1}^{L}\sup_{\boldsymbol{h}\in\mathcal{H}}\Big{|}\mathcal{R}_{\exp}(\boldsymbol{h},\widehat{\boldsymbol{w}}^{[l]})-\mathcal{R}_{\exp}(\boldsymbol{h},\boldsymbol{w})\Big{|}}_{\text{Nuisance Estimation Error}}+\underbrace{\frac{2}{L}\sum_{l=1}^{L}\sup_{h\in\mathcal{H}}\Big{|}\widehat{\mathcal{R}}_{\exp}(\boldsymbol{h},\widehat{\boldsymbol{w}}^{[l]})-\mathcal{R}_{\exp}(\boldsymbol{h},\widehat{\boldsymbol{w}}^{[l]})\Big{|}}_{\text{Empirical Process Error}}.

The first term in Lemma 2 captures the error arising from the estimation of the unknown weight function 𝒘\boldsymbol{w}, which is evaluated on the true excess surrogate risk with the supremum over all 𝒉\boldsymbol{h}\in\mathcal{H}. The second term reflects the error incurred by the estimated score function 𝒉^\widehat{\boldsymbol{h}}, evaluated as the supremum of an empirical process. To bound these two terms, we need to further specify the estimation errors of the nuisance estimators, as we will do in Assumptions 5 and 6.

Assumption 5 (Nuisances Estimation).

We assume that the true weight functions {wk}k=1[K]\{w_{k}\}_{k=1}^{[K]} and their estimates {w^k[1],,w^k[L]}k=1K\{\widehat{w}_{k}^{[1]},\dots,\widehat{w}_{k}^{[L]}\}_{k=1}^{K} satisfy the following conditions:

  1. 1.

    they are uniformly bounded by some constant CwC_{w} over 𝒳\mathcal{X}, namely, wk<Cw,w^k[l]<Cw\|w_{k}\|_{\infty}<C_{w},\|\widehat{w}_{k}^{[l]}\|_{\infty}<C_{w} for all k[K]k\in[K] and l[L]l\in[L];

  2. 2.

    there exists γ>0\gamma>0 such that

    𝔼[𝒘^[l](X)𝒘(X)1]=𝒪(Nγ),l[L].\mathbb{E}\big{[}\|\widehat{\boldsymbol{w}}^{[l]}(X)-\boldsymbol{w}(X)\|_{1}\big{]}=\mathcal{O}(N^{-\gamma}),~{}~{}\forall~{}l\in[L].

Assumption 5 requires that the nuisance functions {wk}k=1K\{w_{k}\}_{k=1}^{K} and their estimates {w^k[l]}k=1K\{\widehat{w}_{k}^{[l]}\}_{k=1}^{K} are uniformly bounded over the feature space 𝒳\mathcal{X} and that the estimation error of the nuisance functions decays at a rate of 𝒪(Nγ)\mathcal{O}(N^{-\gamma}) for some γ>0\gamma>0.

Assumption 6 (Function Class Complexity).

Let Ch>0C_{h}>0 and τ>0\tau>0 be some constants. For any score function candidate hh\in\mathcal{H}, we assume hk<Ch\|h_{k}\|_{\infty}<C_{h} for any k[K]k\in[K]. Moreover, assume that the ϵ\epsilon-bracketing number of \mathcal{H} with respect to the metric ρ\rho, denoted as 𝒩(ϵ,,ρ)\mathcal{N}(\epsilon,\mathcal{H},\rho), satisfies log𝒩(ϵ,,ρ)ϵτ\log\mathcal{N}(\epsilon,\mathcal{H},\rho)\leq\epsilon^{-\tau} for any ϵ(0,1)\epsilon\in(0,1) and some τ<2\tau<2.

In Assumption 6, we further restrict the complexity of the hypothesis class \mathcal{H} by imposing the boundedness on each component of score function hh, and limiting the growth rate of its covering number. The boundedness condition is reasonable, as if any component of score function, say hkh_{k}, goes to infinity, then we cannot get reasonable classifier by the argmin\operatorname*{arg\,min} operations. At the same time, limiting the growth rate of covering number is a common way to quantify the complexity of a function class in statistical learning theory [Massart and Nédélec, 2006, Shalev-Shwartz and Ben-David, 2014, Wainwright, 2019].

Nuisance Estimation Error

To provide a finite-sample error bound for the estimation error, we start with bounding the nuisance estimation error introduced in Lemma 2.

Proposition 3 (Nuisance Estimation Error).

Suppose Assumption 5 and Assumption 6 hold. We have, for any l[L]l\in[L],

suph|exp(𝒉,𝒘^[l])exp(𝒉,𝒘)|=𝒪(Nγ).\sup_{h\in\mathcal{H}}\Big{|}\mathcal{R}_{\exp}(\boldsymbol{h},\widehat{\boldsymbol{w}}^{[l]})-\mathcal{R}_{\exp}(\boldsymbol{h},\boldsymbol{w})\Big{|}=\mathcal{O}(N^{-\gamma}).

Empirical Process Error

We now consider bounding the supremum of empirical process introduced in Lemma 2. We focus on analyzing the empirical process define on the ll-th batch, that is,

^exp(𝒉,𝒘^[l])exp(𝒉,𝒘^[l])=1|Il|iIlk=1Kw^k[l](Xi)exp(hk(Xi))p=1Kexp(hp(Xi))𝔼[k=1Kw^k[l](X)exp(hk(X))p=1Kexp(hp(X))].\widehat{\mathcal{R}}_{\exp}(\boldsymbol{h},\widehat{\boldsymbol{w}}^{[l]})-\mathcal{R}_{\exp}(\boldsymbol{h},\widehat{\boldsymbol{w}}^{[l]})=\frac{1}{|I_{l}|}\sum_{i\in I_{l}}\sum_{k=1}^{K}\widehat{w}_{k}^{[l]}(X_{i})\frac{\exp(h_{k}(X_{i}))}{\sum_{p=1}^{K}\exp(h_{p}(X_{i}))}-\mathbb{E}\left[\sum_{k=1}^{K}\widehat{w}_{k}^{[l]}(X)\frac{\exp(h_{k}(X))}{\sum_{p=1}^{K}\exp(h_{p}(X))}\right].

As the cross-fitting batch size LL involved in Algorithm 1 is usually smaller than the sample size NN, here we simply assume |Im|N|I_{m}|\approx N. Note that, the estimated weight function in the ll-th batch 𝒘^[l]\widehat{\boldsymbol{w}}^{[l]} is a random variable which relies on observed variables (Y,D,X,Z)(Y,D,X,Z). To facilitate our analysis, we denote a new variable V=(Y,D,X,Z)V=(Y,D,X,Z) over the domain 𝒱=𝒴×𝒟×𝒳×𝒵\mathcal{V}=\mathcal{Y}\times\mathcal{D}\times\mathcal{X}\times\mathcal{Z} with some joint distribution PVP_{V}. We will omit the superscript [l][l] for simplicity in the sequel. For every given estimates {w^}k=1K\{\widehat{w}\}_{k=1}^{K}, we define a excess surrogate risk class \mathcal{F}_{\mathcal{H}} as

:={fh:vk=1Kw^k(x)exp(hk(x))p=1Kexp(hp(x))|h}\mathcal{F}_{\mathcal{H}}:=\left\{f_{h}:v\mapsto\sum_{k=1}^{K}\hat{w}_{k}(x)\frac{\exp(h_{k}(x))}{\sum_{p=1}^{K}\exp(h_{p}(x))}~{}\Big{|}~{}h\in\mathcal{H}\right\} (D.1)

To simplify our analysis, we use shorthand notation Pf=𝔼[f(V)]Pf=\mathbb{E}[f(V)] and PNf=1Ni=1Nf(Vi)P_{N}f=\frac{1}{N}\sum_{i=1}^{N}f(V_{i}), where PNP_{N} is the empirical measured associated with the i.i.d. sample of size NN. Then the supermom of empirical process can be written as

PNfPf:=supfPNfPf=suph^Φ(h,w^)Φ(h,w^).\|P_{N}f-Pf\|_{\mathcal{F}_{\mathcal{H}}}:=\sup_{f\in\mathcal{F}_{\mathcal{H}}}P_{N}f-Pf=\sup_{h\in\mathcal{H}}\widehat{\mathcal{R}}_{\Phi}(h,\widehat{w})-\mathcal{R}_{\Phi}(h,\widehat{w}). (D.2)

Therefore, our goal is now provide a uniform bound for the empirical process PNfPfP_{N}f-Pf over function class \mathcal{F}_{\mathcal{H}}. To realize this target, we shall notice that \mathcal{F}_{\mathcal{H}} is a uniformly bounded class, that is, for any ff\in\mathcal{F}_{\mathcal{H}}, we have f:=supv𝒱f(v)\|f\|_{\infty}:=\sup_{v\in\mathcal{V}}f(v) being bounded. To see this, notice that in Assumption 5, we assume w^k=supx𝒳w^k(x)<Cw\|\widehat{w}_{k}\|_{\infty}=\sup_{x\in\mathcal{X}}\widehat{w}_{k}(x)<C_{w} for any k[K]k\in[K]. Consequently, we have

|f(v)|=|k=1Kw^k(x)exp(hk(x))p=1Kexp(hp(x))|\displaystyle|f(v)|=\left|\sum_{k=1}^{K}\widehat{w}_{k}(x)\frac{\exp(h_{k}(x))}{\sum_{p=1}^{K}\exp(h_{p}(x))}\right| k=1K|w^k(x)exp(hk(x))p=1Kexp(hp(x))|\displaystyle\leq\sum_{k=1}^{K}\left|\widehat{w}_{k}(x)\frac{\exp(h_{k}(x))}{\sum_{p=1}^{K}\exp(h_{p}(x))}\right|
k=1K|w^k(x)||exp(hk(x))p=1Kexp(hp(x))|\displaystyle\leq\sum_{k=1}^{K}\Big{|}\widehat{w}_{k}(x)\Big{|}\cdot\left|\frac{\exp(h_{k}(x))}{\sum_{p=1}^{K}\exp(h_{p}(x))}\right|
KCw:=Cf.\displaystyle\leq KC_{w}:=C_{f}.

Now, given a sequence of i.i.d. sample {W1,,WN}\{W_{1},\dots,W_{N}\}, we define the empirical Rademacher complexity with respect to class \mathcal{F}_{\mathcal{H}} as

R^N():=𝔼σ[supf|1Ni=1Nσif(Wi)|],\widehat{R}_{N}(\mathcal{F}_{\mathcal{H}}):=\mathbb{E}_{\sigma}\left[\sup_{f\in\mathcal{F}_{\mathcal{H}}}\left|\frac{1}{N}\sum_{i=1}^{N}\sigma_{i}f(W_{i})\right|\right], (D.3)

where (σ1,,σN)(\sigma_{1},\dots,\sigma_{N}) is a vector of i.i.d. Rademacher variables takin values in {1,+1}\{-1,+1\} with equal probability. The empirical Rademacher complexity provides an upper bound for the supremum of empirical process in Equation D.2. See the lemma below.

Lemma 3 (Symmetrization Bound, Wainwright [2019] Theorem 4.10).

For the CfC_{f}-uniformly bounded class of function \mathcal{F}_{\mathcal{H}} , with probability at least 1δ1-\delta, we have

PNfPf2R^N()+Cf2log(1/δ)N.\|P_{N}f-Pf\|_{\mathcal{F}_{\mathcal{H}}}\leq 2\widehat{R}_{N}(\mathcal{F}_{\mathcal{H}})+C_{f}\sqrt{\frac{2\log(1/\delta)}{N}}.

Define the norm L2(PN)L_{2}(P_{N}) for the function ff\in\mathcal{F}_{\mathcal{H}} such that fgL2(PN):=1Ni=1N(f(Vi)g(Vi))2\|f-g\|_{L_{2}(P_{N})}:=\sqrt{\frac{1}{N}\sum_{i=1}^{N}\big{(}f(V_{i})-g(V_{i})\big{)}^{2}} for any function f,gf,g\in\mathcal{F}_{\mathcal{H}}. As the function class \mathcal{F}_{\mathcal{H}} is CfC_{f}-uniformly bounded, we then have

supf,gfgL2(PN)2Cf.\sup_{f,g\in\mathcal{F}_{\mathcal{H}}}\|f-g\|_{L_{2}(P_{N})}\leq 2C_{f}.

Moreover, let 𝒩(ϵ,,L2(PN))\mathcal{N}\big{(}\epsilon,\mathcal{F}_{\mathcal{H}},L_{2}(P_{N})\big{)} be the covering number of class \mathcal{F}_{\mathcal{H}} with respect to the metric induced by the L2(PN)L_{2}(P_{N}) norm. According to the definition of \mathcal{F}_{\mathcal{H}} in Equation D.1, we have 𝒩(ϵ,,L2(PN))𝒩(ϵ,,L2(PN))\mathcal{N}\big{(}\epsilon,\mathcal{F}_{\mathcal{H}},L_{2}(P_{N})\big{)}\leq\mathcal{N}\big{(}\epsilon,\mathcal{H},L_{2}(P_{N})\big{)}. Along with the limitation of the growing rate of 𝒩(ϵ,,L2(PN))\mathcal{N}\big{(}\epsilon,\mathcal{H},L_{2}(P_{N})\big{)} in Assumption 6, we establish the following Dudley’s entropy integral for the empirical Rademacher complexity defined in Equation D.3

Lemma 4 (Dudley’s Entropy Integral).

Assume Assumption 6 holds. Given that supf,gfgL2(PN)2Cf\sup_{f,g\in\mathcal{F}_{\mathcal{H}}}\|f-g\|_{L_{2}(P_{N})}\leq 2C_{f}, there exists a constant C0>0C_{0}>0 such that the empirical Rademacher complexity is almost surely upper-bounded by the Dudely’s integral: 0<τ<20<\tau<2,

R^N()C0N02Cf𝑑ϵlog𝒩(ϵ,,L2(PN))C0N02Cfϵτ/2𝑑ϵ=2C0(2Cf)1τ/22τN1/2.\widehat{R}_{N}(\mathcal{F}_{\mathcal{H}})\leq\frac{C_{0}}{\sqrt{N}}\int_{0}^{2C_{f}}d\epsilon\sqrt{\log\mathcal{N}(\epsilon,\mathcal{F}_{\mathcal{H}},L_{2}(P_{N}))}\leq\frac{C_{0}}{\sqrt{N}}\int_{0}^{2C_{f}}\epsilon^{-\tau/2}d\epsilon=\frac{2C_{0}(2C_{f})^{1-\tau/2}}{2-\tau}N^{-1/2}.

Given the establishment of Assumption 6, we can now provide a generalization error bound for the empirical process error in Lemma 2 by combining Lemma 3 and Lemma 4: with probability at least 1δ1-\delta, we have

sup𝒉{^exp(𝒉,𝒘^)exp(𝒉,𝒘^)}=PnfPf=𝒪(N1/22log(1/δ)).\sup_{\boldsymbol{h}\in\mathcal{H}}\left\{\widehat{\mathcal{R}}_{\exp}(\boldsymbol{h},\widehat{\boldsymbol{w}})-\mathcal{R}_{\exp}(\boldsymbol{h},\widehat{\boldsymbol{w}})\right\}=\|P_{n}f-Pf\|_{\mathcal{F}_{\mathcal{H}}}=\mathcal{O}\big{(}N^{-1/2}\sqrt{2\log(1/\delta)}\big{)}. (D.4)

Generalization Error Bound for Excess Cost-sensitive Risk

Combining with the calibration bound in Theorem 5, the decomposition error of the excess surrogate risk in Lemma 2, the nuisance estimation error in Proposition 3, and the empirical process error in Equation D.4, we can now provide a generalization error bound for original the excess cost-sensitive risk (𝒉^)inf𝒉(𝒉)\mathcal{R}(\widehat{\boldsymbol{h}})-\inf_{\boldsymbol{h}\in\mathcal{H}}\mathcal{R}(\boldsymbol{h}) induced by the output score function 𝒉^\widehat{\boldsymbol{h}} in Algorithm 1. Formally speaking, we probability at least 1δ1-\delta, we have

(𝒉^,w)inf𝒉(𝒉,𝒘)=𝒪(Kmax{Nγ,N1/2log(1/δ)}).\mathcal{R}(\widehat{\boldsymbol{h}},w)-\inf_{\boldsymbol{h}\in\mathcal{H}}\mathcal{R}(\boldsymbol{h},\boldsymbol{w})=\mathcal{O}\Big{(}K\cdot\max\Big{\{}N^{-\gamma},N^{-1/2}\sqrt{\log(1/\delta)}\Big{\}}\Big{)}. (D.5)

This result shows that the excess cost-sensitive risk of the output score function 𝒉^\widehat{\boldsymbol{h}} converges to the optimal risk at a rate of KN1/2K\cdot N^{-1/2}, which is the standard rate for the multiclass classification risk in statistical learning theory [Massart and Nédélec, 2006, Shalev-Shwartz and Ben-David, 2014, Wainwright, 2019].

D.1 Remaining Proofs

Proof of Lemma 2.

Let hh_{\mathcal{H}}^{\star}\in\mathcal{H} denote the best-in-class score function that minimizes the excess surrogate risk exp(h,w)\mathcal{R}_{\exp}(h,w). We use ww denote the true weight function and w^[l]\widehat{w}^{[l]} representes its estiamte from batch l[L]l\in[L]. By definition, we have

()\displaystyle\mathcal{E}(\mathcal{H}) =exp(h^,w)exp(h,w)\displaystyle=\mathcal{R}_{\exp}(\hat{h},w)-\mathcal{R}_{\exp}(h_{\mathcal{H}}^{\star},w)
=exp(h^,w)1Ll=1Lexp(h^,w^[l])+1Ll=1Lexp(h^,w^[l])1Ll=1L^exp(h^,w^[l])\displaystyle={\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathcal{R}_{\exp}(\hat{h},w)-\frac{1}{L}\sum_{l=1}^{L}\mathcal{R}_{\exp}(\hat{h},\widehat{w}^{[l]})}+{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}\frac{1}{L}\sum_{l=1}^{L}\mathcal{R}_{\exp}(\hat{h},\widehat{w}^{[l]})-\frac{1}{L}\sum_{l=1}^{L}\widehat{\mathcal{R}}_{\exp}(\hat{h},\widehat{w}^{[l]})}
+1Ll=1L^exp(h^,w^[l])1Ll=1Lexp(h,w^[l])+1Ll=1Lexp(h,w^[l])Rexp(h,w)\displaystyle+{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}\frac{1}{L}\sum_{l=1}^{L}\widehat{\mathcal{R}}_{\exp}(\hat{h},\hat{w}^{[l]})-\frac{1}{L}\sum_{l=1}^{L}\mathcal{R}_{\exp}(h_{\mathcal{H}}^{\star},\widehat{w}^{[l]})}+{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\frac{1}{L}\sum_{l=1}^{L}\mathcal{R}_{\exp}(h_{\mathcal{H}}^{\star},\widehat{w}^{[l]})-R_{\exp}(h_{\mathcal{H}}^{\star},w)}
2Ll=1Lsuph|exp(h,w^[l])exp(h,w)|Nuisance Estimation Error+2Ll=1Lsuph|^exp(h,w^[l])exp(h,w^[l])|Empirical Process Error.\displaystyle\leq{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\underbrace{\frac{2}{L}\sum_{l=1}^{L}\sup_{h\in\mathcal{H}}\Big{|}\mathcal{R}_{\exp}(h,\widehat{w}^{[l]})-\mathcal{R}_{\exp}(h,w)\Big{|}}_{\text{Nuisance Estimation Error}}}+{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}\underbrace{\frac{2}{L}\sum_{l=1}^{L}\sup_{h\in\mathcal{H}}\Big{|}\widehat{\mathcal{R}}_{\exp}(h,\widehat{w}^{[l]})-\mathcal{R}_{\exp}(h,\widehat{w}^{[l]})\Big{|}}_{\text{Empirical Process Error}}}.

Proof of Proposition 3.

Notice that for any function hh\in\mathcal{H} and batch l[L]l\in[L],

|exp(h,w^[l])exp(h,w)|\displaystyle\Big{|}\mathcal{R}_{\exp}(h,\widehat{w}^{[l]})-\mathcal{R}_{\exp}(h,w)\Big{|} =|𝔼[k=1K(w^k[l](X)wk(X))exp(hk(X))p=1Kexp(hp(X))]|\displaystyle=\left|\mathbb{E}\left[\sum_{k=1}^{K}\Big{(}\widehat{w}_{k}^{[l]}(X)-w_{k}(X)\Big{)}\cdot\frac{\exp(h_{k}(X))}{\sum_{p=1}^{K}\exp(h_{p}(X))}\right]\right|
𝔼[k=1K|(w^k[l](X)wk(X))exp(hk(X))p=1Kexp(hp(X))|]\displaystyle\leq\mathbb{E}\left[\sum_{k=1}^{K}\left|\Big{(}\widehat{w}_{k}^{[l]}(X)-w_{k}(X)\Big{)}\cdot\frac{\exp(h_{k}(X))}{\sum_{p=1}^{K}\exp(h_{p}(X))}\right|\right]
𝔼[k=1K|w^k(X)wk(X)||exp(hk(X))p=1Kexp(hp(X))|]\displaystyle\leq\mathbb{E}\left[\sum_{k=1}^{K}\Big{|}\hat{w}_{k}(X)-w_{k}(X)\Big{|}\cdot\left|\frac{\exp(h_{k}(X))}{\sum_{p=1}^{K}\exp(h_{p}(X))}\right|\right]
𝔼[k=1K|w^k(X)wk(X)|]=𝔼w^(X)w(X)1.\displaystyle\leq\mathbb{E}\Big{[}\sum_{k=1}^{K}\Big{|}\hat{w}_{k}(X)-w_{k}(X)\Big{|}\Big{]}=\mathbb{E}\Big{\|}\hat{w}(X)-w(X)\Big{\|}_{1}.

The first inequality comes from the fact that the absolute value function |||\cdot| is convex and we use Jensen’s inequality. The second inequality follows from Cauchy-Schwarz inequality, and the third inequality follows from the fact that the softmax\operatorname{softmax} function is uniformly bounded by 11 for all x𝒳x\in\mathcal{X} and hh\in\mathcal{H}. Finally, combining with Assumption 5 and the results above, we have for every batch l[L]l\in[L]

suph|exp(h,w^[l])exp(h,w)|=𝒪(Nγ).\sup_{h\in\mathcal{H}}\Big{|}\mathcal{R}_{\exp}(h,\widehat{w}^{[l]})-\mathcal{R}_{\exp}(h,w)\Big{|}=\mathcal{O}(N^{-\gamma}).