Learning with Selectively Labeled Data from Multiple Decision-makers

Jian Chen, Zhehao Li^†^†footnotemark: , Xiaojie Mao^†^†footnotemark: Alphabetical order. Email: [email protected]

( Tsinghua University )

Abstract

We study the problem of classification with selectively labeled data, whose distribution may differ from the full population due to historical decision-making. We exploit the fact that in many applications historical decisions were made by multiple decision-makers, each with different decision rules. We analyze this setup under a principled instrumental variable (IV) framework and rigorously study the identification of classification risk. We establish conditions for the exact identification of classification risk and derive tight partial identification bounds when exact identification fails. We further propose a unified cost-sensitive learning (UCL) approach to learn classifiers robust to selection bias in both identification settings. We further theoretically and numerically validate the efficacy of our proposed method.

1 Introduction

The problem of selective labels is common in many decision-making applications involving human subjects. In these applications, each individual receives a certain decision, which in turn determines whether the individual’s outcome label is observed. For example, in judicial bail, the label of interest is whether a defendant returns to court without committing another crime if released. However, this label cannot be observed if bail is denied. Similarly, in lending, the default status of a loan applicant cannot be observed if the loan is not approved. In hiring, a candidate’s job performance cannot be observed if the candidate is not hired.

The selective label problem presents significant challenges for developing effective machine learning (ML) algorithms to support decision-making [Lakkaraju et al., 2017, Kleinberg et al., 2018, De-Arteaga et al., 2021]. Specifically, labeled data may not be representative of the broader population to which decisions will be applied, creating a classic selection bias issue. As a result, models trained on selectively labeled data may perform well on the observed subjects but fail to generalize to the full population when deployed in real-world settings. This challenge becomes even more critical when historical decisions relied on unobservable factors not captured in the available data, as is common when human decision-makers are involved. In such cases, the labeled data can differ substantially from the broader population due to unknown factors, further complicating the development of reliable ML models.

In this paper, we address the selective label problem by leveraging the heterogeneity of decision-making processes in many real-world applications. Specifically, we consider settings where decisions are made by multiple decision-makers with distinct decision rules, potentially leading to different outcomes for the same subject. Furthermore, since decision-makers often evaluate similar pools of subjects, each subject can be viewed as being quasi-randomly assigned to a specific decision-maker. This structure offers a way to mitigate the selective label problem: a subject who remains unlabeled under one decision-maker might have been labeled if assigned to another. For instance, judicial decisions are typically made by different judges who vary in their leniency when deciding whether to release the same defendant. The heterogeneous decision-maker framework has also been explored in prior research to address selection bias [e.g., Lakkaraju et al., 2017, Kleinberg et al., 2018, Rambachan et al., 2021, Arnold et al., 2022].

1.1 Main Contributions

In this paper, we leverage the instrumental variable framework in causal inference to formalize the selective label problem with multiple heterogeneous decision-makers. This enables principled identification analyses to characterize the identifiability of classification risks from the observed data, revealing fundamental power and limits of the multiple decision-maker structure in tackling the selective label problem. This provides valuable insights beyond the heuristic approaches in the prior literature.

Specifically, we describe the problem setup and the IV framework in Section 3. Then in Section 4 we establish identification conditions for the classification risk from the observed data with multiple decision-makers and highlight the identification assumptions are overly strong. In Section 5, we further derive tight bounds on the plausible value of classification risk when the exact identification fails. In Section 6, we propose a Unified Cost-sensitive Learning (UCL) algorithm to learn classifiers robust to selection bias in both identification settings. We briefly demonstrate the superior performance of our proposed method through comprehensive numerical experiments on synthetic and semi-synthetic dataset in Section 7. We defer theoretical guarantees for our method to Appendix D.

Notations

Define $x^{+}=\max\{x,0\}$ , $x\land y=\min\{x,y\}$ and $x\lor y=\max\{x,y\}$ for $x,y\in\mathbb{R}$ . We use the symbol $[N]$ as a shorthand for the set $\{1,2,\cdots,N\}$ and $|\cdot|$ to denote the absolute value of number or cardinality of set.

2 Related Work

This work is closely related to the growing literatures on the following topics: selective labels problem, identification analysis with IV and individual treatment rule learning.

Selective Labels Problem

Kleinberg et al. [2018], Lakkaraju et al. [2017] formalize the selective labels problem in predicting recidivism within the jail bail context. They train a prediction algorithm on selectively labeled datasets and evaluate whether the algorithm can outperform human decisions when used to assist decision-making. By exploiting the random assignment of judges to defendants, they group defendants based on judge assignment, ensuring comparable demographic features across groups. They also leverage variations in judges’ release rates, where defendants detained by stricter judges might have been released by more lenient ones. Using a heuristic “contraction” technique, they compare the performance of the algorithm-assisted decisions combined with the most lenient judge’s decisions against the stricter judge’s decisions. Fixing the release rate for both methods, they evaluate the recidivism rates and demonstrate that the prediction algorithm can outperform human decisions in some judge groups.

Bertsimas and Fazel-Zarandi [2022] explore the application of machine learning algorithms to assist public decision-making in immigration enforcement. They leverage the fact that most defendants are randomly assigned to federal and state prisons after conviction, enabling the grouping of defendants based on prison assignments. Additionally, they exploit variations in release rates among prisons due to differences in geographic proximity to ICE offices, allowing for the matching of counterfactual outcomes for defendants detained by stricter prisons. Unlike Kleinberg et al. [2018], Lakkaraju et al. [2017], who focus on one-sided evaluations—assessing algorithmic predictions on individuals not deported —- Bertsimas and Fazel-Zarandi [2022] conduct both one-sided and two-sided evaluations. They not only evaluate the predictions for defendants not deported but also establish partial bounds for the recidivism rates of individuals released by the algorithm who were actually deported by ICE officers (for whom labels are missing). Their findings provide evidence that machine learning algorithms can sometimes outperform human officers in decision-making.

The aforementioned works focus on evaluating the performance of machine learning algorithms on selectively labeled data. However, the core challenge lies in training prediction algorithms that are robust to the selection bias inherent in such datasets, and this is what we focus on in this paper. To address this, De-Arteaga et al. [2018, 2021] propose imputing missing labels as negative when multiple human decision-makers consistently reject a decision, leading to missing labels. While this approach offers a partial solution, it is inherently one-sided, as some missing labels may actually correspond to positive outcomes. Consequently, this method does not fully address the selection bias present in the observed data.

There is also a stream of literatures considering the online learning to correct the selection bias in the label. Kilbertus et al. [2020], Wei [2021], Yang et al. [2022] consider learn-to-decide approach to address the selective label problem by implementing a stochastic policy/online algorithm that balances exploration of the entire population with exploitation of the selectively labeled dataset. While effective, this approach is resource-intensive and may be impractical for high-cost decision-making scenarios, such as judicial bail determinations or school admissions decisions.

Finally, our work relates to the literature on counterfactual evaluation under unmeasured confounding [Coston et al., 2020a, b, Rambachan et al., 2021, 2022]. Specifically, Rambachan et al. [2022] address the problem of risk assessment under unmeasured confounding by proposing a mean outcome sensitivity model. This meta-framework translates bounds on the average treatment effect (ATE) into bounds on various risk metrics. They outline two approaches to specify the bounding function for ATE: the outcome regression bound, which involves regressing the outcome on the observed features and treatment, and the instrumental variable (IV) bound, which aligns with our concept of IV partial bounds. However, our IV partial bounds are slightly tighter than their proposed bounds, as we refine them by taking the maximum over all $j\in[J]$ for the lower bound $l_{k}$ and the minimum over all $j\in[J]$ for the upper bound $u_{k}$ . We further prove the tightness of these IV partial bounds. The key distinction between our work and that of Rambachan et al. [2022] lies in focus. While they emphasize estimating mean outcome sensitivity bounds using doubly robust methods to derive bounds for various risk metrics, our work centers on training robust prediction models based on the estimated IV partial bounds. This ensures that the resulting models are resilient to selection bias in the observed training data.

Identification Analysis with IV

This work builds on the IV literature in statistics and econometrics. IV is a common and powerful tool to handle unmeasured confounding in causal inference [Angrist and Pischke, 2009, Angrist and Imbens, 1995, Angrist et al., 1996]. Actually, some empirical studies have already considered heterogeneous decision-makers as an IV, such as the judge IV in Kling [2006], Mueller-Smith [2015]. However, these analyses are based on linear parametric models, while our identification analysis is free of any parametric restrictions. The point identification analysis builds on Wang and Tchetgen [2018], Cui and Tchetgen Tchetgen [2021] and the partial identification analysis builds on Manski and Pepper [1998], Swanson et al. [2018], Balke and Pearl [1994]. Our paper directly extends these results to the evaluation of prediction risks under multi-valued selective labels with multi-valued IV, and develop novel reformulations that enable efficient learning of prediction rules via a unified procedure.

Individual Treatment Rule (ITR) Learning

There is a growing body of literature on learning individualized treatment rules (ITR), which seeks to determine the optimal treatment for each individual based on their observed features. This field has been extensively studied in personalized medicine and policy-making [Qian and Murphy, 2011, Zhao et al., 2012, Athey and Wager, 2021]. Qian and Murphy [2011] introduce an inverse propensity weighting (IPW) formulation for ITR learning, parameterizing the value function with a linear model and obtaining a linear ITR through $l_{1}$ -penalized square loss minimization. Zhao et al. [2012] also adopt the IPW framework but treat ITR learning as a classification problem, proposing the hinge loss as a surrogate and solving for the optimal ITR using an SVM algorithm. Zhang et al. [2012] develop a doubly robust estimator for the value function and formalize ITR learning as a classification problem, using the square loss as a surrogate for the indicator function. Extending to multi-category problems, Zhang et al. [2020] propose the outcome-weighted learning (OWL) formulation and analyze the Fisher consistency of various surrogate loss functions. We demonstrate that learning under selective labels can similarly be formulated as a cost-sensitive or weighted classification problem when point identification is available. Furthermore, we propose several surrogate loss functions for learning the optimal classification rule and establish the finite-sample error bound for our unified learning algorithm.

Our work is also connected to the research on policy learning under partial identification. Ben-Michael et al. [2021] address robust policy learning under partial identification, focusing specifically on violations of the overlap assumption, where some instances have zero probability of receiving treatment. Kallus and Zhou [2021] investigate off-policy learning in the presence of unmeasured confounding, utilizing an IPW formulation for the value function. They address scenarios where the true propensity score is only partially bounded and propose a minimax optimization framework to learn robust policies. Their approach considers a differentiable, parameterized policy class and employs an iterative algorithm to solve the inner maximization via the dual problem, compute the gradient of the policy value, and update the policy parameters. In contrast, our work demonstrates that the inner maximization in our setting can be solved explicitly. This reformulation allows the minimax learning problem to be cast as a cost-sensitive classification task, enabling the use of state-of-the-art machine learning algorithms to efficiently solve the optimization problem.

Particularly relevant to our work are Cui and Tchetgen Tchetgen [2021] and Pu and Zhang [2021], who study ITR learning with instrumental variables under point identification and partial identification settings, respectively. Pu and Zhang [2021] consider a binary treatment setting and develop a minimax ITR framework using IV partial bounds of the expected potential outcomes. Similar to our approach, they explicitly solve the inner maximization and replace indicator functions with the hinge loss, solving the problem using an SVM algorithm. However, their work stops short of formalizing the minimax optimization as a general weighted classification problem, which could be implemented using any machine learning package. In contrast, we extend this framework to multi-valued treatments and demonstrate that minimax learning can be formulated as a cost-sensitive classification problem, generalizing the binary treatment setting in Pu and Zhang [2021] (see Section 5).

3 Problem Formulation

We address the selective label problem within a multiclass classification setting. Consider a dataset comprising $N$ units, indexed by $i=1,\dots,N$ . For each unit $i$ , let $Y_{i}^{\star}\in[K]$ denote the true label of interest for $K\geq 2$ classes. Additionally, let $X_{i}\in\mathcal{X}$ and $U_{i}\in\mathcal{U}$ represent observable and unobservable features, respectively, both of which may be correlated with $Y_{i}^{\star}$ . Here, “observable” features are those accessible to the ML practitioner, while “unobservable” features are unavailable in the dataset.

In the selective label problem, the true label $Y_{i}^{\star}$ is not always observed; its observability depends on a binary decision $D_{i}\in\{0,1\}$ made by a decision-maker based on the features $(X_{i},U_{i})$ . Specifically, we consider $J\geq 2$ different decision-makers and let $Z_{i}\in[J]$ denote the specific decision-maker assigned to unit $i$ . The observed label $Y_{i}$ is defined as:

Y_{i}=\left\{\begin{aligned} &Y_{i}^{\star}~{}~{}&&\text{if}~{}D_{i}=1,\\ &\text{NaN}~{}~{}&&\text{if}~{}D_{i}=0.\end{aligned}\right.

(3.1)

When $D_{i}=1$ (e.g., bail, loan approval, hiring), the observed label $Y_{i}$ is exactly the true label $Y_{i}^{\star}$ , and when $D_{i}=0$ , the observed label is recorded as $Y_{i}=\text{NaN}$ (Not a Number), indicating that $Y_{i}^{\star}$ remains unobserved in this case. Thus, each unit’s label is selectively observed according to the decision $D_{i}$ made by decision-maker $Z_{i}$ .

Figure 1: Causal graph of selective label problem. The dashed nodes

U

and

Y^{\star}

are unobserved. The dashed line from

X

Z

means that

Z

is allowed but not required to be affected by

X

We assume that $\{(Y_{i},Y^{\star}_{i},D_{i},Z_{i},X_{i},U_{i})\}_{i=1}^{N}$ represents independent and identically distributed samples from a population denoted by $(Y,Y^{\star},D,Z,X,U)$ . The causal structure of these variables is plotted in Figure 1. However, we cannot observe all variables but instead only observe the data $S_{N}=\{(Y_{i},D_{i},X_{i},Z_{i})\}_{i=1}^{N}$ .

3.1 Learning under Selective Labels

Our goal is to learn a classifier $t:\mathcal{X}\mapsto[K]$ that can accurately predict the true label $Y^{\star}$ based on observed features $X$ . This classifier will aid in future decision-making processes through accurately predicting $Y^{\star}$ . The effectiveness of the classifier upon deployment across the full population is evaluated using the expected risk with the zero-one loss:

\min_{t:\mathcal{X}\rightarrow[K]}\quad\mathcal{R}(t):=\mathbb{E}\left[\mathbb{I}(Y^{\star}\neq t(X))\right].

(3.2)

Here $\mathbb{I}(\cdot)$ is an indicator function, which equals $1$ if the event inside $(\cdot)$ occurs and $0$ otherwise. The expectation is taken over the unknown joint distribution of $Y^{\star}$ and $X$ . For a given classifier $t$ , the risk $\mathcal{R}(t)$ quantifies its misclassification error rate $\mathbb{P}(Y^{\star}\neq t(X))$ .

Conventionally, one can train a classifier through empirical risk minimization (ERM), i.e., minimizing an empirical approximation for the classification risk $\mathcal{R}(t)$ based on the sample data. However, this is challenging under selective labels since the true label $Y^{\star}$ is not fully observed. One natural alternative is to perform ERM on the labeled subsample (subsample size denoted by $N_{1}$ ):

\min_{t:\mathcal{X}\mapsto[K]}\widehat{\mathcal{R}}_{\operatorname{label}}(t):=\frac{1}{N_{1}}\sum_{i:D_{i}=1}\mathbb{I}\{Y_{i}\neq t(X_{i})\}.

(3.3)

However, the empirical risk $\widehat{\mathcal{R}}_{\operatorname{label}}(t)$ may not capture the true risk even when total sample size $N\to\infty$ :

\widehat{\mathcal{R}}_{\operatorname{select}}(t)\to\mathbb{P}(Y^{\star}\neq t(X)\mid D=1)\neq\mathcal{R}(t).

This is known as a selection bias problem, where the distribution of $(Y^{\star},X)\mid D=1$ in the labeled subpopulation and that of $(Y^{\star},X)$ in the full population are different. The selection bias arises because the historical decision $D$ can depend on the features $X$ and $U$ that are also correlated with the true outcome $Y^{\star}$ . Hence, one has to adjust for both $X$ and $U$ to eliminate the selection bias, as formalized below.

Assumption 1.

Suppose (1) Selection-on-observables-and-unobservables: $D\perp Y^{\star}\mid X,U$ ; (2) Overlap: $\mathbb{P}(D=1\mid X,U)>0$ almost surely.

The condition $D\perp Y^{\star}\mid X,U$ in Assumption 1 means that the labeled subpopulation and the full population would have the same $Y^{\star}$ distribution if8 properly adjusting for $(X,U)$ . The overlap condition means that the subpopulation with almost any $X,U$ value has a positive chance to be labeled. Both conditions are standard in the missing data literature Little and Rubin [2019]. If features $X,U$ were both observed, then one could correct for the selection bias by adjusting for $X,U$ via many methods in the literature [e.g., Rosenbaum and Rubin, 1983, Rubin, 2005, Imbens and Rubin, 2015]. However, adjusting for the unobservable $U$ is infeasible, so selection bias remains a significant problem in our setting. This setting is referred to the missing-not-at-random (MNAR) problem in the missing data literature [Little and Rubin, 2019]. Addressing MNAR is notoriously difficult and typically requires additional information.

3.2 Multiple Decision-makers and IV

To address the selective label problem, we draw on recent literature that leverage the fact that in many applications the selective labels involve multiple decision-makers with heterogeneous decision-making preference [Lakkaraju et al., 2017, Kleinberg et al., 2018, De-Arteaga et al., 2018]. The rationale behind this idea is that, the unobserved true label $Y^{\star}$ for a unit in the missing group ( $D=0$ ) could have been observed, if the unit had been assigned to a different decision-maker. This is particularly useful when the assignment of decision-makers $Z$ is random, independent of the unobserved featured $U$ . For instance, Lakkaraju et al. [2017], Kleinberg et al. [2018] considers that in the judicial bail, a defendant is randomly assigned to a judge, so a unit who is not granted bail (and thus unlabeled) could have been assigned to a more lenient judge, received bail, and consequently been labeled.

Although the previous works have already considered the structure of multiple decision-makers, they largely rely on some heuristic approaches. In fact, the assignment of decision-makers can be formalized as an instrumental variable (IV). As we will show shortly, this IV formalization allows us to understand the role of multiple decision-makers in a principled way. IV is a well-known concept in causal inference. Following standard literature [Angrist et al., 1996, Cui and Tchetgen Tchetgen, 2021, Wang and Tchetgen, 2018, e.g.,], a valid IV has to satisfy the following conditions.

Assumption 2 (IV conditions).

The assignment $Z$ satisfies the following three conditions: 1. IV relevance: $Z\not\perp D\mid X$ ; 2. IV independence: $Z\perp(U,Y^{\star})\mid X$ ; 3. Exclusion restriction: $Z\perp Y\mid D,Y^{\star}$ .

The IV conditions above are particularly suitable for the random assignment of decision-makers. The IV relevance means that even after controlling for the observed features $X$ , different decision-makers could have systematically different decision rules so that the assignment $Z$ is dependent with the decision $D$ . The IV independence trivially holds when the decision-maker assignment is random. The exclusion restriction automatically holds because the observed label $Y$ is completely determined by the decision $D$ and true label $Y^{\star}$ so it is independent with any other variable given $D,Y^{\star}$ . We note that these conditions are also explicitly or implicitly assumed in the prior literature. Assumption 2 connects these conditions to the IV framework.

Based on the IV framework, we will next understand the power and limits of the multiple decision-maker structure in addressing the selective label problem. In particular, we will analyze the identification of the classification risk $\mathcal{R}(t)$ in Equation 3.2, studying under what conditions the risk can be determined from the distribution of observed data with multiple decision-makers.

4 Exact Identification of Classification Risk

In this section, we establish conditions under which the classification risk $\mathcal{R}(t)$ can be exactly identified from the observed data. This represents settings where consistently evaluating the classification risk is possible, if one properly leverage the multiple decision-maker structure.

Before delving into our detailed analysis, we define $\eta_{k}^{\star}(X):=\mathbb{P}(Y^{\star}=k\mid X)$ as the conditional probability that $Y^{\star}$ equals $k$ , given the observable features $X$ . The Bayes optimal classifier, which minimizes the classification risk $\mathcal{R}(t)$ , is determined by $s^{\star}(X):=\operatorname*{arg\,max}_{k\in[K]}\eta_{k}^{\star}(X)$ . In fact, $s^{\star}$ also minimizes the excess risk, defined as:

\mathcal{R}(t,s^{\star})=\mathbb{P}(Y^{\star}\neq t(X))-\mathbb{P}(Y^{\star}\neq s^{\star}(X)).

(4.1)

Minimizing $\mathcal{R}(t)$ over classifier $t$ is equivalent to minimizing $\mathcal{R}(t,s^{\star})$ over $t$ as the Bayes optimal risk $\mathcal{R}(s^{\star})$ is a constant. Therefore, we can equivalently study the identification of excess risk $\mathcal{R}(t,s^{\star})$ . The following lemma gives a weighted formulation of excess risk in terms of the conditional probabilities.

Lemma 1.

The excess risk $\mathcal{R}(t,s^{\star})$ satisfies

\mathcal{R}(t,s^{\star})=\mathbb{E}\Big{[}\sum_{k=1}^{K}\eta_{k}^{\star}(X)\cdot\big{(}\mathbb{I}_{s^{\star}(X)=k}-\mathbb{I}_{t(X)=k}\big{)}\Big{]}.

Lemma 1 indicates to identify the excess risk we need first identify the conditional probabilities $\eta_{k}^{\star}\in[0,1]$ of the latent $Y^{\star}$ from the observed data. We therefore focus on the identification of $\boldsymbol{\eta}^{\star}(x)=(\eta_{1}^{\star}(x),\dots,\eta_{K}^{\star}(x))$ in the sequel.

Identification of Conditional Probabilities

In Section 3.2, we show that the assignment of decision-makers $Z$ can be viewed as an IV. By adapting the IV identification theory in the causal inference literature [Wang and Tchetgen, 2018, Cui and Tchetgen Tchetgen, 2021], we can identify the classification risk under the following no unmeasured common effect modifiers (NUCEM) assumption.

Assumption 3 (NUCEM).

For each $x\in\mathcal{X}$ , $\operatorname*{\mathrm{Cov}}\left(\operatorname*{\mathrm{Cov}}(D,Z\mid X,U),\mathbb{P}(Y^{\star}=k\mid X,U)\mid x\right)=0$ .

Theorem 1 (Point Identification).

Define $Y_{k}:=\mathbb{I}\{Y=k\}$ . Under Assumptions 1, 2 and 3, $\eta_{k}^{\star}(X)$ is identified,

\eta_{k}^{\star}(X)=\operatorname*{\mathrm{Cov}}(DY_{k},Z\mid X)/\operatorname*{\mathrm{Cov}}(D,Z\mid X).

(4.2)

Conversely, if (4.2) holds, Assumption 3 is also satisfied.

Theorem 1 shows that Assumption 3 is sufficient and necessary for identifying the conditional probability function $\eta^{\star}_{k}$ ’s through Equation 4.2. This identification equation connects the conditional probability of the latent true label $Y^{\star}$ with fully observed variables $D,Y,Z,X$ . This lays the foundation for learning $\eta^{\star}_{k}$ ’s (and classification risk) from the observed data using the decision-maker assignment IV.

Identification of Excess Classification Risk

We can now identify the excess classification risk from observed data.

Theorem 2.

Under assumptions in Theorem 1, the excess classification risk $\mathcal{R}(t,s^{\star})$ in eq. (4.1) is also identified:

\mathcal{R}(t,s^{\star})=\mathbb{E}\Big{[}\sum_{k=1}^{K}w_{k}(X)\cdot\mathbb{I}\{t(X)=k\}\Big{]},

(4.3)

where $w_{k}(x):=\max_{p\in[K]}(\eta_{p}^{\star}(x)-\eta_{k}^{\star}(x))$ denotes the weight function and $\eta_{k}^{\star}(x)$ is identified in eq. (4.2). This can be simplified for binary classification problems with $K=2$ :

\mathcal{R}(t,s^{\star})=\mathbb{E}\big{[}\big{|}w(X)\big{|}\cdot\mathbb{I}\big{\{}t(X)\neq\operatorname*{\mathrm{sgn}}[w(X)]\big{\}}\big{]},

(4.4)

where $w(x):=\operatorname*{\mathrm{Cov}}(DY,Z\mid x)/\operatorname*{\mathrm{Cov}}(D,Z\mid x)-1/2$ denotes the weight and $\operatorname*{\mathrm{sgn}}:\mathcal{X}\rightarrow\{\pm 1\}$ is a sign function.

Theorem 2 does not only establish the identification of the excess classification risk under Assumption 3, but also formulates the identification into a weighted classification form. Specifically, minimizing the right hand side of Equation 4.3 corresponds to searching a cost-sensitive classifier, while minimizing the right hand side of Equation 4.4 corresponds to a weighted classification problem with the sign of $w(X)$ as the label and $|w(X)|$ as the misclassification cost. In Section 6, we will show that these formulations are useful in the design of effective learning algorithms.

The NUCEM Identification Assumption

The identification theory in Theorems 1 and 2 crucially rely on the No Unmeasured Common Effect Modifier (NUCEM) assumption in Assumption 3, which generalizes Cui and Tchetgen Tchetgen [2021] from binary IV to a multi-valued IV. These analyses reveals that additional assumptions are needed to fully identify the classification risk, even when the decision-maker structure already satisfies the IV conditions in Assumption 2. This fact is not known in the prior selective label literature without the formal IV framework.

To understand the NUCEM assumption, it is instructive to consider examples where this assumption holds.

Example 1 (Full Information).

If all features needed to predict the true label $Y^{\star}$ are recorded, then $\mathbb{E}[Y^{\star}\mid X,U]=\mathbb{E}[Y^{\star}\mid X]$ almost surely and Assumption 3 holds.

Example 2 (Separable and Independent Unobservables).

Suppose unobservables $U$ can be divided into two categories: $U_{1}$ influences the decision $D$ , and $U_{2}$ affects the outcome $Y^{\star}$ . Hence, we have $\operatorname{Cov}(D,Z\mid X,U)=\operatorname{Cov}(D,Z\mid X,U_{1})$ and $\mathbb{E}[Y^{\star}\mid X,U]=\mathbb{E}[Y^{\star}\mid X,U_{2}]$ almost surely. If $U_{1}$ and $U_{2}$ are (conditionally) independent, then Assumption 3 holds.

Example 3 (Additive Decision Probability).

Suppose that for every decision-maker $j\in[J]$ , the conditional decision probability satisfies $\mathbb{P}(D=1\mid U,X,Z=j)=g_{j}(X)+q(U)$ for a common function $q$ and potentially different functions $g_{j}$ . That is, all decision-makers use the unobservables $U$ in a common and additive way. Under this condition, we can prove that Assumption 3 also holds (see Section A.2).

The examples above show that the NUCEM assumption needs to impose restrictions on the impact of unobserved features $U$ on either the decision $D$ or the true label $Y^{\star}$ or both. However, these restrictions may be too strong in practice so the NUCEM assumption may fail. As a result, the identification formula in Theorems 1 and 2 are no longer valid. In the next section, we will drop the NUCEM assumption and only impose some mild conditions, leading to weaker identification results accordingly.

5 Partial Identification of Classification Risk

The last section shows that the exact identification of classification risk often requires assumptions that are untenable in practice (see Assumption 3). Hence it is desirable to avoid such stringent assumptions and instead consider weaker and more reasonable assumptions. Nevertheless, without the stringent assumptions, it is generally impossible to exactly identify the classification risk from the observed data. Instead, there may exist a range of plausible values of the classification risk, all compatible with the observed data. This plausible range is characterized by the so-called partial identification bounds.

Below we adopt a mild bound condition on the conditional probability of the true label for the partial bound analysis.

Assumption 4.

For each fixed $x$ and $k$ , there exist two known functions $a_{k}(x)\in[0,1]$ and $b_{k}(x)\in[0,1]$ such that $a_{k}(x)\leq\mathbb{P}(Y^{\star}=k\mid U,X=x)\leq b_{k}(x)$ almost surely.

Assumption 4 mandates a known bound on the conditional probability $\mathbb{P}(Y^{\star}=k\mid U,X=x)$ while permitting arbitrary dependence on unobserved features $U$ . This assumption is mild, since it is automatically satisfied when $a_{k}(x)=0,b_{k}(x)=1$ for each $k\in[K]$ . With these parameters, we derive a precise partial bound on $\eta_{k}^{\star}(X)$ .

Partial Identification of Conditional Probabilities

Based on this condition, we can derive tight IV partial identification bounds on the conditional probability $\eta_{k}^{\star}(X)=\mathbb{P}(Y^{\star}=k\mid X)$ for each $k\in[K]$ .

Theorem 3.

Define $Y_{k}:=\mathbb{I}\{Y=k\}$ and assume Assumptions 1 and 2 and Assumption 4 hold. Then for any $x\in\mathcal{X}$ and $k\in[K]$ , the value of $\eta_{k}^{\star}(x)$ must fall within the interval $[l_{k}(x),u_{k}(x)]$ , where

	$\displaystyle l_{k}(x)$	$\displaystyle:=\max_{z\in[J]}\mathbb{E}[DY_{k}+a_{k}(x)(1-D)\mid X=x,Z=z],$
	$\displaystyle u_{k}(x)$	$\displaystyle:=\min_{z\in[J]}\mathbb{E}[DY_{k}+b_{k}(x)(1-D)\mid X=x,Z=z].$

Conversely, any value within the interval $[l_{k}(x),u_{k}(x)]$ can be a plausible value of $\eta_{k}^{\star}(x)$ . Moreover, the interval endpoints satisfy that $a_{k}(x)\leq l_{k}(x)\leq u_{k}(x)\leq b_{k}(x)$ .

Theorem 3 generalizes the well-known Balke and Pearl’s bound and Manski’s bound in Balke and Pearl [1994], Manski and Pepper [1998]. Specifically, our bounds coincide with Balke and Pearl’ bounds when applied to a binary IV (see Section B.2). Furthermore, if $a(X),b(X)$ are two constants independent of $X$ (such as $0,1$ ), our bounds recover Manski’s bounds. Theorem 3 also demonstrates the tightness of our IV partial bounds, in the sense that our bounds tightly give the range of all plausible values of the conditional probability $\eta_{k}^{\star}$ ’s (see Section B.1).

Partial Identification of Excess Classification Risk

When $\{\eta_{k}^{\star}\}_{k\in[K]}$ are only partially identified within $[l_{k},u_{k}]$ for each $k\in[K]$ , the excess classification risk $\mathcal{R}(t,s^{\star})$ is also partially identified. Define the feasible region for the conditional probability vector $\boldsymbol{\eta}^{\star}=(\eta_{1},\dots,\eta_{K})$ as

\displaystyle S:=\Big{\{}\boldsymbol{\eta}:\|\boldsymbol{\eta}\|_{1}=1,\eta_{k}\in[l_{k},u_{k}],k\in[K]\Big{\}},

(5.1)

Then the range of plausible value of the excess risk $\mathcal{R}(t,s^{\star})$ is given by the interval $[\underline{\mathcal{R}}(t),\overline{\mathcal{R}}(t)]$ , where

\displaystyle\overline{\mathcal{R}}(t)

\displaystyle:=\mathbb{E}\Big{[}\max_{\boldsymbol{\eta}\in S}\sum_{k=1}^{K}\eta_{k}(X)\big{(}\mathbb{I}_{s^{\star}(X)=k}-\mathbb{I}_{t(X)=k}\big{)}\Big{]}

(5.2)

and $\underline{\mathcal{R}}(t)$ is symmetrically defined by changing the maximization over $\eta\in S$ into minimization over the same set. We are particularly interested in the upper bound $\overline{\mathcal{R}}(t)$ as it gives the worst-case risk of a classifier $t$ . Minimizing this upper bound, i.e., $\min_{t:\mathcal{X}\rightarrow[K]}\overline{\mathcal{R}}(t)$ , hence leads to a robust classifier safeguarding the perfrmance in the worst case. Below we further provide a reformulation of the worst-case risk under a mild regularity condition.

Theorem 4.

Suppose conditions in Theorem 3 hold and the set $S$ defined in eq. (5.1) is non-empty. Given the bounds $[l_{k},u_{k}]$ in Theorem 3, we define the “realizable” partial bounds for $\eta_{k}^{\star}$ as $\tilde{l}_{k}(x):=[1-\sum_{p\neq k}u_{p}(x)]\lor l_{k}(x)$ and $\tilde{u}_{k}(x):=[1-\sum_{p\neq k}l_{p}(x)]\land u_{k}(x)$ . We have

\overline{\mathcal{R}}(t)=\mathbb{E}\Big{[}\sum_{k=1}^{K}w_{k}(X)\cdot\mathbb{I}\{t(X)=k\}\Big{]},

where $w_{k}(x):=\max_{p\neq k,p\in[K]}\big{(}\tilde{u}_{p}(x)-\tilde{l}_{k}(x)\big{)}^{+}$ denotes the weight function. This can be simplified for binary classification problems with $K=2$ :

\overline{\mathcal{R}}(t)=\mathbb{E}\big{[}|w(X)|\cdot\mathbb{I}\{t(X)\neq\operatorname*{\mathrm{sgn}}[w(X)]\big{\}}\big{]}+C,

where $C$ is a constant not relying on the classifier $t$ and $w(x):=\max\{2u_{1}(x)-1,0\}+\min\{2l_{1}(x)-1,0\}$ .

The non-emptyness of set $S$ defined in Equation 5.1 is a mild regularity condition. It is easy to verify from Theorem 3 that this condition is automatically satisfied for $a_{k}(X)=0$ and $b_{k}(X)=1$ for all $k\in[K]$ . Like Theorem 2, here Theorem 4 formulates the worst-case risk into a cost-sensitive form for general multi-class classification problems and a weighted classification form for binary classification problems. This reformulation will be handy for the learning algorithm design in the next section.

Discussions

Notably, the partial identification results in this section are quite different than the exact identification results in Section 4. Section 4 relies on a strong NUCEM identification assumption (Assumption 3). This strong assumption has a strong implication: we can pinpoint the exact value of classification risk from the observed data with the decision-maker assignment as an IV. However, the key NUCEM identification assumption is very strong and may be often implausible in practice. In this section, we instead consider a much weaker bound condition in Assumption 4. This condition is arguably more plausible yet meanwhile it also has weaker identification implication. Under this condition, even with the decision-maker assignment as a valid IV, the classification risk is still intrinsically ambiguous, so we can at most get a range of its values from the observed data. In the next subsection, we will design algorithms to learn robust classifiers by approximately minimizing the worst-case classification risk. These analyses reveal the power and limits of the multiple decision-maker data structure in tackling the selective label problem, which highlights the value of the principled IV framework adopted in our paper.

We note that the recent work by Rambachan et al. [2022] is quite related to our paper. Rambachan et al. [2022] study counterfactual evaluation under unmeasured confounding using IV, mention selective labels with multiple decision-makers as one example, and derive IV partial identification bounds. However, our paper differs from Rambachan et al. [2022] in important aspects. In particular, Rambachan et al. [2022] focus on partial identification bounds on various risk measures in counterfactual decision-making and statistical estimation and inference of the bounds. In contrast, our paper systematically studies both the exact identification and partial identification settings and provide a unified weighting formulations of classification risks for these two settings (see Theorems 4 and 2). Moreover, Rambachan et al. [2022], like prior literature [Lakkaraju et al., 2017, Kleinberg et al., 2018, De-Arteaga et al., 2018], primarily focus on evaluation the risk of a given prediction rule. In contrast, our paper does not only study evaluation but also designs efficient algorithms to learn classifiers in the two identification settings. This will be the focus of next section.

Algorithm 1 Unified Cost-sensitive Learning

1:function class

\mathcal{H}

\operatorname{mode}\in\{\operatorname{point,partial}\}

, data

S_{N}=\{(Y_{i},X_{i},D_{i},Z_{i})\}_{i=1}^{N}

, cross-fitting folds

L

2:Randomly split the data into

L

(even) batches with indices denoted by

S_{N}^{1},\dots,S_{N}^{L}

respectively.

3:for each batch

l\in[L]

4: if

\operatorname{mode}=\operatorname{point}

then

5: For

k\in[K]

, use dataset

S_{N}\setminus S_{N}^{l}

to estimate the weight

\widehat{w}_{k}^{[l]}(x)=\max_{p\in[K]}\big{(}\widehat{\eta}_{p}(x)-\widehat{\eta}_{k}(x)\big{)}

6: else if

\operatorname{mode}=\operatorname{partial}

then

7: For

k\in[K]

, use dataset

S_{N}\setminus S_{N}^{l}

to estimate the weight

\widehat{w}_{k}^{[l]}(x)=\max_{p\neq k}\big{(}\widehat{u}_{p}(x)-\widehat{l}_{k}(x)\big{)}^{+}

8: end if

9:end for

10:Construct the empirical risk for each batch

l\in[L]

\widehat{\mathcal{R}}_{\exp}^{[l]}(\boldsymbol{h})=\frac{1}{|I_{l}|}\sum_{i\in I_{l}}\sum_{k=1}^{K}\widehat{w}_{k}^{[l]}(X_{i})\frac{\exp(h_{k}(X_{i}))}{\sum_{p=1}^{K}\exp(h_{p}(X_{i}))}.

11:Return the score function

\widehat{\boldsymbol{h}}

by minimizing:

\widehat{\boldsymbol{h}}\in\operatorname*{arg\,min}_{h\in\mathcal{H}}\frac{1}{L}\sum_{l=1}^{L}\widehat{\mathcal{R}}_{\exp}^{[l]}(\boldsymbol{h}).

(5.3)

6 Unified Cost-sensitive Learning

Refer to caption — Figure 2: The testing accuracy of methods with $\alpha\in\{0.5,0.7,0.9\}$ for model NUCEM in FICO dataset.

In this section, we propose a unified cost-sensitive learning (UCL) algorithm tailored to learn a classifier from selectively labeled data. This algorithm accommodates both the exact identification setting in Section 4 and the partial identification setting in Section 5 in a unified manner.

Typically, a classifier $t(x)$ is represented by some score function $\boldsymbol{h}:\mathcal{X}\rightarrow\mathbb{R}^{K}$ through $t(x)=\operatorname*{arg\,max}_{k\in[K]}h_{k}(x)$ . Here for any given $x$ , $h_{k}(x)$ quantifies the score of class $k$ . According to Theorems 4 and 2, the excess classification risk of a classifier $t$ with score function $\boldsymbol{h}$ in the two identification settings can be written into the common weighted form as follows: for $t(X)=\operatorname*{arg\,max}_{k\in[K]}h_{k}(X)$ ,

\mathcal{R}(\boldsymbol{h},\boldsymbol{w})=\mathbb{E}\Big{[}\sum_{k=1}^{K}w_{k}(X)\cdot\mathbb{I}\{t(X)=k\}\Big{]}.

(6.1)

Specifically, we can set $w_{k}(x)=\max_{p\in[K]}(\eta_{p}^{\star}(x)-\eta_{k}^{\star}(x))$ for each $k$ to recover the exact identification in Theorem 2 and set $w_{k}(x)=\max_{p\neq k,p\in[K]}(\tilde{u}_{p}(x)-\tilde{l}_{k}(x))^{+}$ to recover the worst-case excess risk $\overline{\mathcal{R}}(t)$ in Theorem 4. In this section, we aim to minimize these risk objectives to learn effective classifiers. This corresponds to a cost-sensitive multi-classification problem.

Theorems 4 and 2 also show that for binary classification problems with $K=2$ , the risk function simplifies to:

\mathcal{R}(h,w)=\mathbb{E}\big{[}\big{|}w(X)\big{|}\cdot\mathbb{I}\big{\{}t(X)\neq\operatorname*{\mathrm{sgn}}[w(X)]\big{\}}\big{]},

where $t(x)=\operatorname*{\mathrm{sgn}}[h(x)]$ for any $x$ . By specifying $w(x)=\operatorname*{\mathrm{Cov}}(DY,Z\mid x)/\operatorname*{\mathrm{Cov}}(D,Z\mid x)-1/2$ , we recover the classification risk in the exact identification setting. By specifying $w(x)=\max\{2u_{1}(x)-1,0\}+\min\{2l_{1}(x)-1,0\}$ , we recover the worst-case risk (up to some constants) under partial identification. Minimizing these risks correspond to solving weighted binary classification with a pseudo label $\operatorname*{\mathrm{sgn}}[w(X)]$ and misclassification cost $|w(X)|$ .

6.1 Convex Surrogate Risk

The cost-sensitive classification risk $\mathcal{R}(\boldsymbol{h},\boldsymbol{w})$ in Equation 6.1 involves non-convex and non-smooth indicator functions $\mathbb{I}\{\cdot\}$ , making the optimization challenging. To address this issue, we introduce a convex surrogate risk $\mathcal{R}_{\exp}(\boldsymbol{h},w)$ that replaces the indicator function with the $\operatorname{softmax}$ function. This substitution allows us to define the convex surrogate for cost-sensitive classification risk:

\mathcal{R}_{\exp}(\boldsymbol{h},\boldsymbol{w})=\mathbb{E}\Big{[}\sum_{k=1}^{K}w_{k}(X)\frac{\exp(h_{k}(X))}{\sum_{p=1}^{K}\exp(h_{p}(X))}\Big{]}.

(6.2)

For the binary case, we introduce the weighted $\phi$ -risk as

\mathcal{R}_{\phi}(h,w)=\mathbb{E}\big{[}\big{|}w(X)\big{|}\cdot\phi\big{(}h(X)\cdot\operatorname*{\mathrm{sgn}}[w(X)]\big{)}\big{]}.

(6.3)

There are many possible choices for the convex surrogate loss $\phi$ , including the Hinge loss $\phi(\alpha)=\max\{1-\alpha,0\}$ , the logistic loss $\phi(\alpha)=\log(1+e^{-\alpha})$ , and the exponential loss $\phi(\alpha)=e^{-\alpha}$ , and so on [Bartlett et al., 2006].

Theorem 5 establishes a calibration bound linking the cost-sensitive risk $\mathcal{R}(\boldsymbol{h},\boldsymbol{w})$ to its surrogate $\mathcal{R}_{\exp}(\boldsymbol{h},\boldsymbol{w})$ .

Theorem 5 (Calibration Bound for Cost-sensitive Risk).

For any given weight function $\boldsymbol{w}:\mathcal{X}\rightarrow\mathbb{R}^{K}$ , we have

\mathcal{R}(\boldsymbol{h},\boldsymbol{w})-\inf_{\boldsymbol{h}}\mathcal{R}(\boldsymbol{h},\boldsymbol{w})\leq K\big{[}\mathcal{R}_{\exp}(\boldsymbol{h},\boldsymbol{w})-\inf_{\boldsymbol{h}}\mathcal{R}_{\exp}(\boldsymbol{h},\boldsymbol{w})\big{]}.

For binary case, calibration bound can also be established for $\mathcal{R}_{\phi}(h,w)$ [Bartlett et al., 2006, Tewari and Bartlett, 2007].

Proposition 1 (Calibration Bound for Weighted Risk).

For any fixed weight function $w:\mathcal{X}\rightarrow\mathbb{R}$ , we have

\mathcal{R}(h,w)-\inf_{h}\mathcal{R}(h,w)\leq\mathcal{R}_{\phi}(h;w)-\inf_{h}\mathcal{R}_{\phi}(h,w).

Theorems 5 and 1 show that the target risks can be well-bounded by the excess surrogate risks. Hence we can use sample data to approximately minimize the surrogate risks, which can be efficiently solved.

6.2 Empirical Risk Minimization

We now propose Algorithm 1 to minimize the empirical surrogate risk $\mathcal{R}_{\exp}(\boldsymbol{h},\boldsymbol{w})$ in Equation 6.2 using observed data $S_{N}$ . We also establish the finite-sample error bound for the resulting score function $\boldsymbol{h}$ (see Appendix D).

Weights Estimation

The weight functions $\{w_{k}\}_{k=1}^{K}$ differ between point-identified and partial-identified settings, requiring separate estimation. For $\operatorname{mode}=\operatorname{point}$ , we first estimate $\operatorname*{\mathrm{Cov}}(D,Z\mid X)$ and $\operatorname*{\mathrm{Cov}}(DY_{k},Z\mid X)$ in eq. (4.2), then compute their ratio. For example, $\operatorname*{\mathrm{Cov}}(DY_{k},Z\mid x)$ can be expressed as $\mathbb{E}[DZY_{k}\mid x]-\mathbb{E}[DY_{k}\mid x]\cdot\mathbb{E}[Z\mid x]$ , with conditional expectations estimated via regression on $X=x$ . Alternatively, forest methods, such as Generalized Random Forests, can directly estimate the conditional covariance ratio in one step [Athey et al., 2019].

For $\operatorname{mode}=\operatorname{partial}$ , we estimate the bounds $l_{k}$ and $u_{k}$ for each $k\in[K]$ as defined in Theorem 3, involving conditional expectations $\mathbb{E}[DY_{k}\mid x,z]$ and $\mathbb{E}[D\mid x,z]$ via regression. These bounds are then used to construct $w_{k}(x)$ .

Classifier Optimization

The empirical surrogate risk in Equation 5.3 defines a cost-sensitive classification problem. Once the weight function $\{w_{k}\}_{k=1}^{K}$ are estimated, we can then solve for the score function $\boldsymbol{h}$ , which can be implemented using $\operatorname{PyTorch}$ by defining a cost-sensitive loss function with $\operatorname{softmax}$ being the output layer. We can also incorporate regularization in Equation 5.3 to reduce over-fitting. In the binary setting, we can simply solve a weighted classification problem with $\operatorname*{\mathrm{sgn}}[\hat{w}(X)]$ ’s as the labels and $|\hat{w}(X)|$ as weights, and this can be easily implemented by off-the-shelf ML packages that take additional weight inputs. For example, considering the surrogate risk in Equation 6.3 with $\phi$ as the logistic loss, we can simply run a logistic regression with the aforementioned labels and weights.

Cross-fitting

In Algorithm 1, we divide the dataset into $L$ folds and apply the cross-fitting to estimate these weight functions. This approach has been widely used in statistical inference and learning with ML nuisance function estimators [e.g., Chernozhukov et al., 2018, Athey and Wager, 2021], which effectively alleviates the over-fitting bias by avoiding using the same data to both train weight function estimators and evaluate the weight function values.

7 Numeric Experiments

We conduct experiments on a synthetic dataset with a multiclass label ( $K=3$ ) and a semi-synthetic dataset with a binary label ( $K=2$ ).

7.1 Synthetic Dataset

In this section, we construct a synthetic dataset with multi-valued selective labels. The experiment results are illustrated in Figures 4 and 5, which show that our unified cost-sensitive learning (UCL) can achieve superior performance under various strength of selection bias.

Data Generating Process

Consider a dataset consisting of observable variables $\mathbf{X}=(X_{1},\dots,X_{p})$ , and an unobservable variable $\mathbf{U}=(U_{1},U_{2},\dots,U_{q})$ , following the joint Gaussian distribution. We also introduce the variable $Z$ that represents the random assignment of decision-makers, which is uniformly drawn from the set $\mathcal{Z}=\{1,2,\dots,J\}$ . The missingness decision $D\in\{0,1\}$ is modeled as Bernoulli distributed variables with parameters $p_{D}:=\mathbb{P}(D=1\mid X,U,Z)$ , and the true label $Y^{\star}\in\{1,\dots,K\}$ is modeled as a categorical variable with parameters $p_{k}:=\mathbb{P}(Y=k\mid X,U,Z)$ for $k\in[K]$ . The complete data generating process is summarized in Equation 7.1.

Observed variables:	$\displaystyle\mathbf{X}=(X_{1},\dots,X_{p})\sim 2\cdot\mathcal{N}(0,I_{p}),~{}~{}Z\sim\text{Uniform}(\{1,\dots,J\})$	(7.1)
Unobserved variables:	$\displaystyle\mathbf{U}=(U_{1},\dots,U_{q})\sim 2\cdot\mathcal{N}(0,I_{q})$
Coefficients:	$\displaystyle\mathbf{W}_{X\rightarrow Y}=\begin{bmatrix}i+k\end{bmatrix}_{p\times K}\quad i\in\{1,\dots,p\},\quad k\in\{1,\dots,K\}$
	$\displaystyle\mathbf{W}_{U\rightarrow Y}=\begin{bmatrix}i+k\end{bmatrix}_{q\times K}\quad i\in\{1,\dots,q\},\quad k\in\{1,\dots,K\}$
	$\displaystyle\mathbf{W}_{X\rightarrow D}=\begin{bmatrix}2-i\end{bmatrix}_{p\times 1}\quad i\in\{1,\dots,p\},$
	$\displaystyle\mathbf{W}_{U\rightarrow D}=\begin{bmatrix}1+i\end{bmatrix}_{q\times 1}\quad i\in\{1,\dots,q\}.$
Decision (NUCEM):	$\displaystyle p_{D}=(1-\alpha_{D})\cdot\operatorname{expit}\big{\{}Z\cdot 2\mathbf{X}\mathbf{W}_{X\rightarrow D}\big{\}}+\alpha_{D}\cdot\operatorname{expit}\big{\{}3\mathbf{U}\mathbf{W}_{U\rightarrow D}\big{\}},$
Decision (UC):	$\displaystyle p_{D}=\operatorname{expit}\big{\{}(1-\alpha_{D})\cdot Z\cdot 2\mathbf{X}\mathbf{W}_{X\rightarrow D}+\alpha_{D}\cdot 3\mathbf{U}\mathbf{W}_{U\rightarrow D}\big{\}}.$
Outcome:	$\displaystyle p_{k}=\text{softmax}\big{\{}(1-\alpha_{Y})\cdot\mathbf{X}\mathbf{W}_{X\rightarrow Y}+\alpha_{Y}\cdot 4\mathbf{U}\mathbf{W}_{U\rightarrow Y}\big{\}},\quad k\in\{1,\dots,K\}$
	$\displaystyle Y=\text{Categorical}(p_{1},\dots,p_{K}).$

Here we define the functions $\operatorname{expit}:v\mapsto 1/(1+\exp(-v))$ and $\text{softmax}:\mathbf{v}\mapsto\exp(\mathbf{v})/\sum_{k=1}^{K}\exp(v_{k})$ for any vector $\mathbf{v}\in\mathbb{R}^{K}$ . In both UC and NUCEM models, the parameter $\alpha_{D}\in(0,1)$ controls the impact of unobservable variables $\mathbf{U}$ on the probability of missingness $p_{D}$ , while the parameter $\alpha_{Y}\in(0,1)$ adjust the magnitude of $\mathbf{U}$ affecting the distribution of outcome $Y^{*}$ . Overall, the pair of $(\alpha_{D},\alpha_{Y})$ jointly determine the degree of selection bias in our data generating process.

In this experiment, we simulate a full dataset with sample size $N=10000$ , feature dimensions $p=q=5$ , the number of instruments level $J=5$ , and the label classes $K=3$ . Each data point in the dataset is a tuple $\{(X_{i},U_{i},Z_{i},D_{i},Y_{i}^{*})\}_{i=1}^{N}$ . The observed dataset consists of a tuple of variables $\{(X_{i},Z_{i},D_{i},Y_{i})\}_{i=1}^{N}$ with $Y_{i}:=Y_{i}^{\star}$ if $D_{i}=1$ and $Y_{i}:=\text{NaN}$ if $D_{i}=0$ for each $i\in[N]$ .

Model UC and NUCEM

The missingness model NUCEM captures one of the setting of No Unmeasured Common Effect Modifier, and this ensures the satisfaction of sufficient condition in Proposition 2, and therefore the conditional probabilities $\{p_{k}\}_{k=1}^{K}$ can be exactly identified according to Theorem 1. Meanwhile, the model UC captures a general setting of Unmeasured Confounding in which the point-identification condition cannot be realized in this situation.

Baseline Methods and Our Proposed Methods

We divide our dataset into two parts: a training set and a testing set, using a 7:3 split ratio, denoted as $\mathcal{S}_{\text{train}}$ and $\mathcal{S}_{\text{test}}$ . In the training set, five different meta-learning methods are trained on selectively labeled data, using a simple neural network classifer with softmax output layer. We then evaluate the performance of these methods on the testing set with fully labeled data. The five methods under consideration are as follows:

•

The first two methods act as baselines. The SelectedSample method involves running multiclass classification algorithm solely on the labeled portion of the training set $\{i\in S_{\text{train}}:D_{i}=1\}$ . For the second method, SelectedSample(IPW), we firstly estimate the inverse propensity weighting (IPW) for the labeled subset, and then implement the weighted multiclass classification for the labeled training set. These baseline methods establish a “lower bound” for any other meta learning algorithm, as any enhanced method with the use of selectively labeled data should least beat the performance of these two methods.
•

The third method, FullSample, runs a multiclass classification algorithm on the entire training set. This represents an ideal but impractical scenario since we cannot actually observe the true label $Y_{i}^{*}$ for the missing data (where $D_{i}=0$ ). This method serves as an ’upper bound’ for learning performance, indicating the highest possible effectiveness if full information were available.
•

Finally, our two proposed methods correspond to the point- and partial- identification settings, named PointLearning and PartialLearning respectively. We anticipate that the performance of these methods will fall between the established “lower” and “upper” bounds, representing a realistic estimation of effectiveness under selective label conditions. Similar to the method SelectedSample(IPW), our method involve estimating a series of weight functions $w_{k},k=1,\dots,K$ to correct for selection bias (refers to the details in Section 6). In this experiment, we estimate these weght functions using Histogram Gradient Boosting with a 5-fold cross-fitted approach. We selected all hyperparameter through 5-fold cross-validation. The cost-sensitive classification problem is solved by a simple neural network with a softmax output layer.

The Role of Observed Vairable Z

It’s important to note the random assignment of decision-maker, denoted as $Z$ , play different roles in these methods. In our proposed method, $Z$ is treated as an instrumental variable, aiding in correcting for selection bias, while in the baseline methods (SelectedSample, SelectedSample(IPW), and FullSample), $Z$ is only one of the observed features.

Experiment Results

Figures 4 and 5 report the testing accuracy of each method on the data generated by the NUCEM and UC models in Equation 7.1 respectively. All the experiments are repeated 100 times with confounding strengths $\alpha_{D},\alpha_{Y}\in\{0.3,0.5,0.7\}$ . Each boxplot then shows the distribution of the testing accuracy for distinct methods under different combinations of confounding strength $\alpha_{D}$ and $\alpha_{Y}$ .

Transition from the top to the bottom panel, $\alpha_{Y}$ increases from $0.3$ to $0.7$ , indicating a stronger influence of the unobservable variable $U$ on the true label $Y^{\star}$ . As we can see from Figures 4 and 5, the accuracy of each of five methods decreases as $\alpha_{Y}$ increases, reflecting the growing challenge in accurately predicting outcomes as the influence of unobservable factors intensifies. Moreover, transition from the left to the right panel, $\alpha_{D}$ rises from $0.3$ to $0.7$ , indicating an enhanced role of the unobservable variable $U$ in determining the missingness decision $D$ . A larger value of $\alpha_{D}$ within the range $(0,1)$ suggests that the missingness decision $D$ is predominantly influenced by the unobservable $U$ , hinting at a potential larger difference of the distribution among selected group and missing group. This is visually corroborated in Figures 4 and 5, where the performance disparity between the SelectedSample and FullSample methods widens as $\alpha_{D}$ increases, illustrating the growing challenge in correcting the selection bias.

The results in Figures 4 and 5 demonstrate the effectiveness of our proposed methods in handling selection bias in the data generating process. In each combinations of $(\alpha_{Y},\alpha_{D})$ , our Partial Learning method consistently outperforms the baseline methods, including the Selected Sample, and Selected Sample(IPW) methods. Moreover, the perforamnce of Partial Learning is close to the Full Sample method, which is the upper bound of the performance. The performance of the Point Learning method is also competitive, although it is slightly less robust than the Partial Learning method. The results in Figure 4 are consistent with those in Figure 5, except that under this setting, the point identification requirement in Theorem 1 is not satisfied, and the Point Learning method is therefore less effective than the Partial Learning method.

Below we provide a detailed analysis of the performance of each method.

•

Selected Sample (IPW): As we see from Figures 4 and 5, the performance of the Selective Sample (IPW) approach is close to the baseline of naively running multiclass classification algorithms on the selected subsample (named Selected Sample), and it is outperformed by our proposed algorithms. This is not surprising since the propensity score $P(D=1\mid X,Z)$ cannot correct selection bias due to the existence of unobserved variables $U$ .
•

PointLearning: Our proposed Point Learning method shows promising results in the bottom right panel of Figure 4 with parameter $(\alpha_{Y},\alpha_{D})=(0.7,0.7)$ , in which case the unobserved confounding strength is large. Although the NUCEM model in Equation 7.1 satisfies Proposition 2, which enables the point identification of the conditional probabilities and therefore the classification risk $\mathcal{R}(h,w)$ , Point Learning does not outperform the baseline methods when the unmeasured confounding strength is not so large. This limitation is attributed to the complexities involved in accurately estimating the weight function $\{w_{k}\}_{k=1}^{K}$ as outlined in Equation 4.3.

The crux of the challenge lies in estimating the ratio of two conditional covariances, $\operatorname*{\mathrm{Cov}}(DY_{k},Z\mid X)$ and $\operatorname*{\mathrm{Cov}}(D,Z\mid X)$ for each $k\in[K]$ . Estimating conditional covariances is inherently more difficult than estimating conditional expectations, primarily because the process of calculating the ratio of these estimates is fraught with potential inaccuracies. If the denominator is not precisely estimated, the resultant ratio can become unstable, leading to exaggerated or diminished values. Such inaccuracies can severely impact the effectiveness of the downstream cost-sensitive classification, undermining the robustness of Point Learning under these conditions.
•

Partial Learning: The cornerstone of our learning framework is Partial Learning. As evidenced by the graphs in Figures 4 and 5, Partial Learning significantly enhances the accuracy of performance predictions while maintaining low variance, showcasing its potential for real-world applications. Remarkably, this improvement in prediction accuracy is consistent across almost all combinations of confounding strengths $(\alpha_{Y},\alpha_{D})$ , as well as the decision models NUCEM and UC.

Reflecting on the conditions outlined in Theorem 3, Partial Learning only requires the conditional mean function $\mathbb{E}[Y_{k}^{\star}\mid U,X=x],k=1,\dots,K$ to be almost surely bounded, aside from the basic instrumental variable (IV) conditions. This requirement is generally more feasible in real-world scenarios than the point-identification assumptions detailed in Theorem (referenced as Theorem 1). The success of Partial Learning can be attributed to two key strategies: Firstly, we adopt a robust optimization approach to develop a prediction rule that remains effective under the least favorable conditions of $\eta^{\star}(X)$ , as detailed in our minimax risk function formulation (see Equation 5.2). Secondly, the weighting functions $\{w_{k}\}_{k=1}^{K}$ are based on the summation of several conditional expectations, which is inherently more stable than the ratio of conditional covariances required for point identification.

Overall, the compelling performance of Partial Learning demonstrated in Figures 4 and 5 convinces us of its practical viability. It stands as a reliable method that consistently outperforms baseline approaches, even in situations where the presence of selection bias (missing not at random) in the data generating process is uncertain.

7.2 Semi-synthetic Dataset: FICO

In this section, we evaluate the performance of our proposed algorithm in a semi-synthetic experiment based on the home loans dataset from FICO [2018]. This dataset consists of $10459$ observations of approved home loan applications. The dataset records whether the applicant repays the loan within $90$ days overdue, which we view as the true label $Y^{\star}\in\{0,1\}$ , and various transaction information of the bank account. The dataset also includes a variable called ExternalRisk, which is a risk score assigned to each application by a proprietary algorithm. We consider ExternalRisk and all transaction features as the observed features $X$ .

Semi-synthetic Dataset with Selective Labels

In this dataset the label of interest is fully observed, so we choose to synthetically create selective labels on top of the dataset. Specifically, we simulate $10$ decision-makers (e.g., bank officers who handle the loan applications) and randomly assign one to each case. We simulate the decision $D$ from a Bernoulli distribution with a success rate $p_{D}$ that depends on an “unobservable” variable $U$ , the decision-maker identity $Z$ , and the ExternalRisk variable (which serves as an algorithmic assistance to human decision-making). We blind the true label $Y^{\star}$ for observations with $D=0$ . Specifically, we construct $U$ as the residual from a random forest regression of $Y^{\star}$ with respect to $X$ over the whole dataset, which is naturally dependent with $Y^{\star}$ . We then specify $p_{D}:=\mathbb{P}(D=1\mid X,U)$ according to

	Decision (NUCEM):	$\displaystyle p_{D}=\alpha\cdot\text{expit}\big{\{}U\big{\}}+(1-\alpha)\cdot\text{expit}\big{\{}(1+Z)\cdot\text{ExternalRisk}\big{\}},$		(7.2)
	(UC):	$\displaystyle p_{D}=\text{expit}\big{\{}\alpha\cdot U+(1-\alpha)\cdot(1+Z)\cdot\text{ExternalRisk}\big{\}}.$		(7.2)

Here the $\operatorname{expit}$ function is given by $\operatorname{expit}(t)=1/(1+\exp(-t))$ . The parameter $\alpha_{D}\in(0,1)$ controls the impact of $U$ on the labeling process and thus the degree of selection bias. We can easily verify that the sufficient condition in Proposition 2 is guaranteed and therefore the point-identification of classification risk $\mathcal{R}(h,w)$ is realizable under the NUCEM model in Equation 7.2.

Baseline Methods and Our Proposed Methods

We randomly split our data into training and and testing sets at a $7:3$ ratio. Similar to what we discuss in Section 7.1, on the training set, we apply five types of different methods: Selected Sample, SelectedSample(IPW), FullSample, PointLearning and PartialLearning. The first three methods serve as the benchmark of the last two methods. For each type of method, we try multiple classification algorithms including AdaBoost, Gradient Boosting, Logistic Regression, Random Forest, and SVM. Our proposed method also need to estimate some unknown weight functions, which we implement by $K=5$ fold cross-fitted Gradient Boosting. Again, all hyperparameters are chosen via $5$ -fold cross-validation, and we evaluate the classification accuracy of the resulting classifiers on the testing data.

Similarly, we remark that the decision-maker assignment $Z$ plays different roles in different methods. Our proposed methods (PointLearning and PartialLearning) treatment the decision-maker assignment $Z$ as an instrumental variable to correct for selection bias. In contrast, the baseline methods (SelectedSample, SelectiveSample(IPW) and FullSample) do not necessarily need $Z$ . However, for a fair comparison between our proposals and the baselines, we still incorporate $Z$ as a classification feature in the baseline methods, so they also use the information of $Z$ .

Results and Discussions

Figures 6 and 7 presents the testing accuracy of each method over 50 experiment replications for $\alpha\in\{0.5,0.7,0.9\}$ under both the NUCEM and UC models defined in Equation 7.2. First, we observe that the performance of SelectedSample(IPW) is comparable to the baseline SelectedSample, which applies binary classification algorithms directly to selectively labeled data. Notably, as the strength of unmeasured confounding ( $\alpha_{D}$ ) increases, the gains from using our proposed methods—especially PartialLearning—over the baseline methods (SelectedSample and SelectedSample(IPW)) become more pronounced. Interestingly, the PartialLearning method outperforms even under the NUCEM model, where the point-identification condition is satisfied. As discussed in Section 7.1, this may be because the PointLearning method relies on estimating a conditional variance ratio, which is challenging to estimate accurately in practice. In contrast, the PartialLearning method requires only the estimation of conditional expectations, making it more stable and robust. This highlights the advantage of the PartialLearning method: it is robust to violations of the point-identification assumption and achieves more stable performance even when point-identification holds.

To summarize, among all the methods we evaluated, PartialLearning emerges as a reliable approach that consistently outperforms baseline methods, even in scenarios where the unobsreved confounding is strong.

8 Conclusion

We address multiclass classification with selectively labeled data, where the label distribution deviates from the full population due to historical decision-making. Leveraging variations in decision rules across multiple decision-makers, we analyze this problem using an instrumental variable (IV) framework, providing necessary and sufficient conditions for exact classification risk identification. When exact identification is unattainable, we derive sharp partial risk bounds. To address label selection bias, we propose a unified cost-sensitive learning (UCL) approach, supported by theoretical guarantees, and demonstrate its effectiveness through comprehensive numerical experiments.

References

Angrist and Imbens [1995] J. Angrist and G. Imbens. Identification and estimation of local average treatment effects, 1995.
Angrist and Pischke [2009] J. D. Angrist and J.-S. Pischke. Mostly harmless econometrics: An empiricist’s companion. Princeton university press, 2009.
Angrist et al. [1996] J. D. Angrist, G. W. Imbens, and D. B. Rubin. Identification of causal effects using instrumental variables. Journal of the American statistical Association, 91(434):444–455, 1996.
Arnold et al. [2022] D. Arnold, W. Dobbie, and P. Hull. Measuring racial discrimination in bail decisions. American Economic Review, 112(9):2992–3038, 2022.
Athey and Wager [2021] S. Athey and S. Wager. Policy learning with observational data. Econometrica, 89(1):133–161, 2021.
Athey et al. [2019] S. Athey, J. Tibshirani, and S. Wager. Generalized random forests. The Annals of Statistics, 47(2):1148–1178, 2019.
Balke and Pearl [1994] A. Balke and J. Pearl. Counterfactual probabilities: Computational methods, bounds and applications. In Uncertainty Proceedings 1994, pages 46–54. Elsevier, 1994.
Bartlett et al. [2006] P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156, 2006.
Ben-Michael et al. [2021] E. Ben-Michael, D. J. Greiner, K. Imai, and Z. Jiang. Safe policy learning through extrapolation: Application to pre-trial risk assessment. arXiv preprint arXiv:2109.11679, 2021.
Bertsimas and Fazel-Zarandi [2022] D. Bertsimas and M. M. Fazel-Zarandi. Prescriptive machine learning for public policy: The case of immigration enforcement. Under review, 2022.
Chernozhukov et al. [2018] V. Chernozhukov, D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, and J. Robins. Double/debiased machine learning for treatment and structural parameters, 2018.
Coston et al. [2020a] A. Coston, E. Kennedy, and A. Chouldechova. Counterfactual predictions under runtime confounding. Advances in neural information processing systems, 33:4150–4162, 2020a.
Coston et al. [2020b] A. Coston, A. Mishler, E. H. Kennedy, and A. Chouldechova. Counterfactual risk assessments, evaluation, and fairness. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pages 582–593, 2020b.
Cui and Tchetgen Tchetgen [2021] Y. Cui and E. Tchetgen Tchetgen. A semiparametric instrumental variable approach to optimal treatment regimes under endogeneity. Journal of the American Statistical Association, 116(533):162–173, 2021.
De-Arteaga et al. [2018] M. De-Arteaga, A. Dubrawski, and A. Chouldechova. Learning under selective labels in the presence of expert consistency. arXiv preprint arXiv:1807.00905, 2018.
De-Arteaga et al. [2021] M. De-Arteaga, V. Jeanselme, A. Dubrawski, and A. Chouldechova. Leveraging expert consistency to improve algorithmic decision support. arXiv preprint arXiv:2101.09648, 2021.
FICO [2018] FICO. Explainable machine learning challenge, 2018. URL https://community.fico.com/s/explainable-machine-learning-challenge?tabset-158d9=3.
Imbens and Rubin [2015] G. W. Imbens and D. B. Rubin. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015.
Kallus and Zhou [2021] N. Kallus and A. Zhou. Minimax-optimal policy learning under unobserved confounding. Management Science, 67(5):2870–2890, 2021.
Kédagni and Mourifie [2017] D. Kédagni and I. Mourifie. Generalized instrumental inequalities: Testing the iv independence assumption. Available at SSRN 2692274, 2017.
Kilbertus et al. [2020] N. Kilbertus, M. G. Rodriguez, B. Schölkopf, K. Muandet, and I. Valera. Fair decisions despite imperfect predictions. In International Conference on Artificial Intelligence and Statistics, pages 277–287. PMLR, 2020.
Kleinberg et al. [2018] J. Kleinberg, H. Lakkaraju, J. Leskovec, J. Ludwig, and S. Mullainathan. Human decisions and machine predictions. The quarterly journal of economics, 133(1):237–293, 2018.
Kling [2006] J. R. Kling. Incarceration length, employment, and earnings. American Economic Review, 96(3):863–876, 2006.
Lakkaraju et al. [2017] H. Lakkaraju, J. Kleinberg, J. Leskovec, J. Ludwig, and S. Mullainathan. The selective labels problem: Evaluating algorithmic predictions in the presence of unobservables. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 275–284, 2017.
Little and Rubin [2019] R. J. Little and D. B. Rubin. Statistical analysis with missing data, volume 793. John Wiley & Sons, 2019.
Manski and Pepper [1998] C. F. Manski and J. V. Pepper. Monotone instrumental variables with an application to the returns to schooling, 1998.
Mao et al. [2023] A. Mao, M. Mohri, and Y. Zhong. Cross-entropy loss functions: Theoretical analysis and applications. In International conference on Machine learning, pages 23803–23828. PMLR, 2023.
Massart and Nédélec [2006] P. Massart and É. Nédélec. Risk bounds for statistical learning. Annals of Statistics, 2006.
Mueller-Smith [2015] M. Mueller-Smith. The criminal and labor market impacts of incarceration. Unpublished working paper, 18, 2015.
Pu and Zhang [2021] H. Pu and B. Zhang. Estimating optimal treatment rules with an instrumental variable: A partial identification learning approach. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 83(2):318–345, 2021.
Qian and Murphy [2011] M. Qian and S. A. Murphy. Performance guarantees for individualized treatment rules. Annals of statistics, 39(2):1180, 2011.
Rambachan et al. [2022] A. Rambachan, A. Coston, and E. Kennedy. Counterfactual risk assessments under unmeasured confounding. arXiv preprint arXiv:2212.09844, 2022.
Rambachan et al. [2021] A. Rambachan et al. Identifying prediction mistakes in observational data. Harvard University, 2021.
Rosenbaum and Rubin [1983] P. R. Rosenbaum and D. B. Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55, 1983.
Rubin [2005] D. B. Rubin. Causal inference using potential outcomes: Design, modeling, decisions. Journal of the American Statistical Association, 100(469):322–331, 2005.
Shalev-Shwartz and Ben-David [2014] S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
Swanson et al. [2018] S. A. Swanson, M. A. Hernán, M. Miller, J. M. Robins, and T. S. Richardson. Partial identification of the average treatment effect using instrumental variables: review of methods for binary instruments, treatments, and outcomes. Journal of the American Statistical Association, 113(522):933–947, 2018.
Tewari and Bartlett [2007] A. Tewari and P. L. Bartlett. On the consistency of multiclass classification methods. Journal of Machine Learning Research, 8(5), 2007.
Wainwright [2019] M. J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019.
Wang and Tchetgen [2018] L. Wang and E. T. Tchetgen. Bounded, efficient and multiply robust estimation of average treatment effects using instrumental variables. Journal of the Royal Statistical Society. Series B, Statistical methodology, 80(3):531, 2018.
Wei [2021] D. Wei. Decision-making under selective labels: Optimal finite-domain policies and beyond. In International Conference on Machine Learning, pages 11035–11046. PMLR, 2021.
Yang et al. [2022] Y. Yang, Y. Liu, and P. Naghizadeh. Adaptive data debiasing through bounded exploration. Advances in Neural Information Processing Systems, 35:1516–1528, 2022.
Zhang et al. [2012] B. Zhang, A. A. Tsiatis, M. Davidian, M. Zhang, and E. Laber. Estimating optimal treatment regimes from a classification perspective. Stat, 1(1):103–114, 2012.
Zhang et al. [2020] C. Zhang, J. Chen, H. Fu, X. He, Y.-Q. Zhao, and Y. Liu. Multicategory outcome weighted margin-based learning for estimating individualized treatment rules. Statistica sinica, 30:1857, 2020.
Zhao et al. [2012] Y. Zhao, D. Zeng, A. J. Rush, and M. R. Kosorok. Estimating individualized treatment rules using outcome weighted learning. Journal of the American Statistical Association, 107(499):1106–1118, 2012.

Appendix A Supplements for Point Identification

A.1 Proofs in Section 4

Proof of Lemma 1.

By the definition of excess risk in Equation 4.1, we have

	$\displaystyle\mathcal{R}(t,s^{\star})$	$\displaystyle=\mathbb{P}(Y_{i}^{\star}\neq t(X_{i}))-\mathbb{P}(Y_{i}^{\star}\neq s^{\star}(X_{i}))$
		$\displaystyle=\mathbb{E}\Big{[}\mathbb{P}(Y_{i}^{\star}\neq t(X_{i})\mid X_{i})-\mathbb{P}(Y_{i}^{\star}\neq s^{\star}(X_{i})\mid X_{i})\Big{]}$
		$\displaystyle=\mathbb{E}\left[\sum_{k=1}^{K}\eta_{k}^{\star}(X_{i})\cdot\Big{(}\mathbb{P}(k\neq t(X_{i})\mid X_{i})-\mathbb{P}(k\neq s^{\star}(X_{i})\mid X_{i})\Big{)}\right]$
		$\displaystyle=\mathbb{E}\left[\sum_{k=1}^{K}\eta_{k}^{\star}(X_{i})\cdot\Big{(}\mathbb{P}(s^{\star}(X_{i})=k\mid X_{i})-\mathbb{P}(t(X_{i})=k\mid X_{i})\Big{)}\right]$
		$\displaystyle=\mathbb{E}\left[\sum_{k=1}^{K}\eta_{k}^{\star}(X_{i})\cdot\mathbb{E}\Big{[}\mathbb{I}\{s^{\star}(X_{i})=k\}-\mathbb{I}\{t(X_{i})=k\}\mid X_{i}\Big{]}\right]$
		$\displaystyle=\mathbb{E}\left[\sum_{k=1}^{K}\eta_{k}^{\star}(X_{i})\cdot\Big{(}\mathbb{I}\{s^{\star}(X_{i})=k\}-\mathbb{I}\{t(X_{i})=k\}\Big{)}\right]$
		$\displaystyle=\mathbb{E}\left[\sum_{k=1}^{K}\eta_{k}^{\star}(X_{i})\cdot\Big{(}\mathbb{I}\{\operatorname*{arg\,max}_{p\in[K]}\eta_{p}^{\star}(X_{i})=k\}-\mathbb{I}\{t(X_{i})=k\}\Big{)}\right].$

We apply the iterated law of expectation in second equality, and the total law of expectation in third equality. The fourth equality comes from the sum-to-one constraint of probability. The fifth equality is established from the definition of indicator function and the sixth equality follows again from the iterated law of expectation. Finally we apply the definition of Bayes optimal $s^{\star}(X_{i})=\operatorname*{arg\,max}_{k\in[K]}\eta_{k}^{\star}(X_{i})$ .

∎

Proof of Theorem 1.

In this part, we prove that, under Assumptions 1 and 2, Assumption 3 is the necessary and sufficient condition of the exact identification of $\mathbb{P}(Y^{\star}=k\mid X)$ . To simplify the notation, we denote two new variables $Y_{k}^{\star}:=\mathbb{I}\{Y^{\star}=k\}$ and $Y_{k}:=\mathbb{I}\{Y=k\}$ .

•

Sufficiency: For any $z\in\mathcal{Z}$ , the conditional probability $\mathbb{P}(Y^{\star}=k\mid X,Z=z)=\mathbb{E}[Y_{k}^{\star}\mid X,Z=z]$ could be expressed as

		$\displaystyle\mathbb{E}[Y_{k}^{\star}\mid X,Z=z]=\mathbb{E}\big{[}\mathbb{E}[Y_{k}^{\star}\mid X,U,Z=z]\mid X,Z=z\big{]}$		(A.1)
		$\displaystyle=\mathbb{E}\big{[}\mathbb{E}[DY_{k}^{\star}\mid X,U,Z=z]\mid X_{,}Z=z\big{]}+\mathbb{E}\big{[}\mathbb{E}[(1-D)Y_{k}^{\star}\mid X,U,Z=z]\mid X,Z=z\big{]}$
		$\displaystyle=\mathbb{E}\big{[}\mathbb{E}[DY_{k}\mid X,U,Z=z]\mid X_{,}Z=z\big{]}+\mathbb{E}\big{[}\mathbb{E}[(1-D)Y_{k}^{\star}\mid X,U,Z=z]\mid X,Z=z\big{]}$
		$\displaystyle=\mathbb{E}\big{[}\mathbb{E}[DY_{k}\mid X,U,Z=z]\mid X,Z=z\big{]}+\mathbb{E}\big{[}\mathbb{E}[Y_{k}^{\star}\mid X,U,Z=z]\cdot\mathbb{E}[(1-D)\mid X,U,Z=z]\mid X,Z=z\big{]}$
		$\displaystyle=\mathbb{E}\big{[}\mathbb{E}[DY_{k}\mid X,U,Z=z]\mid X,Z=z\big{]}+\mathbb{E}\big{[}\mathbb{E}[Y_{k}^{\star}\mid X,U]\cdot\mathbb{E}[(1-D)\mid X,U,Z=z]\mid X,Z=z\big{]}$
		$\displaystyle=\mathbb{E}[DY_{k}\mid X,Z=z]+\mathbb{E}[Y_{k}^{\star}\mid X,Z=z]-\mathbb{E}\big{[}\mathbb{E}[Y_{k}^{\star}\mid X,U]\cdot\mathbb{E}[D\mid X,U,Z=z]\big{]}.$

In the first and second equalities, we use the iterated law of expectation and the linearity of expectation. The third equality follows from the consistency between observed label and true label from Equation 3.1, that is, $Y_{k}=Y_{k}^{\star}$ when $D=1$ . The fourth equality follows from the unconfoundedness of $Y_{k}^{\star}$ and $D$ given $X$ and $U$ , which is guaranteed by the general unconfoundedness ( $Y^{\star}\perp D\mid X,U$ ) in Assumption 1. We use the IV independence ( $Z\perp Y_{k}^{\star}\mid X$ ) from Assumption 2 in the fifth equality. The last equality follows again by the iterated law of expectation and linearity of expectation.

By subtracting $\mathbb{E}[Y_{k}^{\star}\mid X]$ on both sides of Equation A.1, we have for any $z\in\mathcal{Z}$ ,

	$\displaystyle 0$	$\displaystyle=\mathbb{E}\big{[}DY_{k}\mid X,Z=z\big{]}-\mathbb{E}\big{[}\mathbb{E}[Y_{k}^{\star}\mid X,U]\cdot\mathbb{E}[D\mid X,U,Z=z]\mid X,Z=z\big{]}$
		$\displaystyle=\mathbb{E}\big{[}DY_{k}\mid X,Z=z\big{]}-\mathbb{E}\big{[}\mathbb{E}[Y_{k}^{\star}\mid X,U]\cdot\mathbb{E}[D\mid X,U,Z=z]\mid X\big{]},$

where the last equality follows from the IV independence ( $Z\perp U\mid X$ ) from Assumption 2. Now, by multiplying the weight $z\cdot\mathbb{P}(Z=z\mid X)$ on both sides of equation above, we have

	$\displaystyle 0$	$\displaystyle=\mathbb{E}[DY_{k}\mid X,Z=z]-\mathbb{E}\big{[}\mathbb{E}[Y_{k}^{\star}\mid X,U]\cdot\mathbb{E}[D\mid X,U,Z=z]\mid X\big{]}$
		$\displaystyle=\mathbb{E}[zDY_{k}\mid X,Z=z]\cdot\mathbb{P}(Z=z\mid X)-\mathbb{E}\big{[}\mathbb{E}[Y_{k}^{\star}\mid X,U]\cdot\mathbb{E}[zD\mid X,U,Z=z]\cdot\mathbb{P}(Z=z\mid X)\mid X\big{]}$
		$\displaystyle=\mathbb{E}[zDY_{k}\mid X,Z=z]\cdot\mathbb{P}(Z=z\mid X)-\mathbb{E}\big{[}\mathbb{E}[Y_{k}^{\star}\mid X,U]\cdot\mathbb{E}[zD\mid X,U,Z=z]\cdot\mathbb{P}(Z=z\mid X,U)\mid X\big{]}.$

He we apply the IV independence ( $Z\perp U\mid X$ ) again in the last equality above. Taking the summation (or integral) for $z$ over $\mathcal{Z}$ yields the following results:

0=\mathbb{E}[ZDY_{k}\mid X]-\mathbb{E}\big{[}\mathbb{E}[Y_{k}^{\star}\mid X,U]\cdot\mathbb{E}[ZD\mid X,U]\mid X\big{]}.

Moreover, observe that

\left\{\begin{aligned} &\mathbb{E}[ZDY_{k}\mid X]=\operatorname*{\mathrm{Cov}}(DY_{k},Z\mid X)+\mathbb{E}[DY_{k}\mid X]\cdot\mathbb{E}[Z\mid X]\\ &\mathbb{E}[ZD\mid X,U]=\operatorname*{\mathrm{Cov}}(D,Z\mid X,U)+\mathbb{E}[D\mid X,U]\cdot\mathbb{E}[Z\mid X,U],\end{aligned}\right.

we have

	$\displaystyle 0$	$\displaystyle=\underbrace{\operatorname{\mathrm{Cov}}(DY_{k},Z\mid X)}_{\text{I}}+\underbrace{\mathbb{E}[DY_{k}^{\star}\mid X]\cdot\mathbb{E}[Z\mid X]}_{\text{II}}-\underbrace{\mathbb{E}\big{[}\mathbb{E}[Y_{k}^{\star}\mid X,U]\cdot\operatorname{\mathrm{Cov}}(D,Z\mid X,U)\mid X\big{]}}_{\text{III}}$		(A.2)
		$\displaystyle\quad-\underbrace{\mathbb{E}\big{[}\mathbb{E}[Y_{k}^{\star}\mid X,U]\cdot\mathbb{E}[D\mid X,U]\cdot\mathbb{E}[Z\mid X,U]\mid X\big{]}}_{\text{IV}},$		(A.2)

Here again we use the consistency that $\mathbb{E}[DY_{k}\mid X]=\mathbb{E}[DY_{k}^{\star}\mid X]$ . By the unconfoundedness ( $Y^{\star}\perp D\mid X,U$ ) from Assumption 1 and IV independence ( $Z\perp U\mid X$ ) from Assumption 2, we have $\mathbb{E}[Z\mid X,U]=\mathbb{E}[Z\mid X]$ and

\mathbb{E}[Y_{k}^{\star}\mid X,U]\cdot\mathbb{E}[D\mid X,U]\cdot\mathbb{E}[Z\mid X,U]=\mathbb{E}[DY_{k}^{\star}\mid X,U]\cdot\mathbb{E}[Z\mid X].

Therefore the II and IV terms in Equation A.2 cancel out, which lead us to the result

0=\operatorname*{\mathrm{Cov}}(DY_{k},Z\mid X)-\mathbb{E}\big{[}\mathbb{E}[Y_{k}^{\star}\mid X,U]\cdot\operatorname*{\mathrm{Cov}}(D,Z\mid X,U)\mid X\big{]}.

(A.3)

Finally, by assuming Assumption 3 holds, that is,

\operatorname*{\mathrm{Cov}}\big{(}\operatorname*{\mathrm{Cov}}(D,Z\mid X,U),\ \mathbb{E}[Y_{k}^{\star}\mid X,U]\mid X\big{)}=0,

we have

	$\displaystyle\mathbb{E}\big{[}\mathbb{E}[Y_{k}^{\star}\mid X,U]\cdot\operatorname*{\mathrm{Cov}}(D,Z\mid X,U)\mid X\big{]}$	$\displaystyle=\mathbb{E}\big{[}\mathbb{E}[Y_{k}^{\star}\mid X,U]\mid X\big{]}\cdot\mathbb{E}\big{[}\operatorname*{\mathrm{Cov}}(D,Z\mid X,U)\mid X\big{]}$
		$\displaystyle=\mathbb{E}[Y_{k}^{\star}\mid X]\cdot\mathbb{E}\big{[}\operatorname*{\mathrm{Cov}}(D,Z\mid X,U)\mid X\big{]}.$

According to conditional covariance identity and IV independence ( $Z\perp U\mid X$ ), we have

$\displaystyle\operatorname*{\mathrm{Cov}}(D,Z\mid X)$	$\displaystyle=\mathbb{E}\big{[}\operatorname{\mathrm{Cov}}(D,Z\mid X,U)\mid X\big{]}+\mathbb{E}\big{[}\operatorname{\mathrm{Cov}}(\mathbb{E}[D\mid X,U],~{}\mathbb{E}[Z\mid X,U])\mid X\big{]}$	(A.4)
	$\displaystyle=\mathbb{E}\big{[}\operatorname{\mathrm{Cov}}\big{(}D,Z\mid X,U)\mid X\big{]}+\mathbb{E}\big{[}\operatorname{\mathrm{Cov}}(\mathbb{E}[D\mid X,U],~{}\mathbb{E}[Z\mid X]\big{)}\mid X\big{]}$
	$\displaystyle=\mathbb{E}\big{[}\operatorname*{\mathrm{Cov}}(D,Z\mid X,U)\mid X\big{]}.$

Combining (A.4) with (A.3) leads us to the desired result, that is,

\eta_{k}^{\star}(X)=\mathbb{E}[Y_{k}^{\star}\mid X]=\frac{\operatorname*{\mathrm{Cov}}(DY_{k},Z\mid X)}{\operatorname*{\mathrm{Cov}}(D,Z\mid X)}.

•

Necessity: Given the identification result $\mathbb{E}[Y_{k}^{\star}\mid X]=\frac{\operatorname*{\mathrm{Cov}}(DY_{k},Z\mid X)}{\operatorname*{\mathrm{Cov}}(D,Z\mid X)}$ and the Equation A.4, we have

	$\displaystyle\operatorname*{\mathrm{Cov}}(DY_{k},Z\mid X)$	$\displaystyle=\mathbb{E}[Y_{k}^{\star}\mid X]\cdot\operatorname*{\mathrm{Cov}}(D,Z\mid X)$
		$\displaystyle=\mathbb{E}\big{[}\mathbb{E}[Y_{k}^{\star}\mid X,U]\mid X\big{]}\cdot\mathbb{E}\big{[}\operatorname*{\mathrm{Cov}}(D,Z\mid X,U)\mid X\big{]}.$

On the other hand, we have the identity (A.3) established when every Assumptions 1 and 2 hold, that is,

\operatorname*{\mathrm{Cov}}(DY_{k},Z\mid X)=\mathbb{E}\big{[}\mathbb{E}[Y_{k}^{\star}\mid X,U]\cdot\operatorname*{\mathrm{Cov}}(D,Z\mid X,U)\mid X\big{]}.

Therefore, combining these results together, we have

\mathbb{E}\big{[}\mathbb{E}[Y_{k}^{\star}\mid X,U]\cdot\operatorname*{\mathrm{Cov}}(D,Z\mid X,U)\mid X\big{]}-\mathbb{E}\big{[}\mathbb{E}[Y_{k}^{\star}\mid X,U]\mid X\big{]}\cdot\mathbb{E}\big{[}\operatorname*{\mathrm{Cov}}(D,Z\mid X,U)\mid X\big{]}=0,

which then recovers condition (3).

∎

Proof of Theorem 2.

By the result in Lemma 1, we have

	$\displaystyle\mathcal{R}(t,s^{\star})$	$\displaystyle=\mathbb{E}\left[\sum_{k=1}^{K}\eta_{k}^{\star}(X_{i})\cdot\Big{(}\mathbb{I}\{\operatorname*{arg\,max}_{p\in[K]}\eta_{p}^{\star}(X_{i})=k\}-\mathbb{I}\{t(X_{i})=k\}\Big{)}\right]$
		$\displaystyle=\mathbb{E}\Big{[}\sum_{k=1}^{K}\big{[}\max_{p\in[K]}\eta_{p}^{\star}(X_{i})-\eta_{k}^{\star}(X_{i})\big{]}\cdot\mathbb{I}\{t(X_{i})=k\}\Big{]}$
		$\displaystyle=\mathbb{E}\left[\sum_{k=1}^{K}w_{k}(X_{i})\cdot\mathbb{I}\{t(X_{i})=k\}\right].$

For each $k\in[K]$ , we define $w_{k}(X):=\max_{p\in[K]}\{\eta_{p}^{\star}(X)-\eta_{k}^{\star}(X)\}$ . According to Theorem 1, when NUCEM conditions in Assumption 3 are satisfied for each $k\in[K]$ , the conditional probability $\eta_{k}^{\star}$ can be uniquely identified by Equation 4.2. Therefore, the weight function can be further identified as

w_{k}:=\max_{p\in[K]}\left\{\frac{\operatorname*{\mathrm{Cov}}(DY_{p},Z\mid X)}{\operatorname*{\mathrm{Cov}}(D,Z\mid X)}-\frac{\operatorname*{\mathrm{Cov}}(DY_{k},Z\mid X)}{\operatorname*{\mathrm{Cov}}(D,Z\mid X)}\right\},~{}~{}\forall~{}k\in[K].

Now, consider a binary classification with the true label $Y^{\star}\in\{-1,1\}$ . Let $\eta^{\star}(x):=\mathbb{P}(Y^{\star}=1\mid X=x)$ denote the conditional probability of $Y^{\star}=1$ given $X=x$ . The Bayes optimal classifier is $s^{\star}(x)=\operatorname*{\mathrm{sgn}}[\eta^{\star}(x)-1/2]$ . As a consequence, the excess risk can be then written as

	$\displaystyle\mathcal{R}(t,s^{\star})$	$\displaystyle=\mathbb{P}\big{(}Y^{\star}\neq t(X)\big{)}-\mathbb{P}\big{(}Y^{\star}\neq s^{\star}(X)\big{)}$
		$\displaystyle=\mathbb{E}\Big{[}\mathbb{I}\big{\{}Y^{\star}\neq t(X)\big{\}}-\mathbb{I}\big{\{}Y^{\star}\neq s^{\star}(X)\big{\}}\Big{]}$
		$\displaystyle=\mathbb{E}\Big{[}\eta^{\star}(X)\cdot\Big{(}\mathbb{I}\{1\neq t(X)\}-\mathbb{I}\{1\neq s^{\star}(X)\}\Big{)}+(1-\eta^{\star}(X))\cdot\Big{(}\mathbb{I}\{-1\neq t(X)\}-\mathbb{I}\{-1\neq s^{\star}(X)\}\Big{)}\Big{]}$
		$\displaystyle=\mathbb{E}\Big{[}\big{(}2\eta^{\star}(X)-1\big{)}\cdot\Big{(}\frac{s^{\star}(X)-t(X)}{2}\Big{)}\Big{]}$
		$\displaystyle=\mathbb{E}\Big{[}\big{\|}\eta^{\star}(X)-1/2\big{\|}\cdot\big{\|}s^{\star}(X)-t(X)\big{\|}\Big{]}$
		$\displaystyle=\mathbb{E}\Big{[}\big{\|}\eta^{\star}(X)-1/2\big{\|}\cdot\mathbb{I}\big{\{}t(X)\neq\operatorname*{\mathrm{sgn}}[\eta^{\star}(X)-1/2]\big{\}}\Big{]}.$

By defining the weight function $w^{\rm exact}(X):=\eta^{\star}(X)-1/2$ , we have

\mathcal{R}(t,s^{\star})=\mathbb{E}\Big{[}\big{|}w^{\rm exact}(X)\big{|}\cdot\mathbb{I}\big{\{}t(X)\neq\operatorname*{\mathrm{sgn}}[w^{\rm exact}(X)]\big{\}}\Big{]}:=\mathcal{R}(t,w^{\rm exact}).

Applying the identification result from Theorem 1, we obtain

w^{\rm exact}(X)=\frac{\operatorname*{\mathrm{Cov}}(DY,Z\mid X)}{\operatorname*{\mathrm{Cov}}(D,Z\mid X)}-1/2.

∎

A.2 Sufficient Condition for Point Identification

Proposition 2 (A Sufficient Condition for Point Identification).

Suppose Assumptions 1 and 2 hold. For any $z_{1},z_{2}\in[m]$ , define $\Delta_{jl}(X,U):=\mathbb{E}[D\mid X,U,Z=j]-\mathbb{E}[D\mid X,U,Z=l]$ . Assumption 3 holds if, for any $j,l\in[J]$ ,

\operatorname{Cov}(\Delta_{jl}(X,U),\mathbb{P}(Y^{\star}=k\mid X,U)\mid X)=0,~{}a.s.

(A.5)

Proposition 2 outlines a sufficient condition for Assumption 3. This condition permits unobserved features $U$ to influence both the decision $D$ and the true outcome $Y^{\star}$ , but constrains this influence to a specific form. Specifically, the condition is met when $\mathbb{E}[D\mid U,X,Z=j]=g_{j}(X)+q(U)$ almost surely, with functions $g_{j}$ and $q$ reflecting the impact of $U$ is additive and consistent across all decision-makers.

Proof of Proposition 2.

Notice that the conditional covariance $\operatorname*{\mathrm{Cov}}(D,Z\mid X,U)$ could be decomposed as below:

		$\displaystyle\operatorname*{\mathrm{Cov}}(D,Z\mid X,U)=\mathbb{E}[DZ\mid X,U]-\mathbb{E}[D\mid X,U]\cdot\mathbb{E}[Z\mid X,U]$
	$\displaystyle=$	$\displaystyle\sum_{j=1}^{m}\mathbb{E}[DZ\mid X,U,Z=j]\cdot\mathbb{P}(Z=j\mid X,U)-\mathbb{E}[D\mid X,U]\cdot\sum_{j=1}^{m}j\cdot\mathbb{P}(Z=j\mid X,U)$
	$\displaystyle=$	$\displaystyle\sum_{j=1}^{m}j\cdot\mathbb{E}[D\mid X,U,Z=j]\cdot\mathbb{P}(Z=j\mid X,U)-\sum_{l=1}^{m}\mathbb{E}[D\mid X,U,Z=l]\cdot\mathbb{P}(Z=l\mid X,U)\cdot\sum_{j=1}^{m}j\cdot\mathbb{P}(Z=j\mid X,U)$
	$\displaystyle=$	$\displaystyle\sum_{j=1}^{m}\sum_{l=1}^{m}j\cdot\mathbb{P}(Z=l\mid X,U)\cdot\mathbb{P}(Z=k\mid X,U)\cdot\big{[}\mathbb{E}[D\mid X,U,Z=j]-\mathbb{E}[D\mid X,U,Z=l]\big{]}.$

Define $\Delta_{j,l}(X,U)=\mathbb{E}[D\mid X,U,Z=j]-\mathbb{E}[D\mid X,U,Z=l]$ . If for any $j,l\in[m]$ , we have

\operatorname*{\mathrm{Cov}}(\Delta_{jl}(X,U),~{}\mathbb{P}(Y^{\star}=k\mid X,U)\mid X)=0~{}~{}\text{almost surely},

Assumption 3 is then guaranteed.

∎

Appendix B Supplements for Partial Identification

B.1 Tightness of IV Partial Bounds in Theorem 3

Theorem 6 (Tightness of IV Partial Bound).

Assume Assumption 1 is valid, and the conditional joint distribution of observable variables $(Y_{k},D,Z)\mid X=x$ is fixed. Consider a new set of random variables $(\tilde{Y}_{k}^{\star},\tilde{D},\tilde{Z},\tilde{X},\tilde{U})$ where $\tilde{Y}_{k}^{\star}$ and $\tilde{D}$ are binary, and $\tilde{Z}=Z$ . Define the observed label $\tilde{Y}_{k}$ as $\tilde{D}\cdot\tilde{Y}_{k}^{\star}+(1-\tilde{D})\cdot\text{NaN}$ . If the expectation $\mathbb{E}[\tilde{Y}_{k}^{\star}\mid\tilde{X}=x]$ falls within the interval $[l_{k}(x),u_{k}(x)]$ for all $x\in\mathcal{X}$ , then it is possible to establish a conditional joint distribution for $(\tilde{Y}_{k}^{\star},\tilde{D},\tilde{Z},\tilde{U})$ given $\tilde{X}=x$ that satisfies:

1.

The IV independence in Assumption 2 is met.
2.

The conditional joint distribution of $(\tilde{Y}_{k},\tilde{D},\tilde{Z})\mid\tilde{X}=x$ matches the initial distribution $(Y_{k},D,Z)\mid X=x$ .
3.

The natural bounds in Assumption 4 are maintained.

Theorem 6 along with Theorem 3 establish a two-way correspondence between the conditional mean outcome $\mathbb{E}[\tilde{Y}_{k}^{\star}\mid X=x]$ and the IV partial bound $[l_{k}(x),u_{k}(x)]$ . On one hand, any value of $\mathbb{E}[\tilde{Y}_{k}^{\star}\mid X=x]$ within this interval corresponds to a feasible conditional joint distribution of $(\tilde{Y}_{k}^{\star},\tilde{D},\tilde{Z})\mid\tilde{X}=x$ that is consistent with the observed data $(Y_{k},D,Z)\mid X=x$ . On the other hand, any conditional joint distribution of $(\tilde{Y}_{k}^{\star},\tilde{D},\tilde{Z})\mid\tilde{X}=x$ consistent with the observed data must yield a conditional mean outcome $\mathbb{E}[\tilde{Y}_{k}^{\star}\mid\tilde{X}=x]$ within the interval $[l_{k}(x),u_{k}(x)]$ , as established in Theorem 3.

These results underscore the tightness of the IV partial bound in Theorem 3 and the necessity of the bounded label assumption in Assumption 4 for the partial identification of the conditional probability $\eta_{k}^{\star}$ .

Proof of Theorem 6.

The proof is adapted from Kédagni and Mourifie [2017]. Let $(\tilde{Y}_{k}^{\star},\tilde{D},\tilde{Z},\tilde{X},\tilde{U})$ be a sequence of random variables such that $\tilde{Z}=Z$ almost surely and $l_{k}(x)\leq\mathbb{E}[\tilde{Y}_{k}^{\star}\mid\tilde{X}=x]\leq u_{k}(x)$ for every $x\in\mathcal{X}$ . Given that Assumption 1 held and fix $\tilde{X}=x$ for some $x\in\mathcal{X}$ , we aim to prove that, for every possible value of $\mathbb{E}[\tilde{Y}_{k}^{\star}\mid\tilde{X}=x]\in[l_{k}(x),u_{k}(x)]$ , there exists a conditional joint distribution of $(\tilde{Y}_{k}^{\star},\tilde{D},\tilde{Z},\tilde{U})$ given $\tilde{X}=x$ such that: (1) IV independence in Assumption 2 holds, that is, $\tilde{Z}\perp\tilde{Y}_{k}^{\star}\mid\tilde{X}=x$ for every $x\in\mathcal{X}$ ; (2) the new observable variables $(\tilde{Y}_{k},\tilde{D},\tilde{Z})\mid\tilde{X}=x$ has the same conditional joint distribution as the fixed observable variables $(Y_{k},D,Z)\mid X=x$ ; (3) Assumption 4 holds.

Without loss of generality, we first consider the case when $\mathbb{E}[\tilde{Y}_{k}^{\star}\mid\tilde{X}=x]=l_{k}(x)$ . As a consequent, we let

	$\displaystyle\mathbb{P}(\tilde{Y}_{k}^{\star}=1\mid\tilde{X}=x)$	$\displaystyle:=l_{k}(x)=\max_{z\in[m]}\big{\{}\mathbb{E}[DY_{k}\mid X=x,Z=z]+a(x)\cdot\big{(}1-\mathbb{E}[D\mid X=x,Z=z]\big{)}\big{\}}$		(B.1)
	$\displaystyle\mathbb{P}(\tilde{Y}_{k}^{\star}=0\mid X=x)$	$\displaystyle:=1-l_{k}(x),$		(B.1)

where the last expression follows by sum-to-one constraint of conditional probability. Note that we does not impose any observable restrictions on the conditional joint distribution of $(\tilde{Y}_{k},\tilde{D})$ given $\tilde{X}=x$ and $\tilde{Z}=z$ , we define the following: for all $z\in\mathcal{Z}$ ,

$\displaystyle\mathbb{P}(\tilde{Y}_{k}^{\star}=1,\tilde{D}=1\mid\tilde{X}=x,\tilde{Z}=z)$	$\displaystyle:=\mathbb{E}[DY_{k}\mid X=x,Z=z]$	(B.2)
$\displaystyle\mathbb{P}(\tilde{Y}_{k}^{\star}=1,\tilde{D}=0\mid\tilde{X}=x,\tilde{Z}=z)$	$\displaystyle:=l_{k}(x)-\mathbb{E}[DY_{k}\mid X=x,Z=z]$
$\displaystyle\mathbb{P}(\tilde{Y}_{k}^{\star}=0,\tilde{D}=1\mid\tilde{X}=x,\tilde{Z}=z)$	$\displaystyle:=\mathbb{E}[D\mid X=x,Z=z]-\mathbb{E}[DY_{k}\mid X=x,Z=z]$
$\displaystyle\mathbb{P}(\tilde{Y}_{k}^{\star}=0,\tilde{D}=0\mid\tilde{X}=x,\tilde{Z}=z)$	$\displaystyle:=\big{(}1-l_{k}(x)\big{)}-\big{(}\mathbb{E}[D\mid X=x,Z=z]-\mathbb{E}[DY_{k}\mid X=x,Z=z]\big{)}.$

Notice that all of the conditional probabilities are well-established as they are all between $0$ and $1$ , and the sum-to-one constraint is established as follow:

\sum_{i\in\{0,1\}}\sum_{j\in\{0,1\}}\mathbb{P}(\tilde{Y}_{k}^{\star}=i,\tilde{D}=j\mid X,Z=z)=1,~{}~{}\forall~{}z\in[m].

We now prove the three statements mentioned above.

IV Independence: To check whether we have $\tilde{Z}\perp\tilde{Y}_{k}^{\star}\mid\tilde{X}=x$ , notice that for any $z\in\mathcal{Z}$ ,

	$\displaystyle\mathbb{P}(\tilde{Y}_{k}^{\star}=1\mid\tilde{X}=x,\tilde{Z}=z)$	$\displaystyle=\mathbb{P}(\tilde{Y}_{k}^{\star}=1,\tilde{D}=1\mid\tilde{X}=,\tilde{Z}=z)+\mathbb{P}(\tilde{Y}_{k}^{\star}=1,\tilde{D}=0\mid\tilde{X}=x,\tilde{Z}=z)$
		$\displaystyle=l_{k}(x)$
	$\displaystyle\mathbb{P}(\tilde{Y}_{k}^{\star}=0\mid\tilde{X}=x,\tilde{Z}=z)$	$\displaystyle=\mathbb{P}(\tilde{Y}_{k}^{\star}=0,\tilde{D}=1\mid\tilde{X}=x,\tilde{Z}=z)+\mathbb{P}(\tilde{Y}_{k}^{\star}=0,\tilde{D}=0\mid\tilde{X}=x,\tilde{Z}=z)$
		$\displaystyle=1-l_{k}(x).$

We can see that the conditional distribution of $\tilde{Y}^{\star}$ given $\tilde{X}=x$ and $\tilde{Z}=z$ does not depends on $z$ for any $z\in[m]$ , and therefore we are able to claim that $\tilde{Z}$ is conditional independent with $\tilde{Y}_{k}^{\star}$ given $\tilde{X}=x$ .

$(\tilde{Y}_{k},\tilde{D},\tilde{Z})\mid\tilde{X}=x$ has same conditional joint distribution as $(Y_{k},D,Z)\mid X=x$ : Recall that the observed label is defined as $\tilde{Y}_{k}=\tilde{D}\tilde{Y}_{k}^{\star}+(1-\tilde{D})\cdot\text{NaN}$ . By Equation B.2, we have for any $z\in\mathcal{Z}$ ,

$\displaystyle\mathbb{P}(\tilde{Y}_{k}=1,\tilde{D}=1\mid\tilde{X}=x,\tilde{Z}=z)$	$\displaystyle=\mathbb{P}(\tilde{Y}_{k}^{\star}=1,\tilde{D}=1\mid\tilde{X}=x,\tilde{Z}=z)$	(B.3)
	$\displaystyle=\mathbb{E}[DY_{k}\mid X=x,Z=z]$
	$\displaystyle=\mathbb{P}(Y_{k}=1,D=1\mid X=x,Z=z)$
$\displaystyle\mathbb{P}(\tilde{Y}_{k}=0,\tilde{D}=1\mid\tilde{X}=x,\tilde{Z}=z)$	$\displaystyle=\mathbb{P}(\tilde{Y}_{k}^{\star}=0,\tilde{D}=1\mid\tilde{X}=x,\tilde{Z}=z)$
	$\displaystyle=\mathbb{E}[D\mid X=x,Z=z]-\mathbb{E}[DY_{k}\mid X=x,Z=z]$
	$\displaystyle=\mathbb{P}(Y_{k}=0,D=1\mid X=x,Z=z),$

and

$\displaystyle\mathbb{P}(\tilde{Y}_{k}=\text{NaN},\tilde{D}=0\mid\tilde{X}=x,\tilde{Z}=z)$	$\displaystyle=\mathbb{P}(\tilde{D}=0\mid\tilde{X}=x,\tilde{Z}=z)$	(B.4)
	$\displaystyle=\sum_{k\in\{0,1\}}\mathbb{P}(\tilde{Y}_{k}^{\star}=k,\tilde{D}=0\mid\tilde{X}=x,\tilde{Z}=z)$
	$\displaystyle=1-\mathbb{E}[D\mid X=x,Z=z]$
	$\displaystyle=\mathbb{P}(D=0\mid X=x,Z=z)$
	$\displaystyle=\mathbb{P}(Y_{k}=\text{NA},D=0\mid X=x,Z=z).$

As a result, we have shown that the observed variables $(\tilde{Y}_{k},\tilde{D},\tilde{Z})\mid\tilde{X}=x$ has exactly the same conditional joint distribution with $(Y_{k},D,Z)\mid X=x$ .

Finally, we show that when $\mathbb{E}[\tilde{Y}_{k}^{\star}\mid\tilde{X}=x]=l_{k}(x)$ , there exists a joint distribution on $(\tilde{Y}_{k}^{\star},\tilde{X},\tilde{U})$ such that $\mathbb{E}[\tilde{Y}_{k}^{\star}\mid\tilde{U},\tilde{X}=x]=a(x)$ . Without the loss of generality, we assume

z_{0}:=\operatorname*{arg\,max}_{z\in[m]}\big{\{}\mathbb{E}[DY_{k}\mid X=x,Z=z]+a_{k}(x)\cdot\mathbb{E}[(1-D)\mid X=x,Z=z]\big{\}}

and then we have

\mathbb{E}[\tilde{Y}_{k}^{\star}\mid\tilde{X}=x]=l_{k}(x)=\mathbb{E}[DY_{k}\mid X=x,Z=z_{0}]+a_{k}(x)\cdot\mathbb{E}[(1-D)\mid X=x,Z=z_{0}].

(B.5)

Meanwhile, by the statement of IV independence that $\tilde{Z}\perp\tilde{Y}_{k}^{\star}\mid\tilde{X}=x$ , we have for $Z=z_{0}$ ,

		$\displaystyle\mathbb{E}[\tilde{Y}_{k}^{\star}\mid\tilde{X}=x]=\mathbb{E}[\tilde{Y}_{k}^{\star}\mid\tilde{X}=x,\tilde{Z}=z_{0}]$
		$\displaystyle=\mathbb{E}[\tilde{D}\tilde{Y}_{k}^{\star}\mid\tilde{X}=x,\tilde{Z}=z_{0}]+\mathbb{E}[(1-\tilde{D})\tilde{Y}_{k}^{\star}\mid\tilde{X}=x,\tilde{Z}=z_{0}]$
		$\displaystyle=\mathbb{E}[\tilde{D}\tilde{Y}_{k}\mid\tilde{X}=x,\tilde{Z}=z_{0}]+\mathbb{E}[\mathbb{E}[(1-\tilde{D})\tilde{Y}_{k}^{\star}\mid\tilde{U},\tilde{X}=x,\tilde{Z}=z_{0}]\mid\tilde{X}=x,\tilde{Z}=z_{0}]$
		$\displaystyle=\mathbb{E}[\tilde{D}\tilde{Y}_{k}\mid\tilde{X}=x,\tilde{Z}=z_{0}]+\mathbb{E}[\mathbb{E}[\tilde{Y}_{k}^{\star}\mid\tilde{U},\tilde{X}=x]\cdot\mathbb{E}[(1-\tilde{D})\mid\tilde{U},\tilde{X}=x,\tilde{Z}=z_{0}]\mid\tilde{X}=x,\tilde{Z}=z_{0}].$

The first line and second line come from the IV independence and linearity of conditional expectation. The third line follows by the consistency of observed outcome as well as the iterated law of expectation. The last line follows by Assumption 1 ( $\tilde{D}\perp\tilde{Y}_{k}^{\star}\mid\tilde{X},\tilde{U}$ ) and the IV independence.

From Equations B.3 and B.4, we have $\mathbb{E}[\tilde{D}\tilde{Y}_{k}\mid\tilde{X}=x,\tilde{Z}=z_{0}]=\mathbb{E}[DY_{k}\mid X=x,Z=z_{0})$ , and therefore

	$\displaystyle\mathbb{E}[\tilde{Y}_{k}^{\star}\mid\tilde{X}=x]$	$\displaystyle=\mathbb{E}[DY_{k}\mid X=x,Z=z_{0}]$		(B.6)
		$\displaystyle~{}+\mathbb{E}[\mathbb{E}[\tilde{Y}_{k}^{\star}\mid\tilde{U},\tilde{X}=x]\cdot\mathbb{E}[(1-\tilde{D})\mid\tilde{U},\tilde{X}=x,\tilde{Z}=z_{0}]\mid\tilde{X}=x,\tilde{Z}=z_{0}].$		(B.6)

Now, suppose that there is not any conditional joint distribution of $(\tilde{Y}_{k}^{\star},\tilde{D},\tilde{U})$ given $\tilde{X}=x$ such that $\mathbb{E}[\tilde{Y}_{k}^{\star}\mid\tilde{U},\tilde{X}=x]=a_{k}(x)$ almost surely. Alternatively, we assume there is a conditional joint distribution for $(\tilde{Y}_{k}^{\star},\tilde{D},\tilde{U})\mid\tilde{X}=x$ such that $\mathbb{E}[\tilde{Y}_{k}^{\star}\mid\tilde{U},\tilde{X}=x]=a_{k}(x)+\varepsilon(x,\tilde{U})$ almost surely. Substitute this condition into Equation B.6, we have

$\displaystyle\mathbb{E}[\tilde{Y}_{k}^{\star}\mid\tilde{X}=x]$	$\displaystyle=\mathbb{E}[DY_{k}\mid X=x,Z=z_{0}]+a_{k}(x)\cdot\mathbb{E}\big{[}(1-\tilde{D})\mid\tilde{X}=x,\tilde{Z}=z_{0}\big{]}$	(B.7)
	$\displaystyle\quad+\mathbb{E}\Big{[}\varepsilon(\tilde{X},\tilde{U})\cdot\mathbb{E}[(1-\tilde{D})\mid\tilde{U},\tilde{X}=x,\tilde{Z}=z_{0}]\mid\tilde{X}=x,\tilde{Z}=z_{0}\Big{]}$
	$\displaystyle=l_{k}(x)+\mathbb{E}\Big{[}\varepsilon(\tilde{X},\tilde{U})\cdot\mathbb{E}[(1-\tilde{D})\mid\tilde{U},\tilde{X}=x,\tilde{Z}=z_{0}]\mid\tilde{X}=x,\tilde{Z}=z_{0}\Big{]}.$

In the first equality, we use the iterated law of expectation and fact that $a_{k}(x)$ is independent with unobserved variable $\tilde{U}$ . The second equality follows from Equations B.3 and B.4 that $\mathbb{P}(\tilde{D}=0\mid\tilde{X}=x,\tilde{Z}=z_{0})=\mathbb{P}(D=0\mid X=x,Z=z)$ for any fixed $x\in\mathcal{X}$ and $z\in[m]$ along with the definition of $l_{k}(x)$ .

As a consequence, if

\mathbb{E}\big{[}\varepsilon(\tilde{X},\tilde{U})\cdot\mathbb{E}[(1-\tilde{D})\mid\tilde{U},\tilde{X}=x,\tilde{Z}=z_{0}]\mid\tilde{X}=x,\tilde{Z}=z_{0}\big{]}\neq 0,

then Equation B.7 contradicts with Equation B.5, implying that there is indeed a conditional joint distribution for $(\tilde{Y}^{\star},\tilde{D},\tilde{U})\mid\tilde{X}=x$ such that $\mathbb{E}[\tilde{Y}^{\star}\mid\tilde{U},\tilde{X}=x]=a_{k}(x)$ . On the other hand, if

\mathbb{E}\big{[}\varepsilon(\tilde{X},\tilde{U})\cdot\mathbb{E}[(1-\tilde{D})\mid\tilde{U},\tilde{X}=x,\tilde{Z}=z_{0}]\mid\tilde{X}=x,\tilde{Z}=z_{0}\big{]}=0,

(B.8)

then the conditional joint distribution for $(\tilde{Y}_{k},\tilde{D},\tilde{U})\mid\tilde{X}=x$ induced by the Equation B.8 is exactly the “right” distribution for the establishment of $\mathbb{E}[\tilde{Y}_{k}^{\star}\mid\tilde{U},\tilde{X}=x]=a_{k}(x)$ .

Therefore, we have shown that when $\mathbb{E}[\tilde{Y}_{k}^{\star}\mid\tilde{X}=x]=l_{k}(x)$ , there is always a conditional joint distribution for $(\tilde{Y}_{k}^{\star},\tilde{D},\tilde{U})\mid\tilde{X}=x$ such that $\mathbb{E}[\tilde{Y}_{k}^{\star}\mid\tilde{U},\tilde{X}=x]=a_{k}(x)$ almost surely.

To wrap up, we can see that the establishment of statements 1 and 2 do not rely on the condition $\mathbb{P}(\tilde{Y}^{\star}=1\mid\tilde{X}=x)=l_{k}(x)$ in Equation B.1. In fact, whenever we choose $g_{k}(x)\in[l_{k}(x),u_{k}(x)]$ for a fixed $x\in\mathcal{X}$ , we can simply choose $\mathbb{P}(\tilde{Y}_{k}^{\star}=1\mid\tilde{X}=x)=g_{k}(x)$ . As a result, the IV independence $\tilde{Z}\perp\tilde{Y}_{k}^{\star}\mid\tilde{X}=x$ always holds, and the new observable variables $(\tilde{Y}_{k},\tilde{D},\tilde{Z})\mid\tilde{X}=x$ still has the same conditional distribution as $(Y_{k},D,Z)\mid X=x$ , which is fixed as a prior.

However, the value of $\mathbb{P}(\tilde{Y}_{k}^{\star}=1\mid\tilde{X}=x)$ condition does affect the value of $\mathbb{E}[\tilde{Y}_{k}^{\star}\mid\tilde{U},\tilde{X}=x]$ . If we let $\mathbb{E}[\tilde{Y}_{k}^{\star}\mid\tilde{X}=x]=u_{k}(x)$ , we can show in a similar way that $\mathbb{E}[\tilde{Y}_{k}^{\star}\mid\tilde{U},\tilde{X}=x]=b_{k}(x)$ almost surely. In fact, following the proof of statement 3, one can prove that, for any $g_{k}(x)\in[l_{k}(x),u_{k}(x)]$ such that $\mathbb{E}[\tilde{Y}_{k}^{\star}\mid\tilde{X}=x]=g_{k}(x)$ , there is always a conditional joint distribution for $(\tilde{Y}_{k},\tilde{D},\tilde{U})\mid\tilde{X}=x$ such that $\mathbb{E}[\tilde{Y}_{k}^{\star}\mid\tilde{U},\tilde{X}=x]=c_{k}(x)$ almost surely and $c_{k}(x)\in[a_{k}(x),b_{k}(x)]$ .

Therefore, we have shown that under Assumption 1, whenever we have $\mathbb{E}[\tilde{Y}_{k}^{\star}\mid\tilde{X}=x]\in[l_{k}(x),u_{k}(x)]$ , there is always a conditional joint distribution for $(\tilde{Y}_{k}^{\star},\tilde{D},\tilde{Z},\tilde{U})\mid\tilde{X}=x$ such that the three statements in Theorem 6 hold. To conclude, we claim that the IV partial bound $[l_{k}(x),u_{k}(x)]$ introduced in Theorem 3 is sharp for $\mathbb{E}[Y_{k}^{\star}\mid X=x]$ given the establishment of Assumption 1, the IV independence in Assumption 2 and Assumption 4.

∎

B.2 Balke and Pearl’s Bound

Balke and Pearl [1994] provides partial identification bounds for the average treatment effect of a binary treatment with a binary instrumental variable. In this section, we adapt their bound to our setting with partially observed labels and a binary IV (i.e., the assignment to one of two decision-makers). Under Assumption 2, we have following decomposition of joint probability distribution of $(Y,D,Z,U)$

\mathbb{P}(Y,D,Z,U)=\mathbb{P}(Y\mid D,U)\cdot\mathbb{P}(D\mid Z,U)\cdot\mathbb{P}(Z)\cdot\mathbb{P}(U).

(B.9)

Here we omit the observed covariates $X$ for simplicity, or alternatively, all distributions can be considered as implicitly conditioning on $X$ . Now we define three response functions which characterize the values of $Z$ , $D(0),D(1)$ , and $Y^{\star}$ :

r_{Z}=\left\{\begin{aligned} &0\ &\text{if}\ Z=0\\ &1\ &\text{if}\ Z=1\\ \end{aligned}\right.,\quad r_{D}=\left\{\begin{aligned} &0\quad&\text{if}\quad D(0)=0\ \text{and}\ D(1)=0\\ &1\quad&\text{if}\quad D(0)=0\ \text{and}\ D(1)=1\\ &2\quad&\text{if}\quad D(0)=1\ \text{and}\ D(1)=0\\ &3\quad&\text{if}\quad D(0)=1\ \text{and}\ D(1)=1\\ \end{aligned}\right.,\quad r_{Y}=\left\{\begin{aligned} &0\ &\text{if}\quad Y^{\star}=0\\ &1\ &\text{if}\quad Y^{\star}=1\\ \end{aligned}\right..

Next, we specify the joint distribution of unobservable variables $r_{D}$ and $r_{Y}$ as follows:

q_{kj}=\mathbb{P}(r_{D}=k,r_{Y}=j)\quad\forall\ k\in\{0,1,2,3\},\ j\in\{0,1\},

which satisfies the constraint $\sum_{k=0}^{3}(q_{k0}+q_{k1})=1$ . Then the target mean parameter of the true outcome can be written as a linear combinations of the $q$ ’s. Moreover, we note that the observable distribution $\mathbb{P}(Y,D\mid Z)$ is fully specified by the following six variables

		$\displaystyle p_{na,0}=\mathbb{P}(D=0\mid Z=0)\qquad$	$\displaystyle p_{na,1}=\mathbb{P}(D=0\mid Z=1)$
		$\displaystyle p_{01,0}=\mathbb{P}(Y=0,D=1\mid Z=0)\qquad$	$\displaystyle p_{01,1}=\mathbb{P}(Y=0,D=1\mid Z=1)$
		$\displaystyle p_{11,0}=\mathbb{P}(Y=1,D=1\mid Z=0)\qquad$	$\displaystyle p_{11,1}=\mathbb{P}(Y=1,D=1\mid Z=1),$

with constraints $p_{11,0}+p_{01,0}+p_{na,0}=1$ and $p_{11,1}+p_{01,1}+p_{na,1}=1$ . We also have the following relation between $p$ ’s and $q$ ’s:

\begin{aligned} p_{na,0}&=q_{00}+q_{01}+q_{10}+q_{11}\\ p_{01,0}&=q_{20}+q_{30}\\ p_{11,0}&=q_{21}+q_{31}\\ \end{aligned}\qquad\begin{aligned} p_{na,1}&=q_{00}+q_{01}+q_{20}+q_{21}\\ p_{01,1}&=q_{10}+q_{30}\\ p_{11,1}&=q_{11}+q_{31}.\\ \end{aligned}

Therefore, we have $p=Pq$ where $p=(p_{na,0},\dots,p_{11,1})$ , $q=(q_{00},\dots,q_{31})$ , and

P=\begin{bmatrix}1&1&0&0&1&1&0&0\\ 0&0&1&1&0&0&0&0\\ 0&0&0&0&0&0&1&1\\ 1&0&1&0&1&0&1&0\\ 0&1&0&1&0&0&0&0\\ 0&0&0&0&0&1&0&1\\ \end{bmatrix}.

Then the lower bound on $\mathbb{P}(Y^{\star}=1)$ can be written as the optimal value of the following linear programming problem

\begin{aligned} \min\quad&q_{01}+q_{11}+q_{21}+q_{31}\\ \text{subject to}\quad&\sum_{k=0}^{3}\sum_{j=0}^{1}q_{kj}=1\\ &Pq=p\\ &q_{kj}\geq 0\quad k\in\{0,1,2,3\},\ j\in\{0,1\}.\\ \end{aligned}.

(B.10)

Similarly, the upper bound on $\mathbb{P}(Y^{\star}=1)$ can be written as the optimal value of the following optimization problem:

$\displaystyle\max$	$\displaystyle q_{01}+q_{11}+q_{21}+q_{31}$	(B.11)
subject to	$\displaystyle\sum_{k=0}^{3}\sum_{j=0}^{1}q_{kj}=1$
	$\displaystyle Pq=p$
	$\displaystyle q_{kj}\geq 0\quad k\in\{0,1,2,3\},\ j\in\{0,1\}.$

In fact, by simply comparing the variables in the objective function $\mathbb{P}(Y^{\star}=1)=q_{01}+q_{11}+q_{21}+q_{31}$ and those in constraints, one could find that

		$\displaystyle p_{11,0}=q_{21}+q_{31}\leq\mathbb{P}(Y^{\star}=1)$
		$\displaystyle p_{11,1}=q_{11}+q_{31}\leq\mathbb{P}(Y^{\star}=1)$
		$\displaystyle p_{11,0}+p_{na,0}=q_{00}+q_{10}+q_{01}+q_{11}+q_{21}+q_{31}\geq\mathbb{P}(Y^{\star}=1)$
		$\displaystyle p_{11,1}+p_{na,1}=q_{00}+q_{20}+q_{01}+q_{11}+q_{21}+q_{31}\geq\mathbb{P}(Y^{\star}=1).$

If we let

	$\displaystyle L$	$\displaystyle=\max\{p_{11,0},p_{11,1}\}=\max_{z\in\{0,1\}}\ \{\mathbb{P}(Y=1,D=1\mid Z=z)\}$		(B.12)
	$\displaystyle U$	$\displaystyle=\min\{p_{11,0}+p_{na,0},p_{11,1}+p_{na,1}\}=\min_{z\in\{0,1\}}\ \{\mathbb{P}(Y=1,D=1\mid Z=z)+\mathbb{P}(D=0\mid Z=z)\}.$		(B.12)

We then have the following partial bounds of $\mathbb{P}(Y^{\star}=1)$ .

L\ \leq\ \mathbb{P}(Y^{\star}=1)\ \leq\ U.

According to Balke and Pearl [1994], the bounds above are tight for $\mathbb{P}(Y^{\star}=1)$ . We note that if we condition on $X$ in these bounds, then the corresponding bound on $\mathbb{P}(Y^{\star}=1\mid X)$ coincide with the bounds in Theorem 3 specialized to a binary instrument.

B.3 Proofs in Section 5

Proof of Theorem 3.

We first prove the construction of IV partial bound, and then show that the IV partial bound $[l_{k}(x),u_{k}(x)]$ is tighter than the natural bound $[a_{k}(x),b_{k}(x)]$ for all $k\in[K]$ when $x$ is fixed.

Under Assumptions 1 and 2, Equation A.1 gives that

		$\displaystyle\mathbb{P}(Y^{\star}=k\mid X=x)=\mathbb{E}[Y_{k}^{\star}\mid X=x]$
		$\displaystyle=\mathbb{E}[DY_{k}\mid X=x,Z=z]+\mathbb{E}\big{[}\mathbb{E}[Y_{k}^{\star}\mid U,X=x]\cdot\mathbb{E}[(1-D)\mid U,X=x,Z=z]\mid X=x,Z=z\big{]}.$

As we have assumed that $\mathbb{E}[Y_{k}^{\star}\mid U,X=x]\in[a_{k}(x),b_{k}(x)]$ almost surely in Assumption 4, we then have

	$\displaystyle\mathbb{E}[Y_{k}^{\star}\mid X=x]$	$\displaystyle\geq\mathbb{E}[DY_{k}\mid X=x,Z=z]+a_{k}(x)\cdot\mathbb{E}\big{[}\mathbb{E}[(1-D)\mid U,X=x,Z=z]\mid X=x,Z=z\big{]}$
		$\displaystyle=\mathbb{E}[DY_{k}\mid X=x,Z=z]+a_{k}(x)\cdot\mathbb{E}[(1-D)\mid X=x,Z=z]$

and

	$\displaystyle\mathbb{E}[Y_{k}^{\star}\mid X=x]$	$\displaystyle\leq\mathbb{E}[DY_{k}\mid X=x,Z=z]+b_{k}(x)\cdot\mathbb{E}\big{[}\mathbb{E}[(1-D)\mid U,X=x,Z=z]\mid X=x,Z=z\big{]}$
		$\displaystyle=\mathbb{E}[DY_{k}\mid X=x,Z=z]+b_{k}(x)\cdot\mathbb{E}[(1-D)\mid X=x,Z=z]$

for every $z\in[m]$ . Optimizing the lower and upper bounds of $\mathbb{E}[Y_{k}^{\star}\mid X=x]$ over $z\in\mathcal{Z}$ yields the desired results:

		$\displaystyle\max_{z\in\mathcal{Z}}\ \Big{\{}\mathbb{E}[DY_{k}\mid X=x,Z=z]+a_{k}(x)\cdot\big{(}1-\mathbb{E}[D\mid X=x,Z=z]\big{)}\Big{\}}$
		$\displaystyle\qquad\qquad\qquad\qquad\leq\mathbb{E}[Y_{k}^{\star}\mid X=x]\leq$
		$\displaystyle\min_{z\in\mathcal{Z}}\ \Big{\{}\mathbb{E}[DY_{k}\mid X=x,Z=z]+b_{k}(x)\cdot\big{(}1-\mathbb{E}[D\mid X=x,Z=z]\big{)}\Big{\}}.$

By taking the conditional expectation on both sides of the inequality, a direct consequence of Assumption 4 induces a natural bound for $\mathbb{E}[Y_{k}^{\star}\mid X=x]$ , that is,

a_{k}(x)\leq\mathbb{E}[Y_{k}^{\star}\mid X=x]\leq b_{k}(x).

We now show that the IV partial bound $[l_{k}(x),u_{k}(x)]$ is tighter than the natural bound $[a_{k}(x),b_{k}(x)]$ for any $x\in\mathcal{X}$ . Without loss of generality, let

z_{0}:=\operatorname*{arg\,max}_{z\in[m]}\big{\{}\mathbb{E}[DY_{k}\mid X=x,Z=z]+a_{k}(x)\cdot\mathbb{E}[(1-D)\mid X=x,Z=z]\big{\}}

and then we have

$\displaystyle l_{k}(x)$	$\displaystyle=\mathbb{E}[DY_{k}\mid X=x,Z=z_{0}]+a_{k}(x)\cdot\mathbb{E}[(1-D)\mid X=x,Z=z_{0}]$	(B.13)
	$\displaystyle=\underbrace{a_{k}(x)-a_{k}(x)\cdot\mathbb{P}(D=1\mid X=x,Z=z_{0})}_{(\cdots)}+\mathbb{E}[DY_{k}^{\star}\mid X=x,Z=z_{0}]$
	$\displaystyle=(\cdots)+\mathbb{E}\Big{[}\mathbb{E}[DY_{k}^{\star}\mid U,X=x,Z=z_{0}]\mid X=x,Z=z_{0}\Big{]}$
	$\displaystyle=(\cdots)+\mathbb{E}\Big{[}\mathbb{E}[D\mid U,X=x,Z=z_{0}]\cdot\mathbb{E}[Y_{k}^{\star}\mid U,X=x,Z=z_{0}]\mid X=x,Z=z_{0}\Big{]}$
	$\displaystyle=(\cdots)+\mathbb{E}\Big{[}\mathbb{P}(D=1\mid U,X=x,Z=z_{0})\cdot\mathbb{E}[Y_{k}^{\star}\mid U,X=x]\mid X=x,Z=z_{0}\Big{]}$
	$\displaystyle=a(x)+\mathbb{E}\Big{[}\mathbb{P}(D=1\mid U,X=x,Z=z_{0})\cdot\big{(}\mathbb{E}[Y_{k}^{\star}\mid U,X=x]-a_{k}(x)\big{)}\mid X=x,Z=z_{0}\Big{]}.$

The first line comes from the definition, while the second line holds by the consistency between the observed label and the true label ( $DY_{k}=DY_{k}^{\star}$ ) and the linearity of conditional expectation. The third line holds by iterated law of expectation, and the fourth line holds by Assumption 1 ( $D\perp Y_{k}^{\star}\mid X,U$ ). The fifth line follows by Assumption 2 ( $Z\perp Y_{k}^{\star}\mid X$ ) and the fact that $\mathbb{E}[D\mid U,X=x,Z=z]=\mathbb{P}(D=1\mid U,X=x,Z=z)$ . The last line again follows by the linearity and the iterated law of expectation.

Recall that Assumption 4 states that $E[Y_{k}^{\star}\mid U,X=x]\geq a_{k}(x)$ almost surely, along with Equation B.13 and the fact that $\mathbb{P}(D=1\mid U,X=x,Z=z_{0})\geq 0$ , we have

l_{k}(x)\geq a_{k}(x)~{}~{}\forall\ x\in\mathcal{X}.

Similarly, we can show that $u_{k}(x)\leq b_{k}(x)$ for every fixed $x\in\mathcal{X}$ . We therefore complete the proof.

∎

Proof of Theorem 4.

To simplify the notation, for any classifier $t:\mathcal{X}\mapsto[K]$ and vector function $\eta:\mathcal{X}\mapsto[0,1]^{K}$ , define the function

A(t,\eta;x)=\sum_{k=1}^{K}\eta_{k}(x)\cdot\Big{(}\mathbb{I}\{\operatorname*{arg\,max}_{k\in[K]}\eta_{k}(x)=k\}-\mathbb{I}\{t(x)=k\}\Big{)}\geq 0.

The worst-case risk function can then be written as

\overline{\mathcal{R}}(t)=\mathbb{E}\Big{[}\max_{\eta\in S}A(t,\eta;X)\Big{]}

with $S=\{\eta\in[0,1]^{K}:\|\eta\|_{1}=1,\eta_{k}\in[l_{k},u_{k}],k=1,\dots,K\}$ . The following proof will rely on computing the maximization of $A(t,\eta;x)$ over all possible values of $t(x)\in[K]$ for any fixed $x\in\mathcal{X}$ .

•

Step 1. We start with the case when $t(x)=k_{0}$ for some $k_{0}\in[K]$ , we have

		$\displaystyle\max_{\eta\in S}\Big{\{}A(t,\eta;x)\Big{\}}\cdot\mathbb{I}\{t(x)=k_{0}\}=\max_{\eta\in S}\Big{\{}A(t,\eta;x)\cdot\mathbb{I}\{t(x)=k_{0}\}\Big{\}}$
		$\displaystyle=\max_{\eta\in S}\ \Big{\{}\Big{(}\sum_{k\in[K]}\eta_{k}(x)\cdot\mathbb{I}\{\operatorname*{arg\,max}_{p\in[K]}\{\eta_{p}(x)\}=k\}-\sum_{k\in[K]}\eta_{k}(x)\cdot\mathbb{I}\{t(x)=k\}\Big{)}\cdot\mathbb{I}\{t(x)=k_{0}\}\Big{\}}$
		$\displaystyle=\max_{\eta\in S}\ \Big{\{}\Big{(}\sum_{k\in[K]}\eta_{k}(x)\cdot\mathbb{I}\{\operatorname*{arg\,max}_{p\in[K]}\{\eta_{p}(x)\}=k\}-\eta_{k_{0}}(X)\Big{)}\cdot\mathbb{I}\{t(x)=k_{0}\}\Big{\}}$

Given a fixed vector $\eta(x)\in S$ , we denote $p_{0}:=\operatorname*{arg\,max}_{p\in[K]}\eta_{p}(x)$ as the label with maximal conditional probability. We have

	$\displaystyle\max_{\eta\in S}\Big{\{}A(t,\eta;x)\Big{\}}\cdot\mathbb{I}\{t(x)=k_{0}\}$	$\displaystyle=\max_{\eta\in S}\Big{\{}\Big{(}\eta_{p_{0}}(x)-\eta_{k_{0}}(x)\Big{)}\Big{\}}\cdot\mathbb{I}\{t(x)=k_{0}\}$		(B.14)
		$\displaystyle=\max_{p_{0}\neq k_{0},\eta\in S}\Big{\{}\Big{(}\eta_{p_{0}}(x)-\eta_{k_{0}}(x)\Big{)}^{+}\Big{\}}\cdot\mathbb{I}\{t(x)=k_{0}\}\geq 0,$		(B.14)

where $(z)^{+}:=\max(z,0)$ for any real-valued $z$ . In the second equality of Equation B.14, we exclude the case when $p_{0}=k_{0}$ from the maximization, as this situation only occurs when the lower bound of $\eta_{k_{0}}(x)$ exceeds the upper bound of $\eta_{p}(x)$ for all $p\neq k_{0}$ . Consequently, we have $p_{0}=\arg\min_{p\in[K]}\eta_{p}(x)=k_{0}$ , which implies that

\max_{p_{0}=k_{0},\eta\in S}\{A(t,\eta;x)\}\cdot\mathbb{I}\{t(x)=k_{0}\}=\max_{p_{0}=k_{0},\eta\in S}\{(\eta_{k_{0}}(x)-\eta_{k_{0}}(x))\}\cdot\mathbb{I}\{t(x)=k_{0}\}=0.

Therefore, we expect the maximization of $A(t,\eta;x)\cdot\mathbb{I}\{t(x)=k_{0}\}$ over $\eta\in S$ to be non-negative. We exclude the case $p_{0}=k_{0}$ from the maximization to facilitate further analysis.

•

Step 2. As shown in Equation B.14, for any fixed $x\in\mathcal{X}$ and $p_{0}\neq k_{0}$ , the term $(\eta_{p_{0}}(x)-\eta_{k_{0}}(x))^{+}$ is a non-decreasing ( $\uparrow$ ) function of $\eta_{p_{0}}(x)$ and a non-increasing ( $\downarrow$ ) function of $\eta_{k_{0}}(x)$ . Therefore, one may expect that the maximization of $(\eta_{p_{0}}(x)-\eta_{k_{0}}(x))^{+}$ over $\eta(x)\in S(x)$ is realized when $\eta_{p_{0}}(x)$ and $\eta_{k_{0}}(x)$ achieving their upper-bound and lower-bound respectively. However, due to the constraint $\|\eta(x)\|_{1}=1$ for any $x\in\mathcal{X}$ , the orginal upper-bound $u_{p_{0}}(x)$ for $\eta_{p_{0}}(x)$ may not be reached, and similar concern arises when treating $\eta_{k_{0}}(x)$ . To fix these issues, we define a “realizable” upper-bound for $\eta_{p_{0}}(x)$ as

\tilde{u}_{p_{0}}(x):=\Big{[}1-\sum_{j\neq p_{0}}l_{j}(x)\Big{]}\land u_{p_{0}}(x),

(B.15)

which is the maximal value that $\eta_{p_{0}}(x)$ can realize under the constraint $\sum_{k\in[K]}\eta_{k}(x)=1$ . Meanwhile, we define a “realizable” lower-bound for $\eta_{k_{0}}(x)$ as

$\displaystyle\tilde{l}_{k_{0}}(x)$	$\displaystyle:=\Big{[}1-\tilde{u}_{p_{0}}(x)-\sum_{j\neq p_{0},k_{0}}u_{j}(x)\Big{]}\lor l_{k_{0}}(x)$	(B.16)
	$\displaystyle=\Big{\{}\Big{[}1-\Big{(}1-\sum_{j\neq p_{0}}l_{j}(x)\Big{)}-\sum_{j\neq p_{0},k_{0}}u_{j}(x)\Big{]}\lor\Big{[}1-u_{p_{0}}(x)-\sum_{j\neq p_{0},k_{0}}u_{j}(x)\Big{]}\Big{\}}\lor l_{k_{0}}(x)$
	$\displaystyle=\Big{\{}\Big{[}l_{k_{0}}(x)+\sum_{j\neq p_{0},k_{0}}\big{(}l_{j}(x)-u_{j}(x)\big{)}\Big{]}\lor\Big{[}1-\sum_{j\neq k_{0}}u_{j}(x)\Big{]}\Big{\}}\lor l_{k_{0}}(x)$
	$\displaystyle=\Big{[}1-\sum_{j\neq k_{0}}u_{j}(x)\Big{]}\lor l_{k_{0}}(x).$

The last equality is established as $l_{j}(x)-u_{j}(x)\leq 0$ for any $j\in[K]$ , leading to the fact that $l_{k_{0}}(x)+\sum_{j\neq p_{0},k_{0}}[l_{j}(x)-u_{j}(x)]\leq l_{k_{0}}(x)$ . In Equation B.16, we replace the upper-bound $u_{p_{0}}(x)$ with $\tilde{u}_{p_{0}}(x)$ in the definition of $\tilde{l}_{k_{0}}(x)$ , because in Equation B.14, the lowest possible values of $\eta_{k_{0}}(x)$ depends on the “realizable” maximal value of $\eta_{p_{0}}(x)$ . However, notice that $\tilde{l}_{k_{0}}(x)$ does not depend on the label $p_{0}$ involved in the definition, indicating that the “realizable” lower-bound is an intrinsic property for the conditional probability for each $k_{0}\in[K]$ . Similar statement can be applied for the “realizable” upper-bound $\tilde{u}_{k_{0}}(x)$ for each $k_{0}\in[K]$ .

•

Step 3. Back to the maximization of $A(t,\eta;x)\cdot\mathbb{I}\{t(x)=k_{0}\}$ in Equation B.14, we have

	$\displaystyle\max_{\eta\in S}\Big{\{}A(t,\eta;x)\Big{\}}\cdot\mathbb{I}\{t(x)=k_{0}\}$	$\displaystyle=\max_{p_{0}\neq k_{0}\eta\in S}\Big{(}\eta_{p_{0}}(x)-\eta_{k_{0}}(x)\Big{)}^{+}\cdot\mathbb{I}\{t(x)=k_{0}\}$		(B.17)
		$\displaystyle=\max_{p_{0}\neq k_{0},p_{0}\in[K]}\Big{(}\tilde{u}_{p_{0}}(x)-\tilde{l}_{k_{0}}(x)\Big{)}^{+}\cdot\mathbb{I}\{t(x)=k_{0}\}.$		(B.17)

By defining

w_{k_{0}}(x)=\max_{p_{0}\neq k_{0},p_{0}\in[K]}\Big{(}\tilde{u}_{p_{0}}(x)-\tilde{l}_{k_{0}}(x)\Big{)}^{+}

as the weight function corresponding to the event $\{t(x)=k_{0}\}$ , the Equation B.17 can be written as

\max_{\eta(x)\in S(x)}\Big{\{}A(t,\eta;x)\Big{\}}\cdot\mathbb{I}\{t(x)=k_{0}\}=w_{k_{0}}(x)\cdot\mathbb{I}\{t(x)=k_{0}\}.

The worst-case excess risk $\overline{\mathcal{R}}(t)$ can then be written as

	$\displaystyle\overline{\mathcal{R}}(t)$	$\displaystyle=\mathbb{E}\left[\max_{\eta\in S}\sum_{k_{0}\in[K]}A(t,\eta;X)\cdot\mathbb{I}\{t(X)=k_{0}\}\right]$
		$\displaystyle=\mathbb{E}\left[\sum_{k_{0}\in[K]}\max_{\eta\in S}A(t,\eta;X)\cdot\mathbb{I}\{t(X)=k_{0}\}\right]$
		$\displaystyle=\mathbb{E}\left[\sum_{k_{0}\in[K]}w_{k_{0}}(X)\cdot\mathbb{I}\{t(X)=k_{0}\}\right].$

The second inequality holds because $\{t(x)=k_{0}\},k_{0}\in[K]$ defines a sequence of disjoint events for every fixed $x\in\mathcal{X}$ . By replacing the subscript $k_{0}$ with $k$ , we complete the proof.

•

Step 4. We are now left to check whether there exists a vector $\eta\in S$ such that the second equality in Equation B.17 is guaranteed, that is, there is indeed a feasible solution of $\eta(x)$ with $\eta_{p_{0}}(x)=\tilde{u}_{p_{0}}(x)$ , $\eta_{k_{0}}(x)=\tilde{l}_{k_{0}}(x)$ such that $\|\eta(x)\|_{1}=1$ and $\eta_{j}(x)\in[l_{j}(x),u_{j}(x)]$ for every $j\in[K]$ . Before we go deep into the details, the non-emptyness of set $S$ reminds us that, for every $x\in\mathcal{X}$ ,

\sum_{j\in[K]}l_{j}(x)\leq 1\leq\sum_{j\in[K]}u_{j}(x).

(B.18)

In the following, we check the feasibility of $\eta\in S$ for a fixed $x\in\mathcal{X}$ in three aspects.

1.

For $\eta_{p_{0}}(x)=\tilde{u}_{p_{0}}(x)$ , check $\eta_{p_{0}}(x)\in[l_{p_{0}}(x),u_{p_{0}}(x)]$ .

For the upper-bound, by the definition of $\tilde{u}_{p_{0}}(x)$ in Equation B.15, $\eta_{p_{0}}(x)\leq u_{p}(x)$ is guaranteed.

For the lower-bound, on one hand, if $\tilde{u}_{p_{0}}(x)=u_{p_{0}}(x)$ , then $u_{p_{0}}(x)\geq l_{p_{0}}(x)$ naturally holds. On the other hand, suppose $\tilde{u}_{p_{0}}(x)=[1-\sum_{j\neq p_{0}}l_{j}(x)]$ . By Equation B.18, we have $1\geq\sum_{k\in[K]}l_{k}(x)$ , which suggests that $\tilde{u}_{p_{0}}(x)\geq l_{p_{0}}(x)$ . Hence, $\tilde{u}_{p_{0}}(x)\geq l_{p_{0}}(x)$ always holds.
2.

For $\eta_{k_{0}}(x)=\tilde{l}_{k_{0}}(x)$ , check $\eta_{k_{0}}(x)\in[l_{k_{0}}(x),u_{k_{0}}(x)]$ .

For the lower-bound, by the definition of $\tilde{l}_{k_{0}}(x)$ in Equation B.16, $\eta_{k_{0}}(x)\geq l_{k_{0}}(x)$ always holds.

For the upper-bound, by Equation B.16, we have

$\tilde{l}_{k_{0}}(x)=\Big{[}1-\sum_{j\neq k_{0}}u_{j}(x)\Big{]}\lor l_{k_{0}}(x).$

Equation B.18 ensures that $1-\sum_{j\neq k_{0}}u_{j}(x)\leq u_{k_{0}}(x)$ , and on the other hand $l_{k_{0}}(x)\leq u_{k_{0}}(x)$ simply holds. Therefore, we establish the upper-bound $\eta_{k_{0}}(x)\leq u_{k_{0}}(x)$ .

Check $\|\eta(x)\|_{1}=1$ . We start with computing the summation

\eta_{p_{0}}(x)+\eta_{k_{0}}(x)=\tilde{u}_{p_{0}}(x)+\tilde{l}_{k_{0}}(x).

If the summation stays with in the capacity region $\big{[}1-\sum_{j\neq p_{0},k_{0}}u_{j}(x),1-\sum_{j\neq p_{0},k_{0}}l_{j}(x)\big{]}$ , we can always find a series of feasible solutions $\eta_{j}(x)\in[l_{j}(x),u_{j}(x)]$ for $j\neq p_{0},k_{0}$ such that $\sum_{j\in[K]}\eta_{j}(x)=1$ is satisfied. The discussions are summarized below.

(a)

Suppose $u_{p_{0}}(x)\leq 1-\sum_{j\neq p_{0}}l_{j}(x)$ , by the definition of $\tilde{u}_{p_{0}}(x)$ in Equation B.15, we have $\tilde{u}_{p_{0}}(x)=u_{p_{0}}(x)$ and $\tilde{l}_{k_{0}}(x)=[1-\sum_{j\neq k_{0}}u_{j}(x)]\lor l_{k_{0}}(x)$ . We then have

\tilde{u}_{p_{0}}(x)+\tilde{l}_{k_{0}}(x)=\left\{\begin{aligned} &1-\sum_{j\neq k_{0},p_{0}}u_{j}(x),&&\text{if}\ l_{k_{0}}(x)\leq 1-\sum_{j\neq k_{0}}u_{j}(x)\quad(\text{Case 1})\\ &u_{p_{0}}(x)+l_{k_{0}}(x),&&\text{if}\ l_{k_{0}}(x)\geq 1-\sum_{j\neq k_{0}}u_{j}(x)\quad(\text{Case 2}).\end{aligned}\right.

For the Case 1, we can simply let $\eta_{j}(x)=u_{j}(x)$ for $j\neq p_{0},k_{0}$ , and this arrangement fulfills the requirement $\sum_{j\in[K]}\eta_{j}(x)=1$ .

For the Case 2, we have

1-\sum_{j\neq k_{0},p_{0}}u_{j}(x)\leq u_{p_{0}}(x)+l_{k_{0}}(x)\leq 1-\sum_{j\neq k_{0},p_{0}}l_{j}(x).

As a result, there exists a sequence of $\{\eta_{j}(x)\}_{j\neq p_{0},k_{0}}$ such that $\eta_{j}(x)\in[l_{j}(x),u_{j}(x)]$ for all $j\neq p_{0},k_{0}$ that satisifes $u_{p_{0}}(x)+l_{k_{0}}(x)+\sum_{j\neq p_{0},k_{0}}\eta_{j}(x)=1$ .

(b)

Suppose $u_{p_{0}}(x)\geq 1-\sum_{j\neq p_{0}}l_{j}(x)$ , we then have $\tilde{u}_{p_{0}}(x)=1-\sum_{j\neq p_{0}}l_{j}(x)$ and

	$\displaystyle\tilde{l}_{k_{0}}(x)$	$\displaystyle=\Big{[}1-\tilde{u}_{p_{0}}(x)-\sum_{j\neq k_{0},p_{0}}u_{j}(x)\Big{]}\lor l_{k_{0}}(x)$
		$\displaystyle=\Big{[}l_{k_{0}}(x)+\sum_{j\neq k_{0},p_{0}}[l_{j}(x)-u_{j}(x)]\Big{]}\lor l_{k_{0}}(x)$
		$\displaystyle=l_{k_{0}}(x),$

where the last equality holds because $l_{j}(x)\leq u_{j}(x)$ for all $j\in[K]$ . Therefore, we have

\tilde{u}_{p_{0}}(x)+\tilde{l}_{k_{0}}(x)=1-\sum_{j\neq k_{0},p_{0}}l_{j}(x).

As a result, we can simply let $\eta_{j}(x)=l_{j}(x)$ for $j\neq p_{0},k_{0}$ , and this arrangement agains fullfills the requirement $\sum_{j\in[K]}\eta_{j}(x)=1$ .

The above analysis show that validity of the second equality in Equation B.17.

∎

Proof of Theorem 4 (Binary Classification Setting).

Recall from the first part of Theorem 4 that

	$\displaystyle\overline{\mathcal{R}}(t)$	$\displaystyle=\mathbb{E}\left[\sum_{k=1}^{K}w_{k}(X)\cdot\mathbb{I}\{t(X)=k\}\right],\ \text{where}$
	$\displaystyle w_{k}(X)$	$\displaystyle=\max_{p\neq k,p\in[K]}\Big{\{}\Big{(}\tilde{u}_{p}(x)-\tilde{l}_{k}(x)\Big{)}^{+}\Big{\}}.$

Consider the binary case when $Y^{\star}\in\{-1,1\}$ and $t:\mathcal{X}\mapsto\{-1,1\}$ . Recall that in Theorem 3, we derive the partial bound for $\eta_{1}^{\star}(x)=\mathbb{P}(Y^{\star}=1\mid X=x)$ as $[l_{1}(x),u_{1}(x)]$ . By the constraint $\eta_{1}^{\star}(x)+\eta_{0}^{\star}(x)=1$ for each $x\in\mathcal{X}$ , we define the partial bound $[l_{-1}(x),u_{-1}(x)]$ for $\eta_{-1}^{\star}(x)$ as $l_{-1}(x)=1-u_{1}(x)$ and $u_{-1}(x)=1-l_{1}(x)$ .

We can then compute the “realizable” upper- and lower-bounds for both $\eta_{1}^{\star}(x)$ and $\eta_{-1}^{\star}(x)$ as below:

	$\displaystyle\tilde{u}_{1}(x)=(1-l_{-1}(x))\land u_{1}(x)=u_{1}(x)$	$\displaystyle\text{and}\quad\tilde{u}_{-1}(x)=(1-l_{1}(x))\land u_{-1}(x)=1-l_{1}(x),$
	$\displaystyle\tilde{l}_{1}(x)=(1-u_{-1}(x))\lor l_{1}(x)=l_{1}(x)$	$\displaystyle\text{and}\quad\tilde{l}_{-1}(x)=(1-u_{1}(x))\lor l_{1}(x)=1-u_{1}(x).$

The weight functions $w_{k}(x)$ for $k=\pm 1$ are therefore

	$\displaystyle w_{1}(x)$	$\displaystyle=\big{(}\tilde{u}_{0}(x)-\tilde{l}_{1}(x)\big{)}^{+}=\big{(}1-2l_{1}(x)\big{)}^{+}=\big{(}1-2l_{1}(x)\big{)}\cdot\mathbb{I}\{l_{1}(x)\leq 1/2\}$
	$\displaystyle w_{0}(x)$	$\displaystyle=\big{(}\tilde{u}_{1}(x)-\tilde{l}_{0}(x)\big{)}^{+}=\big{(}2u_{1}(x)-1\big{)}^{+}=\big{(}2u_{1}(x)-1\big{)}\cdot\mathbb{I}\{u_{1}(x)\geq 1/2\}.$

Hence, in binary outcome case, the risk function $\overline{\mathcal{R}}(t)$ takes the form

	$\displaystyle\overline{\mathcal{R}}(t)$	$\displaystyle=\mathbb{E}\Big{[}\|1-2l_{1}(X)\|\cdot\mathbb{I}_{l_{1}(X)\leq 1/2}\cdot\mathbb{I}\{t(X)=1\}+\|2u_{1}(X)-1\|\cdot\mathbb{I}_{u_{1}(X)\geq 1/2}\cdot\mathbb{I}\{t(X)=0\}\Big{]}$
		$\displaystyle=\mathbb{E}\Big{[}\|1-2l_{1}(X)\|\cdot\mathbb{I}_{l_{1}(X)\leq 1/2}\cdot\big{(}1-\mathbb{I}\{t(X)=0\}\big{)}+\|2u_{1}(X)-1\|\cdot\mathbb{I}_{u_{1}(X)\geq 1/2}\cdot\mathbb{I}\{t(X)=0\}\Big{]}$
		$\displaystyle=\mathbb{E}\Big{[}\big{\|}1-2l_{1}(X)\big{\|}\cdot\mathbb{I}_{l_{1}(X)\leq 1/2}\Big{]}+\mathbb{E}\Big{[}\mathbb{I}\{t(X)\neq 1\}\cdot\Big{(}\big{\|}2u_{1}(X)-1\big{\|}\cdot\mathbb{I}_{u_{1}(X)\geq 1/2}-\big{\|}1-2l_{1}(X)\big{\|}\cdot\mathbb{I}_{l_{1}(X)\leq 1/2}\Big{)}\Big{]}$
		$\displaystyle=\mathbb{E}\Big{[}\big{\|}1-2l_{1}(X)\big{\|}\cdot\mathbb{I}_{l_{1}(X)\leq 1/2}\Big{]}+\mathbb{E}\left[\frac{1-t(X)}{2}\cdot\Big{(}\big{\|}2u_{1}(X)-1\big{\|}\cdot\mathbb{I}_{u_{1}(X)\geq 1/2}-\big{\|}1-2l_{1}(X)\big{\|}\cdot\mathbb{I}_{l_{1}(X)\leq 1/2}\Big{)}\right].$

For every fixed $x\in\mathcal{X}$ , define a weight function

	$\displaystyle w(x)$	$\displaystyle:=\big{\|}u_{1}(x)-1/2\big{\|}\cdot\mathbb{I}\{u_{1}(x)\geq 1/2\}-\big{\|}l_{1}(x)-1/2\big{\|}\cdot\mathbb{I}\{l_{1}(x)<1/2\}$
		$\displaystyle=\max\big{(}u_{1}(X)-1/2,0\big{)}+\min\big{(}l_{1}(X)-1/2,0\big{)},$

we then have

$\displaystyle\overline{\mathcal{R}}(t)$	$\displaystyle=\underbrace{\mathbb{E}\Big{[}\big{\|}1-2l_{1}(X)\big{\|}\cdot\mathbb{I}\{l_{1}(X)<1/2\}\Big{]}}_{(**)}+\mathbb{E}\Big{[}w(X)\cdot\big{(}1-t(X)\big{)}\Big{]}$	(B.19)
	$\displaystyle=(**)+\mathbb{E}\big{[}w(X)\big{]}+\ \mathbb{E}\big{[}-w(X)\cdot t(X)\big{]}$
	$\displaystyle=(*)+\mathbb{E}\big{[}w(X)\big{]}+\mathbb{E}\Big{[}\|w(X)\|\cdot\operatorname{\mathrm{sgn}}[-w(X)]\cdot t(X)\Big{]}$
	$\displaystyle=(*)+\mathbb{E}\big{[}w(X)\big{]}+\mathbb{E}\Big{[}\|w(X)\|\cdot\Big{(}2\mathbb{I}\big{\{}\operatorname{\mathrm{sgn}}[w(X)]\neq t(X)\big{\}}-1\Big{)}\Big{]}$
	$\displaystyle=(*)+\mathbb{E}\Big{[}w(X)-\|w(X)\|\Big{]}+2\mathbb{E}\Big{[}\|w(X)\|\cdot\mathbb{I}(\operatorname{\mathrm{sgn}}[w(X)]\neq\operatorname*{\mathrm{sgn}}[h(X)])\Big{]}.$

For any fixed $x$ , notice that $w(x)-|w(x)|=0$ when $w(x)\geq 0$ and $w(x)-|w(x)|=2w(x)$ when $w(x)<0$ , we then have

	$\displaystyle\mathbb{E}\Big{[}w(X)-\|w(X)\|\Big{]}$	$\displaystyle=\mathbb{E}\Big{[}2w(X)\cdot\mathbb{I}\{w(X)<0\}\Big{]}$
		$\displaystyle=\mathbb{E}\Big{[}\Big{(}\big{\|}2u_{1}(x)-1\big{\|}\cdot\mathbb{I}\{u_{1}(X)\geq 1\}-\|1-2l_{1}(X)\|\cdot\mathbb{I}\{l_{1}(X)<1/2\}\Big{)}\cdot\mathbb{I}\{w(X)<0\}\Big{]}.$

Consequently, by observing the establishment of the event

w(X)<0~{}~{}\Leftrightarrow~{}~{}|2u_{1}(X)-1|\cdot\mathbb{I}\{u_{1}(X)\geq 1/2\}\leq|l_{1}(X)-1/2|\cdot\mathbb{I}\{l_{1}(X)<1/2\},

the first two term in Equation B.19 can be further organized as

		$\displaystyle\mathbb{E}\Big{[}\big{\|}1-2l_{1}(X)\big{\|}\cdot\mathbb{I}\{l_{1}(X)<1/2\}\Big{]}+\mathbb{E}\Big{[}w(X)-\|w(X)\|\Big{]}$		(B.20)
		$\displaystyle=\left\{\begin{aligned} &\mathbb{E}\Big{[}\|2u_{1}(X)-1\|\cdot\mathbb{I}\{u_{1}(X)\geq 1/2\}\Big{]},&&\text{if}\ w(X)<0\\ &\mathbb{E}\Big{[}\|1-2l_{1}(X)\|\cdot\mathbb{I}\{l_{1}(X)<1/2\}\Big{]},&&\text{if}\ w(X)\geq 0\\ \end{aligned}\right.$
		$\displaystyle=\min\Big{(}\mathbb{E}\Big{[}\big{\|}1-2l_{1}(X)\big{\|}\cdot\mathbb{I}\{l_{1}(X)<1/2\}\big{]},\mathbb{E}\Big{[}\big{\|}2u_{1}(X)-1\big{\|}\cdot\mathbb{I}\{u(X)\geq 1/2\}\big{]}\Big{)}.$

Finally, we define

w^{\rm partial}(x)=2w(x)=\max\big{(}2u_{1}(X)-1,0\big{)}+\min\big{(}2l_{1}(X)-1,0\big{)}

for every $x\in\mathcal{X}$ . Combining Equations B.19 and B.20 together, we have

	$\displaystyle\overline{\mathcal{R}}(h)$	$\displaystyle=\mathbb{E}\Big{[}\|w^{\rm partial}(X)\|\cdot\mathbb{I}\{\operatorname*{\mathrm{sgn}}[w^{\rm partial}(X)]\neq t(X)\}\Big{]}$
		$\displaystyle\qquad+\min\Big{(}\mathbb{E}\Big{[}\big{\|}1-2l_{1}(X)\big{\|}\cdot\mathbb{I}\{l_{1}(X)<1/2\}\big{]},\mathbb{E}\Big{[}\big{\|}2u_{1}(X)-1\big{\|}\cdot\mathbb{I}\{u_{1}(X)\geq 1/2\}\big{]}\Big{)}$

∎

B.4 Comparisons of Point and Partial Identification

Point identification hinges on decision-makers’ homogeneity in their use of unobservables, as defined by the generalized NUCEM assumption in Theorem 1. Concretely, the conditional irrelevant of $\operatorname*{\mathrm{Cov}}(D,Z\mid X,U)$ and $\mathbb{E}[Y^{\star}_{k}\mid X,U]$ given $X$ reflects the uniformity of decision-makers’ reliance on unobserved information. Conversely, partial identification benefits from the heterogeneity of decision-makers’ use of observables $X$ , not requiring the homogeneity of selection on unobservables $U$ . This approach uses IV partial bounds that consider variations in the decision rules $\mathbb{P}(D=1\mid X,Z)$ and the labeled probabilities $\mathbb{P}(Y=k,D=1\mid X,Z)$ across decision-makers $Z$ . Such heterogeneity helps construct tighter bounds for $\eta_{k}^{\star}(x)$ Moreover, we having the following claims.

Theorem 7.

The partial bounds $l_{k}(x)$ and $u_{k}(x)$ in Theorem 3 are achieved by the most lenient and most stringent decision-makers, respectively.

Proof of Theorem 7.

We claim that for any fixed $X$ and $U$ , the lower-bound is achieved by the most lenient decision-maker, and the upper-bound is achieved by the most stringent decision-maker. To see this, notice that the lower-bound for $\eta_{k}^{\star}$ with respect to decision-maker $Z=j$ in Theorem 3 satisfies

	$\displaystyle l_{k}(x,z)$	$\displaystyle:=\mathbb{E}\big{[}DY_{k}+a_{k}(X)\cdot(1-D)\mid X=x,Z=z\big{]}$
		$\displaystyle=\mathbb{E}\big{[}DY_{k}^{\star}+a_{k}(X)\cdot(1-D)\mid X=x,Z=z\big{]}$
		$\displaystyle=\mathbb{E}\Big{[}\mathbb{E}\big{[}DY_{k}^{\star}+a_{k}(X)\cdot(1-D)\mid U,X,Z=z\big{]}\mid X=x,Z=z\Big{]}$
		$\displaystyle=\mathbb{E}\Big{[}\mathbb{P}(D=1\mid U,X,Z)\cdot\mathbb{E}\big{[}Y_{k}^{\star}\mid U,X,Z\big{]}+a_{k}(X)\cdot\big{(}1-\mathbb{P}(D=1\mid U,X,Z)\big{)}\mid X=x,Z=z\Big{]}$
		$\displaystyle=a_{k}(x)+\mathbb{E}\Big{[}\mathbb{P}(D=1\mid U,X,Z)\cdot\big{(}\mathbb{E}[Y_{k}^{\star}\mid X,U]-a_{k}(X)\big{)}\mid X=x,Z=z\Big{]}.$

The second equality holds consistency, and the third equality follows by iterated law of expectation. The fourth equality holds by Assumption 1 and the final equality is established by the IV independence in Assumption 2 ( $Z\perp(U,Y^{\star})\mid X$ ). Note that the Assumption 4 states $a_{k}(X)\leq E[Y_{k}^{\star}\mid X,U]\leq b_{k}(X)$ almost surely, the lower bound $l_{k}(x,z)$ is an increasing function of decision rule $\mathbb{P}(D=1\mid U,X,Z=z)$ . Therefore, for any fixed $X$ and $U$ , the maximization over lower bounds $\max_{z\in[m]}l_{k}(x,z)$ is achieved by the most lenient decision-maker with the highest decision rule $\mathbb{P}(D=1\mid X,U,Z=z)$ , that is,

	$\displaystyle l_{k}(x)$	$\displaystyle=\max_{z\in\mathcal{Z}}l_{k}(x,z)$
		$\displaystyle=\max_{z\in\mathcal{Z}}a_{k}(x)+\mathbb{E}\Big{[}\mathbb{P}(D=1\mid U,X,Z=z)\cdot\big{(}\mathbb{E}[Y_{k}^{\star}\mid X,U]-a_{k}(X)\big{)}\mid X=x\Big{]}$
		$\displaystyle=a_{k}(x)+\mathbb{E}\Big{[}\big{\{}\max_{z\in\mathcal{Z}}\mathbb{P}(D=1\mid U,X,Z=z)\big{\}}\cdot\big{(}\mathbb{E}[Y_{k}^{\star}\mid X,U]-a_{k}(X)\big{)}\mid X=x\Big{]}$

Similarly, for the upper-bound, the minimization $\min_{z\in[m]}u_{k}(x,z)$ is achieved by the most stringent decision-maker when $X$ and $U$ are fixed. ∎

Appendix C Supplements for Unified Cost-sensitive Learning

Proof of Theorem 5.

To facilitate the analysis, for fixed weight function $w:\mathcal{X}\rightarrow\mathbb{R}$ and score function $h:\mathcal{X}\rightarrow\mathbb{R}^{K}$ , we define the class labels

k_{0}:=\operatorname*{arg\,min}_{k\in[K]}w_{k}(x)\quad\text{and}\quad k_{h}:=\operatorname*{arg\,max}_{k\in[K]}h_{k}(x).

Namely, when $x\in\mathcal{X}$ is fixed, $k_{0}$ denotes the class label with the minimal weight $w_{k_{0}}(x)$ , while $k_{h}$ corresponds to the class label with the highest score $h_{k_{h}}(x)$ . According to the definition of cost-sensitive classification risk $\mathcal{R}(h,w)$ in Equation 6.1, we define the cost-sensitive classification loss as

l(h,w;x)=\sum_{k=1}^{K}w_{k}(x)\mathbb{I}\{\operatorname*{arg\,max}_{p\in[K]}h_{p}(x)\neq k\}.

Obviously, the excess classification risk can be written as

\mathcal{R}(h,w)-\inf_{h}\mathcal{R}(h,w)=\mathbb{E}\big{[}l(h,w;X)-\inf_{h}l(h,w;X)\big{]}=\mathbb{E}\big{[}w_{k_{h}}(X)-w_{k_{0}}(X)\big{]}.

Correspondingly, we define the cost-sensitive surrogate loss for $\mathcal{R}_{\exp}(\boldsymbol{h},\boldsymbol{w})$ in Equation 6.2 as

l_{\exp}(h,w;x)=\sum_{k=1}^{K}w_{k}(x)\frac{\exp(h_{k}(x))}{\sum_{p=1}^{K}\exp(h_{p}(x))}.

As a result, the excess surrogate risk can be written as

\mathcal{R}_{\exp}(h,w)-\inf_{h}\mathcal{R}_{\exp}(h,w)=\mathbb{E}\big{[}l_{\exp}(h,w;X)-\inf_{h}l_{\exp}(h,w;X)\big{]}.

Therefore, we are left to analyze the excess surrogate loss $l_{\exp}(h,w;x)-\inf_{h}l_{\exp}(h,w;x)$ for every fixed $x\in\mathcal{X}$ .

Following the idea of Mao et al. [2023], for every fixed scored function $h:\mathcal{X}\rightarrow\mathbb{R}^{k}$ , we define a new function $h^{\mu}$ induced by some real-valued parameter $\mu$ as follow: for fixed $x\in\mathcal{X}$ ,

\left\{\begin{aligned} &h_{k}^{\mu}(x)=h_{k}(x),&&k\neq k_{0},k_{h}\\ &h_{k_{0}}^{\mu}(x)=\log\big{(}\exp(h_{k_{h}}(x))-\mu\big{)},&&k=k_{0}\\ &h_{k_{h}}^{\mu}(x)=\log\big{(}\exp(h_{k_{0}}(x))+\mu\big{)},&&k=k_{h}.\end{aligned}\right.

Since the domain of $\log$ function is $(0,+\infty)$ , we should restrict the value of $\mu$ to ensure that $\exp(h_{k_{h}}(x))+\mu>0$ and $\exp(h_{k_{0}}(x))-\mu>0$ . Therefore, for fixed $x\in\mathcal{X}$ , we define the feasible region of $\mu$ as

\mathcal{M}(\mu):=\big{\{}\mu\in\mathbb{R}:-\exp(h_{k_{h}}(x))<\mu<\exp(h_{k_{0}}(x))\big{\}}.

A direct conseuqnece of above definition is that $\sum_{k=1}^{K}\exp(h_{k}(x))=\sum_{k=1}^{K}\exp(h_{k}^{\mu}(x))$ for every fixed $x$ . Consequently, the surrogate loss $l(h^{\mu},w;x)$ takes the form

l_{\exp}(h^{\mu},w;x)=\sum_{k\neq k_{0},k_{h}}w_{k}(x)\frac{\exp(h_{k}(x))}{\sum_{p=1}^{K}\exp(h_{p}(x))}+w_{k_{0}}(x)\frac{\exp(h_{k_{h}}(x))-\mu}{\sum_{p=1}^{K}\exp(h_{p}(x))}+w_{k_{h}}(x)\frac{\exp(h_{k_{0}}(x))+\mu}{\sum_{p=1}^{K}\exp(h_{p}(x))}.

Now, the excess surrogate loss $l_{\exp}(h,w;x)-\inf_{h}l_{\exp}(h,w;x)$ can be lower-bounded by

		$\displaystyle l_{\exp}(h,w;x)-\inf_{h}l_{\exp}(h,w;x)$
		$\displaystyle\geq l_{\exp}(h,w;x)-\inf_{\mu\in\mathcal{M}(\mu)}l_{\exp}(h^{\mu},w;x)$
		$\displaystyle=\sup_{\mu\in\mathcal{M}(\mu)}\left\{w_{k_{h}}(x)\frac{\exp(h_{k_{h}}(x))-\exp(h_{k_{0}}(x))-\mu}{\sum_{p=1}^{K}\exp(h_{p}(x))}+w_{k_{0}}(x)\frac{\exp(h_{k_{0}}(x))-\exp(h_{k_{h}}(x))+\mu}{\sum_{p=1}^{K}\exp(h_{p}(x))}\right\}$
		$\displaystyle=\sup_{\mu\in\mathcal{M}(\mu)}\left\{\big{(}w_{k_{h}}(x)-w_{k_{0}}(x)\big{)}\frac{\exp(h_{k_{h}}(x))-\exp(h_{k_{0}}(x))-\mu}{\sum_{p=1}^{K}\exp(h_{p}(x))}\right\}$

The first inequality holds by the fact that the infimum of $\mu$ over set $\mathcal{M}(\mu)$ is always greater then the infimum of $h$ over all measurable functions. Notice that $k_{0}$ is the class label with the minimal weight, we have $w_{k_{h}}(x)-w_{k_{0}}(x)\geq 0$ for any $x$ . Consequently, the supremum of the above expression is achieved when $\mu=-\exp(h_{k_{0}}(x))$ , and the excess surrogate loss is therefore lower-bounded by

	$\displaystyle l_{\exp}(h,w;x)-\inf_{h}l_{\exp}(h,w;x)$	$\displaystyle\geq\big{(}w_{k_{h}}(x)-w_{k_{0}}(x)\big{)}\frac{\exp(h_{k_{h}}(x))}{\sum_{p=1}^{K}\exp(h_{p}(x))}$
		$\displaystyle\geq\frac{1}{K}\big{[}w_{k_{h}}(x)-w_{k_{0}}(x)\big{]}$
		$\displaystyle=\frac{1}{K}\big{[}l(h,w;x)-\inf_{h}l(h,w;x)\big{]}.$

The last inequality follows by the definition that $k_{h}$ is the class label with the highest score $h_{k_{h}}$ , and therefore we must have $\frac{\exp(h_{k_{h}}(x))}{\sum_{p=1}^{K}\exp(h_{p}(x))}\geq\frac{1}{K}$ for any fixed $x\in\mathcal{X}$ . By taking the expectation over $X$ on the both sides, we completes the proof.

∎

Proof of Proposition 1.

For the weighted classification risk $\mathcal{R}(h,w)$ , if we choose $\tilde{h}(x)=w(x)$ for $x\in\mathcal{X}$ , we simply have $\mathcal{R}(\tilde{h},w)=0$ , and this suggests the Bayes optimal risk equals to zero, that is, $\inf_{h}\mathcal{R}(h,w)=0$ . Therefore, the excess weighted classification risk can be written as

\mathcal{R}(h,w)-\inf_{h}\mathcal{R}(h,w)=\mathcal{R}(h,w)-0=\mathbb{E}\Big{[}|w(X)|\cdot\mathbb{I}\{\operatorname*{\mathrm{sgn}}[h(X)]\neq\operatorname*{\mathrm{sgn}}[w(X)]\}\Big{]}.

Here we use the fact that $h^{*}(x)$ has the same sign as weight function $w(x)$ for any fixed $x\in\mathcal{X}$ .

For the $\phi$ -risk introduce in Equation 6.3, we consider the following surrogate loss functions:

\begin{aligned} &\text{Hinge loss}:&&\phi(\alpha)=\max\{1-\alpha,0\}\\ &\text{logistic loss}:&&\phi(\alpha)=\log(1+e^{-\alpha})\\ &\text{exponential loss}:&&\phi(\alpha)=e^{-\alpha}\\ \end{aligned}.

Notice that for any $\phi\in\{\operatorname{Hinge},\operatorname{logistic},\operatorname{exponential}\}$ , the surrogate loss $\phi(\alpha)$ is lower-bounded (in fact, $\inf_{\alpha}\phi(\alpha)=0$ ), then we have $\inf_{\alpha}\mathbb{E}[\phi(\alpha(X))]=\mathbb{E}[\inf_{\alpha}\phi(\alpha(X))]=0$ . Therefore, the excess $\phi$ -risk takes the form

\mathcal{R}_{\phi}(h,w)-\inf_{h}\mathcal{R}(h,w)=\mathcal{R}_{\phi}(h,w)-0=\mathbb{E}\Big{[}|w(X)|\cdot\phi\big{(}h(X)\cdot\operatorname*{\mathrm{sgn}}[w(X)]\big{)}\Big{]}.

In the sequel, we only need to check if the surrogate loss $\phi(h(x)\cdot w(x))$ provides an upper-bound for the $0-1$ loss $\mathbb{I}\{\operatorname*{\mathrm{sgn}}[h(x)]\neq\operatorname*{\mathrm{sgn}}[w(x)]\}$ for each $\phi$ in the set $\{\operatorname{Hinge},\operatorname{logistic},\operatorname{exponential}\}$ . We fix the weight function $w$ and the observed features $x\in\mathcal{X}$ in the following discussion.

•

Consider the case when $\operatorname*{\mathrm{sgn}}[h(x)]\neq\operatorname*{\mathrm{sgn}}[w(x)]$ , we then have $\mathbb{I}\{\operatorname*{\mathrm{sgn}}{h(x)}\neq\operatorname*{\mathrm{sgn}}[w(x)]\}=1$ . For the $\phi$ -risk, we have

\phi\big{(}h(x)\cdot\operatorname*{\mathrm{sgn}}[w(x)]\big{)}=\phi(-|h(x)|)

for any $\phi\in\{\operatorname{Hinge},\operatorname{logistic},\operatorname{exponential}\}$ . Concretely, we have

	$\displaystyle\phi_{\operatorname{Hinge}}(-\|h(x)\|)$	$\displaystyle=(1+\|h(x)\|)^{+}\geq 1,$
	$\displaystyle\phi_{\operatorname{logistic}}(-\|h(x)\|)$	$\displaystyle=\log_{2}(1+e^{\|h(x)\|})\geq 1,$
	$\displaystyle\phi_{\operatorname{exponential}}(-\|h(x)\|)$	$\displaystyle=e^{\|h(x)\|}\geq 1.$

Therefore, we have

\phi(h(x)\cdot\operatorname*{\mathrm{sgn}}[w(x)])\geq\mathbb{I}\{\operatorname*{\mathrm{sgn}}[h(x)]\neq\operatorname*{\mathrm{sgn}}[w(x)]\}

for any $\phi\in\{\operatorname{Hinge},\operatorname{logistic},\operatorname{exponential}\}$ .

•

Consider the case when $\operatorname*{\mathrm{sgn}}[h(x)]=\operatorname*{\mathrm{sgn}}[w(x)]$ , we then have $\mathbb{I}\{\operatorname*{\mathrm{sgn}}{h(x)}\neq\operatorname*{\mathrm{sgn}}[w(x)]\}=0$ . For the $\phi$ -risk, we have

\phi\big{(}h(x)\cdot\operatorname*{\mathrm{sgn}}[w(x)]\big{)}=\phi(|h(x)|)

for any $\phi\in\{\operatorname{Hinge},\operatorname{logistic},\operatorname{exponential}\}$ . Concretely, we have

	$\displaystyle\phi_{\operatorname{Hinge}}(\|h(x)\|)$	$\displaystyle=(1-\|h(x)\|)^{+}\geq 0,$
	$\displaystyle\phi_{\operatorname{logistic}}(\|h(x)\|)$	$\displaystyle=\log_{2}(1+e^{-\|h(x)\|})\geq 0,$
	$\displaystyle\phi_{\operatorname{exponential}}(\|h(x)\|)$	$\displaystyle=e^{-\|h(x)\|}\geq 0.$

Hence, we have

\phi(h(x)\cdot\operatorname*{\mathrm{sgn}}[w(x)])\geq\mathbb{I}\{\operatorname*{\mathrm{sgn}}[h(x)]\neq\operatorname*{\mathrm{sgn}}[w(x)]\}

for any $\phi\in\{\operatorname{Hinge},\operatorname{logistic},\operatorname{exponential}\}$ .

Finally, by multiplying the common weight $|w(X)|$ and taking the expectation over $X$ , we conclude that when the weight function $w$ is fixed, the excess $\phi$ -risk is always an upper-bound for the excess weighted risk for any $\phi\in\{\operatorname{Hinge},\operatorname{logistic},\operatorname{exponential}\}$ :

\mathcal{R}_{\phi}(h,w)-\inf_{h}\mathcal{R}_{\phi}(h,w)\geq\mathcal{R}(h,w)-\inf_{h}\mathcal{R}(h,w).

∎

Appendix D Generalization Error Bound

In this section, we theoretically analyze the performance of the UCL algorithm in Algorithm 1 by deriving a generalization error bound for the output score function $\widehat{\boldsymbol{h}}$ . Specifically, Theorem 5 establishes a calibration bound for the cost-sensitive excess risk $\mathcal{R}(\widehat{\boldsymbol{h}})-\inf_{\boldsymbol{h}\in\mathcal{H}}\mathcal{R}(\boldsymbol{h})$ induced by the output score function. To leverage this result, we first derive a generalization error bound for the excess surrogate risk $\mathcal{R}_{\exp}(\widehat{\boldsymbol{h}})-\inf_{\boldsymbol{h}\in\mathcal{H}}\mathcal{R}_{\exp}(\boldsymbol{h})$ .

To facilitate the analysis, we define the surrogate risk induced by the estimated weight functions $\boldsymbol{w}^{[l]}$ from batch $l\in[K]$ :

\mathcal{R}_{\exp}(\boldsymbol{h},\widehat{\boldsymbol{w}}^{[l]}):=\mathbb{E}\Big{[}\sum_{k=1}^{K}\widehat{w}_{k}^{[l]}(X)\cdot\frac{\exp(h_{k}(X))}{\sum_{p=1}^{K}\exp(h_{p}(X))}\Big{]}.

The below lemma provides a decomposition of the upper-bound of the excess surrogate risk into two components: the estimation error of the nuisance functions $\{w_{k}\}_{k=1}^{K}$ and the empirical process error.

Lemma 2.

The excess surrogate risk $\mathcal{E}(\mathcal{H}):=\mathcal{R}_{\exp}(\widehat{\boldsymbol{h}})-\inf_{\boldsymbol{h}\in\mathcal{H}}\mathcal{R}_{\exp}(\boldsymbol{h})$ satisfies

\mathcal{E}(\mathcal{H})\leq\underbrace{\frac{2}{L}\sum_{l=1}^{L}\sup_{\boldsymbol{h}\in\mathcal{H}}\Big{|}\mathcal{R}_{\exp}(\boldsymbol{h},\widehat{\boldsymbol{w}}^{[l]})-\mathcal{R}_{\exp}(\boldsymbol{h},\boldsymbol{w})\Big{|}}_{\text{Nuisance Estimation Error}}+\underbrace{\frac{2}{L}\sum_{l=1}^{L}\sup_{h\in\mathcal{H}}\Big{|}\widehat{\mathcal{R}}_{\exp}(\boldsymbol{h},\widehat{\boldsymbol{w}}^{[l]})-\mathcal{R}_{\exp}(\boldsymbol{h},\widehat{\boldsymbol{w}}^{[l]})\Big{|}}_{\text{Empirical Process Error}}.

The first term in Lemma 2 captures the error arising from the estimation of the unknown weight function $\boldsymbol{w}$ , which is evaluated on the true excess surrogate risk with the supremum over all $\boldsymbol{h}\in\mathcal{H}$ . The second term reflects the error incurred by the estimated score function $\widehat{\boldsymbol{h}}$ , evaluated as the supremum of an empirical process. To bound these two terms, we need to further specify the estimation errors of the nuisance estimators, as we will do in Assumptions 5 and 6.

Assumption 5 (Nuisances Estimation).

We assume that the true weight functions $\{w_{k}\}_{k=1}^{[K]}$ and their estimates $\{\widehat{w}_{k}^{[1]},\dots,\widehat{w}_{k}^{[L]}\}_{k=1}^{K}$ satisfy the following conditions:

1.

they are uniformly bounded by some constant $C_{w}$ over $\mathcal{X}$ , namely, $\|w_{k}\|_{\infty}<C_{w},\|\widehat{w}_{k}^{[l]}\|_{\infty}<C_{w}$ for all $k\in[K]$ and $l\in[L]$ ;

there exists $\gamma>0$ such that

\mathbb{E}\big{[}\|\widehat{\boldsymbol{w}}^{[l]}(X)-\boldsymbol{w}(X)\|_{1}\big{]}=\mathcal{O}(N^{-\gamma}),~{}~{}\forall~{}l\in[L].

Assumption 5 requires that the nuisance functions $\{w_{k}\}_{k=1}^{K}$ and their estimates $\{\widehat{w}_{k}^{[l]}\}_{k=1}^{K}$ are uniformly bounded over the feature space $\mathcal{X}$ and that the estimation error of the nuisance functions decays at a rate of $\mathcal{O}(N^{-\gamma})$ for some $\gamma>0$ .

Assumption 6 (Function Class Complexity).

Let $C_{h}>0$ and $\tau>0$ be some constants. For any score function candidate $h\in\mathcal{H}$ , we assume $\|h_{k}\|_{\infty}<C_{h}$ for any $k\in[K]$ . Moreover, assume that the $\epsilon$ -bracketing number of $\mathcal{H}$ with respect to the metric $\rho$ , denoted as $\mathcal{N}(\epsilon,\mathcal{H},\rho)$ , satisfies $\log\mathcal{N}(\epsilon,\mathcal{H},\rho)\leq\epsilon^{-\tau}$ for any $\epsilon\in(0,1)$ and some $\tau<2$ .

In Assumption 6, we further restrict the complexity of the hypothesis class $\mathcal{H}$ by imposing the boundedness on each component of score function $h$ , and limiting the growth rate of its covering number. The boundedness condition is reasonable, as if any component of score function, say $h_{k}$ , goes to infinity, then we cannot get reasonable classifier by the $\operatorname*{arg\,min}$ operations. At the same time, limiting the growth rate of covering number is a common way to quantify the complexity of a function class in statistical learning theory [Massart and Nédélec, 2006, Shalev-Shwartz and Ben-David, 2014, Wainwright, 2019].

Nuisance Estimation Error

To provide a finite-sample error bound for the estimation error, we start with bounding the nuisance estimation error introduced in Lemma 2.

Proposition 3 (Nuisance Estimation Error).

Suppose Assumption 5 and Assumption 6 hold. We have, for any $l\in[L]$ ,

\sup_{h\in\mathcal{H}}\Big{|}\mathcal{R}_{\exp}(\boldsymbol{h},\widehat{\boldsymbol{w}}^{[l]})-\mathcal{R}_{\exp}(\boldsymbol{h},\boldsymbol{w})\Big{|}=\mathcal{O}(N^{-\gamma}).

Empirical Process Error

We now consider bounding the supremum of empirical process introduced in Lemma 2. We focus on analyzing the empirical process define on the $l$ -th batch, that is,

\widehat{\mathcal{R}}_{\exp}(\boldsymbol{h},\widehat{\boldsymbol{w}}^{[l]})-\mathcal{R}_{\exp}(\boldsymbol{h},\widehat{\boldsymbol{w}}^{[l]})=\frac{1}{|I_{l}|}\sum_{i\in I_{l}}\sum_{k=1}^{K}\widehat{w}_{k}^{[l]}(X_{i})\frac{\exp(h_{k}(X_{i}))}{\sum_{p=1}^{K}\exp(h_{p}(X_{i}))}-\mathbb{E}\left[\sum_{k=1}^{K}\widehat{w}_{k}^{[l]}(X)\frac{\exp(h_{k}(X))}{\sum_{p=1}^{K}\exp(h_{p}(X))}\right].

As the cross-fitting batch size $L$ involved in Algorithm 1 is usually smaller than the sample size $N$ , here we simply assume $|I_{m}|\approx N$ . Note that, the estimated weight function in the $l$ -th batch $\widehat{\boldsymbol{w}}^{[l]}$ is a random variable which relies on observed variables $(Y,D,X,Z)$ . To facilitate our analysis, we denote a new variable $V=(Y,D,X,Z)$ over the domain $\mathcal{V}=\mathcal{Y}\times\mathcal{D}\times\mathcal{X}\times\mathcal{Z}$ with some joint distribution $P_{V}$ . We will omit the superscript $[l]$ for simplicity in the sequel. For every given estimates $\{\widehat{w}\}_{k=1}^{K}$ , we define a excess surrogate risk class $\mathcal{F}_{\mathcal{H}}$ as

\mathcal{F}_{\mathcal{H}}:=\left\{f_{h}:v\mapsto\sum_{k=1}^{K}\hat{w}_{k}(x)\frac{\exp(h_{k}(x))}{\sum_{p=1}^{K}\exp(h_{p}(x))}~{}\Big{|}~{}h\in\mathcal{H}\right\}

(D.1)

To simplify our analysis, we use shorthand notation $Pf=\mathbb{E}[f(V)]$ and $P_{N}f=\frac{1}{N}\sum_{i=1}^{N}f(V_{i})$ , where $P_{N}$ is the empirical measured associated with the i.i.d. sample of size $N$ . Then the supermom of empirical process can be written as

\|P_{N}f-Pf\|_{\mathcal{F}_{\mathcal{H}}}:=\sup_{f\in\mathcal{F}_{\mathcal{H}}}P_{N}f-Pf=\sup_{h\in\mathcal{H}}\widehat{\mathcal{R}}_{\Phi}(h,\widehat{w})-\mathcal{R}_{\Phi}(h,\widehat{w}).

(D.2)

Therefore, our goal is now provide a uniform bound for the empirical process $P_{N}f-Pf$ over function class $\mathcal{F}_{\mathcal{H}}$ . To realize this target, we shall notice that $\mathcal{F}_{\mathcal{H}}$ is a uniformly bounded class, that is, for any $f\in\mathcal{F}_{\mathcal{H}}$ , we have $\|f\|_{\infty}:=\sup_{v\in\mathcal{V}}f(v)$ being bounded. To see this, notice that in Assumption 5, we assume $\|\widehat{w}_{k}\|_{\infty}=\sup_{x\in\mathcal{X}}\widehat{w}_{k}(x)<C_{w}$ for any $k\in[K]$ . Consequently, we have

	$\displaystyle\|f(v)\|=\left\|\sum_{k=1}^{K}\widehat{w}_{k}(x)\frac{\exp(h_{k}(x))}{\sum_{p=1}^{K}\exp(h_{p}(x))}\right\|$	$\displaystyle\leq\sum_{k=1}^{K}\left\|\widehat{w}_{k}(x)\frac{\exp(h_{k}(x))}{\sum_{p=1}^{K}\exp(h_{p}(x))}\right\|$
		$\displaystyle\leq\sum_{k=1}^{K}\Big{\|}\widehat{w}_{k}(x)\Big{\|}\cdot\left\|\frac{\exp(h_{k}(x))}{\sum_{p=1}^{K}\exp(h_{p}(x))}\right\|$
		$\displaystyle\leq KC_{w}:=C_{f}.$

Now, given a sequence of i.i.d. sample $\{W_{1},\dots,W_{N}\}$ , we define the empirical Rademacher complexity with respect to class $\mathcal{F}_{\mathcal{H}}$ as

\widehat{R}_{N}(\mathcal{F}_{\mathcal{H}}):=\mathbb{E}_{\sigma}\left[\sup_{f\in\mathcal{F}_{\mathcal{H}}}\left|\frac{1}{N}\sum_{i=1}^{N}\sigma_{i}f(W_{i})\right|\right],

(D.3)

where $(\sigma_{1},\dots,\sigma_{N})$ is a vector of i.i.d. Rademacher variables takin values in $\{-1,+1\}$ with equal probability. The empirical Rademacher complexity provides an upper bound for the supremum of empirical process in Equation D.2. See the lemma below.

Lemma 3 (Symmetrization Bound, Wainwright [2019] Theorem 4.10).

For the $C_{f}$ -uniformly bounded class of function $\mathcal{F}_{\mathcal{H}}$ , with probability at least $1-\delta$ , we have

\|P_{N}f-Pf\|_{\mathcal{F}_{\mathcal{H}}}\leq 2\widehat{R}_{N}(\mathcal{F}_{\mathcal{H}})+C_{f}\sqrt{\frac{2\log(1/\delta)}{N}}.

Define the norm $L_{2}(P_{N})$ for the function $f\in\mathcal{F}_{\mathcal{H}}$ such that $\|f-g\|_{L_{2}(P_{N})}:=\sqrt{\frac{1}{N}\sum_{i=1}^{N}\big{(}f(V_{i})-g(V_{i})\big{)}^{2}}$ for any function $f,g\in\mathcal{F}_{\mathcal{H}}$ . As the function class $\mathcal{F}_{\mathcal{H}}$ is $C_{f}$ -uniformly bounded, we then have

\sup_{f,g\in\mathcal{F}_{\mathcal{H}}}\|f-g\|_{L_{2}(P_{N})}\leq 2C_{f}.

Moreover, let $\mathcal{N}\big{(}\epsilon,\mathcal{F}_{\mathcal{H}},L_{2}(P_{N})\big{)}$ be the covering number of class $\mathcal{F}_{\mathcal{H}}$ with respect to the metric induced by the $L_{2}(P_{N})$ norm. According to the definition of $\mathcal{F}_{\mathcal{H}}$ in Equation D.1, we have $\mathcal{N}\big{(}\epsilon,\mathcal{F}_{\mathcal{H}},L_{2}(P_{N})\big{)}\leq\mathcal{N}\big{(}\epsilon,\mathcal{H},L_{2}(P_{N})\big{)}$ . Along with the limitation of the growing rate of $\mathcal{N}\big{(}\epsilon,\mathcal{H},L_{2}(P_{N})\big{)}$ in Assumption 6, we establish the following Dudley’s entropy integral for the empirical Rademacher complexity defined in Equation D.3

Lemma 4 (Dudley’s Entropy Integral).

Assume Assumption 6 holds. Given that $\sup_{f,g\in\mathcal{F}_{\mathcal{H}}}\|f-g\|_{L_{2}(P_{N})}\leq 2C_{f}$ , there exists a constant $C_{0}>0$ such that the empirical Rademacher complexity is almost surely upper-bounded by the Dudely’s integral: $0<\tau<2$ ,

\widehat{R}_{N}(\mathcal{F}_{\mathcal{H}})\leq\frac{C_{0}}{\sqrt{N}}\int_{0}^{2C_{f}}d\epsilon\sqrt{\log\mathcal{N}(\epsilon,\mathcal{F}_{\mathcal{H}},L_{2}(P_{N}))}\leq\frac{C_{0}}{\sqrt{N}}\int_{0}^{2C_{f}}\epsilon^{-\tau/2}d\epsilon=\frac{2C_{0}(2C_{f})^{1-\tau/2}}{2-\tau}N^{-1/2}.

Given the establishment of Assumption 6, we can now provide a generalization error bound for the empirical process error in Lemma 2 by combining Lemma 3 and Lemma 4: with probability at least $1-\delta$ , we have

\sup_{\boldsymbol{h}\in\mathcal{H}}\left\{\widehat{\mathcal{R}}_{\exp}(\boldsymbol{h},\widehat{\boldsymbol{w}})-\mathcal{R}_{\exp}(\boldsymbol{h},\widehat{\boldsymbol{w}})\right\}=\|P_{n}f-Pf\|_{\mathcal{F}_{\mathcal{H}}}=\mathcal{O}\big{(}N^{-1/2}\sqrt{2\log(1/\delta)}\big{)}.

(D.4)

Generalization Error Bound for Excess Cost-sensitive Risk

Combining with the calibration bound in Theorem 5, the decomposition error of the excess surrogate risk in Lemma 2, the nuisance estimation error in Proposition 3, and the empirical process error in Equation D.4, we can now provide a generalization error bound for original the excess cost-sensitive risk $\mathcal{R}(\widehat{\boldsymbol{h}})-\inf_{\boldsymbol{h}\in\mathcal{H}}\mathcal{R}(\boldsymbol{h})$ induced by the output score function $\widehat{\boldsymbol{h}}$ in Algorithm 1. Formally speaking, we probability at least $1-\delta$ , we have

\mathcal{R}(\widehat{\boldsymbol{h}},w)-\inf_{\boldsymbol{h}\in\mathcal{H}}\mathcal{R}(\boldsymbol{h},\boldsymbol{w})=\mathcal{O}\Big{(}K\cdot\max\Big{\{}N^{-\gamma},N^{-1/2}\sqrt{\log(1/\delta)}\Big{\}}\Big{)}.

(D.5)

This result shows that the excess cost-sensitive risk of the output score function $\widehat{\boldsymbol{h}}$ converges to the optimal risk at a rate of $K\cdot N^{-1/2}$ , which is the standard rate for the multiclass classification risk in statistical learning theory [Massart and Nédélec, 2006, Shalev-Shwartz and Ben-David, 2014, Wainwright, 2019].

D.1 Remaining Proofs

Proof of Lemma 2.

Let $h_{\mathcal{H}}^{\star}\in\mathcal{H}$ denote the best-in-class score function that minimizes the excess surrogate risk $\mathcal{R}_{\exp}(h,w)$ . We use $w$ denote the true weight function and $\widehat{w}^{[l]}$ representes its estiamte from batch $l\in[L]$ . By definition, we have

	$\displaystyle\mathcal{E}(\mathcal{H})$	$\displaystyle=\mathcal{R}_{\exp}(\hat{h},w)-\mathcal{R}_{\exp}(h_{\mathcal{H}}^{\star},w)$
		$\displaystyle={\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathcal{R}_{\exp}(\hat{h},w)-\frac{1}{L}\sum_{l=1}^{L}\mathcal{R}_{\exp}(\hat{h},\widehat{w}^{[l]})}+{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}\frac{1}{L}\sum_{l=1}^{L}\mathcal{R}_{\exp}(\hat{h},\widehat{w}^{[l]})-\frac{1}{L}\sum_{l=1}^{L}\widehat{\mathcal{R}}_{\exp}(\hat{h},\widehat{w}^{[l]})}$
		$\displaystyle+{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}\frac{1}{L}\sum_{l=1}^{L}\widehat{\mathcal{R}}_{\exp}(\hat{h},\hat{w}^{[l]})-\frac{1}{L}\sum_{l=1}^{L}\mathcal{R}_{\exp}(h_{\mathcal{H}}^{\star},\widehat{w}^{[l]})}+{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\frac{1}{L}\sum_{l=1}^{L}\mathcal{R}_{\exp}(h_{\mathcal{H}}^{\star},\widehat{w}^{[l]})-R_{\exp}(h_{\mathcal{H}}^{\star},w)}$
		$\displaystyle\leq{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\underbrace{\frac{2}{L}\sum_{l=1}^{L}\sup_{h\in\mathcal{H}}\Big{\|}\mathcal{R}_{\exp}(h,\widehat{w}^{[l]})-\mathcal{R}_{\exp}(h,w)\Big{\|}}_{\text{Nuisance Estimation Error}}}+{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}\underbrace{\frac{2}{L}\sum_{l=1}^{L}\sup_{h\in\mathcal{H}}\Big{\|}\widehat{\mathcal{R}}_{\exp}(h,\widehat{w}^{[l]})-\mathcal{R}_{\exp}(h,\widehat{w}^{[l]})\Big{\|}}_{\text{Empirical Process Error}}}.$

∎

Proof of Proposition 3.

Notice that for any function $h\in\mathcal{H}$ and batch $l\in[L]$ ,

	$\displaystyle\Big{\|}\mathcal{R}_{\exp}(h,\widehat{w}^{[l]})-\mathcal{R}_{\exp}(h,w)\Big{\|}$	$\displaystyle=\left\|\mathbb{E}\left[\sum_{k=1}^{K}\Big{(}\widehat{w}_{k}^{[l]}(X)-w_{k}(X)\Big{)}\cdot\frac{\exp(h_{k}(X))}{\sum_{p=1}^{K}\exp(h_{p}(X))}\right]\right\|$
		$\displaystyle\leq\mathbb{E}\left[\sum_{k=1}^{K}\left\|\Big{(}\widehat{w}_{k}^{[l]}(X)-w_{k}(X)\Big{)}\cdot\frac{\exp(h_{k}(X))}{\sum_{p=1}^{K}\exp(h_{p}(X))}\right\|\right]$
		$\displaystyle\leq\mathbb{E}\left[\sum_{k=1}^{K}\Big{\|}\hat{w}_{k}(X)-w_{k}(X)\Big{\|}\cdot\left\|\frac{\exp(h_{k}(X))}{\sum_{p=1}^{K}\exp(h_{p}(X))}\right\|\right]$
		$\displaystyle\leq\mathbb{E}\Big{[}\sum_{k=1}^{K}\Big{\|}\hat{w}_{k}(X)-w_{k}(X)\Big{\|}\Big{]}=\mathbb{E}\Big{\\|}\hat{w}(X)-w(X)\Big{\\|}_{1}.$

The first inequality comes from the fact that the absolute value function $|\cdot|$ is convex and we use Jensen’s inequality. The second inequality follows from Cauchy-Schwarz inequality, and the third inequality follows from the fact that the $\operatorname{softmax}$ function is uniformly bounded by $1$ for all $x\in\mathcal{X}$ and $h\in\mathcal{H}$ . Finally, combining with Assumption 5 and the results above, we have for every batch $l\in[L]$

\sup_{h\in\mathcal{H}}\Big{|}\mathcal{R}_{\exp}(h,\widehat{w}^{[l]})-\mathcal{R}_{\exp}(h,w)\Big{|}=\mathcal{O}(N^{-\gamma}).

∎

	$\displaystyle\overline{\mathcal{R}}(t)$	$\displaystyle=\mathbb{E}\Big{[}\|1-2l_{1}(X)\|\cdot\mathbb{I}_{l_{1}(X)\leq 1/2}\cdot\mathbb{I}\{t(X)=1\}+\|2u_{1}(X)-1\|\cdot\mathbb{I}_{u_{1}(X)\geq 1/2}\cdot\mathbb{I}\{t(X)=0\}\Big{]}$
		$\displaystyle=\mathbb{E}\Big{[}\|1-2l_{1}(X)\|\cdot\mathbb{I}_{l_{1}(X)\leq 1/2}\cdot\big{(}1-\mathbb{I}\{t(X)=0\}\big{)}+\|2u_{1}(X)-1\|\cdot\mathbb{I}_{u_{1}(X)\geq 1/2}\cdot\mathbb{I}\{t(X)=0\}\Big{]}$
		$\displaystyle=\mathbb{E}\Big{[}\big{\|}1-2l_{1}(X)\big{\|}\cdot\mathbb{I}_{l_{1}(X)\leq 1/2}\Big{]}+\mathbb{E}\Big{[}\mathbb{I}\{t(X)\neq 1\}\cdot\Big{(}\big{\|}2u_{1}(X)-1\big{\|}\cdot\mathbb{I}_{u_{1}(X)\geq 1/2}-\big{\|}1-2l_{1}(X)\big{\|}\cdot\mathbb{I}_{l_{1}(X)\leq 1/2}\Big{)}\Big{]}$
		$\displaystyle=\mathbb{E}\Big{[}\big{\|}1-2l_{1}(X)\big{\|}\cdot\mathbb{I}_{l_{1}(X)\leq 1/2}\Big{]}+\mathbb{E}\left[\frac{1-t(X)}{2}\cdot\Big{(}\big{\|}2u_{1}(X)-1\big{\|}\cdot\mathbb{I}_{u_{1}(X)\geq 1/2}-\big{\|}1-2l_{1}(X)\big{\|}\cdot\mathbb{I}_{l_{1}(X)\leq 1/2}\Big{)}\right].$

$\displaystyle\overline{\mathcal{R}}(t)$	$\displaystyle=\underbrace{\mathbb{E}\Big{[}\big{\|}1-2l_{1}(X)\big{\|}\cdot\mathbb{I}\{l_{1}(X)<1/2\}\Big{]}}_{(**)}+\mathbb{E}\Big{[}w(X)\cdot\big{(}1-t(X)\big{)}\Big{]}$	(B.19)
	$\displaystyle=(**)+\mathbb{E}\big{[}w(X)\big{]}+\ \mathbb{E}\big{[}-w(X)\cdot t(X)\big{]}$
	$\displaystyle=(*)+\mathbb{E}\big{[}w(X)\big{]}+\mathbb{E}\Big{[}\|w(X)\|\cdot\operatorname{\mathrm{sgn}}[-w(X)]\cdot t(X)\Big{]}$
	$\displaystyle=(*)+\mathbb{E}\big{[}w(X)\big{]}+\mathbb{E}\Big{[}\|w(X)\|\cdot\Big{(}2\mathbb{I}\big{\{}\operatorname{\mathrm{sgn}}[w(X)]\neq t(X)\big{\}}-1\Big{)}\Big{]}$
	$\displaystyle=(*)+\mathbb{E}\Big{[}w(X)-\|w(X)\|\Big{]}+2\mathbb{E}\Big{[}\|w(X)\|\cdot\mathbb{I}(\operatorname{\mathrm{sgn}}[w(X)]\neq\operatorname*{\mathrm{sgn}}[h(X)])\Big{]}.$

		$\displaystyle\mathbb{E}\Big{[}\big{\|}1-2l_{1}(X)\big{\|}\cdot\mathbb{I}\{l_{1}(X)<1/2\}\Big{]}+\mathbb{E}\Big{[}w(X)-\|w(X)\|\Big{]}$		(B.20)
		$\displaystyle=\left\{\begin{aligned} &\mathbb{E}\Big{[}\|2u_{1}(X)-1\|\cdot\mathbb{I}\{u_{1}(X)\geq 1/2\}\Big{]},&&\text{if}\ w(X)<0\\ &\mathbb{E}\Big{[}\|1-2l_{1}(X)\|\cdot\mathbb{I}\{l_{1}(X)<1/2\}\Big{]},&&\text{if}\ w(X)\geq 0\\ \end{aligned}\right.$
		$\displaystyle=\min\Big{(}\mathbb{E}\Big{[}\big{\|}1-2l_{1}(X)\big{\|}\cdot\mathbb{I}\{l_{1}(X)<1/2\}\big{]},\mathbb{E}\Big{[}\big{\|}2u_{1}(X)-1\big{\|}\cdot\mathbb{I}\{u(X)\geq 1/2\}\big{]}\Big{)}.$

	$\displaystyle\phi_{\operatorname{Hinge}}(-\|h(x)\|)$	$\displaystyle=(1+\|h(x)\|)^{+}\geq 1,$
	$\displaystyle\phi_{\operatorname{logistic}}(-\|h(x)\|)$	$\displaystyle=\log_{2}(1+e^{\|h(x)\|})\geq 1,$
	$\displaystyle\phi_{\operatorname{exponential}}(-\|h(x)\|)$	$\displaystyle=e^{\|h(x)\|}\geq 1.$

	$\displaystyle\phi_{\operatorname{Hinge}}(\|h(x)\|)$	$\displaystyle=(1-\|h(x)\|)^{+}\geq 0,$
	$\displaystyle\phi_{\operatorname{logistic}}(\|h(x)\|)$	$\displaystyle=\log_{2}(1+e^{-\|h(x)\|})\geq 0,$
	$\displaystyle\phi_{\operatorname{exponential}}(\|h(x)\|)$	$\displaystyle=e^{-\|h(x)\|}\geq 0.$