The Value of Information in Human-AI Decision-making

Ziyang Guo Department of Computer Science, Northwestern University. Email: [email protected] Yifan Wu Department of Computer Science, Northwestern University. Email: [email protected] Jason Hartline Department of Computer Science, Northwestern University. Email: [email protected] Jessica Hullman Department of Computer Science, Northwestern University. Email: [email protected]

Abstract

Humans and AIs are often paired on decision tasks with the expectation of achieving complementary performance, where the combination of human and AI outperforms either one alone. However, how to improve performance of a human-AI team is often not clear without knowing more about what particular information and strategies each agent employs. We provide a decision-theoretic framework for characterizing the value of information—and consequently, opportunities for agents to better exploit available information–in AI-assisted decision workflow. We demonstrate the use of the framework for model selection, empirical evaluation of human-AI performance, and explanation design. We propose a novel information-based instance-level explanation technique that adapts a conventional saliency-based explanation to explain information value in decision making.

1 Introduction

As the performance of artificial intelligence (AI) models improves, workflows in which human and AI model-based judgments are combined to make decisions are sought in medicine, finance, and other domains. Though statistical models often make more accurate predictions than human experts on average (Ægisdóttir et al., 2006; Grove et al., 2000; Meehl, 1954), whenever humans have access to additional information over the AI, there is potential to achieve complementary performance by pairing the two, i.e., better performance than either the human or AI alone. For example, a physician may have access to additional information that may not be captured in tabular electronic health records or other structured data (Alur et al., 2024b).

However, evidence of complementary performance between humans and AI is limited, with many studies showing that human-AI teams underperform an AI alone (Buçinca et al., 2020; Bussone et al., 2015; Green and Chen, 2019; Jacobs et al., 2021; Lai and Tan, 2019; Vaccaro and Waldo, 2019; Kononenko, 2001). A solid understanding of such results is limited by the fact that most analyses of human-AI decision-making focus on ranking the performance of human-AI teams or each individually using measures like posthoc decision accuracy. This approach is problematic for several reasons. First, it does not account for the best achievable performance based on the information available at the time of the decision (Kleinberg et al., 2015; Guo et al., 2024; Rambachan, 2024). Second, it cannot provide insight into the potential for available information to improve the decisions, making it difficult to design interventions that improve the team’s performance.

In contrast, identifying information complementarities that contribute to the maximum achievable decision performance of a human and AI model—such as when one of the agents has access to information not contained in the other’s judgments, or has not fully integrated information available in the environment into their judgments—provides more actionable information for intervening to improve the decision pipeline. For example, if human experts are found to possess decision-relevant information over the AI, we might collect further data to improve the AI model. If the model predictions contain decision-relevant information not contained in human decisions, we might design more targeted explanations to help humans integrate under-exploited information.

We contribute a decision-theoretic framework for characterizing the value of information available within an AI-assisted decision workflow. In our framework, information is considered valuable to a decision-maker to the extent that it is possible to in theory incorporate it into their decisions to improve performance. Specifically, our approach analyzes the expected marginal payoff gain from best case (Bayes rational) use of additional information over best case use of the information already encoded in agent decisions for a given decision problem. Based on the intuition that any information that is used by the agents will eventually reveal itself through variation in their decisions, we identify the value of the information in agent (human, AI, or human-AI) decisions by offering them as a signal to a Bayesian rational decision-maker.

We introduce two metrics for evaluating information value in human-AI collaboration. The first—global human-complementary information value—calculates the value of a new piece of information to an agent over the data-generating distribution. The second—instance-level human-complementary information value—supports analyses of how humans or AI systems use information on a instance-by-instance basis.

We apply the framework by applying it to three decision-making tasks where AI models serve as human decision-making assistants¹¹1Code to replicate our experiments is availabel at https://osf.io/p2qzy/?view_only=ec06600d06cd4e59bb6051f992e54c08: chest X-ray diagnosis (Rajpurkar et al., 2018; Johnson et al., 2019), deepfake detection (Dolhansky et al., 2020; Groh et al., 2022), and recidivism prediction (Angwin et al., 2022; Dressel and Farid, 2018). First, we demonstrate its utility in model selection by evaluating how well different AI models complement human decision-makers, showing how even among models with similar accuracy, some models strictly offer more complementary information than others across decision problems. Next, we use our framework to empirically evaluate how providing AI assistance (alongside instance-level features) helps human exploit available information for decision-making. Lastly, we demonstrate use of the framework to design explanations by extending SHAP Lundberg and Lee (2017) to highlight the portion of an AI’s prediction that complements human information.

2 Related work

Human-AI complementarity

Many empirical studies of human-AI collaboration focus on AI-assisted human decision-making for legal, ethical or safety reasons (Bo et al., 2021; Boskemper et al., 2022; Bondi et al., 2022; Schemmer et al., 2022). However, a recent meta-analysis by Vaccaro et al. (2024) finds that on average, human–AI teams perform worse than the better of either humans or AI alone. In response, a growing body of work seeks to evaluate and enhance complementarity in human–AI systems (Bansal et al., 2021b, 2019, a; Wilder et al., 2021; Rastogi et al., 2023; Mozannar et al., 2024b). The present work differs from much of this prior work by approaching human-AI complementarity from the perspective of information value and use, including whether the human and AI decisions provide additional information that is not used by the other.

Evaluation of human decision-making with machine learning

Our work contributes to the development of methods for evaluating decisions of human-AI teams (Kleinberg et al., 2015, 2018; Lakkaraju et al., 2017; Mullainathan and Obermeyer, 2022; Rambachan, 2024; Guo et al., 2024; Ben-Michael et al., 2024; Shreekumar, 2025). Kleinberg et al. (2015) first proposed that evaluations of human-AI collaboration should be based on what information is available at the time of decisions. Our work contributes to definition of Bayesian best-attainable-performance benchmarks (Hofman et al., 2021; Wu et al., 2023; Agrawal et al., 2020; Fudenberg et al., 2022). Closest to our work, Guo et al. (2024) use a rational Bayesian agent faced with deciding between the human and AI recommendations as the theoretical upper bound on expected performance of any human-AI team. This benchmark provides a basis for identifying informational “opportunities” within a decision problem.

Human information in machine learning

One popular approach for developing machine learning models is to incorporate human information or expertise in model predictions (Alur et al., 2024a, b; Corvelo Benz and Rodriguez, 2023; Mozannar et al., 2024a; Bastani et al., 2021; Madras et al., 2018; Raghu et al., 2019; Keswani et al., 2022, 2021; Okati et al., 2021). Corvelo Benz and Rodriguez (2023) propose multicalibration over human and AI model confidence information to guarantee the existence of an optimal monotonic decision rule. Alur et al. (2023) propose a hypothesis testing framework to evaluate the added value of human expertise over AI forecasts. Our work shares the motivation of incorporating human expertise but targets a broader scope by quantifying the information value for all available signals and agent decisions implied in a human–AI decision pipeline, rather than focusing solely on improving model performance.

3 Methodology

Our framework takes input as a decision problem associated with an information model and outputs the value of information in any available signals to any agent, conditioning on the existing information in their decisions within a Bayesian decision theoretic framework. Our framework provides two separate functions to quantify the value of information globally across the data-generating process and locally in a realization drawn from the data-generating process. We also introduce a robust analysis approach to information order, which enables to compare the agent-complementary information in signals for all possible decision problems.

Decision Problem

A decision problem consists of three key elements. We illustrate with an example of a weather decision.

•

A payoff-relevant state $\omega$ from a space $\mathbf{\Omega}$ . For example, $\omega\in\mathbf{\Omega}=\{0,1\}=\{\text{no rain},\text{rain}\}$ .
•

A decision $d$ from the decision space $\mathbf{D}$ characterizing the decision-maker (DM)’s choice. For example, $d\in\mathbf{D}=\{0,1\}=\{\text{not take umbrella},\text{take umbrella}\}$ .
•

A payoff function $S:\mathbf{D}\times\mathbf{\Omega}\to\mathbb{R}$ , used to assess the quality of a decision given a realization of the state. For example, $S(d=0,\omega=0)=0,S(d=0,\omega=1)=-100,S(d=1,\omega=0)=-50,S(d=1,\omega=1)=0$ , which punishes the DM for selecting an action that does not match the weather.

Information Model

We cast the information available to a DM as a signal defined within an information model. We use the definition of an information model in Blackwell et al. (1951). The information model can be represented by a data-generating model with a set of signals.

•

Signals. There are $n$ “basic signals” represented as random variables $\Sigma_{1},\ldots,\Sigma_{n}$ , from the signal spaces $\mathbf{\Sigma}_{1},\ldots,\mathbf{\Sigma}_{n}$ . These represent information obtained by a decision-maker, e.g., $\mathbf{\Sigma}_{1}=\{\text{cloudy},\text{not cloudy}\}$ , $\mathbf{\Sigma}_{2}\in\{0,\ldots,100\}$ for temprature Celsius, etc. The decision-maker observes a signal, which is a combination of the basic signals, represented as a set $V\subseteq 2^{\{\Sigma_{1},\dots,\Sigma_{n}\}}$ . For example, a signal representing a combination of two basic signals $V=\{\Sigma_{1},\Sigma_{2}\}$ observed by the decision-maker might consist of cloudiness $\Sigma_{1}$ and the temperature $\Sigma_{2}$ of the day. Given a signal composed of $m$ basic signals, we write the realization of $V$ as $v=(\sigma_{j_{1}},\dots,\sigma_{j_{m}})$ , where the realizations $\sigma_{j_{i}}\in\mathbf{\Sigma}_{j_{i}}$ are sorted by the index of the basic signals $j_{i}\in[n]$ . The union $V$ of two signals $V_{1},V_{2}$ takes the set union, i.e., $V=V_{1}\cup V_{2}$ . Though $V$ is initially defined as a set of random variables, we will slightly abuse notation $V$ to represent a random variable that is drawn from the joint distribution of the basic signals in it.
•

Data-generating process. A data-generating process is a joint distribution $\pi\in\Delta(\mathbf{\Sigma}_{1}\times\ldots\times\mathbf{\Sigma}_{n}\times\mathbf{\Omega})$ over the basic signals and the payoff-relevant state. However, the DM may only observe a subset $V$ of the $n$ basic signals. We can define the Bayesian posterior belief upon receiving a signal $V=v$ from the data-generating model as

$\Pr[\omega|v]=\frac{\pi(v,\omega)}{\pi(\omega)}$

where we slightly abuse notation to write $\pi(v,\omega)$ as the marginal probability of the signal realized to be $v$ and the state being $\omega$ with expectation over unobserved signals.

Information value

Our framework quantifies the value of information in a signal $V$ as the extent to which the payoff could be improved by the ideal use of $V$ over a baseline information set. We suppose a rational Bayesian DM who knows the data-generating process, observes a signal realization, updates their prior to arrive at posterior beliefs, and then chooses a decision to maximize their expected payoff based on the posterior belief. Formally, the rational DM’s expected payoff given a (set of) signal(s) $V$ is

\mathrm{R}^{\pi,S}(V)=\text{\bf E}_{v\sim\pi}\![{\max_{d\in\mathbf{D}}\text{\bf E}_{\omega\sim\Pr(\omega|v)}\![{S(d,\omega)}]}]

We use $\emptyset$ to represent a null signal, such that $\mathrm{R}^{\pi,S}(\emptyset)$ is the expected payoff of a Bayesian rational DM who has no access to a signal but only uses their prior belief to make decisions. In this case, the Bayesian rational DM will take the best fixed action under the prior, and their expected payoff is:

\mathrm{R}^{\pi,S}(\emptyset)=\max_{d\in\mathbf{D}}\text{\bf E}_{\omega\sim\pi}\![{S(d,\omega)}]

$\mathrm{R}^{\pi,S}(\emptyset)$ defines the maximum expected payoff that can be achieved with no information. Bayesian decision theory quantifies the information value of $V$ by the payoff improvement of $V$ over the payoff obtained without information.

Definition 3.1.

Given a decision task with payoff function $S$ and an information model $\pi$ , we define the information value of $V$ as

IV^{\pi,S}(V)=\mathrm{R}^{\pi,S}(V)-\mathrm{R}^{\pi,S}(\emptyset)

We adopt the same idea to define the agent-complementary information values in our framework.

3.1 Agent-Complementary Information Value

Given the above definitions, it becomes possible to measure the additional value that new signals can provide over the information already captured by an agent’s decisions. Here, agent may refer to a human, an AI system, or a human–AI team. The intuition behind our approach is that any information that is used by decision-makers should eventually reveal itself through variation in their behaviors. We recover the information value in agent decisions by offering the decisions as a signal to the Bayesian rational DM. We model the agent decisions as a random variable $D^{b}$ from the action space $\mathbf{D}$ , which follows a joint distribution $\pi^{b}\in\Delta(\mathbf{\Omega}\times\mathbf{\Sigma}_{1}\times\ldots\times\mathbf{\Sigma}_{n}\times\mathbf{D})$ with the state and signals. The expected payoff of a Bayesian rational DM who knows $\pi^{b}$ is given by the function:

\mathrm{R}(D^{b})=\text{\bf E}_{d^{b}\sim\pi^{b}}\![{\max_{d\in\mathbf{D}}\text{\bf E}_{\omega\sim\Pr(\omega|D^{b}=d^{b})}\![{S(d,\omega)}]}]

We seek to identify signals $V$ that can potentially improve agent decisions by analyzing the information value in the combined signal $D^{b}\cup V$ and the information value in $D^{b}$ , which we define as agent-complementary information value.

Definition 3.2.

Given a decision task with payoff function $S$ and an information model $\pi$ , we define the agent-complementary information value of $V$ on agent decisions $D^{b}$ as

ACIV^{\pi,S}(V;D^{b})=\mathrm{R}^{\pi,S}(D^{b}\cup V)-\mathrm{R}^{\pi,S}(D^{b})

If the $ACIV$ of a signal $V$ is low, this means either that the $IV$ of $V$ is low (e.g., it is not correlated with $\omega$ ), or that the agent has already exploited the information in $V$ (e.g., the agent relies on $V$ to make their decisions such that their decisions correlate with $\omega$ in the same way as $V$ correlates with $\omega$ ). If, however, the $ACIV$ of $V$ is high, then at least in theory, the agent can improve their payoff by incorporating $V$ in their decision making.

Furthermore, $ACIV$ can reveal complementary information between different types of agents. For instance, if we view AI predictions as $V$ and treat human decisions as the existing agent signal $D^{b}$ , a large $ACIV$ indicates that AI predictions add considerable value beyond what humans alone achieve. In the reverse scenario, if human decisions serve as $V$ and AI predictions are $D^{b}$ , we can measure how much humans contribute on top of the AI. We demonstrate further usage of $ACIV$ in Section 4 and Section 5.

3.2 Instance-level Agent-Complementary

Instance-level Agent-Complementary Information Value ( $ILIV$ ) evaluates the additional information contributed by a single realization of a signal rather than the entire joint distribution. This finer-grained view is critical for tasks where we need to understand the information value on individual instances, such as when asking whether a human expert should trust an individual prediction from an AI or how to help a human expert understand how the AI model is exploiting information about an instance.

To quantify the information value of a realization $V=v$ , we construct $Z_{v}$ as a binary variable indicating whether signal $V$ is realized as $v$ , i.e., $Z_{v}=\mathbb{1}[V=v]$ . The data-generating model defining the joint distribution of $Z_{v}$ , $\pi_{v}\in\Delta(\mathbf{\Omega}\times\{0,1\})$ , can be constructed through transforming the original data-generating model $\pi$ . Because $Z_{v}$ is a garbling of $V$ (Blackwell et al., 1951), there always exists a a Markov matrix $\Gamma:\mathbf{V}\rightarrow\{0,1\}$ such that the new data-generating process $\pi_{v}$ can be constructed through $\pi_{v}=\Gamma\pi$ .

Definition 3.3.

Given a decision task with payoff function $S$ and an information model $\pi$ , we define the instance-level agent-complementary information value of the realization $V=v$ as:

ILIV^{\pi,S}(v;D^{b})=\mathrm{R}^{\pi_{v},S}(D^{b}\cup Z_{v})-\mathrm{R}^{\pi_{v},S}(D^{b})

This local measure captures how much additional payoff can be gained by incorporating the specific realization $v$ into the agent decisions $D^{b}$ . Summing $ILIV$ over all possible realizations of $V$ recovers the global agent-complementary information value ( $ACIV$ ).

Proposition 3.4.

ACIV^{\pi,S}(V;D^{b})=\sum_{v\in\mathbf{V}}ILIV^{\pi,S}(v;D^{b})

We apply $ILIV$ to define an information-based explanation technique (ILIV-SHAP) that extends SHAP to explains how the information value of AI predictions complements human decisions $D^{\text{Human}}$ for specific instances. Vanilla SHAP (Lundberg and Lee, 2017) defines a saliency-based explanation with a set of effect variables $\phi_{i}$ representing the influence of the realization of basic signal $\Sigma_{i}=\sigma_{i}$ on the model output $f(v)$ , where $v=(\sigma_{1},\ldots,\sigma_{n})$ and $f:\mathbf{V}\rightarrow\mathbb{R}$ . Lundberg and Lee (2017) show that the only possible explanation model fulfills properties of local accuracy, missingness and consistency for model $f$ and instance $v$ is following:

Definition 3.5 (SHAP (Lundberg and Lee, 2017)).

\phi_{i}(f,v)=\sum_{v^{\prime}\subseteq v}\frac{|v^{\prime}|!(n-|v^{\prime}|-1)!}{n!}[g_{f}(v^{\prime})-g_{f}(v^{\prime}\backslash\sigma_{i})]

where $g_{f}(v^{\prime})$ denotes the expectation of model output conditioned on the subset of signals with $V^{\prime}=v^{\prime}$ , i.e., $\text{\bf E}\![{f(V)|V^{\prime}=v^{\prime}}]$ .

Instead of quantifying changes in the AI’s raw predictions, ILIV-SHAP clarifies how each feature contributes to the agent-complementary information value of predictions.

Proposition 3.6 (ILIV-SHAP).

An explanation model whose effective variable defined as

\phi_{i}(ILIV(f(\cdot);D^{b}),v)

fulfill properties of local accuracy, missingness and consistency for $ILIV$ of $f(\cdot)$ on instance $v$ .

Intuitively, when $\phi_{i}(ILIV(f(\cdot);D^{b}),v)$ for a feature $\sigma_{i}$ is relatively large, it means that the model $f(\cdot)$ is extracting information that $D^{b}$ lacks from $\Sigma_{i}$ (or other features with equivalent information to it). Otherwise, when $\sigma_{i}$ makes a small relative contribution, it means that it makes the model ignore information in other features that have higher information value for human decision-makers (such that on average lossing information value).

3.3 Robustness Analysis of Information Order

Ambiguity about the appropriate payoff function is not uncommon in human-AI decision settings due to challenges of eliciting utility functions and potential variance in these functions across decision-makers or groups of instances; e.g., doctors penalize certain false negative results differently when diagnosing younger versus older patients (Mclaughlin and Spiess, 2023). We therefore define the partial order of complementary information value using Blackwell order. Blackwell’s comparison of signals (Blackwell et al., 1951) defines a (set of) signal $V_{1}$ as more informative than $V_{2}$ if $V_{1}$ has a higher information value on all possible decision problems. We identify this partial order by decomposing the space of decision problems via a basis of proper scoring rules (Li et al., 2022; Kleinberg et al., 2023).

Definition 3.7 (Blackwell Order of Information).

A (set of) signal $V_{1}$ is Blackwell more informative than $V_{2}$ if $V_{1}$ achieves a higher payoff on any decision problem,

\mathrm{R}^{\pi,S}(V_{1})\geq\mathrm{R}^{\pi,S}(V_{2}),\forall S

We test the Blackwell order between signals on a basis of proper scoring rules induced from decision problems. The basis is the set of V-shaped scoring rules, parameterized by the kink of the piecewise-linear utility function.

Definition 3.8.

(V-shaped scoring rule) A V-shaped scoring rule with kink $\mu\in(0,\frac{1}{2}]$ is defined as

S_{\mu}(d,\omega)=\left\{\begin{array}[]{cc}\frac{1}{2}-\frac{1}{2}\cdot\frac{\omega-\mu}{1-\mu}&\text{if }d\leq\mu\\ \frac{1}{2}+\frac{1}{2}\cdot\frac{\omega-\mu}{1-\mu}&\text{else},\end{array}\right.

When $\mu^{\prime}\in(\frac{1}{2},1)$ , the V-shaped scoring rule can be symmetrically defined by $S_{\mu^{\prime}}=S_{1-\mu^{\prime}}(1-y,\omega)$ .

Intuitively, the kink $\mu$ represents the threshold belief where the decision-maker switches between two actions. Larger $\mu$ means that the decision-makers will prefer $d=1$ more. The closer $\mu$ is to $0.5$ , the more indifferent the decision-maker is to $d=0$ or $d=1$ .

Proposition 3.9 shows that if $V_{1}$ achieves a higher information value on the basis of V-shaped proper scoring rules than $V_{2}$ , then $V_{1}$ is Blackwell more informative than $V_{2}$ . Proposition 3.9 follows from the fact that any best-responding payoff can be linearly decomposed into the payoff on V-shaped scoring rules.

Proposition 3.9 (Hu and Wu 2024).

\mathrm{R}^{\pi,S_{\mu}}(V_{1})\geq\mathrm{R}^{\pi,S_{\mu}}(V_{2}),\forall\mu\in(0,1),

then $V_{1}$ is Blackwell more informative than $V_{2}$ .

Extending this to agent-complementary information value, we say that $V_{1}$ offers a higher complementary value than $V_{2}$ under the Blackwell order if

ACIV^{\pi,S_{\mu}}(V_{1};D^{b})\geq ACIV^{\pi,S_{\mu}}(V_{2};D^{b}),\forall\mu\in(0,1)

This definition allows us to rank signals (or sets of signals) universally, without needing to pin down any specific payoff function prior to the analysis. See the use case in Section 4.

4 Experiment I: Model Comparison on Chest Radiographs Diagnosis

We apply our framework to a well-known cardiac dysfunction diagnosis task (Rajpurkar et al., 2018; Tang et al., 2020; Shreekumar, 2025). We apply the framework to analyze how much complementary information value a set of possible AI models offer to human decision-makers.

Refer to caption — Figure 1: Information value of all deep-learning models calculated under our framework using Brier score as payoff function.

4.1 Data and Model

We use data from the MIMIC dataset (Goldberger et al., 2000), which contains anonymized electronic health records from Beth Israel Deaconess Medical Center (BIDMC), a large teaching hospital in Boston, Massachusetts, affiliated with Harvard Medical School. Specifically, we utilize chest x-ray images and radiology reports from the MIMIC-CXR database (Johnson et al., 2019) merged with patient and visit information from the broader MIMIC-IV database (Johnson et al., 2023). The payoff-related state, cardiac dysfunction $\omega\in\{0,1\}$ , is coded based on two common tests, the NT-proBNP and the troponin, using the age-specific cutoffs from Mueller et al. (2019) and Heidenreich et al. (2022). We use the labels from Irvin et al. (2019) as the human decisions (without AI’s assistance) in the diagnosis task, which is a rule-based tool labeling the symptoms as positive, negative, or uncertain, i.e., $d\in\{+,?,-\}$ ²²2The three symbols represents the encoding we use for signal construction, not an assertion of how radiologists communicate.. We fune-tuned five deep-learning models on the cardiac dysfunction diagnosis task, VisionTransformer (Alexey, 2020), SwinTransformer (Liu et al., 2021), ResNet (He et al., 2016), Inception-v3 (Szegedy et al., 2016), and DenseNet (Huang et al., 2017). Our training set contains 12,228 images and validation set contains 6,115 images. On a hold-out test set with 12,229 images, the AUC achieved by the five models are: DenseNet with $0.77$ , Inception v3 with $0.76$ , ResNet with $0.77$ , SwinTransformer with $0.78$ , VisionTransformer with $0.80$ .

We consider Brier score as a payoff function, and also conduct a robust analysis considering various V-shaped payoff functions with different kinks on a discretized grid of $[0,1]$ with a step $0.01$ . We use the hold-out test set to estimate the data-generating process, which defines the joint distribution of state, human decisions and AI models’ predictions.

4.2 Results

Can the AI models complement human judgment? We first analyze the agent-complementary information values in Figure 1, using the Brier score as the payoff function. We find that all AI models provide complementary information value to human judgment. This highlights the same takeaways as section 5, that the AI model has considerable potential to improve human decisions. As shown in Figure 1 (comparison between $\mathrm{R}^{\pi,S}(D^{\text{Human}}\cup D^{\text{AI}})$ and $\mathrm{R}^{\pi,S}(D^{\text{Human}})$ ), all AI models capture at least $20\%$ of the total available information value that is not exploited by human decisions. This motivates deploying at AI to assist humans in this scenario.

From the other direction, the human decisions also provide complementary information to all AI models, comparing $\mathrm{R}^{\pi,S}(D^{\text{AI}}\cup D^{\text{AI}})$ with $\mathrm{R}^{\pi,S}(D^{\text{AI}})$ in Figure 1. In scenarios where partial automation is possible, this observation might inspire, for example, further investigation into what information the humans may have access to that is not represented in AI training data.

Which AI model offers the most decision-relevant information over human judgments? Figure 1 shows that VisionTransformer contains slightly higher information value than the other models and Inception v3 contains slightly lower information value than the other models. However, these differences are slight. If the payoff function were questionable in any way, an organization may not want to trust the model rankings. This motivates evaluating model performance over many possible losses to test if there is a Blackwell ordering of models. Across all the V-shaped payoff functions, we find that VisionTransformer is Blackwell more informative and Inception v3 is Blackwell less informative than all other models. By Proposition 3.9, we test the payoff of models on all V-shaped scoring rules, shown in Figure 4. The VisionTransformer achieves a higher information value on all V-shaped scoring rules, implying a higher information value on all decision problems. This analysis highlights the insufficiency of accuracy-based model comparisons to account for all downstream decision problems: 1) while accuracy may rank models in one way, there may exist decision problems where the order between models is reversed; 2) while two models may seem comparable in accuracy, one can be more informative than the other for all decision problems, which is robustly good.

5 Experiment II: Behavioral Analysis on Deepfake Detection

We analyze a deepfake video detection task (Dolhansky et al., 2020), where participants are asked to judge whether a video was created by generative AI. We apply the framework to benchmark the use of available information by the human, AI, and human-AI team.

5.1 Data and Model

We define the information model on the experiment data of Groh et al. (2022). They recruited 5,524 non-expert participants through Prolific. Participants were asked to examine the videos, and provided with assistance from a computer vision model, which achieved $65\%$ accuracy on a holdout dataset. They reported their decisions in two rounds. They first reviewed the video and reported an initial decision ( $D^{\text{Human}}$ ) without access to the model. Then, in a second round, they were told the AI’s recommendation ( $D^{\text{AI}}$ ) and chose whether to change their initial decision, resulting in a final decision ( $D^{\text{Human-AI}}$ ). Participants’ decisions (both initial and final) were elicited as a percentage indicating how confident they were that the video was a deepfake, measured in $1\%$ increments: $d\in\{0\%,1\%,\ldots,100\%\}$ . We round the predictions from the AI to the same 100-scale probability scale available to study participants.

We use the Brier score as the payoff function: $S(\omega,d)=1-(\omega-d)^{2}$ , with the binary payoff-related state: $\omega\in\{0,1\}=\{\text{genuine},\text{fake}\}$ . The scale of Brier score is $[0,1]$ and a random guess ( $d\sim\text{Bernoulli}(0.5)$ ) achieves $0.75$ payoff under Brier score. We choose the Brier score instead of the norm-1 accuaracy used by Groh et al. (2022) because Brier score is a proper scoring rule where truthfully reporting the belief maximizes the payoff³³3We prefer a proper scoring rule so that the rational decision-maker’s strategy is to reveal their true belief, ensuring that the signal’s information value accurately reflects its role in forming beliefs. If the goal is merely to evaluate the outcome performance of real decision-makers, a non-proper scoring rule may still suffice..

We identify a set of features that were implicitly available to all three agents (human, AI, and human-AI). Because the video signal is high dimensional, we make use of seven seven video-level features using manually coded labels by Groh et al. (2022): graininess, blurriness, darkness, presence of a flickering face, presence of two people, presence of a floating distraction, and the presence of an individual with dark skin, all of which are labeled as binary indictors. These are the basic signals in our framework. We estimate the data-generating process $\pi$ using the realizations of signals, state, first round human decisions, AI predictions, and second round human-AI decisions.

5.2 Results

How much decision-relevant information do each agent’s decisions offer? We first compare the information value of the AI predictions to the decision problem to that of the human decisions in the first round (without AI assistance). To contextualize the value of each, we first construct a scale ranging from the expected payoff of rational DM with no information, i.e., $\mathrm{R}^{\pi,S}(\emptyset)$ , to that of the rational agent who has access to all information, i.e., $\mathrm{R}^{\pi,S}(\Sigma_{1}\cup\ldots\cup\Sigma_{n}\cup D^{\text{Human}}\cup D^{\text{Human-AI}}\cup D^{\text{AI}})$ . The lower-bound represented by the rational DM with no information is $0.75$ in Brier score, which is equivalent to the payoff achieved by a random guess drawn from $\text{Bernoulli}(0.5)$ .

Using this scale, Figure 2(a) shows that AI predictions provide about $65\%$ of the total possible information value over the no information baseline, while human decisions only provide about $15\%$ . Hence, human decisions are only weakly informative for the problem.

We next consider the human-AI decisions. Given that the AI predictions contained a significant portion of the total possible information value, we might hope that when participants are provided with AI predictions, their performance comes close to the full information baseline. However, the information value contained in human-AI decisions is also only take a small proportion of the total possible information value ( $30\%$ ). This aligns with the findings by Guo et al. (2024) that humans tend to rely on the AI but are bad at distinguishing when AI predictions are correct.

How much additional decision-relevant information do the available features offer over each agent’s decisions? To understand what information might improve human decisions, we assess the $ACIV$ s of different signals over different agents. This describes the additional information value in the signal after conditioning on the exisiting information in the agents’ decisions. As shown on the fifth row in Figure 2, the presence of a flickering face offers larger $ACIV$ over human decisions than over AI predictions, meaning that human decisions could improve by a greater amount if they were to incorporate this information. Meanwhile, as shown on the fourth row in Figure 2, the presence of a individual with dark skin offers larger $ACIV$ over AI predictions than over human decisions. This suggests that the AI and human rely on differing information to make their initial predictions, where the AI relies more on information associated with the presence of a flickering face while human participants rely more on information associated with the presence of an individual with dark skin.

By comparing the $ACIV$ s of different signals over human decisions and human-AI decisions, we also find that simply displaying AI predictions to human did not help human-AI to exploit the observed signals in their decisions. As shown in Figure 2, with the assistance of AI, the human-AI teams’ decisions have similar $ACIV$ compared to human decisions (note that the $ACIV$ over human decisions shows significant difference on same signal from the $ACIV$ over AI predictions), except a little improvement on the presence of a flickering face. This finding further comfirms the hypothsis on human simply relying on AI predictions without processing the information contained in them.

6 Experiment III: Information-based Local Explanation on Recidivism Prediction

We apply our framework to a recidivism prediction task, where the decision-maker decides whether to release a defendent. We apply the framework to evaluate and augment saliency-based explanations.

6.1 Data and Model

We use the dataset from COMPAS dataset (Angwin et al., 2022), which contains 11,758 defendents with associated features capturing demographics, information about their current charges, and their prior criminal history. We merge the dataset with the experimental data from Lin et al. (2020) to represent the human decisions, which contains recidivism decisions on a subset of instances from Angwin et al. (2022)’s experiment on laypeople⁴⁴4Here the experimental decisions avoid the problem of performative prediction (Perdomo et al., 2020). Human decisions were elicited on a 30-point probability scale. Merging the two datasets produces 9,269 instances (defendants). For the AI model, we trained an extreme gradient boosting (XGBoost) model (Chen and Guestrin, 2016) on a training set with 6,488 instances, achieving an AUC of $0.84$ on a hold-out test set with 2,781 instances. We round the model predictions to the same 30-point probability scale available to study participants in Lin et al. (2020).

We use Brier score as the payoff function, with the binary payoff-related state $\omega$ indicating whether the defendent gets rearrested within two years. We use the features of defendent contained in COMPAS dedaset as the signals in our demonstration: demographic features (age, sex, and race), information about the current charge (type of offense and whether it is a misdemeanor or a felony), prior criminal history (e.g., past number of arrest and charges for several offense categories) and the predicted score from COMPAS system⁵⁵5https://doc.wi.gov/Pages/AboutDOC/COMPAS.aspx. We use the hold-out test set to estimate the data-generating process, which defines the joint distribution of state, signals, human decisions and AI predictions.

Constructing instance-level signals.

Denote the XGBoost model’s predictions as a random variable $Y^{pred}=f(X)$ , where $X$ denotes the random variable of data features and $f(\cdot)$ denotes the model’s predictive function. We construct $Z_{f(x)}$ as $\mathbb{1}[Y^{pred}=f(x)]$ for every $x$ and use it to calculate the instance-level agent-complementary information value ( $ILIV$ ) of $Y^{pred}$ on instances $x$ (Definition 3.3). We generate the information-based explanations by ILIV-SHAP in this demonstration.

6.2 Results

How well does SHAP explain complementary information offered by the AI prediction over what humans already know? Figure 3 (A) and (B) compares the distribution of feature-attribution scores from SHAP to those from our ILIV-SHAP. We observe multiple discrepancies in feature importance across the methods. For instance, in Figure 3(A), the age feature negatively correlates with the AI’s prediction, indicating that younger defendants tend to receive higher predictions. However, Figure 3(B) ILIV-SHAP shows no such association between age and the AI’s information value over human decisions. This implies that, although age influences the AI’s prediction, on average it does not provide additional information that humans lack.

Conversely, some features that correlate strongly with the $ILIV$ of the AI prediction are relatively unimportant for the AI’s raw prediction. For example, in Figure 3(A), low decile scores predicted by the COMPAS (colored in blue) have a relatively small impact on the AI’s numerical output compared with other features. In contrast, Figure 3(B) reveals that some low decile scores yield large positive $ILIV$ , suggesting that this feature might offer valuable information for human decision-makers despite not significantly changing the AI’s raw prediction.

How can an information-based explanation enhance understanding of AI predictions? We argue that an information-based explanation (via ILIV-SHAP) can serve as a supplement to saliency-based explanations, providing users a sense of whether the individual prediction can help them make decisions. First, an information-based explanation conveys whether the prediction correlates with the payoff-related state. For example, Figure 3(C) shows that the model prediction $f(x)$ deviates $+0.383$ from the prior prediction $\text{\bf E}\![{f(X)}]$ on instance 4. However, the information-based explanation of the same instance in Figure 3(D) shows that the $ILIV$ of the AI prediction over human decisions does not change much from the $ILIV$ of the prior prediction (which is very low at $0.005$ ). This suggests to the user that focusing on the AI prediction on this instance is not necessary to good decision-making.

Second, the information-based explanation conveys to the user whether the AI gets useful information beyond their own information from a certain feature. In Figure 3(A), the defendent’s prior count record (priors_count=3) changes the AI prediction the most ( $+0.186$ ). However, Figure 3(D) shows that priors_count = 3 contributes marginally to the $ILIV$ ( $+0.00004$ ). Therefore, ILIV-SHAP conveys that even if priors_count=3 changes the AI prediction significantly, the AI does not necessarily better predict the payoff-related state as a result of it.

7 Discussion and Limitations

We propose a decision-theoretic framework for assigning value to information in human-AI decision-making. Our methods quantify the additional information value of any signals over an agent’s decisions. The three demonstrations show how quantified information value can be used in model selection, empricial evaluation, and explanation design. These demonstrations are just a few of many possible uses cases. For example, alternative explanation approaches could be compared via their information value to human decisions. Information value analysis could drive elicitation of human private signals or decision rules to further improve the pairing. New explanation strategies could integrate visualized information value with conventional depictions of feature importance.

Information value cannot definitively establish that particular signals were used by a human. It is always possible that the human has other private signals offering equivalent information to a feature being analyzed. Our framework cannot account for private unobservable signals even though they might have strong correlation with the payoff-related state and agent deicions. However, quantifying the value of observed information is a tool toward learning about information that may exist beyond a problem definition.

Our framework quantifies the best-attainable performance improvement from integrating signals in decisions. This does not necessarily mean that empirically those signals will lead to agents better performing. However, the motivating idea is that Bayesian decision theory provides a theoretical basis that can be adapted to support informative comparisons to human behavior. For example, if we suspect a human decision-maker uses AI predictions and their own predictions strictly monotonically, we could constrain the Bayesian decision-maker to only make monotonic decisions with AI prediction and their own predictions.

References

Ægisdóttir et al. [2006] Stefanía Ægisdóttir, Michael J White, Paul M Spengler, Alan S Maugherman, Linda A Anderson, Robert S Cook, Cassandra N Nichols, Georgios K Lampropoulos, Blain S Walker, Genna Cohen, et al. The meta-analysis of clinical judgment project: Fifty-six years of accumulated research on clinical versus statistical prediction. The Counseling Psychologist, 34(3):341–382, 2006.
Agrawal et al. [2020] Mayank Agrawal, Joshua C Peterson, and Thomas L Griffiths. Scaling up psychology via scientific regret minimization. Proceedings of the National Academy of Sciences, 117(16):8825–8835, 2020.
Alexey [2020] Dosovitskiy Alexey. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv: 2010.11929, 2020.
Alur et al. [2023] Rohan Alur, Loren Laine, Darrick Li, Manish Raghavan, Devavrat Shah, and Dennis Shung. Auditing for human expertise. Advances in Neural Information Processing Systems, 36:79439–79468, 2023.
Alur et al. [2024a] Rohan Alur, Loren Laine, Darrick Li, Manish Raghavan, Devavrat Shah, and Dennis Shung. Auditing for human expertise. Advances in Neural Information Processing Systems, 36, 2024a.
Alur et al. [2024b] Rohan Alur, Manish Raghavan, and Devavrat Shah. Distinguishing the indistinguishable: Human expertise in algorithmic prediction. arXiv preprint arXiv:2402.00793, 2024b.
Angwin et al. [2022] Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. Machine bias. In Ethics of data and analytics, pages 254–264. Auerbach Publications, 2022.
Bansal et al. [2019] Gagan Bansal, Besmira Nushi, Ece Kamar, Daniel S Weld, Walter S Lasecki, and Eric Horvitz. Updates in human-ai teams: Understanding and addressing the performance/compatibility tradeoff. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 2429–2437, 2019.
Bansal et al. [2021a] Gagan Bansal, Besmira Nushi, Ece Kamar, Eric Horvitz, and Daniel S Weld. Is the most accurate ai the best teammate? optimizing ai for teamwork. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 11405–11414, 2021a.
Bansal et al. [2021b] Gagan Bansal, Tongshuang Wu, Joyce Zhou, Raymond Fok, Besmira Nushi, Ece Kamar, Marco Tulio Ribeiro, and Daniel Weld. Does the whole exceed its parts? the effect of ai explanations on complementary team performance. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, CHI ’21, New York, NY, USA, 2021b. Association for Computing Machinery. ISBN 9781450380966. doi: 10.1145/3411764.3445717. URL https://doi.org/10.1145/3411764.3445717.
Bastani et al. [2021] Hamsa Bastani, Osbert Bastani, and Wichinpong Park Sinchaisri. Improving human decision-making with machine learning. arXiv preprint arXiv:2108.08454, 5, 2021.
Ben-Michael et al. [2024] Eli Ben-Michael, D James Greiner, Melody Huang, Kosuke Imai, Zhichao Jiang, and Sooahn Shin. Does ai help humans make better decisions? a statistical evaluation framework for experimental and observational studies. arXiv, 2403:v3, 2024.
Blackwell et al. [1951] David Blackwell et al. Comparison of experiments. In Proceedings of the second Berkeley symposium on mathematical statistics and probability, volume 1, page 26, 1951.
Bo et al. [2021] Zi-Hao Bo, Hui Qiao, Chong Tian, Yuchen Guo, Wuchao Li, Tiantian Liang, Dongxue Li, Dan Liao, Xianchun Zeng, Leilei Mei, et al. Toward human intervention-free clinical diagnosis of intracranial aneurysm via deep neural network. Patterns, 2(2), 2021.
Bondi et al. [2022] Elizabeth Bondi, Raphael Koster, Hannah Sheahan, Martin Chadwick, Yoram Bachrach, Taylan Cemgil, Ulrich Paquet, and Krishnamurthy Dvijotham. Role of human-ai interaction in selective prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 5286–5294, 2022.
Boskemper et al. [2022] Melanie M Boskemper, Megan L Bartlett, and Jason S McCarley. Measuring the efficiency of automation-aided performance in a simulated baggage screening task. Human factors, 64(6):945–961, 2022.
Buçinca et al. [2020] Zana Buçinca, Phoebe Lin, Krzysztof Z. Gajos, and Elena L. Glassman. Proxy tasks and subjective measures can be misleading in evaluating explainable ai systems. In Proceedings of the 25th International Conference on Intelligent User Interfaces, IUI ’20, page 454–464, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450371186. doi: 10.1145/3377325.3377498. URL https://doi.org/10.1145/3377325.3377498.
Bussone et al. [2015] Adrian Bussone, Simone Stumpf, and Dympna O’Sullivan. The role of explanations on trust and reliance in clinical decision support systems. In 2015 International Conference on Healthcare Informatics, pages 160–169, Oct 2015. doi: 10.1109/ICHI.2015.26.
Chen and Guestrin [2016] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794, 2016.
Chen and Waggoner [2016] Yiling Chen and Bo Waggoner. Informational substitutes. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 239–247. IEEE, 2016.
Corvelo Benz and Rodriguez [2023] Nina Corvelo Benz and Manuel Rodriguez. Human-aligned calibration for ai-assisted decision making. Advances in Neural Information Processing Systems, 36:14609–14636, 2023.
Dolhansky et al. [2020] Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo Lu, Russ Howes, Menglin Wang, and Cristian Canton Ferrer. The deepfake detection challenge (dfdc) dataset. arXiv preprint arXiv:2006.07397, 2020.
Dressel and Farid [2018] Julia Dressel and Hany Farid. The accuracy, fairness, and limits of predicting recidivism. Science advances, 4(1):eaao5580, 2018.
Fudenberg et al. [2022] Drew Fudenberg, Jon Kleinberg, Annie Liang, and Sendhil Mullainathan. Measuring the completeness of economic models. Journal of Political Economy, 130(4):956–990, 2022.
Goldberger et al. [2000] Ary L Goldberger, Luis AN Amaral, Leon Glass, Jeffrey M Hausdorff, Plamen Ch Ivanov, Roger G Mark, Joseph E Mietus, George B Moody, Chung-Kang Peng, and H Eugene Stanley. Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals. circulation, 101(23):e215–e220, 2000.
Green and Chen [2019] Ben Green and Yiling Chen. The principles and limits of algorithm-in-the-loop decision making. Proceedings of the ACM on Human-Computer Interaction, 3(CSCW):1–24, 2019.
Groh et al. [2022] Matthew Groh, Ziv Epstein, Chaz Firestone, and Rosalind Picard. Deepfake detection by human crowds, machines, and machine-informed crowds. Proceedings of the National Academy of Sciences, 119(1):e2110013119, 2022.
Grove et al. [2000] William M Grove, David H Zald, Boyd S Lebow, Beth E Snitz, and Chad Nelson. Clinical versus mechanical prediction: A meta-analysis. Psychological Assessment, 12(1):19, 2000.
Guo et al. [2024] Ziyang Guo, Yifan Wu, Jason D Hartline, and Jessica Hullman. A decision theoretic framework for measuring ai reliance. In The 2024 ACM Conference on Fairness, Accountability, and Transparency, pages 221–236, 2024.
He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
Heidenreich et al. [2022] Paul A Heidenreich, Biykem Bozkurt, David Aguilar, Larry A Allen, Joni J Byun, Monica M Colvin, Anita Deswal, Mark H Drazner, Shannon M Dunlay, Linda R Evers, et al. 2022 aha/acc/hfsa guideline for the management of heart failure: a report of the american college of cardiology/american heart association joint committee on clinical practice guidelines. Journal of the American College of Cardiology, 79(17):e263–e421, 2022.
Hofman et al. [2021] Jake M Hofman, Duncan J Watts, Susan Athey, Filiz Garip, Thomas L Griffiths, Jon Kleinberg, Helen Margetts, Sendhil Mullainathan, Matthew J Salganik, Simine Vazire, et al. Integrating explanation and prediction in computational social science. Nature, 595(7866):181–188, 2021.
Hu and Wu [2024] Lunjia Hu and Yifan Wu. Predict to minimize swap regret for all payoff-bounded tasks. arXiv preprint arXiv:2404.13503, 2024.
Huang et al. [2017] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
Irvin et al. [2019] Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 590–597, 2019.
Jacobs et al. [2021] Maia Jacobs, Melanie F Pradier, Thomas H McCoy Jr, Roy H Perlis, Finale Doshi-Velez, and Krzysztof Z Gajos. How machine-learning recommendations influence clinician treatment selections: the example of antidepressant selection. Translational psychiatry, 11(1):108, 2021.
Johnson et al. [2019] Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih-ying Deng, Roger G Mark, and Steven Horng. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data, 6(1):317, 2019.
Johnson et al. [2023] Alistair EW Johnson, Lucas Bulgarelli, Lu Shen, Alvin Gayles, Ayad Shammout, Steven Horng, Tom J Pollard, Sicheng Hao, Benjamin Moody, Brian Gow, et al. Mimic-iv, a freely accessible electronic health record dataset. Scientific data, 10(1):1, 2023.
Keswani et al. [2021] Vijay Keswani, Matthew Lease, and Krishnaram Kenthapadi. Towards unbiased and accurate deferral to multiple experts. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pages 154–165, 2021.
Keswani et al. [2022] Vijay Keswani, Matthew Lease, and Krishnaram Kenthapadi. Designing closed human-in-the-loop deferral pipelines. arXiv preprint arXiv:2202.04718, 2022.
Kleinberg et al. [2023] Bobby Kleinberg, Renato Paes Leme, Jon Schneider, and Yifeng Teng. U-calibration: Forecasting for an unknown agent. In The Thirty Sixth Annual Conference on Learning Theory, pages 5143–5145. PMLR, 2023.
Kleinberg et al. [2015] Jon Kleinberg, Jens Ludwig, Sendhil Mullainathan, and Ziad Obermeyer. Prediction policy problems. American Economic Review, 105(5):491–495, 2015.
Kleinberg et al. [2018] Jon Kleinberg, Himabindu Lakkaraju, Jure Leskovec, Jens Ludwig, and Sendhil Mullainathan. Human decisions and machine predictions. The quarterly journal of economics, 133(1):237–293, 2018.
Kononenko [2001] Igor Kononenko. Machine learning for medical diagnosis: history, state of the art and perspective. Artificial Intelligence in medicine, 23(1):89–109, 2001.
Lai and Tan [2019] Vivian Lai and Chenhao Tan. On human predictions with explanations and predictions of machine learning models: A case study on deception detection. In Proceedings of the conference on fairness, accountability, and transparency, pages 29–38, 2019.
Lakkaraju et al. [2017] Himabindu Lakkaraju, Jon Kleinberg, Jure Leskovec, Jens Ludwig, and Sendhil Mullainathan. The selective labels problem: Evaluating algorithmic predictions in the presence of unobservables. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 275–284, 2017.
Li et al. [2022] Yingkai Li, Jason D Hartline, Liren Shan, and Yifan Wu. Optimization of scoring rules. In Proceedings of the 23rd ACM Conference on Economics and Computation, pages 988–989, 2022.
Lin et al. [2020] Zhiyuan “Jerry” Lin, Jongbin Jung, Sharad Goel, and Jennifer Skeem. The limits of human predictions of recidivism. Science advances, 6(7):eaaz0652, 2020.
Liu et al. [2021] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
Lundberg and Lee [2017] Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 4765–4774. Curran Associates, Inc., 2017.
Madras et al. [2018] David Madras, Toni Pitassi, and Richard Zemel. Predict responsibly: improving fairness and accuracy by learning to defer. Advances in neural information processing systems, 31, 2018.
Mclaughlin and Spiess [2023] Bryce Mclaughlin and Jann Spiess. Algorithmic assistance with recommendation-dependent preferences. In Proceedings of the 24th ACM Conference on Economics and Computation, EC ’23, page 991, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400701047. doi: 10.1145/3580507.3597775. URL https://doi.org/10.1145/3580507.3597775.
Meehl [1954] Paul E Meehl. Clinical Versus Statistical Prediction: A Theoretical Analysis and a Review of the Evidence. University of Minnesota Press, 1954.
Mozannar et al. [2024a] Hussein Mozannar, Gagan Bansal, Adam Fourney, and Eric Horvitz. When to show a suggestion? integrating human feedback in ai-assisted programming. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 10137–10144, 2024a.
Mozannar et al. [2024b] Hussein Mozannar, Jimin Lee, Dennis Wei, Prasanna Sattigeri, Subhro Das, and David Sontag. Effective human-ai teams via learned natural language rules and onboarding. Advances in Neural Information Processing Systems, 36, 2024b.
Mueller et al. [2019] Christian Mueller, Kenneth McDonald, Rudolf A de Boer, Alan Maisel, John GF Cleland, Nikola Kozhuharov, Andrew JS Coats, Marco Metra, Alexandre Mebazaa, Frank Ruschitzka, et al. Heart failure association of the european society of cardiology practical guidance on the use of natriuretic peptide concentrations. European journal of heart failure, 21(6):715–731, 2019.
Mullainathan and Obermeyer [2022] Sendhil Mullainathan and Ziad Obermeyer. Diagnosing physician error: A machine learning approach to low-value health care. The Quarterly Journal of Economics, 137(2):679–727, 2022.
Okati et al. [2021] Nastaran Okati, Abir De, and Manuel Rodriguez. Differentiable learning under triage. Advances in Neural Information Processing Systems, 34:9140–9151, 2021.
Perdomo et al. [2020] Juan Perdomo, Tijana Zrnic, Celestine Mendler-Dünner, and Moritz Hardt. Performative prediction. In International Conference on Machine Learning, pages 7599–7609. PMLR, 2020.
Raghu et al. [2019] Maithra Raghu, Katy Blumer, Greg Corrado, Jon Kleinberg, Ziad Obermeyer, and Sendhil Mullainathan. The algorithmic automation problem: Prediction, triage, and human effort. arXiv preprint arXiv:1903.12220, 2019.
Rajpurkar et al. [2018] Pranav Rajpurkar, Jeremy Irvin, Robyn L Ball, Kaylie Zhu, Brandon Yang, Hershel Mehta, Tony Duan, Daisy Ding, Aarti Bagul, Curtis P Langlotz, et al. Deep learning for chest radiograph diagnosis: A retrospective comparison of the chexnext algorithm to practicing radiologists. PLoS medicine, 15(11):e1002686, 2018.
Rambachan [2024] Ashesh Rambachan. Identifying prediction mistakes in observational data. The Quarterly Journal of Economics, page qjae013, 2024.
Rastogi et al. [2023] Charvi Rastogi, Liu Leqi, Kenneth Holstein, and Hoda Heidari. A taxonomy of human and ml strengths in decision-making to investigate human-ml complementarity. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, volume 11, pages 127–139, 2023.
Schemmer et al. [2022] Max Schemmer, Patrick Hemmer, Maximilian Nitsche, Niklas Kühl, and Michael Vössing. A meta-analysis of the utility of explainable artificial intelligence in human-ai decision-making. In Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society, pages 617–626, 2022.
Shapley [1953] Lloyd S Shapley. A value for n-person games. Contributions to the Theory of Games, 2, 1953.
Shreekumar [2025] Advik Shreekumar. X-raying experts: Decomposing predictable mistakes in radiology. 2025.
Szegedy et al. [2016] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
Tang et al. [2020] Yu-Xing Tang, You-Bao Tang, Yifan Peng, Ke Yan, Mohammadhadi Bagheri, Bernadette A Redd, Catherine J Brandon, Zhiyong Lu, Mei Han, Jing Xiao, et al. Automated abnormality classification of chest radiographs using deep convolutional neural networks. NPJ digital medicine, 3(1):70, 2020.
Vaccaro and Waldo [2019] Michelle Vaccaro and Jim Waldo. The effects of mixing machine learning and human judgment. Communications of the ACM, 62(11):104–110, 2019.
Vaccaro et al. [2024] Michelle Vaccaro, Abdullah Almaatouq, and Thomas Malone. When combinations of humans and ai are useful: A systematic review and meta-analysis. Nature Human Behaviour, pages 1–11, 2024.
Wilder et al. [2021] Bryan Wilder, Eric Horvitz, and Ece Kamar. Learning to complement humans. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pages 1526–1533, 2021.
Wu et al. [2023] Yifan Wu, Ziyang Guo, Michalis Mamakos, Jason Hartline, and Jessica Hullman. The rational agent benchmark for data visualization. IEEE transactions on visualization and computer graphics, 2023.

Appendix A The Combinatorial Nature of the Value of Signals

We model the information value of a single signal over the existing information in agent decisions. When decision-makers are provided with multiple signals, they might use them in combination. Therefore, our definition of information value in Definition 3.2 may overlook the signals’ value in combination with other signals. Signals can be complemented [Chen and Waggoner, 2016], i.e, they contain no information value by themselves but a considerable value when combined with other signals. For example, two signals $\Sigma_{1}$ and $\Sigma_{2}$ ight be uniformly random bits and the state $\omega=\Sigma_{1}\oplus\Sigma_{2}$ , the XOR of $\Sigma_{1}$ and $\Sigma_{2}$ . In this case, neither of the signals offers information value on its own, but knowing both leads to the maximum payoff. To consider this complementation between signals, we use the Shapley value $\phi$ [Shapley, 1953] to interpret the contribution to information gain of each basic signal. The Shapley value calculates the average of the marginal contribution of a basic signal $\mathbf{\Sigma}_{i}$ in every combination of signals.

\phi(\Sigma_{i})=\frac{1}{n}\sum_{V\subseteq\{V_{1},\ldots,V_{n}\}/\{\Sigma_{i}\}}{(n-1)\choose|V|}^{-1}(ACIV(V\cup\{\Sigma_{i}\};D^{b})-ACIV(V;D^{b}))

(1)

The Shapley value suggests how much information value of the basic signal is unexploited by the human decision-maker on average in all combinations.

The following algorithm provides a polynomial-time approximation of the Shapley value of $ACIV$ . Under the assumption of submodularity, it orders the signals the same as the Shapley value.

Algorithm 1 Algorithm 1

V^{*}=\{D^{b}\}

\Phi^{*}=\{\}

3: for

i=1

n

\phi^{\prime}_{j}=ACIV(\Sigma_{j};V^{*})\ for\ each\ j

j^{*}=\arg\max_{j\ s.t.\ \Sigma_{j}\notin V^{*}}\phi^{\prime}_{j}

\phi_{j^{*}}=\max_{j\ s.t.\ \Sigma_{j}\notin V^{*}}\phi^{\prime}_{j}

add\ \Sigma_{j^{*}}\ to\ V^{*}

add\ \phi_{j^{*}}\ to\ \Phi^{*}

9: end for

10:

output\ \phi_{j^{*}}