This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Evaluating Treatment Benefit Predictors using Observational Data: Contending with Identification and Confounding Bias

Yuan Xia, Mohsen Sadatsafavi, and Paul Gustafson
Abstract

A treatment benefit predictor (TBP) maps patient characteristics into an estimate of the treatment benefit for that patient, which can support optimizing treatment decisions. However, evaluating the predictive performance of a TBP is challenging, as this often must be conducted in a sample where treatment assignment is not random. After briefly reviewing the metrics for evaluating TBPs, we show conceptually how to evaluate a pre-specified TBP using observational data from the target population for a binary treatment decision at a single time point. We exemplify with a particular measure of discrimination (the concentration of benefit index) and a particular measure of calibration (the moderate calibration curve). The population-level definitions of these metrics involve the latent (counterfactual) treatment benefit variable, but we show identification by re-expressing the respective estimands in terms of the distribution of observable data only. We also show that in the absence of full confounding control, bias propagates in a more complex manner than when targeting more commonly encountered estimands (such as the average treatment effect). Our findings reveal the patterns of biases are often unpredictable and general intuition about the direction of bias in causal effect estimates does not hold in the present context.


Keywords: calibration; discrimination; confounding bias; precision medicine.

Introduction

Precision medicine aims to optimize medical care by tailoring treatment decisions to the unique characteristics of each patient. This objective naturally falls in the intersection between predictive analytics and causal inference; the former aims at predicting the outcome of interest given salient patient characteristics, and the latter seeks to answer counterfactual “what if” questions about the outcome. Most progress in clinical prediction models has centered around estimating the risk of adverse outcomes, often guiding treatment decisions by prioritizing high-risk individuals for intervention [1, 2]. However, the success of such guidance relies on an implicit assumption: that those classified as high risk are also those who will benefit the most from the treatment. A more relevant approach for decision-making is to directly predict treatment benefits based on patient characteristics. In this context, treatment benefit refers to the conditional risk reduction for treated individuals given their characteristics, compared to the conditional risk they would face under identical conditions without the treatment. This comparison, which is hypothetical, requires causal reasoning to predict [3]. Such a prediction is often termed “causal prediction” or “counterfactual prediction” [4]. In causal inference, the treatment benefit we are predicting is the conditional average treatment effect (CATE).

We are interested in a function that maps a patient’s characteristics to their putative treatment benefit, which we call a treatment benefit predictor (TBP). Suppose we have a pre-specified TBP developed from a randomized controlled trial (RCT) in one setting, where it demonstrates strong predictive performance. Before being adopted into patient care for the target population with a potentially distinct setting, this TBP needs to be evaluated (validated) [5, 6]. Transportability and generalizability have been studied across various epidemiology-adjacent disciplines, focusing on concepts, assessment methods, and correction techniques, most of which require access to both original and target population data. This includes applications in estimating causal estimands (e.g. average treatment effect (ATE)) [7], identifying causal relations [8], and predicting outcome risks [9] in the target population. However, we do not aim to assess or correct the transportability and generalizability of the pre-specified TBP, as we only have access to data from the target population. Instead, we focus on the predictive performance metrics, though transportability and generalizability may naturally be implied by these results, e.g. if we find the TBP performs well.

Traditionally, performance metrics for risk prediction are categorized into measures of overall fit (e.g. Brier score), discrimination (e.g. c-statistic), and calibration (e.g. calibration plot) [10, 11]. Various performance measures for TBPs have been formulated by extending the concepts from risk prediction to the treatment-benefit paradigm. The overall fit of a TBP is typically evaluated by the discrepancy between predicted treatment benefits and estimated CATE, using metrics like the precision in estimating heterogeneous effects [12, 13]. The discriminative ability reflects the degree to which TBP ranks patients in the correct order of treatment benefit. For instance, the c-for-benefit proposed by van Klaveren et al. is an extension of the c-statistic [14]. However, some possible limitations of the c-for-benefit have been discussed [9, 15, 16]. In particular, it has been shown to lack the properties of a proper scoring rule [17]. Other metrics of discriminatory performance are the concentration of the benefit index (CbC_{b}) [18] and the rank-weighted average treatment effect metrics [19], which are extended from the relative concentration curve (RCC) [20] and Qini curve [21], respectively. Both metrics compare how individuals prioritized by TBP predictions benefit more from treatment than those selected randomly. Note that RCCs and Qini curves are conceptually related to the Lorenz curve and the Gini index [22]. Calibration evaluates the agreement between the predicted and the actual quantities. Van Calster et al. proposed a hierarchical definition in the context of risk prediction [23]. Researchers have extended the ‘weak calibration’ to assess the performance of TBPs via the calibration intercept and slope [9, 16]. They have also extended the ‘moderate calibration’ to examine whether the conditional expected treatment benefit given TBP prediction, equals the TBP prediction [24].

None of these extensions are straightforward, given the unavailability of the counterfactual outcome. Another challenge arises when evaluating TBP performance using observational studies, as these introduce potential confounding bias. This is especially relevant when RCTs are unavailable, underpowered, or not representative of the target population. Additionally, for some interventions where equipoise is not established, conducting a RCT might be unethical.

In this study, we show conceptually how to approach evaluating a TBP from observational data, focusing on a single time-point binary treatment decision. Specifically, for several measures of TBP performance, we re-express the metrics, initially defined in terms of the latent treatment benefit variable, in terms of the distribution of observable quantities. To illustrate, we focus on two existing performance metrics for assessing pre-specified TBPs: CbC_{b} and the moderate calibration curve. Additionally, we pay close attention to the bias incurred if we are not fully controlling for confounding. In doing so, we extend the understanding of identification and confounding bias from the context of well-understood estimands such as ATE to estimands which describe the predictive performance of a pre-specified TBP [25, 26]. Some applied problems have more complex settings than those of a single time-point binary treatment choice. For instance, a recent study extended the definitions and estimators of calibration and discrimination metrics to time-to-event outcomes and time-varying treatment decisions[27]. However, their focus was not on evaluating predictive performance in scenarios where confounding is not fully controlled or on exploring confounding bias in this context.

Notation and Assumptions

Each individual in the target population is described by (Y(0),Y(1),A,X,Z)(Y^{(0)},Y^{(1)},A,X,Z) with joint distribution \mathbb{P}. Here AA denotes a binary treatment, and Y(a)Y^{(a)} is the counterfactual outcome that would be observed under treatment A=aA=a, where a{0,1}a\in\{0,1\}. A larger Y(a)Y^{(a)} is presumed to be the preferred outcome. The set XX consists of dd pre-treatment covariates that are observable in routine clinical practice and in observational studies, and are used to predict treatment benefit. In contrast, ZZ is a distinct set of pp covariates that may only be available in the observational study and could be necessary for controlling confounding. Both XX and ZZ may include a mix of continuous and discrete variables. For instance, XX could be blood pressure and obesity, available at the point of care, which are used to predict the benefit of statin therapy for cardiovascular diseases. Meanwhile, ZZ could be socioeconomic status, which is a confounding variable but not often used for predicting benefit from statins.

Individual treatment benefit is quantified as B=Y(1)Y(0)B=Y^{(1)}-Y^{(0)}, which is unobservable. For instance, when an individual has received A=0A=0, the corresponding outcome Y(1)Y^{(1)} remains unobserved (and therefore counterfactual). We denote the mean treatment benefit for groups partitioned by both XX and ZZ as τ(x,z)=E[BX=x,Z=z]\tau(x,z)=\operatorname{E}[B\mid X=x,Z=z]. However, in clinical practice, the ideal quantity to inform treatment decisions for an individual with X=xX=x is τs(x)=E[BX=x]\tau_{s}(x)=\operatorname{E}[B\mid X=x], which conditions only on XX. Denote the common causal estimand ATE as E[B]\operatorname{E}[B]. A TBP predicting the mean treatment benefit based on patient characteristics XX is denoted as h(x)h(x). This can guide treatment decision-making, for example the care provider offering treatment only to those with h(x)>0h(x)>0. We denote the predicted treatment benefit from h(x)h(x) as H:=h(X)H:=h(X). The best possible TBP is τs(x)\tau_{s}(x) itself, and the corresponding prediction is τs(X)\tau_{s}(X).

Refer to caption
Figure S1: The directed acyclic graph (DAG) illustrates the causal effect of the exposure (AA) on the outcome (YY). The set of covariates XX is used to build the TBP, while the set of covariates ZZ is not used to build the TBP but needs to be controlled for confounding. The dashed arrow is intended to indicate that XX could consist of a mix of confounding variables and predictors.

Observed data from the observational study are realizations of a set of random variables drawn from an underlying probability distribution obs\mathbb{P}_{obs}, denoted as (Y,A,X,Z)obs(Y,A,X,Z)\sim\mathbb{P}_{obs}, which is a consequence of \mathbb{P}. We define μa(x,z)=E[YA=a,X=x,Z=z]\mu_{a}(x,z)=\operatorname{E}[Y\mid A=a,X=x,Z=z] and μa(x)=E[YA=a,X=x]\mu_{a}(x)=\operatorname{E}[Y\mid A=a,X=x]. We denote the propensity score as e(x,z)=P(A=1X=x,Z=z)e(x,z)=\operatorname{P}(A=1\mid X=x,Z=z).

To ensure that τs(X)\tau_{s}(X) can be identified from the observed data, μa(X,Z)\mu_{a}(X,Z) must have a causal interpretation. Therefore, the following assumptions are universally required: (1) no interference: between any two individuals, the treatment taken by one does not affect the counterfactual outcomes of the other; (2) consistency: the counterfactual outcome under the observed treatment assignment equals the observed outcome YY; and (3) conditional exchangeability: the treatment assignment is independent of the counterfactual outcomes, given the set of variables XZX\cup Z; (4) overlap (positivity): the conditional probability of receiving the active treatment or not is bounded away from 0 and 11, i.e., 0<e(x,z)<10<e(x,z)<1, for all possible xx and zz. The first two assumptions are known as the stable-unit-treatment-value assumption (SUTVA) [28]. The third one assumes no unmeasured confounders given XX and ZZ, as illustrated in Figure S1.

Predictive Performance Metrics

Given a pre-specified TBP h(x)h(x), we intend to evaluate its predictive performance on observational data from the target population. Among all possible predictive performance metrics for evaluating TBP, we exemplify with two metrics by specifying the estimand of each, which helps conceptually in understanding how to measure the predictive performance of h()h(\cdot) using observational data.

Discrimination

The concentration of the benefit index (CbC_{b}) is a single-value summary of the difference in average treatment benefit between two treatment assignment rules: ‘treat at random’ and ‘treat greater HH.’ [18] With i.i.d. copies {(B1,H1),(B2,H2)}\{(B_{1},H_{1}),(B_{2},H_{2})\} of (B,H)(B,H), the CbC_{b} of HH is defined as:

Cb=1E[B1]E[B1I(H1H2)+B2I(H1<H2)],C_{b}=1-\frac{\operatorname{E}[B_{1}]}{\operatorname{E}[B_{1}I(H_{1}\geq H_{2})+B_{2}I(H_{1}<H_{2})]}, (1)

where I()I(\cdot) is an indicator function. The numerator in Equation (1) operationalizes the strategy of ‘treat at random’ among two patients randomly selected from the population, and the denominator refers to the strategy of ‘treat greater HH.’ If the two patients have the same HH, treatment is randomly assigned to one of them. When E[B]>0\operatorname{E}[B]>0 and h(x)h(x) is at least not worse than ‘treat at random,’ 0Cb10\leq C_{b}\leq 1. A value of CbC_{b} can be interpreted as indicating that ‘treat at random’ is associated with a Cb×100%C_{b}\times 100\% reduction in expected benefit compared with ‘treat greater HH.’ Therefore, the higher the value of CbC_{b}, the stronger the discriminatory ability of h(x)h(x).

We note that CbC_{b} can be calculated by approaching the cumulative distribution function (CDF) and the probability mass function (PMF) of HH, thereby eliminating the necessity of considering patient pairs to ascertain the expectation in Equation (1). Therefore, we reexpress the CbC_{b} of HH as

Cb=1E[B]E[Bη(H)],\displaystyle C_{b}=1-\frac{\operatorname{E}[B]}{\operatorname{E}[B\eta(H)]},

where η(H)=2FH(H)fH(H)\eta(H)=2F_{H}(H)-f_{H}(H), FH()F_{H}(\cdot) denotes CDF of HH, and fH()f_{H}(\cdot) is PMF of HH. This expression clearly indicates that CbC_{b} summarizes the predictive performance of h()h(\cdot) based on its ranking ability. The calculation process is simplified by focusing on BB, FHF_{H}, and fHf_{H}, and it becomes even simpler when HH is continuous, as η(H)=2FH(H)\eta(H)=2F_{H}(H). Although HH is continuous in most applications, it is helpful to derive the general expression to create illustrative examples where HH is discrete. As the original publication on CbC_{b} did not discuss estimation in the presence of ties, we elaborate on this point and give a derivation of why we are able to consider CDF of HH (and its PMF if necessary) in Appendix S1.

Calibration

The concept of moderate calibration for risk prediction is that the expected value of the outcome among individuals with the same predicted risk is equal to the predicted risk [23]. Similarly, in treatment benefit prediction, a h(x)h(x) can be considered moderately calibrated if for any hh,

E[BH=h]=h,\operatorname{E}[B\mid H=h]=h,

where E[BH=h]\operatorname{E}[B\mid H=h] is the moderate calibration function. The equality above states that the average treatment benefit among all patients with predicted treatment benefit H=hH=h equals hh. For example, if h(x)h(x) is moderately calibrated and predicts a group of individuals to have H=0.5H=0.5, we should expect that the average treatment benefit within the group is also 0.50.5. Furthermore, h(x)h(x) is strongly calibrated if τs(X)=H\tau_{s}(X)=H. Calibration of TBPs can also be visualized in a calibration plot [29]. The calibration plot compares E[BH=h]\operatorname{E}[B\mid H=h] against hh, with a moderately calibrated TBP showing points on the diagonal identity line.

Estimands for metrics

These predictive performance metrics for TBP involve the latent variable BB. To evaluate a pre-specified h(x)h(x) using observational data, we need to specify estimands for each metric so that they can be identified using observable data. We adjust for confounding by expressing the estimands in terms of τ(X,Z)\tau(X,Z), even though TBP is solely a function of XX. To compute CbC_{b} of h(x)h(x), we re-express CbC_{b} as

Cb=1E[τ(X,Z)]E[τ(X,Z)η(H)].\displaystyle C_{b}=1-\frac{\operatorname{E}[\tau(X,Z)]}{\operatorname{E}[\tau(X,Z)\eta(H)]}.

Similarly, to evaluate the moderate calibration of any h(x)h(x), we need to determine E[BH]\operatorname{E}[B\mid H] (i.e., the moderate calibration curve), which can be obtained by taking the average of τ(X,Z)\tau(X,Z) conditional on H=hH=h:

E[BH=h]=E[τ(X,Z)H=h].\displaystyle\operatorname{E}[B\mid H=h]=\operatorname{E}[\tau(X,Z)\mid H=h].

Note that τ(X,Z)\tau(X,Z), which plays a vital role in the determination of both CbC_{b} and E[BH]\operatorname{E}[B\mid H], is identifiable through the expression: τ(X,Z)=μ1(X,Z)μ0(X,Z)\tau(X,Z)=\mu_{1}(X,Z)-\mu_{0}(X,Z). Alternately, it can be expressed as τ(X,Z)=E[Y(Ae(X,Z))e(X,Z)(1e(X,Z))|X,Z]\tau(X,Z)=\operatorname{E}\left[\frac{Y(A-e(X,Z))}{e(X,Z)(1-e(X,Z))}\middle|X,Z\right]. These two expressions are equivalent at the population-level. However, variations may emerge when considering specific finite-sample estimating techniques associated with each expression. We will return to the finite-sample estimation of the CbC_{b} and calibration curve in the discussion section.

Confounding Bias of Performance Metrics

If we treat the observational data as if they arose from a RCT, or if we do not sufficiently control for confounding, confounding bias may emerge. It is essential to investigate this bias and grasp how the lack of full control for confounding might affect the accuracy of our TBP performance evaluations. We focus on the confounding bias that occurs when XX alone is insufficient to control for confounding and denote the bias as a function of XX, which is

bias(X)=μ1(X)μ0(X)τs(X),\displaystyle\text{bias}(X)=\mu_{1}(X)-\mu_{0}(X)-\tau_{s}(X),

where τs(X)\tau_{s}(X) is obtained by taking the expectation of τ(X,Z)\tau(X,Z) over ZZ. To illustrate the propagation of bias(X)\text{bias}(X) to performance measures, we denote the inaccurate CbC_{b} and E[BH]\operatorname{E}[B\mid H], calculated without controlling for ZZ, as C~b\tilde{C}_{b} and E~[BH]\tilde{E}[B\mid H], respectively.

The bias of CbC_{b} and the bias function of the moderate calibration curve, E[BH=h]\operatorname{E}[B\mid H=h], will be influenced by bias(X)\text{bias}(X). For CbC_{b}, this bias affects the identification of both E[B]\operatorname{E}[B] and E[Bη(X)]\operatorname{E}[B\eta(X)]. The discrepancy from actual E[B]\operatorname{E}[B] is E[bias(X)]\operatorname{E}[\text{bias}(X)], while the deviation from E[Bη(H)]\operatorname{E}[B\eta(H)] is E[bias(X)η(H)]\operatorname{E}[\text{bias}(X)\eta(H)]. However, expressing the bias of CbC_{b} is complex as it involves the difference between two ratios, which is in the form of:

C~bCb=E[B]E[bias(X)η(H)]E[Bη(H)]E[bias(X)]E[Bη(H)](E[Bη(H)]+E[bias(X)η(H)]).\tilde{C}_{b}-C_{b}=\frac{\operatorname{E}[B]\operatorname{E}[\text{bias}(X)\eta(H)]-\operatorname{E}[B\eta(H)]\operatorname{E}[\text{bias}(X)]}{\operatorname{E}[B\eta(H)](\operatorname{E}[B\eta(H)]+\operatorname{E}[\text{bias}(X)\eta(H)])}. (2)

This bias value depends on bias(X)\text{bias}(X), BB and η(H)\eta(H).

For E[BH=h]\operatorname{E}[B\mid H=h], the deviation from the accurate assessment can be expressed as

E~[BH=h]E[BH=h]=E[bias(X)H=h],\tilde{\operatorname{E}}[B\mid H=h]-\operatorname{E}[B\mid H=h]=\operatorname{E}[\text{bias}(X)\mid H=h], (3)

which is a function of hh. It depends on bias(X)\text{bias}(X) and the association between HH and XX.

According to Equation (2) and Equation (3), the deviations in CbC_{b} and moderate calibration may yield zero value(s) even with non-zero bias(X)\text{bias}(X). In particular, zero deviation in CbC_{b} would occur when the numerator in Equation (2) equals zero, and zero deviations in moderate calibration curve would occur when E[bias(X)H=h]=0\operatorname{E}[\text{bias}(X)\mid H=h]=0 for some hh. In the next section, we further investigate these biases in several illustrative examples to demonstrate how bias(X)\text{bias}(X) influences and propagates to the evaluation measures.

Evaluating TBP in Synthetic Populations

Synthetic Target Populations

To illustrate the evaluation of pre-specified TBPs and to explore the impact of confounding bias, we consider three synthetic target populations, described by the DAG in Figure S1. The synthetic populations were constructed based on the motivating example of evaluating pre-specified TBPs that predict the benefit of bronchodilator therapy for lung function measured by Forced Expiratory Volume in one second (FEV1\text{FEV}_{1}) using observational data from the target population. Meanwhile, the TBPs considered are functions of obesity and symptom severity, as variables which are often available in routine primary care and can be used to make treatment decisions. However, socioeconomic status, which is not usually available in RCT, needs to be adjusted for when evaluating the predictive performance of the TBP using observational data.

In synthetic population 1, obesity and symptom severity are binary variables, denoted as X={X1,X2}X=\{X_{1},X_{2}\}, and socioeconomic status is also binary denoted as ZZ. These covariates are dependent with the joint distribution p=(p111,p110,p101,p100,p011,p010,p001,p000)=(0.181,0.100,0.035,0.148,0.174,0.087,0.121,0.153)p=(p_{111},p_{110},p_{101},p_{100},p_{011},p_{010},p_{001},p_{000})=(0.181,0.100,0.035,0.148,0.174,0.087,0.121,0.153). The exposure (A)(A) is a binary indicator for bronchodilator therapy, and the outcome, FEV1\text{FEV}_{1} improvement, is measured by a binary indicator (YY), leading to B{1,0,1}B\in\{-1,0,1\}. The probability of receiving bronchodilator therapy is e(Z)=0.120+0.762Ze(Z)=0.120+0.762Z. The probability of having FEV1\text{FEV}_{1} improvement is affected by covariates and the exposure linearly: μ0(X,Z)=0.629+0.143X10.479X20.058Z\mu_{0}(X,Z)=0.629+0.143X_{1}-0.479X_{2}-0.058Z and μ1(X,Z)=0.335+0.304X10.334X2+0.314Z\mu_{1}(X,Z)=0.335+0.304X_{1}-0.334X_{2}+0.314Z. Then, τ(X,Z)=μ1(X,Z)μ0(X,Z)\tau(X,Z)=\mu_{1}(X,Z)-\mu_{0}(X,Z).

In synthetic population 2, obesity, symptom severity, and socioeconomic status are independent continuous variables, each following a uniform distribution on the interval [0,1][0,1]. The exposure is binary, and the probability of receiving bronchodilator therapy is directly influenced by the values of socioeconomic status, with e(Z)=Ze(Z)=Z. Outcome FEV1\text{FEV}_{1} is a continuous variable, which has non-linear relationships with the exposure and covariates. These are given by μ0(X,Z)=0.5τs(X)+b(X,Z)\mu_{0}(X,Z)=-0.5\tau_{s}(X)+b(X,Z) and μ1(X,Z)=0.5τs(X)+b(X,Z)\mu_{1}(X,Z)=0.5\tau_{s}(X)+b(X,Z), where τs(X)=max(X1,X2)\tau_{s}(X)=\max\left(X_{1},X_{2}\right) and base response function b(X,Z)=max(Z,X2)+0.1X1b(X,Z)=\max\left(Z,X_{2}\right)+0.1X_{1}. A similar setup has been used by Foster and Syrgkanis [30]. From these population specifications, we can derive that τ(X,Z)=τs(X)\tau(X,Z)=\tau_{s}(X), which shows that socioeconomic status contributes to explaining both the outcome and therapy assignment but is independent of treatment benefit conditional on XX.

The synthetic population 3 differs slightly from population 1 in that X,ZX,Z\in\mathbb{R} with p=(p11,p10,p01,p00)=c(0.3,0.1,0.4,0.2)p=(p_{11},p_{10},p_{01},p_{00})=c(0.3,0.1,0.4,0.2), and inverse logit, rather than linear, functions are used: μ0(X,Z)=expit(X+αZ)\mu_{0}(X,Z)=\text{expit}(X+\alpha Z), μ1(X,Z)=expit(0.1+X+αZ)\mu_{1}(X,Z)=\text{expit}(0.1+X+\alpha Z), and e(Z)=expit(0.35+βZ)e(Z)=\text{expit}(0.35+\beta Z). Both population 1 and 3 quantify the strength of confounding through parameters, but we use population 3 to explore the extent to which the strength of confounding affects the bias of CbC_{b} and E[BH=h]\operatorname{E}[B\mid H=h].

All proposed populations are sufficiently simple to provide closed-form descriptions of the predictive performance of TBPs using CbC_{b} and E[BH=h]\operatorname{E}[B\mid H=h], as well as how confounding bias propagates to the bias of CbC_{b} and E[BH=h]\operatorname{E}[B\mid H=h] for some pre-specified TBPs. Next, we compare evaluation results for TBPs with and without controlling for ZZ using populations 1 and 2, and illustrate how bias(X)\text{bias}(X) and the bias of metrics are influenced by the strength of confounding in population 3.

Evaluating Results

We formulate four pre-specified TBPs to be evaluated in the synthetic population 1, which can be expressed in the form of

h(x1,x2)\displaystyle h(x_{1},x_{2}) =c0+c1x1+c2x2+c3x1x2,\displaystyle=c_{0}+c_{1}x_{1}+c_{2}x_{2}+c_{3}x_{1}x_{2},

where c={c0,c1,c2,c3}c=\{c_{0},c_{1},c_{2},c_{3}\} are coefficients. In particular, h1(x1,x2)h_{1}(x_{1},x_{2}) is the mean of covariates with c={0,0.5,0.5,0}c=\{0,0.5,0.5,0\}, and h2(x1,x2)h_{2}(x_{1},x_{2}) is designed to be moderately calibrated by carefully choosing the coefficients. Let h3(x1,x2):=τs(x)h_{3}(x_{1},x_{2}):=\tau_{s}(x), which is strongly calibrated, and h4(x1,x2):=μ1(x)μ0(x)h_{4}(x_{1},x_{2}):=\mu_{1}(x)-\mu_{0}(x). (See Appendix S2 for the detailed definitions of the coefficients for TBPs.)

Refer to caption
Figure S2: The RCCs, CbC_{b} values (left column) and moderate calibration plots (right column) for the four TBPs when β1=0.762\beta_{1}=0.762 in the first synthetic population. The blue dotted curves refer to CbC_{b} and E[BH]\operatorname{E}[B\mid H], and the red dashed curves refer to C~b\tilde{C}_{b} and E~[BH]\tilde{\operatorname{E}}[B\mid H].

The evaluation results are illustrated in Figure S2, where the four plots on the left show the RCCs and CbC_{b} values for the four TBPs, with and without bias. Note that h1h_{1} and h2h_{2} have identical discriminatory ability, as do h3h_{3} and h4h_{4}. This is because the CDFs of H1H_{1} and H2H_{2} are the same, as are those of H3H_{3} and H4H_{4}, resulting in the corresponding TBPs identically ranking patients. Note that h3h_{3} and h4h_{4} yield a slightly larger CbC_{b}, compared to h1h_{1} and h2h_{2}. Consequently, H3H_{3} and H4H_{4} result in more effective treatment assignment rules, leading to larger average treatment benefits. Although values of C~b\tilde{C}_{b} suggest that H3H_{3} and H4H_{4} are more effective treatment assignment rules, they underestimate CbC_{b} for all TBPs.

The four plots on the right in Figure S2 display the moderate calibration curves of all TBPs with and without bias. We see that h2h_{2} and h3h_{3} are moderately calibrated, aligning closely with the 45-degree line. Figure S2 highlights that the bias of the moderation calibration curve is positive for all HH values of all four TBPs due to bias(X)>0\text{bias}(X)>0. Consequently, the failure to fully control for confounding variables results in an inaccurate calibration assessment. The moderate calibration curve of h4h_{4} suggests that a TBP developed in the target population, but without controlling for confounder ZZ, then it can give the misleading impression that it is moderately calibrated. When β1=0\beta_{1}=0, all blue dotted curves align with the red dashed curves for both performance metrics because ZZ is no longer a confounding variable. (See Appendix S2 for the corresponding RCCs, CbC_{b}, and calibration plots.) These findings highlight the importance of fully controlling for confounding variables, and ignoring confounding variables can produce misleading patterns.

Refer to caption
Figure S3: The RCC, CbC_{b} value (left) and moderate calibration plot (right) for the second synthetic population.

For synthetic population 2, we have h(X)=X1+X2h(X)=X_{1}+X_{2}, where HH follows a triangular distribution with parameters: lower limit a=0a=0, upper limit b=2,b=2, and mode c=1c=1. Using the knowledge of the population, we compute and then compare the two performance metrics with and without bias for h()h(\cdot) in closed-form. (See Appendix S3 for calculation details.) In this setup, Cb=0.149C_{b}=0.149 indicates that assigning treatment to a patient who has a larger HH value is slightly better than random treatment assignment. For the calibration curve, the average treatment benefit conditioned on the H=hH=h is 3h/43h/4 when h(0,1]h\in(0,1] and is (1h2/4)(2h)1(1-h^{2}/4)(2-h)^{-1} when h(1,2)h\in(1,2).

Refer to caption
Figure S4: Heatmap of performance metrics. Plots on the left show the bias of CbC_{b} when p=(0.3,0.1,0.4,0.2)p=(0.3,0.1,0.4,0.2) (row 1) and when p=(0.300,0.008,0.194,0.498)p=(0.300,0.008,0.194,0.498) (row 2). Plots on the right show the bias of moderate calibration curve when p=(0.3,0.1,0.4,0.2)p=(0.3,0.1,0.4,0.2).

However, confounding bias causes a deviation from the actual CbC_{b} and the moderate calibration curve, which are depicted by Figure S3. The bias reduces the area between the independence line and the RCC by roughly half. It causes an overestimation of E[B]\operatorname{E}[B] and an overestimation of 2E[BFH(H)]2\operatorname{E}[BF_{H}(H)]. Consequently, C~b=0.0773\tilde{C}_{b}=0.0773, which is lower than the CbC_{b}. Furthermore, in the calibration plot, the red curve deviates from the blue curve as the value of HH approaches zero. However, this deviation diminishes to zero as HH approaches two.

In synthetic population 3, the strength of confounding can be adjusted by the values of α\alpha and β\beta. In Figure S4, the two plots on the right shows the relationship between bias(X)\text{bias}(X) and varying strengths of confounding, where α\alpha and β\beta vary within the interval (5,5)(-5,5). For simplification, we focus on the special case where h(X):=τs(X)h(X):=\tau_{s}(X), and use the fact that the bias of the moderate calibration curve for HH equals bias(X)\text{bias}(X) when hh is a one-to-one mapping. Therefore, these two plots also displays the bias of the moderate calibration curve when H=h(1)H=h(1) and H=h(0)H=h(0). The two plots on the left show the bias of CbC_{b} under two different settings, highlighting the unpredictable pattern of bias and the strength of confounding. However, the pattern of bias in the moderate calibration curve is more interpretable compared to that of CbC_{b}.

Discussion

In clinical settings, TBPs may offer valuable guidance for physicians and patients in making informed treatment decisions. Before being transported to a new population, these TBPs need to be evaluated in that population. Observational data, where treatment assignment is not random, might be the only opportunity for such evaluation. Consequently, addressing confounding bias is crucial when assessing TBPs based on observational data from the target population.

This study demonstrated conceptually how to evaluate pre-specified TBPs using observational data. We delved into two specific measures, CbC_{b} and moderate calibration curve, which are estimands defined in terms of the latent benefit variable BB. We showed the identification of CbC_{b} and E[BH]\operatorname{E}[B\mid H] by re-expressing the respective estimands in terms of the distribution of (Y,A,X,Z)(Y,A,X,Z). The absence of full confounding control leads to bias in identifying CbC_{b} and E[BH]\operatorname{E}[B\mid H]. The behavior of these biases are more complex than when targeting commonly encountered estimands such as ATE. We showed how the bias associated with τs(X)\tau_{s}(X), denoted as bias(X)\text{bias}(X), propagates to that associated with CbC_{b} and τs(X)\tau_{s}(X). We demonstrated the behaviors of these biases in two synthetic populations with bias(X)>0\text{bias}(X)>0. Positive bias(X)\text{bias}(X) function will always result in E~[BH]>E[BH]\tilde{\operatorname{E}}[B\mid H]>\operatorname{E}[B\mid H]; nevertheless, it might result C~b<Cb\tilde{C}_{b}<C_{b} for some choices of h(x)h(x). In the two synthetic populations, we showed that positive bias(X)\text{bias}(X) functions lead to overestimation of E[B]\operatorname{E}[B] and E[Bη(H)]\operatorname{E}[B\eta(H)] but underestimation of CbC_{b} for our choices of h(x)h(x). However, it is also possible that the overestimation of E[Bη(H)]\operatorname{E}[B\eta(H)] is greater than that of E[B]\operatorname{E}[B], leading to an overestimation of CbC_{b}. When bias(X)\text{bias}(X) function takes both positive and negative values, behaviors of bias in identifying CbC_{b} and E[BH]\operatorname{E}[B\mid H] become more unpredictable, making it difficult to intuit the specific direction of bias.

While we focused on population-level performance of TBPs and used synthetic populations, several findings from this study are relevant for practice. The primary objective of evaluating a pre-determined TBP is to estimate τ(x,z)\tau(x,z), for all possible xx and zz, from sample data. Estimating τ(x,z)\tau(x,z) is a classical topic in the casual inference literature. For instance, it can be estimated through outcome regression, inverse probability weighting [31], or a combination of both, such as with the doubly robust method [32]. Additionally, Bayesian additive regression trees (BART) have been shown to offer several advantages in estimating CATE [12]. Building on the work presented here, we have initiated a new project that explores BART in TBP assessment using a real-world observational study.

Even with a scheme to estimate τ()\tau(\cdot), further effect will be needed to estimate specific performance metrics, such as CbC_{b} and the calibration curve. For instance, one possible estimator of CbC_{b} uses sample averages to estimate expectations in the estimand and estimate the CDF of HH, and possibly its fH(h)f_{H}(h), through the empirical distribution of HH. For the calibration curve, the sample average within groups sharing the same HH can be used to estimate the conditional expectation of the estimated τ(X,Z)\tau(X,Z) given H=hH=h when HH is discrete. For continuous HH, one approach is to discretize HH into equally sized bins and use the sample average within each bin to estimate the conditional expectation[24]. Moderate calibration can also be assessed non-parametrically, without the need for grouping or smoothing factors [33]. To our knowledge, this approach, based on the cumulative sum of prediction errors has yet to be applied to TBP assessment. Overall, exploring and tackling the challenges associated with estimating different performance measures is worthwhile.

In summary, we presented a conceptual framework for evaluating the predictive performance of TBPs using observational data. We demonstrated that in the absence of complete confounding control, the unpredictable nature of bias can significantly impact the reliability of TBP assessments. Moreover, this framework can be readily applied to practical settings, enabling the estimation of performance metrics such as CbC_{b} and calibration curves.

References

  • 1. Sperrin M, Martin GP, Pate A, Van Staa T, Peek N, Buchan I. Using marginal structural models to adjust for treatment drop-in when developing clinical prediction models. Statistics in medicine. 2018;37(28):4142–4154.
  • 2. Sperrin M, Diaz-Ordaz K, Pajouheshnia R. Invited commentary: treatment drop-in—making the case for causal prediction. American journal of epidemiology. 2021;190(10):2015–2018.
  • 3. Geloven N, Swanson SA, Ramspek CL, et al. Prediction meets causal inference: the role of treatment in clinical prediction models. European journal of epidemiology. 2020;35:619–630.
  • 4. Prosperi M, Guo Y, Sperrin M, et al. Causal inference and counterfactual prediction in machine learning for actionable healthcare. Nature Machine Intelligence. 2020;2(7):369–375.
  • 5. Roi-Teeuw HM, Royen FS, Hond A, et al. Don’t be misled: Three misconceptions about external validation of clinical prediction models. Journal of Clinical Epidemiology. 2024:111387.
  • 6. Riley RD, Archer L, Snell KI, et al. Evaluation of clinical prediction models (part 2): how to undertake an external validation study. BMJ. 2024;384.
  • 7. Degtiar I, Rose S. A review of generalizability and transportability. Annual Review of Statistics and Its Application. 2023;10(1):501–524.
  • 8. Pearl J, Bareinboim E. Transportability of causal and statistical relations: A formal approach. in Proceedings of the AAAI Conference on Artificial Intelligence;25:247–254 2011.
  • 9. Efthimiou O, Hoogland J, Debray TP, et al. Measuring the performance of prediction models to personalize treatment choice. Statistics in medicine. 2023;42(8):1188–1206.
  • 10. Riley RD, Snell KI, Moons KG, Debray TP. Fundamental statistical methods for prognosis research. in Prognosis Research in Health Carech. 3, :37–68Oxford University Press 2019.
  • 11. Steyerberg EW. Clinical Prediction Models. Springer International Publishing 2019.
  • 12. Hill JL. Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics. 2011;20(1):217–240.
  • 13. Schuler A, Baiocchi M, Tibshirani R, Shah N. A comparison of methods for model selection when estimating individual treatment effects. arXiv preprint arXiv:1804.05146. 2018.
  • 14. Klaveren D, Steyerberg EW, Serruys PW, Kent DM. The proposed ‘concordance-statistic for benefit’ provided a useful metric when modeling heterogeneous treatment effects. Journal of Clinical Epidemiology. 2018;94:59–68.
  • 15. Klaveren D, Maas CC, Kent DM. Measuring the performance of prediction models to personalize treatment choice: Defining observed and predicted pairwise treatment effects. Statistics in Medicine. 2023;42(24):4514–4515.
  • 16. Hoogland J, Efthimiou O, Nguyen TL, Debray TP. Evaluating individualized treatment effect predictions: A model-based perspective on discrimination and calibration assessment. Statistics in medicine. 2024.
  • 17. Xia Y, Gustafson P, Sadatsafavi M. Methodological concerns about “concordance-statistic for benefit” as a measure of discrimination in predicting treatment benefit. Diagnostic and Prognostic Research. 2023;7(1):10.
  • 18. Sadatsafavi M, Mansournia MA, Gustafson P. A threshold-free summary index for quantifying the capacity of covariates to yield efficient treatment rules. Statistics in Medicine. 2020;39:1362–1373.
  • 19. Yadlowsky S, Fleming S, Shah N, Brunskill E, Wager S. Evaluating treatment prioritization rules via rank-weighted average treatment effects. Journal of the American Statistical Association. 2024(just-accepted):1–25.
  • 20. Yitzhaki S, Olkin I. Concentration indices and concentration curves. Lecture Notes-Monograph Series. 1991:380–392.
  • 21. Radcliffe N. Using control groups to target on predicted lift: Building and assessing uplift model. 2007.
  • 22. Gastwirth JL. The estimation of the Lorenz curve and Gini index. The review of economics and statistics. 1972:306–316.
  • 23. Van Calster B, Nieboer D, Vergouwe Y, De Cock B, Pencina MJ, Steyerberg EW. A calibration hierarchy for risk models was defined: from utopia to empirical data. Journal of Clinical Epidemiology. 2016;74:167–176.
  • 24. Xu Y, Yadlowsky S. Calibration error for heterogeneous treatment effects. in International Conference on Artificial Intelligence and Statistics:9280–9303PMLR 2022.
  • 25. Imbens GW. Sensitivity to exogeneity assumptions in program evaluation. American Economic Review. 2003;93(2):126–132.
  • 26. Veitch V, Zaveri A. Sense and sensitivity analysis: Simple post-hoc analysis of bias due to unobserved confounding. Advances in neural information processing systems. 2020;33:10999–11009.
  • 27. Keogh RH, Van Geloven N. Prediction under interventions: evaluation of counterfactual performance using longitudinal observational data. Epidemiology. 2024;35(3):329–339.
  • 28. Rubin DB. Randomization analysis of experimental data: The Fisher randomization test comment. Journal of the American statistical association. 1980;75(371):591–593.
  • 29. Van Calster B, McLernon DJ, Van Smeden M, Wynants L, Steyerberg EW. Calibration: the Achilles heel of predictive analytics. BMC medicine. 2019;17(1):230.
  • 30. Foster DJ, Syrgkanis V. Orthogonal statistical learning. The Annals of Statistics. 2023;51(3):879–908.
  • 31. Athey S, Imbens GW. Machine learning methods for estimating heterogeneous causal effects. stat. 2015;1050(5):1–26.
  • 32. Bang H, Robins JM. Doubly robust estimation in missing data and causal inference models. Biometrics. 2005;61(4):962–973.
  • 33. Sadatsafavi M, Petkau J. Non-parametric inference on calibration of predicted risks. Statistics in Medicine. 2024.

Appendix S1: Propositions for CbC_{b} Calculation

Proposition 1.

Given two i.i.d. copies, denoted as {(B1,H1),(B2,H2)}\{(B_{1},H_{1}),(B_{2},H_{2})\}, the expectation of benefit from ‘treat greater HH’ strategy is expressed as:

E[B1I(H1H2)+B2I(H1<H2)]=E[Bη(H)],\operatorname{E}[B_{1}I(H_{1}\geq H_{2})+B_{2}I(H_{1}<H_{2})]=\operatorname{E}[B\eta(H)],

where I()I(\cdot) is an indicator function, η(H)=2FH(H)fH(H)\eta(H)=2F_{H}(H)-f_{H}(H). Note that both FH(H)F_{H}(H) and fH(H)f_{H}(H) are random variables, where FH()F_{H}(\cdot) is the CDF of HH, and fH()f_{H}(\cdot) is probability mass function of HH.

Proof.

We first show that η(H)=2FH(H)\eta(H)=2F_{H}(H) for continuous HH. Without loss of generality, we assume that BB is continuous.

E[B1I(H1H2)+B2I(H1<H2)]\displaystyle\operatorname{E}[B_{1}I(H_{1}\geq H_{2})+B_{2}I(H_{1}<H_{2})]
=2E[E[B1I(H1H2)H1,H2]](H is continuous)\displaystyle=2\operatorname{E}\big{[}\operatorname{E}[B_{1}I(H_{1}\geq H_{2})\mid H_{1},H_{2}]\big{]}\quad\text{($H$ is continuous)}
=2B1H1FH2(h1)b1fB1H1(b1h1)fH1(h1)𝑑h1𝑑b1(fB1H1,H2=fB1H1)\displaystyle=2\int_{B_{1}}\int_{H_{1}}F_{H_{2}}(h_{1})b_{1}f_{B_{1}\mid H_{1}}(b_{1}\mid h_{1})f_{H_{1}}(h_{1})dh_{1}db_{1}\quad\text{($f_{B_{1}\mid H_{1},H_{2}}=f_{B_{1}\mid H_{1}}$)}
=2B1H1b1FH1(h1)fB1,H1(b1,h1)𝑑h1𝑑b1(FH1=FH2)\displaystyle=2\int_{B_{1}}\int_{H_{1}}b_{1}F_{H_{1}}(h_{1})f_{B_{1},H_{1}}(b_{1},h_{1})dh_{1}db_{1}\quad\text{($F_{H_{1}}=F_{H_{2}}$)}
=2E[BFH(H)],\displaystyle=2\operatorname{E}[BF_{H}(H)],

where fB1,H1,H2(b1,h1,h2)f_{B_{1},H_{1},H_{2}}(b_{1},h_{1},h_{2}) is the joint probability density function (PDF) of (B1,H1,H2)(B_{1},H_{1},H_{2}).

Then, we show that η(H)=2FH(H)fH(H)\eta(H)=2F_{H}(H)-f_{H}(H) for discrete HH. Here, we assume that BB is discrete without loss of generality.

E[B1I(H1H2)+B2I(H1<H2)]\displaystyle\operatorname{E}[B_{1}I(H_{1}\geq H_{2})+B_{2}I(H_{1}<H_{2})]
=2E[E[B1I(H1>H2)H1,H2]]+E[E[B1I(H1=H2)H1,H2]]\displaystyle=2\operatorname{E}\left[E[B_{1}I(H_{1}>H_{2})\mid H_{1},H_{2}]\right]+\operatorname{E}\left[E[B_{1}I(H_{1}=H_{2})\mid H_{1},H_{2}]\right]
=2h1b1b1P(B1=b1,H1=h1)(h2I(h2<h1)P(H2=h2))+\displaystyle=2\sum_{h_{1}}\sum_{b_{1}}b_{1}\operatorname{P}(B_{1}=b_{1},H_{1}=h_{1})\left(\sum_{h_{2}}I(h_{2}<h_{1})\operatorname{P}(H_{2}=h_{2})\right)+
h1b1b1P(B1=b1,H1=h1)(h2I(h2=h1)P(H2=h2))\displaystyle\sum_{h_{1}}\sum_{b_{1}}b_{1}\operatorname{P}(B_{1}=b_{1},H_{1}=h_{1})\left(\sum_{h_{2}}I(h_{2}=h_{1})\operatorname{P}(H_{2}=h_{2})\right)
=2E[BFH(H)]E[BfH(H)].\displaystyle=2\operatorname{E}[BF_{H}(H)]-\operatorname{E}[Bf_{H}(H)].

Therefore, we have shown that E[B1I(H1H2)+B2I(H1<H2)]=E[Bη(H)]\operatorname{E}[B_{1}I(H_{1}\geq H_{2})+B_{2}I(H_{1}<H_{2})]=\operatorname{E}[B\eta(H)]. ∎

For variables BB and HH, the RCC is R(p)=E[BI(Hh)]E[B]R(p)=\frac{\operatorname{E}[BI(H\leq h)]}{\operatorname{E}[B]}, where pp represents the pp-th quantile concerning the value of HH. We assume that E[B]>0E[B]>0.

Proposition 2.

The Gini-like coefficient is twice the area (AA) between the line of independence (pp) and R(p)R(p), which satisfies

2A=E[Bη(H)]E[B]E[B].2A=\frac{E[B\eta(H)]-\operatorname{E}[B]}{\operatorname{E}[B]}.
Proof.

For continuous HH, we start with twice the area pp and R(p)R(p) multiplied by E[B]\operatorname{E}[B], which is

2AE[B]\displaystyle 2A\operatorname{E}[B] =201(pE[B]hpE[BH=h]fH(h)𝑑h)𝑑p\displaystyle=2\int^{1}_{0}\left(p\operatorname{E}[B]-\int^{h_{p}}_{-\infty}\operatorname{E}[B\mid H=h]f_{H}(h)dh\right)dp
=E[B]201hpE[BH=h]fH(h)𝑑h𝑑p\displaystyle=\operatorname{E}[B]-2\int^{1}_{0}\int^{h_{p}}_{-\infty}\operatorname{E}[B\mid H=h]f_{H}(h)dhdp
=E[B]2E[BH=h]fH(h)(1FH(h))𝑑h\displaystyle=\operatorname{E}[B]-2\int^{\infty}_{-\infty}\operatorname{E}[B\mid H=h]f_{H}(h)(1-F_{H}(h))dh
=2E[BH=h]fH(h)FH(h)𝑑hE[B]\displaystyle=2\int^{\infty}_{-\infty}\operatorname{E}[B\mid H=h]f_{H}(h)F_{H}(h)dh-\operatorname{E}[B]
=2E[BFH(H)]E[B],\displaystyle=2\operatorname{E}[BF_{H}(H)]-\operatorname{E}[B],

where hph_{p} represents hh value at the pp-th quantile. Since η(H)=2FH(H)\eta(H)=2F_{H}(H) for continuous HH, we obtain

2A=2E[Bη(H)]E[B]E[B].2A=\frac{2\operatorname{E}[B\eta(H)]-\operatorname{E}[B]}{\operatorname{E}[B]}.

For discrete HH, we assume that HH has finite kk distinct values, patients are ranked by their value of HH in ascending order to plot the RCC: h(1)<h(2)<h(3)<<h(k)h_{(1)}<h_{(2)}<h_{(3)}<\cdots<h_{(k)} with the probability P(H=h(i))=pi\operatorname{P}(H=h_{(i)})=p_{i}, where i=1,2,3,,ki=1,2,3,\cdots,k and i=1kpi=1\sum_{i=1}^{k}p_{i}=1. We have

E[BFH(H)]\displaystyle\operatorname{E}[BF_{H}(H)] =(p1E[BI(H=h(1))]+(p1+p2)E[BI(H=h(2))]++1E[I(H=h(k))]),\displaystyle=\left(p_{1}\operatorname{E}[BI(H=h_{(1)})]+(p_{1}+p_{2})\operatorname{E}[BI(H=h_{(2)})]+\cdots+1\cdot\operatorname{E}[I(H=h_{(k)})]\right),
E[BfH(H)]\displaystyle\operatorname{E}[Bf_{H}(H)] =i=1kpiE[BI(H=h(i))].\displaystyle=\sum^{k}_{i=1}p_{i}\operatorname{E}[BI(H=h_{(i)})].

When B0B\geq 0, area AA would be bounded between 0 and 0.50.5, which can be calculated as 0.50.5 minus the sum of areas of one triangle and (k1)(k-1) trapezoids. Therefore, we can express the Gini-like coefficient as

2A\displaystyle 2A =1E[B]((1pk)E[B]i=1k1(pi+pi+1)E[BI(Hh(i))]),\displaystyle=\frac{1}{\operatorname{E}[B]}\left((1-p_{k})\operatorname{E}[B]-\sum^{k-1}_{i=1}(p_{i}+p_{i+1})\operatorname{E}[BI(H\leq h_{(i)})]\right),

where E[BI(Hh(k))]=i=1kE[BI(H=h(i))]\operatorname{E}[BI(H\leq h_{(k)})]=\sum_{i=1}^{k}\operatorname{E}[BI(H=h_{(i)})]. Then, we can find that E[BFH(H)]\operatorname{E}[BF_{H}(H)], E[BfH(H)]\operatorname{E}[Bf_{H}(H)], E[B]\operatorname{E}[B], and 2A2A satisfy the relationship

2E[BFH(H)]E[B]E[B]2A\displaystyle\frac{2\operatorname{E}[BF_{H}(H)]-\operatorname{E}[B]}{\operatorname{E}[B]}-2A =E[BfH(H)]E[B].\displaystyle=\frac{\operatorname{E}[Bf_{H}(H)]}{\operatorname{E}[B]}.

Since η(H)=2FH(H)fH(H)\eta(H)=2F_{H}(H)-f_{H}(H) discrete HH, we obtain

2A=2E[Bη(H)]E[B]E[B].2A=\frac{2\operatorname{E}[B\eta(H)]-\operatorname{E}[B]}{\operatorname{E}[B]}.

When distribution of BB includes negative values or ‘treat greater HH’ is worse than ‘treat at random,’ this relationship still holds. However, the area AA can take a value greater than 0.50.5 or less than 0, causing CbC_{b} to fall outside the interval (0,1)(0,1). This issue also arises with the Gini coefficient and the Lorenz curve, and it has been discussed in the literature. ∎

Appendix S2: Supplementary Information for Population 1

Refer to caption
Figure S5: The RCCs and moderate calibration plots for h1(X1,X2)h_{1}(X_{1},X_{2}), h2(X1,X2)h_{2}(X_{1},X_{2}), h3(X1,X2)h_{3}(X_{1},X_{2}) and h4(X1,X2)h_{4}(X_{1},X_{2}) when β1=0\beta_{1}=0. The blue dotted curves refer to CbC_{b} and E[BH]\operatorname{E}[B\mid H], and the red dashed curves refer to C~b\tilde{C}_{b} and E~[BH]\tilde{\operatorname{E}}[B\mid H].

In the first population example, we defined a distribution \mathbb{P} and a linear function τ(X,Z)\tau(X,Z) with a total of 18 parameters. To begin with, we design moderately and strongly calibrated TBPs. To design a moderately calibrated h2(x1,x2)h_{2}(x_{1},x_{2}), we specific a set of coefficients c=(b0,b1,b1,b2)c=(b_{0},b_{1},b_{1},b_{2}), where

b0\displaystyle b_{0} =[(α10α00)+(α13α03)p001p001+p000],\displaystyle=\left[(\alpha_{10}-\alpha_{00})+(\alpha_{13}-\alpha_{03})\frac{p_{001}}{p_{001}+p_{000}}\right],
b1\displaystyle b_{1} =[(α11α01)p101+p100p101+p100+p011+p010+(α12α02)p011+p010p101+p100+p011+p010+\displaystyle=\Bigg{[}(\alpha_{11}-\alpha_{01})\frac{p_{101}+p_{100}}{p_{101}+p_{100}+p_{011}+p_{010}}+(\alpha_{12}-\alpha_{02})\frac{p_{011}+p_{010}}{p_{101}+p_{100}+p_{011}+p_{010}}+
(α13α03)(p101+p011p101+p100+p011+p010p001p001+p000)],\displaystyle(\alpha_{13}-\alpha_{03})\left(\frac{p_{101}+p_{011}}{p_{101}+p_{100}+p_{011}+p_{010}}-\frac{p_{001}}{p_{001}+p_{000}}\right)\Bigg{]},
b2\displaystyle b_{2} =[(α11α01)(12p101+p100p101+p100+p011+p010)+\displaystyle=\Bigg{[}(\alpha_{11}-\alpha_{01})\left(1-2\frac{p_{101}+p_{100}}{p_{101}+p_{100}+p_{011}+p_{010}}\right)+
(α12α02)(12p011+p010p101+p100+p011+p010)+\displaystyle(\alpha_{12}-\alpha_{02})\left(1-2\frac{p_{011}+p_{010}}{p_{101}+p_{100}+p_{011}+p_{010}}\right)+
(α13α03)(p111p111+p1102p101+p011p101+p100+p011+p010+p001p001+p000)].\displaystyle(\alpha_{13}-\alpha_{03})\left(\frac{p_{111}}{p_{111}+p_{110}}-2\frac{p_{101}+p_{011}}{p_{101}+p_{100}+p_{011}+p_{010}}+\frac{p_{001}}{p_{001}+p_{000}}\right)\Bigg{]}.

To design a strongly calibrated h3(x1,x2)h_{3}(x_{1},x_{2}), we set c=(c0,c1,c2,c3)c=(c_{0},c_{1},c_{2},c_{3}), where

c0\displaystyle c_{0} =[(α10α00)+(α13α03)p001p001+p000],\displaystyle=\left[(\alpha_{10}-\alpha_{00})+(\alpha_{13}-\alpha_{03})\frac{p_{001}}{p_{001}+p_{000}}\right],
c1\displaystyle c_{1} =[(α11α01)+(α13α03)(p101p101+p100p001p001+p000)],\displaystyle=\left[(\alpha_{11}-\alpha_{01})+(\alpha_{13}-\alpha_{03})\left(\frac{p_{101}}{p_{101}+p_{100}}-\frac{p_{001}}{p_{001}+p_{000}}\right)\right],
c2\displaystyle c_{2} =[(α12α02)+(α13α03)(p011p011+p010p001p001+p000)],\displaystyle=\left[(\alpha_{12}-\alpha_{02})+(\alpha_{13}-\alpha_{03})\left(\frac{p_{011}}{p_{011}+p_{010}}-\frac{p_{001}}{p_{001}+p_{000}}\right)\right],
c3\displaystyle c_{3} =[(α13α03)(p111p111+p110p101p101+p100p011p011+p010+p001p001+p000)].\displaystyle=\left[(\alpha_{13}-\alpha_{03})\left(\frac{p_{111}}{p_{111}+p_{110}}-\frac{p_{101}}{p_{101}+p_{100}}-\frac{p_{011}}{p_{011}+p_{010}}+\frac{p_{001}}{p_{001}+p_{000}}\right)\right].

Finally, we look at the discrimination and calibration assessments of h1h_{1}, h2h_{2}, h3h_{3} and h4h_{4} when there is no confounding bias. When β1=0\beta_{1}=0, bias(X)=0\text{bias}(X)=0, resulting in zero bias for identifying both the CbC_{b} and E[BH]\operatorname{E}[B\mid H]. Results are shown in Figure S5. Particularly, the red and blue calibration curves for h2(X1,X2)h_{2}(X_{1},X_{2}), h3(X1,X2)h_{3}(X_{1},X_{2}) and h4(X1,X2)h_{4}(X_{1},X_{2}) lie on the 45-degree diagonal line. Note that all values in this section are rounded to three decimal places.

Appendix S3: Closed-form Measures Calculations for Population 2

We first compute E[BH]\operatorname{E}[B\mid H]. Given E[BH]=E[τs(X)H]\operatorname{E}[B\mid H]=\operatorname{E}[\tau_{s}(X)\mid H], we need to find the joint PDF of (τs(X),H)(\tau_{s}(X),H), where τs(X)=max(X1,X2)\tau_{s}(X)=\max(X_{1},X_{2}) and H=X1+X2H=X_{1}+X_{2}. Since τs\tau_{s} is not a one-to-one function, we look at the pre-image of the point (τs(x),h)(\tau_{s}(x),h). When h2τs(x)h\leq 2\tau_{s}(x) and τs(x)h\tau_{s}(x)\leq h, the pre-image of any point (τs(x),h)(\tau_{s}(x),h) consists of two points, which are expressed as

{(x1=τs(x),x2=hτs(x)),(x1=hτs(x),x2=τs(x))}.\left\{(x_{1}=\tau_{s}(x),x_{2}=h-\tau_{s}(x)),(x_{1}=h-\tau_{s}(x),x_{2}=\tau_{s}(x))\right\}.

These two points correspond to two scenarios: x1x2x_{1}\geq x_{2} and x1<x2x_{1}<x_{2}. In each scenario, max function is a one-to-one function.

Therefore, the joint PDF of (τs(x),H)(\tau_{s}(x),H) is

fτs(X),H(τs(x),h)={2,τs(x)h,h2τs(x),0τs(x)1,0,otherwise.f_{\tau_{s}(X),H}(\tau_{s}(x),h)=\begin{cases}2,&\tau_{s}(x)\leq h,h\leq 2\tau_{s}(x),0\leq\tau_{s}(x)\leq 1,\\ 0,&\text{otherwise}.\end{cases}

With the expression of fτs(X),H(τs(x),h)f_{\tau_{s}(X),H}(\tau_{s}(x),h), we can calculate two marginal PDFs to check that τs(X)\tau_{s}(X) follows the Beta(2,1)\text{Beta}(2,1) distribution and HH follows the triangular(a=0,b=2,c=1)\text{triangular}(a=0,b=2,c=1) distribution. Thus, the CDF of HH is

FH(h)={0,h<0,h2/2,0h<1,1(2h)2/2,1h<2,1,2h.\displaystyle F_{H}(h)=\begin{cases}0,&h<0,\\ h^{2}/2,&0\leq h<1,\\ 1-(2-h)^{2}/2,&1\leq h<2,\\ 1,&2\leq h.\end{cases}

With fτs(X),H(τs(x),h)f_{\tau_{s}(X),H}(\tau_{s}(x),h), we calculate the conditional PDF of (τs(X)H)(\tau_{s}(X)\mid H):

fτs(X)H(τs(x)h)\displaystyle f_{\tau_{s}(X)\mid H}(\tau_{s}(x)\mid h) ={2/h,0<h1,τs(x)h,h2τs(x),2/(2h),1<h<2,τs(x)h,h2τs(x),τs(x)1,0,otherwise.\displaystyle=\begin{cases}2/h,&0<h\leq 1,\tau_{s}(x)\leq h,h\leq 2\tau_{s}(x),\\ 2/(2-h),&1<h<2,\tau_{s}(x)\leq h,h\leq 2\tau_{s}(x),\tau_{s}(x)\leq 1,\\ 0,&\text{otherwise}.\end{cases}

We obtain

E[τs(X)H=h]\displaystyle\operatorname{E}[\tau_{s}(X)\mid H=h] =τs(X)τs(x)fτs(X)H(τs(x)h)𝑑τs(x)\displaystyle=\int_{\tau_{s}(X)}\tau_{s}(x)f_{\tau_{s}(X)\mid H}(\tau_{s}(x)\mid h)d\tau_{s}(x)
={3h4,0<h1,1h2/42h,1<h<2.\displaystyle=\begin{cases}\frac{3h}{4},&0<h\leq 1,\\ \frac{1-h^{2}/4}{2-h},&1<h<2.\end{cases}

Then, we compute CbC_{b} for HH via the expression 1E[τs(X)]/2E[τs(X)FH(H)]1-\operatorname{E}[\tau_{s}(X)]/2\operatorname{E}[\tau_{s}(X)F_{H}(H)], where E[τs(X)]=2/3\operatorname{E}[\tau_{s}(X)]=2/3 as τs(X)\tau_{s}(X) follows the Beta(2,1)\text{Beta}(2,1) distribution. With the expression of fτs(X),H(τs(x),h)f_{\tau_{s}(X),H}(\tau_{s}(x),h), we have

E[τs(X)FH(H)]\displaystyle\operatorname{E}[\tau_{s}(X)F_{H}(H)] =hτs(x)τs(x)FH(h)fτs(X),H(τs(x),h)𝑑τs(x)𝑑h\displaystyle=\int_{h}\int_{\tau_{s}(x)}\tau_{s}(x)F_{H}(h)f_{\tau_{s}(X),H}(\tau_{s}(x),h)d\tau_{s}(x)dh
=01h2h2τs(x)h22𝑑τs(x)𝑑h+12h212τs(x)(1(2h)22)𝑑τs(x)𝑑h\displaystyle=\int^{1}_{0}\int^{h}_{\frac{h}{2}}2\tau_{s}(x)\frac{h^{2}}{2}d\tau_{s}(x)dh+\int^{2}_{1}\int^{1}_{\frac{h}{2}}2\tau_{s}(x)\left(1-\frac{(2-h)^{2}}{2}\right)d\tau_{s}(x)dh
=2(380+19120)=E[BFH(H)].\displaystyle=2\left(\frac{3}{80}+\frac{19}{120}\right)=E[BF_{H}(H)].

Therefore, we have

Cb,h=1E[τs(X)]2E[BFH(H)]=12/34(380+19120)=0.1489362.C_{b,h}=1-\frac{\operatorname{E}[\tau_{s}(X)]}{2E[BF_{H}(H)]}=1-\frac{2/3}{4\left(\frac{3}{80}+\frac{19}{120}\right)}=0.1489362.

Furthermore, we can also calculate the CbC_{b} for τs(X)\tau_{s}(X) following the same processes, which is 12/34/5=0.1666666671-\frac{2/3}{4/5}=0.166666667.

Finally, we compute E~[BH]\tilde{\operatorname{E}}[B\mid H] and C~b\tilde{C}_{b}, with the confounding bias:

bias(X)=(E[YA=1,X]E[YA=0,X])τs(X).\displaystyle\text{bias}(X)=\left(\operatorname{E}[Y\mid A=1,X]-\operatorname{E}[Y\mid A=0,X]\right)-\tau_{s}(X).

Given μa(x,z)\mu_{a}(x,z) and a{0,1}a\in\{0,1\}, we obtain E[YA=a,X]\operatorname{E}[Y\mid A=a,X]:

E[YA=1,X=x]\displaystyle\operatorname{E}[Y\mid A=1,X=x] =ZE[YA=1,X=x,Z=z]fZX,A(zx,1)𝑑z\displaystyle=\int_{Z}\operatorname{E}[Y\mid A=1,X=x,Z=z]f_{Z\mid X,A}(z\mid x,1)dz
=Z2z(12max(x1,x2)+max(x2,z)+110x1)𝑑z\displaystyle=\int_{Z}2z\left(\frac{1}{2}\max(x_{1},x_{2})+\max(x_{2},z)+\frac{1}{10}x_{1}\right)dz
=12max(x1,x2)+13x23+23+110x1.\displaystyle=\frac{1}{2}\max(x_{1},x_{2})+\frac{1}{3}x_{2}^{3}+\frac{2}{3}+\frac{1}{10}x_{1}.
E[YA=0,X=x]\displaystyle\operatorname{E}[Y\mid A=0,X=x] =ZE[YA=0,X=x,Z=z]fZX,A(zx,0)𝑑z\displaystyle=\int_{Z}\operatorname{E}[Y\mid A=0,X=x,Z=z]f_{Z\mid X,A}(z\mid x,0)dz
=Z2(1z)(12max(x1,x2)+max(x2,z)+110x1)𝑑z\displaystyle=\int_{Z}2(1-z)\left(-\frac{1}{2}\max(x_{1},x_{2})+\max(x_{2},z)+\frac{1}{10}x_{1}\right)dz
=12max(x1,x2)+13+x2213x23+110x1.\displaystyle=-\frac{1}{2}\max(x_{1},x_{2})+\frac{1}{3}+x_{2}^{2}-\frac{1}{3}x_{2}^{3}+\frac{1}{10}x_{1}.

Denote E[YA=1,X=x]E[YA=0,X=x]\operatorname{E}[Y\mid A=1,X=x]-\operatorname{E}[Y\mid A=0,X=x] as function D(x)D(x), and we have

D(X)\displaystyle D(X) =τ(x1,x2)+13x22+23x23\displaystyle=\tau(x_{1},x_{2})+\frac{1}{3}-x_{2}^{2}+\frac{2}{3}x_{2}^{3}
=τs(X)+bias(X),\displaystyle=\tau_{s}(X)+\text{bias}(X),

where bias(x)=23x23x22+13\text{bias}(x)=\frac{2}{3}x_{2}^{3}-x_{2}^{2}+\frac{1}{3}. This confounding bias is illustrated in Figure S6.

Refer to caption
Figure S6: The confounding bias function, bias(X)\text{bias}(X), which is only a function of X2X_{2}.

To compute E~[BH]=E[D(X)H]\tilde{\operatorname{E}}[B\mid H]=\operatorname{E}[D(X)\mid H], we have

E[D(X)H]\displaystyle\operatorname{E}[D(X)\mid H] =E[τs(X)H]+E[bias(X)H],\displaystyle=\operatorname{E}[\tau_{s}(X)\mid H]+\operatorname{E}[\text{bias}(X)\mid H],

where E[τs(X)H]\operatorname{E}[\tau_{s}(X)\mid H] is known. To calculate E[bias(X)H]\operatorname{E}[\text{bias}(X)\mid H], we need to figure out the joint distribution of (X2,H)(X_{2},H) and then the conditional distribution of (X2H)(X_{2}\mid H). Solving the linear system, we get

fX2,H(x2,h)=fX2,X1+X2(x2,x1+x2)=fX2,X1(x2,x1)|J|=1,f_{X_{2},H}(x_{2},h)=f_{X_{2},X_{1}+X_{2}}(x_{2},x_{1}+x_{2})=f_{X_{2},X_{1}}(x_{2},x_{1})|J|=1,

where 0x210\leq x_{2}\leq 1, 0h20\leq h\leq 2, h1x2h-1\leq x_{2}, and x2hx_{2}\leq h. The conditional PDF should be

fX2H(x2h)\displaystyle f_{X_{2}\mid H}(x_{2}\mid h) ={1h,0x21,0<h1,h1x2,x2h,12h,0x21,1h<2,h1x2,x2h,0,otherwise.\displaystyle=\begin{cases}\frac{1}{h},&0\leq x_{2}\leq 1,0<h\leq 1,h-1\leq x_{2},x_{2}\leq h,\\ \frac{1}{2-h},&0\leq x_{2}\leq 1,1\leq h<2,h-1\leq x_{2},x_{2}\leq h,\\ 0,&\text{otherwise}.\end{cases}

Therefore, we have

E[bias(X2)H=h]\displaystyle\operatorname{E}[\text{bias}(X_{2})\mid H=h] ={16(h32h2+2),0<h1,16(h34h2+4h),1<h<2.\displaystyle=\begin{cases}\frac{1}{6}\left(h^{3}-2h^{2}+2\right),&0<h\leq 1,\\ \frac{1}{6}\left(h^{3}-4h^{2}+4h\right),&1<h<2.\end{cases}
E~[BH=h]\displaystyle\tilde{\operatorname{E}}[B\mid H=h] ={3h4+h32h2+26,0<h1,1h2/42h+h34h2+4h6,1<h<2.\displaystyle=\begin{cases}\frac{3h}{4}+\frac{h^{3}-2h^{2}+2}{6},&0<h\leq 1,\\ \frac{1-h^{2}/4}{2-h}+\frac{h^{3}-4h^{2}+4h}{6},&1<h<2.\end{cases}

Similarly, we calculate the C~b\tilde{C}_{b} for HH by computing

E[D(X)]\displaystyle\operatorname{E}[D(X)] =E[τs(X)]+E[bias(X)]\displaystyle=\operatorname{E}[\tau_{s}(X)]+\operatorname{E}[\text{bias}(X)]
=23+x2(23x23x22+13)fX2(x2)𝑑x2\displaystyle=\frac{2}{3}+\int_{x_{2}}\left(\frac{2}{3}x_{2}^{3}-x_{2}^{2}+\frac{1}{3}\right)f_{X_{2}}(x_{2})dx_{2}
=23+16=56.\displaystyle=\frac{2}{3}+\frac{1}{6}=\frac{5}{6}.
E[D(X)FH(H)]\displaystyle\operatorname{E}[D(X)F_{H}(H)] =E[τs(X)FH(H)]+E[bias(X2)FH(H)]\displaystyle=\operatorname{E}[\tau_{s}(X)F_{H}(H)]+\operatorname{E}[\text{bias}(X_{2})F_{H}(H)]
=2(380+19120)+x2h(23x23x22+13)FH(h)fX2,H(x2,h)𝑑x2𝑑h\displaystyle=2\left(\frac{3}{80}+\frac{19}{120}\right)+\int_{x_{2}}\int_{h}\left(\frac{2}{3}x_{2}^{3}-x_{2}^{2}+\frac{1}{3}\right)F_{H}(h)f_{X_{2},H}(x_{2},h)dx_{2}dh
=2(380+19120)+(13504+431260).\displaystyle=2\left(\frac{3}{80}+\frac{19}{120}\right)+\left(\frac{13}{504}+\frac{43}{1260}\right).

Therefore,

C~b,h=1E[D(X)]2E[D(X)FH(H)]=15/62(2(380+19120)+(13504+431260))=0.07732865.\displaystyle\tilde{C}_{b,h}=1-\frac{\operatorname{E}[D(X)]}{2\operatorname{E}[D(X)F_{H}(H)]}=1-\frac{5/6}{2\left(2\left(\frac{3}{80}+\frac{19}{120}\right)+\left(\frac{13}{504}+\frac{43}{1260}\right)\right)}=0.07732865.