Transformation models for ROC analysis

Ainesh Sewak
Universität Zürich &Torsten Hothorn
Universität Zürich

\Plainauthor

Ainesh Sewak, Torsten Hothorn \Plaintitle \Abstract Receiver operating characteristic (ROC) analysis is one of the most popular approaches for evaluating and comparing the accuracy of medical diagnostic tests. Although various methodologies have been developed for estimating ROC curves and its associated summary indices, there is no consensus on a single framework that can provide consistent statistical inference whilst handling the complexities associated with medical data. Such complexities might include covariates that influence the diagnostic potential of a test, ordinal test data, censored data due to instrument detection limits or correlated biomarkers. We propose a regression model for the transformed test results which exploits the invariance of ROC curves to monotonic transformations and naturally accommodates these features. Our use of maximum likelihood inference guarantees asymptotic efficiency of the resulting estimators and associated confidence intervals. Simulation studies show that the estimates based on transformation models are unbiased and yield coverage at nominal levels. The methodology is applied to a cross-sectional study of metabolic syndrome where we investigate the covariate-specific performance of weight-to-height ratio as a non-invasive diagnostic test. Software implementations for all the methods described in the article are provided in the \pkgtram \proglangR package. \Keywordstransformation model, ROC curve, AUC, diagnostic test, distribution regression, ordinal outcome, censoring, Youden Index, OVL, correlated biomarkers, limit of detection \Plainkeywordstransformation model, ROC curve, AUC, diagnostic tests, distribution regression, ordinal outcome, censoring, Youden Index, OVL, correlated biomarkers, limit of detection \Address Ainesh Sewak, Torsten Hothorn
Institut für Epidemiologie, Biostatistik und Prävention
Universität Zürich
Hirschengraben 84, CH-8001 Zürich, Switzerland
[email protected], [email protected]

1 Introduction

Estimating receiver operating characteristic (ROC) curves for evaluating the performance of medical diagnostic tests has been a main focus of statistical literature over the last decades (Green and Swets, 1966; Egan, 1975; Pepe, 2003; Zou et al., 2011; Inácio et al., 2021b). Diagnostic tests screen for the presence or absence of a disease. Characterizing their accuracy is essential to ensure appropriate prevention, treatment and monitoring of diseases. ROC curves are a valuable tool in determining the diagnostic potential of a test and continue to be extensively applied in biomedical studies as new tests or biomarkers are developed in radiology, oncology, genetics, and other related fields. Increasingly more applications can be expected due to advancements in technology and analyzing resulting data requires a computationally straightforward approach to provide accurate and consistent statistical inference.

Previous research has focused on extending statistical methodology for ROC curve estimation to address issues such as adjustment for covariates (Pepe, 1997, 1998, 2000; Faraggi, 2003), incorporating censoring due to instrument detection limits (Perkins et al., 2007; Ruopp et al., 2008; Bantis et al., 2017), comparing correlated diagnostic tests (Hanley and McNeil, 1983; DeLong et al., 1988; Zou and Hall, 2002; Molodianovitch et al., 2006; Bantis and Feng, 2016) and robustness to model misspecification (Bianco et al., 2020; Inácio et al., 2021a). In addition, a wide variety of parametric and nonparametric methods have been proposed within frequentist and Bayesian paradigms (Hsieh and Turnbull, 1996; Alonzo and Pepe, 2002; Erkanli et al., 2006; Branscum et al., 2008; Yao et al., 2010; González-Manteiga et al., 2011; Rodríguez-Álvarez et al., 2011; Cai and Pepe, 2002; Cai and Moskowitz, 2004; Inácio et al., 2013; Ghosal et al., 2022). However, there is no consensus on an analytic approach that can handle all these issues simultaneously.

An attractive feature of the ROC curve, which has scantly been used for its estimation, is that it remains invariant to monotonic transformations of the test results. Although transformations have been used to bring continuous test results into a form that approximately satisfies the assumptions of a suitable parametric model (Faraggi, 2003; Schisterman et al., 2004), estimation of a transformation function has been limited to the Box-Cox power transformation family (Box and Cox, 1964; Zou et al., 1998; Zou and Hall, 2000). For rank-based methods, the transformation function can be left unspecified but in all cases a restriction to normality has been previously imposed on the model for the ROC curve (Zou et al., 1997).

In this article we present a new unifying methodological framework for estimating ROC curves and its associated summary indices by modeling the relationship between the transformed test results and potential covariates. We employ transformation models to jointly estimate the transformation function and regression parameters (Hothorn et al., 2014, 2018). This approach specifies a parametric model for the ROC curve but remains distribution-free because we do not impose any strong assumptions on the transformation function. Using the estimated parameters, we show how to evaluate covariate effects on discriminatory performance of diagnostic tests. We focus on data obtained from case-control or cross-sectional studies of subjects with continuous or ordinal test results. Unlike nonparametric methods which are flexible but difficult to interpret and implement, transformation models excel on both fronts. Their output can be easily interpreted by clinicians, which is of important consideration in biomedical applications. \proglangR implementations of all methods discussed in this article are available, along with a set of supporting examples. The goals of this article are to introduce a general methodology for performing ROC analysis which is model-based but distribution-free; to apply maximum likelihood inference procedures to examine covariate effects with potentially censored observations; and to demonstrate methods for testing differences between correlated ROC curves of several tests at specific covariate levels within the same conceptual and computational framework.

1.1 Notation

Let the random variable $Y$ denote the continuous result of a diagnostic test and let $D$ denote the disease status, with $D=1$ if a subject is diseased and 0 if nondiseased. We denote quantities conditional on the disease status using subscripts. For example, $Y_{1}$ and $Y_{0}$ are the test results in the diseased and nondiseased populations with cumulative distribution functions (CDF) given by $F_{1}$ and $F_{0}$ and densities $f_{1}$ and $f_{0}$ , respectively. Suppose that the subject is diagnosed as diseased when their test result exceeds a threshold value, $c$ . By convention, we assume that larger values of the test result are more indicative of the disease. The probability of truly identifying a diseased and nondiseased subject are defined as sensitivity, $\Prob(Y_{1}>c)=1-F_{1}(c)$ , and specificity, $\Prob(Y_{0}\leq c)=F_{0}(c)$ , respectively. The set of pairs $(1-\text{specificity},\text{sensitivity})$ for all $c\in\mathbb{R}$ produce the ROC curve. By setting $p=1-F_{0}(c)$ , an equivalent representation of the ROC curve is

\displaystyle\text{ROC}(p)=1-F_{1}\left(F_{0}^{-1}(1-p)\right).

1.2 Summary indices

Summary indices of the ROC curve quantify the degree of separation between the distributions $Y_{1}$ and $Y_{0}$ . The most widely used index is the area under the ROC curve (AUC) defined by

\displaystyle\text{AUC}=\Prob(Y_{1}>Y_{0})=\int_{0}^{1}\text{ROC}(p)dp.

The AUC represents the probability that the test results of a randomly selected diseased subject exceeds the one of a nondiseased subject and is directly related to the Mann-Whitney-Wilcoxon U-statistic (MWW) (Bamber, 1975; Hanley and McNeil, 1982). Alternative indices include the Youden index (Youden, 1950), $J$ , which combines sensitivity and specificity over all possible thresholds to provide the maximum potential effectiveness of a diagnostic test, given by

\displaystyle J

\displaystyle=\max_{c\in\mathbb{R}}[F_{0}(c)-F_{1}(c)].

The Youden index is equivalent to the Smirnov (or the two-sample Kolmogorov-Smirnov) test statistic (Gail and Green, 1976; Komaba et al., 2022) and can be represented as half the $L_{1}$ distance between two densities or as the complement of the overlapping coefficient (OVL) (Weitzman, 1970; Feller, 1991; Schmid and Schmidt, 2006; Martínez-Camblor, 2022):

\displaystyle J

\displaystyle=\frac{1}{2}\int|f_{0}(y)-f_{1}(y)|dy=1-\int\min[f_{1}(y),f_{0}(y)]dy=1-\text{OVL}.

Additionally, the threshold corresponding to $J$ , where sensitivity and specificity are maximized, denoted as $c^{*}$ , is often used in clinical practice as the optimal classification threshold to screen subjects.

1.3 Covariates

The result of a diagnostic test can be affected by both the disease status, as well as by covariates associated with the subjects. Consequently, the test result distributions and their separation may differ by levels of the covariates. In order to appropriately understand the accuracy of the test in subpopulations, we can use covariate-specific or conditional ROC curves (Tosteson and Begg, 1988; Pepe, 1997). Throughout we use boldface letters to denote vectors and lightface for scalars. Let $X$ denote a vector of covariates that are hypothesized to have an impact on the accuracy of the test. The conditional CDF in the diseased population is given by $F_{1}(y\mid\text{\boldmath$x$})=\Prob(Y_{1}\leq y\mid\text{\boldmath$X$}=\text{\boldmath$x$})$ and analogously given for the nondiseased population. The covariate-specific ROC can be written as

\displaystyle\text{ROC}(p\mid\text{\boldmath$x$})=1-F_{1}\left(F_{0}^{-1}(1-p\mid\text{\boldmath$x$})\mid\text{\boldmath$x$}\right)

(1)

with its counterpart conditional summary indices, $\text{AUC}(\text{\boldmath$x$})$ and $J(\text{\boldmath$x$})$ , defined accordingly.

The covariate-specific ROC curve can be generated by modeling the conditional distribution of the test results. This is known as the induced or indirect methodology for modeling ROC curves (Pepe, 1997, 1998, 2003; Faraggi, 2003; Rodríguez-Álvarez et al., 2011). Alternatively, direct methodology is based on a regression model for the ROC curve itself (Pepe, 2000; Alonzo and Pepe, 2002; Cai and Moskowitz, 2004).

1.4 Overview

The article proceeds as follows. In Section 2, we propose a transformation modeling framework for parameterizing ROC curves from which we derive closed-form expressions for associated AUC and Youden summary indices. We then show how our approach addresses the main components of ROC analysis: summarizing the discriminatory capabilities of a diagnostic test for two-sample classification, adjusting for covariates, and comparing the efficacy of two (or more) correlated diagnostic tests. We discuss maximum likelihood estimation procedures for our model and corresponding inference. Specifically, we develop score confidence bands for the ROC curve and score confidence intervals for its summary indices, which remain invariant to reparametrizations and hence can be determined from the estimated parameters of the transformation model. In Section 4 we apply our approach to a cross-sectional study for detection of metabolic syndrome. Specifically, using the weight-to-height ratio as a non-invasive diagnostic test for metabolic syndrome, we investigate the age and gender-specific performance of the test. We conclude the article with a discussion.

2 Methods

2.1 Transformation model

The ROC curve is a composition of distribution functions and thus is invariant to strictly monotonically increasing transformations of $Y$ . We propose a model for the conditional distribution of the transformed test result given the disease status and covariates. This transformation is obtained from the data and leads to a distribution-free framework to parameterize the covariate-specific ROC curve and its summary indices.

Suppose there exists a strictly monotonically increasing function $h$ such that the relationship between the transformed test result and the covariates follows a shift-scale model

\displaystyle h(Y)

\displaystyle=\mu_{d}(\text{\boldmath$x$})+\sigma_{d}(\text{\boldmath$x$})Z,

where $D=d$ specifies the disease indicator ( $D=0$ for nondiseased and $D=1$ for diseased), $\text{\boldmath$X$}=\text{\boldmath$x$}$ a fixed set of covariates, $\mu_{d}(\text{\boldmath$x$})$ the shift term, $\sigma_{d}(\text{\boldmath$x$})$ the scale term and $Z\in\mathbb{R}$ is a latent random variable with an a priori known absolutely continuous log-concave CDF, $F_{Z}$ . Given that $D$ and $X$ are fixed, the conditional CDF for $Y$ is

\displaystyle\begin{split}\Prob(Y\leq y\mid D=d,\text{\boldmath$X$}=\text{\boldmath$x$})&=F_{d}(y\mid\text{\boldmath$x$})=F_{Z}\left(\frac{h(y)-\mu_{d}(\text{\boldmath$x$})}{\sigma_{d}(\text{\boldmath$x$})}\right).\end{split}

(2)

Equation 2 represents a general class of models called transformation models (Bickel and Doksum, 1981; Bickel, 1986; Bickel et al., 1993; Hothorn et al., 2018). The transformation function $h$ uniquely characterizes the distribution of $Y$ , similar to the density or distribution function. Plugging in this conditional CDF of $Y$ into Equation 1, $h$ cancels out and the covariate-specific ROC curve is given by

	$\displaystyle\begin{split}\text{ROC}(p\mid\text{\boldmath$x$})&=1-F_{Z}\left(\zeta(\text{\boldmath$x$})F_{Z}^{-1}(1-p)-\delta(\text{\boldmath$x$})\right),\end{split}$		(3)
where
	$\displaystyle\delta(\text{\boldmath$x$})$	$\displaystyle=\frac{\mu_{1}(\text{\boldmath$x$})-\mu_{0}(\text{\boldmath$x$})}{\sigma_{1}(\text{\boldmath$x$})},$
	$\displaystyle\zeta(\text{\boldmath$x$})$	$\displaystyle=\frac{\sigma_{0}(\text{\boldmath$x$})}{\sigma_{1}(\text{\boldmath$x$})}.$

Thus, the ROC curve is completely determined by the shift and scale terms of the model.

The binormal (Dorfman and Alf Jr, 1969) and bilogistic (Ogilvie and Creelman, 1968) ROC curves can be obtained by setting $F_{Z}$ to the standard normal distribution function $\operatorname{probit}^{-1}=\Phi$ , or the standard logistic distribution function $\operatorname{logit}^{-1}(x)=\operatorname{expit}(x)=(1+\exp(-x))^{-1}$ , in Equation 3, respectively. Similarly, the proportional hazard (Gönen and Heller, 2010) and reverse proportional hazard alternatives (Khan, 2022) for the ROC curve also fall within the purview of our transformation model with $F_{Z}$ specified as $\operatorname{cloglog}^{-1}(x)=1-\exp(-\exp(x))$ (minimum extreme value distribution function) and $\operatorname{loglog}^{-1}(x)=\exp(-\exp(-x))$ (maximum extreme value distribution function), respectively.

However, to the best of our knowledge, the only literature where the transformation function $h$ is included in the model formulation of the ROC curve is Zou (1997), who jointly models the shift term and the parameters of a Box-Cox power transformation function. A key point of this paper is that we explicitly estimate $h$ jointly with $\mu(\text{\boldmath$x$})$ from the observed data and are not restricted to normality imposed by power transformation families. Thus, the methods we propose allow for proper propagation of uncertainty from the estimated transformation function $\hat{h}$ into the estimates of the shift and scale terms of the model.

The ROC curve in Equation 3 follows a parametric model depending on $F_{Z}$ , but is distribution-free as in Pepe (2000); Alonzo and Pepe (2002), because no assumptions are made about the transformation $h$ and consequently for the distribution of the test results. The approach to model the test results as a function of the disease status and covariates was originally proposed in the latent variable ordinal regression setting by Tosteson and Begg (1988) and extended by Pepe (1997, 2000, 2003) to modeling covariate effects directly on the ROC curve.

Tosteson and Begg (1988) pointed out that to ensure concavity of the induced ROC curve, the scale term must be omitted, i.e., $\sigma_{d}(\text{\boldmath$x$})=1$ for $d=\{0,1\}$ . The ROC curve is termed proper if it is concave or, equivalently, if the derivative of the ROC curve is a monotonically decreasing function (Egan, 1975; Pan and Metz, 1997; Dorfman et al., 1997). A concave ROC curve is desirable as it yields the maximal sensitivity for a given value of specificity (McIntosh and Pepe, 2002). In this sense, as the decision criterion for classifying subjects is optimal when the ROC curve is concave, we focus the remaining work on the model involving only the shift term. Hence, the effect of covariates on the ROC curve is contained in the difference between the shift terms for diseased and nondiseased subjects, $\delta(\text{\boldmath$x$})=\mu_{1}(\text{\boldmath$x$})-\mu_{0}(\text{\boldmath$x$})$ . For a relaxation of this assumption, see Siegfried et al. (2022) who estimate the scale functions through regression models.

2.1.1 Two-sample case

To fix ideas, we first consider the case of two samples without covariates. Let the shift term take the form $\mu_{d}(\text{\boldmath$x$})=\delta d$ . The CDF of the test results in the nondiseased population is given by $F_{0}(y)=F_{Z}(h(y))$ and in the diseased population by $F_{1}(y)=F_{Z}(h(y)-\delta)$ . Using Equation 3, the induced ROC curve can be expressed as

\displaystyle\text{ROC}(p)

\displaystyle=1-F_{Z}\left(F_{Z}^{-1}(1-p)-\delta\right).

(4)

The model assumption implies that a monotone function $h$ exists to transform both $Y_{1}$ and $Y_{0}$ into the same distribution, $Z\sim F_{Z}$ separated by a shift parameter, $\delta$ . The induced ROC curve from this model doesn’t assume a particular distribution of the test result, rather, it quantifies the difference between the test result distributions on the scale of a user-defined $F_{Z}$ . In this sense, the difference between the test result distributions is described by $\delta$ . Each choice of $F_{Z}$ leads to a different interpretation of $\delta$ . For example, when $F_{Z}$ is selected to be the standard normal distribution function, $\delta$ is the difference in means of the transformed test results comparing the diseased and nondiseased groups, $\E[h(Y_{1})-h(Y_{0})]$ . Similarly, when $F_{Z}$ is the standard logistic distribution function, $\exp(\delta)$ is the ratio of odds of having a positive test result comparing diseased and nondiseased groups. Closed-form expressions can be derived for summary indices of the ROC curve by solving the appropriate integrals. The expressions of AUC, $J$ , the optimal threshold $c^{*}$ , sensitivity and specificity at $c^{*}$ are given for some well known choices of $F_{Z}$ in Table 1.

	$\operatorname{probit}^{-1}$	$\operatorname{logit}^{-1}$	$\operatorname{cloglog}^{-1}$	$\operatorname{loglog}^{-1}$
AUC	$\Phi\left(\frac{\delta}{\sqrt{2}}\right)$	$\left\{\begin{array}[]{ll}\frac{\exp(\delta)(\exp(\delta)-1-\delta)}{(\exp(\delta)-1)^{2}}&\delta\neq 0\\ 1/2&\delta=0\end{array}\right.$	$\operatorname{expit}(\delta)$
$J$	$1-2\Phi\left(\frac{-\|\delta\|}{2}\right)$	$1-2\operatorname{expit}\left(\frac{-\|\delta\|}{2}\right)$	$\exp\left(\frac{-\|\delta\|}{e^{\|\delta\|}-1}\right)-\exp\left(\frac{\|\delta\|}{e^{-\|\delta\|}-1}\right)$
$c^{*}$	$h^{-1}\left(\frac{\delta}{2}\right)$		$h^{-1}\left(\log\left(\frac{\delta}{1-e^{-\delta}}\right)\right)$	$h^{-1}\left(\log\left(\frac{e^{\delta}-1}{\delta}\right)\right)$
$\text{Sens}(c^{*})$	$\Phi\left(\frac{\delta}{2}\right)$	$\operatorname{expit}\left(\frac{\delta}{2}\right)$	$\exp\left(\frac{-\delta}{e^{\delta}-1}\right)$	$1-\exp\left(\frac{\delta}{e^{-\delta}-1}\right)$
$\text{Spec}(c^{*})$	$\Phi\left(\frac{\delta}{2}\right)$	$\operatorname{expit}\left(\frac{\delta}{2}\right)$	$1-\exp\left(\frac{\delta}{e^{-\delta}-1}\right)$	$\exp\left(\frac{-\delta}{e^{\delta}-1}\right)$

Table 1: Closed-form expressions for the area under the receiver operating characteristic curve (AUC), Youden Index (

J

), optimal classification threshold (

c^{*}

), sensitivity (Sens) and specificity (Spec) at

c^{*}

in terms of the shift parameter

\delta

in the linear transformation model given by

F_{d}(y)=F_{Z}(h(y)-\delta d)

2.1.2 Conditional ROC curve

The accuracy of a diagnostic test may be influenced by a set of covariates $X$ . To evaluate this effect on the ROC curve and its summary indices, we assume a linear transformation model with a shift term that takes the form

\displaystyle\mu_{d}(\text{\boldmath$x$})=\delta d+\text{\boldmath$x$}^{\top}\text{\boldmath$\xi$}+d\text{\boldmath$x$}^{\top}\text{\boldmath$\gamma$},

(5)

where $\text{\boldmath$\xi$},\text{\boldmath$\gamma$}\in\mathbb{R}^{P}$ are the coefficient vectors for the covariates and interaction term, respectively. Under this model, the resulting covariate-specific ROC curve is

\displaystyle\text{ROC}(p\mid\text{\boldmath$x$})=1-F_{Z}\left(F_{Z}^{-1}(1-p)-(\delta+\text{\boldmath$x$}^{\top}\text{\boldmath$\gamma$})\right),

where the covariate effect on the ROC curve is given by the difference in shift terms between diseased and nondiseased subjects, $\delta(\text{\boldmath$x$})=\delta+\text{\boldmath$x$}^{\top}\text{\boldmath$\gamma$}$ . Similarly, the covariate-specific AUC is given by

\displaystyle\text{AUC}(\text{\boldmath$x$})=\Prob(Y_{0}<Y_{1}\mid\text{\boldmath$X$}=\text{\boldmath$x$})=a\left(\delta(\text{\boldmath$x$})\right)=a\left(\delta+\text{\boldmath$x$}^{\top}\text{\boldmath$\gamma$}\right),

(6)

where $a:\mathbb{R}\mapsto[0,1]$ is the AUC function from the first row of Table 1 for different choices of $F_{Z}$ . The expressions for $J$ , $c^{*}$ , sensitivity and specificity can analogously be adjusted to account for covariates, with $\delta$ replaced by $\delta+\text{\boldmath$x$}^{\top}\text{\boldmath$\gamma$}$ in Table 1. Interpretation of the interaction coefficient is similar to Pepe (1998). For example, in the case of a continuous covariate $X=x\in\mathbb{R}$ , for each possible specificity value $1-p\in(0,1)$ , a unit increase in $x$ results in a $\gamma$ -unit increase in the ROC curve (or an increase in the sensitivity) on the scale of $F_{Z}$ . If $\gamma$ is positive, an increase in $x$ corresponds to an increase in the ROC curve, indicating that a test is better able to discriminate the two populations for larger values of $x$ and, vice versa. Note that the ROC curve varies with the covariate contingent upon the presence of an interaction between $d$ and $x$ (Pepe, 2003). For $\gamma=0$ , the covariate affects the distribution of the test results from the diseased and nondiseased population, but not the ROC curve. That is, for all levels of $x$ , the difference between the transformed distributions $h(Y_{1})$ and $h(Y_{0})$ is given by $\delta$ , thus the ROC curve is unchanged. Analogous interpretations hold when we are interested in modeling a set of covariates $X$ , which could possibly include categorical covariates.

Standard regression techniques have also been proposed as an alternative to assess the effect of covariates on summary indices rather than deriving the induced ROC curve. For example, Dodd and Pepe (2003) model the partial AUC as a regression function of covariates. Our model equivalently results in a regression model for the AUC when the shift term is given from Equation 5 and a particularly chosen scale $F_{Z}$ . Specifically, note that Equation 6 is the model proposed by Dodd and Pepe (2003), where $\delta+\text{\boldmath$x$}^{\top}\text{\boldmath$\gamma$}$ is in the form of a usual linear predictor and $a$ is a monotonically increasing inverse link function which defines the scale for the regression coefficients. This relationship provides another interpretation of the interaction term. For example, when $F_{Z}=\operatorname{cloglog}^{-1}$ or $F_{Z}=\operatorname{loglog}^{-1}$ , $\exp(\gamma)$ is a ratio of AUC odds comparing a subject with covariate value $x+1$ to $x$ . The AUC odds conditional on $x$ are defined as ${\Prob(Y_{1}>Y_{0}\mid X=x)}/{\Prob(Y_{1}\leq Y_{0}\mid X=x)}$ . As will be shown in Section 2.2, an advantage of our approach is that we estimate the regression parameters of the transformation model using maximum likelihood estimation and do not rely on less efficient binary regression techniques. Thus, we retain parameter interpretation while allowing for efficient estimation of a variety of covariate-specific summary indices of the ROC curve.

We show that our method is additionally related to the probabilistic index model (PIM) of Thas et al. (2012); De Neve et al. (2019), who propose to model the AUC as a function of $\text{\boldmath$X$}^{*}-\text{\boldmath$X$}$ . Here, the superscript (*) indicates an alternate configuration of the modeled covariates. We discuss the PIM of the form

\displaystyle\Prob(Y<Y^{*}\mid D,D^{*},X,X^{*})=a\left(\text{$\tilde{\delta}$}(D^{*}-D)+(\text{\boldmath$X$}^{*}-\text{\boldmath$X$})^{\top}\text{\boldmath$\tilde{\xi}$}+(D^{*}\text{\boldmath$X$}^{*}-D\text{\boldmath$X$})^{\top}\text{\boldmath$\tilde{\gamma}$}\right)

where $a$ is the function as defined previously and the tilde ( $\texttildelow$ ) is used to distinguish the parameters of the PIM from those of the transformation model. Contrasting a diseased and nondiseased subject with the same covariate value, we let $D^{*}=1$ , $D=0$ , $\text{\boldmath$X$}^{*}=\text{\boldmath$X$}$ and $a=\operatorname{logit}^{-1}$ , then the probabilistic index is given by $\operatorname{expit}(\text{$\tilde{\delta}$}+\text{\boldmath$X$}^{\top}\text{\boldmath$\tilde{\gamma}$})$ . Thus, this is equivalent to the AUC resulting from a linear transformation model with $F_{Z}=\operatorname{cloglog}^{-1}$ and a shift term as in Equation 5. Similarly, when $F_{Z}=\operatorname{probit}^{-1}$ , we have the following relationship between the parameters of the linear transformation model and the PIM, $\text{$\tilde{\delta}$}=\delta/\sqrt{2}$ and $\text{\boldmath$\tilde{\gamma}$}=\text{\boldmath$\gamma$}/\sqrt{2}$ . As was the case for ROC curves modeled by transformation models, the interaction between $D$ and $X$ allows the AUC to vary with covariates. The main effect $\tilde{\xi}$ , however, does not effect the AUC for configurations where the interest is to compare the distributions of $Y_{1}$ and $Y_{0}$ among subjects with a specific value of covariates.

We can also consider more general and potentially nonlinear formulations of the shift and scale terms in our framework. For the special case of $F_{Z}=\operatorname{probit}^{-1}$ , the AUC from a shift-scale transformation model (Siegfried et al., 2022) is given by

\displaystyle\Prob(Y_{0}<Y_{1}\mid\text{\boldmath$X$}_{0}=\text{\boldmath$x$}_{0},\text{\boldmath$X$}_{1}=\text{\boldmath$x$}_{1})=\Phi\left(\frac{\mu_{1}(\text{\boldmath$x$}_{1})-\ \mu_{0}(\text{\boldmath$x$}_{0})}{\displaystyle\sqrt{\sigma_{0}(\text{\boldmath$x$}_{0})^{2}+\sigma_{1}(\text{\boldmath$x$}_{1})^{2}}}\right)

where $\text{\boldmath$X$}_{0}$ and $\text{\boldmath$X$}_{1}$ are the corresponding (potentially different) sets of covariates in the nondiseased and diseased populations, respectively. When the scale term depends only on a single set of covariates, $\sigma_{0}(\text{\boldmath$x$}_{0})=\sigma_{1}(\text{\boldmath$x$}_{1})=\sigma(\text{\boldmath$x$})$ , despite varying sets of covariates in the shift terms, all the expressions hold from Table 1 with $\delta$ replaced by $\frac{\mu_{1}(\text{\boldmath$x$}_{1})-\mu_{0}(\text{\boldmath$x$}_{0})}{\sigma(\text{\boldmath$x$})}$ . However, such closed-form expressions cannot be derived for other choices of $F_{Z}$ when the scale term depends on the disease indicator or on different sets of covariates. In such cases, AUCs and other summary indices can be derived using numeric techniques on the induced ROC curve.

2.1.3 Comparing correlated ROC curves

If two (or more) diagnostic tests are measured on the same set of subjects (paired design), the test results are likely to be correlated leading to paired or correlated ROC curves (Swets and Pickett, 1982). To compare the accuracy of these tests, an approach that is frequently used in practice is to consider a summary index of the ROC curve and evaluate a hypothesis for the equivalence of the indices. For example, DeLong et al. (1988) developed an asymptotically normal nonparametric statistic to compare correlated AUCs based on the theory of U-statistics; see also Hanley and McNeil (1983); Wieand et al. (1989); Thompson and Zucchini (1989); McClish (1989); Jiang et al. (1996); Bandos et al. (2005, 2006) who provide similar methods. The problem with using the AUC as a summary index is that diagnostic tests with different ROC curves could potentially have the same value for the AUC. To overcome this limitation, we construct a distribution-free test for differences in the marginal ROC curve parameters using multivariate transformation models that inherently account for the correlations between outcomes in their dependency structure. The null hypothesis represents the equality of entire ROC curves instead of comparing correlated summary indices. Additionally, our test allows for covariate adjusted comparisons to be made.

Let the continuous random vector $\text{\boldmath$Y$}=(Y_{1}$ , $Y_{2},\dots,Y_{J})$ denote the results of $J$ diagnostic tests, possibly dependent on a set of covariates $X$ . This notation should not be confused with subscripts denoting quantities conditional on $D=1$ . To account for the correlation between the test results, we consider a multivariate transformation model with the strictly monotonically increasing conditional transformation function $g:\mathbb{R}^{J}\mapsto\mathbb{R}^{J}$ . This function maps the unknown conditional distribution of $Y$ depending on the disease indicator and covariates, to a random vector with a known distribution $\text{\boldmath$Z$}=g(\text{\boldmath$Y$}\mid d,\text{\boldmath$x$})$ . Specifically, the vector $\text{\boldmath$Z$}=(Z_{1},\dots,Z_{J})$ follows a zero mean multivariate normal distribution $\text{\boldmath$Z$}\sim N_{J}(\mathbf{0},\mathbf{\Sigma})$ with covariance matrix $\mathbf{\Sigma}\in\mathbb{R}^{J\times J}$ . Consequently, the marginal distribution of a multivariate normal vector is itself normally distributed with $Z_{j}\sim N(0,\varsigma^{2}_{j})$ , where the diagonal elements of $\mathbf{\Sigma}$ give the variances $\varsigma_{j}^{2}$ for $j=1,\dots,J$ . Thus, the conditional joint CDF of $Y$ is given by

	$\displaystyle\Prob(\text{\boldmath$Y$}\leq\text{\boldmath$y$}\mid D=d,\text{\boldmath$X$}=\text{\boldmath$x$})$	$\displaystyle=\Prob(g(\text{\boldmath$Y$}\mid d,\text{\boldmath$x$})\leq g(\text{\boldmath$y$}\mid d,\text{\boldmath$x$}))$
		$\displaystyle=\Prob(\text{\boldmath$Z$}\leq g(\text{\boldmath$y$}\mid d,\text{\boldmath$x$}))$
		$\displaystyle={\Phi}_{\mathbf{0},\mathbf{\Sigma}}\left(g_{1}(y_{1}\mid d,\text{\boldmath$x$}),\dots,g_{J}(y_{J}\mid d,\text{\boldmath$x$})\right)$

where $\Phi_{\mathbf{0},\mathbf{\Sigma}}$ is the joint CDF of a multivariate normal distribution with a zero mean vector and covariance matrix $\mathbf{\Sigma}$ . Letting

\displaystyle g_{j}(y_{j}\mid d,\text{\boldmath$x$})

\displaystyle=\Phi^{-1}_{0,\varsigma_{j}^{2}}\left(F_{Z}(h_{j}(y_{j})-\mu_{d}(\text{\boldmath$x$}))\right)

where $\Phi_{0,\varsigma^{2}}$ is the CDF of a normal distribution with variance $\varsigma^{2}$ , the CDF of $Y$ coincides to the Gaussian copula and the marginal distribution of $Y_{j}$ is given by

	$\displaystyle\Prob(Y_{j}\leq y_{j}\mid D=d,\text{\boldmath$X$}=\text{\boldmath$x$})$	$\displaystyle={\Phi}_{\mathbf{0},\mathbf{\Sigma}}\left(\infty,\dots,\infty,g_{j}(y_{j}\mid d,\text{\boldmath$x$}),\infty,\dots,\infty\right)$
		$\displaystyle=\Phi_{0,\varsigma_{j}^{2}}\left(g_{j}(y_{j}\mid d,\text{\boldmath$x$})\right)$
		$\displaystyle=F_{Z}\left(h_{j}(y_{j})-\mu_{d,j}(\text{\boldmath$x$})\right).$

By using the multivariate model specified we allow $Y$ to have a nonlinear dependence structure through the marginal transformation functions $h_{j}$ . Analogous to our univariate transformation model, this multivariate model implies that the $j$ th marginal transformed test results follow

\displaystyle h_{j}(Y_{j})=\mu_{d,j}(\text{\boldmath$x$})+Z

where the function $\mu_{d,j}$ is the shift term for the $j$ th test result model and $Z\sim F_{Z}$ , as described previously. The marginal ROC curve is then given by

\displaystyle\text{ROC}_{j}(p\mid\text{\boldmath$x$})=1-F_{Z}\left(F_{Z}^{-1}(1-p)-\delta_{j}(\text{\boldmath$x$})\right)

where $\delta_{j}(\text{\boldmath$x$})=\mu_{1,j}(\text{\boldmath$x$})-\mu_{0,j}(\text{\boldmath$x$})$ and values of marginal summary indices can be computed from the closed-form expressions given in Table 1 using the marginal ROC curve parameters. Specifically, $\delta$ can be replaced by the covariate effect on the summary indices given by $\delta_{j}(\text{\boldmath$x$})$ . From this result, we see that the null hypothesis of equal ROC curves at a given set of covariate values $x$ for the $j$ th and $k$ th test is equivalent to testing if $\delta_{j}(\text{\boldmath$x$})-\delta_{k}(\text{\boldmath$x$})=0$ . Moreover, this hypothesis test is distribution-free as we make no assumptions about $h_{j}$ .

Klein et al. (2022) establish conditions which decompose general transformation functions into the marginal ones described above. Specifically, the covariance matrix takes the form $\mathbf{\Sigma}=\mathbf{\Lambda}^{-1}\mathbf{\Lambda}^{-\top}$ where $\mathbf{\Lambda}$ is a lower triangular $(J\times J)$ matrix whose coefficients characterize the dependence structure given by the Gaussian copula. The matrix $\mathbf{\Lambda}$ could also vary between diseased and nondiseased subjects and potentially depend on a set of covariates $x$ . This allows the dependence structure of $Y$ to change as a function of $x$ . For the special case of two diagnostic tests, the correlation between $Y_{1}$ and $Y_{2}$ for a given set of covariates $x$ is given by $\frac{-\lambda_{d}(\text{\boldmath$x$})}{\sqrt{\lambda_{d}(\text{\boldmath$x$})^{2}+1}}$ . For more details and examples of multivariate transformation models, see Klein et al. (2022); Barbanti et al. (2022).

2.2 Univariate estimation

In this section we propose estimation methods for a transformation model with univariate test results. We provide an explicit parameterization of the transformation function and the shift term. We then maximize the likelihood contributions for a potentially exact continuous, right-, left- or interval-censored datum to jointly estimate the model parameters. This enables us to fully determine the ROC curve and its summary indices as well as handle test results which are ordinal or impacted by instrument detection limits.

2.2.1 Parameterization

We parameterize the transformation function as

\displaystyle h(y\mid\text{\boldmath$\vartheta$})=\text{\boldmath$b$}(y)^{\top}\text{\boldmath$\vartheta$}=\sum_{m=0}^{M}\vartheta_{m}b_{m}(y)\quad\text{for}\quad y\in\mathbb{R},

(7)

where $\text{\boldmath$b$}(y)=(b_{0}(y),\dots,b_{M}(y))^{\top}$ is vector of $M+1$ basis functions with coefficients $\text{\boldmath$\vartheta$}\in\mathbb{R}^{M+1}$ . Polynomials in Bernstein form offer a choice of basis that provides a flexible way of estimating the underlying function. The Bernstein basis polynomial of order $M$ is defined on the interval $[l,u]$ as

\displaystyle b_{m}(y)={M\choose m}\tilde{y}^{m}(1-\tilde{y})^{M-m},\quad m=0,\dots,M,

(8)

where $\tilde{y}=\frac{y-l}{u-y}\in[0,1]$ . The restriction $\vartheta_{m}\leq\vartheta_{m+1}$ for $m=0,\dots,M-1$ , guarantees the monotonicity of $h$ . Observe that the transformation function is linear in the parameters that define it and, any nonlinearity of the test results is modeled by the basis functions. If the order $M$ is chosen to be sufficiently large, Bernstein polynomials can uniformly approximate any real-valued continuous function on an interval (Farouki, 2012).

2.2.2 Likelihood

Denote the complete parameter vector as $\text{\boldmath$\theta$}=(\text{\boldmath$\beta$}^{\top},\text{\boldmath$\vartheta$}^{\top})^{\top}$ , where $\text{\boldmath$\beta$}=(\delta,\text{\boldmath$\xi$}^{\top},\text{\boldmath$\gamma$}^{\top})^{\top}\in\mathbb{R}^{2P+1}$ are the vector of regression coefficients parameterizing the function $\mu_{d}$ from Section 2.1 and $\text{\boldmath$\vartheta$}\in\mathbb{R}^{M+1}$ are the basis coefficients. We follow the maximum likelihood approach proposed by Hothorn et al. (2018) to jointly estimate $\beta$ and $\vartheta$ . The advantages of embedding the model in the likelihood framework are as follows. (i) All forms of random censoring (right, left, interval) as well as truncation can directly be incorporated into likelihood contributions defined in terms of the distribution function (Lindsey et al., 1996). (ii) If the given model is correctly specified, under regularity conditions, the maximum likelihood estimator (MLE) will be asymptotically the most efficient estimator (minimum variance and unbiased). (iii) The MLE is asymptotically normally distributed and has a sample variance that can be computed from the inverse of the Fisher information matrix. This can be used to generate confidence intervals for the estimated parameters. (iv) The MLE is equivariant which implies invariance of the score test (or the Lagrange multiplier test) to reparameterizations (Rao, 1948; Aitchison and Silvey, 1958; Silvey, 1959; Dagenais and Dufour, 1991). Specifically, as will be shown in Section 2.4.1, by inverting the score test, our method produces confidence bands for the ROC curve and appropriate score intervals for its summary indices.

The likelihood contribution of a single observation $O=(Y,D,\text{\boldmath$X$})$ where $Y\in(\underline{y},\overline{y}]=\{y\in\mathbb{R}:\underline{y}<y\leq\overline{y}\}$ is given by

L(\text{\boldmath$\theta$}\mid O)=\begin{cases}\begin{aligned} &f_{Z}(h(y\mid\text{\boldmath$\vartheta$})-\mu_{d}(\text{\boldmath$x$}\mid\text{\boldmath$\beta$}))h^{\prime}(y\mid\text{\boldmath$\vartheta$})&&y\in\mathbb{R}&\quad&\text{`exact continuous'}\\ &1-F_{Z}(h(\underline{y}\mid\text{\boldmath$\vartheta$})-\mu_{d}(\text{\boldmath$x$}\mid\text{\boldmath$\beta$}))&&y\in(\underline{y},\infty)&\quad&\text{`right censored'}\\ &F_{Z}(h(\overline{y}\mid\text{\boldmath$\vartheta$})-\mu_{d}(\text{\boldmath$x$}\mid\text{\boldmath$\beta$}))&&y\in(-\infty,\overline{y})&\quad&\text{`left censored'}\\ &\begin{split}F_{Z}(h(\overline{y}\mid\text{\boldmath$\vartheta$})-\mu_{d}(\text{\boldmath$x$}\mid\text{\boldmath$\beta$}))-\\ F_{Z}(h(\underline{y}\mid\text{\boldmath$\vartheta$})-\mu_{d}(\text{\boldmath$x$}\mid\text{\boldmath$\beta$}))\end{split}&&y\in(\underline{y},\overline{y}]&\quad&\text{`interval censored'},\\ \end{aligned}\end{cases}

where $f_{Z}$ is the density function of $Z$ and $h^{\prime}(y\mid\text{\boldmath$\vartheta$})$ is the first derivative of the transformation function with respect to $y$ . Given a sample of $N$ independent and identically distributed observations $O_{i}$ for $i=1,\dots,N$ , the log-likelihood is given by $\ell(\text{\boldmath$\theta$})=\sum_{i=1}^{N}\log(L_{i}(\text{\boldmath$\theta$}))$ , where $L_{i}$ is the likelihood contribution of observation $i$ . The (unconditional) maximum likelihood estimate of $\theta$ is the solution to the optimization problem

\displaystyle\hat{\text{\boldmath$\theta$}}=(\text{\boldmath$\hat{\beta}$},\text{\boldmath$\hat{\vartheta}$})=\operatorname*{{arg\,max}}_{\text{\boldmath$\beta$},\text{\boldmath$\vartheta$}}\ell(\text{\boldmath$\beta$},\text{\boldmath$\vartheta$}),

subject to the monotonicity constraint $\vartheta_{m}\leq\vartheta_{m+1}$ for $m=0,\dots,M-1$ . The resulting ROC curve only depends on $\beta$ which is decoupled from the parameters needed to model the transformation function $\vartheta$ . The score function is defined as the first derivative of the log-likelihood function with respect to each of the parameters and is given by

\displaystyle S(\text{\boldmath$\theta$})=\left(\begin{array}[]{l}\frac{\partial\ell(\text{\boldmath$\theta$})}{\partial\text{\boldmath$\beta$}}\\ \frac{\partial\ell(\text{\boldmath$\theta$})}{\partial\text{\boldmath$\vartheta$}}\end{array}\right)=\left(\begin{array}[]{l}S_{\text{\boldmath$\beta$}}(\text{\boldmath$\theta$})\\ S_{\text{\boldmath$\vartheta$}}(\text{\boldmath$\theta$})\end{array}\right).

We perform constrained optimization using the likelihood and score contributions to determine the maximum likelihood estimates for $\beta$ and $\vartheta$ . For computational details, see Hothorn (2020). The asymptotic variance of the MLE can further be estimated by the expected Fisher information matrix which is the variance-covariance matrix of the score function and is defined as

\displaystyle I(\text{\boldmath$\theta$})=-\E\left(\begin{array}[]{ll}\frac{\partial^{2}\ell(\text{\boldmath$\theta$})}{\partial\text{\boldmath$\beta$}\partial\text{\boldmath$\beta$}^{\top}}&\frac{\partial^{2}\ell(\text{\boldmath$\theta$})}{\partial\text{\boldmath$\beta$}\partial\text{\boldmath$\vartheta$}^{\top}}\\ \frac{\partial^{2}\ell(\text{\boldmath$\theta$})}{\partial\text{\boldmath$\vartheta$}\partial\text{\boldmath$\beta$}^{\top}}&\frac{\partial^{2}\ell(\text{\boldmath$\theta$})}{\partial\text{\boldmath$\vartheta$}\partial\text{\boldmath$\vartheta$}^{\top}}\end{array}\right)=\left(\begin{array}[]{ll}I_{\text{\boldmath$\beta$},\text{\boldmath$\beta$}}(\text{\boldmath$\theta$})&I_{\text{\boldmath$\beta$},\text{\boldmath$\vartheta$}}(\text{\boldmath$\theta$})\\ I_{\text{\boldmath$\beta$},\text{\boldmath$\vartheta$}}(\text{\boldmath$\theta$})^{\top}&I_{\text{\boldmath$\vartheta$},\text{\boldmath$\vartheta$}}(\text{\boldmath$\theta$})\end{array}\right).

The matrix is partitioned such that the submatrix $I_{\text{\boldmath$\beta$},\text{\boldmath$\beta$}}(\text{\boldmath$\theta$})$ corresponds to the parameter related to the disease indicator and covariates. The variation in $\vartheta$ and its relationship to $\beta$ is not of direct relevance here but is nonetheless estimated.

2.2.3 Limit of detection

Instrument precision can affect the evaluation of diagnostic biomarkers. For example, when biomarker levels are at or below the limit of detection (LOD) $y_{\text{LOD}}$ , the observed value lies in an interval $(-\infty,y_{\text{LOD}})$ and the resulting measurement is left censored. Often a replacement value is substituted for such measurements. Alternatively, only biomarker values above the LOD are used for the ROC analysis. It has been shown that these approaches lead to biased estimation (Lynn, 2001; Singh and Nocerino, 2002). Various adjustments to ROC curves and its summary indices have been proposed to handle such censored measurements (Mumford et al., 2006; Perkins et al., 2007, 2009; Ruopp et al., 2008; Bantis et al., 2017; Xiong et al., 2022). However, these methods typically do not account for covariates. Our framework naturally accounts for such observations in the likelihood function for left censored test results. Similarly, the right censored likelihood accounts for measurements which are affected by an upper limit of detection. Thus, our method provides a smooth covariate-specific ROC curve for all values of specificity with estimates and inference appropriately incorporating the observed information.

2.2.4 Ordinal test results

Many diagnostic tests of interest to clinical researchers are measured on ordinal scales. These are rank-ordered categorical variables for which quantitative differences between levels are unknown. Examples include the Memory Impairment Screen (MIS) for the measurement of cognition to distinguish Alzheimer’s disease and other dementias (Buschke et al., 1999) or the risk assessment of disease from radiological images such as with the Breast Imaging Reporting and Data System (BI-RADS) (American College of Radiology (2013), ACR). A simple reparameterization of the transformation function facilitates an extension of ROC curves to such ordinal test results.

Let the random variable $Y\in\{y_{1},\dots,y_{K}\}$ denote the results of an ordered categorical diagnostic test with $K$ categories, where $y_{k}<y_{k+1}$ for $k=1,...,K-1$ . Define the covariate-specific ROC curve as a discrete function of the subject categories

\displaystyle\left\{\left(1-F_{0}(y\mid\text{\boldmath$x$}),1-F_{1}(y\mid\text{\boldmath$x$})\right),y\in\{y_{1},\dots,y_{K}\}\right\}.

This set of points plots the covariate-specific $1-\text{specificity}$ and sensitivity coordinates for each of the $K$ categories. Suppose that the following transformation model defines the distribution of the ordinal test results in the diseased and nondiseased populations

\displaystyle F_{d}(y_{k}\mid\text{\boldmath$x$})=F_{Z}\left(h(y_{k})-\mu_{d}(\text{\boldmath$x$})\right),

where the transformation function assigns one parameter to each of the categories of the test result except the last one, $h(y_{k})=\vartheta_{k}\in\mathbb{R}$ for $k=1,\dots,K-1$ . Note that since $\Prob(Y\leq y_{K}\mid\text{\boldmath$X$}=\text{\boldmath$x$})=1$ , only $K-1$ parameters need to be specified. The monotonicity constraint $\vartheta_{k}<\vartheta_{k+1}$ is also necessary so that $F_{d}(y_{k}\mid\text{\boldmath$x$})<F_{d}(y_{k+1}\mid\text{\boldmath$x$})$ for $k=1,\dots,K-1$ . Observe that this is a special case of the general shift-scale model in Equation 2. Under a linear shift term as given in Equation 5 and $F_{Z}=\operatorname{logit}^{-1}$ , this model is known as the proportional odds logistic regression (POLR).

Setting $p_{k}=1-F_{0}(y_{k}\mid\text{\boldmath$x$})$ , the ROC curve under the ordinal transformation model is given by

\displaystyle\text{ROC}(p_{k})=1-F_{Z}\left(F_{Z}^{-1}(1-p_{k})-\delta(\text{\boldmath$x$})\right),\quad k=1,\dots,K.

The parameters of the model can be estimated by using the maximum likelihood scheme described previously with the interval censored likelihood contribution

\displaystyle F_{Z}(\vartheta_{k}-\mu_{d}(\text{\boldmath$x$}\mid\text{\boldmath$\beta$}))-F_{Z}(\vartheta_{k-1}-\mu_{d}(\text{\boldmath$x$}\mid\text{\boldmath$\beta$})),

where the observed subject has a test result of $y_{k}$ and covariate value $x$ . After estimating the parameters, it is possible to plot a continuous ROC curve as a function of all $p\in[0,1]$ . This curve’s shape depends on the choice of $F_{Z}$ and represents the underlying parametric form of the model. However, it does not represent the ROC curve for a continuous test. Specifically, the function is only defined at the $K$ discrete specificity values. Summary measures can be calculated using the expressions detailed in Table 1. Note that these would be based on a smooth interpolation given by the model rather a discrete one and lead to the interpretation of an ordinal test with a latent continuous scale.

2.3 Multivariate estimation

The estimation of a multivariate test result follows a similar approach to the univariate case. The marginal transformation functions are parameterized in terms of monotonically increasing basis polynomials of Bernstein form $h_{j}(y_{j}\mid\text{\boldmath$\vartheta$}_{j})=\text{\boldmath$b$}_{j}(y_{j})^{\top}\text{\boldmath$\vartheta$}_{j}$ , where $\text{\boldmath$b$}_{j}:\mathbb{R}\mapsto\mathbb{R}^{M+1}$ is a vector of $M+1$ basis functions with individual components $b_{j,m}$ defined as in Equation 8 for $j=1,\dots,J$ and $m=0,\dots,M$ . The constraint $\vartheta_{j,m}\leq\vartheta_{j,m+1}$ for $m=0,\dots,M-1$ , guarantees the monotonicity of $h_{j}$ and the smooth parameterization the existence of the derivative $h^{\prime}_{j}(y_{j}\mid\text{\boldmath$\vartheta$}_{j})=\text{\boldmath$b$}^{\prime}_{j}(y_{j})^{\top}\text{\boldmath$\vartheta$}_{j}$ .

With a slight abuse of notation compared to the univariate case, we also denote the vector of regression coefficients as $\text{\boldmath$\beta$}=(\text{\boldmath$\beta$}_{1}^{\top},\dots,\text{\boldmath$\beta$}_{J}^{\top})^{\top}\in\mathbb{R}^{J(2P+1)}$ and the vector of basis coefficients as $\text{\boldmath$\vartheta$}=(\text{\boldmath$\vartheta$}_{1}^{\top},\dots,\text{\boldmath$\vartheta$}_{J}^{\top})^{\top}\in\mathbb{R}^{J(M+1)}$ , with each component parameterizing a separate marginal conditional distribution function. The covariance matrix $\mathbf{\Sigma}$ captures the correlation structure between the transformed diagnostic tests and the vector $\text{\boldmath$\varsigma$}\in\mathbb{R}^{J(J+1)/2}$ contains all the elements of $\mathbf{\Sigma}$ . We denote the complete set of unknown model parameters for a transformation model with a multivariate test result as $\text{\boldmath$\theta$}=(\text{\boldmath$\beta$}^{\top},\text{\boldmath$\vartheta$}^{\top},\text{\boldmath$\varsigma$}^{\top})^{\top}$ . The likelihood contribution of a single observation $O=(\text{\boldmath$Y$},D,\text{\boldmath$X$})$ , with $\text{\boldmath$Y$}=\text{\boldmath$y$}=(y_{1},\dots,y_{J})^{\top}$ being an exact continuous vector of test results, is given by

\displaystyle L(\text{\boldmath$\theta$}\mid O)={\phi}_{\mathbf{0},\mathbf{\Sigma}}\left(g(\text{\boldmath$y$}\mid d,\text{\boldmath$x$},\text{\boldmath$\beta$},\text{\boldmath$\vartheta$})\right)\prod_{j=1}^{J}g^{\prime}_{j}\left(y_{j}\mid d,\text{\boldmath$x$},\text{\boldmath$\beta$},\text{\boldmath$\vartheta$}\right),

where $\phi_{\mathbf{0},\mathbf{\Sigma}}$ is the joint density function of a multivariate normal distribution with a zero mean vector and covariance matrix $\mathbf{\Sigma}$ . With a choice of each component $g_{j}$ from Equation 2.1.3, its derivative is given by

\displaystyle g^{\prime}_{j}(y_{j}\mid d,\text{\boldmath$x$},\text{\boldmath$\beta$},\text{\boldmath$\vartheta$})=\frac{f_{Z}\left(h_{j}(y_{j}\mid\text{\boldmath$\vartheta$})-\mu_{d}(\text{\boldmath$x$}\mid\text{\boldmath$\beta$})\right)h^{\prime}_{j}(y_{j}\mid\text{\boldmath$\beta$})}{\phi_{0,\varsigma_{j}^{2}}\left(\Phi^{-1}_{0,\varsigma_{j}^{2}}\left\{F_{Z}\left(h_{j}(y_{j}\mid\text{\boldmath$\vartheta$})-\mu_{d}(\text{\boldmath$x$}\mid\text{\boldmath$\beta$})\right)\right\}\right)}.

In the special case where $F_{Z}=\operatorname{probit}^{-1}$ , this derivative simplifies to $\varsigma_{j}h^{\prime}_{j}(y_{j})$ .

We derive the maximum likelihood estimate of $\theta$ from a sample of $N$ independent and identically distributed observations using constrained maximization algorithms following the strategy detailed in Section 2.2.2. For more details see Klein et al. (2022).

2.4 Confidence intervals

In the following section, we present three methods to calculate confidence bands for the ROC curve and confidence intervals for its summary indices. These quantities are functions of the parameters in the model $G(\text{\boldmath$\beta$}):\mathbb{R}^{2P+1}\rightarrow\mathbb{R}$ and to maintain nominal coverage for a confidence interval for $G(\text{\boldmath$\beta$})$ , appropriate methods are needed. The methods discussed include inversion of the score test, delta method and simulation from the asymptotic distribution of the estimate. The methods are ordered by their degree of theoretical justification. We start with score intervals which are invariant to parameter transformations but become computationally expensive when dealing with a large set of parameters. We then discuss estimating the variance using the multivariate delta method and conclude with a simple simulation method which is versatile without being computationally demanding.

2.4.1 Score intervals

In the two-sample univariate case where $\delta$ defines the ROC curve, as in Equation 4, we can construct score intervals for $\delta$ . Unlike the Wald and other commonly used intervals, score intervals are especially desirable as they are invariant to transformations of the parameters. That is, a score CI for $G(\delta)$ (e.g., the AUC $a(\delta)$ ), provides the same level of coverage as would a score CI for $\delta$ . In turn, under a correctly specified model, a score CI for $\delta$ allows the construction of accurately covered uniform confidence bands for the ROC curve as well as intervals for its summary indices such as the AUC and the Youden index.

We first generate score confidence intervals for $\delta$ by inverting the score test. In this case, the null hypothesis is given by $H_{0}:\delta=\delta_{0}$ where $\delta_{0}$ is a specific value of the parameter of interest. Under $H_{0}$ , the restricted (conditional) maximum likelihood estimator for $\vartheta$ can be obtained by

\displaystyle\text{\boldmath$\hat{\vartheta}$}(\delta_{0})=\operatorname*{{arg\,max}}_{\text{\boldmath$\vartheta$}}\ell(\delta_{0},\text{\boldmath$\vartheta$}),

or as a solution of the $M+1$ score equations $S_{\text{\boldmath$\vartheta$}}(\delta_{0},\text{\boldmath$\vartheta$})=\mathbf{0}$ . Note that this estimate is a function of $\delta_{0}$ . Letting $\text{\boldmath$\tilde{\theta}$}=(\delta_{0},\text{\boldmath$\hat{\vartheta}$}(\delta_{0}))$ , the quadratic (Rao) score statistic simplifies to

	$\displaystyle R(\delta_{0})$	$\displaystyle=S(\text{\boldmath$\tilde{\theta}$})^{\top}I^{-1}(\text{\boldmath$\tilde{\theta}$})S(\text{\boldmath$\tilde{\theta}$})$
		$\displaystyle=(S_{\delta}(\text{\boldmath$\tilde{\theta}$})^{\top},\mathbf{0}^{\top})I^{-1}(\text{\boldmath$\tilde{\theta}$})(S_{\delta}(\text{\boldmath$\tilde{\theta}$})^{\top},\mathbf{0}^{\top})^{\top}$
		$\displaystyle=S_{\delta}(\text{\boldmath$\tilde{\theta}$})^{\top}A_{\delta,\delta}(\text{\boldmath$\tilde{\theta}$})S_{\delta}(\text{\boldmath$\tilde{\theta}$}),$

where $A_{\delta,\delta}(\text{\boldmath$\tilde{\theta}$})$ denotes the submatrix corresponding to $\delta$ of the inverse Fisher information matrix and is given by the Schur complement $I_{\delta,\delta}(\text{\boldmath$\theta$})-I_{\delta,\text{\boldmath$\vartheta$}}(\text{\boldmath$\theta$})I^{-1}_{\text{\boldmath$\vartheta$},\text{\boldmath$\vartheta$}}(\text{\boldmath$\theta$})I_{\delta,\text{\boldmath$\vartheta$}}(\text{\boldmath$\theta$})^{\top}$ . Under $H_{0}$ , $R(\delta_{0})$ converges asymptotically to a chi-square distribution with 1 degree of freedom, $R(\delta_{0})\stackrel{{\scriptstyle D}}{{\longrightarrow}}\chi^{2}_{1}$ . This result is explained by Rao (2005). Thus, inverting the score statistic by enumerating values of $\delta_{0}$ allows for the construction of $(1-\alpha)$ score confidence intervals for $\delta$ defined as

\displaystyle\{\delta_{0}\in\mathbb{R}\mid R(\delta_{0})<\chi^{2}_{1}(1-\alpha)\},

where $\chi^{2}_{P}(1-\alpha)$ is the $(1-\alpha)$ quantile value of the $\chi^{2}_{1}$ distribution. Equivalently, we can use the square root of the score statistic to construct a $(1-\alpha)$ score interval using quantiles of the standard normal distribution, $\{\delta_{0}\in\mathbb{R}\mid\Phi^{-1}(\alpha/2)<\sqrt{R(\delta_{0})}\leq\Phi^{-1}(1-\alpha/2)\}$ . Finally, we apply the function $G$ to both the lower and upper limits of the interval to construct score confidence bands for the ROC curve or score confidence intervals for its summary indices

The score statistic is given by $R(\delta_{0})=S_{\delta}(\text{\boldmath$\tilde{\theta}$})^{2}A_{\delta,\delta}(\text{\boldmath$\tilde{\theta}$})$ . Testing if there is a significant difference between the nondiseased and diseased populations coincides to the hypothesis test, $H_{0}:\delta=0$ . This is computationally efficient because only the distribution of $R(0)$ needs to be computed. However, computing score CIs requires updating the restricted MLEs $\text{\boldmath$\hat{\vartheta}$}(\delta_{0})$ for an enumeration of $\delta_{0}$ values. This becomes computationally intractable when enumerating a higher dimensional grid of parameters.

2.4.2 Delta method

Since the MLE satisfies

\displaystyle\sqrt{n}(\text{\boldmath$\hat{\beta}$}-\text{\boldmath$\beta$})\xrightarrow{D}N_{P+1}\left(\mathbf{0},A_{\text{\boldmath$\beta$},\text{\boldmath$\beta$}}(\text{\boldmath$\theta$})\right),

then by the multivariate delta method, $G(\text{\boldmath$\hat{\beta}$})$ also follows a normal distribution with

\displaystyle\mathbb{V}(G(\text{\boldmath$\hat{\beta}$}))=\frac{1}{n}\nabla G(\text{\boldmath$\beta$})^{\top}A_{\text{\boldmath$\beta$},\text{\boldmath$\beta$}}(\text{\boldmath$\theta$})\nabla G(\text{\boldmath$\beta$}),

where $\nabla G(\text{\boldmath$\beta$})$ is the gradient of $G$ evaluated at $\beta$ and the inverse Fisher information matrix $A_{\text{\boldmath$\beta$},\text{\boldmath$\beta$}}(\text{\boldmath$\theta$})$ is given by the Schur complement $I_{\text{\boldmath$\beta$},\text{\boldmath$\beta$}}(\text{\boldmath$\theta$})-I_{\text{\boldmath$\beta$},\text{\boldmath$\vartheta$}}(\text{\boldmath$\theta$})I^{-1}_{\text{\boldmath$\vartheta$},\text{\boldmath$\vartheta$}}(\text{\boldmath$\theta$})I_{\text{\boldmath$\beta$},\text{\boldmath$\vartheta$}}(\text{\boldmath$\theta$})^{\top}$ . For example, when the shift term takes the linear form as in Equation 5 and $G$ defines the AUC function for $F_{Z}=\operatorname{probit}^{-1}$ , the entries of $\nabla G(\text{\boldmath$\beta$})$ are given by

\displaystyle\frac{\partial G(\text{\boldmath$\beta$})}{\partial\delta}=\frac{1}{\sqrt{2}}C,\quad\frac{\partial G(\text{\boldmath$\beta$})}{\partial\xi_{i}}=0\quad\text{and}\quad\frac{\partial G(\text{\boldmath$\beta$})}{\partial\gamma_{i}}=\frac{x_{i}}{\sqrt{2}}C,

where $C=\phi\left(\frac{\delta+\text{\boldmath$x$}^{\top}\text{\boldmath$\gamma$}}{\sqrt{2}}\right)$ , $\phi$ is the density of the standard normal distribution and $i$ indexes the $P$ covariates. In general, the gradient can be estimated by calculating such derivatives and evaluating the resulting function at the MLE. Similarly, the variance-covariance matrix of the estimated parameters $A_{\text{\boldmath$\beta$},\text{\boldmath$\beta$}}(\text{\boldmath$\theta$})$ can be computed by inverting the numerically evaluated Hessian matrix. Thus, a $(1-\alpha)$ level confidence interval for $G(\text{\boldmath$\beta$})$ is given by

\displaystyle G(\text{\boldmath$\beta$})\pm\Phi^{-1}(\alpha/2)\sqrt{\hat{\mathbb{V}}(G(\text{\boldmath$\hat{\beta}$}))}.

2.4.3 Simulated intervals

When the function $G$ has complex derivatives, as would be the case for nonlinear shift terms $\mu_{d}(\text{\boldmath$x$})$ or when calculating optimal thresholds $c^{*}$ where $G$ includes the inverse of the transformation function, constructing confidence intervals using the delta method becomes infeasible. For these cases, we apply a simple simulation-based algorithm which utilizes the asymptotic normality of the MLE to calculate confidence intervals for the ROC curve and its summary indices, which are functions of the parameters of interest. The steps of the algorithm to construct $(1-\alpha)$ level confidence intervals for $G(\text{\boldmath$\hat{\beta}$})$ can be summarized as follows:

1.

Generate $B$ independent samples from the asymptotic multivariate normal distribution of the parameter estimates $N_{P+1}\left(\text{\boldmath$\hat{\beta}$},\frac{1}{n}\hat{A}_{\text{\boldmath$\beta$},\text{\boldmath$\beta$}}(\mathbf{\hat{\Theta}})\right)$ and denote as $\text{\boldmath$\hat{\beta}$}^{*}_{1},\dots,\text{\boldmath$\hat{\beta}$}^{*}_{B}$ .
2.

For each sample $b=1,\dots,B$ , calculate the function of interest $G(\text{\boldmath$\hat{\beta}$}^{*}_{b})$ .
3.

Construct the confidence interval $\left(Q_{G(\text{\boldmath$\hat{\beta}$}^{*})}(\alpha/2),Q_{G(\text{\boldmath$\hat{\beta}$}^{*})}(1-\alpha/2)\right)$ , where $Q_{G(\text{\boldmath$\hat{\beta}$}^{*})}$ is the empirical quantile function of the sample $G(\text{\boldmath$\hat{\beta}$}^{*}_{1}),\dots,G(\text{\boldmath$\hat{\beta}$}^{*}_{B})$ .

A similar algorithm is presented in Mandel (2013), who discuss its asymptotic validity and present several examples that show its empirical coverage adheres to nominal levels with results similar to the delta method.

3 Empirical evaluation

We conducted a simulation study to evaluate the performance of our estimators in the two-sample setting. We considered a data generating process such that nondiseased test results followed a standard normal distribution $F_{0}(y)=\Phi(y)$ and the diseased results a distribution with the CDF $F_{1}(y)=F_{Z}(F_{Z}^{-1}(\Phi(y))-\delta)$ . To obtain different shapes of the ROC curve, we chose three choices of $F_{Z}\in\{\operatorname{probit}^{-1},\operatorname{logit}^{-1},\operatorname{cloglog}^{-1}\}$ and varied $\delta$ such that the $\text{AUC}\in\{0.5,0.65,0.8,0.95\}$ or that $J\in\{0,0.25,0.5,0.8\}$ , leading to a variety of configurations. Under this simulation paramaterization, the ROC curves followed the form of Equation 4 and summary indices could be calculated as a function of the estimated $\delta$ from Table 1. The conventional binormal model corresponded to $F_{Z}=\operatorname{probit}^{-1}=\Phi$ , which induced proper binormal ROC curves. This was the only configuration where the test results for both groups were normally distributed. We included this configuration to ascertain the loss of power associated with our estimators when the standard binormal assumption held. For other choices of $F_{Z}$ with $\text{AUC}>0.5$ , the resulting distributions of the diseased test results were non-normal, with variances and higher moments differing between the two groups. Specifically, the configuration of $F_{Z}=\operatorname{logit}^{-1}$ led to light tailed distributions for the diseased test results, whilst $F_{Z}=\operatorname{cloglog}^{-1}$ led to skewed, heavy-tailed distributions. The corresponding density functions for the data generating model with selected AUC values are given in Figure 1.

Refer to caption — Figure 1: Density functions for the model used to generate the data for the simulations. The nondiseased test results followed a standard normal distribution corresponding to an $\text{AUC}=0.5$ . The diseased test results varied with three choices of $F_{Z}$ : $\operatorname{probit}^{-1}$ , $\operatorname{logit}^{-1}$ and $\operatorname{cloglog}^{-1}$ each of which had an AUC of 0.5, 0.65, 0.8 and 0.95.

For 10 000 replications of each configuration, we generated balanced data sets with sample sizes $N_{0}=N_{1}\in\{25,50,100\}$ . The transformation models discussed in Section 2 were fitted to the simulated data sets assuming a parameterization of the transformation function given by a Bernstein basis polynomial of order $M=6$ . The true data-generating model had a nonlinear transformation function $h=F_{Z}^{-1}\circ\Phi$ . Our model estimation procedure aimed to approximate this function alongside the shift parameter $\delta$ . The functions implementing these models are available from the \pkgtram package (Hothorn, 2020), and included \codeBoxCox, \codeColr and \codeCoxph models which corresponded to $F_{Z}$ selected as $\operatorname{probit}^{-1}$ , $\operatorname{logit}^{-1}$ and $\operatorname{cloglog}^{-1}$ , respectively.

Table 2 displays the bias and root-mean-square error (RMSE) for the AUC estimates using different methods under the various configurations. We found that all three methods had minimal bias and equivalent RMSEs for an $\text{AUC}=0.5$ , where the test results were unable to distinguish between the two groups. The \codeBoxCox and \codeColr models yielded approximately unbiased AUC estimates in all cases, even when they were misspecified for the true data generating process. However, estimates based on the proportional hazards model, \codeCoxph, were biased for data generating processes other than $F_{Z}=\operatorname{cloglog}^{-1}$ . We also computed nominal 90% score confidence intervals for the AUC using the approach described in Section 2.4.1. The observed coverage probabilities are summarized in Table 3. When $F_{Z}$ was correctly specified, all methods had close to nominal coverage under all sample size configurations. In addition, the score confidence intervals from the \codeColr model were accurate even when it was misspecified for the true data generating process. This was unlike the \codeBoxCox and \codeCoxph models, whose coverage deteriorated for higher AUC values when they were misspecified.

We compared our approaches to a set of alternative methods for computing confidence intervals for the AUC and Youden index. The software details of all the methods used alongside their respective features and references are summarized in Supplementary Material Table S1. We detail the empirical coverage and average width of the confidence intervals for the AUC in Figures S1 and S2, respectively. Estimates based on transformation models (\proglangR packages \pkgtram, \pkgorm, \pkgpim) yielded coverage close to the nominal level and significantly outperformed the other methods when the model was correctly specified for the true data generating process. All other methods generally performed close to nominal levels for low to medium AUC values (0.5-0.8), but broke down for higher AUC values. Methods which used $F_{Z}=\operatorname{cloglog}^{-1}$ gave confidence intervals which were shorter in length (overconfident) and otherwise there were no notable differences between the methods.

Analogously, Figures S3 and S4 detail the coverage and length of the confidence intervals for the Youden index. The methods which were based on the overlap coefficient failed to cover the configuration where $J=0$ because their lower limits were never below 0. Our methods estimated confidence intervals for $\delta\in\mathbb{R}$ which naturally accounted for this scenario. Specifically, the lower limit of the CI for $J$ is given by $G(\max\{0,\delta^{l}\})$ and the upper limit by $G(\max\{\lvert\delta^{l}\rvert,\lvert\delta^{u}\rvert\})$ , where $G$ is the Youden index function defined in Table 1 and $(\delta^{l},\delta^{u})$ is the confidence interval for $\delta$ . Remarkably, the transformation model with $F_{Z}=\operatorname{logit}^{-1}$ provided coverage at nominal levels for all simulation configurations with a relatively small confidence interval width. The approach of Franco-Pereira et al. (2021) (FP) was also accurate under model misspecification but was more involved. Namely, it consisted of estimating Box Cox transformation parameters under a binormal framework with bootstrap variance, all carried out on the logit scale and then back-transformed. In a setting with covariates, censoring or with $J=0$ this methodology would be limited.

Figures S5 and S6 show the coverage and area of the confidence bands for the ROC curve. All the approaches based on transformation models covered the configuration with $\text{AUC}=0.5$ accurately. However, the other approaches did not yield coverage close to nominal levels in this configuration with the exception of Martínez-Camblor et al. (2018), whose confidence bands had a significantly larger area. For all other configurations, only transformation models which were correctly specified for the true data-generating model provided accurate results.

			$F_{Z}$	$\text{probit}^{-1}$		$\text{logit}^{-1}$		$\text{cloglog}^{-1}$
AUC	Method	$N_{0}=N_{1}$		Bias	RMSE	Bias	RMSE	Bias	RMSE
0.5	\codeBoxCox	25		0.001	0.083	0.000	0.082	0.000	0.082
		50		0.000	0.057	0.000	0.057	0.001	0.057
		100		0.000	0.041	0.000	0.040	0.000	0.040
	\codeColr	25		0.001	0.084	0.000	0.083	0.001	0.083
		50		0.000	0.058	-0.001	0.058	0.001	0.058
		100		0.000	0.041	0.000	0.041	0.000	0.041
	\codeCoxph	25		0.001	0.077	0.000	0.076	0.000	0.077
		50		0.000	0.052	0.000	0.052	0.001	0.052
		100		0.000	0.036	0.000	0.036	0.000	0.036
0.65	\codeBoxCox	25		0.003	0.078	-0.002	0.078	0.004	0.078
		50		0.001	0.055	-0.004	0.055	0.002	0.054
		100		0.000	0.038	-0.006	0.040	0.002	0.038
	\codeColr	25		0.001	0.078	0.002	0.078	0.002	0.079
		50		0.000	0.055	0.001	0.055	0.000	0.055
		100		0.000	0.038	0.000	0.039	0.001	0.038
	\codeCoxph	25		-0.021	0.077	-0.028	0.080	0.006	0.071
		50		-0.025	0.057	-0.032	0.061	0.003	0.049
		100		-0.027	0.044	-0.034	0.050	0.002	0.034
0.8	\codeBoxCox	25		0.004	0.061	-0.002	0.066	0.003	0.063
		50		0.002	0.043	-0.004	0.046	0.001	0.044
		100		0.001	0.030	-0.006	0.033	0.000	0.031
	\codeColr	25		0.000	0.061	0.002	0.063	0.002	0.062
		50		-0.002	0.043	0.001	0.044	0.000	0.044
		100		-0.002	0.030	0.000	0.031	0.000	0.031
	\codeCoxph	25		-0.038	0.075	-0.049	0.086	0.007	0.054
		50		-0.043	0.063	-0.056	0.075	0.003	0.038
		100		-0.047	0.057	-0.060	0.070	0.001	0.027
0.95	\codeBoxCox	25		0.001	0.027	0.003	0.030	0.003	0.031
		50		0.001	0.019	0.002	0.022	0.003	0.022
		100		0.001	0.014	0.002	0.016	0.003	0.016
	\codeColr	25		-0.006	0.028	0.000	0.028	0.001	0.028
		50		-0.007	0.021	0.000	0.020	0.001	0.019
		100		-0.007	0.015	0.000	0.014	0.001	0.014
	\codeCoxph	25		-0.038	0.054	-0.042	0.065	0.003	0.023
		50		-0.041	0.050	-0.049	0.062	0.002	0.016
		100		-0.044	0.049	-0.054	0.061	0.002	0.011

Table 2: Bias and RMSE simulation results for estimation of the AUC. Data were generated using

F_{0}(y)=\Phi(y)

for nondiseased results and

F_{1}(y)=F_{Z}(F_{Z}^{-1}(\Phi(y))-\delta)

for the diseased results. We varied

F_{Z}\in\{\operatorname{probit}^{-1}

\operatorname{logit}^{-1}

and

\operatorname{cloglog}^{-1}\}

, AUC and sample size. The proposed linear transformation models \codeBoxCox, \codeColr and \codeCoxph were used from the \pkgtram package to estimate

\delta

and the transformation function

h=F_{Z}^{-1}\circ\Phi

			$F_{Z}$
AUC	Method	$N_{0}=N_{1}$	$\text{probit}^{-1}$	$\text{logit}^{-1}$	$\text{cloglog}^{-1}$
0.5	\codeBoxCox	25	0.888	0.890	0.890
		50	0.895	0.894	0.894
		100	0.896	0.900	0.894
	\codeColr	25	0.890	0.890	0.893
		50	0.898	0.894	0.895
		100	0.895	0.901	0.896
	\codeCoxph	25	0.874	0.875	0.873
		50	0.887	0.887	0.886
		100	0.895	0.897	0.886
0.65	\codeBoxCox	25	0.883	0.882	0.890
		50	0.891	0.891	0.891
		100	0.899	0.888	0.897
	\codeColr	25	0.887	0.891	0.894
		50	0.894	0.897	0.892
		100	0.903	0.895	0.895
	\codeCoxph	25	0.851	0.831	0.878
		50	0.825	0.794	0.890
		100	0.774	0.713	0.897
0.8	\codeBoxCox	25	0.892	0.868	0.883
		50	0.893	0.876	0.885
		100	0.898	0.874	0.893
	\codeColr	25	0.902	0.893	0.895
		50	0.905	0.898	0.898
		100	0.908	0.899	0.906
	\codeCoxph	25	0.762	0.685	0.892
		50	0.650	0.538	0.892
		100	0.462	0.304	0.896
0.95	\codeBoxCox	25	0.885	0.831	0.810
		50	0.895	0.840	0.832
		100	0.894	0.839	0.830
	\codeColr	25	0.907	0.888	0.867
		50	0.901	0.897	0.887
		100	0.877	0.896	0.889
	\codeCoxph	25	0.557	0.532	0.855
		50	0.355	0.312	0.878
		100	0.119	0.100	0.871

Table 3: Observed coverage probabilities of nominal 90% score confidence intervals for the AUC. Score intervals were calculated for

\delta

from the proposed linear transformation models \codeBoxCox, \codeColr and \codeCoxph. The score intervals for the AUC were then calculated by transforming the intervals for

\delta

using the closed-form expressions given in Table 1.

4 Application

The prevalence of obesity has increased consistently for most countries in the recent decade and this trend is a serious global health concern (Abarca-Gómez et al., 2017). Obesity contributes directly to increased risk of cardiovascular disease (CVD) and its risk factors, including type 2 diabetes, hypertension and dyslipidemia (Zalesin et al., 2008; Grundy, 2004). Metabolic syndrome (MetS) refers to the joint presence of several cardiovascular risk factors and is characterized by insulin resistance (Eckel et al., 2010). The National Cholesterol Education Program Adult Treatment Panel III (NCEP-ATP III) criteria is the most widely used definition for MetS, but it requires laboratory analysis of a blood sample. This has led to the search for non-invasive techniques which allow reliable and early detection of MetS.

Waist-to-height ratio (WHtR) is a well-known anthropometric index used to predict visceral obesity. Visceral obesity is an independent risk factor for development of MetS by means of the increased production of free fatty acids whose presence obstructs insulin activity (Bosello and Zamboni, 2000). This suggests that higher values of WHtR, reflecting obesity and CVD risk factors, are more indicative of incident MetS. Several studies have found that WHtR is highly predictive of MetS (Shao et al., 2010; Romero-Saldaña et al., 2016; Suliga et al., 2019). However, as waist circumference changes with age and gender (Stevens et al., 2010), it is also important to study whether or not the performance of WHtR at diagnosing MetS is impacted by these variables. Evaluation of WHtR as a predictor of MetS after adjusting for covariates is necessary so that more tailored interventions can be initiated to improve outcomes.

We illustrate the use of our methods to data from a cross sectional study designed to validate the use of WHtR and systolic blood pressure (SBP) as markers for early detection of MetS in a working population from the Balearic Islands (Spain). Detailed descriptions of the study methodology and population characteristics are reported in Romero-Saldaña et al. (2018). Briefly, data on 60 799 workers were collected during their work health periodic assessments between 2012 and 2016. Presence of MetS was determined by the NCEP-ATP III criteria and the sample consisted of 5 487 workers with MetS.

4.1 Two-sample analysis

$F_{Z}$	$\delta$	AUC	$J$
$\Phi$	1.492 (1.462, 1.521)	0.854 (0.849, 0.859)	0.544 (0.535, 0.553)
$\operatorname{expit}$	2.785 (2.730, 2.841)	0.871 (0.866, 0.875)	0.602 (0.593, 0.611)
$\text{cloglog}^{-1}$	1.186 (1.157, 1.215)	0.766 (0.761, 0.771)	0.412 (0.403, 0.421)
$\text{loglog}^{-1}$	1.425 (1.397, 1.453)	0.806 (0.802, 0.810)	0.484 (0.475, 0.492)

Table 4: Estimates and 95% score confidence intervals of the shift paramater, AUC and

J

in the two-sample linear transformation model for WHtR as a marker of MetS.

We first examined the unconditional performance of WHtR as a marker to diagnose MetS, denoted $Y$ and $D$ , respectively. We fit a linear transformation model which leads to the ROC curve of the form in Equation 4, where $\delta$ is the shift parameter, for various choices of the inverse link function $F_{Z}$ . Associated inference of the AUC and $J$ are calculated using the closed-form expressions from Table 1. The resulting estimates are presented in Table 4. The AUCs are consistently bounded away from 0.5 indicating a good capacity of WHtR to discriminate between workers with and without MetS. This can also be seen from the estimated ROC curve plotted in Figure 2 which lies well above the diagonal line as well as the modeled densities which have a small degree of overlap. The confidence intervals and uniform confidence bands are quite small due to the large sample size.

4.2 Conditional ROC analysis

Next, we investigated if the discriminatory ability of WHtR in separating workers with and without MetS varies with covariates. We considered a transformation model that included the main effects of covariates plus interaction terms with the disease indicator, which leads to the ROC curve given by

\displaystyle\text{ROC}(p\mid\text{\boldmath$x$})=1-F_{Z}(F_{Z}^{-1}(1-p)-(\delta+\gamma^{\top}\text{\boldmath$x$}))

where the covariates $x$ included age, gender and tobacco consumption and the inverse link function $F_{Z}=\operatorname{logit}^{-1}$ . Figure 3 displays the covariate-specific ROC curves fitted to these data. The performance of WHtR appeared to be better for females compared to males and decreased with age. The effect of smoking, although significant in the model, does not seem to substantially alter the ROC curves given the other covariates are kept fixed. To inspect the covariate effect further, we calculated the age- and gender-specific AUCs and Youden indices from the model. Figure 4 clearly shows that the discriminatory capabilities of WHtR in distinguishing workers with MetS is consistently better for females and decreases with age.

4.3 Comparing correlated biomarkers

Type	WHtR	SBP	Difference
$\delta_{j}$	1.874 (1.811, 1.945)	1.214 (1.151, 1.273)	0.660 (0.578, 0.750)
AUC	0.907 (0.900, 0.915)	0.805 (0.792, 0.816)	0.103 (0.090, 0.117)
$J$	0.651 (0.635, 0.669)	0.456 (0.435, 0.476)	0.195 (0.171, 0.222)

Table 5: Marginal estimates and 95% Wald confidence intervals of the shift paramater, AUC and Youden index from the two-sample bivariate normal transformation model for WHtR and SBP as markers of MetS for a 40-year non-smoking female worker. Wald confidence intervals are calculated by simulating from the asymptotic distribution of the MLE.

Finally, we compare the performance of WHtR to SBP, explicitly accounting for the within-patient clustered nature of these measurements. That is, both markers are measured on the same subject and hence their measurements are correlated. We consider a transformation model for the joint distribution of $\text{\boldmath$Y$}=(\text{WHtR},\text{SBP})$ that induces a marginal ROC curve for the $j$ th marker, given by

\displaystyle\text{ROC}_{j}(p\mid\text{\boldmath$x$})=1-F_{Z}(F_{Z}^{-1}(1-p)-(\delta_{j}+\gamma_{j}^{\top}\text{\boldmath$x$}))\quad\text{for }j\in\{\text{WHtR, SBP}\},

where $\delta_{j}$ is the marginal shift parameter, $\gamma_{j}$ are the marginal interaction coefficients and $F_{Z}=\operatorname{probit}^{-1}$ . Figure 5 shows the corresponding estimated marginal ROC curves for the two markers for a 40-year old non-smoking female worker. The uniform confidence band is generated by the simulation procedure detailed in Section 2.4.3. Whilst both markers perform well at predicting MetS, WHtR seems to be favorable with a ROC curve uniformly above SBP. We tested the hypothesis for equality of ROC curves $H_{0}:\text{ROC}_{1}(p)=\text{ROC}_{2}(p)$ for all $p\in[0,1]$ , using the marginal coefficients from the joint model. This hypothesis test is equivalent to testing if the difference in the AUCs or Youden indices of the two markers is equal to 0. The estimates, differences and corresponding confidence intervals are detailed in Table 5. The difference of 0.103 between AUCs is significant ( $p$ value $<0.001$ ) and indicates that the WHtR is better at discriminating incident MetS compared to SBP.

5 Discussion

This article presents a new modeling framework for ROC analysis that can be used to characterize the accuracy of medical diagnostic tests. Our model is based on estimating an unknown transformation function for the test results and yields a distribution-free yet model-based estimator for the ROC curve. Covariates that influence the diagnostic accuracy of tests can naturally be accommodated as regression parameters into the model and covariate-specific summary indices such as the AUC and Youden index are easily computed using closed-form expressions. We extend the same conceptual transformation framework to a multivariate setting for comparing correlated diagnostic tests after appropriately accounting for their dependency structure.

Our proposed approach has several features which distinguish it from contemporary methods of ROC analysis. Firstly, we employ maximum likelihood to jointly estimate all parameters defining the transformation function and regression coefficients. This implies the variation in the estimated transformation parameters is accounted for and appropriately propagated for the inference of the ROC curve. In turn, asymptotic efficiency is guaranteed for our estimators and we avoid reliance on resampling procedures for the construction of confidence intervals. Secondly, transformation models focus on estimating the conditional distribution function whose evaluation directly provides the likelihood contributions for interval, right- and left-censored data that commonly arises due to instrument detection limits. Additionally, with a simple reparameterization of the transformation function, our model extends to ROC curves for ordinal test results. Thirdly, no strong assumptions are made regarding the transformation function which results in a highly flexible model that retains interpretability of the regression coefficients. Fourthly, we develop a test for covariate-specific correlated ROC curves, a problem which has yet to be addressed in the literature. Lastly, software implementations for all the methods described in this article are available in the \pkgtram \proglangR package (see Supplementary Material for example code), thus enabling a unified framework for ROC analysis.

One aspect that warrants further investigation is model selection, specifically with regards to the choice of $F_{Z}$ . In general, this may depend on the aim of the study and distinct features of the data. Our simulation explored the extent to which the bias and coverage of the resulting estimates were affected by model choice. We found that a model with $F_{Z}=\operatorname{logit}^{-1}$ provided accurate results even when it was misspecified for the true data generating process. This model also behaves very similarly to the semiparametric cumulative probability model (Tian et al., 2020), also known as a proportional odds ordinal logistic model, both which estimate a log-odds ratio $\delta$ . The equivalence of the transformation model’s odds ratio to the MWW test statistic has been well studied (Wang and Tian, 2017). The MWW statistic has a bounded influence function and is robust to contaminations of the specified model (Hampel, 1974). Due to their equivalence, we hypothesize that the transformation model with $F_{Z}=\operatorname{logit}^{-1}$ is also endowed with the same robustness properties as the MWW and can be chosen when no a priori model is known. However, further research is needed to develop suitable model verification and selection methods.

Our proposed estimators assume that the difference between the diseased and nondiseased distributions is described by a shift-term, $\mu_{d}$ on the scale of $F_{Z}$ . A relaxation of this assumption would allow for separate transformation functions in each of the two groups. Namely, consider a model where the nondiseased results follow a distribution with the CDF $F_{0}(y)=F_{Z}(h_{0}(y))$ and the diseased with the CDF $F_{1}(y)=F_{Z}(h_{1}(y))$ . Defining a new transformation function $r=h_{1}\circ h_{0}^{-1}\circ F_{Z}^{-1}:[0,1]\mapsto\mathbb{R}$ , the smooth ROC curve with no shift assumptions is given by $\text{ROC}(p)=1-F_{Z}(r(1-p))$ . This model has more flexibility but sacrifices the properness property desirable for ROC curves. Furthermore, care has to be taken in defining the correct likelihood contributions for accurate inference of this model as uncertainty enters from both transformation functions.

In future work we plan to pursue various extensions of transformation models for ROC analysis to consider (1) penalty terms for high-dimensional covariates (Kook and Hothorn, 2021), (2) mixed effects for clustered observations (Tamási and Hothorn, 2021), and (3) covariate-dependent transformation functions through forest-based machine learning methods (Hothorn and Zeileis, 2021).

Funding

Financial support by Swiss National Science Foundation, grant number 200021_184603, is gratefully acknowledged.

References

Abarca-Gómez et al. (2017) Abarca-Gómez L, Abdeen ZA, Hamid ZA, Abu-Rmeileh NM, Acosta-Cazares B, Acuin C, Adams RJ, Aekplakorn W, Afsana K, Aguilar-Salinas CA, et al. (2017). “Worldwide trends in body-mass index, underweight, overweight, and obesity from 1975 to 2016: a pooled analysis of 2416 population-based measurement studies in 128· 9 million children, adolescents, and adults.” The Lancet, 390(10113), 2627–2642.
Aitchison and Silvey (1958) Aitchison J, Silvey S (1958). “Maximum-likelihood estimation of parameters subject to restraints.” The Annals of Mathematical Statistics, pp. 813–828.
Alonzo and Pepe (2002) Alonzo TA, Pepe MS (2002). “Distribution-free ROC analysis using binary regression techniques.” Biostatistics, 3(3), 421–432.
American College of Radiology (2013) (ACR) American College of Radiology (ACR) (2013). ACR BI-RADS atlas: Breast Imaging Reporting and Data System. 5th edition. American College of Radiology.
Bamber (1975) Bamber D (1975). “The area above the ordinal dominance graph and the area below the receiver operating characteristic graph.” Journal of Mathematical Psychology, 12(4), 387–415.
Bandos et al. (2005) Bandos AI, Rockette HE, Gur D (2005). “A permutation test sensitive to differences in areas for comparing ROC curves from a paired design.” Statistics in Medicine, 24(18), 2873–2893.
Bandos et al. (2006) Bandos AI, Rockette HE, Gur D (2006). “A permutation test for comparing ROC curves in multireader studies: a multi-reader ROC, permutation test.” Academic Radiology, 13(4), 414–420.
Bantis and Feng (2016) Bantis LE, Feng Z (2016). “Comparison of two correlated ROC curves at a given specificity or sensitivity level.” Statistics in Medicine, 35(24), 4352–4367.
Bantis et al. (2017) Bantis LE, Yan Q, Tsimikas JV, Feng Z (2017). “Estimation of smooth ROC curves for biomarkers with limits of detection.” Statistics in Medicine, 36(24), 3830–3843.
Barbanti et al. (2022) Barbanti L, Brandl R, Hothorn T (2022). “Multi-species count transformation models.” arXiv:2201.13095 [stat.ME], URL https://arxiv.org/abs/2201.13095.
Bianco et al. (2020) Bianco AM, Boente G, Gonzalez-Manteiga W (2020). “A robust approach for ROC curves with covariates.” arXiv:2007.00150 [stat.ME], URL https://arxiv.org/abs/2007.00150.
Bickel (1986) Bickel P (1986). “Efficient Testing in a Class of Transformation Models.” In Proceedings of the 45th Session of the International Statistical Institute, volume 23, pp. 63–23.
Bickel and Doksum (1981) Bickel PJ, Doksum KA (1981). “An analysis of transformations revisited.” Journal of the American Statistical Association, 76(374), 296–311.
Bickel et al. (1993) Bickel PJ, Klaassen CA, Bickel PJ, Ritov Y, Klaassen J, Wellner JA, Ritov Y (1993). Efficient and adaptive estimation for semiparametric models, volume 4. Johns Hopkins University Press Baltimore.
Bosello and Zamboni (2000) Bosello O, Zamboni M (2000). “Visceral obesity and metabolic syndrome.” Obesity Reviews, 1(1), 47–56.
Box and Cox (1964) Box GE, Cox DR (1964). “An analysis of transformations.” Journal of the Royal Statistical Society: Series B (Methodological), 26(2), 211–243.
Branscum et al. (2008) Branscum AJ, Johnson WO, Hanson TE, Gardner IA (2008). “Bayesian semiparametric ROC curve estimation and disease diagnosis.” Statistics in Medicine, 27(13), 2474–2496.
Buschke et al. (1999) Buschke H, Kuslansky G, Katz M, Stewart WF, Sliwinski MJ, Eckholdt HM, Lipton RB (1999). “Screening for dementia with the memory impairment screen.” Neurology, 52(2), 231–231.
Cai and Moskowitz (2004) Cai T, Moskowitz CS (2004). “Semi-parametric estimation of the binormal ROC curve for a continuous diagnostic test.” Biostatistics, 5(4), 573–586.
Cai and Pepe (2002) Cai T, Pepe MS (2002). “Semiparametric receiver operating characteristic analysis to evaluate biomarkers for disease.” Journal of the American Statistical Association, 97(460), 1099–1107.
Dagenais and Dufour (1991) Dagenais MG, Dufour JM (1991). “Invariance, nonlinear models, and asymptotic tests.” Econometrica: Journal of the Econometric Society, 59(6), 1601–1615.
De Neve et al. (2019) De Neve J, Thas O, Gerds TA (2019). “Semiparametric linear transformation models: Effect measures, estimators, and applications.” Statistics in Medicine, 38(8), 1484–1501.
DeLong et al. (1988) DeLong ER, DeLong DM, Clarke-Pearson DL (1988). “Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach.” Biometrics, 44(3), 837–845.
Demidenko (2012) Demidenko E (2012). “Confidence intervals and bands for the binormal ROC curve revisited.” Journal of Applied Statistics, 39(1), 67–79.
Dodd and Pepe (2003) Dodd LE, Pepe MS (2003). “Partial AUC estimation and regression.” Biometrics, 59(3), 614–623.
Dorfman and Alf Jr (1969) Dorfman DD, Alf Jr E (1969). “Maximum-likelihood estimation of parameters of signal-detection theory and determination of confidence intervals-rating-method data.” Journal of Mathematical Psychology, 6(3), 487–496.
Dorfman et al. (1997) Dorfman DD, Berbaum KS, Metz CE, Lenth RV, Hanley JA, Dagga HA (1997). “Proper receiver operating characteristic analysis: the bigamma model.” Academic Radiology, 4(2), 138–149.
Eckel et al. (2010) Eckel RH, Alberti KG, Grundy SM, Zimmet PZ (2010). “The metabolic syndrome.” The Lancet, 375(9710), 181–183.
Egan (1975) Egan JP (1975). Signal Detection Theory and ROC-Analysis. Academic Press.
Erkanli et al. (2006) Erkanli A, Sung M, Jane Costello E, Angold A (2006). “Bayesian semi-parametric ROC analysis.” Statistics in Medicine, 25(22), 3905–3928.
Faraggi (2003) Faraggi D (2003). “Adjusting receiver operating characteristic curves and related indices for covariates.” Journal of the Royal Statistical Society: Series D (The Statistician), 52(2), 179–192.
Farouki (2012) Farouki RT (2012). “The Bernstein polynomial basis: A centennial retrospective.” Computer Aided Geometric Design, 29(6), 379–419.
Fay (2022) Fay MP (2022). \pkgasht: Applied Statistical Hypothesis Tests. \proglangR package version 0.9.7, URL https://CRAN.R-project.org/package=asht.
Feller (1991) Feller W (1991). An Introduction to Probability Theory and Its Applications. Wiley, New York.
Feng et al. (2020) Feng D, Manevski D, Perme MP (2020). \pkgauRoc: Various Methods to Estimate the AUC. \proglangR package version 0.2-1, URL https://CRAN.R-project.org/package=auRoc.
Franco-Pereira et al. (2021) Franco-Pereira AM, Nakas CT, Reiser B, Carmen Pardo M (2021). “Inference on the overlap coefficient: The binormal approach and alternatives.” Statistical Methods in Medical Research, 30(12), 2672–2684.
Gail and Green (1976) Gail MH, Green SB (1976). “A generalization of the one-sided two-sample Kolmogorov-Smirnov statistic for evaluating diagnostic tests.” Biometrics, pp. 561–570.
Ghosal et al. (2022) Ghosal S, Grantz KL, Chen Z (2022). “Estimation of multiple ordered ROC curves using placement values.” Statistical Methods in Medical Research, pp. 1470–1483.
Gönen and Heller (2010) Gönen M, Heller G (2010). “Lehmann family of ROC curves.” Medical Decision Making, 30(4), 509–517.
González-Manteiga et al. (2011) González-Manteiga W, Pardo-Fernández JC, van Keilegom I (2011). “ROC curves in non-parametric location-scale regression models.” Scandinavian Journal of Statistics, 38(1), 169–184.
Green and Swets (1966) Green DM, Swets JA (1966). Signal Detection Theory and Psychophysics, volume 1. Wiley New York.
Grundy (2004) Grundy SM (2004). “Obesity, metabolic syndrome, and cardiovascular disease.” The Journal of Clinical Endocrinology & Metabolism, 89(6), 2595–2600.
Hampel (1974) Hampel FR (1974). “The influence curve and its role in robust estimation.” Journal of the American Statistical Association, 69(346), 383–393.
Hanley and McNeil (1982) Hanley JA, McNeil BJ (1982). “The meaning and use of the area under a receiver operating characteristic (ROC) curve.” Radiology, 143(1), 29–36.
Hanley and McNeil (1983) Hanley JA, McNeil BJ (1983). “A method of comparing the areas under receiver operating characteristic curves derived from the same cases.” Radiology, 148(3), 839–843.
Harrell Jr (2001) Harrell Jr FE (2001). Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis, volume 608. Springer.
Harrell Jr (2022) Harrell Jr FE (2022). \pkgrms: Regression Modeling Strategies. \proglangR package version 6.3-0, URL https://CRAN.R-project.org/package=rms.
Hothorn (2020) Hothorn T (2020). “Most likely transformations: The \pkgmlt package.” Journal of Statistical Software, 92, 1–68.
Hothorn et al. (2014) Hothorn T, Kneib T, Bühlmann P (2014). “Conditional transformation models.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1), 3–27.
Hothorn et al. (2018) Hothorn T, Möst L, Bühlmann P (2018). “Most likely transformations.” Scandinavian Journal of Statistics, 45(1), 110–134.
Hothorn and Zeileis (2021) Hothorn T, Zeileis A (2021). “Predictive distribution modeling using transformation forests.” Journal of Computational and Graphical Statistics, 30(4), 1181–1196.
Hsieh and Turnbull (1996) Hsieh F, Turnbull BW (1996). “Nonparametric and semiparametric estimation of the receiver operating characteristic curve.” The Annals of Statistics, 24(1), 25–40.
Inácio et al. (2013) Inácio V, Jara A, Hanson TE, de Carvalho M (2013). “Bayesian nonparametric ROC regression modeling.” Bayesian Analysis, 8(3), 623–646.
Inácio et al. (2021a) Inácio V, M Lourenço V, de Carvalho M, Parker RA, Gnanapragasam V (2021a). “Robust and flexible inference for the covariate-specific receiver operating characteristic curve.” Statistics in Medicine, 40(26), 5779–5795.
Inácio et al. (2021b) Inácio V, Rodríguez-Álvarez MX, Gayoso-Diz P (2021b). “Statistical evaluation of medical tests.” Annual Review of Statistics and Its Application, 8, 41–67.
Jensen et al. (2000) Jensen K, Müller HH, Schäfer H (2000). “Regional confidence bands for ROC curves.” Statistics in Medicine, 19(4), 493–509.
Jiang et al. (1996) Jiang Y, Metz CE, Nishikawa RM (1996). “A receiver operating characteristic partial area index for highly sensitive diagnostic tests.” Radiology, 201(3), 745–750.
Khan and Brandenburger (2020) Khan MRA, Brandenburger T (2020). \pkgROCit: Performance Assessment of Binary Classifier with Visualization. \proglangR package version 2.1.1, URL https://CRAN.R-project.org/package=ROCit.
Khan (2022) Khan RA (2022). “Resilience family of receiver operating characteristic curves.” arXiv:2203.13665 [stat.ME], [math.ST], URL https://arxiv.org/abs/2203.13665.
Klein et al. (2022) Klein N, Hothorn T, Barbanti L, Kneib T (2022). “Multivariate conditional transformation models.” Scandinavian Journal of Statistics, 49(1), 116–142.
Komaba et al. (2022) Komaba A, Johno H, Nakamoto K (2022). “A novel statistical approach for two-sample testing based on the overlap coefficient.” arXiv:2206.03166 [math.ST], URL https://arxiv.org/abs/2206.03166.
Konietschke et al. (2015) Konietschke F, Placzek M, Schaarschmidt F, Hothorn LA (2015). “\pkgnparcomp: An \proglangR Software Package for Nonparametric Multiple Comparisons and Simultaneous Confidence Intervals.” Journal of Statistical Software, 64(9), 1–17. URL http://www.jstatsoft.org/v64/i09/.
Kook and Hothorn (2021) Kook L, Hothorn T (2021). “Regularized Transformation Models: The \pkgtramnet Package.” R Journal, 13(1), 581.
Lindsey et al. (1996) Lindsey JK, et al. (1996). Parametric Statistical Inference. Oxford University Press.
Lynn (2001) Lynn HS (2001). “Maximum likelihood inference for left-censored HIV RNA data.” Statistics in Medicine, 20(1), 33–45.
Mandel (2013) Mandel M (2013). “Simulation-based confidence intervals for functions with complicated derivatives.” The American Statistician, 67(2), 76–81.
Martínez-Camblor (2022) Martínez-Camblor P (2022). “About the use of the overlap coefficient in the binary classification context.” Communications in Statistics-Theory and Methods, pp. 1–11.
Martínez-Camblor et al. (2018) Martínez-Camblor P, Pérez-Fernández S, Corral N (2018). “Efficient nonparametric confidence bands for receiver operating-characteristic curves.” Statistical Methods in Medical Research, 27(6), 1892–1908.
McClish (1989) McClish DK (1989). “Analyzing a portion of the ROC curve.” Medical Decision Making, 9(3), 190–195.
McIntosh and Pepe (2002) McIntosh MW, Pepe MS (2002). “Combining several screening tests: optimality of the risk score.” Biometrics, 58(3), 657–664.
Molodianovitch et al. (2006) Molodianovitch K, Faraggi D, Reiser B (2006). “Comparing the areas under two correlated ROC curves: parametric and non-parametric approaches.” Biometrical Journal, 48(5), 745–757.
Mumford et al. (2006) Mumford SL, Schisterman EF, Vexler A, Liu A (2006). “Pooling biospecimens and limits of detection: effects on ROC curve analysis.” Biostatistics, 7(4), 585–598.
Ogilvie and Creelman (1968) Ogilvie JC, Creelman CD (1968). “Maximum-likelihood estimation of receiver operating characteristic curve parameters.” Journal of Mathematical Psychology, 5(3), 377–391.
Pan and Metz (1997) Pan X, Metz CE (1997). “The “proper” binormal model: parametric receiver operating characteristic curve estimation with degenerate data.” Academic Radiology, 4(5), 380–389.
Pepe (1997) Pepe MS (1997). “A regression modelling framework for receiver operating characteristic curves in medical diagnostic testing.” Biometrika, 84(3), 595–608.
Pepe (1998) Pepe MS (1998). “Three approaches to regression analysis of receiver operating characteristic curves for continuous test results.” Biometrics, pp. 124–135.
Pepe (2000) Pepe MS (2000). “An interpretation for the ROC curve and inference using GLM procedures.” Biometrics, 56(2), 352–359.
Pepe (2003) Pepe MS (2003). The Statistical Evaluation of Medical Tests for Classification and Prediction. Medicine.
Pérez Fernández et al. (2018) Pérez Fernández S, Martínez Camblor P, Filzmoser P, Corral Blanco NO, et al. (2018). “\pkgnsROC: an \proglangR package for non-standard ROC curve analysis.” The \proglangR Journal, 10 (2).
Perez-Jaume et al. (2017) Perez-Jaume S, Skaltsa K, Pallarès N, Carrasco JL (2017). “\pkgThresholdROC: Optimum Threshold Estimation Tools for Continuous Diagnostic Tests in \proglangR.” Journal of Statistical Software, 82(4), 1–21. 10.18637/jss.v082.i04.
Perkins et al. (2007) Perkins NJ, Schisterman EF, Vexler A (2007). “Receiver operating characteristic curve inference from a sample with a limit of detection.” American Journal of Epidemiology, 165(3), 325–333.
Perkins et al. (2009) Perkins NJ, Schisterman EF, Vexler A (2009). “Generalized ROC curve inference for a biomarker subject to a limit of detection and measurement error.” Statistics in Medicine, 28(13), 1841–1860.
Rao (1948) Rao CR (1948). “Large sample tests of statistical hypotheses concerning several parameters with applications to problems of estimation.” In Mathematical Proceedings of the Cambridge Philosophical Society, volume 44, pp. 50–57. Cambridge University Press.
Rao (2005) Rao CR (2005). “Score Test: Historical Review and Recent Developments.” In Advances in Ranking and Selection, Multiple Comparisons, and Reliability, pp. 8–20. Birkhäuser, Boston, U.S.A.
Ridout and Linkie (2009) Ridout M, Linkie M (2009). “Estimating overlap of daily activity patterns from camera trap data.” Journal of Agricultural, Biological, and Environmental Statistics, 14(3), 322–337.
Robin et al. (2011) Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez JC, Müller M (2011). “\pkgpROC: an open-source package for \proglangR and \proglangS+ to analyze and compare ROC curves.” BMC Bioinformatics, 12, 77.
Rodríguez-Álvarez et al. (2011) Rodríguez-Álvarez MX, Roca-Pardiñas J, Cadarso-Suárez C (2011). “ROC curve and covariates: extending induced methodology to the non-parametric framework.” Statistics and Computing, 21(4), 483–499.
Romero-Saldaña et al. (2016) Romero-Saldaña M, Fuentes-Jiménez FJ, Vaquero-Abellán M, Álvarez-Fernández C, Molina-Recio G, López-Miranda J (2016). “New non-invasive method for early detection of metabolic syndrome in the working population.” European Journal of Cardiovascular Nursing, 15(7), 549–558.
Romero-Saldaña et al. (2018) Romero-Saldaña M, Tauler P, Vaquero-Abellán M, López-González AA, Fuentes-Jiménez FJ, Aguiló A, Álvarez-Fernández C, Molina-Recio G, Bennasar-Veny M (2018). “Validation of a non-invasive method for the early detection of metabolic syndrome: a diagnostic accuracy test in a working population.” BMJ Open, 8(10), e020476.
Ruopp et al. (2008) Ruopp MD, Perkins NJ, Whitcomb BW, Schisterman EF (2008). “Youden Index and optimal cut-point estimated from observations affected by a lower limit of detection.” Biometrical Journal, 50(3), 419–430.
Schisterman et al. (2004) Schisterman EF, Faraggi D, Reiser B (2004). “Adjusting the generalized ROC curve for covariates.” Statistics in Medicine, 23(21), 3319–3331.
Schmid and Schmidt (2006) Schmid F, Schmidt A (2006). “Nonparametric estimation of the coefficient of overlapping—theory and empirical application.” Computational Statistics & Data Analysis, 50(6), 1583–1596.
Shao et al. (2010) Shao J, Yu L, Shen X, Li D, Wang K (2010). “Waist-to-height ratio, an optimal predictor for obesity and metabolic syndrome in Chinese adults.” The Journal of Nutrition, Health & Aging, 14(9), 782–785.
Siegfried et al. (2022) Siegfried S, Kook L, Hothorn T (2022). “Distribution-Free Location-Scale Regression.” arXiv:2208.05302 [stat.ME], URL https://arxiv.org/abs/2208.05302.
Silvey (1959) Silvey SD (1959). “The Lagrangian multiplier test.” The Annals of Mathematical Statistics, 30(2), 389–407.
Singh and Nocerino (2002) Singh A, Nocerino J (2002). “Robust estimation of mean and variance using environmental data sets with below detection limit observations.” Chemometrics and Intelligent Laboratory Systems, 60(1-2), 69–86.
Stevens et al. (2010) Stevens J, Katz EG, Huxley RR (2010). “Associations between gender, age and waist circumference.” European Journal of Clinical Nutrition, 64(1), 6–15.
Suliga et al. (2019) Suliga E, Ciesla E, Głuszek-Osuch M, Rogula T, Głuszek S, Kozieł D (2019). “The usefulness of anthropometric indices to identify the risk of metabolic syndrome.” Nutrients, 11(11), 2598.
Swets and Pickett (1982) Swets JA, Pickett RM (1982). Evaluation of Diagnostic Systems: Methods from Signal Detection Theory. New York: Academic Press.
Tamási and Hothorn (2021) Tamási B, Hothorn T (2021). “\pkgtramME: Mixed-Effects Transformation Models Using Template Model Builder.” The R Journal, 13(1), 581–594.
Thas et al. (2012) Thas O, Neve JD, Clement L, Ottoy JP (2012). “Probabilistic index models.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(4), 623–671.
Therneau (2022) Therneau TM (2022). A Package for Survival Analysis in \proglangR. \proglangR package version 3.3-1, URL https://CRAN.R-project.org/package=survival.
Thompson and Zucchini (1989) Thompson ML, Zucchini W (1989). “On the statistical analysis of ROC curves.” Statistics in Medicine, 8(10), 1277–1290.
Tian et al. (2020) Tian Y, Hothorn T, Li C, Harrell Jr FE, Shepherd BE (2020). “An empirical comparison of two novel transformation models.” Statistics in Medicine, 39(5), 562–576.
Tosteson and Begg (1988) Tosteson ANA, Begg CB (1988). “A general regression methodology for ROC curve estimation.” Medical Decision Making, 8(3), 204–215.
Wang and Tian (2017) Wang Y, Tian L (2017). “The Equivalence Between Mann-Whitney Wilcoxon Test and Score Test Based on the Proportional Odds Model for Ordinal Responses.” In 4th International Conference on Industrial Economics System and Industrial Security Engineering (IEIS), pp. 1–5. IEEE, Kyoto, Japan. 10.1109/IEIS.2017.8078606.
Weitzman (1970) Weitzman MS (1970). Measures of Overlap of Income Distributions of White and Negro Families in the United States, volume 3. US Bureau of the Census.
Wieand et al. (1989) Wieand S, Gail MH, James BR, James KL (1989). “A family of nonparametric statistics for comparing diagnostic markers with paired or unpaired data.” Biometrika, 76(3), 585–592.
Xiong et al. (2022) Xiong C, Luo J, Agboola F, Grant E, Morris JC (2022). “A family of estimators to diagnostic accuracy when candidate tests are subject to detection limits—Application to diagnosing early stage Alzheimer disease.” Statistical Methods in Medical Research, 31(5), 882–898.
Yao et al. (2010) Yao F, Craiu RV, Reiser B (2010). “Nonparametric covariate adjustment for receiver operating characteristic curves.” Canadian Journal of Statistics, 38(1), 27–46.
Youden (1950) Youden WJ (1950). “Index for rating diagnostic tests.” Cancer, 3(1), 32–35.
Zalesin et al. (2008) Zalesin KC, Franklin BA, Miller WM, Peterson ED, McCullough PA (2008). “Impact of obesity on cardiovascular disease.” Endocrinology and Metabolism Clinics of North America, 37(3), 663–684.
Zou (1997) Zou KH (1997). Analysis of Some Transformation Models for the Two-sample Problem with Special Reference to Receiver Operating Characteristic Curves. Ph.D. thesis, University of Rochester.
Zou and Hall (2000) Zou KH, Hall W (2000). “Two transformation models for estimating an ROC curve derived from continuous data.” Journal of Applied Statistics, 27(5), 621–631.
Zou and Hall (2002) Zou KH, Hall W (2002). “Semiparametric and parametric transformation models for comparing diagnostic markers with paired design.” Journal of Applied Statistics, 29(6), 803–816.
Zou et al. (1997) Zou KH, Hall WJ, Shapiro DE (1997). “Smooth non-parametric receiver operating characteristic (ROC) curves for continuous diagnostic tests.” Statistics in Medicine, 16(19), 2143–2156.
Zou et al. (2011) Zou KH, Liu A, Bandos AI, Ohno-Machado L, Rockette HE (2011). Statistical Evaluation of Diagnostic Performance: Topics in ROC analysis. CRC Press.
Zou et al. (1998) Zou KH, Tempany CM, Fielding JR, Silverman SG (1998). “Original smooth receiver operating characteristic curve estimation from continuous data: statistical methods for analyzing the predictive value of spiral CT of ureteral stones.” Academic Radiology, 5(10), 680–687.

Appendix A Simulation study

Reference	\proglangR package	ROC		AUC		Youden index
Reference	\proglangR package	Estimate	CB	Estimate	CI	Estimate	CI
Hothorn (2020)	\pkgtram	✓	✓	✓	✓	✓	✓
Harrell Jr (2001, 2022)	\pkgrms			✓	✓
Thas et al. (2012)	\pkgpim			✓	✓
Therneau (2022)	\pkgsurvival			✓	✓
Robin et al. (2011)	\pkgpROC	✓	✓	✓	✓	✓
Fay (2022)	\pkgasht			✓	✓
Konietschke et al. (2015)	\pkgnparcomp			✓	✓
Khan and Brandenburger (2020)	\pkgROCit	✓	✓	✓	✓	✓
Feng et al. (2020)	\pkgauRoc			✓	✓
Perez-Jaume et al. (2017)	\pkgThresholdROC					✓	✓
Ridout and Linkie (2009)	\pkgoverlap					✓	✓
Franco-Pereira et al. (2021)	-					✓	✓
Pérez Fernández et al. (2018)	\pkgnsROC	✓	✓	✓	✓	✓

Table S1: Overview of the different methods used in the simulation study. References to the original publication along with \proglangR software details are given. The (✓) indicates if a method computes the specific metric. The metrics included estimates for the ROC curve, AUC and Youden index as well as corresponding confidence bands or confidence intervals.

Appendix B Computational details

This section provides \proglangR code for fitting transformation models and conducting ROC analysis for the metabolic syndrome (MetS) application detailed in the paper. We first consider the unconditional performance of waist-to-height ratio (WHtR) as a diagnostic test of MetS. Next, we examined if the discriminatory ability of WHtR varies with covariates. Finally, we compare the performance of an alternative diagnostic test for MetS, systolic blood pressure (SBP). The structure of the code follows the tables and figures presented in the main text.

B.1 Two-sample analysis

\MakeFramed

library("tram")
library("rdryad")

### MetS data; doi: 10.5061/dryad.cb51t54
dd <- dryad_files_download(79303)
dat <- read.csv(dd[[1]])
dat$D <- dat$MetS
dat$Y <- dat$WHtR
supp <- with(dat, c(min(Y), max(Y)))
nd <- data.frame(D = c(0,1))
ndD1 <- nd[2,, drop = FALSE]

### Two-sample ROC analysis
### Model with F_Z = logit^{-1}
(m_lgt <- Colr(Y ~ D, data = dat, support = supp))

### ROC curve with 95 percent Wald CI
r <- ROC(m_lgt, newdata = ndD1, conf.level = 0.95)
plot(r)
### Score test
(sc <- score_test(m_lgt, level = 0.95))

### Summary indices and Wald CIs
PI(m_lgt, newdata = ndD1, conf.level = 0.95)
TV(m_lgt, newdata = ndD1, conf.level = 0.95)
OVL(m_lgt, newdata = ndD1, conf.level = 0.95)

### Score CIs for summary indices
PI(sc$conf.int, link = "logistic")
TV(rev(sc$conf.int), link = "logistic")
OVL(sc$conf.int, link = "logistic")

### Modeled densities
grd <- seq(supp[1], supp[2], by=0.001)
dens <- predict(m_lgt, nd, type = "density", q=grd)
plot(grd, dens[,1], type = "l", ylim = c(0, 10),
     xlab = "Waist-to-height ratio (WHtR) cm/cm",
     ylab = "Density")
lines(grd, dens[,2], lty = 3)
legend("topright", legend=c("No MetS", "MetS"), lty = c(1, 3))

\endMakeFramed

B.2 Conditional ROC analysis

\MakeFramed

(m_cov <- Colr(Y ~ D * (Age + Gender + Smoke), data = dat, support = supp))

grd_age <- 20:70
nd1 <- expand.grid(D = 1, Age = grd_age, Gender = c(0, 1), Smoke = 0)
nd0 <- expand.grid(D = 0, Age = grd_age, Gender = c(0, 1), Smoke = 0)

### Covariate-specific ROC curve for 40 year old non-smokers
### Solid line: Female, Dashed line: Male
p <- seq(0, 1, by = 0.001)
rcov <- ROC(m_cov,
            newdata = subset(nd1, Age == 40),
            reference = subset(nd0, Age == 40),
            one2one = T, prob = p)
plot(rcov)
### Age-based AUC for non-smokers
### Solid line: Female, Dashed line: Male
auc <- PI(m_cov,
          newdata = nd1,
          reference = nd0,
          one2one = TRUE, conf.level = 0.95)
id_f <- which(nd1$Gender==0)
id_m <- which(nd1$Gender==1)

## Females
plot(grd_age, auc[id_f, "Estimate"],
     xlab = "Age (in years)", ylab = "AUC",
     ylim = c(0, 1), type = "l")
lines(grd_age, auc[id_f, "lwr"], lty = 3, col = "gray50")
lines(grd_age, auc[id_f, "upr"], lty = 3, col = "gray50")

## Males
lines(grd_age, auc[id_m, "Estimate"], lty = 2)
lines(grd_age, auc[id_m, "lwr"], lty = 3, col = "gray50")
lines(grd_age, auc[id_m, "upr"], lty = 3, col = "gray50")
legend("bottomright", legend =