Needle in a Haystack: Label-Efficient Evaluation under Extreme Class Imbalance

Neil G. Marchant [email protected] 0000-0001-5713-4235 School of Computing and Information Systems, University of MelbourneMelbourneVictoriaAustralia3010 and Benjamin I. P. Rubinstein [email protected] 0000-0002-2947-6980 School of Computing and Information Systems, University of MelbourneMelbourneVictoriaAustralia3010

(2021)

Abstract.

Important tasks like record linkage and extreme classification demonstrate extreme class imbalance, with 1 minority instance to every 1 million or more majority instances. Obtaining a sufficient sample of all classes, even just to achieve statistically-significant evaluation, is so challenging that most current approaches yield poor estimates or incur impractical cost. Where importance sampling has been levied against this challenge, restrictive constraints are placed on performance metrics, estimates do not come with appropriate guarantees, or evaluations cannot adapt to incoming labels. This paper develops a framework for online evaluation based on adaptive importance sampling. Given a target performance metric and model for $p(y|x)$ , the framework adapts a distribution over items to label in order to maximize statistical precision. We establish strong consistency and a central limit theorem for the resulting performance estimates, and instantiate our framework with worked examples that leverage Dirichlet-tree models. Experiments demonstrate an average MSE superior to state-of-the-art on fixed label budgets.

performance evaluation, adaptive importance sampling, Dirichlet tree, central limit theorem

^†^†ccs: General and reference Evaluation^†^†ccs: Computing methodologies Machine learning^†^†ccs: Mathematics of computing Sequential Monte Carlo methods

1. Introduction

Evaluation of machine learning systems under extreme class imbalance seems like a hopeless task. When minority classes are exceedingly rare—e.g. occurring at a rate of one in a million—a massive number of examples (1 million in expectation) must be labeled before a single minority example is encountered. It seems nigh impossible to reliably estimate performance in these circumstances, as the level of statistical noise is simply too high. Making matters worse, is the fact that high quality labels for evaluation are rarely available for free. Typically they are acquired manually at some cost—e.g. by employing expert annotators or crowdsourcing workers. Reducing labeling requirements for evaluation, while ensuring estimates of performance are precise and free from bias, is therefore paramount. One cannot afford to waste resources on evaluation if the results are potentially misleading or totally useless.

From a statistical perspective, evaluation can be cast as the estimation of population performance measures using independently-drawn test data. Although unlabeled test data is abundant in many applied settings, labels must usually be acquired as part of the evaluation process. To ensure estimated performance measures converge to their population values, it is important to select examples for labeling in a statistically sound manner. This can be achieved by sampling examples passively according to the data generating distribution. However, passive sampling suffers from poor label efficiency for some tasks, especially under extreme class imbalance. This impacts a range of application areas including fraud detection (Wei et al., 2013), record linkage (Marchant and Rubinstein, 2017), rare diseases (Khalilia et al., 2011) and extreme classification (Schultheis et al., 2020).

The poor efficiency of passive sampling for some evaluation tasks motivates active or biased sampling strategies, which improve efficiency by focusing labeling efforts on the “most informative” examples (Sawade et al., 2010b). Previous work in this area is based on variance-reduction methods, such as stratified sampling (Bennett and Carvalho, 2010; Druck and McCallum, 2011; Gao et al., 2019), importance sampling (Sawade et al., 2010a; Schnabel et al., 2016) and adaptive importance sampling (Marchant and Rubinstein, 2017). However existing approaches suffer from serious limitations, including lack of support for a broad range of performance measures (Bennett and Carvalho, 2010; Sawade et al., 2010a; Welinder et al., 2013; Schnabel et al., 2016; Marchant and Rubinstein, 2017), weak theoretical justification (Bennett and Carvalho, 2010; Druck and McCallum, 2011; Welinder et al., 2013) and an inability to adapt sampling based on incoming labels (Sawade et al., 2010a; Schnabel et al., 2016).

In this paper, we present a general framework for label-efficient online evaluation that addresses these limitations. Our framework supports any performance measure that can be expressed as a transformation of a vector-valued risk functional—a much broader class than previous work. This allows us to target simple scalar measures such as accuracy and F1 score, as well as more complex multi-dimensional measures such as precision-recall curves for the first time. We leverage adaptive importance sampling (AIS) to efficiently select examples for labeling in batches. The AIS proposal is adapted using labels from previous batches in order to approximate the asymptotically-optimal variance-minimizing proposal. This approximation relies on online estimates of $p(y|x)$ , which we propose to estimate via a Bayesian Dirichlet-tree (Dennis, 1996) model that achieves asymptotic optimality for deterministic labels.

We analyze the asymptotic behavior of our framework under general conditions, establishing strong consistency and a central limit theorem. This improves upon a weak consistency result obtained in a less general setting (Marchant and Rubinstein, 2017). We also compare our framework empirically against four baselines: passive sampling, stratified sampling, importance sampling (Sawade et al., 2010a), and the stratified AIS method of (Marchant and Rubinstein, 2017). Our approach based on a Dirichlet-tree model, achieves superior or competitive performance on all but one of seven test cases. Proofs and extensions are included in appendices. A Python package implementing our framework has been released open-source at https://github.com/ngmarchant/activeeval.

2. Preliminaries

We introduce notation and define the label-efficient evaluation problem in Section 2.1. Then in Section 2.2, we specify the family of performance measures supported by our framework. Section 2.3 presents novel insights into the impracticality of passive sampling relative to class imbalance and evaluation measure, supported by asymptotic analysis.

2.1. Problem formulation

Consider the task of evaluating a set of systems $\mathcal{S}$ which solve a prediction problem on a feature space $\mathcal{X}\subseteq\mathbb{R}^{m}$ and label space $\mathcal{Y}\subseteq\mathbb{R}^{l}$ . Let $f^{(s)}(x)$ denote the output produced by system $s\in\mathcal{S}$ for a given input $x\in\mathcal{X}$ —e.g. a predicted label or distribution over labels. We assume instances encountered by the systems are generated i.i.d. from an unknown joint distribution with density $p(x,y)$ on $\mathcal{X}\times\mathcal{Y}$ . Our objective is to obtain accurate and precise estimates of target performance measures (e.g. F1 score) with respect to $p(x,y)$ at minimal cost.

We consider the common scenario where an unlabeled test pool $\mathcal{T}=\{x_{1},\ldots,x_{M}\}$ drawn from $p(x)$ is available upfront. We assume labels are unavailable initially and can only be obtained by querying a stochastic oracle that returns draws from the conditional $p(y|x)$ . We assume the response time and cost of oracle queries far outweigh contributions from other parts of the evaluation process. This is reasonable in practice, since the oracle requires human input—e.g. annotators on a crowdsourcing platform or domain experts. Under these assumptions, minimizing the cost of evaluation is equivalent to minimizing the number of oracle queries required to estimate target performance measures to a given precision.

Remark 1.

A stochastic oracle covers the most general case where $p(y|x)$ has support on one or more labels. This may be due to a set of heterogeneous or noisy annotators (not modeled) or genuine ambiguity in the label. We also consider a deterministic oracle where $p(y|x)$ is a point mass. This is appropriate when trusting individual judgments from an expert annotator.

2.2. Generalized measures

When embarking on an evaluation task it is important to select a suitable measure of performance. For some tasks it may be sufficient to measure global error rates, while for others it may be desirable to measure error rates for different classes, sub-populations or parameter configurations—the possibilities are boundless. Since no single measure is suitable for all tasks, we consider a broad family of measures which correspond mathematically to transformations of vector-valued risk functionals.

Definition 0 (Generalized measure).

Let $\ell(x,y;f)$ be a vector-valued loss function that maps instances $(x,y)\in\mathcal{X}\times\mathcal{Y}$ to vectors in $\mathbb{R}^{d}$ dependent on the system outputs $f=\{f^{(s)}\}$ . We suppress explicit dependence on $f$ where it is understood. Assume $\ell(x,y;f)$ is uniformly bounded in sup norm for all system outputs $f$ . Denote the corresponding vector-valued risk functional by $R=\mathbb{E}_{X,Y\sim p}[\ell(X,Y;f)]$ . For any choice of $\ell$ and continuous mapping $g:\mathbb{R}^{d}\to\mathbb{R}^{m}$ differentiable at $R$ , the generalized measure is defined as $G=g(R)$ .

Table 1. Representations of binary classification measures as generalized measures. Here we assume

\mathcal{Y}=\{0,1\}

f(x)

denotes the predicted class label, and

\hat{p}_{1}(x)

is an estimate of

p(y=1|x)

according to the system under evaluation.

Measure	$\ell(x,y)^{\intercal}$	$g(R)$
Accuracy	$\mathbf{1}_{y\neq f(x)}\vphantom{\Big{]}}$	$1-R$
Balanced accuracy	$\left[yf(x),y,f(x)\right]\vphantom{\Big{]}}$	$\frac{R_{1}+R_{2}(1-R_{2}-R_{3})}{2R_{2}(1-R_{2})}$
Precision	$\left[yf(x),f(x)\right]\vphantom{\Big{]}}$	$\frac{R_{1}}{R_{2}}$
Recall	$\left[yf(x),y\right]\vphantom{\Big{]}}$	$\frac{R_{1}}{R_{2}}$
$F_{\beta}$ score	$\left[yf(x),\frac{\beta^{2}y+f(x)}{1+\beta^{2}}\right]$	$\frac{R_{1}}{R_{2}}$
Matthews correlation coefficient	$\left[yf(x),y,f(x)\right]\vphantom{\Big{]}}$	$\frac{R_{1}-R_{2}R_{3}}{\sqrt{R_{2}R_{3}(1-R_{2})(1-R_{3})}}$
Fowlkes-Mallows index	$\left[yf(x),y,f(x)\right]\vphantom{\Big{]}}$	$\frac{R_{1}}{\sqrt{R_{2}R_{3}}}\vphantom{\Big{]}}$
Brier score	$2(\hat{p}_{1}(x)-y)^{2}$	$R\vphantom{\Big{]}}$

Table 2. Representations of scalar regression measures as generalized measures. Here

f(x)

denotes the predicted response according to the system under evaluation.

Measure	$\ell(x,y)^{\intercal}$	$g(R)$
Mean absolute error	$\|y-f(x)\|$	$R\vphantom{\Big{]}}$
Mean squared error	$(y-f(x))^{2}$	$R\vphantom{\Big{]}}$
Coefficient of determination	$[y,y^{2},f(x),f(x)^{2}]$	$\frac{R_{4}-2R_{1}R_{3}+R_{1}^{2}}{R_{2}-R_{1}^{2}}\vphantom{\Big{]}}$

Although this definition may appear somewhat abstract, it encompasses a variety of practical measures. For instance, when $g$ is the identity and $d=1$ the family reduces to a scalar-valued risk functional, which includes accuracy and mean-squared error as special cases. Other well-known performance measures such as precision and recall can be represented by selecting a non-linear $g$ and a vector-valued $\ell$ . Tables 1 and 2 demonstrate how to recover standard binary classification and regression measures for different settings of $g$ and $\ell$ . In addition to scalar measures, the family also encompasses vector-valued measures for vector-valued $g$ and $\ell$ . These can be used to estimate multiple scalar measures simultaneously—e.g. precision and recall of a system, accuracy of several competing systems, or recall of a system for various sub-populations. Below, we demonstrate that a vector-valued generalized measure can represent a precision-recall (PR) curve.

Example 0 (PR curve).

A precision-recall (PR) curve plots the precision and recall of a soft binary classifier on a grid of classification thresholds $\tau_{1}<\tau_{2}<\cdots<\tau_{L}$ . Let $f(x)\in\mathbb{R}$ denote the classifier score for input $x$ , where a larger score means the classifier is more confident the label is positive (encoded as ‘1’) and a smaller score means the classifier is more confident the label is negative (encoded as ‘0’). We define a vector loss function that measures whether an instance $(x,y)$ is: (1) a predicted positive for each threshold (the first $L$ entries), (2) a true positive for each threshold (the next $L$ entries), and/or (3) a positive (the last entry):

\ell(x,y)=\left[\mathbf{1}_{f(x)\geq\tau_{1}},\ldots,\mathbf{1}_{f(x)\geq\tau_{L}},y\mathbf{1}_{f(x)\geq\tau_{1}},\ldots,y\mathbf{1}_{f(x)\geq\tau_{L}},y\right]^{\intercal}.

A PR curve can then be obtained using the following mapping function:

G=g(R)=\left[\frac{R_{L+1}}{R_{1}},\ldots,\frac{R_{2L}}{R_{L}},\frac{R_{L+1}}{R_{2L+1}},\ldots,\frac{R_{2L}}{R_{2L+1}}\right]^{\intercal},

where the first $L$ entries correspond to the precision at each threshold in ascending order, and the last $L$ entries correspond to the recall at each threshold in ascending order.

Remark 2.

We have defined generalized measures with respect to the data generating distribution $p(x,y)$ . While this is the ideal target for evaluation, it is common in practice to define performance measures with respect to a sample. We can recover these from our definition by substituting an empirical distribution for $p(x,y)$ . For example, the familiar sample-based definition of recall can be obtained by setting $\ell(x,y)=[yf(x),y]^{\intercal}$ , $g(R)=R_{1}/R_{2}$ and $p(x,y)=\frac{1}{N}\sum_{i=1}^{N}\mathbf{1}_{x_{i}=x}\mathbf{1}_{y_{i}=y}$ . Then

G_{\mathrm{rec}}=g(R)=\frac{\frac{1}{N}\sum_{i=1}^{N}y_{i}f(x_{i})}{\frac{1}{N}\sum_{i=1}^{N}y_{i}}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}.

Given our assumption that the test pool $\mathcal{T}$ is drawn from $p(x)$ , any consistent sample-based estimator will converge to the population measure.

2.3. Inadequacy of passive sampling

We have previously mentioned passive sampling as an obvious baseline for selecting instances to label for evaluation. In this section, we conduct an asymptotic analysis for two sample evaluation tasks, which highlights the impracticality of passive sampling under extreme class imbalance. This serves as concrete motivation for our interest in label-efficient solutions. We begin by defining an estimator for generalized measures based on passive samples.

Definition 0 (Passive estimator for $G$ ).

Let $\mathcal{L}=\{(x_{1},y_{1}),\allowbreak\ldots,\allowbreak(x_{N},y_{N})\}$ be a labeled sample of size $N$ drawn passively according to $p(x,y)$ . In practice, $\mathcal{L}$ is obtained by drawing instances i.i.d. from the marginal $p(x)$ and querying labels from the oracle $p(y|x)$ . The Monte Carlo or passive estimator for a generalized measure $G$ is then defined as follows:

\hat{G}_{N}^{\mathrm{MC}}=g(\hat{R}_{N}^{\mathrm{MC}})\text{ with }\hat{R}_{N}^{\mathrm{MC}}=\frac{1}{N}\sum_{(x,y)\in\mathcal{L}}\ell(x,y).

Note that $\hat{G}_{N}^{\mathrm{MC}}$ is a biased estimator for $G$ in general, since $g$ may be non-linear. However, it is asymptotically unbiased—that is, $\hat{G}_{N}^{\mathrm{MC}}$ converges to $G$ with probability one in the limit $N\to\infty$ . This property is known as strong consistency and it follows from the strong law of large numbers (Feller, 1968, pp. 243–245) and continuity of $g$ . There is also a central limit theorem (CLT) for $\hat{G}_{N}^{\mathrm{MC}}$ , reflecting the rate of convergence: $\mathbb{E}[\|\hat{G}_{N}^{\mathrm{MC}}-G\|]\leq\|\Sigma\|/\sqrt{N}$ asymptotically where $\Sigma$ is an asymptotic covariance matrix (see Theorem 2). We shall now use this result to analyse the asymptotic efficiency of the passive estimator for two evaluation tasks.

Example 0 (Accuracy).

Consider estimating the accuracy $G_{\mathrm{acc}}$ (row 1 of Table 1) of a classifier. By the CLT, it is straightforward to show that the passive estimator for $G_{\mathrm{acc}}$ is asymptotically normal with variance $G_{\mathrm{acc}}(1-G_{\mathrm{acc}})/N$ . Thus, to estimate $G_{\mathrm{acc}}$ with precision $w$ we require a labeled sample of size $N\propto G_{\mathrm{acc}}(1-G_{\mathrm{acc}})/w^{2}$ . Although this is suboptimal¹¹1Theoretically it is possible to achieve an asymptotic variance of zero. it is not impractical. A passive sample reasonably captures the variance in $G_{\mathrm{acc}}$ .

This example shows that passive sampling is not always a poor choice. It can yield reasonably precise estimates of a generalized measure, so long as the measure is sensitive to regions of the space $\mathcal{X}\times\mathcal{Y}$ with high density as measured by $p(x,y)$ . However, where these conditions are not satisfied, passive sampling may become impractical, requiring huge samples of labeled data in order to sufficiently drive down the variance. This is the case for the example below, where the measure is is sensitive to regions of $\mathcal{X}\times\mathcal{Y}$ with low density as measured by $p(x,y)$ .

Example 0 (Recall).

Consider estimating recall $G_{\mathrm{rec}}$ (row 4 of Table 1) of a binary classifier. By the CLT, the passive estimator for $G_{\mathrm{rec}}$ is asymptotically normal with variance $G_{\mathrm{rec}}(1-G_{\mathrm{rec}})/N\epsilon$ where $\epsilon$ denotes the relative frequency of the positive class. Thus we require a labeled sample of size $N\propto G_{\mathrm{rec}}(1-G_{\mathrm{rec}})/w^{2}\epsilon$ to estimate $G_{\mathrm{rec}}$ with precision $w$ . This dependence on $\epsilon$ makes passive sampling impractical when $\epsilon\ll 0$ —i.e. when the positive class is rare.

This example is not merely an intellectual curiosity—there are important applications where $\epsilon$ is exceedingly small. For instance, in record linkage $\epsilon$ scales inversely in the size of the databases to be linked (Marchant and Rubinstein, 2017).

3. An AIS-based framework for label-efficient evaluation

Refer to caption — Figure 1. Schematic of the proposed AIS-based evaluation framework.

Input: generalized measure

G

; unlabeled test pool

\mathcal{T}

; proposal update procedure; sample allocations

N_{1},\ldots,N_{T}

Initialize proposal

q_{0}

and sample history

\mathcal{L}\leftarrow\emptyset

for

t=1

T

for

n=1

N_{t}

x_{t,n}\sim q_{t-1}

w_{t,n}\leftarrow\frac{p(x_{t,n})}{q_{t-1}(x_{t,n})}

y_{t,n}\sim\operatorname{Oracle}(x_{t,n})

\mathcal{L}\leftarrow\mathcal{L}\cup\{(x_{t,n},y_{t,n},w_{t,n})\}

end for

Update proposal

q_{t}

using

\mathcal{L}

end for

\hat{R}_{N}^{\mathrm{AIS}}\leftarrow\frac{1}{N}\sum_{(x,y,w)\in\mathcal{L}}w\,\ell(x,y)

\hat{G}_{N}^{\mathrm{AIS}}\leftarrow g(\hat{R}_{N}^{\mathrm{AIS}})

return

\hat{G}_{N}^{\mathrm{AIS}}

and history

\mathcal{L}

Algorithm 1 AIS for generalized measures

When passive sampling is inefficient²²2Unless otherwise specified, we mean label efficiency or sample efficiency when we use the term “efficiency” without qualification. , as we have seen in the preceding analysis, substantial improvements can often be achieved through biased sampling. In this section, we devise a framework for efficiently estimating generalized measures that leverages a biased sampling approach called adaptive importance sampling (AIS) (Bugallo et al., 2017). AIS estimates an expectation using samples drawn sequentially from a biased proposal distribution, that is adapted in stages based on samples from previous stages. It produces non-i.i.d. samples in general, unlike passive sampling and static (non-adaptive) importance sampling (IS) (Rubinstein and Kroese, 2016). AIS is a powerful approach because it does not assume an effective proposal distribution is known a priori—instead it is learnt from data. This addresses the main limitation of static IS—that one may be stuck using a sub-optimal proposal, which may compromise label efficiency.

There are many variations of AIS which differ in: (i) the way samples are allocated among stages; (ii) the dependence of the proposal on previous stages; (iii) the types of proposals considered; and (iv) the way samples are weighted within and across stages. Our approach is completely flexible with respect to points (i)–(iii). For point (iv), we use simple importance-weighting as it is amenable to asymptotic analysis using martingale theory (Portier and Delyon, 2018). A more complex weighting scheme is proposed in (Cornuet et al., 2012) which may have better stability, however its asymptotic behavior is not well understood.³³3Consistency was proved for this weighting scheme in the limit $T\to\infty$ where $\{N_{t}\}$ is a monotonically increasing sequence (Marin et al., 2019). To our knowledge, a CLT remains elusive.

Our framework is summarized in Figure 1. Given a target performance measure $G$ and an unlabeled test pool $\mathcal{T}$ , the labeling process proceeds in several stages indexed by $t\in\{1,\ldots,T\}$ . In the $t$ -th stage, $N_{t}\geq 1$ instances are drawn i.i.d. from $\mathcal{T}$ according to proposal $q_{t-1}$ . Labels are obtained for each instance by querying the oracle, and recorded with their importance weights in $\mathcal{L}$ . At the end of the $t$ -th stage, the proposal is updated for the next stage. This update may depend on the weighted samples from all previous stages as recorded in $\mathcal{L}$ . At any point during sampling, we estimate $G$ as follows:

(1)

\hat{G}_{N}^{\mathrm{AIS}}=g(\hat{R}_{N}^{\mathrm{AIS}})\text{ where }\hat{R}_{N}^{\mathrm{AIS}}=\frac{1}{N}\sum_{(x,y,w)\in\mathcal{L}}w\,\ell(x,y).

For generality, we permit the user to specify the sample allocations and the procedure for updating the proposals in Figure 1. In practice, we recommend allocating a small number of samples to each stage, as empirical studies suggest that efficiency improves when the proposal is updated more frequently (Portier and Delyon, 2018). However, this must be balanced with practical constraints, as a small sample allocation limits the ability to acquire labels in parallel. In Section 3.2, we recommend a practical procedure for updating the proposals. It approximates the asymptotically-optimal variance-minimizing proposal based on an online model of $p(y|x)$ . We present an example model for $p(y|x)$ in Section 4, which can leverage prior information from the systems under evaluation. Further practicalities including sample reuse and confidence intervals are discussed in Section 3.3.

Remark 3 (Constraint on the proposal).

In a standard application of AIS for estimating $\mathbb{E}_{X,Y\sim p}[\phi(X,Y)]$ , one is free to select any proposal $q_{t}(x,y)$ with support on the set $\{(x,y):\|\phi(x,y)\|p(x,y)\neq 0$ }. However, we have an additional constraint since we cannot bias sampling from the oracle $p(y|x)$ . Thus we consider proposals of the form $q_{t}(x,y)=q_{t}(x)p(y|x)$ .

Remark 4 (Static Importance Sampling).

Our AIS framework reduces to static importance sampling when $T=1$ so that all samples are drawn from a single static proposal $q_{0}$ .

3.1. Asymptotic analysis

We study the asymptotic behavior of estimates for $G$ produced by the generic framework in Figure 1. Since our analysis does not depend on how samples are allocated among stages, it is cleaner to identify a sample using a single index $j\in\{1,\ldots,N\}$ where $N=\sum_{t=1}^{T}N_{t}$ is the total number of samples, rather than a pair of indices $(t,n)$ . Concretely, we map each $(t,n)$ to index $j=n+\sum_{t^{\prime}=1}^{t-1}N_{t^{\prime}}$ . With this change, we let $q_{j-1}(x)$ denote the proposal used to generate sample $j$ . It is important to note that this notation conceals the dependence on previous samples. Thus $q_{j-1}(x)$ should be understood as shorthand for $q_{j-1}(x|\mathcal{F}_{j-1})$ where $\mathcal{F}_{j}=\sigma((X_{1},Y_{1}),\ldots,(X_{j},Y_{j}))$ denotes the filtration.

Our analysis relies on the fact that $Z_{N}=N(\hat{R}_{N}^{\mathrm{AIS}}-R)$ is a martingale with respect to $\mathcal{F}_{N}$ . The consistency of $\hat{G}_{N}^{\mathrm{AIS}}$ then follows by a strong law of large numbers for martingales (Feller, 1971) and the continuous mapping theorem. A proof is included in Appendix A.

Theorem 1 (Consistency).

Let the support of proposal $q_{j}(x,y)=q_{j}(x)p(y|x)$ be a superset of $\{x,y\in\mathcal{X}\times\mathcal{Y}:\allowbreak\|\ell(x,y)\|p(x,y)\neq 0\}$ for all $j\geq 0$ and assume

(2)

\sup_{j\in\mathbb{N}}\mathbb{E}\!\left[\left(\frac{p(X_{j})}{q_{j-1}(X_{j})}\right)^{2}\middle|\mathcal{F}_{j-1}\right]<\infty.

Then $\hat{G}_{N}^{\mathrm{AIS}}$ is strongly consistent for $G$ .

We also obtain a central limit theorem (CLT), which is useful for assessing asymptotic efficiency and computing approximate confidence intervals. Our proof (see Appendix B) invokes a CLT for martingales (Portier and Delyon, 2018) and the multivariate delta method.

Theorem 2 (CLT).

Suppose

(3)

V_{j}:={\operatorname{var}}\!\left[\frac{p(X_{j})}{q_{j-1}(X_{j})}\ell(X_{j},Y_{j})-R\middle|\mathcal{F}_{j-1}\right]\to V_{\infty}\quad\text{a.s.},

where $V_{\infty}$ is an a.s. finite random positive semidefinite matrix, and there exists $\eta>0$ such that

(4)

\sup_{j\in\mathbb{N}}\,\mathbb{E}\!\left[\left(\frac{p(X_{j})}{q_{j-1}(X_{j})}\right)^{2+\eta}\middle|\mathcal{F}_{j-1}\right]<\infty\quad\text{a.s.}

Then $\sqrt{N}(\hat{G}_{N}^{\mathrm{AIS}}-G)$ converges in distribution to a multivariate normal $\mathcal{N}(0,\Sigma)$ with covariance matrix $\Sigma=\mathrm{D}g(R)V_{\infty}\mathrm{D}g(R)^{\intercal}$ where $[\mathrm{D}g]_{ij}=\frac{\partial g_{i}}{\partial R_{j}}$ is the Jacobian of $g$ .

3.2. Variance-minimizing proposals

In order to achieve optimal asymptotic efficiency, we would like to use the AIS proposal that achieves the minimal asymptotic variance $\Sigma$ as defined in Theorem 2. We can solve for this optimal proposal using functional calculus, as demonstrated in the proposition below. Note that we must generalize the variance minimization problem for vector-valued $G$ since $\Sigma$ becomes a covariance matrix. We opt to use the total variance (the trace of $\Sigma$ ) since the diagonal elements of $\Sigma$ are directly related to statistical efficiency, while the off-diagonal elements measure correlations between components of $\hat{G}_{N}^{\mathrm{AIS}}$ that are beyond our control. This choice also ensures the functional optimization problem is tractable.

Proposition 0 (Asymptotically-optimal proposal).

Suppose the Jacobian $\mathrm{D}_{g}(R)$ has full row rank and $\mathbb{E}_{X,Y\sim p}\left[\|\mathrm{D}_{g}(R)\,\ell(X,Y)\|_{2}^{2}\right]>0$ . Then the proposal

(5)

q^{\star}(x)=\frac{v(x)}{\int v(x)\,\mathrm{d}x}\quad\text{with}\quad v(x)=p(x)\sqrt{\int\|\mathrm{D}_{g}(R)\,\ell(x,y)\|_{2}^{2}\,p(y|x)\,\mathrm{d}y}

achieves the minimum total asymptotic variance of

\operatorname{tr}\Sigma=\left(\int v(x)\,\mathrm{d}x\right)^{2}-\|\mathrm{D}_{g}(R)\,R\|_{2}^{2}.

Appendix D provides sufficient conditions on $G$ and the oracle which ensure $\operatorname{tr}\Sigma=0$ .

We use the above result to design a practical scheme for adapting a proposal for AIS. At each stage of the evaluation process, we approximate the asymptotically-optimal proposal $q^{\star}(x)$ using an online model for the oracle response $p(y|x)$ . The model for $p(y|x)$ should be initialized using prior information if available (e.g. from the systems under evaluation) and updated at the end of each stage using labels received from the oracle. However, we cannot simply estimate $q^{\star}(x)$ by plugging in estimates of $p(y|x)$ directly, as the resulting proposal may not satisfy the conditions of Theorems 1 and 2. Below we provide estimators for $q^{\star}(x)$ which do satisfy the conditions, and provide sufficient conditions for achieving asymptotic optimality.

Proposition 0.

If the oracle is stochastic, let $\hat{p}_{t}(y|x)$ be an estimate for $p(y|x)$ whose support includes the support of $p(y|x)$ for all stages $t\geq 0$ , and assume $\hat{p}_{t}(y|x)\overset{\mathrm{a.s.}}{\to}\hat{p}_{\infty}(y|x)$ pointwise in $x$ . Alternatively, if the oracle is deterministic, let $\pi_{t}(y|x)$ be a posterior distribution for the response $y(x)$ whose support includes $y(x)$ for all $t\geq 0$ , and assume $\pi_{t}(y|x)\overset{\mathrm{a.s.}}{\to}\pi_{\infty}(y|x)$ pointwise in $x$ . Let $\epsilon_{t}$ be a positive bounded sequence and $\hat{R}_{t}$ be an estimator for $R$ which converges a.s. to $\hat{R}_{\infty}$ . Assume $\mathcal{X}$ is finite (e.g. a pool of test data) and $\|\mathrm{D}_{g}(\hat{R}_{t})\|_{2}\leq K<\infty$ for all $t\geq 0$ . Then the proposals

q_{t}(x)\propto\begin{cases}p(x)\int\max\{\|\mathrm{D}_{g}(\hat{R}_{t})\,\ell(x,y)\|_{2},\epsilon_{t}\mathbf{1}_{\|\ell(x,y)\|\neq 0}\}\pi_{t}(y|x)\,\mathrm{d}y,&\text{for a deterministic oracle,}\\ p(x)\left[\int\max\{\|\mathrm{D}_{g}(\hat{R}_{t})\,\ell(x,y)\|_{2}^{2},\epsilon_{t}\mathbf{1}_{\|\ell(x,y)\|\neq 0}\}\hat{p}_{t}(y|x)\,\mathrm{d}y\right]^{\frac{1}{2}},&\text{for a stochastic oracle,}\end{cases}

satisfy the conditions of Theorems 1 and 2. If in addition $\hat{R}_{\infty}=R$ and $\hat{p}_{\infty}(y|x)=p(y|x)$ (alternatively $\pi_{\infty}(y|x)=\mathbf{1}_{y=y(x)}$ ) and $\epsilon_{t}\downarrow 0$ , then the proposals approach asymptotic optimality.

3.3. Practicalities

We briefly discuss solutions to two issues that may arise in practical settings: sample reuse and approximate confidence regions.

3.3.1. Sample reuse

Suppose our framework is used to estimate a generalized measure $G_{1}$ . If the joint distribution $p(x,y)$ associated with the prediction problem has not changed, it may be desirable to reuse the weighted samples $\mathcal{L}$ to estimate a different generalized measure $G_{2}$ . This is possible so long as the sequence of proposals used to estimate $G_{1}$ have the required support for $G_{2}$ . More precisely, the support of $q_{j}(x,y)$ must include $\{x,y\in\mathcal{X}\times\mathcal{Y}:\|\ell(x,y)\|p(x,y)\neq 0\}$ for the loss functions associated with $G_{1}$ and $G_{2}$ .

If one anticipates sample reuse, the proposals can be made less specialized to a particular measure by mixing with the marginal distribution $p(x)$ , i.e. $q(x)\to(1-\delta)q(x)+\delta p(x)$ where $\delta\in(0,1]$ is a hyperparameter that controls the degree of specialization.

3.3.2. Approximate confidence regions

When publishing performance estimates, it may be desirable to quantify statistical uncertainty. An asymptotic $100(1-\alpha)\%$ confidence region for a generalized measure $G$ is given by the ellipsoid

\left\{G^{\star}\in\mathbb{R}^{k}:(G^{\star}-\hat{G})^{\intercal}\hat{\Sigma}^{-1}(G^{\star}-\hat{G})\leq\frac{(N-1)k}{N(N-k)}F_{\alpha,k,N-k}\right\},

where $\hat{G}$ is the sample mean, $\hat{\Sigma}$ is the sample covariance matrix, and $F_{\alpha,d_{1},d_{2}}$ is the critical value of the $F$ distribution with $d_{1},d_{2}$ degrees of freedom at significance level $\alpha$ . This region can be approximated using the estimator for $G$ in (1) and the following estimator for $\Sigma$ :

\hat{\Sigma}^{\mathrm{AIS}}=\mathrm{D}_{g}(\hat{R}^{\mathrm{AIS}})\bigg{(}\frac{1}{N}\sum_{j=1}^{N}\frac{p(x_{j})^{2}\ell(x_{j},y_{j})\ell(x_{j},y_{j})^{\intercal}}{q_{N}(x_{j})q_{j-1}(x_{j})}\\ -\hat{R}^{\mathrm{AIS}}\hat{R}^{\mathrm{AIS}\intercal}\bigg{)}\mathrm{D}_{g}(\hat{R}^{\mathrm{AIS}})^{\intercal}.

This is obtained from the expression for $\Sigma$ in Theorem 2, by plugging in AIS estimators for the variance and $R$ , and approximating $q_{\infty}(x)$ by the most recent proposal $q_{N}(x)$ .

4. A Dirichlet-tree model for the oracle response

In the previous section, we introduced a scheme for updating the AIS proposal which relies on an online model of the oracle response. Since there are many conceivable choices for the model, we left it unspecified for full generality. In this section, we propose a particular model that is suited for evaluating classifiers when the response from the oracle is deterministic (see Remark 1). Concretely, we make the assumption that the label space $\mathcal{Y}=\{1,\ldots,C\}$ is a finite set and $p(y|x)$ is a point mass at $y(x)$ for all $x\in\mathcal{T}$ . An extension of this section for stochastic oracles (where $p(y|x)$ is not necessarily a point mass) is presented in Appendix H.

Since we would like to leverage prior information (e.g. classifier scores) from the system(s) under evaluation and perform regular updates as labels are received from the oracle, we opt to use a Bayesian model. Another design consideration is label efficiency. Since labels are scarce and the test pool $\mathcal{T}$ may be huge, we want to design a model that allows for sharing of statistical strength between “similar” instances in $\mathcal{T}$ . To this end, we propose a model that incorporates a hierarchical partition of $\mathcal{T}$ , where instances assigned to hierarchically neighboring blocks are assumed to elicit a similar oracle response.⁴⁴4 This is in contrast to models used in related work (Bennett and Carvalho, 2010; Marchant and Rubinstein, 2017) which assume the oracle response is independent across blocks of a non-hierarchical partition. Various unsupervised methods may be used to learn a hierarchical partition, including hierarchical agglomerative/divisive clustering (Reddy and Vinzamuri, 2014), $k$ -d trees (Bentley, 1975), and stratification based on classifier scores (see Appendix I for a brief review).

4.1. Generative process

We assume the global oracle response $\theta$ (averaged over all instances) is generated according to a Dirichlet distribution, viz.

\theta|\alpha\sim\operatorname{Dirichlet}\!\left(\alpha\right),

where $\alpha=[\alpha_{1},\ldots,\alpha_{C}]\in\mathbb{R}_{+}^{C}$ are concentration hyperparameters. The label $y_{i}$ for each instance $i\in\{1,\ldots,M\}$ (indexing $\mathcal{T}$ ) is then assumed to be generated i.i.d. according to $\theta$ :

\displaystyle y_{i}|\theta

\displaystyle\overset{\mathrm{iid.}}{\sim}\operatorname{Categorical}\!\left(\theta\right),

\displaystyle i\in 1,\ldots,M.

We assume a hierarchical partition of the test pool $\mathcal{T}$ is given. The partition can be represented as a tree $T$ , where the leaf nodes of $T$ correspond to the finest partition of $\mathcal{T}$ into disjoint blocks $\{\mathcal{T}_{k}\}_{k=1}^{K}$ such that $\mathcal{T}=\bigcup_{k=1}^{K}\mathcal{T}_{k}$ . We assume each instance $i$ is assigned to one of the blocks (leaf nodes) $k_{i}\in\{1,\ldots,K\}$ according to a distribution $\psi_{y}$ with a Dirichlet-tree prior (Dennis, 1996; Minka, 1999):

	$\displaystyle\psi_{y}\|\beta_{y},T$	$\displaystyle\overset{\mathrm{ind.}}{\sim}\operatorname{DirichletTree}\!\left(\beta_{y};T\right),$	$\displaystyle y\in\mathcal{Y},$
	$\displaystyle k_{i}\|y_{i},\psi_{y_{i}}$	$\displaystyle\overset{\mathrm{ind.}}{\sim}\operatorname{Categorical}\!\left(\psi_{y_{i}}\right),$	$\displaystyle i\in 1,\ldots,M.$

The Dirichlet-tree distribution is a generalization of the Dirichlet distribution, which allows for more flexible dependencies between the categories (blocks in this case). Categories that are hierarchically nearby based on the tree structure $T$ tend to be correlated. The Dirichlet concentration hyperparameters $\beta_{y}$ associated with the internal nodes also control the correlation structure.

4.2. Inference

For a deterministic oracle, the response $y_{i}$ for instance $i$ is either observed (previously labeled) or unobserved (yet to be labeled). It is important to model the observation process in case it influences the values of inferred parameters. To this end, we let $\mathbf{o}_{t}=(o_{t,1},\ldots,o_{t,M})$ be observation indicators for the labels $\mathbf{y}=(y_{1},\ldots,y_{M})$ at the end of stage $t$ of the evaluation process (see Algorithm 1). We initialize $\mathbf{o}_{0}=0$ and define $\mathbf{o}_{t}$ in the obvious way: $o_{t,i}$ is 1 if the label for instance $i$ has been observed by the end of stage $t$ and 0 otherwise. From Algorithm 1, we have that the $n$ -th instance selected in stage $t$ depends on the labels of the previously observed instances $\mathbf{y}_{(\mathbf{o}_{t-1})}$ and the block assignments $\mathbf{k}=(k_{1},\ldots,k_{M})$ :

i_{t,n}|\mathbf{o}_{t-1},\mathbf{y}_{(\mathbf{o}_{t-1})},\mathbf{k}\sim q_{t-1}(\mathbf{y}_{(\mathbf{o}_{t})},\mathbf{k}).

Our goal is to infer the unobserved labels (and hence the oracle response) at each stage $t$ of the evaluation process. We assume the block assignments $\mathbf{k}=(k_{1},\ldots,k_{M})$ are fully observed. Since the observation indicators are independent of the unobserved labels conditional on the observed labels, our model satisfies ignorability (Jaeger, 2005). This means we can ignore the observation process when conducting inference. Since we do not require a full posterior distribution over all parameters, it is sufficient to conduct inference using the expectation-maximization algorithm. This yields a distribution over the unobserved label for each instance and point estimates for the other parameters ( $\psi_{y}$ and $\theta$ ). Full details are provided in Appendix F.

4.3. Asymptotic optimality

Since the Dirichlet-tree model described in this section is consistent for the true oracle (deterministic) response, it can be combined with the proposal updates described in Proposition 4 to yield an asymptotically-optimal AIS algorithm. This result is made precise in the following proposition, which is proved in Appendix G.

Proposition 0.

Consider an instantiation of our framework under a deterministic oracle where:

•

the oracle response is estimated online using the Dirichlet-tree model described in this section via the EM algorithm;
•

the proposals are adapted using the estimator defined in Proposition 4 with
•

$\epsilon_{t}=\epsilon_{0}(1-\frac{1}{M}\sum_{i=1}^{M}o_{t,i})$ for some user-specified $\epsilon_{0}>0$ .

Then Theorems 1 and 2 hold and the estimator is asymptotically-optimal.

5. Experimental study

We conduct experiments to assess the label efficiency of our proposed framework⁵⁵5Using the procedure described in Section 3.2 to adapt the AIS proposal, together with the online model for the oracle response presented in Section 4. for a variety of evaluation tasks. The tasks vary in terms of the degree of class imbalance, the quality of predictions/scores from the classifier under evaluation, the size of the test pool, and the target performance measure. Where possible, we compare our framework (denoted Ours) with the following baselines:

•

Passive: passive sampling as specified in Definition 3.
•

IS: static importance sampling as described in Remark 4. We approximate the asymptotically-optimal proposal as in Proposition 4 using estimates of $p(y|x)$ derived from classifier scores.
•

Stratified: an online variant of stratified sampling with proportional allocation, as used in (Druck and McCallum, 2011). Items are sampled one-at-a-time in proportion to the size of the allocated stratum.
•

OASIS: a stratified AIS method for estimating F scores (Marchant and Rubinstein, 2017).

5.1. Evaluation tasks

We prepare classifiers and test pools for evaluation using publicly-available datasets from various domains, as summarized in Table 3. amzn-googl, dblp-acm, abt-buy (Köpcke et al., 2010) and restaurant (RIDDLE, 2003) are benchmark datasets for entity resolution. They contain records from two sources and the goal is to predict whether pairs of records refer to the same entity or not. safedriver contains anonymized records from a car insurance company, and the task is to predict drivers who are likely to make a claim (Porto Seguro, 2017). creditcard relates to fraud detection for online credit card transactions (Pozzolo et al., 2015). tweets100k has been applied to tweet sentiment analysis (Mozafari et al., 2014).

For amzn-goog, dblp-acm, abt-buy, restaurant and tweets100k we use the same classifiers and test pools as in (Marchant and Rubinstein, 2017). For safedriver and creditcard we prepare our own by randomly splitting the data into train/test with a 70:30 ratio, and training classifiers using supervised learning. In scenarios where labeled data is scarce, semi-supervised or unsupervised methods might be used instead—the choice of learning paradigm has no bearing on evaluation. We consider three target performance measures—F1 score, accuracy and precision-recall curves—as separate evaluation tasks.

Table 3. Summary of test pools and classifiers under evaluation. The imbalance ratio is the number of positive class instances divided by the number of negative class instances. The true F1 score is assumed unknown.

Source	Domain	Size	Imbalance ratio	Classifier type	F1 score
abt-buy (Köpcke et al., 2010)	Entity resolution	53,753	1075	Linear SVM	0.595
amzn-goog (Köpcke et al., 2010)	Entity resolution	676,267	3381	Linear SVM	0.282
dblp-acm (Köpcke et al., 2010)	Entity resolution	53,946	2697	Linear SVM	0.947
restaurant (RIDDLE, 2003)	Entity resolution	149,747	3328	Linear SVM	0.899
safedriver (Porto Seguro, 2017)	Risk assessment	178,564	26.56	XGBoost (Chen and Guestrin, 2016)	0.100
creditcard (Pozzolo et al., 2015)	Fraud detection	85,443	580.2	Logistic Regression	0.728
tweets100k (Mozafari et al., 2014)	Sentiment analysis	20,000	0.990	Linear SVM	0.770

5.2. Setup

Oracle.

We simulate an oracle using labels included with each dataset. Since the datasets only contain a single label for each instance, we assume a deterministic oracle. Thus, when computing the consumed label budget, we only count the first query to the oracle for an instance as a consumed label. If an instance is selected for labeling again, we reuse the label from the first query.

Partitioning.

Ours, Stratified and OASIS assume the test pool is partitioned so the oracle response within each block is ideally uniform. We construct partitions by binning instances according to their classifier scores. The bin edges are determined using the cumulative square-root frequency (CSF) method (Dalenius and Hodges, 1959), which is widely used for stratified sampling. We set the number of bins to $K=2^{8}$ . Since Ours is capable of exploiting partitions with hierarchical structure, we fill in post-hoc structure by associating the CSF bins with the leaf nodes of an appropriately-size tree in breadth-first order. We consider two trees: a tree of depth 1 with branching factor $K$ (denoted Ours-1, equivalent to a non-hierarchical partition) and a tree of depth 8 with branching factor 2 (denoted Ours-8).

Hyperparameters.

We leverage prior information from the classifiers under evaluation to set hyperparameters. Wherever a prior estimate of $p(y|x)$ is required, we use the classifier score $s(y|x)$ , applying the softmax function to non-probabilistic scores. For the Dirichlet-tree model we set hyperparameters as follows:

\alpha_{y}=1+\sum_{k=1}^{K}s(y|k),\quad\beta_{y\nu}=\operatorname{depth}(\nu)^{2}+\sum_{k=1}^{K}s(y|k)\delta_{\nu}(k)

where $s(y|k)=\frac{1}{|\mathcal{T}_{k}|}\sum_{x_{i}\in\mathcal{T}_{k}}s(y|x_{i})$ is the mean score over instances in the $k$ -th block, $\nu$ denotes a non-root node of the tree $T$ , $\operatorname{depth}(\nu)$ denotes the depth of node $\nu$ in $T$ , and $\delta_{\nu}(k)$ is an indicator equal to 1 if node $\nu$ is traversed to reach leaf node $k$ and 0 otherwise.

Repeats.

Since the evaluation process is randomized, we repeated each experiment 1000 times, computing and reporting the mean behavior with bootstrap confidence intervals.

5.3. Results

We provide a summary of the results here, with a particular focus on the results for F1 score, which is supported by all baselines. Detailed results for accuracy and precision-recall curves are included in Appendix J.

F1 score.

To assess convergence of the estimated F1 score, we plot the mean-squared error (MSE) as a function of the consumed label budget for all datasets and methods. A plot of this kind is included for abt-buy in Figure 2. It shows that Ours is significantly more efficient than the baseline methods in this case, achieving a lower MSE for all label budgets. The Passive and Stratified methods perform significantly worse, achieving an MSE at least one order of magnitude greater than the biased sampling methods. Figure 2 also plots the convergence of the proposal in terms of the mean KL divergence from the asymptotically-optimal proposal. The results here are in line with expectations: convergence of the F1 score is more rapid when the proposal is closer to asymptotic optimality.

Figure 3 summarizes the convergence plots for the six other data sets (included in Appendix J), by plotting the MSE of the estimated F1 score after 2000 labels are consumed. It shows that Ours achieves best or equal-best MSE on all but one of the datasets (dblp-acm) within the 95% confidence bands. We find the adaptive methods Ours and OASIS perform similarly to IS when the prior estimates for $p(y|x)$ are accurate—i.e. there is less to gain by adapting. Of the two adaptive methods, Ours generally converges more rapidly than OASIS, which might be expected since our procedure for adapting the proposal is asymptotically-optimal. The hierarchical variant of our model Ours-8 tends to outperform the non-hierarchical variant Ours-1, which we expect when $p(y|x)$ varies smoothly across neighboring blocks. Finally, we observe that Passive and Stratified are competitive when the class imbalance is less severe. This agrees with our analysis in Section 2.3. We note that similar trends when targeting accuracy in place of F1 score, as presented in Appendix J.

PR curves.

We estimate PR curves on a uniformly-spaced grid of classifier thresholds $\tau_{1}<\cdots<\tau_{L}$ , where $\tau_{1}$ is the minimum classifier score, $\tau_{L}$ is the maximum classifier score, and $L=2^{10}$ (see Example 2). We also use the uniform grid to partition the test pool (in place of the CSF method), associating each block of the partition with four neighboring bins on the grid to yield $K=2^{8}$ blocks. Figure 4 illustrates the vast improvement of the biased sampling methods (Ours-8 and IS) over Passive for this evaluation task. The estimated PR curves shown for Ours-8 and IS vary minimally about the true PR curve, and are reliable for selecting an operating threshold. The same cannot be said for the curves produced by Passive, which exhibit such high variance that they are essentially useless.

6. Related work

Existing approaches to label-efficient evaluation in a machine learning context largely fall into three categories: model-based (Welinder et al., 2013), stratified sampling (Bennett and Carvalho, 2010; Druck and McCallum, 2011) and importance sampling (Sawade et al., 2010a; Marchant and Rubinstein, 2017). The model-based approach in (Welinder et al., 2013) estimates precision-recall curves for binary classifiers. However, it uses inefficient passive sampling to select instances to label, and makes strong assumptions about the distribution of scores and labels which can result in biased estimates. Stratified sampling has been used to estimate scalar performance measures such as precision, accuracy and F1 score. Existing approaches (Bennett and Carvalho, 2010; Druck and McCallum, 2011) bias the sampling of instances from strata (akin to blocks) using a heuristic generalization of the optimal allocation principle (Cochran, 1977). However, stratified sampling is considered to be a less effective variance reduction method compared to importance sampling (Rubinstein and Kroese, 2016), and it does not naturally support stochastic oracles.

Static importance sampling (Sawade et al., 2010a) and stratified adaptive importance sampling (Marchant and Rubinstein, 2017) have been used for online evaluation, and are most similar to our approach. However (Marchant and Rubinstein, 2017) only supports the estimation of F1 score, and (Sawade et al., 2010a) only supports the estimation of scalar generalized risks⁶⁶6These can be viewed as a sub-family of generalized measures with the following correspondence $\ell(x,y)=w(x,y,f(x))[\ell(f(x),y),1]^{\intercal}$ , and $g(R)=R_{1}/R_{2}$ .. Both of these methods attempt to approximate the asymptotically-optimal variance-minimizing proposal, however the approximation used in (Sawade et al., 2010a) is non-adaptive and is not optimized for deterministic oracles, while the approximation used in (Marchant and Rubinstein, 2017) is adaptive but less accurate due to the stratified design.

Novel evaluation methods are also studied in the information retrieval (IR) community (see survey (Kanoulas, 2016)). Some tasks in the IR setting can be cast as prediction problems by treating query-document pairs as inputs and relevance judgments as outputs. Early approaches used relevance scores from the IR system to manage the abundance of irrelevant documents in an ad hoc manner (Cormack et al., 1998). Recent approaches (Schnabel et al., 2016; Li and Kanoulas, 2017) are based on a statistical formulation similar to ours, however they are specialized to IR systems. Within the IR community, stratified sampling and cluster sampling have also been used to efficiently evaluate knowledge graphs (Gao et al., 2019).

Adaptive importance sampling (AIS) is studied more generally in the context of Monte Carlo integration (see review by Bugallo et al., 2017). Most methods are inappropriate for our application, as they assume a continuous space without constraints on the proposal (see Remark 3). Oh and Berger (1992) introduced the idea of adapting the proposal over multiple stages using samples from the previous stages. Cappé et al. (2008) devise a general framework using independent mixtures as proposals. The method of Cornuet et al. (2012) continually re-weights all past samples, however it is more computationally demanding and less amenable to analysis since it breaks the martingale property. Portier and Delyon (2018) analyze parametric AIS in the large sample limit. This improves upon earlier work which assumed the number of stages goes to infinity (Douc and Moulines, 2008) or the sample size at each stage is monotonically increasing (Marin et al., 2019).

Finally, we note that label-efficient evaluation may be viewed as a counterpart to active learning, as both are concerned with reducing labeling requirements. There is a large body of literature concerning active learning—we refer the reader to surveys (Settles, 2009; Gilyazev and Turdakov, 2018). However whereas active learning aims to find a model with low bounded risk using actively-sampled training data, active evaluation aims to estimate risk using actively-sampled test data for any model.

7. Conclusion

We have proposed a framework for online supervised evaluation, which aims to minimize labeling efforts required to achieve precise, asymptotically-unbiased performance estimates. Our framework is based on adaptive importance sampling, with variance-minimizing proposals that are refined adaptively based on an online model of $p(y|x)$ . Under verifiable conditions on the chosen performance measure and the model, we proved strong consistency (asymptotic unbiasedness) of the resulting performance estimates and a central limit theorem. We instantiated our framework to evaluate classifiers using deterministic or stochastic human annotators. Our approach based on a hierarchical Dirichlet-tree model, achieves superior or competitive performance on all but one of seven test cases.

Acknowledgements.

N. Marchant acknowledges the support of an Australian Government Research Training Program Scholarship. B. Rubinstein acknowledges the support of Australian Research Council grant DP150103710.

References

(1)
Bennett and Carvalho (2010) Paul N. Bennett and Vitor R. Carvalho. 2010. Online Stratified Sampling: Evaluating Classifiers at Web-scale. In CIKM. 1581–1584. https://doi.org/10.1145/1871437.1871677
Bentley (1975) Jon Louis Bentley. 1975. Multidimensional Binary Search Trees Used for Associative Searching. Commun. ACM 18, 9 (Sept. 1975), 509–517. https://doi.org/10.1145/361002.361007
Bugallo et al. (2017) M. F. Bugallo, V. Elvira, L. Martino, D. Luengo, J. Miguez, and P. M. Djuric. 2017. Adaptive Importance Sampling: The past, the present, and the future. IEEE Signal Processing Magazine 34, 4 (July 2017), 60–79. https://doi.org/10.1109/MSP.2017.2699226
Cappé et al. (2008) Olivier Cappé, Randal Douc, Arnaud Guillin, Jean-Michel Marin, and Christian P. Robert. 2008. Adaptive importance sampling in general mixture classes. Statistics and Computing 18, 4 (01 Dec. 2008), 447–459. https://doi.org/10.1007/s11222-008-9059-x
Chen and Guestrin (2016) Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Francisco, California, USA) (KDD ’16). ACM, New York, NY, USA, 785–794. https://doi.org/10.1145/2939672.2939785
Cochran (1977) William G. Cochran. 1977. Sampling Techniques (3rd ed.). Wiley, New York.
Cormack et al. (1998) Gordon V. Cormack, Christopher R. Palmer, and Charles L. A. Clarke. 1998. Efficient Construction of Large Test Collections. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Melbourne, Australia) (SIGIR ’98). Association for Computing Machinery, New York, NY, USA, 282–289. https://doi.org/10.1145/290941.291009
Cornuet et al. (2012) Jean-Marie Cornuet, Jean-Michel Marin, Antonietta Mira, and Christian P. Robert. 2012. Adaptive Multiple Importance Sampling. Scandinavian Journal of Statistics 39, 4 (2012), 798–812. https://doi.org/10.1111/j.1467-9469.2011.00756.x
Dalenius and Hodges (1959) Tore Dalenius and Joseph L. Hodges. 1959. Minimum Variance Stratification. J. Amer. Statist. Assoc. 54, 285 (March 1959), 88–101. https://doi.org/10.1080/01621459.1959.10501501
Dennis (1996) Samuel Y. Dennis. 1996. A Bayesian analysis of tree-structured statistical decision problems. Journal of Statistical Planning and Inference 53, 3 (1996), 323–344. https://doi.org/10.1016/0378-3758(95)00112-3
Douc and Moulines (2008) Randal Douc and Eric Moulines. 2008. Limit theorems for weighted samples with applications to sequential Monte Carlo methods. The Annals of Statistics 36, 5 (Oct. 2008), 2344–2376. https://doi.org/10.1214/07-AOS514
Druck and McCallum (2011) Gregory Druck and Andrew McCallum. 2011. Toward Interactive Training and Evaluation. In CIKM (New York, NY, USA). 947–956. https://doi.org/10.1145/2063576.2063712
Feller (1968) W. Feller. 1968. An Introduction to Probability Theory and Its Applications, Volume 1 (3rd ed.). Wiley.
Feller (1971) W. Feller. 1971. An Introduction to Probability Theory and Its Applications, Volume 2 (2nd ed.). Wiley.
Gao et al. (2019) Junyang Gao, Xian Li, Yifan Ethan Xu, Bunyamin Sisman, Xin Luna Dong, and Jun Yang. 2019. Efficient Knowledge Graph Accuracy Evaluation. Proc. VLDB Endow. 12, 11 (July 2019), 1679–1691. https://doi.org/10.14778/3342263.3342642
Gilyazev and Turdakov (2018) R. A. Gilyazev and D. Yu. Turdakov. 2018. Active Learning and Crowdsourcing: A Survey of Optimization Methods for Data Labeling. Programming and Computer Software 44, 6 (Nov. 2018), 476–491. https://doi.org/10.1134/S0361768818060142
Gunning and Horgan (2004) Patricia Gunning and Jane M. Horgan. 2004. A New Algorithm for the Construction of Stratum Boundaries in Skewed Populations. Survey Methodology 30, 2 (Dec. 2004), 159–166.
Jaeger (2005) Manfred Jaeger. 2005. Ignorability in Statistical and Probabilistic Inference. J. Artif. Int. Res. 24, 1 (Dec. 2005), 889–917.
Kanoulas (2016) Evangelos Kanoulas. 2016. A Short Survey on Online and Offline Methods for Search Quality Evaluation. Springer International Publishing, Cham, 38–87. https://doi.org/10.1007/978-3-319-41718-9_3
Khalilia et al. (2011) Mohammed Khalilia, Sounak Chakraborty, and Mihail Popescu. 2011. Predicting disease risks from highly imbalanced data using random forest. BMC Medical Informatics and Decision Making 11, 1 (2011), 13 pages. https://doi.org/10.1186/1472-6947-11-51
Köpcke et al. (2010) Hanna Köpcke, Andreas Thor, and Erhard Rahm. 2010. Evaluation of Entity Resolution Approaches on Real-world Match Problems. PVLDB 3, 1 (2010), 484–493. https://doi.org/10.14778/1920841.1920904
Li and Kanoulas (2017) Dan Li and Evangelos Kanoulas. 2017. Active Sampling for Large-scale Information Retrieval Evaluation. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (Singapore, Singapore) (CIKM ’17). ACM, New York, NY, USA, 49–58. https://doi.org/10.1145/3132847.3133015
Marchant and Rubinstein (2017) Neil G. Marchant and Benjamin I. P. Rubinstein. 2017. In Search of an Entity Resolution OASIS: Optimal Asymptotic Sequential Importance Sampling. Proc. VLDB Endow. 10, 11 (Aug. 2017), 1322–1333. https://doi.org/10.14778/3137628.3137642
Marin et al. (2019) Jean-Michel Marin, Pierre Pudlo, and Mohammed Sedki. 2019. Consistency of adaptive importance sampling and recycling schemes. Bernoulli 25, 3 (Aug. 2019), 1977–1998. https://doi.org/10.3150/18-BEJ1042
Minka (1999) Tom Minka. 1999. The Dirichlet-tree Distribution. Technical Report. Justsystem Pittsburgh Research Center.
Mozafari et al. (2014) Barzan Mozafari, Purna Sarkar, Michael Franklin, Michael Jordan, and Samuel Madden. 2014. Scaling up Crowd-Sourcing to Very Large Datasets: A Case for Active Learning. Proc. VLDB Endow. 8, 2 (Oct. 2014), 125–136. https://doi.org/10.14778/2735471.2735474
Oh and Berger (1992) Man-Suk Oh and James O. Berger. 1992. Adaptive importance sampling in monte carlo integration. Journal of Statistical Computation and Simulation 41, 3-4 (1992), 143–168. https://doi.org/10.1080/00949659208810398
Portier and Delyon (2018) François Portier and Bernard Delyon. 2018. Asymptotic optimality of adaptive importance sampling. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.). Curran Associates, Inc., 3138–3148.
Porto Seguro (2017) Porto Seguro. 2017. Porto Seguro’s Safe Driver Prediction. https://www.kaggle.com/c/porto-seguro-safe-driver-prediction. Accessed: Dec 2019.
Pozzolo et al. (2015) A. D. Pozzolo, O. Caelen, R. A. Johnson, and G. Bontempi. 2015. Calibrating Probability with Undersampling for Unbalanced Classification. In 2015 IEEE Symposium Series on Computational Intelligence. 159–166. https://doi.org/10.1109/SSCI.2015.33
Reddy and Vinzamuri (2014) Chandan K. Reddy and Bhanukiran Vinzamuri. 2014. A Survey of Partitional and Hierarchical Clustering Algorithms (1st ed.). Chapman & Hall/CRC, 87–110. https://doi.org/10.1201/9781315373515-4
RIDDLE (2003) RIDDLE 2003. Duplicate Detection, Record Linkage, and Identity Uncertainty: Datasets. http://www.cs.utexas.edu/users/ml/riddle/data.html. Accessed: Dec 2016.
Rubinstein and Kroese (2016) Reuven Y. Rubinstein and Dirk P. Kroese. 2016. Simulation and the Monte Carlo Method. John Wiley & Sons, Ltd. https://doi.org/10.1002/9781118631980
Sawade et al. (2010b) Christoph Sawade, Niels Landwehr, Steffen Bickel, and Tobias Scheffer. 2010b. Active Risk Estimation. In Proceedings of the 27th International Conference on International Conference on Machine Learning (Haifa, Israel) (ICML’10). Omnipress, Madison, WI, USA, 951–958.
Sawade et al. (2010a) Christoph Sawade, Niels Landwehr, and Tobias Scheffer. 2010a. Active Estimation of F-Measures. In Advances in Neural Information Processing Systems 23, J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta (Eds.). Curran Associates, Inc., 2083–2091.
Schnabel et al. (2016) Tobias Schnabel, Adith Swaminathan, Peter I. Frazier, and Thorsten Joachims. 2016. Unbiased Comparative Evaluation of Ranking Functions. In Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval (Newark, Delaware, USA) (ICTIR ’16). ACM, New York, NY, USA, 109–118. https://doi.org/10.1145/2970398.2970410
Schultheis et al. (2020) Erik Schultheis, Mohammadreza Qaraei, Priyanshu Gupta, and Rohit Babbar. 2020. Unbiased Loss Functions for Extreme Classification With Missing Labels. arXiv:2007.00237 [stat.ML]
Settles (2009) Burr Settles. 2009. Active Learning Literature Survey. Technical Report. University of Wisconsin-Madison Department of Computer Sciences. http://digital.library.wisc.edu/1793/60660
Vaart (1998) A. W. van der Vaart. 1998. Delta Method. Cambridge University Press, 25––34. https://doi.org/10.1017/CBO9780511802256.004
Wei et al. (2013) Wei Wei, Jinjiu Li, Longbing Cao, Yuming Ou, and Jiahang Chen. 2013. Effective detection of sophisticated online banking fraud on extremely imbalanced data. World Wide Web 16, 4 (2013), 449–475. https://doi.org/10.1007/s11280-012-0178-0
Welinder et al. (2013) P. Welinder, M. Welling, and P. Perona. 2013. A Lazy Man’s Approach to Benchmarking: Semisupervised Classifier Evaluation and Recalibration. In CVPR. 3262–3269. https://doi.org/10.1109/CVPR.2013.419

Appendix A Proof of Theorem 1

First we prove that $\hat{R}_{N}^{\mathrm{AIS}}\overset{\mathrm{a.s.}}{\to}R$ using a strong law of large numbers (SLLN) for martingales (Feller, 1971, p. 243). Consider the martingale

Z_{N}=N(\hat{R}_{N}^{\mathrm{AIS}}-R)=\sum_{j=1}^{N}\left\{\frac{p(X_{j})}{q_{j-1}(X_{j})}\ell(X_{j},Y_{j})-R\right\}

and let $\delta_{j,i}=\frac{p(X_{j})}{q_{j-1}(X_{j})}\ell_{i}(X_{j},Y_{j})-R_{i}$ denote the $i$ -th component of the $j$ -th contribution to $Z_{N}$ . Since $X_{j}$ is drawn from $q_{j-1}(x)$ and $q_{j-1}(x)>0$ wherever $p(x)\|\ell(x,y)\|\neq 0$ , it follows that $\mathbb{E}[\delta_{j,i}|\mathcal{F}_{j-1}]=0$ . In addition, we have

\displaystyle\sum_{j=1}^{\infty}\frac{\mathbb{E}\!\left[\delta_{j,i}^{2}\right]}{j^{2}}

\displaystyle=\sum_{j=1}^{\infty}\frac{1}{j^{2}}\left\{\mathbb{E}\!\left[\frac{p(X_{j})^{2}}{q_{j-1}(X_{j})^{2}}\ell_{i}(X_{j},Y_{j})^{2}\right]+R_{i}^{2}\right\}\leq\sum_{j=1}^{\infty}\frac{U^{2}C}{j^{2}}<\infty,

where the inequality follows from the boundedness of $\ell(x,y)$ and (2). Thus the conditions of the SLLN are satisfied and we have $\frac{1}{N}\sum_{j=1}^{N}\delta_{i,j}\overset{\mathrm{a.s.}}{\to}0\implies\hat{R}_{N}^{\mathrm{AIS}}\overset{\mathrm{a.s.}}{\to}R$ . Now the continuous mapping theorem states that

\hat{R}_{N}^{\mathrm{AIS}}\overset{\mathrm{a.s.}}{\to}R\quad\implies\quad g(\hat{R}_{N}^{\mathrm{AIS}})\overset{\mathrm{a.s.}}{\to}g(R),

provided $R$ is not in the set of discontinuity points of $g$ . This condition is satisfied by assumption.

Appendix B Proof of Theorem 2

The CLT of Portier and Delyon (2018) implies that $\sqrt{N}(\hat{R}^{\mathrm{AIS}}-R)\Rightarrow\mathcal{N}(0,V_{\infty})$ , provided two conditions hold. The first condition of their CLT holds by assumption (3) and the second condition holds by the boundedness of the loss function and (4). The multivariate delta method (Vaart, 1998) then implies that $\sqrt{N}(g(\hat{R}^{\mathrm{AIS}})-g(R))\Rightarrow\mathcal{N}(0,\mathrm{D}g(R)V_{\infty}\mathrm{D}g(R)^{\intercal})$ , since $g$ is assumed to be differentiable at $R$ in Definition 1.

Appendix C Proof of Proposition 3

We want to find the proposal $q$ that minimizes the total asymptotic variance

\operatorname{tr}\Sigma=\underset{X,Y\sim p}{\mathbb{E}}\left[\frac{p(X)\|\mathrm{D}_{g}(R)\,\ell(X,Y)\|_{2}^{2}}{q(X)}\right]-\|\mathrm{D}_{g}(R)\,R\|_{2}^{2}.

We can express this as a functional optimization problem:

(6)		$\displaystyle\min_{q}$	$\displaystyle\int\frac{c(x)}{q(x)}\mathrm{d}x$
(6)		s.t.	$\displaystyle\int q(x)\,\mathrm{d}x=1,$

where $c(x)=p(x)^{2}\int\|\mathrm{D}_{g}(R)\,\ell(x,y)\|_{2}^{2}\,p(y|x)\mathrm{d}y$ .

Using the method of Lagrange multipliers, Sawade et al. (2010a) show that the solution to (6) is $q^{\star}(x)\propto\sqrt{c(x)}$ . This yields the required result.

Appendix D Conditions for achieving zero total asymptotic variance

In typical applications of IS, the asymptotically-optimal proposal can achieve zero variance. This is not guaranteed in our application, since we do not have complete freedom in selecting the proposal (see Remark 3). Below we provide sufficient conditions under which the total asymptotic variance of $\hat{G}_{N}^{\mathrm{AIS}}$ can be reduced to zero.

Proposition 0.

Suppose the oracle is deterministic (i.e. $p(y|x)$ is a point mass for all $x$ ) and the generalized measure is such that $\operatorname{sign}(\ell(x,y)\cdot\nabla g_{l}(R))$ is constant for all $(x,y)\in\mathcal{X}\times\mathcal{Y}$ and $l\in\{1,\ldots,m\}$ . Then the asymptotically-optimal proposal achieves $\operatorname{tr}\Sigma=0$ .

Proof.

From Proposition 3, the asymptotically optimal proposal achieves a total variance of

(7)

\operatorname{tr}\Sigma=\left(\int v(x)\,\mathrm{d}x\right)^{2}-\|\mathrm{D}_{g}(R)\,R\|_{2}^{2}.

We evaluate the two terms in this expression separately. Using the fact that $p(y|x)=\mathbf{1}_{y=y(x)}$ , the first term becomes

	$\displaystyle\left(\int v(x)\,dx\right)^{2}$	$\displaystyle=\left(\int\\|\mathrm{D}_{g}(R)\,\ell(x,y(x))\\|_{2}\,p(x)\,\mathrm{d}x\right)^{2}$
		$\displaystyle\leq\left(\int\\|\mathrm{D}_{g}(R)\,\ell(x,y(x))\\|_{1}\,p(x)\,\mathrm{d}x\right)^{2}$
		$\displaystyle=\left(\sum_{l=1}^{m}\int\ell(x,y(x))\cdot\nabla g_{l}(R)\,p(x)\,\mathrm{d}x\right)^{2}.$

The second line follows by application of the inequality $\|x\|_{2}\leq\|x\|_{1}$ , and the third line follows by assumption. For the second term we have

	$\displaystyle\\|\mathrm{D}_{g}(R)\,R\\|_{2}^{2}$	$\displaystyle=\left\\|\int\mathrm{D}_{g}(R)\,\ell(x,y(x))\,p(x)\,\mathrm{d}x\right\\|_{2}^{2}$
		$\displaystyle=\sum_{l=1}^{m}\left(\int\ell(x,y(x))\cdot\nabla g_{l}(R)\,p(x)\,\mathrm{d}x\right)^{2}$
		$\displaystyle\geq\left(\sum_{l=1}^{m}\int\ell(x,y(x))\cdot\nabla g_{l}(R)\,p(x)\,\mathrm{d}x\right)^{2},$

by application of Jensen’s inequality. Subtracting the second term from the first, we have $\operatorname{tr}\Sigma\leq 0$ . ∎

By way of illustration, we apply the above proposition to two common performance measures: accuracy and recall. We assume a deterministic oracle in both cases.

Example 0 (Asymptotic variance for accuracy).

From Table 1, we have that accuracy can be expressed as a generalized performance measure by setting $\ell(x,y)=\mathbf{1}_{y\neq f(x)}$ and $g(R)=1-R$ . Evaluating the condition in Proposition 1, we have

\operatorname{sign}\left(\ell(x,y)\cdot\nabla g(R)\right)=\operatorname{sign}\left(-\mathbf{1}_{y\neq f(x)}\right)=-1

for all $(x,y)\in\mathcal{X}\times\mathcal{Y}$ . Thus our framework can achieve zero asymptotic total variance when estimating accuracy under a deterministic oracle.

Example 0 (Asymptotic variance for recall).

From Table 1, we have that recall can be expressed as a generalized performance measure by setting $\ell(x,y)=[yf(x),y]^{\intercal}$ and $g(R)=R_{1}/R_{2}$ . The conditions of Proposition 1 are not satisfied in this case, since

\operatorname{sign}\left(\ell(x,y)\cdot\nabla g(R)\right)=\operatorname{sign}\left(\frac{y}{R_{2}}(f(x)-G_{\mathrm{rec}})\right)=\operatorname{sign}\left(f(x)-G_{\mathrm{rec}}\right)

which is not constant for all $(x,y)\in\mathcal{X}\times\mathcal{Y}$ . Indeed, when we evaluate the expression for the asymptotic total variance (see Proposition 3), we find that $\Sigma=4G_{\mathrm{rec}}^{2}(1-G_{\mathrm{rec}})^{2}$ . Therefore in general there is positive lower bound on the asymptotic variance that can be achieved by our framework when estimating recall under a deterministic oracle.

Appendix E Proof of Proposition 4

Before proving the proposition, we establish a useful corollary.

Corollary 0.

Suppose the generalized measure $G$ is defined with respect to a finite input space $\mathcal{X}$ (e.g. a finite pool of test data).

(i)

If the support of proposal $q_{j}(x,y)=q_{j}(x)p(y|x)$ is a superset of $\{x,y\in\mathcal{X}\times\mathcal{Y}:p(x,y)\|\ell(x,y)\|\neq 0\}$ for all $j\geq 0$ , then Theorem 1 holds.
(ii)

If in addition $q_{j}(x)\overset{\mathrm{a.s.}}{\to}q_{\infty}(x)$ pointwise in $x$ , then Theorem 2 holds.

Proof.

For the first statement, we check conditions (2) and (4) of Theorem 1. Let $\mathcal{Q}_{j}\subset\mathcal{X}$ be the support of $q_{j}(x)$ and let $\delta_{j}=\inf_{x\in\mathcal{Q}_{j}}q_{j}(x)>0$ . For $\eta\geq 0$ we have

	$\displaystyle\mathbb{E}\left[\left(\frac{p(X_{j})}{q_{j-1}(X_{j})}\right)^{2+\eta}\middle\|\mathcal{F}_{j-1}\right]$	$\displaystyle=\int\sum_{x\in\mathcal{Q}_{j-1}}\frac{p(x)^{2+\eta}q_{j-1}(x)p(y\|x)}{q_{j-1}(x)^{2+\eta}}\,\mathrm{d}y$
		$\displaystyle\leq\left(\frac{1}{\delta_{j}}\right)^{1+\eta}<\infty.$

For the second statement, we must additionally check condition (3) regarding the convergence of $V_{j}$ . Letting

f_{j}(x,y)=\left(\frac{p(x)}{q_{j}(x)}\ell(x,y)-R\right)\left(\frac{p(x)}{q_{j}(x)}\ell(x,y)-R\right)^{\intercal}q_{j}(x)p(y|x),

we can write $V_{j}=\int\sum_{x\in\mathcal{Q}_{j}}f_{j}(x,y)\,\mathrm{d}y$ . By the a.s. pointwise convergence of $q_{j}(x)$ and the continuous mapping theorem, we have $f_{j}(x,y)\to f_{\infty}(x,y)$ a.s. pointwise in $x$ and $y$ . Now observe that

	$\displaystyle\\|f_{j}(x,y)\\|_{2}$	$\displaystyle=q_{j}(x,y)\left\\|\frac{p(x)}{q_{j}(x)}\ell(x,y)-R\right\\|_{2}^{2}$
		$\displaystyle\leq q_{j}(x,y)\left(\frac{p(x,y)^{2}}{q_{j}(x,y)^{2}}\\|\ell(x,y)\\|_{2}^{2}+\\|R\\|_{2}^{2}\right)$
		$\displaystyle\leq p(y\|x)\left(\frac{1}{\epsilon^{2}}\\|\ell(x,y)\\|_{2}^{2}+\\|R\\|_{2}^{2}\right)=h(x,y).$

It is straightforward to show that $\int\sum_{x\in\mathcal{Q}_{j}}h(x,y)\,\mathrm{d}y<\infty$ using the boundedness of $\ell(x,y)$ (see Definition 1). Thus we have $V_{j}\to V_{\infty}$ by the dominated convergence theorem. ∎

We can now prove Proposition 4 by showing that the conditions of Corollary 1 hold. We focus on the case of a deterministic oracle—the proof for a stochastic oracle follows by a similar argument.

First we examine the support of the sequence of proposals. At stage $t$ , the proposal can be expressed as

	$\displaystyle q_{t}(x)=\frac{v_{t}(x)}{\sum_{x\in\mathcal{X}}v_{t}(x)}\quad\text{with}$
	$\displaystyle v_{t}(x)=p(x)\int\max\left\{\\|\mathrm{D}_{g}(\hat{R}_{t})\,\ell(x,y)\\|_{2},\epsilon_{t}\mathbf{1}_{\\|\ell(x,y)\\|\neq 0}\right\}\,\pi_{t}(y\|x)\,\mathrm{d}y.$

Observe that

v_{t}(x)\geq\epsilon_{t}\,p(x)\int\mathbf{1}_{\|\ell(x,y)\|\neq 0}\,\pi_{t}(y|x)\,\mathrm{d}y

and

\begin{split}v_{t}(x)&\leq p(x)\int\{\epsilon_{t}+\|\mathrm{D}_{g}(\hat{R}_{t})\|_{2}\,\|\ell(x,y)\|_{2}\}\pi_{t}(y|x)\,\mathrm{d}y\\ &\leq p(x)\left(\epsilon_{t}+d^{2}K\sup_{x,y\in\mathcal{X}\times\mathcal{Y}}\,\|\ell(x,y)\|_{\infty}\right)\\ &\leq Cp(x)\end{split}

where $C<\infty$ is a constant. The upper bound follows from the boundedness of $\ell(x,y)$ (see Definition 1), the boundedness of $\epsilon_{t}$ , and the boundedness of the Jacobian. Since

\sum_{x\in\mathcal{X}}v_{t}(x)\geq\epsilon_{t}\sum_{x\in\mathcal{X}}p(x)\int\mathbf{1}_{\|\ell(x,y)\|\neq 0}\pi_{t}(y|x)\,\mathrm{d}y>0

by assumption and $v_{t}(x)$ is bounded from above, we conclude that $q_{t}(x)$ is a valid distribution for all $t\geq 0$ . The lower bound on $v_{t}(x)$ implies that the support of $q_{t}(x,y)=q_{t}(x)p(y|x)$ is

	$\displaystyle\{(x,y)\in\mathcal{X}\times\mathcal{Y}:p(x,y)\pi_{t}(y\|x)\\|\ell(x,y)\\|\neq 0\}$
	$\displaystyle\supseteq\{(x,y)\in\mathcal{X}\times\mathcal{Y}:p(x,y)\\|\ell(x,y)\\|\neq 0\}.$

The inequality follows from the fact that the support of $\pi_{t}(y|x)$ includes the support of $p(y|x)=\mathbf{1}_{y=y(x)}$ . Thus $q_{t}(x)$ has the required support for all $t\geq 0$ .

Next we prove that the sequence of proposals converges a.s. pointwise in $x$ . Given that $\hat{R}_{t}\overset{\mathrm{a.s.}}{\to}\hat{R}_{\infty}$ and $\pi_{t}(y|x)\overset{\mathrm{a.s.}}{\to}\pi_{\infty}(y|x)$ pointwise in $x$ , one can show by application of the continuous mapping theorem and dominated convergence theorem that

v_{t}(x)\overset{\mathrm{a.s.}}{\to}v_{\infty}(x)=p(x)\int\|\mathrm{D}_{g}(R_{\infty})\,\ell(x,y)\|_{2}\pi_{\infty}(y|x)\,\mathrm{d}y

pointwise in $x$ . By application of the continuous mapping theorem, we then have that $q_{t}(x)\overset{\mathrm{a.s.}}{\to}q_{\infty}(x)=\frac{v_{\infty}(x)}{\sum_{x\in\mathcal{X}}v_{\infty}(x)}$ . Thus Theorems 1 and 2 hold. Furthermore, if $\hat{R}_{\infty}=R$ and $\pi_{\infty}(y|x)=\mathbf{1}_{y=y(x)}$ , then $q_{\infty}(x)$ is equal to the asymptotically-optimal proposal $q^{\star}(x)$ as defined in (5). ∎

Appendix F Inference for the Dirichlet-tree model in Section 4

In this appendix, we outline a procedure for inferring the oracle response based on the hierarchical model presented in Section 4. Recall that the model assumes a deterministic response—i.e. there is only one possible response (label) $y$ for a given input $x$ . At stage $t$ of the evaluation process (see Algorithm 1), the labels for instances in the test pool are partially-observed. To estimate the response for the unobserved instances, we can apply the expectation-maximization (EM) algorithm. Omitting the dependence on $t$ , we let $\mathbf{y}_{(\mathbf{o})}$ denote the observed labels and $\mathbf{y}_{(\neg\mathbf{o})}$ denote the unobserved labels. The EM algorithm returns a distribution over the unobserved labels $\mathbf{y}_{(\neg\mathbf{o})}$ and MAP estimates of the model parameters $\mathbf{\phi}=(\theta,\psi)$ . At each iteration $\tau$ of the EM algorithm, the following two steps are applied:

•

E-step. Compute the function

(8)

Q(\mathbf{\phi}|\mathbf{\phi}^{(\tau)})=\mathbb{E}_{\mathbf{y}_{(\neg\mathbf{o})}|\mathbf{y}_{\mathbf{o}},\mathbf{k},\mathbf{\phi}^{(\tau)}}\left(\log p(\mathbf{\phi}|\mathbf{y},\mathbf{k})\right),

which is the expected log posterior with respect to the current distribution over the unobserved labels $\mathbf{y}_{(\neg\mathbf{o})}$ , conditional on the observed labels $\mathbf{y}_{(\mathbf{o})}$ and the current parameter estimates $\mathbf{\phi}^{(\tau)}$ .

•

M-step. Update the parameter estimates by maximizing $Q$ :

(9) $\mathbf{\phi}^{(\tau+1)}\in\arg\max_{\mathbf{\phi}}Q(\mathbf{\phi}|\mathbf{\phi}^{(\tau)}).$

In order to implement the E- and M-steps, we must evaluate the $Q$ function for our model. Since the Dirichlet prior on $\theta$ and Dirichlet-tree priors on $\phi_{y}$ are conjugate to the categorical distribution, the posterior $p(\mathbf{\phi}|\mathbf{y},\mathbf{k})$ is straightforward to compute. We have

	$\displaystyle\theta\|\mathbf{y},\alpha$	$\displaystyle\sim\operatorname{Dirichlet}(\tilde{\alpha}),$
	$\displaystyle\psi_{y}\|\mathbf{y},\mathbf{k},\beta_{y},T$	$\displaystyle\sim\operatorname{DirichletTree}(\tilde{\beta}_{y},T),$

where $\tilde{\alpha}_{y}=\alpha_{y}+\sum_{i=1}^{M}\mathbf{1}_{y_{i}=y}$ , $\tilde{\beta}_{y\nu}=\beta_{y\nu}+\sum_{i=1}^{M}\mathbf{1}_{y_{i}=y}\delta_{\nu}(k_{i})$ and $\delta_{\nu}(k)$ is defined in (16). The posterior density for $\theta$ is

p(\theta|\mathbf{y},\alpha)\propto\prod_{y=1}^{C}\theta_{y}^{\tilde{\alpha}_{y}-1}.

Minka (1999) gives the density for $\psi_{y}$ as:

\displaystyle p(\psi_{y}|\mathbf{y},\mathbf{k},\mathbf{\beta}_{y},T)\propto\prod_{k\in\mathrm{lv}(T)}\psi_{yk}^{\tilde{\beta}_{yk}-1}\prod_{\nu\in\mathrm{in}(T)}\left(\sum_{k\in\mathrm{lv}(T)}\sum_{\nu^{\prime}\in\mathrm{children}(\nu)}\delta_{\nu^{\prime}}(k)\psi_{yk}\right)^{\gamma_{y\nu}},

where $\mathrm{lv}(T)$ denotes the set of leaf nodes in $T$ , $\mathrm{in}(T)$ denotes the set of inner nodes in $T$ , $\mathrm{lv}(\nu)$ denotes the leaf nodes reachable from node $\nu$ , and $\tilde{\gamma}_{y\nu}=\tilde{\beta}_{y\nu}-\sum_{c\in\mathrm{children}(\nu)}\tilde{\beta}_{yc}$ .

Substituting the posterior densities in (8), we have

	$\displaystyle Q(\mathbf{\phi}\|\mathbf{\phi}^{(t)})$	$\displaystyle=\sum_{y\in\mathcal{Y}}\sum_{k\in\mathrm{lv(T)}}\left(\tilde{\beta}_{yk}^{(\tau)}-1\right)\log\psi_{ky}+\sum_{y\in\mathcal{Y}}\sum_{\nu\in\mathrm{in}(T)}\tilde{\gamma}_{y\nu}^{(\tau)}\log\left(\sum_{k\in\mathrm{lv}(T)}\sum_{\nu^{\prime}\in\mathrm{children}(\nu)}\delta_{\nu^{\prime}}(k)\psi_{yk}\right)$
		$\displaystyle\qquad{}+\sum_{y\in\mathcal{Y}}\left(\tilde{\alpha}_{y}^{(\tau)}-1\right)\log\theta_{y}+\mathrm{const}.$

where we define $\tilde{\beta}_{yk}^{(\tau)}=\mathbb{E}_{\mathbf{y}_{(\neg\mathbf{o})}|\mathbf{y}_{\mathbf{o}},\mathbf{k},\mathbf{\phi}^{(\tau)}}[\tilde{\beta}_{yk}]$ and similarly for $\tilde{\alpha}_{y}^{(\tau)}$ and $\tilde{\gamma}_{y\nu}^{(\tau)}$ . When maximizing $Q(\mathbf{\phi}|\mathbf{\phi}^{(t)})$ with respect to $\mathbf{\phi}$ , we must obey the constraints:

•

$\theta_{y}>0$ for all $y\in\mathcal{Y}$ ,
•

$\sum_{y\in\mathcal{Y}}\theta_{y}=1$ ,
•

$\psi_{yk}>0$ for all $y\in\mathcal{Y}$ and leaf nodes $k\in\mathrm{lv}(T)$ , and
•

$\sum_{k\in\mathrm{lv}(T)}\psi_{yk}=1$ .

We can maximize $\theta$ and $\{\psi_{y}\}$ separately since they are independent. For $\theta_{y}$ we have the mode of a Dirichlet random variable:

\theta_{y}^{(\tau+1)}=\frac{\tilde{\alpha}_{y}^{(\tau)}-1}{\sum_{y^{\prime}}\{\tilde{\alpha}_{y^{\prime}}^{(\tau)}-1\}}

and for $\psi_{y}$ we have (see Minka, 1999):

(10)		$\displaystyle\psi_{yk}^{(\tau+1)}=\prod_{\nu\in\mathrm{in}(T)}\prod_{c\in\mathrm{children}(\nu)}\left(b_{yc}^{(\tau+1)}\right)^{\delta_{c}(k)}$
(11)		$\displaystyle\text{where}\quad b_{yc}^{(\tau+1)}=\frac{\tilde{\beta}_{yc}^{(\tau)}-\sum_{k\in\mathrm{lv}(T)}\delta_{c}(k)}{\sum_{c^{\prime}\in\mathrm{siblings}(c)\cup\{c\}}\{\tilde{\beta}_{yc^{\prime}}^{(\tau)}-\sum_{k\in\mathrm{lv}(T)}\delta_{c^{\prime}}(k)\}}.$

The parameters $\{b_{yc}:c\in\mathrm{children}(\nu)\}$ may be interpreted as branching probabilities for node $\nu\in\mathrm{in}(T)$ .

In summary, the EM algorithm reduces to the following two steps:

•

E-step. Compute the expected value for each unobserved label using $\mathbf{\phi}^{(\tau)}$ :

(12)

\mathbb{E}[\mathbf{1}_{y_{j}=y}|k_{j},\mathbf{\phi}^{(\tau)}]=\frac{\psi_{yk_{j}}^{(\tau)}\theta_{y}^{(\tau)}}{\sum_{y^{\prime}\in\mathcal{Y}}\psi_{y^{\prime}k_{j}}^{(\tau)}\theta_{y^{\prime}}^{(\tau)}}.

Then make a backward pass through the tree, computing $\tilde{\beta}_{y\nu}^{(\tau)}$ at each internal node $\nu\in\mathrm{in}(T)$ .

•

M-step. Make a forward-pass through the tree, updating the branch probabilities $b_{yc}^{(\tau+1)}$ using (11). Compute $\psi_{y}^{(\tau+1)}$ at the same time using (10).

We can interpret (12) as providing a posterior estimate for the unknown oracle response:

(13)

\pi(y|x)\propto\psi_{yk_{x}}^{(\tau)}\theta_{y}^{(\tau)}

where $k_{x}$ denotes the assigned stratum for instance $x$ . If the response $y(x)$ for instance $x$ has been observed in a previous stage of the evaluation process, then $\pi(y|x)$ collapses to a point mass at $y(x)$ .

Appendix G Proof of Proposition 1

Let $\pi_{t}(y|x)$ denote the posterior estimate of the (determinisitc) oracle response at stage $t$ , as defined in (13). First we must ensure that the support of $\pi_{t}(y|x)$ includes the true label $y(x)$ for all $t\geq 0$ . This condition is satisfied since the priors on $\theta$ and $\psi$ ensure $\theta_{y}>0$ and $\psi_{ky}>0$ for all $k\in\{1,\ldots,K\}$ and $y\in\mathcal{Y}$ (see 12). Once the label for instance $x$ is observed, the posterior degenerates to a point mass at the true value $y(x)$ . This also implies that the sequence of proposals have the necessary support to ensure Theorem 2 holds.

Second, we verify that $\pi_{t}(y|x)$ converges a.s. pointwise in $x$ and $y$ to a conditional pmf $\pi_{\infty}(y|x)$ independent of $t$ . This condition is also satisfied, since the posterior $\pi_{t}(y|x)$ degenerates to a point mass at $y(x)$ once the label for instance $x$ is observed, and all instances are observed in the limit $t\to\infty$ . Thus $\pi_{t}(y|x)\overset{\mathrm{a.s.}}{\to}\mathbf{1}_{y=y(x)}$ and $\epsilon_{t}\downarrow 0$ , which implies that the sequence of proposals converges to the asymptotically-optimal proposal. ∎

Appendix H Extension of the Dirichlet-tree model to stochastic oracles

Recall that the Dirichlet-tree model in Section 4 is tailored for a deterministic oracle—where $p(y|x)$ is a point mass at a single label $y(x)$ . In this appendix, we extend the model to handle stochastic oracles, where $p(y|x)$ has support on multiple labels in general. The model retains the same structure, however we no longer assume there is a discrete set of items whose labels are either observed or unobserved. Instead, we allow for a potentially infinite number of item-label pairs (indexed by $j$ below) to be generated. In reality, these pairs correspond to labelled items drawn uniformly at random from the test pool $\mathcal{T}$ . Since estimating the oracle response for individual instances requires estimating a large number of continuous parameters ( $|\mathcal{T}|\times(|\mathcal{Y}|-1)$ parameters), we instead opt to estimate the response averaged over instances at leaf nodes of the tree ( $K\times(|\mathcal{Y}|-1)$ parameters for $K$ leaf nodes). We refer to this as the leaf-level oracle response $p_{\mathrm{leaf}}(y|k)$ below.

Model specification

For clarity, we reintroduce the model here despite some overlap with Section 4. The oracle response is modeled globally using a pmf $\theta=[\theta_{1},\ldots,\theta_{C}]$ over the label space $\mathcal{Y}$ with a Dirichlet prior:

\theta|\alpha\sim\operatorname{Dirichlet}\!\left(\alpha\right),

where $\alpha=[\alpha_{1},\ldots,\alpha_{C}]\in\mathbb{R}_{+}^{C}$ are concentration hyperparameters. The label $y_{j}$ for each instance $j$ is then assumed to be generated i.i.d. according to $\theta$ :

\displaystyle y_{j}|\theta

\displaystyle\overset{\mathrm{iid.}}{\sim}\operatorname{Categorical}\!\left(\theta\right),

\displaystyle j\in 1,\ldots,J.

We assume a hierarchical partition of the space $\mathcal{X}$ is given, encoded as a tree $T$ . Given $T$ and label $y_{j}$ , we assume instance $j$ is assigned to a leaf node $k_{j}\in\{1,\ldots,K\}$ of $T$ according to a distribution $\psi_{y}$ with a Dirichlet-tree prior:

	$\displaystyle\psi_{y}\|\beta_{y},T$	$\displaystyle\overset{\mathrm{ind.}}{\sim}\operatorname{DirichletTree}\!\left(\beta_{y};T\right),$	$\displaystyle y\in\mathcal{Y},$
	$\displaystyle k_{j}\|y_{j},\psi_{y_{j}}$	$\displaystyle\overset{\mathrm{ind.}}{\sim}\operatorname{Categorical}\!\left(\psi_{y_{j}}\right),$	$\displaystyle j\in 1,\ldots,J.$

where $\beta_{y}$ is a set of Dirichlet concentration parameters associated with the internal nodes of $T$ .

Inference

To estimate the leaf-level oracle response $p_{\mathrm{leaf}}(y|k)$ , we use the posterior predictive distribution

p(y_{j}|k_{j},\mathcal{L})=\int_{\psi}\int_{\theta}p(y_{j}|k_{j},\psi,\theta)p(\psi,\theta|\mathcal{L}),

which encodes our uncertainty about the oracle response $y_{j}$ for a query instance $x_{j}$ from stratum $k_{j}$ conditional on the previously observed samples $\mathcal{L}$ . If the observed samples $\mathcal{L}$ were collected through unbiased sampling (as assumed in the model above), we would compute the posterior predictive distribution as follows:

(14)

\begin{split}p(y_{j}|k_{j},\mathcal{L})&\propto\int_{\theta}p(y_{j}|\theta)p(\theta|\mathcal{L})\int_{\psi}p(k_{j}|y_{j},\psi)p(\psi|\mathcal{L})\\ &\propto\int_{\theta}\theta_{y_{j}}p(\theta|\mathcal{L})\int_{\psi}\psi_{y_{j},k_{j}}p(\psi|\mathcal{L})\\ &\propto\tilde{\alpha}_{y_{j}}\times\prod_{\nu\in\mathrm{in}(T)}\prod_{c\in\mathrm{children}(\nu)}\left(\frac{\tilde{\beta}_{y_{j}c}}{\sum_{c^{\prime}\in\mathrm{children}(\nu)}\tilde{\beta}_{y_{j}c^{\prime}}}\right)^{\delta_{c}(k_{j})}\end{split}

where

(15)

\begin{split}\tilde{\alpha}_{y}&=\alpha_{y}+\sum_{(x^{\prime},y^{\prime},w^{\prime})\in\mathcal{L}}\mathbf{1}_{y^{\prime}=y},\\ \tilde{\beta}_{yc}&=\beta_{yc}+\sum_{(x^{\prime},y^{\prime},w^{\prime})\in\mathcal{L}}\mathbf{1}_{y^{\prime}=y}\delta_{c}(k_{x^{\prime}}),\end{split}

$\mathrm{in}(T)$ denotes the inner nodes of $T$ , $\mathrm{children}(\nu)$ denotes the children of node $\nu$ , $k_{x}\in\{1,\ldots,K\}$ denotes the leaf index of instance $x$ and

(16)

\delta_{\nu}(k):=\begin{cases}1,&\text{if node $\nu$ is traversed to reach leaf node $k$},\\ 0,&\text{otherwise.}\end{cases}

In words, the posterior parameters are updated by adding a count of ‘1’ for every observation with label $y$ that is reachable from node $\nu$ in the tree $T$ . However, since the observed samples $\mathcal{L}$ are biased in our application, we must apply a bias-correction to (15):

	$\displaystyle\tilde{\alpha}_{y}$	$\displaystyle=\alpha_{y}+\sum_{(x^{\prime},y^{\prime},w^{\prime})\in\mathcal{L}}w^{\prime}\mathbf{1}_{y^{\prime}=y},$
	$\displaystyle\tilde{\beta}_{yc}$	$\displaystyle=\beta_{yc}+\sum_{(x^{\prime},y^{\prime},w^{\prime})\in\mathcal{L}}w^{\prime}\mathbf{1}_{y^{\prime}=y}\delta_{c}(k_{x^{\prime}}).$

This guarantees that $\frac{\tilde{\alpha}_{y}}{N}\overset{\mathrm{a.s.}}{\to}\mathbb{E}[\mathbf{1}_{Y=y}]$ and $\frac{\tilde{\beta}_{yc}}{N}\overset{\mathrm{a.s.}}{\to}\mathbb{E}[\mathbf{1}_{Y=y}\delta_{c}(k_{X})]$ .

Proposition 0.

Consider an instantiation of our evaluation framework (see Algorithm 1) under a stochastic oracle where:

•

the oracle response for instance $x$ is estimated online using the posterior predictive (14) for the leaf node $k_{x}$ in which $x$ resides;
•

the proposals are adapted using the estimator defined in Proposition 4 with $\epsilon_{t}=\epsilon_{0}/(t+1)$ for some user-specified $\epsilon_{0}>0$ ; and
•

$\hat{R}_{t}=\frac{1}{M}\sum_{i=1}^{M}\sum_{y\in\mathcal{Y}}\hat{p}_{t}(y|x_{i})\ell(x_{i},y).$

Then Theorems 1 and 2 hold.

Proof.

Let $\hat{p}_{t}(y|x)$ denote the posterior predictive in evaluated at stage $t$ . First we must ensure that the support of $\hat{p}_{t}(y|x)$ is a superset of the the support of $p(y|x)$ . This condition is satisfied since the priors on $\theta$ and $\psi$ ensure that $\hat{p}_{t}(y|x)>0$ for all $y\in\mathcal{Y}$ (see 14). This implies that the proposals $\{q_{t}(x)\}$ have the necessary support to ensure that Theorem 1 is satisfied.

Second, we must ensure that $\hat{p}_{t}(y|x)$ converges a.s. pointwise in $x$ and $y$ to a conditional pmf $\hat{p}_{\infty}(y|x)$ independent of $t$ . This condition is also satisfied, since the posterior parameters $\tilde{\alpha}_{y}/N$ and $\tilde{\beta}_{yc}/N$ converge a.s. to constants by Theorem 1 (see the text preceding the statement of this proposition). ∎

Appendix I Unsupervised partitioning methods

The models for the oracle response described in Section 4 require that the test pool be hierarchically partitioned into blocks. Ideally, the partition should be selected so that the oracle distribution $p(y|x)$ is approximately constant within each block. Since we begin without any labels, we are restricted to unsupervised partitioning methods. We briefly describe two settings for learning partitions: (i) where classifier scores are used as a proxy for $p(y|x)$ and (ii) where feature vectors are used.

Score-based methods.

Our partitioning problem can be tackled using stratification methods studied in the survey statistics community (Cochran, 1977). The aim of stratification is to partition a population into roughly homogenous subpopulations, known as strata (or blocks) for the purpose of variance reduction. When an auxiliary variable is observed that is correlated with the statistic of interest, the strata may be defined as a partition of the range of the auxiliary variable into sub-intervals. Various methods are used for determining the sub-intervals, including the cumulative square-root frequency (CSF) method (Dalenius and Hodges, 1959) and the geometric method (Gunning and Horgan, 2004). In our application, the classifier scores are the auxiliary variables and the statistic of interest is $p(y=1|x)$ (for binary classification).

Stratification is conventionally used to produce a partition without hierarchical structure. However, if the strata are ordered, it is straightforward to “fill in” hierarchical structure. In our experiments, we specify the desired tree structure—e.g. an ordered binary tree of a particular depth. We then compute the stratum bins (sub-intervals) so that the number of bins matches the number of leaf nodes in the tree. The stratum bins are then associated with the leaf nodes of the tree in breadth-first order.

Feature-based methods.

When scores are not available, it is natural to consider unsupervised clustering methods which operate on the feature vectors. We expect $p(y|x)$ to be roughly constant within a cluster, since neighboring points in the feature space typically behave similarly. Reddy and Vinzamuri (2014) reviews hierarchical clustering methods including agglomerative and divisive clustering. One disadvantage of clustering methods, is that they tend to scale quadratically with the number of data points. A more efficient alternative is the $k$ -d tree (Bentley, 1975).

Appendix J Additional convergence plots

F1 score.

We provide additional convergence plots for estimating F1 score in Figure 5, which cover six of the seven datasets listed in Table 3. The convergence plot for abt-buy is featured in the main paper in Figure 2. The biased sampling methods (Ours-8, Ours-1, OASIS and IS) converge significantly more rapidly than Passive and Stratified for five of the six datasets. tweets100k is an exception because it is the only dataset with well-balanced classes. Of the biased sampling methods, Ours-8 performs best on two of the six datasets (amzn-goog and restaurant) and equal-best on one (safedriver). Ours-1 performs best on one dataset (dblp-acm) and equal-best on one (safedriver), while IS performs best on one dataset (creditcard). In general, we expect IS to perform well when the oracle response $p(y|x)$ is already well-approximated by the model under evaluation. When this is not the case, the adaptive methods are expected to perform best as they produce refined estimates of $p(y|x)$ during sampling.

Accuracy.

We repeated the experiments described in Section 5 for estimating accuracy. Although accuracy is not recommended for evaluating classifiers in the presence of class imbalance, it is interesting to see whether the biased sampling methods offer any improvement in label efficiency, given the discussion in Section 2.3. Figures 6(a) and 7 present convergence plots for the seven datasets listed in Table 3. A summary of the MSE for all datasets is included in Figure 6(b) assuming a label budget of 1000. OASIS does not appear in these results, as it does not support estimation of accuracy.

We find that gains in label efficiency are less pronounced for the biased sampling methods when estimating accuracy. This is to be expected as accuracy is less sensitive to class imbalance, as noted in Section 2.3. However, there is still a marked improvement in the convergence rate—by an order of magnitude or more—for three of the datasets (abt-buy, dblp-acm and restaurant). Again, we find that the more imbalanced datasets (abt-buy, amzn-goog, dblp-acm and restaurant) seem to benefit most from biased sampling.

Precision-recall curves.

Figure 8 presents convergence plots for estimating precision-recall curves for two of the test pools in Table 3 assuming a label budget of 5000. We find that the biased sampling methods (Ours-8, Ours-1 and IS) offer a significant improvement in the MSE compared to Passive and Stratified—by 1–2 orders of magnitude. The difference in the MSE between the AIS-based methods and IS is less pronounced here than when estimating F1-score.

	$\displaystyle\\|f_{j}(x,y)\\|_{2}$	$\displaystyle=q_{j}(x,y)\left\\|\frac{p(x)}{q_{j}(x)}\ell(x,y)-R\right\\|_{2}^{2}$
		$\displaystyle\leq q_{j}(x,y)\left(\frac{p(x,y)^{2}}{q_{j}(x,y)^{2}}\\|\ell(x,y)\\|_{2}^{2}+\\|R\\|_{2}^{2}\right)$
		$\displaystyle\leq p(y\|x)\left(\frac{1}{\epsilon^{2}}\\|\ell(x,y)\\|_{2}^{2}+\\|R\\|_{2}^{2}\right)=h(x,y).$

Needle in a Haystack: Label-Efficient Evaluation under Extreme Class Imbalance

Abstract.

1. Introduction

2. Preliminaries

2.1. Problem formulation

Remark 1.

2.2. Generalized measures

Definition 0 (Generalized measure).

Example 0 (PR curve).

Remark 2.

2.3. Inadequacy of passive sampling

Definition 0 (Passive estimator for GG).

Example 0 (Accuracy).

Example 0 (Recall).

3. An AIS-based framework for label-efficient evaluation

Remark 3 (Constraint on the proposal).

Remark 4 (Static Importance Sampling).

3.1. Asymptotic analysis

Theorem 1 (Consistency).

Theorem 2 (CLT).

3.2. Variance-minimizing proposals

Proposition 0 (Asymptotically-optimal proposal).

Proposition 0.

3.3. Practicalities

3.3.1. Sample reuse

3.3.2. Approximate confidence regions

4. A Dirichlet-tree model for the oracle response

4.1. Generative process

4.2. Inference

4.3. Asymptotic optimality

Proposition 0.

5. Experimental study

5.1. Evaluation tasks

5.2. Setup

Oracle.

Partitioning.

Hyperparameters.

Repeats.

5.3. Results

F1 score.

PR curves.

6. Related work

7. Conclusion

Acknowledgements.

References

Appendix A Proof of Theorem 1

Appendix B Proof of Theorem 2

Appendix C Proof of Proposition 3

Appendix D Conditions for achieving zero total asymptotic variance

Proposition 0.

Proof.

Example 0 (Asymptotic variance for accuracy).

Example 0 (Asymptotic variance for recall).

Appendix E Proof of Proposition 4

Corollary 0.

Proof.

Appendix F Inference for the Dirichlet-tree model in Section 4

Appendix G Proof of Proposition 1

Appendix H Extension of the Dirichlet-tree model to stochastic oracles

Model specification

Inference

Proposition 0.

Proof.

Appendix I Unsupervised partitioning methods

Score-based methods.

Feature-based methods.

Appendix J Additional convergence plots

F1 score.

Accuracy.

Precision-recall curves.

Definition 0 (Passive estimator for $G$ ).