Needle in a Haystack: Label-Efficient Evaluation under Extreme Class Imbalance
Abstract.
Important tasks like record linkage and extreme classification demonstrate extreme class imbalance, with 1 minority instance to every 1 million or more majority instances. Obtaining a sufficient sample of all classes, even just to achieve statistically-significant evaluation, is so challenging that most current approaches yield poor estimates or incur impractical cost. Where importance sampling has been levied against this challenge, restrictive constraints are placed on performance metrics, estimates do not come with appropriate guarantees, or evaluations cannot adapt to incoming labels. This paper develops a framework for online evaluation based on adaptive importance sampling. Given a target performance metric and model for , the framework adapts a distribution over items to label in order to maximize statistical precision. We establish strong consistency and a central limit theorem for the resulting performance estimates, and instantiate our framework with worked examples that leverage Dirichlet-tree models. Experiments demonstrate an average MSE superior to state-of-the-art on fixed label budgets.
1. Introduction
Evaluation of machine learning systems under extreme class imbalance seems like a hopeless task. When minority classes are exceedingly rare—e.g. occurring at a rate of one in a million—a massive number of examples (1 million in expectation) must be labeled before a single minority example is encountered. It seems nigh impossible to reliably estimate performance in these circumstances, as the level of statistical noise is simply too high. Making matters worse, is the fact that high quality labels for evaluation are rarely available for free. Typically they are acquired manually at some cost—e.g. by employing expert annotators or crowdsourcing workers. Reducing labeling requirements for evaluation, while ensuring estimates of performance are precise and free from bias, is therefore paramount. One cannot afford to waste resources on evaluation if the results are potentially misleading or totally useless.
From a statistical perspective, evaluation can be cast as the estimation of population performance measures using independently-drawn test data. Although unlabeled test data is abundant in many applied settings, labels must usually be acquired as part of the evaluation process. To ensure estimated performance measures converge to their population values, it is important to select examples for labeling in a statistically sound manner. This can be achieved by sampling examples passively according to the data generating distribution. However, passive sampling suffers from poor label efficiency for some tasks, especially under extreme class imbalance. This impacts a range of application areas including fraud detection (Wei et al., 2013), record linkage (Marchant and Rubinstein, 2017), rare diseases (Khalilia et al., 2011) and extreme classification (Schultheis et al., 2020).
The poor efficiency of passive sampling for some evaluation tasks motivates active or biased sampling strategies, which improve efficiency by focusing labeling efforts on the “most informative” examples (Sawade et al., 2010b). Previous work in this area is based on variance-reduction methods, such as stratified sampling (Bennett and Carvalho, 2010; Druck and McCallum, 2011; Gao et al., 2019), importance sampling (Sawade et al., 2010a; Schnabel et al., 2016) and adaptive importance sampling (Marchant and Rubinstein, 2017). However existing approaches suffer from serious limitations, including lack of support for a broad range of performance measures (Bennett and Carvalho, 2010; Sawade et al., 2010a; Welinder et al., 2013; Schnabel et al., 2016; Marchant and Rubinstein, 2017), weak theoretical justification (Bennett and Carvalho, 2010; Druck and McCallum, 2011; Welinder et al., 2013) and an inability to adapt sampling based on incoming labels (Sawade et al., 2010a; Schnabel et al., 2016).
In this paper, we present a general framework for label-efficient online evaluation that addresses these limitations. Our framework supports any performance measure that can be expressed as a transformation of a vector-valued risk functional—a much broader class than previous work. This allows us to target simple scalar measures such as accuracy and F1 score, as well as more complex multi-dimensional measures such as precision-recall curves for the first time. We leverage adaptive importance sampling (AIS) to efficiently select examples for labeling in batches. The AIS proposal is adapted using labels from previous batches in order to approximate the asymptotically-optimal variance-minimizing proposal. This approximation relies on online estimates of , which we propose to estimate via a Bayesian Dirichlet-tree (Dennis, 1996) model that achieves asymptotic optimality for deterministic labels.
We analyze the asymptotic behavior of our framework under general conditions, establishing strong consistency and a central limit theorem. This improves upon a weak consistency result obtained in a less general setting (Marchant and Rubinstein, 2017). We also compare our framework empirically against four baselines: passive sampling, stratified sampling, importance sampling (Sawade et al., 2010a), and the stratified AIS method of (Marchant and Rubinstein, 2017). Our approach based on a Dirichlet-tree model, achieves superior or competitive performance on all but one of seven test cases. Proofs and extensions are included in appendices. A Python package implementing our framework has been released open-source at https://github.com/ngmarchant/activeeval.
2. Preliminaries
We introduce notation and define the label-efficient evaluation problem in Section 2.1. Then in Section 2.2, we specify the family of performance measures supported by our framework. Section 2.3 presents novel insights into the impracticality of passive sampling relative to class imbalance and evaluation measure, supported by asymptotic analysis.
2.1. Problem formulation
Consider the task of evaluating a set of systems which solve a prediction problem on a feature space and label space . Let denote the output produced by system for a given input —e.g. a predicted label or distribution over labels. We assume instances encountered by the systems are generated i.i.d. from an unknown joint distribution with density on . Our objective is to obtain accurate and precise estimates of target performance measures (e.g. F1 score) with respect to at minimal cost.
We consider the common scenario where an unlabeled test pool drawn from is available upfront. We assume labels are unavailable initially and can only be obtained by querying a stochastic oracle that returns draws from the conditional . We assume the response time and cost of oracle queries far outweigh contributions from other parts of the evaluation process. This is reasonable in practice, since the oracle requires human input—e.g. annotators on a crowdsourcing platform or domain experts. Under these assumptions, minimizing the cost of evaluation is equivalent to minimizing the number of oracle queries required to estimate target performance measures to a given precision.
Remark 1.
A stochastic oracle covers the most general case where has support on one or more labels. This may be due to a set of heterogeneous or noisy annotators (not modeled) or genuine ambiguity in the label. We also consider a deterministic oracle where is a point mass. This is appropriate when trusting individual judgments from an expert annotator.
2.2. Generalized measures
When embarking on an evaluation task it is important to select a suitable measure of performance. For some tasks it may be sufficient to measure global error rates, while for others it may be desirable to measure error rates for different classes, sub-populations or parameter configurations—the possibilities are boundless. Since no single measure is suitable for all tasks, we consider a broad family of measures which correspond mathematically to transformations of vector-valued risk functionals.
Definition 0 (Generalized measure).
Let be a vector-valued loss function that maps instances to vectors in dependent on the system outputs . We suppress explicit dependence on where it is understood. Assume is uniformly bounded in sup norm for all system outputs . Denote the corresponding vector-valued risk functional by . For any choice of and continuous mapping differentiable at , the generalized measure is defined as .
Measure | ||
---|---|---|
Accuracy | ||
Balanced accuracy | ||
Precision | ||
Recall | ||
score | ||
Matthews correlation coefficient | ||
Fowlkes-Mallows index | ||
Brier score |
Measure | ||
---|---|---|
Mean absolute error | ||
Mean squared error | ||
Coefficient of determination |
Although this definition may appear somewhat abstract, it encompasses a variety of practical measures. For instance, when is the identity and the family reduces to a scalar-valued risk functional, which includes accuracy and mean-squared error as special cases. Other well-known performance measures such as precision and recall can be represented by selecting a non-linear and a vector-valued . Tables 1 and 2 demonstrate how to recover standard binary classification and regression measures for different settings of and . In addition to scalar measures, the family also encompasses vector-valued measures for vector-valued and . These can be used to estimate multiple scalar measures simultaneously—e.g. precision and recall of a system, accuracy of several competing systems, or recall of a system for various sub-populations. Below, we demonstrate that a vector-valued generalized measure can represent a precision-recall (PR) curve.
Example 0 (PR curve).
A precision-recall (PR) curve plots the precision and recall of a soft binary classifier on a grid of classification thresholds . Let denote the classifier score for input , where a larger score means the classifier is more confident the label is positive (encoded as ‘1’) and a smaller score means the classifier is more confident the label is negative (encoded as ‘0’). We define a vector loss function that measures whether an instance is: (1) a predicted positive for each threshold (the first entries), (2) a true positive for each threshold (the next entries), and/or (3) a positive (the last entry):
A PR curve can then be obtained using the following mapping function:
where the first entries correspond to the precision at each threshold in ascending order, and the last entries correspond to the recall at each threshold in ascending order.
Remark 2.
We have defined generalized measures with respect to the data generating distribution . While this is the ideal target for evaluation, it is common in practice to define performance measures with respect to a sample. We can recover these from our definition by substituting an empirical distribution for . For example, the familiar sample-based definition of recall can be obtained by setting , and . Then
Given our assumption that the test pool is drawn from , any consistent sample-based estimator will converge to the population measure.
2.3. Inadequacy of passive sampling
We have previously mentioned passive sampling as an obvious baseline for selecting instances to label for evaluation. In this section, we conduct an asymptotic analysis for two sample evaluation tasks, which highlights the impracticality of passive sampling under extreme class imbalance. This serves as concrete motivation for our interest in label-efficient solutions. We begin by defining an estimator for generalized measures based on passive samples.
Definition 0 (Passive estimator for ).
Let be a labeled sample of size drawn passively according to . In practice, is obtained by drawing instances i.i.d. from the marginal and querying labels from the oracle . The Monte Carlo or passive estimator for a generalized measure is then defined as follows:
Note that is a biased estimator for in general, since may be non-linear. However, it is asymptotically unbiased—that is, converges to with probability one in the limit . This property is known as strong consistency and it follows from the strong law of large numbers (Feller, 1968, pp. 243–245) and continuity of . There is also a central limit theorem (CLT) for , reflecting the rate of convergence: asymptotically where is an asymptotic covariance matrix (see Theorem 2). We shall now use this result to analyse the asymptotic efficiency of the passive estimator for two evaluation tasks.
Example 0 (Accuracy).
Consider estimating the accuracy (row 1 of Table 1) of a classifier. By the CLT, it is straightforward to show that the passive estimator for is asymptotically normal with variance . Thus, to estimate with precision we require a labeled sample of size . Although this is suboptimal111Theoretically it is possible to achieve an asymptotic variance of zero. it is not impractical. A passive sample reasonably captures the variance in .
This example shows that passive sampling is not always a poor choice. It can yield reasonably precise estimates of a generalized measure, so long as the measure is sensitive to regions of the space with high density as measured by . However, where these conditions are not satisfied, passive sampling may become impractical, requiring huge samples of labeled data in order to sufficiently drive down the variance. This is the case for the example below, where the measure is is sensitive to regions of with low density as measured by .
Example 0 (Recall).
Consider estimating recall (row 4 of Table 1) of a binary classifier. By the CLT, the passive estimator for is asymptotically normal with variance where denotes the relative frequency of the positive class. Thus we require a labeled sample of size to estimate with precision . This dependence on makes passive sampling impractical when —i.e. when the positive class is rare.
This example is not merely an intellectual curiosity—there are important applications where is exceedingly small. For instance, in record linkage scales inversely in the size of the databases to be linked (Marchant and Rubinstein, 2017).
3. An AIS-based framework for label-efficient evaluation

When passive sampling is inefficient222Unless otherwise specified, we mean label efficiency or sample efficiency when we use the term “efficiency” without qualification. , as we have seen in the preceding analysis, substantial improvements can often be achieved through biased sampling. In this section, we devise a framework for efficiently estimating generalized measures that leverages a biased sampling approach called adaptive importance sampling (AIS) (Bugallo et al., 2017). AIS estimates an expectation using samples drawn sequentially from a biased proposal distribution, that is adapted in stages based on samples from previous stages. It produces non-i.i.d. samples in general, unlike passive sampling and static (non-adaptive) importance sampling (IS) (Rubinstein and Kroese, 2016). AIS is a powerful approach because it does not assume an effective proposal distribution is known a priori—instead it is learnt from data. This addresses the main limitation of static IS—that one may be stuck using a sub-optimal proposal, which may compromise label efficiency.
There are many variations of AIS which differ in: (i) the way samples are allocated among stages; (ii) the dependence of the proposal on previous stages; (iii) the types of proposals considered; and (iv) the way samples are weighted within and across stages. Our approach is completely flexible with respect to points (i)–(iii). For point (iv), we use simple importance-weighting as it is amenable to asymptotic analysis using martingale theory (Portier and Delyon, 2018). A more complex weighting scheme is proposed in (Cornuet et al., 2012) which may have better stability, however its asymptotic behavior is not well understood.333Consistency was proved for this weighting scheme in the limit where is a monotonically increasing sequence (Marin et al., 2019). To our knowledge, a CLT remains elusive.
Our framework is summarized in Figure 1. Given a target performance measure and an unlabeled test pool , the labeling process proceeds in several stages indexed by . In the -th stage, instances are drawn i.i.d. from according to proposal . Labels are obtained for each instance by querying the oracle, and recorded with their importance weights in . At the end of the -th stage, the proposal is updated for the next stage. This update may depend on the weighted samples from all previous stages as recorded in . At any point during sampling, we estimate as follows:
(1) |
For generality, we permit the user to specify the sample allocations and the procedure for updating the proposals in Figure 1. In practice, we recommend allocating a small number of samples to each stage, as empirical studies suggest that efficiency improves when the proposal is updated more frequently (Portier and Delyon, 2018). However, this must be balanced with practical constraints, as a small sample allocation limits the ability to acquire labels in parallel. In Section 3.2, we recommend a practical procedure for updating the proposals. It approximates the asymptotically-optimal variance-minimizing proposal based on an online model of . We present an example model for in Section 4, which can leverage prior information from the systems under evaluation. Further practicalities including sample reuse and confidence intervals are discussed in Section 3.3.
Remark 3 (Constraint on the proposal).
In a standard application of AIS for estimating , one is free to select any proposal with support on the set }. However, we have an additional constraint since we cannot bias sampling from the oracle . Thus we consider proposals of the form .
Remark 4 (Static Importance Sampling).
Our AIS framework reduces to static importance sampling when so that all samples are drawn from a single static proposal .
3.1. Asymptotic analysis
We study the asymptotic behavior of estimates for produced by the generic framework in Figure 1. Since our analysis does not depend on how samples are allocated among stages, it is cleaner to identify a sample using a single index where is the total number of samples, rather than a pair of indices . Concretely, we map each to index . With this change, we let denote the proposal used to generate sample . It is important to note that this notation conceals the dependence on previous samples. Thus should be understood as shorthand for where denotes the filtration.
Our analysis relies on the fact that is a martingale with respect to . The consistency of then follows by a strong law of large numbers for martingales (Feller, 1971) and the continuous mapping theorem. A proof is included in Appendix A.
Theorem 1 (Consistency).
Let the support of proposal be a superset of for all and assume
(2) |
Then is strongly consistent for .
We also obtain a central limit theorem (CLT), which is useful for assessing asymptotic efficiency and computing approximate confidence intervals. Our proof (see Appendix B) invokes a CLT for martingales (Portier and Delyon, 2018) and the multivariate delta method.
Theorem 2 (CLT).
Suppose
(3) |
where is an a.s. finite random positive semidefinite matrix, and there exists such that
(4) |
Then converges in distribution to a multivariate normal with covariance matrix where is the Jacobian of .
3.2. Variance-minimizing proposals
In order to achieve optimal asymptotic efficiency, we would like to use the AIS proposal that achieves the minimal asymptotic variance as defined in Theorem 2. We can solve for this optimal proposal using functional calculus, as demonstrated in the proposition below. Note that we must generalize the variance minimization problem for vector-valued since becomes a covariance matrix. We opt to use the total variance (the trace of ) since the diagonal elements of are directly related to statistical efficiency, while the off-diagonal elements measure correlations between components of that are beyond our control. This choice also ensures the functional optimization problem is tractable.
Proposition 0 (Asymptotically-optimal proposal).
Suppose the Jacobian has full row rank and . Then the proposal
(5) |
achieves the minimum total asymptotic variance of
Appendix D provides sufficient conditions on and the oracle which ensure .
We use the above result to design a practical scheme for adapting a proposal for AIS. At each stage of the evaluation process, we approximate the asymptotically-optimal proposal using an online model for the oracle response . The model for should be initialized using prior information if available (e.g. from the systems under evaluation) and updated at the end of each stage using labels received from the oracle. However, we cannot simply estimate by plugging in estimates of directly, as the resulting proposal may not satisfy the conditions of Theorems 1 and 2. Below we provide estimators for which do satisfy the conditions, and provide sufficient conditions for achieving asymptotic optimality.
Proposition 0.
If the oracle is stochastic, let be an estimate for whose support includes the support of for all stages , and assume pointwise in . Alternatively, if the oracle is deterministic, let be a posterior distribution for the response whose support includes for all , and assume pointwise in . Let be a positive bounded sequence and be an estimator for which converges a.s. to . Assume is finite (e.g. a pool of test data) and for all . Then the proposals
satisfy the conditions of Theorems 1 and 2. If in addition and (alternatively ) and , then the proposals approach asymptotic optimality.
3.3. Practicalities
We briefly discuss solutions to two issues that may arise in practical settings: sample reuse and approximate confidence regions.
3.3.1. Sample reuse
Suppose our framework is used to estimate a generalized measure . If the joint distribution associated with the prediction problem has not changed, it may be desirable to reuse the weighted samples to estimate a different generalized measure . This is possible so long as the sequence of proposals used to estimate have the required support for . More precisely, the support of must include for the loss functions associated with and .
If one anticipates sample reuse, the proposals can be made less specialized to a particular measure by mixing with the marginal distribution , i.e. where is a hyperparameter that controls the degree of specialization.
3.3.2. Approximate confidence regions
When publishing performance estimates, it may be desirable to quantify statistical uncertainty. An asymptotic confidence region for a generalized measure is given by the ellipsoid
where is the sample mean, is the sample covariance matrix, and is the critical value of the distribution with degrees of freedom at significance level . This region can be approximated using the estimator for in (1) and the following estimator for :
This is obtained from the expression for in Theorem 2, by plugging in AIS estimators for the variance and , and approximating by the most recent proposal .
4. A Dirichlet-tree model for the oracle response
In the previous section, we introduced a scheme for updating the AIS proposal which relies on an online model of the oracle response. Since there are many conceivable choices for the model, we left it unspecified for full generality. In this section, we propose a particular model that is suited for evaluating classifiers when the response from the oracle is deterministic (see Remark 1). Concretely, we make the assumption that the label space is a finite set and is a point mass at for all . An extension of this section for stochastic oracles (where is not necessarily a point mass) is presented in Appendix H.
Since we would like to leverage prior information (e.g. classifier scores) from the system(s) under evaluation and perform regular updates as labels are received from the oracle, we opt to use a Bayesian model. Another design consideration is label efficiency. Since labels are scarce and the test pool may be huge, we want to design a model that allows for sharing of statistical strength between “similar” instances in . To this end, we propose a model that incorporates a hierarchical partition of , where instances assigned to hierarchically neighboring blocks are assumed to elicit a similar oracle response.444 This is in contrast to models used in related work (Bennett and Carvalho, 2010; Marchant and Rubinstein, 2017) which assume the oracle response is independent across blocks of a non-hierarchical partition. Various unsupervised methods may be used to learn a hierarchical partition, including hierarchical agglomerative/divisive clustering (Reddy and Vinzamuri, 2014), -d trees (Bentley, 1975), and stratification based on classifier scores (see Appendix I for a brief review).
4.1. Generative process
We assume the global oracle response (averaged over all instances) is generated according to a Dirichlet distribution, viz.
where are concentration hyperparameters. The label for each instance (indexing ) is then assumed to be generated i.i.d. according to :
We assume a hierarchical partition of the test pool is given. The partition can be represented as a tree , where the leaf nodes of correspond to the finest partition of into disjoint blocks such that . We assume each instance is assigned to one of the blocks (leaf nodes) according to a distribution with a Dirichlet-tree prior (Dennis, 1996; Minka, 1999):
The Dirichlet-tree distribution is a generalization of the Dirichlet distribution, which allows for more flexible dependencies between the categories (blocks in this case). Categories that are hierarchically nearby based on the tree structure tend to be correlated. The Dirichlet concentration hyperparameters associated with the internal nodes also control the correlation structure.
4.2. Inference
For a deterministic oracle, the response for instance is either observed (previously labeled) or unobserved (yet to be labeled). It is important to model the observation process in case it influences the values of inferred parameters. To this end, we let be observation indicators for the labels at the end of stage of the evaluation process (see Algorithm 1). We initialize and define in the obvious way: is 1 if the label for instance has been observed by the end of stage and 0 otherwise. From Algorithm 1, we have that the -th instance selected in stage depends on the labels of the previously observed instances and the block assignments :
Our goal is to infer the unobserved labels (and hence the oracle response) at each stage of the evaluation process. We assume the block assignments are fully observed. Since the observation indicators are independent of the unobserved labels conditional on the observed labels, our model satisfies ignorability (Jaeger, 2005). This means we can ignore the observation process when conducting inference. Since we do not require a full posterior distribution over all parameters, it is sufficient to conduct inference using the expectation-maximization algorithm. This yields a distribution over the unobserved label for each instance and point estimates for the other parameters ( and ). Full details are provided in Appendix F.
4.3. Asymptotic optimality
Since the Dirichlet-tree model described in this section is consistent for the true oracle (deterministic) response, it can be combined with the proposal updates described in Proposition 4 to yield an asymptotically-optimal AIS algorithm. This result is made precise in the following proposition, which is proved in Appendix G.
Proposition 0.
Consider an instantiation of our framework under a deterministic oracle where:
-
•
the oracle response is estimated online using the Dirichlet-tree model described in this section via the EM algorithm;
-
•
the proposals are adapted using the estimator defined in Proposition 4 with
-
•
for some user-specified .
Then Theorems 1 and 2 hold and the estimator is asymptotically-optimal.
5. Experimental study
We conduct experiments to assess the label efficiency of our proposed framework555Using the procedure described in Section 3.2 to adapt the AIS proposal, together with the online model for the oracle response presented in Section 4. for a variety of evaluation tasks. The tasks vary in terms of the degree of class imbalance, the quality of predictions/scores from the classifier under evaluation, the size of the test pool, and the target performance measure. Where possible, we compare our framework (denoted Ours) with the following baselines:
-
•
Passive: passive sampling as specified in Definition 3.
- •
-
•
Stratified: an online variant of stratified sampling with proportional allocation, as used in (Druck and McCallum, 2011). Items are sampled one-at-a-time in proportion to the size of the allocated stratum.
-
•
OASIS: a stratified AIS method for estimating F scores (Marchant and Rubinstein, 2017).
5.1. Evaluation tasks
We prepare classifiers and test pools for evaluation using publicly-available datasets from various domains, as summarized in Table 3. amzn-googl, dblp-acm, abt-buy (Köpcke et al., 2010) and restaurant (RIDDLE, 2003) are benchmark datasets for entity resolution. They contain records from two sources and the goal is to predict whether pairs of records refer to the same entity or not. safedriver contains anonymized records from a car insurance company, and the task is to predict drivers who are likely to make a claim (Porto Seguro, 2017). creditcard relates to fraud detection for online credit card transactions (Pozzolo et al., 2015). tweets100k has been applied to tweet sentiment analysis (Mozafari et al., 2014).
For amzn-goog, dblp-acm, abt-buy, restaurant and tweets100k we use the same classifiers and test pools as in (Marchant and Rubinstein, 2017). For safedriver and creditcard we prepare our own by randomly splitting the data into train/test with a 70:30 ratio, and training classifiers using supervised learning. In scenarios where labeled data is scarce, semi-supervised or unsupervised methods might be used instead—the choice of learning paradigm has no bearing on evaluation. We consider three target performance measures—F1 score, accuracy and precision-recall curves—as separate evaluation tasks.
Source | Domain | Size | Imbalance ratio | Classifier type | F1 score |
---|---|---|---|---|---|
abt-buy (Köpcke et al., 2010) | Entity resolution | 53,753 | 1075 | Linear SVM | 0.595 |
amzn-goog (Köpcke et al., 2010) | Entity resolution | 676,267 | 3381 | Linear SVM | 0.282 |
dblp-acm (Köpcke et al., 2010) | Entity resolution | 53,946 | 2697 | Linear SVM | 0.947 |
restaurant (RIDDLE, 2003) | Entity resolution | 149,747 | 3328 | Linear SVM | 0.899 |
safedriver (Porto Seguro, 2017) | Risk assessment | 178,564 | 26.56 | XGBoost (Chen and Guestrin, 2016) | 0.100 |
creditcard (Pozzolo et al., 2015) | Fraud detection | 85,443 | 580.2 | Logistic Regression | 0.728 |
tweets100k (Mozafari et al., 2014) | Sentiment analysis | 20,000 | 0.990 | Linear SVM | 0.770 |
5.2. Setup
Oracle.
We simulate an oracle using labels included with each dataset. Since the datasets only contain a single label for each instance, we assume a deterministic oracle. Thus, when computing the consumed label budget, we only count the first query to the oracle for an instance as a consumed label. If an instance is selected for labeling again, we reuse the label from the first query.
Partitioning.
Ours, Stratified and OASIS assume the test pool is partitioned so the oracle response within each block is ideally uniform. We construct partitions by binning instances according to their classifier scores. The bin edges are determined using the cumulative square-root frequency (CSF) method (Dalenius and Hodges, 1959), which is widely used for stratified sampling. We set the number of bins to . Since Ours is capable of exploiting partitions with hierarchical structure, we fill in post-hoc structure by associating the CSF bins with the leaf nodes of an appropriately-size tree in breadth-first order. We consider two trees: a tree of depth 1 with branching factor (denoted Ours-1, equivalent to a non-hierarchical partition) and a tree of depth 8 with branching factor 2 (denoted Ours-8).
Hyperparameters.
We leverage prior information from the classifiers under evaluation to set hyperparameters. Wherever a prior estimate of is required, we use the classifier score , applying the softmax function to non-probabilistic scores. For the Dirichlet-tree model we set hyperparameters as follows:
where is the mean score over instances in the -th block, denotes a non-root node of the tree , denotes the depth of node in , and is an indicator equal to 1 if node is traversed to reach leaf node and 0 otherwise.
Repeats.
Since the evaluation process is randomized, we repeated each experiment 1000 times, computing and reporting the mean behavior with bootstrap confidence intervals.
5.3. Results
We provide a summary of the results here, with a particular focus on the results for F1 score, which is supported by all baselines. Detailed results for accuracy and precision-recall curves are included in Appendix J.
F1 score.
To assess convergence of the estimated F1 score, we plot the mean-squared error (MSE) as a function of the consumed label budget for all datasets and methods. A plot of this kind is included for abt-buy in Figure 2. It shows that Ours is significantly more efficient than the baseline methods in this case, achieving a lower MSE for all label budgets. The Passive and Stratified methods perform significantly worse, achieving an MSE at least one order of magnitude greater than the biased sampling methods. Figure 2 also plots the convergence of the proposal in terms of the mean KL divergence from the asymptotically-optimal proposal. The results here are in line with expectations: convergence of the F1 score is more rapid when the proposal is closer to asymptotic optimality.


Figure 3 summarizes the convergence plots for the six other data sets (included in Appendix J), by plotting the MSE of the estimated F1 score after 2000 labels are consumed. It shows that Ours achieves best or equal-best MSE on all but one of the datasets (dblp-acm) within the 95% confidence bands. We find the adaptive methods Ours and OASIS perform similarly to IS when the prior estimates for are accurate—i.e. there is less to gain by adapting. Of the two adaptive methods, Ours generally converges more rapidly than OASIS, which might be expected since our procedure for adapting the proposal is asymptotically-optimal. The hierarchical variant of our model Ours-8 tends to outperform the non-hierarchical variant Ours-1, which we expect when varies smoothly across neighboring blocks. Finally, we observe that Passive and Stratified are competitive when the class imbalance is less severe. This agrees with our analysis in Section 2.3. We note that similar trends when targeting accuracy in place of F1 score, as presented in Appendix J.
PR curves.
We estimate PR curves on a uniformly-spaced grid of classifier thresholds , where is the minimum classifier score, is the maximum classifier score, and (see Example 2). We also use the uniform grid to partition the test pool (in place of the CSF method), associating each block of the partition with four neighboring bins on the grid to yield blocks. Figure 4 illustrates the vast improvement of the biased sampling methods (Ours-8 and IS) over Passive for this evaluation task. The estimated PR curves shown for Ours-8 and IS vary minimally about the true PR curve, and are reliable for selecting an operating threshold. The same cannot be said for the curves produced by Passive, which exhibit such high variance that they are essentially useless.

6. Related work
Existing approaches to label-efficient evaluation in a machine learning context largely fall into three categories: model-based (Welinder et al., 2013), stratified sampling (Bennett and Carvalho, 2010; Druck and McCallum, 2011) and importance sampling (Sawade et al., 2010a; Marchant and Rubinstein, 2017). The model-based approach in (Welinder et al., 2013) estimates precision-recall curves for binary classifiers. However, it uses inefficient passive sampling to select instances to label, and makes strong assumptions about the distribution of scores and labels which can result in biased estimates. Stratified sampling has been used to estimate scalar performance measures such as precision, accuracy and F1 score. Existing approaches (Bennett and Carvalho, 2010; Druck and McCallum, 2011) bias the sampling of instances from strata (akin to blocks) using a heuristic generalization of the optimal allocation principle (Cochran, 1977). However, stratified sampling is considered to be a less effective variance reduction method compared to importance sampling (Rubinstein and Kroese, 2016), and it does not naturally support stochastic oracles.
Static importance sampling (Sawade et al., 2010a) and stratified adaptive importance sampling (Marchant and Rubinstein, 2017) have been used for online evaluation, and are most similar to our approach. However (Marchant and Rubinstein, 2017) only supports the estimation of F1 score, and (Sawade et al., 2010a) only supports the estimation of scalar generalized risks666These can be viewed as a sub-family of generalized measures with the following correspondence , and .. Both of these methods attempt to approximate the asymptotically-optimal variance-minimizing proposal, however the approximation used in (Sawade et al., 2010a) is non-adaptive and is not optimized for deterministic oracles, while the approximation used in (Marchant and Rubinstein, 2017) is adaptive but less accurate due to the stratified design.
Novel evaluation methods are also studied in the information retrieval (IR) community (see survey (Kanoulas, 2016)). Some tasks in the IR setting can be cast as prediction problems by treating query-document pairs as inputs and relevance judgments as outputs. Early approaches used relevance scores from the IR system to manage the abundance of irrelevant documents in an ad hoc manner (Cormack et al., 1998). Recent approaches (Schnabel et al., 2016; Li and Kanoulas, 2017) are based on a statistical formulation similar to ours, however they are specialized to IR systems. Within the IR community, stratified sampling and cluster sampling have also been used to efficiently evaluate knowledge graphs (Gao et al., 2019).
Adaptive importance sampling (AIS) is studied more generally in the context of Monte Carlo integration (see review by Bugallo et al., 2017). Most methods are inappropriate for our application, as they assume a continuous space without constraints on the proposal (see Remark 3). Oh and Berger (1992) introduced the idea of adapting the proposal over multiple stages using samples from the previous stages. Cappé et al. (2008) devise a general framework using independent mixtures as proposals. The method of Cornuet et al. (2012) continually re-weights all past samples, however it is more computationally demanding and less amenable to analysis since it breaks the martingale property. Portier and Delyon (2018) analyze parametric AIS in the large sample limit. This improves upon earlier work which assumed the number of stages goes to infinity (Douc and Moulines, 2008) or the sample size at each stage is monotonically increasing (Marin et al., 2019).
Finally, we note that label-efficient evaluation may be viewed as a counterpart to active learning, as both are concerned with reducing labeling requirements. There is a large body of literature concerning active learning—we refer the reader to surveys (Settles, 2009; Gilyazev and Turdakov, 2018). However whereas active learning aims to find a model with low bounded risk using actively-sampled training data, active evaluation aims to estimate risk using actively-sampled test data for any model.
7. Conclusion
We have proposed a framework for online supervised evaluation, which aims to minimize labeling efforts required to achieve precise, asymptotically-unbiased performance estimates. Our framework is based on adaptive importance sampling, with variance-minimizing proposals that are refined adaptively based on an online model of . Under verifiable conditions on the chosen performance measure and the model, we proved strong consistency (asymptotic unbiasedness) of the resulting performance estimates and a central limit theorem. We instantiated our framework to evaluate classifiers using deterministic or stochastic human annotators. Our approach based on a hierarchical Dirichlet-tree model, achieves superior or competitive performance on all but one of seven test cases.
Acknowledgements.
N. Marchant acknowledges the support of an Australian Government Research Training Program Scholarship. B. Rubinstein acknowledges the support of Australian Research Council grant DP150103710.References
- (1)
- Bennett and Carvalho (2010) Paul N. Bennett and Vitor R. Carvalho. 2010. Online Stratified Sampling: Evaluating Classifiers at Web-scale. In CIKM. 1581–1584. https://doi.org/10.1145/1871437.1871677
- Bentley (1975) Jon Louis Bentley. 1975. Multidimensional Binary Search Trees Used for Associative Searching. Commun. ACM 18, 9 (Sept. 1975), 509–517. https://doi.org/10.1145/361002.361007
- Bugallo et al. (2017) M. F. Bugallo, V. Elvira, L. Martino, D. Luengo, J. Miguez, and P. M. Djuric. 2017. Adaptive Importance Sampling: The past, the present, and the future. IEEE Signal Processing Magazine 34, 4 (July 2017), 60–79. https://doi.org/10.1109/MSP.2017.2699226
- Cappé et al. (2008) Olivier Cappé, Randal Douc, Arnaud Guillin, Jean-Michel Marin, and Christian P. Robert. 2008. Adaptive importance sampling in general mixture classes. Statistics and Computing 18, 4 (01 Dec. 2008), 447–459. https://doi.org/10.1007/s11222-008-9059-x
- Chen and Guestrin (2016) Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Francisco, California, USA) (KDD ’16). ACM, New York, NY, USA, 785–794. https://doi.org/10.1145/2939672.2939785
- Cochran (1977) William G. Cochran. 1977. Sampling Techniques (3rd ed.). Wiley, New York.
- Cormack et al. (1998) Gordon V. Cormack, Christopher R. Palmer, and Charles L. A. Clarke. 1998. Efficient Construction of Large Test Collections. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Melbourne, Australia) (SIGIR ’98). Association for Computing Machinery, New York, NY, USA, 282–289. https://doi.org/10.1145/290941.291009
- Cornuet et al. (2012) Jean-Marie Cornuet, Jean-Michel Marin, Antonietta Mira, and Christian P. Robert. 2012. Adaptive Multiple Importance Sampling. Scandinavian Journal of Statistics 39, 4 (2012), 798–812. https://doi.org/10.1111/j.1467-9469.2011.00756.x
- Dalenius and Hodges (1959) Tore Dalenius and Joseph L. Hodges. 1959. Minimum Variance Stratification. J. Amer. Statist. Assoc. 54, 285 (March 1959), 88–101. https://doi.org/10.1080/01621459.1959.10501501
- Dennis (1996) Samuel Y. Dennis. 1996. A Bayesian analysis of tree-structured statistical decision problems. Journal of Statistical Planning and Inference 53, 3 (1996), 323–344. https://doi.org/10.1016/0378-3758(95)00112-3
- Douc and Moulines (2008) Randal Douc and Eric Moulines. 2008. Limit theorems for weighted samples with applications to sequential Monte Carlo methods. The Annals of Statistics 36, 5 (Oct. 2008), 2344–2376. https://doi.org/10.1214/07-AOS514
- Druck and McCallum (2011) Gregory Druck and Andrew McCallum. 2011. Toward Interactive Training and Evaluation. In CIKM (New York, NY, USA). 947–956. https://doi.org/10.1145/2063576.2063712
- Feller (1968) W. Feller. 1968. An Introduction to Probability Theory and Its Applications, Volume 1 (3rd ed.). Wiley.
- Feller (1971) W. Feller. 1971. An Introduction to Probability Theory and Its Applications, Volume 2 (2nd ed.). Wiley.
- Gao et al. (2019) Junyang Gao, Xian Li, Yifan Ethan Xu, Bunyamin Sisman, Xin Luna Dong, and Jun Yang. 2019. Efficient Knowledge Graph Accuracy Evaluation. Proc. VLDB Endow. 12, 11 (July 2019), 1679–1691. https://doi.org/10.14778/3342263.3342642
- Gilyazev and Turdakov (2018) R. A. Gilyazev and D. Yu. Turdakov. 2018. Active Learning and Crowdsourcing: A Survey of Optimization Methods for Data Labeling. Programming and Computer Software 44, 6 (Nov. 2018), 476–491. https://doi.org/10.1134/S0361768818060142
- Gunning and Horgan (2004) Patricia Gunning and Jane M. Horgan. 2004. A New Algorithm for the Construction of Stratum Boundaries in Skewed Populations. Survey Methodology 30, 2 (Dec. 2004), 159–166.
- Jaeger (2005) Manfred Jaeger. 2005. Ignorability in Statistical and Probabilistic Inference. J. Artif. Int. Res. 24, 1 (Dec. 2005), 889–917.
- Kanoulas (2016) Evangelos Kanoulas. 2016. A Short Survey on Online and Offline Methods for Search Quality Evaluation. Springer International Publishing, Cham, 38–87. https://doi.org/10.1007/978-3-319-41718-9_3
- Khalilia et al. (2011) Mohammed Khalilia, Sounak Chakraborty, and Mihail Popescu. 2011. Predicting disease risks from highly imbalanced data using random forest. BMC Medical Informatics and Decision Making 11, 1 (2011), 13 pages. https://doi.org/10.1186/1472-6947-11-51
- Köpcke et al. (2010) Hanna Köpcke, Andreas Thor, and Erhard Rahm. 2010. Evaluation of Entity Resolution Approaches on Real-world Match Problems. PVLDB 3, 1 (2010), 484–493. https://doi.org/10.14778/1920841.1920904
- Li and Kanoulas (2017) Dan Li and Evangelos Kanoulas. 2017. Active Sampling for Large-scale Information Retrieval Evaluation. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (Singapore, Singapore) (CIKM ’17). ACM, New York, NY, USA, 49–58. https://doi.org/10.1145/3132847.3133015
- Marchant and Rubinstein (2017) Neil G. Marchant and Benjamin I. P. Rubinstein. 2017. In Search of an Entity Resolution OASIS: Optimal Asymptotic Sequential Importance Sampling. Proc. VLDB Endow. 10, 11 (Aug. 2017), 1322–1333. https://doi.org/10.14778/3137628.3137642
- Marin et al. (2019) Jean-Michel Marin, Pierre Pudlo, and Mohammed Sedki. 2019. Consistency of adaptive importance sampling and recycling schemes. Bernoulli 25, 3 (Aug. 2019), 1977–1998. https://doi.org/10.3150/18-BEJ1042
- Minka (1999) Tom Minka. 1999. The Dirichlet-tree Distribution. Technical Report. Justsystem Pittsburgh Research Center.
- Mozafari et al. (2014) Barzan Mozafari, Purna Sarkar, Michael Franklin, Michael Jordan, and Samuel Madden. 2014. Scaling up Crowd-Sourcing to Very Large Datasets: A Case for Active Learning. Proc. VLDB Endow. 8, 2 (Oct. 2014), 125–136. https://doi.org/10.14778/2735471.2735474
- Oh and Berger (1992) Man-Suk Oh and James O. Berger. 1992. Adaptive importance sampling in monte carlo integration. Journal of Statistical Computation and Simulation 41, 3-4 (1992), 143–168. https://doi.org/10.1080/00949659208810398
- Portier and Delyon (2018) François Portier and Bernard Delyon. 2018. Asymptotic optimality of adaptive importance sampling. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.). Curran Associates, Inc., 3138–3148.
- Porto Seguro (2017) Porto Seguro. 2017. Porto Seguro’s Safe Driver Prediction. https://www.kaggle.com/c/porto-seguro-safe-driver-prediction. Accessed: Dec 2019.
- Pozzolo et al. (2015) A. D. Pozzolo, O. Caelen, R. A. Johnson, and G. Bontempi. 2015. Calibrating Probability with Undersampling for Unbalanced Classification. In 2015 IEEE Symposium Series on Computational Intelligence. 159–166. https://doi.org/10.1109/SSCI.2015.33
- Reddy and Vinzamuri (2014) Chandan K. Reddy and Bhanukiran Vinzamuri. 2014. A Survey of Partitional and Hierarchical Clustering Algorithms (1st ed.). Chapman & Hall/CRC, 87–110. https://doi.org/10.1201/9781315373515-4
- RIDDLE (2003) RIDDLE 2003. Duplicate Detection, Record Linkage, and Identity Uncertainty: Datasets. http://www.cs.utexas.edu/users/ml/riddle/data.html. Accessed: Dec 2016.
- Rubinstein and Kroese (2016) Reuven Y. Rubinstein and Dirk P. Kroese. 2016. Simulation and the Monte Carlo Method. John Wiley & Sons, Ltd. https://doi.org/10.1002/9781118631980
- Sawade et al. (2010b) Christoph Sawade, Niels Landwehr, Steffen Bickel, and Tobias Scheffer. 2010b. Active Risk Estimation. In Proceedings of the 27th International Conference on International Conference on Machine Learning (Haifa, Israel) (ICML’10). Omnipress, Madison, WI, USA, 951–958.
- Sawade et al. (2010a) Christoph Sawade, Niels Landwehr, and Tobias Scheffer. 2010a. Active Estimation of F-Measures. In Advances in Neural Information Processing Systems 23, J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta (Eds.). Curran Associates, Inc., 2083–2091.
- Schnabel et al. (2016) Tobias Schnabel, Adith Swaminathan, Peter I. Frazier, and Thorsten Joachims. 2016. Unbiased Comparative Evaluation of Ranking Functions. In Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval (Newark, Delaware, USA) (ICTIR ’16). ACM, New York, NY, USA, 109–118. https://doi.org/10.1145/2970398.2970410
- Schultheis et al. (2020) Erik Schultheis, Mohammadreza Qaraei, Priyanshu Gupta, and Rohit Babbar. 2020. Unbiased Loss Functions for Extreme Classification With Missing Labels. arXiv:2007.00237 [stat.ML]
- Settles (2009) Burr Settles. 2009. Active Learning Literature Survey. Technical Report. University of Wisconsin-Madison Department of Computer Sciences. http://digital.library.wisc.edu/1793/60660
- Vaart (1998) A. W. van der Vaart. 1998. Delta Method. Cambridge University Press, 25––34. https://doi.org/10.1017/CBO9780511802256.004
- Wei et al. (2013) Wei Wei, Jinjiu Li, Longbing Cao, Yuming Ou, and Jiahang Chen. 2013. Effective detection of sophisticated online banking fraud on extremely imbalanced data. World Wide Web 16, 4 (2013), 449–475. https://doi.org/10.1007/s11280-012-0178-0
- Welinder et al. (2013) P. Welinder, M. Welling, and P. Perona. 2013. A Lazy Man’s Approach to Benchmarking: Semisupervised Classifier Evaluation and Recalibration. In CVPR. 3262–3269. https://doi.org/10.1109/CVPR.2013.419
Appendix A Proof of Theorem 1
First we prove that using a strong law of large numbers (SLLN) for martingales (Feller, 1971, p. 243). Consider the martingale
and let denote the -th component of the -th contribution to . Since is drawn from and wherever , it follows that . In addition, we have
where the inequality follows from the boundedness of and (2). Thus the conditions of the SLLN are satisfied and we have . Now the continuous mapping theorem states that
provided is not in the set of discontinuity points of . This condition is satisfied by assumption.
Appendix B Proof of Theorem 2
The CLT of Portier and Delyon (2018) implies that , provided two conditions hold. The first condition of their CLT holds by assumption (3) and the second condition holds by the boundedness of the loss function and (4). The multivariate delta method (Vaart, 1998) then implies that , since is assumed to be differentiable at in Definition 1.
Appendix C Proof of Proposition 3
We want to find the proposal that minimizes the total asymptotic variance
We can express this as a functional optimization problem:
(6) | ||||
s.t. |
where .
Appendix D Conditions for achieving zero total asymptotic variance
In typical applications of IS, the asymptotically-optimal proposal can achieve zero variance. This is not guaranteed in our application, since we do not have complete freedom in selecting the proposal (see Remark 3). Below we provide sufficient conditions under which the total asymptotic variance of can be reduced to zero.
Proposition 0.
Suppose the oracle is deterministic (i.e. is a point mass for all ) and the generalized measure is such that is constant for all and . Then the asymptotically-optimal proposal achieves .
Proof.
From Proposition 3, the asymptotically optimal proposal achieves a total variance of
(7) |
We evaluate the two terms in this expression separately. Using the fact that , the first term becomes
The second line follows by application of the inequality , and the third line follows by assumption. For the second term we have
by application of Jensen’s inequality. Subtracting the second term from the first, we have . ∎
By way of illustration, we apply the above proposition to two common performance measures: accuracy and recall. We assume a deterministic oracle in both cases.
Example 0 (Asymptotic variance for accuracy).
Example 0 (Asymptotic variance for recall).
From Table 1, we have that recall can be expressed as a generalized performance measure by setting and . The conditions of Proposition 1 are not satisfied in this case, since
which is not constant for all . Indeed, when we evaluate the expression for the asymptotic total variance (see Proposition 3), we find that . Therefore in general there is positive lower bound on the asymptotic variance that can be achieved by our framework when estimating recall under a deterministic oracle.
Appendix E Proof of Proposition 4
Before proving the proposition, we establish a useful corollary.
Corollary 0.
Proof.
For the first statement, we check conditions (2) and (4) of Theorem 1. Let be the support of and let . For we have
For the second statement, we must additionally check condition (3) regarding the convergence of . Letting
we can write . By the a.s. pointwise convergence of and the continuous mapping theorem, we have a.s. pointwise in and . Now observe that
It is straightforward to show that using the boundedness of (see Definition 1). Thus we have by the dominated convergence theorem. ∎
We can now prove Proposition 4 by showing that the conditions of Corollary 1 hold. We focus on the case of a deterministic oracle—the proof for a stochastic oracle follows by a similar argument.
First we examine the support of the sequence of proposals. At stage , the proposal can be expressed as
Observe that
and
where is a constant. The upper bound follows from the boundedness of (see Definition 1), the boundedness of , and the boundedness of the Jacobian. Since
by assumption and is bounded from above, we conclude that is a valid distribution for all . The lower bound on implies that the support of is
The inequality follows from the fact that the support of includes the support of . Thus has the required support for all .
Next we prove that the sequence of proposals converges a.s. pointwise in . Given that and pointwise in , one can show by application of the continuous mapping theorem and dominated convergence theorem that
pointwise in . By application of the continuous mapping theorem, we then have that . Thus Theorems 1 and 2 hold. Furthermore, if and , then is equal to the asymptotically-optimal proposal as defined in (5). ∎
Appendix F Inference for the Dirichlet-tree model in Section 4
In this appendix, we outline a procedure for inferring the oracle response based on the hierarchical model presented in Section 4. Recall that the model assumes a deterministic response—i.e. there is only one possible response (label) for a given input . At stage of the evaluation process (see Algorithm 1), the labels for instances in the test pool are partially-observed. To estimate the response for the unobserved instances, we can apply the expectation-maximization (EM) algorithm. Omitting the dependence on , we let denote the observed labels and denote the unobserved labels. The EM algorithm returns a distribution over the unobserved labels and MAP estimates of the model parameters . At each iteration of the EM algorithm, the following two steps are applied:
-
•
E-step. Compute the function
(8) which is the expected log posterior with respect to the current distribution over the unobserved labels , conditional on the observed labels and the current parameter estimates .
-
•
M-step. Update the parameter estimates by maximizing :
(9)
In order to implement the E- and M-steps, we must evaluate the function for our model. Since the Dirichlet prior on and Dirichlet-tree priors on are conjugate to the categorical distribution, the posterior is straightforward to compute. We have
where , and is defined in (16). The posterior density for is
Minka (1999) gives the density for as:
where denotes the set of leaf nodes in , denotes the set of inner nodes in , denotes the leaf nodes reachable from node , and .
Substituting the posterior densities in (8), we have
where we define and similarly for and . When maximizing with respect to , we must obey the constraints:
-
•
for all ,
-
•
,
-
•
for all and leaf nodes , and
-
•
.
We can maximize and separately since they are independent. For we have the mode of a Dirichlet random variable:
and for we have (see Minka, 1999):
(10) | |||
(11) |
The parameters may be interpreted as branching probabilities for node .
In summary, the EM algorithm reduces to the following two steps:
-
•
E-step. Compute the expected value for each unobserved label using :
(12) Then make a backward pass through the tree, computing at each internal node .
- •
We can interpret (12) as providing a posterior estimate for the unknown oracle response:
(13) |
where denotes the assigned stratum for instance . If the response for instance has been observed in a previous stage of the evaluation process, then collapses to a point mass at .
Appendix G Proof of Proposition 1
Let denote the posterior estimate of the (determinisitc) oracle response at stage , as defined in (13). First we must ensure that the support of includes the true label for all . This condition is satisfied since the priors on and ensure and for all and (see 12). Once the label for instance is observed, the posterior degenerates to a point mass at the true value . This also implies that the sequence of proposals have the necessary support to ensure Theorem 2 holds.
Second, we verify that converges a.s. pointwise in and to a conditional pmf independent of . This condition is also satisfied, since the posterior degenerates to a point mass at once the label for instance is observed, and all instances are observed in the limit . Thus and , which implies that the sequence of proposals converges to the asymptotically-optimal proposal. ∎
Appendix H Extension of the Dirichlet-tree model to stochastic oracles
Recall that the Dirichlet-tree model in Section 4 is tailored for a deterministic oracle—where is a point mass at a single label . In this appendix, we extend the model to handle stochastic oracles, where has support on multiple labels in general. The model retains the same structure, however we no longer assume there is a discrete set of items whose labels are either observed or unobserved. Instead, we allow for a potentially infinite number of item-label pairs (indexed by below) to be generated. In reality, these pairs correspond to labelled items drawn uniformly at random from the test pool . Since estimating the oracle response for individual instances requires estimating a large number of continuous parameters ( parameters), we instead opt to estimate the response averaged over instances at leaf nodes of the tree ( parameters for leaf nodes). We refer to this as the leaf-level oracle response below.
Model specification
For clarity, we reintroduce the model here despite some overlap with Section 4. The oracle response is modeled globally using a pmf over the label space with a Dirichlet prior:
where are concentration hyperparameters. The label for each instance is then assumed to be generated i.i.d. according to :
We assume a hierarchical partition of the space is given, encoded as a tree . Given and label , we assume instance is assigned to a leaf node of according to a distribution with a Dirichlet-tree prior:
where is a set of Dirichlet concentration parameters associated with the internal nodes of .
Inference
To estimate the leaf-level oracle response , we use the posterior predictive distribution
which encodes our uncertainty about the oracle response for a query instance from stratum conditional on the previously observed samples . If the observed samples were collected through unbiased sampling (as assumed in the model above), we would compute the posterior predictive distribution as follows:
(14) |
where
(15) |
denotes the inner nodes of , denotes the children of node , denotes the leaf index of instance and
(16) |
In words, the posterior parameters are updated by adding a count of ‘1’ for every observation with label that is reachable from node in the tree . However, since the observed samples are biased in our application, we must apply a bias-correction to (15):
This guarantees that and .
Proposition 0.
Consider an instantiation of our evaluation framework (see Algorithm 1) under a stochastic oracle where:
Proof.
Let denote the posterior predictive in evaluated at stage . First we must ensure that the support of is a superset of the the support of . This condition is satisfied since the priors on and ensure that for all (see 14). This implies that the proposals have the necessary support to ensure that Theorem 1 is satisfied.
Second, we must ensure that converges a.s. pointwise in and to a conditional pmf independent of . This condition is also satisfied, since the posterior parameters and converge a.s. to constants by Theorem 1 (see the text preceding the statement of this proposition). ∎
Appendix I Unsupervised partitioning methods
The models for the oracle response described in Section 4 require that the test pool be hierarchically partitioned into blocks. Ideally, the partition should be selected so that the oracle distribution is approximately constant within each block. Since we begin without any labels, we are restricted to unsupervised partitioning methods. We briefly describe two settings for learning partitions: (i) where classifier scores are used as a proxy for and (ii) where feature vectors are used.
Score-based methods.
Our partitioning problem can be tackled using stratification methods studied in the survey statistics community (Cochran, 1977). The aim of stratification is to partition a population into roughly homogenous subpopulations, known as strata (or blocks) for the purpose of variance reduction. When an auxiliary variable is observed that is correlated with the statistic of interest, the strata may be defined as a partition of the range of the auxiliary variable into sub-intervals. Various methods are used for determining the sub-intervals, including the cumulative square-root frequency (CSF) method (Dalenius and Hodges, 1959) and the geometric method (Gunning and Horgan, 2004). In our application, the classifier scores are the auxiliary variables and the statistic of interest is (for binary classification).
Stratification is conventionally used to produce a partition without hierarchical structure. However, if the strata are ordered, it is straightforward to “fill in” hierarchical structure. In our experiments, we specify the desired tree structure—e.g. an ordered binary tree of a particular depth. We then compute the stratum bins (sub-intervals) so that the number of bins matches the number of leaf nodes in the tree. The stratum bins are then associated with the leaf nodes of the tree in breadth-first order.
Feature-based methods.
When scores are not available, it is natural to consider unsupervised clustering methods which operate on the feature vectors. We expect to be roughly constant within a cluster, since neighboring points in the feature space typically behave similarly. Reddy and Vinzamuri (2014) reviews hierarchical clustering methods including agglomerative and divisive clustering. One disadvantage of clustering methods, is that they tend to scale quadratically with the number of data points. A more efficient alternative is the -d tree (Bentley, 1975).
Appendix J Additional convergence plots

F1 score.
We provide additional convergence plots for estimating F1 score in Figure 5, which cover six of the seven datasets listed in Table 3. The convergence plot for abt-buy is featured in the main paper in Figure 2. The biased sampling methods (Ours-8, Ours-1, OASIS and IS) converge significantly more rapidly than Passive and Stratified for five of the six datasets. tweets100k is an exception because it is the only dataset with well-balanced classes. Of the biased sampling methods, Ours-8 performs best on two of the six datasets (amzn-goog and restaurant) and equal-best on one (safedriver). Ours-1 performs best on one dataset (dblp-acm) and equal-best on one (safedriver), while IS performs best on one dataset (creditcard). In general, we expect IS to perform well when the oracle response is already well-approximated by the model under evaluation. When this is not the case, the adaptive methods are expected to perform best as they produce refined estimates of during sampling.
Accuracy.
We repeated the experiments described in Section 5 for estimating accuracy. Although accuracy is not recommended for evaluating classifiers in the presence of class imbalance, it is interesting to see whether the biased sampling methods offer any improvement in label efficiency, given the discussion in Section 2.3. Figures 6(a) and 7 present convergence plots for the seven datasets listed in Table 3. A summary of the MSE for all datasets is included in Figure 6(b) assuming a label budget of 1000. OASIS does not appear in these results, as it does not support estimation of accuracy.
We find that gains in label efficiency are less pronounced for the biased sampling methods when estimating accuracy. This is to be expected as accuracy is less sensitive to class imbalance, as noted in Section 2.3. However, there is still a marked improvement in the convergence rate—by an order of magnitude or more—for three of the datasets (abt-buy, dblp-acm and restaurant). Again, we find that the more imbalanced datasets (abt-buy, amzn-goog, dblp-acm and restaurant) seem to benefit most from biased sampling.
Precision-recall curves.
Figure 8 presents convergence plots for estimating precision-recall curves for two of the test pools in Table 3 assuming a label budget of 5000. We find that the biased sampling methods (Ours-8, Ours-1 and IS) offer a significant improvement in the MSE compared to Passive and Stratified—by 1–2 orders of magnitude. The difference in the MSE between the AIS-based methods and IS is less pronounced here than when estimating F1-score.
(a)
(b)

