This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Needle in a Haystack: Label-Efficient Evaluation under Extreme Class Imbalance

Neil G. Marchant [email protected] 0000-0001-5713-4235 School of Computing and Information Systems, University of MelbourneMelbourneVictoriaAustralia3010  and  Benjamin I. P. Rubinstein [email protected] 0000-0002-2947-6980 School of Computing and Information Systems, University of MelbourneMelbourneVictoriaAustralia3010
(2021)
Abstract.

Important tasks like record linkage and extreme classification demonstrate extreme class imbalance, with 1 minority instance to every 1 million or more majority instances. Obtaining a sufficient sample of all classes, even just to achieve statistically-significant evaluation, is so challenging that most current approaches yield poor estimates or incur impractical cost. Where importance sampling has been levied against this challenge, restrictive constraints are placed on performance metrics, estimates do not come with appropriate guarantees, or evaluations cannot adapt to incoming labels. This paper develops a framework for online evaluation based on adaptive importance sampling. Given a target performance metric and model for p(y|x)p(y|x), the framework adapts a distribution over items to label in order to maximize statistical precision. We establish strong consistency and a central limit theorem for the resulting performance estimates, and instantiate our framework with worked examples that leverage Dirichlet-tree models. Experiments demonstrate an average MSE superior to state-of-the-art on fixed label budgets.

performance evaluation, adaptive importance sampling, Dirichlet tree, central limit theorem
ccs: General and reference Evaluationccs: Computing methodologies Machine learningccs: Mathematics of computing Sequential Monte Carlo methods

1. Introduction

Evaluation of machine learning systems under extreme class imbalance seems like a hopeless task. When minority classes are exceedingly rare—e.g. occurring at a rate of one in a million—a massive number of examples (1 million in expectation) must be labeled before a single minority example is encountered. It seems nigh impossible to reliably estimate performance in these circumstances, as the level of statistical noise is simply too high. Making matters worse, is the fact that high quality labels for evaluation are rarely available for free. Typically they are acquired manually at some cost—e.g. by employing expert annotators or crowdsourcing workers. Reducing labeling requirements for evaluation, while ensuring estimates of performance are precise and free from bias, is therefore paramount. One cannot afford to waste resources on evaluation if the results are potentially misleading or totally useless.

From a statistical perspective, evaluation can be cast as the estimation of population performance measures using independently-drawn test data. Although unlabeled test data is abundant in many applied settings, labels must usually be acquired as part of the evaluation process. To ensure estimated performance measures converge to their population values, it is important to select examples for labeling in a statistically sound manner. This can be achieved by sampling examples passively according to the data generating distribution. However, passive sampling suffers from poor label efficiency for some tasks, especially under extreme class imbalance. This impacts a range of application areas including fraud detection (Wei et al., 2013), record linkage (Marchant and Rubinstein, 2017), rare diseases (Khalilia et al., 2011) and extreme classification (Schultheis et al., 2020).

The poor efficiency of passive sampling for some evaluation tasks motivates active or biased sampling strategies, which improve efficiency by focusing labeling efforts on the “most informative” examples (Sawade et al., 2010b). Previous work in this area is based on variance-reduction methods, such as stratified sampling (Bennett and Carvalho, 2010; Druck and McCallum, 2011; Gao et al., 2019), importance sampling (Sawade et al., 2010a; Schnabel et al., 2016) and adaptive importance sampling (Marchant and Rubinstein, 2017). However existing approaches suffer from serious limitations, including lack of support for a broad range of performance measures (Bennett and Carvalho, 2010; Sawade et al., 2010a; Welinder et al., 2013; Schnabel et al., 2016; Marchant and Rubinstein, 2017), weak theoretical justification (Bennett and Carvalho, 2010; Druck and McCallum, 2011; Welinder et al., 2013) and an inability to adapt sampling based on incoming labels (Sawade et al., 2010a; Schnabel et al., 2016).

In this paper, we present a general framework for label-efficient online evaluation that addresses these limitations. Our framework supports any performance measure that can be expressed as a transformation of a vector-valued risk functional—a much broader class than previous work. This allows us to target simple scalar measures such as accuracy and F1 score, as well as more complex multi-dimensional measures such as precision-recall curves for the first time. We leverage adaptive importance sampling (AIS) to efficiently select examples for labeling in batches. The AIS proposal is adapted using labels from previous batches in order to approximate the asymptotically-optimal variance-minimizing proposal. This approximation relies on online estimates of p(y|x)p(y|x), which we propose to estimate via a Bayesian Dirichlet-tree (Dennis, 1996) model that achieves asymptotic optimality for deterministic labels.

We analyze the asymptotic behavior of our framework under general conditions, establishing strong consistency and a central limit theorem. This improves upon a weak consistency result obtained in a less general setting (Marchant and Rubinstein, 2017). We also compare our framework empirically against four baselines: passive sampling, stratified sampling, importance sampling (Sawade et al., 2010a), and the stratified AIS method of (Marchant and Rubinstein, 2017). Our approach based on a Dirichlet-tree model, achieves superior or competitive performance on all but one of seven test cases. Proofs and extensions are included in appendices. A Python package implementing our framework has been released open-source at https://github.com/ngmarchant/activeeval.

2. Preliminaries

We introduce notation and define the label-efficient evaluation problem in Section 2.1. Then in Section 2.2, we specify the family of performance measures supported by our framework. Section 2.3 presents novel insights into the impracticality of passive sampling relative to class imbalance and evaluation measure, supported by asymptotic analysis.

2.1. Problem formulation

Consider the task of evaluating a set of systems 𝒮\mathcal{S} which solve a prediction problem on a feature space 𝒳m\mathcal{X}\subseteq\mathbb{R}^{m} and label space 𝒴l\mathcal{Y}\subseteq\mathbb{R}^{l}. Let f(s)(x)f^{(s)}(x) denote the output produced by system s𝒮s\in\mathcal{S} for a given input x𝒳x\in\mathcal{X}—e.g. a predicted label or distribution over labels. We assume instances encountered by the systems are generated i.i.d. from an unknown joint distribution with density p(x,y)p(x,y) on 𝒳×𝒴\mathcal{X}\times\mathcal{Y}. Our objective is to obtain accurate and precise estimates of target performance measures (e.g. F1 score) with respect to p(x,y)p(x,y) at minimal cost.

We consider the common scenario where an unlabeled test pool 𝒯={x1,,xM}\mathcal{T}=\{x_{1},\ldots,x_{M}\} drawn from p(x)p(x) is available upfront. We assume labels are unavailable initially and can only be obtained by querying a stochastic oracle that returns draws from the conditional p(y|x)p(y|x). We assume the response time and cost of oracle queries far outweigh contributions from other parts of the evaluation process. This is reasonable in practice, since the oracle requires human input—e.g. annotators on a crowdsourcing platform or domain experts. Under these assumptions, minimizing the cost of evaluation is equivalent to minimizing the number of oracle queries required to estimate target performance measures to a given precision.

Remark 1.

A stochastic oracle covers the most general case where p(y|x)p(y|x) has support on one or more labels. This may be due to a set of heterogeneous or noisy annotators (not modeled) or genuine ambiguity in the label. We also consider a deterministic oracle where p(y|x)p(y|x) is a point mass. This is appropriate when trusting individual judgments from an expert annotator.

2.2. Generalized measures

When embarking on an evaluation task it is important to select a suitable measure of performance. For some tasks it may be sufficient to measure global error rates, while for others it may be desirable to measure error rates for different classes, sub-populations or parameter configurations—the possibilities are boundless. Since no single measure is suitable for all tasks, we consider a broad family of measures which correspond mathematically to transformations of vector-valued risk functionals.

Definition 0 (Generalized measure).

Let (x,y;f)\ell(x,y;f) be a vector-valued loss function that maps instances (x,y)𝒳×𝒴(x,y)\in\mathcal{X}\times\mathcal{Y} to vectors in d\mathbb{R}^{d} dependent on the system outputs f={f(s)}f=\{f^{(s)}\}. We suppress explicit dependence on ff where it is understood. Assume (x,y;f)\ell(x,y;f) is uniformly bounded in sup norm for all system outputs ff. Denote the corresponding vector-valued risk functional by R=𝔼X,Yp[(X,Y;f)]R=\mathbb{E}_{X,Y\sim p}[\ell(X,Y;f)]. For any choice of \ell and continuous mapping g:dmg:\mathbb{R}^{d}\to\mathbb{R}^{m} differentiable at RR, the generalized measure is defined as G=g(R)G=g(R).

Table 1. Representations of binary classification measures as generalized measures. Here we assume 𝒴={0,1}\mathcal{Y}=\{0,1\}, f(x)f(x) denotes the predicted class label, and p^1(x)\hat{p}_{1}(x) is an estimate of p(y=1|x)p(y=1|x) according to the system under evaluation.
Measure (x,y)\ell(x,y)^{\intercal} g(R)g(R)
Accuracy 𝟏yf(x)\mathbf{1}_{y\neq f(x)}\vphantom{\Big{]}} 1R1-R
Balanced accuracy [yf(x),y,f(x)]\left[yf(x),y,f(x)\right]\vphantom{\Big{]}} R1+R2(1R2R3)2R2(1R2)\frac{R_{1}+R_{2}(1-R_{2}-R_{3})}{2R_{2}(1-R_{2})}
Precision [yf(x),f(x)]\left[yf(x),f(x)\right]\vphantom{\Big{]}} R1R2\frac{R_{1}}{R_{2}}
Recall [yf(x),y]\left[yf(x),y\right]\vphantom{\Big{]}} R1R2\frac{R_{1}}{R_{2}}
FβF_{\beta} score [yf(x),β2y+f(x)1+β2]\left[yf(x),\frac{\beta^{2}y+f(x)}{1+\beta^{2}}\right] R1R2\frac{R_{1}}{R_{2}}
Matthews correlation coefficient [yf(x),y,f(x)]\left[yf(x),y,f(x)\right]\vphantom{\Big{]}} R1R2R3R2R3(1R2)(1R3)\frac{R_{1}-R_{2}R_{3}}{\sqrt{R_{2}R_{3}(1-R_{2})(1-R_{3})}}
Fowlkes-Mallows index [yf(x),y,f(x)]\left[yf(x),y,f(x)\right]\vphantom{\Big{]}} R1R2R3\frac{R_{1}}{\sqrt{R_{2}R_{3}}}\vphantom{\Big{]}}
Brier score 2(p^1(x)y)22(\hat{p}_{1}(x)-y)^{2} RR\vphantom{\Big{]}}
Table 2. Representations of scalar regression measures as generalized measures. Here f(x)f(x) denotes the predicted response according to the system under evaluation.
Measure (x,y)\ell(x,y)^{\intercal} g(R)g(R)
Mean absolute error |yf(x)||y-f(x)| RR\vphantom{\Big{]}}
Mean squared error (yf(x))2(y-f(x))^{2} RR\vphantom{\Big{]}}
Coefficient of determination [y,y2,f(x),f(x)2][y,y^{2},f(x),f(x)^{2}] R42R1R3+R12R2R12\frac{R_{4}-2R_{1}R_{3}+R_{1}^{2}}{R_{2}-R_{1}^{2}}\vphantom{\Big{]}}

Although this definition may appear somewhat abstract, it encompasses a variety of practical measures. For instance, when gg is the identity and d=1d=1 the family reduces to a scalar-valued risk functional, which includes accuracy and mean-squared error as special cases. Other well-known performance measures such as precision and recall can be represented by selecting a non-linear gg and a vector-valued \ell. Tables 1 and 2 demonstrate how to recover standard binary classification and regression measures for different settings of gg and \ell. In addition to scalar measures, the family also encompasses vector-valued measures for vector-valued gg and \ell. These can be used to estimate multiple scalar measures simultaneously—e.g. precision and recall of a system, accuracy of several competing systems, or recall of a system for various sub-populations. Below, we demonstrate that a vector-valued generalized measure can represent a precision-recall (PR) curve.

Example 0 (PR curve).

A precision-recall (PR) curve plots the precision and recall of a soft binary classifier on a grid of classification thresholds τ1<τ2<<τL\tau_{1}<\tau_{2}<\cdots<\tau_{L}. Let f(x)f(x)\in\mathbb{R} denote the classifier score for input xx, where a larger score means the classifier is more confident the label is positive (encoded as ‘1’) and a smaller score means the classifier is more confident the label is negative (encoded as ‘0’). We define a vector loss function that measures whether an instance (x,y)(x,y) is: (1) a predicted positive for each threshold (the first LL entries), (2) a true positive for each threshold (the next LL entries), and/or (3) a positive (the last entry):

(x,y)=[𝟏f(x)τ1,,𝟏f(x)τL,y𝟏f(x)τ1,,y𝟏f(x)τL,y].\ell(x,y)=\left[\mathbf{1}_{f(x)\geq\tau_{1}},\ldots,\mathbf{1}_{f(x)\geq\tau_{L}},y\mathbf{1}_{f(x)\geq\tau_{1}},\ldots,y\mathbf{1}_{f(x)\geq\tau_{L}},y\right]^{\intercal}.

A PR curve can then be obtained using the following mapping function:

G=g(R)=[RL+1R1,,R2LRL,RL+1R2L+1,,R2LR2L+1],G=g(R)=\left[\frac{R_{L+1}}{R_{1}},\ldots,\frac{R_{2L}}{R_{L}},\frac{R_{L+1}}{R_{2L+1}},\ldots,\frac{R_{2L}}{R_{2L+1}}\right]^{\intercal},

where the first LL entries correspond to the precision at each threshold in ascending order, and the last LL entries correspond to the recall at each threshold in ascending order.

Remark 2.

We have defined generalized measures with respect to the data generating distribution p(x,y)p(x,y). While this is the ideal target for evaluation, it is common in practice to define performance measures with respect to a sample. We can recover these from our definition by substituting an empirical distribution for p(x,y)p(x,y). For example, the familiar sample-based definition of recall can be obtained by setting (x,y)=[yf(x),y]\ell(x,y)=[yf(x),y]^{\intercal}, g(R)=R1/R2g(R)=R_{1}/R_{2} and p(x,y)=1Ni=1N𝟏xi=x𝟏yi=yp(x,y)=\frac{1}{N}\sum_{i=1}^{N}\mathbf{1}_{x_{i}=x}\mathbf{1}_{y_{i}=y}. Then

Grec=g(R)=1Ni=1Nyif(xi)1Ni=1Nyi=TPTP+FN.G_{\mathrm{rec}}=g(R)=\frac{\frac{1}{N}\sum_{i=1}^{N}y_{i}f(x_{i})}{\frac{1}{N}\sum_{i=1}^{N}y_{i}}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}.

Given our assumption that the test pool 𝒯\mathcal{T} is drawn from p(x)p(x), any consistent sample-based estimator will converge to the population measure.

2.3. Inadequacy of passive sampling

We have previously mentioned passive sampling as an obvious baseline for selecting instances to label for evaluation. In this section, we conduct an asymptotic analysis for two sample evaluation tasks, which highlights the impracticality of passive sampling under extreme class imbalance. This serves as concrete motivation for our interest in label-efficient solutions. We begin by defining an estimator for generalized measures based on passive samples.

Definition 0 (Passive estimator for GG).

Let ={(x1,y1),,(xN,yN)}\mathcal{L}=\{(x_{1},y_{1}),\allowbreak\ldots,\allowbreak(x_{N},y_{N})\} be a labeled sample of size NN drawn passively according to p(x,y)p(x,y). In practice, \mathcal{L} is obtained by drawing instances i.i.d. from the marginal p(x)p(x) and querying labels from the oracle p(y|x)p(y|x). The Monte Carlo or passive estimator for a generalized measure GG is then defined as follows:

G^NMC=g(R^NMC) with R^NMC=1N(x,y)(x,y).\hat{G}_{N}^{\mathrm{MC}}=g(\hat{R}_{N}^{\mathrm{MC}})\text{ with }\hat{R}_{N}^{\mathrm{MC}}=\frac{1}{N}\sum_{(x,y)\in\mathcal{L}}\ell(x,y).

Note that G^NMC\hat{G}_{N}^{\mathrm{MC}} is a biased estimator for GG in general, since gg may be non-linear. However, it is asymptotically unbiased—that is, G^NMC\hat{G}_{N}^{\mathrm{MC}} converges to GG with probability one in the limit NN\to\infty. This property is known as strong consistency and it follows from the strong law of large numbers (Feller, 1968, pp. 243–245) and continuity of gg. There is also a central limit theorem (CLT) for G^NMC\hat{G}_{N}^{\mathrm{MC}}, reflecting the rate of convergence: 𝔼[G^NMCG]Σ/N\mathbb{E}[\|\hat{G}_{N}^{\mathrm{MC}}-G\|]\leq\|\Sigma\|/\sqrt{N} asymptotically where Σ\Sigma is an asymptotic covariance matrix (see Theorem 2). We shall now use this result to analyse the asymptotic efficiency of the passive estimator for two evaluation tasks.

Example 0 (Accuracy).

Consider estimating the accuracy GaccG_{\mathrm{acc}} (row 1 of Table 1) of a classifier. By the CLT, it is straightforward to show that the passive estimator for GaccG_{\mathrm{acc}} is asymptotically normal with variance Gacc(1Gacc)/NG_{\mathrm{acc}}(1-G_{\mathrm{acc}})/N. Thus, to estimate GaccG_{\mathrm{acc}} with precision ww we require a labeled sample of size NGacc(1Gacc)/w2N\propto G_{\mathrm{acc}}(1-G_{\mathrm{acc}})/w^{2}. Although this is suboptimal111Theoretically it is possible to achieve an asymptotic variance of zero. it is not impractical. A passive sample reasonably captures the variance in GaccG_{\mathrm{acc}}.

This example shows that passive sampling is not always a poor choice. It can yield reasonably precise estimates of a generalized measure, so long as the measure is sensitive to regions of the space 𝒳×𝒴\mathcal{X}\times\mathcal{Y} with high density as measured by p(x,y)p(x,y). However, where these conditions are not satisfied, passive sampling may become impractical, requiring huge samples of labeled data in order to sufficiently drive down the variance. This is the case for the example below, where the measure is is sensitive to regions of 𝒳×𝒴\mathcal{X}\times\mathcal{Y} with low density as measured by p(x,y)p(x,y).

Example 0 (Recall).

Consider estimating recall GrecG_{\mathrm{rec}} (row 4 of Table 1) of a binary classifier. By the CLT, the passive estimator for GrecG_{\mathrm{rec}} is asymptotically normal with variance Grec(1Grec)/NϵG_{\mathrm{rec}}(1-G_{\mathrm{rec}})/N\epsilon where ϵ\epsilon denotes the relative frequency of the positive class. Thus we require a labeled sample of size NGrec(1Grec)/w2ϵN\propto G_{\mathrm{rec}}(1-G_{\mathrm{rec}})/w^{2}\epsilon to estimate GrecG_{\mathrm{rec}} with precision ww. This dependence on ϵ\epsilon makes passive sampling impractical when ϵ0\epsilon\ll 0—i.e. when the positive class is rare.

This example is not merely an intellectual curiosity—there are important applications where ϵ\epsilon is exceedingly small. For instance, in record linkage ϵ\epsilon scales inversely in the size of the databases to be linked (Marchant and Rubinstein, 2017).

3. An AIS-based framework for label-efficient evaluation

Refer to caption
Figure 1. Schematic of the proposed AIS-based evaluation framework.
Input: generalized measure GG; unlabeled test pool 𝒯\mathcal{T}; proposal update procedure; sample allocations N1,,NTN_{1},\ldots,N_{T}.
Initialize proposal q0q_{0} and sample history \mathcal{L}\leftarrow\emptyset
for t=1t=1 to TT do
     for n=1n=1 to NtN_{t} do
         xt,nqt1x_{t,n}\sim q_{t-1}
         wt,np(xt,n)qt1(xt,n)w_{t,n}\leftarrow\frac{p(x_{t,n})}{q_{t-1}(x_{t,n})}
         yt,nOracle(xt,n)y_{t,n}\sim\operatorname{Oracle}(x_{t,n})
         {(xt,n,yt,n,wt,n)}\mathcal{L}\leftarrow\mathcal{L}\cup\{(x_{t,n},y_{t,n},w_{t,n})\}
     end for
     Update proposal qtq_{t} using \mathcal{L}
end for
R^NAIS1N(x,y,w)w(x,y)\hat{R}_{N}^{\mathrm{AIS}}\leftarrow\frac{1}{N}\sum_{(x,y,w)\in\mathcal{L}}w\,\ell(x,y)
G^NAISg(R^NAIS)\hat{G}_{N}^{\mathrm{AIS}}\leftarrow g(\hat{R}_{N}^{\mathrm{AIS}})
return G^NAIS\hat{G}_{N}^{\mathrm{AIS}} and history \mathcal{L}
Algorithm 1 AIS for generalized measures

When passive sampling is inefficient222Unless otherwise specified, we mean label efficiency or sample efficiency when we use the term “efficiency” without qualification. , as we have seen in the preceding analysis, substantial improvements can often be achieved through biased sampling. In this section, we devise a framework for efficiently estimating generalized measures that leverages a biased sampling approach called adaptive importance sampling (AIS) (Bugallo et al., 2017). AIS estimates an expectation using samples drawn sequentially from a biased proposal distribution, that is adapted in stages based on samples from previous stages. It produces non-i.i.d. samples in general, unlike passive sampling and static (non-adaptive) importance sampling (IS) (Rubinstein and Kroese, 2016). AIS is a powerful approach because it does not assume an effective proposal distribution is known a priori—instead it is learnt from data. This addresses the main limitation of static IS—that one may be stuck using a sub-optimal proposal, which may compromise label efficiency.

There are many variations of AIS which differ in: (i) the way samples are allocated among stages; (ii) the dependence of the proposal on previous stages; (iii) the types of proposals considered; and (iv) the way samples are weighted within and across stages. Our approach is completely flexible with respect to points (i)–(iii). For point (iv), we use simple importance-weighting as it is amenable to asymptotic analysis using martingale theory (Portier and Delyon, 2018). A more complex weighting scheme is proposed in (Cornuet et al., 2012) which may have better stability, however its asymptotic behavior is not well understood.333Consistency was proved for this weighting scheme in the limit TT\to\infty where {Nt}\{N_{t}\} is a monotonically increasing sequence (Marin et al., 2019). To our knowledge, a CLT remains elusive.

Our framework is summarized in Figure 1. Given a target performance measure GG and an unlabeled test pool 𝒯\mathcal{T}, the labeling process proceeds in several stages indexed by t{1,,T}t\in\{1,\ldots,T\}. In the tt-th stage, Nt1N_{t}\geq 1 instances are drawn i.i.d. from 𝒯\mathcal{T} according to proposal qt1q_{t-1}. Labels are obtained for each instance by querying the oracle, and recorded with their importance weights in \mathcal{L}. At the end of the tt-th stage, the proposal is updated for the next stage. This update may depend on the weighted samples from all previous stages as recorded in \mathcal{L}. At any point during sampling, we estimate GG as follows:

(1) G^NAIS=g(R^NAIS) where R^NAIS=1N(x,y,w)w(x,y).\hat{G}_{N}^{\mathrm{AIS}}=g(\hat{R}_{N}^{\mathrm{AIS}})\text{ where }\hat{R}_{N}^{\mathrm{AIS}}=\frac{1}{N}\sum_{(x,y,w)\in\mathcal{L}}w\,\ell(x,y).

For generality, we permit the user to specify the sample allocations and the procedure for updating the proposals in Figure 1. In practice, we recommend allocating a small number of samples to each stage, as empirical studies suggest that efficiency improves when the proposal is updated more frequently (Portier and Delyon, 2018). However, this must be balanced with practical constraints, as a small sample allocation limits the ability to acquire labels in parallel. In Section 3.2, we recommend a practical procedure for updating the proposals. It approximates the asymptotically-optimal variance-minimizing proposal based on an online model of p(y|x)p(y|x). We present an example model for p(y|x)p(y|x) in Section 4, which can leverage prior information from the systems under evaluation. Further practicalities including sample reuse and confidence intervals are discussed in Section 3.3.

Remark 3 (Constraint on the proposal).

In a standard application of AIS for estimating 𝔼X,Yp[ϕ(X,Y)]\mathbb{E}_{X,Y\sim p}[\phi(X,Y)], one is free to select any proposal qt(x,y)q_{t}(x,y) with support on the set {(x,y):ϕ(x,y)p(x,y)0\{(x,y):\|\phi(x,y)\|p(x,y)\neq 0}. However, we have an additional constraint since we cannot bias sampling from the oracle p(y|x)p(y|x). Thus we consider proposals of the form qt(x,y)=qt(x)p(y|x)q_{t}(x,y)=q_{t}(x)p(y|x).

Remark 4 (Static Importance Sampling).

Our AIS framework reduces to static importance sampling when T=1T=1 so that all samples are drawn from a single static proposal q0q_{0}.

3.1. Asymptotic analysis

We study the asymptotic behavior of estimates for GG produced by the generic framework in Figure 1. Since our analysis does not depend on how samples are allocated among stages, it is cleaner to identify a sample using a single index j{1,,N}j\in\{1,\ldots,N\} where N=t=1TNtN=\sum_{t=1}^{T}N_{t} is the total number of samples, rather than a pair of indices (t,n)(t,n). Concretely, we map each (t,n)(t,n) to index j=n+t=1t1Ntj=n+\sum_{t^{\prime}=1}^{t-1}N_{t^{\prime}}. With this change, we let qj1(x)q_{j-1}(x) denote the proposal used to generate sample jj. It is important to note that this notation conceals the dependence on previous samples. Thus qj1(x)q_{j-1}(x) should be understood as shorthand for qj1(x|j1)q_{j-1}(x|\mathcal{F}_{j-1}) where j=σ((X1,Y1),,(Xj,Yj))\mathcal{F}_{j}=\sigma((X_{1},Y_{1}),\ldots,(X_{j},Y_{j})) denotes the filtration.

Our analysis relies on the fact that ZN=N(R^NAISR)Z_{N}=N(\hat{R}_{N}^{\mathrm{AIS}}-R) is a martingale with respect to N\mathcal{F}_{N}. The consistency of G^NAIS\hat{G}_{N}^{\mathrm{AIS}} then follows by a strong law of large numbers for martingales (Feller, 1971) and the continuous mapping theorem. A proof is included in Appendix A.

Theorem 1 (Consistency).

Let the support of proposal qj(x,y)=qj(x)p(y|x)q_{j}(x,y)=q_{j}(x)p(y|x) be a superset of {x,y𝒳×𝒴:(x,y)p(x,y)0}\{x,y\in\mathcal{X}\times\mathcal{Y}:\allowbreak\|\ell(x,y)\|p(x,y)\neq 0\} for all j0j\geq 0 and assume

(2) supj𝔼[(p(Xj)qj1(Xj))2|j1]<.\sup_{j\in\mathbb{N}}\mathbb{E}\!\left[\left(\frac{p(X_{j})}{q_{j-1}(X_{j})}\right)^{2}\middle|\mathcal{F}_{j-1}\right]<\infty.

Then G^NAIS\hat{G}_{N}^{\mathrm{AIS}} is strongly consistent for GG.

We also obtain a central limit theorem (CLT), which is useful for assessing asymptotic efficiency and computing approximate confidence intervals. Our proof (see Appendix B) invokes a CLT for martingales (Portier and Delyon, 2018) and the multivariate delta method.

Theorem 2 (CLT).

Suppose

(3) Vj:=var[p(Xj)qj1(Xj)(Xj,Yj)R|j1]Va.s.,V_{j}:={\operatorname{var}}\!\left[\frac{p(X_{j})}{q_{j-1}(X_{j})}\ell(X_{j},Y_{j})-R\middle|\mathcal{F}_{j-1}\right]\to V_{\infty}\quad\text{a.s.},

where VV_{\infty} is an a.s. finite random positive semidefinite matrix, and there exists η>0\eta>0 such that

(4) supj𝔼[(p(Xj)qj1(Xj))2+η|j1]<a.s.\sup_{j\in\mathbb{N}}\,\mathbb{E}\!\left[\left(\frac{p(X_{j})}{q_{j-1}(X_{j})}\right)^{2+\eta}\middle|\mathcal{F}_{j-1}\right]<\infty\quad\text{a.s.}

Then N(G^NAISG)\sqrt{N}(\hat{G}_{N}^{\mathrm{AIS}}-G) converges in distribution to a multivariate normal 𝒩(0,Σ)\mathcal{N}(0,\Sigma) with covariance matrix Σ=Dg(R)VDg(R)\Sigma=\mathrm{D}g(R)V_{\infty}\mathrm{D}g(R)^{\intercal} where [Dg]ij=giRj[\mathrm{D}g]_{ij}=\frac{\partial g_{i}}{\partial R_{j}} is the Jacobian of gg.

3.2. Variance-minimizing proposals

In order to achieve optimal asymptotic efficiency, we would like to use the AIS proposal that achieves the minimal asymptotic variance Σ\Sigma as defined in Theorem 2. We can solve for this optimal proposal using functional calculus, as demonstrated in the proposition below. Note that we must generalize the variance minimization problem for vector-valued GG since Σ\Sigma becomes a covariance matrix. We opt to use the total variance (the trace of Σ\Sigma) since the diagonal elements of Σ\Sigma are directly related to statistical efficiency, while the off-diagonal elements measure correlations between components of G^NAIS\hat{G}_{N}^{\mathrm{AIS}} that are beyond our control. This choice also ensures the functional optimization problem is tractable.

Proposition 0 (Asymptotically-optimal proposal).

Suppose the Jacobian Dg(R)\mathrm{D}_{g}(R) has full row rank and 𝔼X,Yp[Dg(R)(X,Y)22]>0\mathbb{E}_{X,Y\sim p}\left[\|\mathrm{D}_{g}(R)\,\ell(X,Y)\|_{2}^{2}\right]>0. Then the proposal

(5) q(x)=v(x)v(x)dxwithv(x)=p(x)Dg(R)(x,y)22p(y|x)dyq^{\star}(x)=\frac{v(x)}{\int v(x)\,\mathrm{d}x}\quad\text{with}\quad v(x)=p(x)\sqrt{\int\|\mathrm{D}_{g}(R)\,\ell(x,y)\|_{2}^{2}\,p(y|x)\,\mathrm{d}y}

achieves the minimum total asymptotic variance of

trΣ=(v(x)dx)2Dg(R)R22.\operatorname{tr}\Sigma=\left(\int v(x)\,\mathrm{d}x\right)^{2}-\|\mathrm{D}_{g}(R)\,R\|_{2}^{2}.

Appendix D provides sufficient conditions on GG and the oracle which ensure trΣ=0\operatorname{tr}\Sigma=0.

We use the above result to design a practical scheme for adapting a proposal for AIS. At each stage of the evaluation process, we approximate the asymptotically-optimal proposal q(x)q^{\star}(x) using an online model for the oracle response p(y|x)p(y|x). The model for p(y|x)p(y|x) should be initialized using prior information if available (e.g. from the systems under evaluation) and updated at the end of each stage using labels received from the oracle. However, we cannot simply estimate q(x)q^{\star}(x) by plugging in estimates of p(y|x)p(y|x) directly, as the resulting proposal may not satisfy the conditions of Theorems 1 and 2. Below we provide estimators for q(x)q^{\star}(x) which do satisfy the conditions, and provide sufficient conditions for achieving asymptotic optimality.

Proposition 0.

If the oracle is stochastic, let p^t(y|x)\hat{p}_{t}(y|x) be an estimate for p(y|x)p(y|x) whose support includes the support of p(y|x)p(y|x) for all stages t0t\geq 0, and assume p^t(y|x)a.s.p^(y|x)\hat{p}_{t}(y|x)\overset{\mathrm{a.s.}}{\to}\hat{p}_{\infty}(y|x) pointwise in xx. Alternatively, if the oracle is deterministic, let πt(y|x)\pi_{t}(y|x) be a posterior distribution for the response y(x)y(x) whose support includes y(x)y(x) for all t0t\geq 0, and assume πt(y|x)a.s.π(y|x)\pi_{t}(y|x)\overset{\mathrm{a.s.}}{\to}\pi_{\infty}(y|x) pointwise in xx. Let ϵt\epsilon_{t} be a positive bounded sequence and R^t\hat{R}_{t} be an estimator for RR which converges a.s. to R^\hat{R}_{\infty}. Assume 𝒳\mathcal{X} is finite (e.g. a pool of test data) and Dg(R^t)2K<\|\mathrm{D}_{g}(\hat{R}_{t})\|_{2}\leq K<\infty for all t0t\geq 0. Then the proposals

qt(x){p(x)max{Dg(R^t)(x,y)2,ϵt𝟏(x,y)0}πt(y|x)dy,for a deterministic oracle,p(x)[max{Dg(R^t)(x,y)22,ϵt𝟏(x,y)0}p^t(y|x)dy]12,for a stochastic oracle,q_{t}(x)\propto\begin{cases}p(x)\int\max\{\|\mathrm{D}_{g}(\hat{R}_{t})\,\ell(x,y)\|_{2},\epsilon_{t}\mathbf{1}_{\|\ell(x,y)\|\neq 0}\}\pi_{t}(y|x)\,\mathrm{d}y,&\text{for a deterministic oracle,}\\ p(x)\left[\int\max\{\|\mathrm{D}_{g}(\hat{R}_{t})\,\ell(x,y)\|_{2}^{2},\epsilon_{t}\mathbf{1}_{\|\ell(x,y)\|\neq 0}\}\hat{p}_{t}(y|x)\,\mathrm{d}y\right]^{\frac{1}{2}},&\text{for a stochastic oracle,}\end{cases}

satisfy the conditions of Theorems 1 and 2. If in addition R^=R\hat{R}_{\infty}=R and p^(y|x)=p(y|x)\hat{p}_{\infty}(y|x)=p(y|x) (alternatively π(y|x)=𝟏y=y(x)\pi_{\infty}(y|x)=\mathbf{1}_{y=y(x)}) and ϵt0\epsilon_{t}\downarrow 0, then the proposals approach asymptotic optimality.

3.3. Practicalities

We briefly discuss solutions to two issues that may arise in practical settings: sample reuse and approximate confidence regions.

3.3.1. Sample reuse

Suppose our framework is used to estimate a generalized measure G1G_{1}. If the joint distribution p(x,y)p(x,y) associated with the prediction problem has not changed, it may be desirable to reuse the weighted samples \mathcal{L} to estimate a different generalized measure G2G_{2}. This is possible so long as the sequence of proposals used to estimate G1G_{1} have the required support for G2G_{2}. More precisely, the support of qj(x,y)q_{j}(x,y) must include {x,y𝒳×𝒴:(x,y)p(x,y)0}\{x,y\in\mathcal{X}\times\mathcal{Y}:\|\ell(x,y)\|p(x,y)\neq 0\} for the loss functions associated with G1G_{1} and G2G_{2}.

If one anticipates sample reuse, the proposals can be made less specialized to a particular measure by mixing with the marginal distribution p(x)p(x), i.e. q(x)(1δ)q(x)+δp(x)q(x)\to(1-\delta)q(x)+\delta p(x) where δ(0,1]\delta\in(0,1] is a hyperparameter that controls the degree of specialization.

3.3.2. Approximate confidence regions

When publishing performance estimates, it may be desirable to quantify statistical uncertainty. An asymptotic 100(1α)%100(1-\alpha)\% confidence region for a generalized measure GG is given by the ellipsoid

{Gk:(GG^)Σ^1(GG^)(N1)kN(Nk)Fα,k,Nk},\left\{G^{\star}\in\mathbb{R}^{k}:(G^{\star}-\hat{G})^{\intercal}\hat{\Sigma}^{-1}(G^{\star}-\hat{G})\leq\frac{(N-1)k}{N(N-k)}F_{\alpha,k,N-k}\right\},

where G^\hat{G} is the sample mean, Σ^\hat{\Sigma} is the sample covariance matrix, and Fα,d1,d2F_{\alpha,d_{1},d_{2}} is the critical value of the FF distribution with d1,d2d_{1},d_{2} degrees of freedom at significance level α\alpha. This region can be approximated using the estimator for GG in (1) and the following estimator for Σ\Sigma:

Σ^AIS=Dg(R^AIS)(1Nj=1Np(xj)2(xj,yj)(xj,yj)qN(xj)qj1(xj)R^AISR^AIS)Dg(R^AIS).\hat{\Sigma}^{\mathrm{AIS}}=\mathrm{D}_{g}(\hat{R}^{\mathrm{AIS}})\bigg{(}\frac{1}{N}\sum_{j=1}^{N}\frac{p(x_{j})^{2}\ell(x_{j},y_{j})\ell(x_{j},y_{j})^{\intercal}}{q_{N}(x_{j})q_{j-1}(x_{j})}\\ -\hat{R}^{\mathrm{AIS}}\hat{R}^{\mathrm{AIS}\intercal}\bigg{)}\mathrm{D}_{g}(\hat{R}^{\mathrm{AIS}})^{\intercal}.

This is obtained from the expression for Σ\Sigma in Theorem 2, by plugging in AIS estimators for the variance and RR, and approximating q(x)q_{\infty}(x) by the most recent proposal qN(x)q_{N}(x).

4. A Dirichlet-tree model for the oracle response

In the previous section, we introduced a scheme for updating the AIS proposal which relies on an online model of the oracle response. Since there are many conceivable choices for the model, we left it unspecified for full generality. In this section, we propose a particular model that is suited for evaluating classifiers when the response from the oracle is deterministic (see Remark 1). Concretely, we make the assumption that the label space 𝒴={1,,C}\mathcal{Y}=\{1,\ldots,C\} is a finite set and p(y|x)p(y|x) is a point mass at y(x)y(x) for all x𝒯x\in\mathcal{T}. An extension of this section for stochastic oracles (where p(y|x)p(y|x) is not necessarily a point mass) is presented in Appendix H.

Since we would like to leverage prior information (e.g. classifier scores) from the system(s) under evaluation and perform regular updates as labels are received from the oracle, we opt to use a Bayesian model. Another design consideration is label efficiency. Since labels are scarce and the test pool 𝒯\mathcal{T} may be huge, we want to design a model that allows for sharing of statistical strength between “similar” instances in 𝒯\mathcal{T}. To this end, we propose a model that incorporates a hierarchical partition of 𝒯\mathcal{T}, where instances assigned to hierarchically neighboring blocks are assumed to elicit a similar oracle response.444 This is in contrast to models used in related work (Bennett and Carvalho, 2010; Marchant and Rubinstein, 2017) which assume the oracle response is independent across blocks of a non-hierarchical partition. Various unsupervised methods may be used to learn a hierarchical partition, including hierarchical agglomerative/divisive clustering (Reddy and Vinzamuri, 2014), kk-d trees (Bentley, 1975), and stratification based on classifier scores (see Appendix I for a brief review).

4.1. Generative process

We assume the global oracle response θ\theta (averaged over all instances) is generated according to a Dirichlet distribution, viz.

θ|αDirichlet(α),\theta|\alpha\sim\operatorname{Dirichlet}\!\left(\alpha\right),

where α=[α1,,αC]+C\alpha=[\alpha_{1},\ldots,\alpha_{C}]\in\mathbb{R}_{+}^{C} are concentration hyperparameters. The label yiy_{i} for each instance i{1,,M}i\in\{1,\ldots,M\} (indexing 𝒯\mathcal{T}) is then assumed to be generated i.i.d. according to θ\theta:

yi|θ\displaystyle y_{i}|\theta iid.Categorical(θ),\displaystyle\overset{\mathrm{iid.}}{\sim}\operatorname{Categorical}\!\left(\theta\right), i1,,M.\displaystyle i\in 1,\ldots,M.

We assume a hierarchical partition of the test pool 𝒯\mathcal{T} is given. The partition can be represented as a tree TT, where the leaf nodes of TT correspond to the finest partition of 𝒯\mathcal{T} into disjoint blocks {𝒯k}k=1K\{\mathcal{T}_{k}\}_{k=1}^{K} such that 𝒯=k=1K𝒯k\mathcal{T}=\bigcup_{k=1}^{K}\mathcal{T}_{k}. We assume each instance ii is assigned to one of the blocks (leaf nodes) ki{1,,K}k_{i}\in\{1,\ldots,K\} according to a distribution ψy\psi_{y} with a Dirichlet-tree prior (Dennis, 1996; Minka, 1999):

ψy|βy,T\displaystyle\psi_{y}|\beta_{y},T ind.DirichletTree(βy;T),\displaystyle\overset{\mathrm{ind.}}{\sim}\operatorname{DirichletTree}\!\left(\beta_{y};T\right), y𝒴,\displaystyle y\in\mathcal{Y},
ki|yi,ψyi\displaystyle k_{i}|y_{i},\psi_{y_{i}} ind.Categorical(ψyi),\displaystyle\overset{\mathrm{ind.}}{\sim}\operatorname{Categorical}\!\left(\psi_{y_{i}}\right), i1,,M.\displaystyle i\in 1,\ldots,M.

The Dirichlet-tree distribution is a generalization of the Dirichlet distribution, which allows for more flexible dependencies between the categories (blocks in this case). Categories that are hierarchically nearby based on the tree structure TT tend to be correlated. The Dirichlet concentration hyperparameters βy\beta_{y} associated with the internal nodes also control the correlation structure.

4.2. Inference

For a deterministic oracle, the response yiy_{i} for instance ii is either observed (previously labeled) or unobserved (yet to be labeled). It is important to model the observation process in case it influences the values of inferred parameters. To this end, we let 𝐨t=(ot,1,,ot,M)\mathbf{o}_{t}=(o_{t,1},\ldots,o_{t,M}) be observation indicators for the labels 𝐲=(y1,,yM)\mathbf{y}=(y_{1},\ldots,y_{M}) at the end of stage tt of the evaluation process (see Algorithm 1). We initialize 𝐨0=0\mathbf{o}_{0}=0 and define 𝐨t\mathbf{o}_{t} in the obvious way: ot,io_{t,i} is 1 if the label for instance ii has been observed by the end of stage tt and 0 otherwise. From Algorithm 1, we have that the nn-th instance selected in stage tt depends on the labels of the previously observed instances 𝐲(𝐨t1)\mathbf{y}_{(\mathbf{o}_{t-1})} and the block assignments 𝐤=(k1,,kM)\mathbf{k}=(k_{1},\ldots,k_{M}):

it,n|𝐨t1,𝐲(𝐨t1),𝐤qt1(𝐲(𝐨t),𝐤).i_{t,n}|\mathbf{o}_{t-1},\mathbf{y}_{(\mathbf{o}_{t-1})},\mathbf{k}\sim q_{t-1}(\mathbf{y}_{(\mathbf{o}_{t})},\mathbf{k}).

Our goal is to infer the unobserved labels (and hence the oracle response) at each stage tt of the evaluation process. We assume the block assignments 𝐤=(k1,,kM)\mathbf{k}=(k_{1},\ldots,k_{M}) are fully observed. Since the observation indicators are independent of the unobserved labels conditional on the observed labels, our model satisfies ignorability (Jaeger, 2005). This means we can ignore the observation process when conducting inference. Since we do not require a full posterior distribution over all parameters, it is sufficient to conduct inference using the expectation-maximization algorithm. This yields a distribution over the unobserved label for each instance and point estimates for the other parameters (ψy\psi_{y} and θ\theta). Full details are provided in Appendix F.

4.3. Asymptotic optimality

Since the Dirichlet-tree model described in this section is consistent for the true oracle (deterministic) response, it can be combined with the proposal updates described in Proposition 4 to yield an asymptotically-optimal AIS algorithm. This result is made precise in the following proposition, which is proved in Appendix G.

Proposition 0.

Consider an instantiation of our framework under a deterministic oracle where:

  • the oracle response is estimated online using the Dirichlet-tree model described in this section via the EM algorithm;

  • the proposals are adapted using the estimator defined in Proposition 4 with

  • ϵt=ϵ0(11Mi=1Mot,i)\epsilon_{t}=\epsilon_{0}(1-\frac{1}{M}\sum_{i=1}^{M}o_{t,i}) for some user-specified ϵ0>0\epsilon_{0}>0.

Then Theorems 1 and 2 hold and the estimator is asymptotically-optimal.

5. Experimental study

We conduct experiments to assess the label efficiency of our proposed framework555Using the procedure described in Section 3.2 to adapt the AIS proposal, together with the online model for the oracle response presented in Section 4. for a variety of evaluation tasks. The tasks vary in terms of the degree of class imbalance, the quality of predictions/scores from the classifier under evaluation, the size of the test pool, and the target performance measure. Where possible, we compare our framework (denoted Ours) with the following baselines:

  • Passive: passive sampling as specified in Definition 3.

  • IS: static importance sampling as described in Remark 4. We approximate the asymptotically-optimal proposal as in Proposition 4 using estimates of p(y|x)p(y|x) derived from classifier scores.

  • Stratified: an online variant of stratified sampling with proportional allocation, as used in (Druck and McCallum, 2011). Items are sampled one-at-a-time in proportion to the size of the allocated stratum.

  • OASIS: a stratified AIS method for estimating F scores (Marchant and Rubinstein, 2017).

5.1. Evaluation tasks

We prepare classifiers and test pools for evaluation using publicly-available datasets from various domains, as summarized in Table 3. amzn-googl, dblp-acm, abt-buy (Köpcke et al., 2010) and restaurant (RIDDLE, 2003) are benchmark datasets for entity resolution. They contain records from two sources and the goal is to predict whether pairs of records refer to the same entity or not. safedriver contains anonymized records from a car insurance company, and the task is to predict drivers who are likely to make a claim (Porto Seguro, 2017). creditcard relates to fraud detection for online credit card transactions (Pozzolo et al., 2015). tweets100k has been applied to tweet sentiment analysis (Mozafari et al., 2014).

For amzn-goog, dblp-acm, abt-buy, restaurant and tweets100k we use the same classifiers and test pools as in (Marchant and Rubinstein, 2017). For safedriver and creditcard we prepare our own by randomly splitting the data into train/test with a 70:30 ratio, and training classifiers using supervised learning. In scenarios where labeled data is scarce, semi-supervised or unsupervised methods might be used instead—the choice of learning paradigm has no bearing on evaluation. We consider three target performance measures—F1 score, accuracy and precision-recall curves—as separate evaluation tasks.

Table 3. Summary of test pools and classifiers under evaluation. The imbalance ratio is the number of positive class instances divided by the number of negative class instances. The true F1 score is assumed unknown.
Source Domain Size Imbalance ratio Classifier type F1 score
abt-buy (Köpcke et al., 2010) Entity resolution 53,753 1075 Linear SVM 0.595
amzn-goog (Köpcke et al., 2010) Entity resolution 676,267 3381 Linear SVM 0.282
dblp-acm (Köpcke et al., 2010) Entity resolution 53,946 2697 Linear SVM 0.947
restaurant (RIDDLE, 2003) Entity resolution 149,747 3328 Linear SVM 0.899
safedriver (Porto Seguro, 2017) Risk assessment 178,564 26.56 XGBoost (Chen and Guestrin, 2016) 0.100
creditcard (Pozzolo et al., 2015) Fraud detection 85,443 580.2 Logistic Regression 0.728
tweets100k (Mozafari et al., 2014) Sentiment analysis 20,000 0.990 Linear SVM 0.770

5.2. Setup

Oracle.

We simulate an oracle using labels included with each dataset. Since the datasets only contain a single label for each instance, we assume a deterministic oracle. Thus, when computing the consumed label budget, we only count the first query to the oracle for an instance as a consumed label. If an instance is selected for labeling again, we reuse the label from the first query.

Partitioning.

Ours, Stratified and OASIS assume the test pool is partitioned so the oracle response within each block is ideally uniform. We construct partitions by binning instances according to their classifier scores. The bin edges are determined using the cumulative square-root frequency (CSF) method (Dalenius and Hodges, 1959), which is widely used for stratified sampling. We set the number of bins to K=28K=2^{8}. Since Ours is capable of exploiting partitions with hierarchical structure, we fill in post-hoc structure by associating the CSF bins with the leaf nodes of an appropriately-size tree in breadth-first order. We consider two trees: a tree of depth 1 with branching factor KK (denoted Ours-1, equivalent to a non-hierarchical partition) and a tree of depth 8 with branching factor 2 (denoted Ours-8).

Hyperparameters.

We leverage prior information from the classifiers under evaluation to set hyperparameters. Wherever a prior estimate of p(y|x)p(y|x) is required, we use the classifier score s(y|x)s(y|x), applying the softmax function to non-probabilistic scores. For the Dirichlet-tree model we set hyperparameters as follows:

αy=1+k=1Ks(y|k),βyν=depth(ν)2+k=1Ks(y|k)δν(k)\alpha_{y}=1+\sum_{k=1}^{K}s(y|k),\quad\beta_{y\nu}=\operatorname{depth}(\nu)^{2}+\sum_{k=1}^{K}s(y|k)\delta_{\nu}(k)

where s(y|k)=1|𝒯k|xi𝒯ks(y|xi)s(y|k)=\frac{1}{|\mathcal{T}_{k}|}\sum_{x_{i}\in\mathcal{T}_{k}}s(y|x_{i}) is the mean score over instances in the kk-th block, ν\nu denotes a non-root node of the tree TT, depth(ν)\operatorname{depth}(\nu) denotes the depth of node ν\nu in TT, and δν(k)\delta_{\nu}(k) is an indicator equal to 1 if node ν\nu is traversed to reach leaf node kk and 0 otherwise.

Repeats.

Since the evaluation process is randomized, we repeated each experiment 1000 times, computing and reporting the mean behavior with bootstrap confidence intervals.

5.3. Results

We provide a summary of the results here, with a particular focus on the results for F1 score, which is supported by all baselines. Detailed results for accuracy and precision-recall curves are included in Appendix J.

F1 score.

To assess convergence of the estimated F1 score, we plot the mean-squared error (MSE) as a function of the consumed label budget for all datasets and methods. A plot of this kind is included for abt-buy in Figure 2. It shows that Ours is significantly more efficient than the baseline methods in this case, achieving a lower MSE for all label budgets. The Passive and Stratified methods perform significantly worse, achieving an MSE at least one order of magnitude greater than the biased sampling methods. Figure 2 also plots the convergence of the proposal in terms of the mean KL divergence from the asymptotically-optimal proposal. The results here are in line with expectations: convergence of the F1 score is more rapid when the proposal is closer to asymptotic optimality.

Refer to caption
Figure 2. Convergence for abt-buy over 1000 repeats. The upper panel plots the KL divergence from the proposal to the asymptotically-optimal one. The lower panel plots the MSE of the estimated F1 score. 95% bootstrap confidence intervals are included.
Refer to caption
Figure 3. MSE of the estimated F1 score after 2000 label queries over 1000 repeats. The order of the bars (from top to bottom) for each dataset matches the order in the legend. 95% bootstrap confidence intervals are shown in black.

Figure 3 summarizes the convergence plots for the six other data sets (included in Appendix J), by plotting the MSE of the estimated F1 score after 2000 labels are consumed. It shows that Ours achieves best or equal-best MSE on all but one of the datasets (dblp-acm) within the 95% confidence bands. We find the adaptive methods Ours and OASIS perform similarly to IS when the prior estimates for p(y|x)p(y|x) are accurate—i.e. there is less to gain by adapting. Of the two adaptive methods, Ours generally converges more rapidly than OASIS, which might be expected since our procedure for adapting the proposal is asymptotically-optimal. The hierarchical variant of our model Ours-8 tends to outperform the non-hierarchical variant Ours-1, which we expect when p(y|x)p(y|x) varies smoothly across neighboring blocks. Finally, we observe that Passive and Stratified are competitive when the class imbalance is less severe. This agrees with our analysis in Section 2.3. We note that similar trends when targeting accuracy in place of F1 score, as presented in Appendix J.

PR curves.

We estimate PR curves on a uniformly-spaced grid of classifier thresholds τ1<<τL\tau_{1}<\cdots<\tau_{L}, where τ1\tau_{1} is the minimum classifier score, τL\tau_{L} is the maximum classifier score, and L=210L=2^{10} (see Example 2). We also use the uniform grid to partition the test pool (in place of the CSF method), associating each block of the partition with four neighboring bins on the grid to yield K=28K=2^{8} blocks. Figure 4 illustrates the vast improvement of the biased sampling methods (Ours-8 and IS) over Passive for this evaluation task. The estimated PR curves shown for Ours-8 and IS vary minimally about the true PR curve, and are reliable for selecting an operating threshold. The same cannot be said for the curves produced by Passive, which exhibit such high variance that they are essentially useless.

Refer to caption
Figure 4. A sample of 100 estimated precision-recall (PR) curves (in dark gray) for three evaluation methods: Ours-8, IS and Passive. The PR curves are estimated for abt-buy using a label budget of 5000. The thick red curve is the true PR curve (assuming all labels are known).

6. Related work

Existing approaches to label-efficient evaluation in a machine learning context largely fall into three categories: model-based (Welinder et al., 2013), stratified sampling (Bennett and Carvalho, 2010; Druck and McCallum, 2011) and importance sampling (Sawade et al., 2010a; Marchant and Rubinstein, 2017). The model-based approach in (Welinder et al., 2013) estimates precision-recall curves for binary classifiers. However, it uses inefficient passive sampling to select instances to label, and makes strong assumptions about the distribution of scores and labels which can result in biased estimates. Stratified sampling has been used to estimate scalar performance measures such as precision, accuracy and F1 score. Existing approaches (Bennett and Carvalho, 2010; Druck and McCallum, 2011) bias the sampling of instances from strata (akin to blocks) using a heuristic generalization of the optimal allocation principle (Cochran, 1977). However, stratified sampling is considered to be a less effective variance reduction method compared to importance sampling (Rubinstein and Kroese, 2016), and it does not naturally support stochastic oracles.

Static importance sampling (Sawade et al., 2010a) and stratified adaptive importance sampling (Marchant and Rubinstein, 2017) have been used for online evaluation, and are most similar to our approach. However (Marchant and Rubinstein, 2017) only supports the estimation of F1 score, and (Sawade et al., 2010a) only supports the estimation of scalar generalized risks666These can be viewed as a sub-family of generalized measures with the following correspondence (x,y)=w(x,y,f(x))[(f(x),y),1]\ell(x,y)=w(x,y,f(x))[\ell(f(x),y),1]^{\intercal}, and g(R)=R1/R2g(R)=R_{1}/R_{2}.. Both of these methods attempt to approximate the asymptotically-optimal variance-minimizing proposal, however the approximation used in (Sawade et al., 2010a) is non-adaptive and is not optimized for deterministic oracles, while the approximation used in (Marchant and Rubinstein, 2017) is adaptive but less accurate due to the stratified design.

Novel evaluation methods are also studied in the information retrieval (IR) community (see survey (Kanoulas, 2016)). Some tasks in the IR setting can be cast as prediction problems by treating query-document pairs as inputs and relevance judgments as outputs. Early approaches used relevance scores from the IR system to manage the abundance of irrelevant documents in an ad hoc manner (Cormack et al., 1998). Recent approaches (Schnabel et al., 2016; Li and Kanoulas, 2017) are based on a statistical formulation similar to ours, however they are specialized to IR systems. Within the IR community, stratified sampling and cluster sampling have also been used to efficiently evaluate knowledge graphs (Gao et al., 2019).

Adaptive importance sampling (AIS) is studied more generally in the context of Monte Carlo integration (see review by Bugallo et al., 2017). Most methods are inappropriate for our application, as they assume a continuous space without constraints on the proposal (see Remark 3). Oh and Berger (1992) introduced the idea of adapting the proposal over multiple stages using samples from the previous stages. Cappé et al. (2008) devise a general framework using independent mixtures as proposals. The method of Cornuet et al. (2012) continually re-weights all past samples, however it is more computationally demanding and less amenable to analysis since it breaks the martingale property. Portier and Delyon (2018) analyze parametric AIS in the large sample limit. This improves upon earlier work which assumed the number of stages goes to infinity (Douc and Moulines, 2008) or the sample size at each stage is monotonically increasing (Marin et al., 2019).

Finally, we note that label-efficient evaluation may be viewed as a counterpart to active learning, as both are concerned with reducing labeling requirements. There is a large body of literature concerning active learning—we refer the reader to surveys (Settles, 2009; Gilyazev and Turdakov, 2018). However whereas active learning aims to find a model with low bounded risk using actively-sampled training data, active evaluation aims to estimate risk using actively-sampled test data for any model.

7. Conclusion

We have proposed a framework for online supervised evaluation, which aims to minimize labeling efforts required to achieve precise, asymptotically-unbiased performance estimates. Our framework is based on adaptive importance sampling, with variance-minimizing proposals that are refined adaptively based on an online model of p(y|x)p(y|x). Under verifiable conditions on the chosen performance measure and the model, we proved strong consistency (asymptotic unbiasedness) of the resulting performance estimates and a central limit theorem. We instantiated our framework to evaluate classifiers using deterministic or stochastic human annotators. Our approach based on a hierarchical Dirichlet-tree model, achieves superior or competitive performance on all but one of seven test cases.

Acknowledgements.
N. Marchant acknowledges the support of an Australian Government Research Training Program Scholarship. B. Rubinstein acknowledges the support of Australian Research Council grant DP150103710.

References

  • (1)
  • Bennett and Carvalho (2010) Paul N. Bennett and Vitor R. Carvalho. 2010. Online Stratified Sampling: Evaluating Classifiers at Web-scale. In CIKM. 1581–1584. https://doi.org/10.1145/1871437.1871677
  • Bentley (1975) Jon Louis Bentley. 1975. Multidimensional Binary Search Trees Used for Associative Searching. Commun. ACM 18, 9 (Sept. 1975), 509–517. https://doi.org/10.1145/361002.361007
  • Bugallo et al. (2017) M. F. Bugallo, V. Elvira, L. Martino, D. Luengo, J. Miguez, and P. M. Djuric. 2017. Adaptive Importance Sampling: The past, the present, and the future. IEEE Signal Processing Magazine 34, 4 (July 2017), 60–79. https://doi.org/10.1109/MSP.2017.2699226
  • Cappé et al. (2008) Olivier Cappé, Randal Douc, Arnaud Guillin, Jean-Michel Marin, and Christian P. Robert. 2008. Adaptive importance sampling in general mixture classes. Statistics and Computing 18, 4 (01 Dec. 2008), 447–459. https://doi.org/10.1007/s11222-008-9059-x
  • Chen and Guestrin (2016) Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Francisco, California, USA) (KDD ’16). ACM, New York, NY, USA, 785–794. https://doi.org/10.1145/2939672.2939785
  • Cochran (1977) William G. Cochran. 1977. Sampling Techniques (3rd ed.). Wiley, New York.
  • Cormack et al. (1998) Gordon V. Cormack, Christopher R. Palmer, and Charles L. A. Clarke. 1998. Efficient Construction of Large Test Collections. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Melbourne, Australia) (SIGIR ’98). Association for Computing Machinery, New York, NY, USA, 282–289. https://doi.org/10.1145/290941.291009
  • Cornuet et al. (2012) Jean-Marie Cornuet, Jean-Michel Marin, Antonietta Mira, and Christian P. Robert. 2012. Adaptive Multiple Importance Sampling. Scandinavian Journal of Statistics 39, 4 (2012), 798–812. https://doi.org/10.1111/j.1467-9469.2011.00756.x
  • Dalenius and Hodges (1959) Tore Dalenius and Joseph L. Hodges. 1959. Minimum Variance Stratification. J. Amer. Statist. Assoc. 54, 285 (March 1959), 88–101. https://doi.org/10.1080/01621459.1959.10501501
  • Dennis (1996) Samuel Y. Dennis. 1996. A Bayesian analysis of tree-structured statistical decision problems. Journal of Statistical Planning and Inference 53, 3 (1996), 323–344. https://doi.org/10.1016/0378-3758(95)00112-3
  • Douc and Moulines (2008) Randal Douc and Eric Moulines. 2008. Limit theorems for weighted samples with applications to sequential Monte Carlo methods. The Annals of Statistics 36, 5 (Oct. 2008), 2344–2376. https://doi.org/10.1214/07-AOS514
  • Druck and McCallum (2011) Gregory Druck and Andrew McCallum. 2011. Toward Interactive Training and Evaluation. In CIKM (New York, NY, USA). 947–956. https://doi.org/10.1145/2063576.2063712
  • Feller (1968) W. Feller. 1968. An Introduction to Probability Theory and Its Applications, Volume 1 (3rd ed.). Wiley.
  • Feller (1971) W. Feller. 1971. An Introduction to Probability Theory and Its Applications, Volume 2 (2nd ed.). Wiley.
  • Gao et al. (2019) Junyang Gao, Xian Li, Yifan Ethan Xu, Bunyamin Sisman, Xin Luna Dong, and Jun Yang. 2019. Efficient Knowledge Graph Accuracy Evaluation. Proc. VLDB Endow. 12, 11 (July 2019), 1679–1691. https://doi.org/10.14778/3342263.3342642
  • Gilyazev and Turdakov (2018) R. A. Gilyazev and D. Yu. Turdakov. 2018. Active Learning and Crowdsourcing: A Survey of Optimization Methods for Data Labeling. Programming and Computer Software 44, 6 (Nov. 2018), 476–491. https://doi.org/10.1134/S0361768818060142
  • Gunning and Horgan (2004) Patricia Gunning and Jane M. Horgan. 2004. A New Algorithm for the Construction of Stratum Boundaries in Skewed Populations. Survey Methodology 30, 2 (Dec. 2004), 159–166.
  • Jaeger (2005) Manfred Jaeger. 2005. Ignorability in Statistical and Probabilistic Inference. J. Artif. Int. Res. 24, 1 (Dec. 2005), 889–917.
  • Kanoulas (2016) Evangelos Kanoulas. 2016. A Short Survey on Online and Offline Methods for Search Quality Evaluation. Springer International Publishing, Cham, 38–87. https://doi.org/10.1007/978-3-319-41718-9_3
  • Khalilia et al. (2011) Mohammed Khalilia, Sounak Chakraborty, and Mihail Popescu. 2011. Predicting disease risks from highly imbalanced data using random forest. BMC Medical Informatics and Decision Making 11, 1 (2011), 13 pages. https://doi.org/10.1186/1472-6947-11-51
  • Köpcke et al. (2010) Hanna Köpcke, Andreas Thor, and Erhard Rahm. 2010. Evaluation of Entity Resolution Approaches on Real-world Match Problems. PVLDB 3, 1 (2010), 484–493. https://doi.org/10.14778/1920841.1920904
  • Li and Kanoulas (2017) Dan Li and Evangelos Kanoulas. 2017. Active Sampling for Large-scale Information Retrieval Evaluation. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (Singapore, Singapore) (CIKM ’17). ACM, New York, NY, USA, 49–58. https://doi.org/10.1145/3132847.3133015
  • Marchant and Rubinstein (2017) Neil G. Marchant and Benjamin I. P. Rubinstein. 2017. In Search of an Entity Resolution OASIS: Optimal Asymptotic Sequential Importance Sampling. Proc. VLDB Endow. 10, 11 (Aug. 2017), 1322–1333. https://doi.org/10.14778/3137628.3137642
  • Marin et al. (2019) Jean-Michel Marin, Pierre Pudlo, and Mohammed Sedki. 2019. Consistency of adaptive importance sampling and recycling schemes. Bernoulli 25, 3 (Aug. 2019), 1977–1998. https://doi.org/10.3150/18-BEJ1042
  • Minka (1999) Tom Minka. 1999. The Dirichlet-tree Distribution. Technical Report. Justsystem Pittsburgh Research Center.
  • Mozafari et al. (2014) Barzan Mozafari, Purna Sarkar, Michael Franklin, Michael Jordan, and Samuel Madden. 2014. Scaling up Crowd-Sourcing to Very Large Datasets: A Case for Active Learning. Proc. VLDB Endow. 8, 2 (Oct. 2014), 125–136. https://doi.org/10.14778/2735471.2735474
  • Oh and Berger (1992) Man-Suk Oh and James O. Berger. 1992. Adaptive importance sampling in monte carlo integration. Journal of Statistical Computation and Simulation 41, 3-4 (1992), 143–168. https://doi.org/10.1080/00949659208810398
  • Portier and Delyon (2018) François Portier and Bernard Delyon. 2018. Asymptotic optimality of adaptive importance sampling. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.). Curran Associates, Inc., 3138–3148.
  • Porto Seguro (2017) Porto Seguro. 2017. Porto Seguro’s Safe Driver Prediction. https://www.kaggle.com/c/porto-seguro-safe-driver-prediction. Accessed: Dec 2019.
  • Pozzolo et al. (2015) A. D. Pozzolo, O. Caelen, R. A. Johnson, and G. Bontempi. 2015. Calibrating Probability with Undersampling for Unbalanced Classification. In 2015 IEEE Symposium Series on Computational Intelligence. 159–166. https://doi.org/10.1109/SSCI.2015.33
  • Reddy and Vinzamuri (2014) Chandan K. Reddy and Bhanukiran Vinzamuri. 2014. A Survey of Partitional and Hierarchical Clustering Algorithms (1st ed.). Chapman & Hall/CRC, 87–110. https://doi.org/10.1201/9781315373515-4
  • RIDDLE (2003) RIDDLE 2003. Duplicate Detection, Record Linkage, and Identity Uncertainty: Datasets. http://www.cs.utexas.edu/users/ml/riddle/data.html. Accessed: Dec 2016.
  • Rubinstein and Kroese (2016) Reuven Y. Rubinstein and Dirk P. Kroese. 2016. Simulation and the Monte Carlo Method. John Wiley & Sons, Ltd. https://doi.org/10.1002/9781118631980
  • Sawade et al. (2010b) Christoph Sawade, Niels Landwehr, Steffen Bickel, and Tobias Scheffer. 2010b. Active Risk Estimation. In Proceedings of the 27th International Conference on International Conference on Machine Learning (Haifa, Israel) (ICML’10). Omnipress, Madison, WI, USA, 951–958.
  • Sawade et al. (2010a) Christoph Sawade, Niels Landwehr, and Tobias Scheffer. 2010a. Active Estimation of F-Measures. In Advances in Neural Information Processing Systems 23, J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta (Eds.). Curran Associates, Inc., 2083–2091.
  • Schnabel et al. (2016) Tobias Schnabel, Adith Swaminathan, Peter I. Frazier, and Thorsten Joachims. 2016. Unbiased Comparative Evaluation of Ranking Functions. In Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval (Newark, Delaware, USA) (ICTIR ’16). ACM, New York, NY, USA, 109–118. https://doi.org/10.1145/2970398.2970410
  • Schultheis et al. (2020) Erik Schultheis, Mohammadreza Qaraei, Priyanshu Gupta, and Rohit Babbar. 2020. Unbiased Loss Functions for Extreme Classification With Missing Labels. arXiv:2007.00237 [stat.ML]
  • Settles (2009) Burr Settles. 2009. Active Learning Literature Survey. Technical Report. University of Wisconsin-Madison Department of Computer Sciences. http://digital.library.wisc.edu/1793/60660
  • Vaart (1998) A. W. van der Vaart. 1998. Delta Method. Cambridge University Press, 25––34. https://doi.org/10.1017/CBO9780511802256.004
  • Wei et al. (2013) Wei Wei, Jinjiu Li, Longbing Cao, Yuming Ou, and Jiahang Chen. 2013. Effective detection of sophisticated online banking fraud on extremely imbalanced data. World Wide Web 16, 4 (2013), 449–475. https://doi.org/10.1007/s11280-012-0178-0
  • Welinder et al. (2013) P. Welinder, M. Welling, and P. Perona. 2013. A Lazy Man’s Approach to Benchmarking: Semisupervised Classifier Evaluation and Recalibration. In CVPR. 3262–3269. https://doi.org/10.1109/CVPR.2013.419

Appendix A Proof of Theorem 1

First we prove that R^NAISa.s.R\hat{R}_{N}^{\mathrm{AIS}}\overset{\mathrm{a.s.}}{\to}R using a strong law of large numbers (SLLN) for martingales (Feller, 1971, p. 243). Consider the martingale

ZN=N(R^NAISR)=j=1N{p(Xj)qj1(Xj)(Xj,Yj)R}Z_{N}=N(\hat{R}_{N}^{\mathrm{AIS}}-R)=\sum_{j=1}^{N}\left\{\frac{p(X_{j})}{q_{j-1}(X_{j})}\ell(X_{j},Y_{j})-R\right\}

and let δj,i=p(Xj)qj1(Xj)i(Xj,Yj)Ri\delta_{j,i}=\frac{p(X_{j})}{q_{j-1}(X_{j})}\ell_{i}(X_{j},Y_{j})-R_{i} denote the ii-th component of the jj-th contribution to ZNZ_{N}. Since XjX_{j} is drawn from qj1(x)q_{j-1}(x) and qj1(x)>0q_{j-1}(x)>0 wherever p(x)(x,y)0p(x)\|\ell(x,y)\|\neq 0, it follows that 𝔼[δj,i|j1]=0\mathbb{E}[\delta_{j,i}|\mathcal{F}_{j-1}]=0. In addition, we have

j=1𝔼[δj,i2]j2\displaystyle\sum_{j=1}^{\infty}\frac{\mathbb{E}\!\left[\delta_{j,i}^{2}\right]}{j^{2}} =j=11j2{𝔼[p(Xj)2qj1(Xj)2i(Xj,Yj)2]+Ri2}j=1U2Cj2<,\displaystyle=\sum_{j=1}^{\infty}\frac{1}{j^{2}}\left\{\mathbb{E}\!\left[\frac{p(X_{j})^{2}}{q_{j-1}(X_{j})^{2}}\ell_{i}(X_{j},Y_{j})^{2}\right]+R_{i}^{2}\right\}\leq\sum_{j=1}^{\infty}\frac{U^{2}C}{j^{2}}<\infty,

where the inequality follows from the boundedness of (x,y)\ell(x,y) and (2). Thus the conditions of the SLLN are satisfied and we have 1Nj=1Nδi,ja.s.0R^NAISa.s.R\frac{1}{N}\sum_{j=1}^{N}\delta_{i,j}\overset{\mathrm{a.s.}}{\to}0\implies\hat{R}_{N}^{\mathrm{AIS}}\overset{\mathrm{a.s.}}{\to}R. Now the continuous mapping theorem states that

R^NAISa.s.Rg(R^NAIS)a.s.g(R),\hat{R}_{N}^{\mathrm{AIS}}\overset{\mathrm{a.s.}}{\to}R\quad\implies\quad g(\hat{R}_{N}^{\mathrm{AIS}})\overset{\mathrm{a.s.}}{\to}g(R),

provided RR is not in the set of discontinuity points of gg. This condition is satisfied by assumption.

Appendix B Proof of Theorem 2

The CLT of Portier and Delyon (2018) implies that N(R^AISR)𝒩(0,V)\sqrt{N}(\hat{R}^{\mathrm{AIS}}-R)\Rightarrow\mathcal{N}(0,V_{\infty}), provided two conditions hold. The first condition of their CLT holds by assumption (3) and the second condition holds by the boundedness of the loss function and (4). The multivariate delta method (Vaart, 1998) then implies that N(g(R^AIS)g(R))𝒩(0,Dg(R)VDg(R))\sqrt{N}(g(\hat{R}^{\mathrm{AIS}})-g(R))\Rightarrow\mathcal{N}(0,\mathrm{D}g(R)V_{\infty}\mathrm{D}g(R)^{\intercal}), since gg is assumed to be differentiable at RR in Definition 1.

Appendix C Proof of Proposition 3

We want to find the proposal qq that minimizes the total asymptotic variance

trΣ=𝔼X,Yp[p(X)Dg(R)(X,Y)22q(X)]Dg(R)R22.\operatorname{tr}\Sigma=\underset{X,Y\sim p}{\mathbb{E}}\left[\frac{p(X)\|\mathrm{D}_{g}(R)\,\ell(X,Y)\|_{2}^{2}}{q(X)}\right]-\|\mathrm{D}_{g}(R)\,R\|_{2}^{2}.

We can express this as a functional optimization problem:

(6) minq\displaystyle\min_{q} c(x)q(x)dx\displaystyle\int\frac{c(x)}{q(x)}\mathrm{d}x
s.t. q(x)dx=1,\displaystyle\int q(x)\,\mathrm{d}x=1,

where c(x)=p(x)2Dg(R)(x,y)22p(y|x)dyc(x)=p(x)^{2}\int\|\mathrm{D}_{g}(R)\,\ell(x,y)\|_{2}^{2}\,p(y|x)\mathrm{d}y.

Using the method of Lagrange multipliers, Sawade et al. (2010a) show that the solution to (6) is q(x)c(x)q^{\star}(x)\propto\sqrt{c(x)}. This yields the required result.

Appendix D Conditions for achieving zero total asymptotic variance

In typical applications of IS, the asymptotically-optimal proposal can achieve zero variance. This is not guaranteed in our application, since we do not have complete freedom in selecting the proposal (see Remark 3). Below we provide sufficient conditions under which the total asymptotic variance of G^NAIS\hat{G}_{N}^{\mathrm{AIS}} can be reduced to zero.

Proposition 0.

Suppose the oracle is deterministic (i.e. p(y|x)p(y|x) is a point mass for all xx) and the generalized measure is such that sign((x,y)gl(R))\operatorname{sign}(\ell(x,y)\cdot\nabla g_{l}(R)) is constant for all (x,y)𝒳×𝒴(x,y)\in\mathcal{X}\times\mathcal{Y} and l{1,,m}l\in\{1,\ldots,m\}. Then the asymptotically-optimal proposal achieves trΣ=0\operatorname{tr}\Sigma=0.

Proof.

From Proposition 3, the asymptotically optimal proposal achieves a total variance of

(7) trΣ=(v(x)dx)2Dg(R)R22.\operatorname{tr}\Sigma=\left(\int v(x)\,\mathrm{d}x\right)^{2}-\|\mathrm{D}_{g}(R)\,R\|_{2}^{2}.

We evaluate the two terms in this expression separately. Using the fact that p(y|x)=𝟏y=y(x)p(y|x)=\mathbf{1}_{y=y(x)}, the first term becomes

(v(x)𝑑x)2\displaystyle\left(\int v(x)\,dx\right)^{2} =(Dg(R)(x,y(x))2p(x)dx)2\displaystyle=\left(\int\|\mathrm{D}_{g}(R)\,\ell(x,y(x))\|_{2}\,p(x)\,\mathrm{d}x\right)^{2}
(Dg(R)(x,y(x))1p(x)dx)2\displaystyle\leq\left(\int\|\mathrm{D}_{g}(R)\,\ell(x,y(x))\|_{1}\,p(x)\,\mathrm{d}x\right)^{2}
=(l=1m(x,y(x))gl(R)p(x)dx)2.\displaystyle=\left(\sum_{l=1}^{m}\int\ell(x,y(x))\cdot\nabla g_{l}(R)\,p(x)\,\mathrm{d}x\right)^{2}.

The second line follows by application of the inequality x2x1\|x\|_{2}\leq\|x\|_{1}, and the third line follows by assumption. For the second term we have

Dg(R)R22\displaystyle\|\mathrm{D}_{g}(R)\,R\|_{2}^{2} =Dg(R)(x,y(x))p(x)dx22\displaystyle=\left\|\int\mathrm{D}_{g}(R)\,\ell(x,y(x))\,p(x)\,\mathrm{d}x\right\|_{2}^{2}
=l=1m((x,y(x))gl(R)p(x)dx)2\displaystyle=\sum_{l=1}^{m}\left(\int\ell(x,y(x))\cdot\nabla g_{l}(R)\,p(x)\,\mathrm{d}x\right)^{2}
(l=1m(x,y(x))gl(R)p(x)dx)2,\displaystyle\geq\left(\sum_{l=1}^{m}\int\ell(x,y(x))\cdot\nabla g_{l}(R)\,p(x)\,\mathrm{d}x\right)^{2},

by application of Jensen’s inequality. Subtracting the second term from the first, we have trΣ0\operatorname{tr}\Sigma\leq 0. ∎

By way of illustration, we apply the above proposition to two common performance measures: accuracy and recall. We assume a deterministic oracle in both cases.

Example 0 (Asymptotic variance for accuracy).

From Table 1, we have that accuracy can be expressed as a generalized performance measure by setting (x,y)=𝟏yf(x)\ell(x,y)=\mathbf{1}_{y\neq f(x)} and g(R)=1Rg(R)=1-R. Evaluating the condition in Proposition 1, we have

sign((x,y)g(R))=sign(𝟏yf(x))=1\operatorname{sign}\left(\ell(x,y)\cdot\nabla g(R)\right)=\operatorname{sign}\left(-\mathbf{1}_{y\neq f(x)}\right)=-1

for all (x,y)𝒳×𝒴(x,y)\in\mathcal{X}\times\mathcal{Y}. Thus our framework can achieve zero asymptotic total variance when estimating accuracy under a deterministic oracle.

Example 0 (Asymptotic variance for recall).

From Table 1, we have that recall can be expressed as a generalized performance measure by setting (x,y)=[yf(x),y]\ell(x,y)=[yf(x),y]^{\intercal} and g(R)=R1/R2g(R)=R_{1}/R_{2}. The conditions of Proposition 1 are not satisfied in this case, since

sign((x,y)g(R))=sign(yR2(f(x)Grec))=sign(f(x)Grec)\operatorname{sign}\left(\ell(x,y)\cdot\nabla g(R)\right)=\operatorname{sign}\left(\frac{y}{R_{2}}(f(x)-G_{\mathrm{rec}})\right)=\operatorname{sign}\left(f(x)-G_{\mathrm{rec}}\right)

which is not constant for all (x,y)𝒳×𝒴(x,y)\in\mathcal{X}\times\mathcal{Y}. Indeed, when we evaluate the expression for the asymptotic total variance (see Proposition 3), we find that Σ=4Grec2(1Grec)2\Sigma=4G_{\mathrm{rec}}^{2}(1-G_{\mathrm{rec}})^{2}. Therefore in general there is positive lower bound on the asymptotic variance that can be achieved by our framework when estimating recall under a deterministic oracle.

Appendix E Proof of Proposition 4

Before proving the proposition, we establish a useful corollary.

Corollary 0.

Suppose the generalized measure GG is defined with respect to a finite input space 𝒳\mathcal{X} (e.g. a finite pool of test data).

  1. (i)

    If the support of proposal qj(x,y)=qj(x)p(y|x)q_{j}(x,y)=q_{j}(x)p(y|x) is a superset of {x,y𝒳×𝒴:p(x,y)(x,y)0}\{x,y\in\mathcal{X}\times\mathcal{Y}:p(x,y)\|\ell(x,y)\|\neq 0\} for all j0j\geq 0, then Theorem 1 holds.

  2. (ii)

    If in addition qj(x)a.s.q(x)q_{j}(x)\overset{\mathrm{a.s.}}{\to}q_{\infty}(x) pointwise in xx, then Theorem 2 holds.

Proof.

For the first statement, we check conditions (2) and (4) of Theorem 1. Let 𝒬j𝒳\mathcal{Q}_{j}\subset\mathcal{X} be the support of qj(x)q_{j}(x) and let δj=infx𝒬jqj(x)>0\delta_{j}=\inf_{x\in\mathcal{Q}_{j}}q_{j}(x)>0. For η0\eta\geq 0 we have

𝔼[(p(Xj)qj1(Xj))2+η|j1]\displaystyle\mathbb{E}\left[\left(\frac{p(X_{j})}{q_{j-1}(X_{j})}\right)^{2+\eta}\middle|\mathcal{F}_{j-1}\right] =x𝒬j1p(x)2+ηqj1(x)p(y|x)qj1(x)2+ηdy\displaystyle=\int\sum_{x\in\mathcal{Q}_{j-1}}\frac{p(x)^{2+\eta}q_{j-1}(x)p(y|x)}{q_{j-1}(x)^{2+\eta}}\,\mathrm{d}y
(1δj)1+η<.\displaystyle\leq\left(\frac{1}{\delta_{j}}\right)^{1+\eta}<\infty.

For the second statement, we must additionally check condition (3) regarding the convergence of VjV_{j}. Letting

fj(x,y)=(p(x)qj(x)(x,y)R)(p(x)qj(x)(x,y)R)qj(x)p(y|x),f_{j}(x,y)=\left(\frac{p(x)}{q_{j}(x)}\ell(x,y)-R\right)\left(\frac{p(x)}{q_{j}(x)}\ell(x,y)-R\right)^{\intercal}q_{j}(x)p(y|x),

we can write Vj=x𝒬jfj(x,y)dyV_{j}=\int\sum_{x\in\mathcal{Q}_{j}}f_{j}(x,y)\,\mathrm{d}y. By the a.s. pointwise convergence of qj(x)q_{j}(x) and the continuous mapping theorem, we have fj(x,y)f(x,y)f_{j}(x,y)\to f_{\infty}(x,y) a.s. pointwise in xx and yy. Now observe that

fj(x,y)2\displaystyle\|f_{j}(x,y)\|_{2} =qj(x,y)p(x)qj(x)(x,y)R22\displaystyle=q_{j}(x,y)\left\|\frac{p(x)}{q_{j}(x)}\ell(x,y)-R\right\|_{2}^{2}
qj(x,y)(p(x,y)2qj(x,y)2(x,y)22+R22)\displaystyle\leq q_{j}(x,y)\left(\frac{p(x,y)^{2}}{q_{j}(x,y)^{2}}\|\ell(x,y)\|_{2}^{2}+\|R\|_{2}^{2}\right)
p(y|x)(1ϵ2(x,y)22+R22)=h(x,y).\displaystyle\leq p(y|x)\left(\frac{1}{\epsilon^{2}}\|\ell(x,y)\|_{2}^{2}+\|R\|_{2}^{2}\right)=h(x,y).

It is straightforward to show that x𝒬jh(x,y)dy<\int\sum_{x\in\mathcal{Q}_{j}}h(x,y)\,\mathrm{d}y<\infty using the boundedness of (x,y)\ell(x,y) (see Definition 1). Thus we have VjVV_{j}\to V_{\infty} by the dominated convergence theorem. ∎

We can now prove Proposition 4 by showing that the conditions of Corollary 1 hold. We focus on the case of a deterministic oracle—the proof for a stochastic oracle follows by a similar argument.

First we examine the support of the sequence of proposals. At stage tt, the proposal can be expressed as

qt(x)=vt(x)x𝒳vt(x)with\displaystyle q_{t}(x)=\frac{v_{t}(x)}{\sum_{x\in\mathcal{X}}v_{t}(x)}\quad\text{with}
vt(x)=p(x)max{Dg(R^t)(x,y)2,ϵt𝟏(x,y)0}πt(y|x)dy.\displaystyle v_{t}(x)=p(x)\int\max\left\{\|\mathrm{D}_{g}(\hat{R}_{t})\,\ell(x,y)\|_{2},\epsilon_{t}\mathbf{1}_{\|\ell(x,y)\|\neq 0}\right\}\,\pi_{t}(y|x)\,\mathrm{d}y.

Observe that

vt(x)ϵtp(x)𝟏(x,y)0πt(y|x)dyv_{t}(x)\geq\epsilon_{t}\,p(x)\int\mathbf{1}_{\|\ell(x,y)\|\neq 0}\,\pi_{t}(y|x)\,\mathrm{d}y

and

vt(x)p(x){ϵt+Dg(R^t)2(x,y)2}πt(y|x)dyp(x)(ϵt+d2Ksupx,y𝒳×𝒴(x,y))Cp(x)\begin{split}v_{t}(x)&\leq p(x)\int\{\epsilon_{t}+\|\mathrm{D}_{g}(\hat{R}_{t})\|_{2}\,\|\ell(x,y)\|_{2}\}\pi_{t}(y|x)\,\mathrm{d}y\\ &\leq p(x)\left(\epsilon_{t}+d^{2}K\sup_{x,y\in\mathcal{X}\times\mathcal{Y}}\,\|\ell(x,y)\|_{\infty}\right)\\ &\leq Cp(x)\end{split}

where C<C<\infty is a constant. The upper bound follows from the boundedness of (x,y)\ell(x,y) (see Definition 1), the boundedness of ϵt\epsilon_{t}, and the boundedness of the Jacobian. Since

x𝒳vt(x)ϵtx𝒳p(x)𝟏(x,y)0πt(y|x)dy>0\sum_{x\in\mathcal{X}}v_{t}(x)\geq\epsilon_{t}\sum_{x\in\mathcal{X}}p(x)\int\mathbf{1}_{\|\ell(x,y)\|\neq 0}\pi_{t}(y|x)\,\mathrm{d}y>0

by assumption and vt(x)v_{t}(x) is bounded from above, we conclude that qt(x)q_{t}(x) is a valid distribution for all t0t\geq 0. The lower bound on vt(x)v_{t}(x) implies that the support of qt(x,y)=qt(x)p(y|x)q_{t}(x,y)=q_{t}(x)p(y|x) is

{(x,y)𝒳×𝒴:p(x,y)πt(y|x)(x,y)0}\displaystyle\{(x,y)\in\mathcal{X}\times\mathcal{Y}:p(x,y)\pi_{t}(y|x)\|\ell(x,y)\|\neq 0\}
{(x,y)𝒳×𝒴:p(x,y)(x,y)0}.\displaystyle\supseteq\{(x,y)\in\mathcal{X}\times\mathcal{Y}:p(x,y)\|\ell(x,y)\|\neq 0\}.

The inequality follows from the fact that the support of πt(y|x)\pi_{t}(y|x) includes the support of p(y|x)=𝟏y=y(x)p(y|x)=\mathbf{1}_{y=y(x)}. Thus qt(x)q_{t}(x) has the required support for all t0t\geq 0.

Next we prove that the sequence of proposals converges a.s. pointwise in xx. Given that R^ta.s.R^\hat{R}_{t}\overset{\mathrm{a.s.}}{\to}\hat{R}_{\infty} and πt(y|x)a.s.π(y|x)\pi_{t}(y|x)\overset{\mathrm{a.s.}}{\to}\pi_{\infty}(y|x) pointwise in xx, one can show by application of the continuous mapping theorem and dominated convergence theorem that

vt(x)a.s.v(x)=p(x)Dg(R)(x,y)2π(y|x)dyv_{t}(x)\overset{\mathrm{a.s.}}{\to}v_{\infty}(x)=p(x)\int\|\mathrm{D}_{g}(R_{\infty})\,\ell(x,y)\|_{2}\pi_{\infty}(y|x)\,\mathrm{d}y

pointwise in xx. By application of the continuous mapping theorem, we then have that qt(x)a.s.q(x)=v(x)x𝒳v(x)q_{t}(x)\overset{\mathrm{a.s.}}{\to}q_{\infty}(x)=\frac{v_{\infty}(x)}{\sum_{x\in\mathcal{X}}v_{\infty}(x)}. Thus Theorems 1 and 2 hold. Furthermore, if R^=R\hat{R}_{\infty}=R and π(y|x)=𝟏y=y(x)\pi_{\infty}(y|x)=\mathbf{1}_{y=y(x)}, then q(x)q_{\infty}(x) is equal to the asymptotically-optimal proposal q(x)q^{\star}(x) as defined in (5). ∎

Appendix F Inference for the Dirichlet-tree model in Section 4

In this appendix, we outline a procedure for inferring the oracle response based on the hierarchical model presented in Section 4. Recall that the model assumes a deterministic response—i.e. there is only one possible response (label) yy for a given input xx. At stage tt of the evaluation process (see Algorithm 1), the labels for instances in the test pool are partially-observed. To estimate the response for the unobserved instances, we can apply the expectation-maximization (EM) algorithm. Omitting the dependence on tt, we let 𝐲(𝐨)\mathbf{y}_{(\mathbf{o})} denote the observed labels and 𝐲(¬𝐨)\mathbf{y}_{(\neg\mathbf{o})} denote the unobserved labels. The EM algorithm returns a distribution over the unobserved labels 𝐲(¬𝐨)\mathbf{y}_{(\neg\mathbf{o})} and MAP estimates of the model parameters ϕ=(θ,ψ)\mathbf{\phi}=(\theta,\psi). At each iteration τ\tau of the EM algorithm, the following two steps are applied:

  • E-step. Compute the function

    (8) Q(ϕ|ϕ(τ))=𝔼𝐲(¬𝐨)|𝐲𝐨,𝐤,ϕ(τ)(logp(ϕ|𝐲,𝐤)),Q(\mathbf{\phi}|\mathbf{\phi}^{(\tau)})=\mathbb{E}_{\mathbf{y}_{(\neg\mathbf{o})}|\mathbf{y}_{\mathbf{o}},\mathbf{k},\mathbf{\phi}^{(\tau)}}\left(\log p(\mathbf{\phi}|\mathbf{y},\mathbf{k})\right),

    which is the expected log posterior with respect to the current distribution over the unobserved labels 𝐲(¬𝐨)\mathbf{y}_{(\neg\mathbf{o})}, conditional on the observed labels 𝐲(𝐨)\mathbf{y}_{(\mathbf{o})} and the current parameter estimates ϕ(τ)\mathbf{\phi}^{(\tau)}.

  • M-step. Update the parameter estimates by maximizing QQ:

    (9) ϕ(τ+1)argmaxϕQ(ϕ|ϕ(τ)).\mathbf{\phi}^{(\tau+1)}\in\arg\max_{\mathbf{\phi}}Q(\mathbf{\phi}|\mathbf{\phi}^{(\tau)}).

In order to implement the E- and M-steps, we must evaluate the QQ function for our model. Since the Dirichlet prior on θ\theta and Dirichlet-tree priors on ϕy\phi_{y} are conjugate to the categorical distribution, the posterior p(ϕ|𝐲,𝐤)p(\mathbf{\phi}|\mathbf{y},\mathbf{k}) is straightforward to compute. We have

θ|𝐲,α\displaystyle\theta|\mathbf{y},\alpha Dirichlet(α~),\displaystyle\sim\operatorname{Dirichlet}(\tilde{\alpha}),
ψy|𝐲,𝐤,βy,T\displaystyle\psi_{y}|\mathbf{y},\mathbf{k},\beta_{y},T DirichletTree(β~y,T),\displaystyle\sim\operatorname{DirichletTree}(\tilde{\beta}_{y},T),

where α~y=αy+i=1M𝟏yi=y\tilde{\alpha}_{y}=\alpha_{y}+\sum_{i=1}^{M}\mathbf{1}_{y_{i}=y}, β~yν=βyν+i=1M𝟏yi=yδν(ki)\tilde{\beta}_{y\nu}=\beta_{y\nu}+\sum_{i=1}^{M}\mathbf{1}_{y_{i}=y}\delta_{\nu}(k_{i}) and δν(k)\delta_{\nu}(k) is defined in (16). The posterior density for θ\theta is

p(θ|𝐲,α)y=1Cθyα~y1.p(\theta|\mathbf{y},\alpha)\propto\prod_{y=1}^{C}\theta_{y}^{\tilde{\alpha}_{y}-1}.

Minka (1999) gives the density for ψy\psi_{y} as:

p(ψy|𝐲,𝐤,βy,T)klv(T)ψykβ~yk1νin(T)(klv(T)νchildren(ν)δν(k)ψyk)γyν,\displaystyle p(\psi_{y}|\mathbf{y},\mathbf{k},\mathbf{\beta}_{y},T)\propto\prod_{k\in\mathrm{lv}(T)}\psi_{yk}^{\tilde{\beta}_{yk}-1}\prod_{\nu\in\mathrm{in}(T)}\left(\sum_{k\in\mathrm{lv}(T)}\sum_{\nu^{\prime}\in\mathrm{children}(\nu)}\delta_{\nu^{\prime}}(k)\psi_{yk}\right)^{\gamma_{y\nu}},

where lv(T)\mathrm{lv}(T) denotes the set of leaf nodes in TT, in(T)\mathrm{in}(T) denotes the set of inner nodes in TT, lv(ν)\mathrm{lv}(\nu) denotes the leaf nodes reachable from node ν\nu, and γ~yν=β~yνcchildren(ν)β~yc\tilde{\gamma}_{y\nu}=\tilde{\beta}_{y\nu}-\sum_{c\in\mathrm{children}(\nu)}\tilde{\beta}_{yc}.

Substituting the posterior densities in (8), we have

Q(ϕ|ϕ(t))\displaystyle Q(\mathbf{\phi}|\mathbf{\phi}^{(t)}) =y𝒴klv(T)(β~yk(τ)1)logψky+y𝒴νin(T)γ~yν(τ)log(klv(T)νchildren(ν)δν(k)ψyk)\displaystyle=\sum_{y\in\mathcal{Y}}\sum_{k\in\mathrm{lv(T)}}\left(\tilde{\beta}_{yk}^{(\tau)}-1\right)\log\psi_{ky}+\sum_{y\in\mathcal{Y}}\sum_{\nu\in\mathrm{in}(T)}\tilde{\gamma}_{y\nu}^{(\tau)}\log\left(\sum_{k\in\mathrm{lv}(T)}\sum_{\nu^{\prime}\in\mathrm{children}(\nu)}\delta_{\nu^{\prime}}(k)\psi_{yk}\right)
+y𝒴(α~y(τ)1)logθy+const.\displaystyle\qquad{}+\sum_{y\in\mathcal{Y}}\left(\tilde{\alpha}_{y}^{(\tau)}-1\right)\log\theta_{y}+\mathrm{const}.

where we define β~yk(τ)=𝔼𝐲(¬𝐨)|𝐲𝐨,𝐤,ϕ(τ)[β~yk]\tilde{\beta}_{yk}^{(\tau)}=\mathbb{E}_{\mathbf{y}_{(\neg\mathbf{o})}|\mathbf{y}_{\mathbf{o}},\mathbf{k},\mathbf{\phi}^{(\tau)}}[\tilde{\beta}_{yk}] and similarly for α~y(τ)\tilde{\alpha}_{y}^{(\tau)} and γ~yν(τ)\tilde{\gamma}_{y\nu}^{(\tau)}. When maximizing Q(ϕ|ϕ(t))Q(\mathbf{\phi}|\mathbf{\phi}^{(t)}) with respect to ϕ\mathbf{\phi}, we must obey the constraints:

  • θy>0\theta_{y}>0 for all y𝒴y\in\mathcal{Y},

  • y𝒴θy=1\sum_{y\in\mathcal{Y}}\theta_{y}=1,

  • ψyk>0\psi_{yk}>0 for all y𝒴y\in\mathcal{Y} and leaf nodes klv(T)k\in\mathrm{lv}(T), and

  • klv(T)ψyk=1\sum_{k\in\mathrm{lv}(T)}\psi_{yk}=1.

We can maximize θ\theta and {ψy}\{\psi_{y}\} separately since they are independent. For θy\theta_{y} we have the mode of a Dirichlet random variable:

θy(τ+1)=α~y(τ)1y{α~y(τ)1}\theta_{y}^{(\tau+1)}=\frac{\tilde{\alpha}_{y}^{(\tau)}-1}{\sum_{y^{\prime}}\{\tilde{\alpha}_{y^{\prime}}^{(\tau)}-1\}}

and for ψy\psi_{y} we have (see Minka, 1999):

(10) ψyk(τ+1)=νin(T)cchildren(ν)(byc(τ+1))δc(k)\displaystyle\psi_{yk}^{(\tau+1)}=\prod_{\nu\in\mathrm{in}(T)}\prod_{c\in\mathrm{children}(\nu)}\left(b_{yc}^{(\tau+1)}\right)^{\delta_{c}(k)}
(11) wherebyc(τ+1)=β~yc(τ)klv(T)δc(k)csiblings(c){c}{β~yc(τ)klv(T)δc(k)}.\displaystyle\text{where}\quad b_{yc}^{(\tau+1)}=\frac{\tilde{\beta}_{yc}^{(\tau)}-\sum_{k\in\mathrm{lv}(T)}\delta_{c}(k)}{\sum_{c^{\prime}\in\mathrm{siblings}(c)\cup\{c\}}\{\tilde{\beta}_{yc^{\prime}}^{(\tau)}-\sum_{k\in\mathrm{lv}(T)}\delta_{c^{\prime}}(k)\}}.

The parameters {byc:cchildren(ν)}\{b_{yc}:c\in\mathrm{children}(\nu)\} may be interpreted as branching probabilities for node νin(T)\nu\in\mathrm{in}(T).

In summary, the EM algorithm reduces to the following two steps:

  • E-step. Compute the expected value for each unobserved label using ϕ(τ)\mathbf{\phi}^{(\tau)}:

    (12) 𝔼[𝟏yj=y|kj,ϕ(τ)]=ψykj(τ)θy(τ)y𝒴ψykj(τ)θy(τ).\mathbb{E}[\mathbf{1}_{y_{j}=y}|k_{j},\mathbf{\phi}^{(\tau)}]=\frac{\psi_{yk_{j}}^{(\tau)}\theta_{y}^{(\tau)}}{\sum_{y^{\prime}\in\mathcal{Y}}\psi_{y^{\prime}k_{j}}^{(\tau)}\theta_{y^{\prime}}^{(\tau)}}.

    Then make a backward pass through the tree, computing β~yν(τ)\tilde{\beta}_{y\nu}^{(\tau)} at each internal node νin(T)\nu\in\mathrm{in}(T).

  • M-step. Make a forward-pass through the tree, updating the branch probabilities byc(τ+1)b_{yc}^{(\tau+1)} using (11). Compute ψy(τ+1)\psi_{y}^{(\tau+1)} at the same time using (10).

We can interpret (12) as providing a posterior estimate for the unknown oracle response:

(13) π(y|x)ψykx(τ)θy(τ)\pi(y|x)\propto\psi_{yk_{x}}^{(\tau)}\theta_{y}^{(\tau)}

where kxk_{x} denotes the assigned stratum for instance xx. If the response y(x)y(x) for instance xx has been observed in a previous stage of the evaluation process, then π(y|x)\pi(y|x) collapses to a point mass at y(x)y(x).

Appendix G Proof of Proposition 1

Let πt(y|x)\pi_{t}(y|x) denote the posterior estimate of the (determinisitc) oracle response at stage tt, as defined in (13). First we must ensure that the support of πt(y|x)\pi_{t}(y|x) includes the true label y(x)y(x) for all t0t\geq 0. This condition is satisfied since the priors on θ\theta and ψ\psi ensure θy>0\theta_{y}>0 and ψky>0\psi_{ky}>0 for all k{1,,K}k\in\{1,\ldots,K\} and y𝒴y\in\mathcal{Y} (see 12). Once the label for instance xx is observed, the posterior degenerates to a point mass at the true value y(x)y(x). This also implies that the sequence of proposals have the necessary support to ensure Theorem 2 holds.

Second, we verify that πt(y|x)\pi_{t}(y|x) converges a.s. pointwise in xx and yy to a conditional pmf π(y|x)\pi_{\infty}(y|x) independent of tt. This condition is also satisfied, since the posterior πt(y|x)\pi_{t}(y|x) degenerates to a point mass at y(x)y(x) once the label for instance xx is observed, and all instances are observed in the limit tt\to\infty. Thus πt(y|x)a.s.𝟏y=y(x)\pi_{t}(y|x)\overset{\mathrm{a.s.}}{\to}\mathbf{1}_{y=y(x)} and ϵt0\epsilon_{t}\downarrow 0, which implies that the sequence of proposals converges to the asymptotically-optimal proposal. ∎

Appendix H Extension of the Dirichlet-tree model to stochastic oracles

Recall that the Dirichlet-tree model in Section 4 is tailored for a deterministic oracle—where p(y|x)p(y|x) is a point mass at a single label y(x)y(x). In this appendix, we extend the model to handle stochastic oracles, where p(y|x)p(y|x) has support on multiple labels in general. The model retains the same structure, however we no longer assume there is a discrete set of items whose labels are either observed or unobserved. Instead, we allow for a potentially infinite number of item-label pairs (indexed by jj below) to be generated. In reality, these pairs correspond to labelled items drawn uniformly at random from the test pool 𝒯\mathcal{T}. Since estimating the oracle response for individual instances requires estimating a large number of continuous parameters (|𝒯|×(|𝒴|1)|\mathcal{T}|\times(|\mathcal{Y}|-1) parameters), we instead opt to estimate the response averaged over instances at leaf nodes of the tree (K×(|𝒴|1)K\times(|\mathcal{Y}|-1) parameters for KK leaf nodes). We refer to this as the leaf-level oracle response pleaf(y|k)p_{\mathrm{leaf}}(y|k) below.

Model specification

For clarity, we reintroduce the model here despite some overlap with Section 4. The oracle response is modeled globally using a pmf θ=[θ1,,θC]\theta=[\theta_{1},\ldots,\theta_{C}] over the label space 𝒴\mathcal{Y} with a Dirichlet prior:

θ|αDirichlet(α),\theta|\alpha\sim\operatorname{Dirichlet}\!\left(\alpha\right),

where α=[α1,,αC]+C\alpha=[\alpha_{1},\ldots,\alpha_{C}]\in\mathbb{R}_{+}^{C} are concentration hyperparameters. The label yjy_{j} for each instance jj is then assumed to be generated i.i.d. according to θ\theta:

yj|θ\displaystyle y_{j}|\theta iid.Categorical(θ),\displaystyle\overset{\mathrm{iid.}}{\sim}\operatorname{Categorical}\!\left(\theta\right), j1,,J.\displaystyle j\in 1,\ldots,J.

We assume a hierarchical partition of the space 𝒳\mathcal{X} is given, encoded as a tree TT. Given TT and label yjy_{j}, we assume instance jj is assigned to a leaf node kj{1,,K}k_{j}\in\{1,\ldots,K\} of TT according to a distribution ψy\psi_{y} with a Dirichlet-tree prior:

ψy|βy,T\displaystyle\psi_{y}|\beta_{y},T ind.DirichletTree(βy;T),\displaystyle\overset{\mathrm{ind.}}{\sim}\operatorname{DirichletTree}\!\left(\beta_{y};T\right), y𝒴,\displaystyle y\in\mathcal{Y},
kj|yj,ψyj\displaystyle k_{j}|y_{j},\psi_{y_{j}} ind.Categorical(ψyj),\displaystyle\overset{\mathrm{ind.}}{\sim}\operatorname{Categorical}\!\left(\psi_{y_{j}}\right), j1,,J.\displaystyle j\in 1,\ldots,J.

where βy\beta_{y} is a set of Dirichlet concentration parameters associated with the internal nodes of TT.

Inference

To estimate the leaf-level oracle response pleaf(y|k)p_{\mathrm{leaf}}(y|k), we use the posterior predictive distribution

p(yj|kj,)=ψθp(yj|kj,ψ,θ)p(ψ,θ|),p(y_{j}|k_{j},\mathcal{L})=\int_{\psi}\int_{\theta}p(y_{j}|k_{j},\psi,\theta)p(\psi,\theta|\mathcal{L}),

which encodes our uncertainty about the oracle response yjy_{j} for a query instance xjx_{j} from stratum kjk_{j} conditional on the previously observed samples \mathcal{L}. If the observed samples \mathcal{L} were collected through unbiased sampling (as assumed in the model above), we would compute the posterior predictive distribution as follows:

(14) p(yj|kj,)θp(yj|θ)p(θ|)ψp(kj|yj,ψ)p(ψ|)θθyjp(θ|)ψψyj,kjp(ψ|)α~yj×νin(T)cchildren(ν)(β~yjccchildren(ν)β~yjc)δc(kj)\begin{split}p(y_{j}|k_{j},\mathcal{L})&\propto\int_{\theta}p(y_{j}|\theta)p(\theta|\mathcal{L})\int_{\psi}p(k_{j}|y_{j},\psi)p(\psi|\mathcal{L})\\ &\propto\int_{\theta}\theta_{y_{j}}p(\theta|\mathcal{L})\int_{\psi}\psi_{y_{j},k_{j}}p(\psi|\mathcal{L})\\ &\propto\tilde{\alpha}_{y_{j}}\times\prod_{\nu\in\mathrm{in}(T)}\prod_{c\in\mathrm{children}(\nu)}\left(\frac{\tilde{\beta}_{y_{j}c}}{\sum_{c^{\prime}\in\mathrm{children}(\nu)}\tilde{\beta}_{y_{j}c^{\prime}}}\right)^{\delta_{c}(k_{j})}\end{split}

where

(15) α~y=αy+(x,y,w)𝟏y=y,β~yc=βyc+(x,y,w)𝟏y=yδc(kx),\begin{split}\tilde{\alpha}_{y}&=\alpha_{y}+\sum_{(x^{\prime},y^{\prime},w^{\prime})\in\mathcal{L}}\mathbf{1}_{y^{\prime}=y},\\ \tilde{\beta}_{yc}&=\beta_{yc}+\sum_{(x^{\prime},y^{\prime},w^{\prime})\in\mathcal{L}}\mathbf{1}_{y^{\prime}=y}\delta_{c}(k_{x^{\prime}}),\end{split}

in(T)\mathrm{in}(T) denotes the inner nodes of TT, children(ν)\mathrm{children}(\nu) denotes the children of node ν\nu, kx{1,,K}k_{x}\in\{1,\ldots,K\} denotes the leaf index of instance xx and

(16) δν(k):={1,if node ν is traversed to reach leaf node k,0,otherwise.\delta_{\nu}(k):=\begin{cases}1,&\text{if node $\nu$ is traversed to reach leaf node $k$},\\ 0,&\text{otherwise.}\end{cases}

In words, the posterior parameters are updated by adding a count of ‘1’ for every observation with label yy that is reachable from node ν\nu in the tree TT. However, since the observed samples \mathcal{L} are biased in our application, we must apply a bias-correction to (15):

α~y\displaystyle\tilde{\alpha}_{y} =αy+(x,y,w)w𝟏y=y,\displaystyle=\alpha_{y}+\sum_{(x^{\prime},y^{\prime},w^{\prime})\in\mathcal{L}}w^{\prime}\mathbf{1}_{y^{\prime}=y},
β~yc\displaystyle\tilde{\beta}_{yc} =βyc+(x,y,w)w𝟏y=yδc(kx).\displaystyle=\beta_{yc}+\sum_{(x^{\prime},y^{\prime},w^{\prime})\in\mathcal{L}}w^{\prime}\mathbf{1}_{y^{\prime}=y}\delta_{c}(k_{x^{\prime}}).

This guarantees that α~yNa.s.𝔼[𝟏Y=y]\frac{\tilde{\alpha}_{y}}{N}\overset{\mathrm{a.s.}}{\to}\mathbb{E}[\mathbf{1}_{Y=y}] and β~ycNa.s.𝔼[𝟏Y=yδc(kX)]\frac{\tilde{\beta}_{yc}}{N}\overset{\mathrm{a.s.}}{\to}\mathbb{E}[\mathbf{1}_{Y=y}\delta_{c}(k_{X})].

Proposition 0.

Consider an instantiation of our evaluation framework (see Algorithm 1) under a stochastic oracle where:

  • the oracle response for instance xx is estimated online using the posterior predictive (14) for the leaf node kxk_{x} in which xx resides;

  • the proposals are adapted using the estimator defined in Proposition 4 with ϵt=ϵ0/(t+1)\epsilon_{t}=\epsilon_{0}/(t+1) for some user-specified ϵ0>0\epsilon_{0}>0; and

  • R^t=1Mi=1My𝒴p^t(y|xi)(xi,y).\hat{R}_{t}=\frac{1}{M}\sum_{i=1}^{M}\sum_{y\in\mathcal{Y}}\hat{p}_{t}(y|x_{i})\ell(x_{i},y).

Then Theorems 1 and 2 hold.

Proof.

Let p^t(y|x)\hat{p}_{t}(y|x) denote the posterior predictive in evaluated at stage tt. First we must ensure that the support of p^t(y|x)\hat{p}_{t}(y|x) is a superset of the the support of p(y|x)p(y|x). This condition is satisfied since the priors on θ\theta and ψ\psi ensure that p^t(y|x)>0\hat{p}_{t}(y|x)>0 for all y𝒴y\in\mathcal{Y} (see 14). This implies that the proposals {qt(x)}\{q_{t}(x)\} have the necessary support to ensure that Theorem 1 is satisfied.

Second, we must ensure that p^t(y|x)\hat{p}_{t}(y|x) converges a.s. pointwise in xx and yy to a conditional pmf p^(y|x)\hat{p}_{\infty}(y|x) independent of tt. This condition is also satisfied, since the posterior parameters α~y/N\tilde{\alpha}_{y}/N and β~yc/N\tilde{\beta}_{yc}/N converge a.s. to constants by Theorem 1 (see the text preceding the statement of this proposition). ∎

Appendix I Unsupervised partitioning methods

The models for the oracle response described in Section 4 require that the test pool be hierarchically partitioned into blocks. Ideally, the partition should be selected so that the oracle distribution p(y|x)p(y|x) is approximately constant within each block. Since we begin without any labels, we are restricted to unsupervised partitioning methods. We briefly describe two settings for learning partitions: (i) where classifier scores are used as a proxy for p(y|x)p(y|x) and (ii) where feature vectors are used.

Score-based methods.

Our partitioning problem can be tackled using stratification methods studied in the survey statistics community (Cochran, 1977). The aim of stratification is to partition a population into roughly homogenous subpopulations, known as strata (or blocks) for the purpose of variance reduction. When an auxiliary variable is observed that is correlated with the statistic of interest, the strata may be defined as a partition of the range of the auxiliary variable into sub-intervals. Various methods are used for determining the sub-intervals, including the cumulative square-root frequency (CSF) method (Dalenius and Hodges, 1959) and the geometric method (Gunning and Horgan, 2004). In our application, the classifier scores are the auxiliary variables and the statistic of interest is p(y=1|x)p(y=1|x) (for binary classification).

Stratification is conventionally used to produce a partition without hierarchical structure. However, if the strata are ordered, it is straightforward to “fill in” hierarchical structure. In our experiments, we specify the desired tree structure—e.g. an ordered binary tree of a particular depth. We then compute the stratum bins (sub-intervals) so that the number of bins matches the number of leaf nodes in the tree. The stratum bins are then associated with the leaf nodes of the tree in breadth-first order.

Feature-based methods.

When scores are not available, it is natural to consider unsupervised clustering methods which operate on the feature vectors. We expect p(y|x)p(y|x) to be roughly constant within a cluster, since neighboring points in the feature space typically behave similarly. Reddy and Vinzamuri (2014) reviews hierarchical clustering methods including agglomerative and divisive clustering. One disadvantage of clustering methods, is that they tend to scale quadratically with the number of data points. A more efficient alternative is the kk-d tree (Bentley, 1975).

Appendix J Additional convergence plots

Refer to caption
Figure 5. MSE of estimated F1 score over 1000 repeats as a function of consumed label budget. 95% bootstrap confidence intervals are included.
F1 score.

We provide additional convergence plots for estimating F1 score in Figure 5, which cover six of the seven datasets listed in Table 3. The convergence plot for abt-buy is featured in the main paper in Figure 2. The biased sampling methods (Ours-8, Ours-1, OASIS and IS) converge significantly more rapidly than Passive and Stratified for five of the six datasets. tweets100k is an exception because it is the only dataset with well-balanced classes. Of the biased sampling methods, Ours-8 performs best on two of the six datasets (amzn-goog and restaurant) and equal-best on one (safedriver). Ours-1 performs best on one dataset (dblp-acm) and equal-best on one (safedriver), while IS performs best on one dataset (creditcard). In general, we expect IS to perform well when the oracle response p(y|x)p(y|x) is already well-approximated by the model under evaluation. When this is not the case, the adaptive methods are expected to perform best as they produce refined estimates of p(y|x)p(y|x) during sampling.

Accuracy.

We repeated the experiments described in Section 5 for estimating accuracy. Although accuracy is not recommended for evaluating classifiers in the presence of class imbalance, it is interesting to see whether the biased sampling methods offer any improvement in label efficiency, given the discussion in Section 2.3. Figures 6(a) and 7 present convergence plots for the seven datasets listed in Table 3. A summary of the MSE for all datasets is included in Figure 6(b) assuming a label budget of 1000. OASIS does not appear in these results, as it does not support estimation of accuracy.

We find that gains in label efficiency are less pronounced for the biased sampling methods when estimating accuracy. This is to be expected as accuracy is less sensitive to class imbalance, as noted in Section 2.3. However, there is still a marked improvement in the convergence rate—by an order of magnitude or more—for three of the datasets (abt-buy, dblp-acm and restaurant). Again, we find that the more imbalanced datasets (abt-buy, amzn-goog, dblp-acm and restaurant) seem to benefit most from biased sampling.

Precision-recall curves.

Figure 8 presents convergence plots for estimating precision-recall curves for two of the test pools in Table 3 assuming a label budget of 5000. We find that the biased sampling methods (Ours-8, Ours-1 and IS) offer a significant improvement in the MSE compared to Passive and Stratified—by 1–2 orders of magnitude. The difference in the MSE between the AIS-based methods and IS is less pronounced here than when estimating F1-score.

(a)Refer to caption (b)Refer to caption

Figure 6. (a) Convergence plot for estimating accuracy on abt-buy over 1000 repeats. The upper panel plots the KL divergence from the proposal to the asymptotically-optimal one. The lower panel plots the MSE of the estimate for accuracy. 95% bootstrap confidence intervals are included. (b) MSE of the estimate for accuracy after 1000 label queries. 95% bootstrap confidence intervals are shown in black.
Refer to caption
Figure 7. Convergence plots for estimating accuracy over 1000 repeats. The upper panel in each sub-figure plots the KL divergence from the proposal to the asymptotically optimal one. The lower panel plots the MSE of the estimate for accuracy. 95% bootstrap confidence intervals are included.
Refer to caption
Figure 8. Convergence plots for estimating a precision-recall curve for abt-buy and dblp-acm over 1000 repeats. The upper panel in each sub-figure plots the KL divergence from the proposal to the asymptotically optimal one. The lower panel plots the total MSE of the precision and recall estimates at each threshold. 95% bootstrap confidence intervals are included.