This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Active Classification with Uncertainty Comparison Queries

Zhenghang Cui1,2{}^{\text{1,2}}โ€ƒIssei Sato1,2{}^{\text{1,2}}
1{}^{\text{1}} The University of Tokyo, Japan
2{}^{\text{2}} RIKEN Center for Advanced Intelligence Project, Japan
{cui, sato}@g.ecc.u-tokyo.ac.jp
Abstract

Noisy pairwise comparison feedback has been incorporated to improve the overall query complexity of interactively learning binary classifiers. The positivity comparison oracle is used to provide feedback on which is more likely to be positive given a pair of data points. Because it is impossible to infer accurate labels using this oracle alone without knowing the classification threshold, existing methods still rely on the traditional explicit labeling oracle, which directly answers the label given a data point. Existing methods conduct sorting on all data points and use explicit labeling oracle to find the classification threshold. The current methods, however, have two drawbacks: (1) they needs unnecessary sorting for label inference; (2) quick sort is naively adapted to noisy feedback and negatively affects practical performance. In order to avoid this inefficiency and acquire information of the classification threshold, we propose a new pairwise comparison oracle concerning uncertainties. This oracle receives two data points as input and answers which one has higher uncertainty. We then propose an efficient adaptive labeling algorithm using the proposed oracle and the positivity comparison oracle. In addition, we also address the situation where the labeling budget is insufficient compared to the dataset size, which can be dealt with by plugging the proposed algorithm into an active learning algorithm. Furthermore, we confirm the feasibility of the proposed oracle and the performance of the proposed algorithm theoretically and empirically.

1 Introduction

Active learning studies interactive algorithms that can achieve the same generalization ability as passive learning, with more efficient query complexity. It is known that active learning can achieve exponential improvement over passive learning under certain conditions [14]. However, this improvement does not always hold for more general situations. Consequently, active learning methods have been developed by casting assumptions on the underlying data distribution and the target concept, or designing different forms of oracles that can better take advantage of the knowledge of annotators.

This paper focuses on the latter approach, specifically on methods incorporating the positivity comparison oracle into active learning. This oracle has high practicality in applications, as it has already been extensively used in other fields of machine learning, such as preference learning [7, 11]. It is obvious that using feedback from only this oracle, we can at most sort all unlabeled data points according to their positivity, or class-posterior probabilities. Without knowing the location of the classification threshold to some extent, we cannot infer labels with accuracy guarantees. Therefore, existing methods [16, 31] still need to access the explicit labeling oracle to infer labels.

Among the existing methods, the one by Kane et al. [16] takes a geometry approach, thus results a dimension-dependent query complexity. Moreover, it only considers the noise-free setting of oracles, which limits its practicality. Thus, we focus on the other method by Xu et al. [31], which shares a similar problem setting with this paper. For nn unlabeled data points, the main idea of Xu et al. [31] is (1) conducting quick sort using ๐’ชโ€‹(nโ€‹logโกn)\mathcal{O}(n\log{n}) queries to the positivity comparison oracle and treat the feedback as if it is noise-free; (2) conducting binary search to locate the classification threshold using ๐’ชโ€‹(logโกn)\mathcal{O}(\log{n}) queries to the explicit labeling oracle. We note that knowing the positivity comparison order of all data points is unnecessary for the goal of label inference. Given an unlabeled data point xx, we are only interested in the relationship between pโ€‹(y=+1|x)p(y=+1|x) and the classification threshold 0.50.5, not the relationship between pโ€‹(y=+1|x)p(y=+1|x) and class-posterior probabilities of other data points. On the other hand, knowing all class labels cannot reconstruct the sorted list of class-posterior probabilities. We recognize sorting over at least a subset of data points is inevitable due to collaborating with the explicit labeling oracle and the lack of information of the classification threshold. Although it is feasible to improve the naive quick sort approach of the existing method, we choose to pose a question at a higher level:

โ€œWhat form of feedback can efficiently provide (indirect) information of the classification threshold for the positivity comparison oracle?โ€.

With this desired form of feedback, we would be able to save positivity comparison queries from unnecessary sorting. Note this question is fundamentally motivated by the Vapnikโ€™s principle [29].

If you possess a restricted amount of information for solving some problem, try to solve the problem directly and never solve a more general problem as an intermediate step.

In this paper, our problem is the lack of information of the classification threshold. To this end, we propose a new form of oracle, the uncertainty comparison oracle, which asks annotators to compare the uncertainties of a pair of data points. We assume that higher uncertainty indicates being closer to the classification threshold. Properly using this new oracle, we can efficiently select the set of data points with high uncertainties, namely the set of data points that appears closet to the classification threshold. Then, using this selected set as a delegation (a proxy or an approximation) of the unknown classification threshold, we can infer labels with accuracy guarantees to data points outside this selected set, which are the majority of unlabeled data points. Not surprisingly, the explicit labeling oracle is no longer needed due to its inferior compatibility with pairwise comparisons.

In summary, for the problem of interactive labeling with access to the positivity comparison oracle, our contributions are three-fold:

  • โ€ข

    We propose a novel pairwise comparison oracle that compares the uncertainty of two unlabeled data points.

  • โ€ข

    We propose a feasible labeling algorithm effectively accessing the aforementioned two kinds of pairwise comparison oracles, as well as its applications under different query budgets.

  • โ€ข

    We establish the error rate bound for the proposed algorithm and generalization error bounds for its applications, and confirm their empirical performance.

2 Labeling with pairwise comparisons

In this section, we introduce the two comparison oracles and the proposed labeling algorithm.

2.1 Preliminaries

We consider the binary classification problem. Let ๐’ณโŠ‚โ„d\mathcal{X}\subset\mathbbm{R}^{d} denote the dd-dimensional sample space and ๐’ด={+1,โˆ’1}\mathcal{Y}=\{+1,-1\} denote the binary label space. Let ๐’ซ๐’ณโ€‹๐’ด\mathcal{P}_{\mathcal{X}\mathcal{Y}} denote the underlying data distribution over ๐’ณร—๐’ด\mathcal{X}\times\mathcal{Y} and ฮทโ€‹(x)โ‰œpโ€‹(Y=+1|X=x)\eta(x)\triangleq p(Y=+1|X=x) denote the underlying conditional probability for a data point xx being positive. Then hโˆ—โ‰œsignโ€‹(ฮทโ€‹(x)โˆ’0.5)h^{*}\triangleq\text{sign}(\eta(x)-0.5) is the Bayes classifier minimizing the classification risk Rโ€‹(f)โ‰œ๐”ผ(x,y)โˆผ๐’ซ๐’ณโ€‹๐’ด[๐Ÿ™fโ€‹(x)โ‰ y]R(f)\triangleq\mathop{\mathbb{E}}_{(x,y)\sim\mathcal{P}_{\mathcal{X}\mathcal{Y}}}[\mathbbm{1}_{f(x)\neq y}] for a classifier f:๐’ณโ†’๐’ดf:\mathcal{X}\rightarrow\mathcal{Y}. In this problem setting, we are given only data points drawn from ๐’ซ๐’ณ\mathcal{P}_{\mathcal{X}}, the marginal distribution over ๐’ณ\mathcal{X}. Without accessing class labels {hโˆ—โ€‹(x)|xโˆˆ๐’ณ}\{h^{*}(x)|x\in\mathcal{X}\}, we query the following two oracles for essential information.

2.2 Two pairwise comparison oracles

Positivity comparison oracle

This oracle receives two data points as input and answers whether the first data point has a higher probability of being positive. The answer โ€œ+1+1โ€ means โ€œyes" and โ€œโˆ’1-1โ€ means โ€œnoโ€. This is a common oracle that has been used in many different fields such as interactive classification [16, 31] and preference learning [7, 11]. We denote this oracle by O1:๐’ณร—๐’ณโ†’{+1,โˆ’1}O_{1}:\mathcal{X}\times\mathcal{X}\rightarrow\{+1,-1\} and define it with the following noise condition.

Condition 1.

Distribution ๐’ซ๐’ณโ€‹๐’ด\mathcal{P}_{\mathcal{X}\mathcal{Y}} and oracle O1O_{1} satisfies the condition with noise parameter ฯต1โ‰ฅ0\epsilon_{1}\geq 0 if ๐”ผx1,x2โˆผ๐’ซ๐’ณโ€‹[๐Ÿ™O1โ€‹(x1,x2)โ€‹(ฮทโ€‹(x1)โˆ’ฮทโ€‹(x2))<0]=ฯต1\mathbb{E}_{x_{1},x_{2}\sim\mathcal{P}_{\mathcal{X}}}[\mathbbm{1}_{O_{1}(x_{1},x_{2})(\eta(x_{1})-\eta(x_{2}))<0}]=\epsilon_{1}.

Uncertainty comparison oracle

This is our proposed oracle that receives two data points as input and answers whether the first one has higher uncertainty. The answer โ€œ+1+1โ€ means โ€œyesโ€ and โ€œโˆ’1-1โ€ means โ€œnoโ€. We define the uncertainty of a data point xโˆˆ๐’ณx\in\mathcal{X} as the difference between ฮทโ€‹(x)\eta(x) and the classification threshold 0.50.5. This difference |ฮทโ€‹(x)โˆ’0.5||\eta(x)-0.5| being small means xx has high uncertainty. We denote this oracle by O2:๐’ณร—๐’ณโ†’{+1,โˆ’1}O_{2}:\mathcal{X}\times\mathcal{X}\rightarrow\{+1,-1\} and define it with the following noise condition.

Condition 2.

Distribution ๐’ซ๐’ณโ€‹๐’ด\mathcal{P}_{\mathcal{X}\mathcal{Y}} satisfies this condition with noise parameter ฯต2โ‰ฅ0\epsilon_{2}\geq 0 if ๐”ผx1,x2โˆผ๐’ซ๐’ณโ€‹[๐Ÿ™O2โ€‹(x1,x2)โ€‹(|ฮทโ€‹(x2)โˆ’0.5|โˆ’|ฮทโ€‹(x1)โˆ’0.5|)<0]=ฯต2\mathbb{E}_{x_{1},x_{2}\sim\mathcal{P}_{\mathcal{X}}}[\mathbbm{1}_{O_{2}(x_{1},x_{2})(|\eta(x_{2})-0.5|-|\eta(x_{1})-0.5|)<0}]=\epsilon_{2}.

Note that the above conditions only assume the error rates. Thus, answers may not hold for a proper order. Namely, it is possible to collect positive answers from O1โ€‹(x1,x2)O_{1}(x_{1},x_{2}) and O1โ€‹(x2,x1)O_{1}(x_{2},x_{1}) (asymmetricity), or O1โ€‹(x1,x2)O_{1}(x_{1},x_{2}), O1โ€‹(x2,x3)O_{1}(x_{2},x_{3}) and O1โ€‹(x3,x1)O_{1}(x_{3},x_{1}) (intransitivity) for x1,x2,x3โˆผ๐’ซ๐’ณx_{1},x_{2},x_{3}\sim\mathcal{P}_{\mathcal{X}}. The same holds for O2O_{2}. Therefore our assumptions are relatively weak compared to parametric models, such as the Bradley-Terry-Luce (BTL) model [5, 21].

2.3 Proposed labeling algorithm

We propose a labeling algorithm that does not access the explicit labeling oracle at all and can still infer accurate labels. Given a set of unlabeled data points DD sampled from ๐’ซ๐’ณ\mathcal{P}_{\mathcal{X}} with size nn, the idea is to first select a subset of tt data points Dโ€ฒโŠ‚DD^{\prime}\subset D as a proxy or delegation for the classification threshold where tโ‰ชnt\ll n. Note that we do not need to know the ranking order based on class-posterior probabilities of either Dโ€ฒD^{\prime} or Dโˆ–Dโ€ฒD\setminus D^{\prime}, and we want to find this subset by actively accessing the oracle as few times as possible. This can be formulated as a top-tt items selection problem from noisy comparisons and has been extensively studied. Note that we want to select the most uncertain data points, thus the uncertainty comparison oracle O2O_{2} will be queried. To this end, we choose the theoretical-guaranteed and practically promising algorithm proposed by Mohajer et al. [23] as the first step of our algorithm. For the self-containment of this paper, we briefly introduce this algorithm and its theoretical property.

Top selection algorithm from noisy comparisons [23]

Suppose we want to select Dโ€ฒD^{\prime} of tt data points from DD of nn data points based on a noisy comparison oracle. The algorithm can be described in following steps.

  1. 1.

    Separate the whole set into tt subsets with equal size of around nt\frac{n}{t}.

  2. 2.

    On each subset, conduct a single-elimination tournament to select a single data point which is supposed to have highest uncertainty. Because the comparison results are noisy, each comparison is repeated mm times where mm is a hyper-parameter.

  3. 3.

    For the tt data points selected from tt subsets, construct a heap structure.

  4. 4.

    Move the data point at the top of the heap to Dโ€ฒD^{\prime}.

  5. 5.

    Conduct a single-elimination tournament with mm repetitions on the subset of which the first element of Dโ€ฒD^{\prime} belongs to.

  6. 6.

    Reconstruct the heap and move the data point at the top of the heap to Dโ€ฒD^{\prime}.

  7. 7.

    Repeat the step five and six until Dโ€ฒD^{\prime} has enough data points.

Although the above algorithm is a simple combination of single-elimination tournament and a heap structure, it enjoys the following favorable query complexity bound.

Theorem 1 (Mohajer et al. [23]).

With probability exceeding 1โˆ’(logโกn)โˆ’c01-(\log{n})^{-c_{0}}, the subset of top-tt instances can be identified by the above algorithm with the query complexity upper bounded by c1โ€‹(n+tโ€‹logโกt)โ€‹maxโก(logโกlogโกn,logโกt)(ฯต2โˆ’0.5)2c_{1}(n+t\log{t})\frac{\max(\log\log{n},\log{t})}{(\epsilon_{2}-0.5)^{2}}. Here, c0,c1c_{0},c_{1} are some universal positive constants.

After selecting tt most uncertain data points using the above algorithm, we use this selected subset Dโ€ฒD^{\prime} and the positivity comparison oracle O1O_{1} to decide labels of Dโˆ–Dโ€ฒD\setminus D^{\prime}. The whole algorithm can be summarized in following three steps.

  1. 1.

    Use O2O_{2} to find Dโ€ฒD^{\prime}, a subset of tt most uncertain data points.

  2. 2.

    For each data point xโˆˆDโˆ–Dโ€ฒx\in D\setminus D^{\prime}, we use O1O_{1} to compare it with all (or part of) data points in Dโ€ฒD^{\prime} to infer its label by majority votes.

  3. 3.

    Since we do not assume DD is i.i.d. sampled so that we can use this algorithm in more general situations, the worst case could happen for any labeling of Dโ€ฒD^{\prime}. Therefore, we can randomly label Dโ€ฒD^{\prime}, or repeat the whole algorithm using Dโ€ฒD^{\prime} as the initial input.

This algorithm can efficiently infer labels without requiring unnecessary information such as the ranking order of class-posterior probabilities. The algorithm is formally described in Algorithm 1. An error rate bound for inferred labels under noise conditions is established in Section 3.1.

Algorithm 1 Proposed Labeling Algorithm
0:  Positive integer tt, dataset DD with size nn.
1:  Select tt most uncertain data points from DD using the algorithm of Mohajer et al. [23] and O2O_{2}. Denote the selected set as Dโ€ฒD^{\prime}.
2:  for xiโˆˆDโˆ–Dโ€ฒx_{i}\in D\setminus D^{\prime} do
3:     If โˆ‘xjโˆˆDโ€ฒO1โ€‹(xi,xj)โ‰ฅ12\sum_{x_{j}\in D^{\prime}}O_{1}(x_{i},x_{j})\geq\frac{1}{2}, then let yi^โ†1\hat{y_{i}}\leftarrow 1, else let yi^โ†0\hat{y_{i}}\leftarrow 0.
4:  end for
5:  Randomly label xiโˆˆDโ€ฒx_{i}\in D^{\prime}.
5:  Inferred labels Y^โ‰œ{yi^}i=1n\hat{Y}\triangleq\{\hat{y_{i}}\}_{i=1}^{n}.

2.4 Learning classifiers under different budgets

For down-stream tasks, we can feed DD and Y^\hat{Y} into any algorithms that rely on samples from ๐’ซ๐’ณโ€‹๐’ด\mathcal{P}_{\mathcal{X}\mathcal{Y}}. In this paper, we consider the general application of learning a binary classifier.

Sufficient budget case

In this case, we assume enough budget for running Algorithm 1 on the whole dataset. Then, we can obtain the inferred labels and feed them into any classification algorithms. In this paper, we consider the simplest non-parametric kk-NN algorithm [1], which is easy to implement and enjoys good theoretical guarantees. A generalization bound for classifiers obtained in this case is established in Section 3.2.

Insufficient budget case

In this case, we consider a more practical situation where the dataset is too large compared to the budget; thus, we cannot afford to run Algorithm 1 on the whole dataset. We resort to using active learning with Algorithm 1 as a subroutine for the selected batch at each step. The same as Algorithm 3 of Xu et al. [31], we consider a disagreement-based active learning algorithm calling the proposed Algorithm 1 at each step. Algorithm 2 describes the detailed algorithm.

Algorithm 2 Disagreement-based active learning algorithm (Algorithm 3 of Xu et al. [31]).
0:  ฯต\epsilon, a sequence of nin_{i}, hypothesis set HH.
1:  H1โ†HH_{1}\leftarrow H
2:  for i=1,2,โ‹ฏ,โŒˆlogโก(1ฯต)โŒ‰i=1,2,\cdots,\lceil\log(\frac{1}{\epsilon})\rceil do
3:     Siโ†S_{i}\leftarrow i.i.d. sample from ๐’ซ๐’ณ\mathcal{P}_{\mathcal{X}} with size nin_{i}.
4:     Diโ†DISโ€‹(Si,Hi)D_{i}\leftarrow\text{DIS}(S_{i},H_{i}).
5:     Run Algorithm 1 with ฯตi=12i+2\epsilon_{i}=\frac{1}{2^{i+2}} and DiD_{i}, obtain {yj^}j=1|Di|\{\hat{y_{j}}\}_{j=1}^{|D_{i}|}.
6:     Hi+1โ†{hโˆˆHi:โˆ‘j=1ni๐Ÿ™hโ€‹(xj)โ‰ y^jโ‰คฯตiโ€‹ni}H_{i+1}\leftarrow\{h\in H_{i}:\sum_{j=1}^{n_{i}}\mathbbm{1}_{h(x_{j})\neq\hat{y}_{j}}\leq\epsilon_{i}n_{i}\}
7:  end for
7:  Any Classifier in Hi+1H_{i+1}

3 Theoretical analysis

In this section, we establish the error rate bound for Algorithm 1 and generalization error bounds for the kk-NN algorithm and Algorithm 2.

3.1 Analysis of the proposed labeling algorithm

Theorem 2 (Error rate bound).

Suppose Condition 1 and Condition 2 hold for ฯต1,ฯต2โˆˆ[0,0.5)\epsilon_{1},\epsilon_{2}\in[0,0.5). Let t=ฮฉโ€‹(logโก22โ€‹(0.5โˆ’ฯต1)2)t=\Omega\left(\frac{\log 2}{2(0.5-\epsilon_{1})^{2}}\right). Fix ฯต>0\epsilon>0 and assume DD to be a set of size n>tฯตn>\frac{t}{\epsilon} that contains data points xโˆˆ๐’ณx\in\mathcal{X}. Then, there exist constants C1C_{1} and C2C_{2} such that for an execution of Algorithm 1 on DD with parameters tt and mโ‰ฅC1โ€‹maxโก(logโกlogโกn,logโกt)(0.5โˆ’ฯต2)2m\geq\frac{C_{1}\max(\log\log{n},\log{t})}{(0.5-\epsilon_{2})^{2}}, with probability at least 1โˆ’ฮด1-\delta where we denote ฮดโ‰œฮดโ€‹(C2,n,t,ฯต1)\delta\triangleq\delta(C_{2},n,t,\epsilon_{1}) for simplicity, the error rate of inferred labels is bounded as |{iโˆˆ[n]|yi^โ‰ hโˆ—โ€‹(xi)}|nโ‰คฯต\frac{|\{i\in[n]|\hat{y_{i}}\neq h^{*}(x_{i})\}|}{n}\leq\epsilon. The query complexity is ๐’ชโ€‹(nฯต12)\mathcal{O}\left(\frac{n}{\epsilon_{1}^{2}}\right) for O1O_{1} and ๐’ชโ€‹(nโ€‹logโกlogโกnฯต22)\mathcal{O}\left(\frac{n\log\log{n}}{\epsilon_{2}^{2}}\right) for O2O_{2}.

Proof can be found in Appendix A. Note that there are two hyper-parameters for the algorithm: the size of the delegation subset tt and the repetition number for each comparison mm. The above theory shows a principled way to select the hyper-parameter tt, which only depends on the error condition ฯต1\epsilon_{1}. For a reasonable range of ฯต1โ‰ค0.4\epsilon_{1}\leq 0.4, Algorithm 1 only requires tt to be at most 3535, which is relatively small compared to the size of a modern dataset. For the other hyper-parameter mm, we empirically observe that a surprisingly small value, even 11, shows promising performance. For the query complexities, the ๐’ชโ€‹(n)\mathcal{O}(n) factor should be required by at least one oracle, since we cannot decide the label of a data point without accessing it at least once.

3.2 Analysis of nearest neighbors classifiers

We establish a generalization error bound for classifiers obtained by combining Algorithm 1 and kk-NN. We want to estimate the function ฮทโ€‹(x)\eta(x) from the inferred labels by Algorithm 1. For xโˆˆ๐’ณx\in\mathcal{X}, we denote indices of other points in a descending distance order by {ฯ„qโ€‹(x)}q=1nโˆ’1\{\tau_{q}(x)\}_{q=1}^{n-1}. This means that for a metric ฯ\rho, it holds ฯโ€‹(x,xฯ„q)โ‰คฯโ€‹(x,xฯ„q+1)\rho(x,x_{\tau_{q}})\leq\rho(x,x_{\tau_{q+1}}) for qโˆˆ[1,nโˆ’2]q\in[1,n-2]. Thus, we can denote the resulting kk-NN classifier as f^โ€‹(x;k)=1kโ€‹โˆ‘q=1ky^ฯ„qโ€‹(x)\hat{f}(x;k)=\frac{1}{k}\sum_{q=1}^{k}\hat{y}_{\tau_{q}(x)}.

We then introduce two essential assumptions. First, we need a general assumption for achieving fast convergence rates for kk-NN classifiers.

Assumption 1 (Measure smoothness [6]).

With ฮป>0\lambda>0 and ฯ‰>0\omega>0, for all x1,x2โˆˆ๐’ณx_{1},x_{2}\in\mathcal{X}, it satisfies

|ฮทโ€‹(x1)โˆ’ฮทโ€‹(x2)|โ‰คฯ‰โ€‹ฮผโ€‹(Bฯโ€‹(x1,x2)โ€‹(x0))ฮป,|\eta(x_{1})-\eta(x_{2})|\leq\omega\mu\left(B_{\rho(x_{1},x_{2})}(x_{0})\right)^{\lambda},

where Bฯโ€‹(x1,x2)โ€‹(x0)B_{\rho(x_{1},x_{2})}(x_{0}) denotes a ball with center x0x_{0} and radius ฯโ€‹(x1,x2)\rho(x_{1},x_{2}).

Then, we need the following Tsybakovโ€™s margin condition [22], which is a common assumption for establishing fast convergence rates.

Assumption 2 (Tsybakovโ€™s margin condition).

There exist ฮฑโ‰ฅ0\alpha\geq 0 and Cฮฑโ‰ฅ1C_{\alpha}\geq 1 such that for all ฮพ>0\xi>0 we have

ฮผโ€‹({xโˆˆ๐’ณ:0<|ฮทโ€‹(x)โˆ’12|<ฮพ})โ‰คCฮฑโ€‹ฮพฮฑ.\mu\left(\bigg{\{}x\in\mathcal{X}:0<\left|\eta(x)-\frac{1}{2}\right|<\xi\bigg{\}}\right)\leq C_{\alpha}\xi^{\alpha}.

Finally, we establish the generalization error bound.

Theorem 3 (Generalization error bound for kk-NN).

Suppose the conditions for Theorem 2 hold. Let the input and the output of Algorithm 1 be D={xi}i=1nD=\{{x_{i}}\}_{i=1}^{n} and Y^={y^i}i=1n\hat{Y}=\{\hat{y}_{i}\}_{i=1}^{n}. Let f^โ€‹(x;k)\hat{f}(x;k) be the kk-NN classifier obtained and fโˆ—โ€‹(x)โ‰œ๐Ÿ™ฮทโ€‹(x)โ‰ฅ12f^{*}(x)\triangleq\mathbbm{1}_{\eta(x)\geq\frac{1}{2}} be the Bayes classifier. Then, using the same notations as Theorem 2, supposing that Assumption 1 holds with ฮป>0\lambda>0 and ฯ‰>0\omega>0, and Assumption 2 holds with ฮฑโ‰ฅ0\alpha\geq 0 and Cฮฑโ‰ฅ1C_{\alpha}\geq 1, for ฮดโ€ฒโˆˆ(0,1)\delta^{\prime}\in(0,1), 4โ€‹logโก(1ฮดโ€ฒ)+1โ‰คkโ‰คn24\log(\frac{1}{\delta^{\prime}})+1\leq k\leq\frac{n}{2}, with probability at least (1โˆ’ฮด)โ€‹(1โˆ’ฮดโ€ฒ)(1-\delta)(1-\delta^{\prime}), we have

Rโ€‹(f^)โ‰คRโ€‹(fโˆ—)+Cฮฑโ€‹(2โ€‹ฯตk+ฯ‰โ€‹(2โ€‹kn)ฮป)ฮฑ+1.R(\hat{f})\leq R(f^{*})+C_{\alpha}\left(\frac{2\epsilon}{k}+\omega\left(\frac{2k}{n}\right)^{\lambda}\right)^{\alpha+1}.

Proof can be found in Appendix B. The difference between the above bound and other generalization bounds under unknown asymmetric noise [12, 23] is that Theorem 3 does not require the labels to be an i.i.d. sample from an underlying distribution, as they are instead inferred by Algorithm 1.

3.3 Analysis of disagreement-based active learning

We establish the generalization error bound by the following corollary to justify Algorithm 2. Its proof can be found in Appendix C.

Corollary 4 (Generalization error bound for active learning).

Suppose conditions for Theorem 2 hold. Then, for an execution of Algorithm 2 with ฯตโˆˆ(0,1),ฯตi=12i+2\epsilon\in(0,1),\epsilon_{i}=\frac{1}{2^{i+2}}, with probability at least 1โˆ’ฮด1-\delta, the output h^\hat{h} satisfies Pxโˆผ๐’ซ๐’ณโ€‹[h^โ€‹(x)โ‰ hโˆ—โ€‹(x)]โ‰คฯต\text{P}_{x\sim\mathcal{P}_{\mathcal{X}}}[\hat{h}(x)\neq h^{*}(x)]\leq\epsilon.

4 Related work

Weakly-supervised learning

Learning classifiers from passively obtained comparisons without explicit class labels have been studied, such as learning from similarity comparisons [3, 27] and learning from triplet comparisons [9]. However, the feasibility of learning in such cases relies on various inevitable assumptions. Bao et al. [3] assumes the group with more data to be the positive class. The other two methods [27, 9] assume specific data generation processes, which may not always hold in some applications. Moreover, none of these methods have theoretical guarantees for noisy comparisons. On the other hand, learning from totally unlabeled data has also been studied [10, 20]. However, these methods require at least two datasets with different class priors pโ€‹(Y=+1)p(Y=+1), and they also need to know these class priors exactly, which can be impossible in some cases. The proposed labeling algorithm is transductive and can be combined with non-parametric classifiers, while above existing methods mainly rewrite the classification risk and require a differentiable model. Furthermore, the proposed algorithm does not require above assumptions and additional information such as exact class priors.

Preference learning

Results of O1O_{1} are mainly used in this learning paradigm to learn a (partial) ranking over data points. However, ranking cannot induce labels without further information as the class prior is needed to decide the classification threshold. At the same time, labeling cannot induce ranking as there is no information on the ranking order of data with the same label. Similar arguments also hold for bipartite ranking [24].

Active learning

Interactive classification with oracles that do not answer the explicit class labels has been studied [4, 16, 31, 28]. Beygelzimer et al. [4] uses a search oracle that receives a function set as input and outputs a data point with its explicit class label. Other two methods use the same oracle as O1O_{1}. However, they all need to access the explicit labeling oracle. On the other hand, Balcan et al. [2] uses only the class conditional queries (CCQ) without accessing the explicit labeling oracle. However, labels can be inferred from a single CCQ query. Although we cannot directly compare, we claim that O2O_{2} is weaker than CCQ because labels cannot be inferred from the query results of O2O_{2}.

5 Experiments

In this section, we confirmed the feasibility and performance of the proposed algorithm using both simulation and crowdsourcing data.

5.1 Simulation study

All experiments were repeated ten times on a server with an Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz CPU and a Tesla V100 GPU. Their mean values and standard deviations are reported.

5.1.1 Sufficient budget case

When considering constructing binary datasets from multi-class datasets, experiments in existing work usually split the whole dataset into two parts, such as separating odd numbers and even numbers for hand written digits. However, as the focus of the proposed oracle is the uncertainty, it is important to simulate experiments that has some kind of uncertainties in its expression. For image datasets, the uncertainty can be expressed as visual similarity between two classes. Therefore, we constructed following eight binary classification datasets that have visual similarity to some extent.

  • โ€ข

    MNIST-a denotes the MNIST [19] data with label 11 (78777877 data) and 77 (72937293 data),

  • โ€ข

    MNIST-b denotes the MNIST data with label 33 (71417141 data) and 55 (63136313 data).

  • โ€ข

    FMNIST-a denotes the Fashion-MNIST [30] data with label T-shirt/top (70007000 data) and shirt (70007000 data).

  • โ€ข

    FMNIST-b denotes the Fashion-MNIST data with label pullover (70007000 data) and coat (70007000 data).

  • โ€ข

    KMNIST-a denotes the Kuzushiji-MNIST [8] data with the second label (70007000 data) and eighth label (70007000 data).

  • โ€ข

    KMNIST-b denotes the Kuzushiji-MNIST data with the third label (70007000 data) and seventh label (70007000 data)

  • โ€ข

    CIFAR10-a denotes the CIFAR-10 [18] data with label automobile (50005000 data) and truck (50005000 data).

  • โ€ข

    CIFAR10-b denotes the CIFAR-10 data with label deer (50005000 data) and horse (50005000 data).

For all datasets except CIFAR-10, a logistic regression classifier was first learned on all selected data with one hundred thousand maximum iteration. The oracles were then simulated using the output conditional probabilities of this logistic regression classifier. For CIFAR-10, a ResNet152 [15] classifier was first trained on the whole dataset (1010 classes) for 100100 epochs. Then, the 20482048-dimension expressions before the last fully connected layer were used as low dimensional features, which were then used to train a logistic regression classifier. The logistic regression classifier and the kk-NN classifiers are trained on these 20482048-dimension features instead of the original input. We set k=5k=5 for kk-NN classifiers throughout all experiments. We randomly split training and test set according to the 4:14:1 ratio for every repetition of the algorithms. We do not have sensitive hyper-parameters to tune, thus we did not separated a validation set. For the Co-teaching experiments, we set batchsize as 10241024 and epochs as 100100. We adopted the public codes provided by the authors, thus followed all other default settings therein, such as learning rate schedule.

We considered the conservative case where the noise rates are high and the repetition number mm is small. Theorem 2 indicates that the size of the delegation subset tt usually has a maximum of 3535, thus we set tt to be 1010 or 3535. Table 1 shows that a larger set of delegation set (corresponding to a higher tt) contributes to a better label accuracy, thus a better generalization ability. This behavior matches the expectation as the inferred label for each non-delegation data point becomes more accurate. We also observed that even with a small tt, kk-NN shows promising generalization ability.

Table 1: Performance when the repetition number m=1m=1, noise rates ฯต1=ฯต2=0.4\epsilon_{1}=\epsilon_{2}=0.4.
Dataset Label Accuracy (tt=10) kk-NN Test Accuracy (tt=10) Label Accuracy (tt=35) kk-NN Test Accuracy (tt=35)
MNIST-a 67.89 (0.37) 77.63 (0.83) 80.94 (0.47) 92.36 (0.60)
MNIST-b 67.10 (0.52) 76.11 (0.79) 80.46 (0.37) 92.93 (0.37)
FMNIST-a 65.78 (0.26) 70.96 (0.45) 76.38 (0.20) 81.40 (0.19)
FMNIST-b 66.25 (0.34) 72.28 (0.50) 77.25 (0.24) 83.36 (0.20)
KMNIST-a 68.69 (0.56) 78.90 (1.07) 81.64 (0.62) 94.30 (0.58)
KMNIST-b 67.99 (0.26) 77.45 (0.45) 78.88 (0.36) 90.16 (0.33)
CIFAR10-a 69.34 (0.44) 80.09 (0.82) 82.07 (0.41) 94.28 (0.31)
CIFAR10-b 68.67 (0.20) 78.47 (0.59) 81.83 (0.50) 93.95 (0.42)

Table 2 shows the results of the optimism situation when the noise rates were low and sufficient budget for a larger mm was available.

Table 2: Performance when repetition m=10m=10, noise rate ฯต1=ฯต2=0.1\epsilon_{1}=\epsilon_{2}=0.1.
Dataset Label Accuracy (tt=10) kk-NN Test Accuracy (tt=10) Label Accuracy (tt=35) kk-NN Test Accuracy (tt=35)
MNIST-a 99.74 (0.01) 99.39 (0.03) 99.84 (0.01) 99.35 (0.03)
MNIST-b 97.12 (0.03) 98.36 (0.09) 97.22 (0.02) 98.36 (0.06)
FMNIST-a 87.19 (0.06) 83.95 (0.18) 87.38 (0.06) 84.14 (0.16)
FMNIST-b 88.84 (0.04) 86.26 (0.20) 88.86 (0.04) 86.67 (0.18)
KMNIST-a 98.78 (0.01) 99.12 (0.05) 98.90 (0.01) 99.00 (0.02)
KMNIST-b 92.33 (0.03) 94.53 (0.14) 92.36 (0.03) 94.85 (0.09)
CIFAR10-a 99.87 (0.02) 99.92 (0.02) 99.97 (0.01) 99.95 (0.01)
CIFAR10-b 99.86 (0.01) 99.98 (0.01) 99.94 (0.01) 99.98 (0.01)

We next confirmed the quality of inferred labels using a more powerful model. Co-teaching [13] is a recently proposed training method for extremely noisy labels. It holds two classifiers which feed their small loss data points to the other classifier for training. Although lacking theoretically guarantees, it is reported promising performance [13]. We used relatively small ResNet18 [15] models and restrain from tuning any hyper-parameters for Co-teaching.

Figure 1 shows results with same size of delegation set in the same color, and uses dot lines to show results with fewer repetition numbers. We observe that setting m=1m=1 already shows promising accuracy, with tt set to be the theoretical maximum 3535. For the same value of tt, increasing mm from 11 to 1010 can offer only little improvement on the accuracy. Setting mm to 11 means we only query each pair once and proceed the algorithm believing the answer is correct. This shows that the proposed algorithm is highly robust to query noise, as it shows promising performance using the single noisy result without repeating the same query many times. Moreover, the low noise rate regime shows comparable performance under different settings, which means the proposed algorithm can generally achieve high performance with low budget.

Refer to caption
(a)
Refer to caption
(b)
Figure 1: Generalization performance of co-teaching classifiers.

Figure 2 shows the detailed investigation on the Fashion MNIST datasets. It shows similar tendency as the previous Co-teaching results on CIFAR-10 datasets.

Refer to caption
(a)
Refer to caption
(b)
Figure 2: Generalization performance of kk-NN classifiers for Fashion-MNIST datasets.

5.1.2 Insufficient budget case

In this case, because Algorithm 2 needs to loop over every available hypothesis at each step, it is infeasible to start with a large hypotheses set. Note that even for the MNIST dataset with 784784 features and the simplest linear models, using a discrete exploring space of size 1010 for the parameter corresponding to each feature creates a huge hypotheses set of size 1078410^{784}. Therefore, in order to illustrate the feasibility of the algorithm, we used 22-dimensional toy data generated from two Gaussian distributions that are symmetric to the origin point. Specifically, we used two Gaussian distributions with mean value of (2,2)(2,2) and (โˆ’2,โˆ’2)(-2,-2) and the identical matrix as both covariances. From these distributions, we drew ten thousand data points in total, with each data point having an equal probability to be generated from either distribution. Then a logistic regression classifier is trained with one hundred thousand maximum iteration to simulate the oracles. For the hypothesis set, we used 10001000 equally separated linear classifiers passing through the origin point. Setting the desiring precision ฯต=0.1\epsilon=0.1 resulted three steps based on Algorithm 2. Table 3 shows the number of left candidate hypotheses and their test accuracy at each step.

Table 3: Active Learning Experiment Results
Step 1 Step 2 Step 3
Number of Left Hypotheses 674.10 (4.97) 525.60 (7.34) 196.90 (71.85)
Test Accuracy of Left Hypotheses 96.98 (0.44) 99.29 (0.19) 99.78 (0.11)

5.2 User study

The previous section investigated the proposed algorithm using artificial oracles, and the feasibility in real-world situations remains untouched. Therefore, we conducted user study using crowdsourcing in this section.

5.2.1 Character recognition task

In this task, we focused on the classification of Kuzushiji (cursive Japanese) [8], which is important for advocating research on Japanese historical books and documents.

Goals

We want to justify the proposed oracle and confirm whether the proposed algorithm can work on results collected through user study without simulation. Specifically, we want to (1) confirm whether data pairs selected by the proposed algorithm are easier for uncertainty comparison than explicit labeling, and (2) confirm whether the proposed algorithm can work on only crowdsourcing results. We will introduce the data and the general interface we used in user study, followed by detailed description of each user study setting in the following paragraphs.

Data
Refer to caption
(a)
Refer to caption
(b)
Figure 3: Sample images for โ€˜NAโ€™ in the left and โ€˜WOโ€™ in the right.

From the Kuzushiji-MNIST dataset [8], we selected the 55-th and the 1010-th characters to form the binary classification task. The reading alphabet is โ€˜NAโ€™ for the 55-th character and โ€˜WOโ€™ for the 1010-th character. Figure 3 shows them in a standard font. Albeit the visual similarity, these two characters are important auxiliary words with distinct meanings. Thus, wrongly recognizing the two characters can harm the understanding of the sentence. This recognition task has a natural affinity with uncertainty comparison, as in daily writing, the difficulty of recognizing a hand written character is easier to interpret, rather than recognizing the exact character.

Methods

We prepared three types of questions: explicit labeling, pairwise positivity comparison, and pairwise uncertainty comparison. We also asked annotators for the difficulty of each question when necessary. The interfaces are shown by the following list.

  • โ€ข

    Figure 4 shows how we ask annotators for explicit labels.

  • โ€ข

    Figure 5 shows how we ask annotators for pairwise positivity comparisons. If we fix one label such as โ€˜NAโ€™ and ask which one is more likely to be โ€˜NAโ€™, there are cases that both images in a pair look similar to โ€˜WOโ€™, thus itโ€™s difficult to answer. Therefore, we also ask annotators to choose either โ€˜NAโ€™ or โ€˜WOโ€™ that is used as the criterion of positivity.

  • โ€ข

    Figure 6 shows how we ask annotators for pairwise uncertainty comparisons. As this is a newly proposed comparison question and annotators may be not used to answer it, we give an explanatory example on how to select.

  • โ€ข

    Figure 7 shows how we ask annotators for difficulty evaluation of uncertainty comparisons compared to explicit labeling. We asked annotators to answer both queries first to familiarize them with the problem.

Refer to caption
Figure 4: Questionnaire of explicit labeling.
Refer to caption
Figure 5: Questionnaire of pairwise positivity comparisons.
Refer to caption
Figure 6: Questionnaire of pairwise uncertainty comparisons.
Refer to caption
Figure 7: Questionnaire of difficulty evaluation of pairwise uncertainty comparisons.
Justification for uncertainty comparisons

In this user study, we confirmed whether the data pairs selected by the algorithm for O2O_{2} are difficult for explicit labeling. We first greedily selected 2525 medoids from all data points. Then, we ran the proposed algorithm on these 2525 data points using artificial oracles, and collected the 4242 pairs that were selected for O2O_{2}. Finally, we conducted user study from 5050 annotators on explicit labeling and uncertainty comparison on these 4242 pairs. For each, we also asked the difficulty of uncertainty comparisons compared to explicit labeling using scores from one to five, with a smaller score indicating an easier question. Furthermore, we collected difficulty evaluation of explicit labeling for each image from 1010 annotators.

In order to investigate the difficulty evaluation on pair attributes, we introduce the individual difficulty for each single image. Another difficulty will be introduced in the following paragraph. Then, based on the user evaluation of individual difficulties, we classified data pairs into three types: (1) the โ€˜Eโ€™ type containing two easy data points, (2) the โ€˜&โ€™ type containing one easy and one difficult data point, and (3) the โ€˜Dโ€™ type containing two difficult data points.

We then aggregated the user evaluations based on pair types. Table 4 shows the mean and standard deviation of the difficulty evaluations for each type, as well as t statistics and p values when conducted one sample t test against value 33, which means two types of query have equal difficulty. From the results, we can conclude that as pair type changes from โ€˜Eโ€™ to โ€˜Dโ€™, uncertainty comparison becomes less favored against explicit labeling. For type โ€˜Dโ€™, the mean of difficulty evaluations is not significantly different from 33, as the p value 0.06>0.050.06>0.05. However, as the proposed algorithm focuses on separating difficult images, random decisions on images with similar difficulty do not harm the performance.

Table 4: Statistics of difficulty evaluation of uncertainty comparisons.
Type โ€˜Eโ€™ Type โ€˜&โ€™ Type โ€˜Dโ€™ Total
Number of Pairs 12 25 5 42
Number of Total Evaluations 600 1250 250 2100
Mean 2.57 2.82 2.84 2.75
Standard Deviation 1.28 1.38 1.35 0.34
t statistic -8.23 -4.68 -1.91 -5.15
p value 1.19ร—10โˆ’151.19\times 10^{-15} 3.22ร—10โˆ’63.22\times 10^{-6} 0.06 4.69ร—10โˆ’64.69\times 10^{-6}
Algorithm feasibility using simulated pairwise comparisons

In this user study, we first greedily selected 5050 medoids as training data. We then collected explicit label feedback from 2020 annotators. For a single image in these 5050 medoids, we simulated its class probability by the proportion of class assignments in the 2020 evaluations. For example, if 1515 annotators assigned positive label to an image, we defined its probability to be positive as 1520=0.75\frac{15}{20}=0.75. These probabilities were then used to simulate both kinds of pairwise comparisons. Using inferred labels as input, the last layer of a pre-trained neural network was fine-tuned.

Refer to caption
(a)
Refer to caption
(b)
Figure 8: Performance using simulated pairwise comparisons.

Figure 8 shows the label accuracy and the test accuracy when using different numbers of medoids as training data. The test accuracy measures the performance of each classifier learnt from inferred labels on a test dataset of size 100100, which is uniformly selected without replacement excluding the training data points. It can be clearly observed for the full supervision case that more training data contribute higher accuracies. It is not clear for the other two methods, because they rely on not only the number of training data, but also the quality of their pairwise comparison feedback. Although there were 64%64\% ties among all uncertainty pairwise comparisons, the proposed method showed consistent performance. However, with 24%24\% ties among all positivity pairwise comparisons, the existing method failed to perform consistently, even with parameter tuning.

Algorithm feasibility using user feedback on pairwise comparisons

In this user study, we confirmed the performance of each algorithm on only crowdsourcing results. We greedily selected 2525 medoids [26], collected answers for all possible combinations among these medoids from 1010 annotators, and used aggregated majority as input to the existing algorithm [31] using both positivity comparisons and explicit class labels and the proposed algorithm. We adopted a pre-trained neural network and fine-tuned its last layer considering the small number of training data.

Figure 9 and Figure 10 show the label accuracies and test accuracies for results of 2525 greedily selected medoids and 2525 uniformly selected data points, respectively. The test accuracy measures the performance of each classifier learnt from inferred labels on a test dataset of size 100100, which is uniformly selected without replacement after the selection of training data points. For label accuracies, we calculated the scores for each trial. For test accuracies, we uses aggregated results and calculated only once. The mean value from results of 1010 annotators are shown in dashed lines and the standard deviation are shown by the shadow. The value from aggregated results are shown in solid lines. The proposed algorithm showed competitive performance to fully supervised learning without accessing explicit class labels at all.

Refer to caption
(a)
Refer to caption
(b)
Figure 9: Performance on medoids.
Refer to caption
(a)
Refer to caption
(b)
Figure 10: Performance on uniformly selected data points.

When increasing the number of training data, we observed the proposed algorithm could also show stable and promising generalization ability competitive with full supervision. However, the performance of the existing algorithm [31] was not stable, because it separated data points into small bags, and queried a random subset of each bag for explicit class labels. With fewer training data, the size of each bag was small and it could query most of a bag for explicit class labels, thus achieved high labeling accuracy. However, with more training data, a reasonable budget restrained the size of the subset from each bag for querying explicit class labels, thus resulting the drop in performance.

Then we analysed the properties of pairs selected for O2O_{2}. Different from last paragraph, these pairs were selected by the algorithm ran on crowdsourcing results. We introduce another type of difficulty: pair difficulty for a pair of data points. We investigated the relationship between pair types and pair difficulties. The user evaluation of pair difficulties were 0.16โ€‹(ยฑ0.33)0.16\,(\pm 0.33), 0.17โ€‹(ยฑ0.10)0.17\,(\pm 0.10) and 0.56โ€‹(ยฑ0.09)0.56\,(\pm 0.09), respectively. It matches the intuition that annotators confused when both images were difficult to classify.

Refer to caption
Figure 11: Uncertainty comparison query types and difficulties.

Figure 11 shows the trajectories of actually queried uncertainty comparisons of 1010 trials, indicating easy pairs by white and difficult pairs by orange. Note that each query consisted of a pair of images. Taking Trial 0202 as an example, we observe that for the first query, an annotator found it easy to assign the explicit class label to one image and difficult for the other. This also holds for the second query. The same annotator then found it difficult to assign explicit class labels to both images in the third query. Another annotator found it easy to compare uncertainties than explicit labeling images in the first and second queries, and difficult for the third query. We can observe that more difficult pairs are queried on the latter half of the executions. This can be interpreted that the algorithm successfully separated difficult data from easy data at an early stage. Note that for the purpose of separating data by different difficulties, the results of โ€˜Eโ€™ pairs and โ€˜Dโ€™ pairs do not effect too much as the data points in these pairs have similar difficulty.

Refer to caption
Figure 12: Histogram of uncertainty comparison query types and difficulties.

Figure 12 shows the corresponding histogram. As pair type becoming difficult, the proportion of pairs evaluated as difficult for uncertainty comparison increased as expected. Although blue areas are more preferable than orange areas, the proposed algorithm is not significantly influenced by the orange proportion of โ€˜Dโ€™ pairs.

User comments

At the end of each questionnaire, we also asked annotators to answer their opinions on these tasks in free text. We select some of representative opinions and list their English translation 111The translation is based on the results of DeepL (https://www.deepl.com/translator)..

The following list shows advantages of positivity comparisons over explicit labeling.

  • โ€ข

    It is easy to choose between โ€œNA" or โ€œWO" even if you canโ€™t read the word.

  • โ€ข

    You can choose the one you can easily recognize.

  • โ€ข

    You can choose the letters by your feeling.

  • โ€ข

    Unlike direct judgments, there is no clear correct answer, so it is possible to create questions that are easy for anyone to answer.

  • โ€ข

    When itโ€™s not too curled up, itโ€™s easy to choose.

The following list shows disadvantages of positivity comparisons over explicit labeling.

  • โ€ข

    If you cannot read either of them, your selection criteria will be blurred.

  • โ€ข

    It is hard to judge a flaw when itโ€™s curled up.

  • โ€ข

    It is not sure if the decision is accurate.

  • โ€ข

    You need to stop and compare both images carefully, and may feel a great sense of hesitation before making a decision.

  • โ€ข

    Unlike direct judgement, there is no clear correct answer, and if neither letter is difficult to judge, you donโ€™t have to think about the answer. You can make a good choice.

The following list shows advantages of uncertainty comparisons over explicit labeling.

  • โ€ข

    Itโ€™s easy to choose if you can read one or the other somehow.

  • โ€ข

    Itโ€™s quick and intuitive and I understand it quickly.

  • โ€ข

    Can be narrowed down if both are recognized as โ€œNA" or โ€œWO".

  • โ€ข

    Itโ€™s easy to imagine how easy it is to read by just the simple criterion of being able to read, and how easy it is to read by pronouncing it in your head.

  • โ€ข

    It is highly flexible and does not have any restrictions.

The following list shows disadvantages of uncertainty comparisons over explicit labeling.

  • โ€ข

    You can only seem to read them, but you canโ€™t tell whether you actually chose the correct answer or not.

  • โ€ข

    I donโ€™t know if other people can quickly recognize.

  • โ€ข

    If the words are not read as โ€œNA" or โ€œWO", I use the elimination method to select.

  • โ€ข

    When neither of them is likely to be readable, I tend to choose them at random.

  • โ€ข

    Unlike direct judgments, there is no clear correct answer, which makes it difficult to evaluate the competence of the annotator.

As we can see from above lists, it is difficult to choose when both images in a pair are not recognizable. This may affect the accuracy of the existing method, as it is required to sort the whole dataset. However, this does not significantly downgrade the performance of the proposed algorithm, as either one in the pair satisfying the desired uncertainty. Moreover, it is interesting to see the various criterion used by annotators.

5.3 Car preference task

The pairwise positivity oracle O1O_{1} is extensively used in preference learning. Thus in this study, we used a car dataset [17] to simulate a binary classification using user preference, denoting car images a user likes as positive and those a user dislikes as negative. Note the true labels differ for each user, as different users may have different preferences for cars.

Goals

We want to verify if the proposed comparison oracle is useful for binary classification of individual user preference.

Method

In this user study, we conducted crowdsourcing in two ways.

  • โ€ข

    First, we collected user preference by five-stage evaluation. Stage one indices the user likes the car very much and stage five indices the opposite. This can be seen as different ranges of pโ€‹(y=1|x)p(y=1|x) for a given image xx, thus can be used to simulate both pairwise comparison oracles. For eliciting explicit labels, we considered the first two stages as positive.

  • โ€ข

    Second, we directly collected user feedback of two kinds of pairwise comparisons on all possible pairs for a fixed set of training data.

We used an interface that is similar to the one used in the first user study.

Data

The original dataset [17] consists of 196196 categories. We trained a ResNet18 [15] model for classifying car categories to extract useful features. Based on extracted features, we greedily selected a single medoid for each class to collect 196196 images. For the first crowdsourcing task on five-stage evaluation, we then uniformly selected 150150 images. For the second task, we greedily selected 2525 medoids based on extracted features for training and used the left 125125 images for testing. We collected user feedback of all possible 300300 pairs for both kinds of pairwise comparisons. All tasks are answered by four users. After inferring labels, we trained both a neural network classifier and a kk-NN classifier for each setting.

Algorithm feasibility using simulated pairwise comparisons

Using five-stage evaluation to simulate pairwise comparisons, we had the freedom of choosing various sets of training data points. Thus, we conducted experiments with different sizes of training data points that are selected either uniformly or greedily as medoids using extracted features. The simulated feedback was noisy in the sense that when two images has same stage evaluation, we can only randomly answer one with equal probability.

Refer to caption
Refer to caption
Figure 13: Test accuracies using simulated pairwise comparisons.

As shown in Figure 13, the proposed method using only simulated pairwise comparisons showed competitive performance to fully supervision. The performance of the existing method was not stable, because the quick sort subroutine is very sensitive to the results of pairwise comparison, which could be random in this case. However, the proposed algorithm showed consistent performance under the same situation.

User Comments

After a user finished answering all questions, we asked comments on the following open questions. The answers below are summaries of comments from four users.

Question 11: What are the characteristics of pairs that are easy for preference uncertainty comparison.

  • โ€ข

    When one of the car falls in the middle of like and dislike, or falls in a preferred category.

  • โ€ข

    When two cars are completely different from each other.

Question 11: What are the characteristics of pairs that are difficult for preference uncertainty comparison.

  • โ€ข

    When two cars have similar appearance or preference.

Question 33: What factors decide the difficulty of preference uncertainty comparison.

  • โ€ข

    Appearance; category; experience.

Question 44: What other items that preference uncertainty comparison may work?

  • โ€ข

    Food; plants; shoes; cloth; things that are unusual in daily life.

Question 55: What other measures other than preference uncertainty comparison may work?

  • โ€ข

    Fairness; measures that everyone is familiar with; measures that based on experience.

6 Conclusion

In this paper, we address the problem of interactive labeling and propose a novel uncertainty comparison oracle, followed by a noise-tolerant theoretical-guaranteed labeling algorithm without accessing explicit class labels at all. We then confirm the performance of the algorithm theoretically and empirically. For future work, eliminating ๐’ชโ€‹(n)\mathcal{O}(n) from one of the query complexity can improve the efficiency. On the other hand, extending the uncertainty comparison oracle to multiple data points and multiple classes is a promising direction.

Broader Impact

We believe this research will benefit researchers in all fields who are seeking for a more effective and less laborious annotation method for their unlabeled datasets. It can foster applications of machine learning by lowering the annotation barrier for people without specific professional knowledge. It can also benefit domain experts with professional knowledge by saving their time for more important tasks. Furthermore, collecting comparison information can potentially mitigate annotation biases of explicit labeling. It can also serve the aim of protecting privacy by not querying the explicit class labels in some cases.

For the negative side, it may harm the performance of downstream classification models when the comparison annotation is mostly incorrect. However, there would be no consequential ethical issues of failure of the method.

References

  • [1] Naomi S Altman. An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician, 46(3):175โ€“185, 1992.
  • [2] Maria Florina Balcan and Steve Hanneke. Robust interactive learning. In Conference on Learning Theory, pages 20โ€“1, 2012.
  • [3] Han Bao, Gang Niu, and Masashi Sugiyama. Classification from pairwise similarity and unlabeled data. In International Conference on Machine Learning, pages 452โ€“461, 2018.
  • [4] Alina Beygelzimer, Daniel J Hsu, John Langford, and Chicheng Zhang. Search improves label for active learning. In Advances in Neural Information Processing Systems, pages 3342โ€“3350, 2016.
  • [5] Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324โ€“345, 1952.
  • [6] Kamalika Chaudhuri and Sanjoy Dasgupta. Rates of convergence for nearest neighbor classification. In Advances in Neural Information Processing Systems, pages 3437โ€“3445, 2014.
  • [7] Wei Chu and Zoubin Ghahramani. Preference learning with gaussian processes. In Proceedings of the 22nd International Conference on Machine Learning, ICML โ€™05, page 137โ€“144, New York, NY, USA, 2005. Association for Computing Machinery.
  • [8] Tarin Clanuwat, Mikel Bober-Irizar, Asanobu Kitamoto, Alex Lamb, Kazuaki Yamamoto, and David Ha. Deep learning for classical japanese literature. preprint, 2018.
  • [9] Zhenghang Cui, Nontawat Charoenphakdee, Issei Sato, and Masashi Sugiyama. Classification from triplet comparison data. Neural Computation, 32(3):659โ€“681, 2020.
  • [10] Marthinus Christoffel Du Plessis, Gang Niu, and Masashi Sugiyama. Clustering unclustered data: Unsupervised binary labeling of two datasets having different class balances. In 2013 Conference on Technologies and Applications of Artificial Intelligence, pages 1โ€“6. IEEE, 2013.
  • [11] Johannes Fรผrnkranz and Eyke Hรผllermeier. Preference learning and ranking by pairwise comparison. In Preference learning, pages 65โ€“82. Springer, 2010.
  • [12] Wei Gao, Bin-Bin Yang, and Zhi-Hua Zhou. On the resistance of nearest neighbor to random noisy labels. arXiv preprint arXiv:1607.07526, 2016.
  • [13] Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and Masashi Sugiyama. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In NeurIPS, pages 8535โ€“8545, 2018.
  • [14] Steve Hanneke et al. Theory of disagreement-based active learning. Foundations and Trendsยฎ in Machine Learning, 7(2-3):131โ€“309, 2014.
  • [15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770โ€“778, 2016.
  • [16] Daniel M Kane, Shachar Lovett, Shay Moran, and Jiapeng Zhang. Active classification with comparison queries. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 355โ€“366. IEEE, 2017.
  • [17] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia, 2013.
  • [18] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. preprint, 2009.
  • [19] Yann LeCun, Lรฉon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278โ€“2324, 1998.
  • [20] Nan Lu, Gang Niu, Aditya K. Menon, and Masashi Sugiyama. On the minimal supervision for training any binary classifier from only unlabeled data. In International Conference on Learning Representations, 2019.
  • [21] R Duncan Luce. Individual choice behavior: A theoretical analysis. Courier Corporation, 2012.
  • [22] Enno Mammen, Alexandre B Tsybakov, et al. Smooth discrimination analysis. The Annals of Statistics, 27(6):1808โ€“1829, 1999.
  • [23] Soheil Mohajer, Changho Suh, and Adel Elmahdy. Active learning for top-kk rank aggregation from noisy comparisons. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 2488โ€“2497. PMLR, 06โ€“11 Aug 2017.
  • [24] Harikrishna Narasimhan and Shivani Agarwal. On the relationship between binary classification, bipartite ranking, and binary class probability estimation. In Advances in Neural Information Processing Systems, pages 2913โ€“2921, 2013.
  • [25] Henry Reeve and Ata Kaban. Fast rates for a kNN classifier robust to unknown asymmetric label noise. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 5401โ€“5409, Long Beach, California, USA, 09โ€“15 Jun 2019. PMLR.
  • [26] Erich Schubert and Peter J Rousseeuw. Faster k-medoids clustering: improving the pam, clara, and clarans algorithms. In International Conference on Similarity Search and Applications, pages 171โ€“187. Springer, 2019.
  • [27] Takuya Shimada, Han Bao, Issei Sato, and Masashi Sugiyama. Classification from pairwise similarities/dissimilarities and unlabeled data via empirical risk minimization. arXiv preprint arXiv:1904.11717, 2019.
  • [28] Christopher Tosh and Sanjoy Dasgupta. Interactive structure learning with structural query-by-committee. In Advances in Neural Information Processing Systems, pages 1121โ€“1131, 2018.
  • [29] Vladimir Vapnik. Statistical learning theory. New York, 3, 1998.
  • [30] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. preprint, 2017.
  • [31] Yichong Xu, Hongyang Zhang, Kyle Miller, Aarti Singh, and Artur Dubrawski. Noise-tolerant interactive learning using pairwise comparisons. In Advances in Neural Information Processing Systems, pages 2431โ€“2440, 2017.

Appendix A Proof of Theorem 2

Proof.

The Algorithm 1 consists of two steps: selection of relatively uncertain points and assigning labels by majority vote.

For the first step, the algorithm of Mohajer et al. [23] is executed using parameters K=tK=t and mm. By adapting Theorem 1 of Mohajer et al. [23], we know that if mโ‰ฅC1โ€‹maxโก(logโกlogโกn,logโกt)(0.5โˆ’ฯต2)2m\geq\frac{C_{1}\max(\log\log{n},\log{t})}{(0.5-\epsilon_{2})^{2}}, then the correct top-tt points can be identified with probability at least 1โˆ’logโกnโˆ’C21-{\log{n}}^{-C_{2}}.

For the second step, we analyze the probability that a point xโˆˆDโˆ–Dโ€ฒx\in D\setminus D^{\prime} is correctly inferred. Without loss of generality, we assume the correct label for xx is 11 and we calculate the probability that โˆ‘xjโˆˆDโ€ฒO1โ€‹(x,xj)โ‰ฅ12\sum_{x_{j}\in D^{\prime}}O_{1}(x,x_{j})\geq\frac{1}{2}.

Let Zjโ‰œO1โ€‹(x,xj)Z_{j}\triangleq O_{1}(x,x_{j}) denotes the random variable representing the outcome of every call to oracle O1O_{1}. Because Dโ€ฒD^{\prime} is assumed to be correctly identified, so pโ€‹(y|x)โ‰ฅpโ€‹(y|xj)p(y|x)\geq p(y|x_{j}) for every xjโˆˆDโ€ฒx_{j}\in D^{\prime}, thus the expectation of ZjZ_{j} is 1โˆ’ฯต11-\epsilon_{1}. Also note that ZjZ_{j} only takes a value of either 0 or 11, thus by applying Hoeffdingโ€™s inequality to Z1,Z2,โ‹ฏ,ZtZ_{1},Z_{2},\cdots,Z_{t}, we have

Prโ€‹[1tโ€‹โˆ‘j=1tZjโˆ’(1โˆ’ฯต1)โ‰คโˆ’(0.5โˆ’ฯต1)]โ‰คexpโก(โˆ’2โ€‹tโ€‹(0.5โˆ’ฯต1)2).\mathrm{Pr}\left[\frac{1}{t}\sum_{j=1}^{t}Z_{j}-(1-\epsilon_{1})\leq-(0.5-\epsilon_{1})\right]\leq\exp\left(-2t(0.5-\epsilon_{1})^{2}\right). (1)

This actually expresses the probability that 1tโ€‹โˆ‘j=1tZj\frac{1}{t}\sum_{j=1}^{t}Z_{j} is smaller than 0.50.5.

Let aโ‰œexpโก(โˆ’2โ€‹tโ€‹(0.5โˆ’ฯต1)2)a\triangleq\exp\left(-2t(0.5-\epsilon_{1})^{2}\right). Because tt is selected so that aโ‰ค12a\leq\frac{1}{2} and 1tโ€‹โˆ‘j=1tZj\frac{1}{t}\sum_{j=1}^{t}Z_{j} is bounded within [0,1][0,1], therefore for a single xโˆˆDโˆ–Dโ€ฒx\in D\setminus D^{\prime} it holds that

Prโ€‹[1tโ€‹โˆ‘j=1tZjโ‰ฅ12]\displaystyle\mathrm{Pr}\left[\frac{1}{t}\sum_{j=1}^{t}Z_{j}\geq\frac{1}{2}\right] โ‰ฅ1โˆ’a\displaystyle\geq 1-a (2)
โ‰ฅexpโก(โˆ’aโ€‹(a+1)).\displaystyle\geq\exp(-a(a+1)). (3)

For points in Dโ€ฒD^{\prime}, because we assign random labels, there is positive probability that all assigned labels are wrong.

In conclusion, for all points in Dโˆ–Dโ€ฒD\setminus D^{\prime} correctly labeled, the error rate ฯต=tn\epsilon=\frac{t}{n} can be achieved with probability at least 1โˆ’ฮด1-\delta where ฮดโ‰œ1โˆ’(1โˆ’logโกnโˆ’C2)โ€‹expโก(โˆ’aโ€‹(a+1)โ€‹(nโˆ’t))\delta\triangleq 1-(1-\log{n}^{-C_{2}})\exp(-a(a+1)(n-t)).

For query complexities, as O1O_{1} is queried tโ€‹(nโˆ’t)t(n-t) times, the query complexity of O1O_{1} is ๐’ชโ€‹(nฯต12)\mathcal{O}\left(\frac{n}{\epsilon_{1}^{2}}\right). Moreover, as indicated by Eq. (17) of [23], the query complexity of O2O_{2} is ๐’ชโ€‹(nโ€‹logโกlogโกnฯต22)\mathcal{O}\left(\frac{n\log\log{n}}{\epsilon_{2}^{2}}\right). โˆŽ

Appendix B Proof of Theorem 3

Proof.

First, we bound the difference between f^โ€‹(x;k)\hat{f}(x;k) and fโ€‹(x)f(x). Similar to Reeve et al. [25], we define f~โ€‹(x;k)=๐”ผpโ€‹(y|x)=1kโ€‹โˆ‘q=1kyฯ„qโ€‹(x)\tilde{f}(x;k)=\mathbb{E}_{p(y|x)}=\frac{1}{k}\sum_{q=1}^{k}y_{\tau_{q}(x)}.

Then we have

|f^โ€‹(x;k)โˆ’fโ€‹(x)|โ‰ค|f^โ€‹(x;k)โˆ’f~โ€‹(x;k)|+|f~โ€‹(x;k)โˆ’fโ€‹(x)|.\left|\hat{f}(x;k)-f(x)\right|\leq\left|\hat{f}(x;k)-\tilde{f}(x;k)\right|+\left|\tilde{f}(x;k)-f(x)\right|. (4)

For the first term in RHS, from Theorem 2, we know it is bounded by 2โ€‹ฯตk\frac{2\epsilon}{k} with probability at least 1โˆ’ฮด1-\delta. For the second term in right hand side, from Lemma 4.1 in [25], we have it is bounded by ฯ‰โ€‹(2โ€‹kn)ฮป\omega\left(\frac{2k}{n}\right)^{\lambda} with probability at least 1โˆ’ฮดโ€ฒ1-\delta^{\prime} for ฮดโ€ฒ>0\delta^{\prime}>0 and n2โ‰ฅkโ‰ฅ4โ€‹logโก(1ฮดโ€ฒ)+1\frac{n}{2}\geq k\geq 4\log(\frac{1}{\delta^{\prime}})+1. Thus combing the two inequalities, we have the left hand side is bounded by ฮ”โ‰œ2โ€‹ฯตk+ฯ‰โ€‹(2โ€‹kn)ฮป\Delta\triangleq\frac{2\epsilon}{k}+\omega\left(\frac{2k}{n}\right)^{\lambda} with probability at least (1โˆ’ฮด)โ€‹(1โˆ’ฮดโ€ฒ)(1-\delta)(1-\delta^{\prime}). This means with at least the same probability, a randomly drawn point from ๐’ณ\mathcal{X} will fall in the set

๐’ณโ€ฒโ‰œ{xโˆˆ๐’ณ:|ฮท^โ€‹(x)โˆ’ฮทโ€‹(x)|โ‰คฮ”}.\mathcal{X}^{\prime}\triangleq\{x\in\mathcal{X}:|\hat{\eta}(x)-\eta(x)|\leq\Delta\}.

Thus

Rโ€‹(f^)โˆ’Rโ€‹(fโˆ—)\displaystyle R(\hat{f})-R(f^{*})
=\displaystyle= โˆซ๐’ณ|ฮทโ€‹(x)โˆ’12|โ€‹๐Ÿ™f^โ€‹(x)โ‰ fโˆ—โ€‹(x)โ€‹๐‘‘ฮผโ€‹(x)\displaystyle\int_{\mathcal{X}}\left|\eta(x)-\frac{1}{2}\right|\mathbbm{1}_{\hat{f}(x)\neq f^{*}(x)}d\mu(x)
=\displaystyle= โˆซ๐’ณโ€ฒ|ฮทโ€‹(x)โˆ’12|โ€‹๐Ÿ™f^โ€‹(x)โ‰ fโˆ—โ€‹(x)โ€‹๐‘‘ฮผโ€‹(x)โ€‹(with probability at leastโ€‹(1โˆ’ฮด)โ€‹(1โˆ’ฮดโ€ฒ))\displaystyle\int_{\mathcal{X}^{\prime}}\left|\eta(x)-\frac{1}{2}\right|\mathbbm{1}_{\hat{f}(x)\neq f^{*}(x)}d\mu(x)\left(\text{with probability at least}(1-\delta)(1-\delta^{\prime})\right)
โ‰ค\displaystyle\leq โˆซ๐’ณ|ฮทโ€‹(x)โˆ’12|โ€‹๐Ÿ™|ฮทโ€‹(x)โˆ’12|โ‰คฮ”โ€‹๐‘‘ฮผโ€‹(x)\displaystyle\int_{\mathcal{X}}\left|\eta(x)-\frac{1}{2}\right|\mathbbm{1}_{\left|\eta(x)-\frac{1}{2}\right|\leq\Delta}d\mu(x)
โ‰ค\displaystyle\leq Cโ€‹ฮ”ฮฑ+1.\displaystyle C\Delta^{\alpha+1}.

โˆŽ

Appendix C Proof of Corollary 4

Proof.

Similar to the approach in Xu et al. [31], we use induction to show that at the end of every step ii, ๐”ผ๐’ซ๐’ณโ€‹[hโ€‹(x)โ‰ hโˆ—โ€‹(x)]โ‰ค4โ€‹ฯตi\mathbb{E}_{\mathcal{P}_{\mathcal{X}}}[h(x)\neq h^{*}(x)]\leq 4\epsilon_{i} always holds with probability at least (1โˆ’ฮด)logโก(1ฯต)(1-\delta)^{\log(\frac{1}{\epsilon})} for a universal ฮด\delta, which is obvious for i=0i=0.

Then, with a little abusing of notations, we have

|xโˆˆSi:h(x)โ‰ hโˆ—(x)|\displaystyle|{x\in S_{i}:h(x)\neq h^{*}(x)}| =|xโˆˆDi:h(x)โ‰ hโˆ—(x)|\displaystyle=|{x\in D_{i}:h(x)\neq h^{*}(x)}|
โ‰ค|xโˆˆDi:h(x)โ‰ y^|+|xโˆˆDi:hโˆ—(x)โ‰ y^|\displaystyle\leq|{x\in D_{i}:h(x)\neq\hat{y}}|+|{x\in D_{i}:h^{*}(x)\neq\hat{y}}|
=2โ€‹ฯตiโ€‹|Si|.\displaystyle=2\epsilon_{i}|S_{i}|.

Thus PxโˆผSiโ€‹[hโ€‹(x)โ‰ hโˆ—โ€‹(x)]=|xโˆˆSi:h(x)โ‰ hโˆ—(x)||Si|โ‰ค2โ€‹ฯตi\text{P}_{x\sim S_{i}}[h(x)\neq h^{*}(x)]=\frac{|{x\in S_{i}:h(x)\neq h^{*}(x)}|}{|S_{i}|}\leq 2\epsilon_{i}. Having c0โˆˆ(1,โˆž)c_{0}\in(1,\infty) and ฮณโˆˆ(0,1)\gamma\in(0,1), using Lemma 3.1 from Hanneke et al. [14], we have Pxโˆผ๐’ซ๐’ณโ€‹[hโ€‹(x)โ‰ hโˆ—โ€‹(x)]โ‰ค4โ€‹ฯตi\text{P}_{x\sim\mathcal{P}_{\mathcal{X}}}[h(x)\neq h^{*}(x)]\leq 4\epsilon_{i} with probability at least 1โˆ’ฮณ1-\gamma, providing c0โ€‹dโ€‹logโก(|Si|d)+logโก(1ฮณ)|Si|โ‰คฯตic_{0}\frac{d\log(\frac{|S_{i}|}{d})+\log(\frac{1}{\gamma})}{|S_{i}|}\leq\epsilon_{i}. Setting ฮณ=1โˆ’(1โˆ’ฮด)logโก(2โ€‹ฯต)\gamma=1-(1-\delta)^{\log(2\epsilon)}, We have P๐’ซ๐’ณโ€‹[h^โ€‹(x)โ‰ hโˆ—โ€‹(x)]โ‰คฯต\text{P}_{\mathcal{P}_{\mathcal{X}}}[\hat{h}(x)\neq h^{*}(x)]\leq\epsilon with probability at least (1โˆ’ฮด)logโก(1ฯต)โ€‹(1โˆ’ฮด)logโก(2โ€‹ฯต)=1โˆ’ฮด(1-\delta)^{\log(\frac{1}{\epsilon})}(1-\delta)^{\log(2\epsilon)}=1-\delta at the end of the algorithm. โˆŽ