Rank Aggregation in Crowdsourcing for Listwise Annotations
Abstract.
Rank aggregation through crowdsourcing has recently gained significant attention, particularly in the context of listwise ranking annotations. However, existing methods primarily focus on a single problem and partial ranks, while the aggregation of listwise full ranks across numerous problems remains largely unexplored. This scenario finds relevance in various applications, such as model quality assessment and reinforcement learning with human feedback. In light of practical needs, we propose LAC, a Listwise rank Aggregation method in Crowdsourcing, where the global position information is carefully measured and included. In our design, an especially proposed annotation quality indicator is employed to measure the discrepancy between the annotated rank and the true rank. We also take the difficulty of the ranking problem itself into consideration, as it directly impacts the performance of annotators and consequently influences the final results. To our knowledge, LAC is the first work to directly deal with the full rank aggregation problem in listwise crowdsourcing, and simultaneously infer the difficulty of problems, the ability of annotators, and the ground-truth ranks in an unsupervised way. To evaluate our method, we collect a real-world business-oriented dataset for paragraph ranking. Experimental results on both synthetic and real-world benchmark datasets demonstrate the effectiveness of our proposed LAC method.
1. Introduction
Recently, inferring ranking (Du et al., 2020; Xu et al., 2021; Lin et al., 2018; Liu and Moitra, 2020; Wu et al., 2023, 2021) over a set of items has gained increasing attention due to its wide range of applications, such as information retrieval (Raghavan, 1997), recommendation systems (Das et al., 2017), and RLHF (Reinforcement Learning from Human Feedback) for finetuning large language models (Ouyang et al., 2022). This task aims to train a ranking model in a supervised way (Li, 2011), thereby requiring a large amount of well-annotated data.
Unfortunately, it will be of high cost to hire expert annotators, especially when the scale of data is large. To accommodate this dilemma, many practitioners resort to crowdsourcing platforms (Hirth et al., 2011; Yu et al., 2023; Garcia-Molina et al., 2016; Zhang and Wu, 2021), such as Amazon Mechanical Turk and CrowdFlower. They distribute ranking problems into multiple sub-tasks, which are required to be solved by crowdsourced annotators. Aggregation methods are subsequently employed so that the solutions of sub-tasks can be aggregated and the true ranks of the original ranking problems can be derived. Among these procedures, the effectiveness of rank aggregation algorithms is of significant importance for the performance of the final aggregated results.
Existing rank aggregation methods can be roughly classified into three categories according to different forms of annotations: pointwise, pairwise, and listwise. Fig. 1 illustrates the similarities and differences between these forms of annotations. Under the setting of pointwise methods, each sub-task contains only one item, and the annotator assigns scores independently to each item without access to the information of other items. The aggregation methods will then derive one ranking sequence based on these score annotations (Aslam and Montague, 2001). In the pairwise setting, the sub-task contains one pair of items, and the annotator determines the relative ranking between two items. The aggregation methods use the results of compared pairs to form the final rank (Cohen et al., 1997). In the listwise setting, the sub-task contains multiple items, which is a subset of all the items. The annotator is required to provide the rank of this subset, and the aggregation methods aggregate ranks of all subsets into a final rank (Niu et al., 2015). These types of methods find applications in recommendation systems (Baltrunas et al., 2010), information retrieval (Dwork et al., 2001), bioinformatics (Kim et al., 2014), etc. Obviously, although the forms of annotations are different, all these types of methods aim to derive the ground-truth rank of one target ranking sequence.
However, the above forms of annotation may not cover all the rank aggregation problems. As shown in Fig. 1 (b), there also exists another scenario in which we need to deduce the ranks of multiple target sequences based on full rank annotations. Specifically, there are multiple short sequences required to be ranked, with each sub-task containing all items of one sequence. The annotator is required to give a full rank over each assigned sequence. Under this setting, different annotators are possible to label the same target sequence, leading to redundant annotations. The aggregation method is hereby expected to infer the ground-truth rank of each sequence simultaneously. Such a setting can be applied to tasks such as model quality assessment (Chang et al., 2023) and reinforcement learning with human feedback (RLHF) (Ouyang et al., 2022). To be specific, in the model quality assessment task, the outputs of several models are evaluated in each round, and these outputs form a single sequence (like in Fig. 1 (b)). Similarly, in RLHF, the outputs of a large language model are sampled, forming a sequence to be ranked by human annotators. Table 1 further outlines the differences between various aggregation tasks and their applications, where it is highlighted that listwise full rank aggregation aims to handle multiple short sequences with full rank annotations.


Scenarios | Multiple sequences? | Short sequences? | Partial annotations for each sequence? | Typical applications |
pointwise rank aggregation | 1) recommendation systems (Baltrunas et al., 2010) 2) information retrieval (Dwork et al., 2001) 3) bioinformatics (Kim et al., 2014) | |||
pairwise rank aggregation | ||||
listwise partial rank aggregation | ||||
listwise full rank aggregation | 1) model quality assessment (Chang et al., 2023) 2) reinforcement learning with human feedback (Ouyang et al., 2022) |
Various methods have been proposed to tackle pointwise, pairwise, and listwise rank aggregation problems. The pointwise methods often minimize a specific distance between the collected scores and the aggregated ones (Fagin et al., 2003), while the pairwise methods usually model the pairwise probability between items (Montague and Aslam, 2002). Besides, the listwise partial rank aggregation methods often convert the aggregation task to the pairwise one with the consideration of uncertainty (Niu et al., 2013). Although some of these methods can be adapted to full rank annotations by solving each problem independently, they fail to characterize the relationship between different problems and cannot capture the global information in full rank annotations. Therefore, in this paper, we present a specialized study for listwise full rank aggregation to fill the gaps in research and practice.
The full rank aggregation task is intrinsically difficult since different sequences have different levels of difficulty, and the ability of annotators can be different. Therefore, explicit modeling of the ability of annotators and the difficulty of problems is desired, which is usually a missing concern in previous methods. Specifically, the ability of an annotator is supposed to reflect the extent of uncertainty associated with the rank position of each candidate item. Therefore, the probability that the annotator flips every pair of candidate items should be considered. Meanwhile, the difficulty of problems can also be represented by multiple probabilities, measuring whether each pair of items is prone to confusion. In addition, from a global point of view, in the final rank list, the nearby items are generally much more likely to be mispositioned than those with a more significant gap in position. Therefore, the ability of annotators and the difficulty of problems should be dynamically adjusted according to the gap in position.
To this end, we propose a Listwise rank Aggregation method in Crowdsourcing (“LAC” for short hereinafter), to deal with the listwise input in crowdsourcing. To the best of our knowledge, this method represents the first endeavor specifically addressing the task of full rank aggregation in crowdsourcing. In our method, for the ability of annotators and the difficulty of problems, two sets of confusion matrices are employed to estimate the degree of confusion for each pair of items. Then, the distance between positions is carefully defined to integrate the relative positional information between two items. Considering the unknown latent variables of the ability of annotators, the difficulty of problems, and the true rank, we derive the log-likelihood of the observations and maximize it with the Expectation Maximization (EM) method. Specifically, in the E-step, based on the estimated values of latent variables derived in the previous step, we obtain the conditional expectation of observations over the underlying rank distribution. In the M-step, the latent variables and the ground-truth ranks are estimated respectively via maximizing the expectation calculated in the E-step.
In experiments, we propose to synthesize ranking datasets with the consideration of five essential factors, including number of problems, length of rank, number of annotators, ability of annotators, and the annotation ratio, so that the performance in different scenarios can be investigated. Furthermore, a real-world dataset with a total of 5,981 listwise ranking annotations from 25 annotators is collected to examine the performance of LAC.
Our contributions can be summarized in three folds:
-
•
We focus on the underexplored problem of listwise full rank aggregation in crowdsourcing and propose LAC as a new rank aggregation method.
-
•
Algorithmically, the ability of annotators, the difficulty of problems, and the ground-truth ranks are respectively modeled as latent variables, and we propose to use the EM method to deduce their optimal values iteratively.
-
•
Experimentally, we simulate synthetic datasets with a comprehensive consideration of five essential factors to explore the performance of LAC. We also present, for the first time, a real-world dataset under the listwise full rank aggregation setting to benchmark previous methods. Experimental results on both synthetic and real-world datasets demonstrate the superiority of LAC over existing methods in most scenarios.
2. Related Work
In this section, we introduce some related works, including the background of crowdsourcing and the existing rank aggregation methods.
2.1. Crowdsourcing
There are three critical concerns in crowdsourcing, i.e., cost control (Li et al., 2016), latency control (Zeng et al., 2018), and quality control (Allahbakhsh et al., 2013; Yang et al., 2024).
Cost control. Crowdsourcing may be expensive when dealing with a large number of tasks and instances. In order to alleviate this issue, several cost-control techniques have been proposed. These techniques include removing unnecessary tasks and picking up valuable tasks (task pruning (Wang et al., 2012)), ranking and prioritizing valuable tasks (task selection (Guo et al., 2012)), deducing answers for the candidate tasks based on the feedback data (answer deduction (Wang et al., 2013)), and sampling tasks based on some specific criteria for crowdsourcing.
Latency control. Crowdsourcing for answering tasks may suffer from excessive latency due to the unavailability of annotators, the difficulty of tasks, and insufficient appeal to annotators. Therefore, latency control is in need. Two representative models for latency control have been proposed, namely the round model (Sarma et al., 2014) and the statistical model (Yan et al., 2010). Here, the round model arranges tasks to be published in many rounds and models the overall latency as the number of rounds. In contrast, the statistical model uses feedback data to build models that capture the arrival and completion times of annotators, allowing better prediction and adjustment for expected latency.
Quality control. Crowdsourcing may produce low-quality or even incorrect answers due to annotators’ varying levels of expertise. Therefore, quality control is crucial. The ability of annotators can be modeled and controlled through several methods, including eliminating low-quality annotators (Ipeirotis et al., 2010), aggregating answers from multiple annotators (Cao et al., 2012), and assigning tasks to appropriate annotators based on their skills and experience (Zhao et al., 2015). How to infer the true labels from multiple noisy labels is a critical problem in quality control. The most seminal work is DS (Dawid and Skene, 1979), which uses a confusion matrix to represent the quality of the crowdsourcing annotator. Then several works are derived from the DS algorithm directly. For example, (Raykar et al., 2010) simplifies the parameters of the annotators, (Venanzi et al., 2014) constrains the annotator in some perspectives, and (Zhang et al., 2014) optimizes the initial settings. In addition, another stream of approaches models the quality of an annotator using fewer parameters (Bi et al., 2014) while adding auxiliary parameters to model the annotator’s property, such as bias (Kamar et al., 2015) and intention (Bi et al., 2014). All of them are based on probabilistic models and can be solved via an EM algorithm with gradient descent. Some of them (Bi et al., 2014; Dawid and Skene, 1979) can be applied to multi-class scenarios, while others (Raykar et al., 2010; Karger et al., 2011) are only suitable for binary classification problems. Unfortunately, none of these techniques can handle listwise full rank aggragation problem. Next, we introduce some methods to aggregate multiple ranks.
2.2. Rank Aggregation Methods
According to various input forms, the existing ranking methods can be roughly classified into three categories, namely pointwise, pairwise, and listwise methods.
Pointwise. Two well-known pointwise methods, Borda Count (Aslam and Montague, 2001) and Median Rank (Fagin et al., 2003), are often adopted to obtain suitable rankings. Specifically, Borda Count minimizes the average Spearman Rank Coefficient, and Median Rank minimizes the average Spearman Footrule Distance between the true rank and each input.
Pairwise. Pairwise rank aggregation methods organize their ranking inputs in pairs and optimize the objective function or ranking functions accordingly (Dong et al., 2020). For example, BradleyTerry (Thurstone, 1927) defines the pairwise probability based on the BradleyTerry model and then optimizes the likelihood function by gradient descent. Afterwards, GreedyOrder (Cohen et al., 1997) focuses on minimizing the cost of pairwise disagreement in a tournament to infer the true rank. In contrast, CondorcetFuse (Montague and Aslam, 2002) builds a Condorcet Graph by majority voting and obtains a Hamiltonian path from the graph by QuickSort, and SVP (Gleich and Lim, 2011) minimizes the nuclear norm of a pairwise preference matrix by rank-2 factorization. However, pairwise methods usually cannot capture the global information over items.
Listwise. Unlike the pointwise and pairwise methods, listwise rank aggregation methods treat the ranking inputs in a listwise way to emphasize the importance of positions. Previous listwise methods usually aim at the aggregation of partial ranks (Wu et al., 2016). The typical methods include Plackett-Luce (Guiver and Snelson, 2009), St.Agg (Niu et al., 2013), and CrowdAgg (Niu et al., 2015). Among them, Plackett-Luce extends the Plackett-Luce model (Plackett, 1975) and defines the similarity of ranks based on generative probability. St.Agg incorporates uncertainty into the aggregation process and introduces a prior rank distribution. Besides, CrowdAgg further takes the quality of annotators into consideration. Unfortunately, the above methods primarily focus on partial ranks, where only part of a single long sequence is required to be ranked by each annotator (see Fig. 1). Meanwhile, the study directly dealing with full ranks over the items is basically in the blank. Since such a setting is of utmost significance in real-world applications, we thereby undertake meticulous studies on the problem of listwise full rank aggregation.
3. The Proposed Method
3.1. Preliminary
In this section, we formalize the listwise full rank aggregation task in crowdsourcing. Suppose that there are totally problems / sequences and crowdsourced annotators. The dataset with ground-truth ranks is denoted by , where represents the -th problem, and the underlying true rank is . Each problem posted on the crowdsourcing platform is associated with items, and the selected annotators provide ranks over these items based on their knowledge. The ground-truth rank and the ranks provided by the selected annotators are permutations of the items in . All the true ranks of the entire dataset are denoted by . Let be the -th annotator. The rank given by for the problem is denoted by , and all ranks provided for this problem are denoted by . Then the observed dataset is denoted by . Obviously, the resulting annotations for each problem consist of noisy repeated ranks. Therefore, our task is to deduce the ground-truth rank for each problem based on the noisy ranks . To this end, we use and to represent lists and define as the index of in list , where indicates the item in position of list . The main mathematical notations that will be later used for algorithm description are listed in Table 2.
Notation | Interpretation |
the number of problems / sequences. | |
the number of annotators. | |
the number of items in a single problem. | |
the -th problem. | |
the unknown ground-truth rank of problem . | |
the -th crowdsourced annotator. | |
the rank of problem given by annotator . | |
the confusion matrix that models the ability of the annotator . | |
the confusion matrix that models the difficulty of the problem . | |
the likelihood for the true rank . | |
the probability of item being confused with by annotator . | |
the probability of confusion between item and in problem . | |
a ranked list. | |
the list containing all the ground-truth ranks. | |
the noisily ranked dataset collected from the crowdsourcing platform. | |
the set . |
3.2. Listwise Rank Aggregation in Crowdsourcing
To tackle the task introduced above, we propose a novel probabilistic generative model termed LAC. Unlike previous methods that may only consider the quality of annotators (Niu et al., 2015), we also incorporate the difficulty of problems into our model, which can further capture the intrinsic property of the aggregation task. Intuitively, it is reasonable that the crowdsourced annotator with stronger ability can perform better when provided with easier problems and vice versa. Therefore, we introduce a confusion matrix and for the annotator and the problem , respectively. Here explicitly characterizes the ability of the -th annotator, namely, the -th element denotes the probability of being confused with by . Similarly, explicitly characterizes the difficulty of the -th problem, namely, the -th element denotes the probability of being confused with in . To be clear, we take for example. Suppose that the ground-truth rank of the problem is given as , and ’s annotation is given as . When , we can get and , and then we have , which represents the probability of being confused with by annotator , and , which represents the probability of being confused with in problem . The examples of and are given by
(1) |
and
(2) |
respectively. The probabilistic graphical model is illustrated in Fig. 2, and there are three latent variables referred to as . Therefore, our target is transformed to deriving the most reasonable value of to maximize the likelihood of all observations and inferring the possible ranks based on the optimal .

Likelihood of the observed dataset. Suppose that the ground-truth rank of each problem is independently drawn from a multinomial distribution with parameter , where , and is the total number of permutations over all items. Here, is the prior probability, defined as
(3) |
Since each problem is independently annotated, the likelihood of the observed dataset can be factorized as
(4) |
which is governed by the parameter set . Subsequently, the likelihood of the annotation can be derived as
(5) |
Since our input is in listwise form, the global ranking information can be readily obtained. It is crucial to utilize this information to calculate the distance between the annotated rank and the possible true rank. In view of this point, we calculate the likelihood when is wrongly ranked as in a position-wise way, where we multiply the reciprocal of the distance for penalizing those ranks with further distance. Therefore, the posterior probability is defined as
(6) |
where . For brevity, in the following derivation, we denote
(7) |
To enhance clarity, here we introduce an example to show the rationale of the position-wise distance. Let , and we suppose that the ground-truth rank is . Assume that the first annotator gives the rank , and the second one gives the rank . For simplicity, we only take the first position of each rank as an example. For the first annotator, the distance in the first position is 1 (because the difference of indices between and is 1). But for the second annotator, the distance in the first position is 2 (because the difference of indices between and is 2). It is evident that for the first position, is better than . Therefore, it is reasonable to take the distance as a penalty.
Subsequently, the log-likelihood of the noisy repeated listwise rank can be represented as
(8) |
where is defined in Eq. (6).
3.3. Optimization with EM Algorithm
To effectively seek the latent variables that maximize the log-likelihood defined in Eq. (8), we employ the prevalent Expectation Maximization (EM) algorithm (Bishop, 2006), which estimates the , the element of , and iteratively.
E-step. We first construct an expectation function of the log-likelihood of the dataset on latent variable as follows:
(9) |
where is the value of in the previous step.
Afterwards, we calculate the expectation over the possible latent variable , namely
(10) | ||||
M-step. The parameters , , and are updated to maximize the function . In the first step, we expand the function of using Bayes’ Theorem, namely
(11) | ||||
Because the prior probability is a constant w.r.t. in derivatives, we only need to maximize the term . Consequently, Eq. (11) can be reformulated as
(12) | ||||
The parameters , , and are updated to maximize the function .
Update on
Here, we use the Lagrange multiplier to find the optimal in Eq. (12), and we construct the function
(13) |
Then, we set , i.e.,
(14) |
By applying the sum-up-to-one condition, namely , we can obtain . Subsequently, plug it into Eq. (14), we can obtain the estimation of , which is given by
(15) |
update on
update on
By using the Lagrange multiplier to optimize in Eq. (12), we construct the function
(19) |
and then we set the partial derivative to zero. Applying the sum-up-to-one condition (), we have
(20) |
With this condition, we have . By plugging it into Eq. (20), we obtain the estimation of , given by
(21) |
3.4. Complexity Analysis
In this section, we analyze the computational complexity of the proposed LAC method.
Since our method can be divided into the E-step and the M-step, we first analyze the complexity of the E-step in Algorithm 1. We use to denote the annotation ratio, i.e., the ratio of annotators required to provide ranks for each problem. Therefore, for each problem, annotators give the full rank, where is the number of annotators, and thus the complexity of E-step in Eq. (10) is . In the M-step, the calculation of requires computations, the calculation of requires computations, and the calculation of requires computations. Therefore, taking all the above results into consideration and suppose that the total iterations required are , our LAC algorithm takes complexity. From the above analysis, we conclude that our algorithm applies for some cases with a moderate number of problems and a relatively small length for each problem. Here, we take the model quality assessment task as an example. There are often few models to be compared (usually three to seven) (Ouyang et al., 2022), rendering the complexity of our LAC acceptable.
4. Experiments
In this section, we conduct intensive experiments on both synthetic and real-world datasets to demonstrate the superiority of the proposed method. The implementation code of LAC can be found at https://anonymous.4open.science/r/LAC-B871.
4.1. Experimental Setting
The target of rank aggregation is to correctly infer the truth at each position. Therefore, we utilize a position-wise accuracy metric to evaluate the overall performance, defined as
(23) |
where each correctly predicted position represents an aggregated result identical to the true rank in some positions.
To validate the effectiveness of the proposed LAC method, we choose five baseline methods for comparison. Specifically, the pairwise methods adopted are BradleyTerry (Thurstone, 1927), CondorcetFuse (Montague and Aslam, 2002), and CoarsenRank (Pan et al., 2022). Besides, the listwise rank aggregation methods adopted are St.Agg (Niu et al., 2013) and CrowdAgg (Niu et al., 2015), which are originally proposed for the partial rank aggregation scenarios. Therefore, to accommodate them to our setting, we apply these algorithms to each problem independently. Here, we provide a brief introduction of the adopted baseline methods:
-
•
BradleyTerry (Thurstone, 1927). It is a pairwise method that estimates the probability of the superior item within each pairwise comparison.
-
•
CondorcetFuse (Montague and Aslam, 2002). It is a pairwise method, which constructs the Condorcet Graph with items and derives the final rank via a Hamiltonian path.
-
•
CoarsenRank (Pan et al., 2022). It is specifically designed for mild model misspecification, and it assumes that the ideal preferences exist in a neighborhood of the actual preferences. Afterwards, it performs regular rank aggregations directly over a neighborhood of the preferences set.
-
•
St.Agg (Niu et al., 2013). It is a listwise method, which incorporates uncertainty into the aggregation process via introducing a prior distribution on ranks. Then it transforms the ranking functions to their expectations over this distribution.
-
•
CrowdAgg (Niu et al., 2015). It is an extension of the St.Agg method introduced above, where the annotator’s quality information is further incorporated into the definition of the rank distribution.
For the comparing methods, hyperparameters are set as suggested in the original papers. For example, the hyperparameter in CrowdAgg is set to 0.95, as recommended by (Niu et al., 2013), and the rank measure adopted is -. For St.Agg (Niu et al., 2013), the ranking function is obtained by incorporating rank distribution into the mean position function. Moreover, for CoarsenRank (Pan et al., 2022), the optimal hyperparameter is determined by the Deviance Information Criterion (Spiegelhalter et al., 1998).
4.2. Data Synthetic Procedure
In this section, we introduce the generation of the synthetic datasets in detail, including the construction of the confusion matrices and the generation process of the biased ranks provided by each annotator.
To this end, we now take the -th annotator as an example. First, for a given quality of annotators (namely, ) and a specific position , we select a random scalar uniformly in the range of , which serves as the -th element in . The value corresponds to the probability that the annotator gives the correct rank at the -th position. Subsequently, the remaining positive values in the -th row of can be selected randomly, which must satisfy . Here, we use to denote the -th element of . For clarity, here we take as an example. If , a feasible quality matrix of an annotator is given by
(24) |
A reason for such a construction is that a higher annotator’s quality is correlated with the annotator’s superior ability. Therefore, this annotator is more likely to give correct ranks at each position. Reflected in the transition matrix, the diagonal values should be relatively large, which can be satisfied actually in our construction. Note that we introduce randomness for these diagonal values to ensure discrepancies between different annotators.
Subsequently, we elaborate on the generation process of the biased ranks based on . We denote the biased rank by , which is blank in the beginning. The ground-truth rank (a random permutation) is denoted by . First, we iterate through the array sequentially. For a specific position , we select a random value from the list with probability . Such a process stops when we find a that satisfies , and then we put in the -th position of . Finally, the value in the -th position is selected so that can be a permutation over . Therefore, we obtain one biased rank, and the other ranks are provided in the same manner.
4.3. Experiments on Synthetic Datasets
In this part, we compare the performance of our LAC with the baseline methods under various predefined conditions on the synthetic datasets introduced in the previous section.
There are mainly five factors associated with the synthetic datasets, i.e., the number of examples , the length of rank , the number of annotators , the ability of annotators , and the annotation ratio . Here, the annotation ratio controls the ratio of annotators required to provide ranks for each problem. Notably, we set the basic parameters as , , , , and since our approach is applicable for some cases with a moderate number of problems and a suitable number of items to be ranked. Unless otherwise stated, this configuration is used throughout all our experiments.
Performance on problems with different lengths. We select the number of items in each problem, namely , from the set . Other parameters for the dataset are fixed as mentioned above, i.e., , , , . The accuracies, as well as the standard deviations over five independent trials of all methods, are shown in Table 3, where the best record under each is in bold, and the second best record is underlined (such presentation remains consistent throughout this paper). This table reveals that our LAC achieves relatively stable performance, demonstrating the superiority of our method on different lengths of problems over other baseline methods. It is worth noting that our LAC outperforms CrowdAgg by more than 4.7% with various , which suggests that explicitly modeling the annotators can obtain more satisfactory performance than tackling each problem independently.
Performance on different numbers of examples. We choose the number of examples within the set . The performance comparison of all methods is shown in Table 4. Apparently, with the increase of , our LAC still performs better than other baseline methods, which shows the excellence of LAC on various scales of the ranking problem.
Performance on different annotation ratios. We select the annotation ratio in the range of . The experimental results of all methods are shown in Table 5. As revealed in this table, when is very small, our LAC outperforms all the baseline methods by a large margin. Meanwhile, in all other cases, LAC shows superiority over the compared methods, especially when the annotation ratio is relatively large.
Performance on different numbers of annotators. We pick the number of annotators from the set for each method. The detailed results are shown in Table 6. As shown in the results, LAC consistently outperforms other methods in all tested scenarios, indicating its robust aggregation ability. Notably, with an increase in the number of annotators, our LAC generally obtains more accurate prediction. Moreover, our method is potentially applicable to some scenarios with a small number of annotators since our LAC surpasses the second best method CondorcetFuse by 13.90% when .
Performance on different abilities of annotators. We finally evaluate the performance of our LAC on different abilities of annotators, and thus we select in the range of . The performance comparison is shown in Table 7. This table reveals that, as the annotator’s ability decreases to 0.1, LAC exhibits superior performance relative to other methods by a significant margin, which can be attributed to the characterization of the annotator’s ability matrices. It further justifies that our method can handle low-quality annotations at different levels.
Length of rank () | 3 | 4 | 5 | 6 | 7 |
BradleyTerry (Thurstone, 1927) | 80.91 4.23 | 79.50 5.79 | 74.17 3.81 | 76.48 2.23 | 77.54 4.69 |
CondorcetFuse (Montague and Aslam, 2002) | 73.94 4.28 | 75.65 4.40 | 75.62 4.87 | 79.70 1.90 | 82.89 2.15 |
CoarsenRank (Pan et al., 2022) | 67.05 7.96 | 69.49 7.30 | 67.03 3.92 | 71.83 3.34 | 73.80 4.60 |
St.Agg (Niu et al., 2013) | 77.95 7.53 | 60.65 3.58 | 71.57 2.46 | 59.40 0.52 | 77.08 3.59 |
CrowdAgg (Niu et al., 2015) | 81.38 5.90 | 63.53 3.20 | 72.64 2.26 | 64.47 0.59 | 78.49 3.44 |
LAC | 86.08 3.84 | 88.09 4.53 | 89.20 3.73 | 92.84 1.57 | 94.72 2.43 |
Number of problems () | 100 | 200 | 300 | 400 | 500 |
BradleyTerry (Thurstone, 1927) | 81.08 4.27 | 79.74 3.52 | 81.25 3.75 | 80.25 3.42 | 81.53 2.48 |
CondorcetFuse (Montague and Aslam, 2002) | 75.72 3.33 | 77.38 3.11 | 76.10 3.27 | 77.06 2.77 | 75.88 3.27 |
CoarsenRank (Pan et al., 2022) | 64.76 3.77 | 66.24 5.31 | 66.08 5.06 | 67.13 4.68 | 67.03 3.92 |
St.Agg (Niu et al., 2013) | 68.96 4.00 | 71.32 4.66 | 70.93 4.99 | 71.50 4.75 | 71.20 4.00 |
CrowdAgg (Niu et al., 2015) | 70.56 3.81 | 72.58 3.94 | 72.36 4.28 | 73.15 4.35 | 72.72 3.14 |
LAC | 95.96 1.71 | 96.04 2.08 | 96.14 1.92 | 96.03 2.32 | 96.33 1.91 |
Annotation ratio () | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 |
BradleyTerry (Thurstone, 1927) | 74.17 3.81 | 78.47 2.97 | 81.53 2.48 | 83.89 2.56 | 85.09 3.19 |
CondorcetFuse (Montague and Aslam, 2002) | 75.62 4.87 | 76.56 2.28 | 75.88 3.27 | 76.48 3.06 | 76.00 3.78 |
CoarsenRank (Pan et al., 2022) | 67.03 3.92 | 70.23 3.71 | 72.57 2.42 | 76.62 2.79 | 78.31 3.22 |
St.Agg (Niu et al., 2013) | 71.20 4.00 | 72.88 2.94 | 78.88 3.03 | 82.26 2.80 | 84.98 3.27 |
CrowdAgg (Niu et al., 2015) | 72.72 3.14 | 77.47 3.75 | 81.36 3.62 | 86.17 2.18 | 87.02 3.41 |
LAC | 89.20 3.73 | 93.00 2.09 | 96.33 1.91 | 97.88 1.07 | 98.60 0.91 |
Number of annotators () | 10 | 12 | 14 | 16 | 18 |
BradleyTerry (Thurstone, 1927) | 74.06 1.93 | 72.22 1.94 | 79.29 0.57 | 79.34 1.43 | 81.72 0.62 |
CondorcetFuse (Montague and Aslam, 2002) | 76.85 2.42 | 76.64 1.23 | 78.30 0.59 | 76.34 0.45 | 75.42 1.68 |
CoarsenRank (Pan et al., 2022) | 68.05 1.68 | 66.41 1.61 | 71.41 0.38 | 70.96 1.66 | 73.38 1.04 |
St.Agg (Niu et al., 2013) | 71.57 2.46 | 70.25 1.98 | 72.89 1.52 | 73.37 2.41 | 79.40 2.08 |
CrowdAgg (Niu et al., 2015) | 72.64 2.26 | 71.35 2.30 | 77.75 1.73 | 77.39 2.40 | 82.52 1.73 |
LAC | 90.75 1.01 | 90.50 0.74 | 95.12 0.61 | 94.37 0.80 | 96.85 1.15 |
Ability of annotators () | 0.1 | 0.3 | 0.5 | 0.7 | 0.9 |
BradleyTerry (Thurstone, 1927) | 53.82 5.52 | 64.29 4.84 | 74.17 3.81 | 84.74 2.79 | 94.82 0.62 |
CondorcetFuse (Montague and Aslam, 2002) | 56.90 6.22 | 65.77 4.90 | 75.62 4.87 | 86.82 3.39 | 95.35 0.71 |
CoarsenRank (Pan et al., 2022) | 47.55 5.16 | 56.81 3.95 | 67.03 3.92 | 79.12 4.57 | 91.64 1.44 |
St.Agg (Niu et al., 2013) | 51.12 5.44 | 61.22 4.06 | 71.20 4.00 | 82.68 3.82 | 93.50 1.55 |
CrowdAgg (Niu et al., 2015) | 54.32 5.82 | 63.41 3.71 | 72.72 3.14 | 83.26 3.67 | 93.79 1.46 |
LAC | 68.41 7.54 | 79.85 5.28 | 89.20 3.73 | 96.32 1.91 | 99.60 0.19 |












Estimation error on the ability matrices of annotators. Since LAC explicitly models the annotators’ quality via confusion matrices, we also carry out experiments to verify the effectiveness of our method in estimating such matrices. To this end, we analyze the estimation errors quantitatively, and we propose to use the following metric to calculate the overall estimation error for a given :
(25) |
where is the estimated ability matrix, and is the ground-truth matrix for the annotator . The estimation errors with various are shown in Fig. 3. From this figure, we identify that our estimations achieve small errors in most cases, demonstrating the effectiveness of EM steps in finding the optimal values of latent variables. Note that when , the estimation error exhibits an increasing trend as increases. This can be attributed to the fact that a larger leads to a sparser ability matrix (when , degenerates to an identity matrix, representing the sparsest case), and thus it is harder to estimate such a matrix accurately.
Additionally, more qualitative results are provided. Here, we showcase some comparisons between the ground-truth matrices and the estimated ones, which are illustrated in Fig. 4. In this figure, for simplicity, we only present the results related to the first annotator. Obviously, our estimations are very close to the ground-truth matrices in most cases, which implies that our estimation is reliable.
To summarize, the experimental results on synthetic datasets clearly indicate that our LAC can obtain more accurate estimations of ground-truth ranks than other methods. Besides, our LAC performs well in estimating the annotator’s ability matrices. Next, we conduct experiments on a real-world dataset to further verify its superior performance.
Property | Number |
Paragraphs / Problems | 600 |
Words per paragraph | 300 |
Sentences per paragraph | 5 |
Crowdsourced annotators | 25 |
Problems for each annotator | 239 |
Total annotations | 5,981 |
4.4. Experiments on Real-world Dataset
There are requirements for listwise full rank aggregation in real-world business. Specifically, in the game business of NetEase, the large language models (Brown et al., 2020) (LLMs) can automatically generate plots and quests for players. However, the logic of the content generated by LLMs sometimes cannot align with the human logic. Therefore, we have to reorder the sentences of the generated paragraph to make it more reasonable. Subsequently, the reordered sentences and the originally generated sentences are used to train a reward model, which is further employed to fine-tune the LLM. However, hiring experts to determine the order of sentences is costly and inefficient, and thus we post this task on the crowdsourcing platforms to collect redundant annotations. Here, we selected some examples of paragraphs and collected a real-world dataset called “ParaRank”111The ParaRank dataset is publicly available at https://anonymous.4open.science/r/LAC-B871., to evaluate the proposed rank aggregation method.
The basic properties of ParaRank are listed in Table. 8. As shown in this table, the ParaRank dataset comprises 600 paragraphs generated by LLMs, with 300 words per paragraph on average, and each paragraph contains five sentences. Here, each paragraph corresponds to a ranking problem, where the sentences within it need to be ranked. We posted all the paragraphs on the NetEase Youling crowdsourcing platform222URL: https://zhongbao-web-9109-80.apps-fp.danlu.netease.com/mark/task to obtain noisy rankings for them. In more detail, there are a total of 25 crowdsourced annotators, each with a satisfactory track record of historical accuracy. We solicited their recommendations for the most suitable sequence of the five sentences in each paragraph. Each problem was annotated by ten different annotators, and on average, each annotator annotated 239 problems. In terms of expenses, each annotator was paid 2 RMB for each problem, resulting in a total cost of 20,000 RMB. In terms of time cost, the average time expended for a single annotation was about 24.2 second, and the entire task was completed in 4 days. Finally, we obtained a total of 5,981 annotations for 600 problems. For evaluation, we collected the ground-truth rank for each problem, which was provided manually by human experts.
Five adopted baseline methods and our LAC are evaluated on ParaRank. For parameter setting, we choose the parameters when a method reaches its best performance on the validation set. For example, the hyperparameter in CrowdAgg is set to 0.95, and the rank measure adopted is -. The results are presented in Table 9. Based on the experimental results, our LAC has demonstrated superior performance to all other methods. To further investigate the performance of all methods on this real-world dataset, we gradually increase the number of examples (or the number of annotators) and get the corresponding test accuracy. The accuracy curves are illustrated in Fig. 5. The two graphs show that most methods achieve better performance with the increase in number of examples and annotators, but LAC has more significant performance gains than the other methods. This is because LAC models the difficulty of each problem and the ability of the annotators in a more detailed manner, which results in enhanced robustness compared to other methods that do not explicitly address these critical factors.
BradleyTerry (Thurstone, 1927) | CondorcetFuse (Montague and Aslam, 2002) | CoarsenRank (Pan et al., 2022) | St.Agg (Niu et al., 2013) | CrowdAgg (Niu et al., 2015) | LAC |
68.93 | 64.53 | 73.52 | 69.27 | 69.97 | 79.26 |
5. Conclusion
In this paper, we propose a novel rank aggregation method dubbed LAC to deal with the listwise input in crowdsourcing. Unlike previous listwise methods that may only consider the partial ranks across items, our LAC delves into the underexplored problem of full rank aggregation. Moreover, LAC incorporates both the ability of annotators and the difficulty of problems into the modeling by introducing two sets of confusion matrices. Such matrices and the true ranks can be deduced iteratively by the EM algorithm. To our knowledge, LAC is the first work to directly deal with the full rank aggregation problem in listwise crowdsourcing, and simultaneously infer the difficulty of problems, the ability of annotators, and the ground-truth ranks in an unsupervised way. To evaluate our method on the listwise full rank aggregation task, we collect a dataset with real-world business consideration. Experimental results on both synthetic and real-world datasets demonstrate the effectiveness of our proposed LAC method.
Acknowledgements.
This work is supported by the National Natural Science Foundation Program of China under Grants U1909207, U21B2029, and 62336003, the Natural Science Foundation of Jiangsu Province under Grant BK20220080, and the Key R&D Program of Zhejiang Province under Grant No. 2022C01011.References
- (1)
- Allahbakhsh et al. (2013) Mohammad Allahbakhsh, Boualem Benatallah, Aleksandar Ignjatovic, Hamid Reza Motahari-Nezhad, Elisa Bertino, and Schahram Dustdar. 2013. Quality control in crowdsourcing systems: issues and directions. IEEE Internet Computing 17, 2 (2013), 76–81.
- Aslam and Montague (2001) Javed A Aslam and Mark Montague. 2001. Models for metasearch. In ACM SIGIR Conference on Research and Development in Information Retrieval. 276–284.
- Baltrunas et al. (2010) Linas Baltrunas, Tadas Makcinskas, and Francesco Ricci. 2010. Group recommendations with rank aggregation and collaborative filtering. In Proceedings of the fourth ACM conference on Recommender systems. 119–126.
- Bi et al. (2014) Wei Bi, Liwei Wang, James T Kwok, and Zhuowen Tu. 2014. Learning to Predict from Crowdsourced Data. In Proceedings of the Uncertainty in Artificial Intelligence, Vol. 14. 82–91.
- Bishop (2006) Christopher M. Bishop. 2006. Pattern Recognition and Machine Learning. Springer.
- Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in Neural Information Processing Systems 33 (2020), 1877–1901.
- Cao et al. (2012) Caleb Chen Cao, Jieying She, Yongxin Tong, and Lei Chen. 2012. Whom to ask? jury selection for decision making tasks on micro-blog services. arXiv preprint arXiv:1208.0273 (2012).
- Chang et al. (2023) Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Kaijie Zhu, Hao Chen, Linyi Yang, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. 2023. A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109 (2023).
- Cohen et al. (1997) William W Cohen, Robert E Schapire, and Yoram Singer. 1997. Learning to order things. Advances in Neural Information Processing Systems 10 (1997).
- Das et al. (2017) Debashis Das, Laxman Sahoo, and Sujoy Datta. 2017. A survey on recommendation system. International Journal of Computer Applications 160, 7 (2017).
- Dawid and Skene (1979) Alexander Philip Dawid and Allan M Skene. 1979. Maximum likelihood estimation of observer error-rates using the EM algorithm. Journal of the Royal Statistical Society 28, 1 (1979), 20–28.
- Dong et al. (2020) Jialin Dong, Kai Yang, and Yuanming Shi. 2020. Ranking from Crowdsourced Pairwise Comparisons via Smoothed Riemannian Optimization. ACM Transactions on Knowledge Discovery from Data 14, 2 (2020).
- Du et al. (2020) Yulu Du, Xiangwu Meng, Yujie Zhang, and Pengtao Lv. 2020. GERF: A Group Event Recommendation Framework Based on Learning-to-Rank. IEEE Transactions on Knowledge and Data Engineering 32, 4 (2020), 674–687.
- Dwork et al. (2001) Cynthia Dwork, Ravi Kumar, Moni Naor, and D. Sivakumar. 2001. Rank Aggregation Methods for the Web. In Proceedings of the International Conference on World Wide Web. 613–622.
- Fagin et al. (2003) Ronald Fagin, Ravi Kumar, and Dandapani Sivakumar. 2003. Efficient similarity search and classification via rank aggregation. In ACM SIGMOD International Conference on Management of Data. 301–312.
- Garcia-Molina et al. (2016) Hector Garcia-Molina, Manas Joglekar, Adam Marcus, Aditya Parameswaran, and Vasilis Verroios. 2016. Challenges in Data Crowdsourcing. IEEE Transactions on Knowledge and Data Engineering (2016), 901–911.
- Gleich and Lim (2011) David F Gleich and Lek-heng Lim. 2011. Rank aggregation via nuclear norm minimization. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 60–68.
- Guiver and Snelson (2009) John Guiver and Edward Snelson. 2009. Bayesian inference for Plackett-Luce ranking models. In International Conference on Machine Learning. 377–384.
- Guo et al. (2012) Stephen Guo, Aditya Parameswaran, and Hector Garcia-Molina. 2012. So who won? Dynamic max discovery with the crowd. In ACM SIGMOD International Conference on Management of Data. 385–396.
- Hirth et al. (2011) Matthias Hirth, Tobias Hoßfeld, and Phuoc Tran-Gia. 2011. Anatomy of a crowdsourcing platform-using the example of microworkers.com. In International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing. 322–329.
- Ipeirotis et al. (2010) Panagiotis G Ipeirotis, Foster Provost, and Jing Wang. 2010. Quality management on amazon mechanical turk. In ACM SIGKDD Workshop on Human Computation. 64–67.
- Kamar et al. (2015) Ece Kamar, Ashish Kapoor, and Eric Horvitz. 2015. Identifying and accounting for task-dependent bias in crowdsourcing. In AAAI Conference on Human Computation and Crowdsourcing.
- Karger et al. (2011) David Karger, Sewoong Oh, and Devavrat Shah. 2011. Iterative learning for reliable crowdsourcing systems. Advances in Neural Information Processing Systems 24 (2011).
- Kim et al. (2014) Minji Kim, Farzad Farnoud, and Olgica Milenkovic. 2014. HyDRA: gene prioritization via hybrid distance-score rank aggregation. Bioinformatics 31, 7 (2014), 1034–1043.
- Li et al. (2016) Guoliang Li, Jiannan Wang, Yudian Zheng, and Michael J Franklin. 2016. Crowdsourced data management: A survey. IEEE Transactions on Knowledge and Data Engineering 28, 9 (2016), 2296–2319.
- Li (2011) Hang Li. 2011. A short introduction to learning to rank. IEICE Transactions on Information and Systems 94, 10 (2011), 1854–1862.
- Lin et al. (2018) Xin Lin, Jianliang Xu, Haibo Hu, and Fan Zhe. 2018. Reducing Uncertainty of Probabilistic Top-k Ranking via Pairwise Crowdsourcing. In IEEE International Conference on Data Engineering. 1757–1758.
- Liu and Moitra (2020) Allen Liu and Ankur Moitra. 2020. Better Algorithms for Estimating Non-Parametric Models in Crowd-Sourcing and Rank Aggregation. In Proceedings of the Thirty Third Conference on Learning Theory, Vol. 125. 2780–2829.
- Montague and Aslam (2002) Mark Montague and Javed A Aslam. 2002. Condorcet fusion for improved retrieval. In International Conference on Information and Knowledge Management. 538–548.
- Niu et al. (2013) Shuzi Niu, Yanyan Lan, Jiafeng Guo, and Xueqi Cheng. 2013. Stochastic Rank Aggregation. In Proceedings of the Uncertainty in Artificial Intelligence.
- Niu et al. (2015) Shuzi Niu, Yanyan Lan, Jiafeng Guo, Xueqi Cheng, Lei Yu, and Guoping Long. 2015. Listwise approach for rank aggregation in crowdsourcing. In Proceedings of the ACM International Conference on Web Search and Data Mining. 253–262.
- Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730–27744.
- Pan et al. (2022) Yuangang Pan, Ivor W Tsang, Weijie Chen, Gang Niu, and Masashi Sugiyama. 2022. Fast and Robust Rank Aggregation against Model Misspecification. Journal of Machine Learning Research 23 (2022), 23–1.
- Plackett (1975) Robin L Plackett. 1975. The analysis of permutations. Journal of the Royal Statistical Society 24, 2 (1975), 193–202.
- Raghavan (1997) Prabhakar Raghavan. 1997. Information retrieval algorithms: A survey. In ACM-SIAM Symposium on Discrete Algorithms. 11–18.
- Raykar et al. (2010) Vikas C Raykar, Shipeng Yu, Linda H Zhao, Gerardo Hermosillo Valadez, Charles Florin, Luca Bogoni, and Linda Moy. 2010. Learning from crowds. Journal of Machine Learning Research 11, 4 (2010).
- Sarma et al. (2014) Anish Das Sarma, Aditya Parameswaran, Hector Garcia-Molina, and Alon Halevy. 2014. Crowd-powered find algorithms. In IEEE International Conference on Data Engineering. 964–975.
- Spiegelhalter et al. (1998) David Spiegelhalter, Nicky Best, and Bradley Carlin. 1998. Bayesian Deviance, the Effective Number of Parameters, and the Comparison of Arbitrarily Complex Models. Journal of Royal Statistical Society 64 (1998).
- Thurstone (1927) Louis L Thurstone. 1927. The method of paired comparisons for social values. Journal of Abnormal and Social Psychology (1927), 384–400.
- Venanzi et al. (2014) Matteo Venanzi, John Guiver, Gabriella Kazai, Pushmeet Kohli, and Milad Shokouhi. 2014. Community-based bayesian aggregation models for crowdsourcing. In Proceedings of the International Conference on World Wide Web. 155–164.
- Wang et al. (2012) Jiannan Wang, Tim Kraska, Michael J Franklin, and Jianhua Feng. 2012. Crowder: Crowdsourcing entity resolution. arXiv preprint arXiv:1208.1927 (2012).
- Wang et al. (2013) Jiannan Wang, Guoliang Li, Tim Kraska, Michael J Franklin, and Jianhua Feng. 2013. Leveraging transitive relations for crowdsourced joins. In ACM SIGMOD International Conference on Management of Data. 229–240.
- Wu et al. (2023) Gongqing Wu, Xingrui Zhuo, Xianyu Bao, Xuegang Hu, Richang Hong, and Xindong Wu. 2023. Crowdsourcing Truth Inference via Reliability-Driven Multi-View Graph Embedding. ACM Transactions on Knowledge Discovery from Data 17, 5 (2023).
- Wu et al. (2021) Hanlu Wu, Tengfei Ma, Lingfei Wu, Fangli Xu, and Shouling Ji. 2021. Exploiting Heterogeneous Graph Neural Networks with Latent Worker/Task Correlation Information for Label Aggregation in Crowdsourcing. ACM Transactions on Knowledge Discovery from Data 16, 2 (2021).
- Wu et al. (2016) Ou Wu, Qiang You, Fen Xia, Lei Ma, and Weiming Hu. 2016. Listwise Learning to Rank from Crowds. ACM Transactions on Knowledge Discovery from Data 11, 1 (2016).
- Xu et al. (2021) Qianqian Xu, Zhiyong Yang, Zuyao Chen, Yangbangyan Jiang, Xiaochun Cao, Yuan Yao, and Qingming Huang. 2021. Deep Partial Rank Aggregation for Personalized Attributes. In Proceedings of the AAAI Conference on Artificial Intelligence. 678–688.
- Yan et al. (2010) Tingxin Yan, Vikas Kumar, and Deepak Ganesan. 2010. Crowdsearch: exploiting crowds for accurate real-time image search on mobile phones. In International Conference on Mobile Systems, Applications, and Services. 77–90.
- Yang et al. (2024) Yi Yang, Zhong-Qiu Zhao, Gongqing Wu, Xingrui Zhuo, Qing Liu, Quan Bai, and Weihua Li. 2024. A Lightweight, Effective, and Efficient Model for Label Aggregation in Crowdsourcing. ACM Transactions on Knowledge Discovery from Data 18, 4 (2024).
- Yu et al. (2023) Hao Yu, Chengyuan Zhang, Jiaye Li, and Shichao Zhang. 2023. Robust Sparse Weighted Classification for Crowdsourcing. IEEE Transactions on Knowledge and Data Engineering 35, 8 (2023), 8490–8502.
- Zeng et al. (2018) Yuxiang Zeng, Yongxin Tong, Lei Chen, and Zimu Zhou. 2018. Latency-oriented task completion via spatial crowdsourcing. In International Conference on Data Engineering. IEEE, 317–328.
- Zhang and Wu (2021) Jing Zhang and Xindong Wu. 2021. Multi-Label Truth Inference for Crowdsourcing Using Mixture Models. IEEE Transactions on Knowledge and Data Engineering 33, 5 (2021), 2083–2095.
- Zhang et al. (2014) Yuchen Zhang, Xi Chen, Dengyong Zhou, and Michael I Jordan. 2014. Spectral methods meet EM: A provably optimal algorithm for crowdsourcing. Advances in Neural Information Processing Systems 27 (2014).
- Zhao et al. (2015) Zhou Zhao, Furu Wei, Ming Zhou, Weikeng Chen, and Wilfred Ng. 2015. Crowd-Selection Query Processing in Crowdsourcing Databases: A Task-Driven Approach. In International Conference on Extending Database Technology. 397–408.