Rank Aggregation in Crowdsourcing for Listwise Annotations

Wenshui Luo 0009-0005-9283-740X Nanjing University of Science and TechnologyNanjing210094China [email protected] , Haoyu Liu Fuxi AI Lab, NetEase GamesHangzhouChina310052 [email protected] , Yongliang Ding Nanjing University of Science and TechnologyNanjing210094China [email protected] , Tao Zhou Nanjing University of Science and TechnologyNanjing210094China [email protected] , Sheng Wan Nanjing University of Science and TechnologyNanjing210094China [email protected] , Runze Wu Fuxi AI Lab, NetEase GamesHangzhouChina310052 [email protected] , Minmin Lin Fuxi AI Lab, NetEase GamesHangzhouChina310052 [email protected] , Cong Zhang Fuxi AI Lab, NetEase GamesHangzhouChina310052 [email protected] , Changjie Fan Fuxi AI Lab, NetEase GamesHangzhouChina310052 [email protected] and Chen Gong Nanjing University of Science and TechnologyNanjing210094China [email protected]

Abstract.

Rank aggregation through crowdsourcing has recently gained significant attention, particularly in the context of listwise ranking annotations. However, existing methods primarily focus on a single problem and partial ranks, while the aggregation of listwise full ranks across numerous problems remains largely unexplored. This scenario finds relevance in various applications, such as model quality assessment and reinforcement learning with human feedback. In light of practical needs, we propose LAC, a Listwise rank Aggregation method in Crowdsourcing, where the global position information is carefully measured and included. In our design, an especially proposed annotation quality indicator is employed to measure the discrepancy between the annotated rank and the true rank. We also take the difficulty of the ranking problem itself into consideration, as it directly impacts the performance of annotators and consequently influences the final results. To our knowledge, LAC is the first work to directly deal with the full rank aggregation problem in listwise crowdsourcing, and simultaneously infer the difficulty of problems, the ability of annotators, and the ground-truth ranks in an unsupervised way. To evaluate our method, we collect a real-world business-oriented dataset for paragraph ranking. Experimental results on both synthetic and real-world benchmark datasets demonstrate the effectiveness of our proposed LAC method.

Rank aggregation, Crowdsourcing, Listwise annotation.

^†^†ccs: Computing methodologies Learning in probabilistic graphical models

1. Introduction

Recently, inferring ranking (Du et al., 2020; Xu et al., 2021; Lin et al., 2018; Liu and Moitra, 2020; Wu et al., 2023, 2021) over a set of items has gained increasing attention due to its wide range of applications, such as information retrieval (Raghavan, 1997), recommendation systems (Das et al., 2017), and RLHF (Reinforcement Learning from Human Feedback) for finetuning large language models (Ouyang et al., 2022). This task aims to train a ranking model in a supervised way (Li, 2011), thereby requiring a large amount of well-annotated data.

Unfortunately, it will be of high cost to hire expert annotators, especially when the scale of data is large. To accommodate this dilemma, many practitioners resort to crowdsourcing platforms (Hirth et al., 2011; Yu et al., 2023; Garcia-Molina et al., 2016; Zhang and Wu, 2021), such as Amazon Mechanical Turk and CrowdFlower. They distribute ranking problems into multiple sub-tasks, which are required to be solved by crowdsourced annotators. Aggregation methods are subsequently employed so that the solutions of sub-tasks can be aggregated and the true ranks of the original ranking problems can be derived. Among these procedures, the effectiveness of rank aggregation algorithms is of significant importance for the performance of the final aggregated results.

Existing rank aggregation methods can be roughly classified into three categories according to different forms of annotations: pointwise, pairwise, and listwise. Fig. 1 illustrates the similarities and differences between these forms of annotations. Under the setting of pointwise methods, each sub-task contains only one item, and the annotator assigns scores independently to each item without access to the information of other items. The aggregation methods will then derive one ranking sequence based on these score annotations (Aslam and Montague, 2001). In the pairwise setting, the sub-task contains one pair of items, and the annotator determines the relative ranking between two items. The aggregation methods use the results of compared pairs to form the final rank (Cohen et al., 1997). In the listwise setting, the sub-task contains multiple items, which is a subset of all the items. The annotator is required to provide the rank of this subset, and the aggregation methods aggregate ranks of all subsets into a final rank (Niu et al., 2015). These types of methods find applications in recommendation systems (Baltrunas et al., 2010), information retrieval (Dwork et al., 2001), bioinformatics (Kim et al., 2014), etc. Obviously, although the forms of annotations are different, all these types of methods aim to derive the ground-truth rank of one target ranking sequence.

However, the above forms of annotation may not cover all the rank aggregation problems. As shown in Fig. 1 (b), there also exists another scenario in which we need to deduce the ranks of multiple target sequences based on full rank annotations. Specifically, there are multiple short sequences required to be ranked, with each sub-task containing all items of one sequence. The annotator is required to give a full rank over each assigned sequence. Under this setting, different annotators are possible to label the same target sequence, leading to redundant annotations. The aggregation method is hereby expected to infer the ground-truth rank of each sequence simultaneously. Such a setting can be applied to tasks such as model quality assessment (Chang et al., 2023) and reinforcement learning with human feedback (RLHF) (Ouyang et al., 2022). To be specific, in the model quality assessment task, the outputs of several models are evaluated in each round, and these outputs form a single sequence (like $A_{1},B_{1},C_{1},D_{1}$ in Fig. 1 (b)). Similarly, in RLHF, the outputs of a large language model are sampled, forming a sequence to be ranked by human annotators. Table 1 further outlines the differences between various aggregation tasks and their applications, where it is highlighted that listwise full rank aggregation aims to handle multiple short sequences with full rank annotations.

Refer to caption — (a) Previous rank aggregation settings.

Table 1. Properties of various rank aggregation tasks and their applications.

Scenarios	Multiple sequences?	Short sequences?	Partial annotations for each sequence?	Typical applications
pointwise rank aggregation	$\times$	$\times$	$\surd$	1) recommendation systems (Baltrunas et al., 2010) 2) information retrieval (Dwork et al., 2001) 3) bioinformatics (Kim et al., 2014)
pairwise rank aggregation	$\times$	$\times$	$\surd$
listwise partial rank aggregation	$\times$	$\times$	$\surd$
listwise full rank aggregation	$\boldsymbol{\surd}$	$\boldsymbol{\surd}$	$\boldsymbol{\times}$	1) model quality assessment (Chang et al., 2023) 2) reinforcement learning with human feedback (Ouyang et al., 2022)

Various methods have been proposed to tackle pointwise, pairwise, and listwise rank aggregation problems. The pointwise methods often minimize a specific distance between the collected scores and the aggregated ones (Fagin et al., 2003), while the pairwise methods usually model the pairwise probability between items (Montague and Aslam, 2002). Besides, the listwise partial rank aggregation methods often convert the aggregation task to the pairwise one with the consideration of uncertainty (Niu et al., 2013). Although some of these methods can be adapted to full rank annotations by solving each problem independently, they fail to characterize the relationship between different problems and cannot capture the global information in full rank annotations. Therefore, in this paper, we present a specialized study for listwise full rank aggregation to fill the gaps in research and practice.

The full rank aggregation task is intrinsically difficult since different sequences have different levels of difficulty, and the ability of annotators can be different. Therefore, explicit modeling of the ability of annotators and the difficulty of problems is desired, which is usually a missing concern in previous methods. Specifically, the ability of an annotator is supposed to reflect the extent of uncertainty associated with the rank position of each candidate item. Therefore, the probability that the annotator flips every pair of candidate items should be considered. Meanwhile, the difficulty of problems can also be represented by multiple probabilities, measuring whether each pair of items is prone to confusion. In addition, from a global point of view, in the final rank list, the nearby items are generally much more likely to be mispositioned than those with a more significant gap in position. Therefore, the ability of annotators and the difficulty of problems should be dynamically adjusted according to the gap in position.

To this end, we propose a Listwise rank Aggregation method in Crowdsourcing (“LAC” for short hereinafter), to deal with the listwise input in crowdsourcing. To the best of our knowledge, this method represents the first endeavor specifically addressing the task of full rank aggregation in crowdsourcing. In our method, for the ability of annotators and the difficulty of problems, two sets of confusion matrices are employed to estimate the degree of confusion for each pair of items. Then, the distance between positions is carefully defined to integrate the relative positional information between two items. Considering the unknown latent variables of the ability of annotators, the difficulty of problems, and the true rank, we derive the log-likelihood of the observations and maximize it with the Expectation Maximization (EM) method. Specifically, in the E-step, based on the estimated values of latent variables derived in the previous step, we obtain the conditional expectation of observations over the underlying rank distribution. In the M-step, the latent variables and the ground-truth ranks are estimated respectively via maximizing the expectation calculated in the E-step.

In experiments, we propose to synthesize ranking datasets with the consideration of five essential factors, including number of problems, length of rank, number of annotators, ability of annotators, and the annotation ratio, so that the performance in different scenarios can be investigated. Furthermore, a real-world dataset with a total of 5,981 listwise ranking annotations from 25 annotators is collected to examine the performance of LAC.

Our contributions can be summarized in three folds:

•

We focus on the underexplored problem of listwise full rank aggregation in crowdsourcing and propose LAC as a new rank aggregation method.
•

Algorithmically, the ability of annotators, the difficulty of problems, and the ground-truth ranks are respectively modeled as latent variables, and we propose to use the EM method to deduce their optimal values iteratively.
•

Experimentally, we simulate synthetic datasets with a comprehensive consideration of five essential factors to explore the performance of LAC. We also present, for the first time, a real-world dataset under the listwise full rank aggregation setting to benchmark previous methods. Experimental results on both synthetic and real-world datasets demonstrate the superiority of LAC over existing methods in most scenarios.

2. Related Work

In this section, we introduce some related works, including the background of crowdsourcing and the existing rank aggregation methods.

2.1. Crowdsourcing

There are three critical concerns in crowdsourcing, i.e., cost control (Li et al., 2016), latency control (Zeng et al., 2018), and quality control (Allahbakhsh et al., 2013; Yang et al., 2024).

Cost control. Crowdsourcing may be expensive when dealing with a large number of tasks and instances. In order to alleviate this issue, several cost-control techniques have been proposed. These techniques include removing unnecessary tasks and picking up valuable tasks (task pruning (Wang et al., 2012)), ranking and prioritizing valuable tasks (task selection (Guo et al., 2012)), deducing answers for the candidate tasks based on the feedback data (answer deduction (Wang et al., 2013)), and sampling tasks based on some specific criteria for crowdsourcing.

Latency control. Crowdsourcing for answering tasks may suffer from excessive latency due to the unavailability of annotators, the difficulty of tasks, and insufficient appeal to annotators. Therefore, latency control is in need. Two representative models for latency control have been proposed, namely the round model (Sarma et al., 2014) and the statistical model (Yan et al., 2010). Here, the round model arranges tasks to be published in many rounds and models the overall latency as the number of rounds. In contrast, the statistical model uses feedback data to build models that capture the arrival and completion times of annotators, allowing better prediction and adjustment for expected latency.

Quality control. Crowdsourcing may produce low-quality or even incorrect answers due to annotators’ varying levels of expertise. Therefore, quality control is crucial. The ability of annotators can be modeled and controlled through several methods, including eliminating low-quality annotators (Ipeirotis et al., 2010), aggregating answers from multiple annotators (Cao et al., 2012), and assigning tasks to appropriate annotators based on their skills and experience (Zhao et al., 2015). How to infer the true labels from multiple noisy labels is a critical problem in quality control. The most seminal work is DS (Dawid and Skene, 1979), which uses a confusion matrix to represent the quality of the crowdsourcing annotator. Then several works are derived from the DS algorithm directly. For example, (Raykar et al., 2010) simplifies the parameters of the annotators, (Venanzi et al., 2014) constrains the annotator in some perspectives, and (Zhang et al., 2014) optimizes the initial settings. In addition, another stream of approaches models the quality of an annotator using fewer parameters (Bi et al., 2014) while adding auxiliary parameters to model the annotator’s property, such as bias (Kamar et al., 2015) and intention (Bi et al., 2014). All of them are based on probabilistic models and can be solved via an EM algorithm with gradient descent. Some of them (Bi et al., 2014; Dawid and Skene, 1979) can be applied to multi-class scenarios, while others (Raykar et al., 2010; Karger et al., 2011) are only suitable for binary classification problems. Unfortunately, none of these techniques can handle listwise full rank aggragation problem. Next, we introduce some methods to aggregate multiple ranks.

2.2. Rank Aggregation Methods

According to various input forms, the existing ranking methods can be roughly classified into three categories, namely pointwise, pairwise, and listwise methods.

Pointwise. Two well-known pointwise methods, Borda Count (Aslam and Montague, 2001) and Median Rank (Fagin et al., 2003), are often adopted to obtain suitable rankings. Specifically, Borda Count minimizes the average Spearman Rank Coefficient, and Median Rank minimizes the average Spearman Footrule Distance between the true rank and each input.

Pairwise. Pairwise rank aggregation methods organize their ranking inputs in pairs and optimize the objective function or ranking functions accordingly (Dong et al., 2020). For example, BradleyTerry (Thurstone, 1927) defines the pairwise probability based on the BradleyTerry model and then optimizes the likelihood function by gradient descent. Afterwards, GreedyOrder (Cohen et al., 1997) focuses on minimizing the cost of pairwise disagreement in a tournament to infer the true rank. In contrast, CondorcetFuse (Montague and Aslam, 2002) builds a Condorcet Graph by majority voting and obtains a Hamiltonian path from the graph by QuickSort, and SVP (Gleich and Lim, 2011) minimizes the nuclear norm of a pairwise preference matrix by rank-2 factorization. However, pairwise methods usually cannot capture the global information over items.

Listwise. Unlike the pointwise and pairwise methods, listwise rank aggregation methods treat the ranking inputs in a listwise way to emphasize the importance of positions. Previous listwise methods usually aim at the aggregation of partial ranks (Wu et al., 2016). The typical methods include Plackett-Luce (Guiver and Snelson, 2009), St.Agg (Niu et al., 2013), and CrowdAgg (Niu et al., 2015). Among them, Plackett-Luce extends the Plackett-Luce model (Plackett, 1975) and defines the similarity of ranks based on generative probability. St.Agg incorporates uncertainty into the aggregation process and introduces a prior rank distribution. Besides, CrowdAgg further takes the quality of annotators into consideration. Unfortunately, the above methods primarily focus on partial ranks, where only part of a single long sequence is required to be ranked by each annotator (see Fig. 1). Meanwhile, the study directly dealing with full ranks over the items is basically in the blank. Since such a setting is of utmost significance in real-world applications, we thereby undertake meticulous studies on the problem of listwise full rank aggregation.

3. The Proposed Method

3.1. Preliminary

In this section, we formalize the listwise full rank aggregation task in crowdsourcing. Suppose that there are totally $I$ problems / sequences and $J$ crowdsourced annotators. The dataset with ground-truth ranks is denoted by $\mathcal{S}=\{(x_{i},y_{i})\}_{i=1}^{I}$ , where $x_{i}$ represents the $i$ -th problem, and the underlying true rank is $y_{i}$ . Each problem posted on the crowdsourcing platform is associated with $R$ items, and the selected annotators provide ranks over these items based on their knowledge. The ground-truth rank $y_{i}$ and the ranks provided by the selected annotators are permutations of the items in $x_{i}$ . All the true ranks of the entire dataset are denoted by $\mathbf{Y}=[y_{1},\cdots,y_{I}]$ . Let $E_{j}$ be the $j$ -th annotator. The rank given by $E_{j}$ for the problem $x_{i}$ is denoted by $l_{ij}$ , and all ranks provided for this problem are denoted by $l_{i}$ . Then the observed dataset is denoted by $\mathcal{L}=\{(x_{i},l_{i})\}_{i=1}^{I}$ . Obviously, the resulting annotations for each problem consist of noisy repeated ranks. Therefore, our task is to deduce the ground-truth rank $y_{i}$ for each problem $x_{i}$ based on the noisy ranks $l_{i}$ . To this end, we use $d$ and $k$ to represent lists and define $\tau(k,d_{r})$ as the index of $d_{r}$ in list $k$ , where $d_{r}$ indicates the item in position $r$ of list $d$ . The main mathematical notations that will be later used for algorithm description are listed in Table 2.

Table 2. Summary of main mathematical notations

Notation	Interpretation
$I$	the number of problems / sequences.
$J$	the number of annotators.
$R$	the number of items in a single problem.
$x_{i}$	the $i$ -th problem.
$y_{i}$	the unknown ground-truth rank of problem $x_{i}$ .
$E_{j}$	the $j$ -th crowdsourced annotator.
$l_{ij}$	the rank of problem $x_{i}$ given by annotator $E_{j}$ .
$\Pi^{(j)}$	the confusion matrix that models the ability of the annotator $E_{j}$ .
$\Delta^{(i)}$	the confusion matrix that models the difficulty of the problem $x_{i}$ .
$\theta_{k}$	the likelihood for the true rank $k$ .
$\pi_{k_{r}d_{r}}^{(j)}$	the probability of item $k_{r}$ being confused with $d_{r}$ by annotator $E_{j}$ .
$\delta_{k_{r}d_{r}}^{(i)}$	the probability of confusion between item $k_{r}$ and $d_{r}$ in problem $x_{i}$ .
$d,\;k$	a ranked list.
$\mathbf{Y}$	the list containing all the ground-truth ranks.
$\mathcal{L}$	the noisily ranked dataset collected from the crowdsourcing platform.
$\Psi$	the set $\{\boldsymbol{\theta},\Pi,\Delta\}$ .

3.2. Listwise Rank Aggregation in Crowdsourcing

To tackle the task introduced above, we propose a novel probabilistic generative model termed LAC. Unlike previous methods that may only consider the quality of annotators (Niu et al., 2015), we also incorporate the difficulty of problems into our model, which can further capture the intrinsic property of the aggregation task. Intuitively, it is reasonable that the crowdsourced annotator with stronger ability can perform better when provided with easier problems and vice versa. Therefore, we introduce a confusion matrix $\Pi^{(j)}\in[0,1]^{R\times R}$ and $\Delta^{(i)}\in[0,1]^{R\times R}$ for the annotator $E_{j}$ and the problem $x_{i}$ , respectively. Here $\Pi^{(j)}$ explicitly characterizes the ability of the $j$ -th annotator, namely, the $(k_{r},d_{r})$ -th element $\pi_{k_{r}d_{r}}^{(j)}$ denotes the probability of $k_{r}$ being confused with $d_{r}$ by $E_{j}$ . Similarly, $\Delta^{(i)}$ explicitly characterizes the difficulty of the $i$ -th problem, namely, the $(k_{r},d_{r})$ -th element $\delta_{k_{r}d_{r}}^{(i)}$ denotes the probability of $k_{r}$ being confused with $d_{r}$ in $x_{i}$ . To be clear, we take $R=3$ for example. Suppose that the ground-truth rank of the problem $x_{i}$ is given as $k=A\prec B\prec C$ , and $E_{j}$ ’s annotation is given as $d=B\prec A\prec C$ . When $r=1$ , we can get $k_{1}=A$ and $d_{1}=B$ , and then we have $\pi_{k_{1}d_{1}}^{(j)}=\pi_{AB}^{(j)}$ , which represents the probability of $A$ being confused with $B$ by annotator $E_{j}$ , and $\delta_{k_{1}d_{1}}^{(i)}=\delta_{AB}^{(i)}$ , which represents the probability of $A$ being confused with $B$ in problem $x_{i}$ . The examples of $\Pi^{(j)}$ and $\Delta^{(i)}$ are given by

(1)

\Pi^{(j)}=\begin{bmatrix}\pi_{AA}^{(j)}&\pi_{AB}^{(j)}&\pi_{AC}^{(j)}\\ \pi_{BA}^{{(j)}}&\pi_{BB}^{(j)}&\pi_{BC}^{(j)}\\ \pi_{CA}^{{(j)}}&\pi_{CB}^{(j)}&\pi_{CC}^{(j)}\\ \end{bmatrix}_{3\times 3},

and

(2)

\Delta^{(i)}=\begin{bmatrix}\delta_{AA}^{(i)}&\delta_{AB}^{(i)}&\delta_{AC}^{(i)}\\ \delta_{BA}^{(i)}&\delta_{BB}^{(i)}&\delta_{BC}^{(i)}\\ \delta_{CA}^{(i)}&\delta_{CB}^{(i)}&\delta_{CC}^{(i)}\\ \end{bmatrix}_{3\times 3},

respectively. The probabilistic graphical model is illustrated in Fig. 2, and there are three latent variables referred to as $\Psi=\{\boldsymbol{\theta},\Pi,\Delta\}$ . Therefore, our target is transformed to deriving the most reasonable value of $\Psi$ to maximize the likelihood of all observations $\{l_{i}\}_{i=1}^{I}$ and inferring the possible ranks based on the optimal $\Psi$ .

Likelihood of the observed dataset. Suppose that the ground-truth rank of each problem is independently drawn from a multinomial distribution with parameter $\boldsymbol{\theta}=[\theta_{1},\cdots,\theta_{K}]$ , where $\sum_{k=1}^{K}\theta_{k}=1$ , and $K$ is the total number of permutations over all items. Here, $\theta_{k}$ is the prior probability, defined as

(3)

\theta_{k}=p(y_{i}=k|\boldsymbol{\theta}).

Since each problem is independently annotated, the likelihood of the observed dataset $\mathcal{L}$ can be factorized as

(4)

\displaystyle P(\mathcal{L}|\Psi)=\prod_{i=1}^{I}P({l_{i}|\Psi}),

which is governed by the parameter set $\Psi=\{\boldsymbol{\theta},\Pi,\Delta\}$ . Subsequently, the likelihood of the annotation $l_{i}$ can be derived as

(5)

\displaystyle P(l_{i}|\Psi)=\sum_{k=list_{first}}^{K=list_{last}}\underbrace{P(y_{i}=k|\boldsymbol{\theta})}_{\text{the prior }\theta_{k}}\underbrace{P(l_{i}|y_{i}=k,\Pi,\Delta^{(i)})}_{\text{the posterior probability}}.

Since our input is in listwise form, the global ranking information can be readily obtained. It is crucial to utilize this information to calculate the distance between the annotated rank and the possible true rank. In view of this point, we calculate the likelihood when $k$ is wrongly ranked as $d$ in a position-wise way, where we multiply the reciprocal of the distance for penalizing those ranks with further distance. Therefore, the posterior probability is defined as

(6)

P(l_{i}|y_{i}=k,\Pi,\Delta^{(i)})=\prod_{j=1}^{J}\prod_{d=list_{first}}^{K=list_{last}}\big{(}\underbrace{\prod_{r=1}^{R}(\frac{\pi_{k_{r}d_{r}}^{(j)}\delta_{k_{r}d_{r}}^{(i)}}{d(k_{r},d_{r})})}_{\mathcal{F}(\Pi^{(j)},\Delta^{(i)})}\big{)}^{\mathbb{I}(l_{ij}=d)},

where $d(k_{r},d_{r})=|\tau(k,d_{r})-r|+1$ . For brevity, in the following derivation, we denote

(7)

\displaystyle\mathcal{F}(\Pi^{(j)},\Delta^{(i)})=\prod_{r=1}^{R}\big{(}\frac{\pi_{k_{r}d_{r}}^{(j)}\delta_{k_{r}d_{r}}^{(i)}}{d(k_{r},d_{r})}\big{)}.

To enhance clarity, here we introduce an example to show the rationale of the position-wise distance. Let $R=3$ , and we suppose that the ground-truth rank is $A\prec B\prec C$ . Assume that the first annotator gives the rank $k=B\prec A\prec C$ , and the second one gives the rank $d=C\prec B\prec A$ . For simplicity, we only take the first position of each rank as an example. For the first annotator, the distance in the first position is 1 (because the difference of indices between $B$ and $A$ is 1). But for the second annotator, the distance in the first position is 2 (because the difference of indices between $C$ and $A$ is 2). It is evident that for the first position, $k$ is better than $d$ . Therefore, it is reasonable to take the distance as a penalty.

Subsequently, the log-likelihood of the noisy repeated listwise rank can be represented as

(8)

\displaystyle\sum_{i=1}^{I}\ln{\left(\sum_{k=list_{first}}^{K=list_{last}}\theta_{k}P(l_{i}|y_{i}=k,\Pi,\Delta^{(i)})\right)},

where $P(l_{i}|y_{i}=k,\Pi,\Delta^{(i)})$ is defined in Eq. (6).

3.3. Optimization with EM Algorithm

To effectively seek the latent variables that maximize the log-likelihood defined in Eq. (8), we employ the prevalent Expectation Maximization (EM) algorithm (Bishop, 2006), which estimates the $\boldsymbol{\theta}$ , the element of $\Pi^{(j)}$ , and $\Delta^{(i)}$ iteratively.

E-step. We first construct an expectation function of the log-likelihood of the dataset on latent variable $\mathbf{Y}$ as follows:

(9)

\displaystyle Q(\Psi,\Psi^{old})=\mathbb{E}_{\mathbf{Y}|L,\Psi^{old}}[\ln P(L,\mathbf{Y}|\Psi)],

where $\Psi^{old}$ is the value of $\Psi$ in the previous step.

Afterwards, we calculate the expectation over the possible latent variable $y_{i}$ , namely

(10)		$\displaystyle\mathbb{E}_{y_{i}}[\mathbb{I}(y_{i}=k)]$	$\displaystyle=P(y_{i}=k\|L,\Psi)\propto P(L\|y_{i}=k,\Psi)P(y_{i}=k\|\Psi)$
(10)			$\displaystyle=\theta_{k}\prod_{j=1}^{J}\prod_{d=list_{first}}^{K=list_{last}}(\mathcal{F}(\Pi^{(j)},\Delta^{(i)}))^{\mathbb{I}(l_{ij}=d)}.$

M-step. The parameters $\boldsymbol{\theta}$ , $\Pi$ , and $\Delta$ are updated to maximize the function $Q$ . In the first step, we expand the function of $Q$ using Bayes’ Theorem, namely

(11)		$\displaystyle\mathbb{E}_{\mathbf{Y}\|L,\Psi^{old}}\left[\ln P(L,\mathbf{Y}\|\Psi)\right]=$	$\displaystyle\sum_{i=1}^{I}\mathbb{E}_{y_{i}}\left[\ln P(l_{i},y_{i}\|\boldsymbol{\theta},\Pi,\Delta^{(i)})\right]$
(11)		$\displaystyle=$	$\displaystyle\sum_{i=1}^{I}\mathbb{E}_{y_{i}}\left[\ln\left(P(l_{i}\|y_{i},\boldsymbol{\theta},\Pi,\Delta^{(i)})P(y_{i}\|\boldsymbol{\theta})\right)\right].$

Because the prior probability $P(y_{i}|\boldsymbol{\theta})$ is a constant w.r.t. $\Psi$ in derivatives, we only need to maximize the term $\sum_{i=1}^{I}\mathbb{E}_{y_{i}}\left[\ln P(l_{i}|y_{i},\boldsymbol{\theta},\Pi,\Delta^{(i)})\right]$ . Consequently, Eq. (11) can be reformulated as

(12)			$\displaystyle\mathbb{E}_{\mathbf{Y}\|L,\Psi^{old}}\left[\ln P(L,\mathbf{Y}\|\Psi)\right]=\sum_{i=1}^{I}\sum_{k=list_{first}}^{K=list_{last}}\mathbb{E}\left[\mathbb{I}(y_{i}=k)\right]$
(12)			$\displaystyle\cdot\left[\ln\theta_{k}+\sum_{j=1}^{J}\sum_{d=list_{first}}^{K=list_{last}}\mathbb{I}(l_{ij}=d)\ln(\mathcal{F}(\Pi^{(j)},\Delta^{(i)}))\right].$

The parameters $\boldsymbol{\theta}$ , $\Pi$ , and $\Delta$ are updated to maximize the function $Q$ .

Update on $\theta_{k}$

Here, we use the Lagrange multiplier to find the optimal $\theta_{k}$ in Eq. (12), and we construct the function

(13)

F_{1}(\theta_{k})=\sum_{i=1}^{I}\mathbb{E}[\ln P(l_{i}|y_{i},\boldsymbol{\theta},\Pi,\Delta^{(i)})]+\lambda\left(\sum_{k=1}^{K}\theta_{k}-1\right).

Then, we set $\frac{\partial F_{1}}{\partial\theta_{k}}=0$ , i.e.,

(14)

\frac{\partial F_{1}}{\partial\theta_{k}}=\frac{\sum_{i=1}^{I}\mathbb{E}[\mathbb{I}(y_{i}=k)]}{\theta_{k}}+\lambda=0.\\

By applying the sum-up-to-one condition, namely $\sum_{k=1}^{K}\theta_{k}=1$ , we can obtain $\lambda=-I$ . Subsequently, plug it into Eq. (14), we can obtain the estimation of $\theta_{k}$ , which is given by

(15)

\hat{\theta}_{k}=\frac{\sum_{i=1}^{I}\mathbb{E}[\mathbb{I}(y_{i}=k)]}{I}.

update on $\pi_{k_{r}d_{r}}^{(j)}$

By using the Lagrange multiplier to optimize $\pi_{k_{r}d_{r}}^{(j)}$ in Eq. (12), we construct the function

(16)

F_{2}=\sum_{i=1}^{I}[\ln P(l_{i}|y_{i},\boldsymbol{\theta},\Pi,\Delta^{(i)})]+\alpha\left(\sum_{r=1}^{R}\pi_{k_{r}d_{r}}^{(j)}-1\right),

and then we set the partial derivative $\frac{\partial F_{2}}{\partial\pi_{k_{r}d_{r}}^{(j)}}$ to zero, namely,

(17)

\frac{\partial F_{2}}{\partial\pi_{k_{r}d_{r}}^{(j)}}=\frac{\sum_{i=1}^{I}\{\mathbb{E}[\mathbb{I}(y_{i}=k)]\mathbb{I}(l_{ij}=d)\}}{\pi_{k_{r}d_{r}}^{(j)}}+\alpha=0.

With the sum-up-to-one condition ( $\sum_{r=1}^{R}\pi_{k_{r}d_{r}}^{(j)}=1$ ) , we have $\alpha=-\sum_{i=1}^{I}\{\mathbb{E}[\mathbb{I}(y_{i}=k)]\}$ . By plugging it into Eq. (17), we can get the optimal $\pi_{d_{r}k_{r}}^{(j)}$ , which is formulated as

(18)

\displaystyle\hat{\pi}_{k_{r}d_{r}}^{(j)}=\frac{\sum_{i=1}^{I}\{\mathbb{E}[\mathbb{I}(y_{i}=k)]\mathbb{I}(l_{ij}=d)\}}{\sum_{r=1}^{R}\sum_{i=1}^{I}\{\mathbb{E}[\mathbb{I}(y_{i}=k)]\mathbb{I}(l_{ij}=d)\}}.

update on $\delta_{k_{r}d_{r}}^{(i)}$

By using the Lagrange multiplier to optimize $\delta_{k_{r}d_{r}}^{(i)}$ in Eq. (12), we construct the function

(19)

F_{3}=\sum_{i=1}^{I}[\ln P(l_{i}|y_{i},\boldsymbol{\theta},\Pi,\Delta^{(i)})]+\beta\left(\sum_{r=1}^{R}\delta_{k_{r}d_{r}}^{(i)}-1\right),

and then we set the partial derivative $\frac{\partial F_{3}}{\partial\delta_{k_{r}d_{r}}^{(i)}}$ to zero. Applying the sum-up-to-one condition ( $\sum_{r=1}^{R}\delta_{k_{r}d_{r}}^{(i)}=1$ ), we have

(20)

\frac{\partial F_{3}}{\partial\delta_{k_{r}d_{r}}^{(i)}}=\frac{\mathbb{E}[\mathbb{I}(y_{i}=k)]\sum_{j=1}^{J}\mathbb{I}(l_{ij}=d)}{\delta_{k_{r}d_{r}}^{(i)}}+\beta=0.

With this condition, we have $\beta=-J\{\mathbb{E}[\mathbb{I}(y_{i}=k)]\}$ . By plugging it into Eq. (20), we obtain the estimation of $\delta_{d_{r}k_{r}}^{(i)}$ , given by

(21)

\displaystyle\hat{\delta}_{k_{r}d_{r}}^{(i)}=\frac{\sum_{j=1}^{J}\mathbb{I}(l_{ij}=d)\mathbb{E}[\mathbb{I}(y_{i}=k)]}{J\{\mathbb{E}[\mathbb{I}(y_{i}=k)]\}}.

Algorithm 1 The overall algorithm of LAC.

1: Input: Listwise full rank dataset

\mathcal{L}

, number of problems

I

, number of annotators

J

2: for

i=0\cdots I

3: initialize the difficulty matrix

\Delta^{(i)}

4: end for

5: for

j=0\cdots J

6: initialize the ability matrix

\Pi^{(j)}

7: end for

8: while not converge do

9: E-step:

10: calculate the expectation by Eq. (10).

11: M-step:

12: update

\theta_{k}

via Eq. (15).

13: update

\pi_{k_{r}d_{r}}

via Eq. (18).

14: update

\delta_{k_{r}d_{r}}

via Eq. (21).

15: end while

16: For each problem

x_{i}

, calculate

\hat{y}_{i}

by Eq. (22).

17: Output: the inferred ranks

\{\hat{y}_{i}\}_{i=1}^{I}

After convergence. Based on Eq. (10), the inferred true rank of the $i$ -th problem can be determined as

(22)

\displaystyle\hat{y}_{i}=\arg\max_{k}\mathbb{E}[\mathbb{I}(y_{i}=k)].

The detailed steps of the proposed LAC method are summarized in Algorithm 1.

3.4. Complexity Analysis

In this section, we analyze the computational complexity of the proposed LAC method.

Since our method can be divided into the E-step and the M-step, we first analyze the complexity of the E-step in Algorithm 1. We use $\eta$ to denote the annotation ratio, i.e., the ratio of annotators required to provide ranks for each problem. Therefore, for each problem, $\eta\cdot J$ annotators give the full rank, where $J$ is the number of annotators, and thus the complexity of E-step in Eq. (10) is $\mathcal{O}(I\cdot R!\cdot\eta\cdot J)$ . In the M-step, the calculation of $\theta_{k}$ requires $\mathcal{O}(R!\cdot I)$ computations, the calculation of $\pi_{k_{r}d_{r}}$ requires $\mathcal{O}(R^{3}\cdot J\cdot I)$ computations, and the calculation of $\delta_{k_{r}d_{r}}$ requires $\mathcal{O}(I\cdot R^{2}\cdot J)$ computations. Therefore, taking all the above results into consideration and suppose that the total iterations required are $T$ , our LAC algorithm takes $\mathcal{O}(T\cdot(I\cdot R!\cdot\eta\cdot J+R!\cdot I+R^{3}\cdot J\cdot I))$ complexity. From the above analysis, we conclude that our algorithm applies for some cases with a moderate number of problems $I$ and a relatively small length $R$ for each problem. Here, we take the model quality assessment task as an example. There are often few models to be compared (usually three to seven) (Ouyang et al., 2022), rendering the complexity of our LAC acceptable.

4. Experiments

In this section, we conduct intensive experiments on both synthetic and real-world datasets to demonstrate the superiority of the proposed method. The implementation code of LAC can be found at https://anonymous.4open.science/r/LAC-B871.

4.1. Experimental Setting

The target of rank aggregation is to correctly infer the truth at each position. Therefore, we utilize a position-wise accuracy metric to evaluate the overall performance, defined as

(23)

\displaystyle acc=\frac{\#\,correctly\,predicted\,positions}{\#\,positions\,in\,all\,problems},

where each correctly predicted position represents an aggregated result identical to the true rank in some positions.

To validate the effectiveness of the proposed LAC method, we choose five baseline methods for comparison. Specifically, the pairwise methods adopted are BradleyTerry (Thurstone, 1927), CondorcetFuse (Montague and Aslam, 2002), and CoarsenRank (Pan et al., 2022). Besides, the listwise rank aggregation methods adopted are St.Agg (Niu et al., 2013) and CrowdAgg (Niu et al., 2015), which are originally proposed for the partial rank aggregation scenarios. Therefore, to accommodate them to our setting, we apply these algorithms to each problem independently. Here, we provide a brief introduction of the adopted baseline methods:

•

BradleyTerry (Thurstone, 1927). It is a pairwise method that estimates the probability of the superior item within each pairwise comparison.
•

CondorcetFuse (Montague and Aslam, 2002). It is a pairwise method, which constructs the Condorcet Graph with $R$ items and derives the final rank via a Hamiltonian path.
•

CoarsenRank (Pan et al., 2022). It is specifically designed for mild model misspecification, and it assumes that the ideal preferences exist in a neighborhood of the actual preferences. Afterwards, it performs regular rank aggregations directly over a neighborhood of the preferences set.
•

St.Agg (Niu et al., 2013). It is a listwise method, which incorporates uncertainty into the aggregation process via introducing a prior distribution on ranks. Then it transforms the ranking functions to their expectations over this distribution.
•

CrowdAgg (Niu et al., 2015). It is an extension of the St.Agg method introduced above, where the annotator’s quality information is further incorporated into the definition of the rank distribution.

For the comparing methods, hyperparameters are set as suggested in the original papers. For example, the hyperparameter $p$ in CrowdAgg is set to 0.95, as recommended by (Niu et al., 2013), and the rank measure adopted is $\kappa$ - $\text{RBP}_{s}$ . For St.Agg (Niu et al., 2013), the ranking function is obtained by incorporating rank distribution into the mean position function. Moreover, for CoarsenRank (Pan et al., 2022), the optimal hyperparameter $\alpha$ is determined by the Deviance Information Criterion (Spiegelhalter et al., 1998).

4.2. Data Synthetic Procedure

In this section, we introduce the generation of the synthetic datasets in detail, including the construction of the confusion matrices and the generation process of the biased ranks provided by each annotator.

To this end, we now take the $j$ -th annotator as an example. First, for a given quality of annotators (namely, $e$ ) and a specific position $r\in\{1,2,\cdots,R\}$ , we select a random scalar $v$ uniformly in the range of $[e,1]$ , which serves as the $(r,r)$ -th element in $\Pi^{(j)}$ . The value $v$ corresponds to the probability that the annotator $E_{j}$ gives the correct rank at the $r$ -th position. Subsequently, the remaining $r-1$ positive values in the $r$ -th row of $\Pi^{(j)}$ can be selected randomly, which must satisfy $\sum_{t=1,t\neq r}^{R}\Pi^{(j)}_{rt}=1-v$ . Here, we use $\Pi^{(j)}_{rt}$ to denote the $(r,t)$ -th element of $\Pi^{(j)}$ . For clarity, here we take $R=3$ as an example. If $e=0.8$ , a feasible quality matrix of an annotator is given by

(24)

\Pi^{(j)}=\begin{bmatrix}0.8&0.15&0.05\\ 0.05&0.9&0.05\\ 0.0&0.0&1.0\\ \end{bmatrix}_{3\times 3}.

A reason for such a construction is that a higher annotator’s quality $e$ is correlated with the annotator’s superior ability. Therefore, this annotator is more likely to give correct ranks at each position. Reflected in the transition matrix, the diagonal values $\{\Pi_{tt}^{(j)}\}_{t=1}^{R}$ should be relatively large, which can be satisfied actually in our construction. Note that we introduce randomness for these diagonal values to ensure discrepancies between different annotators.

Subsequently, we elaborate on the generation process of the biased ranks based on $\Pi^{(j)}$ . We denote the biased rank by $Rank_{b}$ , which is blank in the beginning. The ground-truth rank (a random permutation) is denoted by $Rank_{g}$ . First, we iterate through the array $[1,2,\cdots,R-1]$ sequentially. For a specific position $r$ , we select a random value $p$ from the list $[1,2,\cdots,R]$ with probability $\Pi^{(j)}_{r\cdot}$ . Such a process stops when we find a $p$ that satisfies $Rank_{g}[p]\not\in Rank_{b}$ , and then we put $Rank_{g}[p]$ in the $r$ -th position of $Rank_{b}$ . Finally, the value in the $R$ -th position is selected so that $Rank_{b}$ can be a permutation over $\{1,2,\cdots,R\}$ . Therefore, we obtain one biased rank, and the other ranks are provided in the same manner.

4.3. Experiments on Synthetic Datasets

In this part, we compare the performance of our LAC with the baseline methods under various predefined conditions on the synthetic datasets introduced in the previous section.

There are mainly five factors associated with the synthetic datasets, i.e., the number of examples $I$ , the length of rank $R$ , the number of annotators $J$ , the ability of annotators $e$ , and the annotation ratio $\eta$ . Here, the annotation ratio controls the ratio of annotators required to provide ranks for each problem. Notably, we set the basic parameters as $I=500$ , $J=10$ , $e=0.3$ , $\eta=0.5$ , and $R=5$ since our approach is applicable for some cases with a moderate number of problems and a suitable number of items to be ranked. Unless otherwise stated, this configuration is used throughout all our experiments.

Performance on problems with different lengths. We select the number of items in each problem, namely $R$ , from the set $\{3,4,5,6,7\}$ . Other parameters for the dataset are fixed as mentioned above, i.e., $I=500$ , $J=10$ , $e=0.3$ , $\eta=0.5$ . The accuracies, as well as the standard deviations over five independent trials of all methods, are shown in Table 3, where the best record under each $R$ is in bold, and the second best record is underlined (such presentation remains consistent throughout this paper). This table reveals that our LAC achieves relatively stable performance, demonstrating the superiority of our method on different lengths of problems over other baseline methods. It is worth noting that our LAC outperforms CrowdAgg by more than 4.7% with various $R$ , which suggests that explicitly modeling the annotators can obtain more satisfactory performance than tackling each problem independently.

Performance on different numbers of examples. We choose the number of examples $I$ within the set $\{100,200,300,400,500\}$ . The performance comparison of all methods is shown in Table 4. Apparently, with the increase of $I$ , our LAC still performs better than other baseline methods, which shows the excellence of LAC on various scales of the ranking problem.

Performance on different annotation ratios. We select the annotation ratio $\eta$ in the range of $\{0.3,0.4,0.5,0.6,0.7\}$ . The experimental results of all methods are shown in Table 5. As revealed in this table, when $\eta$ is very small, our LAC outperforms all the baseline methods by a large margin. Meanwhile, in all other cases, LAC shows superiority over the compared methods, especially when the annotation ratio is relatively large.

Performance on different numbers of annotators. We pick the number of annotators $J$ from the set $\{10,12,14,16,18\}$ for each method. The detailed results are shown in Table 6. As shown in the results, LAC consistently outperforms other methods in all tested scenarios, indicating its robust aggregation ability. Notably, with an increase in the number of annotators, our LAC generally obtains more accurate prediction. Moreover, our method is potentially applicable to some scenarios with a small number of annotators since our LAC surpasses the second best method CondorcetFuse by 13.90% when $J=10$ .

Performance on different abilities of annotators. We finally evaluate the performance of our LAC on different abilities of annotators, and thus we select $e$ in the range of $\{0.1,0.3,0.5,0.7,0.9\}$ . The performance comparison is shown in Table 7. This table reveals that, as the annotator’s ability decreases to 0.1, LAC exhibits superior performance relative to other methods by a significant margin, which can be attributed to the characterization of the annotator’s ability matrices. It further justifies that our method can handle low-quality annotations at different levels.

Table 3. Performance comparison of various methods on synthetic dataset with different lengths of rank (in percent). The best record under each

R