Partition-Mallows Model and Its Inference
for Rank Aggregation
Abstract
Learning how to aggregate ranking lists has been an active research area for many years and its advances have played a vital role in many applications ranging from bioinformatics to internet commerce. The problem of discerning reliability of rankers based only on the rank data is of great interest to many practitioners, but has received less attention from researchers. By dividing the ranked entities into two disjoint groups, i.e., relevant and irrelevant/background ones, and incorporating the Mallows model for the relative ranking of relevant entities, we propose a framework for rank aggregation that can not only distinguish quality differences among the rankers but also provide the detailed ranking information for relevant entities. Theoretical properties of the proposed approach are established, and its advantages over existing approaches are demonstrated via simulation studies and real-data applications. Extensions of the proposed method to handle partial ranking lists and conduct covariate-assisted rank aggregation are also discussed.
Keywords: Meta-analysis; Heterogeneous rankers; Mallows model; Partial ranking lists; Covariate-assisted aggregation.
1 Introduction
Rank data arise naturally in many fields, such as web searching (Renda and Straccia,, 2003), design of recommendation systems (Linas et al.,, 2010) and genomics (Bader,, 2011). Many probabilistic models have been proposed for analyzing this type of data, among which the Thurstone model (Thurstone,, 1927), the Mallows model (Mallows,, 1957) and the Plackett-Luce model (Luce,, 1959; Plackett,, 1975) are the most well-known representatives. The Thurstone model assumes that each entity possesses a hidden score and all the scores come from a joint probability distribution. The Mallows model is a location model defined on the permutation space of ordered entities, in which the probability mass of a permuted order is an exponential function of its distance from the true order. The Plackett-Luce model assumes that the preference of entity is associated with a weight , and describes a recursive procedure for generating a random ranking list: entities are picked one by one with the probability proportional to their weights in a sequential fashion without replacement, and ranked based on their order of being selected.
Rank aggregation aims to derive a “better” aggregated ranking list from multiple ranking lists . It is a classic problem and has been studied in a variety of contexts for decades. Early applications of rank aggregation can be traced back to the 18th-century France, where the idea of rank aggregation was proposed to solve the problem of political elections (Borda,, 1781). In the past 30 years, efficient rank aggregation algorithms have played important roles in many fields, such as web searching (Renda and Straccia,, 2003), information retrieval (Fagin et al.,, 2003), design of recommendation systems (Linas et al.,, 2010), social choice studies (Porello and Endriss,, 2012; Soufiani et al.,, 2014), genomics (Bader,, 2011) and bioinformatics (Lin and Ding,, 2010; Chen et al.,, 2016).
Some popular approaches for rank aggregation are based on certain summary statistics. These methods simply calculate a summary statistics, such as the mean, median or geometric mean, for each entity based on its rankings across different ranking lists, and obtain the aggregated ranking list based on these summary statistics. Optimization-based methods obtain the aggregated ranking by minimizing a user-defined objective function, i.e., let , where distance measurement could be either Spearman’s footrule distance (Diaconis and Graham,, 1977) or the Kendall tau distance (Diaconis,, 1988). More detailed studies on these optimization-based methods can be found in Young and Levenglick, (1978); Young, (1988); Dwork et al., (2001).
In early 2000s, a novel class of Markov chain-based methods have been proposed (Dwork et al.,, 2001; Lin and Ding,, 2010; Lin,, 2010; Deconde et al.,, 2011), which first use the observed ranking lists to construct a probabilistic transition matrix among the entities and then use the magnitudes of the entities’ equilibrium probabilities of the resulting Markov chain to rank them. The boosting-based method RankBoost (Freund et al.,, 2003), employs a feedback function to construct the final ranking, where (or ) indicates that entity is (or is not) preferred to entity . Some statistical methods utilize aforementioned probabilistic models (such as the Thurstone model) and derive the maximum likelihood estimate (MLE) of the final ranking. More recently, researchers have began to pay attention to rank aggregation methods for pairwise comparison data (Rajkumar and Agarwal,, 2014; Chen and Suh,, 2015; Chen et al.,, 2019).
We note that all aforementioned methods assume that the rankers of interest are equally reliable. In practice, however, it is very common that some rankers are more reliable than the others, whereas some are nearly non-informative and may be regarded as “spam rankers”. Such differences in rankers’ qualities, if ignored in analysis, may significantly corrupt the rank aggregation and lead to seriously misleading results. To the best of our knowledge, the earliest effort to address this critical issue can be traced to Aslam and Montague, (2001), which derived an aggregated ranking list by calculating a weighted summation of the observed ranking lists, known as the Borda Fuse. Lin and Ding, (2010) extended the objective function of Dwork et al., (2001) to a weighted fashion. Independently, Liu et al., (2007) proposed a supervised rank aggregation to determine weights of the rankers by training with some external data. Although assigning weights to rankers is an intuitive and simple way to handle quality differences, how to scientifically determine these weights is a critical and unsolved problem in the aforementioned works.
Recently, Deng et al., (2014) proposed BARD, a Bayesian approach to deal with quality differences among independent rankers without the need of external information. BARD introduces a partition model, which assumes that all involved entities can be partitioned into two groups: the relevant ones and the background ones. A rationale of the approach is that, in many applications, distinguishing relevant entities from background ones has the priority over the construction of a final ranking of all entities. Under this setting, BARD decomposes the information in a ranking list into three components: (i) the relative rankings of all background entities, which is assumed to be uniform; (ii) the relative ranking of each relevant entity among all background ones, which takes the form of a truncated power-law; and, (iii) the relative rankings of all relevant entities, which is again uniform. The parameter of the truncated power-law distribution, which is ranker-specific, naturally serves as a quality measure for each ranker, as a ranker of a higher quality means a less spread truncated power-law distribution.
Li et al., (2020) proposed a stage-wise data generation process based on an extended Mallows model (EMM) introduced by Fligner and Verducci, (1986). EMM assumes that each entity comes from a two-components mixture model involving a uniform distribution to model non-informative entities, a modified Mallows model for informative entities and a ranker-specific proportion parameter. Li et al., (2021) followed the Thurstone model framework to deal with available covariates for the entities as well as different qualities of the rankers. In their model, each entity is associated with a Gaussian-distributed latent score and a ranking list is determined by the ranking of these scores. The quality of each ranker is determined by the standard deviation parameter in the Gaussian model so that a larger standard deviation indicates a poorer quality ranker.
Although these recent papers have proposed different ways for learning the quality variation among rankers, they all suffer from some limitations. The BARD method (Deng et al.,, 2014) simplifies the problem by assuming that all relevant entities are exchangeable. In many applications, however, the observed ranking lists often have a strong ordering information for relevant entities, and simply labeling these entities as “relevant” without considering their relative rankings tends to lose too much information and oversimplify the problem. Li et al., (2020) does not explicitly measure quality differences by their extended Mallows model. Although they mentioned that some of their model parameters can indicate the rankers’ qualities, it is not clear how to properly combine multiple indicators to produce an easily interpretable quality measurement. The learning framework of Li et al., (2021) based on Gaussian latent variables appears to be more suitable for incorporating covariates than for handling heterogeneous rankers.
In this paper, we propose a partition-Mallows model (PAMA), which combines the partition modeling framework of Deng et al., (2014) with the Mallows model, to accommodate the detailed ordering information among the relevant entities. The new framework can not only quantify the quality difference of rankers and distinguish relevant entities from background entities like BARD, but also provide an explicit ranking estimate among the relevant entities in rank aggregation. In contrast to the strategy of imposing the Mallows on the full ranking lists, which tends to be sensitive to noises in low-ranking entities, the combination of the partition and Mallows models allows us to focus on highly ranked entities, which typically contain high-quality signals in data, and is thus more robust. Both simulation studies and real data applications show that the proposed approach is superior to existing methods, e.g., BARD and EMM, for a large class of rank aggregation problems.
The rest of this paper is organized as follows. A brief review of BARD and the Mallows model is presented in Section 2 as preliminaries. The proposed PAMA model is described in Section 3 with some key theoretical properties established. Statistical inference of the PAMA model, including the Bayesian inference and the pursuit of MLE, is detailed in Section 4. Performance of PAMA is evaluated and compared to existing methods via simulations in Section 5. Two real data applications are shown in Section 6 to demonstrate strength of the PAMA model in practice. Finally, we conclude the article with a short discussion in Section 7.
2 Notations and Preliminaries
Let be the set of entities to be ranked. We use “” to represent that entity is preferred to entity in a ranking list , and denote the position of entity in by . Note that more preferred entities always have lower rankings. Our research interest is to aggregate observed ranking lists, , presumably constructed by rankers independently into one consensus ranking list which is supposed to be “better” than each individual one.
2.1 BARD and Its Partition Model
The partition model in BARD (Deng et al.,, 2014) assumes that can be partitioned into two non-overlapping subsets: , with representing the set of relevant entities and for the background ones. Let be the vector of group indicators, where and is the indicator function. This formulation makes sense in many applications where people are only concerned about a fixed number of top-ranked entities. Under this formulation, the information in a ranking list can be equivalently represented by a triplet , where denotes relative rankings of all background entities, denotes relative rankings of relevant entities among the background entities and denotes relative rankings of all relevant entities.
Deng et al., (2014) suggested a three-component model for by taking advantage of its equivalent decomposition:
(1) |
where both (relative ranking of the background entities) and (relative ranking of the relevant entities conditional on their set of positions relative to background entities) are uniform, and the relative ranking of a relevant entity among background ones follows a power-law distribution with parameter , i.e.,
leading to the following explicit forms for the three terms in equation (1):
(2) | |||||
(3) | |||||
(4) |
where and are the counts of relevant and background entities respectively, , is the normalizing constant of the power-law distribution, is the set of ’s that are compatible with , and with .
Intuitively, this model assumes that each ranker first randomly places all background entities to generate , then “inserts” each relevant entity independently into the list of background entities according to a truncated power-law distribution to generate , and finally draws uniformly from . In other words, serves as a baseline for modeling and . It is easy to see from the model that a more reliable ranker should possess a larger . With the assumption of independent rankers, we have the full-data likelihood:
(5) | |||||
where . A detailed Bayesian inference procedure for via Markov chain Monte Carlo can be found in Deng et al., (2014).
2.2 The Mallows Model
Mallows, (1957) proposed the following probability model for a ranking list of entities:
(6) |
where denotes the true ranking list, characterizing the reliability of , function is a distance metric between two ranking lists, and
(7) |
being the normalizing constant, whose analytic form was derived in Diaconis, (1988). Clearly, a larger means that is more stable and concentrates in a tighter neighborhood of . A common choice of is the Kendall tau distance.
The Mallows model under the Kendall tau distance can also be equivalently described by an alternative multistage model, which selects and positions entities one by one in a sequential fashion, where serves as a common parameter that governs the probabilistic behavior of each entity in the stochastic process (Mallows,, 1957). Later on, Fligner and Verducci, (1986) extended the Mallows model by allowing to vary at different stages, i.e., introducing a position-specific parameter for each position , which leads to a very flexible, in many cases too flexible, framework to model rank data. To stabilize the generalized Mallows model by Fligner and Verducci, (1986), Li et al., (2020) proposed to put a structural constraint on s of the form with and . As a probabilistic model for rank data, the Mallows model enjoys great interpretability, model compactness, inference and computation efficiency. For a comprehensive review of the Mallows model and its extensions, see Irurozki et al., (2014) and Li et al., (2020).
3 The Partition-Mallows Model
The partition model employed by BARD (Deng et al.,, 2014) tends to oversimplify the problem for scenarios where we care about the detailed rankings of relevant entities. To further enhance the partition model of BARD so that it can reflect the detailed rankings of relevant entities, we describe a new partition-Mallows model in this section.
3.1 The Reverse Partition Model
To combine the partition model with the Mallows model, a naive strategy is to simply replace the uniform model for the relevant entities, i.e., in (1), by the Mallows model, which leads to the updated Equation (4) as below:
where is the Mallows density of and is the normalizing constant of the Mallows model with a constraint due to the compatibility of with respect to . Apparently, the calculation of , which involves the summation over the whole space of , whose size is , is infeasible for most practical cases, rendering such a naive combination of the Mallows model and the partition model impractical.
To avoid the challenging computation caused by the constraints due to , we rewrite the partition model by switching the roles of and in the model: instead of decomposing as conditioning on the group indicators , we decompose into an alternative triplet , where denotes the relative reverse rankings of background entities among the relevant ones. Formally, we note that for any , where denotes the relative ranking of a background entity among the relevant ones. In this reverse partition model, we first order the relevant entities according to a certain distribution and then use them as a reference system to “insert” the background entities. Figure 1 illustrates the equivalence between and its two alternative presentations, and .
Given the group indicator vector , the reverse partition model based on gives rise to the following distributional form for :
(8) |
which is analogous to (1) for the original partition model in BARD. Comparing to (1), however, the new form (8) enables us to specify an unconstrained marginal distribution for . Moreover, due to the symmetry between and , it is highly likely that the power-law distribution, which was shown in Deng et al., (2014) to approximate the distribution of well for each , can also model for each reasonably well. Detailed numerical validations are shown in Supplementary Material.
If we assume that all relevant entities are exchangeable, all background entities are exchangeable, and the relative reserve ranking of a background entity among the relevant entities follows a power-law distribution, we have
(9) | |||||
(10) | |||||
(11) |
where and are numbers of relevant and background entities, respectively, is the unnormalized part of the power-law, is the normalizing constant, is the set of all that are compatible with a given , and with . Apparently, the likelihood of this reverse-partition model shares the same structure as that of the original partition model in BARD, and thus can be inferred in a similar way.
3.2 The Partition-Mallows Model
The reverse partition model introduced in section 3.1 allows us to freely model beyond a uniform distribution, which is infeasible for the original partition model in BARD. Here we employ the Mallows model for due to its interpretability, compactness and computability. To achieve this, we replace the group indicator vector in the partition model by a more general indicator vector , which takes value in , the space of all permutations of , with if , and if and is ranked at position among all relevant entities in . Figure 1 provides an illustrative example of assigning an enhanced indicator vector to a universe of 10 entities with .
Based on the status of , we can define subvectors and , where stands for the subvector of containing all positive elements in , and for the remaining zero elements in . Figure 1 demonstrates the constructions of , and , and the equivalence between , , and given . Note that different from the partition model in BARD, in which we allow the number of relevant entities represented by to vary nearby its expected value, the number of relevant entities in the new model, is assumed to be fixed and known for conceptual and computational convenience. In other words, we have in the new setting.
1 | - | 1 | 1 | 2 | 2 | - | - | - | 1 | 2 | |||||
2 | - | 1 | 2 | 6 | 4 | - | - | - | 3 | 4 | |||||
3 | - | 1 | 3 | 4 | 3 | - | - | - | 2 | 3 | |||||
4 | - | 1 | 4 | 1 | 1 | - | - | - | 1 | 1 | |||||
5 | - | 1 | 5 | 7 | 5 | - | - | - | 3 | 5 | |||||
- | 0 | 0 | 0 | 5 | - | 3 | 2 | 2 | - | - | |||||
- | 0 | 0 | 0 | 3 | - | 4 | 1 | 1 | - | - | |||||
- | 0 | 0 | 0 | 8 | - | 1 | 3 | 3 | - | - | |||||
- | 0 | 0 | 0 | 9 | - | 1 | 4 | 4 | - | - | |||||
- | 0 | 0 | 0 | 10 | - | 1 | 5 | 5 | - | - |
As an analogy of Equations (1) and (8), we have the following decomposition of given the enhanced indicator vector :
(12) |
Assume that follows the Mallows model (with parameter ) centered at :
(13) |
where denotes Kendall tau distance and is defined as in (7). Clearly, a larger indicates that ranker is of a higher quality, as the distribution is more concentrated at the “true ranking” defined by . Since the relative ranking of background entities are of no interest to us, we still assume that they are randomly ranked. Together with the power-law assumption for , we have
(14) | |||||
(15) |
where notations , and are the same as in the reverse-partition model. We call the resulting model the Partition-Mallows model, abbreviated as PAMA.
Different from the partition and reverse partition models, which quantify the quality of ranker with only one parameter in the power-law distribution, the PAMA model contains two quality parameters and , with the former indicating the ranker’s ability of ranking relevant entities and the latter reflecting the ranker’s ability in differentiating relevant entities from background ones. Intuitively, and reflect the quality of ranker in two different aspects. However, considering that a good ranker is typically strong in both dimensions, it looks quite natural to further simplify the model by assuming
(16) |
with being a common factor for all rankers. This assumption, while reducing the number of free parameters by almost half, captures the natural positive correlation between and and serves as a first-order (i.e., linear) approximation to the functional relationship between and . A wide range of numerical studies based on simulated data suggest that the linear approximation showed in (16) works reasonably well for many typical scenarios for rank aggregation. In contrast, the more flexible model with both and as free parameters (which is referred to as PAMA∗) suffers from unstable performance from time to time. Detailed evidences to support assumption (16) can be found in Supplementary Material.
Plugging in (16) into (13), we have a simplified model for given as follows:
(17) |
Combining (14), (15) and (17), we get the full likelihood of :
(18) | |||||
where , , and , and keep the same meaning as in the reverse partition model. At last, for the set of observed ranking lists from independent rankers, we have the joint likelihood:
(19) |
3.3 Model Identifiability and Estimation Consistency
Let be the space of all permutations of in which takes value, and let be the vector of model parameters. The PAMA model in (19), i.e., , defines a family of probability distributions on indexed by parameter taking values in space , where is the space of all permutations of , and . We show here that the PAMA model defined in (19) is identifiable and the model parameters can be estimated consistently under mild conditions.
Theorem 1.
The PAMA model is identifiable, i.e.,
(20) |
Proof.
See Supplementary Material. ∎
To show that parameters in the PAMA model can be estimated consistently, we will first construct a consistent estimator for the indicator vector as but with the number of ranked entities fixed, and show later that can also be consistently estimated once is given. To this end, we define to be the average rank of entity across all rankers, and assume that the ranker-specific quality parameters are i.i.d. samples from a non-atomic probability measure defined on with a finite first moment (referred to as condition hereinafter). Then, by the strong law of large numbers we have
(21) |
since are i.i.d. random variables with expectation
where is the conditional mean of given the model parameters , i.e.,
Clearly, is a function of given and . We define to emphasize ’s nature as a continuous function of . Without loss of generality, we suppose that and , i.e., , hereinafter. Then, the partition structure and the Mallows model embedded in the PAMA model lead to the following facts:
(22) |
Note that degenerates to a constant with respect to (i.e., ) for all because parameter influences only the relative rankings of relevant entities in the Mallows model. The value of is completely determined by . For the BARD model, it is easy to see that

Figure 2 shows some empirical estimates of the ’s based on independent samples drawn from PAMA models with , , and , but three different values: (a) , which corresponds to the BARD model; (b) ; and (c) . One surprise is that in case (c), some relevant entities may have a larger (worse ranking) than the average rank of background entities. Lemma 1 guarantees that for almost all , is different from for . The proof of Lemma 1 can be found in Supplementary Material.
Lemma 1.
For the PAMA model with condition , , s.t. contains only finite elements and
(23) |
The facts demonstrated in (22) and (23) suggest the following three-step strategy to estimate : (a) find the subset of entities from so that the within-subset variation of the ’s is the smallest, i.e.,
(24) |
and let be an estimate of ; (b) rank the entities in by increasingly and use the obtained ranking as an estimate of ; (c) combine the above two steps to obtain the estimate of . This can be achieved by defining and and obtain , with
Note that is based on the mean ranks, , thus is clearly a moment estimator. Although this three-step estimation strategy is neither statistically efficient nor computationally feasible (step (a) is NP-hard), it nevertheless serves as a prototype for developing the consistency theory. Theorem 2 guarantees that is a consistent estimator of under mild conditions.
Theorem 2.
For the PAMA model with condition , for almost all , the moment estimator converges to with probability 1 with going to infinity.
Proof.
Theorem 2 tells us that estimating is straightforward if the number of independent rankers goes to infinity: a simple moment method ignoring the quality difference of rankers can provide us a consistent estimate of . In a practical problem where only a finite number of rankers are involved, however, more efficient statistical inference of the PAMA model based on Bayesian or frequentist principles becomes more attractive as effectively utilizing the quality information of different rankers is critical.
With and fixed, parameter , which governs the power-law distribution for the rank list , cannot be estimated consistently. Thus, its distribution cannot be determined nonparametrically even when the number of rank lists goes to infinity. We impose a parametric form with as the hyper-parameter and refer to the resulting hierarchical-structured model as PAMA-H, which has the following marginal likelihood of given :
We show in Theorem 3 that the MLE based on the above marginal likelihood is consistent.
Theorem 3.
Under the PAMA-H model, assume that belongs to the parameter space , and the true parameter is an interior point of . Let be the maximizer of . If has a density function that is differentiable and concave with respect to , then almost surely.
Proof.
See Supplementary Material. ∎
4 Inference with the Partition-Mallows Model
4.1 Maximum Likelihood Estimation
Under the PAMA model, the MLE of is , where
(25) |
is the logarithm of the likelihood function (19). Here, we adopt the Gauss-Seidel iterative method in Yang, (2018), also known as backfitting or cyclic coordinate ascent, to implement the optimization. Starting from an initial point , the Gauss-Seidel method iteratively updates one coordinate of at each step with the other coordinates held fixed at their current values. A Newton-like method is adopted to update and . Since is a discrete vector, we find favorable values of by swapping two neighboring entities to check whether increases. More details of the algorithm are provided in Supplementary Material.
With the MLE , we define and , and generate the final aggregated ranking list based on the rules below: (a) set the top- list of as , (b) all entities in tie for positions behind. Hereinafter, we refer to this MLE-based rank aggregation procedure under PAMA model as PAMAF.
For the PAMA-H model, a similar procedure can be applied to find the MLE of , with the being treated as the missing data. With the MLE , we can generate the final aggregated ranking list based on in the same way as in PAMA, and evaluate the quality of ranker via the mean or mode of the conditional distribution below:
In this paper, we refer to the above MLE-based rank aggregation procedure under PAMA-H model as PAMAHF. The procedure is detailed in Supplementary Material.
4.2 Bayesian Inference
Since the three model parameters , and encode “orthogonal” information of the PAMA model, it is natural to expect that , and are mutually independent a priori. We thus specify their joint prior distribution as
Without much loss, we may restrict the range of and ’s to a closed interval with a large enough . In contrast, is discrete and takes value in the space of all permutations of . It is convenient to specify , and as uniform, i.e.,
Based on our experiences in a large range of simulation studies and real data applications, we find that it works reasonably well to set . In Section 3.3 we also discussed letting be of a parametric form, which will be further discussed later.
The posterior distribution can be expressed as
(26) | |||||
with the following conditional distributions:
(27) | |||||
(28) | |||||
(29) |
based on which posterior samples of can be obtained by Gibbs sampling, where .
Considering that conditional distributions in (27)-(29) are nonstandard, we adopt the Metropolis-Hastings algorithm (Hastings,, 1970) to enable the conditional sampling. To be specific, we choose the proposal distributions for and as
where and can be tuned to optimize the mixing rate of the sampler. Since is a discrete vector, we propose new values of by swapping two randomly selected adjacent entities. Note that the entity whose ranking is could be swapped with any background entity. Due to the homogeneity of background entities, there is no need to swap two background entities. Therefore, the number of potential proposals in each step is . More details about MCMC sampling techniques can be found in Liu, (2008).
Suppose that posterior samples are obtained. We calculate the posterior means of different parameters as below:
We quantify the quality of ranker with , and generate the final aggregated ranking list based on the s as following:
Hereinafter, we refer to this MCMC-based Bayesian rank aggregation procedure under the Partition-Mallows model as PAMAB.
The Bayesian inference procedure PAMAHB for the PAMA-H model differs from PAMAB only by replacing the prior distribution , which is uniform in , with a hierarchically structured prior . The conditional distributions needed for Gibbs sampling are almost the same as (27)-(29), except an additional one
(30) |
We may specify to be an exponential distribution and let be a proper conjugate prior to make (30) easy to sample from. More details for PAMAHB with specified as an exponential distribution is provided in Supplementary Material.
Our simulation studies suggest that the practical performance of PAMAB and PAMAHB are very similar when and are reasonably large (see Supplementary Material for details). In contrast, as we will show in Section 5, the MLE-based estimates (e.g., PAMAF) typically produce less accurate results with a shorter computational time compared to PAMAB.
4.3 Extension to Partial Ranking Lists
The proposed Partition-Mallows model can be extended to more general scenarios where partial ranking lists, instead of full ranking lists, are involved in the aggregation. Given the entity set and a ranking list of entities in , we say is a full ranking list if , and a partial ranking list if . Suppose is a partial ranking list and is a full ranking list of . If the projection of on equals to , we say is compatible with , denotes as . Let be the set of all full lists that are compatible with . Suppose a partial list is involved in the ranking aggregation problem. The probability of can be evaluated by:
(31) |
where is the probability of a compatible full list under the PAMA model. Clearly, the probability in (31) does not have a closed-form representation due to complicated constraints between and , and it is very challenging to do statistical inference directly based on this quantity. Fortunately, as rank aggregation with partial lists can be treated as a missing data problem, we can resolve the problem via standard methods for missing data inference.
The Bayesian inference can be accomplished by the classic data augmentation strategy (Tanner and Wong,, 1987) in a similar way as described in Deng et al., (2014), which iterates between imputing the missing data conditional on the observed data given the current parameter values, and updating parameter values by sampling from the posterior distribution based on the imputed full data. To be specific, we iteratively draw from the following two conditional distributions:
To find the MLE of for this more challenging scenario, we can use the Monte Carlo EM algorithm (MCEM, Wei and Tanner, (1990)). Let be independent samples drawn from distribution . The E-step involves the calculation of the -function below:
In the M-step, we use the Gauss-Seidel method to maximize the above -function in a similar way as detailed in Supplementary Material.
No matter which method is used, a key step is to draw samples from
To achieve this goal, we start with obtained from the previous step of the data augmentation or MCEM algorithms, and conduct several iterations of the following Metropolis step with as its target distribution: (a) construct the proposal by randomly selecting two elements in the current full list and swapping them; (b) accept or reject the proposal according to the Metropolis rule, that is to accept with probability of . Note that the proposed list is automatically rejected if it is incompatible with the observed partial list .
4.4 Incorporating Covariates in the Analysis
In some applications, covariate information for each ranked entity is available to assist rank aggregation. One of the earliest attempts for incorporating such information in analysing rank data is perhaps the hidden score model due to Thurstone, (1927), which has become a standard approach and has many extensions. Briefly, these models assume that there is an unobserved score for each entity that is related to the entity-specific covariates under a regression framework and the observed rankings are determined by these scores plus noises, i.e.,
Here, is the common regression coefficient and is the noise term. Recent progresses along this line are reviewed by Yu, (2000); Bhowmik and Ghosh, (2017); Li et al., (2021).
Here, we propose to incorporate covariates into the analysis in a different way. Assuming that covariate provides information on the group assignment instead of the detailed ranking of entity , we connect and , the enhanced indicator of , by a logistic regression model:
(32) |
where as the regression parameters. Let be the covariate matrix, we can extend the Partition-Mallows model as
(33) |
where the first term
comes from the logistic regression model (32), and the second term comes from the original Partition-Mallows model. In the extended model, our goal is to infer based on . We can achieve both Bayesian inference and MLE for the extended model in a similar way as described for the Partition-Mallows model. More details are provided in the Supplementary Material.
An alternative way to incorporate covariates is to replace the logistic regression model by a naive Bayes model, which models the conditional distribution of instead of , as follows:
(34) |
where
is pre-specified parametric distribution for covariates with parameter for entities in the background group and for entities in the relevant group. Since the performances of the two approaches are very similar, in the rest of this paper we use the logistic regression strategy to handle covariates due to its convenient form.
5 Simulation Study
5.1 Simulation Settings
We simulated data from two models: (a) the proposed Partition-Mallows model, referred to as , and (b) the Thurstone hidden score model, referred to as . In the scenario, we specified the true indicator vector as , indicating that the first entities belong to and the rest belong to the background group , and set
Clearly, and control the quality difference and signal strength of the base rankers in the scenario. We set (defined in (16)), , and with two options: 2.5 and 1.5. For easy reference, we denoted the strong signal case with and the weak signal case with by and , respectively.
In the scenario, we used the Thurstone model to generate the rank lists as and
In this model, and (all positive numbers) control the quality difference and signal strength of the base rankers. We also specified two sub-cases: , the stronger-signal case with ; and , the weaker-signal case with . Table 1 shows the configuration matrix of under when and . In both scenarios, the first half of rankers are completely non-informative, with the other half providing increasingly strong signals.
0 | 0 | 0 | 0 | 0 | 3.7 | 3.9 | 4.1 | 4.3 | 4.5 | |
0 | 0 | 0 | 0 | 0 | 3.5 | 3.7 | 3.9 | 4.1 | 4.3 | |
0 | 0 | 0 | 0 | 0 | 3.3 | 3.5 | 3.7 | 3.9 | 4.1 | |
0 | 0 | 0 | 0 | 0 | 3.1 | 3.3 | 3.5 | 3.7 | 3.9 | |
0 | 0 | 0 | 0 | 0 | 2.9 | 3.1 | 3.3 | 3.5 | 3.7 | |
0 | 0 | 0 | 0 | 0 | 2.7 | 2.9 | 3.1 | 3.3 | 3.5 | |
0 | 0 | 0 | 0 | 0 | 2.5 | 2.7 | 2.9 | 3.1 | 3.3 | |
0 | 0 | 0 | 0 | 0 | 2.3 | 2.5 | 2.7 | 2.9 | 3.1 | |
0 | 0 | 0 | 0 | 0 | 2.1 | 2.3 | 2.5 | 2.7 | 2.9 | |
0 | 0 | 0 | 0 | 0 | 1.9 | 2.1 | 2.3 | 2.5 | 2.7 | |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
For each of the four simulation scenarios (i.e., , , and ), we fixed the true number of relevant entities , but allowed the number of rankers and the total number of entities to vary, resulting in a total of 16 simulation settings ( ). Under each setting, we simulated 500 independent data sets to evaluate and compare performances of different rank aggregation methods.
5.2 Methods in Comparison and Performance Measures
Except for the proposed PAMAB and PAMAF, we considered state-of-the-art methods in several classes, including the Markov chain-based methods MC1, MC2, MC3 in Lin, (2010) and CEMC in Lin and Ding, (2010), the partition-based method BARD in Deng et al., (2014), and the Mallows model-based methods MM and EMM in Li et al., (2020). Classic naive methods based on summary statistics were ignored because they have been shown in previous studies to perform suboptimally, especially in cases where base rankers are heterogeneous in quality. The Markov-chain-based methods, MM, and EMM were implemented in TopKLists, PerMallows and ExtMallows packages in R (https://www.r-project.org/), respectively. The code of BARD was provided by its authors.
Let be the underlying true ranking list of all entities, be the true relative ranking of relevant entities, be the aggregated ranking obtained from a rank aggregation approach, be the relative ranking of relevant entities after aggregation, and be the top- list of . After obtaining the aggregated ranking from a rank aggregation approach, we evaluated its performance by two measurements, namely the recovery distance and the coverage , defined as below:
where denotes the Kendall tau distance between and , and denotes the number of relevant entities who are classified as background entities in . The recovery distance considers detailed rankings of all relevant entities plus mis-classification distances, while the coverage cares only about the identification of relevant entities without considering the detailed rankings. In the setting of PAMA, is the average rank of a background entity. The recovery distance increases if some relevant entities are mis-classified as background entities. Clearly, we expect a smaller and a larger for a stronger aggregation approach.
5.3 Simulation Results
Table 2 summarizes the performances of the nine competing methods in the 16 different simulation settings, demonstrating the proposed PAMAB and PAMAF outperform all the other methods by a significant margin in most settings and PAMAB uniformly dominates PAMAF. Figure 3 shows the quality parameter learned from the Partition-Mallows model in various simulation scenarios with and , confirming that the proposed methods can effectively capture the quality difference among the rankers. The results of for other combinations of can be found in Supplementary Material which demonstrates consistent performance with Figure 3.
Configuration | Partition-type Models | Mallows Models | MC-based Models | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
PAMAF | PAMAB | BARD | EMM | MM | MC1 | MC2 | MC3 | CEMC | |||
100 | 10 | 24.5 | 15.2 | 57.1 | 51.7 | 103.2 | 338.4 | 163.1 | 198.6 | 197.8 | |
[0.95] | [0.97] | [0.91] | [0.89] | [0.81] | [0.36] | [0.69] | [0.63] | [0.62] | |||
100 | 20 | 2.6 | 0.3 | 42.1 | 22.8 | 44.2 | 466.6 | 88.9 | 121.2 | 114.7 | |
[0.99] | [1.00] | [0.95] | [0.95] | [0.93] | [0.11] | [0.82] | [0.78] | [0.77] | |||
300 | 10 | 17.4 | 4.0 | 180.0 | 683.3 | 519.2 | 1268.3 | 997.7 | 1075.8 | 1085.7 | |
[0.99] | [1.00] | [0.89] | [0.66] | [0.55] | [0.17] | [0.34] | [0.29] | [0.28] | |||
300 | 20 | 7.1 | 3.2 | 122.3 | 124.4 | 157.1 | 1445.9 | 613.5 | 723.0 | 727.2 | |
[1.00] | [1.00] | [0.93] | [0.92] | [0.90] | [0.05] | [0.60] | [0.53] | [0.52] | |||
100 | 10 | 90.0 | 66.6 | 115.2 | 108.3 | 152.9 | 404.3 | 285.5 | 307.2 | 313.8 | |
[0.82] | [0.86] | [0.77] | [0.77] | [0.70] | [0.24] | [0.47] | [0.43] | [0.41] | |||
100 | 20 | 26.9 | 2.4 | 81.5 | 59.8 | 91.5 | 468.1 | 217.3 | 245.2 | 249.5 | |
[0.94] | [1.00] | [0.85] | [0.87] | [0.82] | [0.11] | [0.60] | [0.55] | [0.53] | |||
300 | 10 | 81.1 | 26.8 | 468.4 | 609.8 | 472.1 | 1388.4 | 1294.7 | 1321.5 | 1328.4 | |
[0.95] | [0.98] | [0.69] | [0.68] | [0.60] | [0.09] | [0.15] | [0.13] | [0.13] | |||
300 | 20 | 77.2 | 3.4 | 313.6 | 267.5 | 337.0 | 1469.0 | 1205.9 | 1251.8 | 1258.9 | |
[0.95] | [1.00] | [0.79] | [0.82] | [0.78] | [0.04] | [0.21] | [0.18] | [0.18] | |||
100 | 10 | 24.9 | 20.6 | 22.9 | 54.9 | 115.9 | 334.7 | 150.9 | 180.3 | 186.0 | |
[0.97] | [0.98] | [0.99] | [0.91] | [0.80] | [0.37] | [0.71] | [0.66] | [0.64] | |||
100 | 20 | 18.7 | 15.6 | 22.8 | 8.7 | 33.4 | 498.8 | 46.7 | 64.1 | 60.8 | |
[0.98] | [0.98] | [1.00] | [1.00] | [0.97] | [0.05] | [0.92] | [0.89] | [0.89] | |||
300 | 10 | 172.0 | 159.8 | 37.9 | 205.5 | 490.6 | 1098.6 | 627.0 | 752.9 | 769.4 | |
[0.89] | [0.90] | [0.99] | [0.87] | [0.68] | [0.28] | [0.59] | [0.50] | [0.49] | |||
300 | 20 | 7.4 | 7.0 | 22.7 | 11.4 | 114.1 | 1402.6 | 237.8 | 319.7 | 322.3 | |
[1.00] | [1.00] | [1.00] | [1.00] | [0.94] | [0.08] | [0.84] | [0.79] | [0.79] | |||
100 | 10 | 92.6 | 74.0 | 68.7 | 123.7 | 162.3 | 382.4 | 228.2 | 250.2 | 256.6 | |
[0.83] | [0.86] | [0.88] | [0.77] | [0.70] | [0.27] | [0.56] | [0.52] | [0.50] | |||
100 | 20 | 24.4 | 20.0 | 22.2 | 12.4 | 38.3 | 500.3 | 87.5 | 103.5 | 102.9 | |
[0.96] | [0.97] | [1.00] | [0.99] | [0.95] | [0.04] | [0.83] | [0.80] | [0.80] | |||
300 | 10 | 319.1 | 463.8 | 245.6 | 516.9 | 683.5 | 1267.9 | 998.0 | 1076.0 | 1085.5 | |
[0.79] | [0.69] | [0.84] | [0.66] | [0.55] | [0.17] | [0.34] | [0.29] | [0.28] | |||
300 | 20 | 8.7 | 8.0 | 23.2 | 30.3 | 155.5 | 1430.7 | 437.6 | 516.2 | 523.3 | |
[1.00] | [1.00] | [1.00] | [0.99] | [0.91] | [0.06] | [0.71] | [0.66] | [0.65] |

Figure 4 (a) shows the boxplots of recovery distances and the coverages of the nine competing methods in the four simulation scenarios with , , and . The five methods from the left outperform the other four methods by a significant gap, and the PAMA-based methods generally perform the best. Figure 4 (b) confirms that the methods based on the Partition-Mallows model enjoys the same capability as BARD in detecting quality differences between informative and non-informative rankers. However, while both BARD and PAMA can further discern quality differences among informative rankers, EMM fails this more subtle task. Similar figures for other combinations of are provided in the Supplementary Material, highlighting consistent results as in Figure 4.

5.4 Robustness to the Specification of
We need to specify , the number of relevant entities, when applying PAMAB or PAMAF. In many practical problems, however, there may not be a strong prior information on and there may not even be clear distinctions between relevant and background entities. To examine robustness of the algorithm with respect to the specification of , we design a simulation setting to mimic the no-clear-cut scenario and investigate how the performance of PAMA is affected by the specification of under this setting. Formally, assumes that , where , following the same data generating framework as defined in the Section 5.1, with being replaced by
where and . Different from and , where jumps from 0 to a positive number as ranges from background to relevant entities, in the scenario increases smoothly as a function of for each informative ranker . In such cases, the concept of “relevant” entities is not well-defined.
We simulate 500 independent data sets from with and . For each data set, we try different specifications of ranging from 10 to 50 and compare PAMA to several competing methods based on their performance on recovering the top- list , which is still well-defined based on the simulation design. The results summarized in Table 3 show that no matter which is specified, the partition-type models consistently outperform all the competing methods in terms of a lower recovery distance from the true top- list of items, i.e., . Figure 5 illustrates in details the average aggregated rankings of the top-10 entities by PAMA as increases, suggesting that PAMA is able to figure out the correct rankings of the top entities effectively. These results give us confidence that PAMA is robust to misspecification of .
Configuration | Partition-type Models | Mallows Models | MC-based Models | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
PAMAF | PAMAB | BARD | EMM | MM | MC1 | MC2 | MC3 | CEMC | |||
100 | 10 | 10 | 44.8 | 34.6 | 42.6 | 61.5 | 227.7 | 423.8 | 45.6 | 199.1 | 241.3 |
[0.90] | [0.93] | [0.96] | [0.88] | [0.58] | [0.20] | [0.92] | [0.63] | [0.54] | |||
100 | 10 | 20 | 39.2 | 33.9 | 94.2 | 107.0 | 308.6 | 764.6 | 52.3 | 268.9 | 372.9 |
[0.95] | [0.96] | [0.99] | [0.90] | [0.75] | [0.33] | [0.96] | [0.78] | [0.67] | |||
100 | 10 | 30 | 27.5 | 29.2 | 207.4 | 126.2 | 360.6 | 1040.2 | 67.8 | 325.4 | 445.0 |
[0.98] | [0.98] | [0.98] | [0.93] | [0.83] | [0.44] | [0.96] | [0.84] | [0.77] | |||
100 | 10 | 40 | 16.0 | 17.4 | 363.9 | 131.6 | 408.1 | 1274.1 | 83.1 | 382.5 | 486.9 |
[0.99] | [0.99] | [0.98] | [0.95] | [0.87] | [0.54] | [0.97] | [0.87] | [0.83] | |||
100 | 10 | 50 | 8.8 | 9.3 | 565.3 | 134.6 | 452.2 | 1484.2 | 109.2 | 446.4 | 524.9 |
[1.00] | [1.00] | [0.99] | [0.97] | [0.90] | [0.62] | [0.96] | [0.89] | [0.88] |

Noticeably, although PAMA and BARD achieve comparable coverage as shown in Table 3, PAMA dominates BARD uniformly in terms of a much smaller recovery distance in all cases, suggesting that PAMA is capable of figuring out the detailed ranking of relevant entities that is missed by BARD. In fact, since BARD relies only on to rank entities, in cases where the signal to distinguish the relevant and background entities is strong, many ’s are very close to 1, resulting in nearly a “random” ranking among the top relevant entities. Theoretically, if all relevant entities are recognized correctly but ranked randomly, the corresponding recovery distance would increase with in an order of , which matches well with the increasing trend of the recovery distance of BARD shown in Table 3.
We also tested the model’s performance when there is a true but it is mis-specified in our algorithm. We varied as and , respectively, for setting with =100 and =10, where the true =10 (the first ten entities). Figure 6 shows boxplots of for each mis-specified case. For the visualization purpose, we just show the boxplot of to . The other entities are of the similar pattern with . The figure shows a robust behavior of PAMAB as we vary the specifications of . It also shows that the results are slightly better if we specify a that is moderately larger than its true value. The consistent results of other mis-specified cases, such as , can be found in the Supplementary Material.

6 Real Data Applications
6.1 Aggregating Rankings of NBA Teams
We applied PAMAB to the NBA-team data analyzed in Deng et al., (2014), and compared it to competing methods in the literature. The NBA-team data set contains 34 “predictive” power rankings of the 30 NBA teams in the 2011-2012 season. The 34 “predictive” rankings were obtained from 6 professional rankers (sports websites) and 28 amateur rankers (college students), and the data quality varies significantly across rankers. More details of the dataset can be found in Table 1 of Deng et al., (2014).

Figure 7 displays the results obtained by PAMAB (the partial-list version with specified as 16). Figure 7 (a) shows the posterior distributions, as boxplots, of the quality parameter of each involved ranker. Figure 7 (b) shows the posterior distribution of the aggregated power ranking of each NBA team. All the posterior samples that reports “0” for the rank of an entity means that the entity is a background one, for visualization purpose we replace “0” by the rank of background average rank, . The final set of 16 playoff teams are shown in blue while the champion of that season is shown in red (i.e., Heat). Figure 7 (c) shows the trace plot of the log-likelihood of the PAMA model along the MCMC iteration. Comparing the results to Figure 8 of Deng et al., (2014), we observe the following facts: (1) PAMAB can successfully discover the quality difference of rankers as BARD; (2) PAMAB can not only pick up the relevant entities effectively like BARD, but also rank the discovered relevant entities reasonably well, which cannot be achieved by BARD; (3) PAMAB converges quickly in this application.
We also applied other methods, including MM, EMM and Markov-chain-based methods, to this data set. We found that none of these methods could discern the quality difference of rankers as successfully as PAMA and BARD. Moreover, using the team ranking at the end of the regular season as the surrogate true power ranking of these NBA teams in the reason, we found that PAMA also outperformed BARD and EMM by reporting an aggregated ranking list that is the closest to the truth. Table 4 provides the detailed aggregated ranking lists inferred by BARD, EMM and PAMA respectively, as well as their coverage of and Kendall distance from the surrogate truth. Note that the Kendall distance is calculated for the eastern teams and western teams separately because the NBA Playoffs proceed at the eastern conference and the western conference in parallel until the NBA final, in which the two conference champions compete for the NBA champion title, making it difficult to validate the rankings between Eastern and Western teams.
Ranking | Surrogate truth | BARD | EMM | PAMA | ||||
Eastern | Western | Eastern | Western | Eastern | Western | Eastern | Western | |
1 | Bulls | Spurs | Heat | Thunder | Heat | Thunder | Heat | Thunder |
2 | Heat | Thunder | Bulls | Mavericks | Bulls | Maverick | Bulls | Maverickss |
3 | Pacers | Lakers | Celtics | Clippers | Knicks | Clippers | Celtics | Lakers |
4 | Celtics | Grizzlies | Knicks | Lakers | Celtics | Lakers | Knicks | Clippers |
5 | Hawks | Clippers | Magic | Spurs | Magic | Spurs | Magic | Spurs |
6 | Magic | Nuggets | Pacers | Grizzlies | Pacers | Grizzlies | Hawks | Nuggets |
7 | Knicks | Mavericks | 76ers | Nuggets | 76ers | Nuggets | Pacers | Grizzlies |
8 | 76ers | Jazz | Hawks | Jazz∗ | Hawks∗ | Jazz∗ | 76ers | Jazz∗ |
Kendall | - | - | 14.5 | 10.5 | 9 | 10 | 8 | 10 |
Coverage | - |
6.2 Aggregating Rankings of NFL Quarterback Players with the Presence of Covariates
Our next application is targeted at the NFL-player data reported by Li et al., (2021). The NFL-player data contains 13 predictive power rankings of 24 NFL quarterback players. The rankings were produced by 13 experts based on the performance of the 24 NFL players during the first 12 weeks in the 2014 season. The dataset also contains covariates for each player summarizing the performances of these players during the period, including the number of games played (G), pass completion percentage (Pct), the number of passing attempts per game (Att), average yards per carry (Avg), total receiving yards (Yds), average passing yards per attempt (RAvg), the touchdown percentage (TD), the intercept percentage (Int), running attempts per game (RAtt), running yards per attempt (RYds) and the running first down percentage (R1st). Details of the dataset can be found in Table 2 of Li et al., (2021).

Here, we set in order to find which players are above average. We analyzed the NFL-player data with PAMAB (the covaritate-assisted version) and the results are summarized in Figure 8: (a) the posterior boxplots of the quality parameter for all the rankers; (b) the barplot of for all the NFL players with the descending order; (c) the traceplot of the log-likelihood of the model; and (d), the barplot of probabilities and the covariates are rearranged from left to right by decreasing the probability. From Figure 8 (a), we observe that rankers 1, 3, 4 and 5 are generally less reliable than the other rankers. In the study of the same dataset in Li et al., (2021), the authors assumed that the 13 rankers fall into three quality levels, and reported that seven rankers (i.e., 2, 6, 7, 8, 9, 10 and 13) are of a higher quality than the other six (see Figure 7 of Li et al., (2021)). Interestingly, according to Figure 8 (a), the PAMA algorithm suggested exactly the same set of high-quality rankers. In the meantime, ranker 2 is of the lowest quality among the seven high quality rankers in both studies. From Figure 8 (b), a consensus ranking list can be obtained. Our result is consistent with that of Figure 6 in Li et al., (2021). Figure 8 (d) shows that six covariates are more probable to have positive effects.
Ranking | Gold standard | BARD | EMM | PAMA |
1 | Aaron R. | Andrew L. | Andrew L. | Andrew L. |
2 | Andrew L. | Aaron R. | Aaron R. | Aaron R. |
3 | Ben R. | Tom B. | Tom B. | Tom B. |
4 | Drew B. | Drew B. | Ben R. | Drew B. |
5 | Russell W. | Ben R. | Drew B. | Ben R. |
6 | Matt R. | Ryan T. | Ryan T. | Ryan T. |
7 | Ryan T. | Russell W. | Russell W. | Russell W. |
8 | Tom B. | Philip R.* | Philip R.* | Philip R. |
9 | Eli M. | Eli M.* | Eli M.* | Eli M.* |
10 | Philip R. | Matt R.* | Matt R.* | Matt R.* |
R-distance | - | 35.5 | 32 | 25 |
Coverage | - | 0.7 | 0.7 | 0.8 |
Using the Fantasy points of the players (https://fantasy.nfl.com/research/players) derived at the end of the 2014 NFL season as the surrogate truth, the recovery distance and coverage of the aggregated rankings by different approaches can be calculated so as to evaluate the performances of different approaches. Note that the Fantasy points of two top NFL players Peyton Manning and Tony Romo are missing for unknown reasons, we excluded them from analysis and only report results for the top 10 positions instead of top 12. Table 5 summarizes the results, demonstrating that PAMA outperformed the other two methods.
7 Conclusion and Discussion
The proposed Partition-Mallows model embeds the classic Mallows model into a partition modeling framework developed earlier by Deng et al., (2014), which is analogous to the well-known “spike-and-slab” mixture distribution often employed in Bayesian variable selection. Such a nontrivial “mixture” combines the strengths of both the Mallows model and BARD’s partition framework, leading to a stronger rank aggregation method that can not only learn quality variation of rankers and distinguish relevant entities from background ones effectively, but also provide an accurate ranking estimate of the discovered relevant entities. Compared to other frameworks in the literature for rank aggregation with heterogeneous rankers, the Partition-Mallows model enjoys more accurate results with better interpretability at the price of a moderate increase of computational burden. We also show that the Partition-Mallows framework can easily handle partial lists and and incorporate covariates in the analysis.
Throughout this work, we assume that the number of relevant entities is known. This is reasonable in many practical problems where a specific can be readily determined according to research demands. Empirically, we found that the ranking results are insensitive to the choice of a wide range of values of . If needed, we may also determine according to a model selection criterion, such as AIC or BIC.
In the PAMA model, is assumed to be a uniform distribution. If the detailed ranking of the background entities is of interest, we can modify the conditional distribution to be the Mallows model or other models. A quality parameter can still be incorporated to control the interaction between relevant entities and background entities. The resulting likelihood function becomes more complicated, but is still tractable.
In practice, the assumption of independent rankers may be violated. In the literature, a few approaches have been proposed to detect and handle dependent rankers. For example, Deng et al., (2014) proposed a hypothesis-testing-based framework to detect pairs of over-correlated rankers and a hierarchical model to accommodate clusters of dependent rankers; Johnson et al., (2020) adopted an extended Dirichlet process and a similar hierarchical model to achieve simultaneous ranker clustering and rank aggregation inference. Similar ideas can be incorporated in the PAMA model as well to deal with non-independent rankers.
Acknowledgement
We thank Miss Yuchen Wu for helpful discussions at the early stage of this work and the two reviewers for their insightful comments and suggestions that helped us improve the paper greatly. This research is supported in part by the National Natural Science Foundation of China (Grants 11771242 & 11931001), Beijig Academy of Artificial Intelligence (Grant BAAI2019ZD0103), and the National Science Foundation of USA (Grants DMS-1903139 and DMS-1712714). The author Wanchuang Zhu is partially supported by the Australian Research Council (Data Analytics for Resources and Environments, Grant IC190100031)
Supplementary Materials
Appendix A Numerical Supports to Power-Law Model of
Figure 9 illustrates numerical evidences to support the power-law model for in different scenarios in a similar spirit of the Figure 2 in Deng et al., (2014): generating each ranker as the order of , i.e., where is generated from two different distributions and via the following mechanism:
Figure 9 confirms that the log-log plot of versus is also nearly linear (when is not too large), suggesting the power-law model is acceptable as an approximation of the reality.


Appendix B Numerical Evidence for the Assumption
In this section, we provide numerical evidence to support the assumption by showing that it approximates the reality reasonably well in more general settings. In the literature, the Thurstone hidden score model is widely used to generate ranked data. Specifying , , and , we explored two simulation settings based on the hidden score model, namely and , in which we assume
and specify in the setting
and in the setting
where stands for a random number drawn from the Uniform distribution on interval . Clearly, the quality of ranker increases monotonously with the ranker index in both settings, and there is a clear gap of mean scores between background and relevant entities. The two hyper-parameters control the signal strength and ranker heterogeneity in the simulated data.
For each data set simulated from the above settings, given the true ranking list , we can always fit a Mallows model for the rankings of relevant entities in each to get an estimated and a power-law model for the relative rankings of the background entities among the relevant ones in each to get an estimated , leading to a set of pairwise estimates . Figure 10 provides a graphical illustration of the estimated parameters for typical simulated datasets from the setting with different specifications of , and Figure 11 demonstrates the counterpart for the setting. From the figures, we can see clearly a strong linear trend between and in all cases, suggesting that the presumed assumption approximates the reality very well.


Appendix C Proof of Theorem 1
Proof.
We start with the degenerated special case where . In this special case, the parameter vector degenerates to a 3-dimensional vector , and the PAMA model defined in Equation (19) degenerates to a simpler form as defined in Equation (18) with . To prove the identifiability of the degenerated PAMA model, we need to show that for any two proper parameter vectors and from the parameter space , we always have:
Here, we choose to prove the above conclusion by proving its equivalent counterpart:
Apparently, the condition implies two possible scenarios:
Below we discuss the two scenarios separately.
The scenario where . Let
(35) |
be the rankings with the smallest and second smallest sampling probability among the ranking space given model parameter . According to the likelihood function (18), it is easy to check that the solutions of the optimization problems in (35) depend on only and have the general form below:
where the operator locates the relevant entities defined by to the last positions in the ranking list with their internal order reversed and the background entities randomly to the other open positions, and the operator swaps the positions of either two adjacent relevant entities defined by or the last relevant entity and an arbitrary background entity in . For instance, assume that , and , we have:
where stands for a random permutation of the input sequence . Note that although both and have a lot of variations, the different variations correspond to the same sampling probability below:
where . Further, define
It can be showed via proof by contradiction that the following two equations can not hold simultaneously for :
Because if , we would have either or based on the definition of function , which leads to as function is monotonically decreasing with respect to and . Apparently, the above fact indicates that the following two equations can not hold simultaneously:
which means that the two distributions and are not identical. Therefore, there must exist such that .
The scenario where but . As and are the minima of and , and in this case, we have
Similarly, we have
Therefore, there exists (e.g., or ) such that . Combining the above two scenarios, we conclude that fact (20) holds for .
Next, we prove the more general cases where by mathematical induction: assuming that (20) holds for all , we will prove that it also holds for . Considering that for any , implies:
where and with , the condition in Theorem 1 for suggests that
(36) |
Define , , and
For , equation (36) can be expressed alternatively as below:
For any fixed , summing over all equations of the above form for all that is compatible with , we have
Considering that
we have
which indicates that
Thus, we have , and the proof is complete. ∎
Appendix D Proof of Lemma 1
Proof.
The function is given as
(37) |
where
As is a constant with respect to , we denote . Thus,
(38) | |||||
From Equation (38), we can see that is a continuous function of for , not an infinite oscillation function, and degenerates to a constant for . Thus, equation has only finite solutions in . Let be the solutions of equation for and . We have . ∎
Appendix E Proof of Theorem 3
Proof.
Based on the classic theory for MLE consistency (Wald,, 1949), to prove that are consistent, we only to show that the following regularity conditions hold for the PAMA-H model: (1) probability measure keeps the same support for all , (2) the true parameter is an interior point of the parameter space , (3) the log-likelihood is differentiable with respect to and , and (4) the MLE is the unique solution of the score equations and .
It is easy to check that satisfies regular conditions (1)-(3). The regular condition (4) is satisfied by showing that is a concave function with respect to and respectively. The log-likelihood of PAMA-H is
where , and . According to properties of the Mallows model, it is easy to check and thus is a concave function with respect to . As , then preserves convexity of according to the properties of a concave function. Obviously, is a composite function of with composite functions being logarithm and summation. Such composite function preserves convexity of . Therefore, is a concave function with respect to . Similarly, given that is a concave function of , it is easy to show is a concave function with respect to . Therefore, regular condition (4) is satisfied. Thus, the proof is complete.
∎
Appendix F Details of the Gauss-Seidel Iterative Optimization
Let be the maximizer obtained at cycle . The Gauss-Seidel method works as follows:
-
1.
Update . Define as partial function of log-likelihood with respect to given the other parameters. Newton-like method is adopted to update from to .
-
2.
Update . Define as partial function of log-likelihood with respect to given other parameters. Similarly, Newton-like method is adopted to update from to .
-
3.
Update . Let denote the log-likelihood as a function of with other parameters fixed. We randomly select two neighboring entities and swap their rankings to check whether increases.
The procedure starts from a random guess of all the parameters and then repeat Steps 1-3 until the log-likelihood converges. The convergence of likelihood is achieved when the difference of log-likelihood between any two consecutive iterations is less than 0.1. The difficulty of the procedure lies in the update of . Practically, the starting point of can be some quick estimation of , such as an estimate from the Mallows model.
In each cycle of the Gauss-Seidel update for finding the MLE, a partial function of likelihood has to be defined. Suppose the current cyclic index is , the detailed computation of is given below.
F.1 Update
Define . Thus, for ,
(39) | |||||
The updating equation is given as
(40) |
where is step length parameter which can be tuned to control convergence of . And
(41) | |||||
(42) |
where
F.2 Update
We define . Thus, for
(43) |
The update equation is given as
(44) |
where is step length parameter which controls convergence of . And
(45) | |||||
(46) |
where
F.3 Update
We define . Thus,
(47) |
Given current estimation , the proposal of a new estimate can be obtained by iteratively swapping the neighboring entities in . To be noticeable that the entity whose ranking is could be randomly swapped with any background entity. The proposed estimation is denoted by . If , assign to . Otherwise, keep generating proposed estimation until or no new proposal can be generated for .
Appendix G Details of Inferring PAMA-H
G.1 The Gauss-Seidel Iterative Optimization
Let be the maximizer obtained at cycle . As is treated as missing data, we adapt MCEM (Wei and Tanner, (1990)) to implement optimization. Then E step is: define , where is a sample drawn from which is defined in (49). And M step is: maximize with respect to to obtain . The Gauss-Seidel method is utilized to conduct this optimization. The detailed procedure of Gauss-Seidel method is stated below.
Update . With objective function being , the procedure of updating is same as that in Section F.
Update . With objective function being , the procedure of updating is same as that in Section F.
Update . Define as partial function of , Newton-like method is adopted to update . Then . Suppose is an exponential distribution . Then . It is straightforward to conduct Newton-like method to obtain .
G.2 The Bayesian Inference
Suppose the is an exponential distribution . More specifically,
Note that here. As the conditional distribution of and are already known in (27) and (28), we just show the conditional distributions of and as below,
(48) | |||||
(49) |
G.3 PAMA-H versus PAMA in Numerical Experiments
We simulate 500 independent data sets with hierarchy distribution being an exponential distribution. The true parameters are and for the parameter in the exponential distribution. Four different configurations of are considered. The average recovery distances and coverages are displayed in Table 6. Figure 12 compares the estimation of obtained by different models and frameworks. Both Table 6 and Figure 12 indicate that PAMA model performs as well as PAMA-H model.
Configuration | Bayesian Inference | MLE | ||||
---|---|---|---|---|---|---|
Metric | PAMAHB | PAMAB | PAMAHF | PAMAF | ||
100 | 10 | Recovery | 0.70 (3.83) | 0.98 (6.11) | 19.09 (26.56) | 17.22 (24.80) |
Coverage | 1.00 (0.01) | 1.00 (0.01) | 0.96 (0.05) | 0.97 (0.05) | ||
100 | 20 | Recovery | 1.22 (6.71) | 0.84 (5.09) | 33.76 (29.96) | 32.16 (29.28) |
Coverage | 1.00 (0.01) | 1.00 (0.01) | 0.93 (0.06) | 0.93 (0.06) | ||
200 | 10 | Recovery | 4.15 (24.22) | 3.36 (19.94) | 57.23 (57.02) | 49.08 (56.00) |
Coverage | 1.00 (0.02) | 1.00 (0.02) | 0.94 (0.06) | 0.95 (0.06) | ||
200 | 20 | Recovery | 1.51 (10.86) | 1.51 (10.97) | 92.59 (65.94) | 85.67 (64.35) |
Coverage | 1.00 (0.01) | 1.00 (0.01) | 0.91 (0.07) | 0.91 (0.07) |

Appendix H Statistical Inference of the Covariate-Assisted Partition-Mallows Model
H.1 Bayesian Inference
Due to the incorporation of , the full conditional distributions may occur changes from the conditional distributions in Section 4.2. While the full conditional distributions of and keep the same as Equations (28) and (29), the full conditional distribution of changes to
(50) |
In addition, the full conditional distribution of the element of is given as follows,
(51) |
MH algorithm is also adopted to draw corresponding samples. Posterior point estimation can be calculated accordingly.
H.2 MLE
Gauss-Seidel iterative optimization is adopted as well in optimizing covariate-assisted PAMA. Let be the maximizer obtained at cycle . The detailed computation of is given below.
H.2.1 Update of
This is same as Section F.1.
H.2.2 Update of
This is same as Section F.2.
H.2.3 Update of
We define , thus,
(52) |
Given current estimation , the proposal of new estimation can be obtained by iteratively swapping the neighboring entities in . To be noticeable that the entity whose ranking is could be randomly swapped with any background entity. The proposed estimation is denoted by . If , assign to . Otherwise, keep generating proposed estimation until or no new proposal can be generated for .
H.2.4 Update of
Define . Maximizing with respect to is actually a standard logistic regression problem. Therefore, by using standard statistical software.
Appendix I Boxplots of
Figure 13 demonstrates the boxplots of of all the setting with all combinations of .


Appendix J Simulation Results
Figure 14, 15 and 16 present results from different methods for simulated data under various combinations of and .



Appendix K Robustness of
Figure 17 shows boxplots of estimated for each mis-specified case (5, 12, 15).

References
- Aslam and Montague, (2001) Aslam, J. A. and Montague, M. (2001). Models for metasearch. In Proceedings of the 24th annual international ACM SIGIR Conference on Research and Development in Information Retrieval, pages 276–284. ACM.
- Bader, (2011) Bader, M. (2011). The transposition median problem is NP-complete. Theoretical Computer Science, 412(12):1099–1110.
- Bhowmik and Ghosh, (2017) Bhowmik, A. and Ghosh, J. (2017). LETOR methods for unsupervised rank aggregation. In Proceedings of the 26th International Conference on World Wide Web, pages 1331–1340. International World Wide Web Conferences Steering Committee.
- Borda, (1781) Borda, J. C. (1781). Mémoire sur les élections au scrutin. Histoire del’ Académie Royale des Sciences.
- Chen et al., (2016) Chen, J., Long, R., Wang, X., Liu, B., and Chou, K. (2016). dRHP-PseRA: detecting remote homology proteins using profile-based pseudo protein sequence and rank aggregation. Scientific Reports, 6(32333).
- Chen et al., (2019) Chen, Y., Fan, J., Ma, C., and Wang, K. (2019). Spectral method and regularized mle are both optimal for top- ranking. Annals of statistics, 47(4):2204.
- Chen and Suh, (2015) Chen, Y. and Suh, C. (2015). Spectral MLE: Top- rank aggregation from pairwise comparisons. In International Conference on Machine Learning, pages 371–380.
- Deconde et al., (2011) Deconde, R. P., Hawley, S., Falcon, S., Clegg, N., Knudsen, B., and Etzioni, R. (2011). Combining results of microarray experiments: a rank aggregation approach. Statistical Applications in Genetics & Molecular Biology, 5(1):1544–6115.
- Deng et al., (2014) Deng, K., Han, S., Li, K. J., and Liu, J. S. (2014). Bayesian aggregation of order-based rank data. Journal of the American Statistical Association, 109(507):1023–1039.
- Diaconis, (1988) Diaconis, P. (1988). Group representations in probability and statistics. Lecture Notes-Monograph Series, 11:1–192.
- Diaconis and Graham, (1977) Diaconis, P. and Graham, R. L. (1977). Spearman’s footrule as a measure of disarray. Journal of the Royal Statistical Society Series B (Methodological), 39(2):262–268.
- Dwork et al., (2001) Dwork, C., Kumar, R., Naor, M., and Sivakumar, D. (2001). Rank aggregation methods for the web. In Proceedings of the 10th International Conference on World Wide Web, pages 613–622. ACM.
- Fagin et al., (2003) Fagin, R., Kumar, R., and Sivakumar, D. (2003). Efficient similarity search and classification via rank aggregation. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pages 301–312. ACM.
- Fligner and Verducci, (1986) Fligner, M. A. and Verducci, J. S. (1986). Distance based ranking models. Journal of the Royal Statistical Society. Series B (Methodological), 48(3):359–369.
- Freund et al., (2003) Freund, Y., Iyer, R., Schapire, R. E., and Singer, Y. (2003). An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4(Nov):933–969.
- Hastings, (1970) Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1):97–109.
- Irurozki et al., (2014) Irurozki, E., Calvo, B., and Lozano, J. A. (2014). Permallows : An R package for Mallows and generalized Mallows models. Journal of Statistical Software, 071(12):1–30.
- Johnson et al., (2020) Johnson, S. R., Henderson, D. A., and Boys, R. J. (2020). Revealing subgroup structure in ranked data using a Bayesian WAND. Journal of the American Statistical Association, 115(532):1888–1901.
- Li et al., (2020) Li, H., Xu, M., Liu, J. S., and Fan, X. (2020). An extended mallows model for ranked data aggregation. Journal of the American Statistical Association, 115(530):730–746.
- Li et al., (2021) Li, X., Yi, D., and Liu, J. S. (2021). Bayesian analysis of rank data with covariates and heterogeneous rankers. Statistical Science, (In press).
- Lin, (2010) Lin, S. (2010). Space oriented rank-based data integration. Statistical Applications in Genetics & Molecular Biology, 9(1):Article20.
- Lin and Ding, (2010) Lin, S. and Ding, J. (2010). Integration of ranked lists via cross entropy monte carlo with applications to mRNA and microRNA studies. Biometrics, 65(1):9–18.
- Linas et al., (2010) Linas, B., Tadas, M., and Francesco, R. (2010). Group recommendations with rank aggregation and collaborative filtering. In Proceedings of the Fourth ACM Conference on Recommender Systems, pages 119–126. ACM.
- Liu, (2008) Liu, J. S. (2008). Monte Carlo strategies in scientific computing. Springer Science & Business Media.
- Liu et al., (2007) Liu, Y., Liu, T., Qin, T., Ma, Z., and Li, H. (2007). Supervised rank aggregation. In Proceedings of the 16th International Conference on World Wide Web, pages 481–490. ACM.
- Luce, (1959) Luce, R. D. (1959). Individual choice behavior: a theoretical analysis. New York: Wiley.
- Mallows, (1957) Mallows, C. L. (1957). Non-null ranking models. Biometrika, 44(1/2):114–130.
- Plackett, (1975) Plackett, R. L. (1975). The analysis of permutations. Journal of the Royal Statistical Society: Series C (Applied Statistics), 24(2):193–202.
- Porello and Endriss, (2012) Porello, D. and Endriss, U. (2012). Ontology merging as social choice: judgment aggregation under the open world assumption. Journal of Logic and Computation, 24(6):1229–1249.
- Rajkumar and Agarwal, (2014) Rajkumar, A. and Agarwal, S. (2014). A statistical convergence perspective of algorithms for rank aggregation from pairwise data. In International Conference on Machine Learning, pages 118–126.
- Renda and Straccia, (2003) Renda, M. E. and Straccia, U. (2003). Web metasearch: rank vs. score based rank aggregation methods. In Proceedings of the 2003 ACM Symposium on Applied Computing, pages 841–846. ACM.
- Soufiani et al., (2014) Soufiani, H. A., Parkes, D. C., and Xia, L. (2014). A statistical decision-theoretic framework for social choice. In Advances in Neural Information Processing Systems, pages 3185–3193.
- Tanner and Wong, (1987) Tanner, M. A. and Wong, W. H. (1987). The calculation of posterior distributions by data augmentation. Journal of the American Statistical Association, 82(398):528–540.
- Thurstone, (1927) Thurstone, L. L. (1927). A law of comparative judgment. Psychological Review, 34(4):273–286.
- Wald, (1949) Wald, A. (1949). Note on the consistency of the maximum likelihood estimate. The Annals of Mathematical Statistics, 20(4):595–601.
- Wei and Tanner, (1990) Wei, G. C. G. and Tanner, M. A. (1990). A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. Journal of the American Statistical Association, 85(411):699–704.
- Yang, (2018) Yang, K. H. (2018). Chapter 7 - Stepping through finite element analysis. In Yang, K.-H., editor, Basic Finite Element Method as Applied to Injury Biomechanics, pages 281–308. Academic Press.
- Young, (1988) Young, H. P. (1988). Condorcet’s theory of voting. American Political Science Review, 82(4):1231–1244.
- Young and Levenglick, (1978) Young, H. P. and Levenglick, A. (1978). A consistent extension of condorcet’s election principle. SIAM Journal on Applied Mathematics, 35(2):285–300.
- Yu, (2000) Yu, P. L. H. (2000). Bayesian analysis of order-statistics models for ranking data. Psychometrika, 65(3):281–299.