Partition-Mallows Model and Its Inference
for Rank Aggregation

Wanchuang Zhu Center for Statistical Science & Department of Industrial Engineering, Tsinghua University, Beijing 100084, China Yingkai Jiang Yau Mathematical Sciences Center & Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China Jun S. Liu Department of Statistics, Harvard University, Cambridge, MA 02138, USA Ke Deng Center for Statistical Science & Department of Industrial Engineering, Tsinghua University, Beijing 100084, China

Abstract

Learning how to aggregate ranking lists has been an active research area for many years and its advances have played a vital role in many applications ranging from bioinformatics to internet commerce. The problem of discerning reliability of rankers based only on the rank data is of great interest to many practitioners, but has received less attention from researchers. By dividing the ranked entities into two disjoint groups, i.e., relevant and irrelevant/background ones, and incorporating the Mallows model for the relative ranking of relevant entities, we propose a framework for rank aggregation that can not only distinguish quality differences among the rankers but also provide the detailed ranking information for relevant entities. Theoretical properties of the proposed approach are established, and its advantages over existing approaches are demonstrated via simulation studies and real-data applications. Extensions of the proposed method to handle partial ranking lists and conduct covariate-assisted rank aggregation are also discussed.

Keywords: Meta-analysis; Heterogeneous rankers; Mallows model; Partial ranking lists; Covariate-assisted aggregation.

1 Introduction

Rank data arise naturally in many fields, such as web searching (Renda and Straccia,, 2003), design of recommendation systems (Linas et al.,, 2010) and genomics (Bader,, 2011). Many probabilistic models have been proposed for analyzing this type of data, among which the Thurstone model (Thurstone,, 1927), the Mallows model (Mallows,, 1957) and the Plackett-Luce model (Luce,, 1959; Plackett,, 1975) are the most well-known representatives. The Thurstone model assumes that each entity possesses a hidden score and all the scores come from a joint probability distribution. The Mallows model is a location model defined on the permutation space of ordered entities, in which the probability mass of a permuted order is an exponential function of its distance from the true order. The Plackett-Luce model assumes that the preference of entity $E_{i}$ is associated with a weight $w_{i}$ , and describes a recursive procedure for generating a random ranking list: entities are picked one by one with the probability proportional to their weights in a sequential fashion without replacement, and ranked based on their order of being selected.

Rank aggregation aims to derive a “better” aggregated ranking list $\hat{\tau}$ from multiple ranking lists $\tau_{1},\tau_{2},\cdots,\tau_{m}$ . It is a classic problem and has been studied in a variety of contexts for decades. Early applications of rank aggregation can be traced back to the 18th-century France, where the idea of rank aggregation was proposed to solve the problem of political elections (Borda,, 1781). In the past 30 years, efficient rank aggregation algorithms have played important roles in many fields, such as web searching (Renda and Straccia,, 2003), information retrieval (Fagin et al.,, 2003), design of recommendation systems (Linas et al.,, 2010), social choice studies (Porello and Endriss,, 2012; Soufiani et al.,, 2014), genomics (Bader,, 2011) and bioinformatics (Lin and Ding,, 2010; Chen et al.,, 2016).

Some popular approaches for rank aggregation are based on certain summary statistics. These methods simply calculate a summary statistics, such as the mean, median or geometric mean, for each entity $E_{i}$ based on its rankings across different ranking lists, and obtain the aggregated ranking list based on these summary statistics. Optimization-based methods obtain the aggregated ranking by minimizing a user-defined objective function, i.e., let $\hat{\tau}=\arg\min\limits_{\tau}\dfrac{1}{m}\sum\limits_{i=1}^{m}d\left(\tau,\tau_{i}\right)$ , where distance measurement $d(\cdot,\cdot)$ could be either Spearman’s footrule distance (Diaconis and Graham,, 1977) or the Kendall tau distance (Diaconis,, 1988). More detailed studies on these optimization-based methods can be found in Young and Levenglick, (1978); Young, (1988); Dwork et al., (2001).

In early 2000s, a novel class of Markov chain-based methods have been proposed (Dwork et al.,, 2001; Lin and Ding,, 2010; Lin,, 2010; Deconde et al.,, 2011), which first use the observed ranking lists to construct a probabilistic transition matrix among the entities and then use the magnitudes of the entities’ equilibrium probabilities of the resulting Markov chain to rank them. The boosting-based method RankBoost (Freund et al.,, 2003), employs a feedback function $\Phi(i,j)$ to construct the final ranking, where $\Phi(i,j)>0$ (or $\leq 0$ ) indicates that entity $E_{i}$ is (or is not) preferred to entity $E_{j}$ . Some statistical methods utilize aforementioned probabilistic models (such as the Thurstone model) and derive the maximum likelihood estimate (MLE) of the final ranking. More recently, researchers have began to pay attention to rank aggregation methods for pairwise comparison data (Rajkumar and Agarwal,, 2014; Chen and Suh,, 2015; Chen et al.,, 2019).

We note that all aforementioned methods assume that the rankers of interest are equally reliable. In practice, however, it is very common that some rankers are more reliable than the others, whereas some are nearly non-informative and may be regarded as “spam rankers”. Such differences in rankers’ qualities, if ignored in analysis, may significantly corrupt the rank aggregation and lead to seriously misleading results. To the best of our knowledge, the earliest effort to address this critical issue can be traced to Aslam and Montague, (2001), which derived an aggregated ranking list by calculating a weighted summation of the observed ranking lists, known as the Borda Fuse. Lin and Ding, (2010) extended the objective function of Dwork et al., (2001) to a weighted fashion. Independently, Liu et al., (2007) proposed a supervised rank aggregation to determine weights of the rankers by training with some external data. Although assigning weights to rankers is an intuitive and simple way to handle quality differences, how to scientifically determine these weights is a critical and unsolved problem in the aforementioned works.

Recently, Deng et al., (2014) proposed BARD, a Bayesian approach to deal with quality differences among independent rankers without the need of external information. BARD introduces a partition model, which assumes that all involved entities can be partitioned into two groups: the relevant ones and the background ones. A rationale of the approach is that, in many applications, distinguishing relevant entities from background ones has the priority over the construction of a final ranking of all entities. Under this setting, BARD decomposes the information in a ranking list into three components: (i) the relative rankings of all background entities, which is assumed to be uniform; (ii) the relative ranking of each relevant entity among all background ones, which takes the form of a truncated power-law; and, (iii) the relative rankings of all relevant entities, which is again uniform. The parameter of the truncated power-law distribution, which is ranker-specific, naturally serves as a quality measure for each ranker, as a ranker of a higher quality means a less spread truncated power-law distribution.

Li et al., (2020) proposed a stage-wise data generation process based on an extended Mallows model (EMM) introduced by Fligner and Verducci, (1986). EMM assumes that each entity comes from a two-components mixture model involving a uniform distribution to model non-informative entities, a modified Mallows model for informative entities and a ranker-specific proportion parameter. Li et al., (2021) followed the Thurstone model framework to deal with available covariates for the entities as well as different qualities of the rankers. In their model, each entity is associated with a Gaussian-distributed latent score and a ranking list is determined by the ranking of these scores. The quality of each ranker is determined by the standard deviation parameter in the Gaussian model so that a larger standard deviation indicates a poorer quality ranker.

Although these recent papers have proposed different ways for learning the quality variation among rankers, they all suffer from some limitations. The BARD method (Deng et al.,, 2014) simplifies the problem by assuming that all relevant entities are exchangeable. In many applications, however, the observed ranking lists often have a strong ordering information for relevant entities, and simply labeling these entities as “relevant” without considering their relative rankings tends to lose too much information and oversimplify the problem. Li et al., (2020) does not explicitly measure quality differences by their extended Mallows model. Although they mentioned that some of their model parameters can indicate the rankers’ qualities, it is not clear how to properly combine multiple indicators to produce an easily interpretable quality measurement. The learning framework of Li et al., (2021) based on Gaussian latent variables appears to be more suitable for incorporating covariates than for handling heterogeneous rankers.

In this paper, we propose a partition-Mallows model (PAMA), which combines the partition modeling framework of Deng et al., (2014) with the Mallows model, to accommodate the detailed ordering information among the relevant entities. The new framework can not only quantify the quality difference of rankers and distinguish relevant entities from background entities like BARD, but also provide an explicit ranking estimate among the relevant entities in rank aggregation. In contrast to the strategy of imposing the Mallows on the full ranking lists, which tends to be sensitive to noises in low-ranking entities, the combination of the partition and Mallows models allows us to focus on highly ranked entities, which typically contain high-quality signals in data, and is thus more robust. Both simulation studies and real data applications show that the proposed approach is superior to existing methods, e.g., BARD and EMM, for a large class of rank aggregation problems.

The rest of this paper is organized as follows. A brief review of BARD and the Mallows model is presented in Section 2 as preliminaries. The proposed PAMA model is described in Section 3 with some key theoretical properties established. Statistical inference of the PAMA model, including the Bayesian inference and the pursuit of MLE, is detailed in Section 4. Performance of PAMA is evaluated and compared to existing methods via simulations in Section 5. Two real data applications are shown in Section 6 to demonstrate strength of the PAMA model in practice. Finally, we conclude the article with a short discussion in Section 7.

2 Notations and Preliminaries

Let $U=\left\{E_{1},E_{2},\cdots,E_{n}\right\}$ be the set of entities to be ranked. We use “ $E_{i}\preceq E_{j}$ ” to represent that entity $E_{i}$ is preferred to entity $E_{j}$ in a ranking list $\tau$ , and denote the position of entity $E_{i}$ in $\tau$ by $\tau(i)$ . Note that more preferred entities always have lower rankings. Our research interest is to aggregate $m$ observed ranking lists, $\tau_{1},\ldots,\tau_{m}$ , presumably constructed by $m$ rankers independently into one consensus ranking list which is supposed to be “better” than each individual one.

2.1 BARD and Its Partition Model

The partition model in BARD (Deng et al.,, 2014) assumes that $U$ can be partitioned into two non-overlapping subsets: $U=U_{R}\cup U_{B}$ , with $U_{R}$ representing the set of relevant entities and $U_{B}$ for the background ones. Let $I=\{I_{i}\}_{i\in U}$ be the vector of group indicators, where $I_{i}=\mathbb{I}(E_{i}\in U_{R})$ and $\mathbb{I}(\cdot)$ is the indicator function. This formulation makes sense in many applications where people are only concerned about a fixed number of top-ranked entities. Under this formulation, the information in a ranking list $\tau_{k}$ can be equivalently represented by a triplet $(\tau_{k}^{0},\tau_{k}^{1\mid 0},\tau_{k}^{1})$ , where $\tau_{k}^{0}$ denotes relative rankings of all background entities, $\tau_{k}^{1\mid 0}$ denotes relative rankings of relevant entities among the background entities and $\tau_{k}^{1}$ denotes relative rankings of all relevant entities.

Deng et al., (2014) suggested a three-component model for $\tau_{k}$ by taking advantage of its equivalent decomposition:

\displaystyle P(\tau_{k}\mid I)=P(\tau_{k}^{0},\tau_{k}^{1\mid 0},\tau_{k}^{1}\mid I)=P(\tau_{k}^{0}\mid I)\times P(\tau_{k}^{1\mid 0}\mid I)\times P(\tau_{k}^{1}\mid\tau_{k}^{1\mid 0},I),

(1)

where both $P(\tau_{k}^{0}\mid I)$ (relative ranking of the background entities) and $P(\tau_{k}^{1}\mid\tau_{k}^{1\mid 0},I)$ (relative ranking of the relevant entities conditional on their set of positions relative to background entities) are uniform, and the relative ranking of a relevant entity $E_{i}$ among background ones follows a power-law distribution with parameter $\gamma_{k}>0$ , i.e.,

P(\tau_{k}^{1\mid 0}(i)=t\mid I)=q(t\mid\gamma_{k},n_{0})\propto t^{-\gamma_{k}}\cdot\mathbb{I}(1\leq t\leq n_{0}+1),

leading to the following explicit forms for the three terms in equation (1):

$\displaystyle P(\tau_{k}^{0}\mid I)$	$\displaystyle=$	$\displaystyle\frac{1}{n_{0}!},$	(2)
$\displaystyle P(\tau_{k}^{1\mid 0}\mid I)$	$\displaystyle=$	$\displaystyle\prod_{i\in U_{R}}q(\tau_{k}^{1\mid 0}(i)\mid\gamma_{k},I)=\frac{1}{(B_{\tau_{k},I})^{\gamma_{k}}\times(C_{\gamma_{k},n_{1}})^{n_{1}}},$	(3)
$\displaystyle P(\tau_{k}^{1}\mid\tau_{k}^{1\mid 0},I)$	$\displaystyle=$	$\displaystyle\frac{1}{A_{\tau_{k},I}}\times\mathbb{I}\big{(}\tau_{k}^{1}\in\mathcal{A}_{U_{R}}(\tau_{k}^{1\mid 0})\big{)},$	(4)

where $n_{1}=\sum_{i=1}^{n}I_{i}$ and $n_{0}=n-n_{1}$ are the counts of relevant and background entities respectively, $B_{\tau_{k},I}=\prod_{i\in U_{R}}\tau_{k}^{1\mid 0}(i)$ , $C_{\gamma_{k},n_{1}}=\sum_{t=1}^{n_{0}+1}t^{-\gamma_{k}}$ is the normalizing constant of the power-law distribution, $\mathcal{A}_{U_{R}}(\tau_{k}^{1\mid 0})$ is the set of $\tau_{k}^{1}$ ’s that are compatible with $\tau_{k}^{1\mid 0}$ , and $A_{\tau_{k},I}=\#\{\mathcal{A}_{U_{R}}(\tau_{k}^{1\mid 0})\}=\prod_{t=1}^{n_{0}+1}(n_{\tau_{k},t}^{1\mid 0}!)$ with $n_{\tau_{k},t}^{1\mid 0}=\sum_{i\in U_{R}}\mathbb{I}(\tau_{k}^{1\mid 0}(i)=t)$ .

Intuitively, this model assumes that each ranker first randomly places all background entities to generate $\tau_{k}^{0}$ , then “inserts” each relevant entity independently into the list of background entities according to a truncated power-law distribution to generate $\tau_{k}^{1\mid 0}$ , and finally draws $\tau_{k}^{1}$ uniformly from $\mathcal{A}_{U_{R}}(\tau_{k}^{1\mid 0})$ . In other words, $\tau_{k}^{0}$ serves as a baseline for modeling $\tau_{k}^{1\mid 0}$ and $\tau_{k}^{1}$ . It is easy to see from the model that a more reliable ranker should possess a larger $\gamma_{k}$ . With the assumption of independent rankers, we have the full-data likelihood:

	$\displaystyle P(\tau_{1},\cdots,\tau_{m}\mid I,\boldsymbol{\gamma})$	$\displaystyle=$	$\displaystyle\prod_{k=1}^{m}P(\tau_{k}\mid I,\gamma_{k})$		(5)
		$\displaystyle=$	$\displaystyle[(n_{0})!]^{-m}\times\prod_{k=1}^{m}\frac{\mathbb{I}\big{(}\tau_{k}^{1}\in\mathcal{A}_{U_{R}}(\tau_{k}^{1\mid 0})\big{)}}{A_{\tau_{k},I}\times(B_{\tau_{k},I})^{\gamma_{k}}\times\big{(}C_{\gamma_{k},n_{1}}\big{)}^{n_{1}}},$		(5)

where $\boldsymbol{\gamma}=(\gamma_{1},\cdots,\gamma_{m})$ . A detailed Bayesian inference procedure for $(I,\boldsymbol{\gamma})$ via Markov chain Monte Carlo can be found in Deng et al., (2014).

2.2 The Mallows Model

Mallows, (1957) proposed the following probability model for a ranking list $\tau$ of $n$ entities:

\pi(\tau\mid\tau_{0},\phi)=\dfrac{1}{Z_{n}(\phi)}\cdot\exp\{-\phi\cdot d(\tau,\tau_{0})\},

(6)

where $\tau_{0}$ denotes the true ranking list, $\phi>0$ characterizing the reliability of $\tau$ , function $d(\cdot,\cdot)$ is a distance metric between two ranking lists, and

Z_{n}(\phi)=\sum_{\tau^{\prime}}\exp\{-\phi\cdot d(\tau^{\prime},\tau_{0})\}=\frac{\prod_{t=2}^{n}(1-e^{-t\phi})}{(1-e^{-\phi})^{n-1}}

(7)

being the normalizing constant, whose analytic form was derived in Diaconis, (1988). Clearly, a larger $\phi$ means that $\tau$ is more stable and concentrates in a tighter neighborhood of $\tau_{0}$ . A common choice of $d(\cdot,\cdot)$ is the Kendall tau distance.

The Mallows model under the Kendall tau distance can also be equivalently described by an alternative multistage model, which selects and positions entities one by one in a sequential fashion, where $\phi$ serves as a common parameter that governs the probabilistic behavior of each entity in the stochastic process (Mallows,, 1957). Later on, Fligner and Verducci, (1986) extended the Mallows model by allowing $\phi$ to vary at different stages, i.e., introducing a position-specific parameter $\phi_{i}$ for each position $i$ , which leads to a very flexible, in many cases too flexible, framework to model rank data. To stabilize the generalized Mallows model by Fligner and Verducci, (1986), Li et al., (2020) proposed to put a structural constraint on $\phi_{i}$ s of the form $\phi_{i}=\phi\cdot(1-\alpha^{i})$ with $0<\phi<1$ and $0\leq\alpha\leq 1$ . As a probabilistic model for rank data, the Mallows model enjoys great interpretability, model compactness, inference and computation efficiency. For a comprehensive review of the Mallows model and its extensions, see Irurozki et al., (2014) and Li et al., (2020).

3 The Partition-Mallows Model

The partition model employed by BARD (Deng et al.,, 2014) tends to oversimplify the problem for scenarios where we care about the detailed rankings of relevant entities. To further enhance the partition model of BARD so that it can reflect the detailed rankings of relevant entities, we describe a new partition-Mallows model in this section.

3.1 The Reverse Partition Model

To combine the partition model with the Mallows model, a naive strategy is to simply replace the uniform model for the relevant entities, i.e., $P(\tau_{k}^{1}\mid\tau_{k}^{1|0},I)$ in (1), by the Mallows model, which leads to the updated Equation (4) as below:

P(\tau_{k}^{1}\mid\tau_{k}^{1\mid 0},I)=\frac{\pi(\tau_{k}^{1})}{Z_{\tau_{k},I}}\times\mathbb{I}\big{(}\tau_{k}^{1}\in\mathcal{A}_{U_{R}}(\tau_{k}^{1\mid 0})\big{)},

where $\pi(\tau_{k}^{1})$ is the Mallows density of $\tau_{k}^{1}$ and $Z_{\tau_{k},I}=\sum_{\tau\in\mathcal{A}_{U_{R}}(\tau_{k}^{1\mid 0})}\pi(\tau)$ is the normalizing constant of the Mallows model with a constraint due to the compatibility of $\tau_{k}^{1}$ with respect to $\mathcal{A}_{U_{R}}(\tau_{k}^{1\mid 0})$ . Apparently, the calculation of $Z_{\tau_{k},I}$ , which involves the summation over the whole space of $\mathcal{A}_{U_{R}}(\tau_{k}^{1\mid 0})$ , whose size is $A_{\tau_{k},I}=\#\{\mathcal{A}_{U_{R}}(\tau_{k}^{1\mid 0})\}=\prod_{t=1}^{n_{0}+1}(n_{\tau_{k},t}^{1\mid 0}!)$ , is infeasible for most practical cases, rendering such a naive combination of the Mallows model and the partition model impractical.

To avoid the challenging computation caused by the constraints due to $\mathcal{A}_{U_{R}}(\tau_{k}^{1\mid 0})$ , we rewrite the partition model by switching the roles of $\tau_{k}^{0}$ and $\tau_{k}^{1}$ in the model: instead of decomposing $\tau_{k}$ as $(\tau_{k}^{0},\tau_{k}^{1\mid 0},\tau_{k}^{1})$ conditioning on the group indicators $I$ , we decompose $\tau_{k}$ into an alternative triplet $(\tau_{k}^{1},\tau_{k}^{0\mid 1},\tau_{k}^{0})$ , where $\tau_{k}^{0\mid 1}$ denotes the relative reverse rankings of background entities among the relevant ones. Formally, we note that $\tau_{k}^{0\mid 1}(i)\triangleq n_{1}+2-\tau_{k|\{i\}\cup U_{R}}(i)$ for any $i\in U_{R}$ , where $\tau_{k|\{i\}\cup U_{R}}(i)$ denotes the relative ranking of a background entity among the relevant ones. In this reverse partition model, we first order the relevant entities according to a certain distribution and then use them as a reference system to “insert” the background entities. Figure 1 illustrates the equivalence between $\tau_{k}$ and its two alternative presentations, $(\tau_{k}^{0},\tau_{k}^{1\mid 0},\tau_{k}^{1})$ and $(\tau_{k}^{1},\tau_{k}^{0\mid 1},\tau_{k}^{0})$ .

Given the group indicator vector $I$ , the reverse partition model based on $(\tau_{k}^{1},\tau_{k}^{0\mid 1},\tau_{k}^{0})$ gives rise to the following distributional form for $\tau_{k}$ :

\displaystyle P(\tau_{k}\mid I)=P(\tau_{k}^{1},\tau_{k}^{0\mid 1},\tau_{k}^{0}\mid I)=P(\tau_{k}^{1}\mid I)\times P(\tau_{k}^{0\mid 1}\mid I)\times P(\tau_{k}^{0}\mid\tau_{k}^{0\mid 1},I),

(8)

which is analogous to (1) for the original partition model in BARD. Comparing to (1), however, the new form (8) enables us to specify an unconstrained marginal distribution for $\tau_{k}^{1}$ . Moreover, due to the symmetry between $\tau_{k}^{1\mid 0}$ and $\tau_{k}^{0\mid 1}$ , it is highly likely that the power-law distribution, which was shown in Deng et al., (2014) to approximate the distribution of $\tau_{k}^{1\mid 0}(i)$ well for each $E_{i}\in U_{R}$ , can also model $\tau_{k}^{0\mid 1}(i)$ for each $E_{i}\in U_{B}$ reasonably well. Detailed numerical validations are shown in Supplementary Material.

If we assume that all relevant entities are exchangeable, all background entities are exchangeable, and the relative reserve ranking of a background entity among the relevant entities follows a power-law distribution, we have

$\displaystyle P(\tau_{k}^{1}\mid I)$	$\displaystyle=$	$\displaystyle\frac{1}{n_{1}!},$	(9)
$\displaystyle P(\tau_{k}^{0\mid 1}\mid I,\gamma_{k})$	$\displaystyle=$	$\displaystyle\prod_{i\in U_{B}}P(\tau_{k}^{0\mid 1}(i)\mid I,\gamma_{k})=\frac{1}{(B^{}_{\tau_{k},I})^{\gamma_{k}}\times(C^{}_{\gamma_{k},n_{1}})^{n_{0}}},$	(10)
$\displaystyle P(\tau_{k}^{0}\mid\tau_{k}^{0\mid 1},I)$	$\displaystyle=$	$\displaystyle\frac{1}{A^{*}_{\tau_{k},I}}\times\mathbb{I}\big{(}\tau_{k}^{0}\in\mathcal{A}_{U_{R}}(\tau_{k}^{0\mid 1})\big{)},$	(11)

where $n_{1}$ and $n_{0}$ are numbers of relevant and background entities, respectively, $B^{*}_{\tau_{k},I}=\prod_{i\in U_{B}}\tau_{k}^{0\mid 1}(i)$ is the unnormalized part of the power-law, $C^{*}_{\gamma_{k},n_{1}}=\sum_{t=1}^{n_{1}+1}t^{-\gamma_{k}}$ is the normalizing constant, $\mathcal{A}_{U_{B}}(\tau_{k}^{0\mid 1})$ is the set of all $\tau_{k}^{0}$ that are compatible with a given $\tau_{k}^{0\mid 1}$ , and $A^{*}_{\tau_{k},I}=\#\{\mathcal{A}_{U_{B}}(\tau_{k}^{0\mid 1})\}=\prod_{t=1}^{n_{1}+1}(n_{\tau_{k},t}^{0\mid 1}!)$ with $n_{\tau_{k},t}^{0\mid 1}=\sum_{i\in U_{B}}\mathbb{I}(\tau_{k}^{0\mid 1}(i)=t)$ . Apparently, the likelihood of this reverse-partition model shares the same structure as that of the original partition model in BARD, and thus can be inferred in a similar way.

3.2 The Partition-Mallows Model

The reverse partition model introduced in section 3.1 allows us to freely model $\tau_{k}^{1}$ beyond a uniform distribution, which is infeasible for the original partition model in BARD. Here we employ the Mallows model for $\tau_{k}^{1}$ due to its interpretability, compactness and computability. To achieve this, we replace the group indicator vector $I$ in the partition model by a more general indicator vector $\mathcal{I}=\{\mathcal{I}_{i}\}_{i=1}^{n}$ , which takes value in $\Omega_{\mathcal{I}}$ , the space of all permutations of $\{1,\cdots,n_{1},\underbrace{0,\ldots,0}_{n_{0}}\}$ , with $\mathcal{I}_{i}=0$ if $E_{i}\in U_{B}$ , and $\mathcal{I}_{i}=k>0$ if $E_{i}\in U_{R}$ and is ranked at position $k$ among all relevant entities in $U_{R}$ . Figure 1 provides an illustrative example of assigning an enhanced indicator vector $\mathcal{I}$ to a universe of 10 entities with $n_{1}=5$ .

Based on the status of $\mathcal{I}$ , we can define subvectors $\mathcal{I}^{+}$ and $\mathcal{I}^{0}$ , where $\mathcal{I}^{+}$ stands for the subvector of $\mathcal{I}$ containing all positive elements in $\mathcal{I}$ , and $\mathcal{I}^{0}$ for the remaining zero elements in $\mathcal{I}$ . Figure 1 demonstrates the constructions of $\mathcal{I}$ , $\mathcal{I}^{+}$ and $\mathcal{I}^{0}$ , and the equivalence between $\tau_{k}$ , $(\tau_{k}^{0},\tau_{k}^{1\mid 0},\tau_{k}^{1})$ , and $(\tau_{k}^{1},\tau_{k}^{0\mid 1},\tau_{k}^{0})$ given $\mathcal{I}$ . Note that different from the partition model in BARD, in which we allow the number of relevant entities represented by $n_{1}$ to vary nearby its expected value, the number of relevant entities in the new model, is assumed to be fixed and known for conceptual and computational convenience. In other words, we have $|U_{R}|=n_{1}$ in the new setting.

$\mathcal{I}^{+}$	$\mathcal{I}^{0}$	$I$		$\mathcal{I}$	$U$	$\tau_{k}$		$\tau_{k}^{1}$	$\tau_{k}^{0\mid 1}$	$\tau_{k}^{0}$		$\tau_{k}^{0}$	$\tau_{k}^{1\mid 0}$	$\tau_{k}^{1}$
1	-	1		1	$E_{1}$	2		2	-	-		-	1	2
2	-	1		2	$E_{2}$	6		4	-	-		-	3	4
3	-	1		3	$E_{3}$	4		3	-	-		-	2	3
4	-	1		4	$E_{4}$	1		1	-	-		-	1	1
5	-	1	$\Longleftarrow$	5	$E_{5}$	7	$\Longleftrightarrow$	5	-	-	$\Longleftrightarrow$	-	3	5
-	0	0		0	$E_{6}$	5		-	3	2		2	-	-
-	0	0		0	$E_{7}$	3		-	4	1		1	-	-
-	0	0		0	$E_{8}$	8		-	1	3		3	-	-
-	0	0		0	$E_{9}$	9		-	1	4		4	-	-
-	0	0		0	$E_{10}$	10		-	1	5		5	-	-

Figure 1: An illustrative example of construction of

\mathcal{I}^{+}

\mathcal{I}^{0}

and

I

based on the enhanced indicator vector

\mathcal{I}

n_{1}=5

to a universe of 10 entities, and the decomposition of a ranking list

\tau_{k}

into triplet

(\tau_{k}^{1},\tau_{k}^{0\mid 1},\tau_{k}^{0})

and

(\tau_{k}^{0},\tau_{k}^{1\mid 0},\tau_{k}^{1})

respectively given

\mathcal{I}

As an analogy of Equations (1) and (8), we have the following decomposition of $\tau_{k}$ given the enhanced indicator vector $\mathcal{I}$ :

\displaystyle P(\tau_{k}\mid\mathcal{I})=P(\tau_{k}^{1},\tau_{k}^{0\mid 1},\tau_{k}^{0}\mid\mathcal{I})=P(\tau_{k}^{1}\mid\mathcal{I})\times P(\tau_{k}^{0\mid 1}\mid\mathcal{I})\times P(\tau_{k}^{0}\mid\tau_{k}^{0\mid 1},\mathcal{I}).

(12)

Assume that $\tau_{k}^{1}\mid\mathcal{I}$ follows the Mallows model (with parameter $\phi_{k}$ ) centered at $\mathcal{I}^{+}$ :

\displaystyle P(\tau_{k}^{1}\mid\mathcal{I},\phi_{k})=P(\tau_{k}^{1}\mid\mathcal{I}^{+},\phi_{k})=\frac{\exp\{-\phi_{k}\cdot d_{\tau}(\tau_{k}^{1},\mathcal{I}^{+})\}}{Z_{n_{1}}(\phi_{k})},

(13)

where $d_{\tau}(\cdot,\cdot)$ denotes Kendall tau distance and $Z_{n_{1}}(\phi_{k})$ is defined as in (7). Clearly, a larger $\phi_{k}$ indicates that ranker $\tau_{k}$ is of a higher quality, as the distribution is more concentrated at the “true ranking” defined by $\mathcal{I}^{+}$ . Since the relative ranking of background entities are of no interest to us, we still assume that they are randomly ranked. Together with the power-law assumption for $\tau_{k}^{0\mid 1}(i)$ , we have

	$\displaystyle P(\tau_{k}^{0\mid 1}\mid\mathcal{I})$	$\displaystyle=$	$\displaystyle P(\tau_{k}^{0\mid 1}\mid I,\gamma_{k})=\frac{1}{(B^{}_{\gamma_{k},I})^{\gamma_{k}}\times(C^{}_{\gamma_{k},n_{1}})^{n-n_{1}}},$		(14)
	$\displaystyle P(\tau_{k}^{0}\mid\tau_{k}^{0\mid 1},\mathcal{I})$	$\displaystyle=$	$\displaystyle P(\tau_{k}^{0}\mid\tau_{k}^{0\mid 1},I)=\frac{1}{A^{*}_{\tau_{k},I}}\times\mathbb{I}\big{(}\tau_{k}^{0}\in\mathcal{A}_{U_{R}}(\tau_{k}^{0\mid 1})\big{)},$		(15)

where notations $A^{*}_{\tau_{k},I}$ , $B^{*}_{\tau_{k},I}$ and $C^{*}_{\gamma_{k},n_{1}}$ are the same as in the reverse-partition model. We call the resulting model the Partition-Mallows model, abbreviated as PAMA.

Different from the partition and reverse partition models, which quantify the quality of ranker $\tau_{k}$ with only one parameter $\gamma_{k}$ in the power-law distribution, the PAMA model contains two quality parameters $\phi_{k}$ and $\gamma_{k}$ , with the former indicating the ranker’s ability of ranking relevant entities and the latter reflecting the ranker’s ability in differentiating relevant entities from background ones. Intuitively, $\phi_{k}$ and $\gamma_{k}$ reflect the quality of ranker $\tau_{k}$ in two different aspects. However, considering that a good ranker is typically strong in both dimensions, it looks quite natural to further simplify the model by assuming

\phi_{k}=\phi\cdot\gamma_{k},

(16)

with $\phi>0$ being a common factor for all rankers. This assumption, while reducing the number of free parameters by almost half, captures the natural positive correlation between $\phi_{k}$ and $\gamma_{k}$ and serves as a first-order (i.e., linear) approximation to the functional relationship between $\phi_{k}$ and $\gamma_{k}$ . A wide range of numerical studies based on simulated data suggest that the linear approximation showed in (16) works reasonably well for many typical scenarios for rank aggregation. In contrast, the more flexible model with both $\phi_{k}$ and $\gamma_{k}$ as free parameters (which is referred to as PAMA^∗) suffers from unstable performance from time to time. Detailed evidences to support assumption (16) can be found in Supplementary Material.

Plugging in (16) into (13), we have a simplified model for $\tau_{k}^{1}$ given $\mathcal{I}$ as follows:

\displaystyle P(\tau_{k}^{1}\mid\mathcal{I},\phi,\gamma_{k})=P(\tau_{k}^{1}\mid\mathcal{I}^{+},\phi,\gamma_{k})=\frac{\exp\{-\phi\cdot\gamma_{k}\cdot d_{\tau}(\tau_{k}^{1},\mathcal{I}^{+})\}}{Z_{n_{1}}(\phi\cdot\gamma_{k})}.

(17)

Combining (14), (15) and (17), we get the full likelihood of $\tau_{k}$ :

	$\displaystyle P(\tau_{k}\mid\mathcal{I},\phi,\gamma_{k})$	$\displaystyle=$	$\displaystyle P(\tau_{k}^{1}\mid\mathcal{I},\phi,\gamma_{k})\times P(\tau_{k}^{0\|1}\mid\mathcal{I},\gamma_{k})\times P(\tau_{k}^{0}\mid\tau_{k}^{0\|1},\mathcal{I})$		(18)
		$\displaystyle=$	$\displaystyle\frac{\mathbb{I}\big{(}\tau_{k}^{0}\in\mathcal{A}_{U_{R}}(\tau_{k}^{0\mid 1})\big{)}}{A^{}_{\tau_{k},I}\times(B^{}_{\tau_{k},I})^{\gamma_{k}}\times(C^{}_{\gamma_{k},n_{1}})^{n-n_{1}}\times(D^{}_{\tau_{k},\mathcal{I}})^{\phi\cdot\gamma_{k}}\times E^{*}_{\phi,\gamma_{k}}},$		(18)

where $D^{*}_{\tau_{k},\mathcal{I}}=\exp\{d_{\tau}(\tau_{k}^{1},\mathcal{I}^{+})\}$ , $E^{*}_{\phi,\gamma_{k}}=Z_{n_{1}}(\phi\cdot\gamma_{k})=\frac{\prod_{t=2}^{n_{1}}(1-e^{-t\phi\gamma_{k}})}{(1-e^{-\phi\gamma_{k}})^{n_{1}-1}}$ , and $A^{*}_{\tau_{k},\mathcal{I}}$ , $B^{*}_{\tau_{k},\mathcal{I}}$ and $C^{*}_{\tau_{k},n_{1}}$ keep the same meaning as in the reverse partition model. At last, for the set of observed ranking lists $\boldsymbol{\tau}=(\tau_{1},\cdots,\tau_{m})$ from $m$ independent rankers, we have the joint likelihood:

\displaystyle P(\boldsymbol{\tau}\mid\mathcal{I},\phi,\boldsymbol{\gamma})

\displaystyle=

\displaystyle\prod_{k=1}^{m}P(\tau_{k}\mid\mathcal{I},\phi,\gamma_{k}).

(19)

3.3 Model Identifiability and Estimation Consistency

Let $\Omega_{n}$ be the space of all permutations of $\{1,\cdots,n\}$ in which $\tau_{k}$ takes value, and let ${\boldsymbol{\theta}}=(\mathcal{I},\phi,\boldsymbol{\gamma})$ be the vector of model parameters. The PAMA model in (19), i.e., $P(\boldsymbol{\tau}\mid{\boldsymbol{\theta}})$ , defines a family of probability distributions on $\Omega_{n}^{m}$ indexed by parameter ${\boldsymbol{\theta}}$ taking values in space $\boldsymbol{\Theta}=\Omega_{\mathcal{I}}\times\Omega_{\phi}\times\Omega_{\boldsymbol{\gamma}}$ , where $\Omega_{\mathcal{I}}$ is the space of all permutations of $\{1,\cdots,n_{1},{\bf 0}_{n_{0}}\}$ , $\Omega_{\phi}=(0,+\infty)$ and $\Omega_{\boldsymbol{\gamma}}=[0,+\infty)^{m}$ . We show here that the PAMA model defined in (19) is identifiable and the model parameters can be estimated consistently under mild conditions.

Theorem 1.

The PAMA model is identifiable, i.e.,

\forall\ {\boldsymbol{\theta}}_{1},{\boldsymbol{\theta}}_{2}\in\boldsymbol{\Theta},\ \mbox{if}\ P(\boldsymbol{\tau}\mid{\boldsymbol{\theta}}_{1})=P(\boldsymbol{\tau}\mid{\boldsymbol{\theta}}_{2})\ \mbox{for}\ \forall\ \boldsymbol{\tau}\in\Omega_{n}^{m},\ \mbox{then}\ {\boldsymbol{\theta}}_{1}={\boldsymbol{\theta}}_{2}.

(20)

Proof.

See Supplementary Material. ∎

To show that parameters in the PAMA model can be estimated consistently, we will first construct a consistent estimator for the indicator vector $\mathcal{I}$ as $m\rightarrow\infty$ but with the number of ranked entities $n$ fixed, and show later that $\phi$ can also be consistently estimated once $\mathcal{I}$ is given. To this end, we define $\bar{\tau}(i)=m^{-1}\sum_{k=1}^{m}\tau_{k}(i)$ to be the average rank of entity $E_{i}$ across all $m$ rankers, and assume that the ranker-specific quality parameters $\gamma_{1},\cdots,\gamma_{m}$ are i.i.d. samples from a non-atomic probability measure $F(\gamma)$ defined on $[0,\infty)$ with a finite first moment (referred to as condition $\boldsymbol{C}_{\gamma}$ hereinafter). Then, by the strong law of large numbers we have

\bar{\tau}(i)=\frac{1}{m}\sum_{k=1}^{m}\tau_{k}(i)\rightarrow\mathbb{E}\big{[}\tau(i)\big{]}\ a.s.\ \ \mbox{with}\ m\rightarrow\ \infty,

(21)

since $\{\tau_{k}(i)\}_{k=1}^{m}$ are i.i.d. random variables with expectation

\mathbb{E}\big{[}\tau(i)\big{]}=\mathbb{E}\Big{[}\mathbb{E}\big{[}\tau(i)\mid\gamma\big{]}\Big{]}=\int\mathbb{E}\big{[}\tau(i)\mid\gamma\big{]}dF(\gamma),

where $\mathbb{E}\big{[}\tau(i)\mid\gamma\big{]}$ is the conditional mean of $\tau(i)$ given the model parameters $(\mathcal{I},\phi,\gamma)$ , i.e.,

\mathbb{E}\big{[}\tau(i)\mid\gamma\big{]}=\sum_{t=1}^{n}t\cdot P\big{(}\tau(i)=t\mid\mathcal{I},\phi,\gamma\big{)}.

Clearly, $\mathbb{E}\big{[}\tau(i)\big{]}$ is a function of $\phi$ given $\mathcal{I}$ and $F(\gamma)$ . We define $e_{i}(\phi)\triangleq\mathbb{E}\big{[}\tau(i)\big{]}$ to emphasize $\mathbb{E}\big{[}\tau(i)\big{]}$ ’s nature as a continuous function of $\phi$ . Without loss of generality, we suppose that $U_{R}=\{1,\cdots,n_{1}\}$ and $U_{B}=\{n_{1}+1,\cdots,n\}$ , i.e., $\mathcal{I}=(1,\cdots,n_{1},0,\cdots,0)$ , hereinafter. Then, the partition structure and the Mallows model embedded in the PAMA model lead to the following facts:

e_{1}(\phi)<\cdots<e_{n_{1}}(\phi)\ \mbox{and}\ e_{n_{1}+1}(\phi)=\cdots=e_{n}(\phi)=e_{0},\ \forall\ \phi\in\Omega_{\phi}.

(22)

Note that $e_{i}(\phi)$ degenerates to a constant with respect to $\phi$ (i.e., $e_{0}$ ) for all $i>n_{1}$ because parameter $\phi$ influences only the relative rankings of relevant entities in the Mallows model. The value of $e_{0}$ is completely determined by $F(\gamma)$ . For the BARD model, it is easy to see that $e_{1}=\cdots=e_{n_{1}}\leq e_{n_{1}+1}=\cdots=e_{n}.$

Refer to caption — Figure 2: Average ranks of all the entities with fixed $\mathcal{I}=(1,\cdots,n_{1},0,\cdots,0)$ , $n=30$ , $n_{1}=15$ , $m=100000$ and $F(\gamma)=U(0,2)$ . Figures (a), (b) and (c) are the corresponding results for $\phi=0,0.2\ \mbox{and }0.4$ respectively.

Figure 2 shows some empirical estimates of the $e_{i}(\phi)$ ’s based on $m=100,000$ independent samples drawn from PAMA models with $n=30$ , $n_{1}=15$ , and $F(\gamma)=U(0,2)$ , but three different $\phi$ values: (a) $\phi=0$ , which corresponds to the BARD model; (b) $\phi=0.2$ ; and (c) $\phi=0.5$ . One surprise is that in case (c), some relevant entities may have a larger $e_{i}(\phi)$ (worse ranking) than the average rank of background entities. Lemma 1 guarantees that for almost all $\phi\in\Omega_{\phi}$ , $e_{0}$ is different from $e_{i}(\phi)$ for $i=1,\cdots,n_{1}$ . The proof of Lemma 1 can be found in Supplementary Material.

Lemma 1.

For the PAMA model with condition $\boldsymbol{C}_{\gamma}$ , $\exists\ \tilde{\Omega}_{\phi}\subset\Omega_{\phi}$ , s.t. $(\Omega_{\phi}-\tilde{\Omega}_{\phi})$ contains only finite elements and

e_{i}(\phi)\neq e_{0}\ \mbox{for}\ i=1,\cdots,n_{1},\ \forall\ \phi\in\tilde{\Omega}_{\phi}.

(23)

The facts demonstrated in (22) and (23) suggest the following three-step strategy to estimate $\mathcal{I}$ : (a) find the subset $S_{0}$ of $n_{0}=(n-n_{1})$ entities from $U$ so that the within-subset variation of the $\bar{\tau}(i)$ ’s is the smallest, i.e.,

S_{0}=\operatorname*{arg\,min}_{S\in U,\ |S|=n_{0}}\sum_{i\in S}(e_{i}-\bar{e}_{S})^{2},\ \ \mbox{with}\ \bar{e}_{S}=n_{0}^{-1}\sum_{i\in S}e_{i},

(24)

and let $\tilde{U}_{B}=S_{0}$ be an estimate of $U_{B}$ ; (b) rank the entities in $U\setminus S_{0}$ by $\bar{\tau}(i)$ increasingly and use the obtained ranking $\tilde{\mathcal{I}}^{+}$ as an estimate of $\mathcal{I}^{+}$ ; (c) combine the above two steps to obtain the estimate $\tilde{\mathcal{I}}$ of $\mathcal{I}$ . This can be achieved by defining $\tilde{U}_{R}=U\setminus\tilde{U}_{B}$ and $\tilde{\mathcal{I}}^{+}=rank(\{\bar{\tau}(i):i\in\tilde{U}_{R}\}),$ and obtain $\tilde{\mathcal{I}}=(\tilde{\mathcal{I}}_{1},\cdots,\tilde{\mathcal{I}}_{n})$ , with $\tilde{\mathcal{I}}_{i}=\tilde{\mathcal{I}}^{+}_{i}\cdot\mathbb{I}(i\in\tilde{U}_{R}).$

Note that $\tilde{U}_{B}$ is based on the mean ranks, $\{\bar{\tau}(i)\}_{i\in U}$ , thus is clearly a moment estimator. Although this three-step estimation strategy is neither statistically efficient nor computationally feasible (step (a) is NP-hard), it nevertheless serves as a prototype for developing the consistency theory. Theorem 2 guarantees that $\tilde{\mathcal{I}}$ is a consistent estimator of $\mathcal{I}$ under mild conditions.

Theorem 2.

For the PAMA model with condition $\boldsymbol{C}_{\gamma}$ , for almost all $\phi\in\Omega_{\phi}$ , the moment estimator $\tilde{\mathcal{I}}$ converges to $\mathcal{I}$ with probability 1 with $m$ going to infinity.

Proof.

Combining fact (23) in Lemma 1 with fact (21), we have for $\forall\ \phi\in\tilde{\Omega}_{\phi}$ that

e_{1}(\phi)<\cdots<e_{n_{1}}(\phi)\ \mbox{and}\ e_{i}(\phi)\neq e_{0}\ \mbox{for}\ i=1,\cdots,n_{1}.

Moreover, as fact (21) tells us that for $\forall\ \epsilon,\delta>0$ , $\exists\ M>0$ s.t. for $\forall\ m>M$ ,

P\big{(}|\bar{\tau}(i)-e_{i}(\phi)|<\delta\big{)}\geq 1-\epsilon,\ i=1,\cdots,n,

it is straightforward to see the conclusion of the theorem. ∎

Theorem 2 tells us that estimating $\mathcal{I}$ is straightforward if the number of independent rankers $m$ goes to infinity: a simple moment method ignoring the quality difference of rankers can provide us a consistent estimate of $\mathcal{I}$ . In a practical problem where only a finite number of rankers are involved, however, more efficient statistical inference of the PAMA model based on Bayesian or frequentist principles becomes more attractive as effectively utilizing the quality information of different rankers is critical.

With $n_{0}$ and $n_{1}$ fixed, parameter $\gamma_{k}$ , which governs the power-law distribution for the rank list $\tau_{k}$ , cannot be estimated consistently. Thus, its distribution $F(\gamma)$ cannot be determined nonparametrically even when the number of rank lists $m$ goes to infinity. We impose a parametric form $F_{\psi}(\gamma)$ with $\psi$ as the hyper-parameter and refer to the resulting hierarchical-structured model as PAMA-H, which has the following marginal likelihood of $(\phi,\psi)$ given $\mathcal{I}$ :

L(\phi,\psi\mid\mathcal{I})=\int P(\boldsymbol{\tau}\mid\mathcal{I},\phi,\boldsymbol{\gamma})dF_{\psi}(\boldsymbol{\gamma})=\prod_{k=1}^{m}\int P(\tau_{k}\mid\mathcal{I},\phi,\gamma_{k})dF_{\psi}(\gamma_{k})=\prod_{k=1}^{m}L_{k}(\phi,\psi\mid\mathcal{I}).

We show in Theorem 3 that the MLE based on the above marginal likelihood is consistent.

Theorem 3.

Under the PAMA-H model, assume that $(\phi,\psi)$ belongs to the parameter space $\Omega_{\phi}\times\Omega_{\psi}$ , and the true parameter $(\phi_{0},\psi_{0})$ is an interior point of $\Omega_{\phi}\times\Omega_{\psi}$ . Let $(\hat{\phi}_{\mathcal{I}},\hat{\psi}_{\mathcal{I}})$ be the maximizer of $L(\phi,\psi\mid\mathcal{I})$ . If $F_{\psi}(\gamma)$ has a density function $f_{\psi}(\gamma)$ that is differentiable and concave with respect to $\psi$ , then $\lim_{m\rightarrow\infty}(\hat{\phi}_{\mathcal{I}},\hat{\psi}_{\mathcal{I}})=(\phi_{0},\psi_{0})$ almost surely.

Proof.

See Supplementary Material. ∎

4 Inference with the Partition-Mallows Model

4.1 Maximum Likelihood Estimation

Under the PAMA model, the MLE of ${\boldsymbol{\theta}}=(\mathcal{I},\phi,\boldsymbol{\gamma})$ is $\hat{{\boldsymbol{\theta}}}=\arg\max_{{\boldsymbol{\theta}}}l({\boldsymbol{\theta}})$ , where

l({\boldsymbol{\theta}})=\log P(\tau_{1},\tau_{2},\cdots,\tau_{m}\mid{\boldsymbol{\theta}})

(25)

is the logarithm of the likelihood function (19). Here, we adopt the Gauss-Seidel iterative method in Yang, (2018), also known as backfitting or cyclic coordinate ascent, to implement the optimization. Starting from an initial point ${\boldsymbol{\theta}}^{(0)}$ , the Gauss-Seidel method iteratively updates one coordinate of ${\boldsymbol{\theta}}$ at each step with the other coordinates held fixed at their current values. A Newton-like method is adopted to update $\phi$ and $\gamma_{k}$ . Since $\mathcal{I}$ is a discrete vector, we find favorable values of $\mathcal{I}$ by swapping two neighboring entities to check whether $g(\mathcal{I}\mid\boldsymbol{\gamma}^{(s+1)},\phi^{(s+1)})$ increases. More details of the algorithm are provided in Supplementary Material.

With the MLE $\hat{\boldsymbol{\theta}}=(\hat{\mathcal{I}},\hat{\phi},\hat{\boldsymbol{\gamma}})$ , we define $U_{R}(\hat{\mathcal{I}})=\{i\in U:\ \hat{\mathcal{I}}_{i}>0\}$ and $U_{B}(\hat{\mathcal{I}})=\{i\in U:\ \hat{\mathcal{I}}_{i}=0\}$ , and generate the final aggregated ranking list $\hat{\tau}$ based on the rules below: (a) set the top- $n_{1}$ list of $\hat{\tau}$ as $\hat{\tau}_{n_{1}}=sort(i\in U_{R}(\hat{\mathcal{I}})\ by\ \hat{\mathcal{I}}_{i}\uparrow)$ , (b) all entities in $U_{B}(\hat{\mathcal{I}})$ tie for positions behind. Hereinafter, we refer to this MLE-based rank aggregation procedure under PAMA model as PAMA_F.

For the PAMA-H model, a similar procedure can be applied to find the MLE of ${\boldsymbol{\theta}}=(\mathcal{I},\phi,\psi)$ , with the $\boldsymbol{\gamma}=(\gamma_{1},\cdots,\gamma_{m})$ being treated as the missing data. With the MLE $\hat{\boldsymbol{\theta}}=(\hat{\mathcal{I}},\hat{\phi},\hat{\psi})$ , we can generate the final aggregated ranking list $\hat{\tau}$ based on $\hat{\mathcal{I}}$ in the same way as in PAMA, and evaluate the quality of ranker $\tau_{k}$ via the mean or mode of the conditional distribution below:

f(\gamma_{k}\mid\tau_{k};\hat{\mathcal{I}},\hat{\phi},\hat{\psi})\propto f(\gamma_{k}\mid\hat{\psi})\cdot P(\tau_{k}\mid\hat{\mathcal{I}},\hat{\phi},\gamma_{k}).

In this paper, we refer to the above MLE-based rank aggregation procedure under PAMA-H model as PAMA_HF. The procedure is detailed in Supplementary Material.

4.2 Bayesian Inference

Since the three model parameters $\mathcal{I}$ , $\phi$ and $\boldsymbol{\gamma}$ encode “orthogonal” information of the PAMA model, it is natural to expect that $\mathcal{I}$ , $\phi$ and $\boldsymbol{\gamma}$ are mutually independent a priori. We thus specify their joint prior distribution as

\pi(\mathcal{I},\phi,\boldsymbol{\gamma})=\pi(\mathcal{I})\cdot\pi(\phi)\cdot\prod_{k=1}^{m}\pi(\gamma_{k}).

Without much loss, we may restrict the range of $\phi$ and $\gamma_{k}$ ’s to a closed interval $[0,b]$ with a large enough $b$ . In contrast, $\mathcal{I}$ is discrete and takes value in the space $\Omega_{\mathcal{I}}$ of all permutations of $\{1,\ldots,n_{1},\underbrace{0,\ldots,0}_{n_{0}}\}$ . It is convenient to specify $\pi(\mathcal{I})$ , $\pi(\phi)$ and $\pi(\gamma_{k})$ as uniform, i.e.,

\pi(\mathcal{I})\sim U(\Omega_{\mathcal{I}}),\ \pi(\phi)\sim U[0,b],\ \pi(\gamma_{k})\sim U[0,b].

Based on our experiences in a large range of simulation studies and real data applications, we find that it works reasonably well to set $b=10$ . In Section 3.3 we also discussed letting $\pi(\gamma_{k})$ be of a parametric form, which will be further discussed later.

The posterior distribution can be expressed as

	$\displaystyle f(\mathcal{I},\phi,\boldsymbol{\gamma}\|\tau_{1},\tau_{2},\cdots,\tau_{m})$	(26)
$\displaystyle\propto$	$\displaystyle\pi(\mathcal{I},\phi,\boldsymbol{\gamma})\cdot P(\tau_{1},\tau_{2},\cdots,\tau_{m}\|\mathcal{I},\phi,\boldsymbol{\gamma})$
$\displaystyle=$	$\displaystyle\mathbb{I}\big{(}\phi\in[0,10])\big{)}\times\prod_{k=1}^{m}\Big{\{}\frac{\mathbb{I}\big{(}\tau_{k}^{0}\in\mathcal{A}_{U_{R}}(\tau_{k}^{0\mid 1})\big{)}\times\mathbb{I}\big{(}\gamma_{k}\in[0,10]\big{)}}{A^{}_{\tau_{k},I}\times(B^{}_{\tau_{k},I})^{\gamma_{k}}\times(C^{}_{\gamma_{k},n_{1}})^{n-n_{1}}\times(D^{}_{\tau_{k},\mathcal{I}})^{\phi\cdot\gamma_{k}}\times E^{*}_{\phi,\gamma_{k}}}\Big{\}},$

with the following conditional distributions:

$\displaystyle f(\mathcal{I}\mid\phi,\boldsymbol{\gamma})$	$\displaystyle\propto$	$\displaystyle\prod_{k=1}^{m}\frac{\mathbb{I}\big{(}\tau_{k}^{0}\in\mathcal{A}_{U_{R}}(\tau_{k}^{0\mid 1})\big{)}}{A^{}_{\tau_{k},I}\times(B^{}_{\tau_{k},I})^{\gamma_{k}}\times(D^{*}_{\tau_{k},\mathcal{I}})^{\phi\cdot\gamma_{k}}},$	(27)
$\displaystyle f(\phi\mid\mathcal{I},\boldsymbol{\gamma})$	$\displaystyle\propto$	$\displaystyle{\mathbb{I}}\big{(}\phi\in[0,10]\big{)}\times\prod_{k=1}^{m}\frac{1}{(D^{}_{\tau_{k},\mathcal{I}})^{\phi\cdot\gamma_{k}}\times E^{}_{\phi,\gamma_{k}}},$	(28)
$\displaystyle f(\gamma_{k}\mid\mathcal{I},\phi,\boldsymbol{\gamma}_{[-k]})$	$\displaystyle\propto$	$\displaystyle\frac{\mathbb{I}\big{(}\gamma_{k}\in[0,10]\big{)}}{(B^{}_{\tau_{k},I})^{\gamma_{k}}\times(C^{}_{\gamma_{k},n_{1}})^{n-n_{1}}\times(D^{}_{\tau_{k},\mathcal{I}})^{\phi\cdot\gamma_{k}}\times E^{}_{\phi,\gamma_{k}}},$	(29)

based on which posterior samples of $(\mathcal{I},\phi,\boldsymbol{\gamma})$ can be obtained by Gibbs sampling, where $\boldsymbol{\gamma}_{[-k]}=(\gamma_{1},\cdots,\gamma_{k-1},\gamma_{k+1},\cdots,\gamma_{m})$ .

Considering that conditional distributions in (27)-(29) are nonstandard, we adopt the Metropolis-Hastings algorithm (Hastings,, 1970) to enable the conditional sampling. To be specific, we choose the proposal distributions for $\phi$ and $\gamma_{k}$ as

	$\displaystyle q(\phi\mid\phi^{(t)};\mathcal{I},\boldsymbol{\gamma})$	$\displaystyle\sim$	$\displaystyle\mathcal{N}(\phi^{(t)},\sigma_{\phi}^{2})$
	$\displaystyle q(\gamma_{k}\mid\gamma_{k}^{(t)};\mathcal{I},\phi,\boldsymbol{\gamma}_{[-k]})$	$\displaystyle\sim$	$\displaystyle\mathcal{N}(\gamma_{k}^{(t)},\sigma_{\gamma_{k}}^{2}),$

where $\sigma_{\phi}^{2}$ and $\sigma_{\gamma_{k}}^{2}$ can be tuned to optimize the mixing rate of the sampler. Since $\mathcal{I}$ is a discrete vector, we propose new values of $\mathcal{I}$ by swapping two randomly selected adjacent entities. Note that the entity whose ranking is $n_{1}$ could be swapped with any background entity. Due to the homogeneity of background entities, there is no need to swap two background entities. Therefore, the number of potential proposals in each step is $\mathcal{O}(nn_{1})$ . More details about MCMC sampling techniques can be found in Liu, (2008).

Suppose that $M$ posterior samples $\{(\mathcal{I}^{(t)},\phi^{(t)},\boldsymbol{\gamma}^{(t)})\}_{t=1}^{M}$ are obtained. We calculate the posterior means of different parameters as below:

$\displaystyle\bar{\mathcal{I}}_{i}$	$\displaystyle=$	$\displaystyle\frac{1}{M}\sum_{t=1}^{M}\Big{[}\mathcal{I}_{i}^{(t)}\cdot I^{(t)}_{i}+\frac{n_{1}+1+n}{2}\cdot(1-I^{(t)}_{i})\Big{]},\ i=1,\cdots,n,$
$\displaystyle\bar{\phi}$	$\displaystyle=$	$\displaystyle\frac{1}{M}\sum_{t=1}^{M}\phi^{(t)},$
$\displaystyle\bar{\gamma}_{k}$	$\displaystyle=$	$\displaystyle\frac{1}{M}\sum_{t=1}^{M}\gamma_{k}^{(t)},k=1,\cdots,m.$

We quantify the quality of ranker $\tau_{k}$ with $\bar{\gamma}_{k}$ , and generate the final aggregated ranking list $\hat{\tau}$ based on the $\bar{\mathcal{I}}_{i}$ s as following:

\hat{\tau}=sort(i\in U\ by\ \bar{\mathcal{I}}_{i}\uparrow).

Hereinafter, we refer to this MCMC-based Bayesian rank aggregation procedure under the Partition-Mallows model as PAMA_B.

The Bayesian inference procedure PAMA_HB for the PAMA-H model differs from PAMA_B only by replacing the prior distribution $\prod_{k=1}^{m}\pi(\gamma_{k})$ , which is uniform in $[0,b]^{m}$ , with a hierarchically structured prior $\pi(\psi)\prod_{k=1}^{m}f_{\psi}(\gamma_{k})$ . The conditional distributions needed for Gibbs sampling are almost the same as (27)-(29), except an additional one

\displaystyle f(\psi\mid\mathcal{I},\phi,\boldsymbol{\gamma})

\displaystyle\propto

\displaystyle\pi(\psi)\cdot\prod_{k=1}^{m}f_{\psi}(\gamma_{k}).

(30)

We may specify $f_{\psi}(\gamma)$ to be an exponential distribution and let $\pi(\psi)$ be a proper conjugate prior to make (30) easy to sample from. More details for PAMA_HB with $f_{\psi}(\gamma)$ specified as an exponential distribution is provided in Supplementary Material.

Our simulation studies suggest that the practical performance of PAMA_B and PAMA_HB are very similar when $n_{0}$ and $n_{1}$ are reasonably large (see Supplementary Material for details). In contrast, as we will show in Section 5, the MLE-based estimates (e.g., PAMA_F) typically produce less accurate results with a shorter computational time compared to PAMA_B.

4.3 Extension to Partial Ranking Lists

The proposed Partition-Mallows model can be extended to more general scenarios where partial ranking lists, instead of full ranking lists, are involved in the aggregation. Given the entity set $U$ and a ranking list $\tau_{S}$ of entities in $S\subseteq U$ , we say $\tau_{S}$ is a full ranking list if $S=U$ , and a partial ranking list if $S\subset U$ . Suppose $\tau_{S}$ is a partial ranking list and $\tau_{U}$ is a full ranking list of $U$ . If the projection of $\tau_{U}$ on $S$ equals to $\tau_{S}$ , we say $\tau_{U}$ is compatible with $\tau_{S}$ , denotes as $\tau_{U}\sim\tau_{S}$ . Let $\mathcal{A}(\tau_{S})=\{\tau_{U}:\tau_{U}\sim\tau_{S}\}$ be the set of all full lists that are compatible with $\tau_{S}$ . Suppose a partial list $\tau_{k}$ is involved in the ranking aggregation problem. The probability of $\tau_{k}$ can be evaluated by:

\displaystyle P(\tau_{k}\mid\mathcal{I},\phi,\gamma_{k})=\sum_{\tau_{k}^{*}\sim\tau_{k}}P(\tau_{k}^{*}\mid\mathcal{I},\phi,\gamma_{k}),

(31)

where $P(\tau_{k}^{*}\mid\mathcal{I},\phi,\gamma_{k})$ is the probability of a compatible full list under the PAMA model. Clearly, the probability in (31) does not have a closed-form representation due to complicated constraints between $\tau_{k}$ and $\tau_{k}^{*}$ , and it is very challenging to do statistical inference directly based on this quantity. Fortunately, as rank aggregation with partial lists can be treated as a missing data problem, we can resolve the problem via standard methods for missing data inference.

The Bayesian inference can be accomplished by the classic data augmentation strategy (Tanner and Wong,, 1987) in a similar way as described in Deng et al., (2014), which iterates between imputing the missing data conditional on the observed data given the current parameter values, and updating parameter values by sampling from the posterior distribution based on the imputed full data. To be specific, we iteratively draw from the following two conditional distributions:

	$\displaystyle P(\tau_{1}^{\ast},\cdots,\tau_{m}^{\ast}\mid\tau_{1},\cdots,\tau_{m};\mathcal{I},\phi,\boldsymbol{\gamma})=\prod_{k=1}^{m}P(\tau_{k}^{\ast}\mid\tau_{k};\mathcal{I},\phi,\gamma_{k}),$
	$\displaystyle f(\mathcal{I},\phi,\boldsymbol{\gamma}\mid\tau_{1}^{\ast},\cdots,\tau_{m}^{\ast})\propto\pi(\mathcal{I})\times\pi(\boldsymbol{\gamma})\times\pi(\phi)\times\prod_{k=1}^{m}P(\tau_{k}^{\ast}\mid\mathcal{I},\gamma_{k},\phi).$

To find the MLE of ${\boldsymbol{\theta}}$ for this more challenging scenario, we can use the Monte Carlo EM algorithm (MCEM, Wei and Tanner, (1990)). Let $\tau_{k}^{(1)},\cdots,\tau_{k}^{(M)}$ be $M$ independent samples drawn from distribution $P(\tau_{k}^{\ast}\mid\tau_{k},\mathcal{I},\phi,\gamma_{k})$ . The E-step involves the calculation of the $Q$ -function below:

	$\displaystyle Q(\mathcal{I},\boldsymbol{\gamma},\phi\mid\mathcal{I}^{(s)},\boldsymbol{\gamma}^{(s)},\phi^{(s)})$	$\displaystyle=$	$\displaystyle E\left\{\sum_{k=1}^{m}\log P(\tau_{k}^{*}\mid\mathcal{I},\boldsymbol{\gamma},\phi)\mid\tau_{k},\mathcal{I}^{(s)},\boldsymbol{\gamma}_{k}^{(s)},\phi^{(s)}\right\}$
		$\displaystyle\approx$	$\displaystyle\dfrac{1}{M}\sum_{k=1}^{m}\sum_{t=1}^{M}\log P(\tau_{k}^{(t)}\mid\mathcal{I},\boldsymbol{\gamma}_{k},\phi).$

In the M-step, we use the Gauss-Seidel method to maximize the above $Q$ -function in a similar way as detailed in Supplementary Material.

No matter which method is used, a key step is to draw samples from

P(\tau_{k}^{\ast}\mid\tau_{k};\mathcal{I},\phi,\gamma_{k})\propto P(\tau_{k}^{\ast}\mid\mathcal{I},\gamma_{k},\phi)\cdot\mathbb{I}\big{(}\tau_{k}^{\ast}\in\mathcal{A}(\tau_{k})\big{)}.

To achieve this goal, we start with $\tau_{k}^{*}$ obtained from the previous step of the data augmentation or MCEM algorithms, and conduct several iterations of the following Metropolis step with $P(\tau_{k}^{\ast}\mid\tau_{k};\mathcal{I},\phi,\gamma_{k})$ as its target distribution: (a) construct the proposal $\tau_{k}^{\prime}$ by randomly selecting two elements in the current full list $\tau_{k}^{*}$ and swapping them; (b) accept or reject the proposal according to the Metropolis rule, that is to accept $\tau_{k}^{\prime}$ with probability of $\min(1,\frac{P(\tau_{k}^{\prime}\mid\mathcal{I},\gamma_{k},\phi)}{P(\tau_{k}^{*}\mid\mathcal{I},\gamma_{k},\phi)})$ . Note that the proposed list $\tau_{k}^{\prime}$ is automatically rejected if it is incompatible with the observed partial list $\tau_{k}$ .

4.4 Incorporating Covariates in the Analysis

In some applications, covariate information for each ranked entity is available to assist rank aggregation. One of the earliest attempts for incorporating such information in analysing rank data is perhaps the hidden score model due to Thurstone, (1927), which has become a standard approach and has many extensions. Briefly, these models assume that there is an unobserved score for each entity that is related to the entity-specific covariates $X_{i}=(X_{i1},\cdots,X_{ip})^{T}$ under a regression framework and the observed rankings are determined by these scores plus noises, i.e.,

\tau_{k}=sort(S_{ik}\downarrow,\ E_{i}\in U),\ \mbox{where}\ S_{ik}=X_{i}^{T}\boldsymbol{\beta}+\varepsilon_{ik}.

Here, $\boldsymbol{\beta}$ is the common regression coefficient and $\varepsilon_{ik}\sim N(0,\sigma^{2}_{k})$ is the noise term. Recent progresses along this line are reviewed by Yu, (2000); Bhowmik and Ghosh, (2017); Li et al., (2021).

Here, we propose to incorporate covariates into the analysis in a different way. Assuming that covariate $X_{i}$ provides information on the group assignment instead of the detailed ranking of entity $E_{i}$ , we connect $X_{i}$ and $\mathcal{I}_{i}$ , the enhanced indicator of $X_{i}$ , by a logistic regression model:

P(\mathcal{I}_{i}\mid X_{i})=P(I_{i}\mid X_{i},\boldsymbol{\psi})=\dfrac{\exp\{X_{i}^{T}\boldsymbol{\psi}\cdot I_{i}\}}{1+\exp\{X_{i}^{T}\boldsymbol{\psi}\}},~{}~{}i=1,\cdots,n,

(32)

where $\boldsymbol{\psi}=(\psi_{1},\ldots,\psi_{p})^{T}$ as the regression parameters. Let $\boldsymbol{X}=(X_{1},\cdots,X_{n})$ be the covariate matrix, we can extend the Partition-Mallows model as

P(\tau_{1},\cdots,\tau_{m},\mathcal{I}\mid\boldsymbol{X})=P(\mathcal{I}\mid\boldsymbol{X},\boldsymbol{\psi})\times P(\tau_{1},\cdots,\tau_{m}\mid\mathcal{I},\phi,\boldsymbol{\gamma}),

(33)

where the first term

P(\mathcal{I}\mid\boldsymbol{X},\boldsymbol{\psi})=\prod_{i=1}^{n}P(I_{i}\mid X_{i},\boldsymbol{\psi})

comes from the logistic regression model (32), and the second term comes from the original Partition-Mallows model. In the extended model, our goal is to infer $(\mathcal{I},\phi,\boldsymbol{\gamma},\boldsymbol{\psi})$ based on $(\tau_{1},\cdots,\tau_{m};\boldsymbol{X})$ . We can achieve both Bayesian inference and MLE for the extended model in a similar way as described for the Partition-Mallows model. More details are provided in the Supplementary Material.

An alternative way to incorporate covariates is to replace the logistic regression model by a naive Bayes model, which models the conditional distribution of $\boldsymbol{X}\mid\mathcal{I}$ instead of $\mathcal{I}\mid\boldsymbol{X}$ , as follows:

f(\tau_{1},\cdots,\tau_{m},\boldsymbol{X}\mid\mathcal{I})=P(\tau_{1},\cdots,\tau_{m},\mid\mathcal{I},\phi,\boldsymbol{\gamma})\times f(\boldsymbol{X}\mid\mathcal{I}),

(34)

where

	$\displaystyle f(\boldsymbol{X}\mid\mathcal{I})$	$\displaystyle=$	$\displaystyle\prod_{i=1}^{n}f(X_{i}\mid\mathcal{I}_{i})=\prod_{i=1}^{n}f(X_{i}\mid I_{i})=\prod_{i=1}^{n}\prod_{j=1}^{p}f(X_{ij}\mid I_{i})$
		$\displaystyle=$	$\displaystyle\prod_{i=1}^{n}\prod_{j=1}^{p}\Big{\{}\big{[}f_{j}(X_{ij}\mid\psi_{j0})\big{]}^{1-I_{i}}\cdot\big{[}f_{j}(X_{ij}\mid\psi_{j1})\big{]}^{I_{i}}\Big{\}},$

$f_{j}$ is pre-specified parametric distribution for covariates $X_{j}$ with parameter $\psi_{j0}$ for entities in the background group and $\psi_{j1}$ for entities in the relevant group. Since the performances of the two approaches are very similar, in the rest of this paper we use the logistic regression strategy to handle covariates due to its convenient form.

5 Simulation Study

5.1 Simulation Settings

We simulated data from two models: (a) the proposed Partition-Mallows model, referred to as $\mathcal{S}_{PM}$ , and (b) the Thurstone hidden score model, referred to as $\mathcal{S}_{HS}$ . In the $\mathcal{S}_{PM}$ scenario, we specified the true indicator vector as $\mathcal{I}=(1,\cdots,n_{1},0,\cdots,0)$ , indicating that the first $n_{1}$ entities $E_{1},\cdots,E_{n_{1}}$ belong to $U_{R}$ and the rest belong to the background group $U_{B}$ , and set

\gamma_{k}=\left\{\begin{array}[]{ll}0.1,&\mbox{if }k\leq\frac{m}{2};\\ a+(k-\frac{m}{2})\times\delta_{R},&\mbox{if }k>\frac{m}{2}.\\ \end{array}\right.

Clearly, $a>0$ and $\delta_{R}>0$ control the quality difference and signal strength of the $m$ base rankers in the $\mathcal{S}_{PM}$ scenario. We set $\phi=0.6$ (defined in (16)), $\delta_{R}=\frac{2}{m}$ , and $a$ with two options: 2.5 and 1.5. For easy reference, we denoted the strong signal case with $a=2.5$ and the weak signal case with $a=1.5$ by $\mathcal{S}_{PM_{1}}$ and $\mathcal{S}_{PM_{2}}$ , respectively.

In the $\mathcal{S}_{HS}$ scenario, we used the Thurstone model to generate the rank lists as $\tau_{k}=sort(i\in U\ by\ S_{ik}\downarrow),\ \mbox{where}\ S_{ik}\sim N(\mu_{ik},1)$ and

\mu_{ik}=\left\{\begin{array}[]{ll}0,&\mbox{if }k\leq\frac{m}{2}\ \mbox{or}\ i>n_{1};\\ a^{*}+\frac{b^{*}-a^{*}}{m}\times k+(n_{1}-i)\times\delta_{E}^{*},&\mbox{otherwise}.\\ \end{array}\right.

In this model, $a^{*},b^{*}$ and $\delta_{E}^{*}$ (all positive numbers) control the quality difference and signal strength of the $m$ base rankers. We also specified two sub-cases: $\mathcal{S}_{HS_{1}}$ , the stronger-signal case with $(a^{*},b^{*},\delta_{E}^{*})=(0.5,2.5,0.2)$ ; and $\mathcal{S}_{HS_{2}}$ , the weaker-signal case with $(a^{*},b^{*},\delta_{E}^{*})=(-0.5,1.5,0.2)$ . Table 1 shows the configuration matrix of $\mu_{ik}$ under $\mathcal{S}_{HS_{1}}$ when $m=10,n=100$ and $n_{1}=10$ . In both scenarios, the first half of rankers are completely non-informative, with the other half providing increasingly strong signals.

	$\mu_{1}$	$\mu_{2}$	$\mu_{3}$	$\mu_{4}$	$\mu_{5}$	$\mu_{6}$	$\mu_{7}$	$\mu_{8}$	$\mu_{9}$	$\mu_{10}$
$E_{1}$	0	0	0	0	0	3.7	3.9	4.1	4.3	4.5
$E_{2}$	0	0	0	0	0	3.5	3.7	3.9	4.1	4.3
$E_{3}$	0	0	0	0	0	3.3	3.5	3.7	3.9	4.1
$E_{4}$	0	0	0	0	0	3.1	3.3	3.5	3.7	3.9
$E_{5}$	0	0	0	0	0	2.9	3.1	3.3	3.5	3.7
$E_{6}$	0	0	0	0	0	2.7	2.9	3.1	3.3	3.5
$E_{7}$	0	0	0	0	0	2.5	2.7	2.9	3.1	3.3
$E_{8}$	0	0	0	0	0	2.3	2.5	2.7	2.9	3.1
$E_{9}$	0	0	0	0	0	2.1	2.3	2.5	2.7	2.9
$E_{10}$	0	0	0	0	0	1.9	2.1	2.3	2.5	2.7
$E_{11}$	0	0	0	0	0	0	0	0	0	0
$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$
$E_{100}$	0	0	0	0	0	0	0	0	0	0

Table 1: The configuration matrix of the

\mu_{ik}

’s under

\mathcal{S}_{HS_{1}}

with

m

=10,

n

=100 and

n_{1}

=10.

For each of the four simulation scenarios (i.e., $\mathcal{S}_{PM_{1}}$ , $\mathcal{S}_{PM_{2}}$ , $\mathcal{S}_{HS_{1}}$ and $\mathcal{S}_{HS_{2}}$ ), we fixed the true number of relevant entities $n_{1}=10$ , but allowed the number of rankers $m$ and the total number of entities $n$ to vary, resulting in a total of 16 simulation settings ( $\{scenarios:\mathcal{S}_{PM_{1}},\mathcal{S}_{PM_{2}},\mathcal{S}_{HS_{1}},\mathcal{S}_{HS_{2}}\}\times\{m:10,20\}\times\{n:100,300\}\times\{n_{1}:10\}$ ). Under each setting, we simulated 500 independent data sets to evaluate and compare performances of different rank aggregation methods.

5.2 Methods in Comparison and Performance Measures

Except for the proposed PAMA_B and PAMA_F, we considered state-of-the-art methods in several classes, including the Markov chain-based methods MC₁, MC₂, MC₃ in Lin, (2010) and CEMC in Lin and Ding, (2010), the partition-based method BARD in Deng et al., (2014), and the Mallows model-based methods MM and EMM in Li et al., (2020). Classic naive methods based on summary statistics were ignored because they have been shown in previous studies to perform suboptimally, especially in cases where base rankers are heterogeneous in quality. The Markov-chain-based methods, MM, and EMM were implemented in TopKLists, PerMallows and ExtMallows packages in R (https://www.r-project.org/), respectively. The code of BARD was provided by its authors.

Let $\tau$ be the underlying true ranking list of all entities, $\tau_{R}=\{\tau(i):\ E_{i}\in U_{R}\}$ be the true relative ranking of relevant entities, $\hat{\tau}$ be the aggregated ranking obtained from a rank aggregation approach, $\hat{\tau}_{R}=\{\hat{\tau}(i):\ E_{i}\in U_{R}\}$ be the relative ranking of relevant entities after aggregation, and $\hat{\tau}_{n_{1}}$ be the top- $n_{1}$ list of $\hat{\tau}$ . After obtaining the aggregated ranking $\hat{\tau}$ from a rank aggregation approach, we evaluated its performance by two measurements, namely the recovery distance $\kappa_{R}$ and the coverage $\rho_{R}$ , defined as below:

	$\displaystyle\kappa_{R}$	$\displaystyle\triangleq$	$\displaystyle d_{\tau}(\hat{\tau}_{R},\tau_{R})+n_{\hat{\tau}}\times\frac{n+n_{1}+1}{2},$
	$\displaystyle\rho_{R}$	$\displaystyle\triangleq$	$\displaystyle\frac{n_{1}-n_{\hat{\tau}}}{n_{1}},$

where $d_{\tau}(\hat{\tau}_{R},\tau_{R})$ denotes the Kendall tau distance between $\hat{\tau}_{R}$ and $\tau_{R}$ , and $n_{\hat{\tau}}$ denotes the number of relevant entities who are classified as background entities in $\hat{\tau}$ . The recovery distance $\kappa_{R}$ considers detailed rankings of all relevant entities plus mis-classification distances, while the coverage $\rho_{R}$ cares only about the identification of relevant entities without considering the detailed rankings. In the setting of PAMA, $\frac{n+n_{1}+1}{2}$ is the average rank of a background entity. The recovery distance increases if some relevant entities are mis-classified as background entities. Clearly, we expect a smaller $\kappa_{R}$ and a larger $\rho_{R}$ for a stronger aggregation approach.

5.3 Simulation Results

Table 2 summarizes the performances of the nine competing methods in the 16 different simulation settings, demonstrating the proposed PAMA_B and PAMA_F outperform all the other methods by a significant margin in most settings and PAMA_B uniformly dominates PAMA_F. Figure 3 shows the quality parameter $\boldsymbol{\gamma}$ learned from the Partition-Mallows model in various simulation scenarios with $m=10$ and $n=100$ , confirming that the proposed methods can effectively capture the quality difference among the rankers. The results of $\boldsymbol{\gamma}$ for other combinations of $(m,n)$ can be found in Supplementary Material which demonstrates consistent performance with Figure 3.

Configuration			Partition-type Models			Mallows Models		MC-based Models
$\mathcal{S}$	$n$	$m$	PAMA_F	PAMA_B	BARD	EMM	MM	MC₁	MC₂	MC₃	CEMC
$\mathcal{S}_{PM_{1}}$	100	10	24.5	15.2	57.1	51.7	103.2	338.4	163.1	198.6	197.8
	100	10	[0.95]	[0.97]	[0.91]	[0.89]	[0.81]	[0.36]	[0.69]	[0.63]	[0.62]
	100	20	2.6	0.3	42.1	22.8	44.2	466.6	88.9	121.2	114.7
	100	20	[0.99]	[1.00]	[0.95]	[0.95]	[0.93]	[0.11]	[0.82]	[0.78]	[0.77]
	300	10	17.4	4.0	180.0	683.3	519.2	1268.3	997.7	1075.8	1085.7
	300	10	[0.99]	[1.00]	[0.89]	[0.66]	[0.55]	[0.17]	[0.34]	[0.29]	[0.28]
	300	20	7.1	3.2	122.3	124.4	157.1	1445.9	613.5	723.0	727.2
	300	20	[1.00]	[1.00]	[0.93]	[0.92]	[0.90]	[0.05]	[0.60]	[0.53]	[0.52]
$\mathcal{S}_{PM_{2}}$	100	10	90.0	66.6	115.2	108.3	152.9	404.3	285.5	307.2	313.8
	100	10	[0.82]	[0.86]	[0.77]	[0.77]	[0.70]	[0.24]	[0.47]	[0.43]	[0.41]
	100	20	26.9	2.4	81.5	59.8	91.5	468.1	217.3	245.2	249.5
	100	20	[0.94]	[1.00]	[0.85]	[0.87]	[0.82]	[0.11]	[0.60]	[0.55]	[0.53]
	300	10	81.1	26.8	468.4	609.8	472.1	1388.4	1294.7	1321.5	1328.4
	300	10	[0.95]	[0.98]	[0.69]	[0.68]	[0.60]	[0.09]	[0.15]	[0.13]	[0.13]
	300	20	77.2	3.4	313.6	267.5	337.0	1469.0	1205.9	1251.8	1258.9
	300	20	[0.95]	[1.00]	[0.79]	[0.82]	[0.78]	[0.04]	[0.21]	[0.18]	[0.18]
$\mathcal{S}_{HS_{1}}$	100	10	24.9	20.6	22.9	54.9	115.9	334.7	150.9	180.3	186.0
	100	10	[0.97]	[0.98]	[0.99]	[0.91]	[0.80]	[0.37]	[0.71]	[0.66]	[0.64]
	100	20	18.7	15.6	22.8	8.7	33.4	498.8	46.7	64.1	60.8
	100	20	[0.98]	[0.98]	[1.00]	[1.00]	[0.97]	[0.05]	[0.92]	[0.89]	[0.89]
	300	10	172.0	159.8	37.9	205.5	490.6	1098.6	627.0	752.9	769.4
	300	10	[0.89]	[0.90]	[0.99]	[0.87]	[0.68]	[0.28]	[0.59]	[0.50]	[0.49]
	300	20	7.4	7.0	22.7	11.4	114.1	1402.6	237.8	319.7	322.3
	300	20	[1.00]	[1.00]	[1.00]	[1.00]	[0.94]	[0.08]	[0.84]	[0.79]	[0.79]
$\mathcal{S}_{HS_{2}}$	100	10	92.6	74.0	68.7	123.7	162.3	382.4	228.2	250.2	256.6
	100	10	[0.83]	[0.86]	[0.88]	[0.77]	[0.70]	[0.27]	[0.56]	[0.52]	[0.50]
	100	20	24.4	20.0	22.2	12.4	38.3	500.3	87.5	103.5	102.9
	100	20	[0.96]	[0.97]	[1.00]	[0.99]	[0.95]	[0.04]	[0.83]	[0.80]	[0.80]
	300	10	319.1	463.8	245.6	516.9	683.5	1267.9	998.0	1076.0	1085.5
	300	10	[0.79]	[0.69]	[0.84]	[0.66]	[0.55]	[0.17]	[0.34]	[0.29]	[0.28]
	300	20	8.7	8.0	23.2	30.3	155.5	1430.7	437.6	516.2	523.3
	300	20	[1.00]	[1.00]	[1.00]	[0.99]	[0.91]	[0.06]	[0.71]	[0.66]	[0.65]

Table 2: Average recovery distances [coverages] of different methods based on 500 independent replicates under different simulation scenarios.

Figure 4 (a) shows the boxplots of recovery distances and the coverages of the nine competing methods in the four simulation scenarios with $m=10$ , $n=100$ , and $n_{1}=10$ . The five methods from the left outperform the other four methods by a significant gap, and the PAMA-based methods generally perform the best. Figure 4 (b) confirms that the methods based on the Partition-Mallows model enjoys the same capability as BARD in detecting quality differences between informative and non-informative rankers. However, while both BARD and PAMA can further discern quality differences among informative rankers, EMM fails this more subtle task. Similar figures for other combinations of $(m,n)$ are provided in the Supplementary Material, highlighting consistent results as in Figure 4.

5.4 Robustness to the Specification of $n_{1}$

We need to specify $n_{1}$ , the number of relevant entities, when applying PAMA_B or PAMA_F. In many practical problems, however, there may not be a strong prior information on $n_{1}$ and there may not even be clear distinctions between relevant and background entities. To examine robustness of the algorithm with respect to the specification of $n_{1}$ , we design a simulation setting $\mathcal{S}_{HS_{3}}$ to mimic the no-clear-cut scenario and investigate how the performance of PAMA is affected by the specification of $n_{1}$ under this setting. Formally, $\mathcal{S}_{HS_{3}}$ assumes that $\tau_{k}=sort(i\in U\ by\ S_{ik}\downarrow)$ , where $S_{ik}\sim N(\mu_{ik},1)$ , following the same data generating framework as $\mathcal{S}_{HS}$ defined in the Section 5.1, with $\mu_{ik}$ being replaced by

\mu_{ik}=\left\{\begin{array}[]{ll}0,&\mbox{if }k\leq\frac{m}{2},\\ \frac{2\times a^{*}\times k/m}{1+e^{-b^{*}\times(70-i)}},&\mbox{otherwise},\\ \end{array}\right.

where $a^{*}=50$ and $b^{*}=0.1$ . Different from $\mathcal{S}_{HS_{1}}$ and $\mathcal{S}_{HS_{2}}$ , where $\mu_{ik}$ jumps from 0 to a positive number as $i$ ranges from background to relevant entities, in the $\mathcal{S}_{HS_{3}}$ scenario $\mu_{ik}$ increases smoothly as a function of $i$ for each informative ranker $k$ . In such cases, the concept of “relevant” entities is not well-defined.

We simulate 500 independent data sets from $\mathcal{S}_{HS_{3}}$ with $n=100$ and $m=10$ . For each data set, we try different specifications of $n_{1}$ ranging from 10 to 50 and compare PAMA to several competing methods based on their performance on recovering the top- $A$ list $[E_{1}\preceq E_{2}\preceq\cdots\preceq E_{A}]$ , which is still well-defined based on the simulation design. The results summarized in Table 3 show that no matter which $n_{1}$ is specified, the partition-type models consistently outperform all the competing methods in terms of a lower recovery distance from the true top- $n_{1}$ list of items, i.e., $[E_{1}\preceq E_{2}\preceq\cdots\preceq E_{n_{1}}]$ . Figure 5 illustrates in details the average aggregated rankings of the top-10 entities by PAMA as $n_{1}$ increases, suggesting that PAMA is able to figure out the correct rankings of the top entities effectively. These results give us confidence that PAMA is robust to misspecification of $n_{1}$ .

Configuration			Partition-type Models			Mallows Models		MC-based Models
$n$	$m$	$n_{1}$	PAMA_F	PAMA_B	BARD	EMM	MM	MC₁	MC₂	MC₃	CEMC
100	10	10	44.8	34.6	42.6	61.5	227.7	423.8	45.6	199.1	241.3
100	10	10	[0.90]	[0.93]	[0.96]	[0.88]	[0.58]	[0.20]	[0.92]	[0.63]	[0.54]
100	10	20	39.2	33.9	94.2	107.0	308.6	764.6	52.3	268.9	372.9
100	10	20	[0.95]	[0.96]	[0.99]	[0.90]	[0.75]	[0.33]	[0.96]	[0.78]	[0.67]
100	10	30	27.5	29.2	207.4	126.2	360.6	1040.2	67.8	325.4	445.0
100	10	30	[0.98]	[0.98]	[0.98]	[0.93]	[0.83]	[0.44]	[0.96]	[0.84]	[0.77]
100	10	40	16.0	17.4	363.9	131.6	408.1	1274.1	83.1	382.5	486.9
100	10	40	[0.99]	[0.99]	[0.98]	[0.95]	[0.87]	[0.54]	[0.97]	[0.87]	[0.83]
100	10	50	8.8	9.3	565.3	134.6	452.2	1484.2	109.2	446.4	524.9
100	10	50	[1.00]	[1.00]	[0.99]	[0.97]	[0.90]	[0.62]	[0.96]	[0.89]	[0.88]

Table 3: Average recovery distances [coverages] of different methods based on 500 independent replicates under scenario

\mathcal{S}_{HS_{3}}

Noticeably, although PAMA and BARD achieve comparable coverage as shown in Table 3, PAMA dominates BARD uniformly in terms of a much smaller recovery distance in all cases, suggesting that PAMA is capable of figuring out the detailed ranking of relevant entities that is missed by BARD. In fact, since BARD relies only on $\rho_{i}\triangleq P(I_{i}=1\mid\tau_{1},\cdots,\tau_{m})$ to rank entities, in cases where the signal to distinguish the relevant and background entities is strong, many $\rho_{i}$ ’s are very close to 1, resulting in nearly a “random” ranking among the top relevant entities. Theoretically, if all relevant entities are recognized correctly but ranked randomly, the corresponding recovery distance would increase with $n_{1}$ in an order of $\mathcal{O}({n_{1}}^{2})$ , which matches well with the increasing trend of the recovery distance of BARD shown in Table 3.

We also tested the model’s performance when there is a true $n_{1}$ but it is mis-specified in our algorithm. We varied $n_{1}$ as $8,10$ and $18$ , respectively, for setting $\mathcal{S}_{HS_{1}}$ with $n$ =100 and $m$ =10, where the true $n_{1}$ =10 (the first ten entities). Figure 6 shows boxplots of $\mathcal{I}$ for each mis-specified case. For the visualization purpose, we just show the boxplot of $E_{1}$ to $E_{20}$ . The other entities are of the similar pattern with $E_{20}$ . The figure shows a robust behavior of PAMA_B as we vary the specifications of $n_{1}$ . It also shows that the results are slightly better if we specify a $n_{1}$ that is moderately larger than its true value. The consistent results of other mis-specified cases, such as $5,12,15$ , can be found in the Supplementary Material.

6 Real Data Applications

6.1 Aggregating Rankings of NBA Teams

We applied PAMA_B to the NBA-team data analyzed in Deng et al., (2014), and compared it to competing methods in the literature. The NBA-team data set contains 34 “predictive” power rankings of the 30 NBA teams in the 2011-2012 season. The 34 “predictive” rankings were obtained from 6 professional rankers (sports websites) and 28 amateur rankers (college students), and the data quality varies significantly across rankers. More details of the dataset can be found in Table 1 of Deng et al., (2014).

Figure 7 displays the results obtained by PAMA_B (the partial-list version with $n_{1}$ specified as 16). Figure 7 (a) shows the posterior distributions, as boxplots, of the quality parameter of each involved ranker. Figure 7 (b) shows the posterior distribution of the aggregated power ranking of each NBA team. All the posterior samples that reports “0” for the rank of an entity means that the entity is a background one, for visualization purpose we replace “0” by the rank of background average rank, $\frac{n+n_{1}+1}{2}=\frac{30+16+1}{2}=23.5$ . The final set of 16 playoff teams are shown in blue while the champion of that season is shown in red (i.e., Heat). Figure 7 (c) shows the trace plot of the log-likelihood of the PAMA model along the MCMC iteration. Comparing the results to Figure 8 of Deng et al., (2014), we observe the following facts: (1) PAMA_B can successfully discover the quality difference of rankers as BARD; (2) PAMA_B can not only pick up the relevant entities effectively like BARD, but also rank the discovered relevant entities reasonably well, which cannot be achieved by BARD; (3) PAMA_B converges quickly in this application.

We also applied other methods, including MM, EMM and Markov-chain-based methods, to this data set. We found that none of these methods could discern the quality difference of rankers as successfully as PAMA and BARD. Moreover, using the team ranking at the end of the regular season as the surrogate true power ranking of these NBA teams in the reason, we found that PAMA also outperformed BARD and EMM by reporting an aggregated ranking list that is the closest to the truth. Table 4 provides the detailed aggregated ranking lists inferred by BARD, EMM and PAMA respectively, as well as their coverage of and Kendall $\tau$ distance from the surrogate truth. Note that the Kendall $\tau$ distance is calculated for the eastern teams and western teams separately because the NBA Playoffs proceed at the eastern conference and the western conference in parallel until the NBA final, in which the two conference champions compete for the NBA champion title, making it difficult to validate the rankings between Eastern and Western teams.

Ranking	Surrogate truth		BARD		EMM		PAMA
	Eastern	Western	Eastern	Western	Eastern	Western	Eastern	Western
1	Bulls	Spurs	Heat	Thunder	Heat	Thunder	Heat	Thunder
2	Heat	Thunder	Bulls	Mavericks	Bulls	Maverick	Bulls	Maverickss
3	Pacers	Lakers	Celtics	Clippers	Knicks	Clippers	Celtics	Lakers
4	Celtics	Grizzlies	Knicks	Lakers	Celtics	Lakers	Knicks	Clippers
5	Hawks	Clippers	Magic	Spurs	Magic	Spurs	Magic	Spurs
6	Magic	Nuggets	Pacers	Grizzlies	Pacers	Grizzlies	Hawks	Nuggets
7	Knicks	Mavericks	76ers	Nuggets	76ers	Nuggets	Pacers	Grizzlies
8	76ers	Jazz	Hawks	Jazz^∗	Hawks^∗	Jazz^∗	76ers	Jazz^∗
Kendall $\tau$	-	-	14.5	10.5	9	10	8	10
Coverage	-		$\frac{15}{16}$		$\frac{14}{16}$		$\frac{15}{16}$

Table 4: Aggregated power ranking of the NBA teams inferred by BARD, EMM, and PAMA, respectively, and the corresponding coverage of and the Kendall

\tau

distance from the surrogate true rank based on the performances of these teams in the regular season. The teams in italic indicate that they have equal posterior probabilities of being in the relevant group, and the teams with asterisk are those that were misclassified to the background group.

6.2 Aggregating Rankings of NFL Quarterback Players with the Presence of Covariates

Our next application is targeted at the NFL-player data reported by Li et al., (2021). The NFL-player data contains 13 predictive power rankings of 24 NFL quarterback players. The rankings were produced by 13 experts based on the performance of the 24 NFL players during the first 12 weeks in the 2014 season. The dataset also contains covariates for each player summarizing the performances of these players during the period, including the number of games played (G), pass completion percentage (Pct), the number of passing attempts per game (Att), average yards per carry (Avg), total receiving yards (Yds), average passing yards per attempt (RAvg), the touchdown percentage (TD), the intercept percentage (Int), running attempts per game (RAtt), running yards per attempt (RYds) and the running first down percentage (R1st). Details of the dataset can be found in Table 2 of Li et al., (2021).

Here, we set $n_{1}=12$ in order to find which players are above average. We analyzed the NFL-player data with PAMA_B (the covaritate-assisted version) and the results are summarized in Figure 8: (a) the posterior boxplots of the quality parameter for all the rankers; (b) the barplot of $\bar{\mathcal{I}_{i}}$ for all the NFL players with the descending order; (c) the traceplot of the log-likelihood of the model; and (d), the barplot of probabilities $P({\psi}_{j}>0)$ and the covariates are rearranged from left to right by decreasing the probability. From Figure 8 (a), we observe that rankers 1, 3, 4 and 5 are generally less reliable than the other rankers. In the study of the same dataset in Li et al., (2021), the authors assumed that the 13 rankers fall into three quality levels, and reported that seven rankers (i.e., 2, 6, 7, 8, 9, 10 and 13) are of a higher quality than the other six (see Figure 7 of Li et al., (2021)). Interestingly, according to Figure 8 (a), the PAMA algorithm suggested exactly the same set of high-quality rankers. In the meantime, ranker 2 is of the lowest quality among the seven high quality rankers in both studies. From Figure 8 (b), a consensus ranking list can be obtained. Our result is consistent with that of Figure 6 in Li et al., (2021). Figure 8 (d) shows that six covariates are more probable to have positive effects.

Ranking	Gold standard	BARD	EMM	PAMA
1	Aaron R.	Andrew L.	Andrew L.	Andrew L.
2	Andrew L.	Aaron R.	Aaron R.	Aaron R.
3	Ben R.	Tom B.	Tom B.	Tom B.
4	Drew B.	Drew B.	Ben R.	Drew B.
5	Russell W.	Ben R.	Drew B.	Ben R.
6	Matt R.	Ryan T.	Ryan T.	Ryan T.
7	Ryan T.	Russell W.	Russell W.	Russell W.
8	Tom B.	Philip R.*	Philip R.*	Philip R.
9	Eli M.	Eli M.*	Eli M.*	Eli M.*
10	Philip R.	Matt R.*	Matt R.*	Matt R.*
R-distance	-	35.5	32	25
Coverage	-	0.7	0.7	0.8

Table 5: Top players in the aggregated rankings inferred by BARD, EMM and PAMA. The entities in italic indicate that they have equal posterior probabilities of being in the relevant group, and the players with asterisk are those that were mis-classified to the background group.

Using the Fantasy points of the players (https://fantasy.nfl.com/research/players) derived at the end of the 2014 NFL season as the surrogate truth, the recovery distance and coverage of the aggregated rankings by different approaches can be calculated so as to evaluate the performances of different approaches. Note that the Fantasy points of two top NFL players Peyton Manning and Tony Romo are missing for unknown reasons, we excluded them from analysis and only report results for the top 10 positions instead of top 12. Table 5 summarizes the results, demonstrating that PAMA outperformed the other two methods.

7 Conclusion and Discussion

The proposed Partition-Mallows model embeds the classic Mallows model into a partition modeling framework developed earlier by Deng et al., (2014), which is analogous to the well-known “spike-and-slab” mixture distribution often employed in Bayesian variable selection. Such a nontrivial “mixture” combines the strengths of both the Mallows model and BARD’s partition framework, leading to a stronger rank aggregation method that can not only learn quality variation of rankers and distinguish relevant entities from background ones effectively, but also provide an accurate ranking estimate of the discovered relevant entities. Compared to other frameworks in the literature for rank aggregation with heterogeneous rankers, the Partition-Mallows model enjoys more accurate results with better interpretability at the price of a moderate increase of computational burden. We also show that the Partition-Mallows framework can easily handle partial lists and and incorporate covariates in the analysis.

Throughout this work, we assume that the number of relevant entities $n_{1}$ is known. This is reasonable in many practical problems where a specific $n_{1}$ can be readily determined according to research demands. Empirically, we found that the ranking results are insensitive to the choice of a wide range of values of $n_{1}$ . If needed, we may also determine $n_{1}$ according to a model selection criterion, such as AIC or BIC.

In the PAMA model, $\pi(\tau_{k}^{0}\mid\tau_{k}^{0\mid 1})$ is assumed to be a uniform distribution. If the detailed ranking of the background entities is of interest, we can modify the conditional distribution $\pi(\tau_{k}^{0}\mid\tau_{k}^{0\mid 1})$ to be the Mallows model or other models. A quality parameter can still be incorporated to control the interaction between relevant entities and background entities. The resulting likelihood function becomes more complicated, but is still tractable.

In practice, the assumption of independent rankers may be violated. In the literature, a few approaches have been proposed to detect and handle dependent rankers. For example, Deng et al., (2014) proposed a hypothesis-testing-based framework to detect pairs of over-correlated rankers and a hierarchical model to accommodate clusters of dependent rankers; Johnson et al., (2020) adopted an extended Dirichlet process and a similar hierarchical model to achieve simultaneous ranker clustering and rank aggregation inference. Similar ideas can be incorporated in the PAMA model as well to deal with non-independent rankers.

Acknowledgement

We thank Miss Yuchen Wu for helpful discussions at the early stage of this work and the two reviewers for their insightful comments and suggestions that helped us improve the paper greatly. This research is supported in part by the National Natural Science Foundation of China (Grants 11771242 & 11931001), Beijig Academy of Artificial Intelligence (Grant BAAI2019ZD0103), and the National Science Foundation of USA (Grants DMS-1903139 and DMS-1712714). The author Wanchuang Zhu is partially supported by the Australian Research Council (Data Analytics for Resources and Environments, Grant IC190100031)

Supplementary Materials

Appendix A Numerical Supports to Power-Law Model of $\tau_{k}^{0\mid 1}(i)$

Figure 9 illustrates numerical evidences to support the power-law model for $\tau_{k}^{0\mid 1}(i)$ in different scenarios in a similar spirit of the Figure 2 in Deng et al., (2014): generating each ranker $\tau_{k}$ as the order of $\{X_{k,1},\cdots,X_{k,n}\}$ , i.e., $\tau_{k}=sort(i\in U\mbox{ by }X_{k,i}\downarrow),$ where $X_{k,i}$ is generated from two different distributions $F_{k,0}$ and $F_{k,1}$ via the following mechanism:

X_{k,i}\sim F_{k,0}\cdot\mathbb{I}(i\in U_{B})+F_{k,1}\cdot\mathbb{I}(i\in U_{R}),\ \forall\ i\in U.

Figure 9 confirms that the log-log plot of $t$ versus $h(t)=P(\tau_{k}^{0\mid 1}(t)\mid\tau_{k}^{1};I)$ is also nearly linear (when $t$ is not too large), suggesting the power-law model is acceptable as an approximation of the reality.

Appendix B Numerical Evidence for the Assumption $\phi_{k}=\phi\times\gamma_{k}$

In this section, we provide numerical evidence to support the assumption $\phi_{k}=\phi\times\gamma_{k}$ by showing that it approximates the reality reasonably well in more general settings. In the literature, the Thurstone hidden score model is widely used to generate ranked data. Specifying $m=50$ , $n=100$ , $n_{1}=10$ and $\mathcal{I}=(1,\cdots,n_{1},0,\cdots,0)$ , we explored two simulation settings based on the hidden score model, namely $\mathcal{S}_{HS}^{{}^{\prime}}$ and $\mathcal{S}_{HS}^{{}^{\prime\prime}}$ , in which we assume

\tau_{k}=sort(i\in U\ by\ S_{ik}\downarrow)\ \mbox{where}\ S_{ik}\sim N(\mu_{ik},1),\ \forall\ 1\leq k\leq m,

and specify in the $\mathcal{S}_{HS}^{{}^{\prime}}$ setting

\mu_{ik}=\left\{\begin{array}[]{ll}-k\times g,&i>n_{1},\\ (n_{1}-i)\times k\times s,&i\leq n_{1};\\ \end{array}\right.

and in the $\mathcal{S}_{HS}^{{}^{\prime\prime}}$ setting

\mu_{ik}=\left\{\begin{array}[]{ll}-k\times g,&i>n_{1},\\ 0,&i=n_{1},\\ \mu_{(i+1)k}+U[0,k\times s],&i<n_{1},\\ \end{array}\right.

where $U[a,b]$ stands for a random number drawn from the Uniform distribution on interval $[a,b]$ . Clearly, the quality of ranker $\tau_{k}$ increases monotonously with the ranker index $k$ in both settings, and there is a clear gap of mean scores between background and relevant entities. The two hyper-parameters $(s,g)$ control the signal strength and ranker heterogeneity in the simulated data.

For each data set simulated from the above settings, given the true ranking list $\mathcal{I}$ , we can always fit a Mallows model for the rankings of relevant entities in each $\tau_{k}$ to get an estimated $\hat{\phi}_{k}$ and a power-law model for the relative rankings of the background entities among the relevant ones in each $\tau_{k}$ to get an estimated $\hat{\gamma}_{k}$ , leading to a set of pairwise estimates $\{(\hat{\phi}_{k},\hat{\gamma}_{k})\}_{1\leq k\leq m}$ . Figure 10 provides a graphical illustration of the estimated parameters $\{(\hat{\phi}_{k},\hat{\gamma}_{k})\}_{1\leq k\leq m}$ for typical simulated datasets from the $\mathcal{S}_{HS}^{{}^{\prime}}$ setting with different specifications of $(s,g)$ , and Figure 11 demonstrates the counterpart for the $\mathcal{S}_{HS}^{{}^{\prime\prime}}$ setting. From the figures, we can see clearly a strong linear trend between $\hat{\phi}_{k}$ and $\hat{\gamma}_{k}$ in all cases, suggesting that the presumed assumption approximates the reality very well.

Appendix C Proof of Theorem 1

Proof.

We start with the degenerated special case where $m=1$ . In this special case, the parameter vector ${\boldsymbol{\theta}}$ degenerates to a 3-dimensional vector ${\boldsymbol{\theta}}=(\mathcal{I},\phi,\gamma)$ , and the PAMA model $P(\boldsymbol{\tau}\mid{\boldsymbol{\theta}})$ defined in Equation (19) degenerates to a simpler form $P(\tau\mid{\boldsymbol{\theta}})$ as defined in Equation (18) with $\tau\in\Omega_{n}$ . To prove the identifiability of the degenerated PAMA model, we need to show that for any two proper parameter vectors ${\boldsymbol{\theta}}_{1}=(\mathcal{I}_{1},\phi_{1},\gamma_{1})$ and ${\boldsymbol{\theta}}_{2}=(\mathcal{I}_{1},\phi_{1},\gamma_{1})$ from the parameter space $\boldsymbol{\Theta}$ , we always have:

\mbox{if}\ P(\tau\mid\mathcal{I}_{1},\phi_{1},\gamma_{1})=P(\tau\mid\mathcal{I}_{2},\phi_{2},\gamma_{2})\ \mbox{for}\ \forall\ \tau\in\Omega_{n},\ \mbox{then}\ (\mathcal{I}_{1},\phi_{1},\gamma_{1})=(\mathcal{I}_{2},\phi_{2},\gamma_{2}).

Here, we choose to prove the above conclusion by proving its equivalent counterpart:

for\ \forall\ (\mathcal{I}_{1},\phi_{1},\gamma_{1})\neq(\mathcal{I}_{2},\phi_{2},\gamma_{2}),\ \exists\ \tau\in\Omega_{n},\ s.t.\ P(\tau\mid\mathcal{I}_{1},\phi_{1},\gamma_{1})\neq P(\tau\mid\mathcal{I}_{2},\phi_{2},\gamma_{2}).

Apparently, the condition $(\mathcal{I}_{1},\phi_{1},\gamma_{1})\neq(\mathcal{I}_{2},\phi_{2},\gamma_{2})$ implies two possible scenarios:

(i)\ (\phi_{1},\gamma_{1})\neq(\phi_{2},\gamma_{2}),\ \mbox{or}\ (ii)\ (\phi_{1},\gamma_{1})=(\phi_{2},\gamma_{2})\ \mbox{with}\ \mathcal{I}_{1}\neq\mathcal{I}_{2}.

Below we discuss the two scenarios separately.

$(i)$ The scenario where $(\phi_{1},\gamma_{1})\neq(\phi_{2},\gamma_{2})$ . Let

\displaystyle\tau^{*}_{{\boldsymbol{\theta}}}=\arg\min_{\tau\in\Omega_{n}}P(\tau\mid{\boldsymbol{\theta}})\ \mbox{and}\ \tau^{**}_{{\boldsymbol{\theta}}}=\arg\min_{\tau\in\Omega_{n}\setminus\tau^{*}_{{\boldsymbol{\theta}}}}P(\tau\mid{\boldsymbol{\theta}})

(35)

be the rankings with the smallest and second smallest sampling probability among the ranking space $\Omega_{n}$ given model parameter ${\boldsymbol{\theta}}$ . According to the likelihood function (18), it is easy to check that the solutions of the optimization problems in (35) depend on $\mathcal{I}$ only and have the general form below:

\tau^{*}_{{\boldsymbol{\theta}}}=inv(\mathcal{I})\ \mbox{and}\ \tau^{**}_{{\boldsymbol{\theta}}}=swap_{\mathcal{I}}(\tau^{*}_{{\boldsymbol{\theta}}}),

where the operator $inv(\mathcal{I})$ locates the relevant entities defined by $\mathcal{I}$ to the last $n_{1}$ positions in the ranking list with their internal order reversed and the background entities randomly to the other $(n-d)$ open positions, and the operator $swap_{\mathcal{I}}(\tau^{*}_{{\boldsymbol{\theta}}})$ swaps the positions of either two adjacent relevant entities defined by $\mathcal{I}$ or the last relevant entity and an arbitrary background entity in $\tau^{*}_{{\boldsymbol{\theta}}}$ . For instance, assume that $n=20$ , $n_{1}=10$ and $\mathcal{I}=(1,2,\cdots,n_{1},0,\cdots,0)$ , we have:

$\displaystyle\tau^{*}_{{\boldsymbol{\theta}}}$	$\displaystyle=$	$\displaystyle\ inv(\mathcal{I})\ =\ \big{(}20,19,\cdots,13,12,11,perm(10,\cdots,2,1)\big{)},$
$\displaystyle\tau^{**}_{{\boldsymbol{\theta}}}$	$\displaystyle=$	$\displaystyle swap_{\mathcal{I}}(\tau^{*}_{{\boldsymbol{\theta}}})=\big{(}20,19,\cdots,13,11,12,perm(10,\cdots,2,1)\big{)},$
$\displaystyle or\ \tau^{**}_{{\boldsymbol{\theta}}}$	$\displaystyle=$	$\displaystyle swap_{\mathcal{I}}(\tau^{*}_{{\boldsymbol{\theta}}})=\big{(}20,19,\cdots,13,12,10,perm(11,9,\cdots,2,1)\big{)},$

where $perm(S)$ stands for a random permutation of the input sequence $S$ . Note that although both $\tau^{*}_{{\boldsymbol{\theta}}}$ and $\tau^{**}_{{\boldsymbol{\theta}}}$ have a lot of variations, the different variations correspond to the same sampling probability below:

	$\displaystyle p^{*}_{{\boldsymbol{\theta}}}$	$\displaystyle\triangleq$	$\displaystyle P(\tau^{}_{{\boldsymbol{\theta}}}\mid{\boldsymbol{\theta}})=\dfrac{1}{(n-n_{1})!}\times\dfrac{(n_{1}+1)^{-(n-n_{1})\gamma}}{(C^{}_{\gamma,n_{1}})^{n-n_{1}}}\times\dfrac{\exp\left\{-\frac{n_{1}(n_{1}-1)}{2}\phi\gamma\right\}}{Z(\phi\gamma)}=h(\phi,\gamma),$
	$\displaystyle p^{**}_{{\boldsymbol{\theta}}}$	$\displaystyle\triangleq$	$\displaystyle P(\tau^{**}_{{\boldsymbol{\theta}}}\mid{\boldsymbol{\theta}})=h(\phi,\gamma)\times\min\{(n-n_{1})\times(1+1/n_{1})^{\gamma},\exp\{\phi\gamma/2\}\},$

where $C^{*}_{\gamma,n_{1}}=\sum_{t=1}^{n_{1}+1}t^{-\gamma}$ . Further, define

r^{*}_{\boldsymbol{\theta}}=\frac{p^{**}_{{\boldsymbol{\theta}}}}{p^{*}_{{\boldsymbol{\theta}}}}=\min\big{\{}(n-n_{1})\times(1+1/n_{1})^{\gamma},\exp\{\phi\gamma/2\}\big{\}}=g(\phi,\gamma).

It can be showed via proof by contradiction that the following two equations can not hold simultaneously for $(\phi_{1},\gamma_{1})\neq(\phi_{2},\gamma_{2})$ :

h(\phi_{1},\gamma_{1})=h(\phi_{2},\gamma_{2})\ \mbox{and}\ g(\phi_{1},\gamma_{1})=g(\phi_{2},\gamma_{2}).

Because if $g(\phi_{1},\gamma_{1})=g(\phi_{2},\gamma_{2})$ , we would have either $\phi_{1}\gamma_{1}=\phi_{2}\gamma_{2}$ or $\gamma_{1}=\gamma_{2}$ based on the definition of function $g(\phi,\gamma)$ , which leads to $(\phi_{1},\gamma_{1})=(\phi_{2},\gamma_{2})$ as function $h(\phi,\gamma)$ is monotonically decreasing with respect to $\phi$ and $\gamma$ . Apparently, the above fact indicates that the following two equations can not hold simultaneously:

\displaystyle P(\tau^{*}_{{\boldsymbol{\theta}}_{1}}\mid{\boldsymbol{\theta}}_{1})

\displaystyle=

\displaystyle P(\tau^{*}_{{\boldsymbol{\theta}}_{2}}\mid{\boldsymbol{\theta}}_{2})\ \mbox{and}\ P(\tau^{**}_{{\boldsymbol{\theta}}_{1}}\mid{\boldsymbol{\theta}}_{1})=P(\tau^{**}_{{\boldsymbol{\theta}}_{2}}\mid{\boldsymbol{\theta}}_{2}),

which means that the two distributions $P(\tau\mid{\boldsymbol{\theta}}_{1})$ and $P(\tau\mid{\boldsymbol{\theta}}_{2})$ are not identical. Therefore, there must exist $\tau\in\Omega_{n}$ such that $P(\tau\mid{\boldsymbol{\theta}}_{1})\neq P(\tau\mid{\boldsymbol{\theta}}_{2})$ .

$(ii)$ The scenario where $(\phi_{1},\gamma_{1})=(\phi_{2},\gamma_{2})$ but $\mathcal{I}_{1}\neq\mathcal{I}_{2}$ . As $\tau^{*}_{{\boldsymbol{\theta}}_{1}}$ and $\tau^{*}_{{\boldsymbol{\theta}}_{2}}$ are the minima of $P(\tau\mid{\boldsymbol{\theta}}_{1})$ and $P(\tau\mid{\boldsymbol{\theta}}_{2})$ , and $\tau^{*}_{{\boldsymbol{\theta}}_{1}}\neq\tau^{*}_{{\boldsymbol{\theta}}_{2}}$ in this case, we have

P(\tau^{*}_{{\boldsymbol{\theta}}_{1}}\mid{\boldsymbol{\theta}}_{1})=h(\phi_{1},\gamma_{1})=h(\phi_{2},\gamma_{2})=P(\tau^{*}_{{\boldsymbol{\theta}}_{2}}\mid{\boldsymbol{\theta}}_{2})<P(\tau^{*}_{{\boldsymbol{\theta}}_{1}}\mid{\boldsymbol{\theta}}_{2}).

Similarly, we have

P(\tau^{*}_{{\boldsymbol{\theta}}_{2}}\mid{\boldsymbol{\theta}}_{1})>P(\tau^{*}_{{\boldsymbol{\theta}}_{2}}\mid{\boldsymbol{\theta}}_{2}).

Therefore, there exists $\tau\in\Omega_{n}$ (e.g., $\tau^{*}_{{\boldsymbol{\theta}}_{1}}$ or $\tau^{*}_{{\boldsymbol{\theta}}_{2}}$ ) such that $P(\tau\mid{\boldsymbol{\theta}}_{1})\neq P(\tau\mid{\boldsymbol{\theta}}_{2})$ . Combining the above two scenarios, we conclude that fact (20) holds for $m=1$ .

Next, we prove the more general cases where $m>1$ by mathematical induction: assuming that (20) holds for all $m\leq M$ , we will prove that it also holds for $m=M+1$ . Considering that for any $m>1$ , $P(\boldsymbol{\tau}\mid{\boldsymbol{\theta}}_{1})=P(\boldsymbol{\tau}\mid{\boldsymbol{\theta}}_{2})$ implies:

\prod_{k=1}^{m}P(\tau_{k}\mid\mathcal{I}_{1},\phi_{1},\gamma_{1k})=\prod_{k=1}^{m}P(\tau_{k}\mid\mathcal{I}_{2},\phi_{2},\gamma_{2k}),

where $\boldsymbol{\tau}=(\tau_{1},\cdots,\tau_{m})$ and ${\boldsymbol{\theta}}_{t}=(\mathcal{I}_{t},\phi_{t},\boldsymbol{\gamma}_{t})$ with $\boldsymbol{\gamma}_{t}=\{\gamma_{tk}\}_{1\leq k\leq m}$ , the condition in Theorem 1 for $m=M+1$ suggests that

\prod_{k=1}^{M+1}P(\tau_{k}\mid\mathcal{I}_{1},\phi_{1},\gamma_{1k})=\prod_{k=1}^{M+1}P(\tau_{k}\mid\mathcal{I}_{2},\phi_{2},\gamma_{2k}),\ \forall\ \boldsymbol{\tau}\in\Omega_{n}^{M+1}.

(36)

Define $\boldsymbol{\tau}_{[-j]}=(\tau_{1},\cdots,\tau_{j-1},\tau_{j+1},\cdots,\tau_{M+1})$ , $\boldsymbol{\gamma}_{t[-j]}=\{\gamma_{tk}\}_{k\neq j}$ , ${\boldsymbol{\theta}}_{t[-j]}=(\mathcal{I}_{t},\phi_{t},\boldsymbol{\gamma}_{t[-j]})$ and

P(\boldsymbol{\tau}_{[-j]}\mid{\boldsymbol{\theta}}_{t[-j]})=\prod_{k\neq j}P(\tau_{k}\mid\mathcal{I}_{t},\phi_{t},\gamma_{tk}).

For $\forall\ 1\leq j\leq M+1$ , equation (36) can be expressed alternatively as below:

P(\boldsymbol{\tau}_{[-j]}\mid{\boldsymbol{\theta}}_{1[-j]})\cdot P(\tau_{j}\mid\mathcal{I}_{1},\phi_{1},\gamma_{1j})=P(\boldsymbol{\tau}_{[-j]}\mid{\boldsymbol{\theta}}_{2[-j]})\cdot P(\tau_{j}\mid\mathcal{I}_{2},\phi_{2},\gamma_{2j}),\ \forall\ \boldsymbol{\tau}\in\Omega_{n}^{M+1}.

For any fixed $\boldsymbol{\tau}_{[-j]}$ , summing over all equations of the above form for all $\boldsymbol{\tau}$ that is compatible with $\boldsymbol{\tau}_{[-j]}$ , we have

P(\boldsymbol{\tau}_{[-j]}\mid{\boldsymbol{\theta}}_{1[-j]})\cdot\sum_{\tau_{j}\in\Omega_{n}}P(\tau_{j}\mid\mathcal{I}_{1},\phi_{1},\gamma_{1k})=P(\boldsymbol{\tau}_{[-j]}\mid{\boldsymbol{\theta}}_{2[-j]})\cdot\sum_{\tau_{j}\in\Omega_{n}}P(\tau_{j}\mid\mathcal{I}_{2},\phi_{2},\gamma_{2k})

Considering that

\sum_{\tau_{j}\in\Omega_{n}}P(\tau_{j}\mid\mathcal{I}_{1},\phi_{1},\gamma_{1k})=\sum_{\tau_{j}\in\Omega_{n}}P(\tau_{j}\mid\mathcal{I}_{2},\phi_{2},\gamma_{2k})=1,

we have

P(\boldsymbol{\tau}_{[-j]}\mid{\boldsymbol{\theta}}_{1[-j]})=P(\boldsymbol{\tau}_{[-j]}\mid{\boldsymbol{\theta}}_{2[-j]}),\ \forall\ \boldsymbol{\tau}_{[-j]}\in\Omega_{n}^{M},

which indicates that

{\boldsymbol{\theta}}_{1[-j]}={\boldsymbol{\theta}}_{2[-j]},\ \forall\ j=1,\cdots,M+1.

Thus, we have ${\boldsymbol{\theta}}_{1}={\boldsymbol{\theta}}_{2}$ , and the proof is complete. ∎

Appendix D Proof of Lemma 1

Proof.

The function $e_{i}(\phi)$ is given as

\displaystyle e_{i}(\phi)

\displaystyle=

\displaystyle\int\mathbb{E}\big{[}\tau(i)\mid\gamma\big{]}dF(\gamma)=\int\sum_{j=1}^{n}jP(\tau(i)=j\mid\mathcal{I},\phi,\gamma)dF(\gamma),

(37)

where

	$\displaystyle P(\tau(i)=j\mid\mathcal{I},\phi,\gamma)=\sum_{k=\max(1,j-n+n_{1})}^{\min(n_{1},j)}P(\tau^{1}(i)=k)P\left(\sum_{e\in U_{B}}\mathbb{I}(\tau^{0\|1}(e)\geq n_{1}+2-k)=j-k\right),$
	$\displaystyle P(\tau^{1}(i)=k)=\dfrac{e^{-\phi\gamma(i-k)}}{\sum_{l=0}^{n_{1}-1}e^{-\phi\gamma l}},~{}~{}~{}k\leq n_{1},$
	$\displaystyle P\left(\sum_{e\in U_{B}}\mathbb{I}(\tau^{0\|1}(e)\geq n_{1}+2-k)=j-k\right)=\begin{pmatrix}n-n_{1}\\ j-k\end{pmatrix}\left(\dfrac{\sum_{l\geq n_{1}+2-k}l^{-\gamma}}{\sum_{l=1}^{n_{1}+1}l^{-\gamma}}\right)^{j-k}\left(\dfrac{\sum_{l\leq n_{1}+1-k}l^{-\gamma}}{\sum_{l=1}^{n_{1}+1}l^{-\gamma}}\right)^{n^{{}^{\prime}}},$
	$\displaystyle n^{{}^{\prime}}=n-n_{1}-j+k.$

As $P\left(\sum_{e\in U_{B}}\mathbb{I}(\tau^{0|1}(e)\geq n_{1}+2-k)=j-k\right)$ is a constant with respect to $\phi$ , we denote $g(\gamma)\triangleq P\left(\sum_{e\in U_{B}}\mathbb{I}(\tau^{0|1}(e)\geq n_{1}+2-k)=j-k\right)$ . Thus,

	$\displaystyle e_{i}(\phi)$	$\displaystyle=$	$\displaystyle\int\sum_{j=1}^{n}jP(\tau(i)=j\mid\mathcal{I},\phi,\gamma)dF(\gamma)$		(38)
		$\displaystyle=$	$\displaystyle\int\sum_{j=1}^{n}j\sum_{k=\max(1,j-n+n_{1})}^{\min(n_{1},j)}\dfrac{e^{-\phi\gamma(i-k)}}{\sum_{l=0}^{n_{1}-1}e^{-\phi\gamma l}}g(\gamma)dF(\gamma).$		(38)

From Equation (38), we can see that $e_{i}(\phi)$ is a continuous function of $\phi$ for $i\in U_{R}$ , not an infinite oscillation function, and degenerates to a constant $e_{0}$ for $i\in U_{B}$ . Thus, equation $e_{i}(\phi)=e_{0}$ has only finite solutions in $[0,\phi_{max})$ . Let $\mathcal{S}_{i}$ be the solutions of equation $e_{i}(\phi)=e_{0}$ for $i\leq n_{1}$ and $\mathcal{S}=\cup_{i=1}^{d}\mathcal{S}_{i}$ . We have $\tilde{\Omega}_{\phi}=\Omega_{\phi}-\mathcal{S}$ . ∎

Appendix E Proof of Theorem 3

Proof.

Based on the classic theory for MLE consistency (Wald,, 1949), to prove that $(\hat{\phi}_{\mathcal{I}},\hat{\psi}_{\mathcal{I}})$ are consistent, we only to show that the following regularity conditions hold for the PAMA-H model: (1) probability measure $L_{k}(\phi,\psi)=\int P(\tau_{k}\mid\mathcal{I},\phi,\gamma_{k})dF_{\psi}(\gamma_{k})$ keeps the same support for all $(\phi,\psi)\in\Omega_{\phi}\times\Omega_{\psi}$ , (2) the true parameter $(\phi_{0},\psi_{0})$ is an interior point of the parameter space $\Omega_{\phi}\times\Omega_{\psi}$ , (3) the log-likelihood $l(\phi,\psi\mid\mathcal{I})$ is differentiable with respect to $\phi$ and $\psi$ , and (4) the MLE $(\hat{\phi},\hat{\psi})$ is the unique solution of the score equations $\frac{\delta l(\phi,\psi)}{\delta\phi}=0$ and $\frac{\delta l(\phi,\psi)}{\delta\psi}=0$ .

It is easy to check that $l(\phi,\psi\mid\mathcal{I})$ satisfies regular conditions (1)-(3). The regular condition (4) is satisfied by showing that $l(\phi,\psi\mid\mathcal{I})$ is a concave function with respect to $\phi$ and $\psi$ respectively. The log-likelihood of PAMA-H is

$\displaystyle l(\phi,\psi\mid\mathcal{I})$	$\displaystyle=$	$\displaystyle\sum_{k=1}^{m}l_{k}(\phi,\psi\mid\mathcal{I})=\sum_{k=1}^{m}\log\left(\int P(\tau_{k}\mid\mathcal{I},\phi,\gamma_{k})dF_{\psi}(\gamma_{k})\right)$
	$\displaystyle=$	$\displaystyle\sum_{k=1}^{m}\log\left(\int_{0}^{\infty}\frac{\exp\{-\phi\gamma_{k}d_{\tau}(\tau_{k},\mathcal{I}^{+})\}(1-e^{-\phi\gamma_{k}})^{n_{1}-1}}{\prod_{t=2}^{n_{1}}(1-e^{-t\phi\gamma_{k}})}\frac{f_{\psi}(\gamma_{k})}{G(\gamma_{k})}d\gamma_{k}\right)$
	$\displaystyle=$	$\displaystyle\sum_{k=1}^{m}\log\left(\int_{0}^{\infty}R(\phi,\gamma_{k})\frac{f_{\psi}(\gamma_{k})}{G(\gamma_{k})}d\gamma_{k}\right)=\sum_{k=1}^{m}\log(H_{k}(\phi,\psi)),$

where $H_{k}(\phi,\psi)=\int_{0}^{\infty}R(\phi,\gamma_{k})\frac{f_{\psi}(\gamma_{k})}{G(\gamma_{k})}d\gamma_{k}$ , $R(\phi,\gamma_{k})=\frac{\exp\{-\phi\gamma_{k}d_{\tau}(\tau_{k},\mathcal{I}^{+})\}(1-e^{-\phi\gamma_{k}})^{n_{1}-1}}{\prod_{t=2}^{n_{1}}(1-e^{-t\phi\gamma_{k}})}$ and $G(\gamma_{k})=A^{*}_{\tau_{k},I}(B^{*}_{\tau_{k},I})^{\gamma_{k}}(C^{*}_{\gamma_{k},n_{1}})^{n_{0}}$ . According to properties of the Mallows model, it is easy to check $\frac{\partial^{2}R(\phi,\gamma_{k})}{\partial^{2}\phi}<0$ and thus $R(\phi,\gamma_{k})$ is a concave function with respect to $\phi$ . As $\frac{f_{\psi}(\gamma_{k})}{G(\gamma_{k})}>0$ , then $H_{k}(\phi,\psi)$ preserves convexity of $R(\phi,\gamma_{k})$ according to the properties of a concave function. Obviously, $l(\phi,\psi\mid\mathcal{I})$ is a composite function of $H_{k}(\phi,\psi)$ with composite functions being logarithm and summation. Such composite function preserves convexity of $H_{k}(\phi,\psi)$ . Therefore, $l(\phi,\psi\mid\mathcal{I})$ is a concave function with respect to $\phi$ . Similarly, given that $f_{\psi}(\gamma_{k})$ is a concave function of $\psi$ , it is easy to show $l(\phi,\psi\mid\mathcal{I})$ is a concave function with respect to $\psi$ . Therefore, regular condition (4) is satisfied. Thus, the proof is complete.

∎

Appendix F Details of the Gauss-Seidel Iterative Optimization

Let ${\boldsymbol{\theta}}^{(s)}=(\mathcal{I}^{(s)},\phi^{(s)},\boldsymbol{\gamma}^{(s)})$ be the maximizer obtained at cycle $s$ . The Gauss-Seidel method works as follows:

1.

Update $\gamma_{k},k=1,\cdots,m$ . Define $g(\gamma_{k})$ as partial function of log-likelihood with respect to $\gamma_{k}$ given the other parameters. Newton-like method is adopted to update $\gamma_{k}$ from $\gamma_{k}^{(s)}$ to $\gamma_{k}^{(s+1)}$ .
2.

Update $\phi$ . Define $g(\phi)$ as partial function of log-likelihood with respect to $\phi$ given other parameters. Similarly, Newton-like method is adopted to update $\phi$ from $\phi^{(s)}$ to $\phi^{(s+1)}$ .
3.

Update $\mathcal{I}$ . Let $g(\mathcal{I})$ denote the log-likelihood as a function of $\mathcal{I}$ with other parameters fixed. We randomly select two neighboring entities and swap their rankings to check whether $g(\mathcal{I}\mid\boldsymbol{\gamma}^{(s+1)},\phi^{(s+1)})$ increases.

The procedure starts from a random guess of all the parameters and then repeat Steps 1-3 until the log-likelihood converges. The convergence of likelihood is achieved when the difference of log-likelihood between any two consecutive iterations is less than 0.1. The difficulty of the procedure lies in the update of $\mathcal{I}$ . Practically, the starting point of $\mathcal{I}$ can be some quick estimation of $\mathcal{I}$ , such as an estimate from the Mallows model.

In each cycle of the Gauss-Seidel update for finding the MLE, a partial function of likelihood has to be defined. Suppose the current cyclic index is $s$ , the detailed computation of ${\boldsymbol{\theta}}^{(s+1)}$ is given below.

F.1 Update $\gamma_{k}$

Define $g(\gamma_{k})=\log(f(\gamma_{k}|\boldsymbol{\gamma}_{[1:k-1]}^{(s+1)},\boldsymbol{\gamma}_{[k+1:m]}^{(s)},\mathcal{I}^{(s)},\phi^{(s)}))$ . Thus, for $\gamma_{k}\in[0,10]$ ,

	$\displaystyle g(\gamma_{k})$	$\displaystyle=$	$\displaystyle\log\left(\frac{1}{(B^{}_{\tau_{k},I})^{\gamma_{k}}\times(C^{}_{\gamma_{k},n_{1}})^{n-n_{1}}\times(D^{}_{\tau_{k},\mathcal{I}})^{\phi\cdot\gamma_{k}}\times E^{}_{\phi,\gamma_{k}}}\right)$		(39)
		$\displaystyle=$	$\displaystyle-\gamma_{k}\times\log(B^{}_{\tau_{k},I})-(n-n_{1})\times\log(C^{}_{\gamma_{k},n_{1}})-\phi^{(s)}\times\gamma_{k}\times\log(D^{}_{\tau_{k},\mathcal{I}})-\log(E^{}_{\phi,\gamma_{k}}).$		(39)

The updating equation is given as

\displaystyle\gamma_{k}^{(s+1)}=\gamma_{k}^{(s)}-\alpha_{\gamma_{k}}\frac{g^{\prime}(\gamma_{k})}{g^{\prime\prime}(\gamma_{k})},

(40)

where $\alpha_{\gamma_{k}}$ is step length parameter which can be tuned to control convergence of $g(\gamma_{k})$ . And

	$\displaystyle g^{\prime}(\gamma_{k})$	$\displaystyle=$	$\displaystyle-\log(B^{}_{\tau_{k},I})-(n-n_{1})\times\frac{\partial\log(C^{}_{\gamma_{k},n_{1}})}{\partial\gamma_{k}}-\phi^{(s)}\times\log(D^{}_{\tau_{k},\mathcal{I}})-\frac{\partial\log(E^{}_{\phi,\gamma_{k}})}{\partial\gamma_{k}},$		(41)
	$\displaystyle g^{\prime\prime}(\gamma_{k})$	$\displaystyle=$	$\displaystyle-(n-n_{1})\times\frac{\partial^{2}\log(C^{}_{\gamma_{k},n_{1}})}{\partial^{2}\gamma_{k}}-\frac{\partial^{2}\log(E^{}_{\phi,\gamma_{k}})}{\partial^{2}\gamma_{k}},$		(42)

where

$\displaystyle\log(B^{*}_{\tau_{k},I})$	$\displaystyle=$	$\displaystyle\sum_{t=1}^{n_{1}+1}n_{\tau_{k},t}^{0\|1}\log(t),$
$\displaystyle\frac{\partial\log(C^{*}_{\gamma_{k},n_{1}})}{\partial\gamma_{k}}$	$\displaystyle=$	$\displaystyle-\frac{\sum_{t=1}^{n_{1}+1}t^{-\gamma_{k}}\log(t)}{C^{*}_{\gamma_{k},n_{1}}},$
$\displaystyle\log(D^{*}_{\tau_{k},\mathcal{I}})$	$\displaystyle=$	$\displaystyle d_{\tau}(\tau_{k}^{1},\mathcal{I}^{+(s)}),$
$\displaystyle\frac{\partial\log(E^{*}_{\phi,\gamma_{k}})}{\partial\gamma_{k}}$	$\displaystyle=$	$\displaystyle\sum_{t=2}^{n_{1}}\frac{t\cdot\phi^{(s)}\exp\{-t\phi^{(s)}\gamma_{k}\}}{1-\exp\{-t\phi^{(s)}\gamma_{k}\}}-\frac{\phi^{(s)}\cdot(n_{1}-1)}{1-\exp\{-\phi^{(s)}\gamma_{k}\}}\times\exp\{-\phi^{(s)}\gamma_{k}\},$
$\displaystyle\frac{\partial^{2}\log(C^{*}_{\gamma_{k},n_{1}})}{\partial^{2}\gamma_{k}}$	$\displaystyle=$	$\displaystyle\frac{1}{(C^{}_{\gamma_{k},n_{1}})^{2}}[C^{}_{\gamma_{k},n_{1}}\sum_{t=1}^{n_{1}+1}t^{-\gamma_{k}}\log(t)^{2}+(\sum_{t=1}^{n_{1}+1}t^{-\gamma_{k}}\log(t))^{2}],$
$\displaystyle\frac{\partial^{2}\log(E^{*}_{\phi,\gamma_{k}})}{\partial^{2}\gamma_{k}}$	$\displaystyle=$	$\displaystyle\sum_{t=2}^{n_{1}}\frac{-t^{2}\phi^{(s)^{2}}\exp\{-t\phi^{(s)}\gamma_{k}\}}{[1-\exp\{-t\phi^{(s)}\gamma_{k}\}]^{2}}+\frac{\phi^{(s)^{2}}(n_{1}-1)\exp\{-\phi^{(s)}\gamma_{k}\}}{[1-\exp\{-\phi^{(s)}\gamma_{k}\}]^{2}}.$

F.2 Update $\phi$

We define $g(\phi)=\log(f(\phi|\boldsymbol{\gamma}^{(s+1)},\mathcal{I}^{(s)}))$ . Thus, for $\phi\in[0,10]$

\displaystyle g(\phi)=-\sum_{k=1}^{m}\left(\phi\cdot\gamma_{k}^{(s+1)}\log(D^{*}_{\tau_{k},\mathcal{I}})+\log(E^{*}_{\phi,\gamma_{k}^{(s+1)}})\right).

(43)

The update equation is given as

\displaystyle\phi^{(s+1)}=\phi^{(s)}-\alpha_{\phi}\frac{g^{\prime}(\phi)}{g^{\prime\prime}(\phi)},

(44)

where $\alpha_{\phi}$ is step length parameter which controls convergence of $g(\phi)$ . And

	$\displaystyle g^{\prime}(\phi)$	$\displaystyle=$	$\displaystyle-\sum_{k=1}^{m}\left(\gamma_{k}^{(s+1)}\log(D^{}_{\tau_{k},\mathcal{I}})+\frac{\partial[\log(E^{}_{\phi,\gamma_{k}^{(s+1)}})]}{\partial\phi}\right),$		(45)
	$\displaystyle g^{\prime\prime}(\phi)$	$\displaystyle=$	$\displaystyle-\sum_{k=1}^{m}\frac{\partial^{2}[\log(E^{*}_{\phi,\gamma_{k}^{(s+1)}})]}{\partial^{2}\phi},$		(46)

where

\frac{\partial[\log(E^{*}_{\phi,\gamma_{k}^{(s+1)}})]}{\partial\phi}=\left(\sum_{t=2}^{n_{1}}\frac{t\cdot\gamma_{k}^{(s+1)}\exp\{-t\phi\gamma_{k}^{(s+1)}\}}{1-\exp\{-t\phi\gamma_{k}^{(s+1)}\}}\right)-(n_{1}-1)\times\frac{\gamma_{k}^{(s+1)}\exp\{-\phi\gamma_{k}^{(s+1)}\}}{1-\exp\{-\phi\gamma_{k}^{(s+1)}\}},

\frac{\partial^{2}[\log(E^{*}_{\phi,\gamma_{k}^{(s+1)}})]}{\partial^{2}\phi}=\left(\sum_{t=2}^{n_{1}}\frac{-t^{2}\gamma_{k}^{(s+1)^{2}}\exp\{-t\phi\gamma_{k}^{(s+1)}\}}{[1-\exp\{-t\phi\gamma_{k}^{(s+1)}\}]^{2}}\right)+\frac{\gamma_{k}^{(s+1)^{2}}(n_{1}-1)\exp\{-\phi\gamma_{k}^{(s+1)}\}}{[1-\exp\{-\phi\gamma_{k}^{(s+1)}\}]^{2}}.

F.3 Update $\mathcal{I}$

We define $g(\mathcal{I})=\log(f(\mathcal{I}|\boldsymbol{\gamma}^{(s+1)},\phi^{(s+1)}))$ . Thus,

\displaystyle g(\mathcal{I})

\displaystyle=

\displaystyle\log\left(\prod_{k=1}^{m}\frac{1}{A^{*}_{\tau_{k},I}\times(B^{*}_{\tau_{k},I})^{\gamma_{k}}\times(D^{*}_{\tau_{k},\mathcal{I}})^{\phi\cdot\gamma_{k}}}\right).

(47)

Given current estimation $\mathcal{I}^{(s)}$ , the proposal of a new estimate can be obtained by iteratively swapping the neighboring entities in $\mathcal{I}^{(s)}$ . To be noticeable that the entity whose ranking is $n_{1}$ could be randomly swapped with any background entity. The proposed estimation is denoted by $\mathcal{I}^{(s+\frac{1}{2})}$ . If $g(\mathcal{I}^{(s+\frac{1}{2})})>g(\mathcal{I}^{(s)})$ , assign $\mathcal{I}^{(s+\frac{1}{2})}$ to $\mathcal{I}^{(s+1)}$ . Otherwise, keep generating proposed estimation $\mathcal{I}^{(s+\frac{1}{2})}$ until $g(\mathcal{I}^{(s+\frac{1}{2})})>g(\mathcal{I}^{(s)})$ or no new proposal can be generated for $\mathcal{I}$ .

Appendix G Details of Inferring PAMA-H

G.1 The Gauss-Seidel Iterative Optimization

Let ${\boldsymbol{\theta}}^{(s)}=(\mathcal{I}^{(s)},\phi^{(s)},\psi^{(s)})$ be the maximizer obtained at cycle $s$ . As $\boldsymbol{\gamma}$ is treated as missing data, we adapt MCEM (Wei and Tanner, (1990)) to implement optimization. Then E step is: define $\hat{Q}^{(s+1)}({\boldsymbol{\theta}}\mid{\boldsymbol{\theta}}^{(s)})=\frac{1}{m^{(s)}}\sum_{i=1}^{m^{(s)}}\log f(\mathcal{I}^{(s)},\phi^{(s)},\psi^{(s)},\boldsymbol{\gamma}_{i}\mid\tau_{1},\cdots,\tau_{m})$ , where $\boldsymbol{\gamma}_{i}$ is a sample drawn from $f(\boldsymbol{\gamma}\mid\mathcal{I}^{(s)},\phi^{(s)},\psi^{(s)})$ which is defined in (49). And M step is: maximize $\hat{Q}^{(s+1)}({\boldsymbol{\theta}}\mid{\boldsymbol{\theta}}^{(s)})$ with respect to ${\boldsymbol{\theta}}$ to obtain ${\boldsymbol{\theta}}^{(s+1)}=(\mathcal{I}^{(s+1)},\phi^{(s+1)},\psi^{(s+1)})$ . The Gauss-Seidel method is utilized to conduct this optimization. The detailed procedure of Gauss-Seidel method is stated below.

Update $\mathcal{I}$ . With objective function being $\hat{Q}^{(s+1)}({\boldsymbol{\theta}}\mid{\boldsymbol{\theta}}^{(s)})$ , the procedure of updating $\mathcal{I}$ is same as that in Section F.

Update $\phi$ . With objective function being $\hat{Q}^{(s+1)}({\boldsymbol{\theta}}\mid{\boldsymbol{\theta}}^{(s)})$ , the procedure of updating $\phi$ is same as that in Section F.

Update $\psi$ . Define $g(\psi)$ as partial function of $\hat{Q}^{(s+1)}({\boldsymbol{\theta}}\mid{\boldsymbol{\theta}}^{(s)})$ , Newton-like method is adopted to update $\psi$ . Then $g(\psi)=\frac{1}{m^{(s)}}\sum_{i=1}^{m^{(s)}}\log f_{\psi}(\boldsymbol{\gamma}_{i})$ . Suppose $F_{\psi}(\gamma)$ is an exponential distribution $f(\gamma\mid\alpha)$ . Then $g(\alpha)=\log(\alpha)-\frac{\alpha}{m^{(s)}}\sum_{i=1}^{m^{(s)}}\sum_{k=1}^{m}\gamma_{ik}]$ . It is straightforward to conduct Newton-like method to obtain $\psi^{(s+1)}$ .

G.2 The Bayesian Inference

Suppose the $F_{\psi}(\gamma)$ is an exponential distribution $f(\gamma\mid\alpha)$ . More specifically,

f(x\mid\alpha)=\alpha\exp\{-\alpha x\},\alpha>0.

Note that $\psi=\alpha$ here. As the conditional distribution of $\mathcal{I}$ and $\phi$ are already known in (27) and (28), we just show the conditional distributions of $\psi$ and $\gamma_{k}$ as below,

	$\displaystyle f(\alpha\mid\mathcal{I},\phi,\boldsymbol{\gamma})$	$\displaystyle\propto$	$\displaystyle\pi(\alpha)\cdot\prod_{k=1}^{m}{\alpha}\exp\{-\alpha\gamma_{k}\},$		(48)
	$\displaystyle f(\gamma_{k}\mid\mathcal{I},\alpha,\psi,\boldsymbol{\gamma}_{[-k]})$	$\displaystyle\propto$	$\displaystyle\frac{\exp\{-\alpha\gamma_{k}\}}{(B^{}_{\tau_{k},I})^{\gamma_{k}}\times(C^{}_{\gamma_{k},n_{1}})^{n-n_{1}}\times(D^{}_{\tau_{k},\mathcal{I}})^{\phi\cdot\gamma_{k}}\times E^{}_{\phi,\gamma_{k}}}.$		(49)

G.3 PAMA-H versus PAMA in Numerical Experiments

We simulate 500 independent data sets with hierarchy distribution being an exponential distribution. The true parameters are $\mathcal{I}=(1,2,\cdots,n_{1},{\bf 0}_{n_{0}}),\phi=0.6,n_{1}=10$ and $\alpha=1$ for the parameter in the exponential distribution. Four different configurations of $n,m$ are considered. The average recovery distances and coverages are displayed in Table 6. Figure 12 compares the estimation of $\gamma_{k}$ obtained by different models and frameworks. Both Table 6 and Figure 12 indicate that PAMA model performs as well as PAMA-H model.

Configuration			Bayesian Inference		MLE
$n$	$m$	Metric	PAMA_HB	PAMA_B	PAMA_HF	PAMA_F
100	10	Recovery	0.70 (3.83)	0.98 (6.11)	19.09 (26.56)	17.22 (24.80)
100	10	Coverage	1.00 (0.01)	1.00 (0.01)	0.96 (0.05)	0.97 (0.05)
100	20	Recovery	1.22 (6.71)	0.84 (5.09)	33.76 (29.96)	32.16 (29.28)
100	20	Coverage	1.00 (0.01)	1.00 (0.01)	0.93 (0.06)	0.93 (0.06)
200	10	Recovery	4.15 (24.22)	3.36 (19.94)	57.23 (57.02)	49.08 (56.00)
200	10	Coverage	1.00 (0.02)	1.00 (0.02)	0.94 (0.06)	0.95 (0.06)
200	20	Recovery	1.51 (10.86)	1.51 (10.97)	92.59 (65.94)	85.67 (64.35)
200	20	Coverage	1.00 (0.01)	1.00 (0.01)	0.91 (0.07)	0.91 (0.07)

Table 6: Average recovery distances and coverages of different methods based on 500 independent replicates under PAMA-H model with different configurations of

n

and

m

. The numbers in brackets are corresponding standard deviation.

Appendix H Statistical Inference of the Covariate-Assisted Partition-Mallows Model

H.1 Bayesian Inference

Due to the incorporation of $\boldsymbol{X}$ , the full conditional distributions may occur changes from the conditional distributions in Section 4.2. While the full conditional distributions of $\boldsymbol{\gamma}$ and $\phi$ keep the same as Equations (28) and (29), the full conditional distribution of $\mathcal{I}$ changes to

\displaystyle f(\mathcal{I}\mid\cdot)

\displaystyle\propto

\displaystyle\exp\{\boldsymbol{\psi}^{T}\sum_{i:I_{i}=1}X_{i}\}\prod_{k=1}^{m}\frac{\mathbb{I}\big{(}\tau_{k}^{0}\in\mathcal{A}_{U_{R}}(\tau_{k}^{0\mid 1})\big{)}}{A^{*}_{\tau_{k},I}\times(B^{*}_{\tau_{k},I})^{\gamma_{k}}\times(D^{*}_{\tau_{k},\mathcal{I}})^{\phi\cdot\gamma_{k}}}.

(50)

In addition, the full conditional distribution of the $l^{th}$ element of $\boldsymbol{\psi}$ is given as follows,

\displaystyle f(\boldsymbol{\psi}_{l}\mid\cdot)

\displaystyle\propto

\displaystyle\dfrac{\exp\{\boldsymbol{\psi}^{T}(\sum_{i:I_{i}>0}X_{i})\}}{\prod_{i=1}^{n}(1+\exp\{\boldsymbol{\psi}^{T}X_{i}\})},l=1,\cdots,p.

(51)

MH algorithm is also adopted to draw corresponding samples. Posterior point estimation can be calculated accordingly.

H.2 MLE

Gauss-Seidel iterative optimization is adopted as well in optimizing covariate-assisted PAMA. Let ${\boldsymbol{\theta}}^{(s)}=(\mathcal{I}^{(s)},\phi^{(s)},\boldsymbol{\gamma}^{(s)},\boldsymbol{\psi}^{(s)})$ be the maximizer obtained at cycle $s$ . The detailed computation of ${\boldsymbol{\theta}}^{(s+1)}$ is given below.

H.2.1 Update of $\gamma_{k}$

This is same as Section F.1.

H.2.2 Update of $\phi$

This is same as Section F.2.

H.2.3 Update of $\mathcal{I}$

We define $g(\mathcal{I})=\log(f(\mathcal{I}|\boldsymbol{\gamma}^{(s+1)},\phi^{(s+1)},\boldsymbol{\psi}^{(s)}))$ , thus,

\displaystyle g(\mathcal{I})

\displaystyle=

\displaystyle\log\left(P(\mathcal{I}\mid\boldsymbol{X},\boldsymbol{\psi}^{(s)})\times P(\tau_{1},\cdots,\tau_{m}\mid\mathcal{I},\phi^{(s+1)},\boldsymbol{\gamma}^{(s+1)})\right)

(52)

Given current estimation $\mathcal{I}^{(s)}$ , the proposal of new estimation can be obtained by iteratively swapping the neighboring entities in $\mathcal{I}^{(s)}$ . To be noticeable that the entity whose ranking is $n_{1}$ could be randomly swapped with any background entity. The proposed estimation is denoted by $\mathcal{I}^{(s+\frac{1}{2})}$ . If $g(\mathcal{I}^{(s+\frac{1}{2})})>g(\mathcal{I}^{(s)})$ , assign $\mathcal{I}^{(s+\frac{1}{2})}$ to $\mathcal{I}^{(s+1)}$ . Otherwise, keep generating proposed estimation $\mathcal{I}^{(s+\frac{1}{2})}$ until $g(\mathcal{I}^{(s+\frac{1}{2})})>g(\mathcal{I}^{(s)})$ or no new proposal can be generated for $\mathcal{I}$ .

H.2.4 Update of $\boldsymbol{\psi}$

Define $g(\boldsymbol{\psi})=\log(P(\mathcal{I}^{(s+1)}\mid\boldsymbol{X},\boldsymbol{\psi}))$ . Maximizing $g(\boldsymbol{\psi})$ with respect to $\boldsymbol{\psi}$ is actually a standard logistic regression problem. Therefore, $\boldsymbol{\psi}^{(s+1)}={\arg\ max}_{\boldsymbol{\psi}}\log(P(\mathcal{I}^{(s+1)}\mid\boldsymbol{X},\boldsymbol{\psi}))$ by using standard statistical software.

Appendix I Boxplots of $\boldsymbol{\gamma}$

Figure 13 demonstrates the boxplots of $\boldsymbol{\gamma}$ of all the setting with all combinations of $(n,m)$ .

Appendix J Simulation Results

Figure 14, 15 and 16 present results from different methods for simulated data under various combinations of $n$ and $m$ .

Appendix K Robustness of $n_{1}$

Figure 17 shows boxplots of estimated $\mathcal{I}$ for each mis-specified case (5, 12, 15).

References

Aslam and Montague, (2001) Aslam, J. A. and Montague, M. (2001). Models for metasearch. In Proceedings of the 24th annual international ACM SIGIR Conference on Research and Development in Information Retrieval, pages 276–284. ACM.
Bader, (2011) Bader, M. (2011). The transposition median problem is NP-complete. Theoretical Computer Science, 412(12):1099–1110.
Bhowmik and Ghosh, (2017) Bhowmik, A. and Ghosh, J. (2017). LETOR methods for unsupervised rank aggregation. In Proceedings of the 26th International Conference on World Wide Web, pages 1331–1340. International World Wide Web Conferences Steering Committee.
Borda, (1781) Borda, J. C. (1781). Mémoire sur les élections au scrutin. Histoire del’ Académie Royale des Sciences.
Chen et al., (2016) Chen, J., Long, R., Wang, X., Liu, B., and Chou, K. (2016). dRHP-PseRA: detecting remote homology proteins using profile-based pseudo protein sequence and rank aggregation. Scientific Reports, 6(32333).
Chen et al., (2019) Chen, Y., Fan, J., Ma, C., and Wang, K. (2019). Spectral method and regularized mle are both optimal for top- $K$ ranking. Annals of statistics, 47(4):2204.
Chen and Suh, (2015) Chen, Y. and Suh, C. (2015). Spectral MLE: Top- $K$ rank aggregation from pairwise comparisons. In International Conference on Machine Learning, pages 371–380.
Deconde et al., (2011) Deconde, R. P., Hawley, S., Falcon, S., Clegg, N., Knudsen, B., and Etzioni, R. (2011). Combining results of microarray experiments: a rank aggregation approach. Statistical Applications in Genetics & Molecular Biology, 5(1):1544–6115.
Deng et al., (2014) Deng, K., Han, S., Li, K. J., and Liu, J. S. (2014). Bayesian aggregation of order-based rank data. Journal of the American Statistical Association, 109(507):1023–1039.
Diaconis, (1988) Diaconis, P. (1988). Group representations in probability and statistics. Lecture Notes-Monograph Series, 11:1–192.
Diaconis and Graham, (1977) Diaconis, P. and Graham, R. L. (1977). Spearman’s footrule as a measure of disarray. Journal of the Royal Statistical Society Series B (Methodological), 39(2):262–268.
Dwork et al., (2001) Dwork, C., Kumar, R., Naor, M., and Sivakumar, D. (2001). Rank aggregation methods for the web. In Proceedings of the 10th International Conference on World Wide Web, pages 613–622. ACM.
Fagin et al., (2003) Fagin, R., Kumar, R., and Sivakumar, D. (2003). Efficient similarity search and classification via rank aggregation. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pages 301–312. ACM.
Fligner and Verducci, (1986) Fligner, M. A. and Verducci, J. S. (1986). Distance based ranking models. Journal of the Royal Statistical Society. Series B (Methodological), 48(3):359–369.
Freund et al., (2003) Freund, Y., Iyer, R., Schapire, R. E., and Singer, Y. (2003). An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4(Nov):933–969.
Hastings, (1970) Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1):97–109.
Irurozki et al., (2014) Irurozki, E., Calvo, B., and Lozano, J. A. (2014). Permallows : An R package for Mallows and generalized Mallows models. Journal of Statistical Software, 071(12):1–30.
Johnson et al., (2020) Johnson, S. R., Henderson, D. A., and Boys, R. J. (2020). Revealing subgroup structure in ranked data using a Bayesian WAND. Journal of the American Statistical Association, 115(532):1888–1901.
Li et al., (2020) Li, H., Xu, M., Liu, J. S., and Fan, X. (2020). An extended mallows model for ranked data aggregation. Journal of the American Statistical Association, 115(530):730–746.
Li et al., (2021) Li, X., Yi, D., and Liu, J. S. (2021). Bayesian analysis of rank data with covariates and heterogeneous rankers. Statistical Science, (In press).
Lin, (2010) Lin, S. (2010). Space oriented rank-based data integration. Statistical Applications in Genetics & Molecular Biology, 9(1):Article20.
Lin and Ding, (2010) Lin, S. and Ding, J. (2010). Integration of ranked lists via cross entropy monte carlo with applications to mRNA and microRNA studies. Biometrics, 65(1):9–18.
Linas et al., (2010) Linas, B., Tadas, M., and Francesco, R. (2010). Group recommendations with rank aggregation and collaborative filtering. In Proceedings of the Fourth ACM Conference on Recommender Systems, pages 119–126. ACM.
Liu, (2008) Liu, J. S. (2008). Monte Carlo strategies in scientific computing. Springer Science & Business Media.
Liu et al., (2007) Liu, Y., Liu, T., Qin, T., Ma, Z., and Li, H. (2007). Supervised rank aggregation. In Proceedings of the 16th International Conference on World Wide Web, pages 481–490. ACM.
Luce, (1959) Luce, R. D. (1959). Individual choice behavior: a theoretical analysis. New York: Wiley.
Mallows, (1957) Mallows, C. L. (1957). Non-null ranking models. Biometrika, 44(1/2):114–130.
Plackett, (1975) Plackett, R. L. (1975). The analysis of permutations. Journal of the Royal Statistical Society: Series C (Applied Statistics), 24(2):193–202.
Porello and Endriss, (2012) Porello, D. and Endriss, U. (2012). Ontology merging as social choice: judgment aggregation under the open world assumption. Journal of Logic and Computation, 24(6):1229–1249.
Rajkumar and Agarwal, (2014) Rajkumar, A. and Agarwal, S. (2014). A statistical convergence perspective of algorithms for rank aggregation from pairwise data. In International Conference on Machine Learning, pages 118–126.
Renda and Straccia, (2003) Renda, M. E. and Straccia, U. (2003). Web metasearch: rank vs. score based rank aggregation methods. In Proceedings of the 2003 ACM Symposium on Applied Computing, pages 841–846. ACM.
Soufiani et al., (2014) Soufiani, H. A., Parkes, D. C., and Xia, L. (2014). A statistical decision-theoretic framework for social choice. In Advances in Neural Information Processing Systems, pages 3185–3193.
Tanner and Wong, (1987) Tanner, M. A. and Wong, W. H. (1987). The calculation of posterior distributions by data augmentation. Journal of the American Statistical Association, 82(398):528–540.
Thurstone, (1927) Thurstone, L. L. (1927). A law of comparative judgment. Psychological Review, 34(4):273–286.
Wald, (1949) Wald, A. (1949). Note on the consistency of the maximum likelihood estimate. The Annals of Mathematical Statistics, 20(4):595–601.
Wei and Tanner, (1990) Wei, G. C. G. and Tanner, M. A. (1990). A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. Journal of the American Statistical Association, 85(411):699–704.
Yang, (2018) Yang, K. H. (2018). Chapter 7 - Stepping through finite element analysis. In Yang, K.-H., editor, Basic Finite Element Method as Applied to Injury Biomechanics, pages 281–308. Academic Press.
Young, (1988) Young, H. P. (1988). Condorcet’s theory of voting. American Political Science Review, 82(4):1231–1244.
Young and Levenglick, (1978) Young, H. P. and Levenglick, A. (1978). A consistent extension of condorcet’s election principle. SIAM Journal on Applied Mathematics, 35(2):285–300.
Yu, (2000) Yu, P. L. H. (2000). Bayesian analysis of order-statistics models for ranking data. Psychometrika, 65(3):281–299.

Partition-Mallows Model and Its Inference for Rank Aggregation

Abstract

1 Introduction

2 Notations and Preliminaries

2.1 BARD and Its Partition Model

2.2 The Mallows Model

3 The Partition-Mallows Model

3.1 The Reverse Partition Model

3.2 The Partition-Mallows Model

3.3 Model Identifiability and Estimation Consistency

Theorem 1.

Proof.

Lemma 1.

Theorem 2.

Proof.

Theorem 3.

Proof.

4 Inference with the Partition-Mallows Model

4.1 Maximum Likelihood Estimation

4.2 Bayesian Inference

4.3 Extension to Partial Ranking Lists

4.4 Incorporating Covariates in the Analysis

5 Simulation Study

5.1 Simulation Settings

5.2 Methods in Comparison and Performance Measures

5.3 Simulation Results

5.4 Robustness to the Specification of n1n_{1}

6 Real Data Applications

6.1 Aggregating Rankings of NBA Teams

6.2 Aggregating Rankings of NFL Quarterback Players with the Presence of Covariates

7 Conclusion and Discussion

Acknowledgement

Supplementary Materials

Appendix A Numerical Supports to Power-Law Model of τk0∣1​(i)\tau_{k}^{0\mid 1}(i)

Appendix B Numerical Evidence for the Assumption ϕk=ϕ×γk\phi_{k}=\phi\times\gamma_{k}

Appendix C Proof of Theorem 1

Proof.

Appendix D Proof of Lemma 1

Proof.

Appendix E Proof of Theorem 3

Proof.

Appendix F Details of the Gauss-Seidel Iterative Optimization

F.1 Update γk\gamma_{k}

F.2 Update ϕ\phi

F.3 Update ℐ\mathcal{I}

Appendix G Details of Inferring PAMA-H

G.1 The Gauss-Seidel Iterative Optimization

G.2 The Bayesian Inference

G.3 PAMA-H versus PAMA in Numerical Experiments

Appendix H Statistical Inference of the Covariate-Assisted Partition-Mallows Model

H.1 Bayesian Inference

H.2 MLE

H.2.1 Update of γk\gamma_{k}

H.2.2 Update of ϕ\phi

H.2.3 Update of ℐ\mathcal{I}

H.2.4 Update of 𝝍\boldsymbol{\psi}

Appendix I Boxplots of 𝜸\boldsymbol{\gamma}

Appendix J Simulation Results

Appendix K Robustness of n1n_{1}

References

Partition-Mallows Model and Its Inference
for Rank Aggregation

5.4 Robustness to the Specification of $n_{1}$

Appendix A Numerical Supports to Power-Law Model of $\tau_{k}^{0\mid 1}(i)$

Appendix B Numerical Evidence for the Assumption $\phi_{k}=\phi\times\gamma_{k}$

F.1 Update $\gamma_{k}$

F.2 Update $\phi$

F.3 Update $\mathcal{I}$

H.2.1 Update of $\gamma_{k}$

H.2.2 Update of $\phi$

H.2.3 Update of $\mathcal{I}$

H.2.4 Update of $\boldsymbol{\psi}$

Appendix I Boxplots of $\boldsymbol{\gamma}$

Appendix K Robustness of $n_{1}$