This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Partition-Mallows Model and Its Inference
for Rank Aggregation

Wanchuang Zhu Center for Statistical Science & Department of Industrial Engineering, Tsinghua University, Beijing 100084, China Yingkai Jiang Yau Mathematical Sciences Center & Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China Jun S. Liu Department of Statistics, Harvard University, Cambridge, MA 02138, USA Ke Deng Center for Statistical Science & Department of Industrial Engineering, Tsinghua University, Beijing 100084, China
Abstract

Learning how to aggregate ranking lists has been an active research area for many years and its advances have played a vital role in many applications ranging from bioinformatics to internet commerce. The problem of discerning reliability of rankers based only on the rank data is of great interest to many practitioners, but has received less attention from researchers. By dividing the ranked entities into two disjoint groups, i.e., relevant and irrelevant/background ones, and incorporating the Mallows model for the relative ranking of relevant entities, we propose a framework for rank aggregation that can not only distinguish quality differences among the rankers but also provide the detailed ranking information for relevant entities. Theoretical properties of the proposed approach are established, and its advantages over existing approaches are demonstrated via simulation studies and real-data applications. Extensions of the proposed method to handle partial ranking lists and conduct covariate-assisted rank aggregation are also discussed.


Keywords: Meta-analysis; Heterogeneous rankers; Mallows model; Partial ranking lists; Covariate-assisted aggregation.

1 Introduction

Rank data arise naturally in many fields, such as web searching (Renda and Straccia,, 2003), design of recommendation systems (Linas et al.,, 2010) and genomics (Bader,, 2011). Many probabilistic models have been proposed for analyzing this type of data, among which the Thurstone model (Thurstone,, 1927), the Mallows model (Mallows,, 1957) and the Plackett-Luce model (Luce,, 1959; Plackett,, 1975) are the most well-known representatives. The Thurstone model assumes that each entity possesses a hidden score and all the scores come from a joint probability distribution. The Mallows model is a location model defined on the permutation space of ordered entities, in which the probability mass of a permuted order is an exponential function of its distance from the true order. The Plackett-Luce model assumes that the preference of entity EiE_{i} is associated with a weight wiw_{i}, and describes a recursive procedure for generating a random ranking list: entities are picked one by one with the probability proportional to their weights in a sequential fashion without replacement, and ranked based on their order of being selected.

Rank aggregation aims to derive a “better” aggregated ranking list τ^\hat{\tau} from multiple ranking lists τ1,τ2,,τm\tau_{1},\tau_{2},\cdots,\tau_{m}. It is a classic problem and has been studied in a variety of contexts for decades. Early applications of rank aggregation can be traced back to the 18th-century France, where the idea of rank aggregation was proposed to solve the problem of political elections (Borda,, 1781). In the past 30 years, efficient rank aggregation algorithms have played important roles in many fields, such as web searching (Renda and Straccia,, 2003), information retrieval (Fagin et al.,, 2003), design of recommendation systems (Linas et al.,, 2010), social choice studies (Porello and Endriss,, 2012; Soufiani et al.,, 2014), genomics (Bader,, 2011) and bioinformatics (Lin and Ding,, 2010; Chen et al.,, 2016).

Some popular approaches for rank aggregation are based on certain summary statistics. These methods simply calculate a summary statistics, such as the mean, median or geometric mean, for each entity EiE_{i} based on its rankings across different ranking lists, and obtain the aggregated ranking list based on these summary statistics. Optimization-based methods obtain the aggregated ranking by minimizing a user-defined objective function, i.e., let τ^=argminτ1mi=1md(τ,τi)\hat{\tau}=\arg\min\limits_{\tau}\dfrac{1}{m}\sum\limits_{i=1}^{m}d\left(\tau,\tau_{i}\right), where distance measurement d(,)d(\cdot,\cdot) could be either Spearman’s footrule distance (Diaconis and Graham,, 1977) or the Kendall tau distance (Diaconis,, 1988). More detailed studies on these optimization-based methods can be found in Young and Levenglick, (1978); Young, (1988); Dwork et al., (2001).

In early 2000s, a novel class of Markov chain-based methods have been proposed (Dwork et al.,, 2001; Lin and Ding,, 2010; Lin,, 2010; Deconde et al.,, 2011), which first use the observed ranking lists to construct a probabilistic transition matrix among the entities and then use the magnitudes of the entities’ equilibrium probabilities of the resulting Markov chain to rank them. The boosting-based method RankBoost (Freund et al.,, 2003), employs a feedback function Φ(i,j)\Phi(i,j) to construct the final ranking, where Φ(i,j)>0\Phi(i,j)>0 (or 0\leq 0) indicates that entity EiE_{i} is (or is not) preferred to entity EjE_{j}. Some statistical methods utilize aforementioned probabilistic models (such as the Thurstone model) and derive the maximum likelihood estimate (MLE) of the final ranking. More recently, researchers have began to pay attention to rank aggregation methods for pairwise comparison data (Rajkumar and Agarwal,, 2014; Chen and Suh,, 2015; Chen et al.,, 2019).

We note that all aforementioned methods assume that the rankers of interest are equally reliable. In practice, however, it is very common that some rankers are more reliable than the others, whereas some are nearly non-informative and may be regarded as “spam rankers”. Such differences in rankers’ qualities, if ignored in analysis, may significantly corrupt the rank aggregation and lead to seriously misleading results. To the best of our knowledge, the earliest effort to address this critical issue can be traced to Aslam and Montague, (2001), which derived an aggregated ranking list by calculating a weighted summation of the observed ranking lists, known as the Borda Fuse. Lin and Ding, (2010) extended the objective function of Dwork et al., (2001) to a weighted fashion. Independently, Liu et al., (2007) proposed a supervised rank aggregation to determine weights of the rankers by training with some external data. Although assigning weights to rankers is an intuitive and simple way to handle quality differences, how to scientifically determine these weights is a critical and unsolved problem in the aforementioned works.

Recently, Deng et al., (2014) proposed BARD, a Bayesian approach to deal with quality differences among independent rankers without the need of external information. BARD introduces a partition model, which assumes that all involved entities can be partitioned into two groups: the relevant ones and the background ones. A rationale of the approach is that, in many applications, distinguishing relevant entities from background ones has the priority over the construction of a final ranking of all entities. Under this setting, BARD decomposes the information in a ranking list into three components: (i) the relative rankings of all background entities, which is assumed to be uniform; (ii) the relative ranking of each relevant entity among all background ones, which takes the form of a truncated power-law; and, (iii) the relative rankings of all relevant entities, which is again uniform. The parameter of the truncated power-law distribution, which is ranker-specific, naturally serves as a quality measure for each ranker, as a ranker of a higher quality means a less spread truncated power-law distribution.

Li et al., (2020) proposed a stage-wise data generation process based on an extended Mallows model (EMM) introduced by Fligner and Verducci, (1986). EMM assumes that each entity comes from a two-components mixture model involving a uniform distribution to model non-informative entities, a modified Mallows model for informative entities and a ranker-specific proportion parameter. Li et al., (2021) followed the Thurstone model framework to deal with available covariates for the entities as well as different qualities of the rankers. In their model, each entity is associated with a Gaussian-distributed latent score and a ranking list is determined by the ranking of these scores. The quality of each ranker is determined by the standard deviation parameter in the Gaussian model so that a larger standard deviation indicates a poorer quality ranker.

Although these recent papers have proposed different ways for learning the quality variation among rankers, they all suffer from some limitations. The BARD method (Deng et al.,, 2014) simplifies the problem by assuming that all relevant entities are exchangeable. In many applications, however, the observed ranking lists often have a strong ordering information for relevant entities, and simply labeling these entities as “relevant” without considering their relative rankings tends to lose too much information and oversimplify the problem. Li et al., (2020) does not explicitly measure quality differences by their extended Mallows model. Although they mentioned that some of their model parameters can indicate the rankers’ qualities, it is not clear how to properly combine multiple indicators to produce an easily interpretable quality measurement. The learning framework of Li et al., (2021) based on Gaussian latent variables appears to be more suitable for incorporating covariates than for handling heterogeneous rankers.

In this paper, we propose a partition-Mallows model (PAMA), which combines the partition modeling framework of Deng et al., (2014) with the Mallows model, to accommodate the detailed ordering information among the relevant entities. The new framework can not only quantify the quality difference of rankers and distinguish relevant entities from background entities like BARD, but also provide an explicit ranking estimate among the relevant entities in rank aggregation. In contrast to the strategy of imposing the Mallows on the full ranking lists, which tends to be sensitive to noises in low-ranking entities, the combination of the partition and Mallows models allows us to focus on highly ranked entities, which typically contain high-quality signals in data, and is thus more robust. Both simulation studies and real data applications show that the proposed approach is superior to existing methods, e.g., BARD and EMM, for a large class of rank aggregation problems.

The rest of this paper is organized as follows. A brief review of BARD and the Mallows model is presented in Section 2 as preliminaries. The proposed PAMA model is described in Section 3 with some key theoretical properties established. Statistical inference of the PAMA model, including the Bayesian inference and the pursuit of MLE, is detailed in Section 4. Performance of PAMA is evaluated and compared to existing methods via simulations in Section 5. Two real data applications are shown in Section 6 to demonstrate strength of the PAMA model in practice. Finally, we conclude the article with a short discussion in Section 7.

2 Notations and Preliminaries

Let U={E1,E2,,En}U=\left\{E_{1},E_{2},\cdots,E_{n}\right\} be the set of entities to be ranked. We use “EiEjE_{i}\preceq E_{j}” to represent that entity EiE_{i} is preferred to entity EjE_{j} in a ranking list τ\tau, and denote the position of entity EiE_{i} in τ\tau by τ(i)\tau(i). Note that more preferred entities always have lower rankings. Our research interest is to aggregate mm observed ranking lists, τ1,,τm\tau_{1},\ldots,\tau_{m}, presumably constructed by mm rankers independently into one consensus ranking list which is supposed to be “better” than each individual one.

2.1 BARD and Its Partition Model

The partition model in BARD (Deng et al.,, 2014) assumes that UU can be partitioned into two non-overlapping subsets: U=URUBU=U_{R}\cup U_{B}, with URU_{R} representing the set of relevant entities and UBU_{B} for the background ones. Let I={Ii}iUI=\{I_{i}\}_{i\in U} be the vector of group indicators, where Ii=𝕀(EiUR)I_{i}=\mathbb{I}(E_{i}\in U_{R}) and 𝕀()\mathbb{I}(\cdot) is the indicator function. This formulation makes sense in many applications where people are only concerned about a fixed number of top-ranked entities. Under this formulation, the information in a ranking list τk\tau_{k} can be equivalently represented by a triplet (τk0,τk10,τk1)(\tau_{k}^{0},\tau_{k}^{1\mid 0},\tau_{k}^{1}), where τk0\tau_{k}^{0} denotes relative rankings of all background entities, τk10\tau_{k}^{1\mid 0} denotes relative rankings of relevant entities among the background entities and τk1\tau_{k}^{1} denotes relative rankings of all relevant entities.

Deng et al., (2014) suggested a three-component model for τk\tau_{k} by taking advantage of its equivalent decomposition:

P(τkI)=P(τk0,τk10,τk1I)=P(τk0I)×P(τk10I)×P(τk1τk10,I),\displaystyle P(\tau_{k}\mid I)=P(\tau_{k}^{0},\tau_{k}^{1\mid 0},\tau_{k}^{1}\mid I)=P(\tau_{k}^{0}\mid I)\times P(\tau_{k}^{1\mid 0}\mid I)\times P(\tau_{k}^{1}\mid\tau_{k}^{1\mid 0},I), (1)

where both P(τk0I)P(\tau_{k}^{0}\mid I) (relative ranking of the background entities) and P(τk1τk10,I)P(\tau_{k}^{1}\mid\tau_{k}^{1\mid 0},I) (relative ranking of the relevant entities conditional on their set of positions relative to background entities) are uniform, and the relative ranking of a relevant entity EiE_{i} among background ones follows a power-law distribution with parameter γk>0\gamma_{k}>0, i.e.,

P(τk10(i)=tI)=q(tγk,n0)tγk𝕀(1tn0+1),P(\tau_{k}^{1\mid 0}(i)=t\mid I)=q(t\mid\gamma_{k},n_{0})\propto t^{-\gamma_{k}}\cdot\mathbb{I}(1\leq t\leq n_{0}+1),

leading to the following explicit forms for the three terms in equation (1):

P(τk0I)\displaystyle P(\tau_{k}^{0}\mid I) =\displaystyle= 1n0!,\displaystyle\frac{1}{n_{0}!}, (2)
P(τk10I)\displaystyle P(\tau_{k}^{1\mid 0}\mid I) =\displaystyle= iURq(τk10(i)γk,I)=1(Bτk,I)γk×(Cγk,n1)n1,\displaystyle\prod_{i\in U_{R}}q(\tau_{k}^{1\mid 0}(i)\mid\gamma_{k},I)=\frac{1}{(B_{\tau_{k},I})^{\gamma_{k}}\times(C_{\gamma_{k},n_{1}})^{n_{1}}}, (3)
P(τk1τk10,I)\displaystyle P(\tau_{k}^{1}\mid\tau_{k}^{1\mid 0},I) =\displaystyle= 1Aτk,I×𝕀(τk1𝒜UR(τk10)),\displaystyle\frac{1}{A_{\tau_{k},I}}\times\mathbb{I}\big{(}\tau_{k}^{1}\in\mathcal{A}_{U_{R}}(\tau_{k}^{1\mid 0})\big{)}, (4)

where n1=i=1nIin_{1}=\sum_{i=1}^{n}I_{i} and n0=nn1n_{0}=n-n_{1} are the counts of relevant and background entities respectively, Bτk,I=iURτk10(i)B_{\tau_{k},I}=\prod_{i\in U_{R}}\tau_{k}^{1\mid 0}(i), Cγk,n1=t=1n0+1tγkC_{\gamma_{k},n_{1}}=\sum_{t=1}^{n_{0}+1}t^{-\gamma_{k}} is the normalizing constant of the power-law distribution, 𝒜UR(τk10)\mathcal{A}_{U_{R}}(\tau_{k}^{1\mid 0}) is the set of τk1\tau_{k}^{1}’s that are compatible with τk10\tau_{k}^{1\mid 0}, and Aτk,I=#{𝒜UR(τk10)}=t=1n0+1(nτk,t10!)A_{\tau_{k},I}=\#\{\mathcal{A}_{U_{R}}(\tau_{k}^{1\mid 0})\}=\prod_{t=1}^{n_{0}+1}(n_{\tau_{k},t}^{1\mid 0}!) with nτk,t10=iUR𝕀(τk10(i)=t)n_{\tau_{k},t}^{1\mid 0}=\sum_{i\in U_{R}}\mathbb{I}(\tau_{k}^{1\mid 0}(i)=t).

Intuitively, this model assumes that each ranker first randomly places all background entities to generate τk0\tau_{k}^{0}, then “inserts” each relevant entity independently into the list of background entities according to a truncated power-law distribution to generate τk10\tau_{k}^{1\mid 0}, and finally draws τk1\tau_{k}^{1} uniformly from 𝒜UR(τk10)\mathcal{A}_{U_{R}}(\tau_{k}^{1\mid 0}). In other words, τk0\tau_{k}^{0} serves as a baseline for modeling τk10\tau_{k}^{1\mid 0} and τk1\tau_{k}^{1}. It is easy to see from the model that a more reliable ranker should possess a larger γk\gamma_{k}. With the assumption of independent rankers, we have the full-data likelihood:

P(τ1,,τmI,𝜸)\displaystyle P(\tau_{1},\cdots,\tau_{m}\mid I,\boldsymbol{\gamma}) =\displaystyle= k=1mP(τkI,γk)\displaystyle\prod_{k=1}^{m}P(\tau_{k}\mid I,\gamma_{k}) (5)
=\displaystyle= [(n0)!]m×k=1m𝕀(τk1𝒜UR(τk10))Aτk,I×(Bτk,I)γk×(Cγk,n1)n1,\displaystyle[(n_{0})!]^{-m}\times\prod_{k=1}^{m}\frac{\mathbb{I}\big{(}\tau_{k}^{1}\in\mathcal{A}_{U_{R}}(\tau_{k}^{1\mid 0})\big{)}}{A_{\tau_{k},I}\times(B_{\tau_{k},I})^{\gamma_{k}}\times\big{(}C_{\gamma_{k},n_{1}}\big{)}^{n_{1}}},

where 𝜸=(γ1,,γm)\boldsymbol{\gamma}=(\gamma_{1},\cdots,\gamma_{m}). A detailed Bayesian inference procedure for (I,𝜸)(I,\boldsymbol{\gamma}) via Markov chain Monte Carlo can be found in Deng et al., (2014).

2.2 The Mallows Model

Mallows, (1957) proposed the following probability model for a ranking list τ\tau of nn entities:

π(ττ0,ϕ)=1Zn(ϕ)exp{ϕd(τ,τ0)},\pi(\tau\mid\tau_{0},\phi)=\dfrac{1}{Z_{n}(\phi)}\cdot\exp\{-\phi\cdot d(\tau,\tau_{0})\}, (6)

where τ0\tau_{0} denotes the true ranking list, ϕ>0\phi>0 characterizing the reliability of τ\tau, function d(,)d(\cdot,\cdot) is a distance metric between two ranking lists, and

Zn(ϕ)=τexp{ϕd(τ,τ0)}=t=2n(1etϕ)(1eϕ)n1Z_{n}(\phi)=\sum_{\tau^{\prime}}\exp\{-\phi\cdot d(\tau^{\prime},\tau_{0})\}=\frac{\prod_{t=2}^{n}(1-e^{-t\phi})}{(1-e^{-\phi})^{n-1}} (7)

being the normalizing constant, whose analytic form was derived in Diaconis, (1988). Clearly, a larger ϕ\phi means that τ\tau is more stable and concentrates in a tighter neighborhood of τ0\tau_{0}. A common choice of d(,)d(\cdot,\cdot) is the Kendall tau distance.

The Mallows model under the Kendall tau distance can also be equivalently described by an alternative multistage model, which selects and positions entities one by one in a sequential fashion, where ϕ\phi serves as a common parameter that governs the probabilistic behavior of each entity in the stochastic process (Mallows,, 1957). Later on, Fligner and Verducci, (1986) extended the Mallows model by allowing ϕ\phi to vary at different stages, i.e., introducing a position-specific parameter ϕi\phi_{i} for each position ii, which leads to a very flexible, in many cases too flexible, framework to model rank data. To stabilize the generalized Mallows model by Fligner and Verducci, (1986), Li et al., (2020) proposed to put a structural constraint on ϕi\phi_{i}s of the form ϕi=ϕ(1αi)\phi_{i}=\phi\cdot(1-\alpha^{i}) with 0<ϕ<10<\phi<1 and 0α10\leq\alpha\leq 1. As a probabilistic model for rank data, the Mallows model enjoys great interpretability, model compactness, inference and computation efficiency. For a comprehensive review of the Mallows model and its extensions, see Irurozki et al., (2014) and Li et al., (2020).

3 The Partition-Mallows Model

The partition model employed by BARD (Deng et al.,, 2014) tends to oversimplify the problem for scenarios where we care about the detailed rankings of relevant entities. To further enhance the partition model of BARD so that it can reflect the detailed rankings of relevant entities, we describe a new partition-Mallows model in this section.

3.1 The Reverse Partition Model

To combine the partition model with the Mallows model, a naive strategy is to simply replace the uniform model for the relevant entities, i.e., P(τk1τk1|0,I)P(\tau_{k}^{1}\mid\tau_{k}^{1|0},I) in (1), by the Mallows model, which leads to the updated Equation (4) as below:

P(τk1τk10,I)=π(τk1)Zτk,I×𝕀(τk1𝒜UR(τk10)),P(\tau_{k}^{1}\mid\tau_{k}^{1\mid 0},I)=\frac{\pi(\tau_{k}^{1})}{Z_{\tau_{k},I}}\times\mathbb{I}\big{(}\tau_{k}^{1}\in\mathcal{A}_{U_{R}}(\tau_{k}^{1\mid 0})\big{)},

where π(τk1)\pi(\tau_{k}^{1}) is the Mallows density of τk1\tau_{k}^{1} and Zτk,I=τ𝒜UR(τk10)π(τ)Z_{\tau_{k},I}=\sum_{\tau\in\mathcal{A}_{U_{R}}(\tau_{k}^{1\mid 0})}\pi(\tau) is the normalizing constant of the Mallows model with a constraint due to the compatibility of τk1\tau_{k}^{1} with respect to 𝒜UR(τk10)\mathcal{A}_{U_{R}}(\tau_{k}^{1\mid 0}). Apparently, the calculation of Zτk,IZ_{\tau_{k},I}, which involves the summation over the whole space of 𝒜UR(τk10)\mathcal{A}_{U_{R}}(\tau_{k}^{1\mid 0}), whose size is Aτk,I=#{𝒜UR(τk10)}=t=1n0+1(nτk,t10!)A_{\tau_{k},I}=\#\{\mathcal{A}_{U_{R}}(\tau_{k}^{1\mid 0})\}=\prod_{t=1}^{n_{0}+1}(n_{\tau_{k},t}^{1\mid 0}!), is infeasible for most practical cases, rendering such a naive combination of the Mallows model and the partition model impractical.

To avoid the challenging computation caused by the constraints due to 𝒜UR(τk10)\mathcal{A}_{U_{R}}(\tau_{k}^{1\mid 0}), we rewrite the partition model by switching the roles of τk0\tau_{k}^{0} and τk1\tau_{k}^{1} in the model: instead of decomposing τk\tau_{k} as (τk0,τk10,τk1)(\tau_{k}^{0},\tau_{k}^{1\mid 0},\tau_{k}^{1}) conditioning on the group indicators II, we decompose τk\tau_{k} into an alternative triplet (τk1,τk01,τk0)(\tau_{k}^{1},\tau_{k}^{0\mid 1},\tau_{k}^{0}), where τk01\tau_{k}^{0\mid 1} denotes the relative reverse rankings of background entities among the relevant ones. Formally, we note that τk01(i)n1+2τk|{i}UR(i)\tau_{k}^{0\mid 1}(i)\triangleq n_{1}+2-\tau_{k|\{i\}\cup U_{R}}(i) for any iURi\in U_{R}, where τk|{i}UR(i)\tau_{k|\{i\}\cup U_{R}}(i) denotes the relative ranking of a background entity among the relevant ones. In this reverse partition model, we first order the relevant entities according to a certain distribution and then use them as a reference system to “insert” the background entities. Figure 1 illustrates the equivalence between τk\tau_{k} and its two alternative presentations, (τk0,τk10,τk1)(\tau_{k}^{0},\tau_{k}^{1\mid 0},\tau_{k}^{1}) and (τk1,τk01,τk0)(\tau_{k}^{1},\tau_{k}^{0\mid 1},\tau_{k}^{0}).

Given the group indicator vector II, the reverse partition model based on (τk1,τk01,τk0)(\tau_{k}^{1},\tau_{k}^{0\mid 1},\tau_{k}^{0}) gives rise to the following distributional form for τk\tau_{k}:

P(τkI)=P(τk1,τk01,τk0I)=P(τk1I)×P(τk01I)×P(τk0τk01,I),\displaystyle P(\tau_{k}\mid I)=P(\tau_{k}^{1},\tau_{k}^{0\mid 1},\tau_{k}^{0}\mid I)=P(\tau_{k}^{1}\mid I)\times P(\tau_{k}^{0\mid 1}\mid I)\times P(\tau_{k}^{0}\mid\tau_{k}^{0\mid 1},I), (8)

which is analogous to (1) for the original partition model in BARD. Comparing to (1), however, the new form (8) enables us to specify an unconstrained marginal distribution for τk1\tau_{k}^{1}. Moreover, due to the symmetry between τk10\tau_{k}^{1\mid 0} and τk01\tau_{k}^{0\mid 1}, it is highly likely that the power-law distribution, which was shown in Deng et al., (2014) to approximate the distribution of τk10(i)\tau_{k}^{1\mid 0}(i) well for each EiURE_{i}\in U_{R}, can also model τk01(i)\tau_{k}^{0\mid 1}(i) for each EiUBE_{i}\in U_{B} reasonably well. Detailed numerical validations are shown in Supplementary Material.

If we assume that all relevant entities are exchangeable, all background entities are exchangeable, and the relative reserve ranking of a background entity among the relevant entities follows a power-law distribution, we have

P(τk1I)\displaystyle P(\tau_{k}^{1}\mid I) =\displaystyle= 1n1!,\displaystyle\frac{1}{n_{1}!}, (9)
P(τk01I,γk)\displaystyle P(\tau_{k}^{0\mid 1}\mid I,\gamma_{k}) =\displaystyle= iUBP(τk01(i)I,γk)=1(Bτk,I)γk×(Cγk,n1)n0,\displaystyle\prod_{i\in U_{B}}P(\tau_{k}^{0\mid 1}(i)\mid I,\gamma_{k})=\frac{1}{(B^{*}_{\tau_{k},I})^{\gamma_{k}}\times(C^{*}_{\gamma_{k},n_{1}})^{n_{0}}}, (10)
P(τk0τk01,I)\displaystyle P(\tau_{k}^{0}\mid\tau_{k}^{0\mid 1},I) =\displaystyle= 1Aτk,I×𝕀(τk0𝒜UR(τk01)),\displaystyle\frac{1}{A^{*}_{\tau_{k},I}}\times\mathbb{I}\big{(}\tau_{k}^{0}\in\mathcal{A}_{U_{R}}(\tau_{k}^{0\mid 1})\big{)}, (11)

where n1n_{1} and n0n_{0} are numbers of relevant and background entities, respectively, Bτk,I=iUBτk01(i)B^{*}_{\tau_{k},I}=\prod_{i\in U_{B}}\tau_{k}^{0\mid 1}(i) is the unnormalized part of the power-law, Cγk,n1=t=1n1+1tγkC^{*}_{\gamma_{k},n_{1}}=\sum_{t=1}^{n_{1}+1}t^{-\gamma_{k}} is the normalizing constant, 𝒜UB(τk01)\mathcal{A}_{U_{B}}(\tau_{k}^{0\mid 1}) is the set of all τk0\tau_{k}^{0} that are compatible with a given τk01\tau_{k}^{0\mid 1}, and Aτk,I=#{𝒜UB(τk01)}=t=1n1+1(nτk,t01!)A^{*}_{\tau_{k},I}=\#\{\mathcal{A}_{U_{B}}(\tau_{k}^{0\mid 1})\}=\prod_{t=1}^{n_{1}+1}(n_{\tau_{k},t}^{0\mid 1}!) with nτk,t01=iUB𝕀(τk01(i)=t)n_{\tau_{k},t}^{0\mid 1}=\sum_{i\in U_{B}}\mathbb{I}(\tau_{k}^{0\mid 1}(i)=t). Apparently, the likelihood of this reverse-partition model shares the same structure as that of the original partition model in BARD, and thus can be inferred in a similar way.

3.2 The Partition-Mallows Model

The reverse partition model introduced in section  3.1 allows us to freely model τk1\tau_{k}^{1} beyond a uniform distribution, which is infeasible for the original partition model in BARD. Here we employ the Mallows model for τk1\tau_{k}^{1} due to its interpretability, compactness and computability. To achieve this, we replace the group indicator vector II in the partition model by a more general indicator vector ={i}i=1n\mathcal{I}=\{\mathcal{I}_{i}\}_{i=1}^{n}, which takes value in Ω\Omega_{\mathcal{I}}, the space of all permutations of {1,,n1,0,,0n0}\{1,\cdots,n_{1},\underbrace{0,\ldots,0}_{n_{0}}\}, with i=0\mathcal{I}_{i}=0 if EiUBE_{i}\in U_{B}, and i=k>0\mathcal{I}_{i}=k>0 if EiURE_{i}\in U_{R} and is ranked at position kk among all relevant entities in URU_{R}. Figure 1 provides an illustrative example of assigning an enhanced indicator vector \mathcal{I} to a universe of 10 entities with n1=5n_{1}=5.

Based on the status of \mathcal{I}, we can define subvectors +\mathcal{I}^{+} and 0\mathcal{I}^{0}, where +\mathcal{I}^{+} stands for the subvector of \mathcal{I} containing all positive elements in \mathcal{I}, and 0\mathcal{I}^{0} for the remaining zero elements in \mathcal{I}. Figure 1 demonstrates the constructions of \mathcal{I}, +\mathcal{I}^{+} and 0\mathcal{I}^{0}, and the equivalence between τk\tau_{k}, (τk0,τk10,τk1)(\tau_{k}^{0},\tau_{k}^{1\mid 0},\tau_{k}^{1}), and (τk1,τk01,τk0)(\tau_{k}^{1},\tau_{k}^{0\mid 1},\tau_{k}^{0}) given \mathcal{I}. Note that different from the partition model in BARD, in which we allow the number of relevant entities represented by n1n_{1} to vary nearby its expected value, the number of relevant entities in the new model, is assumed to be fixed and known for conceptual and computational convenience. In other words, we have |UR|=n1|U_{R}|=n_{1} in the new setting.

+\mathcal{I}^{+} 0\mathcal{I}^{0} II \mathcal{I} UU τk\tau_{k} τk1\tau_{k}^{1} τk01\tau_{k}^{0\mid 1} τk0\tau_{k}^{0} τk0\tau_{k}^{0} τk10\tau_{k}^{1\mid 0} τk1\tau_{k}^{1}
1 - 1 1 E1E_{1} 2 2 - - - 1 2
2 - 1 2 E2E_{2} 6 4 - - - 3 4
3 - 1 3 E3E_{3} 4 3 - - - 2 3
4 - 1 4 E4E_{4} 1 1 - - - 1 1
5 - 1 \Longleftarrow 5 E5E_{5} 7 \Longleftrightarrow 5 - - \Longleftrightarrow - 3 5
- 0 0 0 E6E_{6} 5 - 3 2 2 - -
- 0 0 0 E7E_{7} 3 - 4 1 1 - -
- 0 0 0 E8E_{8} 8 - 1 3 3 - -
- 0 0 0 E9E_{9} 9 - 1 4 4 - -
- 0 0 0 E10E_{10} 10 - 1 5 5 - -
Figure 1: An illustrative example of construction of +\mathcal{I}^{+}, 0\mathcal{I}^{0} and II based on the enhanced indicator vector \mathcal{I} of n1=5n_{1}=5 to a universe of 10 entities, and the decomposition of a ranking list τk\tau_{k} into triplet (τk1,τk01,τk0)(\tau_{k}^{1},\tau_{k}^{0\mid 1},\tau_{k}^{0}) and (τk0,τk10,τk1)(\tau_{k}^{0},\tau_{k}^{1\mid 0},\tau_{k}^{1}) respectively given \mathcal{I}.

As an analogy of Equations (1) and (8), we have the following decomposition of τk\tau_{k} given the enhanced indicator vector \mathcal{I}:

P(τk)=P(τk1,τk01,τk0)=P(τk1)×P(τk01)×P(τk0τk01,).\displaystyle P(\tau_{k}\mid\mathcal{I})=P(\tau_{k}^{1},\tau_{k}^{0\mid 1},\tau_{k}^{0}\mid\mathcal{I})=P(\tau_{k}^{1}\mid\mathcal{I})\times P(\tau_{k}^{0\mid 1}\mid\mathcal{I})\times P(\tau_{k}^{0}\mid\tau_{k}^{0\mid 1},\mathcal{I}). (12)

Assume that τk1\tau_{k}^{1}\mid\mathcal{I} follows the Mallows model (with parameter ϕk\phi_{k}) centered at +\mathcal{I}^{+}:

P(τk1,ϕk)=P(τk1+,ϕk)=exp{ϕkdτ(τk1,+)}Zn1(ϕk),\displaystyle P(\tau_{k}^{1}\mid\mathcal{I},\phi_{k})=P(\tau_{k}^{1}\mid\mathcal{I}^{+},\phi_{k})=\frac{\exp\{-\phi_{k}\cdot d_{\tau}(\tau_{k}^{1},\mathcal{I}^{+})\}}{Z_{n_{1}}(\phi_{k})}, (13)

where dτ(,)d_{\tau}(\cdot,\cdot) denotes Kendall tau distance and Zn1(ϕk)Z_{n_{1}}(\phi_{k}) is defined as in (7). Clearly, a larger ϕk\phi_{k} indicates that ranker τk\tau_{k} is of a higher quality, as the distribution is more concentrated at the “true ranking” defined by +\mathcal{I}^{+}. Since the relative ranking of background entities are of no interest to us, we still assume that they are randomly ranked. Together with the power-law assumption for τk01(i)\tau_{k}^{0\mid 1}(i), we have

P(τk01)\displaystyle P(\tau_{k}^{0\mid 1}\mid\mathcal{I}) =\displaystyle= P(τk01I,γk)=1(Bγk,I)γk×(Cγk,n1)nn1,\displaystyle P(\tau_{k}^{0\mid 1}\mid I,\gamma_{k})=\frac{1}{(B^{*}_{\gamma_{k},I})^{\gamma_{k}}\times(C^{*}_{\gamma_{k},n_{1}})^{n-n_{1}}}, (14)
P(τk0τk01,)\displaystyle P(\tau_{k}^{0}\mid\tau_{k}^{0\mid 1},\mathcal{I}) =\displaystyle= P(τk0τk01,I)=1Aτk,I×𝕀(τk0𝒜UR(τk01)),\displaystyle P(\tau_{k}^{0}\mid\tau_{k}^{0\mid 1},I)=\frac{1}{A^{*}_{\tau_{k},I}}\times\mathbb{I}\big{(}\tau_{k}^{0}\in\mathcal{A}_{U_{R}}(\tau_{k}^{0\mid 1})\big{)}, (15)

where notations Aτk,IA^{*}_{\tau_{k},I}, Bτk,IB^{*}_{\tau_{k},I} and Cγk,n1C^{*}_{\gamma_{k},n_{1}} are the same as in the reverse-partition model. We call the resulting model the Partition-Mallows model, abbreviated as PAMA.

Different from the partition and reverse partition models, which quantify the quality of ranker τk\tau_{k} with only one parameter γk\gamma_{k} in the power-law distribution, the PAMA model contains two quality parameters ϕk\phi_{k} and γk\gamma_{k}, with the former indicating the ranker’s ability of ranking relevant entities and the latter reflecting the ranker’s ability in differentiating relevant entities from background ones. Intuitively, ϕk\phi_{k} and γk\gamma_{k} reflect the quality of ranker τk\tau_{k} in two different aspects. However, considering that a good ranker is typically strong in both dimensions, it looks quite natural to further simplify the model by assuming

ϕk=ϕγk,\phi_{k}=\phi\cdot\gamma_{k}, (16)

with ϕ>0\phi>0 being a common factor for all rankers. This assumption, while reducing the number of free parameters by almost half, captures the natural positive correlation between ϕk\phi_{k} and γk\gamma_{k} and serves as a first-order (i.e., linear) approximation to the functional relationship between ϕk\phi_{k} and γk\gamma_{k}. A wide range of numerical studies based on simulated data suggest that the linear approximation showed in (16) works reasonably well for many typical scenarios for rank aggregation. In contrast, the more flexible model with both ϕk\phi_{k} and γk\gamma_{k} as free parameters (which is referred to as PAMA) suffers from unstable performance from time to time. Detailed evidences to support assumption (16) can be found in Supplementary Material.

Plugging in (16) into (13), we have a simplified model for τk1\tau_{k}^{1} given \mathcal{I} as follows:

P(τk1,ϕ,γk)=P(τk1+,ϕ,γk)=exp{ϕγkdτ(τk1,+)}Zn1(ϕγk).\displaystyle P(\tau_{k}^{1}\mid\mathcal{I},\phi,\gamma_{k})=P(\tau_{k}^{1}\mid\mathcal{I}^{+},\phi,\gamma_{k})=\frac{\exp\{-\phi\cdot\gamma_{k}\cdot d_{\tau}(\tau_{k}^{1},\mathcal{I}^{+})\}}{Z_{n_{1}}(\phi\cdot\gamma_{k})}. (17)

Combining (14), (15) and (17), we get the full likelihood of τk\tau_{k}:

P(τk,ϕ,γk)\displaystyle P(\tau_{k}\mid\mathcal{I},\phi,\gamma_{k}) =\displaystyle= P(τk1,ϕ,γk)×P(τk0|1,γk)×P(τk0τk0|1,)\displaystyle P(\tau_{k}^{1}\mid\mathcal{I},\phi,\gamma_{k})\times P(\tau_{k}^{0|1}\mid\mathcal{I},\gamma_{k})\times P(\tau_{k}^{0}\mid\tau_{k}^{0|1},\mathcal{I}) (18)
=\displaystyle= 𝕀(τk0𝒜UR(τk01))Aτk,I×(Bτk,I)γk×(Cγk,n1)nn1×(Dτk,)ϕγk×Eϕ,γk,\displaystyle\frac{\mathbb{I}\big{(}\tau_{k}^{0}\in\mathcal{A}_{U_{R}}(\tau_{k}^{0\mid 1})\big{)}}{A^{*}_{\tau_{k},I}\times(B^{*}_{\tau_{k},I})^{\gamma_{k}}\times(C^{*}_{\gamma_{k},n_{1}})^{n-n_{1}}\times(D^{*}_{\tau_{k},\mathcal{I}})^{\phi\cdot\gamma_{k}}\times E^{*}_{\phi,\gamma_{k}}},

where Dτk,=exp{dτ(τk1,+)}D^{*}_{\tau_{k},\mathcal{I}}=\exp\{d_{\tau}(\tau_{k}^{1},\mathcal{I}^{+})\}, Eϕ,γk=Zn1(ϕγk)=t=2n1(1etϕγk)(1eϕγk)n11E^{*}_{\phi,\gamma_{k}}=Z_{n_{1}}(\phi\cdot\gamma_{k})=\frac{\prod_{t=2}^{n_{1}}(1-e^{-t\phi\gamma_{k}})}{(1-e^{-\phi\gamma_{k}})^{n_{1}-1}}, and Aτk,A^{*}_{\tau_{k},\mathcal{I}}, Bτk,B^{*}_{\tau_{k},\mathcal{I}} and Cτk,n1C^{*}_{\tau_{k},n_{1}} keep the same meaning as in the reverse partition model. At last, for the set of observed ranking lists 𝝉=(τ1,,τm)\boldsymbol{\tau}=(\tau_{1},\cdots,\tau_{m}) from mm independent rankers, we have the joint likelihood:

P(𝝉,ϕ,𝜸)\displaystyle P(\boldsymbol{\tau}\mid\mathcal{I},\phi,\boldsymbol{\gamma}) =\displaystyle= k=1mP(τk,ϕ,γk).\displaystyle\prod_{k=1}^{m}P(\tau_{k}\mid\mathcal{I},\phi,\gamma_{k}). (19)

3.3 Model Identifiability and Estimation Consistency

Let Ωn\Omega_{n} be the space of all permutations of {1,,n}\{1,\cdots,n\} in which τk\tau_{k} takes value, and let 𝜽=(,ϕ,𝜸){\boldsymbol{\theta}}=(\mathcal{I},\phi,\boldsymbol{\gamma}) be the vector of model parameters. The PAMA model in (19), i.e., P(𝝉𝜽)P(\boldsymbol{\tau}\mid{\boldsymbol{\theta}}), defines a family of probability distributions on Ωnm\Omega_{n}^{m} indexed by parameter 𝜽{\boldsymbol{\theta}} taking values in space 𝚯=Ω×Ωϕ×Ω𝜸\boldsymbol{\Theta}=\Omega_{\mathcal{I}}\times\Omega_{\phi}\times\Omega_{\boldsymbol{\gamma}}, where Ω\Omega_{\mathcal{I}} is the space of all permutations of {1,,n1,𝟎n0}\{1,\cdots,n_{1},{\bf 0}_{n_{0}}\}, Ωϕ=(0,+)\Omega_{\phi}=(0,+\infty) and Ω𝜸=[0,+)m\Omega_{\boldsymbol{\gamma}}=[0,+\infty)^{m}. We show here that the PAMA model defined in (19) is identifiable and the model parameters can be estimated consistently under mild conditions.

Theorem 1.

The PAMA model is identifiable, i.e.,

𝜽1,𝜽2𝚯,ifP(𝝉𝜽1)=P(𝝉𝜽2)for𝝉Ωnm,then𝜽1=𝜽2.\forall\ {\boldsymbol{\theta}}_{1},{\boldsymbol{\theta}}_{2}\in\boldsymbol{\Theta},\ \mbox{if}\ P(\boldsymbol{\tau}\mid{\boldsymbol{\theta}}_{1})=P(\boldsymbol{\tau}\mid{\boldsymbol{\theta}}_{2})\ \mbox{for}\ \forall\ \boldsymbol{\tau}\in\Omega_{n}^{m},\ \mbox{then}\ {\boldsymbol{\theta}}_{1}={\boldsymbol{\theta}}_{2}. (20)
Proof.

See Supplementary Material. ∎

To show that parameters in the PAMA model can be estimated consistently, we will first construct a consistent estimator for the indicator vector \mathcal{I} as mm\rightarrow\infty but with the number of ranked entities nn fixed, and show later that ϕ\phi can also be consistently estimated once \mathcal{I} is given. To this end, we define τ¯(i)=m1k=1mτk(i)\bar{\tau}(i)=m^{-1}\sum_{k=1}^{m}\tau_{k}(i) to be the average rank of entity EiE_{i} across all mm rankers, and assume that the ranker-specific quality parameters γ1,,γm\gamma_{1},\cdots,\gamma_{m} are i.i.d. samples from a non-atomic probability measure F(γ)F(\gamma) defined on [0,)[0,\infty) with a finite first moment (referred to as condition 𝑪γ\boldsymbol{C}_{\gamma} hereinafter). Then, by the strong law of large numbers we have

τ¯(i)=1mk=1mτk(i)𝔼[τ(i)]a.s.withm,\bar{\tau}(i)=\frac{1}{m}\sum_{k=1}^{m}\tau_{k}(i)\rightarrow\mathbb{E}\big{[}\tau(i)\big{]}\ a.s.\ \ \mbox{with}\ m\rightarrow\ \infty, (21)

since {τk(i)}k=1m\{\tau_{k}(i)\}_{k=1}^{m} are i.i.d. random variables with expectation

𝔼[τ(i)]=𝔼[𝔼[τ(i)γ]]=𝔼[τ(i)γ]𝑑F(γ),\mathbb{E}\big{[}\tau(i)\big{]}=\mathbb{E}\Big{[}\mathbb{E}\big{[}\tau(i)\mid\gamma\big{]}\Big{]}=\int\mathbb{E}\big{[}\tau(i)\mid\gamma\big{]}dF(\gamma),

where 𝔼[τ(i)γ]\mathbb{E}\big{[}\tau(i)\mid\gamma\big{]} is the conditional mean of τ(i)\tau(i) given the model parameters (,ϕ,γ)(\mathcal{I},\phi,\gamma), i.e.,

𝔼[τ(i)γ]=t=1ntP(τ(i)=t,ϕ,γ).\mathbb{E}\big{[}\tau(i)\mid\gamma\big{]}=\sum_{t=1}^{n}t\cdot P\big{(}\tau(i)=t\mid\mathcal{I},\phi,\gamma\big{)}.

Clearly, 𝔼[τ(i)]\mathbb{E}\big{[}\tau(i)\big{]} is a function of ϕ\phi given \mathcal{I} and F(γ)F(\gamma). We define ei(ϕ)𝔼[τ(i)]e_{i}(\phi)\triangleq\mathbb{E}\big{[}\tau(i)\big{]} to emphasize 𝔼[τ(i)]\mathbb{E}\big{[}\tau(i)\big{]}’s nature as a continuous function of ϕ\phi. Without loss of generality, we suppose that UR={1,,n1}U_{R}=\{1,\cdots,n_{1}\} and UB={n1+1,,n}U_{B}=\{n_{1}+1,\cdots,n\}, i.e., =(1,,n1,0,,0)\mathcal{I}=(1,\cdots,n_{1},0,\cdots,0), hereinafter. Then, the partition structure and the Mallows model embedded in the PAMA model lead to the following facts:

e1(ϕ)<<en1(ϕ)anden1+1(ϕ)==en(ϕ)=e0,ϕΩϕ.e_{1}(\phi)<\cdots<e_{n_{1}}(\phi)\ \mbox{and}\ e_{n_{1}+1}(\phi)=\cdots=e_{n}(\phi)=e_{0},\ \forall\ \phi\in\Omega_{\phi}. (22)

Note that ei(ϕ)e_{i}(\phi) degenerates to a constant with respect to ϕ\phi (i.e., e0e_{0}) for all i>n1i>n_{1} because parameter ϕ\phi influences only the relative rankings of relevant entities in the Mallows model. The value of e0e_{0} is completely determined by F(γ)F(\gamma). For the BARD model, it is easy to see that e1==en1en1+1==en.e_{1}=\cdots=e_{n_{1}}\leq e_{n_{1}+1}=\cdots=e_{n}.

Refer to caption
Figure 2: Average ranks of all the entities with fixed =(1,,n1,0,,0)\mathcal{I}=(1,\cdots,n_{1},0,\cdots,0), n=30n=30, n1=15n_{1}=15, m=100000m=100000 and F(γ)=U(0,2)F(\gamma)=U(0,2). Figures (a), (b) and (c) are the corresponding results for ϕ=0,0.2and 0.4\phi=0,0.2\ \mbox{and }0.4 respectively.

Figure 2 shows some empirical estimates of the ei(ϕ)e_{i}(\phi)’s based on m=100,000m=100,000 independent samples drawn from PAMA models with n=30n=30, n1=15n_{1}=15, and F(γ)=U(0,2)F(\gamma)=U(0,2), but three different ϕ\phi values: (a) ϕ=0\phi=0, which corresponds to the BARD model; (b) ϕ=0.2\phi=0.2; and (c) ϕ=0.5\phi=0.5. One surprise is that in case (c), some relevant entities may have a larger ei(ϕ)e_{i}(\phi) (worse ranking) than the average rank of background entities. Lemma 1 guarantees that for almost all ϕΩϕ\phi\in\Omega_{\phi}, e0e_{0} is different from ei(ϕ)e_{i}(\phi) for i=1,,n1i=1,\cdots,n_{1}. The proof of Lemma 1 can be found in Supplementary Material.

Lemma 1.

For the PAMA model with condition 𝐂γ\boldsymbol{C}_{\gamma}, Ω~ϕΩϕ\exists\ \tilde{\Omega}_{\phi}\subset\Omega_{\phi}, s.t. (ΩϕΩ~ϕ)(\Omega_{\phi}-\tilde{\Omega}_{\phi}) contains only finite elements and

ei(ϕ)e0fori=1,,n1,ϕΩ~ϕ.e_{i}(\phi)\neq e_{0}\ \mbox{for}\ i=1,\cdots,n_{1},\ \forall\ \phi\in\tilde{\Omega}_{\phi}. (23)

The facts demonstrated in (22) and (23) suggest the following three-step strategy to estimate \mathcal{I}: (a) find the subset S0S_{0} of n0=(nn1)n_{0}=(n-n_{1}) entities from UU so that the within-subset variation of the τ¯(i)\bar{\tau}(i)’s is the smallest, i.e.,

S0=argminSU,|S|=n0iS(eie¯S)2,withe¯S=n01iSei,S_{0}=\operatorname*{arg\,min}_{S\in U,\ |S|=n_{0}}\sum_{i\in S}(e_{i}-\bar{e}_{S})^{2},\ \ \mbox{with}\ \bar{e}_{S}=n_{0}^{-1}\sum_{i\in S}e_{i}, (24)

and let U~B=S0\tilde{U}_{B}=S_{0} be an estimate of UBU_{B}; (b) rank the entities in US0U\setminus S_{0} by τ¯(i)\bar{\tau}(i) increasingly and use the obtained ranking ~+\tilde{\mathcal{I}}^{+} as an estimate of +\mathcal{I}^{+}; (c) combine the above two steps to obtain the estimate ~\tilde{\mathcal{I}} of \mathcal{I}. This can be achieved by defining U~R=UU~B\tilde{U}_{R}=U\setminus\tilde{U}_{B} and ~+=rank({τ¯(i):iU~R}),\tilde{\mathcal{I}}^{+}=rank(\{\bar{\tau}(i):i\in\tilde{U}_{R}\}), and obtain ~=(~1,,~n)\tilde{\mathcal{I}}=(\tilde{\mathcal{I}}_{1},\cdots,\tilde{\mathcal{I}}_{n}), with ~i=~i+𝕀(iU~R).\tilde{\mathcal{I}}_{i}=\tilde{\mathcal{I}}^{+}_{i}\cdot\mathbb{I}(i\in\tilde{U}_{R}).

Note that U~B\tilde{U}_{B} is based on the mean ranks, {τ¯(i)}iU\{\bar{\tau}(i)\}_{i\in U}, thus is clearly a moment estimator. Although this three-step estimation strategy is neither statistically efficient nor computationally feasible (step (a) is NP-hard), it nevertheless serves as a prototype for developing the consistency theory. Theorem 2 guarantees that ~\tilde{\mathcal{I}} is a consistent estimator of \mathcal{I} under mild conditions.

Theorem 2.

For the PAMA model with condition 𝐂γ\boldsymbol{C}_{\gamma}, for almost all ϕΩϕ\phi\in\Omega_{\phi}, the moment estimator ~\tilde{\mathcal{I}} converges to \mathcal{I} with probability 1 with mm going to infinity.

Proof.

Combining fact (23) in Lemma 1 with fact (21), we have for ϕΩ~ϕ\forall\ \phi\in\tilde{\Omega}_{\phi} that

e1(ϕ)<<en1(ϕ)andei(ϕ)e0fori=1,,n1.e_{1}(\phi)<\cdots<e_{n_{1}}(\phi)\ \mbox{and}\ e_{i}(\phi)\neq e_{0}\ \mbox{for}\ i=1,\cdots,n_{1}.

Moreover, as fact (21) tells us that for ϵ,δ>0\forall\ \epsilon,\delta>0, M>0\exists\ M>0 s.t. for m>M\forall\ m>M,

P(|τ¯(i)ei(ϕ)|<δ)1ϵ,i=1,,n,P\big{(}|\bar{\tau}(i)-e_{i}(\phi)|<\delta\big{)}\geq 1-\epsilon,\ i=1,\cdots,n,

it is straightforward to see the conclusion of the theorem. ∎

Theorem 2 tells us that estimating \mathcal{I} is straightforward if the number of independent rankers mm goes to infinity: a simple moment method ignoring the quality difference of rankers can provide us a consistent estimate of \mathcal{I}. In a practical problem where only a finite number of rankers are involved, however, more efficient statistical inference of the PAMA model based on Bayesian or frequentist principles becomes more attractive as effectively utilizing the quality information of different rankers is critical.

With n0n_{0} and n1n_{1} fixed, parameter γk\gamma_{k}, which governs the power-law distribution for the rank list τk\tau_{k}, cannot be estimated consistently. Thus, its distribution F(γ)F(\gamma) cannot be determined nonparametrically even when the number of rank lists mm goes to infinity. We impose a parametric form Fψ(γ)F_{\psi}(\gamma) with ψ\psi as the hyper-parameter and refer to the resulting hierarchical-structured model as PAMA-H, which has the following marginal likelihood of (ϕ,ψ)(\phi,\psi) given \mathcal{I}:

L(ϕ,ψ)=P(𝝉,ϕ,𝜸)𝑑Fψ(𝜸)=k=1mP(τk,ϕ,γk)𝑑Fψ(γk)=k=1mLk(ϕ,ψ).L(\phi,\psi\mid\mathcal{I})=\int P(\boldsymbol{\tau}\mid\mathcal{I},\phi,\boldsymbol{\gamma})dF_{\psi}(\boldsymbol{\gamma})=\prod_{k=1}^{m}\int P(\tau_{k}\mid\mathcal{I},\phi,\gamma_{k})dF_{\psi}(\gamma_{k})=\prod_{k=1}^{m}L_{k}(\phi,\psi\mid\mathcal{I}).

We show in Theorem 3 that the MLE based on the above marginal likelihood is consistent.

Theorem 3.

Under the PAMA-H model, assume that (ϕ,ψ)(\phi,\psi) belongs to the parameter space Ωϕ×Ωψ\Omega_{\phi}\times\Omega_{\psi}, and the true parameter (ϕ0,ψ0)(\phi_{0},\psi_{0}) is an interior point of Ωϕ×Ωψ\Omega_{\phi}\times\Omega_{\psi}. Let (ϕ^,ψ^)(\hat{\phi}_{\mathcal{I}},\hat{\psi}_{\mathcal{I}}) be the maximizer of L(ϕ,ψ)L(\phi,\psi\mid\mathcal{I}). If Fψ(γ)F_{\psi}(\gamma) has a density function fψ(γ)f_{\psi}(\gamma) that is differentiable and concave with respect to ψ\psi, then limm(ϕ^,ψ^)=(ϕ0,ψ0)\lim_{m\rightarrow\infty}(\hat{\phi}_{\mathcal{I}},\hat{\psi}_{\mathcal{I}})=(\phi_{0},\psi_{0}) almost surely.

Proof.

See Supplementary Material. ∎

4 Inference with the Partition-Mallows Model

4.1 Maximum Likelihood Estimation

Under the PAMA model, the MLE of 𝜽=(,ϕ,𝜸){\boldsymbol{\theta}}=(\mathcal{I},\phi,\boldsymbol{\gamma}) is 𝜽^=argmax𝜽l(𝜽)\hat{{\boldsymbol{\theta}}}=\arg\max_{{\boldsymbol{\theta}}}l({\boldsymbol{\theta}}), where

l(𝜽)=logP(τ1,τ2,,τm𝜽)l({\boldsymbol{\theta}})=\log P(\tau_{1},\tau_{2},\cdots,\tau_{m}\mid{\boldsymbol{\theta}}) (25)

is the logarithm of the likelihood function (19). Here, we adopt the Gauss-Seidel iterative method in Yang, (2018), also known as backfitting or cyclic coordinate ascent, to implement the optimization. Starting from an initial point 𝜽(0){\boldsymbol{\theta}}^{(0)}, the Gauss-Seidel method iteratively updates one coordinate of 𝜽{\boldsymbol{\theta}} at each step with the other coordinates held fixed at their current values. A Newton-like method is adopted to update ϕ\phi and γk\gamma_{k}. Since \mathcal{I} is a discrete vector, we find favorable values of \mathcal{I} by swapping two neighboring entities to check whether g(𝜸(s+1),ϕ(s+1))g(\mathcal{I}\mid\boldsymbol{\gamma}^{(s+1)},\phi^{(s+1)}) increases. More details of the algorithm are provided in Supplementary Material.

With the MLE 𝜽^=(^,ϕ^,𝜸^)\hat{\boldsymbol{\theta}}=(\hat{\mathcal{I}},\hat{\phi},\hat{\boldsymbol{\gamma}}), we define UR(^)={iU:^i>0}U_{R}(\hat{\mathcal{I}})=\{i\in U:\ \hat{\mathcal{I}}_{i}>0\} and UB(^)={iU:^i=0}U_{B}(\hat{\mathcal{I}})=\{i\in U:\ \hat{\mathcal{I}}_{i}=0\}, and generate the final aggregated ranking list τ^\hat{\tau} based on the rules below: (a) set the top-n1n_{1} list of τ^\hat{\tau} as τ^n1=sort(iUR(^)by^i)\hat{\tau}_{n_{1}}=sort(i\in U_{R}(\hat{\mathcal{I}})\ by\ \hat{\mathcal{I}}_{i}\uparrow), (b) all entities in UB(^)U_{B}(\hat{\mathcal{I}}) tie for positions behind. Hereinafter, we refer to this MLE-based rank aggregation procedure under PAMA model as PAMAF.

For the PAMA-H model, a similar procedure can be applied to find the MLE of 𝜽=(,ϕ,ψ){\boldsymbol{\theta}}=(\mathcal{I},\phi,\psi), with the 𝜸=(γ1,,γm)\boldsymbol{\gamma}=(\gamma_{1},\cdots,\gamma_{m}) being treated as the missing data. With the MLE 𝜽^=(^,ϕ^,ψ^)\hat{\boldsymbol{\theta}}=(\hat{\mathcal{I}},\hat{\phi},\hat{\psi}), we can generate the final aggregated ranking list τ^\hat{\tau} based on ^\hat{\mathcal{I}} in the same way as in PAMA, and evaluate the quality of ranker τk\tau_{k} via the mean or mode of the conditional distribution below:

f(γkτk;^,ϕ^,ψ^)f(γkψ^)P(τk^,ϕ^,γk).f(\gamma_{k}\mid\tau_{k};\hat{\mathcal{I}},\hat{\phi},\hat{\psi})\propto f(\gamma_{k}\mid\hat{\psi})\cdot P(\tau_{k}\mid\hat{\mathcal{I}},\hat{\phi},\gamma_{k}).

In this paper, we refer to the above MLE-based rank aggregation procedure under PAMA-H model as PAMAHF. The procedure is detailed in Supplementary Material.

4.2 Bayesian Inference

Since the three model parameters \mathcal{I}, ϕ\phi and 𝜸\boldsymbol{\gamma} encode “orthogonal” information of the PAMA model, it is natural to expect that \mathcal{I}, ϕ\phi and 𝜸\boldsymbol{\gamma} are mutually independent a priori. We thus specify their joint prior distribution as

π(,ϕ,𝜸)=π()π(ϕ)k=1mπ(γk).\pi(\mathcal{I},\phi,\boldsymbol{\gamma})=\pi(\mathcal{I})\cdot\pi(\phi)\cdot\prod_{k=1}^{m}\pi(\gamma_{k}).

Without much loss, we may restrict the range of ϕ\phi and γk\gamma_{k}’s to a closed interval [0,b][0,b] with a large enough bb. In contrast, \mathcal{I} is discrete and takes value in the space Ω\Omega_{\mathcal{I}} of all permutations of {1,,n1,0,,0n0}\{1,\ldots,n_{1},\underbrace{0,\ldots,0}_{n_{0}}\}. It is convenient to specify π()\pi(\mathcal{I}), π(ϕ)\pi(\phi) and π(γk)\pi(\gamma_{k}) as uniform, i.e.,

π()U(Ω),π(ϕ)U[0,b],π(γk)U[0,b].\pi(\mathcal{I})\sim U(\Omega_{\mathcal{I}}),\ \pi(\phi)\sim U[0,b],\ \pi(\gamma_{k})\sim U[0,b].

Based on our experiences in a large range of simulation studies and real data applications, we find that it works reasonably well to set b=10b=10. In Section 3.3 we also discussed letting π(γk)\pi(\gamma_{k}) be of a parametric form, which will be further discussed later.

The posterior distribution can be expressed as

f(,ϕ,𝜸|τ1,τ2,,τm)\displaystyle f(\mathcal{I},\phi,\boldsymbol{\gamma}|\tau_{1},\tau_{2},\cdots,\tau_{m}) (26)
\displaystyle\propto π(,ϕ,𝜸)P(τ1,τ2,,τm|,ϕ,𝜸)\displaystyle\pi(\mathcal{I},\phi,\boldsymbol{\gamma})\cdot P(\tau_{1},\tau_{2},\cdots,\tau_{m}|\mathcal{I},\phi,\boldsymbol{\gamma})
=\displaystyle= 𝕀(ϕ[0,10]))×k=1m{𝕀(τk0𝒜UR(τk01))×𝕀(γk[0,10])Aτk,I×(Bτk,I)γk×(Cγk,n1)nn1×(Dτk,)ϕγk×Eϕ,γk},\displaystyle\mathbb{I}\big{(}\phi\in[0,10])\big{)}\times\prod_{k=1}^{m}\Big{\{}\frac{\mathbb{I}\big{(}\tau_{k}^{0}\in\mathcal{A}_{U_{R}}(\tau_{k}^{0\mid 1})\big{)}\times\mathbb{I}\big{(}\gamma_{k}\in[0,10]\big{)}}{A^{*}_{\tau_{k},I}\times(B^{*}_{\tau_{k},I})^{\gamma_{k}}\times(C^{*}_{\gamma_{k},n_{1}})^{n-n_{1}}\times(D^{*}_{\tau_{k},\mathcal{I}})^{\phi\cdot\gamma_{k}}\times E^{*}_{\phi,\gamma_{k}}}\Big{\}},

with the following conditional distributions:

f(ϕ,𝜸)\displaystyle f(\mathcal{I}\mid\phi,\boldsymbol{\gamma}) \displaystyle\propto k=1m𝕀(τk0𝒜UR(τk01))Aτk,I×(Bτk,I)γk×(Dτk,)ϕγk,\displaystyle\prod_{k=1}^{m}\frac{\mathbb{I}\big{(}\tau_{k}^{0}\in\mathcal{A}_{U_{R}}(\tau_{k}^{0\mid 1})\big{)}}{A^{*}_{\tau_{k},I}\times(B^{*}_{\tau_{k},I})^{\gamma_{k}}\times(D^{*}_{\tau_{k},\mathcal{I}})^{\phi\cdot\gamma_{k}}}, (27)
f(ϕ,𝜸)\displaystyle f(\phi\mid\mathcal{I},\boldsymbol{\gamma}) \displaystyle\propto 𝕀(ϕ[0,10])×k=1m1(Dτk,)ϕγk×Eϕ,γk,\displaystyle{\mathbb{I}}\big{(}\phi\in[0,10]\big{)}\times\prod_{k=1}^{m}\frac{1}{(D^{*}_{\tau_{k},\mathcal{I}})^{\phi\cdot\gamma_{k}}\times E^{*}_{\phi,\gamma_{k}}}, (28)
f(γk,ϕ,𝜸[k])\displaystyle f(\gamma_{k}\mid\mathcal{I},\phi,\boldsymbol{\gamma}_{[-k]}) \displaystyle\propto 𝕀(γk[0,10])(Bτk,I)γk×(Cγk,n1)nn1×(Dτk,)ϕγk×Eϕ,γk,\displaystyle\frac{\mathbb{I}\big{(}\gamma_{k}\in[0,10]\big{)}}{(B^{*}_{\tau_{k},I})^{\gamma_{k}}\times(C^{*}_{\gamma_{k},n_{1}})^{n-n_{1}}\times(D^{*}_{\tau_{k},\mathcal{I}})^{\phi\cdot\gamma_{k}}\times E^{*}_{\phi,\gamma_{k}}}, (29)

based on which posterior samples of (,ϕ,𝜸)(\mathcal{I},\phi,\boldsymbol{\gamma}) can be obtained by Gibbs sampling, where 𝜸[k]=(γ1,,γk1,γk+1,,γm)\boldsymbol{\gamma}_{[-k]}=(\gamma_{1},\cdots,\gamma_{k-1},\gamma_{k+1},\cdots,\gamma_{m}).

Considering that conditional distributions in (27)-(29) are nonstandard, we adopt the Metropolis-Hastings algorithm (Hastings,, 1970) to enable the conditional sampling. To be specific, we choose the proposal distributions for ϕ\phi and γk\gamma_{k} as

q(ϕϕ(t);,𝜸)\displaystyle q(\phi\mid\phi^{(t)};\mathcal{I},\boldsymbol{\gamma}) \displaystyle\sim 𝒩(ϕ(t),σϕ2)\displaystyle\mathcal{N}(\phi^{(t)},\sigma_{\phi}^{2})
q(γkγk(t);,ϕ,𝜸[k])\displaystyle q(\gamma_{k}\mid\gamma_{k}^{(t)};\mathcal{I},\phi,\boldsymbol{\gamma}_{[-k]}) \displaystyle\sim 𝒩(γk(t),σγk2),\displaystyle\mathcal{N}(\gamma_{k}^{(t)},\sigma_{\gamma_{k}}^{2}),

where σϕ2\sigma_{\phi}^{2} and σγk2\sigma_{\gamma_{k}}^{2} can be tuned to optimize the mixing rate of the sampler. Since \mathcal{I} is a discrete vector, we propose new values of \mathcal{I} by swapping two randomly selected adjacent entities. Note that the entity whose ranking is n1n_{1} could be swapped with any background entity. Due to the homogeneity of background entities, there is no need to swap two background entities. Therefore, the number of potential proposals in each step is 𝒪(nn1)\mathcal{O}(nn_{1}). More details about MCMC sampling techniques can be found in Liu, (2008).

Suppose that MM posterior samples {((t),ϕ(t),𝜸(t))}t=1M\{(\mathcal{I}^{(t)},\phi^{(t)},\boldsymbol{\gamma}^{(t)})\}_{t=1}^{M} are obtained. We calculate the posterior means of different parameters as below:

¯i\displaystyle\bar{\mathcal{I}}_{i} =\displaystyle= 1Mt=1M[i(t)Ii(t)+n1+1+n2(1Ii(t))],i=1,,n,\displaystyle\frac{1}{M}\sum_{t=1}^{M}\Big{[}\mathcal{I}_{i}^{(t)}\cdot I^{(t)}_{i}+\frac{n_{1}+1+n}{2}\cdot(1-I^{(t)}_{i})\Big{]},\ i=1,\cdots,n,
ϕ¯\displaystyle\bar{\phi} =\displaystyle= 1Mt=1Mϕ(t),\displaystyle\frac{1}{M}\sum_{t=1}^{M}\phi^{(t)},
γ¯k\displaystyle\bar{\gamma}_{k} =\displaystyle= 1Mt=1Mγk(t),k=1,,m.\displaystyle\frac{1}{M}\sum_{t=1}^{M}\gamma_{k}^{(t)},k=1,\cdots,m.

We quantify the quality of ranker τk\tau_{k} with γ¯k\bar{\gamma}_{k}, and generate the final aggregated ranking list τ^\hat{\tau} based on the ¯i\bar{\mathcal{I}}_{i}s as following:

τ^=sort(iUby¯i).\hat{\tau}=sort(i\in U\ by\ \bar{\mathcal{I}}_{i}\uparrow).

Hereinafter, we refer to this MCMC-based Bayesian rank aggregation procedure under the Partition-Mallows model as PAMAB.

The Bayesian inference procedure PAMAHB for the PAMA-H model differs from PAMAB only by replacing the prior distribution k=1mπ(γk)\prod_{k=1}^{m}\pi(\gamma_{k}), which is uniform in [0,b]m[0,b]^{m}, with a hierarchically structured prior π(ψ)k=1mfψ(γk)\pi(\psi)\prod_{k=1}^{m}f_{\psi}(\gamma_{k}). The conditional distributions needed for Gibbs sampling are almost the same as (27)-(29), except an additional one

f(ψ,ϕ,𝜸)\displaystyle f(\psi\mid\mathcal{I},\phi,\boldsymbol{\gamma}) \displaystyle\propto π(ψ)k=1mfψ(γk).\displaystyle\pi(\psi)\cdot\prod_{k=1}^{m}f_{\psi}(\gamma_{k}). (30)

We may specify fψ(γ)f_{\psi}(\gamma) to be an exponential distribution and let π(ψ)\pi(\psi) be a proper conjugate prior to make (30) easy to sample from. More details for PAMAHB with fψ(γ)f_{\psi}(\gamma) specified as an exponential distribution is provided in Supplementary Material.

Our simulation studies suggest that the practical performance of PAMAB and PAMAHB are very similar when n0n_{0} and n1n_{1} are reasonably large (see Supplementary Material for details). In contrast, as we will show in Section 5, the MLE-based estimates (e.g., PAMAF) typically produce less accurate results with a shorter computational time compared to PAMAB.

4.3 Extension to Partial Ranking Lists

The proposed Partition-Mallows model can be extended to more general scenarios where partial ranking lists, instead of full ranking lists, are involved in the aggregation. Given the entity set UU and a ranking list τS\tau_{S} of entities in SUS\subseteq U, we say τS\tau_{S} is a full ranking list if S=US=U, and a partial ranking list if SUS\subset U. Suppose τS\tau_{S} is a partial ranking list and τU\tau_{U} is a full ranking list of UU. If the projection of τU\tau_{U} on SS equals to τS\tau_{S}, we say τU\tau_{U} is compatible with τS\tau_{S}, denotes as τUτS\tau_{U}\sim\tau_{S}. Let 𝒜(τS)={τU:τUτS}\mathcal{A}(\tau_{S})=\{\tau_{U}:\tau_{U}\sim\tau_{S}\} be the set of all full lists that are compatible with τS\tau_{S}. Suppose a partial list τk\tau_{k} is involved in the ranking aggregation problem. The probability of τk\tau_{k} can be evaluated by:

P(τk,ϕ,γk)=τkτkP(τk,ϕ,γk),\displaystyle P(\tau_{k}\mid\mathcal{I},\phi,\gamma_{k})=\sum_{\tau_{k}^{*}\sim\tau_{k}}P(\tau_{k}^{*}\mid\mathcal{I},\phi,\gamma_{k}), (31)

where P(τk,ϕ,γk)P(\tau_{k}^{*}\mid\mathcal{I},\phi,\gamma_{k}) is the probability of a compatible full list under the PAMA model. Clearly, the probability in (31) does not have a closed-form representation due to complicated constraints between τk\tau_{k} and τk\tau_{k}^{*}, and it is very challenging to do statistical inference directly based on this quantity. Fortunately, as rank aggregation with partial lists can be treated as a missing data problem, we can resolve the problem via standard methods for missing data inference.

The Bayesian inference can be accomplished by the classic data augmentation strategy (Tanner and Wong,, 1987) in a similar way as described in Deng et al., (2014), which iterates between imputing the missing data conditional on the observed data given the current parameter values, and updating parameter values by sampling from the posterior distribution based on the imputed full data. To be specific, we iteratively draw from the following two conditional distributions:

P(τ1,,τmτ1,,τm;,ϕ,𝜸)=k=1mP(τkτk;,ϕ,γk),\displaystyle P(\tau_{1}^{\ast},\cdots,\tau_{m}^{\ast}\mid\tau_{1},\cdots,\tau_{m};\mathcal{I},\phi,\boldsymbol{\gamma})=\prod_{k=1}^{m}P(\tau_{k}^{\ast}\mid\tau_{k};\mathcal{I},\phi,\gamma_{k}),
f(,ϕ,𝜸τ1,,τm)π()×π(𝜸)×π(ϕ)×k=1mP(τk,γk,ϕ).\displaystyle f(\mathcal{I},\phi,\boldsymbol{\gamma}\mid\tau_{1}^{\ast},\cdots,\tau_{m}^{\ast})\propto\pi(\mathcal{I})\times\pi(\boldsymbol{\gamma})\times\pi(\phi)\times\prod_{k=1}^{m}P(\tau_{k}^{\ast}\mid\mathcal{I},\gamma_{k},\phi).

To find the MLE of 𝜽{\boldsymbol{\theta}} for this more challenging scenario, we can use the Monte Carlo EM algorithm (MCEM, Wei and Tanner, (1990)). Let τk(1),,τk(M)\tau_{k}^{(1)},\cdots,\tau_{k}^{(M)} be MM independent samples drawn from distribution P(τkτk,,ϕ,γk)P(\tau_{k}^{\ast}\mid\tau_{k},\mathcal{I},\phi,\gamma_{k}). The E-step involves the calculation of the QQ-function below:

Q(,𝜸,ϕ(s),𝜸(s),ϕ(s))\displaystyle Q(\mathcal{I},\boldsymbol{\gamma},\phi\mid\mathcal{I}^{(s)},\boldsymbol{\gamma}^{(s)},\phi^{(s)}) =\displaystyle= E{k=1mlogP(τk,𝜸,ϕ)τk,(s),𝜸k(s),ϕ(s)}\displaystyle E\left\{\sum_{k=1}^{m}\log P(\tau_{k}^{*}\mid\mathcal{I},\boldsymbol{\gamma},\phi)\mid\tau_{k},\mathcal{I}^{(s)},\boldsymbol{\gamma}_{k}^{(s)},\phi^{(s)}\right\}
\displaystyle\approx 1Mk=1mt=1MlogP(τk(t),𝜸k,ϕ).\displaystyle\dfrac{1}{M}\sum_{k=1}^{m}\sum_{t=1}^{M}\log P(\tau_{k}^{(t)}\mid\mathcal{I},\boldsymbol{\gamma}_{k},\phi).

In the M-step, we use the Gauss-Seidel method to maximize the above QQ-function in a similar way as detailed in Supplementary Material.

No matter which method is used, a key step is to draw samples from

P(τkτk;,ϕ,γk)P(τk,γk,ϕ)𝕀(τk𝒜(τk)).P(\tau_{k}^{\ast}\mid\tau_{k};\mathcal{I},\phi,\gamma_{k})\propto P(\tau_{k}^{\ast}\mid\mathcal{I},\gamma_{k},\phi)\cdot\mathbb{I}\big{(}\tau_{k}^{\ast}\in\mathcal{A}(\tau_{k})\big{)}.

To achieve this goal, we start with τk\tau_{k}^{*} obtained from the previous step of the data augmentation or MCEM algorithms, and conduct several iterations of the following Metropolis step with P(τkτk;,ϕ,γk)P(\tau_{k}^{\ast}\mid\tau_{k};\mathcal{I},\phi,\gamma_{k}) as its target distribution: (a) construct the proposal τk\tau_{k}^{\prime} by randomly selecting two elements in the current full list τk\tau_{k}^{*} and swapping them; (b) accept or reject the proposal according to the Metropolis rule, that is to accept τk\tau_{k}^{\prime} with probability of min(1,P(τk,γk,ϕ)P(τk,γk,ϕ))\min(1,\frac{P(\tau_{k}^{\prime}\mid\mathcal{I},\gamma_{k},\phi)}{P(\tau_{k}^{*}\mid\mathcal{I},\gamma_{k},\phi)}). Note that the proposed list τk\tau_{k}^{\prime} is automatically rejected if it is incompatible with the observed partial list τk\tau_{k}.

4.4 Incorporating Covariates in the Analysis

In some applications, covariate information for each ranked entity is available to assist rank aggregation. One of the earliest attempts for incorporating such information in analysing rank data is perhaps the hidden score model due to Thurstone, (1927), which has become a standard approach and has many extensions. Briefly, these models assume that there is an unobserved score for each entity that is related to the entity-specific covariates Xi=(Xi1,,Xip)TX_{i}=(X_{i1},\cdots,X_{ip})^{T} under a regression framework and the observed rankings are determined by these scores plus noises, i.e.,

τk=sort(Sik,EiU),whereSik=XiT𝜷+εik.\tau_{k}=sort(S_{ik}\downarrow,\ E_{i}\in U),\ \mbox{where}\ S_{ik}=X_{i}^{T}\boldsymbol{\beta}+\varepsilon_{ik}.

Here, 𝜷\boldsymbol{\beta} is the common regression coefficient and εikN(0,σk2)\varepsilon_{ik}\sim N(0,\sigma^{2}_{k}) is the noise term. Recent progresses along this line are reviewed by Yu, (2000); Bhowmik and Ghosh, (2017); Li et al., (2021).

Here, we propose to incorporate covariates into the analysis in a different way. Assuming that covariate XiX_{i} provides information on the group assignment instead of the detailed ranking of entity EiE_{i}, we connect XiX_{i} and i\mathcal{I}_{i}, the enhanced indicator of XiX_{i}, by a logistic regression model:

P(iXi)=P(IiXi,𝝍)=exp{XiT𝝍Ii}1+exp{XiT𝝍},i=1,,n,P(\mathcal{I}_{i}\mid X_{i})=P(I_{i}\mid X_{i},\boldsymbol{\psi})=\dfrac{\exp\{X_{i}^{T}\boldsymbol{\psi}\cdot I_{i}\}}{1+\exp\{X_{i}^{T}\boldsymbol{\psi}\}},~{}~{}i=1,\cdots,n, (32)

where 𝝍=(ψ1,,ψp)T\boldsymbol{\psi}=(\psi_{1},\ldots,\psi_{p})^{T} as the regression parameters. Let 𝑿=(X1,,Xn)\boldsymbol{X}=(X_{1},\cdots,X_{n}) be the covariate matrix, we can extend the Partition-Mallows model as

P(τ1,,τm,𝑿)=P(𝑿,𝝍)×P(τ1,,τm,ϕ,𝜸),P(\tau_{1},\cdots,\tau_{m},\mathcal{I}\mid\boldsymbol{X})=P(\mathcal{I}\mid\boldsymbol{X},\boldsymbol{\psi})\times P(\tau_{1},\cdots,\tau_{m}\mid\mathcal{I},\phi,\boldsymbol{\gamma}), (33)

where the first term

P(𝑿,𝝍)=i=1nP(IiXi,𝝍)P(\mathcal{I}\mid\boldsymbol{X},\boldsymbol{\psi})=\prod_{i=1}^{n}P(I_{i}\mid X_{i},\boldsymbol{\psi})

comes from the logistic regression model (32), and the second term comes from the original Partition-Mallows model. In the extended model, our goal is to infer (,ϕ,𝜸,𝝍)(\mathcal{I},\phi,\boldsymbol{\gamma},\boldsymbol{\psi}) based on (τ1,,τm;𝑿)(\tau_{1},\cdots,\tau_{m};\boldsymbol{X}). We can achieve both Bayesian inference and MLE for the extended model in a similar way as described for the Partition-Mallows model. More details are provided in the Supplementary Material.

An alternative way to incorporate covariates is to replace the logistic regression model by a naive Bayes model, which models the conditional distribution of 𝑿\boldsymbol{X}\mid\mathcal{I} instead of 𝑿\mathcal{I}\mid\boldsymbol{X}, as follows:

f(τ1,,τm,𝑿)=P(τ1,,τm,,ϕ,𝜸)×f(𝑿),f(\tau_{1},\cdots,\tau_{m},\boldsymbol{X}\mid\mathcal{I})=P(\tau_{1},\cdots,\tau_{m},\mid\mathcal{I},\phi,\boldsymbol{\gamma})\times f(\boldsymbol{X}\mid\mathcal{I}), (34)

where

f(𝑿)\displaystyle f(\boldsymbol{X}\mid\mathcal{I}) =\displaystyle= i=1nf(Xii)=i=1nf(XiIi)=i=1nj=1pf(XijIi)\displaystyle\prod_{i=1}^{n}f(X_{i}\mid\mathcal{I}_{i})=\prod_{i=1}^{n}f(X_{i}\mid I_{i})=\prod_{i=1}^{n}\prod_{j=1}^{p}f(X_{ij}\mid I_{i})
=\displaystyle= i=1nj=1p{[fj(Xijψj0)]1Ii[fj(Xijψj1)]Ii},\displaystyle\prod_{i=1}^{n}\prod_{j=1}^{p}\Big{\{}\big{[}f_{j}(X_{ij}\mid\psi_{j0})\big{]}^{1-I_{i}}\cdot\big{[}f_{j}(X_{ij}\mid\psi_{j1})\big{]}^{I_{i}}\Big{\}},

fjf_{j} is pre-specified parametric distribution for covariates XjX_{j} with parameter ψj0\psi_{j0} for entities in the background group and ψj1\psi_{j1} for entities in the relevant group. Since the performances of the two approaches are very similar, in the rest of this paper we use the logistic regression strategy to handle covariates due to its convenient form.

5 Simulation Study

5.1 Simulation Settings

We simulated data from two models: (a) the proposed Partition-Mallows model, referred to as 𝒮PM\mathcal{S}_{PM}, and (b) the Thurstone hidden score model, referred to as 𝒮HS\mathcal{S}_{HS}. In the 𝒮PM\mathcal{S}_{PM} scenario, we specified the true indicator vector as =(1,,n1,0,,0)\mathcal{I}=(1,\cdots,n_{1},0,\cdots,0), indicating that the first n1n_{1} entities E1,,En1E_{1},\cdots,E_{n_{1}} belong to URU_{R} and the rest belong to the background group UBU_{B}, and set

γk={0.1,if km2;a+(km2)×δR,if k>m2.\gamma_{k}=\left\{\begin{array}[]{ll}0.1,&\mbox{if }k\leq\frac{m}{2};\\ a+(k-\frac{m}{2})\times\delta_{R},&\mbox{if }k>\frac{m}{2}.\\ \end{array}\right.

Clearly, a>0a>0 and δR>0\delta_{R}>0 control the quality difference and signal strength of the mm base rankers in the 𝒮PM\mathcal{S}_{PM} scenario. We set ϕ=0.6\phi=0.6 (defined in (16)), δR=2m\delta_{R}=\frac{2}{m}, and aa with two options: 2.5 and 1.5. For easy reference, we denoted the strong signal case with a=2.5a=2.5 and the weak signal case with a=1.5a=1.5 by 𝒮PM1\mathcal{S}_{PM_{1}} and 𝒮PM2\mathcal{S}_{PM_{2}}, respectively.

In the 𝒮HS\mathcal{S}_{HS} scenario, we used the Thurstone model to generate the rank lists as τk=sort(iUbySik),whereSikN(μik,1)\tau_{k}=sort(i\in U\ by\ S_{ik}\downarrow),\ \mbox{where}\ S_{ik}\sim N(\mu_{ik},1) and

μik={0,if km2ori>n1;a+bam×k+(n1i)×δE,otherwise.\mu_{ik}=\left\{\begin{array}[]{ll}0,&\mbox{if }k\leq\frac{m}{2}\ \mbox{or}\ i>n_{1};\\ a^{*}+\frac{b^{*}-a^{*}}{m}\times k+(n_{1}-i)\times\delta_{E}^{*},&\mbox{otherwise}.\\ \end{array}\right.

In this model, a,ba^{*},b^{*} and δE\delta_{E}^{*} (all positive numbers) control the quality difference and signal strength of the mm base rankers. We also specified two sub-cases: 𝒮HS1\mathcal{S}_{HS_{1}}, the stronger-signal case with (a,b,δE)=(0.5,2.5,0.2)(a^{*},b^{*},\delta_{E}^{*})=(0.5,2.5,0.2); and 𝒮HS2\mathcal{S}_{HS_{2}}, the weaker-signal case with (a,b,δE)=(0.5,1.5,0.2)(a^{*},b^{*},\delta_{E}^{*})=(-0.5,1.5,0.2). Table 1 shows the configuration matrix of μik\mu_{ik} under 𝒮HS1\mathcal{S}_{HS_{1}} when m=10,n=100m=10,n=100 and n1=10n_{1}=10. In both scenarios, the first half of rankers are completely non-informative, with the other half providing increasingly strong signals.

μ1\mu_{1} μ2\mu_{2} μ3\mu_{3} μ4\mu_{4} μ5\mu_{5} μ6\mu_{6} μ7\mu_{7} μ8\mu_{8} μ9\mu_{9} μ10\mu_{10}
E1E_{1} 0 0 0 0 0 3.7 3.9 4.1 4.3 4.5
E2E_{2} 0 0 0 0 0 3.5 3.7 3.9 4.1 4.3
E3E_{3} 0 0 0 0 0 3.3 3.5 3.7 3.9 4.1
E4E_{4} 0 0 0 0 0 3.1 3.3 3.5 3.7 3.9
E5E_{5} 0 0 0 0 0 2.9 3.1 3.3 3.5 3.7
E6E_{6} 0 0 0 0 0 2.7 2.9 3.1 3.3 3.5
E7E_{7} 0 0 0 0 0 2.5 2.7 2.9 3.1 3.3
E8E_{8} 0 0 0 0 0 2.3 2.5 2.7 2.9 3.1
E9E_{9} 0 0 0 0 0 2.1 2.3 2.5 2.7 2.9
E10E_{10} 0 0 0 0 0 1.9 2.1 2.3 2.5 2.7
E11E_{11} 0 0 0 0 0 0 0 0 0 0
\vdots \vdots \vdots \vdots \vdots \vdots \vdots \vdots \vdots \vdots \vdots
E100E_{100} 0 0 0 0 0 0 0 0 0 0
Table 1: The configuration matrix of the μik\mu_{ik}’s under 𝒮HS1\mathcal{S}_{HS_{1}} with mm=10, nn=100 and n1n_{1}=10.

For each of the four simulation scenarios (i.e., 𝒮PM1\mathcal{S}_{PM_{1}}, 𝒮PM2\mathcal{S}_{PM_{2}}, 𝒮HS1\mathcal{S}_{HS_{1}} and 𝒮HS2\mathcal{S}_{HS_{2}}), we fixed the true number of relevant entities n1=10n_{1}=10, but allowed the number of rankers mm and the total number of entities nn to vary, resulting in a total of 16 simulation settings ( {scenarios:𝒮PM1,𝒮PM2,𝒮HS1,𝒮HS2}×{m:10,20}×{n:100,300}×{n1:10}\{scenarios:\mathcal{S}_{PM_{1}},\mathcal{S}_{PM_{2}},\mathcal{S}_{HS_{1}},\mathcal{S}_{HS_{2}}\}\times\{m:10,20\}\times\{n:100,300\}\times\{n_{1}:10\}). Under each setting, we simulated 500 independent data sets to evaluate and compare performances of different rank aggregation methods.

5.2 Methods in Comparison and Performance Measures

Except for the proposed PAMAB and PAMAF, we considered state-of-the-art methods in several classes, including the Markov chain-based methods MC1, MC2, MC3 in Lin, (2010) and CEMC in Lin and Ding, (2010), the partition-based method BARD in Deng et al., (2014), and the Mallows model-based methods MM and EMM in Li et al., (2020). Classic naive methods based on summary statistics were ignored because they have been shown in previous studies to perform suboptimally, especially in cases where base rankers are heterogeneous in quality. The Markov-chain-based methods, MM, and EMM were implemented in TopKLists, PerMallows and ExtMallows packages in R (https://www.r-project.org/), respectively. The code of BARD was provided by its authors.

Let τ\tau be the underlying true ranking list of all entities, τR={τ(i):EiUR}\tau_{R}=\{\tau(i):\ E_{i}\in U_{R}\} be the true relative ranking of relevant entities, τ^\hat{\tau} be the aggregated ranking obtained from a rank aggregation approach, τ^R={τ^(i):EiUR}\hat{\tau}_{R}=\{\hat{\tau}(i):\ E_{i}\in U_{R}\} be the relative ranking of relevant entities after aggregation, and τ^n1\hat{\tau}_{n_{1}} be the top-n1n_{1} list of τ^\hat{\tau}. After obtaining the aggregated ranking τ^\hat{\tau} from a rank aggregation approach, we evaluated its performance by two measurements, namely the recovery distance κR\kappa_{R} and the coverage ρR\rho_{R}, defined as below:

κR\displaystyle\kappa_{R} \displaystyle\triangleq dτ(τ^R,τR)+nτ^×n+n1+12,\displaystyle d_{\tau}(\hat{\tau}_{R},\tau_{R})+n_{\hat{\tau}}\times\frac{n+n_{1}+1}{2},
ρR\displaystyle\rho_{R} \displaystyle\triangleq n1nτ^n1,\displaystyle\frac{n_{1}-n_{\hat{\tau}}}{n_{1}},

where dτ(τ^R,τR)d_{\tau}(\hat{\tau}_{R},\tau_{R}) denotes the Kendall tau distance between τ^R\hat{\tau}_{R} and τR\tau_{R}, and nτ^n_{\hat{\tau}} denotes the number of relevant entities who are classified as background entities in τ^\hat{\tau}. The recovery distance κR\kappa_{R} considers detailed rankings of all relevant entities plus mis-classification distances, while the coverage ρR\rho_{R} cares only about the identification of relevant entities without considering the detailed rankings. In the setting of PAMA, n+n1+12\frac{n+n_{1}+1}{2} is the average rank of a background entity. The recovery distance increases if some relevant entities are mis-classified as background entities. Clearly, we expect a smaller κR\kappa_{R} and a larger ρR\rho_{R} for a stronger aggregation approach.

5.3 Simulation Results

Table 2 summarizes the performances of the nine competing methods in the 16 different simulation settings, demonstrating the proposed PAMAB and PAMAF outperform all the other methods by a significant margin in most settings and PAMAB uniformly dominates PAMAF. Figure 3 shows the quality parameter 𝜸\boldsymbol{\gamma} learned from the Partition-Mallows model in various simulation scenarios with m=10m=10 and n=100n=100, confirming that the proposed methods can effectively capture the quality difference among the rankers. The results of 𝜸\boldsymbol{\gamma} for other combinations of (m,n)(m,n) can be found in Supplementary Material which demonstrates consistent performance with Figure 3.

Configuration Partition-type Models Mallows Models MC-based Models
𝒮\mathcal{S} nn mm PAMAF PAMAB BARD EMM MM MC1 MC2 MC3 CEMC
𝒮PM1\mathcal{S}_{PM_{1}} 100 10 24.5 15.2 57.1 51.7 103.2 338.4 163.1 198.6 197.8
[0.95] [0.97] [0.91] [0.89] [0.81] [0.36] [0.69] [0.63] [0.62]
100 20 2.6 0.3 42.1 22.8 44.2 466.6 88.9 121.2 114.7
[0.99] [1.00] [0.95] [0.95] [0.93] [0.11] [0.82] [0.78] [0.77]
300 10 17.4 4.0 180.0 683.3 519.2 1268.3 997.7 1075.8 1085.7
[0.99] [1.00] [0.89] [0.66] [0.55] [0.17] [0.34] [0.29] [0.28]
300 20 7.1 3.2 122.3 124.4 157.1 1445.9 613.5 723.0 727.2
[1.00] [1.00] [0.93] [0.92] [0.90] [0.05] [0.60] [0.53] [0.52]
𝒮PM2\mathcal{S}_{PM_{2}} 100 10 90.0 66.6 115.2 108.3 152.9 404.3 285.5 307.2 313.8
[0.82] [0.86] [0.77] [0.77] [0.70] [0.24] [0.47] [0.43] [0.41]
100 20 26.9 2.4 81.5 59.8 91.5 468.1 217.3 245.2 249.5
[0.94] [1.00] [0.85] [0.87] [0.82] [0.11] [0.60] [0.55] [0.53]
300 10 81.1 26.8 468.4 609.8 472.1 1388.4 1294.7 1321.5 1328.4
[0.95] [0.98] [0.69] [0.68] [0.60] [0.09] [0.15] [0.13] [0.13]
300 20 77.2 3.4 313.6 267.5 337.0 1469.0 1205.9 1251.8 1258.9
[0.95] [1.00] [0.79] [0.82] [0.78] [0.04] [0.21] [0.18] [0.18]
𝒮HS1\mathcal{S}_{HS_{1}} 100 10 24.9 20.6 22.9 54.9 115.9 334.7 150.9 180.3 186.0
[0.97] [0.98] [0.99] [0.91] [0.80] [0.37] [0.71] [0.66] [0.64]
100 20 18.7 15.6 22.8 8.7 33.4 498.8 46.7 64.1 60.8
[0.98] [0.98] [1.00] [1.00] [0.97] [0.05] [0.92] [0.89] [0.89]
300 10 172.0 159.8 37.9 205.5 490.6 1098.6 627.0 752.9 769.4
[0.89] [0.90] [0.99] [0.87] [0.68] [0.28] [0.59] [0.50] [0.49]
300 20 7.4 7.0 22.7 11.4 114.1 1402.6 237.8 319.7 322.3
[1.00] [1.00] [1.00] [1.00] [0.94] [0.08] [0.84] [0.79] [0.79]
𝒮HS2\mathcal{S}_{HS_{2}} 100 10 92.6 74.0 68.7 123.7 162.3 382.4 228.2 250.2 256.6
[0.83] [0.86] [0.88] [0.77] [0.70] [0.27] [0.56] [0.52] [0.50]
100 20 24.4 20.0 22.2 12.4 38.3 500.3 87.5 103.5 102.9
[0.96] [0.97] [1.00] [0.99] [0.95] [0.04] [0.83] [0.80] [0.80]
300 10 319.1 463.8 245.6 516.9 683.5 1267.9 998.0 1076.0 1085.5
[0.79] [0.69] [0.84] [0.66] [0.55] [0.17] [0.34] [0.29] [0.28]
300 20 8.7 8.0 23.2 30.3 155.5 1430.7 437.6 516.2 523.3
[1.00] [1.00] [1.00] [0.99] [0.91] [0.06] [0.71] [0.66] [0.65]
Table 2: Average recovery distances [coverages] of different methods based on 500 independent replicates under different simulation scenarios.
Refer to caption
Figure 3: (a) The boxplots of {γ¯k}\{\bar{\gamma}_{k}\} estimated by PAMAB with m=10m=10 and n=100n=100. (b) The boxplots of {γ^k}\{\hat{\gamma}_{k}\} estimated by PAMAF with m=10m=10 and n=100n=100. Each column denotes a scenario setting. The results were obtained from 500 independent replicates.

Figure 4 (a) shows the boxplots of recovery distances and the coverages of the nine competing methods in the four simulation scenarios with m=10m=10, n=100n=100, and n1=10n_{1}=10. The five methods from the left outperform the other four methods by a significant gap, and the PAMA-based methods generally perform the best. Figure 4 (b) confirms that the methods based on the Partition-Mallows model enjoys the same capability as BARD in detecting quality differences between informative and non-informative rankers. However, while both BARD and PAMA can further discern quality differences among informative rankers, EMM fails this more subtle task. Similar figures for other combinations of (m,n)(m,n) are provided in the Supplementary Material, highlighting consistent results as in Figure 4.

Refer to caption
Figure 4: Boxplots of the rank aggregation results of 500 replications obtained from different methods under various scenarios with mm=10, nn=100, and n1n_{1}=10. (a) Recovery distances in log scale and coverage obtained from nine algorithms. (b) Quality parameters obtained by Partition-type models and EMM.

5.4 Robustness to the Specification of n1n_{1}

We need to specify n1n_{1}, the number of relevant entities, when applying PAMAB or PAMAF. In many practical problems, however, there may not be a strong prior information on n1n_{1} and there may not even be clear distinctions between relevant and background entities. To examine robustness of the algorithm with respect to the specification of n1n_{1}, we design a simulation setting 𝒮HS3\mathcal{S}_{HS_{3}} to mimic the no-clear-cut scenario and investigate how the performance of PAMA is affected by the specification of n1n_{1} under this setting. Formally, 𝒮HS3\mathcal{S}_{HS_{3}} assumes that τk=sort(iUbySik)\tau_{k}=sort(i\in U\ by\ S_{ik}\downarrow), where SikN(μik,1)S_{ik}\sim N(\mu_{ik},1), following the same data generating framework as 𝒮HS\mathcal{S}_{HS} defined in the Section 5.1, with μik\mu_{ik} being replaced by

μik={0,if km2,2×a×k/m1+eb×(70i),otherwise,\mu_{ik}=\left\{\begin{array}[]{ll}0,&\mbox{if }k\leq\frac{m}{2},\\ \frac{2\times a^{*}\times k/m}{1+e^{-b^{*}\times(70-i)}},&\mbox{otherwise},\\ \end{array}\right.

where a=50a^{*}=50 and b=0.1b^{*}=0.1. Different from 𝒮HS1\mathcal{S}_{HS_{1}} and 𝒮HS2\mathcal{S}_{HS_{2}}, where μik\mu_{ik} jumps from 0 to a positive number as ii ranges from background to relevant entities, in the 𝒮HS3\mathcal{S}_{HS_{3}} scenario μik\mu_{ik} increases smoothly as a function of ii for each informative ranker kk. In such cases, the concept of “relevant” entities is not well-defined.

We simulate 500 independent data sets from 𝒮HS3\mathcal{S}_{HS_{3}} with n=100n=100 and m=10m=10. For each data set, we try different specifications of n1n_{1} ranging from 10 to 50 and compare PAMA to several competing methods based on their performance on recovering the top-AA list [E1E2EA][E_{1}\preceq E_{2}\preceq\cdots\preceq E_{A}], which is still well-defined based on the simulation design. The results summarized in Table 3 show that no matter which n1n_{1} is specified, the partition-type models consistently outperform all the competing methods in terms of a lower recovery distance from the true top-n1n_{1} list of items, i.e., [E1E2En1][E_{1}\preceq E_{2}\preceq\cdots\preceq E_{n_{1}}]. Figure 5 illustrates in details the average aggregated rankings of the top-10 entities by PAMA as n1n_{1} increases, suggesting that PAMA is able to figure out the correct rankings of the top entities effectively. These results give us confidence that PAMA is robust to misspecification of n1n_{1}.

Configuration Partition-type Models Mallows Models MC-based Models
nn mm n1n_{1} PAMAF PAMAB BARD EMM MM MC1 MC2 MC3 CEMC
100 10 10 44.8 34.6 42.6 61.5 227.7 423.8 45.6 199.1 241.3
[0.90] [0.93] [0.96] [0.88] [0.58] [0.20] [0.92] [0.63] [0.54]
100 10 20 39.2 33.9 94.2 107.0 308.6 764.6 52.3 268.9 372.9
[0.95] [0.96] [0.99] [0.90] [0.75] [0.33] [0.96] [0.78] [0.67]
100 10 30 27.5 29.2 207.4 126.2 360.6 1040.2 67.8 325.4 445.0
[0.98] [0.98] [0.98] [0.93] [0.83] [0.44] [0.96] [0.84] [0.77]
100 10 40 16.0 17.4 363.9 131.6 408.1 1274.1 83.1 382.5 486.9
[0.99] [0.99] [0.98] [0.95] [0.87] [0.54] [0.97] [0.87] [0.83]
100 10 50 8.8 9.3 565.3 134.6 452.2 1484.2 109.2 446.4 524.9
[1.00] [1.00] [0.99] [0.97] [0.90] [0.62] [0.96] [0.89] [0.88]
Table 3: Average recovery distances [coverages] of different methods based on 500 independent replicates under scenario 𝒮HS3\mathcal{S}_{HS_{3}}.
Refer to caption
Figure 5: Average aggregated rankings of the top-10 entities by PAMA as n1n_{1} increases from 10 to 50 for simulated data sets generated from 𝒮HS3\mathcal{S}_{HS_{3}}.

Noticeably, although PAMA and BARD achieve comparable coverage as shown in Table 3, PAMA dominates BARD uniformly in terms of a much smaller recovery distance in all cases, suggesting that PAMA is capable of figuring out the detailed ranking of relevant entities that is missed by BARD. In fact, since BARD relies only on ρiP(Ii=1τ1,,τm)\rho_{i}\triangleq P(I_{i}=1\mid\tau_{1},\cdots,\tau_{m}) to rank entities, in cases where the signal to distinguish the relevant and background entities is strong, many ρi\rho_{i}’s are very close to 1, resulting in nearly a “random” ranking among the top relevant entities. Theoretically, if all relevant entities are recognized correctly but ranked randomly, the corresponding recovery distance would increase with n1n_{1} in an order of 𝒪(n12)\mathcal{O}({n_{1}}^{2}), which matches well with the increasing trend of the recovery distance of BARD shown in Table 3.

We also tested the model’s performance when there is a true n1n_{1} but it is mis-specified in our algorithm. We varied n1n_{1} as 8,108,10 and 1818, respectively, for setting 𝒮HS1\mathcal{S}_{HS_{1}} with nn=100 and mm=10, where the true n1n_{1}=10 (the first ten entities). Figure 6 shows boxplots of \mathcal{I} for each mis-specified case. For the visualization purpose, we just show the boxplot of E1E_{1} to E20E_{20}. The other entities are of the similar pattern with E20E_{20}. The figure shows a robust behavior of PAMAB as we vary the specifications of n1n_{1}. It also shows that the results are slightly better if we specify a n1n_{1} that is moderately larger than its true value. The consistent results of other mis-specified cases, such as 5,12,155,12,15, can be found in the Supplementary Material.

Refer to caption
Figure 6: Boxplots of the estimated \mathcal{I} from 500 replications under the setting of 𝒮HS1\mathcal{S}_{HS_{1}} with n1n_{1} being set as 8, 10 and 1818, respectively. The true n1n_{1} is 1010. The vertical lines separate relevant entities (left) from background ones. The Y-axis shows the logarithm of the entities’ ranks. The rank of a background entity is replaced by their average 100+10+12\frac{100+10+1}{2}. The triangle denotes the mean rank of each entity.

6 Real Data Applications

6.1 Aggregating Rankings of NBA Teams

We applied PAMAB to the NBA-team data analyzed in Deng et al., (2014), and compared it to competing methods in the literature. The NBA-team data set contains 34 “predictive” power rankings of the 30 NBA teams in the 2011-2012 season. The 34 “predictive” rankings were obtained from 6 professional rankers (sports websites) and 28 amateur rankers (college students), and the data quality varies significantly across rankers. More details of the dataset can be found in Table 1 of Deng et al., (2014).

Refer to caption
Figure 7: Results from PAMAB for the NBA-team dataset. (a) boxplots of posterior samples of γk\gamma_{k}. (b) barplots of ¯i\bar{\mathcal{I}}_{i} where the vertical line divides the NBA teams in Western and Eastern Conferences. (c) the trace plot of the log-likelihood function.

Figure 7 displays the results obtained by PAMAB (the partial-list version with n1n_{1} specified as 16). Figure 7 (a) shows the posterior distributions, as boxplots, of the quality parameter of each involved ranker. Figure 7 (b) shows the posterior distribution of the aggregated power ranking of each NBA team. All the posterior samples that reports “0” for the rank of an entity means that the entity is a background one, for visualization purpose we replace “0” by the rank of background average rank, n+n1+12=30+16+12=23.5\frac{n+n_{1}+1}{2}=\frac{30+16+1}{2}=23.5. The final set of 16 playoff teams are shown in blue while the champion of that season is shown in red (i.e., Heat). Figure 7 (c) shows the trace plot of the log-likelihood of the PAMA model along the MCMC iteration. Comparing the results to Figure 8 of Deng et al., (2014), we observe the following facts: (1) PAMAB can successfully discover the quality difference of rankers as BARD; (2) PAMAB can not only pick up the relevant entities effectively like BARD, but also rank the discovered relevant entities reasonably well, which cannot be achieved by BARD; (3) PAMAB converges quickly in this application.

We also applied other methods, including MM, EMM and Markov-chain-based methods, to this data set. We found that none of these methods could discern the quality difference of rankers as successfully as PAMA and BARD. Moreover, using the team ranking at the end of the regular season as the surrogate true power ranking of these NBA teams in the reason, we found that PAMA also outperformed BARD and EMM by reporting an aggregated ranking list that is the closest to the truth. Table 4 provides the detailed aggregated ranking lists inferred by BARD, EMM and PAMA respectively, as well as their coverage of and Kendall τ\tau distance from the surrogate truth. Note that the Kendall τ\tau distance is calculated for the eastern teams and western teams separately because the NBA Playoffs proceed at the eastern conference and the western conference in parallel until the NBA final, in which the two conference champions compete for the NBA champion title, making it difficult to validate the rankings between Eastern and Western teams.

Ranking Surrogate truth BARD EMM PAMA
Eastern Western Eastern Western Eastern Western Eastern Western
1 Bulls Spurs Heat Thunder Heat Thunder Heat Thunder
2 Heat Thunder Bulls Mavericks Bulls Maverick Bulls Maverickss
3 Pacers Lakers Celtics Clippers Knicks Clippers Celtics Lakers
4 Celtics Grizzlies Knicks Lakers Celtics Lakers Knicks Clippers
5 Hawks Clippers Magic Spurs Magic Spurs Magic Spurs
6 Magic Nuggets Pacers Grizzlies Pacers Grizzlies Hawks Nuggets
7 Knicks Mavericks 76ers Nuggets 76ers Nuggets Pacers Grizzlies
8 76ers Jazz Hawks Jazz Hawks Jazz 76ers Jazz
Kendall τ\tau - - 14.5 10.5 9 10 8 10
Coverage - 1516\frac{15}{16} 1416\frac{14}{16} 1516\frac{15}{16}
Table 4: Aggregated power ranking of the NBA teams inferred by BARD, EMM, and PAMA, respectively, and the corresponding coverage of and the Kendall τ\tau distance from the surrogate true rank based on the performances of these teams in the regular season. The teams in italic indicate that they have equal posterior probabilities of being in the relevant group, and the teams with asterisk are those that were misclassified to the background group.

6.2 Aggregating Rankings of NFL Quarterback Players with the Presence of Covariates

Our next application is targeted at the NFL-player data reported by Li et al., (2021). The NFL-player data contains 13 predictive power rankings of 24 NFL quarterback players. The rankings were produced by 13 experts based on the performance of the 24 NFL players during the first 12 weeks in the 2014 season. The dataset also contains covariates for each player summarizing the performances of these players during the period, including the number of games played (G), pass completion percentage (Pct), the number of passing attempts per game (Att), average yards per carry (Avg), total receiving yards (Yds), average passing yards per attempt (RAvg), the touchdown percentage (TD), the intercept percentage (Int), running attempts per game (RAtt), running yards per attempt (RYds) and the running first down percentage (R1st). Details of the dataset can be found in Table 2 of Li et al., (2021).

Refer to caption
Figure 8: Key results from PAMAB for the NFL-player dataset. (a) Boxplots of posterior samples of γk\gamma_{k}. (b) Barplots of ¯i\bar{\mathcal{I}}_{i}. (c) Trace plot of the log-likelihood. (d) Barplots of posterior probabilities for each coefficient to be positive.

Here, we set n1=12n_{1}=12 in order to find which players are above average. We analyzed the NFL-player data with PAMAB (the covaritate-assisted version) and the results are summarized in Figure 8: (a) the posterior boxplots of the quality parameter for all the rankers; (b) the barplot of i¯\bar{\mathcal{I}_{i}} for all the NFL players with the descending order; (c) the traceplot of the log-likelihood of the model; and (d), the barplot of probabilities P(ψj>0)P({\psi}_{j}>0) and the covariates are rearranged from left to right by decreasing the probability. From Figure 8 (a), we observe that rankers 1, 3, 4 and 5 are generally less reliable than the other rankers. In the study of the same dataset in Li et al., (2021), the authors assumed that the 13 rankers fall into three quality levels, and reported that seven rankers (i.e., 2, 6, 7, 8, 9, 10 and 13) are of a higher quality than the other six (see Figure 7 of Li et al., (2021)). Interestingly, according to Figure 8 (a), the PAMA algorithm suggested exactly the same set of high-quality rankers. In the meantime, ranker 2 is of the lowest quality among the seven high quality rankers in both studies. From Figure 8 (b), a consensus ranking list can be obtained. Our result is consistent with that of Figure 6 in Li et al., (2021). Figure 8 (d) shows that six covariates are more probable to have positive effects.

Ranking Gold standard BARD EMM PAMA
1 Aaron R. Andrew L. Andrew L. Andrew L.
2 Andrew L. Aaron R. Aaron R. Aaron R.
3 Ben R. Tom B. Tom B. Tom B.
4 Drew B. Drew B. Ben R. Drew B.
5 Russell W. Ben R. Drew B. Ben R.
6 Matt R. Ryan T. Ryan T. Ryan T.
7 Ryan T. Russell W. Russell W. Russell W.
8 Tom B. Philip R.* Philip R.* Philip R.
9 Eli M. Eli M.* Eli M.* Eli M.*
10 Philip R. Matt R.* Matt R.* Matt R.*
R-distance - 35.5 32 25
Coverage - 0.7 0.7 0.8
Table 5: Top players in the aggregated rankings inferred by BARD, EMM and PAMA. The entities in italic indicate that they have equal posterior probabilities of being in the relevant group, and the players with asterisk are those that were mis-classified to the background group.

Using the Fantasy points of the players (https://fantasy.nfl.com/research/players) derived at the end of the 2014 NFL season as the surrogate truth, the recovery distance and coverage of the aggregated rankings by different approaches can be calculated so as to evaluate the performances of different approaches. Note that the Fantasy points of two top NFL players Peyton Manning and Tony Romo are missing for unknown reasons, we excluded them from analysis and only report results for the top 10 positions instead of top 12. Table 5 summarizes the results, demonstrating that PAMA outperformed the other two methods.

7 Conclusion and Discussion

The proposed Partition-Mallows model embeds the classic Mallows model into a partition modeling framework developed earlier by Deng et al., (2014), which is analogous to the well-known “spike-and-slab” mixture distribution often employed in Bayesian variable selection. Such a nontrivial “mixture” combines the strengths of both the Mallows model and BARD’s partition framework, leading to a stronger rank aggregation method that can not only learn quality variation of rankers and distinguish relevant entities from background ones effectively, but also provide an accurate ranking estimate of the discovered relevant entities. Compared to other frameworks in the literature for rank aggregation with heterogeneous rankers, the Partition-Mallows model enjoys more accurate results with better interpretability at the price of a moderate increase of computational burden. We also show that the Partition-Mallows framework can easily handle partial lists and and incorporate covariates in the analysis.

Throughout this work, we assume that the number of relevant entities n1n_{1} is known. This is reasonable in many practical problems where a specific n1n_{1} can be readily determined according to research demands. Empirically, we found that the ranking results are insensitive to the choice of a wide range of values of n1n_{1}. If needed, we may also determine n1n_{1} according to a model selection criterion, such as AIC or BIC.

In the PAMA model, π(τk0τk01)\pi(\tau_{k}^{0}\mid\tau_{k}^{0\mid 1}) is assumed to be a uniform distribution. If the detailed ranking of the background entities is of interest, we can modify the conditional distribution π(τk0τk01)\pi(\tau_{k}^{0}\mid\tau_{k}^{0\mid 1}) to be the Mallows model or other models. A quality parameter can still be incorporated to control the interaction between relevant entities and background entities. The resulting likelihood function becomes more complicated, but is still tractable.

In practice, the assumption of independent rankers may be violated. In the literature, a few approaches have been proposed to detect and handle dependent rankers. For example, Deng et al., (2014) proposed a hypothesis-testing-based framework to detect pairs of over-correlated rankers and a hierarchical model to accommodate clusters of dependent rankers; Johnson et al., (2020) adopted an extended Dirichlet process and a similar hierarchical model to achieve simultaneous ranker clustering and rank aggregation inference. Similar ideas can be incorporated in the PAMA model as well to deal with non-independent rankers.

Acknowledgement

We thank Miss Yuchen Wu for helpful discussions at the early stage of this work and the two reviewers for their insightful comments and suggestions that helped us improve the paper greatly. This research is supported in part by the National Natural Science Foundation of China (Grants 11771242 & 11931001), Beijig Academy of Artificial Intelligence (Grant BAAI2019ZD0103), and the National Science Foundation of USA (Grants DMS-1903139 and DMS-1712714). The author Wanchuang Zhu is partially supported by the Australian Research Council (Data Analytics for Resources and Environments, Grant IC190100031)

Supplementary Materials

Appendix A Numerical Supports to Power-Law Model of τk01(i)\tau_{k}^{0\mid 1}(i)

Figure 9 illustrates numerical evidences to support the power-law model for τk01(i)\tau_{k}^{0\mid 1}(i) in different scenarios in a similar spirit of the Figure 2 in Deng et al., (2014): generating each ranker τk\tau_{k} as the order of {Xk,1,,Xk,n}\{X_{k,1},\cdots,X_{k,n}\}, i.e., τk=sort(iU by Xk,i),\tau_{k}=sort(i\in U\mbox{ by }X_{k,i}\downarrow), where Xk,iX_{k,i} is generated from two different distributions Fk,0F_{k,0} and Fk,1F_{k,1} via the following mechanism:

Xk,iFk,0𝕀(iUB)+Fk,1𝕀(iUR),iU.X_{k,i}\sim F_{k,0}\cdot\mathbb{I}(i\in U_{B})+F_{k,1}\cdot\mathbb{I}(i\in U_{R}),\ \forall\ i\in U.

Figure 9 confirms that the log-log plot of tt versus h(t)=P(τk01(t)τk1;I)h(t)=P(\tau_{k}^{0\mid 1}(t)\mid\tau_{k}^{1};I) is also nearly linear (when tt is not too large), suggesting the power-law model is acceptable as an approximation of the reality.

Refer to caption
(a) n1=30n_{1}=30.
Refer to caption
(b) n1=300n_{1}=300.
Figure 9: Log-log plots of relative reverse ranking τk01(i)=t\tau_{k}^{0\mid 1}(i)=t versus the corresponding probability h(t)=P(τk01(i)=tI)h(t)=P(\tau_{k}^{0\mid 1}(i)=t\mid I) under different scenarios. The values of h(t)h(t) are calculated by numerical integration.

Appendix B Numerical Evidence for the Assumption ϕk=ϕ×γk\phi_{k}=\phi\times\gamma_{k}

In this section, we provide numerical evidence to support the assumption ϕk=ϕ×γk\phi_{k}=\phi\times\gamma_{k} by showing that it approximates the reality reasonably well in more general settings. In the literature, the Thurstone hidden score model is widely used to generate ranked data. Specifying m=50m=50, n=100n=100, n1=10n_{1}=10 and =(1,,n1,0,,0)\mathcal{I}=(1,\cdots,n_{1},0,\cdots,0), we explored two simulation settings based on the hidden score model, namely 𝒮HS\mathcal{S}_{HS}^{{}^{\prime}} and 𝒮HS′′\mathcal{S}_{HS}^{{}^{\prime\prime}}, in which we assume

τk=sort(iUbySik)whereSikN(μik,1), 1km,\tau_{k}=sort(i\in U\ by\ S_{ik}\downarrow)\ \mbox{where}\ S_{ik}\sim N(\mu_{ik},1),\ \forall\ 1\leq k\leq m,

and specify in the 𝒮HS\mathcal{S}_{HS}^{{}^{\prime}} setting

μik={k×g,i>n1,(n1i)×k×s,in1;\mu_{ik}=\left\{\begin{array}[]{ll}-k\times g,&i>n_{1},\\ (n_{1}-i)\times k\times s,&i\leq n_{1};\\ \end{array}\right.

and in the 𝒮HS′′\mathcal{S}_{HS}^{{}^{\prime\prime}} setting

μik={k×g,i>n1,0,i=n1,μ(i+1)k+U[0,k×s],i<n1,\mu_{ik}=\left\{\begin{array}[]{ll}-k\times g,&i>n_{1},\\ 0,&i=n_{1},\\ \mu_{(i+1)k}+U[0,k\times s],&i<n_{1},\\ \end{array}\right.

where U[a,b]U[a,b] stands for a random number drawn from the Uniform distribution on interval [a,b][a,b]. Clearly, the quality of ranker τk\tau_{k} increases monotonously with the ranker index kk in both settings, and there is a clear gap of mean scores between background and relevant entities. The two hyper-parameters (s,g)(s,g) control the signal strength and ranker heterogeneity in the simulated data.

For each data set simulated from the above settings, given the true ranking list \mathcal{I}, we can always fit a Mallows model for the rankings of relevant entities in each τk\tau_{k} to get an estimated ϕ^k\hat{\phi}_{k} and a power-law model for the relative rankings of the background entities among the relevant ones in each τk\tau_{k} to get an estimated γ^k\hat{\gamma}_{k}, leading to a set of pairwise estimates {(ϕ^k,γ^k)}1km\{(\hat{\phi}_{k},\hat{\gamma}_{k})\}_{1\leq k\leq m}. Figure 10 provides a graphical illustration of the estimated parameters {(ϕ^k,γ^k)}1km\{(\hat{\phi}_{k},\hat{\gamma}_{k})\}_{1\leq k\leq m} for typical simulated datasets from the 𝒮HS\mathcal{S}_{HS}^{{}^{\prime}} setting with different specifications of (s,g)(s,g), and Figure 11 demonstrates the counterpart for the 𝒮HS′′\mathcal{S}_{HS}^{{}^{\prime\prime}} setting. From the figures, we can see clearly a strong linear trend between ϕ^k\hat{\phi}_{k} and γ^k\hat{\gamma}_{k} in all cases, suggesting that the presumed assumption approximates the reality very well.

Refer to caption
Figure 10: The relationship between γ^k\hat{\gamma}_{k} and ϕ^k\hat{\phi}_{k} under the scenario 𝒮HS\mathcal{S}_{HS}^{{}^{\prime}} with different specifications of (s,g)(s,g).
Refer to caption
Figure 11: The relationship between γ^k\hat{\gamma}_{k} and ϕ^k\hat{\phi}_{k} under the scenario 𝒮HS′′\mathcal{S}_{HS}^{{}^{\prime\prime}} with different parameters.

Appendix C Proof of Theorem 1

Proof.

We start with the degenerated special case where m=1m=1. In this special case, the parameter vector 𝜽{\boldsymbol{\theta}} degenerates to a 3-dimensional vector 𝜽=(,ϕ,γ){\boldsymbol{\theta}}=(\mathcal{I},\phi,\gamma), and the PAMA model P(𝝉𝜽)P(\boldsymbol{\tau}\mid{\boldsymbol{\theta}}) defined in Equation (19) degenerates to a simpler form P(τ𝜽)P(\tau\mid{\boldsymbol{\theta}}) as defined in Equation (18) with τΩn\tau\in\Omega_{n}. To prove the identifiability of the degenerated PAMA model, we need to show that for any two proper parameter vectors 𝜽1=(1,ϕ1,γ1){\boldsymbol{\theta}}_{1}=(\mathcal{I}_{1},\phi_{1},\gamma_{1}) and 𝜽2=(1,ϕ1,γ1){\boldsymbol{\theta}}_{2}=(\mathcal{I}_{1},\phi_{1},\gamma_{1}) from the parameter space 𝚯\boldsymbol{\Theta}, we always have:

ifP(τ1,ϕ1,γ1)=P(τ2,ϕ2,γ2)forτΩn,then(1,ϕ1,γ1)=(2,ϕ2,γ2).\mbox{if}\ P(\tau\mid\mathcal{I}_{1},\phi_{1},\gamma_{1})=P(\tau\mid\mathcal{I}_{2},\phi_{2},\gamma_{2})\ \mbox{for}\ \forall\ \tau\in\Omega_{n},\ \mbox{then}\ (\mathcal{I}_{1},\phi_{1},\gamma_{1})=(\mathcal{I}_{2},\phi_{2},\gamma_{2}).

Here, we choose to prove the above conclusion by proving its equivalent counterpart:

for(1,ϕ1,γ1)(2,ϕ2,γ2),τΩn,s.t.P(τ1,ϕ1,γ1)P(τ2,ϕ2,γ2).for\ \forall\ (\mathcal{I}_{1},\phi_{1},\gamma_{1})\neq(\mathcal{I}_{2},\phi_{2},\gamma_{2}),\ \exists\ \tau\in\Omega_{n},\ s.t.\ P(\tau\mid\mathcal{I}_{1},\phi_{1},\gamma_{1})\neq P(\tau\mid\mathcal{I}_{2},\phi_{2},\gamma_{2}).

Apparently, the condition (1,ϕ1,γ1)(2,ϕ2,γ2)(\mathcal{I}_{1},\phi_{1},\gamma_{1})\neq(\mathcal{I}_{2},\phi_{2},\gamma_{2}) implies two possible scenarios:

(i)(ϕ1,γ1)(ϕ2,γ2),or(ii)(ϕ1,γ1)=(ϕ2,γ2)with12.(i)\ (\phi_{1},\gamma_{1})\neq(\phi_{2},\gamma_{2}),\ \mbox{or}\ (ii)\ (\phi_{1},\gamma_{1})=(\phi_{2},\gamma_{2})\ \mbox{with}\ \mathcal{I}_{1}\neq\mathcal{I}_{2}.

Below we discuss the two scenarios separately.

(i)(i) The scenario where (ϕ1,γ1)(ϕ2,γ2)(\phi_{1},\gamma_{1})\neq(\phi_{2},\gamma_{2}). Let

τ𝜽=argminτΩnP(τ𝜽)andτ𝜽=argminτΩnτ𝜽P(τ𝜽)\displaystyle\tau^{*}_{{\boldsymbol{\theta}}}=\arg\min_{\tau\in\Omega_{n}}P(\tau\mid{\boldsymbol{\theta}})\ \mbox{and}\ \tau^{**}_{{\boldsymbol{\theta}}}=\arg\min_{\tau\in\Omega_{n}\setminus\tau^{*}_{{\boldsymbol{\theta}}}}P(\tau\mid{\boldsymbol{\theta}}) (35)

be the rankings with the smallest and second smallest sampling probability among the ranking space Ωn\Omega_{n} given model parameter 𝜽{\boldsymbol{\theta}}. According to the likelihood function (18), it is easy to check that the solutions of the optimization problems in (35) depend on \mathcal{I} only and have the general form below:

τ𝜽=inv()andτ𝜽=swap(τ𝜽),\tau^{*}_{{\boldsymbol{\theta}}}=inv(\mathcal{I})\ \mbox{and}\ \tau^{**}_{{\boldsymbol{\theta}}}=swap_{\mathcal{I}}(\tau^{*}_{{\boldsymbol{\theta}}}),

where the operator inv()inv(\mathcal{I}) locates the relevant entities defined by \mathcal{I} to the last n1n_{1} positions in the ranking list with their internal order reversed and the background entities randomly to the other (nd)(n-d) open positions, and the operator swap(τ𝜽)swap_{\mathcal{I}}(\tau^{*}_{{\boldsymbol{\theta}}}) swaps the positions of either two adjacent relevant entities defined by \mathcal{I} or the last relevant entity and an arbitrary background entity in τ𝜽\tau^{*}_{{\boldsymbol{\theta}}}. For instance, assume that n=20n=20, n1=10n_{1}=10 and =(1,2,,n1,0,,0)\mathcal{I}=(1,2,\cdots,n_{1},0,\cdots,0), we have:

τ𝜽\displaystyle\tau^{*}_{{\boldsymbol{\theta}}} =\displaystyle= inv()=(20,19,,13,12,11,perm(10,,2,1)),\displaystyle\ inv(\mathcal{I})\ =\ \big{(}20,19,\cdots,13,12,11,perm(10,\cdots,2,1)\big{)},
τ𝜽\displaystyle\tau^{**}_{{\boldsymbol{\theta}}} =\displaystyle= swap(τ𝜽)=(20,19,,13,11,12,perm(10,,2,1)),\displaystyle swap_{\mathcal{I}}(\tau^{*}_{{\boldsymbol{\theta}}})=\big{(}20,19,\cdots,13,11,12,perm(10,\cdots,2,1)\big{)},
orτ𝜽\displaystyle or\ \tau^{**}_{{\boldsymbol{\theta}}} =\displaystyle= swap(τ𝜽)=(20,19,,13,12,10,perm(11,9,,2,1)),\displaystyle swap_{\mathcal{I}}(\tau^{*}_{{\boldsymbol{\theta}}})=\big{(}20,19,\cdots,13,12,10,perm(11,9,\cdots,2,1)\big{)},

where perm(S)perm(S) stands for a random permutation of the input sequence SS. Note that although both τ𝜽\tau^{*}_{{\boldsymbol{\theta}}} and τ𝜽\tau^{**}_{{\boldsymbol{\theta}}} have a lot of variations, the different variations correspond to the same sampling probability below:

p𝜽\displaystyle p^{*}_{{\boldsymbol{\theta}}} \displaystyle\triangleq P(τ𝜽𝜽)=1(nn1)!×(n1+1)(nn1)γ(Cγ,n1)nn1×exp{n1(n11)2ϕγ}Z(ϕγ)=h(ϕ,γ),\displaystyle P(\tau^{*}_{{\boldsymbol{\theta}}}\mid{\boldsymbol{\theta}})=\dfrac{1}{(n-n_{1})!}\times\dfrac{(n_{1}+1)^{-(n-n_{1})\gamma}}{(C^{*}_{\gamma,n_{1}})^{n-n_{1}}}\times\dfrac{\exp\left\{-\frac{n_{1}(n_{1}-1)}{2}\phi\gamma\right\}}{Z(\phi\gamma)}=h(\phi,\gamma),
p𝜽\displaystyle p^{**}_{{\boldsymbol{\theta}}} \displaystyle\triangleq P(τ𝜽𝜽)=h(ϕ,γ)×min{(nn1)×(1+1/n1)γ,exp{ϕγ/2}},\displaystyle P(\tau^{**}_{{\boldsymbol{\theta}}}\mid{\boldsymbol{\theta}})=h(\phi,\gamma)\times\min\{(n-n_{1})\times(1+1/n_{1})^{\gamma},\exp\{\phi\gamma/2\}\},

where Cγ,n1=t=1n1+1tγC^{*}_{\gamma,n_{1}}=\sum_{t=1}^{n_{1}+1}t^{-\gamma}. Further, define

r𝜽=p𝜽p𝜽=min{(nn1)×(1+1/n1)γ,exp{ϕγ/2}}=g(ϕ,γ).r^{*}_{\boldsymbol{\theta}}=\frac{p^{**}_{{\boldsymbol{\theta}}}}{p^{*}_{{\boldsymbol{\theta}}}}=\min\big{\{}(n-n_{1})\times(1+1/n_{1})^{\gamma},\exp\{\phi\gamma/2\}\big{\}}=g(\phi,\gamma).

It can be showed via proof by contradiction that the following two equations can not hold simultaneously for (ϕ1,γ1)(ϕ2,γ2)(\phi_{1},\gamma_{1})\neq(\phi_{2},\gamma_{2}):

h(ϕ1,γ1)=h(ϕ2,γ2)andg(ϕ1,γ1)=g(ϕ2,γ2).h(\phi_{1},\gamma_{1})=h(\phi_{2},\gamma_{2})\ \mbox{and}\ g(\phi_{1},\gamma_{1})=g(\phi_{2},\gamma_{2}).

Because if g(ϕ1,γ1)=g(ϕ2,γ2)g(\phi_{1},\gamma_{1})=g(\phi_{2},\gamma_{2}), we would have either ϕ1γ1=ϕ2γ2\phi_{1}\gamma_{1}=\phi_{2}\gamma_{2} or γ1=γ2\gamma_{1}=\gamma_{2} based on the definition of function g(ϕ,γ)g(\phi,\gamma), which leads to (ϕ1,γ1)=(ϕ2,γ2)(\phi_{1},\gamma_{1})=(\phi_{2},\gamma_{2}) as function h(ϕ,γ)h(\phi,\gamma) is monotonically decreasing with respect to ϕ\phi and γ\gamma. Apparently, the above fact indicates that the following two equations can not hold simultaneously:

P(τ𝜽1𝜽1)\displaystyle P(\tau^{*}_{{\boldsymbol{\theta}}_{1}}\mid{\boldsymbol{\theta}}_{1}) =\displaystyle= P(τ𝜽2𝜽2)andP(τ𝜽1𝜽1)=P(τ𝜽2𝜽2),\displaystyle P(\tau^{*}_{{\boldsymbol{\theta}}_{2}}\mid{\boldsymbol{\theta}}_{2})\ \mbox{and}\ P(\tau^{**}_{{\boldsymbol{\theta}}_{1}}\mid{\boldsymbol{\theta}}_{1})=P(\tau^{**}_{{\boldsymbol{\theta}}_{2}}\mid{\boldsymbol{\theta}}_{2}),

which means that the two distributions P(τ𝜽1)P(\tau\mid{\boldsymbol{\theta}}_{1}) and P(τ𝜽2)P(\tau\mid{\boldsymbol{\theta}}_{2}) are not identical. Therefore, there must exist τΩn\tau\in\Omega_{n} such that P(τ𝜽1)P(τ𝜽2)P(\tau\mid{\boldsymbol{\theta}}_{1})\neq P(\tau\mid{\boldsymbol{\theta}}_{2}).

(ii)(ii) The scenario where (ϕ1,γ1)=(ϕ2,γ2)(\phi_{1},\gamma_{1})=(\phi_{2},\gamma_{2}) but 12\mathcal{I}_{1}\neq\mathcal{I}_{2}. As τ𝜽1\tau^{*}_{{\boldsymbol{\theta}}_{1}} and τ𝜽2\tau^{*}_{{\boldsymbol{\theta}}_{2}} are the minima of P(τ𝜽1)P(\tau\mid{\boldsymbol{\theta}}_{1}) and P(τ𝜽2)P(\tau\mid{\boldsymbol{\theta}}_{2}), and τ𝜽1τ𝜽2\tau^{*}_{{\boldsymbol{\theta}}_{1}}\neq\tau^{*}_{{\boldsymbol{\theta}}_{2}} in this case, we have

P(τ𝜽1𝜽1)=h(ϕ1,γ1)=h(ϕ2,γ2)=P(τ𝜽2𝜽2)<P(τ𝜽1𝜽2).P(\tau^{*}_{{\boldsymbol{\theta}}_{1}}\mid{\boldsymbol{\theta}}_{1})=h(\phi_{1},\gamma_{1})=h(\phi_{2},\gamma_{2})=P(\tau^{*}_{{\boldsymbol{\theta}}_{2}}\mid{\boldsymbol{\theta}}_{2})<P(\tau^{*}_{{\boldsymbol{\theta}}_{1}}\mid{\boldsymbol{\theta}}_{2}).

Similarly, we have

P(τ𝜽2𝜽1)>P(τ𝜽2𝜽2).P(\tau^{*}_{{\boldsymbol{\theta}}_{2}}\mid{\boldsymbol{\theta}}_{1})>P(\tau^{*}_{{\boldsymbol{\theta}}_{2}}\mid{\boldsymbol{\theta}}_{2}).

Therefore, there exists τΩn\tau\in\Omega_{n} (e.g., τ𝜽1\tau^{*}_{{\boldsymbol{\theta}}_{1}} or τ𝜽2\tau^{*}_{{\boldsymbol{\theta}}_{2}}) such that P(τ𝜽1)P(τ𝜽2)P(\tau\mid{\boldsymbol{\theta}}_{1})\neq P(\tau\mid{\boldsymbol{\theta}}_{2}). Combining the above two scenarios, we conclude that fact (20) holds for m=1m=1.

Next, we prove the more general cases where m>1m>1 by mathematical induction: assuming that (20) holds for all mMm\leq M, we will prove that it also holds for m=M+1m=M+1. Considering that for any m>1m>1, P(𝝉𝜽1)=P(𝝉𝜽2)P(\boldsymbol{\tau}\mid{\boldsymbol{\theta}}_{1})=P(\boldsymbol{\tau}\mid{\boldsymbol{\theta}}_{2}) implies:

k=1mP(τk1,ϕ1,γ1k)=k=1mP(τk2,ϕ2,γ2k),\prod_{k=1}^{m}P(\tau_{k}\mid\mathcal{I}_{1},\phi_{1},\gamma_{1k})=\prod_{k=1}^{m}P(\tau_{k}\mid\mathcal{I}_{2},\phi_{2},\gamma_{2k}),

where 𝝉=(τ1,,τm)\boldsymbol{\tau}=(\tau_{1},\cdots,\tau_{m}) and 𝜽t=(t,ϕt,𝜸t){\boldsymbol{\theta}}_{t}=(\mathcal{I}_{t},\phi_{t},\boldsymbol{\gamma}_{t}) with 𝜸t={γtk}1km\boldsymbol{\gamma}_{t}=\{\gamma_{tk}\}_{1\leq k\leq m}, the condition in Theorem 1 for m=M+1m=M+1 suggests that

k=1M+1P(τk1,ϕ1,γ1k)=k=1M+1P(τk2,ϕ2,γ2k),𝝉ΩnM+1.\prod_{k=1}^{M+1}P(\tau_{k}\mid\mathcal{I}_{1},\phi_{1},\gamma_{1k})=\prod_{k=1}^{M+1}P(\tau_{k}\mid\mathcal{I}_{2},\phi_{2},\gamma_{2k}),\ \forall\ \boldsymbol{\tau}\in\Omega_{n}^{M+1}. (36)

Define 𝝉[j]=(τ1,,τj1,τj+1,,τM+1)\boldsymbol{\tau}_{[-j]}=(\tau_{1},\cdots,\tau_{j-1},\tau_{j+1},\cdots,\tau_{M+1}), 𝜸t[j]={γtk}kj\boldsymbol{\gamma}_{t[-j]}=\{\gamma_{tk}\}_{k\neq j}, 𝜽t[j]=(t,ϕt,𝜸t[j]){\boldsymbol{\theta}}_{t[-j]}=(\mathcal{I}_{t},\phi_{t},\boldsymbol{\gamma}_{t[-j]}) and

P(𝝉[j]𝜽t[j])=kjP(τkt,ϕt,γtk).P(\boldsymbol{\tau}_{[-j]}\mid{\boldsymbol{\theta}}_{t[-j]})=\prod_{k\neq j}P(\tau_{k}\mid\mathcal{I}_{t},\phi_{t},\gamma_{tk}).

For 1jM+1\forall\ 1\leq j\leq M+1, equation (36) can be expressed alternatively as below:

P(𝝉[j]𝜽1[j])P(τj1,ϕ1,γ1j)=P(𝝉[j]𝜽2[j])P(τj2,ϕ2,γ2j),𝝉ΩnM+1.P(\boldsymbol{\tau}_{[-j]}\mid{\boldsymbol{\theta}}_{1[-j]})\cdot P(\tau_{j}\mid\mathcal{I}_{1},\phi_{1},\gamma_{1j})=P(\boldsymbol{\tau}_{[-j]}\mid{\boldsymbol{\theta}}_{2[-j]})\cdot P(\tau_{j}\mid\mathcal{I}_{2},\phi_{2},\gamma_{2j}),\ \forall\ \boldsymbol{\tau}\in\Omega_{n}^{M+1}.

For any fixed 𝝉[j]\boldsymbol{\tau}_{[-j]}, summing over all equations of the above form for all 𝝉\boldsymbol{\tau} that is compatible with 𝝉[j]\boldsymbol{\tau}_{[-j]}, we have

P(𝝉[j]𝜽1[j])τjΩnP(τj1,ϕ1,γ1k)=P(𝝉[j]𝜽2[j])τjΩnP(τj2,ϕ2,γ2k)P(\boldsymbol{\tau}_{[-j]}\mid{\boldsymbol{\theta}}_{1[-j]})\cdot\sum_{\tau_{j}\in\Omega_{n}}P(\tau_{j}\mid\mathcal{I}_{1},\phi_{1},\gamma_{1k})=P(\boldsymbol{\tau}_{[-j]}\mid{\boldsymbol{\theta}}_{2[-j]})\cdot\sum_{\tau_{j}\in\Omega_{n}}P(\tau_{j}\mid\mathcal{I}_{2},\phi_{2},\gamma_{2k})

Considering that

τjΩnP(τj1,ϕ1,γ1k)=τjΩnP(τj2,ϕ2,γ2k)=1,\sum_{\tau_{j}\in\Omega_{n}}P(\tau_{j}\mid\mathcal{I}_{1},\phi_{1},\gamma_{1k})=\sum_{\tau_{j}\in\Omega_{n}}P(\tau_{j}\mid\mathcal{I}_{2},\phi_{2},\gamma_{2k})=1,

we have

P(𝝉[j]𝜽1[j])=P(𝝉[j]𝜽2[j]),𝝉[j]ΩnM,P(\boldsymbol{\tau}_{[-j]}\mid{\boldsymbol{\theta}}_{1[-j]})=P(\boldsymbol{\tau}_{[-j]}\mid{\boldsymbol{\theta}}_{2[-j]}),\ \forall\ \boldsymbol{\tau}_{[-j]}\in\Omega_{n}^{M},

which indicates that

𝜽1[j]=𝜽2[j],j=1,,M+1.{\boldsymbol{\theta}}_{1[-j]}={\boldsymbol{\theta}}_{2[-j]},\ \forall\ j=1,\cdots,M+1.

Thus, we have 𝜽1=𝜽2{\boldsymbol{\theta}}_{1}={\boldsymbol{\theta}}_{2}, and the proof is complete. ∎

Appendix D Proof of Lemma 1

Proof.

The function ei(ϕ)e_{i}(\phi) is given as

ei(ϕ)\displaystyle e_{i}(\phi) =\displaystyle= 𝔼[τ(i)γ]𝑑F(γ)=j=1njP(τ(i)=j,ϕ,γ)dF(γ),\displaystyle\int\mathbb{E}\big{[}\tau(i)\mid\gamma\big{]}dF(\gamma)=\int\sum_{j=1}^{n}jP(\tau(i)=j\mid\mathcal{I},\phi,\gamma)dF(\gamma), (37)

where

P(τ(i)=j,ϕ,γ)=k=max(1,jn+n1)min(n1,j)P(τ1(i)=k)P(eUB𝕀(τ0|1(e)n1+2k)=jk),\displaystyle P(\tau(i)=j\mid\mathcal{I},\phi,\gamma)=\sum_{k=\max(1,j-n+n_{1})}^{\min(n_{1},j)}P(\tau^{1}(i)=k)P\left(\sum_{e\in U_{B}}\mathbb{I}(\tau^{0|1}(e)\geq n_{1}+2-k)=j-k\right),
P(τ1(i)=k)=eϕγ(ik)l=0n11eϕγl,kn1,\displaystyle P(\tau^{1}(i)=k)=\dfrac{e^{-\phi\gamma(i-k)}}{\sum_{l=0}^{n_{1}-1}e^{-\phi\gamma l}},~{}~{}~{}k\leq n_{1},
P(eUB𝕀(τ0|1(e)n1+2k)=jk)=(nn1jk)(ln1+2klγl=1n1+1lγ)jk(ln1+1klγl=1n1+1lγ)n,\displaystyle P\left(\sum_{e\in U_{B}}\mathbb{I}(\tau^{0|1}(e)\geq n_{1}+2-k)=j-k\right)=\begin{pmatrix}n-n_{1}\\ j-k\end{pmatrix}\left(\dfrac{\sum_{l\geq n_{1}+2-k}l^{-\gamma}}{\sum_{l=1}^{n_{1}+1}l^{-\gamma}}\right)^{j-k}\left(\dfrac{\sum_{l\leq n_{1}+1-k}l^{-\gamma}}{\sum_{l=1}^{n_{1}+1}l^{-\gamma}}\right)^{n^{{}^{\prime}}},
n=nn1j+k.\displaystyle n^{{}^{\prime}}=n-n_{1}-j+k.

As P(eUB𝕀(τ0|1(e)n1+2k)=jk)P\left(\sum_{e\in U_{B}}\mathbb{I}(\tau^{0|1}(e)\geq n_{1}+2-k)=j-k\right) is a constant with respect to ϕ\phi, we denote g(γ)P(eUB𝕀(τ0|1(e)n1+2k)=jk)g(\gamma)\triangleq P\left(\sum_{e\in U_{B}}\mathbb{I}(\tau^{0|1}(e)\geq n_{1}+2-k)=j-k\right). Thus,

ei(ϕ)\displaystyle e_{i}(\phi) =\displaystyle= j=1njP(τ(i)=j,ϕ,γ)dF(γ)\displaystyle\int\sum_{j=1}^{n}jP(\tau(i)=j\mid\mathcal{I},\phi,\gamma)dF(\gamma) (38)
=\displaystyle= j=1njk=max(1,jn+n1)min(n1,j)eϕγ(ik)l=0n11eϕγlg(γ)dF(γ).\displaystyle\int\sum_{j=1}^{n}j\sum_{k=\max(1,j-n+n_{1})}^{\min(n_{1},j)}\dfrac{e^{-\phi\gamma(i-k)}}{\sum_{l=0}^{n_{1}-1}e^{-\phi\gamma l}}g(\gamma)dF(\gamma).

From Equation (38), we can see that ei(ϕ)e_{i}(\phi) is a continuous function of ϕ\phi for iURi\in U_{R}, not an infinite oscillation function, and degenerates to a constant e0e_{0} for iUBi\in U_{B}. Thus, equation ei(ϕ)=e0e_{i}(\phi)=e_{0} has only finite solutions in [0,ϕmax)[0,\phi_{max}). Let 𝒮i\mathcal{S}_{i} be the solutions of equation ei(ϕ)=e0e_{i}(\phi)=e_{0} for in1i\leq n_{1} and 𝒮=i=1d𝒮i\mathcal{S}=\cup_{i=1}^{d}\mathcal{S}_{i}. We have Ω~ϕ=Ωϕ𝒮\tilde{\Omega}_{\phi}=\Omega_{\phi}-\mathcal{S}. ∎

Appendix E Proof of Theorem 3

Proof.

Based on the classic theory for MLE consistency (Wald,, 1949), to prove that (ϕ^,ψ^)(\hat{\phi}_{\mathcal{I}},\hat{\psi}_{\mathcal{I}}) are consistent, we only to show that the following regularity conditions hold for the PAMA-H model: (1) probability measure Lk(ϕ,ψ)=P(τk,ϕ,γk)𝑑Fψ(γk)L_{k}(\phi,\psi)=\int P(\tau_{k}\mid\mathcal{I},\phi,\gamma_{k})dF_{\psi}(\gamma_{k}) keeps the same support for all (ϕ,ψ)Ωϕ×Ωψ(\phi,\psi)\in\Omega_{\phi}\times\Omega_{\psi}, (2) the true parameter (ϕ0,ψ0)(\phi_{0},\psi_{0}) is an interior point of the parameter space Ωϕ×Ωψ\Omega_{\phi}\times\Omega_{\psi}, (3) the log-likelihood l(ϕ,ψ)l(\phi,\psi\mid\mathcal{I}) is differentiable with respect to ϕ\phi and ψ\psi, and (4) the MLE (ϕ^,ψ^)(\hat{\phi},\hat{\psi}) is the unique solution of the score equations δl(ϕ,ψ)δϕ=0\frac{\delta l(\phi,\psi)}{\delta\phi}=0 and δl(ϕ,ψ)δψ=0\frac{\delta l(\phi,\psi)}{\delta\psi}=0.

It is easy to check that l(ϕ,ψ)l(\phi,\psi\mid\mathcal{I}) satisfies regular conditions (1)-(3). The regular condition (4) is satisfied by showing that l(ϕ,ψ)l(\phi,\psi\mid\mathcal{I}) is a concave function with respect to ϕ\phi and ψ\psi respectively. The log-likelihood of PAMA-H is

l(ϕ,ψ)\displaystyle l(\phi,\psi\mid\mathcal{I}) =\displaystyle= k=1mlk(ϕ,ψ)=k=1mlog(P(τk,ϕ,γk)𝑑Fψ(γk))\displaystyle\sum_{k=1}^{m}l_{k}(\phi,\psi\mid\mathcal{I})=\sum_{k=1}^{m}\log\left(\int P(\tau_{k}\mid\mathcal{I},\phi,\gamma_{k})dF_{\psi}(\gamma_{k})\right)
=\displaystyle= k=1mlog(0exp{ϕγkdτ(τk,+)}(1eϕγk)n11t=2n1(1etϕγk)fψ(γk)G(γk)𝑑γk)\displaystyle\sum_{k=1}^{m}\log\left(\int_{0}^{\infty}\frac{\exp\{-\phi\gamma_{k}d_{\tau}(\tau_{k},\mathcal{I}^{+})\}(1-e^{-\phi\gamma_{k}})^{n_{1}-1}}{\prod_{t=2}^{n_{1}}(1-e^{-t\phi\gamma_{k}})}\frac{f_{\psi}(\gamma_{k})}{G(\gamma_{k})}d\gamma_{k}\right)
=\displaystyle= k=1mlog(0R(ϕ,γk)fψ(γk)G(γk)𝑑γk)=k=1mlog(Hk(ϕ,ψ)),\displaystyle\sum_{k=1}^{m}\log\left(\int_{0}^{\infty}R(\phi,\gamma_{k})\frac{f_{\psi}(\gamma_{k})}{G(\gamma_{k})}d\gamma_{k}\right)=\sum_{k=1}^{m}\log(H_{k}(\phi,\psi)),

where Hk(ϕ,ψ)=0R(ϕ,γk)fψ(γk)G(γk)𝑑γkH_{k}(\phi,\psi)=\int_{0}^{\infty}R(\phi,\gamma_{k})\frac{f_{\psi}(\gamma_{k})}{G(\gamma_{k})}d\gamma_{k}, R(ϕ,γk)=exp{ϕγkdτ(τk,+)}(1eϕγk)n11t=2n1(1etϕγk)R(\phi,\gamma_{k})=\frac{\exp\{-\phi\gamma_{k}d_{\tau}(\tau_{k},\mathcal{I}^{+})\}(1-e^{-\phi\gamma_{k}})^{n_{1}-1}}{\prod_{t=2}^{n_{1}}(1-e^{-t\phi\gamma_{k}})} and G(γk)=Aτk,I(Bτk,I)γk(Cγk,n1)n0G(\gamma_{k})=A^{*}_{\tau_{k},I}(B^{*}_{\tau_{k},I})^{\gamma_{k}}(C^{*}_{\gamma_{k},n_{1}})^{n_{0}}. According to properties of the Mallows model, it is easy to check 2R(ϕ,γk)2ϕ<0\frac{\partial^{2}R(\phi,\gamma_{k})}{\partial^{2}\phi}<0 and thus R(ϕ,γk)R(\phi,\gamma_{k}) is a concave function with respect to ϕ\phi. As fψ(γk)G(γk)>0\frac{f_{\psi}(\gamma_{k})}{G(\gamma_{k})}>0, then Hk(ϕ,ψ)H_{k}(\phi,\psi) preserves convexity of R(ϕ,γk)R(\phi,\gamma_{k}) according to the properties of a concave function. Obviously, l(ϕ,ψ)l(\phi,\psi\mid\mathcal{I}) is a composite function of Hk(ϕ,ψ)H_{k}(\phi,\psi) with composite functions being logarithm and summation. Such composite function preserves convexity of Hk(ϕ,ψ)H_{k}(\phi,\psi). Therefore, l(ϕ,ψ)l(\phi,\psi\mid\mathcal{I}) is a concave function with respect to ϕ\phi. Similarly, given that fψ(γk)f_{\psi}(\gamma_{k}) is a concave function of ψ\psi, it is easy to show l(ϕ,ψ)l(\phi,\psi\mid\mathcal{I}) is a concave function with respect to ψ\psi. Therefore, regular condition (4) is satisfied. Thus, the proof is complete.

Appendix F Details of the Gauss-Seidel Iterative Optimization

Let 𝜽(s)=((s),ϕ(s),𝜸(s)){\boldsymbol{\theta}}^{(s)}=(\mathcal{I}^{(s)},\phi^{(s)},\boldsymbol{\gamma}^{(s)}) be the maximizer obtained at cycle ss. The Gauss-Seidel method works as follows:

  1. 1.

    Update γk,k=1,,m\gamma_{k},k=1,\cdots,m. Define g(γk)g(\gamma_{k}) as partial function of log-likelihood with respect to γk\gamma_{k} given the other parameters. Newton-like method is adopted to update γk\gamma_{k} from γk(s)\gamma_{k}^{(s)} to γk(s+1)\gamma_{k}^{(s+1)}.

  2. 2.

    Update ϕ\phi. Define g(ϕ)g(\phi) as partial function of log-likelihood with respect to ϕ\phi given other parameters. Similarly, Newton-like method is adopted to update ϕ\phi from ϕ(s)\phi^{(s)} to ϕ(s+1)\phi^{(s+1)}.

  3. 3.

    Update \mathcal{I}. Let g()g(\mathcal{I}) denote the log-likelihood as a function of \mathcal{I} with other parameters fixed. We randomly select two neighboring entities and swap their rankings to check whether g(𝜸(s+1),ϕ(s+1))g(\mathcal{I}\mid\boldsymbol{\gamma}^{(s+1)},\phi^{(s+1)}) increases.

The procedure starts from a random guess of all the parameters and then repeat Steps 1-3 until the log-likelihood converges. The convergence of likelihood is achieved when the difference of log-likelihood between any two consecutive iterations is less than 0.1. The difficulty of the procedure lies in the update of \mathcal{I}. Practically, the starting point of \mathcal{I} can be some quick estimation of \mathcal{I}, such as an estimate from the Mallows model.

In each cycle of the Gauss-Seidel update for finding the MLE, a partial function of likelihood has to be defined. Suppose the current cyclic index is ss, the detailed computation of 𝜽(s+1){\boldsymbol{\theta}}^{(s+1)} is given below.

F.1 Update γk\gamma_{k}

Define g(γk)=log(f(γk|𝜸[1:k1](s+1),𝜸[k+1:m](s),(s),ϕ(s)))g(\gamma_{k})=\log(f(\gamma_{k}|\boldsymbol{\gamma}_{[1:k-1]}^{(s+1)},\boldsymbol{\gamma}_{[k+1:m]}^{(s)},\mathcal{I}^{(s)},\phi^{(s)})). Thus, for γk[0,10]\gamma_{k}\in[0,10],

g(γk)\displaystyle g(\gamma_{k}) =\displaystyle= log(1(Bτk,I)γk×(Cγk,n1)nn1×(Dτk,)ϕγk×Eϕ,γk)\displaystyle\log\left(\frac{1}{(B^{*}_{\tau_{k},I})^{\gamma_{k}}\times(C^{*}_{\gamma_{k},n_{1}})^{n-n_{1}}\times(D^{*}_{\tau_{k},\mathcal{I}})^{\phi\cdot\gamma_{k}}\times E^{*}_{\phi,\gamma_{k}}}\right) (39)
=\displaystyle= γk×log(Bτk,I)(nn1)×log(Cγk,n1)ϕ(s)×γk×log(Dτk,)log(Eϕ,γk).\displaystyle-\gamma_{k}\times\log(B^{*}_{\tau_{k},I})-(n-n_{1})\times\log(C^{*}_{\gamma_{k},n_{1}})-\phi^{(s)}\times\gamma_{k}\times\log(D^{*}_{\tau_{k},\mathcal{I}})-\log(E^{*}_{\phi,\gamma_{k}}).

The updating equation is given as

γk(s+1)=γk(s)αγkg(γk)g′′(γk),\displaystyle\gamma_{k}^{(s+1)}=\gamma_{k}^{(s)}-\alpha_{\gamma_{k}}\frac{g^{\prime}(\gamma_{k})}{g^{\prime\prime}(\gamma_{k})}, (40)

where αγk\alpha_{\gamma_{k}} is step length parameter which can be tuned to control convergence of g(γk)g(\gamma_{k}). And

g(γk)\displaystyle g^{\prime}(\gamma_{k}) =\displaystyle= log(Bτk,I)(nn1)×log(Cγk,n1)γkϕ(s)×log(Dτk,)log(Eϕ,γk)γk,\displaystyle-\log(B^{*}_{\tau_{k},I})-(n-n_{1})\times\frac{\partial\log(C^{*}_{\gamma_{k},n_{1}})}{\partial\gamma_{k}}-\phi^{(s)}\times\log(D^{*}_{\tau_{k},\mathcal{I}})-\frac{\partial\log(E^{*}_{\phi,\gamma_{k}})}{\partial\gamma_{k}}, (41)
g′′(γk)\displaystyle g^{\prime\prime}(\gamma_{k}) =\displaystyle= (nn1)×2log(Cγk,n1)2γk2log(Eϕ,γk)2γk,\displaystyle-(n-n_{1})\times\frac{\partial^{2}\log(C^{*}_{\gamma_{k},n_{1}})}{\partial^{2}\gamma_{k}}-\frac{\partial^{2}\log(E^{*}_{\phi,\gamma_{k}})}{\partial^{2}\gamma_{k}}, (42)

where

log(Bτk,I)\displaystyle\log(B^{*}_{\tau_{k},I}) =\displaystyle= t=1n1+1nτk,t0|1log(t),\displaystyle\sum_{t=1}^{n_{1}+1}n_{\tau_{k},t}^{0|1}\log(t),
log(Cγk,n1)γk\displaystyle\frac{\partial\log(C^{*}_{\gamma_{k},n_{1}})}{\partial\gamma_{k}} =\displaystyle= t=1n1+1tγklog(t)Cγk,n1,\displaystyle-\frac{\sum_{t=1}^{n_{1}+1}t^{-\gamma_{k}}\log(t)}{C^{*}_{\gamma_{k},n_{1}}},
log(Dτk,)\displaystyle\log(D^{*}_{\tau_{k},\mathcal{I}}) =\displaystyle= dτ(τk1,+(s)),\displaystyle d_{\tau}(\tau_{k}^{1},\mathcal{I}^{+(s)}),
log(Eϕ,γk)γk\displaystyle\frac{\partial\log(E^{*}_{\phi,\gamma_{k}})}{\partial\gamma_{k}} =\displaystyle= t=2n1tϕ(s)exp{tϕ(s)γk}1exp{tϕ(s)γk}ϕ(s)(n11)1exp{ϕ(s)γk}×exp{ϕ(s)γk},\displaystyle\sum_{t=2}^{n_{1}}\frac{t\cdot\phi^{(s)}\exp\{-t\phi^{(s)}\gamma_{k}\}}{1-\exp\{-t\phi^{(s)}\gamma_{k}\}}-\frac{\phi^{(s)}\cdot(n_{1}-1)}{1-\exp\{-\phi^{(s)}\gamma_{k}\}}\times\exp\{-\phi^{(s)}\gamma_{k}\},
2log(Cγk,n1)2γk\displaystyle\frac{\partial^{2}\log(C^{*}_{\gamma_{k},n_{1}})}{\partial^{2}\gamma_{k}} =\displaystyle= 1(Cγk,n1)2[Cγk,n1t=1n1+1tγklog(t)2+(t=1n1+1tγklog(t))2],\displaystyle\frac{1}{(C^{*}_{\gamma_{k},n_{1}})^{2}}[C^{*}_{\gamma_{k},n_{1}}\sum_{t=1}^{n_{1}+1}t^{-\gamma_{k}}\log(t)^{2}+(\sum_{t=1}^{n_{1}+1}t^{-\gamma_{k}}\log(t))^{2}],
2log(Eϕ,γk)2γk\displaystyle\frac{\partial^{2}\log(E^{*}_{\phi,\gamma_{k}})}{\partial^{2}\gamma_{k}} =\displaystyle= t=2n1t2ϕ(s)2exp{tϕ(s)γk}[1exp{tϕ(s)γk}]2+ϕ(s)2(n11)exp{ϕ(s)γk}[1exp{ϕ(s)γk}]2.\displaystyle\sum_{t=2}^{n_{1}}\frac{-t^{2}\phi^{(s)^{2}}\exp\{-t\phi^{(s)}\gamma_{k}\}}{[1-\exp\{-t\phi^{(s)}\gamma_{k}\}]^{2}}+\frac{\phi^{(s)^{2}}(n_{1}-1)\exp\{-\phi^{(s)}\gamma_{k}\}}{[1-\exp\{-\phi^{(s)}\gamma_{k}\}]^{2}}.

F.2 Update ϕ\phi

We define g(ϕ)=log(f(ϕ|𝜸(s+1),(s)))g(\phi)=\log(f(\phi|\boldsymbol{\gamma}^{(s+1)},\mathcal{I}^{(s)})). Thus, for ϕ[0,10]\phi\in[0,10]

g(ϕ)=k=1m(ϕγk(s+1)log(Dτk,)+log(Eϕ,γk(s+1))).\displaystyle g(\phi)=-\sum_{k=1}^{m}\left(\phi\cdot\gamma_{k}^{(s+1)}\log(D^{*}_{\tau_{k},\mathcal{I}})+\log(E^{*}_{\phi,\gamma_{k}^{(s+1)}})\right). (43)

The update equation is given as

ϕ(s+1)=ϕ(s)αϕg(ϕ)g′′(ϕ),\displaystyle\phi^{(s+1)}=\phi^{(s)}-\alpha_{\phi}\frac{g^{\prime}(\phi)}{g^{\prime\prime}(\phi)}, (44)

where αϕ\alpha_{\phi} is step length parameter which controls convergence of g(ϕ)g(\phi). And

g(ϕ)\displaystyle g^{\prime}(\phi) =\displaystyle= k=1m(γk(s+1)log(Dτk,)+[log(Eϕ,γk(s+1))]ϕ),\displaystyle-\sum_{k=1}^{m}\left(\gamma_{k}^{(s+1)}\log(D^{*}_{\tau_{k},\mathcal{I}})+\frac{\partial[\log(E^{*}_{\phi,\gamma_{k}^{(s+1)}})]}{\partial\phi}\right), (45)
g′′(ϕ)\displaystyle g^{\prime\prime}(\phi) =\displaystyle= k=1m2[log(Eϕ,γk(s+1))]2ϕ,\displaystyle-\sum_{k=1}^{m}\frac{\partial^{2}[\log(E^{*}_{\phi,\gamma_{k}^{(s+1)}})]}{\partial^{2}\phi}, (46)

where

[log(Eϕ,γk(s+1))]ϕ=(t=2n1tγk(s+1)exp{tϕγk(s+1)}1exp{tϕγk(s+1)})(n11)×γk(s+1)exp{ϕγk(s+1)}1exp{ϕγk(s+1)},\frac{\partial[\log(E^{*}_{\phi,\gamma_{k}^{(s+1)}})]}{\partial\phi}=\left(\sum_{t=2}^{n_{1}}\frac{t\cdot\gamma_{k}^{(s+1)}\exp\{-t\phi\gamma_{k}^{(s+1)}\}}{1-\exp\{-t\phi\gamma_{k}^{(s+1)}\}}\right)-(n_{1}-1)\times\frac{\gamma_{k}^{(s+1)}\exp\{-\phi\gamma_{k}^{(s+1)}\}}{1-\exp\{-\phi\gamma_{k}^{(s+1)}\}},
2[log(Eϕ,γk(s+1))]2ϕ=(t=2n1t2γk(s+1)2exp{tϕγk(s+1)}[1exp{tϕγk(s+1)}]2)+γk(s+1)2(n11)exp{ϕγk(s+1)}[1exp{ϕγk(s+1)}]2.\frac{\partial^{2}[\log(E^{*}_{\phi,\gamma_{k}^{(s+1)}})]}{\partial^{2}\phi}=\left(\sum_{t=2}^{n_{1}}\frac{-t^{2}\gamma_{k}^{(s+1)^{2}}\exp\{-t\phi\gamma_{k}^{(s+1)}\}}{[1-\exp\{-t\phi\gamma_{k}^{(s+1)}\}]^{2}}\right)+\frac{\gamma_{k}^{(s+1)^{2}}(n_{1}-1)\exp\{-\phi\gamma_{k}^{(s+1)}\}}{[1-\exp\{-\phi\gamma_{k}^{(s+1)}\}]^{2}}.

F.3 Update \mathcal{I}

We define g()=log(f(|𝜸(s+1),ϕ(s+1)))g(\mathcal{I})=\log(f(\mathcal{I}|\boldsymbol{\gamma}^{(s+1)},\phi^{(s+1)})). Thus,

g()\displaystyle g(\mathcal{I}) =\displaystyle= log(k=1m1Aτk,I×(Bτk,I)γk×(Dτk,)ϕγk).\displaystyle\log\left(\prod_{k=1}^{m}\frac{1}{A^{*}_{\tau_{k},I}\times(B^{*}_{\tau_{k},I})^{\gamma_{k}}\times(D^{*}_{\tau_{k},\mathcal{I}})^{\phi\cdot\gamma_{k}}}\right). (47)

Given current estimation (s)\mathcal{I}^{(s)}, the proposal of a new estimate can be obtained by iteratively swapping the neighboring entities in (s)\mathcal{I}^{(s)}. To be noticeable that the entity whose ranking is n1n_{1} could be randomly swapped with any background entity. The proposed estimation is denoted by (s+12)\mathcal{I}^{(s+\frac{1}{2})}. If g((s+12))>g((s))g(\mathcal{I}^{(s+\frac{1}{2})})>g(\mathcal{I}^{(s)}), assign (s+12)\mathcal{I}^{(s+\frac{1}{2})} to (s+1)\mathcal{I}^{(s+1)}. Otherwise, keep generating proposed estimation (s+12)\mathcal{I}^{(s+\frac{1}{2})} until g((s+12))>g((s))g(\mathcal{I}^{(s+\frac{1}{2})})>g(\mathcal{I}^{(s)}) or no new proposal can be generated for \mathcal{I}.

Appendix G Details of Inferring PAMA-H

G.1 The Gauss-Seidel Iterative Optimization

Let 𝜽(s)=((s),ϕ(s),ψ(s)){\boldsymbol{\theta}}^{(s)}=(\mathcal{I}^{(s)},\phi^{(s)},\psi^{(s)}) be the maximizer obtained at cycle ss. As 𝜸\boldsymbol{\gamma} is treated as missing data, we adapt MCEM (Wei and Tanner, (1990)) to implement optimization. Then E step is: define Q^(s+1)(𝜽𝜽(s))=1m(s)i=1m(s)logf((s),ϕ(s),ψ(s),𝜸iτ1,,τm)\hat{Q}^{(s+1)}({\boldsymbol{\theta}}\mid{\boldsymbol{\theta}}^{(s)})=\frac{1}{m^{(s)}}\sum_{i=1}^{m^{(s)}}\log f(\mathcal{I}^{(s)},\phi^{(s)},\psi^{(s)},\boldsymbol{\gamma}_{i}\mid\tau_{1},\cdots,\tau_{m}), where 𝜸i\boldsymbol{\gamma}_{i} is a sample drawn from f(𝜸(s),ϕ(s),ψ(s))f(\boldsymbol{\gamma}\mid\mathcal{I}^{(s)},\phi^{(s)},\psi^{(s)}) which is defined in (49). And M step is: maximize Q^(s+1)(𝜽𝜽(s))\hat{Q}^{(s+1)}({\boldsymbol{\theta}}\mid{\boldsymbol{\theta}}^{(s)}) with respect to 𝜽{\boldsymbol{\theta}} to obtain 𝜽(s+1)=((s+1),ϕ(s+1),ψ(s+1)){\boldsymbol{\theta}}^{(s+1)}=(\mathcal{I}^{(s+1)},\phi^{(s+1)},\psi^{(s+1)}). The Gauss-Seidel method is utilized to conduct this optimization. The detailed procedure of Gauss-Seidel method is stated below.

Update \mathcal{I}. With objective function being Q^(s+1)(𝜽𝜽(s))\hat{Q}^{(s+1)}({\boldsymbol{\theta}}\mid{\boldsymbol{\theta}}^{(s)}), the procedure of updating \mathcal{I} is same as that in Section F.

Update ϕ\phi. With objective function being Q^(s+1)(𝜽𝜽(s))\hat{Q}^{(s+1)}({\boldsymbol{\theta}}\mid{\boldsymbol{\theta}}^{(s)}), the procedure of updating ϕ\phi is same as that in Section F.

Update ψ\psi. Define g(ψ)g(\psi) as partial function of Q^(s+1)(𝜽𝜽(s))\hat{Q}^{(s+1)}({\boldsymbol{\theta}}\mid{\boldsymbol{\theta}}^{(s)}), Newton-like method is adopted to update ψ\psi. Then g(ψ)=1m(s)i=1m(s)logfψ(𝜸i)g(\psi)=\frac{1}{m^{(s)}}\sum_{i=1}^{m^{(s)}}\log f_{\psi}(\boldsymbol{\gamma}_{i}). Suppose Fψ(γ)F_{\psi}(\gamma) is an exponential distribution f(γα)f(\gamma\mid\alpha). Then g(α)=log(α)αm(s)i=1m(s)k=1mγik]g(\alpha)=\log(\alpha)-\frac{\alpha}{m^{(s)}}\sum_{i=1}^{m^{(s)}}\sum_{k=1}^{m}\gamma_{ik}]. It is straightforward to conduct Newton-like method to obtain ψ(s+1)\psi^{(s+1)}.

G.2 The Bayesian Inference

Suppose the Fψ(γ)F_{\psi}(\gamma) is an exponential distribution f(γα)f(\gamma\mid\alpha). More specifically,

f(xα)=αexp{αx},α>0.f(x\mid\alpha)=\alpha\exp\{-\alpha x\},\alpha>0.

Note that ψ=α\psi=\alpha here. As the conditional distribution of \mathcal{I} and ϕ\phi are already known in (27) and (28), we just show the conditional distributions of ψ\psi and γk\gamma_{k} as below,

f(α,ϕ,𝜸)\displaystyle f(\alpha\mid\mathcal{I},\phi,\boldsymbol{\gamma}) \displaystyle\propto π(α)k=1mαexp{αγk},\displaystyle\pi(\alpha)\cdot\prod_{k=1}^{m}{\alpha}\exp\{-\alpha\gamma_{k}\}, (48)
f(γk,α,ψ,𝜸[k])\displaystyle f(\gamma_{k}\mid\mathcal{I},\alpha,\psi,\boldsymbol{\gamma}_{[-k]}) \displaystyle\propto exp{αγk}(Bτk,I)γk×(Cγk,n1)nn1×(Dτk,)ϕγk×Eϕ,γk.\displaystyle\frac{\exp\{-\alpha\gamma_{k}\}}{(B^{*}_{\tau_{k},I})^{\gamma_{k}}\times(C^{*}_{\gamma_{k},n_{1}})^{n-n_{1}}\times(D^{*}_{\tau_{k},\mathcal{I}})^{\phi\cdot\gamma_{k}}\times E^{*}_{\phi,\gamma_{k}}}. (49)

G.3 PAMA-H versus PAMA in Numerical Experiments

We simulate 500 independent data sets with hierarchy distribution being an exponential distribution. The true parameters are =(1,2,,n1,𝟎n0),ϕ=0.6,n1=10\mathcal{I}=(1,2,\cdots,n_{1},{\bf 0}_{n_{0}}),\phi=0.6,n_{1}=10 and α=1\alpha=1 for the parameter in the exponential distribution. Four different configurations of n,mn,m are considered. The average recovery distances and coverages are displayed in Table 6. Figure 12 compares the estimation of γk\gamma_{k} obtained by different models and frameworks. Both Table 6 and Figure 12 indicate that PAMA model performs as well as PAMA-H model.

Configuration Bayesian Inference MLE
nn mm Metric PAMAHB PAMAB PAMAHF PAMAF
100 10 Recovery 0.70 (3.83) 0.98 (6.11) 19.09 (26.56) 17.22 (24.80)
Coverage 1.00 (0.01) 1.00 (0.01) 0.96 (0.05) 0.97 (0.05)
100 20 Recovery 1.22 (6.71) 0.84 (5.09) 33.76 (29.96) 32.16 (29.28)
Coverage 1.00 (0.01) 1.00 (0.01) 0.93 (0.06) 0.93 (0.06)
200 10 Recovery 4.15 (24.22) 3.36 (19.94) 57.23 (57.02) 49.08 (56.00)
Coverage 1.00 (0.02) 1.00 (0.02) 0.94 (0.06) 0.95 (0.06)
200 20 Recovery 1.51 (10.86) 1.51 (10.97) 92.59 (65.94) 85.67 (64.35)
Coverage 1.00 (0.01) 1.00 (0.01) 0.91 (0.07) 0.91 (0.07)
Table 6: Average recovery distances and coverages of different methods based on 500 independent replicates under PAMA-H model with different configurations of nn and mm. The numbers in brackets are corresponding standard deviation.
Refer to caption
Figure 12: The boxplot of γk^{\hat{\gamma_{k}}} using different models and frameworks with n=200n=200 and m=20m=20.

Appendix H Statistical Inference of the Covariate-Assisted Partition-Mallows Model

H.1 Bayesian Inference

Due to the incorporation of 𝑿\boldsymbol{X}, the full conditional distributions may occur changes from the conditional distributions in Section 4.2. While the full conditional distributions of 𝜸\boldsymbol{\gamma} and ϕ\phi keep the same as Equations (28) and (29), the full conditional distribution of \mathcal{I} changes to

f()\displaystyle f(\mathcal{I}\mid\cdot) \displaystyle\propto exp{𝝍Ti:Ii=1Xi}k=1m𝕀(τk0𝒜UR(τk01))Aτk,I×(Bτk,I)γk×(Dτk,)ϕγk.\displaystyle\exp\{\boldsymbol{\psi}^{T}\sum_{i:I_{i}=1}X_{i}\}\prod_{k=1}^{m}\frac{\mathbb{I}\big{(}\tau_{k}^{0}\in\mathcal{A}_{U_{R}}(\tau_{k}^{0\mid 1})\big{)}}{A^{*}_{\tau_{k},I}\times(B^{*}_{\tau_{k},I})^{\gamma_{k}}\times(D^{*}_{\tau_{k},\mathcal{I}})^{\phi\cdot\gamma_{k}}}. (50)

In addition, the full conditional distribution of the lthl^{th} element of 𝝍\boldsymbol{\psi} is given as follows,

f(𝝍l)\displaystyle f(\boldsymbol{\psi}_{l}\mid\cdot) \displaystyle\propto exp{𝝍T(i:Ii>0Xi)}i=1n(1+exp{𝝍TXi}),l=1,,p.\displaystyle\dfrac{\exp\{\boldsymbol{\psi}^{T}(\sum_{i:I_{i}>0}X_{i})\}}{\prod_{i=1}^{n}(1+\exp\{\boldsymbol{\psi}^{T}X_{i}\})},l=1,\cdots,p. (51)

MH algorithm is also adopted to draw corresponding samples. Posterior point estimation can be calculated accordingly.

H.2 MLE

Gauss-Seidel iterative optimization is adopted as well in optimizing covariate-assisted PAMA. Let 𝜽(s)=((s),ϕ(s),𝜸(s),𝝍(s)){\boldsymbol{\theta}}^{(s)}=(\mathcal{I}^{(s)},\phi^{(s)},\boldsymbol{\gamma}^{(s)},\boldsymbol{\psi}^{(s)}) be the maximizer obtained at cycle ss. The detailed computation of 𝜽(s+1){\boldsymbol{\theta}}^{(s+1)} is given below.

H.2.1 Update of γk\gamma_{k}

This is same as Section F.1.

H.2.2 Update of ϕ\phi

This is same as Section F.2.

H.2.3 Update of \mathcal{I}

We define g()=log(f(|𝜸(s+1),ϕ(s+1),𝝍(s)))g(\mathcal{I})=\log(f(\mathcal{I}|\boldsymbol{\gamma}^{(s+1)},\phi^{(s+1)},\boldsymbol{\psi}^{(s)})), thus,

g()\displaystyle g(\mathcal{I}) =\displaystyle= log(P(𝑿,𝝍(s))×P(τ1,,τm,ϕ(s+1),𝜸(s+1)))\displaystyle\log\left(P(\mathcal{I}\mid\boldsymbol{X},\boldsymbol{\psi}^{(s)})\times P(\tau_{1},\cdots,\tau_{m}\mid\mathcal{I},\phi^{(s+1)},\boldsymbol{\gamma}^{(s+1)})\right) (52)

Given current estimation (s)\mathcal{I}^{(s)}, the proposal of new estimation can be obtained by iteratively swapping the neighboring entities in (s)\mathcal{I}^{(s)}. To be noticeable that the entity whose ranking is n1n_{1} could be randomly swapped with any background entity. The proposed estimation is denoted by (s+12)\mathcal{I}^{(s+\frac{1}{2})}. If g((s+12))>g((s))g(\mathcal{I}^{(s+\frac{1}{2})})>g(\mathcal{I}^{(s)}), assign (s+12)\mathcal{I}^{(s+\frac{1}{2})} to (s+1)\mathcal{I}^{(s+1)}. Otherwise, keep generating proposed estimation (s+12)\mathcal{I}^{(s+\frac{1}{2})} until g((s+12))>g((s))g(\mathcal{I}^{(s+\frac{1}{2})})>g(\mathcal{I}^{(s)}) or no new proposal can be generated for \mathcal{I}.

H.2.4 Update of 𝝍\boldsymbol{\psi}

Define g(𝝍)=log(P((s+1)𝑿,𝝍))g(\boldsymbol{\psi})=\log(P(\mathcal{I}^{(s+1)}\mid\boldsymbol{X},\boldsymbol{\psi})). Maximizing g(𝝍)g(\boldsymbol{\psi}) with respect to 𝝍\boldsymbol{\psi} is actually a standard logistic regression problem. Therefore, 𝝍(s+1)=argmax𝝍log(P((s+1)𝑿,𝝍))\boldsymbol{\psi}^{(s+1)}={\arg\ max}_{\boldsymbol{\psi}}\log(P(\mathcal{I}^{(s+1)}\mid\boldsymbol{X},\boldsymbol{\psi})) by using standard statistical software.

Appendix I Boxplots of 𝜸\boldsymbol{\gamma}

Figure 13 demonstrates the boxplots of 𝜸\boldsymbol{\gamma} of all the setting with all combinations of (n,m)(n,m).

Refer to caption
Refer to caption
Figure 13: (a) The boxplots of {γ¯k}\{\bar{\gamma}_{k}\} estimated by PAMAB. (b) The boxplots of {γ^k}\{\hat{\gamma}_{k}\} estimated by PAMAF. Each row denotes a fixed combination of mm and nn. Each column denotes a scenario setting. The boxplots are based on results from 500 independent replicates.

Appendix J Simulation Results

Figure 14, 15 and 16 present results from different methods for simulated data under various combinations of nn and mm.

Refer to caption
Figure 14: Results from different methods for simulated data under various scenarios with m=20,n=100m=20,n=100, and n1=10n_{1}=10. The boxplots are based on 500 replications. (a) Recovery distances in log scale and coverage obtained from nine algorithms. (b) Quality parameters obtained by Partition-type models and EMM.
Refer to caption
Figure 15: Results from different methods for simulated data under various scenarios with m=10,n=300m=10,n=300, and n1=10n_{1}=10. The boxplots are based on 500 replications. (a) Recovery distances in log scale and coverage obtained from nine algorithms. (b) Quality parameters obtained by Partition-type models and EMM.
Refer to caption
Figure 16: Results from different methods for simulated data under various scenarios with m=20,n=300m=20,n=300, and n1=10n_{1}=10. The boxplots are based on 500 replications. (a) Recovery distances in log scale and coverage obtained from nine algorithms. (b) Quality parameters obtained by Partition-type models and EMM.

Appendix K Robustness of n1n_{1}

Figure 17 shows boxplots of estimated \mathcal{I} for each mis-specified case (5, 12, 15).

Refer to caption
Figure 17: Boxplots of the estimated \mathcal{I} from 500 replications under the setting of 𝒮HS1\mathcal{S}_{HS_{1}} with n1n_{1} being set as 5, 12 and 15, respectively. The true n1n_{1} is 1010. The vertical lines separate relevant entities (left) from background ones. The Y-axis shows the logarithm of the entities’ ranks. The rank of a background entity is replaced by their average 100+10+12\frac{100+10+1}{2}. The triangle denotes the mean rank of each entity.

References

  • Aslam and Montague, (2001) Aslam, J. A. and Montague, M. (2001). Models for metasearch. In Proceedings of the 24th annual international ACM SIGIR Conference on Research and Development in Information Retrieval, pages 276–284. ACM.
  • Bader, (2011) Bader, M. (2011). The transposition median problem is NP-complete. Theoretical Computer Science, 412(12):1099–1110.
  • Bhowmik and Ghosh, (2017) Bhowmik, A. and Ghosh, J. (2017). LETOR methods for unsupervised rank aggregation. In Proceedings of the 26th International Conference on World Wide Web, pages 1331–1340. International World Wide Web Conferences Steering Committee.
  • Borda, (1781) Borda, J. C. (1781). Mémoire sur les élections au scrutin. Histoire del’ Académie Royale des Sciences.
  • Chen et al., (2016) Chen, J., Long, R., Wang, X., Liu, B., and Chou, K. (2016). dRHP-PseRA: detecting remote homology proteins using profile-based pseudo protein sequence and rank aggregation. Scientific Reports, 6(32333).
  • Chen et al., (2019) Chen, Y., Fan, J., Ma, C., and Wang, K. (2019). Spectral method and regularized mle are both optimal for top-KK ranking. Annals of statistics, 47(4):2204.
  • Chen and Suh, (2015) Chen, Y. and Suh, C. (2015). Spectral MLE: Top-KK rank aggregation from pairwise comparisons. In International Conference on Machine Learning, pages 371–380.
  • Deconde et al., (2011) Deconde, R. P., Hawley, S., Falcon, S., Clegg, N., Knudsen, B., and Etzioni, R. (2011). Combining results of microarray experiments: a rank aggregation approach. Statistical Applications in Genetics & Molecular Biology, 5(1):1544–6115.
  • Deng et al., (2014) Deng, K., Han, S., Li, K. J., and Liu, J. S. (2014). Bayesian aggregation of order-based rank data. Journal of the American Statistical Association, 109(507):1023–1039.
  • Diaconis, (1988) Diaconis, P. (1988). Group representations in probability and statistics. Lecture Notes-Monograph Series, 11:1–192.
  • Diaconis and Graham, (1977) Diaconis, P. and Graham, R. L. (1977). Spearman’s footrule as a measure of disarray. Journal of the Royal Statistical Society Series B (Methodological), 39(2):262–268.
  • Dwork et al., (2001) Dwork, C., Kumar, R., Naor, M., and Sivakumar, D. (2001). Rank aggregation methods for the web. In Proceedings of the 10th International Conference on World Wide Web, pages 613–622. ACM.
  • Fagin et al., (2003) Fagin, R., Kumar, R., and Sivakumar, D. (2003). Efficient similarity search and classification via rank aggregation. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pages 301–312. ACM.
  • Fligner and Verducci, (1986) Fligner, M. A. and Verducci, J. S. (1986). Distance based ranking models. Journal of the Royal Statistical Society. Series B (Methodological), 48(3):359–369.
  • Freund et al., (2003) Freund, Y., Iyer, R., Schapire, R. E., and Singer, Y. (2003). An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4(Nov):933–969.
  • Hastings, (1970) Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1):97–109.
  • Irurozki et al., (2014) Irurozki, E., Calvo, B., and Lozano, J. A. (2014). Permallows : An R package for Mallows and generalized Mallows models. Journal of Statistical Software, 071(12):1–30.
  • Johnson et al., (2020) Johnson, S. R., Henderson, D. A., and Boys, R. J. (2020). Revealing subgroup structure in ranked data using a Bayesian WAND. Journal of the American Statistical Association, 115(532):1888–1901.
  • Li et al., (2020) Li, H., Xu, M., Liu, J. S., and Fan, X. (2020). An extended mallows model for ranked data aggregation. Journal of the American Statistical Association, 115(530):730–746.
  • Li et al., (2021) Li, X., Yi, D., and Liu, J. S. (2021). Bayesian analysis of rank data with covariates and heterogeneous rankers. Statistical Science, (In press).
  • Lin, (2010) Lin, S. (2010). Space oriented rank-based data integration. Statistical Applications in Genetics & Molecular Biology, 9(1):Article20.
  • Lin and Ding, (2010) Lin, S. and Ding, J. (2010). Integration of ranked lists via cross entropy monte carlo with applications to mRNA and microRNA studies. Biometrics, 65(1):9–18.
  • Linas et al., (2010) Linas, B., Tadas, M., and Francesco, R. (2010). Group recommendations with rank aggregation and collaborative filtering. In Proceedings of the Fourth ACM Conference on Recommender Systems, pages 119–126. ACM.
  • Liu, (2008) Liu, J. S. (2008). Monte Carlo strategies in scientific computing. Springer Science & Business Media.
  • Liu et al., (2007) Liu, Y., Liu, T., Qin, T., Ma, Z., and Li, H. (2007). Supervised rank aggregation. In Proceedings of the 16th International Conference on World Wide Web, pages 481–490. ACM.
  • Luce, (1959) Luce, R. D. (1959). Individual choice behavior: a theoretical analysis. New York: Wiley.
  • Mallows, (1957) Mallows, C. L. (1957). Non-null ranking models. Biometrika, 44(1/2):114–130.
  • Plackett, (1975) Plackett, R. L. (1975). The analysis of permutations. Journal of the Royal Statistical Society: Series C (Applied Statistics), 24(2):193–202.
  • Porello and Endriss, (2012) Porello, D. and Endriss, U. (2012). Ontology merging as social choice: judgment aggregation under the open world assumption. Journal of Logic and Computation, 24(6):1229–1249.
  • Rajkumar and Agarwal, (2014) Rajkumar, A. and Agarwal, S. (2014). A statistical convergence perspective of algorithms for rank aggregation from pairwise data. In International Conference on Machine Learning, pages 118–126.
  • Renda and Straccia, (2003) Renda, M. E. and Straccia, U. (2003). Web metasearch: rank vs. score based rank aggregation methods. In Proceedings of the 2003 ACM Symposium on Applied Computing, pages 841–846. ACM.
  • Soufiani et al., (2014) Soufiani, H. A., Parkes, D. C., and Xia, L. (2014). A statistical decision-theoretic framework for social choice. In Advances in Neural Information Processing Systems, pages 3185–3193.
  • Tanner and Wong, (1987) Tanner, M. A. and Wong, W. H. (1987). The calculation of posterior distributions by data augmentation. Journal of the American Statistical Association, 82(398):528–540.
  • Thurstone, (1927) Thurstone, L. L. (1927). A law of comparative judgment. Psychological Review, 34(4):273–286.
  • Wald, (1949) Wald, A. (1949). Note on the consistency of the maximum likelihood estimate. The Annals of Mathematical Statistics, 20(4):595–601.
  • Wei and Tanner, (1990) Wei, G. C. G. and Tanner, M. A. (1990). A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. Journal of the American Statistical Association, 85(411):699–704.
  • Yang, (2018) Yang, K. H. (2018). Chapter 7 - Stepping through finite element analysis. In Yang, K.-H., editor, Basic Finite Element Method as Applied to Injury Biomechanics, pages 281–308. Academic Press.
  • Young, (1988) Young, H. P. (1988). Condorcet’s theory of voting. American Political Science Review, 82(4):1231–1244.
  • Young and Levenglick, (1978) Young, H. P. and Levenglick, A. (1978). A consistent extension of condorcet’s election principle. SIAM Journal on Applied Mathematics, 35(2):285–300.
  • Yu, (2000) Yu, P. L. H. (2000). Bayesian analysis of order-statistics models for ranking data. Psychometrika, 65(3):281–299.