This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

PairRank: Online Pairwise Learning to Rank by Divide-and-Conquer

Yiling Jia University of VirginiaCharlottesvilleVAUSA [email protected] Huazheng Wang University of VirginiaCharlottesvilleVAUSA [email protected] Stephen Guo Walmart LabsSunnyvaleCAUSA [email protected]  and  Hongning Wang University of VirginiaCharlottesvilleVAUSA [email protected]
(2021)
Abstract.

Online Learning to Rank (OL2R) eliminates the need of explicit relevance annotation by directly optimizing the rankers from their interactions with users. However, the required exploration drives it away from successful practices in offline learning to rank, which limits OL2R’s empirical performance and practical applicability. In this work, we propose to estimate a pairwise learning to rank model online. In each round, candidate documents are partitioned and ranked according to the model’s confidence on the estimated pairwise rank order, and exploration is only performed on the uncertain pairs of documents, i.e., divide-and-conquer. Regret directly defined on the number of mis-ordered pairs is proven, which connects the online solution’s theoretical convergence with its expected ranking performance. Comparisons against an extensive list of OL2R baselines on two public learning to rank benchmark datasets demonstrate the effectiveness of the proposed solution.

Online learning to rank; divide and conquer; regret analysis
copyright: acmcopyrightjournalyear: 2021conference: Proceedings of the Web Conference 2021; April 19–23, 2021; Ljubljana, Sloveniabooktitle: Proceedings of the Web Conference 2021 (WWW ’21), April 19–23, 2021, Ljubljana, Sloveniadoi: 10.1145/3442381.3449972isbn: 978-1-4503-8312-7/21/04ccs: Information systems Learning to rankccs: Theory of computation Online learning algorithmsccs: Theory of computation Divide and conquerccs: Theory of computation Regret bounds

1. Introduction

Online learning to rank (OL2R) empowers modern retrieval systems to optimize their performance directly from users’ implicit feedback (Wang et al., 2019; Wang et al., 2018a; Yue and Joachims, 2009; Hofmann et al., 2013; Schuth et al., 2016; Zoghi et al., 2017; Lattimore et al., 2018; Oosterhuis and de Rijke, 2018; Li et al., 2018). The essence of OL2R solutions is to infer the quality of individual documents under a given query (Radlinski et al., 2008; Kveton et al., 2015a; Zoghi et al., 2017; Lattimore et al., 2018; Li et al., 2018) or a parameterized ranking function (Wang et al., 2019; Wang et al., 2018a; Yue and Joachims, 2009; Oosterhuis and de Rijke, 2018) via sequential interactions with users, i.e., trial and error. It eliminates classical offline learning to rank solutions’ strong dependency on explicit relevance annotations and makes supervised learning of ranking models possible when collecting explicit annotations from experts is economically infeasible or even impossible (e.g., private collection search).

Although influential and theoretically sound, the current OL2R solutions are not compatible with the successful practices in offline learning to rank, which directly optimize rankers by minimizing loss defined by rank-based metrics, such as Average Relevance Position (ARP) (Joachims et al., 2017) or Normalized Discounted Cumulative Gain (NDCG) (Burges, 2010). As a result, the performance of existing OL2R solutions is still behind that of offline solutions, which directly restricts OL2R’s real-world applicability.

The key barrier separating the practices in online and offline learning to rank is the need of exploration. Since users’ feedback is implicit and known to be biased and noisy (Joachims et al., 2005; Agichtein et al., 2006; Joachims et al., 2007), more clicks on a top-ranked document do not necessarily indicate greater relevance. Effective exploration in the problem space is thus vital for the online model update. Current OL2R solutions explore either in the action space (e.g., presenting currently underestimated results at higher positions of a ranked list) (Radlinski et al., 2008; Kveton et al., 2015a; Zoghi et al., 2017; Lattimore et al., 2018), or in the model space (e.g., presenting ranked results from different rankers) (Yue and Joachims, 2009; Schuth et al., 2014). However, due to the combinatorial nature of ranking, the action space is too large to be efficiently explored (e.g., all permutations of returned documents). This forces such OL2R solutions to take a pointwise approach to estimate the utility of each query-document pair separately, which has proven to be inferior to the pairwise or listwise approaches in offline learning to rank studies (Chapelle and Chang, 2011). While for model space exploration, though an interleaved test makes it possible to compare different rankers with respect to a hidden utility function in an unbiased fashion, it is hard to link this comparison to the optimization of any rank-based metrics. Moreover, due to the required uniform sampling in the model space, this type of OL2R solutions suffers from high variance and high regret during online result serving and model update (Wang et al., 2019).

In this work, we aim to bridge the gap by directly training a pairwise learning to rank model online. We target pairwise ranking models for three major reasons. First, a pairwise ranker reduces the exponentially sized action space to quadratic, by deriving the full ranking order from the pairwise comparisons between documents. Second, existing studies in search log analysis demonstrate relative preferences derived from clicks are more accurate and reliable than absolute judgments (Joachims et al., 2005, 2007). Third, pairwise learning to rank models have competitive empirical performance and have been widely used in practical systems (Chapelle and Chang, 2011; Joachims, 2002; Burges, 2010). To gain theoretical insight into the proposed solution, we work with a single layer RankNet model with a sigmoid activation function (Burges, 2010), which makes analytical convergence analysis possible. In particular, we explore in the pairwise ranking space of all candidate documents via divide-and-conquer:

we partition documents under a given query in each round of result serving, where the document ranking across different parts of the partition is certain (e.g., all documents in one part should be ranked higher than those in another part), but the ranking among documents within each part is still uncertain. The ranked list is thus generated by a topological sort across parts of a partition and randomly shuffling within each part. We name our solution PairRank. We rigorously prove that the exploration space shrinks exponentially fast in PairRank as the ranker estimation converges, such that the cumulative regret defined on the number of mis-ordered pairs has a sublinear upper bound. As most existing ranking metrics can be reduced to different kinds of pairwise comparisons among candidate documents (Wang et al., 2018b), e.g., Average Relevance Position is counted over a relevant document against all other documents, PairRank can directly optimize those ranking metrics based on users’ implicit feedback on the fly. Our extensive empirical evaluations also demonstrate the strong advantage of PairRank against a rich set of state-of-the-art OL2R solutions over a collection of OL2R benchmark datasets on standard retrieval metrics.

2. Related Work

We broadly categorize existing OL2R solutions into two families. The first type learns the best ranked list for each individual query separately by modeling users’ click and examination behaviors with multi-armed bandit algorithms (Radlinski et al., 2008; Kveton et al., 2015a; Zoghi et al., 2017; Lattimore et al., 2018). Typically, solutions in this category depend on specific click models to decompose the estimation on each query-document pair; as a result, exploration is performed on the ranking of individual documents. For example, by assuming users examine documents from top to bottom until reaching the first relevant document, cascading bandit models perform exploration by ranking the documents based on the upper confidence bound of their estimated relevance (Kveton et al., 2015a, b; Li et al., 2016). Other types of click models have also been explored (such as the dependent click model) (Katariya et al., 2016; Zoghi et al., 2017; Lattimore et al., 2018; Li et al., 2018; Kveton et al., 2018). However, as the relevance is estimated for each query-document pair, such algorithms can hardly generalize to unseen queries or documents. Moreover, pointwise relevance estimation is proved to be ineffective for rank estimation in established offline learning to rank studies (Chapelle and Chang, 2011; Burges, 2010).

The second type of OL2R solutions leverage ranking features for relevance estimation and explores for the best ranker in the entire model space (Yue and Joachims, 2009; Li et al., 2018; Oosterhuis and de Rijke, 2018). The most representative work is Dueling Bandit Gradient Descent (DBGD) (Yue and Joachims, 2009; Schuth et al., 2014), which proposes an exploratory direction in each round of interaction and uses an interleaved test (Chapelle et al., 2012) to validate the exploration for model update. To ensure an unbiased gradient estimate, DBGD uniformly explores the entire model space, which costs high variance and high regret during online ranking and model update. Subsequent methods improved upon DBGD by developing more efficient sampling strategies, such as multiple interleaving and projected gradient, to reduce variance (Hofmann et al., 2012; Zhao and King, 2016; Oosterhuis and de Rijke, 2017; Wang et al., 2018a; Wang et al., 2019). However, as exploration is performed in the model space, click feedback is used to infer which ranker is preferred under a hypothetical utility function. It is difficult to reason how the update in DBGD is related to the optimization of any rank-based metric. Hence, though generalizable, this type of OL2R solutions’ empirical performance is still worse than classical offline solutions.

The clear divide between the practices in online and offline learning to rank is quite remarkable, which has motivated some recent efforts to bridge the gap. Hofmann et al. (2013) adopt ϵ\epsilon-greedy to estimate a stochastic RankSVM (Joachims, 2002; Herbrich et al., 1999) model on the fly. Though RankSVM is effective for pairwise learning to rank, the totally random exploration by ϵ\epsilon-greedy is independent from the learned ranker. It keeps distorting the ranked results, even when the ranker has identified some high-quality results. Oosterhuis and de Rijke (2018) perform exploration by sampling the next ranked document from a Plackett-Luce model and estimate gradients of this ranking model from the inferred pairwise result preferences. Although exploration is linked to the ranker’s estimation, the convergence of this solution is still unknown.

3. Method

In this section, we present our solution, which trains a pairwise learning to rank model online. The key idea is to partition the pairwise document ranking space and only explore the pairs where the ranker is currently uncertain, i.e., divide-and-conquer. We name our solution PairRank. We rigorously prove the regret of PairRank defined on the cumulative number of mis-ordered pairs over the course of online model update.

3.1. Problem Formulation

In OL2R, a ranker interacts with users for TT rounds. At each round t=1,2,,Tt=1,2,...,T, the ranker receives a query qtq_{t} and its associated candidate documents, which are represented as a set of dd-dimensional query-document feature vectors 𝒳t={𝐱1t,𝐱2t,𝐱Ltt}\mathcal{X}_{t}=\{\mathbf{x}_{1}^{t},\mathbf{x}_{2}^{t},\dots\mathbf{x}_{L_{t}}^{t}\} with 𝐱itd\mathbf{x}^{t}_{i}\in\mathbb{R}^{d} and 𝐱itu\|\mathbf{x}_{i}^{t}\|\leq u. The ranker determines the ranking of the candidate documents τt=(τt(1),τt(2),,τt(Lt))Π([Lt])\tau_{t}=\big{(}\tau_{t}(1),\tau_{t}(2),\dots,\tau_{t}(L_{t})\big{)}\in\Pi([L_{t}]), based on its knowledge so far, where Π([Lt])\Pi([L_{t}]) represents the set of all permutations of LtL_{t} documents and τt(i)\tau_{t}(i) is the rank position of document ii under query qtq_{t}. Once the ranked list is returned to the user, the user examines the results and provides his/her click feedback Ct={c1t,c2t,,cLtt}C_{t}=\{c_{1}^{t},c_{2}^{t},...,c_{L_{t}}^{t}\}, where cit=1c_{i}^{t}=1 if the user clicks on document ii at round tt; otherwise cit=0c_{i}^{t}=0. Based on this feedback, the ranker updates itself and proceeds to the next query.

CtC_{t} is known to be biased and noisy (Joachims et al., 2005; Agichtein et al., 2006; Joachims et al., 2007). Existing studies find that users tend to click more on higher-ranked documents, as they stop examination early in the list. This is known as position bias. And users can only interact with the documents shown to them, known as the presentation bias. As a result, the ranker cannot simply treat non-clicked documents as irrelevant. Such implicit feedback imposes several key challenges for online learning to rank: how to deal with the implicit feedback, and in the meanwhile, how to effectively explore the unknowns for the model update. Following the practice in (Joachims et al., 2005), we treat clicks as relative preference feedback and assume that clicked documents are preferred over the examined but unclicked ones. In addition, we consider every document that precedes a clicked document and the first subsequent unclicked document as examined. This approach has been widely adopted and proven to be effective in learning to rank (Wang et al., 2019; Agichtein et al., 2006; Oosterhuis and de Rijke, 2018). Accordingly, we use oto_{t} to represent the index of the last examined position in the ranked list τt\tau_{t} at round tt.

Exploration is the key component that differentiates OL2R to offline L2R, where OL2R needs to serve while learning from its presented rankings. The most straightforward exploration is to provide a random list of candidate documents. However, such random exploration is less appreciated for OL2R as it hurts user experience, even though it may be beneficial for model training. Therefore, regret becomes an important metric for evaluating OL2R. Though various types of regret have been defined and analyzed in existing OL2R studies, few of them link to any rank-based metric, which is the key in ranker evaluation. For example, for OL2R solutions that explore in the document space, regret is typically defined on the number of clicks received on the presented ranking versus that known only in hindsight (Li et al., 2016; Kveton et al., 2015a; Zoghi et al., 2017; Lattimore et al., 2018). For solutions that explore in the model space, such as DBGD, regret is defined as the number of rounds where the chosen ranker is preferred over the optimal ranker (Yue and Joachims, 2009; Wang et al., 2019). It is difficult to reason how such measures indicate an OL2R solution’s ranking performance against a desired retrieval metric, such as ARP and NDCG. To bridge this gap, we define regret by the number of mis-ordered pairs from the presented ranking to the ideal one, i.e., the Kendall tau rank distance,

RT=𝔼[t=1Trt]=𝔼[t=1TK(τt,τt)]R_{T}=\mathbb{E}\big{[}\sum\nolimits_{t=1}^{T}r_{t}\big{]}=\mathbb{E}\big{[}\sum\nolimits_{t=1}^{T}K(\tau_{t},\tau_{t}^{*})\big{]}

where K(τt,τt)=|{(i,j):i<j,(τt(i)<τt(j)τt(i)>τt(j))(τt(i)>τt(j)τt(i)<τt(j))}|K(\tau_{t},\tau_{t}^{*})=\Big{|}\big{\{}(i,j):i<j,\big{(}\tau_{t}(i)<\tau_{t}(j)\wedge\tau^{*}_{t}(i)>\tau^{*}_{t}(j)\big{)}\vee\big{(}\tau_{t}(i)>\tau_{t}(j)\wedge\tau^{*}_{t}(i)<\tau^{*}_{t}(j)\big{)}\big{\}}\Big{|}. As shown in (Wang et al., 2018b), most ranking metrics employed in real-world retrieval systems, such as ARP and NDCG, can be decomposed into pairwise comparisons; hence, our defined regret directly connects an OL2R algorithm’s online performance with classical ranking evaluations.

3.2. Online Pairwise Learning to Rank

Refer to caption
\Description

[An illustration of PairRank model]From document graph to block graph, and then the generated ranked list.

Figure 1. PairRank: pairwise explore and exploit by divide-and-conquer. Assume that the optimal ranking order among the 5 documents is 123451\succ 2\succ 3\succ 4\succ 5. At the current round tt, the ranker is confident about its preference estimation between all the pairs expect (1,2),(3,5),(4,5)(1,2),(3,5),(4,5). In this example, the instantaneous regret of the first proposed ranked list is 3 and 2 in the second proposal.

The key in OL2R is to effectively explore the unknowns while providing high-quality results to the users, which is often referred to as the explore-exploit trade-off. In this work, we propose to directly train a pairwise model from its interactions with users and directly explore in the pairwise document ranking space via a divide-and-conquer strategy. The high-level idea of PairRank is illustrated in Figure 1, and we explain its details in the following.

3.2.1. Pairwise Learning to Rank

We focus on the pairwise ranking models because of their advantageous effectiveness reported in prior offline LTR studies (Chapelle and Chang, 2011), and their ability to tackle implicit feedback. Specifically, we adopt a single layer RankNet model with a sigmoid activation function (Burges, 2010) as our pairwise ranker. This choice is based on the promising empirical performance of RankNet and the feasibility of analyzing the resulting online solution’s convergence.

In a single layer RankNet, the probability that a document ii is more relevant than document jj under query qq is computed as (ij|q)=σ(𝐱i𝜽𝐱j𝜽)\mathbb{P}(i\succ j|q)=\sigma({\mathbf{x}^{\top}_{i}}\boldsymbol{\theta}-{\mathbf{x}^{\top}_{j}}\boldsymbol{\theta}), where 𝜽d\boldsymbol{\theta}\in\mathbb{R}^{d} and 𝜽Q\|\boldsymbol{\theta}\|\leq Q is the model parameter and σ(x)=1/(1+exp(x))\sigma(x)={1}/({1+\exp(-x)}). To simplify our notations, we use 𝐱ij\mathbf{x}_{ij} to denote 𝐱i𝐱j\mathbf{x}_{i}-\mathbf{x}_{j} in our subsequent discussions.

With the knowledge of 𝜽\boldsymbol{\theta}, due to the monotonicity and transitivity of the sigmoid function, the ranking of documents in 𝒳\mathcal{X} can be uniquely determined by {𝐱1𝜽,𝐱2𝜽,,𝐱L𝜽}\{{\mathbf{x}^{\top}_{1}}\boldsymbol{\theta},{\mathbf{x}^{\top}_{2}}\boldsymbol{\theta},\dots,{\mathbf{x}^{\top}_{L}}\boldsymbol{\theta}\}. Therefore, the key of learning a RankNet model is to estimate its parameter 𝜽\boldsymbol{\theta}. As RankNet specifies a distribution on pairwise comparisons, the objective function for 𝜽\boldsymbol{\theta} estimation can be readily derived as the cross-entropy loss between the predicted pairwise distribution on documents and those inferred from user feedback till round tt:

(1) t=s=1t(m,n)𝒢s\displaystyle\mathcal{L}_{t}=\sum_{s=1}^{t}\sum_{(m,n)\in\mathcal{G}_{s}} 𝐲mnslog(σ(𝐱mns𝜽))\displaystyle-{\mathbf{y}_{mn}^{s}}\log\big{(}\sigma({{\mathbf{x}_{mn}^{s}}}^{\top}\boldsymbol{\theta})\big{)}
(1𝐲mns)log(1σ(𝐱mns𝜽))+λ2𝜽2\displaystyle-(1-{\mathbf{y}_{mn}^{s}})\log\big{(}1-\sigma({{\mathbf{x}_{mn}^{s}}}^{\top}\boldsymbol{\theta})\big{)}+\frac{\lambda}{2}\|\boldsymbol{\theta}\|^{2}

where λ\lambda is the L2 regularization coefficient, 𝒢s\mathcal{G}_{s} denotes the set of document pairs that received different click feedback at round ss, i.e., 𝒢s={(m,n):cmscns,τ(m)<τ(n)os}\mathcal{G}_{s}=\{(m,n):c_{m}^{s}\neq c_{n}^{s},\forall\tau(m)<\tau(n)\leq o_{s}\}, 𝐲mns{\mathbf{y}_{mn}^{s}} indicates whether the document mm is preferred over document nn in the click feedback, i.e., 𝐲mns=12(cmtcnt)+12{\mathbf{y}_{mn}^{s}}=\frac{1}{2}(c_{m}^{t}-c_{n}^{t})+\frac{1}{2} (Burges, 2010). Due to the log-convexity of the loss function defined in Eq (1), the global optimal solution 𝜽^t\hat{\boldsymbol{\theta}}_{t} at round tt exists and can be efficiently obtained by gradient descent.

Online learning of RankNet boils down to the construction of {𝒢s}s=1T\{\mathcal{G}_{s}\}^{T}_{s=1} over time. However, the conventional practice of using all the inferred pairwise preferences from click feedback (Joachims, 2002; Agichtein et al., 2006) imposes a higher risk in an online setting. In the presence of click noise (e.g., a user mistakenly clicks on an irrelevant document), pairing documents would cause a quadratically increasing number of noisy training instances, which impose strong negative impact on the quality of the learned ranker (Carvalho et al., 2008). As the updated ranker is immediately executed, cascading of ranker deterioration is possible. To alleviate this deficiency, we propose to only use independent pairs inferred from the feedback, e.g., 𝒢sind={(m,n):cmscns,(τs(m),τs(n))D}\mathcal{G}_{s}^{ind}=\{(m,n):c_{m}^{s}\neq c_{n}^{s},\forall(\tau_{s}(m),\tau_{s}(n))\in D\}, where DD represents the set of disjointed position pairs, for example, D={(1,2),(3,4),(L1,L)}D=\{(1,2),(3,4),...(L-1,L)\}. In other words, we will only use a subset of non-overlapping pairwise comparisons for our online ranker update.

3.2.2. Uncertainty Qualification

Refer to caption\Description

[Comparison between certain and uncertain rank order]How estimation uncertainty determines the certain and uncertain rank order.

Figure 2. Illustration of certain and uncertain rank orders.

As discussed in Section 3.1, 𝜽^t\hat{\boldsymbol{\theta}}_{t} is obtained based on the acquired feedback from what has been presented to the user, which is subject to various types of biases and noises (Joachims et al., 2005; Agichtein et al., 2006; Joachims et al., 2007). Hence, 𝜽^t\hat{\boldsymbol{\theta}}_{t} only reflects what the ranker knows so far; and it is vital to effectively explore the unknowns to complete its knowledge. In PairRank, we propose to explore in the pairwise document ranking space spanned by 𝒳t\mathcal{X}_{t} under qtq_{t}, with respect to the current ranker’s uncertainty about the pairwise comparisons.

Accurate quantification of the current ranker’s uncertainty on pairwise preference estimation is the key to such exploration. The model estimation uncertainty is caused by the existence of click noise, i.e., 𝜽^t𝜽0\|\hat{\boldsymbol{\theta}}_{t}-\boldsymbol{\theta}^{*}\|\neq 0, where 𝜽\boldsymbol{\theta}^{*} is the underlying ground-truth model parameter. And this model estimation uncertainty directly leads to the uncertainty in the ranker’s pairwise preference estimation. To quantify the source of uncertainty, we follow conventional click models to assume the clicks are independent of each other given the true relevance of documents, so as their noise (Joachims et al., 2005; Guo et al., 2009b; Guo et al., 2009a). As a result, the pairwise noise becomes the sum of noise from the two associated clicks. Because we only use the independent document pairs 𝒢ind\mathcal{G}^{ind}, the pairwise noise is thus independent of each other in PairRank and the history of result serving, which directly leads to the following proposition.

Proposition 0.

For any t1t\geq 1, (i,j)𝒢tind\forall(i,j)\in\mathcal{G}_{t}^{ind}, define pairwise noise ϵijt=𝐲ijtσ(𝐱ijtθ)\epsilon_{ij}^{t}={\mathbf{y}_{ij}^{t}}-\sigma({\mathbf{x}_{ij}^{t}}^{\top}\theta^{*}). For all t1t\geq 1, ϵijt\epsilon_{ij}^{t} is a sub-Gaussian random variable with 𝔼[ϵijt|𝒢t1ind,{ϵt1},,𝒢1ind,{ϵ1}]=0\mathbb{E}[\epsilon_{ij}^{t}|\mathcal{G}_{t-1}^{ind},\{\epsilon^{t-1}\},\dots,\mathcal{G}_{1}^{ind},\{\epsilon^{1}\}]=0, where {ϵt}={ϵijt,(i,j)𝒢tind}\{\epsilon^{t}\}=\{\epsilon_{ij}^{t},(i,j)\in\mathcal{G}_{t}^{ind}\}

According to the property of sub-Gaussian random variables, such assumption can be easily satisfied in practice as long as the pointwise click noise follows a sub-Gaussian distribution. For example, the pointwise noise can be modeled as a binary random variable related to the document’s true relevance under the given query, which follows 12\frac{1}{2}-sub-Gaussian distribution.

Therefore, based on the solution of Eq (1), the uncertainty of the estimated pairwise preference σ(𝐱ijt𝜽^t)\sigma({\mathbf{x}_{ij}^{t}}^{\top}\hat{\boldsymbol{\theta}}_{t}) by RankNet at round tt can be analytically bounded with a high probability, as shown in the following lemma.

Lemma 0.

(Confidence Interval of Pairwise Preference Estimation). At rount t<Tt<T, for any pair of documents 𝐱it\mathbf{x}_{i}^{t} and 𝐱jt\mathbf{x}_{j}^{t} under query qtq_{t}, with probability at least 1δ11-\delta_{1}, we have,

|σ(𝐱ijt𝜽^t)σ(𝐱ijt𝜽)|αt𝐱ijt𝐌t1,|\sigma({{\mathbf{x}_{ij}^{t}}}^{\top}\hat{\boldsymbol{\theta}}_{t})-\sigma({{\mathbf{x}_{ij}^{t}}}^{\top}\boldsymbol{\theta}^{*})|\leq\alpha_{t}\|{\mathbf{x}_{ij}^{t}}\|_{\mathbf{M}_{t}^{-1}},

where αt=(2kμ/cμ)(R2log(det(𝐌t)/(δ12det(λ𝐈)))+λQ)\alpha_{t}=({2k_{\mu}}/{c_{\mu}})\Big{(}\sqrt{R^{2}\log{({\det(\mathbf{M}_{t})}/({\delta_{1}^{2}\det(\lambda\mathbf{I})})})}+\sqrt{\lambda}Q\Big{)}, 𝐌t=λ𝐈+s=1t1(m,n)𝒢sind𝐱mns𝐱mns\mathbf{M}_{t}=\lambda\mathbf{I}+\sum_{s=1}^{t-1}\sum_{(m,n)\in\mathcal{G}^{ind}_{s}}{\mathbf{x}_{mn}^{s}}{\mathbf{x}_{mn}^{s}}^{\top}, kμk_{\mu} is the Lipschitz constant of the sigmoid link function σ\sigma, cμ=inf𝜽𝚯σ˙(𝐱𝜽)c_{\mu}=\inf_{\boldsymbol{\theta}\in\boldsymbol{\Theta}}\dot{\sigma}(\mathbf{x}^{\top}\boldsymbol{\theta}), with σ˙\dot{\sigma} as the first derivative of σ\sigma, and RR is the sub-gaussian parameter for noise ϵ\epsilon.

The detailed proof of Lemma 2 can be found in the appendix. This lemma provides a tight high probability bound of the pairwise preference estimation uncertainty under a RankNet specified by 𝜽^t\hat{\boldsymbol{\theta}}_{t}, which enables us to perform efficient pairwise exploration for model update. To better illustrate our exploration strategy based on the pairwise estimation uncertainty, we introduce the following definition on document pairs.

Definition 0.

(Certain Rank Order) At any round t<Tt<T, the ranking order between document ii and jj, denoted as (i,j)(i,j), is considered in a certain rank order if and only if σ(𝐱ijt𝜽^t)αt𝐱ijt𝐌t1>12\sigma({\mathbf{x}_{ij}^{t}}^{\top}\hat{\boldsymbol{\theta}}_{t})-\alpha_{t}\|\mathbf{x}_{ij}^{t}\|_{\mathbf{M}_{t}^{-1}}>\frac{1}{2}.

Intuitively, based on Lemma 2, if (i,j)(i,j) is in a certain rank order, with a high probability that the estimated preference (order) between document ii and jj is consistent with the ground-truth. For example, as shown in Figure 2, σ^ijt\hat{\sigma}^{t}_{ij} and σ^mnt\hat{\sigma}_{mn}^{t} represent the estimated pairwise preference on document pair (i,j)(i,j) and (m,n)(m,n) based on 𝜽^t\hat{\boldsymbol{\theta}}_{t}, while CBijtCB_{ij}^{t} and CBmntCB_{mn}^{t} represent the corresponding confidence bound defined in Lemma 2, i.e., CBijt=αt𝐱ijt𝐌t1CB_{ij}^{t}=\alpha_{t}\|{\mathbf{x}_{ij}^{t}}\|_{\mathbf{M}_{t}^{-1}} and CBmnt=αt𝐱mnt𝐌t1CB_{mn}^{t}=\alpha_{t}\|{\mathbf{x}_{mn}^{t}}\|_{\mathbf{M}_{t}^{-1}}. According to Lemma 2, we know that the ground-truth pairwise preferences, σij\sigma_{ij}^{*} and σmn\sigma_{mn}^{*}, lie within the corresponding confidence intervals with a probability at least 1δ11-\delta_{1}, i.e., σij[σ^ijtCBijt,σ^ijt+CBijt]\sigma_{ij}^{*}\in[\hat{\sigma}_{ij}^{t}-CB_{ij}^{t},\hat{\sigma}_{ij}^{t}+CB_{ij}^{t}]. In Figure 2, for pair (m,n)(m,n), the lower bound of its pairwise estimation, σ^mntCBmnt\hat{\sigma}_{mn}^{t}-CB_{mn}^{t}, is greater than 12\frac{1}{2}. This indicates that with a high probability 1δ11-\delta_{1}, the estimated preference between document mm and nn is consistent with the ground-truth model 𝜽\boldsymbol{\theta}^{*}; and thus there is no need to explore this pair. In contrast, with σ^ijtCBijt<12\hat{\sigma}_{ij}^{t}-CB_{ij}^{t}<\frac{1}{2}, the estimated order (ij)(i\succ j) is still with uncertainty as the ground-truth model may present an opposite order; hence, exploration on this pair is necessary.

We use ct\mathcal{E}_{c}^{t} to represent the set of all certain rank orders at round tt. Accordingly, the set of uncertain rank orders at round tt is defined as: ut={(i,j)[Lt]2:(i,j)ct(j,i)ct}\mathcal{E}_{u}^{t}=\{(i,j)\in[L_{t}]^{2}:(i,j)\notin\mathcal{E}_{c}^{t}\wedge(j,i)\notin\mathcal{E}_{c}^{t}\}.

3.2.3. Explore the Unknowns via Divide-and-Conquer

1 Input: λ\lambda, δ1\delta_{1}, δ2\delta_{2} Initialize 𝐌0=λ𝐈,𝜽^1=0\mathbf{M}_{0}=\lambda\mathbf{I},\hat{\boldsymbol{\theta}}_{1}=0 for t=1t=1 to TT do
2       Receive query qtq_{t} and its corresponding candidate documents set 𝒳t={𝐱1t,𝐱2t,,𝐱Ltt}\mathcal{X}_{t}=\{\mathbf{x}_{1}^{t},\mathbf{x}_{2}^{t},...,\mathbf{x}_{L_{t}}^{t}\}.
3      
4      ct={(i,j)[Lt]2:σ(𝐱ijt𝜽^t1)αt1𝐱ijt𝐌t1>1/2}\mathcal{E}_{c}^{t}=\{(i,j)\in[L_{t}]^{2}:\sigma({\mathbf{x}_{ij}^{t}}^{\top}\mathbf{\hat{\boldsymbol{\theta}}}_{t-1})-\alpha_{t-1}\|\mathbf{x}_{ij}^{t}\|_{\mathbf{M}_{t-1}}>1/2\}
5       ut={(i,j)[Lt]2:(i,j)ct(j,i)ct}\mathcal{E}_{u}^{t}=\{(i,j)\in[L_{t}]^{2}:(i,j)\notin\mathcal{E}_{c}^{t}\wedge(j,i)\notin\mathcal{E}_{c}^{t}\}
6       Construct ordered block list t={1t,2t,dtt}\mathcal{B}_{t}=\{\mathcal{B}_{1}^{t},\mathcal{B}_{2}^{t},...\mathcal{B}_{d_{t}}^{t}\}
7       Generate ranked list τt={π(1t),π(2t),,π(dtt)}\tau_{t}=\{\pi(\mathcal{B}_{1}^{t}),\pi(\mathcal{B}_{2}^{t}),...,\pi(\mathcal{B}_{d_{t}}^{t})\}
8       Observe click feedback Ct{C}_{t}, and corresponding oto_{t}.
9       𝒢tind={(i,j)[ot]2:citcjt(τt(i),τt(j))D}\mathcal{G}_{t}^{ind}=\{(i,j)\in[o_{t}]^{2}:c_{i}^{t}\neq c_{j}^{t}\wedge(\tau_{t}(i),\tau_{t}(j))\in D\}.
10       Estimate θ^t\hat{\theta}_{t} as the solution of Eq (1).
11       𝐌t=𝐌t1+(i,j)𝒢tind𝐱ijt𝐱ijt\mathbf{M}_{t}=\mathbf{M}_{t-1}+\sum_{(i,j)\in\mathcal{G}_{t}^{ind}}\mathbf{x}_{ij}^{t}{\mathbf{x}_{ij}^{t}}^{\top}.
12 end for
Algorithm 1 PairRank

With the aforementioned pairwise estimation uncertainty and the corresponding sets of certain and uncertain rank orders, i.e., ct\mathcal{E}_{c}^{t} and ut\mathcal{E}_{u}^{t}, we can effectively explore the unknowns. Intuitively, we only need to randomize the ranking of documents among which the model is still uncertain about their ranking orders, i.e., the uncertain rank orders, and therefore obtain feedback to further update the model (and reduce uncertainty). For example, in the document graph shown in Figure 1, the solid lines represent certain rank orders, while the dash lines represent the uncertain rank orders. When generating the ranked list, we should randomly swap the order between document 1 and 2 (i.e., to explore) while preserving the order between document 1 and documents 3, 4, 5 (i.e., to exploit).

This naturally leads to a divide-and-conquer exploration strategy in the space of pairwise document comparisons. Specifically, we partition 𝒳t\mathcal{X}_{t} into different parts (referred to as blocks hereafter) based on ct\mathcal{E}_{c}^{t} and ut\mathcal{E}_{u}^{t} such that the ranking orders between any documents belonging to different blocks are certain, shown in the block graph in Figure 1. The ranked list can thus be generated by topological sort across blocks and random shuffling within each block. As the exploration is confined to the pairwise ranking space, it effectively reduces the exponentially sized action space to quadratic.

Algorithm 1 shows the detailed steps of PairRank. At round t, we first construct ct\mathcal{E}_{c}^{t} and ut\mathcal{E}_{u}^{t} according to the current mode 𝜽^t1\hat{\boldsymbol{\theta}}_{t-1}. Then, we create blocks of 𝒳t\mathcal{X}_{t} according to the definition below.

Definition 0.

(Block) At any round t<Tt<T, the block \mathcal{B} is a set of documents that satisfy:

  1. (1)

    idt,(i,j)ct for any j[Lt]dt.\forall i\in\mathcal{B}_{d}^{t},(i,j)\in\mathcal{E}_{c}^{t}\text{ for any }j\in[L_{t}]\setminus\mathcal{B}_{d}^{t}.

  2. (2)

    k[Lt]dt,for i,jdt,(i,k)tc(k,j)ct\nexists k\in[L_{t}]\setminus\mathcal{B}_{d}^{t},\text{for }i,j\in\mathcal{B}_{d}^{t},(i,k)\in\mathcal{E}_{t}^{c}\wedge(k,j)\in\mathcal{E}_{c}^{t}

Intuitively, each block is a subset of documents linked to each other by the uncertain rank order. It can be viewed as a connected component in an undirected graph with documents as vertices and uncertain rank order as edges under a given query. The connected components (blocks) can be found by linearly scanning through the vertices, based on breadth-first search or depth-first search if a vertex is not visited before. When the algorithm stops, each vertex (document) will be assigned to a connected component (block). Once the blocks are constructed, the order of the blocks can be obtained by topological sort (line 6). Let t={1t,2t,dtt}\mathcal{B}_{t}=\{\mathcal{B}_{1}^{t},\mathcal{B}_{2}^{t},...\mathcal{B}_{d_{t}}^{t}\} be the ordered block list for 𝒳t\mathcal{X}_{t} at round tt, the ranked list τt\tau_{t} is generated as τt={π(1t),π(1t),,π(dtt)}\tau_{t}=\{\pi(\mathcal{B}_{1}^{t}),\pi(\mathcal{B}_{1}^{t}),...,\pi(\mathcal{B}_{d_{t}}^{t})\}, where π()\pi(\cdot) randomly permutes its input set as output.

To further improve exploration efficiency, we propose two options to generate the ranked list. As shown in Figure 1, the first ranked list is generated by randomly shuffling all documents within each block (referred to as random exploration), while in the second list, only the uncertain rank orders are shuffled, and the certain ones are preserved (referred to as conservative exploration). In our empirical studies, we observe such conservative exploration gives better improvement than random exploration, which further confirms the importance of efficient exploration in OL2R.

3.3. Regret Analysis

We theoretically analyze the regret of PairRank, which is defined by the cumulative number of mis-ordered pairs in its proposed ranked list till round TT. The key in analyzing this regret is to quantify how fast the model achieves certainty about its pairwise preference estimation in candidate documents. First, we define EtE_{t} as the success event at round tt:

Et={(i,j)[Lt]2,|σ(𝐱ijt𝜽^t)σ(𝐱ijt𝜽)|αt𝐱ijt𝐌t1}.E_{t}=\big{\{}\forall(i,j)\in[L_{t}]^{2},|\sigma({\mathbf{x}_{ij}^{t}}^{\top}\hat{\boldsymbol{\theta}}_{t})-\sigma({\mathbf{x}_{ij}^{t}}^{\top}\boldsymbol{\theta}^{*})|\leq\alpha_{t}\|\mathbf{x}_{ij}^{t}\|_{\mathbf{M}_{t}^{-1}}\big{\}}.

Intuitively, EtE_{t} is the event that the estimated θ^t\hat{\theta}_{t} is “close” to the optimal model θ\theta^{*} at round tt. According to Lemma 2, it is easy to reach the following conclusion,

Corollary 0.

On the event EtE_{t}, it holds that σ(𝐱ijt𝜽)>1/2\sigma({\mathbf{x}_{ij}^{t}}^{\top}\boldsymbol{\theta}^{*})>{1}/{2} if (i,j)ct(i,j)\in\mathcal{E}_{c}^{t}.

PairRank suffers regret as a result of misplacing a pair of documents, i.e., swapping a pair already in the correct order. Based on Corollary 5, under event EtE_{t}, the certain rank order identified by PairRank is consistent with the ground-truth. As our partition design always places documents under a certain rank order into distinct blocks, under event EtE_{t} the ranking order across blocks is consistent with the ground-truth. In other words, regret only occurs when ranking documents within each block.

To analyze the regret caused by random shuffling within each block, we need the following technical lemma derived from random matrix theory. We adapted it from Equation (5.23) of Theorem 5.39 from (Vershynin, 2010).

Lemma 0.

Let An×dA\in\mathbb{R}^{n\times d} be a matrix whose rows AiA_{i} are independent sub-Gaussian isotropic random vectors in d\mathbb{R}^{d} with parameter σ\sigma, namely 𝔼[exp(x(Ai𝔼[Ai])]exp(σ2x2/2)\mathbb{E}[\text{exp}(x^{\top}(A_{i}-\mathbb{E}[A_{i}])]\leq\text{exp}(\sigma^{2}\|x\|^{2}/2) for any xdx\in\mathbb{R}^{d}. Then, there exist positive universal constants C1C_{1} and C2C_{2} such that, for every t0t\geq 0, the following holds with probability at least 12exp(C2t2),where ϵ=σ(C1d/n+t/n)1-2\text{exp}(-C_{2}t^{2}),\text{where }\epsilon=\sigma(C_{1}\sqrt{d/n}+{t}/{\sqrt{n}}): AA/n𝐈dmax{ϵ,ϵ2}\|A^{\top}A/n-\mathbf{I}_{d}\|\leq\text{max}\{\epsilon,\epsilon^{2}\}.

The detailed proof can be found in (Vershynin, 2010). We should note the condition in Lemma 6 is not hard to satisfy in OL2R: at every round, the ranker is serving a potentially distinct query; and even for the same query, different documents might be returned at different times. This gives the ranker a sufficient chance to collect informative observations for model estimation. Based on Lemma 6, we have the following lemma, which provides a tight upper bound of the probability that PairRank’s estimation of the pairwise preference is an uncertain rank order.

Lemma 0.

At round ttt\geq t^{\prime}, with δ1(0,12)\delta_{1}\in(0,\frac{1}{2}), δ2(0,12)\delta_{2}\in(0,\frac{1}{2}), β(0,12)\beta\in(0,\frac{1}{2}), and C1C_{1}, C2C_{2} defined in Lemma 6, under event EtE_{t}, the following holds with probability at least 1δ21-\delta_{2}: (i,j)[Lt]2\forall(i,j)\in[L_{t}]^{2}, ((i,j)ut)8kμ2𝐱ijt𝐌t12(12β)cμ2Δmin2log1δ1\mathbb{P}\big{(}(i,j)\in\mathcal{E}_{u}^{t}\big{)}\leq\frac{8k_{\mu}^{2}\|{\mathbf{x}_{ij}^{t}}\|_{\mathbf{M}_{t}^{-1}}^{2}}{(1-2\beta)c_{\mu}^{2}\Delta_{\min}^{2}}\log{\frac{1}{\delta_{1}}}, with t=(c1d+c2log(1δ2)+abdomaxu2dλλmin(𝚺))2+2ablog(1/δ12)+8aλQ2λλmin(𝚺)t^{\prime}=\Big{(}\frac{c_{1}\sqrt{d}+c_{2}\sqrt{\log(\frac{1}{\delta_{2}})}+abd\sqrt{\frac{o_{\text{max}}u^{2}}{d\lambda}}}{\lambda_{\text{min}}(\mathbf{\Sigma})}\Big{)}^{2}+\frac{2ab\log({1}/{\delta_{1}^{2}})+8a\lambda Q^{2}-\lambda}{\lambda_{\text{min}}(\mathbf{\Sigma})}, where Δmin=mintT,(i,j)[Lt]2|σ(𝐱ijt𝜽)12|\Delta_{\min}=\min\limits_{t\in T,(i,j)\in[L_{t}]^{2}}|\sigma({{\mathbf{x}_{ij}^{t}}}^{\top}\boldsymbol{\theta}^{*})-\frac{1}{2}| representing the smallest gap of pairwise difference between any pair of documents associated to the same query over time (across all queries), a=4kμ2u2/(β2cμ2Δmin2)a={4k_{\mu}^{2}u^{2}}/({\beta^{2}c_{\mu}^{2}\Delta_{\text{min}}^{2}}), and b=R2+4λQRb=R^{2}+4\sqrt{\lambda}QR,

Proof Sketch.

According to the definition of certain rank order, a pairwise estimation σ(𝐱ijt𝜽^)\sigma({\mathbf{x}_{ij}^{t}}^{\top}\hat{\boldsymbol{\theta}}) is certain if and only if |σ(𝐱ijt𝜽^)1/2|αt𝐱ijt𝐌t1|\sigma({\mathbf{x}_{ij}^{t}}^{\top}\hat{\boldsymbol{\theta}})-1/2|\geq\alpha_{t}\|{\mathbf{x}_{ij}^{t}}\|_{\mathbf{M}_{t}^{-1}}. By the reverse triangle inequality, the probability can be upper bounded by (||σ(𝐱ijt𝜽^)σ(𝐱ijt𝜽)||σ(𝐱ijt𝜽)1/2||αt𝐱ijt𝐌t1)\mathbb{P}\big{(}\big{|}|\sigma({\mathbf{x}_{ij}^{t}}^{\top}\hat{\boldsymbol{\theta}})-\sigma({\mathbf{x}_{ij}^{t}}^{\top}\boldsymbol{\theta}^{*})|-|\sigma({\mathbf{x}_{ij}^{t}}^{\top}\boldsymbol{\theta}^{*})-1/2|\big{|}\geq\alpha_{t}\|{\mathbf{x}_{ij}^{t}}\|_{\mathbf{M}_{t}^{-1}}\big{)}, which can be further bounded by Theorem 1 in (Abbasi-Yadkori et al., 2011). The key in this proof is to obtain a tighter bound on the uncertainty of PairRank’s parameter estimation compared to the bound determined by δ1\delta_{1} in Lemma 2, such that its confidence interval on a pair of documents’ comparison at round tt will exclude the possibility of flipping their ranking order, i.e., the lower confidence bound of this pairwise estimation is above 0.5. ∎

In each round of result serving, as the model 𝜽^t\hat{\boldsymbol{\theta}}_{t} would not change until next round, the expected number of uncertain rank orders, denoted as Nt=|ut|N_{t}=|\mathcal{E}_{u}^{t}|, can be estimated by the summation of the uncertain probabilities over all possible pairwise comparisons under the current query qtq_{t}, e.g., 𝔼[Nt]=12((i,j)[Lt]2((i,j)ut)\mathbb{E}[N_{t}]=\frac{1}{2}(\sum_{(i,j)\in[L_{t}]^{2}}\mathbb{P}((i,j)\in\mathcal{E}_{u}^{t}).

Denote ptp_{t} as the probability that the user examines all documents in τt\tau_{t} at round tt, and let p=min1tTptp^{*}=\min_{1\leq t\leq T}p_{t} be the minimal probability that all documents in a query are examined over time. We present the upper regret bound of PairRank as follows.

Theorem 8.

Assume pairwise query-document feature vector 𝐱ijt{\mathbf{x}_{ij}^{t}} under query qtq_{t}, where (i,j)[Lt]2(i,j)\in[L_{t}]^{2} and t[T]t\in[T], satisfies Proposition 1. With δ1(0,12)\delta_{1}\in(0,\frac{1}{2}), δ2(0,12)\delta_{2}\in(0,\frac{1}{2}), β(0,12)\beta\in(0,\frac{1}{2}), the TT-step regret of PairRank is upper bounded by:

RT\displaystyle R_{T}\leq R+(1δ1)(1δ2)p2(2adLmaxlog(1+omaxTu22dλ)+aw)2\displaystyle R^{\prime}+(1-\delta_{1})(1-\delta_{2}){p^{*}}^{-2}\left(2adL_{\max}\log(1+\frac{o_{\max}Tu^{2}}{2d\lambda})+aw\right)^{2}

where R=tLmax2+(Tt)(δ2Lmax2(1δ2)δ1Lmax2)R^{\prime}=t^{\prime}L_{\max}^{2}+(T-t^{\prime})\big{(}\delta_{2}L_{\max}^{2}-(1-\delta_{2})\delta_{1}L_{\max}^{2}\big{)}, with tt^{\prime} and aa defined in Lemma 7, and w=s=tT((Lmax22Lmax)u2/(λmin(𝐌s))w=\sum\nolimits_{s=t^{\prime}}^{T}({(L_{\max}^{2}-2L_{\max})u^{2}}/({\lambda_{\min}(\mathbf{M}_{s})}), and LmaxL_{\max} representing the maximum number of document associated to the same query over time. By choosing δ1=δ2=1/T\delta_{1}=\delta_{2}=1/T, we have the expected regret at most RTO(dlog4(T))R_{T}\leq O(d\log^{4}(T)).

Proof Sketch.

The detailed proof is provided in the appendix. Here, we only provide the key ideas behind our regret analysis. The regret is first decomposed into two parts: RR^{\prime} represents the regret when either EtE_{t} or Lemma 7 does not hold, in which the regret is out of our control, and we use the maximum number of pairs associated to a query over time, LmaxL_{\text{max}} to compute the regret. The second part corresponds to the cases when both events happen. Then, the instantaneous regret at round tt can be bounded by

(2) rt=𝔼[K(τt,τt)]=i=1dt𝔼[(Nit+1)Nit2]𝔼[Nt(Nt+1)2]\displaystyle r_{t}=\mathbb{E}\big{[}K(\tau_{t},\tau_{t}^{*})\big{]}=\sum\nolimits_{i=1}^{d_{t}}\mathbb{E}\big{[}\frac{(N_{i}^{t}+1)N_{i}^{t}}{2}\big{]}\leq\mathbb{E}\big{[}\frac{N_{t}(N_{t}+1)}{2}\big{]}

where NitN_{i}^{t} denotes the number of uncertain rank orders in block it\mathcal{B}_{i}^{t} at round tt, and NtN_{t} denotes the total number of uncertain rank orders. From the last inequality, it follows that in the worst case where the NtN_{t} uncertain rank orders are placed into the same block and thus generate at most (Nt2+Nt)/2({N_{t}^{2}+N_{t}})/{2} mis-ordered pairs with random shuffling. This is because based on the blocks created by PairRank, with NtN_{t} uncertain rank orders in one block, this block can at most have Nt+1N_{t}+1 documents. Then, the cumulative number of mis-ordered pairs can be bounded by the probability of observing uncertain rank orders in each round, which shrinks rapidly with more observations over time. ∎

Remark (1).

By choosing δ=1/T\delta=1/T, the theorem shows the expected regret increases at a rate of 𝒪(log4T)\mathcal{O}({\log^{4}{T}}). In this analysis, we provide a gap-dependent regret upper bound of PairRank, where the gap Δmin\Delta_{\min} characterizes the hardness of sorting the LtL_{t} candidate documents at round tt. As the matrix MtM_{t} only contains information from observed document pairs, we adopt the probability of a ranked list is fully observed to tackle with the partial feedback (Kveton et al., 2015b, c), which is a constant independent of TT.

Remark (2).

Our regret is defined over the number of mis-ordered pairs, which is the first pairwise regret analysis for an OL2R algorithm, to the best of our knowledge. As we discussed before, existing OL2R algorithms optimize their own metrics, which can hardly link to any conventional rank metrics. As shown in (Wang et al., 2018b), most classical ranking evaluation metrics, such as ARP and NDCG, are based on pairwise document comparisons. Our regret analysis of PairRank connects its theoretical property with such metrics, which has been later confirmed in our empirical evaluations.

4. Experiments

Refer to caption
(a) Offline performance (NDCG@10) on the MSLR-WEB10K dataset.
Refer to caption
(b) Offline performance (NDCG@10) on the Yahoo! dataset.
Figure 3. Offline ranking performance on two different datasets under three different click models.

In this section, we empirically compare our proposed PairRank with an extensive list of state-of-the-art OL2R algorithms on two public learning to rank benchmark datasets. Both quantitative and qualitative evaluations are performed to examine our proposed solution, especially its advantages over existing OL2R solutions in online learning efficiency.

Reproducibility Our entire codebase, baselines, analysis, and experiments can be found on Github 111https://github.com/yilingjia/PairRank.

4.1. Experiment Setup

Datasets. We experiment on two publicly available learning to rank datasets: 1) Yahoo! Learning to Rank Challenge dataset (Chapelle and Chang, 2011), which consists of 292,921 queries and 709,877 documents represented by 700 ranking features; and 2) MSLR-WEB10K (Qin and Liu, 2013), which contains 30,000 queries, each having 125 assessed documents on average, and is represented by 136 ranking features. Both datasets are labeled with a five-grade relevance scale: from not relevant (0) to perfectly relevant (4). We followed the train/test/validation split provided in the datasets.

Table 1. Configuration of simulated click models.
Click Probability Stop Probability
Relevance Grade 0 1 2 0 1 2
Perfect 0.0 0.5 1.0 0.0 0.0 0.0
Navigational 0.05 0.5 0.95 0.2 0.5 0.9
Informational 0.4 0.7 0.9 0.1 0.3 0.5

User Interaction Simulation. We simulate user behavior via the standard setup for OL2R evaluations (Oosterhuis and de Rijke, 2018; Wang et al., 2019) to make our reported results directly comparable to those in literature. First, at each time tt, a query is uniformly sampled from the training set for result serving. Then, the model determines the ranked list and returns it to the users. The interaction between the user and the list is then simulated with a dependent click model (Guo et al., 2009a), which assumes that the user will sequentially examine the list and make a click decision on each document. At each position, the user decides whether to click on the document or not, modeled as a probability conditioned on the document’s relevance label, e.g, (click=1|relevance grade)\mathbb{P}(click=1|\text{relevance grade}). After the click, the user might stop due to his/her satisfaction with the result or continue to examine more results. The stop probability after a click is modeled as (stop=1|click=1,relevance grade)\mathbb{P}(stop=1|click=1,\text{relevance grade}). If there is no click, the user will continue examining the next position. We employ three different click model configurations to represent three different types of users, for which the details are shown in Table 1. Basically, we have the perfect users, who click on all relevant documents and do not stop browsing until the last returned document; the navigational users, who are very likely to click on the first highly relevant document and stop there; and the informational users, who tend to examine more documents, but sometimes click on irrelevant ones, and thus contribute a significant amount of noise in their click feedback.

Baselines. As our PairRank learns a parametric ranker, we focus our comparison against existing parametric OL2R solutions, which are known to be more generalizable and powerful than those estimating pointwise query-document relevance (Radlinski et al., 2008; Kveton et al., 2015a; Zoghi et al., 2017; Lattimore et al., 2018). We list the OL2R solutions used for our empirical comparisons below.

\bullet DBGD (Yue and Joachims, 2009): As detailed in our related work discussion, DBGD uniformly samples an exploratory direction from the entire model space for exploration and model update.

\bullet MGD (Schuth et al., 2016): It improves DBGD by sampling multiple directions and compares them via a multileaving test. If there is a tie, the model updates towards the mean of all winners.

\bullet PDGD (Oosterhuis and de Rijke, 2018): PDGD samples the next ranked document from a Plackett-Luce model and estimates the gradient from the inferred pairwise result preferences.

\bullet ϵ\epsilon-Greedy (Hofmann et al., 2013): It randomly samples an unranked document with probability ϵ\epsilon or selects the next best document based on the currently learned RankNet with probability 1ϵ1-\epsilon, at each position.

\bullet SGD RankNet: It estimates a single layer RankNet with stochastic gradient descent using the same click preference inference method in PairRank. At each round, it presents the best ranking based on the currently learned RankNet model.

\bullet PairRank: We compare two shuffling strategies within each partition: 1) random shuffling the documents, denoted as PairRank_R, and 2) shuffling with respect to the uncertain rank orders, while preserving the certain rank orders within each partition, denoted as PairRank_C (as illustrated in Figure 1).

4.2. Experiment results

4.2.1. Offline Performance

In offline evaluation, while the algorithm is learning, we evaluate the learned rankers on a separate testing set using its ground-truth relevance labels. We use Normalized Discounted Cumulative Gain at 10 (NDCG@10) to assess different algorithms’ ranking quality. We compare all algorithms over 3 click models and 2 datasets. We set the hyper-parameters, such as learning rate, in DBGD, MGD, and PDGD according to their original papers. For SGD RankNet, ϵ\epsilon-Greedy, and our proposed PairRank, we set the hyper-parameters via the best performance on the validation data for each dataset. We fix the total number of iterations TT to 5000. We execute the experiments 20 times with different random seeds and report the averaged result in Figure 3. For comparison purposes, we also include the performance of an offline RankNet trained on the ground-truth relevance labels on the training set, whose performance can be considered as an upper bound in our evaluation.

Refer to caption
(a) Online performance (cNDCG@10) on the MSLR-WEB10K dataset.
Refer to caption
(b) Online performance (cNDCG@10) on the Yahoo! dataset.
Figure 4. Online ranking performance on two different datasets under three different click models.

We can clearly observe that PairRank achieved significant improvement over all the baselines. Across the two datasets under different click models, DBGD and MGD performed worse than other OL2R solutions. This is consistent with previous studies’ findings: DBGD and MGD depend on interleave test to determine the update direction across rankers. But such model-level feedback is hard to directly inform the optimization of any rank-based metric (e.g., NDCG). PDGD consistently outperformed DBGD-type solutions under different click models and datasets. However, its document sampling-based exploration limits its learning efficiency, especially when users only examine a small number of documents (e.g., the navigational users). ϵ\epsilon-Greedy seriously suffered from its random exploration, which is independent of how the current ranker performs. Oftentimes as shown in Figure 3(a), its performance is even worse than MGD, although ϵ\epsilon-Greedy can directly optimize the pairwise ranking loss. One interesting observation is that SGD RankNet performs comparably to PDGD. Exploration in SGD RankNet is implicitly achieved via stochastic gradient descent. However, because this passive exploration is subject to the given queries and noise in user feedback, its performance in online result serving is hard to predict. We attribute PairRank’s fast convergence to its uncertainty based exploration: it only explores when its estimation on a pair of documents is uncertain. As proved in our regret analysis, the number of such pairs shrinks at a logarithmic rate, such that more and more documents are presented in their correct ranking positions as PairRank learns from user feedback. This conclusion is further backed by the comparison between PairRank_R and PairRank_C: by respecting the certain rank orders within a partition, PairRank_C further improves result ranking quality. When click noise is small, i.e., from perfect users, PairRank converges to its offline counterpart with only a few thousand impressions.

4.2.2. Online Performance

In OL2R, in addition to the offline evaluation, a model’s ranking performance during online optimization should also be evaluated, as it reflects user experience. Sacrificing user experience for model training will compromise the goal of OL2R. We adopt the cumulative Normalized Discounted Cumulative Gain (Oosterhuis and de Rijke, 2018) to assess models’ online performance. For TT rounds, the cumulative NDCG (cNDCG) is calculated as

cNDCG=t=1TNDCG(τt)γ(t1),\text{cNDCG}=\sum\nolimits_{t=1}^{T}\text{NDCG}(\tau_{t})\cdot\gamma^{(t-1)},

which computes the expected NDCG reward a user receives with a probability γ\gamma that he/she stops searching after each query (Oosterhuis and de Rijke, 2018). Following the previous work (Oosterhuis and de Rijke, 2018; Wang et al., 2019; Wang et al., 2018a), we set γ=0.9995\gamma=0.9995.

Figure 4 shows the online performance of PairRank and all the other baselines. It is clear to observe that directly adding model-independent exploration, e.g., ϵ\epsilon-greedy, strongly hurts a model’s online ranking performance, and therefore hurts user experience. Compared to PDGD, SGD RankNet showed consistently better performance, especially under the navigational and informational click models. We attribute this difference to the exploration strategy inside PDGD: though SGD RankNet has limited exploration and focuses more on exploiting the learned model in ranking the documents, PDGD’s sampling-based exploration might introduce unwanted distortion in the ranked results, especially at the early stage of online learning.

We should note in cumulative NDCG, the earlier stages play a much more important role due to the strong shrinking effect of γ\gamma. Our proposed PairRank demonstrated significant improvements over all the baselines. Such improvement indicates the effectiveness of our uncertainty based exploration, which only explores when the ranker’s pairwise estimation is uncertain. We can also observe the difference between PairRank_C and PairRank_R in this online evaluation. This is because PairRank_C preserves the certain rank order within blocks, which further eliminates unnecessary exploration (random shuffling) in result serving.

4.2.3. Closeness to the Ideal Ranker

Refer to caption
Figure 5. Cosine similarity between offline RankNet model and online models.

In this experiment, we view the offline trained RankNet model on the complete annotated relevance as the ideal model, denoted as ww^{*}, and compare the cosine similarities between ww^{*} and the online estimated models in Figure 5. We can observe that all the pairwise models push the learned ranker closer to the ideal one with more iterations, while PairRank converges faster and closer to ww^{*}. Apparently, the rankers obtained by DBGD and MGD are quite different from the ideal model. This confirms our earlier analysis that DBGD-type ranker update can hardly link to any rank-based metric optimization, and thus it becomes less related to the offline learned ranker. The difference in objective function can also explain why SGD RankNet converges faster than PDGD, as PDGD adopts the pairwise Plackett-Luce model while SGD RankNet has the same objective function as the offline model.

4.2.4. Zoom into PairRank

Refer to caption
(a) MSLR-WEB10K Dataset
Refer to caption
(b) Yahoo Dataset
Figure 6. The size of blocks at top ranks.

To further verify the effectiveness of the exploration strategy devised in PairRank, we zoom into the trace of its block size across queries during the online model update. As PairRank uses random shuffling within each block for exploration, a smaller block size, especially at the top-ranked positions, is preferred to reduce regret. Figure 6 shows the size of document blocks at rank position 1, 5 and 10.

First, we can clearly observe that after hundred rounds of interactions, the sizes of blocks quickly converge to 1, especially at the top-ranked positions. This confirms our theoretical analysis about PairRank’s block convergence. And by comparing the results across different click models, we can observe that the block size converges slower under the navigational click model. Similar trends can be observed in Figure 7(b). In this figure, the number of blocks is calculated by averaging in every 200 iterations to reduce variance. We can observe that at the early stages, PairRank under the navigational click model has fewer blocks (hence more documents in one block), which indicates a higher uncertainty in model estimation. The key reason is that much fewer clicks can be observed in each round under the navigational click model, as the stop probability is much higher, i.e., stronger position bias. As a result, more interactions are needed to improve model estimation. For the same reason, in Figure 6, the block size at rank position 10 shrinks much slower than that at other positions also suggests position bias is the main factor that slows down the learning of PairRank.

Refer to caption
(a) MSLR-WEB10K with informational click model.
Refer to caption
(b) Number of blocks identified on MSLR-WEB10K dataset
Figure 7. Qualitative analysis of blocks in PairRank

In Figure 7(a), we show the block size at each rank position in PairRank under the informational click model at different time points. One interesting finding is the blocks in the middle ranks, i.e., rank 300 - 400, tend to have more documents, while for the ranks at both the top and the bottom, the corresponding blocks tend to have fewer documents. We attribute it to the pairwise learning in PairRank, where the uncertainty is calculated on the pairwise preferences, and thus it is easy to identify the document pairs with greater differences.

5. Conclusion

Existing OL2R solutions suffer from slow convergence and sub-optimal performance due to inefficient exploration and limited optimization strategies. Motivated by the success of offline models, we propose to estimate a pairwise learning to rank model on the fly, named as PairRank. Based on the model’s pairwise order estimation confidence, exploration is performed only on the pairs where the ranker is still uncertain, i.e., divide-and-conquer. We prove a sub-linear upper regret bound defined on the number of mis-ordered pairs, which directly links PairRank’s convergence with classical ranking evaluations. Our empirical experiments support our regret analysis and demonstrate significant improvement of PairRank over several state-of-the-art OL2R baselines.

Our effort sheds light on moving more powerful offline learning to rank solutions online. Currently, our work is based on a single layer RankNet for analysis purposes. Following recent efforts of convergence analysis in deep learning (Zhou et al., 2019), it is possible to extend PairRank with deep ranking models and directly optimize rank-based metrics (such as NDCG). Furthermore, most OL2R solutions focus on population-level ranker estimation; thanks to the improved learning efficiency by PairRank, it is possible for us to study individual-level ranking problems, e.g., personalized OL2R.

6. Acknowledgements

We want to thank the reviewers for their insightful comments. This work is based upon work supported by National Science Foundation under grant IIS-1553568 and IIS-1618948, and Google Faculty Research Award.

References

  • (1)
  • Abbasi-Yadkori et al. (2011) Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. 2011. Improved algorithms for linear stochastic bandits. In NIPS. 2312–2320.
  • Agichtein et al. (2006) Eugene Agichtein, Eric Brill, and Susan Dumais. 2006. Improving web search ranking by incorporating user behavior information. In Proceedings of the 29th ACM SIGIR. ACM, 19–26.
  • Burges (2010) Christopher JC Burges. 2010. From ranknet to lambdarank to lambdamart: An overview. Learning 11, 23-581 (2010), 81.
  • Carvalho et al. (2008) Vitor R Carvalho, Jonathan L Elsas, William W Cohen, and Jaime G Carbonell. 2008. Suppressing outliers in pairwise preference ranking. In Proceedings of the 17th ACM CIKM. 1487–1488.
  • Chapelle and Chang (2011) Olivier Chapelle and Yi Chang. 2011. Yahoo! learning to rank challenge overview. In Proceedings of the Learning to Rank Challenge. 1–24.
  • Chapelle et al. (2012) Olivier Chapelle, Thorsten Joachims, Filip Radlinski, and Yisong Yue. 2012. Large-scale validation and analysis of interleaved search evaluation. ACM TOIS 30, 1 (2012), 6.
  • Filippi et al. (2010) Sarah Filippi, Olivier Cappe, Aurélien Garivier, and Csaba Szepesvári. 2010. Parametric bandits: The generalized linear case. In NIPS. 586–594.
  • Guo et al. (2009b) Fan Guo, Chao Liu, Anitha Kannan, Tom Minka, Michael Taylor, Yi-Min Wang, and Christos Faloutsos. 2009b. Click chain model in web search. In Proceedings of the 18th WWW. 11–20.
  • Guo et al. (2009a) Fan Guo, Chao Liu, and Yi Min Wang. 2009a. Efficient multiple-click models in web search. In Proceedings of the 2nd WSDM. 124–131.
  • Herbrich et al. (1999) Ralf Herbrich, Thore Graepel, and Klaus Obermayer. 1999. Support vector learning for ordinal regression. (1999).
  • Hofmann et al. (2012) Katja Hofmann, Shimon Whiteson, and Maarten de Rijke. 2012. Estimating interleaved comparison outcomes from historical click data. In Proceedings of the 21st CIKM. 1779–1783.
  • Hofmann et al. (2013) Katja Hofmann, Shimon Whiteson, and Maarten de Rijke. 2013. Balancing exploration and exploitation in listwise and pairwise online learning to rank for information retrieval. Information Retrieval 16, 1 (2013), 63–90.
  • Joachims (2002) Thorsten Joachims. 2002. Optimizing search engines using clickthrough data. In Proceedings of the 8th ACM KDD. 133–142.
  • Joachims et al. (2005) Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, and Geri Gay. 2005. Accurately interpreting clickthrough data as implicit feedback. In Proceedings of the 28th ACM SIGIR. ACM, 154–161.
  • Joachims et al. (2007) Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, Filip Radlinski, and Geri Gay. 2007. Evaluating the accuracy of implicit feedback from clicks and query reformulations in web search. ACM TOIS 25, 2 (2007), 7.
  • Joachims et al. (2017) Thorsten Joachims, Adith Swaminathan, and Tobias Schnabel. 2017. Unbiased Learning-to-Rank with Biased Feedback. In Proceedings of the 10th ACM WSDM. ACM, 781–789.
  • Katariya et al. (2016) Sumeet Katariya, Branislav Kveton, Csaba Szepesvari, and Zheng Wen. 2016. DCM bandits: Learning to rank with multiple clicks. In ICML. 1215–1224.
  • Kveton et al. (2018) Branislav Kveton, Chang Li, Tor Lattimore, Ilya Markov, Maarten de Rijke, Csaba Szepesvari, and Masrour Zoghi. 2018. Bubblerank: Safe online learning to rerank. arXiv preprint arXiv:1806.05819 (2018).
  • Kveton et al. (2015a) Branislav Kveton, Csaba Szepesvari, Zheng Wen, and Azin Ashkan. 2015a. Cascading bandits: Learning to rank in the cascade model. In ICML. 767–776.
  • Kveton et al. (2015b) Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesvari. 2015b. Combinatorial cascading bandits. In NIPS. 1450–1458.
  • Kveton et al. (2015c) Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesvari. 2015c. Tight regret bounds for stochastic combinatorial semi-bandits. In Artificial Intelligence and Statistics. 535–543.
  • Lattimore et al. (2018) Tor Lattimore, Branislav Kveton, Shuai Li, and Csaba Szepesvari. 2018. Toprank: A practical algorithm for online stochastic ranking. In NIPS. 3945–3954.
  • Li et al. (2018) Shuai Li, Tor Lattimore, and Csaba Szepesvári. 2018. Online learning to rank with features. arXiv preprint arXiv:1810.02567 (2018).
  • Li et al. (2016) Shuai Li, Baoxiang Wang, Shengyu Zhang, and Wei Chen. 2016. Contextual Combinatorial Cascading Bandits.. In ICML, Vol. 16. 1245–1253.
  • Oosterhuis and de Rijke (2017) Harrie Oosterhuis and Maarten de Rijke. 2017. Balancing speed and quality in online learning to rank for information retrieval. In Proceedings of the 26th 2017 ACM CIKM. 277–286.
  • Oosterhuis and de Rijke (2018) Harrie Oosterhuis and Maarten de Rijke. 2018. Differentiable unbiased online learning to rank. In Proceedings of the 27th ACM CIKM. 1293–1302.
  • Qin and Liu (2013) Tao Qin and Tie-Yan Liu. 2013. Introducing LETOR 4.0 Datasets. arXiv:1306.2597 [cs.IR]
  • Radlinski et al. (2008) Filip Radlinski, Robert Kleinberg, and Thorsten Joachims. 2008. Learning diverse rankings with multi-armed bandits. In ICML. 784–791.
  • Schuth et al. (2016) Anne Schuth, Harrie Oosterhuis, Shimon Whiteson, and Maarten de Rijke. 2016. Multileave gradient descent for fast online learning to rank. In Proceedings of the 9th ACM WSDM. 457–466.
  • Schuth et al. (2014) Anne Schuth, Floor Sietsma, Shimon Whiteson, Damien Lefortier, and Maarten de Rijke. 2014. Multileaved comparisons for fast online evaluation. In Proceedings of the 23rd ACM CIKM. ACM, 71–80.
  • Vershynin (2010) Roman Vershynin. 2010. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027 (2010).
  • Wang et al. (2019) Huazheng Wang, Sonwoo Kim, Eric McCord-Snook, Qingyun Wu, and Hongning Wang. 2019. Variance Reduction in Gradient Exploration for Online Learning to Rank. In SIGIR 2019. 835–844.
  • Wang et al. (2018a) Huazheng Wang, Ramsey Langley, Sonwoo Kim, Eric McCord-Snook, and Hongning Wang. 2018a. Efficient exploration of gradient space for online learning to rank. In SIGIR 2018. 145–154.
  • Wang et al. (2018b) Xuanhui Wang, Cheng Li, Nadav Golbandi, Michael Bendersky, and Marc Najork. 2018b. The LambdaLoss Framework for Ranking Metric Optimization. In CIKM ’18. ACM, 1313–1322.
  • Yue and Joachims (2009) Yisong Yue and Thorsten Joachims. 2009. Interactively optimizing information retrieval systems as a dueling bandits problem. In ICML. 1201–1208.
  • Zhao and King (2016) Tong Zhao and Irwin King. 2016. Constructing reliable gradient exploration for online learning to rank. In Proceedings of the 25th ACM CIKM. 1643–1652.
  • Zhou et al. (2019) Dongruo Zhou, Lihong Li, and Quanquan Gu. 2019. Neural Contextual Bandits with UCB-based Exploration. arXiv preprint arXiv:1911.04462 (2019).
  • Zoghi et al. (2017) Masrour Zoghi, Tomas Tunys, Mohammad Ghavamzadeh, Branislav Kveton, Csaba Szepesvari, and Zheng Wen. 2017. Online learning to rank in stochastic click models. In ICML 2017. 4199–4208.

Appendix A Preliminaries

In this section, we present some basic definitions and inequalities for later use. The first inequality is about the Lipschitz continuity of the logistic function.

Lemma 0.

For the logistic function σ(x)=1/(1+exp(x))\sigma(x)={1}/{(1+\exp(-x))}, we have: |σ(x)σ(y)|<kμ|xy||\sigma(x)-\sigma(y)|<k_{\mu}|x-y| where kμ=1/4k_{\mu}=1/4.

The second inequality is the self-normalized bound for martingales, adopted from Lemma 9 in (Abbasi-Yadkori et al., 2011).

Lemma 0 (Self-normalized bound for martingales, Lemma 9 in (Abbasi-Yadkori et al., 2011)).

Let tt be a stopping time with respect to the filtration {Fτ}τ=0\{F_{\tau}\}^{\infty}_{\tau=0}. Because ϵijs\epsilon_{ij}^{s} is 12\frac{1}{2}-sub-Gaussian, for δ>0\delta>0, with probability 1δ1-\delta,

s=1t1(i,j)𝒢sindϵijs𝐱ijs𝐌t12(1/2)log(det(𝐌t)1/2/(δdet(λ𝐈)1/2))\left\|\sum\nolimits_{s=1}^{t-1}\sum\nolimits_{(i,j)\in\mathcal{G}_{s}^{ind}}\epsilon_{ij}^{s}\mathbf{x}^{s}_{ij}\right\|_{\scriptscriptstyle\mathbf{M}_{t}^{-1}}^{2}\leq({1}/{2})\log({{\det(\mathbf{M}_{t})^{1/2}}/{(\delta\det(\lambda\mathbf{I})^{1/2})}})

Appendix B Proof of PairRank’s confidence interval: Lemma 2

B.1. Model learning for PairRank

As we introduced in Section 3.2, we adopt a single layer RankNet model with sigmoid activation function as our pairwise ranker. The objective function for our ranker defined in Eq (2) is the cross-entropy loss between the predicted pairwise preference distribution and the inferred distribution from user feedback. Given this objective function is log-convex with respect to 𝜽\boldsymbol{\theta}, its solution 𝜽~t\tilde{\boldsymbol{\theta}}_{t} is unique under the following estimating equation at each round tt,

(3) s=1t1(m,n)𝒢sind(σ(𝐱mns𝜽)𝐲mns)𝐱mns+λ𝜽=0\displaystyle\sum\nolimits_{s=1}^{t-1}\sum\nolimits_{\scriptscriptstyle(m,n)\in\mathcal{G}_{s}^{ind}}\left(\sigma({{\mathbf{x}_{mn}^{s}}}^{\top}\boldsymbol{\theta})-{\mathbf{y}_{mn}^{s}}\right){\mathbf{x}_{mn}^{s}}+\lambda\boldsymbol{\theta}=0

Let gt(𝜽)=s=1t1(m,n)𝒢sindσ(𝐱mns𝜽)𝐱mns+λ𝜽g_{t}(\boldsymbol{\theta})=\sum_{\scriptscriptstyle s=1}^{\scriptscriptstyle t-1}\sum_{(m,n)\in\mathcal{G}_{s}^{ind}}\sigma({{\mathbf{x}_{mn}^{s}}}^{\top}\boldsymbol{\theta}){\mathbf{x}_{mn}^{s}}+\lambda\boldsymbol{\theta} be the invertible function such that the estimated parameter 𝜽~t\tilde{\boldsymbol{\theta}}_{t} satisfies gt(𝜽~t)=s=1t1(m,n)𝒢sind𝐲mns𝐱mnsg_{t}(\tilde{\boldsymbol{\theta}}_{t})=\sum_{s=1}^{t-1}\sum_{(m,n)\in\mathcal{G}_{s}^{ind}}{\mathbf{y}_{mn}^{s}}{\mathbf{x}_{mn}^{s}}. Since 𝜽~t\tilde{\boldsymbol{\theta}}_{t} might be outside of the admissible set of parameter Θ\Theta that 𝜽Θ\boldsymbol{\theta}\in\Theta (e.g., 𝜽2Q\|\boldsymbol{\theta}\|_{2}\leq Q), the estimation can be projected into Θ\Theta to obtain 𝜽^t\hat{\boldsymbol{\theta}}_{t}:

(4) 𝜽^t\displaystyle\hat{\boldsymbol{\theta}}_{t} =argmin𝜽Θgt(𝜽)gt(𝜽~t)𝐌t1\displaystyle{=}\operatorname*{argmin}_{\boldsymbol{\theta}\in\Theta}\left\|g_{t}(\boldsymbol{\theta}){-}g_{t}(\tilde{\boldsymbol{\theta}}_{t})\right\|_{\scriptscriptstyle\mathbf{M}_{t}^{-1}}
=argmin𝜽Θgt(𝜽)s=1t1(m,n)𝒢sind𝐲mns𝐱mns𝐌t1\displaystyle{=}\operatorname*{argmin}_{\boldsymbol{\theta}\in\Theta}\left\|g_{t}(\boldsymbol{\theta}){-}\sum_{\scriptscriptstyle s=1}^{\scriptscriptstyle t-1}\sum_{\scriptscriptstyle(m,n)\in\mathcal{G}_{s}^{ind}}{\mathbf{y}_{mn}^{s}}{\mathbf{x}_{mn}^{s}}\right\|_{\scriptscriptstyle\mathbf{M}_{t}^{-1}}

where 𝐌t\mathbf{M}_{t} is defined as 𝐌t=s=1t1(m,n)𝒢sind𝐱mns𝐱mns+λ𝐈\mathbf{M}_{t}=\sum_{s=1}^{t-1}\sum_{(m,n)\in\mathcal{G}_{s}^{ind}}{\mathbf{x}_{mn}^{s}}{{\mathbf{x}_{mn}^{s}}}^{\top}+\lambda\mathbf{I}. To summarize, 𝜽~t\tilde{\boldsymbol{\theta}}_{t} is the solution of Eq (4), and 𝜽^t\hat{\boldsymbol{\theta}}_{t} is the estimated model which is generated by projecting 𝜽~t\tilde{\boldsymbol{\theta}}_{t} onto Θ\Theta.

B.2. Confidence interval analysis

Now we present detailed proof of the confidence interval in Lemma 2, which is based on Proposition 1 in (Filippi et al., 2010).

At round tt, for any document pair 𝐱it\mathbf{x}^{t}_{i} and 𝐱jt\mathbf{x}^{t}_{j}, the estimation error of the pairwise preference in PairRank is defined as |σ(𝐱ijt𝜽)σ(𝐱ijt𝜽^t)||\sigma({{\mathbf{x}_{ij}^{t}}}^{\top}\boldsymbol{\theta}^{*})-\sigma({{\mathbf{x}_{ij}^{t}}}^{\top}\hat{\boldsymbol{\theta}}_{t})|, which is based on the model 𝜽^t\hat{\boldsymbol{\theta}}_{t} learned from the observations until last round. According to Lemma 1, we have |σ(𝐱ijt𝜽)σ(𝐱ijt𝜽^t)|kμ|𝐱ijt𝜽𝐱ijt𝜽^t||\sigma({{\mathbf{x}_{ij}^{t}}}^{\top}\boldsymbol{\theta}^{*})-\sigma({{\mathbf{x}_{ij}^{t}}}^{\top}\hat{\boldsymbol{\theta}}_{t})|\leq k_{\mu}|{{\mathbf{x}_{ij}^{t}}}^{\top}\boldsymbol{\theta}^{*}-{{\mathbf{x}_{ij}^{t}}}^{\top}\hat{\boldsymbol{\theta}}_{t}|. As logistic function σ()\sigma(\cdot) is continuously differentiable, gt\nabla g_{t} is continuous. Hence, according to the Fundamental Theorem of Calculus, we have gt(𝜽)gt(𝜽^t)=𝐆t(𝜽𝜽^t)g_{t}(\boldsymbol{\theta}^{*})-g_{t}(\hat{\boldsymbol{\theta}}_{t})=\mathbf{G}_{t}(\boldsymbol{\theta}^{*}-\hat{\boldsymbol{\theta}}_{t}), where 𝐆t=01gt(l𝜽+(1l)𝜽^t)𝑑l.\mathbf{G}_{t}=\int_{0}^{1}\nabla g_{t}\left(l\boldsymbol{\theta}^{*}+(1-l)\hat{\boldsymbol{\theta}}_{t}\right)dl. Therefore, gt(𝜽)=s=1t1(m,n)𝒢sindσ˙(𝐱mns𝜽)𝐱mns𝐱mns+λ𝐈\nabla g_{t}(\boldsymbol{\theta})=\sum_{s=1}^{t-1}\sum_{(m,n)\in\mathcal{G}_{s}^{ind}}\dot{\sigma}({{\mathbf{x}_{mn}^{s}}}^{\top}\boldsymbol{\theta}){{\mathbf{x}_{mn}^{s}}}{{\mathbf{x}_{mn}^{s}}}^{\top}+\lambda\mathbf{I}, where σ˙\dot{\sigma} is the first order derivative of σ\sigma. As cμ=inf𝜽𝚯σ˙(𝐱𝜽)c_{\mu}=\inf_{\boldsymbol{\theta}\in\boldsymbol{\Theta}}\dot{\sigma}(\mathbf{x}^{\top}\boldsymbol{\theta}), it is easy to verify that cμ1/4c_{\mu}\leq 1/4. Thus, we can conclude that 𝐆tcμ𝐌t\mathbf{G}_{t}\succeq c_{\mu}\mathbf{M}_{t}. Accordingly, we have the following inequality,

|σ(𝐱ijt𝜽)σ(𝐱ijt𝜽^t)|\displaystyle\left|\sigma({{\mathbf{x}_{ij}^{t}}}^{\top}\boldsymbol{\theta}^{*})-\sigma({{\mathbf{x}_{ij}^{t}}}^{\top}\hat{\boldsymbol{\theta}}_{t})\right|
\displaystyle\leq kμ|𝐱ijt𝐆t1(gt(𝜽)gt(𝜽^t))|kμ𝐱ijt𝐆t1gt(𝜽)gt(𝜽^t)𝐆t1\displaystyle k_{\mu}\left|{{\mathbf{x}_{ij}^{t}}}^{\top}\mathbf{G}_{t}^{-1}\left(g_{t}(\boldsymbol{\theta}^{*})-g_{t}(\hat{\boldsymbol{\theta}}_{t})\right)\right|\leq k_{\mu}\|{\mathbf{x}_{ij}^{t}}\|_{\scriptscriptstyle\mathbf{G}_{t}^{-1}}\left\|g_{t}(\boldsymbol{\theta}^{*})-g_{t}(\hat{\boldsymbol{\theta}}_{t})\right\|_{\scriptscriptstyle\mathbf{G}_{t}^{-1}}
\displaystyle\leq (2kμ/cμ)𝐱ijt𝐌t1gt(𝜽)gt(𝜽~t)𝐌t1\displaystyle({2k_{\mu}}/{c_{\mu}})\|{\mathbf{x}_{ij}^{t}}\|_{\scriptscriptstyle\mathbf{M}_{t}^{-1}}\left\|g_{t}(\boldsymbol{\theta}^{*})-g_{t}(\tilde{\boldsymbol{\theta}}_{t})\right\|_{\scriptscriptstyle\mathbf{M}_{t}^{-1}}

where the first and second inequalities stand as 𝐆t\mathbf{G}_{t} and 𝐆t1\mathbf{G}_{t}^{-1} are positive definite. The third inequality stands as 𝐆tcμ𝐌t\mathbf{G}_{t}\succeq c_{\mu}\mathbf{M}_{t}, which implies that 𝐆t1cμ1𝐌t1\mathbf{G}_{t}^{-1}\preceq c_{\mu}^{-1}\mathbf{M}_{t}^{-1} and 𝐱𝐆t1𝐱𝐌t1/cμ\|\mathbf{x}\|_{\scriptscriptstyle\mathbf{G}_{t}^{-1}}\leq\|\mathbf{x}\|_{\scriptscriptstyle\mathbf{M}_{t}^{-1}}/{\sqrt{c_{\mu}}} hold for any 𝐱d\mathbf{x}\in\mathbb{R}^{d}. The last inequality stands as 𝜽Θ\boldsymbol{\theta}^{*}\in\Theta, and 𝜽^t\hat{\boldsymbol{\theta}}_{t} is the optimum solution for Eq (4) at round tt within Θ\Theta.

Based on the definition of 𝜽~t\tilde{\boldsymbol{\theta}}_{t} and the assumption on the noisy feedback that yt=σ(𝐱t𝜽)+ϵty^{t}=\sigma({\mathbf{x}_{t}}^{\top}\boldsymbol{\theta}^{*})+\epsilon^{t}, where ϵt\epsilon^{t} is the noise in user feedback defined in Section 3.2, we have

gt(𝜽~t)gt(𝜽)\displaystyle g_{t}(\tilde{\boldsymbol{\theta}}_{t})-g_{t}(\boldsymbol{\theta}^{*})
=\displaystyle= s=1t1(m,n)𝒢sind(𝐲mnsσ(𝐱mns𝜽))𝐱mnsλ𝜽\displaystyle\sum\nolimits_{s=1}^{t-1}\sum\nolimits_{(m,n)\in\mathcal{G}_{s}^{ind}}({\mathbf{y}_{mn}^{s}}-\sigma({{\mathbf{x}_{mn}^{s}}}^{\top}\boldsymbol{\theta}^{*})){\mathbf{x}_{mn}^{s}}-\lambda\boldsymbol{\theta}^{*}
=\displaystyle= s=1t1(m,n)𝒢sindϵmns𝐱mnsλ𝜽=Stλ𝜽.\displaystyle\sum\nolimits_{s=1}^{t-1}\sum\nolimits_{(m,n)\in\mathcal{G}_{s}^{ind}}\epsilon_{mn}^{s}{\mathbf{x}_{mn}^{s}}-\lambda\boldsymbol{\theta}^{*}=S_{t}-\lambda\boldsymbol{\theta}^{*}.

where we define St=s=1t1(m,n)𝒢sindϵmns𝐱mnsS_{t}=\sum_{s=1}^{t-1}\sum_{(m,n)\in\mathcal{G}_{s}^{ind}}\epsilon^{s}_{mn}{\mathbf{x}_{mn}^{s}}.

Therefore, the confidence interval of the estimated pairwise preference in PairRank can be derived as:

|σ(𝐱ijt𝜽)σ(𝐱ijt𝜽^t)|\displaystyle\left|\sigma({\mathbf{x}_{ij}^{t}}^{\top}\boldsymbol{\theta}^{*})-\sigma({\mathbf{x}_{ij}^{t}}^{\top}\hat{\boldsymbol{\theta}}_{t})\right|\leq (2kμ/cμ)𝐱ijt𝐌t1Stλ𝜽𝐌t1\displaystyle({2k_{\mu}}/{c_{\mu}})\|{\mathbf{x}_{ij}^{t}}\|_{\mathbf{M}_{t}^{-1}}\left\|S_{t}-\lambda\boldsymbol{\theta}^{*}\right\|_{\mathbf{M}_{t}^{-1}}
\displaystyle\leq (2kμ/cμ)𝐱ijt𝐌t1(St𝐌t1+λQ)\displaystyle({2k_{\mu}}/{c_{\mu}})\|{\mathbf{x}_{ij}^{t}}\|_{\mathbf{M}_{t}^{-1}}\left(\left\|S_{t}\right\|_{\mathbf{M}_{t}^{-1}}+\sqrt{\lambda}Q\right)

where the second inequality is based on minimum eigenvalue λmin(𝐌t)λ\lambda_{\min}(\mathbf{M}_{t})\geq\lambda and 𝜽2Q\|\boldsymbol{\theta}\|_{2}\leq Q.

As ϵmntR\epsilon_{mn}^{t}\sim R-sub-Gaussian, according to Theorem 1 in (Abbasi-Yadkori et al., 2011),

[St𝐌t12>2R2logdet(𝐌t)1/2δdet(λ𝐈)1/2]δ\mathbb{P}\left[\left\|S_{t}\right\|_{\mathbf{M}_{t}^{-1}}^{2}>2R^{2}\log{\frac{\det(\mathbf{M}_{t})^{1/2}}{\delta\det(\lambda\mathbf{I})^{1/2}}}\right]\leq\delta

Therefore, with probability at least 1δ11-\delta_{1}, we have

|σ(𝐱ijt𝜽^t)σ(𝐱ijt𝜽)|αt𝐱ijt𝐌t1\left|\sigma({{\mathbf{x}_{ij}^{t}}}^{\top}\hat{\boldsymbol{\theta}}_{t})-\sigma({{\mathbf{x}_{ij}^{t}}}^{\top}\boldsymbol{\theta}^{*})\right|\leq\alpha_{t}\|{\mathbf{x}_{ij}^{t}}\|_{\mathbf{M}_{t}^{-1}}

with αt=(2kμ/cμ)(R2logdet(𝐌t)δ12det(λ𝐈)+λQ)\alpha_{t}=({2k_{\mu}}/{c_{\mu}})\Big{(}\sqrt{R^{2}\log{\frac{\det(\mathbf{M}_{t})}{\delta_{1}^{2}\det(\lambda\mathbf{I})}}}+\sqrt{\lambda}Q\Big{)}

Appendix C Proof of Theorem 8

In this section, we present the detailed proof of Theorem 1. We first prove Lemma 7, which provides an upper bound of the probability that an estimated pairwise preference is identified as uncertain. The blocks with uncertain rank orders will lead to regret in the ranked list due to the random shuffling based exploration strategy.

C.1. Proof of Lemma 7

Proof.

In this proof, we will first provide an upper bound of the minimum eigenvalue of 𝐌t\mathbf{M}_{t}, and then provide the detailed derivation of the probability’s upper bound.

\bullet\leavevmode\nobreak\ Upper bound the minimum eigenvalue of 𝐌t\mathbf{M}_{t}. As discussion in Lemma 6 , we assume that pairwise feature vectors are random vectors drawn from distribution vv. With 𝚺=𝔼[𝐱mns𝐱mns]\mathbf{\Sigma}=\mathbb{E}[{\mathbf{x}_{mn}^{s}}{\mathbf{x}_{mn}^{s}}^{\top}] as the second moment matrix, define 𝐙=𝚺12𝐗\mathbf{Z}=\mathbf{\Sigma}^{\frac{-1}{2}}\mathbf{X}, where 𝐗\mathbf{X} is a random vector drawn from the same distribution vv. Then 𝐙\mathbf{Z} is isotropic, namely 𝔼[𝐙𝐙]=𝐈d\mathbb{E}[\mathbf{Z}\mathbf{Z}^{\top}]=\mathbf{I}_{d}.

Define 𝐔=s=1t1(m,n)𝒢sind𝐙mns𝐙mns=Σ12𝐌¯tΣ12\mathbf{U}=\sum_{s=1}^{t-1}\sum_{(m,n)\in\mathcal{G}_{s}^{ind}}\mathbf{Z}_{mn}^{s}{\mathbf{Z}_{mn}^{s}}^{\top}=\Sigma^{\frac{-1}{2}}\bar{\mathbf{M}}_{t}\Sigma^{\frac{-1}{2}}, where 𝐌¯t=s=1t1(m,n)𝒢sind𝐱mns𝐱mns=𝐌tλ𝐈\bar{\mathbf{M}}_{t}=\sum_{s=1}^{t-1}\sum_{(m,n)\in\mathcal{G}_{s}^{ind}}{\mathbf{x}_{mn}^{s}}{\mathbf{x}_{mn}^{s}}^{\top}=\mathbf{M}_{t}-\lambda\mathbf{I}. From Lemma 6, we know that for any ll, with probability at least 12exp(C2l2)1-2\text{exp}(-C_{2}l^{2}), λmin(𝐔)nC1σ2ndσ2ln\lambda_{\text{min}}(\mathbf{U})\geq n-C_{1}\sigma^{2}\sqrt{nd}-\sigma^{2}l\sqrt{n}, where σ\sigma is the sub-Gaussian parameter of 𝐙\mathbf{Z}, which is upper-bounded by 𝚺1/2=λmin(𝚺)\|\mathbf{\Sigma}^{-1/2}\|=\lambda_{\text{min}(\mathbf{\Sigma})}, and n=s=1t1|𝒢sind|n=\sum_{s=1}^{t-1}|\mathcal{G}_{s}^{ind}|, represents the number of observations so far. We thus can rewrite the above inequality which holds with probability 1δ21-\delta_{2} as λmin(𝐔)nλmin1(𝚺)(C1nd+ln)\lambda_{\text{min}}(\mathbf{U})\geq n-\lambda_{\text{min}}^{-1}(\mathbf{\Sigma})(C_{1}\sqrt{nd}+l\sqrt{n}). We now bound the minimum eigenvalue of 𝐌¯t\bar{\mathbf{M}}_{t}, as follows:

λmin(𝐌¯t)\displaystyle\lambda_{\text{min}}(\bar{\mathbf{M}}_{t}) =minx𝔹dx𝐌¯tx=minx𝔹dx𝚺1/2𝐔𝚺1/2xλmin(𝐔)minx𝔹dx𝚺x\displaystyle=\min_{x\in\mathbb{B}^{d}}x^{\top}\bar{\mathbf{M}}_{t}x=\min_{x\in\mathbb{B}^{d}}x^{\top}\mathbf{\Sigma}^{1/2}\mathbf{U}\mathbf{\Sigma}^{1/2}x\geq\lambda_{\text{min}}(\mathbf{U})\min_{x\in\mathbb{B}^{d}}x^{\top}\mathbf{\Sigma}x
=λmin(𝐔)λmin(𝚺)λmin(𝚺)nC1ndC2nlog(1/δ2)\displaystyle=\lambda_{\text{min}}(\mathbf{U})\lambda_{\text{min}}(\mathbf{\Sigma})\geq\lambda_{\text{min}}(\mathbf{\Sigma})n-C_{1}\sqrt{nd}-C_{2}\sqrt{n\log(1/\delta_{2})}

According to Lemma 6, for ttt\geq t^{\prime}, we have:

λmin(𝚺)t(c1d+c2log(1/δ2)+abdomaxu2/(dλ))t\displaystyle\lambda_{\text{min}}(\mathbf{\Sigma})t-(c_{1}\sqrt{d}+c_{2}\sqrt{\log({1}/{\delta_{2}})}+abd\sqrt{{o_{\text{max}}u^{2}}/({d\lambda})})\sqrt{t}
(ablog(1/δ12)+4aλQ2λ)0\displaystyle-(ab\log({1}/{\delta_{1}^{2}})+4a\lambda Q^{2}-\lambda)\geq 0

As ntn\geq t, we have

(5) λmin(𝐌t)λmin(𝐌¯t)+λabdomaxu2dλ+ablog(1δ12)+4aλQ2\lambda_{\text{min}}({\mathbf{M}}_{t})\geq\lambda_{\text{min}}(\bar{\mathbf{M}}_{t})+\lambda\geq abd\sqrt{\frac{o_{\text{max}}u^{2}}{d\lambda}}+ab\log(\frac{1}{\delta_{1}^{2}})+4a\lambda Q^{2}

\bullet\leavevmode\nobreak\ Upper bound the probability of being uncertain rank order. Under event EtE_{t}, based on the definition of ut\mathcal{E}_{u}^{t} in Section 3.2, we know that for any document 𝐱it\mathbf{x}^{t}_{i} and 𝐱jt\mathbf{x}^{t}_{j} at round tt, (i,j)ut(i,j)\in\mathcal{E}_{u}^{t} if and only if σ(𝐱ijt𝜽^t)αt𝐱ijt𝐌t112\sigma({{\mathbf{x}_{ij}^{t}}}^{\top}\hat{\boldsymbol{\theta}}_{t})-\alpha_{t}\|{\mathbf{x}_{ij}^{t}}\|_{\mathbf{M}_{t}^{-1}}\leq\frac{1}{2} and σ(𝐱jit𝜽^t)αt𝐱jit𝐌t112\sigma({{\mathbf{x}_{ji}^{t}}}^{\top}\hat{\boldsymbol{\theta}}_{t})-\alpha_{t}\|{\mathbf{x}_{ji}^{t}}\|_{\mathbf{M}_{t}^{-1}}\leq\frac{1}{2}. For a logistic function, we know that σ(s)=1σ(s)\sigma(s)=1-\sigma(-s). Therefore, let CBijtCB_{ij}^{t} denote αt𝐱ijt𝐌t1\alpha_{t}\|{\mathbf{x}_{ij}^{t}}\|_{\mathbf{M}_{t}^{-1}}, we can conclude that (i,j)ut(i,j)\in\mathcal{E}_{u}^{t} if and only if |σ(𝐱ijt𝜽^t)12|CBijt|\sigma({{\mathbf{x}_{ij}^{t}}}^{\top}\hat{\boldsymbol{\theta}}_{t})-\frac{1}{2}|\leq CB_{ij}^{t}; and accordingly, (i,j)ct(i,j)\in\mathcal{E}_{c}^{t}, when |σ(𝐱ijt𝜽^t)12|>CBijt|\sigma({{\mathbf{x}_{ij}^{t}}}^{\top}\hat{\boldsymbol{\theta}}_{t})-\frac{1}{2}|>CB_{ij}^{t}. To further simplify our notations, in the following analysis, we use σ^t\hat{\sigma}_{t} and σ\sigma^{*} to present σ(𝐱ijt𝜽^t)\sigma({{\mathbf{x}_{ij}^{t}}}^{\top}\hat{\boldsymbol{\theta}}_{t}) and σ(𝐱ijt𝜽)\sigma({{\mathbf{x}_{ij}^{t}}}^{\top}\boldsymbol{\theta}^{*}) respectively. According to the discussion above, the probability that the estimated preference between document 𝐱it\mathbf{x}^{t}_{i} and 𝐱jt\mathbf{x}^{t}_{j} to be in an uncertain rank order, e.g., (i,j)ut(i,j)\in\mathcal{E}_{u}^{t} can be upper bounded by:

((i,j)ut)=(|σ^t1/2|CBijt)\displaystyle\mathbb{P}\big{(}(i,j)\in\mathcal{E}_{u}^{t}\big{)}=\mathbb{P}\big{(}|\hat{\sigma}_{t}-{1}/{2}|\leq CB_{ij}^{t}\big{)}
\displaystyle\leq (||σ^tσ||σ1/2||CBijt)(|σ1/2||σ^tσ|CBijt)\displaystyle\mathbb{P}\left(\left||\hat{\sigma}_{t}-\sigma^{*}|-|\sigma^{*}-{1}/{2}|\right|\leq CB_{ij}^{t}\right)\leq\mathbb{P}\left(|\sigma^{*}-{1}/{2}|-|\hat{\sigma}_{t}-\sigma^{*}|\leq CB_{ij}^{t}\right)
\displaystyle\leq (Δmin|σ^tσ|CBijt).\displaystyle\mathbb{P}\left(\Delta_{\min}-|\hat{\sigma}_{t}-\sigma^{*}|\leq CB_{ij}^{t}\right).

where, the first inequality is based on the reverse triangle inequality. The last inequality is based on the definition of Δmin\Delta_{\min}. Based on Lemma 2, the above probability can be further bounded by

(Δmin|σ^tσ|CBijt)=(|σ^tσ|Δminαt𝐱ijt𝐌t1)\displaystyle\mathbb{P}\left(\Delta_{\min}-|\hat{\sigma}_{t}-\sigma^{*}|\leq CB_{ij}^{t}\right)=\mathbb{P}\left(|\hat{\sigma}_{t}-\sigma^{*}|\geq\Delta_{\min}-\alpha_{t}\|{\mathbf{x}_{ij}^{t}}\|_{\scriptscriptstyle\mathbf{M}_{t}^{-1}}\right)
\displaystyle\leq ((2kμ/cμ)𝐱ijt𝐌t1(St𝐌t1+λQ)Δminαt𝐱ijt𝐌t1)\displaystyle\mathbb{P}\left(({2k_{\mu}}/{c_{\mu}})||{\mathbf{x}_{ij}^{t}}||_{\scriptscriptstyle\mathbf{M}_{t}^{-1}}\left(\left\|S_{t}\right\|_{\scriptscriptstyle\mathbf{M}_{t}^{-1}}+\sqrt{\lambda}Q\right)\geq\Delta_{\min}-\alpha_{t}\|{\mathbf{x}_{ij}^{t}}\|_{\scriptscriptstyle\mathbf{M}_{t}^{-1}}\right)
\displaystyle\leq (St𝐌t1cμΔmin2kμ𝐱ijt𝐌t1(14logdet(𝐌t)δ2det(λ𝐈)+2λQ))\displaystyle\mathbb{P}\left(\left\|S_{t}\right\|_{\scriptscriptstyle\mathbf{M}_{t}^{-1}}\geq\frac{c_{\mu}\Delta_{\min}}{2k_{\mu}||{\mathbf{x}_{ij}^{t}}||_{\scriptscriptstyle\mathbf{M}_{t}^{-1}}}-\left(\sqrt{\frac{1}{4}\log{\frac{\det(\mathbf{M}_{t})}{\delta^{2}\det(\lambda\mathbf{I})}}}+2\sqrt{\lambda}Q\right)\right)

For the RHS of the inequality inside the probability, we know that:

(cμΔmin2kμ𝐱ijt𝐌t1)2(R2logdet(𝐌t)δ2det(λ𝐈)+2λQ)2\displaystyle\left(\frac{c_{\mu}\Delta_{\min}}{2k_{\mu}||{\mathbf{x}_{ij}^{t}}||_{\scriptscriptstyle\mathbf{M}_{t}^{-1}}}\right)^{2}-\left(\sqrt{R^{2}\log{\frac{\det(\mathbf{M}_{t})}{\delta^{2}\det(\lambda\mathbf{I})}}}+2\sqrt{\lambda}Q\right)^{2}
\displaystyle\geq λmin(𝐌t)/aR2log(det(𝐌t)/(δ12det(λ𝐈)))\displaystyle{\lambda_{\min}(\mathbf{M}_{t})}/{a}-R^{2}\log({\det(\mathbf{M}_{t})}/{(\delta_{1}^{2}\det(\lambda\mathbf{I}))})
4λQ24λQRlog(det(𝐌t)/detλ𝐈)+log(1/δ12)\displaystyle-4\lambda Q^{2}-4\sqrt{\lambda}QR\sqrt{\log({\det(\mathbf{M}_{t})}/{\det{\lambda\mathbf{I}}})+\log({1}/{\delta_{1}^{2}})}
\displaystyle\geq λmin(𝐌t)/ab(log(det(𝐌t)/det(λ𝐈))+log(1/δ12))4λQ2\displaystyle{\lambda_{\min}(\mathbf{M}_{t})}/{a}-b\left(\log({\det(\mathbf{M}_{t})}/{\det(\lambda\mathbf{I})})+\log({1}/{\delta_{1}^{2}})\right)-4\lambda Q^{2}
\displaystyle\geq λmin(𝐌t)/abdlog(1+omaxtu2dλ)blog(1δ12)4λQ20\displaystyle{\lambda_{\min}(\mathbf{M}_{t})}/{a}-bd\log(1+\frac{o_{\max}tu^{2}}{d\lambda})-b\log(\frac{1}{\delta_{1}^{2}})-4\lambda Q^{2}\geq 0

where the last inequality follows Eq (5). Therefore, the probability could be upper bounded as:

(Δmin|σ^tσ|CBijt)\displaystyle\mathbb{P}\left(\Delta_{\min}-|\hat{\sigma}_{t}-\sigma^{*}|\leq CB_{ij}^{t}\right)
\displaystyle\leq (St𝐌t12(cμΔmin2kμ𝐱ijt𝐌t1(R2logdet(𝐌t)δdet(λ𝐈)+2λQ))2)\displaystyle\mathbb{P}\left(\left\|S_{t}\right\|_{\scriptscriptstyle\mathbf{M}_{t}^{-1}}^{2}{\geq}\left(\frac{c_{\mu}\Delta_{\min}}{2k_{\mu}\|{\mathbf{x}_{ij}^{t}}\|_{\scriptscriptstyle\mathbf{M}_{t}^{-1}}}-\Big{(}\sqrt{R^{2}\log{\frac{\det(\mathbf{M}_{t})}{\delta\det(\lambda\mathbf{I})}}}+2\sqrt{\lambda}Q\Big{)}\right)^{2}\right)
\displaystyle\leq (St𝐌t12(12β)cμ2Δmin24kμ2𝐱ijt𝐌t1+R2log(det(Mt)δ12det(λ𝐈)))\displaystyle\mathbb{P}\left(\left\|S_{t}\right\|_{\scriptscriptstyle\mathbf{M}_{t}^{-1}}^{2}{\geq}\frac{(1-2\beta)c_{\mu}^{2}\Delta_{\min}^{2}}{4k_{\mu}^{2}\|{\mathbf{x}_{ij}^{t}}\|_{\scriptscriptstyle\mathbf{M}_{t}}^{-1}}+R^{2}\log(\frac{\det(M_{t})}{\delta_{1}^{2}\det(\lambda\mathbf{I})})\right)
\displaystyle\leq (St𝐌t122R2log(exp((12β)cμ2Δmin28R2kμ2𝐱ijt𝐌t1)det(𝐌t)δ12det(λ𝐈))))\displaystyle\mathbb{P}\left(\left\|S_{t}\right\|_{\scriptscriptstyle\mathbf{M}_{t}^{-1}}^{2}{\geq}2R^{2}\log\left(\exp\left(\frac{(1-2\beta)c_{\mu}^{2}\Delta_{\min}^{2}}{8R^{2}k_{\mu}^{2}\|{\mathbf{x}_{ij}^{t}}\|_{\scriptscriptstyle\mathbf{M}_{t}}^{-1}}\right)\cdot\frac{\det(\mathbf{M}_{t})}{\delta_{1}^{2}\det(\lambda\mathbf{I})})\right)\right)
\displaystyle\leq δ1exp1((12β)cμ2Δmin28R2kμ2𝐱ijt𝐌t12)log(1δ1)8R2kμ2𝐱ijt𝐌t12(12β)cμ2Δmin2\displaystyle\delta_{1}\cdot\exp^{-1}\left(\frac{(1-2\beta)c_{\mu}^{2}\Delta_{\min}^{2}}{8R^{2}k_{\mu}^{2}\|{\mathbf{x}_{ij}^{t}}\|_{\scriptscriptstyle\mathbf{M}_{t}^{-1}}^{2}}\right)\leq\log(\frac{1}{\delta_{1}})\cdot\frac{8R^{2}k_{\mu}^{2}\|{\mathbf{x}_{ij}^{t}}\|_{\mathbf{M}_{t}^{-1}}^{2}}{(1-2\beta)c_{\mu}^{2}\Delta_{\min}^{2}}

C.2. Proof of Theorem 8

With δ1\delta_{1} and δ2\delta_{2} defined in the previous lemmas, we have the T-step regret upper bounded as:

(6) RT=Rt+RTtR+(1δ1)(1δ2)s=tTrs\displaystyle R_{T}=R_{t^{\prime}}+R_{T-t^{\prime}}\leq R^{\prime}+(1-\delta_{1})(1-\delta_{2})\sum\nolimits_{s=t^{\prime}}^{T}r_{s}

where R=tLmax+(Tt)[δ2Lmax+(1δ2)δ1Lmax]R^{\prime}=t^{\prime}\cdot L_{\max}+(T-t^{\prime})\left[\delta_{2}\cdot L_{\max}+(1-\delta_{2})\cdot\delta_{1}L_{\max}\right] Under event Et1E_{t}^{1} and Et2E_{t}^{2}, the instantaneous regret at round ss is bounded by

(7) rs=𝔼[K(τs,τs)]=i=1ds𝔼[(Nis+1)Nis/2]𝔼[Ns(Ns+1)/2]\displaystyle r_{s}=\mathbb{E}\big{[}K(\tau_{s},\tau_{s}^{*})\big{]}=\sum\nolimits_{i=1}^{d_{s}}\mathbb{E}\left[{(N_{i}^{s}+1)N_{i}^{s}}/{2}\right]\leq\mathbb{E}\left[{N_{s}(N_{s}+1)}/{2}\right]

where NisN_{i}^{s} denotes the number of uncertain rank orders in block is\mathcal{B}_{i}^{s} at round ss and NsN_{s} denotes the total number of uncertain rank orders at round ss. It follows that in the worst case scenario, when NsN_{s} uncertain rank orders are all placed into the same block, at most (Ns2+Ns)/2(N_{s}^{2}+N_{s})/2 mis-ordered pairs will be generated by random shuffling in PairRank. This is due to the fact that based on the block created by PairRank, with NsN_{s} uncertain rank orders in one block, this block can have at most Ns+1N_{s}+1 documents.

Therefore, the cumulative regret after tt^{\prime} can be bounded by:

s=tTrs\displaystyle\sum\nolimits_{s=t^{\prime}}^{T}r_{s}\leq s=tT𝔼[Ns(Ns+1)/2](s=tT𝔼[Ns2]+𝔼[Ns])/2\displaystyle\sum\nolimits_{s=t^{\prime}}^{T}\mathbb{E}\left[{N_{s}(N_{s}+1)}/{2}\right]\leq\left(\sum\nolimits_{s=t^{\prime}}^{T}\mathbb{E}[N_{s}^{2}]+\mathbb{E}[N_{s}]\right)/2
\displaystyle\leq (𝔼[s=tTNs]2+𝔼[s=tTNs])/2\displaystyle\left(\mathbb{E}[\sum\nolimits_{s=t^{\prime}}^{T}N_{s}]^{2}+\mathbb{E}[\sum\nolimits_{s=t^{\prime}}^{T}N_{s}]\right)/2

According to our previous analysis, at round ttt\geq t^{\prime}, the number of uncertain rank orders can be estimated by the probability of observing an uncertain rank order, i.e., ((i,j)ut)log(1/δ1)8R2kμ2𝐱ijtMt12(12β)cμ2Δmin2\mathbb{P}\big{(}(i,j)\in\mathcal{E}_{u}^{t}\big{)}\leq\log({1}/{\delta_{1}})\cdot\frac{8R^{2}k_{\mu}^{2}\|{\mathbf{x}_{ij}^{t}}\|_{M_{t}^{-1}}^{2}}{(1-2\beta)c_{\mu}^{2}\Delta_{\min}^{2}}. Therefore, the cumulative number of mis-ordered pairs can be bounded by the probability of observing uncertain rank orders in each round, which shrinks with more observations become available over time,

𝔼[s=tTNs]\displaystyle\mathbb{E}\left[\sum\nolimits_{s=t^{\prime}}^{T}N_{s}\right]\leq 𝔼[12s=tT(m,n)[Ls]2((m,n)ut)]\displaystyle\mathbb{E}\left[\frac{1}{2}\sum\nolimits_{s=t^{\prime}}^{T}\sum\nolimits_{(m,n)\in[L_{s}]^{2}}\mathbb{P}\big{(}(m,n)\in\mathcal{E}_{u}^{t}\big{)}\right]
\displaystyle\leq 𝔼[as=tT(m,n)[Ls]2𝐱mns𝐌s12]\displaystyle\mathbb{E}\left[a\sum\nolimits_{s=t^{\prime}}^{T}\sum\nolimits_{(m,n)\in[L_{s}]^{2}}\|{\mathbf{x}_{mn}^{s}}\|_{\mathbf{M}_{s}^{-1}}^{2}\right]

where a=log(1/δ1)4kμ2/((12β)cμ2Δmin2)a=\log({1}/{\delta_{1}})\cdot{4k_{\mu}^{2}}/({(1-2\beta)c_{\mu}^{2}\Delta_{\min}^{2}})

Because 𝐌t\mathbf{M}_{t} only contains information of observed document pairs so far, PairRank guarantees the number of mis-ordered pairs among the observed documents in the above inequality is upper bounded. To reason about the number of mis-ordered pairs in those unobserved documents (i.e., from oto_{t} to LtL_{t} for each query qtq_{t}), we leverage the constant pp^{*}, which is defined as the minimal probability that all documents in a query are examined over time,

𝔼[s=tT(m,n)[Ls]2𝐱mns𝐌s12]\displaystyle\mathbb{E}\left[\sum\nolimits_{s=t^{\prime}}^{T}\sum\nolimits_{(m,n)\in[L_{s}]^{2}}\|{\mathbf{x}_{mn}^{s}}\|_{\mathbf{M}_{s}^{-1}}^{2}\right]
=\displaystyle= 𝔼[s=tT(m,n)[Ls]2𝐱mns𝐌s12×𝔼[1ps𝟏{os=Ls}]]\displaystyle\mathbb{E}\left[\sum\nolimits_{s=t^{\prime}}^{T}\sum\nolimits_{(m,n)\in[L_{s}]^{2}}\|{\mathbf{x}_{mn}^{s}}\|_{\mathbf{M}_{s}^{-1}}^{2}\times\mathbb{E}\left[\frac{1}{p_{s}}\mathbf{1}\{o_{s}=L_{s}\}\right]\right]
\displaystyle\leq p1𝔼[s=tT(m,n)[Ls]2𝐱mns𝐌s12𝟏{os=Ls}]\displaystyle{p^{*}}^{-1}\mathbb{E}\left[\sum\nolimits_{s=t^{\prime}}^{T}\sum\nolimits_{(m,n)\in[L_{s}]^{2}}\|{\mathbf{x}_{mn}^{s}}\|_{\mathbf{M}_{s}^{-1}}^{2}\mathbf{1}\{o_{s}=L_{s}\}\right]

Besides, in PairRank, we only use the independent pairs, 𝒢tind\mathcal{G}_{t}^{ind} to update the model and the corresponding MtM_{t} matrix. Therefore, to bound the regret, we rewrite the above equation as:

𝔼[s=tT(m,n)[Ls]2𝐱mns𝐌s12]\displaystyle\mathbb{E}\left[\sum\nolimits_{s=t^{\prime}}^{T}\sum\nolimits_{(m,n)\in[L_{s}]^{2}}\|{\mathbf{x}_{mn}^{s}}\|_{\mathbf{M}_{s}^{-1}}^{2}\right]
=\displaystyle= 𝔼[s=tT(m,n)𝒢tind(𝐱mns𝐌s12+k[Lt]{m,n}𝐱mks𝐌s12+𝐱nks𝐌s12)]\displaystyle\mathbb{E}\left[\sum_{s=t^{\prime}}^{T}\sum_{(m,n)\in\mathcal{G}_{t}^{ind}}\left(\|{\mathbf{x}_{mn}^{s}}\|_{\mathbf{M}_{s}^{-1}}^{2}+\sum_{k\in[L_{t}]\setminus\{m,n\}}\|\mathbf{x}_{mk}^{s}\|_{\mathbf{M}_{s}^{-1}}^{2}+\|\mathbf{x}_{nk}^{s}\|_{\mathbf{M}_{s}^{-1}}^{2}\right)\right]
=\displaystyle= 𝔼[s=tT(m,n)𝒢tind((Lt1)𝐱mns𝐌s12+k[Lt]{m,n}2𝐱mks𝐌s1𝐱nks)]\displaystyle\mathbb{E}\left[\sum_{s=t^{\prime}}^{T}\sum_{(m,n)\in\mathcal{G}_{t}^{ind}}\left((L_{t}-1)\|{\mathbf{x}_{mn}^{s}}\|_{\mathbf{M}_{s}^{-1}}^{2}+\sum_{k\in[L_{t}]\setminus\{m,n\}}2{\mathbf{x}_{mk}^{s}}^{\top}\mathbf{M}_{s}^{-1}\mathbf{x}_{nk}^{s}\right)\right]

Based on Lemma 10 and Lemma 11 in (Abbasi-Yadkori et al., 2011), the expected number of the first term can be upper bounded by,

s=tT(m,n)𝒢tind(Lt1)𝐱mns𝐌s122dLmaxlog(1+omaxTu22dλ)\displaystyle\sum\nolimits_{s=t^{\prime}}^{T}\sum\nolimits_{(m,n)\in\mathcal{G}_{t}^{ind}}(L_{t}-1)\|{\mathbf{x}_{mn}^{s}}\|_{\mathbf{M}_{s}^{-1}}^{2}\leq 2dL_{max}\log(1+\frac{o_{\max}Tu^{2}}{2d\lambda})

And the second term can be bounded as:

s=tT(m,n)𝒢tindk[Lt]{m,n}2𝐱mks𝐌s1𝐱nks\displaystyle\sum\nolimits_{s=t^{\prime}}^{T}\sum\nolimits_{(m,n)\in\mathcal{G}_{t}^{ind}}\sum\nolimits_{k\in[L_{t}]\setminus\{m,n\}}2{\mathbf{x}_{mk}^{s}}^{\top}\mathbf{M}_{s}^{-1}\mathbf{x}_{nk}^{s}
\displaystyle\leq s=tT(Lmax22Lmax)u2/λmin(𝐌s)=w.\displaystyle\sum\nolimits_{s=t^{\prime}}^{T}{(L_{max}^{2}-2L_{max})u^{2}}/{\lambda_{\min}(\mathbf{M}_{s})}=w.

Therefore, combining the conclusion of 𝔼[s=tTNs]\mathbb{E}[\sum_{s=t^{\prime}}^{T}N_{s}], and Eq (6), Eq (7), the regret could be upper bounded by:

RT\displaystyle R_{T}\leq R+(1δ1)(1δ2)1p2(2adLmaxlog(1+omaxTu22dλ)+aw)2\displaystyle R^{\prime}+(1-\delta_{1})(1-\delta_{2})\frac{1}{p^{*}{{}^{2}}}\left(2adL_{\max}\log(1+\frac{o_{\max}Tu^{2}}{2d\lambda})+aw\right)^{2}

where R=tLmax+(Tt)(δ2Lmax(1δ2)δ1Lmax)R^{\prime}=t^{\prime}L_{\max}+(T-t^{\prime})(\delta_{2}L_{\max}-(1-\delta_{2})\delta_{1}L_{\max}), with tt^{\prime} defined in Lemma 7, and LmaxL_{\max} representing the maximum number of documents associated to the same query over time. By choosing δ1=δ2=1/T\delta_{1}=\delta_{2}=1/T, the theorem shows that the expected regret is at most RTO(dlog4(T))R_{T}\leq O(d\log^{4}(T)).