PairRank: Online Pairwise Learning to Rank by Divide-and-Conquer

Yiling Jia University of VirginiaCharlottesvilleVAUSA [email protected] , Huazheng Wang University of VirginiaCharlottesvilleVAUSA [email protected] , Stephen Guo Walmart LabsSunnyvaleCAUSA [email protected] and Hongning Wang University of VirginiaCharlottesvilleVAUSA [email protected]

(2021)

Abstract.

Online Learning to Rank (OL2R) eliminates the need of explicit relevance annotation by directly optimizing the rankers from their interactions with users. However, the required exploration drives it away from successful practices in offline learning to rank, which limits OL2R’s empirical performance and practical applicability. In this work, we propose to estimate a pairwise learning to rank model online. In each round, candidate documents are partitioned and ranked according to the model’s confidence on the estimated pairwise rank order, and exploration is only performed on the uncertain pairs of documents, i.e., divide-and-conquer. Regret directly defined on the number of mis-ordered pairs is proven, which connects the online solution’s theoretical convergence with its expected ranking performance. Comparisons against an extensive list of OL2R baselines on two public learning to rank benchmark datasets demonstrate the effectiveness of the proposed solution.

Online learning to rank; divide and conquer; regret analysis

^†^†copyright: acmcopyright^†^†journalyear: 2021^†^†conference: Proceedings of the Web Conference 2021; April 19–23, 2021; Ljubljana, Slovenia^†^†booktitle: Proceedings of the Web Conference 2021 (WWW ’21), April 19–23, 2021, Ljubljana, Slovenia^†^†doi: 10.1145/3442381.3449972^†^†isbn: 978-1-4503-8312-7/21/04^†^†ccs: Information systems Learning to rank^†^†ccs: Theory of computation Online learning algorithms^†^†ccs: Theory of computation Divide and conquer^†^†ccs: Theory of computation Regret bounds

1. Introduction

Online learning to rank (OL2R) empowers modern retrieval systems to optimize their performance directly from users’ implicit feedback (Wang et al., 2019; Wang et al., 2018a; Yue and Joachims, 2009; Hofmann et al., 2013; Schuth et al., 2016; Zoghi et al., 2017; Lattimore et al., 2018; Oosterhuis and de Rijke, 2018; Li et al., 2018). The essence of OL2R solutions is to infer the quality of individual documents under a given query (Radlinski et al., 2008; Kveton et al., 2015a; Zoghi et al., 2017; Lattimore et al., 2018; Li et al., 2018) or a parameterized ranking function (Wang et al., 2019; Wang et al., 2018a; Yue and Joachims, 2009; Oosterhuis and de Rijke, 2018) via sequential interactions with users, i.e., trial and error. It eliminates classical offline learning to rank solutions’ strong dependency on explicit relevance annotations and makes supervised learning of ranking models possible when collecting explicit annotations from experts is economically infeasible or even impossible (e.g., private collection search).

Although influential and theoretically sound, the current OL2R solutions are not compatible with the successful practices in offline learning to rank, which directly optimize rankers by minimizing loss defined by rank-based metrics, such as Average Relevance Position (ARP) (Joachims et al., 2017) or Normalized Discounted Cumulative Gain (NDCG) (Burges, 2010). As a result, the performance of existing OL2R solutions is still behind that of offline solutions, which directly restricts OL2R’s real-world applicability.

The key barrier separating the practices in online and offline learning to rank is the need of exploration. Since users’ feedback is implicit and known to be biased and noisy (Joachims et al., 2005; Agichtein et al., 2006; Joachims et al., 2007), more clicks on a top-ranked document do not necessarily indicate greater relevance. Effective exploration in the problem space is thus vital for the online model update. Current OL2R solutions explore either in the action space (e.g., presenting currently underestimated results at higher positions of a ranked list) (Radlinski et al., 2008; Kveton et al., 2015a; Zoghi et al., 2017; Lattimore et al., 2018), or in the model space (e.g., presenting ranked results from different rankers) (Yue and Joachims, 2009; Schuth et al., 2014). However, due to the combinatorial nature of ranking, the action space is too large to be efficiently explored (e.g., all permutations of returned documents). This forces such OL2R solutions to take a pointwise approach to estimate the utility of each query-document pair separately, which has proven to be inferior to the pairwise or listwise approaches in offline learning to rank studies (Chapelle and Chang, 2011). While for model space exploration, though an interleaved test makes it possible to compare different rankers with respect to a hidden utility function in an unbiased fashion, it is hard to link this comparison to the optimization of any rank-based metrics. Moreover, due to the required uniform sampling in the model space, this type of OL2R solutions suffers from high variance and high regret during online result serving and model update (Wang et al., 2019).

In this work, we aim to bridge the gap by directly training a pairwise learning to rank model online. We target pairwise ranking models for three major reasons. First, a pairwise ranker reduces the exponentially sized action space to quadratic, by deriving the full ranking order from the pairwise comparisons between documents. Second, existing studies in search log analysis demonstrate relative preferences derived from clicks are more accurate and reliable than absolute judgments (Joachims et al., 2005, 2007). Third, pairwise learning to rank models have competitive empirical performance and have been widely used in practical systems (Chapelle and Chang, 2011; Joachims, 2002; Burges, 2010). To gain theoretical insight into the proposed solution, we work with a single layer RankNet model with a sigmoid activation function (Burges, 2010), which makes analytical convergence analysis possible. In particular, we explore in the pairwise ranking space of all candidate documents via divide-and-conquer:

we partition documents under a given query in each round of result serving, where the document ranking across different parts of the partition is certain (e.g., all documents in one part should be ranked higher than those in another part), but the ranking among documents within each part is still uncertain. The ranked list is thus generated by a topological sort across parts of a partition and randomly shuffling within each part. We name our solution PairRank. We rigorously prove that the exploration space shrinks exponentially fast in PairRank as the ranker estimation converges, such that the cumulative regret defined on the number of mis-ordered pairs has a sublinear upper bound. As most existing ranking metrics can be reduced to different kinds of pairwise comparisons among candidate documents (Wang et al., 2018b), e.g., Average Relevance Position is counted over a relevant document against all other documents, PairRank can directly optimize those ranking metrics based on users’ implicit feedback on the fly. Our extensive empirical evaluations also demonstrate the strong advantage of PairRank against a rich set of state-of-the-art OL2R solutions over a collection of OL2R benchmark datasets on standard retrieval metrics.

2. Related Work

We broadly categorize existing OL2R solutions into two families. The first type learns the best ranked list for each individual query separately by modeling users’ click and examination behaviors with multi-armed bandit algorithms (Radlinski et al., 2008; Kveton et al., 2015a; Zoghi et al., 2017; Lattimore et al., 2018). Typically, solutions in this category depend on specific click models to decompose the estimation on each query-document pair; as a result, exploration is performed on the ranking of individual documents. For example, by assuming users examine documents from top to bottom until reaching the first relevant document, cascading bandit models perform exploration by ranking the documents based on the upper confidence bound of their estimated relevance (Kveton et al., 2015a, b; Li et al., 2016). Other types of click models have also been explored (such as the dependent click model) (Katariya et al., 2016; Zoghi et al., 2017; Lattimore et al., 2018; Li et al., 2018; Kveton et al., 2018). However, as the relevance is estimated for each query-document pair, such algorithms can hardly generalize to unseen queries or documents. Moreover, pointwise relevance estimation is proved to be ineffective for rank estimation in established offline learning to rank studies (Chapelle and Chang, 2011; Burges, 2010).

The second type of OL2R solutions leverage ranking features for relevance estimation and explores for the best ranker in the entire model space (Yue and Joachims, 2009; Li et al., 2018; Oosterhuis and de Rijke, 2018). The most representative work is Dueling Bandit Gradient Descent (DBGD) (Yue and Joachims, 2009; Schuth et al., 2014), which proposes an exploratory direction in each round of interaction and uses an interleaved test (Chapelle et al., 2012) to validate the exploration for model update. To ensure an unbiased gradient estimate, DBGD uniformly explores the entire model space, which costs high variance and high regret during online ranking and model update. Subsequent methods improved upon DBGD by developing more efficient sampling strategies, such as multiple interleaving and projected gradient, to reduce variance (Hofmann et al., 2012; Zhao and King, 2016; Oosterhuis and de Rijke, 2017; Wang et al., 2018a; Wang et al., 2019). However, as exploration is performed in the model space, click feedback is used to infer which ranker is preferred under a hypothetical utility function. It is difficult to reason how the update in DBGD is related to the optimization of any rank-based metric. Hence, though generalizable, this type of OL2R solutions’ empirical performance is still worse than classical offline solutions.

The clear divide between the practices in online and offline learning to rank is quite remarkable, which has motivated some recent efforts to bridge the gap. Hofmann et al. (2013) adopt $\epsilon$ -greedy to estimate a stochastic RankSVM (Joachims, 2002; Herbrich et al., 1999) model on the fly. Though RankSVM is effective for pairwise learning to rank, the totally random exploration by $\epsilon$ -greedy is independent from the learned ranker. It keeps distorting the ranked results, even when the ranker has identified some high-quality results. Oosterhuis and de Rijke (2018) perform exploration by sampling the next ranked document from a Plackett-Luce model and estimate gradients of this ranking model from the inferred pairwise result preferences. Although exploration is linked to the ranker’s estimation, the convergence of this solution is still unknown.

3. Method

In this section, we present our solution, which trains a pairwise learning to rank model online. The key idea is to partition the pairwise document ranking space and only explore the pairs where the ranker is currently uncertain, i.e., divide-and-conquer. We name our solution PairRank. We rigorously prove the regret of PairRank defined on the cumulative number of mis-ordered pairs over the course of online model update.

3.1. Problem Formulation

In OL2R, a ranker interacts with users for $T$ rounds. At each round $t=1,2,...,T$ , the ranker receives a query $q_{t}$ and its associated candidate documents, which are represented as a set of $d$ -dimensional query-document feature vectors $\mathcal{X}_{t}=\{\mathbf{x}_{1}^{t},\mathbf{x}_{2}^{t},\dots\mathbf{x}_{L_{t}}^{t}\}$ with $\mathbf{x}^{t}_{i}\in\mathbb{R}^{d}$ and $\|\mathbf{x}_{i}^{t}\|\leq u$ . The ranker determines the ranking of the candidate documents $\tau_{t}=\big{(}\tau_{t}(1),\tau_{t}(2),\dots,\tau_{t}(L_{t})\big{)}\in\Pi([L_{t}])$ , based on its knowledge so far, where $\Pi([L_{t}])$ represents the set of all permutations of $L_{t}$ documents and $\tau_{t}(i)$ is the rank position of document $i$ under query $q_{t}$ . Once the ranked list is returned to the user, the user examines the results and provides his/her click feedback $C_{t}=\{c_{1}^{t},c_{2}^{t},...,c_{L_{t}}^{t}\}$ , where $c_{i}^{t}=1$ if the user clicks on document $i$ at round $t$ ; otherwise $c_{i}^{t}=0$ . Based on this feedback, the ranker updates itself and proceeds to the next query.

$C_{t}$ is known to be biased and noisy (Joachims et al., 2005; Agichtein et al., 2006; Joachims et al., 2007). Existing studies find that users tend to click more on higher-ranked documents, as they stop examination early in the list. This is known as position bias. And users can only interact with the documents shown to them, known as the presentation bias. As a result, the ranker cannot simply treat non-clicked documents as irrelevant. Such implicit feedback imposes several key challenges for online learning to rank: how to deal with the implicit feedback, and in the meanwhile, how to effectively explore the unknowns for the model update. Following the practice in (Joachims et al., 2005), we treat clicks as relative preference feedback and assume that clicked documents are preferred over the examined but unclicked ones. In addition, we consider every document that precedes a clicked document and the first subsequent unclicked document as examined. This approach has been widely adopted and proven to be effective in learning to rank (Wang et al., 2019; Agichtein et al., 2006; Oosterhuis and de Rijke, 2018). Accordingly, we use $o_{t}$ to represent the index of the last examined position in the ranked list $\tau_{t}$ at round $t$ .

Exploration is the key component that differentiates OL2R to offline L2R, where OL2R needs to serve while learning from its presented rankings. The most straightforward exploration is to provide a random list of candidate documents. However, such random exploration is less appreciated for OL2R as it hurts user experience, even though it may be beneficial for model training. Therefore, regret becomes an important metric for evaluating OL2R. Though various types of regret have been defined and analyzed in existing OL2R studies, few of them link to any rank-based metric, which is the key in ranker evaluation. For example, for OL2R solutions that explore in the document space, regret is typically defined on the number of clicks received on the presented ranking versus that known only in hindsight (Li et al., 2016; Kveton et al., 2015a; Zoghi et al., 2017; Lattimore et al., 2018). For solutions that explore in the model space, such as DBGD, regret is defined as the number of rounds where the chosen ranker is preferred over the optimal ranker (Yue and Joachims, 2009; Wang et al., 2019). It is difficult to reason how such measures indicate an OL2R solution’s ranking performance against a desired retrieval metric, such as ARP and NDCG. To bridge this gap, we define regret by the number of mis-ordered pairs from the presented ranking to the ideal one, i.e., the Kendall tau rank distance,

R_{T}=\mathbb{E}\big{[}\sum\nolimits_{t=1}^{T}r_{t}\big{]}=\mathbb{E}\big{[}\sum\nolimits_{t=1}^{T}K(\tau_{t},\tau_{t}^{*})\big{]}

where $K(\tau_{t},\tau_{t}^{*})=\Big{|}\big{\{}(i,j):i<j,\big{(}\tau_{t}(i)<\tau_{t}(j)\wedge\tau^{*}_{t}(i)>\tau^{*}_{t}(j)\big{)}\vee\big{(}\tau_{t}(i)>\tau_{t}(j)\wedge\tau^{*}_{t}(i)<\tau^{*}_{t}(j)\big{)}\big{\}}\Big{|}$ . As shown in (Wang et al., 2018b), most ranking metrics employed in real-world retrieval systems, such as ARP and NDCG, can be decomposed into pairwise comparisons; hence, our defined regret directly connects an OL2R algorithm’s online performance with classical ranking evaluations.

3.2. Online Pairwise Learning to Rank

Refer to caption — Figure 1. PairRank: pairwise explore and exploit by divide-and-conquer. Assume that the optimal ranking order among the 5 documents is $1\succ 2\succ 3\succ 4\succ 5$ . At the current round $t$ , the ranker is confident about its preference estimation between all the pairs expect $(1,2),(3,5),(4,5)$ . In this example, the instantaneous regret of the first proposed ranked list is 3 and 2 in the second proposal.

The key in OL2R is to effectively explore the unknowns while providing high-quality results to the users, which is often referred to as the explore-exploit trade-off. In this work, we propose to directly train a pairwise model from its interactions with users and directly explore in the pairwise document ranking space via a divide-and-conquer strategy. The high-level idea of PairRank is illustrated in Figure 1, and we explain its details in the following.

3.2.1. Pairwise Learning to Rank

We focus on the pairwise ranking models because of their advantageous effectiveness reported in prior offline LTR studies (Chapelle and Chang, 2011), and their ability to tackle implicit feedback. Specifically, we adopt a single layer RankNet model with a sigmoid activation function (Burges, 2010) as our pairwise ranker. This choice is based on the promising empirical performance of RankNet and the feasibility of analyzing the resulting online solution’s convergence.

In a single layer RankNet, the probability that a document $i$ is more relevant than document $j$ under query $q$ is computed as $\mathbb{P}(i\succ j|q)=\sigma({\mathbf{x}^{\top}_{i}}\boldsymbol{\theta}-{\mathbf{x}^{\top}_{j}}\boldsymbol{\theta})$ , where $\boldsymbol{\theta}\in\mathbb{R}^{d}$ and $\|\boldsymbol{\theta}\|\leq Q$ is the model parameter and $\sigma(x)={1}/({1+\exp(-x)})$ . To simplify our notations, we use $\mathbf{x}_{ij}$ to denote $\mathbf{x}_{i}-\mathbf{x}_{j}$ in our subsequent discussions.

With the knowledge of $\boldsymbol{\theta}$ , due to the monotonicity and transitivity of the sigmoid function, the ranking of documents in $\mathcal{X}$ can be uniquely determined by $\{{\mathbf{x}^{\top}_{1}}\boldsymbol{\theta},{\mathbf{x}^{\top}_{2}}\boldsymbol{\theta},\dots,{\mathbf{x}^{\top}_{L}}\boldsymbol{\theta}\}$ . Therefore, the key of learning a RankNet model is to estimate its parameter $\boldsymbol{\theta}$ . As RankNet specifies a distribution on pairwise comparisons, the objective function for $\boldsymbol{\theta}$ estimation can be readily derived as the cross-entropy loss between the predicted pairwise distribution on documents and those inferred from user feedback till round $t$ :

(1)		$\displaystyle\mathcal{L}_{t}=\sum_{s=1}^{t}\sum_{(m,n)\in\mathcal{G}_{s}}$	$\displaystyle-{\mathbf{y}_{mn}^{s}}\log\big{(}\sigma({{\mathbf{x}_{mn}^{s}}}^{\top}\boldsymbol{\theta})\big{)}$
		$\displaystyle-(1-{\mathbf{y}_{mn}^{s}})\log\big{(}1-\sigma({{\mathbf{x}_{mn}^{s}}}^{\top}\boldsymbol{\theta})\big{)}+\frac{\lambda}{2}\\|\boldsymbol{\theta}\\|^{2}$

where $\lambda$ is the L2 regularization coefficient, $\mathcal{G}_{s}$ denotes the set of document pairs that received different click feedback at round $s$ , i.e., $\mathcal{G}_{s}=\{(m,n):c_{m}^{s}\neq c_{n}^{s},\forall\tau(m)<\tau(n)\leq o_{s}\}$ , ${\mathbf{y}_{mn}^{s}}$ indicates whether the document $m$ is preferred over document $n$ in the click feedback, i.e., ${\mathbf{y}_{mn}^{s}}=\frac{1}{2}(c_{m}^{t}-c_{n}^{t})+\frac{1}{2}$ (Burges, 2010). Due to the log-convexity of the loss function defined in Eq (1), the global optimal solution $\hat{\boldsymbol{\theta}}_{t}$ at round $t$ exists and can be efficiently obtained by gradient descent.

Online learning of RankNet boils down to the construction of $\{\mathcal{G}_{s}\}^{T}_{s=1}$ over time. However, the conventional practice of using all the inferred pairwise preferences from click feedback (Joachims, 2002; Agichtein et al., 2006) imposes a higher risk in an online setting. In the presence of click noise (e.g., a user mistakenly clicks on an irrelevant document), pairing documents would cause a quadratically increasing number of noisy training instances, which impose strong negative impact on the quality of the learned ranker (Carvalho et al., 2008). As the updated ranker is immediately executed, cascading of ranker deterioration is possible. To alleviate this deficiency, we propose to only use independent pairs inferred from the feedback, e.g., $\mathcal{G}_{s}^{ind}=\{(m,n):c_{m}^{s}\neq c_{n}^{s},\forall(\tau_{s}(m),\tau_{s}(n))\in D\}$ , where $D$ represents the set of disjointed position pairs, for example, $D=\{(1,2),(3,4),...(L-1,L)\}$ . In other words, we will only use a subset of non-overlapping pairwise comparisons for our online ranker update.

3.2.2. Uncertainty Qualification

As discussed in Section 3.1, $\hat{\boldsymbol{\theta}}_{t}$ is obtained based on the acquired feedback from what has been presented to the user, which is subject to various types of biases and noises (Joachims et al., 2005; Agichtein et al., 2006; Joachims et al., 2007). Hence, $\hat{\boldsymbol{\theta}}_{t}$ only reflects what the ranker knows so far; and it is vital to effectively explore the unknowns to complete its knowledge. In PairRank, we propose to explore in the pairwise document ranking space spanned by $\mathcal{X}_{t}$ under $q_{t}$ , with respect to the current ranker’s uncertainty about the pairwise comparisons.

Accurate quantification of the current ranker’s uncertainty on pairwise preference estimation is the key to such exploration. The model estimation uncertainty is caused by the existence of click noise, i.e., $\|\hat{\boldsymbol{\theta}}_{t}-\boldsymbol{\theta}^{*}\|\neq 0$ , where $\boldsymbol{\theta}^{*}$ is the underlying ground-truth model parameter. And this model estimation uncertainty directly leads to the uncertainty in the ranker’s pairwise preference estimation. To quantify the source of uncertainty, we follow conventional click models to assume the clicks are independent of each other given the true relevance of documents, so as their noise (Joachims et al., 2005; Guo et al., 2009b; Guo et al., 2009a). As a result, the pairwise noise becomes the sum of noise from the two associated clicks. Because we only use the independent document pairs $\mathcal{G}^{ind}$ , the pairwise noise is thus independent of each other in PairRank and the history of result serving, which directly leads to the following proposition.

Proposition 0.

For any $t\geq 1$ , $\forall(i,j)\in\mathcal{G}_{t}^{ind}$ , define pairwise noise $\epsilon_{ij}^{t}={\mathbf{y}_{ij}^{t}}-\sigma({\mathbf{x}_{ij}^{t}}^{\top}\theta^{*})$ . For all $t\geq 1$ , $\epsilon_{ij}^{t}$ is a sub-Gaussian random variable with $\mathbb{E}[\epsilon_{ij}^{t}|\mathcal{G}_{t-1}^{ind},\{\epsilon^{t-1}\},\dots,\mathcal{G}_{1}^{ind},\{\epsilon^{1}\}]=0$ , where $\{\epsilon^{t}\}=\{\epsilon_{ij}^{t},(i,j)\in\mathcal{G}_{t}^{ind}\}$

According to the property of sub-Gaussian random variables, such assumption can be easily satisfied in practice as long as the pointwise click noise follows a sub-Gaussian distribution. For example, the pointwise noise can be modeled as a binary random variable related to the document’s true relevance under the given query, which follows $\frac{1}{2}$ -sub-Gaussian distribution.

Therefore, based on the solution of Eq (1), the uncertainty of the estimated pairwise preference $\sigma({\mathbf{x}_{ij}^{t}}^{\top}\hat{\boldsymbol{\theta}}_{t})$ by RankNet at round $t$ can be analytically bounded with a high probability, as shown in the following lemma.

Lemma 0.

(Confidence Interval of Pairwise Preference Estimation). At rount $t<T$ , for any pair of documents $\mathbf{x}_{i}^{t}$ and $\mathbf{x}_{j}^{t}$ under query $q_{t}$ , with probability at least $1-\delta_{1}$ , we have,

|\sigma({{\mathbf{x}_{ij}^{t}}}^{\top}\hat{\boldsymbol{\theta}}_{t})-\sigma({{\mathbf{x}_{ij}^{t}}}^{\top}\boldsymbol{\theta}^{*})|\leq\alpha_{t}\|{\mathbf{x}_{ij}^{t}}\|_{\mathbf{M}_{t}^{-1}},

where $\alpha_{t}=({2k_{\mu}}/{c_{\mu}})\Big{(}\sqrt{R^{2}\log{({\det(\mathbf{M}_{t})}/({\delta_{1}^{2}\det(\lambda\mathbf{I})})})}+\sqrt{\lambda}Q\Big{)}$ , $\mathbf{M}_{t}=\lambda\mathbf{I}+\sum_{s=1}^{t-1}\sum_{(m,n)\in\mathcal{G}^{ind}_{s}}{\mathbf{x}_{mn}^{s}}{\mathbf{x}_{mn}^{s}}^{\top}$ , $k_{\mu}$ is the Lipschitz constant of the sigmoid link function $\sigma$ , $c_{\mu}=\inf_{\boldsymbol{\theta}\in\boldsymbol{\Theta}}\dot{\sigma}(\mathbf{x}^{\top}\boldsymbol{\theta})$ , with $\dot{\sigma}$ as the first derivative of $\sigma$ , and $R$ is the sub-gaussian parameter for noise $\epsilon$ .

The detailed proof of Lemma 2 can be found in the appendix. This lemma provides a tight high probability bound of the pairwise preference estimation uncertainty under a RankNet specified by $\hat{\boldsymbol{\theta}}_{t}$ , which enables us to perform efficient pairwise exploration for model update. To better illustrate our exploration strategy based on the pairwise estimation uncertainty, we introduce the following definition on document pairs.

Definition 0.

(Certain Rank Order) At any round $t<T$ , the ranking order between document $i$ and $j$ , denoted as $(i,j)$ , is considered in a certain rank order if and only if $\sigma({\mathbf{x}_{ij}^{t}}^{\top}\hat{\boldsymbol{\theta}}_{t})-\alpha_{t}\|\mathbf{x}_{ij}^{t}\|_{\mathbf{M}_{t}^{-1}}>\frac{1}{2}$ .

Intuitively, based on Lemma 2, if $(i,j)$ is in a certain rank order, with a high probability that the estimated preference (order) between document $i$ and $j$ is consistent with the ground-truth. For example, as shown in Figure 2, $\hat{\sigma}^{t}_{ij}$ and $\hat{\sigma}_{mn}^{t}$ represent the estimated pairwise preference on document pair $(i,j)$ and $(m,n)$ based on $\hat{\boldsymbol{\theta}}_{t}$ , while $CB_{ij}^{t}$ and $CB_{mn}^{t}$ represent the corresponding confidence bound defined in Lemma 2, i.e., $CB_{ij}^{t}=\alpha_{t}\|{\mathbf{x}_{ij}^{t}}\|_{\mathbf{M}_{t}^{-1}}$ and $CB_{mn}^{t}=\alpha_{t}\|{\mathbf{x}_{mn}^{t}}\|_{\mathbf{M}_{t}^{-1}}$ . According to Lemma 2, we know that the ground-truth pairwise preferences, $\sigma_{ij}^{*}$ and $\sigma_{mn}^{*}$ , lie within the corresponding confidence intervals with a probability at least $1-\delta_{1}$ , i.e., $\sigma_{ij}^{*}\in[\hat{\sigma}_{ij}^{t}-CB_{ij}^{t},\hat{\sigma}_{ij}^{t}+CB_{ij}^{t}]$ . In Figure 2, for pair $(m,n)$ , the lower bound of its pairwise estimation, $\hat{\sigma}_{mn}^{t}-CB_{mn}^{t}$ , is greater than $\frac{1}{2}$ . This indicates that with a high probability $1-\delta_{1}$ , the estimated preference between document $m$ and $n$ is consistent with the ground-truth model $\boldsymbol{\theta}^{*}$ ; and thus there is no need to explore this pair. In contrast, with $\hat{\sigma}_{ij}^{t}-CB_{ij}^{t}<\frac{1}{2}$ , the estimated order $(i\succ j)$ is still with uncertainty as the ground-truth model may present an opposite order; hence, exploration on this pair is necessary.

We use $\mathcal{E}_{c}^{t}$ to represent the set of all certain rank orders at round $t$ . Accordingly, the set of uncertain rank orders at round $t$ is defined as: $\mathcal{E}_{u}^{t}=\{(i,j)\in[L_{t}]^{2}:(i,j)\notin\mathcal{E}_{c}^{t}\wedge(j,i)\notin\mathcal{E}_{c}^{t}\}$ .

3.2.3. Explore the Unknowns via Divide-and-Conquer

1 Input:

\lambda

\delta_{1}

\delta_{2}

Initialize

\mathbf{M}_{0}=\lambda\mathbf{I},\hat{\boldsymbol{\theta}}_{1}=0

for $t=1$ to $T$ do

2 Receive query

q_{t}

and its corresponding candidate documents set

\mathcal{X}_{t}=\{\mathbf{x}_{1}^{t},\mathbf{x}_{2}^{t},...,\mathbf{x}_{L_{t}}^{t}\}

\mathcal{E}_{c}^{t}=\{(i,j)\in[L_{t}]^{2}:\sigma({\mathbf{x}_{ij}^{t}}^{\top}\mathbf{\hat{\boldsymbol{\theta}}}_{t-1})-\alpha_{t-1}\|\mathbf{x}_{ij}^{t}\|_{\mathbf{M}_{t-1}}>1/2\}

\mathcal{E}_{u}^{t}=\{(i,j)\in[L_{t}]^{2}:(i,j)\notin\mathcal{E}_{c}^{t}\wedge(j,i)\notin\mathcal{E}_{c}^{t}\}

6 Construct ordered block list

\mathcal{B}_{t}=\{\mathcal{B}_{1}^{t},\mathcal{B}_{2}^{t},...\mathcal{B}_{d_{t}}^{t}\}

7 Generate ranked list

\tau_{t}=\{\pi(\mathcal{B}_{1}^{t}),\pi(\mathcal{B}_{2}^{t}),...,\pi(\mathcal{B}_{d_{t}}^{t})\}

8 Observe click feedback

{C}_{t}

, and corresponding

o_{t}

\mathcal{G}_{t}^{ind}=\{(i,j)\in[o_{t}]^{2}:c_{i}^{t}\neq c_{j}^{t}\wedge(\tau_{t}(i),\tau_{t}(j))\in D\}

10 Estimate

\hat{\theta}_{t}

as the solution of Eq (1).

\mathbf{M}_{t}=\mathbf{M}_{t-1}+\sum_{(i,j)\in\mathcal{G}_{t}^{ind}}\mathbf{x}_{ij}^{t}{\mathbf{x}_{ij}^{t}}^{\top}

12 end for

Algorithm 1 PairRank

With the aforementioned pairwise estimation uncertainty and the corresponding sets of certain and uncertain rank orders, i.e., $\mathcal{E}_{c}^{t}$ and $\mathcal{E}_{u}^{t}$ , we can effectively explore the unknowns. Intuitively, we only need to randomize the ranking of documents among which the model is still uncertain about their ranking orders, i.e., the uncertain rank orders, and therefore obtain feedback to further update the model (and reduce uncertainty). For example, in the document graph shown in Figure 1, the solid lines represent certain rank orders, while the dash lines represent the uncertain rank orders. When generating the ranked list, we should randomly swap the order between document 1 and 2 (i.e., to explore) while preserving the order between document 1 and documents 3, 4, 5 (i.e., to exploit).

This naturally leads to a divide-and-conquer exploration strategy in the space of pairwise document comparisons. Specifically, we partition $\mathcal{X}_{t}$ into different parts (referred to as blocks hereafter) based on $\mathcal{E}_{c}^{t}$ and $\mathcal{E}_{u}^{t}$ such that the ranking orders between any documents belonging to different blocks are certain, shown in the block graph in Figure 1. The ranked list can thus be generated by topological sort across blocks and random shuffling within each block. As the exploration is confined to the pairwise ranking space, it effectively reduces the exponentially sized action space to quadratic.

Algorithm 1 shows the detailed steps of PairRank. At round t, we first construct $\mathcal{E}_{c}^{t}$ and $\mathcal{E}_{u}^{t}$ according to the current mode $\hat{\boldsymbol{\theta}}_{t-1}$ . Then, we create blocks of $\mathcal{X}_{t}$ according to the definition below.

Definition 0.

(Block) At any round $t<T$ , the block $\mathcal{B}$ is a set of documents that satisfy:

(1)

$\forall i\in\mathcal{B}_{d}^{t},(i,j)\in\mathcal{E}_{c}^{t}\text{ for any }j\in[L_{t}]\setminus\mathcal{B}_{d}^{t}.$
(2)

$\nexists k\in[L_{t}]\setminus\mathcal{B}_{d}^{t},\text{for }i,j\in\mathcal{B}_{d}^{t},(i,k)\in\mathcal{E}_{t}^{c}\wedge(k,j)\in\mathcal{E}_{c}^{t}$

Intuitively, each block is a subset of documents linked to each other by the uncertain rank order. It can be viewed as a connected component in an undirected graph with documents as vertices and uncertain rank order as edges under a given query. The connected components (blocks) can be found by linearly scanning through the vertices, based on breadth-first search or depth-first search if a vertex is not visited before. When the algorithm stops, each vertex (document) will be assigned to a connected component (block). Once the blocks are constructed, the order of the blocks can be obtained by topological sort (line 6). Let $\mathcal{B}_{t}=\{\mathcal{B}_{1}^{t},\mathcal{B}_{2}^{t},...\mathcal{B}_{d_{t}}^{t}\}$ be the ordered block list for $\mathcal{X}_{t}$ at round $t$ , the ranked list $\tau_{t}$ is generated as $\tau_{t}=\{\pi(\mathcal{B}_{1}^{t}),\pi(\mathcal{B}_{1}^{t}),...,\pi(\mathcal{B}_{d_{t}}^{t})\}$ , where $\pi(\cdot)$ randomly permutes its input set as output.

To further improve exploration efficiency, we propose two options to generate the ranked list. As shown in Figure 1, the first ranked list is generated by randomly shuffling all documents within each block (referred to as random exploration), while in the second list, only the uncertain rank orders are shuffled, and the certain ones are preserved (referred to as conservative exploration). In our empirical studies, we observe such conservative exploration gives better improvement than random exploration, which further confirms the importance of efficient exploration in OL2R.

3.3. Regret Analysis

We theoretically analyze the regret of PairRank, which is defined by the cumulative number of mis-ordered pairs in its proposed ranked list till round $T$ . The key in analyzing this regret is to quantify how fast the model achieves certainty about its pairwise preference estimation in candidate documents. First, we define $E_{t}$ as the success event at round $t$ :

E_{t}=\big{\{}\forall(i,j)\in[L_{t}]^{2},|\sigma({\mathbf{x}_{ij}^{t}}^{\top}\hat{\boldsymbol{\theta}}_{t})-\sigma({\mathbf{x}_{ij}^{t}}^{\top}\boldsymbol{\theta}^{*})|\leq\alpha_{t}\|\mathbf{x}_{ij}^{t}\|_{\mathbf{M}_{t}^{-1}}\big{\}}.

Intuitively, $E_{t}$ is the event that the estimated $\hat{\theta}_{t}$ is “close” to the optimal model $\theta^{*}$ at round $t$ . According to Lemma 2, it is easy to reach the following conclusion,

Corollary 0.

On the event $E_{t}$ , it holds that $\sigma({\mathbf{x}_{ij}^{t}}^{\top}\boldsymbol{\theta}^{*})>{1}/{2}$ if $(i,j)\in\mathcal{E}_{c}^{t}$ .

PairRank suffers regret as a result of misplacing a pair of documents, i.e., swapping a pair already in the correct order. Based on Corollary 5, under event $E_{t}$ , the certain rank order identified by PairRank is consistent with the ground-truth. As our partition design always places documents under a certain rank order into distinct blocks, under event $E_{t}$ the ranking order across blocks is consistent with the ground-truth. In other words, regret only occurs when ranking documents within each block.

To analyze the regret caused by random shuffling within each block, we need the following technical lemma derived from random matrix theory. We adapted it from Equation (5.23) of Theorem 5.39 from (Vershynin, 2010).

Lemma 0.

Let $A\in\mathbb{R}^{n\times d}$ be a matrix whose rows $A_{i}$ are independent sub-Gaussian isotropic random vectors in $\mathbb{R}^{d}$ with parameter $\sigma$ , namely $\mathbb{E}[\text{exp}(x^{\top}(A_{i}-\mathbb{E}[A_{i}])]\leq\text{exp}(\sigma^{2}\|x\|^{2}/2)$ for any $x\in\mathbb{R}^{d}$ . Then, there exist positive universal constants $C_{1}$ and $C_{2}$ such that, for every $t\geq 0$ , the following holds with probability at least $1-2\text{exp}(-C_{2}t^{2}),\text{where }\epsilon=\sigma(C_{1}\sqrt{d/n}+{t}/{\sqrt{n}})$ : $\|A^{\top}A/n-\mathbf{I}_{d}\|\leq\text{max}\{\epsilon,\epsilon^{2}\}$ .

The detailed proof can be found in (Vershynin, 2010). We should note the condition in Lemma 6 is not hard to satisfy in OL2R: at every round, the ranker is serving a potentially distinct query; and even for the same query, different documents might be returned at different times. This gives the ranker a sufficient chance to collect informative observations for model estimation. Based on Lemma 6, we have the following lemma, which provides a tight upper bound of the probability that PairRank’s estimation of the pairwise preference is an uncertain rank order.

Lemma 0.

At round $t\geq t^{\prime}$ , with $\delta_{1}\in(0,\frac{1}{2})$ , $\delta_{2}\in(0,\frac{1}{2})$ , $\beta\in(0,\frac{1}{2})$ , and $C_{1}$ , $C_{2}$ defined in Lemma 6, under event $E_{t}$ , the following holds with probability at least $1-\delta_{2}$ : $\forall(i,j)\in[L_{t}]^{2}$ , $\mathbb{P}\big{(}(i,j)\in\mathcal{E}_{u}^{t}\big{)}\leq\frac{8k_{\mu}^{2}\|{\mathbf{x}_{ij}^{t}}\|_{\mathbf{M}_{t}^{-1}}^{2}}{(1-2\beta)c_{\mu}^{2}\Delta_{\min}^{2}}\log{\frac{1}{\delta_{1}}}$ , with $t^{\prime}=\Big{(}\frac{c_{1}\sqrt{d}+c_{2}\sqrt{\log(\frac{1}{\delta_{2}})}+abd\sqrt{\frac{o_{\text{max}}u^{2}}{d\lambda}}}{\lambda_{\text{min}}(\mathbf{\Sigma})}\Big{)}^{2}+\frac{2ab\log({1}/{\delta_{1}^{2}})+8a\lambda Q^{2}-\lambda}{\lambda_{\text{min}}(\mathbf{\Sigma})}$ , where $\Delta_{\min}=\min\limits_{t\in T,(i,j)\in[L_{t}]^{2}}|\sigma({{\mathbf{x}_{ij}^{t}}}^{\top}\boldsymbol{\theta}^{*})-\frac{1}{2}|$ representing the smallest gap of pairwise difference between any pair of documents associated to the same query over time (across all queries), $a={4k_{\mu}^{2}u^{2}}/({\beta^{2}c_{\mu}^{2}\Delta_{\text{min}}^{2}})$ , and $b=R^{2}+4\sqrt{\lambda}QR$ ,

Proof Sketch.

According to the definition of certain rank order, a pairwise estimation $\sigma({\mathbf{x}_{ij}^{t}}^{\top}\hat{\boldsymbol{\theta}})$ is certain if and only if $|\sigma({\mathbf{x}_{ij}^{t}}^{\top}\hat{\boldsymbol{\theta}})-1/2|\geq\alpha_{t}\|{\mathbf{x}_{ij}^{t}}\|_{\mathbf{M}_{t}^{-1}}$ . By the reverse triangle inequality, the probability can be upper bounded by $\mathbb{P}\big{(}\big{|}|\sigma({\mathbf{x}_{ij}^{t}}^{\top}\hat{\boldsymbol{\theta}})-\sigma({\mathbf{x}_{ij}^{t}}^{\top}\boldsymbol{\theta}^{*})|-|\sigma({\mathbf{x}_{ij}^{t}}^{\top}\boldsymbol{\theta}^{*})-1/2|\big{|}\geq\alpha_{t}\|{\mathbf{x}_{ij}^{t}}\|_{\mathbf{M}_{t}^{-1}}\big{)}$ , which can be further bounded by Theorem 1 in (Abbasi-Yadkori et al., 2011). The key in this proof is to obtain a tighter bound on the uncertainty of PairRank’s parameter estimation compared to the bound determined by $\delta_{1}$ in Lemma 2, such that its confidence interval on a pair of documents’ comparison at round $t$ will exclude the possibility of flipping their ranking order, i.e., the lower confidence bound of this pairwise estimation is above 0.5. ∎

In each round of result serving, as the model $\hat{\boldsymbol{\theta}}_{t}$ would not change until next round, the expected number of uncertain rank orders, denoted as $N_{t}=|\mathcal{E}_{u}^{t}|$ , can be estimated by the summation of the uncertain probabilities over all possible pairwise comparisons under the current query $q_{t}$ , e.g., $\mathbb{E}[N_{t}]=\frac{1}{2}(\sum_{(i,j)\in[L_{t}]^{2}}\mathbb{P}((i,j)\in\mathcal{E}_{u}^{t})$ .

Denote $p_{t}$ as the probability that the user examines all documents in $\tau_{t}$ at round $t$ , and let $p^{*}=\min_{1\leq t\leq T}p_{t}$ be the minimal probability that all documents in a query are examined over time. We present the upper regret bound of PairRank as follows.

Theorem 8.

Assume pairwise query-document feature vector ${\mathbf{x}_{ij}^{t}}$ under query $q_{t}$ , where $(i,j)\in[L_{t}]^{2}$ and $t\in[T]$ , satisfies Proposition 1. With $\delta_{1}\in(0,\frac{1}{2})$ , $\delta_{2}\in(0,\frac{1}{2})$ , $\beta\in(0,\frac{1}{2})$ , the $T$ -step regret of PairRank is upper bounded by:

\displaystyle R_{T}\leq

\displaystyle R^{\prime}+(1-\delta_{1})(1-\delta_{2}){p^{*}}^{-2}\left(2adL_{\max}\log(1+\frac{o_{\max}Tu^{2}}{2d\lambda})+aw\right)^{2}

where $R^{\prime}=t^{\prime}L_{\max}^{2}+(T-t^{\prime})\big{(}\delta_{2}L_{\max}^{2}-(1-\delta_{2})\delta_{1}L_{\max}^{2}\big{)}$ , with $t^{\prime}$ and $a$ defined in Lemma 7, and $w=\sum\nolimits_{s=t^{\prime}}^{T}({(L_{\max}^{2}-2L_{\max})u^{2}}/({\lambda_{\min}(\mathbf{M}_{s})})$ , and $L_{\max}$ representing the maximum number of document associated to the same query over time. By choosing $\delta_{1}=\delta_{2}=1/T$ , we have the expected regret at most $R_{T}\leq O(d\log^{4}(T))$ .

Proof Sketch.

The detailed proof is provided in the appendix. Here, we only provide the key ideas behind our regret analysis. The regret is first decomposed into two parts: $R^{\prime}$ represents the regret when either $E_{t}$ or Lemma 7 does not hold, in which the regret is out of our control, and we use the maximum number of pairs associated to a query over time, $L_{\text{max}}$ to compute the regret. The second part corresponds to the cases when both events happen. Then, the instantaneous regret at round $t$ can be bounded by

(2)

\displaystyle r_{t}=\mathbb{E}\big{[}K(\tau_{t},\tau_{t}^{*})\big{]}=\sum\nolimits_{i=1}^{d_{t}}\mathbb{E}\big{[}\frac{(N_{i}^{t}+1)N_{i}^{t}}{2}\big{]}\leq\mathbb{E}\big{[}\frac{N_{t}(N_{t}+1)}{2}\big{]}

where $N_{i}^{t}$ denotes the number of uncertain rank orders in block $\mathcal{B}_{i}^{t}$ at round $t$ , and $N_{t}$ denotes the total number of uncertain rank orders. From the last inequality, it follows that in the worst case where the $N_{t}$ uncertain rank orders are placed into the same block and thus generate at most $({N_{t}^{2}+N_{t}})/{2}$ mis-ordered pairs with random shuffling. This is because based on the blocks created by PairRank, with $N_{t}$ uncertain rank orders in one block, this block can at most have $N_{t}+1$ documents. Then, the cumulative number of mis-ordered pairs can be bounded by the probability of observing uncertain rank orders in each round, which shrinks rapidly with more observations over time. ∎

Remark (1).

By choosing $\delta=1/T$ , the theorem shows the expected regret increases at a rate of $\mathcal{O}({\log^{4}{T}})$ . In this analysis, we provide a gap-dependent regret upper bound of PairRank, where the gap $\Delta_{\min}$ characterizes the hardness of sorting the $L_{t}$ candidate documents at round $t$ . As the matrix $M_{t}$ only contains information from observed document pairs, we adopt the probability of a ranked list is fully observed to tackle with the partial feedback (Kveton et al., 2015b, c), which is a constant independent of $T$ .

Remark (2).

Our regret is defined over the number of mis-ordered pairs, which is the first pairwise regret analysis for an OL2R algorithm, to the best of our knowledge. As we discussed before, existing OL2R algorithms optimize their own metrics, which can hardly link to any conventional rank metrics. As shown in (Wang et al., 2018b), most classical ranking evaluation metrics, such as ARP and NDCG, are based on pairwise document comparisons. Our regret analysis of PairRank connects its theoretical property with such metrics, which has been later confirmed in our empirical evaluations.

4. Experiments

In this section, we empirically compare our proposed PairRank with an extensive list of state-of-the-art OL2R algorithms on two public learning to rank benchmark datasets. Both quantitative and qualitative evaluations are performed to examine our proposed solution, especially its advantages over existing OL2R solutions in online learning efficiency.

Reproducibility Our entire codebase, baselines, analysis, and experiments can be found on Github ¹¹1https://github.com/yilingjia/PairRank.

4.1. Experiment Setup

Datasets. We experiment on two publicly available learning to rank datasets: 1) Yahoo! Learning to Rank Challenge dataset (Chapelle and Chang, 2011), which consists of 292,921 queries and 709,877 documents represented by 700 ranking features; and 2) MSLR-WEB10K (Qin and Liu, 2013), which contains 30,000 queries, each having 125 assessed documents on average, and is represented by 136 ranking features. Both datasets are labeled with a five-grade relevance scale: from not relevant (0) to perfectly relevant (4). We followed the train/test/validation split provided in the datasets.

Table 1. Configuration of simulated click models.

	Click Probability			Stop Probability
Relevance Grade	0	1	2	0	1	2
Perfect	0.0	0.5	1.0	0.0	0.0	0.0
Navigational	0.05	0.5	0.95	0.2	0.5	0.9
Informational	0.4	0.7	0.9	0.1	0.3	0.5

User Interaction Simulation. We simulate user behavior via the standard setup for OL2R evaluations (Oosterhuis and de Rijke, 2018; Wang et al., 2019) to make our reported results directly comparable to those in literature. First, at each time $t$ , a query is uniformly sampled from the training set for result serving. Then, the model determines the ranked list and returns it to the users. The interaction between the user and the list is then simulated with a dependent click model (Guo et al., 2009a), which assumes that the user will sequentially examine the list and make a click decision on each document. At each position, the user decides whether to click on the document or not, modeled as a probability conditioned on the document’s relevance label, e.g, $\mathbb{P}(click=1|\text{relevance grade})$ . After the click, the user might stop due to his/her satisfaction with the result or continue to examine more results. The stop probability after a click is modeled as $\mathbb{P}(stop=1|click=1,\text{relevance grade})$ . If there is no click, the user will continue examining the next position. We employ three different click model configurations to represent three different types of users, for which the details are shown in Table 1. Basically, we have the perfect users, who click on all relevant documents and do not stop browsing until the last returned document; the navigational users, who are very likely to click on the first highly relevant document and stop there; and the informational users, who tend to examine more documents, but sometimes click on irrelevant ones, and thus contribute a significant amount of noise in their click feedback.

Baselines. As our PairRank learns a parametric ranker, we focus our comparison against existing parametric OL2R solutions, which are known to be more generalizable and powerful than those estimating pointwise query-document relevance (Radlinski et al., 2008; Kveton et al., 2015a; Zoghi et al., 2017; Lattimore et al., 2018). We list the OL2R solutions used for our empirical comparisons below.

$\bullet$ DBGD (Yue and Joachims, 2009): As detailed in our related work discussion, DBGD uniformly samples an exploratory direction from the entire model space for exploration and model update.

$\bullet$ MGD (Schuth et al., 2016): It improves DBGD by sampling multiple directions and compares them via a multileaving test. If there is a tie, the model updates towards the mean of all winners.

$\bullet$ PDGD (Oosterhuis and de Rijke, 2018): PDGD samples the next ranked document from a Plackett-Luce model and estimates the gradient from the inferred pairwise result preferences.

$\bullet$ $\epsilon$ -Greedy (Hofmann et al., 2013): It randomly samples an unranked document with probability $\epsilon$ or selects the next best document based on the currently learned RankNet with probability $1-\epsilon$ , at each position.

$\bullet$ SGD RankNet: It estimates a single layer RankNet with stochastic gradient descent using the same click preference inference method in PairRank. At each round, it presents the best ranking based on the currently learned RankNet model.

$\bullet$ PairRank: We compare two shuffling strategies within each partition: 1) random shuffling the documents, denoted as PairRank_R, and 2) shuffling with respect to the uncertain rank orders, while preserving the certain rank orders within each partition, denoted as PairRank_C (as illustrated in Figure 1).

4.2. Experiment results

4.2.1. Offline Performance

In offline evaluation, while the algorithm is learning, we evaluate the learned rankers on a separate testing set using its ground-truth relevance labels. We use Normalized Discounted Cumulative Gain at 10 (NDCG@10) to assess different algorithms’ ranking quality. We compare all algorithms over 3 click models and 2 datasets. We set the hyper-parameters, such as learning rate, in DBGD, MGD, and PDGD according to their original papers. For SGD RankNet, $\epsilon$ -Greedy, and our proposed PairRank, we set the hyper-parameters via the best performance on the validation data for each dataset. We fix the total number of iterations $T$ to 5000. We execute the experiments 20 times with different random seeds and report the averaged result in Figure 3. For comparison purposes, we also include the performance of an offline RankNet trained on the ground-truth relevance labels on the training set, whose performance can be considered as an upper bound in our evaluation.

We can clearly observe that PairRank achieved significant improvement over all the baselines. Across the two datasets under different click models, DBGD and MGD performed worse than other OL2R solutions. This is consistent with previous studies’ findings: DBGD and MGD depend on interleave test to determine the update direction across rankers. But such model-level feedback is hard to directly inform the optimization of any rank-based metric (e.g., NDCG). PDGD consistently outperformed DBGD-type solutions under different click models and datasets. However, its document sampling-based exploration limits its learning efficiency, especially when users only examine a small number of documents (e.g., the navigational users). $\epsilon$ -Greedy seriously suffered from its random exploration, which is independent of how the current ranker performs. Oftentimes as shown in Figure 3(a), its performance is even worse than MGD, although $\epsilon$ -Greedy can directly optimize the pairwise ranking loss. One interesting observation is that SGD RankNet performs comparably to PDGD. Exploration in SGD RankNet is implicitly achieved via stochastic gradient descent. However, because this passive exploration is subject to the given queries and noise in user feedback, its performance in online result serving is hard to predict. We attribute PairRank’s fast convergence to its uncertainty based exploration: it only explores when its estimation on a pair of documents is uncertain. As proved in our regret analysis, the number of such pairs shrinks at a logarithmic rate, such that more and more documents are presented in their correct ranking positions as PairRank learns from user feedback. This conclusion is further backed by the comparison between PairRank_R and PairRank_C: by respecting the certain rank orders within a partition, PairRank_C further improves result ranking quality. When click noise is small, i.e., from perfect users, PairRank converges to its offline counterpart with only a few thousand impressions.

4.2.2. Online Performance

In OL2R, in addition to the offline evaluation, a model’s ranking performance during online optimization should also be evaluated, as it reflects user experience. Sacrificing user experience for model training will compromise the goal of OL2R. We adopt the cumulative Normalized Discounted Cumulative Gain (Oosterhuis and de Rijke, 2018) to assess models’ online performance. For $T$ rounds, the cumulative NDCG (cNDCG) is calculated as

\text{cNDCG}=\sum\nolimits_{t=1}^{T}\text{NDCG}(\tau_{t})\cdot\gamma^{(t-1)},

which computes the expected NDCG reward a user receives with a probability $\gamma$ that he/she stops searching after each query (Oosterhuis and de Rijke, 2018). Following the previous work (Oosterhuis and de Rijke, 2018; Wang et al., 2019; Wang et al., 2018a), we set $\gamma=0.9995$ .

Figure 4 shows the online performance of PairRank and all the other baselines. It is clear to observe that directly adding model-independent exploration, e.g., $\epsilon$ -greedy, strongly hurts a model’s online ranking performance, and therefore hurts user experience. Compared to PDGD, SGD RankNet showed consistently better performance, especially under the navigational and informational click models. We attribute this difference to the exploration strategy inside PDGD: though SGD RankNet has limited exploration and focuses more on exploiting the learned model in ranking the documents, PDGD’s sampling-based exploration might introduce unwanted distortion in the ranked results, especially at the early stage of online learning.

We should note in cumulative NDCG, the earlier stages play a much more important role due to the strong shrinking effect of $\gamma$ . Our proposed PairRank demonstrated significant improvements over all the baselines. Such improvement indicates the effectiveness of our uncertainty based exploration, which only explores when the ranker’s pairwise estimation is uncertain. We can also observe the difference between PairRank_C and PairRank_R in this online evaluation. This is because PairRank_C preserves the certain rank order within blocks, which further eliminates unnecessary exploration (random shuffling) in result serving.

4.2.3. Closeness to the Ideal Ranker

In this experiment, we view the offline trained RankNet model on the complete annotated relevance as the ideal model, denoted as $w^{*}$ , and compare the cosine similarities between $w^{*}$ and the online estimated models in Figure 5. We can observe that all the pairwise models push the learned ranker closer to the ideal one with more iterations, while PairRank converges faster and closer to $w^{*}$ . Apparently, the rankers obtained by DBGD and MGD are quite different from the ideal model. This confirms our earlier analysis that DBGD-type ranker update can hardly link to any rank-based metric optimization, and thus it becomes less related to the offline learned ranker. The difference in objective function can also explain why SGD RankNet converges faster than PDGD, as PDGD adopts the pairwise Plackett-Luce model while SGD RankNet has the same objective function as the offline model.

4.2.4. Zoom into PairRank

To further verify the effectiveness of the exploration strategy devised in PairRank, we zoom into the trace of its block size across queries during the online model update. As PairRank uses random shuffling within each block for exploration, a smaller block size, especially at the top-ranked positions, is preferred to reduce regret. Figure 6 shows the size of document blocks at rank position 1, 5 and 10.

First, we can clearly observe that after hundred rounds of interactions, the sizes of blocks quickly converge to 1, especially at the top-ranked positions. This confirms our theoretical analysis about PairRank’s block convergence. And by comparing the results across different click models, we can observe that the block size converges slower under the navigational click model. Similar trends can be observed in Figure 7(b). In this figure, the number of blocks is calculated by averaging in every 200 iterations to reduce variance. We can observe that at the early stages, PairRank under the navigational click model has fewer blocks (hence more documents in one block), which indicates a higher uncertainty in model estimation. The key reason is that much fewer clicks can be observed in each round under the navigational click model, as the stop probability is much higher, i.e., stronger position bias. As a result, more interactions are needed to improve model estimation. For the same reason, in Figure 6, the block size at rank position 10 shrinks much slower than that at other positions also suggests position bias is the main factor that slows down the learning of PairRank.

In Figure 7(a), we show the block size at each rank position in PairRank under the informational click model at different time points. One interesting finding is the blocks in the middle ranks, i.e., rank 300 - 400, tend to have more documents, while for the ranks at both the top and the bottom, the corresponding blocks tend to have fewer documents. We attribute it to the pairwise learning in PairRank, where the uncertainty is calculated on the pairwise preferences, and thus it is easy to identify the document pairs with greater differences.

5. Conclusion

Existing OL2R solutions suffer from slow convergence and sub-optimal performance due to inefficient exploration and limited optimization strategies. Motivated by the success of offline models, we propose to estimate a pairwise learning to rank model on the fly, named as PairRank. Based on the model’s pairwise order estimation confidence, exploration is performed only on the pairs where the ranker is still uncertain, i.e., divide-and-conquer. We prove a sub-linear upper regret bound defined on the number of mis-ordered pairs, which directly links PairRank’s convergence with classical ranking evaluations. Our empirical experiments support our regret analysis and demonstrate significant improvement of PairRank over several state-of-the-art OL2R baselines.

Our effort sheds light on moving more powerful offline learning to rank solutions online. Currently, our work is based on a single layer RankNet for analysis purposes. Following recent efforts of convergence analysis in deep learning (Zhou et al., 2019), it is possible to extend PairRank with deep ranking models and directly optimize rank-based metrics (such as NDCG). Furthermore, most OL2R solutions focus on population-level ranker estimation; thanks to the improved learning efficiency by PairRank, it is possible for us to study individual-level ranking problems, e.g., personalized OL2R.

6. Acknowledgements

We want to thank the reviewers for their insightful comments. This work is based upon work supported by National Science Foundation under grant IIS-1553568 and IIS-1618948, and Google Faculty Research Award.

References

(1)
Abbasi-Yadkori et al. (2011) Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. 2011. Improved algorithms for linear stochastic bandits. In NIPS. 2312–2320.
Agichtein et al. (2006) Eugene Agichtein, Eric Brill, and Susan Dumais. 2006. Improving web search ranking by incorporating user behavior information. In Proceedings of the 29th ACM SIGIR. ACM, 19–26.
Burges (2010) Christopher JC Burges. 2010. From ranknet to lambdarank to lambdamart: An overview. Learning 11, 23-581 (2010), 81.
Carvalho et al. (2008) Vitor R Carvalho, Jonathan L Elsas, William W Cohen, and Jaime G Carbonell. 2008. Suppressing outliers in pairwise preference ranking. In Proceedings of the 17th ACM CIKM. 1487–1488.
Chapelle and Chang (2011) Olivier Chapelle and Yi Chang. 2011. Yahoo! learning to rank challenge overview. In Proceedings of the Learning to Rank Challenge. 1–24.
Chapelle et al. (2012) Olivier Chapelle, Thorsten Joachims, Filip Radlinski, and Yisong Yue. 2012. Large-scale validation and analysis of interleaved search evaluation. ACM TOIS 30, 1 (2012), 6.
Filippi et al. (2010) Sarah Filippi, Olivier Cappe, Aurélien Garivier, and Csaba Szepesvári. 2010. Parametric bandits: The generalized linear case. In NIPS. 586–594.
Guo et al. (2009b) Fan Guo, Chao Liu, Anitha Kannan, Tom Minka, Michael Taylor, Yi-Min Wang, and Christos Faloutsos. 2009b. Click chain model in web search. In Proceedings of the 18th WWW. 11–20.
Guo et al. (2009a) Fan Guo, Chao Liu, and Yi Min Wang. 2009a. Efficient multiple-click models in web search. In Proceedings of the 2nd WSDM. 124–131.
Herbrich et al. (1999) Ralf Herbrich, Thore Graepel, and Klaus Obermayer. 1999. Support vector learning for ordinal regression. (1999).
Hofmann et al. (2012) Katja Hofmann, Shimon Whiteson, and Maarten de Rijke. 2012. Estimating interleaved comparison outcomes from historical click data. In Proceedings of the 21st CIKM. 1779–1783.
Hofmann et al. (2013) Katja Hofmann, Shimon Whiteson, and Maarten de Rijke. 2013. Balancing exploration and exploitation in listwise and pairwise online learning to rank for information retrieval. Information Retrieval 16, 1 (2013), 63–90.
Joachims (2002) Thorsten Joachims. 2002. Optimizing search engines using clickthrough data. In Proceedings of the 8th ACM KDD. 133–142.
Joachims et al. (2005) Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, and Geri Gay. 2005. Accurately interpreting clickthrough data as implicit feedback. In Proceedings of the 28th ACM SIGIR. ACM, 154–161.
Joachims et al. (2007) Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, Filip Radlinski, and Geri Gay. 2007. Evaluating the accuracy of implicit feedback from clicks and query reformulations in web search. ACM TOIS 25, 2 (2007), 7.
Joachims et al. (2017) Thorsten Joachims, Adith Swaminathan, and Tobias Schnabel. 2017. Unbiased Learning-to-Rank with Biased Feedback. In Proceedings of the 10th ACM WSDM. ACM, 781–789.
Katariya et al. (2016) Sumeet Katariya, Branislav Kveton, Csaba Szepesvari, and Zheng Wen. 2016. DCM bandits: Learning to rank with multiple clicks. In ICML. 1215–1224.
Kveton et al. (2018) Branislav Kveton, Chang Li, Tor Lattimore, Ilya Markov, Maarten de Rijke, Csaba Szepesvari, and Masrour Zoghi. 2018. Bubblerank: Safe online learning to rerank. arXiv preprint arXiv:1806.05819 (2018).
Kveton et al. (2015a) Branislav Kveton, Csaba Szepesvari, Zheng Wen, and Azin Ashkan. 2015a. Cascading bandits: Learning to rank in the cascade model. In ICML. 767–776.
Kveton et al. (2015b) Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesvari. 2015b. Combinatorial cascading bandits. In NIPS. 1450–1458.
Kveton et al. (2015c) Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesvari. 2015c. Tight regret bounds for stochastic combinatorial semi-bandits. In Artificial Intelligence and Statistics. 535–543.
Lattimore et al. (2018) Tor Lattimore, Branislav Kveton, Shuai Li, and Csaba Szepesvari. 2018. Toprank: A practical algorithm for online stochastic ranking. In NIPS. 3945–3954.
Li et al. (2018) Shuai Li, Tor Lattimore, and Csaba Szepesvári. 2018. Online learning to rank with features. arXiv preprint arXiv:1810.02567 (2018).
Li et al. (2016) Shuai Li, Baoxiang Wang, Shengyu Zhang, and Wei Chen. 2016. Contextual Combinatorial Cascading Bandits.. In ICML, Vol. 16. 1245–1253.
Oosterhuis and de Rijke (2017) Harrie Oosterhuis and Maarten de Rijke. 2017. Balancing speed and quality in online learning to rank for information retrieval. In Proceedings of the 26th 2017 ACM CIKM. 277–286.
Oosterhuis and de Rijke (2018) Harrie Oosterhuis and Maarten de Rijke. 2018. Differentiable unbiased online learning to rank. In Proceedings of the 27th ACM CIKM. 1293–1302.
Qin and Liu (2013) Tao Qin and Tie-Yan Liu. 2013. Introducing LETOR 4.0 Datasets. arXiv:1306.2597 [cs.IR]
Radlinski et al. (2008) Filip Radlinski, Robert Kleinberg, and Thorsten Joachims. 2008. Learning diverse rankings with multi-armed bandits. In ICML. 784–791.
Schuth et al. (2016) Anne Schuth, Harrie Oosterhuis, Shimon Whiteson, and Maarten de Rijke. 2016. Multileave gradient descent for fast online learning to rank. In Proceedings of the 9th ACM WSDM. 457–466.
Schuth et al. (2014) Anne Schuth, Floor Sietsma, Shimon Whiteson, Damien Lefortier, and Maarten de Rijke. 2014. Multileaved comparisons for fast online evaluation. In Proceedings of the 23rd ACM CIKM. ACM, 71–80.
Vershynin (2010) Roman Vershynin. 2010. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027 (2010).
Wang et al. (2019) Huazheng Wang, Sonwoo Kim, Eric McCord-Snook, Qingyun Wu, and Hongning Wang. 2019. Variance Reduction in Gradient Exploration for Online Learning to Rank. In SIGIR 2019. 835–844.
Wang et al. (2018a) Huazheng Wang, Ramsey Langley, Sonwoo Kim, Eric McCord-Snook, and Hongning Wang. 2018a. Efficient exploration of gradient space for online learning to rank. In SIGIR 2018. 145–154.
Wang et al. (2018b) Xuanhui Wang, Cheng Li, Nadav Golbandi, Michael Bendersky, and Marc Najork. 2018b. The LambdaLoss Framework for Ranking Metric Optimization. In CIKM ’18. ACM, 1313–1322.
Yue and Joachims (2009) Yisong Yue and Thorsten Joachims. 2009. Interactively optimizing information retrieval systems as a dueling bandits problem. In ICML. 1201–1208.
Zhao and King (2016) Tong Zhao and Irwin King. 2016. Constructing reliable gradient exploration for online learning to rank. In Proceedings of the 25th ACM CIKM. 1643–1652.
Zhou et al. (2019) Dongruo Zhou, Lihong Li, and Quanquan Gu. 2019. Neural Contextual Bandits with UCB-based Exploration. arXiv preprint arXiv:1911.04462 (2019).
Zoghi et al. (2017) Masrour Zoghi, Tomas Tunys, Mohammad Ghavamzadeh, Branislav Kveton, Csaba Szepesvari, and Zheng Wen. 2017. Online learning to rank in stochastic click models. In ICML 2017. 4199–4208.

Appendix A Preliminaries

In this section, we present some basic definitions and inequalities for later use. The first inequality is about the Lipschitz continuity of the logistic function.

Lemma 0.

For the logistic function $\sigma(x)={1}/{(1+\exp(-x))}$ , we have: $|\sigma(x)-\sigma(y)|<k_{\mu}|x-y|$ where $k_{\mu}=1/4$ .

The second inequality is the self-normalized bound for martingales, adopted from Lemma 9 in (Abbasi-Yadkori et al., 2011).

Lemma 0 (Self-normalized bound for martingales, Lemma 9 in (Abbasi-Yadkori et al., 2011)).

Let $t$ be a stopping time with respect to the filtration $\{F_{\tau}\}^{\infty}_{\tau=0}$ . Because $\epsilon_{ij}^{s}$ is $\frac{1}{2}$ -sub-Gaussian, for $\delta>0$ , with probability $1-\delta$ ,

\left\|\sum\nolimits_{s=1}^{t-1}\sum\nolimits_{(i,j)\in\mathcal{G}_{s}^{ind}}\epsilon_{ij}^{s}\mathbf{x}^{s}_{ij}\right\|_{\scriptscriptstyle\mathbf{M}_{t}^{-1}}^{2}\leq({1}/{2})\log({{\det(\mathbf{M}_{t})^{1/2}}/{(\delta\det(\lambda\mathbf{I})^{1/2})}})

Appendix B Proof of PairRank’s confidence interval: Lemma 2

B.1. Model learning for PairRank

As we introduced in Section 3.2, we adopt a single layer RankNet model with sigmoid activation function as our pairwise ranker. The objective function for our ranker defined in Eq (2) is the cross-entropy loss between the predicted pairwise preference distribution and the inferred distribution from user feedback. Given this objective function is log-convex with respect to $\boldsymbol{\theta}$ , its solution $\tilde{\boldsymbol{\theta}}_{t}$ is unique under the following estimating equation at each round $t$ ,

(3)

\displaystyle\sum\nolimits_{s=1}^{t-1}\sum\nolimits_{\scriptscriptstyle(m,n)\in\mathcal{G}_{s}^{ind}}\left(\sigma({{\mathbf{x}_{mn}^{s}}}^{\top}\boldsymbol{\theta})-{\mathbf{y}_{mn}^{s}}\right){\mathbf{x}_{mn}^{s}}+\lambda\boldsymbol{\theta}=0

Let $g_{t}(\boldsymbol{\theta})=\sum_{\scriptscriptstyle s=1}^{\scriptscriptstyle t-1}\sum_{(m,n)\in\mathcal{G}_{s}^{ind}}\sigma({{\mathbf{x}_{mn}^{s}}}^{\top}\boldsymbol{\theta}){\mathbf{x}_{mn}^{s}}+\lambda\boldsymbol{\theta}$ be the invertible function such that the estimated parameter $\tilde{\boldsymbol{\theta}}_{t}$ satisfies $g_{t}(\tilde{\boldsymbol{\theta}}_{t})=\sum_{s=1}^{t-1}\sum_{(m,n)\in\mathcal{G}_{s}^{ind}}{\mathbf{y}_{mn}^{s}}{\mathbf{x}_{mn}^{s}}$ . Since $\tilde{\boldsymbol{\theta}}_{t}$ might be outside of the admissible set of parameter $\Theta$ that $\boldsymbol{\theta}\in\Theta$ (e.g., $\|\boldsymbol{\theta}\|_{2}\leq Q$ ), the estimation can be projected into $\Theta$ to obtain $\hat{\boldsymbol{\theta}}_{t}$ :

(4)		$\displaystyle\hat{\boldsymbol{\theta}}_{t}$	$\displaystyle{=}\operatorname*{argmin}_{\boldsymbol{\theta}\in\Theta}\left\\|g_{t}(\boldsymbol{\theta}){-}g_{t}(\tilde{\boldsymbol{\theta}}_{t})\right\\|_{\scriptscriptstyle\mathbf{M}_{t}^{-1}}$
		$\displaystyle{=}\operatorname*{argmin}_{\boldsymbol{\theta}\in\Theta}\left\\|g_{t}(\boldsymbol{\theta}){-}\sum_{\scriptscriptstyle s=1}^{\scriptscriptstyle t-1}\sum_{\scriptscriptstyle(m,n)\in\mathcal{G}_{s}^{ind}}{\mathbf{y}_{mn}^{s}}{\mathbf{x}_{mn}^{s}}\right\\|_{\scriptscriptstyle\mathbf{M}_{t}^{-1}}$

where $\mathbf{M}_{t}$ is defined as $\mathbf{M}_{t}=\sum_{s=1}^{t-1}\sum_{(m,n)\in\mathcal{G}_{s}^{ind}}{\mathbf{x}_{mn}^{s}}{{\mathbf{x}_{mn}^{s}}}^{\top}+\lambda\mathbf{I}$ . To summarize, $\tilde{\boldsymbol{\theta}}_{t}$ is the solution of Eq (4), and $\hat{\boldsymbol{\theta}}_{t}$ is the estimated model which is generated by projecting $\tilde{\boldsymbol{\theta}}_{t}$ onto $\Theta$ .

B.2. Confidence interval analysis

Now we present detailed proof of the confidence interval in Lemma 2, which is based on Proposition 1 in (Filippi et al., 2010).

At round $t$ , for any document pair $\mathbf{x}^{t}_{i}$ and $\mathbf{x}^{t}_{j}$ , the estimation error of the pairwise preference in PairRank is defined as $|\sigma({{\mathbf{x}_{ij}^{t}}}^{\top}\boldsymbol{\theta}^{*})-\sigma({{\mathbf{x}_{ij}^{t}}}^{\top}\hat{\boldsymbol{\theta}}_{t})|$ , which is based on the model $\hat{\boldsymbol{\theta}}_{t}$ learned from the observations until last round. According to Lemma 1, we have $|\sigma({{\mathbf{x}_{ij}^{t}}}^{\top}\boldsymbol{\theta}^{*})-\sigma({{\mathbf{x}_{ij}^{t}}}^{\top}\hat{\boldsymbol{\theta}}_{t})|\leq k_{\mu}|{{\mathbf{x}_{ij}^{t}}}^{\top}\boldsymbol{\theta}^{*}-{{\mathbf{x}_{ij}^{t}}}^{\top}\hat{\boldsymbol{\theta}}_{t}|$ . As logistic function $\sigma(\cdot)$ is continuously differentiable, $\nabla g_{t}$ is continuous. Hence, according to the Fundamental Theorem of Calculus, we have $g_{t}(\boldsymbol{\theta}^{*})-g_{t}(\hat{\boldsymbol{\theta}}_{t})=\mathbf{G}_{t}(\boldsymbol{\theta}^{*}-\hat{\boldsymbol{\theta}}_{t})$ , where $\mathbf{G}_{t}=\int_{0}^{1}\nabla g_{t}\left(l\boldsymbol{\theta}^{*}+(1-l)\hat{\boldsymbol{\theta}}_{t}\right)dl.$ Therefore, $\nabla g_{t}(\boldsymbol{\theta})=\sum_{s=1}^{t-1}\sum_{(m,n)\in\mathcal{G}_{s}^{ind}}\dot{\sigma}({{\mathbf{x}_{mn}^{s}}}^{\top}\boldsymbol{\theta}){{\mathbf{x}_{mn}^{s}}}{{\mathbf{x}_{mn}^{s}}}^{\top}+\lambda\mathbf{I}$ , where $\dot{\sigma}$ is the first order derivative of $\sigma$ . As $c_{\mu}=\inf_{\boldsymbol{\theta}\in\boldsymbol{\Theta}}\dot{\sigma}(\mathbf{x}^{\top}\boldsymbol{\theta})$ , it is easy to verify that $c_{\mu}\leq 1/4$ . Thus, we can conclude that $\mathbf{G}_{t}\succeq c_{\mu}\mathbf{M}_{t}$ . Accordingly, we have the following inequality,

		$\displaystyle\left\|\sigma({{\mathbf{x}_{ij}^{t}}}^{\top}\boldsymbol{\theta}^{*})-\sigma({{\mathbf{x}_{ij}^{t}}}^{\top}\hat{\boldsymbol{\theta}}_{t})\right\|$
	$\displaystyle\leq$	$\displaystyle k_{\mu}\left\|{{\mathbf{x}_{ij}^{t}}}^{\top}\mathbf{G}_{t}^{-1}\left(g_{t}(\boldsymbol{\theta}^{})-g_{t}(\hat{\boldsymbol{\theta}}_{t})\right)\right\|\leq k_{\mu}\\|{\mathbf{x}_{ij}^{t}}\\|_{\scriptscriptstyle\mathbf{G}_{t}^{-1}}\left\\|g_{t}(\boldsymbol{\theta}^{})-g_{t}(\hat{\boldsymbol{\theta}}_{t})\right\\|_{\scriptscriptstyle\mathbf{G}_{t}^{-1}}$
	$\displaystyle\leq$	$\displaystyle({2k_{\mu}}/{c_{\mu}})\\|{\mathbf{x}_{ij}^{t}}\\|_{\scriptscriptstyle\mathbf{M}_{t}^{-1}}\left\\|g_{t}(\boldsymbol{\theta}^{*})-g_{t}(\tilde{\boldsymbol{\theta}}_{t})\right\\|_{\scriptscriptstyle\mathbf{M}_{t}^{-1}}$

where the first and second inequalities stand as $\mathbf{G}_{t}$ and $\mathbf{G}_{t}^{-1}$ are positive definite. The third inequality stands as $\mathbf{G}_{t}\succeq c_{\mu}\mathbf{M}_{t}$ , which implies that $\mathbf{G}_{t}^{-1}\preceq c_{\mu}^{-1}\mathbf{M}_{t}^{-1}$ and $\|\mathbf{x}\|_{\scriptscriptstyle\mathbf{G}_{t}^{-1}}\leq\|\mathbf{x}\|_{\scriptscriptstyle\mathbf{M}_{t}^{-1}}/{\sqrt{c_{\mu}}}$ hold for any $\mathbf{x}\in\mathbb{R}^{d}$ . The last inequality stands as $\boldsymbol{\theta}^{*}\in\Theta$ , and $\hat{\boldsymbol{\theta}}_{t}$ is the optimum solution for Eq (4) at round $t$ within $\Theta$ .

Based on the definition of $\tilde{\boldsymbol{\theta}}_{t}$ and the assumption on the noisy feedback that $y^{t}=\sigma({\mathbf{x}_{t}}^{\top}\boldsymbol{\theta}^{*})+\epsilon^{t}$ , where $\epsilon^{t}$ is the noise in user feedback defined in Section 3.2, we have

		$\displaystyle g_{t}(\tilde{\boldsymbol{\theta}}_{t})-g_{t}(\boldsymbol{\theta}^{*})$
	$\displaystyle=$	$\displaystyle\sum\nolimits_{s=1}^{t-1}\sum\nolimits_{(m,n)\in\mathcal{G}_{s}^{ind}}({\mathbf{y}_{mn}^{s}}-\sigma({{\mathbf{x}_{mn}^{s}}}^{\top}\boldsymbol{\theta}^{})){\mathbf{x}_{mn}^{s}}-\lambda\boldsymbol{\theta}^{}$
	$\displaystyle=$	$\displaystyle\sum\nolimits_{s=1}^{t-1}\sum\nolimits_{(m,n)\in\mathcal{G}_{s}^{ind}}\epsilon_{mn}^{s}{\mathbf{x}_{mn}^{s}}-\lambda\boldsymbol{\theta}^{}=S_{t}-\lambda\boldsymbol{\theta}^{}.$

where we define $S_{t}=\sum_{s=1}^{t-1}\sum_{(m,n)\in\mathcal{G}_{s}^{ind}}\epsilon^{s}_{mn}{\mathbf{x}_{mn}^{s}}$ .

Therefore, the confidence interval of the estimated pairwise preference in PairRank can be derived as:

	$\displaystyle\left\|\sigma({\mathbf{x}_{ij}^{t}}^{\top}\boldsymbol{\theta}^{*})-\sigma({\mathbf{x}_{ij}^{t}}^{\top}\hat{\boldsymbol{\theta}}_{t})\right\|\leq$	$\displaystyle({2k_{\mu}}/{c_{\mu}})\\|{\mathbf{x}_{ij}^{t}}\\|_{\mathbf{M}_{t}^{-1}}\left\\|S_{t}-\lambda\boldsymbol{\theta}^{*}\right\\|_{\mathbf{M}_{t}^{-1}}$
	$\displaystyle\leq$	$\displaystyle({2k_{\mu}}/{c_{\mu}})\\|{\mathbf{x}_{ij}^{t}}\\|_{\mathbf{M}_{t}^{-1}}\left(\left\\|S_{t}\right\\|_{\mathbf{M}_{t}^{-1}}+\sqrt{\lambda}Q\right)$

where the second inequality is based on minimum eigenvalue $\lambda_{\min}(\mathbf{M}_{t})\geq\lambda$ and $\|\boldsymbol{\theta}\|_{2}\leq Q$ .

As $\epsilon_{mn}^{t}\sim R$ -sub-Gaussian, according to Theorem 1 in (Abbasi-Yadkori et al., 2011),

\mathbb{P}\left[\left\|S_{t}\right\|_{\mathbf{M}_{t}^{-1}}^{2}>2R^{2}\log{\frac{\det(\mathbf{M}_{t})^{1/2}}{\delta\det(\lambda\mathbf{I})^{1/2}}}\right]\leq\delta

Therefore, with probability at least $1-\delta_{1}$ , we have

\left|\sigma({{\mathbf{x}_{ij}^{t}}}^{\top}\hat{\boldsymbol{\theta}}_{t})-\sigma({{\mathbf{x}_{ij}^{t}}}^{\top}\boldsymbol{\theta}^{*})\right|\leq\alpha_{t}\|{\mathbf{x}_{ij}^{t}}\|_{\mathbf{M}_{t}^{-1}}

with $\alpha_{t}=({2k_{\mu}}/{c_{\mu}})\Big{(}\sqrt{R^{2}\log{\frac{\det(\mathbf{M}_{t})}{\delta_{1}^{2}\det(\lambda\mathbf{I})}}}+\sqrt{\lambda}Q\Big{)}$

Appendix C Proof of Theorem 8

In this section, we present the detailed proof of Theorem 1. We first prove Lemma 7, which provides an upper bound of the probability that an estimated pairwise preference is identified as uncertain. The blocks with uncertain rank orders will lead to regret in the ranked list due to the random shuffling based exploration strategy.

C.1. Proof of Lemma 7

Proof.

In this proof, we will first provide an upper bound of the minimum eigenvalue of $\mathbf{M}_{t}$ , and then provide the detailed derivation of the probability’s upper bound.

$\bullet\leavevmode\nobreak\$ Upper bound the minimum eigenvalue of $\mathbf{M}_{t}$ . As discussion in Lemma 6 , we assume that pairwise feature vectors are random vectors drawn from distribution $v$ . With $\mathbf{\Sigma}=\mathbb{E}[{\mathbf{x}_{mn}^{s}}{\mathbf{x}_{mn}^{s}}^{\top}]$ as the second moment matrix, define $\mathbf{Z}=\mathbf{\Sigma}^{\frac{-1}{2}}\mathbf{X}$ , where $\mathbf{X}$ is a random vector drawn from the same distribution $v$ . Then $\mathbf{Z}$ is isotropic, namely $\mathbb{E}[\mathbf{Z}\mathbf{Z}^{\top}]=\mathbf{I}_{d}$ .

Define $\mathbf{U}=\sum_{s=1}^{t-1}\sum_{(m,n)\in\mathcal{G}_{s}^{ind}}\mathbf{Z}_{mn}^{s}{\mathbf{Z}_{mn}^{s}}^{\top}=\Sigma^{\frac{-1}{2}}\bar{\mathbf{M}}_{t}\Sigma^{\frac{-1}{2}}$ , where $\bar{\mathbf{M}}_{t}=\sum_{s=1}^{t-1}\sum_{(m,n)\in\mathcal{G}_{s}^{ind}}{\mathbf{x}_{mn}^{s}}{\mathbf{x}_{mn}^{s}}^{\top}=\mathbf{M}_{t}-\lambda\mathbf{I}$ . From Lemma 6, we know that for any $l$ , with probability at least $1-2\text{exp}(-C_{2}l^{2})$ , $\lambda_{\text{min}}(\mathbf{U})\geq n-C_{1}\sigma^{2}\sqrt{nd}-\sigma^{2}l\sqrt{n}$ , where $\sigma$ is the sub-Gaussian parameter of $\mathbf{Z}$ , which is upper-bounded by $\|\mathbf{\Sigma}^{-1/2}\|=\lambda_{\text{min}(\mathbf{\Sigma})}$ , and $n=\sum_{s=1}^{t-1}|\mathcal{G}_{s}^{ind}|$ , represents the number of observations so far. We thus can rewrite the above inequality which holds with probability $1-\delta_{2}$ as $\lambda_{\text{min}}(\mathbf{U})\geq n-\lambda_{\text{min}}^{-1}(\mathbf{\Sigma})(C_{1}\sqrt{nd}+l\sqrt{n})$ . We now bound the minimum eigenvalue of $\bar{\mathbf{M}}_{t}$ , as follows:

	$\displaystyle\lambda_{\text{min}}(\bar{\mathbf{M}}_{t})$	$\displaystyle=\min_{x\in\mathbb{B}^{d}}x^{\top}\bar{\mathbf{M}}_{t}x=\min_{x\in\mathbb{B}^{d}}x^{\top}\mathbf{\Sigma}^{1/2}\mathbf{U}\mathbf{\Sigma}^{1/2}x\geq\lambda_{\text{min}}(\mathbf{U})\min_{x\in\mathbb{B}^{d}}x^{\top}\mathbf{\Sigma}x$
		$\displaystyle=\lambda_{\text{min}}(\mathbf{U})\lambda_{\text{min}}(\mathbf{\Sigma})\geq\lambda_{\text{min}}(\mathbf{\Sigma})n-C_{1}\sqrt{nd}-C_{2}\sqrt{n\log(1/\delta_{2})}$

According to Lemma 6, for $t\geq t^{\prime}$ , we have:

	$\displaystyle\lambda_{\text{min}}(\mathbf{\Sigma})t-(c_{1}\sqrt{d}+c_{2}\sqrt{\log({1}/{\delta_{2}})}+abd\sqrt{{o_{\text{max}}u^{2}}/({d\lambda})})\sqrt{t}$
	$\displaystyle-(ab\log({1}/{\delta_{1}^{2}})+4a\lambda Q^{2}-\lambda)\geq 0$

As $n\geq t$ , we have

(5)

\lambda_{\text{min}}({\mathbf{M}}_{t})\geq\lambda_{\text{min}}(\bar{\mathbf{M}}_{t})+\lambda\geq abd\sqrt{\frac{o_{\text{max}}u^{2}}{d\lambda}}+ab\log(\frac{1}{\delta_{1}^{2}})+4a\lambda Q^{2}

$\bullet\leavevmode\nobreak\$ Upper bound the probability of being uncertain rank order. Under event $E_{t}$ , based on the definition of $\mathcal{E}_{u}^{t}$ in Section 3.2, we know that for any document $\mathbf{x}^{t}_{i}$ and $\mathbf{x}^{t}_{j}$ at round $t$ , $(i,j)\in\mathcal{E}_{u}^{t}$ if and only if $\sigma({{\mathbf{x}_{ij}^{t}}}^{\top}\hat{\boldsymbol{\theta}}_{t})-\alpha_{t}\|{\mathbf{x}_{ij}^{t}}\|_{\mathbf{M}_{t}^{-1}}\leq\frac{1}{2}$ and $\sigma({{\mathbf{x}_{ji}^{t}}}^{\top}\hat{\boldsymbol{\theta}}_{t})-\alpha_{t}\|{\mathbf{x}_{ji}^{t}}\|_{\mathbf{M}_{t}^{-1}}\leq\frac{1}{2}$ . For a logistic function, we know that $\sigma(s)=1-\sigma(-s)$ . Therefore, let $CB_{ij}^{t}$ denote $\alpha_{t}\|{\mathbf{x}_{ij}^{t}}\|_{\mathbf{M}_{t}^{-1}}$ , we can conclude that $(i,j)\in\mathcal{E}_{u}^{t}$ if and only if $|\sigma({{\mathbf{x}_{ij}^{t}}}^{\top}\hat{\boldsymbol{\theta}}_{t})-\frac{1}{2}|\leq CB_{ij}^{t}$ ; and accordingly, $(i,j)\in\mathcal{E}_{c}^{t}$ , when $|\sigma({{\mathbf{x}_{ij}^{t}}}^{\top}\hat{\boldsymbol{\theta}}_{t})-\frac{1}{2}|>CB_{ij}^{t}$ . To further simplify our notations, in the following analysis, we use $\hat{\sigma}_{t}$ and $\sigma^{*}$ to present $\sigma({{\mathbf{x}_{ij}^{t}}}^{\top}\hat{\boldsymbol{\theta}}_{t})$ and $\sigma({{\mathbf{x}_{ij}^{t}}}^{\top}\boldsymbol{\theta}^{*})$ respectively. According to the discussion above, the probability that the estimated preference between document $\mathbf{x}^{t}_{i}$ and $\mathbf{x}^{t}_{j}$ to be in an uncertain rank order, e.g., $(i,j)\in\mathcal{E}_{u}^{t}$ can be upper bounded by:

		$\displaystyle\mathbb{P}\big{(}(i,j)\in\mathcal{E}_{u}^{t}\big{)}=\mathbb{P}\big{(}\|\hat{\sigma}_{t}-{1}/{2}\|\leq CB_{ij}^{t}\big{)}$
	$\displaystyle\leq$	$\displaystyle\mathbb{P}\left(\left\|\|\hat{\sigma}_{t}-\sigma^{}\|-\|\sigma^{}-{1}/{2}\|\right\|\leq CB_{ij}^{t}\right)\leq\mathbb{P}\left(\|\sigma^{}-{1}/{2}\|-\|\hat{\sigma}_{t}-\sigma^{}\|\leq CB_{ij}^{t}\right)$
	$\displaystyle\leq$	$\displaystyle\mathbb{P}\left(\Delta_{\min}-\|\hat{\sigma}_{t}-\sigma^{*}\|\leq CB_{ij}^{t}\right).$

where, the first inequality is based on the reverse triangle inequality. The last inequality is based on the definition of $\Delta_{\min}$ . Based on Lemma 2, the above probability can be further bounded by

		$\displaystyle\mathbb{P}\left(\Delta_{\min}-\|\hat{\sigma}_{t}-\sigma^{}\|\leq CB_{ij}^{t}\right)=\mathbb{P}\left(\|\hat{\sigma}_{t}-\sigma^{}\|\geq\Delta_{\min}-\alpha_{t}\\|{\mathbf{x}_{ij}^{t}}\\|_{\scriptscriptstyle\mathbf{M}_{t}^{-1}}\right)$
	$\displaystyle\leq$	$\displaystyle\mathbb{P}\left(({2k_{\mu}}/{c_{\mu}})\|\|{\mathbf{x}_{ij}^{t}}\|\|_{\scriptscriptstyle\mathbf{M}_{t}^{-1}}\left(\left\\|S_{t}\right\\|_{\scriptscriptstyle\mathbf{M}_{t}^{-1}}+\sqrt{\lambda}Q\right)\geq\Delta_{\min}-\alpha_{t}\\|{\mathbf{x}_{ij}^{t}}\\|_{\scriptscriptstyle\mathbf{M}_{t}^{-1}}\right)$
	$\displaystyle\leq$	$\displaystyle\mathbb{P}\left(\left\\|S_{t}\right\\|_{\scriptscriptstyle\mathbf{M}_{t}^{-1}}\geq\frac{c_{\mu}\Delta_{\min}}{2k_{\mu}\|\|{\mathbf{x}_{ij}^{t}}\|\|_{\scriptscriptstyle\mathbf{M}_{t}^{-1}}}-\left(\sqrt{\frac{1}{4}\log{\frac{\det(\mathbf{M}_{t})}{\delta^{2}\det(\lambda\mathbf{I})}}}+2\sqrt{\lambda}Q\right)\right)$

For the RHS of the inequality inside the probability, we know that:

		$\displaystyle\left(\frac{c_{\mu}\Delta_{\min}}{2k_{\mu}\|\|{\mathbf{x}_{ij}^{t}}\|\|_{\scriptscriptstyle\mathbf{M}_{t}^{-1}}}\right)^{2}-\left(\sqrt{R^{2}\log{\frac{\det(\mathbf{M}_{t})}{\delta^{2}\det(\lambda\mathbf{I})}}}+2\sqrt{\lambda}Q\right)^{2}$
	$\displaystyle\geq$	$\displaystyle{\lambda_{\min}(\mathbf{M}_{t})}/{a}-R^{2}\log({\det(\mathbf{M}_{t})}/{(\delta_{1}^{2}\det(\lambda\mathbf{I}))})$
		$\displaystyle-4\lambda Q^{2}-4\sqrt{\lambda}QR\sqrt{\log({\det(\mathbf{M}_{t})}/{\det{\lambda\mathbf{I}}})+\log({1}/{\delta_{1}^{2}})}$
	$\displaystyle\geq$	$\displaystyle{\lambda_{\min}(\mathbf{M}_{t})}/{a}-b\left(\log({\det(\mathbf{M}_{t})}/{\det(\lambda\mathbf{I})})+\log({1}/{\delta_{1}^{2}})\right)-4\lambda Q^{2}$
	$\displaystyle\geq$	$\displaystyle{\lambda_{\min}(\mathbf{M}_{t})}/{a}-bd\log(1+\frac{o_{\max}tu^{2}}{d\lambda})-b\log(\frac{1}{\delta_{1}^{2}})-4\lambda Q^{2}\geq 0$

where the last inequality follows Eq (5). Therefore, the probability could be upper bounded as:

		$\displaystyle\mathbb{P}\left(\Delta_{\min}-\|\hat{\sigma}_{t}-\sigma^{*}\|\leq CB_{ij}^{t}\right)$
	$\displaystyle\leq$	$\displaystyle\mathbb{P}\left(\left\\|S_{t}\right\\|_{\scriptscriptstyle\mathbf{M}_{t}^{-1}}^{2}{\geq}\left(\frac{c_{\mu}\Delta_{\min}}{2k_{\mu}\\|{\mathbf{x}_{ij}^{t}}\\|_{\scriptscriptstyle\mathbf{M}_{t}^{-1}}}-\Big{(}\sqrt{R^{2}\log{\frac{\det(\mathbf{M}_{t})}{\delta\det(\lambda\mathbf{I})}}}+2\sqrt{\lambda}Q\Big{)}\right)^{2}\right)$
	$\displaystyle\leq$	$\displaystyle\mathbb{P}\left(\left\\|S_{t}\right\\|_{\scriptscriptstyle\mathbf{M}_{t}^{-1}}^{2}{\geq}\frac{(1-2\beta)c_{\mu}^{2}\Delta_{\min}^{2}}{4k_{\mu}^{2}\\|{\mathbf{x}_{ij}^{t}}\\|_{\scriptscriptstyle\mathbf{M}_{t}}^{-1}}+R^{2}\log(\frac{\det(M_{t})}{\delta_{1}^{2}\det(\lambda\mathbf{I})})\right)$
	$\displaystyle\leq$	$\displaystyle\mathbb{P}\left(\left\\|S_{t}\right\\|_{\scriptscriptstyle\mathbf{M}_{t}^{-1}}^{2}{\geq}2R^{2}\log\left(\exp\left(\frac{(1-2\beta)c_{\mu}^{2}\Delta_{\min}^{2}}{8R^{2}k_{\mu}^{2}\\|{\mathbf{x}_{ij}^{t}}\\|_{\scriptscriptstyle\mathbf{M}_{t}}^{-1}}\right)\cdot\frac{\det(\mathbf{M}_{t})}{\delta_{1}^{2}\det(\lambda\mathbf{I})})\right)\right)$
	$\displaystyle\leq$	$\displaystyle\delta_{1}\cdot\exp^{-1}\left(\frac{(1-2\beta)c_{\mu}^{2}\Delta_{\min}^{2}}{8R^{2}k_{\mu}^{2}\\|{\mathbf{x}_{ij}^{t}}\\|_{\scriptscriptstyle\mathbf{M}_{t}^{-1}}^{2}}\right)\leq\log(\frac{1}{\delta_{1}})\cdot\frac{8R^{2}k_{\mu}^{2}\\|{\mathbf{x}_{ij}^{t}}\\|_{\mathbf{M}_{t}^{-1}}^{2}}{(1-2\beta)c_{\mu}^{2}\Delta_{\min}^{2}}$

∎

C.2. Proof of Theorem 8

With $\delta_{1}$ and $\delta_{2}$ defined in the previous lemmas, we have the T-step regret upper bounded as:

(6)

\displaystyle R_{T}=R_{t^{\prime}}+R_{T-t^{\prime}}\leq R^{\prime}+(1-\delta_{1})(1-\delta_{2})\sum\nolimits_{s=t^{\prime}}^{T}r_{s}

where $R^{\prime}=t^{\prime}\cdot L_{\max}+(T-t^{\prime})\left[\delta_{2}\cdot L_{\max}+(1-\delta_{2})\cdot\delta_{1}L_{\max}\right]$ Under event $E_{t}^{1}$ and $E_{t}^{2}$ , the instantaneous regret at round $s$ is bounded by

(7)

\displaystyle r_{s}=\mathbb{E}\big{[}K(\tau_{s},\tau_{s}^{*})\big{]}=\sum\nolimits_{i=1}^{d_{s}}\mathbb{E}\left[{(N_{i}^{s}+1)N_{i}^{s}}/{2}\right]\leq\mathbb{E}\left[{N_{s}(N_{s}+1)}/{2}\right]

where $N_{i}^{s}$ denotes the number of uncertain rank orders in block $\mathcal{B}_{i}^{s}$ at round $s$ and $N_{s}$ denotes the total number of uncertain rank orders at round $s$ . It follows that in the worst case scenario, when $N_{s}$ uncertain rank orders are all placed into the same block, at most $(N_{s}^{2}+N_{s})/2$ mis-ordered pairs will be generated by random shuffling in PairRank. This is due to the fact that based on the block created by PairRank, with $N_{s}$ uncertain rank orders in one block, this block can have at most $N_{s}+1$ documents.

Therefore, the cumulative regret after $t^{\prime}$ can be bounded by:

	$\displaystyle\sum\nolimits_{s=t^{\prime}}^{T}r_{s}\leq$	$\displaystyle\sum\nolimits_{s=t^{\prime}}^{T}\mathbb{E}\left[{N_{s}(N_{s}+1)}/{2}\right]\leq\left(\sum\nolimits_{s=t^{\prime}}^{T}\mathbb{E}[N_{s}^{2}]+\mathbb{E}[N_{s}]\right)/2$
	$\displaystyle\leq$	$\displaystyle\left(\mathbb{E}[\sum\nolimits_{s=t^{\prime}}^{T}N_{s}]^{2}+\mathbb{E}[\sum\nolimits_{s=t^{\prime}}^{T}N_{s}]\right)/2$

According to our previous analysis, at round $t\geq t^{\prime}$ , the number of uncertain rank orders can be estimated by the probability of observing an uncertain rank order, i.e., $\mathbb{P}\big{(}(i,j)\in\mathcal{E}_{u}^{t}\big{)}\leq\log({1}/{\delta_{1}})\cdot\frac{8R^{2}k_{\mu}^{2}\|{\mathbf{x}_{ij}^{t}}\|_{M_{t}^{-1}}^{2}}{(1-2\beta)c_{\mu}^{2}\Delta_{\min}^{2}}$ . Therefore, the cumulative number of mis-ordered pairs can be bounded by the probability of observing uncertain rank orders in each round, which shrinks with more observations become available over time,

	$\displaystyle\mathbb{E}\left[\sum\nolimits_{s=t^{\prime}}^{T}N_{s}\right]\leq$	$\displaystyle\mathbb{E}\left[\frac{1}{2}\sum\nolimits_{s=t^{\prime}}^{T}\sum\nolimits_{(m,n)\in[L_{s}]^{2}}\mathbb{P}\big{(}(m,n)\in\mathcal{E}_{u}^{t}\big{)}\right]$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}\left[a\sum\nolimits_{s=t^{\prime}}^{T}\sum\nolimits_{(m,n)\in[L_{s}]^{2}}\\|{\mathbf{x}_{mn}^{s}}\\|_{\mathbf{M}_{s}^{-1}}^{2}\right]$

where $a=\log({1}/{\delta_{1}})\cdot{4k_{\mu}^{2}}/({(1-2\beta)c_{\mu}^{2}\Delta_{\min}^{2}})$

Because $\mathbf{M}_{t}$ only contains information of observed document pairs so far, PairRank guarantees the number of mis-ordered pairs among the observed documents in the above inequality is upper bounded. To reason about the number of mis-ordered pairs in those unobserved documents (i.e., from $o_{t}$ to $L_{t}$ for each query $q_{t}$ ), we leverage the constant $p^{*}$ , which is defined as the minimal probability that all documents in a query are examined over time,

		$\displaystyle\mathbb{E}\left[\sum\nolimits_{s=t^{\prime}}^{T}\sum\nolimits_{(m,n)\in[L_{s}]^{2}}\\|{\mathbf{x}_{mn}^{s}}\\|_{\mathbf{M}_{s}^{-1}}^{2}\right]$
	$\displaystyle=$	$\displaystyle\mathbb{E}\left[\sum\nolimits_{s=t^{\prime}}^{T}\sum\nolimits_{(m,n)\in[L_{s}]^{2}}\\|{\mathbf{x}_{mn}^{s}}\\|_{\mathbf{M}_{s}^{-1}}^{2}\times\mathbb{E}\left[\frac{1}{p_{s}}\mathbf{1}\{o_{s}=L_{s}\}\right]\right]$
	$\displaystyle\leq$	$\displaystyle{p^{*}}^{-1}\mathbb{E}\left[\sum\nolimits_{s=t^{\prime}}^{T}\sum\nolimits_{(m,n)\in[L_{s}]^{2}}\\|{\mathbf{x}_{mn}^{s}}\\|_{\mathbf{M}_{s}^{-1}}^{2}\mathbf{1}\{o_{s}=L_{s}\}\right]$

Besides, in PairRank, we only use the independent pairs, $\mathcal{G}_{t}^{ind}$ to update the model and the corresponding $M_{t}$ matrix. Therefore, to bound the regret, we rewrite the above equation as:

		$\displaystyle\mathbb{E}\left[\sum\nolimits_{s=t^{\prime}}^{T}\sum\nolimits_{(m,n)\in[L_{s}]^{2}}\\|{\mathbf{x}_{mn}^{s}}\\|_{\mathbf{M}_{s}^{-1}}^{2}\right]$
	$\displaystyle=$	$\displaystyle\mathbb{E}\left[\sum_{s=t^{\prime}}^{T}\sum_{(m,n)\in\mathcal{G}_{t}^{ind}}\left(\\|{\mathbf{x}_{mn}^{s}}\\|_{\mathbf{M}_{s}^{-1}}^{2}+\sum_{k\in[L_{t}]\setminus\{m,n\}}\\|\mathbf{x}_{mk}^{s}\\|_{\mathbf{M}_{s}^{-1}}^{2}+\\|\mathbf{x}_{nk}^{s}\\|_{\mathbf{M}_{s}^{-1}}^{2}\right)\right]$
	$\displaystyle=$	$\displaystyle\mathbb{E}\left[\sum_{s=t^{\prime}}^{T}\sum_{(m,n)\in\mathcal{G}_{t}^{ind}}\left((L_{t}-1)\\|{\mathbf{x}_{mn}^{s}}\\|_{\mathbf{M}_{s}^{-1}}^{2}+\sum_{k\in[L_{t}]\setminus\{m,n\}}2{\mathbf{x}_{mk}^{s}}^{\top}\mathbf{M}_{s}^{-1}\mathbf{x}_{nk}^{s}\right)\right]$

Based on Lemma 10 and Lemma 11 in (Abbasi-Yadkori et al., 2011), the expected number of the first term can be upper bounded by,

\displaystyle\sum\nolimits_{s=t^{\prime}}^{T}\sum\nolimits_{(m,n)\in\mathcal{G}_{t}^{ind}}(L_{t}-1)\|{\mathbf{x}_{mn}^{s}}\|_{\mathbf{M}_{s}^{-1}}^{2}\leq 2dL_{max}\log(1+\frac{o_{\max}Tu^{2}}{2d\lambda})

And the second term can be bounded as:

		$\displaystyle\sum\nolimits_{s=t^{\prime}}^{T}\sum\nolimits_{(m,n)\in\mathcal{G}_{t}^{ind}}\sum\nolimits_{k\in[L_{t}]\setminus\{m,n\}}2{\mathbf{x}_{mk}^{s}}^{\top}\mathbf{M}_{s}^{-1}\mathbf{x}_{nk}^{s}$
	$\displaystyle\leq$	$\displaystyle\sum\nolimits_{s=t^{\prime}}^{T}{(L_{max}^{2}-2L_{max})u^{2}}/{\lambda_{\min}(\mathbf{M}_{s})}=w.$

Therefore, combining the conclusion of $\mathbb{E}[\sum_{s=t^{\prime}}^{T}N_{s}]$ , and Eq (6), Eq (7), the regret could be upper bounded by:

\displaystyle R_{T}\leq

\displaystyle R^{\prime}+(1-\delta_{1})(1-\delta_{2})\frac{1}{p^{*}{{}^{2}}}\left(2adL_{\max}\log(1+\frac{o_{\max}Tu^{2}}{2d\lambda})+aw\right)^{2}

where $R^{\prime}=t^{\prime}L_{\max}+(T-t^{\prime})(\delta_{2}L_{\max}-(1-\delta_{2})\delta_{1}L_{\max})$ , with $t^{\prime}$ defined in Lemma 7, and $L_{\max}$ representing the maximum number of documents associated to the same query over time. By choosing $\delta_{1}=\delta_{2}=1/T$ , the theorem shows that the expected regret is at most $R_{T}\leq O(d\log^{4}(T))$ .

		$\displaystyle\left\|\sigma({{\mathbf{x}_{ij}^{t}}}^{\top}\boldsymbol{\theta}^{*})-\sigma({{\mathbf{x}_{ij}^{t}}}^{\top}\hat{\boldsymbol{\theta}}_{t})\right\|$
	$\displaystyle\leq$	$\displaystyle k_{\mu}\left\|{{\mathbf{x}_{ij}^{t}}}^{\top}\mathbf{G}_{t}^{-1}\left(g_{t}(\boldsymbol{\theta}^{})-g_{t}(\hat{\boldsymbol{\theta}}_{t})\right)\right\|\leq k_{\mu}\\|{\mathbf{x}_{ij}^{t}}\\|_{\scriptscriptstyle\mathbf{G}_{t}^{-1}}\left\\|g_{t}(\boldsymbol{\theta}^{})-g_{t}(\hat{\boldsymbol{\theta}}_{t})\right\\|_{\scriptscriptstyle\mathbf{G}_{t}^{-1}}$
	$\displaystyle\leq$	$\displaystyle({2k_{\mu}}/{c_{\mu}})\\|{\mathbf{x}_{ij}^{t}}\\|_{\scriptscriptstyle\mathbf{M}_{t}^{-1}}\left\\|g_{t}(\boldsymbol{\theta}^{*})-g_{t}(\tilde{\boldsymbol{\theta}}_{t})\right\\|_{\scriptscriptstyle\mathbf{M}_{t}^{-1}}$

	$\displaystyle\left\|\sigma({\mathbf{x}_{ij}^{t}}^{\top}\boldsymbol{\theta}^{*})-\sigma({\mathbf{x}_{ij}^{t}}^{\top}\hat{\boldsymbol{\theta}}_{t})\right\|\leq$	$\displaystyle({2k_{\mu}}/{c_{\mu}})\\|{\mathbf{x}_{ij}^{t}}\\|_{\mathbf{M}_{t}^{-1}}\left\\|S_{t}-\lambda\boldsymbol{\theta}^{*}\right\\|_{\mathbf{M}_{t}^{-1}}$
	$\displaystyle\leq$	$\displaystyle({2k_{\mu}}/{c_{\mu}})\\|{\mathbf{x}_{ij}^{t}}\\|_{\mathbf{M}_{t}^{-1}}\left(\left\\|S_{t}\right\\|_{\mathbf{M}_{t}^{-1}}+\sqrt{\lambda}Q\right)$

		$\displaystyle\mathbb{P}\big{(}(i,j)\in\mathcal{E}_{u}^{t}\big{)}=\mathbb{P}\big{(}\|\hat{\sigma}_{t}-{1}/{2}\|\leq CB_{ij}^{t}\big{)}$
	$\displaystyle\leq$	$\displaystyle\mathbb{P}\left(\left\|\|\hat{\sigma}_{t}-\sigma^{}\|-\|\sigma^{}-{1}/{2}\|\right\|\leq CB_{ij}^{t}\right)\leq\mathbb{P}\left(\|\sigma^{}-{1}/{2}\|-\|\hat{\sigma}_{t}-\sigma^{}\|\leq CB_{ij}^{t}\right)$
	$\displaystyle\leq$	$\displaystyle\mathbb{P}\left(\Delta_{\min}-\|\hat{\sigma}_{t}-\sigma^{*}\|\leq CB_{ij}^{t}\right).$

		$\displaystyle\mathbb{P}\left(\Delta_{\min}-\|\hat{\sigma}_{t}-\sigma^{}\|\leq CB_{ij}^{t}\right)=\mathbb{P}\left(\|\hat{\sigma}_{t}-\sigma^{}\|\geq\Delta_{\min}-\alpha_{t}\\|{\mathbf{x}_{ij}^{t}}\\|_{\scriptscriptstyle\mathbf{M}_{t}^{-1}}\right)$
	$\displaystyle\leq$	$\displaystyle\mathbb{P}\left(({2k_{\mu}}/{c_{\mu}})\|\|{\mathbf{x}_{ij}^{t}}\|\|_{\scriptscriptstyle\mathbf{M}_{t}^{-1}}\left(\left\\|S_{t}\right\\|_{\scriptscriptstyle\mathbf{M}_{t}^{-1}}+\sqrt{\lambda}Q\right)\geq\Delta_{\min}-\alpha_{t}\\|{\mathbf{x}_{ij}^{t}}\\|_{\scriptscriptstyle\mathbf{M}_{t}^{-1}}\right)$
	$\displaystyle\leq$	$\displaystyle\mathbb{P}\left(\left\\|S_{t}\right\\|_{\scriptscriptstyle\mathbf{M}_{t}^{-1}}\geq\frac{c_{\mu}\Delta_{\min}}{2k_{\mu}\|\|{\mathbf{x}_{ij}^{t}}\|\|_{\scriptscriptstyle\mathbf{M}_{t}^{-1}}}-\left(\sqrt{\frac{1}{4}\log{\frac{\det(\mathbf{M}_{t})}{\delta^{2}\det(\lambda\mathbf{I})}}}+2\sqrt{\lambda}Q\right)\right)$

		$\displaystyle\mathbb{P}\left(\Delta_{\min}-\|\hat{\sigma}_{t}-\sigma^{*}\|\leq CB_{ij}^{t}\right)$
	$\displaystyle\leq$	$\displaystyle\mathbb{P}\left(\left\\|S_{t}\right\\|_{\scriptscriptstyle\mathbf{M}_{t}^{-1}}^{2}{\geq}\left(\frac{c_{\mu}\Delta_{\min}}{2k_{\mu}\\|{\mathbf{x}_{ij}^{t}}\\|_{\scriptscriptstyle\mathbf{M}_{t}^{-1}}}-\Big{(}\sqrt{R^{2}\log{\frac{\det(\mathbf{M}_{t})}{\delta\det(\lambda\mathbf{I})}}}+2\sqrt{\lambda}Q\Big{)}\right)^{2}\right)$
	$\displaystyle\leq$	$\displaystyle\mathbb{P}\left(\left\\|S_{t}\right\\|_{\scriptscriptstyle\mathbf{M}_{t}^{-1}}^{2}{\geq}\frac{(1-2\beta)c_{\mu}^{2}\Delta_{\min}^{2}}{4k_{\mu}^{2}\\|{\mathbf{x}_{ij}^{t}}\\|_{\scriptscriptstyle\mathbf{M}_{t}}^{-1}}+R^{2}\log(\frac{\det(M_{t})}{\delta_{1}^{2}\det(\lambda\mathbf{I})})\right)$
	$\displaystyle\leq$	$\displaystyle\mathbb{P}\left(\left\\|S_{t}\right\\|_{\scriptscriptstyle\mathbf{M}_{t}^{-1}}^{2}{\geq}2R^{2}\log\left(\exp\left(\frac{(1-2\beta)c_{\mu}^{2}\Delta_{\min}^{2}}{8R^{2}k_{\mu}^{2}\\|{\mathbf{x}_{ij}^{t}}\\|_{\scriptscriptstyle\mathbf{M}_{t}}^{-1}}\right)\cdot\frac{\det(\mathbf{M}_{t})}{\delta_{1}^{2}\det(\lambda\mathbf{I})})\right)\right)$
	$\displaystyle\leq$	$\displaystyle\delta_{1}\cdot\exp^{-1}\left(\frac{(1-2\beta)c_{\mu}^{2}\Delta_{\min}^{2}}{8R^{2}k_{\mu}^{2}\\|{\mathbf{x}_{ij}^{t}}\\|_{\scriptscriptstyle\mathbf{M}_{t}^{-1}}^{2}}\right)\leq\log(\frac{1}{\delta_{1}})\cdot\frac{8R^{2}k_{\mu}^{2}\\|{\mathbf{x}_{ij}^{t}}\\|_{\mathbf{M}_{t}^{-1}}^{2}}{(1-2\beta)c_{\mu}^{2}\Delta_{\min}^{2}}$