This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

GRN: Generative Rerank Network for Context-wise Recommendation

Yufei Feng, Binbin Hu, Yu Gong, Fei Sun, Qingwen Liu, Wenwu Ou Alibaba Group, Hangzhou, China [email protected], [email protected], gongyu.gy, ofey.sf, [email protected], [email protected]
(2021)
Abstract.

Reranking is attracting incremental attention in the recommender systems, which rearranges the input ranking list into the final ranking list to better meet user demands. Most existing methods greedily rerank candidates through the rating scores from point-wise or list-wise models. Despite effectiveness, neglecting the mutual influence between each item and its contexts in the final ranking list often makes the greedy strategy based reranking methods sub-optimal. In this work, we propose a new context-wise reranking framework named Generative Rerank Network (GRN for short). Specifically, we first design the evaluator, which applies Bi-LSTM and self-attention mechanism to model the contextual information in the labeled final ranking list and predict the interaction probability of each item more precisely. Afterwards, we elaborate on the generator, equipped with GRU, attention mechanism and pointer network to select the item from the input ranking list step by step. Finally, we apply cross-entropy loss to train the evaluator and, subsequently, policy gradient to optimize the generator under the guidance of the evaluator. Empirical results show that GRN consistently and significantly outperforms state-of-the-art point-wise and list-wise methods. Moreover, GRN has achieved a performance improvement of 5.2% on PV and 6.1% on IPV metric after the successful deployment in one popular recommendation scenario of Taobao application.

copyright: acmcopyrightjournalyear: 2021doi: 10.1145/1122445.1122456price: 15.00isbn: 978-1-4503-XXXX-X/18/06

1. Introduction

Refer to caption
Figure 1. The difference of existing works and ours. Greedily reranking changes the original contexts and lead to inaccurate prediction in the new permutation. Our work attempts to learn a context-wise reranking strategy under the guidance of the well-trained context-wise model.

Recommender systems (RS) have been widely deployed in various web-scale applications, including e-commerce Sarwar et al. (2001); Covington et al. (2016); Zhou et al. (2018); Feng et al. (2019), social media Covington et al. (2016); Gomez-Uribe and Hunt (2015) and online news (Das et al., 2007; Liu et al., 2010; Li et al., 2011). Owing to the great impact on user satisfaction as well as the revenue of the RS, recent attention is increasingly shirting towards the reranking stage, which is typically designed to rearrange the input ranking list generated by previous stages (i.e., matching and ranking). With the rapid development of deep learning techniques, various reranking algorithms have been proposed to address the interior relevance in the input ranking list for improving recommendation performance.

In particular, several recent efforts have attempted to follow the greedy strategy for reranking on items with refined rating scores, which mainly fall into two groups, global point-wise models Covington et al. (2016); Cheng et al. (2016); Guo et al. (2017) and local list-wise models Ai et al. (2018); Burges ([n. d.]); Pei et al. (2019); Bello et al. (2018); Zhuang et al. (2018). Global point-wise models learn a ranking function with the labeled user-item pairs. In spite of effectiveness, they ignore different feature distributions within the input ranking list generated for each user. In fact, users may show different concerns about different input ranking lists, such as the price of the snacks and the brand of the electronics. To solve this, several local list-wise models have been proposed to refine the ranking scores by taking feature distributions of the input ranking list into account. Unfortunately, these methods commonly follow the greedy strategy for item rearrangements, which changes the contexts of each item from the initial list to the final list, resulting in imprecise estimation under the new item permutation. As shown in Fig. 1, the rating scores predicted by modeling the contexts of items (i1,i2,i3,i4)(i_{1},i_{2},i_{3},i_{4}) in the final list could be (0.4,0.1,0.2,0.3)(0.4,0.1,0.2,0.3). Specifically, 0.30.3 represents the contextual interaction probability when placed before i2i_{2} and after i1,i4i_{1},i_{4}. Once the candidates are rearranged according to the ratings scores to (i1,i4,i3,i2)(i_{1},i_{4},i_{3},i_{2}), the contextual items of i3i_{3} are modified, leading to inaccurate estimation in the new permutation and sub-optimal item arrangements.

How to better leverage contexts in the reranking stage remains a great challenge in RS. Intuitively, instead of directly recommending a cheap item, placing a more expensive item ahead can stimulate the user’s desire to buy the cheaper item. Besides, placing all the items of the user’s most interest in the front may reduce the possibility of his/her continue browsing. Such generalized contextual knowledge is of crucial importance in the item rearrangement. Unfortunately, such contexts are unable to determine and capture since the final ranking list is not given. Inspired by continuous advances in actor-critic reinforcement learning Schulman et al. (2015, 2017); Mnih et al. (2016), one practical and effective solution is to learn a context-wise reranking strategy under the guidance of the well-trained context-wise model. Specifically, as illustrated in Fig. 1, we transform the context-wise reranking objective into the following two tasks: (1) The local context-wise model is designed to predict the contextual interaction probability (e.g., click-through rate and conversion rate) more precisely, by fully leveraging the contexts of each item from the labeled records between users and item lists; (2) The reranking strategy is meant to obtain the rearranged item list from the input ranking list. Moreover, the reranking strategy is converged to be context-wisely optimal under the guidance of the local context-wise model. In this way, by considering the contextual information, the reranking strategy can reach the context-wise item arrangements by distilling contextual knowledge from the local context-wise model, and then bring better experience for users.

In this work, with the above discussions in mind, we propose a novel context-wise framework named Generative Rerank Network (GRN for short), which is composed of three parts: (1) Evaluator. To better refine the contextual interaction probability of each item in the final ranking list, we employ Bi-LSTM and self-attention mechanism to capture sequential information alongside the list and mutual influence between items, respectively; (2) Generator. With the aim of learning the reranking strategy to generate the final ranking list, we apply pointer network combined with GRU and attention mechanism to select appropriate item from the input ranking list at each step; (3) Training procedure. We adopt cross-entropy loss to train evaluator and, subsequently, policy gradient with technically designed advantage reward to train the generator.

In sum, we make the following contributions:

  • We highlight the necessity of modeling the contexts in the final item list to each item for more precise prediction and better item rearrangements. We propose a practical and effective solution is to learn a context-wise reranking strategy under the guidance of the well-trained context-wise model.

  • We propose a novel context-wise reranking framework named GRN, which simultaneous employs the well-designed evaluator and generator in an cooperative manner, with the aims of comprehensively capturing evolving intent and mutual influence implied in input and final ranking list.

  • We perform a series of experiments on a public dataset from Alimama and a proprietary dataset from Taobao application. Online and offline experimental results demonstrate that GRN consistently and significantly outperforms various state-of-the-art methods.

The remainder of this paper is organized as follows. Section 2 discusses related work for reranking, including point-wise, pair-wise and list-wise models. Section 3 presents some preliminary knowledge as well as the task description. Then, we propose the generative rerank network in Section 4. Experiments and detailed analysis are reported in Section 5. Finally, we conclude the paper in Section 6.

2. Related Work

Reranking usually serves as the last stage after matching and ranking stages in one typical recommender system. In this section, we review the most related studies in the reranking stage of recommender systems, where an input ranking list provided by previous stages is rearranged into the final ranking list and expected to better meet user demands. Most reranking methods can be broadly classified into three categories: point-wise, pair-wise and list-wise methods.

  • Point-wise methods Covington et al. (2016); Cheng et al. (2016); Guo et al. (2017) regard the recommendation task as a binary classification problem and globally learn a scoring function for a given user-item pair with manually feature engineering. Though with continuous achievements, these methods neglect considering the comparison and mutual influence between items in the input ranking list.

  • Pair-wise models work by considering the semantic distance of an item pair every time. Following this line, a surge of works are proposed to techinically designed pair-wise loss functions to compare any item pair in the input ranking list with well-designed architecture, including RankSVM (Joachims, 2006) based on SVM, GBRank (Zheng et al., 2007) based on GBDT as well as RankNet (Burges et al., 2005) and DSF (Ai et al., 2019) based on emerging deep neural networks. However, pair-wise methods neglect the local information in the list and increase the model complexity.

  • List-wise models are proposed to capture the interior correlations among items in the list in different ways. LambdaMart (Burges, [n. d.]) is a well-known tree-based method with the list-wise loss function. MIDNN (Zhuang et al., 2018) works by handmade global list-wise features, while it requires much domain knowledge and decreases its generalization performance. DLCM (Ai et al., 2018) and PRM (Pei et al., 2019) apply GRU and transformer to encode the list-wise information of the input ranking list for better prediction, respectively. Though effective, this type of list-wise models do not escape the paradigm of greedy ranking based on predicted scores, where the adjustment of the final ranking list greatly changes the contextual information of each item.

Most closest to our work are MIRNN (Zhuang et al., 2018) and Seq2slate (Bello et al., 2018), which directly learns a step-greedy reranking strategy to generate the final ranking list through RNN and pointer network, respectively. We argue that only considering the preceding information is insufficient and incomplete for the optimal item rearrangements since the interaction probability of each item is heavily affected by both preceding and subsequent ones. In this work, we upgrade the step-greedy strategy based reranking methods to the context-wise reranking strategy.

There are also some works (Chen et al., 2018; Wilhelm et al., 2018; Gelada et al., 2019; Gogna and Majumdar, 2017) focusing on making the trade-off between relevance and diversity in the reranking stage. Different from these works, GRN is an end-to-end context-wise reranking framework, which may automatically generate diverse or alike recommendaion results for the only sake of effectiveness. Besides, group-wise methods (Ai et al., 2019) determine the priority of items by multiple documents in the list. Though effective, the computation complexity of group-wise methods is at least O(N2)O(N^{2}), which may not be appropriate for industrial recommender systems.

3. Preliminary

In general, a mature recommender system (e.g., e-commerce and news) is composed of three stages chronologically: matching, ranking and reranking. In this paper, we focus on the final reranking stage, whose input is the ranking list produced by the previous two stage (i.e., matching and ranking). The goal of the reranking is to elaborately select candidates from the input ranking list and rearrange them into the final ranking list, followed by the exhibition for users. Mathematically, with user set 𝒰\mathcal{U} and item set \mathcal{I}, we denote list interaction records as ={(u,𝒞,𝒱,𝒴|u𝒰,𝒱𝒞}\mathcal{R}=\{(u,\mathcal{C},\mathcal{V},\mathcal{Y}|u\in\mathcal{U},\mathcal{V}\subset\mathcal{C}\subset\mathcal{I}\}. Here, 𝒞\mathcal{C} and 𝒱\mathcal{V} represent the recorded input ranking list with mm items for reranking stage and the recorded final ranking list with nn items exhibited to user uu, respectively. Intuitively, nmn\leq m. yuvt𝒴y^{t}_{uv}\in\mathcal{Y} is the implicit feedback of user uu w.r.t. tt-th item vt𝒱v_{t}\in\mathcal{V}, which can be defined as follows: yuvt=1y^{t}_{uv}=1 when interaction (e.g., click) is observed, and yuvt=0y^{t}_{uv}=0 otherwise. In the real-world industrial recommender systems, each user uu is associated with a user profile xux_{u} consisting of sparse features xsux^{u}_{s} (e.g., user id and gender) and dense features xdux^{u}_{d} (e.g., age), while each item ii is also associated with a item profile xix_{i} consisting of sparse features xisx^{s}_{i} (e.g., item id and brand) and dense features xidx_{i}^{d} (e.g., price) .

Given the above definitions, we now formulate the reranking task to be addressed in this paper:

Definition 1.

Task Description. Given a certain user uu, involving his/her input ranking list 𝒞\mathcal{C}, the goal is to learn a reranking strategy π:𝒞π𝒪\pi:\mathcal{C}\stackrel{{\scriptstyle\pi}}{{\longrightarrow}}\mathcal{O}, which aims to select and rearrange items from 𝒞\mathcal{C}, and subsequently recommend a final ranking list 𝒪\mathcal{O} that are of the most interest to uu.

Many efforts have been made for reranking task, while most of these works ignore the contextual information in the final ranking list (i.e., 𝒱\mathcal{V}), resulting in sub-optimal performance. In practice, such contextual knowledge is of crucial importance, since users are commonly sensitive to the permutation of items during browsing.

4. The Proposed Framework

Table 1. Key notations.
Notations Description
𝒰\mathcal{U}, \mathcal{I} the set of users and items, respectively
𝒞\mathcal{C}, 𝒱\mathcal{V} the recorded input and final ranking list, respectively
𝒴\mathcal{Y} the implicit feedback towards the ranking list
m,nm,n the number of items in the recorded input and final ranking list, respectively
π\pi the learned reranking strategy
𝒪\mathcal{O} the generated final ranking list
xsu,xdux_{s}^{u},x_{d}^{u} the sparse and dense feature set for users, respectively
xsi,xsix_{s}^{i},x_{s}^{i} the sparse and dense feature set for items, respectively
𝚯G\mathbf{\Theta}^{G}, 𝚯E\mathbf{\Theta}^{E} parameters of generator and evaluator, respectively

In this section, we introduce our proposed framework GRN, which aims to achieve context-wise item arrangements in the reranking stage of recommender systems. Overall, GRN is composed of three parts: (1) To predict the contextual probability of being interacted in the labeled final ranking list 𝒱\mathcal{V}, we propose the local context-wise model named evaluator, where Bi-LSTM and self-attention mechanism are applied to capture the sequential information and mutual influence between items; (2) To generate the final ranking list 𝒪\mathcal{O} from the input ranking list 𝒞\mathcal{C}, we elaborate on the model design of generator. Specifically, we equip the generator with pointer network, GRU and attention mechanism, with the hope of the abilities to select the most suitable item from 𝒞\mathcal{C} and capture adjacent information; (3) To converge the generator to a preeminent reranking strategy under the guidance of evaluator, we pay special attention to designing the training procedure. First, we adopt cross-entropy loss to train the evaluator with labeled list records. Afterwards, policy gradient cooperated with proposed advantage reward are applied to train the generator. The key notations we will use throughout the article are summarized in Table 1.

First of all, we begin with the representations of users and items, which are basic inputs of our proposed model. Following previous works (Covington et al., 2016; Zhou et al., 2018), we parameterize111Note that the parameters are not shared in the evaluator and generator. the available profiles into vector representations for users and items. Given a user uu, associated with sparse features xusx_{u}^{s} and dense features xudx_{u}^{d}, we embed each sparse feature value into dd-dimensional space, while handle dense features standardization or batch normalization to ensure normal distribution. Subsequently, each user uu can be represented as 𝐱u|xus|×d+|xud|\mathbf{x}_{u}\in\mathbb{R}^{|x^{s}_{u}|\times d+|x^{d}_{u}|}, where |xus||x^{s}_{u}| and |xud||x^{d}_{u}| denote the size of sparse and dense feature space of user uu, respectively. Similarly, we represent each item ii as 𝐱i|xis|×d+|xid|\mathbf{x}_{i}\in\mathbb{R}^{|x^{s}_{i}|\times d+|x^{d}_{i}|}. Naturally, we represent the input ranking list 𝒞\mathcal{C} as 𝐂=[𝐱c1,,𝐱cm]\mathbf{C}=[\mathbf{x}_{c}^{1},...,\mathbf{x}_{c}^{m}] and the final ranking list 𝒱\mathcal{V} as 𝐕=[𝐱v1,,𝐱vn]\mathbf{V}=[\mathbf{x}_{v}^{1},...,\mathbf{x}_{v}^{n}], where mm and nn is the number of items in the input ranking list and final ranking list, respectively.

In the ensuing sections, we shall zoom into the proposed evaluator and generator for GRN.

Refer to caption
Figure 2. The overall architecture of evaluator in GRN.

4.1. Evaluator

As motivated, in the context-wise view, the underlying reasons driving a user to interact with a item in a list may depend on following two aspects: (1) the two-way evolution of user’s intent during browsing; (2) the mutual influence between items in the item list. Thus, by taking above factors into consideration, our evaluator E(𝐱vt|u,𝒱;𝚯E)E(\mathbf{x}_{v}^{t}|u,\mathcal{V};\mathbf{\Theta}^{E}), parameterized by 𝚯E\mathbf{\Theta}^{E}, estimates the contextual interaction probability between user uu and the tt-th item 𝐱vt\mathbf{x}_{v}^{t} in the final ranking list 𝒱\mathcal{V}. Specifically, our evaluator aims to capture following two aspects of information implicated in sequences.

  • Intent evolution. Naturally, user intent keeps bi-directionally evolving when browsing items sequentially, and thus, it is imperative to capture such dynamics of intent for modeling user preference. Here, we adopt the commonly used Bi-LSTM  (Hochreiter and Schmidhuber, 1997) to deal with sequences for long-term and short-term preference. Mathematically, the forward output state for the tt-th item xvtx_{v}^{t} can be calculated as follows:

    (1) it\displaystyle\textbf{i}_{t} =σ(Wxi𝐱vt+Whiht1+Wcict1+bi)\displaystyle=\sigma(\textbf{W}_{xi}\mathbf{x}_{v}^{t}+\textbf{W}_{hi}\textbf{h}_{t-1}+\textbf{W}_{ci}\textbf{c}_{t-1}+\textbf{b}_{i})
    ft\displaystyle\textbf{f}_{t} =σ(Wxf𝐱vt+Whfht1+Wcfct1+bf)\displaystyle=\sigma(\textbf{W}_{xf}\mathbf{x}_{v}^{t}+\textbf{W}_{hf}\textbf{h}_{t-1}+\textbf{W}_{cf}\textbf{c}_{t-1}+\textbf{b}_{f})
    ct\displaystyle\textbf{c}_{t} =ft𝐱vt+ittanh(Wxc𝐱vt+Whcht1+bc)\displaystyle=\textbf{f}_{t}\mathbf{x}_{v}^{t}+\textbf{i}_{t}\tanh(\textbf{W}_{xc}\mathbf{x}_{v}^{t}+\textbf{W}_{hc}\textbf{h}_{t-1}+\textbf{b}_{c})
    ot\displaystyle\textbf{o}_{t} =σ(Wxo𝐱vt+Whoht1+Wcoct+bo)\displaystyle=\sigma(\textbf{W}_{xo}\mathbf{x}_{v}^{t}+\textbf{W}_{ho}\textbf{h}_{t-1}+\textbf{W}_{co}\textbf{c}_{t}+\textbf{b}_{o})
    ht\displaystyle\overrightarrow{\textbf{h}_{t}} =ottanh(ct)\displaystyle=\textbf{o}_{t}\tanh(\textbf{c}_{t})

    where σ()\sigma(\cdot) is the logistic function, and i, f, o and c are the input gate, forget gate, output gate and cell vectors which have the same size as 𝐱vt\mathbf{x}_{v}^{t}. Shapes of weight matrices are indicated with the subscripts. Similarly, we can get the backward output state ht\overleftarrow{\textbf{h}_{t}}. Then we concatenate the two output states ht=htht\textbf{h}_{t}=\overrightarrow{\textbf{h}_{t}}\oplus\overleftarrow{\textbf{h}_{t}} and denote ht\textbf{h}_{t} as the sequential representation of xvtx_{v}^{t}.

  • Mutual influence. That is, any two items can affect each other, although they are far away from each other in the final ranking list. To achieve this goal, we apply the recently emerging self-attention mechanism (Vaswani et al., 2017) to directly model the mutual influences for any two items regardless the distances between them. Formally, we implement it as follows:

    (2) 𝐀=softmax(𝐕𝐕Tdk)𝐕\begin{split}\mathbf{A}=\mbox{softmax}\left(\frac{\mathbf{V}\mathbf{V}^{T}}{\sqrt{d_{k}}}\right)\mathbf{V}\end{split}

    where dkd_{k} is the dimension of each item representation in 𝒱\mathcal{V}. Obviously, we can easily extend it to multi-head attention for the stable training process. We denote 𝐚t\mathbf{a}_{t}, the tt-th representation in 𝐀\mathbf{A}, as the mutual representation of xvtx_{v}^{t}.

Due to the powerful ability in modeling complex interaction in the CTR prediction field Zhou et al. (2018); Feng et al. (2019); Pi et al. (2019); Zhou et al. (2019), we integrate the multi-layer perceptron (MLP) into the evaluator for better feature interaction. Hence, we formalize our evaluator as follows:

(3) E(𝐱vt|u,𝒱;𝚯E)=σ(f(f(f(𝐱u𝐱vtht𝐚t))))\begin{split}E(\mathbf{x}_{v}^{t}|u,\mathcal{V};\mathbf{\Theta}^{E})=\sigma\bigl{(}f(f(f(\mathbf{x}_{u}\oplus\mathbf{x}_{v}^{t}\oplus\textbf{h}_{t}\oplus\mathbf{a}_{t})))\bigr{)}\end{split}

where f(x)=ReLU(Wx+b)f(\textbf{x})=ReLU(\textbf{W}\textbf{x}+\textbf{b}), σ()\sigma(\cdot) is the logistic function. The parameter set of the evaluator is thus 𝚯E={𝐖,𝐛}\mathbf{\Theta}^{E}=\{\mathbf{W}_{*},\mathbf{b}_{*}\}, i.e., the union of parameters for Bi-LSTM and MLP.

Clearly, our evaluator can be optimized via binary cross-entropy loss function, which is defined as follows:

(4) E=1N(u,𝒱)xvt𝒱(yuvtlogy^uvt+(1yuvt)log(1y^uvt))\begin{split}\mathcal{L}^{E}=-\frac{1}{N}\!\sum_{(u,\mathcal{V})\in\mathcal{R}}\sum_{x_{v}^{t}\in\mathcal{V}}\left(y_{uv}^{t}\,\log\hat{y}^{t}_{uv}+(1{-}y_{uv}^{t})\,\log(1{-}\hat{y}_{uv}^{t})\right)\end{split}

where 𝔻\mathbb{D} is the training dataset. For convenience, we refer y^uvt\hat{y}_{uv}^{t} as E(𝐱vt|u,𝒱;𝚯E)E(\mathbf{x}_{v}^{t}|u,\mathcal{V};\mathbf{\Theta}^{E}), i.e., y^uvt=E(𝐱vt|u,𝒱;𝚯E)\hat{y}_{uv}^{t}=E(\mathbf{x}_{v}^{t}|u,\mathcal{V};\mathbf{\Theta}^{E}), and yuvt{0,1}y_{uv}^{t}\in\{0,1\} is the ground truth. We can optimize the parameters 𝚯E\mathbf{\Theta}^{E} through minimizing E\mathcal{L}^{E}.

Refer to caption
Figure 3. The overall architecture of generator in GRN.

4.2. Generator

The goal of our generator G(u,𝒞;𝚯G)G(u,\mathcal{C};\mathbf{\Theta}^{G}), parameterized by 𝚯G\mathbf{\Theta}^{G}, is to learn a reranking strategy for item selection from input ranking list 𝒞\mathcal{C}, that are of the most interest to user uu. In the following sections, we focus on how to select the appropriate item at each step by considering the user’s dynamic state during browsing, besides how to generate the final ranking list and train the generator.

Evolving Layer. As motivated, user’s intent or interest will change over time, which derive us to learn dynamic representation for the target user to adapt for the evolution of the browsing history. For this purpose, we aim to distill the sequential information into the state representation at each step for the target user uu, which achieved by the time-efficient GRU module. Formally, at the tt-th step, let 𝒮=[𝐱s0,𝐱s1,,𝐱st1]\mathcal{S}=[\mathbf{x}_{s}^{0},\mathbf{x}_{s}^{1},...,\mathbf{x}_{s}^{t-1}] be the t1t-1 items selected from 𝒞\mathcal{C} 222Specially, for the first step, 𝐱l0\mathbf{x}_{l}^{0} is a trainable vector which has the same embedding size with 𝐱lt\mathbf{x}_{l}^{t}, the output of GRU module can be calculated as follows,

(5) zt=σ(Wz𝐱lt1+Uzht1)rt=σ(Wt𝐱lt1+Utht1)h~t=tanh(W𝐱lt1+U(rtht1))ht=(1zt)ht1+zth~t\begin{split}\textbf{z}_{t}&=\sigma(\textbf{W}_{z}\mathbf{x}_{l}^{t-1}+\textbf{U}_{z}\textbf{h}_{t-1})\\ \textbf{r}_{t}&=\sigma(\textbf{W}_{t}\mathbf{x}_{l}^{t-1}+\textbf{U}_{t}\textbf{h}_{t-1})\\ \widetilde{\textbf{h}}_{t}&=\text{tanh}(\textbf{W}\mathbf{x}_{l}^{t-1}+\textbf{U}(\textbf{r}_{t}\textbf{h}_{t-1}))\\ \textbf{h}_{t}&=(1-\textbf{z}_{t})\textbf{h}_{t-1}+\textbf{z}_{t}\widetilde{\textbf{h}}_{t}\end{split}

where Wz,Uz,Wt,Ut,W\textbf{W}_{z},\textbf{U}_{z},\textbf{W}_{t},\textbf{U}_{t},\textbf{W} and U are trainable variables. We denote 𝐇=[𝐡1,,𝐡t]\mathbf{H}=[\mathbf{h}_{1},...,\mathbf{h}_{t}] as the sequential encoding of the selected list 𝒮\mathcal{S}.

Activating Layer. Typically, the selected items have different influence towards the remaining items in the input ranking list. It is a common phenomenon that a user may get tired of the clothing category after browsing abundant different clothes. Hence, in this case, recommending other categories matching his/her interest could be a better choice. Inspired by vanilla attention mechanism (Bahdanau et al., 2015) , we learn the attention weights over remaining items in the input ranking list conditioned on the sequential representation of selected items. Given the representation of jj-th remaining item 𝐱cj\mathbf{x}_{c}^{j} and the sequential representations of the selected list 𝐇\mathbf{H}, we adopt weighted sum to generate the new representation for the jj-th in the input ranking list with the corresponding attention weights as follows:

(6) ahi=exp(𝐡i𝐖𝐱cj)i=1texp(𝐡i𝐖𝐱cj)𝐚cj=i=1tahi𝐡i\begin{split}a^{i}_{h}&=\frac{\exp(\mathbf{h}_{i}\mathbf{W}\mathbf{x}_{c}^{j})}{\sum_{i=1}^{t}\exp(\mathbf{h}_{i}\mathbf{W}\mathbf{x}_{c}^{j})}\\ \mathbf{a}^{j}_{c}&=\sum_{i=1}^{t}a^{i}_{h}\mathbf{h}_{i}\\ \end{split}

Selector Layer. After obtaining the representations for items (i.e., 𝐚cj\mathbf{a}^{j}_{c}) in input ranking list, we now aim to select the most suitable item, which best matches the potential interest for the target user at the tt-th step. Here, we choose pointer network (Vinyals et al., 2015) to achieve this goal. Formally, for the jj-th item 𝐱cj\mathbf{x}_{c}^{j} in the input ranking list 𝐂\mathbf{C}, we first apply MLP for better feature interaction at the tt-th step:

(7) s~cj=f(f(f(𝐱u𝐱cj𝐚cj)))\begin{split}\widetilde{s}_{c}^{j}=f(f(f(\mathbf{x}_{u}\oplus\mathbf{x}_{c}^{j}\oplus\mathbf{a}^{j}_{c})))\end{split}

Afterwards, we apply the softmax function to normalize the vector s~cj\widetilde{s}_{c}^{j} as follows:

(8) scj=exp(s~cj)jmexp(s~cj)\begin{split}s_{c}^{j}=\frac{\exp(\widetilde{s}_{c}^{j})}{\sum_{j}^{m}\exp(\widetilde{s}_{c}^{j})}\\ \end{split}

where mm is the number of items in 𝐂\mathbf{C}. Then, at tt-th step, our generator will select the item with the highest selection probability excluding the selected items, which can be formalized as follows:

(9) Gt(u,𝒞;𝚯G)=xcargmaxjscjs.t.xcargmaxjscj𝒮G^{t}(u,\mathcal{C};\mathbf{\Theta}^{G})=x^{\mathop{\arg\max}_{j}{s^{j}_{c}}}_{c}\ \ \ \ s.t.\ x^{\mathop{\arg\max}_{j}{s^{j}_{c}}}_{c}\notin\mathcal{S}

The parameter set of the generator is 𝚯G={𝐔,𝐖,𝐛}\mathbf{\Theta}^{G}=\{\mathbf{U}_{*},\mathbf{W}_{*},\mathbf{b}_{*}\}, i.e., the union of parameters for GRU module and point network.

List Generation. Our generator is recurrently repeated until the generated final ranking list reaches the pre-defined length (i.e., nn). After repeating calculating Gt(u,𝒞;𝚯G)G^{t}(u,\mathcal{C};\mathbf{\Theta}^{G}) for nn times, we get the reranking strategy π=[G1(u,𝒞;𝚯G),,Gn(u,𝒞;𝚯G)]\pi=[G^{1}(u,\mathcal{C};\mathbf{\Theta}^{G}),...,G^{n}(u,\mathcal{C};\mathbf{\Theta}^{G})] and the generated final ranking list 𝒪=[xo1,,xon]\mathcal{O}=[x_{o}^{1},...,x_{o}^{n}], correspondingly.

Advantage Reward. As illustrated earlier, we utilize the evaluator to guide the optimization of the generator for the context-wise reranking strategy through policy gradient. As mentioned above, at each step, our generator will select the most suitable item from 𝒞\mathcal{C} in the light of the probability distribution in Eq. 8. Besides, we propose the advantage reward to estimate the actual reward of each item in the final ranking list more comprehensively. Specifically, the advantage reward consists of two parts:

  • Self reward: As motivated, the interaction probability of itself in the final ranking list heavily affected by its context. Here we use the predicted contextual score by the evaluator to denote the its own reward in the list, which is calculated as follows:

    (10) rself(xot|u,𝒪)=E(xot|u,𝒪;𝚯E)\begin{split}r^{self}(x_{o}^{t}|u,\mathcal{O})&=\mbox{E}(x_{o}^{t}|u,\mathcal{O};\mathbf{\Theta}^{E})\\ \end{split}
  • Differential reward: Comprehensively, besides the self reward, each item brings differences in the interaction probability of other items in the list. For example, though selecting an unattractive item is generally not a good idea, it is appreciated if it can increase the interaction probability of other items. Hence, we propose the differential reward, with the assistance of the generator and calculated as follows:

    (11) rdiff(xot|u,𝒪)=xoi𝒪E(xoi|u,𝒪;𝚯E)xoi𝒪E(xoi|u,𝒪;𝚯E)\begin{split}r^{diff}(x_{o}^{t}|u,\mathcal{O})=&\sum_{x_{o}^{i}\in\mathcal{O}^{-}}\mbox{E}(x_{o}^{i}|u,\mathcal{O};\mathbf{\Theta}^{E})-\\ &\sum_{x_{o}^{i}\in\mathcal{O}^{-}}\mbox{E}(x_{o}^{i}|u,\mathcal{O}^{-};\mathbf{\Theta}^{E})\\ \end{split}

    where 𝒪\mathcal{O}^{-} is the item list which removes xotx_{o}^{t} from 𝒪\mathcal{O}.

By combining both above rewards, we define the advantage reward of tt-th item xotx_{o}^{t} in 𝒪\mathcal{O} as follows:

(12) r(xot|u,𝒪)=rself(xot|u,𝒪)+rdiff(xot|u,𝒪)=(E(xot|u,𝒪;𝚯E)+xoi𝒪E(xoi|u,𝒪;𝚯E))xoi𝒪E(xoi|u,𝒪;𝚯E)=xoi𝒪E(xoi|u,𝒪;𝚯E)xoi𝒪E(xoi|u,𝒪;𝚯E)\begin{split}&r(x_{o}^{t}|u,\mathcal{O})=r^{self}(x_{o}^{t}|u,\mathcal{O})+r^{diff}(x_{o}^{t}|u,\mathcal{O})\\ &=(\mbox{E}(x_{o}^{t}|u,\mathcal{O};\mathbf{\Theta}^{E})+\sum_{x_{o}^{i}\in\mathcal{O}^{-}}\mbox{E}(x_{o}^{i}|u,\mathcal{O};\mathbf{\Theta}^{E}))-\sum_{x_{o}^{i}\in\mathcal{O}^{-}}\mbox{E}(x_{o}^{i}|u,\mathcal{O}^{-};\mathbf{\Theta}^{E})\\ &=\sum_{x_{o}^{i}\in\mathcal{O}}\mbox{E}(x_{o}^{i}|u,\mathcal{O};\mathbf{\Theta}^{E})-\sum_{x_{o}^{i}\in\mathcal{O}^{-}}\mbox{E}(x_{o}^{i}|u,\mathcal{O}^{-};\mathbf{\Theta}^{E})\end{split}

Finally, the loss function of the generator is defined as follows:

(13) G=1N(u,𝒞)xot𝒪r(xot|u,𝒪)logscj.\begin{split}\mathcal{L}^{G}=-\frac{1}{N}\sum_{(u,\mathcal{C})\in\mathcal{R}}\sum_{x_{o}^{t}\in\mathcal{O}}r(x_{o}^{t}|u,\mathcal{O})\log s_{c}^{j}.\end{split}

Here scjs_{c}^{j} represents the sampling probability of xotx_{o}^{t} from 𝒞\mathcal{C} in Eq. 8. We can optimize the parameters 𝚯G\mathbf{\Theta}^{G} through minimizing G\mathcal{L}^{G}.

Refer to caption
Figure 4. The illustration of the training procedure for GRN.
Algorithm 1 Training procedure for GRN
1:List interaction records as ={(u,𝒱,𝒞,𝒴)}\mathcal{R}=\{(u,\mathcal{V},\mathcal{C},\mathcal{Y})\}; Evaluator E(𝐱vt|u,𝒱;𝚯E)E(\mathbf{x}_{v}^{t}|u,\mathcal{V};\mathbf{\Theta}^{E}); Generator G(u,𝒞;𝚯G)G(u,\mathcal{C};\mathbf{\Theta}^{G})
2:Converged parameters 𝚯E\mathbf{\Theta}^{E} and 𝚯G\mathbf{\Theta}^{G};
3://// Evaluator training
4:while 𝚯E\mathbf{\Theta}^{E} not converge do
5:     Calculate prediction scores of evaluator by Eq. 1 \sim 3;
6:     Optimizing 𝚯E\mathbf{\Theta}^{E} with loss in Eq. 4
7:end while
8://// Generator training
9:while 𝚯G\mathbf{\Theta}^{G} not converge do
10:     New final ranking list 𝒪\mathcal{O};
11:     for t=1,2,,nt=1,2,...,n do:
12:         Generate the tt-th item xotx_{o}^{t} from 𝒞\mathcal{C} by Eq. 5 \sim 9;
13:         𝒪𝒪xot\mathcal{O}\leftarrow\mathcal{O}\cup x_{o}^{t};
14:     end for
15:     Calculate the advantage reward of each item in 𝒪\mathcal{O} by Eq. 10 \sim 12;
16:     Optimize 𝚯G\mathbf{\Theta}^{G} with loss in Eq. 13
17:end while

4.3. Training Procedure

we adopt the two-stage optimization strategy to train GRN, which is illustrated in Fig. 4. We first utilize the list interaction records to optimize 𝚯E\mathbf{\Theta}^{E} and thus improve the evaluator until converge ((1-1) in Fig. 4). Next, we fix 𝚯E\mathbf{\Theta}^{E}, and optimize 𝚯G\mathbf{\Theta}^{G} in order to produce the final ranking list and better meet user’s demands. Specifically, the generator will produce the final ranking list for the evaluator((2-1) in Fig. 4), and then update the paratemer 𝚯G\mathbf{\Theta}^{G} via policy gradient with advantage reward based on the evaluator((2-2) in Fig. 4). The above process is repeated for more iterations until our generator converges. Overall, the training procedure for GRN is outlined in Algorithm 1.

5. Experiments

In this section, we perform a series of experiments on two real-world datasets, with the aims of answering the following research questions:

  • RQ1: How does GRN predict the interaction probability more precisely and converge to a better reranking strategy compared with SOTA models on the reranking task?

  • RQ2: How do the well-designed components of GRN (e.g., Bi-LSTM, self-attention, etc.) influence the performance of GRN?

  • RQ3: How does the generator of GRN provide a better reranking strategy intuitively?

  • RQ4: How about the performance of the generator of GRN in the real-world recommendation scenarios with efficient deployment?

Table 2. Statistics of datasets
Description Rec Ad
#Users 2.16 ×108\times 10^{8} 1.06 ×106\times 10^{6}
#Items 4.36 ×107\times 10^{7} 8.27 ×105\times 10^{5}
#Records 3.01 ×109\times 10^{9} 2.04 ×106\times 10^{6}
#User-item interactions 7.07 ×108\times 10^{8} 2.04 ×107\times 10^{7}
#Avg size of 𝒞u\mathcal{C}_{u} 83.86 100.0
#Avg size of 𝒱u\mathcal{V}_{u} 4.26 10.0

5.1. Experimental Setup

5.1.1. Datasets

We conduct extensive experiments on two real-world datasets: a proprietary dataset from Taobao application and a public dataset from Alimama, which are introduced as follows:

  • Rec 333https://www.taobao.com dataset consists of large-scale list interaction logs collected from Taobao application, one of the most popular e-commerce platform in China. Besides, Rec dataset contains user profile (e.g., id, age and gender), item profile (e.g., id, category and brand), the input ranking list provided by the previous stages for each user and the labeled final ranking list.

  • Ad 444https://tianchi.aliyun.com/dataset/dataDetail?dataId=56 dataset records interactions between users and advertisements and contains user profile (e.g., id, age and occupation), item profile (e.g., id, campaign and brand). According to the timestamp of the user browsing the advertisement, we transform records of each user and slice them at each 10 items into list records. Moreover, we mix an additional 90 items with the final ranking list as the input ranking list to make it more suitable for reranking.

The detailed descriptions of the two datasets are shown in Table 2. For the both datasets, we randomly split the entire records into training and test set, i.e., we utilize 90% records to predict the remaining 10% interactions 555We hold out 10% training data as the validation set for parameter tuning..

5.1.2. Baselines

We select two kinds of representative methods as baselines: point-wise and list-wise methods, which are widely adopted in most recommender systems. Point-wise methods (i.e., DNN and DeepFM) mainly predict the interaction probability for the given user-item pair by utilizing raw features derived from user and item profile. List-wise methods (i.e., MIDNN, DLCM , PRM and Seq2Slate) devote to extracting list-wise information with different well-designed principles. The comparison methods are given below in detail:

  • DNN (Covington et al., 2016) is a standard deep learning method in the industrial recommender system, which applies MLP for complex feature interaction.

  • DeepFM (Guo et al., 2017) is a general deep model for recommendation, which combines a factorization machine component and a deep neural network component.

  • MIDNN (Zhuang et al., 2018) extracts list-wise information of the input ranking list with complex handmade features engineering.

  • DLCM (Ai et al., 2018) firstly applies GRU to encode the input ranking list into a local vector, and then combine the global vector and each feature vector to learn a powerful scoring function for list-wise reranking.

  • PRM (Pei et al., 2019) applies the self-attention mechanism to explicitly capture the mutual influence between items in the input ranking list.

  • Seq2Slate (Bello et al., 2018) applies rnn and pointer network to encode the previous selected items and select the most appropriate item at each step.

  • GRN is the novel context-wise framework proposed in this paper, which consists of the evaluator (i.e., GRNE) for predicting interaction probabilities and the generator (i.e., GRNG) for generating reranking results.

It is worthwhile to note that pair-wise and group-wise methods are not selected as baselines in our experiments due to their high training or inference complexities ( at least 𝒪(N2)\mathcal{O}(N^{2})) compared with point-wise (𝒪(1)\mathcal{O}(1)) or list-wise and context-wise (𝒪(N)\mathcal{O}(N)) models. Moreover, these methods usually achieve relatively worse performance, which have been reported in many previous studies Pei et al. (2019); Ai et al. (2018); Burges ([n. d.]).

5.1.3. Evaluation Metrics

To answer RQ1, we adopt different criteria to evaluate the model performance in following two aspects: (1) To evaluate the accuracy of model predictions for GRNE and baselines, Loss (cross-entropy), AUC (area under ROC curve) and GAUC (Zhou et al., 2018) (average intra-user AUC) are applied. (2) To evaluate the reranking results of GRNG and baselines, apart from NDCG (normalized discounted cumulative gain) metrics, we design the LR (list reward) metric to evaluate the overall context-wise profits of the whole list simulated by the evaluator (i.e.,GRNE), which is calculated as follows:

(14) LR(𝒪)=xoi𝒪E(xoi|u,𝒪).\begin{split}LR(\mathcal{O})=&\sum_{x_{o}^{i}\in\mathcal{O}}\mbox{E}(x_{o}^{i}|u,\mathcal{O}).\end{split}

Compared with NDCG, we argue that LR is a more fine-grained metric for the evaluation of rerank, which fully estimates the intra-correlations within the top-kk results and the overall profits are evaluated in a more precise manner. Moreover, the generalization capability of the LR metric is much stronger since it can rate the items that users have not browsed yet.

For online A/B testing in RQ4, We choose PV and IPV metrics, which are widely adopted in industrial recommender systems for evaluating online performance. Specifically, PV and IPV is defined as the total number of items that users browsed and clicked, respectively.

5.1.4. Implementation

We implement all models in Tensorflow 1.4. For fair comparison, pre-training, batch normalization and regularization are not adopted in our experiments. We employ random uniform to initialize model parameters and adopt Adam as optimizer using a learning rate of 0.001. Moreover, embedding size of each feature is set to 8 and the architecture of MLP is set to [128, 64, 32]. We run each model three times and reported the mean of results.

5.1.5. Significance Test

For experimental results in Tables 3, 4, 5 and 6, we use “*” to indicate that the best method is significantly different from the runner-up method based on paired t-tests at the significance level of 0.01.

Table 3. Overall performance comparison w.r.t. interaction probability prediction (bold: best; underline: runner-up).
Model Rec Ad
Loss AUC GAUC Loss AUC GAUC
DNN 0.158 0.589 0.931 0.187 0.587 0.848
DeepFM 0.152 0.599 0.933 0.186 0.588 0.848
MIDNN 0.143 0.610 0.936 0.185 0.600 0.848
DLCM 0.138 0.616 0.938 0.185 0.602 0.849
PRM 0.121 0.630 0.942 0.184 0.605 0.850
Seq2Slate - - - - - -
GRNE 0.095 0.693 0.960 0.182 0.610 0.856
  • {\dagger}

    Note that Seq2slate is designed to generate the final ranking list directly and can not be used to give predictions on the recorded final ranking list.

Table 4. Overall performance comparison w.r.t. the reranking strategy (bold: best; underline: runner-up).
Model Rec Ad
NDCG@5 LR@5 NDCG@5 LR@5
DNN 0.042 0.164 0.109 0.199
DeepFM 0.043 0.169 0.111 0.203
MIDNN 0.050 0.175 0.117 0.212
DLCM 0.052 0.182 0.119 0.228
PRM 0.054 0.186 0.120 0.232
Seq2Slate 0.050 0.192 0.116 0.235
GRNG 0.062 0.203 0.123 0.240

5.2. Performance Comparison (RQ1)

We report the comparison results of the generator and evaluator module in GRN (i.e.,GRNG and GRNE) and other baselines on two datasets in Table 3 and Table 4. The major findings from the experimental results are summarized as follows:

  • On both datasets, point-wise methods (i.e., DNN and DeepFM) achieve relatively pool performance for the task of interaction probability prediction and reranking. It indicates that feature engineering of user and item profile is incapable of covering the list-wise information contained by the input ranking list. Generally, list-wise methods achieves remarkable improvements in most cases. By considering the mutual influence among the input ranking list, these methods learn a refined scoring function aware of the feature distributions of the input ranking list. Among these methods, DLCM and PRM consistently outperform MIDNN in all cases, proving that deep structures (i.e., RNN and self-attention) has superior abilities in extracting the feature distribution compared with handmade feature engineering. Moreover, compared with DLCM, adequate improvements are observed by PRM, which demonstrates the effectiveness of self-attention and personalization in the reranking task.

  • GRNE consistently yields the best performance on the Loss, AUC and GAUC metrics of both datasets. In particular, GRNE improves over the best baseline PRM by 0.073, 0.022 in AUC, and 0.005, 0.006 in GAUC on Rec and Ad dataset, respectively. By taking full advantage of the contextual information in the final ranking list, GRNE shows the excellent ability to provide more precise contextual interaction probabilities. Different from extracting information from the disordered input ranking list in DLCM and PRM, with the help of well-designed Bi-LSTM and self-attention mechanism, GRNE capture the sequential dependency and mutual influence in the final ranking list, which is another essential and effective factor to affect the contextual predictions.

  • GRNG significantly and consistently outperforms state-of-the-arts methods by a relatively large margin on both datasets across all metrics. Overall, on Rec and Ad dataset, GRNG achieves performance gains over the best baseline (i.e., Seq2Slate) by 0.008 and 0.003 for NDCG , 0.011 and 0.005 for LR, respectively, which evidences that the technically designed model architecture and training procedure of GRNG have successfully transformed the context-wise ability of GRNE into production. The evolving and activating layers model the information of the selected list, assisting the selector layer to make the most appropriate choice for each step by comparing in the input ranking list. Besides, under the guidance of GRNE, policy gradient with the proposed advantage reward helps distill the reranking knowledge into GRNG comprehensively.

  • Compared with the experimental results on the Rec dataset, the performance lift on the Ad dataset is relatively slight. One possible reason is that Ad dataset is published with random sampling, resulting in the inconsistent and incomplete list records as well as weaker intra-list relevance of user feedback.

5.3. Study of GRN (RQ2)

Table 5. Ablation study of the evaluator (BL: Bi-LSTM; SA: self-attention).
Model Loss AUC GAUC
GRNE( - BL ) 0.120 0.632 0.942
GRNE( - SA ) 0.102 0.684 0.955
GRNE 0.095 0.693 0.960
Table 6. Ablation study of the generator (EL: evolving layer; AL: Activating layer; DR: different reward; SR: self reward).
Model NDCG@5 LR@5
Greedy 0.054 0.184
GRNG( - EL) 0.060 0.192
GRNG( - AL) 0.055 0.185
GRNG( - DR) 0.059 0.195
GRNG( - SR) 0.033 0.099
GRNG 0.062 0.203

In this section, we perform a series of experiments on the Rec dataset to better understand the traits of GRN, including well-designed components of evaluator, generator and training procedure. It is noteworthy that similar trends can also be observed on the Ad dataset, which are omitted due to the page limitation.

5.3.1. Ablation Study of GRNE

By leveraging the contextual information in the final ranking list, GRNE is designed to predict the context-wise interaction probability more precisely. To examine the effectiveness of in the interacting layer, we prepare two variants of GRNE:

  • GRNE( - BL): The variant of evaluator, which removes the Bi-LSTM.

  • GRNE( - SA): The variant of evaluator, which removes the self-attention mechanism.

The comparison results of GRNE with its variants are show in Table 5. The declines of AUC and GAUC metrics are observed after removing each component, which demonstrate their effectiveness for capturing the context-wise information. Besides, GRNE( - BL) performs worse than GRNE( - SA), which means modeling the two-way sequential information of the final ranking list is more important.

5.3.2. Ablation Study of Generator and Training Procedure

Both the model design and training procedure help achieve the best reranking results. In this section, we carefully review the contributions of each component. The variants of generator and training procedure are listed as follows:

  • Greedy: Directly reranking in descending order according to the score of evaluator.

  • GRNG( - EL): The variant of generator without evolving layer, which ignores the evolution of user intent (i.e., Eq. 5).

  • GRNG( - AL): The variant of generator without activating layer, which ignores the influence between items in the input ranking list (i.e., Eq. 6).

  • GRNG( - DR): The variant of training procedure, which removes the differential reward (i.e., Eq. 11) in the advantage reward.

  • GRNG( - SR): The variant of training procedure, which removes the self reward (i.e., Eq. 10) in the advantage reward.

We summarize the results in Table 6 and have the following observations:

  • Reranking based the greedy strategy achieves a relatively poor performance on the on both metrics. It indicates that the GRNE is capable of capturing context-wise information, while it may not be a good way to ranking directly by scores since the greedy strategy changes each item’s context in the final ranking list.

  • The performance of GRNG( - EL) and GRNG( - AL) proves the importance of modeling the selected item list for a better choice at each step.

  • Intuitively, GRNG( - SR) perform worst due to the lack of item’s own reward in the list. GRNG performs better than GRNG( - DR) by considering the context-wise reward of each item in the final ranking list more comprehensively.

Refer to caption
Figure 5. Illustrative example from Rec dataset, which shows the difference between greedy strategy and our learned reranking strategy.

5.4. Case Study (RQ3)

In order to clearly demonstrate how GRN addresses the limitations of the greedy reranking strategy existed in previous methods, we conduct one case study in the large-scale industrial Rec dataset. As shown in Fig. 5, the main findings are summarized as follows:

  • The rating scores below the item are estimated by a well-trained point-wise model. Intuitively, according to the score distribution, we can find that the target user is most interested in clothing category recently, followed by jewelry and pants.

  • The greedy reranking strategy choose to place four clothes in its top-5 results. After this user thoroughly browsing and engaging with the first clothe, he/she may be tired of the clothing category and eager for other categories he/she is also interested in. In this case, the greedy reranking strategy can not reach the best recommendation results and further decrease the user experience.

  • Different form the greedy reranking strategy, the generator selects the item by considering the contextual items. The recommendation results are more pluralistic, taking more chance to capture and activate the user’s latent interests. Hence, the generator shows its ability to generate a more contextually effective and attractive recommendation list, making users engage with the RS more.

Refer to caption
Figure 6. Online performance on the homepage for the Guess You Like scenario in Taobao app. yy-xis denotes the improvement ratio over the existed deployed baseline.

5.5. Online A/B Testing (RQ4)

To verify the effectiveness of our proposed framework GRN in the real-world settings, GRN has been fully deployed in homepage for the Guess You Like scenario, the most popular recommendation scenario of Taobao application 666In practice, we only deploy the GRNG into production for serving reranking task.. In this waterfall scenario, users can browse and interact with items sequentially and endlessly. The fixed-length recommended results are displayed to the user when he/she reaches the bottom of the current page. Considering the potential issue of high latency to the user experience, it brings great challenge to deploy GRN into production.

To this end, we make the following improvements for efficient deployment:

  • Instead of the traditional cloud-to-edge framework, we directly deploy GRN on edge as introduced in EdgeRec (Gong et al., 2020). In this way, the network latency between the cloud server and user edge is saved. Nearly 30 milliseconds are saved, consisting of 10 requests and 3 milliseconds per request, which could be a bottleneck of deploying a context-wise algorithm.

  • Considering the long total time cost, we developed the advance trigger and delay completion mechanism. Specifically, GRN is requested a few places in advance of the last position of current page. Besides, GRN will return more results than expected at each request, to make up for the vacancy of the next request.

After successful deployment of the proposed GRN, we evaluate the performance from “2020-09-01” to “2020-09-07” and report the performance comparison with deployed baseline model PRM in the Fig. 6. Not surprisingly, we observe that GRN consistently and significantly outperforms PRM model across both PV and IPV metrics, which further verifies the effectiveness of our proposed framework GRN. Overall, GRN achieves performance improvement over the best baseline PRM of 5.2% for PV and 6.1% for IPV, with the average cost of 32 milliseconds for online inference.

6. Conclusion

In this paper, we highlight the importance of modeling the contextual information in the final ranking list and address how to leverage and transform such information for better recommendation. We propose a novel two-stage context-wise framework GRN to learn to rank with contexts. Furthermore, we elaborate on the model design of evaluator and generator in GRN, which aim to predict the contextual interaction probability more precisely and learn a better context-wise reranking strategy, respectively. Extensive experiments on both industrial and benchmark datasets demonstrate the effectiveness of our framework compared to state-of-the-art point-wise and list-wise methods. Moreover, GRN has also achieved impressive improvements on the PV and IPV metrics in online experiments after successful deployment in one popular recommendation scenario of Taobao application. In the future, we will investigate into how to incorporate the long-term reward for learning a better reranking strategy.

References

  • (1)
  • Ai et al. (2018) Qingyao Ai, Keping Bi, Jiafeng Guo, and W. Bruce Croft. 2018. Learning a Deep Listwise Context Model for Ranking Refinement. In ACM Special Interest Group on Information Retrieval. 135–144.
  • Ai et al. (2019) Qingyao Ai, Xuanhui Wang, Sebastian Bruch, Nadav Golbandi, Michael Bendersky, and Marc Najork. 2019. Learning groupwise multivariate scoring functions using deep neural networks. In ACM Special Interest Group on Information Retrieval. 85–92.
  • Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations.
  • Bello et al. (2018) Irwan Bello, Sayali Kulkarni, Sagar Jain, Craig Boutilier, Ed Chi, Elad Eban, Xiyang Luo, Alan Mackey, and Ofer Meshi. 2018. Seq2slate: Re-ranking and slate optimization with rnns. arXiv preprint arXiv:1810.02019 (2018).
  • Burges et al. (2005) Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. 2005. Learning to Rank Using Gradient Descent. In International Conference on Machine Learning. 89–96.
  • Burges ([n. d.]) Christopher JC Burges. [n. d.]. From ranknet to lambdarank to lambdamart: An overview. Learning ([n. d.]).
  • Chen et al. (2018) Laming Chen, Guoxin Zhang, and Eric Zhou. 2018. Fast greedy map inference for determinantal point process to improve recommendation diversity. In Conference and Workshop on Neural Information Processing Systems. 5622–5633.
  • Cheng et al. (2016) Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. 2016. Wide & deep learning for recommender systems. In ACM Conference on Recommender Systems Workshop. 7–10.
  • Covington et al. (2016) Covington, Paul, Adams, Jay, Sargin, and Emre. 2016. Deep neural networks for youtube recommendations. In ACM Conference on Recommender Systems. 191–198.
  • Das et al. (2007) Abhinandan S Das, Mayur Datar, Ashutosh Garg, and Shyam Rajaram. 2007. Google news personalization: scalable online collaborative filtering. In The International World Wide Web Conference. 271–280.
  • Feng et al. (2019) Yufei Feng, Fuyu Lv, Weichen Shen, Menghan Wang, Fei Sun, Yu Zhu, and Keping Yang. 2019. Deep Session Interest Network for Click-Through Rate Prediction. In International Joint Conference on Artificial Intelligence. 2301–2307.
  • Gelada et al. (2019) Carles Gelada, Saurabh Kumar, Jacob Buckman, Ofir Nachum, and Marc G Bellemare. 2019. Deepmdp: Learning continuous latent space models for representation learning. arXiv preprint arXiv:1906.02736 (2019).
  • Gogna and Majumdar (2017) Anupriya Gogna and Angshul Majumdar. 2017. Balancing accuracy and diversity in recommendations using matrix completion framework. Knowledge-Based Systems 125 (2017), 83–95.
  • Gomez-Uribe and Hunt (2015) Carlos A Gomez-Uribe and Neil Hunt. 2015. The netflix recommender system: Algorithms, business value, and innovation. ACM Transactions on Management Information Systems 6, 4 (2015), 1–19.
  • Gong et al. (2020) Yu Gong, Z. Jiang, Kaiqi Zhao, Q. Liu, and Wenwu Ou. 2020. EdgeRec: Recommender System on Edge in Mobile Taobao. In ACM International Conference on Information and Knowledge Management.
  • Guo et al. (2017) Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: a factorization-machine based neural network for CTR prediction. In International Joint Conference on Artificial Intelligence. 1725–1731.
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jrgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.
  • Joachims (2006) Thorsten Joachims. 2006. Training linear SVMs in linear time. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 217–226.
  • Li et al. (2011) Lei Li, Dingding Wang, Tao Li, Daniel Knox, and Balaji Padmanabhan. 2011. SCENE: A Scalable Two-Stage Personalized News Recommendation System (ACM Special Interest Group on Information Retrieval). Association for Computing Machinery, New York, NY, USA, 125–134. https://doi.org/10.1145/2009916.2009937
  • Liu et al. (2010) Jiahui Liu, Peter Dolan, and Elin Rønby Pedersen. 2010. Personalized news recommendation based on click behavior. In International Conference on Intelligent User Interfaces. 31–40.
  • Mnih et al. (2016) Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning. 1928–1937.
  • Pei et al. (2019) Changhua Pei, Yi Zhang, Yongfeng Zhang, Fei Sun, Xiao Lin, Hanxiao Sun, Jian Wu, Peng Jiang, Junfeng Ge, Wenwu Ou, and Dan Pei. 2019. Personalized Re-Ranking for Recommendation. In ACM Conference on Recommender Systems. 3–11.
  • Pi et al. (2019) Qi Pi, Weijie Bian, Guorui Zhou, Xiaoqiang Zhu, and Kun Gai. 2019. Practice on Long Sequential User Behavior Modeling for Click-Through Rate Prediction. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2671–2679.
  • Sarwar et al. (2001) Badrul Munir Sarwar, George Karypis, Joseph A Konstan, John Riedl, et al. 2001. Item-based collaborative filtering recommendation algorithms. In The International World Wide Web Conference. 285–295.
  • Schulman et al. (2015) John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. 2015. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438 (2015).
  • Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017).
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Conference and Workshop on Neural Information Processing Systems. 5998–6008.
  • Vinyals et al. (2015) Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer Networks. In Conference and Workshop on Neural Information Processing Systems, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.). 2692–2700.
  • Wilhelm et al. (2018) Mark Wilhelm, Ajith Ramanathan, Alexander Bonomo, Sagar Jain, Ed H Chi, and Jennifer Gillenwater. 2018. Practical diversified recommendations on youtube with determinantal point processes. In ACM International Conference on Information and Knowledge Management. 2165–2173.
  • Zheng et al. (2007) Zhaohui Zheng, Keke Chen, Gordon Sun, and Hongyuan Zha. 2007. A regression framework for learning ranking functions using relative relevance judgments. In ACM Special Interest Group on Information Retrieval. 287–294.
  • Zhou et al. (2019) Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang Zhu, and Kun Gai. 2019. Deep interest evolution network for click-through rate prediction. In the Association for the Advance of Artificial Intelligence. 5941–5948.
  • Zhou et al. (2018) Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. 2018. Deep interest network for click-through rate prediction. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 1059–1068.
  • Zhuang et al. (2018) Tao Zhuang, Wenwu Ou, and Zhirong Wang. 2018. Globally Optimized Mutual Influence Aware Ranking in E-Commerce Search. In International Joint Conference on Artificial Intelligence. 3725–3731.