This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Contrastive Information Transfer for Pre-Ranking Systems

Yue Cao111The first two authors contributed equally to this paper., XiaoJiang Zhou111The first two authors contributed equally to this paper., Peihao Huang, Yao Xiao, Dayao Chen, Sheng Chen Meituan Inc.BeijingP.R. China [email protected], zhouxiaojiang, huangpeihao, xiaoyao06, chendayao, [email protected]
(2022)
Abstract.

Real-word search and recommender systems usually adopt a multi-stage ranking architecture, including matching, pre-ranking, ranking, and re-ranking. Previous works mainly focus on the ranking stage while very few focus on the pre-ranking stage. In this paper, we focus on the information transfer from ranking to pre-ranking stage. We propose a new Contrastive Information Transfer (CIT) framework to transfer useful information from ranking model to pre-ranking model. We train the pre-ranking model to distinguish the positive pair of representation from a set of positive and negative pairs with a contrastive objective. As a consequence, the pre-ranking model can make full use of rich information in ranking model’s representations. The CIT framework also has the advantage of alleviating selection bias and improving the performance of recall metrics, which is crucial for pre-ranking models. We conduct extensive experiments including offline datasets and online A/B testing. Experimental results show that CIT achieves superior results than competitive models. In addition, a strict online A/B testing at one of the world’s largest E-commercial platforms shows that the proposed model achieves 0.63% improvements on CTR and 1.64% improvements on VBR. The proposed model now has been deployed online and serves the main traffic of this system, contributing a remarkable business growth.

Learning to Rank, Pre-Ranking, Contrastive Learning, Search Systems
copyright: acmcopyrightjournalyear: 2022doi: 10.1145/1122445.1122456conference: Unpublished work; June, 2022; Chinabooktitle: Unpublished work, 2022price: 15.00isbn: 978-1-4503-XXXX-X/18/06ccs: Information systems Learning to rankccs: Information systems Recommender systems

1. Introduction

With the rapid growth of internet services, search systems are becoming increasingly important in assisting users to find what they want. For the balance of performance and efficiency, industrial search systems usually consist of multiple cascade stages (Wang et al., 2020): matching, pre-ranking, ranking, and re-ranking, as shown in Fig 1. Previous work mainly pay attention to the ranking stage (Covington et al., 2016; Gai et al., 2017; Grbovic and Cheng, 2018; Lyu et al., 2020; Zhou et al., 2017, 2019), while very few focus on the pre-ranking stage. This paper focuses on the pre-ranking stage, where the model receives thousands of candidate items retrieved from the matching stage, and applies a simple model to rank and select hundreds of candidates for the subsequent ranking model.

To meet the efficiency constraint, the pre-ranking model is usually a simpler model compared with the ranking model. For a long time, the pre-ranking model is usually designed like the tree model or shallower linear models (McMahan et al., 2013; Wang et al., 2020). Recently, some works focus on leveraging knowledge from the ranking model to assist the pre-ranking model (Xu et al., 2020).

Refer to caption
Figure 1. Illustration of a real-world multi-stage search system.

In this paper, we reveal that the ranking consistency between pre-ranking and ranking stages is a key factor to the whole system’s performance. From the perspective of the whole system, the top-uu items retrieved by the pre-ranking model will be re-ranked by the subsequent ranking model and therefore, the relative orders within top-uu items predicted by the pre-ranking model will not affect the final result.

To this end, we propose a novel Contrastive Information Transfer (CIT) framework to improve the ranking consistency between pre-ranking and ranking models. To optimize the CIT objective, we design labels that takes the orders of results of the ranking model into account, which enables the pre-ranking model to focus more on top items, thereby improving the performance on recall metric. Then, we train the pre-ranking model to distinguish the positive pair from a set of positive and negative pairs with a contrastive objective. Theoretically, CIT maximizes a lower bound of mutual information between pre-ranking model’s representations and ranking model’s representation, thus facilitating the pre-ranking to make full use of rich information in representations and orders of the ranking model.

We conduct extensive experiments in one of the world’s largest E-commercial platforms. CIT achieves 0.63% improvement in Click-Through Rate (CTR) and 1.64% improvement in Visit-Buy Rate (VBR), which is very significant considering the huge turnover of this platforms. Since 2022, CIT has been deployed online and served the main search traffic of this system, and bringing significant online profit improvement.

In sum, the prime contributions of this paper can be summarized as follows:

  • We propose a novel contrastive information transfer framework that transfers rich information of representations from ranking model to pre-ranking model.

  • Extensive experiments including offline dataset and online A/B test show the superior of CIT compared with competitive methods. Moreover, CIT has been successfully deployed online and serves the main search traffic.

Refer to caption
Figure 2. Overall architecture of the proposed CIT framework. The left part is the pre-ranking net and the right part is the ranking net. During training, the gradients will be not passed to the ranking model and the parameters of the ranking model are fixed.

2. Methodology

2.1. Overview

A high-level overview of the pre-ranking and ranking models in our system are shown in Figure. 2 The pre-ranking model in our system consists of an embedding and several feed-forward layers. Our ranking model is also built upon deep neural networks, while the architecture of the ranking model is more complicated than that of the pre-ranking model for higher accuracy.

Our ranking model is also built upon deep neural networks, while the architecture of the ranking model is more complicated than that of the pre-ranking model for higher accuracy. In our system, the pre-ranking model ranks 2,000 candidates retrieved from the matching model, and selects top-150 candidates for the ranking model.

2.2. Learning from Ranking Model via Contrastive Information Transfer

Compared with the ranking model, the pre-ranking model has similar training data, training objectives, and architectures. The difference is that the ranking model is more sophisticated in architecture and contains more parameters. Therefore, it is natural to transfer the information from the ranking model to the pre-ranking model. For this purpose, we propose a novel Contrastive Information Transfer (CIT) framework in this paper.

As shown in Fig 2, we introduce the ranking model into the computation graph of the pre-ranking model. We regard representations of the same positive example encoded by pre-ranking and ranking models as positive pairs, and representations of a positive and negative example encoded by pre-ranking model respectively as negative pairs. Notice that the ranking model is only used to provide representations, and its parameters will not be updated during training. At the inference phase, we only export the pre-ranking part and deploy it online.

2.2.1. Formulation

Denote ϕT\phi^{T} as the ranking model and ϕ\phi as the pre-ranking model. Given a positive instance {(x0,y0)|y0=1}\{(x_{0},y_{0})|y_{0}=1\}, we feed x0x_{0} into the pre-ranking and ranking model respectively, and obtain the corresponding representations ϕ(q,x0)\phi(q,x_{0}) and ϕT(q,x0)\phi^{T}(q,x_{0}). At the same time, we sample KK negative instances from the same request, and feed these instances into the pre-ranking model to obtain the corresponding representations {ϕ(q,x1),ϕ(q,x2),,ϕ(q,xK)\{\phi(q,x_{1}),\phi(q,x_{2}),\cdots,\phi(q,x_{K}) |y1,2,,K=0}|y_{1,2,\cdots,K}=0\}.

We treat (ϕ(q,x0),ϕT(q,x0))(\phi(q,x_{0}),\phi^{T}(q,x_{0})) as the positive pair, and {(ϕ(q,x0),\{(\phi(q,x_{0}),
ϕ(q,xj))j=1K}\phi(q,x_{j}))_{j=1}^{K}\} as negative pairs, and minimize the following contrastive loss:

(1) CKT=qQ[logexp(ϕ(q,x0),ϕT(q,x0)/τ)j=0Kexp(ϕ(q,x0),ϕ(q,xj)/τ)]\mathcal{L}_{CKT}=-\sum_{q\in Q}\left[\log\frac{\exp{\left(\langle\phi(q,x_{0}),\phi^{T}(q,x_{0})\rangle/\tau^{\prime}\right)}}{\sum_{j=0}^{K}\exp{\left(\langle\phi(q,x_{0}),\phi(q,x_{j})\rangle/\tau^{\prime}\right)}}\right]

where ,\langle\cdot,\cdot\rangle refers to the inner product, and τ\tau^{\prime} is the temperature. After incorporating the CIT loss, the total training objective of our pre-ranking model can be formulated as:

(2) =λCE+(1λ)CKT\mathcal{L}=\lambda\mathcal{L}_{CE}+(1-\lambda)\mathcal{L}_{CKT}

where CE\mathcal{L}_{CE} is the cross-entropy loss, and λ\lambda is the hyper-parameter to balance two loss terms.

Intuitively, Eq. 1 encourages the pre-ranking model to find the most similar one from a set of representations, which is defined as the corresponding ranking model’s representation. Therefore the pre-ranking model is facilitated to exploit the information from ranking model. Theoretically, minimizing Obj. 1 can also be interpreted as maximizing the lower bound of mutual information between pre-ranking model’s representation ϕ(x0)\phi(x_{0}) and ranking model’s representation ϕT(x0)\phi^{T}(x_{0}) (van den Oord et al., 2018):

(3) CKT=𝔼Q[logexp(ϕ(q,x0),ϕT(q,x0)/τ)j=0Kexp(ϕ(q,x0),ϕ(q,xj)/τ)]I(ϕ(q,x0),ϕT(q,x0))+log(N)\begin{split}\mathcal{L}_{CKT}&=-\mathbb{E}_{Q}\left[\log\frac{\exp{\left(\langle\phi(q,x_{0}),\phi^{T}(q,x_{0})\rangle/\tau^{\prime}\right)}}{\sum_{j=0}^{K}\exp{\left(\langle\phi(q,x_{0}),\phi(q,x_{j})\rangle/\tau^{\prime}\right)}}\right]\\ &\geq-I(\phi(q,x_{0}),\phi^{T}(q,x_{0}))+\log(N)\end{split}

where I(,)I(\cdot,\cdot) refers to the mutual information. Therefore, the proposed method maximizes the agreement of representations of pre-ranking and ranking at the mutual information level. We named the proposed method as Contrastive Information Transfer (CIT).

2.2.2. Design labels yy

The choices of label yy are crucial for CIT. A straightforward approach is to treat the clicked items as positive instances (y=1y=1), while the displayed items but not clicked as negative instances (y=0y=0). However, as the ranking model is available, this approach does not use any information of orders of results of the ranking model.

Let’s look back at the goal and nature of the pre-ranking system. In a real-world search system, the pre-ranking system is used to retrieve top-uu items from v(vu)v(v\gg u) candidates for the subsequent ranking model. From the perspective of the whole system, the relative orders of top-uu items of pre-ranking do not affect the final result, because they will all be re-ranked by the subsequent ranking model. Therefore, we should care more about the recall rate of top-uu items.

In this paper, we propose an approach which is more in line with the nature of pre-ranking: we treat top-uu items ranked by the ranking model as positive examples (y=1y=1), and the rest vuv-u items as negative examples (y=0y=0).

2.2.3. CIT can alleviate the Selection Bias

We argue that the proposed contrastive information transfer approach can alleviate the selection bias issue (Yuan et al., 2019) of the pre-ranking model to some extent.

The pre-ranking model suffers from two selection biases: (1) without considering non-displayed items, and (2) without considering the request that has no positive feedback. For pairwise objectives, requests without positive feedback can not be used for training and have to be discarded.

The presence of selection bias makes the distribution of training data and real data inconsistent. Previous work (Yuan et al., 2019) has confirmed that ignoring these selection biases may hurt the training and reduce the performance. CIT considered the entire 2,000 items, which may contain non-displayed items for some request, so it can alleviate the first type of selection bias. Besides, the introduction of CKT\mathcal{L}_{CKT} enables us to utilize requests without positive feedback111For requests without positive feedback, we only minimize the second term in Eq. 2.. Therefore, CIT can alleviate the second type of selection bias, either.

2.2.4. Advantages over Knowledge Distillation

Knowledge distillation (Hinton et al., 2015) refers to transfer the dark knowledge from one network to another network.

The idea behind CKT is similar to knowledge distillation, but it has the following advantages in our scenario: (1) CIT pays more attention to top results (positive instances), which is more in line with the nature of the pre-ranking model, i.e., select top candidates for the ranking model. (2) The data of a real-world search system is highly structured, i.e., dependencies exist in different dimensions of features. While the commonly used KL divergence in knowledge distillation treats all dimensions as independent, thus is insufficient for capturing structural knowledge (Chen et al., 2021; Tian et al., 2020). In contrast, CIT maximizes the lower-bound of the mutual information between two representations, which is more advantageous in dealing with higher-order correlations.

3. Experiments

3.1. Experiment Setups

3.1.1. Dataset

To the best of our knowledge, there is no available benchmark for the pre-ranking task, and previous works (Wang et al., 2020; Ma et al., 2021) conduct experiments on their own datasets222Their datasets are commercial and not publicly available.. To verify the effectiveness of the proposed method, we follow previous work to conduct experiments using real-world data collected from our own system.

The dataset is collected from the searching system in one of the world’s largest E-commercial platforms. This dataset contains more than 100 million users and more than 10 billion training examples. Following the classic setting for industrial models, we treat the data of the first 14 days as the training set, the data of the following 2 days as validation and test set respectively.

3.1.2. Evaluation Metrics

Following previous works, we use G-AUC (Group-Area Under Curve) (Zhu et al., 2017), Recall (Wang et al., 2020), and NDCG@k (Normalized Discounted Cumulative Gain) for offline evaluations. We use k=15k=15 because NDCG@15 is the core metric in our business. For online experiments, we use CTR (Click-Through Rate) and VBR (Visited-Buy Rate) as online metrics.

3.2. Competitive Models

Following previous work (Wang et al., 2020; Ma et al., 2021), we compare CIT with following popular pre-ranking models.

  • Classic pre-ranking models, including Vector-Product based Deep Models (VPDM) (Yang et al., 2020; Wang et al., 2020), Deep Neural Networks (DNN), and Deep Neural Networks with Knowledge Distillation (DNN+KD). DNN+KD is the previous pre-ranking model in our system.

  • Famous pre-ranking model proposed by others, including COLD (Wang et al., 2020), and FSCD (Ma et al., 2021)333We implement COLD and FSCD upon the DNN architecture.. Please Note that COLD and FSCD mainly focus on how to select features for pre-ranking models, which is different from us. Therefore, it is less necessary to compare CIT with them. But since they are classic models in the pre-ranking field, we also list their results as a reference.

  • CIT w/o CIT: to demonstrate the superiority of CIT over knowledge distillation, we set a variation that replaces the CIT module with traditional knowledge distillation, where the model optimizes the following KD loss (Hinton et al., 2015):

    (4) KD=𝐊𝐋(ϕT(q,x)||ϕ(q,x))\mathcal{L}_{KD}=\mathbf{KL}(\phi^{T}(q,x)||\phi(q,x))

    For a fair comparison, we distill all top-uu examples, therefore the training data of KD is the same with CIT.

    To illustrate that the improvement is not entirely attributable to the choice of label yy, we also set a variation that replaces the CIT with a pairwise-based loss where the choice of label is the same as in Section 2.2.2, i.e., y=1y=1 for top-uu items and y=0y=0 for the rest.

The above experimental settings, including competitive models and datasets chosen, are completely consistent with previous work (Ma et al., 2021; Wang et al., 2020; Xu et al., 2020).

4. Results and Analysis

4.1. Overall Results

      Method       NDCG       G-AUC       Recall
      VPDM       0.6988*       0.8633*       0.5948*
      DNN       0.7123*       0.8772*       0.6321*
      DNN+KD       0.7134*       0.8846*       0.6359*
      COLD       0.7125*       0.8777*       0.6325*
      FSCD       0.7129*       0.8780*       0.6331*
      CIT       0.7222       0.8907       0.7481
Table 1. Performance comparison of different model on offline dataset. ”*” indicates that the improvement of CIT over this baseline is statistically significant at p-value ¡ 0.05 over paired Wilcoxon-test.

The overall results of different models on offline dataset are shown in Table 1. In sum, CIT outperforms competitive models in NDCG, G-AUC, and Recall metrics concurrently. Particularly, the improvement of CIT on Recall is very significant, with an increase of more than 0.1 absolute scores (17.6%) compared to baseline models.

Compared with the previous online pre-ranking model DNN+KD, CIT achieves 1.23% improvement on NDCG (0.7222 vs. 0.7134), 0.69% improvement on G-AUC (0.8907 vs. 0.8846), and 17.6% on Recall (0.7481 vs. 0.6359).

Compared with DNN, neither COLD nor FSCD achieves remarkable improvement. This is because COLD and FSCD focus on selecting important features for pre-ranking models, while in our scenario, the importance of features can mostly be derived from prior knowledge444The initial feature set of COLD and FSCD is the same with CIT and DNN.. Besides, we found in experiments that COLD and FSCD may miss some intuitively important features. CIT essentially changes the way of modeling and training pre-ranking models, making the optimization of the pre-ranking model more in line with the goal and nature of pre-ranking. Therefore, CIT can achieve significant improvement over all competitive models.

4.2. Ablation Study of CIT

Method NDCG G-AUC Recall
CIT 0.7222 0.8907 0.7481
CIT w/o CIT - 0.7107 0.8704 0.6604
+ KD 0.7185 0.8842 0.6675
+ pairwise 0.7118 0.8687 0.6885
Table 2. Ablation study of CIT.

The results of the ablation study are listed in Table 2.

The contrastive information transfer (CIT) plays a significant role in improving the recall score, as the performance greatly decreases when the CIT module is removed. We believe that this is because CIT\mathcal{L}_{CIT} facilitates the model to learn better representations, and focuses more on top-uu items. The experimental results also show that the performance decrease when replace CIT with pairwise loss, indicating that the improvement is not entirely attributable to the choice of label.

4.3. Online A/B test

A strict online A/B testing is also conducted to verify the effectiveness of CIT. For online A/B testing, the baseline model is the previous online pre-ranking model in our system, which is the DNN+KD model mentioned earlier. The test model is CIT. The testing lasts for 14 days, with 10% of traffic is distributed for each model respectively.

Overall, CIT achieves 0.63% improvement in CTR (pp-value¡0.005, confidence¿0.995), and 1.64% improvement in VBR (pp-value¡0.001, confidence¿0.998), which can greatly increase online profit considering the large traffic of our system. Particularly, CIT greatly improves Recall@150 by an absolute 0.1152 score, from 0.6037 to 0.7189, which again verifies the effectiveness of CIT in improving results of top items. Besides, we find that CIT does not increase the response time (RT) of the system compared with the baseline model. This is because CIT does not introduce any additional modules or new features, but improves the training frameworks.

The results of online A/B testing demonstrate that CIT is superior to the last online pre-ranking model (DNN+KD). Since 2022, CIT has been deployed online and served the main traffic of our system.

5. Conclusions

In this paper, we propose a new contrastive information transfer framework for pre-ranking systems. The system seeks for a high recall score and is more in-line with the position of pre-ranking models. We also show that the proposed CIT framework is better for transferring structural knowledge, alleviating the selection bias, and improving the recall rate of the model.

Future work includes employing the contrastive learning framework in other stages of the search system, such as matching, ranking, re-ranking, and so on.

References

  • (1)
  • Chen et al. (2021) Liqun Chen, Dong Wang, Zhe Gan, Jingjing Liu, Ricardo Henao, and Lawrence Carin. 2021. Wasserstein Contrastive Representation Distillation. In CVPR 2021, virtual, June 19-25, 2021. Computer Vision Foundation / IEEE, 16296–16305. https://openaccess.thecvf.com/content/CVPR2021/html/Chen_Wasserstein_Contrastive_Representation_Distillation_CVPR_2021_paper.html
  • Covington et al. (2016) Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep Neural Networks for YouTube Recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems, Boston, MA, USA, September 15-19, 2016. ACM, 191–198. https://doi.org/10.1145/2959100.2959190
  • Gai et al. (2017) Kun Gai, Xiaoqiang Zhu, Han Li, Kai Liu, and Zhe Wang. 2017. Learning Piece-wise Linear Models from Large Scale Data for Ad Click Prediction. CoRR abs/1704.05194 (2017). arXiv:1704.05194 http://arxiv.org/abs/1704.05194
  • Grbovic and Cheng (2018) Mihajlo Grbovic and Haibin Cheng. 2018. Real-time Personalization using Embeddings for Search Ranking at Airbnb. In KDD 2018, London, UK, August 19-23, 2018, Yike Guo and Faisal Farooq (Eds.). ACM, 311–320. https://doi.org/10.1145/3219819.3219885
  • Hinton et al. (2015) Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. Distilling the Knowledge in a Neural Network. CoRR abs/1503.02531 (2015). arXiv:1503.02531 http://arxiv.org/abs/1503.02531
  • Lyu et al. (2020) Zequn Lyu, Yu Dong, Chengfu Huo, and Weijun Ren. 2020. Deep Match to Rank Model for Personalized Click-Through Rate Prediction. In AAAI 2020, New York, NY, USA, February 7-12, 2020. AAAI Press, 156–163. https://aaai.org/ojs/index.php/AAAI/article/view/5346
  • Ma et al. (2021) Xu Ma, Pengjie Wang, Hui Zhao, Shaoguo Liu, Chuhan Zhao, Wei Lin, Kuang-Chih Lee, Jian Xu, and Bo Zheng. 2021. Towards a Better Tradeoff between Effectiveness and Efficiency in Pre-Ranking: A Learnable Feature Selection based Approach. In SIGIR ’21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, July 11-15, 2021. ACM, 2036–2040. https://doi.org/10.1145/3404835.3462979
  • McMahan et al. (2013) H. Brendan McMahan, Gary Holt, David Sculley, Michael Young, Dietmar Ebner, Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, Sharat Chikkerur, Dan Liu, Martin Wattenberg, Arnar Mar Hrafnkelsson, Tom Boulos, and Jeremy Kubica. 2013. Ad click prediction: a view from the trenches. In KDD 2013, Chicago, IL, USA, August 11-14, 2013. ACM, 1222–1230. https://doi.org/10.1145/2487575.2488200
  • Tian et al. (2020) Yonglong Tian, Dilip Krishnan, and Phillip Isola. 2020. Contrastive Representation Distillation. In ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net. https://openreview.net/forum?id=SkgpBJrtvS
  • van den Oord et al. (2018) Aäron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation Learning with Contrastive Predictive Coding. CoRR abs/1807.03748 (2018). arXiv:1807.03748 http://arxiv.org/abs/1807.03748
  • Wang et al. (2020) Zhe Wang, Liqin Zhao, Biye Jiang, Guorui Zhou, Xiaoqiang Zhu, and Kun Gai. 2020. COLD: Towards the Next Generation of Pre-Ranking System. CoRR abs/2007.16122 (2020). arXiv:2007.16122 https://arxiv.org/abs/2007.16122
  • Xu et al. (2020) Chen Xu, Quan Li, Junfeng Ge, Jinyang Gao, Xiaoyong Yang, Changhua Pei, Fei Sun, Jian Wu, Hanxiao Sun, and Wenwu Ou. 2020. Privileged Features Distillation at Taobao Recommendations. In KDD ’20, Virtual Event, CA, USA, August 23-27, 2020. ACM, 2590–2598. https://doi.org/10.1145/3394486.3403309
  • Yang et al. (2020) Ji Yang, Xinyang Yi, Derek Zhiyuan Cheng, Lichan Hong, Yang Li, Simon Xiaoming Wang, Taibai Xu, and Ed H. Chi. 2020. Mixed Negative Sampling for Learning Two-tower Neural Networks in Recommendations. In Companion of The 2020 Web Conference 2020, Taipei, Taiwan, April 20-24, 2020. ACM / IW3C2, 441–447. https://doi.org/10.1145/3366424.3386195
  • Yuan et al. (2019) Bo-Wen Yuan, Jui-Yang Hsia, Meng-Yuan Yang, Hong Zhu, Chih-Yao Chang, Zhenhua Dong, and Chih-Jen Lin. 2019. Improving Ad Click Prediction by Considering Non-displayed Events. In CIKM 2019, Beijing, China, November 3-7, 2019. ACM, 329–338. https://doi.org/10.1145/3357384.3358058
  • Zhou et al. (2019) Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang Zhu, and Kun Gai. 2019. Deep Interest Evolution Network for Click-Through Rate Prediction. In AAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019. AAAI Press, 5941–5948. https://doi.org/10.1609/aaai.v33i01.33015941
  • Zhou et al. (2017) Guorui Zhou, Chengru Song, Xiaoqiang Zhu, Xiao Ma, Yanghui Yan, Xingya Dai, Han Zhu, Junqi Jin, Han Li, and Kun Gai. 2017. Deep Interest Network for Click-Through Rate Prediction. CoRR abs/1706.06978 (2017). arXiv:1706.06978 http://arxiv.org/abs/1706.06978
  • Zhu et al. (2017) Han Zhu, Junqi Jin, Chang Tan, Fei Pan, Yifan Zeng, Han Li, and Kun Gai. 2017. Optimized Cost per Click in Taobao Display Advertising. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, August 13 - 17, 2017. ACM, 2191–2200. https://doi.org/10.1145/3097983.3098134