Enhancing CTR Prediction through Sequential Recommendation Pre-training: Introducing the SRP4CTR Framework
Abstract.
Understanding user interests is crucial for Click-Through Rate (CTR) prediction tasks. In sequential recommendation, pre-training from user historical behaviors through self-supervised learning can better comprehend user dynamic preferences, presenting the potential for direct integration with CTR tasks. Previous methods have integrated pre-trained models into downstream tasks with the sole purpose of extracting semantic information or well-represented user features, which are then incorporated as new features. However, these approaches tend to ignore the additional inference costs to the downstream tasks, and they do not consider how to transfer the effective information from the pre-trained models for specific estimated items in CTR prediction. In this paper, we propose a Sequential Recommendation Pre-training framework for CTR prediction (SRP4CTR) to tackle the above problems. Initially, we discuss the impact of introducing pre-trained models on inference costs. Subsequently, we introduced a pre-trained method to encode sequence side information concurrently. During the fine-tuning process, we incorporate a cross-attention block to establish a bridge between estimated items and the pre-trained model at a low cost. Moreover, we develop a querying transformer technique to facilitate the knowledge transfer from the pre-trained model to industrial CTR models. Offline and online experiments show that our method outperforms previous baseline models.
1. Introduction
Click-through Rate (CTR) prediction is crucial in recommender systems. Meanwhile, modeling users’ behavior by their sequences is the main approach for CTR prediction, and longer sequences have become a trend in industry recommendation systems(Pi et al., 2020; Chang et al., 2023).
Traditional methods tend to utilize click labels for supervised learning, ignoring the extensive information within the sequence. This information contains not only item IDs but also an abundance of side information, such as price and behavior type. To utilize and model the sequence information, some methods introduce self-supervised pre-training to solve sequential recommendation tasks, whose goals are next-item prediction. They have directly demonstrated the potential of integrating self-supervised pre-training with downstream tasks in the recommender system. With this inspiration, we introduce the self-supervised pre-training method for the CTR prediction task. However, self-supervised pre-training methods (Sun et al., 2019; de Souza Pereira Moreira et al., 2021; Liu et al., 2021) for sequence modeling only take item IDs as input or output, and they are insufficient for encoding side information. In fact, fully exploring information in item IDs and side information contributes to the improvement of CTR prediction.
Another challenge lies in how to integrate pre-training with CTR prediction tasks. Current methods attempt to integrate CTR prediction with the pre-trained model, which can be categorized based on the form of encoded information as follows: 1) user-related only(Liao, 2020; Chitlangia et al., 2023; Liu et al., 2022). 2) user-item-related(Lin et al., 2023; Wang et al., 2023). Existing methods have not systematically compared the extra inference costs among different methods, leading to unfair comparison. In industrial-level CTR prediction models with extremely high Query Per Second (QPS), the online system utilizes a separate ending workflow for user and item. For user-related pre-training, as shown in Figure 1, the user (sequence) information is encoded once, and then this information is tiled across the item, significantly reducing inference costs. In the following paper, we refer to this inference method as folded inference. Some approaches(Zhai et al., 2024) have incorporated this concept into the training process. When there is a relatively large number of samples n available for computation under the same user, the complexity introduced by the transformer can be significantly reduced to a negligible level . However, the user-related pre-trained model can only obtain a single expression of sequence information, which is then concatenated with item information for learning user interest. These methods fail to capture interest information tailored to the predicted item from the pre-trained model. On the contrary, the user-item-related encoding can fully capture the user’s interest in the predicted item. However, it causes a great workload for real-time serving and may lead to inapplicability in real scenarios.
In this paper, we propose a new fine-tuning framework designed to adapt the transformer-based Sequential Recommendation Pre-trained model for the CTR prediction task, which we call SRP4CTR. For encoding item IDs and side information simultaneously, during the pre-training phase, we introduce a new bidirectional transformer model which is named Fine-Grained BERT (FG-BERT). FG-BERT uses all side information in input and output at the same time and performs multi-attribute masking predictions while encoding side information. For integrating the pre-trained model, in the fine-tuning phase, we introduce a uni cross-attention mechanism that establishes a unidirectional attention connection between the predicted item and the pre-trained model. Most of the computation can still be reduced through folded inference, and uni cross-attention only adds a small amount of inference cost to capture interest in the estimated item from the pre-trained model.
Furthermore, unlike the sequential recommendation tasks that rely solely on a single CLS token or a “[mask]” token for prediction, discerning useful information from hundreds of tokens in user behavior for CTR prediction also poses a significant challenge. In this paper, we develop a querying transformer encoder that uses a small number of learnable query tokens to aggregate hundreds of user behavior tokens before passing them to the CTR model. This encoder can further enhance the performance of downstream tasks.
To validate the effectiveness of our proposed SRP4CTR method, we conducted extensive experiments on different sequential recommendation pre-trained models. The results demonstrate that SRP4CTR can enhance the performance of CTR tasks based on various pre-trained models while maintaining low inference costs.
The contributions of this paper are as follows: 1) We analyze the inference costs, demonstrating that integrating CTR prediction with sequential recommendation pre-training is practical for industrial recommendation systems. 2) We propose a new prediction framework, which can transfer information from the self-supervised pre-trained model at a low cost. 3) We conducted complete offline experiments and online validations to prove the effectiveness of our proposed method.

2. METHODOLOGY
SRP4CTR is designed to leverage transformer-based sequential recommendation pre-trained models to enhance the learning effectiveness of downstream CTR tasks. The overview of SRP4CTR is presented in Figure 2. In the following sections, we first introduce the base pre-trained sequential recommendation model with side information. Then, we will describe how we transfer knowledge from the pre-trained model for estimated items in CTR prediction.
2.1. Sequential Recommendation Pre-training
Self-supervised pre-training is used to tackle data sparsity issues by enhancing data representation (Zhou et al., 2020), and improve recommendation performance for long-tail items (Yao et al., 2021). Following the approach of (Sun et al., 2019), we utilize a bidirectional transformer for modeling and a Cloze objective for self-supervised pre-training. However, unlike most sequence recommendation tasks that only require modeling the item ID, real-world recommendation systems involve user click logs that contain not just the item ID but also a variety of side information. For a pre-trained model whose downstream task is CTR prediction, it is essential to encode not only the item ID but also the corresponding side information. To address this, we propose the Fine-Grained BERT (FG-BERT) to accomplish the task.
For generalization, given a sequence of user-item interaction record with length , -th elements is consisted of M corresponding item-related features such as item ID, tags, price, etc., and N behavior actions such as types of interaction, count of click, etc., formally: .
FG-BERT enhances the encoding effectiveness of BERT by further introducing a training objective that predicts behavior-related features. Specifically, since the item ID and its corresponding attribute features are explicitly associated, we directly sum the embeddings of item-related features to obtain an item generalized representation of -th record. Similarly, we sum the embeddings of all behavior-related features to obtain the side info representation . As shown in Figure 3, in the FG-BERT, we introduce two different types of random mask: the item-related mask and the behavior-related mask. For behavior-related masks, we remove all associated behavior actions for one element. The mask is denoted as , with the remaining set denoted as . Correspondingly, the set of item-related masks is denoted as , with the remaining set denoted as . For FG-BERT, the probability of and is given as:
(1) | ||||
where is the learnable parameter of the model. To maximize the probabilities and , FG-BERT employs a multi-task approach, optimizing the aforementioned objectives with cross-entropy (CE) loss.

2.2. Fine-tune for CTR Prediction
2.2.1. uni cross-attention block
For downstream CTR tasks, we introduce a uni cross-attention that enables the predicted item to transfer corresponding information from the pre-trained model. As illustrated in Figure 2, in uni cross-attention, the queries are composed of predicted items, while Key and Value are derived from the user behavior sequence representation at the same layer. Although the predicted item shares the same representational information as the input of the pre-trained model, the attention mechanism learned under self-supervised differs significantly from the attention mechanism required for CTR tasks. Therefore, we untie the parameter sharing between the uni cross-attention block and the pre-trained model, only sharing the input’s embedding parameters and the projection parameters for Key and Value at each layer. After multi-head self-attention, the encoded feature of the predicted item is passed through the feed-forward network and residual networks, as commonly processed in the transformer-based layer. Besides, a learnable variable is set as position embedding about the predicted item to circumvent issues with changes in positional semantics.
During the deployment phase, the uni cross-attention block only adds a minimal amount of extra computation. The primary computational load, carried by the pre-trained model, can still be accelerated using folded inference.

2.2.2. Querying Transformer
Though encoding from the pre-trained model, the features generated may not align perfectly with the information required for CTR tasks, because the goal of pre-training mismatches with CTR training objective. For better alignment, as shown in Figure 2, we introduce a novel Querying Transformer that employs learnable queries as input (where ) and encoded by cross-attention with the output of the pre-trained model as Key and Value. This module is expected to achieve discrimination of user interests. Additionally, in practical scenarios, we also use user or context features, such as gender, age, or hour, as the initialization for queries, which are then mapped into multiple different queries to accommodate the diverse interest distributions of different users. Finally, the Querying Transformer only processes user information, which enables it to be folded inference.
3. EXPERIMENT
3.1. Offline Experiments
3.1.1. Experimental Setups
We conducted experiments using two offline public datasets: MovieLens-20M(Harper and Konstan, 2015) and Taobao(Zhou et al., 2018). In the case of MovieLens, similarly to (Zhou et al., 2018), the task is to predict whether a user is likely to assign a rating exceeding 3(denoted as a positive label). We leverage features derived from users’ historical interactions to evaluate the most recent 10 interactions. We employ movie_id and movie_cate_id as item-related features and the rating score from the log as the behavior-related feature. For the Taobao dataset, we use the item cate_id and brand_id as item-related features, and the behavioral tag serves as the behavior-related feature. Additionally, we have executed a data sampling in the Taobao dataset, selecting the top 50% of users with the most behaviors.
In this work, we utilize the Adam optimizer and apply polynomial decay to the learning rate. During the pre-training phase, we use the same architecture as in (Sun et al., 2019) consisting of a two-layer transformer with the maximum sequence length . For the MovieLens dataset, we train with a batch size of 512 for 100,000 steps. For the Taobao dataset, due to GPU memory constraints, we train with a batch size of 256 for 200,000 steps. During the fine-tuning phase, with Taobao’s token count reaching 500,000 compared to MovieLens’s token count of only 20,000, there is a higher risk of overfitting for MovieLens. Consequently, all parameters in the pre-trained model are learnable in the Taobao dataset but frozen in the MovieLens. We train for 10 epochs and select the most effective checkpoint as the benchmark for performance evaluation. Additionally, we use Area Under the Curve(AUC) as a metric to evaluate the performance of the CTR task.
3.1.2. Overall Performance
The overall performance of SRP4CTR is presented in Table 1. We primarily compared five traditional CTR-based methods: PNN(Qu et al., 2016), BST(Chen et al., 2019),DIN(Zhou et al., 2018), DIEN(Zhou et al., 2019), and CAN(Bian et al., 2022). Our approach demonstrated notable improvements over conventional end-to-end modeling methods among various datasets.
Furthermore, we explored the impact of different pre-training methods on SRP4CTR, including the original BERT(Sun et al., 2019), ELECTRA(Clark et al., 2020), S3(Zhou et al., 2020), and our proposed FG-BERT. Each experimental group included two models. We compared the performance of MP(Yuan et al., 2020) and SRP4CTR on different pre-trained models. MP is a simple method of transferring a pre-trained model to downstream tasks. In the fine-tuning phase, this method inserts a learnable model patch into the pre-trained model while the rest of the parameters are frozen. Subsequently, the representation produced by the pre-trained model is concatenated with the embedding of the estimated item for prediction. MP enables a more direct comparison of the effectiveness of different pre-training methods. From Table 1, our proposed FG-BERT outperforms other pre-training approaches. Furthermore, on the MovieLens dataset, the pre-trained model can surpass CTR-based methods solely through MP. However, for the sparser Taobao dataset, MP underperforms compared to CTR-based models.
3.1.3. Long Tail Performance
In Table 2, we further explored the effectiveness of SRP4CTR for recommending long-tail items. We defined the bottom 20% of items in terms of their occurrence in the dataset as long-tail items and compared the performance of our method and traditional CTR prediction methods on this sub-dataset. Traditional CTR prediction methods like DIN and CAN tend to learn more effectively from hot items, resulting in a significant discrepancy between the AUC for long-tail items and the average AUC across all items. However, through pre-training, both the MP and SRP4CTR can greatly narrow the learning gap between long-tail items and other items.
Methods | MovieLens | Taobao | |
CTR based | PNN(Qu et al., 2016) | 0.7376 | 0.6006 |
BST(Chen et al., 2019) | 0.7359 | 0.6021 | |
DIN(Zhou et al., 2018) | 0.7466 | 0.6066 | |
DIEN(Zhou et al., 2019) | 0.7494 | 0.6178 | |
CAN(Bian et al., 2022) | 0.7507 | 0.6018 | |
BERT(Sun et al., 2019) | BERT-MP | 0.7595 | 0.5633 |
BERT-SRP4CTR | 0.7696 | 0.6208 | |
ELECTRA(Clark et al., 2020) | ELECTRA-MP | 0.7559 | 0.5580 |
ELECTRA-SPR4CTR | 0.7603 | 0.6197 | |
S3(Zhou et al., 2020) | S3-MP | 0.7526 | 0.5576 |
S3-SRP4CTR | 0.7592 | 0.6203 | |
FG-BERT | FG-BERT-MP | 0.7721 | 0.5651 |
SPR4CTR | 0.7817 | 0.6230 |
Methods | MovieLens | Taobao | ||
AUC | AUC diff | AUC | AUC diff | |
DIN | 0.7205 | -0.0261 | 0.6044 | -0.0022 |
CAN | 0.7208 | -0.0299 | 0.5957 | -0.0061 |
FG-BERT-MP | 0.7591 | -0.0130 | 0.5686 | +0.0035 |
SRP4CTR | 0.7679 | -0.0139 | 0.6225 | -0.0005 |
3.1.4. Ablation Studies
We conducted detailed ablation studies on SRP4CTR, with the results presented in Table 3. The comparisons include our proposed querying transformer(qFormer) and uni cross-attention module, as well as the impact of other training paradigms like scratch. Scratch denotes training SRP4CTR directly in an end-to-end manner. Moreover, for the uni cross-attention, we also examined the impact of parameter tying. The results validate the effectiveness of our proposed methods.
Methods | MovieLens | Taobao |
SRP4CTR-Scratch | 0.7740 | 0.6107 |
SRP4CTR w/o uni-att & qFormer | 0.7651 | 0.6099 |
SRP4CTR w/o uni-att | 0.7726 | 0.6116 |
SRP4CTR w/o qFormer | 0.7808 | 0.6220 |
SRP4CTR w tying uni-att | 0.7728 | 0.6153 |
SRP4CTR | 0.7817 | 0.6230 |
3.1.5. Inference Cost Analysis
In Table 4, we have conducted a comparison of the inference costs between SRP4CTR and other methodologies. We propose a new metric efficiency-FLOPs, which denotes the FLOPs when the inference batch size is set to one. This metric is indicative of the computational complexity that truly impacts the precision of the model within a folded inference framework. Furthermore, inference-FLOPs represents the computational complexity for a single batch when inference is carried out using folded inference with a batch size of 100 111Our online deployment performs inference with a batch size of 100., thereby reflecting the cost of the model’s inference.
As the original implementations of DIN and CAN did not take into account inference performance, we have introduced two variants, DIN and CAN. These variants can be accelerated by folded inference, designed to align as closely as possible with the original versions in terms of efficiency-FLOPs. From Table 4, SRP4CTR is capable of delivering an increase in efficiency-FLOPs by more than double in comparison to DIN and CAN, with only a slight rise in inference-FLOPs.
Metrics | DIN | DIN | CAN | CAN | SRP4CTR |
efficiency-FLOPs | 10.01 | 8.96 | 10.12 | 8.99 | 26.88 |
inference-FLOPs | 989.59 | 51.22 | 1002.21 | 59.64 | 64.56 |
inference-FLOPs/ efficiency-FLOPs | 98.86 | 5.72 | 99.03 | 6.63 | 2.40 |
3.2. Online A/B test
We deploy our method to the Meituan Takeaway recommender system, one of the largest takeaway platforms in China. We have conducted a one-month online A/B test since June 2023. The SRP4CTR increased the GMV(Gross Merchandise Volume) by 1.66% and the CTR by 0.70% in our main recommendation scenarios. From the perspective of efficiency, compared with the previous state-of-art baseline(DIN+MMOE(Ma et al., 2018)), our model brings an 182% increase in efficiency-FLOPs. Meanwhile, the inference-FLOPs only increase by 21% with the folded inference framework.
4. Conclusions
In this paper, we propose a new method named SRP4CTR to improve CTR prediction by adapting the sequential recommendation pre-trained model at a low cost. Concretely, we first analyze the inference costs associated with introducing pre-trained models into CTR tasks. Then, we introduce a Fine-Grained BERT and uni cross-attention mechanism to accomplish the knowledge transfer. Our method has achieved significant growth in Gross Merchandise Volume and has been deployed in Meituan Waimai recommendation scenarios since July 2023.
References
- (1)
- Bian et al. (2022) Weijie Bian, Kailun Wu, Lejian Ren, Qi Pi, Yujing Zhang, Can Xiao, Xiang-Rong Sheng, Yong-Nan Zhu, Zhangming Chan, Na Mou, et al. 2022. CAN: feature co-action network for click-through rate prediction. In Proceedings of the fifteenth ACM international conference on web search and data mining. 57–65.
- Chang et al. (2023) Jianxin Chang, Chenbin Zhang, Zhiyi Fu, Xiaoxue Zang, Lin Guan, Jing Lu, Yiqun Hui, Dewei Leng, Yanan Niu, Yang Song, et al. 2023. TWIN: TWo-stage Interest Network for Lifelong User Behavior Modeling in CTR Prediction at Kuaishou. arXiv preprint arXiv:2302.02352 (2023).
- Chen et al. (2019) Qiwei Chen, Huan Zhao, Wei Li, Pipei Huang, and Wenwu Ou. 2019. Behavior sequence transformer for e-commerce recommendation in alibaba. In Proceedings of the 1st international workshop on deep learning practice for high-dimensional sparse data. 1–4.
- Chitlangia et al. (2023) Sharad Chitlangia, Krishna Reddy Kesari, and Rajat Agarwal. 2023. Scaling generative pre-training for user ad activity sequences. (2023).
- Clark et al. (2020) Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. 2020. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555 (2020).
- de Souza Pereira Moreira et al. (2021) Gabriel de Souza Pereira Moreira, Sara Rabhi, Jeong Min Lee, Ronay Ak, and Even Oldridge. 2021. Transformers4rec: Bridging the gap between nlp and sequential/session-based recommendation. In Proceedings of the 15th ACM Conference on Recommender Systems. 143–153.
- Harper and Konstan (2015) F Maxwell Harper and Joseph A Konstan. 2015. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis) 5, 4 (2015), 1–19.
- Liao (2020) Yiping Liao. 2020. On the Effectiveness of Self-supervised Pre-training for Modeling User Behavior Sequences.
- Lin et al. (2023) Jianghao Lin, Yanru Qu, Wei Guo, Xinyi Dai, Ruiming Tang, Yong Yu, and Weinan Zhang. 2023. MAP: A Model-agnostic Pretraining Framework for Click-through Rate Prediction. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 1384–1395.
- Liu et al. (2021) Chang Liu, Xiaoguang Li, Guohao Cai, Zhenhua Dong, Hong Zhu, and Lifeng Shang. 2021. Noninvasive self-attention for side information fusion in sequential recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 4249–4256.
- Liu et al. (2022) Qijiong Liu, Jieming Zhu, Quanyu Dai, and Xiao-Ming Wu. 2022. Boosting deep CTR prediction with a plug-and-play pre-trainer for news recommendation. In Proceedings of the 29th International Conference on Computational Linguistics. 2823–2833.
- Ma et al. (2018) Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong, and Ed H Chi. 2018. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 1930–1939.
- Pi et al. (2020) Qi Pi, Guorui Zhou, Yujing Zhang, Zhe Wang, Lejian Ren, Ying Fan, Xiaoqiang Zhu, and Kun Gai. 2020. Search-based user interest modeling with lifelong sequential behavior data for click-through rate prediction. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 2685–2692.
- Qu et al. (2016) Yanru Qu, Han Cai, Kan Ren, Weinan Zhang, Yong Yu, Ying Wen, and Jun Wang. 2016. Product-based neural networks for user response prediction. In 2016 IEEE 16th international conference on data mining (ICDM). IEEE, 1149–1154.
- Sun et al. (2019) Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. 2019. BERT4Rec: Sequential recommendation with bidirectional encoder representations from transformer. In Proceedings of the 28th ACM international conference on information and knowledge management. 1441–1450.
- Wang et al. (2023) Dong Wang, Kavé Salamatian, Yunqing Xia, Weiwei Deng, and Qi Zhang. 2023. BERT4CTR: An Efficient Framework to Combine Pre-trained Language Model with Non-textual Features for CTR Prediction. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 5039–5050.
- Yao et al. (2021) Tiansheng Yao, Xinyang Yi, Derek Zhiyuan Cheng, Felix Yu, Ting Chen, Aditya Menon, Lichan Hong, Ed H Chi, Steve Tjoa, Jieqi Kang, et al. 2021. Self-supervised learning for large-scale item recommendations. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 4321–4330.
- Yuan et al. (2020) Fajie Yuan, Xiangnan He, Alexandros Karatzoglou, and Liguang Zhang. 2020. Parameter-efficient transfer from sequential behaviors for user modeling and recommendation. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 1469–1478.
- Zhai et al. (2024) Jiaqi Zhai, Lucy Liao, Xing Liu, Yueming Wang, Rui Li, Xuan Cao, Leon Gao, Zhaojie Gong, Fangda Gu, Michael He, et al. 2024. Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations. arXiv preprint arXiv:2402.17152 (2024).
- Zhou et al. (2019) Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang Zhu, and Kun Gai. 2019. Deep interest evolution network for click-through rate prediction. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 5941–5948.
- Zhou et al. (2018) Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. 2018. Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 1059–1068.
- Zhou et al. (2020) Kun Zhou, Hui Wang, Wayne Xin Zhao, Yutao Zhu, Sirui Wang, Fuzheng Zhang, Zhongyuan Wang, and Ji-Rong Wen. 2020. S3-rec: Self-supervised learning for sequential recommendation with mutual information maximization. In Proceedings of the 29th ACM international conference on information & knowledge management. 1893–1902.