This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

ID-Agnostic User Behavior Pre-training for
Sequential Recommendation

Shanlei Mu1, Yupeng Hou2, Wayne Xin Zhao2†, Yaliang Li3, and Bolin Ding3 1School of Information, Renmin University of China, China
2Gaoling School of Artificial Intelligence, Renmin University of China, China
3Alibaba Group, USA
[email protected]
Abstract.

Recently, sequential recommendation has emerged as a widely studied topic. Existing researches mainly design effective neural architectures to model user behavior sequences based on item IDs. However, this kind of approach highly relies on user-item interaction data and neglects the attribute- or characteristic-level correlations among similar items preferred by a user. In light of these issues, we propose IDA-SR, which stands for ID-Agnostic User Behavior Pre-training approach for Sequential Recommendation. Instead of explicitly learning representations for item IDs, IDA-SR directly learns item representations from rich text information. To bridge the gap between text semantics and sequential user behaviors, we utilize the pre-trained language model as text encoder, and conduct a pre-training architecture on the sequential user behaviors. In this way, item text can be directly utilized for sequential recommendation without relying on item IDs. Extensive experiments show that the proposed approach can achieve comparable results when only using ID-agnostic item representations, and performs better than baselines by a large margin when fine-tuned with ID information.

Corresponding author

1. Introduction

Over the past decade, sequential recommendation has been a widely studied task (Fang et al., 2020), which learns time-varying user preference and provides timely information resources in need. To better model sequential user behaviors, recent approaches (Hidasi et al., 2016; Li et al., 2017; Tang and Wang, 2018; Kang and McAuley, 2018; Sun et al., 2019) mainly focus on how to design effective neural architecture for recommender system, such as Recurrent Neural Network (RNN) (Hidasi et al., 2016; Li et al., 2017) and Transformer (Kang and McAuley, 2018; Sun et al., 2019). Typically, existing methods learn the item representations based on item IDs (i.e., a unique integer associated with an item) and then feed them to the designed sequential recommender. We refer to such a paradigm as ID-based item representation.

In the literature of recommender systems, ID-based item representation has been a mainstream paradigm to model items and develop recommendation models. It is conceptually simple and flexibly extensible. However, it is also noticed with several limitations for sequential recommendation. First, it highly relies on user-item interaction data to learn ID-based item representations (Rendle et al., 2010; Hidasi et al., 2016). When interaction data is insufficient, it is difficult to derive high-quality item representations, which are widely observed in practice (Liu et al., 2021; Zhou et al., 2020). Second, there is a discrepancy between real user behaviors and the learned sequential models. Intuitively, user behaviors are driven by underlying user preference, i.e., the preference over different item attributes or characteristics, so that a user behavior sequence essentially reflects the attribute- or characteristic-level correlations among similar items preferred by a user. However, ID-based item representations learn more abstractive ID-level correlations, which cannot directly characterize fine-grained correlations that actually reflect real user preference.

Considering the above issues, we aim to represent items in a more natural way from the user view, such that it can directly capture attribute- or characteristic-level correlations without explicitly involving item IDs. The basic idea is to incorporate item side information in modeling item sequences, where we learn ID-agnostic item representations that are derived from attribute- or characteristic-level correlations driven by user preference. Specifically, in this work, we utilize rich text information to represent items instead of learning ID-based item representations. As a kind of natural language, item text reflects the human’s cognition about item attributes or characteristics, which provides a general information resource to reveal user preference from sequential behavior. Different from existing studies that leverage item text to enhance ID-based representations (Yu et al., 2020; Zhang et al., 2019) or improve zero-shot recommendations (Ding et al., 2021), this work aims to learn sequential user preference solely based on text semantics without explicitly modeling item IDs.

However, it is not easy to transfer the semantics reflected in the item texts to the recommendation task, as the semantics may not be directly relevant to the user preference or even noisy to the recommendation task (Chen et al., 2018). There is a natural semantic gap between natural language and user behaviors. To fill this gap, we construct a pre-trained user behavior model that learns user preference by modeling the text-level correlations through behavior sequences.

To this end, we propose an ID-Agnostic User Behavior Pre-training approach for Sequential Recommendation , named IDA-SR. Compared with existing sequential recommenders, the most prominent feature of IDA-SR is that it no longer explicitly learn representations for item IDs. Instead, it directly learns item representations from item texts, and we utilize the pre-trained language model (Devlin et al., 2019; Brown et al., 2020) as text encoder to represent items. To adapt text semantics to the recommendation task, the key point is a pre-training architecture conducted on the sequential user behaviors with three important pre-training tasks, namely next item prediction, masked item prediction and permuted item prediction. In this way, our approach can effectively bridge the gap between text semantics and sequential user behaviors, so that item text can be directly utilized for sequential recommendation without the help of item IDs.

To evaluate the proposed approach, we conduct extensive experiments on the four real-world datasets. Experimental results demonstrate that only using ID-agnostic item representations, the proposed approach can achieve comparable results with several competitive recommendation methods and even much better when training data is limited. In order to better fit the recommendation task, we also provide a fine-tuning mechanism that allows the pre-trained architecture to use the guidance of explicit item ID information. In such way, our approach performs better than baseline methods by a large margin under various settings, which is brought by the benefit of the ID-agnostic user behavior pre-training.

2. Methodology

Given the ordered sequence of user uu’s historical items up to the timestamp-tt: {i1,,it}\{i_{1},\cdots,i_{t}\}, we need to predict the next item, i.e., it+1i_{t+1}. In this section, we present the ID-Agnostic user behavior modeling approach for Sequential Recommendation, named IDA-SR. It consists of ID-agnostic user behavior pre-training stage and fine-tuning stage. The major novelty lies in the ID-agnostic user behavior pre-training, where we represent items solely based on item texts instead of item IDs. To adapt text semantics to the recommendation task, we incorporate an adapter layer that transforms text representations into item representations, and further design three elaborate pre-training tasks to retain the preference characteristics based on user behavior sequences. Figure 1 presents the overview of the pre-training stage. Next, we describe our approach in detail.

Refer to caption
Figure 1. Overall illustration of the proposed approach.

2.1. ID-Agnostic User Behavior Pre-training

For sequential recommendation, the key point of learning ID-agnostic representations is to capture sequential preference characteristics from user behavior sequences based on item texts. Since item texts directly describe items’ attributes or characteristics, our pre-training approach tries to integrate item characteristic encoding into sequential user behavior modeling, and further bridge the semantic gap between them.

2.1.1. ID-Agnostic Item Representations

To obtain ID-agnostic item representations, our idea is to utilize the pre-trained language model (PLM) (Devlin et al., 2019; Brown et al., 2020) to encode item texts. Specifically, we use the pre-trained language model BERT as the text encoder to generate item representations based on the item text. Given an item ii and its corresponding item text Ci={w1,,wm}C_{i}=\{w_{1},\cdots,w_{m}\}, we first add an extra token [CLS] into the item text CiC_{i} to form the input word sequence C~i={[𝙲𝙻𝚂],w1,,wm}\tilde{C}_{i}=\{{\tt[CLS]},w_{1},\cdots,w_{m}\}. Then the input word sequence C~i\tilde{C}_{i} is fed to the pre-trained BERT model. We use the embedding for the [CLS] token as ID-agnostic item representations. In this way, we can obtain the item text embedding matrix 𝐌T||×d~\mathbf{M}_{T}\in\mathbb{R}^{|\mathcal{I}|\times\tilde{d}}, where d~\tilde{d} is the dimension of item text embedding.

Different from ID-based item representations, ID-agnostic item representations are less sensitive to the quality of the interaction data. Instead, it allows the sequential model to capture attribute- characteristics preference from user behavior sequences. It is also more resistible to the cold-start scenarios where a new item occurs for recommendation.

2.1.2. Text Semantic Adapter Layer

Although the ID-agnostic item representations generated from PLMs have great expressive ability for item characteristics, not all the encoded semantics in these representations are directly beneficial or useful for sequential user behavior modeling. Therefore, we incorporate a text semantic adapter layer for transforming the original text representations into a form that is more suitable for the recommendation task. The adapter layer is formalized as follows:

(1) 𝒎~i=σ(σ(𝒎i𝐖1+𝒃1)𝐖2+𝒃2),\tilde{\bm{m}}_{i}=\sigma\big{(}\sigma(\bm{m}_{i}\mathbf{W}_{1}+\bm{b}_{1})\mathbf{W}_{2}+\bm{b}_{2}\big{)},

where 𝒎i𝐌T\bm{m}_{i}\in\mathbf{M}_{T} is the input item representation, 𝒎~i1×d\tilde{\bm{m}}_{i}\in\mathbb{R}^{1\times d} is the updated item representation, 𝐖1d~×d\mathbf{W}_{1}\in\mathbb{R}^{\tilde{d}\times d}, 𝐖2d×d\mathbf{W}_{2}\in\mathbb{R}^{d\times d} and 𝒃1,𝒃21×d\bm{b}_{1},\bm{b}_{2}\in\mathbb{R}^{1\times d} are learnable parameters, σ()\sigma(\cdot) is the activation function. So we can obtain the updated item text embedding matrix 𝐌~T||×d\tilde{\mathbf{M}}_{T}\in\mathbb{R}^{|\mathcal{I}|\times d}.

Given a nn-length item sequence, we apply a look-up operation from 𝐌~T\tilde{\mathbf{M}}_{T} to form the input embedding matrix 𝐄Tn×d\mathbf{E}_{T}\in\mathbb{R}^{n\times d}. Besides, following (Kang and McAuley, 2018), we further incorporate a learnable position encoding matrix 𝐏n×d\mathbf{P}\in\mathbb{R}^{n\times d} to enhance the input representation of the item sequence. By this means, the sequence representations 𝐄n×d\mathbf{E}\in\mathbb{R}^{n\times d} can be obtained by summing two embedding matrices 𝐄=𝐄T+𝐏\mathbf{E}=\mathbf{E}_{T}+\mathbf{P}.

2.1.3. Sequential User Behavior Modeling

The core of sequential recommendation lies in the sequential user behavior modeling, where we aim to capture sequential preference characteristics from user behavior sequences. Here, we adopt a classic self-attention architecture (Vaswani et al., 2017) based on the above ID-agnostic item representations. A self-attention block generally consists of two sub-layers, i.e., a multi-head self-attention layer (denoted by MultiHeadAttn(\cdot)) and a point-wise feed-forward network (denoted by FFN(\cdot)). The update process can be formalized as following:

(2) 𝐅l+1=FFN(MultiHeadAttn(𝐅l)),\mathbf{F}^{l+1}=\text{FFN}(\text{MultiHeadAttn}(\mathbf{F}^{l})),

where the 𝐅l\mathbf{F}^{l} is the ll-th layer’s input. When l=0l=0, we set 𝐅0=𝐄\mathbf{F}^{0}=\mathbf{E}.

2.1.4. Text Semantics based User Behavior Pre-training Task

Given the ID-agnostic item representations and self-attention architecture, we next study how to design suitable optimization objectives to learn the parameters of the architecture, which is the key to bridge the semantic gap between text semantics and sequential preference characteristics. Next, we introduce three pre-training tasks derived from user behavior sequences based on text representations.

Next Item Prediction. The next item prediction pre-training task refers to predicting the next item having read all the previous ones, which has been widely adopted in the existing sequential recommendation methods (Kang and McAuley, 2018). Based on this task, we calculate the user’s preference over the candidate item set as follows:

(3) Ppre(it+1|S)=softmax(𝐅tL𝐌~T)[it+1],P_{pre}(i_{t+1}|S)={\text{softmax}(\mathbf{F}_{t}^{L}\tilde{\mathbf{M}}^{\top}_{T})}_{[i_{t+1}]},

where S=i1:tS=i_{1:t} is the user historical sequence, 𝐅tL\mathbf{F}_{t}^{L} is the output of the LL-layer self-attention block at step tt. Such a task tries to capture sequential preference based on text semantics.

Masked Item Prediction. The masked item prediction pre-training task refers that corrupting the input item sequence and trying to reconstruct the original item sequence. Specifically, we randomly mask some items (i.e., replace them with a special token [MASK]) in the input sequences, and then predict the masked items based on their surrounding context. Based on this task, we calculate the user’s preference over the candidate item set as follows:

(4) Ppre(it|S)=softmax(𝐅tL𝐌~T)[it],P_{pre}(i_{t}|S)={\text{softmax}(\mathbf{F}_{t}^{L}\tilde{\mathbf{M}}^{\top}_{T})}_{[i_{t}]},

where SS is the masked version for user behavior sequence, position tt is replaced with [MASK]. Such a task enhances the overall sequential modeling capacity of the recommendation model.

Permuted Item Prediction. The permuted item prediction pre-training task refers that permuting the items in the original user behavior sequence, then using the previous items in permuted user behavior sequence to predict the next item. Given the input user behavior sequence i1:t={i1,,it}i_{1:t}=\{i_{1},\cdots,i_{t}\}, we first generate the permuted user behavior sequence ij1:jt={ij1,,ijt}i_{j_{1}:j_{t}}=\{i_{j_{1}},\cdots,i_{j_{t}}\} by permuting the items. Then, based on the permuted sequence, we calculate the user’s preference over the candidate item set as follows:

(5) Ppre(ijt+1|S)=softmax(𝐅jtL𝐌~T)[ijt+1],P_{pre}(i_{j_{t+1}}|S)={\text{softmax}(\mathbf{F}_{j_{t}}^{L}\tilde{\mathbf{M}}^{\top}_{T})}_{[i_{j_{t+1}}]},

where S=ij1:jtS=i_{j_{1}:j_{t}} is the permuted user historical sequence. In this way, the context for each position consist of tokens from both left and right, which is able to improve the performance (Yang et al., 2019).

To combine the three pre-training tasks, we adopt the cross-entropy loss to pre-train our model as follows:

(6) pre=u𝒰t𝒯logPpre(it=it|S),\mathcal{L}_{pre}=-\sum_{u\in\mathcal{U}}\sum_{t\in\mathcal{T}}\log P_{pre}(i_{t}=i_{t}^{*}|S),

where iti_{t}^{*} is the ground truth item, 𝒯\mathcal{T} is the predicted position set.

2.2. Fine-tuning for Recommendation

At the pre-training stage, we integrate the text semantics of items into the sequential behavior modeling. Next, we further optimize the architecture according to the recommendation task. Different from previous self-supervised recommendation models (Sun et al., 2019; Zhou et al., 2020), we can fine-tune our approach with or without item IDs.

Fine-tuning without ID. Without adding any extra parameters, we can directly fine-tune the pre-trained model based on the ID-agnostic item representations. In this way, we calculate the user’s preference score for the item ii in the step tt under the context from user history as:

(7) Pfine(it+1|i1:t)=softmax(𝐅tL𝐌~T)[it+1],P_{fine}(i_{t+1}|i_{1:t})={\text{softmax}(\mathbf{F}_{t}^{L}\tilde{\mathbf{M}}_{T}^{\top})}_{[i_{t+1}]},

where 𝐌~T\tilde{\mathbf{M}}_{T} is the updated item text embedding matrix, 𝐅tL\mathbf{F}_{t}^{L} is the output of the LL-layer self-attention block at step tt. Since no additional parameters are incorporated, it enforces the model to well fit the recommendation task in an efficient way.

Fine-tuning with ID. Unlike text information, IDs are more discriminative to represent an item, e.g., an item will be easily identified when we know its ID. Therefore, we can further improve the discriminative ability of the above pre-trained approach by incorporating additional item ID representations. Specifically, we maintain a learnable item embedding matrix 𝐌I||×d\mathbf{M}_{I}\in\mathbb{R}^{|\mathcal{I}|\times d}. Then we combine 𝐌I\mathbf{M}_{I} and the item text embedding 𝐌~T\tilde{\mathbf{M}}_{T} as the final item representation. We calculate the user’s preference score for the item ii in the step tt under the context from user history as:

(8) Pfine(it+1|i1:t)=softmax(𝐅~tL(𝐌~T+𝐌I))[it+1].P_{fine}(i_{t+1}|i_{1:t})={\text{softmax}(\tilde{\mathbf{F}}_{t}^{L}\tilde{(\mathbf{M}}_{T}+\mathbf{M}_{I})^{\top})}_{[i_{t+1}]}.

Here, we only incorporate item IDs at the fine-tuning stage and the rest parts (ID-agnostic item representations, adapter layer, and self-attention architecture) have been pre-trained at the pre-training stage. As will be shown in Section 3.2, this fine-tuning method is more effective than that simply combining text and ID features.

For each setting, we adopt the widely used cross-entropy loss to train the model in the fine-tuning stage.

3. EXPERIMENTS

3.1. Experimental Setup

Table 1. Statistics of the datasets.
Dataset Pantry Instruments Arts Food
# Users 13,101 24,962 45,486 115,349
# Items 4,898 9,964 21,019 39,670
# Actions 126,962 208,926 395,150 1,027,413
Table 2. Performance comparison of different methods on the four datasets. The best performance and the best performance baseline are denoted in bold and underlined fonts respectively.
Dataset Metric PopRec FPMC GRU4Rec SASRec BERT4Rec FDSA ZESRec S3-Rec IDA-SRt IDA-SRt+ID Improv.
Pantry HR@10 0.0068 0.0373 0.0395 0.0488 0.0311 0.0422 0.0529 0.0509 0.0738 0.0750 +41.78%
NDCG@10 0.0024 0.0196 0.0194 0.0231 0.0160 0.0226 0.0263 0.0242 0.0378 0.0375 +43.73%
Instruments HR@10 0.0133 0.1043 0.1045 0.1103 0.1057 0.1117 0.1076 0.1123 0.1250 0.1304 +16.12%
NDCG@10 0.0039 0.0771 0.0796 0.0787 0.0697 0.0840 0.0711 0.0795 0.0821 0.0872 +3.81%
Arts HR@10 0.0156 0.0958 0.0909 0.1164 0.1096 0.1074 0.0971 0.1093 0.1130 0.1304 +12.03%
NDCG@10 0.0090 0.0684 0.0637 0.0685 0.0774 0.0768 0.0579 0.0692 0.0708 0.0828 +6.98%
Food HR@10 0.0281 0.0940 0.1075 0.1173 0.1119 0.1124 0.0967 0.1163 0.1097 0.1309 +11.59%
NDCG@10 0.0141 0.0746 0.0862 0.0846 0.0792 0.0883 0.0646 0.0864 0.0730 0.0943 +6.80%

3.1.1. Datasets.

We conduct experiments on the Amazon review datasets (Ni et al., 2019), which contain product ratings and reviews in 29 categories on Amazon.com and rich textual metadata such as title, brand, description, etc. We use the version released in the year 2018. Specifically, we use the 5-core data of Pantry, Instruments, Arts and Food, in which each user or item has at least 5 associated ratings. The statistics of our datasets are summarized in Table 1.

3.1.2. Comparison Methods.

We consider the following baselines for comparisons: (1) PopRec recommends items according to the item popularity; (2) FPMC (Rendle et al., 2010) models the behavior correlations by Markov chain; (3) GRU4Rec (Hidasi et al., 2016) applies GRU to model user behaviors; (4) SASRec (Kang and McAuley, 2018) applies self-attention mechanism to model user behaviors; (5) BERT4Rec (Sun et al., 2019) applies bidirectional self-attention mechanism to model user behaviors; (6) FDSA (Zhang et al., 2019) constructs a feature sequence and uses a feature level self-attention block to model the feature transition patterns; (7) ZESRec (Ding et al., 2021) regards pre-trained BERT representations as item representations for cross-domain recommendation. In our setting, we report its result on source domain data. (8) S3-Rec (Zhou et al., 2020) pre-trains user behavior models via mutual information maximization objectives for feature fusion. We implement these methods with RecBole (Zhao et al., 2021).

Among all the above methods, FPMC, GRU4Rec, SASRec and BERT4Rec are general sequential recommendation methods that model the user behavior sequences only by user-item interaction data. FDSA, ZESRec and S3-Rec are text-enhanced sequential recommendation methods that model the user behavior sequences with extra information from item texts. For our proposed approach, IDA-SRt and IDA-SRt+ID represent the model is fine-tuned without ID and with ID, respectively.

3.1.3. Evaluation Settings.

To evaluate the performance, we adopt top-kk Hit Ratio (HR@kk) and top-kk Normalized Discounted Cumulative Gain (NDCG@kk) evaluation metrics. Following previous works (Kang and McAuley, 2018; Sun et al., 2019), we apply the leave-one-out strategy for evaluation. For each user, we treat all the items that this user has not interacted with as negative items.

3.2. Experimental Results

In this section, we first compare the proposed IDA-SR approach with the aforementioned baselines on the four datasets, then conduct the ablation study, and finally compare the results on cold-start items.

3.2.1. Main Results

Compared with the general sequential recommendation methods, text-enhanced sequential recommendation methods perform better on some datasets. Because item texts are used as auxiliary features to help improve the recommendation performance. These results further confirm that semantic information from item text is useful for modeling user behaviors. By comparing the proposed approach IDA-SRt+ID with all the baselines, it is clear that IDA-SRt+ID consistently performs better than them by a large margin. Different from these baselines, we adopt the ID-agnostic user behavior pre-training framework, which transfers the semantic knowledge to guide the user behavior modeling through the three appropriate self-supervised tasks. Then in the fine-tuning stage, we fine-tune the model to utilize the user behavior knowledge from the pre-trained model. In this way, our proposed approach can better capture user behavior patterns and achieve much better results. Besides, without incorporating item ID information in the fine-tuning stage, IDA-SRt also has a comparable result with other baseline methods which model user behaviors based on the ID information. This further illustrates the effectiveness of our proposed approach.

3.2.2. Ablation Study

We examine the performance of IDA-SR’s variants by removing each pre-training task from the full approach. We use np, mp and pp to represent three pre-training task next item prediction, masked item prediction and permuted item prediction, respectively. Figure 2 presents the evaluation results. We can observe that the three pre-training tasks all contribute to the final performance. All of them are helpful for bridging the semantic gap between text semantics and sequential preference characteristics.

Refer to caption
(a) Pantry
Refer to caption
(b) Instruments
Figure 2. Ablation study.
Refer to caption
(a) Pantry
Refer to caption
(b) Instruments
Figure 3. Performance comparison w.r.t. cold-start items. The bar graph represents the number of test data for each group and the line chart represents the improved mean rank of ground truth item compared with SASRec.

3.2.3. Performance Comparison w.r.t. Cold-start Items

Conventional user behavior modeling methods are likely to suffer from the cold-start items recommendation problem. This problem can be alleviated by our method because the proposed ID-agnostic pre-training framework can utilize the text semantic information to make the model less dependent on the interaction data. To verify this, we split the test data according to the popularity of ground truth items in the training data, and then record the improved mean rank of ground truth item in each group compared with baseline method SASRec. From Figure 3, we can find that the proposed IDA-SRt and IDA-SRt+ID have a big improvement when the ground truth item is extremely unpopular e.g., group [0, 5) and group [5, 10). This observation implies the proposed IDA-SR can alleviate the cold-start items recommendation problem.

4. CONCLUSION

In this paper, we propose the ID-agnostic user behavior modeling approach for sequential recommendation, named IDA-SR. Different from the existing sequential recommendation methods that are limited by the ID-based item representations, the proposed IDA-SR adopts the ID-agnostic item representations based on item texts to help user behavior modeling in a direct and natural way. To bridge the gap between text semantics and sequential user behaviors, the proposed IDA-SR conducts a pre-training architecture over item text representations on the sequential user behaviors. Experimental results have shown the effectiveness of the proposed approach by comparing it with several competitive baselines, especially when training data is limited. For future work, we will explore more kinds of item side information to represent items and help user behavior modeling, e.g., images and videos.

References

  • (1)
  • Brown et al. (2020) Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In NeurIPS.
  • Chen et al. (2018) Chong Chen, Min Zhang, Yiqun Liu, and Shaoping Ma. 2018. Neural Attentional Rating Regression with Review-level Explanations. In WWW. 1583–1592.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL. 4171–4186.
  • Ding et al. (2021) Hao Ding, Yifei Ma, Anoop Deoras, Yuyang Wang, and Hao Wang. 2021. Zero-Shot Recommender Systems. arXiv preprint arXiv:2105.08318 (2021).
  • Fang et al. (2020) Hui Fang, Danning Zhang, Yiheng Shu, and Guibing Guo. 2020. Deep Learning for Sequential Recommendation: Algorithms, Influential Factors, and Evaluations. ACM Trans. Inf. Syst. 39, 1 (2020), 10:1–10:42.
  • Hidasi et al. (2016) Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. 2016. Session-based Recommendations with Recurrent Neural Networks. In ICLR.
  • Kang and McAuley (2018) Wang-Cheng Kang and Julian J. McAuley. 2018. Self-Attentive Sequential Recommendation. In ICDM. 197–206.
  • Li et al. (2017) Jing Li, Pengjie Ren, Zhumin Chen, Zhaochun Ren, Tao Lian, and Jun Ma. 2017. Neural Attentive Session-based Recommendation. In CIKM. 1419–1428.
  • Liu et al. (2021) Zhiwei Liu, Ziwei Fan, Yu Wang, and Philip S. Yu. 2021. Augmenting Sequential Recommendation with Pseudo-Prior Items via Reversely Pre-training Transformer. In SIGIR. 1608–1612.
  • Ni et al. (2019) Jianmo Ni, Jiacheng Li, and Julian J. McAuley. 2019. Justifying Recommendations using Distantly-Labeled Reviews and Fine-Grained Aspects. In EMNLP. 188–197.
  • Rendle et al. (2010) Steffen Rendle, Christoph Freudenthaler, and Lars Schmidt-Thieme. 2010. Factorizing personalized Markov chains for next-basket recommendation. In WWW. 811–820.
  • Sun et al. (2019) Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. 2019. BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer. In CIKM. 1441–1450.
  • Tang and Wang (2018) Jiaxi Tang and Ke Wang. 2018. Personalized Top-N Sequential Recommendation via Convolutional Sequence Embedding. In WSDM. 565–573.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In NIPS. 5998–6008.
  • Yang et al. (2019) Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In NeurIPS.
  • Yu et al. (2020) Wenhui Yu, Xiao Lin, Junfeng Ge, Wenwu Ou, and Zheng Qin. 2020. Semi-supervised Collaborative Filtering by Text-enhanced Domain Adaptation. In SIGKDD. 2136–2144.
  • Zhang et al. (2019) Tingting Zhang, Pengpeng Zhao, Yanchi Liu, Victor S. Sheng, Jiajie Xu, Deqing Wang, Guanfeng Liu, and Xiaofang Zhou. 2019. Feature-level Deeper Self-Attention Network for Sequential Recommendation. In IJCAI. 4320–4326.
  • Zhao et al. (2021) Wayne Xin Zhao, Shanlei Mu, Yupeng Hou, Zihan Lin, Yushuo Chen, Xingyu Pan, Kaiyuan Li, Yujie Lu, Hui Wang, Changxin Tian, Yingqian Min, Zhichao Feng, Xinyan Fan, Xu Chen, Pengfei Wang, Wendi Ji, Yaliang Li, Xiaoling Wang, and Ji-Rong Wen. 2021. RecBole: Towards a Unified, Comprehensive and Efficient Framework for Recommendation Algorithms. In CIKM. 4653–4664.
  • Zhou et al. (2020) Kun Zhou, Hui Wang, Wayne Xin Zhao, Yutao Zhu, Sirui Wang, Fuzheng Zhang, Zhongyuan Wang, and Ji-Rong Wen. 2020. S3-Rec: Self-Supervised Learning for Sequential Recommendation with Mutual Information Maximization. In CIKM. 1893–1902.