This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

AutoAttention: Automatic Field Pair Selection for Attention in User Behavior Modeling

Zuowu Zheng1, Xiaofeng Gao1, Junwei Pan2, Qi Luo3, Guihai Chen1, Dapeng Liu2, and Jie Jiang2 1Z. Zheng, X. Gao, and G. Chen are with the MoE Key Lab of Artificial Intelligence, Department of Computer Science and Engineering, Shanghai Jiao Tong University. X. Gao is the Corresponding author. 1Shanghai Jiao Tong University, Shanghai, China
[email protected], {gao-xf, gchen}@cs.sjtu.edu.cn
2Tencent Inc., Shenzhen, China
{jonaspan, rocliu, zeus}@tencent.com
3Shandong University, Shandong, China
[email protected]
Abstract

In Click-through rate (CTR) prediction models, a user’s interest is usually represented as a fixed-length vector based on her history behaviors. Recently, several methods are proposed to learn an attentive weight for each user behavior and conduct weighted sum pooling. However, these methods only manually select several fields from the target item side as the query to interact with the behaviors, neglecting the other target item fields, as well as user and context fields. Directly including all these fields in the attention may introduce noise and deteriorate the performance. In this paper, we propose a novel model named AutoAttention, which includes all item/user/context side fields as the query, and assigns a learnable weight for each field pair between behavior fields and query fields. Pruning on these field pairs via these learnable weights lead to automatic field pair selection, so as to identify and remove noisy field pairs. Though including more fields, the computation cost of AutoAttention is still low due to using a simple attention function and field pair selection. Extensive experiments on the public dataset and Tencent’s production dataset demonstrate the effectiveness of the proposed approach.

Index Terms:
Click-Through Rate Prediction, User Behavior Modeling, Recommendation System

I Introduction

Click-through rate (CTR) prediction is one of the most fundamental tasks for online advertising systems, and it has attracted much attention from both industrial and academic communities [1, 2, 3]. Modeling a user’s interest through his or her history behaviors on items has proven as one of the most successful advances in the CTR prediction task [4, 5, 6].

Refer to caption
Figure 1: AUC and inference time comparison of the proposed AutoAttention with baselines on the public Alibaba dataset. Sum pooling, DIN, DIEN, and DSIN are four existing methods, which only include several manually selected fields in the attention unit. MAF-C, MAF-S, DIN+, and DotProduct are several proposed baselines which include all available fields. AutoAttention also includes all fields, but conducted field pair selection and achieves new state-of-the-art AUC with low inference time.

In the Embedding & Multi-Layer Perceptron (MLP) algorithms for online advertising and recommendation systems, a user’s interest is usually represented as a fixed-length embedding vector, based on her history behaviors [7, 4]. Traditional methods take a straightforward way to do a sum or mean pooling over all behavior embedding vectors to generate one embedding [7]. However, it ignores the fact that some behaviors are more important than others given the target item, user and context features.

Recently, several user behavior modeling methods are proposed to calculate attentive weights for different behaviors w.r.t a given target item and then conduct a weighted sum pooling, such as Deep Interest Network (DIN) [4] and its variants [5, 6]. Even though these methods achieve significant performance lift, they still suffer from the following limitations:

  • First, in real-world recommendation systems, a user’s interest may not only depend on the target item but also on the user’s demographic features or context features. However, existing works only manually select several fields from the target item side as the query and interact them with each behavior to calculate the attentive weight. It neglects the effect of other fields, including other fields from the target item side, as well as those from the user and context sides. For example, when browsing the game zone of a shopping website, a boy will click a recommended new game The Witcher 3 because he clicked some similar games last week, so the item side fields should be included as all existing works do. Or it’s because he is in the game zone now and any history click on games indicates a strong interest in games. In the latter case, the game zone feature from the context side plays an important role in capturing his interest from behaviors.

  • Second, existing works interact all behavior fields with all target item side fields. Recent studies [8, 9] show that some interactions in attention are unnecessary and harm the performance. Involving more fields as the query may introduce more irrelevant field interactions and further deteriorate the performance.

  • Third, as a part of the input layer of a more complicated DNN model for CTR prediction, the procedure of generating a user interest vector should be lightweight. Unfortunately, most existing methods use an MLP to calculate the attention weight, which leads to high computation complexity.

To resolve these challenges, we propose to include all item/user/context fields as the query in the attention unit, and calculate a learnable weight for each field pair between user behavior fields and these query fields. To avoid introducing noisy field pairs, we further propose to automatically select the most important ones by pruning on these weights. Besides, we adopt a simple dot product function rather than an MLP as the attention function, leading to much less computation cost. We summarize the AUC as well as the average inference time of AutoAttention and several baseline models in Fig. 1. Except Sum Pooling which has a very low inference time due to its simplicity, the proposed AutoAttention gets a higher AUC than all the other baseline models with low inference time. The main contributions of this paper are summarized as follows:

  • We propose to involve all item/user/context fields as the query in the attention unit for user interest modeling. A weight is assigned for each field pair between user behavior fields and these query fields. Pruning on weights automates the field pair selection, preventing performance deterioration due to introducing irrelevant field pairs.

  • We propose to use a simple dot product attention, rather than an MLP in existing methods. This greatly reduces the time complexity with comparable or even better performance.

  • We conduct extensive experiments on public and production datasets to compare AutoAttention with state-of-the-art methods. Evaluation results verify the effectiveness of AutoAttention. We also study the learnt field pair weights and find that AutoAttention does identify several field pairs including user or context side fields, which are ignored by expert knowledge in existing works.

The rest of the paper is organized as follows. Section II provides the preliminaries of existing user behavior methods. In Section III, we describe AutoAttention, and describe its connection with several existing methods. Experiment settings and evaluation results are presented in Section IV. Finally, Section V and Section VI discusses the related work and concludes the paper, respectively.

II Preliminaries

Refer to caption
Figure 2: Architecture of the proposed AutoAttention. The input features include four parts: user behaviors, target item, user, and context. We use an attention function to calculate the weight of each behavior, the detail of which is depicted in the Attention Unit. It assigns a learnable weight for each field pair between behavior fields and query fields which consists of all item/user/context fields. Automatic field pair selection is conducted by pruning on these weights. All behaviors are summed based on the attention weights, then fed into an MLP together with all other feature embeddings.

In this section, we present the preliminaries of user behavior modeling in CTR prediction. A CTR prediction model aims at predicting the probability that a user clicks an item given a context (e.g., time, location, publisher information, etc.). It takes fields from three sides as the input:

pCTR=f(user,item,context)p\mbox{CTR}=f(\mbox{user},\mbox{item},\mbox{context})

where user side fields consists of user demographic fields and behavior fields, item and context denote fields from the item and context sides, respectively. In this paper, we focus on how to capture a user’s interest from user behaviors.

Given a user uu and her corresponding behaviors {𝒗1,𝒗2,,𝒗H}\{\bm{v}_{1},\bm{v}_{2},\dots,\bm{v}_{H}\}, her interest is represented as a fixed-length vector as follows:

𝒗u=f(𝒗1,𝒗2,,𝒗H,𝒆F1,𝒆F2,,𝒆FM)\bm{v}_{u}=f(\bm{v}_{1},\bm{v}_{2},\dots,\bm{v}_{H},\bm{e}_{F_{1}},\bm{e}_{F_{2}},\dots,\bm{e}_{F_{M}}) (1)

where 𝒗i\bm{v}_{i} denotes the embedding for the ii-th behavior, HH denotes the length of user behaviors, and 𝒆FjK\bm{e}_{F_{j}}\in\mathcal{R}^{K} denotes the feature embedding from any other field besides the user behaviors(e.g., item/user/context side fields), i.e., FjF_{j}. Each behavior is usually represented by multiple item side fields. Denoting the set of fields to represent behaviors as ={Bp}\mathcal{B}=\{B_{p}\}, then each behavior is represented as 𝒗i=Bp𝒗Bp\bm{v}_{i}=\sum_{B_{p}\in\mathcal{B}}\bm{v}_{B_{p}}, where 𝒗BpK\bm{v}_{B_{p}}\in\mathcal{R}^{K} denotes the feature embedding for the field BpB_{p} of the ii-th behavior.

A straightforward way to calculate 𝒗u\bm{v}_{u} is to do a sum or mean pooling over all these 𝒗i\bm{v}_{i} embedding vectors [7]. However, it neglects the importance of each behavior given a specific target item. Recently, a commonly used behavior modeling strategy is to adopt an attention mechanism over the user’s historical behaviors. It learns an attentive weight for each behavior ii w.r.t. a given target item tt and then conducts a weighted sum pooling, i.e. 𝒗u=i=1Ha(i,t)𝒗i\bm{v}_{u}=\sum_{i=1}^{H}a(i,t)\bm{v}_{i}, where a(i,t)a(i,t) denotes an attention function. For example, Deep Interest Network (DIN) considers the influence of the target item on user behaviors [4], which learns larger weights to those behaviors that are more important given the target item, as shown in Eqn. (2).

𝒗u=f(𝒗1,𝒗2,,𝒗H,𝒆t)=i=1Ha(i,t)𝒗i=i=1HMLP(𝒗i,𝒆t)𝒗i\begin{split}\bm{v}_{u}&=f(\bm{v}_{1},\bm{v}_{2},\dots,\bm{v}_{H},\bm{e}_{t})\\ &=\sum_{i=1}^{H}a(i,t)\bm{v}_{i}=\sum_{i=1}^{H}\text{MLP}(\bm{v}_{i},\bm{e}_{t})\bm{v}_{i}\end{split} (2)

where 𝒆t\bm{e}_{t} denotes the embedding vector of the target item tt. MLP()\text{MLP}() denotes an MLP with its output as the attention weight. Following DIN, DIEN [5] further considers the evolution of user interest, and DSIN [6] considers the homogeneity and heterogeneity of a user’s interests within and among sessions. DIF-SR [10] proposes to only consider the interaction between corresponding fields between queries and keys within the attention.

All existing methods only interact each behavior with several selected fields from the item side within the attention, neglecting other fields, especially those from the user and context sides.

III AutoAttention

In this section, we first describe several straightforward approaches that interact user behavior with all fields in the attention unit. Then we propose AutoAttention to automatically identify and remove irrelevant field pairs which are introduced to the model due to including all fields as the query in the attention. At last, we discuss the model complexity and its connection with several existing approaches.

Mathematically, we learn a user’s interest representation 𝒗𝒖\bm{v_{u}} from her historical behaviors based on all fields about the current sample, i.e., all available fields from target item, user and context sides.

𝒗u(𝒙)=f(𝒗1,𝒗2,,𝒗H,𝒆F1,𝒆F2,,𝒆FM)=i=1Ha(𝒗i,𝒆F1:M)𝒗i\begin{split}\bm{v}_{u}(\bm{x})&=f(\bm{v}_{1},\bm{v}_{2},\dots,\bm{v}_{H},\bm{e}_{F_{1}},\bm{e}_{F_{2}},\dots,\bm{e}_{F_{M}})\\ &=\sum_{i=1}^{H}a(\bm{v}_{i},\bm{e}_{F_{1:M}})\bm{v}_{i}\end{split} (3)

where 𝒗i\bm{v}_{i} denotes the embedding for the ii-th behavior, which is usually a summation of several attribute embeddings for this behavior: 𝒗i=p𝒗Bp\bm{v}_{i}=\sum_{p}\bm{v}_{B_{p}}. And {𝒆F1,,𝒆FM}\{\bm{e}_{F_{1}},\dots,\bm{e}_{F_{M}}\} denotes the set of all fields from the target item, user and context sides.

III-A Base Models

Before introducing AutoAttention, we first present several straightforward approaches to interact user behaviors with all fields within attention: MLP with All fields (MAF in short) and DotProduct.

III-A1 MLP with All Fields

MAF simply sums or concatenates all field embedding vectors 𝒆F1:M\bm{e}_{F_{1:M}}, then feeds it and the behavior embedding 𝒗i\bm{v}_{i} to an MLP to calculate the weight. There are two ways to construct the first layer of the MLP: element-wisely sum 𝒆F1:M\bm{e}_{F_{1:M}} and 𝒗i\bm{v}_{i} as a KK dimension vector, denoted as MAF-S (MLP with All Fields Summed); or concatenate 𝒆F1:M\bm{e}_{F_{1:M}} and 𝒗i\bm{v}_{i} as a (M+1)K(M+1)K dimension vector, denoted as MAF-C (MLP with All Fields Concatenated). Mathematically,

aMAF-S(𝒗i,𝒆F1:M)=MLP(𝒗i𝒆F1𝒆F2,,𝒆FM)aMAF-C(𝒗i,𝒆F1:M)=MLP([𝒗i,𝒆F1,𝒆F2,,𝒆FM])\begin{split}a_{\text{MAF-S}}(\bm{v}_{i},\bm{e}_{F_{1:M}})&=\text{MLP}(\bm{v}_{i}\oplus\bm{e}_{F_{1}}\oplus\bm{e}_{F_{2}}\oplus,\cdots,\oplus\bm{e}_{F_{M}})\\ a_{\text{MAF-C}}(\bm{v}_{i},\bm{e}_{F_{1:M}})&=\text{MLP}([\bm{v}_{i},\bm{e}_{F_{1}},\bm{e}_{F_{2}},\cdots,\bm{e}_{F_{M}}])\end{split} (4)

where \oplus denotes element-wise summation, [][\cdot] denotes concatenation, MLP()\mbox{MLP}(\cdot) denotes an Multi-Layer Perceptron, with the last layer as a single output node activated by the softmax function.

III-A2 DotProduct

Dot product is widely used in attention models [11], and it has been proved that an MLP is hard to learn a dot product [12]. So we propose another base model to explicitly conduct a dot product between the user behavior embedding and the sum pooling vector over all query fields. We name it as DotProduct, formally:

aDotProduct(𝒗i,𝒆F1:M)=σ(b+j=1M𝒗i,𝒆Fj)=σ(b+𝒗i,j=1M𝒆Fj)\begin{split}a_{\text{DotProduct}}(\bm{v}_{i},\bm{e}_{F_{1:M}})&=\sigma\left(b+\sum_{j=1}^{M}\langle\bm{v}_{i},\bm{e}_{F_{j}}\rangle\right)\\ &=\sigma\left(b+\langle\bm{v}_{i},\sum_{j=1}^{M}\bm{e}_{F_{j}}\rangle\right)\end{split} (5)

where 𝒗i,𝒗j=k=1Kvi,kvj,k\langle\bm{v}_{i},\bm{v}_{j}\rangle=\sum_{k=1}^{K}v_{i,k}\cdot v_{j,k} denotes the dot product function, σ()\sigma(\cdot) denotes the softmax function, and bb denotes the bias term.

III-B AutoAttention

In above base models, all available fields are considered as the query in the attention function. However, there are still several concerns: 1) simply involving all fields as the query ignores the fact that some fields are more important than others when being interacted with each behavior field; 2) in real-world industry systems, the number of fields is large, including all of them as the query may increase the computation cost of the model; 3) some field pairs are irrelevant or noisy for capturing user interests, leading to performance deterioration when included in the attention.

To tackle the above mentioned challenges, we assign a weight RBp,FjR_{B_{p},F_{j}} to model the interaction strength for each field pair between a behavior field BpB_{p} and a query field FjF_{j}. These field pair wise weights are learnable and trained together with all the other parameters. We name our approach as AutoAttention. Mathematically,

aAutoAttention(𝒗i,𝒆F1:M)=σ(b+p=1Pj=1M𝒗Bp,𝒆FjRBp,Fj)a_{\text{AutoAttention}}(\bm{v}_{i},\bm{e}_{F_{1:M}})=\sigma\left(b+\sum_{p=1}^{P}\sum_{j=1}^{M}\langle\bm{v}_{B_{p}},\bm{e}_{F_{j}}\rangle R_{B_{p},F_{j}}\right) (6)

Directly including all query fields in the attention function and interacting all of them with the behavior fields may introduce irrelevant or noisy field pairs. In order to identify and remove them, we conduct automatic field pair selection by pruning on the field pair wise weights RR. There are lots of empirical studies in weight pruning area [13, 14, 15]. For simplicity, we adopt a standard iterative pruning algorithm used in [15]. An illustration of AutoAttention is presented in Fig. 2.

The algorithm of pruning is depicted in Alg. 1. We first train the model a few epochs to initialize the weights RR and then conduct pruning on RR to remove those with the bottom-S%S\% lowest magnitude values. We gradually increase the sparsity rate S%S\% such that it increases faster in the early phase when the network is stable and slower in the late phase when it becomes sensitive. Other approaches such as regularization [16] can also be used here.

Existing methods heavily rely on expert knowledge on selecting relevant fields to involve them in the attention. For example, in DIN [4], the authors manually select three fields from the target item side: item_id, shop_id and category_id. Such expert knowledge is not always feasible and accurate. AutoAttention avoids such reliance on expert knowledge by automatic field pair selection.

Input: Field pair strength weights RR, initialize the target sparsity rate SS, damping ratios DD and UU.
1 Warm up Initialize whole network by training ii epochs;
2 Pruning Procedure;
3 for iteration j=1,2,j=1,2,... do
4       Train the model for one iteration;
5       Update the current sparsity rate SS×(1Dj/U)S\leftarrow S\times(1-D^{j/U});
6       Prune the bottom-SS% lowest magnitude weights in RR;
7      
Online Prediction Use the fine-tuned sparse model to make the prediction.
Algorithm 1 Field pairs selection training procedure

III-C Model Training

Follow [4], after we extract a user’s interest 𝒗u\bm{v}_{u}, we concatenate it with all the other features and feed them into an MLP:

y^=sigmoid(MLP([𝒗u,𝒆F1,𝒆F2,,𝒆FM]))\hat{y}=\mbox{sigmoid}(\mbox{MLP}([\bm{v}_{u},\bm{e}_{F_{1}},\bm{e}_{F_{2}},\cdots,\bm{e}_{F_{M}}])) (7)

We then minimize the following cross-entropy loss during model training:

L(Θ)=1Ni=1N(yilogy^i+(1yi)log(1y^i))+λΘ2L(\Theta)=-\frac{1}{N}\sum_{i=1}^{N}(y_{i}\log\hat{y}_{i}+(1-y_{i})\log(1-\hat{y}_{i}))+\lambda\left\|\Theta\right\|_{2} (8)

where NN denotes the number of training samples, Θ\Theta denotes all trainable parameters, yi{0,1}y_{i}\in\{0,1\} denotes the label, and λΘ2\lambda\left\|\Theta\right\|_{2} denotes the L2L_{2} regularization term.

III-D Discussion

TABLE I: A summary of model complexities. MM denotes the number of query fields, PP denotes the number of behavior fields, KK denotes the dimension of embedding vectors, the number of neurons of the two-layer MLP in the attention is dd and 1, HH denotes the length of user behaviors. Note that the time complexity of HH behaviors includes the complexity of weight calculation and weighted sum pooling over HH behaviors. We list the estimated FLOPs and the number of parameters in Alibaba Dataset with experiments settings of Section IV-B, i.e, M=15,P=2,K=64,d=200,H=50M=15,P=2,K=64,d=200,H=50.
Model
Time Complexity
(One Behavior)
Estimated FLOPs
(One Behavior)
Time Complexity
(HH Behaviors)
#Parameters of One Behavior Estimated #Parameters
Sum Pooling O(1)O(1) 0 O(HK)O(HK) 0 0
DIN O(dK2)O(dK^{2}) 1,642,496 O(K2+dKH)O(K^{2}+dKH) dK2+2d+1dK^{2}+2d+1 819,601
DIEN O(K3+dK2)O(K^{3}+dK^{2}) 1,695,232 O(K3+dKH)O(K^{3}+dKH) dK2+12K2+6K+2d+1dK^{2}+12K^{2}+6K+2d+1 869,137
DSIN O(H2K+dK2)O(H^{2}K+dK^{2}) 3,380,864 O(H2K+K2+dKH)O(H^{2}K+K^{2}+dKH) 2dK2+19K2+8K+4d+22dK^{2}+19K^{2}+8K+4d+2 1,717,538
DotProduct O(MK)O(MK) 2,112 O(MK+HK)O(MK+HK) 11 1
AutoAttention O(PMK)O(PMK) 5,952 O(HPMK)O(HPMK) PM+1PM+1 31
Refer to caption
Figure 3: Model architecture comparison. DIN uses an MLP as the attention function, with several manually selected fields as the query. CFI considers the corresponding field interactions between behavior fields and query fields. Both DotProduct and AutoAttention consider all fields in the query. DotProduct uses a dot product function between behaviors fields and query fields, and AutoAttention learns a weight RBp,FjR_{B_{p},F_{j}} for each behavior field and query field pair (Bp,Fj)(B_{p},F_{j}). The darker red color represents higher strength weights of the field pairs.

III-D1 Model Complexity

The time complexity of the proposed DotProduct and AutoAttention is O(MK)O(MK) and O(PMK)O(PMK) for each behavior respectively, where MM denotes the number of all the other fields, PP denotes the number of fields from the user behaviors, KK denotes the dimension of embedding vectors. With field pairs selection, the inference time complexity of AutoAttention can be reduced since it removes redundant field pairs during model training. Using a sparsity rate S%S\%, its inference time complexity is reduced to O(S%PMK)O(S\%PMK).

As for space complexity, DotProduct only introduces one parameter, i.e., the bias term. AutoAttention introduces field pair strength weights RR and the bias term, which is PM+1PM+1. Using a sparsity rate S%S\%, AutoAttention’s space complexity becomes S%PM+1S\%PM+1. We summarize the complexities and model architectures of these models in Tab. I and Fig. 3, respectively.

We also summarize the estimated number of Floating Point Operations (FLOPs), as well as the number of parameters in Section IV-B from the public Alibaba dataset. As shown in Tab. I, DotProduct and AutoAttention only take thousands of FLOPs to extract user interests. Compared to DNN-based methods, AutoAttention is at least two hundred of times faster and also introduces fewer parameters, which makes it a preferable choice in real-world online advertising systems.

III-D2 Comparison to Self-Attention

Self-attention is widely used in NLP [11], CV [17] and recommender systems [18]. In self-attention, the attention weight is a softmax over the dot product between the query QQ and key KK. The value VV is then multiplied by these attention weights to get the final output. The proposed DotProduct can be viewed as a self-attention, taking all other fields as the query, and the user behavior as the key and value. AutoAttention further assigns a field pair wise weight for each field interaction. Recent works [8, 9] on self-attention also reveal that some field pairs (e.g., position cross position) are critical while others are noisy within the attention, indicating the necessity of automatic field pair selection.

III-D3 Comparison to CFI

Recently, [10] proposes to only interact a field from behavior with the corresponding field from the target item in attention. For example, the category field of behavior is only interacted with the category field of the target item. We name this approach as CFI (Corresponding Field Interaction). CFI assumes that the corresponding field pairs are the most important ones, and all the other pairs are noisy. We compare CFI with AutoAttention in Sec IV-D.

IV Experiment

TABLE II: Statistics of the datasets.
      Datasets       #Train Samples       #Test Samples       #Fields       #Features       #Items       Positive Ratio
      Alibaba       5,544,213       660,694       15       1,657,981       512,431       5.147%
      Tencent       6,666,928       1,125,130       15       1,030,047       388,195       14.167%

In this section, we evaluate AutoAttention on two real-world datasets: the public Alibaba Display Ad CTR dataset and Tencent’s production CTR dataset. The code is publicly available111https://github.com/waydrow/AutoAttention. We aim to answer the following research questions:

  • RQ1: How does AutoAttention perform compared with existing user interest methods?

  • RQ2: Compared with existing methods, AutoAttention includes more fields in the attention unit, and then conducts field pair selection by pruning on field pair weights. Is it these additional fields or the field pair selection contributes more to the performance lift?

  • RQ3: AutoAttention can be used as a building block of attention module in some complicated user interest methods, such as DIEN and DSIN. Does the replacement of the vanilla attention unit by AutoAttention bring performance lift?

  • RQ4: Which field pairs are regarded as the most important ones in AutoAttention? Are there any important field pair regarded by AutoAttention ignored by existing methods or expert knowledge? What are these ignored field pairs?

IV-A Datasets and Baselines

We use the following two datasets for performance comparison. Their statistics are presented in Tab. II.

  • Alibaba Dataset222https://tianchi.aliyun.com/dataset/dataDetail?dataId=56 [6] is a public advertising dataset released by Alibaba. It randomly samples 1,140,000 users from the website of Taobao from 8 days of click logs (26 million records) to generate the original dataset. Following [6], we use the first 7 days’ samples as the training set (2017-05-06 to 2017-05-12), and the next day’s samples as the testing set (2017-05-13). We keep users’ most recent 50 behaviors. Please note that we only extract the user click behaviors whose click time is before the target item to prevent information leakage.

  • Tencent CTR Dataset is collected by sampling user click logs for one week from Tencent’s advertising CTR log. We use samples from 2021-09-05 to 2021-09-10 as the training set, and samples on 2021-09-11 as the testing set. The data preprocessing strategy is the same as that of the Alibaba dataset.

We compare AutoAttention with the following baseline approaches:

  • Sum Pooling conducts a sum pooling without weights on the user’s behavior embeddings to generate a fixed-length user interest representation.

  • DIN [4] conducts a weighted sum pooling over user behaviors. The attention weight is calculated over the user behavior and several manually selected fields from the item side. The attention is implemented as an out product between the user behavior embedding and those selected item side embedding, followed by an MLP.

  • DIEN [5] uses a GRU encoder to capture the behavior dependencies, followed by another GRU with an attentional update gate to depict interest evolution.

  • DSIN [6] captures users’ homogeneous interests in each session and heterogeneous interests in different sessions.

  • GRU4Rec [19] uses a GRU with ranking based loss to model user sequences for session based recommendation.

  • SASRec [20] uses a left-to-right Transformer to capture users’ behaviors for sequential recommendation.

  • BERT4Rec [21] uses a bi-directional self-attention to model user behaviors.

  • BST [22] uses self-attention and target-attention together to model user behaviors.

  • CFI [8] considers the corresponding field interactions for sequential recommendation.

IV-B Experimental Settings

All methods are implemented in Tensorflow 1.4 with Python 3.5, which are trained from scratch on a NVIDIA TESLA M40 GPU with 24G memory. For baseline methods, we follow the hyper-parameter settings in their original papers but also finetune them on our datasets. For both datasets, We set the maximum user behavior length HH to 50. The embedding dimension KK is 64 for all features. The dimension of each hidden layer of the three-layer MLP is 200, 80, and 1, with activation functions PReLU, PReLU, and Softmax, respectively. We use Adagrad [23] as the optimizer with a learning rate of 0.01. The batch size is 4,096 and 16,384 for the training and testing set, respectively. For DIN, DIEN, and DSIN, the dimension of the two-layer MLP in the local activation unit is 200 and 1, with the dice activation function [4]. For DSIN, we divide user behavior sequences into 5 sessions. The maximum user behavior length of each session is 10. For AutoAttention, the target sparsity rate SS is 0.6 and 0.8 for Alibaba and Tencent datasets respectively. The damping ratios DD and UU is set to 0.8 and 100, respectively.

We use user-weighted AUC as the evaluation metric [4], which measures the goodness of samples ranking for each user. A vanilla AUC is first calculated for all samples of each user, then we conduct a weighted average over these AUCs, using the number of samples of each user as the weights. We still refer it as AUC in this paper for simplicity.

AUC=i=1n#impressioni×AUCii=1n#impressioni\mbox{AUC}=\frac{\sum_{i=1}^{n}\#impression_{i}\times\mbox{AUC}_{i}}{\sum_{i=1}^{n}\#impression_{i}} (9)

where nn denotes the number of users, #impressioni\#impression_{i} and AUCi\mbox{AUC}_{i} denote the number of impressions and AUC of the ii-th user, respectively.

IV-C Performance Comparison (RQ1)

TABLE III: Experiment results of AutoAttention and baselines on the public Alibaba dataset and Tencent dataset. The bold value marks the best one in each column, while the underlined value corresponds to the second best one.
Model Alibaba Tencent
Loss (mean±\pmstd) AUC (mean±\pmstd) AUC Impv. Loss (mean±\pmstd) AUC (mean±\pmstd) AUC Impv.
Sum Pooling 0.2083±\pm0.00036 0.6024±\pm0.00015 - 0.3561±\pm0.00185 0.7125±\pm0.00015 -
DIN 0.2052±\pm0.00013 0.6055±\pm0.00007 0.515% 0.3545±\pm0.00096 0.7173±\pm0.00050 0.674%
DIEN 0.2033±\pm0.00020 0.6069±\pm0.00062 0.747% 0.3539±\pm0.00032 0.7236±\pm0.00015 1.558%
DSIN 0.2008±\pm0.00034 0.6094±\pm0.00007 1.162% 0.3526±\pm0.00010 0.7285±\pm0.00006 2.246%
GRU4Rec 0.2031±\pm0.00081 0.6043±\pm0.00040 0.315% 0.3536±\pm0.00054 0.7148±\pm0.00004 0.323%
SAS4Rec 0.2014±\pm0.00043 0.6043±\pm0.00015 0.315% 0.3537±\pm0.00029 0.7144±\pm0.00031 0.267%
BERT4Rec 0.2027±\pm0.00026 0.6049±\pm0.00076 0.415% 0.3542±\pm0.00016 0.7152±\pm0.00063 0.379%
BST 0.2016±\pm0.00002 0.6050±\pm0.00037 0.432% 0.3540±\pm0.00005 0.7160±\pm0.00052 0.491%
CFI 0.1983±\pm0.00007 0.6115±\pm0.00047 1.511% 0.3516±\pm0.00019 0.7349±\pm0.00083 3.144%
AutoAttention 0.1945±\pm0.00062 0.6156±\pm0.00053 2.191% 0.3509±\pm0.00081 0.7380±\pm0.00040 3.579%

The experiment results of comparison between existing methods and our proposed AutoAttention on both datasets are shown in Tab. III. All experiments are repeated 5 times and the averaged results are reported.

The sum pooling method is treated as a baseline. DIN gets 0.52% and 0.67% relative AUC lift on two datasets compared with sum pooling, since it considers the different importance of each behavior. GRU4Rec, SAS4Rec, BERT4Rec, and BST achieve similar performance, which considers sequence dependencies. DIEN and DSIN take both into account and further improve AUC. CFI gets 1.51% and 3.14% AUC lift respectively, which shows the advantage of corresponding field interaction.

AutoAttention significantly lifts the AUC on two datasets by 2.19% and 3.58%, respectively. Please note that even a 0.1% AUC lift is huge and usually leads to a decent Gross Merchandise Volume (GMV) lift in online advertising systems [24].

IV-D Study of AutoAttention (RQ2)

So far, all baselines only consider several item side fields as specified in their papers, while the proposed AutoAttention uses all fields as the query and then conducts field pair selection. One may wonder whether the performance lift is mainly due to including more fields, or due to the selection.

IV-D1 Effect of additional fields

To answer this question, we first equip several baseline models with all fields. Specifically, we consider all fields in three baseline models: DIN, DIEN, and DSIN, making them consist of the same fields with AutoAttention. We denote these three baselines with all fields as DIN+, DIEN+, and DSIN+. We also present the performance of the three baselines DotProduct, MAF-S, and MAF-C that already consider all fields, and AutoAttention-w/oP as a variant of AutoAttention without pruning on field pairs. The results are summarized in Tab. IV.

DIN+, DIEN+, and DSIN+ get some performance lifts compared to their original model, due to the inclusion of additional fields. For example, DSIN+ improves AUC by 0.0021 and 0.0029 on two datasets. MAF-S and MAF-C get comparable performance with DIEN+. However, they are still worse than DotProduct and AutoAttention-w/oP. AutoAttention-w/oP achieves the best result among these baselines. It indicates that explicit field interaction strength modeling is necessary between user behavior fields and query fields.

IV-D2 Effect of automatic field pair selection

CFI also conducts field pair selection by manually selecting corresponding field pairs, e.g., only interacting behavior category field with target item category field. We compare it with AutoAttention to investigate the effect of automatic field pair selection. To make a fair comparison, we keep the same number of field pairs in AutoAttention with CFI, naming it as AutoAttention-. Please note that CFI selects PP fields from the target item side, and then interacts each behavior field BpB_{p}\in\mathcal{B} with one of the corresponding fields among the selected ones. Therefore, AutoAttention- only keeps the top-PP field pairs, where P=2P=2 in the Alibaba dataset and P=3P=3 in the Tencent dataset.

As shown in Tab. IV, AutoAttention- performs better than CFI. Furthermore, the relative lift of AutoAttention over CFI is 0.67% and 0.42%. We also compare the selected field pairs of these two methods in the following two ways: a) Compare the selected top-PP field pairs from AutoAttention with the PP corresponding field pairs in CFI. In Alibaba dataset, CFI considers two field pairs, (cate_id, cate_id) and (brand, brand), where the first field in each pair is a query field, and the second one is a behavior field. However, AutoAttention- selects (cate_id, brand) and (brand, brand). This indicates that automatic field pair selection does not always rank the corresponding field pairs the highest; b) We check the rank of the corresponding field pairs according to the strength weight of AutoAttention. (brand, brand) ranks 2nd and (cate_id, cate_id) ranks 4th in AutoAttention. This indicates that the corresponding field pairs are indeed important according to AutoAttention, but not always the most important ones.

TABLE IV: Performance comparison of all baseline models using all fields and field pair selection.
Model Alibaba Tencent
Loss AUC Loss AUC
DIN+ 0.2020±\pm0.00027 0.6083±\pm0.00013 0.3541±\pm0.00064 0.7196±\pm0.00025
DIEN+ 0.2011±\pm0.00037 0.6092±\pm0.00046 0.3530±\pm0.00011 0.7268±\pm0.00017
DSIN+ 0.1984±\pm0.00032 0.6115±\pm0.00058 0.3515±\pm0.00063 0.7314±\pm0.00014
MAF-S 0.2015±\pm0.00023 0.6087±\pm0.00043 0.3531±\pm0.00021 0.7269±\pm0.00032
MAF-C 0.2013±\pm0.00015 0.6089±\pm0.00046 0.3529±\pm0.00060 0.7273±\pm0.00031
DotProduct 0.1992±\pm0.00092 0.6108±\pm0.00012 0.3518±\pm0.00048 0.7312±\pm0.00031
AutoAttention-w/oP 0.1969±\pm0.00038 0.6134±\pm0.00006 0.3512±\pm0.00031 0.7369±\pm0.00012
CFI 0.1983±\pm0.00007 0.6115±\pm0.00047 0.3516±\pm0.00019 0.7349±\pm0.00083
AutoAttention- 0.1972±\pm0.00029 0.6124±\pm0.00025 0.3515±\pm0.00012 0.7366±\pm0.00034
AutoAttention 0.1945±\pm0.00062 0.6156±\pm0.00053 0.3509±\pm0.00081 0.7380±\pm0.00040

IV-D3 Effect of different sparsity rates

Refer to caption
(a) Alibaba Dataset
Refer to caption
(b) Tencent Dataset
Figure 4: Effect of different sparsity rates SS on the field pair weights on two datasets.

In this section, we study the impact of pruning with various sparsity rates in AutoAttention. The results are shown in Fig. 4.

At the beginning (the leftmost), AutoAttention with S%=0S\%=0 means that we just assign a different weight to each field pair, but do not prune any field pair. Its performance increases with a higher sparsity rate, and gets the best results with S%=0.6S\%=0.6 and S%=0.8S\%=0.8 on two datasets, respectively. The AUC lift is 0.36% and 0.15% compared to no pruning. This verifies that identifying and removing irrelevant field pairs leads to performance lift. With the sparsity rate becoming higher, the performance deteriorates rapidly, due to the pruning some important field pairs.

IV-E AutoAttention as a Building Block (RQ3)

TABLE V: Experiments of replacing the original attention unit by DotProduct and AutoAttention in DIEN and DSIN.
Model Attention Alibaba Tencent
Loss AUC Loss AUC
DIEN Original 0.2033±\pm0.00015 0.6069±\pm0.00025 0.3539±\pm0.00027 0.7236±\pm0.00040
DotProduct 0.2027±\pm0.00016 0.6076±\pm0.00005 0.3535±\pm0.00009 0.7244±\pm0.00018
AutoAttention 0.2020±\pm0.00023 0.6080±\pm0.00032 0.3531±\pm0.00010 0.7258±\pm0.00035
DSIN Original 0.2008±\pm0.00008 0.6094±\pm0.00005 0.3526±\pm0.00037 0.7285±\pm0.00076
DotProduct 0.2001±\pm0.00034 0.6098±\pm0.00042 0.3523±\pm0.00019 0.7291±\pm0.00013
AutoAttention 0.1998±\pm0.00070 0.6101±\pm0.00047 0.3521±\pm0.00068 0.7294±\pm0.00062

AutoAttention can be treated as a general purpose attention unit. In addition to using it standalone to model a user’s interest, we can also replace the original attention unit within DIEN and DSIN by the proposed DotProduct and AutoAttention. Note that we can not obtain the original field embeddings of user behaviors since DIEN and DSIN use hidden states to represent behavior within attention. Therefore, we only assign field-wise weights for fields from the target item rather than the field pair wise weights. In DIEN, we replace its attention function by AutoAttention:

αt=σ(b+j=1M𝒉t,𝒆FjRFj)\alpha_{t}=\sigma\left(b+\sum_{j=1}^{M^{\prime}}\langle\bm{h}_{t},\bm{e}_{F_{j}}\rangle R_{F_{j}}\right) (10)

where MM^{\prime} denotes the number of fields from the target item side, 𝒉t\bm{h}_{t} is the hidden state of user behavior. While in DSIN, we replace the attention function in the session interest activating layer. The formulation is similar to Eqn. (10), except that we use an embedding vector of session interest instead of a hidden state of user behavior.

As shown in Tab V, the performance of DIEN and DSIN can be further boosted with the two proposed attention units. Note that replacing the original attention unit by AutoAttention introduces marginal additional parameters and computational cost, which can be ignored compared with the cost of the two methods themselves.

IV-F Visualization of field pair selection (RQ4)

Refer to caption
(a) Alibaba Dataset
Refer to caption
(b) Tencent Dataset
Figure 5: Heat map of learnt field pairs strength weights RR of AutoAttention on two datasets. The cells with red box denote the selected field pairs in AutoAttention.

We would verify whether the learnt field pairs weights really reflect the importance of each field interaction, and whether we recognize some other important field pairs which are neglected by the expert knowledge. We visualize the learnt field pair strength weights RR by heat maps on two datasets, shown in Fig. 5(a) and Fig. 5(b), where the x-axis and y-axis denote the fields from current sample (including item/user/context fields) and user behaviors, respectively.

We observe that the field pairs within item side fields indeed play an important role within the attention, such as (cate_id, brand), (brand, brand), (creative_id, ad_id), and (ams_second_industry_id, creative_id). In addition, we also observe that some field pairs from other sides are also important, such as (scenario_id, cate_id) and (user_grade, ad_id). These field pairs are usually neglected when manually selecting fields or field pairs, which can be identified in AutoAttention.

V Related Works

In this section, we discuss two research areas related to our work, i.e., CTR prediction and user behavior modeling.

V-A CTR Prediction

CTR Prediction is one of the most fundamental tasks in online advertising and recommendation systems, which aims at predicting the probability that a user clicks an item or ad. Pioneer works of CTR prediction are proposed mainly based on Logistic Regression (LR) [3, 1, 2], polynomial [25], collaborative filtering [26], tree models [27], Bayesian models [28], etc. In order to explicitly model the feature interactions, many factorization machine based methods are proposed for high-dimensional data, such as Factorization Machine (FM) [29], Field-aware Factorization Machine (FFM) [30], Field-weighted Factorization Machine (FwFM) [31, 32], and Field-matrixed Factorization Machine (FmFM) [33]. Besides, there are some works that aim at learning weight for different feature interactions, including Attentional Factorization Machines (AFM) [34], Dual-attentional Factorization Machines (DFM) [35], Dual Inputaware Factorization Machines (DIFM) [36].

Since the number of samples and the dimension of features have become larger and larger, many deep learning based models have been proposed, such as Wide&Deep [24], Deep Crossing [37], YouTube Recommendation [7], PNN [38], Deep&Cross [39]. There are some studies that combine FM with DNN, such as DeepFM [40], NFM [41], xDeepFM [42], InterAtt [43], DeepLight [15] and DCN V2 [44].

V-B User Behavior Modeling

Traditional methods take a straightforward way to represent each behavior with an embedding vector, and then do a sum or mean pooling over all these embedding vectors to generate one embedding [7]. Then many works propose to assign a dynamic weight for each behavior and then conduct weighted sum pooling, such as Deep Interest Network (DIN) [4], Deep Interest Evolution Network (DIEN) [5], and Deep Session Interest Network (DSIN) [6]. There are also many works to use RNN or Transformer for behavior sequence modeling, including GRU4Rec [19], SAS4Rec [20], BERT4Rec [21], and Behavior Sequence Transformer (BST) [22].

Recently, there are also some works further consider long-term historical behavior sequences, such as Multi-channel user Interest Memory Network (MIMN) [45], Hierarchical Periodic Memory Network (HPMN) [46], Search-based Interest Model (SIM) [47], UBR4CTR [48], ETA [49] and LimaRec [50].

VI Conclusion

In this paper, we propose an efficient user interest model AutoAttention for CTR prediction. We propose to include all fields from item/user/context sides as the query fields and interact them with behavior fields within attention. We assign a learnable weight for each field pair between behavior fields and query fields to capture their different importance. Pruning fields pairs via these weights can identify and remove irrelevant and noisy field pairs, leading to performance lift and computation complexity reduction. Comprehensive experiments on public and production datasets demonstrate the effectiveness of the proposed approach.

Acknowledgment

This work was supported by the National Key R&D Program of China [2020YFB1707903]; the National Natural Science Foundation of China [61972254, 62272302]; Shanghai Municipal Science and Technology Major Project [2021SHZDZX0102]; the CCF-Tencent Open Fund [RAGR20200105]; and the Tencent Marketing Solution Rhino-Bird Focused Research Program [FR202001].

References

  • [1] O. Chapelle, E. Manavoglu, and R. Rosales, “Simple and scalable response prediction for display advertising,” ACM Transactions on Intelligent Systems and Technology (TIST), vol. 5, no. 4, pp. 1–34, 2014.
  • [2] H. B. McMahan, G. Holt, D. Sculley, M. Young, D. Ebner, J. Grady, L. Nie, T. Phillips, E. Davydov, D. Golovin et al., “Ad click prediction: a view from the trenches,” in ACM SIGKDD International conference on Knowledge Discovery & Data Mining (KDD), 2013, pp. 1222–1230.
  • [3] M. Richardson, E. Dominowska, and R. Ragno, “Predicting clicks: estimating the click-through rate for new ads,” in World Wide Web Conference (WWW), 2007, pp. 521–530.
  • [4] G. Zhou, X. Zhu, C. Song, Y. Fan, H. Zhu, X. Ma, Y. Yan, J. Jin, H. Li, and K. Gai, “Deep interest network for click-through rate prediction,” in ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), 2018, pp. 1059–1068.
  • [5] G. Zhou, N. Mou, Y. Fan, Q. Pi, W. Bian, C. Zhou, X. Zhu, and K. Gai, “Deep interest evolution network for click-through rate prediction,” in AAAI Conference on Artificial Intelligence (AAAI), vol. 33, no. 01, 2019, pp. 5941–5948.
  • [6] Y. Feng, F. Lv, W. Shen, M. Wang, F. Sun, Y. Zhu, and K. Yang, “Deep session interest network for click-through rate prediction,” in International Joint Conference on Artificial Intelligence (IJCAI), 2019, pp. 2301–2307.
  • [7] P. Covington, J. Adams, and E. Sargin, “Deep neural networks for youtube recommendations,” in ACM Recommender Systems Conference (RecSys), 2016, pp. 191–198.
  • [8] G. Ke, D. He, and T. Liu, “Rethinking positional encoding in language pre-training,” in International Conference on Learning Representations (ICLR), 2021.
  • [9] Y. Tay, D. Bahri, D. Metzler, D.-C. Juan, Z. Zhao, and C. Zheng, “Synthesizer: Rethinking self-attention for transformer models,” in International Conference on Machine Learning (ICML).   PMLR, 2021, pp. 10 183–10 192.
  • [10] Y. Xie, P. Zhou, and S. Kim, “Decoupled side information fusion for sequential recommendation,” in International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), E. Amigó, P. Castells, J. Gonzalo, B. Carterette, J. S. Culpepper, and G. Kazai, Eds., 2022, pp. 1611–1621.
  • [11] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems (NeurIPS), 2017, pp. 5998–6008.
  • [12] S. Rendle, W. Krichene, L. Zhang, and J. Anderson, “Neural collaborative filtering vs. matrix factorization revisited,” in ACM Recommender Systems Conference (RecSys), 2020, pp. 240–248.
  • [13] W. Deng, X. Zhang, F. Liang, and G. Lin, “An adaptive empirical bayesian method for sparse deep learning,” in Advances in Neural Information Processing Systems (NeurIPS), 2019, pp. 5564–5574.
  • [14] J. Frankle and M. Carbin, “The lottery ticket hypothesis: Finding sparse, trainable neural networks,” in International Conference on Learning Representations (ICLR), 2019.
  • [15] W. Deng, J. Pan, T. Zhou, D. Kong, A. Flores, and G. Lin, “Deeplight: Deep lightweight feature interactions for accelerating ctr predictions in ad serving,” in ACM International Conference on Web Search and Data Mining (WSDM), 2021, pp. 922–930.
  • [16] B. Liu, C. Zhu, G. Li, W. Zhang, J. Lai, R. Tang, X. He, Z. Li, and Y. Yu, “Autofis: Automatic feature interaction selection in factorization models for click-through rate prediction,” in ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), 2020, pp. 2636–2645.
  • [17] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations (ICLR), 2021.
  • [18] F. Sun, J. Liu, J. Wu, C. Pei, X. Lin, W. Ou, and P. Jiang, “Bert4rec: Sequential recommendation with bidirectional encoder representations from transformer,” in ACM International Conference on Information and Knowledge Management (CIKM), 2019, pp. 1441–1450.
  • [19] B. Hidasi, A. Karatzoglou, L. Baltrunas, and D. Tikk, “Session-based recommendations with recurrent neural networks,” in International Conference on Learning Representations (ICLR), 2016.
  • [20] W.-C. Kang and J. McAuley, “Self-attentive sequential recommendation,” in IEEE International Conference on Data Mining (ICDM), 2018, pp. 197–206.
  • [21] F. Sun, J. Liu, J. Wu, C. Pei, X. Lin, W. Ou, and P. Jiang, “Bert4rec: Sequential recommendation with bidirectional encoder representations from transformer,” in ACM International Conference on Information and Knowledge Management (CIKM), 2019, pp. 1441–1450.
  • [22] Q. Chen, H. Zhao, W. Li, P. Huang, and W. Ou, “Behavior sequence transformer for e-commerce recommendation in alibaba,” in International Workshop on Deep Learning Practice for High-Dimensional Sparse Data (DLP-KDD), 2019, pp. 1–4.
  • [23] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization.” Journal of Machine Learning Research (JMLR), vol. 12, no. 7, 2011.
  • [24] H. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chandra, H. Aradhye, G. Anderson, G. Corrado, W. Chai, M. Ispir, R. Anil, Z. Haque, L. Hong, V. Jain, X. Liu, and H. Shah, “Wide & deep learning for recommender systems,” in Workshop on Deep Learning for Recommender Systems (DLRS), 2016, pp. 7–10.
  • [25] Y.-W. Chang, C.-J. Hsieh, K.-W. Chang, M. Ringgaard, and C.-J. Lin, “Training and testing low-degree polynomial data mappings via linear svm.” Journal of Machine Learning Research (JMLR), vol. 11, no. 4, 2010.
  • [26] S. Shen, B. Hu, W. Chen, and Q. Yang, “Personalized click model through collaborative filtering,” in ACM International Conference on Web Search and Data Mining (WSDM), 2012, pp. 323–332.
  • [27] X. He, J. Pan, O. Jin, T. Xu, B. Liu, T. Xu, Y. Shi, A. Atallah, R. Herbrich, S. Bowers et al., “Practical lessons from predicting clicks on ads at facebook,” in International Workshop on Data Mining for Online Advertising (ADKDD), 2014, pp. 1–9.
  • [28] T. Graepel, J. Q. Candela, T. Borchert, and R. Herbrich, “Web-scale bayesian click-through rate prediction for sponsored search advertising in microsoft’s bing search engine,” in International Conference on Machine Learning (ICML), 2010, pp. 13–20.
  • [29] S. Rendle, “Factorization machines,” in IEEE International Conference on Data Mining (ICDM), 2010, pp. 995–1000.
  • [30] Y. Juan, Y. Zhuang, W.-S. Chin, and C.-J. Lin, “Field-aware factorization machines for ctr prediction,” in ACM Conference on Recommender Systems (RecSys), 2016, pp. 43–50.
  • [31] J. Pan, J. Xu, A. L. Ruiz, W. Zhao, S. Pan, Y. Sun, and Q. Lu, “Field-weighted factorization machines for click-through rate prediction in display advertising,” in World Wide Web Conference (WWW), 2018, pp. 1349–1357.
  • [32] J. Pan, Y. Mao, A. L. Ruiz, Y. Sun, and A. Flores, “Predicting different types of conversions with multi-task learning in online advertising,” in ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), 2019, pp. 2689–2697.
  • [33] Y. Sun, J. Pan, A. Zhang, and A. Flores, “Fm2: field-matrixed factorization machines for recommender systems,” in The Web Conference (WWW), 2021, pp. 2828–2837.
  • [34] J. Xiao, H. Ye, X. He, H. Zhang, F. Wu, and T. Chua, “Attentional factorization machines: Learning the weight of feature interactions via attention networks,” in International Joint Conference on Artificial Intelligence (IJCAI), 2017, pp. 3119–3125.
  • [35] F. Liu, W. Guo, H. Guo, R. Tang, Y. Ye, and X. He, “Dual-attentional factorization-machines based neural network for user response prediction,” in The Web Conference (WWW), 2020, pp. 26–27.
  • [36] W. Lu, Y. Yu, Y. Chang, Z. Wang, C. Li, and B. Yuan, “A dual input-aware factorization machine for ctr prediction.” in International Joint Conference on Artificial Intelligence (IJCAI), 2020, pp. 3139–3145.
  • [37] Y. Shan, T. R. Hoens, J. Jiao, H. Wang, D. Yu, and J. Mao, “Deep crossing: Web-scale modeling without manually crafted combinatorial features,” in ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), 2016, pp. 255–262.
  • [38] Y. Qu, H. Cai, K. Ren, W. Zhang, Y. Yu, Y. Wen, and J. Wang, “Product-based neural networks for user response prediction,” in International Conference on Data Mining (ICDM), 2016, pp. 1149–1154.
  • [39] R. Wang, B. Fu, G. Fu, and M. Wang, “Deep & cross network for ad click predictions,” in International Workshop on Data Mining for Online Advertising (ADKDD), 2017, pp. 1–7.
  • [40] H. Guo, R. Tang, Y. Ye, Z. Li, and X. He, “Deepfm: a factorization-machine based neural network for ctr prediction,” in International Joint Conference on Artificial Intelligence (IJCAI), 2017, pp. 1725–1731.
  • [41] X. He and T.-S. Chua, “Neural factorization machines for sparse predictive analytics,” in International ACM SIGIR conference on Research and Development in Information Retrieval (SIGIR), 2017, pp. 355–364.
  • [42] J. Lian, X. Zhou, F. Zhang, Z. Chen, X. Xie, and G. Sun, “xdeepfm: Combining explicit and implicit feature interactions for recommender systems,” in ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), 2018, pp. 1754–1763.
  • [43] Z. Li, W. Cheng, Y. Chen, H. Chen, and W. Wang, “Interpretable click-through rate prediction through hierarchical attention,” in ACM International Conference on Web Search and Data Mining (WSDM), 2020, pp. 313–321.
  • [44] R. Wang, R. Shivanna, D. Cheng, S. Jain, D. Lin, L. Hong, and E. Chi, “Dcn v2: Improved deep & cross network and practical lessons for web-scale learning to rank systems,” in The Web Conference (WWW), 2021, pp. 1785–1797.
  • [45] Q. Pi, W. Bian, G. Zhou, X. Zhu, and K. Gai, “Practice on long sequential user behavior modeling for click-through rate prediction,” in ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), 2019, pp. 2671–2679.
  • [46] K. Ren, J. Qin, Y. Fang, W. Zhang, L. Zheng, W. Bian, G. Zhou, J. Xu, Y. Yu, X. Zhu et al., “Lifelong sequential modeling with personalized memorization for user response prediction,” in International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2019, pp. 565–574.
  • [47] Q. Pi, G. Zhou, Y. Zhang, Z. Wang, L. Ren, Y. Fan, X. Zhu, and K. Gai, “Search-based user interest modeling with lifelong sequential behavior data for click-through rate prediction,” in ACM International Conference on Information & Knowledge Management (CIKM), 2020, pp. 2685–2692.
  • [48] J. Qin, W. Zhang, X. Wu, J. Jin, Y. Fang, and Y. Yu, “User behavior retrieval for click-through rate prediction,” in International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2020, pp. 2347–2356.
  • [49] Q. Chen, C. Pei, S. Lv, C. Li, J. Ge, and W. Ou, “End-to-end user behavior retrieval in click-through rate prediction model,” arXiv preprint arXiv:2108.04468, 2021.
  • [50] Y. Wu, L. Yin, D. Lian, M. Yin, N. Z. Gong, J. Zhou, and H. Yang, “Rethinking lifelong sequential recommendation with incremental multi-interest attention,” arXiv preprint arXiv:2105.14060, 2021.