AutoAttention: Automatic Field Pair Selection for Attention in User Behavior Modeling

Zuowu Zheng1, Xiaofeng Gao1, Junwei Pan2, Qi Luo3, Guihai Chen1, Dapeng Liu2, and Jie Jiang2 ¹Z. Zheng, X. Gao, and G. Chen are with the MoE Key Lab of Artificial Intelligence, Department of Computer Science and Engineering, Shanghai Jiao Tong University. X. Gao is the Corresponding author. 1Shanghai Jiao Tong University, Shanghai, China
[email protected], {gao-xf, gchen}@cs.sjtu.edu.cn 2Tencent Inc., Shenzhen, China
{jonaspan, rocliu, zeus}@tencent.com 3Shandong University, Shandong, China
[email protected]

Abstract

In Click-through rate (CTR) prediction models, a user’s interest is usually represented as a fixed-length vector based on her history behaviors. Recently, several methods are proposed to learn an attentive weight for each user behavior and conduct weighted sum pooling. However, these methods only manually select several fields from the target item side as the query to interact with the behaviors, neglecting the other target item fields, as well as user and context fields. Directly including all these fields in the attention may introduce noise and deteriorate the performance. In this paper, we propose a novel model named AutoAttention, which includes all item/user/context side fields as the query, and assigns a learnable weight for each field pair between behavior fields and query fields. Pruning on these field pairs via these learnable weights lead to automatic field pair selection, so as to identify and remove noisy field pairs. Though including more fields, the computation cost of AutoAttention is still low due to using a simple attention function and field pair selection. Extensive experiments on the public dataset and Tencent’s production dataset demonstrate the effectiveness of the proposed approach.

Index Terms:

Click-Through Rate Prediction, User Behavior Modeling, Recommendation System

I Introduction

Click-through rate (CTR) prediction is one of the most fundamental tasks for online advertising systems, and it has attracted much attention from both industrial and academic communities [1, 2, 3]. Modeling a user’s interest through his or her history behaviors on items has proven as one of the most successful advances in the CTR prediction task [4, 5, 6].

Refer to caption — Figure 1: AUC and inference time comparison of the proposed AutoAttention with baselines on the public Alibaba dataset. Sum pooling, DIN, DIEN, and DSIN are four existing methods, which only include several manually selected fields in the attention unit. MAF-C, MAF-S, DIN+, and DotProduct are several proposed baselines which include all available fields. AutoAttention also includes all fields, but conducted field pair selection and achieves new state-of-the-art AUC with low inference time.

In the Embedding & Multi-Layer Perceptron (MLP) algorithms for online advertising and recommendation systems, a user’s interest is usually represented as a fixed-length embedding vector, based on her history behaviors [7, 4]. Traditional methods take a straightforward way to do a sum or mean pooling over all behavior embedding vectors to generate one embedding [7]. However, it ignores the fact that some behaviors are more important than others given the target item, user and context features.

Recently, several user behavior modeling methods are proposed to calculate attentive weights for different behaviors w.r.t a given target item and then conduct a weighted sum pooling, such as Deep Interest Network (DIN) [4] and its variants [5, 6]. Even though these methods achieve significant performance lift, they still suffer from the following limitations:

•

First, in real-world recommendation systems, a user’s interest may not only depend on the target item but also on the user’s demographic features or context features. However, existing works only manually select several fields from the target item side as the query and interact them with each behavior to calculate the attentive weight. It neglects the effect of other fields, including other fields from the target item side, as well as those from the user and context sides. For example, when browsing the game zone of a shopping website, a boy will click a recommended new game The Witcher 3 because he clicked some similar games last week, so the item side fields should be included as all existing works do. Or it’s because he is in the game zone now and any history click on games indicates a strong interest in games. In the latter case, the game zone feature from the context side plays an important role in capturing his interest from behaviors.
•

Second, existing works interact all behavior fields with all target item side fields. Recent studies [8, 9] show that some interactions in attention are unnecessary and harm the performance. Involving more fields as the query may introduce more irrelevant field interactions and further deteriorate the performance.
•

Third, as a part of the input layer of a more complicated DNN model for CTR prediction, the procedure of generating a user interest vector should be lightweight. Unfortunately, most existing methods use an MLP to calculate the attention weight, which leads to high computation complexity.

To resolve these challenges, we propose to include all item/user/context fields as the query in the attention unit, and calculate a learnable weight for each field pair between user behavior fields and these query fields. To avoid introducing noisy field pairs, we further propose to automatically select the most important ones by pruning on these weights. Besides, we adopt a simple dot product function rather than an MLP as the attention function, leading to much less computation cost. We summarize the AUC as well as the average inference time of AutoAttention and several baseline models in Fig. 1. Except Sum Pooling which has a very low inference time due to its simplicity, the proposed AutoAttention gets a higher AUC than all the other baseline models with low inference time. The main contributions of this paper are summarized as follows:

•

We propose to involve all item/user/context fields as the query in the attention unit for user interest modeling. A weight is assigned for each field pair between user behavior fields and these query fields. Pruning on weights automates the field pair selection, preventing performance deterioration due to introducing irrelevant field pairs.
•

We propose to use a simple dot product attention, rather than an MLP in existing methods. This greatly reduces the time complexity with comparable or even better performance.
•

We conduct extensive experiments on public and production datasets to compare AutoAttention with state-of-the-art methods. Evaluation results verify the effectiveness of AutoAttention. We also study the learnt field pair weights and find that AutoAttention does identify several field pairs including user or context side fields, which are ignored by expert knowledge in existing works.

The rest of the paper is organized as follows. Section II provides the preliminaries of existing user behavior methods. In Section III, we describe AutoAttention, and describe its connection with several existing methods. Experiment settings and evaluation results are presented in Section IV. Finally, Section V and Section VI discusses the related work and concludes the paper, respectively.

II Preliminaries

In this section, we present the preliminaries of user behavior modeling in CTR prediction. A CTR prediction model aims at predicting the probability that a user clicks an item given a context (e.g., time, location, publisher information, etc.). It takes fields from three sides as the input:

p\mbox{CTR}=f(\mbox{user},\mbox{item},\mbox{context})

where user side fields consists of user demographic fields and behavior fields, item and context denote fields from the item and context sides, respectively. In this paper, we focus on how to capture a user’s interest from user behaviors.

Given a user $u$ and her corresponding behaviors $\{\bm{v}_{1},\bm{v}_{2},\dots,\bm{v}_{H}\}$ , her interest is represented as a fixed-length vector as follows:

\bm{v}_{u}=f(\bm{v}_{1},\bm{v}_{2},\dots,\bm{v}_{H},\bm{e}_{F_{1}},\bm{e}_{F_{2}},\dots,\bm{e}_{F_{M}})

(1)

where $\bm{v}_{i}$ denotes the embedding for the $i$ -th behavior, $H$ denotes the length of user behaviors, and $\bm{e}_{F_{j}}\in\mathcal{R}^{K}$ denotes the feature embedding from any other field besides the user behaviors(e.g., item/user/context side fields), i.e., $F_{j}$ . Each behavior is usually represented by multiple item side fields. Denoting the set of fields to represent behaviors as $\mathcal{B}=\{B_{p}\}$ , then each behavior is represented as $\bm{v}_{i}=\sum_{B_{p}\in\mathcal{B}}\bm{v}_{B_{p}}$ , where $\bm{v}_{B_{p}}\in\mathcal{R}^{K}$ denotes the feature embedding for the field $B_{p}$ of the $i$ -th behavior.

A straightforward way to calculate $\bm{v}_{u}$ is to do a sum or mean pooling over all these $\bm{v}_{i}$ embedding vectors [7]. However, it neglects the importance of each behavior given a specific target item. Recently, a commonly used behavior modeling strategy is to adopt an attention mechanism over the user’s historical behaviors. It learns an attentive weight for each behavior $i$ w.r.t. a given target item $t$ and then conducts a weighted sum pooling, i.e. $\bm{v}_{u}=\sum_{i=1}^{H}a(i,t)\bm{v}_{i}$ , where $a(i,t)$ denotes an attention function. For example, Deep Interest Network (DIN) considers the influence of the target item on user behaviors [4], which learns larger weights to those behaviors that are more important given the target item, as shown in Eqn. (2).

\begin{split}\bm{v}_{u}&=f(\bm{v}_{1},\bm{v}_{2},\dots,\bm{v}_{H},\bm{e}_{t})\\ &=\sum_{i=1}^{H}a(i,t)\bm{v}_{i}=\sum_{i=1}^{H}\text{MLP}(\bm{v}_{i},\bm{e}_{t})\bm{v}_{i}\end{split}

(2)

where $\bm{e}_{t}$ denotes the embedding vector of the target item $t$ . $\text{MLP}()$ denotes an MLP with its output as the attention weight. Following DIN, DIEN [5] further considers the evolution of user interest, and DSIN [6] considers the homogeneity and heterogeneity of a user’s interests within and among sessions. DIF-SR [10] proposes to only consider the interaction between corresponding fields between queries and keys within the attention.

All existing methods only interact each behavior with several selected fields from the item side within the attention, neglecting other fields, especially those from the user and context sides.

III AutoAttention

In this section, we first describe several straightforward approaches that interact user behavior with all fields in the attention unit. Then we propose AutoAttention to automatically identify and remove irrelevant field pairs which are introduced to the model due to including all fields as the query in the attention. At last, we discuss the model complexity and its connection with several existing approaches.

Mathematically, we learn a user’s interest representation $\bm{v_{u}}$ from her historical behaviors based on all fields about the current sample, i.e., all available fields from target item, user and context sides.

\begin{split}\bm{v}_{u}(\bm{x})&=f(\bm{v}_{1},\bm{v}_{2},\dots,\bm{v}_{H},\bm{e}_{F_{1}},\bm{e}_{F_{2}},\dots,\bm{e}_{F_{M}})\\ &=\sum_{i=1}^{H}a(\bm{v}_{i},\bm{e}_{F_{1:M}})\bm{v}_{i}\end{split}

(3)

where $\bm{v}_{i}$ denotes the embedding for the $i$ -th behavior, which is usually a summation of several attribute embeddings for this behavior: $\bm{v}_{i}=\sum_{p}\bm{v}_{B_{p}}$ . And $\{\bm{e}_{F_{1}},\dots,\bm{e}_{F_{M}}\}$ denotes the set of all fields from the target item, user and context sides.

III-A Base Models

Before introducing AutoAttention, we first present several straightforward approaches to interact user behaviors with all fields within attention: MLP with All fields (MAF in short) and DotProduct.

III-A1 MLP with All Fields

MAF simply sums or concatenates all field embedding vectors $\bm{e}_{F_{1:M}}$ , then feeds it and the behavior embedding $\bm{v}_{i}$ to an MLP to calculate the weight. There are two ways to construct the first layer of the MLP: element-wisely sum $\bm{e}_{F_{1:M}}$ and $\bm{v}_{i}$ as a $K$ dimension vector, denoted as MAF-S (MLP with All Fields Summed); or concatenate $\bm{e}_{F_{1:M}}$ and $\bm{v}_{i}$ as a $(M+1)K$ dimension vector, denoted as MAF-C (MLP with All Fields Concatenated). Mathematically,

\begin{split}a_{\text{MAF-S}}(\bm{v}_{i},\bm{e}_{F_{1:M}})&=\text{MLP}(\bm{v}_{i}\oplus\bm{e}_{F_{1}}\oplus\bm{e}_{F_{2}}\oplus,\cdots,\oplus\bm{e}_{F_{M}})\\ a_{\text{MAF-C}}(\bm{v}_{i},\bm{e}_{F_{1:M}})&=\text{MLP}([\bm{v}_{i},\bm{e}_{F_{1}},\bm{e}_{F_{2}},\cdots,\bm{e}_{F_{M}}])\end{split}

(4)

where $\oplus$ denotes element-wise summation, $[\cdot]$ denotes concatenation, $\mbox{MLP}(\cdot)$ denotes an Multi-Layer Perceptron, with the last layer as a single output node activated by the softmax function.

III-A2 DotProduct

Dot product is widely used in attention models [11], and it has been proved that an MLP is hard to learn a dot product [12]. So we propose another base model to explicitly conduct a dot product between the user behavior embedding and the sum pooling vector over all query fields. We name it as DotProduct, formally:

\begin{split}a_{\text{DotProduct}}(\bm{v}_{i},\bm{e}_{F_{1:M}})&=\sigma\left(b+\sum_{j=1}^{M}\langle\bm{v}_{i},\bm{e}_{F_{j}}\rangle\right)\\ &=\sigma\left(b+\langle\bm{v}_{i},\sum_{j=1}^{M}\bm{e}_{F_{j}}\rangle\right)\end{split}

(5)

where $\langle\bm{v}_{i},\bm{v}_{j}\rangle=\sum_{k=1}^{K}v_{i,k}\cdot v_{j,k}$ denotes the dot product function, $\sigma(\cdot)$ denotes the softmax function, and $b$ denotes the bias term.

III-B AutoAttention

In above base models, all available fields are considered as the query in the attention function. However, there are still several concerns: 1) simply involving all fields as the query ignores the fact that some fields are more important than others when being interacted with each behavior field; 2) in real-world industry systems, the number of fields is large, including all of them as the query may increase the computation cost of the model; 3) some field pairs are irrelevant or noisy for capturing user interests, leading to performance deterioration when included in the attention.

To tackle the above mentioned challenges, we assign a weight $R_{B_{p},F_{j}}$ to model the interaction strength for each field pair between a behavior field $B_{p}$ and a query field $F_{j}$ . These field pair wise weights are learnable and trained together with all the other parameters. We name our approach as AutoAttention. Mathematically,

a_{\text{AutoAttention}}(\bm{v}_{i},\bm{e}_{F_{1:M}})=\sigma\left(b+\sum_{p=1}^{P}\sum_{j=1}^{M}\langle\bm{v}_{B_{p}},\bm{e}_{F_{j}}\rangle R_{B_{p},F_{j}}\right)

(6)

Directly including all query fields in the attention function and interacting all of them with the behavior fields may introduce irrelevant or noisy field pairs. In order to identify and remove them, we conduct automatic field pair selection by pruning on the field pair wise weights $R$ . There are lots of empirical studies in weight pruning area [13, 14, 15]. For simplicity, we adopt a standard iterative pruning algorithm used in [15]. An illustration of AutoAttention is presented in Fig. 2.

The algorithm of pruning is depicted in Alg. 1. We first train the model a few epochs to initialize the weights $R$ and then conduct pruning on $R$ to remove those with the bottom- $S\%$ lowest magnitude values. We gradually increase the sparsity rate $S\%$ such that it increases faster in the early phase when the network is stable and slower in the late phase when it becomes sensitive. Other approaches such as regularization [16] can also be used here.

Existing methods heavily rely on expert knowledge on selecting relevant fields to involve them in the attention. For example, in DIN [4], the authors manually select three fields from the target item side: item_id, shop_id and category_id. Such expert knowledge is not always feasible and accurate. AutoAttention avoids such reliance on expert knowledge by automatic field pair selection.

Input: Field pair strength weights

R

, initialize the target sparsity rate

S

, damping ratios

D

and

U

1 Warm up Initialize whole network by training

i

epochs;

2 Pruning Procedure;

3 for iteration $j=1,2,...$ do

4 Train the model for one iteration;

5 Update the current sparsity rate

S\leftarrow S\times(1-D^{j/U})

;

6 Prune the bottom-

S

% lowest magnitude weights in

R

;

Online Prediction Use the fine-tuned sparse model to make the prediction.

Algorithm 1 Field pairs selection training procedure

III-C Model Training

Follow [4], after we extract a user’s interest $\bm{v}_{u}$ , we concatenate it with all the other features and feed them into an MLP:

\hat{y}=\mbox{sigmoid}(\mbox{MLP}([\bm{v}_{u},\bm{e}_{F_{1}},\bm{e}_{F_{2}},\cdots,\bm{e}_{F_{M}}]))

(7)

We then minimize the following cross-entropy loss during model training:

L(\Theta)=-\frac{1}{N}\sum_{i=1}^{N}(y_{i}\log\hat{y}_{i}+(1-y_{i})\log(1-\hat{y}_{i}))+\lambda\left\|\Theta\right\|_{2}

(8)

where $N$ denotes the number of training samples, $\Theta$ denotes all trainable parameters, $y_{i}\in\{0,1\}$ denotes the label, and $\lambda\left\|\Theta\right\|_{2}$ denotes the $L_{2}$ regularization term.

III-D Discussion

TABLE I: A summary of model complexities.

M

denotes the number of query fields,

P

denotes the number of behavior fields,

K

denotes the dimension of embedding vectors, the number of neurons of the two-layer MLP in the attention is

d

and 1,

H

denotes the length of user behaviors. Note that the time complexity of

H

behaviors includes the complexity of weight calculation and weighted sum pooling over

H

behaviors. We list the estimated FLOPs and the number of parameters in Alibaba Dataset with experiments settings of Section IV-B, i.e,

M=15,P=2,K=64,d=200,H=50

Model

Time Complexity

(One Behavior)

Estimated FLOPs

(One Behavior)

Time Complexity

(

H

Behaviors)

#Parameters of One Behavior

Estimated #Parameters

Sum Pooling

O(1)

O(HK)

DIN

O(dK^{2})

1,642,496

O(K^{2}+dKH)

dK^{2}+2d+1

819,601

DIEN

O(K^{3}+dK^{2})

1,695,232

O(K^{3}+dKH)

dK^{2}+12K^{2}+6K+2d+1

869,137

DSIN

O(H^{2}K+dK^{2})

3,380,864

O(H^{2}K+K^{2}+dKH)

2dK^{2}+19K^{2}+8K+4d+2

1,717,538

DotProduct

O(MK)

2,112

O(MK+HK)

1

AutoAttention

O(PMK)

5,952

O(HPMK)

PM+1

III-D1 Model Complexity

The time complexity of the proposed DotProduct and AutoAttention is $O(MK)$ and $O(PMK)$ for each behavior respectively, where $M$ denotes the number of all the other fields, $P$ denotes the number of fields from the user behaviors, $K$ denotes the dimension of embedding vectors. With field pairs selection, the inference time complexity of AutoAttention can be reduced since it removes redundant field pairs during model training. Using a sparsity rate $S\%$ , its inference time complexity is reduced to $O(S\%PMK)$ .

As for space complexity, DotProduct only introduces one parameter, i.e., the bias term. AutoAttention introduces field pair strength weights $R$ and the bias term, which is $PM+1$ . Using a sparsity rate $S\%$ , AutoAttention’s space complexity becomes $S\%PM+1$ . We summarize the complexities and model architectures of these models in Tab. I and Fig. 3, respectively.

We also summarize the estimated number of Floating Point Operations (FLOPs), as well as the number of parameters in Section IV-B from the public Alibaba dataset. As shown in Tab. I, DotProduct and AutoAttention only take thousands of FLOPs to extract user interests. Compared to DNN-based methods, AutoAttention is at least two hundred of times faster and also introduces fewer parameters, which makes it a preferable choice in real-world online advertising systems.

III-D2 Comparison to Self-Attention

Self-attention is widely used in NLP [11], CV [17] and recommender systems [18]. In self-attention, the attention weight is a softmax over the dot product between the query $Q$ and key $K$ . The value $V$ is then multiplied by these attention weights to get the final output. The proposed DotProduct can be viewed as a self-attention, taking all other fields as the query, and the user behavior as the key and value. AutoAttention further assigns a field pair wise weight for each field interaction. Recent works [8, 9] on self-attention also reveal that some field pairs (e.g., position cross position) are critical while others are noisy within the attention, indicating the necessity of automatic field pair selection.

III-D3 Comparison to CFI

Recently, [10] proposes to only interact a field from behavior with the corresponding field from the target item in attention. For example, the category field of behavior is only interacted with the category field of the target item. We name this approach as CFI (Corresponding Field Interaction). CFI assumes that the corresponding field pairs are the most important ones, and all the other pairs are noisy. We compare CFI with AutoAttention in Sec IV-D.

IV Experiment

TABLE II: Statistics of the datasets.

Datasets	#Train Samples	#Test Samples	#Fields	#Features	#Items	Positive Ratio
Alibaba	5,544,213	660,694	15	1,657,981	512,431	5.147%
Tencent	6,666,928	1,125,130	15	1,030,047	388,195	14.167%

In this section, we evaluate AutoAttention on two real-world datasets: the public Alibaba Display Ad CTR dataset and Tencent’s production CTR dataset. The code is publicly available¹¹1https://github.com/waydrow/AutoAttention. We aim to answer the following research questions:

•

RQ1: How does AutoAttention perform compared with existing user interest methods?
•

RQ2: Compared with existing methods, AutoAttention includes more fields in the attention unit, and then conducts field pair selection by pruning on field pair weights. Is it these additional fields or the field pair selection contributes more to the performance lift?
•

RQ3: AutoAttention can be used as a building block of attention module in some complicated user interest methods, such as DIEN and DSIN. Does the replacement of the vanilla attention unit by AutoAttention bring performance lift?
•

RQ4: Which field pairs are regarded as the most important ones in AutoAttention? Are there any important field pair regarded by AutoAttention ignored by existing methods or expert knowledge? What are these ignored field pairs?

IV-A Datasets and Baselines

We use the following two datasets for performance comparison. Their statistics are presented in Tab. II.

•

Alibaba Dataset²²2https://tianchi.aliyun.com/dataset/dataDetail?dataId=56 [6] is a public advertising dataset released by Alibaba. It randomly samples 1,140,000 users from the website of Taobao from 8 days of click logs (26 million records) to generate the original dataset. Following [6], we use the first 7 days’ samples as the training set (2017-05-06 to 2017-05-12), and the next day’s samples as the testing set (2017-05-13). We keep users’ most recent 50 behaviors. Please note that we only extract the user click behaviors whose click time is before the target item to prevent information leakage.
•

Tencent CTR Dataset is collected by sampling user click logs for one week from Tencent’s advertising CTR log. We use samples from 2021-09-05 to 2021-09-10 as the training set, and samples on 2021-09-11 as the testing set. The data preprocessing strategy is the same as that of the Alibaba dataset.

We compare AutoAttention with the following baseline approaches:

•

Sum Pooling conducts a sum pooling without weights on the user’s behavior embeddings to generate a fixed-length user interest representation.
•

DIN [4] conducts a weighted sum pooling over user behaviors. The attention weight is calculated over the user behavior and several manually selected fields from the item side. The attention is implemented as an out product between the user behavior embedding and those selected item side embedding, followed by an MLP.
•

DIEN [5] uses a GRU encoder to capture the behavior dependencies, followed by another GRU with an attentional update gate to depict interest evolution.
•

DSIN [6] captures users’ homogeneous interests in each session and heterogeneous interests in different sessions.
•

GRU4Rec [19] uses a GRU with ranking based loss to model user sequences for session based recommendation.
•

SASRec [20] uses a left-to-right Transformer to capture users’ behaviors for sequential recommendation.
•

BERT4Rec [21] uses a bi-directional self-attention to model user behaviors.
•

BST [22] uses self-attention and target-attention together to model user behaviors.
•

CFI [8] considers the corresponding field interactions for sequential recommendation.

IV-B Experimental Settings

All methods are implemented in Tensorflow 1.4 with Python 3.5, which are trained from scratch on a NVIDIA TESLA M40 GPU with 24G memory. For baseline methods, we follow the hyper-parameter settings in their original papers but also finetune them on our datasets. For both datasets, We set the maximum user behavior length $H$ to 50. The embedding dimension $K$ is 64 for all features. The dimension of each hidden layer of the three-layer MLP is 200, 80, and 1, with activation functions PReLU, PReLU, and Softmax, respectively. We use Adagrad [23] as the optimizer with a learning rate of 0.01. The batch size is 4,096 and 16,384 for the training and testing set, respectively. For DIN, DIEN, and DSIN, the dimension of the two-layer MLP in the local activation unit is 200 and 1, with the dice activation function [4]. For DSIN, we divide user behavior sequences into 5 sessions. The maximum user behavior length of each session is 10. For AutoAttention, the target sparsity rate $S$ is 0.6 and 0.8 for Alibaba and Tencent datasets respectively. The damping ratios $D$ and $U$ is set to 0.8 and 100, respectively.

We use user-weighted AUC as the evaluation metric [4], which measures the goodness of samples ranking for each user. A vanilla AUC is first calculated for all samples of each user, then we conduct a weighted average over these AUCs, using the number of samples of each user as the weights. We still refer it as AUC in this paper for simplicity.

\mbox{AUC}=\frac{\sum_{i=1}^{n}\#impression_{i}\times\mbox{AUC}_{i}}{\sum_{i=1}^{n}\#impression_{i}}

(9)

where $n$ denotes the number of users, $\#impression_{i}$ and $\mbox{AUC}_{i}$ denote the number of impressions and AUC of the $i$ -th user, respectively.

IV-C Performance Comparison (RQ1)

TABLE III: Experiment results of AutoAttention and baselines on the public Alibaba dataset and Tencent dataset. The bold value marks the best one in each column, while the underlined value corresponds to the second best one.

Model	Alibaba			Tencent
Model	Loss (mean $\pm$ std)	AUC (mean $\pm$ std)	AUC Impv.	Loss (mean $\pm$ std)	AUC (mean $\pm$ std)	AUC Impv.
Sum Pooling	0.2083 $\pm$ 0.00036	0.6024 $\pm$ 0.00015	-	0.3561 $\pm$ 0.00185	0.7125 $\pm$ 0.00015	-
DIN	0.2052 $\pm$ 0.00013	0.6055 $\pm$ 0.00007	0.515%	0.3545 $\pm$ 0.00096	0.7173 $\pm$ 0.00050	0.674%
DIEN	0.2033 $\pm$ 0.00020	0.6069 $\pm$ 0.00062	0.747%	0.3539 $\pm$ 0.00032	0.7236 $\pm$ 0.00015	1.558%
DSIN	0.2008 $\pm$ 0.00034	0.6094 $\pm$ 0.00007	1.162%	0.3526 $\pm$ 0.00010	0.7285 $\pm$ 0.00006	2.246%
GRU4Rec	0.2031 $\pm$ 0.00081	0.6043 $\pm$ 0.00040	0.315%	0.3536 $\pm$ 0.00054	0.7148 $\pm$ 0.00004	0.323%
SAS4Rec	0.2014 $\pm$ 0.00043	0.6043 $\pm$ 0.00015	0.315%	0.3537 $\pm$ 0.00029	0.7144 $\pm$ 0.00031	0.267%
BERT4Rec	0.2027 $\pm$ 0.00026	0.6049 $\pm$ 0.00076	0.415%	0.3542 $\pm$ 0.00016	0.7152 $\pm$ 0.00063	0.379%
BST	0.2016 $\pm$ 0.00002	0.6050 $\pm$ 0.00037	0.432%	0.3540 $\pm$ 0.00005	0.7160 $\pm$ 0.00052	0.491%
CFI	0.1983 $\pm$ 0.00007	0.6115 $\pm$ 0.00047	1.511%	0.3516 $\pm$ 0.00019	0.7349 $\pm$ 0.00083	3.144%
AutoAttention	0.1945 $\pm$ 0.00062	0.6156 $\pm$ 0.00053	2.191%	0.3509 $\pm$ 0.00081	0.7380 $\pm$ 0.00040	3.579%

The experiment results of comparison between existing methods and our proposed AutoAttention on both datasets are shown in Tab. III. All experiments are repeated 5 times and the averaged results are reported.

The sum pooling method is treated as a baseline. DIN gets 0.52% and 0.67% relative AUC lift on two datasets compared with sum pooling, since it considers the different importance of each behavior. GRU4Rec, SAS4Rec, BERT4Rec, and BST achieve similar performance, which considers sequence dependencies. DIEN and DSIN take both into account and further improve AUC. CFI gets 1.51% and 3.14% AUC lift respectively, which shows the advantage of corresponding field interaction.

AutoAttention significantly lifts the AUC on two datasets by 2.19% and 3.58%, respectively. Please note that even a 0.1% AUC lift is huge and usually leads to a decent Gross Merchandise Volume (GMV) lift in online advertising systems [24].

IV-D Study of AutoAttention (RQ2)

So far, all baselines only consider several item side fields as specified in their papers, while the proposed AutoAttention uses all fields as the query and then conducts field pair selection. One may wonder whether the performance lift is mainly due to including more fields, or due to the selection.

IV-D1 Effect of additional fields

To answer this question, we first equip several baseline models with all fields. Specifically, we consider all fields in three baseline models: DIN, DIEN, and DSIN, making them consist of the same fields with AutoAttention. We denote these three baselines with all fields as DIN+, DIEN+, and DSIN+. We also present the performance of the three baselines DotProduct, MAF-S, and MAF-C that already consider all fields, and AutoAttention-w/oP as a variant of AutoAttention without pruning on field pairs. The results are summarized in Tab. IV.

DIN+, DIEN+, and DSIN+ get some performance lifts compared to their original model, due to the inclusion of additional fields. For example, DSIN+ improves AUC by 0.0021 and 0.0029 on two datasets. MAF-S and MAF-C get comparable performance with DIEN+. However, they are still worse than DotProduct and AutoAttention-w/oP. AutoAttention-w/oP achieves the best result among these baselines. It indicates that explicit field interaction strength modeling is necessary between user behavior fields and query fields.

IV-D2 Effect of automatic field pair selection

CFI also conducts field pair selection by manually selecting corresponding field pairs, e.g., only interacting behavior category field with target item category field. We compare it with AutoAttention to investigate the effect of automatic field pair selection. To make a fair comparison, we keep the same number of field pairs in AutoAttention with CFI, naming it as AutoAttention-. Please note that CFI selects $P$ fields from the target item side, and then interacts each behavior field $B_{p}\in\mathcal{B}$ with one of the corresponding fields among the selected ones. Therefore, AutoAttention- only keeps the top- $P$ field pairs, where $P=2$ in the Alibaba dataset and $P=3$ in the Tencent dataset.

As shown in Tab. IV, AutoAttention- performs better than CFI. Furthermore, the relative lift of AutoAttention over CFI is 0.67% and 0.42%. We also compare the selected field pairs of these two methods in the following two ways: a) Compare the selected top- $P$ field pairs from AutoAttention with the $P$ corresponding field pairs in CFI. In Alibaba dataset, CFI considers two field pairs, (cate_id, cate_id) and (brand, brand), where the first field in each pair is a query field, and the second one is a behavior field. However, AutoAttention- selects (cate_id, brand) and (brand, brand). This indicates that automatic field pair selection does not always rank the corresponding field pairs the highest; b) We check the rank of the corresponding field pairs according to the strength weight of AutoAttention. (brand, brand) ranks 2nd and (cate_id, cate_id) ranks 4th in AutoAttention. This indicates that the corresponding field pairs are indeed important according to AutoAttention, but not always the most important ones.

TABLE IV: Performance comparison of all baseline models using all fields and field pair selection.

Model	Alibaba		Tencent
Model	Loss	AUC	Loss	AUC
DIN+	0.2020 $\pm$ 0.00027	0.6083 $\pm$ 0.00013	0.3541 $\pm$ 0.00064	0.7196 $\pm$ 0.00025
DIEN+	0.2011 $\pm$ 0.00037	0.6092 $\pm$ 0.00046	0.3530 $\pm$ 0.00011	0.7268 $\pm$ 0.00017
DSIN+	0.1984 $\pm$ 0.00032	0.6115 $\pm$ 0.00058	0.3515 $\pm$ 0.00063	0.7314 $\pm$ 0.00014
MAF-S	0.2015 $\pm$ 0.00023	0.6087 $\pm$ 0.00043	0.3531 $\pm$ 0.00021	0.7269 $\pm$ 0.00032
MAF-C	0.2013 $\pm$ 0.00015	0.6089 $\pm$ 0.00046	0.3529 $\pm$ 0.00060	0.7273 $\pm$ 0.00031
DotProduct	0.1992 $\pm$ 0.00092	0.6108 $\pm$ 0.00012	0.3518 $\pm$ 0.00048	0.7312 $\pm$ 0.00031
AutoAttention-w/oP	0.1969 $\pm$ 0.00038	0.6134 $\pm$ 0.00006	0.3512 $\pm$ 0.00031	0.7369 $\pm$ 0.00012
CFI	0.1983 $\pm$ 0.00007	0.6115 $\pm$ 0.00047	0.3516 $\pm$ 0.00019	0.7349 $\pm$ 0.00083
AutoAttention-	0.1972 $\pm$ 0.00029	0.6124 $\pm$ 0.00025	0.3515 $\pm$ 0.00012	0.7366 $\pm$ 0.00034
AutoAttention	0.1945 $\pm$ 0.00062	0.6156 $\pm$ 0.00053	0.3509 $\pm$ 0.00081	0.7380 $\pm$ 0.00040

IV-D3 Effect of different sparsity rates

In this section, we study the impact of pruning with various sparsity rates in AutoAttention. The results are shown in Fig. 4.

At the beginning (the leftmost), AutoAttention with $S\%=0$ means that we just assign a different weight to each field pair, but do not prune any field pair. Its performance increases with a higher sparsity rate, and gets the best results with $S\%=0.6$ and $S\%=0.8$ on two datasets, respectively. The AUC lift is 0.36% and 0.15% compared to no pruning. This verifies that identifying and removing irrelevant field pairs leads to performance lift. With the sparsity rate becoming higher, the performance deteriorates rapidly, due to the pruning some important field pairs.

IV-E AutoAttention as a Building Block (RQ3)

TABLE V: Experiments of replacing the original attention unit by DotProduct and AutoAttention in DIEN and DSIN.

Model	Attention	Alibaba		Tencent
Model	Attention	Loss	AUC	Loss	AUC
DIEN	Original	0.2033 $\pm$ 0.00015	0.6069 $\pm$ 0.00025	0.3539 $\pm$ 0.00027	0.7236 $\pm$ 0.00040
	DotProduct	0.2027 $\pm$ 0.00016	0.6076 $\pm$ 0.00005	0.3535 $\pm$ 0.00009	0.7244 $\pm$ 0.00018
	AutoAttention	0.2020 $\pm$ 0.00023	0.6080 $\pm$ 0.00032	0.3531 $\pm$ 0.00010	0.7258 $\pm$ 0.00035
DSIN	Original	0.2008 $\pm$ 0.00008	0.6094 $\pm$ 0.00005	0.3526 $\pm$ 0.00037	0.7285 $\pm$ 0.00076
	DotProduct	0.2001 $\pm$ 0.00034	0.6098 $\pm$ 0.00042	0.3523 $\pm$ 0.00019	0.7291 $\pm$ 0.00013
	AutoAttention	0.1998 $\pm$ 0.00070	0.6101 $\pm$ 0.00047	0.3521 $\pm$ 0.00068	0.7294 $\pm$ 0.00062

AutoAttention can be treated as a general purpose attention unit. In addition to using it standalone to model a user’s interest, we can also replace the original attention unit within DIEN and DSIN by the proposed DotProduct and AutoAttention. Note that we can not obtain the original field embeddings of user behaviors since DIEN and DSIN use hidden states to represent behavior within attention. Therefore, we only assign field-wise weights for fields from the target item rather than the field pair wise weights. In DIEN, we replace its attention function by AutoAttention:

\alpha_{t}=\sigma\left(b+\sum_{j=1}^{M^{\prime}}\langle\bm{h}_{t},\bm{e}_{F_{j}}\rangle R_{F_{j}}\right)

(10)

where $M^{\prime}$ denotes the number of fields from the target item side, $\bm{h}_{t}$ is the hidden state of user behavior. While in DSIN, we replace the attention function in the session interest activating layer. The formulation is similar to Eqn. (10), except that we use an embedding vector of session interest instead of a hidden state of user behavior.

As shown in Tab V, the performance of DIEN and DSIN can be further boosted with the two proposed attention units. Note that replacing the original attention unit by AutoAttention introduces marginal additional parameters and computational cost, which can be ignored compared with the cost of the two methods themselves.

IV-F Visualization of field pair selection (RQ4)

We would verify whether the learnt field pairs weights really reflect the importance of each field interaction, and whether we recognize some other important field pairs which are neglected by the expert knowledge. We visualize the learnt field pair strength weights $R$ by heat maps on two datasets, shown in Fig. 5(a) and Fig. 5(b), where the x-axis and y-axis denote the fields from current sample (including item/user/context fields) and user behaviors, respectively.

We observe that the field pairs within item side fields indeed play an important role within the attention, such as (cate_id, brand), (brand, brand), (creative_id, ad_id), and (ams_second_industry_id, creative_id). In addition, we also observe that some field pairs from other sides are also important, such as (scenario_id, cate_id) and (user_grade, ad_id). These field pairs are usually neglected when manually selecting fields or field pairs, which can be identified in AutoAttention.

V Related Works

In this section, we discuss two research areas related to our work, i.e., CTR prediction and user behavior modeling.

V-A CTR Prediction

CTR Prediction is one of the most fundamental tasks in online advertising and recommendation systems, which aims at predicting the probability that a user clicks an item or ad. Pioneer works of CTR prediction are proposed mainly based on Logistic Regression (LR) [3, 1, 2], polynomial [25], collaborative filtering [26], tree models [27], Bayesian models [28], etc. In order to explicitly model the feature interactions, many factorization machine based methods are proposed for high-dimensional data, such as Factorization Machine (FM) [29], Field-aware Factorization Machine (FFM) [30], Field-weighted Factorization Machine (FwFM) [31, 32], and Field-matrixed Factorization Machine (FmFM) [33]. Besides, there are some works that aim at learning weight for different feature interactions, including Attentional Factorization Machines (AFM) [34], Dual-attentional Factorization Machines (DFM) [35], Dual Inputaware Factorization Machines (DIFM) [36].

Since the number of samples and the dimension of features have become larger and larger, many deep learning based models have been proposed, such as Wide&Deep [24], Deep Crossing [37], YouTube Recommendation [7], PNN [38], Deep&Cross [39]. There are some studies that combine FM with DNN, such as DeepFM [40], NFM [41], xDeepFM [42], InterAtt [43], DeepLight [15] and DCN V2 [44].

V-B User Behavior Modeling

Traditional methods take a straightforward way to represent each behavior with an embedding vector, and then do a sum or mean pooling over all these embedding vectors to generate one embedding [7]. Then many works propose to assign a dynamic weight for each behavior and then conduct weighted sum pooling, such as Deep Interest Network (DIN) [4], Deep Interest Evolution Network (DIEN) [5], and Deep Session Interest Network (DSIN) [6]. There are also many works to use RNN or Transformer for behavior sequence modeling, including GRU4Rec [19], SAS4Rec [20], BERT4Rec [21], and Behavior Sequence Transformer (BST) [22].

Recently, there are also some works further consider long-term historical behavior sequences, such as Multi-channel user Interest Memory Network (MIMN) [45], Hierarchical Periodic Memory Network (HPMN) [46], Search-based Interest Model (SIM) [47], UBR4CTR [48], ETA [49] and LimaRec [50].

VI Conclusion

In this paper, we propose an efficient user interest model AutoAttention for CTR prediction. We propose to include all fields from item/user/context sides as the query fields and interact them with behavior fields within attention. We assign a learnable weight for each field pair between behavior fields and query fields to capture their different importance. Pruning fields pairs via these weights can identify and remove irrelevant and noisy field pairs, leading to performance lift and computation complexity reduction. Comprehensive experiments on public and production datasets demonstrate the effectiveness of the proposed approach.

Acknowledgment

This work was supported by the National Key R&D Program of China [2020YFB1707903]; the National Natural Science Foundation of China [61972254, 62272302]; Shanghai Municipal Science and Technology Major Project [2021SHZDZX0102]; the CCF-Tencent Open Fund [RAGR20200105]; and the Tencent Marketing Solution Rhino-Bird Focused Research Program [FR202001].

References

[1] O. Chapelle, E. Manavoglu, and R. Rosales, “Simple and scalable response prediction for display advertising,” ACM Transactions on Intelligent Systems and Technology (TIST), vol. 5, no. 4, pp. 1–34, 2014.
[2] H. B. McMahan, G. Holt, D. Sculley, M. Young, D. Ebner, J. Grady, L. Nie, T. Phillips, E. Davydov, D. Golovin et al., “Ad click prediction: a view from the trenches,” in ACM SIGKDD International conference on Knowledge Discovery & Data Mining (KDD), 2013, pp. 1222–1230.
[3] M. Richardson, E. Dominowska, and R. Ragno, “Predicting clicks: estimating the click-through rate for new ads,” in World Wide Web Conference (WWW), 2007, pp. 521–530.
[4] G. Zhou, X. Zhu, C. Song, Y. Fan, H. Zhu, X. Ma, Y. Yan, J. Jin, H. Li, and K. Gai, “Deep interest network for click-through rate prediction,” in ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), 2018, pp. 1059–1068.
[5] G. Zhou, N. Mou, Y. Fan, Q. Pi, W. Bian, C. Zhou, X. Zhu, and K. Gai, “Deep interest evolution network for click-through rate prediction,” in AAAI Conference on Artificial Intelligence (AAAI), vol. 33, no. 01, 2019, pp. 5941–5948.
[6] Y. Feng, F. Lv, W. Shen, M. Wang, F. Sun, Y. Zhu, and K. Yang, “Deep session interest network for click-through rate prediction,” in International Joint Conference on Artificial Intelligence (IJCAI), 2019, pp. 2301–2307.
[7] P. Covington, J. Adams, and E. Sargin, “Deep neural networks for youtube recommendations,” in ACM Recommender Systems Conference (RecSys), 2016, pp. 191–198.
[8] G. Ke, D. He, and T. Liu, “Rethinking positional encoding in language pre-training,” in International Conference on Learning Representations (ICLR), 2021.
[9] Y. Tay, D. Bahri, D. Metzler, D.-C. Juan, Z. Zhao, and C. Zheng, “Synthesizer: Rethinking self-attention for transformer models,” in International Conference on Machine Learning (ICML). PMLR, 2021, pp. 10 183–10 192.
[10] Y. Xie, P. Zhou, and S. Kim, “Decoupled side information fusion for sequential recommendation,” in International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), E. Amigó, P. Castells, J. Gonzalo, B. Carterette, J. S. Culpepper, and G. Kazai, Eds., 2022, pp. 1611–1621.
[11] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems (NeurIPS), 2017, pp. 5998–6008.
[12] S. Rendle, W. Krichene, L. Zhang, and J. Anderson, “Neural collaborative filtering vs. matrix factorization revisited,” in ACM Recommender Systems Conference (RecSys), 2020, pp. 240–248.
[13] W. Deng, X. Zhang, F. Liang, and G. Lin, “An adaptive empirical bayesian method for sparse deep learning,” in Advances in Neural Information Processing Systems (NeurIPS), 2019, pp. 5564–5574.
[14] J. Frankle and M. Carbin, “The lottery ticket hypothesis: Finding sparse, trainable neural networks,” in International Conference on Learning Representations (ICLR), 2019.
[15] W. Deng, J. Pan, T. Zhou, D. Kong, A. Flores, and G. Lin, “Deeplight: Deep lightweight feature interactions for accelerating ctr predictions in ad serving,” in ACM International Conference on Web Search and Data Mining (WSDM), 2021, pp. 922–930.
[16] B. Liu, C. Zhu, G. Li, W. Zhang, J. Lai, R. Tang, X. He, Z. Li, and Y. Yu, “Autofis: Automatic feature interaction selection in factorization models for click-through rate prediction,” in ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), 2020, pp. 2636–2645.
[17] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations (ICLR), 2021.
[18] F. Sun, J. Liu, J. Wu, C. Pei, X. Lin, W. Ou, and P. Jiang, “Bert4rec: Sequential recommendation with bidirectional encoder representations from transformer,” in ACM International Conference on Information and Knowledge Management (CIKM), 2019, pp. 1441–1450.
[19] B. Hidasi, A. Karatzoglou, L. Baltrunas, and D. Tikk, “Session-based recommendations with recurrent neural networks,” in International Conference on Learning Representations (ICLR), 2016.
[20] W.-C. Kang and J. McAuley, “Self-attentive sequential recommendation,” in IEEE International Conference on Data Mining (ICDM), 2018, pp. 197–206.
[21] F. Sun, J. Liu, J. Wu, C. Pei, X. Lin, W. Ou, and P. Jiang, “Bert4rec: Sequential recommendation with bidirectional encoder representations from transformer,” in ACM International Conference on Information and Knowledge Management (CIKM), 2019, pp. 1441–1450.
[22] Q. Chen, H. Zhao, W. Li, P. Huang, and W. Ou, “Behavior sequence transformer for e-commerce recommendation in alibaba,” in International Workshop on Deep Learning Practice for High-Dimensional Sparse Data (DLP-KDD), 2019, pp. 1–4.
[23] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization.” Journal of Machine Learning Research (JMLR), vol. 12, no. 7, 2011.
[24] H. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chandra, H. Aradhye, G. Anderson, G. Corrado, W. Chai, M. Ispir, R. Anil, Z. Haque, L. Hong, V. Jain, X. Liu, and H. Shah, “Wide & deep learning for recommender systems,” in Workshop on Deep Learning for Recommender Systems (DLRS), 2016, pp. 7–10.
[25] Y.-W. Chang, C.-J. Hsieh, K.-W. Chang, M. Ringgaard, and C.-J. Lin, “Training and testing low-degree polynomial data mappings via linear svm.” Journal of Machine Learning Research (JMLR), vol. 11, no. 4, 2010.
[26] S. Shen, B. Hu, W. Chen, and Q. Yang, “Personalized click model through collaborative filtering,” in ACM International Conference on Web Search and Data Mining (WSDM), 2012, pp. 323–332.
[27] X. He, J. Pan, O. Jin, T. Xu, B. Liu, T. Xu, Y. Shi, A. Atallah, R. Herbrich, S. Bowers et al., “Practical lessons from predicting clicks on ads at facebook,” in International Workshop on Data Mining for Online Advertising (ADKDD), 2014, pp. 1–9.
[28] T. Graepel, J. Q. Candela, T. Borchert, and R. Herbrich, “Web-scale bayesian click-through rate prediction for sponsored search advertising in microsoft’s bing search engine,” in International Conference on Machine Learning (ICML), 2010, pp. 13–20.
[29] S. Rendle, “Factorization machines,” in IEEE International Conference on Data Mining (ICDM), 2010, pp. 995–1000.
[30] Y. Juan, Y. Zhuang, W.-S. Chin, and C.-J. Lin, “Field-aware factorization machines for ctr prediction,” in ACM Conference on Recommender Systems (RecSys), 2016, pp. 43–50.
[31] J. Pan, J. Xu, A. L. Ruiz, W. Zhao, S. Pan, Y. Sun, and Q. Lu, “Field-weighted factorization machines for click-through rate prediction in display advertising,” in World Wide Web Conference (WWW), 2018, pp. 1349–1357.
[32] J. Pan, Y. Mao, A. L. Ruiz, Y. Sun, and A. Flores, “Predicting different types of conversions with multi-task learning in online advertising,” in ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), 2019, pp. 2689–2697.
[33] Y. Sun, J. Pan, A. Zhang, and A. Flores, “Fm2: field-matrixed factorization machines for recommender systems,” in The Web Conference (WWW), 2021, pp. 2828–2837.
[34] J. Xiao, H. Ye, X. He, H. Zhang, F. Wu, and T. Chua, “Attentional factorization machines: Learning the weight of feature interactions via attention networks,” in International Joint Conference on Artificial Intelligence (IJCAI), 2017, pp. 3119–3125.
[35] F. Liu, W. Guo, H. Guo, R. Tang, Y. Ye, and X. He, “Dual-attentional factorization-machines based neural network for user response prediction,” in The Web Conference (WWW), 2020, pp. 26–27.
[36] W. Lu, Y. Yu, Y. Chang, Z. Wang, C. Li, and B. Yuan, “A dual input-aware factorization machine for ctr prediction.” in International Joint Conference on Artificial Intelligence (IJCAI), 2020, pp. 3139–3145.
[37] Y. Shan, T. R. Hoens, J. Jiao, H. Wang, D. Yu, and J. Mao, “Deep crossing: Web-scale modeling without manually crafted combinatorial features,” in ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), 2016, pp. 255–262.
[38] Y. Qu, H. Cai, K. Ren, W. Zhang, Y. Yu, Y. Wen, and J. Wang, “Product-based neural networks for user response prediction,” in International Conference on Data Mining (ICDM), 2016, pp. 1149–1154.
[39] R. Wang, B. Fu, G. Fu, and M. Wang, “Deep & cross network for ad click predictions,” in International Workshop on Data Mining for Online Advertising (ADKDD), 2017, pp. 1–7.
[40] H. Guo, R. Tang, Y. Ye, Z. Li, and X. He, “Deepfm: a factorization-machine based neural network for ctr prediction,” in International Joint Conference on Artificial Intelligence (IJCAI), 2017, pp. 1725–1731.
[41] X. He and T.-S. Chua, “Neural factorization machines for sparse predictive analytics,” in International ACM SIGIR conference on Research and Development in Information Retrieval (SIGIR), 2017, pp. 355–364.
[42] J. Lian, X. Zhou, F. Zhang, Z. Chen, X. Xie, and G. Sun, “xdeepfm: Combining explicit and implicit feature interactions for recommender systems,” in ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), 2018, pp. 1754–1763.
[43] Z. Li, W. Cheng, Y. Chen, H. Chen, and W. Wang, “Interpretable click-through rate prediction through hierarchical attention,” in ACM International Conference on Web Search and Data Mining (WSDM), 2020, pp. 313–321.
[44] R. Wang, R. Shivanna, D. Cheng, S. Jain, D. Lin, L. Hong, and E. Chi, “Dcn v2: Improved deep & cross network and practical lessons for web-scale learning to rank systems,” in The Web Conference (WWW), 2021, pp. 1785–1797.
[45] Q. Pi, W. Bian, G. Zhou, X. Zhu, and K. Gai, “Practice on long sequential user behavior modeling for click-through rate prediction,” in ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), 2019, pp. 2671–2679.
[46] K. Ren, J. Qin, Y. Fang, W. Zhang, L. Zheng, W. Bian, G. Zhou, J. Xu, Y. Yu, X. Zhu et al., “Lifelong sequential modeling with personalized memorization for user response prediction,” in International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2019, pp. 565–574.
[47] Q. Pi, G. Zhou, Y. Zhang, Z. Wang, L. Ren, Y. Fan, X. Zhu, and K. Gai, “Search-based user interest modeling with lifelong sequential behavior data for click-through rate prediction,” in ACM International Conference on Information & Knowledge Management (CIKM), 2020, pp. 2685–2692.
[48] J. Qin, W. Zhang, X. Wu, J. Jin, Y. Fang, and Y. Yu, “User behavior retrieval for click-through rate prediction,” in International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2020, pp. 2347–2356.
[49] Q. Chen, C. Pei, S. Lv, C. Li, J. Ge, and W. Ou, “End-to-end user behavior retrieval in click-through rate prediction model,” arXiv preprint arXiv:2108.04468, 2021.
[50] Y. Wu, L. Yin, D. Lian, M. Yin, N. Z. Gong, J. Zhou, and H. Yang, “Rethinking lifelong sequential recommendation with incremental multi-interest attention,” arXiv preprint arXiv:2105.14060, 2021.