Prospective Preference Enhanced Mixed Attentive Model for Session-based Recommendation

Bo Peng, Chang-Yu Tai, Srinivasan Parthasarathy, and Xia Ning^∗ Bo Peng and Chang-Yu Tai are with the Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, 43210.
E-mail: [email protected], [email protected] Srinivasan Parthasarathy and Xia Ning are with the Department of Biomedical Informatics, and the Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, 43210.
E-mail: [email protected], [email protected] ^∗Corresponding author Manuscript received April 19, 2005; revised August 26, 2015.

Abstract

Session-based recommendation aims to generate recommendations for the next item of users’ interest based on a given session. In this manuscript, we develop prospective preference enhanced mixed attentive model ( $\mathop{\mathtt{P^{2}MAM}}\limits$ ) to generate session-based recommendations using two important factors: temporal patterns and estimates of users’ prospective preferences. Unlike existing methods, $\mathop{\mathtt{P^{2}MAM}}\limits$ models the temporal patterns using a light-weight while effective position-sensitive attention mechanism. In $\mathop{\mathtt{P^{2}MAM}}\limits$ , we also leverage the estimate of users’ prospective preferences to signify important items, and generate better recommendations. Our experimental results demonstrate that $\mathop{\mathtt{P^{2}MAM}}\limits$ models significantly outperform the state-of-the-art methods in six benchmark datasets, with an improvement as much as 19.2%. In addition, our run-time performance comparison demonstrates that during testing, $\mathop{\mathtt{P^{2}MAM}}\limits$ models are much more efficient than the best baseline method, with a significant average speedup of 47.7 folds.

Index Terms:

session-based recommendation, recommender system, attention mechanism

1 Introduction

Session-based recommendation aims to generate recommendations for the next item of users’ interest based on a given session (i.e., a sequence of items chronologically ordered according to user interactions in a short-time period). It has been drawing increasing attention from the research community due to its wide applications in online shopping [1, 2], music streaming [3] and tourist planing [4], among others. With the prosperity of deep learning, many deep models, particularly based on recurrent neural networks (RNNs) [5] and graph neural networks (GNNs) [6] have been developed for session-based recommendation, and have demonstrated the state-of-the-art performance. These methods primarily model the temporal patterns (e.g., transitions, recency patterns, etc.) in sessions, but are not always effective in modeling other important factors that are indicative of the next item. In addition, existing methods primarily model the temporal patterns using gated recurrent units (GRUs) [7]. However, considering the notorious sparse nature of session-based recommendation datasets, as shown in the literature [8], the complicated GRUs may not be well-learned and could degrade the performance. Due to its recurrent essence, GRUs also suffer from limited parallelizability and poor interpretability.

To mitigate the limitations of existing methods, in this manuscript, we develop prospective preference enhanced mixed attentive model, denoted as $\mathop{\mathtt{P^{2}MAM}}\limits$ , for session-based recommendation. In $\mathop{\mathtt{P^{2}MAM}}\limits$ , different from existing methods, we model the temporal patterns using a novel position-sensitive attention mechanism, which is light-weight, fully parallelizable, and could enable better interpretability over GRUs. Besides the temporal patterns, we also leverage the estimate of users’ prospective preferences for better recommendation. Users’ prospective preferences is another important factor for recommendation. Intuitively, if we would have known beforehand that the user is going to watch action movies next (i.e., prospective preference), we could generate better recommendations by learning her/his preference from action movies instead of comedy movies in her/his watching history. We conducted an analysis to empirically verify that users’ prospective preferences could signify important items, and thus, improve the recommendation performance as in the Discussion Section. The results reveal that conditioned on the prospective preferences, we could learn indicative attention weights over items, and enable superior performance. However, in practice, users’ prospective preferences are usually intractable. Thus, in $\mathop{\mathtt{P^{2}MAM}}\limits$ , we explicitly estimate the prospective preferences, and learn attention weights based on the estimate to boost the recommendation performance.

With different combinations of the two factors (i.e., temporal patterns and the estimate of users’ prospective preferences), $\mathop{\mathtt{P^{2}MAM}}\limits$ has three variants: $\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits$ , $\mathop{\mathtt{P^{2}MAM\text{-}P}}\limits$ and $\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$ . $\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits$ models temporal patterns using a novel position-sensitive attention mechanism. $\mathop{\mathtt{P^{2}MAM\text{-}P}}\limits$ leverages the estimate of users’ prospective preferences to weigh items and generate recommendations. $\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$ explicitly leverages the two factors for better recommendation.

We compare $\mathop{\mathtt{P^{2}MAM}}\limits$ with five state-of-the-art baseline methods on six benchmark session-based recommendation datasets. Our experimental results demonstrate that $\mathop{\mathtt{P^{2}MAM}}\limits$ significantly outperforms the state-of-the-art methods on all the datasets, with an improvement of up to 19.2%. The results also show that on most of the datasets. the two factors are mutually strengthened, and could enable superior performance when used together. We also conduct a comprehensive analysis to verify the effectiveness of different components in $\mathop{\mathtt{P^{2}MAM}}\limits$ . The results show that with the position embeddings, our position-sensitive attention mechanism could effectively capture the temporal patterns in the datasets, and on most of the datasets, our learning-based prospective preference estimate strategy could be more effective than recency-based strategies. Moreover, we conduct run-time performance analysis, and find that $\mathop{\mathtt{P^{2}MAM}}\limits$ is much more efficient than the best baseline method with an average speedup of 47.7 folds over the six datasets.

Our major contributions are summarized as follows:

•

We develop a novel session-based recommendation method $\mathop{\mathtt{P^{2}MAM}}\limits$ , which leverages both the temporal patterns and estimates of users’ prospective preferences for recommendation.
•

$\mathop{\mathtt{P^{2}MAM}}\limits$ significantly outperforms five state-of-the-art methods on six benchmark datasets (Section 6.1).
•

Our analysis demonstrates the importance of modeling the position information for session-based recommendation (Section 6.5).
•

The experimental results show that our learning-based prospective preference estimate strategy is more effective than the existing recency-based strategy (Section 6.6).
•

Our analysis shows that the learned attention weights in $\mathop{\mathtt{P^{2}MAM}}\limits$ could capture the temporal patterns in the data (Section 6.7).
•

Our analysis verifies that users’ prospective preferences could signify important items, and benefit recommendations (Section 7).
•

For reproducibility purposes, we release our source code on GitHub ¹¹1https://github.com/ninglab/P2MAM, and report the hyper parameters in the Appendix.

2 Related Work

2.1 Session-based Recommendation

In the last few years, numerous session-based recommendation methods have been developed, particularly using Markov Chains (MCs), attention mechanisms and neural networks such as RNNs and GNNs, etc. MCs-based methods [9] use MCs to capture the transitions among items for recommendation. For example, Rendle et al. [9] employs a first-order MC to generate recommendations based on the transitions of the last item in each session. Attention-based methods [1, 10] model the importance of items for the recommendation. For example, Liu et al. [10] developed a short-term attention priority model ( $\mathop{\mathtt{STAMP}}\limits$ ), which adapts a gate mechanism to capture users’ short-term preferences. Recently, RNNs-based methods such as $\mathop{\mathtt{GRU4Rec+}}\limits$ [11] and $\mathop{\mathtt{NARM}}\limits$ [1] have been developed to model the temporal patterns among items primarily using GRUs.

GNNs-based methods are also extensively developed for the session-based recommendation. Wu et al. [2] converted sessions to direct graphs, and developed a GNNs-based model ( $\mathop{\mathtt{SR\text{-}GNN}}\limits$ ) to generate recommendations based on the graph structures. Qiu et al. [12] re-examined the importance of item ordering in session-based recommendations and developed a GNN-based model ( $\mathop{\mathtt{FGNN}}\limits$ ), which included self-loop for each node in graphs to better capture users’ short-term preferences. Chen et al. [4] showed that the widely used directed graph representations can not fully preserve the sequential information in sessions. To mitigate this problem, they converted sessions to multigraphs, and developed a GNNs-based model ( $\mathop{\mathtt{LESSR}}\limits$ ) to generate recommendations. Xia et al. [3] developed hyper graph-based model ( $\mathop{\mathtt{DHCN}}\limits$ ), which leverages hyper graphs and hyper graph convolutional networks to capture the high-order information among items.

2.2 Sequential Recommendation

Sequential recommendation aims to generate recommendations for the next items of users’ interest based on users’ historical interactions. It is closely related to session-based recommendation except that in sequential recommendation, we could access users’ historical interactions in a long-time period (e.g., months). In the last few years, neural networks (e.g., RNNs) and attention mechanisms have been extensively employed in sequential recommendation methods. For example, RNNs-based methods such as User-based RNNs [13] incorporate user characteristics into GRUs for personalized recommendation. Skip-gram-based methods [14] leverage the skip-gram model [15] to capture the co-occurrence among items in a time window. Recently, Convolutional Neural Networks (CNNs) are also adapted for sequential recommendation. Tang et al. [16] developed a CNNs-based model, which uses multiple convolutional filters to model the synergies [17] among items. Yuan et al. [18] developed another CNN-based generative model NextItRec to better capture the long-term dependencies in sequential recommendation. Besides CNNs, attention-based methods [19, 20, 21] are also developed for sequential recommendation. Kang et al. [19] developed a self-attention based model, which adapts the self-attention to better model users’ long-term preferences. Sun et al. [20] further developed a bidirectional self-attention based model to improve the representational power of item embeddings.

3 Definitions and Notations

TABLE I: Notations

notations	meanings
$m$	the number of items
$d$	the dimensionality of embeddings
$S$	the original session
$A$	transformed fixed-length sequence
$n$	the number of items in $A$
$\mathop{\mathbf{h}^{o}}\limits$	position-sensitive preference prediction
$\mathop{\mathbf{h}^{p}}\limits$	prospective preferences-sensitive prediction
$\hat{\mathbf{r}}$	recommendation scores

In this manuscript, we tackle the recommendation problem that given an anonymous session, we recommend the next item of users’ interest in the session. An anonymous session is represented as a sequence $S=\{s_{1},s_{2},\dots\,s_{|S|}\}$ , where $s_{t}$ is the $t$ -th item in the session and $|S|$ is the length of the session. We use upper-case letters to denote matrices, lower-case and bold letters to denote row vectors, and lower-case non-bold letters to denote scalars. Table I presents the key notations used in this manuscript.

4 Methods

Figure 1 presents the overall architecture of $\mathop{\mathtt{P^{2}MAM}}\limits$ . $\mathop{\mathtt{P^{2}MAM}}\limits$ generates recommendations via a session representation component, a position-sensitive preference prediction component, and a prospective preference enhanced preference prediction component. We will discuss each component in detail below.

4.1 Session Representation ( $\mathop{\mathtt{SR}}\limits$ )

Previous studies [19, 17, 16] have shown that recently interacted items are more indicative than earlier ones of the next item. Following these, we focus on the most recent $n$ items (i.e., the last $n$ items) in a session to generate recommendations. Particularly, given a session $S=\{s_{1},s_{2},\dots\,s_{|S|}\}$ , we transform it to a fixed-length sequence $A=\{a_{1},a_{2},\dots,a_{n}\}$ , which contains the last $n$ items in $S$ (i.e., $a_{i}$ = $s_{|S|-n+i}$ , $i=1,\cdots,n$ ). If $S$ is shorter than n, we will pad empty items at the beginning of $A$ until length n.

Refer to caption — Figure 1: Overall Architecture

In $\mathop{\mathtt{P^{2}MAM}}\limits$ , we represent the items in sessions using learnable embeddings. Specifically, we learn an item embedding matrix $\mbox{$\mathop{V}\limits$}\in\mathbb{R}^{m\times d}$ , in which the $j$ -th row $\mathbf{v}_{j}$ is the embedding of item $j$ , and $m$ and $d$ is the number of all the items and the dimensionality of embeddings, respectively. Given $\mathop{V}\limits$ , we represent the items in $A$ by a matrix $\mbox{$\mathop{E}\limits$}\in\mathbb{R}^{n\times d}$ as follows:

\mbox{$\mathop{E}\limits$}=[\mathbf{v}_{a_{1}};\mathbf{v}_{a_{2}};\cdots;\mathbf{v}_{a_{n}}],

(1)

where $\mathbf{v}_{a_{i}}$ is the embedding of item $a_{i}$ . Following the previous work [22, 19, 21], we use a constant zero vector as the embedding of padded empty items.

We also learn embeddings for the $n$ positions in the fixed-length sessions to capture the temporal patterns. Specifically, following Vaswani et al. [22], we learn a position embedding matrix $\mbox{$\mathop{P}\limits$}\in\mathbb{R}^{n\times d}$ :

\mbox{$\mathop{P}\limits$}=[\mathbf{p}_{1};\mathbf{p}_{2};\cdots;\mathbf{p}_{n}],

(2)

in which the $t$ -th row $\mathbf{p}_{t}$ is the embedding of the $t$ -th position.

4.2 Position-sensitive Preference Prediction ( $\mathop{\mathtt{PS}}\limits$ )

It has been shown [8, 10, 17] that the temporal patterns play an important role in predicting users’ preferences. Existing session-based recommendation methods model the temporal patterns primarily using GRUs-based methods [7, 23], which implicitly learn weights over items. However, as demonstrated in the literature [8], the complicated GRUs-based models may not be well learned for the notoriously sparse recommendation datasets, and it also suffers from poor interpretability [24] and limited parallelizability [8, 19].

In $\mathop{\mathtt{P^{2}MAM}}\limits$ , different from other methods, we develop a novel position-sensitive attention mechanism to model the temporal patterns, and generate predictions of users’ preferences accordingly. Specifically, we use a dot-product attention mechanism as follows:

\displaystyle\begin{aligned} \mbox{$\mathop{C}\limits$}=\mbox{$\mathop{E}\limits$}+\mbox{$\mathop{P}\limits$},&&\bm{\alpha}=\text{softmax}(\frac{\mathbf{q}\mbox{$\mathop{C}\limits$}^{\top}}{\sqrt{d}}),&&\mbox{$\mathop{\mathbf{h}^{o}}\limits$}&=\bm{\alpha}\mbox{$\mathop{C}\limits$},\end{aligned}\vspace{-5pt}

(3)

where $\mathop{C}\limits$ combines items embeddings and position embeddings in the session, and $\mathbf{c}_{i}$ is the $i$ -th row in $\mathop{C}\limits$ , $\bm{\alpha}$ is a vector of attention weights, $\mathbf{q}\in\mathbb{R}^{1\times d}$ is a learnable vector shared by all the sessions, and $\mathop{\mathbf{h}^{o}}\limits$ is the position-sensitive preference prediction. The intuition of our position-sensitive preference prediction is that items themselves (their contents) and their relative orders in the sessions represent meaningful information related to user preferences; we can use data-driven attention weights from item embeddings and position embeddings to capture such information. Different from existing attentive session-based methods [10, 1, 25], which learn attention weights without explicitly considering the position information, we explicitly incorporate position embeddings to learn position-sensitive attention weights. Compared to GRUs-based methods [7, 23], the simple and light-weight attention mechanism is easier to learn well on the sparse recommendation datasets, and it could also provide better interpretability and parallelizability.

4.3 Prospective Preference Enhanced Preference Prediction ( $\mathop{\mathtt{P^{2}E}}\limits$ )

As presented in Section 1, users’ prospective preferences reveal important items, and thus, could benefit the recommendation. Motivated by this, in $\mathop{\mathtt{P^{2}MAM}}\limits$ , we develop a novel strategy that leverages the preference prediction (i.e., $\mathop{\mathbf{h}^{o}}\limits$ ) from $\mathop{\mathtt{PS}}\limits$ as an estimate of users’ prospective preferences to enable better attention weights over items. Specifically, we employ a multi-head attention mechanism [22] as follows:

\displaystyle\begin{aligned} \bm{\beta}_{i}&=\text{softmax}(\frac{(\mbox{$\mathop{\mathbf{h}^{o}}\limits$}Q_{i})(\mbox{$\mathop{C}\limits$}K_{i})^{\top}}{\sqrt{d}}),\\ \mathbf{head}_{i}&=\bm{\beta}_{i}(\mbox{$\mathop{C}\limits$}W_{i}),\\ \mbox{$\mathop{\mathbf{h}^{p}}\limits$}&=[\mathbf{head}_{1},\mathbf{head}_{2},\dots,\mathbf{head}_{b}]W,\\ \end{aligned}\vspace{-10pt}

(4)

where $\bm{\beta}_{i}$ is the vector of attention weights from the $i$ -th head, $Q_{i}\in\mathbb{R}^{d\times\frac{d}{b}}$ , $K_{i}\in\mathbb{R}^{d\times\frac{d}{b}}$ , and $W_{i}\in\mathbb{R}^{d\times\frac{d}{b}}$ are learnable projection matrices in the $i$ -th head, and $\mathbf{head}_{i}$ is the output of the $i$ -th head, $b$ is the number of heads, $W\in\mathbb{R}^{d\times d}$ is a learnable projection matrix shared by all the heads, and $\mbox{$\mathop{\mathbf{h}^{p}}\limits$}\in\mathbb{R}^{1\times d}$ is the prospective preference enhanced preference prediction. The intuition of our strategy (i.e., $\mathop{\mathtt{P^{2}E}}\limits$ ) is that we learn predictive attention weights based on the estimate of users’ prospective preferences. As will be shown in Section 6.1, this strategy could significantly improve the recommendation performance. Note that $\mathop{\mathtt{P^{2}E}}\limits$ serves as a general strategy that is adaptable to other estimates of users’ prospective preferences. We also tried the dot-product attention as in $\mathop{\mathtt{PS}}\limits$ but empirically found that it produces inferior performance.

4.4 Recommendation Scores in $\mathop{\mathtt{P^{2}MAM}}\limits$

In $\mathop{\mathtt{P^{2}MAM}}\limits$ , we calculate recommendation scores based on the predictions of users’ preferences (i.e., $\mathop{\mathbf{h}^{o}}\limits$ and $\mathop{\mathbf{h}^{p}}\limits$ ). Specifically, we develop three different methods to calculate the scores. Based on the scores, the items with the top- $k$ largest scores will be recommended.

4.4.1 Scores based on temporal patterns ( $\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits$ )

Similarly to existing methods [1, 2], we calculate the recommendation scores based on the temporal patterns as follows:

\hat{\mathbf{r}}=\text{softmax}(\mbox{$\mathop{\mathbf{h}^{o}}\limits$}\mbox{$\mathop{V}\limits$}^{\top}),

(5)

where $\hat{\mathbf{r}}$ is a vector of recommendation scores over candidate items, $\mathop{V}\limits$ is the item embedding matrix (Section 4.1), and the softmax function is employed to normalize the scores to be into the range $[0,1]$ . We denote this $\mathop{\mathtt{P^{2}MAM}}\limits$ variant as $\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits$ .

4.4.2 Scores based on users’ prospective preferences ( $\mathop{\mathtt{P^{2}MAM\text{-}P}}\limits$ )

In $\mathop{\mathtt{P^{2}MAM}}\limits$ , we could also generate recommendations from the prospective preference enhanced preference prediction $\mathop{\mathbf{h}^{p}}\limits$ as follows:

\hat{\mathbf{r}}=\text{softmax}(\mbox{$\mathop{\mathbf{h}^{p}}\limits$}\mbox{$\mathop{V}\limits$}^{\top}).

(6)

We denote this $\mathop{\mathtt{P^{2}MAM}}\limits$ variant as $\mathop{\mathtt{P^{2}MAM\text{-}P}}\limits$ .

4.4.3 Scores based on temporal patterns and users’ prospective preferences ( $\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$ )

The two preference predictions (i.e., $\mathop{\mathbf{h}^{o}}\limits$ and $\mathop{\mathbf{h}^{p}}\limits$ ) could be mutually strengthened and might enable better performance when used together. Following this motivation, we also calculate recommendation scores using both $\mathop{\mathbf{h}^{o}}\limits$ and $\mathop{\mathbf{h}^{p}}\limits$ as follows:

\hat{\mathbf{r}}=\text{softmax}((\mbox{$\mathop{\mathbf{h}^{o}}\limits$}+\mbox{$\mathop{\mathbf{h}^{p}}\limits$})\mbox{$\mathop{V}\limits$}^{\top}),

(7)

where we sum $\mathop{\mathbf{h}^{o}}\limits$ and $\mathop{\mathbf{h}^{p}}\limits$ for the recommendation. We denote the $\mathop{\mathtt{P^{2}MAM}}\limits$ variant using both $\mathop{\mathbf{h}^{o}}\limits$ and $\mathop{\mathbf{h}^{p}}\limits$ as $\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$ .

4.5 Network Training

Following the literature [10, 2, 4], we adapt the cross-entropy loss to minimize the negative log likelihood of correctly recommending the ground-truth next item as follows:

\min\limits_{\bm{\Theta}}\sum\nolimits_{i=1}^{|T|}-\mathbf{r}_{i}\log(\hat{\mathbf{r}}_{i}^{\top}),

(8)

where $T$ is the set of all the training sessions, $\mathbf{r}_{i}$ is a one-hot vector in which the dimension $j$ is 1 if item $j$ is the ground-truth next item in the $i$ -th training session or 0 otherwise, $\hat{\mathbf{r}}_{i}$ is the vector of recommendation scores for the $i$ -th training session, and $\Theta$ is the set of learnable parameters (e.g., $\mathop{V}\limits$ , $\mathop{P}\limits$ and $W$ ). All the learnable parameters are randomly initialized, and are optimized in an end-to-end manner. We use this objective to optimize all the $\mathop{\mathtt{P^{2}MAM}}\limits$ variants (i.e., $\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits$ , $\mathop{\mathtt{P^{2}MAM\text{-}P}}\limits$ and $\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$ ).

5 Materials

5.1 Baseline Methods

We compare $\mathop{\mathtt{P^{2}MAM}}\limits$ with the five state-of-the-art baseline methods:

•

$\mathop{\mathtt{POP}}\limits$ [2] recommends the most popular items of each session.
•

$\mathop{\mathtt{NARM}}\limits$ [1] employs GRUs and attention mechanisms to model the temporal patterns for the recommendation.
•

$\mathop{\mathtt{SR\text{-}GNN}}\limits$ [2] transforms sessions into directed graphs, and employs GNNs to model complex transitions in sessions.
•

$\mathop{\mathtt{LESSR}}\limits$ [4] transforms sessions into directed multigraphs and generates recommendations using GNNs.
•

$\mathop{\mathtt{DHCN}}\limits$ [3] transforms sessions into hypergraphs and generates recommendations using a hypergraph convolutional network.

Note that $\mathop{\mathtt{DHCN}}\limits$ and $\mathop{\mathtt{LESSR}}\limits$ have been compared against a comprehensive set of other methods including $\mathop{\mathtt{GRU4Rec+}}\limits$ [11], $\mathop{\mathtt{STAMP}}\limits$ [10] and $\mathop{\mathtt{FGNN}}\limits$ [12], and have outperformed those methods. Thus, we compare $\mathop{\mathtt{P^{2}MAM}}\limits$ with $\mathop{\mathtt{DHCN}}\limits$ and $\mathop{\mathtt{LESSR}}\limits$ instead of the methods that they have outperformed. For all the baseline methods, we use the implementations provided by their authors (Section A in Appendix).

5.2 Datasets

We compare $\mathop{\mathtt{P^{2}MAM}}\limits$ with the baseline methods in the following benchmark datasets that are widely used in the literature [1, 2, 4]:

•

Diginetica ( $\mathop{\texttt{DG}}\limits$ ) ²²2http://cikm2016.cs.iupui.edu/cikm-cup is from the CIKM Cup 2016, and contains anonymized browsing logs and transactions. Following the literature [1, 2, 4], we only use the transaction data in our experiments.
•

Yoochoose ( $\mathop{\texttt{YC}}\limits$ ) ³³3http://2015.recsyschallenge.com/challenge.html is from the RecSys Challenge 2015 containing sessions of user clicks within 6 months.
•

Gowalla ( $\mathop{\texttt{GA}}\limits$ ) ⁴⁴4https://snap.stanford.edu/data/loc-gowalla.html is a point-of-interests dataset and includes user-venue check-in records with timestamps.
•

Lastfm ( $\mathop{\texttt{LF}}\limits$ ) ⁵⁵5http://ocelma.net/MusicRecommendationDataset/lastfm-1K.html is a dataset collecting streams of music listening events in the Last.fm. Following the literature [4], we focus on the music artist recommendation in our experiments.
•

Nowplaying ( $\mathop{\texttt{NP}}\limits$ ) ⁶⁶6http://dbis-nowplaying.uibk.ac.at/#nowplaying is another dataset describing the music listening events of users.
•

Tmall ( $\mathop{\texttt{TM}}\limits$ ) ⁷⁷7https://tianchi.aliyun.com/dataset/dataDetail?dataId=42 is from the IJCAI-15 competition, and it includes anonymized shopping logs on the online retail platform Tmall.

Following the literature [1, 2, 4, 3], for $\mathop{\texttt{GA}}\limits$ , we only keep the top 30,000 most popular locations, and view the check-in records in one day as a session [4, 8]. For $\mathop{\texttt{LF}}\limits$ , we only keep the top 40,000 most popular artists, and view the listening events in 8 hours as a session [4]. For all the datasets, we filter out sessions of length one and items appearing less than five times over all the sessions.

5.3 Experimental Protocol

TABLE II: Dataset Statistics

dataset	#items	#train	#test	length	#aug train	#aug test	aug len
DG	42,596	188,636	15,955	4.80	716,835	60,194	4.90
YC	17,597	124,472	15,237	4.22	394,802	55,424	6.14
GA	29,510	234,403	57,492	3.85	675,561	155,332	4.32
LF	38,615	260,780	64,763	11.78	2,837,644	672,519	9.16
NP	60,416	128,077	14,479	7.42	825,304	89,824	6.53
TM	40,727	65,286	1,027	6.69	351,268	25,898	8.01

•

In this table, #item is the number of items. The columns #train, #test, and length correspond to the number of training sessions, the number of testing sessions, and the average length of sessions, respectively, before the augmentation. The columns #aug train, #aug test and ‘aug len’ correspond to that after the augmentation.

5.3.1 Training and testing Sets

Following the literature [2, 3, 4], we generate the training and testing sets as follows: for $\mathop{\texttt{DG}}\limits$ , $\mathop{\texttt{NP}}\limits$ and $\mathop{\texttt{TM}}\limits$ , we use the sessions in the last week as the testing set, and all the other sessions as the training set. For $\mathop{\texttt{YC}}\limits$ , we use the sessions in the last day as the testing set. For the other sessions, following that in Li et al. [1], we use the last (i.e., most recent) $1/64$ of them for training. For $\mathop{\texttt{GA}}\limits$ and $\mathop{\texttt{LF}}\limits$ , we use the last (i.e., most recent) 20% of all the sessions for testing, and all the other sessions for training.

Following the literature [1, 2, 3, 4], we augment the data to enrich the training and testing data. Specifically, for each original training and testing session $S=\{s_{1},s_{2},\dots,s_{|S|}\}$ , we split it to $\{s_{1},s_{2}\}$ , $\{s_{1},s_{2},s_{3}\}$ , $\cdots$ , $\{s_{1},s_{2},\dots,s_{|S|-1}\}$ and $\{s_{1},\dots,s_{|S|}\}$ , and use all the resulted sessions as the augmented sessions for training and testing. The key statistics of the original and augmented datasets are presented in Table II. Note that, we transform sessions to fixed-length as in Section 4.1 after the augmentation.

We tune the hyper parameters using grid search and use the best hyper parameters in terms of recall@20 (Section 5.3.2) for $\mathop{\mathtt{P^{2}MAM}}\limits$ and all the baseline methods during testing. Particularly, during the hyper parameter tuning, we use the first 80% of training sessions for model training, and evaluate the model on the last 20% of training sessions. During testing, we use all the training sessions for model training, and evaluate methods in testing sessions. We report the search ranges of hyper parameters, and the identified optimal hyper parameters for each method in the Appendix (Section A).

5.3.2 Evaluation metrics

We use recall@ $k$ , MRR@ $k$ and NDCG@ $k$ to evaluate the performance of methods.

•

Recall@ $k$ measures the proportion of sessions in which the ground-truth next item (i.e., $s_{|S|+1}$ ) is correctly recommended. For each session $S$ , the recall@ $k$ is 1 if $s_{|S|+1}$ is among the top $k$ of the recommendation list, or 0 otherwise. Note that in next item recommendation, recall@ $k$ is the most popular evaluation metric. It is also called precision@ $k$ and HR@ $k$ as in the literature [4].
•

MRR@ $k$ is the mean reciprocal rank of the correctly recommended item, and is 0 if the ground-truth next item is not in the top- $k$ of the recommendation list. MRR@ $k$ is widely used in the literature [1, 2, 3, 4, 10] as a rank-aware evaluation metric for session-based recommendation.
•

NDCG@ $k$ is the normalized discounted cumulative gain for the top-k ranking, and is another widely used rank-aware metric [16, 26, 19]. Different from MRR@ $k$ that only focuses on the very top ranked items (e.g., top- $1$ ) [27], NDCG@ $k$ effectively measures models’ performance in ranking the top- $k$ items, and thus, might be a better metric for evaluating recommendation methods in some scenarios. Follow the literature [16], In our experiments, the gain indicates whether the ground-truth next item is recommended (i.e., gain is 1) or not (i.e., gain is 0).

For all the evaluation metrics, we report the average results over all the testing sessions in the experiments. We also statistically test the significance of the performance difference among different methods via a standard paired $t$ -test at 95% confidence level.

6 Experimental Results

6.1 Overall Performance Comparison

TABLE III: Overall Performance

method		recall@ $k$		MRR@ $k$		NDCG@ $k$			recall@ $k$		MRR@ $k$		NDCG@ $k$			recall@ $k$		MRR@ $k$		NDCG@ $k$
method		$k$ =10	$k$ =20	$k$ =10	$k$ =20	$k$ =10	$k$ =20		$k$ =10	$k$ =20	$k$ =10	$k$ =20	$k$ =10	$k$ =20		$k$ =10	$k$ =20	$k$ =10	$k$ =20	$k$ =10	$k$ =20
$\mathop{\mathtt{POP}}\limits$	$\mathop{\texttt{DG}}\limits$	0.0058	0.0078	0.0021	0.0022	0.0029	0.0034	$\mathop{\texttt{YC}}\limits$	0.0555	0.1102	0.0248	0.0285	0.0319	0.0456	$\mathop{\texttt{GA}}\limits$	0.0266	0.0456	0.0066	0.0079	0.0113	0.0160
$\mathop{\mathtt{NARM}}\limits$		0.4049	0.5401	0.1751	0.1845	0.2289	0.2631		0.5970	0.7054	0.2925	0.3001	0.3648	0.3924		0.4510	0.5321	0.2445	0.2502	0.2939	0.3144
$\mathop{\mathtt{SR\text{-}GNN}}\limits$		0.3783	0.5091	0.1629	0.1719	0.2133	0.2463		0.5979	0.7072	0.2969	0.3046	0.3683	0.3961		0.4317	0.5142	0.2389	0.2446	0.2848	0.2057
$\mathop{\mathtt{LESSR}}\limits$		0.4000	0.5321	0.1765	0.1857	0.2289	0.2623		0.6098	0.7140	$\mathclap{{}^{\dagger~{}}}$ 0.3077	$\mathclap{{}^{\dagger~{}}}$ 0.3151	$\mathclap{{}^{\dagger~{}}}$ 0.3795	$\mathclap{{}^{\dagger~{}}}$ 0.4060		0.4440	0.5229	0.2540	0.2595	0.2993	0.3192
$\mathop{\mathtt{DHCN}}\limits$		0.4058	0.5400	0.1768	0.1861	0.2305	0.2644		0.6090	0.7157	0.2964	0.3039	0.3708	0.3979		0.4531	0.5377	0.2354	0.2413	0.2876	0.3089
$\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits$		$\mathclap{{}^{\dagger~{}}}$ 0.4148	$\mathclap{{}^{\dagger~{}}}$ 0.5500	$\mathclap{{}^{\dagger~{}}}$ 0.1817	$\mathclap{{}^{\dagger~{}}}$ 0.1911	$\mathclap{{}^{\dagger~{}}}$ 0.2364	$\mathclap{{}^{\dagger~{}}}$ 0.2705		0.6118	0.7203	0.2956	0.3033	0.3707	0.3983		0.4529	0.5352	0.2384	0.2441	0.2898	0.3106
$\mathop{\mathtt{P^{2}MAM\text{-}P}}\limits$		0.4028	0.5354	0.1749	0.1841	0.2283	0.2618		0.6148	0.7220	0.3021	0.3096	0.3764	0.4037		0.4554	0.5369	0.2493	0.2550	0.2985	0.3192
$\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$		0.4120	0.5474	0.1805	0.1899	0.2348	0.2690		$\mathclap{{}^{\dagger~{}}}$ 0.6192	$\mathclap{{}^{\dagger~{}}}$ 0.7277	0.3008	0.3085	0.3766	0.4043		$\mathclap{{}^{\dagger~{}}}$ 0.4644	$\mathclap{{}^{\dagger~{}}}$ 0.5479	$\mathclap{{}^{\dagger~{}}}$ 0.2586	$\mathclap{{}^{\dagger~{}}}$ 0.2644	$\mathclap{{}^{\dagger~{}}}$ 0.3077	$\mathclap{{}^{\dagger~{}}}$ 0.3289
improv		2.2% $\mathclap{{}^{*}}$	1.8% $\mathclap{{}^{*}}$	2.8% $\mathclap{{}^{*}}$	2.7% $\mathclap{{}^{*}}$	2.6% $\mathclap{{}^{*}}$	2.3% $\mathclap{{}^{*}}$		1.5% $\mathclap{{}^{*}}$	1.7% $\mathclap{{}^{*}}$	-1.8% $\mathclap{{}^{*}}$	-1.7% $\mathclap{{}^{*}}$	-0.8%	-0.4%		2.5% $\mathclap{{}^{*}}$	1.9% $\mathclap{{}^{*}}$	1.8% $\mathclap{{}^{*}}$	1.9% $\mathclap{{}^{*}}$	2.8% $\mathclap{{}^{*}}$	3.0% $\mathclap{{}^{*}}$
$\mathop{\mathtt{POP}}\limits$	$\mathop{\texttt{LF}}\limits$	0.0304	0.0498	0.0113	0.0127	0.0157	0.0207	$\mathop{\texttt{NP}}\limits$	0.0137	0.0171	0.0058	0.0060	0.0075	0.0084	$\mathop{\texttt{TM}}\limits$	0.0177	0.0231	0.0095	0.0099	0.0114	0.0128
$\mathop{\mathtt{NARM}}\limits$		0.1632	0.2287	0.0720	0.0765	0.0933	0.1099		0.1494	0.2037	0.0702	0.0739	0.0887	0.1024		0.2489	0.2661	$\mathclap{{}^{\dagger~{}}}$ 0.1765	$\mathclap{{}^{\dagger~{}}}$ 0.1777	$\mathclap{{}^{\dagger~{}}}$ 0.1944	0.1987
$\mathop{\mathtt{SR\text{-}GNN}}\limits$		0.1666	0.2260	0.0853	0.0893	0.1043	0.1192		0.1436	0.1893	0.0744	0.0775	0.0906	0.1021		0.2438	0.2885	0.1366	0.1397	0.1621	0.1734
$\mathop{\mathtt{LESSR}}\limits$		0.1719	0.2328	$\mathclap{{}^{\dagger~{}}}$ 0.0865	$\mathclap{{}^{\dagger~{}}}$ 0.0907	$\mathclap{{}^{\dagger~{}}}$ 0.1065	$\mathclap{{}^{\dagger~{}}}$ 0.1218		0.1542	0.2059	0.0747	0.0782	0.0933	0.1063		0.2216	0.2567	0.1267	0.1291	0.1493	0.1582
$\mathop{\mathtt{DHCN}}\limits$		0.1647	0.2293	0.0730	0.0774	0.0945	0.1107		0.1711	0.2307	$\mathclap{{}^{\dagger~{}}}$ 0.0750	$\mathclap{{}^{\dagger~{}}}$ 0.0791	$\mathclap{{}^{\dagger~{}}}$ 0.0974	0.1124		0.2330	0.2839	0.1294	0.1329	0.1539	0.1668
$\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits$		0.1646	0.2301	0.0724	0.0769	0.0940	0.1105		$\mathclap{{}^{\dagger~{}}}$ 0.1744	$\mathclap{{}^{\dagger~{}}}$ 0.2371	0.0740	0.0783	0.0973	$\mathclap{{}^{\dagger~{}}}$ 0.1132		$\mathclap{{}^{\dagger~{}}}$ 0.2840	0.3406	0.1608	0.1648	0.1900	$\mathclap{{}^{\dagger~{}}}$ 0.2043
$\mathop{\mathtt{P^{2}MAM\text{-}P}}\limits$		0.1675	0.2334	0.0743	0.0788	0.0961	0.1127		0.1587	0.2132	0.0743	0.0780	0.0940	0.1078		0.2302	0.2772	0.1224	0.1256	0.1479	0.1598
$\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$		$\mathclap{{}^{\dagger~{}}}$ 0.1771	$\mathclap{{}^{\dagger~{}}}$ 0.2454	0.0793	0.0840	0.1022	0.1195		0.1688	0.2315	0.0732	0.0775	0.0955	0.1113		0.2826	$\mathclap{{}^{\dagger~{}}}$ 0.3438	0.1484	0.1527	0.1801	0.1956
improv		3.0% $\mathclap{{}^{*}}$	5.4% $\mathclap{{}^{*}}$	-8.3% $\mathclap{{}^{*}}$	-7.4% $\mathclap{{}^{*}}$	-4.0% $\mathclap{{}^{*}}$	-1.9% $\mathclap{{}^{*}}$		1.9%	2.8% $\mathclap{{}^{*}}$	-0.9%	-1.0%	-0.1%	0.7%		14.1% $\mathclap{{}^{*}}$	19.2% $\mathclap{{}^{*}}$	-8.9% $\mathclap{{}^{*}}$	-7.3% $\mathclap{{}^{*}}$	-2.3%	2.8%

•

For each dataset, the best performance among our proposed methods (e.g., $\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$ ) is in bold, the best performance among the baseline methods is underlined, and the overall best performance is indicated by a dagger (i.e., $\dagger$ ). The row ”improv” presents the percentage improvement of the best performing variant of $\mathop{\mathtt{P^{2}MAM}}\limits$ (bold) over the best performing baseline methods (underlined). The $*$ indicates that the improvement is statistically significant at $95\%$ confidence level.

Table III presents the overall performance of different methods at recall@ $k$ , MRR@ $k$ and NDCG@k in recommending the next item. Due to the space limit, we do not present the results on recall@5, MRR@5 and NDCG@5. However, we observed a similar trend on these metrics. Table III shows that overall, $\mathop{\mathtt{P^{2}MAM}}\limits$ (i.e., $\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits$ , $\mathop{\mathtt{P^{2}MAM\text{-}P}}\limits$ and $\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$ ) is the best performing method on the six benchmark datasets. In terms of recall@10 and recall@20, $\mathop{\mathtt{P^{2}MAM}}\limits$ achieves the best performance on all the six datasets, with significant average improvement of 4.2% and 5.5%, respectively, compared to the best baseline method at each dataset. Note that in session-based recommendation, improvement of above 1% is generally considered as significant [2, 4]. At MRR@ $k$ , $\mathop{\mathtt{P^{2}MAM}}\limits$ also achieves competitive performance over the baseline methods. For example, on $\mathop{\texttt{DG}}\limits$ and $\mathop{\texttt{GA}}\limits$ , $\mathop{\mathtt{P^{2}MAM}}\limits$ achieves statistically significant improvement of 2.7% and 1.9% at MRR@10 and MRR@20, respectively. On $\mathop{\texttt{YC}}\limits$ and $\mathop{\texttt{TM}}\limits$ , $\mathop{\mathtt{P^{2}MAM}}\limits$ achieves the second best performance at MRR@10 and MRR@20. We found a similar trend at NDCG@ $k$ . For example, in terms of NDCG@ $10$ , $\mathop{\mathtt{P^{2}MAM}}\limits$ substantially outperforms the baseline methods on $\mathop{\texttt{DG}}\limits$ and $\mathop{\texttt{GA}}\limits$ ; at NDCG@ $20$ , $\mathop{\mathtt{P^{2}MAM}}\limits$ is the best method on four out of six datasets except $\mathop{\texttt{YC}}\limits$ and $\mathop{\texttt{LF}}\limits$ . These results demonstrate the strong recommendation performance of $\mathop{\mathtt{P^{2}MAM}}\limits$ . We notice that on $\mathop{\texttt{YC}}\limits$ , $\mathop{\texttt{LF}}\limits$ and $\mathop{\texttt{TM}}\limits$ , at MRR@10 and MRR@20, $\mathop{\mathtt{P^{2}MAM}}\limits$ appears considerably worse than the baseline methods (e.g., $\mathop{\mathtt{LESSR}}\limits$ ). These results indicate that on certain datasets, even though $\mathop{\mathtt{P^{2}MAM}}\limits$ may be less effective than the baseline methods on ranking the ground-truth next items on the very top (e.g., top-1), $\mathop{\mathtt{P^{2}MAM}}\limits$ is still on average more effective on recommending the correct items among on top.

6.2 Comparison among $\mathop{\mathtt{P^{2}MAM}}\limits$ Variants

As shown in Table III, among the three variants of $\mathop{\mathtt{P^{2}MAM}}\limits$ , $\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$ has the best performance overall. In terms of recall@10, $\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$ outperforms the other variants on $\mathop{\texttt{YC}}\limits$ , $\mathop{\texttt{GA}}\limits$ and $\mathop{\texttt{LF}}\limits$ , and achieves the second best performance on the other three datasets. In terms of recall@20, $\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$ is the best method at four out of six datasets except $\mathop{\texttt{DG}}\limits$ and $\mathop{\texttt{NP}}\limits$ . We found a similar trend at MRR@ $k$ and NDCG@ $k$ . For example, in terms of NDCG@10 and NDCG@20, $\mathop{\mathtt{P^{2}MAM}}\limits$ achieves significant improvement over the other variants on three out of six datasets (i.e., $\mathop{\texttt{YC}}\limits$ , $\mathop{\texttt{GA}}\limits$ and $\mathop{\texttt{LF}}\limits$ ). On the other three datasets, $\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$ is still ranked as the second best method.

Compared to $\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits$ , $\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$ learns attention weights conditioned on the estimate of users’ prospective preferences (i.e., $\mathop{\mathtt{P^{2}E}}\limits$ ), while $\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits$ does not have this strategy. The superior performance of $\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$ over $\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits$ on four out of six datasets indicates that on most of the datasets, incorporating the estimate of prospective preferences could enable better recommendations. Compared to $\mathop{\mathtt{P^{2}MAM\text{-}P}}\limits$ , $\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$ generates recommendations using the preference predictions from temporal patterns (i.e., $\mathop{\mathbf{h}^{o}}\limits$ ) and users’ prospective preferences (i.e., $\mathop{\mathbf{h}^{p}}\limits$ ), while $\mathop{\mathtt{P^{2}MAM\text{-}P}}\limits$ only use $\mathop{\mathbf{h}^{p}}\limits$ to generate recommendations. The strong improvement of $\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$ compared to $\mathop{\mathtt{P^{2}MAM\text{-}P}}\limits$ indicates that when used together, the two preference predictions could reinforce each other and improve the recommendation performance. We notice that on $\mathop{\texttt{DG}}\limits$ and $\mathop{\texttt{NP}}\limits$ , the performance of $\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$ is slightly worse than that of $\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits$ at recall@20. This might be due to that as will be shown in Section 6.5, in certain datasets (e.g. $\mathop{\texttt{DG}}\limits$ and $\mathop{\texttt{NP}}\limits$ ), the temporal patterns are highly strong, and could denominate the learning process. As a result, incorporating the $\mathop{\mathtt{P^{2}E}}\limits$ component may not improve the recommendation performance.

6.3 Comparison with GRUs-based Methods

Existing methods [1, 2, 4] leverage GRUs to capture the temporal patterns, while in $\mathop{\mathtt{P^{2}MAM}}\limits$ , we model the temporal patterns using a position-sensitive attention mechanism (i.e., $\mathop{\mathtt{PS}}\limits$ ). Here, we compare the performance of $\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits$ , the $\mathop{\mathtt{P^{2}MAM}}\limits$ variant purely based on $\mathop{\mathtt{PS}}\limits$ , and the GRUs-based baseline methods including $\mathop{\mathtt{NARM}}\limits$ , $\mathop{\mathtt{SR\text{-}GNN}}\limits$ and $\mathop{\mathtt{LESSR}}\limits$ . As shown in Table III, $\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits$ achieves superior performance over GRUs-based baseline methods on five out of six datasets except $\mathop{\texttt{LF}}\limits$ at recall@ $k$ . On $\mathop{\texttt{LF}}\limits$ , the performance of $\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits$ is slightly worse than that of $\mathop{\mathtt{SR\text{-}GNN}}\limits$ and $\mathop{\mathtt{LESSR}}\limits$ but $\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits$ is still able to outperform $\mathop{\mathtt{NARM}}\limits$ at all the metrics. GRUs model the temporal patterns in a recurrent fashion using complicated non-linear layers, while $\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits$ explicitly models the temporal patterns using attention weights over items in the session. The significant improvement of $\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits$ over GRUs-based baseline methods indicates that on sparse recommendation datasets, our simple attentive method could be easier to learn well than the complicated GRUs-based methods and thus, enable better performance.

6.4 Comparison with Graph-based Methods

Graph-based methods [2, 4, 3, 12] are extensively developed for session-based recommendation, and have been demonstrated the state-of-the-art performance. However, as shown in Table III, $\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$ as a sequence-based method, significantly outperforms the state-of-the-art graph-based methods (i.e., $\mathop{\mathtt{SR\text{-}GNN}}\limits$ , $\mathop{\mathtt{LESSR}}\limits$ , $\mathop{\mathtt{DHCN}}\limits$ ) on the benchmark datasets. For example, $\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$ achieves significant improvement compared to the state-of-the-art graph-based methods at both recall@10 and recall@20 on all the datasets. Graph-based methods convert sessions to directed graphs or hyper graphs, and learn the complex temporal patterns [2] leveraging the graph structure (i.e., topology). However, the graphs are constructed based on some assumptions by design such as an item should link to all the subsequent items [4]. Such assumptions may introduce noises or unnecessary/unrealistic relations in the graphs. Meanwhile, the sparse nature of recommendation datasets, and thus of the graphs, may not support the complicated learning of GNN-based models very well. The superior performance of $\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$ over graph-based methods signifies that the sequence representation could be more effective than graphs for the recommendation.

6.5 Analysis on Position Embeddings

TABLE IV: Performance Improvement of Position Embeddings

method		recall@ $k$			MRR@ $k$			NDCG@ $k$				recall@ $k$			MRR@ $k$			NDCG@ $k$
method		$k$ =5	$k$ =10	$k$ =20	$k$ =5	$k$ =10	$k$ =20	$k$ =5	$k$ =10	$k$ =20		$k$ =5	$k$ =10	$k$ =20	$k$ =5	$k$ =10	$k$ =20	$k$ =5	$k$ =10	$k$ =20
$\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits$ \ $\mathop{\mathtt{PE}}\limits$	$\mathop{\texttt{DG}}\limits$	0.2766	0.3918	0.5236	0.1580	0.1733	0.1824	0.1873	0.2245	0.2578	$\mathop{\texttt{YC}}\limits$	0.4138	0.5521	0.6754	0.2418	0.2603	0.2690	0.2845	0.3293	0.3606
$\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits$		0.2936	0.4148	0.5500	0.1657	0.1817	0.1911	0.1973	0.2364	0.2705		0.4721	0.6118	0.7203	0.2767	0.2956	0.3033	0.3254	0.3707	0.3983
$\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$ \ $\mathop{\mathtt{PE}}\limits$		0.2765	0.3934	0.5276	0.1570	0.1725	0.1818	0.1865	0.2242	0.2581		0.4270	0.5674	0.6896	0.2470	0.2658	0.2744	0.2917	0.3371	0.3682
$\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$		0.2919	0.4125	0.5476	0.1652	0.1812	0.1905	0.1966	0.2354	0.2695		0.4834	0.6192	0.7277	0.2825	0.3008	0.3085	0.3325	0.3766	0.4043
$\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits$ \ $\mathop{\mathtt{PE}}\limits$	$\mathop{\texttt{GA}}\limits$	0.3610	0.4392	0.5189	0.2275	0.2379	0.2435	0.2609	0.2862	0.3064	$\mathop{\texttt{LF}}\limits$	0.1165	0.1686	0.2360	0.0656	0.0724	0.0771	0.0782	0.0949	0.1119
$\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits$		0.3700	0.4529	0.5352	0.2273	0.2384	0.2441	0.2629	0.2898	0.3106		0.1142	0.1646	0.2301	0.0658	0.0724	0.0769	0.0778	0.0940	0.1105
$\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$ \ $\mathop{\mathtt{PE}}\limits$		0.3677	0.4472	0.5277	0.2319	0.2425	0.2481	0.2658	0.2916	0.3119		0.1232	0.1768	0.2449	0.0687	0.0757	0.0804	0.0822	0.0994	0.1166
$\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$		0.3803	0.4644	0.5479	0.2473	0.2586	0.2644	0.2805	0.3077	0.3289		0.1247	0.1771	0.2454	0.0724	0.0793	0.0840	0.0854	0.1022	0.1195
$\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits$ \ $\mathop{\mathtt{PE}}\limits$	$\mathop{\texttt{NP}}\limits$	0.1034	0.1589	0.2224	0.0591	0.0664	0.0708	0.0700	0.0878	0.1039	$\mathop{\texttt{TM}}\limits$	0.2320	0.2872	0.3432	0.1561	0.1635	0.1675	0.1750	0.1929	0.2071
$\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits$		0.1180	0.1744	0.2371	0.0665	0.0740	0.0783	0.0791	0.0973	0.1132		0.2279	0.2840	0.3406	0.1534	0.1608	0.1648	0.1719	0.1900	0.2043
$\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$ \ $\mathop{\mathtt{PE}}\limits$		0.1058	0.1598	0.2205	0.0603	0.0674	0.0717	0.0715	0.0889	0.1042		0.2277	0.2861	0.3468	0.1483	0.1561	0.1603	0.1680	0.1869	0.2023
$\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$		0.1158	0.1688	0.2315	0.0662	0.0732	0.0775	0.0784	0.0955	0.1113		0.2192	0.2826	0.3438	0.1400	0.1484	0.1527	0.1597	0.1801	0.1956

•

In this table, \ $\mathop{\mathtt{PE}}\limits$ represents $\mathop{\mathtt{P^{2}MAM}}\limits$ variants (i.e., $\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits$ and $\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$ ) without position embeddings. The best performance between $\mathop{\mathtt{P^{2}MAM}}\limits$ variants with and without position embeddings (i.e. \ $\mathop{\mathtt{PE}}\limits$ ) is in bold.

We conduct an analysis to verify the importance of position embeddings in $\mathop{\mathtt{P^{2}MAM}}\limits$ . Specifically, in $\mathop{\mathtt{P^{2}MAM}}\limits$ , we remove the position embeddings (i.e., $\mathop{P}\limits$ ) in Equation 3 and Equation 4, and calculate the attention weights using item embeddings (i.e., $\mathop{E}\limits$ ) only. We denote $\mathop{\mathtt{P^{2}MAM}}\limits$ without position embeddings as $\mathop{\mathtt{P^{2}MAM}}\limits$ \ $\mathop{\mathtt{PE}}\limits$ , and report the performance of $\mathop{\mathtt{P^{2}MAM}}\limits$ and $\mathop{\mathtt{P^{2}MAM}}\limits$ \ $\mathop{\mathtt{PE}}\limits$ in Table IV. Due to the space limit, we do not present the performance of $\mathop{\mathtt{P^{2}MAM\text{-}P}}\limits$ but we observed a similar trend in $\mathop{\mathtt{P^{2}MAM\text{-}P}}\limits$ .

As presented in Table IV, without position embeddings, the performance of $\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits$ \ $\mathop{\mathtt{PE}}\limits$ and $\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$ \ $\mathop{\mathtt{PE}}\limits$ degrades significantly on four out of six datasets (i.e., $\mathop{\texttt{DG}}\limits$ , $\mathop{\texttt{YC}}\limits$ , $\mathop{\texttt{GA}}\limits$ and $\mathop{\texttt{NP}}\limits$ ). For example, on $\mathop{\texttt{DG}}\limits$ and $\mathop{\texttt{YC}}\limits$ , $\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$ \ $\mathop{\mathtt{PE}}\limits$ underperforms $\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$ at 3.8% and 5.5%, respectively. Recall that in $\mathop{\mathtt{P^{2}MAM}}\limits$ , we learn position embeddings to incorporate the position information into the model, and better model the temporal patterns. These results demonstrate the importance of the position information for session-based recommendation. These results also reveal that the temporal patterns on the four datasets (e.g., $\mathop{\texttt{DG}}\limits$ and $\mathop{\texttt{NP}}\limits$ ) are strong, and explain the similar performance of $\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits$ and $\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$ on $\mathop{\texttt{DG}}\limits$ and $\mathop{\texttt{NP}}\limits$ as discussed in Section 6.2. We also notice that on $\mathop{\texttt{LF}}\limits$ and $\mathop{\texttt{TM}}\limits$ , $\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits$ \ $\mathop{\mathtt{PE}}\limits$ and $\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$ \ $\mathop{\mathtt{PE}}\limits$ still achieve performance similar to that with position embeddings. For example, on $\mathop{\texttt{LF}}\limits$ , $\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$ \ $\mathop{\mathtt{PE}}\limits$ achieves 0.2449 at recall@20, and $\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$ achieves 0.2454 (difference: 0.2%). Similarly, on $\mathop{\texttt{TM}}\limits$ , at recall@20, the performance of $\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$ \ $\mathop{\mathtt{PE}}\limits$ and $\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$ is 0.3468 and 0.3438, respectively (difference: 0.9%). As will be shown in Section 6.7, on some datasets (e.g., $\mathop{\texttt{LF}}\limits$ ), the position information may not be crucial for the recommendation. Therefore, on these datasets, without position embeddings, the model could still achieve similar performance.

6.6 Prospective Preference Estimate Analysis

TABLE V: Performance Comparison on Estimate Strategies

	method	recall@ $k$			MRR@ $k$			NDCG@ $k$
	method	$k$ =5	$k$ =10	$k$ =20	$k$ =5	$k$ =10	$k$ =20	$k$ =5	$k$ =10	$k$ =20
$\mathop{\texttt{DG}}\limits$	$\mathop{\mathtt{LAST\text{-}O\text{-}P}}\limits$	0.2881	0.4098	0.5459	0.1615	0.1776	0.1870	0.1928	0.2320	0.2664
$\mathop{\texttt{DG}}\limits$	$\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$	0.2922	0.4120	0.5474	0.1646	0.1805	0.1899	0.1961	0.2348	0.2690
$\mathop{\texttt{YC}}\limits$	$\mathop{\mathtt{LAST\text{-}O\text{-}P}}\limits$	0.4851	0.6202	0.7265	0.2824	0.3006	0.3081	0.3329	0.3768	0.4038
$\mathop{\texttt{YC}}\limits$	$\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$	0.4834	0.6192	0.7277	0.2825	0.3008	0.3085	0.3325	0.3766	0.4043
$\mathop{\texttt{GA}}\limits$	$\mathop{\mathtt{LAST\text{-}O\text{-}P}}\limits$	0.3777	0.4638	0.5475	0.2339	0.2454	0.2512	0.2698	0.2977	0.3189
$\mathop{\texttt{GA}}\limits$	$\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$	0.3803	0.4644	0.5479	0.2473	0.2586	0.2644	0.2805	0.3077	0.3289
$\mathop{\texttt{LF}}\limits$	$\mathop{\mathtt{LAST\text{-}O\text{-}P}}\limits$	0.1176	0.1676	0.2332	0.0681	0.0747	0.0792	0.0804	0.0965	0.1130
$\mathop{\texttt{LF}}\limits$	$\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$	0.1247	0.1771	0.2454	0.0724	0.0793	0.0840	0.0854	0.1022	0.1195
$\mathop{\texttt{NP}}\limits$	$\mathop{\mathtt{LAST\text{-}O\text{-}P}}\limits$	0.1261	0.1735	0.2304	0.0736	0.0799	0.0838	0.0866	0.1019	0.1163
$\mathop{\texttt{NP}}\limits$	$\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$	0.1158	0.1688	0.2315	0.0662	0.0732	0.0775	0.0784	0.0955	0.1113
$\mathop{\texttt{TM}}\limits$	$\mathop{\mathtt{LAST\text{-}O\text{-}P}}\limits$	0.2071	0.2637	0.3176	0.1324	0.1401	0.1439	0.1510	0.1694	0.1830
$\mathop{\texttt{TM}}\limits$	$\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$	0.2192	0.2826	0.3438	0.1400	0.1484	0.1527	0.1597	0.1801	0.1956

•

In this table, $\mathop{\mathtt{LAST\text{-}O\text{-}P}}\limits$ estimates users’ prospective preferences using the last item in each session, and $\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$ estimates users’ prospective preferences using $\mathop{\mathtt{PS}}\limits$ (Section 4.2). The best performance in each dataset is in bold.

In $\mathop{\mathtt{P^{2}MAM}}\limits$ , we leverage the position-sensitive preference prediction (i.e., $\mathop{\mathbf{h}^{o}}\limits$ ) from $\mathop{\mathtt{PS}}\limits$ (Section 4.2) as the estimate of users’ prospective preferences (Section 4.3). We notice that in the literature [10, 2, 19], another strategy is to estimate the users’ prospective preferences from the last item in each session. This strategy is based on the recency assumption [10, 2] that the most recently interacted item could be a highly strong indicator of the next item of users’ interest. We conduct an analysis to empirically compare the two strategies in $\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$ . Specifically, in Equation 4, instead of $\mathop{\mathbf{h}^{o}}\limits$ , we use the embeddings of the last item and its position in each session (i.e., $\textbf{v}_{a_{n}}\!+\mathbf{p}_{n}$ ) to calculate the attention weights, and denote the resulted variant as $\mathop{\mathtt{LAST\text{-}O\text{-}P}}\limits$ . Table V presents the performance of $\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$ and $\mathop{\mathtt{LAST\text{-}O\text{-}P}}\limits$ . Similarly to that in $\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$ , we tune hyper parameters for $\mathop{\mathtt{LAST\text{-}O\text{-}P}}\limits$ using grid search, and report the results from the identified best performing hyper parameters.

As presented in Table V, overall $\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$ achieves considerable improvement compared to $\mathop{\mathtt{LAST\text{-}O\text{-}P}}\limits$ on four out of six datasets (i.e., $\mathop{\texttt{DG}}\limits$ , $\mathop{\texttt{GA}}\limits$ , $\mathop{\texttt{LF}}\limits$ and $\mathop{\texttt{TM}}\limits$ ). At recall@5, recall@10 and recall@20, on average, $\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$ achieves significant improvement of 3.5%, 3.4% and 3.5%, respectively, over $\mathop{\mathtt{LAST\text{-}O\text{-}P}}\limits$ . We find a similar trend at MRR@ $k$ and NDCG@ $k$ . For example, in terms of MRR@5 and NDCG@5, over the four datasets, $\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$ still significantly outperforms $\mathop{\mathtt{LAST\text{-}O\text{-}P}}\limits$ with an average improvement of 4.9% and 4.4%, respectively. The primary difference between $\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$ and $\mathop{\mathtt{LAST\text{-}O\text{-}P}}\limits$ is that $\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$ uses a learning-based method to estimate users’ prospective preferences in a data-driven manner, while $\mathop{\mathtt{LAST\text{-}O\text{-}P}}\limits$ generates the estimate based on the recency assumption. The above results indicate that on most of the datasets, data driven-based estimate is more effective than recency-based estimate. We also notice that on $\mathop{\texttt{YC}}\limits$ and $\mathop{\texttt{NP}}\limits$ , the performance of $\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$ is competitive with that of $\mathop{\mathtt{LAST\text{-}O\text{-}P}}\limits$ . For example, on $\mathop{\texttt{YC}}\limits$ , the performance difference between $\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$ and $\mathop{\mathtt{LAST\text{-}O\text{-}P}}\limits$ is 0.2% and 0.1% at recall@20 and MRR@20, respectively. On $\mathop{\texttt{NP}}\limits$ , the difference at recall@20 is also as small as 0.5%. As shown in the literature [10], some datasets such as $\mathop{\texttt{YC}}\limits$ have strong recency patterns. The similar results between $\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$ and $\mathop{\mathtt{LAST\text{-}O\text{-}P}}\limits$ on $\mathop{\texttt{YC}}\limits$ and $\mathop{\texttt{NP}}\limits$ , indicate that on datasets with strong recency patterns, our learning-based strategy may implicitly capture the patterns, and still produce competitive results.

6.7 Attention Weight Analysis

We conduct an analysis to evaluate the attention weights learned in $\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$ . Particularly, we represent the attention weights learned in $\mathop{\mathtt{PS}}\limits$ (Section 4.2) over corresponding session positions in Figure 2. Due to the space limit we only present the results on $\mathop{\texttt{YC}}\limits$ and $\mathop{\texttt{LF}}\limits$ but we have similar results on the other datasets. In Figure 2, the $y$ -axis corresponds to the session lengths; the $x$ -axis corresponds to the last $n$ positions of a session of length $n$ ( $n=1,\cdots,8$ ). Due to the space limit we only present the weights in augmented testing sessions with at most 8 items, which represent 82.2% and 57.4% of all the testing sessions in $\mathop{\texttt{YC}}\limits$ and $\mathop{\texttt{LF}}\limits$ , respectively. However, we observed a similar trend in the longer sessions as that in Figure 2.

Figure 2(a) and 2(b) present the attention weight distribution from $\mathop{\mathtt{PS}}\limits$ with, and without position embeddings (i.e., \ $\mathop{\mathtt{PE}}\limits$ ), respectively, on $\mathop{\texttt{YC}}\limits$ . Similarly, Figure 2(c) and 2(d) present the distributions on $\mathop{\texttt{LF}}\limits$ . Comparing Figure 2(a) and Figure 2(b), we notice that on $\mathop{\texttt{YC}}\limits$ , the attention weights on the last item of sessions (i.e., diagonal in the Figure) are significantly higher than those on the earlier ones. This weight distribution is consistent with the recency pattern on $\mathop{\texttt{YC}}\limits$ as shown in the literature [10], and demonstrates that with the position embeddings, $\mathop{\mathtt{PS}}\limits$ could accurately capture the recency patterns in $\mathop{\texttt{YC}}\limits$ . Figure 2(b) shows that without position embeddings, the attention weights from $\mathop{\mathtt{PS}}\limits$ do not show considerable difference over positions, revealing that without position embeddings, the attention mechanism cannot differentiate positions.. The comparison between Figure 2(a) and Figure 2(b) further demonstrates the importance of position embeddings in $\mathop{\mathtt{P^{2}MAM}}\limits$ .

Comparing Figure 2(a) and Figure 2(c), we find that on $\mathop{\texttt{YC}}\limits$ , the weights follow the recency pattern, while on $\mathop{\texttt{LF}}\limits$ , the second last item has higher weights than the last one. This result implies that the recency assumption may not always hold, and our learning-based method (i.e., $\mathop{\mathtt{PS}}\limits$ ) could be more effective than the existing recency-based method in estimating prospective preferences (Section 6.6). Comparing Figure 2(c) and Figure 2(d), we notice that without position embeddings, on sessions with more than 4 items, the weight distribution in Figure 2(d) is still similar with that in Figure 2(c). This result reveals that on this dataset, the position information may not be critical, and also supports the similar performance of $\mathop{\mathtt{P^{2}MAM}}\limits$ and $\mathop{\mathtt{P^{2}MAM}}\limits$ \ $\mathop{\mathtt{PE}}\limits$ on $\mathop{\texttt{LF}}\limits$ as discussed in Section 6.5.

6.8 Analysis on Cosine Similarities

TABLE VI: Preference Prediction Similarity Comparison

dataset	$\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits$		$\mathop{\mathtt{P^{2}MAM\text{-}P}}\limits$		$\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$
dataset	avg.	next	avg.	next	avg.	next
$\mathop{\texttt{DG}}\limits$	-0.050	0.662	-0.007	0.646	-0.010	0.661
$\mathop{\texttt{YC}}\limits$	-0.137	0.724	-0.084	0.660	-0.082	0.666
$\mathop{\texttt{GA}}\limits$	-0.030	0.676	-0.007	0.663	-0.001	0.610
$\mathop{\texttt{LF}}\limits$	-0.150	0.322	-0.006	0.325	0.012	0.306
$\mathop{\texttt{NP}}\limits$	-0.057	0.423	0.014	0.454	0.007	0.422
$\mathop{\texttt{TM}}\limits$	-0.042	0.356	0.002	0.383	0.015	0.370

•

In this table, the columns avg. shows the average cosine similarities between preference predictions and all the candidate items, The columns next shows the similarity between the predictions and the ground-truth next item.

We conduct an analysis to verify if $\mathop{\mathtt{P^{2}MAM}}\limits$ truly learns predictive preferences from the data. Specifically, we calculate the cosine similarities between the preference predictions (i.e., $\mathop{\mathbf{h}^{o}}\limits$ and $\mathop{\mathbf{h}^{p}}\limits$ ) and the item embeddings on the six datasets, and present the results in Table VI. For $\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits$ / $\mathop{\mathtt{P^{2}MAM\text{-}P}}\limits$ , we calculate similarities between $\mathop{\mathbf{h}^{o}}\limits$ / $\mathop{\mathbf{h}^{p}}\limits$ and item embeddings. For $\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$ , since we calculate recommendation scores using $\mathop{\mathbf{h}^{o}}\limits$ + $\mathop{\mathbf{h}^{p}}\limits$ (Equation 7), we use $\mathop{\mathbf{h}^{o}}\limits$ + $\mathop{\mathbf{h}^{p}}\limits$ to calculate the similarities. Table VI shows that in $\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits$ , $\mathop{\mathtt{P^{2}MAM\text{-}P}}\limits$ and $\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$ , compared to the average similarities among all the items, the similarities between the predictions and the ground-truth next item is significantly higher. This result reveals that $\mathop{\mathtt{P^{2}MAM}}\limits$ could learn to capture users’ true preferences from the data.

6.9 Run-time Performance

TABLE VII: Testing Runtime Performance (ms)

method	$\mathop{\texttt{DG}}\limits$	$\mathop{\texttt{YC}}\limits$	$\mathop{\texttt{GA}}\limits$	$\mathop{\texttt{LF}}\limits$	$\mathop{\texttt{NP}}\limits$	$\mathop{\texttt{TM}}\limits$
$\mathop{\mathtt{DHCN}}\limits$	4.6e0	2.8e0	3.0e0	1.3e2	7.2e0	4.4e0
$\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$	5.7e-1	2.7e-1	4.2e-1	5.2e-1	8.9e-1	6.2e-1
speedup	8.1	10.4	7.1	245.4	8.1	7.1

•

The best run-time performance at each dataset is in bold.

We compare the run-time performance of $\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$ and that of the best performing baseline method $\mathop{\mathtt{DHCN}}\limits$ during testing, and report the results in Table VII. We focus on testing instead of training due to the fact that compared to training, the run-time performance in testing could better imply the models’ latency in real-time recommendation, which could significantly affect the user experience and thus revenue. As presented in Table VII, the run-time performance of $\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$ is substantially better than that of $\mathop{\mathtt{DHCN}}\limits$ on all the datasets. Specifically, on average, $\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$ is 47.7 times faster than $\mathop{\mathtt{DHCN}}\limits$ . The superior run-time performance of $\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$ over $\mathop{\mathtt{DHCN}}\limits$ demonstrates that while generating high-quality recommendations, $\mathop{\mathtt{P^{2}MAM}}\limits$ could enable lower latency in real time, and thus could significantly improve the user experience.

6.10 Parameter Study

We conduct a parameter study to assess how the length of the transformed sequence (i.e., $n$ ) affect the recommendation performance on the widely used $\mathop{\texttt{DG}}\limits$ , $\mathop{\texttt{YC}}\limits$ , $\mathop{\texttt{GA}}\limits$ and $\mathop{\texttt{LF}}\limits$ datasets. Particularly, on each dataset, we change $n$ and fix the other hyper parameters as the best performing ones during hyper parameter tuning, and report the performance at recall@20 on augmented testing sessions in Figure 3. As shown in Figure 3, on $\mathop{\texttt{DG}}\limits$ , $\mathop{\texttt{YC}}\limits$ and $\mathop{\texttt{GA}}\limits$ , the performance increases significantly as $n$ increases when $n<5$ , while when $n\geq 5$ , incorporating earlier items in the session will not considerably improve the performance. Similarly on $\mathop{\texttt{LF}}\limits$ , the performance becomes stable when $n\geq 10$ . These results reveal that on session-based recommendation datasets, only the most recent few items are effective in learning users’ preferences. As a result, we will not loss crucial information in $\mathop{\mathtt{P^{2}MAM}}\limits$ by using only the last $n$ items (Section 4.1) for the recommendation.

7 Discussion

TABLE VIII: Effectiveness of Users’ Prospective Preferences

	method	recall@ $k$			MRR@ $k$			NDCG@ $k$
	method	$k$ =5	$k$ =10	$k$ =20	$k$ =5	$k$ =10	$k$ =20	$k$ =5	$k$ =10	$k$ =20
$\mathop{\texttt{DG}}\limits$	$\mathop{\mathtt{MEAN}}\limits$	0.2778	0.3917	0.5246	0.1585	0.1736	0.1828	0.1880	0.2247	0.2583
$\mathop{\texttt{DG}}\limits$	$\mathop{\mathtt{ORACLE}}\limits$	0.4113	0.5260	0.6436	0.2755	0.2908	0.2990	0.3092	0.3463	0.3760
$\mathop{\texttt{YC}}\limits$	$\mathop{\mathtt{MEAN}}\limits$	0.4039	0.5392	0.6620	0.2376	0.2557	0.2643	0.2788	0.3227	0.3538
$\mathop{\texttt{YC}}\limits$	$\mathop{\mathtt{ORACLE}}\limits$	0.5771	0.6922	0.7760	0.3791	0.3947	0.4006	0.4285	0.4660	0.4873
$\mathop{\texttt{GA}}\limits$	$\mathop{\mathtt{MEAN}}\limits$	0.3589	0.4387	0.5202	0.2243	0.2350	0.2407	0.2580	0.2838	0.3044
$\mathop{\texttt{GA}}\limits$	$\mathop{\mathtt{ORACLE}}\limits$	0.4350	0.5097	0.5837	0.3084	0.3185	0.3236	0.3401	0.3643	0.3830
$\mathop{\texttt{LF}}\limits$	$\mathop{\mathtt{MEAN}}\limits$	0.1200	0.1706	0.2366	0.0685	0.0751	0.0797	0.0812	0.0975	0.1142
$\mathop{\texttt{LF}}\limits$	$\mathop{\mathtt{ORACLE}}\limits$	0.2544	0.3152	0.3892	0.1915	0.1996	0.2047	0.2071	0.2267	0.2453
$\mathop{\texttt{NP}}\limits$	$\mathop{\mathtt{MEAN}}\limits$	0.1019	0.1592	0.2250	0.0579	0.0654	0.0700	0.0687	0.0871	0.1037
$\mathop{\texttt{NP}}\limits$	$\mathop{\mathtt{ORACLE}}\limits$	0.1822	0.2429	0.3088	0.1078	0.1160	0.1205	0.1263	0.1459	0.1625
$\mathop{\texttt{TM}}\limits$	$\mathop{\mathtt{MEAN}}\limits$	0.2531	0.3181	0.3780	0.1676	0.1763	0.1805	0.1889	0.2099	0.2251
$\mathop{\texttt{TM}}\limits$	$\mathop{\mathtt{ORACLE}}\limits$	0.5282	0.5717	0.6086	0.4489	0.4547	0.4573	0.4687	0.4828	0.4921

•

In this table, $\mathop{\mathtt{MEAN}}\limits$ equally weighs items in the session, and $\mathop{\mathtt{ORACLE}}\limits$ learns attention weights over items conditioned on the ground-truth next item. The best performance at each dataset is in bold.

We conduct an experiment to verify that users’ prospective preferences could signify the important items, and thus, benefit the recommendation. Specifically, we develop a method, denoted as $\mathop{\mathtt{ORACLE}}\limits$ , which learns attention weights conditioned on the ground-truth next item (i.e,, $s_{|S|+1}$ , the true preference). In $\mathop{\mathtt{ORACLE}}\limits$ , we generate recommendations in the same way as that in $\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits$ except that in the dot-product attention (Equation 3), we remove the position embeddings (i.e., $\mathop{P}\limits$ ) and replace $\mathbf{q}$ with $\mathbf{v_{s_{|S|+1}}}$ . (i.e., embedding of $s_{|S|+1}$ ). We empirically compare $\mathop{\mathtt{ORACLE}}\limits$ with another method, denoted as $\mathop{\mathtt{MEAN}}\limits$ , which uses a mean pooling to equally weigh items in the session. We report the results of $\mathop{\mathtt{ORACLE}}\limits$ and $\mathop{\mathtt{MEAN}}\limits$ on the six datasets in Table VIII.

As shown in Table VIII, $\mathop{\mathtt{ORACLE}}\limits$ significantly outperforms $\mathop{\mathtt{MEAN}}\limits$ in all the datasets. For example, in terms of recall@5, MRR@5 and NDCG@5, compared to $\mathop{\mathtt{MEAN}}\limits$ , on average, $\mathop{\mathtt{ORACLE}}\limits$ achieves significant improvement of 68.6%, 100.7% and 89.5%, respectively. The superior performance of $\mathop{\mathtt{ORACLE}}\limits$ over $\mathop{\mathtt{MEAN}}\limits$ shows that the learned attention weights in $\mathop{\mathtt{ORACLE}}\limits$ are effective, and further reveals that users’ prospective preferences could indicate important items. These results motivate us to estimate the prospective preferences, and weigh items conditioned on the estimate as in $\mathop{\mathtt{P^{2}E}}\limits$ (Section 4.3).

We notice that on $\mathop{\texttt{TM}}\limits$ , $\mathop{\mathtt{MEAN}}\limits$ outperforms $\mathop{\mathtt{P^{2}MAM}}\limits$ and all the baseline methods (Table III). Previous work [17] suggests that on some extremely sparse recommendation datasets, the attention-based methods may not be well-learned, and underperform simple mean pooling-based method (i.e., $\mathop{\mathtt{MEAN}}\limits$ ). $\mathop{\texttt{TM}}\limits$ with the smallest training set and a large number of items is extremely sparse. Therefore, $\mathop{\mathtt{MEAN}}\limits$ could outperform $\mathop{\mathtt{P^{2}MAM}}\limits$ and all the baseline methods on this dataset. However, on all the other datasets, $\mathop{\mathtt{P^{2}MAM}}\limits$ significantly outperforms $\mathop{\mathtt{MEAN}}\limits$ , which reveals that on most of the datasets, $\mathop{\mathtt{P^{2}MAM}}\limits$ could learn well, and thus, enable better performance.

8 Conclusions

In this manuscript, we presented novel $\mathop{\mathtt{P^{2}MAM}}\limits$ models that conduct session-based recommendations using two important factors: temporal patterns and estimates of users’ prospective preferences. Our experimental results in comparison with five state-of-the-art baseline methods on the six benchmark datasets demonstrate that $\mathop{\mathtt{P^{2}MAM}}\limits$ significantly outperforms the baseline methods with an improvement of up to 19.2%. The results also reveal that on most of the datasets, the two factors could reinforce each other, and enable superior performance. Our analysis on position embeddings signifies the importance of explicitly modeling the position information for session-based recommendation. Our analysis on prospective preference estimate strategies demonstrates that on most of the datasets, our learning-based strategy is more effective than the existing recency-based strategy. Our analysis on the learned attention weights shows that with position embeddings, $\mathop{\mathtt{P^{2}MAM}}\limits$ could effectively capture the temporal patterns (e.g., recency patterns). Our results in run-time performance comparison show that $\mathop{\mathtt{P^{2}MAM}}\limits$ is much more efficient than the best baseline method $\mathop{\mathtt{DHCN}}\limits$ (47.7 average speedup). Our analysis on users’ prospective preferences demonstrates that the prospective preferences could signify important items, and thus, benefit the recommendation.

Acknowledgement

This project was made possible, in part, by support from the National Science Foundation under Grant Number IIS-1855501, EAR-1520870, SES-1949037, IIS-1827472 and IIS-2133650, and from National Library of Medicine under Grant Number 1R01LM012605-01A1 and R21LM013678-01. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funding agencies.

Appendix A Reproducibility

TABLE IX: Best Hyper Parameters for

\mathop{\mathtt{P^{2}MAM}}\limits

\mathop{\mathtt{LAST\text{-}O\text{-}P}}\limits

and Baseline Methods

Dataset	$\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits$		$\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$			$\mathop{\mathtt{LAST\text{-}O\text{-}P}}\limits$			$\mathop{\mathtt{NARM}}\limits$		$\mathop{\mathtt{SR\text{-}GNN}}\limits$			$\mathop{\mathtt{LESSR}}\limits$			$\mathop{\mathtt{DHCN}}\limits$
Dataset	$d$	n	$d$	n	b	$d$	n	b	$d$	$lr$	$d$	$lr$	$\lambda$	$d$	$lr$	$l$	$d$	$\beta$
DG	128	10	128	15	8	128	10	4	64	1e-3	32	1e-4	1e-6	32	2e-3	3	128	5e-3
YC	128	7	128	8	8	128	20	2	64	1e-3	128	1e-3	0.0	32	1e-3	2	128	1e-4
GA	128	10	128	15	8	128	10	4	64	1e-3	128	1e-3	1e-6	32	1e-3	2	128	1e-3
LF	128	10	128	20	2	128	20	1	128	1e-3	128	5e-4	0.0	128	1e-3	4	128	1e-5
NP	128	9	128	8	4	128	20	4	64	1e-3	128	5e-4	1e-6	32	2e-3	3	128	1e-3
TM	128	15	128	10	4	128	20	4	128	1e-4	96	1e-3	1e-6	128	5e-4	2	64	5e-5

•

In this table, in $\mathop{\mathtt{P^{2}MAM}}\limits$ , $d$ , $n$ and $b$ are the dimension of the hidden representation, length of the transformed session and the number of heads. In $\mathop{\mathtt{NARM}}\limits$ , $\mathop{\mathtt{SR\text{-}GNN}}\limits$ and $\mathop{\mathtt{LESSR}}\limits$ , $d$ is the dimension of the hidden representation and $lr$ is the learning rate. In $\mathop{\mathtt{SR\text{-}GNN}}\limits$ , $\lambda$ is the weight decay factor. In $\mathop{\mathtt{LESSR}}\limits$ , $l$ is the number of GNN layers, and in $\mathop{\mathtt{DHCN}}\limits$ , $\beta$ is the factor for the self-supervision.

We implement $\mathop{\mathtt{P^{2}MAM}}\limits$ in python 3.7.3 with PyTorch 1.4.0 ⁸⁸8https://pytorch.org. We use Adam optimizer with learning rate 1e-3 on all the datasets for $\mathop{\mathtt{P^{2}MAM}}\limits$ variants (i.e., $\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits$ , $\mathop{\mathtt{P^{2}MAM\text{-}P}}\limits$ and $\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits$ ). We initialize all the learnable parameters using the default initialization methods in PyTorch. The source code and processed data is available on GitHub ⁹⁹9https://github.com/ninglab/P2MAM. For all the methods, during the grid search, we initially search the hyper parameters in a search range. If the hyper parameter yields the best performance on the boundary of the search range, we will extend the search range, if applicable, until a value in the middle yields the best performance. Table IX presents the hyper parameters used for all the methods.

For $\mathop{\mathtt{P^{2}MAM}}\limits$ and $\mathop{\mathtt{LAST\text{-}O\text{-}P}}\limits$ , the initially search range for the embedding dimension $d$ , the length of transformed sequence $n$ and the number of heads $b$ is $\{32,64,128\}$ , $\{10,15,20\}$ and $\{1,2,4\}$ , respectively.

For $\mathop{\mathtt{NARM}}\limits$ ¹⁰¹⁰10https://github.com/lijingsdu/sessionRec_NARM, we initially search $d$ and the learning rate $lr$ from $\{32,64,128\}$ and $\{$ 1e-4, 1e-3 $\}$ , respectively.

For $\mathop{\mathtt{SR\text{-}GNN}}\limits$ ¹¹¹¹11https://github.com/CRIPAC-DIG/SR-GNN, we initially search $d$ , $lr$ and the weight decay factor $\lambda$ from $\{32,64,128\}$ , $\{$ 1e-4, 1e-3 $\}$ and $\{$ 1e-5, 1e-4, 1e-3 $\}$ , respectively.

For $\mathop{\mathtt{LESSR}}\limits$ ¹²¹²12https://github.com/twchen/lessr, we initially search $d$ , $lr$ and the number of GNN layers $l$ from $\{32,64,128\}$ , $\{$ 1e-4, 1e-3 $\}$ and $\{1,2,3,4,5\}$ , respectively.

For $\mathop{\mathtt{DHCN}}\limits$ ¹³¹³13https://github.com/xiaxin1998/DHCN, we initially search $d$ and the factor for the self-supervision $\beta$ from $\{32,64,128\}$ and $\{$ 1e-4, 1e-3, 1e-2 $\}$ , respectively. We use the default number of GNN layers (i.e., 3) due to the fact that the original paper of $\mathop{\mathtt{DHCN}}\limits$ shows that $\mathop{\mathtt{DHCN}}\limits$ is not sensitive to this hyper parameter, and it is very expensive to tune hyper parameters for $\mathop{\mathtt{DHCN}}\limits$ (Section 6.9).

References

[1] J. Li, P. Ren, Z. Chen, Z. Ren, T. Lian, and J. Ma, “Neural attentive session-based recommendation,” in Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, 2017, pp. 1419–1428.
[2] S. Wu, Y. Tang, Y. Zhu, L. Wang, X. Xie, and T. Tan, “Session-based recommendation with graph neural networks,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2019, pp. 346–353.
[3] X. Xia, H. Yin, J. Yu, Q. Wang, L. Cui, and X. Zhang, “Self-supervised hypergraph convolutional networks for session-based recommendation,” arXiv preprint arXiv:2012.06852, 2020.
[4] T. Chen and R. C.-W. Wong, “Handling information loss of graph neural networks for session-based recommendation,” in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 1172–1180.
[5] A. Sherstinsky, “Fundamentals of recurrent neural network (rnn) and long short-term memory (lstm) network,” Physica D: Nonlinear Phenomena, vol. 404, p. 132306, 2020.
[6] J. Zhou, G. Cui, S. Hu, Z. Zhang, C. Yang, Z. Liu, L. Wang, C. Li, and M. Sun, “Graph neural networks: A review of methods and applications,” AI Open, vol. 1, pp. 57–81, 2020.
[7] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.
[8] B. Peng, Z. Ren, S. Parthasarathy, and X. Ning, “M2: Mixed models with preferences, popularities and transitions for next-basket recommendation,” arXiv preprint arXiv:2004.01646, 2020.
[9] S. Rendle, C. Freudenthaler, and L. Schmidt-Thieme, “Factorizing personalized markov chains for next-basket recommendation,” ser. WWW ’10, 2010, p. 811–820.
[10] Q. Liu, Y. Zeng, R. Mokhosi, and H. Zhang, “Stamp: short-term attention/memory priority model for session-based recommendation,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018, pp. 1831–1839.
[11] B. Hidasi and A. Karatzoglou, “Recurrent neural networks with top-k gains for session-based recommendations,” in Proceedings of the 27th ACM international conference on information and knowledge management, 2018, pp. 843–852.
[12] R. Qiu, J. Li, Z. Huang, and H. Yin, “Rethinking the item order in session-based recommendation with graph neural networks,” in Proceedings of the 28th ACM International Conference on Information and Knowledge Management, 2019, pp. 579–588.
[13] T. Donkers, B. Loepp, and J. Ziegler, “Sequential user-based recurrent neural network recommendations,” in Proceedings of the Eleventh ACM Conference on Recommender Systems, ser. RecSys ’17, 2017, p. 152–160.
[14] F. Vasile, E. Smirnova, and A. Conneau, “Meta-prod2vec: Product embeddings using side-information for recommendation,” in Proceedings of the 10th ACM Conference on Recommender Systems, 2016, pp. 225–232.
[15] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” Advances in neural information processing systems, vol. 26, 2013.
[16] J. Tang and K. Wang, “Personalized top-n sequential recommendation via convolutional sequence embedding,” in Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, 2018, pp. 565–573.
[17] B. Peng, Z. Ren, S. Parthasarathy, and X. Ning, “Ham: hybrid associations models for sequential recommendation,” IEEE Transactions on Knowledge and Data Engineering, 2021.
[18] F. Yuan, A. Karatzoglou, I. Arapakis, J. M. Jose, and X. He, “A simple convolutional generative network for next item recommendation,” in Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, 2019, pp. 582–590.
[19] W.-C. Kang and J. McAuley, “Self-attentive sequential recommendation,” in 2018 IEEE International Conference on Data Mining (ICDM). IEEE, 2018, pp. 197–206.
[20] F. Sun, J. Liu, J. Wu, C. Pei, X. Lin, W. Ou, and P. Jiang, “Bert4rec: Sequential recommendation with bidirectional encoder representations from transformer,” in Proceedings of the 28th ACM international conference on information and knowledge management, 2019, pp. 1441–1450.
[21] Z. Fan, Z. Liu, S. Wang, L. Zheng, and P. S. Yu, “Modeling sequences as distributions with uncertainty for sequential recommendation,” in Proceedings of the 30th ACM International Conference on Information & Knowledge Management, 2021, pp. 3019–3023.
[22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
[23] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel, “Gated graph sequence neural networks,” arXiv preprint arXiv:1511.05493, 2015.
[24] B.-J. Hou and Z.-H. Zhou, “Learning with interpretable structure from gated rnn,” IEEE transactions on neural networks and learning systems, pp. 2267–2279, 2020.
[25] C. Xu, P. Zhao, Y. Liu, V. S. Sheng, J. Xu, F. Zhuang, J. Fang, and X. Zhou, “Graph contextualized self-attention network for session-based recommendation.” in IJCAI, 2019, pp. 3940–3946.
[26] Z. Fan, Z. Liu, J. Zhang, Y. Xiong, L. Zheng, and P. S. Yu, “Continuous-time sequential recommendation with temporal graph collaborative transformer,” in Proceedings of the 30th ACM International Conference on Information & Knowledge Management, 2021, pp. 433–442.
[27] Y. Wu, M. Mukunoki, T. Funatomi, M. Minoh, and S. Lao, “Optimizing mean reciprocal rank for person re-identification,” in 2011 8th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). IEEE, 2011, pp. 408–413.

Bo Peng received his M.S. degree from the Department of Computer and Information Science, Indiana University–Purdue University, Indianapolis, in 2019. He is currently a Ph.D. student at the Computer Science and Engineering Department, The Ohio State University. His research interests include machine learning, data mining and their applications in recommender systems and graph mining.

Chang-Yu Tai received his M.S. degree from the Department of Chemistry, National Taiwan University, in 2018. He is currently an M.S. student at the Computer Science and Engineering Department, The Ohio State University. His research interests include deep learning applications in natural language processing and recommender systems.

Srinivasan Parthasarathy received his Ph.D. degree from the Department of Computer Science, University of Rochester, Rochester, in 1999. He is currently a Professor at the Computer Science and Engineering Department, and the Biomedical Informatics Department, The Ohio State University. His research is on high performance data analytics, graph analytics and network science, and machine learning and database systems.

Xia Ning received her Ph.D. degree from the Department of Computer Science & Engineering, University of Minnesota, Twin Cities, in 2012. She is currently an Associate Professor at the Biomedical Informatics Department, and the Computer Science and Engineering Department, The Ohio State University. Her research is on data mining, machine learning and artificial intelligence with applications in recommender systems, drug discovery and medical informatics.