This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Prospective Preference Enhanced Mixed Attentive Model for Session-based Recommendation

Bo Peng,  Chang-Yu Tai, Srinivasan Parthasarathy,  and Xia Ning Bo Peng and Chang-Yu Tai are with the Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, 43210.
E-mail: [email protected], [email protected] Srinivasan Parthasarathy and Xia Ning are with the Department of Biomedical Informatics, and the Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, 43210.
E-mail: [email protected], [email protected] Corresponding author Manuscript received April 19, 2005; revised August 26, 2015.
Abstract

Session-based recommendation aims to generate recommendations for the next item of users’ interest based on a given session. In this manuscript, we develop prospective preference enhanced mixed attentive model (𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits) to generate session-based recommendations using two important factors: temporal patterns and estimates of users’ prospective preferences. Unlike existing methods, 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits models the temporal patterns using a light-weight while effective position-sensitive attention mechanism. In 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits, we also leverage the estimate of users’ prospective preferences to signify important items, and generate better recommendations. Our experimental results demonstrate that 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits models significantly outperform the state-of-the-art methods in six benchmark datasets, with an improvement as much as 19.2%. In addition, our run-time performance comparison demonstrates that during testing, 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits models are much more efficient than the best baseline method, with a significant average speedup of 47.7 folds.

Index Terms:
session-based recommendation, recommender system, attention mechanism

1 Introduction

Session-based recommendation aims to generate recommendations for the next item of users’ interest based on a given session (i.e., a sequence of items chronologically ordered according to user interactions in a short-time period). It has been drawing increasing attention from the research community due to its wide applications in online shopping [1, 2], music streaming [3] and tourist planing [4], among others. With the prosperity of deep learning, many deep models, particularly based on recurrent neural networks (RNNs) [5] and graph neural networks (GNNs) [6] have been developed for session-based recommendation, and have demonstrated the state-of-the-art performance. These methods primarily model the temporal patterns (e.g., transitions, recency patterns, etc.) in sessions, but are not always effective in modeling other important factors that are indicative of the next item. In addition, existing methods primarily model the temporal patterns using gated recurrent units (GRUs) [7]. However, considering the notorious sparse nature of session-based recommendation datasets, as shown in the literature [8], the complicated GRUs may not be well-learned and could degrade the performance. Due to its recurrent essence, GRUs also suffer from limited parallelizability and poor interpretability.

To mitigate the limitations of existing methods, in this manuscript, we develop prospective preference enhanced mixed attentive model, denoted as 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits, for session-based recommendation. In 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits, different from existing methods, we model the temporal patterns using a novel position-sensitive attention mechanism, which is light-weight, fully parallelizable, and could enable better interpretability over GRUs. Besides the temporal patterns, we also leverage the estimate of users’ prospective preferences for better recommendation. Users’ prospective preferences is another important factor for recommendation. Intuitively, if we would have known beforehand that the user is going to watch action movies next (i.e., prospective preference), we could generate better recommendations by learning her/his preference from action movies instead of comedy movies in her/his watching history. We conducted an analysis to empirically verify that users’ prospective preferences could signify important items, and thus, improve the recommendation performance as in the Discussion Section. The results reveal that conditioned on the prospective preferences, we could learn indicative attention weights over items, and enable superior performance. However, in practice, users’ prospective preferences are usually intractable. Thus, in 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits, we explicitly estimate the prospective preferences, and learn attention weights based on the estimate to boost the recommendation performance.

With different combinations of the two factors (i.e., temporal patterns and the estimate of users’ prospective preferences), 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits has three variants: 𝙿𝟸𝙼𝙰𝙼-𝙾\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits, 𝙿𝟸𝙼𝙰𝙼-𝙿\mathop{\mathtt{P^{2}MAM\text{-}P}}\limits and 𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits. 𝙿𝟸𝙼𝙰𝙼-𝙾\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits models temporal patterns using a novel position-sensitive attention mechanism. 𝙿𝟸𝙼𝙰𝙼-𝙿\mathop{\mathtt{P^{2}MAM\text{-}P}}\limits leverages the estimate of users’ prospective preferences to weigh items and generate recommendations. 𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits explicitly leverages the two factors for better recommendation.

We compare 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits with five state-of-the-art baseline methods on six benchmark session-based recommendation datasets. Our experimental results demonstrate that 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits significantly outperforms the state-of-the-art methods on all the datasets, with an improvement of up to 19.2%. The results also show that on most of the datasets. the two factors are mutually strengthened, and could enable superior performance when used together. We also conduct a comprehensive analysis to verify the effectiveness of different components in 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits. The results show that with the position embeddings, our position-sensitive attention mechanism could effectively capture the temporal patterns in the datasets, and on most of the datasets, our learning-based prospective preference estimate strategy could be more effective than recency-based strategies. Moreover, we conduct run-time performance analysis, and find that 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits is much more efficient than the best baseline method with an average speedup of 47.7 folds over the six datasets.

Our major contributions are summarized as follows:

  • We develop a novel session-based recommendation method 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits, which leverages both the temporal patterns and estimates of users’ prospective preferences for recommendation.

  • 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits significantly outperforms five state-of-the-art methods on six benchmark datasets (Section 6.1).

  • Our analysis demonstrates the importance of modeling the position information for session-based recommendation (Section 6.5).

  • The experimental results show that our learning-based prospective preference estimate strategy is more effective than the existing recency-based strategy (Section 6.6).

  • Our analysis shows that the learned attention weights in 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits could capture the temporal patterns in the data (Section 6.7).

  • Our analysis verifies that users’ prospective preferences could signify important items, and benefit recommendations (Section 7).

  • For reproducibility purposes, we release our source code on GitHub 111https://github.com/ninglab/P2MAM, and report the hyper parameters in the Appendix.

2 Related Work

2.1 Session-based Recommendation

In the last few years, numerous session-based recommendation methods have been developed, particularly using Markov Chains (MCs), attention mechanisms and neural networks such as RNNs and GNNs, etc. MCs-based methods [9] use MCs to capture the transitions among items for recommendation. For example, Rendle et al.  [9] employs a first-order MC to generate recommendations based on the transitions of the last item in each session. Attention-based methods [1, 10] model the importance of items for the recommendation. For example, Liu et al.  [10] developed a short-term attention priority model (𝚂𝚃𝙰𝙼𝙿\mathop{\mathtt{STAMP}}\limits), which adapts a gate mechanism to capture users’ short-term preferences. Recently, RNNs-based methods such as 𝙶𝚁𝚄𝟺𝚁𝚎𝚌+\mathop{\mathtt{GRU4Rec+}}\limits [11] and 𝙽𝙰𝚁𝙼\mathop{\mathtt{NARM}}\limits [1] have been developed to model the temporal patterns among items primarily using GRUs.

GNNs-based methods are also extensively developed for the session-based recommendation. Wu et al.  [2] converted sessions to direct graphs, and developed a GNNs-based model (𝚂𝚁-𝙶𝙽𝙽\mathop{\mathtt{SR\text{-}GNN}}\limits) to generate recommendations based on the graph structures. Qiu et al.  [12] re-examined the importance of item ordering in session-based recommendations and developed a GNN-based model (𝙵𝙶𝙽𝙽\mathop{\mathtt{FGNN}}\limits), which included self-loop for each node in graphs to better capture users’ short-term preferences. Chen et al.  [4] showed that the widely used directed graph representations can not fully preserve the sequential information in sessions. To mitigate this problem, they converted sessions to multigraphs, and developed a GNNs-based model (𝙻𝙴𝚂𝚂𝚁\mathop{\mathtt{LESSR}}\limits) to generate recommendations. Xia et al.  [3] developed hyper graph-based model (𝙳𝙷𝙲𝙽\mathop{\mathtt{DHCN}}\limits), which leverages hyper graphs and hyper graph convolutional networks to capture the high-order information among items.

2.2 Sequential Recommendation

Sequential recommendation aims to generate recommendations for the next items of users’ interest based on users’ historical interactions. It is closely related to session-based recommendation except that in sequential recommendation, we could access users’ historical interactions in a long-time period (e.g., months). In the last few years, neural networks (e.g., RNNs) and attention mechanisms have been extensively employed in sequential recommendation methods. For example, RNNs-based methods such as User-based RNNs [13] incorporate user characteristics into GRUs for personalized recommendation. Skip-gram-based methods [14] leverage the skip-gram model [15] to capture the co-occurrence among items in a time window. Recently, Convolutional Neural Networks (CNNs) are also adapted for sequential recommendation. Tang et al.  [16] developed a CNNs-based model, which uses multiple convolutional filters to model the synergies [17] among items. Yuan et al.  [18] developed another CNN-based generative model NextItRec to better capture the long-term dependencies in sequential recommendation. Besides CNNs, attention-based methods [19, 20, 21] are also developed for sequential recommendation. Kang et al.  [19] developed a self-attention based model, which adapts the self-attention to better model users’ long-term preferences. Sun et al.  [20] further developed a bidirectional self-attention based model to improve the representational power of item embeddings.

3 Definitions and Notations

TABLE I: Notations
        notations meanings
        mm the number of items
        dd the dimensionality of embeddings
        SS the original session
        AA transformed fixed-length sequence
        nn the number of items in AA
        𝐡o\mathop{\mathbf{h}^{o}}\limits position-sensitive preference prediction
        𝐡p\mathop{\mathbf{h}^{p}}\limits prospective preferences-sensitive prediction
        𝐫^\hat{\mathbf{r}} recommendation scores

In this manuscript, we tackle the recommendation problem that given an anonymous session, we recommend the next item of users’ interest in the session. An anonymous session is represented as a sequence S={s1,s2,s|S|}S=\{s_{1},s_{2},\dots\,s_{|S|}\}, where sts_{t} is the tt-th item in the session and |S||S| is the length of the session. We use upper-case letters to denote matrices, lower-case and bold letters to denote row vectors, and lower-case non-bold letters to denote scalars. Table I presents the key notations used in this manuscript.

4 Methods

Figure 1 presents the overall architecture of 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits. 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits generates recommendations via a session representation component, a position-sensitive preference prediction component, and a prospective preference enhanced preference prediction component. We will discuss each component in detail below.

4.1 Session Representation (𝚂𝚁\mathop{\mathtt{SR}}\limits)

Previous studies [19, 17, 16] have shown that recently interacted items are more indicative than earlier ones of the next item. Following these, we focus on the most recent nn items (i.e., the last nn items) in a session to generate recommendations. Particularly, given a session S={s1,s2,s|S|}S=\{s_{1},s_{2},\dots\,s_{|S|}\}, we transform it to a fixed-length sequence A={a1,a2,,an}A=\{a_{1},a_{2},\dots,a_{n}\}, which contains the last nn items in SS (i.e., aia_{i} = s|S|n+is_{|S|-n+i}, i=1,,ni=1,\cdots,n). If SS is shorter than n, we will pad empty items at the beginning of AA until length n.

Refer to caption
Figure 1: Overall Architecture

In 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits, we represent the items in sessions using learnable embeddings. Specifically, we learn an item embedding matrix 𝑉m×d\mbox{$\mathop{V}\limits$}\in\mathbb{R}^{m\times d}, in which the jj-th row 𝐯j\mathbf{v}_{j} is the embedding of item jj, and mm and dd is the number of all the items and the dimensionality of embeddings, respectively. Given 𝑉\mathop{V}\limits, we represent the items in AA by a matrix 𝐸n×d\mbox{$\mathop{E}\limits$}\in\mathbb{R}^{n\times d} as follows:

𝐸=[𝐯a1;𝐯a2;;𝐯an],\mbox{$\mathop{E}\limits$}=[\mathbf{v}_{a_{1}};\mathbf{v}_{a_{2}};\cdots;\mathbf{v}_{a_{n}}], (1)

where 𝐯ai\mathbf{v}_{a_{i}} is the embedding of item aia_{i}. Following the previous work [22, 19, 21], we use a constant zero vector as the embedding of padded empty items.

We also learn embeddings for the nn positions in the fixed-length sessions to capture the temporal patterns. Specifically, following Vaswani et al.  [22], we learn a position embedding matrix 𝑃n×d\mbox{$\mathop{P}\limits$}\in\mathbb{R}^{n\times d}:

𝑃=[𝐩1;𝐩2;;𝐩n],\mbox{$\mathop{P}\limits$}=[\mathbf{p}_{1};\mathbf{p}_{2};\cdots;\mathbf{p}_{n}], (2)

in which the tt-th row 𝐩t\mathbf{p}_{t} is the embedding of the tt-th position.

4.2 Position-sensitive Preference Prediction (𝙿𝚂\mathop{\mathtt{PS}}\limits)

It has been shown [8, 10, 17] that the temporal patterns play an important role in predicting users’ preferences. Existing session-based recommendation methods model the temporal patterns primarily using GRUs-based methods [7, 23], which implicitly learn weights over items. However, as demonstrated in the literature [8], the complicated GRUs-based models may not be well learned for the notoriously sparse recommendation datasets, and it also suffers from poor interpretability [24] and limited parallelizability [8, 19].

In 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits, different from other methods, we develop a novel position-sensitive attention mechanism to model the temporal patterns, and generate predictions of users’ preferences accordingly. Specifically, we use a dot-product attention mechanism as follows:

𝐶=𝐸+𝑃,𝜶=softmax(𝐪𝐶d),𝐡o=𝜶𝐶,\displaystyle\begin{aligned} \mbox{$\mathop{C}\limits$}=\mbox{$\mathop{E}\limits$}+\mbox{$\mathop{P}\limits$},&&\bm{\alpha}=\text{softmax}(\frac{\mathbf{q}\mbox{$\mathop{C}\limits$}^{\top}}{\sqrt{d}}),&&\mbox{$\mathop{\mathbf{h}^{o}}\limits$}&=\bm{\alpha}\mbox{$\mathop{C}\limits$},\end{aligned}\vspace{-5pt} (3)

where 𝐶\mathop{C}\limits combines items embeddings and position embeddings in the session, and 𝐜i\mathbf{c}_{i} is the ii-th row in 𝐶\mathop{C}\limits, 𝜶\bm{\alpha} is a vector of attention weights, 𝐪1×d\mathbf{q}\in\mathbb{R}^{1\times d} is a learnable vector shared by all the sessions, and 𝐡o\mathop{\mathbf{h}^{o}}\limits is the position-sensitive preference prediction. The intuition of our position-sensitive preference prediction is that items themselves (their contents) and their relative orders in the sessions represent meaningful information related to user preferences; we can use data-driven attention weights from item embeddings and position embeddings to capture such information. Different from existing attentive session-based methods [10, 1, 25], which learn attention weights without explicitly considering the position information, we explicitly incorporate position embeddings to learn position-sensitive attention weights. Compared to GRUs-based methods [7, 23], the simple and light-weight attention mechanism is easier to learn well on the sparse recommendation datasets, and it could also provide better interpretability and parallelizability.

4.3 Prospective Preference Enhanced Preference Prediction (𝙿𝟸𝙴\mathop{\mathtt{P^{2}E}}\limits)

As presented in Section 1, users’ prospective preferences reveal important items, and thus, could benefit the recommendation. Motivated by this, in 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits, we develop a novel strategy that leverages the preference prediction (i.e., 𝐡o\mathop{\mathbf{h}^{o}}\limits) from 𝙿𝚂\mathop{\mathtt{PS}}\limits as an estimate of users’ prospective preferences to enable better attention weights over items. Specifically, we employ a multi-head attention mechanism [22] as follows:

𝜷i=softmax((𝐡oQi)(𝐶Ki)d),𝐡𝐞𝐚𝐝i=𝜷i(𝐶Wi),𝐡p=[𝐡𝐞𝐚𝐝1,𝐡𝐞𝐚𝐝2,,𝐡𝐞𝐚𝐝b]W,\displaystyle\begin{aligned} \bm{\beta}_{i}&=\text{softmax}(\frac{(\mbox{$\mathop{\mathbf{h}^{o}}\limits$}Q_{i})(\mbox{$\mathop{C}\limits$}K_{i})^{\top}}{\sqrt{d}}),\\ \mathbf{head}_{i}&=\bm{\beta}_{i}(\mbox{$\mathop{C}\limits$}W_{i}),\\ \mbox{$\mathop{\mathbf{h}^{p}}\limits$}&=[\mathbf{head}_{1},\mathbf{head}_{2},\dots,\mathbf{head}_{b}]W,\\ \end{aligned}\vspace{-10pt} (4)

where 𝜷i\bm{\beta}_{i} is the vector of attention weights from the ii-th head, Qid×dbQ_{i}\in\mathbb{R}^{d\times\frac{d}{b}}, Kid×dbK_{i}\in\mathbb{R}^{d\times\frac{d}{b}}, and Wid×dbW_{i}\in\mathbb{R}^{d\times\frac{d}{b}} are learnable projection matrices in the ii-th head, and 𝐡𝐞𝐚𝐝i\mathbf{head}_{i} is the output of the ii-th head, bb is the number of heads, Wd×dW\in\mathbb{R}^{d\times d} is a learnable projection matrix shared by all the heads, and 𝐡p1×d\mbox{$\mathop{\mathbf{h}^{p}}\limits$}\in\mathbb{R}^{1\times d} is the prospective preference enhanced preference prediction. The intuition of our strategy (i.e., 𝙿𝟸𝙴\mathop{\mathtt{P^{2}E}}\limits) is that we learn predictive attention weights based on the estimate of users’ prospective preferences. As will be shown in Section 6.1, this strategy could significantly improve the recommendation performance. Note that 𝙿𝟸𝙴\mathop{\mathtt{P^{2}E}}\limits serves as a general strategy that is adaptable to other estimates of users’ prospective preferences. We also tried the dot-product attention as in 𝙿𝚂\mathop{\mathtt{PS}}\limits but empirically found that it produces inferior performance.

4.4 Recommendation Scores in 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits

In 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits, we calculate recommendation scores based on the predictions of users’ preferences (i.e., 𝐡o\mathop{\mathbf{h}^{o}}\limits and 𝐡p\mathop{\mathbf{h}^{p}}\limits). Specifically, we develop three different methods to calculate the scores. Based on the scores, the items with the top-kk largest scores will be recommended.

4.4.1 Scores based on temporal patterns (𝙿𝟸𝙼𝙰𝙼-𝙾\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits)

Similarly to existing methods [1, 2], we calculate the recommendation scores based on the temporal patterns as follows:

𝐫^=softmax(𝐡o𝑉),\hat{\mathbf{r}}=\text{softmax}(\mbox{$\mathop{\mathbf{h}^{o}}\limits$}\mbox{$\mathop{V}\limits$}^{\top}), (5)

where 𝐫^\hat{\mathbf{r}} is a vector of recommendation scores over candidate items, 𝑉\mathop{V}\limits is the item embedding matrix (Section 4.1), and the softmax function is employed to normalize the scores to be into the range [0,1][0,1]. We denote this 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits variant as 𝙿𝟸𝙼𝙰𝙼-𝙾\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits.

4.4.2 Scores based on users’ prospective preferences (𝙿𝟸𝙼𝙰𝙼-𝙿\mathop{\mathtt{P^{2}MAM\text{-}P}}\limits)

In 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits, we could also generate recommendations from the prospective preference enhanced preference prediction 𝐡p\mathop{\mathbf{h}^{p}}\limits as follows:

𝐫^=softmax(𝐡p𝑉).\hat{\mathbf{r}}=\text{softmax}(\mbox{$\mathop{\mathbf{h}^{p}}\limits$}\mbox{$\mathop{V}\limits$}^{\top}). (6)

We denote this 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits variant as 𝙿𝟸𝙼𝙰𝙼-𝙿\mathop{\mathtt{P^{2}MAM\text{-}P}}\limits.

4.4.3 Scores based on temporal patterns and users’ prospective preferences (𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits)

The two preference predictions (i.e., 𝐡o\mathop{\mathbf{h}^{o}}\limits and 𝐡p\mathop{\mathbf{h}^{p}}\limits) could be mutually strengthened and might enable better performance when used together. Following this motivation, we also calculate recommendation scores using both 𝐡o\mathop{\mathbf{h}^{o}}\limits and 𝐡p\mathop{\mathbf{h}^{p}}\limits as follows:

𝐫^=softmax((𝐡o+𝐡p)𝑉),\hat{\mathbf{r}}=\text{softmax}((\mbox{$\mathop{\mathbf{h}^{o}}\limits$}+\mbox{$\mathop{\mathbf{h}^{p}}\limits$})\mbox{$\mathop{V}\limits$}^{\top}), (7)

where we sum 𝐡o\mathop{\mathbf{h}^{o}}\limits and 𝐡p\mathop{\mathbf{h}^{p}}\limits for the recommendation. We denote the 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits variant using both 𝐡o\mathop{\mathbf{h}^{o}}\limits and 𝐡p\mathop{\mathbf{h}^{p}}\limits as 𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits.

4.5 Network Training

Following the literature [10, 2, 4], we adapt the cross-entropy loss to minimize the negative log likelihood of correctly recommending the ground-truth next item as follows:

min𝚯i=1|T|𝐫ilog(𝐫^i),\min\limits_{\bm{\Theta}}\sum\nolimits_{i=1}^{|T|}-\mathbf{r}_{i}\log(\hat{\mathbf{r}}_{i}^{\top}), (8)

where TT is the set of all the training sessions, 𝐫i\mathbf{r}_{i} is a one-hot vector in which the dimension jj is 1 if item jj is the ground-truth next item in the ii-th training session or 0 otherwise, 𝐫^i\hat{\mathbf{r}}_{i} is the vector of recommendation scores for the ii-th training session, and Θ\Theta is the set of learnable parameters (e.g., 𝑉\mathop{V}\limits, 𝑃\mathop{P}\limits and WW). All the learnable parameters are randomly initialized, and are optimized in an end-to-end manner. We use this objective to optimize all the 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits variants (i.e., 𝙿𝟸𝙼𝙰𝙼-𝙾\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits, 𝙿𝟸𝙼𝙰𝙼-𝙿\mathop{\mathtt{P^{2}MAM\text{-}P}}\limits and 𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits).

5 Materials

5.1 Baseline Methods

We compare 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits with the five state-of-the-art baseline methods:

  • 𝙿𝙾𝙿\mathop{\mathtt{POP}}\limits [2] recommends the most popular items of each session.

  • 𝙽𝙰𝚁𝙼\mathop{\mathtt{NARM}}\limits [1] employs GRUs and attention mechanisms to model the temporal patterns for the recommendation.

  • 𝚂𝚁-𝙶𝙽𝙽\mathop{\mathtt{SR\text{-}GNN}}\limits [2] transforms sessions into directed graphs, and employs GNNs to model complex transitions in sessions.

  • 𝙻𝙴𝚂𝚂𝚁\mathop{\mathtt{LESSR}}\limits [4] transforms sessions into directed multigraphs and generates recommendations using GNNs.

  • 𝙳𝙷𝙲𝙽\mathop{\mathtt{DHCN}}\limits [3] transforms sessions into hypergraphs and generates recommendations using a hypergraph convolutional network.

Note that 𝙳𝙷𝙲𝙽\mathop{\mathtt{DHCN}}\limits and 𝙻𝙴𝚂𝚂𝚁\mathop{\mathtt{LESSR}}\limits have been compared against a comprehensive set of other methods including 𝙶𝚁𝚄𝟺𝚁𝚎𝚌+\mathop{\mathtt{GRU4Rec+}}\limits [11], 𝚂𝚃𝙰𝙼𝙿\mathop{\mathtt{STAMP}}\limits [10] and 𝙵𝙶𝙽𝙽\mathop{\mathtt{FGNN}}\limits [12], and have outperformed those methods. Thus, we compare 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits with 𝙳𝙷𝙲𝙽\mathop{\mathtt{DHCN}}\limits and 𝙻𝙴𝚂𝚂𝚁\mathop{\mathtt{LESSR}}\limits instead of the methods that they have outperformed. For all the baseline methods, we use the implementations provided by their authors (Section A in Appendix).

5.2 Datasets

We compare 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits with the baseline methods in the following benchmark datasets that are widely used in the literature [1, 2, 4]:

Following the literature [1, 2, 4, 3], for GA\mathop{\texttt{GA}}\limits, we only keep the top 30,000 most popular locations, and view the check-in records in one day as a session [4, 8]. For LF\mathop{\texttt{LF}}\limits, we only keep the top 40,000 most popular artists, and view the listening events in 8 hours as a session [4]. For all the datasets, we filter out sessions of length one and items appearing less than five times over all the sessions.

5.3 Experimental Protocol

TABLE II: Dataset Statistics
dataset #items #train #test length #aug train #aug test aug len
DG 42,596 188,636 15,955 4.80 716,835 60,194 4.90
YC 17,597 124,472 15,237 4.22 394,802 55,424 6.14
GA 29,510 234,403 57,492 3.85 675,561 155,332 4.32
LF 38,615 260,780 64,763 11.78 2,837,644 672,519 9.16
NP 60,416 128,077 14,479 7.42 825,304 89,824 6.53
TM 40,727 65,286 1,027 6.69 351,268 25,898 8.01
  • In this table, #item is the number of items. The columns #train, #test, and length correspond to the number of training sessions, the number of testing sessions, and the average length of sessions, respectively, before the augmentation. The columns #aug train, #aug test and ‘aug len’ correspond to that after the augmentation.

5.3.1 Training and testing Sets

Following the literature [2, 3, 4], we generate the training and testing sets as follows: for DG\mathop{\texttt{DG}}\limits, NP\mathop{\texttt{NP}}\limits and TM\mathop{\texttt{TM}}\limits, we use the sessions in the last week as the testing set, and all the other sessions as the training set. For YC\mathop{\texttt{YC}}\limits, we use the sessions in the last day as the testing set. For the other sessions, following that in Li et al.  [1], we use the last (i.e., most recent) 1/641/64 of them for training. For GA\mathop{\texttt{GA}}\limits and LF\mathop{\texttt{LF}}\limits, we use the last (i.e., most recent) 20% of all the sessions for testing, and all the other sessions for training.

Following the literature [1, 2, 3, 4], we augment the data to enrich the training and testing data. Specifically, for each original training and testing session S={s1,s2,,s|S|}S=\{s_{1},s_{2},\dots,s_{|S|}\}, we split it to {s1,s2}\{s_{1},s_{2}\}, {s1,s2,s3}\{s_{1},s_{2},s_{3}\}, \cdots, {s1,s2,,s|S|1}\{s_{1},s_{2},\dots,s_{|S|-1}\} and {s1,,s|S|}\{s_{1},\dots,s_{|S|}\}, and use all the resulted sessions as the augmented sessions for training and testing. The key statistics of the original and augmented datasets are presented in Table II. Note that, we transform sessions to fixed-length as in Section 4.1 after the augmentation.

We tune the hyper parameters using grid search and use the best hyper parameters in terms of recall@20 (Section 5.3.2) for 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits and all the baseline methods during testing. Particularly, during the hyper parameter tuning, we use the first 80% of training sessions for model training, and evaluate the model on the last 20% of training sessions. During testing, we use all the training sessions for model training, and evaluate methods in testing sessions. We report the search ranges of hyper parameters, and the identified optimal hyper parameters for each method in the Appendix (Section A).

5.3.2 Evaluation metrics

We use recall@kk, MRR@kk and NDCG@kk to evaluate the performance of methods.

  • Recall@kk measures the proportion of sessions in which the ground-truth next item (i.e., s|S|+1s_{|S|+1}) is correctly recommended. For each session SS, the recall@kk is 1 if s|S|+1s_{|S|+1} is among the top kk of the recommendation list, or 0 otherwise. Note that in next item recommendation, recall@kk is the most popular evaluation metric. It is also called precision@kk and HR@kk as in the literature [4].

  • MRR@kk is the mean reciprocal rank of the correctly recommended item, and is 0 if the ground-truth next item is not in the top-kk of the recommendation list. MRR@kk is widely used in the literature [1, 2, 3, 4, 10] as a rank-aware evaluation metric for session-based recommendation.

  • NDCG@kk is the normalized discounted cumulative gain for the top-k ranking, and is another widely used rank-aware metric [16, 26, 19]. Different from MRR@kk that only focuses on the very top ranked items (e.g., top-11[27], NDCG@kk effectively measures models’ performance in ranking the top-kk items, and thus, might be a better metric for evaluating recommendation methods in some scenarios. Follow the literature [16], In our experiments, the gain indicates whether the ground-truth next item is recommended (i.e., gain is 1) or not (i.e., gain is 0).

For all the evaluation metrics, we report the average results over all the testing sessions in the experiments. We also statistically test the significance of the performance difference among different methods via a standard paired tt-test at 95% confidence level.

6 Experimental Results

6.1 Overall Performance Comparison

TABLE III: Overall Performance
method recall@kk MRR@kk NDCG@kk recall@kk MRR@kk NDCG@kk recall@kk MRR@kk NDCG@kk
kk=10 kk=20 kk=10 kk=20 kk=10 kk=20 kk=10 kk=20 kk=10 kk=20 kk=10 kk=20 kk=10 kk=20 kk=10 kk=20 kk=10 kk=20
𝙿𝙾𝙿\mathop{\mathtt{POP}}\limits DG\mathop{\texttt{DG}}\limits 0.0058 0.0078 0.0021 0.0022 0.0029 0.0034 YC\mathop{\texttt{YC}}\limits 0.0555 0.1102 0.0248 0.0285 0.0319 0.0456 GA\mathop{\texttt{GA}}\limits 0.0266 0.0456 0.0066 0.0079 0.0113 0.0160
𝙽𝙰𝚁𝙼\mathop{\mathtt{NARM}}\limits 0.4049 0.5401 0.1751 0.1845 0.2289 0.2631 0.5970 0.7054 0.2925 0.3001 0.3648 0.3924 0.4510 0.5321 0.2445 0.2502 0.2939 0.3144
𝚂𝚁-𝙶𝙽𝙽\mathop{\mathtt{SR\text{-}GNN}}\limits 0.3783 0.5091 0.1629 0.1719 0.2133 0.2463 0.5979 0.7072 0.2969 0.3046 0.3683 0.3961 0.4317 0.5142 0.2389 0.2446 0.2848 0.2057
𝙻𝙴𝚂𝚂𝚁\mathop{\mathtt{LESSR}}\limits 0.4000 0.5321 0.1765 0.1857 0.2289 0.2623 0.6098 0.7140 \mathclap{{}^{\dagger~{}}}0.3077 \mathclap{{}^{\dagger~{}}}0.3151 \mathclap{{}^{\dagger~{}}}0.3795 \mathclap{{}^{\dagger~{}}}0.4060 0.4440 0.5229 0.2540 0.2595 0.2993 0.3192
𝙳𝙷𝙲𝙽\mathop{\mathtt{DHCN}}\limits 0.4058 0.5400 0.1768 0.1861 0.2305 0.2644 0.6090 0.7157 0.2964 0.3039 0.3708 0.3979 0.4531 0.5377 0.2354 0.2413 0.2876 0.3089
𝙿𝟸𝙼𝙰𝙼-𝙾\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits \mathclap{{}^{\dagger~{}}}0.4148 \mathclap{{}^{\dagger~{}}}0.5500 \mathclap{{}^{\dagger~{}}}0.1817 \mathclap{{}^{\dagger~{}}}0.1911 \mathclap{{}^{\dagger~{}}}0.2364 \mathclap{{}^{\dagger~{}}}0.2705 0.6118 0.7203 0.2956 0.3033 0.3707 0.3983 0.4529 0.5352 0.2384 0.2441 0.2898 0.3106
𝙿𝟸𝙼𝙰𝙼-𝙿\mathop{\mathtt{P^{2}MAM\text{-}P}}\limits 0.4028 0.5354 0.1749 0.1841 0.2283 0.2618 0.6148 0.7220 0.3021 0.3096 0.3764 0.4037 0.4554 0.5369 0.2493 0.2550 0.2985 0.3192
𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits 0.4120 0.5474 0.1805 0.1899 0.2348 0.2690 \mathclap{{}^{\dagger~{}}}0.6192 \mathclap{{}^{\dagger~{}}}0.7277 0.3008 0.3085 0.3766 0.4043 \mathclap{{}^{\dagger~{}}}0.4644 \mathclap{{}^{\dagger~{}}}0.5479 \mathclap{{}^{\dagger~{}}}0.2586 \mathclap{{}^{\dagger~{}}}0.2644 \mathclap{{}^{\dagger~{}}}0.3077 \mathclap{{}^{\dagger~{}}}0.3289
improv 2.2%\mathclap{{}^{*}} 1.8%\mathclap{{}^{*}} 2.8%\mathclap{{}^{*}} 2.7%\mathclap{{}^{*}} 2.6%\mathclap{{}^{*}} 2.3%\mathclap{{}^{*}} 1.5%\mathclap{{}^{*}} 1.7%\mathclap{{}^{*}} -1.8%\mathclap{{}^{*}} -1.7%\mathclap{{}^{*}} -0.8% -0.4% 2.5%\mathclap{{}^{*}} 1.9%\mathclap{{}^{*}} 1.8%\mathclap{{}^{*}} 1.9%\mathclap{{}^{*}} 2.8%\mathclap{{}^{*}} 3.0%\mathclap{{}^{*}}
𝙿𝙾𝙿\mathop{\mathtt{POP}}\limits LF\mathop{\texttt{LF}}\limits 0.0304 0.0498 0.0113 0.0127 0.0157 0.0207 NP\mathop{\texttt{NP}}\limits 0.0137 0.0171 0.0058 0.0060 0.0075 0.0084 TM\mathop{\texttt{TM}}\limits 0.0177 0.0231 0.0095 0.0099 0.0114 0.0128
𝙽𝙰𝚁𝙼\mathop{\mathtt{NARM}}\limits 0.1632 0.2287 0.0720 0.0765 0.0933 0.1099 0.1494 0.2037 0.0702 0.0739 0.0887 0.1024 0.2489 0.2661 \mathclap{{}^{\dagger~{}}}0.1765 \mathclap{{}^{\dagger~{}}}0.1777 \mathclap{{}^{\dagger~{}}}0.1944 0.1987
𝚂𝚁-𝙶𝙽𝙽\mathop{\mathtt{SR\text{-}GNN}}\limits 0.1666 0.2260 0.0853 0.0893 0.1043 0.1192 0.1436 0.1893 0.0744 0.0775 0.0906 0.1021 0.2438 0.2885 0.1366 0.1397 0.1621 0.1734
𝙻𝙴𝚂𝚂𝚁\mathop{\mathtt{LESSR}}\limits 0.1719 0.2328 \mathclap{{}^{\dagger~{}}}0.0865 \mathclap{{}^{\dagger~{}}}0.0907 \mathclap{{}^{\dagger~{}}}0.1065 \mathclap{{}^{\dagger~{}}}0.1218 0.1542 0.2059 0.0747 0.0782 0.0933 0.1063 0.2216 0.2567 0.1267 0.1291 0.1493 0.1582
𝙳𝙷𝙲𝙽\mathop{\mathtt{DHCN}}\limits 0.1647 0.2293 0.0730 0.0774 0.0945 0.1107 0.1711 0.2307 \mathclap{{}^{\dagger~{}}}0.0750 \mathclap{{}^{\dagger~{}}}0.0791 \mathclap{{}^{\dagger~{}}}0.0974 0.1124 0.2330 0.2839 0.1294 0.1329 0.1539 0.1668
𝙿𝟸𝙼𝙰𝙼-𝙾\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits 0.1646 0.2301 0.0724 0.0769 0.0940 0.1105 \mathclap{{}^{\dagger~{}}}0.1744 \mathclap{{}^{\dagger~{}}}0.2371 0.0740 0.0783 0.0973 \mathclap{{}^{\dagger~{}}}0.1132 \mathclap{{}^{\dagger~{}}}0.2840 0.3406 0.1608 0.1648 0.1900 \mathclap{{}^{\dagger~{}}}0.2043
𝙿𝟸𝙼𝙰𝙼-𝙿\mathop{\mathtt{P^{2}MAM\text{-}P}}\limits 0.1675 0.2334 0.0743 0.0788 0.0961 0.1127 0.1587 0.2132 0.0743 0.0780 0.0940 0.1078 0.2302 0.2772 0.1224 0.1256 0.1479 0.1598
𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits \mathclap{{}^{\dagger~{}}}0.1771 \mathclap{{}^{\dagger~{}}}0.2454 0.0793 0.0840 0.1022 0.1195 0.1688 0.2315 0.0732 0.0775 0.0955 0.1113 0.2826 \mathclap{{}^{\dagger~{}}}0.3438 0.1484 0.1527 0.1801 0.1956
improv 3.0%\mathclap{{}^{*}} 5.4%\mathclap{{}^{*}} -8.3%\mathclap{{}^{*}} -7.4%\mathclap{{}^{*}} -4.0%\mathclap{{}^{*}} -1.9%\mathclap{{}^{*}} 1.9% 2.8%\mathclap{{}^{*}} -0.9% -1.0% -0.1% 0.7% 14.1%\mathclap{{}^{*}} 19.2%\mathclap{{}^{*}} -8.9%\mathclap{{}^{*}} -7.3%\mathclap{{}^{*}} -2.3% 2.8%
  • For each dataset, the best performance among our proposed methods (e.g., 𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits) is in bold, the best performance among the baseline methods is underlined, and the overall best performance is indicated by a dagger (i.e., \dagger). The row ”improv” presents the percentage improvement of the best performing variant of 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits (bold) over the best performing baseline methods (underlined). The * indicates that the improvement is statistically significant at 95%95\% confidence level.

Table III presents the overall performance of different methods at recall@kk, MRR@kk and NDCG@k in recommending the next item. Due to the space limit, we do not present the results on recall@5, MRR@5 and NDCG@5. However, we observed a similar trend on these metrics. Table III shows that overall, 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits (i.e., 𝙿𝟸𝙼𝙰𝙼-𝙾\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits, 𝙿𝟸𝙼𝙰𝙼-𝙿\mathop{\mathtt{P^{2}MAM\text{-}P}}\limits and 𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits) is the best performing method on the six benchmark datasets. In terms of recall@10 and recall@20, 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits achieves the best performance on all the six datasets, with significant average improvement of 4.2% and 5.5%, respectively, compared to the best baseline method at each dataset. Note that in session-based recommendation, improvement of above 1% is generally considered as significant [2, 4]. At MRR@kk, 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits also achieves competitive performance over the baseline methods. For example, on DG\mathop{\texttt{DG}}\limits and GA\mathop{\texttt{GA}}\limits, 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits achieves statistically significant improvement of 2.7% and 1.9% at MRR@10 and MRR@20, respectively. On YC\mathop{\texttt{YC}}\limits and TM\mathop{\texttt{TM}}\limits, 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits achieves the second best performance at MRR@10 and MRR@20. We found a similar trend at NDCG@kk. For example, in terms of NDCG@1010, 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits substantially outperforms the baseline methods on DG\mathop{\texttt{DG}}\limits and GA\mathop{\texttt{GA}}\limits; at NDCG@2020, 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits is the best method on four out of six datasets except YC\mathop{\texttt{YC}}\limits and LF\mathop{\texttt{LF}}\limits. These results demonstrate the strong recommendation performance of 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits. We notice that on YC\mathop{\texttt{YC}}\limits, LF\mathop{\texttt{LF}}\limits and TM\mathop{\texttt{TM}}\limits, at MRR@10 and MRR@20, 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits appears considerably worse than the baseline methods (e.g., 𝙻𝙴𝚂𝚂𝚁\mathop{\mathtt{LESSR}}\limits). These results indicate that on certain datasets, even though 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits may be less effective than the baseline methods on ranking the ground-truth next items on the very top (e.g., top-1), 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits is still on average more effective on recommending the correct items among on top.

6.2 Comparison among 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits Variants

As shown in Table III, among the three variants of 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits, 𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits has the best performance overall. In terms of recall@10, 𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits outperforms the other variants on YC\mathop{\texttt{YC}}\limits, GA\mathop{\texttt{GA}}\limits and LF\mathop{\texttt{LF}}\limits, and achieves the second best performance on the other three datasets. In terms of recall@20, 𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits is the best method at four out of six datasets except DG\mathop{\texttt{DG}}\limits and NP\mathop{\texttt{NP}}\limits. We found a similar trend at MRR@kk and NDCG@kk. For example, in terms of NDCG@10 and NDCG@20, 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits achieves significant improvement over the other variants on three out of six datasets (i.e., YC\mathop{\texttt{YC}}\limits, GA\mathop{\texttt{GA}}\limits and LF\mathop{\texttt{LF}}\limits). On the other three datasets, 𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits is still ranked as the second best method.

Compared to 𝙿𝟸𝙼𝙰𝙼-𝙾\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits, 𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits learns attention weights conditioned on the estimate of users’ prospective preferences (i.e., 𝙿𝟸𝙴\mathop{\mathtt{P^{2}E}}\limits), while 𝙿𝟸𝙼𝙰𝙼-𝙾\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits does not have this strategy. The superior performance of 𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits over 𝙿𝟸𝙼𝙰𝙼-𝙾\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits on four out of six datasets indicates that on most of the datasets, incorporating the estimate of prospective preferences could enable better recommendations. Compared to 𝙿𝟸𝙼𝙰𝙼-𝙿\mathop{\mathtt{P^{2}MAM\text{-}P}}\limits, 𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits generates recommendations using the preference predictions from temporal patterns (i.e., 𝐡o\mathop{\mathbf{h}^{o}}\limits) and users’ prospective preferences (i.e., 𝐡p\mathop{\mathbf{h}^{p}}\limits), while 𝙿𝟸𝙼𝙰𝙼-𝙿\mathop{\mathtt{P^{2}MAM\text{-}P}}\limits only use 𝐡p\mathop{\mathbf{h}^{p}}\limits to generate recommendations. The strong improvement of 𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits compared to 𝙿𝟸𝙼𝙰𝙼-𝙿\mathop{\mathtt{P^{2}MAM\text{-}P}}\limits indicates that when used together, the two preference predictions could reinforce each other and improve the recommendation performance. We notice that on DG\mathop{\texttt{DG}}\limits and NP\mathop{\texttt{NP}}\limits, the performance of 𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits is slightly worse than that of 𝙿𝟸𝙼𝙰𝙼-𝙾\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits at recall@20. This might be due to that as will be shown in Section 6.5, in certain datasets (e.g. DG\mathop{\texttt{DG}}\limits and NP\mathop{\texttt{NP}}\limits), the temporal patterns are highly strong, and could denominate the learning process. As a result, incorporating the 𝙿𝟸𝙴\mathop{\mathtt{P^{2}E}}\limits component may not improve the recommendation performance.

6.3 Comparison with GRUs-based Methods

Existing methods [1, 2, 4] leverage GRUs to capture the temporal patterns, while in 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits, we model the temporal patterns using a position-sensitive attention mechanism (i.e., 𝙿𝚂\mathop{\mathtt{PS}}\limits). Here, we compare the performance of 𝙿𝟸𝙼𝙰𝙼-𝙾\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits, the 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits variant purely based on 𝙿𝚂\mathop{\mathtt{PS}}\limits, and the GRUs-based baseline methods including 𝙽𝙰𝚁𝙼\mathop{\mathtt{NARM}}\limits, 𝚂𝚁-𝙶𝙽𝙽\mathop{\mathtt{SR\text{-}GNN}}\limits and 𝙻𝙴𝚂𝚂𝚁\mathop{\mathtt{LESSR}}\limits. As shown in Table III, 𝙿𝟸𝙼𝙰𝙼-𝙾\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits achieves superior performance over GRUs-based baseline methods on five out of six datasets except LF\mathop{\texttt{LF}}\limits at recall@kk. On LF\mathop{\texttt{LF}}\limits, the performance of 𝙿𝟸𝙼𝙰𝙼-𝙾\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits is slightly worse than that of 𝚂𝚁-𝙶𝙽𝙽\mathop{\mathtt{SR\text{-}GNN}}\limits and 𝙻𝙴𝚂𝚂𝚁\mathop{\mathtt{LESSR}}\limits but 𝙿𝟸𝙼𝙰𝙼-𝙾\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits is still able to outperform 𝙽𝙰𝚁𝙼\mathop{\mathtt{NARM}}\limits at all the metrics. GRUs model the temporal patterns in a recurrent fashion using complicated non-linear layers, while 𝙿𝟸𝙼𝙰𝙼-𝙾\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits explicitly models the temporal patterns using attention weights over items in the session. The significant improvement of 𝙿𝟸𝙼𝙰𝙼-𝙾\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits over GRUs-based baseline methods indicates that on sparse recommendation datasets, our simple attentive method could be easier to learn well than the complicated GRUs-based methods and thus, enable better performance.

6.4 Comparison with Graph-based Methods

Graph-based methods [2, 4, 3, 12] are extensively developed for session-based recommendation, and have been demonstrated the state-of-the-art performance. However, as shown in Table III, 𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits as a sequence-based method, significantly outperforms the state-of-the-art graph-based methods (i.e., 𝚂𝚁-𝙶𝙽𝙽\mathop{\mathtt{SR\text{-}GNN}}\limits, 𝙻𝙴𝚂𝚂𝚁\mathop{\mathtt{LESSR}}\limits, 𝙳𝙷𝙲𝙽\mathop{\mathtt{DHCN}}\limits) on the benchmark datasets. For example, 𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits achieves significant improvement compared to the state-of-the-art graph-based methods at both recall@10 and recall@20 on all the datasets. Graph-based methods convert sessions to directed graphs or hyper graphs, and learn the complex temporal patterns [2] leveraging the graph structure (i.e., topology). However, the graphs are constructed based on some assumptions by design such as an item should link to all the subsequent items [4]. Such assumptions may introduce noises or unnecessary/unrealistic relations in the graphs. Meanwhile, the sparse nature of recommendation datasets, and thus of the graphs, may not support the complicated learning of GNN-based models very well. The superior performance of 𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits over graph-based methods signifies that the sequence representation could be more effective than graphs for the recommendation.

6.5 Analysis on Position Embeddings

TABLE IV: Performance Improvement of Position Embeddings
method recall@kk MRR@kk NDCG@kk recall@kk MRR@kk NDCG@kk
kk=5 kk=10 kk=20 kk=5 kk=10 kk=20 kk=5 kk=10 kk=20 kk=5 kk=10 kk=20 kk=5 kk=10 kk=20 kk=5 kk=10 kk=20
𝙿𝟸𝙼𝙰𝙼-𝙾\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits ​\𝙿𝙴\mathop{\mathtt{PE}}\limits DG\mathop{\texttt{DG}}\limits 0.2766 0.3918 0.5236 0.1580 0.1733 0.1824 0.1873 0.2245 0.2578 YC\mathop{\texttt{YC}}\limits 0.4138 0.5521 0.6754 0.2418 0.2603 0.2690 0.2845 0.3293 0.3606
𝙿𝟸𝙼𝙰𝙼-𝙾\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits 0.2936 0.4148 0.5500 0.1657 0.1817 0.1911 0.1973 0.2364 0.2705 0.4721 0.6118 0.7203 0.2767 0.2956 0.3033 0.3254 0.3707 0.3983
𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits ​\𝙿𝙴\mathop{\mathtt{PE}}\limits 0.2765 0.3934 0.5276 0.1570 0.1725 0.1818 0.1865 0.2242 0.2581 0.4270 0.5674 0.6896 0.2470 0.2658 0.2744 0.2917 0.3371 0.3682
𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits 0.2919 0.4125 0.5476 0.1652 0.1812 0.1905 0.1966 0.2354 0.2695 0.4834 0.6192 0.7277 0.2825 0.3008 0.3085 0.3325 0.3766 0.4043
𝙿𝟸𝙼𝙰𝙼-𝙾\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits ​\𝙿𝙴\mathop{\mathtt{PE}}\limits GA\mathop{\texttt{GA}}\limits 0.3610 0.4392 0.5189 0.2275 0.2379 0.2435 0.2609 0.2862 0.3064 LF\mathop{\texttt{LF}}\limits 0.1165 0.1686 0.2360 0.0656 0.0724 0.0771 0.0782 0.0949 0.1119
𝙿𝟸𝙼𝙰𝙼-𝙾\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits 0.3700 0.4529 0.5352 0.2273 0.2384 0.2441 0.2629 0.2898 0.3106 0.1142 0.1646 0.2301 0.0658 0.0724 0.0769 0.0778 0.0940 0.1105
𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits ​\𝙿𝙴\mathop{\mathtt{PE}}\limits 0.3677 0.4472 0.5277 0.2319 0.2425 0.2481 0.2658 0.2916 0.3119 0.1232 0.1768 0.2449 0.0687 0.0757 0.0804 0.0822 0.0994 0.1166
𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits 0.3803 0.4644 0.5479 0.2473 0.2586 0.2644 0.2805 0.3077 0.3289 0.1247 0.1771 0.2454 0.0724 0.0793 0.0840 0.0854 0.1022 0.1195
𝙿𝟸𝙼𝙰𝙼-𝙾\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits ​\𝙿𝙴\mathop{\mathtt{PE}}\limits NP\mathop{\texttt{NP}}\limits 0.1034 0.1589 0.2224 0.0591 0.0664 0.0708 0.0700 0.0878 0.1039 TM\mathop{\texttt{TM}}\limits 0.2320 0.2872 0.3432 0.1561 0.1635 0.1675 0.1750 0.1929 0.2071
𝙿𝟸𝙼𝙰𝙼-𝙾\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits 0.1180 0.1744 0.2371 0.0665 0.0740 0.0783 0.0791 0.0973 0.1132 0.2279 0.2840 0.3406 0.1534 0.1608 0.1648 0.1719 0.1900 0.2043
𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits ​\𝙿𝙴\mathop{\mathtt{PE}}\limits 0.1058 0.1598 0.2205 0.0603 0.0674 0.0717 0.0715 0.0889 0.1042 0.2277 0.2861 0.3468 0.1483 0.1561 0.1603 0.1680 0.1869 0.2023
𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits 0.1158 0.1688 0.2315 0.0662 0.0732 0.0775 0.0784 0.0955 0.1113 0.2192 0.2826 0.3438 0.1400 0.1484 0.1527 0.1597 0.1801 0.1956
  • In this table, ​\𝙿𝙴\mathop{\mathtt{PE}}\limits represents 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits variants (i.e., 𝙿𝟸𝙼𝙰𝙼-𝙾\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits and 𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits) without position embeddings. The best performance between 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits variants with and without position embeddings (i.e. ​\𝙿𝙴\mathop{\mathtt{PE}}\limits) is in bold.

We conduct an analysis to verify the importance of position embeddings in 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits. Specifically, in 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits, we remove the position embeddings (i.e., 𝑃\mathop{P}\limits) in Equation 3 and Equation 4, and calculate the attention weights using item embeddings (i.e., 𝐸\mathop{E}\limits) only. We denote 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits without position embeddings as 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits ​\𝙿𝙴\mathop{\mathtt{PE}}\limits, and report the performance of 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits and 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits ​\𝙿𝙴\mathop{\mathtt{PE}}\limits in Table IV. Due to the space limit, we do not present the performance of 𝙿𝟸𝙼𝙰𝙼-𝙿\mathop{\mathtt{P^{2}MAM\text{-}P}}\limits but we observed a similar trend in 𝙿𝟸𝙼𝙰𝙼-𝙿\mathop{\mathtt{P^{2}MAM\text{-}P}}\limits.

As presented in Table IV, without position embeddings, the performance of 𝙿𝟸𝙼𝙰𝙼-𝙾\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits ​\𝙿𝙴\mathop{\mathtt{PE}}\limits and 𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits ​\𝙿𝙴\mathop{\mathtt{PE}}\limits degrades significantly on four out of six datasets (i.e., DG\mathop{\texttt{DG}}\limits, YC\mathop{\texttt{YC}}\limits, GA\mathop{\texttt{GA}}\limits and NP\mathop{\texttt{NP}}\limits). For example, on DG\mathop{\texttt{DG}}\limits and YC\mathop{\texttt{YC}}\limits, 𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits ​\𝙿𝙴\mathop{\mathtt{PE}}\limits underperforms 𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits at 3.8% and 5.5%, respectively. Recall that in 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits, we learn position embeddings to incorporate the position information into the model, and better model the temporal patterns. These results demonstrate the importance of the position information for session-based recommendation. These results also reveal that the temporal patterns on the four datasets (e.g., DG\mathop{\texttt{DG}}\limits and NP\mathop{\texttt{NP}}\limits) are strong, and explain the similar performance of 𝙿𝟸𝙼𝙰𝙼-𝙾\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits and 𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits on DG\mathop{\texttt{DG}}\limits and NP\mathop{\texttt{NP}}\limits as discussed in Section 6.2. We also notice that on LF\mathop{\texttt{LF}}\limits and TM\mathop{\texttt{TM}}\limits, 𝙿𝟸𝙼𝙰𝙼-𝙾\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits ​\𝙿𝙴\mathop{\mathtt{PE}}\limits and 𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits ​\𝙿𝙴\mathop{\mathtt{PE}}\limits still achieve performance similar to that with position embeddings. For example, on LF\mathop{\texttt{LF}}\limits, 𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits ​\𝙿𝙴\mathop{\mathtt{PE}}\limits achieves 0.2449 at recall@20, and 𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits achieves 0.2454 (difference: 0.2%). Similarly, on TM\mathop{\texttt{TM}}\limits, at recall@20, the performance of 𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits ​\𝙿𝙴\mathop{\mathtt{PE}}\limits and 𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits is 0.3468 and 0.3438, respectively (difference: 0.9%). As will be shown in Section 6.7, on some datasets (e.g., LF\mathop{\texttt{LF}}\limits), the position information may not be crucial for the recommendation. Therefore, on these datasets, without position embeddings, the model could still achieve similar performance.

6.6 Prospective Preference Estimate Analysis

TABLE V: Performance Comparison on Estimate Strategies
method recall@kk MRR@kk NDCG@kk
kk=5 kk=10 kk=20 kk=5 kk=10 kk=20 kk=5 kk=10 kk=20
DG\mathop{\texttt{DG}}\limits 𝙻𝙰𝚂𝚃-𝙾-𝙿\mathop{\mathtt{LAST\text{-}O\text{-}P}}\limits 0.2881 0.4098 0.5459 0.1615 0.1776 0.1870 0.1928 0.2320 0.2664
𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits 0.2922 0.4120 0.5474 0.1646 0.1805 0.1899 0.1961 0.2348 0.2690
YC\mathop{\texttt{YC}}\limits 𝙻𝙰𝚂𝚃-𝙾-𝙿\mathop{\mathtt{LAST\text{-}O\text{-}P}}\limits 0.4851 0.6202 0.7265 0.2824 0.3006 0.3081 0.3329 0.3768 0.4038
𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits 0.4834 0.6192 0.7277 0.2825 0.3008 0.3085 0.3325 0.3766 0.4043
GA\mathop{\texttt{GA}}\limits 𝙻𝙰𝚂𝚃-𝙾-𝙿\mathop{\mathtt{LAST\text{-}O\text{-}P}}\limits 0.3777 0.4638 0.5475 0.2339 0.2454 0.2512 0.2698 0.2977 0.3189
𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits 0.3803 0.4644 0.5479 0.2473 0.2586 0.2644 0.2805 0.3077 0.3289
LF\mathop{\texttt{LF}}\limits 𝙻𝙰𝚂𝚃-𝙾-𝙿\mathop{\mathtt{LAST\text{-}O\text{-}P}}\limits 0.1176 0.1676 0.2332 0.0681 0.0747 0.0792 0.0804 0.0965 0.1130
𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits 0.1247 0.1771 0.2454 0.0724 0.0793 0.0840 0.0854 0.1022 0.1195
NP\mathop{\texttt{NP}}\limits 𝙻𝙰𝚂𝚃-𝙾-𝙿\mathop{\mathtt{LAST\text{-}O\text{-}P}}\limits 0.1261 0.1735 0.2304 0.0736 0.0799 0.0838 0.0866 0.1019 0.1163
𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits 0.1158 0.1688 0.2315 0.0662 0.0732 0.0775 0.0784 0.0955 0.1113
TM\mathop{\texttt{TM}}\limits 𝙻𝙰𝚂𝚃-𝙾-𝙿\mathop{\mathtt{LAST\text{-}O\text{-}P}}\limits 0.2071 0.2637 0.3176 0.1324 0.1401 0.1439 0.1510 0.1694 0.1830
𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits 0.2192 0.2826 0.3438 0.1400 0.1484 0.1527 0.1597 0.1801 0.1956
  • In this table, 𝙻𝙰𝚂𝚃-𝙾-𝙿\mathop{\mathtt{LAST\text{-}O\text{-}P}}\limits estimates users’ prospective preferences using the last item in each session, and 𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits estimates users’ prospective preferences using 𝙿𝚂\mathop{\mathtt{PS}}\limits (Section 4.2). The best performance in each dataset is in bold.

In 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits, we leverage the position-sensitive preference prediction (i.e., 𝐡o\mathop{\mathbf{h}^{o}}\limits) from 𝙿𝚂\mathop{\mathtt{PS}}\limits (Section 4.2) as the estimate of users’ prospective preferences (Section 4.3). We notice that in the literature [10, 2, 19], another strategy is to estimate the users’ prospective preferences from the last item in each session. This strategy is based on the recency assumption [10, 2] that the most recently interacted item could be a highly strong indicator of the next item of users’ interest. We conduct an analysis to empirically compare the two strategies in 𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits. Specifically, in Equation 4, instead of 𝐡o\mathop{\mathbf{h}^{o}}\limits, we use the embeddings of the last item and its position in each session (i.e., van+𝐩n\textbf{v}_{a_{n}}\!+\mathbf{p}_{n}) to calculate the attention weights, and denote the resulted variant as 𝙻𝙰𝚂𝚃-𝙾-𝙿\mathop{\mathtt{LAST\text{-}O\text{-}P}}\limits. Table V presents the performance of 𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits and 𝙻𝙰𝚂𝚃-𝙾-𝙿\mathop{\mathtt{LAST\text{-}O\text{-}P}}\limits. Similarly to that in 𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits, we tune hyper parameters for 𝙻𝙰𝚂𝚃-𝙾-𝙿\mathop{\mathtt{LAST\text{-}O\text{-}P}}\limits using grid search, and report the results from the identified best performing hyper parameters.

As presented in Table V, overall 𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits achieves considerable improvement compared to 𝙻𝙰𝚂𝚃-𝙾-𝙿\mathop{\mathtt{LAST\text{-}O\text{-}P}}\limits on four out of six datasets (i.e., DG\mathop{\texttt{DG}}\limits, GA\mathop{\texttt{GA}}\limits, LF\mathop{\texttt{LF}}\limits and TM\mathop{\texttt{TM}}\limits). At recall@5, recall@10 and recall@20, on average, 𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits achieves significant improvement of 3.5%, 3.4% and 3.5%, respectively, over 𝙻𝙰𝚂𝚃-𝙾-𝙿\mathop{\mathtt{LAST\text{-}O\text{-}P}}\limits. We find a similar trend at MRR@kk and NDCG@kk. For example, in terms of MRR@5 and NDCG@5, over the four datasets, 𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits still significantly outperforms 𝙻𝙰𝚂𝚃-𝙾-𝙿\mathop{\mathtt{LAST\text{-}O\text{-}P}}\limits with an average improvement of 4.9% and 4.4%, respectively. The primary difference between 𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits and 𝙻𝙰𝚂𝚃-𝙾-𝙿\mathop{\mathtt{LAST\text{-}O\text{-}P}}\limits is that 𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits uses a learning-based method to estimate users’ prospective preferences in a data-driven manner, while 𝙻𝙰𝚂𝚃-𝙾-𝙿\mathop{\mathtt{LAST\text{-}O\text{-}P}}\limits generates the estimate based on the recency assumption. The above results indicate that on most of the datasets, data driven-based estimate is more effective than recency-based estimate. We also notice that on YC\mathop{\texttt{YC}}\limits and NP\mathop{\texttt{NP}}\limits, the performance of 𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits is competitive with that of 𝙻𝙰𝚂𝚃-𝙾-𝙿\mathop{\mathtt{LAST\text{-}O\text{-}P}}\limits. For example, on YC\mathop{\texttt{YC}}\limits, the performance difference between 𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits and 𝙻𝙰𝚂𝚃-𝙾-𝙿\mathop{\mathtt{LAST\text{-}O\text{-}P}}\limits is 0.2% and 0.1% at recall@20 and MRR@20, respectively. On NP\mathop{\texttt{NP}}\limits, the difference at recall@20 is also as small as 0.5%. As shown in the literature [10], some datasets such as YC\mathop{\texttt{YC}}\limits have strong recency patterns. The similar results between 𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits and 𝙻𝙰𝚂𝚃-𝙾-𝙿\mathop{\mathtt{LAST\text{-}O\text{-}P}}\limits on YC\mathop{\texttt{YC}}\limits and NP\mathop{\texttt{NP}}\limits, indicate that on datasets with strong recency patterns, our learning-based strategy may implicitly capture the patterns, and still produce competitive results.

6.7 Attention Weight Analysis

1234567812345678Refer to caption

session length

position0.00.20.40.60.81.0
(a) YC\mathop{\texttt{YC}}\limits (𝙿𝚂\mathop{\mathtt{PS}}\limits)
1234567812345678Refer to caption

session length

position0.00.20.40.60.81.0
(b) YC\mathop{\texttt{YC}}\limits (𝙿𝚂\mathop{\mathtt{PS}}\limits ​\𝙿𝙴\mathop{\mathtt{PE}}\limits)
1234567812345678Refer to caption

session length

position0.00.20.40.60.81.0
(c) LF\mathop{\texttt{LF}}\limits (𝙿𝚂\mathop{\mathtt{PS}}\limits)
1234567812345678Refer to caption

session length

position0.00.20.40.60.81.0
(d) LF\mathop{\texttt{LF}}\limits (𝙿𝚂\mathop{\mathtt{PS}}\limits ​\𝙿𝙴\mathop{\mathtt{PE}}\limits)
Figure 2: Attention Weights from 𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits

We conduct an analysis to evaluate the attention weights learned in 𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits. Particularly, we represent the attention weights learned in 𝙿𝚂\mathop{\mathtt{PS}}\limits (Section 4.2) over corresponding session positions in Figure 2. Due to the space limit we only present the results on YC\mathop{\texttt{YC}}\limits and LF\mathop{\texttt{LF}}\limits but we have similar results on the other datasets. In Figure 2, the yy-axis corresponds to the session lengths; the xx-axis corresponds to the last nn positions of a session of length nn (n=1,,8n=1,\cdots,8). Due to the space limit we only present the weights in augmented testing sessions with at most 8 items, which represent 82.2% and 57.4% of all the testing sessions in YC\mathop{\texttt{YC}}\limits and LF\mathop{\texttt{LF}}\limits, respectively. However, we observed a similar trend in the longer sessions as that in Figure 2.

Figure 2(a) and 2(b) present the attention weight distribution from 𝙿𝚂\mathop{\mathtt{PS}}\limits with, and without position embeddings (i.e., ​\𝙿𝙴\mathop{\mathtt{PE}}\limits), respectively, on YC\mathop{\texttt{YC}}\limits. Similarly, Figure 2(c) and 2(d) present the distributions on LF\mathop{\texttt{LF}}\limits. Comparing Figure  2(a) and Figure 2(b), we notice that on YC\mathop{\texttt{YC}}\limits, the attention weights on the last item of sessions (i.e., diagonal in the Figure) are significantly higher than those on the earlier ones. This weight distribution is consistent with the recency pattern on YC\mathop{\texttt{YC}}\limits as shown in the literature [10], and demonstrates that with the position embeddings, 𝙿𝚂\mathop{\mathtt{PS}}\limits could accurately capture the recency patterns in YC\mathop{\texttt{YC}}\limits. Figure 2(b) shows that without position embeddings, the attention weights from 𝙿𝚂\mathop{\mathtt{PS}}\limits do not show considerable difference over positions, revealing that without position embeddings, the attention mechanism cannot differentiate positions.. The comparison between Figure 2(a) and Figure 2(b) further demonstrates the importance of position embeddings in 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits.

Comparing Figure 2(a) and Figure 2(c), we find that on YC\mathop{\texttt{YC}}\limits, the weights follow the recency pattern, while on LF\mathop{\texttt{LF}}\limits, the second last item has higher weights than the last one. This result implies that the recency assumption may not always hold, and our learning-based method (i.e., 𝙿𝚂\mathop{\mathtt{PS}}\limits) could be more effective than the existing recency-based method in estimating prospective preferences (Section 6.6). Comparing Figure 2(c) and Figure 2(d), we notice that without position embeddings, on sessions with more than 4 items, the weight distribution in Figure 2(d) is still similar with that in Figure 2(c). This result reveals that on this dataset, the position information may not be critical, and also supports the similar performance of 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits and 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits ​\𝙿𝙴\mathop{\mathtt{PE}}\limits on LF\mathop{\texttt{LF}}\limits as discussed in Section 6.5.

6.8 Analysis on Cosine Similarities

TABLE VI: Preference Prediction Similarity Comparison
dataset 𝙿𝟸𝙼𝙰𝙼-𝙾\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits 𝙿𝟸𝙼𝙰𝙼-𝙿\mathop{\mathtt{P^{2}MAM\text{-}P}}\limits 𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits
avg. next avg. next avg. next
DG\mathop{\texttt{DG}}\limits -0.050 0.662 -0.007 0.646 -0.010 0.661
YC\mathop{\texttt{YC}}\limits -0.137 0.724 -0.084 0.660 -0.082 0.666
GA\mathop{\texttt{GA}}\limits -0.030 0.676 -0.007 0.663 -0.001 0.610
LF\mathop{\texttt{LF}}\limits -0.150 0.322 -0.006 0.325 0.012 0.306
NP\mathop{\texttt{NP}}\limits -0.057 0.423 0.014 0.454 0.007 0.422
TM\mathop{\texttt{TM}}\limits -0.042 0.356 0.002 0.383 0.015 0.370
  • In this table, the columns avg. shows the average cosine similarities between preference predictions and all the candidate items, The columns next shows the similarity between the predictions and the ground-truth next item.

We conduct an analysis to verify if 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits truly learns predictive preferences from the data. Specifically, we calculate the cosine similarities between the preference predictions (i.e., 𝐡o\mathop{\mathbf{h}^{o}}\limits and 𝐡p\mathop{\mathbf{h}^{p}}\limits) and the item embeddings on the six datasets, and present the results in Table VI. For 𝙿𝟸𝙼𝙰𝙼-𝙾\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits/𝙿𝟸𝙼𝙰𝙼-𝙿\mathop{\mathtt{P^{2}MAM\text{-}P}}\limits, we calculate similarities between 𝐡o\mathop{\mathbf{h}^{o}}\limits/𝐡p\mathop{\mathbf{h}^{p}}\limits and item embeddings. For 𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits, since we calculate recommendation scores using 𝐡o\mathop{\mathbf{h}^{o}}\limits+𝐡p\mathop{\mathbf{h}^{p}}\limits (Equation 7), we use 𝐡o\mathop{\mathbf{h}^{o}}\limits+𝐡p\mathop{\mathbf{h}^{p}}\limits to calculate the similarities. Table VI shows that in 𝙿𝟸𝙼𝙰𝙼-𝙾\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits, 𝙿𝟸𝙼𝙰𝙼-𝙿\mathop{\mathtt{P^{2}MAM\text{-}P}}\limits and 𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits, compared to the average similarities among all the items, the similarities between the predictions and the ground-truth next item is significantly higher. This result reveals that 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits could learn to capture users’ true preferences from the data.

6.9 Run-time Performance

TABLE VII: Testing Runtime Performance (ms)
method DG\mathop{\texttt{DG}}\limits YC\mathop{\texttt{YC}}\limits GA\mathop{\texttt{GA}}\limits LF\mathop{\texttt{LF}}\limits NP\mathop{\texttt{NP}}\limits TM\mathop{\texttt{TM}}\limits
𝙳𝙷𝙲𝙽\mathop{\mathtt{DHCN}}\limits 4.6e0 2.8e0 3.0e0 1.3e2 7.2e0 4.4e0
𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits 5.7e-1 2.7e-1 4.2e-1 5.2e-1 8.9e-1 6.2e-1
speedup 8.1 10.4 7.1 245.4 8.1 7.1
  • The best run-time performance at each dataset is in bold.

We compare the run-time performance of 𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits and that of the best performing baseline method 𝙳𝙷𝙲𝙽\mathop{\mathtt{DHCN}}\limits during testing, and report the results in Table VII. We focus on testing instead of training due to the fact that compared to training, the run-time performance in testing could better imply the models’ latency in real-time recommendation, which could significantly affect the user experience and thus revenue. As presented in Table VII, the run-time performance of 𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits is substantially better than that of 𝙳𝙷𝙲𝙽\mathop{\mathtt{DHCN}}\limits on all the datasets. Specifically, on average, 𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits is 47.7 times faster than 𝙳𝙷𝙲𝙽\mathop{\mathtt{DHCN}}\limits. The superior run-time performance of 𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits over 𝙳𝙷𝙲𝙽\mathop{\mathtt{DHCN}}\limits demonstrates that while generating high-quality recommendations, 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits could enable lower latency in real time, and thus could significantly improve the user experience.

6.10 Parameter Study

0.500.510.520.530.540.5515913Refer to caption

recall@20

nn
(a) DG\mathop{\texttt{DG}}\limits
0.710.720.7315913Refer to caption

recall@20

nn
(b) YC\mathop{\texttt{YC}}\limits
0.510.520.530.540.5515913Refer to caption

recall@20

nn
(c) GA\mathop{\texttt{GA}}\limits
0.140.160.180.200.220.2415913Refer to caption

recall@20

nn
(d) LF\mathop{\texttt{LF}}\limits
Figure 3: Parameter Study

We conduct a parameter study to assess how the length of the transformed sequence (i.e., nn) affect the recommendation performance on the widely used DG\mathop{\texttt{DG}}\limits, YC\mathop{\texttt{YC}}\limits, GA\mathop{\texttt{GA}}\limits and LF\mathop{\texttt{LF}}\limits datasets. Particularly, on each dataset, we change nn and fix the other hyper parameters as the best performing ones during hyper parameter tuning, and report the performance at recall@20 on augmented testing sessions in Figure 3. As shown in Figure 3, on DG\mathop{\texttt{DG}}\limits, YC\mathop{\texttt{YC}}\limits and GA\mathop{\texttt{GA}}\limits, the performance increases significantly as nn increases when n<5n<5, while when n5n\geq 5, incorporating earlier items in the session will not considerably improve the performance. Similarly on LF\mathop{\texttt{LF}}\limits, the performance becomes stable when n10n\geq 10. These results reveal that on session-based recommendation datasets, only the most recent few items are effective in learning users’ preferences. As a result, we will not loss crucial information in 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits by using only the last nn items (Section 4.1) for the recommendation.

7 Discussion

TABLE VIII: Effectiveness of Users’ Prospective Preferences
method recall@kk MRR@kk NDCG@kk
kk=5 kk=10 kk=20 kk=5 kk=10 kk=20 kk=5 kk=10 kk=20
DG\mathop{\texttt{DG}}\limits 𝙼𝙴𝙰𝙽\mathop{\mathtt{MEAN}}\limits 0.2778 0.3917 0.5246 0.1585 0.1736 0.1828 0.1880 0.2247 0.2583
𝙾𝚁𝙰𝙲𝙻𝙴\mathop{\mathtt{ORACLE}}\limits 0.4113 0.5260 0.6436 0.2755 0.2908 0.2990 0.3092 0.3463 0.3760
YC\mathop{\texttt{YC}}\limits 𝙼𝙴𝙰𝙽\mathop{\mathtt{MEAN}}\limits 0.4039 0.5392 0.6620 0.2376 0.2557 0.2643 0.2788 0.3227 0.3538
𝙾𝚁𝙰𝙲𝙻𝙴\mathop{\mathtt{ORACLE}}\limits 0.5771 0.6922 0.7760 0.3791 0.3947 0.4006 0.4285 0.4660 0.4873
GA\mathop{\texttt{GA}}\limits 𝙼𝙴𝙰𝙽\mathop{\mathtt{MEAN}}\limits 0.3589 0.4387 0.5202 0.2243 0.2350 0.2407 0.2580 0.2838 0.3044
𝙾𝚁𝙰𝙲𝙻𝙴\mathop{\mathtt{ORACLE}}\limits 0.4350 0.5097 0.5837 0.3084 0.3185 0.3236 0.3401 0.3643 0.3830
LF\mathop{\texttt{LF}}\limits 𝙼𝙴𝙰𝙽\mathop{\mathtt{MEAN}}\limits 0.1200 0.1706 0.2366 0.0685 0.0751 0.0797 0.0812 0.0975 0.1142
𝙾𝚁𝙰𝙲𝙻𝙴\mathop{\mathtt{ORACLE}}\limits 0.2544 0.3152 0.3892 0.1915 0.1996 0.2047 0.2071 0.2267 0.2453
NP\mathop{\texttt{NP}}\limits 𝙼𝙴𝙰𝙽\mathop{\mathtt{MEAN}}\limits 0.1019 0.1592 0.2250 0.0579 0.0654 0.0700 0.0687 0.0871 0.1037
𝙾𝚁𝙰𝙲𝙻𝙴\mathop{\mathtt{ORACLE}}\limits 0.1822 0.2429 0.3088 0.1078 0.1160 0.1205 0.1263 0.1459 0.1625
TM\mathop{\texttt{TM}}\limits 𝙼𝙴𝙰𝙽\mathop{\mathtt{MEAN}}\limits 0.2531 0.3181 0.3780 0.1676 0.1763 0.1805 0.1889 0.2099 0.2251
𝙾𝚁𝙰𝙲𝙻𝙴\mathop{\mathtt{ORACLE}}\limits 0.5282 0.5717 0.6086 0.4489 0.4547 0.4573 0.4687 0.4828 0.4921
  • In this table, 𝙼𝙴𝙰𝙽\mathop{\mathtt{MEAN}}\limits equally weighs items in the session, and 𝙾𝚁𝙰𝙲𝙻𝙴\mathop{\mathtt{ORACLE}}\limits learns attention weights over items conditioned on the ground-truth next item. The best performance at each dataset is in bold.

We conduct an experiment to verify that users’ prospective preferences could signify the important items, and thus, benefit the recommendation. Specifically, we develop a method, denoted as 𝙾𝚁𝙰𝙲𝙻𝙴\mathop{\mathtt{ORACLE}}\limits, which learns attention weights conditioned on the ground-truth next item (i.e,, s|S|+1s_{|S|+1}, the true preference). In 𝙾𝚁𝙰𝙲𝙻𝙴\mathop{\mathtt{ORACLE}}\limits, we generate recommendations in the same way as that in 𝙿𝟸𝙼𝙰𝙼-𝙾\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits except that in the dot-product attention (Equation 3), we remove the position embeddings (i.e., 𝑃\mathop{P}\limits) and replace 𝐪\mathbf{q} with 𝐯𝐬|𝐒|+𝟏\mathbf{v_{s_{|S|+1}}}. (i.e., embedding of s|S|+1s_{|S|+1}). We empirically compare 𝙾𝚁𝙰𝙲𝙻𝙴\mathop{\mathtt{ORACLE}}\limits with another method, denoted as 𝙼𝙴𝙰𝙽\mathop{\mathtt{MEAN}}\limits, which uses a mean pooling to equally weigh items in the session. We report the results of 𝙾𝚁𝙰𝙲𝙻𝙴\mathop{\mathtt{ORACLE}}\limits and 𝙼𝙴𝙰𝙽\mathop{\mathtt{MEAN}}\limits on the six datasets in Table VIII.

As shown in Table VIII, 𝙾𝚁𝙰𝙲𝙻𝙴\mathop{\mathtt{ORACLE}}\limits significantly outperforms 𝙼𝙴𝙰𝙽\mathop{\mathtt{MEAN}}\limits in all the datasets. For example, in terms of recall@5, MRR@5 and NDCG@5, compared to 𝙼𝙴𝙰𝙽\mathop{\mathtt{MEAN}}\limits, on average, 𝙾𝚁𝙰𝙲𝙻𝙴\mathop{\mathtt{ORACLE}}\limits achieves significant improvement of 68.6%, 100.7% and 89.5%, respectively. The superior performance of 𝙾𝚁𝙰𝙲𝙻𝙴\mathop{\mathtt{ORACLE}}\limits over 𝙼𝙴𝙰𝙽\mathop{\mathtt{MEAN}}\limits shows that the learned attention weights in 𝙾𝚁𝙰𝙲𝙻𝙴\mathop{\mathtt{ORACLE}}\limits are effective, and further reveals that users’ prospective preferences could indicate important items. These results motivate us to estimate the prospective preferences, and weigh items conditioned on the estimate as in 𝙿𝟸𝙴\mathop{\mathtt{P^{2}E}}\limits (Section 4.3).

We notice that on TM\mathop{\texttt{TM}}\limits, 𝙼𝙴𝙰𝙽\mathop{\mathtt{MEAN}}\limits outperforms 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits and all the baseline methods (Table III). Previous work [17] suggests that on some extremely sparse recommendation datasets, the attention-based methods may not be well-learned, and underperform simple mean pooling-based method (i.e., 𝙼𝙴𝙰𝙽\mathop{\mathtt{MEAN}}\limits). TM\mathop{\texttt{TM}}\limits with the smallest training set and a large number of items is extremely sparse. Therefore, 𝙼𝙴𝙰𝙽\mathop{\mathtt{MEAN}}\limits could outperform 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits and all the baseline methods on this dataset. However, on all the other datasets, 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits significantly outperforms 𝙼𝙴𝙰𝙽\mathop{\mathtt{MEAN}}\limits, which reveals that on most of the datasets, 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits could learn well, and thus, enable better performance.

8 Conclusions

In this manuscript, we presented novel 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits models that conduct session-based recommendations using two important factors: temporal patterns and estimates of users’ prospective preferences. Our experimental results in comparison with five state-of-the-art baseline methods on the six benchmark datasets demonstrate that 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits significantly outperforms the baseline methods with an improvement of up to 19.2%. The results also reveal that on most of the datasets, the two factors could reinforce each other, and enable superior performance. Our analysis on position embeddings signifies the importance of explicitly modeling the position information for session-based recommendation. Our analysis on prospective preference estimate strategies demonstrates that on most of the datasets, our learning-based strategy is more effective than the existing recency-based strategy. Our analysis on the learned attention weights shows that with position embeddings, 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits could effectively capture the temporal patterns (e.g., recency patterns). Our results in run-time performance comparison show that 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits is much more efficient than the best baseline method 𝙳𝙷𝙲𝙽\mathop{\mathtt{DHCN}}\limits (47.7 average speedup). Our analysis on users’ prospective preferences demonstrates that the prospective preferences could signify important items, and thus, benefit the recommendation.

Acknowledgement

This project was made possible, in part, by support from the National Science Foundation under Grant Number IIS-1855501, EAR-1520870, SES-1949037, IIS-1827472 and IIS-2133650, and from National Library of Medicine under Grant Number 1R01LM012605-01A1 and R21LM013678-01. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funding agencies.

Appendix A Reproducibility

TABLE IX: Best Hyper Parameters for 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits, 𝙻𝙰𝚂𝚃-𝙾-𝙿\mathop{\mathtt{LAST\text{-}O\text{-}P}}\limits and Baseline Methods
Dataset 𝙿𝟸𝙼𝙰𝙼-𝙾\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits 𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits 𝙻𝙰𝚂𝚃-𝙾-𝙿\mathop{\mathtt{LAST\text{-}O\text{-}P}}\limits 𝙽𝙰𝚁𝙼\mathop{\mathtt{NARM}}\limits 𝚂𝚁-𝙶𝙽𝙽\mathop{\mathtt{SR\text{-}GNN}}\limits 𝙻𝙴𝚂𝚂𝚁\mathop{\mathtt{LESSR}}\limits 𝙳𝙷𝙲𝙽\mathop{\mathtt{DHCN}}\limits
dd n dd n b dd n b dd lrlr dd lrlr λ\lambda dd lrlr ll dd β\beta
DG 128 10 128 15 8 128 10 4 64 1e-3 32 1e-4 1e-6 32 2e-3 3 128 5e-3
YC 128 7 128 8 8 128 20 2 64 1e-3 128 1e-3 0.0 32 1e-3 2 128 1e-4
GA 128 10 128 15 8 128 10 4 64 1e-3 128 1e-3 1e-6 32 1e-3 2 128 1e-3
LF 128 10 128 20 2 128 20 1 128 1e-3 128 5e-4 0.0 128 1e-3 4 128 1e-5
NP 128 9 128 8 4 128 20 4 64 1e-3 128 5e-4 1e-6 32 2e-3 3 128 1e-3
TM 128 15 128 10 4 128 20 4 128 1e-4 96 1e-3 1e-6 128 5e-4 2 64 5e-5
  • In this table, in 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits, dd, nn and bb are the dimension of the hidden representation, length of the transformed session and the number of heads. In 𝙽𝙰𝚁𝙼\mathop{\mathtt{NARM}}\limits, 𝚂𝚁-𝙶𝙽𝙽\mathop{\mathtt{SR\text{-}GNN}}\limits and 𝙻𝙴𝚂𝚂𝚁\mathop{\mathtt{LESSR}}\limits, dd is the dimension of the hidden representation and lrlr is the learning rate. In 𝚂𝚁-𝙶𝙽𝙽\mathop{\mathtt{SR\text{-}GNN}}\limits, λ\lambda is the weight decay factor. In 𝙻𝙴𝚂𝚂𝚁\mathop{\mathtt{LESSR}}\limits, ll is the number of GNN layers, and in 𝙳𝙷𝙲𝙽\mathop{\mathtt{DHCN}}\limits, β\beta is the factor for the self-supervision.

We implement 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits in python 3.7.3 with PyTorch 1.4.0 888https://pytorch.org. We use Adam optimizer with learning rate 1e-3 on all the datasets for 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits variants (i.e., 𝙿𝟸𝙼𝙰𝙼-𝙾\mathop{\mathtt{P^{2}MAM\text{-}O}}\limits, 𝙿𝟸𝙼𝙰𝙼-𝙿\mathop{\mathtt{P^{2}MAM\text{-}P}}\limits and 𝙿𝟸𝙼𝙰𝙼-𝙾-𝙿\mathop{\mathtt{P^{2}MAM\text{-}O\text{-}P}}\limits). We initialize all the learnable parameters using the default initialization methods in PyTorch. The source code and processed data is available on GitHub 999https://github.com/ninglab/P2MAM. For all the methods, during the grid search, we initially search the hyper parameters in a search range. If the hyper parameter yields the best performance on the boundary of the search range, we will extend the search range, if applicable, until a value in the middle yields the best performance. Table IX presents the hyper parameters used for all the methods.

For 𝙿𝟸𝙼𝙰𝙼\mathop{\mathtt{P^{2}MAM}}\limits and 𝙻𝙰𝚂𝚃-𝙾-𝙿\mathop{\mathtt{LAST\text{-}O\text{-}P}}\limits, the initially search range for the embedding dimension dd, the length of transformed sequence nn and the number of heads bb is {32,64,128}\{32,64,128\}, {10,15,20}\{10,15,20\} and {1,2,4}\{1,2,4\}, respectively.

For 𝙽𝙰𝚁𝙼\mathop{\mathtt{NARM}}\limits 101010https://github.com/lijingsdu/sessionRec_NARM, we initially search dd and the learning rate lrlr from {32,64,128}\{32,64,128\} and {\{1e-4, 1e-3}\}, respectively.

For 𝚂𝚁-𝙶𝙽𝙽\mathop{\mathtt{SR\text{-}GNN}}\limits 111111https://github.com/CRIPAC-DIG/SR-GNN, we initially search dd, lrlr and the weight decay factor λ\lambda from {32,64,128}\{32,64,128\}, {\{1e-4, 1e-3}\} and {\{1e-5, 1e-4, 1e-3}\}, respectively.

For 𝙻𝙴𝚂𝚂𝚁\mathop{\mathtt{LESSR}}\limits 121212https://github.com/twchen/lessr, we initially search dd, lrlr and the number of GNN layers ll from {32,64,128}\{32,64,128\}, {\{1e-4, 1e-3}\} and {1,2,3,4,5}\{1,2,3,4,5\}, respectively.

For 𝙳𝙷𝙲𝙽\mathop{\mathtt{DHCN}}\limits 131313https://github.com/xiaxin1998/DHCN, we initially search dd and the factor for the self-supervision β\beta from {32,64,128}\{32,64,128\} and {\{1e-4, 1e-3, 1e-2}\}, respectively. We use the default number of GNN layers (i.e., 3) due to the fact that the original paper of 𝙳𝙷𝙲𝙽\mathop{\mathtt{DHCN}}\limits shows that 𝙳𝙷𝙲𝙽\mathop{\mathtt{DHCN}}\limits is not sensitive to this hyper parameter, and it is very expensive to tune hyper parameters for 𝙳𝙷𝙲𝙽\mathop{\mathtt{DHCN}}\limits (Section 6.9).

References

  • [1] J. Li, P. Ren, Z. Chen, Z. Ren, T. Lian, and J. Ma, “Neural attentive session-based recommendation,” in Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, 2017, pp. 1419–1428.
  • [2] S. Wu, Y. Tang, Y. Zhu, L. Wang, X. Xie, and T. Tan, “Session-based recommendation with graph neural networks,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2019, pp. 346–353.
  • [3] X. Xia, H. Yin, J. Yu, Q. Wang, L. Cui, and X. Zhang, “Self-supervised hypergraph convolutional networks for session-based recommendation,” arXiv preprint arXiv:2012.06852, 2020.
  • [4] T. Chen and R. C.-W. Wong, “Handling information loss of graph neural networks for session-based recommendation,” in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 1172–1180.
  • [5] A. Sherstinsky, “Fundamentals of recurrent neural network (rnn) and long short-term memory (lstm) network,” Physica D: Nonlinear Phenomena, vol. 404, p. 132306, 2020.
  • [6] J. Zhou, G. Cui, S. Hu, Z. Zhang, C. Yang, Z. Liu, L. Wang, C. Li, and M. Sun, “Graph neural networks: A review of methods and applications,” AI Open, vol. 1, pp. 57–81, 2020.
  • [7] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.
  • [8] B. Peng, Z. Ren, S. Parthasarathy, and X. Ning, “M2: Mixed models with preferences, popularities and transitions for next-basket recommendation,” arXiv preprint arXiv:2004.01646, 2020.
  • [9] S. Rendle, C. Freudenthaler, and L. Schmidt-Thieme, “Factorizing personalized markov chains for next-basket recommendation,” ser. WWW ’10, 2010, p. 811–820.
  • [10] Q. Liu, Y. Zeng, R. Mokhosi, and H. Zhang, “Stamp: short-term attention/memory priority model for session-based recommendation,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018, pp. 1831–1839.
  • [11] B. Hidasi and A. Karatzoglou, “Recurrent neural networks with top-k gains for session-based recommendations,” in Proceedings of the 27th ACM international conference on information and knowledge management, 2018, pp. 843–852.
  • [12] R. Qiu, J. Li, Z. Huang, and H. Yin, “Rethinking the item order in session-based recommendation with graph neural networks,” in Proceedings of the 28th ACM International Conference on Information and Knowledge Management, 2019, pp. 579–588.
  • [13] T. Donkers, B. Loepp, and J. Ziegler, “Sequential user-based recurrent neural network recommendations,” in Proceedings of the Eleventh ACM Conference on Recommender Systems, ser. RecSys ’17, 2017, p. 152–160.
  • [14] F. Vasile, E. Smirnova, and A. Conneau, “Meta-prod2vec: Product embeddings using side-information for recommendation,” in Proceedings of the 10th ACM Conference on Recommender Systems, 2016, pp. 225–232.
  • [15] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” Advances in neural information processing systems, vol. 26, 2013.
  • [16] J. Tang and K. Wang, “Personalized top-n sequential recommendation via convolutional sequence embedding,” in Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, 2018, pp. 565–573.
  • [17] B. Peng, Z. Ren, S. Parthasarathy, and X. Ning, “Ham: hybrid associations models for sequential recommendation,” IEEE Transactions on Knowledge and Data Engineering, 2021.
  • [18] F. Yuan, A. Karatzoglou, I. Arapakis, J. M. Jose, and X. He, “A simple convolutional generative network for next item recommendation,” in Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, 2019, pp. 582–590.
  • [19] W.-C. Kang and J. McAuley, “Self-attentive sequential recommendation,” in 2018 IEEE International Conference on Data Mining (ICDM).   IEEE, 2018, pp. 197–206.
  • [20] F. Sun, J. Liu, J. Wu, C. Pei, X. Lin, W. Ou, and P. Jiang, “Bert4rec: Sequential recommendation with bidirectional encoder representations from transformer,” in Proceedings of the 28th ACM international conference on information and knowledge management, 2019, pp. 1441–1450.
  • [21] Z. Fan, Z. Liu, S. Wang, L. Zheng, and P. S. Yu, “Modeling sequences as distributions with uncertainty for sequential recommendation,” in Proceedings of the 30th ACM International Conference on Information & Knowledge Management, 2021, pp. 3019–3023.
  • [22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
  • [23] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel, “Gated graph sequence neural networks,” arXiv preprint arXiv:1511.05493, 2015.
  • [24] B.-J. Hou and Z.-H. Zhou, “Learning with interpretable structure from gated rnn,” IEEE transactions on neural networks and learning systems, pp. 2267–2279, 2020.
  • [25] C. Xu, P. Zhao, Y. Liu, V. S. Sheng, J. Xu, F. Zhuang, J. Fang, and X. Zhou, “Graph contextualized self-attention network for session-based recommendation.” in IJCAI, 2019, pp. 3940–3946.
  • [26] Z. Fan, Z. Liu, J. Zhang, Y. Xiong, L. Zheng, and P. S. Yu, “Continuous-time sequential recommendation with temporal graph collaborative transformer,” in Proceedings of the 30th ACM International Conference on Information & Knowledge Management, 2021, pp. 433–442.
  • [27] Y. Wu, M. Mukunoki, T. Funatomi, M. Minoh, and S. Lao, “Optimizing mean reciprocal rank for person re-identification,” in 2011 8th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).   IEEE, 2011, pp. 408–413.
Bo Peng received his M.S. degree from the Department of Computer and Information Science, Indiana University–Purdue University, Indianapolis, in 2019. He is currently a Ph.D. student at the Computer Science and Engineering Department, The Ohio State University. His research interests include machine learning, data mining and their applications in recommender systems and graph mining.
Chang-Yu Tai received his M.S. degree from the Department of Chemistry, National Taiwan University, in 2018. He is currently an M.S. student at the Computer Science and Engineering Department, The Ohio State University. His research interests include deep learning applications in natural language processing and recommender systems.
Srinivasan Parthasarathy received his Ph.D. degree from the Department of Computer Science, University of Rochester, Rochester, in 1999. He is currently a Professor at the Computer Science and Engineering Department, and the Biomedical Informatics Department, The Ohio State University. His research is on high performance data analytics, graph analytics and network science, and machine learning and database systems.
Xia Ning received her Ph.D. degree from the Department of Computer Science & Engineering, University of Minnesota, Twin Cities, in 2012. She is currently an Associate Professor at the Biomedical Informatics Department, and the Computer Science and Engineering Department, The Ohio State University. Her research is on data mining, machine learning and artificial intelligence with applications in recommender systems, drug discovery and medical informatics.